Java Pro – Build a Smarter Search Engine

Site hosted by Angelfire.com: Build your free website today!

Build a Smarter Search Engine
Create a search engine, using Java and case based reasoning, that produces the best match rather than the exact match
by Baylor Wetzel

A man walks into a car dealership and tells the salesperson about his perfect car. It would be red, have four doors, a large trunk, 300 horsepower, side airbags, four wheel drive, and would cost $20,000. Knowing his perfect car probably doesn't exist, he asks the dealer to show him the car that is closest to what he described. It doesn't have to be an exact match, it just has to be close.

If the customer had asked to see all cars that had more than 275 horsepower and cost less than $22,000, we could have used a simple SQL statement to search through the car database. Had we found no cars, we'd have asked him to change his search criteria. Had we found a dozen cars, we would have handed him the list, perhaps sorted by price or horsepower, and asked him to find the one he liked best. But, being a typical human, he didn't ask for a set of acceptable items, he asked for the "best" item, the closest match. How we'd do that in SQL isn't exactly obvious.

Case based reasoning (CBR) is a field of artificial intelligence (AI) that is concerned with the best match rather than the exact match. In this field, expert systems reason by memory rather than rules. As an example, suppose we want to examine a patient's blood for disease. In a typical expert system, the data is run through a set of rules and a diagnosis is made. In a CBR expert system, the system searches its memory (the case base) to see if it has seen this situation before. If it finds a similar situation, it uses that situation's diagnosis.

Many argue that human thought, from diagnosing blood to playing chess to driving a car, is based more on recognizing patterns than on memorizing rules. If true, CBR seems more human-like than a rule-based system. But that (and other CBR issues such as adaptation rules) is for the AI people to worry about. This article isn't about expert systems or cognitive modeling. It's about using Java to write a search engine that uses CBR to return best matches rather than exact matches.

Using Similarity-Based Matching
Suppose we want to find all items that match at least seven of ten criteria. With similarity-based matching, we can find items that match most of our search criteria, but not all of it. Suppose that some of our data is incomplete—some of the fields we use in our search criteria were left blank. With similarity-based matching, we can find items even when data is missing. What if we want to see all items that are around a certain price? With similarity-based matching, we can find items that are close to what we want without having to specify a hard cut off. What if we want to sort our items by a combination of attributes rather than on a column-by-column basis? Similarity-based matching makes it easy to rank items. And suppose certain attributes are more important than others. With similarity-based matching, we can assign weights to each attribute, allowing some data attributes to be more important than others.

I'm sure you can think of plenty of examples of where this would and would not be useful. You obviously wouldn't use similarity-based matching if you knew exactly what you were looking for and had that item's "key"—a part number, a social security number, an order number, and so on. Similarity-based matching is meant to be used when you don't know what options you have, or when you want to have your items ranked.

So when would you use similarity-based searching? Using similarity-based searching, if you were searching for a plane flight, you could ask to see all flights that left around 10:00 on Friday, cost around $500 (the cheaper the better) and, if possible, had no layovers. If you wanted a restaurant recommendation, you could ask for a medium-priced restaurant that got good reviews—French and Italian are good (you're not in the mood for Mexican), but Chinese or Thai would be great. When it's time for a movie, you could say that you're in the mood for a suspense or action movie, maybe something directed by John Woo or Robert Rodriguez, movies with the CIA or some other spy agency would be nice, and perhaps starring Bruce Willis, Chow Yun Fat, or Jet Li.

Real-world systems have used case based reasoning to help lawyers find cases, TV viewers find interesting TV shows, customers find the perfect computer, help desk workers find answers (by far the most common application), technicians configure x-ray machines, credit card companies find good credit risks, auditors find fraud, mechanics maintain aircraft, and music lovers find songs.

Which leads us to a short aside: As with SQL, you can search on only those fields that you define. Often, the trick to a good search or expert system is in the feature extraction, not the search algorithm. A good example is the University of California at Berkeley's Ninja Jukebox (www.cs.berkeley.edu/~mdw/proj/jukebox). Several graduate students built a search engine that allows you to find songs that sound similar to a specified song. The search part of the problem (a brute force k-nearest neighbor algorithm, which we'll talk about in a minute) was simple. The tricky part was in figuring out how to describe a song so that you could search on it.

The group wrote a series of automated feature extractors that take an MP3 and convert it to 1,248 feature dimensions such as frequency, amplitude, and tempo. That's important because, obviously, the search is only as good as the database description of each song, along with the proper weighting of each attribute. CBR doesn't solve the feature extraction problem nor does it by itself solve the weighting issue—you're still responsible for those parts. If you want to create a great search engine, start with good data. As they say, garbage in, garbage out.

Nearest Neighbor Algorithm
The class of selection algorithms used by most CBR products is called nearest neighbor. Consider an item that has two attributes, price and performance. You could plot the items on graph paper. On the bottom (x-axis) might be price, say from 1-10 to make it easy, while the side (y-axis) would have performance numbers (again, for simplicity, let's make the range 1-10). You want to find the item that has the highest performance (10) and lowest cost (1). On the graph, what you're searching for would be plotted at (1,10).

So far so good. Unfortunately, no system has a price of 1 and a performance of 10. What you do have are items like item A with a price of 4 and performance of 4. On a graph, we'd plot that, obviously, at (4,4). We also have items B at (7,7), C at (10,9) and D at (1,6). If we plotted them on the chart, it might look like Figure 1.


Figure 1. More for Your Money

We want to find the item that is closest to our ideal item at (1,10). So, which item is closest to what we're looking for? If you look at the graph, D probably looks the closest. Which is easy for us to say—we have eyes. But how does the computer know which is closest?

It's time to recall high school geometry and the infamous Pythagorean Theorem. Remember that one? It says that a² + b² = c². It's used to calculate the length of the longest side of a right triangle. So let's take item A. Draw an imaginary triangle with the hypotenuse (the diagonal side) from Target to A. We draw a straight line down from Target and left from A. Where they intersect, (1,4), is the right angle portion of our triangle. So now we have a right triangle with the points (1,10), (4,4) and (1,4). Using some simple math, we know that the length of side A (the vertical side) of our triangle is 6 (10 – 4), while side B (the horizontal side) is 3 (4 – 1). Side C is the diagonal line between Target and A. Using the Pythagorean Theorem, the length of side C is:

a = 10 – 4 = 6
b = 4 – 1 = 3
a² + b² = c²
6² + 3² = c²
36 + 9 = c²
c² = 45
c = 6.71

We get the same answer if we skip the phantom third point (1,4) and just subtract the two remaining points, which means (10,1) – (4,4). The calculations look like:

a = 10 – 4 = 6
b = 1 – 4 = –3
a² + b² = c²
6² + –3² = c²
36 + 9 = c²
c² = 45
c = 6.71

So now we know that item A has a distance of 6.71. If we run the other items through the same process, B's distance is 6.71, C's distance is 9.06, and D's distance is 4. Because D has the shortest distance, it is the most similar and thus our top choice.

What does a distance of 4 mean? Nothing—the range of numbers (the maximum distance) changes with each problem. So how do we convert it to a percentage? First, figure out the maximum distance. With ranges of 1–10 for performance and 1–10 for price, the maximum distance is from (10,10) to (1,1). Again using the Pythagorean Theorem, we come up with a distance of 12.73. So how similar is item D and our target item? The amount of difference is (4 / 12.73) which is 0.31 and because similarity is just one minus that (1 – 0.31), the degree of similarity is 0.69, or 69%. A's similarity is 47%, B is also 47%, and C is 29% similar.

That was an easy example. With only two attributes, we could use the Pythagorean Theorem just as we remember it from our early school days. But what happens when an item has numerous attributes?

Short answer: Rather than use a triangle with two lengths, A and B, use some multisided object with however many sides you need. Or, using our more simple way, we skip the whole diagram and just subtract all our points from one another. Say, for example, that our items have price, performance, reliability, and size, again using the simple 1–10 ranges. We want to find an item of price 1, performance 10, reliability 10, and size 1, which as a graph point would be (1,10,10,1). Item A might be (4,4,4,4), B might be (7,7,2,8), C might be (10,9,6,10), and D might be (1,6,8,3). I won't show a graph here because I'm not very good at drawing four-dimensional objects.

What is the distance of A from what we're looking for? Running through the calculations, we get:

(1,10,10,1) – (4,4,4,4) = –3, 6, 6, –3
a² + b² + c² + d² = e²
–3² + 6² + 6² + –3² = e²
9 + 36 + 36 + 9 = e²
e² = 90
e = 9.49

Since a quick calculation shows that the maximum distance is now 18, we know that item A is 47% similar to what we're looking for.

Now let's talk about our last little algorithm bit—weighting. Suppose that, in our example, performance is more important than price. We can invent a scale of one to five, where five means really important. Let's say we weight performance highest (weight = 5), price high (weight = 4), reliability low (weight = 2), and size lowest (weight = 1). To use these values, multiply the initial numbers, which for our perfect item was (1,10,10,1), by their weights, which here are (5,4,2,1). That's it. Using simple multiplication, we end up with a new, weighted scale where our perfect item is located at (5,40,20,1) and item A moves from its previous (4,4,4,4) to the new weighted position of (20,16,8,4). To see how near a neighbor item A is now, let's compute the distance:

Target position = (1,10,10,1) * (5,4,2,1) = (5,40,20,1)
Item A position = (4,4,4,4) * (5,4,2,1) = (20,16,8,4)
(5,40,20,1) – (20,16,8,4) = –15, 24, 12, –3
a² + b² + c² + d² = e²
–15² + 24² + 12² + –3²= e²
225 + 576 + 144 + 9 = e²
e² = 954
e = 30.89

We calculate the maximum distance as before, this time using weights:

Max value = (10,10,10,10) * (5,4,2,1) = (50,40,20,10)
Min value = (1,1,1,1) * (5,4,2,1) = (5,4,2,1)
(50,40,20,10) - (5,4,2,1) = 45, 36, 18, 9
a² + b² + c² + d² = e²
45² + 36² + 18² + 9² = e²
2025 + 1296 + 324 + 81 = e²
e² = 3726
e = 61.04

That means that the similarity of item A to our target item is (1 – (30.89 / 61.04)) = 49%. As a note, things haven't changed much—A was a 47% match in our two-attribute example, a 47% match in our four-attribute example, and now a 49% match in our weighted four-attribute example.

Enough with all the addition and multiplication. Let's take a slight detour and talk about nearest neighbor terminology.

If, on every request, you manually compute the distances and return the best match, that's a brute force nearest neighbor search. A k-nearest neighbor search returns the k closest items, meaning that if you ask for the top five matches, you get the top five matches (k = 5). The whole brute force thing is pretty simple to understand and implement, but you are doing a fair number of calculations on each call. I believe that many, if not most, CBR researchers are spending their time on performance tuning. The most common, as well as obvious, approach is to precalculate all of the distances and cache those in memory. This approach trades flexibility for performance.

Why does this limit our flexibility? The precalculation routine normally assumes that all attributes should be used in computing distance and that all attributes carry the same weight. You could tell the precalculation routine to use a predefined subset of attributes and define weights for each field, but those decisions are frozen at the time the similarity relationships are precalculated. In other words, you can't change the attributes and weightings on the fly. That's perfectly okay for some applications and a problem for others. Let's build a user-driven search engine, and because the user is in control of the attributes and weightings, we're going to skip building a precalculation routine and do everything dynamically.

Implementing CBR
Before we start looking at code, I want to talk a little bit about an open source project named the Selection Engine.

Once upon a time, I needed a CBR tool to help with a couple of applications. I couldn't find a freeware or shareware tool I liked and couldn't easily get my hands on a demo of a commercial product. After a little research, I realized that CBR wasn't all that complex and that I could write a tool to do what I needed fairly quickly.

Two weeks later, I had everything I needed. For me, that meant an engine that could do standard data filtering (ala SQL), compute similarity, compute percent similarity, handle weightings, sort data by percent similarity, handle searches on subsets of attributes, be generic enough to handle any arbitrary data sent to it, easily integrate into larger applications, work with Java, and be stateless. And because I knew from the beginning that I wanted to share this code with others, I wanted the engine to work with whatever environment other people had, which led me to create generic data loading and display managers. By default, the Selection Engine reads the data, metadata, and search query from text files and sends the output to stdout, which should make it usable by anyone. Most likely the person using the engine will replace those pieces, perhaps replacing the text file loader with something that reads from an Oracle database or VSAM file and perhaps replacing the display manager with something that generates HTML or creates a GUI using Swing. I leave the customization of the I/O to the reader.

The implementation of a CBR system that I'm talking about here comes from the Selection Engine. But since I'm talking about CBR, I won't spend a lot of time discussing the parts of the Selection Engine that aren't nearest-neighbor related (the filter engine, the file parsing routines, the stdout-oriented display manager, and the engine's object model). If you want to know more about those pieces, or if you want to help improve the tool, feel free to visit the engine's Web page at http://selectionengine.sourceforge.net/.

Let's talk about the basic pieces of our CBR engine. The most obvious part, of course, is the Java class named SimilarityEngine, which, as you probably guessed, computes the similarity of a collection of items to a specified target.

SimilarityEngine relies on a handful of objects (see Figure 2). The call to SimilarityEngine's main method, computeSimilarity(), takes the object's Items, SimilarityCriteria, and SimilarityWeights. Each Item has a collection of Traits, and the Items collection has a link to the metadata collection TraitDescriptors.


Figure 2. Easy Comparison

Items is a collection of Item objects. It's basically a standard collection. An Item is generic and represents anything you want to be able to search for—a product, a song, an image, a decision, a customer, and so on. Because Item is supposed to be able to represent any type of item, its attributes are defined at run time rather than compile time. Specifically, Item contains a Traits collection, which, obviously, contains several Trait objects. A Trait object is basically just a name/value pair. A separate object, TraitDescriptors, plays the metadata role. It is a collection of TraitDescriptor objects. There is one TraitDescriptor for each trait an Item object will have. A TraitDescriptor, like the Trait class, is basically just a name/value pair. In this case, value is the data type of the attribute. The Selection Engine recognizes integers, floating-point numbers, strings, and booleans.

It's worth noting here that all of the items I just mentioned (Items, Item, Traits, Trait, TraitDescriptors, and TraitDescriptor) exist in the Selection Engine to help make the engine generic and reusable. In your particular application, where you know at compile time what classes, attributes, and attribute data types you'll have, you'll probably replace Item with the specific class you want to use. Doing so should have minimal impact on how the engine functions, so the rest of this article should still be helpful.

So that covers the Items, the first of three arguments we pass to SimilarityEngine's computeSimilarity() method. The collection should contain every item that we want to compute similarity on. In the Selection Engine, the items are first run through an object called FilterEngine, which filters out items we don't care about. As an example, if the user doesn't want to see any items that cost more than $100, the FilterEngine can get rid of those choices before they are passed to the SimilarityEngine. In practice, most people will probably use SQL to filter out obviously unsuitable items before they are scored by the SimilarityEngine.

The other two arguments passed to computeSimilarity() are SimilarityCriteria and SimilarityWeights. SimilarityCriteria is a collection of SimilarityCriterion objects. A SimilarityCriterion object describes a simple relationship made up of an attribute, an operator, and a value. The Selection Engine recognizes three operators. ~ means "around," % means "prefer," and !% means "try to avoid." The first is used with numbers, the latter two with strings and booleans. The distinction is that numbers have a degree of similarity while strings and booleans in the Selection Engine either match or don't. Let's do a few quick examples. For the similarity criterion "spiciness ~ 10", the value 7 might be a 70% match. For the similarity criterion "taste % 'spicy'", a value of sweet would be a 0% match. Although the Selection Engine is case insensitive, it does not do degree of similarity calculations on strings or use similarity-oriented matching algorithms such as Soundex, metaphone, or Levenshtein.

At this point, some of you might be asking what the difference is between "taste % 'spicy'" and "taste = 'spicy'". In the Selection Engine, % is used by the SimilarityEngine while = is used by the FilterEngine. If an item fails the "taste = 'spicy'" criterion, it is rejected and will not be passed to the SimilarityEngine. If an item fails the "taste % 'spicy'" criterion, the degree of similarity is lowered, but the item still receives a similarity score. This applies to both strings and booleans.

For numeric fields, the SimilarityEngine recognizes two special values, [MAX_VAL] and [MIN_VAL] (again, case is unimportant). Both are, obviously, relative values rather than absolute values. You can do CBR without using relative values, but relative values makes CBR maintenance easier. The SimilarityEngine translates relative numbers into absolutes by first looping through every item passed to it so it can determine the max and min values for each of the item's attributes.

The third object passed to computeSimilarity() is SimilarityWeights, which is a collection of SimilarityWeight objects. SimilarityWeight is yet another name/value pair, where name is the name of an attribute and value is its weight. Weight can be any integer value. The weights are not percentages and do not need to add up to 1 or 100. By default, all attributes have a weight of 1. Remember that weights affect only similarity operators (~, %, !%), not filter operators (=, !=, <, >, <=, >=).

To review, let's look at a sample query. We visit an Internet retailer to buy a computer. We like Alienware computers, but would rather not buy an HP and absolutely will not buy a Dell. The price should be low, and we absolutely cannot spend more than $1,000. We really, really want a fast computer and a big hard drive. And a DVD player would be nice. How would we write that as selection criteria? Using the Selection Engine's query format, which is pipe delimited and uses c and w to signify constraints and weights, the query might look something like this:

c | Vendor | %  | Alienware
c | Vendor | !% | HP
c | Vendor | != | Dell
w | Vendor | 1
c | Price  | ~  | [MIN_VAL]
c | Price  | <= | 1000
w | Price  | 1
c | HD     | ~ | [MAX_VAL]
w | HD     | 4
c | DVD    | % | TRUE
w | DVD    | 1
c | cpu_benchmark | ~ | [MAX_VAL]
w | cpu_benchmark | 5

Note that, to search for fast computers, we search against a benchmark score (cpu_benchmark) rather than a CPU name (for example, Intel PIII 800) or Megaherz rating (for example, 800). That's because a benchmark might accurately represent system performance, but the name of the CPU or Mhz rating most likely will not. Although this has nothing to do with the technical aspects of the nearest neighbor algorithm, I mention it here because, as always, garbage in, garbage out. In my experience, poor design (database and object modeling problems) is responsible for far more problems than technological issues.

So we know we pass Items, SimilarityCriteria, and SimilarityWeights into the computeSimilarity() method. Let's talk about how the computeSimilarity() method works. A pseudo code representation is shown in Listing 1. The source code for computeSimilarity() is shown in Listing 2.

First, we need some information on our data set. Read through each item and determine the max and min values for each numeric attribute. We also need the range for each attribute, but we can get that by subtracting the smallest value from the largest (Listing 3).

Next we need to turn our similarity query into something (a series of points) that we could, in theory, plot on a graph. We do that by normalizing each numeric value in the search criteria. Normalizing means that we want to express the value as a percentage (0 to 1) of the possible values. We do that by subtracting the minimum value possible (which we found when we calculated our data set statistics back in step one) and then dividing it by the range of possible values (the maximum value minus the minimum value). As an example, if our target price was $2,000 and the prices ranged from $1,000 to $3,000, the normalized value of our target price would be 0.5 - (2,000 – 1,000) / (3,000 – 2,000). After we get the normalized value, we convert it to a weighted value by multiplying each normalized value by the weight for that attribute (see Listing 4).

Next we calculate the maximum distance possible for any item. We'll need this number to calculate percent similarity later. We know that the normalized value of the maximum value is always 1, so the weighted values are just the weights. We convert these points to a distance by squaring each point, summing them and then taking the square root, just like the good old Pythagorean Theorem (see Listing 5).

That completes our prep work. Now it's time to rate each item. As with our search criteria, we start by normalizing and weighting the numeric values (see Listing 6). We then compute the distance by subtracting each of the item's attributes from the target values (which gives us a delta), squaring the deltas, summing them, and then taking the square root of the sum. Dividing the distance by the maximum distance gives us the percent difference, and subtracting that from 1 gives us the percent similarity (see Listing 7).

Once upon a time I was the AI specialist for a large consumer electronics retailer. I spent a lot of time at AI conferences, which are typically filled with academics and military researchers. I was a popular curiosity at those conferences, with the most common question asked of me being why an IT person from a retail company would care about AI. My response was that AI is more valuable to the business world than to the academics and engineers. I've also known a lot of business programmers who've asked why I waste my time studying AI when everyone knows that only scientists can and should use that stuff. My reply is that the old classics are just as valuable, if not more so, than memorizing the latest APIs and, besides, this stuff is easy!

I know many people are intimidated by computer science books and their attendant algorithms and data structures. I'm no big fan of computer science books either, with their examples that aren't relevant to my work life and their formulas filled with Greek letters and lots of funny looking math symbols. But if you can get past the bad writing, this computer science stuff turns out to be really useful and really easy! CBR, as one example, can be used in all sorts of personal and business applications—managing recipes, selling cars, searching for real estate, making video games, picking home loans, recommending music, and more. Other fields of computer science, equally unfamiliar to the average IT developer, are just as useful. Despite the silly names, if you think about it hard enough you'll probably find some uses in your workplace for decision trees (and C4.5, production rule language [PRL], and Rete), information retrieval (and Porter stemming, resolving power, and inverted file systems), image processing (edge tracing algorithms can be used to process any type of data), parallelization (and similar topics such as agents, emergent behavior, blackboards, and voting systems), run length encoding, directed acyclic graphs, hidden Markov models, and all sorts of other "academic" tasks. And have I mentioned that this stuff is easy? I'm math illiterate and went to journalism school and even I get this. Yes, okay, I needed someone to translate all the math gobbledy-gook into liberal arts speak for me, but it did eventually make sense. We've already seen how to build a simple (and valuable) search engine with a little AI, and that wasn't too hard was it?

So here's what I ask. Think back on your computer science schooling or go pick up a book on algorithms (no, not a language-specific one but an honest-to-goodness theory book) and see if you can't find something useful in there. And if you do, talk some teammates into trying it out. Hey, while you're at it, why not think of a simple way to explain it and then write an article? Many of us who would be happy to
read it.

About the Author
Baylor Wetzel is a novelist, short story writer, and movie maker specializing in satire, parody, and ethnic literature. In his spare time, he works a day job consulting in architecture, artificial intelligence, object modeling, and all sorts of process stuff. His attempts at schooling include undergraduate stints at the University of Texas schools of journalism and business and he continues to do grad work at the University of St Thomas' AI program. Baylor has written for several computer magazines and is certified in Java and Microsoft, but he would gladly trade it all for a chance to meet Buffy the Vampire Slayer's Willow Rosenberg. You can reach Baylor at baylor@ihatebaylor.com.