Build a Smarter Search Engine Create a search engine, using Java and case based
reasoning, that produces the best match rather than the
exact match by
Baylor Wetzel
A man walks into a car dealership and
tells the salesperson about his perfect car. It would be red, have
four doors, a large trunk, 300 horsepower, side airbags, four wheel
drive, and would cost $20,000. Knowing his perfect car probably
doesn't exist, he asks the dealer to show him the car that is
closest to what he described. It doesn't have to be an exact match,
it just has to be close.
If the customer had asked to see all cars that had more than 275
horsepower and cost less than $22,000, we could have used a simple
SQL statement to search through the car database. Had we found no
cars, we'd have asked him to change his search criteria. Had we
found a dozen cars, we would have handed him the list, perhaps
sorted by price or horsepower, and asked him to find the one he
liked best. But, being a typical human, he didn't ask for a set of
acceptable items, he asked for the "best" item, the closest match.
How we'd do that in SQL isn't exactly obvious.
Case based reasoning (CBR) is a field of artificial intelligence
(AI) that is concerned with the best match rather than the exact
match. In this field, expert systems reason by memory rather than
rules. As an example, suppose we want to examine a patient's blood
for disease. In a typical expert system, the data is run through a
set of rules and a diagnosis is made. In a CBR expert system, the
system searches its memory (the case base) to see if it has seen
this situation before. If it finds a similar situation, it uses that
situation's diagnosis.
Many argue that human thought, from diagnosing blood to playing
chess to driving a car, is based more on recognizing patterns than
on memorizing rules. If true, CBR seems more human-like than a
rule-based system. But that (and other CBR issues such as adaptation
rules) is for the AI people to worry about. This article isn't about
expert systems or cognitive modeling. It's about using Java to write
a search engine that uses CBR to return best matches rather than
exact matches.
Using Similarity-Based Matching Suppose we
want to find all items that match at least seven of ten criteria.
With similarity-based matching, we can find items that match most of
our search criteria, but not all of it. Suppose that some of our
data is incomplete—some of the fields we use in our search criteria
were left blank. With similarity-based matching, we can find items
even when data is missing. What if we want to see all items that are
around a certain price? With similarity-based matching, we can find
items that are close to what we want without having to specify a
hard cut off. What if we want to sort our items by a combination of
attributes rather than on a column-by-column basis? Similarity-based
matching makes it easy to rank items. And suppose certain attributes
are more important than others. With similarity-based matching, we
can assign weights to each attribute, allowing some data attributes
to be more important than others.
I'm sure you can think of plenty of examples of where this would
and would not be useful. You obviously wouldn't use similarity-based
matching if you knew exactly what you were looking for and had that
item's "key"—a part number, a social security number, an order
number, and so on. Similarity-based matching is meant to be used
when you don't know what options you have, or when you want to have
your items ranked.
So when would you use similarity-based searching? Using
similarity-based searching, if you were searching for a plane
flight, you could ask to see all flights that left around 10:00 on
Friday, cost around $500 (the cheaper the better) and, if possible,
had no layovers. If you wanted a restaurant recommendation, you
could ask for a medium-priced restaurant that got good
reviews—French and Italian are good (you're not in the mood for
Mexican), but Chinese or Thai would be great. When it's time for a
movie, you could say that you're in the mood for a suspense or
action movie, maybe something directed by John Woo or Robert
Rodriguez, movies with the CIA or some other spy agency would be
nice, and perhaps starring Bruce Willis, Chow Yun Fat, or Jet
Li.
Real-world systems have used case based reasoning to help lawyers
find cases, TV viewers find interesting TV shows, customers find the
perfect computer, help desk workers find answers (by far the most
common application), technicians configure x-ray machines, credit
card companies find good credit risks, auditors find fraud,
mechanics maintain aircraft, and music lovers find songs.
Which leads us to a short aside: As with SQL, you can search on
only those fields that you define. Often, the trick to a good search
or expert system is in the feature extraction, not the search
algorithm. A good example is the University of California at
Berkeley's Ninja Jukebox (www.cs.berkeley.edu/~mdw/proj/jukebox).
Several graduate students built a search engine that allows you to
find songs that sound similar to a specified song. The search part
of the problem (a brute force k-nearest neighbor algorithm, which
we'll talk about in a minute) was simple. The tricky part was in
figuring out how to describe a song so that you could search on
it.
The group wrote a series of automated feature extractors that
take an MP3 and convert it to 1,248 feature dimensions such as
frequency, amplitude, and tempo. That's important because,
obviously, the search is only as good as the database description of
each song, along with the proper weighting of each attribute. CBR
doesn't solve the feature extraction problem nor does it by itself
solve the weighting issue—you're still responsible for those parts.
If you want to create a great search engine, start with good data.
As they say, garbage in, garbage out.
Nearest Neighbor Algorithm The class of
selection algorithms used by most CBR products is called nearest
neighbor. Consider an item that has two attributes, price and
performance. You could plot the items on graph paper. On the bottom
(x-axis) might be price, say from 1-10 to make it easy, while the
side (y-axis) would have performance numbers (again, for simplicity,
let's make the range 1-10). You want to find the item that has the
highest performance (10) and lowest cost (1). On the graph, what
you're searching for would be plotted at (1,10).
So far so good. Unfortunately, no system has a price of 1 and a
performance of 10. What you do have are items like item A with a
price of 4 and performance of 4. On a graph, we'd plot that,
obviously, at (4,4). We also have items B at (7,7), C at (10,9) and
D at (1,6). If we plotted them on the chart, it might look like Figure 1.
We want to find the item that is closest to our ideal item at
(1,10). So, which item is closest to what we're looking for? If you
look at the graph, D probably looks the closest. Which is easy for
us to say—we have eyes. But how does the computer know which is
closest?
It's time to recall high school geometry and the infamous
Pythagorean Theorem. Remember that one? It says that a2 +
b2 = c2. It's used to calculate the length of
the longest side of a right triangle. So let's take item A. Draw an
imaginary triangle with the hypotenuse (the diagonal side) from
Target to A. We draw a straight line down from Target and left from
A. Where they intersect, (1,4), is the right angle portion of our
triangle. So now we have a right triangle with the points (1,10),
(4,4) and (1,4). Using some simple math, we know that the length of
side A (the vertical side) of our triangle is 6 (10 – 4), while side
B (the horizontal side) is 3 (4 – 1). Side C is the diagonal line
between Target and A. Using the Pythagorean Theorem, the length of
side C is:
a = 10 – 4 = 6 b = 4 – 1 = 3 a2 + b2
= c2 62 + 32 =
c2 36 + 9 = c2 c2 = 45 c =
6.71
We get the same answer if we skip the phantom third point (1,4)
and just subtract the two remaining points, which means (10,1) –
(4,4). The calculations look like:
a = 10 – 4 = 6 b = 1 – 4 = –3 a2 + b2
= c2 62 + –32 =
c2 36 + 9 = c2 c2 = 45 c =
6.71
So now we know that item A has a distance of 6.71. If we run the
other items through the same process, B's distance is 6.71, C's
distance is 9.06, and D's distance is 4. Because D has the shortest
distance, it is the most similar and thus our top choice.
What does a distance of 4 mean? Nothing—the range of numbers (the
maximum distance) changes with each problem. So how do we convert it
to a percentage? First, figure out the maximum distance. With ranges
of 1–10 for performance and 1–10 for price, the maximum distance is
from (10,10) to (1,1). Again using the Pythagorean Theorem, we come
up with a distance of 12.73. So how similar is item D and our target
item? The amount of difference is (4 / 12.73) which is 0.31 and
because similarity is just one minus that (1 – 0.31), the degree of
similarity is 0.69, or 69%. A's similarity is 47%, B is also 47%,
and C is 29% similar.
That was an easy example. With only two attributes, we could use
the Pythagorean Theorem just as we remember it from our early school
days. But what happens when an item has numerous attributes?
Short answer: Rather than use a triangle with two lengths, A and
B, use some multisided object with however many sides you need. Or,
using our more simple way, we skip the whole diagram and just
subtract all our points from one another. Say, for example, that our
items have price, performance, reliability, and size, again using
the simple 1–10 ranges. We want to find an item of price 1,
performance 10, reliability 10, and size 1, which as a graph point
would be (1,10,10,1). Item A might be (4,4,4,4), B might be
(7,7,2,8), C might be (10,9,6,10), and D might be (1,6,8,3). I won't
show a graph here because I'm not very good at drawing
four-dimensional objects.
What is the distance of A from what we're looking for? Running
through the calculations, we get:
(1,10,10,1) – (4,4,4,4) = –3, 6, 6, –3 a2 +
b2 + c2 + d2 =
e2 –32 + 62 + 62 +
–32 = e2 9 + 36 + 36 + 9 =
e2 e2 = 90 e = 9.49
Since a quick calculation shows that the maximum distance is now
18, we know that item A is 47% similar to what we're looking
for.
Now let's talk about our last little algorithm bit—weighting.
Suppose that, in our example, performance is more important than
price. We can invent a scale of one to five, where five means really
important. Let's say we weight performance highest (weight = 5),
price high (weight = 4), reliability low (weight = 2), and size
lowest (weight = 1). To use these values, multiply the initial
numbers, which for our perfect item was (1,10,10,1), by their
weights, which here are (5,4,2,1). That's it. Using simple
multiplication, we end up with a new, weighted scale where our
perfect item is located at (5,40,20,1) and item A moves from its
previous (4,4,4,4) to the new weighted position of (20,16,8,4). To
see how near a neighbor item A is now, let's compute the
distance:
Target position = (1,10,10,1) * (5,4,2,1) = (5,40,20,1) Item A
position = (4,4,4,4) * (5,4,2,1) = (20,16,8,4) (5,40,20,1) –
(20,16,8,4) = –15, 24, 12, –3 a2 + b2 +
c2 + d2 = e2 –152 +
242 + 122 + –32=
e2 225 + 576 + 144 + 9 =
e2 e2 = 954 e = 30.89
We calculate the maximum distance as before, this time using
weights:
Max value = (10,10,10,10) * (5,4,2,1) = (50,40,20,10) Min
value = (1,1,1,1) * (5,4,2,1) = (5,4,2,1) (50,40,20,10) -
(5,4,2,1) = 45, 36, 18, 9 a2 + b2 +
c2 + d2 = e2 452 +
362 + 182 + 92 =
e2 2025 + 1296 + 324 + 81 =
e2 e2 = 3726 e = 61.04
That means that the similarity of item A to our target item is (1
– (30.89 / 61.04)) = 49%. As a note, things haven't changed much—A
was a 47% match in our two-attribute example, a 47% match in our
four-attribute example, and now a 49% match in our weighted
four-attribute example.
Enough with all the addition and multiplication. Let's take a
slight detour and talk about nearest neighbor terminology.
If, on every request, you manually compute the distances and
return the best match, that's a brute force nearest neighbor search.
A k-nearest neighbor search returns the k closest items, meaning
that if you ask for the top five matches, you get the top five
matches (k = 5). The whole brute force thing is pretty simple to
understand and implement, but you are doing a fair number of
calculations on each call. I believe that many, if not most, CBR
researchers are spending their time on performance tuning. The most
common, as well as obvious, approach is to precalculate all of the
distances and cache those in memory. This approach trades
flexibility for performance.
Why does this limit our flexibility? The precalculation routine
normally assumes that all attributes should be used in computing
distance and that all attributes carry the same weight. You could
tell the precalculation routine to use a predefined subset of
attributes and define weights for each field, but those decisions
are frozen at the time the similarity relationships are
precalculated. In other words, you can't change the attributes and
weightings on the fly. That's perfectly okay for some applications
and a problem for others. Let's build a user-driven search engine,
and because the user is in control of the attributes and weightings,
we're going to skip building a precalculation routine and do
everything dynamically.
Implementing CBR Before we start looking at
code, I want to talk a little bit about an open source project named
the Selection Engine.
Once upon a time, I needed a CBR tool to help with a couple of
applications. I couldn't find a freeware or shareware tool I liked
and couldn't easily get my hands on a demo of a commercial product.
After a little research, I realized that CBR wasn't all that complex
and that I could write a tool to do what I needed fairly quickly.
Two weeks later, I had everything I needed. For me, that meant an
engine that could do standard data filtering (ala SQL), compute
similarity, compute percent similarity, handle weightings, sort data
by percent similarity, handle searches on subsets of attributes, be
generic enough to handle any arbitrary data sent to it, easily
integrate into larger applications, work with Java, and be
stateless. And because I knew from the beginning that I wanted to
share this code with others, I wanted the engine to work with
whatever environment other people had, which led me to create
generic data loading and display managers. By default, the Selection
Engine reads the data, metadata, and search query from text files
and sends the output to stdout, which should make it usable by
anyone. Most likely the person using the engine will replace those
pieces, perhaps replacing the text file loader with something that
reads from an Oracle database or VSAM file and perhaps replacing the
display manager with something that generates HTML or creates a GUI
using Swing. I leave the customization of the I/O to the reader.
The implementation of a CBR system that I'm talking about here
comes from the Selection Engine. But since I'm talking about CBR, I
won't spend a lot of time discussing the parts of the Selection
Engine that aren't nearest-neighbor related (the filter engine, the
file parsing routines, the stdout-oriented display manager, and the
engine's object model). If you want to know more about those pieces,
or if you want to help improve the tool, feel free to visit the
engine's Web page at http://selectionengine.sourceforge.net/.
Let's talk about the basic pieces of our CBR engine. The most
obvious part, of course, is the Java class named SimilarityEngine,
which, as you probably guessed, computes the similarity of a
collection of items to a specified target.
SimilarityEngine relies on a handful of objects (see Figure 2). The call to
SimilarityEngine's main method, computeSimilarity(), takes the
object's Items, SimilarityCriteria, and SimilarityWeights. Each Item
has a collection of Traits, and the Items collection has a link to
the metadata collection TraitDescriptors.
Items is a collection of Item objects. It's basically a standard
collection. An Item is generic and represents anything you want to
be able to search for—a product, a song, an image, a decision, a
customer, and so on. Because Item is supposed to be able to
represent any type of item, its attributes are defined at run time
rather than compile time. Specifically, Item contains a Traits
collection, which, obviously, contains several Trait objects. A
Trait object is basically just a name/value pair. A separate object,
TraitDescriptors, plays the metadata role. It is a collection of
TraitDescriptor objects. There is one TraitDescriptor for each trait
an Item object will have. A TraitDescriptor, like the Trait class,
is basically just a name/value pair. In this case, value is the data
type of the attribute. The Selection Engine recognizes integers,
floating-point numbers, strings, and booleans.
It's worth noting here that all of the items I just mentioned
(Items, Item, Traits, Trait, TraitDescriptors, and TraitDescriptor)
exist in the Selection Engine to help make the engine generic and
reusable. In your particular application, where you know at compile
time what classes, attributes, and attribute data types you'll have,
you'll probably replace Item with the specific class you want to
use. Doing so should have minimal impact on how the engine
functions, so the rest of this article should still be helpful.
So that covers the Items, the first of three arguments we pass to
SimilarityEngine's computeSimilarity() method. The collection should
contain every item that we want to compute similarity on. In the
Selection Engine, the items are first run through an object called
FilterEngine, which filters out items we don't care about. As an
example, if the user doesn't want to see any items that cost more
than $100, the FilterEngine can get rid of those choices before they
are passed to the SimilarityEngine. In practice, most people will
probably use SQL to filter out obviously unsuitable items before
they are scored by the SimilarityEngine.
The other two arguments passed to computeSimilarity() are
SimilarityCriteria and SimilarityWeights. SimilarityCriteria is a
collection of SimilarityCriterion objects. A SimilarityCriterion
object describes a simple relationship made up of an attribute, an
operator, and a value. The Selection Engine recognizes three
operators. ~ means "around," % means "prefer," and !% means "try to
avoid." The first is used with numbers, the latter two with strings
and booleans. The distinction is that numbers have a degree of
similarity while strings and booleans in the Selection Engine either
match or don't. Let's do a few quick examples. For the similarity
criterion "spiciness ~ 10", the value 7 might be a 70% match. For
the similarity criterion "taste % 'spicy'", a value of sweet would
be a 0% match. Although the Selection Engine is case insensitive, it
does not do degree of similarity calculations on strings or use
similarity-oriented matching algorithms such as Soundex, metaphone,
or Levenshtein.
At this point, some of you might be asking what the difference is
between "taste % 'spicy'" and "taste = 'spicy'". In the Selection
Engine, % is used by the SimilarityEngine while = is used by the
FilterEngine. If an item fails the "taste = 'spicy'" criterion, it
is rejected and will not be passed to the SimilarityEngine. If an
item fails the "taste % 'spicy'" criterion, the degree of similarity
is lowered, but the item still receives a similarity score. This
applies to both strings and booleans.
For numeric fields, the SimilarityEngine recognizes two special
values, [MAX_VAL] and [MIN_VAL] (again, case is unimportant). Both
are, obviously, relative values rather than absolute values. You can
do CBR without using relative values, but relative values makes CBR
maintenance easier. The SimilarityEngine translates relative numbers
into absolutes by first looping through every item passed to it so
it can determine the max and min values for each of the item's
attributes.
The third object passed to computeSimilarity() is
SimilarityWeights, which is a collection of SimilarityWeight
objects. SimilarityWeight is yet another name/value pair, where name
is the name of an attribute and value is its weight. Weight can be
any integer value. The weights are not percentages and do not need
to add up to 1 or 100. By default, all attributes have a weight of
1. Remember that weights affect only similarity operators (~, %,
!%), not filter operators (=, !=, <, >, <=, >=).
To review, let's look at a sample query. We visit an Internet
retailer to buy a computer. We like Alienware computers, but would
rather not buy an HP and absolutely will not buy a Dell. The price
should be low, and we absolutely cannot spend more than $1,000. We
really, really want a fast computer and a big hard drive. And a DVD
player would be nice. How would we write that as selection criteria?
Using the Selection Engine's query format, which is pipe delimited
and uses c and w to signify constraints and weights, the query might
look something like this: c | Vendor | % | Alienware
c | Vendor | !% | HP
c | Vendor | != | Dell
w | Vendor | 1
c | Price | ~ | [MIN_VAL]
c | Price | <= | 1000
w | Price | 1
c | HD | ~ | [MAX_VAL]
w | HD | 4
c | DVD | % | TRUE
w | DVD | 1
c | cpu_benchmark | ~ | [MAX_VAL]
w | cpu_benchmark | 5
Note that, to search for fast computers, we search against a
benchmark score (cpu_benchmark) rather than a CPU name (for example,
Intel PIII 800) or Megaherz rating (for example, 800). That's
because a benchmark might accurately represent system performance,
but the name of the CPU or Mhz rating most likely will not. Although
this has nothing to do with the technical aspects of the nearest
neighbor algorithm, I mention it here because, as always, garbage
in, garbage out. In my experience, poor design (database and object
modeling problems) is responsible for far more problems than
technological issues.
So we know we pass Items, SimilarityCriteria, and
SimilarityWeights into the computeSimilarity() method. Let's talk
about how the computeSimilarity() method works. A pseudo code
representation is shown in Listing 1. The source code for
computeSimilarity() is shown in Listing 2.
First, we need some information on our data set. Read through
each item and determine the max and min values for each numeric
attribute. We also need the range for each attribute, but we can get
that by subtracting the smallest value from the largest (Listing 3).
Next we need to turn our similarity query into something (a
series of points) that we could, in theory, plot on a graph. We do
that by normalizing each numeric value in the search criteria.
Normalizing means that we want to express the value as a percentage
(0 to 1) of the possible values. We do that by subtracting the
minimum value possible (which we found when we calculated our data
set statistics back in step one) and then dividing it by the range
of possible values (the maximum value minus the minimum value). As
an example, if our target price was $2,000 and the prices ranged
from $1,000 to $3,000, the normalized value of our target price
would be 0.5 - (2,000 – 1,000) / (3,000 – 2,000). After we get the
normalized value, we convert it to a weighted value by multiplying
each normalized value by the weight for that attribute (see Listing 4).
Next we calculate the maximum distance possible for any item.
We'll need this number to calculate percent similarity later. We
know that the normalized value of the maximum value is always 1, so
the weighted values are just the weights. We convert these points to
a distance by squaring each point, summing them and then taking the
square root, just like the good old Pythagorean Theorem (see Listing 5).
That completes our prep work. Now it's time to rate each item. As
with our search criteria, we start by normalizing and weighting the
numeric values (see Listing
6). We then compute the distance by subtracting each of the
item's attributes from the target values (which gives us a delta),
squaring the deltas, summing them, and then taking the square root
of the sum. Dividing the distance by the maximum distance gives us
the percent difference, and subtracting that from 1 gives us the
percent similarity (see Listing
7).
Once upon a time I was the AI specialist for a large consumer
electronics retailer. I spent a lot of time at AI conferences, which
are typically filled with academics and military researchers. I was
a popular curiosity at those conferences, with the most common
question asked of me being why an IT person from a retail company
would care about AI. My response was that AI is more valuable to the
business world than to the academics and engineers. I've also known
a lot of business programmers who've asked why I waste my time
studying AI when everyone knows that only scientists can and should
use that stuff. My reply is that the old classics are just as
valuable, if not more so, than memorizing the latest APIs and,
besides, this stuff is easy!
I know many people are intimidated by computer science books and
their attendant algorithms and data structures. I'm no big fan of
computer science books either, with their examples that aren't
relevant to my work life and their formulas filled with Greek
letters and lots of funny looking math symbols. But if you can get
past the bad writing, this computer science stuff turns out to be
really useful and really easy! CBR, as one example, can be used in
all sorts of personal and business applications—managing recipes,
selling cars, searching for real estate, making video games, picking
home loans, recommending music, and more. Other fields of computer
science, equally unfamiliar to the average IT developer, are just as
useful. Despite the silly names, if you think about it hard enough
you'll probably find some uses in your workplace for decision trees
(and C4.5, production rule language [PRL], and Rete), information
retrieval (and Porter stemming, resolving power, and inverted file
systems), image processing (edge tracing algorithms can be used to
process any type of data), parallelization (and similar topics such
as agents, emergent behavior, blackboards, and voting systems), run
length encoding, directed acyclic graphs, hidden Markov models, and
all sorts of other "academic" tasks. And have I mentioned that this
stuff is easy? I'm math illiterate and went to journalism school and
even I get this. Yes, okay, I needed someone to translate
all the math gobbledy-gook into liberal arts speak for me, but it
did eventually make sense. We've already seen how to build a simple
(and valuable) search engine with a little AI, and that wasn't too
hard was it?
So here's what I ask. Think back on your computer science
schooling or go pick up a book on algorithms (no, not a
language-specific one but an honest-to-goodness theory book) and see
if you can't find something useful in there. And if you do, talk
some teammates into trying it out. Hey, while you're at it, why not
think of a simple way to explain it and then write an article? Many
of us who would be happy to read it.
About the Author Baylor Wetzel
is a novelist, short story writer, and movie maker specializing in
satire, parody, and ethnic literature. In his spare time, he works a
day job consulting in architecture, artificial intelligence, object
modeling, and all sorts of process stuff. His attempts at schooling
include undergraduate stints at the University of Texas schools of
journalism and business and he continues to do grad work at the
University of St Thomas' AI program. Baylor has written for several
computer magazines and is certified in Java and Microsoft, but he
would gladly trade it all for a chance to meet Buffy the Vampire
Slayer's Willow Rosenberg. You can reach Baylor at baylor@ihatebaylor.com.
|