Simon Willison’s Weblog

Vector search engines

Building a Vector Space Search Engine in Perl:

Vector-space search engines use the notion of a term space, where each document is represented as a vector in a high-dimensional space. There are as many dimensions as there are unique words in the entire collection. Because a document’s position in the term space is determined by the words it contains, documents with many words in common end up close together, while documents with few shared words end up far apart.

To search our collection, we project a query into this term space and calculate the distance from the query vector to all the document vectors in turn. Those documents that are within a certain threshold distance get added to our result set. If all this sounds like gobbledygook to you, then don’t worry—it will become clearer when we write the code.

Having done a course on Linear Algebra last term, it’s interesting to see how it can be applied to the search problem. The technique described lends itself well to finding “similar documents” as well, as documents with similar word content will end up “near” to each other when projected on to the vector space.

The article is also yet another demonstration of how Perl’s modules make it such a powerful tool. Lingua::Stem is used to find word “stems”, providing a free algorithm for eliminating related words like cat and cats. The performance overhead of using Perl arrays to represent large vectors is avoided with the PDL module, which implements a whole set of matrix algebra functions in compiled C for high performance. Without these two modules the technique described would be a great deal less powerful. Of course, neither of them are available for PHP or Python, my scripting languages of choice.

This is Vector search engines by Simon Willison, posted on 1st March 2003.

Next: An interview with Cory

Previous: Problems in Nirvana

Previously hosted at http://simon.incutio.com/archive/2003/03/01/vectorSearchEngines