Feed Sign in with OpenID OpenID

Simon Willison’s Weblog

Vector search engines

Building a Vector Space Search Engine in Perl:

Vector-space search engines use the notion of a term space, where each document is represented as a vector in a high-dimensional space. There are as many dimensions as there are unique words in the entire collection. Because a document’s position in the term space is determined by the words it contains, documents with many words in common end up close together, while documents with few shared words end up far apart.

To search our collection, we project a query into this term space and calculate the distance from the query vector to all the document vectors in turn. Those documents that are within a certain threshold distance get added to our result set. If all this sounds like gobbledygook to you, then don’t worry—it will become clearer when we write the code.

Having done a course on Linear Algebra last term, it’s interesting to see how it can be applied to the search problem. The technique described lends itself well to finding “similar documents” as well, as documents with similar word content will end up “near” to each other when projected on to the vector space.

The article is also yet another demonstration of how Perl’s modules make it such a powerful tool. Lingua::Stem is used to find word “stems”, providing a free algorithm for eliminating related words like cat and cats. The performance overhead of using Perl arrays to represent large vectors is avoided with the PDL module, which implements a whole set of matrix algebra functions in compiled C for high performance. Without these two modules the technique described would be a great deal less powerful. Of course, neither of them are available for PHP or Python, my scripting languages of choice.

This is Vector search engines by Simon Willison, posted on 1st March 2003.

View blog reactions

Next: An interview with Cory

Previous: Problems in Nirvana

6 comments

  1. The Python Numeric stuff is very good at mathematical things, which includes matrix operations. Lingua::Stem would surely not be all that hard to reimplement?

    Stuart Langridge - 1st March 2003 14:37 - #

  2. Probably not, at the drop of a hat. I hadn't looked at Numeric Python but if it can do matrix stuff efficiently that's a great start.

    PHP's not likely to get this stuff any time soon though :/

    Simon Willison - 1st March 2003 14:45 - #

  3. Hey, if the Perl module is compiled C, I don't see that it would take that long to either turn it into a PHP module, or just convert the code into PHP... Also, recall a conversation we had about a plajurism (sp) detecting program? At the time I don't think I mentioned, but I'd scanned this article before (not read into much detail) but this sort of thing was what I was thinking about as the first pass for the program, then use sentance matching on the first 1000 documents... you could get quite a long way.

    Swannie - 4th March 2003 00:16 - #

  4. There is a python version of the Porter Stemming Algorithm at Martin Porter's website: http://www.tartarus.org/~martin/PorterStemmer I haven't used it, but Porter says that it works.

    Joshua - 22nd September 2003 17:17 - #

  5. I've just finished creating a small GPL term vector space search engine (ala Google) in PHP/Mysql. Mysql queries do most of the matrix work. Also, note the dimensions are FK's using GUIDs. You may download it @ http://www.bigattichouse.com/thoughtbrew.php?BREWI D=PROGRAMMING#bf28e6e51231941226369cb6689c1a47

    Mike Johnson - 15th August 2004 15:51 - #

  6. General-purpose linear algebra libraries weren't so much use when I looked into this - you need something that can deal with large, sparse matrices in an efficient way. I found a sparse matrix library for Python which I use for this, PySparse, although its functionality is pretty basic it'll do the M * M_transpose that I need pretty fast.

    Matthew - 27th April 2006 17:45 - #

Comments are closed.

Previously hosted at http://simon.incutio.com/archive/2003/03/01/vectorSearchEngines

A django site