Vector search engines
Building a Vector Space Search Engine in Perl:
Vector-space search engines use the notion of a term space, where each document is represented as a vector in a high-dimensional space. There are as many dimensions as there are unique words in the entire collection. Because a document’s position in the term space is determined by the words it contains, documents with many words in common end up close together, while documents with few shared words end up far apart.
To search our collection, we project a query into this term space and calculate the distance from the query vector to all the document vectors in turn. Those documents that are within a certain threshold distance get added to our result set. If all this sounds like gobbledygook to you, then don’t worry—it will become clearer when we write the code.
Having done a course on Linear Algebra last term, it’s interesting to see how it can be applied to the search problem. The technique described lends itself well to finding “similar documents” as well, as documents with similar word content will end up “near” to each other when projected on to the vector space.
The article is also yet another demonstration of how Perl’s modules make it such a powerful tool. Lingua::Stem is used to find word “stems”, providing a free algorithm for eliminating related words like cat and cats. The performance overhead of using Perl arrays to represent large vectors is avoided with the PDL module, which implements a whole set of matrix algebra functions in compiled C for high performance. Without these two modules the technique described would be a great deal less powerful. Of course, neither of them are available for PHP or Python, my scripting languages of choice.
More recent articles
- Weeknotes: Parquet in Datasette Lite, various talks, more LLM hacking - 4th June 2023
- It's infuriatingly hard to understand how closed models train on their input - 4th June 2023
- ChatGPT should include inline tips - 30th May 2023
- Lawyer cites fake cases invented by ChatGPT, judge is not amused - 27th May 2023
- llm, ttok and strip-tags - CLI tools for working with ChatGPT and other LLMs - 18th May 2023
- Delimiters won't save you from prompt injection - 11th May 2023
- Weeknotes: sqlite-utils 3.31, download-esm, Python in a sandbox - 10th May 2023
- Leaked Google document: "We Have No Moat, And Neither Does OpenAI" - 4th May 2023
- Midjourney 5.1 - 4th May 2023
- Prompt injection explained, with video, slides, and a transcript - 2nd May 2023