Simon Willison on full-text-search

21 posts tagged “full-text-search”

2024

Hybrid full-text search and vector search with SQLite. As part of Alex’s work on his sqlite-vec SQLite extension - adding fast vector lookups to SQLite - he’s been investigating hybrid search, where search results from both vector similarity and traditional full-text search are combined together.

The most promising approach looks to be Reciprocal Rank Fusion, which combines the top ranked items from both approaches. Here’s Alex’s SQL query:

-- the sqlite-vec KNN vector search results
with vec_matches as (
  select
    article_id,
    row_number() over (order by distance) as rank_number,
    distance
  from vec_articles
  where
    headline_embedding match lembed(:query)
    and k = :k
),
-- the FTS5 search results
fts_matches as (
  select
    rowid,
    row_number() over (order by rank) as rank_number,
    rank as score
  from fts_articles
  where headline match :query
  limit :k
),
-- combine FTS5 + vector search results with RRF
final as (
  select
    articles.id,
    articles.headline,
    vec_matches.rank_number as vec_rank,
    fts_matches.rank_number as fts_rank,
    -- RRF algorithm
    (
      coalesce(1.0 / (:rrf_k + fts_matches.rank_number), 0.0) * :weight_fts +
      coalesce(1.0 / (:rrf_k + vec_matches.rank_number), 0.0) * :weight_vec
    ) as combined_rank,
    vec_matches.distance as vec_distance,
    fts_matches.score as fts_score
  from fts_matches
  full outer join vec_matches on vec_matches.article_id = fts_matches.rowid
  join articles on articles.rowid = coalesce(fts_matches.rowid, vec_matches.article_id)
  order by combined_rank desc
)
select * from final;

I’ve been puzzled in the past over how to best do that because the distance scores from vector similarity and the relevance scores from FTS are meaningless in comparison to each other. RRF doesn’t even attempt to compare them - it uses them purely for row_number() ranking within each set and combines the results based on that.

# 4th October 2024, 4:22 pm / full-text-search, search, sql, sqlite, alex-garcia, vector-search, embeddings, rag

tantivy-cli (via) I tried out this Rust based search engine today and I was very impressed.

Tantivy is the core project - it's an open source (MIT) Rust library that implements Lucene-style full text search, with a very full set of features: BM25 ranking, faceted search, range queries, incremental indexing etc.

tantivy-cli offers a CLI wrapper around the Rust library. It's not actually as full-featured as I hoped: it's intended as more of a demo than a full exposure of the library's features. The JSON API server it runs can only be used to run simple keyword or phrase searches for example, no faceting or filtering.

Tantivy's performance is fantastic. I was able to index the entire contents of my link blog in a fraction of a second.

I found this post from 2017 where Tantivy creator Paul Masurel described the initial architecture of his new search side-project that he created to help him learn Rust. Paul went on to found Quickwit, an impressive looking analytics platform that uses Tantivy as one of its core components.

The Python bindings for Tantivy look well maintained, wrapping the Rust library using maturin. Those are probably the best way for a developer like myself to really start exploring what it can do.

Also notable: the Hacker News thread has dozens of posts from happy Tantivy users reporting successful use on their projects.

# 13th June 2024, 6:03 am / cli, full-text-search, search, rust

2020

Weeknotes: datasette-dump, sqlite-backup, talks

I spent some time this week digging into Python’s sqlite3 internals. I also gave two talks and recorded a third, due to air at PyGotham in October.

[... 928 words]

4:47 am / 11th September 2020 / full-text-search, speaking, sqlite, datasette, weeknotes, sqlite-utils

sqlite-utils 2.14 (via) I finally figured out porter stemming with SQLite full-text search today—it turns out it’s as easy as adding tokenize=’porter’ to the CREATE VIRTUAL TABLE statement. So I just shipped sqlite-utils 2.14 with a tokenize= option (plus the ability to insert binary file data from stdin).

# 1st August 2020, 9:19 pm / full-text-search, projects, search, sqlite, sqlite-utils

datasette-search-all: a new plugin for searching multiple Datasette tables at once

I just released a new plugin for Datasette, and it’s pretty fun. datasette-search-all is a plugin written mostly in JavaScript that executes the same search query against every searchable table in every database connected to your Datasette instance.

[... 819 words]

12:59 am / 9th March 2020 / full-text-search, javascript, plugins, projects, search, datasette

2019

sqlite-utils 1.11. Amjith Ramanujam contributed an excellent new feature to sqlite-utils, which I’ve now released as part of version 1.11. Previously you could enable SQLite full-text-search on a table using the .enable_fts() method (or the “sqlite-utils enable-fts” CLI command) but it wouldn’t reflect future changes to the table—you had to use populate_fts() any time you inserted new records. Thanks to Amjith you can now pass create_triggers=True (or --create-triggers) to cause sqlite-utils to automatically add triggers that keeps the FTS index up-to-date any time a row is inserted, updated or deleted from the table.

# 3rd September 2019, 1:05 am / cli, full-text-search, projects, sqlite, sqlite-utils

Exploring search relevance algorithms with SQLite

SQLite isn’t just a fast, high quality embedded database: it also incorporates a powerful full-text search engine in the form of the FTS4 and FTS5 extensions. You’ve probably used these a bunch of times already: many iOS, Android and desktop applications use SQLite under-the-hood and use it to implement their built-in search.

[... 1,390 words]

3:29 am / 7th January 2019 / full-text-search, projects, search, sqlite, datasette

2018

Datasette: Full-text search. I wrote some documentation for Datasette’s full-text search feature, which detects tables which have been configured to use the SQLite FTS module and adds a search input box and support for a _search= querystring parameter.

# 12th May 2018, 12:09 pm / full-text-search, search, sqlite, datasette

2017

Implementing faceted search with Django and PostgreSQL

I’ve added a faceted search engine to this blog, powered by PostgreSQL. It supports regular text search (proper search, not just SQL“like” queries), filter by tag, filter by date, filter by content type (entries vs blogmarks vs quotation) and any combination of the above. Some example searches:

[... 3,103 words]

2:12 pm / 5th October 2017 / django, full-text-search, orm, postgresql, projects, search, facetedsearch

2010

A fast, fuzzy, full-text index using Redis. Interesting twist on building a reverse-index using Redis sets: this one indexes only the metaphones of the words, resulting in a phonetic fuzzy search.

# 5th May 2010, 5:51 pm / full-text-search, redis, search, recovered, fuzzy, metaphone

2009

Digg Search: Now With 99.987% Less Suck. Really nice implementation of faceted search, still using Lucene and Solr under the hood.

# 10th April 2009, 10:17 pm / digg, facets, full-text-search, lucene, search, solr

Sphinx 0.9.9-rc2 is out. Interesting new feature: the Sphinx search server now supports the MySQL binary protocol, so you can talk to it using a regular MySQL client library and fire off search queries using SELECT syntax and the new SphinxQL query language.

# 8th April 2009, 1:59 pm / full-text-search, mysql, search, sphinx-search, sql

Guardian + Lucene = Similar Articles + Categorisation. Alf Eaton loaded 13,000 Guardian articles tagged Science in to Solr and Lucene and is using Solr’s MoreLikeThisHandler to find related articles and automatically apply Guardian tags to Nature News articles.

# 11th March 2009, 12:53 pm / alf-eaton, full-text-search, guardian, lucene, naturenews, openplatform, search, solr, tagging

Xapian performance comparision with Whoosh. Whoosh appears to be around four times slower than Xapian for indexing and empty cache searches, but Xapian with a full cache blows Whoosh out of the water (5408 searches/second compared to 26.3). Considering how fast Xapian is, that’s still a pretty impressive result for the pure-Python Whoosh.

# 14th February 2009, 1:15 pm / full-text-search, python, richard-boulton, search, whoosh, xapian

Tokyo Cabinet and Tokyo Tyrant Presentation. By Tokyo Cabinet author Mikio Hirabayashi. The third leg of the Tokyo tripod is Tokyo Dystopia, a full-text search engine which is presumably a modern replacement for Mikio’s older hyperestraier engine.

# 14th February 2009, 11:34 am / full-text-search, hyperestraier, mikiohirabayashi, tokyocabinet, tokyodystopia, tokyotyrant

Whoosh. A brand new, pure-python full text indexing engine (think Lucene). Claims to offer performance in the same league as wrappers to C or Java libraries. If this works as well as it claims it will be an excellent tool for adding search to projects that wish to avoid a dependency on an external engine.

# 12th February 2009, 12:49 pm / full-text-search, lucene, open-source, python, search, whoosh

2008

How-to: Full-text search in Google App Engine. Use search.SearchableModel instead of db.Model—it’s pretty rough at the moment which is probably why it’s still undocumented.

# 27th June 2008, 8:25 am / appengine, full-text-search, googleappengine, python, search

2007

PostgreSQL 8.3 beta 4 release notes. In addition to the huge speed improvements, 8.3 adds support for XML, UUID and ENUM data types and brings full text (tsearch2) in to the core database engine.

# 12th December 2007, 12:43 am / beta, databases, enum, full-text-search, postgresql, tsearch2, uuid, xml

Opera 9.5 alpha, Kestrel, released. “With history search, Opera creates a full-text index of each and every page you visit, and when you go to the address bar, you can simply start entering words you know have been on pages you’ve visited before, and items matching your search show up.” I just tried this; it’s magic. I’m switching back to Opera from Camino.

# 16th September 2007, 8:34 pm / browsers, camino, full-text-search, history, kestrel, opera, search

django-sphinx (via) More code from Curse Gaming; this time a really nice API for adding Sphinx full-text search to a Django model.

# 9th September 2007, 12:35 am / cursegaming, david-cramer, django, full-text-search, orm, python, search, sphinx-search

2006

SQLite Keynote. SQLite 3.3.8 has full text indexing!

# 21st October 2006, 11:44 pm / full-text-search, sqlite, d-richard-hipp

Simon Willison’s Weblog