Simon Willison’s Weblog

Subscribe

Blogmarks tagged search

Filters: Type: blogmark × search × Sorted by date


More than an OpenAI Wrapper: Perplexity Pivots to Open Source. I’m increasingly impressed with Perplexity.ai—I’m using it on a daily basis now. It’s by far the best implementation I’ve seen of LLM-assisted search—beating Microsoft Bing and Google Bard at their own game.

A year ago it was implemented as a GPT 3.5 powered wrapper around Microsoft Bing. To my surprise they’ve now evolved way beyond that: Perplexity has their own search index now and is running their own crawlers, and they’re using variants of Mistral 7B and Llama 70B as their models rather than continuing to depend on OpenAI. # 13th January 2024, 6:12 am

ast-grep (via) There are a lot of interesting things about this year-old project.

sg (an alias for ast-grep) is a CLI tool for running AST-based searches against code, built in Rust on top of the Tree-sitter parsing library. You can run commands like this:

sg -p ’await await_me_maybe($ARG)’ datasette --lang python

To search the datasette directory for code that matches the search pattern, in a syntax-aware way.

It works across 19 different languages, and can handle search-and-replace too, so it can work as a powerful syntax-aware refactoring tool.

My favourite detail is how it’s packaged. You can install the CLI utility using Homebrew, Cargo, npm or pip/pipx—each of which will give you a CLI tool you can start running. On top of that it provides API bindings for Rust, JavaScript and Python! # 10th December 2023, 7:56 pm

Wikipedia search-by-vibes through millions of pages offline (via) Really cool demo by Lee Butterman, who built embeddings of 2 million Wikipedia pages and figured out how to serve them directly to the browser, where they are used to implement “vibes based” similarity search returning results in 250ms. Lots of interesting details about how he pulled this off, using Arrow as the file format and ONNX to run the model in the browser. # 4th September 2023, 9:13 pm

Building Search DSLs with Django (via) Neat tutorial by Dan Lamanna: how to build a GitHub-style search feature—supporting modifiers like “is:open author:danlamanna”—using PyParsing and the Django ORM. # 19th June 2023, 8:30 am

GitHub code search is generally available. I’ve been a beta user of GitHub’s new code search for a year and a half now and I wouldn’t want to be without it. It’s spectacularly useful: it provides fast, regular-expression-capable search across every public line of code hosted by GitHub—plus code in private repos you have access to.

I mainly use it to compensate for libraries with poor documentation—I can usually find an example of exactly what I want to do somewhere on GitHub.

It’s also great for researching how people are using libraries that I’ve released myself—to figure out how much pain deprecating a method would cause, for example. # 8th May 2023, 6:52 pm

Can We Trust Search Engines with Generative AI? A Closer Look at Bing’s Accuracy for News Queries (via) Computational journalism professor Nick Diakopoulos takes a deeper dive into the quality of the summarizations provided by AI-assisted Bing. His findings are troubling: for news queries, which are a great test for AI summarization since they include recent information that may have sparse or conflicting stories, Bing confidently produces answers with important errors: claiming the Ohio train derailment happened on February 9th when it actually happened on February 3rd for example. # 18th February 2023, 6:09 pm

The technology behind GitHub’s new code search (via) I’ve been a beta user of the new GitHub code search for a while and I absolutely love it: you really can run a regular expression search across the entire of GitHub, which is absurdly useful for both finding code examples of under-documented APIs and for seeing how people are using open source code that you have released yourself. It turns out GitHub built their own search engine for this from scratch, called Blackbird. It’s implemented in Rust and makes clever use of sharded ngram indexes—not just trigrams, because it turns out those aren’t quite selective enough for a corpus that includes a lot of three letter keywords like “for”.

I also really appreciated the insight into how they handle visibility permissions: they compile those into additional internal search clauses, resulting in things like “RepoIDs(...) or PublicRepo()” # 6th February 2023, 6:38 pm

Semantic text search using embeddings. Example Python notebook from OpenAI demonstrating how to build a search engine using embeddings rather than straight up token matching. This is a fascinating way of implementing search, providing results that match the intent of the search (“delicious beans” for example) even if none of the keywords are actually present in the text. # 9th November 2022, 7:57 pm

Simple, Fast, and Scalable Reverse Image Search Using Perceptual Hashes and DynamoDB. Christopher Bong provides a clear explanation of how perceptual hashes can be used to create a string representing the visual content of an image, such that similar images can be identified by calculating a hamming distance between those hashes. He then explains how they built a large-scale system for this at Canva on top of DynamoDB, by splitting those strings into smaller hash windows and using those for efficient bulk lookups of similar candidates. # 19th October 2022, 3:04 pm

Abusing AWS Lambda to make an Aussie Search Engine (via) Ben Boyter built a search engine that only indexes .au Australian websites, with the novel approach of directly compiling the search index into 250 different ~40MB large lambda functions written in Go, then running searches across 12 million pages by farming them out to all of the lambdas and combining the results. His write-up includes all sorts of details about how he built this, including how he ran the indexer and how he solved the surprisingly hard problem of returning good-enough text snippets for the results. # 16th January 2022, 8:52 pm

sqlite-utils 2.14 (via) I finally figured out porter stemming with SQLite full-text search today—it turns out it’s as easy as adding tokenize=’porter’ to the CREATE VIRTUAL TABLE statement. So I just shipped sqlite-utils 2.14 with a tokenize= option (plus the ability to insert binary file data from stdin). # 1st August 2020, 9:19 pm

PostgreSQL full-text search in the Django Admin. Today I figured out how to use PostgreSQL full-text search in the Django admin for my blog, using the get_search_results method on a subclass of ModelAdmin. # 25th July 2020, 11:05 pm

Reducing search indexing latency to one second. Really detailed dive into the nuts and bolts of Twitter’s latest iteration of search indexing technology, including a great explanation of skip lists. # 26th June 2020, 5:06 pm

Guide To Using Reverse Image Search For Investigations (via) Detailed guide from Bellingcat’s Aric Toler on using reverse image search for investigative reporting. Surprisingly Google Image Search isn’t the state of the art: Russian search engine Yandex offers a much more powerful solution, mainly because it’s the largest public-facing image search engine to integrate scary levels of face recognition. # 30th December 2019, 10:23 pm

Falsehoods Programmers Believe About Search (via) These are great. “When you find the boolean operator ‘OR’, you always know it doesn’t mean Oregon”. # 29th May 2019, 8:09 pm

Discussion about Altavista on Hacker News. Fascinating thread on Hacker News where Bryant Durrell, a former Director from Altavista provides some insider thoughts on how they lost against Google. # 16th February 2019, 6:57 pm

Fast Autocomplete Search for Your Website (via) I wrote a tutorial for the 24 ways advent calendar on building fast autocomplete search for a website on top of Datasette and SQLite. I built the demo against 24 ways itself—I used wget to recursively fetch all 330 articles as HTML, then wrote code in a Jupyter notebook to extract the raw data from them (with BeautifulSoup) and load them into SQLite using my sqlite-utils Python library. I deployed the resulting database using Datasette, then wrote some vanilla JavaScript to implement autocomplete using fast SQL queries against the Datasette JSON API. # 19th December 2018, 12:26 am

Datasette: Full-text search. I wrote some documentation for Datasette’s full-text search feature, which detects tables which have been configured to use the SQLite FTS module and adds a search input box and support for a _search= querystring parameter. # 12th May 2018, 12:09 pm

Typesense (via) A new (to me) open source search engine, with a focus on being “typo-tolerant” and offering great, fast autocomplete—incredibly important now that most searches take place using a mobile phone keyboard. Similar to Elasticsearch or Solr in that it runs as an HTTP server that you serve JSON via POST and GET—and it offers read-only replicas for scaling and high availability. And since it’s 2018, if you have Docker running (I use Docker for Mac) you can start up a test instance with a one-line shell command. # 6th April 2018, 5:07 pm

elasticsearch: Percolator. Another fascinating elasticsearch feature: Percolator lets you register searches with your elasticsearch cluster, then pass in a document and have the matching query IDs returned. It’s an upside down search engine. I’m sure there are some very neat things you could build with this, I just haven’t figured out what they are just yet. # 8th February 2011, 11:16 pm

Indexing JSON in Solr 3.1. The next release of Solr will support indexing documents provided as JSON—Solr currently requires incoming documents to be formatted as XML. # 10th December 2010, 9:46 am

[UPDATE] Spatial Search in Apache Lucene and Solr. Spacial search is finally coming (back) to Solr—trunk now supports sorting and boosting by distance. # 20th July 2010, 6:28 pm

A fast, fuzzy, full-text index using Redis. Interesting twist on building a reverse-index using Redis sets: this one indexes only the metaphones of the words, resulting in a phonetic fuzzy search. # 5th May 2010, 5:51 pm

Search Engine Time Machine. Detailed explanation of how ElasticSearch provides high availability, through clever sharding and replication strategies and configurable gateways for long-term persistent storage. # 17th February 2010, 10:32 pm

ElasticSearch: Your Data, Your Search. A neat example of how ElasticSearch’s schemaless indexes and native JSON support make it ridiculously easy to index different types of data and run queries across them. # 12th February 2010, 3:22 pm

Elastic Search (via) Solr has competition! Like Solr, Elastic Search provides a RESTful JSON HTTP interface to Lucene. The focus here is on distribution, auto-sharding and high availability. It’s even easier to get started with than Solr, partly due to the focus on providing a schema-less document store, but it’s currently missing out on a bunch of useful Solr features (a web interface and faceting are the two that stand out). The high availability features look particularly interesting. UPDATE: I was incorrect, basic faceted queries are already supported. # 11th February 2010, 6:33 pm

The Seven Deadly Sins of Solr. Useful advice on managing and deploying Solr. # 24th January 2010, 1:30 pm

Haystack 1.0 Final Released. I’ve used Haystack on a number of projects recently, and it has proved itself as a completely painless way of adding full-text search (using Solr or Whoosh—I haven’t tried the Xapian backend yet) to a Django ORM powered project in just a few minutes. Congratulations, Daniel + contributors. # 30th November 2009, 8:07 am

Large Problems in Django, Mostly Solved: Search. Eric Holscher shows how Haystack uses a number of common Django patterns (object registration, pluggable backends, QuerySet-style chaining and class-based views) to great effect in creating a powerful search application for Django. Makes me wonder if more of those patterns should be promoted to first class concepts within Django. # 3rd November 2009, 10:42 am

So’s your facet: Faceted global search for Mozilla Thunderbird. Yes! This is the kind of innovation I’ve been hoping would show up in e-mail clients for years. Faceting is a really natural fit for e-mail. # 4th September 2009, 10:29 am