Simon Willison’s Weblog

Subscribe

Blogmarks tagged search in 2023

Filters: Type: blogmark × Year: 2023 × search × Sorted by date


ast-grep (via) There are a lot of interesting things about this year-old project.

sg (an alias for ast-grep) is a CLI tool for running AST-based searches against code, built in Rust on top of the Tree-sitter parsing library. You can run commands like this:

sg -p ’await await_me_maybe($ARG)’ datasette --lang python

To search the datasette directory for code that matches the search pattern, in a syntax-aware way.

It works across 19 different languages, and can handle search-and-replace too, so it can work as a powerful syntax-aware refactoring tool.

My favourite detail is how it’s packaged. You can install the CLI utility using Homebrew, Cargo, npm or pip/pipx—each of which will give you a CLI tool you can start running. On top of that it provides API bindings for Rust, JavaScript and Python! # 10th December 2023, 7:56 pm

Wikipedia search-by-vibes through millions of pages offline (via) Really cool demo by Lee Butterman, who built embeddings of 2 million Wikipedia pages and figured out how to serve them directly to the browser, where they are used to implement “vibes based” similarity search returning results in 250ms. Lots of interesting details about how he pulled this off, using Arrow as the file format and ONNX to run the model in the browser. # 4th September 2023, 9:13 pm

Building Search DSLs with Django (via) Neat tutorial by Dan Lamanna: how to build a GitHub-style search feature—supporting modifiers like “is:open author:danlamanna”—using PyParsing and the Django ORM. # 19th June 2023, 8:30 am

GitHub code search is generally available. I’ve been a beta user of GitHub’s new code search for a year and a half now and I wouldn’t want to be without it. It’s spectacularly useful: it provides fast, regular-expression-capable search across every public line of code hosted by GitHub—plus code in private repos you have access to.

I mainly use it to compensate for libraries with poor documentation—I can usually find an example of exactly what I want to do somewhere on GitHub.

It’s also great for researching how people are using libraries that I’ve released myself—to figure out how much pain deprecating a method would cause, for example. # 8th May 2023, 6:52 pm

Can We Trust Search Engines with Generative AI? A Closer Look at Bing’s Accuracy for News Queries (via) Computational journalism professor Nick Diakopoulos takes a deeper dive into the quality of the summarizations provided by AI-assisted Bing. His findings are troubling: for news queries, which are a great test for AI summarization since they include recent information that may have sparse or conflicting stories, Bing confidently produces answers with important errors: claiming the Ohio train derailment happened on February 9th when it actually happened on February 3rd for example. # 18th February 2023, 6:09 pm

The technology behind GitHub’s new code search (via) I’ve been a beta user of the new GitHub code search for a while and I absolutely love it: you really can run a regular expression search across the entire of GitHub, which is absurdly useful for both finding code examples of under-documented APIs and for seeing how people are using open source code that you have released yourself. It turns out GitHub built their own search engine for this from scratch, called Blackbird. It’s implemented in Rust and makes clever use of sharded ngram indexes—not just trigrams, because it turns out those aren’t quite selective enough for a corpus that includes a lot of three letter keywords like “for”.

I also really appreciated the insight into how they handle visibility permissions: they compile those into additional internal search clauses, resulting in things like “RepoIDs(...) or PublicRepo()” # 6th February 2023, 6:38 pm