Simon Willison on search

91 posts tagged “search”

2022

Abusing AWS Lambda to make an Aussie Search Engine (via) Ben Boyter built a search engine that only indexes .au Australian websites, with the novel approach of directly compiling the search index into 250 different ~40MB large lambda functions written in Go, then running searches across 12 million pages by farming them out to all of the lambdas and combining the results. His write-up includes all sorts of details about how he built this, including how he ran the indexer and how he solved the surprisingly hard problem of returning good-enough text snippets for the results.

# 16th January 2022, 8:52 pm / aws, go, lambda, search

2020

Building a search engine for datasette.io

This week I added a search engine to datasette.io, using the search indexing tool I’ve been building for Dogsheep.

[... 1,391 words]

6:12 pm / 19th December 2020 / projects, search, sqlite, datasette, dogsheep, weeknotes, baked-data

sqlite-utils 2.14 (via) I finally figured out porter stemming with SQLite full-text search today—it turns out it’s as easy as adding tokenize=’porter’ to the CREATE VIRTUAL TABLE statement. So I just shipped sqlite-utils 2.14 with a tokenize= option (plus the ability to insert binary file data from stdin).

# 1st August 2020, 9:19 pm / full-text-search, projects, search, sqlite, sqlite-utils

PostgreSQL full-text search in the Django Admin. Today I figured out how to use PostgreSQL full-text search in the Django admin for my blog, using the get_search_results method on a subclass of ModelAdmin.

# 25th July 2020, 11:05 pm / django, django-admin, postgresql, search

Reducing search indexing latency to one second. Really detailed dive into the nuts and bolts of Twitter’s latest iteration of search indexing technology, including a great explanation of skip lists.

# 26th June 2020, 5:06 pm / data-structures, lucene, scaling, search, twitter

datasette-search-all: a new plugin for searching multiple Datasette tables at once

I just released a new plugin for Datasette, and it’s pretty fun. datasette-search-all is a plugin written mostly in JavaScript that executes the same search query against every searchable table in every database connected to your Datasette instance.

[... 819 words]

12:59 am / 9th March 2020 / full-text-search, javascript, plugins, projects, search, datasette

Weeknotes: datasette-ics, datasette-upload-csvs, datasette-configure-fts, asgi-csrf

I’ve been preparing for the NICAR 2020 Data Journalism conference this week which has lead me into a flurry of activity across a plethora of different projects and plugins.

[... 834 words]

2:27 am / 4th March 2020 / csrf, data-journalism, icalendar, plugins, projects, search, security, datasette, asgi, weeknotes, datasette-cloud

2019

Guide To Using Reverse Image Search For Investigations (via) Detailed guide from Bellingcat’s Aric Toler on using reverse image search for investigative reporting. Surprisingly Google Image Search isn’t the state of the art: Russian search engine Yandex offers a much more powerful solution, mainly because it’s the largest public-facing image search engine to integrate scary levels of face recognition.

# 30th December 2019, 10:23 pm / journalism, search, bellingcat

Falsehoods Programmers Believe About Search (via) These are great. “When you find the boolean operator ‘OR’, you always know it doesn’t mean Oregon”.

# 29th May 2019, 8:09 pm / search

Discussion about Altavista on Hacker News. Fascinating thread on Hacker News where Bryant Durrell, a former Director from Altavista provides some insider thoughts on how they lost against Google.

# 16th February 2019, 6:57 pm / computer-history, google, internet-history, search

Exploring search relevance algorithms with SQLite

SQLite isn’t just a fast, high quality embedded database: it also incorporates a powerful full-text search engine in the form of the FTS4 and FTS5 extensions. You’ve probably used these a bunch of times already: many iOS, Android and desktop applications use SQLite under-the-hood and use it to implement their built-in search.

[... 1,390 words]

3:29 am / 7th January 2019 / full-text-search, projects, search, sqlite, datasette

2018

Fast Autocomplete Search for Your Website (via) I wrote a tutorial for the 24 ways advent calendar on building fast autocomplete search for a website on top of Datasette and SQLite. I built the demo against 24 ways itself—I used wget to recursively fetch all 330 articles as HTML, then wrote code in a Jupyter notebook to extract the raw data from them (with BeautifulSoup) and load them into SQLite using my sqlite-utils Python library. I deployed the resulting database using Datasette, then wrote some vanilla JavaScript to implement autocomplete using fast SQL queries against the Datasette JSON API.

# 19th December 2018, 12:26 am / 24-ways, autocomplete, beautifulsoup, search, sqlite, jupyter, datasette

Datasette: Full-text search. I wrote some documentation for Datasette’s full-text search feature, which detects tables which have been configured to use the SQLite FTS module and adds a search input box and support for a _search= querystring parameter.

# 12th May 2018, 12:09 pm / full-text-search, search, sqlite, datasette

Typesense (via) A new (to me) open source search engine, with a focus on being “typo-tolerant” and offering great, fast autocomplete—incredibly important now that most searches take place using a mobile phone keyboard. Similar to Elasticsearch or Solr in that it runs as an HTTP server that you serve JSON via POST and GET—and it offers read-only replicas for scaling and high availability. And since it’s 2018, if you have Docker running (I use Docker for Mac) you can start up a test instance with a one-line shell command.

# 6th April 2018, 5:07 pm / autocomplete, open-source, search

2017

New in Datasette: filters, foreign keys and search

I’ve released Datasette 0.13 with a number of exciting new features (Datasette previously).

[... 1,143 words]

9:17 pm / 25th November 2017 / csv, projects, search, sqlite, datasette

Implementing faceted search with Django and PostgreSQL

I’ve added a faceted search engine to this blog, powered by PostgreSQL. It supports regular text search (proper search, not just SQL“like” queries), filter by tag, filter by date, filter by content type (entries vs blogmarks vs quotation) and any combination of the above. Some example searches:

[... 3,103 words]

2:12 pm / 5th October 2017 / django, full-text-search, orm, postgresql, projects, search, facetedsearch

2013

Why is site search so bad on most websites?

It’s not so much that site search is bad, it’s that your expectations have been raised enormously high by the incredible quality of search provided by search engines like Google.

[... 125 words]

5:28 pm / 3rd May 2013 / search, search-engines, quora

2012

Is there a place or portal where I can search for hashtags which track possible upcoming events or topics?

Our site http://lanyrd.com/ includes hashtags for thousands of upcoming conferences and professional events.

[... 39 words]

1:58 pm / 3rd July 2012 / search, quora

What are the best events search engines?

Since I co-founded one I’m certainly not qualified to express an opinion on which ones are best, but here are a few of my favourites:

[... 233 words]

11:15 am / 13th January 2012 / events, marketing, search, search-engines, quora

What kind of publicly available search software is able to be purchased or used freely as part of a website, and how good is it?

There are plenty of good open source options—Solr is currently my favourite. It’s extremely powerful but you do need to do some programming on top of it—I use Django and Haystack to build the search UI on most of my projects.

[... 115 words]

1:45 pm / 6th January 2012 / search, search-engines, web-development, quora

2011

Why does Wolfram|Alpha present all search results as pictures rather than text?

Wolfram Alpha is essentially a web interface to Mathematica (plus a huge corpus of structured data). Mathematica has been around for decades, and has an extremely sophisticated visualisation engine (try typing “sin(x)/cos(x)” in to Wolfram Alpha and see what happens). It’s also very good at rendering mathematical formulae that would be very hard to represent in plain HTML (without using MathML, which isn’t supported by IE).

[... 137 words]

4:50 pm / 25th December 2011 / search, usability, quora

Twitter API: What is the best data storage mechanism and client library for analysing tweets using Python?

It depends on how much data you intend to collect, and how you intend to then share that data.

[... 182 words]

12:48 pm / 21st December 2011 / python, search, twitter, quora

elasticsearch: Percolator. Another fascinating elasticsearch feature: Percolator lets you register searches with your elasticsearch cluster, then pass in a document and have the matching query IDs returned. It’s an upside down search engine. I’m sure there are some very neat things you could build with this, I just haven’t figured out what they are just yet.

# 8th February 2011, 11:16 pm / elasticsearch, search, recovered

2010

Indexing JSON in Solr 3.1. The next release of Solr will support indexing documents provided as JSON—Solr currently requires incoming documents to be formatted as XML.

# 10th December 2010, 9:46 am / json, search, solr, xml, recovered

Who are major competitors to Solr?

ElasticSearch is a really interesting one—it’s the same underlying search library (Lucene) and the same integration model (an HTTP interface) but takes quite a different approach. It hasn’t been around for a long time but it looks very impressive: http://www.elasticsearch.com/

[... 95 words]

6:01 pm / 2nd September 2010 / apache, lucene, search, search-engines, solr, sphinx-search, xapian, quora

How do Solr, Lucene, Sphinx and Searchify compare?

Lucene is a Java library for creating and searching through a full text index. If you want to make use of it, you’ll need to write your own Java code that integrates with it.

[... 109 words]

2:14 pm / 26th August 2010 / databases, lucene, search, search-engines, solr, sphinx-search, web-development, quora

Which major companies are using Solr for search?

The Guardian newspaper uses Solr for its Open Platform Content API. http://www.guardian.co.uk/open-p...

[... 27 words]

11:43 am / 25th August 2010 / lucene, open-source, search, solr, quora

[UPDATE] Spatial Search in Apache Lucene and Solr. Spacial search is finally coming (back) to Solr—trunk now supports sorting and boosting by distance.

# 20th July 2010, 6:28 pm / lucene, search, solr, recovered, spatialsearch

A fast, fuzzy, full-text index using Redis. Interesting twist on building a reverse-index using Redis sets: this one indexes only the metaphones of the words, resulting in a phonetic fuzzy search.

# 5th May 2010, 5:51 pm / full-text-search, redis, search, recovered, fuzzy, metaphone

Search Engine Time Machine. Detailed explanation of how ElasticSearch provides high availability, through clever sharding and replication strategies and configurable gateways for long-term persistent storage.

# 17th February 2010, 10:32 pm / elasticsearch, highavailability, scaling, search

«« first « previous page 2 / 4 next » last »»