Simon Willison’s Weblog

45 items tagged “search”

Implementing faceted search with Django and PostgreSQL

I’ve added a faceted search engine to this blog, powered by PostgreSQL. It supports regular text search (proper search, not just SQL“like” queries), filter by tag, filter by date, filter by content type (entries vs blogmarks vs quotation) and any combination of the above. Some example searches:

[... 3048 words]

Why is site search so bad on most websites?

It’s not so much that site search is bad, it’s that your expectations have been raised enormously high by the incredible quality of search provided by search engines like Google.

[... 125 words]

Is there a place or portal where I can search for hashtags which track possible upcoming events or topics?

Our site includes hashtags for thousands of upcoming conferences and professional events.

[... 39 words]

What are the best events search engines?

Since I co-founded one I’m certainly not qualified to express an opinion on which ones are best, but here are a few of my favourites:

[... 233 words]

What kind of publicly available search software is able to be purchased or used freely as part of a website, and how good is it?

There are plenty of good open source options—Solr is currently my favourite. It’s extremely powerful but you do need to do some programming on top of it—I use Django and Haystack to build the search UI on most of my projects.

[... 115 words]

Why does Wolfram|Alpha present all search results as pictures rather than text?

Wolfram Alpha is essentially a web interface to Mathematica (plus a huge corpus of structured data). Mathematica has been around for decades, and has an extremely sophisticated visualisation engine (try typing “sin(x)/cos(x)” in to Wolfram Alpha and see what happens). It’s also very good at rendering mathematical formulae that would be very hard to represent in plain HTML (without using MathML, which isn’t supported by IE).

[... 137 words]

Twitter API: What is the best data storage mechanism and client library for analysing tweets using Python?

It depends on how much data you intend to collect, and how you intend to then share that data.

[... 182 words]

elasticsearch: Percolator. Another fascinating elasticsearch feature: Percolator lets you register searches with your elasticsearch cluster, then pass in a document and have the matching query IDs returned. It’s an upside down search engine. I’m sure there are some very neat things you could build with this, I just haven’t figured out what they are just yet. # 8th February 2011, 11:16 pm

Indexing JSON in Solr 3.1. The next release of Solr will support indexing documents provided as JSON—Solr currently requires incoming documents to be formatted as XML. # 10th December 2010, 9:46 am

Who are major competitors to Solr?

ElasticSearch is a really interesting one—it’s the same underlying search library (Lucene) and the same integration model (an HTTP interface) but takes quite a different approach. It hasn’t been around for a long time but it looks very impressive:

[... 95 words]

How do Solr, Lucene, Sphinx and Searchify compare?

Lucene is a Java library for creating and searching through a full text index. If you want to make use of it, you’ll need to write your own Java code that integrates with it.

[... 109 words]

Which major companies are using Solr for search?

The Guardian newspaper uses Solr for its Open Platform Content API.

[... 27 words]

[UPDATE] Spatial Search in Apache Lucene and Solr. Spacial search is finally coming (back) to Solr—trunk now supports sorting and boosting by distance. # 20th July 2010, 6:28 pm

A fast, fuzzy, full-text index using Redis. Interesting twist on building a reverse-index using Redis sets: this one indexes only the metaphones of the words, resulting in a phonetic fuzzy search. # 5th May 2010, 5:51 pm

Search Engine Time Machine. Detailed explanation of how ElasticSearch provides high availability, through clever sharding and replication strategies and configurable gateways for long-term persistent storage. # 17th February 2010, 10:32 pm

ElasticSearch: Your Data, Your Search. A neat example of how ElasticSearch’s schemaless indexes and native JSON support make it ridiculously easy to index different types of data and run queries across them. # 12th February 2010, 3:22 pm

Elastic Search (via) Solr has competition! Like Solr, Elastic Search provides a RESTful JSON HTTP interface to Lucene. The focus here is on distribution, auto-sharding and high availability. It’s even easier to get started with than Solr, partly due to the focus on providing a schema-less document store, but it’s currently missing out on a bunch of useful Solr features (a web interface and faceting are the two that stand out). The high availability features look particularly interesting. UPDATE: I was incorrect, basic faceted queries are already supported. # 11th February 2010, 6:33 pm

The Seven Deadly Sins of Solr. Useful advice on managing and deploying Solr. # 24th January 2010, 1:30 pm

Haystack 1.0 Final Released. I’ve used Haystack on a number of projects recently, and it has proved itself as a completely painless way of adding full-text search (using Solr or Whoosh—I haven’t tried the Xapian backend yet) to a Django ORM powered project in just a few minutes. Congratulations, Daniel + contributors. # 30th November 2009, 8:07 am

Large Problems in Django, Mostly Solved: Search. Eric Holscher shows how Haystack uses a number of common Django patterns (object registration, pluggable backends, QuerySet-style chaining and class-based views) to great effect in creating a powerful search application for Django. Makes me wonder if more of those patterns should be promoted to first class concepts within Django. # 3rd November 2009, 10:42 am

So’s your facet: Faceted global search for Mozilla Thunderbird. Yes! This is the kind of innovation I’ve been hoping would show up in e-mail clients for years. Faceting is a really natural fit for e-mail. # 4th September 2009, 10:29 am

Collection: Search Patterns. Peter Morville’s enormous collection of screenshots of search engine interfaces. # 30th July 2009, 12:35 pm

MongoDB—Capped Collections. Collections with a size limit that automatically expire older entries are interesting—useful for things like a “recent searches on this site” feature. # 7th June 2009, 12:50 pm

Haystack (via) A brand new modular search plugin for Django, by Daniel Lindsley. The interface is modelled after the Django ORM (complete with declarative classes for defining your search schema) and it ships with backends for both Solr and pure-python Whoosh, with more on the way. Excellent documentation. # 17th April 2009, 9:53 pm

Digg Search: Now With 99.987% Less Suck. Really nice implementation of faceted search, still using Lucene and Solr under the hood. # 10th April 2009, 10:17 pm

Sphinx 0.9.9-rc2 is out. Interesting new feature: the Sphinx search server now supports the MySQL binary protocol, so you can talk to it using a regular MySQL client library and fire off search queries using SELECT syntax and the new SphinxQL query language. # 8th April 2009, 1:59 pm

Guardian + Lucene = Similar Articles + Categorisation. Alf Eaton loaded 13,000 Guardian articles tagged Science in to Solr and Lucene and is using Solr’s MoreLikeThisHandler to find related articles and automatically apply Guardian tags to Nature News articles. # 11th March 2009, 12:53 pm

How uses Varnish. Includes examples of the configuration options they use. # 2nd March 2009, 5:08 pm

django-springsteen and Distributed Search. Will Larson’s Django search library currently just talks to Yahoo! BOSS, but is designed to be extensible for other external search services. Interestingly, it uses threads to fire off several HTTP requests in parallel from within the Django view. # 25th February 2009, 10:28 pm

Xapian performance comparision with Whoosh. Whoosh appears to be around four times slower than Xapian for indexing and empty cache searches, but Xapian with a full cache blows Whoosh out of the water (5408 searches/second compared to 26.3). Considering how fast Xapian is, that’s still a pretty impressive result for the pure-Python Whoosh. # 14th February 2009, 1:15 pm