Whoosh. A brand new, pure-python full text indexing engine (think Lucene). Claims to offer performance in the same league as wrappers to C or Java libraries. If this works as well as it claims it will be an excellent tool for adding search to projects that wish to avoid a dependency on an external engine.
Check out Hypy, which has all the same features except the spellchecker and is based on the well-proven Hyper Estraier engine.
http://goonmill.org/hypy/
This blog's search runs on hyperestraier at at the moment - or rather doesn't, because I've corrupted the search index somehow and haven't had the time to figure out how to rebuild it. I'm actually running a hyperestraier "peer to peer" node and talking to it via its HTTP interface using code written in Python.
I'm keen on switching to something else though to reduce my external dependencies - so whoosh looks ideal.
The main features of Whoosh are (1) faster than other pure Python libraries, (2) no dependency on compiled libraries, (3) highly customizable by Python programmers.
It's not going to be as fast as a compiled library, obviously. The reason it's competitive at indexing with the Python bindings for Xapian and PyLucene is because those bindings waste (IMHO) a lot of time instantiating and then throwing away temporary Python objects.
The compiled libraries have their own shortcomings, though. If you're a Python programmer/web developer interested in customizing search behavior, or just in information retrieval for its own sake, it's possible you might want to check Whoosh out even if you've gotten PyLucene/Xappy/Hypy/other to work and you're happy with it.
Matt Chaput (the author of Whoosh)
Matt Chaput - 13th February 2009 04:07 - #
I'm using PyLucene with some pretty big indexes (several gigabytes in some cases), and I'm definitely in the market for an indexing solution which not only offers optimised disk I/O (which Lucene can manage quite well), but also has sensible policies on document identifiers (which Lucene doesn't have), and maybe permits optimised storage of token position information (which Lucene likes to relegate to records whose use isn't conducive to that optimised I/O). I don't care about incremental updates to the index.
Would Whoosh be something for me, Matt?
Paul Boddie - 14th February 2009 00:49 - #
Oh, and before I ask for my pony, I'd like to point out that stemming and tokenising are things I can do myself and don't really want done magically for me - the inversion of control, or whatever people call it, can be inconvenient.
In fact, with Lucene, I've written code to search for token combinations instead of using the built-in stuff, so access to the index as like, well, an index, is highly desirable.
Just to finish off, I think that the Lucene API is alright for something from the Java world, whereas the Xapian bindings look and feel fragile, and I've never managed to motivate myself to get into Hyper Estraier. Sometimes, though, I just think that I'm an awkward case in the full-text search world.
Paul Boddie - 14th February 2009 00:54 - #