Feed Sign in with OpenID OpenID

Simon Willison’s Weblog

Experimental feature: Related entries

I’m experimenting with using MySQL full text indexing to generate a list of “related entries” for each entry (click on an item’s permalink to see it in action). It works by concatenating the item’s title and entry body together and running a full text search on it, which sounds horendously inefficient but seems to work surpsingly quickly. If I decide to keep the feature I’ll probably cache the results somewhere to reduce the overhead, but for the moment it’s fast enough.

Update: I’ve turned it off again—it was resulting in ugly database timeouts all over the place. I’ll switch it back on once I’ve added caching.

Update 2: It’s back on again now, with caching. I’m still getting quite a few database timeouts for some reason—if they don’t resolve themselves I’ll have to do some more tweaking.

This is Experimental feature: Related entries by Simon Willison, posted on 25th April 2003.

View blog reactions

Next: position:fixed in IE, no Javascript required

Previous: Site search finally available

12 comments

  1. I'm doing something similar but using a simple Perl implementation of LSI, and then running it out of cron as it uses obscenes amounts of memory (to store the word list). It works pretty well. Would be interesting if one could figure out a way to quantitatively measure these strategies effectiveness.

    kellan - 25th April 2003 19:33 - #

  2. LSI looks really cool. Unfortunately it isn't really practical in PHP as the Perl implementation relies on an external matrix library written in C for performance, and as far as I know PHP doesn't have anything comparable.

    Simon Willison - 25th April 2003 19:41 - #

  3. "PHP doesnt have anything comparable." Not so fast ;) There's Namazu which has a PHP extension at here. Not the beaten path I know

    Harry Fuecks - 25th April 2003 20:06 - #

  4. Why did " Gone to Glastonbury" show up under related articles?

    Andrew - 25th April 2003 22:43 - #

  5. Hi Simon! I use a very similar approach as Kellan (using some of his code) and will be working on speeding things up very soon... see here as a starting point

    Martin - 25th April 2003 23:37 - #

  6. re: Namazu... that's a keyword search engine, what Kellan and I are doing deals with 'VectorSpaces', which needs C based matrix algebra, but is better suited for finding 'similars'. I'm writing on an essay explaining the difference, see my blog for details... (enough advertisment already ;) )

    Martin - 25th April 2003 23:40 - #

  7. Andrew: I have absolutely no idea :)

    Simon Willison - 26th April 2003 00:33 - #

  8. Ah - got it. I think it's because of the word 'update', which appears twice in this entry and once in the Glastonbury one ("no updates for a while").

    Simon Willison - 26th April 2003 01:00 - #

  9. I added a similar facility to b2++ last night. The above link has an example, and code for the function is linked off it. I decided to drop words that were less than 3 characters in length, and add a space before words. That stopped "ping" matching "shopping" in one of my postings! Donncha.

    Donncha - 29th April 2003 15:13 - #

  10. I am doing much the same thing at my site, but I am searching a different subset of the entry fields and I use an MT plugin to make the list static, eliminating the need for caching.

    Adam Kalsey - 1st May 2003 18:24 - #

  11. You're running a full text search, but with which search words? I'm just wondering...

    Pepino - 13th May 2003 20:54 - #

  12. I added to my site for some reason which I can't remember but on my site it uses the post_tite to search the rest of the database (post_titles + post_bodies)

    owen - 16th December 2003 19:12 - #

Comments are closed.

Previously hosted at http://simon.incutio.com/archive/2003/04/25/relatedArticles

A django site