Feed Sign in with OpenID OpenID

Simon Willison’s Weblog

Giving away the index

My final year project is due in two weeks, and I’m going to be running on silent for most of them. I have, however, upgraded to Tiger and playing with Spotlight has given me plenty to think about.

Giving away the index

The great benefit of having an electronic version of a book you own in dead-tree format to hand is that you can search it. Publishers generally don’t hand out free digital copies because, well, they want you to buy the books, not freely distribute electronic copies.

The thing is, you don’t need a digital copy of a book to be able to search it; you just need a full-text index of it (if you don’t understand what this means, go and read Tim Bray’s series On Search). An index isn’t enough to reconstruct the book, but it is enough to answer questions like "on what pages of Eric Meyer on CSS are float layouts discussed?"

Imagine if technical publishers made binary full-text index files of their titles available for download, for free in some kind of open standard format. Readers could query them using Spotlight or similar technologies, and gain the ability to search the titles they own all without needing to rely on centralised, artificially limited services such as Amazon’s Search Inside the Book.

O’Reilly, I’m looking at you.

Full-text phishing

On a darker note, one thing about Spotlight that has given me pause is the immense ease with which it can uncover passwords saved amongst my email. Lost password reminders, new account details, invitations to sign up for services—they’re all hidden away in my mail archive. Spotlight makes it trivial to dig them back up again, and offers the APIs for applications to do so as well. Combine this with a piece of spyware / some trojan horse and you’ve got the ultimate vector for phishing attacks.

This problem isn’t limited to Macs either; Google and MSN’s Desktop Search engines could be used for much the same purpose, and full-text search is bound to end up built in to Windows sooner or later. For the moment, the safest thing to do is either delete those pesky emails or move them to a folder that is excluded from Spotlight’s index. Somehow I doubt many people will think to take such precautions.

And with that off my chest, it’s time to get back to my dissertation.

This is Giving away the index by Simon Willison, posted on 4th May 2005.

Tagged , ,

View blog reactions

Next: Fighting RFCs with RFCs

Previous: A Firefox observation

9 comments

  1. That's impressive - I went from excited and hopeful to scared and dejected about the same thing all in the space of a few paragraphs. Now that's (geek) drama. At least I don't have my email or online banking passwords saved in my email archive *checks frantically*. Good luck on your dissertation.

    Wilson Miner - 4th May 2005 02:16 - #

  2. Is encryption an option?

    Sunn - 4th May 2005 04:36 - #

  3. Great idea, Simon. Reference book indexes are often woefully inadequate when you want to find something quickly. Magazine archives would be the same - National Geographic springs to mind as a publication that could well benefit from this (many of their readers have well over ten years worth archived). As for the password thing - I don't save any, any more. I delete emails containing passwords when I receive them (they can almost always be resent later if needed), and use a modified version of the password generator bookmarklet you pointed to some time ago (meaning I only really even need to remember my master password anyway).

    Dave Child - 4th May 2005 10:06 - #

  4. Very thoughtful post, I really like the idea of searchable book indexes, it would seem benefit both the consumer and the producer.

    What you said about information in email's also struck home. I run windows and will certainly keep in mind your point when I upgrade to longhorn, thanks for an important reminder.

    nitr0z - 4th May 2005 11:54 - #

  5. Those passwords have always been there - a program could run grep and get them. Now it's just a little bit easier to see them. :)

    Daniel Von Fange - 4th May 2005 16:59 - #

  6. The problem with passwords being saved within your email is one of education of web developers (who primarily use this form of password recovery). It would be simple negated as a vector of attack if one time passwords or urls were used to allow the user to reset their passwords. There are several good resources discussing various aspects of security. I 'd recommend reading "Password Recovery" by Charles Miller, and also browsing around OWASP.

    Ben Meadowcroft - 5th May 2005 12:50 - #

  7. I like the idea, Simon. We value our indexes and have one of the best technical indexers in the business (Bill Johncocks) working on our books, and enabling our readers to make better use of his work would be a great idea. Our books are in DocBook XML format, so in theory, I guess this wouldn't be that hard to do. We already make our indexes available online (see for example our CSS Anthology index) so we would certainly be happy to release them in an open format.

    Simon Mackie - 10th May 2005 07:51 - #

  8. On re-reading, I see that isn't really what you meant, d'oh (the clue being "full-text indexes".) It's still a good idea, and we would certainly be very happy to do it, although we'd probably need a company like O'Reilly to come up with the format in the first place.

    Simon Mackie - 10th May 2005 08:28 - #

  9. myhot maoill inbox not open

    samir - 28th August 2005 16:03 - #

Comments are closed.

Previously hosted at http://simon.incutio.com/archive/2005/05/04/spotlight

A django site