Feed Sign in with OpenID OpenID

Simon Willison’s Weblog

HTML entities for email addresses: don’t bother

I’ve suspected this for a long time, and now here’s the empirical evidence: Popular Spam Protection Technique Doesn’t Work. If you’re relying on HTML entities to protect your email address from spam harvesters—for example username@example.com—your email address may as well be in plain text. Chip Rosenthal downloaded a tool called “Web Data Extractor v4.0” and tried it on some test data to prove once and for all that the technique doesn’t work.

My advice is to use your common sense when analysing a potential spam protection technique. If you were a spammer, would you be able to outwit the method? Spammers aren’t always very smart, but the people who write spamming tools (and get paid big bucks for them) are. Also remember to think about the payoff—unencoding a bunch of entities is a cheap operation. Embedding a Javascript interpreter to decipher email addresses that are glued together using Javascript at the last possible moment is a lot harder and could slow down a tool, so it may not be worth the effort.

I’m still pretty confident in my own anti-spam harvester technique of hiding my email address behind a POST form, but even that could eventually be outsmarted by a really dedicated harvesting tool.

This is HTML entities for email addresses: don’t bother by Simon Willison, posted on 2nd December 2003.

View blog reactions

Next: Downloading your hotmail inbox

Previous: Selectutorial

13 comments

  1. I generally use the Hiveware Enkoder for non-scripting sites, but for my own, I just leave the PHP $recipient field in my script so that people can't find it. Why do people need to know anyone's email address? Isn't the point just to send a message? And if it's that they want a copy of their message, you can always add a checkbox to CC the person.

    Graham - 2nd December 2003 04:09 - #

  2. The first link should end in "html" instead of "htm".

    Luke - 2nd December 2003 04:45 - #

  3. I've always liked the idea of using XML+XSLT on the client side to fool spammers, but that really isn't as practical as I'd like it to be. Next time I have a site of my own, I'm thinking of doing what some sites I've signed up at do...display an image with a few random letters & numbers in it, and make the user type that string into an input box to prove they're not a robot. That should work better than most anything else, I think.

    Devon - 2nd December 2003 05:46 - #

  4. Simon -- are you familiar with Dan Benjamin's obfuscator? http://hiveware.com/enkoder_form.php

    tim - 2nd December 2003 10:17 - #

  5. My strategy of having my email address in plain view on every webpage I author and various people's comments pages too seems to be working. It's now nine years since I adopted this strategy and I only get two or three pieces of spam a day. On second thoughts, this is probably evidence that not just normal people but degenerate mutant spammers hate me...

    Rich - 2nd December 2003 12:23 - #

  6. I always assumed that spambot writers are at least as savvy as I am to these techniques and that trying to keep ahead of them or employing trivial obfuscation like entity encoding was a waste of time. Then I read a study that demonstrated otherwise. Much may have changed in the year or so since that study was run, but on the face of it simple techniques like entity encoding seem surprisingly effective.

    Not that I use it.

    Sam - 2nd December 2003 16:34 - #

  7. Devon: That's extremely bad for accessibility. The problem with excluding bots that way is that you exclude blind or visually impaired people as well. Also, if you have the facilities for generating an image and checking the response like that, you should have the facilities to use a friendlier approach like simon's email contact form.

    Graham: Some people complain that a CC isn't filed in their e-mail system the way a sent message would be. It's not too hard to address this complaint either (assuming you've got a half decent host). Set up an e-mail address that goes to a script instead of anyone, have people e-mail that, and then they get sent out to them an e-mail containing relevant e-mail addresses. The spammers won't benefit from this e-mail address, since they generally fake their address.

    Lach - 3rd December 2003 02:03 - #

  8. What is a good net citizen to do ?!

    All these solutions are great, but they will fail on the next generation of harvester robots. Current methodologies rely on the fact that robots trawl websites for HTML source pages, which allows systems such as the Hiveware Enkoder to sucessfuly prevent the harvesting of addresses because the page has to be rendered and javascript processed before the address is available.

    New robots are now being built however that ( in just one example ) use browsers rendering engines and javascript processors to construct the page as the end user would see and then analyse the resulting DOM data structure. Therefore systems like the Hiveware Enkoder become useless because the robots can now see the processed javascript email address.

    It is a shame that there are people out there willing to write software for this purpose. They must have no morals whatsoever !

    I still think we need to fight this war at the protocol level. Just my 2 cents.

    Paul - 3rd December 2003 16:46 - #

  9. Bummer, but that's what I've suspected for some time, and that's why a) I mix hex, ascii and text and b) added a javascript option to my Mean Dean Anti-Spam Email Obfuscation Tool.

    Still, I suspect it is only a matter of time before someone figures out how to crack that ... which is why I too use form-based email contacts on most of the sites I now develop.

    Mean Dean - 4th December 2003 10:21 - #

  10. I just delete spam. Works every time.

    Eric - 5th December 2003 02:21 - #

  11. Thanks

    abhisheksi2005 - 29th October 2004 15:02 - #

  12. no tengo

    Alejandro - 25th September 2005 07:09 - #

  13. Here it is 2006, and my spamtest email, which appears only HTML encoded on a set of directory pages I run that are full of email addresses, still has only received 3 spams since 2002. So depending on the circumstances, it CAN work. I should set up another address that I use encoded in a wider context (don't want to test on my old spamtest because it's my "first warning" for this particular directory). The test you point to only looks at one or two HTML entities; I encode every single character in the email address, both in the anchor and in the visible text. Perhaps that's what protects it?

    Hilary Caws-Elwitt - 8th January 2006 16:48 - #

Comments are closed.

Previously hosted at http://simon.incutio.com/archive/2003/12/02/entitiesDontWork

A django site