Feed Sign in with OpenID OpenID

Simon Willison’s Weblog

Battling comment spam

It’s a sad state of affairs when you come back to your blog after a week elsewhere and have to add another 56 domains to your blacklist. I’m actually getting more comment spam than legitimate comments now—this is becoming more than just a minor nuisance. I’m considering a number of improvements, including adding a moderation queue to comments on entries posted more than a month ago, disabling the comment form if the referral is a search engine (as per Russell Beattie’s suggestion) and adding some kind of wildcard support to the blacklist file.

I’d really rather not do any of this, but the problem looks like it’s going to escalate.

This is Battling comment spam by Simon Willison, posted on 30th September 2003.

View blog reactions

Next: Good Gifts

Previous: "Interactive Tabular Data"

27 comments

  1. I'd say it's a happy state of affairs when you come back to your blog after a week elsewhere :)

    Lars - 30th September 2003 14:16 - #

  2. Jeremey Zawodny recently had a pretty long thread on this very subject.
    http://jeremy.zawodny.com/blog/archives/000984.htm l

    Danny Shepherd - 30th September 2003 14:31 - #

  3. Yep, the gits have been hitting me quite a bit recently, which given the traffic my blog gets (i.e. Me and people lookng for research on the topic of bird anatmoy getting directed by mistake) has been a suprise. I mooted the idea of requiring a login but realised that many of the occasional but useful comments I do recieve would probably vanish - hos going to go through the hassle of signing up with a blog they may never read again (several of my readers come via the Java Blogs aggregator on the basis of specific posts)? One solution i have is to have a central blog commentor service. If you are a member of this central service, you can post to a number of blogs associated with this service. The API for such a service could be based on something like the Liberty alliance. Comments?

    Sam - 30th September 2003 14:40 - #

  4. I personally tired of having to register / log in to sites a long time ago - I'll do pretty much anything to avoid people having to log in to post comments on this site. Even if there's some kind of magical centralised service where people only have to log in once and a cookie keeps them logged in forever they would still have to remember their username/password when they tried to post from another machine (a task I find increasingly difficult now that so many different sites require a login). I also don't want to put off people without an account from commenting - if only for the amusement value.

    Simon Willison - 30th September 2003 14:50 - #

  5. How about using those random numbers/characters in auto-generated images (I don't remember what they're called) as an authorization code? Drawbacks would be a little extra coding, using GD/IM, and users would have to fill in an extra text field with, say, five characters.

    Travholt - 30th September 2003 14:50 - #

  6. Those image things are out for accessibility reasons - they effectively ban Lynx users and visually disabled people from participating in discussions. They also won't help prevent against manual commment spam. The google boost a spammer gets from a link on a high traffic blog is easily worth the effort of spamming it by hand.

    Simon Willison - 30th September 2003 14:56 - #

  7. Well then what's next? Once you've managed locking down your comments, they could just start spamming your referrals section. Come to think of it: wouldn't that be an even easier way of spamming blogs? Free links on high-profile websites through a standard interface, without the hassle of having to do it manually or having to build a custom spam bot for each site... Maybe I missed something, otherwise feel free to delete this comment if you don't want it to "inspire" spammers.

    Thorn Vandevelde - 30th September 2003 16:15 - #

  8. Spamming my referrals section isnt an issue - I don't have one :-) Even if I did it would be an easy matter to format it so google doesn't pick it up. If you want open commenting (which I do really) then you have to stop the benifit that spammers gain from making the spam in the first place. If they are posting to publicise their site, format the comment so that the URL gets garbaged. Likewise google-rankboosting monkeys will be foiled as their URL to their site doesn't appear. A less aggresive solution could be to replace all direct URL's in commenters posts with a click-through type serverside solution where the URL itself is never displayed in the static page - this would defeat the google-page rank boosters out there but would still make you vulnerable to spam. Also you would still get hit by purely automatic spam bots as they couldn't tell that their spam wasn't going to get them anywhere. There is also of course the idea of having a realtime update for blacklistings, although I'm unsure how much work this would be to get into my MT install...

    Sam - 30th September 2003 16:39 - #

  9. I turned comments and trackback links off six weeks ago, changed my email link to send all to the account I use for spam collection. I was getting practically no actual comments or emails from real people, nearly 100% of everything was spam, adverts for porn sites, etc. I don't actually need the comments and people who want to get in touch with me seem to be able to figure it out.

    Bill Brandon - 30th September 2003 17:39 - #

  10. Regarding CAPTCHAs: I think that most blog templates are different enough that you can fool blog-roving robots by just putting the word in plain text on the page, with a text box for the user to type the word. The robot would have to understand the text on the page to know that the word was needed, and it's more friendly to text-only users. Here's an idea for a combo moderation/registration system: there's a password box on the comment form. The first time users post, they're asked to type in a password for future use. That post goes into a moderation queue. After the post is accepted, that email/password combo will let the user straight through. It's not completely painless, but it's a hell of a lot quicker than most registration systems.

    Yoz - 30th September 2003 18:28 - #

  11. It is really sad that most of our time these days are used for deleting comment spam and now, comment spam. The various solutions presented above are quite useful, but it would have been really nice if we don't have to do anything at all. Yoz's suggestion is nice, but does not really eradicate the problem. Simon, I guess this is one of the disadvantages of having a popular site.

    markku - 30th September 2003 19:32 - #

  12. What about a transparent Turing Test, one the author doesn't even realize they are completing? A lot of this is inspired directly from Yoz's comment...

    First, dynamically switch the email and URI positions. I would bet that no-one would have any problems with this, and it would diversify your individual comment pages a bit leading to greater security. If a URI is entered for the email, flag them (don't display and mark possibly-blacklist until review).

    Another idea would be to replace the input name/id attributes with codewords instead of the standard. You could even make them randomly encoded name/id attributes and have an algorithm decode them. If someone contiues to submit the old form, blacklist them.

    And finally, create a set of hidden fields whose name/id attributes match the system you have now, but with jargon values. If these values change, blacklist them.

    My whole point with this was that diversity is key. Using this system over a large number of sites would raise the bar for spambots, as they would be unable to figure out what was going on. But I have no idea what to do with people physically entering these values. :\

    Stephen - 30th September 2003 21:49 - #

  13. (Bah - that's the second time I've managed to completely ignore Simon's comment-posting instructions and have my separate paragraphs gracelessly smooshed into one. Just to explain - my previous comment explains two separate schemes for blocking spammers, one targeted at robots and the other at humans)

    Stephen nails it when he says that diversity is key in fooling robots. That said, I disagree with the methods he suggests, as most robots are already good enough to parse HTML forms and work out the required fields and values. I say that based on Shelley's experiences. But I still think it's far easier to outwit a spambot than to write one that can spam any blog.

    But how many of these spammers are actually bots? I didn't think that manual spamming would be worth it but Simon disagrees, and I can see his reasoning. To fight those, you need heavier artillery like moderation queues, blacklists, content analysis etc. I'm still biased against heavily complex solutions - I think we're better off being agile in our solutions rather than investing large amounts of time and code into big (and usually centralised) single projects. I like moderation queues because they can be quick to deal with if done right, there are ways of minimising interference with genuine posters (as I previously suggested) but more importantly they should dissuade the persistent manual spammer after his first couple of posts never make it live.

    But since we're dealing with a human spammer, how about this idea: Ask him not to do it. No, really. If I were browsing around blogs and having to manually spam each one, I'd want to minimise wasted time as much as possible.

    I'm thinking mainly of this Slashdot post. The comment form equivalent would be default text in the comment body TEXTAREA saying something like

    PLEASE DO NOT TRY AND ADVERTISE THROUGH THIS FORM. COMMENTS THAT ARE CONSIDERED TO BE SPAM WILL BE DELETED IMMEDIATELY.

    You can go one better by removing the Google incentive and stopping Google from indexing comments on blog entries, also including notification of this in the warning text.

    Yoz - 1st October 2003 02:00 - #

  14. Addendum: I should make it clear that I don't think that merely asking spammers to stop is necessarily going to work. However, visibly removing the incentive to spend time spamming your blog may help, plus (as you can probably tell) I'm all in favour of scattergun approaches to stopping spam if those approaches are quick and easy to try.

    Yoz - 1st October 2003 02:16 - #

  15. I used to use a "shoutbox" as well as allow comments on posts. I considered the shoutbox to be more adhoc/freeform and it would regularly get spammed. My comments on the other hand never (I think) got spammed. The spammers seemed to like the shoutbox for some reason and at least they're all in one place.

    Sometimes I'd alter the spam url slightly for fun. Hell, if someone wants cracked software they're going to have to work for it!

    I decided to experiment when I last redesigned my site and decided to turn off all comments and got rid of the shoutbox thinking people will just email me. Hehe was that ever a mistake. Site/visitor interaction is probably in minus numbers.

    pete - 1st October 2003 08:41 - #

  16. This is only a thought, and probably would not work, but it just flashed before my eyes before rapidly vanishing again. If you had some sort of "peer support group", where people who you trust could have certain rights, like deleting comments for you, or at least removing them for your review. You must have enough people regularly coming through here to make this work, in theory.

    Having said that, it's completely unviable for sites with even moderate traffic, as the number of times "trusted visitors" came through would most likely not be very high.

    Without thinking about it, I can't see an easy way of implementing it. Having thought about it, I realise how silly the idea sounds. But I'll post it anyway, if not purely to waste some of my time and yours.

    Andrew - 1st October 2003 09:49 - #

  17. Actually, I really like the idea Andrew - but I think the most practical thing to do such a thing is to have a RSS feed for blog comments and subscribe to that in your favorite aggregator. Then you would have it refresh every half hour, perhaps every hour, and you would quickly see when someone had spammed the comments, and you could go in and just click on a link that said 'remove for moderation'.

    If say.. 10-15 people who were blogging did this for each other, they would quickly be able to get rid of the comment spam.

    Eivind - 1st October 2003 10:46 - #

  18. I love that idea as well. To completely eliminate comment spam you need to be sitting at your computer 24 hours a day deleting spam comments as soon as they come in. This is obviously inpractical as even the most hard core of geeks need to sleep, eat and occasionally interact with the outside world. However, given a group of a dozen or so blgogers from around the world you can be pretty sure at least one of them will be at their computer at any one time. By helping each other out, that group of bloggers could achieve 24 hour surveillance against comment spam with very little individual effort.

    The idea could probably be adapted to help fight other forms of abuse as well. In fact, it's already used on sites such as Wikipedia where a large community monitors the "recent changes" page and combats any negative activity almost as soon as it appears.

    Simon Willison - 1st October 2003 10:56 - #

  19. What about some method of voting? I mean generally most of the people who visit your site, or at least most of the people who would find your site are going to be 'more mature', educated. Less likely to (and I really would love to have another word to use) spaz out and cause trouble with the system.

    Some sort of voting system where if a post is out of place simply mark it down. Once enough people have voted it down, it goes to review. This may start to lean towards the solutions that are just too big a project for the amount of effort it takes to watch out for rogue spam-comments.

    Do we even have a techie name for spam comments? Spamments!

    Andrew Donaldson - 1st October 2003 12:15 - #

  20. I just found this article on killing comment spam via Jeremy Zadwodny's blog - I give you Killing Comment Spam for Dummies. It does seem to be aimed sqaurely at MT users though.

    sam - 1st October 2003 12:40 - #

  21. What about using bayesian filtering?

    You could probably hook up an open-source email spam filter pretty easily to examine good and bad comment buckets (files?).

    You'd have to train it, but as much spam as you're currently dealing with, that should take little time.

    If a comment falls in the spam bucket, then... there's room for other people to make suggestions. ;)

    I'd say, respond with a message that apologetically explains that the link text "click here" looks spammy to you, and could the commenter please choose something else?

    Or, instead of asking them to change text, you could ask them to enter a nonce at that point. The vast majority of legit comments would never see this secondary screen.

    Jeremy Dunck - 1st October 2003 22:49 - #

  22. I like the comment-monitoring RSS feed idea. Sort of a spin from "Neighborhood Watch?" This could be done, but a purely professional trust network must be in place beforehand obvously. Easy to implement, setup, and develop! I really like this idea now.

    Stephen - 2nd October 2003 11:02 - #

  23. But the peer-review filtering doesn't scale.

    If every blogger had a similar system, then most readers would be asked to filter multiple blogs-- and since we're an incestuous crew, most readers would also have their own blogs to be filtered.

    It's a club solution, not a lojack solution. ;-)

    Jeremy Dunck - 2nd October 2003 14:50 - #

  24. I don't see why it wouldn't scale. To protect a dozen blogs against comment spam, you need those 12 bloggers to get together with a mailing list and agree to watch each other's blogs. To protect 12 million blogs you need a million groups of a dozen bloggers each to get toegether and agree to watch each other's blogs. It's decentralised as each "protection racket" essentially forms and runs itself. Of course, it isn't ideal for every blog (someone who only blogs once a month and only goes online at the weekends is unlikely to be a useful contributor to the scheme) but for serious bloggers it could work just fine.

    Simon Willison - 2nd October 2003 18:23 - #

  25. Simon,

    OK, you're talking about a cooperative. I was thinking along the lines of random visits from admin-esque samaritans. Yes, I think that would work.

    Hmm. What don't you like about the bayesian approach? Less work, after implementing. ;)

    Jeremy Dunck - 3rd October 2003 14:52 - #

  26. Come on everyone ! The only reason relevant and long standing websites have been bumped from the high spots is because the Google wants to raise the share value prior to floating the company. If your living relies on google searches, and you disapear from the search your only option is to pay for ad words. (Pay Per Click) This is a self generating fraudulent money maker. The more people who are forced to do this, the more they will have to bid against eachother to ever be seen on adwords. They deny this? Then let them show that there has not been an increase (dramatic) in profits from this PPC scam. Everybody is in panic. If there were any solidarity people would switch search engines. That would change there way of thinking and give you back control of your own business. Google can say whatever they want but the point is, if your site has all the relevant info for a given subject, has no faults in its construction and, above all, still appears in the same position in the minor engines, what other reason could there be? Can someone tell me ?

    Murphy - 16th February 2004 15:50 - #

  27. I don't see why it wouldn't scale. To protect a dozen blogs against comment spam, you need those 12 bloggers to get together with a mailing list and agree to watch each other's blogs. To protect 12 million blogs you need a million groups of a dozen bloggers each to get toegether and agree to watch each other's blogs. It's decentralised as each "protection racket" essentially forms and runs itself. Of course, it isn't ideal for every blog (someone who only blogs once a month and only goes online at the weekends is unlikely to be a useful contributor to the scheme) but for serious bloggers it could work just fine.

    windflash - 14th June 2004 14:31 - #

Comments are closed.

Previously hosted at http://simon.incutio.com/archive/2003/09/30/moreCommentSpam

A django site