Feed Sign in with OpenID OpenID

Simon Willison’s Weblog

Solving comment spam

There are two main schools of thought concerning comment spam: the optimists and the defeatists. Optimists believe that comment spam can be beaten with technology; defeatists (maybe I should call them pessimists) believe that comments are as doomed as email and we’re all going to hell in a hand basket.

The story so far

I fall squarely in to the techno-optimist category. Back in September I started blacklisting domains linked to from spam comments, defending against return visits from spammers and allowing others to syndicate my block list to run on their own site. Then in October I tweaked my comment system to eliminate PageRank from links in comments, making spamming for search engine optimisation a futile exercise. Of course, this measure only works if spammers realise it’s there (I know at least one has) which is why I’m personally very happy to see that the latest release of Moveable Type has adopted the technique—to mixed reviews from the MT community.

There have been a whole bunch of other technological innovations over the past few months. Sam Ruby has implemented throttling to ban people who post three consecutive comments, and has some great ideas about guarding against strangers. Jay Allen’s MT-Blacklist makes the blacklisting concept available to a wide audience. Meanwhile, James Seng’s MT-Bayesian introduces trainable spam filters adapted from the fight against email spam.

The challenges ahead

So those are the solutions so far; the critical question is whether they work. The amount of spam I’ve been getting has definitely decreased, but as I run a completely custom blogging system I’m safe from the automated scripts that target more widespread systems—other sites make easier targets. Now that the less ethical search engine optimisers have started to catch on to the potential of comment spam to improve their PageRank the amount of spam can only increase. Some bloggers have already started to disable comments entirely (thankfully Dan turned them back on again shortly afterwards), setting a worrying precedent for the elimination two way interactions comments allow between bloggers and non-bloggers.

I’ll put it in writing now: I will never disable comments on this blog. In the past few months the comments here have proved far more interesting and valuable than my actual posts, and I really appreciate the quality of the discussions that have arisen here. I will take whatever steps are necessary to keep this a useful environment for discussion.

Many people have hailed user registration as the ultimate solution to spam. It isn’t, because the value of PageRank is just too high—and writing a script to automatically create accounts (even with email confirmation required) is child’s play to anyone who is competent in an internet-aware scripting language. Even accessibility-impeding captchas are no defence against spammers who can afford to employ cheap labour to defeat them—and with search engine rankings as critical as they are there’s no shortage of spam dollars.

With those ruled out, let’s look at the remaining solutions:

The killer

Without links, comment spam has no purpose. To eliminate spam, eliminate links. Redirecting them through a PageRank killer already achieves this, but proves too subtle for spammers intent on spreading their links as widely as they can. Too truly eliminate spam, strip out links and anything that even looks like a URL and force the spammer to preview their carefully crafted advertisement before hitting submit. Seeing as hyperlinks are the single most important feature of the web this may seem draconian—and indeed it is. But on a site that serves more as a discussion forum than a farm and where the alternative to killing links is killing comments entirely this could be the saving factor.

For most blogs however links are an essential part of the discourse—I certainly wouldn’t want to disable them here. Now only do they add huge value to the discussions, but more importantly they act as a “signature” for many commenters—knowing a comment is by “Dan” is far less useful than knowing that it’s by Dan from www.simplebits.com.

Finding a compromise

Draconian measures such as the above wouldn’t be necessary if spammers would wise up to the fact that their carefully crafted missives were having no effect on their precious PageRank. The real challenge then is to make anti-PageRank measures obvious to even the most brain-addled viagra peddlers. I’ve taken the first step towards this by turning on compulsory previewing for comments, which should have the added benefit of reminding legitimate commenters to use paragraph tags. I’ll be working on ways of making the anti PageRank measures more obvious over the next few days, as and when work permits.

I’ve seen people argue that depriving legitimate commenters of PageRank is a poor compromise. I disagree: if the only cost of eliminating the incentive to spam is the loss of some Google ego then I see it as a price well worth paying. Of course, I say that as someone who’s already built up their Google ego but at the end of the day it’s my blog, my rules. One solution I’ve considered is creating a whitelist of sites that frequent commenters use in their signatures, causing them to be displayed without a redirect.

Comment spam is a solvable problem. Furthermore, blogging about comment spamming is almost as dull as blogging about blogging. Let’s hurry up and solve it so we can go back to blogging about cats.

This is Solving comment spam by Simon Willison, posted on 28th January 2004.

View blog reactions

Next: Iterating over a sequence in reverse

Previous: Simple tricks for more usable forms

35 comments

  1. For the record, I'm a techno-optimist when it cones to spam, and a pessimist when it comes to crapflooding.

    Sam Ruby - 28th January 2004 03:37 - #

  2. Crap flooding is definitely a completely different issue. Spamming can be beaten because it has an easily understood motive: creating links that increase PageRank. Crap flooding is vandalism for the sake of vandalism and much harder to combat as there's nothing to use as leverage against the incentive for the attack.

    Simon Willison - 28th January 2004 03:41 - #

  3. Nicely said, Simon. I have a bit of a knee-jerk reaction to comments -- when I'm flooded I curse them, but like you, I feel the comments add far more value than my posts.

    So I feel more optimistic after reading your thoughts on the state of things. Something that didn't register until now is that MT's variation on your excellent redirect solution completely hides the URL, killing the "signature" aspect of a comment. I'm thinking your /redirect/?http://www.site.com is a nicer solution -- it's certainly a good way of verifying a frequent poster.

    Dan Cederholm - 28th January 2004 03:52 - #

  4. I'm certainly not a fan of MT's complete obfuscation of the URL - not only does it kill the signature aspect but it also opens your site up to people using it to trick your visitors in to visiting unsuitable sites - the infamous goatse.cx, now offline, provides a classic example.

    Simon Willison - 28th January 2004 04:53 - #

  5. Hey, What about checking for known spidering User-Agents and filtering URLs or all HTML only then. This would obviously need to be advertised so spammers realised it.

    Yuri - 28th January 2004 05:53 - #

  6. Yuri: Sorry, but you have a good bit of catching up to do. UAs can easily change their UA-string and a blacklist will quickly grow unmanagable in any large-scale endeavor. Cutting out HTML is a legitimate option, but who really wants to copy and past URIs into an address bar?

    And something I wish to comment about... :)

    it's pretty obvious that MT either took inspiration of Simon's redirects or took notice prior to their release, so why do they use their URL obfuscation method? I've been trying to figure it out for days now, and can't come up with a single beneficiary factor that overrules the benifits you stated.

    They had to have some great reason for something that complicated, so anyone have any ideas?

    Stephen - 28th January 2004 08:01 - #

  7. I believe Yuri meant that you could filter out HTML and URIs for the search engine bots. You could kill PageRank without a redirect method by not showing the links at all to the GoogleBot UAs. This could possibly be a great solution - if you can make the spammer realize it before they post.

    As for MT's obfuscation, that is a mystery to me as well. It doesn't seem to help anymore than Simon's solution, but it can only hurt (goatse.cx was a great example). If they had a good reason, I hope they will let everyone know what it is soon . :) If not, they should definitely fix that in the next release.

    Simon, I don't suppose you would be willing to give up the code to this thing anytime soon. :)

    Travis Watkins - 28th January 2004 08:14 - #

  8. I know this won't be liked by many, because too many people have a google ego as you put it, but I found a simple solution to comment spam. I'm not listed in Google. I removed myself (using robots.txt and telling google specifically because they seemed to ignore robots.txt for some reason), since then not a single spam. I removed myself for entirely different reasons, but this has been a welcome bonus. Do you really need to be so high up in a web search engine? If not, and spam is really annoying you, this might be a temporary solution.

    Paul Freeman - 28th January 2004 09:28 - #

  9. Yep, thats what I meant.
    No hassle for normal users. The only participant you have to trust is google or other popular search engines to supply the right user-agent.

    Yuri - 28th January 2004 09:30 - #

  10. To address the "Google-Juice" issue a little:

    If someone makes a really fantastic, pertinent and Google-juice worthy comment, then re-edit the entry to assign them a direct link. This way people with real contributions to the actual entry will get a link (much more likely to be followed in fact).

    I think that some blog-authors already do this. I realise that it isn't a great solution -- for instance if the discussion as a whole adds a lot to the entry, then that's a difficult one; also it is more work for the blog authors ... and their time might be better spent allowing direct links and killing comment spam. But it's a sort of manual whitelist and it might be an interim solution.

    My other question would be around the Google Juice factor itself. Are all spammers merely spamming comments for the Google factor? Or is it actually worth it for them to continue, even if only a fraction of a percent follow the link and actually arrive at their sites? This seems ludicrous, but apparently that is the aim with e-mail spam as well .. to get people to the sites, however few it is. The volume of the spam guarantees a return. Now maybe the sort of person who follows links to spam-sites and follows through with purchase are really unlikely to be in the blogosphere (although a certain critter named dorkSpotter has proved on a number of blogs that reading blogs really doesn't mean you have even basic intelligence ... google the name!) ... but do the spammers know that?

    Meri - 28th January 2004 12:00 - #

  11. The idea I've been toying with (and which I still need to put into code) is to use Bayesian analysis on comments, then based on the score, either pass links through unmolested, or apply redirection.

    Also, lettings users 'register' by some means, then only allowing unaltered links for registered users is a possible technique.

    Dougal Campbell - 28th January 2004 16:23 - #

  12. By far the most useful tool we have is the preview. By forcing commenters to preview, you are creating a big automation headache for the spammer (and a convenient chance to change things for the commenter).

    A more sophisticated measure would be to make commenters type in a password displayed in an image. You could create the image programmatically, or just hold a bunch of image/password pairings in a database or something.

    Simon Jessey - 28th January 2004 17:53 - #

  13. One alternative not mentioned is trust networks. This doesn't require registration but does require a client-based bookmarklet or applet.

    Ken MacLeod - 28th January 2004 18:24 - #

  14. The image thing is called a "captcha", and has significant disadvantages with respect to accessibility. Spammers also have a number of ways of getting around them, from hiring cheap labour to offering free porn.

    Simon Willison - 28th January 2004 18:51 - #

  15. Simon, for all my writing about spam in the past (well, two posts then) I've only had about 10 spam messages that have made it through the blacklist filters. My next step (because I'd rather be ahead of the problem than playing catch-up) is to do exactly what you're suggested - the redirection thingy combined with whitelists. It would be very simple for me to do that for my blog because it's the same people commenting time and again. Seems ideal.

    Dunstan - 28th January 2004 19:19 - #

  16. From the MT announcement:

    along with using comment IDs instead of URLs to fix an open redirect problem

    I'm not quite sure why having an open redirecter is a bad thing, other than the fact that I could drop a link to simon.incutio.com/redirect/?http://tubgirl.com in a /. comment, and it would be labelled as going to [simon.incutio.com].

    But, since spammers by and large don't read, using redirects is one of those things that only works if everyone does it. Are alarm bells going off in your head? Good. Any time you think of a solution that would be great, if only everyone would do it, your very next step should be to discard it. As long as some randomly and automatically posted comments will wind up in Yole's abandoned devblog for Syndirella, where they aren't redirected, it's still worth trying to spam every comment form you can find. It doesn't matter that I refuse to automatically redirect everything on principle, because you're never going to get every single comment form on the whole internet upgraded. Next solution?

    Phil Ringnalda - 28th January 2004 19:19 - #

  17. I'm a big fan of the whitelisted URL idea. I've got the excellent Optional-Redirect plugin for MovableType installed on my blog so that I'm not forced to redirect my URLs. I'm thinking that a plugin similar to MT-Blacklist would be nice for a whitelist of URLs. Maybe we'll get something like this soon.

    Scott Johnson - 28th January 2004 20:17 - #

  18. Conventional ads don't use hyperlinks, and they're still worth doing.

    And, for that matter, PageRank is dead. And moreover, Google won't be the dominant search engine forever. You're shocked! Shocked! If top results get clogged with PageRank-whoring comment spammers, it'll drive mainstream users away, too.

    So I really don't think even the draconian approach of not allowing URLs at all is sufficient.

    As long as spammers can get their message in front of eyeballs, it'll be profitable to do. So I really, really think that Bayesian, or some other site-specific trainable filter is the way forward. And even then you're not stopping comment spam. You're just stopping it on your site.

    Has everybody read Graham's Plan for Spam and the follow-up So far, so good? I think people wishing to become familiar with the options would do well to read those. He also has a pretty exhaustive list of Ways to stop spam.

    Oh, and I think that bayesian would handle crap-flooding just fine, too.

    Jeremy Dunck - 28th January 2004 21:13 - #

  19. Sorry, but no one has ever managed to convince me that bayesian analysis is worthwhile for comment spam, or even for crap flooding. How would a bayesian filter stop a (made up) message like:

    Comment spam is a real problem. I've wondered if bayesian analysis is the best approach

    Which links to a spammer page instead of my homepage and maybe uses a different redirector so it's not quite obvious where the link goes to?

    You state that "Pagerank is dead", implying that spamming for search engine ranking is no longer worthwhile, but there doesn't seem to be much evidence for that. Certianly I haven't heard of anyone who has a better system for determining relevance and, even if individual blogs have their relevance artificially reduced, spamming over multiple blogs could still produce noticable gains above whatever other tactics the people who flood serach engines with crap use.

    Email spam and web spam are qualitatively different things. Email spam can only ever be designed to sell products directly to the person reading the spam and so the email itself must have a degree of relevance to the product being sold. Web spam has the slightly different goal of making it more likely that people in general will visit a site, not necessarily the people who read the actual spam. This means that the form of the content can be different and so the effective techniques for blocking the spam will be different.

    jgraham - 28th January 2004 22:50 - #

  20. Sorry, that wasn't quite clear. Obviously it is possible to set up a server side redirect that kills PageRank (although it's hard to see why this should remain true in the future). Therefore, my example comment wouldn't necessarily increase the pagerank of a site. However, it would bypass a bayesian filter. On the other hand, one can embed links directly - i.e. without redirection, which would increase pagerank, but still have the text appear to be a genuine comment on the article.

    jgraham - 28th January 2004 23:11 - #

  21. I re-recommend reading Graham's So Far, So Good, as it has some suggestions about to handle your example.

    Another thing about the bayesian approach is that seemingly innocuous words can actually be quite good predictors of spam.

    Maybe you should try Bayesian and see how it works?

    Assuming you've read it, what are the weak points you see in the idea of retrieving web content and judging based on that content?

    Jeremy Dunck - 28th January 2004 23:39 - #

  22. Sorry, I also wasn't clear.

    I think web spam is a long-range issue, and figuring out something that remedies Google-juice ambition as opposed to straight-eyeballs (whether through popular search engines or just readers of popular-blog comments) is a bit short-sighted.

    And I didn't say "only" that PageRank is dead... I linked to an entry that talks about why that's thought and why that's happening. I will not argue that point here. I will only say that Heisenberg definitely applies here.

    Jeremy Dunck - 28th January 2004 23:43 - #

  23. Just wanted to see how the comment preview alerts me to the PageRank killing.... Hmm

    JP - 29th January 2004 02:02 - #

  24. I didn't say "only" that PageRank is dead... I linked to an entry that talks about why that's thought and why that's happening

    Which I read (the article at least, not the comments). It appeared to be anecdotal "evidence" based on the rumor that Google is working to prevent blogs dominating its search and the observation that the search order for the term "Jeremy" changed to one he feels is less relevant.

    Google ranking has never been only about links. It's pretty well known that Google adjust their rankings at regular intervals in order to prevent people trying to massage the results.

    As long as anecdotal evidence is admissible, let's try a few blog related searches:

    In the last case, notice that my page is almost entirely linked to just from my name in blog comments. I have very few outbound links and, as far as I know, no one has ever linked to me in the body of an article, either on a weblog or otherwise.

    I re-recommend reading Graham's So Far, So Good

    I read it. Unless I've missed something, all his suggestions rely on the fact that email spam has features that distinguish it from legitimate email - legitimate email is unlikely to compose of nothing more than a single html embedded image, for example. It also has to contain a payload which encourages the reader to visit the site - so where text is used, it is easy to detect words which make spam likely. There is no reason that has to be true for web spam.

    Good web spam is hard to distinguish from a real message. For example, if I was from a porn site, I might slip a link in at the end of an otherwise innocent looking message. Can you even call it spam? Is using the URL field to create a link to my homepage spamming? Bayesian filters might weed out messages like "Get hot porn now!" (although this message would also be filtered, presuambly), but it's difficult to see why they should work for more sophisticated spamming.

    jgraham - 29th January 2004 09:31 - #

  25. Zawodny's an employee of Yahoo, and his examples are indeed anecdotal. So he's got motive and not much else. So as he says, draw your own conclusions.

    But the point of my statement is that as long as their is strong motivation and means to game PageRank, PageRank is at risk. If you design a spam solution just to address Google-motivated spammers, you might save Google, and you'll have a short-term win, but spammers will find other games.

    Web spam, as well as email spam, can take two forms, as far as I can see-- linked and unlinked. If the spam is unlinked, then the spam's point has to be in the text. And bayesian addresses this, I think.

    If the spam is linked, then you could have a quite innocuous message with a link to some spam content. I don't see why you make the distinction between web and email spam, except for the search-ranking motivation. In any case, a Bayesian filter which retrieves and interprets the linked content, using the retrieved content for filtering, would still be effective.

    Of course, interpreting linked content is not trivial-- I can imagine an HTML doc which has a script executed onload, and whose script is a bootstrap for content from some other URL. You might end up embedding a UA just to completely retrieve the content before analyzing. This might be the failing point of the Bayesian approach.

    Finally, it is not true that just any message with text that is a high predictor would be excluded. Bayesian pays attention to outliers both positive and negative. It is not a sort of dictionary blacklist. If my message has enough positive tokens, I can say "porn" all I want, and it will not be filtered. Of course, here's a lovely opportunity to game Bayesian. :)

    In the end, I am not optimistic about a single silver bullet. Eventually, statistics might be available on "generally neutral" Bayesian words which still get the spammer's intent across.

    Either we're all headed towards a strong-identity system (digital ID), or legislation will come along making spam the purvey of criminals (like all other outlawed things).

    Jeremy Dunck - 29th January 2004 15:52 - #

  26. Simon: my apologies. I shouldn't have made it so obvious in public how an open redirect could be a bad thing, and where it could be, given who I know reads me. It will be used that way, if it hasn't already, so unless I'm naive, and /. already has some fix, if you don't want to provide redirect services to trolls, you're going to have to figure some way to restrict it. Sorry.

    Phil Ringnalda - 29th January 2004 17:22 - #

  27. Phil: no apology necessary - I've never been a believer in security through obscurity ;) I've actually worked out a way of killing the open redirect without removing the URL itself from the link, I just haven't found time to implement it. I'm going to pass both the URL and the ID of the comment in which it was posted, then check that the URL does indeed exist in the comment before redirecting the user.

    Simon Willison - 29th January 2004 21:08 - #

  28. I've solved this problem proactively:

    http://68.84.49.90/mt/archives/000076.html

    :)

    MJH - 30th January 2004 03:26 - #

  29. Re: open-redirects vulnerability. You could just check the http-referrer header. If the referrer is sent and it's legit, automatically redirect the browser like you currently are. If there is no referrer or it's from a false address you could disable the auto-redirect in favor of a plain-text hyperlink. If you're feeling malicious you could even bounce them back to the referrer (as I've seen Mark do). :)

    I've actually considered using other's open-redirects, not to hide a link but to strip my own http-referrer for security (although you can never be sure who's logging an open-redirect, so its always best to create your own (and log that!)). So I highly recommend at least some security.

    And if someone doesn't develop a whitelist system soon, I'm going to make one! Slackers...

    Stephen - 30th January 2004 20:18 - #

  30. I found myself on a little anti-spam campaign recently and one of the topics I tried to address was comment spam.

    I employed a Captcha system on my website and, so far, it's worked out pretty well.

    Next stop: Referrer spam!

    DarkBlue - 16th May 2004 01:39 - #

  31. One option I tried (as I use my own code for my blog) was cofirming email addresses. THe first time an email address is used for a comment an email is sent with a clickable link. After that you need never confirm again. Simple and 100% effective with no moderation from me. Any comment left 24 hours without confirming the email is autmoatically deleted using a cron job. I just sit back and watch. I have never had to interfere. But trackback spam is killing me - if I found an address I would pay someone to lob grenades through theletterbox - it is time some people were removed from the gene pool.

    Phil - 7th May 2005 10:30 - #

  32. We are doing a project for filtering comment spams on a distributed network.So if anybody has certain information about comment spam or if few samples about comment spam then please reply.

    rinku - 1st September 2005 12:24 - #

  33. How about asking Google to blacklist blog comments pages? Or is that too simple?

    Lorraine - 13th September 2005 10:04 - #

  34. I've got a serious comment spam problem on my weblog. I've implemented the Captcha system, installed MT3.2 and all the spam-prevention that comes with it and renamed the comments script, and this has reduced spam to zero, but my problem remains.

    I am getting so many hits to the comments script (when I rename it, the flow only stops for about 30 minutes) that my web hosting provider has chosen to interpret the hits as a denial of service attack. The spammers are either too stupid or too bull-headed to realise that not a single one of their comments makes it through to the live site, but when the comments start coming in at a rate of several per second, you kind of have to see my provider's point of view.

    Every hit causes the captcha code to run, executing a perl script on the host. My provider recently went into my online directories and CHMODded the comments script to 000 - then made it impossible for me to switch it back (I had to get another version of mt-comments.cgi and upload it under a different name, then change the templates, to re-enable comments).

    When spamming reaches DoS proportions, and you've installed enough to prevent any of the spam coming through, but it continues to do so anyway, what's the next step? I'm at a bit of a loss now. Do I have to resort to removing comments entirely and thereby give up?

    Nicolas - 4th October 2005 21:50 - #

  35. Sorry - just realised I wasn't very clear. What I mean above is that none of the spam makes it to the live site because it's blocked by the spam filters / captcha code etc. When I say it continues nevertheless, I mean that I get hundreds of denied comments every minute - they keep trying my comments script, keep getting repelled by the defenses, but keep trying anyway, thus pissing my provider off because of the load on their servers.

    Nicolas - 4th October 2005 21:55 - #

Comments are closed.

Previously hosted at http://simon.incutio.com/archive/2004/01/28/solvingCommentSpam

A django site