Solving comment spam
There are two main schools of thought concerning comment spam: the optimists and the defeatists. Optimists believe that comment spam can be beaten with technology; defeatists (maybe I should call them pessimists) believe that comments are as doomed as email and we’re all going to hell in a hand basket.
The story so far
I fall squarely in to the techno-optimist category. Back in September I started blacklisting domains linked to from spam comments, defending against return visits from spammers and allowing others to syndicate my block list to run on their own site. Then in October I tweaked my comment system to eliminate PageRank from links in comments, making spamming for search engine optimisation a futile exercise. Of course, this measure only works if spammers realise it’s there (I know at least one has) which is why I’m personally very happy to see that the latest release of Moveable Type has adopted the technique—to mixed reviews from the MT community.
There have been a whole bunch of other technological innovations over the past few months. Sam Ruby has implemented throttling to ban people who post three consecutive comments, and has some great ideas about guarding against strangers. Jay Allen’s MT-Blacklist makes the blacklisting concept available to a wide audience. Meanwhile, James Seng’s MT-Bayesian introduces trainable spam filters adapted from the fight against email spam.
The challenges ahead
So those are the solutions so far; the critical question is whether they work. The amount of spam I’ve been getting has definitely decreased, but as I run a completely custom blogging system I’m safe from the automated scripts that target more widespread systems—other sites make easier targets. Now that the less ethical search engine optimisers have started to catch on to the potential of comment spam to improve their PageRank the amount of spam can only increase. Some bloggers have already started to disable comments entirely (thankfully Dan turned them back on again shortly afterwards), setting a worrying precedent for the elimination two way interactions comments allow between bloggers and non-bloggers.
I’ll put it in writing now: I will never disable comments on this blog. In the past few months the comments here have proved far more interesting and valuable than my actual posts, and I really appreciate the quality of the discussions that have arisen here. I will take whatever steps are necessary to keep this a useful environment for discussion.
Many people have hailed user registration as the ultimate solution to spam. It isn’t, because the value of PageRank is just too high—and writing a script to automatically create accounts (even with email confirmation required) is child’s play to anyone who is competent in an internet-aware scripting language. Even accessibility-impeding captchas are no defence against spammers who can afford to employ cheap labour to defeat them—and with search engine rankings as critical as they are there’s no shortage of spam dollars.
With those ruled out, let’s look at the remaining solutions:
The killer
Without links, comment spam has no purpose. To eliminate spam, eliminate links. Redirecting them through a PageRank killer already achieves this, but proves too subtle for spammers intent on spreading their links as widely as they can. Too truly eliminate spam, strip out links and anything that even looks like a URL and force the spammer to preview their carefully crafted advertisement before hitting submit. Seeing as hyperlinks are the single most important feature of the web this may seem draconian—and indeed it is. But on a site that serves more as a discussion forum than a farm and where the alternative to killing links is killing comments entirely this could be the saving factor.
For most blogs however links are an essential part of the discourse—I certainly wouldn’t want to disable them here. Now only do they add huge value to the discussions, but more importantly they act as a “signature” for many commenters—knowing a comment is by “Dan” is far less useful than knowing that it’s by Dan from www.simplebits.com.
Finding a compromise
Draconian measures such as the above wouldn’t be necessary if spammers would wise up to the fact that their carefully crafted missives were having no effect on their precious PageRank. The real challenge then is to make anti-PageRank measures obvious to even the most brain-addled viagra peddlers. I’ve taken the first step towards this by turning on compulsory previewing for comments, which should have the added benefit of reminding legitimate commenters to use paragraph tags. I’ll be working on ways of making the anti PageRank measures more obvious over the next few days, as and when work permits.
I’ve seen people argue that depriving legitimate commenters of PageRank is a poor compromise. I disagree: if the only cost of eliminating the incentive to spam is the loss of some Google ego then I see it as a price well worth paying. Of course, I say that as someone who’s already built up their Google ego but at the end of the day it’s my blog, my rules. One solution I’ve considered is creating a whitelist of sites that frequent commenters use in their signatures, causing them to be displayed without a redirect.
Comment spam is a solvable problem. Furthermore, blogging about comment spamming is almost as dull as blogging about blogging. Let’s hurry up and solve it so we can go back to blogging about cats.
Sam Ruby - 28th January 2004 03:37 - #
Simon Willison - 28th January 2004 03:41 - #
Nicely said, Simon. I have a bit of a knee-jerk reaction to comments -- when I'm flooded I curse them, but like you, I feel the comments add far more value than my posts.
So I feel more optimistic after reading your thoughts on the state of things. Something that didn't register until now is that MT's variation on your excellent redirect solution completely hides the URL, killing the "signature" aspect of a comment. I'm thinking your
/redirect/?http://www.site.comis a nicer solution -- it's certainly a good way of verifying a frequent poster.Dan Cederholm - 28th January 2004 03:52 - #
Simon Willison - 28th January 2004 04:53 - #
Yuri - 28th January 2004 05:53 - #
Yuri: Sorry, but you have a good bit of catching up to do. UAs can easily change their UA-string and a blacklist will quickly grow unmanagable in any large-scale endeavor. Cutting out HTML is a legitimate option, but who really wants to copy and past URIs into an address bar?
And something I wish to comment about... :)
it's pretty obvious that MT either took inspiration of Simon's redirects or took notice prior to their release, so why do they use their URL obfuscation method? I've been trying to figure it out for days now, and can't come up with a single beneficiary factor that overrules the benifits you stated.
They had to have some great reason for something that complicated, so anyone have any ideas?
Stephen - 28th January 2004 08:01 - #
I believe Yuri meant that you could filter out HTML and URIs for the search engine bots. You could kill PageRank without a redirect method by not showing the links at all to the GoogleBot UAs. This could possibly be a great solution - if you can make the spammer realize it before they post.
As for MT's obfuscation, that is a mystery to me as well. It doesn't seem to help anymore than Simon's solution, but it can only hurt (goatse.cx was a great example). If they had a good reason, I hope they will let everyone know what it is soon . :) If not, they should definitely fix that in the next release.
Simon, I don't suppose you would be willing to give up the code to this thing anytime soon. :)
Travis Watkins - 28th January 2004 08:14 - #
Paul Freeman - 28th January 2004 09:28 - #
No hassle for normal users. The only participant you have to trust is google or other popular search engines to supply the right user-agent.
Yuri - 28th January 2004 09:30 - #
To address the "Google-Juice" issue a little:
If someone makes a really fantastic, pertinent and Google-juice worthy comment, then re-edit the entry to assign them a direct link. This way people with real contributions to the actual entry will get a link (much more likely to be followed in fact).
I think that some blog-authors already do this. I realise that it isn't a great solution -- for instance if the discussion as a whole adds a lot to the entry, then that's a difficult one; also it is more work for the blog authors ... and their time might be better spent allowing direct links and killing comment spam. But it's a sort of manual whitelist and it might be an interim solution.
My other question would be around the Google Juice factor itself. Are all spammers merely spamming comments for the Google factor? Or is it actually worth it for them to continue, even if only a fraction of a percent follow the link and actually arrive at their sites? This seems ludicrous, but apparently that is the aim with e-mail spam as well .. to get people to the sites, however few it is. The volume of the spam guarantees a return. Now maybe the sort of person who follows links to spam-sites and follows through with purchase are really unlikely to be in the blogosphere (although a certain critter named dorkSpotter has proved on a number of blogs that reading blogs really doesn't mean you have even basic intelligence ... google the name!) ... but do the spammers know that?
Meri - 28th January 2004 12:00 - #
The idea I've been toying with (and which I still need to put into code) is to use Bayesian analysis on comments, then based on the score, either pass links through unmolested, or apply redirection.
Also, lettings users 'register' by some means, then only allowing unaltered links for registered users is a possible technique.
Dougal Campbell - 28th January 2004 16:23 - #
By far the most useful tool we have is the preview. By forcing commenters to preview, you are creating a big automation headache for the spammer (and a convenient chance to change things for the commenter).
A more sophisticated measure would be to make commenters type in a password displayed in an image. You could create the image programmatically, or just hold a bunch of image/password pairings in a database or something.
Simon Jessey - 28th January 2004 17:53 - #
One alternative not mentioned is trust networks. This doesn't require registration but does require a client-based bookmarklet or applet.
Ken MacLeod - 28th January 2004 18:24 - #
Simon Willison - 28th January 2004 18:51 - #
Dunstan - 28th January 2004 19:19 - #
From the MT announcement:
I'm not quite sure why having an open redirecter is a bad thing, other than the fact that I could drop a link to simon.incutio.com/redirect/?http://tubgirl.com in a /. comment, and it would be labelled as going to [simon.incutio.com].
But, since spammers by and large don't read, using redirects is one of those things that only works if everyone does it. Are alarm bells going off in your head? Good. Any time you think of a solution that would be great, if only everyone would do it, your very next step should be to discard it. As long as some randomly and automatically posted comments will wind up in Yole's abandoned devblog for Syndirella, where they aren't redirected, it's still worth trying to spam every comment form you can find. It doesn't matter that I refuse to automatically redirect everything on principle, because you're never going to get every single comment form on the whole internet upgraded. Next solution?
Phil Ringnalda - 28th January 2004 19:19 - #
Scott Johnson - 28th January 2004 20:17 - #
Conventional ads don't use hyperlinks, and they're still worth doing.
And, for that matter, PageRank is dead. And moreover, Google won't be the dominant search engine forever. You're shocked! Shocked! If top results get clogged with PageRank-whoring comment spammers, it'll drive mainstream users away, too.
So I really don't think even the draconian approach of not allowing URLs at all is sufficient.
As long as spammers can get their message in front of eyeballs, it'll be profitable to do. So I really, really think that Bayesian, or some other site-specific trainable filter is the way forward. And even then you're not stopping comment spam. You're just stopping it on your site.
Has everybody read Graham's Plan for Spam and the follow-up So far, so good? I think people wishing to become familiar with the options would do well to read those. He also has a pretty exhaustive list of Ways to stop spam.
Oh, and I think that bayesian would handle crap-flooding just fine, too.
Jeremy Dunck - 28th January 2004 21:13 - #
Sorry, but no one has ever managed to convince me that bayesian analysis is worthwhile for comment spam, or even for crap flooding. How would a bayesian filter stop a (made up) message like:
Which links to a spammer page instead of my homepage and maybe uses a different redirector so it's not quite obvious where the link goes to?
You state that "Pagerank is dead", implying that spamming for search engine ranking is no longer worthwhile, but there doesn't seem to be much evidence for that. Certianly I haven't heard of anyone who has a better system for determining relevance and, even if individual blogs have their relevance artificially reduced, spamming over multiple blogs could still produce noticable gains above whatever other tactics the people who flood serach engines with crap use.
Email spam and web spam are qualitatively different things. Email spam can only ever be designed to sell products directly to the person reading the spam and so the email itself must have a degree of relevance to the product being sold. Web spam has the slightly different goal of making it more likely that people in general will visit a site, not necessarily the people who read the actual spam. This means that the form of the content can be different and so the effective techniques for blocking the spam will be different.
jgraham - 28th January 2004 22:50 - #
Sorry, that wasn't quite clear. Obviously it is possible to set up a server side redirect that kills PageRank (although it's hard to see why this should remain true in the future). Therefore, my example comment wouldn't necessarily increase the pagerank of a site. However, it would bypass a bayesian filter. On the other hand, one can embed links directly - i.e. without redirection, which would increase pagerank, but still have the text appear to be a genuine comment on the article.
jgraham - 28th January 2004 23:11 - #
I re-recommend reading Graham's So Far, So Good, as it has some suggestions about to handle your example.
Another thing about the bayesian approach is that seemingly innocuous words can actually be quite good predictors of spam.
Maybe you should try Bayesian and see how it works?
Assuming you've read it, what are the weak points you see in the idea of retrieving web content and judging based on that content?
Jeremy Dunck - 28th January 2004 23:39 - #
Sorry, I also wasn't clear.
I think web spam is a long-range issue, and figuring out something that remedies Google-juice ambition as opposed to straight-eyeballs (whether through popular search engines or just readers of popular-blog comments) is a bit short-sighted.
And I didn't say "only" that PageRank is dead... I linked to an entry that talks about why that's thought and why that's happening. I will not argue that point here. I will only say that Heisenberg definitely applies here.
Jeremy Dunck - 28th January 2004 23:43 - #
JP - 29th January 2004 02:02 - #
Which I read (the article at least, not the comments). It appeared to be anecdotal "evidence" based on the rumor that Google is working to prevent blogs dominating its search and the observation that the search order for the term "Jeremy" changed to one he feels is less relevant.
Google ranking has never been only about links. It's pretty well known that Google adjust their rankings at regular intervals in order to prevent people trying to massage the results.
As long as anecdotal evidence is admissible, let's try a few blog related searches:
In the last case, notice that my page is almost entirely linked to just from my name in blog comments. I have very few outbound links and, as far as I know, no one has ever linked to me in the body of an article, either on a weblog or otherwise.
I read it. Unless I've missed something, all his suggestions rely on the fact that email spam has features that distinguish it from legitimate email - legitimate email is unlikely to compose of nothing more than a single html embedded image, for example. It also has to contain a payload which encourages the reader to visit the site - so where text is used, it is easy to detect words which make spam likely. There is no reason that has to be true for web spam.
Good web spam is hard to distinguish from a real message. For example, if I was from a porn site, I might slip a link in at the end of an otherwise innocent looking message. Can you even call it spam? Is using the URL field to create a link to my homepage spamming? Bayesian filters might weed out messages like "Get hot porn now!" (although this message would also be filtered, presuambly), but it's difficult to see why they should work for more sophisticated spamming.
jgraham - 29th January 2004 09:31 - #
Zawodny's an employee of Yahoo, and his examples are indeed anecdotal. So he's got motive and not much else. So as he says, draw your own conclusions.
But the point of my statement is that as long as their is strong motivation and means to game PageRank, PageRank is at risk. If you design a spam solution just to address Google-motivated spammers, you might save Google, and you'll have a short-term win, but spammers will find other games.
Web spam, as well as email spam, can take two forms, as far as I can see-- linked and unlinked. If the spam is unlinked, then the spam's point has to be in the text. And bayesian addresses this, I think.
If the spam is linked, then you could have a quite innocuous message with a link to some spam content. I don't see why you make the distinction between web and email spam, except for the search-ranking motivation. In any case, a Bayesian filter which retrieves and interprets the linked content, using the retrieved content for filtering, would still be effective.
Of course, interpreting linked content is not trivial-- I can imagine an HTML doc which has a script executed onload, and whose script is a bootstrap for content from some other URL. You might end up embedding a UA just to completely retrieve the content before analyzing. This might be the failing point of the Bayesian approach.
Finally, it is not true that just any message with text that is a high predictor would be excluded. Bayesian pays attention to outliers both positive and negative. It is not a sort of dictionary blacklist. If my message has enough positive tokens, I can say "porn" all I want, and it will not be filtered. Of course, here's a lovely opportunity to game Bayesian. :)
In the end, I am not optimistic about a single silver bullet. Eventually, statistics might be available on "generally neutral" Bayesian words which still get the spammer's intent across.
Either we're all headed towards a strong-identity system (digital ID), or legislation will come along making spam the purvey of criminals (like all other outlawed things).
Jeremy Dunck - 29th January 2004 15:52 - #
Simon: my apologies. I shouldn't have made it so obvious in public how an open redirect could be a bad thing, and where it could be, given who I know reads me. It will be used that way, if it hasn't already, so unless I'm naive, and /. already has some fix, if you don't want to provide redirect services to trolls, you're going to have to figure some way to restrict it. Sorry.
Phil Ringnalda - 29th January 2004 17:22 - #
Simon Willison - 29th January 2004 21:08 - #
http://68.84.49.90/mt/archives/000076.html
:)
MJH - 30th January 2004 03:26 - #
Re: open-redirects vulnerability. You could just check the http-referrer header. If the referrer is sent and it's legit, automatically redirect the browser like you currently are. If there is no referrer or it's from a false address you could disable the auto-redirect in favor of a plain-text hyperlink. If you're feeling malicious you could even bounce them back to the referrer (as I've seen Mark do). :)
I've actually considered using other's open-redirects, not to hide a link but to strip my own http-referrer for security (although you can never be sure who's logging an open-redirect, so its always best to create your own (and log that!)). So I highly recommend at least some security.
And if someone doesn't develop a whitelist system soon, I'm going to make one! Slackers...
Stephen - 30th January 2004 20:18 - #
I found myself on a little anti-spam campaign recently and one of the topics I tried to address was comment spam.
I employed a Captcha system on my website and, so far, it's worked out pretty well.
Next stop: Referrer spam!
DarkBlue - 16th May 2004 01:39 - #
Phil - 7th May 2005 10:30 - #
rinku - 1st September 2005 12:24 - #
Lorraine - 13th September 2005 10:04 - #
Nicolas - 4th October 2005 21:50 - #
Nicolas - 4th October 2005 21:55 - #