Feed Sign in with OpenID OpenID

Simon Willison’s Weblog

New anti-comment-spam measure

I’ve added a new anti-comment-spam measure to this site. The majority of comment spam exists for one reason and one reason only to increase the Google PageRank of the site linked from the spam and specifically to increase its ranking for the term used in the link. This is why so many comment spams include links like this: Cheap Viagra.

Cut off the PageRank boost and you cut off the advantage of spamming, simple as that. I’ve altered my comments system to redirect ALL outgoing links through a simple redirect script, and added that script to my robots.txt file. Links still work fine (even the referral information persists across the redirect) but Google will ignore them completely when calculating PageRank.

Will this reduce the floods of comment spam my site receives? Probably not; I’ve added a note about the restriction to my ’add comment’ form but I doubt many spammers bother to read much about the sites they are targetting. What’s really needed is for this technique to become widespread by being integrated in to existing blogging tools—are you listening Moveable Type hackers?

Update: Sencer has pointed out in the comments that PageRank persists over redirects, and Google appears to ignore robots.txt when used to hide a redirecting page. I’ve updated my redirection script to use javascript to power the redirect (with a link for people with javascript disabled) and an extra meta tag to remind Google not to follow the link. This has the unfortunate side effect that referral information no longer persists across the redirect.

This is New anti-comment-spam measure by Simon Willison, posted on 13th October 2003.

View blog reactions

Next: Practical Unicode, please!

Previous: Firebird URL shortcut tips

49 comments

  1. Damn, why didn't I think of that. I had a similar idea, but it wasn't quite this good. I just may try this...

    Jeremy Zawodny - 13th October 2003 08:46 - #

  2. Your robots.txt is incorrect. You have:

    User-agent: Googlebot
    Disallow: /categories

    User-agent: *
    Disallow: /redirect

    If you have a GoogleBot section, GoogleBot ignores the * section. Your robots.txt reads "Googlebot may not spider /categories. Other robots may not spider /redirect." This is clearly not what you intend, since the purpose of blocking /redirect is to prevent spammers from gaining Google PageRank.

    References:

    http://www.robotstxt.org/wc/norobots.html: "If the value is '*', the record describes the default access policy for any robot that has not matched any of the other records."

    http://www.google.com/webmasters/faq.html#11

    Jesse Ruderman - 13th October 2003 10:20 - #

  3. Thanks, I've fixed that now.

    Simon Willison - 13th October 2003 10:39 - #

  4. Hi since I didn't get past the XHTML-validator, I posted my comment here: Sencer.de :: Google Comment Spammers, Redirects and PR. I amm afraid it won't work like outlined. I had a similar (or rather identical) idea a while back, see what I concluded...

    Sencer - 13th October 2003 11:13 - #

  5. That's a bit annoying - I've done a bit of further reading and it seems Google follows redirects (which is a good thing as it lets sites change their URL scheme without losing their PageRank) but from what Sencer says it also ignores robots.txt if the page in question has a redirect. I've altered my redirect script to use a javascript powered refresh with a "click here if you are not forwarded" message and a meta tag to tell Google not to follow the link. This has the unfortunate side-effect that the referral information is lost - I suppose I could hack that in to the URL of the redirect script.

    Simon Willison - 13th October 2003 11:49 - #

  6. Couldn't you just do some mod_rewrite-fu to serve Googlebot a blank page on the redirect script? What it can't see, it can't spider.

    Ben Milleare - 13th October 2003 13:47 - #

  7. That's an interesting idea, but it wouldn't cover other search engine crawlers. I'm also very wary of serving up special content to GoogleBot because I know Google penalise people for that kind of thing (so-called "Cloaking") - presumably they occasionally check your site using a fake user agent string to see if they can catch you out.

    Simon Willison - 13th October 2003 14:10 - #

  8. I'll be interested to see if this approach works. I think it assumes certain things about the spammers that are mostly likely false. I would be very surprised if they stopped posting just because you say it will not help their Page Rank. I've been very happy with Jay Allen's solution, and will be even happier once I can benefit from an updated spammer url list, because right now updating the banned URLs is takign as much time as just deleting the comments.

    john - 13th October 2003 14:15 - #

  9. I now have a working captcha for Movable Type - If your interested or fancy a play Full details are here

    Chris - 13th October 2003 14:52 - #

  10. I updated my entry. Just found out that the method I outlined doesn't work either. Proof is in the entry. Hopefully Simon's method with javascript is more effective.

    Captchas probably help blind, automated submissions; however there are easier ways to (temporarirly) throw spammers off. A lot of the spam on blogs I saw looked more like manual spam. However that may only be because it gets deleted fast enough. ;) (I am not using MovableType where the problem seems to be bigger, because of wider distribution.)

    No matter wether the stopping of PR-tranferal helps or not, you at least rob spammers of their anchor-text advantage...

    Sencer - 13th October 2003 15:06 - #

  11. You should probably have a look at what Jay Allen is doing with his newMT Blacklist plugin, due for first release today.

    Jim - 13th October 2003 15:28 - #

  12. You could change your redirecting script so it takes both the target link and the refering page as arguments (or just the refering path). If you access that script without being refered from your own host, then redirect the person to the source page. This way referal information will work when people follow the links.

    Another technique might be to use Javascript to swap in the correct URL when someone clicks on the link. (This also works well for hiding mailto: links from bots without actually mangling the email address)

    Ian Bicking - 13th October 2003 15:47 - #

  13. Are you, in fact, saying that you believe a real person submits the comment spam that you receive? It's not a robot posting to your comment form? That seems amazing to me. I guess it might be worthwhile, if the people you spam have serious google juice, but I'd never considered it before.

    Still, I think we can agree that in general, comments and links from comments are useful and important.

    I don't much like the social aspect of proxying comments. That is, you're making your comments less webby. I suppose that it's OK to not lend your site's google juice to just any outgoing link, but it's placing comment links in a sort of web-ghetto.

    I recognize that it's a double-edged sword, but I wouldn't choose the same bargain on my site... if I ever updated it. ;)

    Again I ask what's wrong with hooking up a bayesian filter to your comment input?

    Even if you didn't do the "Please reword this comment" bit (in the belief that it will aid human spammers in crafting a better spam, you could dump questionable comments into a queue for review. This might fit nicely into your peer review idea, if it gets overwhelming to maintain.

    Jeremy Dunck - 13th October 2003 16:09 - #

  14. Hello Simon,

    You could encode the URLs which you add to your redirect script, since spammers often look for "open redirect proxys".

    This is why they often use Yahoo's redirect URLs to hide the websites hosting their SPAMvertised products. This also protects them from automated SPAM reporting systems such as Spamcop.net.

    So instead of http://simon.incutio.com/redirect/?http://www.yaho o.com

    I would do something like http://simon.incutio.com/redirect/?siteid=1

    The first a link is created in your comments, it is added to your links table, and given an id. That way your redirect service cannot easily be used for random links.

    Another option is something like: http://simon.incutio.com/redirect/?siteid=c5f7ac7a 3ec5f014723738c15891a896

    That is a MD5 of the original URL, and since MD5 is offers only one way encryption, you would have to store the MD5 of a URL in the links table, and look it up when it is referenced.

    Or the best option would be to combine the id of the link entry with the URL of the referenced site and then encode it using MD5. So it would be done with something like md5("siteid25_http://www.yahoo.com").

    That way no one would be able to write an automated script to encode links in MD5 and illegally use your redirect script. (Because they would need to know the linkid of the site in your links table.)

    I hope that all makes sense ...

    Jay Sheth - 13th October 2003 16:13 - #

  15. Jeremy: Bayesian filters don't work as the spambots leave what seem to be valid comments, but the url points to their favourite porn site. MT-Blacklist over at Jayallen.org seems to be the best soloution. No need of all this redirecting stuff.

    KO - 13th October 2003 17:19 - #

  16. my site got hit with a two different methods: Apparent straight robot (IP address's first visit was a comment to entry 1 on my movable type blog) Human-assisted robot (person browsed through entries on my blog, then another UA (same IP address) added comments. Since I have an obscure site and don't believe in IP blacklists, I've been thinking of a few ways to shut out robots and limit people. 'type in this word' is probably enough, actually.

    mb - 13th October 2003 18:05 - #

  17. I'm trying out something much more basic and less intrusive - a good old .htaccess/.htpasswd solution (the poor man's equivalent of "Type these 6 very hard to read numbers and letters into this box").

    Full write-up on my site.

    Peter Bowyer - 13th October 2003 18:16 - #

  18. As I don't think sites are being hand selected I think this solution will take a long time before it achieves its goals. Which is not to say taking the long view is wrong.

    In terms of usability for all us humans who read your blog, perhaps setting the real URL as the title of the links, so that we can see where the redirect will take us now that the status bar has been obscured?

    kellan - 13th October 2003 18:46 - #

  19. I'm just using a simple .htaccess directive:

    <Limit GET>
    order allow,deny
    deny from spammers.ip.address.here
    allow from all
    </Limit>

    Gavin Laking - 13th October 2003 20:02 - #

  20. KO: Then run the Bayesian filter on the linked pages, not the comment itself. Hypertext filters for hypertext comments...

    Ian Bicking - 13th October 2003 22:33 - #

  21. Hi Simon, I just thought of something very simple after looking at a page which employs the ROT13 method of obscuring text.

    The sample JavaScript ROT13 form at geht.net [ http://tools.geht.net/rot13.html ] demonstrated something to me: you can use the same (ROT 13) function to obscure and unobscure text.

    What if by default, just after the Submit button on a comment form were pressed, all the form fields were obscured using browser-side (JavaScript-driven) ROT13 (or a variant on ROT13).

    After the comment form is submitted, the server-side script (PHP / PERL -powered) runs the ROT13 function on it. Then the comment text and associated data (email / links) are stored in the database.

    If a SPAM robot tries to submit a comment, the ROT13 function will be run on it only on the server side, resulting in it coming up obscured. That pretty much guarantees that the inserted spam links will be non-functional :-)

    Just a thought ...

    Jay Sheth - 14th October 2003 03:35 - #

  22. I was hopeful about the redirects, and was disappointed that they don't reduce the Google juice. But it got me thinking....

    In WordPress, we automagically obfuscate email addresses by randomly encoding characters as numeric entities. Yes, that can be decoded for email scraping, but the comment spammer is trying to accomplish something different. So by encoding URLs in a similar fashion, would we kill the Google juice, and reduce the incentive for the spam?

    The idea is that we take a URL like 'http://example.com/', and transform it into something like 'http&#58;&#47;&#47;&#101;xa&#109;&#112;le.c&#111; m&#47;'

    This assumes, of course, that Google doesn't normalize the entities before attempting to index the links. Does anybody have any insight on that?

    Dougal Campbell - 14th October 2003 16:38 - #

  23. I don't see this as being an effective deterrent against spammers who may have this significantly automated or just don't read the warning below your comment box, and furthermore it penalizes legitimate commenters who (in theory) deserve an unadulterated link from your site as a result of their participation. I think any effective spam measure must first not penalize legitimate commenters.

    Matt - 14th October 2003 16:45 - #

  24. Dougal,

    Since such encoding is standard HTML, I'd be very much surprised if they didn't.

    I think that any systematic approach that allows the client to adapt is an arms race that will fail in the long run.

    And Ian's got the right idea. Likely, we're both reading the same thing.

    -Jeremy

    Jeremy Dunck - 14th October 2003 21:53 - #

  25. True, it will also reduce the link matching for non-spam commenters. This will mainly affect services that track blog interconnectedness (Technorati, Daypop, et al). But really, most linkage of that nature occurs at the main article listing level, not at the comment level, I'd think.

    And the link redirection method would have the same side-effect, yes?

    But many site admins might consider that a small price to pay. The link still looks normal in a browser, and still behaves normally. But it should break the link relevance in Google, which could go a long way towards deterring a spammer.

    I realize that it's not a complete solution in itself, but it could very well be a useful part of a suite of techniques.

    Dougal Campbell - 14th October 2003 21:57 - #

  26. Serendipity (via a comment on Sam Ruby's):

    I don't think his follows links for its decision, though.

    Jeremy Dunck - 16th October 2003 07:57 - #

  27. Hi Simon,

    I'm not sure if this has been suggested before, but what about using <META NAME="GOOGLEBOT" CONTENT="NOARCHIVE"> described at Google Information for Webmasters.

    If it works as advertised, this should allow the comments to be indexed, but stop outgoing links from being followed.

    Cheers, Marty.

    Martin Kenny - 16th October 2003 09:21 - #

  28. Pretty good discussion - but as as been pointed out: Spammers most likely won't notice that they don't get any pagerank through the redirection; the spambot will simply sign anything it finds and not check whether the link will earn any pagerank... At www.junkeater.com we have worked on an alternative to prevent spam from even entering a weblog; each entry has be manually signed to prevent spambot from automatic submissions. In the future we will likely improve our service by keeping a central blacklist of IP addresses and destination URLs known to be spammers... An example how movabletype weblogs can be protected using junkeater can be found at www.junkeater.de/blog

    Junkeater - 18th October 2003 14:47 - #

  29. Just another observation: The javascript redirect renders the BACK button inoperative (back to the javascript that sends me back to the new URL again). Not too user friendly...

    Junkeater - 18th October 2003 14:49 - #

  30. (Assuming the spam is from a bot) Why not add a simple question, such as "Post after entering into the form: the value of two and 2 added together," as a verification step to posting?

    Jimmy Cerra - 31st October 2003 06:44 - #

  31. Simon, posting a "Note to spammers" is useless, because they do use some automated tools to post spam - and these programs do not read notes ;-) Two methods may be useful - first, provide the visitor with the sequence of 3 to 5 digits rendered as machine-unreadable graphics - just like here: http://www.beeonline.ru/portal/comm/send_sms/simpl e_send_sms.sms - the visitor will have to type them in the form. Next step, change the names of field names (sorry for my English ;-)) in the comments form randomly. For example, {form action="/addcomment/blah-blah-blah/"} {input type=hidden name=key value=uniquekeyforthisuser} {input type=text name=q2w3e4r5t6 etc. Let the key be valid for decrypting the filed names for umm.. 15 minutes - a lng time to compose even a long comment. At the same time, it will block automated spamming tools. Sanja http://www.bougakov.com/blog/

    Sanja - 11th November 2003 07:19 - #

  32. Sanja - actually I'm pretty sure a lot of the spam I was receiving was posted manually. Scripts have been written that target Moveable Type, but I don't run MT (this is a custom blogging system). I know that if I was a spammer I would be quite happy to manually spam comment systems simply because the payback in terms of Google PageRank is so high - it's easily worth spamming by hand.

    Image based spam prevention systems are a nice idea, but are completely inaccessible and also do nothing to deter manual spam. I've actually seen a marked decrease in spam since I implemented my warning notice combined with the PageRank killing redirect - I don't know if this means my appraoch is working or not but it's certainly a good sign.

    Simon Willison - 11th November 2003 19:38 - #

  33. Actually you could hide the URL totally from the blog. See http://james.seng.cc/wiki/wiki.cgi?MT_Redirect

    James Seng - 16th January 2004 04:50 - #

  34. MT 2.66 was just announced, and incorporates a similar system. They even sent me here to read about it.

    Jason Shao - 19th January 2004 08:45 - #

  35. Another method would be to remove the url link completley from your template. Genuine posters could always put thier urls in thier post but a spammers script currently won't do that. Hopefully this would upset the spammers. :D

    Simon Cox - 29th January 2004 17:04 - #

  36. And the referrer spamming? *points downwards* :)

    Alden Bates - 30th January 2004 07:43 - #

  37. I just turned off referral tracking. The value it provides is minimal, and the spam level is just ridiculous. Thanks for reminding me!

    Simon Willison - 30th January 2004 08:42 - #

  38. Yeah, I don't even publish referral info on my web site, and I *still* get what looks suspiciously like referrer spamming. I wouldn't mind so much, but the bots they use are usually badly designed and end up spamming my 404 logs (which aren't published either, but I use them as a debugging tool).

    Alden Bates - 4th February 2004 21:53 - #

  39. Would it really be a big deal to not allow folks to specify any URLs in comments (incl. as part of their name signature)? Wouldn't this eliminate the problem entirely?

    I guess this is where the whole MT "comment registration system" is going? (I haven't read up on exactly what they're doing, yet.) But, if I were to build something, I'd probably do it this way: if you want to tag your comments with an URL, you have to be a registered (and authenticated via challenge-response using something like email) user. Otherwise, you just get to leave a name with your comment.

    If spammers want to register with you first, well, oh well.

    Dossy - 10th February 2004 23:18 - #

  40. below me

    I AM SPAM - 15th March 2004 00:11 - #

  41. Seriously though, blacklisting IPs doesn't solve much, because what happens when your spammer is on a dialup pool or a PPPoE DSL line, all they have to do is disconnect and reconnect to their ISP and they have a new IP. Thus blocking IPs would be pointless because it would in the long run only keep valid users out. Blog spam has many of the same issues as email spam, and the issue above is a serious one, and the reason why many ISPs will not use black lists in their anti-spam measures even though doing so would cut down on the load hitting their mailservers and use less bandwidth. A false positive would be bad for business, like was I false block here would be bad for image and or community. I think a spam-assassin like comment filtering would work the best, check out http://www.nuclearelephant.com/projects/dspam/ their solution would be great for blogs, and then give registered users spam moderation power to take the load off the blogger.....Build up the wordlists and filters pretty easily...Especially if you get the amount of spam I was when I decided to turn off my wordpress based blog because it signal to noise ratio was favoring noise by a landslide..... -Ryan

    Ryan - 15th March 2004 00:31 - #

  42. It was pointed out to me that without a (setTimeout();) delay, using document.location = "http://www.theurl.com/"; makes it difficult for most IE users to use the Back button to get back to the page with the comments. This is an accessibility problem.

    In my own redirection script I've tried using document.location.replace(url); instead, and it seems to completely remove the redirection step from the Browser's history (at least in IE/win, Opera7 and Mozilla), allowing for unimpeded use of the Back button.

    Example: http://www.anomy.net/indirector/?http://mar.anomy. net/

    Now my question is whether the document.location.replace() method is in any way likely to cause us grief? Is Google more likely to index it, etc?

    Már - 8th April 2004 16:42 - #

  43. Has anyone thought of speaking to Google about the problem?

    turbofisk - 24th August 2004 17:59 - #

  44. Regarding your update... I'm the tech admin for Cre8asite Forums, and we have had a similar redirection system to this one in place for some time (we get a lot of people linking to penalised sites, and spam etc). We noticed that the redirect links (of the form http://www.cre8asiteforums.com/redirect/jump.php?u rl=http://simon.incutio.com/ ) were showing up in backlink searches for some domains, as Sencer pointed out. However, these links are not followed by Googlebot and do not actually count for any PageRank. I looked into this in some depth. Google never accessed the redirect script itself (therefore cannot know where the script is sending you onwards to - Googlebot would have to actually visit the redirect script, prohibited through robots.txt, in order to be sent the redirection headers). I also set up some tracking on sites I run, so many of the pages linked to from the forum would let me know if Googlebot had followed a link from there. No actual Googlebot followed a link (though a few people faking the Googlebot UA did - I verified these were not actually Googlebot through their IP). I also, finally, reversed the url the user is forwarded to, (so http://www.cre8asiteforums.com/redirect/jump.php?u rl=http://simon.incutio.com/ became http://www.cre8asiteforums.com/redirect/jump.php?u rl=/moc.oitucni.nomis//:ptth ), and the links stopped showing up in searches. So we can say, with a very high degree of certainty, that your original method is absolutely fine. It appears the reason that sometimes the redirect page shows as a backlink is that the actual link: search itself at Google is at fault. Rather than showing links (that Google registers and that count for PageRank), it shows links that contain the text you are looking for. For example, a search for "link:yahoo.com" will return all pages that have a link on them containing "yahoo.com" - doesn't mean they go to yahoo though.

    Dave Child - 30th September 2004 13:46 - #

  45. Google are now taking on board the rel="nofollow" tag and `when Google sees the attribute (rel="nofollow") on hyperlinks, those links won't get any credit when we rank websites in our search results` Further Information at : http://www.google.com/googleblog/2005/01/preventin g-comment-spam.html

    Kev Swindells - 19th January 2005 09:34 - #

  46. Readers of this blog entry who not only want to do something about comment spam but also know their way around python should read this: http://www.peterbe.com/plog/control-comment-spam

    Peter Bengtsson - 27th March 2005 17:43 - #

  47. Some "human-image-validator" would be nice....

    Jo - 12th August 2005 12:29 - #

  48. Hi, I have tried everything until now. I started removing obvious comment spam handly (and it still remains the best solution) I used to fight comment spam with javascript tricks. Then I stopped it as me, my visitors and search engines don't really like javascript. Then I tried with a server side script: outgoing urls were written backward, and then converted and redirected from a fixed page i.e.: link.php?red_url=ti.elgoog.www This was a good choice, I called it "method Tripiciano" :) Your solution is nice as well, I will try it somewhere.

    Aldo Tripiciano - 22nd August 2005 22:59 - #

  49. I think it's a wonderful thing that you're trying to preserve the idea of anonymous (unregistered/unvalidated) message posting on websites. I'm sad to see this concept dying as it's abused by automated scripts and spammers. It's a waste of time to have to register with a website just so you can give you feedback on an article, especially when it's likely that you'll never visit the site again to gain a higher return on investment of time.

    Kramer - 7th December 2005 23:51 - #

Comments are closed.

Previously hosted at http://simon.incutio.com/archive/2003/10/13/linkRedirects

A django site