New anti-comment-spam measure
I’ve added a new anti-comment-spam measure to this site. The majority of comment spam exists for one reason and one reason only to increase the Google PageRank of the site linked from the spam and specifically to increase its ranking for the term used in the link. This is why so many comment spams include links like this: Cheap Viagra.
Cut off the PageRank boost and you cut off the advantage of spamming, simple as that. I’ve altered my comments system to redirect ALL outgoing links through a simple redirect script, and added that script to my robots.txt file. Links still work fine (even the referral information persists across the redirect) but Google will ignore them completely when calculating PageRank.
Will this reduce the floods of comment spam my site receives? Probably not; I’ve added a note about the restriction to my ’add comment’ form but I doubt many spammers bother to read much about the sites they are targetting. What’s really needed is for this technique to become widespread by being integrated in to existing blogging tools—are you listening Moveable Type hackers?
Update: Sencer has pointed out in the comments that PageRank persists over redirects, and Google appears to ignore robots.txt when used to hide a redirecting page. I’ve updated my redirection script to use javascript to power the redirect (with a link for people with javascript disabled) and an extra meta tag to remind Google not to follow the link. This has the unfortunate side effect that referral information no longer persists across the redirect.
Jeremy Zawodny - 13th October 2003 08:46 - #
Your robots.txt is incorrect. You have:
If you have a GoogleBot section, GoogleBot ignores the * section. Your robots.txt reads "Googlebot may not spider /categories. Other robots may not spider /redirect." This is clearly not what you intend, since the purpose of blocking /redirect is to prevent spammers from gaining Google PageRank.
References:
http://www.robotstxt.org/wc/norobots.html: "If the value is '*', the record describes the default access policy for any robot that has not matched any of the other records."
http://www.google.com/webmasters/faq.html#11
Jesse Ruderman - 13th October 2003 10:20 - #
Simon Willison - 13th October 2003 10:39 - #
Hi since I didn't get past the XHTML-validator, I posted my comment here: Sencer.de :: Google Comment Spammers, Redirects and PR. I amm afraid it won't work like outlined. I had a similar (or rather identical) idea a while back, see what I concluded...
Sencer - 13th October 2003 11:13 - #
Simon Willison - 13th October 2003 11:49 - #
Ben Milleare - 13th October 2003 13:47 - #
Simon Willison - 13th October 2003 14:10 - #
john - 13th October 2003 14:15 - #
Chris - 13th October 2003 14:52 - #
I updated my entry. Just found out that the method I outlined doesn't work either. Proof is in the entry. Hopefully Simon's method with javascript is more effective.
Captchas probably help blind, automated submissions; however there are easier ways to (temporarirly) throw spammers off. A lot of the spam on blogs I saw looked more like manual spam. However that may only be because it gets deleted fast enough. ;) (I am not using MovableType where the problem seems to be bigger, because of wider distribution.)
No matter wether the stopping of PR-tranferal helps or not, you at least rob spammers of their anchor-text advantage...
Sencer - 13th October 2003 15:06 - #
You should probably have a look at what Jay Allen is doing with his newMT Blacklist plugin, due for first release today.
Jim - 13th October 2003 15:28 - #
You could change your redirecting script so it takes both the target link and the refering page as arguments (or just the refering path). If you access that script without being refered from your own host, then redirect the person to the source page. This way referal information will work when people follow the links.
Another technique might be to use Javascript to swap in the correct URL when someone clicks on the link. (This also works well for hiding mailto: links from bots without actually mangling the email address)
Ian Bicking - 13th October 2003 15:47 - #
Are you, in fact, saying that you believe a real person submits the comment spam that you receive? It's not a robot posting to your comment form? That seems amazing to me. I guess it might be worthwhile, if the people you spam have serious google juice, but I'd never considered it before.
Still, I think we can agree that in general, comments and links from comments are useful and important.
I don't much like the social aspect of proxying comments. That is, you're making your comments less webby. I suppose that it's OK to not lend your site's google juice to just any outgoing link, but it's placing comment links in a sort of web-ghetto.
I recognize that it's a double-edged sword, but I wouldn't choose the same bargain on my site... if I ever updated it. ;)
Again I ask what's wrong with hooking up a bayesian filter to your comment input?
Even if you didn't do the "Please reword this comment" bit (in the belief that it will aid human spammers in crafting a better spam, you could dump questionable comments into a queue for review. This might fit nicely into your peer review idea, if it gets overwhelming to maintain.
Jeremy Dunck - 13th October 2003 16:09 - #
Hello Simon,
You could encode the URLs which you add to your redirect script, since spammers often look for "open redirect proxys".
This is why they often use Yahoo's redirect URLs to hide the websites hosting their SPAMvertised products. This also protects them from automated SPAM reporting systems such as Spamcop.net.
So instead of http://simon.incutio.com/redirect/?http://www.yaho o.com
I would do something like http://simon.incutio.com/redirect/?siteid=1
The first a link is created in your comments, it is added to your links table, and given an id. That way your redirect service cannot easily be used for random links.
Another option is something like: http://simon.incutio.com/redirect/?siteid=c5f7ac7a 3ec5f014723738c15891a896
That is a MD5 of the original URL, and since MD5 is offers only one way encryption, you would have to store the MD5 of a URL in the links table, and look it up when it is referenced.
Or the best option would be to combine the id of the link entry with the URL of the referenced site and then encode it using MD5. So it would be done with something like md5("siteid25_http://www.yahoo.com").
That way no one would be able to write an automated script to encode links in MD5 and illegally use your redirect script. (Because they would need to know the linkid of the site in your links table.)
I hope that all makes sense ...
Jay Sheth - 13th October 2003 16:13 - #
KO - 13th October 2003 17:19 - #
mb - 13th October 2003 18:05 - #
I'm trying out something much more basic and less intrusive - a good old .htaccess/.htpasswd solution (the poor man's equivalent of "Type these 6 very hard to read numbers and letters into this box").
Full write-up on my site.
Peter Bowyer - 13th October 2003 18:16 - #
As I don't think sites are being hand selected I think this solution will take a long time before it achieves its goals. Which is not to say taking the long view is wrong.
In terms of usability for all us humans who read your blog, perhaps setting the real URL as the title of the links, so that we can see where the redirect will take us now that the status bar has been obscured?
kellan - 13th October 2003 18:46 - #
I'm just using a simple .htaccess directive:
<Limit GET>order allow,deny
deny from spammers.ip.address.here
allow from all
</Limit>
Gavin Laking - 13th October 2003 20:02 - #
Ian Bicking - 13th October 2003 22:33 - #
Hi Simon, I just thought of something very simple after looking at a page which employs the ROT13 method of obscuring text.
The sample JavaScript ROT13 form at geht.net [ http://tools.geht.net/rot13.html ] demonstrated something to me: you can use the same (ROT 13) function to obscure and unobscure text.
What if by default, just after the Submit button on a comment form were pressed, all the form fields were obscured using browser-side (JavaScript-driven) ROT13 (or a variant on ROT13).
After the comment form is submitted, the server-side script (PHP / PERL -powered) runs the ROT13 function on it. Then the comment text and associated data (email / links) are stored in the database.
If a SPAM robot tries to submit a comment, the ROT13 function will be run on it only on the server side, resulting in it coming up obscured. That pretty much guarantees that the inserted spam links will be non-functional :-)
Just a thought ...
Jay Sheth - 14th October 2003 03:35 - #
I was hopeful about the redirects, and was disappointed that they don't reduce the Google juice. But it got me thinking....
In WordPress, we automagically obfuscate email addresses by randomly encoding characters as numeric entities. Yes, that can be decoded for email scraping, but the comment spammer is trying to accomplish something different. So by encoding URLs in a similar fashion, would we kill the Google juice, and reduce the incentive for the spam?
The idea is that we take a URL like 'http://example.com/', and transform it into something like 'http://example.co m/'
This assumes, of course, that Google doesn't normalize the entities before attempting to index the links. Does anybody have any insight on that?
Dougal Campbell - 14th October 2003 16:38 - #
Matt - 14th October 2003 16:45 - #
Dougal,
Since such encoding is standard HTML, I'd be very much surprised if they didn't.
I think that any systematic approach that allows the client to adapt is an arms race that will fail in the long run.
And Ian's got the right idea. Likely, we're both reading the same thing.
-Jeremy
Jeremy Dunck - 14th October 2003 21:53 - #
True, it will also reduce the link matching for non-spam commenters. This will mainly affect services that track blog interconnectedness (Technorati, Daypop, et al). But really, most linkage of that nature occurs at the main article listing level, not at the comment level, I'd think.
And the link redirection method would have the same side-effect, yes?
But many site admins might consider that a small price to pay. The link still looks normal in a browser, and still behaves normally. But it should break the link relevance in Google, which could go a long way towards deterring a spammer.
I realize that it's not a complete solution in itself, but it could very well be a useful part of a suite of techniques.
Dougal Campbell - 14th October 2003 21:57 - #
Serendipity (via a comment on Sam Ruby's):
I don't think his follows links for its decision, though.
Jeremy Dunck - 16th October 2003 07:57 - #
Hi Simon,
I'm not sure if this has been suggested before, but what about using <META NAME="GOOGLEBOT" CONTENT="NOARCHIVE"> described at Google Information for Webmasters.
If it works as advertised, this should allow the comments to be indexed, but stop outgoing links from being followed.
Cheers, Marty.
Martin Kenny - 16th October 2003 09:21 - #
Junkeater - 18th October 2003 14:47 - #
Junkeater - 18th October 2003 14:49 - #
Jimmy Cerra - 31st October 2003 06:44 - #
Sanja - 11th November 2003 07:19 - #
Sanja - actually I'm pretty sure a lot of the spam I was receiving was posted manually. Scripts have been written that target Moveable Type, but I don't run MT (this is a custom blogging system). I know that if I was a spammer I would be quite happy to manually spam comment systems simply because the payback in terms of Google PageRank is so high - it's easily worth spamming by hand.
Image based spam prevention systems are a nice idea, but are completely inaccessible and also do nothing to deter manual spam. I've actually seen a marked decrease in spam since I implemented my warning notice combined with the PageRank killing redirect - I don't know if this means my appraoch is working or not but it's certainly a good sign.
Simon Willison - 11th November 2003 19:38 - #
James Seng - 16th January 2004 04:50 - #
Jason Shao - 19th January 2004 08:45 - #
Simon Cox - 29th January 2004 17:04 - #
Alden Bates - 30th January 2004 07:43 - #
Simon Willison - 30th January 2004 08:42 - #
Alden Bates - 4th February 2004 21:53 - #
Would it really be a big deal to not allow folks to specify any URLs in comments (incl. as part of their name signature)? Wouldn't this eliminate the problem entirely?
I guess this is where the whole MT "comment registration system" is going? (I haven't read up on exactly what they're doing, yet.) But, if I were to build something, I'd probably do it this way: if you want to tag your comments with an URL, you have to be a registered (and authenticated via challenge-response using something like email) user. Otherwise, you just get to leave a name with your comment.
If spammers want to register with you first, well, oh well.
Dossy - 10th February 2004 23:18 - #
I AM SPAM - 15th March 2004 00:11 - #
Ryan - 15th March 2004 00:31 - #
It was pointed out to me that without a (setTimeout();) delay, using
document.location = "http://www.theurl.com/";makes it difficult for most IE users to use the Back button to get back to the page with the comments. This is an accessibility problem.In my own redirection script I've tried using
document.location.replace(url);instead, and it seems to completely remove the redirection step from the Browser's history (at least in IE/win, Opera7 and Mozilla), allowing for unimpeded use of the Back button.Example: http://www.anomy.net/indirector/?http://mar.anomy. net/
Now my question is whether the
document.location.replace()method is in any way likely to cause us grief? Is Google more likely to index it, etc?Már - 8th April 2004 16:42 - #
turbofisk - 24th August 2004 17:59 - #
Dave Child - 30th September 2004 13:46 - #
Kev Swindells - 19th January 2005 09:34 - #
Peter Bengtsson - 27th March 2005 17:43 - #
Jo - 12th August 2005 12:29 - #
Aldo Tripiciano - 22nd August 2005 22:59 - #
Kramer - 7th December 2005 23:51 - #