Social whitelisting with OpenID
A key feature of OpenID is that it provides a globally unique identifier for every user, no matter what site or service they are using on the Web.
This gives us a powerful tool to fight comment spam. If someone has logged in with an OpenID and we are confident that they are not a spammer (remember, spammers can create OpenIDs too) we can add them to a whitelist, allowing their comments to skip any moderation step or spam guard that we might have in place.
This weblog has a comment spam detection system based on simple heuristics. Comments are assigned a score; if the score exceeds a certain level the comment is placed in a queue for moderation. As of today, one of the heuristics is “does the comment author have an OpenID that is on the whitelist”. I’ve populated my whitelist with the OpenIDs of people who have posted two or more useful comments and do not appear to be using an anonymous provider. I’ll be adding to it regularly in the future.
Here comes the social part: I’m sharing my whitelist. If you run your own OpenID-enabled weblog you are welcome to include my whitelist in your comment spam heuristics. If you publish your own whitelist, I will happily do the same.
Social whitelisting benefits from being de-centralised, just like OpenID. If I find that you have whitelisted a spammer, I can unsubscribe from your whitelist. There’s no central authority or point of failure.
Long-time readers may be feeling a strong sense of deja-vu. Way back in September 2003, I proposed shared comment blacklists as a solution to weblog comment spam. The idea was simple: every time you delete a spam comment, you add the link it was advertising to a public blacklist. Other blogs could then subscribe to your blacklist and block any new comments advertising the same site.
The blacklisting idea was flawed from the very start. It was a classic example of Marcus J. Ranum’s number one dumbest idea in computer security: Default Permit. Spam blacklists assume that if we don’t know a link is bad, it’s good. Spammers can create new bad links far faster than we can blacklist them.
Here’s Ranum’s suggested alternative:
The opposite of “Default Permit” is “Default Deny” and it is a really good idea. It takes dedication, thought, and understanding to implement a “Default Deny” policy, which is why it is so seldom done. It’s not that much harder to do than “Default Permit” but you’ll sleep much better at night.
Social whitelisting uses Default Deny. As such, I believe it has a much higher chance of making a useful impact on the comment spam problem.
Update: I should have mentioned that this idea developed over a number of discussions with Tom Coates, which totally slipped my mind when I was writing it up at 3am.
We've been kicking around something like this as well. This seems like the perfect thing to put on your personal identity page (or directly link off of that).
If there was a way to publish lists of groups in a "secure" fashion (I may not want everyone to see who is on my 'work' or 'friends' lists) then this would be amazing. You could essentially create one social network, once and for all and consume it everywhere.
Is this OpenID stuff cool or what?! :-)
The use of the word "Social" brings to my mind that these whitelists COULD be shared through social bookmarking services. If you used ma.gnolia to keep up with your whitelist by using openid:whitelist/openid:spammer, etc as tags for openid urls, then if I trusted you enough, I could use your live whitelist for my openid-enabled app. Just a crazy idea. :)
Excellent idea. I'd love to see it turn into some sort of huge decentralized OpenID scoring / reputation system.
While I'm dreaming, maybe weblog platforms could automatically expand their whitelist by following XFN-tagged links in existing whitelis URLs...
Oh, and I want a pony.
Beware of strangers.
I'm wondering if it would be feasible to hide the whitelist behind an OpenID login itself, such that only people you have whitelisted can see the whole whitelist.
It could be done, surely, I just wonder if it has benefits to keep the whitelist a little more secret.
This is a useful comment. I want to be on your whitelist! *kidding*
QUOTE: "If you run your own OpenID-enabled weblog you are welcome to include my whitelist in your comment spam heuristics. If you publish your own whitelist, I will happily do the same."
Good idea! In order to be feasible, tho, I believe a semi-automated update mechanism would be needed. I am not sure that sharing the whitelists over magnolia or del.icio.us would be a good idea, mostly because many people (including myself) are using these services as actual bookmarking services and not as data exchanges. ;)
But then again, having a central spot/site to share whitelists would mean it's not decentralized anymore... Hmm...
I need to ponder this for a while.
So, everyone's guilty until proven innocent, eh? :)
Mislav - 22nd January 2007 09:30 - #
'If there was a way to publish lists of groups in a "secure" fashion'
LOAF does this -- steal the ideas!
http://loaf.cantbedone.org/about.htm
We tried social-network-driven IP whitelisting a few years back in email:
http://www.web-o-trust.org/
It's not as simple as it first appears. Once someone in the web -- a friend of a friend -- trusts a marginally-spammy identity, everyone gets the spam, and tracking down the culprit can be hard unless you've designed for that in the first place (this happened in our case, and killed the experiment). I think you need to use a more complex advogato-style trust algorithm, instead of the simplistic one web-o-trust used, to avoid this danger.
Basically, my gut feeling is that a web of trust for anti-spam is an attractive concept, _possible_, but a lot harder than it first appears. It's been suggested repeatedly ever since I started writing SpamAssassin, but nobody's yet come up with a working one... that's got to indicate something ;)
In the meantime, the concept of a trusted party who publishes their concept of an identity's reputation -- like Dun and Bradstreet, or Spamhaus.org, works very nicely indeed.
PS: Simon, thanks for the recent link-blogging -- you keep finding great links!
I thought I posted a couple of useful comments here... :-(
Michael's comment is definitely a step in the right direction. This should be a P2P network requiring no user interaction.
I agree that this is the way to go and I've been using and promoting "greenlisting" (whitelisting) for email spam filtering for years. For a cool idea for social greenlisting (social whitelisting), check out Joshua Schachter and Maciej Ceglowski's LOAF, which "is a simple extension to email that lets you append your entire address book to outgoing mail message without compromising your privacy." LOAF could probably be adapted for blogs and other non email purposes. Details about LOAF are at <http://loaf.cantbedone.org/>. Details about my reverse spam filtering system are at <http://www.ii.com/internet/messaging/spam/>.
Nancy McGough - 22nd January 2007 11:30 - #
Simon, I agree that white-listing is the way forward. At the moment, I moderate all comments on my blog but will implement OpenID to whitelist 'known' bloggers. i.e. those who have built up reputations over many months/years.
This is really layering a trust system on top of OpenID.
I think that 2 good comments and you're in is perhaps a tad simplistic. I already get spam which posts some innocuous comment (e.g. nice site!) which is then followed up by the real spam a few days later (assuming they've been white-listed).
Spammers will simply make a 'long sting'. However, humans are perhaps better able to tell who to trust and can make good decisions on very few posts.
Trusting who somebody else trusts may be a step too far (reminds me of PGP web of trust). If the owner of that list is a spammer who has seeded lots of spam IdPs in his white list (together with legitimate ones), how do you revoke all those under him (and perhaps those under theirs) once the spammer flips the switch?
Charles Darke - 22nd January 2007 11:49 - #
Very interesting idea. Some comments:
1. Wow, it's all about trust. Whom do you see trusting, though? A sibling? The guy you went to college with (but didn't really like)? Your favorite bloggers? Writers from your favorite web sites? At what level of trust are you asking me to place these people? Driving my car? Holding my wallet? Dating my sister? If it's heavy trust, it's my brother; if it's medium trust, I'd give Andy Baio a nod; if it's carefree and low threat, I'd be willing to let Jake Applebaum and his hooligan friends be trusted. If it's just trust over posting comments, fine. But plug-ins for Wordpress, or editorial capabilities -- that's more.
2. Smaller is better, for once. Lists, that is. If you've got a list from hell, miles long, I may be more cautious about truly given my trust. And it's compounded by #1 above, too.
3. What about the differences in our cultures? With talk in some countries, of holding hosts liable for the comments their readers make, this might be an issue. Is it OK to talk about circumventing HDMI encryption in Sweden, when it's not in America? Can I comment about Serb elections and the final status for Kosovo, without getting the host in trouble?
I like the idea -- it's about convenience. I just worry about the implications.
Art - 22nd January 2007 12:58 - #
I think shared *ip address* blacklists could work. Spammers can create new nodes quickly (by taking over more computers), but not as quickly as they want. With a shared blacklist, each compromised computer would be much more limited in how much damage it can do.
Jesse Ruderman - 22nd January 2007 13:44 - #
Sharing whitelists is the start of the process. The next step will be stealing Advogato's trust-metric algorithm to propagate the 'probably not a spammer' scores more reliably.
This is a great idea! The next step would be to formalize it somehow, so it can be automized and thus include all non-technical people as well as us geeky bozos.
Having the whitelist behind an OpenID login makes a lot of sense, so perhaps formalizing this in OpenID is the right thing to do? I dunno, but I'm very confident this will work out great. At least once we can say "No OpenID? Then begone, you smelly rodent!".
Interesting.
Are some OpenID URLs more "trustworthy" than others?
It is rumoured that Google prefers domain names that are registered for a longer period of time (as it costs more to register for longer).
And of course I would imagine that the root directory of a domain name would be more valuable than some subdirectory.
Perhaps then, a heuristic should just take into account the Google pagerank of an OpenID URL :-)
@Jason Davies
I think so. For example:
mydomain.com/openid.html
is more 'valuable' than:
user1.mydomain.com/subdir/openid.html
It would be good if a protocol could be established to agree a name of a file e.g. openid.pointer which would be put in the root directory and that this file would point to the location of the IdP.
What do we gain from this? It doesn't quite assert ownership of the domain, but it is close. And using existing http server means that there's not need to build DNS queries into it (most clean would be to have an entry in DNS pointing to the IdP).
Charles Darke - 22nd January 2007 17:22 - #
Very interesting idea indeed. The problem of course is managing the whitelist over time.
Not everyone will be interested in maintaining a white list of their own, but we don't want a situation of a fully centralised whitelist maintainer. How about a central location which allows you to login and update your personal whitelist, while also selecting the others that you want to add to your global whitelist?
This way there is a central store, but no central control, which would also enable a 'most trusted' list etc.
David Bell - 22nd January 2007 18:51 - #
I've got a "502 Bad Gateway" to your whitelist, but ten minutes it works. Are you working on keeping your whitelist more secret ? :)
zyegfryed - 22nd January 2007 21:04 - #
For those wanting to keep the whitelist slightly more secret why not simply MD5 each entry - it's still easy to do the comparison but at least it's harder for someone to determine exactly who is on your whitelist.
At the moment I can't think of a reason as to why publishing the list in the clear would be a problem, but it certainly doesn't feel quite right.
Incidentally Simon - I notice that after logging in via OpenID on your site I'm redirected back to your homepage. I was wondering whether or not OpenID provides the mechanism for redirecting back to the URL you came from or if you have to specify an fixed endpoint.
Just to clarify: the "two decent comments" metric was just the simplest thing I could think of to get things started. I'll be developing a smarter way to manage my own whitelist as time goes on.
Dmitry: You're not on there because I haven't figured out what to do with i-names yet. It seems that I should be whitelisting the i-number rather than the i-name, but I haven't yet worked out how to resolve those using the JanRain library.
Ed: you can redirect back from OpenID to a specific URL (by passing some extra information along when you auth) - I've half written the code for the site, but really need to get around to finishing it!
zyegfryed: Nothing smart going on there; that's an intermitant error with my nginx/Apache setup which I haven't found a solution for yet.
This is a brilliant idea, but it needs some mechanism for (at least semi-automatic) sharing of the whitelists. Maybe an autodiscovery-meta, like for RSS-feeds?
Definitely. Whitelists are the way to go, but some sort of system by which to automate list subscriptions certainly wouldn't hurt. :P
Maybe a certain site which would let you login with our OpenID, then add and subtract whitelists (not actually as part of your OpenID; just like registering on the site).
It might be useful. (Same goes for this comment!)
I have a simple and elegant solution for the automated exchange that's been proposed here. We need a site that allows page owners to sign up and save a URL to some serialized version of their local whitelists, then subscribe to other member's whitelists. The data could be exchanged in XML or some sort of simple delineated format since we're dealing with a relatively plain dataset. Members could then download aggregate versions of the various whitelists they had subscribed to periodically to update their local active whitelists and the site could periodically scrape user's whitelists for updates. The site could also publish a feed of "most trusted" openid providers or "least trusted", etc... Feeds could be commented on and rated and critiqued by the users. Just an idea, wouldn't take much to get a simple beta running.
I have a simple and elegant solution for the automated exchange that's been proposed here. We need a site that allows page owners to sign up and save a URL to some serialized version of their local whitelists, then subscribe to other member's whitelists. The data could be exchanged in XML or some sort of simple delineated format since we're dealing with a relatively plain dataset. Members could then download aggregate versions of the various whitelists they had subscribed to periodically to update their local active whitelists and the site could periodically scrape user's whitelists for updates. The site could also publish a feed of "most trusted" openid providers or "least trusted", etc... Feeds could be commented on and rated and critiqued by the users. Just an idea, wouldn't take much to get a simple beta running.
Sorry for the double post, it returned a 504 the first time I tried to submit it.
There would be certain benefits to an infrastructure facilitating the sharing of whitelists amongst trusted peers. Not least of which would be the reduction in network traffic due to centralised de-duping of addresses. A group of peers might share 80% of the same trusted IDs, for example.
Of course, the beauty of OpenID is that if there were to be a security breach enabling a spammer to get their ID onto your whitelist, we can identify their comments and remove them pretty quickly.
Just wanted to note that I don't think blacklists are entirely a dead idea. There's some value in having a shared blacklist as well as a shared whitelist, in that if you're manually approving commentors into a whitelist, you also want to be able to blacklist spammers so if they've posted 200 spam messages under the same id you only have to deal with saying no once.
Lach - 23rd January 2007 22:14 - #
I think that's right, Lach. As soon as you start sharing a whitelist, you also need to share a list of deliberate omissions to that whitelist.
Otherwise, if a previously trusted ID that you have shared with your peers becomes untrusted (say if the ID gets sold to a spammer) it's hard to distinguish such an omission from the innocuous removal of an ID for housekeeping reasons.
Good point on the blacklists, since it's likely that certain IDs will appear on more than one whitelist, there would need to be an over-riding blacklist system.
If, as Paul suggested above, an XML standard was used for lists, it could be created to contain both your whitelist and your blacklist.
how about a way to link to a whitelist from the LINK REL section of the header? in this way you could automatically discover who is vouching for people...
joshua schachter - 24th January 2007 01:24 - #
Scott Reynen - 25th January 2007 06:22 - #
re 'a hash of the identities' -- as I noted here, two good algorithms to publish this are Google's enchash format and LOAF.
If you just create yourself a "social" network with OpenId and FOAF you can use it as a foundation to then build trust. Then from the trust level you have with your friends you can use the whitelists accordingly. For an idea of social network based on OpenId and FOAF, you can take a look at that:
http://xhtml.net/breves/308-The-no-network-social- network
The concept is just a derivative of the work already done with OpenID and FOAF.
The funny effect of the buzz around OpenID is that now a lot of my projects have a new requirement: OpenID enabled. :)
Loïc d'Anterroches - 25th January 2007 19:55 - #
Hi, what do you think about the new Google Malware Warnings? They are a kind of public blacklist. Indeed they can really damage a brand (or a person) if the site is improperly flagged. As an example search "tom dyson" in Google... Tom Dyson is a Torchbox co-founder and is personal website Throwing Beans was (improperly, I suppose) flagged by Google as "potentially harmful".
Andrea - 26th January 2007 16:26 - #
"If I find that you have whitelisted a spammer, I can unsubscribe from your whitelist."
This seems overly fragile. Spammers regularly invest time on message boards to become 'trusted' and then leverage this trust.
Yes, this is human-intensive, but that is what 3rd-world spam sweatshops are for.
Eventually, everyone will have whitelested *some* spammer or other.
You need a mechanism for flowing feedback back to the whitelist maintainer so they can remove the spammer, before you escalate to unsubscribing from their whitelist, or the whole thing is too brittle.
Michael Bernstein - 27th January 2007 17:03 - #
nice
Great Job with your PHP XML-RPC library. The most easy, usable library I've seen. Some of the logic is a bit verbose (more than typical for php at least). But, very clear. One the site for it though, I would show that the language does supports support associate array->stuct. I dug all through to make sure it would interact correctly. But, enlightening none-the-less.
Rob Colburn - 9th February 2007 10:04 - #
Seems like it would good to create some kind of standard format for these whitelists. Also, it seems like there could eventually be an easy mechanism for "flagging" URLS that were whilelisted, but are suspect. Then, it seems like there could be some way to alert people about that flagging (this URL has been flagged 'x' times).
I must be very lucky. I've had the one LJ/blog for coming up on four years now, with most (80-90%?) of all posts public, anonymous comments allowed (IP tracking enabled) and in that time I've only had two spammers leaving dreck in the comments.
Mind you, both of those were during January of this year, so maybe it's just gearing up for a sudden onslaught? It's definitely something I'll be keeping an eye on though, so the development and propagation of OpenID is the obvious solution. Although it does, as ever, still depend on the security of the disparate OpenID servers and the end users not doing silly things with their authentication information.
Clever :) Social Whitelisting of OpenIDs could be the trickt hat rescues distributed conversation from death by spam.
I think this sort of concept is critical to the success of OpenID. Right now, OpenID is great for the user, because it's really easy to get an OpenID url and you can use it all over.
This is exactly why it's less than helpful for site owners. The mere existence of an OpenID url doesn't tell me anything at all about the "quality" of the id.
There are always people there in the market pointing at the negative aspect of things. I would say pointing at negativity is not bad at all because it helps to become efficient but on the other hand one should appreciate the positive features as well for appreciation
http://www.jeffpaul.tv
Tracy Esau - 11th April 2008 07:42 - #