Pingback redux
I think I’ve worked out a way of implementing Pingback (or a Pingback-like system) without any need for XML-RPC, <link> elements or custom HTTP headers.
There are three principle reasons for using Pingback to “detect” a link to a page rather than relying on referrals:
- A referral from a blog is likely to come from that blog’s front page, whereas any link back to that content should target the permalink of the specific entry from which the incoming link was made.
- Pingbacks are deliberate—they show that the source of the Pingback is a deliberate response to the linked item and wishes to be listed as such.
- It is possible (although unlikely) for a link to remain undetected simply because that link has never been “clicked” by anyone.
Pingback solves these problems through “alert me if you link to me information” embedded in the HTTP headers / embedded metadata of a page, combined with a simple XML-RPC server for accepting alerts. While this solves the problems outlined above, the overhead of carrying out a Pingback is quite large and the implementation of the client / server is quite challenging. The system is also of no use at all unless both parties have Pingback installed.
My solution is an extension of my own Pingback implementation. Whenever I link to a site from my blog, a script running on my server requests each of the pages I have linked to and checks for information on a related Pingback server (this is standard behaviour for any conformant Pingback client). As a nod towards those users who do not have Pingback enabled, the script sends the permalink of the linking item as the Referer header, to ensure that their logs have at least one hit from the entry in question. It dawned upon me that if this single “hit” was identifiable as a Pingback probe the process could stop there—the target server would have the required information that “Page X linked to Page Y at Time T” and would be able to process the Pingback straight away. How to identify the hit? Two methods come to mind—the request could include an additional header (X-Pingback-Probe: yes) or the User-Agent string could include some standard string. Since some scripting languages (such as PHP) do not provide access to non-standard headers in the HTTP request, the second option seems immediately favourable.
Here is some outline PHP code for spotting and responding to my proposed “pingback-probe” requests:
if (isset($_SERVER['HTTP_REFERER']) &&
$strpos($_SERVER['HTTP_USER_AGENT'], 'pingback-probe') !== false) {
// User Agent contains 'pingback-probe' and referer information is present
$linkFrom = $_SERVER['HTTP_REFERER']
$linkTo = 'http://'.$_SERVER['HTTP_HOST'].$_SERVER['REQUEST_URI'];
if ($info = checkPingback($linkFrom, $linkTo)) {
addPingback($linkFrom, $linkTo, $info['title'], $info['extract']);
}
}
function checkPingback($linkFrom, $linkTo) {
/* This function loads the $linkFrom page and checks that it really does
contain a link to $linkTo. If not, it returns false. If the link exists,
it grabs the title of the page and an extract of text found surrounding
the $linkTo link and returns them in a small associative array. */
}
function addPingback($linkFrom, $linkTo, $title, $extract) {
/* This function saves the pingback information, presumable by logging it
to a file or saving it to a database. It would almost certainly save
the time the Pingback was received as well. */
}
Pingback client implementations would say similar to the way they work now, except instead of having to retrieve the target page, check for Pingback server information and send an XML-RPC ping they would just have to send a single request with the specified referral and user agent information. Implementation is thus simpler for both client and server sides of the system, while keeping the required functionality.
Ian Hickson - 24th February 2003 17:02 - #
Mark - 24th February 2003 17:58 - #
RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} pingback-probe
RewriteRule (.*) http://ln.hixie.ch/cgi-bin/deal-with-pingback.cgi? uri=$1 [T=application/x-httpd-cgi,L]
Mark - 24th February 2003 17:58 - #
Sam - 24th February 2003 20:12 - #
I've been thinking about that. Firstly, once a pingback has been accepted from a page any more for that relationship can be ignored. Since the server checks the page being linked from to make sure the link actually exists casual spamming won't work.
Of course, a really crafty spammer could set up a script of their own that generated pages linking to the page they are spamming purely to fool the pingback script in to linking back to them. This wouldn't be too hard to combat (if necessary) by implementing domain bans - if a spammer form www.freepr0n.com starts spamming you you can ban all pingbacks from that domain.
If spamming becomes a really big problem a solution would be to implement a moderation system where pingbacks are only displayed once you have approved them. "Trusted" domains could be implemented which would be trusted implicitly, so in fact moderation would quickly become a small scale job as most of the sites that regularly linked to you ended up on your safe list.
This leaves the only realistic spam attack a crude denial of service, where dozens of fake pingbacks are sent to cause your server to waste resources downloading lots of fake pages. Even this can be combatted by implementing a throttle of some sort, so only one pingback request is allowed per minute (per IP address).
Simon Willison - 24th February 2003 23:56 - #
Stuart Langridge - 25th February 2003 05:17 - #
Simon Willison - 25th February 2003 11:48 - #
Daniel Nolan - 25th February 2003 19:26 - #
Stuart Langridge - 25th February 2003 22:07 - #
Dave - 25th February 2003 22:23 - #
Fred Cooper - 26th February 2003 01:33 - #