rev=canonical bookmarklet and designing shorter URLs
I’ve watched the proliferation of URL shortening services over the past year with a certain amount of dismay. I care about the health of the web and try to ensure that URLs I am responsible will last for as long as possible, and I think it’s very unlikely that all of these new services will still be around in twenty years time. Last month I suggested that the Internet Archive start mirroring redirect databases, and last week I was pleased to hear that Archiveteam, a different organisation, had already started crawling.
The most recent discussion was kicked off by Joshua Schachter and Dave Winer, and a solution has emerged driven by some lightning fast hacking by Kellan Elliott-McCrea. The idea is simple: sites get to chose their preferred source of shortened URLs (including self-hosted solutions) and specify it from individual pages using <link rev="canonical" href="... shorter URL here ...">.
By hosting their own shorteners, the reliability should match that of the host site—and the amount of damage caused by a major shortener going missing can be dramatically reduced.
I’ve been experimenting with this new pattern today. Here are a few small contributions to the wider discussion.
A URL shortening bookmarklet
Kellan’s rev=canonical service exposes rev=canonical links using a server-side script running on App Engine. An obvious next step is to distil that logic in to a bookmarklet. I decided to combine the rev=canonical logic with my json-tinyurl web service (also on App Engine), which allows browsers to lookup or create TinyURLs using a cross-domain JSONP request. The resulting bookmarklet will display the site’s rev=canonical link if it exists, or create and display a TinyURL link otherwise:
Bookmarklet: Shorten (drag to your browser toolbar)
You can also grab the uncompressed source code.
Designing short URLs
I’ve also implemented rev=canonical on this site. I ended up buying a new domain for this, since simonwillison.net is both difficult to spell and 17 characters long. I ended up going with swtiny.eu—9 characters, and keeping tiny in the domain helps people guess the nature of the site from just the URLs it generates. Be warned: the DNS doesn’t appear to have finished resolving yet.
For the path component, I turned to a variant of base 62 encoding. Decimal integers are represented using 10 digits (0-9), but base 62 uses those digits plus the letters of the alphabet in both lower and upper case. A 13 character integer such as 7250397214971 compresses down to just 8 characters (CDeIPpOD) using base62. My baseconv.py module implements base62, among others. I considered using base 57 by excluding o, O, 0, 1 and l as being too easily confused but decided against it.
This site has three key types of content: entries, blogmarks and quotations. Each one is a separate Django model, and hence each has its own underlying database table and individual ID sequence. Since the IDs overlap, I need a way of separating out the shortened URLs for each content type.
I decided to spend a byte on namespacing my shortened URLs. A prefix of E means an entry, Q means a quotation and B means a blogmark. For example:
- http://swtiny.eu/EZ8: Entry with ID 1584
- http://swtiny.eu/BBEQ: Blogmark with ID 4108
- http://swtiny.eu/QE5: Quotation with ID 279
By using upper case letters for the prefixes, I can later define custom paths starting with a lower case letter. I also have another 23 upper case prefix letters reserved in case I need them.
I asked on Twitter and consensus opinion was that a 301 permanent redirect was the right thing to do (as opposed to a 302), both for SEO reasons and because the content will never exist at the shorter URL.
Implementation using Django and nginx
I run all of my Django sites using Apache and mod_wsgi, proxied behind nginx. Each site gets an Apache running on a high port, and nginx deals with virtual host configuration (proxying each domain to a different Apache backend) and static file serving. I didn’t want to set up a full Django site just to run swtiny.eu, especially since my existing blog engine was required in order to resolve the shortened URLs.
Instead, I implemented the shortened URL direction as just another view within my existing site: http://simonwillison.net/shorter/EZ8. I then configured nginx to invisibly requests to swtiny.eu through to that URL. The correct incantation took a while to figure out, so here’s the relevant section of my nginx.conf:
server {
listen 80;
server_name www.swtiny.eu swtiny.eu;
location / {
rewrite (.*) /shorter$1 break;
proxy_pass http://simonwillison.net;
proxy_redirect off;
}
}
proxy_redirect off is needed to prevent nginx from replacing simonwillison.net in the resulting location header with swtiny.eu. My Django view code is relatively shonky, but if you’re interested you can find it here.
The nice thing about this approach is that it makes it trivial to add custom URL shortening domains to other projects—a quick view function and a few lines of nginx configuration are all that is needed.
Update: The bookmarklet now supports the rev attribute on A elements as well—thanks for the suggestion, Jeremy.
Neat work Simon, Good to see someone implementing solutions here. I know what I'll be doing on Tuesdays train journey now..
I know it would be a bit of a hassle, but why didn't you try to incorporate your URL slug (e.g., revcanonical for this post) into your short URLs, instead of just relying on the base 62 string?
To be perfectly honest, it just didn't cross my mind to do that. The fact that my prefixes are upper case means I can add that kind of thing in the future though. I'll give it some thought.
Fair enough. :-)
By the way, I've got a post that goes into a bit more detail about rev="canonical" and has some useful info in the comments:
http://shiflett.org/blog/2009/apr/save-the-interne t-with-rev-canonical
Simon, could you update the code so that it also looks for rev="canonical" claims on A elements as well as LINK elements? Muchas gracias.
Awesome idea. I too have a short variant of my unwieldy long domain name. http://u-e-h.net/ which I use in twitter messages and the likes.
Yesterday I clicked a tinyurl in someone else's tweet and landed on my own site :-S. It would be really cool if the other url shortening services could digest the page and offer the preferred short url instead of their own.
Jaap - 11th April 2009 19:47 - #
Hmm, I share your concern of short URLs. Stupid Twitter fad. Maybe I should try something similar, my domain is short enough :). I’d use IRIs for ultimate shortness though! Hmm~~ grauw.nl/水 nice!
(Let’s hope it’s just your preview that doesn’t understand Unicode btw, here goes).
Forgive the off-topic comment, but Laurens reminds me that I've heard of this problem before. (Browsers fumbling the Unicode support when using XHR.) It seems like a potential XSS exploit that has yet to be discovered, similar to the UTF-7 one used on Google:
http://shiflett.org/blog/2005/dec/google-xss-examp le
Any idea what the problem is?
The question remains, what is the point of short urls? What problem do they try to solve? A good url is a good url, and there is no way to make it better, by making it sorter; it should be already short and to the point.
Philippe Jadin - 11th April 2009 22:16 - #
@Chris: The problem illustrated on that link is that functions like htmlentities only work if the encoding is correctly specified, which can be ambiguous if no output encoding is indicated to the user agent.
In this case, it looked like the text in the preview window was escaped using JavaScript’s ‘escape()’ function. This function does not escape non-US-ASCII characters correctly and is deprecated (of sorts). You should use encodeURI or encodeURIComponent.
@Philippe: they seem to solve the problem imposed on us by services like Twitter which have an artificial length limit on messages. Previously, URL shortening services were only used to link to those hideous kind of URLs that are riddled with parameters, as far as I’m aware.
So... You've got an article, saying that its canonical URL is a URL that redirects 301 to the article... Right.
Either I miss something which is way probable, or this is the dumbest stuff I've ever read from an SEO point of view. I understand this is not about SEO, but I'm afraid this will heavily confuse search engines.
Ozh - 11th April 2009 23:39 - #
I've forked this bookmarklet for those who might prefer it default to bit.ly :)
http://gist.github.com/93761
Stephen Paul Weber - 12th April 2009 00:05 - #
Philippe Jadin asks "what is the point of short urls?".
They have a number of use cases, of which micro-blogging (including Twitter, but also other similar services) are one. Others include SMS text messages and the provision of URLs in print sources such as newspapers or magazines, were space is at a premium and manual retyping of long URLs introduces risk of errors.
@Ozh: Just what I was thinking. What the heck? Why are you using canonical? That has a totally different meaning.
Julian - 12th April 2009 06:55 - #
@Ozh & Julian:
I think you're confusing rel and rev. Simon's link element isn't saying that the swtiny.eu link is the canonical URL for this article; it's the other way around - this article is the canonical link for the shortened URL.
Matthew Pennell - 12th April 2009 08:12 - #
Laurens Holst - thank you! I've been meaning to find a solution to that bug for years. Switching from escape() to encodeURICompenent() seems to have fixed it perfectly.
Looks like rel="alternate shorter" is better than rev="canonical" for short URL disclosure, according to Eli White: http://is.gd/s4Dd
Alan Hogan - 12th April 2009 23:24 - #
I've taken Simon's code, added some of my own, and packaged it into a proper Django app:
easy_install django-shorturls, and patches welcome.This is a great idea.
Now I can get rid of these long urls :-)
rates - 13th April 2009 05:28 - #
Oh, looks like I've done same thing few hours earlier... :\ Though mine requires less configuration. :P
Hey, yet another approach – I've modified django-ittybitty to use Simon's baseconv: http://github.com/jezdez/django-ittybitty/.
Thanks for this, I've put it on http://www.dubizzle.com
Ben Walton - 16th April 2009 14:10 - #
Hi Simon,
I wrote a snippet that did something similar a couple of months ago, which used the existing django.contrib.redirects app.
http://www.djangosnippets.org/snippets/1323/
My approach was quite quick-and-dirty (and I regret using a mixin) so it's great to see something that has had a lot more thought applied to it!
mattgeeknz - 17th April 2009 00:54 - #
Simon,
Fancy supporting rel="shorturl" as per this RFC:
http://sites.google.com/a/snaplog.com/wiki/short_u rl ?
Regards,
Rob...
G'day,
rev=canonical is broken for more reasons than I care to list here (see Counting the ways that rev="canonical" hurts the Web) so if you want an alternative without the confusion of rel="short[_- ]?ur[il]" then rel=shortlink is the answer.
Sam
This is a very intriguing idea. I'm not comfortable with using URL shortening services, but I know a lot of people are, so something really must be done for it.
Hi Simon,
I've used your bookmarklet as a basis to make one for Tweeting URLs on an iPhone. http://ts0.com/tweet/
Thom Shannon - 8th May 2009 14:51 - #