Simon Willison’s Weblog

rev=canonical bookmarklet and designing shorter URLs

I’ve watched the proliferation of URL shortening services over the past year with a certain amount of dismay. I care about the health of the web and try to ensure that URLs I am responsible will last for as long as possible, and I think it’s very unlikely that all of these new services will still be around in twenty years time. Last month I suggested that the Internet Archive start mirroring redirect databases, and last week I was pleased to hear that Archiveteam, a different organisation, had already started crawling.

The most recent discussion was kicked off by Joshua Schachter and Dave Winer, and a solution has emerged driven by some lightning fast hacking by Kellan Elliott-McCrea. The idea is simple: sites get to chose their preferred source of shortened URLs (including self-hosted solutions) and specify it from individual pages using <link rev="canonical" href="... shorter URL here ...">.

By hosting their own shorteners, the reliability should match that of the host site—and the amount of damage caused by a major shortener going missing can be dramatically reduced.

I’ve been experimenting with this new pattern today. Here are a few small contributions to the wider discussion.

A URL shortening bookmarklet

Kellan’s rev=canonical service exposes rev=canonical links using a server-side script running on App Engine. An obvious next step is to distil that logic in to a bookmarklet. I decided to combine the rev=canonical logic with my json-tinyurl web service (also on App Engine), which allows browsers to lookup or create TinyURLs using a cross-domain JSONP request. The resulting bookmarklet will display the site’s rev=canonical link if it exists, or create and display a TinyURL link otherwise:

Bookmarklet: Shorten (drag to your browser toolbar)

You can also grab the uncompressed source code.

Designing short URLs

I’ve also implemented rev=canonical on this site. I ended up buying a new domain for this, since simonwillison.net is both difficult to spell and 17 characters long. I ended up going with swtiny.eu—9 characters, and keeping tiny in the domain helps people guess the nature of the site from just the URLs it generates. Be warned: the DNS doesn’t appear to have finished resolving yet.

For the path component, I turned to a variant of base 62 encoding. Decimal integers are represented using 10 digits (0-9), but base 62 uses those digits plus the letters of the alphabet in both lower and upper case. A 13 character integer such as 7250397214971 compresses down to just 8 characters (CDeIPpOD) using base62. My baseconv.py module implements base62, among others. I considered using base 57 by excluding o, O, 0, 1 and l as being too easily confused but decided against it.

This site has three key types of content: entries, blogmarks and quotations. Each one is a separate Django model, and hence each has its own underlying database table and individual ID sequence. Since the IDs overlap, I need a way of separating out the shortened URLs for each content type.

I decided to spend a byte on namespacing my shortened URLs. A prefix of E means an entry, Q means a quotation and B means a blogmark. For example:

  • http://swtiny.eu/EZ8: Entry with ID 1584
  • http://swtiny.eu/BBEQ: Blogmark with ID 4108
  • http://swtiny.eu/QE5: Quotation with ID 279

By using upper case letters for the prefixes, I can later define custom paths starting with a lower case letter. I also have another 23 upper case prefix letters reserved in case I need them.

I asked on Twitter and consensus opinion was that a 301 permanent redirect was the right thing to do (as opposed to a 302), both for SEO reasons and because the content will never exist at the shorter URL.

Implementation using Django and nginx

I run all of my Django sites using Apache and mod_wsgi, proxied behind nginx. Each site gets an Apache running on a high port, and nginx deals with virtual host configuration (proxying each domain to a different Apache backend) and static file serving. I didn’t want to set up a full Django site just to run swtiny.eu, especially since my existing blog engine was required in order to resolve the shortened URLs.

Instead, I implemented the shortened URL direction as just another view within my existing site: http://simonwillison.net/shorter/EZ8. I then configured nginx to invisibly requests to swtiny.eu through to that URL. The correct incantation took a while to figure out, so here’s the relevant section of my nginx.conf:

server {
    listen 80;
    server_name www.swtiny.eu swtiny.eu;
    location / {
        rewrite (.*) /shorter$1 break;
        proxy_pass http://simonwillison.net;
        proxy_redirect off;
    }
}

proxy_redirect off is needed to prevent nginx from replacing simonwillison.net in the resulting location header with swtiny.eu. My Django view code is relatively shonky, but if you’re interested you can find it here.

The nice thing about this approach is that it makes it trivial to add custom URL shortening domains to other projects—a quick view function and a few lines of nginx configuration are all that is needed.

Update: The bookmarklet now supports the rev attribute on A elements as well—thanks for the suggestion, Jeremy.

This is rev=canonical bookmarklet and designing shorter URLs by Simon Willison, posted on 11th April 2009.

Tagged , , , , , , , , ,

Next: djng - a Django powered microframework

Previous: List of SxSW 2009 panels with "social" in the title