Why you should be using disambiguated URLs
Good URLs are important. The best URLs are readable, reliable and hackable.
One of the nice things about Rails, Django and other modern Web frameworks is that most of them encourage smart URL design. Rails has relatively smart defaults and a powerful routing system for custom URLs; Django forces you to think about URL design up front by defining them as regular expressions. Many of the definitive “Web 2.0” sites such as Flickr and del.icio.us also use well designed URLs. This is a positive trend, and long may it continue.
There’s one aspect of URL design that is often ignored. Good URLs should be unambiguous. By that, I mean that any logical piece of content should have one and only one definitive URL, with any alternatives acting as a permanent redirect.
This rule is frequently broken. Here are some examples:
- My Flickr photo stream lives at www.flickr.com/photos/simon/ and flickr.com/photos/simon/
- My del.icio.us account is at del.icio.us/simonw and del.icio.us/simonw/
- The YDN Python Developer center lives at both http://developer.yahoo.com/python/ and http://developer.yahoo.com/python/index.html
- The Google AdSense sign-in page is at https://www.google.com/adsense/ and https://google.com/adsense/. If you visit the latter you get a scary certificate warning (they really need to fix that).
In each of the above cases, it’s obvious to regular people that the URLs are the same. Unfortunately, from a technical point of view they are different and could quite feasibly serve up different content. This causes all kinds of problems:
- Caches (both browser and intermediate proxies) can’t improve performance if you request the same content from a different URL.
- Browser can’t show users their visited links.
- Social link sharing sites such as del.icio.us can’t accurately aggregate links to the same resource.
That last one in particular should catch your attention if you care about effectively promoting your site. Here’s a random example, plucked from today’s del.icio.us popular. convinceme.net is a new online debating site (tag clouds, gradient fills, rounded corners). It’s listed in del.icio.us a total of four times!
- http://www.convinceme.net/ has 36 saves
- http://www.convinceme.net/index.php has 148 saves
- http://convinceme.net/ has 211 saves
- http://convinceme.net/index.php has 38 saves
Combined that’s 433 saves; much more impressive, and more likely to end up at the top of a social sharing sites.
Here’s a useful rule of thumb:
Links to (and within) your site should never, ever end with index.php / index.html / default.aspx / any default directory index filename.
The whole point those defaults is that you can link to the containing directory to see their content, resulting in a shorter and prettier URL. If you’re linking to them directly you’re missing out on a golden opportunity to disambiguate your URLs.
Disambiguating your URLs isn’t particularly difficult. The no-www site offers tips on having one domain name redirect to the other, and there are various mod_rewrite techniques for achieving the desired effect as well. If mod_rewrite makes your hair stand on end, remember that if you are using a server-side scripting language such as PHP you can implement rewriting logic in your application code by examining the $_ENV['PATH_INFO'] variable or your platform’s equivalent.
Django sites get this behaviour for free, thanks to some default settings and Django’s CommonMiddleware. You can that in action on this weblog: try here, here and here.
"Links to (and within) your site should never, ever end with index.php / index.html / default.aspx / any default directory index filename."
I was guilty 5x over, thanks for the push. I knew better, guess it was laziness or being in a hurry. :-)
Diego Pires Plentz - 4th February 2007 15:45 - #
Very good points, ta.
an article like this is dime a dozen but rarely i come across one that sheds light on how using logical url's could not only help the users but also the site owners. thanks.
soxiam - 4th February 2007 17:00 - #
Why does mod_rewrite cause anyone stress?
Peter Michaux - 4th February 2007 17:44 - #
Very good point, Simon.
Good timing, too, as I just moved the canonical homepage for the Google APIs to http://code.google.com/apis/ for almost exactly this reason.
YouTube and Google Video are even worse. I'm always seeing YouTube URLs with &NR or &eurl= tacked on the end, or Google Video URLs that show what somebody searched for in order to find the video.
Jesse Ruderman - 4th February 2007 21:08 - #
I agree with this 100%. Most people don't understand why I am so particular about URLs - but take a look at your statistics/server logs. Wouldn't it be much easier to see realistic stats related to a URL versus having to look up the same url minus the www, the same url with the index file, the same url without the index file, etc. Some like no-www, I prefer to use it. However, no matter which you choose - make sure you stick with one or the other.
I always make sure the root has a 301 redirect to www.domain.com. All requests make sure this happens. Then, for friendly urls, I make sure they have a trailing slash. So, www.domain.com/page would be 301 redirected to www.domain.com/page/ - again, keeping the stats neat and tidy as well.
Now, we also have promotional URL's that have to go to a specific page. Again, I 301 redirect these to the appropriate inner page - making sure there is only ONE representation of the page.
These are just a few reasons - but I think that is is very important to keep a neat and tidy URL structure.
Nate Klaiber - 4th February 2007 23:18 - #
I'm always a bit scared of the Accept header, mainly because I'm worried that poorly written proxies will screw them up. Has this ever been a problem with the blinksale API?
If you want your API to be accessible to Flash developers you pretty much have to put it on a separate domain thanks to the CSRF holes opened up by the crossdomain.xml file.
Simon, this isn't ambiguity (as in each URL should mean only one thing) it is canonicalization, as in each thing should only have one name/location.
Every URL should be canonical.
Michael Bernstein - 5th February 2007 01:59 - #
You could've entitled this post "Cool URI's Don't Change - 2007". I don't remember anyone talking about this aspect before. Love it. I was thinking about URI design/architecture earlier today myself. Must be something in the air.
Devon - 5th February 2007 04:36 - #
I have had one case where a misconfigured browser sent a nonsense Accept header, which screwed up the user's ability to use Blinksale, but I haven't had any problems yet with intermediaries. In fairness, I should confess that we allow media-type disambiguation via file extensions (e.g. /invoices.xml), although that feature isn't documented. Good point about Flash; I hadn't thought of that.
For reasons that escape me, many seem to equate canonicalizing your URLs with eliminating everything after the final slash. Thus
http://your.site.com/feeds/atom/is believed to be superior to (pick one)http://your.site.com/feed.atomorhttp://your.site.com/atom.xmlor evenhttp://your.site.com/feeds/atom.xml.In the case of default directory indices, there's a clear argument for omitting them from the URL. Extending this to all resources is a stretch, which results in some rather bizarre URL conventions.
By the way. Your comment system cannot seem to tell the difference between <ul> and <ol>.
It also seems to demand XHTML, even though your blog is HTML4.
Oh, yeah, as to readable URLs, my test is: can you read the URL to someone over the phone, and will they get it right? Most CMS/blogging systems, which use machine-generated, weird camel-cased slugs as part of the URL, fail this test rather miserably.
One last point:
One notes that you said "Social link sharing sites", rather than "Search engines". That's because Search engines --- which actually care about the quality of the results they provide to users --- go to great pains to canonicalize the URLs in the search results that they return. The fact that Social link sharing sites can't be bothered to do so tells you a lot more about them than it does about the URL-practices of the sites they index.
P.S.: I notice that, though your page is UTF-8, your comment form only accepts US-ASCII, hence the ugly ASCII approximation to an em-dash, above.
There's an unfortunate side-effect to altogether eliminating the sub-domain name from your site URLs (i.e.
http://domain.com/instead ofhttp://www.domain.com/):Every cookie you may want to set for that site, will automatically "bleed" down to *all* sub-domain-based websites you might want to add later.
So, Simon, I'd like to add to your excellent point of disambiguating URLs, a suggestion that people give virtually *all* their sites a sub-domain name to avoid unnecessary cookie inconveniences.
Már - 5th February 2007 23:11 - #
Már: outstanding point.
Jacques: Iñtërnâtiônàlizætiøn seems to work for me. It breaks in the Ajax preview pane though.
Simon, I did this on my main domain using the RewriteEngine On in .htaccess method. Everything was fine until someone tried to leave a comment on my Movable Type powered blog which is throwing the error "No entry_id." I tried fixing that with the mt.cfg file to no avail. So, unfortunately, pending figuring this out, I've switched back to www.
I do realize you're not the MT tech guy, but I just can't help myself. And, thank you for mentioning this as I do have some other domains which will be less problematic.
In the "enter your own valid XHTML" mode,
<p>—</p>
produces the error:
'ascii' codec can't encode character u'\u2014' in position 10: ordinal not in range(128)
<p>—</p>
produces an even more ominous:
500 Internal Server Error
Just thought you'd like to know.
Ouch :/ I'll take a look at that as soon as I can - thanks for the bug report.
All I can say is... yup! More people should catch on to this.
Joshua - 6th February 2007 19:42 - #
One of the features of Google's Webmaster tools is choosing your "preferred", or canonical, domain. This way at least Googlebot will know that your www- and non-www domains are to be treated as a single site, and which should take precedence in the results pages. Other web crawlers will still need to be shown with a 301 redirect. http://www.google.com/webmasters/tools/
Ross Shannon - 6th February 2007 20:13 - #
Scott, there is a good reason to use a separate domain for a public API:
http://shiflett.org/archive/263
(This post prompted changes to the Flickr and YouTube APIs, among others.)
I discussed URLs myself recently:
http://shiflett.org/archive/289
Simon, have you ever posted about why you chose this particular URL structure? I'm particularly curious about your blog URLs, since those are the focal point of your content. For example, why Feb instead of feb or 02? Why be granular to the date instead of just the month? Why end your URLs in a trailing slash? Why use a keyword instead of the post title?
I'm sure you've given these topics some thought, and you can read my thoughts in the above post. (I'm particularly curious, because I'm adopting a new URL structure soon.)
You've touched on a couple of the many issues involved in real user-centered URL design.
See: welldesignedurls.org for more awesomeness on well-designed URLs.
Brad Fults - 12th February 2007 08:27 - #
Excellent article. Thanks for the OpenID talk at BarCamp - I'm using it now!
<a href="index.html">Home</a>they can actually write<a href=".">Home</a>. (Believe it or not, it doesn't link to a file called .) That would avoid duplicate content. My feeling is that index files like index.html and index.php were originally intended (back before WWW became mainstream) to be hidden from the frontend -- as a form of internal configuration, much like a .htaccess file.I'd better get cracking on this! Thanks for the reminder
Steve Roberts - 4th June 2007 21:49 - #