Simon Willison’s Weblog

70 items tagged “urls”

Removing MediaWiki from SPA: Cool URIs don’t change (via) Detailed write-up from Anna Shipman describing how she archived an old MediaWiki as static content using recursive wget and some cunning application of mod_rewrite. # 8th October 2017, 7:54 pm

Recovering missing content from the Internet Archive

When I restored my blog last weekend I used the most recent SQL backup of my blog’s database from back in 2010. I thought it had all of my content from before I started my 7 year hiatus, but in watching the 404 logs I started seeing the occasional hit to something that really should have been there but wasn’t. Turns out the SQL backup I was working from was missing some content.

[... 636 words]

What do Twitter and Gawker think of hash-bangs URLs?

As of December 2013 (and potentially much earlier, I don’t have the exact dates) both Twitter and a Gawker have moved away from hash bang URLs, so my guess is they turned out not to be a good idea.

[... 82 words]

How to find the URL of a page in an iframe?

You can’t, as this would be a security and privacy violation. Imagine an evil website which loads up Google in a full page iframe and then tracks what the unsuspecting user searches for and clicks on.

[... 125 words]

What is the most efficient way to lookup an object (e.g. a user) by only a string?

Yes—an index on a varchar column is exactly how you would implement this.

[... 38 words]

Is there an API that returns metadata for a given URL?

I suggest taking a look at—it can take a huge range of URLs and turn them in to JSON metadata. Here’s what it can do with a Wikipedia page:—and here’s Google Maps URL (not as useful, but still some interesting metadata extracted)

[... 69 words]

How did get a “.sy” url?

Here’s a generally useful tip: if you’re interested in learning more about ANY top level domain, visit the Wikipedia page for it—which will be in this case (just add the domain, complete with its dot prefix, directly after ).

[... 105 words]

Which sites have the best URL design?

GitHub’s URL design is fantastic—it’s a virtually flawless mapping of Git semantics to URL space. Their basic URL structure is excellent, but they also have a bunch of neat URL hacks going on. Here are a few of my favourites:

[... 97 words]

When referring to our web site in publications (or Twitter or Facebook), when is it important to provide the full URL— and when should you provide just the

You have no control over how other publications refer to your site—if you’re lucky, they might spell it correctly and check the link works before publishing (but I wouldn’t bet on it). What you DO have control over is making sure you compensate for any mistakes they make.

[... 166 words]

How did slashes become the standard path separators for URLs?

I’m going to take an educated guess and say it’s because of unix file system conventions. Early web servers mapped the URL to a path on disk inside the document root—this is still how most static sites work today.

[... 57 words]

How do you find the new URL of a Tumblr that has moved?

One trick that might work is to look up the old tumble in the Google cache or on, then copy and paste a unique search phrase from that page and run a Google search for:

[... 72 words]

URLs are supposed to represent resources. A web app can be a resource, and there are techniques for managing state within those. Hashbangs might be one of these. But when large web properties are converting all their links to _articles_ and other _bits of text_ (tweets/twits/whatever) into these monstrosities, it’s not innovation. It’s a huge mistake that ought to be regretted now and will certainly be regretted in the future.

Reed Underwood # 10th February 2011, 4:56 pm

Before events took this bad turn, the contract represented by a link was simple: “Here’s a string, send it off to a server and the server will figure out what it identifies and send you back a representation.” Now it’s along the lines of: “Here’s a string, save the hashbang, send the rest to the server, and rely on being able to run the code the server sends you to use the hashbang to generate the representation.” Do I need to explain why this is less robust and flexible? This is what we call “tight coupling” and I thought that anyone with a Computer Science degree ought to have been taught to avoid it.

Tim Bray # 10th February 2011, 6 am

Going Postel. Jeremy points out that one of the many disadvantages of publishing JavaScript dependent content on the Web is that a single typo can render your entire site unusable. # 9th February 2011, 2:18 am

Breaking the Web with hash-bangs. Mike Davies explains why Gawker’s new Ajax fragment-tastic redesign is a web architecture error of colossal proportions. # 9th February 2011, 2:17 am

Is there a way of tracking shortened URLs with Twitter streaming API?

Think about it like this: the whole point of the Twitter streaming API is to get you the tweets as soon after they are posted as possible. If the API were to provide access to the lengthened URLs, it would have to delay emitting a Tweet on to the stream until a resolver had gone through each shortened URL in the tweet and checked to find what it redirects to. This would mean that the speed with which the streaming API could deal out tweets would be dependent on the speed of the third party servers that serve up the redirects. I doubt Twitter would ever want to implement this.

[... 159 words]

Getting Started—Google URL Shortener API. The API for the URL shortener is really nice—no API key required, easy to create a short URL and you can retrieve detailed stats breakdowns (similar to as JSON for any URL. # 13th January 2011, 3:49 am

Could browsers be made to scroll down (e.g. by 67%) if you add #67% to a URL?

I’d say no.

[... 89 words]

URL Design. Thoughtful tips on modern URL design, from GitHub designer Kyle Neath. GitHub has the best designed URLs of any application I can think of. # 31st December 2010, 10:03 am

Is it a good idea to allocate URLs such as to users?

There’s an interesting discussion about this issue on this question: How do sites prevent vanity URLs from colliding with future features ?

[... 42 words]

Spacelog: space exploration stories from the original transcripts. The product of the most recent /dev/fort outing—a beautiful, web-native interface for browsing the NASA transcripts from the Apollo 13 and Mercury 6 missions (more to come). Every key moment has a URL. # 10th December 2010, 10:07 am

Porting Flickr to YUI 3: Lessons in Performance (at YUIConf 2010). Some very interesting tips here. The new Flickr photo pages suffered from what I’ve been calling “Flash of Un-Behavioured Content”, where slow loading JavaScript results in poor behaviour from some UI controls. They started using “Action Queueing”, where a small JS stub ensures a loading indicator is shown for clicks on features that have not yet fully loaded. Also, it turns out some corporate firewalls (Sonicwall in particular) dislike URLs over 1600 characters, and filter out any URL with xxx in it. # 10th November 2010, 6:33 pm

Is there any consensus yet on link rel=shorturl vs rev=canonical?

It’s pretty clear from the answers that rev=canonical v.s. rel=canonical is way too confusing—so it’s down to rel=shortlink v.s. rel=shorturl.

[... 38 words]

Why doesn’t Facebook use nicer URLs?

Just noticed this link:—so it looks like things are beginning to improve.

[... 28 words]

Why don’t more websites use alternative domains?

Because regular human beings don’t understand them, and expect everything to be a .com. Here’s an interesting post from 2007 on why spent $1,000,000 buying the .com domain:

[... 45 words]

How do sites prevent vanity URLs from colliding with future features ?

For and I used the same trick as described by others in this list—an enormous blacklist of everything I could possibly want to use for a future feature.

[... 157 words]

The Web for me is still URLs and HTML. I don’t want a Web which can only be understood by running a JavaScript interpreter against it.

Me, on Twitter # 27th September 2010, 4:37 pm incident report for 04/09/2010. An issue was posted to the Apache JIRA containing an XSS attack (disguised using TinyURL), which stole the user’s session cookie. Several admin users clicked the link, so JIRA admin credentials were compromised. The attackers then changed the JIRA attachment upload path setting to point to an executable directory, and uploaded JSPs that gave them backdoor access to the file system. They modified JIRA to collect entered passwords, then sent password reset e-mails to team members and captured the new passwords that they set through the online form. One of those passwords happened to be the same as the user’s shell account with sudo access, leading to a full root compromise of the machine. # 14th April 2010, 9:08 am

RFC5785: Defining Well-Known Uniform Resource Identifiers (via) Sounds like a very good idea to me: defining a common prefix of /.well-known/ for well-known URLs (common metadata like robots.txt) and establishing a registry for all such files. OAuth, OpenID and other decentralised identity systems can all benefit from this. # 11th April 2010, 7:32 pm

Introduction to Surlex. A neat drop-in alternative for Django’s regular expression based URL parsing, providing simpler syntax for common path patterns. # 11th April 2010, 7:23 pm