Simon Willison’s Weblog


XHTML is just fine

6th January 2003

In Who dropped the deat cat into the well? (via Mark Pilgrim), Brian Donovan argues that keeping web site content in (X)HTML is a fundamentally bad idea. I thoroughly disagree. When I started this weblog, I realised I needed a format for storing my entries that would keep my content “free” to be reused in multiple different ways. I thought about a simple UBB style markup language, with [url=""]links like this[/url], automatic line breaks and a few other simple structures such as lists and headings. I also considered Wiki markup of some sort, again looking for a reasonable expressive but controlled markup vocabulary for storing my blog entries in a reusable way.

Both UBB code and WikiText have the disadvantage that they require extensive work with regular expressions to extract meaning from them. Regular expression support is excellent in the languages I normally work with (Python and PHP) but is not guaranteed across other technologies, especially when differences in regular expression syntax start to become a problem.

Since regular expressions were a bit risky, I decided to look at XML—after all, it boasts excellent support across multiple languages and platforms and is designed for storing content in the neutral manner I desired. I quickly realised I needed an XML tag set with support for the various content that I would be including in my blog—paragraphs, links, quotations, the occasional list and maybe a few other simple document components. Then I realised that XHTML offered exactly that, provided I stuck to the strict version and forgot about the presentation elements.

By carefully using semantic XHTML to store content, I gain the ability to easily extract and process information I have created using tried and tested XML tools. I can extract the links from an entry with a few lines of code, a technique used by my Pingback client implementation for this weblog. Furthermore, should I ever decide to serve my content in a different format I can do so using simple transformation tools that have already been created and extensively tested by other developers.

I agree with Brian that storing content as HTML (especially presentational HTML) could turn out to be a great mistake, but semantic XHTML provides a powerful and well defined format for storing content in a way that is both future proof and instantly accessible.

This is XHTML is just fine by Simon Willison, posted on 6th January 2003.

Next: Perl made less ugly

Previous: Browser upgrade messages enter history

Previously hosted at