XHTML is just fine
6th January 2003
In Who dropped the deat cat into the well? (via Mark Pilgrim), Brian Donovan argues that keeping web site content in (X)HTML is a fundamentally bad idea. I thoroughly disagree. When I started this weblog, I realised I needed a format for storing my entries that would keep my content “free” to be reused in multiple different ways. I thought about a simple UBB style markup language, with
[url="http://www.example.com/"]links like this[/url], automatic line breaks and a few other simple structures such as lists and headings. I also considered Wiki markup of some sort, again looking for a reasonable expressive but controlled markup vocabulary for storing my blog entries in a reusable way.
Both UBB code and WikiText have the disadvantage that they require extensive work with regular expressions to extract meaning from them. Regular expression support is excellent in the languages I normally work with (Python and PHP) but is not guaranteed across other technologies, especially when differences in regular expression syntax start to become a problem.
Since regular expressions were a bit risky, I decided to look at XML—after all, it boasts excellent support across multiple languages and platforms and is designed for storing content in the neutral manner I desired. I quickly realised I needed an XML tag set with support for the various content that I would be including in my blog—paragraphs, links, quotations, the occasional list and maybe a few other simple document components. Then I realised that XHTML offered exactly that, provided I stuck to the strict version and forgot about the presentation elements.
By carefully using semantic XHTML to store content, I gain the ability to easily extract and process information I have created using tried and tested XML tools. I can extract the links from an entry with a few lines of code, a technique used by my Pingback client implementation for this weblog. Furthermore, should I ever decide to serve my content in a different format I can do so using simple transformation tools that have already been created and extensively tested by other developers.
I agree with Brian that storing content as HTML (especially presentational HTML) could turn out to be a great mistake, but semantic XHTML provides a powerful and well defined format for storing content in a way that is both future proof and instantly accessible.
More recent articles
- Weeknotes: datasette-enrichments, datasette-comments, sqlite-chronicle - 8th December 2023
- Datasette Enrichments: a new plugin framework for augmenting your data - 1st December 2023
- llamafile is the new best way to run a LLM on your own computer - 29th November 2023
- Prompt injection explained, November 2023 edition - 27th November 2023
- I'm on the Newsroom Robots podcast, with thoughts on the OpenAI board - 25th November 2023
- Weeknotes: DevDay, GitHub Universe, OpenAI chaos - 22nd November 2023
- Deciphering clues in a news article to understand how it was reported - 22nd November 2023
- Exploring GPTs: ChatGPT in a trench coat? - 15th November 2023
- Financial sustainability for open source projects at GitHub Universe - 10th November 2023
- ospeak: a CLI tool for speaking text in the terminal via OpenAI - 7th November 2023