XHTML is just fine
6th January 2003
In Who dropped the deat cat into the well? (via Mark Pilgrim), Brian Donovan argues that keeping web site content in (X)HTML is a fundamentally bad idea. I thoroughly disagree. When I started this weblog, I realised I needed a format for storing my entries that would keep my content “free” to be reused in multiple different ways. I thought about a simple UBB style markup language, with [url="http://www.example.com/"]links like this[/url]
, automatic line breaks and a few other simple structures such as lists and headings. I also considered Wiki markup of some sort, again looking for a reasonable expressive but controlled markup vocabulary for storing my blog entries in a reusable way.
Both UBB code and WikiText have the disadvantage that they require extensive work with regular expressions to extract meaning from them. Regular expression support is excellent in the languages I normally work with (Python and PHP) but is not guaranteed across other technologies, especially when differences in regular expression syntax start to become a problem.
Since regular expressions were a bit risky, I decided to look at XML—after all, it boasts excellent support across multiple languages and platforms and is designed for storing content in the neutral manner I desired. I quickly realised I needed an XML tag set with support for the various content that I would be including in my blog—paragraphs, links, quotations, the occasional list and maybe a few other simple document components. Then I realised that XHTML offered exactly that, provided I stuck to the strict version and forgot about the presentation elements.
By carefully using semantic XHTML to store content, I gain the ability to easily extract and process information I have created using tried and tested XML tools. I can extract the links from an entry with a few lines of code, a technique used by my Pingback client implementation for this weblog. Furthermore, should I ever decide to serve my content in a different format I can do so using simple transformation tools that have already been created and extensively tested by other developers.
I agree with Brian that storing content as HTML (especially presentational HTML) could turn out to be a great mistake, but semantic XHTML provides a powerful and well defined format for storing content in a way that is both future proof and instantly accessible.
More recent articles
- Weeknotes: Llama 3, AI for Data Journalism, llm-evals and datasette-secrets - 23rd April 2024
- Options for accessing Llama 3 from the terminal using LLM - 22nd April 2024
- AI for Data Journalism: demonstrating what we can do with this stuff right now - 17th April 2024
- Three major LLM releases in 24 hours (plus weeknotes) - 10th April 2024
- Building files-to-prompt entirely using Claude 3 Opus - 8th April 2024
- Running OCR against PDFs and images directly in your browser - 30th March 2024
- llm cmd undo last git commit - a new plugin for LLM - 26th March 2024
- Building and testing C extensions for SQLite with ChatGPT Code Interpreter - 23rd March 2024
- Claude and ChatGPT for ad-hoc sidequests - 22nd March 2024
- Weeknotes: the aftermath of NICAR - 16th March 2024