XHTML is just fine
In Who dropped the deat cat into the well? (via Mark Pilgrim), Brian Donovan argues that keeping web site content in (X)HTML is a fundamentally bad idea. I thoroughly disagree. When I started this weblog, I realised I needed a format for storing my entries that would keep my content “free” to be reused in multiple different ways. I thought about a simple UBB style markup language, with [url="http://www.example.com/"]links like this[/url]
, automatic line breaks and a few other simple structures such as lists and headings. I also considered Wiki markup of some sort, again looking for a reasonable expressive but controlled markup vocabulary for storing my blog entries in a reusable way.
Both UBB code and WikiText have the disadvantage that they require extensive work with regular expressions to extract meaning from them. Regular expression support is excellent in the languages I normally work with (Python and PHP) but is not guaranteed across other technologies, especially when differences in regular expression syntax start to become a problem.
Since regular expressions were a bit risky, I decided to look at XML—after all, it boasts excellent support across multiple languages and platforms and is designed for storing content in the neutral manner I desired. I quickly realised I needed an XML tag set with support for the various content that I would be including in my blog—paragraphs, links, quotations, the occasional list and maybe a few other simple document components. Then I realised that XHTML offered exactly that, provided I stuck to the strict version and forgot about the presentation elements.
By carefully using semantic XHTML to store content, I gain the ability to easily extract and process information I have created using tried and tested XML tools. I can extract the links from an entry with a few lines of code, a technique used by my Pingback client implementation for this weblog. Furthermore, should I ever decide to serve my content in a different format I can do so using simple transformation tools that have already been created and extensively tested by other developers.
I agree with Brian that storing content as HTML (especially presentational HTML) could turn out to be a great mistake, but semantic XHTML provides a powerful and well defined format for storing content in a way that is both future proof and instantly accessible.
More recent articles
- Understanding GPT tokenizers - 8th June 2023
- Weeknotes: Parquet in Datasette Lite, various talks, more LLM hacking - 4th June 2023
- It's infuriatingly hard to understand how closed models train on their input - 4th June 2023
- ChatGPT should include inline tips - 30th May 2023
- Lawyer cites fake cases invented by ChatGPT, judge is not amused - 27th May 2023
- llm, ttok and strip-tags - CLI tools for working with ChatGPT and other LLMs - 18th May 2023
- Delimiters won't save you from prompt injection - 11th May 2023
- Weeknotes: sqlite-utils 3.31, download-esm, Python in a sandbox - 10th May 2023
- Leaked Google document: "We Have No Moat, And Neither Does OpenAI" - 4th May 2023
- Midjourney 5.1 - 4th May 2023