Feed Sign in with OpenID OpenID

Simon Willison’s Weblog

Living on a knife edge

In The XHTML 100, Evan Goer describes an experiment in which he checked 119 site claiming to be with an XHTML doctype for full compliance with the W3C standards. His test consisted of three parts—a validation check on the front page, a check on another “inside” page, and a check to see if the correct Content-Type header (application/xhtml+xml) was served to supporting User Agents (in his case Mozilla 1.3).

The results are depressing, but not necessarily surprising. Only one site passed all three tests—beandizzy. Of the others, most fell at the first hurdle with only 13 getting as far as the third test.

I don’t know if my site was included in the experiment, but if it was it failed at the third test as well. I have now implemented Mark Pilgrim’s trivial PHP fix (which serves the correct Content-Type to user agents that include application/xhtml+xml in their HTTP-ACCEPT header). This is no small step to take—serving XHTML with the correct Content-Type causes Gecko based browsers to attempt to parse it using a real XML parser, and should it turn out to be well formed they will refuse to render the site and die with an error message. Since I use Phoenix myself and almost certainly visit this site more than anyone else I’m hoping I’ll spot and fix any errors before anyone else runs in to them. Talk about living on a knife edge!

I’ve been cautious about recommending XHTML for several months now, and this turn of events has made me even more wary of it as a technology that is ready for mainstream use. Creating valid XHTML documents is extremely difficult—virtually impossible by hand without regular checks with the validator, and hard to achieve using home grown tools as well. I plan to revise my Validator Web Service code shortly to help run automated validation checks whenver I update, but it’s going to take quite a lot of effort to keep things working as they should.

So why bother when HTML 4.01 Strict gives all of the benefits of structural, valid markup with none of the additional hassles provided by XHTML? 6 months ago I would have said that XHTML is vital to support new light weight devices that can only handle an XML parser, but with mobile phones carrying full tag-soup capable web browsers that’s looking more and more unlikely. The greatest benefit provided by valid XHTML is the increased ability to automate the extraction and processing of content at a later date (see Mark Pilgrim’s acclaimed acronym and citation support for a concrete demonstration of this idea). I’ve been storing my blog entries as XHTML since I started blogging, and I maintain a firm belief that XHTML is an excellent format for storing items of content. Sadly, it just doesn’t seem practical or worthwhile to serve it to browsers.

I’m going to keep serving this blog as XHTML as an open experiment in the practicalities and challenges involved in doing so, but from now on my other web projects will target HTML 4.01 Strict.

This is Living on a knife edge by Simon Willison, posted on 6th May 2003.

View blog reactions

Next: New mozgest soon

Previous: Instant caching with PHP

30 comments

  1. Interesting side effect number one: blockquotes.js has stopped working. I'll have to take a further look at Mark's coverage of the changes application/xhtml+xml makes to the DOM.

    Simon Willison - 6th May 2003 14:13 - #

  2. I hate to break it to you, but 1. I use HTML, not XHTML 2. My quotation and citation extraction scripts don't scrape my HTML, they go straight to my backend database 3. The scripts don't use an XML or even an SGML parser; they use regular expressions 4. The quotation extraction script couldn't possibly scrape my HTML anyway, because I use MT-Macros at publish time to convert my Q tags to universally-supported HTML curly-quote entities I can attest to one concrete benefit of XHTML: I've earned over $1000 writing about it for O'Reilly.

    Mark - 6th May 2003 15:41 - #

  3. By far the most difficult test to pass is his fourth one: the Cluefulness test. Since Python has what looks to me to be a lovely SGML parser, so that any self-scraping you do could easily be turned into anyone-scraping if you do it in Python, you're pretty much stuck with embedding some other XML. You can do MathML or SVG with the W3C's doctypes, and it sure sounds like you should be able to do your own to embed any random thing, but when I tried with RDF last summer I wasn't able to validate the result, or to find anyone who could tell me how to make it work. As you say, the whole XHTML Basic for phones thing doesn't seem to be catching on (always a good idea to be wary of anything out of the W3C that will only work if everyone decides to do things their way), so it really seems to me that for today, once you strip away the benefits of doing valid anything, the benefits of doing valid XHTML come down to embedding MathML and SVG, and what's rumored to be faster parsing in Mozilla.

    Phil Ringnalda - 6th May 2003 16:12 - #

  4. [Posted here, because I started typing here, I didn't mean to turn it into a rant. I left it here because you allow proper code in your comments, Simon :)]

    I choose not to serve application/xhtml+xml to supporting UAs. Why? Well, to do content negotiation in this way, you need to send a "Vary: Accept" header. This means that every UA sending a different Accept header is referencing a different object - cachability is reduced. For what? Rendering speed? Even more differences between browsers wrt. the DOM etc?

    In fact, when 90% of your visitors are using IE, you may speed up the rendering of the document by a small amount in other browsers, but slow down the retrieval of the document by a large amount. Remember that it's rarely the client processor that is the bottleneck, but usually the network connection.

    HTTP does not allow servers to say "I'm serving based on whether application/xhtml+xml is supported" - it's a straight string comparison of a header. Caches may optimise this, but it's definitely not trivial.

    Also, there is a glaring error in the test. I'm not willfully ignoring the spec, I'm following the spec. He quotes RFC 3236, but ignores the relevent RFC, RFC 2854, the defining RFC for text/html, which states:

    The text/html media type is now defined by W3C Recommendations; the latest published version is [HTML401]. In addition, [XHTML1] defines a profile of use of XHTML which is compatible with HTML 4.01 and which may also be labeled as text/html

    Given a choice between the two, I'll take the RFC as the definitive source, not Evan or Hixie, that you very much. I'm sure they'll understand.

    First, as the results demonstrate, XHTML is hard enough that even advanced authors get it wrong most of the time.

    Well, if he includes people doing perfectly reasonable things and following the relevent specifications in the "getting it wrong" group, then there's little wonder, is there?

    Second, configuring your server to do some minimal MIME-type negotion really isn't that tough. If you're advanced enough to know what XHTML is, you're advanced enough to add a few lines to your .htaccess file.

    Sure, adding content negotiation is trivial. But he's implying it's without consequences, which it certainly isn't.

    He claims that observing the "Alpha geeks" is more likely to bias the results in favour of valid XHTML, purely on the basis of them being tech-savvy. Being tech-savvy in one respect does not mean you are tech-savvy in XHTML, or the web in general. It also means you are more likely to fiddle with your blog or develop your own, rather than use an off-the-shelf package, which could well be compliant by default. I'm not willing to accept this premise on face value, it's something that will have to be measured.

    Sorry, I have a hard time taking this survey seriously, when he judges people as being nonconformant because they do something he doesn't like.

    Jim - 6th May 2003 17:43 - #

  5. PS: Simon, you need to check for a cite attribute before creating the "Source" link :)

    Jim - 6th May 2003 17:45 - #

  6. One last thing...

    Creating valid XHTML documents is extremely difficult - virtually impossible by hand without regular checks with the validator

    Am I the only one who doesn't see the big deal about lowercase, proper nesting, quotes around attributes, and the different empty element syntax? It's not rocket science, people!

    Jim - 6th May 2003 17:48 - #

    1. You can't hide Javascript in XML comments when you are serving application/xhtml+xml. Remove the comments.
    2. Gecko's DOM bug (HTMLDocument methods being unavailable when served as application/xhtml+xml) was only recently fixed. You may need to update your browser.
    3. In XHTML 1.1, there are some other tricky things to do with the loss of the "name" attribute. I had to rewrite my rememberMe Javascript to compensate. But you don't have to worry about that ... yet.

    Jacques Distler - 6th May 2003 17:49 - #

  7. Jim:

    (I note you did not leave a URL). RFC 3236 is consistent with RFC 2854. If you look at the former, you will see that XHTML 1.0 "MAY" be served as text/html, but "SHOULD" be served as application/xhtml+xml.

    "SHOULD" is not the same thing as "MUST" (wait for XHTML 2.0 for that), but it is not entirely without meaning.

    Jacques Distler - 6th May 2003 18:04 - #

  8. ... And that's, of course, providing you stay within the bounds of the "HTML-compatible" profile for XHTML 1.0. If not, or if you are using XHTML 1.1, the RFC switches to "MUST NOT" use text/html.

    Jacques Distler - 6th May 2003 18:09 - #

  9. Ack! s/"MUST NOT"/"SHOULD NOT"/ .

    Jacques Distler - 6th May 2003 18:10 - #

  10. Jacque,

    I am well aware that it is consistent with the other RFC. What I take issue with is that he points out the SHOULDs in one RFC, claims that it's a failure of an author to do anything else, and conveniently forgets to mention that the alternative is explicitly allowed by the definition of text/html.

    Jim - 6th May 2003 18:40 - #

  11. Maybe

    Martijn - 6th May 2003 19:01 - #

  12. Maybe that is also the reason why the Babel Fish Translation service of Alta Vista doesn't work anymore for your site? (sorry for the above)

    Martijn - 6th May 2003 19:03 - #

  13. The babelfish service appears to pass on http headers from the client, including the Accept header. Sheer idiocy when they need to manipulate the document themselves.

    Other headers passed on include User-Agent, Accept-Charset, and, funnily enough, Accept-Language. Now why didn't they provide an accurate Accept-Language, so that they could return a human-translated document to the end-user if it's available?

    Jim - 6th May 2003 20:19 - #

  14. XHTML as application/xhtml+xml is near-suicidal on most dynamic sites that are hand-coded at the moment imo.

    The only things it is suitable for are:

    • Small static sites (for instance, red.uk.com and 19th are both xhtml+xml in Moz)
    • Sites generated using a server-side DOM-alike - where none of the XHTML is actually written by hand, but generated via objects (like new Input() or something)

    ...and even then, I'd want to have a validation spider running often to check the validation of every single page.

    I still want to have a nice "is XHTML displaying errors for end-users actually a good thing for the user?" discussion :)

    Tom Gilder - 6th May 2003 20:20 - #

  15. Am I the only one who doesn't see the big deal about lowercase, proper nesting, quotes around attributes, and the different empty element syntax?

    I've never actually used HTML 4, or any previous version. I only started learning how to mark up web documents a little over a year ago, so I went straight to XHTML 1.0 Transitional. After a couple of months, I ditched it in favor of XHTML 1.1 and I've been doing it that way ever since.

    I hand code everything, and I find it very easy indeed. I don't see what the big deal is either. I am aware of the famous article by Ian Hickson on the evils of serving up XHTML as text/html, but up until now I have had no knowledge of how to correct that - particularly because my website isn't hosted on an Apache web server.

    I thought about adopting Mark's PHP solution, but Jim's comments above have given me a reason not to. Hmmm.

    Simon Jessey - 6th May 2003 22:14 - #

  16. I have no problem at all writing XHTML - lower case attributes, proper nesting, empty tags with a slash before the end definitely aren't rocket science. The problem is that it's very easy to make a mistake, and with XHTML such a mistake will now result in my blog failing to render in Mozilla browsers. Perfection is a pretty hard standard to live up to when you are generating and adding new content several times a day.

    It's true though that XHTML served with the correct mime type is perfectly viable for non-dynamic sites. If the site content isn't being constantly changed (either by hand or via a content management system) you can write it once, validate it and leave it at that.

    Mark: I was aware that you used regular expressions rather than an XML parser (I think you mentioned it in a previous entry) but my intention was to use your widely known semantic markup based features as a simple demonstration of the kind of benefits extracting meaning from old content can bring. Original content stored in an XML format opens up the entire spectrum of XHTML tools for future processing of that content. Of course, Pythonistas get this with HTML as well thanks to sgmllib but in my experience most languages do not have that kind of thing as an easily available feature.

    Simon Willison - 6th May 2003 22:36 - #

  17. Your blockquotes.js script is broken because you are using createElement when you should be using createElementNS - ie. something like the following:

    newlink = document.createElementNS("http://www.w3.org/1999/x html","a");

    Tom - 7th May 2003 00:06 - #

  18. ...or try adding this at the top of your script:

    if (document.createElementNS) document.createElement = function(elName) { document.createElementNS("http://www.w3.org/1999/x html", elName); }

    Tom Gilder - 7th May 2003 11:18 - #

  19. Yes, I've been using XHTML and notepad since 1999 mainly due to not liking the elasticity of HTML 4.01 grammar; the rocket science is not writing XHTML but accommodating for browsers that fail to follow the guidelines.

    Personally my websites would fail test 'Level 3' it was is more my deliberate choice not to adapt the PHP or try and adjust server configuration.

    However, I actually understand the meaning of the term SHOULD NOT [RFC2119]. With the test you could class me as; 'The author is willfully ignoring the spec.' However that is incorrect since one is adhering to the W3C XHTML Media Types Technical Recommendation but not specifically targeting browsers that can handle the; application/xhtml+xml for 'valid reasons'.

    The aspect of concern however is that the test failed to address the characteristic of the required XML Declaration and the added implications when using; application/xhtml+xml with regards to XML parsing which in essence is a core consideration when serving application/xhtml+xml.

    Robert Wellock - 7th May 2003 12:40 - #

  20. According to the specs, an XML declaration is not a requirement for comformance:

    An XML declaration like the one above is not required in all XML documents. XHTML document authors are strongly encouraged to use XML declarations in all their documents. Such a declaration is required when the character encoding of the document is other than the default UTF-8 or UTF-16.

    XHTML 1.0: http://www.w3.org/TR/xhtml1/#strict XHTML 1.1: http://www.w3.org/TR/xhtml11/conformance.html#stri ct

    insin - 7th May 2003 13:37 - #

  21. I never said the XML Declaration was a must for actual conformance; although there are certain instances when it is necessary. Hence why I did not say it was essential, or mandatory, though it is a detail, which is commonly overlooked by many authors - I should have worded the above more clearly.

    It is one of those highly recommended procedures to add the XML declaration since we are aware XHTML is an application of XML. Thus I was referring to XHTML in the context of a 'valid XML document' rather than XHTML as just a 'well-formed XML document'. Don't worry, in essence one was just following an obscured tangent.

    Robert Wellock - 7th May 2003 17:05 - #

  22. The XML declaration is not necessary in the cases where you are using utf-8 or utf-16. Including the declaration would be a good idea if you weren't serving it as text/html - some obscure user-agents, such as older versions of IE and PocketIE render the declaration as content.

    Jim - 7th May 2003 19:12 - #

  23. I'm avoiding the XML declaration because it kicks IE6 in to "quirks" mode, which could break my layout (possibly - I haven't tried it though). I like to keep IE6 in standards mode out of sheer bloody mindedness.

    Simon Willison - 8th May 2003 02:00 - #

  24. You probably didn't mean to serve xhtml+xml with a <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> tag in it, did you?

    Phil Ringnalda - 8th May 2003 09:27 - #

  25. Phil: I certainly didn't - fixed now, thanks.

    Simon Willison - 8th May 2003 11:59 - #

  26. This is probably a daft question: Could I not save an XHTML file with the extension .xml locally and see if it displayed - even in IE6?

    The URL I use above is a translation by John Milton of Horace's Ode 1. 5, the one addressed to Pyrrha. I did it partly because I thought it would make an interesting exercise in using CSS. I took the layout from Poole and Maule (The Oxford Book of Classical Verse In Translation). This presumably reflects what Milton originally did, which was to present the poem in a continuous block with the last two lines in each verse indented. CSS can cope with this, and the leading, quite easily and neatly.

    It displays as xml rather than throwing up an error message—does that mean it would render OK, if served as application/xhtml+xml?

    Michael - 8th May 2003 20:13 - #

  27. Simon: Your Content-Type meta tag indicates a charset of UTF-8, but according to Firebird, this page is actually being sent as ISO-8859-1. And if you actually mean to use ISO-8859-1 instead of UTF-8, then, based on one of the comments above, you must specify an XML declaration. All in the spirit of pickiness. :)

    jacob - 14th May 2003 18:48 - #

  28. So does www.fosod.com pass all three tests?

    Walter Stevenson - 22nd June 2004 21:36 - #

  29. I have always looked for a possibility to find information as quick as I can. Now there is the internet. And I really appreciate people like you who take their chance in such an excellent way to give an impression on certain topics. Thanks for having me here.

    online pharmacy - 30th October 2004 08:23 - #

Comments are closed.

Previously hosted at http://simon.incutio.com/archive/2003/05/06/knifeEdge

A django site