Feed Sign in with OpenID OpenID

Simon Willison’s Weblog

Elliotte Rusty Harold: Why XHTML. “XHTML makes life harder for document authors in exchange for making life easier for document consumers.”—since there are a lot more document authors than there are tools for consuming, this seems like an argument AGAINST XHTML to me.

34 comments

  1. "more document authors than there are tools for consuming"

    That's possibly a function of there being only a few reliable documents to consume, but I take your point

    Jonathan Barrett - 5th June 2008 21:49 - #

  2. The (really relatively small) number of interactions I've had with XHTML as a consumer have been anything but satisfying (compared to HTML). What good does a parse error do me? So as a consumer, I really don't see how it gives me any benefits at all.

    Ian Bicking - 5th June 2008 23:35 - #

  3. I see your point, but if you consider every person browsing the web as a consumer, then it makes no sense whatsoever.

    Eric Larson - 5th June 2008 23:37 - #

  4. As an author I prefer XHTML - I simply like my document structure to actually be structured...

    (I usually have *most* of the XHTML generated - and I find structured documents easier to understand... HTML 4 is ugly...)

    Michael Foord - 6th June 2008 00:12 - #

  5. If you consider every person browsing the web as a consumer you're missing the point that they're using a browser authored by a tiny subset of that population.

    Koz - 6th June 2008 00:38 - #

  6. Michael Foord: What is so ugly with valid Html vs. valid Xhtml?

    It is no reason to use Xhtml in todays wold. In most cases Xhtml is just html with a xhtml-doctype. Unless you send your document with the http media type header, "application/xhtml+xml" (which doesn't work in IE), your pages will be rendres just as normal html, so you will not be able to take advantage of the opportunity xhtml gives you.

    JonT - 6th June 2008 00:59 - #

  7. Eric - that's why I specified "tools for consuming" as opposed to consumers. There are millions of document authors out there - anyone who creates web pages. There are far, far fewer people writing browsers and HTML parsing libraries. I have no problem at all with making life a bit harder for those consumer tool authors in return for making life easier for the millions of authors out there.

    Not to mention that while HTML is harder to parse, the tools have for the most part been written now. Parsing HTML is mostly a solved problem - and HTML 5 makes it easier by specifying an error recovery model so people won't have to reverse engineer common browser behaviour any more.

    Simon Willison - 6th June 2008 07:21 - #

  8. HTML doesn't have to be any uglier than XHTML. You can still nest correctly, you can still validate it, you can still use lists and alt tags and just about everything except for self closing tags.

    The fact that HTML has historically been ugly (from bad editors) is no reflection on the language. It can't compete with XHTML for fashion though.

    Mike H - 6th June 2008 07:37 - #

  9. I've never found a better argument against strict XHTML parsing (in fact, strict document parsing of any kind) than Mark Pilgrim's piece from four years ago:
    http://diveintomark.org/archives/2004/01/14/though t_experiment

    My first attempt at HTML, created in a text editor 15 years ago (oh god), was completely invalid, but NCSA Mosaic did a good job of rendering it anyway. That's why I stuck with it, and made a second page.

    The web exists because the barriers to content creation are so low - it's nothing to do with the creation tools, and everything to do with simplicity. Sure, it makes parsing harder, but until we have more people writing parsers than creating content, that's where the balance should lie.

    Yoz - 6th June 2008 08:36 - #

  10. I've just switched back to HTML 4.01 Transitional for the latest new build I'm working on.

    I decided I'd better put my money where my mouth is, as I'd been arguing that using XHTML is fairly pointless now.

    The need to header-sniff if you're doing it properly and the subsequent need to fork your DOM Scripts (CreateElement in text/html vs CreateElementNS in application/xhtml+xml) is just too much hassle. HTML 4.01 works, and HTML 5 is the new next step in web markup evolution.

    I agree, by the way, that parsing tag soup HTML is a solved problem, so why shift the burden of responsibility to the page author?

    Tim Beadle - 6th June 2008 09:43 - #

  11. Ah geez, not this again.

    Producing is more important than consuming.

    An unescaped ampersand shouldn't prevent successful publication.

    No, the tools will not save us.

    James Bennett - 6th June 2008 09:43 - #

  12. Your use of capitalisation amused me, but probably for the wrong reasons.

    Noah Slater - 6th June 2008 10:49 - #

  13. As far as I'm concerned, the only advantage XHTML has is the technology composition features that XML namespaces give it. (You simply can't do this in HTML without a new DTD for each combination of compositions.)

    Unfortunately, Mozilla got this wrong: you have to use a special DOCTYPE to (for instance) compose SVG with XHTML. Or MathML with XHTML. Or SVG and MathML with XHTML. Three new DOCTYPEs right there; this is no better than HTML, which has the advantages of implicit minimisation, and a history of permissive parsing.

    html5 is trying to "fix" this by pre-composing useful technologies, except that they're getting opposition from some (SVG?) that aren't yet certain what the right approach is, and in any case this doesn't actually solve the actual problem, merely one or two of its symptoms today.

    We're still going to need a language with composable foreign markup at some point. In some ways it might be better not to try to build it up as a successor to HTML at all, perhaps.

    On a slightly different note, and responding more directly to the original post: who are these crazy document authors writing in HTML? Having a manual process of a markup specialist creating HTML out of the final copy does not mean that the document author is writing in HTML. The markup specialist is a tool, just not an automated one. Document authors don't care about syntax: they are writers. And honestly, when I'm writing, I do it on paper. All the rest is the application of more or less inefficient, more or less human tools.

    (Note I'm using the auto-HTML option. I wonder how many people don't? Can we get stats on that?)

    James Aylett - 6th June 2008 11:15 - #

  14. "What is so ugly with valid Html vs. valid Xhtml?"

    IMHO allowing optional start and ending tags is one thing that allows for very ugly coding practices in HTML - and something that was fixed with the reformulation of HTML4 into XHTML1.

    <title>foo</title>
    <p>bar</html>

    is a perfectly valid HTML4 document regardless of wether you choose to slap the Strict or the Transitional doctype on it.

    ... and as far as I can tell it will unfortunately also be a conformant HTML5 document - so HTML5 authors will not have the benefit of a validator that hints to them that balancing tags is a very good practice, something that the W3C XHTML1 validator have been doing since the turn of the century or so!

    You simply get better automated support for good coding practices from the validator if you choose XHTML1 today IMHO. ;)

    Jarvklo - 6th June 2008 11:15 - #

  15. Whenever this argument surfaces, there seems to be the assumption that loose syntax is easier for beginners. This baffles me. In my experience simple, strict rules are *much* easier to learn and code to than loose rules with multiple shortcuts.

    I like XHTML because attributes must always be quoted. Tags must always be closed. These are simple rules that require no thought, and result in uniform, predictable markup.

    As soon as something is optional (be that the need to quote an attribute or close a tag or whatever) the author has to learn a set of conditions and evaluate when a shortcut may or may not be used. Similarly, reading markup back becomes more difficult, as those conditions have to be taken into account again.

    Strict syntax is simpler and easier.

    Drew McLellan - 6th June 2008 11:42 - #

  16. Yeah, and that doesn't take into account any possible UA bugs... like I found out yesterday. Ugh. I'm an XML guy at heart, but... well...

    Devon - 6th June 2008 17:33 - #

  17. Drew, I don't buy it. Yes, HTML4 has a lot of optional stuff you can leave out, but how many people *actually* know that? A while back, I ran my blog as (valid) HTML4 Strict with everything omitted that could legally be omitted -- "html", "head" and "body" tags, closing tags, quotes around attributes where permissible (I actually wrote a Markdown variant that knew how to do this). And I got an endless stream of complaints from people who told me to "learn how to write HTML", because they didn't know that was OK.

    HTML tutorials don't cover that stuff; they teach you to close your tags and quote your attributes. Which is as it should be, because that's generally the best practice.

    So if the overwhelming majority of folks never know about it, what's the harm? Also, it's not like people have trouble with optional HTML features that don't involve syntax, so why should syntactic options be such a problem?

    James Bennett - 6th June 2008 19:20 - #

  18. James, from experience dealing with a lot of other people's markup over the years, lots of people know of the 'shortcuts', but don't appear to really know them well enough.

    My argument for strict syntax is that people make mistakes, therefore the rules should be easy and black-and-white. Options encourage mistakes.

    I'm not a fan of draconian error handling in browsers, but I don't believe that one necessarily leads to the other. I'm more than happy to write the strictest markup I can and have the browser be as lenient as it dare be, as per Postel's law, basically.

    That, and HTML is so tremendously more difficult to parse than XHTML that it really should only be reserved for anarchists. :)

    Drew McLellan - 6th June 2008 21:40 - #

  19. Drew, HTML parsing is by and large a solved problem.
    The argument that XHTML is easier on the consumer is mute.

    Noah Slater - 7th June 2008 00:22 - #

  20. Noah, as the author of a parser, I disagree.

    Drew McLellan - 7th June 2008 09:17 - #

  21. Drew, having written both an HTML5 parser and a liberal XML parser (one that follows the XML specification but does not go the extra lengths required to ensure that the input characters are actually within required ranges et cetera) I respectfully disagree. I would say they are about as complicated (if XML is not more complicated). (Especially the XML internal subset and the HTML adoption agency algorithm are just dreadful.)

    It is true however that the HTML consumer side was harder for some time because nobody had taken the time to write down what was actually required yet. With HTML5 this is by and large a solved problem. So I agree with Noah.

    Anne van Kesteren - 7th June 2008 11:11 - #

  22. Anne, so we disagree. Next! :)

    Drew McLellan - 7th June 2008 11:50 - #

  23. I'm unconvinced by the education argument as well. Why not just teach people "best practice" HTML 4 with matched closing tags, quotes around attributes and so on, along with why those things are a good idea?

    I find teaching people XHTML much more troubling as it requires bundling a whole bunch of extra rules and information (namespaces, well formedness rules, potentially even content type stuff) which have absolutely nothing to do with making a web page. Not to mention that once they've learnt it they'll find many expert web developers who will tell them that they didn't need to know that stuff after all!

    Simon Willison - 7th June 2008 15:32 - #

  24. Drew: you see, my problem with the whole "strictness makes it easier" is this: either you do something about enforcing that strictness, or you don't. If you're going to enforce it then that means draconian error handling; pages just do not render if they're invalid. We can all agree that that's a bad idea, I think. So if you have rules and don't enforce them, then parsers and renderers and user-agents and browsers have to be able to render invalid (X)HTML *anyway*. At that point, what's the benefit of XHTML?

    As a secondary point, in my understanding of it, XHTML is not just "HTML with compulsory closing tags and in lower-case". If that existed, I might like it. As it is, because XHTML is XML, you have to buy into and understand all the annoying XML crap like namespaces. There's no middle ground -- either you're an XML parser and you have to understand esoteric annoying rubbish, or you're an HTML parser and don't care about closing tags. If this "HTML but with closing tags" existed, I'd probably use it -- to be honest, I pretty much use it now. People who care about the quality of their HTML are using 4.01 Strict anyway; pretty much the only empty tag I use is img. Is writing <img .../> rather than <img ...> and doing nothing else really going to make XHTML people happier?

    sil - 7th June 2008 15:40 - #

  25. What Drew said!

    Andy Budd - 7th June 2008 15:43 - #

  26. Simon, I don't believe it's necessary for someone learning XHTML for day-to-day use on the web to learn about namespaces. The well-formedness rules in practise are the same for "best practice" HTML as they are for XHTML. However, someone learning HTML will inevitably come across the shortcuts, and "that's a bad idea" is a much weaker argument than "that's incorrect", imho.

    Stuart (sil), I don't think I've ever used namespaces in an XHTML document, nor come across anyone else using them in any practical context. Namespaces are pretty much a theoretical issue in XHTML - and there's already too many real issues on the web to be worrying about the theoretical ones as well.

    The strictness doesn't need to be aggressively enforced in order for the (to my mind) simpler rules to be useful to the author. The fact that there are fewer options makes the author less likely to make mistakes and therefore increases the quality of their markup, regardless of whether those rules are being enforced or not.

    There is a middle ground, in that XHTML as used on the web today (very successfully) is basically just HTML with a stricter set of authoring rules. It's not a crime to not use parts of a specification you don't have an application for, and by and large the web doesn't appear to have much of a use for namespaces.

    My concern with HTML5 is that the supporting documentation encourages omission of anything optional, and so any "best practice HTML" arguments go completely out of the window. The HTML examples in the authoring guide make my skin crawl.

    http://dev.w3.org/html5/html-author/

    Drew McLellan - 8th June 2008 11:59 - #

  27. Drew: ah, I agree that authors don't have to use namespaces, and indeed 99.9% of XHTML authors don't. However, the underlying subject in this post is tools for consuming XHTML -- parsers and the like -- and if you're writing one of those then you cannot just "not use parts of [the] specification [that] you don't have an application for", which is the problem. HTML-plus-closing-tags is definitely easier to parse than HTML 4.01 (whether it's easier to *write* or not is open to discussion (you think so, I don't, but that's fine)), but XHTML as-it-is-specced isn't because you do have to understand all the stuff that authors just leave out.

    sil - 8th June 2008 16:05 - #

  28. Namespaces are part of the boiler plate though, so if you're teaching someone XHTML you have to tell them "don't worry about that bit, you don't have to know what it does". I find that really frustrating, for the same reason I dislike Java as a teaching language (where you have to ignore "public static void main(String args[])" just to get to Hello World).

    Simon Willison - 8th June 2008 16:44 - #

  29. Stuart: if 99.9% of authors aren't using XHTML namespaces, then that's a pretty strong argument to leave namespace support out of a parser. No point implementing something that isn't going to get used.

    Simon: I agree it's not ideal to have to tell people no to worry about the default XHTML namespace. But in the same way, I'm not sure learning the inner workings of the DOCTYPE declaration is that great either. I think the best domain for teaching namespaces is probably something like XSLT, where they're actually useful.

    Drew McLellan - 8th June 2008 18:55 - #

  30. Drew: OK, but then you're not writing XHTML. You're writing a language without a specification, and without any way of verifying that you're doing it right...

    sil - 8th June 2008 20:58 - #

  31. Stuart: yes, I am writing XHTML. Just as much as I'd still be writing XHTML if I weren't using any IMG elements.

    And I can still have an XHTML parser that doesn't have namespace support, just as I have web browsers without full CSS2 support.

    Drew McLellan - 8th June 2008 21:53 - #

  32. Drew: hm. That's quite a compelling argument you've got there.

    The worry is that other people can be writing XHTML which is part of the W3C's spec but not part of your (more useful) subset, and then the parsers written for your subset won't work (because they're using namespaces, just as a parser that didn't understand IMG wouldn't properly understand XHTML with img tags in). It feels to me like you're defining a useful subset and that that ought to be dignified with more than just a description -- there ought to be a spec and validators and so on for "Useful XHTML", and there is no such spec. I'd be interested in your thoughts on that; is "Useful XHTML" merely HTML 4.01 where you must close tags? To pick a silly example, in XHTML (as I understand it), <script src="wherever" /> is a legal tag. Safari understands that, but Firefox does not. So realistically it can't be used on the web, but an XHTML validator will tell you that it's valid. What would be the differences between "Useful XHTML" and HTML 4.01? I can feel a sort of campaign here to convince people to use it, if there's a decent description of what "it" is.

    sil - 8th June 2008 22:27 - #

  33. Drew, if you're not actually implementing an XML parser than I agree it could be easier than implementing either an XML or HTML parser.

    You can't really compare your solution to parsing HTML or parsing XML since it does so much less.

    Anne van Kesteren - 9th June 2008 13:47 - #

  34. Way late to the discussion, but it's a good one that caught my attention.

    There is an entire perspective missing from this debate: data portability and automation. Disagree, Noah S. - Adherence to simple, strict syntax does make XHTML easier on the consumer than HTML. Well-formed data is always easier on consumers, human and computer, because well-formedness dramatically reduces the rule set required to interpret the markup. It is even easier to read. Given a well-formed markup it takes exactly 3 functions to deserialize any markup into any Object.property model: Parse_Primitive_Attribute, Parse_Complex_Attribute, and Parse_Object. I have found this to be immensely useful regardless of support for namespaces or validation against a schema. Either you recognize and Object for its properties or you do not. But given a well-formed markup at least you can easily use a schema to validate that the markup you are parsing is within the expected language and does not contain illegal elements, attributes, or values. Alas, as sil pointed out, even 3 simple rules are fallible to variances in interpretation, but I would argue much less so than the many other rules required to handle potentially mal-formed HTML. I think somewhere the discussion comingled matters of syntax and semantics. Drew M., I took your point as an argument for adhering to strict syntax and I noted your examples of closing quotes and tags. The omission of optional attributes and elements is an entirely separate matter of semantics. Whether or not xHTML is a bloated language specification is irrelevant to the fact that ad-hoc syntax makes for useless complexity, avoidable rules processing, and poor data portability. To be fair though, I think, Drew M., you may have injected a syntax perspective where one was not implied. XHTML may be difficult for authors for reasons of semantics. I think Jonathan B. conceded too much - I think, yes, we are morbidly afraid of there being any more browsers because we know that the likelihood of 3-5 different browsers interpreting lax rules uniformly is practically zilch. But it is not just a matter of XHTML and browsers; its a matter of digital documents being exchanged over standard protocols between connected devices for the purpose of human communcation and interaction. Well-formed data makes that a much easier proposition. Now, if someone wants to take up the issue of extensible markup perhaps not being the easiest way to express well-formedness then we'd have a whole other, interesting discussion...

    Kevin Curry - 22nd July 2008 00:20 - #

Sign in with OpenID

Auto-HTML: Line breaks are preserved; URLs will be converted in to links.

Manual XHTML: Enter your own, valid XHTML. Allowed tags are a, p, blockquote, ul, ol, li, dl, dt, dd, em, strong, dfn, code, q, samp, kbd, var, cite, abbr, acronym, sub, sup, br, pre

A django site