Feed Sign in with OpenID OpenID

Simon Willison’s Weblog

Staying valid

Andrei Herasimchuk:

There seems no automatic way to keep a site valid with web standards unless you close it off to the rest of the world to contribute to it. I will not do that anytime soon.

There is: I’m doing it. Next Thursday will mark the one year anniversary of my switching to application/xhtml+xml as the content-type header for this site, for user agents that support it. Using that content-type forces Gecko engine browsers to refuse to render pages if they are not well-formed XML, so if a page is invalid I hear about it pretty quickly.

So how do you keep a frequently updated site with data from external sources and user comments valid? There are really only two things you need to do. Firstly, ensure that everything going IN to the system (entries and comments) is valid XHTML. I do that using a simple validation system for comments and a bookmarklet for my own entries. Secondly, any and all data from external sources (my blogroll from blo.gs, blogmark URLs added using a bookmarklet) needs to be entity-escaped before being displayed on the site. In my case, a call to PHP’s htmlspecialchars() function is all that’s needed.

I’m not saying my system is ideal—the need for well formed markup in comments is a major usability issue even on a site with an audience consisting mostly of web developers. But it’s certainly possible to operate a site in XHTML with frequent updates and user comments while staying valid at the same time.

This is Staying valid by Simon Willison, posted on 2nd May 2004.

View blog reactions

Next: CSS History

Previous: Google, circa 1998

19 comments

  1. You can also employ CDATA quoting. XML is an SGML application, after all.

    B. K. Oxley (binkley) - 2nd May 2004 21:59 - #

  2. I don't know how well older browsers would deal with CDATA quoting - in fact, it wouldn't surprise me if modern versions of IE tripped up on it (I don't have them handy to test at the moment).

    Simon Willison - 2nd May 2004 22:06 - #

  3. You also have two other things going for you: you use utf-8, which is apparently the only possible encoding that will let you avoid getting characters from a textarea that are not in the proper charset, and you don't support Trackback, which will deliver characters in an unspecified charset.

    Phil Ringnalda - 2nd May 2004 22:13 - #

  4. The changing to any XML MIME type would indeed quickly help one stop malformed markup issues, and was one of the reasons I've been using the correct MIME type of application/xhtml+xml on my blog for just over a year.

    I still mark everything up by hand and can usually write valid markup right off, but do occasionally forget the odd double quote or entitising something.

    However, kind of spooky this issue has popped up again as I've just spend couple hours playing about with CVS code for phpBB2.1 and getting that to work when the XHTML MIME type is used... not much effort at all and much quicker in spotting well formedness errors than using a validator.

    Jonathan Stanley - 2nd May 2004 23:03 - #

  5. Another option for receiving foreign input (like comments) is running the text through HTML Tidy. Or using a less formal markup language (like Textile, Restructured Text, etc) that can be converted into strict XHTML. Just two sides of the same coin really -- with Tidy the informal markup is human-written HTML.

    Ian Bicking - 2nd May 2004 23:16 - #

  6. Funny, that post you linked has the following (emphasis mine):

    ... an XML parser, which checks that each element is in my list of allowed elements, is nested correctly (you can't put a blockquote inside a p for example)

    To quote this rant by Andrei Herasimchuk,

    <blockquote> requires a <p> tag to wrap the quoted text inside the <blockquote> element. This was news to me. Was it mentioned specifically in the W3C HTML 4.0.1 spec for the <blockquote> element. Nope. Oh wait... There is an example and it uses the <p> tag, but there is nary a mention in the definition of the element that this is now the case. Sure, maybe I missed this documented elsewhere, but it was a surprise to me. I discovered the rule from a web site other than the W3C's. I should also note that I see a lot of other blogs out there that also do not do seem to do it correctly all of the time.

    Just thought it was funny that you responded to Andrei's rant about maintaining valid XHTML pages while not catching a caveat of XHTML, possibly causing your pages to not validate.

    Roman - 2nd May 2004 23:48 - #

  7. Sure the spec says it's required:

    <!ELEMENT BLOCKQUOTE - - (%block;|SCRIPT)+ -- long quotation -->

    Plain as day, content has to be %block; or SCRIPT. Oh, wait, that's not exactly plain as day, is it? :)

    Phil Ringnalda - 3rd May 2004 00:05 - #

  8. Ian: I'm slightly wary of HTML Tidy, because I've run it on a few sites that have generated "fatal" errors that have to be fixed by hand. To be useful in this kind of scenario it needs to be able to take absolutely any old junk and convert it to XHTML without risk of fatal errors. It's certainly a valuable tool for this kind of thing though.

    Roman: you've mis-read my post. You can't put a blockquote inside a paragraph, but you can (and indeed should) put a paragraph inside a blockquote. If you experiment with my comment validator you'll find that it complains when a blockquote tag is used with raw character data inside it and no paragraph tag. This is consistent with the HTML spec.

    Simon Willison - 3rd May 2004 00:07 - #

  9. Next Thursday will mark the one year anniversary of my switching to application/xhtml+xml as the content-type header for this site, for user agents that support it.

    Wow! I had not even noticed it! But sure enough, when I hit Ctrl-J in Firefox, there is was: application/xhtml+xml. Very nice.

    Scott Johnson - 3rd May 2004 05:36 - #

  10. Thanks to your checker and a couple of hacks in Wordpress I have validated comments as well. Before, I also used the correct content-type though I needed to check often to see if everything was still valid...

    Anne - 3rd May 2004 07:51 - #

  11. Like Anne, I also use your checker for comment posts, have been for a while now. Seems to do well to keep our site/ my blog happily application/xhtml+xml.

    Anyways, congrats, and thanks for sharing that. I plan on giving words of thanks, just as soon as the DNS propagates over to our new server...

    Mike P. - 3rd May 2004 09:31 - #

  12. I just looked at the original post you linked to from a year ago. I noticed that one of the comments was talking about character sets. Unfortunately, you're still serving the page as ISO-8859-1 in the HTTP header, but declaring as UTF-8 in the meta tag. The easiest fix for this is probably to put an AddDefaultCharset into your apache config file.

    I think that you've picked up on the key point in your post though. You've got to know what format / charset your input data is. This is something that crops up time and time again at work and it seems to be a difficult lesson to learn...

    Dominic Mitchell - 3rd May 2004 09:42 - #

  13. I've been messing around with a project involving a general purpose XHTML template...this is certainly an interesting call to action.

    CSS Layouts - 4th May 2004 07:50 - #

  14. Well... I added a MTStripControl plug-in from Jacques at Musings. That fixed a large number of copy and paste issues.

    Now it seems I need an elegant solution to enforce XHTML mark-up when entering comments. That is the last thing killing my 100% XHTML 1.0 Strict validation on all my pages.

    I would prefer to do that with a system that just added the correct mark-up if missing, and not force an error and require the user to re-enter the comment themselves. (Like I have to do with your site.) That is really what I'm looking for, and that is where the tools need to go in the next versions to make this XHTML 1.0 Strict thing a reality thorugh and through. IMHO.

    Andrei Herasimchuk - 5th May 2004 05:58 - #

  15. Have you thought about installing the MT-validate plugin and hooking it into the comment system, as described here? The W3C Validator does not provide the most user-friendly error messages, but it's a darn-site better than forcing the commenter to re-enter the comment.

    Try it out on my blog. I think you'll find it's about as user-friendly as this is likely to get in the near-term.

    Jacques Distler - 6th May 2004 16:08 - #

  16. Another approach to ensuring that rendered comments are valid markup is to not allow much control to the commenters.

    For non-geek purposes I expect it is quite enough to convert URIs and email addresses to the proper markup and line breaks to paragraph breaks for people. Even for geeks it is a reasonable default. That way we're just looking for text from contributers, and they need not understand the markup just to contribute an idea. Even for folk who understand - the geeks - how often do we really want to monkey up compliant xml by hand?

    Ash - 26th May 2004 00:45 - #

  17. I have created my own class for this, as my Content management system Absolut Engine has WYSIWYG built-in and the output of WYSIWYG editor needs to be cleaned up/validated. As far as I know, it works 100% for XHTML 1.0 Strict. I have even tested it against MS Word copy&paste code. It had cleaned it up well.

    dusoft - 7th June 2004 00:27 - #

  18. I have created my own class for this, as my Content management system Absolut Engine has WYSIWYG built-in and the output of WYSIWYG editor needs to be cleaned up/validated. As far as I know, it works 100% for XHTML 1.0 Strict. I have even tested it against MS Word copy&paste code. It had cleaned it up well.

    dusoft - 7th June 2004 00:28 - #

  19. You might want to try my TagSoup parser, which when used as a command-line application turns arbitrary HTML into XML. The result is not guaranteed to be valid XHTML, but it is guaranteed to be well-formed (modulo character encoding issues). It doesn't do as much as Tidy, but it never loops or crashes even on arbitrarily dirty input. It is free and Open Source.

    John Cowan - 8th August 2005 19:01 - #

Comments are closed.

Previously hosted at http://simon.incutio.com/archive/2004/05/02/stayingValid

A django site