Simon Willison’s Weblog

45 items tagged “xml”

Has JSON pretty much replaced XML for string processing for the web, or are there use cases where XML is still necessary?

It’s replaced XML as the default format for most APIs. XML is still necessary for Atom/RSS feeds and other existing standards built on top of XML. It’s also a better choice than JSON for markup-style data—stuff like XHTML where tags are applied to sequences of characters within larger chunks of text.

[... 81 words]

What are XML feed best practices?

It sounds like you’re pretty much screwed already, if you’re dealing with companies that still think FTPing XML around is a sensible thing to do.

[... 364 words]

What is the difference between XHTML 1.0 strict and transitional?

Not a lot. XHTML transitional lets you use a few presentational attributes and elements that aren’t available in XHTML strict. Here’s a more detailed overview from back in 2005:

[... 59 words]

Indexing JSON in Solr 3.1. The next release of Solr will support indexing documents provided as JSON—Solr currently requires incoming documents to be formatted as XML. # 10th December 2010, 9:46 am

I think the Web community has spoken, and it’s clear that what it wants is HTML5, JavaScript and JSON. XML isn’t going away but I see it being less and less a Web technology; it won’t be something that you send over the wire on the public Web, but just one of many technologies that are used on the server to manage and generate what you do send over the wire.

James Clark # 2nd December 2010, 6:48 pm

Introducing BERT and BERT-RPC. Justification for inventing a brand new serialisation protocol: Thrift and Protocol Buffers both use IDLs and code generation, XML “is not convertible to a simple unambiguous data structure in any language I’ve ever used” and JSON lacks support for unencoded binary data. The result is BERT—Binary ERlang Term—which extracts a format from Erlang in much the same way that JSON extracted one from JavaScript. # 21st October 2009, 10:11 pm

minixsv (via) As far as I can tell, this is the only library that can validate XML using pure Python (no C extension required). I’d be extremely happy if someone would write a pure Python library (or one that only depends on ElementTree, which is included in the standard library) for validating XML against a Relax NG Compact syntax schema. Even DTD validation would be better than nothing! # 12th August 2009, 4:59 pm

xmlwitch. An XML building library for Python that doesn’t suck (I love ElementTree for parsing XML, but I’ve never really liked it for generation). Makes smart use of the with statement. # 24th July 2009, 12:33 am

With YQL Execute, the Internet becomes your database. This is nuts (in a good way). Yahoo!’s intriguing universal SQL-style XML/JSONP web service interface now supports JavaScript as a kind of stored procedure language, meaning you can use JavaScript and E4X to screen-scrape web pages, then query the results with YQL. # 29th April 2009, 10:50 pm

A few notes on the Guardian Open Platform

This morning we launched the Guardian Open Platform at a well attended event in our new offices in Kings Place. This is one of the main projects I’ve been helping out with since joining the Guardian last year, and it’s fantastic to finally have it out in the open.

[... 839 words]

JsonML (JSON Markup Language). An almost non-lossy serialization format for sending XML as JSON (plain text in between elements is ignored). Uses the (element-name, attribute-dictionary, list-of-children) tuple format, which sadly means many common cases end up taking more bytes than the original XML. Still an improvement on serializations that behave differently when a list of children has only one item in it. # 10th February 2009, 3:03 pm

Crowbar. Headless Gecko/XULRunner which exposes a web service API for screen scraping using a real browser DOM—just pass it the URL of a page and the URL of a screen scraping JavaScript script (a bit like a Greasemonkey user script) and get back RDF/XML. # 24th January 2009, 11:52 pm

How to install lxml python module on mac os 10.5 (leopard). Instructions that work! Finally, I can find out what all the fuss is about. # 15th December 2008, 12:05 am

pyquery. “A jQuery-like library for Python”—implemented on top of lxml, providing jQuery style methods for manipulating an HTML or XML document. # 6th December 2008, 9:53 am

Magnificent Seven—the value of Atom. The seven core things that Atom solves so that you don’t have to. # 19th October 2008, 10:24 pm

cascadenik: cascading sheets of style for mapnik. Great idea. Mapnik (the open source tile rendering system used by OpenStreetMap and others) has a complex style configuration based on XML. Michal Migurski has build a CSS-style equivalent which compiles down to XML, hopefully making it much quicker and easier to get started with Mapnik customisation. # 30th August 2008, 10:04 am

Tip: Configure SAX parsers for secure processing. Explains the billion laughs attack, among others. # 23rd August 2008, 11:12 am

DoS vulnerability in REXML. Ruby’s REXML library is susceptible to the “billion laughs” denial of service attack where recursively nested entities expand a single entitity reference to a billion characters (kind of like the exploding zip file attack). Rails applications that process user-supplied XML should apply the monkey-patch ASAP; a proper gem update is forthcoming. # 23rd August 2008, 11:11 am

My Universal Feed Parser was conceived as a weapon against what I considered the gravest error of XML: draconian error handling. Recently, someone asked me to implement a switch that makes it not fall back on lax parsing in the case of an XML wellformedness error. I said no, not because it would be difficult to implement, but because that defeats its entire reason for being.

Mark Pilgrim # 5th August 2008, 10:52 pm

PDFMiner. Useful looking PDF parsing library in Python—can produce an XML representation of the text and style information in a PDF document. # 3rd August 2008, 3:29 pm

Protocol Buffers: Google’s Data Interchange Format. Open sourced today. Highly efficient binary protocol for storing and transmitting structured data between C++, Java and Python. Uses a .proto file describing the data structure which is compiled to classes in those languages for serializing and deserializing. 3-10 times smaller and 20-100 times faster than XML. # 8th July 2008, 8:20 am

XML is better if you have more text and fewer tags. And JSON is better if you have more tags and less text. Argh! I mean, come on, it’s that easy. But you know, there’s a big debate about it.

Steve Yegge # 15th June 2008, 6:09 pm

Draconian failure on error is not the answer problems of Postel’s law. Draconian error handling creates an unstable equilibrium in Game Theory terms —it only lasts until one player breaks the rule. One non-Draconian XML5 implementation in key client product and the Draconian XML ranks would break. Well-specified error recovery is the right way to implement the liberal part of Postel’s law.

Henri Sivonen # 20th March 2008, 2:43 pm

CouchDB, XML, and E4X. Brilliant—CouchDB now enables SpiderMonkey’s E4X support, meaning CouchDB views can easily query XML documents stored inside JSON objects using E4X syntax. # 5th March 2008, 12:31 am

PrinceXML is extremely impressive. I had a poke at Prince (a commercial package for generating high quality PDFs from HTML, XML, CSS and SVG) a few weeks ago and was similarly impressed. # 8th February 2008, 12:02 pm

Cross-Site XMLHttpRequest (via) “Firefox 3 implements the W3C Access Control working draft, which gives you the ability to do XMLHttpRequests to other web sites”—you can mark a document as available for cross-domain requests using either an Access-Control HTTP header or an XML processing instruction. # 9th January 2008, 11:57 pm

PostgreSQL 8.3 beta 4 release notes. In addition to the huge speed improvements, 8.3 adds support for XML, UUID and ENUM data types and brings full text (tsearch2) in to the core database engine. # 12th December 2007, 12:43 am

[Release] CouchDB 0.7.0. This is a huge milestone for the project—it’s the first official release to include the JSON REST API instead of XML, and it’s also the first release that is “intended for widespread use”. # 17th November 2007, 12:25 am

The larger question is why on earth, in 2007 and ten years after XML came out, we are still using text files that don’t label their encoding?

Rick Jelliffe # 8th October 2007, 12:27 pm

Atom Models. Building Python classes that act as utility wrappers around data stored in an lxml DOM object. # 7th August 2007, 4:02 pm