Simon Willison’s Weblog


56 items tagged “xml”


SIARD: Software Independent Archiving of Relational Databases (via) I hadn’t heard of this before but it looks really interesting: the Federal Archives of Switzerland developed a standard for archiving any relational database as a zip file full of XML which is “is used in over 50 countries around the globe”.

# 4th May 2022, 10:40 pm / databases, xml, archives


Building an Evernote to SQLite exporter

Visit Building an Evernote to SQLite exporter

I’ve been using Evernote for over a decade, and I’ve long wanted to export my data from it so I can do interesting things with it.

[... 1879 words]

xml-analyser. In building evernote-to-sqlite I dusted off an ancient (2009) project I built that scans through an XML file and provides a summary of what elements are present in the document and how they relate to each other. I’ve now packaged it up as a CLI app and published it on PyPI.

# 12th October 2020, 12:41 am / projects, xml


Using memory-profiler to debug excessive memory usage in healthkit-to-sqlite. This morning I figured out how to use the memory-profiler module (and mprof command line tool) to debug memory usage of Python processes. I added the details, including screenshots, to this GitHub issue. It helped me knock down RAM usage for my healthkit-to-sqlite from 2.5GB to just 80MB by making smarter usage of the ElementTree pull parser.

# 24th July 2019, 8:25 am / memory, xml, profiling, python, elementtree

Convert Locations.kml (pulled from an iPhone backup) to SQLite. I’ve been playing around with data from my iPhone using the iPhone Backup Extractor app and one of the things it exports for you is a Locations.kml file full of location history data. I wrote a tiny script using Python’s ElementTree XMLPullParser to efficiently iterate through the Placemarks and yield them as dictionaries, which I then batch-inserted into sqlite-utils to create a SQLite database.

# 14th June 2019, 12:45 am / kml, projects, sqlite, sqlite-utils, xml


Exploring the UK Register of Members Interests with SQL and Datasette

Ever wondered which UK Members of Parliament get gifted the most helicopter rides? How about which MPs have been given Christmas hampers by the Sultan of Brunei? (David Cameron, William Hague and Michael Howard apparently). Here’s how to dig through the Register of Members Interests using SQL and Datasette.

[... 1167 words]


Has JSON pretty much replaced XML for string processing for the web, or are there use cases where XML is still necessary?

It’s replaced XML as the default format for most APIs. XML is still necessary for Atom/RSS feeds and other existing standards built on top of XML. It’s also a better choice than JSON for markup-style data—stuff like XHTML where tags are applied to sequences of characters within larger chunks of text.

[... 81 words]

What are XML feed best practices?

It sounds like you’re pretty much screwed already, if you’re dealing with companies that still think FTPing XML around is a sensible thing to do.

[... 364 words]

What is the difference between XHTML 1.0 strict and transitional?

Not a lot. XHTML transitional lets you use a few presentational attributes and elements that aren’t available in XHTML strict. Here’s a more detailed overview from back in 2005:

[... 59 words]


Indexing JSON in Solr 3.1. The next release of Solr will support indexing documents provided as JSON—Solr currently requires incoming documents to be formatted as XML.

# 10th December 2010, 9:46 am / json, search, solr, xml, recovered

I think the Web community has spoken, and it’s clear that what it wants is HTML5, JavaScript and JSON. XML isn’t going away but I see it being less and less a Web technology; it won’t be something that you send over the wire on the public Web, but just one of many technologies that are used on the server to manage and generate what you do send over the wire.

James Clark

# 2nd December 2010, 6:48 pm / html5, json, xml, recovered


Introducing BERT and BERT-RPC. Justification for inventing a brand new serialisation protocol: Thrift and Protocol Buffers both use IDLs and code generation, XML “is not convertible to a simple unambiguous data structure in any language I’ve ever used” and JSON lacks support for unencoded binary data. The result is BERT—Binary ERlang Term—which extracts a format from Erlang in much the same way that JSON extracted one from JavaScript.

# 21st October 2009, 10:11 pm / protocolbuffers, json, erlang, javascript, bert, serialisation, thrift, xml, github

minixsv (via) As far as I can tell, this is the only library that can validate XML using pure Python (no C extension required). I’d be extremely happy if someone would write a pure Python library (or one that only depends on ElementTree, which is included in the standard library) for validating XML against a Relax NG Compact syntax schema. Even DTD validation would be better than nothing!

# 12th August 2009, 4:59 pm / relaxng, elementtree, minixsv, python, validation, xml, xmlschema

xmlwitch. An XML building library for Python that doesn’t suck (I love ElementTree for parsing XML, but I’ve never really liked it for generation). Makes smart use of the with statement.

# 24th July 2009, 12:33 am / withstatement, python, xml, xmlwitch

With YQL Execute, the Internet becomes your database. This is nuts (in a good way). Yahoo!’s intriguing universal SQL-style XML/JSONP web service interface now supports JavaScript as a kind of stored procedure language, meaning you can use JavaScript and E4X to screen-scrape web pages, then query the results with YQL.

# 29th April 2009, 10:50 pm / yql, yahoo, apis, sql, javascript, xml, jsonp, json, e4x

A few notes on the Guardian Open Platform

This morning we launched the Guardian Open Platform at a well attended event in our new offices in Kings Place. This is one of the main projects I’ve been helping out with since joining the Guardian last year, and it’s fantastic to finally have it out in the open.

[... 839 words]

JsonML (JSON Markup Language). An almost non-lossy serialization format for sending XML as JSON (plain text in between elements is ignored). Uses the (element-name, attribute-dictionary, list-of-children) tuple format, which sadly means many common cases end up taking more bytes than the original XML. Still an improvement on serializations that behave differently when a list of children has only one item in it.

# 10th February 2009, 3:03 pm / json, jsonml, xml, serialization

Crowbar. Headless Gecko/XULRunner which exposes a web service API for screen scraping using a real browser DOM—just pass it the URL of a page and the URL of a screen scraping JavaScript script (a bit like a Greasemonkey user script) and get back RDF/XML.

# 24th January 2009, 11:52 pm / rdf, xml, screenscraping, gecko, xulrunner, mozilla, dom, greasemonkey, webservice, crowbar


How to install lxml python module on mac os 10.5 (leopard). Instructions that work! Finally, I can find out what all the fuss is about.

# 15th December 2008, 12:05 am / lxml, python, osx, leopard, xml, libxml2

pyquery. “A jQuery-like library for Python”—implemented on top of lxml, providing jQuery style methods for manipulating an HTML or XML document.

# 6th December 2008, 9:53 am / jquery, pyquery, python, lxml, xml

Magnificent Seven—the value of Atom. The seven core things that Atom solves so that you don’t have to.

# 19th October 2008, 10:24 pm / atom, xml, rest, bill-de-hora

cascadenik: cascading sheets of style for mapnik. Great idea. Mapnik (the open source tile rendering system used by OpenStreetMap and others) has a complex style configuration based on XML. Michal Migurski has build a CSS-style equivalent which compiles down to XML, hopefully making it much quicker and easier to get started with Mapnik customisation.

# 30th August 2008, 10:04 am / css, xml, mapnik, michalmigurski, mapping, openstreetmap, cascadenik

Tip: Configure SAX parsers for secure processing. Explains the billion laughs attack, among others.

# 23rd August 2008, 11:12 am / billionlaughs, xml, security, sax, elliotterustyharold

DoS vulnerability in REXML. Ruby’s REXML library is susceptible to the “billion laughs” denial of service attack where recursively nested entities expand a single entitity reference to a billion characters (kind of like the exploding zip file attack). Rails applications that process user-supplied XML should apply the monkey-patch ASAP; a proper gem update is forthcoming.

# 23rd August 2008, 11:11 am / rails, ruby, rexml, xml, security, dos, billionlaughs

My Universal Feed Parser was conceived as a weapon against what I considered the gravest error of XML: draconian error handling. Recently, someone asked me to implement a switch that makes it not fall back on lax parsing in the case of an XML wellformedness error. I said no, not because it would be difficult to implement, but because that defeats its entire reason for being.

Mark Pilgrim

# 5th August 2008, 10:52 pm / draconian, feeds, mark-pilgrim, python, universalfeedparser, wellformedness, xml

PDFMiner. Useful looking PDF parsing library in Python—can produce an XML representation of the text and style information in a PDF document.

# 3rd August 2008, 3:29 pm / pdf, python, xml, screenscraping, pdfminer

Protocol Buffers: Google’s Data Interchange Format. Open sourced today. Highly efficient binary protocol for storing and transmitting structured data between C++, Java and Python. Uses a .proto file describing the data structure which is compiled to classes in those languages for serializing and deserializing. 3-10 times smaller and 20-100 times faster than XML.

# 8th July 2008, 8:20 am / cplusplus, google, idf, java, opensource, protocolbuffers, python, xml

XML is better if you have more text and fewer tags. And JSON is better if you have more tags and less text. Argh! I mean, come on, it's that easy. But you know, there's a big debate about it.

Steve Yegge

# 15th June 2008, 6:09 pm / json, steve-yegge, xml

Draconian failure on error is not the answer problems of Postel's law. Draconian error handling creates an unstable equilibrium in Game Theory terms - it only lasts until one player breaks the rule. One non-Draconian XML5 implementation in key client product and the Draconian XML ranks would break. Well-specified error recovery is the right way to implement the liberal part of Postel's law.

Henri Sivonen

# 20th March 2008, 2:43 pm / draconian, henrisivonen, html5, postelslaw, xml

CouchDB, XML, and E4X. Brilliant—CouchDB now enables SpiderMonkey’s E4X support, meaning CouchDB views can easily query XML documents stored inside JSON objects using E4X syntax.

# 5th March 2008, 12:31 am / couchdb, javascript, xml, e4x, json, spidermonkey, christopher-lenz