Simon Willison’s Weblog

Subscribe

What are XML feed best practices?

31st January 2012

My answer to What are XML feed best practices? on Quora

It sounds like you’re pretty much screwed already, if you’re dealing with companies that still think FTPing XML around is a sensible thing to do.

I would suggest focusing on what you can control. Assume that you will be passed bad data—weird formats, not-well-formed XML, duplicate entries etc. Your job is to handle all of this without going mad, and without your codebase turning in to an unmanageable ball of mud.

So, start by figuring out your own core data model / abstraction. It will need to be VERY loose—as few required fields as possible, since you can be sure some if the feeds you are consuming will come in with stuff missing at some point or another.

Separate your feed consumers from the rest of your code. Having your own good internal Web API (which could consume JSON rather than XML since you control it) might be smart, since that will provide a solid separation and you can then write all of your feed consumers as separate pieces of code that just POST new items to the API.

Learn to love, respect and cherish unique identifiers... but be very wary of supposedly unique identifiers from external sources unless you can be absolutely sure they won’t change on you. Create your own unique IDs at the first available opportunity, treat them properly within your own system and map external identifiers to them whenever you can.

Write your consumers in a dynamic language with a solid interactive prompt, like Python or Ruby. This will make them much easier to write and debug. Use whatever you like for your core data storage / API.

Since your incoming data will come in all shapes and sizes, consider a document store such as MongoDB or Riak over a SQL database. Avoiding SQL migrations will help you out a lot.

Log and store absolutely everything. Ideally you should be able to re-execute every import that the system has ever executed, in order, to make debugging and fixing errors non terrifying. That will almost certainly prove impossible, but it’s a nice thought.

Good luck!