What are XML feed best practices?
31st January 2012
My answer to What are XML feed best practices? on Quora
It sounds like you’re pretty much screwed already, if you’re dealing with companies that still think FTPing XML around is a sensible thing to do.
I would suggest focusing on what you can control. Assume that you will be passed bad data—weird formats, not-well-formed XML, duplicate entries etc. Your job is to handle all of this without going mad, and without your codebase turning in to an unmanageable ball of mud.
So, start by figuring out your own core data model / abstraction. It will need to be VERY loose—as few required fields as possible, since you can be sure some if the feeds you are consuming will come in with stuff missing at some point or another.
Separate your feed consumers from the rest of your code. Having your own good internal Web API (which could consume JSON rather than XML since you control it) might be smart, since that will provide a solid separation and you can then write all of your feed consumers as separate pieces of code that just POST new items to the API.
Learn to love, respect and cherish unique identifiers... but be very wary of supposedly unique identifiers from external sources unless you can be absolutely sure they won’t change on you. Create your own unique IDs at the first available opportunity, treat them properly within your own system and map external identifiers to them whenever you can.
Write your consumers in a dynamic language with a solid interactive prompt, like Python or Ruby. This will make them much easier to write and debug. Use whatever you like for your core data storage / API.
Since your incoming data will come in all shapes and sizes, consider a document store such as MongoDB or Riak over a SQL database. Avoiding SQL migrations will help you out a lot.
Log and store absolutely everything. Ideally you should be able to re-execute every import that the system has ever executed, in order, to make debugging and fixing errors non terrifying. That will almost certainly prove impossible, but it’s a nice thought.
More recent articles
- Datasette Enrichments: a new plugin framework for augmenting your data - 1st December 2023
- llamafile is the new best way to run a LLM on your own computer - 29th November 2023
- Prompt injection explained, November 2023 edition - 27th November 2023
- I'm on the Newsroom Robots podcast, with thoughts on the OpenAI board - 25th November 2023
- Weeknotes: DevDay, GitHub Universe, OpenAI chaos - 22nd November 2023
- Deciphering clues in a news article to understand how it was reported - 22nd November 2023
- Exploring GPTs: ChatGPT in a trench coat? - 15th November 2023
- Financial sustainability for open source projects at GitHub Universe - 10th November 2023
- ospeak: a CLI tool for speaking text in the terminal via OpenAI - 7th November 2023
- DALL-E 3, GPT4All, PMTiles, sqlite-migrate, datasette-edit-schema - 30th October 2023