Simon Willison’s Weblog

Subscribe

A few notes on the Guardian Open Platform

10th March 2009

This morning we launched the Guardian Open Platform at a well attended event in our new offices in Kings Place. This is one of the main projects I’ve been helping out with since joining the Guardian last year, and it’s fantastic to finally have it out in the open.

There are two components to the launch today: the Content API and the Data Store. I’ll describe the Data Store first as it deserves not to get buried in the discussion about its larger cousin.

The Data Store

Simon Rogers is the Guardian news editor who is principally responsible for gathering data about the world. If you ever see an infographic in the paper, the chances are Simon had a hand in researching the data for it. His delicious feed is a positive gold mine.

As of today, a sizeable portion the data he collects for the newspaper will also be published online. As a starting point, we’re publishing over 80 data sets, all using Google Spreadsheets which means it’s all accessible through the Spreadsheets Data API.

Here’s Simon’s take on it, from Welcome to the Datablog:

Everyday we work with datasets from around the world. We have had to check this data and make sure it’s the best we can get, from the most credible sources. But then it lives for the moment of the paper’s publication and afterward disappears into a hard drive, rarely to emerge again before updating a year later.

So, together with its companion site, the Data Store – a directory of all the stats we post – we are opening up that data for everyone. Whenever we come across something interesting or relevant or useful, we’ll post it up here and let you know what we’re planning to do with it.

It’s worth spending quite a while digging around the data. Most sets come with a full description, including where the data was sourced from. New data sets will be announced on the Datablog, which is cleverly subtitled “Facts are sacred”.

The Content API

The Content API provides REST-ish access to over a million items of content, mostly from the last decade but with a few gems that are a little bit older. Various types of content are available—article is the most common, but you can grab information (though not necessarily content) about audio, video, galleries and more. You can retrieve 50 items at a time, and pagination is unlimited (provided you stay below the API’s rate limit).

Articles are provided with their full body content, though this does not currently include any HTML tags (a known issue). It’s a good idea to review our terms and conditions, but you should know that if you opt to republish our article bodies on your site we may ask you to include our ads alongside our content in the future.

We serve 15 minute HTTP cache headers, but you are allowed to store our content for up to 24 hours. You really, really don’t want to store content for longer than that, as in addition to violating our T&Cs you might find yourself inadvertently publishing an article that has been retracted for legal reasons. UK libel laws can be pretty scary.

In addition to regular search, you can also filter our content using tags. Tags are a core aspect of the Guardian’s R2 platform, being used for keywords, contributors, “series” (used to implement blogs), content types and more. Every item returned by the API includes tags, and the tags can be used to further filter the results.

We also return a list of filters at the bottom of each page of search results showing the tags that could be used to filter that result set, ordered by the number of results (you may have seen this feature referred to as faceted search or guided navigation). Handy tip: you can use ?count=0 in your search API key to turn off results entirely and just get back the filters section. The race is on to be first to release a tag relationship browser based on this feature.

API responses can be had in custom XML, JSON or Atom. The Atom format is the least mature at the moment, and we’d welcome suggestions for improving it from the community.

I released a Python client library for the API this morning, and we also have libraries for Ruby, Java and PHP.

We also have an API Explorer (written in JavaScript and jQuery, hosted on the same domain as the API so that it can make Ajax requests) but you’ll need an API key to try it out.

The bad news

The response to the API release has been terrific (check out what Tom Watson had to say), but as a result it’s likely that API key provisions will be significantly lower than the overall demand for them. Please bear with us while we work towards a more widely accessible release.