A few notes on the Guardian Open Platform
This morning we launched the Guardian Open Platform at a well attended event in our new offices in Kings Place. This is one of the main projects I’ve been helping out with since joining the Guardian last year, and it’s fantastic to finally have it out in the open.
There are two components to the launch today: the Content API and the Data Store. I’ll describe the Data Store first as it deserves not to get buried in the discussion about its larger cousin.
The Data Store
Simon Rogers is the Guardian news editor who is principally responsible for gathering data about the world. If you ever see an infographic in the paper, the chances are Simon had a hand in researching the data for it. His delicious feed is a positive gold mine.
As of today, a sizeable portion the data he collects for the newspaper will also be published online. As a starting point, we’re publishing over 80 data sets, all using Google Spreadsheets which means it’s all accessible through the Spreadsheets Data API.
Here’s Simon’s take on it, from Welcome to the Datablog:
Everyday we work with datasets from around the world. We have had to check this data and make sure it’s the best we can get, from the most credible sources. But then it lives for the moment of the paper’s publication and afterward disappears into a hard drive, rarely to emerge again before updating a year later.
So, together with its companion site, the Data Store – a directory of all the stats we post – we are opening up that data for everyone. Whenever we come across something interesting or relevant or useful, we’ll post it up here and let you know what we’re planning to do with it.
It’s worth spending quite a while digging around the data. Most sets come with a full description, including where the data was sourced from. New data sets will be announced on the Datablog, which is cleverly subtitled “Facts are sacred”.
The Content API
The Content API provides REST-ish access to over a million items of content, mostly from the last decade but with a few gems that are a little bit older. Various types of content are available—article is the most common, but you can grab information (though not necessarily content) about audio, video, galleries and more. You can retrieve 50 items at a time, and pagination is unlimited (provided you stay below the API’s rate limit).
Articles are provided with their full body content, though this does not currently include any HTML tags (a known issue). It’s a good idea to review our terms and conditions, but you should know that if you opt to republish our article bodies on your site we may ask you to include our ads alongside our content in the future.
We serve 15 minute HTTP cache headers, but you are allowed to store our content for up to 24 hours. You really, really don’t want to store content for longer than that, as in addition to violating our T&Cs you might find yourself inadvertently publishing an article that has been retracted for legal reasons. UK libel laws can be pretty scary.
In addition to regular search, you can also filter our content using tags. Tags are a core aspect of the Guardian’s R2 platform, being used for keywords, contributors, “series” (used to implement blogs), content types and more. Every item returned by the API includes tags, and the tags can be used to further filter the results.
We also return a list of filters at the bottom of each page of search results showing the tags that could be used to filter that result set, ordered by the number of results (you may have seen this feature referred to as faceted search or guided navigation). Handy tip: you can use ?count=0 in your search API key to turn off results entirely and just get back the filters section. The race is on to be first to release a tag relationship browser based on this feature.
API responses can be had in custom XML, JSON or Atom. The Atom format is the least mature at the moment, and we’d welcome suggestions for improving it from the community.
I released a Python client library for the API this morning, and we also have libraries for Ruby, Java and PHP.
We also have an API Explorer (written in JavaScript and jQuery, hosted on the same domain as the API so that it can make Ajax requests) but you’ll need an API key to try it out.
The bad news
The response to the API release has been terrific (check out what Tom Watson had to say), but as a result it’s likely that API key provisions will be significantly lower than the overall demand for them. Please bear with us while we work towards a more widely accessible release.
Do you need to do anything at all in regards to Atom? People can get to Atom representations of the spreadsheet data via the Google API.
We've done a little mashup using the API at Zemanta that you can check out at http://labs.zemanta.com/guardian/. It's also written in JavaScript with jQuery and jQuery UI and uses callbacks and window.name for transport.
Sean: the Data Store uses Google Spreadsheets, but the Content API was built entirely in-house (it's a Java/Spring application running against our internal search index). The Content API Atom output is the one that we're looking for feedback on.
This is great. What are the chances of marking up the articles on the site with hAtom or including a <link> on the article page pointing to the api Atom representation?
Dan W - 10th March 2009 16:51 - #
Nice work. It will be interesting to see if additional types of meta data will be added to the mix - filtering on geotags springs to mind. What kind of search engine is running under the hood here? An open source solution like Lucene/Solr, or some commercial search provider?
Helge Valvik - 10th March 2009 23:26 - #
The search engine under the hood is Endeca, a commercial engine which the Guardian has been using for regular site search for quite a while.
Good stuff. Any plans to expose the data as RDF and make it part of the linked data web?
http://johngoodwin225.wordpress.com/2009/03/10/the -guardian-open-platform-and-data-store/
John Goodwin - 11th March 2009 11:21 - #
Lovely. Now if only I could use OpenID to sign up for a key...