Simon Willison’s Weblog

17 items tagged “data”

The Seven Secrets of Successful Data Scientists. Some sensible advice, including pick the right sized tool, compress everything, split up your data, use open source and run the analysis where the data is. # 3rd September 2010, 12:36 am

Using Freebase Gridworks to Create Linked Data. A very handy tutorial from data.gov.uk’s Jeni Tennison. # 23rd August 2010, 8:11 pm

With Flickr you can get out, via the API, every single piece of information you put into the system. [...] Asking people to accept anything else is sharecropping. It’s a bad deal. Flickr helped pioneer “Web 2.0″, and personal data ownership is a key piece of that vision. Just because the wider public hasn’t caught on yet to all the nuances around data access, data privacy, data ownership, and data fidelity, doesn’t mean you shouldn’t be embarrassed to be failing to deliver a quality product.

Kellan Elliott-McCrea # 18th May 2010, 6:21 pm

Preview: Freebase Gridworks (via) If my experience with government datasets has taught me anything, it’s that most datasets are collected by human beings (probably using Excel) and human beings are inconsistent. The first step in any data related project inevitably involves cleaning up the data. The Freebase team must run up against this all the time, and it looks like they’re tackling the problem head-on. Freebase Gridworks is just a screencast preview at the moment but an open source release is promised “within a month”—and the tool looks absolutely fantastic. DabbleDB-style data refactoring of spreadsheet data, running on your desktop but with the UI served in a browser. Full undo, a JavaScript-based expression language, powerful faceting and the ability to “reconcile” data against Freebase types (matching up country names, for example). I can’t wait to get my hands on this. # 27th March 2010, 6:43 pm

The Case For An Older Woman. OK Cupid’s fascinating statistics blog uses cleverly plotted aggregate data from the dating site to illustrate the difference in age tastes between the genders (men try to date younger women) and show why that might not be the best strategy. An infographics tour-de-force. # 17th February 2010, 10:20 pm

World Government Data. Launched last week, this is the Guardian’s meta-search engine for searching and browsing through data from four different government data sites (with more sites planned). Under the hood it’s Django, Solr, Haystack and the Scrapy crawling library. The application was built by Ben Firshman during an internship over Christmas. # 27th January 2010, 12:27 pm

Toiling in the data-mines: what data exploration feels like. Useful advice from Tom Armitage on the exploratory development approach required when starting to build a project against a large, complex dataset. Tips include making sure you have a REPL to hand and using tools like gRaphael to generate graphs against pretty much everything, since until you’ve seen their shape you won’t know if they are interesting or not. # 26th October 2009, 9:34 am

Yahoo! Geo: Announcing GeoPlanet Data. The Yahoo! WhereOnEarth geographic data set is fantastic, but I’ve always felt slightly uncomfortable about building applications against it in case the API went away. That’s not an issue any more—the entire dataset is now available to download and use under a Creative Commons Attribution license. It’s not entirely clear what the attribution requirements are—do you have to put “data from GeoPlanet” on every page or can you get away with just tucking the attribution away in an “about this site” page? UPDATE: The data doesn’t include latitude/longitude or bounding boxes, which severely reduces its utility. # 20th May 2009, 9:12 pm

Drug seizures: how pure is street cocaine? Neat story on the Guardian Datablog using graphs from Timetric to show that while the purity of cocaine seized by customs over the past five years has stayed constant, the purity of drugs seized by the police has been trending downwards. # 13th May 2009, 12:34 pm

Drop ACID and think about data. I’ve been very impressed with the quality and speed with which the PyCon 2009 videos have been published. Here’s Bob Ippolito on distributed databases and key/value stores. # 17th April 2009, 5:13 pm

A few notes on the Guardian Open Platform

This morning we launched the Guardian Open Platform at a well attended event in our new offices in Kings Place. This is one of the main projects I’ve been helping out with since joining the Guardian last year, and it’s fantastic to finally have it out in the open.

[... 839 words]

US economic data spreadsheets from the Guardian. At the Guardian we’ve just released a bunch of economic data about the US painstakingly collected by Simon Rogers, our top data journalist, as Google Docs spreadsheets. Get your data here. # 16th January 2009, 6:17 pm

ficlets memorial. Here’s a great argument for Creative Commons—AOL shut down Ficlets without providing an archive or export tool, but the license meant Ficlets co-creator Kevin Lawver could scrape and preserve all of the content anyway. # 14th January 2009, 10:02 pm

Magic/Replace. More inspirational magic from the team at Dabble DB. Be sure to watch the (short) demo video. # 1st December 2008, 12:23 am

Code your own election mashup with Google’s JSON data. The data that powered Google’s US election results map is available to download as a bunch of JSON files. # 6th November 2008, 8:24 pm

CKAN—Comprehensive Knowledge Archive Network. Aims to be the “Debian of data”, with apt-get style tools for installing datasets. Presented at Open Tech 2008 by Rufus Pollock. # 5th July 2008, 3:24 pm

The Data Bill of Rights (via) John Battelle’s inherently sensible “draft of what rights we, as consumers, might demand from companies making hay off the data we create as we trip across the web”. # 27th May 2007, 7:28 pm