Simon Willison’s Weblog


21 items tagged “data”

2023 (via) “The leading Groundhog Day data source”. I love this so much: it’s a collection of predictions from all 59 groundhogs active in towns scattered across North America (I had no idea there were that many). The data is available via a JSON API too. # 2nd February 2023, 10:05 pm


To make the analogy explicit, in Software 1.0, human-engineered source code (e.g. some .cpp files) is compiled into a binary that does useful work. In Software 2.0 most often the source code comprises 1) the dataset that defines the desirable behavior and 2) the neural net architecture that gives the rough skeleton of the code, but with many details (the weights) to be filled in. The process of training the neural network compiles the dataset into the binary — the final neural network. In most practical applications today, the neural net architectures and the training systems are increasingly standardized into a commodity, so most of the active “software development” takes the form of curating, growing, massaging and cleaning labeled datasets.

Andrej Karpathy # 24th August 2022, 9:28 pm


The data team: a short story (via) Erik Bernhardsson’s fictional account (“I guess I should really call this a parable”) of a new data team leader successfully growing their team and building a data-first culture in a medium-sized technology company. His depiction of the initial state of the company (data in many different places, frustrated ML researchers who can’t get their research into production, confusion over what the data team is actually for) definitely rings true to me. # 8th July 2021, 11:12 pm

What I’ve learned about data recently (via) Laurie Voss talks about the structure of data teams, based on his experience at npm and more recently Netlify. He suggests that Airflow and dbt are the data world’s equivalent of frameworks like Rails: opinionated tools that solve core problems and which mean that you can now hire people who understand how your data pipelines work on their first day on the job. # 22nd June 2021, 5:09 pm


The Seven Secrets of Successful Data Scientists. Some sensible advice, including pick the right sized tool, compress everything, split up your data, use open source and run the analysis where the data is. # 3rd September 2010, 12:36 am

Using Freebase Gridworks to Create Linked Data. A very handy tutorial from’s Jeni Tennison. # 23rd August 2010, 8:11 pm

With Flickr you can get out, via the API, every single piece of information you put into the system. [...] Asking people to accept anything else is sharecropping. It’s a bad deal. Flickr helped pioneer “Web 2.0″, and personal data ownership is a key piece of that vision. Just because the wider public hasn’t caught on yet to all the nuances around data access, data privacy, data ownership, and data fidelity, doesn’t mean you shouldn’t be embarrassed to be failing to deliver a quality product.

Kellan Elliott-McCrea # 18th May 2010, 6:21 pm

Preview: Freebase Gridworks (via) If my experience with government datasets has taught me anything, it’s that most datasets are collected by human beings (probably using Excel) and human beings are inconsistent. The first step in any data related project inevitably involves cleaning up the data. The Freebase team must run up against this all the time, and it looks like they’re tackling the problem head-on. Freebase Gridworks is just a screencast preview at the moment but an open source release is promised “within a month”—and the tool looks absolutely fantastic. DabbleDB-style data refactoring of spreadsheet data, running on your desktop but with the UI served in a browser. Full undo, a JavaScript-based expression language, powerful faceting and the ability to “reconcile” data against Freebase types (matching up country names, for example). I can’t wait to get my hands on this. # 27th March 2010, 6:43 pm

The Case For An Older Woman. OK Cupid’s fascinating statistics blog uses cleverly plotted aggregate data from the dating site to illustrate the difference in age tastes between the genders (men try to date younger women) and show why that might not be the best strategy. An infographics tour-de-force. # 17th February 2010, 10:20 pm

World Government Data. Launched last week, this is the Guardian’s meta-search engine for searching and browsing through data from four different government data sites (with more sites planned). Under the hood it’s Django, Solr, Haystack and the Scrapy crawling library. The application was built by Ben Firshman during an internship over Christmas. # 27th January 2010, 12:27 pm


Toiling in the data-mines: what data exploration feels like. Useful advice from Tom Armitage on the exploratory development approach required when starting to build a project against a large, complex dataset. Tips include making sure you have a REPL to hand and using tools like gRaphael to generate graphs against pretty much everything, since until you’ve seen their shape you won’t know if they are interesting or not. # 26th October 2009, 9:34 am

Yahoo! Geo: Announcing GeoPlanet Data. The Yahoo! WhereOnEarth geographic data set is fantastic, but I’ve always felt slightly uncomfortable about building applications against it in case the API went away. That’s not an issue any more—the entire dataset is now available to download and use under a Creative Commons Attribution license. It’s not entirely clear what the attribution requirements are—do you have to put “data from GeoPlanet” on every page or can you get away with just tucking the attribution away in an “about this site” page? UPDATE: The data doesn’t include latitude/longitude or bounding boxes, which severely reduces its utility. # 20th May 2009, 9:12 pm

Drug seizures: how pure is street cocaine? Neat story on the Guardian Datablog using graphs from Timetric to show that while the purity of cocaine seized by customs over the past five years has stayed constant, the purity of drugs seized by the police has been trending downwards. # 13th May 2009, 12:34 pm

Drop ACID and think about data. I’ve been very impressed with the quality and speed with which the PyCon 2009 videos have been published. Here’s Bob Ippolito on distributed databases and key/value stores. # 17th April 2009, 5:13 pm

A few notes on the Guardian Open Platform

This morning we launched the Guardian Open Platform at a well attended event in our new offices in Kings Place. This is one of the main projects I’ve been helping out with since joining the Guardian last year, and it’s fantastic to finally have it out in the open.

[... 839 words]

US economic data spreadsheets from the Guardian. At the Guardian we’ve just released a bunch of economic data about the US painstakingly collected by Simon Rogers, our top data journalist, as Google Docs spreadsheets. Get your data here. # 16th January 2009, 6:17 pm

ficlets memorial. Here’s a great argument for Creative Commons—AOL shut down Ficlets without providing an archive or export tool, but the license meant Ficlets co-creator Kevin Lawver could scrape and preserve all of the content anyway. # 14th January 2009, 10:02 pm


Magic/Replace. More inspirational magic from the team at Dabble DB. Be sure to watch the (short) demo video. # 1st December 2008, 12:23 am

Code your own election mashup with Google’s JSON data. The data that powered Google’s US election results map is available to download as a bunch of JSON files. # 6th November 2008, 8:24 pm

CKAN—Comprehensive Knowledge Archive Network. Aims to be the “Debian of data”, with apt-get style tools for installing datasets. Presented at Open Tech 2008 by Rufus Pollock. # 5th July 2008, 3:24 pm


The Data Bill of Rights (via) John Battelle’s inherently sensible “draft of what rights we, as consumers, might demand from companies making hay off the data we create as we trip across the web”. # 27th May 2007, 7:28 pm