Simon Willison's Weblog: data

GROUNDHOG-DAY.com

2023-02-02T22:05:28+00:00

“The leading Groundhog Day data source”. I love this so much: it’s a collection of predictions from all 59 groundhogs active in towns scattered across North America (I had no idea there were that many). The data is available via a JSON API too.

Via Show HN: Groundhog-day.com – structured groundhog data

Tags: data

Quoting Andrej Karpathy

2022-08-24T21:28:00+00:00

To make the analogy explicit, in Software 1.0, human-engineered source code (e.g. some .cpp files) is compiled into a binary that does useful work. In Software 2.0 most often the source code comprises 1) the dataset that defines the desirable behavior and 2) the neural net architecture that gives the rough skeleton of the code, but with many details (the weights) to be filled in. The process of training the neural network compiles the dataset into the binary — the final neural network. In most practical applications today, the neural net architectures and the training systems are increasingly standardized into a commodity, so most of the active “software development” takes the form of curating, growing, massaging and cleaning labeled datasets.

— Andrej Karpathy

Tags: data, machine-learning, ai, andrej-karpathy

The data team: a short story

2021-07-08T23:12:59+00:00

The data team: a short story

Erik Bernhardsson’s fictional account (“I guess I should really call this a parable”) of a new data team leader successfully growing their team and building a data-first culture in a medium-sized technology company. His depiction of the initial state of the company (data in many different places, frustrated ML researchers who can’t get their research into production, confusion over what the data team is actually for) definitely rings true to me.

Via Hacker News

Tags: data-science, data, leadership

What I've learned about data recently

2021-06-22T17:09:07+00:00

What I've learned about data recently

Laurie Voss talks about the structure of data teams, based on his experience at npm and more recently Netlify. He suggests that Airflow and dbt are the data world’s equivalent of frameworks like Rails: opinionated tools that solve core problems and which mean that you can now hire people who understand how your data pipelines work on their first day on the job.

Via @seldo

Tags: big-data, data, data-science, laurie-voss

The Seven Secrets of Successful Data Scientists

2010-09-03T00:36:00+00:00

The Seven Secrets of Successful Data Scientists

Some sensible advice, including pick the right sized tool, compress everything, split up your data, use open source and run the analysis where the data is.

Tags: big-data, data, recovered

Using Freebase Gridworks to Create Linked Data

2010-08-23T20:11:00+00:00

Using Freebase Gridworks to Create Linked Data

A very handy tutorial from data.gov.uk’s Jeni Tennison.

Tags: data, datagovuk, freebase, gridworks, jenitennison, recovered

Quoting Kellan Elliott-McCrea

2010-05-18T18:21:00+00:00

With Flickr you can get out, via the API, every single piece of information you put into the system. [...] Asking people to accept anything else is sharecropping. It’s a bad deal. Flickr helped pioneer “Web 2.0″, and personal data ownership is a key piece of that vision. Just because the wider public hasn’t caught on yet to all the nuances around data access, data privacy, data ownership, and data fidelity, doesn’t mean you shouldn’t be embarrassed to be failing to deliver a quality product.

— Kellan Elliott-McCrea

Tags: data, flickr, kellan-elliott-mccrea, sharecropping, web20, recovered

Preview: Freebase Gridworks

2010-03-27T18:43:42+00:00

Preview: Freebase Gridworks

If my experience with government datasets has taught me anything, it’s that most datasets are collected by human beings (probably using Excel) and human beings are inconsistent. The first step in any data related project inevitably involves cleaning up the data. The Freebase team must run up against this all the time, and it looks like they’re tackling the problem head-on. Freebase Gridworks is just a screencast preview at the moment but an open source release is promised “within a month”—and the tool looks absolutely fantastic. DabbleDB-style data refactoring of spreadsheet data, running on your desktop but with the UI served in a browser. Full undo, a JavaScript-based expression language, powerful faceting and the ability to “reconcile” data against Freebase types (matching up country names, for example). I can’t wait to get my hands on this.

Via Jon Udell

Tags: freebase, gridworks, cleanup, data, open-data, dabbledb, javascript

The Case For An Older Woman

2010-02-17T22:20:03+00:00

The Case For An Older Woman

OK Cupid’s fascinating statistics blog uses cleverly plotted aggregate data from the dating site to illustrate the difference in age tastes between the genders (men try to date younger women) and show why that might not be the best strategy. An infographics tour-de-force.

Tags: dating, graphs, data, infographics, okcupid

World Government Data

2010-01-27T12:27:03+00:00

World Government Data

Launched last week, this is the Guardian’s meta-search engine for searching and browsing through data from four different government data sites (with more sites planned). Under the hood it’s Django, Solr, Haystack and the Scrapy crawling library. The application was built by Ben Firshman during an internship over Christmas.

Tags: django, solr, haystack, scrapy, ben-firshman, guardian, projects, python, data, datagovuk

Toiling in the data-mines: what data exploration feels like

2009-10-26T09:34:34+00:00

Toiling in the data-mines: what data exploration feels like

Useful advice from Tom Armitage on the exploratory development approach required when starting to build a project against a large, complex dataset. Tips include making sure you have a REPL to hand and using tools like gRaphael to generate graphs against pretty much everything, since until you’ve seen their shape you won’t know if they are interesting or not.

Tags: data, tom-armitage, repl, exploratoryprogramming, programming, graphael, graphing, berg

Yahoo! Geo: Announcing GeoPlanet Data

2009-05-20T21:12:24+00:00

Yahoo! Geo: Announcing GeoPlanet Data

The Yahoo! WhereOnEarth geographic data set is fantastic, but I’ve always felt slightly uncomfortable about building applications against it in case the API went away. That’s not an issue any more—the entire dataset is now available to download and use under a Creative Commons Attribution license. It’s not entirely clear what the attribution requirements are—do you have to put “data from GeoPlanet” on every page or can you get away with just tucking the attribution away in an “about this site” page? UPDATE: The data doesn’t include latitude/longitude or bounding boxes, which severely reduces its utility.

Tags: attribution, creativecommons, data, geoplanet, geospatial, gis, whereonearth, yahoo

Drug seizures: how pure is street cocaine?

2009-05-13T12:34:03+00:00

Drug seizures: how pure is street cocaine?

Neat story on the Guardian Datablog using graphs from Timetric to show that while the purity of cocaine seized by customs over the past five years has stayed constant, the purity of drugs seized by the police has been trending downwards.

Tags: guardian, timetric, stats, drugs, data, cocaine

Drop ACID and think about data

2009-04-17T17:13:57+00:00

Drop ACID and think about data

I’ve been very impressed with the quality and speed with which the PyCon 2009 videos have been published. Here’s Bob Ippolito on distributed databases and key/value stores.

Tags: bobippolito, python, acid, databases, data, pycon, pycon2009

A few notes on the Guardian Open Platform

2009-03-10T14:28:39+00:00

This morning we launched the Guardian Open Platform at a well attended event in our new offices in Kings Place. This is one of the main projects I've been helping out with since joining the Guardian last year, and it's fantastic to finally have it out in the open.

There are two components to the launch today: the Content API and the Data Store. I'll describe the Data Store first as it deserves not to get buried in the discussion about its larger cousin.

The Data Store

Simon Rogers is the Guardian news editor who is principally responsible for gathering data about the world. If you ever see an infographic in the paper, the chances are Simon had a hand in researching the data for it. His delicious feed is a positive gold mine.

As of today, a sizeable portion the data he collects for the newspaper will also be published online. As a starting point, we're publishing over 80 data sets, all using Google Spreadsheets which means it's all accessible through the Spreadsheets Data API.

Here's Simon's take on it, from Welcome to the Datablog:

Everyday we work with datasets from around the world. We have had to check this data and make sure it's the best we can get, from the most credible sources. But then it lives for the moment of the paper's publication and afterward disappears into a hard drive, rarely to emerge again before updating a year later.

So, together with its companion site, the Data Store – a directory of all the stats we post – we are opening up that data for everyone. Whenever we come across something interesting or relevant or useful, we'll post it up here and let you know what we're planning to do with it.

It's worth spending quite a while digging around the data. Most sets come with a full description, including where the data was sourced from. New data sets will be announced on the Datablog, which is cleverly subtitled "Facts are sacred".

The Content API

The Content API provides REST-ish access to over a million items of content, mostly from the last decade but with a few gems that are a little bit older. Various types of content are available - article is the most common, but you can grab information (though not necessarily content) about audio, video, galleries and more. You can retrieve 50 items at a time, and pagination is unlimited (provided you stay below the API's rate limit).

Articles are provided with their full body content, though this does not currently include any HTML tags (a known issue). It's a good idea to review our terms and conditions, but you should know that if you opt to republish our article bodies on your site we may ask you to include our ads alongside our content in the future.

We serve 15 minute HTTP cache headers, but you are allowed to store our content for up to 24 hours. You really, really don't want to store content for longer than that, as in addition to violating our T&Cs you might find yourself inadvertently publishing an article that has been retracted for legal reasons. UK libel laws can be pretty scary.

In addition to regular search, you can also filter our content using tags. Tags are a core aspect of the Guardian's R2 platform, being used for keywords, contributors, "series" (used to implement blogs), content types and more. Every item returned by the API includes tags, and the tags can be used to further filter the results.

We also return a list of filters at the bottom of each page of search results showing the tags that could be used to filter that result set, ordered by the number of results (you may have seen this feature referred to as faceted search or guided navigation). Handy tip: you can use ?count=0 in your search API key to turn off results entirely and just get back the filters section. The race is on to be first to release a tag relationship browser based on this feature.

API responses can be had in custom XML, JSON or Atom. The Atom format is the least mature at the moment, and we'd welcome suggestions for improving it from the community.

I released a Python client library for the API this morning, and we also have libraries for Ruby, Java and PHP.

We also have an API Explorer (written in JavaScript and jQuery, hosted on the same domain as the API so that it can make Ajax requests) but you'll need an API key to try it out.

The bad news

The response to the API release has been terrific (check out what Tom Watson had to say), but as a result it's likely that API key provisions will be significantly lower than the overall demand for them. Please bear with us while we work towards a more widely accessible release.

Tags: apis, atom, contentapi, data, datastore, guardian, javascript, journalism, jquery, json, openplatform, simon-rogers, tom-watson, xml, python

US economic data spreadsheets from the Guardian

2009-01-16T18:17:34+00:00

US economic data spreadsheets from the Guardian

At the Guardian we’ve just released a bunch of economic data about the US painstakingly collected by Simon Rogers, our top data journalist, as Google Docs spreadsheets. Get your data here.

Tags: the-guardian, simon-rogers, data, spreadsheets, google-docs, economics, usa

ficlets memorial

2009-01-14T22:02:42+00:00

ficlets memorial

Here’s a great argument for Creative Commons—AOL shut down Ficlets without providing an archive or export tool, but the license meant Ficlets co-creator Kevin Lawver could scrape and preserve all of the content anyway.

Tags: ficlets, aol, archive, preservation, data, creativecommons, kevinlawver

Magic/Replace

2008-12-01T00:23:10+00:00

Magic/Replace

More inspirational magic from the team at Dabble DB. Be sure to watch the (short) demo video.

Tags: dabbledb, magicreplace, cleanupdata, data, avibryant

Code your own election mashup with Google's JSON data

2008-11-06T20:24:59+00:00

Code your own election mashup with Google's JSON data

The data that powered Google’s US election results map is available to download as a bunch of JSON files.

Tags: google, json, data, uselection

CKAN - Comprehensive Knowledge Archive Network

2008-07-05T15:24:37+00:00

CKAN - Comprehensive Knowledge Archive Network

Aims to be the “Debian of data”, with apt-get style tools for installing datasets. Presented at Open Tech 2008 by Rufus Pollock.

Tags: opentech, opentech2008, ckan, data, rufuspollock

The Data Bill of Rights

2007-05-27T19:28:21+00:00

The Data Bill of Rights

John Battelle’s inherently sensible “draft of what rights we, as consumers, might demand from companies making hay off the data we create as we trip across the web”.

Via Mike Migurski

Tags: data, john-battelle