<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: data</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/data.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2025-11-26T00:29:11+00:00</updated><author><name>Simon Willison</name></author><entry><title>Highlights from my appearance on the Data Renegades podcast with CL Kao and Dori Wilson</title><link href="https://simonwillison.net/2025/Nov/26/data-renegades-podcast/#atom-tag" rel="alternate"/><published>2025-11-26T00:29:11+00:00</published><updated>2025-11-26T00:29:11+00:00</updated><id>https://simonwillison.net/2025/Nov/26/data-renegades-podcast/#atom-tag</id><summary type="html">
    &lt;p&gt;I talked with CL Kao and Dori Wilson for an episode of their new &lt;a href="https://www.heavybit.com/library/podcasts/data-renegades"&gt;Data Renegades podcast&lt;/a&gt; titled &lt;a href="https://www.heavybit.com/library/podcasts/data-renegades/ep-2-data-journalism-unleashed-with-simon-willison"&gt;Data Journalism Unleashed with Simon Willison&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I fed the transcript into Claude Opus 4.5 to extract this list of topics with timestamps and illustrative quotes. It did such a good job I'm using what it produced almost verbatim here - I tidied it up a tiny bit and added a bunch of supporting links.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;What is data journalism and why it's the most interesting application of data analytics [02:03]&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;"There's this whole field of data journalism, which is using data and databases to try and figure out stories about the world. It's effectively data analytics, but applied to the world of news gathering. And I think it's fascinating. I think it is the single most interesting way to apply this stuff because everything is in scope for a journalist."&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The origin story of Django at a small Kansas newspaper [02:31]&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;"We had a year's paid internship from university where we went to work &lt;a href="https://simonwillison.net/2025/Jul/13/django-birthday/"&gt;for this local newspaper&lt;/a&gt; in Kansas with this chap &lt;a href="https://holovaty.com/"&gt;Adrian Holovaty&lt;/a&gt;. And at the time we thought we were building a content management system."&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Building the "Downloads Page" - a dynamic radio player of local bands [03:24]&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;"Adrian built a feature of the site called &lt;a href="https://web.archive.org/web/20070320083540/https://www.lawrence.com/downloads/"&gt;the Downloads Page&lt;/a&gt;. And what it did is it said, okay, who are the bands playing at venues this week? And then we'll construct a little radio player of MP3s of music of bands who are playing in Lawrence in this week."&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Working at The Guardian on data-driven reporting projects [04:44]&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;"I just love that challenge of building tools that journalists can use to investigate stories and then that you can use to help tell those stories. Like if you give your audience a searchable database to back up the story that you're presenting, I just feel that's a great way of building more credibility in the reporting process."&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Washington Post's opioid crisis data project and sharing with local newspapers [05:22]&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;"Something the Washington Post did that I thought was extremely forward thinking is that they shared [&lt;a href="https://www.washingtonpost.com/national/2019/08/12/post-released-deas-data-pain-pills-heres-what-local-journalists-are-using-it/?utm_source=chatgpt.com"&gt;the opioid files&lt;/a&gt;] with other newspapers. They said, 'Okay, we're a big national newspaper, but these stories are at a local level. So what can we do so that the local newspaper and different towns can dive into that data for us?'"&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;NICAR conference and the collaborative, non-competitive nature of data journalism [07:00]&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;"It's all about trying to figure out what is the most value we can get out of this technology as an industry as a whole."&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;a href="https://www.ire.org/training/conferences/nicar-2026/"&gt;NICAR 2026&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;ProPublica and the Baltimore Banner as examples of nonprofit newsrooms [09:02]&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;"The &lt;a href="https://www.thebanner.com/"&gt;Baltimore Banner&lt;/a&gt; are a nonprofit newsroom. They have a hundred employees now for the city of Baltimore. This is an enormously, it's a very healthy newsroom. They do amazing data reporting... And I believe they're almost breaking even on subscription revenue [correction, &lt;a href="https://localnewsinitiative.northwestern.edu/posts/2025/11/10/baltimore-local-media-resurgence/"&gt;not yet&lt;/a&gt;], which is astonishing."&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The "shower revelation" that led to Datasette - SQLite on serverless hosting [10:31]&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;"It was literally a shower revelation. I was in the shower thinking about serverless and I thought, 'hang on a second. So you can't use Postgres on serverless hosting, but if it's a read-only database, could you use SQLite? Could you just take that data, bake it into a blob of a SQLite file, ship that as part of the application just as another asset, and then serve things on top of that?'"&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Datasette's plugin ecosystem and the vision of solving data publishing [12:36]&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;"In the past I've thought about it like how Pinterest solved scrapbooking and WordPress solved blogging, who's going to solve data like publishing tables full of data on the internet? So that was my original goal."&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Unexpected Datasette use cases: Copenhagen electricity grid, Brooklyn Cemetery [13:59]&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;"Somebody was doing research on the Brooklyn Cemetery and they got hold of the original paper files of who was buried in the Brooklyn Cemetery. They digitized those, loaded the results into Datasette and now it tells the story of immigration to New York."&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Bellingcat using Datasette to investigate leaked Russian food delivery data [14:40]&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;"It turns out the Russian FSB, their secret police, have an office that's not near any restaurants and they order food all the time. And so this database could tell you what nights were the FSB working late and what were the names and phone numbers of the FSB agents who ordered food... And I'm like, 'Wow, that's going to get me thrown out of a window.'"&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;a href="https://www.bellingcat.com/news/rest-of-world/2022/04/01/food-delivery-leak-unmasks-russian-security-agents/"&gt;Bellingcat: Food Delivery Leak Unmasks Russian Security Agents&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The frustration of open source: no feedback on how people use your software [16:14]&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;"An endless frustration in open source is that you really don't get the feedback on what people are actually doing with it."&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Open office hours on Fridays to learn how people use Datasette [16:49]&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;"I have an &lt;a href="https://calendly.com/swillison/datasette-office-hours"&gt;open office hours Calendly&lt;/a&gt;, where the invitation is, if you use my software or want to use my software, grab 25 minutes to talk to me about it. And that's been a revelation. I've had hundreds of conversations in the past few years with people."&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Data cleaning as the universal complaint - 95% of time spent cleaning [17:34]&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;"I know every single person I talk to in data complains about the cleaning that everyone says, 'I spend 95% of my time cleaning the data and I hate it.'"&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Version control problems in data teams - Python scripts on laptops without Git [17:43]&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;"I used to work for a large company that had a whole separate data division and I learned at one point that they weren't using Git for their scripts. They had Python scripts, littering laptops left, right and center and lots of notebooks and very little version control, which upset me greatly."&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The Carpentries organization teaching scientists Git and software fundamentals [18:12]&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;"There's an organization called &lt;a href="https://carpentries.org/"&gt;The Carpentries&lt;/a&gt;. Basically they teach scientists to use Git. Their entire thing is scientists are all writing code these days. Nobody ever sat them down and showed them how to use the UNIX terminal or Git or version control or write tests. We should do that."&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Data documentation as an API contract problem [21:11]&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;"A coworker of mine said, you do realize that this should be a documented API interface, right? Your data warehouse view of your project is something that you should be responsible for communicating to the rest of the organization and we weren't doing it."&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The importance of "view source" on business reports [23:21]&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;"If you show somebody a report, you need to have view source on those reports... somebody would say 25% of our users did this thing. And I'm thinking I need to see the query because I knew where all of the skeletons were buried and often that 25% was actually a 50%."&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Fact-checking process for data reporting [24:16]&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;"Their stories are fact checked, no story goes out the door without someone else fact checking it and without an editor approving it. And it's the same for data. If they do a piece of data reporting, a separate data reporter has to audit those numbers and maybe even produce those numbers themselves in a separate way before they're confident enough to publish them."&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Queries as first-class citizens with version history and comments [27:16]&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;"I think the queries themselves need to be first class citizens where like I want to see a library of queries that my team are using and each one I want to know who built it and when it was built. And I want to see how that's changed over time and be able to post comments on it."&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Two types of documentation: official docs vs. temporal/timestamped notes [29:46]&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;"There's another type of documentation which I call temporal documentation where effectively it's stuff where you say, 'Okay, it's Friday, the 31st of October and this worked.' But the timestamp is very prominent and if somebody looks that in six months time, there's no promise that it's still going to be valid to them."&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Starting an internal blog without permission - instant credibility [30:24]&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;"The key thing is you need to start one of these without having to ask permission first. You just one day start, you can do it in a Google Doc, right?... It gives you so much credibility really quickly because nobody else is doing it."&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Building a search engine across seven documentation systems [31:35]&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;"It turns out, once you get a search engine over the top, it's good documentation. You just have to know where to look for it. And if you are the person who builds the search engine, you secretly control the company."&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The TIL (Today I Learned) blog approach - celebrating learning basics [33:05]&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;"I've done &lt;a href="https://til.simonwillison.net/"&gt;TILs&lt;/a&gt; about 'for loops' in Bash, right? Because okay, everyone else knows how to do that. I didn't... It's a value statement where I'm saying that if you've been a professional software engineer for 25 years, you still don't know everything. You should still celebrate figuring out how to learn 'for loops' in Bash."&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Coding agents like Claude Code and their unexpected general-purpose power [34:53]&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;"They pretend to be programming tools but actually they're basically a sort of general agent because they can do anything that you can do by typing commands into a Unix shell, which is everything."&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Skills for Claude - markdown files for census data, visualization, newsroom standards [36:16]&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;"Imagine a markdown file for census data. Here's where to get census data from. Here's what all of the columns mean. Here's how to derive useful things from that. And then you have another skill for here's how to visualize things on a map using D3... At the Washington Post, our data standards are this and this and this."&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;a href="https://simonwillison.net/2025/Oct/16/claude-skills/"&gt;Claude Skills are awesome, maybe a bigger deal than MCP&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The absurd 2025 reality: cutting-edge AI tools use 1980s terminal interfaces [38:22]&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;"The terminal is now accessible to people who never learned the terminal before 'cause you don't have to remember all the commands because the LLM knows the commands for you. But isn't that fascinating that the cutting edge software right now is it's like 1980s style— I love that. It's not going to last. That's a current absurdity for 2025."&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Cursor for data? Generic agent loops vs. data-specific IDEs [38:18]&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;"More of a notebook interface makes a lot more sense than a Claude Code style terminal 'cause a Jupyter Notebook is effectively a terminal, it's just in your browser and it can show you charts."&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Future of BI tools: prompt-driven, instant dashboard creation [39:54]&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;"You can copy and paste a big chunk of JSON data from somewhere into [an LLM] and say build me a dashboard. And they do such a good job. Like they will just decide, oh this is a time element so we'll do a bar chart over time and these numbers feel big so we'll put those in a big green box."&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Three exciting LLM applications: text-to-SQL, data extraction, data enrichment [43:06]&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;"LLMs are stunningly good at outputting SQL queries. Especially if you give them extra metadata about the columns. Maybe a couple of example queries and stuff."&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;LLMs extracting structured data from scanned PDFs at 95-98% accuracy [43:36]&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;"You file a freedom of information request and you get back horrifying scanned PDFs with slightly wonky angles and you have to get the data out of those. LLMs for a couple of years now have been so good at, 'here's a page of a police report, give me back JSON with the name of the arresting officer and the date of the incident and the description,' and they just do it."&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Data enrichment: running cheap models in loops against thousands of records [44:36]&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;"There's something really exciting about the cheaper models, Gemini Flash 2.5 Lite, things like that. Being able to run those in a loop against thousands of records feels very valuable to me as well."&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;a href="https://enrichments.datasette.io/"&gt;datasette-enrichments&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Multimodal LLMs for images, audio transcription, and video processing [45:42]&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;"At one point I calculated that using Google's least expensive model, if I wanted to generate captions for like 70,000 photographs in my personal photo library, it would cost me like $13 or something. Wildly inexpensive."&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Correction: with Gemini 1.5 Flash 8B &lt;a href="https://simonwillison.net/2025/May/15/building-on-llms/#llm-tutorial-intro.009.jpeg"&gt;it would cost 173.25 cents&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;First programming language: hated C++, loved PHP and Commodore 64 BASIC [46:54]&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;"I hated C++ 'cause I got my parents to buy me a book on it when I was like 15 and I did not make any progress with Borland C++ compiler... Actually, my first program language was Commodore 64 BASIC. And I did love that. Like I tried to build a database in Commodore 64 BASIC back when I was like six years old or something."&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Biggest production bug: crashing The Guardian's MPs expenses site with a progress bar [47:46]&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;"I tweeted a screenshot of that progress bar and said, 'Hey, look, we have a progress bar.' And 30 seconds later the site crashed because I was using SQL queries to count all 17,000 documents just for this one progress bar."&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;a href="https://simonwillison.net/2009/Dec/20/crowdsourcing/"&gt;Crowdsourced document analysis and MP expenses&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Favorite test dataset: San Francisco's tree list, updated several times a week [48:44]&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;"There's &lt;a href="https://data.sfgov.org/City-Infrastructure/Street-Tree-List/tkzw-k3nq"&gt;195,000 trees in this CSV file&lt;/a&gt; and it's got latitude and longitude and species and age when it was planted... and get this, it's updated several times a week... most working days, somebody at San Francisco City Hall updates their database of trees, and I can't figure out who."&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Showrunning TV shows as a management model - transferring vision to lieutenants [50:07]&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;"Your job is to transfer your vision into their heads so they can go and have the meetings with the props department and the set design and all of those kinds of things... I used to sniff at the idea of a vision when I was young and stupid. And now I'm like, no, the vision really is everything because if everyone understands the vision, they can make decisions you delegate to them."&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;a href="https://okbjgm.weebly.com/uploads/3/1/5/0/31506003/11_laws_of_showrunning_nice_version.pdf"&gt;The Eleven Laws of Showrunning&lt;/a&gt; by Javier Grillo-Marxuach&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Hot take: all executable code with business value must be in version control [52:21]&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;"I think it's inexcusable to have executable code that has business value that is not in version control somewhere."&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Hacker News automation: GitHub Actions scraping for notifications [52:45]&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;"I've got &lt;a href="https://simonwillison.net/2022/Mar/14/scraping-web-pages-shot-scraper/"&gt;a GitHub actions thing&lt;/a&gt; that runs a piece of software I wrote called &lt;a href="https://shot-scraper.datasette.io/"&gt;shot-scraper&lt;/a&gt; that runs Playwright, that loads up a browser in GitHub actions to scrape that webpage and turn the results into JSON, which then get turned into an atom feed, which I subscribe to in NetNewsWire."&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Dream project: whale detection camera with Gemini AI [53:47]&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;"I want to point a camera at the ocean and take a snapshot every minute and feed it into Google Gemini or something and just say, is there a whale yes or no? That would be incredible. I want push notifications when there's a whale."&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Favorite podcast: Mark Steel's in Town (hyperlocal British comedy) [54:23]&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;"Every episode he goes to a small town in England and he does a comedy set in a local venue about the history of the town. And so he does very deep research... I love that sort of like hyperlocal, like comedy, that sort of British culture thing."&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;a href="https://www.bbc.co.uk/programmes/b00rtbk8/episodes/player"&gt;Mark Steel's in Town&lt;/a&gt; available episodes&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Favorite fiction genre: British wizards caught up in bureaucracy [55:06]&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;"My favorite genre of fiction is British wizards who get caught up in bureaucracy... I just really like that contrast of like magical realism and very clearly researched government paperwork and filings."&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;a href="https://www.antipope.org/charlie/blog-static/2020/10/the-laundry-files-an-updated-c.html"&gt;The Laundry Files&lt;/a&gt;, &lt;a href="https://en.wikipedia.org/wiki/Rivers_of_London_(book_series)"&gt;Rivers of London&lt;/a&gt;, &lt;a href="https://en.wikipedia.org/wiki/The_Rook_(novel)"&gt;The Rook&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id="podcast-colophon"&gt;Colophon&lt;/h4&gt;

&lt;p&gt;I used a Claude Project for the initial analysis, pasting in the HTML of the transcript since that included &lt;code&gt;&amp;lt;span data-timestamp="425"&amp;gt;&lt;/code&gt; elements. The project uses the following custom instructions&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;You will be given a transcript of a podcast episode. Find the most interesting quotes in that transcript - quotes that best illustrate the overall themes, and quotes that introduce surprising ideas or express things in a particularly clear or engaging or spicy way. Answer just with those quotes - long quotes are fine.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I then added a follow-up prompt saying:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Now construct a bullet point list of key topics where each item includes the mm:ss in square braces at the end&lt;/p&gt;
&lt;p&gt;Then suggest a very comprehensive list of supporting links I could find&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Then one more follow-up:&lt;/p&gt;
&lt;blockquote&gt;&lt;p&gt;Add an illustrative quote to every one of those key topics you identified&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;Here's &lt;a href="https://claude.ai/share/b2b83b99-c506-4865-8d40-dee290723ac9"&gt;the full Claude transcript&lt;/a&gt; of the analysis.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/data"&gt;data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/data-journalism"&gt;data-journalism&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/django"&gt;django&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/podcast-appearances"&gt;podcast-appearances&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="data"/><category term="data-journalism"/><category term="django"/><category term="ai"/><category term="datasette"/><category term="podcast-appearances"/></entry><entry><title>GROUNDHOG-DAY.com</title><link href="https://simonwillison.net/2023/Feb/2/groundhogday/#atom-tag" rel="alternate"/><published>2023-02-02T22:05:28+00:00</published><updated>2023-02-02T22:05:28+00:00</updated><id>https://simonwillison.net/2023/Feb/2/groundhogday/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://groundhog-day.com/"&gt;GROUNDHOG-DAY.com&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
“The leading Groundhog Day data source”. I love this so much: it’s a collection of predictions from all 59 groundhogs active in towns scattered across North America (I had no idea there were that many). The data is available via a JSON API too.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=34630409"&gt;Show HN: Groundhog-day.com – structured groundhog data&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/data"&gt;data&lt;/a&gt;&lt;/p&gt;



</summary><category term="data"/></entry><entry><title>Quoting Andrej Karpathy</title><link href="https://simonwillison.net/2022/Aug/24/andrej-karpathy/#atom-tag" rel="alternate"/><published>2022-08-24T21:28:00+00:00</published><updated>2022-08-24T21:28:00+00:00</updated><id>https://simonwillison.net/2022/Aug/24/andrej-karpathy/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://karpathy.medium.com/software-2-0-a64152b37c35"&gt;&lt;p&gt;To make the analogy explicit, in Software 1.0, human-engineered source code (e.g. some .cpp files) is compiled into a binary that does useful work. In Software 2.0 most often the source code comprises 1) the dataset that defines the desirable behavior and 2) the neural net architecture that gives the rough skeleton of the code, but with many details (the weights) to be filled in. The process of training the neural network compiles the dataset into the binary — the final neural network. In most practical applications today, the neural net architectures and the training systems are increasingly standardized into a commodity, so most of the active “software development” takes the form of curating, growing, massaging and cleaning labeled datasets.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://karpathy.medium.com/software-2-0-a64152b37c35"&gt;Andrej Karpathy&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/data"&gt;data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/machine-learning"&gt;machine-learning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/andrej-karpathy"&gt;andrej-karpathy&lt;/a&gt;&lt;/p&gt;



</summary><category term="data"/><category term="machine-learning"/><category term="ai"/><category term="andrej-karpathy"/></entry><entry><title>The data team: a short story</title><link href="https://simonwillison.net/2021/Jul/8/the-data-team-a-short-story/#atom-tag" rel="alternate"/><published>2021-07-08T23:12:59+00:00</published><updated>2021-07-08T23:12:59+00:00</updated><id>https://simonwillison.net/2021/Jul/8/the-data-team-a-short-story/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://erikbern.com/2021/07/07/the-data-team-a-short-story.html"&gt;The data team: a short story&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Erik Bernhardsson’s fictional account (“I guess I should really call this a parable”) of a new data team leader successfully growing their team and building a data-first culture in a medium-sized technology company. His depiction of the initial state of the company (data in many different places, frustrated ML researchers who can’t get their research into production, confusion over what the data team is actually for) definitely rings true to me.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=27777594"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/data"&gt;data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/data-science"&gt;data-science&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/leadership"&gt;leadership&lt;/a&gt;&lt;/p&gt;



</summary><category term="data"/><category term="data-science"/><category term="leadership"/></entry><entry><title>What I've learned about data recently</title><link href="https://simonwillison.net/2021/Jun/22/what-ive-learned-about-data-recently/#atom-tag" rel="alternate"/><published>2021-06-22T17:09:07+00:00</published><updated>2021-06-22T17:09:07+00:00</updated><id>https://simonwillison.net/2021/Jun/22/what-ive-learned-about-data-recently/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://seldo.com/posts/what-i-ve-learned-about-data-recently"&gt;What I&amp;#x27;ve learned about data recently&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Laurie Voss talks about the structure of data teams, based on his experience at npm and more recently Netlify. He suggests that Airflow and dbt are the data world’s equivalent of frameworks like Rails: opinionated tools that solve core problems and which mean that you can now hire people who understand how your data pipelines work on their first day on the job.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/seldo/status/1407370508576780290"&gt;@seldo&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/data"&gt;data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/big-data"&gt;big-data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/data-science"&gt;data-science&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/laurie-voss"&gt;laurie-voss&lt;/a&gt;&lt;/p&gt;



</summary><category term="data"/><category term="big-data"/><category term="data-science"/><category term="laurie-voss"/></entry><entry><title>The Seven Secrets of Successful Data Scientists</title><link href="https://simonwillison.net/2010/Sep/3/seven/#atom-tag" rel="alternate"/><published>2010-09-03T00:36:00+00:00</published><updated>2010-09-03T00:36:00+00:00</updated><id>https://simonwillison.net/2010/Sep/3/seven/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://dataspora.com/blog/the-seven-secrets-of-successful-data-scientists/"&gt;The Seven Secrets of Successful Data Scientists&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Some sensible advice, including pick the right sized tool, compress everything, split up your data, use open source and run the analysis where the data is.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/data"&gt;data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/big-data"&gt;big-data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/recovered"&gt;recovered&lt;/a&gt;&lt;/p&gt;



</summary><category term="data"/><category term="big-data"/><category term="recovered"/></entry><entry><title>Using Freebase Gridworks to Create Linked Data</title><link href="https://simonwillison.net/2010/Aug/23/gridworks/#atom-tag" rel="alternate"/><published>2010-08-23T20:11:00+00:00</published><updated>2010-08-23T20:11:00+00:00</updated><id>https://simonwillison.net/2010/Aug/23/gridworks/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://www.jenitennison.com/blog/node/145"&gt;Using Freebase Gridworks to Create Linked Data&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
A very handy tutorial from data.gov.uk’s Jeni Tennison.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/data"&gt;data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datagovuk"&gt;datagovuk&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/freebase"&gt;freebase&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gridworks"&gt;gridworks&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/recovered"&gt;recovered&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/jenitennison"&gt;jenitennison&lt;/a&gt;&lt;/p&gt;



</summary><category term="data"/><category term="datagovuk"/><category term="freebase"/><category term="gridworks"/><category term="recovered"/><category term="jenitennison"/></entry><entry><title>Quoting Kellan Elliott-McCrea</title><link href="https://simonwillison.net/2010/May/18/sharecropping/#atom-tag" rel="alternate"/><published>2010-05-18T18:21:00+00:00</published><updated>2010-05-18T18:21:00+00:00</updated><id>https://simonwillison.net/2010/May/18/sharecropping/#atom-tag</id><summary type="html">
    &lt;blockquote cite="http://laughingmeme.org/2010/05/18/minimal-competence-data-access-data-ownership-and-sharecropping/"&gt;&lt;p&gt;With Flickr you can get out, via the API, every single piece of information you put into the system. [...] Asking people to accept anything else is sharecropping. It’s a bad deal. Flickr helped pioneer “Web 2.0″, and personal data ownership is a key piece of that vision. Just because the wider public hasn’t caught on yet to all the nuances around data access, data privacy, data ownership, and data fidelity, doesn’t mean you shouldn’t be embarrassed to be failing to deliver a quality product.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="http://laughingmeme.org/2010/05/18/minimal-competence-data-access-data-ownership-and-sharecropping/"&gt;Kellan Elliott-McCrea&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/data"&gt;data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/flickr"&gt;flickr&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/kellan-elliott-mccrea"&gt;kellan-elliott-mccrea&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sharecropping"&gt;sharecropping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/web20"&gt;web20&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/recovered"&gt;recovered&lt;/a&gt;&lt;/p&gt;



</summary><category term="data"/><category term="flickr"/><category term="kellan-elliott-mccrea"/><category term="sharecropping"/><category term="web20"/><category term="recovered"/></entry><entry><title>Preview: Freebase Gridworks</title><link href="https://simonwillison.net/2010/Mar/27/gridworks/#atom-tag" rel="alternate"/><published>2010-03-27T18:43:42+00:00</published><updated>2010-03-27T18:43:42+00:00</updated><id>https://simonwillison.net/2010/Mar/27/gridworks/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://blog.freebase.com/2010/03/26/preview-freebase-gridworks/"&gt;Preview: Freebase Gridworks&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
If my experience with government datasets has taught me anything, it’s that most datasets are collected by human beings (probably using Excel) and human beings are inconsistent. The first step in any data related project inevitably involves cleaning up the data. The Freebase team must run up against this all the time, and it looks like they’re tackling the problem head-on. Freebase Gridworks is just a screencast preview at the moment but an open source release is promised “within a month”—and the tool looks absolutely fantastic. DabbleDB-style data refactoring of spreadsheet data, running on your desktop but with the UI served in a browser. Full undo, a JavaScript-based expression language, powerful faceting and the ability to “reconcile” data against Freebase types (matching up country names, for example). I can’t wait to get my hands on this.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="http://blog.jonudell.net/2010/03/26/freebase-gridworks-a-power-tool-for-data-scrubbers/"&gt;Jon Udell&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/cleanup"&gt;cleanup&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/dabbledb"&gt;dabbledb&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/data"&gt;data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/freebase"&gt;freebase&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gridworks"&gt;gridworks&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/javascript"&gt;javascript&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/open-data"&gt;open-data&lt;/a&gt;&lt;/p&gt;



</summary><category term="cleanup"/><category term="dabbledb"/><category term="data"/><category term="freebase"/><category term="gridworks"/><category term="javascript"/><category term="open-data"/></entry><entry><title>The Case For An Older Woman</title><link href="https://simonwillison.net/2010/Feb/17/case/#atom-tag" rel="alternate"/><published>2010-02-17T22:20:03+00:00</published><updated>2010-02-17T22:20:03+00:00</updated><id>https://simonwillison.net/2010/Feb/17/case/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://blog.okcupid.com/index.php/2010/02/16/the-case-for-an-older-woman/"&gt;The Case For An Older Woman&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
OK Cupid’s fascinating statistics blog uses cleverly plotted aggregate data from the dating site to illustrate the difference in age tastes between the genders (men try to date younger women) and show why that might not be the best strategy. An infographics tour-de-force.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/data"&gt;data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/dating"&gt;dating&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/graphs"&gt;graphs&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/infographics"&gt;infographics&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/okcupid"&gt;okcupid&lt;/a&gt;&lt;/p&gt;



</summary><category term="data"/><category term="dating"/><category term="graphs"/><category term="infographics"/><category term="okcupid"/></entry><entry><title>World Government Data</title><link href="https://simonwillison.net/2010/Jan/27/world/#atom-tag" rel="alternate"/><published>2010-01-27T12:27:03+00:00</published><updated>2010-01-27T12:27:03+00:00</updated><id>https://simonwillison.net/2010/Jan/27/world/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://www.guardian.co.uk/world-government-data"&gt;World Government Data&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Launched last week, this is the Guardian’s meta-search engine for searching and browsing through data from four different government data sites (with more sites planned). Under the hood it’s Django, Solr, Haystack and the Scrapy crawling library. The application was built by Ben Firshman during an internship over Christmas.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ben-firshman"&gt;ben-firshman&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/data"&gt;data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datagovuk"&gt;datagovuk&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/django"&gt;django&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/guardian"&gt;guardian&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/haystack"&gt;haystack&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/scrapy"&gt;scrapy&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/solr"&gt;solr&lt;/a&gt;&lt;/p&gt;



</summary><category term="ben-firshman"/><category term="data"/><category term="datagovuk"/><category term="django"/><category term="guardian"/><category term="haystack"/><category term="projects"/><category term="python"/><category term="scrapy"/><category term="solr"/></entry><entry><title>Toiling in the data-mines: what data exploration feels like</title><link href="https://simonwillison.net/2009/Oct/26/toiling/#atom-tag" rel="alternate"/><published>2009-10-26T09:34:34+00:00</published><updated>2009-10-26T09:34:34+00:00</updated><id>https://simonwillison.net/2009/Oct/26/toiling/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://berglondon.com/blog/2009/10/23/toiling-in-the-data-mines-what-data-exploration-feels-like/"&gt;Toiling in the data-mines: what data exploration feels like&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Useful advice from Tom Armitage on the exploratory development approach required when starting to build a project against a large, complex dataset. Tips include making sure you have a REPL to hand and using tools like gRaphael to generate graphs against pretty much everything, since until you’ve seen their shape you won’t know if they are interesting or not.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/berg"&gt;berg&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/data"&gt;data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/exploratoryprogramming"&gt;exploratoryprogramming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/graphael"&gt;graphael&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/graphing"&gt;graphing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/programming"&gt;programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/repl"&gt;repl&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/tom-armitage"&gt;tom-armitage&lt;/a&gt;&lt;/p&gt;



</summary><category term="berg"/><category term="data"/><category term="exploratoryprogramming"/><category term="graphael"/><category term="graphing"/><category term="programming"/><category term="repl"/><category term="tom-armitage"/></entry><entry><title>Yahoo! Geo: Announcing GeoPlanet Data</title><link href="https://simonwillison.net/2009/May/20/geoplanet/#atom-tag" rel="alternate"/><published>2009-05-20T21:12:24+00:00</published><updated>2009-05-20T21:12:24+00:00</updated><id>https://simonwillison.net/2009/May/20/geoplanet/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://www.ygeoblog.com/2009/05/announcing-geoplanet-data/"&gt;Yahoo! Geo: Announcing GeoPlanet Data&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
The Yahoo! WhereOnEarth geographic data set is fantastic, but I’ve always felt slightly uncomfortable about building applications against it in case the API went away. That’s not an issue any more—the entire dataset is now available to download and use under a Creative Commons Attribution license. It’s not entirely clear what the attribution requirements are—do you have to put “data from GeoPlanet” on every page or can you get away with just tucking the attribution away in an “about this site” page? UPDATE: The data doesn’t include latitude/longitude or bounding boxes, which severely reduces its utility.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/attribution"&gt;attribution&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/creativecommons"&gt;creativecommons&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/data"&gt;data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/geoplanet"&gt;geoplanet&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/geospatial"&gt;geospatial&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gis"&gt;gis&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/whereonearth"&gt;whereonearth&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/yahoo"&gt;yahoo&lt;/a&gt;&lt;/p&gt;



</summary><category term="attribution"/><category term="creativecommons"/><category term="data"/><category term="geoplanet"/><category term="geospatial"/><category term="gis"/><category term="whereonearth"/><category term="yahoo"/></entry><entry><title>Drug seizures: how pure is street cocaine?</title><link href="https://simonwillison.net/2009/May/13/drug/#atom-tag" rel="alternate"/><published>2009-05-13T12:34:03+00:00</published><updated>2009-05-13T12:34:03+00:00</updated><id>https://simonwillison.net/2009/May/13/drug/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://www.guardian.co.uk/news/datablog/2009/may/08/drugs-drugs-trade"&gt;Drug seizures: how pure is street cocaine?&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Neat story on the Guardian Datablog using graphs from Timetric to show that while the purity of cocaine seized by customs over the past five years has stayed constant, the purity of drugs seized by the police has been trending downwards.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/cocaine"&gt;cocaine&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/data"&gt;data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/drugs"&gt;drugs&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/guardian"&gt;guardian&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/stats"&gt;stats&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/timetric"&gt;timetric&lt;/a&gt;&lt;/p&gt;



</summary><category term="cocaine"/><category term="data"/><category term="drugs"/><category term="guardian"/><category term="stats"/><category term="timetric"/></entry><entry><title>Drop ACID and think about data</title><link href="https://simonwillison.net/2009/Apr/17/drop/#atom-tag" rel="alternate"/><published>2009-04-17T17:13:57+00:00</published><updated>2009-04-17T17:13:57+00:00</updated><id>https://simonwillison.net/2009/Apr/17/drop/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://blip.tv/file/1949416/"&gt;Drop ACID and think about data&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I’ve been very impressed with the quality and speed with which the PyCon 2009 videos have been published. Here’s Bob Ippolito on distributed databases and key/value stores.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/acid"&gt;acid&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/bob-ippolito"&gt;bob-ippolito&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/data"&gt;data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/databases"&gt;databases&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pycon"&gt;pycon&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pycon2009"&gt;pycon2009&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;&lt;/p&gt;



</summary><category term="acid"/><category term="bob-ippolito"/><category term="data"/><category term="databases"/><category term="pycon"/><category term="pycon2009"/><category term="python"/></entry><entry><title>A few notes on the Guardian Open Platform</title><link href="https://simonwillison.net/2009/Mar/10/openplatform/#atom-tag" rel="alternate"/><published>2009-03-10T14:28:39+00:00</published><updated>2009-03-10T14:28:39+00:00</updated><id>https://simonwillison.net/2009/Mar/10/openplatform/#atom-tag</id><summary type="html">
    &lt;p&gt;This morning we launched the &lt;a href="http://www.guardian.co.uk/open-platform"&gt;Guardian Open Platform&lt;/a&gt; at a well attended event in our new offices in &lt;a href="http://www.kingsplace.co.uk/"&gt;Kings Place&lt;/a&gt;. This is one of the main projects I've been helping out with since joining the Guardian last year, and it's fantastic to finally have it out in the open.&lt;/p&gt;

&lt;p&gt;There are two components to the launch today: the Content API and the Data Store. I'll describe the Data Store first as it deserves not to get buried in the discussion about its larger cousin.&lt;/p&gt;

&lt;h4&gt;The Data Store&lt;/h4&gt;

&lt;p&gt;&lt;a href="http://www.guardian.co.uk/profile/simonrogers"&gt;Simon Rogers&lt;/a&gt; is the Guardian news editor who is principally responsible for gathering data about the world. If you ever see an infographic in the paper, the chances are Simon had a hand in researching the data for it. His delicious feed is a &lt;a href="http://delicious.com/smfrogers"&gt;positive gold mine&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;As of today, a sizeable portion the data he collects for the newspaper will also be published online. As a starting point, we're publishing over &lt;a href="http://www.guardian.co.uk/data-store"&gt;80 data sets&lt;/a&gt;, all using Google Spreadsheets which means it's all accessible through the &lt;a href="http://code.google.com/apis/spreadsheets/overview.html"&gt;Spreadsheets Data API&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Here's Simon's take on it, from &lt;a href="http://www.guardian.co.uk/news/datablog/2009/mar/10/blogpost1"&gt;Welcome to the Datablog&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote cite="http://www.guardian.co.uk/news/datablog/2009/mar/10/blogpost1"&gt;&lt;p&gt;Everyday we work with datasets from around the world. We have had to check this data and make sure it's the best we can get, from the most credible sources. But then it lives for the moment of the paper's publication and afterward disappears into a hard drive, rarely to emerge again before updating a year later.&lt;/p&gt;

&lt;p&gt;So, together with its companion site, the Data Store – a directory of all the stats we post – we are opening up that data for everyone. Whenever we come across something interesting or relevant or useful, we'll post it up here and let you know what we're planning to do with it.&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;It's worth spending quite a while digging around the data. Most sets come with a full description, including where the data was sourced from. New data sets will be announced &lt;a href="http://www.guardian.co.uk/news/datablog"&gt;on the Datablog&lt;/a&gt;, which is cleverly subtitled "Facts are sacred".&lt;/p&gt;

&lt;h4&gt;The Content API&lt;/h4&gt;

&lt;p&gt;&lt;a href="http://api.guardianapis.com/docs/"&gt;The Content API&lt;/a&gt; provides REST-ish access to over a million items of content, mostly from the last decade but with a few gems that are &lt;a href="http://www.guardian.co.uk/world/1944/aug/26/france.secondworldwar"&gt;a little bit older&lt;/a&gt;. Various types of content are available - article is the most common, but you can grab information (though not necessarily content) about audio, video, galleries and more. You can retrieve 50 items at a time, and pagination is unlimited (provided you stay below the API's rate limit).&lt;/p&gt;

&lt;p&gt;Articles are provided with their full body content, though this does not currently include any HTML tags (a known issue). It's a good idea to review &lt;a href="http://www.guardian.co.uk/open-platform/terms-and-conditions"&gt;our terms and conditions&lt;/a&gt;, but you should know that if you opt to republish our article bodies on your site we may ask you to include our ads alongside our content in the future.&lt;/p&gt;

&lt;p&gt;We serve 15 minute HTTP cache headers, but you are allowed to store our content for up to 24 hours. You really, really don't want to store content for longer than that, as in addition to violating our T&amp;amp;Cs you might find yourself inadvertently publishing an article that has been retracted for legal reasons. UK libel laws can be pretty scary.&lt;/p&gt;

&lt;p&gt;In addition to regular search, you can also filter our content using tags. Tags are a core aspect of the Guardian's &lt;a href="http://www.guardian.co.uk/help/insideguardian+series/an-abc-of-r2"&gt;R2 platform&lt;/a&gt;, being used for keywords, contributors, "series" (used to implement blogs), content types and more. Every item returned by the API includes tags, and the tags can be used to further filter the results.&lt;/p&gt;

&lt;p&gt;We also return a list of filters at the bottom of each page of search results showing the tags that could be used to filter that result set, ordered by the number of results (you may have seen this feature referred to as faceted search or guided navigation). Handy tip: you can use ?count=0 in your search API key to turn off results entirely and just get back the filters section. The race is on to be first to release a tag relationship browser based on this feature.&lt;/p&gt;

&lt;p&gt;API responses can be had in custom XML, JSON or Atom. The Atom format is the least mature at the moment, and we'd welcome suggestions for improving it from the community.&lt;/p&gt;

&lt;p&gt;I released &lt;a href="http://code.google.com/p/openplatform-python/"&gt;a Python client library&lt;/a&gt; for the API this morning, and we also have libraries for &lt;a href="http://code.google.com/p/openplatform-ruby/"&gt;Ruby&lt;/a&gt;, &lt;a href="http://code.google.com/p/openplatform-java/"&gt;Java&lt;/a&gt; and &lt;a href="http://code.google.com/p/openplatform-php/"&gt;PHP&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;We also have an API Explorer (written in JavaScript and jQuery, hosted on the same domain as the API so that it can make Ajax requests) but you'll need an API key to try it out.&lt;/p&gt;

&lt;h4&gt;The bad news&lt;/h4&gt;

&lt;p&gt;The response to the API release has been terrific (check out what &lt;a href="http://www.tom-watson.co.uk/2009/03/guardian-open-platform/"&gt;Tom Watson&lt;/a&gt; had to say), but as a result it's likely that API key provisions will be significantly lower than the overall demand for them. Please bear with us while we work towards a more widely accessible release.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/apis"&gt;apis&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/atom"&gt;atom&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/contentapi"&gt;contentapi&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/data"&gt;data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/data-journalism"&gt;data-journalism&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datastore"&gt;datastore&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/guardian"&gt;guardian&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/javascript"&gt;javascript&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/journalism"&gt;journalism&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/jquery"&gt;jquery&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/json"&gt;json&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openplatform"&gt;openplatform&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/simon-rogers"&gt;simon-rogers&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/tom-watson"&gt;tom-watson&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/xml"&gt;xml&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="apis"/><category term="atom"/><category term="contentapi"/><category term="data"/><category term="data-journalism"/><category term="datastore"/><category term="guardian"/><category term="javascript"/><category term="journalism"/><category term="jquery"/><category term="json"/><category term="openplatform"/><category term="python"/><category term="simon-rogers"/><category term="tom-watson"/><category term="xml"/></entry><entry><title>US economic data spreadsheets from the Guardian</title><link href="https://simonwillison.net/2009/Jan/16/datawonks/#atom-tag" rel="alternate"/><published>2009-01-16T18:17:34+00:00</published><updated>2009-01-16T18:17:34+00:00</updated><id>https://simonwillison.net/2009/Jan/16/datawonks/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://www.guardian.co.uk/help/insideguardian/2009/jan/15/unitedstates-data-journalism-google-spreadsheets"&gt;US economic data spreadsheets from the Guardian&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
At the Guardian we’ve just released a bunch of economic data about the US painstakingly collected by Simon Rogers, our top data journalist, as Google Docs spreadsheets. Get your data here.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/data"&gt;data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/economics"&gt;economics&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/google-docs"&gt;google-docs&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/simon-rogers"&gt;simon-rogers&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/spreadsheets"&gt;spreadsheets&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/the-guardian"&gt;the-guardian&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/usa"&gt;usa&lt;/a&gt;&lt;/p&gt;



</summary><category term="data"/><category term="economics"/><category term="google-docs"/><category term="simon-rogers"/><category term="spreadsheets"/><category term="the-guardian"/><category term="usa"/></entry><entry><title>ficlets memorial</title><link href="https://simonwillison.net/2009/Jan/14/ficlets/#atom-tag" rel="alternate"/><published>2009-01-14T22:02:42+00:00</published><updated>2009-01-14T22:02:42+00:00</updated><id>https://simonwillison.net/2009/Jan/14/ficlets/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://ficlets.ficly.com/"&gt;ficlets memorial&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Here’s a great argument for Creative Commons—AOL shut down Ficlets without providing an archive or export tool, but the license meant Ficlets co-creator Kevin Lawver could scrape and preserve all of the content anyway.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/aol"&gt;aol&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/archive"&gt;archive&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/creativecommons"&gt;creativecommons&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/data"&gt;data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ficlets"&gt;ficlets&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/kevin-lawver"&gt;kevin-lawver&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/preservation"&gt;preservation&lt;/a&gt;&lt;/p&gt;



</summary><category term="aol"/><category term="archive"/><category term="creativecommons"/><category term="data"/><category term="ficlets"/><category term="kevin-lawver"/><category term="preservation"/></entry><entry><title>Magic/Replace</title><link href="https://simonwillison.net/2008/Dec/1/magicreplace/#atom-tag" rel="alternate"/><published>2008-12-01T00:23:10+00:00</published><updated>2008-12-01T00:23:10+00:00</updated><id>https://simonwillison.net/2008/Dec/1/magicreplace/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://cleanupdata.com/"&gt;Magic/Replace&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
More inspirational magic from the team at Dabble DB. Be sure to watch the (short) demo video.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/avi-bryant"&gt;avi-bryant&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cleanupdata"&gt;cleanupdata&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/dabbledb"&gt;dabbledb&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/data"&gt;data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/magicreplace"&gt;magicreplace&lt;/a&gt;&lt;/p&gt;



</summary><category term="avi-bryant"/><category term="cleanupdata"/><category term="dabbledb"/><category term="data"/><category term="magicreplace"/></entry><entry><title>Code your own election mashup with Google's JSON data</title><link href="https://simonwillison.net/2008/Nov/6/json/#atom-tag" rel="alternate"/><published>2008-11-06T20:24:59+00:00</published><updated>2008-11-06T20:24:59+00:00</updated><id>https://simonwillison.net/2008/Nov/6/json/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://arstechnica.com/journals/linux.ars/2008/11/04/code-your-own-election-mashup-with-googles-json-data"&gt;Code your own election mashup with Google&amp;#x27;s JSON data&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
The data that powered Google’s US election results map is available to download as a bunch of JSON files.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/data"&gt;data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/json"&gt;json&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/uselection"&gt;uselection&lt;/a&gt;&lt;/p&gt;



</summary><category term="data"/><category term="google"/><category term="json"/><category term="uselection"/></entry><entry><title>CKAN - Comprehensive Knowledge Archive Network</title><link href="https://simonwillison.net/2008/Jul/5/ckan/#atom-tag" rel="alternate"/><published>2008-07-05T15:24:37+00:00</published><updated>2008-07-05T15:24:37+00:00</updated><id>https://simonwillison.net/2008/Jul/5/ckan/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://www.ckan.net/"&gt;CKAN - Comprehensive Knowledge Archive Network&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Aims to be the “Debian of data”, with apt-get style tools for installing datasets. Presented at Open Tech 2008 by Rufus Pollock.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ckan"&gt;ckan&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/data"&gt;data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/opentech"&gt;opentech&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/opentech2008"&gt;opentech2008&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/rufus-pollock"&gt;rufus-pollock&lt;/a&gt;&lt;/p&gt;



</summary><category term="ckan"/><category term="data"/><category term="opentech"/><category term="opentech2008"/><category term="rufus-pollock"/></entry><entry><title>The Data Bill of Rights</title><link href="https://simonwillison.net/2007/May/27/john/#atom-tag" rel="alternate"/><published>2007-05-27T19:28:21+00:00</published><updated>2007-05-27T19:28:21+00:00</updated><id>https://simonwillison.net/2007/May/27/john/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://battellemedia.com/archives/003575.php"&gt;The Data Bill of Rights&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
John Battelle’s inherently sensible “draft of what rights we, as consumers, might demand from companies making hay off the data we create as we trip across the web”.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="http://mike.teczno.com/notes/data-bill-of-rights.html"&gt;Mike Migurski&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/data"&gt;data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/john-battelle"&gt;john-battelle&lt;/a&gt;&lt;/p&gt;



</summary><category term="data"/><category term="john-battelle"/></entry></feed>