Simon Willison’s Weblog

Subscribe
Atom feed for data-journalism

48 items tagged “data-journalism”

2024

Project: Civic Band—scraping and searching PDF meeting minutes from hundreds of municipalities

Visit Project: Civic Band - scraping and searching PDF meeting minutes from hundreds of municipalities

I interviewed Philip James about Civic Band, his “slowly growing collection of databases of the minutes from civic governments”. Philip demonstrated the site and talked through his pipeline for scraping and indexing meeting minutes from many different local government authorities around the USA.

[... 762 words]

Project: VERDAD—tracking misinformation in radio broadcasts using Gemini 1.5

Visit Project: VERDAD - tracking misinformation in radio broadcasts using Gemini 1.5

I’m starting a new interview series called Project. The idea is to interview people who are building interesting data projects and talk about what they’ve built, how they built it, and what they learned along the way.

[... 1,025 words]

New in NotebookLM: Customizing your Audio Overviews. The most requested feature for Google's NotebookLM "audio overviews" (aka automatically generated podcast conversations) has been the ability to provide direction to those artificial podcast hosts - setting their expertise level or asking them to focus on specific topics.

Today's update adds exactly that:

Now you can provide instructions before you generate a "Deep Dive" Audio Overview. For example, you can focus on specific topics or adjust the expertise level to suit your audience. Think of it like slipping the AI hosts a quick note right before they go on the air, which will change how they cover your material.

I pasted in a link to my post about video scraping and prompted it like this:

You are both pelicans who work as data journalist at a pelican news service. Discuss this from the perspective of pelican data journalists, being sure to inject as many pelican related anecdotes as possible

Here's the resulting 7m40s MP3, and the transcript.

It starts off strong!

You ever find yourself wading through mountains of data trying to pluck out the juicy bits? It's like hunting for a single shrimp in a whole kelp forest, am I right?

Then later:

Think of those facial recognition systems they have for humans. We could have something similar for our finned friends. Although, gotta say, the ethical implications of that kind of tech are a whole other kettle of fish. We pelicans gotta use these tools responsibly and be transparent about it.

And when brainstorming some potential use-cases:

Imagine a pelican citizen journalist being able to analyze footage of a local council meeting, you know, really hold those pelicans in power accountable, or a pelican historian using video scraping to analyze old film reels, uncovering lost details about our pelican ancestors.

Plus this delightful conclusion:

The future of data journalism is looking brighter than a school of silversides reflecting the morning sun. Until next time, keep those wings spread, those eyes sharp, and those minds open. There's a whole ocean of data out there just waiting to be explored.

And yes, people on Reddit have got them to swear.

# 17th October 2024, 5:27 pm / notebooklm, data-journalism, google, llms, ai, generative-ai, gemini

Video scraping: extracting JSON data from a 35 second screen capture for less than 1/10th of a cent

Visit Video scraping: extracting JSON data from a 35 second screen capture for less than 1/10th of a cent

The other day I found myself needing to add up some numeric values that were scattered across twelve different emails.

[... 1,220 words]

Follow the Crypto (via) Very smart new site from Molly White tracking the huge increase in activity from Cryptocurrency-focused PACs this year. These PACs have already raised $203 million and spent $38 million influencing US elections in 2024.

Right now Molly's rankings show that the "Fairshake" cryptocurrency PAC is second only to the Trump-supporting "Make America Great Again Inc" in money raised by Super PACs this year - though it's 9th in the list that includes other types of PAC.

Molly's data comes from the FEC, and the code behind the site is all open source.

There's lots more about the project in the latest edition of Molly's newsletter:

Did you know that the cryptocurrency industry has spent more on 2024 elections in the United States than the oil industry? More than the pharmaceutical industry?

In fact, the cryptocurrency industry has spent more on 2024 elections than the entire energy sector and the entire health sector. Those industries, both worth hundreds of billions or trillions of dollars, are being outspent by an industry that, even by generous estimates, is worth less than $20 billion.

# 15th July 2024, 10:06 pm / blockchain, politics, elections, data-journalism, molly-white

interactive-feed (via) Sam Morris maintains this project which gathers interactive, graphic and data visualization stories from various newsrooms around the world and publishes them on Twitter, Mastodon and Bluesky.

It runs automatically using GitHub Actions, and gathers data using a number of different techniques - XML feeds, custom API integrations (for the NYT, Guardian and Washington Post) and in some cases by scraping index pages on news websites using CSS selectors and cheerio.

The data it collects is archived as JSON in the data/ directory of the repository.

# 5th July 2024, 11:39 pm / git-scraping, data-journalism, bluesky, mastodon

Civic Band. Exciting new civic tech project from Philip James: 30 (and counting) Datasette instances serving full-text search enabled collections of OCRd meeting minutes for different civic governments. Includes 20,000 pages for Alameda, 17,000 for Pittsburgh, 3,567 for Baltimore and an enormous 117,000 for Maui County.

Philip includes some notes on how they're doing it. They gather PDF minute notes from anywhere that provides API access to them, then run local Tesseract for OCR (the cost of cloud-based OCR proving prohibitive given the volume of data). The collection is then deployed to a single VPS running multiple instances of Datasette via Caddy, one instance for each of the covered regions.

# 19th June 2024, 9:30 pm / ocr, tesseract, data-journalism, datasette

Food Delivery Leak Unmasks Russian Security Agents. This story is from April 2022 but I realize now I never linked to it.

Yandex Food, a popular food delivery service in Russia, suffered a major data leak.

The data included an order history with names, addresses and phone numbers of people who had placed food orders through that service.

Bellingcat were able to cross-reference this leak with addresses of Russian security service buildings—including those linked to the GRU and FSB.This allowed them to identify the names and phone numbers of people working for those organizations, and then combine that information with further leaked data as part of their other investigations.

If you look closely at the screenshots in this story they may look familiar: Bellingcat were using Datasette internally as a tool for exploring this data!

# 26th April 2024, 1:59 am / bellingcat, data-journalism, datasette

AI for Data Journalism: demonstrating what we can do with this stuff right now

Visit AI for Data Journalism: demonstrating what we can do with this stuff right now

I gave a talk last month at the Story Discovery at Scale data journalism conference hosted at Stanford by Big Local News. My brief was to go deep into the things we can use Large Language Models for right now, illustrated by a flurry of demos to help provide starting points for further conversations at the conference.

[... 6,081 words]

Running OCR against PDFs and images directly in your browser

Visit Running OCR against PDFs and images directly in your browser

I attended the Story Discovery At Scale data journalism conference at Stanford this week. One of the perennial hot topics at any journalism conference concerns data extraction: how can we best get data out of PDFs and images?

[... 2,263 words]

NICAR 2024 Tipsheets & Audio. The NICAR data journalism conference was outstanding this year: ~1100 attendees, and every slot on the schedule had at least 2 sessions that I wanted to attend (and usually a lot more).

If you’re interested in the intersection of data analysis and journalism it really should be a permanent fixture on your calendar, it’s fantastic.

Here’s the official collection of handouts (NICAR calls them tipsheets) and audio recordings from this year’s event.

# 11th March 2024, 1:14 am / conferences, data-journalism, nicar

American Community Survey Data via FTP. I got talking to some people from the US Census at NICAR today and asked them if there was a way to download their data in bulk (in addition to their various APIs)... and there was!

I had heard of the American Community Survey but I hadn’t realized that it’s gathered on a yearly basis, as a 5% sample compared to the full every-ten-years census. It’s only been running for ten years, and there’s around a year long lead time on the survey becoming available.

# 8th March 2024, 12:25 am / data-journalism, census, nicar, surveys

Weeknotes: Getting ready for NICAR

Next week is NICAR 2024 in Baltimore—the annual data journalism conference hosted by Investigative Reporters and Editors. I’m running a workshop on Datasette, and I plan to spend most of my time in the hallway track talking to people about Datasette, Datasette Cloud and how the Datasette ecosystem can best help support their work.

[... 1,390 words]

2023

I’m on the Newsroom Robots podcast, with thoughts on the OpenAI board

Visit I'm on the Newsroom Robots podcast, with thoughts on the OpenAI board

Newsroom Robots is a weekly podcast exploring the intersection of AI and journalism, hosted by Nikita Roy.

[... 1,032 words]

Example of OpenAI function calling API to extract data from LAPD newsroom articles (via) Fascinating code example from Kyle McDonald. The OpenAI functions mechanism is intended to drive custom function calls, but I hadn’t quite appreciated how useful it can be ignoring the function calls entirely. Kyle instead uses it to define a schema for data he wants to extract from a news article, then uses the gpt-3.5-turbo-0613 to get back that exact set of extracted data as JSON.

# 14th June 2023, 8:57 pm / generative-ai, openai, ai, data-journalism, llms

Teaching News Apps with Codespaces (via) Derek Willis used GitHub Codespaces for the latest data journalism class he taught, and it eliminated the painful process of trying to get students on an assortment of Mac, Windows and Chromebook laptops all to a point where they could start working and learning together.

# 23rd March 2023, 12:39 am / teaching, data-journalism, derek-willis, github, github-codespaces

Weeknotes: NICAR, and an appearance on KQED Forum

I spent most of this week at NICAR 2023, the data journalism conference hosted this year in Nashville, Tennessee.

[... 1,941 words]

Datasette is my data hammer (via) Jeremia Kimelman—a data journalist at CalMatters in Sacramento—enthuses about how he uses Datasette as his default hammer for all kinds of data projects—in particular how much he appreciates Datasette’s focus on URLs. So nice to see this!

# 17th January 2023, 5:23 pm / data-journalism, datasette

2022

Measuring traffic during the Half Moon Bay Pumpkin Festival

Visit Measuring traffic during the Half Moon Bay Pumpkin Festival

This weekend was the 50th annual Half Moon Bay Pumpkin Festival.

[... 2,693 words]

Getting tabular data from unstructured text with GPT-3: an ongoing experiment (via) Roberto Rocha shows how to use a carefully designed prompt (with plenty of examples) to get GPT-3 to convert unstructured textual data into a structured table.

# 5th October 2022, 3:03 am / ai, gpt-3, data-journalism, openai, prompt-engineering, generative-ai, llms

2021

git-history: a tool for analyzing scraped data collected using Git and SQLite

Visit git-history: a tool for analyzing scraped data collected using Git and SQLite

I described Git scraping last year: a technique for writing scrapers where you periodically snapshot a source of data to a Git repository in order to record changes to that source over time.

[... 2,002 words]

Weeknotes: sqlite-transform 1.1, Datasette 0.58.1, datasette-graphql 1.5

Work on Project Pelican inspires new features and improvements across a number of different projects.

[... 1,419 words]

The Accountability Project Datasettes. The Accountability Project “curates, standardizes and indexes public data to give journalists, researchers and others a simple way to search across otherwise siloed records”—they have a wide range of useful data, and they’ve started experimenting with Datasette to provide SQL access to a subset of the information that they have collected.

# 22nd March 2021, 12:07 am / data-journalism, datasette

Weeknotes: Datasette and Git scraping at NICAR, VaccinateCA

This week I virtually attended the NICAR data journalism conference and made a ton of progress on the Django backend for VaccinateCA (see last week).

[... 773 words]

Git scraping, the five minute lightning talk

Visit Git scraping, the five minute lightning talk

I prepared a lightning talk about Git scraping for the NICAR 2021 data journalism conference. In the talk I explain the idea of running scheduled scrapers in GitHub Actions, show some examples and then live code a new scraper for the CDC’s vaccination data using the GitHub web interface. Here’s the video.

[... 289 words]

2020

nyt-2020-election-scraper. Brilliant application of git scraping by Alex Gaynor and a growing team of contributors. Takes a JSON snapshot of the NYT’s latest election poll figures every five minutes, then runs a Python script to iterate through the history and build an HTML page showing the trends, including what percentage of the remaining votes each candidate needs to win each state. This is the perfect case study in why it can be useful to take a “snapshot if the world right now” data source and turn it into a git revision history over time.

# 6th November 2020, 2:24 pm / elections, alex-gaynor, git-scraping, data-journalism, git, new-york-times

selenium-wire. Really useful scraping tool: enhances the Python Selenium bindings to run against a proxy which then allows Python scraping code to look at captured requests—great for if a site you are working with triggers Ajax requests and you want to extract data from the raw JSON that came back.

# 2nd November 2020, 6:58 pm / scraping, data-journalism, python, selenium

sba-loans-covid-19-datasette (via) The treasury department released a bunch of data on the Covid-19 SBA Paycheck Protection Program Loan recipients today—I’ve loaded the most interesting data (the $150,000+ loans) into a Datasette instance.

# 7th July 2020, 2:42 am / projects, covid19, data-journalism, datasette

Data Journalism Academy (via) MaryJo Webster is the data editor for the Star Tribune in Minneapolis, and a 2019 Pulitzer nominee. She’s has a huge amount of experience teaching data journalism and has just released her accumulated teaching materials in the form of the Data Journalism Academy.

# 11th May 2020, 4:45 am / data-journalism

Weeknotes: Covid-19, First Python Notebook, more Dogsheep, Tailscale

My covid-19.datasettes.com project publishes information on COVID-19 cases around the world. The project started out using data from Johns Hopkins CSSE, but last week the New York Times started publishing high quality USA county- and state-level daily numbers to their own repository. Here’s the change that added the NY Times data.

[... 993 words]