Simon Willison’s Weblog

Subscribe
Atom feed for nicar

11 posts tagged “nicar”

2025

What’s new in the world of LLMs, for NICAR 2025

Visit What's new in the world of LLMs, for NICAR 2025

I presented two sessions at the NICAR 2025 data journalism conference this year. The first was this one based on my review of LLMs in 2024, extended by several months to cover everything that’s happened in 2025 so far. The second was a workshop on Cutting-edge web scraping techniques, which I’ve written up separately.

[... 2,797 words]

Cutting-edge web scraping techniques at NICAR. Here's the handout for a workshop I presented this morning at NICAR 2025 on web scraping, focusing on lesser know tips and tricks that became possible only with recent developments in LLMs.

For workshops like this I like to work off an extremely detailed handout, so that people can move at their own pace or catch up later if they didn't get everything done.

The workshop consisted of four parts:

  1. Building a Git scraper - an automated scraper in GitHub Actions that records changes to a resource over time
  2. Using in-browser JavaScript and then shot-scraper to extract useful information
  3. Using LLM with both OpenAI and Google Gemini to extract structured data from unstructured websites
  4. Video scraping using Google AI Studio

I released several new tools in preparation for this workshop (I call this "NICAR Driven Development"):

I also came up with a fun way to distribute API keys for workshop participants: I had Claude build me a web page where I can create an encrypted message with a passphrase, then share a URL to that page with users and give them the passphrase to unlock the encrypted message. You can try that at tools.simonwillison.net/encrypt - or use this link and enter the passphrase "demo":

Screenshot of a message encryption/decryption web interface showing the title "Encrypt / decrypt message" with two tab options: "Encrypt a message" and "Decrypt a message" (highlighted). Below shows a decryption form with text "This page contains an encrypted message", a passphrase input field with dots, a blue "Decrypt message" button, and a revealed message saying "This is a secret message".

# 8th March 2025, 7:25 pm / shot-scraper, gemini, nicar, openai, git-scraping, ai, speaking, llms, scraping, generative-ai, claude-artifacts, ai-assisted-programming, claude

simonw/git-scraper-template. I built this new GitHub template repository in preparation for a workshop I'm giving at NICAR (the data journalism conference) next week on Cutting-edge web scraping techniques.

One of the topics I'll be covering is Git scraping - creating a GitHub repository that uses scheduled GitHub Actions workflows to grab copies of websites and data feeds and store their changes over time using Git.

This template repository is designed to be the fastest possible way to get started with a new Git scraper: simple create a new repository from the template and paste the URL you want to scrape into the description field and the repository will be initialized with a custom script that scrapes and stores that URL.

It's modeled after my earlier shot-scraper-template tool which I described in detail in Instantly create a GitHub repository to take screenshots of a web page.

The new git-scraper-template repo took some help from Claude to figure out. It uses a custom script to download the provided URL and derive a filename to use based on the URL and the content type, detected using file --mime-type -b "$file_path" against the downloaded file.

It also detects if the downloaded content is JSON and, if it is, pretty-prints it using jq - I find this is a quick way to generate much more useful diffs when the content changes.

# 26th February 2025, 5:34 am / github-actions, nicar, projects, git-scraping, data-journalism, git, github, scraping

2024

Weeknotes: the aftermath of NICAR

Visit Weeknotes: the aftermath of NICAR

NICAR was fantastic this year. Alex and I ran a successful workshop on Datasette and Datasette Cloud, and I gave a lightning talk demonstrating two new GPT-4 powered Datasette plugins—datasette-enrichments-gpt and datasette-extract. I need to write more about the latter one: it enables populating tables from unstructured content (using a variant of this technique) and it’s really effective. I got it working just in time for the conference.

[... 1,430 words]

NICAR 2024 Tipsheets & Audio. The NICAR data journalism conference was outstanding this year: ~1100 attendees, and every slot on the schedule had at least 2 sessions that I wanted to attend (and usually a lot more).

If you’re interested in the intersection of data analysis and journalism it really should be a permanent fixture on your calendar, it’s fantastic.

Here’s the official collection of handouts (NICAR calls them tipsheets) and audio recordings from this year’s event.

# 11th March 2024, 1:14 am / conferences, data-journalism, nicar

American Community Survey Data via FTP. I got talking to some people from the US Census at NICAR today and asked them if there was a way to download their data in bulk (in addition to their various APIs)... and there was!

I had heard of the American Community Survey but I hadn’t realized that it’s gathered on a yearly basis, as a 5% sample compared to the full every-ten-years census. It’s only been running for ten years, and there’s around a year long lead time on the survey becoming available.

# 8th March 2024, 12:25 am / data-journalism, census, nicar, surveys

Weeknotes: Getting ready for NICAR

Next week is NICAR 2024 in Baltimore—the annual data journalism conference hosted by Investigative Reporters and Editors. I’m running a workshop on Datasette, and I plan to spend most of my time in the hallway track talking to people about Datasette, Datasette Cloud and how the Datasette ecosystem can best help support their work.

[... 1,390 words]

2023

Weeknotes: NICAR, and an appearance on KQED Forum

I spent most of this week at NICAR 2023, the data journalism conference hosted this year in Nashville, Tennessee.

[... 1,941 words]

2021

Weeknotes: Datasette and Git scraping at NICAR, VaccinateCA

This week I virtually attended the NICAR data journalism conference and made a ton of progress on the Django backend for VaccinateCA (see last week).

[... 773 words]

Git scraping, the five minute lightning talk

Visit Git scraping, the five minute lightning talk

I prepared a lightning talk about Git scraping for the NICAR 2021 data journalism conference. In the talk I explain the idea of running scheduled scrapers in GitHub Actions, show some examples and then live code a new scraper for the CDC’s vaccination data using the GitHub web interface. Here’s the video.

[... 289 words]

2019

Publish the data behind your stories with SQLite and Datasette. I presented a workshop on Datasette at the IRE and NICAR CAR 2019 data journalism conference yesterday. Here’s the worksheet I prepared for the tutorial.

# 9th March 2019, 6:27 pm / talks, data-journalism, datasette, nicar