Simon Willison’s Weblog

Subscribe

32 items tagged “csv”

2024

Papa Parse (via) I’ve been trying out this JavaScript library for parsing CSV and TSV data today and I’m very impressed. It’s extremely fast, has all of the advanced features I want (streaming support, optional web workers, automatically detecting delimiters and column types), has zero dependencies and weighs just 19KB minified—6.8KB gzipped.

The project is 11 years old now. It was created by Matt Holt, who later went on to create the Caddy web server. Today it’s maintained by Sergi Almacellas Abellana. # 20th March 2024, 12:53 am

2023

Introducing sqlite-xsv: The Fastest CSV Parser for SQLite. Alex Garcia continues to push the boundaries of SQLite extensions. This new extension in Rust wraps the lightning fast Rust csv crate and provides a new csv_reader() virtual table that can handle regular, gzipped and zstd compressed files. # 14th January 2023, 9:54 pm

2022

Joining CSV files in your browser using Datasette Lite

I added a new feature to Datasette Lite—my version of Datasette that runs entirely in your browser using WebAssembly (previously): you can now use it to load one or more CSV files by URL, and then run SQL queries against them—including joins across data from multiple files.

[... 546 words]

2021

Joining CSV and JSON data with an in-memory SQLite database

The new sqlite-utils memory command can import CSV and JSON data directly into an in-memory SQLite database, combine and query it using SQL and output the results as CSV, JSON or various other formats of plain text tables.

[... 1507 words]

Weeknotes: Vaccinate The States, and how I learned that returning dozens of MB of JSON works just fine these days

On Friday VaccinateCA grew in scope, a lot: we launched a new website called Vaccinate The States. Patrick McKenzie wrote more about the project here—the short version is that we’re building the most comprehensive possible dataset of vaccine availability in the USA, using a combination of data collation, online research and continuing to make a huge number of phone calls.

[... 1109 words]

Weeknotes: Mostly messing around with map tiles

Most of what I worked on this week was covered in Serving map tiles from SQLite with MBTiles and datasette-tiles. I built two new plugins: datasette-tiles for serving map tiles, and datasette-basemap which bundles map tiles for zoom levels 0-6 of OpenStreetMap. I also released download-tiles for downloading tiles and bundling them into an MBTiles database.

[... 350 words]

2020

CSVs: The good, the bad, and the ugly (via) Useful, thoughtful summary of the pros and cons of the most common format for interchanging data. # 5th November 2020, 5:19 pm

Weeknotes: Datasette Writes

As discussed previously, the biggest hole in Datasette’s feature set at the moment involves writing to the database.

[... 604 words]

Weeknotes: Datasette Cloud and zero downtime deployments

Yesterday’s piece on Tracking FARA by deploying a data API using GitHub Actions and Cloud Run was originally intended to be my weeknotes, but ended up getting a bit too involved.

[... 1431 words]

2019

Dockerfile for creating a Datasette of NHS dentist information (via) Really neat Dockerfile example by Alf Eaton that uses multi-stage builds to pull dentist information from the NHS, compile to SQLite using csvs-to-sqlite and serve the results with Datasette. TIL the NHS like to use ¬ as their CSV separator! # 26th April 2019, 2:09 pm

tsv-utils (via) Powerful collection of CLI tools for processing TSV files, written in D for performance and released by eBay. Includes a csv2tsv conversion tool. You can download an archive of pre-built binaries for Linux and OS X from their releases page: worked fine on my Mac. # 7th April 2019, 8:29 pm

csv-diff 0.3.1 (via) I released a minor update to my csv-diff CLI tool today which does a better job of displaying a human-readable representation of rows that have been added or removed from a file—previously they were represented as an ugly JSON dump. My script monitoring changes to the official list of trees in San Francisco has been running for a month now and has captured 23 commits! # 7th April 2019, 8:03 pm

VisiData (via) Intriguing tool by Saul Pwanson: VisiData is a command-line “textpunk utility” for browsing and manipulating tabular data. “pip3 install visidata” and then “vd myfile.csv” (or .json or .xls or SQLite orothers) and get an interactive terminal UI for quickly searching through the data, conducting frequency analysis of columns, manipulating it and much more besides. Two tips for if you start playing with it: hit “gq” to exit, and hit “Ctrl+H” to view the help screen. # 18th March 2019, 3:45 am

Generating a commit log for San Francisco’s official list of trees

San Francisco has a neat open data portal (as do an increasingly large number of cities these days). For a few years my favourite file on there has been Street Tree List, a list of all 190,000 trees in the city maintained by the Department of Public Works.

[... 1051 words]

2018

Datasette 0.23: CSV, SpatiaLite and more (via) The big new feature in 0.23 is CSV export: any Datasette table or query can now be exported as CSV, including the option to get all matching rows in one giant CSV file taking advantage of Python 3 async and Datasette’s efficient keyset pagination. Also in this release: improved support for SpatiaLite and various JSON API improvements including the ability to expand foreign key labels in JSON and CSV responses. # 18th June 2018, 3:34 pm

sqlitebiter. Similar to my csvs-to-sqlite tool, but sqlitebiter handles “CSV/Excel/HTML/JSON/LTSV/Markdown/SQLite/SSV/TSV/Google-Sheets”. Most interestingly, it works against HTML pages—run “sqlitebiter -v url ’https://en.wikipedia.org/wiki/Comparison_of_firewalls’” and it will scrape that Wikipedia page and create a SQLite table for each of the HTML tables it finds there. # 17th May 2018, 10:40 pm

csvs-to-sqlite 0.8. I released a new version of my csvs-to-sqlite tool this morning with a bunch of handy new features. It can now rename columns and define their types, add the CSV filenames as an additional column, add create indexes on columns and parse dates and datetimes into SQLite-friendly ISO formatted values. # 24th April 2018, 4:11 pm

Parsing CSV using ANTLR and Python 3. I’ve been trying to figure out how to use ANTLR grammars from Python—this is the first example I’ve found that really clicked for me. # 6th April 2018, 2:33 pm

django-postgres-copy (via) Really neat Django queryset add-on which exposes the PostgreSQL COPY statement for importing (and exporting) CSV data. MyModel.objects.from_csv(“filename.csv”). Built by the team of data journalists at the California Civic Data Coalition. # 26th January 2018, 12:43 am

Datasette Publish: a web app for publishing CSV files as an online database

I’ve just released Datasette Publish, a web tool for turning one or more CSV files into an online database with a JSON API.

[... 863 words]

csvkit. “A suite of command-line tools for converting to and working with CSV”—includes a huge range of utilities for things like converting Excel and JSON to CSV, grepping, sorting and extracting a subset of columns, combining multiple CSV files together and exporting CSV to a relational database. Worth reading through the tutorial which shows how the different commands can be piped together. # 8th January 2018, 9:03 pm

Himalayan Database: From Visual FoxPro GUI to JSON API with Datasette (via) The Himalayan Database is a compilation of records for all expeditions that have climbed in the Nepalese Himalaya, originally compiled by journalist Elizabeth Hawley over several decades. The database is published as a Visual FoxPro database—here Raffaele Messuti‏ provides step-by-step instructions for extracting the data from the published archive, converting them to CSV using dbfcsv and then converting the CSVs to SQLite using csvs-to-sqlite so you can browse them using Datasette. # 8th January 2018, 7:26 pm

2017

Big Data Workflow with Pandas and SQLite (via) Handy tutorial on dealing with larger data (in this case a 3.9GB CSV file) by incrementally loading it into pandas and writing it out to SQLite. # 28th November 2017, 11:02 pm

Added TSV example to the README · simonw/csvs-to-sqlite@957d4f5. Thanks to a pull request from Jani Monoses, csvs-to-sqlite can now handle TSV (or any other separator) as well as regular CSVs. # 26th November 2017, 7:02 am

New in Datasette: filters, foreign keys and search

I’ve released Datasette 0.13 with a number of exciting new features (Datasette previously).

[... 1143 words]

harelba/q (via) q is a neat command-line utility that lets you run SQL queries directly against CSV and TSV files. Internally it works by firing up an in-memory SQLite database, and as of the latest release (1.7.1) you can use the new --save-db-to-disk option to save that in-memory database to disk. # 25th November 2017, 5:49 pm

csvs-to-sqlite: Refactoring columns into separate lookup tables. I just shipped a new version of csvs-to-sqlite with the ability to extract specified columns into a separate SQLite lookup table by passing additional command-line arguments. # 17th November 2017, 6:41 am

simonw/csvs-to-sqlite. I built a simple tool for bulk converting multiple CSV files into a SQLite database. # 13th November 2017, 6:49 am

The Absurdly Underestimated Dangers of CSV Injection. This is horrifying. A plain old CSV file intended for import into Excel can embed formulas (a value prefixed with an equals symbol) which can execute system commands—with a big honking security prompt that most people will likely ignore. Even worse: they can embed IMPORTXML() functions that can silently leak data from the rest of the sheet to an external URL—and those will work against Google Sheets as well as Excel. # 10th October 2017, 4:13 am

2009

No PDFs! The Sunlight Foundation point out that PDFs are a terrible way of implementing “more transparent government” due to their general lack of structure. At the Guardian (and I’m sure at other newspapers) we waste an absurd amount of time manually extracting data from PDF files and turning it in to something more useful. Even CSV is significantly more useful for many types of information. # 1st November 2009, 12:04 pm