Simon Willison’s Weblog

Weeknotes: evernote-to-sqlite, Datasette Weekly, scrapers, csv-diff, sqlite-utils

This week I built evernote-to-sqlite (see Building an Evernote to SQLite exporter), launched the Datasette Weekly newsletter, worked on some scrapers and pushed out some small improvements to several other projects.

The Datasette Weekly newsletter

After procrastinating on it for several months I finally launched the new Datasette Weekly newsletter!

My plan is to put this out once a week with a combination of news from the Datasette/Dogsheep/sqlite-utils ecosystem of tools, plus tips and tricks for using them to solve data problems.

You can read the first edition here, which covers Datasette 0.50, git scraping, sqlite-utils extract and features datasette-graphql as the plugin of the week.

I’m using Substack because people I trust use it for their newsletters and I decided that picking an option and launching was more important than spending even more time procrastinating on picking the best possible newsletter platform. So far it seems fit for purpose, and it provides an export option should I decide to move to something else.

Writing scrapers with a Python+JavaScript hybrid

I’ve been writing some scraper code to help out with a student journalism project at Stanford. I ended up using Selenium Python running in a Jupyter Notebook.

Historically I’ve avoided Selenium due to how weird and complex it has been to use in the past. I’ve now completely changed my mind: these days it’s a really solid option for browser automation driven by Python thanks to chromedriver and geckodriver, which I recently learned can be installed using Homebrow.

My preferred way of writing scrapers is to do most of the work in JavaScript. The combination of querySelector(), querySelectorAll(), fetch() and the new-to-me DOMParser class makes light work of extracting data from any shape of HTML, and browser DevTools mean that I can interactively build up scrapers by pasting code directly into the console.

My big break-through this week was figuring out how to write scrapers as a Python-JavaScript hybrid. The Selenium driver.execute_script() and driver.execute_async_script() (TIL) methods make it trivial to feed execute chunks of JavaScript from Python and get back the results.

This meant I could scrape pages one at time using JavaScript and save the results directly to SQLite via sqlite-utils. I could even run database queries on the Python side to skip items that had already been scraped.

csv-diff 1.0

I’m trying to get more of my tools past the 1.0 mark, mainly to indicate to potential users that I won’t be breaking backwards compatibility without bumping them to 2.0.

I built csv-diff for my San Francisco Trees project last year. It produces human-readable diffs for CSV files.

The version 1.0 release notes are as follows:

  • New --show-unchanged option for outputting the unchanged values of rows that had at least one change. #9
  • Fix for bug with column names that contained a . character. #7
  • Fix for error when no --key provided—thanks, @MainHanzo. #3
  • CSV delimiter sniffer now ; delimited files. #6

sqlite-utils 2.22

sqlite-utils 2.22 adds some minor features—an --encoding option for processing TSV and CSV files in encodings other than UTF-8, and more support for loading SQLite extensions modules.

Full release notes:

  • New --encoding option for processing CSV and TSV files that use a non-utf-8 encoding, for both the insert and update commands. (#182)
  • The --load-extension option is now available to many more commands. (#137)
  • --load-extension=spatialite can be used to load SpatiaLite from common installation locations, if it is available. (#136)
  • Tests now also run against Python 3.9. (#184)
  • Passing pk=["id"] now has the same effect as passing pk="id". (#181)

Datasette

No new release yet, but I’ve landed some small new features to the main branch.

Inspired by the GitHub and WordPress APIs, Datasette’s JSON API now supports Link: HTTP header pagination (#1014).

This is part of my ongoing effort to redesign the default JSON format ready for Datasette 1.0. I started a new plugin called datasette-json-preview to let me iterate on that format independent of Datasette itself.

Jacob Fenton suggested an “Edit SQL” button on canned queries. That’s a great idea, so I built it—this issue comment links to some demos, e.g. this one here.

I added an “x” button for clearing filters to the table page (#1016) demonstrated by this GIF:

Animation demonstrating the new x button next to filters

TIL this week

Releases this week

This is Weeknotes: evernote-to-sqlite, Datasette Weekly, scrapers, csv-diff, sqlite-utils by Simon Willison, posted on 16th October 2020.

Tagged , , , ,

Next: Weeknotes: incremental improvements

Previous: Building an Evernote to SQLite exporter