Weeknotes: evernote-to-sqlite, Datasette Weekly, scrapers, csv-diff, sqlite-utils
16th October 2020
This week I built evernote-to-sqlite
(see Building an Evernote to SQLite exporter), launched the Datasette Weekly newsletter, worked on some scrapers and pushed out some small improvements to several other projects.
The Datasette Weekly newsletter
After procrastinating on it for several months I finally launched the new Datasette Weekly newsletter!
My plan is to put this out once a week with a combination of news from the Datasette/Dogsheep/sqlite-utils ecosystem of tools, plus tips and tricks for using them to solve data problems.
You can read the first edition here, which covers Datasette 0.50, git scraping, sqlite-utils extract and features datasette-graphql as the plugin of the week.
I’m using Substack because people I trust use it for their newsletters and I decided that picking an option and launching was more important than spending even more time procrastinating on picking the best possible newsletter platform. So far it seems fit for purpose, and it provides an export option should I decide to move to something else.
Writing scrapers with a Python+JavaScript hybrid
I’ve been writing some scraper code to help out with a student journalism project at Stanford. I ended up using Selenium Python running in a Jupyter Notebook.
Historically I’ve avoided Selenium due to how weird and complex it has been to use in the past. I’ve now completely changed my mind: these days it’s a really solid option for browser automation driven by Python thanks to chromedriver
and geckodriver
, which I recently learned can be installed using Homebrow.
My preferred way of writing scrapers is to do most of the work in JavaScript. The combination of querySelector(), querySelectorAll(), fetch() and the new-to-me DOMParser class makes light work of extracting data from any shape of HTML, and browser DevTools mean that I can interactively build up scrapers by pasting code directly into the console.
My big break-through this week was figuring out how to write scrapers as a Python-JavaScript hybrid. The Selenium driver.execute_script()
and driver.execute_async_script()
(TIL) methods make it trivial to feed execute chunks of JavaScript from Python and get back the results.
This meant I could scrape pages one at time using JavaScript and save the results directly to SQLite via sqlite-utils. I could even run database queries on the Python side to skip items that had already been scraped.
csv-diff 1.0
I’m trying to get more of my tools past the 1.0 mark, mainly to indicate to potential users that I won’t be breaking backwards compatibility without bumping them to 2.0.
I built csv-diff for my San Francisco Trees project last year. It produces human-readable diffs for CSV files.
The version 1.0 release notes are as follows:
- New
--show-unchanged
option for outputting the unchanged values of rows that had at least one change. #9 - Fix for bug with column names that contained a
.
character. #7 - Fix for error when no
--key
provided—thanks, @MainHanzo. #3 - CSV delimiter sniffer now
;
delimited files. #6
sqlite-utils 2.22
sqlite-utils 2.22 adds some minor features—an --encoding
option for processing TSV and CSV files in encodings other than UTF-8, and more support for loading SQLite extensions modules.
Full release notes:
- New
--encoding
option for processing CSV and TSV files that use a non-utf-8 encoding, for both theinsert
andupdate
commands. (#182) - The
--load-extension
option is now available to many more commands. (#137) -
--load-extension=spatialite
can be used to load SpatiaLite from common installation locations, if it is available. (#136) - Tests now also run against Python 3.9. (#184)
- Passing
pk=["id"]
now has the same effect as passingpk="id"
. (#181)
Datasette
No new release yet, but I’ve landed some small new features to the main
branch.
Inspired by the GitHub and WordPress APIs, Datasette’s JSON API now supports Link:
HTTP header pagination (#1014).
This is part of my ongoing effort to redesign the default JSON format ready for Datasette 1.0. I started a new plugin called datasette-json-preview to let me iterate on that format independent of Datasette itself.
Jacob Fenton suggested an “Edit SQL” button on canned queries. That’s a great idea, so I built it—this issue comment links to some demos, e.g. this one here.
I added an “x” button for clearing filters to the table page (#1016) demonstrated by this GIF:
TIL this week
Releases this week
- sqlite-utils 2.22—2020-10-16
- csv-diff 1.0—2020-10-16
- swarm-to-sqlite 0.3.2—2020-10-12
- evernote-to-sqlite 0.2—2020-10-12
- evernote-to-sqlite 0.1—2020-10-11
- xml-analyser 1.0—2020-10-11
- datasette-json-preview 0.1—2020-10-11
More recent articles
- Notes from Bing Chat—Our First Encounter With Manipulative AI - 19th November 2024
- Project: Civic Band - scraping and searching PDF meeting minutes from hundreds of municipalities - 16th November 2024
- Qwen2.5-Coder-32B is an LLM that can code well that runs on my Mac - 12th November 2024