7th October 2019
Having figured out my Stanford schedule, this week I started getting back into the habit of writing some code.
Dogsheep is the collective name I’ve given to a suite of tools I’m building around the concept of personal analytics.
I generate a lot of data, and while much of it ends up in the silos of the internet giants, thanks to the GDPR most of those silos now feature an “export a copy of your data” button.
Wouldn’t it be cool if you could convert that data into a SQLite database and then use Datasette to run queries against it?
So that’s what I’m doing! The tools I’ve built so far include:
- healthkit-to-sqlite for my Apple HealthKit data (mostly collected by my Apple Watch).
- twitter-to-sqlite, by far the most developed tool. For Dogsheep purposes it lets me import my tweets and the tweets I have favourited, but it’s growing all kinds of other useful features for retrieving and analyzing data from Twitter. More on this in my previous weeknotes.
- swarm-to-sqlite for my Foursquare Swarm checkins.
- inaturalist-to-sqlite for my iNaturalist observations.
- google-takeout-to-sqlite for the wealth of data available from Google Takeout. I’ve barely scratched the surface with this one but it does have the ability to export my location history from Google Maps—215,000 latitude/longitude/timestamp records dating back to
select min(timestamp) from location_historyJuly 2015!
- github-to-sqlite for importing my GitHub repositories and the repositories I have starred.
- pocket-to-sqlite (built over this weekend) for articles I have saved to Pocket.
- genome-to-sqlite for my 23andMe genome, because being able to run SQL queries against my genetic code is really funny.
My biggest achievement this week is that my own personal Dogsheep is no longer vaporware! I got a Datasette instance running on a AWS Lightsail VPS (using the systemd recipe from this ticket) and started loading in copies of my data.
Securing this instance well is important. In addition to running it behind datasette-auth-github I’ve followed up on a long-term ambition of deploying something protected by client certificates. If your browser doesn’t have the required certificate installed it gets rejected by nginx before it’s even proxied through to Datasette.
I have the certificate successfully installed on my iPhone and my laptop browsers. Setting this all up wasn’t particularly straight-forward—I ended up following this excellent tutorial by Nathan Wittstock for the nginx part of it, then randomly Googling for tips on installing the certificate in Firefox and on my iPhone.
I think there’s a great opportunity for tooling that makes this much easier. Adding that to my maybe-someday pile.
datasette-auth-github asset caching
The culprit there turned out to be my datasette-auth-github plugin.
As detailed in issue #6, the ASGI middleware adds a
cache-control: private header to every HTTP response—to make sure authentication-required content isn’t accidentally cached by any intermediary proxies (I originally designed Datasette to run effectively behind Varnish and Cloudflare).
To fix this, I shipped a new release of
datasette-auth-github with a
cacheable_prefixes setting, which defaults to
/-/static-plugins/ when run as a Datasette plugin. See issue 47 for details.
Areas of research
My most significant ongoing project for Datasette right now concerns database connection pooling.
This means I need to get smarter about database connections, and last week I finally made some solid progress on figuring out more advanced connection pooling. Still more work to go but I’m finally feeling unblocked on this after sweating over it for a couple of months without writing any code.
I also added a
/-/threads page to Datasette (demo) to make it easier to understand the threads being run by the application.
Recursive CTEs in SQLite
twitter-to-sqlite generates a
tweets table with both
in_reply_to_status_id (nullable) columns.
SQLite supports recursive CTEs, documented here. In theory, this means it should be possible to write SQL queries that analyze Twitter threads—showing me the tweets that are at the bottom of the longest reply chain, for example.
I particularly enjoyed learning how Piers uses recursive SQL queries to help run his bakery.
I finally made some progress on getting my head around the Python standard library logging module, a long-time nemesis.
The breakthrough was to sit down and actually read the source code—which helped me understand that the entire point of the library is to be a globally configured logging system that makes careful use of threading primitives to ensure you can log without worrying about concurrency.
I’ve been worrying about that aspect of logging for years, so it’s reassuring to see that Vinay Sajip has that comprehensively covered already.
Easier installation for Datasette
I want journalists to be able to install Datasette. I don’t want to have to talk them through installing xcode to install homebrew to install Python 3 first!
Raffaele Messuti did some great work investigating pyinstaller for this a while ago. I started a conversation on Twitter a few days ago after investigating pex (neat but not quite what I want, since users still have to install Python). From that thread I got some interesting further leads:
- Freezing your code in the The Hitchhiker’s Guide to Python describes the problem space in general.
- Briefcase is the part of the BeeWare suite that deals with for converting a Python project into a standalone native application (for a bunch of different platforms).
- PyOxidizer is a relatively new tool for building standalone executables that bundle a Python interpreter, on top of the Rust and Cargo build ecosystem.
- Conda Constructor helps you construct an installer for a group of Conda packages (I need to figure out Conda).
Next step: actively try some of these out and see what I can do with them.
My longer term goal is to package Datasette up with Electron and make it available as a regular Windows and OS X application, for people who aren’t comfortable using the command-line directly.
I’ve been mostly working through required reading for my Stanford courses. Strategic Communication in particular has some excellent text books:
- Speaking Up without Freaking Out: 50 Techniques for Confident and Compelling Presenting by Matthew Abrahams—despite the clickbait listicle title, this book is really good. I’m not a nervous public speaker, but I’m finding all kinds of actionable advice here on giving higher quality talks. And it’s nice and short.
- Resonate: Present Visual Stories that Transform Audiences by Nancy Duarte. This digs deep into the Hero’s journey as a methodology for presenting, but with an interesting twist: it encourages you to think about your audience’s journey and how you can take them from unaware, to skeptical, to tentatively committed, to excited and convinced.
More recent articles
- Interesting ideas in Observable Framework - 3rd March 2024
- Weeknotes: Getting ready for NICAR - 27th February 2024
- The killer app of Gemini Pro 1.5 is video - 21st February 2024
- Weeknotes: a Datasette release, an LLM release and a bunch of new plugins - 9th February 2024
- LLM 0.13: The annotated release notes - 26th January 2024
- Weeknotes: datasette-test, datasette-build, PSF board retreat - 21st January 2024
- Talking about Open Source LLMs on Oxide and Friends - 17th January 2024
- Publish Python packages to PyPI with a python-lib cookiecutter template and GitHub Actions - 16th January 2024
- What I should have said about the term Artificial Intelligence - 9th January 2024