Simon Willison’s Weblog

On stackoverflow, dogsheep, glitch, sqlite, weeknotes, ...

 

Recent entries

Weeknotes: The Squirrel Census, Genome SQL query two days ago

This week was mostly about incremental improvements. And squirrels.

The Squirrel Census

Last October a team of 323 volunteer Squirrel Sighters got together to attempt the first ever comprehensive census of the squirrels in New York’s Central Park.

The result was the Central Park Squirrel Census 2019 Report, an absolutely delightful package you can order from their site that includes two beautiful five foot long maps, a written supplemental report and (not kidding) a 45-RPM vinyl audio report.

I bought a copy and everyone I have shown it to has fallen in love with it.

Last week they released the underlying data through the NYC open data portal! Naturally I loaded it into a Datasette instance (running on Glitch, which is also made in New York). You can facet by fur color.

Analyzing my genome with SQL

Natalie and I attended the first half of the last San Francisco Science Hack Day. I made a 44 second pitch asking for help using SQL to analyze my genome, and Dr. Laura Cantino answered my call and helped me figure out how to run some fun queries.

Here’s my favourite example so far:

select rsid, genotype, case genotype
  when 'AA' then 'brown eye color, 80% of the time'
  when 'AG' then 'brown eye color'
  when 'GG' then 'blue eye color, 99% of the time'
end as interpretation from genome where rsid = 'rs12913832'

It works! If I run it against my genome it gets GG (blue eyes), and if I run it against Natalie’s it gets AG (brown).

Running SQL against my genome and Natalie's genome

twitter-to-sqlite crons for Dogsheep

I upgraded my personal Dogsheep instance to start fetching fresh data via cron. This meant improving twitter-to-sqlite to be better at incremental data fetching. Here’s the resulting relevant section of my cron tab:

# Fetch new tweets I have tweeted every 10 minutes
1,11,21,31,41,51 * * * * /home/ubuntu/datasette-venv/bin/twitter-to-sqlite user-timeline /home/ubuntu/twitter.db -a /home/ubuntu/auth.json --since

# Fetch most recent 50 tweets I have favourited every 10 minutes
6,16,26,36,46,56 * * * * /home/ubuntu/datasette-venv/bin/twitter-to-sqlite favorites /home/ubuntu/twitter.db -a /home/ubuntu/auth.json --stop_after=50

# Fetch new tweets from my home timeline every 5 minutes
2,7,12,17,22,27,32,37,42,47,52,57 * * * * /home/ubuntu/datasette-venv/bin/twitter-to-sqlite home-timeline /home/ubuntu/timeline.db -a /home/ubuntu/auth.json --since

That last line is quite fun: I’m now hitting the Twitter home-timeline API every five minutes and storing the resulting tweets in a separate timeline.db database. This equates to creating a searchable permanent database of anything tweeted by any of the 3,600+ accounts I follow.

The main thing I’ve learned from this exercise is that I’m really grateful for Twitter’s algorithmic timeline! Accounts I follow tweeted between 13,000 and 22,000 times a day over the past week (/timeline/tweets?_where=id+in+(select+tweet+from+timeline_tweets)&_facet_date=created_at). Trusting the algorithm to show me the highlights frees me to be liberal in following new accounts.

Playing with lit-html

I have a couple of projects brewing where I’ll be building out a JavaScript UI. I used React for both datasette-vega and owlsnearme and it works OK, but it’s pretty weighty (~40KB gzipped) and requires a complex build step.

For my current pro;jects, I’m willing to sacrifice compatibility with older browsers in exchange for less code and no build step.

I considered going completely library free, but having some kind of templating mechanism for constructing complex DOM elements is useful.

I love JavaScript tagged template literals—they’re like Python’s f-strings but even better, because you can provide your own function that executes against every interpolated value—ideal for implementing Django-style autoescaping.

I’ve used them for my own tiny autoescaping implementation in the past, but I’m getting excited about exploring lit-html, which extends the same technique to cover a lot more functionality while still weighing in at less than 4KB gzipped.

Thanks to Laurence Rowe on Twitter I now have a tidy one-liner for running lit-html in the Firefox developer console, loaded from a module:

let {html, render} = await import('https://unpkg.com/lit-html?module');

// Try it out like so:
render(html`<h1>Hello world</h1>`, document.body);

And here’s a full proof-of-concept script showing data loaded from Datasette via fetch() which is then rendered using lit-html:

<div></div>
<script type="module">
import {html, render} from 'https://unpkg.com/lit-html?module';
const myTemplate = (data) => html`<p>Hello ${JSON.stringify(data)}</p>`;
async function go() {
    let div = document.getElementsByTagName("div")[0];
    var data = await (await fetch("https://latest.datasette.io/fixtures/complex_foreign_keys.json?_labels=on&_shape=array")).json();
    render(myTemplate(data), div);
}
go();
</script>

Everything else

I shipped a new release of Datasette (version 0.30) incorporating some bug fixes and the external contributions I talked about last week.

I started work on a new project which has been brewing for a while. Not ready to talk about it yet but I did buy a domain name (through Google Domains this time, but I got some great advice from Twitter on domain registrars with non-terrible UIs in 2019).

Weeknotes: PG&E outages, and Open Source works! nine days ago

My big focus this week was the PG&E outages project. I’m really pleased with how this turned out: the San Francisco Chronicle used data from it for their excellent PG&E outage interactive (mixing in data on wind conditions) and it earned a bunch of interest on Twitter and some discussion on Hacker News.

I gave a talk about the project on Thursday for the Online News Association Silicon Valley meetup and made connections with a bunch of interesting journalists from around the Bay Area.

Open Source works!

A highlight of last week was that a bunch of my projects gained new functionality through open source contributions!

Dogsheep

I started a very basic website for my Dogsheep personal analytics project.

I also started running various Dogsheep tools via cron on my personal Dogsheep server, to keep that set of databases automatically updated with my latest activity on various services.

Most excitingly, Tobias Kunze built the first indepedent Dogsheep-style tool: goodreads-to-sqlite!

As the name suggests, it imports your data from Goodreads into a SQLite database. It inspired me to create a Goodreads account which I will be using to track my book reading activity from now on.

Tobias wrote a fantastic blog post introducing the tool which includes some neat example queries and graphs.

In other Dogsheep news, I added an issue-comments command to github-to-sqlite for fetching all issue comments in a repo. My goal is to evolve that tool to the point where it can import all relevant data from all of my repositories and give me a single Datasette-powered dashboard for keeping track of everything in one place.

Tracking PG&E outages by scraping to a git repo 13 days ago

PG&E have cut off power to several million people in northern California, supposedly as a precaution against wildfires.

As it happens, I’ve been scraping and recording PG&E’s outage data every 10 minutes for the past 4+ months. This data got really interesting over the past two days!

The original data lives in a GitHub repo (more importantly in the commit history of that repo).

Reading JSON in a Git repo isn’t particularly productive, so this afternoon I figured out how to transform that data into a SQLite database and publish it with Datasette.

The result is pge-outages.simonwillison.net

The data model: outages and snapshots

The three key tables to understand are outages, snapshots and outage_snapshots.

PG&E assign an outage ID to every outage—where an outage is usually something that affects a few dozen customers. I store these in the outages table.

Every 10 minutes I grab a snapshot of their full JSON file, which reports every single outage that is currently ongoing. I store a record of when I grabbed that snapshot in the snapshots table.

The most interesting table is outage_snapshots. Every time I see an outage in the JSON feed, I record a new copy of its data as an outage_snapshot row. This allows me to reconstruct the full history of any outage, in 10 minute increments.

Here are all of the outages that were represented in snapshot 1269—captured at 4:10pm Pacific Time today.

I can run select sum(estCustAffected) from outage_snapshots where snapshot = 1269 (try it here) to count up the total PG&E estimate of the number of affected customers—it’s 545,706!

I’ve installed datasette-vega which means I can render graphs. Here’s my first attempt at a graph showing the number of estimated customers affected over time.

(I don’t know why there’s a dip towards the end of the graph).

I also defined a SQL view which shows all of the outages from the most recently captured snapshot (usually within the past 10 minutes if the PG&E website hasn’t gone down) and renders them using datasette-cluster-map.

Things to be aware of

There are a huge amount of unanswered questions about this data. I’ve just been looking at PG&E’s JSON and making guesses about what things like estCustAffected means. Without official documentation we can only guess as to how accurate this data is, or how it should be interpreted.

Some things to question:

  • What’s the quality of this data? Does it reflect accurately on what’s actually going on out there?
  • What’s the exact meaning of the different columns—estCustAffected, currentEtor, autoEtor, hazardFlag etc?
  • Various columns (lastUpdateTime, currentEtor, autoEtor) appear to be integer unix timestamps. What timezone were they recorded in? Do they include DST etc?

How it works

I originally wrote the scraper back in October 2017 during the North Bay fires, and moved it to run on Circle CI based on my work building a commit history of San Francisco’s trees.

It’s pretty simple: every 10 minutes a Circle CI job runs which scrapes the JSON feed that powers the PG&E website’s outage map.

The JSON is then committed to my pge-outages GitHub repository, over-writing the existing pge-outages.json file. There’s some code that attempts to generate a human-readable commit message, but the historic data itself is saved in the commit history of that single file.

Building the Datasette

The hardest part of this project was figuring out how to turn a GitHub commit history of changes to a JSON file into a SQLite database for use with Datasette.

After a bunch of prototyping in a Jupyter notebook, I ended up with the schema described above.

The code that generates the database can be found in build_database.py. I used GitPython to read data from the git repository and my sqlite-utils library to create and update the database.

Deployment

Since this is a large database that changes every ten minutes, I couldn’t use the usual datasette publish trick of packaging it up and re-deploying it to a serverless host (Cloud Run or Heroku or Zeit Now) every time it updates.

Instead, I’m running it on a VPS instance. I ended up trying out Digital Ocean for this, after an enjoyable Twitter conversation about good options for stateful (as opposed to stateless) hosting.

Next steps

I’m putting this out there and sharing it with the California News Nerd community in the hope that people can find interesting stories in there and help firm up my methodology—or take what I’ve done and spin up much more interesting forks of it.

If you build something interesting with this please let me know, via email (swillison is my Gmail) or on Twitter.

Weeknotes: Dogsheep 16 days ago

Having figured out my Stanford schedule, this week I started getting back into the habit of writing some code.

Dogsheep

Dogsheep is the collective name I’ve given to a suite of tools I’m building around the concept of personal analytics.

I generate a lot of data, and while much of it ends up in the silos of the internet giants, thanks to the GDPR most of those silos now feature an “export a copy of your data” button.

Wouldn’t it be cool if you could convert that data into a SQLite database and then use Datasette to run queries against it?

So that’s what I’m doing! The tools I’ve built so far include:

  • healthkit-to-sqlite for my Apple HealthKit data (mostly collected by my Apple Watch).
  • twitter-to-sqlite, by far the most developed tool. For Dogsheep purposes it lets me import my tweets and the tweets I have favourited, but it’s growing all kinds of other useful features for retrieving and analyzing data from Twitter. More on this in my previous weeknotes.
  • swarm-to-sqlite for my Foursquare Swarm checkins.
  • inaturalist-to-sqlite for my iNaturalist observations.
  • google-takeout-to-sqlite for the wealth of data available from Google Takeout. I’ve barely scratched the surface with this one but it does have the ability to export my location history from Google Maps—215,000 latitude/longitude/timestamp records dating back to select min(timestamp) from location_history July 2015!
  • github-to-sqlite for importing my GitHub repositories and the repositories I have starred.
  • pocket-to-sqlite (built over this weekend) for articles I have saved to Pocket.
  • genome-to-sqlite for my 23andMe genome, because being able to run SQL queries against my genetic code is really funny.

My biggest achievement this week is that my own personal Dogsheep is no longer vaporware! I got a Datasette instance running on a AWS Lightsail VPS (using the systemd recipe from this ticket) and started loading in copies of my data.

Securing this instance well is important. In addition to running it behind datasette-auth-github I’ve followed up on a long-term ambition of deploying something protected by client certificates. If your browser doesn’t have the required certificate installed it gets rejected by nginx before it’s even proxied through to Datasette.

I have the certificate successfully installed on my iPhone and my laptop browsers. Setting this all up wasn’t particularly straight-forward—I ended up following this excellent tutorial by Nathan Wittstock for the nginx part of it, then randomly Googling for tips on installing the certificate in Firefox and on my iPhone.

I think there’s a great opportunity for tooling that makes this much easier. Adding that to my maybe-someday pile.

datasette-auth-github asset caching

Dogsheep is a great opportunity for eating my own dogfood. I’ve been having a lot of fun using it to explore my data... but I noticed that there was a suspicous delay every time I clicked a link. Datasette is mostly JavaScript-free but usually it responds so fast (~100ms or less) that clicking around feels “app-like”—but here I was really feeling the delay, often of more than a second.

I fired up the Firefox Network inspector and quickly spotted what was happening. I have the datasette-vega plugin installed so I can create charts, and that weighs in at nearly 1MB of JavaScript. According to Firefox my browser was loading that entire 1M file from scratch for every page load!

The culprit there turned out to be my datasette-auth-github plugin.

As detailed in issue #6, the ASGI middleware adds a cache-control: private header to every HTTP response—to make sure authentication-required content isn’t accidentally cached by any intermediary proxies (I originally designed Datasette to run effectively behind Varnish and Cloudflare).

This was applied to everything... including the static JavaScript and CSS served up by the plugin!

To fix this, I shipped a new release of datasette-auth-github with a cacheable_prefixes setting, which defaults to /-/static/ and /-/static-plugins/ when run as a Datasette plugin. See issue 47 for details.

Areas of research

Connection pooling

My most significant ongoing project for Datasette right now concerns database connection pooling.

For Datasette Library I want to be able to support potentially hundreds of attached databases, and Datasette Edit calls for the ability to use a writable (as opposed to read-only) connection.

This means I need to get smarter about database connections, and last week I finally made some solid progress on figuring out more advanced connection pooling. Still more work to go but I’m finally feeling unblocked on this after sweating over it for a couple of months without writing any code.

I also added a /-/threads page to Datasette (demo) to make it easier to understand the threads being run by the application.

Recursive CTEs in SQLite

twitter-to-sqlite generates a tweets table with both id and in_reply_to_status_id (nullable) columns.

SQLite supports recursive CTEs, documented here. In theory, this means it should be possible to write SQL queries that analyze Twitter threads—showing me the tweets that are at the bottom of the longest reply chain, for example.

I was feeling lazy and asked if anyone knew how to do that on Twitter. Robin Houston, Matthew Somerville and Piers Cawley all chipped in with useful tips, which I’ve collected in this gist.

I particularly enjoyed learning how Piers uses recursive SQL queries to help run his bakery.

logging

I finally made some progress on getting my head around the Python standard library logging module, a long-time nemesis.

The breakthrough was to sit down and actually read the source code—which helped me understand that the entire point of the library is to be a globally configured logging system that makes careful use of threading primitives to ensure you can log without worrying about concurrency.

I’ve been worrying about that aspect of logging for years, so it’s reassuring to see that Vinay Sajip has that comprehensively covered already.

Easier installation for Datasette

I want journalists to be able to install Datasette. I don’t want to have to talk them through installing xcode to install homebrew to install Python 3 first!

Raffaele Messuti did some great work investigating pyinstaller for this a while ago. I started a conversation on Twitter a few days ago after investigating pex (neat but not quite what I want, since users still have to install Python). From that thread I got some interesting further leads:

  • Freezing your code in the The Hitchhiker’s Guide to Python describes the problem space in general.
  • Briefcase is the part of the BeeWare suite that deals with for converting a Python project into a standalone native application (for a bunch of different platforms).
  • PyOxidizer is a relatively new tool for building standalone executables that bundle a Python interpreter, on top of the Rust and Cargo build ecosystem.
  • Conda Constructor helps you construct an installer for a group of Conda packages (I need to figure out Conda).

Next step: actively try some of these out and see what I can do with them.

My longer term goal is to package Datasette up with Electron and make it available as a regular Windows and OS X application, for people who aren’t comfortable using the command-line directly.

Books

I’ve been mostly working through required reading for my Stanford courses. Strategic Communication in particular has some excellent text books:

Weeknotes: first week of Stanford classes 23 days ago

One of the benefits of the JSK fellowship is that I can take classes and lectures at Stanford, on a somewhat ad-hoc basis (I don’t take exams or earn credits).

With thousands of courses to chose from, figuring out how best to take advantage of this isn’t at all easy—especially since I want to spend a big portion of my time focusing on my fellowship project.

This week was the first week of classes, which Stanford calls “shopping week”—because students are encouraged to try out lots of different things and literally walk out half way through a lecture if they decide it’s not for them! Feels really rude to me, but apparently that’s how it works here.

For this term I’ve settled on four classes:

  • Strategic Communications, at the Stanford Graduate School of Business. This is an extremely highly regarded course on public speaking and effective written communication. As you might expect from a class on public speaking the lectures themselves have been case studies in how to communicate well. I’ve given dozens of conference talks and I’m already learning a huge amount from this that will help me perform better in the future.
  • Classical Guitar. I’m taking this with three other fellows. It turns out my cheap acoustic guitar (bought on an impulse a couple of years ago from Amazon Prime Now) isn’t the correct instrument for this class (Classical Guitars are nylon stringed and a different shape) but the instructor thinks it will be fine for the moment. Great opportunity to do something musical!
  • Biostatistics. I want to firm up my fundamental knowledge of statistics, and I figured learning it from the biology department would be much more interesting than the corresponding maths or computer science classes.
  • Media Innovation. This is a lunchtime series of guest lectures from different professionals in different parts of the media industry. As such it doesn’t have much homework (wow, Stanford courses have a lot of homework) which makes it a good fit for my schedule, and the variety of speakers look to be really informative.

Combined with the JSK afternoon sessions on Monday, Wednesday and Friday I’ll be on campus every weekday, which will hopefully help me build a schedule that incorporates plenty of useful conversations with people about my project, plus actual time to get some code written.

… what with all the shopping for classes, I wrote almost no code at all this week!

I did some experimentation with structlog—I have an unfinished module which can write structlog entries to a SQLite database using sqlite-utils (here’s a Gist) and I’ve been messing around with Python threads in a Jupyter notebook as part of ongoing research into smarter connection pooling for Datasette but aside from that I’ve been concentrating on figuring out Stanford.

Books

Stanford classes come with all sorts of required reading, but I’ve also made some progress on Just Enough Research by Erika Hall (mentioned last week). I’m about half way through and it’s fantastic—really fun to read and packed with useful tips on getting the most out of user interviews and associated techniques. Hopefully I’ll get to start putting it into practice next week!

Weeknotes: Design thinking for journalists, genome-to-sqlite, datasette-atom one month ago

I haven’t had much time for code this week: we’ve had a full five day workshop at JSK with Tran Ha (a JSK alumni) learning how to apply Design Thinking to our fellowship projects and generally to challenges facing journalism.

I’ve used aspects of design thinking in building software products, but I’d never really thought about how it could be applied outside of digital product design. It’s been really interesting—especially seeing the other fellows (who, unlike me, are generally not planning to build software during their fellowship) start to apply it to a much wider and more interesting range of problems.

I’ve been commuting in to Stanford on the Caltrain, which did give me a bit of time to work on some code.

genome-to-sqlite

I’m continuing to build out a family of tools for personal analytics, where my principle goal is to reclaim the data that various internet companies have collected about me and pull it into a local SQLite database so I can analyze, visualize and generally and have fun with it.

A few years ago I shared my DNA with 23andMe. I don’t think I’d make the decision to do that today: it’s incredibly personal data, and the horror stories about people making unpleasant discoveries about their family trees keep on building. But since I’ve done it, I decided to see if I could extract out some data…

… and it turns out they let you download your entire genome! You can export it as a zipped up TSV file—mine decompresses to 15MB of data (which feels a little small—I know little about genetics, but I’m presuming that’s because the genome they record and share is just the interesting known genetic markers, not the entire DNA sequence—UPDATE: confirmed, thanks @laurencerowe).

So I wrote a quick utility, genome-to-sqlite, which loads the TSV file (directly from the zip or a file you’ve already extracted) and writes it to a simple SQLite table. Load it into Datasette and you can even facet by chromosome, which is exciting!

This is where my knowledge runs out. I’m confident someone with more insight than me could construct some interesting SQL queries against this—maybe one that determines if you are likely to have red hair?—so I’m hoping someone will step in and provide a few examples.

I filed a help wanted issue on GitHub. I also put a request out on Twitter for an UPDATE statement that could turn me into a dinosaur.

datasette-atom

This is very much a work-in-progress right now: datasette-atom will be a Datasette plugin that adds .atom as an output format (using the register_output_renderer plugin hook contributed by Russ Garrett a few months ago.

The aim is to allow people to subscribe to the output of a query in their feed reader (and potentially through that via email and other mechanisms)—particularly important for databases which are being updated over time.

It’s a slightly tricky plugin to design because valid Atom feed entries require a globally unique ID, a title and an “updated” date—and not all SQL queries produce obvious candidates for these values. As such, I’m going to have the plugin prompt the user for those fields and then persist them in the feed URL that you subscribe to.

This also means you won’t be able to generate an Atom feed for a query that doesn’t return at least one datetime column. I think I’m OK with that.

github-to-sqlite

I released one new feature for github-to-sqlite this week: the github-to-sqlite repos github.db command, which populates a database table of all of the repositories available to the authenticated user. Or use github-to-sqlite repos github.db dogsheep to pull the repos owned by a specific user or organization.

The command configures a SQLite full-text search index against the repo titles and descriptions, so if you have a lot of GitHub repos (I somehow have nearly 300!) you can search through them and use Datasette to facet them against different properties.

github-to-sqlite currently has two other useful subcommands: starred fetches details of every repository a user has starred, and issues pulls details of the issues (but sadly not yet their comment threads) attached to a repository.

Books

I’m trying to spend more time reading books—so I’m going to start including book stuff in my weeknotes in the hope of keeping myself on track.

I acquired two new books this week:

  • Just Enough Research by Erika Hall (recommended by Tom Coates and Tran Ha), because I need to spent the next few months interviewing as many journalists (and other project stakeholders) as possible to ensure I am solving the right problems for them.
  • Producing Open Source Software by Karl Fogel, because my fellowship goal is to build a thriving open source ecosystem around tooling for data journalism and this book looks like it covers a lot of the topics I need to really do a good job of that.

Next step: actually read them! Hopefully I’ll have some notes to share in next week’s update.

Elsewhere

21st October 2019

  • Thematic map - GIS Wiki. This is a really useful wiki full of GIS information, and the coverage of different types of thematic maps is particularly thorough. #
  • Setting up Datasette, step by step (via) Tobias describes how he runs Datasette on his own server/VPS, using nginx and systemd. I’m doing something similar for some projects and systemd really does feel like the solution to the “ensure a Python process keeps running” problem I’ve been fighting for over a decade. I really like how Tobias creates a dedicated Linux user for each of his deployed Python projects. #

16th October 2019

  • 2018 Central Park Squirrel Census in Datasette (via) The Squirrel Census project released their data! 3,000 squirrel observations in Central Park, each with fur color and latitude and longitude and behavioral observations. I love this data so much. I’ve loaded it into a Datasette running on Glitch. #

14th October 2019

  • μPlot (via) “An exceptionally fast, tiny time series chart. [...] from a cold start it can create an interactive chart containing 150,000 data points in 40ms. [...] at < 10 KB, it’s likely the smallest and fastest time series plotter that doesn’t make use of WebGL shaders or WASM” #
  • goodreads-to-sqlite (via) This is so cool! Tobias Kunze built a Python CLI tool to import your Goodreads data into a SQLite database, inspired by github-to-sqlite and my various other Dogsheep tools. It’s the first Dogsheep style tool I’ve seen that wasn’t built by me—and Tobias’ write-up includes some neat examples of queries you can run against your Goodreads data. I’ve now started using Goodreads and I’m importing my books into my own private Dogsheep Datasette instance. #

7th October 2019

  • SQL Murder Mystery in Datasette (via) “A crime has taken place and the detective needs your help. The detective gave you the  crime scene report, but you somehow lost it. You vaguely remember that the crime  was a murder that occurred sometime on ​Jan.15, 2018 and that it took place in SQL  City. Start by retrieving the corresponding crime scene report from the police  department’s database.”—Really fun game to help exercise your skills with SQL by the NU Knight Lab. I loaded their SQLite database into Datasette so you can play in your browser. #

6th October 2019

  • twitter-to-sqlite 0.6, with track and follow. I shipped a new release of my twitter-to-sqlite command-line tool this evening. It now includes experimental features for subscribing to the Twitter streaming API: you can track keywords or follow users and matching Tweets will be written to a SQLite database in real-time as they come in through the API. Since Datasette supports mutable databases now you can run Datasette against the database and run queries against the tweets as they are inserted into the tables. #
  • Streamlit: Turn Python Scripts into Beautiful ML Tools (via) A really interesting new tool / application development framework. Streamlit is designed to help machine learning engineers build usable web frontends for their work. It does this by providing a simple, productive Python environment which lets you declaratively build up a sort-of Notebook style interface for your code. It includes the ability to insert a DataFrame, geospatial map rendering, chart or image into the application with a single Python function call. It’s hard to describe how it works, but the tutorial and demo worked really well for me: “pip install streamlit” and then “streamlit hello” to get a full-featured demo in a browser, then you can run through the tutorial to start building a real interactive application in a few dozen lines of code. #

5th October 2019

  • Get your own Pocket OAuth token (via) I hate it when APIs make you jump through extensive hoops just to get an access token for pulling data directly from your own personal account. I’ve been playing with the Pocket API today and it has a pretty complex OAuth flow, so I built a tiny Flask app on Glitch which helps go through the steps to get an API token for your own personal Pocket account. #
  • Client-Side Certificate Authentication with nginx. I’m intrigued by client-side browser certificates, which allow you to lock down a website such that only browsers with a specific certificate installed can access them. They work on both laptops and mobile phones. I followed the steps in this tutorial and managed to get an nginx instance running which only allows connections from my personal laptop and iPhone. #

4th October 2019

  • NGINX: Authentication Based on Subrequest Result (via) TIL about this neat feature of NGINX: you can use the auth_request directive to cause NGINX to make an HTTP subrequest to a separate authentication server for each incoming HTTP request. The authentication server can see the cookies on the incoming request and tell NGINX if it should fulfill the parent request (via a 2xx status code) or if it should be denied (by returning a 401 or 403). This means you can run NGINX as an authenticating proxy in front of any HTTP application and roll your own custom authentication code as a simple webhook-recieving endpoint. #

3rd October 2019

  • SQL queries don't start with SELECT. This is really useful. Understanding that SELECT (and associated window functions) happen after the WHERE, GROUP BY and HAVING helps explain why you can’t filter a query based on the results of a window function for example. #

28th September 2019

  • Microservices are about scaling teams, not scaling tech

    Petrus Theron #

26th September 2019

  • If you’re a little shy at conferences, speaking is The Best way to break the ice. Nobody talks to you before the talk. Everybody want’s to talk to you afterwards, largely because they have a way in. As such, public speaking is bizarrely good for introverts.

    Andy Budd #