Simon Willison’s Weblog

On sqlite, productmanagement, dogsheep, scraping, sqliteutils, ...

 

Recent entries

Implementing IndieAuth for Datasette one day ago

IndieAuth is a spiritual successor to OpenID, developed and maintained by the IndieWeb community and based on OAuth 2. This weekend I attended IndieWebCamp East Coast and was inspired to try my hand at an implementation. datasette-indieauth is the result, a new plugin which enables IndieAuth logins to a Datasette instance.

Surprisingly this was my first IndieWebCamp—I’ve been adjacent to that community for over a decade, but I’d never made it to one of their in-person events before. Now that everything’s virtual I didn’t even have to travel anywhere, so I finally got to break my streak of non-attendance.

Understanding IndieAuth

The key idea behind IndieAuth is to provide federated login based on URLs. Users enter a URL that they own (e.g. simonwillison.net), and the protocol then derives their identity provider, redirects the user there, waits for them to sign in and get redirected back and then uses tokens passed in the redirect to prove the user’s ownership of the URL and sign them in.

Here’s what that authentication flow looks like, using this demo of the plugin:

Animated demo: starts at an IndieAuth login screen, enters simonwillison.net, gets redirected to another site where clicking the verify button completes the sign-in and redirects back to the original page.

IndieAuth works by scanning the linked page for a <link rel="authorization_endpoint" href="https://indieauth.com/auth"> HTML element which indicates a service that should be redirected to in order to authenticate the user.

I’m using IndieAuth.com for my own site’s authorization endpoint, an identity provider run by IndieAuth spec author Aaron Parecki. IndieAuth.com implements RelMeAuth.

RelMeAuth is a neat hack where the authentication provider can scan the user’s URL for a <link href="https://github.com/simonw" rel="me"> element, confirm that the GitHub profile in question links back to the same page, and then delegate to GitHub authentication for the actual sign-in.

Why implement this for Datasette?

A key goal of Datasette is to reduce the friction involved in publishing data online as much as possible.

The datasette publish command addresses this by providing a single CLI command for publishing a SQLite database to the internet and assigning it a new URL.

datasette publish cloudrun ca-fires.db \
    --service ca-fires \
    --title "Latest fires in California"

This command will create a new Google Cloud Run service, package up the ca-fires.db (created in this talk) along with the Datasette web application, and deploy the resulting site using Google Cloud Run.

It will output a URL that looks like this: https://ca-fires-j7hipcg4aq-uc.a.run.app

Datasette is unauthenticated by default—anyone can view the published data. If you want to add authentication you can do so using a plugin, for example datasette-auth-passwords.

Authentication without passwords is better. The datasette-auth-github plugin implements single-sign-on against the GitHub API, but comes with a slight disadvantage: you need to register and configure your application with GitHub in order to configure things like the redirect URL needed for authentication.

For most applications this isn’t a problem, but when you’re deploying dozens or potentially hundreds of applications with Datasette—each with initially unpredictable URLs—this can add quite a bit of friction.

The joy of IndieAuth (and OpenID before it) is that there’s no centralized authority to register with. You can deploy an application to any URL, install the datasette-indieauth plugin and users can start authenticating with your site.

Even better... IndieAuth means you can grant people permission to access a site without them needing to create an account, provided they have their own domain with IndieAuth setup.

I took advantage of that in the design of datasette-indieauth. Say you want to publish a Datasette that only I can access—you can do that using the restrict_access plugin configuration setting like so:

datasette publish cloudrun simon-only.db \
  --service simon-only \
  --title "For Simon's eye only" \
  --install datasette-indieauth \
  --plugin-secret datasette-indieauth \
    restrict_access https://simonwillison.net/

The resulting Datasette instance will require the user to authenticate in order to view it—and will only allow access to the user who can use IndieAuth to prove that they are the owner of simonwillison.net.

Next steps

There are two sides to the IndieAuth specification: client sites that allow sign-in with IndieAuth, and authorization providers that handle that authentication.

datasette-indieauth currently acts as a client, allowing sign-in with IndieAuth.

I’m considering extending the plugin to act as an authorization provider as well. This is a bit more challenging as authentication providers need to maintain some small aspects of session state, but it would be good for the IndieAuth ecosystem for there to be more providers. The most widely used provider at the moment is the excellent IndieAuth WordPress plugin, which I used while testing my Datasette plugin and really was just a one-click install from the WordPress plugin directory.

datasette-indieauth has 100% test coverage, and I wrote the bulk of the logic in a standalone utils.py module which could potentially be extracted out of the plugin and used to implement IndieAuth in Python against other frameworks. A Django IndieAuth provider is another potential project, which could integrate directly with my Django blog.

Addendum: what about OpenID?

Fom 2006 to 2010 I was a passionate advocate for OpenID. It was clear to me that passwords were an increasingly unpleasant barrier to secure usage of the web, and that some form of federated sign-in was inevitable. I was terrified that Microsoft Passport would take over all authentication on the web!

With hindsight that’s not quite what happened: for a while it looked like Facebook would win instead, but today it seems to be a fairly even balance between Facebook, Google, community-specific authentication providers like GitHub and Apple’s iPhone-monopoly-enforced Sign in with Apple.

OpenID as an open standard didn’t really make it. The specification grew in complicated new directions (Yadis, XRDS, i-names, OpenID Connect, OpenID 2.0) and it never quite overcame the usability hurdle of users having to understand URLs as identifiers.

IndieAuth is a much simpler specification, based on lessons learned from OAuth. I’m still worried about URLs as identifiers, but helping people reclaim their online presence and understand those concepts is core to what the IndieWeb movement is all about.

IndieAuth also has some clever additional tricks up its sleeve. My favourite is that IndieAuth can return an identifier for the user that’s different from the one they typed in the box. This means that if a top-level domain with many users supports IndieAuth, each user can learn to just type example.com in (or click a branded button) to start the authentication flow—they’ll be signed in as example.com/users/simonw based on who they authenticated as. This feels like an enormous usability improvement to me, and one that could really help avoid users having to remember their own profile URLs.

OpenID was trying to solve authentication for every user of the internet. IndieAuth is less ambitious—if it only takes off with the subset of people who embrace the IndieWeb movement I think that’s OK.

The datasette-indieauth project is yet another example of the benefit of having a plugin ecosystem around Datasette: I can add support for technologies like IndieAuth without baking them into Datasette’s core, which almost eliminates the risk to the integrity of the larger project of trying out something new.

Personal Data Warehouses: Reclaiming Your Data six days ago

I gave a talk yesterday about personal data warehouses for GitHub’s OCTO Speaker Series, focusing on my Datasette and Dogsheep projects. The video of the talk is now available, and I’m presenting that here along with an annotated summary of the talk, including links to demos and further information.

There’s a short technical glitch with the screen sharing in the first couple of minutes of the talk—I’ve added screenshots to the notes which show what you would have seen if my screen had been correctly shared.

Weeknotes: sqlite-utils 3.0 alpha, Git scraping in the zeitgeist 13 days ago

Natalie and I decided to escape San Francisco for election week, and have been holed up in Fort Bragg on the Northern California coast. I’ve mostly been on vacation, but I did find time to make some significant changes to sqlite-utils. Plus notes on an exciting Git scraping project.

Better search in the sqlite-utils 3.0 alpha

I practice semantic versioning with sqlite-utils, which means it only gets a major version bump if I break backwards compatibility in some way.

My goal is to avoid breaking backwards compatibility as much as possible, and I was proud to have made it all the way to version 2.23 representing 23 new feature releases since the 2.0 release without breaking any documented features!

Sadly this run has come to an end: I realized that the table.search() method was poorly designed, and I also needed to grab back the -c command-line option (a shortcut for --csv output) to be used for another purpose.

The chances that either of these changes will break anyone are pretty small, but semantic versioning dictates a major version bump so here we are.

I shipped a 3.0 alpha today, which should hopefully become a stable release very shortly (milestone here).

The big new feature is sqlite-utils search—a command-line tool for executing searches against a full-text search enabled table:

$ sqlite-utils search 24ways-fts4.db articles maps -c title
[{"rowid": 163, "title": "Get To Grips with Slippy Maps", "rank": -10.028754920576421},
 {"rowid": 220, "title": "Finding Your Way with Static Maps", "rank": -9.952534352591737},
 {"rowid": 27, "title": "Putting Design on the Map", "rank": -5.667327088267961},
 {"rowid": 168, "title": "Unobtrusively Mapping Microformats with jQuery", "rank": -4.662224207228984},

Here’s full documentation for the new command.

Notably, this command works against both FTS4 and FTS5 tables in SQLite—despite FTS4 not shipping with a built-in ranking function. I’m using my sqlite-fts4 package for this, which I described back in January 2019 in Exploring search relevance algorithms with SQLite.

Git scraping to predict the election

It’s not quite over yet but the end is in sight, and one of the best tools to track the late arriving vote counts is this Election 2020 results site built by Alex Gaynor and a growing cohort of contributors.

The site is a beautiful example of Git scraping in action, and I’m thrilled that it links to my article in the README!

Take a look at the repo to see how it works. Short version: this GitHub Action workflow grabs the latest snapshot of this undocumented New York Times JSON API once every five minutes and commits it to the repository. It then runs this Python script which iterates through the Git history and generates an HTML summary showing the different batches of new votes that were reported and their impact on the overall race.

The resulting report is published to GitHub pages—resulting in a site that can handle a great deal of traffic and is updated entirely by code running in scheduled actions.

Screenshot of the generated report

This is a perfect use-case for Git scraping: it takes a JSON endpoint that represents the current state of the world and turns it into a sequence of historic snapshots, then uses those snapshots to build a unique and useful new source of information to help people understand what’s going on.

Releases this week

Datasette 0.51 (plus weeknotes) 19 days ago

I shipped Datasette 0.51 today, with a new visual design, plugin hooks for adding navigation options, better handling of binary data, URL building utility methods and better support for running Datasette behind a proxy. It’s a lot of stuff! Here are the annotated release notes.

New visual design

Datasette is no longer white and grey with blue and purple links! Natalie Downe has been working on a visual refresh, the first iteration of which is included in this release. (#1056)

It’s about time Datasette grew beyond its clearly-designed-by-a-mostly-backend-engineer roots. Natalie has been helping me start adding some visual polish: we’ve started with an update to the colour scheme and will be continuing to iterate on the visual design as the project evolves towards the 1.0 release.

The new design makes the navigation bar much more obvious, which is important for this release since the new navigation menu (tucked away behind a three-bar icon) is a key new feature.

A screenshot of the new Datasette visual design, also showcasing two new menu icons

A number of existing Datasette plugins add new pages to the Datasette interface, providig tools for things like uploading CSVs, editing table schemas or configuring full-text search.

Plugins like this can now link to themselves from other parts of Datasette interface. The menu_links(datasette, actor) hook (#1064) lets plugins add links to Datasette’s new top-right application menu, and the table_actions(datasette, actor, database, table) hook (#1066) adds links to a new “table actions” menu on the table page.

This feature has been a long time coming. I’ve been writing an increasing number of plugins that add new pages to Datasette, and so far the main way of using them has been to memorise and type in their URLs!

The new navigation menu (which only displays if it has something in it) provides a global location to add new links. I’ve already released several plugin updates that take advantage of this.

The new “table actions” menu imitates Datasette’s existing column header menu icon—it’s a cog. Clicking it opens a menu of actions relating to the current table.

Want to see a demo?

The demo at latest.datasette.io now includes some example plugins. To see the new table actions menu first sign into that demo as root and then visit the facetable table to see the new cog icon menu at the top of the page.

Here’s an animated GIF demo showing the new menus in action.

Using the nav menu to upload CSVs, then the table actions menu to edit the table schema and rename latitude and longitude columns so that the Central Park Squirrel Census is displayed on a map!

Binary data

SQLite tables can contain binary data in BLOB columns. Datasette now provides links for users to download this data directly from Datasette, and uses those links to make binary data available from CSV exports. See Binary data for more details. (#1036 and #1034).

I spent a ton of time on this over the past few weeks. The initial impetus was a realization that Datasette CSV exports included ugly Python b'\x15\x1c\x02\xc7\xad\x05\xfe' strings, which felt like the worst possible way to display binary in a CSV file, out of universally bad options.

Datasette’s main interface punted on binary entirely—it would show a <Binary data: 7 bytes> label which didn’t help much either.

The only way to get at binary data stored in a Datasette instance was to request the JSON version and then manually decode the Base-64 value within it!

This is now fixed: binary columns can be downloaded directly to your computer, using a new .blob output renderer. The approach is described on this new page in the documentation.

Security was a major consideration when building this feature. Allowing the download of arbitrary byte payloads from a web server is dangerous business: it can easily result in XSS holes where HTML with dangerous <script> content can end up hosted on the primary domain.

After some research, I decided to serve up binary content for download using the following headings:

content-type: application/binary
x-content-type-options: nosniff
content-disposition: attachment; filename="data-f30889.blob"

application/binary is a safer Content-Type option than the more common application/octet-stream, according to Michal Zalewski’s renowned web application security book The Tangled Web (quoted here)

x-content-type-options: nosniff disables the XSS-tastic content sniffing feature in older versions of Internet Explorer, where IE would helpfully guess that you intended to serve HTML based on the first few bytes of the response.

The content-disposition: attachment header causes the browser to show a “download this file” dialog, using the suggested filename.

If you know of a reason that this isn’t secure enough, please let me know!

URL building

The new datasette.urls family of methods can be used to generate URLs to key pages within the Datasette interface, both within custom templates and Datasette plugins. See Building URLs within plugins for more details. (#904)

Datasette’s base_url configuration setting was the forcing factor around this piece of work.

It allows you to configure Datasette to serve content starting at a path other than /—for example:

datasette --config base_url:/path-to-datasette/

This will serve all Datasette pages at locations starting with /path-to-datasette/.

Why would you want to do this? It’s useful if you are proxying traffic to Datasette from within the URL hierarchy of an existing website.

The feature didn’t work properly, and enough people care about it that I had a steady stream of bug reports. For 0.51 I gathered them all into a single giant tracking issue and worked through them all one by one.

It quickly became apparent that the key challenge was building URLs within Datasette—not just within HTML template pages, but also for things like HTTP redirects.

Datasette itself needed to generate URLs that took the base_url setting into account, but so do Datasette plugins. So I built a new datasette.urls collection of helper methods and made them part of the documented internals API for plugins. The Building URLs within plugins documentation shows how these should be used.

I also added documentation on Running Datasette behind a proxy with example configs (tested on my laptop) for both nginx and Apache.

The datasette.client mechanism from Datasette 0.50 allows plugins to make calls to Datasette’s internal JSON API without the overhead of an HTTP request. This is another place where plugins need to be able to construct valid URLs to internal Datasette pages.

I added this example to the documentation showing how the two features can work together:

table_json = (
    await datasette.client.get(
        datasette.urls.table("fixtures", "facetable", format="json")
    )
).json()

One final weird detail on this: Datasette now has various methods that automatically add the base_url prefix to a URL. I got worried about what would happen if these were applied more than once (as above, where datasette.urls.table() applies the prefix so does datasette.client.get()).

I fixed this using the same trick that Django and Jinja use to avoid appliying auto-escaping twice to content that will be displayed in HTML: the datasette.urls methods actually return a PrefixedUrlString object which is a subclass of str that knows that the prefix has been applied! Code for that lives here.

Smaller changes

A few highlights from the “smaller changes” in Datasette 0.51:

  • Wide tables shown within Datasette now scroll horizontally (#998). This is achieved using a new <div class="table-wrapper"> element which may impact the implementation of some plugins (for example this change to datasette-cluster-map).

I think this is a big improvement: if your database table is too wide, it now scrolls horizontally on the page (rather than blowing the entire page out to a wider width). You can see that in action on the global-power-plants demo.

If you are signed in as root the new navigation menu links to a whole plethora of previously-undiscoverable Datasette debugging tools. This new permission controls the display of those items.

  • Link: HTTP header pagination. (#1014)

Inspired by GitHub and WordPress, which both use the HTTP Link header in this way. It’s an optional extra though: Datasette will always offer in-JSON pagination information.

  • Edit SQL button on canned queries, (#1019)

Suggested by Jacob Fenton in this issue. The implementation had quite a few edge cases since there are certain categories of canned query that can’t be executed as custom SQL by the user. See the issue comments for details and a demo.

  • --load-extension=spatialite shortcut. (#1028)

Inspired by a similar feature in sqlite-utils.

  • datasette -o option now opens the most relevant page. (#976)

This is a fun little feature. If your Datasette only loads a single database, and that database only has a single table (common if you’ve just run a single CSV import) then running this will open your browser directly to that table page:

datasette data.db -o
  • datasette --cors option now enables access to /database.db downloads. (#1057)

This was inspired by Mike Bostock’s Observable Notebook that uses the Emscripten-compiled JavaScript version of SQLite to run queries against SQLite database files.

It turned out you couldn’t use that notebook against SQLite files hosted in Datasette because they weren’t covered by Datasette’s CORS option. Now they are!

Recommendations for plugin authors, inspired by a question from David Kane on Twitter. David has been building datasette-reconcile, a Datasette plugin that offers a reconciliation API endpoint that can be used with OpenRefine. What a brilliant idea!

datasette-edit-templates (almost)

Inspired by a conversation with Jesse Vincent, I also spent some time experimenting with the idea of a plugin that can load and edit templates from the database—which would turn a personal Datasette into a really fun interface hacking environment. I nearly got this working, and even shipped a preview of a load_template() plugin hook in the Datasette 0.51a2 alpha... before crashing into a road block when I realized that it also needed to work with Jinja’s {% extends %} and {% include %} template tags and loaders for those don’t currenty support async functions.

In exploring this I also realized that my load_template() plugin hook wasn’t actually necessary—if I’m going to solve this problem with Jinja loaders I can do so using the existing prepare_jinja2_environment(env) hook.

My not-yet-functional prototype for this is caled datasette-edit-templates. I’m pretty confident I can get it working against the old plugin hook with a little more work.

Other weeknotes

Most of my time this week was spent on Datasette 0.51—but I did find a little bit of time for other projects.

I finished recording my talk for PyCon Argentina. It will air on November 20th.

sqlite-utils 2.23 is out, with a .m2m() bug fix from Adam Wolf and the new ability to display progress bars when importing TSV and CSV files.

Releases this week

Several of these are updates to take advantage of the new navigation plugin hooks introduced in Datasette 0.51.

TIL this week

Weeknotes: incremental improvements 27 days ago

I’ve been writing my talk for PyCon Argentina this week, which has proved surprisingly time consuming. I hope to have that wrapped up soon—I’m pre-recording it, which it turns out is much more work than preparing a talk to stream live.

I’ve made bits and pieces of progress on a whole bunch of different projects. Here are my notes on Datasette, plus an annotated version of my other releases-this-week.

Datasette 0.51a0

Datasette’s base_url configuration option is designed to help run Datasette behind a proxy—so you can configure Apache or nginx to proxy /my-datasette/ to a Datasette instance and have every internal link work correctly.

It doesn’t completely work. I gathered all of the bugs with it in a tracking issue, addressed as many of them as I could and released Datasette 0.51a0 as a testing alpha.

if you run Datasette behind a proxy please try out this new alpha and tell me if it works for you! Testing help is requested here.

Also in the alpha:

  • New datasette.urls URL builder for plugins, see Building URLs within plugins. (#904)
  • Removed --debug option, which didn’t do anything. (#814)
  • Link: HTTP header pagination. (#1014)
  • x button for clearing filters. (#1016)
  • Edit SQL button on canned queries, (#1019)
  • --load-extension=spatialite shortcut. (#1028)

Other releases this week

sphinx-to-sqlite 0.1a1 and 0.1a

One of the features I’m planning for the official Datasette website is combined search across issues, commits, releases, plugins and documentation—powered by my Dogsheep Beta search engine.

This means I nead to load Datasette’s documentation into a SQLite database. sphinx-to-sqlite is my new tool for doing that: it uses the optional XML output from Sphinx to create a SQLite table populated with sections from the documentation, since these seem like the right unit for executing search against.

I’m now using this to build a Datasette instance at latest-docs.datasette.io with the latest documentation on every commit.

datasette-cluster-map 0.14 and 0.13

The default marker popup for datasette-cluster-map is finally a human readable window, not a blob of JSON! You can see that in action on the global-power-plants demo.

inaturalist-to-sqlite 0.2.1, pocket-to-sqlite 0.2.1

I tried out the new PyPI resolver and found that it is a lot less tolerant of ~= v.s. >= dependencies, so I pushed out new releases of these two packages.

datasette-json-preview 0.2

I’m using this plugin to preview the new default JSON representation I’m planning for Datasette 1.0. Carl Johnson provided some useful feedback leading to this new iteration, which now looks like this.

github-to-sqlite 2.7

Quoting the release notes:

github-to-sqlite repos command now takes options --readme and --readme-html, which write the README or rendered HTML README into the readme or readme_html columns, respectively. #52

Another feature I need for the Datasette website search engine, described above.

dogsheep-beta 0.9

My personal search engine, described in the latest Datasette Weekly newsletter. This release added facet by date, part of ongoing work on a timeline view.

I also updated it to take advantage of the datasette.client internal API mechanism introduced in Datasette 0.50.

healthkit-to-sqlite 1.0

Another project bumped to 1.0. Only a small bug fix here: this can now import Apple HealthKit data from devices that use languages other than English.

TIL this week

I finally upgraded this blog to show recently added “Elsewhere” content (bookmarks and quotations) interspersed with my main entries in mobile view. I worte up this TIL to explain what I did.

Weeknotes: evernote-to-sqlite, Datasette Weekly, scrapers, csv-diff, sqlite-utils one month ago

This week I built evernote-to-sqlite (see Building an Evernote to SQLite exporter), launched the Datasette Weekly newsletter, worked on some scrapers and pushed out some small improvements to several other projects.

The Datasette Weekly newsletter

After procrastinating on it for several months I finally launched the new Datasette Weekly newsletter!

My plan is to put this out once a week with a combination of news from the Datasette/Dogsheep/sqlite-utils ecosystem of tools, plus tips and tricks for using them to solve data problems.

You can read the first edition here, which covers Datasette 0.50, git scraping, sqlite-utils extract and features datasette-graphql as the plugin of the week.

I’m using Substack because people I trust use it for their newsletters and I decided that picking an option and launching was more important than spending even more time procrastinating on picking the best possible newsletter platform. So far it seems fit for purpose, and it provides an export option should I decide to move to something else.

Writing scrapers with a Python+JavaScript hybrid

I’ve been writing some scraper code to help out with a student journalism project at Stanford. I ended up using Selenium Python running in a Jupyter Notebook.

Historically I’ve avoided Selenium due to how weird and complex it has been to use in the past. I’ve now completely changed my mind: these days it’s a really solid option for browser automation driven by Python thanks to chromedriver and geckodriver, which I recently learned can be installed using Homebrow.

My preferred way of writing scrapers is to do most of the work in JavaScript. The combination of querySelector(), querySelectorAll(), fetch() and the new-to-me DOMParser class makes light work of extracting data from any shape of HTML, and browser DevTools mean that I can interactively build up scrapers by pasting code directly into the console.

My big break-through this week was figuring out how to write scrapers as a Python-JavaScript hybrid. The Selenium driver.execute_script() and driver.execute_async_script() (TIL) methods make it trivial to feed execute chunks of JavaScript from Python and get back the results.

This meant I could scrape pages one at time using JavaScript and save the results directly to SQLite via sqlite-utils. I could even run database queries on the Python side to skip items that had already been scraped.

csv-diff 1.0

I’m trying to get more of my tools past the 1.0 mark, mainly to indicate to potential users that I won’t be breaking backwards compatibility without bumping them to 2.0.

I built csv-diff for my San Francisco Trees project last year. It produces human-readable diffs for CSV files.

The version 1.0 release notes are as follows:

  • New --show-unchanged option for outputting the unchanged values of rows that had at least one change. #9
  • Fix for bug with column names that contained a . character. #7
  • Fix for error when no --key provided—thanks, @MainHanzo. #3
  • CSV delimiter sniffer now ; delimited files. #6

sqlite-utils 2.22

sqlite-utils 2.22 adds some minor features—an --encoding option for processing TSV and CSV files in encodings other than UTF-8, and more support for loading SQLite extensions modules.

Full release notes:

  • New --encoding option for processing CSV and TSV files that use a non-utf-8 encoding, for both the insert and update commands. (#182)
  • The --load-extension option is now available to many more commands. (#137)
  • --load-extension=spatialite can be used to load SpatiaLite from common installation locations, if it is available. (#136)
  • Tests now also run against Python 3.9. (#184)
  • Passing pk=["id"] now has the same effect as passing pk="id". (#181)

Datasette

No new release yet, but I’ve landed some small new features to the main branch.

Inspired by the GitHub and WordPress APIs, Datasette’s JSON API now supports Link: HTTP header pagination (#1014).

This is part of my ongoing effort to redesign the default JSON format ready for Datasette 1.0. I started a new plugin called datasette-json-preview to let me iterate on that format independent of Datasette itself.

Jacob Fenton suggested an “Edit SQL” button on canned queries. That’s a great idea, so I built it—this issue comment links to some demos, e.g. this one here.

I added an “x” button for clearing filters to the table page (#1016) demonstrated by this GIF:

Animation demonstrating the new x button next to filters

TIL this week

Releases this week

Elsewhere

Yesterday

17th November 2020

  • Amstelvar (via) A real showcase of what variable fonts can do: this open source font by David Berlow has 17 different variables controlling many different aspects of the font. #17th November 2020, 3:24 pm

16th November 2020

  • Ok Google: please publish your DKIM secret keys (via) The DKIM standard allows email providers such as Gmail to include cryptographic headers that protect against spoofing, proving that an email was sent by a specific host and has not been tampered with. But it has an unintended side effect: if someone’s email is leaked (as happened to John Podesta in 2016) DKIM headers can be used to prove the validity of the leaked emails. This makes DKIM an enabling factor for blackmail and other security breach related crimes. Matthew Green proposes a neat solution: providers like Gmail should rotate their DKIM keys frequently and publish the PRIVATE key after rotation. By enabling spoofing of past email headers they would provide deniability for victims of leaks, fixing this unintended consequence of the DKIM standard. #16th November 2020, 10:02 pm

15th November 2020

  • CoronaFaceImpact (via) Variable fonts are fonts that can be customized by passing in additional parameters, which is done in CSS using the font-variation-settings property. Here’s a ​variable font that shows multiple effects of Covid-19 lockdown on a bearded face, created by Friedrich Althausen. #15th November 2020, 10:41 pm

14th November 2020

  • The Cleanest Trick for Autogrowing Textareas (via) This is a very clever trick. Textarea content is mirrored into a data attribute using a JavaScript one-liner, then a visibility: hidden ::after element clones that content using content: attr(data-replicated-value). The hidden element exists in a CSS grid with the textarea which allows the textarea to resize within the grid when the hidden element increases its height. #14th November 2020, 5:24 am
  • Hunting for Malicious Packages on PyPI (via) Jordan Wright installed all 268,000 Python packages from PyPI in containers, and ran Sysdig to capture syscalls made during installation to see if any of them were making extra network calls or reading or writing from the filesystem. Absolutely brilliant piece of security engineering and research. #14th November 2020, 4:48 am

12th November 2020

  • Intent to Remove: HTTP/2 and gQUIC server push (via) The Chrome / Blink team announce their intent to remove HTTP/2 server push support, where servers can start pushing an asset to a client before it has been requested. It’s been in browsers for over five years now and adoption is terrible. “Over the past 28 days [...] 99.97% of connections never received a pushed stream that got matched with a request [...] These numbers are exactly the same as in June 2019”. Datasette serves redirects with Link: preload headers that cause smart proxies (like Cloudflare) to push the redirected page to the client along with the redirect, but I don’t exepect to miss that optimization if it quietly stops working. #12th November 2020, 1:44 am

6th November 2020

  • nyt-2020-election-scraper. Brilliant application of git scraping by Alex Gaynor and a growing team of contributors. Takes a JSON snapshot of the NYT’s latest election poll figures every five minutes, then runs a Python script to iterate through the history and build an HTML page showing the trends, including what percentage of the remaining votes each candidate needs to win each state. This is the perfect case study in why it can be useful to take a “snapshot if the world right now” data source and turn it into a git revision history over time. #6th November 2020, 2:24 pm

5th November 2020

  • Learning from Mini Apps (via) WeChat, Baidu, Alipay and Douyin in China are all examples of “Super apps” that can host “Mini apps” written in HTML and JavaScript by other developers and installed via in-app search or through scanning a QR code. Mini apps are granted (permission-gated) access to further system APIs via a JavaScript bridge. It’s a fascinating developer ecosystem, explored in detail here by Thomas Steiner. #5th November 2020, 5:21 pm
  • CSVs: The good, the bad, and the ugly (via) Useful, thoughtful summary of the pros and cons of the most common format for interchanging data. #5th November 2020, 5:19 pm

2nd November 2020

  • selenium-wire. Really useful scraping tool: enhances the Python Selenium bindings to run against a proxy which then allows Python scraping code to look at captured requests—great for if a site you are working with triggers Ajax requests and you want to extract data from the raw JSON that came back. #2nd November 2020, 6:58 pm

29th October 2020

  • Defining Data Intuition. Ryan T. Harter, Principal Data Scientist at Mozilla defines data intuition as “a resilience to misleading data and analyses”. He also introduces the term “data-stink” as a similar term to “code smell”, where your intuition should lead you to distrust analysis that exhibits certain characteristics without first digging in further. I strongly believe that data reports should include a link the raw methodology and numbers to ensure they can be more easily vetted—so that data-stink can be investigated with the least amount of resistance. #29th October 2020, 3:14 pm
  • Seniors generally report having more trust in the people around them, a characteristic that may make them more credulous of information that comes from friends and family. There is also the issue of context: Misinformation appears in a stream that also includes baby pictures, recipes and career updates. Users may not expect to toggle between light socializing and heavy truth-assessing when they’re looking at their phone for a few minutes in line at the grocery store.

    Michael Hobbes # 29th October 2020, 3:06 pm