Simon Willison’s Weblog

On weeknotes, dogsheep, datasette, management, alexgaynor, ...

 

Recent entries

datasette-ripgrep: deploy a regular expression search engine for your source code seven days ago

This week I built datasette-ripgrep—a web application for running regular expression searches against source code, built on top of the amazing ripgrep command-line tool.

datasette-ripgrep demo

I’ve deployed a demo version of the application here:

ripgrep.datasette.io/-/ripgrep?pattern=pytest

The demo runs searches against the source code of every one of my GitHub repositories that start with datasette61 repos right now—so it should include all of my Datasette plugins plus the core Datasette repository itself.

Since it’s running on top of ripgrep, it supports regular expressions. This is absurdly useful. Some examples:

I usually run ripgrep as rg on the command-line, or use it within Visual Studio Code (fun fact: the reason VS Code’s “Find in Files” is so good is it’s running ripgrep under the hood).

So why have it as a web application? Because this means I can link to it, bookmark it and use it on my phone.

A screenshot of datasette-ripgrep in action

Why build this?

There are plenty of great existing code search tools out there already: I’ve heard great things about livegrep, and a quick Google search shows a bunch of other options.

Aside from being a fun project, datasette-ripgrep has one key advantage: it gets to benefit from Datasette’s publishing mechanism, which means it’s really easy to deploy.

That ripgrep.datasette.io demo is deployed by checking out the source code to be searched into a all directory and then using the following command:

datasette publish cloudrun \
    --metadata metadata.json \
    --static all:all \
    --install=datasette-ripgrep \
    --service datasette-ripgrep \
    --apt-get-install ripgrep

all is a folder containing the source code to be searched. metadata.json contains this:

{
    "plugins": {
        "datasette-ripgrep": {
            "path": "/app/all",
            "time_limit": 3.0
        }
    }
}

That’s all there is to it! The result is a deployed code search engine, running on Google Cloud Run.

(If you want to try this yourself you’ll need to be using the just-released Datasette 0.52.)

The GitHub Action workflow that deploys the demo also uses my github-to-sqlite tool to fetch my repos and then shallow-clones the ones that begin with datasette.

If you have your own Google Cloud Run credentials, you can run your own copy of that workflow against your own repositories.

A different kind of Datasette plugin

Datasette is a tool for publishing SQLite databases, so most Datasette plugins integrate with SQLite in some way.

datasette-ripgrep is different: it makes no use of SQLite at all, but instead takes advantage of Datasette’s URL routing, datasette publish deployments and permissions system.

The plugin implementation is currently 134 lines of code, excluding tests and templates.

While the plugin doesn’t use SQLite, it does share a common philosophy with Datasette: the plugin bundles the source code that it is going to search as part of the deployed application, in a similar way to how Datasette usually bundles one or more SQLite database files.

As such, it’s extremely inexpensive to run and can be deployed to serverless hosting. If you need to scale it, you can run more copies.

This does mean that the application needs to be re-deployed to pick up changes to the searchable code. I’ll probably set my demo to do this on a daily basis.

Controlling processes from asyncio

The trickiest part of the implementation was figuring out how to use Python’s asyncio.create_subprocess_exec() method to safely run the rg process in response to incoming requests.

I don’t want expensive searches to tie up the server, so I implemented two limits here. The first is a time limit: by default, searches have a second to run after which the rg process will be terminated and only results recieved so far will be returned. This is achieved using the asyncio.wait_for() function.

I also implemented a limit on the number of matching lines that can be returned, defaulting to 2,000. Any more than that and the process is terminated early.

Both of these limits can be customized using plugin settings (documented in the README). You can see how they are implemented in the async def run_ripgrep(pattern, path, time_limit=1.0, max_lines=2000) function.

Highlighted linkable line numbers

The other fun implementation detail is the way the source code listings are displayed. I’m using CSS to display the line numbers in a way that makes them visible without them breaking copy-and-paste (inspired by this article by Sylvain Durand).

code:before {
    content: attr(data-line);
    display: inline-block;
    width: 3.5ch;
    -webkit-user-select: none;
    color: #666;
}

The HTML looks like this:

<pre><code id="L1" data-line="1">from setuptools import setup</code>
<code id="L2" data-line="2">import os</code>
<code id="L3" data-line="3">&nbsp;</code>
<code id="L4" data-line="4">VERSION = &#34;0.1&#34;</code>
...

I wanted to imitate GitHub’s handling of line links, where adding #L23 to the URL both jumps to that line and causes the line to be highlighted. Here’s a demo of that—I use the following JavaScript to update the contents of a <style id="highlightStyle"></style> element in the document head any time the URL fragment changes:

<script>
var highlightStyle = document.getElementById('highlightStyle');
function highlightLineFromFragment() {
    if (/^#L\d+$/.exec(location.hash)) {
        highlightStyle.innerText = `${location.hash} { background-color: yellow; }`;
    }
}
highlightLineFromFragment();
window.addEventListener("hashchange", highlightLineFromFragment);
</script>

It’s the simplest way I could think of to achieve this effect.

Update 28th November 2020: Louis Lévêque on Twitter suggested using the CSS :target selector instead, which is indeed MUCH simpler—I deleted the above JavaScript and replaced it with this CSS:

:target {
    background-color: #FFFF99;
}

Next steps for this project

I’m pleased to have got datasette-ripgrep to a workable state, and I’m looking forward to using it to answer questions about the growing Datasette ecosystem. I don’t know how much more time I’ll invest in this—if it proves useful then I may well expand it.

I do think there’s something really interesting about being able to spin up this kind of code search engine on demand using datasette publish. It feels like a very useful trick to have access to.

Better URLs for my TILs

My other project this week was an upgrade to til.simonwillison.net: I finally spent the time to design nicer URLs for the site.

Before:

til.simonwillison.net/til/til/javascript_manipulating-query-params.md

After:

til.simonwillison.net/javascript/manipulating-query-params

The implementation for this takes advantage of a feature I sneaked into Datasette 0.49: Path parameters for custom page templates. I can create a template file called pages/{topic}/{slug}.html and Datasette use that template to handle 404 errors that match that pattern.

Here’s the new pages/{topic}/{slug}.html template for my TIL site. It uses the sql() template function from the datasette-template-sql plugin to retrieve and render the matching TIL, or raises a 404 if no TIL can be found.

I also needed to setup redirects from the old pages to the new ones. I wrote a TIL on edirects for Datasette explaining how I did that.

TIL this week

Releases this week

Weeknotes: datasette-indieauth, datasette-graphql, PyCon Argentina 13 days ago

Last week’s weeknotes took the form of my Personal Data Warehouses: Reclaiming Your Data talk write-up, which represented most of what I got done that week. This week I mainly worked on datasette-indieauth, but I also gave a keynote at PyCon Argentina and released a version of datasette-graphql with a small security fix.

datasette-indieauth

I wrote about this project in detail in Implementing IndieAuth for Datasette—it was inspired by last weekend’s IndieWebCamp East and provides Datasette with a password-less sign in option with the least possible amount of configuration.

Shortly after release version 1.0 of the plugin I realized it had a critical security vulnerability, where a malicious authorization server could fake a sign-in as any user! I fixed this in version 1.1 and released that along with a GitHub security advisory: Implementation trusts the “me” field returned by the authorization server without verifying it.

The IndieAuth community has an active #dev chat channel, available in Slack and through IRC and their web chat interface. I’ve had some very productive conversations there about parts of the specification that I found confusing.

datasette-graphql

This week I also issued a security advisory for my datasette-graphql plugin. This one was thankfully much less severe: I realized that the plugin was leaking details of the schema of otherwise private databases, if they were protected by Datasette’s permission system.

Here’s the advisory: datasette-graphql leaks details of the schema of private database files. It’s important to note that the actual content of the tables was not exposed—just the schema details such as the names of the tables and columns.

To my knowledge no-one has installed that plugin on an internet-exposed Datasette instance that includes private databases, so I don’t think anyone was affected by the vulnerability. The fix is available in datasette-graphql 1.2.

Also in that release: I’ve added table action items that link to an example GraphQL query for each table. This is a pretty neat usability enhancement, since the example includes all of the non-foreign-key columns making it a useful starting point for iterating on a query. You can try that out starting on this page.

Animated demo showing the cog menu linking to an example query in the GraphiQL API explorer

Keynoting PyCon Argentina

On Friday I presented a keynote at PyCon Argentina. I actually recorded this several weeks ago, but the keynote was broadcast live on YouTube so I got to watch the talk and post real-time notes and links to an accompanying Google Doc, which I also used for Q&A after tha talk.

The conference was really well organized, with top notch production values. They made a pixel-art version of my for the poster!

My PyCon Argentina poster

The video isn’t available yet, but I’ll link to it when they share it. I’m particularly excited about the professionally translated subtitles en Español.

Miscellaneous

Since Datasette depends on Python 3.6 these days, I decided to try out f-strings. I used flynt to automatically convert all of my usage of .format() to use f-strings instead. Flynt is built on top of astor, a really neat looking library for more productively manipulating Python source code using Python’s AST.

I’ve long been envious of the JavaScript community’s aggressive use of codemods for automated refactoring, so I’m excited to see that kind of thing become more common in the Python community.

datasette-search-all is my plugin that returns search results from ALL attached searchable database tables, using a barrage of fetch() calls. I bumped it to a 1.0 release adding loading indicators, more reliable URL construction (with the new datasette.urls utilities) and a menu item in Datasette’s new navigation menu.

Releases in the past two weeks

Implementing IndieAuth for Datasette 16 days ago

IndieAuth is a spiritual successor to OpenID, developed and maintained by the IndieWeb community and based on OAuth 2. This weekend I attended IndieWebCamp East Coast and was inspired to try my hand at an implementation. datasette-indieauth is the result, a new plugin which enables IndieAuth logins to a Datasette instance.

Surprisingly this was my first IndieWebCamp—I’ve been adjacent to that community for over a decade, but I’d never made it to one of their in-person events before. Now that everything’s virtual I didn’t even have to travel anywhere, so I finally got to break my streak of non-attendance.

Understanding IndieAuth

The key idea behind IndieAuth is to provide federated login based on URLs. Users enter a URL that they own (e.g. simonwillison.net), and the protocol then derives their identity provider, redirects the user there, waits for them to sign in and get redirected back and then uses tokens passed in the redirect to prove the user’s ownership of the URL and sign them in.

Here’s what that authentication flow looks like, using this demo of the plugin:

Animated demo: starts at an IndieAuth login screen, enters simonwillison.net, gets redirected to another site where clicking the verify button completes the sign-in and redirects back to the original page.

IndieAuth works by scanning the linked page for a <link rel="authorization_endpoint" href="https://indieauth.com/auth"> HTML element which indicates a service that should be redirected to in order to authenticate the user.

I’m using IndieAuth.com for my own site’s authorization endpoint, an identity provider run by IndieAuth spec author Aaron Parecki. IndieAuth.com implements RelMeAuth.

RelMeAuth is a neat hack where the authentication provider can scan the user’s URL for a <link href="https://github.com/simonw" rel="me"> element, confirm that the GitHub profile in question links back to the same page, and then delegate to GitHub authentication for the actual sign-in.

Why implement this for Datasette?

A key goal of Datasette is to reduce the friction involved in publishing data online as much as possible.

The datasette publish command addresses this by providing a single CLI command for publishing a SQLite database to the internet and assigning it a new URL.

datasette publish cloudrun ca-fires.db \
    --service ca-fires \
    --title "Latest fires in California"

This command will create a new Google Cloud Run service, package up the ca-fires.db (created in this talk) along with the Datasette web application, and deploy the resulting site using Google Cloud Run.

It will output a URL that looks like this: https://ca-fires-j7hipcg4aq-uc.a.run.app

Datasette is unauthenticated by default—anyone can view the published data. If you want to add authentication you can do so using a plugin, for example datasette-auth-passwords.

Authentication without passwords is better. The datasette-auth-github plugin implements single-sign-on against the GitHub API, but comes with a slight disadvantage: you need to register and configure your application with GitHub in order to configure things like the redirect URL needed for authentication.

For most applications this isn’t a problem, but when you’re deploying dozens or potentially hundreds of applications with Datasette—each with initially unpredictable URLs—this can add quite a bit of friction.

The joy of IndieAuth (and OpenID before it) is that there’s no centralized authority to register with. You can deploy an application to any URL, install the datasette-indieauth plugin and users can start authenticating with your site.

Even better... IndieAuth means you can grant people permission to access a site without them needing to create an account, provided they have their own domain with IndieAuth setup.

I took advantage of that in the design of datasette-indieauth. Say you want to publish a Datasette that only I can access—you can do that using the restrict_access plugin configuration setting like so:

datasette publish cloudrun simon-only.db \
  --service simon-only \
  --title "For Simon's eye only" \
  --install datasette-indieauth \
  --plugin-secret datasette-indieauth \
    restrict_access https://simonwillison.net/

The resulting Datasette instance will require the user to authenticate in order to view it—and will only allow access to the user who can use IndieAuth to prove that they are the owner of simonwillison.net.

Next steps

There are two sides to the IndieAuth specification: client sites that allow sign-in with IndieAuth, and authorization providers that handle that authentication.

datasette-indieauth currently acts as a client, allowing sign-in with IndieAuth.

I’m considering extending the plugin to act as an authorization provider as well. This is a bit more challenging as authentication providers need to maintain some small aspects of session state, but it would be good for the IndieAuth ecosystem for there to be more providers. The most widely used provider at the moment is the excellent IndieAuth WordPress plugin, which I used while testing my Datasette plugin and really was just a one-click install from the WordPress plugin directory.

datasette-indieauth has 100% test coverage, and I wrote the bulk of the logic in a standalone utils.py module which could potentially be extracted out of the plugin and used to implement IndieAuth in Python against other frameworks. A Django IndieAuth provider is another potential project, which could integrate directly with my Django blog.

Addendum: what about OpenID?

Fom 2006 to 2010 I was a passionate advocate for OpenID. It was clear to me that passwords were an increasingly unpleasant barrier to secure usage of the web, and that some form of federated sign-in was inevitable. I was terrified that Microsoft Passport would take over all authentication on the web!

With hindsight that’s not quite what happened: for a while it looked like Facebook would win instead, but today it seems to be a fairly even balance between Facebook, Google, community-specific authentication providers like GitHub and Apple’s iPhone-monopoly-enforced Sign in with Apple.

OpenID as an open standard didn’t really make it. The specification grew in complicated new directions (Yadis, XRDS, i-names, OpenID Connect, OpenID 2.0) and it never quite overcame the usability hurdle of users having to understand URLs as identifiers.

IndieAuth is a much simpler specification, based on lessons learned from OAuth. I’m still worried about URLs as identifiers, but helping people reclaim their online presence and understand those concepts is core to what the IndieWeb movement is all about.

IndieAuth also has some clever additional tricks up its sleeve. My favourite is that IndieAuth can return an identifier for the user that’s different from the one they typed in the box. This means that if a top-level domain with many users supports IndieAuth, each user can learn to just type example.com in (or click a branded button) to start the authentication flow—they’ll be signed in as example.com/users/simonw based on who they authenticated as. This feels like an enormous usability improvement to me, and one that could really help avoid users having to remember their own profile URLs.

OpenID was trying to solve authentication for every user of the internet. IndieAuth is less ambitious—if it only takes off with the subset of people who embrace the IndieWeb movement I think that’s OK.

The datasette-indieauth project is yet another example of the benefit of having a plugin ecosystem around Datasette: I can add support for technologies like IndieAuth without baking them into Datasette’s core, which almost eliminates the risk to the integrity of the larger project of trying out something new.

Personal Data Warehouses: Reclaiming Your Data 21 days ago

I gave a talk yesterday about personal data warehouses for GitHub’s OCTO Speaker Series, focusing on my Datasette and Dogsheep projects. The video of the talk is now available, and I’m presenting that here along with an annotated summary of the talk, including links to demos and further information.

There’s a short technical glitch with the screen sharing in the first couple of minutes of the talk—I’ve added screenshots to the notes which show what you would have seen if my screen had been correctly shared.

Weeknotes: sqlite-utils 3.0 alpha, Git scraping in the zeitgeist 28 days ago

Natalie and I decided to escape San Francisco for election week, and have been holed up in Fort Bragg on the Northern California coast. I’ve mostly been on vacation, but I did find time to make some significant changes to sqlite-utils. Plus notes on an exciting Git scraping project.

Better search in the sqlite-utils 3.0 alpha

I practice semantic versioning with sqlite-utils, which means it only gets a major version bump if I break backwards compatibility in some way.

My goal is to avoid breaking backwards compatibility as much as possible, and I was proud to have made it all the way to version 2.23 representing 23 new feature releases since the 2.0 release without breaking any documented features!

Sadly this run has come to an end: I realized that the table.search() method was poorly designed, and I also needed to grab back the -c command-line option (a shortcut for --csv output) to be used for another purpose.

The chances that either of these changes will break anyone are pretty small, but semantic versioning dictates a major version bump so here we are.

I shipped a 3.0 alpha today, which should hopefully become a stable release very shortly (milestone here).

The big new feature is sqlite-utils search—a command-line tool for executing searches against a full-text search enabled table:

$ sqlite-utils search 24ways-fts4.db articles maps -c title
[{"rowid": 163, "title": "Get To Grips with Slippy Maps", "rank": -10.028754920576421},
 {"rowid": 220, "title": "Finding Your Way with Static Maps", "rank": -9.952534352591737},
 {"rowid": 27, "title": "Putting Design on the Map", "rank": -5.667327088267961},
 {"rowid": 168, "title": "Unobtrusively Mapping Microformats with jQuery", "rank": -4.662224207228984},

Here’s full documentation for the new command.

Notably, this command works against both FTS4 and FTS5 tables in SQLite—despite FTS4 not shipping with a built-in ranking function. I’m using my sqlite-fts4 package for this, which I described back in January 2019 in Exploring search relevance algorithms with SQLite.

Git scraping to predict the election

It’s not quite over yet but the end is in sight, and one of the best tools to track the late arriving vote counts is this Election 2020 results site built by Alex Gaynor and a growing cohort of contributors.

The site is a beautiful example of Git scraping in action, and I’m thrilled that it links to my article in the README!

Take a look at the repo to see how it works. Short version: this GitHub Action workflow grabs the latest snapshot of this undocumented New York Times JSON API once every five minutes and commits it to the repository. It then runs this Python script which iterates through the Git history and generates an HTML summary showing the different batches of new votes that were reported and their impact on the overall race.

The resulting report is published to GitHub pages—resulting in a site that can handle a great deal of traffic and is updated entirely by code running in scheduled actions.

Screenshot of the generated report

This is a perfect use-case for Git scraping: it takes a JSON endpoint that represents the current state of the world and turns it into a sequence of historic snapshots, then uses those snapshots to build a unique and useful new source of information to help people understand what’s going on.

Releases this week

Datasette 0.51 (plus weeknotes) one month ago

I shipped Datasette 0.51 today, with a new visual design, plugin hooks for adding navigation options, better handling of binary data, URL building utility methods and better support for running Datasette behind a proxy. It’s a lot of stuff! Here are the annotated release notes.

New visual design

Datasette is no longer white and grey with blue and purple links! Natalie Downe has been working on a visual refresh, the first iteration of which is included in this release. (#1056)

It’s about time Datasette grew beyond its clearly-designed-by-a-mostly-backend-engineer roots. Natalie has been helping me start adding some visual polish: we’ve started with an update to the colour scheme and will be continuing to iterate on the visual design as the project evolves towards the 1.0 release.

The new design makes the navigation bar much more obvious, which is important for this release since the new navigation menu (tucked away behind a three-bar icon) is a key new feature.

A screenshot of the new Datasette visual design, also showcasing two new menu icons

A number of existing Datasette plugins add new pages to the Datasette interface, providig tools for things like uploading CSVs, editing table schemas or configuring full-text search.

Plugins like this can now link to themselves from other parts of Datasette interface. The menu_links(datasette, actor) hook (#1064) lets plugins add links to Datasette’s new top-right application menu, and the table_actions(datasette, actor, database, table) hook (#1066) adds links to a new “table actions” menu on the table page.

This feature has been a long time coming. I’ve been writing an increasing number of plugins that add new pages to Datasette, and so far the main way of using them has been to memorise and type in their URLs!

The new navigation menu (which only displays if it has something in it) provides a global location to add new links. I’ve already released several plugin updates that take advantage of this.

The new “table actions” menu imitates Datasette’s existing column header menu icon—it’s a cog. Clicking it opens a menu of actions relating to the current table.

Want to see a demo?

The demo at latest.datasette.io now includes some example plugins. To see the new table actions menu first sign into that demo as root and then visit the facetable table to see the new cog icon menu at the top of the page.

Here’s an animated GIF demo showing the new menus in action.

Using the nav menu to upload CSVs, then the table actions menu to edit the table schema and rename latitude and longitude columns so that the Central Park Squirrel Census is displayed on a map!

Binary data

SQLite tables can contain binary data in BLOB columns. Datasette now provides links for users to download this data directly from Datasette, and uses those links to make binary data available from CSV exports. See Binary data for more details. (#1036 and #1034).

I spent a ton of time on this over the past few weeks. The initial impetus was a realization that Datasette CSV exports included ugly Python b'\x15\x1c\x02\xc7\xad\x05\xfe' strings, which felt like the worst possible way to display binary in a CSV file, out of universally bad options.

Datasette’s main interface punted on binary entirely—it would show a <Binary data: 7 bytes> label which didn’t help much either.

The only way to get at binary data stored in a Datasette instance was to request the JSON version and then manually decode the Base-64 value within it!

This is now fixed: binary columns can be downloaded directly to your computer, using a new .blob output renderer. The approach is described on this new page in the documentation.

Security was a major consideration when building this feature. Allowing the download of arbitrary byte payloads from a web server is dangerous business: it can easily result in XSS holes where HTML with dangerous <script> content can end up hosted on the primary domain.

After some research, I decided to serve up binary content for download using the following headings:

content-type: application/binary
x-content-type-options: nosniff
content-disposition: attachment; filename="data-f30889.blob"

application/binary is a safer Content-Type option than the more common application/octet-stream, according to Michal Zalewski’s renowned web application security book The Tangled Web (quoted here)

x-content-type-options: nosniff disables the XSS-tastic content sniffing feature in older versions of Internet Explorer, where IE would helpfully guess that you intended to serve HTML based on the first few bytes of the response.

The content-disposition: attachment header causes the browser to show a “download this file” dialog, using the suggested filename.

If you know of a reason that this isn’t secure enough, please let me know!

URL building

The new datasette.urls family of methods can be used to generate URLs to key pages within the Datasette interface, both within custom templates and Datasette plugins. See Building URLs within plugins for more details. (#904)

Datasette’s base_url configuration setting was the forcing factor around this piece of work.

It allows you to configure Datasette to serve content starting at a path other than /—for example:

datasette --config base_url:/path-to-datasette/

This will serve all Datasette pages at locations starting with /path-to-datasette/.

Why would you want to do this? It’s useful if you are proxying traffic to Datasette from within the URL hierarchy of an existing website.

The feature didn’t work properly, and enough people care about it that I had a steady stream of bug reports. For 0.51 I gathered them all into a single giant tracking issue and worked through them all one by one.

It quickly became apparent that the key challenge was building URLs within Datasette—not just within HTML template pages, but also for things like HTTP redirects.

Datasette itself needed to generate URLs that took the base_url setting into account, but so do Datasette plugins. So I built a new datasette.urls collection of helper methods and made them part of the documented internals API for plugins. The Building URLs within plugins documentation shows how these should be used.

I also added documentation on Running Datasette behind a proxy with example configs (tested on my laptop) for both nginx and Apache.

The datasette.client mechanism from Datasette 0.50 allows plugins to make calls to Datasette’s internal JSON API without the overhead of an HTTP request. This is another place where plugins need to be able to construct valid URLs to internal Datasette pages.

I added this example to the documentation showing how the two features can work together:

table_json = (
    await datasette.client.get(
        datasette.urls.table("fixtures", "facetable", format="json")
    )
).json()

One final weird detail on this: Datasette now has various methods that automatically add the base_url prefix to a URL. I got worried about what would happen if these were applied more than once (as above, where datasette.urls.table() applies the prefix so does datasette.client.get()).

I fixed this using the same trick that Django and Jinja use to avoid appliying auto-escaping twice to content that will be displayed in HTML: the datasette.urls methods actually return a PrefixedUrlString object which is a subclass of str that knows that the prefix has been applied! Code for that lives here.

Smaller changes

A few highlights from the “smaller changes” in Datasette 0.51:

  • Wide tables shown within Datasette now scroll horizontally (#998). This is achieved using a new <div class="table-wrapper"> element which may impact the implementation of some plugins (for example this change to datasette-cluster-map).

I think this is a big improvement: if your database table is too wide, it now scrolls horizontally on the page (rather than blowing the entire page out to a wider width). You can see that in action on the global-power-plants demo.

If you are signed in as root the new navigation menu links to a whole plethora of previously-undiscoverable Datasette debugging tools. This new permission controls the display of those items.

  • Link: HTTP header pagination. (#1014)

Inspired by GitHub and WordPress, which both use the HTTP Link header in this way. It’s an optional extra though: Datasette will always offer in-JSON pagination information.

  • Edit SQL button on canned queries, (#1019)

Suggested by Jacob Fenton in this issue. The implementation had quite a few edge cases since there are certain categories of canned query that can’t be executed as custom SQL by the user. See the issue comments for details and a demo.

  • --load-extension=spatialite shortcut. (#1028)

Inspired by a similar feature in sqlite-utils.

  • datasette -o option now opens the most relevant page. (#976)

This is a fun little feature. If your Datasette only loads a single database, and that database only has a single table (common if you’ve just run a single CSV import) then running this will open your browser directly to that table page:

datasette data.db -o
  • datasette --cors option now enables access to /database.db downloads. (#1057)

This was inspired by Mike Bostock’s Observable Notebook that uses the Emscripten-compiled JavaScript version of SQLite to run queries against SQLite database files.

It turned out you couldn’t use that notebook against SQLite files hosted in Datasette because they weren’t covered by Datasette’s CORS option. Now they are!

Recommendations for plugin authors, inspired by a question from David Kane on Twitter. David has been building datasette-reconcile, a Datasette plugin that offers a reconciliation API endpoint that can be used with OpenRefine. What a brilliant idea!

datasette-edit-templates (almost)

Inspired by a conversation with Jesse Vincent, I also spent some time experimenting with the idea of a plugin that can load and edit templates from the database—which would turn a personal Datasette into a really fun interface hacking environment. I nearly got this working, and even shipped a preview of a load_template() plugin hook in the Datasette 0.51a2 alpha... before crashing into a road block when I realized that it also needed to work with Jinja’s {% extends %} and {% include %} template tags and loaders for those don’t currenty support async functions.

In exploring this I also realized that my load_template() plugin hook wasn’t actually necessary—if I’m going to solve this problem with Jinja loaders I can do so using the existing prepare_jinja2_environment(env) hook.

My not-yet-functional prototype for this is caled datasette-edit-templates. I’m pretty confident I can get it working against the old plugin hook with a little more work.

Other weeknotes

Most of my time this week was spent on Datasette 0.51—but I did find a little bit of time for other projects.

I finished recording my talk for PyCon Argentina. It will air on November 20th.

sqlite-utils 2.23 is out, with a .m2m() bug fix from Adam Wolf and the new ability to display progress bars when importing TSV and CSV files.

Releases this week

Several of these are updates to take advantage of the new navigation plugin hooks introduced in Datasette 0.51.

TIL this week

Elsewhere

Today

  • The secrets of Monkey Island's source code (via) To celebrate the thirty year anniversary of the Secret of Monkey Island the Video Game History Foundation interviewed developer Rod Gilbert and produced this comprehensive collection of cut content and material showing how the game was originally constructed. #5th December 2020, 4:32 pm

Yesterday

  • Discoverable CLIs have comprehensive help texts, provide lots of examples, suggest what command to run next, suggest what to do when there is an error. There are lots of ideas that can be stolen from GUIs to make CLIs easier to learn and use, even for power users.

    Command Line Interface Guidelines # 4th December 2020, 8:48 pm

  • Command Line Interface Guidelines (via) Aanand Prasad, Ben Firshman, Carl Tashian and Eva Parish provide the missing manual for designing CLI tools in 2020. Deeply researched and clearly presented—I picked up a bunch of useful tips and ideas from reading this, and I’m looking forward to applying them to my own CLI projects. #4th December 2020, 8:44 pm

3rd December 2020

  • The value of a product is the number of problems it can solve divided by the amount of complexity the user needs to keep in their head to use it. Consider an iPhone vs a standard TV remove: an iPhone touchscreen can be used for countless different functions, but there’s very little to remember about how it works (tap, drag, swipe, pinch). With a TV remove you have to remember what every button does; the more things you can use the remote for, the more buttons it has. We want to create iPhones, not TV remotes.

    Adam Wiggins: Heroku Values # 3rd December 2020, 9:25 pm

1st December 2020

  • Scaling Datastores at Slack with Vitess (via) Slack spent three years migrating 99% of their MySQL query load to run against Vitess, the open source MySQL sharding system originally built by YouTube. “Today, we serve 2.3 million QPS at peak. 2M of those queries are reads and 300K are writes. Our median query latency is 2 ms, and our p99 query latency is 11 ms.” #1st December 2020, 9:30 pm
  • New for AWS Lambda – Container Image Support. “You can now package and deploy Lambda functions as container images of up to 10 GB in size”—can’t wait to try this out with Datasette. #1st December 2020, 5:34 pm

30th November 2020

29th November 2020

  • Datasette 0.52. A relatively small release—it has a new plugin hook (database_actions(), for adding links to a new database actions menu), renames the --config option to --setting and adds a new “datasette publish cloudrun --apt-get-install” option. #29th November 2020, 12:56 am

27th November 2020

  • Unravelling `not` in Python (via) Part of a series where Brett Cannon looks at how fundamental Python syntactic sugar works, including a clearly explained dive I to the underlying op codes and C implementation. #27th November 2020, 5:59 pm

24th November 2020

  • Datasette Client for Observable (via) Really elegant piece of code design from Alex Garcia: DatasetteClient is a client library he built designed to work in Observable notebooks, which uses JavaScript tagged template literals to allow SQL query results to be executed against a Datasette instance and displayed as inline tables in a notebook, or used to return JSON data for further processing. His example notebook includes a neat d3 stacked area chart example built against a Datasette of congresspeople, plus examples using interactive widgets to update the Notebook. #24th November 2020, 6:53 pm

21st November 2020

20th November 2020

  • The open secret Jennings filled me in on is that OpenStreetMap (OSM) is now at the center of an unholy alliance of the world’s largest and wealthiest technology companies. The most valuable companies in the world are treating OSM as critical infrastructure for some of the most-used software ever written. The four companies in the inner circle— Facebook, Apple, Amazon, and Microsoft— have a combined market capitalization of over six trillion dollars.

    Joe Morrison # 20th November 2020, 9:11 pm

  • The trouble with transaction.atomic (via) David Seddon provides a detailed explanation of Django’s nestable transaction.atomic() context manager and describes a gotcha that can occur if you lose track of whether your code is already running in a transaction block, since you may be working with savepoints instead—along with some smart workarounds. #20th November 2020, 3:57 pm