Simon Willison’s Weblog

On security, projects, javascript, github, sqlite, ...

 

Recent entries

Weeknotes: Datasette 0.43 five days ago

My main achievement this week was shipping Datasette 0.43, with a collection of smaller improvements and one big one: a redesign of the register_output_renderer plugin hook.

Generating Atom and iCal feeds with Datasette

The register_output_renderer hook was contributed by Russ Garrett last year. It added a mechanism for plugins to create new output formats for Datasette.

Output formats are controlled by their file extension. Out of the box Datasette supports .json and .csv formats. The hook allows plugins to register new ones—.atom for Atom feeds and .ics for iCal feeds, for example.

I built those exact two plugins using the hook: datasette-atom and datasette-ics. In building them I ran into a few limitations. Atom feeds need to reflect the full URL of the feed for example, but that information wasn’t made available through the plugin hook.

The pattern I settled on for both of my plugins was to require SQL queries that produced data in a specific shape. datasette-atom for example requires a SQL query that returns columns atom_id, atom_title and atom_updated—plus optional columns atom_content, atom_link and a few others.

If those columns are present in the query, the plugin can render an Atom feed! If the columns are not present, it returns an error.

The problem was that EVERY page on the site that could return .json and .csv now got an .atom link as well—even if that link wouldn’t actually work.

So I’ve added a new can_render callback to the plugin registry, which indicates if the that link should be displayed. I shipped a new release, datasette-atom 0.6, that takes advantage of this new feature.

You can see it in action in the Atom feed for my Niche Museums site, which finally passes the Feed Validator!

Also in Datasette 0.43

Copied from the release notes:

  • Visually distinguish float and integer columns—useful for figuring out why order-by-column might be returning unexpected results. (#729)
  • The Request object, which is passed to several plugin hooks, is now documented. (#706)
  • New metadata.json option for setting a custom default page size for specific tables and views, see Setting a custom page size. (#751)
  • Canned queries can now be configured with a default URL fragment hash, useful when working with plugins such as datasette-vega, see Setting a default fragment. (#706)
  • Fixed a bug in datasette publish when running on operating systems where the /tmp directory lives in a different volume, using a backport of the Python 3.8 shutil.copytree() function. (#744)
  • Every plugin hook is now covered by the unit tests, and a new unit test checks that each plugin hook has at least one corresponding test. (#771, #773)

TIL this week

Using SQL to find my best photo of a pelican according to Apple Photos 12 days ago

According to the Apple Photos internal SQLite database, this is the most aesthetically pleasing photograph I have ever taken of a pelican:

A pelican

Here’s the SQL query that found me my best ten pelican photos:

select
  sha256,
  ext,
  uuid,
  date,
  ZOVERALLAESTHETICSCORE
from
  photos_with_apple_metadata
where
  uuid in (
    select
      uuid
    from
      labels
    where
      normalized_string = 'pelican'
  )
order by
  ZOVERALLAESTHETICSCORE desc
limit
  10

You can try it out here (with some extra datasette-json-html magic to display the actual photos). Or try lemur or seal.

I actually think this is my best pelican photo, but Apple Photos rated it fifth:

A pelican

How this works

Apple Photos keeps photo metadata in a SQLite database. It runs machine learning models to identify the contents of every photo, and separate machine learning models to calculate quality scores for those photographs. All of this data lives in SQLite files on my laptop. The trick is knowing where to look.

I’m not running queries directly against the Apple Photos SQLite file—it’s a little hard to work with, and the label metadata is stored in a separate database file. Instead, this query runs against a combined database created by my new dogsheep-photos tool.

An aside: Why I love Apple Photos

The Apple Photos app—on both macOS and iOS—is in my opinion Apple’s most underappreciated piece of software. In my experience most people who use it are missing some of the most valuable features. A few highlights:

  • It can show you ALL of your photos on a map. On iOS go to the “Albums” tab, scroll half way down and then click on “Places” (no wonder people miss this feature!)—on macOS Photos it’s the “Library -> Places” sidebar item. It still baffles me that Google Photos doesn’t do this (I have conspiracy theories about it). This is my most common way for finding a photo I’ve taken—I remember where it was, then zoom in on that area of the map.
  • It runs machine learning models on your phone (or laptop) to identify the subject of your photos, and makes them searchable. Try searching for “dog” and you’ll see all of the photos you’ve taken of dogs! I love that this runs on-device: it’s much less creepy than uploading your photos to the cloud in order to do this.
  • It has a really great faceted search implementation—particularly in the phone app. Try searching for “dog”, then add “selfie” and the name of a city to see all of the selfies you’ve taken with dogs in that place!
  • It has facial recognition, again running on device, which you can use to teach it who your friends are (autocompleting against your contacts). A little bit of effort spent training this and you can see photos you’ve taken of specific friends in specific places and with specific animals!

As with most Apple software, Photos uses SQLite under the hood. The underlying database is undocumented and clearly not intended as a public API, but it exists. And I’ve wanted to gain access to what’s in it for years.

Querying the Apple Photos SQLite database

If you run Apple Photos on a Mac (which will synchronize with your phone via iCloud) then most of your photo metadata can be found in a database file that lives here:

~/Pictures/Photos\ Library.photoslibrary/database/Photos.sqlite

Mine is 752MB, for aroud 40,000 photos. There’s a lot of detailed metadata in there!

Querying the database isn’t straight-forward. Firstly it’s almost always locked by some other process—the workaround for that is to create a copy of the file. Secondly, it uses some custom undocumented Apple SQLite extensions. I’ve not figured out a way to load these, and without them a lot of my queries ended up throwing errors.

osxphotos to the rescue! I ran a GitHub code search for one of the tables in that database (searching for RKPerson in Python code) and was delighted to stumble across the osxphotos project by Rhet Turnbull. It’s a well designed and extremely actively maintained Python tool for accessing the Apple Photos database, including code to handle several iterations of the underlying database structure.

Thanks to osxphotos the first iteration of my own code for accessing the Apple Photos metadata was less than 100 lines of code. This gave me locations, people, albums and places (human names of geographical areas) almost for free!

Quality scores

Apple Photos has a fascinating database table called ZCOMPUTEDASSETATTRIBUTES, with a bewildering collection of columns. Each one is a floating point number calculated presumably by some kind of machine learning model. Here’s a full list, each one linking to my public photos sorted by that score:

I’m not enormously impressed with the results I get from these. They’re clearly not intended for end-user visibility, and sorting them might not even be something that makes sense.

The ZGENERICASSET table provides four more scores, which seem to provide much more useful results:

My guess is that these overall scores are derived from the ZCOMPUTEDASSETATTRIBUTES ones. I’ve seen the best results from ZOVERALLAESTHETICSCORE, so that’s the one I used in my “show me my best photo of a pelican” query.

A note about the demo

The demo I’m running at dogsheep-photos.dogsheep.net currently only contains 496 photos. My private instance of this has over 40,000, but I decided to just publish a subset of that in the demo so I wouldn’t have to carefully filter out private screenshots and photos with sensitive locations and suchlike. Details of how the demo work (using the dogsheep-photos create-subset command to create a subset database containing just photos in my Public album) can be found in this issue.

Automatic labeling of photo contents

Even more impressive than the quality scores are the machine learning labels.

Automatically labeling the content of a photo is surprisingly easy these days, thanks to convolutional neural networks. I wrote a bit about these in Automatically playing science communication games with transfer learning and fastai.

Apple download a machine learning model to your device and do the label classification there. After quite a bit of hunting (I ended up using Activity Monitor’s Inspect -> Open Files and Ports option against the photoanalysisd process) I finally figured out where the results go: the ~/Pictures/Photos\ Library.photoslibrary/database/search/psi.sqlite database file.

(Inspecting photoanalysisd also lead me to the /System/Library/Frameworks/Vision.framework/Versions/A/Resources/ folder, which solved another mystery: where do Apple keep the models? There are some fascinating files in there.)

It took some work to figure out how to match those labels with their corresponding photos, mainly because the psi.sqlite database stores photo UUIDs as a pair of signed integers whereas the Photos.sqlite database stores a UUID string.

I’m now pulling the labels out into a separate labels table. You can browse that in the demo to see how it is structured. Labels belong to numeric categories—here are some of my guesses as to what those mean:

  • Category 2024 appears to be actual content labels—Seal, Water Body, Pelican etc.
  • Category 2027 is more contextual: Entertainment, Trip, Travel, Museum, Beach Activity etc.
  • Category 1014 is simply the month the photo was taken. 1015 is the year, and 2030 is the season.
  • Category 2056 is the original filename.
  • Category 12 is the country the photo was taken in.
Here’s a query that shows the labels (from every category) next to each photo.

Geography

Photos taken on an iPhone have embedded latitudes and longitudes... which means I can display them on a map!

My photos on a map

Apple also perform reverse-geocoding on those photos, resolving them to cities, regions and countries. This is great for faceted browse: here are my photos faceted by country, city and state/province.

Hosting and serving the images

My least favourite thing about Apple Photos is how hard it is to get images from it onto the internet. If you enable iCloud sharing your images are accessible through icloud.com—but they aren’t given publicly accessible URLs, so you can’t embed them in blog entries or do other webby things with them.

I also really want to “own” my images. I want them in a place that I control.

Amazon S3 is ideal for image storage. It’s incredibly inexpensive and essentially infinite.

The dogsheep-photos upload command takes ANY directory as input, scans through that directory for image files and then uploads them to the configured S3 bucket.

I designed this to work independently of Apple Photos, mainly to preserve my ability to switch to alternative image solutions in the future.

I’m using the content addressable storage pattern to store the images. Their filename is the sha256 hash of the file contents. The idea is that since sensible photo management software leaves the original files unmodified I should be able to de-duplicate my photo files no matter where they are from and store everything in the one bucket.

Original image files come with privacy concerns: they embed accurate latitude and longitude data in the EXIF data, so they can be used to reconstruct your exact location history and even figure out your address. This is why systems like Google Photos make it difficult to export images with location data intact.

I’ve addressed this by making the content in my S3 bucket private. Access to the images takes place through s3-image-proxy—a proxy server I wrote and deployed on Vercel (previously Zeit Now). The proxy strips EXIF data and can optionally resize images based on querystring parameters. It also serves them with far-future cache expire headers, which means they sit in Vercel’s CDN cache rather than being resized every time they are accessed.

iPhones default to saving photos in HEIC format, which fails to display using with the <img src=""> tag in the browsers I tested. The proxy uses pyheif to convert those into JPEGs.

Here’s an example HEIC image, resized by the proxy and converted to JPEG: https://photos.simonwillison.net/i/59854a70f125154cdf8dad89a4c730e6afde06466d4a6de24689439539c2d863.heic?w=600

Next steps

This project is a little daunting in that there are so many possibilities for where to take it next!

In the short term:

  • Import EXIF data from the images into a table. The Apple Photos tables give me some of this already (particularly GPS data) but I want things like ISO, aperture, what lens I used.
  • Load the labels into SQLite full-text search.
  • I’d like other people to be able to play with this easily. Getting it all up and running right now is a fair amount of work—I think I can improve this with usability improvements and better documentation.
  • The system only handles static images at the moment. I’d like to get my movies and more importantly my live photos in there as well.

And in the longer term:

  • Only iPhone photos have location data at the moment—I’d like to derive approximate latitude/longitude points for my DSLR images by matching against images from my phone based on date.
  • Running my photos through other computer vision systems like Google’s Cloud Vision APIs could be really interesting.
  • For better spotting of duplicate images I’m interested in exploring image content hashing.
  • The UI for all of this right now is just regular Datasette. Building a custom UI (running against the Datasette JSON API) could be a lot of fun.

Weeknotes: Working on my screenplay 20 days ago

I’m taking an Introduction to Screenwriting course with Adam Tobin at Stanford, and my partial screenplay is due this week. I’m pulling together some scenes that tell the story of the Russian 1917 February Revolution and the fall of the Tsar through the lens of the craftsmen working on the Tsar’s last Fabergé egg. So I’ve not been spending much time on anything else.

Some brief bullet points for this week’s software projects:

  • Released version 0.1 of datasette-media, a new plugin that allows Datasette to serve files from disk based on executing a SQL query to find the file to return. I’m building it to help make photos-to-sqlite more immediately useful.
  • Released Datasette 0.42 with improved (and now documented) internal methods to allow plugins to execute read-only SQL queries. I needed these for datasette-media.
  • Released sqlite-utils 2.9 with new CLI commands sqlite-utils drop-table and sqlite-utils drop-view.
  • Released sqlite-utils 2.9.1 with a tiny cosmetic improvement: the PyPI project page now shows project links! See this TIL for details.

I’ve also started adding changelog badges to various projects, showing the latest release version according to GitHub and linking to that project’s changelog. Datasette, photos-to-sqlite, sqlite-utils all have these now.

TIL this week

Weeknotes: Datasette 0.41, photos breakthroughs 27 days ago

Shorter weeknotes this week, because my main project for the week warrants a detailed write-up on its own (coming soon... update 21st May here it is).

Datasette 0.41

I released Datasette 0.41 this morning. Highlights include:

  • Custom pages and configuration directory mode, both described here last week
  • A new NOT LIKE table filter expression, ?colname__notlike=X#750
  • A CSS pattern portfolio! I’ve been meaning to ship one of these for over two years now—you can see it at latest.datasette.io/-/patterns#151
  • Various minor fixes and documentation improvements
  • PRAGMA function support—#761

A late addition to the release—thought of and implemented just this morning, as the result of this TIL, is the ability to use various SQLite PRAGMA functions.

Datasette blocks user-provided SQL queries containing the term PRAGMA because it can be used to dynamically update various database configuration settings, and Datasette likes to keep queries read-only.

But... SQLite provides access to a number of useful introspection capabilities through a collection of PRAGMA functions. Things like this:

select * from pragma_foreign_key_list('complex_foreign_keys')

These are harmless—the SQLite documentation specifically states that “Table-valued functions exist only for PRAGMAs that return results and that have no side-effects”.

SO I enabled them using a negative lookahead assertion in the regular expression that prevents PRAGMA from being used:

allowed_pragmas = (
    "database_list",
    "foreign_key_list",
    "function_list",
    "index_info",
    "index_list",
    "index_xinfo",
    "page_count",
    "max_page_count",
    "page_size",
    "schema_version",
    "table_info",
    "table_xinfo",
)
disallawed_sql_res = [
    (
        re.compile("pragma(?!_({}))".format("|".join(allowed_pragmas))),
        "Statement may not contain PRAGMA",
    )
]

The regular expression here is dynamically composed from the list of allowed PRAGMA functions—it looks something like this:

re.compile("pragma(?!_(database_list|foreign_key_list|...))")

In Python regular expressions pragme(?!_foo) means "pragma but only if not followed by _foo".

Plotting foreign key relationships with Observable and D3

With PRAGMA functions support shipped in Datasette 0.41 it’s now possible to make use of them in queries. Here’s an example query which finds all foreign key relationships in a database:

select
    sqlite_master.name as table_from,
    fk_info.[from] as column_from,
    fk_info.[table] as table_to,
    fk_info.[to] as column_to
from
    sqlite_master
join
    pragma_foreign_key_list(sqlite_master.name) as fk_info
order by
    sqlite_master.name

Here’s that query run against my github-to-sqlite demo database. Exported as JSON, the results look like this:

[
    {
        "table_from": "commits",
        "column_from": "committer",
        "table_to": "users",
        "column_to": "id"
    },
    {
        "table_from": "commits",
        "column_from": "author",
        "table_to": "users",
        "column_to": "id"
    }
    ...
]

I used this to put together an Observable Notebook which takes this data (from any public Datasette instance if you provide it with a URL) and renders those relationships as a force-directed graph using D3.

Force-directed D3 graph of foreign keys

Apple Photos breakthrough

I’ll write about this more in a separate post, but I had a huge breakthrough with my photos-to-sqlite project this week.

The goal of the project is to give my SQL access to the photos I’ve taken. I use Apple Photos for this, and my holy grail has always been the metadata that Apple generate around my photos using machine learning.

Rhet Turnbull’s fantastic osxphotos project gave me my first breakthrough, allowing me to easily start importing albums, people (from on-device facial recognition) and places.

Apple run sophisticated label classification against my photos on device—so if I search for “dog” they show me all of the photos I’ve taken of dogs. I was certain that this information was in a SQLite database somewhere, and this week I finally figured out where: the ~/Pictures/Photos\ Library.photoslibrary/database/search/psi.sqlite file.

After a whole lot of digging around I worked out how to join it against my photos database, and as a result I can now run SQL queries that select all of my photos of dogs! Or seals! Or pelicans!

Combine that with some intriguing automated photo quality scores I found (order by ZPLEASANTCOMPOSITIONSCORE is a thing) and my Dogsheep photos solution is beginning to take shape. I’ll write it up in more detail over the next few days.

Other projects

  • conditional-get 0.2a now streams large downloads to disk and shows a progress bar in -v mode.
  • datasette-atom 0.5 adds support for the Atom author element, so you can add author information to feed entries.
  • sqlite-utils went through versions 2.7, 2.7.1, 2.7.2 and 2.8—see release notes.
  • github-to-sqlite 2.1 and 2.2 introduced two new commands:
    • github-to-sqlite scrape-dependents scrapes the GitHub dependents page (which isn’t yet in an API) to gather details of repos that depend on your repos
    • github-to-sqlite stargazers fetches the users who have starred a specific list of repositories
    • I also added several new SQL views for better exploring the retrieved data

My github-to-sqlite demo is updated daily via a scheduled GitHub Action and retrieves all kinds of data about my various Dogsheep projects plus sqlite-utils and Datasette. Being able to see things like the most recent dependent repos across my projects in one place is extremely useful.

TIL this week

Weeknotes: Archiving coronavirus.data.gov.uk, custom pages and directory configuration in Datasette, photos-to-sqlite one month ago

I mainly made progress on three projects this week: Datasette, photos-to-sqlite and a cleaner way of archiving data to a git repository.

Archiving coronavirus.data.gov.uk

The UK goverment have a new portal website sharing detailed Coronavirus data for regions around the country, at coronavirus.data.gov.uk.

As with everything else built in 2020, it’s a big single-page JavaScript app. Matthew Somerville investigated what it would take to build a much lighter (and faster loading) site displaying the same information by moving much of the rendering to the server.

One of the best things about the SPA craze is that it strongly encourages structured data to be published as JSON files. Matthew’s article inspired me to take a look, and sure enough the government figures are available in an extremely comprehensive (and 3.3MB in size) JSON file, available from https://c19downloads.azureedge.net/downloads/data/data_latest.json.

Any time I see a file like this my first questions are how often does it change—and what kind of changes are being made to it?

I’ve written about scraping to a git repository (see my new gitscraping tag) a bunch in the past:

Now that I’ve figured out a really clean way to Commit a file if it changed in a GitHub Action knocking out new versions of this pattern is really quick.

simonw/coronavirus-data-gov-archive is my new repo that does exactly that: it periodically fetches the latest versions of the JSON data files powering that site and commits them if they have changed. The aim is to build a commit history of changes made to the underlying data.

The first implementation was extremely simple—here’s the entire action:

name: Fetch latest data

on:
push:
repository_dispatch:
schedule:
    - cron:  '25 * * * *'

jobs:
scheduled:
    runs-on: ubuntu-latest
    steps:
    - name: Check out this repo
    uses: actions/checkout@v2
    - name: Fetch latest data
    run: |-
        curl https://c19downloads.azureedge.net/downloads/data/data_latest.json | jq . > data_latest.json
        curl https://c19pub.azureedge.net/utlas.geojson | gunzip | jq . > utlas.geojson
        curl https://c19pub.azureedge.net/countries.geojson | gunzip | jq . > countries.geojson
        curl https://c19pub.azureedge.net/regions.geojson | gunzip | jq . > regions.geojson
    - name: Commit and push if it changed
    run: |-
        git config user.name "Automated"
        git config user.email "actions@users.noreply.github.com"
        git add -A
        timestamp=$(date -u)
        git commit -m "Latest data: ${timestamp}" || exit 0
        git push

It uses a combination of curl and jq (both available in the default worker environment) to pull down the data and pretty-print it (better for readable diffs), then commits the result.

Matthew Somerville pointed out that inefficient polling sets a bad precedent. Here I’m hitting azureedge.net, the Azure CDN, so that didn’t particularly worry me—but since I want this pattern to be used widely it’s good to provide a best-practice example.

Figuring out the best way to make conditional get requests in a GitHub Action lead me down something of a rabbit hole. I wanted to use curl’s new ETag support but I ran into a curl bug, so I ended up rolling a simple Python CLI tool called conditional-get to solve my problem. In the time it took me to release that tool (just a few hours) a new curl release came out with a fix for that bug!

Here’s the workflow using my conditional-get tool. See the issue thread for all of the other potential solutions, including a really neat Action shell-script solution by Alf Eaton.

To my absolute delight, the project has already been forked once by Daniel Langer to capture Canadian Covid-19 cases!

New Datasette features

I pushed two new features to Datasette master, ready for release in 0.41.

Configuration directory mode

This is an idea I had while building datasette-publish-now. Datasette instances can be run with custom metadata, custom plugins and custom templates. I’m increasingly finding myself working on projects that run using something like this:

$ datasette data1.db data2.db data3.db \
    --metadata=metadata.json
    --template-dir=templates \
    --plugins-dir=plugins

Directory configuration mode introduces the idea that Datasette can configure itself based on a directory layout. The above example can instead by handled by creating the following layout:

my-project/data1.db
my-project/data2.db
my-project/data3.db
my-project/metadatata.json
my-project/templates/index.html
my-project/plugins/custom_plugin.py

Then run Datasette directly targetting that directory:

$ datasette my-project/

See issue #731 for more details. Directory configuration mode is documented here.

Define custom pages using templates/pages

In niche-museums.com, powered by Datasette I described how I built the www.niche-museums.com website as a heavily customized Datasette instance.

That site has /about and /map pages which are served by custom templates—but I had to do some gnarly hacks with empty about.db and map.db files to get them to work.

Issue #648 introduces a new mechanism for creating this kind of page: create a templates/pages/map.html template file and custom 404 handling code will ensure that any hits to /map serve the rendered contents of that template.

This could work really well with the datasette-template-sql plugin, which allows templates to execute abritrary SQL queries (ala PHP or ColdFusion).

Here’s the new documentation on custom pages, including details of how to use the new custom_status(), custom_header() and custom_redirect() template functions to go beyond just returning HTML.

photos-to-sqlite

My Dogsheep personal analytics project brings my tweets, GitHub activity, Swarm checkins and more together in one place. But the big missing feature is my photos.

As-of yesterday, I have 39,000 photos from Apple Photos uploaded to an S3 bucket using my new photos-to-sqlite tool. I can run the following SQL query and get back ten random photos!

select
  json_object(
    'img_src',
    'https://photos.simonwillison.net/i/' || 
    sha256 || '.' || ext || '?w=400'
  ),
  filepath,
  ext
from
  photos
where
  ext in ('jpeg', 'jpg', 'heic')
order by
  random()
limit
  10

photos.simonwillison.net is running a modified version of my heic-to-jpeg image converting and resizing proxy, which I’ll release at some point soon.

There’s still plenty of work to do—I still need to import EXIF data (including locations) into SQLite, and I plan to use osxphotos to export additional metadata from my Apple Photos library. But this week it went from a pure research project to something I can actually start using, which is exciting.

TIL this week

Generated using this query.

Weeknotes: Datasette 0.40, various projects, Dogsheep photos one month ago

A new release of Datasette, two new projects and progress towards a Dogsheep photos solution.

Datasette 0.40

I released Datasette 0.40 last night. Full release notes are here, but the highlights of this key feature in this release is the ability to provide metadata in a metadata.yaml file as an alternative to metadata.json. This is particularly useful for embedded multi-line SQL queries: I’ve upgraded simonw/museums and simonw/til to take advantage of this, since they both use their metadata to define SQL queries that power their search pages and Atom feeds.

A JSK fellows directory and twitter-to-sqlite 0.21

My JSK Fellowship at Stanford ends in a few months. JSK has extremely talented and influential alumni, and one of the benefits of the fellowship is becoming part of that network afterwards.

The @JSKStanford Twitter account maintains lists of fellows on Twitter—journalists love Twitter!—so I decided to use my twitter-to-sqlite tool to build a Datasette-powered search engine of them.

That search engine is now running at jsk-fellows.datasettes.com. It’s updated daily by a GitHub Action to capture any bio changes or new list entrants.

It’s a neat example of taking advantage of SQLite views to build faceted search across a subset of data. A script constructs the jsk_fellows view at build time, then metadata.json configures that view to run full-text search and facet by the derived fellowship column.

I shipped twitter-to-sqlite 0.21 with a new twitter-to-sqlite lists username command as part of this project.

TILs and datasette-template-sql 1.0

I described my new TILs project on Monday. I’ve published 15 so far—the format is working really well for me.

Hacking on simonw/tils reminded me of a feature gap in my datasette-template-sql plugin: it didn’t have a solution for safely escaping parameters in SQL queries, leading to nasty string concatenated SQL queries.

datasette-template-sql 1.0 fixes that issue, at the cost of backwards compatibility with previous releases. I’m using it for both til and museums now.

github-to-sqlite 2.0

I released github-to-sqlite 2.0 with a small backwards incompatible change to the database schema (hence the major version increment). It builds on 1.1 from a few days ago which added a new github-to-sqlite contributors command for fetching statistics on contributors to repositories.

More importantly, I improved the live demo running at github-to-sqlite.dogsheep.net.

The demo now updates once a day using GitHub Actions and pulls in releases, commits, issues, issue comments and contributors for all of my Dogsheep projects plus datasette and sqlite-utils.

This means I can browse and execute SQL queries across 929 issues, 1,505 commits and 132 releases across 14 repositories!

Want to see which of my projects have had the most releases? Facet releases by repo.

I’ve also installed the datasette-search-all plugin there, so you can search across all commits, releases, issues etc for “zeit now” for example.

Bringing all of my different project data together in one place like this is really powerful.

I think it’s a great illustration of the Datasette/Dogsheep philosophy of pulling down a complete SQLite-powered copy of data from external services so you can query and join across your data without being limited to the functionality that those services provide through their own interfaces or APIs.

photos-to-sqlite alpha

Dogsheep is about bringing all of my interesting personal and social data into a single, private place.

The biggest thing missing at the moment is photos. I want to be able to query my photos with SQL, and eventually combine them with tweets, checkins etc in a unified timeline.

Last week I took a step towards this goal with heic-to-jpeg, a proxy to let me display my iPhone’s HEIC photos online.

This week I started work on photos-to-sqlite—the set of tools which I’ll use to turn my photos into something I can run queries again.

So far I’ve mainly been figuring out how to get them into an S3 bucket that I control. Once configured, running photos-to-sqlite upload photos.db ~/Pictures/Photos\ Library.photoslibrary/originals will start uploading every photo it can find in that directory to the S3 bucket.

The filename it uses is the sha256 hash of the photo file contents, which I’m hoping will let me de-dupe photos from multiple sources in the future. It also writes basic metadata on the photos to that photos.db SQLite database.

This is going to be a big project. I’m investigating osxphotos to liberate the metadata from Apple Photos, and various Python libraries for extracting EXIF data from the files themselves.

Once I’ve got that working, I can experiment with things like piping photos through Google Cloud Vision to label them based on their contents.

This is all a very, very early alpha at the moment, but I’m cautiously optimistic about progress so far.

Elsewhere

Yesterday

  • Working Backwards: A New Version Of Amazon’s “Press Release” Approach To Plan Customer-Centric Projects (via) I’ve long wanted to give the Amazon “future press release” trick a go—start a project by writing the imaginary press release that would announce that project to the world, in order to focus on understanding what the project is for and how it will deliver value. Jeff Gothelf has put a lot of thought into this and constructed a thorough looking template for writing one of these that covers a number of different important project aspects. #

1st June 2020

  • Get Started - Materialize. Materialize is a really interesting new database—“a streaming SQL materialized view engine”. It builds materialized views on top of streaming data sources (such as Kafka)—you define the view using a SQL query, then it figures out how to keep that view up-to-date automatically as new data streams in. It speaks the PostgreSQL protocol so you can talk to it using the psql tool or any PostgreSQL client library. The “get started” guide is particularly impressive: it uses a curl stream of the Wikipedia recent changes API, parsed using a regular expression. And it’s written in Rust, so installing it is as easy as downloading and executing a single binary (though I used Homebrew). #

29th May 2020

  • Practical Python Programming (via) David Beazley has been developing and presenting this three day Python course (aimed at people with some prior programming experience) for over thirteen years, and he’s just released the course materials under a Creative Commons license for the first time. #
  • Deno is a Browser for Code (via) One of the most interesting ideas in Deno is that code imports are loaded directly from URLs—which can themselves depend on other URL-based packages. On first encounter it feels wrong—obviously insecure. Deno contributor Kitson Kelly provides a deeper exploration of the idea, and explains how the combination of caching and lock files makes it no less secure than code installed from npm or PyPI. #

28th May 2020

  • Advice on specifying more granular permissions with Google Cloud IAM (via) My single biggest frustration working with both Google Cloud and AWS is permissions: more specifically, figuring out what the smallest set of permissions are that I need to assign in order to achieve different goals. Katie McLaughlin’s new series aims to address exactly that problem. I learned a ton from this that I’ve previously missed, and there’s plenty of actionable advice on tooling that can be used to help figure this stuff out. #
  • Any time you can think of something that is possible this year and wasn’t possible last year, you should pay attention. You may have the seed of a great startup idea. This is especially true if next year will be too late.

    Sam Altman #

27th May 2020

  • Why we use homework to recruit engineers. Ad Hoc run a remote-first team, and use detailed homework assignments as part of their interview process in place of in-person technical interview. The homework assignments are really interesting to browse through—“Containerize” for example involves building a Docker container to run a Python app with nginx a and a modern cipher suite. I’m nervous about the extra burden this places on candidates, but Ad Hoc address that: “We recognize that we’re asking folks to invest time into our process, but we feel like our homework compares favorably to extensive on-site interviews or other evaluation techniques, especially for candidates who have responsibilities outside of their work life.” #

26th May 2020

  • AWS services explained in one line each (via) Impressive effort to summarize all 163(!) AWS services—this helped clarify a whole bunch that I haven’t figured yet. Only a few defeated the author, with a single question mark for the description. I enjoyed Amazon Braket: “Some quantum thing. It’s in preview so I have no idea what it is.” #
  • Serving photos locally with datasette-media. datasette-media is a new Datasette plugin which can serve static files from disk in response to a configured SQL query that maps incoming URL parameters to a path to a file. I built it so I could run dogsheep-photos locally on my laptop and serve up thumbnails of images that match particular queries. I’ve added documentation to the dogsheep-photos README explaining how to use datasette-media, datasette-json-html and datasette-template-sql to create custom interfaces onto Apple Photos data on your machine. #
  • Waiting in asyncio. Handy cheatsheet explaining the differences between asyncio.gather(), asyncio.wait_for(), asyncio.as_completed() and asyncio.wait() by Hynek Schlawack. #

22nd May 2020

  • Using SQL to Look Through All of Your iMessage Text Messages (via) Dan Kelch shows how to access the iMessage SQLite database at ~/Library/Messages/chat.db—it’s protected under macOS Catalina so you have to enable Full Disk Access in the privacy settings first. I usually use the macOS terminal app but I installed iTerm for this because I’d rather enable full disk access to a separate terminal program than let anything I’m running in my regular terminal take advantage of it. It worked! Now I can run “datasette ~/Library/Messages/chat.db” to browse my messages. #
  • Food consumption really only grows at the rate of population growth, so if you want to grow faster than that, you have to take market share from someone else. Ideally, you take it from someone weaker, who has less information. In this industry, the delivery platforms have found unsuspecting victims in restaurants and drivers.

    Collin Wallace #

20th May 2020

  • Company culture is the shared way everyone acts when you aren’t around to see it

    Adam Kalsey #

18th May 2020

  • Doordash and Pizza Arbitrage (via) In which a Pizza restaurant owner notices that Doordash, uninvited, have started offering their $24 pizzas for $16 and starts ordering their own pizzas and keeping the difference. #

14th May 2020

  • Web apps are typically continuously delivered, not rolled back, and you don’t have to support multiple versions of the software running in the wild. This is not the class of software that I had in mind when I wrote the blog post 10 years ago. If your team is doing continuous delivery of software, I would suggest to adopt a much simpler workflow (like GitHub flow) instead of trying to shoehorn git-flow into your team.

    Vincent Driessen #