Simon Willison’s Weblog

On django, covid19, github, security, dogsheep, ...

 

Recent entries

Weeknotes: incremental improvements four days ago

I’ve been writing my talk for PyCon Argentina this week, which has proved surprisingly time consuming. I hope to have that wrapped up soon—I’m pre-recording it, which it turns out is much more work than preparing a talk to stream live.

I’ve made bits and pieces of progress on a whole bunch of different projects. Here are my notes on Datasette, plus an annotated version of my other releases-this-week.

Datasette 0.51a0

Datasette’s base_url configuration option is designed to help run Datasette behind a proxy—so you can configure Apache or nginx to proxy /my-datasette/ to a Datasette instance and have every internal link work correctly.

It doesn’t completely work. I gathered all of the bugs with it in a tracking issue, addressed as many of them as I could and released Datasette 0.51a0 as a testing alpha.

if you run Datasette behind a proxy please try out this new alpha and tell me if it works for you! Testing help is requested here.

Also in the alpha:

  • New datasette.urls URL builder for plugins, see Building URLs within plugins. (#904)
  • Removed --debug option, which didn’t do anything. (#814)
  • Link: HTTP header pagination. (#1014)
  • x button for clearing filters. (#1016)
  • Edit SQL button on canned queries, (#1019)
  • --load-extension=spatialite shortcut. (#1028)

Other releases this week

sphinx-to-sqlite 0.1a1 and 0.1a

One of the features I’m planning for the official Datasette website is combined search across issues, commits, releases, plugins and documentation—powered by my Dogsheep Beta search engine.

This means I nead to load Datasette’s documentation into a SQLite database. sphinx-to-sqlite is my new tool for doing that: it uses the optional XML output from Sphinx to create a SQLite table populated with sections from the documentation, since these seem like the right unit for executing search against.

I’m now using this to build a Datasette instance at latest-docs.datasette.io with the latest documentation on every commit.

datasette-cluster-map 0.14 and 0.13

The default marker popup for datasette-cluster-map is finally a human readable window, not a blob of JSON! You can see that in action on the global-power-plants demo.

inaturalist-to-sqlite 0.2.1, pocket-to-sqlite 0.2.1

I tried out the new PyPI resolver and found that it is a lot less tolerant of ~= v.s. >= dependencies, so I pushed out new releases of these two packages.

datasette-json-preview 0.2

I’m using this plugin to preview the new default JSON representation I’m planning for Datasette 1.0. Carl Johnson provided some useful feedback leading to this new iteration, which now looks like this.

github-to-sqlite 2.7

Quoting the release notes:

github-to-sqlite repos command now takes options --readme and --readme-html, which write the README or rendered HTML README into the readme or readme_html columns, respectively. #52

Another feature I need for the Datasette website search engine, described above.

dogsheep-beta 0.9

My personal search engine, described in the latest Datasette Weekly newsletter. This release added facet by date, part of ongoing work on a timeline view.

I also updated it to take advantage of the datasette.client internal API mechanism introduced in Datasette 0.50.

healthkit-to-sqlite 1.0

Another project bumped to 1.0. Only a small bug fix here: this can now import Apple HealthKit data from devices that use languages other than English.

TIL this week

I finally upgraded this blog to show recently added “Elsewhere” content (bookmarks and quotations) interspersed with my main entries in mobile view. I worte up this TIL to explain what I did.

Weeknotes: evernote-to-sqlite, Datasette Weekly, scrapers, csv-diff, sqlite-utils 12 days ago

This week I built evernote-to-sqlite (see Building an Evernote to SQLite exporter), launched the Datasette Weekly newsletter, worked on some scrapers and pushed out some small improvements to several other projects.

The Datasette Weekly newsletter

After procrastinating on it for several months I finally launched the new Datasette Weekly newsletter!

My plan is to put this out once a week with a combination of news from the Datasette/Dogsheep/sqlite-utils ecosystem of tools, plus tips and tricks for using them to solve data problems.

You can read the first edition here, which covers Datasette 0.50, git scraping, sqlite-utils extract and features datasette-graphql as the plugin of the week.

I’m using Substack because people I trust use it for their newsletters and I decided that picking an option and launching was more important than spending even more time procrastinating on picking the best possible newsletter platform. So far it seems fit for purpose, and it provides an export option should I decide to move to something else.

Writing scrapers with a Python+JavaScript hybrid

I’ve been writing some scraper code to help out with a student journalism project at Stanford. I ended up using Selenium Python running in a Jupyter Notebook.

Historically I’ve avoided Selenium due to how weird and complex it has been to use in the past. I’ve now completely changed my mind: these days it’s a really solid option for browser automation driven by Python thanks to chromedriver and geckodriver, which I recently learned can be installed using Homebrow.

My preferred way of writing scrapers is to do most of the work in JavaScript. The combination of querySelector(), querySelectorAll(), fetch() and the new-to-me DOMParser class makes light work of extracting data from any shape of HTML, and browser DevTools mean that I can interactively build up scrapers by pasting code directly into the console.

My big break-through this week was figuring out how to write scrapers as a Python-JavaScript hybrid. The Selenium driver.execute_script() and driver.execute_async_script() (TIL) methods make it trivial to feed execute chunks of JavaScript from Python and get back the results.

This meant I could scrape pages one at time using JavaScript and save the results directly to SQLite via sqlite-utils. I could even run database queries on the Python side to skip items that had already been scraped.

csv-diff 1.0

I’m trying to get more of my tools past the 1.0 mark, mainly to indicate to potential users that I won’t be breaking backwards compatibility without bumping them to 2.0.

I built csv-diff for my San Francisco Trees project last year. It produces human-readable diffs for CSV files.

The version 1.0 release notes are as follows:

  • New --show-unchanged option for outputting the unchanged values of rows that had at least one change. #9
  • Fix for bug with column names that contained a . character. #7
  • Fix for error when no --key provided—thanks, @MainHanzo. #3
  • CSV delimiter sniffer now ; delimited files. #6

sqlite-utils 2.22

sqlite-utils 2.22 adds some minor features—an --encoding option for processing TSV and CSV files in encodings other than UTF-8, and more support for loading SQLite extensions modules.

Full release notes:

  • New --encoding option for processing CSV and TSV files that use a non-utf-8 encoding, for both the insert and update commands. (#182)
  • The --load-extension option is now available to many more commands. (#137)
  • --load-extension=spatialite can be used to load SpatiaLite from common installation locations, if it is available. (#136)
  • Tests now also run against Python 3.9. (#184)
  • Passing pk=["id"] now has the same effect as passing pk="id". (#181)

Datasette

No new release yet, but I’ve landed some small new features to the main branch.

Inspired by the GitHub and WordPress APIs, Datasette’s JSON API now supports Link: HTTP header pagination (#1014).

This is part of my ongoing effort to redesign the default JSON format ready for Datasette 1.0. I started a new plugin called datasette-json-preview to let me iterate on that format independent of Datasette itself.

Jacob Fenton suggested an “Edit SQL” button on canned queries. That’s a great idea, so I built it—this issue comment links to some demos, e.g. this one here.

I added an “x” button for clearing filters to the table page (#1016) demonstrated by this GIF:

Animation demonstrating the new x button next to filters

TIL this week

Releases this week

Building an Evernote to SQLite exporter 12 days ago

I’ve been using Evernote for over a decade, and I’ve long wanted to export my data from it so I can do interesting things with it.

Ideally I’d use their API for that, so I could continually export new notes via a cron. Unfortunately the API is implemented using Thrift (the only public API I’ve seen to use that) and my attempts to speak to it from Python have so far failed to get anywhere.

Last weekend I decided to try using their “Export notes” feature instead, and its ENEX XML format.

ENEX—the Evernote XML export format

Select a note—or multiple notes—in the Evernote macOS desktop app, use the File -> Export Notes menu item and Evernote will create a Notes.enex exported XML file.

These files can get BIG, because any images attached to your notes will be embedded as base64 encoded binary data within the XML. My export is 2.8GB!

When faced with unknown XML like this it’s useful to get a high level overview of what tags are present. in 2009 I wrote a Python script for this—as part of this project I updated it for Python 3 and pushed a release to PyPI.

Here’s a subset of the output when run against my Notes.enex file (full output is here).

{
    "note": {
        "attr_counts": {},
        "child_counts": {
            "content": 2126,
            "created": 2126,
            "note-attributes": 2126,
            "resource": 2605,
            "tag": 45,
            "title": 2126,
            "updated": 2126
        },
        "count": 2126,
        "parent_counts": {
            "en-export": 2126
        }
    },
    "note-attributes": {
        "attr_counts": {},
        "child_counts": {
            "altitude": 1466,
            "application-data": 449,
            "author": 998,
            "classifications": 51,
            "content-class": 387,
            "latitude": 1480,
            "longitude": 1480,
            "reminder-done-time": 2,
            "reminder-order": 2126,
            "reminder-time": 1,
            "source": 1664,
            "source-application": 423,
            "source-url": 85,
            "subject-date": 10
        },
        "count": 2126,
        "parent_counts": {
            "note": 2126
        }
    }
}

This shows me that every note is represented as a <note> element, and crucial metadata lives in children of a <note-attributes> child element.

(I thought I’d need to tweak the script for performance since it works by loading the entire file into memory, but my laptop has 32GB of RAM so it didn’t even blink.)

ENEX limitations

I ran into some significant limitations while working with my ENEX export.

  • Notebooks are not represented in the file at all—you just get the notes. So I can’t tell which recipes were in my “cooking” notebook. It’s possible to work around this by manually exporting the notes from each notebook one at a time and storing them in separate export files. I didn’t bother.
  • Some of the data in the file—the note content itself for example—consists of further blocks of XML embedded in CDATA. This means you have to run a nested XML parser for every note you process.
  • The notes XML <en-note> format is mostly XHTML, but includes custom <en-media hash="..."> tags where inline images should be displayed
  • Those inline images are CDATA encoded base64 strings. They don’t include a hash, but I figured out that decoding the base64 string and then running it through MD5 generates the hash that is used in the corresponding <en-media> tag.
  • Notes in the export don’t have any form of unique ID!

That last limitation—the lack of unique IDs—is a huge pain. It means that any export is by necessity a one-time operation—if you edit a note and attempt a re-export you will get a brand new record that can’t be automatically used to update the previous one.

This is particularly frustrating because I know that Evernote assigns a GUID to each note—they just don’t include them in the export file.

Writing an exporter

My Dogsheep family of tools aims to liberate personal data from all kinds of different sources and convert it into SQLite, which means I can explore, query and visualize it using Datasette.

dogsheep/evernote-to-sqlite is my latest entry in that series.

I based it on healthkit-to-sqlite because that, too, has to deal with a multiple GB XML file. Both tools use the Python standard library’s XMLPullParser, which makes XML elements available as a stream without needing to load the entire file into memory at once.

The conversion code is here. It ended up being relatively straight-forward, using ElementTree to extract data from the XML and sqlite-utils to write it to a database.

It implements a progress bar by tracking the number of bytes that have been read from the underlying file XML.

One optimization: originally I created a single resources table with the resource metadata and a BLOB column containing the binary image contents.

This table was huge—over a GB—and had very poor performance for operations such as querying and faceting across unindexed columns—because any table scan had to work through MBs of binary data.

My personal Evernote notebook has around 2,000 resources. 2,000 rows of metadata should usually be very fast to query.

So... I split the binary data out into a two column resources_data table—md5 primary key and a BLOB for data. This massively sped up queries against that resources table.

And it works! Running the following command produces a SQLite database file containing all of my notes, note metadata and embedded images:

evernote-to-sqlite enex evernote.db MyNotes.enex

Serving up formatted notes

Figuring out the best way to serve up the note content in Datasette is still a work in progress—but I’ve hacked together a delightfully terrible way of doing this using a one-off plugin.

Datasette plugins are usually packaged and installed via PyPI, but there’s an alternative option for plugins that don’t warrant distribution: create a plugins/ directory, drop in one or more Python files and start Datasette pointing --plugins-dir at that directory.

This is great for hacking on terrible ideas. Here’s the evernote.py plugin I wrote in full:

from datasette import hookimpl
import jinja2

START = "<en-note"
END = "</en-note>"
TEMPLATE = """
<div style="max-width: 500px; white-space: normal; overflow-wrap: break-word;">{}</div>
""".strip()

EN_MEDIA_SCRIPT = """
Array.from(document.querySelectorAll('en-media')).forEach(el => {
    let hash = el.getAttribute('hash');
    let type = el.getAttribute('type');
    let path = `/evernote/resources_data/${hash}.json?_shape=array`;
    fetch(path).then(r => r.json()).then(rows => {
        let b64 = rows[0].data.encoded;
        let data = `data:${type};base64,${b64}`;
        el.innerHTML = `<img style="max-width: 300px" src="${data}">`;
    });
});
"""

@hookimpl
def render_cell(value, table):
    if not table:
        # Don't render content from arbitrary SQL queries, could be XSS hole
        return
    if not value or not isinstance(value, str):
        return
    value = value.strip()
    if value.startswith(START) and value.endswith(END):
        trimmed = value[len(START) : -len(END)]
        trimmed = trimmed.split(">", 1)[1]
        # Replace those horrible double newlines
        trimmed = trimmed.replace("<div><br /></div>", "<br>")
        return jinja2.Markup(TEMPLATE.format(trimmed))

@hookimpl
def extra_body_script():
    return EN_MEDIA_SCRIPT

This uses two Datasette plugin hooks.

render_cell() is called every time Datasette’s interface renders the value of a column. Here I’m looking for the <en-note> and </en-note> tags and, if they are present, stripping them off and marking their inner content as safe to display using jinja2.Markup() (without this they would be automatically HTML esacped).

extra_body_script() can be used to inject additional JavaScript at the bottom of the page. I’m injecting some particularly devious JavaScript which scans the page for Evernote’s <en-media> tags and, if it finds any, runs a fetch() to get the base64-encoded data from Datasette and then injects that into the page as a 300px wide image element using a data: URI.

Like I said, this is a pretty terrifying hack! But it works just fine, and my notes are now visible inside my personal Datasette instance:

One of my notes displayed in Datasette, with an embedded image

Bonus feature: search by OCR

An interesting feature of Evernote is that it runs cloud-based OCR against images in your notes, enabling you to search across the text contained within them.

It turns out the output of this OCR process is included in the export XML. It looks like this:

<recoIndex
  docType="unknown"
  objType="image" objID="76dd28b07797cc9f3f129c4871c5293c" 
  engineVersion="7.0.24.1" 
  recoType="service" 
  lang="en" 
  objWidth="670" 
  objHeight="128">
  <item x="26" y="52" w="81" h="29">
    <t w="76">This</t>
  </item>
  <item x="134" y="52" w="35" h="29">
    <t w="79">is</t>
  </item>
  <item x="196" y="60" w="37" h="21">
    <t w="73">so</t>
  </item>
  <item x="300" y="60" w="57" h="21">
    <t w="71">can</t>
  </item>
  <item x="382" y="54" w="79" h="27">
    <t w="77">test</t>
  </item>
  <item x="486" y="52" w="59" h="29">
    <t w="82">the</t>
  </item>
  <item x="570" y="54" w="59" h="25">
    <t w="74">OCR</t>
    <t w="33">DeR</t>
    <t w="15">OCR!</t>
    <t w="14">OCR]</t>
  </item>
</recoIndex>

As far as I can tell an <item> gets multiple <t> elements only if the OCR wasn’t 100% sure.

evernote-to-sqlite creates a searchable ocr text field using values from that XML, so I can search my images in Datasette.

Weeknotes: Mainly Datasette 0.50 18 days ago

Most of what I’ve been up to this week is covered in Datasette 0.50: The annotated release notes and Git scraping: track changes over time by scraping to a Git repository.

I just pushed out Datasette 0.50.1 to fix a nasty broken link bug in 0.50.

I recorded a 12 minute video talk for JupyterCon: Using Datasette with Jupyter to publish your data. That will be streamed on Monday and I’ll be answering questions and hanging out during the online conference.

I started Git scraping CA fires data to simonw/ca-fires-history and NHS risk venue alerts to simonw/nhs-risky-venues.

I had a pull request accepted to httpx adding raw_path support to the ASGI request emulation layer. I’ll be using that in Datasette as soon as the next httpx version is released.

TIL this week

Releases this week

Datasette 0.50: The annotated release notes 19 days ago

I released Datasette 0.50 this morning, with a new user-facing column actions menu feature and a way for plugins to make internal HTTP requests to consume the JSON API of their parent Datasette instance.

The column actions menu

The key new feature in this release is the column actions menu on the table page (#891). This can be used to sort a column in ascending or descending order, facet data by that column or filter the table to just rows that have a value for that column.

The table page is the most important page within Datasette: it’s where users interact with database tables.

Prior to 0.50 users could sort those tables by clicking on the column header. If they wanted to sort in descending order they had to click it, wait for the table to reload and then click it a second time.

In 0.50 I’ve introduced a new UI element which I’m calling the column actions menu. Here’s an animation showing it in action on the facetable demo table:

Animated demo of the columns action menu, showing it used to sort a column and select two other columns for faceting

Right now the menu can be used to sort ascending, sort descending or add the column to the current set of select facets. If a column has any blank values on the current page a menu option to “Show not-blank rows” appears too—you can try that out on the sortable table.

I plan to extend this with more options in the future. I’d also like to make it a documented plugin extension point, so plugins can add their own column-specific actions. I need to figure out a JavaScript equivalent of the Python pluggy plugins mechanism first though, see issue 983.

datasette.client

Plugin authors can use the new datasette.client object to make internal HTTP requests from their plugins, allowing them to make use of Datasette’s JSON API. (#943)

In building the datasette-graphql plugin I ran into an interesting requirement. I wanted to provide efficient keyset pagination within the GraphQL schema, which is actually quite a complex things to implement.

Datasette already has a robust implementation of keyset pagination, but it’s tangled up in the implementation of the internal TableView class.

It’s not available as a documented, stable Python API... but it IS available via the Datasette JSON API.

Wouldn’t it be great if Datasette plugins could make direct calls to the same externally facing, documented HTTP JSON API that Datasette itself exposes to end users?

That’s what the new datasette.client object does. It’s a thin wrapper around HTTPX AsyncClient (the excellent new Python HTTP library which takes the Requests API and makes it fully asyncio compliant) which dispatches requests internally to Datasette’s ASGI application, without any of the network overhead of an external HTTP request.

One of my goals for Datasette 1.0 is to bring the externally facing JSON API to full, documented, stable status.

The idea of Python plugins being able to efficiently use that same API feels really elegant to me. I’m looking forward to taking advantage of this in my own plugins.

Deploying Datasette documentation

New Deploying Datasette documentation with guides for deploying Datasette on a Linux server using systemd or to hosting providers that support buildpacks. (#514, #997)

The buildpack documenation was inspired by my experiments with the new DigitialOcean App Platform this week. App Platform is a Heroku-style PaaS hosting platform that implements the Cloud Native Buildpacks standard which emerged based on Heroku’s architecture a few years ago.

I hadn’t realized quite how easy it is to run a custom Python application (such as Datasette) using buildpacks—it’s literally just a GitHub repository with two single-line files in it, requirements.txt and Procfile—the buildpacks mechanism detects the requirements.txt and configures a Python environment automatically.

I deployed my new simonw/buildpack-datasette-demo repo on DigitalOcean, Heroku and Scalingo to try this out. It worked on all three providers with no changes—and all three offer continuous deployment against GitHub where any changes to that repository automatically trigger a deployment (optionally guarded by a CI test suite).

Since I was creating a deployment documenatation page I decided to finally address issue 514 and document how I’ve used systemd to deploy Datasette on some of my own projects. I’m very keen to hear from people who try out this recipe so I can continue to improve it over time.

This is part of a series: see also the annotated release notes for Datasette 0.44, 0.45 and 0.49.

Git scraping: track changes over time by scraping to a Git repository 19 days ago

Git scraping is the name I’ve given a scraping technique that I’ve been experimenting with for a few years now. It’s really effective, and more people should use it.

The internet is full of interesting data that changes over time. These changes can sometimes be more interesting than the underlying static data—The @nyt_diff Twitter account tracks changes made to New York Times headlines for example, which offers a fascinating insight into that publication’s editorial process.

We already have a great tool for efficiently tracking changes to text over time: Git. And GitHub Actions (and other CI systems) make it easy to create a scraper that runs every few minutes, records the current state of a resource and records changes to that resource over time in the commit history.

Here’s a recent example. Fires continue to rage in California, and the CAL FIRE website offers an incident map showing the latest fire activity around the state.

Firing up the Firefox Network pane, filtering to requests triggered by XHR and sorting by size, largest first reveals this endpoint:

https://www.fire.ca.gov/umbraco/Api/IncidentApi/GetIncidents

That’s a 241KB JSON endpoints with full details of the various fires around the state.

So... I started running a git scraper against it. My scraper lives in the simonw/ca-fires-history repository on GitHub.

Every 20 minutes it grabs the latest copy of that JSON endpoint, pretty-prints it (for diff readability) using jq and commits it back to the repo if it has changed.

This means I now have a commit log of changes to that information about fires in California. Here’s an example commit showing that last night the Zogg Fires percentage contained increased from 90% to 92%, the number of personnel involved dropped from 968 to 798 and the number of engines responding dropped from 82 to 59.

Screenshot of a diff against the Zogg Fires, showing personnel involved dropping from 968 to 798, engines dropping 82 to 59, water tenders dropping 31 to 27 and percent contained increasing from 90 to 92.

The implementation of the scraper is entirely contained in a single GitHub Actions workflow. It’s in a file called .github/workflows/scrape.yml which looks like this:

name: Scrape latest data

on:
  push:
  workflow_dispatch:
  schedule:
    - cron:  '6,26,46 * * * *'

jobs:
  scheduled:
    runs-on: ubuntu-latest
    steps:
    - name: Check out this repo
      uses: actions/checkout@v2
    - name: Fetch latest data
      run: |-
        curl https://www.fire.ca.gov/umbraco/Api/IncidentApi/GetIncidents | jq . > incidents.json
    - name: Commit and push if it changed
      run: |-
        git config user.name "Automated"
        git config user.email "actions@users.noreply.github.com"
        git add -A
        timestamp=$(date -u)
        git commit -m "Latest data: ${timestamp}" || exit 0
        git push

That’s not a lot of code!

It runs on a schedule at 6, 26 and 46 minutes past the hour—I like to offset my cron times like this since I assume that the majority of crons run exactly on the hour, so running not-on-the-hour feels polite.

The scraper itself works by fetching the JSON using curl, piping it through jq . to pretty-print it and saving the result to incidents.json.

The “commit and push if it changed” block uses a pattern that commits and pushes only if the file has changed. I wrote about this pattern in this TIL a few months ago.

I have a whole bunch of repositories running git scrapers now. I’ve been labeling them with the git-scraping topic so they show up in one place on GitHub (other people have started using that topic as well).

I’ve written about some of these in the past:

I hope that by giving this technique a name I can encourage more people to add it to their toolbox. It’s an extremely effective way of turning all sorts of interesting data sources into a changelog over time.

Comment thread on this post over on Hacker News.

Elsewhere

26th October 2020

  • Apple now receives an estimated $8 billion to $12 billion in annual payments — up from $1 billion a year in 2014 — in exchange for building Google’s search engine into its products. It is probably the single biggest payment that Google makes to anyone and accounts for 14 to 21 percent of Apple’s annual profits.

    Apple, Google and a Deal That Controls the Internet # 26th October 2020, 3:10 am

23rd October 2020

22nd October 2020

  • CG-SQL (via) This is the toolkit the Facebook Messenger team wrote to bring stored procedures to SQLite. It implements a custom version of the T-SQL language which it uses to generate C code that can then be compiled into a SQLite module. #22nd October 2020, 6:25 pm
  • Project LightSpeed: Rewriting the Messenger codebase for a faster, smaller, and simpler messaging app (via) Facebook rewrote their iOS messaging app earlier this year, dropping it from 1.7m lines of code to 360,000 and reducing the binary size to a quarter of what it was. A key part of the new app’s architecture is much heavier reliance on SQLite to coordinate data between views, and to dynamically configure how different views are displayed. They even built their own custom system to add stored procedures to SQLite so they could execute portable business logic inside the database. #22nd October 2020, 6:22 pm

21st October 2020

  • Writing the code to sign data with a private key and verify it with a public key would have been easier to get correct than correctly invoking the JWT library. In fact, the iOS app (which gets this right) doesn’t use a JWT library at all, but manages to verify using a public key in fewer lines of code than the Android app takes to incorrectly use a JWT library!

    James 'zofrex' Sanderson # 21st October 2020, 9:34 pm

  • Proof of concept: sqlite_utils magic for Jupyter (via) Tony Hirst has been experimenting with building a Jupyter “magic” that adds special syntax for using sqlite-utils to insert data and run queries. Query results come back as a Pandas DataFrame, which Jupyter then displays as a table. #21st October 2020, 5:26 pm
  • Pikchr. Interesting new project from SQLite creator D. Richard Hipp. Pikchr is a new mini language for describing visual diagrams, designed to be embedded in Markdown documentation. It’s already enabled for the SQLite forum. Implementation is a no-dependencies C library and output is SVG. #21st October 2020, 4:02 pm

19th October 2020

  • This page is a truly naked, brutalist html quine. (via) I love it. #19th October 2020, 6:46 pm
  • Dogsheep: Personal analytics with Datasette. The second edition of my new Datasette Weekly newsletter, talks about Dogsheep, Dogsheep Beta, Datasette 1.0 and features datasette-cluster-map as the plugin of the week. #19th October 2020, 4:38 pm
  • The stampede of the affluent into grim-faced, highly competitive sports has been a tragicomedy of perverse incentives and social evolution in unequal times: a Darwinian parable of the mayhem that can ensue following the discovery of even a minor advantage. Like a peacock rendered nearly flightless by gaudy tail feathers, the overserved athlete is the product of a process that has become maladaptive, and is now harming the very blue-chip demographic it was supposed to help.

    Ruth S. Barrett # 19th October 2020, 3:09 pm

12th October 2020

  • xml-analyser. In building evernote-to-sqlite I dusted off an ancient (2009) project I built that scans through an XML file and provides a summary of what elements are present in the document and how they relate to each other. I’ve now packaged it up as a CLI app and published it on PyPI. #12th October 2020, 12:41 am
  • evernote-to-sqlite (via) The latest tool in my Dogsheep series of utilities for personal analytics: evernote-to-sqlite takes Evernote note exports en their ENEX XML format and loads them into a SQLite database. Embedded images are loaded into a BLOB column and the output of their cloud-based OCR system is added to a full-text search index. Notes have a latitude and longitude which means you can visualize your notes on a map using Datasette and datasette-cluster-map. #12th October 2020, 12:38 am

10th October 2020

  • It’s probably a bad idea to risk paying your ransom, though — the US Treasury Dept has issued clarifying guidance that companies paying off ransomware, and all companies facilitating the payment, can be charged with sanctions violations if the bitcoins end up at North Korea or sanctioned cybercrime groups.

    David Gerard # 10th October 2020, 9:17 pm

  • Datasette Weekly: Datasette 0.50, git scraping, extracting columns (via) The first edition of the new Datasette Weekly newsletter—covering Datasette 0.50, Git scraping, extracting columns with sqlite-utils and featuring datasette-graphql as the first “plugin of the week” #10th October 2020, 9 pm
  • Datasette Weekly (via) I’m trying something new: I’ve decided to start an email newsletter called the Datasette Weekly (I’m already worried I’ll regret that weekly promise) which will share news about Datasette and the Datasette ecosystem, plus tips and tricks for getting the most out of Datasette and SQLite. #10th October 2020, 7:05 pm

9th October 2020

  • Animating a commit based Sudoku game using Puppeteer (via) This is really clever. There’s a GitHub repo that tracks progress in a game of Sudoku: Anish Karandikar wrote code which iterates through the game board state commit by commit, uses that state to generate an HTML table, passes that table to Puppeteer using a data: URI, renders a PNG of each stage and then concatenates those PNGs together into an animated GIF using the gifencoder Node.js library. #9th October 2020, 10:28 pm

7th October 2020

  • Bedrock: The SQLitening (via) Back in March 2018 www.mozilla.org switched over to running on Django using SQLite! They’re using the same pattern I’ve been exploring with Datasette: their SQLite database is treated as a read-only cache by their frontend servers, and a new SQLite database is built by a separate process and fetched onto the frontend machines every five minutes by a scheduled task. They have a healthcheck page which shows the latest version of the database and when it was fetched, and even lets you download the 25MB SQLite database directly (I’ve been exploring it using Datasette). #7th October 2020, 11:47 pm
  • Running Datasette on DigitalOcean App Platform (via) I spent some time with DigitalOcean’s new App Platform today, which is a Heroku-style PaaS that starts at $5/month. It looks like it could be a really good fit for Datasette. Disk is ephemeral, but if you’re publishing read-only data that doesn’t matter since you can build the SQLite database as part of the deployment and bundle it up in the Docker/Kubernetes container. #7th October 2020, 2:52 am

5th October 2020

  • I’ve often joked with other internet culture reporters about what I call the “normie tipping point.” In every emerging internet trend, there is a point at which “normies” — people who don’t spend all day online, and whose brains aren’t rotted by internet garbage — start calling, texting and emailing us to ask what’s going on. Why are kids eating Tide Pods? What is the Momo Challenge? Who is Logan Paul, and why did he film himself with a dead body? The normie tipping point is a joke, but it speaks to one of the thorniest questions in modern journalism, specifically on this beat: When does the benefit of informing people about an emerging piece of misinformation outweigh the possible harms?

    Kevin Roose # 5th October 2020, 3:40 pm