Simon Willison’s Weblog

On sqliteutils 66 pyodide 7 python 810 shotscraper 9 machinelearning 36 ...

 

Recent entries

Weeknotes: Joining the board of the Python Software Foundation 11 days ago

A few weeks ago I was elected to the board of directors for the Python Software Foundation.

I put myself up for election partly because I’ve found myself saying “I wish the PSF would help with ...” a few times over the years, and realized that joining the board could be a more useful way to actively participate, rather than shouting from the sidelines.

I was quite surprised to win. I wrote up a short manifesto—you can see that here—but the voting system lets voters select as many candidates as they like, so it’s possible I got in more on broad name recognition among the voters than based on what I wrote. I don’t think there’s a way to tell one way or the other.

I had my first board meeting on Wednesday, where I formally joined the board and got to vote on my first resolutions. This is my first time as a board member for a non-profit and I have learned a bunch already, with a lot more to go!

Board terms last three years. I expect it will take me at least a few months to get fully up to speed on how everything works.

As a board member, my primary responsibilities are to show up to the meetings, vote on resolutions, act as an ambassador for the PSF to the Python community and beyond and help both set the direction for the PSF and ensure that the PSF meets its goals and holds true to its values.

I’m embarassed to admit that I wrote my election manifesto without a deep understanding of how the PSF operates and how much is possible for it to get done. Here’s the section I wrote about my goals should I be elected:

I believe there are problems facing the Python community that require dedicated resources beyond volunteer labour. I’d like the PSF to invest funding in the following areas in particular:

  • Improve Python onboarding. In coaching new developers I’ve found that the initial steps to getting started with a Python development environment can be a difficult hurdle to cross. I’d like to help direct PSF resources to tackling this problem, with a goal of making the experience of starting to learn Python as smooth as possible, no matter what platform the learner is using.
  • Make Python a great platform for distributing software. In building my own application, Datasette, in Python I’ve seen how difficult it can be to package up a Python application so that it can be installed by end-users, who aren’t ready to install Python and learn pip in order to try out a new piece of software. I’ve researched solutions for this for my own software using Homebrew, Docker, an Electron app and WASM/Pyodide. I’d like the PSF to invest in initiatives and documentation to make this as easy as possible, so that one of the reasons to build with Python is that distributing an application to end-users is already a solved problem.

I still think these are good ideas, and I hope to make progress on them during my term as a director—but I’m not going to start arguing for new initiatives until I’ve learned the ropes and fully understood the PSF’s abilities, current challenges and existing goals.

In figuring out how the board works, one of the most useful pages I stumbled across was this list of resolutions voted on by the board, dating back to 2001. There are over 1,600 of them! Browsing through them gave me a much better idea of the kind of things the board has the authority to do.

Scraping data into Datasette Lite

Because everything looks like a nail when you have a good hammer, I explored the board resolutions by loading them into Datasette. I tried a new trick this time: I scraped data from that page into a CSV file, then loaded up that CSV file in Datasette Lite via a GitHub Gist.

My scraper isn’t perfect—it misses about 150 resolutions because they don’t exactly fit the format it expects, but it was good enough for a proof of concept. I wrote that in a Jupyter Notebook which you can see here.

Here’s the CSV in a Gist. The great thing about Gists is that GitHub serve those files with the access-control-allow-origin: * HTTP header, which means you can load them cross-domain.

Here’s what you get if you paste the URL to that CSV into Datasette Lite (using this new feature I added last month):

A screenshot of the psf-resolutions table in Datasette, showing 1,654 rows

And here’s a SQL query that shows the sum total dollar amount from every resolution that mentions “Nigeria”:

with filtered as (
  select * from
    [psf-resolutions]
  where
    "dollars" is not null
    and "text" like '%' || :search || '%'
)
select
  'Total: $' || printf('%,d', sum(dollars)) as text,
  null as date
from filtered
union all
select
  text, date
from filtered;

A screenshot of the results of that query, returning 132 rows the top of which says Total: $163,849

I’m using a new-to-me trick here: I use a CTE to filter down to just the rows I am interested in, then I create a first row that sums the dollar amounts as the text column and leaves the date column null, then unions that against the rows from the query.

Important note: These numbers aren’t actually particularly useful. Just because the PSF board voted on a resolution does not mean that the money made it to the grantee—there are apparently situations where the approved grant may not be properly claimed and transferred. Also, my scraper logic isn’t perfect. Plus the PSF spends a whole lot of money in ways that don’t show up in these resolutions.

So this is a fun hack, and a neat way to start getting a general idea of how the PSF works, but any numbers it produces should not be taken as the absolute truth.

As a general pattern though, I really like this workflow of generating CSV files, saving them to a Gist and then opening them directly in Datasette Lite. It provides a way to use Datasette to share and explore data without needing to involve any server-side systems (other than GitHub Gists) at all!

Big-endian bugs in sqlite-fts4

sqlite-fts4 is a small Python library I wrote that adds SQLite functions for calculating relevance scoring for full-text search using the FTS4 module that comes bundled with SQLite. I described that project in detail in Exploring search relevance algorithms with SQLite.

It’s a dependency of sqlite-utils so it has a pretty big install base, despite being relatively obscure.

This week I had a fascinating bug report from Sarah Julia Kriesch: Test test_underlying_decode_matchinfo fails on PPC64 and s390x on openSUSE.

The s390x is an IBM mainframe architecture and it uses a big-endian byte order, unlike all of the machines I use which are little-endian.

This is the first time I’ve encountered a big-endian v.s. little-endian bug in my entire career! I was excited to dig in.

Here’s the relevant code:

 def decode_matchinfo(buf): 
     # buf is a bytestring of unsigned integers, each 4 bytes long 
     return struct.unpack("I" * (len(buf) // 4), buf) 

SQLite FTS4 provides a matchinfo binary string which you need to decode in order to calculate the relevance score. This code uses the struct standard library module to unpack that binary string into a list of integers.

My initial attempt at fixing this turned out to be entirely incorrect.

I didn’t have a big-endian machine available for testing, and I assumed that the problem was caused by Python interpreting the bytes as the current architecture’s byte order. So I applied this fix:

    return struct.unpack(">" + ("I" * (len(buf) // 4)), buf)

The > prefix there ensures that struct will always interpret the bytes as little-endian. I wrote up a TIL and shipped 1.0.2 with the fix.

Sarah promptly got back to me and reported some new failing tests.

It turns out my fix was entirely incorrect—in fact, I’d broken something that previously was working just fine.

The clue is in the SQLite documentation for matchinfo (which I really should have checked):

The matchinfo function returns a blob value. If it is used within a query that does not use the full-text index (a “query by rowid” or “linear scan”), then the blob is zero bytes in size. Otherwise, the blob consists of zero or more 32-bit unsigned integers in machine byte-order (emphasis mine).

Looking more closely at the original bug report, the test that failed was this one:

@pytest.mark.parametrize(
    "buf,expected",
    [
        (
            b"\x01\x00\x00\x00\x02\x00\x00\x00\x02\x00\x00\x00\x02\x00\x00\x00",
            (1, 2, 2, 2),
        )
    ],
)
def test_underlying_decode_matchinfo(buf, expected):
    assert expected == decode_matchinfo(buf)

That test hard-codes a little-endian binary string and checks the output of my decode_matchinfo function. This is obviously going to fail on a big-endian system.

So my original behaviour was actually correct: I was parsing the string using the byte order of the architecture, and SQLite was providing the string in the byte order of the architecture. The only bug was in my test.

I reverted my previous fix and fixed the test instead:

@pytest.mark.parametrize(
    "buf,expected",
    [
        (
            b"\x01\x00\x00\x00\x02\x00\x00\x00\x02\x00\x00\x00\x02\x00\x00\x00"
            if sys.byteorder == "little"
            else b"\x00\x00\x00\x01\x00\x00\x00\x02\x00\x00\x00\x02\x00\x00\x00\x02",
            (1, 2, 2, 2),
        )
    ],
)
def test_underlying_decode_matchinfo(buf, expected):
    assert expected == decode_matchinfo(buf)

sys.byteorder reports the byte order of the host system, so this test now passes on both little-endian and big-endian systems.

There was one remaining challenge: how to test this? I wasn’t going to make the same mistake of shipping a fix that hadn’t actually been exercised on the target architecture a second time.

After quite a bit of research (mainly throwing the terms docker and s390x into the GitHub code search engine and seeing what I could find) I figured out a fix. It turns out you can use Docker and QEMU to run an emulated s390x system—both on a Mac loptop and in GitHub Actions.

Short version:

docker run --rm --privileged multiarch/qemu-user-static:register --reset
docker run -it multiarch/ubuntu-core:s390x-focal /bin/bash

For the longer version, check my TIL: Emulating a big-endian s390x with QEMU.

Releases this week

TIL this week

Weeknotes: Datasette, sqlite-utils, Datasette Desktop 20 days ago

A flurry of releases this week, including a new Datasette alpha and a fixed Datasette Desktop.

datasette 0.62a1

Work on Datasette Cloud continues—the last 10% of the work needed for the beta launch is trending towards taking 90% of the time. It’s been driving all sorts of fixes to the wider Datasette ecosystem, which is nice.

I ran into a bug which would have been easier to investigate using Sentry. The datasette-sentry plugin wasn’t working right, and it turned out I needed a new handle_exception() plugin hook to fix it. This was the impetus I needed to push out a new Datasette alpha—I like to get new hooks into an alpha as quickly as possible so I can depend on that Datasette version from alpha releases of plugins.

Here are some other highlights from the alpha’s release notes:

  • The render_cell() plugin hook is now also passed a row argument, representing the sqlite3.Row object that is being rendered. (#1300)

A neat thing about Pluggy is that you can add new arguments to existing plugin hooks without breaking plugins that already use them.

  • New --nolock option for ignoring file locks when opening read-only databases. (#1744)

Since the very start of the project Datasette has suggested trying the following command to start exploring your Google Chrome history, which is stored using SQLite:

datasette ~/Library/Application\ Support/Google/Chrome/Default/History

I’m not sure when this changed, but I tried running the command recently and got the following error:

sqlite3.OperationalError: database is locked

Since Datasette opens databases in read-only mode I didn’t see why a lock like this should be respected. It turns out SQLite can be told to ignore locks like so:

sqlite3.connect(
    "file:places.sqlite?mode=ro&nolock=1",
    uri=True
)

So I added a --nolock option to Datasette which does exactly that:

datasette ~/Library/Application\ Support/Google/Chrome/Default/History --nolock

Inspired by 6 More Things I Learned Building Snowpack to 20,000 Stars (Part 2) by Fred K. Schott I finally setup a chat community for Datasette, using Discord.

It’s attracted 88 members already! You can join it here. I wrote detailed notes on how I configured it in this issue.

  • Database file downloads now implement conditional GET using ETags. (#1739)

This is a change I made to support Datasette Lite—I noticed that the WASM version of Datasette was downloading a fresh database every time, so I added ETag support to encourage browsers to avoid a duplicate download and use a cached copy of the database file instead, provided it hasn’t changed.

Datasette Desktop

Datasette Desktop was hanging on launch. Paul Everitt figured out a fix, which it took me way too long to get around to applying.

I finally shipped that in Datasette Desktop 0.2.2, but I wanted to reduce the chances of this happening again as much as possible. Datasette Desktop’s Elecron tests used the spectron test harness, but that’s marked as deprecated.

I’m a big fan of Playwright and I was optimistic to see that it has support for testing Electron apps. I figured out how to use that with Datasette Desktop and run the tests in GitHub Actions: I wrote up what I learned in a TIL, Testing Electron apps with Playwright and GitHub Actions.

sqlite-utils 3.28

Annotated release notes:

davidleejy suggested the table.duplicate() method and contributed an implementation. This was the impetus for pushing out a fresh release.

I added the CLI equivalent, sqlite-utils duplicate.

  • sqlite_utils.utils.rows_from_file() is now a documented API. It can be used to read a sequence of dictionaries from a file-like object containing CSV, TSV, JSON or newline-delimited JSON. It can be passed an explicit format or can attempt to detect the format automatically. (#443)
  • sqlite_utils.utils.TypeTracker is now a documented API for detecting the likely column types for a sequence of string rows, see Detecting column types using TypeTracker. (#445)
  • sqlite_utils.utils.chunks() is now a documented API for splitting an iterator into chunks. (#451)

I have a policy that any time I need to use an undocumented method from sqlite-utils in some other project I file an issue to add that to the documented API surface.

I had used rows_from_file() and TypeTracker in datasette-socrata.

  • sqlite-utils enable-fts now has a --replace option for replacing the existing FTS configuration for a table. (#450)
  • The create-index, add-column and duplicate commands all now take a --ignore option for ignoring errors should the database not be in the right state for them to operate. (#450)

This was inspired by my TIL Ignoring errors in a section of a Bash script—a trick I had to figure out because one of my scripts needed to add columns and enable FTS but only if those changes had not been previously applied.

In looking into that I spotted inconsistencies in the design of the sqlite-utils commands, so I fixed those as much as I could while still maintaining backwards compatibility with the 3.x releases.

Releases this week

TIL this week

sqlite-comprehend: run AWS entity extraction against content in a SQLite database 29 days ago

I built a new tool this week: sqlite-comprehend, which passes text from a SQLite database through the AWS Comprehend entity extraction service and stores the returned entities.

I created this as a complement to my s3-ocr tool, which uses AWS Textract service to perform OCR against every PDF file in an S3 bucket.

Short version: given a database table full of text, run the following:

% pip install sqlite-comprehend
% sqlite-comprehend entities myblog.db blog_entry body --strip-tags
  [###---------------------------------]    9%  00:01:02

This will churn through every piece of text in the body column of the blog_entry table in the myblog.db SQLite database, strip any HTML tags (the --strip-tags option), submit it to AWS Comprehend, and store the extracted entities in the following tables:

The above table names link to a live demo produced by running the tool against all of the content in my blog.

Here are 225 mentions that Comprehend classified as the organization called “Mozilla”.

The tool tracks which rows have been processed already (in the blog_entry_comprehend_entities_done table), so you can run it multiple times and it will only process newly added rows.

AWS Comprehend pricing starts at $0.0001 per hundred characters. sqlite-comprehend only submits the first 5,000 characters of each row.

How the demo works

My live demo for this tool uses a new Datasette instance at datasette.simonwillison.net. It hosts a complete copy of the data from my blog—data that lives in a Django/PostgreSQL database on Heroku, but is now mirrored to a SQLite database hosted by Datasette.

The demo runs out of my simonwillisonblog-backup GitHub repository.

A couple of years ago I realized that I’m no longer happy having any content I care about not stored in a Git repository. I want to track my changes! I also want really robust backups: GitHub mirror their repos to three different regions around the world, and having data in a Git repository makes mirroring it somewhere else as easy as running a git pull.

So I created simonwillisonblog-backup using a couple of my other tools: db-to-sqlite, which converts a PostgreSQL database to a SQLite database, and sqlite-diffable, which dumps out a SQLite database as a “diffable” directory of newline-delimited JSON files.

Here’s the simplest version of that pattern:

db-to-sqlite \
    'postgresql+psycopg2://user:pass@hostname:5432/dbname' \
    simonwillisonblog.db --all

This connects to PostgreSQL, loops through all of the database tables and converts them all to SQLite tables stored in simonwillisonblog.db.

sqlite-diffable dump simonwillisonblog.db simonwillisonblog --all

This converts that SQLite database into a directory of JSON files. Each table gets two files: table.metadata.json containing the table’s name, columns and schema and table.ndjson containing a newline-separated list of JSON arrays representing every row in that table.

You can see these files for my blog’s database in the simonwillisonblog folder.

(My actual script is a little more complex, because I backup only selected tables and then run extra code to redact some of the fields.)

Since I set this up it’s captured over 600 changes I’ve applied to my blog’s database, all made the regular Django admin interface.

This morning I extended the script to run sqlite-comprehend against my blog entries and deploy the resulting data using Datasette.

The concise version of the new script looks like this:

wget -q https://datasette.simonwillison.net/simonwillisonblog.db

This retrieves the previous version of the database. I do this to avoid being charged by AWS Comprehend for running entity extraction against rows I have already processed.

sqlite-diffable load simonwillisonblog.db simonwillisonblog --replace

This creates the simonwillisonblog.db database by loading in the JSON from the simonwillisonblog/ folder. I do it this way mainly to exercise the new sqlite-diffable load command I just added to that tool.

The --replace option ensures that any tables that already exist are replaced by a fresh copy (while leaving my existing comprehend entity extraction data intact).

sqlite-comprehend entities simonwillisonblog.db blog_entry title body --strip-tags

This runs sqlite-comprehend against the blog entries that have not yet been processed.

set +e
sqlite-utils enable-fts simonwillisonblog.db blog_series title summary --create-triggers --tokenize porter 2>/dev/null
sqlite-utils enable-fts simonwillisonblog.db blog_tag tag --create-triggers --tokenize porter 2>/dev/null
sqlite-utils enable-fts simonwillisonblog.db blog_quotation quotation source --create-triggers --tokenize porter 2>/dev/null
sqlite-utils enable-fts simonwillisonblog.db blog_entry title body --create-triggers --tokenize porter 2>/dev/null
sqlite-utils enable-fts simonwillisonblog.db blog_blogmark link_title via_title commentary --create-triggers --tokenize porter 2>/dev/null
set -e

This configures SQLite full-text search against each of those tables, using this pattern to supress any errors that occur if the FTS tables already exist.

Setting up FTS in this way means I can use the datasette-search-all plugin to run searches like this one for aws across all of those tables at once.

datasette publish cloudrun simonwillisonblog.db \
-m metadata.yml \
--service simonwillisonblog \
--install datasette-block-robots \
--install datasette-graphql \
--install datasette-search-all

This uses the using datasette publish command to deploy the datasette.simonwillison.net site to Google Cloud Run.

I’m adding two more plugins here: datasette-block-robots to avoid search engine crawlers indexing a duplicate copy of my blog’s content, and datasette-graphql to enable GraphQL queries against my data.

Here’s an example GraphQL query that returns my most recent blog entries that are tagged with datasette.

Releases this week

TIL this week

Using GPT-3 to explain how code works one month ago

One of my favourite uses for the GPT-3 AI language model is generating explanations of how code works. It’s shockingly effective at this: its training set clearly include a vast amount of source code.

(I initially thought this was related to GitHub Copilot being built on GPT-3, but actually that’s built on a GPT-3 descendent called OpenAI Codex.)

Here are a few recent examples.

Explaining some Python

Ned Batchelder shared this brilliant Python code snippet on Twitter this morning:

import re

TOKEN_RX = r"""(?xm)
    (?P<string> ".*?"|'.*?'             )| 
    (?P<float>  \d*(\d\.|\.\d)\d*       )|
    (?P<int>    \d+                     )|
    (?P<id>     [_a-zA-Z][_a-zA-Z0-9]*  )|
    (?P<punct>  [(){}:\[\]=.,+*/-]      )|
    (           \#.*$                   )|
    (           \s+                     )
    """

def tokens(text):
    for match in re.finditer(TOKEN_RX, text):
        if match.lastgroup:
            yield (match.lastgroup, match[0])

TEXT = """
    x = 123 + "hello #99"  # ignore me!
    print(hello.bye[0] + 3.14, 'single')
    """

for kind, text in tokens(TEXT):
    print(f"{kind:7}: {text=}")

I pasted it into GPT-3 (here’s how to use the playground interface) and started asking questions.

This uses the feature of GPT-3 where you can get a response, then add more text and run it again to get further details.

Anything in bold is text that GPT-3 responded with. The rest are my prompts.

Explain this code:

import re

TOKEN_RX = r"""(?xm)
    (?P<string> ".*?"|'.*?'             )| 
    (?P<float>  \d*(\d\.|\.\d)\d*       )|
    (?P<int>    \d+                     )|
    (?P<id>     [_a-zA-Z][_a-zA-Z0-9]*  )|
    (?P<punct>  [(){}:\[\]=.,+*/-]      )|
    (           \#.*$                   )|
    (           \s+                     )
    """

def tokens(text):
    for match in re.finditer(TOKEN_RX, text):
        if match.lastgroup:
            yield (match.lastgroup, match[0])

TEXT = """
    x = 123 + "hello #99"  # ignore me!
    print(hello.bye[0] + 3.14, 'single')
    """

for kind, text in tokens(TEXT):
    print(f"{kind:7}: {text=}")

The code is defining a regular expression that will match different types of tokens in a string of text. The different groups defined in the regular expression will match different types of tokens, such as strings, floats, ints, ids, punctuation, and comments. The code will then iterate over the matches and yield the type of token and the text that was matched.

Explain the regular expression:

The regular expression is made up of a number of different groups, each of which match a different type of token. The first group, “string”, matches any text enclosed in double or single quotes. The second group, “float”, matches any text that looks like a floating point number. The third group, “int”, matches any text that looks like an integer. The fourth group, “id”, matches any text that looks like a variable name. The fifth group, “punct”, matches any text that looks like punctuation. The sixth group, “#.*$”, matches any text that looks like a comment. The seventh group, “\s+”, matches any text that looks like whitespace.

What does (?xm) do?

The (?xm) notation is a modifier that allows the regular expression to span multiple lines and to use comments.

Why does it use match.lastgroup?

The match.lastgroup variable contains the name of the last group that was matched by the regular expression. This is used to determine the type of token that was matched.

This is a really good, really clear explanation of what’s going on.

I already knew how that code works though. Let’s try something that I genuinely didn’t understand.

Explaining some JavaScript

Chris Haynes tweeted an ingenious demo showing how the new Web Component Declarative Shadow DOM <slot> mechanism (described here) can be used to create a page that streams content from a server, filling in different regions of the page as the content arrives—without using any JavaScript!

His demo is at enamel.pages.dev but only works for me in Chrome. For other browsers, he provides enamel.pages.dev/?js which loads a tiny polyfill.

The polyfill code lives in enamel-lite.js and I could not figure out how it works. So I asked GPT-3.

Explain this code:

class Enamel extends HTMLElement {
  attemptPolyfillDSD() {
    const dsd = this.querySelector('template[shadowroot]');

    if (dsd?.content) {
      const mode = dsd.getAttribute('shadowroot');
      this.attachShadow({ mode });
      this.shadowRoot.appendChild(dsd.content);

      dsd.remove();

      return true;
    }

    return false;
  }

  connectedCallback() {
    if (
      !HTMLTemplateElement.prototype.hasOwnProperty('shadowRoot') &&
      !this.attemptPolyfillDSD()
    ) {
      const _observer = new MutationObserver(() => {
        if (this.attemptPolyfillDSD()) {
          _observer.disconnect();
        }
      });

      _observer.observe(this, {
        childList: true,
      });
    }
  }
}

export default Enamel;

This code polyfills the Shadow DOM API for older browsers that do not support it.

OK, that sounds convincing, but it’s not enough information for me. So I added another prompt:

How does it do that?

The code uses a MutationObserver to watch for changes to the element’s children. When a new child is added, the code checks to see if that child is a <template> element with a ’shadowroot’ attribute. If so, the code creates a Shadow Root for the element and appends the template’s content to it. Finally, the code removes the template element from the DOM.

That is a really good explanation. After reading that, I went back to look at the code again and I now completely understand how it works.

Building some SQL

I wanted to build a SQL query that would return all of the rows in one table that were not represented in another table... but where the rows used a compound primary key covering two columns. I didn’t know the right syntax for this off the top of my head.

First, I needed to create a table with a compound primary key. Rather than look that up I asked GPT-3 directly:

SQLite create a pages table that has a compound primary key of type, id and a content column.

CREATE TABLE pages ( type TEXT, id TEXT, content TEXT, PRIMARY KEY (type, id) );

It seems to know about SQLite SQL syntax, which is handy. Turns out you create a compound primary key using PRIMARY KEY (col1, col2) at the end of the table definition.

I added a second line with my own CREATE TABLE, then asked it to build me a query:

CREATE TABLE pages_done ( type TEXT, id TEXT, PRIMARY KEY (type, id) );

SQL query returning rows in the pages table that do not have corresponding records in the pages_done table:

SELECT type, id, content FROM pages WHERE (type, id) NOT IN (SELECT type, id FROM pages_done)

This is useful! I thought that a WHERE ... NOT IN (SELECT pattern would be good here, but I wasn’t sure of the syntax for looking up two columns at once.

I’m pretty sure this could instead be solved with a join... so I asked that too:

That query but using a join:

SELECT p.type, p.id, p.content FROM pages p LEFT JOIN pages_done pd ON p.type = pd.type AND p.id = pd.id WHERE pd.type IS NULL;

But which is better? I posed two more questions:

Which of those two queries is more efficient and why?

The first query is more efficient because it uses a WHERE clause with a subquery. The second query is less efficient because it uses a LEFT JOIN.

Why is a subquery more efficient?

A subquery is more efficient because it only returns the rows that meet the conditions specified in the WHERE clause. A LEFT JOIN returns all rows from the left table, even if there is no match in the right table.

As with everything GPT-3, this answer sounds very convincing! But is this actually true? I would expect the SQL query optimizer to be smart enough to optimize the LEFT JOIN to the point where it would run efficiently. I think GPT-3’s answer to that question is actually very misleading.

And that’s not surprising, because GPT-3 doesn’t actually know anything about anything at all. It’s a huge pattern generator. You can’t trust anything it says, because all it does is group words together into convincing looking shapes based on text that it’s seen before.

Once again, I’m reminded that tools like GPT-3 should be classified in the “bicycles for the mind” category. You still have to know how to pedal!

They’re fantastic tools for thinking, but to actually use their output effectively requires VERY deep knowledge—both of the subject matter in question, and of the way that the AI tools themselves work.

Leading questions

Got this great tip from Riley Goodside:

When prompting like this, it’s important (and difficult) to avoid leading questions. If you ask which of two queries are more efficient and why, it’s extremely unlikely to to say “It doesn’t matter,” no matter how true that is.

I hadn’t considered this at all. I’ll be keeping that in mind in my future experiments.

Bonus: how about mathematics?

I decided to try one more thing: could it explain mathematical equations?

GitHub recently added LaTeX rendering to their Markdown implementation. I grabbed a formula at random from this page of documentation:

Explain this:

$$\left( \sum_{k=1}^n a_k b_k \right)^2 \leq \left( \sum_{k=1}^n a_k^2 \right) \left( \sum_{k=1}^n b_k^2 \right)$$

This is the Cauchy-Schwarz inequality.

OK, so you know what it’s called. But how does it work?

I added “Explain the formula syntax step by step, using $$ wrapped around the formula steps:” (the $$ bit is explained in a moment):

$$\left( \sum_{k=1}^n a_k b_k \right)^2 \leq \left( \sum_{k=1}^n a_k^2 \right) \left( \sum_{k=1}^n b_k^2 \right)$$

This is the Cauchy-Schwarz inequality.

$$\left( \sum_{k=1}^n a_k b_k \right)^2$$

This is the sum of all products of corresponding elements in the vectors a and b, squared.

$$\leq \left( \sum_{k=1}^n a_k^2 \right) \left( \sum_{k=1}^n b_k^2 \right)$$

This is saying that this sum is less than or equal to the product of the sums of the squares of each vector.

Here’s why I added that bit about $$ to the prompt: doing so caused it to produce its answer in valid GitHub Markdown syntax. I pasted that into a Gist and rendered it, and it came out like this:

That same thing but with rendered mathematical equations for the different step explanations

Explaining configuration files

Here’s another prompt pattern I stumbled across that seems to work really well: “Explain every line of this config file:”. I tried it against the Lit example from the Vite 3.0 starter templates:

Explain every line of this config file:

import { defineConfig } from 'vite'

// https://vitejs.dev/config/
export default defineConfig({
  build: {
    lib: {
      entry: 'src/my-element.js',
      formats: ['es']
    },
    rollupOptions: {
      external: /^lit/
    }
  }
})

This config file is written in JavaScript, using the ViteJS library. The first line imports the “defineConfig” function from the ViteJS library. This function allows the user to define a configuration object for the build process.

The next line exports the “defineConfig” function as the default export. This means that when this file is imported, the “defineConfig” function will be the only thing that is imported.

The “defineConfig” function takes one argument, which is an object. This object contains the configuration for the build process. The “build” property contains the configuration for the build tool. The “lib” property contains the configuration for the library that will be built. The “entry” property contains the path to the entry file for the library. The “formats” property contains the formats that the library will be built in. In this case, the library will only be built in the “es” format.

The “rollupOptions” property contains the configuration for the Rollup tool. The “external” property contains a regular expression that matches the names of the dependencies that should be excluded from the build. In this case, all dependencies whose names start with “lit” will be excluded from the build.

s3-ocr: Extract text from PDF files stored in an S3 bucket one month ago

I’ve released s3-ocr, a new tool that runs Amazon’s Textract OCR text extraction against PDF files in an S3 bucket, then writes the resulting text out to a SQLite database with full-text search configured so you can run searches against the extracted data.

You can search through a demo of 697 pages of OCRd text at s3-ocr-demo.datasette.io/pages/pages.

Textract works extremely well: it handles dodgy scanned PDFs full of typewritten code and reads handwritten text better than I can! It charges $1.50 per thousand pages processed.

Why I built this

My initial need for this is a collaboration I have running with the San Francisco Microscopy Society. They’ve been digitizing their archives—which stretch back to 1870!—and were looking for help turning the digital scans into something more useful.

The archives are full of hand-written and type-written notes, scanned and stored as PDFs.

I decided to wrap my work up as a tool because I’m sure there are a LOT of organizations out there with a giant bucket of PDF files that would benefit from being able to easily run OCR and turn the results into a searchable database.

Running Textract directly against large numbers of files is somewhat inconvenient (here’s my earlier TIL about it). s3-ocr is my attempt to make it easier.

Tutorial: How I built that demo

The demo instance uses three PDFs from the Library of Congress Harry Houdini Collection on the Internet Archive:

I started by downloading PDFs of those three files.

Then I installed the two tools I needed:

pip install s3-ocr s3-credentials

I used my s3-credentials tool to create a new S3 bucket and credentials with the ability to write files to it, with the new --statement option (which I released today) to add textract permissions to the generated credentials:

s3-credentials create s3-ocr-demo --statement '{
  "Effect": "Allow",
  "Action": "textract:*",
  "Resource": "*"
}' --create-bucket > ocr.json

(Note that you don’t need to use s3-credentials at all if you have AWS credentials configured on your machine with root access to your account—just leave off the -a ocr.json options in the following examples.)

s3-ocr-demo is now a bucket I can use for the demo. ocr.json contains JSON with an access key and secret key for an IAM user account that can interact with the that bucket, and also has permission to access the AWS Textract APIs.

I uploaded my three PDFs to the bucket:

s3-credentials put-object s3-ocr-demo latestmagicbeing00hoff.pdf latestmagicbeing00hoff.pdf -a ocr.json
s3-credentials put-object s3-ocr-demo practicalmagicia00harr.pdf practicalmagicia00harr.pdf -a ocr.json
s3-credentials put-object s3-ocr-demo unmaskingrobert00houdgoog.pdf unmaskingrobert00houdgoog.pdf -a ocr.json

(I often use Transmit as a GUI for this kind of operation.)

Then I kicked off OCR jobs against every PDF file in the bucket:

% s3-ocr start s3-ocr-demo --all -a ocr.json 
Found 0 files with .s3-ocr.json out of 3 PDFs
Starting OCR for latestmagicbeing00hoff.pdf, Job ID: f66bc2d00fb75d1c42d1f829e5b6788891f9799fda404c4550580959f65a5402
Starting OCR for practicalmagicia00harr.pdf, Job ID: ef085728135d524a39bc037ad6f7253284b1fdbeb728dddcfbb260778d902b55
Starting OCR for unmaskingrobert00houdgoog.pdf, Job ID: 93bd46f02eb099eca369c41e384836d2bd3199b95d415c0257ef3fa3602cbef9

The --all option scans for any file with a .pdf extension. You can pass explicit file names instead if you just want to process one or two files at a time.

This returns straight away, but the OCR process itself can take several minutes depending on the size of the files.

The job IDs can be used to inspect the progress of each task like so:

% s3-ocr inspect-job f66bc2d00fb75d1c42d1f829e5b6788891f9799fda404c4550580959f65a5402
{
  "DocumentMetadata": {
    "Pages": 244
  },
  "JobStatus": "SUCCEEDED",
  "DetectDocumentTextModelVersion": "1.0"
}

Once the job completed, I could preview the text extracted from the PDF like so:

% s3-ocr text s3-ocr-demo latestmagicbeing00hoff.pdf
111
.
116

LATEST MAGIC
BEING
ORIGINAL CONJURING TRICKS
INVENTED AND ARRANGED
BY
PROFESSOR HOFFMANN
(ANGELO LEWIS, M.A.)
Author of "Modern Magic," etc.
WITH NUMEROUS ILLUSTRATIONS
FIRST EDITION
NEW YORK
SPON & CHAMBERLAIN, 120 LIBERTY ST.
...

To create a SQLite database with a table containing rows for every page of scanned text, I ran this command:

% s3-ocr index s3-ocr-demo pages.db -a ocr.json 
Fetching job details  [####################################]  100%
Populating pages table  [####--------------------------------]   13%  00:00:34

I then published the resulting pages.db SQLite database using Datasette—you can explore it here.

How s3-ocr works

s3-ocr works by calling Amazon’s S3 and Textract APIs.

Textract only works against PDF files in asynchronous mode: you call an API endpoint to tell it “start running OCR against this PDF file in this S3 bucket”, then wait for it to finish—which can take several minutes.

It defaults to storing the OCR results in its own storage, expiring after seven days. You can instead tell it to store them in your own S3 bucket—I use that option in s3-ocr.

A design challenge I faced was that I wanted to make the command restartable and resumable: if the user cancelled the task, I wanted to be able to pick up from where it had got to. I also want to be able to run it again after adding more PDFs to the bucket without repeating work for the previously processed files.

I also needed to persist those job IDs: Textract writes the OCR results to keys in the bucket called textract-output/JOB_ID/1-?—but there’s no indication as to which PDF file the results correspond to.

My solution is to write tiny extra JSON files to the bucket when the OCR job is first started.

If you have a file called latestmagicbeing00hoff.pdf the start command will create a new file called latestmagicbeing00hoff.pdf.s3-ocr.json with the following content:

{
  "job_id": "f66bc2d00fb75d1c42d1f829e5b6788891f9799fda404c4550580959f65a5402",
  "etag": "\"d79af487579dcbbef26c9b3be763eb5e-2\""
}

This associates the job ID with the PDF file. It also records the original ETag of the PDF file—this is so in the future I can implement a system that can re-run OCR if the PDF has been updated.

The existence of these files lets me do two things:

  • If you run s3-ocr start s3-ocr-demo --all it can avoid re-submitting PDF files that have already been sent for OCR, by checking for the existence of the .s3-ocr.json file.
  • When you later ask for the results of the OCR it can use these files to associate the PDF with the results.

Scatting .s3-ocr.json files all over the place feels a little messy, so I have an open issue considering moving them all to a s3-ocr/ prefix in the bucket instead.

Try it and let me know what you think

This is a brand new project, but I think it’s ready for other people to start trying it out.

I ran it against around 7,000 pages from 531 PDF files in the San Francisco Microscopy Society archive and it seemed to work well!

If you try this out and it works (or it doesn’t work) please let me know via Twitter or GitHub.

A challenging example page

Here’s one of the more challenging pages I processed using Textract:

A very old page of difficult to read handwriting

Here’s the result:

In. In J a ... the Joe 14
162
Volxv
Lalpa spinosa, Eggt bud development. of
146
Farcomas spindle. cells in nested gowers 271
Fayigaga tridactylites, leaf glaur of ruce 33
staining & mounting
Stiles 133
tilica films, a structure of Diatoins morehouse 38
thile new microscopic
Broeck 22 /
Smith reproduction in the huntroom tribe
6
Trakes, develop mouht succession of the porsion tango/229
Soirce President of the Roy: truc: Soo
285
forby, Presidents address
105
pongida, difficulties of classification
238
tage, american adjustable concentric
150
ttlese staining & mountring wood sections 133
Stodder, Frustulia Iasconica, havicula
chomboides, & havi cula crassinervis 265
Vol XVI
falicylic acid u movorcopy
160
falpar enctry ology of
Brooke 9.97
Sanderson micros: characters If inflammation
43
tap, circulation of the
42
Jars, structure of the genus Brisinga
44
latter throvite connective substances 191- 241
Jehorey Cessification in birds, formation
of ed blood corpuseles during the
ossification process
by

Releases this week

TIL this week

First impressions of DALL-E, generating images from text one month ago

I made it off the DALL-E waiting list a few days ago and I’ve been having an enormous amount of fun experimenting with it. Here are some notes on what I’ve learned so far (and a bunch of example images too).

(For those not familiar with it, DALL-E is OpenAI’s advanced text-to-image generator: you feed it a prompt, it generates images. It’s extraordinarily good at it.)

First, a warning: DALL-E only allows you to generate up to 50 images a day. I found this out only when I tried to generate image number 51. So there’s a budget to watch out for.

I’ve usually run out by lunch time!

How to use DALL-E

DALL-E is even simpler to use than GPT-3: you get a text box to type in, and that’s it. There are no advanced settings to tweak.

A label says "Start with a detailed description" - there is also a "Surprise me" button. The text box has the grayed out suggestion text "An impressionist oil painting of sunflowers in a puple vase." There is also a Generate button, and the text "Or upload an image to edit"

It does have one other mode: you can upload your own photo, crop it to a square and then erase portions of it and ask DALL-E to fill them in with a prompt. This feature is clearly still in the early stages—I’ve not had great results with it yet.

DALL-E always returns six resulting images, which I believe it has selected as the “best” from hundreds of potential results.

Tips on prompts

DALL-E’s initial label suggests to “Start with a detailed description”. This is very good advice!

The more detail you provide, the more interesting DALL-E gets.

If you type “Pelican”, you’ll get an image that is indistinguishable from what you might get from something like Google Image search. But the more details you ask for, the more interesting and fun the result.

Fun with pelicans

Here’s “A ceramic pelican in a Mexican folk art style with a big cactus growing out of it”:

A ceramic pelican in a Mexican folk art style with a big cactus growing out of it - the image looks exactly like that, it's very impressive

Some of the most fun results you can have come from providing hints as to a medium or art style you would like. Here’s “A heavy metal album cover where the band members are all pelicans... made of lightning”:

A heavy metal album cover where the band members are all pelicans... made of lightning - except none of the pelicans are made of lightning. The text at the top reads PLENY HLAN

This illustrates a few interesting points. Firstly, DALL-E is hilariously bad at any images involving text. It can make things that look like letters and words but it has no concept of actual writing.

My initial prompt was for “A death metal album cover...”—but DALL-E refused to generate that. It has a filter to prevent people from generating images that go outside its content policy, and the word “death” triggered it.

(I’m confident that the filter can be easily avoided, but I don’t want to have my access revoked so I haven’t spent any time pushing its limits.)

It’s also not a great result—those pelicans are not made of lightning! I tried a tweaked prompt:

“A heavy metal album cover where the band members are all pelicans that are made of lightning”:

A heavy metal album cover where the band members are all pelicans that are made of lightning - six images, all very heavy metal but none of them where the birds are made of lightning, though two have lightning in the background now

Still not made of lightning. One more try:

“pelican made of lightning”:

Six images of pelicans - they are all made of lightning this time, but they don't look great.

Let’s try the universal DALL-E cheat code, adding “digital art” to the prompt.

“a pelican made of lightning, digital art”

Six images of pelicans - they are all made of lightning this time, and they look pretty cool

OK, those look a lot better!

One last try—the earlier prompt but with “digital art” added.

“A heavy metal album cover where the band members are all pelicans that are made of lightning, digital art”:

These are really cool images of pelicans with lightning - though again, they aren't really made of lightning. Also there's no album text any more.

OK, these are cool. The text is gone—maybe the “digital art” influence over-rode the “album cover” a tiny bit there.

This process is a good example of “prompt engineering”—feeding in altered prompts to try to iterate towards a better result. This is a very deep topic, and I’m confident I’ve only just scratched the surface of it.

Breaking away from album art, here’s “A squadron of pelicans having a tea party in a forest with a raccoon, digital art”. Often when you specify “digital art” it picks some other additional medium:

A beautiful painting. A racoon sits in the foreground at a little table in the forest. He is surrounded by pelicans, one of which is pouring a drink from a half-bucket-half-teapot.

Recreating things you see

A fun game I started to play with DALL-E was to see if I could get it to recreate things I saw in real life.

My dog, Cleo, was woofing at me for breakfast. I took this photo of her:

A medium sized black pitbull mix sitting on a hardwood floor

Then I tried this prompt: “A medium sized black dog who is a pit bull mix sitting on the ground wagging her tail and woofing at me on a hardwood floor”

A medium sized black pitbull mix sitting on a hardwood floor

OK, wow.

Later, I caught her napping on the bed:

A medium sized black pitbull mix curled up asleep on a green duvet

Here’s DALL-E for “A medium sized black pit bull mix curled up asleep on a dark green duvet cover”:

A medium sized black pit bull mix curled up asleep on a dark green duvet cover - a very good image

One more go at that. Our chicken Cardi snuck into the house and snuggled up on the sofa. Before I evicted her back into the garden I took this photo:

a black and white speckled chicken with a red comb snuggled on a blue sofa next to a cushion with a blue seal pattern and a blue and white knitted blanket

“a black and white speckled chicken with a red comb snuggled on a blue sofa next to a cushion with a blue seal pattern and a blue and white knitted blanket”:

Six images that fit the brief, though the cushions don't have the pattern and the camera zoomed in much closer on the chicken than in the original

Clearly I didn’t provide a detailed enough prompt here! I would need to iterate on this one a lot.

Stained glass

DALL-E is great at stained glass windows.

“Pelican in a waistcoat as a stained glass window”:

A really cool stained glass window design of a pelican, though it is not wearing a waistcoat

"A stained glass window depicting 5 different nudibranchs"

5 different nudibranchs in stained glass - really good

People

DALL-E is (understandably) quite careful about depictions of people. It won’t let you upload images with recognisable faces in them, and when you ask for a prompt with a famous person it will sometimes pull off tricks like showing them from behind.

Here’s “The pope on a bicycle leading a bicycle race through Paris”:

A photo of the pope on a bicycle, taken from behind, with a blurred out Paris street in the background

Though maybe it was the “leading a bicycle race” part that inspired it to draw the image from this direction? I’m not sure.

It will sometimes generate made-up people with visible faces, but they ask users not to share those images.

Assorted images

Here are a bunch of images that I liked, with their prompts.

Inspired by one of our chickens:

“A blue-grey fluffy chicken puffed up and looking angry perched under a lemon tree”

A blue-grey fluffy chicken puffed up and looking angry perched under a lemon tree

I asked it for the same thing, painted by Salvador Dali:

“A blue-grey fluffy chicken puffed up and looking angry perched under a lemon tree, painted by Salvador Dali”:

Three paintings of a blue-grey fluffy chicken puffed up and looking angry perched under a lemon tree, in the style of Salvador Dali

“Bats having a quinceañera, digital art”:

Three bats with pink ears, one is wearing a pink dress

“The scene in an Agatha Christie mystery where the e detective reveals who did it, but everyone is a raccoon. Digital art.”:

This one is in more of a cartoon style. The raccoon stands in front, and four people in period clothes stand in the background, one of them with a magnifying glass.

(It didn’t make everyone a raccoon. It also refused my initial prompt where I asked for an Agatha Christie murder mystery, presumably because of the word “murder”.)

“An acoustic guitar decorated with capybaras in Mexican folk art style, sigma 85mm”:

A close-up shot of an acoustic guitar with some capybaras painted on it.

Adding “sigma 85mm” (and various other mm lengths) is a trick I picked up which gives you realistic images that tend to be cropped well.

“A raccoon wearing glasses and reading a poem at a poetry evening, sigma 35mm”:

A very convincing photograph of a racoon wearing glasses reading from a book, with a blurry background

“Pencil sketch of a Squirrel reading a book”:

A just gorgeous pencil sketch of a squirrel reading a book

Pencil sketches come out fantastically well.

“The royal pavilion in brighton covered in snow”

The royal pavilion in brighton covered in snow - the windows look a bit weird

I experienced this once, many years ago when I lived in Brighton—but forgot to take a photo of it. It looked exactly like this.

And a game: fantasy breakfast tacos

It’s difficult to overstate how much fun playing with this stuff is. Here’s a game I came up with: fantasy breakfast tacos. See how tasty a taco you can invent!

Mine was “breakfast tacos with lobster, steak, salmon, sausages and three different sauces”:

A really delicious assortment of tacos

Natalie is a vegetarian, which I think puts her at a disadvantage in this game. “breakfast taco containing cauliflower, cheesecake, tomatoes, eggs, flowers”:

A really delicious assortment of tacos

Closing thoughts

As you can see, I have been enjoying playing with this a LOT. I could easily share twice as much—the above are just the highlights from my experiments so far.

The obvious question raised by this is how it will affect people who generate art and design for a living. I don’t have anything useful to say about that, other than recommending that they make themselves familiar with the capabilities of these kinds of tools—which have taken an astonishing leap forward in the past few years.

My current mental model of DALL-E is that it’s a fascinating tool for enhancing my imagination. Being able to imagine something and see it visualized a few seconds later is an extraordinary new ability.

I haven’t yet figured out how to apply this to real world problems that I face—my attempts at getting DALL-E to generate website wireframes or explanatory illustrations have been unusable so far—but I’ll keep on experimenting with it. Especially since feeding it prompts is just so much fun.

Elsewhere

Today

  • How SQLite Helps You Do ACID (via) Ben Johnson’s series of posts explaining the internals of SQLite continues with a deep look at how the rollback journal works. I’m learning SO much from this series. #10th August 2022, 3:39 pm
  • curl-impersonate (via) “A special build of curl that can impersonate the four major browsers: Chrome, Edge, Safari & Firefox. curl-impersonate is able to perform TLS and HTTP handshakes that are identical to that of a real browser.”

    I hadn’t realized that it’s become increasingly common for sites to use fingerprinting of TLS and HTTP handshakes to block crawlers. curl-impersonate attempts to impersonate browsers much more accurately, using tricks like compiling with Firefox’s nss TLS library and Chrome’s BoringSSL. #10th August 2022, 3:34 pm

Yesterday

  • sqlite-zstd: Transparent dictionary-based row-level compression for SQLite. Interesting SQLite extension from phiresky, the author of that amazing SQLite WASM hack from a while ago which could fetch subsets of a large SQLite database using the HTTP range header. This extension, written in Rust, implements row-level compression for a SQLite table by creating compression dictionaries for larger chunks of the table, providing better results than just running compression against each row value individually. #9th August 2022, 9:23 pm

6th August 2022

  • Microsoft® Open Source Software (OSS) Secure Supply Chain (SSC) Framework Simplified Requirements. This is really good: don’t get distracted by the acronyms, skip past the intro and head straight to the framework practices section, which talks about things like keeping copies of the packages you depend on, running scanners, tracking package updates and most importantly keeping an inventory of the open source packages you work so you can quickly respond to things like log4j.

    I feel like I say this a lot these days, but if you had told teenage-me that Microsoft would be publishing genuinely useful non-FUD guides to open source supply chain security by 2022 I don’t think I would have believed you. #6th August 2022, 4:49 pm

4th August 2022

  • Your documentation is complete when someone can use your module without ever having to look at its code. This is very important. This makes it possible for you to separate your module’s documented interface from its internal implementation (guts). This is good because it means that you are free to change the module’s internals as long as the interface remains the same.

    Remember: the documentation, not the code, defines what a module does.

    Ken Williams # 4th August 2022, 3:50 pm

3rd August 2022

  • Introducing sqlite-html: query, parse, and generate HTML in SQLite (via) Another brilliant SQLite extension module from Alex Garcia, this time written in Go. sqlite-html adds a whole family of functions to SQLite for parsing and constructing HTML strings, built on the Go goquery and cascadia libraries. Once again, Alex uses an Observable notebook to describe the new features, with embedded interactive examples that are backed by a Datasette instance running in Fly. #3rd August 2022, 5:31 pm

2nd August 2022

  • How I Used DALL·E 2 to Generate The Logo for OctoSQL (via) Jacob Martin gives a blow-by-blow account of his attempts at creating a logo for his OctoSQL project using DALL-E, spending $30 of credits and making extensive use of both the “variations” feature and the tool that lets you request modifications to existing images by painting over parts you want to regenerate. Really interesting to read as an example of a “real world” DALL-E project. #2nd August 2022, 9:12 pm

1st August 2022

  • storysniffer (via) Ben Welsh built a small Python library that guesses if a URL points to an article on a news website, or if it’s more likely to be a category page or /about page or similar. I really like this as an example of what you can do with a tiny machine learning model: the model is bundled as a ~3MB pickle file as part of the package, and the repository includes the Jupyter notebook that was used to train it. #1st August 2022, 11:40 pm

31st July 2022

  • Cleaning data with sqlite-utils and Datasette (via) I wrote a new tutorial for the Datasette website, showing how to use sqlite-utils to import a CSV file, clean up the resulting schema, fix date formats and extract some of the columns into a separate table. It’s accompanied by a ten minute video originally recorded for the HYTRADBOI conference. #31st July 2022, 7:57 pm

30th July 2022

  • GAS-ICS-Sync (via) Google Calendar can subscribe to ICS calendar feeds... but polls for updates less than once every 24 hours (as far as I can tell) greatly limiting their usefulness. Derek Antrican wrote a script using Google App Script which fixes this by polling calendar URLs more often and writing them to your calendar via the write API. #30th July 2022, 11:47 pm
  • GPSJam (via) John Wiseman’s “Daily maps of GPS interference” —a beautiful interactive globe (powered by Mapbox GL) which you can use to see points of heaviest GPS interference over a 24 hour period, using data collected from commercial airline radios by ADS-B Exchange. “From what I can tell the most common reason for aircraft GPS systems to have degraded accuracy is jamming by military systems. At least, the vast majority of aircraft that I see with bad GPS accuracy are flying near conflict zones where GPS jamming is known to occur.” #30th July 2022, 7:51 pm
  • Introducing sqlite-lines - a SQLite extension for reading files line-by-line (via) Alex Garcia wrote a brilliant C module for SQLIte which adds functions (and a table-valued function) for efficiently reading newline-delimited text into SQLite. When combined with SQLite’s built-in JSON features this means you can read a huge newline-delimited JSON file into SQLite in a streaming fashion so it doesn’t exhaust memory for a large file. Alex also compiled the extension to WebAssembly, and his post here is an Observable notebook post that lets you exercise the code directly. #30th July 2022, 7:18 pm

29th July 2022