Simon Willison’s Weblog

Subscribe

Blogmarks

Filters: Sorted by date

Discord History Tracker. Very interestingly shaped piece of software. You install and run a localhost web application on your own machine, then paste some JavaScript into the Discord Electron app’s DevTools console (ignoring the prominent messages there warning you not to paste anything into it). The JavaScript scrapes messages you can see in Discord and submits them back to that localhost application, which writes them to a SQLite database for you. It’s written in C# with ASP.NET Core, but complied executables are provided for Windows, macOS and Linux. I had to allow execution of four different unsigned binaries to get this working on my Mac.

# 2nd September 2022, 9:37 pm / security, sqlite, discord

Open every CSV file in a GitHub repository in Datasette Lite (via) I built an Observable notebook that accepts a GitHub repository as input, scans it for CSV files and generates a link to open all of those CSV files in Datasette Lite.

# 1st September 2022, 7:24 pm / github, projects, observable, datasette-lite

Building Layoffs on a Healthy Foundation (via) Kellan provides some valuable guidance for running layoffs in as humane a way as possible.

# 1st September 2022, 6:11 pm / kellan-elliott-mccrea, management

Run Stable Diffusion on your M1 Mac’s GPU. Ben Firshman provides detailed instructions for getting Stable Diffusion running on an M1 Mac.

# 1st September 2022, 5:41 pm / ben-firshman, machine-learning, macos, ai, stable-diffusion, generative-ai, text-to-image

Farmbound, or how I built an app in 2022. Stuart Langridge describes the architecture and decision process behind his new mobile web game, Farmbound.

# 31st August 2022, 11:23 pm / stuart-langridge, web

Exploring 12 Million of the 2.3 Billion Images Used to Train Stable Diffusion’s Image Generator. Andy Baio and I collaborated on an investigation into the training set used for Stable Diffusion. I built a Datasette instance with 12m image records sourced from the LAION-Aesthetics v2 6+ aesthetic score data used as part of the training process, and built a tool so people could run searches and explore the data. Andy did some extensive analysis of things like the domains scraped for the images and names of celebrities and artists represented in the data. His write-up here explains our project in detail and some of the patterns we’ve uncovered so far.

# 31st August 2022, 2:10 am / machine-learning, ai, stable-diffusion, generative-ai, laion, training-data

How SQLite Scales Read Concurrency (via) Ben Johnson’s series on SQLite internals continues—this time with a detailed explanation of how the SQLite WAL (Write-Ahead Log) is implemented.

# 24th August 2022, 4:16 pm / databases, sqlite, ben-johnson

Stable Diffusion Public Release (via) New AI just dropped. Stable Diffusion is similar to DALL-E, but completely open source and with a CC0 license applied to everything it generates. I have a Twitter thread (the via) link of comparisons I’ve made between its output and my previous DALL-E experiments. The announcement buries the lede somewhat: to try it out, visit beta.dreamstudio.ai—which you can use for free at the moment, but it’s unclear to me how billing is supposed to work.

# 22nd August 2022, 7:12 pm / machine-learning, dalle, stable-diffusion, generative-ai, text-to-image

Digitizing 55,000 pages of civic meetings (via) Philip James has been building public, searchable archives of city council meetings for various cities—Oakland and Alamedia so far—using my s3-ocr script to run Textract OCR against the PDFs of the minutes, and deploying them to Fly using Datasette. This is a really cool project, and very much the kind of thing I’ve been hoping to support with the tools I’ve been building.

# 22nd August 2022, 4:26 pm / archiving, ocr, political-hacking, datasette, fly

Turning SQLite into a distributed database (via) Heyang Zhou introduces mvSQLite, his brand new open source “SQLite-compatible distributed database” built in Rust on top of Apple’s FoundationDB. This is a very promising looking new entry into the distributed/replicated SQLite space: FoundationDB was designed to provide low-level primitives that tools like this could build on top of.

# 21st August 2022, 5:40 pm / databases, sqlite, rust

Show HN: A new way to use GPT-3 to generate code (and everything else). Riley Goodside is my favourite Twitter follow for GPT-3 tips. Here he describes a powerful prompt pattern he's designed which lets you generate extremely complex code output by asking GPT-3 to fill in $$areas like this$$ with different patterns, then stitch them together into full HTML or other source code files. It's really clever.

# 20th August 2022, 9:33 pm / gpt-3, prompt-engineering, generative-ai, riley-goodside, llms

Shoelace (via) Saw this for the first time today: it’s a relatively new library of framework-agnostic Web Components, built on lit-html and covering a huge array of common functionality: buttons and sliders and dialogs and drawer interfaces and dropdown menus and so on. The design is very clean, the documentation is superb—and it looks like you can cherry pick just the components you are using for a pretty lean addition to your page weight. So refreshing to see libraries like this that really take advantage of modern web standards.

# 20th August 2022, 8:57 pm / css, javascript, web-standards, web-components, lit-html

The Datasette Newsletter: Datasette Lite, Datasette Tutorials, Datasette Cloud. It’s been quite a while since I’ve sent one of these out now—hoping to get this on to a more regular schedule.

# 19th August 2022, 1:20 am / datasette

Building games and apps entirely through natural language using OpenAI’s code-davinci model. A deeply sophisticated example of using prompts to generate entire working JavaScript programs and games using the new code-davinci OpenAI model.

# 17th August 2022, 7:06 pm / game-design, ai, gpt-3, openai, prompt-engineering, generative-ai, llms, ai-assisted-programming

Crunchy Data: Learn Postgres at the Playground (via) Crunchy Data have a new PostgreSQL tutorial series, with a very cool twist: they have a build of PostgreSQL compiled to WebAssembly which runs in the browser, so each tutorial is accompanied by a working psql terminal that lets you try out the tutorial contents interactively. It even has support for PostGIS, though that particular tutorial has to load 80MB of assets in order to get it to work!

# 17th August 2022, 6:30 pm / postgresql, webassembly

Efficient Pagination Using Deferred Joins (via) Surprisingly simple trick for speeding up deep OFFSET x LIMIT y pagination queries, which get progressively slower as you paginate deeper into the data. Instead of applying them directly, apply them to a “select id from ...” query to fetch just the IDs, then either use a join or run a separate “select * from table where id in (...)” query to fetch the full records for that page.

# 16th August 2022, 5:35 pm / performance, sql

Bypassing macOS notarization (via) Useful tip from the geckodriver docs: if you’ve downloaded an executable file through your browser and now cannot open it because of the macOS quarantine feature, you can run “xattr -r -d com.apple.quarantine path-to-binary” to clear that flag so you can execute the file.

# 13th August 2022, 12 am / macos, security

datasette on Open Source Insights (via) Open Source Insights is "an experimental service developed and hosted by Google to help developers better understand the structure, security, and construction of open source software packages". It calculates scores for packages using various automated heuristics. A JSON version of the resulting score card can be accessed using https://deps.dev/_/s/pypi/p/{package_name}/v/

# 11th August 2022, 1:06 am / open-source, security, datasette

sethmlarson/pypi-data (via) Seth Michael Larson uses GitHub releases to publish a ~325MB (gzipped to ~95MB) SQLite database on a roughly monthly basis that contains records of 370,000+ PyPI packages plus their OpenSSF score card metrics. It’s a really interesting dataset, but also a neat way of packaging and distributing data—the scripts Seth uses to generate the database file are included in the repository.

# 11th August 2022, 1:02 am / github, pypi, sqlite, seth-michael-larson

Let websites framebust out of native apps (via) Adrian Holovaty makes a compelling case that it is Not OK that we allow native mobile apps to embed our websites in their own browsers, including the ability for them to modify and intercept those pages (it turned out today that Instagram injects extra JavaScript into pages loaded within the Instagram in-app browser). He compares this to frame-busting on the regular web, and proposes that the X-Frame-Options: DENY header which browsers support to prevent a page from being framed should be upgraded to apply to native embedded browsers as well.

I’m not convinced that reusing X-Frame-Options: DENY would be the best approach—I think it would break too many existing legitimate uses—but a similar option (or a similar header) specifically for native apps which causes pages to load in the native OS browser instead sounds like a fantastic idea to me.

# 10th August 2022, 10:29 pm / adrian-holovaty, browsers, privacy, security

Introducing sqlite-http: A SQLite extension for making HTTP requests (via) Characteristically thoughtful SQLite extension from Alex, following his sqlite-html extension from a few days ago. sqlite-http lets you make HTTP requests from SQLite—both as a SQL function that returns a string, and as a table-valued SQL function that lets you independently access the body, headers and even the timing data for the request.

This write-up is excellent: it provides interactive demos but also shows how additional SQLite extensions such as the new-to-me “define” extension can be combined with sqlite-http to create custom functions for parsing and processing HTML.

# 10th August 2022, 10:22 pm / http, sqlite, alex-garcia

How SQLite Helps You Do ACID (via) Ben Johnson’s series of posts explaining the internals of SQLite continues with a deep look at how the rollback journal works. I’m learning SO much from this series.

# 10th August 2022, 3:39 pm / sqlite, fly, ben-johnson

curl-impersonate (via) “A special build of curl that can impersonate the four major browsers: Chrome, Edge, Safari & Firefox. curl-impersonate is able to perform TLS and HTTP handshakes that are identical to that of a real browser.”

I hadn’t realized that it’s become increasingly common for sites to use fingerprinting of TLS and HTTP handshakes to block crawlers. curl-impersonate attempts to impersonate browsers much more accurately, using tricks like compiling with Firefox’s nss TLS library and Chrome’s BoringSSL.

# 10th August 2022, 3:34 pm / crawling, curl, scraping

sqlite-zstd: Transparent dictionary-based row-level compression for SQLite. Interesting SQLite extension from phiresky, the author of that amazing SQLite WASM hack from a while ago which could fetch subsets of a large SQLite database using the HTTP range header. This extension, written in Rust, implements row-level compression for a SQLite table by creating compression dictionaries for larger chunks of the table, providing better results than just running compression against each row value individually.

# 9th August 2022, 9:23 pm / sqlite, rust, zstd

Microsoft® Open Source Software (OSS) Secure Supply Chain (SSC) Framework Simplified Requirements. This is really good: don’t get distracted by the acronyms, skip past the intro and head straight to the framework practices section, which talks about things like keeping copies of the packages you depend on, running scanners, tracking package updates and most importantly keeping an inventory of the open source packages you work so you can quickly respond to things like log4j.

I feel like I say this a lot these days, but if you had told teenage-me that Microsoft would be publishing genuinely useful non-FUD guides to open source supply chain security by 2022 I don’t think I would have believed you.

# 6th August 2022, 4:49 pm / microsoft, open-source, security, supply-chain

Introducing sqlite-html: query, parse, and generate HTML in SQLite (via) Another brilliant SQLite extension module from Alex Garcia, this time written in Go. sqlite-html adds a whole family of functions to SQLite for parsing and constructing HTML strings, built on the Go goquery and cascadia libraries. Once again, Alex uses an Observable notebook to describe the new features, with embedded interactive examples that are backed by a Datasette instance running in Fly.

# 3rd August 2022, 5:31 pm / go, html, sqlite, datasette, alex-garcia

How I Used DALL·E 2 to Generate The Logo for OctoSQL (via) Jacob Martin gives a blow-by-blow account of his attempts at creating a logo for his OctoSQL project using DALL-E, spending $30 of credits and making extensive use of both the “variations” feature and the tool that lets you request modifications to existing images by painting over parts you want to regenerate. Really interesting to read as an example of a “real world” DALL-E project.

# 2nd August 2022, 9:12 pm / openai, dalle, generative-ai

storysniffer (via) Ben Welsh built a small Python library that guesses if a URL points to an article on a news website, or if it’s more likely to be a category page or /about page or similar. I really like this as an example of what you can do with a tiny machine learning model: the model is bundled as a ~3MB pickle file as part of the package, and the repository includes the Jupyter notebook that was used to train it.

# 1st August 2022, 11:40 pm / machine-learning, python, jupyter, ben-welsh

Cleaning data with sqlite-utils and Datasette (via) I wrote a new tutorial for the Datasette website, showing how to use sqlite-utils to import a CSV file, clean up the resulting schema, fix date formats and extract some of the columns into a separate table. It’s accompanied by a ten minute video originally recorded for the HYTRADBOI conference.

# 31st July 2022, 7:57 pm / documentation, tutorials, datasette, sqlite-utils

GAS-ICS-Sync (via) Google Calendar can subscribe to ICS calendar feeds... but polls for updates less than once every 24 hours (as far as I can tell) greatly limiting their usefulness. Derek Antrican wrote a script using Google App Script which fixes this by polling calendar URLs more often and writing them to your calendar via the write API.

# 30th July 2022, 11:47 pm / google-calendar, icalendar

Years

Tags