Simon Willison’s Weblog

Subscribe
Atom feed

Blogmarks

Filters: Sorted by date

How ChatGPT Kicked Off an A.I. Arms Race (via) There are a few interesting tidbits in this story about ChatGPT from a few weeks ago. ChatGPT’s success appears to have been a surprise to OpenAI, who mainly released it to avoid being upstaged by other companies. Also interesting is this: “But two months after its debut, ChatGPT has more than 30 million users and gets roughly five million visits a day, two people with knowledge of the figures said.”—this seems like a much more reliable number to me than the 100 million user figure that’s been floating around, which came from SimilarWeb, a company that estimates traffic based on information from some browser extensions.

# 19th February 2023, 8:31 pm / ai, openai, generative-ai, chatgpt, llms

TabFS (via) “TabFS is a browser extension that mounts your browser tabs as a filesystem on your computer.” What a fascinating idea! Each browser tab gets a virtual directory (via FUSE) with “files” representing the tab title, contents and any resources that have been loaded by that page. You can edit files in those folders to live-update the content that’s loaded in your browser!

# 19th February 2023, 4:08 pm / browsers

I’ve been thinking how Sydney can be so different from ChatGPT. Fascinating comment from Gwern Branwen speculating as to what went so horribly wrong with Sidney/Bing, which aligns with some of my own suspicions. Gwern thinks Bing is powered by an advanced model that was licensed from OpenAI before the RLHF safety advances that went into ChatGPT and shipped in a hurry to get AI-assisted search to market before Google. “What if Sydney wasn’t trained on OA RLHF at all, because OA wouldn’t share the crown jewels of years of user feedback and its very expensive hired freelance programmers & whatnot generating data to train on?”

# 19th February 2023, 3:48 pm / bing, ai, gpt-3, openai, generative-ai, chatgpt, llms

Docker can copy in files directly from another image. I did not know you could do this in a Dockerfile:

COPY --from=lubien/tired-proxy:2 /tired-proxy /tired-proxy

# 19th February 2023, 5:35 am / docker

Can We Trust Search Engines with Generative AI? A Closer Look at Bing’s Accuracy for News Queries (via) Computational journalism professor Nick Diakopoulos takes a deeper dive into the quality of the summarizations provided by AI-assisted Bing. His findings are troubling: for news queries, which are a great test for AI summarization since they include recent information that may have sparse or conflicting stories, Bing confidently produces answers with important errors: claiming the Ohio train derailment happened on February 9th when it actually happened on February 3rd for example.

# 18th February 2023, 6:09 pm / bing, search, trust, generative-ai, llms, ai-assisted-search, digital-literacy

Writing Javascript without a build system (via) Julia Evans perfectly captures why I prefer not to use build systems in the majority of my projects that use JavaScript: “... my experience with build systems (not just Javascript build systems!), is that if you have a 5-year-old site, often it’s a huge pain to get the site built again. And because most of my websites are pretty small, the advantage of using a build system is pretty small.”

# 18th February 2023, 5:25 am / javascript, julia-evans

How The Post is replacing Mapbox with open source solutions (via) Kevin Schaul describes the Washington Post’s emerging open source GIS stack: OpenMapTiles, Maputnik, PMTiles and Maplibre-gl-js.

# 17th February 2023, 6:45 pm / maps, opensearch, openstreetmap, washington-post

Web Push for Web Apps on iOS and iPadOS. iOS and iPadOS 16.4 beta 1 finally brings web push notifications to iOS. User’s need to add an app to their home screen and then approve notification access to get this functionality, which also includes the ability for apps to update a badge on their icon. Thankfully you don’t need paid membership of the Apple Developer Program ($99/year) in order to send notifications.

# 17th February 2023, 12:28 am / safari, ios

Browse the BBC In Our Time archive by Dewey decimal code. Matt Webb built Braggoscope, an alternative interface for browsing the 1,000 episodes of the BBC's In Our Time dating back to 1998, organized by Dewey decimal system and with related episodes calculated using OpenAI embeddings and guests and reading lists extracted using GPT-3.

Using GitHub Copilot to write code and calling out to GPT-3 programmatically to dodge days of graft actually brought tears to my eyes.

# 13th February 2023, 4:03 pm / matt-webb, gpt-3, openai, generative-ai, llms, embeddings

The anatomy of visually-hidden (via) James Edwards provides a detailed breakdown of the current recommended CSS for hiding content while keeping it available for assistive technologies in the browser accessibility and render trees. Lots of accumulated tricks and screen reader special cases in this.

# 11th February 2023, 12:37 am / accessibility, css, screen-readers

Introducing sqlite-vss: A SQLite Extension for Vector Search (via) This latest SQLite extension from Alex Garcia is possibly his best yet: it adds FAISS-powered vector similarity search directly to SQLite, enabling fast KNN similarity lookups against a virtual table that feels a lot like SQLite’s own built-in full text search feature. This write-up includes interactive demos using Datasette called from an Observable notebook, running similarity searches against an index of 200,000 news headlines and summaries in less than 50ms.

# 10th February 2023, 10:53 pm / sqlite, datasette, observable, alex-garcia, vector-search

ChatGPT Is a Blurry JPEG of the Web. Science fiction author Ted Chiang offers a brilliant analogy for ChatGPT in this New Yorker article: it's a highly lossy compression algorithm for a vast amount of information which works like a JPEG, and uses grammatically correct interpolation to fill back in the missing gaps.

ChatGPT is so good at this form of interpolation that people find it entertaining: they’ve discovered a “blur” tool for paragraphs instead of photos, and are having a blast playing with it.

# 9th February 2023, 9:28 pm / new-yorker, ai, gpt-3, generative-ai, chatgpt, llms, ted-chiang

OpenAI’s Whisper is another case study in Colonisation (via) Really interesting perspective on Whisper from the Papa Reo project - a group working to nurture and proliferate the Māori language.

The main questions we ask when we see papers like FLEURS and Whisper are: where did they get their indigenous data from, who gave them access to it, and who gave them the right to create a derived work from that data and then open source the derivation?

# 8th February 2023, 5:22 pm / openai, generative-ai, whisper, speech-to-text

PocketPy. PocketPy is “a lightweight(~5000 LOC) Python interpreter for game engines”. It’s implemented as a single C++ header which provides an impressive subset of the Python language: functions, dictionaries, lists, strings and basic classes too. There’s also a browser demo that loads a 766.66 KB pypocket.wasm file (240.72 KB compressed) and uses it to power a basic terminal interface. I tried and failed to get that pypocket.wasm file working from wasmer/wasmtime/wasm3—it should make a really neat lightweight language to run in a WebAssembly sandbox.

# 8th February 2023, 5:13 am / python, webassembly

Big Data is Dead (via) Don’t be distracted by the headline, this is very worth your time. Jordan Tigani spent ten years working on Google BigQuery, during which time he was surprised to learn that the median data storage size for regular customers was much less than 100GB. In this piece he argues that genuine Big Data solutions are relevant to a tiny fraction of companies, and there’s way more value in solving problems for everyone else. I’ve been talking about Datasette as a tool for solving “small data” problems for a while, and this article has given me a whole bunch of new arguments I can use to support that concept.

# 7th February 2023, 7:25 pm / big-data, small-data

Making SQLite extensions pip install-able (via) Alex Garcia figured out how to bundle a compiled SQLite extension in a Python wheel (building different wheels for different platforms) and publish them to PyPI. This is a huge leap forward in terms of the usability of SQLite extensions, which have previously been pretty difficult to actually install and run. Alex also created Datasette plugins that depend on his packages, so you can now “datasette install datasette-sqlite-regex” (or datasette-sqlite-ulid, datasette-sqlite-fastrand, datasette-sqlite-jsonschema) to gain access to his custom SQLite extensions in your Datasette instance. It even works with “datasette publish --install” to deploy to Vercel, Fly.io and Cloud Run.

# 6th February 2023, 7:44 pm / pip, plugins, python, sqlite, datasette, alex-garcia

The technology behind GitHub’s new code search (via) I’ve been a beta user of the new GitHub code search for a while and I absolutely love it: you really can run a regular expression search across the entire of GitHub, which is absurdly useful for both finding code examples of under-documented APIs and for seeing how people are using open source code that you have released yourself. It turns out GitHub built their own search engine for this from scratch, called Blackbird. It’s implemented in Rust and makes clever use of sharded ngram indexes—not just trigrams, because it turns out those aren’t quite selective enough for a corpus that includes a lot of three letter keywords like “for”.

I also really appreciated the insight into how they handle visibility permissions: they compile those into additional internal search clauses, resulting in things like “RepoIDs(...) or PublicRepo()”

# 6th February 2023, 6:38 pm / github, search, rust

I’m Now a Full-Time Professional Open Source Maintainer. Filippo Valsorda, previously a member of the Go team at Google, is now independent and making a full-time living as a maintainer of various open source projects relating to Go. He’s managing to pull in an amount “equivalent to my Google total compensation package”, which is a huge achievement: the greatest cost involved in independent open source is usually the opportunity cost of turning down a big tech salary. He’s doing this through a high touch retainer model, where six client companies pay him to keep working on his projects and also provide them with varying amounts of expert consulting.

# 3rd February 2023, 1:12 am / consulting, go, open-source, careers, filippo-valsorda

GROUNDHOG-DAY.com (via) “The leading Groundhog Day data source”. I love this so much: it’s a collection of predictions from all 59 groundhogs active in towns scattered across North America (I had no idea there were that many). The data is available via a JSON API too.

# 2nd February 2023, 10:05 pm / data

Carving the Scheduler Out of Our Orchestrator (via) Thomas Ptacek describes Fly’s new custom-built alternative to Nomad and Kubernetes in detail, including why they eventually needed to build something custom to best serve their platform. In doing so he provides the best explanation I’ve ever seen of what an orchestration system actually does.

# 2nd February 2023, 9:46 pm / thomas-ptacek, kubernetes, fly

Python’s “Disappointing” Superpowers. Luke Plant provides a fascinating detailed list of Python libraries that use dynamic meta-programming tricks in interesting ways—including SQLAlchemy, Django, Werkzeug, pytest and more.

# 1st February 2023, 10:41 pm / luke-plant, python

pyfakefs usage (via) New to me pytest fixture library that provides a really easy way to mock Python’s filesystem functions—open(), os.path.listdir() and so on—so a test can run against a fake set of files. This looks incredibly useful.

# 1st February 2023, 10:37 pm / luke-plant, python, testing, pytest

datasette-scraper walkthrough on YouTube (via) datasette-scraper is Colin Dellow’s new plugin that turns Datasette into a powerful web scraping tool, with a web UI based on plugin-driven customizations to the Datasette interface. It’s really impressive, and this ten minute demo shows quite how much it is capable of: it can crawl sitemaps and fetch pages, caching them (using zstandard with optional custom dictionaries for extra compression) to speed up subsequent crawls... and you can add your own plugins to extract structured data from crawled pages and save it to a separate SQLite table!

# 29th January 2023, 5:23 am / plugins, scraping, datasette, colin-dellow

Examples of sites built using Datasette (via) I gave the examples page on the Datasette website a significant upgrade today: it now includes screenshots (taken using shot-scraper) of six projects chosen to illustrate the variety of problems Datasette can be used to tackle.

# 29th January 2023, 3:40 am / projects, datasette, shot-scraper

Cyber (via) “Cyber is a new language for fast, efficient, and concurrent scripting.” Lots of interesting ideas in here, but the one that really caught my eye is that its designed to be easily embedded into other languages and “will allow the host to insert gas mileage checks in user scripts. This allows the host to control how long a script can run”—my dream feature for implementing a safe, sandboxed extension mechanism! Cyber is implemented using Zig and LLVM.

# 28th January 2023, 4:25 am / llvm, programming-languages, sandboxing, zig

sqlite-jsonschema. “A SQLite extension for validating JSON objects with JSON Schema”, building on the jsonschema Rust crate. SQLite and JSON are already a great combination—Alex suggests using this extension to implement check constraints to validate JSON columns before inserting into a table, or just to run queries finding existing data that doesn’t match a given schema.

# 28th January 2023, 3:50 am / json, jsonschema, sqlite, rust, alex-garcia

sqlite-ulid. Alex Garcia’s sqlite-ulid adds lightning-fast SQL functions for generating ULIDs—Universally Unique Lexicographically Sortable Identifiers. These work like UUIDs but are smaller and faster to generate, and can be canonically encoded as a URL-safe 26 character string (UUIDs are 36 characters). Again, this builds on a Rust crate—ulid-rs—and can generate 1 million byte-represented ULIDs with the ulid_bytes() function in just 88.4ms.

# 28th January 2023, 3:45 am / sqlite, uuid, rust, alex-garcia

sqlite-fastrand. Alex Garcia just dropped three new SQLite extensions, and I’m going to link to all of them. The first is sqlite-fastrand, which adds new functions for generating random numbers (and alphanumeric characters too). Impressively, these out-perform the default SQLite random() and randomblob() functions by about 1.6-2.6x, thanks to being built on the Rust fastrand crate which builds on wyhash, an extremely fast (though not cryptographically secure) hashing function.

# 28th January 2023, 3:41 am / sqlite, rust, alex-garcia

graphql-voyager. Neat tool for producing an interactive graph visualization of any GraphQL API. Click “Change schema” and then “Introspection” and it will give you a GraphQL query you can run against your own API—copy and paste back the JSON results and the visualizer will show you how your API fits together. I tested this against a datasette-graphql instance and it worked exactly as described.

# 27th January 2023, 11:58 pm / graphql

babelmark3 (via) I found this tool today while investigating an bug in Datasette’s datasette-render-markdown plugin: it lets you run a fragment of Markdown through dozens of different Markdown libraries across multiple different languages and compare the results. Under the hood it works with a registry of API URL endpoints for different implementations, most of which are encrypted in the configuration file on GitHub because they are only intended to be used by this comparison tool.

# 27th January 2023, 11:34 pm / apis, markdown

Years

Tags