Blogmarks
Filters: Sorted by date
Big Data is Dead (via) Don’t be distracted by the headline, this is very worth your time. Jordan Tigani spent ten years working on Google BigQuery, during which time he was surprised to learn that the median data storage size for regular customers was much less than 100GB. In this piece he argues that genuine Big Data solutions are relevant to a tiny fraction of companies, and there’s way more value in solving problems for everyone else. I’ve been talking about Datasette as a tool for solving “small data” problems for a while, and this article has given me a whole bunch of new arguments I can use to support that concept.
Making SQLite extensions pip install-able (via) Alex Garcia figured out how to bundle a compiled SQLite extension in a Python wheel (building different wheels for different platforms) and publish them to PyPI. This is a huge leap forward in terms of the usability of SQLite extensions, which have previously been pretty difficult to actually install and run. Alex also created Datasette plugins that depend on his packages, so you can now “datasette install datasette-sqlite-regex” (or datasette-sqlite-ulid, datasette-sqlite-fastrand, datasette-sqlite-jsonschema) to gain access to his custom SQLite extensions in your Datasette instance. It even works with “datasette publish --install” to deploy to Vercel, Fly.io and Cloud Run.
The technology behind GitHub’s new code search (via) I’ve been a beta user of the new GitHub code search for a while and I absolutely love it: you really can run a regular expression search across the entire of GitHub, which is absurdly useful for both finding code examples of under-documented APIs and for seeing how people are using open source code that you have released yourself. It turns out GitHub built their own search engine for this from scratch, called Blackbird. It’s implemented in Rust and makes clever use of sharded ngram indexes—not just trigrams, because it turns out those aren’t quite selective enough for a corpus that includes a lot of three letter keywords like “for”.
I also really appreciated the insight into how they handle visibility permissions: they compile those into additional internal search clauses, resulting in things like “RepoIDs(...) or PublicRepo()”
I’m Now a Full-Time Professional Open Source Maintainer. Filippo Valsorda, previously a member of the Go team at Google, is now independent and making a full-time living as a maintainer of various open source projects relating to Go. He’s managing to pull in an amount “equivalent to my Google total compensation package”, which is a huge achievement: the greatest cost involved in independent open source is usually the opportunity cost of turning down a big tech salary. He’s doing this through a high touch retainer model, where six client companies pay him to keep working on his projects and also provide them with varying amounts of expert consulting.
GROUNDHOG-DAY.com (via) “The leading Groundhog Day data source”. I love this so much: it’s a collection of predictions from all 59 groundhogs active in towns scattered across North America (I had no idea there were that many). The data is available via a JSON API too.
Carving the Scheduler Out of Our Orchestrator (via) Thomas Ptacek describes Fly’s new custom-built alternative to Nomad and Kubernetes in detail, including why they eventually needed to build something custom to best serve their platform. In doing so he provides the best explanation I’ve ever seen of what an orchestration system actually does.
Python’s “Disappointing” Superpowers. Luke Plant provides a fascinating detailed list of Python libraries that use dynamic meta-programming tricks in interesting ways—including SQLAlchemy, Django, Werkzeug, pytest and more.
pyfakefs usage (via) New to me pytest fixture library that provides a really easy way to mock Python’s filesystem functions—open(), os.path.listdir() and so on—so a test can run against a fake set of files. This looks incredibly useful.
datasette-scraper walkthrough on YouTube (via) datasette-scraper is Colin Dellow’s new plugin that turns Datasette into a powerful web scraping tool, with a web UI based on plugin-driven customizations to the Datasette interface. It’s really impressive, and this ten minute demo shows quite how much it is capable of: it can crawl sitemaps and fetch pages, caching them (using zstandard with optional custom dictionaries for extra compression) to speed up subsequent crawls... and you can add your own plugins to extract structured data from crawled pages and save it to a separate SQLite table!
Examples of sites built using Datasette (via) I gave the examples page on the Datasette website a significant upgrade today: it now includes screenshots (taken using shot-scraper) of six projects chosen to illustrate the variety of problems Datasette can be used to tackle.
Cyber (via) “Cyber is a new language for fast, efficient, and concurrent scripting.” Lots of interesting ideas in here, but the one that really caught my eye is that its designed to be easily embedded into other languages and “will allow the host to insert gas mileage checks in user scripts. This allows the host to control how long a script can run”—my dream feature for implementing a safe, sandboxed extension mechanism! Cyber is implemented using Zig and LLVM.
sqlite-jsonschema. “A SQLite extension for validating JSON objects with JSON Schema”, building on the jsonschema Rust crate. SQLite and JSON are already a great combination—Alex suggests using this extension to implement check constraints to validate JSON columns before inserting into a table, or just to run queries finding existing data that doesn’t match a given schema.
sqlite-ulid. Alex Garcia’s sqlite-ulid adds lightning-fast SQL functions for generating ULIDs—Universally Unique Lexicographically Sortable Identifiers. These work like UUIDs but are smaller and faster to generate, and can be canonically encoded as a URL-safe 26 character string (UUIDs are 36 characters). Again, this builds on a Rust crate—ulid-rs—and can generate 1 million byte-represented ULIDs with the ulid_bytes() function in just 88.4ms.
sqlite-fastrand. Alex Garcia just dropped three new SQLite extensions, and I’m going to link to all of them. The first is sqlite-fastrand, which adds new functions for generating random numbers (and alphanumeric characters too). Impressively, these out-perform the default SQLite random() and randomblob() functions by about 1.6-2.6x, thanks to being built on the Rust fastrand crate which builds on wyhash, an extremely fast (though not cryptographically secure) hashing function.
graphql-voyager. Neat tool for producing an interactive graph visualization of any GraphQL API. Click “Change schema” and then “Introspection” and it will give you a GraphQL query you can run against your own API—copy and paste back the JSON results and the visualizer will show you how your API fits together. I tested this against a datasette-graphql instance and it worked exactly as described.
babelmark3 (via) I found this tool today while investigating an bug in Datasette’s datasette-render-markdown plugin: it lets you run a fragment of Markdown through dozens of different Markdown libraries across multiple different languages and compare the results. Under the hood it works with a registry of API URL endpoints for different implementations, most of which are encrypted in the configuration file on GitHub because they are only intended to be used by this comparison tool.
Guppe Groups. This is a really neat mechanism for helping build topic-oriented communities on Mastodon: follow @any-group-name@a.gup.pe to join (or create) a group, then that account will re-broadcast any messages from people in that group who mention the group in their message.
I found it via the histodons group. I was pondering how something like this might work this recently, so it’s great to see someone has built it already.
Python Sandbox in Web Assembly (via) Jim Kring responded to my questions on Mastodon about running Python in a WASM sandbox by building this repo, which demonstrates using wasmer-python to run a build of Python 3.6 compiled to WebAssembly, complete with protected access to a sandbox directory.
Jortage Communal Cloud. An interesting pattern that’s emerging in the Mastodon / Fediverse community: Jortage is “a communal project providing object storage and hosting”. Each Mastodon server needs to host copies of files—not just for their users, but files that have been imported into the instance because they were posted by other people followed by that instance’s users. Jortage lets multiple instances share the same objects, reducing costs and making things more efficient. I like the idea that multiple projects like this can co-exist, improving the efficiency of the overall network without introducing single centralized services.
Wildebeest (via) New project from Cloudflare, first quietly unveiled three weeks ago: “Wildebeest is an ActivityPub and Mastodon-compatible server”. It’s built using a flurry of Cloudflare-specific technology, including Workers, Pages and their SQLite-based D1 database.
Inside the Globus INK: a mechanical navigation computer for Soviet spaceflight (via) Absolutely beautiful piece of Soviet spacecraft engineering, explained in detail by Ken Shirriff.
The Page With No Code
(via)
A fun demo by Dan Q, who created a web page with no HTML at all - but in Firefox it still renders content, thanks to a data URI base64 encoded stylesheet served in a link: header that uses html::before, html::after, body::before and body::after with content: properties to serve the content. It even has a background image, encoded as a base64 SVG nested inside another data URI.
OpenAI Cookbook: Techniques to improve reliability (via) “Let’s think step by step” is a notoriously successful way of getting large language models to solve problems, but it turns out that’s just the tip of the iceberg: this article includes a wealth of additional examples and techniques that can be used to trick GPT-3 into being a whole lot more effective.
datasette-granian (via) Granian is a new Python web server—similar to Gunicorn—written in Rust. I built a small plugin that adds a “datasette granian” command starting a Granian server that serves Datasette’s ASGI application, using the same pattern as my existing datasette-gunicorn plugin.
Hctree Design Documentation. More detailed information on the design of the new Hctree SQLite branch.
Hctree: an experimental high-concurrency database backend for SQLite (via) Really interesting new research branch from the core SQLite team. “Hctree uses optimistic row-level locking and is designed to support dozens of concurrent writers running at full-speed”—with very impressive benchmarks supporting that claim. Also two bonuses: it has a replication mechanism based on the existing SQLite sessions extension, and it bumps up the maximum size of a SQLite database from 16TiB to 1EiB (roughly one million TiB).
Datasette is my data hammer (via) Jeremia Kimelman—a data journalist at CalMatters in Sacramento—enthuses about how he uses Datasette as his default hammer for all kinds of data projects—in particular how much he appreciates Datasette’s focus on URLs. So nice to see this!
Igalia: the Open Source Powerhouse You’ve Never Heard of (via) An in-depth article about Igalia from July 2022. I had no idea how much stuff they had worked on: arrow functions, generators, async/await, MathML, CSS Grid and a whole bunch more.
Servo to Advance in 2023 (via) This is excellent news: Servo, the browser-in-Rust project started by Mozilla in 2012 that produced the Rust programming language, is getting re-activated with four new full-time developers provided by Igalia.
Igalia are a fascinating organization - I hadn't realized quite how influential they've been until I read their Wikipedia page just now
They've been around since 2001, and "in 2019 they were the #2 committers to both the WebKit and Chromium codebases and in the top 10 contributors to Gecko/Servo" - including implementing and maintaining CSS Grid Layout!
How to simulate a broken database connection for testing in Django (via) Neil Kakkar explores the options using unittest.patch() and then settles on a neater pattern using “with connection.execute_wrapper(QueryTimeoutWrapper()):” to simulate the exact exception he needs to test against.