Simon Willison’s Weblog

Subscribe

Blogmarks

Filters: Sorted by date

Rye: Added support for marking virtualenvs ignored for cloud sync (via) A neat feature in the new Rye 0.22.0 release. It works by using an xattr Rust crate to set the attributes “com.dropbox.ignored” and “com.apple.fileprovider.ignore#P” on the folder.

# 10th February 2024, 6:50 am / python, dropbox, rust, rye

(Almost) Every infrastructure decision I endorse or regret after 4 years running infrastructure at a startup (via) Absolutely fascinating post by Jack Lindamood talking about services, tools and processes used by his startup and which ones turned out to work well v.s. which ones are now regretted.

I’d love to see more companies produce lists like this.

# 10th February 2024, 5:51 am / architecture, infrastructure, startups

How I write HTTP services in Go after 13 years (via) Useful set of current best practices for deploying HTTP servers written in Go. I guess Go counts as boring technology these days, which is high praise in my book.

# 9th February 2024, 8:40 pm / go

Figure out who’s leaving the company: dump, diff, repeat (via) Rachel Kroll describes a neat hack for companies with an internal LDAP server or similar machine-readable employee directory: run a cron somewhere internal that grabs the latest version and diffs it against the previous to figure out who has joined or left the company.

I suggest using Git for this - a form of Git scraping - as then you get a detailed commit log of changes over time effectively for free.

I really enjoyed Rachel's closing thought:

Incidentally, if someone gets mad about you running this sort of thing, you probably don't want to work there anyway. On the other hand, if you're able to build such tools without IT or similar getting "threatened" by it, then you might be somewhere that actually enjoys creating interesting and useful stuff. Treasure such places. They don't tend to last.

# 9th February 2024, 5:44 am / git, git-scraping, rachel-kroll

“Wherever you get your podcasts” is a radical statement. Anil Dash points out that podcasts are one of the few cases where the dream really did work out:

“[...] what it represents is the triumph of exactly the kind of technology that’s supposed to be impossible: open, empowering tech that’s not owned by any one company, that can’t be controlled by any one company, and that allows people to have ownership over their work and their relationship with their audience.”

# 9th February 2024, 5:18 am / anil-dash, podcasts, rss, web-standards

The first four Val Town runtimes (via) Val Town solves one of my favourite technical problems: how to run untrusted code in a safe sandbox. They're on their fourth iteration of this now, currently using a Node.js application that launches Deno sub-processes using the node-deno-vm npm package and runs code in those, taking advantage of the Deno sandboxing mechanism and terminating processes that take too long in order to protect against while(true) style attacks.

# 8th February 2024, 6:38 pm / javascript, nodejs, sandboxing, deno, tom-macwright, val-town

Google’s Gemini Advanced: Tasting Notes and Implications. Ethan Mollick reviews the new Google Gemini Advanced—a rebranded Bard, released today, that runs on the GPT-4 competitive Gemini Ultra model.

“GPT-4 [...] has been the dominant AI for well over a year, and no other model has come particularly close. Prior to Gemini, we only had one advanced AI model to look at, and it is hard drawing conclusions with a dataset of one. Now there are two, and we can learn a few things.”

I like Ethan’s use of the term “tasting notes” here. Reminds me of how Matt Webb talks about being a language model sommelier.

# 8th February 2024, 3:10 pm / google, ai, generative-ai, gpt-4, bard, llms, ethan-mollick, gemini

SQL for Data Scientists in 100 Queries. New comprehensive SQLite SQL tutorial from Greg Wilson, author of Teaching Tech Together and founder of The Carpentries.

# 6th February 2024, 11:08 pm / greg-wilson, sql, sqlite

The power of two random choices, visualized. Grant Slatton shares a visualization illustrating “a favorite load balancing technique at AWS”: pick two nodes at random and then send the task to whichever of those two has the lowest current load score.

Why just two nodes? “The function grows logarithmically, so it’s a big jump from 1 to 2 and then tapers off *real* quick.”

# 6th February 2024, 10:21 pm / aws, load-balancing, scaling

scriptisto (via) This is really clever. “scriptisto is tool to enable writing one file scripts in languages that require compilation, dependencies fetching or preprocessing.”

You start your file with a “#!/usr/bin/env scriptisto” shebang line, then drop in a specially formatted block that tells it which compiler (if any) to use and how to build the tool. The rest of the file can then be written in any of the dozen-plus included languages... or you can create your own template to support something else.

The end result is you can now write a one-off tool in pretty much anything and have it execute as if it was a single built executable.

# 6th February 2024, 4:49 am / programming

shot-scraper 1.4. I decided to add HTTP Basic authentication support to shot-scraper today and found several excellent pull requests waiting to be merged, by Niel Thiart and mhalle.

1.4 adds support for HTTP Basic auth, custom --scale-factor shots, additional --browser-arg arguments and a fix for --interactive mode.

# 5th February 2024, 11:11 pm / projects, shot-scraper

How does Sidekiq really work? (via) I really like this category of blog post: Dan Svetlov took the time to explore the Sidekiq message queue’s implementation and then wrote it up in depth.

# 5th February 2024, 5:20 pm / queues, ruby, sidekiq

llm-sentence-transformers 0.2. I added a new --trust-remote-code option when registering an embedding model, which means LLM can now run embeddings through the new Nomic AI nomic-embed-text-v1 model.

# 4th February 2024, 7:39 pm / plugins, projects, transformers, ai, embeddings, llm, nomic

Introducing Nomic Embed: A Truly Open Embedding Model. A new text embedding model from Nomic AI which supports 8192 length sequences, claims better scores than many other models (including OpenAI’s new text-embedding-3-small) and is available as both a hosted API and a run-yourself model. The model is Apache 2 licensed and Nomic have released the full set of training data and code.

From the accompanying paper: “Full training of nomic-embed-text-v1 can be conducted in a single week on one 8xH100 node.”

# 3rd February 2024, 11:13 pm / ai, embeddings, nomic

The Engineering behind Figma’s Vector Networks (via) Fascinating post by Alex Harri (in 2019) describing FIgma’s unique approach to providing an alternative to the classic Bézier curve pen tool. It includes a really clear explanation of Bézier curves, then dives into the alternative, recent field of vector networks which support lines and curves between any two points rather than enforcing a single path.

# 3rd February 2024, 11:08 pm / graphics, bezier

Open Language Models (OLMos) and the LLM landscape (via) OLMo is a newly released LLM from the Allen Institute for AI (AI2) currently available in 7b and 1b parameters (OLMo-65b is on the way) and trained on a fully openly published dataset called Dolma.

The model and code are Apache 2, while the data is under the “AI2 ImpACT license”.

From the benchmark scores shared here by Nathan Lambert it looks like this may be the highest performing model currently available that was built using a fully documented training set.

What’s in Dolma? It’s mainly Common Crawl, Wikipedia, Project Gutenberg and the Stack.

# 2nd February 2024, 4:11 am / ai, generative-ai, llms, training-data, ai2, llm-release, nathan-lambert, olmo

Samattical (via) Automattic (the company behind WordPress) have a benefit that’s provided to all 1,900+ of their employees: a paid three month sabbatical every five years.

CEO Matt Mullenweg is taking advantage of this for the first time, and here shares an Ignite talk in which he talks about the way the benefit encourages the company to plan for 5% of the company to be unavailable at any one time, helping avoid any single employee becoming a bottleneck.

# 2nd February 2024, 3:42 am / automattic, matt-mullenweg

unstructured. Relatively new but impressively capable Python library (Apache 2 licensed) for extracting information from unstructured documents, such as PDFs, images, Word documents and many other formats.

I got some good initial results against a PDF by running “pip install ’unstructured[pdf]’” and then using the “unstructured.partition.pdf.partition_pdf(filename)” function.

There are a lot of moving parts under the hood: pytesseract, OpenCV, various PDF libraries, even an ONNX model—but it installed cleanly for me on macOS and worked out of the box.

# 2nd February 2024, 2:47 am / ocr, pdf, python

ChunkViz (via) Handy tool by Greg Kamradt to help understand how different text chunking mechanisms work by visualizing them. Chunking is an important part of preparing text to be embedded for semantic search, and thanks to this tool I’ve finally got a solid mental model of what recursive character text splitting does.

# 2nd February 2024, 2:23 am / ai, embeddings

teknium/OpenHermes-2.5 (via) The Nous-Hermes and Open Hermes series of LLMs, fine-tuned on top of base models like Llama 2 and Mistral, have an excellent reputation and frequently rank highly on various leaderboards.

The developer behind them, Teknium, just released the full set of fine-tuning data that they curated to build these models. It’s a 2GB JSON file with over a million examples of high quality prompts, responses and some multi-prompt conversations, gathered from a number of different sources and described in the data card.

# 1st February 2024, 4:18 am / ai, generative-ai, llms, fine-tuning, nous-research, llm-release

stanchion (via) Dan Gallagher’s new (under-development) SQLite extension that adds column-oriented tables to SQLite, using a virtual table implemented in Zig that stores records in row groups, where each row group has multiple segments (one for each column) and those segments are stored as SQLite BLOBs.

I’m surprised that this is possible using the virtual table mechanism. It has the potential to bring some of the analytical querying performance we’ve seen in engines like DuckDB to SQLite itself.

# 31st January 2024, 10:32 pm / sqlite

snoop. Neat Python debugging utility by Alex Hall: snoop lets you “import snoop” and then add “@snoop” as a decorator to any function, which causes that function’s source code to be output directly to the console with details of any variable state changes that occur while it’s running.

I didn’t know you could make a Python module callable like that—turns out it’s running “sys.modules[’snoop’] = snoop” in the __init__.py module!

# 31st January 2024, 6:19 pm / debugging, python

Macaroons Escalated Quickly (via) Thomas Ptacek’s follow-up on Macaroon tokens, based on a two year project to implement them at Fly.io. The way they let end users calculate new signed tokens with additional limitations applied to them (“caveats” in Macaroon terminology) is fascinating, and allows for some very creative solutions.

# 31st January 2024, 4:57 pm / apis, security, thomas-ptacek, fly

GitHub Actions: Introducing the new M1 macOS runner available to open source! Set “runs-on: macos-14” to run a GitHub Actions workflow on a 7GB of RAM ARM M1 runner. I have been looking forward to this for ages: it should make it much easier to build releases of both Electron apps and Python binary wheels for Apple Silicon.

# 31st January 2024, 2:04 am / macos, github-actions

Beej’s Guide to Networking Concepts (via) Beej's Guide to Network Programming is a legendary tutorial on network programming in C, continually authored and updated by Brian "Beej" Hall since 1995.

This is NOT that. Beej's Guide to Networking Concepts is brand new - started in March 2023 - and illustrates a whole bunch of networking concepts using Python instead of C.

From the forward:

Is it Beej’s Guide to Network Programming in Python? Well, kinda, actually. The C book is more about how C’s (well, Unix’s) network API works. And this book is more about the concepts underlying it, using Python as a vehicle.

# 30th January 2024, 10:08 pm / c, networking, python

pgroll (via) “Zero-downtime, reversible, schema migrations for Postgres”

I love this kind of thing. This one is has a really interesting design: you define your schema modifications (adding/dropping columns, creating tables etc) using a JSON DSL, then apply them using a Go binary.

When you apply a migration the tool first creates a brand new PostgreSQL schema (effectively a whole new database) which imitates your new schema design using PostgreSQL views. You can then point your applications that have been upgraded to the new schema at it, using the PostgreSQL search_path setting.

Old applications can continue talking to the previous schema design, giving you an opportunity to roll out a zero-downtime deployment of the new code.

Once your application has upgraded and the physical rows in the database have been transformed to the new schema you can run a --continue command to make the final destructive changes and drop the mechanism that simulates both schema designs at once.

# 30th January 2024, 9:27 pm / migrations, postgresql, zero-downtime

urllib3 2.2.0. Highlighted feature: “urllib3 now works in the browser”—the core urllib3 library now includes code that can integrate with Pyodide, using the browser’s fetch() or XMLHttpRequest APIs to make HTTP requests (to CORS-enabled endpoints).

# 30th January 2024, 4:31 pm / python, webassembly, pyodide, cors

Getting Started With CUDA for Python Programmers (via) if, like me, you’ve avoided CUDA programming (writing efficient code that runs on NVIGIA GPUs) in the past, Jeremy Howard has a new 1hr17m video tutorial that demystifies the basics. The code is all run using PyTorch in notebooks running on Google Colab, and it starts with a very clear demonstration of how to convert a RGB image to black and white.

# 29th January 2024, 9:23 pm / python, ai, pytorch, jeremy-howard, gpus

Observable notebook: URL to download a GitHub repository as a zip file (via) GitHub broke the “right click -> copy URL” feature on their Download ZIP button a few weeks ago. I’m still hoping they fix that, but in the meantime I built this Observable Notebook to generate ZIP URLs for any GitHub repo and any branch or commit hash.

Update 30th January 2024: GitHub have fixed the bug now, so right click -> Copy URL works again on that button.

# 29th January 2024, 9:17 pm / github, observable

llm-embed-onnx. I wrote a new plugin for LLM that acts as a thin wrapper around onnx_embedding_models by Benjamin Anderson, providing access to seven embedding models that can run on the ONNX model framework.

The actual plugin is around 50 lines of code, which makes for a nice example of how thin a plugin wrapper can be that adds new models to my LLM tool.

# 28th January 2024, 10:28 pm / embedding, plugins, projects, ai, llm

Years

Tags