Simon Willison’s Weblog

Blogmarks in Feb

Filters: Type: blogmark × Month: Feb ×


unasync (via) Today I started wondering out loud if one could write code that takes an asyncio Python library and transforms it into the synchronous equivalent by using some regular expressions to strip out the “await ...” keywords and suchlike. Turns out that can indeed work, and Ratan Kulshreshtha built it! unasync uses the standard library tokenize module to run some transformations against an async library and spit out the sync version automatically. I’m now considering using this for sqlite-utils. # 27th February 2021, 10:20 pm

cosmopolitan libc (via) “Cosmopolitan makes C a build-once run-anywhere language, similar to Java, except it doesn’t require interpreters or virtual machines be installed beforehand. [...] Instead, it reconfigures stock GCC to output a POSIX-approved polyglot format that runs natively on Linux + Mac + Windows + FreeBSD + OpenBSD + BIOS with the best possible performance and the tiniest footprint imaginable.” This is a spectacular piece of engineering. # 27th February 2021, 6:02 am

How to secure an Ubuntu server using Tailscale and UFW. This is the Tailscale tutorial I’ve always wanted: it explains in detail how you can run an Ubuntu server (from any cloud provider) such that only devices on your personal Tailscale network can access it. # 26th February 2021, 8:31 pm

Fuzzy Name Matching in Postgres. Paul Ramsey describes how to implement fuzzy name matching in PostgreSQL using the fuzzystrmatch extension and its levenshtein() and soundex() functions, plus functional indexes to query against indexed soundex first and then apply slower Levenshtein. The same tricks should also work against SQLite using the datasette-jellyfish plugin. # 22nd February 2021, 9:16 pm

Blazing fast CI with pytest-split and GitHub Actions (via) pytest-split is a neat looking variant on the pattern of splitting up a test suite to run different parts of it in parallel on different machines. It involves maintaining a periodically updated JSON file in the repo recording the average runtime of different tests, to enable them to be more fairly divided among test runners. Includes a recipe for running as a matrix in GitHub Actions. # 22nd February 2021, 7:06 pm

People, processes, priorities. Twitter thread from Adrienne Porter Felt outlining her model for thinking about engineering management. I like this trifecta of “people, processes, priorities” a lot. # 22nd February 2021, 5:21 pm

trustme (via) This looks incredibly useful. Run “python -m trustme” and it will create three files for you: server.pem, server.key and a client.pem client certificate, providing a certificate for “localhost” (or another host you spefict) using a fake certificate authority. Looks like it should be the easiest way to test TLS locally. # 11th February 2021, 8 pm

Why I Built Litestream. Litestream is a really exciting new piece of technology by Ben Johnson, who previously built BoltDB, the key-value store written in Go that is used by etcd. It adds replication to SQLite by running a process that converts the SQLite WAL log into a stream that can be saved to another folder or pushed to S3. The S3 option is particularly exciting—Ben estimates that keeping a full point-in-time recovery log of a high write SQLite database should cost in the order of a few dollars a month. I think this could greatly expand the set of use-cases for which SQLite is sensible choice. # 11th February 2021, 7:25 pm

Dependency Confusion: How I Hacked Into Apple, Microsoft and Dozens of Other Companies (via) Alex Birsan describes a new category of security vulnerability he discovered in the npm, pip and gem packaging ecosystems: if a company uses a private repository with internal package names, uploading a package with the same name to the public repository can often result in an attacker being able to execute their own code inside the networks of their target. Alex scored over $130,000 in bug bounties from this one, from a number of name-brand companies. Of particular note for Python developers: the --extra-index-url argument to pip will consult both public and private registries and install the package with the highest version number! # 10th February 2021, 8:42 pm

Cleaning Up Your Postgres Database (via) Craig Kerstiens provides some invaluable tips on running an initial check of the health of a PostgreSQL database, by using queries against the pg_statio_user_indexes table to find the memory cache hit ratio and the pg_stat_user_tables table to see what percentage of queries to your tables are using an index. # 3rd February 2021, 7:32 am

JMeter Result Analysis using Datasette (via) NaveenKumar Namachivayam wrote a detailed tutorial on using Datasette (on Windows) and csvs-to-sqlite to analyze the results of JMeter performance test runs and then publish them online using Vercel. # 1st February 2021, 4:42 am

Wildcard: Spreadsheet-Driven Customization of Web Applications (via) What a fascinating collection of ideas. Wildcard is a browser extension (currently using Tampermonkey and sadly not yet available to try out) which lets you add “spreadsheet-driven customization” to any web application. Watching the animated screenshots in the videos helps explain what this mean—essentially it’s a two-way scraping trick, where content on the page (e.g. Airbnb listings) are extracted into a spreadsheet-like table interface using JavaScript—but then interactions you make in that spreadsheet like filtering and sorting are reflected back on the original page. It even has the ability to serve editable cells by mapping them to form inputs on the page. Lots to think about here. # 28th February 2020, 7:39 pm

Why Google invested in providing Google Fonts for free. Fascinating comment from former Google Fonts team member Raph Levien. In short: text rendered as PNGs hurt Google Search, fonts were a delay in the transition from Flash, Google Docs needed them to better compete with Office and anything that helps create better ads is easy to find funding for. # 23rd February 2020, 2:13 pm

pup. This is a great idea: a command-line tool for parsing HTML on stdin using CSS selectors. It’s like jq but for HTML. Supports a sensible collection of selectors and has a number of output options for the selected nodes, including plain text and JSON. It also works as a simple pretty-printer for HTML. # 14th February 2020, 4:25 pm

Deep learning isn’t hard anymore. This article does a great job of explaining how transfer learning is unlocking a new wave of innovation around deep learning. Previously if you wanted to train a model you needed vast amounts if data and thousands of dollars of compute time. Thanks to transfer learning you can now take an existing model (such as GPT2) and train something useful on top of it that’s specific to a new domain in just minutes it hours, with only a few hundred or a few thousand new labeled samples. # 7th February 2020, 8:47 am

Experiments, growth engineering, and exposing company secrets through your API (via) This is fun: Jon Luca observes that many companies that run A/B tests have private JSON APIs that list all of their ongoing experiments, and uses them to explore tests from Lyft, Airbnb, Pinterest, Amazon and more. Facebook and Instagram use SSL Stapling which makes it harder to spy on their mobile app traffic. # 26th February 2019, 4:49 am

huey. Charles Leifer’s “little task queue for Python”. Similar to Celery, but it’s designed to work with Redis, SQLite or in the parent process using background greenlets. Worth checking out for the really neat design. The project is new to me, but it’s been under active development since 2011 and has a very healthy looking rate of releases. # 25th February 2019, 7:49 pm

My Twitter thread collecting behind the scenes content about Spider-Man: Into the Spider-Verse. I absolutely loved Spider-Verse, and I’ve been delighted to discover that many of the artists who created the movie are active on Twitter and have been posting all kinds of fascinating material about their creative process. I’ve been collecting examples in this Twitter thread for a couple of months now. They definitely deserved that Oscar. # 25th February 2019, 2:57 pm

Seeking the Productive Life: Some Details of My Personal Infrastructure (via) Stephen Wolfram’s 15,000 word epic about his personal approach to productivity, developed over the past thirty years. This is a fascinating document—I found myself thinking “surely there can’t be more information than this” and then spotting that the scrollbar wasn’t even a third done yet. Very hard to summarize: it turns out if you’re the work-from-home CEO of your own privately held 800 person company you can construct some very opinionated habits. # 22nd February 2019, 9:46 pm

String length—Rosetta Code (via) Calculating the length of a string is surprisingly difficult once Unicode is involved. Here’s a fascinating illustration of how that problem can be attached dozens of different programming languages. From that page: the string “J̲o̲s̲é̲” (“J\x{332}o\x{332}s\x{332}e\x{301}\x{332}”) has 4 user-visible graphemes, 9 characters (code points), and 14 bytes when encoded in UTF-8. # 22nd February 2019, 3:27 pm

Lessons from 6 software rewrite stories (via) Herb Caudill takes on the classic idea that rewriting from scratch is “the single worst strategic mistake that any software company can make” and investigates it through the lens of six well-chosen examples: Netscape 6, Basecamp Classic/2/3, Visual Studio/VS Code, Gmail/Inbox, FogBugz/Wasabi/Trello, and finally FreshBooks/BillSpring. Each story has details I had never heard before, and the lessons and conclusions are deeply insightful. # 19th February 2019, 9:55 pm

parameterized. I love the @parametrize decorator in pytest, which lets you run the same test multiple times against multiple parameters. The only catch is that the decorator in pytest doesn’t work for old-style unittest TestCase tests, which means you can’t easily add it to test suites that were built using the older model. I just found out about parameterized which works with unittest tests whether or not you are running them using the pytest test runner. # 19th February 2019, 9:05 pm

The Eleven Laws of Showrunning (via) Fascinating essay on how to run a modern TV show by Javier Grillo-Marxuach. Being a showrunner basically involves running a 100+ person startup with a 7 digit budget, almost immovable deadlines, high maintenance activist investors and you’re still expected to write some of the scripts! So many useful lessons here about management, creativity and delegation: almost everything in here is relevant to product management, startup founding and engineering management as well. # 19th February 2019, 7:27 pm

Discussion about Altavista on Hacker News. Fascinating thread on Hacker News where Bryant Durrell, a former Director from Altavista provides some insider thoughts on how they lost against Google. # 16th February 2019, 6:57 pm

Data science is different now (via) Detailed examination of the current state of the job market for data science. Boot camps and university courses have produced a growing volume of junior data scientists seeking work, but the job market is much more competitive than many expected—especially for those without prior experience. Meanwhile the job itself is much more about data cleanup and software engineering skills: machine learning models and applied statistics end up being a small portion of the actual work. # 15th February 2019, 3:36 pm

Vitess (via) I remember looking at Vitess when it was first released by YouTube in 2012. The idea of a proven horizontally scalable sharding mechanism for MySQL was exciting, but I was put off by the need for a custom Go or Java client library. Apparently that changed with Vitess 2.1 in April 2017, the first version to introduce a MySQL protocol compatible proxy which can be connected to by existing code written in any language. Vitess 3.0 came out last December so now the MySQL proxy layer is much more stable. Vitess is used in production by a bunch of other companies now (including Slack and Square) so it’s definitely worth a closer look. # 14th February 2019, 5:35 am

django-zombodb (via) The hardest part of working with an external search engine like Elasticsearch is always keeping that index synchronized with your relational database. ZomboDB is a PostgreSQL extension which lets you create a new type of index backed by an external Elasticsearch cluster. Updated rows will be pushed to the index automatically, and custom SQL syntax can then be used to execute searches. django-zombodb is a brand new library by Flávio Juvenal which integrates ZomboDB directly into the Django ORM, letting you add Elasticsearch-backed functionality with just a few lines of extra configuration. It even includes custom Django migrations for enabling the extension in PostgreSQL! # 13th February 2019, 10:14 pm

socrata2sql (via) Phenomenal new open source tool released by Andrew Chavez at the Dallas Morning News. Socrata is the open data portal software used by huge numbers of local governments worldwide. socrata2sql is a tool that interacts with the standard Socrata API and can use it to suck down a dataset and save it as a SQLite, PostgreSQL, MySQL or other SQLAlchemy-supported database. I just tried this and it took a single command to create a SQLite database of every police arrest in Dallas in the past five years. # 8th February 2019, 3:27 pm

db-to-sqlite (via) I just released version 0.2 of a tiny CLI utility I’ve been working on. It builds on top of SQLAlchemy and lets you connect to any SQLAlchemy-supported database and convert the data from it to a local SQLite database file. The new --all option will mirror all available tables (including foreign key relationships), or you can use --sql to save the results of custom SQL queries. # 8th February 2019, 6:08 am

Questions for a new technology. Kellan poses 8 questions which should be asked of any technology that is being proposed for inclusion in an existing tech stack. I’m particularly fond of “Will this solution kill and eat the solution that it replaces?”. My rule of thumb these days is that new technology either needs to make something possible that isn’t possible at all with the existing stack, or it needs to represent at least a 3X productivity improvement in order to compensate for the switching and retraining costs across a large team. # 6th February 2019, 4:10 am