Simon Willison’s Weblog

93 items tagged “github”


How I build a feature

I’m maintaining a lot of different projects at the moment. I thought it would be useful to describe the process I use for adding a new feature to one of them, using the new sqlite-utils create-database command as an example.

[... 2767 words]


Introducing stack graphs (via) GitHub launched “precise code navigation” for Python today—the first language to get support for this feature. Click on any Python symbol in GitHub’s code browsing views and a box will show you exactly where that symbol was defined—all based on static analysis by a custom parser written in Rust as opposed to executing any Python code directly. The underlying computer science uses a technique called stack graphs, based on scope graphs research from Eelco Visser’s research group at TU Delft. # 9th December 2021, 11:07 pm

How to build, test and publish an open source Python library

At PyGotham this year I presented a ten minute workshop on how to package up a new open source Python library and publish it to the Python Package Index. Here is the video and accompanying notes, which should make sense even without watching the talk.

[... 2017 words]

Weeknotes: CDC vaccination history fixes, developing in GitHub Codespaces

I spent the last week mostly surrounded by boxes: we’re completing our move to the new place and life is mostly unpacking now. I did find some time to fix some issues with my CDC vaccination history Datasette instance though.

[... 514 words]

Datasette on Codespaces, sqlite-utils API reference documentation and other weeknotes

This week I broke my streak of not sending out the Datasette newsletter, figured out how to use Sphinx for Python class documentation, worked out how to run Datasette on GitHub Codespaces, implemented Datasette column metadata and got tantalizingly close to a solution for an elusive Datasette feature.

[... 2163 words]

GitHub’s Engineering Team has moved to Codespaces. My absolute dream development environment is one where I can spin up a new, working development environment in seconds—to try something new on a branch, or because I broke something and don’t want to spend time figuring out how to fix it. This article from GitHub explains how they got there: from a half-day setup to a 45 minute bootstrap in a codespace, then to five minutes through shallow cloning and a nightly pre-built Docker image and finally to 10 seconds be setting up “pools of codespaces, fully cloned and bootstrapped, waiting to be connected with a developer who wants to get to work”. # 11th August 2021, 4:53 pm

Running GitHub on Rails 6.0. Back in 2019 Eileen M. Uchitelle explained how GitHub upgraded everything in production to Rails 6.0 within 1.5 weeks of the stable release. There’s a trick in here I really like: they have an automated weekly job which fetches the latest Rails main branch and runs the full GitHub test suite against it, giving them super-early warnings about anything that might break and letting them provide feedback to upstream about unintended regressions. # 6th August 2021, 4:30 pm

A framework for building Open Graph images. GitHub’s new social preview images are generated by a Node.js script that fetches data from their GraphQL API, generates an HTML version of the card and then grabs a PNG snapshot of it using Puppeteer. It takes an average of 280ms to serve an image and generates around 2 million unique images a day. Interestingly, they found that bumping the available RAM from 512MB up to 513MB had a big effect on performance, because Chromium detects devices on 512MB or less and switches some processes from parallel to sequential. # 22nd June 2021, 9:25 pm

Flat Data. New project from the GitHub OCTO (the Office of the CTO, love that backronym) somewhat inspired by my work on Git scraping: I’m really excited to see GitHub embracing git for CSV/JSON data in this way. Flat incorporates a reusable Action for scraping and storing data (using Deno), a VS Code extension for setting up those workflows and a very nicely designed Flat Viewer web app for browsing CSV and JSON data hosted on GitHub. # 19th May 2021, 1:05 am

Behind GitHub’s new authentication token formats (via) This is a really smart design. GitHub’s new tokens use a type prefix of “ghp_” or “gho_” or a few others depending on the type of token, to help support mechanisms that scan for accidental token publication. A further twist is that the last six characters of the tokens are a checksum, which means token scanners can reliably distinguish a real token from a coincidental string without needing to check back with the GitHub database. “One other neat thing about _ is it will reliably select the whole token when you double click on it”—what a useful detail! # 5th April 2021, 9:28 pm

GitHub, by default, writes five replicas of each repository across our three data centers to protect against failures at the server, rack, network, and data center levels. When we need to update Git references, we briefly take a lock across all of the replicas in all of our data centers, and release the lock when our three-phase-commit (3PC) protocol reports success.

Scott Arbeit # 21st March 2021, 12:57 am

How we found and fixed a rare race condition in our session handling. GitHub had a terrifying bug this month where a user reported suddenly being signed in as another user. This is a particularly great example of a security incident report, explaining how GitHub identified the underlying bug, what caused it and the steps they are taking to ensure bugs like that never happen in the future. The root cause was a convoluted sequence of events which could cause a Ruby Hash to be accidentally shared between two requests, caused as a result of a new background thread that was introduced as a performance optimization. # 18th March 2021, 11:06 pm

Making GitHub’s new homepage fast and performant. A couple of really clever tricks in this article by Tobias Ahlin. The first is using IntersectionObserver in conjunction with the video preload=“none” attribute to lazily load a video when it scrolls into view. The second is an ingenious trick to create an efficiently encoded transparent JPEG image: embed the image in a SVG file twice, once as the image and once as a transparency mask. # 29th January 2021, 7:05 pm

Everything You Always Wanted To Know About GitHub (But Were Afraid To Ask) (via) ClickHouse by Yandex is an open source column-oriented data warehouse, designed to run analytical queries against TBs of data. They’ve loaded the full GitHub Archive of events since 2011 into a public instance, which is a great way of both exploring GitHub activity and trying out ClickHouse. Here’s a query I just ran that shows number of watch events per year, for example: SELECT toYear(created_at) as yyyy, count() FROM github_events WHERE event_type = ’WatchEvent’ group by yyyy # 5th January 2021, 1:02 am


Commits are snapshots, not diffs (via) Useful, clearly explained revision of some Git fundamentals. # 17th December 2020, 10:01 pm

At GitHub, we want to protect developer privacy, and we find cookie banners quite irritating, so we decided to look for a solution. After a brief search, we found one: just don’t use any non-essential cookies. Pretty simple, really. 🤔 So, we have removed all non-essential cookies from GitHub, and visiting our website does not send any information to third-party analytics services.

Nat Friedman # 17th December 2020, 7:44 pm

Personal Data Warehouses: Reclaiming Your Data

I gave a talk yesterday about personal data warehouses for GitHub’s OCTO Speaker Series, focusing on my Datasette and Dogsheep projects. The video of the talk is now available, and I’m presenting that here along with an annotated summary of the talk, including links to demos and further information.

[... 5166 words]

OCTO Speaker Series: Simon Willison—Personal Data Warehouses: Reclaiming Your Data. I’m giving a talk in the GitHub OCTO (Office of the CTO) speaker series about Datasette and my Dogsheep personal analytics project. You can register for free here—the stream will be on Thursday November 12, 2020 at 8:30am PST (4:30pm GMT). # 23rd October 2020, 3 am

Git scraping: track changes over time by scraping to a Git repository

Git scraping is the name I’ve given a scraping technique that I’ve been experimenting with for a few years now. It’s really effective, and more people should use it.

[... 963 words]

Render Markdown tool (via) I wrote a quick JavaScript tool for rendering Markdown via the GitHub Markdown API—which includes all of their clever extensions like tables and syntax highlighting—and then stripping out some extraneous HTML to give me back the format I like using for my blog posts. # 3rd September 2020, 12:08 am

Weeknotes: Rocky Beaches, Datasette 0.48, a commit history of my database

This week I helped Natalie launch Rocky Beaches, shipped Datasette 0.48 and several releases of datasette-graphql, upgraded the CSRF protection for datasette-upload-csvs and figured out how to get a commit log of changes to my blog by backing up its database to a GitHub repository.

[... 1294 words]

Doing Stupid Stuff with GitHub Actions (via) I love the idea here of running a scheduled action once a year that deliberately fails, causing GitHub to send you a “Happy New Year” failure email! # 25th July 2020, 9:19 pm

zhiiiyang/zhiiiyang profile README (via) This is a brilliant hack: a GitHub profile README that uses an action to retrieve the author’s latest tweet (using R), render it as a PNG screenshot in headless Chrome via rstudio/webshot2 and embed that image in their profile. # 11th July 2020, 5:47 pm

Building a self-updating profile README for GitHub

GitHub quietly released a new feature at some point in the past few days: profile READMEs. Create a repository with the same name as your GitHub account (in my case that’s, add a to it and GitHub will render the contents at the top of your personal profile page—for me that’s

[... 599 words]

A cookiecutter template for writing Datasette plugins

Datasette’s plugin system is one of the most interesting parts of the entire project. As I explained to Matt Asay in this interview, the great thing about plugins is that Datasette can gain new functionality overnight without me even having to review a pull request. I just need to get more people to write them!

[... 914 words]

github-to-sqlite 2.2 highlights thread. I released github-to-sqlite 2.2 today with a new “stargazers” command for importing users who have starred one or more specific repositories. This Twitter thread lists highlights of recent releases and links to a live Datasette demo that shows what the tool can do. # 2nd May 2020, 10:16 pm

Weeknotes: Datasette 0.40, various projects, Dogsheep photos

A new release of Datasette, two new projects and progress towards a Dogsheep photos solution.

[... 826 words]

Using a self-rewriting README powered by GitHub Actions to track TILs

I’ve started tracking TILs—Today I Learneds—inspired by this five-year-and-counting collection by Josh Branchaud on GitHub (found via Hacker News). I’m keeping mine in GitHub too, and using GitHub Actions to automatically generate an index page README in the repository and a SQLite-backed search engine.

[... 1100 words]


I released a fun little Datasette utility today: datasette-clone.

[... 379 words]

Goodbye Zeit Now v1, hello datasette-publish-now—and talking to myself in GitHub issues

This week I’ve been mostly dealing with the finally announced shutdown of Zeit Now v1. And having long-winded conversations with myself in GitHub issues.

[... 2049 words]