Simon Willison's Weblog: github-issues

cityofaustin/atd-data-tech issues

2025-05-20T18:18:39+00:00

I stumbled across this today while looking for interesting frequently updated data sources from local governments. It turns out the City of Austin's Transportation Data & Technology Services department run everything out of a public GitHub issues instance, which currently has 20,225 closed and 2,002 open issues. They also publish an exported copy of the issues data through the data.austintexas.gov open data portal.

Tags: open-data, github, github-issues

Jules

2025-05-19T21:40:11+00:00

Jules

It seems like everyone is rolling out AI coding assistants that attach to your GitHub account and submit PRs for you right now. We had OpenAI Codex last week, today Microsoft announced GitHub Copilot coding agent (confusingly not the same thing as Copilot Workspace) and I found out just now that Google's Jules, announced in December, is now in a beta preview.

I'm flying home from PyCon but I managed to try out Jules from my phone. I took this GitHub issue thread, converted it to copy-pasteable Markdown with this tool and pasted it into Jules, with no further instructions.

Here's the resulting PR created from its branch. I haven't fully reviewed it yet and the tests aren't passing, so it's hard to evaluate from my phone how well it did. In a cursory first glance it looks like it's covered most of the requirements from the issue thread.

My habit of creating long issue threads where I talk to myself about the features I'm planning is proving to be a good fit for outsourcing implementation work to this new generation of coding assistants.

Tags: gemini, ai-assisted-programming, google, llms, ai, generative-ai, github, github-issues

Evolving GitHub Issues (public preview)

2025-01-16T17:41:32+00:00

Evolving GitHub Issues (public preview)

GitHub just shipped the largest set of changes to GitHub Issues I can remember in a few years. As an Issues power-user this is directly relevant to me.

The big new features are sub-issues, issue types and boolean operators in search.

Sub-issues look to be a more robust formalization of the existing feature where you could create a - [ ] #123 Markdown list of issues in the issue description to relate issue together and track a 3/5 progress bar. There are now explicit buttons for creating a sub-issue and managing the parent relationship of such, and clicking a sub-issue opens it in a side panel on top of the parent.

Issue types took me a moment to track down: it turns out they are an organization level feature, so they won't show up on repos that belong to a specific user.

Organizations can define issue types that will be available across all of their repos. I created a "Research" one to classify research tasks, joining the default task, bug and feature types.

Unlike labels an issue can have just one issue type. You can then search for all issues of a specific type across an entire organization using org:datasette type:"Research" in GitHub search.

The new boolean logic in GitHub search looks like it could be really useful - it includes AND, OR and parenthesis for grouping.

(type:"Bug" AND assignee:octocat) OR (type:"Enhancement" AND assignee:hubot)

I'm not sure if these are available via the GitHub APIs yet.

Tags: github, github-issues

AI assisted learning: Learning Rust with ChatGPT, Copilot and Advent of Code

2022-12-05T21:11:08+00:00

I'm using this year's Advent of Code to learn Rust - with the assistance of GitHub Copilot and OpenAI's new ChatGPT.

I think one of the most exciting applications of large language models is to support self-guided learning. Used the right way, a language model such as GPT-3 can act as a sort of super-smart-and-super-dumb teaching assistant: you can ask it questions and follow-up questions, and if you get your questions right it can genuinely help you build a good mental model of the topic at hand.

And it could also hallucinate and teach you things that are entirely divorced from reality, but in a very convincing way!

I've started thinking of them as an excellent teacher for some topics who is also a conspiracy theorist around others: you can have a great conversation with them, but you need to take everything they say with a very generous grain of salt.

I've been tinkering with this idea for a few months now, mostly via the GPT-3 Playground. ChatGPT provides a much better interface for this, and I'm ready to try it out for a larger, more structured project.

Learning Rust

I've been looking for an excuse to explore Rust for a few years now. As primarily a Python programmer the single biggest tool missing from my toolbox is something more low-level - I want to be able to confidently switch to more of a systems language for performance-critical tasks, while still being able to use that optimized code in my Python projects.

Rust feels like the best available option for this. It has a really great Python integration support, is already used extensively in the Python ecosystem (e.g. by the cryptography package) and lots of people who I respect have used it without loudly expressing their distaste for it later on!

The problem was finding the right project. I learn by building things, but none of the projects I could imagine building in Rust (a high performance web proxy for example) would be useful to me if I built terrible versions of them while still learning the basics.

Advent of Code turns out to be perfect for this.

Each day you get a new exercise, designed to be solved in a short amount of time (at least so far). Exercises are automatically graded using an input file that is unique to you, so you can't cheat by copying other people's answers (though you can cheat by copying and running their code).

The exercise design is so good! Eric Wastl has been running it for seven years now and I couldn't be more impressed with how it works or the quality of the exercises so far (I just finished day 5).

It's absolutely perfect for my goal of learning a new programming language.

AI assisted learning tools

I've seen a bunch of people this year attempt to solve Advent of Code by feeding the questions to an AI model. That's a fun exercise, but what I'm doing here is a little bit different.

My goal here is to get comfortable enough with basic Rust that I can attempt a larger project without feeling like I'm wasting my time writing unusably poor code.

I also want to see if AI assisted learning actually works as well as I think it might.

I'm using two tools to help me here:

GitHub Copilot runs in my VS Code editor. I've used it for the past few months mainly as a typing assistant (and for writing things like repetitive tests). For this project I'm going to lean a lot more heavily on it - I'm taking advantage of comment-driven prompting, where you can add a code comment and Copilot will suggest code that matches the comment.
ChatGPT. I'm using this as a professor/teaching-assistant/study partner. I ask it questions about how to do things with Rust, it replies with answers (and usually a code sample too). I've also been using it to help understand error messages, which it turns out to be incredibly effective at.

And copious notes

I'm doing all of my work on this in the open, in my simonw/advent-of-code-2022-in-rust repository on GitHub. Each day gets an issue, and I'm making notes on the help I get from the AI tools in detailed issue comments.

Here are my issue threads so far:

I recommend checking out each issue in full if you want to follow how this has all been going.

Some examples from ChatGPT

Here are a few highlights from my interactions with ChatGPT so far:

Using "add comments explaining every single line" to cause it to output a replacement code example with extremely verbose comments.
Pasting in both the code and the resulting Rust compiler error - ChatGPT clearly explained the error to me and spat out a version of the code that incorporated a fix!
Another example of a back-and-forth conversation - I started with "Given two strings in Rust, identify the character that appears in both strings", then said "Can you do this without a HashSet?", then finally asked for it "without any modules" at all. Then I said "Turn that into a function I can call" and it wrote that code for me too.
I got an error from a VS Code extension I was using: "rust-analyzer failed to discover workspace" - so I pasted that into ChatGPT and it gave me step-by-step instructions for fixing that.
I pasted in some gnarly nested code generated by Copilot, and ChatGPT suggested an improvement that removed the nesting without me even asking it how that code could be improved. Though amusingly it used deprecated Rust syntax for it, which took me quite a while to figure out.
Amos has a series of posts about this year's Advent of Code where he uses them to illustrate advanced features of Rust. I occasionally paste his code into ChatGPT and ask it to explain each concept - in this particular case, I used Copilot to generate a list of questions and then pasted them into ChatGPT to get back the answers.
Given two examples of how to do something, I asked it Which of these two options is the more idiomatic Rust code? and it answered along with justification as to why!

Handling CLI arguments in Rust

I'll include my most recent highlight in full:

I wanted to add command-line arguments to my Rust program, so I asked ChatGPT:

How do I write a Rust program that accepts command line options - a required position argument and an optional --multistack flag?

It replied:

To write a Rust program that accepts command line options, you can use the clap crate. First, add clap to your Cargo.toml file like this:

[dependencies]
clap = "2.33.0"

Next, import the crate and use its macros to define the command line arguments that your program accepts. For example:

use clap::{App, Arg};

fn main() {
    let matches = App::new("My Program")
        .arg(
            Arg::with_name("position")
                .help("The position argument")
                .required(true)
                .index(1),
        )
        .arg(
            Arg::with_name("multistack")
                .help("A flag that enables the use of multiple stacks")
                .long("multistack")
                .short("m"),
        )
        .get_matches();

    // Get the value of the position argument
    let position = matches.value_of("position").unwrap();

    // Check if the multistack flag was passed
    let multistack = matches.is_present("multistack");
}

In this example, the position argument is required and must be specified as the first argument to the program. The multistack flag is optional and can be specified using the --multistack or -m flag.

Here's the thing though: I'm still working on Rust basics, so I didn't particularly want to jump straight to using an external dependency for this. So I asked a follow-up question:

Is there a way to do that without using any external crates at all?

And it replied:

Yes, you can parse command line arguments without using any external crates in Rust. To do this, you can use the std::env::args function, which returns an iterator over the arguments that were passed to the program.

For example, you could parse the required position argument and the optional multistack flag like this:
use std::env;

fn main() {
    // Get the arguments as an iterator
    let args: Vec<String> = env::args().collect();

    // Ensure that at least one argument was passed
    if args.len() < 2 {
        println!("error: missing required argument 'position'");
        return;
    }

    // Get the value of the position argument
    let position = &args[1];

    // Check if the multistack flag was passed
    let multistack = args.iter().any(|arg| arg == "--multistack" || arg == "-m");
}
In this example, the position argument is required and must be specified as the first argument to the program. The multistack flag is optional and can be specified using the --multistack or -m flag.

You can then use the values of these arguments in your program as shown. Note that this approach does not provide as much functionality or convenience as using a dedicated crate like clap, but it can be useful in some situations.

I really like this example.

I deliberately crafted my original question to include details of exactly what I wanted to do - not just add CLI options, but add a specific positional argument and a specific flag.

This generated the exact code I needed, plus a very high quality prose explanation of it.

Then I used a follow-up question to get a completely different approach that better matched my personal taste. Unprompted, it even included a note about the disadvantages of that approach at the end!

Copilot as well

I haven't been keeping as detailed notes on my Copilot interactions, since those take the form of autocompletions in my editor which are harder to write down.

I did capture a few illustrative screenshots though. Some examples:

I didn't like that suggestion at all - way too convoluted. So I changed my comment prompt and got something much better:

1, 'Y' => 2, 'Z' => 3, _ => 0, };" style="max-width: 100%;" />

This comment-driven approach to prompting Copilot has proven to be amazingly effective. I'm learning Rust without having to spend any time looking things up - I'm using Copilot to show me examples, then if I don't understand them I paste them into ChatGPT and ask for a detailed explanation.

Where it goes wrong

An interesting part of this exercise is spotting where things go wrong.

Rust is not an easy language to learn. There are concepts like the borrow checker that I've not even started touching on yet, and I'm still getting the hang of basic concepts like Options and Results.

Mostly Copilot and ChatGPT have been able to act as confident guides - but every now and then I've run up against the sharp edges of their fake confidence combined and the fact that they're actually just language models with no genuine understanding of what they are doing.

I had one instance where I lost about an hour to an increasingly frustrating back-and-forth over an integer overflow error - I ended up having to actually think hard about the problem after failing to debug it with ChatGPT!

I wanted to figure out if the first character of a line was a "1". ChatGPT lead me down an infuriatingly complicated warren of options - at one point I asked it "Why is this so hard!?" - until I finally independently stumbled across if line.starts_with("1") which was exactly what I needed. Turns out I should have asked "how do I check if a strings starts with another string" - using the word "character" had thrown it completely off.

I also had an incident where I installed a package using cargo add itertools and decided I wanted to remove it. I asked ChatGPT about it and it confidently gave me instructions on using cargo remove itertools... which turns out to be a command that does not exist! It hallucinated that, then hallucinated some more options until I gave up and figured it out by myself.

So is it working?

So far I think this is working really well.

I feel like I'm beginning to get a good mental model of how Rust works, and a lot of the basic syntax is beginning to embed itself into my muscle memory.

The real test is going to be if I can first make it to day 25 (with no prior Advent of Code experience I don't know how much the increasing difficulty level will interfere with my learning) and then if I can actually write a useful Rust program after that without any assistance from these AI models.

And honestly, the other big benefit here is that this is simply a lot of fun. I'm finding interacting with AIs in this way - as an actual exercise, not just to try them out - is deeply satisfying and intellectually stimulating.

And is this ethical?

The ethical issues around generative AI - both large language models like GPT-3 and image generation models such as Stable Diffusion, continue to be the most complex I've encountered in my career to date.

I'm confident that one thing that is ethical is learning as much as possible about these tools, and helping other people to understand them too.

Using them for personal learning exercises like this feels to me like one of the best ways to do that.

I like that this is a space where I can write code that's not going to be included in products, or used to make money. I don't feel bad about bootstrapping my Rust education off a model that was trained on a vast corpus of data collected without the permission of the people who created it.

(Advent of Code does have a competitive leaderboard to see who can solve the exercises fastest. I have no interest at all in competing on that front, and I'm avoiding trying to leap on the exercises as soon as they are released.)

My current ethical position around these models is best summarized as acknowledging that the technology exists now, and it can't be put back in its bottle.

Our job is to figure out ways to maximize its benefit to society while minimising the harm it causes.

Tags: rust, gpt-3, projects, openai, education, github, chatgpt, github-copilot, ai, generative-ai, llms, ai-assisted-programming, github-issues

Coping strategies for the serial project hoarder

2022-11-26T15:47:02+00:00

I gave a talk at DjangoCon US 2022 in San Diego last month about productivity on personal projects, titled "Massively increase your productivity on personal projects with comprehensive documentation and automated tests".

The alternative title for the talk was Coping strategies for the serial project hoarder.

I'm maintaining a lot of different projects at the moment. Somewhat unintuitively, the way I'm handling this is by scaling down techniques that I've seen working for large engineering teams spread out across multiple continents.

The key trick is to ensure that every project has comprehensive documentation and automated tests. This scales my productivity horizontally, by freeing me up from needing to remember all of the details of all of the different projects I'm working on at the same time.

You can watch the talk on YouTube (25 minutes). Alternatively, I've included a detailed annotated version of the slides and notes below.

This was the title I originally submitted to the conference. But I realized a better title was probably...

Coping strategies for the serial project hoarder

This video is a neat representation of my approach to personal projects: I always have a few on the go, but I can never resist the temptation to add even more.

My PyPI profile (which is only five years old) lists 185 Python packages that I've released. Technically I'm actively maintaining all of them, in that if someone reports a bug I'll push out a fix. Many of them receive new releases at least once a year.

Aside: I took this screenshot using shot-scraper with a little bit of extra JavaScript to hide a notification bar at the top of the page:

shot-scraper 'https://pypi.org/user/simonw/' \
--javascript "
    document.body.style.paddingTop = 0;
    document.querySelector(
        '#sticky-notifications'
    ).style.display = 'none';
  " --height 1000

How can one individual maintain 185 projects?

Surprisingly, I'm using techniques that I've scaled down from working at a company with hundreds of engineers.

I spent seven years at Eventbrite, during which time the engineering team grew to span three different continents. We had major engineering centers in San Francisco, Nashville, Mendoza in Argentina and Madrid in Spain.

Consider timezones: engineers in Madrid and engineers in San Francisco had almost no overlap in their working hours. Good asynchronous communication was essential.

Over time, I noticed that the teams that were most effective at this scale were the teams that had a strong culture of documentation and automated testing.

As I started to work on my own array of smaller personal projects, I found that the same discipline that worked for large teams somehow sped me up, when intuitively I would have expected it to slow me down.

I wrote an extended description of this in The Perfect Commit.

I've started structuring the majority of my work in terms of what I think of as "the perfect commit" - a commit that combines implementation, tests, documentation and a link to an issue thread.

As software engineers, it's important to note that our job generally isn't to write new software: it's to make changes to existing software.

As such, the commit is our unit of work. It's worth us paying attention to how we can make our commits as useful as possible.

Here's a recent example from one of my projects, Datasette.

It's a single commit which bundles together the implementation, some related documentation improvements and the tests that show it works. And it links back to an issue thread from the commit message.

Let's talk about each component in turn.

There's not much to be said about the implementation: your commit should change something!

It should only change one thing, but what that actually means varies on a case by case basis.

It should be a single change that can be documented, tested and explained independently of other changes.

(Being able to cleanly revert it is a useful property too.)

The goals of the tests that accompany a commit are to prove that the new implementation works.

If you apply the implementation the new tests should pass. If you revert it the tests should fail.

I often use git stash to try this out.

If you tell people they need to write tests for every single change they'll often push back that this is too much of a burden, and will harm their productivity.

But I find that the incremental cost of adding a test to an existing test suite keeps getting lower over time.

The hard bit of testing is getting a testing framework setup in the first place - with a test runner, and fixtures, and objects under test and suchlike.

Once that's in place, adding new tests becomes really easy.

So my personal rule is that every new project starts with a test. It doesn't really matter what that test does - what matters is that you can run pytest to run the tests, and you have an obvious place to start building more of them.

I maintain three cookiecutter templates to help with this, for the three kinds of projects I most frequently create:

simonw/python-lib for Python libraries
simonw/click-app for command line tools
simonw/datasette-plugin for Datasette plugins

Each of these templates creates a project with a setup.py file, a README, a test suite and GitHub Actions workflows to run those tests and ship tagged releases to PyPI.

I have a trick for running cookiecutter as part of creating a brand new repository on GitHub. I described that in Dynamic content for GitHub repository templates using cookiecutter and GitHub Actions.

This is a hill that I will die on: your documentation must live in the same repository as your code!

You often see projects keep their documentation somewhere else, like in a wiki.

Inevitably it goes out of date. And my experience is that if your documentation is out of date people will lose trust in it, which means they'll stop reading it and stop contributing to it.

The gold standard of documentation has to be that it's reliably up to date with the code.

The only way you can do that is if the documentation and code are in the same repository.

This gives you versioned snapshots of the documentation that exactly match the code at that time.

More importantly, it means you can enforce it through code review. You can say in a PR "this is great, but don't forget to update this paragraph on this page of the documentation to reflect the change you're making".

If you do this you can finally get documentation that people learn to trust over time.

Another trick I like to use is something I call documentation unit tests.

The idea here is to use unit tests to enforce that concepts introspected from your code are at least mentioned in your documentation.

I wrote more about that in Documentation unit tests.

Here's an example. Datasette has a test that scans through each of the Datasette plugin hooks and checks that there is a heading for each one in the documentation.

The test itself is pretty simple: it uses pytest parametrization to look through every introspected plugin hook name, and for each one checks that it has a matching heading in the documentation.

The final component of my perfect commit is this: every commit must link to an issue thread.

I'll usually have these open in advance but sometimes I'll open an issue thread just so I can close it with a commit a few seconds later!

Here's the issue for the commit I showed earlier. It has 11 comments, and every single one of those comments is by me.

I have literally thousands of issues on GitHub that look like this: issue threads that are effectively me talking to myself about the changes that I'm making.

It turns out this a fantastic form of additional documentation.

What goes in an issue?

Background: the reasons for the change. In six months time you'll want to know why you did this.
State of play before-hand: embed existing code, link to existing docs. I like to start my issues with "I'm going to change this code right here" - that way if I come back the next day I don't have to repeat that little piece of research.
Links to things! Documentation, inspiration, clues found on StackOverflow. The idea is to capture all of the loose information floating around that topic.
Code snippets illustrating potential designs and false-starts.
Decisions. What did you consider? What did you decide? As programmers we make decisions constantly, all day, about everything. That work doesn't have to be invisible. Writing them down also avoids having to re-litigate them several months later when you've forgotten your original reasoning.
Screenshots - of everything! Animated screenshots even better. I even take screenshots of things like the AWS console to remind me what I did there.
When you close it: a link to the updated documentation and demo

The reason I love issues is that they're a form of documentation that I think of as temporal documentation.

Regular documentation comes with a big commitment: you have to keep it up to date in the future.

Issue comments skip that commitment entirely. They're displayed with a timestamp, in the context of the work you were doing at the time.

No-one will be upset or confused if you fail to keep them updated to match future changes.

So it's a commitment-free form of documentation, which I for one find incredibly liberating.

I think of this approach as issue driven development.

Everything you are doing is issue-first, and from that you drive the rest of the development process.

This is how it relates back to maintaining 185 projects at the same time.

With issue driven development you don't have to remember anything about any of these projects at all.

I've had issues where I did a bunch of design work in issue comments, then dropped it, then came back 12 months later and implemented that design - without having to rethink it.

I've had projects where I forgot that the project existed entirely! But I've found it again, and there's been an open issue, and I've been able to pick up work again.

It's a way of working where you treat it like every project is going to be maintained by someone else, and it's the classic cliche here that the somebody else is you in the future.

It horizontally scales you and lets you tackle way more interesting problems.

Programmers always complain when you interrupt them - there's this idea of "flow state" and that interrupting a programmer for a moment costs them half an hour in getting back up to speed.

This fixes that! It's much easier to get back to what you are doing if you have an issue thread that records where you've got to.

Issue driven development is my key productivity hack for taking on much more ambitious projects in much larger quantities.

Another way to think about this is to compare it to laboratory notebooks.

Here's a page from one by Leonardo da Vinci.

Great scientists and great engineers have always kept detailed notes.

We can use GitHub issues as a really quick and easy way to do the same thing!

Another thing I like to use these for is deep research tasks.

Here's an example, from when I was trying to figure out how to run my Python web application in an AWS Lambda function:

Figure out how to deploy Datasette to AWS Lambda using function URLs and Mangum

This took me 65 comments over the course of a few days... but by the end of that thread I'd figured out how to do it!

Here's the follow-up, with another 77 comments, in which I figure out how to serve an AWS Lambda function with a Function URL from a custom subdomain.

I will never have to figure this out ever again! That's a huge win.

https://github.com/simonw/public-notes is a public repository where I keep some of these issue threads, transferred from my private notes repos using this trick.

The last thing I want to encourage you to do is this: if you do project, tell people what it is you did!

This counts for both personal and work projects. It's so easy to skip this step.

Once you've shipped a feature or built a project, it's so tempting to skip the step of spending half an hour or more writing about the work you have done.

But you are missing out on so much of the value of your work if you don't give other people a chance to understand what you did.

I wrote more about this here: What to blog about.

For projects with releases, release notes are a really good way to do this.

I like using GitHub releases for this - they're quick and easy to write, and I have automation setup for my projects such that creating release notes in GitHub triggers a build and release to PyPI.

I've done over 1,000 releases in this way. Having them automated is crucial, and having automation makes it really easy to ship releases more often.

Please make sure your release notes have dates on them. I need to know when your change went out, because if it's only a week old it's unlikely people will have upgraded to it yet, whereas a change from five years ago is probably safe to depend on.

I wrote more about writing better release notes here.

This is a mental trick which works really well for me. "No project of mine is finished until I've told people about it in some way" is a really useful habit to form.

Twitter threads are (or were) a great low-effort way to write about a project. Build a quick thread with some links and images, and maybe even a video.

Get a little unit about your project out into the world, and then you can stop thinking about it.

(I'm trying to do this on Mastodon now instead.)

Even better: get a blog! Having your own corner of the internet to write about the work that you are doing is a small investment that will pay off many times over.

("Nobody blogs anymore" I said in the talk... Phil Gyford disagrees with that meme so much that he launched a new blog directory to show how wrong it is.)

The enemy of projects, especially personal projects, is guilt.

The more projects you have, the more guilty you feel about working on any one of them - because you're not working on the others, and those projects haven't yet achieved their goals.

You have to overcome guilt if you're going to work on 185 projects at once!

This is the most important tip: avoid side projects with user accounts.

If you build something that people can sign into, that's not a side-project, it's an unpaid job. It's a very big responsibility, avoid at all costs!

Almost all of my projects right now are open source things that people can run on their own machines, because that's about as far away from user accounts as I can get.

I still have a responsibility for shipping security updates and things like that, but at least I'm not holding onto other people's data for them.

I feel like if your project is tested and documented, you have nothing to feel guilty about.

You have put a thing out into the world, and it has tests to show that it works, and it has documentation that explains what it is.

This means I can step back and say that it's OK for me to work on other things. That thing there is a unit that makes sense to people.

That's what I tell myself anyway! It's OK to have 185 projects provided they all have documentation and they all have tests.

Do that and the guilt just disappears. You can live guilt free!

You can follow me on Mastodon at @simon@simonwillison.net or on GitHub at github.com/simonw. Or subscribe to my blog at simonwillison.net!

From the Q&A:

You've tweeted about using GitHub Projects. Could you talk about that?
- GitHub Projects V2 is the perfect TODO list for me, because it lets me bring together issues from different repositories. I use a project called "Everything" on a daily basis (it's my browser default window) - I add issues to it that I plan to work on, including personal TODO list items as well as issues from my various public and private repositories. It's kind of like a cross between Trello and Airtable and I absolutely love it.
How did you move notes from the private to the public repo?
- GitHub doesn't let you do this. But there's a trick I use involving a temp repo which I switch between public and private to help transfer notes. More in this TIL.
Question about the perfect commit: do you commit your failing tests?
- I don't: I try to keep the commits that land on my main branch always passing. I'll sometimes write the failing test before the implementation and then commit them together. For larger projects I'll work in a branch and then squash-merge the final result into a perfect commit to main later on.

Tags: productivity, talks, testing, documentation, annotated-talks, djangocon, github-issues

The Perfect Commit

2022-10-29T20:41:01+00:00

For the last few years I've been trying to center my work around creating what I consider to be the Perfect Commit. This is a single commit that contains all of the following:

The implementation: a single, focused change
Tests that demonstrate the implementation works
Updated documentation reflecting the change
A link to an issue thread providing further context

Our job as software engineers generally isn't to write new software from scratch: we spend the majority of our time adding features and fixing bugs in existing software.

The commit is our principle unit of work. It deserves to be treated thoughtfully and with care.

Update 26th November 2022: My 25 minute talk Massively increase your productivity on personal projects with comprehensive documentation and automated tests describes this approach to software development in detail.

Implementation

Each commit should change a single thing.

The definition of "thing" here is left deliberately vague!

The goal is have something that can be easily reviewed, and that can be clearly understood in the future when revisited using tools like git blame or git bisect.

I like to keep my commit history linear, as I find that makes it much easier to comprehend later. This further reinforces the value of each commit being a single, focused change.

Atomic commits are also much easier to cleanly revert if something goes wrong - or to cherry-pick into other branches.

For things like web applications that can be deployed to production, a commit should be a unit that can be deployed. Aiming to keep the main branch in a deployable state is a good rule of thumb for deciding if a commit is a sensible atomic change or not.

Tests

The ultimate goal of tests is to increase your productivity. If your testing practices are slowing you down, you should consider ways to improve them.

In the longer term, this productivity improvement comes from gaining the freedom to make changes and stay confident that your change hasn't broken something else.

But tests can help increase productivity in the immediate short term as well.

How do you know when the change you have made is finished and ready to commit? It's ready when the new tests pass.

I find this reduces the time I spend second-guessing myself and questioning whether I've done enough and thought through all of the edge cases.

Without tests, there's a very strong possibility that your change will have broken some other, potentially unrelated feature. Your commit could be held up by hours of tedious manual testing. Or you could YOLO it and learn that you broke something important later!

Writing tests becomes far less time consuming if you already have good testing practices in place.

Adding a new test to a project with a lot of existing tests is easy: you can often find an existing test that has 90% of the pattern you need already worked out for you.

If your project has no tests at all, adding a test for your change will be a lot more work.

This is why I start every single one of my projects with a passing test. It doesn't matter what this test is - assert 1 + 1 == 2 is fine! The key thing is to get a testing framework in place, such that you can run a command (for me that's usually pytest) to execute the test suite - and you have an obvious place to add new tests in the future.

I use these cookiecutter templates for almost all of my new projects. They configure a testing framework with a single passing test and GitHub Actions workflows to exercise it all from the very start.

I'm not a huge advocate of test-first development, where tests are written before the code itself. What I care about is tests-included development, where the final commit bundles the tests and the implementation together. I wrote more about my approach to testing in How to cheat at unit tests with pytest and Black.

Documentation

If your project defines APIs that are meant to be used outside of your project, they need to be documented. In my work these projects are usually one of the following:

Python APIs (modules, functions and classes) that provide code designed to be imported into other projects.
Web APIs - usually JSON over HTTP these days - that provide functionality to be consumed by other applications.
Command line interface tools, such as those implemented using Click or Typer or argparse.

It is critical that this documentation must live in the same repository as the code itself.

This is important for a number of reasons.

Documentation is only valuable if people trust it. People will only trust it if they know that it is kept up to date.

If your docs live in a separate wiki somewhere it's easy for them to get out of date - but more importantly it's hard for anyone to quickly confirm if the documentation is being updated in sync with the code or not.

Documentation should be versioned. People need to be able to find the docs for the specific version of your software that they are using. Keeping it in the same repository as the code gives you synchronized versioning for free.

Documentation changes should be reviewed in the same way as your code. If they live in the same repository you can catch changes that need to be reflected in the documentation as part of your code review process.

And ideally, documentation should be tested. I wrote about my approach to doing this using Documentation unit tests. Executing example code in the documentation using a testing framework is a great idea too.

As with tests, writing documentation from scratch is much more work than incrementally modifying existing documentation.

Many of my commits include documentation that is just a sentence or two. This doesn't take very long to write, but it adds up to something very comprehensive over time.

How about end-user facing documentation? I'm still figuring that out myself. I created my shot-scraper tool to help automate the process of keeping screenshots up-to-date, but I've not yet found personal habits and styles for end-user documentation that I'm confident in.

A link to an issue

Every perfect commit should include a link to an issue thread that accompanies that change.

Sometimes I'll even open an issue seconds before writing the commit message, just to give myself something I can link to from the commit itself!

The reason I like issue threads is that they provide effectively unlimited space for commentary and background for the change that is being made.

Most of my issue threads are me talking to myself - sometimes with dozens of issue comments, all written by me.

Things that can go in an issue thread include:

Background: the reason for the change. I try to include this in the opening comment.
State of play before the change. I'll often link to the current version of the code and documentation. This is great for if I return to an open issue a few days later, as it saves me from having to repeat that initial research.
Links to things. So many links! Inspiration for the change, relevant documentation, conversations on Slack or Discord, clues found on StackOverflow.
Code snippets illustrating potential designs and false-starts. Use ```python ... ``` blocks to get syntax highlighting in your issue comments.
Decisions. What did you consider? What did you decide? As programmers we make hundreds of tiny decisions a day. Write them down! Then you'll never find yourself relitigating them in the future having forgotten your original reasoning.
Screenshots. What it looked like before, what it looked like after. Animated screenshots are even better! I use LICEcap to generate quick GIF screen captures or QuickTime to capture videos - both of which can be dropped straight into a GitHub issue comment.
Prototypes. I'll often paste a few lines of code copied from a Python console session. Sometimes I'll even paste in a block of HTML and CSS, or add a screenshot of a UI prototype.

After I've closed my issues I like to add one last comment that links to the updated documentation and ideally a live demo of the new feature.

An issue is more valuable than a commit message

I went through a several year phase of writing essays in my commit messages, trying to capture as much of the background context and thinking as possible.

My commit messages grew a lot shorter when I started bundling the updated documentation in the commit - since often much of the material I'd previously included in the commit message was now in that documentation instead.

As I extended my practice of writing issue threads, I found that they were a better place for most of this context than the commit messages themselves. They supported embedded media, were more discoverable and I could continue to extend them even after the commit had landed.

Today many of my commit messages are a single line summary and a link to an issue!

The biggest benefit of lengthy commit messages is that they are guaranteed to survive for as long as the repository itself. If you're going to use issue threads in the way I describe here it is critical that you consider their long term archival value.

I expect this to be controversial! I'm advocating for abandoning one of the core ideas of Git here - that each repository should incorporate a full, decentralized record of its history that is copied in its entirety when someone clones a repo.

I understand that philosophy. All I'll say here is that my own experience has been that dropping that requirement has resulted in a net increase in my overall productivity. Other people may reach a different conclusion.

If this offends you too much, you're welcome to construct an even more perfect commit that incorporates background information and additional context in an extended commit message as well.

One of the reasons I like GitHub Issues is that it includes a comprehensive API, which can be used to extract all of that data. I use my github-to-sqlite tool to maintain an ongoing archive of my issues and issue comments as a SQLite database file.

Not every commit needs to be "perfect"

I find that the vast majority of my work fits into this pattern, but there are exceptions.

Typo fix for some documentation or a comment? Just ship it, it's fine.

Bug fix that doesn't deserve documentation? Still bundle the implementation and the test plus a link to an issue, but no need to update the docs - especially if they already describe the expected bug-free behaviour.

Generally though, I find that aiming for implementation, tests, documentation and an issue link covers almost all of my work. It's a really good default model.

Write scrappy commits in a branch

If I'm writing more exploratory or experimental code it often doesn't make sense to work in this strict way. For those instances I'll usually work in a branch, where I can ship "WIP" commit messages and failing tests with abandon. I'll then squash-merge them into a single perfect commit (sometimes via a self-closed GitHub pull request) to keep my main branch as tidy as possible.

Some examples

Here are some examples of my commits that follow this pattern:

Upgrade Docker images to Python 3.11 for datasette #1853 - a pretty tiny change, but still includes tests, docs and an issue link.
sqlite-utils schema now takes optional tables for sqlite-utils #299
shot-scraper html command for shot-scraper #96
s3-credentials put-objects command for s3-credentials #68
Initial implementation for datasette-gunicorn #1 - this was the first commit to this repository, but I still bundled the tests, docs, implementation and a link to an issue.

Tags: software-engineering, git, github, github-issues

A tool to run caption extraction against online videos using Whisper and GitHub Issues/Actions

2022-09-30T00:56:28+00:00

I released a new project this weekend, built during the Bellingcat Hackathon (I came second!) It's called Action Transcription and it's a tool for caturing captions and transcripts from online videos.

Here's my video introducing the new tool:

Bellingcat

Bellingcat describe themselves as an "independent international collective of researchers, investigators and citizen journalists using open source and social media investigation to probe a variety of subjects".

They specialize in open source intelligence - which, confusingly, does NOT mean "open source software" - this is a much older usage of the term that describes the use of publicly available information to gather intelligence.

They have broken a LOT of impressive stories over their eight year lifespan. Wikipedia has a good list - highlights include identifying the suspects behind the Skripal poisoning case.

The theme of the hackathon was "General Digital Investigation Tools". The goal was to build prototypes of tools that could be used by their community of investigators - most of whom are volunteers working from home with little-to-no budget, and often with limited technical skills (they can use tools very effectively but they might not be comfortable writing code or using the command-line).

Inspired by the recent release of OpenAI's Whisper, I decided to build a tool that would make it easier to extract captions and transcripts from videos on social media sites.

Why GitHub Actions and GitHub Issues?

My goals for the project were:

Help people achieve something useful
Make it as inexpensive to run as possible - ideally free
Make it easy for people to install and run their own copies

I decided to build the entire thing using GitHub Actions and GitHub Issues.

GitHub Actions is a powerful service for running CI jobs and other automation, but its best feature for this particular project is that it's free.

I'm fine with spending money myself, but if I'm building tools for other people having a way for them to run the tool without paying for anything is a huge win.

My tool needed a UI. To keep things as simple as possible, i didn't want to host anything outside of GitHub itself. So I turned to GitHub Issues to provide the interface layer.

It's easy to create Actions scripts that trigger when a new issue is created. And those scripts can then interact with that issue - attaching comments, or even closing it as completed.

I decided that my flow would be:

The user opens an issue and pastes in a link to an online video.
GitHub Actions is triggered by that issue, extracts the URL and fetches the video using youtube-dl (which, despite the name, can actually download videos from over 1,200 sites including many of the social media services popular in Russia).
The script extracts just the audio from the video.
The audio is then passed through OpenAI's Whisper, which can create a high quality transcript in the original language AND create a shockingly good English translation.
The caption is then both written back to the GitHub repository and attached to the original issue as a comment.

GitHub Actions doesn't (yet) provide GPUs, and Whisper works a whole lot faster with GPU access. So I decided to run Whisper using this hosted copy of the model on Replicate.

Extracting YouTube's captions directly

I had a check-in meeting with Tristan from Bellingcat just to make sure my hack wasn't a duplicate effort, and to get feedback on the plan.

Tristan liked the plan, but pointed out that extracting captions directly from YouTube would be a useful additional feature.

In addition to supporting manual captions, it turns out YouTube already creates machine-generated captions in over 100 languages! The quality of these isn't nearly as good as OpenAI Whisper, but they're still useful. And they're free (running Whisper currently costs me money).

So I adapted the plan, to provide the user with two options. The default option would extract captions directly from the video provider - which would definitely work for YouTube and might work for other sites too.

The second option would use Whisper to create a transcript and a translation, taking longer but providing results even for sites that didn't offer their own captions.

I decided to use issue tags to trigger these two workflows: tag with "captions" to extract captions directly, tag with "whisper" to use Whisper.

The implementation

The implementation ended up being 218 lines of JavaScript-embedded-in-YAML in a GitHub Actions issue_created.yml workflow.

I used actions/github-script for it - a convenient reusable Action that provides a pre-configured set of JavaScript objects for interacting with the GitHub API.

The code isn't hugely elegant: I'm not hugely familiar with the Node.js ecosystem so I ended up hacking around with Copilot quite a bit to figure out the patterns that would work.

It turns out captions can come back in a variety of different formats. The two most common appeared to be TTML - which uses XML, and WebVTT, a text-based format.

I decided to archive the original caption files in the GitHub repository itself, but I wanted to extract just the text and post that as the issue comment.

So I ended up building two tiny new tools: webvtt-to-json and ttml-to-json - which converted the different formats into a standard JSON format of my own invention, normalizing the captions so I could then extract the text and include it in a comment.

Hackathons tend to encourage some pretty scrappy solutions!

The results

These two issues demonstrate the final result of the tool:

That first one in particular shows quite how good the Whisper model is at handling Russian text, and translating it to English.

Adding issue templates

I added one last enhancement to the project after recording the demo video for the judges embedded above.

Issue templates are a new GitHub feature that let you define a form that users must fill out when they create a new issue.

Frustratingly, these only work with public repositories. I had built my hack in a private repo at first, so I was only able to explore using issue templates once I had made it public.

I created two issue templates - one for caption tasks and one for whisper tasks.

Now when a user goes to open a new issue they get to chose one of the two templates and fill in the URL as part of a form! Here's a GIF demo showing that flow in action:

Template repositories

One last trick. I want users to be able to run this system themselves, on their own GitHub account.

I made simonw/action-transcription a template repository.

This means that any user can click a green button to get their own copy of the repository - and when they do, they'll get their own fully configured copy of the GitHub Actions workflows too.

If they want to use Whisper they'll need to get an API key from Replicate.com and add it to their repository's secrets - but regular caption extraction will work fine without that.

I've used this technique before - I wrote about it here:

GitHub Actions as a platform

I'm pleased with how this project turned out. But I'm mainly excited about the underlying pattern. I think building tools using GitHub Actions that people can clone to their own accounts is a really promising way of developing sophisticated automated software that people can then run independently, entirely through the GitHub web interface.

I'm excited to see more tools adopt a similar pattern.

Tags: hackathons, projects, bellingcat, github-actions, openai, whisper, replicate, github-issues

upptime

2022-05-26T03:53:35+00:00

upptime

“Open-source uptime monitor and status page, powered entirely by GitHub Actions, Issues, and Pages.” This is a very creative (ab)use of GitHub Actions: it runs a scheduled action to check the availability of sites that you specify, records the results in a YAML file (with the commit history tracking them over time) and can automatically open a GitHub issue for you if it detects a new incident.

Via Ray Voelker

Tags: github-actions, github-issues

Automatically opening issues when tracked file content changes

2022-04-28T17:18:14+00:00

I figured out a GitHub Actions pattern to keep track of a file published somewhere on the internet and automatically open a new repository issue any time the contents of that file changes.

Extracting GZipMiddleware from Starlette

Here's why I needed to solve this problem.

I want to add gzip support to my Datasette open source project. Datasette builds on the Python ASGI standard, and Starlette provides an extremely well tested, robust GZipMiddleware class that adds gzip support to any ASGI application. As with everything else in Starlette, it's really good code.

The problem is, I don't want to add the whole of Starlette as a dependency. I'm trying to keep Datasette's core as small as possible, so I'm very careful about new dependencies. Starlette itself is actually very light (and only has a tiny number of dependencies of its own) but I still don't want the whole thing just for that one class.

So I decided to extract the GZipMiddleware class into a separate Python package, under the same BSD license as Starlette itself.

The result is my new asgi-gzip package, now available on PyPI.

What if Starlette fixes a bug?

The problem with extracting code like this is that Starlette is a very effectively maintained package. What if they make improvements or fix bugs in the GZipMiddleware class? How can I make sure to apply those same fixes to my extracted copy?

As I thought about this challenge, I realized I had most of the solution already.

Git scraping is the name I've given to the trick of running a periodic scraper that writes to a git repository in order to track changes to data over time.

It may seem redundant to do this against a file that already lives in version control elsewhere - but in addition to tracking changes, Git scraping can offfer a cheap and easy way to add automation that triggers when a change is detected.

I need an actionable alert any time the Starlette code changes so I can review the change and apply a fix to my own library, if necessary.

Since I already run all of my projects out of GitHub issues, automatically opening an issue against the asgi-gzip repository would be ideal.

My track.yml workflow does exactly that: it implements the Git scraping pattern against the gzip.py module in Starlette, and files an issue any time it detects changes to that file.

Starlette haven't made any changes to that file since I started tracking it, so I created a test repo to try this out.

Here's one of the example issues. I decided to include the visual diff in the issue description and have a link to it from the underlying commit as well.

How it works

The implementation is contained entirely in this track.yml workflow. I designed this to be contained as a single file to make it easy to copy and paste it to adapt it for other projects.

It uses actions/github-script, which makes it easy to do things like file new issues using JavaScript.

Here's a heavily annotated copy:

name: Track the Starlette version of this

# Run on repo pushes, and if a user clicks the "run this action" button,
# and on a schedule at 5:21am UTC every day
on:
  push:
  workflow_dispatch:
  schedule:
  - cron:  '21 5 * * *'

# Without this block I got this error when the action ran:
# HttpError: Resource not accessible by integration
permissions:
  # Allow the action to create issues
  issues: write
  # Allow the action to commit back to the repository
  contents: write

jobs:
  check:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v2
    - uses: actions/github-script@v6
      # Using env: here to demonstrate how an action like this can
      # be adjusted to take dynamic inputs
      env:
        URL: https://raw.githubusercontent.com/encode/starlette/master/starlette/middleware/gzip.py
        FILE_NAME: tracking/gzip.py
      with:
        script: |
          const { URL, FILE_NAME } = process.env;
          // promisify pattern for getting an await version of child_process.exec
          const util = require("util");
          // Used exec_ here because 'exec' variable name is already used:
          const exec_ = util.promisify(require("child_process").exec);
          // Use curl to download the file
          await exec_(`curl -o ${FILE_NAME} ${URL}`);
          // Use 'git diff' to detect if the file has changed since last time
          const { stdout } = await exec_(`git diff ${FILE_NAME}`);
          if (stdout) {
            // There was a diff to that file
            const title = `${FILE_NAME} was updated`;
            const body =
              `${URL} changed:` +
              "\n\n```diff\n" +
              stdout +
              "\n```\n\n" +
              "Close this issue once those changes have been integrated here";
            const issue = await github.rest.issues.create({
              owner: context.repo.owner,
              repo: context.repo.repo,
              title: title,
              body: body,
            });
            const issueNumber = issue.data.number;
            // Now commit and reference that issue number, so the commit shows up
            // listed at the bottom of the issue page
            const commitMessage = `${FILE_NAME} updated, refs #${issueNumber}`;
            // https://til.simonwillison.net/github-actions/commit-if-file-changed
            await exec_(`git config user.name "Automated"`);
            await exec_(`git config user.email "actions@users.noreply.github.com"`);
            await exec_(`git add -A`);
            await exec_(`git commit -m "${commitMessage}" || exit 0`);
            await exec_(`git pull --rebase`);
            await exec_(`git push`);
          }

In the asgi-gzip repository I keep the fetched gzip.py file in a tracking/ directory. This directory isn't included in the Python package that gets uploaded to PyPI - it's there only so that my code can track changes to it over time.

More interesting applications

I built this to solve my "tell me when Starlette update their gzip.py file" problem, but clearly this pattern has much more interesting uses.

You could point this at any web page to get a new GitHub issue opened when that page content changes. Subscribe to notifications for that repository and you get a robust , shared mechanism for alerts - plus an issue system where you can post additional comments and close the issue once someone has reviewed the change.

There's a lot of potential here for solving all kinds of interesting problems. And it doesn't cost anything either: GitHub Actions (somehow) remains completely free for public repositories!

Update: October 13th 2022

Almost six months after writing about this... it triggered for the first time!

Here's the issue that the script opened: #4: tracking/gzip.py was updated.

I applied the improvement (Marcelo Trylesinski and Kai Klingenberg updated Starlette's code to avoid gzipping if the response already had a Content-Encoding header) and released version 0.2 of the package.

Tags: gzip, github-actions, git-scraping, asgi, datasette, python, projects, github, github-issues

How I build a feature

2022-01-12T18:10:17+00:00

I'm maintaining a lot of different projects at the moment. I thought it would be useful to describe the process I use for adding a new feature to one of them, using the new sqlite-utils create-database command as an example.

I like each feature to be represented by what I consider to be the perfect commit - one that bundles together the implementation, the tests, the documentation and a link to an external issue thread.

Update 29th October 2022: I wrote more about the perfect commit.

The sqlite-utils create-database command is very simple: it creates a new, empty SQLite database file. You use it like this:

% sqlite-utils create-database empty.db

Everything starts with an issue

Every piece of work I do has an associated issue. This acts as ongoing work-in-progress notes and lets me record decisions, reference any research, drop in code snippets and sometimes even add screenshots and video - stuff that is really helpful but doesn't necessarily fit in code comments or commit messages.

Even if it's a tiny improvement that's only a few lines of code, I'll still open an issue for it - sometimes just a few minutes before closing it again as complete.

Any commits that I create that relate to an issue reference the issue number in their commit message. GitHub does a great job of automatically linking these together, bidirectionally so I can navigate from the commit to the issue or from the issue to the commit.

Having an issue also gives me something I can link to from my release notes.

In the case of the create-database command, I opened this issue in November when I had the idea for the feature.

I didn't do the work until over a month later - but because I had designed the feature in the issue comments I could get started on the implementation really quickly.

Development environment

Being able to quickly spin up a development environment for a project is crucial. All of my projects have a section in the README or the documentation describing how to do this - here's that section for sqlite-utils.

On my own laptop each project gets a directory, and I use pipenv shell in that directory to activate a directory-specific virtual environment, then pip install -e '.[test]' to install the dependencies and test dependencies.

Automated tests

All of my features are accompanied by automated tests. This gives me the confidence to boldly make changes to the software in the future without fear of breaking any existing features.

This means that writing tests needs to be as quick and easy as possible - the less friction here the better.

The best way to make writing tests easy is to have a great testing framework in place from the very beginning of the project. My cookiecutter templates (python-lib, datasette-plugin and click-app) all configure pytest and add a tests/ folder with a single passing test, to give me something to start adding tests to.

I can't say enough good things about pytest. Before I adopted it, writing tests was a chore. Now it's an activity I genuinely look forward to!

I'm not a religious adherent to writing the tests first - see How to cheat at unit tests with pytest and Black for more thoughts on that - but I'll write the test first if it's pragmatic to do so.

In the case of create-database, writing the test first felt like the right thing to do. Here's the test I started with:

def test_create_database(tmpdir):
    db_path = tmpdir / "test.db"
    assert not db_path.exists()
    result = CliRunner().invoke(
        cli.cli, ["create-database", str(db_path)]
    )
    assert result.exit_code == 0
    assert db_path.exists()

This test uses the tmpdir pytest fixture to provide a temporary directory that will be automatically cleaned up by pytest after the test run finishes.

It checks that the test.db file doesn't exist yet, then uses the Click framework's CliRunner utility to execute the create-database command. Then it checks that the command didn't throw an error and that the file has been created.

The I run the test, and watch it fail - because I haven't built the feature yet!

% pytest -k test_create_database

============ test session starts ============
platform darwin -- Python 3.8.2, pytest-6.2.4, py-1.10.0, pluggy-0.13.1
rootdir: /Users/simon/Dropbox/Development/sqlite-utils
plugins: cov-2.12.1, hypothesis-6.14.5
collected 808 items / 807 deselected / 1 selected                           

tests/test_cli.py F                                                   [100%]

================= FAILURES ==================
___________ test_create_database ____________

tmpdir = local('/private/var/folders/wr/hn3206rs1yzgq3r49bz8nvnh0000gn/T/pytest-of-simon/pytest-659/test_create_database0')

    def test_create_database(tmpdir):
        db_path = tmpdir / "test.db"
        assert not db_path.exists()
        result = CliRunner().invoke(
            cli.cli, ["create-database", str(db_path)]
        )
>       assert result.exit_code == 0
E       assert 1 == 0
E        +  where 1 = <Result SystemExit(1)>.exit_code

tests/test_cli.py:2097: AssertionError
========== short test summary info ==========
FAILED tests/test_cli.py::test_create_database - assert 1 == 0
===== 1 failed, 807 deselected in 0.99s ====

The -k option lets me run any test that match the search string, rather than running the full test suite. I use this all the time.

Other pytest features I often use:

pytest -x: runs the entire test suite but quits at the first test that fails
pytest --lf: re-runs any tests that failed during the last test run
pytest --pdb -x: open the Python debugger at the first failed test (omit the -x to open it at every failed test). This is the main way I interact with the Python debugger. I often use this to help write the tests, since I can add assert False and get a shell inside the test to interact with various objects and figure out how to best run assertions against them.

Implementing the feature

Test in place, it's time to implement the command. I added this code to my existing cli.py module:

@cli.command(name="create-database")
@click.argument(
    "path",
    type=click.Path(file_okay=True, dir_okay=False, allow_dash=False),
    required=True,
)
def create_database(path):
    "Create a new empty database file."
    db = sqlite_utils.Database(path)
    db.vacuum()

(I happen to know that the quickest way to create an empty SQLite database file is to run VACUUM against it.)

The test now passes!

I iterated on this implementation a little bit more, to add the --enable-wal option I had designed in the issue comments - and updated the test to match. You can see the final implementation in this commit: 1d64cd2e5b402ff957f9be2d9bb490d313c73989.

If I add a new test and it passes the first time, I’m always suspicious of it. I’ll deliberately break the test (change a 1 to a 2 for example) and run it again to make sure it fails, then change it back again.

Code formatting with Black

Black has increased my productivity as a Python developer by a material amount. I used to spend a whole bunch of brain cycles agonizing over how to indent my code, where to break up long function calls and suchlike. Thanks to Black I never think about this at all - I instinctively run black . in the root of my project and accept whatever style decisions it applies for me.

Linting

I have a few linters set up to run on every commit. I can run these locally too - how to do that is documented here - but I'm often a bit lazy and leave them to run in CI.

In this case one of my linters failed! I accidentally called the new command function create_table() when it should have been called create_database(). The code worked fine due to how the cli.command(name=...) decorator works but mypy complained about the redefined function name. I fixed that in a separate commit.

Documentation

My policy these days is that if a feature isn't documented it doesn't exist. Updating existing documentation isn't much work at all if the documentation already exists, and over time these incremental improvements add up to something really comprehensive.

For smaller projects I use a single README.md which gets displayed on both GitHub and PyPI (and the Datasette website too, for example on datasette.io/tools/git-history).

My larger projects, such as Datasette and sqlite-utils, use Read the Docs and reStructuredText with Sphinx instead.

I like reStructuredText mainly because it has really good support for internal reference links - something that is missing from Markdown, though it can be enabled using MyST.

sqlite-utils uses Sphinx. I have the sphinx-autobuild extension configured, which means I can run a live reloading server with the documentation like so:

cd docs
make livehtml

Any time I'm working on the documentation I have that server running, so I can hit "save" in VS Code and see a preview in my browser a few seconds later.

For Markdown documentation I use the VS Code preview pane directly.

The moment the documentation is live online, I like to add a link to it in a comment on the issue thread.

Committing the change

I run git diff a LOT while hacking on code, to make sure I haven’t accidentally changed something unrelated. This also helps spot things like rogue print() debug statements I may have added.

Before my final commit, I sometimes even run git diff | grep print to check for those.

My goal with the commit is to bundle the test, documentation and implementation. If those are the only files I've changed I do this:

git commit -a -m "sqlite-utils create-database command, closes #348"

If this completes the work on the issue I use "closes #N", which causes GitHub to close the issue for me. If it's not yet ready to close I use "refs #N" instead.

Sometimes there will be unrelated changes in my working directory. If so, I use git add <files> and then commit just with git commit -m message.

Branches and pull requests

create-database is a good example of a feature that can be implemented in a single commit, with no need to work in a branch.

For larger features, I'll work in a feature branch:

git checkout -b my-feature

I'll make a commit (often just labelled "WIP prototype, refs #N") and then push that to GitHub and open a pull request for it:

git push -u origin my-feature

I ensure the new pull request links back to the issue in its description, then switch my ongoing commentary to comments on the pull request itself.

I'll sometimes add a task checklist to the opening comment on the pull request, since tasks there get reflected in the GitHub UI anywhere that links to the PR. Then I'll check those off as I complete them.

An example of a PR I used like this is #361: --lines and --text and --convert and --import.

I don't like merge commits - I much prefer to keep my main branch history as linear as possible. I usually merge my PRs through the GitHub web interface using the squash feature, which results in a single, clean commit to main with the combined tests, documentation and implementation. Occasionally I will see value in keeping the individual commits, in which case I will rebase merge them.

Another goal here is to keep the main branch releasable at all times. Incomplete work should stay in a branch. This makes turning around and releasing quick bug fixes a lot less stressful!

Release notes, and a release

A feature isn't truly finished until it's been released to PyPI.

All of my projects are configured the same way: they use GitHub releases to trigger a GitHub Actions workflow which publishes the new release to PyPI. The sqlite-utils workflow for that is here in publish.yml.

My cookiecutter templates for new projects set up this workflow for me. I just need to create a PyPI token for the project and assign it as a repository secret. See the python-lib cookiecutter README for details.

To push out a new release, I need to increment the version number in setup.py and write the release notes.

I use semantic versioning - a new feature is a minor version bump, a breaking change is a major version bump (I try very hard to avoid these) and a bug fix or documentation-only update is a patch increment.

Since create-database was a new feature, it went out in release 3.21.

My projects that use Sphinx for documentation have changelog.rst files in their repositories. I add the release notes there, linking to the relevant issues and cross-referencing the new documentation. Then I ship a commit that bundles the release notes with the bumped version number, with a commit message that looks like this:

git commit -m "Release 3.21

Refs #348, #364, #366, #368, #371, #372, #374, #375, #376, #379"

Here's the commit for release 3.21.

Referencing the issue numbers in the release automatically adds a note to their issue threads indicating the release that they went out in.

I generate that list of issue numbers by pasting the release notes into an Observable notebook I built for the purpose: Extract issue numbers from pasted text. Observable is really great for building this kind of tiny interactive utility.

For projects that just have a README I write the release notes in Markdown and paste them directly into the GitHub "new release" form.

I like to duplicate the release notes to GiHub releases for my Sphinx changelog projects too. This is mainly so the datasette.io website will display the release notes on its homepage, which is populated at build time using the GitHub GraphQL API.

To convert my reStructuredText to Markdown I copy and paste the rendered HTML into this brilliant Paste to Markdown tool by Euan Goddard.

A live demo

When possible, I like to have a live demo that I can link to.

This is easiest for features in Datasette core. Datesette’s main branch gets deployed automatically to latest.datasette.io so I can often link to a demo there.

For Datasette plugins, I’ll deploy a fresh instance with the plugin (e.g. this one for datasette-graphql) or (more commonly) add it to my big latest-with-plugins.datasette.io instance - which tries to demonstrate what happens to Datasette if you install dozens of plugins at once (so far it works OK).

Here’s a demo of the datasette-copyable plugin running there: https://latest-with-plugins.datasette.io/github/commits.copyable

Tell the world about it

The last step is to tell the world (beyond the people who meticulously read the release notes) about the new feature.

Depending on the size of the feature, I might do this with a tweet like this one - usually with a screenshot and a link to the documentation. I often extend this into a short Twitter thread, which gives me a chance to link to related concepts and demos or add more screenshots.

For larger or more interesting feature I'll blog about them. I may save this for my weekly weeknotes, but sometimes for particularly exciting features I'll write up a dedicated blog entry. Some examples include:

I may even assemble a full set of annotated release notes on my blog, where I quote each item from the release in turn and provide some fleshed out examples plus background information on why I built it.

If it’s a new Datasette (or Datasette-adjacent) feature, I’ll try to remember to write about it in the next edition of the Datasette Newsletter.

Finally, if I learned a new trick while building a feature I might extract that into a TIL. If I do that I'll link to the new TIL from the issue thread.

More examples of this pattern

Here are a bunch of examples of commits that implement this pattern, combining the tests, implementation and documentation into a single unit:

sqlite-utils: adding —limit and —offset to sqlite-utils rows
sqlite-utils: --where and -p options for sqlite-utils convert
s3-credentials: s3-credentials policy command
datasette: db.execute_write_script() and db.execute_write_many()
datasette: ?_nosuggest=1 parameter for table views
datasette-graphql: GraphQL execution limits: time_limit_ms and num_queries_limit

Tags: pytest, software-engineering, git, github, unittests, black, read-the-docs, github-issues

Goodbye Zeit Now v1, hello datasette-publish-now - and talking to myself in GitHub issues

2020-04-08T03:32:24+00:00

This week I’ve been mostly dealing with the finally announced shutdown of Zeit Now v1. And having long-winded conversations with myself in GitHub issues.

How Zeit Now inspired Datasette

I first started experiencing with Zeit’s serverless Now hosting platform back in October 2017, when I used it to deploy json-head.now.sh - an updated version of an API tool I originally built for Google App Engine in July 2008.

I liked Zeit Now, a lot. Instant, inexpensive deploys of any stateless project that could be defined using a Dockerfile? Just type now to deploy the project in your current directory? Every deployment gets its own permanent URL? Amazing!

There was just one catch: Since Now deployments are ephemeral applications running on them need to be stateless. If you want a database, you need to involve another (potentially costly) service. It's a limitation shared by other scalable hosting solutions - Heroku, App Engine and so on. How much interesting stuff can you build without a database?

I was musing about this in the shower one day (that old cliche really happened for me) when I had a thought: sure, you can't write to a database... but if your data is read-only, why not bundle the database alongside the application code as part of the Docker image?

Ever since I helped launch the Datablog at the Guardian back in 2009 I had been interested in finding better ways to publish data journalism datasets than CSV files or a Google spreadsheets - so building something that could package and bundle read-only data was of extreme interest to me.

In November 2017 I released the first version of Datasette. The original idea was very much inspired by Zeit Now.

I gave a talk about Datasette at the Zeit Day conference in San Francisco in April 2018. Suffice to say I was a huge fan!

Goodbye, Zeit Now v1

In November 2018, Zeit announced Now v2. And it was, different.

v2 is an entirely different architecture from v1. Where v1 built on Docker containers, v2 is built on top of serverless functions - AWS Lambda in particular.

I can see why Zeit did this. Lambda functions can launch from cold way faster - v1's Docker infrastructure had tough cold-start times. They are much cheaper to run as well - crucial for Zeit given their extremely generous pricing plans.

But it was bad news for my projects. Lambdas are tightly size constrained, which is tough when you're bundling potentially large SQLite database files with your deployments.

More importantly, in 2018 Amazon were deliberately excluding the Python sqlite3 standard library module from the Python Lambda environment! I guess they hadn't considered people who might want to work with read-only database files.

So Datasette on Now v2 just wasn't going to work. Zeit kept v1 supported for the time being, but the writing was clearly on the wall.

In April 2019 Google announced Cloud Run, a serverless, scale-to-zero hosting environment based around Docker containers. In many ways it's Google's version of Zeit Now v1 - it has many of the characteristics I loved about v1, albeit with a clunkier developer experience and much more friction in assigning nice URLs to projects. Romain Primet contributed Cloud Run support to Datasette and it has since become my preferred hosting target for my new projects (see Deploying a data API using GitHub Actions and Cloud Run).

Last week, Zeit finally announced the sunset date for v1. From 1st of May new deploys won't be allowed, and on the 7th of August they'll be turning off the old v1 infrastructure and deleting all existing Now v1 deployments.

I engaged in an extensive Twitter conversation about this, where I praised Zeit's handling of the shutdown while bemoaning the loss of the v1 product I had loved so much.

Migrating my projects

My newer projects have been on Cloud Run for quite some time, but I still have a bunch of old projects that I care about and want to keep running past the v1 shutdown.

The first project I ported was latest.datasette.io, a live demo of Datasette which updates with the latest code any time I push to the Datasette master branch on GitHub.

Any time I do some kind of ops task like this I've gotten into the habit of meticulously documenting every single step in comments on a GitHub issue. Here's the issue for porting latest.datasette.io to Cloud Run (and switching from Circle CI to GitHub Actions at the same time).

My next project was global-power-plants-datasette, a small project which takes a database of global power plants published by the World Resources Institute and publishes it using Datasette. It checks for new updates to their repo once a day. I originally built it as a demo for datasette-cluster-map, since it's fun seeing 33,000 power plants on a single map. Here's that issue.

Having warmed up with these two, my next target was the most significant: porting my Niche Museums website.

Niche Museums is the most heavily customized Datasette instance I've run anywhere - it incorporates custom templates, CSS and plugins.

Here's the tracking issue for porting it to Cloud Run. I ran into a few hurdles with DNS and TLS certificates, and I had to do some additional work to ensure niche-museums.com redirects to www.niche-musums.com, but it's now fully migrated.

Hello, Zeit Now v2

In complaining about the lack of that essential sqlite3 module I figured it would be responsible to double-check and make sure that was still true.

It was not! Today Now's Python environment includes sqlite3 after all.

Datasette's publish_subcommand() plugin hook lets plugins add new publishing targets to the datasette publish command (I used it to build datasette-publish-fly last month). How hard would it be to build a plugin for Zeit Now v2?

I fired up a new lengthy talking-to-myself GitHub issue and started prototyping.

Now v2 may not support Docker, but it does support the ASGI Python standard (the asynchronous alternative to WSGI, shepherded by Andrew Godwin).

Zeit are keen proponents of the Jamstack approach, where websites are built using static pre-rendered HTML and JavaScript that calls out to APIs for dynamic data. v2 deployments are expected to consist of static HTML with "serverless functions" - standalone server-side scripts that live in an api/ directory by convention and are compiled into separate lambdas.

Datasette works just fine without JavaScript, which means it needs to handle all of the URL routes for a site. Essentually I need to build a single function that runs the whole of Datasette, then route all incoming traffic to it.

It took me a while to figure it out, but it turns out the Now v2 recipe for that is a now.json file that looks like this:

{
    "version": 2,
    "builds": [
        {
            "src": "index.py",
            "use": "@now/python"
        }
    ],
    "routes": [
        {
            "src": "(.*)",
            "dest": "index.py"
        }
    ]
}

Thanks Aaron Boodman for the tip.

Given the above configuration, Zeit will install any Python dependencies in a requirements.txt file, then treat an app variable in the index.py file as an ASGI application it should route all incoming traffic to. Exactly what I need to deploy Datasette!

This was everything I needed to build the new plugin. datasette-publish-now is the result.

Here's the generated source code for a project deployed using the plugin, showing how the underlyinng ASGI application is configured.

It's currently an alpha - not every feature is supported (see this milestone) and it relies on a minor deprecated feature (which I've implored Zeit to reconsider) but it's already full-featured enough that I can start using it to upgrade some of my smaller existing Now projects.

The first I upgraded is one of my favourites: polar-bears.now.sh, which visualizes tracking data from polar bear ear tags (using datasette-cluster-map) that was published by the USGS Alaska Science Center, Polar Bear Research Program.

Here's the command I used to deploy the site:

$ pip install datasette-publish-now
$ datasette publish now2 polar-bears.db \
    --title "Polar Bear Ear Tags, 2009-2011" \
    --source "USGS Alaska Science Center, Polar Bear Research Program" \
    --source_url "https://alaska.usgs.gov/products/data.php?dataid=130" \
    --install datasette-cluster-map \
    --project=polar-bears

I exported a full list of my Now v1 projects from their handy active v1 instances page.

The rest of my projects

I scraped the page using the following JavaScript, constructed with the help of the instant evaluation console feature in Firefox 75:

console.log(
  JSON.stringify(
    Array.from(
      Array.from(
        document.getElementsByTagName("table")[1].
          getElementsByTagName("tr")
      ).slice(1).map(
        (tr) =>
          Array.from(
            tr.getElementsByTagName("td")
        ).map((td) => td.innerText)
      )
    )
  )
);

Then I loaded them into Datasette for analysis.

After filtering out the datasette-latest-commithash.now.sh projects I had deployed for every push to GitHub it turns out I have 34 distinct projects running there.

I won't port all of them, but given datasette-publish-now I should be able to port the ones that I care about without too much trouble.

Debugging Datasette with git bisect run

I fixed two bugs in Datasette this week using git bisect run - a tool I've been meaning to figure out for years, which lets you run an automated binary search against a commit log to find the source of a bug.

Since I was figuring out a new tool, I fired up another GitHub issue self-conversation: in issue #716 I document my process of both learning to use git bisect run and using it to find a solution to that particular bug.

It worked great, so I used the same trick on issue 689 as well.

Watching git bisect run churn through 32 revisions in a few seconds and pinpoint the exact moment a bug was introduced is pretty delightful:

$ git bisect start master 0.34
Bisecting: 32 revisions left to test after this (roughly 5 steps)
[dc80e779a2e708b2685fc641df99e6aae9ad6f97] Handle scope path if it is a string
$ git bisect run python check_templates_considered.py
running python check_templates_considered.py
Traceback (most recent call last):
...
AssertionError
Bisecting: 15 revisions left to test after this (roughly 4 steps)
[7c6a9c35299f251f9abfb03fd8e85143e4361709] Better tests for prepare_connection() plugin hook, refs #678
running python check_templates_considered.py
Traceback (most recent call last):
...
AssertionError
Bisecting: 7 revisions left to test after this (roughly 3 steps)
[0091dfe3e5a3db94af8881038d3f1b8312bb857d] More reliable tie-break ordering for facet results
running python check_templates_considered.py
Traceback (most recent call last):
...
AssertionError
Bisecting: 3 revisions left to test after this (roughly 2 steps)
[ce12244037b60ba0202c814871218c1dab38d729] Release notes for 0.35
running python check_templates_considered.py
Traceback (most recent call last):
...
AssertionError
Bisecting: 1 revision left to test after this (roughly 1 step)
[70b915fb4bc214f9d064179f87671f8a378aa127] Datasette.render_template() method, closes #577
running python check_templates_considered.py
Traceback (most recent call last):
...
AssertionError
Bisecting: 0 revisions left to test after this (roughly 0 steps)
[286ed286b68793532c2a38436a08343b45cfbc91] geojson-to-sqlite
running python check_templates_considered.py
70b915fb4bc214f9d064179f87671f8a378aa127 is the first bad commit
commit 70b915fb4bc214f9d064179f87671f8a378aa127
Author: Simon Willison
Date:   Tue Feb 4 12:26:17 2020 -0800

    Datasette.render_template() method, closes #577

    Pull request #664.

:040000 040000 def9e31252e056845609de36c66d4320dd0c47f8 da19b7f8c26d50a4c05e5a7f05220b968429725c M	datasette
bisect run success

Supporting metadata.yaml

The other Datasette project I completed this week is a relatively small feature with hopefully a big impact: you can now use YAML for Datasette's metadata configuration as an alternative to JSON.

I'm not crazy about YAML: I still don't feel like I've mastered it, and I've been tracking it for 18 years! But it has one big advantage over JSON for configuration files: robust support for multi-line strings.

Datasette's metadata file can include lengthy SQL statements and strings of HTML, both of which benefit from multi-line strings.

I first used YAML for metadata for my Analyzing US Election Russian Facebook Ads project. The metadata file for that demonstrates both embedded HTML and embedded SQL - and an accompanying build_metadata.py script converted it to JSON at build time. I've since used the same trick for a number of other projects.

The next release of Datasette (hopefully within a week) will ship the new feature, at which point those conversion scripts won't be necessary.

This should work particularly well with the forthcoming ability for a canned query to write to a database. Getting that wrapped up and shipped will be my focus for the next few days.

Tags: datasette, yaml, zeit-now, projects, github, git, weeknotes, github-issues

github-trending-repos

2018-02-23T17:36:41+00:00

github-trending-repos

This is a really clever hack: Vitaliy Potapov built a system for subscribing to a weekly digest of trending GitHub repos in your favourite languages entirely on top of the existing GitHub issues notification system. Find the issue for your particular language and hit “subscribe” and you’ll get an email (or push notification depending on how you get your issue notifications) once a week with the latest trends. The implementation is a 220 line Node.js script which runs on a daily and weekly schedule using Circle CI, so Vitaliy doesn’t even have to host or pay for any of the underlying infrastructure. It’s brilliant.

Via Show HN

Tags: nodejs, github, github-issues