Simon Willison’s Weblog

How I build a feature

I’m maintaining a lot of different projects at the moment. I thought it would be useful to describe the process I use for adding a new feature to one of them, using the new sqlite-utils create-database command as an example.

I like each feature to be represented by what I consider to be the perfect commit—one that bundles together the implementation, the tests, the documentation and a link to an external issue thread.

The sqlite-utils create-database command is very simple: it creates a new, empty SQLite database file. You use it like this:

% sqlite-utils create-database empty.db

Everything starts with an issue

Every piece of work I do has an associated issue. This acts as ongoing work-in-progress notes and lets me record decisions, reference any research, drop in code snippets and sometimes even add screenshots and video—stuff that is really helpful but doesn’t necessarily fit in code comments or commit messages.

Even if it’s a tiny improvement that’s only a few lines of code, I’ll still open an issue for it—sometimes just a few minutes before closing it again as complete.

Any commits that I create that relate to an issue reference the issue number in their commit message. GitHub does a great job of automatically linking these together, bidirectionally so I can navigate from the commit to the issue or from the issue to the commit.

Having an issue also gives me something I can link to from my release notes.

In the case of the create-database command, I opened this issue in November when I had the idea for the feature.

I didn’t do the work until over a month later—but because I had designed the feature in the issue comments I could get started on the implementation really quickly.

Development environment

Being able to quickly spin up a development environment for a project is crucial. All of my projects have a section in the README or the documentation describing how to do this—here’s that section for sqlite-utils.

On my own laptop each project gets a directory, and I use pipenv shell in that directory to activate a directory-specific virtual environment, then pip install -e '.[test]' to install the dependencies and test dependencies.

Automated tests

All of my features are accompanied by automated tests. This gives me the confidence to boldly make changes to the software in the future without fear of breaking any existing features.

This means that writing tests needs to be as quick and easy as possible—the less friction here the better.

The best way to make writing tests easy is to have a great testing framework in place from the very beginning of the project. My cookiecutter templates (python-lib, datasette-plugin and click-app) all configure pytest and add a tests/ folder with a single passing test, to give me something to start adding tests to.

I can’t say enough good things about pytest. Before I adopted it, writing tests was a chore. Now it’s an activity I genuinely look forward to!

I’m not a religious adherent to writing the tests first—see How to cheat at unit tests with pytest and Black for more thoughts on that—but I’ll write the test first if it’s pragmatic to do so.

In the case of create-database, writing the test first felt like the right thing to do. Here’s the test I started with:

def test_create_database(tmpdir):
    db_path = tmpdir / "test.db"
    assert not db_path.exists()
    result = CliRunner().invoke(
        cli.cli, ["create-database", str(db_path)]
    )
    assert result.exit_code == 0
    assert db_path.exists()

This test uses the tmpdir pytest fixture to provide a temporary directory that will be automatically cleaned up by pytest after the test run finishes.

It checks that the test.db file doesn’t exist yet, then uses the Click framework’s CliRunner utility to execute the create-database command. Then it checks that the command didn’t throw an error and that the file has been created.

The I run the test, and watch it fail—because I haven’t built the feature yet!

% pytest -k test_create_database

============ test session starts ============
platform darwin -- Python 3.8.2, pytest-6.2.4, py-1.10.0, pluggy-0.13.1
rootdir: /Users/simon/Dropbox/Development/sqlite-utils
plugins: cov-2.12.1, hypothesis-6.14.5
collected 808 items / 807 deselected / 1 selected                           

tests/test_cli.py F                                                   [100%]

================= FAILURES ==================
___________ test_create_database ____________

tmpdir = local('/private/var/folders/wr/hn3206rs1yzgq3r49bz8nvnh0000gn/T/pytest-of-simon/pytest-659/test_create_database0')

    def test_create_database(tmpdir):
        db_path = tmpdir / "test.db"
        assert not db_path.exists()
        result = CliRunner().invoke(
            cli.cli, ["create-database", str(db_path)]
        )
>       assert result.exit_code == 0
E       assert 1 == 0
E        +  where 1 = <Result SystemExit(1)>.exit_code

tests/test_cli.py:2097: AssertionError
========== short test summary info ==========
FAILED tests/test_cli.py::test_create_database - assert 1 == 0
===== 1 failed, 807 deselected in 0.99s ====

The -k option lets me run any test that match the search string, rather than running the full test suite. I use this all the time.

Other pytest features I often use:

  • pytest -x: runs the entire test suite but quits at the first test that fails
  • pytest --lf: re-runs any tests that failed during the last test run
  • pytest --pdb -x: open the Python debugger at the first failed test (omit the -x to open it at every failed test). This is the main way I interact with the Python debugger. I often use this to help write the tests, since I can add assert False and get a shell inside the test to interact with various objects and figure out how to best run assertions against them.

Implementing the feature

Test in place, it’s time to implement the command. I added this code to my existing cli.py module:

@cli.command(name="create-database")
@click.argument(
    "path",
    type=click.Path(file_okay=True, dir_okay=False, allow_dash=False),
    required=True,
)
def create_database(path):
    "Create a new empty database file."
    db = sqlite_utils.Database(path)
    db.vacuum()

(I happen to know that the quickest way to create an empty SQLite database file is to run VACUUM against it.)

The test now passes!

I iterated on this implementation a little bit more, to add the --enable-wal option I had designed in the issue comments—and updated the test to match. You can see the final implementation in this commit: 1d64cd2e5b402ff957f9be2d9bb490d313c73989.

If I add a new test and it passes the first time, I’m always suspicious of it. I’ll deliberately break the test (change a 1 to a 2 for example) and run it again to make sure it fails, then change it back again.

Code formatting with Black

Black has increased my productivity as a Python developer by a material amount. I used to spend a whole bunch of brain cycles agonizing over how to indent my code, where to break up long function calls and suchlike. Thanks to Black I never think about this at all—I instinctively run black . in the root of my project and accept whatever style decisions it applies for me.

Linting

I have a few linters set up to run on every commit. I can run these locally too—how to do that is documented here—but I’m often a bit lazy and leave them to run in CI.

In this case one of my linters failed! I accidentally called the new command function create_table() when it should have been called create_database(). The code worked fine due to how the cli.command(name=...) decorator works but mypy complained about the redefined function name. I fixed that in a separate commit.

Documentation

My policy these days is that if a feature isn’t documented it doesn’t exist. Updating existing documentation isn’t much work at all if the documentation already exists, and over time these incremental improvements add up to something really comprehensive.

For smaller projects I use a single README.md which gets displayed on both GitHub and PyPI (and the Datasette website too, for example on datasette.io/tools/git-history).

My larger projects, such as Datasette and sqlite-utils, use Read the Docs and reStructuredText with Sphinx instead.

I like reStructuredText mainly because it has really good support for internal reference links—something that is missing from Markdown, though it can be enabled using MyST.

sqlite-utils uses Sphinx. I have the sphinx-autobuild extension configured, which means I can run a live reloading server with the documentation like so:

cd docs
make livehtml

Any time I’m working on the documentation I have that server running, so I can hit “save” in VS Code and see a preview in my browser a few seconds later.

For Markdown documentation I use the VS Code preview pane directly.

The moment the documentation is live online, I like to add a link to it in a comment on the issue thread.

Committing the change

I run git diff a LOT while hacking on code, to make sure I haven’t accidentally changed something unrelated. This also helps spot things like rogue print() debug statements I may have added.

Before my final commit, I sometimes even run git diff | grep print to check for those.

My goal with the commit is to bundle the test, documentation and implementation. If those are the only files I’ve changed I do this:

git commit -a -m "sqlite-utils create-database command, closes #348"

If this completes the work on the issue I use "closes #N“, which causes GitHub to close the issue for me. If it’s not yet ready to close I use ”refs #N" instead.

Sometimes there will be unrelated changes in my working directory. If so, I use git add <files> and then commit just with git commit -m message.

Branches and pull requests

create-database is a good example of a feature that can be implemented in a single commit, with no need to work in a branch.

For larger features, I’ll work in a feature branch:

git checkout -b my-feature

I’ll make a commit (often just labelled “WIP prototype, refs #N”) and then push that to GitHub and open a pull request for it:

git push -u origin my-feature 

I ensure the new pull request links back to the issue in its description, then switch my ongoing commentary to comments on the pull request itself.

I’ll sometimes add a task checklist to the opening comment on the pull request, since tasks there get reflected in the GitHub UI anywhere that links to the PR. Then I’ll check those off as I complete them.

An example of a PR I used like this is #361: --lines and --text and --convert and --import.

I don’t like merge commits—I much prefer to keep my main branch history as linear as possible. I usually merge my PRs through the GitHub web interface using the squash feature, which results in a single, clean commit to main with the combined tests, documentation and implementation. Occasionally I will see value in keeping the individual commits, in which case I will rebase merge them.

Another goal here is to keep the main branch releasable at all times. Incomplete work should stay in a branch. This makes turning around and releasing quick bug fixes a lot less stressful!

Release notes, and a release

A feature isn’t truly finished until it’s been released to PyPI.

All of my projects are configured the same way: they use GitHub releases to trigger a GitHub Actions workflow which publishes the new release to PyPI. The sqlite-utils workflow for that is here in publish.yml.

My cookiecutter templates for new projects set up this workflow for me. I just need to create a PyPI token for the project and assign it as a repository secret. See the python-lib cookiecutter README for details.

To push out a new release, I need to increment the version number in setup.py and write the release notes.

I use semantic versioning—a new feature is a minor version bump, a breaking change is a major version bump (I try very hard to avoid these) and a bug fix or documentation-only update is a patch increment.

Since create-database was a new feature, it went out in release 3.21.

My projects that use Sphinx for documentation have changelog.rst files in their repositories. I add the release notes there, linking to the relevant issues and cross-referencing the new documentation. Then I ship a commit that bundles the release notes with the bumped version number, with a commit message that looks like this:

git commit -m "Release 3.21

Refs #348, #364, #366, #368, #371, #372, #374, #375, #376, #379"

Here’s the commit for release 3.21.

Referencing the issue numbers in the release automatically adds a note to their issue threads indicating the release that they went out in.

I generate that list of issue numbers by pasting the release notes into an Observable notebook I built for the purpose: Extract issue numbers from pasted text. Observable is really great for building this kind of tiny interactive utility.

For projects that just have a README I write the release notes in Markdown and paste them directly into the GitHub “new release” form.

I like to duplicate the release notes to GiHub releases for my Sphinx changelog projects too. This is mainly so the datasette.io website will display the release notes on its homepage, which is populated at build time using the GitHub GraphQL API.

To convert my reStructuredText to Markdown I copy and paste the rendered HTML into this brilliant Paste to Markdown tool by Euan Goddard.

A live demo

When possible, I like to have a live demo that I can link to.

This is easiest for features in Datasette core. Datesette’s main branch gets deployed automatically to latest.datasette.io so I can often link to a demo there.

For Datasette plugins, I’ll deploy a fresh instance with the plugin (e.g. this one for datasette-graphql) or (more commonly) add it to my big latest-with-plugins.datasette.io instance—which tries to demonstrate what happens to Datasette if you install dozens of plugins at once (so far it works OK).

Here’s a demo of the datasette-copyable plugin running there: https://latest-with-plugins.datasette.io/github/commits.copyable

Tell the world about it

The last step is to tell the world (beyond the people who meticulously read the release notes) about the new feature.

Depending on the size of the feature, I might do this with a tweet like this one—usually with a screenshot and a link to the documentation. I often extend this into a short Twitter thread, which gives me a chance to link to related concepts and demos or add more screenshots.

For larger or more interesting feature I’ll blog about them. I may save this for my weekly weeknotes, but sometimes for particularly exciting features I’ll write up a dedicated blog entry. Some examples include:

I may even assemble a full set of annotated release notes on my blog, where I quote each item from the release in turn and provide some fleshed out examples plus background information on why I built it.

If it’s a new Datasette (or Datasette-adjacent) feature, I’ll try to remember to write about it in the next edition of the Datasette Newsletter.

Finally, if I learned a new trick while building a feature I might extract that into a TIL. If I do that I’ll link to the new TIL from the issue thread.

More examples of this pattern

Here are a bunch of examples of commits that implement this pattern, combining the tests, implementation and documentation into a single unit: