Simon Willison's Weblog: documentation

simonw/docs cookiecutter template

2024-09-23T21:45:15+00:00

Over the last few years I’ve settled on the combination of Sphinx, the Furo theme and the myst-parser extension (enabling Markdown in place of reStructuredText) as my documentation toolkit of choice, maintained in GitHub and hosted using ReadTheDocs.

My LLM and shot-scraper projects are two examples of that stack in action.

Today I wanted to spin up a new documentation site so I finally took the time to construct a cookiecutter template for my preferred configuration. You can use it like this:

pipx install cookiecutter
cookiecutter gh:simonw/docs

Or with uv:

uv tool run cookiecutter gh:simonw/docs

Answer a few questions:

[1/3] project (): shot-scraper
[2/3] author (): Simon Willison
[3/3] docs_directory (docs):

And it creates a docs/ directory ready for you to start editing docs:

cd docs
pip install -r requirements.txt
make livehtml

Tags: uv, markdown, sphinx-docs, cookiecutter, read-the-docs, python, projects, documentation

Quoting Nicolas Bouliane

2024-01-05T22:32:16+00:00

If you learn something the hard way, share your findings with others. You have blazed a new trail; now you must mark it for your fellow travellers. Sharing knowledge is an unreasonably effective way of helping others.

— Nicolas Bouliane

Tags: documentation

PostgreSQL Lock Conflicts

2023-08-23T03:08:54+00:00

PostgreSQL Lock Conflicts

I absolutely love how extremely specific and niche this documentation site is. It details every single lock that PostgreSQL implements, and shows exactly which commands acquire that lock. That’s everything. I can imagine this becoming absurdly useful at extremely infrequent intervals for advanced PostgreSQL work.

Via Hacker News

Tags: postgresql, documentation

Coping strategies for the serial project hoarder

2022-11-26T15:47:02+00:00

I gave a talk at DjangoCon US 2022 in San Diego last month about productivity on personal projects, titled "Massively increase your productivity on personal projects with comprehensive documentation and automated tests".

The alternative title for the talk was Coping strategies for the serial project hoarder.

I'm maintaining a lot of different projects at the moment. Somewhat unintuitively, the way I'm handling this is by scaling down techniques that I've seen working for large engineering teams spread out across multiple continents.

The key trick is to ensure that every project has comprehensive documentation and automated tests. This scales my productivity horizontally, by freeing me up from needing to remember all of the details of all of the different projects I'm working on at the same time.

You can watch the talk on YouTube (25 minutes). Alternatively, I've included a detailed annotated version of the slides and notes below.

This was the title I originally submitted to the conference. But I realized a better title was probably...

Coping strategies for the serial project hoarder

This video is a neat representation of my approach to personal projects: I always have a few on the go, but I can never resist the temptation to add even more.

My PyPI profile (which is only five years old) lists 185 Python packages that I've released. Technically I'm actively maintaining all of them, in that if someone reports a bug I'll push out a fix. Many of them receive new releases at least once a year.

Aside: I took this screenshot using shot-scraper with a little bit of extra JavaScript to hide a notification bar at the top of the page:

shot-scraper 'https://pypi.org/user/simonw/' \
--javascript "
    document.body.style.paddingTop = 0;
    document.querySelector(
        '#sticky-notifications'
    ).style.display = 'none';
  " --height 1000

How can one individual maintain 185 projects?

Surprisingly, I'm using techniques that I've scaled down from working at a company with hundreds of engineers.

I spent seven years at Eventbrite, during which time the engineering team grew to span three different continents. We had major engineering centers in San Francisco, Nashville, Mendoza in Argentina and Madrid in Spain.

Consider timezones: engineers in Madrid and engineers in San Francisco had almost no overlap in their working hours. Good asynchronous communication was essential.

Over time, I noticed that the teams that were most effective at this scale were the teams that had a strong culture of documentation and automated testing.

As I started to work on my own array of smaller personal projects, I found that the same discipline that worked for large teams somehow sped me up, when intuitively I would have expected it to slow me down.

I wrote an extended description of this in The Perfect Commit.

I've started structuring the majority of my work in terms of what I think of as "the perfect commit" - a commit that combines implementation, tests, documentation and a link to an issue thread.

As software engineers, it's important to note that our job generally isn't to write new software: it's to make changes to existing software.

As such, the commit is our unit of work. It's worth us paying attention to how we can make our commits as useful as possible.

Here's a recent example from one of my projects, Datasette.

It's a single commit which bundles together the implementation, some related documentation improvements and the tests that show it works. And it links back to an issue thread from the commit message.

Let's talk about each component in turn.

There's not much to be said about the implementation: your commit should change something!

It should only change one thing, but what that actually means varies on a case by case basis.

It should be a single change that can be documented, tested and explained independently of other changes.

(Being able to cleanly revert it is a useful property too.)

The goals of the tests that accompany a commit are to prove that the new implementation works.

If you apply the implementation the new tests should pass. If you revert it the tests should fail.

I often use git stash to try this out.

If you tell people they need to write tests for every single change they'll often push back that this is too much of a burden, and will harm their productivity.

But I find that the incremental cost of adding a test to an existing test suite keeps getting lower over time.

The hard bit of testing is getting a testing framework setup in the first place - with a test runner, and fixtures, and objects under test and suchlike.

Once that's in place, adding new tests becomes really easy.

So my personal rule is that every new project starts with a test. It doesn't really matter what that test does - what matters is that you can run pytest to run the tests, and you have an obvious place to start building more of them.

I maintain three cookiecutter templates to help with this, for the three kinds of projects I most frequently create:

simonw/python-lib for Python libraries
simonw/click-app for command line tools
simonw/datasette-plugin for Datasette plugins

Each of these templates creates a project with a setup.py file, a README, a test suite and GitHub Actions workflows to run those tests and ship tagged releases to PyPI.

I have a trick for running cookiecutter as part of creating a brand new repository on GitHub. I described that in Dynamic content for GitHub repository templates using cookiecutter and GitHub Actions.

This is a hill that I will die on: your documentation must live in the same repository as your code!

You often see projects keep their documentation somewhere else, like in a wiki.

Inevitably it goes out of date. And my experience is that if your documentation is out of date people will lose trust in it, which means they'll stop reading it and stop contributing to it.

The gold standard of documentation has to be that it's reliably up to date with the code.

The only way you can do that is if the documentation and code are in the same repository.

This gives you versioned snapshots of the documentation that exactly match the code at that time.

More importantly, it means you can enforce it through code review. You can say in a PR "this is great, but don't forget to update this paragraph on this page of the documentation to reflect the change you're making".

If you do this you can finally get documentation that people learn to trust over time.

Another trick I like to use is something I call documentation unit tests.

The idea here is to use unit tests to enforce that concepts introspected from your code are at least mentioned in your documentation.

I wrote more about that in Documentation unit tests.

Here's an example. Datasette has a test that scans through each of the Datasette plugin hooks and checks that there is a heading for each one in the documentation.

The test itself is pretty simple: it uses pytest parametrization to look through every introspected plugin hook name, and for each one checks that it has a matching heading in the documentation.

The final component of my perfect commit is this: every commit must link to an issue thread.

I'll usually have these open in advance but sometimes I'll open an issue thread just so I can close it with a commit a few seconds later!

Here's the issue for the commit I showed earlier. It has 11 comments, and every single one of those comments is by me.

I have literally thousands of issues on GitHub that look like this: issue threads that are effectively me talking to myself about the changes that I'm making.

It turns out this a fantastic form of additional documentation.

What goes in an issue?

Background: the reasons for the change. In six months time you'll want to know why you did this.
State of play before-hand: embed existing code, link to existing docs. I like to start my issues with "I'm going to change this code right here" - that way if I come back the next day I don't have to repeat that little piece of research.
Links to things! Documentation, inspiration, clues found on StackOverflow. The idea is to capture all of the loose information floating around that topic.
Code snippets illustrating potential designs and false-starts.
Decisions. What did you consider? What did you decide? As programmers we make decisions constantly, all day, about everything. That work doesn't have to be invisible. Writing them down also avoids having to re-litigate them several months later when you've forgotten your original reasoning.
Screenshots - of everything! Animated screenshots even better. I even take screenshots of things like the AWS console to remind me what I did there.
When you close it: a link to the updated documentation and demo

The reason I love issues is that they're a form of documentation that I think of as temporal documentation.

Regular documentation comes with a big commitment: you have to keep it up to date in the future.

Issue comments skip that commitment entirely. They're displayed with a timestamp, in the context of the work you were doing at the time.

No-one will be upset or confused if you fail to keep them updated to match future changes.

So it's a commitment-free form of documentation, which I for one find incredibly liberating.

I think of this approach as issue driven development.

Everything you are doing is issue-first, and from that you drive the rest of the development process.

This is how it relates back to maintaining 185 projects at the same time.

With issue driven development you don't have to remember anything about any of these projects at all.

I've had issues where I did a bunch of design work in issue comments, then dropped it, then came back 12 months later and implemented that design - without having to rethink it.

I've had projects where I forgot that the project existed entirely! But I've found it again, and there's been an open issue, and I've been able to pick up work again.

It's a way of working where you treat it like every project is going to be maintained by someone else, and it's the classic cliche here that the somebody else is you in the future.

It horizontally scales you and lets you tackle way more interesting problems.

Programmers always complain when you interrupt them - there's this idea of "flow state" and that interrupting a programmer for a moment costs them half an hour in getting back up to speed.

This fixes that! It's much easier to get back to what you are doing if you have an issue thread that records where you've got to.

Issue driven development is my key productivity hack for taking on much more ambitious projects in much larger quantities.

Another way to think about this is to compare it to laboratory notebooks.

Here's a page from one by Leonardo da Vinci.

Great scientists and great engineers have always kept detailed notes.

We can use GitHub issues as a really quick and easy way to do the same thing!

Another thing I like to use these for is deep research tasks.

Here's an example, from when I was trying to figure out how to run my Python web application in an AWS Lambda function:

Figure out how to deploy Datasette to AWS Lambda using function URLs and Mangum

This took me 65 comments over the course of a few days... but by the end of that thread I'd figured out how to do it!

Here's the follow-up, with another 77 comments, in which I figure out how to serve an AWS Lambda function with a Function URL from a custom subdomain.

I will never have to figure this out ever again! That's a huge win.

https://github.com/simonw/public-notes is a public repository where I keep some of these issue threads, transferred from my private notes repos using this trick.

The last thing I want to encourage you to do is this: if you do project, tell people what it is you did!

This counts for both personal and work projects. It's so easy to skip this step.

Once you've shipped a feature or built a project, it's so tempting to skip the step of spending half an hour or more writing about the work you have done.

But you are missing out on so much of the value of your work if you don't give other people a chance to understand what you did.

I wrote more about this here: What to blog about.

For projects with releases, release notes are a really good way to do this.

I like using GitHub releases for this - they're quick and easy to write, and I have automation setup for my projects such that creating release notes in GitHub triggers a build and release to PyPI.

I've done over 1,000 releases in this way. Having them automated is crucial, and having automation makes it really easy to ship releases more often.

Please make sure your release notes have dates on them. I need to know when your change went out, because if it's only a week old it's unlikely people will have upgraded to it yet, whereas a change from five years ago is probably safe to depend on.

I wrote more about writing better release notes here.

This is a mental trick which works really well for me. "No project of mine is finished until I've told people about it in some way" is a really useful habit to form.

Twitter threads are (or were) a great low-effort way to write about a project. Build a quick thread with some links and images, and maybe even a video.

Get a little unit about your project out into the world, and then you can stop thinking about it.

(I'm trying to do this on Mastodon now instead.)

Even better: get a blog! Having your own corner of the internet to write about the work that you are doing is a small investment that will pay off many times over.

("Nobody blogs anymore" I said in the talk... Phil Gyford disagrees with that meme so much that he launched a new blog directory to show how wrong it is.)

The enemy of projects, especially personal projects, is guilt.

The more projects you have, the more guilty you feel about working on any one of them - because you're not working on the others, and those projects haven't yet achieved their goals.

You have to overcome guilt if you're going to work on 185 projects at once!

This is the most important tip: avoid side projects with user accounts.

If you build something that people can sign into, that's not a side-project, it's an unpaid job. It's a very big responsibility, avoid at all costs!

Almost all of my projects right now are open source things that people can run on their own machines, because that's about as far away from user accounts as I can get.

I still have a responsibility for shipping security updates and things like that, but at least I'm not holding onto other people's data for them.

I feel like if your project is tested and documented, you have nothing to feel guilty about.

You have put a thing out into the world, and it has tests to show that it works, and it has documentation that explains what it is.

This means I can step back and say that it's OK for me to work on other things. That thing there is a unit that makes sense to people.

That's what I tell myself anyway! It's OK to have 185 projects provided they all have documentation and they all have tests.

Do that and the guilt just disappears. You can live guilt free!

You can follow me on Mastodon at @simon@simonwillison.net or on GitHub at github.com/simonw. Or subscribe to my blog at simonwillison.net!

From the Q&A:

You've tweeted about using GitHub Projects. Could you talk about that?
- GitHub Projects V2 is the perfect TODO list for me, because it lets me bring together issues from different repositories. I use a project called "Everything" on a daily basis (it's my browser default window) - I add issues to it that I plan to work on, including personal TODO list items as well as issues from my various public and private repositories. It's kind of like a cross between Trello and Airtable and I absolutely love it.
How did you move notes from the private to the public repo?
- GitHub doesn't let you do this. But there's a trick I use involving a temp repo which I switch between public and private to help transfer notes. More in this TIL.
Question about the perfect commit: do you commit your failing tests?
- I don't: I try to keep the commits that land on my main branch always passing. I'll sometimes write the failing test before the implementation and then commit them together. For larger projects I'll work in a branch and then squash-merge the final result into a perfect commit to main later on.

Tags: djangocon, documentation, productivity, talks, testing, annotated-talks

Automating screenshots for the Datasette documentation using shot-scraper

2022-10-14T23:44:03+00:00

I released shot-scraper back in March as a tool for keeping screenshots in documentation up-to-date.

It's very easy for feature screenshots in documentation for a web application to drift out-of-date with the latest design of the software itself.

shot-scraper is a command-line tool that aims to solve this.

You can use it to take one-off screenshots like this:

shot-scraper https://latest.datasette.io/ --height 800

Or you can define multiple screenshots in a single YAML file - let's call this shots.yml:

- url: https://latest.datasette.io/
  height: 800
  output: index.png
- url: https://latest.datasette.io/fixtures
  height: 800
  output: database.png

And run them all at once like this:

shot-scraper multi shots.yml

This morning I used shot-scraper to replace all of the existing screenshots in the Datasette documentation with up-to-date, automated equivalents.

I decided to use this as an opportunity to create a more detailed tutorial for how to use shot-scraper for this kind of screenshot automation project.

Four screenshots to replace

Datasette's documentation included four screenshots that I wanted to replace with automated equivalents.

full_text_search.png illustrates the full-text search feature:

advanced_export.png displays Datasette's "advanced export" dialog:

binary_data.png displays just a small fragment of a table with binary download links:

facets.png demonstrates faceting against a table:

I'll walk through each screenshot in turn.

full_text_search.png

I decided to use a different example for the new screenshot, because I don't currently have a live instance for that table running against the most recent Datasette release.

I went with https://register-of-members-interests.datasettes.com/regmem/items?_search=hamper&_sort_desc=date - a search against the UK register of members interests for "hamper" (see Exploring the UK Register of Members Interests with SQL and Datasette).

The existing image in the documentation was 960 pixels wide, so I stuck with that and tried a few iterations until I found a height that I liked.

I installed shot-scraper and ran the following, in my /tmp directory:

shot-scraper 'https://register-of-members-interests.datasettes.com/regmem/items?_search=hamper&_sort_desc=date' \
  -h 585 \
  -w 960

This produced a register-of-members-interests-datasettes-com-regmem-items.png file which looked good when I opened it in Preview.

I turned that into the following YAML in my shots.yml file:

- url: https://register-of-members-interests.datasettes.com/regmem/items?_search=hamper&_sort_desc=date
  height: 585
  width: 960
  output: regmem-search.png

Running shot-scraper multi shots.yml against that file produced this regmem-search.png image:

advanced_export.png

This next image isn't a full page screenshot - it's just a small fragment of the page.

shot-scraper can take partial screenshots based on one or more CSS selectors. Given a CSS selector the tool draws a box around just that element and uses that to take the screenshot - adding optional padding.

Here's the recipe for the advanced export box - I used the same register-of-members-interests.datasettes.com example for it as this had enough rows to trigger all of the advanced options to be displayed:

shot-scraper 'https://register-of-members-interests.datasettes.com/regmem/items?_search=hamper' \
  -s '#export' \
  -p 10

The -p 10 here specifies 10px of padding, needed to capture the drop shadow on the box.

Here's the equivalent YAML:

- url: https://register-of-members-interests.datasettes.com/regmem/items?_search=hamper
  selector: "#export"
  output: advanced-export.png
  padding: 10

And the result:

binary_data.png

This screenshot required a different trick.

I wanted to take a screenshot of the table on this page.

The full table looks like this, with three rows:

I only wanted the first two of these to be shown in the screenshot though.

shot-scraper has the ability to execute JavaScript on the page before the screenshot is taken. This can be used to remove elements first.

Here's the JavaScript I came up with to remove all but the first two rows (actually the first three, because the table header counts as a row too):

Array.from(
  document.querySelectorAll('tr:nth-child(n+3)'),
  el => el.parentNode.removeChild(el)
);

I did it this way so that if I add any more rows to that test table in the future the code will still remove everything but the first two.

The CSS selector tr:nth-child(n+3) selects all rows that are not the first three (one header plus two content rows).

Here's how to run that from the command-line, and then take a 10 pixel padded screenshot of just the table on the page after it has been modified by the JavaScript:

shot-scraper 'https://latest.datasette.io/fixtures/binary_data' \
  -j 'Array.from(document.querySelectorAll("tr:nth-child(n+3)"), el => el.parentNode.removeChild(el));' \
  -s table -p 10

The YAML I added to shots.yml:

- url: https://latest.datasette.io/fixtures/binary_data
  selector: table
  javascript: |-
    Array.from(
      document.querySelectorAll('tr:nth-child(n+3)'),
      el => el.parentNode.removeChild(el)
    );
  padding: 10
  output: binary-data.png

And the resulting image:

facets.png

I left the most complex screenshot to last.

For the faceting screenshot, I wanted to include the "suggested facet" links at the top of the page, a set of active facets and then the first three rows of the following table.

But... the table has quite a lot of columns. For a neater screenshot I only wanted to include a subset of columns in the final shot.

Here's the screenshot I ended up taking:

And the YAML recipe:

- url: https://congress-legislators.datasettes.com/legislators/legislator_terms?_facet=type&_facet=party&_facet=state&_facet_size=10
  selectors_all:
  - .suggested-facets a
  - tr:not(tr:nth-child(n+4)) td:not(:nth-child(n+11))
  padding: 10
  output: faceting-details.png

The key trick I'm using here is that selectors_all list.

The usual shot-scraper selector option finds the first element on the page matching the specified CSS selector and takes a screenshot of that.

--selector-all - or the YAML equivalent selectors_all - instead finds EVERY element that matches any of the specified selectors and draws a bounding box containing all of them.

I wanted that bounding box to surround a subset of the table cells on the page. I used this CSS selector to indicate that subset:

tr:not(tr:nth-child(n+4)) td:not(:nth-child(n+11))

Here's what GPT-3 says if you ask it to explain the selector:

Explain this CSS selector:

tr:not(tr:nth-child(n+4)) td:not(:nth-child(n+11))

This selector is selecting all table cells in rows that are not the fourth row or greater, and are not in columns that are the 11th column or greater.

(See also this TIL.)

Automating everything using GitHub Actions

Here's the full shots.yml YAML needed to generate all four of these screenshots:

- url: https://register-of-members-interests.datasettes.com/regmem/items?_search=hamper&_sort_desc=date
  height: 585
  width: 960
  output: regmem-search.png
- url: https://register-of-members-interests.datasettes.com/regmem/items?_search=hamper
  selector: "#export"
  output: advanced-export.png
  padding: 10
- url: https://congress-legislators.datasettes.com/legislators/legislator_terms?_facet=type&_facet=party&_facet=state&_facet_size=10
  selectors_all:
  - .suggested-facets a
  - tr:not(tr:nth-child(n+4)) td:not(:nth-child(n+11))
  padding: 10
  output: faceting-details.png
- url: https://latest.datasette.io/fixtures/binary_data
  selector: table
  javascript: |-
    Array.from(
      document.querySelectorAll('tr:nth-child(n+3)'),
      el => el.parentNode.removeChild(el)
    );
  padding: 10
  output: binary-data.png

Running shot-scraper shots shots.yml against this file takes all four screenshots.

But I want this to be fully automated! So I turned to GitHub Actions.

A while ago I created a template repository for setting up GitHub Actions to take screenshots using shot-scraper and write them back to the same repo. I wrote about that in Instantly create a GitHub repository to take screenshots of a web page.

I had previously used that recipe to create my datasette-screenshots repository - with its own shots.yml file.

So I added the new YAML to that existing file, committed the change, waited a minute and the result was all four images stored in that repository!

My datasette-screenshots workflow actually has two key changes from my default template. First, it takes every screenshot twice - once as a retina image and once as a regular image:

    - name: Take retina shots
      run: |
        shot-scraper multi shots.yml --retina
    - name: Take non-retina shots
      run: |
        mkdir -p non-retina
        cd non-retina
        shot-scraper multi ../shots.yml
        cd ..

This provides me with both a high quality image and a smaller, faster-loading image for each screenshot.

Secondly, it runs oxipng to optimize the PNGs before committing them to the repo:

    - name: Optimize PNGs
      run: |-
        oxipng -o 4 -i 0 --strip safe *.png
        oxipng -o 4 -i 0 --strip safe non-retina/*.png

The shot-scraper documentation describes this pattern in more detail.

With all of that in place, simply committing a change to the shots.yml file is enough to generate and store the new screenshots.

Linking to the images

One last problem to solve: I want to include these images in my documentation, which means I need a way to link to them.

I decided to use GitHub to host these directly, via the raw.githubusercontent.com domain - which is fronted by the Fastly CDN.

I care about up-to-date images, but I also want different versions of the Datasette documentation to reflect the corresponding design in their screenshots - so I needed a way to snapshot those screenshots to a known version.

Repository tags are one way to do this.

I tagged the datasette-screenshots repository with 0.62, since that's the version of Datasette that the screenshots were taken for.

This gave me the following URLs for the images:

To save on page loading time I decided to use the non-retina URLs for the two larger images.

Here's the commit that updated the Datasette documentation to link to these new images (and deleted the old images from the repo).

You can see the new images in the documentation on these pages:

Tags: documentation, datasette, github-actions, shot-scraper

Software engineering practices

2022-10-01T15:56:02+00:00

Gergely Orosz started a Twitter conversation asking about recommended "software engineering practices" for development teams.

(I really like his rejection of the term "best practices" here: I always feel it's prescriptive and misguiding to announce something as "best".)

I decided to flesh some of my replies out into a longer post.

Documentation in the same repo as the code

The most important characteristic of internal documentation is trust: do people trust that documentation both exists and is up-to-date?

If they don't, they won't read it or contribute to it.

The best trick I know of for improving the trustworthiness of documentation is to put it in the same repository as the code it documents, for a few reasons:

You can enforce documentation updates as part of your code review process. If a PR changes code in a way that requires documentation updates, the reviewer can ask for those updates to be included.
You get versioned documentation. If you're using an older version of a library you can consult the documentation for that version. If you're using the current main branch you can see documentation for that, without confusion over what corresponds to the most recent "stable" release.
You can integrate your documentation with your automated tests! I wrote about this in Documentation unit tests, which describes a pattern for introspecting code and then ensuring that the documentation at least has a section header that matches specific concepts, such as plugin hooks or configuration options.

Mechanisms for creating test data

When you work on large products, your customers will inevitably find surprising ways to stress or break your system. They might create an event with over a hundred different types of ticket for example, or an issue thread with a thousand comments.

These can expose performance issues that don't affect the majority of your users, but can still lead to service outages or other problems.

Your engineers need a way to replicate these situations in their own development environments.

One way to handle this is to provide tooling to import production data into local environments. This has privacy and security implications - what if a developer laptop gets stolen that happens to have a copy of your largest customer's data?

A better approach is to have a robust system in place for generating test data, that covers a variety of different scenarios.

You might have a button somewhere that creates an issue thread with a thousand fake comments, with a note referencing the bug that this helps emulate.

Any time a new edge case shows up, you can add a new recipe to that system. That way engineers can replicate problems locally without needing copies of production data.

Rock solid database migrations

The hardest part of large-scale software maintenance is inevitably the bit where you need to change your database schema.

(I'm confident that one of the biggest reasons NoSQL databases became popular over the last decade was the pain people had associated with relational databases due to schema changes. Of course, NoSQL database schema modifications are still necessary, and often they're even more painful!)

So you need to invest in a really good, version-controlled mechanism for managing schema changes. And a way to run them in production without downtime.

If you do not have this your engineers will respond by being fearful of schema changes. Which means they'll come up with increasingly complex hacks to avoid them, which piles on technical debt.

This is a deep topic. I mostly use Django for large database-backed applications, and Django has the best migration system I've ever personally experienced. If I'm working without Django I try to replicate its approach as closely as possible:

The database knows which migrations have already been applied. This means when you run the "migrate" command it can run just the ones that are still needed - important for managing multiple databases, e.g. production, staging, test and development environments.
A single command that applies pending migrations, and updates the database rows that record which migrations have been run.
Optional: rollbacks. Django migrations can be rolled back, which is great for iterating in a development environment but using that in production is actually quite rare: I'll often ship a new migration that reverses the change instead rather than using a rollback, partly to keep the record of the mistake in version control.

Even harder is the challenge of making schema changes without any downtime. I'm always interested in reading about new approaches for this - GitHub's gh-ost is a neat solution for MySQL.

An interesting consideration here is that it's rarely possible to have application code and database schema changes go out at the exact same instance in time. As a result, to avoid downtime you need to design every schema change with this in mind. The process needs to be:

Design a new schema change that can be applied without changing the application code that uses it.
Ship that change to production, upgrading your database while keeping the old code working.
Now ship new application code that uses the new schema.
Ship a new schema change that cleans up any remaining work - dropping columns that are no longer used, for example.

This process is a pain. It's difficult to get right. The only way to get good at it is to practice it a lot over time.

My rule is this: schema changes should be boring and common, as opposed to being exciting and rare.

Templates for new projects and components

If you're working with microservices, your team will inevitably need to build new ones.

If you're working in a monorepo, you'll still have elements of your codebase with similar structures - components and feature implementations of some sort.

Be sure to have really good templates in place for creating these "the right way" - with the right directory structure, a README and a test suite with a single, dumb passing test.

I like to use the Python cookiecutter tool for this. I've also used GitHub template repositories, and I even have a neat trick for combining the two.

These templates need to be maintained and kept up-to-date. The best way to do that is to make sure they are being used - every time a new project is created is a chance to revise the template and make sure it still reflects the recommended way to do things.

Automated code formatting

This one's easy. Pick a code formatting tool for your language - like Black for Python or Prettier for JavaScript (I'm so jealous of how Go has gofmt built in) - and run its "check" mode in your CI flow.

Don't argue with its defaults, just commit to them.

This saves an incredible amount of time in two places:

As an individual, you get back all of that mental energy you used to spend thinking about the best way to format your code and can spend it on something more interesting.
As a team, your code reviews can entirely skip the pedantic arguments about code formatting. Huge productivity win!

Tested, automated process for new development environments

The most painful part of any software project is inevitably setting up the initial development environment.

The moment your team grows beyond a couple of people, you should invest in making this work better.

At the very least, you need a documented process for creating a new environment - and it has to be known-to-work, so any time someone is onboarded using it they should be encouraged to fix any problems in the documentation or accompanying scripts as they encounter them.

Much better is an automated process: a single script that gets everything up and running. Tools like Docker have made this a LOT easier over the past decade.

I'm increasingly convinced that the best-in-class solution here is cloud-based development environments. The ability to click a button on a web page and have a fresh, working development environment running a few seconds later is a game-changer for large development teams.

Gitpod and Codespaces are two of the most promising tools I've tried in this space.

I've seen developers lose hours a week to issues with their development environment. Eliminating that across a large team is the equivalent of hiring several new full-time engineers!

Automated preview environments

Reviewing a pull request is a lot easier if you can actually try out the changes.

The best way to do this is with automated preview environments, directly linked to from the PR itself.

These are getting increasingly easy to offer. Vercel, Netlify, Render and Heroku all have features that can do this. Building a custom system on top of something like Google Cloud Run or Fly Machines is also possible with a bit of work.

This is another one of those things which requires some up-front investment but will pay itself off many times over through increased productivity and quality of reviews.

Tags: documentation, software-engineering, versioncontrol, zero-downtime, github-actions, gergely-orosz

How I’m a Productive Programmer With a Memory of a Fruit Fly

2022-09-19T16:19:02+00:00

How I’m a Productive Programmer With a Memory of a Fruit Fly

Hynek Schlawack describes the value he gets from searchable offline developer documentation, and advocates for the Documentation Sets format which bundles docs, metadata and a SQLite search index. Hynek’s doc2dash command can convert documentation generated by tools like Sphinx into a docset that’s compatible with several offline documentation browser applications.

Via @hynek

Tags: sqlite, hynek-schlawack, documentation, sphinx-docs

Quoting Ken Williams

2022-08-04T15:50:56+00:00

Your documentation is complete when someone can use your module without ever having to look at its code. This is very important. This makes it possible for you to separate your module's documented interface from its internal implementation (guts). This is good because it means that you are free to change the module's internals as long as the interface remains the same.

Remember: the documentation, not the code, defines what a module does.

— Ken Williams

Tags: documentation

Cleaning data with sqlite-utils and Datasette

2022-07-31T19:57:51+00:00

Cleaning data with sqlite-utils and Datasette

I wrote a new tutorial for the Datasette website, showing how to use sqlite-utils to import a CSV file, clean up the resulting schema, fix date formats and extract some of the columns into a separate table. It’s accompanied by a ten minute video originally recorded for the HYTRADBOI conference.

Via @simonw

Tags: tutorials, sqlite-utils, datasette, documentation

Weeknotes: Building Datasette Cloud on Fly Machines, Furo for documentation

2022-05-26T04:35:11+00:00

Hosting provider Fly released Fly Machines this week. I got an early preview and I've been working with it for a few days - it's a fascinating new piece of technology. I'm using it to get my hosting service for Datasette ready for wider release.

Datasette Cloud

Datasette Cloud is the name I've given my forthcoming hosted SaaS version of Datasette. I'm building it for two reasons:

This is an obvious step towards building a sustainable business model for my open source project. It's a reasonably well-trodden path at this point: plenty of projects have demonstrated that offering paid hosting for an open source project can build a valuable business. GitLab are an especially good example of this model.
There are plenty of people who could benefit from Datasette, but the friction involved in hosting it prevents them from taking advantage of the software. I've tried to make it as easy to host as possible, but without a SaaS hosted version I'm failing to deliver value to the people that I most want the software to help.

My previous alpha was built directly on Docker, running everything on a single large VPS. Obviously it needed to scale beyond one machine, and I started experimenting with Kubernetes to make this happen.

I also want to allow users to run their own plugins, without risk of malicious code causing problems for other accounts. Docker and Kubernetes containers don't offer the isolation that I need to feel comfortable doing this, so I started researching Firecracker - constructed by AWS to power Lambda and Fargate, so very much designed with potentially malicious code in mind.

Spinning up Firecracker on a Kubernetes cluster is no small lift!

And then I heard about Fly Machines. And it looks like it's exactly what I need to get this project to the next milestone.

Fly Machines

Fly's core offering allows you to run Docker containers in regions around the world, compiled (automatically by Fly) to Firecracker containers with geo-load-balancing so users automatically get routed to an instance running near them.

Their new Fly Machines product gives you a new way to run containers there: you get full control over when containers are created, updated, started, stopped and destroyed. It's the exact level of control I need to build Datasette Cloud.

It also implements scale-to-zero: you can stop a container, and Fly will automatically start it back up again for you (generally in less than a second) when fresh traffic comes in.

(I had built my own version of this for my Datasette Cloud alpha, but the spin up time took more like 10s and involved showing the user a custom progress bar to help them see what was going on.)

Being able to programatically start and stop Firecracker containers was exactly what I'd been trying to piece together using Kubernetes - and the ability to control which global region they go in (with the potential for Litestream replication between regions in the future) is a feature I hadn't expected to be able to offer for years.

So I spent most of this week on a proof of concept. I've successfully demonstrated that the Fly Machines product has almost exactly the features that I need to ship Datasette Cloud on Fly Machines - and I've confirmed that the gaps I need to fill are on Fly's near-term roadmap.

I don't have anything to demonstrate publicly just yet, but I do have several new TILs.

If this sounds interesting to you or your organization and you'd like to try it out, drop me an email at swillison @ Google's email service.

The Furo theme for Sphinx

My shot-scraper automated screenshot tool's README had got a little too long, so I decided to upgrade it to a full documentation website.

I chose to use MyST and Sphinx for this, hosted on Read The Docs.

MyST adds Markdown syntax to Sphinx, which is easier to remember (and for people to contribute to) than reStructuredText.

After putting the site live, Adam Johnson suggested I take a look at the Furo theme. I'd previously found Sphinx themes hard to navigate because they had so much differing functionality, but a personal recommendation turned out to be exactly what I needed.

Furo is really nice - it fixed a slight rendering complaint I had about nested lists in the theme I was using, and since it doesn't use web fonts it dropped the bytes transferred for a page of documentation by more than half!

I switched shot-scraper over to Furo, and liked it so much that I switched over Datasette and sqlite-utils too.

Here's what the shot-scraper documentation looks like now:

Screenshot taken using shot-scraper itself, like this:

shot-scraper \
  https://shot-scraper.datasette.io/en/latest/ \
  --retina --height 1200

Full details of those theme migrations (including more comparative screenshots) can be found in these issues:

Releases this week

datasette-unsafe-expose-env: 0.1 - 2022-05-25
Datasette plugin to expose some environment variables at /-/env for debugging
shot-scraper: 0.14.1 - (16 releases total) - 2022-05-22
A comand-line utility for taking automated screenshots of websites
google-calendar-to-sqlite: 0.1a0 - 2022-05-21
Create a SQLite database containing your data from Google Calendar
datasette-upload-dbs: 0.1.1 - (2 releases total) - 2022-05-17
Upload SQLite database files to Datasette
datasette-insert: 0.7 - (7 releases total) - 2022-05-16
Datasette plugin for inserting and updating data

TIL this week

Tags: documentation, projects, datasette, weeknotes, datasette-cloud, fly, firecracker

GOV.UK Guidance: Documenting APIs

2022-05-21T23:31:20+00:00

GOV.UK Guidance: Documenting APIs

Characteristically excellent guide from GOV.UK on writing great API documentation. “Task-based guidance helps users complete the most common integration tasks, based on the user needs from your research.”

Via @jamietanna

Tags: documentation, gov-uk

jq language description

2022-04-26T19:04:09+00:00

jq language description

I love jq but I’ve always found it difficult to remember how to use it, and the manual hasn’t helped me as much as I would hope. It turns out the jq wiki on GitHub offers an alternative, more detailed description of the language which fits the way my brain works a lot better.

Via psacawa on Hacker News

Tags: jq, documentation, programming-languages

Deno by example

2022-03-17T01:02:00+00:00

Deno by example

Interesting approach to documentation: a big list of annotated examples illustrating the Deno way of solving a bunch of common problems.

Via Jim Nielsen: Deno is Webby (pt. 2)

Tags: deno, documentation

Instantly create a GitHub repository to take screenshots of a web page

2022-03-14T16:52:27+00:00

I just released shot-scraper-template, a GitHub repository template that helps you start taking automated screenshots of a web page by filling out a form.

shot-scraper is my command line tool for taking screenshots of web pages and scraping data from them using JavaScript.

One of its uses is to help create and maintain screenshots for documentation, making it easy to update them to include changes to the design of the underlying pages.

To make this as easy as possible, I've created a GitHub repository template that automates the process of setting up shot-scraper to run against a URL.

To try it out, start here:

https://github.com/simonw/shot-scraper-template/generate

Pick a name for your new repository and paste the URL of the page you want to screenshot into the description field.

Then click "Create repository from template".

That's it! Your new repository will be created, a GitHub Actions automation script will run for a few seconds and your new screenshot will be added to the repository as a file called shot.png.

Here's an example repository I created using the template: simonw/simonwillison-net-shot - and here's the shot.png file from that repo:

You can re-take the screenshot any time you want by clicking the "Run workflow" button in the Actions tab:

Your repository will have a file in it called shots.yml that initially looks like this:

- url: https://simonwillison.net/
  output: shot.png
  height: 800

You can edit that file to change the settings that apply to your screenshot, or to add further URLs to take shots of like this:

- url: https://simonwillison.net/
  output: shot.png
  height: 800
- url: https://www.example.com/
  output: example.png
  height: 800

Further options are available here, as described in the shot-scraper README.

How this works

This entire system is based around a single GitHub Actions workflow, in .github/workflows/shots.yml.

Here's an annotated copy of that workflow showing how it all works.

name: Take screnshots

on:
  push:
  workflow_dispatch:

The workflow triggers when a change is made to the repository (including edits to the shots.yml file) or when the user manually clicks "Run workflow".

jobs:
  shot-scraper:
    runs-on: ubuntu-latest
    if: ${{ github.repository != 'simonw/shot-scraper-template' }}

This is the trick that makes everything else work, which I picked up from Bruno Rocha last year. It ensures that this workflow job only runs on copies of the template, not on the initial template repository itself.

This is necessary because a later step creates a file in the repository if it doesn't yet exist based on the description URL provided by the user.

    steps:
    - uses: actions/checkout@v2
    - name: Set up Python 3.10
      uses: actions/setup-python@v2
      with:
        python-version: "3.10"
    - uses: actions/cache@v2
      name: Configure pip caching
      with:
        path: ~/.cache/pip
        key: ${{ runner.os }}-pip-${{ hashFiles('requirements.txt') }}
        restore-keys: |
          ${{ runner.os }}-pip-

This is boilerplate that I use in most of my GitHub Actions workflows: it sets up Python 3.10, and also configures a cache such that Python requirements in a requirements.txt file persist from one invocation to another without having to be re-downloaded from PyPI.

    - name: Cache Playwright browsers
      uses: actions/cache@v2
      with:
        path: ~/.cache/ms-playwright/
        key: ${{ runner.os }}-browsers

shot-scraper uses Microsoft's open source Playwright browser automation tool. Playwright works by installing its own full Chromium browser. This line configures a cache for that browser, such that future invocations of the Action don't need to download another copy.

    - name: Install dependencies
      run: |
        pip install -r requirements.txt
    - name: Install Playwright dependencies
      run: |
        shot-scraper install

The pip install line here installs the shot-scraper CLI tool, which is written in Python.

That shot-scraper install line then triggers the Playwright mechanism to download and install the browser. This will do nothing if the browser has already been cached.

    - uses: actions/github-script@v6
      name: Create shots.yml if missing on first run
      with:
        script: |
          const fs = require('fs');
          if (!fs.existsSync('shots.yml')) {
              const desc = context.payload.repository.description;
              let line = '';
              if (desc && (desc.startsWith('http://') || desc.startsWith('https://'))) {
                  line = `- url: ${desc}` + '\n  output: shot.png\n  height: 800';
              } else {
                  line = '# - url: https://www.example.com/\n#   output: shot.png\n#   height: 800';
              }
              fs.writeFileSync('shots.yml', line + '\n');
          }

This is the other key piece of magic. This uses GitHub's github-script action, which provides a Node.js environment with a context object containing details about the actions run.

It starts by reading the repository description from context.payload.repository.description.

Then it creates a shots.yml file based on that description - but only if the file does not exist already.

If there's no repository description it creates one with a commented-out configuration instead, that looks like this:

# - url: https://www.example.com/
#   output: shot.png
#   height: 800

The next step is to take the screenshots:

    - name: Take shots
      run: |
        shot-scraper multi shots.yml

shot-scraper multi is documented here - it runs through the YAML file and takes each of the screenshots configured there in turn.

Final step is to commit and push the new shots.yml and shot.png files to the repository:

    - name: Commit and push
      run: |-
        git config user.name "Automated"
        git config user.email "actions@users.noreply.github.com"
        git add -A
        timestamp=$(date -u)
        git commit -m "${timestamp}" || exit 0
        git pull --rebase
        git push

This uses a pattern I describe in this TIL.

GitHub Actions as a platform

I tweeted this the other day, shortly before I came up with the idea for the shot-scraper-template repository.

Genuinely think GitHub Actions might be my favourite serverless platform right now
- Simon Willison (@simonw) March 13, 2022

This project demonstrates why. The amount of complex moving parts involved in shot-scraper-template is pretty bewildering, but the end result is a free tool that anyone can use to start taking automated screenshots.

And it doesn't cost me anything to provide the tool either!

Tags: documentation, projects, github-actions, shot-scraper

Weeknotes: Distracted by Playwright

2022-03-12T00:30:26+00:00

My goal for this week was to unblock progress on Datasette by finally finishing the dash encoding implementation I described last week. I was getting close, and then I got very distracted by Playwright.

Dash encoding v2

In Why I invented “dash encoding”, a new encoding scheme for URL paths I described a new mechanism I had invented for handling the gnarly problem of including table names with / characters in the URL path on Datasette. The very short version: you can't use URL encoding in a path, because common proxies (including Apache and Nginx) will decode them before they get to your application.

Thanks to feedback on that post I actually changed my design: I'm now using a variant of percent encoding that uses the - instead of the %. More details in the issue - and I'll write this up fully once I've finished landing the change.

shot-scraper and Playwright

I thoroughly nerd-sniped myself with this one. I started investigating possibilities for automatically generating screeshots for documentation, and realized that Playwright made this substantially easier than it has been in the past.

The result was shot-scraper - a new command-line utility for taking screenshots of web pages, or portions of web pages - and for running through a set of screenshots defined in a YAML file.

I still can't quite believe how quickly this came together.

Every now and then a tool comes along which adds a fundamental new set of capabilities to your toolbox, and can be multiplied against other tools to open up a huge range of possibilities.

Playwright feels like one of those tools.

A quick pip install playwright is all it takes to start writing robust browser automation tools, using dedicated standalone headless instances of multiple browsers that are installed for you using playwright install.

It's easy to run in CI - getting it working in GitHub Actions was trivial.

shot-scraper is my first project built on Playwright, but there will definitely be more.

shot-scraper accessibility

I started a Twitter conversation asking for ways to write automated tests that exercise screen readers - not just running audit rules, but actually simulating what happens when a screen reader user attempts to navigate through a specific flow within an application.

The most interesting answer I had was from Ben Mustill-Rose, who built a system for automating tests against an Android screen reader while working on BBC iPlayer - demo here.

@fardarter pointed me back to Playwright again, which turns out to have an Accessibility snapshot mechanism that can dump out the current state of the Chromium accessibility tree.

I couldn't resist adding that to shot-scraper - so now you can run the following to see the accessibility tree for a web page:

~ % shot-scraper accessibility https://datasette.io
{
    "role": "WebArea",
    "name": "Datasette: An open source multi-tool for exploring and publishing data",
    "children": [
        {
            "role": "link",
            "name": "Uses"
        },
        {
            "role": "link",
            "name": "Documentation"
        },

Full output here.

As a really fun bonus trick: since the output is JSON, you can pipe it into sqlite-utils insert to get a SQLite database:

shot-scraper accessibility https://datasette.io \
    | jq .children | sqlite-utils insert \
    /tmp/accessibility.db nodes - --alter

And then open it in Datasette Desktop and start faceting by role and heading level!

sqlite-utils documentation improvements

I complained on Twitter that the way type information was displayed in the Sphinx sqlite-utils API reference documentation was ugly:

Adam Johnson pointed me to the autodoc_typehints = "description" option which fixes this. I spent a while tidying up the documentation to work better with this, mainly by adding a whole bunch of :param name: description tags that I had previously omitted. That work happenen in this issue. I think it looks much better now:

Releases this week

image-diff: 0.2.1 - (3 releases total) - 2022-03-11
CLI tool for comparing images
sqlite-utils: 3.25.1 - (98 releases total) - 2022-03-11
Python CLI utility and library for manipulating SQLite databases
shot-scraper: 0.4 - (5 releases total) - 2022-03-10
Automated website screenshots using GitHub Actions
django-sql-dashboard: 1.0.2 - (34 releases total) - 2022-03-08
Django app for building dashboards using raw SQL queries
geojson-to-sqlite: 1.0 - (8 releases total) - 2022-03-04
CLI tool for converting GeoJSON files to SQLite (with SpatiaLite)
xml-analyser: 1.3 - (4 releases total) - 2022-03-01
Simple command line tool for quickly analysing the structure of an arbitrary XML file
datasette-dateutil: 0.3 - (4 releases total) - 2022-03-01
dateutil functions for Datasette

TIL this week

Tags: accessibility, documentation, datasette, weeknotes, sphinx-docs, playwright, shot-scraper

shot-scraper: automated screenshots for documentation, built on Playwright

2022-03-10T00:13:30+00:00

shot-scraper is a new tool that I’ve built to help automate the process of keeping screenshots up-to-date in my documentation. It also doubles as a scraping tool - hence the name - which I picked as a complement to my git scraping and help scraping techniques.

Update 13th March 2022: The new shot-scraper javascript command can now be used to scrape web pages from the command line.

Update 14th October 2022: Automating screenshots for the Datasette documentation using shot-scraper offers a tutorial introduction to using the tool.

The problem

I like to include screenshots in documentation. I recently started writing end-user tutorials for Datasette, which are particularly image heavy (for example).

As software changes over time, screenshots get out-of-date. I don't like the idea of stale screenshots, but I also don't want to have to manually recreate them every time I make the tiniest tweak to the visual appearance of my software.

Introducing shot-scraper

shot-scraper is a tool for automating this process. You can install it using pip like this:

pip install shot-scraper
shot-scraper install

That second shot-scraper install line will install the browser it needs to do its job - more on that later.

You can use it in two ways. To take a one-off screenshot, you can run it like this:

shot-scraper https://simonwillison.net/ -o simonwillison.png

Or if you want to take a set of screenshots in a repeatable way, you can define them in a YAML file that looks like this:

- url: https://simonwillison.net/
  output: simonwillison.png
- url: https://www.example.com/
  width: 400
  height: 400
  quality: 80
  output: example.jpg

And then use shot-scraper multi to execute every screenshot in one go:

% shot-scraper multi shots.yml 
Screenshot of 'https://simonwillison.net/' written to 'simonwillison.png'
Screenshot of 'https://www.example.com/' written to 'example.jpg'

The documentation describes all of the available options you can use when taking a screenshot.

Each option can be provided to the shot-scraper one-off tool, or can be embedded in the YAML file for use with shot-scraper multi.

JavaScript and CSS selectors

The default behaviour for shot-scraper is to take a full page screenshot, using a browser width of 1280px.

For documentation screenshots you probably don't want the whole page though - you likely want to create an image of one specific part of the interface.

The --selector option allows you to specify an area of the page by CSS selector. The resulting image will consist just of that part of the page.

What if you want to modify the page in addition to selecting a specific area?

The --javascript option lets you pass in a block of JavaScript code which will be injected into the page and executed after the page has loaded, but before the screenshot is taken.

The combination of these two options - also available as javascript: and selector: keys in the YAML file - should be flexible enough to cover the custom screenshot case for documentation.

A complex example

To prove to myself that the tool works, I decided to try replicating this screenshot from my tutorial.

I made the original using CleanShot X, manually adding the two pink arrows:

This is pretty tricky!

It's not this whole page, just a subset of the page
The cog menu for one of the columns is open, which means the cog icon needs to be clicked before taking the screenshot
There are two pink arrows superimposed on the image

I decided to do use just one arrow for the moment, which should hopefully result in a clearer image.

I started by creating my own pink arrow SVG using Figma:

I then fiddled around in the Firefox developer console for quite a while, working out the JavaScript needed to trim the page down to the bit I wanted, open the menu and position the arrow.

With the JavaScript figured out, I pasted it into a YAML file called shot.yml:

- url: https://congress-legislators.datasettes.com/legislators/executive_terms?start__startswith=18&type=prez
  javascript: |
    new Promise(resolve => {
      // Run in a promise so we can sleep 1s at the end
      function remove(el) { el.parentNode.removeChild(el);}
      // Remove header and footer
      remove(document.querySelector('header'));
      remove(document.querySelector('footer'));
      // Remove most of the children of .content
      Array.from(document.querySelectorAll('.content > *:not(.table-wrapper,.suggested-facets)')).map(remove)
      // Bit of breathing room for the screenshot
      document.body.style.marginTop = '10px';
      // Add a bit of padding to .content
      var content = document.querySelector('.content');
      content.style.width = '820px';
      content.style.padding = '10px';
      // Open the menu - it's an SVG so we need to use dispatchEvent here
      document.querySelector('th.col-executive_id svg').dispatchEvent(new Event('click'));
      // Remove all but table header and first 11 rows
      Array.from(document.querySelectorAll('tr')).slice(12).map(remove);
      // Add a pink SVG arrow
      let div = document.createElement('div');
      div.innerHTML = `<svg width="104" height="60" fill="none" xmlns="http://www.w3.org/2000/svg">
        <g filter="url(#a)">
          <path fill-rule="evenodd" clip-rule="evenodd" d="m76.7 1 2 2 .2-.1.1.4 20 20a3.5 3.5 0 0 1 0 5l-20 20-.1.4-.3-.1-1.9 2a3.5 3.5 0 0 1-5.4-4.4l3.2-14.4H4v-12h70.6L71.3 5.4A3.5 3.5 0 0 1 76.7 1Z" fill="#FF31A0"/>
        </g>
        <defs>
          <filter id="a" x="0" y="0" width="104" height="59.5" filterUnits="userSpaceOnUse" color-interpolation-filters="sRGB">
              <feFlood flood-opacity="0" result="BackgroundImageFix"/>
              <feColorMatrix in="SourceAlpha" values="0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 127 0" result="hardAlpha"/>
              <feOffset dy="4"/>
              <feGaussianBlur stdDeviation="2"/>
              <feComposite in2="hardAlpha" operator="out"/>
              <feColorMatrix values="0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.25 0"/>
              <feBlend in2="BackgroundImageFix" result="effect1_dropShadow_2_26"/>
              <feBlend in="SourceGraphic" in2="effect1_dropShadow_2_26" result="shape"/>
          </filter>
        </defs>
      </svg>`;
      let svg = div.firstChild;
      content.appendChild(svg);
      content.style.position = 'relative';
      svg.style.position = 'absolute';
      // Give the menu time to finish fading in
      setTimeout(() => {
        // Position arrow pointing to the 'facet by this' menu item
        var pos = document.querySelector('.dropdown-facet').getBoundingClientRect();
        svg.style.left = (pos.left - pos.width) + 'px';
        svg.style.top = (pos.top - 20) + 'px';
        resolve();
      }, 1000);
    });
  output: annotated-screenshot.png
  selector: .content

And ran this command to generate the screenshot:

shot-scraper multi shot.yml

The generated annotated-screenshot.png image looks like this:

I'm pretty happy with this! I think it works very well as a proof of concept for the process.

How it works: Playwright

I built the first prototype of shot-scraper using Puppeteer, because I had used that before.

Then I noticed that the puppeteer-cli package I was using hadn't had an update in two years, which reminded me to check out Playwright.

I've been looking for an excuse to learn Playwright for a while now, and this project turned out to be ideal.

Playwright is Microsoft's open source browser automation framework. They promote it as a testing tool, but it has plenty of applications outside of testing - screenshot automation and screen scraping being two of the most obvious.

Playwright is comprehensive: it downloads its own custom browser builds, and can run tests across multiple different rendering engines.

The second prototype used the Playwright CLI utility instead, executed via npx:

subprocess.run(
    [
        "npx",
        "playwright",
        "screenshot",
        "--full-page",
        url,
        output,
    ],
    capture_output=True,
)

This could take a full page screenshot, but that CLI tool wasn't flexible enough to take screenshots of specific elements. So I needed to switch to the Playwright programmatic API.

I started out trying to get Python to generate and pass JavaScript to the Node.js library... and then I spotted the official Playwright for Python package.

pip install playwright

It's amazing! It has the exact same functionality as the JavaScript library - the same classes, the same methods. Everything just works, in both languages.

I was curious how they pulled this off, so I dug inside the playwright Python package in my site-packages folder... and found it bundles a full Node.js binary executable and uses it to bridge the two worlds! What a wild hack.

Thanks to Playwright, the entire implementation of shot-scraper is currently just 181 lines of Python code - it's all glue code tying together a Click CLI interface with some code that calls Playwright to do the actual work.

I couldn't be more impressed with Playwright. I'll definitely be using it for other projects - for one thing, I think I'll finally be able to add automated tests to my Datasette Desktop Electron application.

Hooking shot-scraper up to GitHub Actions

I built shot-scraper very much with GitHub Actions in mind.

My shot-scraper-demo repository is my first live demo of the tool.

Once a day, it runs this shots.yml file, generates two screenshots and commits them back to the repository.

One of them is the tutorial screenshot described above.

The other is a screenshot of the list of "recently spotted owls" from this page on owlsnearme.com. I wanted a page that would change on an occasional basis, to demonstrate GitHub's neat image diffing interface.

I may need to change that demo though! That page includes "spotted 5 hours ago" text, which means that there's almost always a tiny pixel difference, like this one (use the "swipe" comparison tool to watch 6 hours ago change to 7 hours ago under the top left photo).

Storing image files that change frequently in a free repository on GitHub feels rude to me, so please use this tool cautiously there!

What's next?

I had ambitious plans to add utilities to the tool that would help with annotations, such as adding pink arrows and drawing circles around different elements on the page.

I've shelved those plans for the moment: as the demo above shows, the JavaScript hook is good enough. I may revisit this later once common patterns have started to emerge.

So really, my next step is to start using this tool for my own projects - to generate screenshots for my documentation.

I'm also very interested to see what kinds of things other people use this for.

Tags: documentation, projects, scraping, github-actions, git-scraping, puppeteer, playwright, shot-scraper

Weeknotes: Datasette Tutorials

2022-02-27T17:35:14+00:00

I published two new tutorials for Datasette this week, both focused at end-users of the web application.

Exploring a database with Datasette shows how to use Datasette as an exploratory data analysis tool, using facets and filters to get a good feeling for a new database.

Learn SQL with Datasette introduces Datasette's SQL query interface and uses it to teach basic SQL as well as a few more advanced tricks too.

Datasette already has a lot of documentation, but so far it's all been written to serve people who are administering or customizing Datasette instances. The user interface itself has been mostly undocumented.

Daniele Procida's Diátaxis documentation framework describes four categories of documentation: tutorials, how-to guides, technical reference and explanation. Datasette is heavy on the last two but light on the first.

These new tutorials are my initial attempt at redressing the balance. I adapted them from a workshop I presented on Friday at the FOIA Fest data journalism conference in (virtual) Chicago.

Writing documentation for end-users has been an interesting experience! I chose to lean heavily into screenshots, live examples and exercises. I'm eager for feedback from people to help me understand if what I've done is working, and I'm keen for suggestions on how to improve them and what to write next.

The example database I used for the tutorial is pretty fun: https://congress-legislators.datasettes.com - a database of USA senators, congresspeople, presidents and vice presidents built using CC0 data from the absolutely brilliant unitedstates/congress-legislators repository (which just accepted my first PR!)

This was my first attempt at writing end-user facing documentation for a personal project, and it turned out to have the same effect as writing developer documentation: the moment you try to describe how to use a feature the flaws in how that feature works from a usability perspective become strikingly evident!

I'm convinced that writing comprehensive documentation is a massively underrated technique for better software design.

Releases this week

datasette-render-markdown: 2.1 - (9 releases total) - 2022-02-26
Datasette plugin for rendering Markdown
datasette-redirect-forbidden: 0.1 - 2022-02-23
Redirect forbidden requests to a login page
sqlite-diffable: 0.2.1 - (3 releases total) - 2022-02-21
Tools for dumping/loading a SQLite database to diffable directory structure
google-drive-to-sqlite: 0.4 - (6 releases total) - 2022-02-20
Create a SQLite database containing metadata from Google Drive
sqlite-utils: 3.24 - (96 releases total) - 2022-02-16
Python CLI utility and library for manipulating SQLite databases

TIL this week

Tags: documentation, datasette, weeknotes

Making world-class docs takes effort

2021-09-06T18:58:27+00:00

Making world-class docs takes effort

Curl maintainer Daniel Stenberg writes about his principles for good documentation. I agree with all of these: he emphasizes keeping docs in the repo, avoiding the temptation to exclusively generate them from code, featuring examples and ensuring every API you provide has documentation. Daniel describes an approach similar to the documentation unit tests I’ve been using for my own projects: he has scripts which scan the curl documentation to ensure not only that everything is documented but that each documentation area contains the same sections in the same order.

Via Hacker News

Tags: curl, documentation, daniel-stenberg

The Diátaxis documentation framework

2021-08-21T22:59:53+00:00

The Diátaxis documentation framework

Daniele Procida’s model of four types of technical documentation—tutorials, how-to guides, technical reference and explanation—now has a name: Diátaxis.

Tags: diataxis, documentation

Datasette on Codespaces, sqlite-utils API reference documentation and other weeknotes

2021-08-14T04:57:12+00:00

This week I broke my streak of not sending out the Datasette newsletter, figured out how to use Sphinx for Python class documentation, worked out how to run Datasette on GitHub Codespaces, implemented Datasette column metadata and got tantalizingly close to a solution for an elusive Datasette feature.

API reference documentation for sqlite-utils using Sphinx

I've never been a big fan of Javadoc-style API documentation: I usually find that documentation structured around classes and methods fails to show me how to actually use those classes to solve real-world problems. I've tended to avoid it for my own projects.

My sqlite-utils Python library has a ton of functionality, but it mainly boils down to two classes: Database and Table. Since it already has pretty comprehesive narrative documentation explaining the different problems it can solve, I decided to try experimenting with the Sphinx autodoc module to produce some classic API reference documentation for it:

Since autodoc works from docstrings, this was also a great excuse to add more comprehensive docstrings and type hints to the library. This helps tools like Jupyter notebooks and VS Code display more useful inline help.

This proved to be time well spent! Here's what sqlite-utils looks like in VS Code now:

Running mypy against the type hints also helped me identify and fix a couple of obscure edge-case bugs in the existing methods, detailed in the 3.15.1 release notes. It's taken me a few years but I'm finally starting to come round to Python's optional typing as being worth the additional effort!

Figuring out how to use autodoc in Sphinx, and then how to get the documentation to build correctly on Read The Docs took some effort. I wrote up what I learned in this TIL.

Datasette on GitHub Codespaces

GitHub released their new Codespaces online development environments to general availability this week and I'm really excited about it. I ran a team at Eventbrite for a while resonsible for development environment tooling and it really was shocking how much time and money was lost to broken local development environments, even with a significant amount of engineering effort applied to the problem.

Codespaces promises a fresh, working development environment on-demand any time you need it. That's a very exciting premise! Their detailed write-up of how they convinced GitHub's own internal engineers to move to it is full of intriguing details - getting an existing application working with it is no small feat, but the pay-off looks very promising indeed.

So... I decided to try and get Datasette running on it. It works really well!

You can run Datasette in any Codespace environment using the following steps:

Open the terminal. Three-bar-menu-icon, View, Terminal does the trick.
In the terminal run pip install datasette datasette-x-forwarded-host (more on this in a moment).
Run datasette - Codespaces will automatically setup port forwarding and give you a link to "Open in Browser" - click the link and you're done!

You can pip install sqlite-utils and then use sqlite-utils insert to create SQLite databases to use with Datasette.

There was one catch: the first time I ran Datasette, clicking on any of the internal links within the web application took me to http://localhost/ pages that broke with a 404.

It turns out the Codespaces proxy sends a host: localhost header - which Datasette then uses to incorrectly construct internal URLs.

So I wrote a tiny ASGI plugin, datasette-x-forwarded-host, which takes the incoming X-Forwarded-Host provided by Codespaces and uses that as the Host header within Datasette itself. After that everything worked fine.

sqlite-utils insert --flatten

Early this week I finally figured out Cloud Run logging. It's actually really good! In doing so, I worked out a convoluted recipe for tailing the JSON logs locally and piping them into a SQLite database so that I could analyze them with Datasette.

Part of the reason it was convoluted is that Cloud Run logs feature nested JSON, but sqlite-utils insert only works against an array of flat JSON objects. I had to use this jq monstrosity to flatten the nested JSON into key/value pairs.

Since I've had to solve this problem a few times now I decided to improve sqlite-utils to have it do the work instead. You can now use the new --flatten option like so:

sqlite-utils insert logs.db logs log.json --flatten

To create a schema that flattens nested objects into a topkey_nextkey structure like so:

CREATE TABLE [logs] (
   [httpRequest_latency] TEXT,
   [httpRequest_requestMethod] TEXT,
   [httpRequest_requestSize] TEXT,
   [httpRequest_status] INTEGER,
   [insertId] TEXT,
   [labels_service] TEXT
);

Full documentation for --flatten.

Datasette column metadata

I've been wanting to add this for a while: Datasette's main branch now includes an implementation of column descriptions metadata for Datasette tables. This is best illustrated by a screenshot (of this live demo):

You can add the following to metadata.yml (or .json) to specify descriptions for the columns of a given table:

databases:
  fixtures:
    roadside_attractions:
      columns:
        name: The name of the attraction
        address: The street address for the attraction

Column descriptions will be shown in a <dl> at the top of the page, and will also be added to the menu that appears when you click on the cog icon at the top of a column.

Getting closer to query column metadata, too

Datasette lets you execute arbitrary SQL queries, like this one:

select
  roadside_attractions.name,
  roadside_attractions.address,
  attraction_characteristic.name
from
  roadside_attraction_characteristics
  join roadside_attractions on roadside_attractions.pk = roadside_attraction_characteristics.attraction_id
  join attraction_characteristic on attraction_characteristic.pk = roadside_attraction_characteristics.characteristic_id

You can try that here. It returns the following:

name	address	name
The Mystery Spot	465 Mystery Spot Road, Santa Cruz, CA 95065	Paranormal
Winchester Mystery House	525 South Winchester Boulevard, San Jose, CA 95128	Paranormal
Bigfoot Discovery Museum	5497 Highway 9, Felton, CA 95018	Paranormal
Burlingame Museum of PEZ Memorabilia	214 California Drive, Burlingame, CA 94010	Museum
Bigfoot Discovery Museum	5497 Highway 9, Felton, CA 95018	Museum

The columns it returns have names... but I've long wanted to do more with these results. If I could derive which source columns each of those output columns were, there are a bunch of interesting things I could do, most notably:

If the output column is a known foreign key relationship, I could turn it into a hyperlink (as seen on this table page)
If the original table column has the new column metadata, I could display that as additional documentation

The challenge is: given an abitrary SQL query, how can I figure out what the resulting columns are going to be and how to tie those back to the original tables?

Thanks to a hint from the SQLite forum I'm getting tantalizingly close to a solution.

The trick is to horribly abuse SQLite's explain output. Here's what it looks like for the example query above:

addr	opcode	p1	p2	p3	p4	p5
0	Init	0	15	0		0
1	OpenRead	0	47	0	2	0
2	OpenRead	1	45	0	3	0
3	OpenRead	2	46	0	2	0
4	Rewind	0	14	0		0
5	Column	0	0	1		0
6	SeekRowid	1	13	1		0
7	Column	0	1	2		0
8	SeekRowid	2	13	2		0
9	Column	1	1	3		0
10	Column	1	2	4		0
11	Column	2	1	5		0
12	ResultRow	3	3	0		0
13	Next	0	5	0		1
14	Halt	0	0	0		0
15	Transaction	0	0	35	0	1
16	Goto	0	1	0		0

The magic is on line 12: ResultRow 3 3 means "return a result that spans three columns, starting at register 3" - so that's register 3, 4 and 5. Those three registers are populated by the Column operations on line 9, 10 and 11 (the register they write into is in the p3 column). Each Column operation specifies the table (as p1) and the column index within that table (p2). And those table references map back to the OpenRead lines at the start, where p1 is that table register (referered to by Column) and p1 is the root page of the table within the schema.

Running select rootpage, name from sqlite_master where rootpage in (45, 46, 47) produces the following:

rootpage	name
45	roadside_attractions
46	attraction_characteristic
47	roadside_attraction_characteristics

Tie all of this together, and it may be possible to use explain to derive the original tables and columns for each of the outputs of an arbitrary query!

I was almost ready to declare victory, until I tried running it against a query with an order by column at the end... and the results no longer matched up.

You can follow my ongoing investigation here - the short version is that I think I'm going to have to learn to decode a whole bunch more opcodes before I can get this to work.

This is also a very risk way of attacking this problem. The SQLite documentation for the bytecode engine includes the following warning:

This document describes SQLite internals. The information provided here is not needed for routine application development using SQLite. This document is intended for people who want to delve more deeply into the internal operation of SQLite.

The bytecode engine is not an API of SQLite. Details about the bytecode engine change from one release of SQLite to the next. Applications that use SQLite should not depend on any of the details found in this document.

So it's pretty clear that this is a highly unsupported way of working with SQLite!

I'm still tempted to try it though. This feature is very much a nice-to-have: if it breaks and the additional column context stops displaying it's not a critical bug - and hopefully I'll be able to ship a Datasette update that takes into account those breaking SQLite changes relatively shortly afterwards.

If I can find another, more supported way to solve this I'll jump on it!

In the meantime, I did use this technque to solve a simpler problem. Datasette extracts :named parameters from arbitrary SQL queries and turns them into form fields - but since it uses a simple regular expression for this it could be confused by things like a literal 00:04:05 time string contained in a SQL query.

The explain output for that query includes the following:

addr	opcode	p1	p2	p3	p4	p5	comment
...	...	...	...	...	...	...	...
27	Variable	1	12	0	:text	0

So I wrote some code which uses explain to extract just the p4 operands from Variable columns and treats those as the extracted parameters! This feels a lot safer than the more complex ResultRow/Column logic - and it also falls back to the regular expression if it runs into any SQL errors. More in the issue.

TIL this week

Releases this week

datasette-x-forwarded-host: 0.1 - 2021-08-12
Treat the X-Forwarded-Host header as the Host header
sqlite-utils: 3.15.1 - (84 releases total) - 2021-08-10
Python CLI utility and library for manipulating SQLite databases
datasette-query-links: 0.1.2 - (3 releases total) - 2021-08-09
Turn SELECT queries returned by a query into links to execute them
datasette: 0.59a1 - (96 releases total) - 2021-08-09
An open source multi-tool for exploring and publishing data
datasette-pyinstrument: 0.1 - 2021-08-08
Use pyinstrument to analyze Datasette page performance

Tags: documentation, github, sql, sqlite, datasette, weeknotes, sqlite-utils, mypy, github-codespaces

Adding Sphinx autodoc to a project, and configuring Read The Docs to build it

2021-08-11T01:21:28+00:00

Adding Sphinx autodoc to a project, and configuring Read The Docs to build it

My TIL notes from figuring out how to use sphinx-autodoc for the sqlite-utils reference documentation today.

Tags: sqlite-utils, documentation, sphinx-docs, read-the-docs

sqlite-utils API reference

2021-08-11T01:03:33+00:00

sqlite-utils API reference

I released sqlite-utils 3.15.1 today with just one change, but it’s a big one: I’ve added docstrings and type annotations to nearly every method in the library, and I’ve started using sphinx-autodoc to generate an API reference page in the documentation directly from those docstrings. I’ve deliberately avoided building this kind of documentation in the past because I so often see projects where the class reference is the ONLY documentation, which I find makes it really hard to figure out how to actually use it. sqlite-utils already has extensive narrative prose documentation so in this case I think it’s a useful enhancement—especially since the docstrings and type hints can help improve the usability of the library in IDEs and Jupyter notebooks.

Via sqlite-utils 3.15.1 release notes

Tags: sqlite-utils, documentation, python, sphinx-docs

Design Docs at Google

2020-08-07T16:31:14+00:00

Design Docs at Google

Useful description of the format used for software design docs at Google—informal documents of between 3 and 20 pages that outline the proposed design of a new project, discuss trade-offs that were considered and solicit feedback before the code starts to be written.

Tags: google, documentation

The unofficial Google Cloud Run FAQ

2020-07-22T17:20:20+00:00

The unofficial Google Cloud Run FAQ

This is really useful: a no-fluff, content rich explanation of Google Cloud Run hosted as a GitHub repo that actively accepts pull requests from the community. It’s maintained by Ahmet Alp Balkan, a Cloud Run engineer who states “Googlers: If you find this repo useful, you should recognize the work internally, as I actively fight for alternative forms of content like this”. One of the hardest parts of working with AWS and GCP is digging through the marketing materials to figure out what the product actually does, so the more alternative forms of documentation like this the better.

Tags: cloudrun, google, documentation

How to find what you want in the Django documentation

2020-07-03T15:04:33+00:00

How to find what you want in the Django documentation

Useful guide by Matthew Segal to navigating the Django documentation, and tips for reading documentation in general. The Django docs have a great reputation so it’s easy to forget how intimidating they can be for newcomers: Matthew emphasizes that docs are rarely meant to be read in full: the trick is learning how to quickly search them for the things you need to understand right now.

Via Django News issue 30

Tags: documentation, django

Quoting Daniele Procida

2019-08-03T08:29:08+00:00

Documentation needs to include and be structured around its four different functions: tutorials, how-to guides, explanation and technical reference. Each of them requires a distinct mode of writing. People working with software need these four different kinds of documentation at different times, in different circumstances - so software usually needs them all.

— Daniele Procida

Tags: documentation, diataxis

The subset of reStructuredText worth committing to memory

2018-08-25T18:44:29+00:00

reStructuredText is the standard for documentation in the Python world.

It’s a bit weird. It’s like Markdown but older, more feature-filled and in my experience significantly harder to remember.

There are plenty of guides and cheatsheets out there, but when writing simple documentation for software projects I think there’s a subset that is worth committing to memory. I’ll describe that subset here.

First though: when writing reStructuredText having a live preview render is extremely useful. I use rst.ninjs.org for this. If you don’t trust that hosted version (it round-trips your documentation through the server in order to render it) you can run a local copy instead using the underlying source code.

Paragraphs

Paragraphs work the same way as Markdown and plain text. They are nice and easy.

This is the first paragraph. No need to wrap the text (though you can wrap at e.g. 80 characters without affecting rendering).

This is the second paragraph.

Headings

reStructuredText section headings are a little surprising.

Markdown has multiple levels of heading, each with a different number of prefix hashes:

# Markdown heading level 1
## Markdown heading level 2
..
###### Markdown heading fevel 6

In reStructuredText there is no single format for these different levels. Instead, the format you use first will be treated as an H1, the next format as an H2 and so on. Here’s the description from the official documentation:

Sections are identified through their titles, which are marked up with adornment: “underlines” below the title text, or underlines and matching “overlines” above the title. An underline/overline is a single repeated punctuation character that begins in column 1 and forms a line extending at least as far as the right edge of the title text. Specifically, an underline/overline character may be any non-alphanumeric printable 7-bit ASCII character. […] There may be any number of levels of section titles, although some output formats may have limits (HTML has 6 levels).

This is deeply confusing. I suggest instead standardizing on the following:

=====================
 This is a heading 1
=====================

This heading has = signs both above and below, and they extend past the text by a single character in each direction.

This is a heading 2
===================

This is a heading 3
-------------------

This is a heading 4
~~~~~~~~~~~~~~~~~~~

If you need more levels, you can invent them using whatever character you like - but try to stay consistent within your project.

Bulleted lists

As with headings, you can use a variety of characters for these. I suggest sticking with asterisks.

A blank line is required before starting a bulleted list.

* A bullet point
* Another bullet point

If you decide to wrap your text (I tend not to) you must maintain the indentation on the wrapped lines:

* A bulleted list item. Since the text is wrapped each subsequent
  line of text must be indented by two spaces.
* Second list item.

Nested lists are supported, but you MUST leave a blank line above the first inner list bullet point or they won't work:

* This is the first bullet list item. Here comes a sub-list:

  * Hello sublist
  * Sublist two

* Back to the parent list.

Inline markup

I only use three inline markup features: bold, italic and code.

**Bold text** is surrounded by two asterisks.

*Italic text* is one asterisk.

``inline code`` uses two backticks at either side of the code.

Links

Links are my least favorite feature of reStructuredText. There are several different ways of including them, but the one I use most often (and hence have committed to memory) is this one:

`a link, note the trailing underscores <http://example.com>`__

So that’s a backtick at the start, then the link text, then the URL contained in greater than / less than symbols, then another backtick and then TWO underscores to finish it off.

Why two underscores? Because if you only use one, the text part of the link is remembered and can be used to duplicate your link later on - see example below. In my experience this is more trouble than it’s worth.

A more complex link syntax example (documented here) looks like this:

See the `Python home page`_ for info.

This link_ is an alias to the link above.

.. _Python home page: http://www.python.org
.. _link: `Python home page`_

I can’t remember this at all, so I stick with the anonymous hyperlink syntax instead.

Code blocks

The easiest way to embed a block of code is like this:

::

    # This is a code example
    print("It needs to be indented")

The :: indicates that a code block is coming up. The blank line after the :: before the indentation starts is required.

Most renderers have the ability to apply syntax highlighting. To specify that a block should have syntax highlighting for a specific language, replace the :: in the above example with one of the following:

.. code-block:: sql

.. code-block:: javascript

.. code-block:: python

Images

There are plenty of options for embedding images, but the most basic syntax (worth remembering) looks like this:

.. image:: full_text_search.png
   :alt: alternate text

This will embed an image of that filename that sits in the same directory as the document itself.

Internal references

In my opinion this is the key feature that makes reStructuredText more powerful than Markdown for larger documentation projects.

Again, there is a vast and complex array of options around this, but the key thing to remember is how to add a reference name to a specific section and how to link to that section later on.

Names are applied to section headings, by adding some magic text before the heading itself. For example:

.. _full_text_search:

Full-text search
================

Note the format: two periods, then a space, then an underscore, then the label, then a colon at the end.

The label full_text_search is now associated with that heading. I can link to it from any page in my documentation project like so:

:ref:`full_text_search`

Note that the leading underscore isn’t included in this reference.

The link text displayed will be the text of the heading, in this case “Full-text search”. If I want to replace that link text with something custom, I can do so like this:

Learn about the :ref:`search feature <full_text_search>`.

This syntax is similar to the inline hyperlink syntax described above.

Learning more

I extracted the patterns I describe in this post from the Datasette documentation - I encourage you to dig around in the source code to see how it all works.

The definitive guide to reStructuredText is the reStructuredText Markup Specification. My favourite of the various quick references is the Restructured Text (reST) and Sphinx CheatSheet by Thomas Cokelaer.

I'm a huge fan of Read the Docs for hosting documentation - it's the key reason I use reStructuredText in my projects. Unsurprisingly, they offer extensive documentation to help you make the most of their platform.

Tags: documentation, python, restructuredtext, sphinx-docs, read-the-docs

Honeycomb changelog

2018-08-25T03:12:04+00:00

Honeycomb changelog

Too few hosted services have detailed user-facing changelogs. This one from Honeycomb (a metrics, tracing and observavility platform) is a particularly great example. I especially like the use of animated screenshots, something I’ve been evangelizing pretty heavily recently for internal communication at work.

Via @michaelwilde

Tags: documentation

Documentation unit tests

2018-07-28T15:59:55+00:00

Or: Test-driven documentation.

Keeping documentation synchronized with an evolving codebase is difficult. Without extreme discipline, it’s easy for documentation to get out-of-date as new features are added.

One thing that can help is keeping the documentation for a project in the same repository as the code itself. This allows you to construct the ideal commit: one that includes the code change, the updated unit tests AND the accompanying documentation all in the same unit of work.

When combined with a code review system (like Phabricator or GitHub pull requests) this pattern lets you enforce documentation updates as part of the review process: if a change doesn’t update the relevant documentation, point that out in your review!

Good code review systems also execute unit tests automatically and attach the results to the review. This provides an opportunity to have the tests enforce other aspects of the codebase: for example, running a linter so that no-one has to waste their time arguing over standardize coding style.

I’ve been experimenting with using unit tests to ensure that aspects of a project are covered by the documentation. I think it’s a very promising technique.

Introspect the code, introspect the docs

The key to this trick is introspection: interogating the code to figure out what needs to be documented, then parsing the documentation to see if each item has been covered.

I’ll use my Datasette project as an example. Datasette’s test_docs.py module contains three relevant tests:

test_config_options_are_documented checks that every one of Datasette’s configuration options are documented.
test_plugin_hooks_are_documented ensures all of the plugin hooks (powered by pluggy) are covered in the plugin documentation.
test_view_classes_are_documented iterates through all of the *View classes (corresponding to pages in the Datasette user interface) and makes sure they are covered.

In each case, the test uses introspection against the relevant code areas to figure out what needs to be documented, then runs a regular expression against the documentation to make sure it is mentioned in the correct place.

Obviously the tests can’t confirm the quality of the documentation, so they are easy to cheat: but they do at least protect against adding a new option but forgetting to document it.

Testing that Datasette’s view classes are covered

Datasette’s view classes use a naming convention: they all end in View. The current list of view classes is DatabaseView, TableView, RowView, IndexView and JsonDataView.

Since these classes are all imported into the datasette.app module (in order to be hooked up to URL routes) the easiest way to introspect them is to import that module, then run dir(app) and grab any class names that end in View. We can do that with a Python list comprehension:

from datasette import app
views = [v for v in dir(app) if v.endswith("View")]

I’m using reStructuredText labels to mark the place in the documentation that addresses each of these classes. This also ensures that each documentation section can be linked to, for example:

http://datasette.readthedocs.io/en/latest/pages.html#tableview

The reStructuredText syntax for that label looks like this:

.. _TableView:

Table
=====

The table page is the heart of Datasette...

We can extract these labels using a regular expression:

from pathlib import Path
import re

docs_path = Path(__file__).parent.parent / 'docs'
label_re = re.compile(r'\.\. _([^\s:]+):')

def get_labels(filename):
    contents = (docs_path / filename).open().read()
    return set(label_re.findall(contents))

Since Datasette’s documentation is spread across multiple *.rst files, and I want the freedom to document a view class in any one of them, I iterate through every file to find the labels and pull out the ones ending in View:

def documented_views():
    view_labels = set()
    for filename in docs_path.glob("*.rst"):
        for label in get_labels(filename):
            first_word = label.split("_")[0]
            if first_word.endswith("View"):
                view_labels.add(first_word)
    return view_labels

We now have a list of class names and a list of labels across all of our documentation. Writing a basic unit test comparing the two lists is trivial:

def test_view_documentation():
    view_labels = documented_views()
    view_classes = set(v for v in dir(app) if v.endswith("View"))
    assert view_labels == view_classes

Taking advantage of pytest

Datasette uses pytest for its unit tests, and documentation unit tests are a great opportunity to take advantage of some advanced pytest features.

Parametrization

The first of these is parametrization: pytest provides a decorator which can be used to execute a single test function multiple times, each time with different arguments.

This example from the pytest documentation shows how parametrization works:

import pytest
@pytest.mark.parametrize("test_input,expected", [
    ("3+5", 8),
    ("2+4", 6),
    ("6*9", 42),
])
def test_eval(test_input, expected):
    assert eval(test_input) == expected

pytest treats this as three separate unit tests, even though they share a single function definition.

We can combine this pattern with our introspection to execute an independent unit test for each of our view classes. Here’s what that looks like:

@pytest.mark.parametrize("view", [v for v in dir(app) if v.endswith("View")])
def test_view_classes_are_documented(view):
    assert view in documented_views()

Here’s the output from pytest if we execute just this unit test (and one of our classes is undocumented):

$ pytest -k test_view_classes_are_documented -v
=== test session starts ===
collected 249 items / 244 deselected

tests/test_docs.py::test_view_classes_are_documented[DatabaseView] PASSED [ 20%]
tests/test_docs.py::test_view_classes_are_documented[IndexView] PASSED [ 40%]
tests/test_docs.py::test_view_classes_are_documented[JsonDataView] PASSED [ 60%]
tests/test_docs.py::test_view_classes_are_documented[RowView] PASSED [ 80%]
tests/test_docs.py::test_view_classes_are_documented[TableView] FAILED [100%]

=== FAILURES ===

view = 'TableView'

    @pytest.mark.parametrize("view", [v for v in dir(app) if v.endswith("View")])
    def test_view_classes_are_documented(view):
>       assert view in documented_views()
E       AssertionError: assert 'TableView' in {'DatabaseView', 'IndexView', 'JsonDataView', 'RowView', 'Table2View'}
E        +  where {'DatabaseView', 'IndexView', 'JsonDataView', 'RowView', 'Table2View'} = documented_views()

tests/test_docs.py:77: AssertionError
=== 1 failed, 4 passed, 244 deselected in 1.13 seconds ===

Fixtures

There’s a subtle inefficiency in the above test: for every view class, it calls the documented_views() function - and that function then iterates through every *.rst file in the docs/ directory and uses a regular expression to extract the labels. With 5 view classes and 17 documentation files that’s 85 executions of get_labels(), and that number will only increase as Datasette’s code and documentation grow larger.

We can use pytest’s neat fixtures to reduce this to a single call to documented_views() that is shared across all of the tests. Here’s what that looks like:

@pytest.fixture(scope="session")
def documented_views():
    view_labels = set()
    for filename in docs_path.glob("*.rst"):
        for label in get_labels(filename):
            first_word = label.split("_")[0]
            if first_word.endswith("View"):
                view_labels.add(first_word)
    return view_labels

@pytest.mark.parametrize("view_class", [
    v for v in dir(app) if v.endswith("View")
])
def test_view_classes_are_documented(documented_views, view_class):
    assert view_class in documented_views

Fixtures in pytest are an example of dependency injection: pytest introspects every test_* function and checks if it has a function argument with a name matching something that has been annotated with the @pytest.fixture decorator. If it finds any matching arguments, it executes the matching fixture function and passes its return value in to the test function.

By default, pytest will execute the fixture function once for every test execution. In the above code we use the scope="session" argument to tell pytest that this particular fixture should be executed only once for every pytest command-line execution of the tests, and that single return value should be passed to every matching test.

What if you haven’t documented everything yet?

Adding unit tests to your documentation in this way faces an obvious problem: when you first add the tests, you may have to write a whole lot of documentation before they can all pass.

Having tests that protect against future code being added without documentation is only useful once you’ve added them to the codebase - but blocking that on documenting your existing features could prevent that benefit from ever manifesting itself.

Once again, pytest to the rescue. The @pytest.mark.xfail decorator allows you to mark a test as “expected to fail” - if it fails, pytest will take note but will not fail the entire test suite.

This means you can add deliberately failing tests to your codebase without breaking the build for everyone - perfect for tests that look for documentation that hasn’t yet been written!

I used xfail when I first added view documentation tests to Datasette, then removed it once the documentation was all in place. Any future code in pull requests without documentation will cause a hard test failure.

Here’s what the test output looks like when some of those tests are marked as “expected to fail”:

$ pytest tests/test_docs.py
collected 31 items

tests/test_docs.py ..........................XXXxx.                [100%]

============ 26 passed, 2 xfailed, 3 xpassed in 1.06 seconds ============

Since this reports both the xfailed and the xpassed counts, it shows how much work is still left to be done before the xfail decorator can be safely removed.

Structuring code for testable documentation

A benefit of comprehensive unit testing is that it encourages you to design your code in a way that is easy to test. In my experience this leads to much higher code quality in general: it encourages separation of concerns and cleanly decoupled components.

My hope is that documentation unit tests will have a similar effect. I’m already starting to think about ways of restructuring my code such that I can cleanly introspect it for the areas that need to be documented. I’m looking forward to discovering code design patterns that help support this goal.

Tags: documentation, restructuredtext, testing, datasette, pytest

SpatiaLite — Datasette documentation

2018-05-30T04:34:06+00:00

SpatiaLite — Datasette documentation

Datasette’s documentation now includes extensive coverage of the SpatiaLite extension for SQLite: how to install it, how to import latitude/longitude points, shapefiles and GeoJSON data into SpatiaLite tables, and how to run SQL queries against it that take advantage of spatial indexes. I’m learning SpatiaLite at the moment and filling out the documentation with each new trick I learn as I go—as Mark Pilgrim once taught me, the best way to learn a new technology is to write about it.

Tags: sqlite, spatialite, datasette, mark-pilgrim, documentation