<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: s3-credentials</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/s3-credentials.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2025-12-16T23:40:31+00:00</updated><author><name>Simon Willison</name></author><entry><title>s3-credentials 0.17</title><link href="https://simonwillison.net/2025/Dec/16/s3-credentials/#atom-tag" rel="alternate"/><published>2025-12-16T23:40:31+00:00</published><updated>2025-12-16T23:40:31+00:00</updated><id>https://simonwillison.net/2025/Dec/16/s3-credentials/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/simonw/s3-credentials/releases/tag/0.17"&gt;s3-credentials 0.17&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
New release of my &lt;a href="https://s3-credentials.readthedocs.io/"&gt;s3-credentials&lt;/a&gt; CLI tool for managing credentials needed to access just one S3 bucket. Here are the release notes in full:&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;New commands &lt;code&gt;get-bucket-policy&lt;/code&gt; and &lt;code&gt;set-bucket-policy&lt;/code&gt;. &lt;a href="https://github.com/simonw/s3-credentials/issues/91"&gt;#91&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;New commands &lt;code&gt;get-public-access-block&lt;/code&gt; and &lt;code&gt;set-public-access-block&lt;/code&gt;. &lt;a href="https://github.com/simonw/s3-credentials/issues/92"&gt;#92&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;New &lt;code&gt;localserver&lt;/code&gt; command for starting a web server that makes time limited credentials accessible via a JSON API. &lt;a href="https://github.com/simonw/s3-credentials/pull/93"&gt;#93&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;That &lt;code&gt;s3-credentials localserver&lt;/code&gt; command (&lt;a href="https://s3-credentials.readthedocs.io/en/stable/localserver.html"&gt;documented here&lt;/a&gt;) is a little obscure, but I found myself wanting something like that to help me test out a new feature I'm building to help create temporary Litestream credentials using Amazon STS.&lt;/p&gt;
&lt;p&gt;Most of that new feature was &lt;a href="https://gistpreview.github.io/?500add71f397874ebadb8e04e8a33b53"&gt;built by Claude Code&lt;/a&gt; from the following starting prompt:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Add a feature s3-credentials localserver which starts a localhost weberver running (using the Python standard library stuff) on port 8094 by default but -p/--port can set a different port and otherwise takes an option that names a bucket and then takes the same options for read--write/read-only etc as other commands. It also takes a required --refresh-interval option which can be set as 5m or 10h or 30s. All this thing does is reply on / to a GET request with the IAM expiring credentials that allow access to that bucket with that policy for that specified amount of time. It caches internally the credentials it generates and will return the exact same data up until they expire (it also tracks expected expiry time) after which it will generate new credentials (avoiding dog pile effects if multiple requests ask at the same time) and return and cache those instead.&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/aws"&gt;aws&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/s3"&gt;s3&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/annotated-release-notes"&gt;annotated-release-notes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/s3-credentials"&gt;s3-credentials&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/prompt-engineering"&gt;prompt-engineering&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/coding-agents"&gt;coding-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude-code"&gt;claude-code&lt;/a&gt;&lt;/p&gt;



</summary><category term="aws"/><category term="projects"/><category term="s3"/><category term="ai"/><category term="annotated-release-notes"/><category term="s3-credentials"/><category term="prompt-engineering"/><category term="generative-ai"/><category term="llms"/><category term="coding-agents"/><category term="claude-code"/></entry><entry><title>Poe the Poet</title><link href="https://simonwillison.net/2025/Dec/16/poe-the-poet/#atom-tag" rel="alternate"/><published>2025-12-16T22:57:02+00:00</published><updated>2025-12-16T22:57:02+00:00</updated><id>https://simonwillison.net/2025/Dec/16/poe-the-poet/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://poethepoet.natn.io/"&gt;Poe the Poet&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I was looking for a way to specify additional commands in my &lt;code&gt;pyproject.toml&lt;/code&gt; file to execute using &lt;code&gt;uv&lt;/code&gt;. There's an &lt;a href="https://github.com/astral-sh/uv/issues/5903"&gt;enormous issue thread&lt;/a&gt; on this in the &lt;code&gt;uv&lt;/code&gt; issue tracker (300+ comments dating back to August 2024) and from there I learned of several options including this one, Poe the Poet.&lt;/p&gt;
&lt;p&gt;It's neat. I added it to my &lt;a href="https://github.com/simonw/s3-credentials"&gt;s3-credentials&lt;/a&gt; project just now and the following now works for running the live preview server for the documentation:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;uv run poe livehtml
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here's the snippet of TOML I added to my &lt;code&gt;pyproject.toml&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;[&lt;span class="pl-en"&gt;dependency-groups&lt;/span&gt;]
&lt;span class="pl-smi"&gt;test&lt;/span&gt; = [
    &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;pytest&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;pytest-mock&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;cogapp&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;moto&amp;gt;=5.0.4&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
]
&lt;span class="pl-smi"&gt;docs&lt;/span&gt; = [
    &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;furo&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;sphinx-autobuild&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;myst-parser&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;cogapp&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
]
&lt;span class="pl-smi"&gt;dev&lt;/span&gt; = [
    {&lt;span class="pl-smi"&gt;include-group&lt;/span&gt; = &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;test&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;},
    {&lt;span class="pl-smi"&gt;include-group&lt;/span&gt; = &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;docs&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;},
    &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;poethepoet&amp;gt;=0.38.0&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
]

[&lt;span class="pl-en"&gt;tool&lt;/span&gt;.&lt;span class="pl-en"&gt;poe&lt;/span&gt;.&lt;span class="pl-en"&gt;tasks&lt;/span&gt;]
&lt;span class="pl-smi"&gt;docs&lt;/span&gt; = &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;sphinx-build -M html docs docs/_build&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-smi"&gt;livehtml&lt;/span&gt; = &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;sphinx-autobuild -b html docs docs/_build&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-smi"&gt;cog&lt;/span&gt; = &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;cog -r docs/*.md&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;

&lt;p&gt;Since &lt;code&gt;poethepoet&lt;/code&gt; is in the &lt;code&gt;dev=&lt;/code&gt; dependency group any time I run &lt;code&gt;uv run ...&lt;/code&gt; it will be available in the environment.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/packaging"&gt;packaging&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/s3-credentials"&gt;s3-credentials&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/uv"&gt;uv&lt;/a&gt;&lt;/p&gt;



</summary><category term="packaging"/><category term="python"/><category term="s3-credentials"/><category term="uv"/></entry><entry><title>s3-credentials 0.16</title><link href="https://simonwillison.net/2024/Apr/5/s3-credentials-016/#atom-tag" rel="alternate"/><published>2024-04-05T05:35:57+00:00</published><updated>2024-04-05T05:35:57+00:00</updated><id>https://simonwillison.net/2024/Apr/5/s3-credentials-016/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/simonw/s3-credentials/releases/tag/0.16"&gt;s3-credentials 0.16&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I spent entirely too long this evening trying to figure out why files in my new supposedly public S3 bucket were unavailable to view. It turns out these days you need to set a &lt;code&gt;PublicAccessBlockConfiguration&lt;/code&gt; of &lt;code&gt;{"BlockPublicAcls": false, "IgnorePublicAcls": false, "BlockPublicPolicy": false, "RestrictPublicBuckets": false}&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;s3-credentials --create-bucket --public&lt;/code&gt; option now does that for you. I also added a &lt;code&gt;s3-credentials debug-bucket name-of-bucket&lt;/code&gt; command to help figure out why a bucket isn't working as expected.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/aws"&gt;aws&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/s3"&gt;s3&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/s3-credentials"&gt;s3-credentials&lt;/a&gt;&lt;/p&gt;



</summary><category term="aws"/><category term="projects"/><category term="s3"/><category term="s3-credentials"/></entry><entry><title>Tracking Mastodon user numbers over time with a bucket of tricks</title><link href="https://simonwillison.net/2022/Nov/20/tracking-mastodon/#atom-tag" rel="alternate"/><published>2022-11-20T07:00:54+00:00</published><updated>2022-11-20T07:00:54+00:00</updated><id>https://simonwillison.net/2022/Nov/20/tracking-mastodon/#atom-tag</id><summary type="html">
    &lt;p&gt;&lt;a href="https://joinmastodon.org/"&gt;Mastodon&lt;/a&gt; is definitely having a moment. User growth is skyrocketing as more and more people migrate over from Twitter.&lt;/p&gt;
&lt;p&gt;I've set up a new &lt;a href="https://simonwillison.net/2020/Oct/9/git-scraping/"&gt;git scraper&lt;/a&gt; to track the number of registered user accounts on known Mastodon instances over time.&lt;/p&gt;
&lt;p&gt;It's only been running for a few hours, but it's already collected enough data to &lt;a href="https://observablehq.com/@simonw/mastodon-users-and-statuses-over-time"&gt;render this chart&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2022/mastodon-users-few-hours.png" alt="The chart starts at around 1am with 4,694,000 users - it climbs to 4,716,000 users by 6am in a relatively straight line" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;I'm looking forward to seeing how this trend continues to develop over the next days and weeks.&lt;/p&gt;
&lt;h4&gt;Scraping the data&lt;/h4&gt;
&lt;p&gt;My scraper works by tracking &lt;a href="https://instances.social/"&gt;https://instances.social/&lt;/a&gt; - a website that lists a large number (but not all) of the Mastodon instances that are out there.&lt;/p&gt;
&lt;p&gt;That site publishes an &lt;a href="https://instances.social/instances.json"&gt;instances.json&lt;/a&gt; array which currently contains 1,830 objects representing Mastodon instances. Each of those objects looks something like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-json"&gt;&lt;pre&gt;{
    &lt;span class="pl-ent"&gt;"name"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;pleroma.otter.sh&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"title"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Otterland&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"short_description"&lt;/span&gt;: &lt;span class="pl-c1"&gt;null&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"description"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Otters does squeak squeak&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"uptime"&lt;/span&gt;: &lt;span class="pl-c1"&gt;0.944757&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"up"&lt;/span&gt;: &lt;span class="pl-c1"&gt;true&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"https_score"&lt;/span&gt;: &lt;span class="pl-c1"&gt;null&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"https_rank"&lt;/span&gt;: &lt;span class="pl-c1"&gt;null&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"ipv6"&lt;/span&gt;: &lt;span class="pl-c1"&gt;true&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"openRegistrations"&lt;/span&gt;: &lt;span class="pl-c1"&gt;false&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"users"&lt;/span&gt;: &lt;span class="pl-c1"&gt;5&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"statuses"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;54870&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"connections"&lt;/span&gt;: &lt;span class="pl-c1"&gt;9821&lt;/span&gt;,
}&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;I have &lt;a href="https://github.com/simonw/scrape-instances-social/blob/main/.github/workflows/scrape.yml"&gt;a GitHub Actions workflow&lt;/a&gt; running approximately every 20 minutes that fetches a copy of that file and commits it back to this repository:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://github.com/simonw/scrape-instances-social"&gt;https://github.com/simonw/scrape-instances-social&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Since each instance includes a &lt;code&gt;users&lt;/code&gt; count, the commit history of my &lt;code&gt;instances.json&lt;/code&gt; file tells the story of Mastodon's growth over time.&lt;/p&gt;
&lt;h4&gt;Building a database&lt;/h4&gt;
&lt;p&gt;A commit log of a JSON file is interesting, but the next step is to turn that into actionable information.&lt;/p&gt;
&lt;p&gt;My &lt;a href="https://simonwillison.net/2021/Dec/7/git-history/"&gt;git-history tool&lt;/a&gt; is designed to do exactly that.&lt;/p&gt;
&lt;p&gt;For the chart up above, the only number I care about is the total number of users listed in each snapshot of the file - the sum of that &lt;code&gt;users&lt;/code&gt; field for each instance.&lt;/p&gt;
&lt;p&gt;Here's how to run &lt;code&gt;git-history&lt;/code&gt; against that file's commit history to generate tables showing how that count has changed over time:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;git-history file counts.db instances.json \
  --convert &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;return [&lt;/span&gt;
&lt;span class="pl-s"&gt;    {&lt;/span&gt;
&lt;span class="pl-s"&gt;        'id': 'all',&lt;/span&gt;
&lt;span class="pl-s"&gt;        'users': sum(d['users'] or 0 for d in json.loads(content)),&lt;/span&gt;
&lt;span class="pl-s"&gt;        'statuses': sum(int(d['statuses'] or 0) for d in json.loads(content)),&lt;/span&gt;
&lt;span class="pl-s"&gt;    }&lt;/span&gt;
&lt;span class="pl-s"&gt;  ]&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; --id id&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;I'm creating a file called &lt;code&gt;counts.db&lt;/code&gt; that shows the history of the &lt;code&gt;instances.json&lt;/code&gt; file.&lt;/p&gt;
&lt;p&gt;The real trick here though is that &lt;code&gt;--convert&lt;/code&gt; argument. I'm using that to compress each snapshot down to a single row that looks like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-json"&gt;&lt;pre&gt;{
    &lt;span class="pl-ent"&gt;"id"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;all&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"users"&lt;/span&gt;: &lt;span class="pl-c1"&gt;4717781&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"statuses"&lt;/span&gt;: &lt;span class="pl-c1"&gt;374217860&lt;/span&gt;
}&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Normally &lt;code&gt;git-history&lt;/code&gt; expects to work against an array of objects, tracking the history of changes to each one based on their &lt;code&gt;id&lt;/code&gt; property.&lt;/p&gt;
&lt;p&gt;Here I'm tricking it a bit - I only return a single object with the ID of &lt;code&gt;all&lt;/code&gt;. This means that &lt;code&gt;git-history&lt;/code&gt; will only track the history of changes to that single object.&lt;/p&gt;
&lt;p&gt;It works though! The result is a &lt;code&gt;counts.db&lt;/code&gt; file which is currently 52KB and has the following schema (truncated to the most interesting bits):&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;CREATE TABLE [commits] (
   [id] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;PRIMARY KEY&lt;/span&gt;,
   [namespace] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;REFERENCES&lt;/span&gt; [namespaces]([id]),
   [hash] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [commit_at] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;
);
CREATE TABLE [item_version] (
   [_id] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;PRIMARY KEY&lt;/span&gt;,
   [_item] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;REFERENCES&lt;/span&gt; [item]([_id]),
   [_version] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt;,
   [_commit] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;REFERENCES&lt;/span&gt; [commits]([id]),
   [id] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [users] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt;,
   [statuses] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt;,
   [_item_full_hash] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;
);&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Each &lt;code&gt;item_version&lt;/code&gt; row will tell us the number of users and statuses at a particular point in time, based on a join against that &lt;code&gt;commits&lt;/code&gt; table to find the &lt;code&gt;commit_at&lt;/code&gt; date.&lt;/p&gt;
&lt;h4&gt;Publishing the database&lt;/h4&gt;
&lt;p&gt;For this project, I decided to publish the SQLite database to an S3 bucket. I considered pushing the binary SQLite file directly to the GitHub repository but this felt rude, since a binary file that changes every 20 minutes would bloat the repository.&lt;/p&gt;
&lt;p&gt;I wanted to serve the file with open CORS headers so I could load it into Datasette Lite and Observable notebooks.&lt;/p&gt;
&lt;p&gt;I used my &lt;a href="https://s3-credentials.readthedocs.io/"&gt;s3-credentials&lt;/a&gt; tool to create a bucket for this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;~ % s3-credentials create scrape-instances-social --public --website --create-bucket
Created bucket: scrape-instances-social
Attached bucket policy allowing public access
Configured website: IndexDocument=index.html, ErrorDocument=error.html
Created  user: 's3.read-write.scrape-instances-social' with permissions boundary: 'arn:aws:iam::aws:policy/AmazonS3FullAccess'
Attached policy s3.read-write.scrape-instances-social to user s3.read-write.scrape-instances-social
Created access key for user: s3.read-write.scrape-instances-social
{
    "UserName": "s3.read-write.scrape-instances-social",
    "AccessKeyId": "AKIAWXFXAIOZI5NUS6VU",
    "Status": "Active",
    "SecretAccessKey": "...",
    "CreateDate": "2022-11-20 05:52:22+00:00"
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This created a new bucket called &lt;code&gt;scrape-instances-social&lt;/code&gt; configured to work as a website and allow public access.&lt;/p&gt;
&lt;p&gt;It also generated an access key and a secret access key with access to just that bucket. I saved these in GitHub Actions secrets called &lt;code&gt;AWS_ACCESS_KEY_ID&lt;/code&gt; and &lt;code&gt;AWS_SECRET_ACCESS_KEY&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;I enabled a CORS policy on the bucket like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;s3-credentials set-cors-policy scrape-instances-social
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then I added the following to my GitHub Actions workflow to build and upload the database after each run of the scraper:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;    - &lt;span class="pl-ent"&gt;name&lt;/span&gt;: &lt;span class="pl-s"&gt;Build and publish database using git-history&lt;/span&gt;
      &lt;span class="pl-ent"&gt;env&lt;/span&gt;:
        &lt;span class="pl-ent"&gt;AWS_ACCESS_KEY_ID&lt;/span&gt;: &lt;span class="pl-s"&gt;${{ secrets.AWS_ACCESS_KEY_ID }}&lt;/span&gt;
        &lt;span class="pl-ent"&gt;AWS_SECRET_ACCESS_KEY&lt;/span&gt;: &lt;span class="pl-s"&gt;${{ secrets.AWS_SECRET_ACCESS_KEY }}&lt;/span&gt;
      &lt;span class="pl-ent"&gt;run&lt;/span&gt;: &lt;span class="pl-s"&gt;|-&lt;/span&gt;
&lt;span class="pl-s"&gt;        # First download previous database to save some time&lt;/span&gt;
&lt;span class="pl-s"&gt;        wget https://scrape-instances-social.s3.amazonaws.com/counts.db&lt;/span&gt;
&lt;span class="pl-s"&gt;        # Update with latest commits&lt;/span&gt;
&lt;span class="pl-s"&gt;        ./build-count-history.sh&lt;/span&gt;
&lt;span class="pl-s"&gt;        # Upload to S3&lt;/span&gt;
&lt;span class="pl-s"&gt;        s3-credentials put-object scrape-instances-social counts.db counts.db \&lt;/span&gt;
&lt;span class="pl-s"&gt;          --access-key $AWS_ACCESS_KEY_ID \&lt;/span&gt;
&lt;span class="pl-s"&gt;          --secret-key $AWS_SECRET_ACCESS_KEY&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;code&gt;git-history&lt;/code&gt; knows how to only process commits since the last time the database was built, so downloading the previous copy saves a lot of time.&lt;/p&gt;
&lt;h4&gt;Exploring the data&lt;/h4&gt;
&lt;p&gt;Now that I have a SQLite database that's being served over CORS-enabled HTTPS I can open it in &lt;a href="https://simonwillison.net/2022/May/4/datasette-lite/"&gt;Datasette Lite&lt;/a&gt; - my implementation of Datasette compiled to WebAssembly that runs entirely in a browser.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://lite.datasette.io/?url=https://scrape-instances-social.s3.amazonaws.com/counts.db"&gt;https://lite.datasette.io/?url=https://scrape-instances-social.s3.amazonaws.com/counts.db&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Any time anyone follows this link their browser will fetch the latest copy of the &lt;code&gt;counts.db&lt;/code&gt; file directly from S3.&lt;/p&gt;
&lt;p&gt;The most interesting page in there is the &lt;code&gt;item_version_detail&lt;/code&gt; SQL view, which joins against the commits table to show the date of each change:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://lite.datasette.io/?url=https://scrape-instances-social.s3.amazonaws.com/counts.db#/counts/item_version_detail"&gt;https://lite.datasette.io/?url=https://scrape-instances-social.s3.amazonaws.com/counts.db#/counts/item_version_detail&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;(Datasette Lite lets you link directly to pages within Datasette itself via a &lt;code&gt;#hash&lt;/code&gt;.)&lt;/p&gt;
&lt;h4&gt;Plotting a chart&lt;/h4&gt;
&lt;p&gt;Datasette Lite doesn't have charting yet, so I decided to turn to my favourite visualization tool, an &lt;a href="https://observablehq.com/"&gt;Observable&lt;/a&gt; notebook.&lt;/p&gt;
&lt;p&gt;Observable has the ability to query SQLite databases (that are served via CORS) directly these days!&lt;/p&gt;
&lt;p&gt;Here's my notebook:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://observablehq.com/@simonw/mastodon-users-and-statuses-over-time"&gt;https://observablehq.com/@simonw/mastodon-users-and-statuses-over-time&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;There are only four cells needed to create the chart shown above.&lt;/p&gt;
&lt;p&gt;First, we need to open the SQLite database from the remote URL:&lt;/p&gt;
&lt;div class="highlight highlight-source-js"&gt;&lt;pre&gt;&lt;span class="pl-s1"&gt;database&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-v"&gt;SQLiteDatabaseClient&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;open&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;
  &lt;span class="pl-s"&gt;"https://scrape-instances-social.s3.amazonaws.com/counts.db"&lt;/span&gt;
&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Next we need to use an Obervable Database query cell to execute SQL against that database and pull out the data we want to plot - and store it in a &lt;code&gt;query&lt;/code&gt; variable:&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;SELECT&lt;/span&gt; _commit_at &lt;span class="pl-k"&gt;as&lt;/span&gt; &lt;span class="pl-k"&gt;date&lt;/span&gt;, users, statuses
&lt;span class="pl-k"&gt;FROM&lt;/span&gt; item_version_detail&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;We need to make one change to that data - we need to convert the &lt;code&gt;date&lt;/code&gt; column from a string to a JavaScript date object:&lt;/p&gt;
&lt;div class="highlight highlight-source-js"&gt;&lt;pre&gt;&lt;span class="pl-s1"&gt;points&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;query&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;map&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;d&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-c1"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;
  &lt;span class="pl-c1"&gt;date&lt;/span&gt;: &lt;span class="pl-k"&gt;new&lt;/span&gt; &lt;span class="pl-v"&gt;Date&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;d&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;date&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt;
  &lt;span class="pl-c1"&gt;users&lt;/span&gt;: &lt;span class="pl-s1"&gt;d&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;users&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt;
  &lt;span class="pl-c1"&gt;statuses&lt;/span&gt;: &lt;span class="pl-s1"&gt;d&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;statuses&lt;/span&gt;
&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Finally, we can plot the data using the &lt;a href="https://observablehq.com/@observablehq/plot"&gt;Observable Plot&lt;/a&gt; charting library like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-js"&gt;&lt;pre&gt;&lt;span class="pl-v"&gt;Plot&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;plot&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;
  &lt;span class="pl-c1"&gt;y&lt;/span&gt;: &lt;span class="pl-kos"&gt;{&lt;/span&gt;
    &lt;span class="pl-c1"&gt;grid&lt;/span&gt;: &lt;span class="pl-c1"&gt;true&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt;
    &lt;span class="pl-c1"&gt;label&lt;/span&gt;: &lt;span class="pl-s"&gt;"Total users over time across all tracked instances"&lt;/span&gt;
  &lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt;
  &lt;span class="pl-c1"&gt;marks&lt;/span&gt;: &lt;span class="pl-kos"&gt;[&lt;/span&gt;&lt;span class="pl-v"&gt;Plot&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;line&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;points&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt; &lt;span class="pl-c1"&gt;x&lt;/span&gt;: &lt;span class="pl-s"&gt;"date"&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-c1"&gt;y&lt;/span&gt;: &lt;span class="pl-s"&gt;"users"&lt;/span&gt; &lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;]&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt;
  &lt;span class="pl-c1"&gt;marginLeft&lt;/span&gt;: &lt;span class="pl-c1"&gt;100&lt;/span&gt;
&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;I added 100px of margin to the left of the chart to ensure there was space for the large (4,696,000 and up) labels on the y-axis.&lt;/p&gt;
&lt;h4&gt;A bunch of tricks combined&lt;/h4&gt;
&lt;p&gt;This project combines a whole bunch of tricks I've been pulling together over the past few years:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2020/Oct/9/git-scraping/"&gt;Git scraping&lt;/a&gt; is the technique I use to gather the initial data, turning a static listing of instances into a record of changes over time&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://datasette.io/tools/git-history"&gt;git-history&lt;/a&gt; is my tool for turning a scraped Git history into a SQLite database that's easier to work with&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://s3-credentials.readthedocs.io/"&gt;s3-credentials&lt;/a&gt; makes working with S3 buckets - in particular creating credentials that are restricted to just one bucket - much less frustrating&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2022/May/4/datasette-lite/"&gt;Datasette Lite&lt;/a&gt; means that once you have a SQLite database online somewhere you can explore it in your browser - without having to run my full server-side &lt;a href="https://datasette.io/"&gt;Datasette&lt;/a&gt; Python application on a machine somewhere&lt;/li&gt;
&lt;li&gt;And finally, combining the above means I can take advantage of &lt;a href="https://observablehq.com/"&gt;Observable notebooks&lt;/a&gt; for ad-hoc visualization of data that's hosted online, in this case as a static SQLite database file served from S3&lt;/li&gt;
&lt;/ul&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/github"&gt;github&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/observable"&gt;observable&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github-actions"&gt;github-actions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-history"&gt;git-history&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/s3-credentials"&gt;s3-credentials&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette-lite"&gt;datasette-lite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mastodon"&gt;mastodon&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cors"&gt;cors&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="github"/><category term="projects"/><category term="datasette"/><category term="observable"/><category term="github-actions"/><category term="git-scraping"/><category term="git-history"/><category term="s3-credentials"/><category term="datasette-lite"/><category term="mastodon"/><category term="cors"/></entry><entry><title>Weeknotes: Datasette Lite, s3-credentials, shot-scraper, datasette-edit-templates and more</title><link href="https://simonwillison.net/2022/Sep/16/weeknotes/#atom-tag" rel="alternate"/><published>2022-09-16T02:55:03+00:00</published><updated>2022-09-16T02:55:03+00:00</updated><id>https://simonwillison.net/2022/Sep/16/weeknotes/#atom-tag</id><summary type="html">
    &lt;p&gt;Despite &lt;a href="https://simonwillison.net/2022/Sep/12/prompt-injection/"&gt;distractions from AI&lt;/a&gt; I managed to make progress on a bunch of different projects this week, including new releases of &lt;a href="https://s3-credentials.readthedocs.io/"&gt;s3-credentials&lt;/a&gt; and &lt;a href="https://shot-scraper.datasette.io/"&gt;shot-scraper&lt;/a&gt;, a new &lt;a href="https://datasette.io/plugins/datasette-edit-templates"&gt;datasette-edit-templates&lt;/a&gt; plugin and a small but neat improvement to &lt;a href="https://lite.datasette.io/"&gt;Datasette Lite&lt;/a&gt;.&lt;/p&gt;
&lt;h4&gt;Better GitHub support for Datasette Lite&lt;/h4&gt;
&lt;p&gt;&lt;a href="https://lite.datasette.io/"&gt;Datasette Lite&lt;/a&gt; is &lt;a href="https://simonwillison.net/2022/May/4/datasette-lite/"&gt;Datasette running in WebAssembly&lt;/a&gt;. Originally intended as a cool tech demo it's quickly becoming a key component of the wider Datasette ecosystem - just this week I saw that mySociety are using it to help people explore their &lt;a href="https://mysociety.github.io/wdtk_authorities_list/datasets/whatdotheyknow_authorities_dataset/latest"&gt;WhatDoTheyKnow Authorities Dataset&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;One of the neat things about Datasette Lite is that you can feed it URLs to CSV files, SQLite database files and even SQL initialization scripts and it will fetch them into your browser and serve them up inside Datasette. I wrote more about this capability in &lt;a href="https://simonwillison.net/2022/Jun/20/datasette-lite-csvs/"&gt;Joining CSV files in your browser using Datasette Lite&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;There's just one catch: because those URLs are fetched by JavaScript running in your browser, they need to be served from a host that sets the &lt;code&gt;Access-Control-Allow-Origin: *&lt;/code&gt; header (&lt;a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/CORS"&gt;see MDN&lt;/a&gt;). This is not an easy thing to explain to people!&lt;/p&gt;
&lt;p&gt;The good news here is that GitHub makes every public file (and every Gist) hosted on GitHub available as static hosting with that magic header.&lt;/p&gt;
&lt;p&gt;The bad news is that you have to know how to construct that URL! GitHub's "raw" links redirect to that URL, but JavaScript &lt;code&gt;fetch()&lt;/code&gt; calls can't follow redirects if they don't have that header - and GitHub's redirects do not.&lt;/p&gt;
&lt;p&gt;So you need to know that if you want to load the SQLite database file from this page on GitHub:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://github.com/lerocha/chinook-database/blob/master/ChinookDatabase/DataSources/Chinook_Sqlite.sqlite"&gt;https://github.com/lerocha/chinook-database/blob/master/ChinookDatabase/DataSources/Chinook_Sqlite.sqlite&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;You first need to rewrite that URL to the following, which is served with the correct CORS header:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://raw.githubusercontent.com/lerocha/chinook-database/master/ChinookDatabase/DataSources/Chinook_Sqlite.sqlite"&gt;https://raw.githubusercontent.com/lerocha/chinook-database/master/ChinookDatabase/DataSources/Chinook_Sqlite.sqlite&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Asking human's to do that by hand isn't reasonable. So I added some code!&lt;/p&gt;
&lt;div class="highlight highlight-source-js"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;githubUrl&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-pds"&gt;&lt;span class="pl-c1"&gt;/&lt;/span&gt;&lt;span class="pl-cce"&gt;^&lt;/span&gt;https:&lt;span class="pl-cce"&gt;\/&lt;/span&gt;&lt;span class="pl-cce"&gt;\/&lt;/span&gt;github.com&lt;span class="pl-cce"&gt;\/&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;.&lt;span class="pl-c1"&gt;*&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-cce"&gt;\/&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;.&lt;span class="pl-c1"&gt;*&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-cce"&gt;\/&lt;/span&gt;blob&lt;span class="pl-cce"&gt;\/&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;.&lt;span class="pl-c1"&gt;*&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-cce"&gt;\?&lt;/span&gt;raw=true&lt;span class="pl-kos"&gt;)&lt;/span&gt;?&lt;span class="pl-cce"&gt;$&lt;/span&gt;&lt;span class="pl-c1"&gt;/&lt;/span&gt;&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;

&lt;span class="pl-k"&gt;function&lt;/span&gt; &lt;span class="pl-en"&gt;fixUrl&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;url&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
  &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;matches&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;githubUrl&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;exec&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;url&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
  &lt;span class="pl-k"&gt;if&lt;/span&gt; &lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;matches&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
    &lt;span class="pl-k"&gt;return&lt;/span&gt; &lt;span class="pl-s"&gt;`https://raw.githubusercontent.com/&lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;${&lt;/span&gt;&lt;span class="pl-s1"&gt;matches&lt;/span&gt;&lt;span class="pl-kos"&gt;[&lt;/span&gt;&lt;span class="pl-c1"&gt;1&lt;/span&gt;&lt;span class="pl-kos"&gt;]&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;/&lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;${&lt;/span&gt;&lt;span class="pl-s1"&gt;matches&lt;/span&gt;&lt;span class="pl-kos"&gt;[&lt;/span&gt;&lt;span class="pl-c1"&gt;2&lt;/span&gt;&lt;span class="pl-kos"&gt;]&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;/&lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;${&lt;/span&gt;&lt;span class="pl-s1"&gt;matches&lt;/span&gt;&lt;span class="pl-kos"&gt;[&lt;/span&gt;&lt;span class="pl-c1"&gt;3&lt;/span&gt;&lt;span class="pl-kos"&gt;]&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;`&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
  &lt;span class="pl-kos"&gt;}&lt;/span&gt;
  &lt;span class="pl-k"&gt;return&lt;/span&gt; &lt;span class="pl-s1"&gt;url&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Fun aside: GitHub Copilot auto-completed that &lt;code&gt;return&lt;/code&gt; statement for me, correctly guessing the URL string I needed based on the regular expression I had defined several lines earlier.&lt;/p&gt;
&lt;p&gt;Now any time you feed Datasette Lite a URL, if it's a GitHub page it will automatically rewrite it to the CORS-enabled equivalent on the &lt;code&gt;raw.githubusercontent.com&lt;/code&gt; domain.&lt;/p&gt;
&lt;p&gt;Some examples:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://lite.datasette.io/?url=https://github.com/lerocha/chinook-database/blob/master/ChinookDatabase/DataSources/Chinook_Sqlite.sqlite"&gt;https://lite.datasette.io/?url=https://github.com/lerocha/chinook-database/blob/master/ChinookDatabase/DataSources/Chinook_Sqlite.sqlite&lt;/a&gt; - that Chinook SQLite database example (from &lt;a href="https://github.com/lerocha/chinook-database/blob/master/ChinookDatabase/DataSources/Chinook_Sqlite.sqlite"&gt;here&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://lite.datasette.io/?csv=https://github.com/simonw/covid-19-datasette/blob/6294ade30843bfd76f2d82641a8df76d8885effa/us_census_state_populations_2019.csv"&gt;https://lite.datasette.io/?csv=https://github.com/simonw/covid-19-datasette/blob/6294ade30843bfd76f2d82641a8df76d8885effa/us_census_state_populations_2019.csv&lt;/a&gt; - US censes populations by state, from my &lt;a href="https://github.com/simonw/covid-19-datasette"&gt;simonw/covid-19-datasette&lt;/a&gt; repo&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;datasette-edit-templates&lt;/h4&gt;
&lt;p&gt;I started working on this plugin a couple of years ago but didn't get it working. This week I finally &lt;a href="https://github.com/simonw/datasette-edit-templates/issues/1"&gt;closed the initial issue&lt;/a&gt; and shipped a &lt;a href="https://datasette.io/plugins/datasette-edit-templates"&gt;first alpha release&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;It's pretty fun. On first launch it creates a &lt;code&gt;_templates_&lt;/code&gt; table in your database. Then it allows the &lt;code&gt;root&lt;/code&gt; user (run &lt;code&gt;datasette data.db --root&lt;/code&gt; and click the link to sign in as root) to edit Datasette's default set of Jinja templates, writing their changes to that new table.&lt;/p&gt;
&lt;p&gt;Datasette uses those templates straight away. It turns the whole of Datasette into an interface for editing itself.&lt;/p&gt;
&lt;p&gt;Here's an animated demo showing the plugin in action:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2022/datasette-edit-templates.gif" alt="Animated screenshot. The Datasette app menu now has a Edit templates item, which goes to a page listing all of the templates. If you edit the _footer.html template to add an exclamation mark on the next page the Datasette footer shows that change." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;The implementation is currently a bit gnarly, but I've filed &lt;a href="https://github.com/simonw/datasette/issues/1809"&gt;an issue&lt;/a&gt; in Datasette core to help clear some of it up.&lt;/p&gt;
&lt;h4&gt;s3-credentials get-objects and put-objects&lt;/h4&gt;
&lt;p&gt;I built &lt;a href="https://s3-credentials.readthedocs.org/"&gt;s3-credentials&lt;/a&gt; to solve my number one frustration with AWS S3: the surprising level of complexity involved in issuing IAM credentials that could only access a specific S3 bucket. I introduced it in &lt;a href="https://simonwillison.net/2021/Nov/3/s3-credentials/"&gt;s3-credentials: a tool for creating credentials for S3 buckets&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Once you've created credentials, you need to be able to do stuff with them. I find the default AWS CLI tools relatively unintuitive, so &lt;code&gt;s3-credentials&lt;/code&gt; has continued to grow &lt;a href="https://s3-credentials.readthedocs.io/en/stable/other-commands.html"&gt;other commands&lt;/a&gt; as and when I feel the need for them.&lt;/p&gt;
&lt;p&gt;The latest version, &lt;a href="https://github.com/simonw/s3-credentials/releases/tag/0.14"&gt;0.14&lt;/a&gt;, adds two more: &lt;a href="https://s3-credentials.readthedocs.io/en/stable/other-commands.html#get-objects"&gt;get-objects&lt;/a&gt; and &lt;a href="https://s3-credentials.readthedocs.io/en/stable/other-commands.html#put-objects"&gt;put-objects&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;These let you do things like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;s3-credentials get-objects my-bucket -p "*.txt" -p "static/*.css"
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This downloads every key in &lt;code&gt;my-bucket&lt;/code&gt; with a name that matches either of those patterns.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;s3-credentials put-objects my-bucket one.txt ../other-directory
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This uploads &lt;code&gt;one.txt&lt;/code&gt; and the whole &lt;code&gt;other-directory&lt;/code&gt; folder with all of its contents.&lt;/p&gt;
&lt;p&gt;As with most of my projects, the GitHub issues threads for each of these include a blow-by-blow account of how I finalized their design - &lt;a href="https://github.com/simonw/s3-credentials/issues/68"&gt;#68&lt;/a&gt; for &lt;code&gt;put-objects&lt;/code&gt; and &lt;a href="https://github.com/simonw/s3-credentials/issues/78"&gt;#78&lt;/a&gt; for &lt;code&gt;get-objects&lt;/code&gt;.&lt;/p&gt;
&lt;h4&gt;shot-scraper --log-requests&lt;/h4&gt;
&lt;p&gt;&lt;a href="https://shot-scraper.datasette.io/"&gt;shot-scraper&lt;/a&gt; is my tool for automating screenshots, &lt;a href="https://simonwillison.net/2022/Mar/10/shot-scraper/"&gt;built on top of Playwright&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Its latest feature was inspired by Datasette Lite.&lt;/p&gt;
&lt;p&gt;I have an ongoing ambition to get Datasette Lite to work &lt;a href="https://github.com/simonw/datasette-lite/issues/26"&gt;entirely offline&lt;/a&gt;, using Service Workers.&lt;/p&gt;
&lt;p&gt;The first step is to get it to work &lt;a href="https://github.com/simonw/datasette-lite/issues/40"&gt;without loading external resources&lt;/a&gt; - it currently hits PyPI and a separate CDN multiple times to download wheels every time you load the application.&lt;/p&gt;
&lt;p&gt;To do that, I need a reliable list of all of the assets that it's fetching.&lt;/p&gt;
&lt;p&gt;Wouldn't it be handy If I could run a command and get a list of those resources?&lt;/p&gt;
&lt;p&gt;The following command now does exactly that:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;shot-scraper https://lite.datasette.io/ \
  --wait-for 'document.querySelector("h2")' \
  --log-requests requests.log
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here' the &lt;code&gt;--wait-for&lt;/code&gt; is needed to ensure &lt;code&gt;shot-scraper&lt;/code&gt; doesn't terminate until the application has fully loaded - detected by waiting for a &lt;code&gt;&amp;lt;h2&amp;gt;&lt;/code&gt; element to be added to the page.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;--log-requests&lt;/code&gt; bit is a &lt;a href="https://shot-scraper.datasette.io/en/stable/screenshots.html#logging-all-requests"&gt;new feature&lt;/a&gt; in &lt;a href="https://github.com/simonw/shot-scraper/releases/tag/0.15"&gt;shot-scraper 0.15&lt;/a&gt;: it logs out a newline-delimited JSON file with details of all of the resources fetched during the run. That file starts like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;{"method": "GET", "url": "https://lite.datasette.io/", "size": 10516, "timing": {...}}
{"method": "GET", "url": "https://plausible.io/js/script.manual.js", "size": 1005, "timing": {...}}
{"method": "GET", "url": "https://latest.datasette.io/-/static/app.css?cead5a", "size": 16230, "timing": {...}}
{"method": "GET", "url": "https://lite.datasette.io/webworker.js", "size": 4875, "timing": {...}}
{"method": "GET", "url": "https://cdn.jsdelivr.net/pyodide/v0.20.0/full/pyodide.js", "size": null, "timing": {...}}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This is already pretty useful... but wouldn't it be more useful if I could explore that data in Datasette?&lt;/p&gt;
&lt;p&gt;That's what this recipe does:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;shot-scraper https://lite.datasette.io/ \
  --wait-for 'document.querySelector("h2")' \
  --log-requests - | \
  sqlite-utils insert /tmp/datasette-lite.db log - --flatten --nl
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;It's piping the newline-delimited JSON to &lt;code&gt;sqlite-utils insert&lt;/code&gt; which then inserts it, using the &lt;code&gt;--flatten&lt;/code&gt; option to turn that nested &lt;code&gt;timing&lt;/code&gt; object into a flat set of columns.&lt;/p&gt;
&lt;p&gt;I decided to share it by turning it into a SQL dump and publishing that to &lt;a href=""&gt;this Gist&lt;/a&gt;. I did that using the &lt;code&gt;sqlite-utils memory&lt;/code&gt; command to convert it to a SQL dump like so:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;shot-scraper https://lite.datasette.io/ \
  --wait-for 'document.querySelector("h2")' \
  --log-requests - | \
  sqlite-utils memory stdin:nl --flatten --dump &amp;gt; dump.sql
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;stdin:nl&lt;/code&gt; means "read from standard input and treat that as newline-delimited JSON". Then I run a &lt;code&gt;select *&lt;/code&gt; command and use &lt;code&gt;--dump&lt;/code&gt; to output that to &lt;code&gt;dump.sql&lt;/code&gt;, which I pasted into a new Gist.&lt;/p&gt;
&lt;p&gt;So now I can &lt;a href="https://lite.datasette.io/?sql=https://gist.githubusercontent.com/simonw/7f41a43ba0f177238ed7bdd95078a0d4/raw/4fc0f80decce4e1ea1e925cdc2bf3f05d73034ed/datasette-lite.sql#/data/stdin"&gt;open the result in Datasette Lite&lt;/a&gt;!&lt;/p&gt;
&lt;h4&gt;Datasette on Sandstorm&lt;/h4&gt;
&lt;p&gt;&lt;a href="https://sandstorm.io/"&gt;Sandstorm&lt;/a&gt; is "an open source
platform for self-hosting web apps". You can think of it as an easy to use UI over a Docker-like container platform - once you've installed it on a server you can use it to manage and install applications that have been bundled for it.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://github.com/ocdtrekkie"&gt;Jacob Weisz&lt;/a&gt; has been doing exactly that for Datasette. The result is &lt;a href="https://apps.sandstorm.io/app/uawacvvx9f9ncex1sqj8njwpujf8s9fkmg7wmp55hg6xetrd45w0"&gt;Datasette in the Sandstorm App Market&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2022/datasette-sandstorm.jpg" alt="The listing for Datasette on the Sandstorm App Market, with a prominent DEMO button" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;You can see how it works in the &lt;a href="https://github.com/ocdtrekkie/datasette-sandstorm"&gt;ocdtrekkie/datasette-sandstorm&lt;/a&gt; repo. I helped out by building a small &lt;a href="https://github.com/simonw/datasette-sandstorm-support"&gt;datasette-sandstorm-support&lt;/a&gt; plugin to show how permissions and authentication can work against Sandstorm's &lt;a href="https://docs.sandstorm.io/en/latest/developing/auth/"&gt;custom HTTP headers&lt;/a&gt;.&lt;/p&gt;
&lt;h4&gt;Releases this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/s3-credentials"&gt;s3-credentials&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/s3-credentials/releases/tag/0.14"&gt;0.14&lt;/a&gt; - (&lt;a href="https://github.com/simonw/s3-credentials/releases"&gt;15 releases total&lt;/a&gt;) - 2022-09-15&lt;br /&gt;A tool for creating credentials for accessing S3 buckets&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/shot-scraper"&gt;shot-scraper&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/shot-scraper/releases/tag/0.16"&gt;0.16&lt;/a&gt; - (&lt;a href="https://github.com/simonw/shot-scraper/releases"&gt;21 releases total&lt;/a&gt;) - 2022-09-15&lt;br /&gt;A command-line utility for taking automated screenshots of websites&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-edit-templates"&gt;datasette-edit-templates&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-edit-templates/releases/tag/0.1a0"&gt;0.1a0&lt;/a&gt; - 2022-09-14&lt;br /&gt;Plugin allowing Datasette templates to be edited within Datasette&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-sandstorm-support"&gt;datasette-sandstorm-support&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-sandstorm-support/releases/tag/0.1"&gt;0.1&lt;/a&gt; - 2022-09-14&lt;br /&gt;Authentication and permissions for Datasette on Sandstorm&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-upload-dbs"&gt;datasette-upload-dbs&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-upload-dbs/releases/tag/0.1.2"&gt;0.1.2&lt;/a&gt; - (&lt;a href="https://github.com/simonw/datasette-upload-dbs/releases"&gt;3 releases total&lt;/a&gt;) - 2022-09-09&lt;br /&gt;Upload SQLite database files to Datasette&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-upload-csvs"&gt;datasette-upload-csvs&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-upload-csvs/releases/tag/0.8.2"&gt;0.8.2&lt;/a&gt; - (&lt;a href="https://github.com/simonw/datasette-upload-csvs/releases"&gt;13 releases total&lt;/a&gt;) - 2022-09-08&lt;br /&gt;Datasette plugin for uploading CSV files and converting them to database tables&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;TIL this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/docker/pytest-docker"&gt;Run pytest against a specific Python version using Docker&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/github/clone-and-push-gist"&gt;Clone, edit and push files that live in a Gist&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/macos/external-display-laptop"&gt;Driving an external display from a Mac laptop&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/macos/ifuse-iphone"&gt;Browse files (including SQLite databases) on your iPhone with ifuse&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/python/pypy-macos"&gt;Running PyPy on macOS using Homebrew&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/plugins"&gt;plugins&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/weeknotes"&gt;weeknotes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/s3-credentials"&gt;s3-credentials&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/shot-scraper"&gt;shot-scraper&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette-lite"&gt;datasette-lite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github-copilot"&gt;github-copilot&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="plugins"/><category term="projects"/><category term="datasette"/><category term="weeknotes"/><category term="s3-credentials"/><category term="shot-scraper"/><category term="datasette-lite"/><category term="github-copilot"/></entry><entry><title>s3-ocr: Extract text from PDF files stored in an S3 bucket</title><link href="https://simonwillison.net/2022/Jun/30/s3-ocr/#atom-tag" rel="alternate"/><published>2022-06-30T21:40:27+00:00</published><updated>2022-06-30T21:40:27+00:00</updated><id>https://simonwillison.net/2022/Jun/30/s3-ocr/#atom-tag</id><summary type="html">
    &lt;p&gt;I've released &lt;strong&gt;&lt;a href="https://datasette.io/tools/s3-ocr"&gt;s3-ocr&lt;/a&gt;&lt;/strong&gt;, a new tool that runs Amazon's &lt;a href="https://aws.amazon.com/textract/"&gt;Textract&lt;/a&gt; OCR text extraction against PDF files in an S3 bucket, then writes the resulting text out to a SQLite database with full-text search configured so you can run searches against the extracted data.&lt;/p&gt;
&lt;p&gt;You can search through a demo of 697 pages of OCRd text at &lt;a href="https://s3-ocr-demo.datasette.io/pages/pages"&gt;s3-ocr-demo.datasette.io/pages/pages&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Textract works extremely well: it handles dodgy scanned PDFs full of typewritten code and reads handwritten text better than I can! It &lt;a href="https://aws.amazon.com/textract/pricing/"&gt;charges&lt;/a&gt; $1.50 per thousand pages processed.&lt;/p&gt;
&lt;h4&gt;Why I built this&lt;/h4&gt;
&lt;p&gt;My initial need for this is a collaboration I have running with the &lt;a href="https://sfmicrosociety.org/"&gt;San Francisco Microscopy Society&lt;/a&gt;. They've been digitizing their archives - which stretch back to 1870! - and were looking for help turning the digital scans into something more useful.&lt;/p&gt;
&lt;p&gt;The archives are full of hand-written and type-written notes, scanned and stored as PDFs.&lt;/p&gt;
&lt;p&gt;I decided to wrap my work up as a tool because I'm sure there are a LOT of organizations out there with a giant bucket of PDF files that would benefit from being able to easily run OCR and turn the results into a searchable database.&lt;/p&gt;
&lt;p&gt;Running Textract directly against large numbers of files is somewhat inconvenient (here's my &lt;a href="https://til.simonwillison.net/aws/ocr-pdf-textract"&gt;earlier TIL about it&lt;/a&gt;). &lt;code&gt;s3-ocr&lt;/code&gt; is my attempt to make it easier.&lt;/p&gt;
&lt;h4&gt;Tutorial: How I built that demo&lt;/h4&gt;
&lt;p&gt;The demo instance uses three PDFs from the Library of Congress Harry Houdini Collection &lt;a href="https://archive.org/search.php?query=creator%3A%22Harry+Houdini+Collection+%28Library+of+Congress%29+DLC%22"&gt;on the Internet Archive&lt;/a&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://archive.org/details/unmaskingrobert00houdgoog"&gt;The unmasking of Robert-Houdin&lt;/a&gt; from 1908&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://archive.org/details/practicalmagicia00harr"&gt;The practical magician and ventriloquist's guide: a practical manual of fireside magic and conjuring illusions: containing also complete instructions for acquiring &amp;amp; practising the art of ventriloquism&lt;/a&gt; from 1876&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://archive.org/details/latestmagicbeing00hoff"&gt;Latest magic, being original conjuring tricks&lt;/a&gt; from 1918&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I started by downloading PDFs of those three files.&lt;/p&gt;
&lt;p&gt;Then I installed the two tools I needed:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;pip install s3-ocr s3-credentials
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I used my &lt;a href="https://datasette.io/tools/s3-credentials"&gt;s3-credentials&lt;/a&gt; tool to create a new S3 bucket and credentials with the ability to write files to it, with the new &lt;a href="https://github.com/simonw/s3-credentials/issues/72"&gt;--statement option&lt;/a&gt; (which I released today) to add &lt;code&gt;textract&lt;/code&gt; permissions to the generated credentials:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;s3-credentials create s3-ocr-demo --statement '{
  "Effect": "Allow",
  "Action": "textract:*",
  "Resource": "*"
}' --create-bucket &amp;gt; ocr.json
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;(Note that you don't need to use &lt;code&gt;s3-credentials&lt;/code&gt; at all if you have AWS credentials configured on your machine with root access to your account - just leave off the &lt;code&gt;-a ocr.json&lt;/code&gt; options in the following examples.)&lt;/p&gt;
&lt;p&gt;&lt;code&gt;s3-ocr-demo&lt;/code&gt; is now a bucket I can use for the demo. &lt;code&gt;ocr.json&lt;/code&gt; contains JSON with an access key and secret key for an IAM user account that can interact with the that bucket, and also has permission to access the AWS Textract APIs.&lt;/p&gt;
&lt;p&gt;I uploaded my three PDFs to the bucket:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;s3-credentials put-object s3-ocr-demo latestmagicbeing00hoff.pdf latestmagicbeing00hoff.pdf -a ocr.json
s3-credentials put-object s3-ocr-demo practicalmagicia00harr.pdf practicalmagicia00harr.pdf -a ocr.json
s3-credentials put-object s3-ocr-demo unmaskingrobert00houdgoog.pdf unmaskingrobert00houdgoog.pdf -a ocr.json
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;(I often use &lt;a href="https://panic.com/transmit/"&gt;Transmit&lt;/a&gt; as a GUI for this kind of operation.)&lt;/p&gt;
&lt;p&gt;Then I kicked off OCR jobs against every PDF file in the bucket:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;% s3-ocr start s3-ocr-demo --all -a ocr.json 
Found 0 files with .s3-ocr.json out of 3 PDFs
Starting OCR for latestmagicbeing00hoff.pdf, Job ID: f66bc2d00fb75d1c42d1f829e5b6788891f9799fda404c4550580959f65a5402
Starting OCR for practicalmagicia00harr.pdf, Job ID: ef085728135d524a39bc037ad6f7253284b1fdbeb728dddcfbb260778d902b55
Starting OCR for unmaskingrobert00houdgoog.pdf, Job ID: 93bd46f02eb099eca369c41e384836d2bd3199b95d415c0257ef3fa3602cbef9
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;--all&lt;/code&gt; option scans for any file with a &lt;code&gt;.pdf&lt;/code&gt; extension. You can pass explicit file names instead if you just want to process one or two files at a time.&lt;/p&gt;
&lt;p&gt;This returns straight away, but the OCR process itself can take several minutes depending on the size of the files.&lt;/p&gt;
&lt;p&gt;The job IDs can be used to inspect the progress of each task like so:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;% s3-ocr inspect-job f66bc2d00fb75d1c42d1f829e5b6788891f9799fda404c4550580959f65a5402
{
  "DocumentMetadata": {
    "Pages": 244
  },
  "JobStatus": "SUCCEEDED",
  "DetectDocumentTextModelVersion": "1.0"
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Once the job completed, I could preview the text extracted from the PDF like so:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;% s3-ocr text s3-ocr-demo latestmagicbeing00hoff.pdf
111
.
116

LATEST MAGIC
BEING
ORIGINAL CONJURING TRICKS
INVENTED AND ARRANGED
BY
PROFESSOR HOFFMANN
(ANGELO LEWIS, M.A.)
Author of "Modern Magic," etc.
WITH NUMEROUS ILLUSTRATIONS
FIRST EDITION
NEW YORK
SPON &amp;amp; CHAMBERLAIN, 120 LIBERTY ST.
...
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To create a SQLite database with a table containing rows for every page of scanned text, I ran this command:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;% s3-ocr index s3-ocr-demo pages.db -a ocr.json 
Fetching job details  [####################################]  100%
Populating pages table  [####--------------------------------]   13%  00:00:34
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I then published the resulting &lt;code&gt;pages.db&lt;/code&gt; SQLite database using Datasette - you can &lt;a href="https://s3-ocr-demo.datasette.io/pages"&gt;explore it here&lt;/a&gt;.&lt;/p&gt;
&lt;h4&gt;How s3-ocr works&lt;/h4&gt;
&lt;p&gt;&lt;code&gt;s3-ocr&lt;/code&gt; works by calling Amazon's S3 and Textract APIs.&lt;/p&gt;
&lt;p&gt;Textract only works against PDF files in &lt;a href="https://docs.aws.amazon.com/textract/latest/dg/api-async.html"&gt;asynchronous mode&lt;/a&gt;: you call an API endpoint to tell it "start running OCR against this PDF file in this S3 bucket", then wait for it to finish - which can take several minutes.&lt;/p&gt;
&lt;p&gt;It defaults to storing the OCR results in its own storage, expiring after seven days. You can instead tell it to store them in your own S3 bucket - I use that option in &lt;code&gt;s3-ocr&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;A design challenge I faced was that I wanted to make the command restartable and resumable: if the user cancelled the task, I wanted to be able to pick up from where it had got to. I also want to be able to run it again after adding more PDFs to the bucket without repeating work for the previously processed files.&lt;/p&gt;
&lt;p&gt;I also needed to persist those job IDs: Textract writes the OCR results to keys in the bucket called &lt;code&gt;textract-output/JOB_ID/1-?&lt;/code&gt; - but there's no indication as to which PDF file the results correspond to.&lt;/p&gt;
&lt;p&gt;My solution is to write tiny extra JSON files to the bucket when the OCR job is first started.&lt;/p&gt;
&lt;p&gt;If you have a file called &lt;code&gt;latestmagicbeing00hoff.pdf&lt;/code&gt; the &lt;code&gt;start&lt;/code&gt; command will create a new file called &lt;code&gt;latestmagicbeing00hoff.pdf.s3-ocr.json&lt;/code&gt; with the following content:&lt;/p&gt;
&lt;div class="highlight highlight-source-json"&gt;&lt;pre&gt;{
  &lt;span class="pl-ent"&gt;"job_id"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;f66bc2d00fb75d1c42d1f829e5b6788891f9799fda404c4550580959f65a5402&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
  &lt;span class="pl-ent"&gt;"etag"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;span class="pl-cce"&gt;\"&lt;/span&gt;d79af487579dcbbef26c9b3be763eb5e-2&lt;span class="pl-cce"&gt;\"&lt;/span&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
}&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This associates the job ID with the PDF file. It also records the original ETag of the PDF file - this is so in the future I can implement a system that can re-run OCR if the PDF has been updated.&lt;/p&gt;
&lt;p&gt;The existence of these files lets me do two things:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;If you run &lt;code&gt;s3-ocr start s3-ocr-demo --all&lt;/code&gt; it can avoid re-submitting PDF files that have already been sent for OCR, by checking for the existence of the &lt;code&gt;.s3-ocr.json&lt;/code&gt; file.&lt;/li&gt;
&lt;li&gt;When you later ask for the results of the OCR it can use these files to associate the PDF with the results.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Scatting &lt;code&gt;.s3-ocr.json&lt;/code&gt; files all over the place feels a little messy, so I have an &lt;a href="https://github.com/simonw/s3-ocr/issues/14"&gt;open issue&lt;/a&gt; considering moving them all to a &lt;code&gt;s3-ocr/&lt;/code&gt; prefix in the bucket instead.&lt;/p&gt;
&lt;h4&gt;Try it and let me know what you think&lt;/h4&gt;
&lt;p&gt;This is a brand new project, but I think it's ready for other people to start trying it out.&lt;/p&gt;
&lt;p&gt;I ran it against around 7,000 pages from 531 PDF files in the San Francisco Microscopy Society archive and it seemed to work well!&lt;/p&gt;
&lt;p&gt;If you try this out and it works (or it doesn't work) please &lt;a href="https://twitter.com/simonw"&gt;let me know via Twitter&lt;/a&gt; or &lt;a href="https://github.com/simonw/s3-ocr"&gt;GitHub&lt;/a&gt;.&lt;/p&gt;
&lt;h4&gt;A challenging example page&lt;/h4&gt;
&lt;p&gt;Here's one of the more challenging pages I processed using Textract:&lt;/p&gt;
&lt;p&gt;&lt;img alt="A very old page of difficult to read handwriting" src="https://static.simonwillison.net/static/2022/s3-ocr-sample-handwriting.jpg" style="max-width: 100%"/&gt;&lt;/p&gt;
&lt;p&gt;Here's the result:&lt;/p&gt;
&lt;pre&gt;
In. In J a ... the Joe 14
162
Volxv
Lalpa spinosa, Eggt bud development. of
146
Farcomas spindle. cells in nested gowers 271
Fayigaga tridactylites, leaf glaur of ruce 33
staining &amp;amp; mounting
Stiles 133
tilica films, a structure of Diatoins morehouse 38
thile new microscopic
Broeck 22 /
Smith reproduction in the huntroom tribe
6
Trakes, develop mouht succession of the porsion tango/229
Soirce President of the Roy: truc: Soo
285
forby, Presidents address
105
pongida, difficulties of classification
238
tage, american adjustable concentric
150
ttlese staining &amp;amp; mountring wood sections 133
Stodder, Frustulia Iasconica, havicula
chomboides, &amp;amp; havi cula crassinervis 265
Vol XVI
falicylic acid u movorcopy
160
falpar enctry ology of
Brooke 9.97
Sanderson micros: characters If inflammation
43
tap, circulation of the
42
Jars, structure of the genus Brisinga
44
latter throvite connective substances 191- 241
Jehorey Cessification in birds, formation
of ed blood corpuseles during the
ossification process
by
&lt;/pre&gt;
&lt;h4&gt;Releases this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/s3-ocr"&gt;s3-ocr&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/s3-ocr/releases/tag/0.4"&gt;0.4&lt;/a&gt; - (&lt;a href="https://github.com/simonw/s3-ocr/releases"&gt;4 releases total&lt;/a&gt;) - 2022-06-30
&lt;br /&gt;Tools for running OCR against files stored in S3&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/s3-credentials"&gt;s3-credentials&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/s3-credentials/releases/tag/0.12"&gt;0.12&lt;/a&gt; - (&lt;a href="https://github.com/simonw/s3-credentials/releases"&gt;12 releases total&lt;/a&gt;) - 2022-06-30
&lt;br /&gt;A tool for creating credentials for accessing S3 buckets&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-scale-to-zero"&gt;datasette-scale-to-zero&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-scale-to-zero/releases/tag/0.1.2"&gt;0.1.2&lt;/a&gt; - (&lt;a href="https://github.com/simonw/datasette-scale-to-zero/releases"&gt;3 releases total&lt;/a&gt;) - 2022-06-23
&lt;br /&gt;Quit Datasette if it has not received traffic for a specified time period&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;TIL this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/sqlite/one-line-csv-operations"&gt;One-liner for running queries against CSV files with SQLite&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/bash/ignore-errors"&gt;Ignoring errors in a section of a Bash script&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/aws/ocr-pdf-textract"&gt;Running OCR against a PDF file with AWS Textract&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/aws"&gt;aws&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ocr"&gt;ocr&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pdf"&gt;pdf&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/s3"&gt;s3&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/weeknotes"&gt;weeknotes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/s3-credentials"&gt;s3-credentials&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="aws"/><category term="ocr"/><category term="pdf"/><category term="projects"/><category term="s3"/><category term="weeknotes"/><category term="s3-credentials"/></entry><entry><title>Weeknotes: s3-credentials prefix and Datasette 0.60</title><link href="https://simonwillison.net/2022/Jan/18/weeknotes/#atom-tag" rel="alternate"/><published>2022-01-18T04:37:39+00:00</published><updated>2022-01-18T04:37:39+00:00</updated><id>https://simonwillison.net/2022/Jan/18/weeknotes/#atom-tag</id><summary type="html">
    &lt;p&gt;A &lt;a href="https://github.com/simonw/s3-credentials/releases/tag/0.9"&gt;new release&lt;/a&gt; of &lt;a href="https://github.com/simonw/s3-credentials"&gt;s3-credentials&lt;/a&gt; with support for restricting access to keys that start with a prefix, &lt;a href="https://docs.datasette.io/en/stable/changelog.html#v0-60"&gt;Datasette 0.60&lt;/a&gt; and a write-up of &lt;a href="https://simonwillison.net/2022/Jan/12/how-i-build-a-feature/"&gt;my process for shipping a feature&lt;/a&gt;.&lt;/p&gt;
&lt;h4&gt;s3-credentials --prefix&lt;/h4&gt;
&lt;p&gt;&lt;a href="https://github.com/simonw/s3-credentials"&gt;s3-credentials&lt;/a&gt; is my tool for creating limited scope AWS credentials that can only read and write from a specific S3 bucket. I introduced it &lt;a href="https://simonwillison.net/2021/Nov/3/s3-credentials/"&gt;in this blog entry&lt;/a&gt; in November, and I've continued to iterate on it since then.&lt;/p&gt;
&lt;p&gt;I released &lt;a href="https://github.com/simonw/s3-credentials/releases/tag/0.9"&gt;s3-credentials 0.9&lt;/a&gt; today with a feature I've been planning since I first built the tool: the ability to &lt;a href="https://github.com/simonw/s3-credentials/issues/12"&gt;specify a --prefix&lt;/a&gt; and get credentials that are only allowed to operate on keys within a specific folder within the S3 bucket.&lt;/p&gt;
&lt;p&gt;This is particularly useful if you are building multi-tenant SaaS applications on top of AWS. You might decide to create a bucket per customer... but S3 limits you to 100 buckets for your by default, with a maximum of 1,000 buckets if you request an increase.&lt;/p&gt;
&lt;p&gt;So a bucket per customer won't scale above 1,000 customers.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sts.html#STS.Client.assume_role"&gt;sts.assume_role()&lt;/a&gt; API lets you retrieve temporary credentials for S3 that can have limits attached to them - including a limit to access keys within a specific bucket and under a specific prefix. That means you can create limited duration credentials that can only read and write from a specific prefix within a bucket.&lt;/p&gt;
&lt;p&gt;Which solves the problem! Each of your customers can have a dedicated prefix within the bucket, and your application can issue restricted tokens that greatly reduce the risk of one customer accidentally seeing files that belong to another.&lt;/p&gt;
&lt;p&gt;Here's how to use it:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;s3-credentials create name-of-bucket --prefix user1410/
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This will return a JSON set of credentials - an access key and secret key - that can only be used to read and write keys in that bucket that start with &lt;code&gt;user1410/&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Add &lt;code&gt;--read-only&lt;/code&gt; to make those credentials read-only, and &lt;code&gt;--write-only&lt;/code&gt; for credentials that can be used to write but not read records.&lt;/p&gt;
&lt;p&gt;If you add &lt;code&gt;--duration 15m&lt;/code&gt; the returned credentials will only be valid for 15 minutes, using &lt;code&gt;sts.assume_role()&lt;/code&gt;. The README includes &lt;a href="https://github.com/simonw/s3-credentials#changes-that-will-be-made-to-your-aws-account"&gt;a detailed description&lt;/a&gt; of the changes that will be made to your AWS account by the tool.&lt;/p&gt;
&lt;p&gt;You can also add &lt;code&gt;--dry-run&lt;/code&gt; to see a text summary of changes without applying them to your account. Here's an example:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;% s3-credentials create name-of-bucket --prefix user1410/ --read-only --dry-run --duration 15m
Would create bucket: 'name-of-bucket'
Would ensure role: 's3-credentials.AmazonS3FullAccess'
Would assume role using following policy for 900 seconds:
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetBucketLocation"
      ],
      "Resource": [
        "arn:aws:s3:::name-of-bucket"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::name-of-bucket"
      ],
      "Condition": {
        "StringLike": {
          "s3:prefix": [
            "user1410/*"
          ]
        }
      }
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:GetObjectAcl",
        "s3:GetObjectLegalHold",
        "s3:GetObjectRetention",
        "s3:GetObjectTagging"
      ],
      "Resource": [
        "arn:aws:s3:::name-of-bucket/user1410/*"
      ]
    }
  ]
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As with all things AWS, the magic is in the details of the JSON policy document. The README includes details of exactly &lt;a href="https://github.com/simonw/s3-credentials/blob/0.9/README.md#--prefix-my-prefix"&gt;what those policies look like&lt;/a&gt;. Getting them right was by far the hardest part of building this tool!&lt;/p&gt;
&lt;h4&gt;s3-credentials integration tests&lt;/h4&gt;
&lt;p&gt;When writing automated tests, I generally avoid calling any external APIs or making any outbound network traffic. I want the tests to run in an isolated environment, with no risk that some other system that's having a bad day could cause random test failures.&lt;/p&gt;
&lt;p&gt;Since the hardest part of building this tool is having confidence that it does the right thing, I decided to also include a suite of integration tests that actively exercise Amazon S3.&lt;/p&gt;
&lt;p&gt;By default, running &lt;code&gt;pytest&lt;/code&gt; will skip these:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;% pytest
================ test session starts ================
platform darwin -- Python 3.10.0, pytest-6.2.5, py-1.10.0, pluggy-1.0.0
rootdir: /Users/simon/Dropbox/Development/s3-credentials
plugins: recording-0.12.0, mock-3.6.1
collected 61 items                                  

tests/test_dry_run.py ....                    [  6%]
tests/test_integration.py ssssssss            [ 19%]
tests/test_s3_credentials.py ................ [ 45%]
.................................             [100%]

=========== 53 passed, 8 skipped in 1.21s ===========
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Running &lt;code&gt;pytest --integration&lt;/code&gt; runs the test suite with those tests enabled. It expects the computer they are running on to have AWS credentials with the ability to create buckets and users - I'm too nervous to add these secrets to GitHub Actions, so I currently only run the integration suite on my own laptop.&lt;/p&gt;
&lt;p&gt;These were invaluable for getting confident that the new &lt;code&gt;--prefix&lt;/code&gt; option behaved as expected, especially when combined with &lt;code&gt;--read-only&lt;/code&gt; and &lt;code&gt;--write-only&lt;/code&gt;. Here's the &lt;a href="https://github.com/simonw/s3-credentials/blob/0.9/tests/test_integration.py#L219-L279"&gt;test_prefix_read_only()&lt;/a&gt; test which exercises the &lt;code&gt;--prefix --read-only&lt;/code&gt; combination.&lt;/p&gt;
&lt;h4&gt;s3-credentials list-bucket&lt;/h4&gt;
&lt;p&gt;One more new feature: the &lt;code&gt;s3-credentials list-bucket name-of-bucket&lt;/code&gt; command lists all of the keys in a specific bucket.&lt;/p&gt;
&lt;p&gt;By default it returns a JSON array, but you can add &lt;code&gt;--nl&lt;/code&gt; to get back &lt;a href="http://ndjson.org/"&gt;newline delimited JSON&lt;/a&gt; or &lt;code&gt;--csv&lt;/code&gt; or &lt;code&gt;--tsv&lt;/code&gt; to get back CSV or TSV.&lt;/p&gt;
&lt;p&gt;So... a fun thing you can do with the command is pipe the output into &lt;a href="https://sqlite-utils.datasette.io/en/stable/cli.html#inserting-newline-delimited-json"&gt;sqlite-utils insert&lt;/a&gt; to create a SQLite database file of your bucket contents... and then use &lt;a href="https://datasette.io/"&gt;Datasette&lt;/a&gt; to browse it!&lt;/p&gt;
&lt;p&gt;&lt;pre&gt;&lt;code&gt;% s3-credentials list-bucket static.niche-museums.com --nl \
  | sqlite-utils insert s3.db keys - --nl
% datasette s3.db -o
&lt;/code&gt;&lt;/pre&gt;&lt;/p&gt;
&lt;p&gt;This will create a &lt;code&gt;s3.db&lt;/code&gt; SQLite database with a &lt;code&gt;keys&lt;/code&gt; table containing your bucket contents, then open Datasette to let you interact with the table.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2022/s3-keys.png" alt="A screenshot of the keys table running in Datasette" style="max-width: 100%" /&gt;&lt;/p&gt;
&lt;h4&gt;Datasette 0.60&lt;/h4&gt;
&lt;p&gt;I shipped several months of work on Datasette a few days ago as &lt;a href="https://docs.datasette.io/en/stable/changelog.html#v0-60"&gt;Datasette 0.60&lt;/a&gt;. I published &lt;a href="https://simonwillison.net/2022/Jan/14/datasette-060/"&gt;annotated release notes&lt;/a&gt; for that release which describe the background of those changes in detail.&lt;/p&gt;
&lt;p&gt;I also released new versions of &lt;a href="https://github.com/simonw/datasette-pretty-traces/releases/tag/0.4"&gt;datasette-pretty-traces&lt;/a&gt; and &lt;a href="https://github.com/simonw/datasette-leaflet-freedraw/releases/tag/0.3"&gt;datasette-leaflet-freedraw&lt;/a&gt; to take advantage of new features added to Datasette.&lt;/p&gt;
&lt;h4&gt;How I build a feature&lt;/h4&gt;
&lt;p&gt;My other big project this week was a blog post: &lt;a href="https://simonwillison.net/2022/Jan/12/how-i-build-a-feature/"&gt;How I build a feature&lt;/a&gt;, which goes into detail about the process I use for adding new features to my various projects. I've had some great feedback about this, so I'm tempted to write more about general software engineering process stuff here in the future.&lt;/p&gt;
&lt;h4&gt;Releases this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/s3-credentials"&gt;s3-credentials&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/s3-credentials/releases/tag/0.9"&gt;0.9&lt;/a&gt; - (&lt;a href="https://github.com/simonw/s3-credentials/releases"&gt;9 releases total&lt;/a&gt;) - 2022-01-18
&lt;br /&gt;A tool for creating credentials for accessing S3 buckets&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-pretty-traces"&gt;datasette-pretty-traces&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-pretty-traces/releases/tag/0.4"&gt;0.4&lt;/a&gt; - (&lt;a href="https://github.com/simonw/datasette-pretty-traces/releases"&gt;6 releases total&lt;/a&gt;) - 2022-01-14
&lt;br /&gt;Prettier formatting for ?_trace=1 traces&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-leaflet-freedraw"&gt;datasette-leaflet-freedraw&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-leaflet-freedraw/releases/tag/0.3"&gt;0.3&lt;/a&gt; - (&lt;a href="https://github.com/simonw/datasette-leaflet-freedraw/releases"&gt;8 releases total&lt;/a&gt;) - 2022-01-14
&lt;br /&gt;Draw polygons on maps in Datasette&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette"&gt;datasette&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette/releases/tag/0.60"&gt;0.60&lt;/a&gt; - (&lt;a href="https://github.com/simonw/datasette/releases"&gt;105 releases total&lt;/a&gt;) - 2022-01-14
&lt;br /&gt;An open source multi-tool for exploring and publishing data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-graphql"&gt;datasette-graphql&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-graphql/releases/tag/2.0.1"&gt;2.0.1&lt;/a&gt; - (&lt;a href="https://github.com/simonw/datasette-graphql/releases"&gt;33 releases total&lt;/a&gt;) - 2022-01-12
&lt;br /&gt;Datasette plugin providing an automatic GraphQL API for your SQLite databases&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;TIL this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/github/dependabot-python-setup"&gt;Configuring Dependabot for a Python project with dependencies in setup.py&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/javascript/javascript-date-objects"&gt;JavaScript date objects&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/python/output-json-array-streaming"&gt;Streaming indented output of a JSON array&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/weeknotes"&gt;weeknotes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/s3-credentials"&gt;s3-credentials&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="datasette"/><category term="weeknotes"/><category term="s3-credentials"/></entry><entry><title>Weeknotes: git-history, bug magnets and s3-credentials --public</title><link href="https://simonwillison.net/2021/Dec/8/weeknotes/#atom-tag" rel="alternate"/><published>2021-12-08T21:34:12+00:00</published><updated>2021-12-08T21:34:12+00:00</updated><id>https://simonwillison.net/2021/Dec/8/weeknotes/#atom-tag</id><summary type="html">
    &lt;p&gt;I've stopped considering my projects "shipped" until I've written a proper blog entry about them, so yesterday I finally &lt;a href="https://simonwillison.net/2021/Dec/7/git-history/"&gt;shipped git-history&lt;/a&gt;, coinciding with the release of &lt;a href="https://github.com/simonw/git-history/releases/tag/0.6"&gt;version 0.6&lt;/a&gt; - a full 27 days after the first &lt;a href="https://github.com/simonw/git-history/releases/tag/0.1"&gt;0.1&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;It took way more work than I was expecting to get to this point!&lt;/p&gt;
&lt;p&gt;I wrote the first version of &lt;code&gt;git-history&lt;/code&gt; in an afternoon, as a tool &lt;a href="https://simonwillison.net/2021/Nov/15/weeknotes-git-history/"&gt;for a workshop I was presenting&lt;/a&gt; on Git scraping and Datasette.&lt;/p&gt;
&lt;p&gt;Before promoting it more widely, I wanted to make some improvements to the schema. In particular, I wanted to record only the updated values in the &lt;code&gt;item_version&lt;/code&gt; table - which otherwise could end up duplicating a full copy of each item in the database hundreds or even thousands of times.&lt;/p&gt;
&lt;p&gt;Getting this right took a lot of work, and I kept on getting stumped by weird bugs and edge-cases. &lt;a href="https://github.com/simonw/git-history/issues/33"&gt;This bug&lt;/a&gt; in particular added a couple of days to the project.&lt;/p&gt;
&lt;p&gt;The whole project turned out to be something of a bug magnet, partly because of a design decision I made concerning column names.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;git-history&lt;/code&gt; creates tables with columns that correspond to the underlying data. Since it also needs its own columns for tracking things like commits and incremental versions, I decided to use underscore prefixes for reserved columns such as &lt;code&gt;_item&lt;/code&gt; and &lt;code&gt;_version&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;Datasette uses underscore prefixes for its own purposes - special table arguments such as &lt;code&gt;?_facet=column-name&lt;/code&gt;. It's supposed to work with existing columns that use underscores by converting query string arguments like &lt;code&gt;?_item=3&lt;/code&gt; into &lt;code&gt;?_item__exact=3&lt;/code&gt; - but &lt;code&gt;git-history&lt;/code&gt; was the first of my projects to really exercise this, and I kept on finding bugs. Datasette &lt;a href="https://docs.datasette.io/en/stable/changelog.html#v0-59-2"&gt;0.59.2&lt;/a&gt; and &lt;a href="https://docs.datasette.io/en/stable/changelog.html#v0-59-4"&gt;0.59.4&lt;/a&gt; both have related bug fixes, and there's &lt;a href="https://github.com/simonw/datasette/issues/1527"&gt;a re-opened bug&lt;/a&gt; that I have yet to resolve.&lt;/p&gt;
&lt;p&gt;Building the &lt;a href="https://git-history-demos.datasette.io/ca-fires/incident"&gt;ca-fires demo&lt;/a&gt; also revealed a &lt;a href="https://github.com/simonw/datasette-cluster-map/issues/38"&gt;bug in datasette-cluster-map&lt;/a&gt; which I fixed in &lt;a href="https://github.com/simonw/datasette-cluster-map/releases/tag/0.17.2"&gt;version 0.17.2&lt;/a&gt;.&lt;/p&gt;
&lt;h4&gt;s3-credentials --public&lt;/h4&gt;
&lt;p&gt;The &lt;code&gt;git-history&lt;/code&gt; live demos are built and deployed by &lt;a href="https://github.com/simonw/git-history/blob/main/.github/workflows/deploy-demos.yml"&gt;this GitHub Actions workflow&lt;/a&gt;. The workflow works by checking out three separate repos and running &lt;code&gt;git-history&lt;/code&gt; against them. It takes advantage of that tool's ability to add just new commits to an existing database to run faster, so it needs to persist database files in between runs.&lt;/p&gt;
&lt;p&gt;Since these files can be several hundred MBs, I decided to persist them in an S3 bucket.&lt;/p&gt;
&lt;p&gt;My &lt;a href="https://simonwillison.net/2021/Nov/3/s3-credentials/"&gt;s3-credentials tool&lt;/a&gt; provides the ability to create a new S3 bucket along with restricted read-write credentials just for that bucket, ideal for use in a GitHub Actions workflow.&lt;/p&gt;
&lt;p&gt;I decided to make the bucket public such that anyone can download files from it, since there was no reason to keep it private. I've been wanting to add this ability to &lt;code&gt;s3-credentials&lt;/code&gt; for a while now, so this was the impetus I needed to finally ship that feature.&lt;/p&gt;
&lt;p&gt;It's surprisingly hard to figure out how to make an S3 bucket public these days! It turned out the magic recipe was adding a JSON bucket policy document to the bucket granting &lt;code&gt;s3:GetObject&lt;/code&gt; permission to principal &lt;code&gt;*&lt;/code&gt; - here's &lt;a href="https://github.com/simonw/s3-credentials/blob/0.8/README.md#public-bucket-policy"&gt;that policy in full&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I released &lt;a href="https://github.com/simonw/s3-credentials/releases/tag/0.8"&gt;s3-credentials 0.8&lt;/a&gt; with a new &lt;code&gt;--public&lt;/code&gt; option for creating public buckets - here are the release notes in full:&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;s3-credentials create my-bucket --public&lt;/code&gt; option for creating public buckets, which allow anyone with knowledge of a filename to download that file. This works by attaching &lt;a href="https://github.com/simonw/s3-credentials/blob/0.8/README.md#public-bucket-policy"&gt;this public bucket policy&lt;/a&gt; to the bucket after it is created. &lt;a href="https://github.com/simonw/s3-credentials/issues/42"&gt;#42&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;s3-credentials put-object&lt;/code&gt; now sets the &lt;code&gt;Content-Type&lt;/code&gt; header on the uploaded object. The type is detected based on the filename, or can be specified using the new &lt;code&gt;--content-type&lt;/code&gt; option. &lt;a href="https://github.com/simonw/s3-credentials/issues/43"&gt;#43&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;s3-credentials policy my-bucket --public-bucket&lt;/code&gt; outputs the public bucket policy that would be attached to a bucket of that name. &lt;a href="https://github.com/simonw/s3-credentials/issues/44"&gt;#44&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;I wrote up this TIL which doubles as a mini-tutorial on using &lt;code&gt;s3-credentials&lt;/code&gt;: &lt;a href="https://til.simonwillison.net/github-actions/s3-bucket-github-actions"&gt;Storing files in an S3 bucket between GitHub Actions runs&lt;/a&gt;.&lt;/p&gt;
&lt;h4&gt;datasette-hovercards&lt;/h4&gt;
&lt;p&gt;This was a quick experiment which turned into a prototype Datasette plugin. I really like how GitHub show hover card previews of links to issues in their interface:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2021/github-hovercard.gif" alt="Animation showing a hover card displayed when the mouse cursor touches a link to a GitHub Issue" style="max-width:100%;" /&gt;&lt;/p&gt;
&lt;p&gt;I decided to see if I could build something similar for links within Datasette, specifically the links that show up when a column is a foreign key to another record.&lt;/p&gt;
&lt;p&gt;Here's what I've got so far:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2021/datasette-hovercards.gif" alt="Animation showing a hover card displayed in GitHub for a link to another record" style="max-width:100%;" /&gt;&lt;/p&gt;
&lt;p&gt;There's an interactive demo running on &lt;a href="https://latest-with-plugins.datasette.io/github/issues"&gt;this table page&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;It still needs a bunch of work - in particular I need to think harder about when the card is shown, where it displays relative to the mouse pointer, what causes it to be hidden again and how it should handle different page widths. Ideally I'd like to figure out a useful mobile / touch-screen variant, but I'm not sure how that could work.&lt;/p&gt;
&lt;p&gt;The prototype plugin is called &lt;a href="https://github.com/simonw/datasette-hovercards"&gt;datasette-hovercards&lt;/a&gt; - I'd like to eventually merge this back into Datasette core once I'm happy with how it works.&lt;/p&gt;
&lt;h4&gt;Releases this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/git-history"&gt;git-history&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/git-history/releases/tag/0.6.1"&gt;0.6.1&lt;/a&gt; - (&lt;a href="https://github.com/simonw/git-history/releases"&gt;9 releases total&lt;/a&gt;) - 2021-12-08
&lt;br /&gt;Tools for analyzing Git history using SQLite&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-cluster-map"&gt;datasette-cluster-map&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-cluster-map/releases/tag/0.17.2"&gt;0.17.2&lt;/a&gt; - (&lt;a href="https://github.com/simonw/datasette-cluster-map/releases"&gt;20 releases total&lt;/a&gt;) - 2021-12-07
&lt;br /&gt;Datasette plugin that shows a map for any data with latitude/longitude columns&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/s3-credentials"&gt;s3-credentials&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/s3-credentials/releases/tag/0.8"&gt;0.8&lt;/a&gt; - (&lt;a href="https://github.com/simonw/s3-credentials/releases"&gt;8 releases total&lt;/a&gt;) - 2021-12-07
&lt;br /&gt;A tool for creating credentials for accessing S3 buckets&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/asyncinject"&gt;asyncinject&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/asyncinject/releases/tag/0.2a1"&gt;0.2a1&lt;/a&gt; - (&lt;a href="https://github.com/simonw/asyncinject/releases"&gt;3 releases total&lt;/a&gt;) - 2021-12-03
&lt;br /&gt;Run async workflows using pytest-fixtures-style dependency injection&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-hovercards"&gt;datasette-hovercards&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-hovercards/releases/tag/0.1a0"&gt;0.1a0&lt;/a&gt; - 2021-12-02
&lt;br /&gt;Add preview hovercards to links in Datasette&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/dogsheep/github-to-sqlite"&gt;github-to-sqlite&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/dogsheep/github-to-sqlite/releases/tag/2.8.3"&gt;2.8.3&lt;/a&gt; - (&lt;a href="https://github.com/dogsheep/github-to-sqlite/releases"&gt;22 releases total&lt;/a&gt;) - 2021-12-01
&lt;br /&gt;Save data from GitHub to a SQLite database&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;TIL this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/python/init-subclass"&gt;__init_subclass__&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/github-actions/s3-bucket-github-actions"&gt;Storing files in an S3 bucket between GitHub Actions runs&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/weeknotes"&gt;weeknotes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-history"&gt;git-history&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/s3-credentials"&gt;s3-credentials&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="datasette"/><category term="weeknotes"/><category term="git-history"/><category term="s3-credentials"/></entry><entry><title>s3-credentials 0.8</title><link href="https://simonwillison.net/2021/Dec/7/s3-credentials/#atom-tag" rel="alternate"/><published>2021-12-07T07:04:35+00:00</published><updated>2021-12-07T07:04:35+00:00</updated><id>https://simonwillison.net/2021/Dec/7/s3-credentials/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/simonw/s3-credentials/releases/tag/0.8"&gt;s3-credentials 0.8&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
The latest release of my s3-credentials CLI tool for creating S3 buckets with credentials to access them (with read-write, read-only or write-only policies) adds a new --public option for creating buckets that allow public access, such that anyone who knows a filename can download a file. The s3-credentials put-object command also now sets the appropriate Content-Type heading on the uploaded object.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/cli"&gt;cli&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/s3"&gt;s3&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/s3-credentials"&gt;s3-credentials&lt;/a&gt;&lt;/p&gt;



</summary><category term="cli"/><category term="projects"/><category term="s3"/><category term="s3-credentials"/></entry><entry><title>Weeknotes: git-history, created for a Git scraping workshop</title><link href="https://simonwillison.net/2021/Nov/15/weeknotes-git-history/#atom-tag" rel="alternate"/><published>2021-11-15T04:10:50+00:00</published><updated>2021-11-15T04:10:50+00:00</updated><id>https://simonwillison.net/2021/Nov/15/weeknotes-git-history/#atom-tag</id><summary type="html">
    &lt;p&gt;My main project this week was a 90 minute workshop I delivered about Git scraping at &lt;a href="https://escoladedados.org/coda2021/"&gt;Coda.Br 2021&lt;/a&gt;, a Brazilian data journalism conference, on Friday. This inspired the creation of a brand new tool, &lt;strong&gt;git-history&lt;/strong&gt;, plus smaller improvements to a range of other projects.&lt;/p&gt;
&lt;h4&gt;git-history&lt;/h4&gt;
&lt;p&gt;I still need to do a detailed write-up of this one (update: &lt;a href="https://simonwillison.net/2021/Dec/7/git-history/"&gt;git-history: a tool for analyzing scraped data collected using Git and SQLite&lt;/a&gt;), but on Thursday I released a brand new tool called &lt;a href="https://datasette.io/tools/git-history"&gt;git-history&lt;/a&gt;, which I describe as "tools for analyzing Git history using SQLite".&lt;/p&gt;
&lt;p&gt;This tool is the missing link in the &lt;a href="https://simonwillison.net/2020/Oct/9/git-scraping/"&gt;Git scraping pattern&lt;/a&gt; I described here last October.&lt;/p&gt;
&lt;p&gt;Git scraping is the technique of regularly scraping an online source of information and writing the results to a file in a Git repository... which automatically gives you a full revision history of changes made to that data source over time.&lt;/p&gt;
&lt;p&gt;The missing piece has always been what to do next: how do you turn a commit history of changes to a JSON or CSV file into a data source that can be used to answer questions about how that file changed over time?&lt;/p&gt;
&lt;p&gt;I've written one-off Python scripts for this a few times (here's &lt;a href="https://github.com/simonw/cdc-vaccination-history/blob/6f6bcb9437c0d44c4bcf94c111c631cc50bc2744/build_database.py"&gt;my CDC vaccinations one&lt;/a&gt;, for example), but giving an interactive workshop about the technique finally inspired me to build a tool to help.&lt;/p&gt;
&lt;p&gt;The tool has &lt;a href="https://datasette.io/tools/git-history"&gt;a comprehensive README&lt;/a&gt;, but the short version is that you can take a JSON (or CSV) file in a repository that has been tracking changes to some items over time and run the following to load all of the different versions into a SQLite database file for analysis with &lt;a href="https://datasette.io/"&gt;Datasette&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;git-convert file incidents.db incidents.json --id IncidentID
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This assumes that &lt;code&gt;incidents.json&lt;/code&gt; contains a JSON array of incidents (reported fires for example) and that each incident has a &lt;code&gt;IncidentID&lt;/code&gt; identifier key. It will then loop through the Git history of that file right from the start, creating an &lt;code&gt;item_versions&lt;/code&gt; table that tracks every change made to each of those items - using &lt;code&gt;IncidentID&lt;/code&gt; to decide if a row represents a new incident or an update to a previous one.&lt;/p&gt;
&lt;p&gt;I have a few more improvements I want to make before I start more widely promoting this, but it's already really useful. I've had a lot of fun running it against example repos from the &lt;a href="https://github.com/topics/git-scraping"&gt;git-scraping GitHub topic&lt;/a&gt; (now at 202 repos and counting).&lt;/p&gt;
&lt;h4&gt;Workshop: Raspando dados com o GitHub Actions e analisando com Datasette&lt;/h4&gt;
&lt;p&gt;The workshop I gave at the conference was live-translated into Portuguese, which is really exciting! I'm looking forward to watching the video when it comes out and seeing how well that worked.&lt;/p&gt;
&lt;p&gt;The title translates to "Scraping data with GitHub Actions and analyzing with Datasette", and it was the first time I've given a workshop that combines Git scraping and Datasette - hence the development of the new git-history tool to help tie the two together.&lt;/p&gt;
&lt;p&gt;I think it went really well. I put together four detailed exercises for the attendees, and then worked through each one live with the goal of attendees working through them at the same time - a method I learned from the Carpentries training course I took &lt;a href="https://simonwillison.net/2020/Sep/26/weeknotes-software-carpentry-sqlite/"&gt;last year&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Four exercises turns out to be exactly right for 90 minutes, with reasonable time for an introduction and some extra material and questions at the end.&lt;/p&gt;
&lt;p&gt;The worst part of running a workshop is inevitably the part where you try and get everyone setup with a functional development environment on their own machines (see &lt;a href="https://xkcd.com/1987/"&gt;XKCD 1987&lt;/a&gt;). This time round I skipped that entirely by encouraging my students to use &lt;strong&gt;&lt;a href="https://gitpod.io/"&gt;GitPod&lt;/a&gt;&lt;/strong&gt;, which provides free browser-based cloud development environments running Linux, with a browser-embedded VS Code editor and terminal running on top.&lt;/p&gt;

&lt;p&gt;&lt;img style="max-width: 100%" src="https://static.simonwillison.net/static/2021/start-datasette-gitpod.gif" alt="Animated demo of GitPod showing how to run Datasette and have it proxy a port" /&gt;&lt;/p&gt;

&lt;p&gt;(It's similar to &lt;a href="https://github.com/features/codespaces"&gt;GitHub Codespaces&lt;/a&gt;, but Codespaces is not yet available to free customers outside of the beta.)&lt;/p&gt;
&lt;p&gt;I demonstrated all of the exercises using GitPod myself during the workshop, and ensured that they could be entirely completed through that environment, with no laptop software needed at all.&lt;/p&gt;
&lt;p&gt;This worked &lt;strong&gt;so well&lt;/strong&gt;. Not having to worry about development environments makes workshops massively more productive. I will absolutely be doing this again in the future.&lt;/p&gt;
&lt;p&gt;The workshop exercises are available &lt;a href="https://docs.google.com/document/d/1TCatZP5gQNfFjZJ5M77wMlf9u_05Z3BZnjp6t1SA6UU/edit"&gt;in this Google Doc&lt;/a&gt;, and I hope to extract some of them out into official tutorials for various tools later on.&lt;/p&gt;
&lt;h4&gt;Datasette 0.58.2&lt;/h4&gt;
&lt;p&gt;Yesterday was Datasette's fourth birthday - the four year anniversary of &lt;a href="https://simonwillison.net/2017/Nov/13/datasette/"&gt;the initial release announcement&lt;/a&gt;! I celebrated by releasing a minor bug-fix, &lt;a href="https://github.com/simonw/datasette/releases/tag/0.59.2"&gt;Datasette 0.58.2&lt;/a&gt;, the release notes for which are quoted below:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Column names with a leading underscore now work correctly when used as a facet. (&lt;a href="https://github.com/simonw/datasette/issues/1506"&gt;#1506&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;Applying &lt;code&gt;?_nocol=&lt;/code&gt; to a column no longer removes that column from the filtering interface. (&lt;a href="https://github.com/simonw/datasette/issues/1503"&gt;#1503&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;Official Datasette Docker container now uses Debian Bullseye as the base image. (&lt;a href="https://github.com/simonw/datasette/issues/1497"&gt;#1497&lt;/a&gt;)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That first change was inspired by ongoing work on &lt;code&gt;git-history&lt;/code&gt;, where I decided to use a &lt;code&gt;_id&lt;/code&gt; underscoper prefix pattern for columns that were reserved for use by that tool in order &lt;a href="https://github.com/simonw/git-history/issues/14"&gt;to avoid clashing with column names&lt;/a&gt; in the provided source data.&lt;/p&gt;
&lt;h4&gt;sqlite-utils 3.18&lt;/h4&gt;
&lt;p&gt;Today I released &lt;a href="https://sqlite-utils.datasette.io/en/stable/changelog.html#v3-18"&gt;sqlite-utils 3.18&lt;/a&gt; - initially also to provide a feature I wanted for &lt;code&gt;git-history&lt;/code&gt; (a way to &lt;a href="https://github.com/simonw/sqlite-utils/issues/339"&gt;populate additional columns&lt;/a&gt; when creating a row using &lt;code&gt;table.lookup()&lt;/code&gt;) but I also closed some bug reports and landed some small pull requests that had come in since 3.17.&lt;/p&gt;
&lt;h4&gt;s3-credentials 0.5&lt;/h4&gt;
&lt;p&gt;Earlier in the week I released &lt;a href="https://github.com/simonw/s3-credentials/releases/tag/0.5"&gt;version 0.5&lt;/a&gt; of &lt;a href="https://github.com/simonw/s3-credentials"&gt;s3-credentials&lt;/a&gt; - my CLI tool for creating read-only, read-write or write-only AWS credentials for a specific S3 bucket.&lt;/p&gt;
&lt;p&gt;The biggest new feature is the ability to create temporary credentials, that expire after a given time limit.&lt;/p&gt;
&lt;p&gt;This is achived using &lt;code&gt;STS.assume_role()&lt;/code&gt;, where STS is &lt;a href="https://docs.aws.amazon.com/STS/latest/APIReference/welcome.html"&gt;Security Token Service&lt;/a&gt;. I've been wanting to learn this API for quite a while now.&lt;/p&gt;
&lt;p&gt;Assume role comes with some limitations: tokens must live between 15 minutes and 12 hours, and you need to first create a role that you can assume. In creating those credentials you can define an additional policy document, which is how I scope down the token I'm creating to only allow a specific level of access to a specific S3 bucket.&lt;/p&gt;
&lt;p&gt;I've learned a huge amount about AWS, IAM and S3 through developming this project. I think I'm finally overcoming my multi-year phobia of anything involving IAM!&lt;/p&gt;
&lt;h4&gt;Releases this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/sqlite-utils"&gt;sqlite-utils&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/sqlite-utils/releases/tag/3.18"&gt;3.18&lt;/a&gt; - (&lt;a href="https://github.com/simonw/sqlite-utils/releases"&gt;88 releases total&lt;/a&gt;) - 2021-11-15
&lt;br /&gt;Python CLI utility and library for manipulating SQLite databases&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette"&gt;datasette&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette/releases/tag/0.59.2"&gt;0.59.2&lt;/a&gt; - (&lt;a href="https://github.com/simonw/datasette/releases"&gt;100 releases total&lt;/a&gt;) - 2021-11-14
&lt;br /&gt;An open source multi-tool for exploring and publishing data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-hello-world"&gt;datasette-hello-world&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-hello-world/releases/tag/0.1.1"&gt;0.1.1&lt;/a&gt; - (&lt;a href="https://github.com/simonw/datasette-hello-world/releases"&gt;2 releases total&lt;/a&gt;) - 2021-11-14
&lt;br /&gt;The hello world of Datasette plugins&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/git-history"&gt;git-history&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/git-history/releases/tag/0.3.1"&gt;0.3.1&lt;/a&gt; - (&lt;a href="https://github.com/simonw/git-history/releases"&gt;5 releases total&lt;/a&gt;) - 2021-11-12
&lt;br /&gt;Tools for analyzing Git history using SQLite&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/s3-credentials"&gt;s3-credentials&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/s3-credentials/releases/tag/0.5"&gt;0.5&lt;/a&gt; - (&lt;a href="https://github.com/simonw/s3-credentials/releases"&gt;5 releases total&lt;/a&gt;) - 2021-11-11
&lt;br /&gt;A tool for creating credentials for accessing S3 buckets&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;TIL this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/kubernetes/basic-datasette-in-kubernetes"&gt;Basic Datasette in Kubernetes&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/deno/annotated-deno-deploy-demo"&gt;Annotated code for a demo of WebSocket chat in Deno Deploy&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/javascript/tesseract-ocr-javascript"&gt;Using Tesseract.js to OCR every image on a page&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/aws"&gt;aws&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/s3"&gt;s3&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/my-talks"&gt;my-talks&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/teaching"&gt;teaching&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/weeknotes"&gt;weeknotes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite-utils"&gt;sqlite-utils&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-history"&gt;git-history&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/s3-credentials"&gt;s3-credentials&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="aws"/><category term="projects"/><category term="s3"/><category term="my-talks"/><category term="teaching"/><category term="datasette"/><category term="weeknotes"/><category term="git-scraping"/><category term="sqlite-utils"/><category term="git-history"/><category term="s3-credentials"/></entry><entry><title>s3-credentials: a tool for creating credentials for S3 buckets</title><link href="https://simonwillison.net/2021/Nov/3/s3-credentials/#atom-tag" rel="alternate"/><published>2021-11-03T04:02:04+00:00</published><updated>2021-11-03T04:02:04+00:00</updated><id>https://simonwillison.net/2021/Nov/3/s3-credentials/#atom-tag</id><summary type="html">
    &lt;p&gt;I've built a command-line tool called &lt;a href="https://github.com/simonw/s3-credentials"&gt;s3-credentials&lt;/a&gt; to solve a problem that's been frustrating me for ages: how to quickly and easily create AWS credentials (an access key and secret key) that have permission to read or write from just a single S3 bucket.&lt;/p&gt;
&lt;h4&gt;The TLDR version&lt;/h4&gt;
&lt;p&gt;To create a new S3 bucket and generate credentials for reading and writing to it:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;% pip install s3-credentials
% s3-credentials create demo-bucket-for-simonwillison-blog-post --create-bucket
Created bucket: demo-bucket-for-simonwillison-blog-post
Created  user: 's3.read-write.demo-bucket-for-simonwillison-blog-post' with permissions boundary: 'arn:aws:iam::aws:policy/AmazonS3FullAccess'
Attached policy s3.read-write.demo-bucket-for-simonwillison-blog-post to user s3.read-write.demo-bucket-for-simonwillison-blog-post
Created access key for user: s3.read-write.demo-bucket-for-simonwillison-blog-post
{
    "UserName": "s3.read-write.demo-bucket-for-simonwillison-blog-post",
    "AccessKeyId": "AKIAWXFXAIOZHY6WAJSF",
    "Status": "Active",
    "SecretAccessKey": "...",
    "CreateDate": "2021-12-06 23:54:08+00:00"
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You can now use the that &lt;code&gt;AccessKeyId&lt;/code&gt; and &lt;code&gt;SecretAccessKey&lt;/code&gt; to read and write files in that bucket.&lt;/p&gt;
&lt;h4&gt;The need for bucket credentials for S3&lt;/h4&gt;
&lt;p&gt;I'm an enormous fan of &lt;a href="https://aws.amazon.com/s3/"&gt;Amazon S3&lt;/a&gt;: I've been using it &lt;a href="https://simonwillison.net/tags/s3/?page=last"&gt;for fifteen years&lt;/a&gt; now (since the launch in 2006) and it's my all-time favourite cloud service: it's cheap, reliable and basically indestructible.&lt;/p&gt;
&lt;p&gt;You need two credentials to make API calls to S3: an &lt;code&gt;AWS_ACCESS_KEY_ID&lt;/code&gt; and a &lt;code&gt;AWS_SECRET_ACCESS_KEY&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Since I often end up adding these credentials to projects hosted in different environments, I'm not at all keen on using my root-level credentials here: usually a project works against just one dedicated S3 bucket, so ideally I would like to create dedicated credentials that are limited to just that bucket.&lt;/p&gt;
&lt;p&gt;Creating those credentials is surprisingly difficult!&lt;/p&gt;
&lt;h4&gt;Dogsheep Photos&lt;/h4&gt;
&lt;p&gt;The last time I solved this problem was for my &lt;a href="https://datasette.io/tools/dogsheep-photos"&gt;Dogsheep Photos&lt;/a&gt; project. I built a tool that uploads all of my photos from Apple Photos to my own dedicated S3 bucket, and extracts the photo metadata into a SQLite database. This means I can do some really cool tricks using SQL to analyze my photos, as described in &lt;a href="https://simonwillison.net/2020/May/21/dogsheep-photos/"&gt;Using SQL to find my best photo of a pelican according to Apple Photos&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The photos are stored in a S3 private bucket, with &lt;a href="https://github.com/simonw/s3-image-proxy"&gt;a custom proxy&lt;/a&gt; in front of them that I can use to grant access to specific photographs via a signed URL.&lt;/p&gt;
&lt;p&gt;For the proxy, I decided to create dedicated credentials that were allowed to make read-only requests to my private S3 bucket.&lt;/p&gt;
&lt;p&gt;I made &lt;a href="https://github.com/dogsheep/dogsheep-photos/issues/4"&gt;detailed notes&lt;/a&gt; along the way as I figured out to do that. It was really hard! There's one step where you literally have to hand-edit a JSON policy document that looks like this (replace &lt;code&gt;dogsheep-photos-simon&lt;/code&gt; with your own bucket name) and paste that into the AWS web console:&lt;/p&gt;
&lt;div class="highlight highlight-source-json"&gt;&lt;pre&gt;{
  &lt;span class="pl-ent"&gt;"Version"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;2012-10-17&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
  &lt;span class="pl-ent"&gt;"Statement"&lt;/span&gt;: [
    {
      &lt;span class="pl-ent"&gt;"Effect"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Allow&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
      &lt;span class="pl-ent"&gt;"Action"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;s3:*&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
      &lt;span class="pl-ent"&gt;"Resource"&lt;/span&gt;: [
        &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;arn:aws:s3:::dogsheep-photos-simon/*&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
      ]
    }
  ]
}&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;I set myself an ambition to try and fix this at some point in the future (that was in April 2020).&lt;/p&gt;
&lt;p&gt;Today I found myself wanting new bucket credentials, so I could play with &lt;a href="https://litestream.io/"&gt;Litestream&lt;/a&gt;. I decided to solve this problem once and for all.&lt;/p&gt;
&lt;p&gt;I've also been meaning to really get my head around Amazon's IAM permission model for years, and this felt like a great excuse to figure it out through writing code.&lt;/p&gt;
&lt;h4&gt;The process in full&lt;/h4&gt;
&lt;p&gt;Here are the steps you need to take in order to get long-lasting credentials for accessing a specific S3 bucket.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Create an S3 bucket&lt;/li&gt;
&lt;li&gt;Create a new, dedicated user. You need a user and not a role because long-lasting AWS credentials cannot be created for roles - and we want credentials we can use in a project without constantly needing to update them.&lt;/li&gt;
&lt;li&gt;Assign an "inline policy" to that user granting them read-only or read-write access to the specific S3 bucket - this is the JSON format shown above.&lt;/li&gt;
&lt;li&gt;Create AWS credentials for that user.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;There are plenty of other ways you can achieve this: you can add permissions to a group and assign that user to a group, or you can create a named "managed policy" and attach that to the user. But using an inline policy seems to be the simplest of the available options.&lt;/p&gt;
&lt;p&gt;Using the &lt;a href="https://aws.amazon.com/sdk-for-python/"&gt;boto3&lt;/a&gt; Python client library for AWS this sequence converts to the following API calls:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;boto3&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;json&lt;/span&gt;

&lt;span class="pl-s1"&gt;s3&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;boto3&lt;/span&gt;.&lt;span class="pl-en"&gt;client&lt;/span&gt;(&lt;span class="pl-s"&gt;"s3"&lt;/span&gt;)
&lt;span class="pl-s1"&gt;iam&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;boto3&lt;/span&gt;.&lt;span class="pl-en"&gt;client&lt;/span&gt;(&lt;span class="pl-s"&gt;"iam"&lt;/span&gt;)

&lt;span class="pl-s1"&gt;username&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;"my-new-user"&lt;/span&gt;
&lt;span class="pl-s1"&gt;bucket_name&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;"my-new-bucket"&lt;/span&gt;
&lt;span class="pl-s1"&gt;policy_name&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;"user-can-access-bucket"&lt;/span&gt;

&lt;span class="pl-s1"&gt;policy_document&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; {
    &lt;span class="pl-s"&gt;"... that big JSON document ..."&lt;/span&gt;: &lt;span class="pl-s"&gt;""&lt;/span&gt;
}

&lt;span class="pl-c"&gt;# Create the bucket&lt;/span&gt;
&lt;span class="pl-s1"&gt;s3&lt;/span&gt;.&lt;span class="pl-en"&gt;create_bucket&lt;/span&gt;(&lt;span class="pl-v"&gt;Bucket&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s1"&gt;bucket_name&lt;/span&gt;)

&lt;span class="pl-c"&gt;# Create the user&lt;/span&gt;
&lt;span class="pl-s1"&gt;iam&lt;/span&gt;.&lt;span class="pl-en"&gt;create_user&lt;/span&gt;(&lt;span class="pl-v"&gt;UserName&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s1"&gt;username&lt;/span&gt;)

&lt;span class="pl-c"&gt;# Assign the policy to the user&lt;/span&gt;
&lt;span class="pl-s1"&gt;iam&lt;/span&gt;.&lt;span class="pl-en"&gt;put_user_policy&lt;/span&gt;(
    &lt;span class="pl-v"&gt;PolicyDocument&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s1"&gt;json&lt;/span&gt;.&lt;span class="pl-en"&gt;dumps&lt;/span&gt;(&lt;span class="pl-s1"&gt;policy_document&lt;/span&gt;),
    &lt;span class="pl-v"&gt;PolicyName&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s1"&gt;policy_name&lt;/span&gt;,
    &lt;span class="pl-v"&gt;UserName&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s1"&gt;username&lt;/span&gt;,
)

&lt;span class="pl-c"&gt;# Retrieve and print the credentials&lt;/span&gt;
&lt;span class="pl-s1"&gt;response&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;iam&lt;/span&gt;.&lt;span class="pl-en"&gt;create_access_key&lt;/span&gt;(
    &lt;span class="pl-v"&gt;UserName&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s1"&gt;username&lt;/span&gt;,
)
&lt;span class="pl-en"&gt;print&lt;/span&gt;(&lt;span class="pl-s1"&gt;response&lt;/span&gt;[&lt;span class="pl-s"&gt;"AccessKey"&lt;/span&gt;])&lt;/pre&gt;
&lt;h4&gt;Turning it into a CLI tool&lt;/h4&gt;
&lt;p&gt;I never want to have to figure out how to do this again, so I decided to build a tool around it.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://github.com/simonw/s3-credentials"&gt;s3-credentials&lt;/a&gt; is a Python CLI utility built on top of &lt;a href="https://click.palletsprojects.com/"&gt;Click&lt;/a&gt; using my &lt;a href="https://github.com/simonw/click-app"&gt;click-app&lt;/a&gt; cookicutter template.&lt;/p&gt;
&lt;p&gt;It's available through PyPI, so you can install it using:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;% pip install s3-credentials&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The main command is &lt;code&gt;s3-credentials create&lt;/code&gt;, which runs through the above sequence of steps.&lt;/p&gt;
&lt;p&gt;To create read-only credentials for my existing &lt;code&gt;static.niche-museums.com&lt;/code&gt; bucket I can run the following:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;% s3-credentials create static.niche-museums.com --read-only

Created user: s3.read-only.static.niche-museums.com with permissions boundary: arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess
Attached policy s3.read-only.static.niche-museums.com to user s3.read-only.static.niche-museums.com
Created access key for user: s3.read-only.static.niche-museums.com
{
    "UserName": "s3.read-only.static.niche-museums.com",
    "AccessKeyId": "AKIAWXFXAIOZJ26NEGBN",
    "Status": "Active",
    "SecretAccessKey": "...",
    "CreateDate": "2021-11-03 03:21:12+00:00"
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The command shows each step as it executes, and at the end it outputs the newly created access key and secret key.&lt;/p&gt;
&lt;p&gt;It defaults to creating a user with a username that reflects what it will be able to do: &lt;code&gt;s3.read-only.static.niche-museums.com&lt;/code&gt;. You can pass &lt;code&gt;--username something&lt;/code&gt; to specify a custom username instead.&lt;/p&gt;
&lt;p&gt;If you omit the &lt;code&gt;--read-only&lt;/code&gt; flag it will create a user with read and write access to the bucket. There's also a &lt;code&gt;--write-only&lt;/code&gt; flag which creates a user that can write to but not read from the bucket - useful for use-cases like logging or backup scripts.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://github.com/simonw/s3-credentials/blob/main/README.md"&gt;README&lt;/a&gt; has full documentation on the various other options, plus details of the other &lt;code&gt;s3-credentials&lt;/code&gt; utility commands &lt;code&gt;list-users&lt;/code&gt;, &lt;code&gt;list-buckets&lt;/code&gt;, &lt;code&gt;list-user-policies&lt;/code&gt; and &lt;code&gt;whoami&lt;/code&gt;.&lt;/p&gt;
&lt;h4&gt;Learned along the way&lt;/h4&gt;
&lt;p&gt;This really was a fantastic project for deepening my understanding of S3, IAM and how it all fits together. A few extra points I picked up:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;AWS users can be created with something called a &lt;a href="https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_boundaries.html"&gt;permissions boundary&lt;/a&gt;. This is an advanced security feature which lets a user be restricted to a set of maximum permissions - for example, only allowed to interact with S3, not any other AWS service.&lt;/p&gt;
&lt;p&gt;Pemissions boundaries do not themselves grant permissions - a user will not be able to do anything until extra policies are added to their account. It instead acts as defense in depth, setting an upper limit to what a user can do no matter what other policies are applied to them.&lt;/p&gt;
&lt;p&gt;There's one big catch: the value you set for a permissions boundary is a very weakly documented ARN string - the &lt;code&gt;boto3&lt;/code&gt; documentation simply calls it "The ARN of the policy that is used to set the permissions boundary for the user". I used &lt;a href="https://github.com/search?l=Python&amp;amp;q=iam+PermissionsBoundary&amp;amp;type=Code"&gt;GitHub code search&lt;/a&gt; to dig up some examples, and found &lt;code&gt;arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess&lt;/code&gt; and &lt;code&gt;arn:aws:iam::aws:policy/AmazonS3FullAccess&lt;/code&gt; to be the ones most relevant to my project. &lt;a href="https://github.com/daviddawha/ArchivesSpaceDevUNR/blob/488b5b83f9ac66a6013e9a0a02d25734886dee02/gems/gems/fog-aws-2.0.0/lib/fog/aws/iam/default_policy_versions.json"&gt;This random file&lt;/a&gt; appears to contain more.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Those JSON policy documents really are the dark secret magic that holds AWS together. Finding trustworthy examples of read-only, read-write and write-only policies for specific S3 buckets was not at all easy. I made &lt;a href="https://github.com/simonw/s3-credentials/issues/3#issuecomment-958401364"&gt;detailed notes in this comment thread&lt;/a&gt; - the policies I went with are baked into the &lt;a href="https://github.com/simonw/s3-credentials/blob/0.2/s3_credentials/policies.py"&gt;policies.py&lt;/a&gt; file in the &lt;code&gt;s3-credentials&lt;/code&gt; repository. If you know your way around IAM I would love to hear your feedback on the policies I ended up using!&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Writing automated tests for code that makes extensive use of &lt;code&gt;boto3&lt;/code&gt; - such that those tests don't make any real HTTP requests to the API - is a bit fiddly. I &lt;a href="https://github.com/simonw/s3-credentials/pull/6#issuecomment-958522457"&gt;explored a few options&lt;/a&gt; for this - potential candidates included the &lt;a href="https://botocore.amazonaws.com/v1/documentation/api/latest/reference/stubber.html"&gt;botocore.stub.Stubber&lt;/a&gt; class and the &lt;a href="https://vcrpy.readthedocs.io/"&gt;VCR.py&lt;/a&gt; class for saving and replaying HTTP traffic (see &lt;a href="https://til.simonwillison.net/pytest/pytest-recording-vcr"&gt;this TIL&lt;/a&gt;). I ended up going with Python's &lt;code&gt;Mock&lt;/code&gt; class, via &lt;a href="https://github.com/pytest-dev/pytest-mock"&gt;pytest-mock&lt;/a&gt; - here's &lt;a href="https://til.simonwillison.net/pytest/pytest-mock-calls"&gt;another TIL&lt;/a&gt; on the pattern I used for that. (Update: Jeff Triplett &lt;a href="https://twitter.com/webology/status/1455749203595087872"&gt;pointed me&lt;/a&gt; to &lt;a href="https://github.com/spulec/moto"&gt;moto&lt;/a&gt; which looks like a really great solution for this.)&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Feedback from AWS experts wanted&lt;/h4&gt;
&lt;p&gt;The tool I've built solves my specific problem pretty well. I'm nervous about it though: I am by no means an IAM expert, and I'm somewhat paranoid that I may have made a dumb mistake and baked it into the tooling.&lt;/p&gt;
&lt;p&gt;As such, the README currently &lt;a href="https://github.com/simonw/s3-credentials/tree/0.2#%EF%B8%8F-warning"&gt;carries a warning&lt;/a&gt; that you should review what the tool is doing carefully before trusting it against your own AWS account!&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update 20 February 2022:&lt;/strong&gt; I removed that warning, since I've now spent long enough working on this tool that I'm comfortable with how it works.&lt;/p&gt;
&lt;p&gt;If you are an AWS expert, you can help: I have &lt;a href="https://github.com/simonw/s3-credentials/issues/7"&gt;an open issue&lt;/a&gt; requesting expert feedback, and I'd love to hear from people with deep experience who can either validate that my approach is sound or help explain what I'm doing wrong and how the process can be fixed.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/cli"&gt;cli&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/s3"&gt;s3&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/security"&gt;security&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/s3-credentials"&gt;s3-credentials&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="cli"/><category term="projects"/><category term="python"/><category term="s3"/><category term="security"/><category term="s3-credentials"/></entry></feed>