Simon Willison's Weblog: s3-credentials

s3-credentials 0.16

2024-04-05T05:35:57+00:00

I spent entirely too long this evening trying to figure out why files in my new supposedly public S3 bucket were unavailable to view. It turns out these days you need to set a PublicAccessBlockConfiguration of {"BlockPublicAcls": false, "IgnorePublicAcls": false, "BlockPublicPolicy": false, "RestrictPublicBuckets": false}.

The s3-credentials --create-bucket --public option now does that for you. I also added a s3-credentials debug-bucket name-of-bucket command to help figure out why a bucket isn't working as expected.

Tags: aws, projects, s3, s3-credentials

Tracking Mastodon user numbers over time with a bucket of tricks

2022-11-20T07:00:54+00:00

Mastodon is definitely having a moment. User growth is skyrocketing as more and more people migrate over from Twitter.

I've set up a new git scraper to track the number of registered user accounts on known Mastodon instances over time.

It's only been running for a few hours, but it's already collected enough data to render this chart:

I'm looking forward to seeing how this trend continues to develop over the next days and weeks.

Scraping the data

My scraper works by tracking https://instances.social/ - a website that lists a large number (but not all) of the Mastodon instances that are out there.

That site publishes an instances.json array which currently contains 1,830 objects representing Mastodon instances. Each of those objects looks something like this:

{
    "name": "pleroma.otter.sh",
    "title": "Otterland",
    "short_description": null,
    "description": "Otters does squeak squeak",
    "uptime": 0.944757,
    "up": true,
    "https_score": null,
    "https_rank": null,
    "ipv6": true,
    "openRegistrations": false,
    "users": 5,
    "statuses": "54870",
    "connections": 9821,
}

I have a GitHub Actions workflow running approximately every 20 minutes that fetches a copy of that file and commits it back to this repository:

https://github.com/simonw/scrape-instances-social

Since each instance includes a users count, the commit history of my instances.json file tells the story of Mastodon's growth over time.

Building a database

A commit log of a JSON file is interesting, but the next step is to turn that into actionable information.

My git-history tool is designed to do exactly that.

For the chart up above, the only number I care about is the total number of users listed in each snapshot of the file - the sum of that users field for each instance.

Here's how to run git-history against that file's commit history to generate tables showing how that count has changed over time:

git-history file counts.db instances.json \
  --convert "return [
    {
        'id': 'all',
        'users': sum(d['users'] or 0 for d in json.loads(content)),
        'statuses': sum(int(d['statuses'] or 0) for d in json.loads(content)),
    }
  ]" --id id

I'm creating a file called counts.db that shows the history of the instances.json file.

The real trick here though is that --convert argument. I'm using that to compress each snapshot down to a single row that looks like this:

{
    "id": "all",
    "users": 4717781,
    "statuses": 374217860
}

Normally git-history expects to work against an array of objects, tracking the history of changes to each one based on their id property.

Here I'm tricking it a bit - I only return a single object with the ID of all. This means that git-history will only track the history of changes to that single object.

It works though! The result is a counts.db file which is currently 52KB and has the following schema (truncated to the most interesting bits):

CREATE TABLE [commits] (
   [id] INTEGER PRIMARY KEY,
   [namespace] INTEGER REFERENCES [namespaces]([id]),
   [hash] TEXT,
   [commit_at] TEXT
);
CREATE TABLE [item_version] (
   [_id] INTEGER PRIMARY KEY,
   [_item] INTEGER REFERENCES [item]([_id]),
   [_version] INTEGER,
   [_commit] INTEGER REFERENCES [commits]([id]),
   [id] TEXT,
   [users] INTEGER,
   [statuses] INTEGER,
   [_item_full_hash] TEXT
);

Each item_version row will tell us the number of users and statuses at a particular point in time, based on a join against that commits table to find the commit_at date.

Publishing the database

For this project, I decided to publish the SQLite database to an S3 bucket. I considered pushing the binary SQLite file directly to the GitHub repository but this felt rude, since a binary file that changes every 20 minutes would bloat the repository.

I wanted to serve the file with open CORS headers so I could load it into Datasette Lite and Observable notebooks.

I used my s3-credentials tool to create a bucket for this:

~ % s3-credentials create scrape-instances-social --public --website --create-bucket
Created bucket: scrape-instances-social
Attached bucket policy allowing public access
Configured website: IndexDocument=index.html, ErrorDocument=error.html
Created  user: 's3.read-write.scrape-instances-social' with permissions boundary: 'arn:aws:iam::aws:policy/AmazonS3FullAccess'
Attached policy s3.read-write.scrape-instances-social to user s3.read-write.scrape-instances-social
Created access key for user: s3.read-write.scrape-instances-social
{
    "UserName": "s3.read-write.scrape-instances-social",
    "AccessKeyId": "AKIAWXFXAIOZI5NUS6VU",
    "Status": "Active",
    "SecretAccessKey": "...",
    "CreateDate": "2022-11-20 05:52:22+00:00"
}

This created a new bucket called scrape-instances-social configured to work as a website and allow public access.

It also generated an access key and a secret access key with access to just that bucket. I saved these in GitHub Actions secrets called AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY.

I enabled a CORS policy on the bucket like this:

s3-credentials set-cors-policy scrape-instances-social

Then I added the following to my GitHub Actions workflow to build and upload the database after each run of the scraper:

    - name: Build and publish database using git-history
      env:
        AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
        AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
      run: |-
        # First download previous database to save some time
        wget https://scrape-instances-social.s3.amazonaws.com/counts.db
        # Update with latest commits
        ./build-count-history.sh
        # Upload to S3
        s3-credentials put-object scrape-instances-social counts.db counts.db \
          --access-key $AWS_ACCESS_KEY_ID \
          --secret-key $AWS_SECRET_ACCESS_KEY

git-history knows how to only process commits since the last time the database was built, so downloading the previous copy saves a lot of time.

Exploring the data

Now that I have a SQLite database that's being served over CORS-enabled HTTPS I can open it in Datasette Lite - my implementation of Datasette compiled to WebAssembly that runs entirely in a browser.

https://lite.datasette.io/?url=https://scrape-instances-social.s3.amazonaws.com/counts.db

Any time anyone follows this link their browser will fetch the latest copy of the counts.db file directly from S3.

The most interesting page in there is the item_version_detail SQL view, which joins against the commits table to show the date of each change:

https://lite.datasette.io/?url=https://scrape-instances-social.s3.amazonaws.com/counts.db#/counts/item_version_detail

(Datasette Lite lets you link directly to pages within Datasette itself via a #hash.)

Plotting a chart

Datasette Lite doesn't have charting yet, so I decided to turn to my favourite visualization tool, an Observable notebook.

Observable has the ability to query SQLite databases (that are served via CORS) directly these days!

Here's my notebook:

https://observablehq.com/@simonw/mastodon-users-and-statuses-over-time

There are only four cells needed to create the chart shown above.

First, we need to open the SQLite database from the remote URL:

database = SQLiteDatabaseClient.open(
  "https://scrape-instances-social.s3.amazonaws.com/counts.db"
)

Next we need to use an Obervable Database query cell to execute SQL against that database and pull out the data we want to plot - and store it in a query variable:

SELECT _commit_at as date, users, statuses
FROM item_version_detail

We need to make one change to that data - we need to convert the date column from a string to a JavaScript date object:

points = query.map((d) => ({
  date: new Date(d.date),
  users: d.users,
  statuses: d.statuses
}))

Finally, we can plot the data using the Observable Plot charting library like this:

Plot.plot({
  y: {
    grid: true,
    label: "Total users over time across all tracked instances"
  },
  marks: [Plot.line(points, { x: "date", y: "users" })],
  marginLeft: 100
})

I added 100px of margin to the left of the chart to ensure there was space for the large (4,696,000 and up) labels on the y-axis.

A bunch of tricks combined

This project combines a whole bunch of tricks I've been pulling together over the past few years:

Git scraping is the technique I use to gather the initial data, turning a static listing of instances into a record of changes over time
git-history is my tool for turning a scraped Git history into a SQLite database that's easier to work with
s3-credentials makes working with S3 buckets - in particular creating credentials that are restricted to just one bucket - much less frustrating
Datasette Lite means that once you have a SQLite database online somewhere you can explore it in your browser - without having to run my full server-side Datasette Python application on a machine somewhere
And finally, combining the above means I can take advantage of Observable notebooks for ad-hoc visualization of data that's hosted online, in this case as a static SQLite database file served from S3

Tags: github, projects, datasette, observable, github-actions, git-scraping, git-history, s3-credentials, datasette-lite, mastodon

Weeknotes: Datasette Lite, s3-credentials, shot-scraper, datasette-edit-templates and more

2022-09-16T02:55:03+00:00

Despite distractions from AI I managed to make progress on a bunch of different projects this week, including new releases of s3-credentials and shot-scraper, a new datasette-edit-templates plugin and a small but neat improvement to Datasette Lite.

Better GitHub support for Datasette Lite

Datasette Lite is Datasette running in WebAssembly. Originally intended as a cool tech demo it's quickly becoming a key component of the wider Datasette ecosystem - just this week I saw that mySociety are using it to help people explore their WhatDoTheyKnow Authorities Dataset.

One of the neat things about Datasette Lite is that you can feed it URLs to CSV files, SQLite database files and even SQL initialization scripts and it will fetch them into your browser and serve them up inside Datasette. I wrote more about this capability in Joining CSV files in your browser using Datasette Lite.

There's just one catch: because those URLs are fetched by JavaScript running in your browser, they need to be served from a host that sets the Access-Control-Allow-Origin: * header (see MDN). This is not an easy thing to explain to people!

The good news here is that GitHub makes every public file (and every Gist) hosted on GitHub available as static hosting with that magic header.

The bad news is that you have to know how to construct that URL! GitHub's "raw" links redirect to that URL, but JavaScript fetch() calls can't follow redirects if they don't have that header - and GitHub's redirects do not.

So you need to know that if you want to load the SQLite database file from this page on GitHub:

https://github.com/lerocha/chinook-database/blob/master/ChinookDatabase/DataSources/Chinook_Sqlite.sqlite

You first need to rewrite that URL to the following, which is served with the correct CORS header:

https://raw.githubusercontent.com/lerocha/chinook-database/master/ChinookDatabase/DataSources/Chinook_Sqlite.sqlite

Asking human's to do that by hand isn't reasonable. So I added some code!

const githubUrl = /^https:\/\/github.com\/(.*)\/(.*)\/blob\/(.*)(\?raw=true)?$/;

function fixUrl(url) {
  const matches = githubUrl.exec(url);
  if (matches) {
    return `https://raw.githubusercontent.com/${matches[1]}/${matches[2]}/${matches[3]}`;
  }
  return url;
}

Fun aside: GitHub Copilot auto-completed that return statement for me, correctly guessing the URL string I needed based on the regular expression I had defined several lines earlier.

Now any time you feed Datasette Lite a URL, if it's a GitHub page it will automatically rewrite it to the CORS-enabled equivalent on the raw.githubusercontent.com domain.

Some examples:

https://lite.datasette.io/?url=https://github.com/lerocha/chinook-database/blob/master/ChinookDatabase/DataSources/Chinook_Sqlite.sqlite - that Chinook SQLite database example (from here)
https://lite.datasette.io/?csv=https://github.com/simonw/covid-19-datasette/blob/6294ade30843bfd76f2d82641a8df76d8885effa/us_census_state_populations_2019.csv - US censes populations by state, from my simonw/covid-19-datasette repo

datasette-edit-templates

I started working on this plugin a couple of years ago but didn't get it working. This week I finally closed the initial issue and shipped a first alpha release.

It's pretty fun. On first launch it creates a _templates_ table in your database. Then it allows the root user (run datasette data.db --root and click the link to sign in as root) to edit Datasette's default set of Jinja templates, writing their changes to that new table.

Datasette uses those templates straight away. It turns the whole of Datasette into an interface for editing itself.

Here's an animated demo showing the plugin in action:

The implementation is currently a bit gnarly, but I've filed an issue in Datasette core to help clear some of it up.

s3-credentials get-objects and put-objects

I built s3-credentials to solve my number one frustration with AWS S3: the surprising level of complexity involved in issuing IAM credentials that could only access a specific S3 bucket. I introduced it in s3-credentials: a tool for creating credentials for S3 buckets.

Once you've created credentials, you need to be able to do stuff with them. I find the default AWS CLI tools relatively unintuitive, so s3-credentials has continued to grow other commands as and when I feel the need for them.

The latest version, 0.14, adds two more: get-objects and put-objects.

These let you do things like this:

s3-credentials get-objects my-bucket -p "*.txt" -p "static/*.css"

This downloads every key in my-bucket with a name that matches either of those patterns.

s3-credentials put-objects my-bucket one.txt ../other-directory

This uploads one.txt and the whole other-directory folder with all of its contents.

As with most of my projects, the GitHub issues threads for each of these include a blow-by-blow account of how I finalized their design - #68 for put-objects and #78 for get-objects.

shot-scraper --log-requests

shot-scraper is my tool for automating screenshots, built on top of Playwright.

Its latest feature was inspired by Datasette Lite.

I have an ongoing ambition to get Datasette Lite to work entirely offline, using Service Workers.

The first step is to get it to work without loading external resources - it currently hits PyPI and a separate CDN multiple times to download wheels every time you load the application.

To do that, I need a reliable list of all of the assets that it's fetching.

Wouldn't it be handy If I could run a command and get a list of those resources?

The following command now does exactly that:

shot-scraper https://lite.datasette.io/ \
  --wait-for 'document.querySelector("h2")' \
  --log-requests requests.log

Here' the --wait-for is needed to ensure shot-scraper doesn't terminate until the application has fully loaded - detected by waiting for a <h2> element to be added to the page.

The --log-requests bit is a new feature in shot-scraper 0.15: it logs out a newline-delimited JSON file with details of all of the resources fetched during the run. That file starts like this:

{"method": "GET", "url": "https://lite.datasette.io/", "size": 10516, "timing": {...}}
{"method": "GET", "url": "https://plausible.io/js/script.manual.js", "size": 1005, "timing": {...}}
{"method": "GET", "url": "https://latest.datasette.io/-/static/app.css?cead5a", "size": 16230, "timing": {...}}
{"method": "GET", "url": "https://lite.datasette.io/webworker.js", "size": 4875, "timing": {...}}
{"method": "GET", "url": "https://cdn.jsdelivr.net/pyodide/v0.20.0/full/pyodide.js", "size": null, "timing": {...}}

This is already pretty useful... but wouldn't it be more useful if I could explore that data in Datasette?

That's what this recipe does:

shot-scraper https://lite.datasette.io/ \
  --wait-for 'document.querySelector("h2")' \
  --log-requests - | \
  sqlite-utils insert /tmp/datasette-lite.db log - --flatten --nl

It's piping the newline-delimited JSON to sqlite-utils insert which then inserts it, using the --flatten option to turn that nested timing object into a flat set of columns.

I decided to share it by turning it into a SQL dump and publishing that to this Gist. I did that using the sqlite-utils memory command to convert it to a SQL dump like so:

shot-scraper https://lite.datasette.io/ \
  --wait-for 'document.querySelector("h2")' \
  --log-requests - | \
  sqlite-utils memory stdin:nl --flatten --dump > dump.sql

stdin:nl means "read from standard input and treat that as newline-delimited JSON". Then I run a select * command and use --dump to output that to dump.sql, which I pasted into a new Gist.

So now I can open the result in Datasette Lite!

Datasette on Sandstorm

Sandstorm is "an open source platform for self-hosting web apps". You can think of it as an easy to use UI over a Docker-like container platform - once you've installed it on a server you can use it to manage and install applications that have been bundled for it.

Jacob Weisz has been doing exactly that for Datasette. The result is Datasette in the Sandstorm App Market.

You can see how it works in the ocdtrekkie/datasette-sandstorm repo. I helped out by building a small datasette-sandstorm-support plugin to show how permissions and authentication can work against Sandstorm's custom HTTP headers.

Releases this week

s3-credentials: 0.14 - (15 releases total) - 2022-09-15
A tool for creating credentials for accessing S3 buckets
shot-scraper: 0.16 - (21 releases total) - 2022-09-15
A command-line utility for taking automated screenshots of websites
datasette-edit-templates: 0.1a0 - 2022-09-14
Plugin allowing Datasette templates to be edited within Datasette
datasette-sandstorm-support: 0.1 - 2022-09-14
Authentication and permissions for Datasette on Sandstorm
datasette-upload-dbs: 0.1.2 - (3 releases total) - 2022-09-09
Upload SQLite database files to Datasette
datasette-upload-csvs: 0.8.2 - (13 releases total) - 2022-09-08
Datasette plugin for uploading CSV files and converting them to database tables

TIL this week

Tags: plugins, projects, datasette, weeknotes, s3-credentials, shot-scraper, datasette-lite, github-copilot

s3-ocr: Extract text from PDF files stored in an S3 bucket

2022-06-30T21:40:27+00:00

I've released s3-ocr, a new tool that runs Amazon's Textract OCR text extraction against PDF files in an S3 bucket, then writes the resulting text out to a SQLite database with full-text search configured so you can run searches against the extracted data.

You can search through a demo of 697 pages of OCRd text at s3-ocr-demo.datasette.io/pages/pages.

Textract works extremely well: it handles dodgy scanned PDFs full of typewritten code and reads handwritten text better than I can! It charges $1.50 per thousand pages processed.

Why I built this

My initial need for this is a collaboration I have running with the San Francisco Microscopy Society. They've been digitizing their archives - which stretch back to 1870! - and were looking for help turning the digital scans into something more useful.

The archives are full of hand-written and type-written notes, scanned and stored as PDFs.

I decided to wrap my work up as a tool because I'm sure there are a LOT of organizations out there with a giant bucket of PDF files that would benefit from being able to easily run OCR and turn the results into a searchable database.

Running Textract directly against large numbers of files is somewhat inconvenient (here's my earlier TIL about it). s3-ocr is my attempt to make it easier.

Tutorial: How I built that demo

The demo instance uses three PDFs from the Library of Congress Harry Houdini Collection on the Internet Archive:

I started by downloading PDFs of those three files.

Then I installed the two tools I needed:

pip install s3-ocr s3-credentials

I used my s3-credentials tool to create a new S3 bucket and credentials with the ability to write files to it, with the new --statement option (which I released today) to add textract permissions to the generated credentials:

s3-credentials create s3-ocr-demo --statement '{
  "Effect": "Allow",
  "Action": "textract:*",
  "Resource": "*"
}' --create-bucket > ocr.json

(Note that you don't need to use s3-credentials at all if you have AWS credentials configured on your machine with root access to your account - just leave off the -a ocr.json options in the following examples.)

s3-ocr-demo is now a bucket I can use for the demo. ocr.json contains JSON with an access key and secret key for an IAM user account that can interact with the that bucket, and also has permission to access the AWS Textract APIs.

I uploaded my three PDFs to the bucket:

s3-credentials put-object s3-ocr-demo latestmagicbeing00hoff.pdf latestmagicbeing00hoff.pdf -a ocr.json
s3-credentials put-object s3-ocr-demo practicalmagicia00harr.pdf practicalmagicia00harr.pdf -a ocr.json
s3-credentials put-object s3-ocr-demo unmaskingrobert00houdgoog.pdf unmaskingrobert00houdgoog.pdf -a ocr.json

(I often use Transmit as a GUI for this kind of operation.)

Then I kicked off OCR jobs against every PDF file in the bucket:

% s3-ocr start s3-ocr-demo --all -a ocr.json 
Found 0 files with .s3-ocr.json out of 3 PDFs
Starting OCR for latestmagicbeing00hoff.pdf, Job ID: f66bc2d00fb75d1c42d1f829e5b6788891f9799fda404c4550580959f65a5402
Starting OCR for practicalmagicia00harr.pdf, Job ID: ef085728135d524a39bc037ad6f7253284b1fdbeb728dddcfbb260778d902b55
Starting OCR for unmaskingrobert00houdgoog.pdf, Job ID: 93bd46f02eb099eca369c41e384836d2bd3199b95d415c0257ef3fa3602cbef9

The --all option scans for any file with a .pdf extension. You can pass explicit file names instead if you just want to process one or two files at a time.

This returns straight away, but the OCR process itself can take several minutes depending on the size of the files.

The job IDs can be used to inspect the progress of each task like so:

% s3-ocr inspect-job f66bc2d00fb75d1c42d1f829e5b6788891f9799fda404c4550580959f65a5402
{
  "DocumentMetadata": {
    "Pages": 244
  },
  "JobStatus": "SUCCEEDED",
  "DetectDocumentTextModelVersion": "1.0"
}

Once the job completed, I could preview the text extracted from the PDF like so:

% s3-ocr text s3-ocr-demo latestmagicbeing00hoff.pdf
111
.
116

LATEST MAGIC
BEING
ORIGINAL CONJURING TRICKS
INVENTED AND ARRANGED
BY
PROFESSOR HOFFMANN
(ANGELO LEWIS, M.A.)
Author of "Modern Magic," etc.
WITH NUMEROUS ILLUSTRATIONS
FIRST EDITION
NEW YORK
SPON & CHAMBERLAIN, 120 LIBERTY ST.
...

To create a SQLite database with a table containing rows for every page of scanned text, I ran this command:

% s3-ocr index s3-ocr-demo pages.db -a ocr.json 
Fetching job details  [####################################]  100%
Populating pages table  [####--------------------------------]   13%  00:00:34

I then published the resulting pages.db SQLite database using Datasette - you can explore it here.

How s3-ocr works

s3-ocr works by calling Amazon's S3 and Textract APIs.

Textract only works against PDF files in asynchronous mode: you call an API endpoint to tell it "start running OCR against this PDF file in this S3 bucket", then wait for it to finish - which can take several minutes.

It defaults to storing the OCR results in its own storage, expiring after seven days. You can instead tell it to store them in your own S3 bucket - I use that option in s3-ocr.

A design challenge I faced was that I wanted to make the command restartable and resumable: if the user cancelled the task, I wanted to be able to pick up from where it had got to. I also want to be able to run it again after adding more PDFs to the bucket without repeating work for the previously processed files.

I also needed to persist those job IDs: Textract writes the OCR results to keys in the bucket called textract-output/JOB_ID/1-? - but there's no indication as to which PDF file the results correspond to.

My solution is to write tiny extra JSON files to the bucket when the OCR job is first started.

If you have a file called latestmagicbeing00hoff.pdf the start command will create a new file called latestmagicbeing00hoff.pdf.s3-ocr.json with the following content:

{
  "job_id": "f66bc2d00fb75d1c42d1f829e5b6788891f9799fda404c4550580959f65a5402",
  "etag": "\"d79af487579dcbbef26c9b3be763eb5e-2\""
}

This associates the job ID with the PDF file. It also records the original ETag of the PDF file - this is so in the future I can implement a system that can re-run OCR if the PDF has been updated.

The existence of these files lets me do two things:

If you run s3-ocr start s3-ocr-demo --all it can avoid re-submitting PDF files that have already been sent for OCR, by checking for the existence of the .s3-ocr.json file.
When you later ask for the results of the OCR it can use these files to associate the PDF with the results.

Scatting .s3-ocr.json files all over the place feels a little messy, so I have an open issue considering moving them all to a s3-ocr/ prefix in the bucket instead.

Try it and let me know what you think

This is a brand new project, but I think it's ready for other people to start trying it out.

I ran it against around 7,000 pages from 531 PDF files in the San Francisco Microscopy Society archive and it seemed to work well!

If you try this out and it works (or it doesn't work) please let me know via Twitter or GitHub.

A challenging example page

Here's one of the more challenging pages I processed using Textract:

Here's the result:

In. In J a ... the Joe 14
162
Volxv
Lalpa spinosa, Eggt bud development. of
146
Farcomas spindle. cells in nested gowers 271
Fayigaga tridactylites, leaf glaur of ruce 33
staining & mounting
Stiles 133
tilica films, a structure of Diatoins morehouse 38
thile new microscopic
Broeck 22 /
Smith reproduction in the huntroom tribe
6
Trakes, develop mouht succession of the porsion tango/229
Soirce President of the Roy: truc: Soo
285
forby, Presidents address
105
pongida, difficulties of classification
238
tage, american adjustable concentric
150
ttlese staining & mountring wood sections 133
Stodder, Frustulia Iasconica, havicula
chomboides, & havi cula crassinervis 265
Vol XVI
falicylic acid u movorcopy
160
falpar enctry ology of
Brooke 9.97
Sanderson micros: characters If inflammation
43
tap, circulation of the
42
Jars, structure of the genus Brisinga
44
latter throvite connective substances 191- 241
Jehorey Cessification in birds, formation
of ed blood corpuseles during the
ossification process
by

Releases this week

s3-ocr: 0.4 - (4 releases total) - 2022-06-30
Tools for running OCR against files stored in S3
s3-credentials: 0.12 - (12 releases total) - 2022-06-30
A tool for creating credentials for accessing S3 buckets
datasette-scale-to-zero: 0.1.2 - (3 releases total) - 2022-06-23
Quit Datasette if it has not received traffic for a specified time period

TIL this week

Tags: aws, ocr, pdf, projects, s3, weeknotes, s3-credentials

Weeknotes: s3-credentials prefix and Datasette 0.60

2022-01-18T04:37:39+00:00

A new release of s3-credentials with support for restricting access to keys that start with a prefix, Datasette 0.60 and a write-up of my process for shipping a feature.

s3-credentials --prefix

s3-credentials is my tool for creating limited scope AWS credentials that can only read and write from a specific S3 bucket. I introduced it in this blog entry in November, and I've continued to iterate on it since then.

I released s3-credentials 0.9 today with a feature I've been planning since I first built the tool: the ability to specify a --prefix and get credentials that are only allowed to operate on keys within a specific folder within the S3 bucket.

This is particularly useful if you are building multi-tenant SaaS applications on top of AWS. You might decide to create a bucket per customer... but S3 limits you to 100 buckets for your by default, with a maximum of 1,000 buckets if you request an increase.

So a bucket per customer won't scale above 1,000 customers.

The sts.assume_role() API lets you retrieve temporary credentials for S3 that can have limits attached to them - including a limit to access keys within a specific bucket and under a specific prefix. That means you can create limited duration credentials that can only read and write from a specific prefix within a bucket.

Which solves the problem! Each of your customers can have a dedicated prefix within the bucket, and your application can issue restricted tokens that greatly reduce the risk of one customer accidentally seeing files that belong to another.

Here's how to use it:

s3-credentials create name-of-bucket --prefix user1410/

This will return a JSON set of credentials - an access key and secret key - that can only be used to read and write keys in that bucket that start with user1410/.

Add --read-only to make those credentials read-only, and --write-only for credentials that can be used to write but not read records.

If you add --duration 15m the returned credentials will only be valid for 15 minutes, using sts.assume_role(). The README includes a detailed description of the changes that will be made to your AWS account by the tool.

You can also add --dry-run to see a text summary of changes without applying them to your account. Here's an example:

% s3-credentials create name-of-bucket --prefix user1410/ --read-only --dry-run --duration 15m
Would create bucket: 'name-of-bucket'
Would ensure role: 's3-credentials.AmazonS3FullAccess'
Would assume role using following policy for 900 seconds:
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetBucketLocation"
      ],
      "Resource": [
        "arn:aws:s3:::name-of-bucket"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::name-of-bucket"
      ],
      "Condition": {
        "StringLike": {
          "s3:prefix": [
            "user1410/*"
          ]
        }
      }
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:GetObjectAcl",
        "s3:GetObjectLegalHold",
        "s3:GetObjectRetention",
        "s3:GetObjectTagging"
      ],
      "Resource": [
        "arn:aws:s3:::name-of-bucket/user1410/*"
      ]
    }
  ]
}

As with all things AWS, the magic is in the details of the JSON policy document. The README includes details of exactly what those policies look like. Getting them right was by far the hardest part of building this tool!

s3-credentials integration tests

When writing automated tests, I generally avoid calling any external APIs or making any outbound network traffic. I want the tests to run in an isolated environment, with no risk that some other system that's having a bad day could cause random test failures.

Since the hardest part of building this tool is having confidence that it does the right thing, I decided to also include a suite of integration tests that actively exercise Amazon S3.

By default, running pytest will skip these:

% pytest
================ test session starts ================
platform darwin -- Python 3.10.0, pytest-6.2.5, py-1.10.0, pluggy-1.0.0
rootdir: /Users/simon/Dropbox/Development/s3-credentials
plugins: recording-0.12.0, mock-3.6.1
collected 61 items                                  

tests/test_dry_run.py ....                    [  6%]
tests/test_integration.py ssssssss            [ 19%]
tests/test_s3_credentials.py ................ [ 45%]
.................................             [100%]

=========== 53 passed, 8 skipped in 1.21s ===========

Running pytest --integration runs the test suite with those tests enabled. It expects the computer they are running on to have AWS credentials with the ability to create buckets and users - I'm too nervous to add these secrets to GitHub Actions, so I currently only run the integration suite on my own laptop.

These were invaluable for getting confident that the new --prefix option behaved as expected, especially when combined with --read-only and --write-only. Here's the test_prefix_read_only() test which exercises the --prefix --read-only combination.

s3-credentials list-bucket

One more new feature: the s3-credentials list-bucket name-of-bucket command lists all of the keys in a specific bucket.

By default it returns a JSON array, but you can add --nl to get back newline delimited JSON or --csv or --tsv to get back CSV or TSV.

So... a fun thing you can do with the command is pipe the output into sqlite-utils insert to create a SQLite database file of your bucket contents... and then use Datasette to browse it!

% s3-credentials list-bucket static.niche-museums.com --nl \
  | sqlite-utils insert s3.db keys - --nl
% datasette s3.db -o

This will create a s3.db SQLite database with a keys table containing your bucket contents, then open Datasette to let you interact with the table.

Datasette 0.60

I shipped several months of work on Datasette a few days ago as Datasette 0.60. I published annotated release notes for that release which describe the background of those changes in detail.

I also released new versions of datasette-pretty-traces and datasette-leaflet-freedraw to take advantage of new features added to Datasette.

How I build a feature

My other big project this week was a blog post: How I build a feature, which goes into detail about the process I use for adding new features to my various projects. I've had some great feedback about this, so I'm tempted to write more about general software engineering process stuff here in the future.

Releases this week

s3-credentials: 0.9 - (9 releases total) - 2022-01-18
A tool for creating credentials for accessing S3 buckets
datasette-pretty-traces: 0.4 - (6 releases total) - 2022-01-14
Prettier formatting for ?_trace=1 traces
datasette-leaflet-freedraw: 0.3 - (8 releases total) - 2022-01-14
Draw polygons on maps in Datasette
datasette: 0.60 - (105 releases total) - 2022-01-14
An open source multi-tool for exploring and publishing data
datasette-graphql: 2.0.1 - (33 releases total) - 2022-01-12
Datasette plugin providing an automatic GraphQL API for your SQLite databases

TIL this week

Tags: datasette, weeknotes, s3-credentials

Weeknotes: git-history, bug magnets and s3-credentials --public

2021-12-08T21:34:12+00:00

I've stopped considering my projects "shipped" until I've written a proper blog entry about them, so yesterday I finally shipped git-history, coinciding with the release of version 0.6 - a full 27 days after the first 0.1.

It took way more work than I was expecting to get to this point!

I wrote the first version of git-history in an afternoon, as a tool for a workshop I was presenting on Git scraping and Datasette.

Before promoting it more widely, I wanted to make some improvements to the schema. In particular, I wanted to record only the updated values in the item_version table - which otherwise could end up duplicating a full copy of each item in the database hundreds or even thousands of times.

Getting this right took a lot of work, and I kept on getting stumped by weird bugs and edge-cases. This bug in particular added a couple of days to the project.

The whole project turned out to be something of a bug magnet, partly because of a design decision I made concerning column names.

git-history creates tables with columns that correspond to the underlying data. Since it also needs its own columns for tracking things like commits and incremental versions, I decided to use underscore prefixes for reserved columns such as _item and _version

Datasette uses underscore prefixes for its own purposes - special table arguments such as ?_facet=column-name. It's supposed to work with existing columns that use underscores by converting query string arguments like ?_item=3 into ?_item__exact=3 - but git-history was the first of my projects to really exercise this, and I kept on finding bugs. Datasette 0.59.2 and 0.59.4 both have related bug fixes, and there's a re-opened bug that I have yet to resolve.

Building the ca-fires demo also revealed a bug in datasette-cluster-map which I fixed in version 0.17.2.

s3-credentials --public

The git-history live demos are built and deployed by this GitHub Actions workflow. The workflow works by checking out three separate repos and running git-history against them. It takes advantage of that tool's ability to add just new commits to an existing database to run faster, so it needs to persist database files in between runs.

Since these files can be several hundred MBs, I decided to persist them in an S3 bucket.

My s3-credentials tool provides the ability to create a new S3 bucket along with restricted read-write credentials just for that bucket, ideal for use in a GitHub Actions workflow.

I decided to make the bucket public such that anyone can download files from it, since there was no reason to keep it private. I've been wanting to add this ability to s3-credentials for a while now, so this was the impetus I needed to finally ship that feature.

It's surprisingly hard to figure out how to make an S3 bucket public these days! It turned out the magic recipe was adding a JSON bucket policy document to the bucket granting s3:GetObject permission to principal * - here's that policy in full.

I released s3-credentials 0.8 with a new --public option for creating public buckets - here are the release notes in full:

s3-credentials create my-bucket --public option for creating public buckets, which allow anyone with knowledge of a filename to download that file. This works by attaching this public bucket policy to the bucket after it is created. #42

s3-credentials put-object now sets the Content-Type header on the uploaded object. The type is detected based on the filename, or can be specified using the new --content-type option. #43

s3-credentials policy my-bucket --public-bucket outputs the public bucket policy that would be attached to a bucket of that name. #44

I wrote up this TIL which doubles as a mini-tutorial on using s3-credentials: Storing files in an S3 bucket between GitHub Actions runs.

datasette-hovercards

This was a quick experiment which turned into a prototype Datasette plugin. I really like how GitHub show hover card previews of links to issues in their interface:

I decided to see if I could build something similar for links within Datasette, specifically the links that show up when a column is a foreign key to another record.

Here's what I've got so far:

There's an interactive demo running on this table page.

It still needs a bunch of work - in particular I need to think harder about when the card is shown, where it displays relative to the mouse pointer, what causes it to be hidden again and how it should handle different page widths. Ideally I'd like to figure out a useful mobile / touch-screen variant, but I'm not sure how that could work.

The prototype plugin is called datasette-hovercards - I'd like to eventually merge this back into Datasette core once I'm happy with how it works.

Releases this week

git-history: 0.6.1 - (9 releases total) - 2021-12-08
Tools for analyzing Git history using SQLite
datasette-cluster-map: 0.17.2 - (20 releases total) - 2021-12-07
Datasette plugin that shows a map for any data with latitude/longitude columns
s3-credentials: 0.8 - (8 releases total) - 2021-12-07
A tool for creating credentials for accessing S3 buckets
asyncinject: 0.2a1 - (3 releases total) - 2021-12-03
Run async workflows using pytest-fixtures-style dependency injection
datasette-hovercards: 0.1a0 - 2021-12-02
Add preview hovercards to links in Datasette
github-to-sqlite: 2.8.3 - (22 releases total) - 2021-12-01
Save data from GitHub to a SQLite database

TIL this week

Tags: datasette, weeknotes, git-history, s3-credentials

s3-credentials 0.8

2021-12-07T07:04:35+00:00

s3-credentials 0.8

The latest release of my s3-credentials CLI tool for creating S3 buckets with credentials to access them (with read-write, read-only or write-only policies) adds a new --public option for creating buckets that allow public access, such that anyone who knows a filename can download a file. The s3-credentials put-object command also now sets the appropriate Content-Type heading on the uploaded object.

Tags: projects, s3, s3-credentials

Weeknotes: git-history, created for a Git scraping workshop

2021-11-15T04:10:50+00:00

My main project this week was a 90 minute workshop I delivered about Git scraping at Coda.Br 2021, a Brazilian data journalism conference, on Friday. This inspired the creation of a brand new tool, git-history, plus smaller improvements to a range of other projects.

git-history

I still need to do a detailed write-up of this one (update: git-history: a tool for analyzing scraped data collected using Git and SQLite), but on Thursday I released a brand new tool called git-history, which I describe as "tools for analyzing Git history using SQLite".

This tool is the missing link in the Git scraping pattern I described here last October.

Git scraping is the technique of regularly scraping an online source of information and writing the results to a file in a Git repository... which automatically gives you a full revision history of changes made to that data source over time.

The missing piece has always been what to do next: how do you turn a commit history of changes to a JSON or CSV file into a data source that can be used to answer questions about how that file changed over time?

I've written one-off Python scripts for this a few times (here's my CDC vaccinations one, for example), but giving an interactive workshop about the technique finally inspired me to build a tool to help.

The tool has a comprehensive README, but the short version is that you can take a JSON (or CSV) file in a repository that has been tracking changes to some items over time and run the following to load all of the different versions into a SQLite database file for analysis with Datasette:

git-convert file incidents.db incidents.json --id IncidentID

This assumes that incidents.json contains a JSON array of incidents (reported fires for example) and that each incident has a IncidentID identifier key. It will then loop through the Git history of that file right from the start, creating an item_versions table that tracks every change made to each of those items - using IncidentID to decide if a row represents a new incident or an update to a previous one.

I have a few more improvements I want to make before I start more widely promoting this, but it's already really useful. I've had a lot of fun running it against example repos from the git-scraping GitHub topic (now at 202 repos and counting).

Workshop: Raspando dados com o GitHub Actions e analisando com Datasette

The workshop I gave at the conference was live-translated into Portuguese, which is really exciting! I'm looking forward to watching the video when it comes out and seeing how well that worked.

The title translates to "Scraping data with GitHub Actions and analyzing with Datasette", and it was the first time I've given a workshop that combines Git scraping and Datasette - hence the development of the new git-history tool to help tie the two together.

I think it went really well. I put together four detailed exercises for the attendees, and then worked through each one live with the goal of attendees working through them at the same time - a method I learned from the Carpentries training course I took last year.

Four exercises turns out to be exactly right for 90 minutes, with reasonable time for an introduction and some extra material and questions at the end.

The worst part of running a workshop is inevitably the part where you try and get everyone setup with a functional development environment on their own machines (see XKCD 1987). This time round I skipped that entirely by encouraging my students to use GitPod, which provides free browser-based cloud development environments running Linux, with a browser-embedded VS Code editor and terminal running on top.

(It's similar to GitHub Codespaces, but Codespaces is not yet available to free customers outside of the beta.)

I demonstrated all of the exercises using GitPod myself during the workshop, and ensured that they could be entirely completed through that environment, with no laptop software needed at all.

This worked so well. Not having to worry about development environments makes workshops massively more productive. I will absolutely be doing this again in the future.

The workshop exercises are available in this Google Doc, and I hope to extract some of them out into official tutorials for various tools later on.

Datasette 0.58.2

Yesterday was Datasette's fourth birthday - the four year anniversary of the initial release announcement! I celebrated by releasing a minor bug-fix, Datasette 0.58.2, the release notes for which are quoted below:

Column names with a leading underscore now work correctly when used as a facet. (#1506)
Applying ?_nocol= to a column no longer removes that column from the filtering interface. (#1503)
Official Datasette Docker container now uses Debian Bullseye as the base image. (#1497)

That first change was inspired by ongoing work on git-history, where I decided to use a _id underscoper prefix pattern for columns that were reserved for use by that tool in order to avoid clashing with column names in the provided source data.

sqlite-utils 3.18

Today I released sqlite-utils 3.18 - initially also to provide a feature I wanted for git-history (a way to populate additional columns when creating a row using table.lookup()) but I also closed some bug reports and landed some small pull requests that had come in since 3.17.

s3-credentials 0.5

Earlier in the week I released version 0.5 of s3-credentials - my CLI tool for creating read-only, read-write or write-only AWS credentials for a specific S3 bucket.

The biggest new feature is the ability to create temporary credentials, that expire after a given time limit.

This is achived using STS.assume_role(), where STS is Security Token Service. I've been wanting to learn this API for quite a while now.

Assume role comes with some limitations: tokens must live between 15 minutes and 12 hours, and you need to first create a role that you can assume. In creating those credentials you can define an additional policy document, which is how I scope down the token I'm creating to only allow a specific level of access to a specific S3 bucket.

I've learned a huge amount about AWS, IAM and S3 through developming this project. I think I'm finally overcoming my multi-year phobia of anything involving IAM!

Releases this week

sqlite-utils: 3.18 - (88 releases total) - 2021-11-15
Python CLI utility and library for manipulating SQLite databases
datasette: 0.59.2 - (100 releases total) - 2021-11-14
An open source multi-tool for exploring and publishing data
datasette-hello-world: 0.1.1 - (2 releases total) - 2021-11-14
The hello world of Datasette plugins
git-history: 0.3.1 - (5 releases total) - 2021-11-12
Tools for analyzing Git history using SQLite
s3-credentials: 0.5 - (5 releases total) - 2021-11-11
A tool for creating credentials for accessing S3 buckets

TIL this week

Tags: aws, projects, s3, talks, teaching, datasette, weeknotes, git-scraping, sqlite-utils, git-history, s3-credentials

s3-credentials: a tool for creating credentials for S3 buckets

2021-11-03T04:02:04+00:00

I've built a command-line tool called s3-credentials to solve a problem that's been frustrating me for ages: how to quickly and easily create AWS credentials (an access key and secret key) that have permission to read or write from just a single S3 bucket.

The TLDR version

To create a new S3 bucket and generate credentials for reading and writing to it:

% pip install s3-credentials
% s3-credentials create demo-bucket-for-simonwillison-blog-post --create-bucket
Created bucket: demo-bucket-for-simonwillison-blog-post
Created  user: 's3.read-write.demo-bucket-for-simonwillison-blog-post' with permissions boundary: 'arn:aws:iam::aws:policy/AmazonS3FullAccess'
Attached policy s3.read-write.demo-bucket-for-simonwillison-blog-post to user s3.read-write.demo-bucket-for-simonwillison-blog-post
Created access key for user: s3.read-write.demo-bucket-for-simonwillison-blog-post
{
    "UserName": "s3.read-write.demo-bucket-for-simonwillison-blog-post",
    "AccessKeyId": "AKIAWXFXAIOZHY6WAJSF",
    "Status": "Active",
    "SecretAccessKey": "...",
    "CreateDate": "2021-12-06 23:54:08+00:00"
}

You can now use the that AccessKeyId and SecretAccessKey to read and write files in that bucket.

The need for bucket credentials for S3

I'm an enormous fan of Amazon S3: I've been using it for fifteen years now (since the launch in 2006) and it's my all-time favourite cloud service: it's cheap, reliable and basically indestructible.

You need two credentials to make API calls to S3: an AWS_ACCESS_KEY_ID and a AWS_SECRET_ACCESS_KEY.

Since I often end up adding these credentials to projects hosted in different environments, I'm not at all keen on using my root-level credentials here: usually a project works against just one dedicated S3 bucket, so ideally I would like to create dedicated credentials that are limited to just that bucket.

Creating those credentials is surprisingly difficult!

Dogsheep Photos

The last time I solved this problem was for my Dogsheep Photos project. I built a tool that uploads all of my photos from Apple Photos to my own dedicated S3 bucket, and extracts the photo metadata into a SQLite database. This means I can do some really cool tricks using SQL to analyze my photos, as described in Using SQL to find my best photo of a pelican according to Apple Photos.

The photos are stored in a S3 private bucket, with a custom proxy in front of them that I can use to grant access to specific photographs via a signed URL.

For the proxy, I decided to create dedicated credentials that were allowed to make read-only requests to my private S3 bucket.

I made detailed notes along the way as I figured out to do that. It was really hard! There's one step where you literally have to hand-edit a JSON policy document that looks like this (replace dogsheep-photos-simon with your own bucket name) and paste that into the AWS web console:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "s3:*",
      "Resource": [
        "arn:aws:s3:::dogsheep-photos-simon/*"
      ]
    }
  ]
}

I set myself an ambition to try and fix this at some point in the future (that was in April 2020).

Today I found myself wanting new bucket credentials, so I could play with Litestream. I decided to solve this problem once and for all.

I've also been meaning to really get my head around Amazon's IAM permission model for years, and this felt like a great excuse to figure it out through writing code.

The process in full

Here are the steps you need to take in order to get long-lasting credentials for accessing a specific S3 bucket.

Create an S3 bucket
Create a new, dedicated user. You need a user and not a role because long-lasting AWS credentials cannot be created for roles - and we want credentials we can use in a project without constantly needing to update them.
Assign an "inline policy" to that user granting them read-only or read-write access to the specific S3 bucket - this is the JSON format shown above.
Create AWS credentials for that user.

There are plenty of other ways you can achieve this: you can add permissions to a group and assign that user to a group, or you can create a named "managed policy" and attach that to the user. But using an inline policy seems to be the simplest of the available options.

Using the boto3 Python client library for AWS this sequence converts to the following API calls:

import boto3
import json

s3 = boto3.client("s3")
iam = boto3.client("iam")

username = "my-new-user"
bucket_name = "my-new-bucket"
policy_name = "user-can-access-bucket"

policy_document = {
    "... that big JSON document ...": ""
}

# Create the bucket
s3.create_bucket(Bucket=bucket_name)

# Create the user
iam.create_user(UserName=username)

# Assign the policy to the user
iam.put_user_policy(
    PolicyDocument=json.dumps(policy_document),
    PolicyName=policy_name,
    UserName=username,
)

# Retrieve and print the credentials
response = iam.create_access_key(
    UserName=username,
)
print(response["AccessKey"])

Turning it into a CLI tool

I never want to have to figure out how to do this again, so I decided to build a tool around it.

s3-credentials is a Python CLI utility built on top of Click using my click-app cookicutter template.

It's available through PyPI, so you can install it using:

% pip install s3-credentials

The main command is s3-credentials create, which runs through the above sequence of steps.

To create read-only credentials for my existing static.niche-museums.com bucket I can run the following:

% s3-credentials create static.niche-museums.com --read-only

Created user: s3.read-only.static.niche-museums.com with permissions boundary: arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess
Attached policy s3.read-only.static.niche-museums.com to user s3.read-only.static.niche-museums.com
Created access key for user: s3.read-only.static.niche-museums.com
{
    "UserName": "s3.read-only.static.niche-museums.com",
    "AccessKeyId": "AKIAWXFXAIOZJ26NEGBN",
    "Status": "Active",
    "SecretAccessKey": "...",
    "CreateDate": "2021-11-03 03:21:12+00:00"
}

The command shows each step as it executes, and at the end it outputs the newly created access key and secret key.

It defaults to creating a user with a username that reflects what it will be able to do: s3.read-only.static.niche-museums.com. You can pass --username something to specify a custom username instead.

If you omit the --read-only flag it will create a user with read and write access to the bucket. There's also a --write-only flag which creates a user that can write to but not read from the bucket - useful for use-cases like logging or backup scripts.

The README has full documentation on the various other options, plus details of the other s3-credentials utility commands list-users, list-buckets, list-user-policies and whoami.

Learned along the way

This really was a fantastic project for deepening my understanding of S3, IAM and how it all fits together. A few extra points I picked up:

AWS users can be created with something called a permissions boundary. This is an advanced security feature which lets a user be restricted to a set of maximum permissions - for example, only allowed to interact with S3, not any other AWS service.

Pemissions boundaries do not themselves grant permissions - a user will not be able to do anything until extra policies are added to their account. It instead acts as defense in depth, setting an upper limit to what a user can do no matter what other policies are applied to them.

There's one big catch: the value you set for a permissions boundary is a very weakly documented ARN string - the boto3 documentation simply calls it "The ARN of the policy that is used to set the permissions boundary for the user". I used GitHub code search to dig up some examples, and found arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess and arn:aws:iam::aws:policy/AmazonS3FullAccess to be the ones most relevant to my project. This random file appears to contain more.
Those JSON policy documents really are the dark secret magic that holds AWS together. Finding trustworthy examples of read-only, read-write and write-only policies for specific S3 buckets was not at all easy. I made detailed notes in this comment thread - the policies I went with are baked into the policies.py file in the s3-credentials repository. If you know your way around IAM I would love to hear your feedback on the policies I ended up using!
Writing automated tests for code that makes extensive use of boto3 - such that those tests don't make any real HTTP requests to the API - is a bit fiddly. I explored a few options for this - potential candidates included the botocore.stub.Stubber class and the VCR.py class for saving and replaying HTTP traffic (see this TIL). I ended up going with Python's Mock class, via pytest-mock - here's another TIL on the pattern I used for that. (Update: Jeff Triplett pointed me to moto which looks like a really great solution for this.)

Feedback from AWS experts wanted

The tool I've built solves my specific problem pretty well. I'm nervous about it though: I am by no means an IAM expert, and I'm somewhat paranoid that I may have made a dumb mistake and baked it into the tooling.

As such, the README currently carries a warning that you should review what the tool is doing carefully before trusting it against your own AWS account!

Update 20 February 2022: I removed that warning, since I've now spent long enough working on this tool that I'm comfortable with how it works.

If you are an AWS expert, you can help: I have an open issue requesting expert feedback, and I'd love to hear from people with deep experience who can either validate that my approach is sound or help explain what I'm doing wrong and how the process can be fixed.

Tags: projects, python, s3, security, s3-credentials