Simon Willison’s Weblog

Subscribe

45 items tagged “aws”

2024

How an empty S3 bucket can make your AWS bill explode (via) Maciej Pocwierz accidentally created an S3 bucket with a name that was already used as a placeholder value in a widely used piece of software. They saw 100 million PUT requests to their new bucket in a single day, racking up a big bill since AWS charges $5/million PUTs.

It turns out AWS charge that same amount for PUTs that result in a 403 authentication error, a policy that extends even to "requester pays" buckets!

So, if you know someone's S3 bucket name you can DDoS their AWS bill just by flooding them with meaningless unauthenticated PUT requests.

AWS support refunded Maciej's bill as an exception here, but I'd like to see them reconsider this broken policy entirely.

Update from Jeff Barr:

We agree that customers should not have to pay for unauthorized requests that they did not initiate. We’ll have more to share on exactly how we’ll help prevent these charges shortly.

# 30th April 2024, 11:19 am

s3-credentials 0.16. I spent entirely too long this evening trying to figure out why files in my new supposedly public S3 bucket were unavailable to view. It turns out these days you need to set a PublicAccessBlockConfiguration of {"BlockPublicAcls": false, "IgnorePublicAcls": false, "BlockPublicPolicy": false, "RestrictPublicBuckets": false}.

The s3-credentials --create-bucket --public option now does that for you. I also added a s3-credentials debug-bucket name-of-bucket command to help figure out why a bucket isn't working as expected. # 5th April 2024, 5:35 am

textract-cli. This is my other OCR project from yesterday: I built the thinnest possible CLI wrapper around Amazon Textract, out of frustration at how hard that tool is to use on an ad-hoc basis.

It only works with JPEGs and PNGs (not PDFs) up to 5MB in size, reflecting limitations in Textract’s synchronous API: it can handle PDFs amazingly well but you have to upload them to an S3 bucket yet and I decided to keep the scope tight for the first version of this tool.

Assuming you’ve configured AWS credentials already, this is all you need to know:

pipx install textract-cli
textract-cli image.jpeg > output.txt # 30th March 2024, 7:01 pm

S3 is files, but not a filesystem (via) Cal Paterson helps some concepts click into place for me: S3 imitates a file system but has a number of critical missing features, the most important of which is the lack of partial updates. Any time you want to modify even a few bytes in a file you have to upload and overwrite the entire thing. Almost every database system is dependent on partial updates to function, which is why there are so few databases that can use S3 directly as a backend storage mechanism. # 10th March 2024, 11:47 am

The power of two random choices, visualized. Grant Slatton shares a visualization illustrating “a favorite load balancing technique at AWS”: pick two nodes at random and then send the task to whichever of those two has the lowest current load score.

Why just two nodes? “The function grows logarithmically, so it’s a big jump from 1 to 2 and then tapers off *real* quick.” # 6th February 2024, 10:21 pm

AWS Fixes Data Exfiltration Attack Angle in Amazon Q for Business. An indirect prompt injection (where the AWS Q bot consumes malicious instructions) could result in Q outputting a markdown link to a malicious site that exfiltrated the previous chat history in a query string.

Amazon fixed it by preventing links from being output at all—apparently Microsoft 365 Chat uses the same mitigation. # 19th January 2024, 12:02 pm

Slashing Data Transfer Costs in AWS by 99% (via) Brilliant trick by Daniel Kleinstein. If you have data in two availability zones in the same AWS region, transferring a TB will cost you $10 in ingress and $10 in egress at the inter-zone rates charged by AWS.

But... transferring data to an S3 bucket in that same region is free (aside from S3 storage costs). And buckets are available with free transfer to all availability zones in their region, which means that TB of data can be transferred between availability zones for mere cents of S3 storage costs provided you delete the data as soon as it’s transferred. # 15th January 2024, 10:22 pm

2023

How ima.ge.cx works (via) ima.ge.cx is Aidan Steele’s web tool for browsing the contents of Docker images hosted on Docker Hub. The architecture is really interesting: it’s a set of AWS Lambda functions, written in Go, that fetch metadata about the images using Step Functions and then cache it in DynamoDB and S3. It uses S3 Select to serve directory listings from newline-delimited JSON in S3 without retrieving the whole file. # 31st December 2023, 4:32 am

2022

You should have lots of AWS accounts (via) Richard Crowley makes the case for maintaining multiple AWS accounts within a single company, because “AWS accounts are the most complete form of isolation on offer”. # 3rd October 2022, 6:36 pm

Figure out how to serve an AWS Lambda function with a Function URL from a custom subdomain (via) This took me five hours and 77 issue comments to figure out, but I finally managed to serve an AWS Lambda function running Datasette on a custom subdomain with an HTTPS certificate. I was going to write this up as a TIL but I’m exhausted so I decided to share my private notes thread instead. # 3rd October 2022, 12:29 am

Deploying Python web apps as AWS Lambda functions. After literally years of failed half-hearted attempts, I finally managed to deploy an ASGI Python web application (Datasette) to an AWS Lambda function! Here are my extensive notes. # 19th September 2022, 4:05 am

Over the years, across multiple deployments, DynamoDB has learned that it’s not just the end state and the start state that matter; there could be times when the newly deployed software doesn’t work and needs a rollback. The rolled-back state might be different from the initial state of the software. The rollback procedure is often missed in testing and can lead to customer impact. DynamoDB runs a suite of upgrade and downgrade tests at a component level before every deployment. Then, the software is rolled back on purpose and tested by running functional tests. DynamoDB has found this process valuable for catching issues that otherwise would make it hard to rollback if needed.

Amazon DynamoDB: A Scalable, Predictably Performant, and Fully Managed NoSQL Database Service # 5th September 2022, 6:49 pm

The Amazon Builders’ Library (via) “How Amazon builds and operates software”—an extraordinarily valuable collection of detailed articles about how AWS works and operates under the hood. # 5th September 2022, 5:50 pm

sqlite-comprehend: run AWS entity extraction against content in a SQLite database

I built a new tool this week: sqlite-comprehend, which passes text from a SQLite database through the AWS Comprehend entity extraction service and stores the returned entities.

[... 1146 words]

s3-ocr: Extract text from PDF files stored in an S3 bucket

Visit s3-ocr: Extract text from PDF files stored in an S3 bucket

I’ve released s3-ocr, a new tool that runs Amazon’s Textract OCR text extraction against PDF files in an S3 bucket, then writes the resulting text out to a SQLite database with full-text search configured so you can run searches against the extracted data.

[... 1493 words]

Abusing AWS Lambda to make an Aussie Search Engine (via) Ben Boyter built a search engine that only indexes .au Australian websites, with the novel approach of directly compiling the search index into 250 different ~40MB large lambda functions written in Go, then running searches across 12 million pages by farming them out to all of the lambdas and combining the results. His write-up includes all sorts of details about how he built this, including how he ran the indexer and how he solved the surprisingly hard problem of returning good-enough text snippets for the results. # 16th January 2022, 8:52 pm

2021

Weeknotes: git-history, created for a Git scraping workshop

Visit Weeknotes: git-history, created for a Git scraping workshop

My main project this week was a 90 minute workshop I delivered about Git scraping at Coda.Br 2021, a Brazilian data journalism conference, on Friday. This inspired the creation of a brand new tool, git-history, plus smaller improvements to a range of other projects.

[... 1239 words]

AWS IAM definitions in Datasette (via) As part of my ongoing quest to conquer IAM permissions, I built myself a Datasette instance that lets me run queries against all 10,441 permissions across 280 AWS services. It’s deployed by a build script running in GitHub Actions which downloads a 8.9MB JSON file from the Salesforce policy_sentry repository—policy_sentry itself creates that JSON file by running an HTML scraper against the official AWS documentation! # 6th November 2021, 3:47 am

aws-lambda-adapter. AWS Lambda added support for Docker containers last year, but with a very weird shape: you can run anything on Lambda that fits in a Docker container, but unlike Google Cloud Run your application doesn’t get to speak HTTP: it needs to run code that listens for proprietary AWS lambda events instead. The obvious way to fix this is to run some kind of custom proxy inside the container which turns AWS runtime events into HTTP calls to a regular web application. Serverlessish and re:Web are two open source projects that implemented this, and now AWS have their own implementation of that pattern, written in Rust. # 28th October 2021, 5:04 am

Behind the scenes, AWS Lambda (via) Bruno Schaatsbergen pulled together details about how AWS Lambda works under the hood from a detailed review of the AWS documentation, the Firecracker paper and various talks at AWS re:Invent. # 10th July 2021, 7:40 pm

This teaches us that—when it’s a big enough deal—Amazon will lie to us. And coming from the company that runs the production infrastructure for our companies, stores our data, and has been granted an outsized position of trust based upon having earned it over 15 years, this is a nightmare.

Corey Quinn # 31st March 2021, 4:47 pm

Weeknotes: SpatiaLite 5, Datasette on Azure, more CDC vaccination history

Visit Weeknotes: SpatiaLite 5, Datasette on Azure, more CDC vaccination history

This week I got SpatiaLite 5 working in the Datasette Docker image, improved the CDC vaccination history git scraper, figured out Datasette on Azure and we closed on a new home!

[... 986 words]

2020

New for AWS Lambda – Container Image Support. “You can now package and deploy Lambda functions as container images of up to 10 GB in size”—can’t wait to try this out with Datasette. # 1st December 2020, 5:34 pm

AWS services explained in one line each (via) Impressive effort to summarize all 163(!) AWS services—this helped clarify a whole bunch that I haven’t figured yet. Only a few defeated the author, with a single question mark for the description. I enjoyed Amazon Braket: “Some quantum thing. It’s in preview so I have no idea what it is.” # 26th May 2020, 4:41 pm

Millions of tiny databases. Fascinating, detailed review of a paper that describes Amazon’s Physalia, a distributed configuration store designed to provide extremely high availability coordination for Elastic Block Store replication. My eyebrows raised at “Physalia is designed to offer consistency and high-availability, even under network partitions.” since that’s such a blatant violation of CAP theorem, but it later justifies it like so: “One desirable property therefore, is that in the event of a partition, a client’s Physalia database will be on the same side of the partition as the client. Clever placement of cells across nodes can maximise the chances of this.” # 5th March 2020, 4:37 am

2019

athena-sqlite (via) Amazon Athena is the AWS tool for querying data stored in S3—as CSV, JSON or Apache Parquet files—using SQL. It’s an interesting way of buliding a very cheap data warehouse on top of S3 without having to run any additional services. Athena recently added a query federation SDK which lets you define additional custom data sources using Lambda functions. Damon Cortesi used this to write a custom connector for SQLite, which lets you run queries against data stored in SQLite files that you have uploaded to S3. You can then run joins between that data and other Athena sources. # 18th December 2019, 9:05 am

Serverless Microservice Patterns for AWS (via) A handy collection of 19 architectural patterns for AWS Lambda collected by Jeremy Daly. # 12th June 2019, 12:13 am

Using 6 Page and 2 Page Documents To Make Organizational Decisions (via) I’ve been thinking a lot recently about the challenges of efficiently getting to consensus within a larger organization spread across multiple locations and time zones. This model described by Ian Nowland based on his experience at AWS seems very promising. The goal is to achieve a decision or “disagree and commit” consensus using a max 6 page document and a one hour meeting. The first fifteen minutes of the meeting are dedicated to silently reading the document—if you’ve read it already you are given the option of arriving fifteen minutes late. # 11th April 2019, 3:46 am

Colm MacCárthaigh tells the inside story of how AWS responded to Heartbleed. The Heartbleed SSL vulnerability came out five years ago. In this Twitter thread Colm, who was Amazon’s principal engineer for Elastic Load Balancer at the time, describes how the AWS team responded to something that “was scarier than any bug I’d ever seen”. It’s a cracking story. # 7th April 2019, 8:32 pm

The Cloud and Open Source Powder Keg (via) Stephen O’Grady’s analysis of the Elastic v.s. AWS situation, where Elastic started mixing their open source and non-open source code together and Amazon responded by releasing their own forked “open distribution for Elasticsearch”. World War One analogies included! # 17th March 2019, 7:08 pm