Simon Willison's Weblog: http

Some Go web dev notes

2024-09-27T23:43:31+00:00

Julia Evans on writing small, self-contained web applications in Go:

In general everything about it feels like it makes projects easy to work on for 5 days, abandon for 2 years, and then get back into writing code without a lot of problems.

Go 1.22 introduced HTTP routing in February of this year, making it even more practical to build a web application using just the Go standard library.

Tags: web-development, go, julia-evans, http

How streaming LLM APIs work

2024-09-22T03:48:12+00:00

How streaming LLM APIs work

New TIL. I used curl to explore the streaming APIs provided by OpenAI, Anthropic and Google Gemini and wrote up detailed notes on what I learned.

Also includes example code for receiving streaming events in Python with HTTPX and receiving streaming events in client-side JavaScript using fetch().

Tags: apis, http, llms, json

SQL Injection Isn't Dead: Smuggling Queries at the Protocol Level

2024-08-12T15:36:47+00:00

SQL Injection Isn't Dead: Smuggling Queries at the Protocol Level

PDF slides from a presentation by Paul Gerste at DEF CON 32. It turns out some databases have vulnerabilities in their binary protocols that can be exploited by carefully crafted SQL queries.

Paul demonstrates an attack against PostgreSQL (which works in some but not all of the PostgreSQL client libraries) which uses a message size overflow, by embedding a string longer than 4GB (2**32 bytes) which overflows the maximum length of a string in the underlying protocol and writes data to the subsequent value. He then shows a similar attack against MongoDB.

The current way to protect against these attacks is to ensure a size limit on incoming requests. This can be more difficult than you may expect - Paul points out that alternative paths such as WebSockets might bypass limits that are in place for regular HTTP requests, plus some servers may apply limits before decompression, allowing an attacker to send a compressed payload that is larger than the configured limit.

Via lobste.rs

Tags: postgresql, sql-injection, security, mongodb, websockets, http

Cloudflare does not consider vary values in caching decisions

2023-11-20T05:08:52+00:00

Cloudflare does not consider vary values in caching decisions

Here’s the spot in Cloudflare’s documentation where they hide a crucially important detail:

“Cloudflare does not consider vary values in caching decisions. Nevertheless, vary values are respected when Vary for images is configured and when the vary header is vary: accept-encoding.”

This means you can’t deploy an application that uses content negotiation via the Accept header behind the Cloudflare CDN—for example serving JSON or HTML for the same URL depending on the incoming Accept header. If you do, Cloudflare may serve cached JSON to an HTML client or vice-versa.

There’s an exception for image files, which Cloudflare added support for in September 2021 (for Pro accounts only) in order to support formats such as WebP which may not have full support across all browsers.

Tags: http, caching, cloudflare

See this page fetch itself, byte by byte, over TLS

2023-05-10T13:58:36+00:00

See this page fetch itself, byte by byte, over TLS

George MacKerron built a TLS 1.3 library in TypeScript and used it to construct this amazing educational demo, which performs a full HTTPS request for its own source code over a WebSocket and displays an annotated byte-by-byte representation of the entire exchange. This is the most useful illustration of how HTTPS actually works that I’ve ever seen.

Via Julia Evans

Tags: tls, http, encryption, explorables, websockets, https

urllib3 v2.0.0 is now generally available

2023-04-26T22:00:16+00:00

urllib3 v2.0.0 is now generally available

urllib3 is 12 years old now, and is a common low-level dependency for packages like requests and httpx. The biggest new feature in v2 is a higher-level API: resp = urllib3.request(“GET”, “https://example.com”)—a very welcome addition to the library.

Tags: http, python

RFC 7807: Problem Details for HTTP APIs

2022-11-01T03:15:05+00:00

RFC 7807: Problem Details for HTTP APIs

This RFC has been brewing for quite a while, and is currently in last call (ends 2022-11-03). I’m designing the JSON error messages for Datasette at the moment so this could not be more relevant for me.

Via Nicolas Fränkel

Tags: standards, http, rfc, json, errors, mark-nottingham

Introducing sqlite-http: A SQLite extension for making HTTP requests

2022-08-10T22:22:42+00:00

Introducing sqlite-http: A SQLite extension for making HTTP requests

Characteristically thoughtful SQLite extension from Alex, following his sqlite-html extension from a few days ago. sqlite-http lets you make HTTP requests from SQLite—both as a SQL function that returns a string, and as a table-valued SQL function that lets you independently access the body, headers and even the timing data for the request.

This write-up is excellent: it provides interactive demos but also shows how additional SQLite extensions such as the new-to-me “define” extension can be combined with sqlite-http to create custom functions for parsing and processing HTML.

Via @agarcia_me

Tags: http, sqlite, alex-garcia

curlconverter.com

2022-03-10T20:12:44+00:00

curlconverter.com

This is pretty magic: paste in a “curl” command (including the ones you get from browser devtools using copy-as-curl) and this will convert that into code for making the same HTTP request... using Python, JavaScript, PHP, R, Go, Rust, Elixir, Java, MATLAB, Ansible URI, Strest, Dart or JSON.

Via Julia Evans

Tags: http, curl

Hurl

2021-11-22T03:32:33+00:00

Hurl

Hurl is “a command line tool that runs HTTP requests defined in a simple plain text format”—written in Rust on top of curl, it lets you run HTTP requests and then execute assertions against the response, defined using JSONPath or XPath for HTML. It can even assert that responses were returned within a specified duration.

Via @humphd

Tags: http, curl, rust

New HTTP standards for caching on the modern web

2021-10-21T22:40:50+00:00

New HTTP standards for caching on the modern web

Cache-Status is a new HTTP header (RFC from August 2021) designed to provide better debugging information about which caches were involved in serving a request—“Cache-Status: Nginx; hit, Cloudflare; fwd=stale; fwd-status=304; collapsed; ttl=300” for example indicates that Nginx served a cache hit, then Cloudflare had a stale cached version so it revalidated from Nginx, got a 304 not modified, collapsed multiple requests (dogpile prevention) and plans to serve the new cached value for the next five minutes. Also described is $Target-Cache-Control: which allows different CDNs to respond to different headers and is already supported by Cloudflare and Akamai (Cloudflare-CDN-Cache-Control: and Akamai-Cache-Control:).

Via Hacker News

Tags: http, caching, cloudflare, dogpile

Weeknotes: Archiving coronavirus.data.gov.uk, custom pages and directory configuration in Datasette, photos-to-sqlite

2020-04-29T19:41:11+00:00

I mainly made progress on three projects this week: Datasette, photos-to-sqlite and a cleaner way of archiving data to a git repository.

Archiving coronavirus.data.gov.uk

The UK goverment have a new portal website sharing detailed Coronavirus data for regions around the country, at coronavirus.data.gov.uk.

As with everything else built in 2020, it's a big single-page JavaScript app. Matthew Somerville investigated what it would take to build a much lighter (and faster loading) site displaying the same information by moving much of the rendering to the server.

One of the best things about the SPA craze is that it strongly encourages structured data to be published as JSON files. Matthew's article inspired me to take a look, and sure enough the government figures are available in an extremely comprehensive (and 3.3MB in size) JSON file, available from https://c19downloads.azureedge.net/downloads/data/data_latest.json.

Any time I see a file like this my first questions are how often does it change - and what kind of changes are being made to it?

I've written about scraping to a git repository (see my new gitscraping tag) a bunch in the past:

Scraping hurricane Irma - September 2017
Changelogs to help understand the fires in the North Bay - October 2017
Generating a commit log for San Francisco’s official list of trees - March 2019
Tracking PG&E outages by scraping to a git repo - October 2019
Deploying a data API using GitHub Actions and Cloud Run - January 2020

Now that I've figured out a really clean way to Commit a file if it changed in a GitHub Action knocking out new versions of this pattern is really quick.

simonw/coronavirus-data-gov-archive is my new repo that does exactly that: it periodically fetches the latest versions of the JSON data files powering that site and commits them if they have changed. The aim is to build a commit history of changes made to the underlying data.

The first implementation was extremely simple - here's the entire action:

name: Fetch latest data

on:
push:
repository_dispatch:
schedule:
    - cron:  '25 * * * *'

jobs:
scheduled:
    runs-on: ubuntu-latest
    steps:
    - name: Check out this repo
    uses: actions/checkout@v2
    - name: Fetch latest data
    run: |-
        curl https://c19downloads.azureedge.net/downloads/data/data_latest.json | jq . > data_latest.json
        curl https://c19pub.azureedge.net/utlas.geojson | gunzip | jq . > utlas.geojson
        curl https://c19pub.azureedge.net/countries.geojson | gunzip | jq . > countries.geojson
        curl https://c19pub.azureedge.net/regions.geojson | gunzip | jq . > regions.geojson
    - name: Commit and push if it changed
    run: |-
        git config user.name "Automated"
        git config user.email "actions@users.noreply.github.com"
        git add -A
        timestamp=$(date -u)
        git commit -m "Latest data: ${timestamp}" || exit 0
        git push

It uses a combination of curl and jq (both available in the default worker environment) to pull down the data and pretty-print it (better for readable diffs), then commits the result.

Matthew Somerville pointed out that inefficient polling sets a bad precedent. Here I'm hitting azureedge.net, the Azure CDN, so that didn't particularly worry me - but since I want this pattern to be used widely it's good to provide a best-practice example.

Figuring out the best way to make conditional get requests in a GitHub Action lead me down something of a rabbit hole. I wanted to use curl's new ETag support but I ran into a curl bug, so I ended up rolling a simple Python CLI tool called conditional-get to solve my problem. In the time it took me to release that tool (just a few hours) a new curl release came out with a fix for that bug!

Here's the workflow using my conditional-get tool. See the issue thread for all of the other potential solutions, including a really neat Action shell-script solution by Alf Eaton.

To my absolute delight, the project has already been forked once by Daniel Langer to capture Canadian Covid-19 cases!

New Datasette features

I pushed two new features to Datasette master, ready for release in 0.41.

Configuration directory mode

This is an idea I had while building datasette-publish-now. Datasette instances can be run with custom metadata, custom plugins and custom templates. I'm increasingly finding myself working on projects that run using something like this:

$ datasette data1.db data2.db data3.db \
    --metadata=metadata.json
    --template-dir=templates \
    --plugins-dir=plugins

Directory configuration mode introduces the idea that Datasette can configure itself based on a directory layout. The above example can instead by handled by creating the following layout:

my-project/data1.db
my-project/data2.db
my-project/data3.db
my-project/metadatata.json
my-project/templates/index.html
my-project/plugins/custom_plugin.py

Then run Datasette directly targetting that directory:

$ datasette my-project/

See issue #731 for more details. Directory configuration mode is documented here.

Define custom pages using templates/pages

In niche-museums.com, powered by Datasette I described how I built the www.niche-museums.com website as a heavily customized Datasette instance.

That site has /about and /map pages which are served by custom templates - but I had to do some gnarly hacks with empty about.db and map.db files to get them to work.

Issue #648 introduces a new mechanism for creating this kind of page: create a templates/pages/map.html template file and custom 404 handling code will ensure that any hits to /map serve the rendered contents of that template.

This could work really well with the datasette-template-sql plugin, which allows templates to execute abritrary SQL queries (ala PHP or ColdFusion).

Here's the new documentation on custom pages, including details of how to use the new custom_status(), custom_header() and custom_redirect() template functions to go beyond just returning HTML.

photos-to-sqlite

My Dogsheep personal analytics project brings my tweets, GitHub activity, Swarm checkins and more together in one place. But the big missing feature is my photos.

As-of yesterday, I have 39,000 photos from Apple Photos uploaded to an S3 bucket using my new photos-to-sqlite tool. I can run the following SQL query and get back ten random photos!

select
  json_object(
    'img_src',
    'https://photos.simonwillison.net/i/' || 
    sha256 || '.' || ext || '?w=400'
  ),
  filepath,
  ext
from
  photos
where
  ext in ('jpeg', 'jpg', 'heic')
order by
  random()
limit
  10

photos.simonwillison.net is running a modified version of my heic-to-jpeg image converting and resizing proxy, which I'll release at some point soon.

There's still plenty of work to do - I still need to import EXIF data (including locations) into SQLite, and I plan to use osxphotos to export additional metadata from my Apple Photos library. But this week it went from a pure research project to something I can actually start using, which is exciting.

TIL this week

Generated using this query.

Tags: git, http, matthew-somerville, photos, projects, datasette, weeknotes, covid19, git-scraping

Async Support - HTTPX

2020-01-10T04:49:59+00:00

Async Support - HTTPX

HTTPX is the new async-friendly HTTP library for Python spearheaded by Tom Christie. It works in both async and non-async mode with an API very similar to requests. The async support is particularly interesting—it’s a really clean API, and now that Jupyter supports top-level await you can run ’(await httpx.AsyncClient().get(url)).text’ directly in a cell and get back the response. Most excitingly the library lets you pass an ASGI app directly to the client and then perform requests against it—ideal for unit tests.

Via @_tomchristie

Tags: asgi, tom-christie, async, http, python, httpx

Usage of ARIA attributes via HTTP Archive

2018-07-12T03:16:26+00:00

Usage of ARIA attributes via HTTP Archive

A neat example of a Google BigQuery query you can run against the HTTP Archive public dataset (a crawl of the “top” websites run periodically by the Internet Archive, which captures the full details of every resource fetched) to see which ARIA attributes are used the most often. Linking to this because I used it successfully today as the basis for my own custom query—I love that it’s possible to analyze a huge representative sample of the modern web in this way.

Tags: big-data, internet-archive, http, aria

How Balanced does Database Migrations with Zero-Downtime

2017-11-07T11:36:25+00:00

How Balanced does Database Migrations with Zero-Downtime

I’m fascinated by the idea of “pausing” traffic during a blocking site maintenance activity (like a database migration) and then un-pausing when the operation is complete—so end clients just see some of their requests taking a few seconds longer than expected. I first saw this trick described by Braintree. Balanced wrote about a neat way of doing this just using HAproxy, which lets you live reconfigure the maxconns to your backend down to zero (causing traffic to be queued up) and then bring the setting back up again a few seconds later to un-pause those requests.

Tags: highavailability, migrations, http, scaling, haproxy, zero-downtime

Whether 404 custom error page necessary for a website?

2014-01-03T13:14:00+00:00

My answer to Whether 404 custom error page necessary for a website? on Quora

They aren't required, but if you don't have a custom 404 page you're missing out on a very easy way of improving the user experience of your site, and protecting against expired or incorrect links from elsewhere on the web.

Even just a search box and a link to your homepage is enough to ensure visitors who arrive on a 404 can still visit the rest of your site, and hopefully find what they were looking for when they clicked on the link.

Tags: http, seo, quora

What will HTTP be superseded by?

2012-12-26T12:28:00+00:00

My answer to What will HTTP be superseded by? on Quora

HTTP 1.x will likely never be completely replaced, but there is ongoing work at the moment to define HTTP 2.0. The first draft of this was released in November and is based on Google's SPDY protocol, which is already widely deployed in Google Chrome and Google's web properties (other browsers have experimented with support for SPDY as well): http://en.m.wikipedia.org/wiki/H...

One thing that looks pretty likely is that any replacement will only work over SSL - not just to improve privacy and security on the web, but also because this is the most reliable way to avoid breaking all of the legacy proxy servers already deployed around the net.

Tags: http, internet, web-development, quora

How can I download a web server's directory and all subdirectories with one command?

2012-01-15T18:55:00+00:00

My answer to How can I download a web server's directory and all subdirectories with one command? on Quora

Use wget (you can install it with apt-get install wget)

$ wget --recursive http://example.com

That will create a directory called example.com and put the mirrored downloaded files in the right sub-directories inside it.

If you just want to download a subdirectory, do this:

$ wget --recursive http://example.com/subdirectory --no-parent

The --no-parent option ensures wget won't follow links up to parent directories of the one you want to download.

Tags: http, linux, ubuntu, quora

What are the best practices in Node.js to communicate with an existing Java backend?

2011-12-08T12:53:00+00:00

My answer to What are the best practices in Node.js to communicate with an existing Java backend? on Quora

Node speaks HTTP extremely well, and using HTTP means you can do things like put an HTTP load balancer or cache (such as varnish) between Node and your Java application server at a later date.

Tags: http, nodejs, quora

Quoting Dan Manges

2011-06-30T21:27:00+00:00

We can deploy new versions of our software, make database schema changes, or even rotate our primary database server, all without failing to respond to a single request. We can accomplish this because we gave ourselves the ability suspend our traffic, which gives us a window of a few seconds to make some changes before letting the requests through. To make this happen, we built a custom HTTP server and application dispatching infrastructure around Python’s Tornado and Redis.

— Dan Manges, Braintree

Tags: deployment, http, redis, tornado, recovered

On HTTP Load Testing

2011-05-18T10:17:00+00:00

On HTTP Load Testing

Mark Nottingham explains that running good HTTP benchmarks means understanding available network bandwidth, using dedicated physical hardware, testing at progressively higher loads and a whole lot more.

Tags: http, load-testing, mark-nottingham, recovered

The Inside Story of How Facebook Responded to Tunisian Hacks

2011-01-24T18:06:00+00:00

The Inside Story of How Facebook Responded to Tunisian Hacks

“By January 5, it was clear that an entire country’s worth of passwords were in the process of being stolen right in the midst of the greatest political upheaval in two decades.”—which is why you shouldn’t serve your login form over HTTP even though it POSTs over HTTPS.

Via O'Reilly Radar

Tags: facebook, http, https, security, tunisia, recovered

gzip support for Amazon Web Services CloudFront

2010-11-12T05:33:00+00:00

gzip support for Amazon Web Services CloudFront

This would have saved me a bunch of work a few weeks ago. CloudFront can now be pointed at your own web server rather than S3, and you can ask it to forward on the Accept-Encoding header and cache multiple content versions based on the result.

Tags: cloudfront, gzip, http, recovered

LWPx::ParanoidAgent

2010-08-31T02:30:00+00:00

LWPx::ParanoidAgent

Every programming language needs an equivalent of this library—a robust, secure way to make HTTP requests against URLs from untrusted sources without risk of tarpits, internal network access, socket starvation, weird server errors, or other nastiness.

Tags: http, perl, recovered

nodejitsu's node-http-proxy

2010-07-28T23:34:00+00:00

nodejitsu's node-http-proxy

Exactly what I’ve been waiting for—a robust HTTP proxy library for Node that makes it trivial to proxy requests to a backend with custom proxy behaviour added in JavaScript. The example app adds an artificial delay to every request to simulate a slow connection, but other exciting potential use cases could include rate limiting, API key restriction, logging, load balancing, lint testing and more besides.

Via The Changelog

Tags: http, javascript, node, nodejs, proxy, recovered

python/trunk/Lib/httplib.py in 1994

2010-07-04T23:25:00+00:00

python/trunk/Lib/httplib.py in 1994

Python’s original httplib implementation, checked in by Guido 16 years and 4 months ago. Not much younger than the Web itself.

Via Hacker News

Tags: guido-van-rossum, http, httplib, python, recovered

Mongrel2 is "Self-Hosting"

2010-06-17T20:11:00+00:00

Mongrel2 is "Self-Hosting"

Zed Shaw’s Mongrel2 is shaping up to be a really interesting project. “A web server simply written in C that loves all languages equally”, the two most interesting new ideas are the ability to handle HTTP, Flash Sockets and WebSockets all on the same port (thanks to an extension to the Mongrel HTTP parser that can identify all three protocols) and the ability to hook Mongrel2 up to the backend servers using either TCP/IP or ZeroMQ. I’m guessing this means Mongrel2 could hold an HTTP request open, fire off some messages and wait for various backends to send messages back to construct the response, making async processing just as easy as a regular blocking request/response cycle.

Tags: async, c, http, mongrel2, webserver, zed-shaw, zeromq, recovered, websockets

ElasticSearch memcached module

2010-05-15T10:17:00+00:00

ElasticSearch memcached module

Fascinating idea: the ElasticSearch search server provides an optional memcached protocol plugin for added performance which maps simple HTTP to memcached. GET is mapped to memcached get commands, POST is mapped to set commands. This means you can use any memcached client to communicate with the search server.

Tags: elasticsearch, http, memcached, protocol, recovered

A HTTP Proxy Server in 20 Lines of node.js

2010-04-28T13:24:58+00:00

A HTTP Proxy Server in 20 Lines of node.js

Proxying is definitely a sweet spot for Node.js. Peteris Krummins takes it a step further, adding host blacklists and an IP whitelist as configuration files and using Node’s watchFile method to automatically reload changes to them.

Tags: nodejs, proxy, http, node, javascript, peteris-krummins

Introduction to nginx.conf scripting

2010-04-21T23:40:46+00:00

Introduction to nginx.conf scripting

Slideshow—hit left arrow to navigate through the slides. The nginx community is officially nuts. Starts out with a simple “Hello world” using the echo module, then rapidly descends down the rabbit hole in to array operations, sub-requests, memcached connection pooling and eventually non-blocking Drizzle SQL execution against a sharded cluster—all implemented in the nginx.conf configuration file.

Tags: nginx, drizzle, memcached, http