Simon Willison’s Weblog

On githubactions 33 datasettelite 3 fly 14 performance 73 openai 4 ...

 

Recent entries

First impressions of DALL-E, generating images from text one day ago

I made it off the DALL-E waiting list a few days ago and I’ve been having an enormous amount of fun experimenting with it. Here are some notes on what I’ve learned so far (and a bunch of example images too).

(For those not familiar with it, DALL-E is OpenAI’s advanced text-to-image generator: you feed it a prompt, it generates images. It’s extraordinarily good at it.)

First, a warning: DALL-E only allows you to generate up to 50 images a day. I found this out only when I tried to generate image number 51. So there’s a budget to watch out for.

I’ve usually run out by lunch time!

How to use DALL-E

DALL-E is even simpler to use than GPT-3: you get a text box to type in, and that’s it. There are no advanced settings to tweak.

A label says "Start with a detailed description" - there is also a "Surprise me" button. The text box has the grayed out suggestion text "An impressionist oil painting of sunflowers in a puple vase." There is also a Generate button, and the text "Or upload an image to edit"

It does have one other mode: you can upload your own photo, crop it to a square and then erase portions of it and ask DALL-E to fill them in with a prompt. This feature is clearly still in the early stages—I’ve not had great results with it yet.

DALL-E always returns six resulting images, which I believe it has selected as the “best” from hundreds of potential results.

Tips on prompts

DALL-E’s initial label suggests to “Start with a detailed description”. This is very good advice!

The more detail you provide, the more interesting DALL-E gets.

If you type “Pelican”, you’ll get an image that is indistinguishable from what you might get from something like Google Image search. But the more details you ask for, the more interesting and fun the result.

Fun with pelicans

Here’s “A ceramic pelican in a Mexican folk art style with a big cactus growing out of it”:

A ceramic pelican in a Mexican folk art style with a big cactus growing out of it - the image looks exactly like that, it's very impressive

Some of the most fun results you can have come from providing hints as to a medium or art style you would like. Here’s “A heavy metal album cover where the band members are all pelicans... made of lightning”:

A heavy metal album cover where the band members are all pelicans... made of lightning - except none of the pelicans are made of lightning. The text at the top reads PLENY HLAN

This illustrates a few interesting points. Firstly, DALL-E is hilariously bad at any images involving text. It can make things that look like letters and words but it has no concept of actual writing.

My initial prompt was for “A death metal album cover...”—but DALL-E refused to generate that. It has a filter to prevent people from generating images that go outside its content policy, and the word “death” triggered it.

(I’m confident that the filter can be easily avoided, but I don’t want to have my access revoked so I haven’t spent any time pushing its limits.)

It’s also not a great result—those pelicans are not made of lightning! I tried a tweaked prompt:

“A heavy metal album cover where the band members are all pelicans that are made of lightning”:

A heavy metal album cover where the band members are all pelicans that are made of lightning - six images, all very heavy metal but none of them where the birds are made of lightning, though two have lightning in the background now

Still not made of lightning. One more try:

“pelican made of lightning”:

Six images of pelicans - they are all made of lightning this time, but they don't look great.

Let’s try the universal DALL-E cheat code, adding “digital art” to the prompt.

“a pelican made of lightning, digital art”

Six images of pelicans - they are all made of lightning this time, and they look pretty cool

OK, those look a lot better!

One last try—the earlier prompt but with “digital art” added.

“A heavy metal album cover where the band members are all pelicans that are made of lightning, digital art”:

These are really cool images of pelicans with lightning - though again, they aren't really made of lightning. Also there's no album text any more.

OK, these are cool. The text is gone—maybe the “digital art” influence over-rode the “album cover” a tiny bit there.

This process is a good example of “prompt engineering”—feeding in altered prompts to try to iterate towards a better result. This is a very deep topic, and I’m confident I’ve only just scratched the surface of it.

Breaking away from album art, here’s “A squadron of pelicans having a tea party in a forest with a raccoon, digital art”. Often when you specify “digital art” it picks some other additional medium:

A beautiful painting. A racoon sits in the foreground at a little table in the forest. He is surrounded by pelicans, one of which is pouring a drink from a half-bucket-half-teapot.

Recreating things you see

A fun game I started to play with DALL-E was to see if I could get it to recreate things I saw in real life.

My dog, Cleo, was woofing at me for breakfast. I took this photo of her:

A medium sized black pitbull mix sitting on a hardwood floor

Then I tried this prompt: “A medium sized black dog who is a pit bull mix sitting on the ground wagging her tail and woofing at me on a hardwood floor”

A medium sized black pitbull mix sitting on a hardwood floor

OK, wow.

Later, I caught her napping on the bed:

A medium sized black pitbull mix curled up asleep on a green duvet

Here’s DALL-E for “A medium sized black pit bull mix curled up asleep on a dark green duvet cover”:

A medium sized black pit bull mix curled up asleep on a dark green duvet cover - a very good image

One more go at that. Our chicken Cardi snuck into the house and snuggled up on the sofa. Before I evicted her back into the garden I took this photo:

a black and white speckled chicken with a red comb snuggled on a blue sofa next to a cushion with a blue seal pattern and a blue and white knitted blanket

“a black and white speckled chicken with a red comb snuggled on a blue sofa next to a cushion with a blue seal pattern and a blue and white knitted blanket”:

Six images that fit the brief, though the cushions don't have the pattern and the camera zoomed in much closer on the chicken than in the original

Clearly I didn’t provide a detailed enough prompt here! I would need to iterate on this one a lot.

Stained glass

DALL-E is great at stained glass windows.

“Pelican in a waistcoat as a stained glass window”:

A really cool stained glass window design of a pelican, though it is not wearing a waistcoat

"A stained glass window depicting 5 different nudibranchs"

5 different nudibranchs in stained glass - really good

People

DALL-E is (understandably) quite careful about depictions of people. It won’t let you upload images with recognisable faces in them, and when you ask for a prompt with a famous person it will sometimes pull off tricks like showing them from behind.

Here’s “The pope on a bicycle leading a bicycle race through Paris”:

A photo of the pope on a bicycle, taken from behind, with a blurred out Paris street in the background

Though maybe it was the “leading a bicycle race” part that inspired it to draw the image from this direction? I’m not sure.

It will sometimes generate made-up people with visible faces, but they ask users not to share those images.

Assorted images

Here are a bunch of images that I liked, with their prompts.

Inspired by one of our chickens:

“A blue-grey fluffy chicken puffed up and looking angry perched under a lemon tree”

A blue-grey fluffy chicken puffed up and looking angry perched under a lemon tree

I asked it for the same thing, painted by Salvador Dali:

“A blue-grey fluffy chicken puffed up and looking angry perched under a lemon tree, painted by Salvador Dali”:

Three paintings of a blue-grey fluffy chicken puffed up and looking angry perched under a lemon tree, in the style of Salvador Dali

“Bats having a quinceañera, digital art”:

Three bats with pink ears, one is wearing a pink dress

“The scene in an Agatha Christie mystery where the e detective reveals who did it, but everyone is a raccoon. Digital art.”:

This one is in more of a cartoon style. The raccoon stands in front, and four people in period clothes stand in the background, one of them with a magnifying glass.

(It didn’t make everyone a raccoon. It also refused my initial prompt where I asked for an Agatha Christie murder mystery, presumably because of the word “murder”.)

“An acoustic guitar decorated with capybaras in Mexican folk art style, sigma 85mm”:

A close-up shot of an acoustic guitar with some capybaras painted on it.

Adding “sigma 85mm” (and various other mm lengths) is a trick I picked up which gives you realistic images that tend to be cropped well.

“A raccoon wearing glasses and reading a poem at a poetry evening, sigma 35mm”:

A very convincing photograph of a racoon wearing glasses reading from a book, with a blurry background

“Pencil sketch of a Squirrel reading a book”:

A just gorgeous pencil sketch of a squirrel reading a book

Pencil sketches come out fantastically well.

“The royal pavilion in brighton covered in snow”

The royal pavilion in brighton covered in snow - the windows look a bit weird

I experienced this once, many years ago when I lived in Brighton—but forgot to take a photo of it. It looked exactly like this.

And a game: fantasy breakfast tacos

It’s difficult to overstate how much fun playing with this stuff is. Here’s a game I came up with: fantasy breakfast tacos. See how tasty a taco you can invent!

Mine was “breakfast tacos with lobster, steak, salmon, sausages and three different sauces”:

A really delicious assortment of tacos

Natalie is a vegetarian, which I think puts her at a disadvantage in this game. “breakfast taco containing cauliflower, cheesecake, tomatoes, eggs, flowers”:

A really delicious assortment of tacos

Closing thoughts

As you can see, I have been enjoying playing with this a LOT. I could easily share twice as much—the above are just the highlights from my experiments so far.

The obvious question raised by this is how it will affect people who generate art and design for a living. I don’t have anything useful to say about that, other than recommending that they make themselves familiar with the capabilities of these kinds of tools—which have taken an astonishing leap forward in the past few years.

My current mental model of DALL-E is that it’s a fascinating tool for enhancing my imagination. Being able to imagine something and see it visualized a few seconds later is an extraordinary new ability.

I haven’t yet figured out how to apply this to real world problems that I face—my attempts at getting DALL-E to generate website wireframes or explanatory illustrations have been unusable so far—but I’ll keep on experimenting with it. Especially since feeding it prompts is just so much fun.

Joining CSV files in your browser using Datasette Lite four days ago

I added a new feature to Datasette Lite—my version of Datasette that runs entirely in your browser using WebAssembly (previously): you can now use it to load one or more CSV files by URL, and then run SQL queries against them—including joins across data from multiple files.

Your CSV file needs to be hosted somewhere with access-control-allow-origin: * CORS headers. Any CSV file hosted on GitHub provides these, if you use the link you get by clicking on the “Raw” version.

Loading CSV data from a URL

Here’s the URL to a CSV file of college fight songs collected by FiveThirtyEight in their data repo as part of the reporting for this story a few years ago:

https://raw.githubusercontent.com/fivethirtyeight/data/master/fight-songs/fight-songs.csv

You can pass this to Datasette Lite in two ways:

Once Datasette has loaded, a data database will be available with a single table called fight-songs.

As you navigate around in Datasette the URL bar will update to reflect current state—which means you can deep-link to table views with applied filters and facets:

https://lite.datasette.io/?csv=https://raw.githubusercontent.com/fivethirtyeight/data/master/fight-songs/fight-songs.csv#/data/fight-songs?_facet=conference&_facet=student_writer&_facet=official_song

Or even link to the result of a custom SQL query:

https://lite.datasette.io/?csv=https://raw.githubusercontent.com/fivethirtyeight/data/master/fight-songs/fight-songs.csv#/data?sql=select+school%2C+conference%2C+song_name%2C+writers%2C+year%2C+student_writer+spotify_id+from+%5Bfight-songs%5D+order+by+rowid+limit+101

Loading multiple files and joining data

You can pass the ?csv= parameter more than once to load data from multiple CSV files into the same virtual data database. Each CSV file will result in a separate table.

For this demo I’ll use two CSV files.

The first is us-counties-recent.csv from the NY Times covid-19-data repository, which lists the most recent numbers for Covid cases for every US county.

The second is us_census_county_populations_2019.csv, a CSV file listing the population of each county according to the 2019 US Census which I extracted from this page on the US Census website.

Both of those tables include a column called fips, representing the FIPS county code for each county. These 4-5 digit codes are ideal for joining the two tables.

Here’s a SQL query which joins the two tables, filters for the data for the most recent date represented (using where date = (select max(date) from [us-counties-recent])) and calculates cases_per_million using the cases and the population:

select
  [us-counties-recent].*,
  us_census_county_populations_2019.population,
  1.0 * [us-counties-recent].cases / us_census_county_populations_2019.population * 1000000 as cases_per_million
from
  [us-counties-recent]
  join us_census_county_populations_2019 on us_census_county_populations_2019.fips = [us-counties-recent].fips
where
  date = (select max(date) from [us-counties-recent])
order by
  cases_per_million desc

A screenshot of that query running in Datasette. Loving county Texas has the worst result - 1,289,940 cases per million - but that's because they have a population of just 169 people and 218 recorded cases.

And since everything in Datasette Lite can be bookmarked, here’s the super long URL (clickable version here) that executes that query against those two CSV files:

https://lite.datasette.io/?csv=https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties-recent.csv&csv=https://raw.githubusercontent.com/simonw/covid-19-datasette/main/us_census_county_populations_2019.csv#/data?sql=select%0A++%5Bus-counties-recent%5D.*%2C%0A++us_census_county_populations_2019.population%2C%0A++1.0+*+%5Bus-counties-recent%5D.cases+%2F+us_census_county_populations_2019.population+*+1000000+as+cases_per_million%0Afrom%0A++%5Bus-counties-recent%5D%0A++join+us_census_county_populations_2019+on+us_census_county_populations_2019.fips+%3D+%5Bus-counties-recent%5D.fips%0Awhere%0A++date+%3D+%28select+max%28date%29+from+%5Bus-counties-recent%5D%29%0Aorder+by%0A++cases_per_million+desc

Weeknotes: datasette-socrata, and the last 10%... six days ago

... takes 90% of the work. I continue to work towards a preview of the new Datasette Cloud, and keep finding new “just one more things” to delay inviting in users.

Aside from continuing to work on that, my big project in the last week was a blog entry: Twenty years of my blog, in which I celebrated twenty years since starting this site by pulling together a selection of highlights from over the years.

I’ve actually updated that entry a few times over the past few days as I remembered new highlights I forgot to include—the Twitter thread that accompanies the entry has those updates, starting here.

datasette-socrata

I’ve been thinking a lot about the Datasette Cloud onboarding experience: how can I help new users understand what Datasette can be used for as quickly as possible?

I want to get them to a point where they are interacting with a freshly created table of data. I can provide some examples, but I’ve always thought that one of the biggest opportunities for Datasette lies in working with the kind of data released by governments through their Open Data portals. This is especially true for its usage in the field of data journalism.

Many open data portals—including the one for San Francisco—are powered by a piece of software called Socrata. And it offers a pretty comprehensive API.

datasette-socrata is a new Datasette plugin which can import data from Socrata instances. Give it the URL to a Socrata dataset (like this one, my perennial favourite, listing all 195,000+ trees managed by the city of San Francisco) and it will import that data and its associated metadata into a brand new table.

It’s pretty neat! It even shows you a progress bar, since some of these datasets can get pretty large:

Animated demo of a progress bar, starting at 137,000/179,605 and continuing until the entire set has been imported

As part of building this I ran into the interesting question of what a plugin like this should do if the system it is running on runs out of disk space?

I’m still working through that, but I’m experimenting with a new type of Datasette plugin for it: datasette-low-disk-space-hook, which introduces a new plugin hook (low_disk_space(datasette)) which other plugins can use to report a situation where disk space is running out.

I wrote a TIL about that here: Registering new Datasette plugin hooks by defining them in other plugins.

I may use this same trick for a future upgrade to datasette-graphql, to allow additional plugins to register custom GraphQL mutations.

sqlite-utils 3.27

In working on datasette-socrata I was inspired to push out a new release of sqlite-utils. Here are the annotated release notes:

  • Documentation now uses the Furo Sphinx theme. (#435)

I wrote about this a few weeks ago—the new documentation theme is now live for the stable documentation.

  • Code examples in documentation now have a “copy to clipboard” button. (#436)

I made this change to Datasette first—the sphinx-copybutton plugin adds a neat “copy” button next to every code example.

I also like how this encourages ensuring that every example will work if people directly copy and paste it.

Francesco Frassinelli filed an issue about this utility function, which wasn’t actually part of the documented stable API, but I saw no reason not to promote it.

The function incorporates the logic that the sqlite-utils CLI tool uses to automatically detect if a provided file is CSV, TSV or JSON and detect the CSV delimeter and other settings.

  • rows_from_file() has two new parameters to help handle CSV files with rows that contain more values than are listed in that CSV file’s headings: ignore_extras=True and extras_key="name-of-key". (#440)

It turns out csv.DictReader in the Python standard library has a mechanism for handling CSV rows that contain too many commas.

In working on this I found a bug in mypy which I reported here, but it turned out to be a dupe of an already fixed issue.

This is a workaround for the following Python error:

_csv.Error: field larger than field limit (131072)

It’s an error that occurs when a field in a CSV file is longer than a default length.

Saying “yeah, I want to be able to handle the maximum length possible” is surprisingly hard—Python doesn’t let you set a maximum, and can throw errors depending on the platform if you set a number too high. Here’s the idiom that works, which is encapsulated by the new utility function:

field_size_limit = sys.maxsize

while True:
    try:
        csv_std.field_size_limit(field_size_limit)
        break
    except OverflowError:
        field_size_limit = int(field_size_limit / 10)
  • table.search(where=, where_args=) parameters for adding additional WHERE clauses to a search query. The where= parameter is available on table.search_sql(...) as well. See Searching with table.search(). (#441)

This was a feature suggestion from Tim Head.

  • Fixed bug where table.detect_fts() and other search-related functions could fail if two FTS-enabled tables had names that were prefixes of each other. (#434)

This was quite a gnarly bug. sqlite-utils attempts to detect if a table has an associated full-text search table by looking through the schema for another table that has a definition like this one:

CREATE VIRTUAL TABLE "searchable_fts"
USING FTS4 (
    text1,
    text2,
    [name with . and spaces],
    content="searchable"
)

I was checking for content="searchable" using a LIKE query:

SELECT name FROM sqlite_master
WHERE rootpage = 0
AND
sql LIKE '%VIRTUAL TABLE%USING FTS%content=%searchable%'

But this would incorrectly match strings such as content="searchable2" as well!

Releases this week

TIL this week

Twenty years of my blog 12 days ago

I started this blog on June 12th 2002—twenty years ago today! To celebrate two decades of blogging, I decided to pull together some highlights and dive down a self-indulgent nostalgia hole.

Some highlights

Some of my more influential posts, in chronological order.

  • A new XML-RPC library for PHP—2nd September 2002

    I was really excited about XML-RPC, one of the earliest technologies for building Web APIs. IXR, the Incutio library for XML-RPC, was one of my earliest ever open source library releases. Here’s a capture of the old site.

    Website: The Incutio XML-RPC Library for PHP. Version 1.6, pbulished May 25th 2003.

    I’ve not touched anything relating to this project in over 15 years now, but it has lived on in both WordPress and Drupal (now only in Drupal 6 LTS).

    It’s also been responsible for at least one CVE vulnerability in those platforms!

  • getElementsBySelector()—25th March 2003

    Andrew Hayward had posted a delightful snippet of JavaScript called document.getElementsByClassName()—like document.getElementsByTagName() but for classes instead.

    Inspired by this, I built document.getElementsBySelector()—a function that could take a CSS selector and return all of the matching elements.

    This ended up being very influential indeed! Paul Irish offers a timeline of JavaScript CSS selector engines which tracks some of what happens next. Most notably, getElementsBySelector() was part of John Resig’s inspiration in creating the first version of jQuery. To this day, the jQuery source includes this testing fixture which is derived from my original demo page.

    I guess you could call document.getElementsBySelector() the original polyfill for document.querySelectorAll().

  • I’m in Kansas—27th August 2003

    In May 2003 Adrian Holovaty posted about a job opportunity for a web developer at at the Lawrence Journal-World newspaper in Lawrence, Kansas.

    This coincided with my UK university offering a “year in industry” placement, which meant I could work for a year anywhere in the world with a student visa program. I’d been reading Adrian’s blog for a while and really liked the way he thought about building for the web—we were big fans of Web Standards and CSS and cleanly-designed URLs, all of which were very hot new things at the time!

    So I talked to Adrian about if this could work as a year-long opportunity, and we figured out how to make it work.

    At the Lawrence Journal-Word Adrian and I decided to start using Python instead of PHP, in order to build a CMS for that local newspaper...

  • Introducing Django—17th July 2005

    ... and this was the eventual outcome! Adrian and I didn’t even know we were building a web framework at first—we called it “the CMS”. But we kept having to solve new foundational problems: how should database routing work? What about templating? What’s the best way to represent the incoming HTTP request?

    I had left the Lawrence Journal-World in 2004, but by 2005 the team there had grown what’s now known as Django far beyond where it was when I had left, and they got the go-ahead from the company to release it as open source (partly thanks to the example set by Ruby on Rails, which first released in August 2004).

    In 2010 I wrote up a more detailed history of Django in a Quora answer, now mirrored to my blog.

  • Finally powered by Django—15th December 2006

    In which I replaced my duct-tape-and-mud PHP blogging engine with a new Django app. I sadly don’t have the version history for this anymore (this was pre-git, I think I probably had it in Subversion somewhere) but today’s implementation is still based on the same code, upgraded to Django 1.8 in 2015.

    That 2006 version did include a very pleasing Flickr integration to import my photos (example on the Internet Archive):

    Screenshot of my blog's archive page for 6th January 2005 with my old design - it included photosets from Flickr mixed in among the links, as well as a set of photo thumbnails in the right hand navigation underneath the calendar widget

  • How to turn your blog in to an OpenID—19th December 2006

    In late 2006 I got very, very excited about OpenID. I was convinced that Microsoft Passport was going to take over SSO on the internet, and that the only way to stop that was to promote an open, decentralized solution. I wrote posts about it, made screencasts (that one got 840 diggs! Annoyingly I was serving it from the Internet Archive who appear to have deleted it) and gave a whole bunch of conference talks about it too.

    I spent the next few years advocating for OpenID—in particular the URL-based OpenID mechanism where any website can be turned into an identifier. It didn’t end up taking off, and with hindsight I think that’s likely for the best: expecting people to take control of their own security by chosing their preferred authentication provider sounded great to me in 2006, but I can understand why companies chose to instead integrate with a smaller, tightly controlled set of SSO partners over time.

  • A few notes on the Guardian Open Platform—10th March 2009

    In 2009 I was working at the Guardian newspaper in London in my first proper data journalism role—my work at the Lawrence Journal-World had hinted towards that a little, but I spent the vast majority of my time there building out a CMS.

    In March we launched two major initiatives: the Datablog (also known as the Data Store) and the Guardian’s Open Platform (an API that is still offered to this day).

    The goal of the Datablog was to share the data behind the stories. Simon Rogers, the Guardian’s data editor, had been collecting meticulous datasets about the world to help power infographics in the paper for years. The new plan was to share that raw data with the world.

    We started out using Google Sheets for this. I desperately wanted to come up with something less proprietary than that—I spent quite some time experimenting with CouchDB—but Google Sheets was more than enough to get the project started.

    Many years later my continued mulling of this problem formed part of the inspiration for my creation of Datasette, a story I told in my 2018 PyBay talk How to Instantly Publish Data to the Internet with Datasette.

  • Why I like Redis— 22nd October 2009

    I got interested in NoSQL for a few years starting around 2009. I still think Redis was the most interesting new piece of technology to come out of that whole movement—an in-memory data structure server exposed over the network turns out to be a fantastic complement for other data stores, and even though I now default to PostgreSQL or SQLite for almost everything else I can still find problems for which Redis is a great solution.

    In April 2010 I gave a three hour Redis tutorial at NoSQL Europe which I wrote up in Comprehensive notes from my three hour Redis tutorial.

  • Node.js is genuinely exciting— 23rd November 2009

    In December 2009 I found out about Node.js. As a Python web developer I had been following the evolution of Twisted with great interest, but I’d also run into the classic challenge that once you start using event-driven programming almost every library you might want to use likely doesn’t work for you any more.

    Node.js had server-side event-driven programming baked into its very core. You couldn’t accidentally make a blocking call and break your event loop because it didn’t ever give you the option to do so!

    I liked it so much I switched out my talk for Full Frontal 2009 at the last minute for one about Node.js instead.

    I think this was an influential decision. I won’t say who they are (for fear of mis-representing or mis-quoting them), but I’ve talked to entrepreneurs who built significant products on top of server-side JavaScript who told me that they heard about Node.js from me first.

  • Crowdsourced document analysis and MP expenses—20th December 2009

    In 2009 I was working at the Guardian newspaper in London in my first proper data journalism role—my work at the Lawrence Journal-World had hinted towards that a little, but I spent the vast majority of my time there building out a CMS.

    The UK government had finally got around to releasing our Member of Parliament expense reports, and there was a giant scandal brewing about the expenses that had been claimed. We recruited our audience to help dig through 10,000s of pages of PDFs to help us find more stories.

    The first round of the MP’s expenses crowdsourcing project launched in June, but I was too busy working on it to properly write about it! Charles Arthur wrote about it for the Guardian in The breakneck race to build an application to crowdsource MPs’ expenses.

    In December we launched round two, and I took the time to write about it properly.

    Here’s a Google Scholar search for guardian mps expenses—I think it was pretty influential. It’s definitely one of the projects I’m most proud of in my career so far.

  • WildlifeNearYou: It began on a fort...—12th January 2010

    In October 2008 I participated in the first /dev/fort—a bunch of nerds rent a fortress (or similar historic building) for a week and hack on a project together.

    Following that week of work it took 14 months to add the “final touches” before putting the site we had built live (partly because I insisted on implementing OpenID for it) but in January 2010 we finally went live with WildlifeNearYou.com (sadly no longer available). It was a fabulous website, which crowdsourced places that people had seen animals in order to answer the crucial question “where is my nearest Llama?”.

    Here’s what it looked like:

    Find and share places to see wildlife: WildlifeNearYou is a site for sharing your passion for wildlife. Search for animals or places near you, or register to add your own trips and photos.

    Although it shipped after the Guardian MP’s expenses project most of the work on WildlifeNearYou had come before that—building WildlifeNearYou (in Django) was the reason I was confident that the MP’s expenses project was feasible.

  • Getting married and going travelling—21st June 2010

    One June 5th 2010 I married Natalie Downe, and we both quit our jobs to set off travelling around the world and see how far we could get.

    Natalie is wearing a bridal gown. I am in a suit. I have a terrifying Golden Eagle perched on my arm.

    We got as far as Casablanca, Morocco before we accidentally launched a startup together: Lanyrd, launched in August 2010. “Sign in with Twitter to see conferences that your friends are speaking at, attending or tracking, then add your own events.”

    We ended up spending the next three years on this: we went through Y Combinator, raised a sizable seed round, moved to London, hired a team and shipped a LOT of features. We even managed to ship some features that made the company money!

    This also coincided with me putting the blog on the back-burner for a few years.

    Here’s an early snapshot:

    Welcome to Lanyrd. The social conference directory. Get more out of conferences. Find great conferences to attend: See what your friends are going to or speaking at, find conferences near you or browse conferences by topic. Discover what's hot while it's on: Track what's going on during the conference, even if you aren't there. Who is tweeting what, what links are doing the rounds. Use our useful mobile version to decide what to go to next. Catch up on anything you missed: Easily discover slides, video and podcasts from conferences you attended or tracked. If you spoke at an event you can build up your speaker portfolio of talks you gave.

    In 2013 we sold Lanyrd to Eventbrite, and moved our entire team (and their families) from London to San Francisco. It had been a very wild ride.

    Sadly the site itself is no longer available: as Eventbrite grew it became impossible to justify the work needed to keep Lanyrd maintained, safe and secure. Especially as it started to attract overwhelming volumes of spam.

    Natalie told the full story of Lanyrd on her blog in September 2013: Lanyrd: from idea to exit—the story of our startup.

  • Scraping hurricane Irma—10th September 2017

    In 2017 hurricane Irma devastated large areas of the Caribbean and the southern USA.

    I got involved with the Irma Response project, helping crowdsource and publish critical information for people affected by the storm.

    I came up with a trick to help with scraping: I ran scrapers against important information sources and recorded the results to a git repository, in order to cheaply track changes to those sources over time.

    I later coined the term “Git scraping” for this technique, see my series of posts about Git scraping over time.

  • Getting the blog back together—1st October 2017

    Running a startup, and then working at Eventbrite afterwards, had resulted in an almost 7 year gap in blogging for me. In October 2017 I decided to finally get my blog going again. I also back-filled content for the intervening years by scraping my content from Quora and from Ask Metafilter.

    If you’ve been meaning to start a new blog or revive an old one this is a trick that I can thoroughly recommend: just because you initially wrote something elsewhere doesn’t mean you shouldn’t repost it on a site you own.

  • Recovering missing content from the Internet Archive—8th October 2017

    The other step in recovering my old blog’s content was picking up some content that was missing from my old database backup. Here’s how I pulled in that content by scraping the Internet Archive.

  • Implementing faceted search with Django and PostgreSQL— 5th October 2017

    I absolutely love building faceted search engines. I realized a while ago that most of my career has been spent applying the exact same trick—faceted search—to different problem spaces. WildlifeNearYou offered faceted search over animal sightings. MP’s expenses had faceted search across crowdsourced expense analysis. Lanyrd was faceted search for conferences.

    I implemented faceted search for this blog on top of PostgreSQL, and wrote about how I did it.

  • Datasette: instantly create and publish an API for your SQLite databases—13th November 2017

    I shipped the first release of simonw/datasette in Nevember 2017. Nearly five years later it’s now my number-one focus, and I don’t see myself losing interest in it for many decades to come.

    Datasette was inspired by the Guardian Datablog, combined with my realization that Zeit Now (today called Vercel) meant you could bundle data up in a SQLite database and deploy it as part of an exploratory application almost for free.

    My blog has 284 items tagged datasette at this point.

  • Datasette Facets—20th May 2018

    Given how much I love faceted search, it’s surprising it took me until May 2018 to realize that I could bake them into Datasette itself—turning it into a tool for building faceted search engines against any data. It turns out to be my ideal solution to my favourite problem!

  • Documentation unit tests—28th July 2018

    I figured out a pattern for using unit tests to ensure that features of my projects were covered by the documentation. Four years later I can confirm that this technique works really well—though I wish I’d called it Test-driven documentation instead!

  • Letterboxing on Lundy—18th September 2018

    A brief foray into travel writing: Natalie and I spent a few days staying in a small castle on the delightful island of Lundy off the coast of North Devon, and I used it as an opportunity to enthuse about letterboxing and the Landmark Trust.

    A small, battered looking castle on a beautiful, remote looking moor

  • sqlite-utils: a Python library and CLI tool for building SQLite databases—25th February 2019

    Datasette helps you explore and publish data stored in SQLite, but how do you get data into SQLite in the first place?

    sqlite-utils is my answer to that question—a combined CLI tool and Python library with all sorts of utilites for working with and creating SQLite databases.

    It recently had its 100th release!

  • I commissioned an oil painting of Barbra Streisand’s cloned dogs—7th March 2019

    Not much I can add that’s not covered by the title. It’s a really good painting!

    A framed oil painting showing two small fluffy white dogs in a stroller, gazing at the tombstone of the dog from which they were cloned.

  • My JSK Fellowship: Building an open source ecosystem of tools for data journalism—10th September 2019

    In late 2019 I left Eventbrite to join the JSK fellowship program at Stanford. It was an opportunity to devote myself full-time to working on my growing collection of open source tools for data journalism, centered around Datasette.

    I jumped on that opportunity with both hands, and I’ve been mostly working full-time on Datasette and associated projects (without being paid for it since the fellowship ended) ever since.

  • Weeknotes: ONA19, twitter-to-sqlite, datasette-rure—13th September 2019

    At the start of my fellowship I decided to publish weeknotes, to keep myself accountable for what I was working on now that I didn’t have the structure of a full-time job.

    I’ve managed to post them roughly once a week ever since—128 posts and counting.

    I absolutely love weeknotes as a format. Even if no-one else ever reads them, I find them really useful as a way to keep track of my progress and ensure that I have motivation to get projects to a point where I can write about them at the end of the week!

  • Using a self-rewriting README powered by GitHub Actions to track TILs—20th April 2020

    In April 2020 I started publishing TILs—Today I Learneds—at til.simonwillison.net.

    The idea behind TILs is to dramatically reduce the friction involved in writing a blog post. If I learned something that was useful to me, I’ll write it up as a TIL. These often take less than ten minutes to throw together and I find myself referring back to them all the time.

    My main blog is a Django application, but my TILs run entirely using Datasette. You can see how that all works in the simonw/til GitHub repository.

  • Using SQL to find my best photo of a pelican according to Apple Photos—21st May 2020

    Dogsheep is my ongoing side project in which I explore ways to analyze my own personal data using SQLite and Datasette.

    dogsheep-photos is my tool for extracting metadata about my photos from the undocumented Apple Photos SQLite database (building on osxphotos by Rhet Turnbull). I had been wanting to solve the photo problem for years and was delighted when osxphotos provided the capability I had been missing. And I really like pelicans, so I celebrated by using my photos of them for the demo.

    A glorious pelican, wings oustretched

  • Git scraping: track changes over time by scraping to a Git repository—9th October 2020

    If you really want people to engage with a technique, it’s helpful to give it a name. I defined Git scraping in this post, and I’ve been promoting it heavily ever since.

    There are now 275 public repositories on GitHub with the git-scraping topic, and if you sort them by recently updated you can see the scrapers on there that most recently captured some new data.

  • Personal Data Warehouses: Reclaiming Your Data—14th November 2020

    I gave this talk for GitHub’s OCTO (previously Office of the CTO, since rebranded to GitHub Next) speaker series.

    It’s the Dogsheep talk, with a better title (thanks, Idan!) It includes a full video demo of my personal Dogsheep instance, including my dog’s Foursquare checkins, my Twitter data, Apple Watch GPS trails and more.

    I also explain why I called it Dogsheep: it’s a devastatingly terrible pun on Wolfram.

    I’m frustrated when information like this is only available in video format, so when I give particularly information-dense talks I like to turn them into full write-ups as well, providing extra notes and resources alongside screen captures from the talk.

    For this one I added a custom template mechanism to my blog, to allow me to break out of my usual entry page design.

  • Trying to end the pandemic a little earlier with VaccinateCA—28th February 2021

    In February 2021 I joined the VaccinateCA effort to try and help end the pandemic a little bit earlier by crowdsourcing information about the best places to get vaccinated. It was a classic match-up for my skills and interests: a huge crowdsourcing effort that needed to be spun up as a fresh Django application as quickly as possible.

    Django SQL Dashboard was one project that spun directly out of that effort.

  • The Baked Data architectural pattern—28th July 2021

    My second attempt at coining a new term, after Git scraping: Baked Data is the name I’m using for the architectural pattern embodied by Datasette where you bundle a read-only copy of your data alongside the code for your application, as part of the same deployment. I think it’s a really good idea, and more people should be doing it.

  • How I build a feature—12th January 2022

    Over the years I’ve evolved a processes for feature development that works really well for me, and scales down to small personal projects as well as scaling up to much larger pieces of work. I described that in detail in this post.

Picking out these highlights wasn’t easy. I ended up setting myself a time limit (to ensure I could put this post live within a minute of midnight UTC time on my blog’s 20th birthday) so there’s plenty more that I would have liked to dig up.

My tags index page includes a 2010s-style word cloud that you can visit if you want to explore the rest of my content. Or use the faceted search!

A few more project release highlights:

Evolution over time

I started my blog in my first year of as a student studying computer science at the University of Bath.

You can tell that Twitter wasn’t a thing yet, because I wrote 107 posts in that first month. Lots of links to other people’s blog posts (we did a lot of that back then) with extra commentary. Lots of blogging about blogging.

That first version of the site was hosted at http://www.bath.ac.uk/~cs1spw/blog/—on my university’s student hosting. Sadly the Internet Archive doesn’t have a capture of it there, since I moved it to http://simon.incutio.com/ (my part-time employer at the time) in September 2002. Here’s my note from then about rewriting it to use MySQL instead of flat file storage.

This is the earliest capture I could find on the Internet Archive, from June 2003:

My blog in June 2003. The header and highlight colours were orange, the rest was black on white text. The tagline reads: PHP, PYthon, CSS, XML and general web development. The sidebar includes a "Blogs I read" section with notes as to when each one was last updated. My top post that day talks about Using boomarklets to experiment with CSS.

Full entry on Using bookmarklets to experiment with CSS.

By November 2006 I had redesigned from orange to green, and started writing Blogmarks—the name I used for small, bookmark-style link posts. I’ve collected 6,304 of them over the years!

My blog in June 2003. The header and highlight colours were orange, the rest was black on white text. The tagline reads: PHP, PYthon, CSS, XML and general web development. The sidebar includes a "Blogs I read" section with notes as to when each one was last updated. My top post that day talks about Using boomarklets to experiment with CSS.

By 2010 I’d reached more-or-less my current purple on white design, albeit with the ability to sign in with OpenID to post a comment. I dropped comments entirely when I relaunched in 2017—constantly fighting against spam comments makes blogging much less fun.

My blog in July 2010. It's the same visual design as today, but with an option to sign in with OpenID and a little bubble next to each item showing the number of comments.

The source code for the current iteration of my blog is available on GitHub.

Taking screenshots of the Internet Archive with shot-scraper

Here’s how I generated the screenshots in this post, using shot-scraper against the Internet Archive but with a line of JavaScript to hide the banner the display at the top of every archived page:

shot-scraper 'https://web.archive.org/web/20030610004652/http://simon.incutio.com/' \
  --javascript 'document.querySelector("#wm-ipp-base").style.display="none"' \
   --width 800 --height 600 --retina

mgdlbp on Hacker News pointed out that you can instead add if_ to the date part of the archive URLs to hide the banner, like this:

shot-scraper 'https://web.archive.org/web/20030610004652if_/http://simon.incutio.com/' \
   --width 800 --height 600 --retina

A tiny web app to create images from OpenStreetMap maps 13 days ago

Earlier today I found myself wanting to programmatically generate some images of maps.

I wanted to create a map centered around a location, at a specific zoom level, and with a marker in a specific place.

Some cursory searches failed to turn up exactly what I wanted, so I decided to build a tiny project to solve the problem, taking advantage of my shot-scraper tool for automating screenshots of web pages.

The result is map.simonwillison.net—hosted on GitHub Pages from my simonw/url-map repository.

Here’s how to generate a map image of Washington DC:

shot-scraper 'https://map.simonwillison.net/?q=washington+dc' \
  --retina --width 600 --height 400 --wait 3000

That command generates a PNG 1200x800 image that’s a retina screenshot of the map displayed at https://map.simonwillison.net/?q=washington+dc—after waiting three seconds to esure all of the tiles have fully loaded.

A map of Washington DC, with a Leaflet / OpenStreetMap attribution in the bottom right

The website itself is documented here. It displays a map with no visible controls, though you can use gestures to zoom in and pan around—and the URL bar will update to reflect your navigation, so you can bookmark or share the URL once you’ve got it to the right spot.

You can also use query string parameters to specify the map that should be initially displayed:

Annotated source code

The entire mapping application is contained in a single 68 line index.html file that mixes HTML and JavaScript. It’s built using the fantastic Leaflet open source mapping library.

Since the code is so short, I’ll enclude the entire thing here with some additional annotating comments.

It started out as a copy of the first example in the Leaflet quick start guide.

<!DOCTYPE html>
<!-- Regular HTML boilerplate -->
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>map.simonwillison.net</title>
<!--
  Leaflet's CSS and JS are loaded from the unpgk.com CDN, with the
  Subresource Integrity (SRI) integrity="sha512..." attribute to ensure
  that the exact expected code is served by the CDN.
-->
<link rel="stylesheet" href="https://unpkg.com/leaflet@1.8.0/dist/leaflet.css" integrity="sha512-hoalWLoI8r4UszCkZ5kL8vayOGVae1oxXe/2A4AO6J9+580uKHDO3JdHb7NzwwzK5xr/Fs0W40kiNHxM9vyTtQ==" crossorigin=""/>
<script src="https://unpkg.com/leaflet@1.8.0/dist/leaflet.js" integrity="sha512-BB3hKbKWOc9Ez/TAwyWxNXeoV9c1v6FIeYiBieIWkpLjauysF18NzgR1MBNBXf8/KABdlkX68nAhlwcDFLGPCQ==" crossorigin=""></script>
<!-- I want the map to occupy the entire browser window with no margins -->
<style>
html, body {
  height: 100%;
  margin: 0;
}
</style>
</head>
<body>
<!-- The Leaflet map renders in this 100% high/wide div -->
<div id="map" style="width: 100%; height: 100%;"></div>
<script>
function toPoint(s) {
  // Convert "51.5,2.1" into [51.5, 2.1]
  return s.split(",").map(parseFloat);
}
// An async function so we can 'await fetch(...)' later on
async function load() {
  // URLSearchParams is a fantastic browser API - it makes it easy to both read
  // query string parameters from the URL and later to generate new ones
  let params = new URLSearchParams(location.search);
  // If the starting URL is /?center=51,32&zoom=3 this will pull those values out
  let center = params.get('center') || '0,0';
  let initialZoom = params.get('zoom');
  let zoom = parseInt(initialZoom || '2', 10);
  let q = params.get('q');
  // .getAll() turns &marker=51.49,0&marker=51.3,0.2 into ['51.49,0', '51.3,0.2']
  let markers = params.getAll('marker');
  // zoomControl: false turns off the visible +/- zoom buttons in Leaflet
  let map = L.map('map', { zoomControl: false }).setView(toPoint(center), zoom);
  L.tileLayer('https://{s}.tile.openstreetmap.org/{z}/{x}/{y}.png', {
    maxZoom: 19,
    attribution: '&copy; <a href="http://www.openstreetmap.org/copyright">OpenStreetMap</a>',
    // This option means retina-capable devices will get double-resolution tiles:
    detectRetina: true
  }).addTo(map);
  // We only pay attention to ?q= if ?center= was not provided:
  if (q && !params.get('center')) {
    // We use fetch to pass ?q= to the Nominatim API and get back JSON
    let response = await fetch(
      `https://nominatim.openstreetmap.org/search.php?q=${encodeURIComponent(q)}&format=jsonv2`
    )
    let data = await response.json();
    // data[0] is the first result - it has a boundingbox array of four floats
    // which we can convert into a Leaflet-compatible bounding box like this:
    let bounds = [
      [data[0].boundingbox[0],data[0].boundingbox[2]],
      [data[0].boundingbox[1],data[0].boundingbox[3]]
    ];
    // This sets both the map center and zooms to the correct level for the bbox:
    map.fitBounds(bounds);
    // User-provided zoom over-rides this
    if (initialZoom) {
      map.setZoom(parseInt(initialZoom));
    }
  }
  // This is the code that updates the URL as the user pans or zooms around.
  // You can subscribe to both the moveend and zoomend Leaflet events in one go:
  map.on('moveend zoomend', () => {
    // Update URL bar with current location
    let newZoom = map.getZoom();
    let center = map.getCenter();
    // This time we use URLSearchParams to construct a center...=&zoom=... URL
    let u = new URLSearchParams();
    // Copy across ?marker=x&marker=y from existing URL, if they were set:
    markers.forEach(s => u.append('marker', s));
    u.append('center', `${center.lat},${center.lng}`);
    u.append('zoom', newZoom);
    // replaceState() is a weird API - the third argument is the one we care about:
    history.replaceState(null, null, '?' + u.toString());
  });
  // This bit adds Leaflet markers to the map for ?marker= query string arguments:
  markers.forEach(s => {
    L.marker(toPoint(s)).addTo(map);
  });
}
load();
</script>
</body>
</html>
<!-- See https://github.com/simonw/url-map for documentation -->

Weeknotes: Datasette Cloud ready to preview 18 days ago

I made an absolute ton of progress building Datasette Cloud on Fly this week, and also had a bunch of fun playing with GPT-3.

Datasette Cloud

Datasette Cloud is my upcoming hosted SaaS version of Datasette. I’ve been re-building my initial alpha on top of Fly because I want to be able to provide each team account with their own Datasette instance running in a dedicated Firecracker container, and the recently announced Fly Machines lets me do exactly that.

As-of this weekend I have all of the different pieces in place, and I’m starting to preview it to potential customers.

Interested in trying it out? You can request access to the preview here.

GPT-3 explorations

Most of my GPT-3 explorations over the past week are covered by these two blog posts:

  • A Datasette tutorial written by GPT-3 is the point at which I really started taking GPT-3 seriously, after convincing myself that I could use it to help with real work, not just as a form of entertainment.
  • How to play with the GPT-3 language model is a very quick getting started tutorial, because I polled people on Twitter and found that more than half didn’t know you could try GPT-3 out now for free.

Searching my tweets for GPT captures a bunch of other, smaller experiments. A few highlights:

Releases this week

TIL this week

Elsewhere

23rd June 2022

  • How Imagen Actually Works. Imagen is Google’s new text-to-image model, similar to (but possibly even more effective than) DALL-E. This article is the clearest explanation I’ve seen of how Imagen works: it uses Google’s existing T5 text encoder to convert the input sentence into an encoding that captures the semantic meaning of the sentence (including things like items being described as being on top of other items), then uses a trained diffusion model to generate a 64x64 image. That image is passed through two super-res models to increase the resolution to the final 1024x1024 output. #23rd June 2022, 6:05 pm

20th June 2022

  • The State of WebAssembly 2022. Colin Eberhardt talks through the results of the State of WebAssembly 2022 survey. Rust continues to dominate as the most popular language for working to WebAssembly, but Python has a notable increase of interest. #20th June 2022, 6:07 pm

19th June 2022

  • WarcDB (via) Florents Tselai built this tool for loading web crawl data stored in WARC (Web ARChive) format into a SQLite database for smaller-scale analysis with SQL, on top of my sqlite-utils Python library. #19th June 2022, 6:08 pm

18th June 2022

  • Becoming a good engineer is about collecting experience. Each project, even small ones, is a chance to add new techniques and tools to your toolbox. Where this delivers even more value is when you can solve problems by pairing techniques learned on one project with tools learned working on another. It all adds up.

    Addy Osmani # 18th June 2022, 9:21 pm

13th June 2022

8th June 2022

  • The End of Localhost. swyx makes the argument for cloud-based development environments, and points out that many large companies—including Google, Facebook, Shopify and GitHub—have made the move already. I was responsible for the team maintaining the local development environment experience at Eventbrite for a while, and my conclusion is that with a large enough engineering team someone will ALWAYS find a new way to break their local environment: the idea of being able to bootstrap a fresh, guaranteed-to-work environment in the cloud at the click of a button could save SO much time and money. #8th June 2022, 6:09 pm
  • Announcing Pyston-lite: our Python JIT as an extension module (via) The Pyston JIT can now be installed in any Python 3.8 virtual environment by running “pip install pyston_lite_autoload”—which includes a hook to automatically inject the JIT. I just tried a very rough benchmark against Datasette (ab -n 1000 -c 10) and got 391.20 requests/second without the JIT compared to 404.10 request/second with it. #8th June 2022, 5:58 pm

31st May 2022

  • Compiling Black with mypyc (via) Richard Si is a Black contributor who recently obtained a 2x performance boost by compiling Black using the mypyc tool from the mypy project, which uses Python type annotations to generate a compiled C version of the Python logic. He wrote up this fantastic three-part series describing in detail how he achieved this, including plenty of tips on Python profiling and clever optimization tricks. #31st May 2022, 11:24 pm
  • Lesser Known Features of ClickHouse (via) I keep hearing positive noises about ClickHouse. I learned about a whole bunch of capabilities from this article—including that ClickHouse can directly query tables that are stored in SQLite or PostgreSQL. #31st May 2022, 7:48 pm

30th May 2022

  • Dragonfly: A modern replacement for Redis and Memcached (via) I was initially pretty skeptical of the tagline: does Redis really need a “modern” replacement? But the Background section of the README makes this look like a genuinely interesting project. It re-imagines Redis to have its keyspace partitioned across multiple threads, and uses the VLL lock manager described in a 2014 paper to “compose atomic multi-key operations without using mutexes or spinlocks”. The initial benchmarks show up to a 25x increase in throughput compared to Redis. It’s written in C++. #30th May 2022, 10:02 pm

27th May 2022

  • Architecture Notes: Datasette (via) I was interviewed for the first edition of Architecture Notes—a new publication (website and newsletter) about software architecture created by Mahdi Yusuf. We covered a bunch of topics in detail: ASGI, SQLIte and asyncio, Baked Data, plugin hook design, Python in WebAssembly, Python in an Electron app and more. Mahdi also turned my scrappy diagrams into beautiful illustrations for the piece. #27th May 2022, 3:20 pm

26th May 2022

  • upptime (via) “Open-source uptime monitor and status page, powered entirely by GitHub Actions, Issues, and Pages.” This is a very creative (ab)use of GitHub Actions: it runs a scheduled action to check the availability of sites that you specify, records the results in a YAML file (with the commit history tracking them over time) and can automatically open a GitHub issue for you if it detects a new incident. #26th May 2022, 3:53 am
  • Benjamin "Zags" Zagorsky: Handling Timezones in Python. The talks from PyCon US have started appearing on YouTube. I found this one really useful for shoring up my Python timezone knowledge: It reminds that if your code calls datetime.now(), datetime.utcnow() or date.today(), you have timezone bugs—you’ve been working with ambiguous representations of instances in time that could span a 26 hour interval from UTC-12 to UTC+14. date.today() represents a 24 hour period and hence is prone to timezone surprises as well. My code has a lot of timezone bugs! #26th May 2022, 3:40 am