Simon Willison’s Weblog

Blogmarks in Mar

Filters: Type: blogmark × Month: Mar ×


gifcap (via) This is really neat: a purely client-side implementation of animated gif screen capture, using navigator.mediaDevices.getDisplayMedia for the screen capturing, mithril for the UI and the gif.js pure JavaScript GIF encoding library to render the output. # 30th March 2020, 9:29 pm

PostGraphile: Production Considerations. PostGraphile is a tool for building a GraphQL API on top of an existing PostgreSQL schema. Their “production considerations” documentation is particularly interesting because it directly addresses some of my biggest worries about GraphQL: the potential for someone to craft an expensive query that ties up server resources. PostGraphile suggests a number of techniques for avoiding this, including a statement timeout, a query allowlist, pagination caps and (in their “pro” version) a cost limit that uses a calculated cost score for the query. # 27th March 2020, 1:22 am

Making Datasets Fly with Datasette and Fly (via) It’s always exciting to see a Datasette tutorial that wasn’t written by me! This one is great—it shows how to load Central Park Squirrel Census data into a SQLite database, explore it with Datasette and then publish it to the Fly hosting platform using datasette-publish-fly and datasette-cluster-map. # 26th March 2020, 11:56 pm

hacker-news-to-sqlite (via) The latest in my Dogsheep series of tools: hacker-news-to-sqlite uses the Hacker News API to fetch your comments and submissions from Hacker News and save them to a SQLite database. # 21st March 2020, 4:27 am

Django: Added support for asynchronous views and middleware (via) An enormously consequential feature just landed in Django, and is set to ship as part of Django 3.1 in August. Asynchronous views will allow Django applications to define views using “async def myview(request)”—taking full advantage of Python’s growing asyncio ecosystem and providing enormous performance improvements for Django sites that do things like hitting APIs over HTTP. Andrew has been puzzling over this for ages and it’s really exciting to see it land in a form that should be usable in a stable Django release in just a few months. # 19th March 2020, 3:43 am

datasette-publish-fly (via) Fly is a neat new Docker hosting provider with a very tempting pricing model: Just $2.67/month for their smallest always-on instance, and they give each user $10/month in free credit. datasette-publish-fly is the first plugin I’ve written using the publish_subcommand plugin hook, which allows extra hosting providers to be added as publish targets. Install the plugin and you can run “datasette publish fly data.db” to deploy SQLite databases to your Fly account. # 19th March 2020, 3:40 am

New governance model for the Django project. This has been under discussion for a long time: I’m really excited to see it put into action. It’s difficult to summarize, but they key effect should be a much more vibrant, active set of people involved in making decisions about the framework. # 12th March 2020, 5:27 pm

Announcing Daylight Map Distribution. Mike Migurski announces a new distribution of OpenStreetMap: a 42GB dump of the version of the data used by Facebook, carefully moderated to minimize the chance of incorrect or maliciously offensive edits. Lots of constructive conversation in the comments about the best way for Facebook to make their moderation decisions more available to the OSM community. # 12th March 2020, 11:44 am

The unexpected Google wide domain check bypass (via) Fantastic story of discovering a devious security vulnerability in a bunch of Google products stemming from a single exploitable regular expression in the Google closure JavaScript library. # 9th March 2020, 11:27 pm

Millions of tiny databases. Fascinating, detailed review of a paper that describes Amazon’s Physalia, a distributed configuration store designed to provide extremely high availability coordination for Elastic Block Store replication. My eyebrows raised at “Physalia is designed to offer consistency and high-availability, even under network partitions.” since that’s such a blatant violation of CAP theorem, but it later justifies it like so: “One desirable property therefore, is that in the event of a partition, a client’s Physalia database will be on the same side of the partition as the client. Clever placement of cells across nodes can maximise the chances of this.” # 5th March 2020, 4:37 am

The Next CEO of Stack Overflow. “Including the Stack Exchange network of 174 sites, we have over 100 million monthly visitors. Every month, over 125,000 wonderful people write answers”—this fits the rule of thumb for user-generated content that only a tiny portion of your audience will actively create content: in this case it’s just 0.125% (one eighth of one percent). I’d love to know how many people are upvoting or performing other more lightweight interactions. # 28th March 2019, 3:12 pm

Programmer migration patterns. Avery Pennarun explores the history of modern programming languages and how developers have migrated from one to another over time. Lots of fun insights in this. # 28th March 2019, 4:59 am

VisiData (via) Intriguing tool by Saul Pwanson: VisiData is a command-line “textpunk utility” for browsing and manipulating tabular data. “pip3 install visidata” and then “vd myfile.csv” (or .json or .xls or SQLite orothers) and get an interactive terminal UI for quickly searching through the data, conducting frequency analysis of columns, manipulating it and much more besides. Two tips for if you start playing with it: hit “gq” to exit, and hit “Ctrl+H” to view the help screen. # 18th March 2019, 3:45 am

The Cloud and Open Source Powder Keg (via) Stephen O’Grady’s analysis of the Elastic v.s. AWS situation, where Elastic started mixing their open source and non-open source code together and Amazon responded by releasing their own forked “open distribution for Elasticsearch”. World War One analogies included! # 17th March 2019, 7:08 pm

What the Hell is Going On? (via) David Perell discusses how the shift from information scarcity to information abundance is reshaping commerce, education, and politics. Long but worthwhile. # 17th March 2019, 4:50 pm

Client-side instrumentation for under $1 per month. No servers necessary. (via) Rolling your own analytics used to be too complex and expensive to be worth the effort. Thanks to cloud technologies like Cloudfront, Athena, S3 and Lambda you can now inexpensively implement client-side analytics (via requests to a tracking pixel) that stores detailed logs on S3, then use Amazon Athena to run queries against those logs ($5/TB scanned) to get detailed reporting. This post also introduced me to Snowplow, an open source JavaScript analytics script (released by a commercial analytics platform) which looks very neat—it’s based on piwik.js, the tracker from the open-source Piwik analytics tool. # 15th March 2019, 4:03 pm

D3 Projection Comparison (via) Fun Observable notebook that lets you compare any two out of D3’s 96 (!) geographical projections of the world. # 10th March 2019, 10:58 pm

datasette-jellyfish. I learned about a handy Python library called Jellyfish which implements approximate and phonetic matching of strings—soundex, metaphone, porter stemming, levenshtein distance and more. I’ve built a simple Datasette plugin which wraps the library and makes each of those algorithms available as a SQL function. # 9th March 2019, 6:29 pm

Publish the data behind your stories with SQLite and Datasette. I presented a workshop on Datasette at the IRE and NICAR CAR 2019 data journalism conference yesterday. Here’s the worksheet I prepared for the tutorial. # 9th March 2019, 6:27 pm

MySQL: How to get the top N rows for each group. MySQL doesn’t support the row_number() window function that’s available in PostgreSQL (and recent SQLite), which means it can’t easily answer questions like “for each of these authors, give me the most recent three blog entries they have written” in a single query. Only it turns out it can, if you abuse MySQL session variables in a devious way. This isn’t a new feature: MySQL has had this for over a decade, and in my rough testing it works quickly even on tables with millions of rows. # 4th March 2019, 11:38 pm

List of Physical Visualizations (via) “A chronological list of physical visualizations and related artifacts, maintained by Pierre Dragicevic and Yvonne Jansen”—327 and counting! # 4th March 2019, 2:45 am

import-pypi. A devious Python 3 hack which abuses importlib.machinery to add a hook such that any time you type “import modulename” it checks to see if the module is installed and runs “pip install modulename” first if it isn’t. Intended as a joke, but if you habitually fire up temporary virtual environments for exploratory programming like I do this could actually be a neat little time-saver. # 29th March 2018, 10:16 pm

The original Reddit source code, written in Lisp in 2005 (via) “If anyone’s interested, I found a hard drive in my garage with the original Reddit Lisp code from 2005. Been looking for it for years. Enjoy.”—spez # 29th March 2018, 10:13 pm

Use The Index, Luke! Paging Through Results (via) The best explanation of keyset pagination I’ve seen. Keyset pagination is where instead of using OFFSET/LIMIT to return the next page of results you instead track the last seen value in the column you sort by and then return the next X results that follow it. This allows you to paginate to arbitrarily deep offsets within a table, whereas OFFSET/LIMIT requires first iterating across all preceding rows and tends to stop working well after the first few thousand results. # 29th March 2018, 5:30 pm

Vega-Lite. A “high-level grammar of interactive graphics”. Part of the Vega project, which provides a mechanism for creating declarative visualizations by defining them using JSON. Vega-Lite is particularly interesting to me because it makes extremely tasteful decisions about how data should be visualized—give it some records, tell it which properties to plot on an axis and it will default to a display that makes sense for that data. The more I play with this the more impressed I am at the quality of its default settings. # 28th March 2018, 5:22 pm

Baltimore Sun Public Salary Records (via) The Baltimore Sun have published an interactive search engine for public salaries of Maryland state employees, and it’s powered by Datasette! Since data journalism is one of my key use-cases for Datasette I’m incredibly excited to see this in the wild. They’ve also published the underlying source code (see the via link) which is a really nice example of how to use Datasette’s custom templates and canned query functionality. # 28th March 2018, 5:12 pm

Charles Proxy now available on iOS (via) I didn’t think this was possible, but the Charles debugging proxy is now available for iOS. It works by setting itself up as a VPN such that all app traffic runs through it. You can also optionally turn on SSL decryption for specific hosts by installing a special certificate (which involves jumping through several hoops). It won’t work for apps that implement SSL certificate pinning but from playing with it for a few minutes it looks like most apps haven’t done that, even apps from Google. Well worth $8.99. # 28th March 2018, 3:57 pm

Cloud-first: Rapid webapp deployment using containers (via) The Research Software Engineering group at ICL have written a tutorial on deploying web apps as Docker containers using Azure and they use Datasette as the example application. # 28th March 2018, 3:50 pm

Touring a Fast, Safe, and Complete(ish) Web Service in Rust. Brandur’s notes from building a high performance web service in Rust, using PostgreSQL via the Diesel ORM and the Rust actix-web framework which peovides Erlang-style actors and promise-based async concurrency. # 28th March 2018, 3:47 pm

Describing events in code (via) Phil Gyford built an online directory of every play, movie, gig and exhibition he has been to in the past 38 years using a combination of digital archaeology and saved ticket stubs. He built it using Django and published this piece extensively describing the process he went through to design the data model. # 28th March 2018, 3:41 pm