Simon Willison’s Weblog

Subscribe
Atom feed

Blogmarks

Filters: Sorted by date

Using 6 Page and 2 Page Documents To Make Organizational Decisions (via) I’ve been thinking a lot recently about the challenges of efficiently getting to consensus within a larger organization spread across multiple locations and time zones. This model described by Ian Nowland based on his experience at AWS seems very promising. The goal is to achieve a decision or “disagree and commit” consensus using a max 6 page document and a one hour meeting. The first fifteen minutes of the meeting are dedicated to silently reading the document—if you’ve read it already you are given the option of arriving fifteen minutes late.

# 11th April 2019, 3:46 am / aws, process, management

Ministry of Silly Runtimes: Vintage Python on Cloud Run (via) Cloud Run is an exciting new hosting service from Google that lets you define a container using a Dockerfile and then run that container in a “scale to zero” environment, so you only pay for time spent serving traffic. It’s similar to the now-deprecated Zeit Now 1.0 which inspired me to create Datasette. Here Dustin Ingram demonstrates how powerful Docker can be as the underlying abstraction by deploying a web app using a 25 year old version of Python 1.x.

# 9th April 2019, 5:33 pm / cloud, python, zeit-now, docker, datasette, cloudrun, dustin-ingram

Generator Tricks for Systems Programmers (via) David Beazley’s definitive generators tutorial from 2008, updated for Python 3.7 in October 2018.

# 9th April 2019, 5:13 pm / david-beazley, generators, python

What is a Self-XSS scam? Facebook link to this page from a console.log message that they display the browser devtools console, specifically warning that “If someone told you to copy-paste something here to enable a Facebook feature or hack someone’s account, it is a scam and will give them access to your Facebook account.”

# 8th April 2019, 6:01 pm / facebook, security, xss

Colm MacCárthaigh tells the inside story of how AWS responded to Heartbleed. The Heartbleed SSL vulnerability came out five years ago. In this Twitter thread Colm, who was Amazon’s principal engineer for Elastic Load Balancer at the time, describes how the AWS team responded to something that “was scarier than any bug I’d ever seen”. It’s a cracking story.

# 7th April 2019, 8:32 pm / aws, security

tsv-utils (via) Powerful collection of CLI tools for processing TSV files, written in D for performance and released by eBay. Includes a csv2tsv conversion tool. You can download an archive of pre-built binaries for Linux and OS X from their releases page: worked fine on my Mac.

# 7th April 2019, 8:29 pm / cli, csv

csv-diff 0.3.1 (via) I released a minor update to my csv-diff CLI tool today which does a better job of displaying a human-readable representation of rows that have been added or removed from a file—previously they were represented as an ugly JSON dump. My script monitoring changes to the official list of trees in San Francisco has been running for a month now and has captured 23 commits!

# 7th April 2019, 8:03 pm / cli, csv, diff, projects

The problem with laziness: minimising performance issues caused by Django’s implicit database queries (via) The ability to accidentally execute further database queries by traversing objects from a Django template is a common source of unexpected performance regressions. django-zen-queries is a neat new library which provides a context manager for disabling database queries during a render (or elsewhere), forcing queries to be explicitly executed in view functions.

# 3rd April 2019, 3:49 pm / django

zson (via) “ZSON is a PostgreSQL extension for transparent JSONB compression. Compression is based on a shared dictionary of strings most frequently used in specific JSONB documents [...] In some cases ZSON can save half of your disk space and give you about 10% more TPS.”

# 2nd April 2019, 9:26 pm / json, postgresql

The Next CEO of Stack Overflow. “Including the Stack Exchange network of 174 sites, we have over 100 million monthly visitors. Every month, over 125,000 wonderful people write answers”—this fits the rule of thumb for user-generated content that only a tiny portion of your audience will actively create content: in this case it’s just 0.125% (one eighth of one percent). I’d love to know how many people are upvoting or performing other more lightweight interactions.

# 28th March 2019, 3:12 pm / social-software, stackoverflow

Programmer migration patterns. Avery Pennarun explores the history of modern programming languages and how developers have migrated from one to another over time. Lots of fun insights in this.

# 28th March 2019, 4:59 am / programming-languages

VisiData (via) Intriguing tool by Saul Pwanson: VisiData is a command-line "textpunk utility" for browsing and manipulating tabular data. pip3 install visidata and then vd myfile.csv (or .json or .xls or SQLite or others) and get an interactive terminal UI for quickly searching through the data, conducting frequency analysis of columns, manipulating it and much more besides. Two tips for if you start playing with it: hit gq to exit, and hit Ctrl+H to view the help screen.

# 18th March 2019, 3:45 am / csv, data-journalism, python, sqlite

The Cloud and Open Source Powder Keg (via) Stephen O’Grady’s analysis of the Elastic v.s. AWS situation, where Elastic started mixing their open source and non-open source code together and Amazon responded by releasing their own forked “open distribution for Elasticsearch”. World War One analogies included!

# 17th March 2019, 7:08 pm / aws, elasticsearch, open-source

What the Hell is Going On? (via) David Perell discusses how the shift from information scarcity to information abundance is reshaping commerce, education, and politics. Long but worthwhile.

# 17th March 2019, 4:50 pm / education, internet, politics

Client-side instrumentation for under $1 per month. No servers necessary. (via) Rolling your own analytics used to be too complex and expensive to be worth the effort. Thanks to cloud technologies like Cloudfront, Athena, S3 and Lambda you can now inexpensively implement client-side analytics (via requests to a tracking pixel) that stores detailed logs on S3, then use Amazon Athena to run queries against those logs ($5/TB scanned) to get detailed reporting. This post also introduced me to Snowplow, an open source JavaScript analytics script (released by a commercial analytics platform) which looks very neat—it’s based on piwik.js, the tracker from the open-source Piwik analytics tool.

# 15th March 2019, 4:03 pm / analytics, athena, cloudfront, lambda, s3

D3 Projection Comparison (via) Fun Observable notebook that lets you compare any two out of D3’s 96 (!) geographical projections of the world.

# 10th March 2019, 10:58 pm / geo, d3, observable, mike-bostock

datasette-jellyfish. I learned about a handy Python library called Jellyfish which implements approximate and phonetic matching of strings—soundex, metaphone, porter stemming, levenshtein distance and more. I’ve built a simple Datasette plugin which wraps the library and makes each of those algorithms available as a SQL function.

# 9th March 2019, 6:29 pm / strings, datasette

Publish the data behind your stories with SQLite and Datasette. I presented a workshop on Datasette at the IRE and NICAR CAR 2019 data journalism conference yesterday. Here’s the worksheet I prepared for the tutorial.

# 9th March 2019, 6:27 pm / data-journalism, my-talks, datasette, nicar

MySQL: How to get the top N rows for each group. MySQL doesn’t support the row_number() window function that’s available in PostgreSQL (and recent SQLite), which means it can’t easily answer questions like “for each of these authors, give me the most recent three blog entries they have written” in a single query. Only it turns out it can, if you abuse MySQL session variables in a devious way. This isn’t a new feature: MySQL has had this for over a decade, and in my rough testing it works quickly even on tables with millions of rows.

# 4th March 2019, 11:38 pm / mysql

List of Physical Visualizations (via) “A chronological list of physical visualizations and related artifacts, maintained by Pierre Dragicevic and Yvonne Jansen”—327 and counting!

# 4th March 2019, 2:45 am / visualization

Experiments, growth engineering, and exposing company secrets through your API (via) This is fun: Jon Luca observes that many companies that run A/B tests have private JSON APIs that list all of their ongoing experiments, and uses them to explore tests from Lyft, Airbnb, Pinterest, Amazon and more. Facebook and Instagram use SSL Stapling which makes it harder to spy on their mobile app traffic.

# 26th February 2019, 4:49 am / ab-testing, security

huey. Charles Leifer’s “little task queue for Python”. Similar to Celery, but it’s designed to work with Redis, SQLite or in the parent process using background greenlets. Worth checking out for the really neat design. The project is new to me, but it’s been under active development since 2011 and has a very healthy looking rate of releases.

# 25th February 2019, 7:49 pm / python, queues, redis, sqlite, charles-leifer

My Twitter thread collecting behind the scenes content about Spider-Man: Into the Spider-Verse. I absolutely loved Spider-Verse, and I’ve been delighted to discover that many of the artists who created the movie are active on Twitter and have been posting all kinds of fascinating material about their creative process. I’ve been collecting examples in this Twitter thread for a couple of months now. They definitely deserved that Oscar.

# 25th February 2019, 2:57 pm / twitter, movies, spiderverse

Seeking the Productive Life: Some Details of My Personal Infrastructure (via) Stephen Wolfram’s 15,000 word epic about his personal approach to productivity, developed over the past thirty years. This is a fascinating document—I found myself thinking “surely there can’t be more information than this” and then spotting that the scrollbar wasn’t even a third done yet. Very hard to summarize: it turns out if you’re the work-from-home CEO of your own privately held 800 person company you can construct some very opinionated habits.

# 22nd February 2019, 9:46 pm / productivity

String length—Rosetta Code (via) Calculating the length of a string is surprisingly difficult once Unicode is involved. Here's a fascinating illustration of how that problem can be attached dozens of different programming languages. From that page: the string "J̲o̲s̲é̲" ("J\x{332}o\x{332}s\x{332}e\x{301}\x{332}") has 4 user-visible graphemes, 9 characters (code points), and 14 bytes when encoded in UTF-8.

# 22nd February 2019, 3:27 pm / programming-languages, strings, unicode

Lessons from 6 software rewrite stories (via) Herb Caudill takes on the classic idea that rewriting from scratch is “the single worst strategic mistake that any software company can make” and investigates it through the lens of six well-chosen examples: Netscape 6, Basecamp Classic/2/3, Visual Studio/VS Code, Gmail/Inbox, FogBugz/Wasabi/Trello, and finally FreshBooks/BillSpring. Each story has details I had never heard before, and the lessons and conclusions are deeply insightful.

# 19th February 2019, 9:55 pm / rewrites, product-management

parameterized. I love the @parametrize decorator in pytest, which lets you run the same test multiple times against multiple parameters. The only catch is that the decorator in pytest doesn’t work for old-style unittest TestCase tests, which means you can’t easily add it to test suites that were built using the older model. I just found out about parameterized which works with unittest tests whether or not you are running them using the pytest test runner.

# 19th February 2019, 9:05 pm / python, testing, pytest

The Eleven Laws of Showrunning (via) Fascinating essay on how to run a modern TV show by Javier Grillo-Marxuach. Being a showrunner basically involves running a 100+ person startup with a 7 digit budget, almost immovable deadlines, high maintenance activist investors and you're still expected to write some of the scripts!

So many useful lessons here about management, creativity and delegation: almost everything in here is relevant to product management, startup founding and engineering management as well.

Are you strong and secure enough in your talent and accomplishment to accept the possibility that other people - properly empowered by you - can actually enhance your genius... or will you cling to the idea that only you can be the source of that genius?

How you answer that question determines the leader you will be.

This one is the "nice" version - a not so nice version is available as well.

# 19th February 2019, 7:27 pm / management, showrunning

Discussion about Altavista on Hacker News. Fascinating thread on Hacker News where Bryant Durrell, a former Director from Altavista provides some insider thoughts on how they lost against Google.

# 16th February 2019, 6:57 pm / computer-history, google, internet-history, search, search-engines

Data science is different now (via) Detailed examination of the current state of the job market for data science. Boot camps and university courses have produced a growing volume of junior data scientists seeking work, but the job market is much more competitive than many expected—especially for those without prior experience. Meanwhile the job itself is much more about data cleanup and software engineering skills: machine learning models and applied statistics end up being a small portion of the actual work.

# 15th February 2019, 3:36 pm / data-science

Years

Tags