Simon Willison’s Weblog

Items in Aug

Filters: Month: Aug ×


In too many organizations, deploy code is a technical backwater, an accumulation of crufty scripts and glue code, forked gems and interns’ earnest attempts to hack up Capistrano.  It usually gives off a strong whiff of “sloppily evolved from many 2 am patches with no code review”. This is insane.  Deploy software is the most important software you have.  Treat it that way: recruit an owner, allocate real time for development and testing, bake in metrics and track them over time.

Charity Majors # 27th August 2018, 9 pm

Serverless for data scientists (via) Slides and accompanying notes from a talk by Mike Lee Williams at PyBay, providing an overview of Zappa and diving a bit more deeply into pywren, which makes it trivial to parallelize a function across a set of AWS lambda instances (serverless Python map() execution essentially). I really like this format for sharing presentations—I used something similar for my own PyBay talk. # 25th August 2018, 11:01 pm

Computational and Inferential Thinking: The Foundations of Data Science. Free online textbook written for the UC Berkeley Foundations of Data Science class. The examples are all provided as Jupyter notebooks, using the mybinder web application to allow students to launch interactive notebooks for any of the examples without having to install any software on their own machines. # 25th August 2018, 10:13 pm

The Future of Notebooks: Lessons from JupyterCon (via) It sounds like reactive notebooks (where cells keep track of their dependencies on other cells and re-evaluate when those update) were a hot topic at JupyterCon this year. # 25th August 2018, 9:55 pm

Most administrators will force users to change their password at regular intervals, typically every 30, 60 or 90 days. This imposes burdens on the user (who is likely to choose new passwords that are only minor variations of the old) and carries no real benefits as stolen passwords are generally exploited immediately. [...] Regular password changing harms rather than improves security, so avoid placing this burden on users. However, users must change their passwords on indication or suspicion of compromise.

UK National Cyber Security Centre # 25th August 2018, 7:57 pm

The subset of reStructuredText worth committing to memory

reStructuredText is the standard for documentation in the Python world.

[... 1183 words]

In case you missed it: @GoogleColab can open any @ProjectJupyter notebook directly from @github! To run the notebook, just replace “github.com” with “colab.research.google.com/github/” in the notebook URL, and it will be loaded into Colab.

Jake VanderPlas # 25th August 2018, 3:16 am

Honeycomb changelog (via) Too few hosted services have detailed user-facing changelogs. This one from Honeycomb (a metrics, tracing and observavility platform) is a particularly great example. I especially like the use of animated screenshots, something I’ve been evangelizing pretty heavily recently for internal communication at work. # 25th August 2018, 3:12 am

I don’t like Jupyter Notebooks—a presentation by Joel Grus (via) Fascinating talk by Joel Grus at the Jupyter conference in New York. He highlights some of the drawbacks of he Jupyter way of working, including the huge confusion that can come from the ability to execute cells out of order (something Observable notebooks solve brilliantly using spreadsheet-style reactive cell associations). He also makes strong arguments that notebooks encourage a way of working that discourages people from producing stable, repeatable and well tested code. # 25th August 2018, 3:04 am

jq recipes. Remy Sharp’s handy collection of jq recipes, each one linking to an interactive demo on jqterm.com. I thought jq was just for extracting values from a JSON document—I hadn’t realized how powerful it was for modifying and extending those documents as well. # 22nd August 2018, 3:23 pm

6 Great Uses of the Spread Operator. As I’ve been getting more comfortable with 2018-era JavaScript the spread operator and object restructuring are two of the features I have found most interesting. # 22nd August 2018, 3:17 pm

Slides, notes and links from my Datasette talk at PyBay (via) I presented a session about Datasette at the PyBay conference in San Francisco this morning. I talked about the project itself and demonstrated ways of creating and publishing databases using csvs-to-sqlite, Datasette Publish and my new sqlite-utils library. # 19th August 2018, 11:23 pm

How to Instantly Publish Data to the Internet with Datasette

I presented a session about Datasette at the PyBay 2018 conference in San Francisco. I talked about the project itself and demonstrated ways of creating and publishing databases using csvs-to-sqlite, Datasette Publish and my new sqlite-utils library.

[... 2043 words]

How about if, instead of ditching Twitter for Mastodon, we all start blogging and subscribing to each other’s Atom feeds again instead? The original distributed social network could still work pretty well if we actually start using it

@simonw # 18th August 2018, 8:59 pm

Observable Tutorial 2: Dog pictures (via) Observable have a neat new set of tutorials on how to get started with their reactive notebooks. You don’t even need to sign up for the service: they have a “Scratchpad” link in their navigation bar now which lets you spin up a test notebook with one click. # 18th August 2018, 7:55 pm

Redux vs. The React Context API. Nice explanation of the new Context API in React 16.3, which provides an easy way for passing props down through a tree of components without needing to explicitly pass the prop at every level of the tree. The comparison with Redux doubles as a useful explanation of the value that Redux provides. # 18th August 2018, 6:51 pm

Beyond Interactive: Notebook Innovation at Netflix. Netflix have been investing heavily in their internal Jupyter notebooks infrastructure: it’s now the most popular tool for working with data at Netflix. They also use parameterized notebooks to make it easy to create templates for reusable operations, and scheduled notebooks for recurring tasks. “When a Spark or Presto job executes from the scheduler, the source code is injected into a newly-created notebook and executed. That notebook then becomes an immutable historical record, containing all related artifacts — including source code, parameters, runtime config, execution logs, error messages, and so on.” # 18th August 2018, 5:55 pm

Every day more than 1 trillion events are written into a streaming ingestion pipeline, which is processed and written to a 100PB cloud-native data warehouse. And every day, our users run more than 150,000 jobs against this data, spanning everything from reporting and analysis to machine learning and recommendation algorithms.

Netflix Technology Blog # 18th August 2018, 5:35 pm

Text to Image (via) Ridiculously entertaining demo by Cris Valenzuela that feeds any text you type to a neural network that then attempts to generate an image for your text. # 18th August 2018, 5:33 pm

Compiling SQLite for use with Python Applications (via) Charles Leifer’s recent tutorial on how to compile and build the latest SQLite (with window function support) for use from Python via his pysqlite3 library. # 15th August 2018, 3:51 pm

coleifer/pysqlite3. Now that the pysqlite package is bundled as part of the Python standard library the original open source project is no longer actively maintained, and has not been upgraded for Python 3. Charles Leifer has been working on pysqlite3, a stand-alone package of the module. Crucially, this should enable compiling the latest version of SQLite (via the amalgamation package) without needing to upgrade the version that ships with the operating system. # 15th August 2018, 3:15 pm

Window Functions in SQLite 3.25.0. The next release of SQLite (apparently due for release in September) will add window functions, as specified in various SQL standards and already available in PostgreSQL. This is going to dramatically improve SQLite as an engine for performing analytical queries, especially across time series data. It’s also going to further emphasize the need for people to be able to upgrade their SQLite versions beyond those provided by the operating system—the default Ubuntu run by Travis CI still only ships with SQLite 3.8 for example. # 15th August 2018, 3:12 pm

Experiences with running PostgreSQL on Kubernetes (via) Fascinating interview that makes a solid argument for the idea that running stateful data stores like PostgreSQL or Cassandra is made harder, not easier when you add an orchestration tool like Kubernetes into the mix. # 13th August 2018, 2:30 pm

With a sufficient number of users of an API, it does not matter what you promise in the contract: all observable behaviors of your system will be depended on by somebody.

Hyrum's Law # 11th August 2018, 12:33 am

Using achievement stats to estimate sales on steam (via) Really interesting data leak exploit here: Valve’s Steam API was showing the percentage of users that gained a specific achievement up to 16 decimal places—which inadvertently leaked their exact usage statistics, since if 0.012782207690179348 percent of players get an achievement the only possible input is 8 players out of 62,587. # 9th August 2018, 9:03 am

Easy explainer: a “blockchain” is a linked list with an append-only restriction, and appending is made incredibly expensive but super parallelizable, so when things work well a big group of people can work together and it’s too expensive for a small evil group to compete. [...] Does your problem benefit from storing information in an append-only list, and relying on a central authority to manage it is so bad that it’s worth paying the enormous append costs to have a bunch of Chinese servers manage it for you? Then *maybe* look at a blockchain.

Tab Atkins # 9th August 2018, 1:27 am

Securing Web Sites Made Them Less Accessible (via) This is fascinating: the move to HTTP everywhere breaks local HTTP caching servers (like Squid) which are still used in remote areas that get their internet by a high latency satellite connection. # 7th August 2018, 5:52 pm

Faust: Python Stream Processing (via) A new open source stream processing system released by Robinhood, created by Vineet Goel and Celery creator Ask Solem. The API looks delightful, making very smart use of Python decorators and async/await. The initial release requires Kafka but they plan to support multiple backends, hopefully including Redis Streams. # 6th August 2018, 10:51 pm

How to Read an RFC. An extremely useful guide to reading RFCs by Mark Nottingham. I didn’t know most of the stuff in here. # 6th August 2018, 10:38 pm

OWASP Top 10 2007-2017: The Fall of CSRF. I was surprised to learn recently that CSRF didn’t make it into the 2017 OWASP Top 10 security vulnerabilities (after featuring almost every year since the list started). The credited reason is that web frameworks do a good enough job protecting against CSRF by default that it’s no longer a top-ten problem. Defaults really do matter. # 6th August 2018, 10:02 pm