Simon Willison's Weblog: data-science

Quoting danah boyd

2021-12-24T23:41:55+00:00

Many of you here today are toolbuilders who help people work with data. Rather than presuming that those using your tools are clear-eyed about their data, how can you build features and methods that ensure people know the limits of their data and work with them responsibly? Your tools are not neutral. Neither is the data that your tools help analyze. How can you build tools that invite responsible data use and make visible when data is being manipulated? How can you help build tools for responsible governance?

— danah boyd

Tags: ethics, data-science

Cookiecutter Data Science

2021-11-18T15:21:59+00:00

Cookiecutter Data Science

Some really solid thinking in this documentation for the DrivenData cookiecutter template. They emphasize designing data science projects for repeatability, such that just the src/ and data/ folders can be used to recreate all of the other analysis from scratch. I like the suggestion to give each project a dedicated S3 bucket for keeping immutable copies of the original raw data that might be too large for GitHub.

Via Paige Bailey

Tags: cookiecutter, data-science

Apply conversion functions to data in SQLite columns with the sqlite-utils CLI tool

2021-08-06T06:05:15+00:00

Earlier this week I released sqlite-utils 3.14 with a powerful new command-line tool: sqlite-utils convert, which applies a conversion function to data stored in a SQLite column.

Anyone who works with data will tell you that 90% of the work is cleaning it up. Running command-line conversions against data in a SQLite file turns out to be a really productive way to do that.

Transforming a column

Here's a simple example. Say someone gave you data with numbers that are formatted with commas - like 3,044,502 - in a count column in a states table.

You can strip those commas out like so:

sqlite-utils convert states.db states count \
    'value.replace(",", "")'

The convert command takes four arguments: the database file, the name of the table, the name of the column and a string containing a fragment of Python code that defines the conversion to be applied.

The conversion function can be anything you can express with Python. If you want to import extra modules you can do so using --import module - here's an example that wraps text using the textwrap module from the Python standard library:

sqlite-utils convert content.db articles content \
    '"\n".join(textwrap.wrap(value, 100))' \
    --import=textwrap

You can consider this analogous to using Array.map() in JavaScript, or running a transformation using a list comprehension in Python.

Custom functions in SQLite

Under the hood, the tool takes advantage of a powerful SQLite feature: the ability to register custom functions written in Python (or other languages) and call them from SQL.

The text wrapping example above works by executing the following SQL:

update articles set content = convert_value(content)

convert_value(value) is a custom SQL function, compiled as Python code and then made available to the database connection.

The equivalent code using just the Python standard library would look like this:

import sqlite3
import textwrap

def convert_value(value):
    return "\n".join(textwrap.wrap(value, 100))

conn = sqlite3.connect("content.db")
conn.create_function("convert_value", 1, convert_value)
conn.execute("update articles set content = convert_value(content)")

sqlite-utils convert works by compiling the code argument to a Python function, registering it with the connection and executing the above SQL query.

Splitting columns into multiple other columns

Sometimes when I'm working with a table I find myself wanting to split a column into multiple other columns.

A classic example is locations - if a location column contains latitude,longitude values I'll often want to split that into separate latitude and longitude columns, so I can visualize the data with datasette-cluster-map.

The --multi option lets you do that using sqlite-utils convert:

sqlite-utils convert data.db places location '
latitude, longitude = value.split(",")
return {
    "latitude": float(latitude),
    "longitude": float(longitude),
}' --multi

--multi tells the command to expect the Python code to return dictionaries. It will then create new columns in the database corresponding to the keys in those dictionaries and populate them using the results of the transformation.

If the places table started with just a location column, after running the above command the new table schema will look like this:

CREATE TABLE [places] (
    [location] TEXT,
    [latitude] FLOAT,
    [longitude] FLOAT
);

Common recipes

This new feature in sqlite-utils actually started life as a separate tool entirely, called sqlite-transform.

Part of the rationale for adding it to sqlite-utils was to avoid confusion between what that tool did and the sqlite-utils transform tool, which does something completely different (applies table transformations that aren't possible using SQLite's default ALTER TABLE statement). Somewhere along the line I messed up with the naming of the two tools!

sqlite-transform bundles a number of useful default transformation recipes, in addition to allowing arbitrary Python code. I ended up making these available in sqlite-utils convert by exposing them as functions that can be called from the command-line code argument like so:

sqlite-utils convert my.db articles created_at \
    'r.parsedate(value)'

Implementing them as Python functions in this way meant I didn't need to invent a new command-line mechanism for passing in additional options to the individual recipes - instead, parameters are passed like this:

sqlite-utils convert my.db articles created_at \
    'r.parsedate(value, dayfirst=True)'

Also available in the sqlite_utils Python library

Almost every feature that is exposed by the sqlite-utils command-line tool has a matching API in the sqlite_utils Python library. convert is no exception.

The Python API lets you perform operations like the following:

db = sqlite_utils.Database("dogs.db")

db["dogs"].convert("name", lambda value: value.upper())

Any Python callable can be passed to convert, and it will be applied to every value in the specified column - again, like using map() to apply a transformation to every item in an array.

You can also use the Python API to perform more complex operations like the following two examples:

# Convert title to upper case only for rows with id > 20
table.convert(
    "title",
    lambda v: v.upper(),
    where="id > :id",
    where_args={"id": 20}
)

# Create two new columns, "upper" and "lower",
# and populate them from the converted title
table.convert(
    "title",
    lambda v: {
        "upper": v.upper(),
        "lower": v.lower()
    }, multi=True
)

See the full documentation for table.convert() for more options.

A more sophisticated example: analyzing log files

I used the new sqlite-utils convert command earlier today, to debug a performance issue with my blog.

Most of my blog traffic is served via Cloudflare with a 15 minute cache timeout - but occasionally I'll hit an uncached page, and they had started to feel not quite as snappy as I would expect.

So I dipped into the Heroku dashboard, and saw this pretty sad looking graph:

Somehow my 50th percentile was nearly 10 seconds, and my maximum page response time was 23 seconds! Something was clearly very wrong.

I use NGINX as part of my Heroku setup to buffer responses (see Running gunicorn behind nginx on Heroku for buffering and logging), and I have custom NGINX configuration to write to the Heroku logs - mainly to work around a limitation in Heroku's default logging where it fails to record full user-agents or referrer headers.

I extended that configuration to record the NGINX request_time, upstream_response_time, upstream_connect_time and upstream_header_time variables, which I hoped would help me figure out what was going on.

After applying that change I started seeing Heroku log lines that looked like this:

2021-08-05T17:58:28.880469+00:00 app[web.1]: measure#nginx.service=4.212 request="GET /search/?type=blogmark&page=2&tag=highavailability HTTP/1.1" status_code=404 request_id=25eb296e-e970-4072-b75a-606e11e1db5b remote_addr="10.1.92.174" forwarded_for="114.119.136.88, 172.70.142.28" forwarded_proto="http" via="1.1 vegur" body_bytes_sent=179 referer="-" user_agent="Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://webmaster.petalsearch.com/site/petalbot)" request_time="4.212" upstream_response_time="4.212" upstream_connect_time="0.000" upstream_header_time="4.212";

Next step: analyze those log lines.

I ran this command for a few minutes to gather some logs:

heroku logs -a simonwillisonblog --tail | grep 'measure#nginx.service' > /tmp/log.txt

Having collected 488 log lines, the next step was to load them into SQLite.

The sqlite-utils insert command likes to work with JSON, but I just had raw log lines. I used jq to convert each line into a {"line": "raw log line"} JSON object, then piped that as newline-delimited JSON into sqlite-utils insert:

cat /tmp/log.txt | \
    jq --raw-input '{line: .}' --compact-output | \
    sqlite-utils insert /tmp/logs.db log - --nl

jq --raw-input accepts input that is just raw lines of text, not yet valid JSON. '{line: .}' is a tiny jq program that builds {"line": "raw input"} objects. --compact-output causes jq to output newline-delimited JSON.

Then sqlite-utils insert /tmp/logs.db log - --nl reads that newline-delimited JSON into a new SQLite log table in a logs.db database file (full documentation here).

Update 6th January 2022: sqlite-utils 3.20 introduced a new sqlite-utils insert ... --lines option for importing raw lines, so you can now achieve this without using jq at all. See Inserting unstructured data with --lines and --text for details.

Now I had a SQLite table with a single column, line. Next step: parse that nasty log format.

To my surprise I couldn't find an existing Python library for parsing key=value key2="quoted value" log lines. Instead I had to figure out a regular expression:

([^\s=]+)=(?:"(.*?)"|(\S+))

Here's that expression visualized using Debuggex:

I used that regular expression as part of a custom function passed in to the sqlite-utils convert tool:

sqlite-utils convert /tmp/logs.db log line --import re --multi "$(cat <<EOD
    r = re.compile(r'([^\s=]+)=(?:"(.*?)"|(\S+))')
    pairs = {}
    for key, value1, value2 in r.findall(value):
        pairs[key] = value1 or value2
    return pairs
EOD
)"

(This uses a cat <<EOD trick to avoid having to figure out how to escape the single and double quotes in the Python code for usage in a zsh shell command.)

Using --multi here created new columns for each of the key/value pairs seen in that log file.

One last step: convert the types. The new columns are all of type text but I want to do sorting and arithmetic on them so I need to convert them to integers and floats. I used sqlite-utils transform for that:

sqlite-utils transform /tmp/logs.db log \
    --type 'measure#nginx.service' float \
    --type 'status_code' integer \
    --type 'body_bytes_sent' integer \
    --type 'request_time' float \
    --type 'upstream_response_time' float \
    --type 'upstream_connect_time' float \
    --type 'upstream_header_time' float

Here's the resulting log table (in Datasette Lite).

Once the logs were in Datasette, the problem quickly became apparent when I sorted by request_time: an army of search engine crawlers were hitting deep linked filters in my faceted search engine, like /search/?tag=geolocation&tag=offlineresources&tag=canvas&tag=javascript&tag=performance&tag=dragndrop&tag=crossdomain&tag=mozilla&tag=video&tag=tracemonkey&year=2009&type=blogmark. These are expensive pages to generate! They're also very unlikely to be in my Cloudflare cache.

Could the answer be as simple as a robots.txt rule blocking access to /search/?

I shipped that change and waited a few hours to see what the impact would be:

It took a while for the crawlers to notice that my robots.txt had changed, but by 8 hours later my site performance was dramatically improved - I'm now seeing 99th percentile of around 450ms, compared to 25 seconds before I shipped the robots.txt change!

With this latest addition, sqlite-utils has evolved into a powerful tool for importing, cleaning and re-shaping data - especially when coupled with Datasette in order to explore, analyze and publish the results.

TIL this week

Releases this week

sqlite-transform: 1.2.1 - (10 releases total) - 2021-08-02
Tool for running transformations on columns in a SQLite database
sqlite-utils: 3.14 - (82 releases total) - 2021-08-02
Python CLI utility and library for manipulating SQLite databases
datasette-json-html: 1.0.1 - (6 releases total) - 2021-07-31
Datasette plugin for rendering HTML based on JSON values
datasette-publish-fly: 1.0.2 - (5 releases total) - 2021-07-30
Datasette plugin for publishing data using Fly

Tags: performance, projects, sqlite, datasette, data-science, weeknotes, sqlite-utils

The data team: a short story

2021-07-08T23:12:59+00:00

The data team: a short story

Erik Bernhardsson’s fictional account (“I guess I should really call this a parable”) of a new data team leader successfully growing their team and building a data-first culture in a medium-sized technology company. His depiction of the initial state of the company (data in many different places, frustrated ML researchers who can’t get their research into production, confusion over what the data team is actually for) definitely rings true to me.

Via Hacker News

Tags: data-science, data, leadership

Group thousands of similar spreadsheet text cells in seconds

2021-06-27T16:24:38+00:00

Group thousands of similar spreadsheet text cells in seconds

Luke Whyte explains how to efficiently group similar text columns in a table (Walmart and Wal-mart for example) using a clever combination of TF/IDF, sparse matrices and cosine similarity. Includes the clearest explanation of cosine similarity for text I’ve seen—and Luke wrote a Python library, textpack, that implements the described pattern.

Via lukewhyte/textpack

Tags: data-science, python

What I've learned about data recently

2021-06-22T17:09:07+00:00

What I've learned about data recently

Laurie Voss talks about the structure of data teams, based on his experience at npm and more recently Netlify. He suggests that Airflow and dbt are the data world’s equivalent of frameworks like Rails: opinionated tools that solve core problems and which mean that you can now hire people who understand how your data pipelines work on their first day on the job.

Via @seldo

Tags: big-data, data, data-science, laurie-voss

Defining Data Intuition

2020-10-29T15:14:28+00:00

Defining Data Intuition

Ryan T. Harter, Principal Data Scientist at Mozilla defines data intuition as “a resilience to misleading data and analyses”. He also introduces the term “data-stink” as a similar term to “code smell”, where your intuition should lead you to distrust analysis that exhibits certain characteristics without first digging in further. I strongly believe that data reports should include a link the raw methodology and numbers to ensure they can be more easily vetted—so that data-stink can be investigated with the least amount of resistance.

Tags: mozilla, data-science, analytics

Announcing the Consortium for Python Data API Standards

2020-08-19T05:48:11+00:00

Announcing the Consortium for Python Data API Standards

Interesting effort to unify the fragmented DataFrame API ecosystem, where increasing numbers of libraries offer APIs inspired by Pandas that imitate each other but aren’t 100% compatible. The announcement includes some very clever code to support the effort: custom tooling to compare the existing APIs, and an ingenious GitHub Actions setup to run traces (via sys.settrace), derive type signatures and commit those generated signatures back to a repository.

Via @ralfgommers

Tags: standards, data-science, github-actions, python

Quoting GPT-3

2020-06-29T04:45:49+00:00

Data Science is a lot like Harry Potter, except there's no magic, it's just math, and instead of a sorting hat you just sort the data with a Python script.

— GPT-3, shepherded by Max Woolf

Tags: machine-learning, data-science, max-woolf

Data science is different now

2019-02-15T15:36:11+00:00

Data science is different now

Detailed examination of the current state of the job market for data science. Boot camps and university courses have produced a growing volume of junior data scientists seeking work, but the job market is much more competitive than many expected—especially for those without prior experience. Meanwhile the job itself is much more about data cleanup and software engineering skills: machine learning models and applied statistics end up being a small portion of the actual work.

Via @vboykis

Tags: data-science

Things About Real-World Data Science Not Discussed In MOOCs and Thought Pieces

2018-12-11T20:51:19+00:00

Things About Real-World Data Science Not Discussed In MOOCs and Thought Pieces

Really good article, pointing out that carefully optimizing machine learning models is only a small part of the day-to-day work of a data scientist: cleaning up data, building dashboards, shipping models to production, deciding on trade-offs between performance and production and considering the product design and ethical implementations of what you are doing make up a much larger portion of the job.

Via minimaxir

Tags: data-science, max-woolf, ethics

Serverless for data scientists

2018-08-25T23:01:09+00:00

Serverless for data scientists

Slides and accompanying notes from a talk by Mike Lee Williams at PyBay, providing an overview of Zappa and diving a bit more deeply into pywren, which makes it trivial to parallelize a function across a set of AWS lambda instances (serverless Python map() execution essentially). I really like this format for sharing presentations—I used something similar for my own PyBay talk.

Via @mikepqr

Tags: data-science, serverless, amazonaws

Computational and Inferential Thinking: The Foundations of Data Science

2018-08-25T22:13:51+00:00

Computational and Inferential Thinking: The Foundations of Data Science

Free online textbook written for the UC Berkeley Foundations of Data Science class. The examples are all provided as Jupyter notebooks, using the mybinder web application to allow students to launch interactive notebooks for any of the examples without having to install any software on their own machines.

Tags: jupyter, data-science, education

Beginner's Guide to Jupyter Notebooks for Data Science (with Tips, Tricks!)

2018-05-24T13:58:15+00:00

Beginner's Guide to Jupyter Notebooks for Data Science (with Tips, Tricks!)

If you haven’t yet got on the Jupyter notebooks bandwagon this should help. It’s the single biggest productivity improvement I’ve made to my workflow in a very long time.

Via Mike Loukides

Tags: jupyter, data-science