<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: data-science</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/data-science.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2021-12-24T23:41:55+00:00</updated><author><name>Simon Willison</name></author><entry><title>Quoting danah boyd</title><link href="https://simonwillison.net/2021/Dec/24/danah-boyd/#atom-tag" rel="alternate"/><published>2021-12-24T23:41:55+00:00</published><updated>2021-12-24T23:41:55+00:00</updated><id>https://simonwillison.net/2021/Dec/24/danah-boyd/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://zephoria.substack.com/p/statistical-imaginaries"&gt;&lt;p&gt;Many of you here today are toolbuilders who help people work with data. Rather than presuming that those using your tools are clear-eyed about their data, how can you build features and methods that ensure people know the limits of their data and work with them responsibly? Your tools are not neutral. Neither is the data that your tools help analyze. How can you build tools that invite responsible data use and make visible when data is being manipulated? How can you help build tools for responsible governance?&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://zephoria.substack.com/p/statistical-imaginaries"&gt;danah boyd&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ethics"&gt;ethics&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/data-science"&gt;data-science&lt;/a&gt;&lt;/p&gt;



</summary><category term="ethics"/><category term="data-science"/></entry><entry><title>Cookiecutter Data Science</title><link href="https://simonwillison.net/2021/Nov/18/cookiecutter-data-science/#atom-tag" rel="alternate"/><published>2021-11-18T15:21:59+00:00</published><updated>2021-11-18T15:21:59+00:00</updated><id>https://simonwillison.net/2021/Nov/18/cookiecutter-data-science/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://drivendata.github.io/cookiecutter-data-science/"&gt;Cookiecutter Data Science&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Some really solid thinking in this documentation for the DrivenData cookiecutter template. They emphasize designing data science projects for repeatability, such that just the src/ and data/ folders can be used to recreate all of the other analysis from scratch. I like the suggestion to give each project a dedicated S3 bucket for keeping immutable copies of the original raw data that might be too large for GitHub.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/dynamicwebpaige/status/1461272760542396421"&gt;Paige Bailey&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/data-science"&gt;data-science&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cookiecutter"&gt;cookiecutter&lt;/a&gt;&lt;/p&gt;



</summary><category term="data-science"/><category term="cookiecutter"/></entry><entry><title>Apply conversion functions to data in SQLite columns with the sqlite-utils CLI tool</title><link href="https://simonwillison.net/2021/Aug/6/sqlite-utils-convert/#atom-tag" rel="alternate"/><published>2021-08-06T06:05:15+00:00</published><updated>2021-08-06T06:05:15+00:00</updated><id>https://simonwillison.net/2021/Aug/6/sqlite-utils-convert/#atom-tag</id><summary type="html">
    &lt;p&gt;Earlier this week I released &lt;a href="https://sqlite-utils.datasette.io/en/stable/changelog.html#v3-14"&gt;sqlite-utils 3.14&lt;/a&gt; with a powerful new command-line tool: &lt;code&gt;sqlite-utils convert&lt;/code&gt;, which applies a conversion function to data stored in a SQLite column.&lt;/p&gt;
&lt;p&gt;Anyone who works with data will tell you that 90% of the work is cleaning it up. Running command-line conversions against data in a SQLite file turns out to be a really productive way to do that.&lt;/p&gt;
&lt;h4&gt;Transforming a column&lt;/h4&gt;
&lt;p&gt;Here's a simple example. Say someone gave you data with numbers that are formatted with commas - like &lt;code&gt;3,044,502&lt;/code&gt; - in a &lt;code&gt;count&lt;/code&gt; column in a &lt;code&gt;states&lt;/code&gt; table.&lt;/p&gt;
&lt;p&gt;You can strip those commas out like so:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;sqlite-utils convert states.db states count \
    'value.replace(",", "")'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;convert&lt;/code&gt; command takes four arguments: the database file, the name of the table, the name of the column and a string containing a fragment of Python code that defines the conversion to be applied.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Animated demo using sqlite-utils convert to strip out commas" src="https://static.simonwillison.net/static/2021/sqlite-convert-demo.gif" style="max-width:100%;" /&gt;&lt;/p&gt;
&lt;p&gt;The conversion function can be anything you can express with Python. If you want to import extra modules you can do so using &lt;code&gt;--import module&lt;/code&gt; - here's an example that wraps text using the &lt;a href=""&gt;textwrap&lt;/a&gt; module from the Python standard library:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;sqlite-utils convert content.db articles content \
    '"\n".join(textwrap.wrap(value, 100))' \
    --import=textwrap
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You can consider this analogous to using &lt;code&gt;Array.map()&lt;/code&gt; in JavaScript, or running a transformation using a list comprehension in Python.&lt;/p&gt;
&lt;h4&gt;Custom functions in SQLite&lt;/h4&gt;
&lt;p&gt;Under the hood, the tool takes advantage of a powerful SQLite feature: the ability to &lt;a href="https://docs.python.org/3/library/sqlite3.html#sqlite3.Connection.create_function"&gt;register custom functions&lt;/a&gt; written in Python (or other languages) and call them from SQL.&lt;/p&gt;
&lt;p&gt;The text wrapping example above works by executing the following SQL:&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;update&lt;/span&gt; articles &lt;span class="pl-k"&gt;set&lt;/span&gt; content &lt;span class="pl-k"&gt;=&lt;/span&gt; convert_value(content)&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;code&gt;convert_value(value)&lt;/code&gt; is a custom SQL function, compiled as Python code and then made available to the database connection.&lt;/p&gt;
&lt;p&gt;The equivalent code using just the Python standard library would look like this:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;sqlite3&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;textwrap&lt;/span&gt;

&lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;convert_value&lt;/span&gt;(&lt;span class="pl-s1"&gt;value&lt;/span&gt;):
    &lt;span class="pl-k"&gt;return&lt;/span&gt; &lt;span class="pl-s"&gt;"&lt;span class="pl-cce"&gt;\n&lt;/span&gt;"&lt;/span&gt;.&lt;span class="pl-en"&gt;join&lt;/span&gt;(&lt;span class="pl-s1"&gt;textwrap&lt;/span&gt;.&lt;span class="pl-en"&gt;wrap&lt;/span&gt;(&lt;span class="pl-s1"&gt;value&lt;/span&gt;, &lt;span class="pl-c1"&gt;100&lt;/span&gt;))

&lt;span class="pl-s1"&gt;conn&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;sqlite3&lt;/span&gt;.&lt;span class="pl-en"&gt;connect&lt;/span&gt;(&lt;span class="pl-s"&gt;"content.db"&lt;/span&gt;)
&lt;span class="pl-s1"&gt;conn&lt;/span&gt;.&lt;span class="pl-en"&gt;create_function&lt;/span&gt;(&lt;span class="pl-s"&gt;"convert_value"&lt;/span&gt;, &lt;span class="pl-c1"&gt;1&lt;/span&gt;, &lt;span class="pl-s1"&gt;convert_value&lt;/span&gt;)
&lt;span class="pl-s1"&gt;conn&lt;/span&gt;.&lt;span class="pl-en"&gt;execute&lt;/span&gt;(&lt;span class="pl-s"&gt;"update articles set content = convert_value(content)"&lt;/span&gt;)&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;sqlite-utils convert&lt;/code&gt; works by &lt;a href="https://github.com/simonw/sqlite-utils/blob/cc90745f4e8bb1ac57d8ee973863cfe00c2e4fe5/sqlite_utils/cli.py#L2019-L2028"&gt;compiling the code argument&lt;/a&gt; to a Python function, registering it with the connection and executing the above SQL query.&lt;/p&gt;
&lt;h4&gt;Splitting columns into multiple other columns&lt;/h4&gt;
&lt;p&gt;Sometimes when I'm working with a table I find myself wanting to split a column into multiple other columns.&lt;/p&gt;
&lt;p&gt;A classic example is locations - if a &lt;code&gt;location&lt;/code&gt; column contains &lt;code&gt;latitude,longitude&lt;/code&gt; values I'll often want to split that into separate &lt;code&gt;latitude&lt;/code&gt; and &lt;code&gt;longitude&lt;/code&gt; columns, so I can visualize the data with &lt;a href="https://datasette.io/plugins/datasette-cluster-map"&gt;datasette-cluster-map&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;--multi&lt;/code&gt; option lets you do that using &lt;code&gt;sqlite-utils convert&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;sqlite-utils convert data.db places location '
latitude, longitude = value.split(",")
return {
    "latitude": float(latitude),
    "longitude": float(longitude),
}' --multi
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;--multi&lt;/code&gt; tells the command to expect the Python code to return dictionaries. It will then create new columns in the database corresponding to the keys in those dictionaries and populate them using the results of the transformation.&lt;/p&gt;
&lt;p&gt;If the &lt;code&gt;places&lt;/code&gt; table started with just a &lt;code&gt;location&lt;/code&gt; column, after running the above command the new table schema will look like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;CREATE TABLE [places] (
    [location] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
    [latitude] FLOAT,
    [longitude] FLOAT
);&lt;/pre&gt;&lt;/div&gt;
&lt;h4&gt;Common recipes&lt;/h4&gt;
&lt;p&gt;This new feature in &lt;code&gt;sqlite-utils&lt;/code&gt; actually started life as a separate tool entirely, called &lt;a href="https://github.com/simonw/sqlite-transform"&gt;sqlite-transform&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Part of the rationale for adding it to &lt;code&gt;sqlite-utils&lt;/code&gt; was to avoid confusion between what that tool did and the &lt;a href="https://simonwillison.net/2020/Sep/23/sqlite-advanced-alter-table/"&gt;sqlite-utils transform&lt;/a&gt; tool, which does something completely different (applies table transformations that aren't possible using SQLite's default &lt;code&gt;ALTER TABLE&lt;/code&gt; statement). Somewhere along the line I messed up with the naming of the two tools!&lt;/p&gt;
&lt;p&gt;&lt;code&gt;sqlite-transform&lt;/code&gt; bundles a number of useful &lt;a href="https://github.com/simonw/sqlite-transform/blob/main/README.md#parsedate-and-parsedatetime"&gt;default transformation recipes&lt;/a&gt;, in addition to allowing arbitrary Python code. I ended up making these available in &lt;code&gt;sqlite-utils convert&lt;/code&gt; by exposing them as functions that can be called from the command-line code argument like so:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;sqlite-utils convert my.db articles created_at \
    'r.parsedate(value)'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Implementing them as Python functions in this way meant I didn't need to invent a new command-line mechanism for passing in additional options to the individual recipes - instead, parameters are passed like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;sqlite-utils convert my.db articles created_at \
    'r.parsedate(value, dayfirst=True)'
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Also available in the sqlite_utils Python library&lt;/h4&gt;
&lt;p&gt;Almost every feature that is exposed by the &lt;a href="https://sqlite-utils.datasette.io/en/stable/cli.html"&gt;sqlite-utils command-line tool&lt;/a&gt; has a matching API in the &lt;a href="https://sqlite-utils.datasette.io/en/stable/python-api.html"&gt;sqlite_utils Python library&lt;/a&gt;. &lt;code&gt;convert&lt;/code&gt; is no exception.&lt;/p&gt;
&lt;p&gt;The Python API lets you perform operations like the following:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-s1"&gt;db&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;sqlite_utils&lt;/span&gt;.&lt;span class="pl-v"&gt;Database&lt;/span&gt;(&lt;span class="pl-s"&gt;"dogs.db"&lt;/span&gt;)

&lt;span class="pl-s1"&gt;db&lt;/span&gt;[&lt;span class="pl-s"&gt;"dogs"&lt;/span&gt;].&lt;span class="pl-en"&gt;convert&lt;/span&gt;(&lt;span class="pl-s"&gt;"name"&lt;/span&gt;, &lt;span class="pl-k"&gt;lambda&lt;/span&gt; &lt;span class="pl-s1"&gt;value&lt;/span&gt;: &lt;span class="pl-s1"&gt;value&lt;/span&gt;.&lt;span class="pl-en"&gt;upper&lt;/span&gt;())&lt;/pre&gt;
&lt;p&gt;Any Python callable can be passed to &lt;code&gt;convert&lt;/code&gt;, and it will be applied to every value in the specified column - again, like using &lt;code&gt;map()&lt;/code&gt; to apply a transformation to every item in an array.&lt;/p&gt;
&lt;p&gt;You can also use the Python API to perform more complex operations like the following two examples:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-c"&gt;# Convert title to upper case only for rows with id &amp;gt; 20&lt;/span&gt;
&lt;span class="pl-s1"&gt;table&lt;/span&gt;.&lt;span class="pl-en"&gt;convert&lt;/span&gt;(
    &lt;span class="pl-s"&gt;"title"&lt;/span&gt;,
    &lt;span class="pl-k"&gt;lambda&lt;/span&gt; &lt;span class="pl-s1"&gt;v&lt;/span&gt;: &lt;span class="pl-s1"&gt;v&lt;/span&gt;.&lt;span class="pl-en"&gt;upper&lt;/span&gt;(),
    &lt;span class="pl-s1"&gt;where&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s"&gt;"id &amp;gt; :id"&lt;/span&gt;,
    &lt;span class="pl-s1"&gt;where_args&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;{&lt;span class="pl-s"&gt;"id"&lt;/span&gt;: &lt;span class="pl-c1"&gt;20&lt;/span&gt;}
)

&lt;span class="pl-c"&gt;# Create two new columns, "upper" and "lower",&lt;/span&gt;
&lt;span class="pl-c"&gt;# and populate them from the converted title&lt;/span&gt;
&lt;span class="pl-s1"&gt;table&lt;/span&gt;.&lt;span class="pl-en"&gt;convert&lt;/span&gt;(
    &lt;span class="pl-s"&gt;"title"&lt;/span&gt;,
    &lt;span class="pl-k"&gt;lambda&lt;/span&gt; &lt;span class="pl-s1"&gt;v&lt;/span&gt;: {
        &lt;span class="pl-s"&gt;"upper"&lt;/span&gt;: &lt;span class="pl-s1"&gt;v&lt;/span&gt;.&lt;span class="pl-en"&gt;upper&lt;/span&gt;(),
        &lt;span class="pl-s"&gt;"lower"&lt;/span&gt;: &lt;span class="pl-s1"&gt;v&lt;/span&gt;.&lt;span class="pl-en"&gt;lower&lt;/span&gt;()
    }, &lt;span class="pl-s1"&gt;multi&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;True&lt;/span&gt;
)&lt;/pre&gt;
&lt;p&gt;See the &lt;a href="https://sqlite-utils.datasette.io/en/stable/python-api.html#converting-data-in-columns"&gt;full documentation for table.convert()&lt;/a&gt; for more options.&lt;/p&gt;
&lt;h4 id="blog-performance"&gt;A more sophisticated example: analyzing log files&lt;/h4&gt;
&lt;p&gt;I used the new &lt;code&gt;sqlite-utils convert&lt;/code&gt; command earlier today, to debug a performance issue with my blog.&lt;/p&gt;
&lt;p&gt;Most of my blog traffic is served via Cloudflare with a 15 minute cache timeout - but occasionally I'll hit an uncached page, and they had started to feel not quite as snappy as I would expect.&lt;/p&gt;
&lt;p&gt;So I dipped into the Heroku dashboard, and saw this pretty sad looking graph:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Performance graph showing 95th percentile of 17s and max of 23s" src="https://static.simonwillison.net/static/2021/sad-performance.png" style="max-width:100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Somehow my 50th percentile was nearly 10 seconds, and my maximum page response time was 23 seconds! Something was clearly very wrong.&lt;/p&gt;
&lt;p&gt;I use NGINX as part of my Heroku setup to buffer responses (see &lt;a href="https://simonwillison.net/2017/Oct/2/nginx-heroku/"&gt;Running gunicorn behind nginx on Heroku for buffering and logging&lt;/a&gt;), and I have custom NGINX configuration to write to the Heroku logs - mainly to work around a limitation in Heroku's default logging where it fails to record full user-agents or referrer headers.&lt;/p&gt;
&lt;p&gt;I extended that configuration to record the NGINX &lt;code&gt;request_time&lt;/code&gt;, &lt;code&gt;upstream_response_time&lt;/code&gt;, &lt;code&gt;upstream_connect_time&lt;/code&gt; and &lt;code&gt;upstream_header_time&lt;/code&gt; variables, which I hoped would help me figure out what was going on.&lt;/p&gt;
&lt;p&gt;After &lt;a href="https://github.com/simonw/simonwillisonblog/commit/dd0faaa64c0e361ae1d760894e201cac7b0224a4"&gt;applying that change&lt;/a&gt; I started seeing Heroku log lines that looked like this:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;2021-08-05T17:58:28.880469+00:00 app[web.1]: measure#nginx.service=4.212 request="GET /search/?type=blogmark&amp;amp;page=2&amp;amp;tag=highavailability HTTP/1.1" status_code=404 request_id=25eb296e-e970-4072-b75a-606e11e1db5b remote_addr="10.1.92.174" forwarded_for="114.119.136.88, 172.70.142.28" forwarded_proto="http" via="1.1 vegur" body_bytes_sent=179 referer="-" user_agent="Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://webmaster.petalsearch.com/site/petalbot)" request_time="4.212" upstream_response_time="4.212" upstream_connect_time="0.000" upstream_header_time="4.212";&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;Next step: analyze those log lines.&lt;/p&gt;
&lt;p&gt;I ran this command for a few minutes to gather some logs:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;heroku logs -a simonwillisonblog --tail | grep 'measure#nginx.service' &amp;gt; /tmp/log.txt&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;Having collected 488 log lines, the next step was to load them into SQLite.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;sqlite-utils insert&lt;/code&gt; command likes to work with JSON, but I just had raw log lines. I used &lt;code&gt;jq&lt;/code&gt; to convert each line into a &lt;code&gt;{"line": "raw log line"}&lt;/code&gt; JSON object, then piped that as newline-delimited JSON into &lt;code&gt;sqlite-utils insert&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;cat /tmp/log.txt | \
    jq --raw-input '{line: .}' --compact-output | \
    sqlite-utils insert /tmp/logs.db log - --nl
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;jq --raw-input&lt;/code&gt; accepts input that is just raw lines of text, not yet valid JSON. &lt;code&gt;'{line: .}'&lt;/code&gt; is a tiny &lt;code&gt;jq&lt;/code&gt; program that builds &lt;code&gt;{"line": "raw input"}&lt;/code&gt; objects. &lt;code&gt;--compact-output&lt;/code&gt; causes &lt;code&gt;jq&lt;/code&gt; to output newline-delimited JSON.&lt;/p&gt;
&lt;p&gt;Then &lt;code&gt;sqlite-utils insert /tmp/logs.db log - --nl&lt;/code&gt; reads that newline-delimited JSON into a new SQLite &lt;code&gt;log&lt;/code&gt; table in a &lt;code&gt;logs.db&lt;/code&gt; database file (&lt;a href="https://sqlite-utils.datasette.io/en/stable/cli.html#inserting-newline-delimited-json"&gt;full documentation here&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;Update 6th January 2022:&lt;/strong&gt; &lt;a href="https://sqlite-utils.datasette.io/en/stable/changelog.html#v3-20"&gt;sqlite-utils 3.20&lt;/a&gt; introduced a new &lt;code&gt;sqlite-utils insert ... --lines&lt;/code&gt; option for importing raw lines, so you can now achieve this without using &lt;code&gt;jq&lt;/code&gt; at all. See 
&lt;a href="https://sqlite-utils.datasette.io/en/stable/cli.html#inserting-unstructured-data-with-lines-and-text"&gt;Inserting unstructured data with --lines and --text&lt;/a&gt; for details.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Now I had a SQLite table with a single column, &lt;code&gt;line&lt;/code&gt;. Next step: parse that nasty log format.&lt;/p&gt;
&lt;p&gt;To my surprise I couldn't find an existing Python library for parsing &lt;code&gt;key=value key2="quoted value"&lt;/code&gt; log lines. Instead I had to figure out a regular expression:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;([^\s=]+)=(?:"(.*?)"|(\S+))
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here's that expression visualized using &lt;a href="https://www.debuggex.com/"&gt;Debuggex&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Screenshot of the regex visualized with debuggex" src="https://static.simonwillison.net/static/2021/debuggex-log-parser-regex.png" style="max-width:100%;" /&gt;&lt;/p&gt;
&lt;p&gt;I used that regular expression as part of a custom function passed in to the &lt;code&gt;sqlite-utils convert&lt;/code&gt; tool:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;sqlite-utils convert /tmp/logs.db log line --import re --multi "$(cat &amp;lt;&amp;lt;EOD
    r = re.compile(r'([^\s=]+)=(?:"(.*?)"|(\S+))')
    pairs = {}
    for key, value1, value2 in r.findall(value):
        pairs[key] = value1 or value2
    return pairs
EOD
)"
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;(This uses a &lt;code&gt;cat &amp;lt;&amp;lt;EOD&lt;/code&gt; trick to avoid having to figure out how to escape the single and double quotes in the Python code for usage in a zsh shell command.)&lt;/p&gt;
&lt;p&gt;Using &lt;code&gt;--multi&lt;/code&gt; here created new columns for each of the key/value pairs seen in that log file.&lt;/p&gt;
&lt;p&gt;One last step: convert the types. The new columns are all of type &lt;code&gt;text&lt;/code&gt; but I want to do sorting and arithmetic on them so I need to convert them to integers and floats. I used &lt;a href="https://sqlite-utils.datasette.io/en/stable/cli.html#transforming-tables"&gt;sqlite-utils transform&lt;/a&gt; for that:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;sqlite-utils transform /tmp/logs.db log \
    --type 'measure#nginx.service' float \
    --type 'status_code' integer \
    --type 'body_bytes_sent' integer \
    --type 'request_time' float \
    --type 'upstream_response_time' float \
    --type 'upstream_connect_time' float \
    --type 'upstream_header_time' float
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here's the &lt;a href="https://lite.datasette.io/?url=https://gist.githubusercontent.com/simonw/3454951e23cab709da42d25520dd78cf/raw/3383a16cd1f423d39c9c6923b6b37a3e74c4f148/logs.db#/logs/log"&gt;resulting log table&lt;/a&gt; (in Datasette Lite).&lt;/p&gt;
&lt;p&gt;&lt;img alt="Datasette showing the log table" src="https://static.simonwillison.net/static/2021/performance-logs.png" style="max-width:100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Once the logs were in Datasette, the problem quickly became apparent when I &lt;a href="https://lite.datasette.io/?url=https://gist.githubusercontent.com/simonw/3454951e23cab709da42d25520dd78cf/raw/3383a16cd1f423d39c9c6923b6b37a3e74c4f148/logs.db#/logs/log?_sort_desc=request_time"&gt;sorted by request_time&lt;/a&gt;: an army of search engine crawlers were hitting deep linked filters in &lt;a href="https://simonwillison.net/2017/Oct/5/django-postgresql-faceted-search/"&gt;my faceted search engine&lt;/a&gt;, like &lt;code&gt;/search/?tag=geolocation&amp;amp;tag=offlineresources&amp;amp;tag=canvas&amp;amp;tag=javascript&amp;amp;tag=performance&amp;amp;tag=dragndrop&amp;amp;tag=crossdomain&amp;amp;tag=mozilla&amp;amp;tag=video&amp;amp;tag=tracemonkey&amp;amp;year=2009&amp;amp;type=blogmark&lt;/code&gt;. These are expensive pages to generate! They're also very unlikely to be in my Cloudflare cache.&lt;/p&gt;
&lt;p&gt;Could the answer be as simple as a &lt;code&gt;robots.txt&lt;/code&gt; rule blocking access to &lt;code&gt;/search/&lt;/code&gt;?&lt;/p&gt;
&lt;p&gt;I &lt;a href="https://github.com/simonw/simonwillisonblog/commit/4c0de5b9f01bb16fc89c587128a276055b0033bb"&gt;shipped that change&lt;/a&gt; and waited a few hours to see what the impact would be:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Heroku metrics showing a dramatic improvement after the deploy, and especially about 8 hours later" src="https://static.simonwillison.net/static/2021/robots-txt-effect.png" style="max-width:100%;" /&gt;&lt;/p&gt;
&lt;p&gt;It took a while for the crawlers to notice that my &lt;code&gt;robots.txt&lt;/code&gt; had changed, but by 8 hours later my site performance was dramatically improved - I'm now seeing 99th percentile of around 450ms, compared to 25 seconds before I shipped the &lt;code&gt;robots.txt&lt;/code&gt; change!&lt;/p&gt;
&lt;p&gt;With this latest addition, &lt;a href="https://sqlite-utils.datasette.io/"&gt;sqlite-utils&lt;/a&gt; has evolved into a powerful tool for importing, cleaning and re-shaping data - especially when coupled with Datasette in order to explore, analyze and publish the results.&lt;/p&gt;
&lt;h4&gt;TIL this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/vscode/vs-code-regular-expressions"&gt;Search and replace with regular expressions in VS Code&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/python/codespell"&gt;Check spelling using codespell&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/imagemagick/set-a-gif-to-loop"&gt;Set a GIF to loop using ImageMagick&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/sqlite/sqlite-aggregate-filter-clauses"&gt;SQLite aggregate filter clauses&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/imagemagick/compress-animated-gif"&gt;Compressing an animated GIF with ImageMagick mogrify&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Releases this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/sqlite-transform"&gt;sqlite-transform&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/sqlite-transform/releases/tag/1.2.1"&gt;1.2.1&lt;/a&gt; - (&lt;a href="https://github.com/simonw/sqlite-transform/releases"&gt;10 releases total&lt;/a&gt;) - 2021-08-02
&lt;br /&gt;Tool for running transformations on columns in a SQLite database&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/sqlite-utils"&gt;sqlite-utils&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/sqlite-utils/releases/tag/3.14"&gt;3.14&lt;/a&gt; - (&lt;a href="https://github.com/simonw/sqlite-utils/releases"&gt;82 releases total&lt;/a&gt;) - 2021-08-02
&lt;br /&gt;Python CLI utility and library for manipulating SQLite databases&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-json-html"&gt;datasette-json-html&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-json-html/releases/tag/1.0.1"&gt;1.0.1&lt;/a&gt; - (&lt;a href="https://github.com/simonw/datasette-json-html/releases"&gt;6 releases total&lt;/a&gt;) - 2021-07-31
&lt;br /&gt;Datasette plugin for rendering HTML based on JSON values&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-publish-fly"&gt;datasette-publish-fly&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-publish-fly/releases/tag/1.0.2"&gt;1.0.2&lt;/a&gt; - (&lt;a href="https://github.com/simonw/datasette-publish-fly/releases"&gt;5 releases total&lt;/a&gt;) - 2021-07-30
&lt;br /&gt;Datasette plugin for publishing data using Fly&lt;/li&gt;
&lt;/ul&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/cli"&gt;cli&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/performance"&gt;performance&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/data-science"&gt;data-science&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/weeknotes"&gt;weeknotes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite-utils"&gt;sqlite-utils&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="cli"/><category term="performance"/><category term="projects"/><category term="sqlite"/><category term="datasette"/><category term="data-science"/><category term="weeknotes"/><category term="sqlite-utils"/></entry><entry><title>The data team: a short story</title><link href="https://simonwillison.net/2021/Jul/8/the-data-team-a-short-story/#atom-tag" rel="alternate"/><published>2021-07-08T23:12:59+00:00</published><updated>2021-07-08T23:12:59+00:00</updated><id>https://simonwillison.net/2021/Jul/8/the-data-team-a-short-story/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://erikbern.com/2021/07/07/the-data-team-a-short-story.html"&gt;The data team: a short story&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Erik Bernhardsson’s fictional account (“I guess I should really call this a parable”) of a new data team leader successfully growing their team and building a data-first culture in a medium-sized technology company. His depiction of the initial state of the company (data in many different places, frustrated ML researchers who can’t get their research into production, confusion over what the data team is actually for) definitely rings true to me.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=27777594"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/data"&gt;data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/data-science"&gt;data-science&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/leadership"&gt;leadership&lt;/a&gt;&lt;/p&gt;



</summary><category term="data"/><category term="data-science"/><category term="leadership"/></entry><entry><title>Group thousands of similar spreadsheet text cells in seconds</title><link href="https://simonwillison.net/2021/Jun/27/similar-text-cells/#atom-tag" rel="alternate"/><published>2021-06-27T16:24:38+00:00</published><updated>2021-06-27T16:24:38+00:00</updated><id>https://simonwillison.net/2021/Jun/27/similar-text-cells/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://towardsdatascience.com/group-thousands-of-similar-spreadsheet-text-cells-in-seconds-2493b3ce6d8d"&gt;Group thousands of similar spreadsheet text cells in seconds&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Luke Whyte explains how to efficiently group similar text columns in a table (Walmart and Wal-mart for example) using a clever combination of TF/IDF, sparse matrices and cosine similarity. Includes the clearest explanation of cosine similarity for text I’ve seen—and Luke wrote a Python library, textpack, that implements the described pattern.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://github.com/lukewhyte/textpack"&gt;lukewhyte/textpack&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/data-science"&gt;data-science&lt;/a&gt;&lt;/p&gt;



</summary><category term="python"/><category term="data-science"/></entry><entry><title>What I've learned about data recently</title><link href="https://simonwillison.net/2021/Jun/22/what-ive-learned-about-data-recently/#atom-tag" rel="alternate"/><published>2021-06-22T17:09:07+00:00</published><updated>2021-06-22T17:09:07+00:00</updated><id>https://simonwillison.net/2021/Jun/22/what-ive-learned-about-data-recently/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://seldo.com/posts/what-i-ve-learned-about-data-recently"&gt;What I&amp;#x27;ve learned about data recently&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Laurie Voss talks about the structure of data teams, based on his experience at npm and more recently Netlify. He suggests that Airflow and dbt are the data world’s equivalent of frameworks like Rails: opinionated tools that solve core problems and which mean that you can now hire people who understand how your data pipelines work on their first day on the job.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/seldo/status/1407370508576780290"&gt;@seldo&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/data"&gt;data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/big-data"&gt;big-data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/data-science"&gt;data-science&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/laurie-voss"&gt;laurie-voss&lt;/a&gt;&lt;/p&gt;



</summary><category term="data"/><category term="big-data"/><category term="data-science"/><category term="laurie-voss"/></entry><entry><title>Defining Data Intuition</title><link href="https://simonwillison.net/2020/Oct/29/defining-data-intuition/#atom-tag" rel="alternate"/><published>2020-10-29T15:14:28+00:00</published><updated>2020-10-29T15:14:28+00:00</updated><id>https://simonwillison.net/2020/Oct/29/defining-data-intuition/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://blog.harterrt.com/data_intuition.html"&gt;Defining Data Intuition&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Ryan T. Harter, Principal Data Scientist at Mozilla defines data intuition as “a resilience to misleading data and analyses”. He also introduces the term “data-stink” as a similar term to “code smell”, where your intuition should lead you to distrust analysis that exhibits certain characteristics without first digging in further. I strongly believe that data reports should include a link the raw methodology and numbers to ensure they can be more easily vetted—so that data-stink can be investigated with the least amount of resistance.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/analytics"&gt;analytics&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mozilla"&gt;mozilla&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/data-science"&gt;data-science&lt;/a&gt;&lt;/p&gt;



</summary><category term="analytics"/><category term="mozilla"/><category term="data-science"/></entry><entry><title>Announcing the Consortium for Python Data API Standards</title><link href="https://simonwillison.net/2020/Aug/19/announcing-consortium-python-data-api-standards/#atom-tag" rel="alternate"/><published>2020-08-19T05:48:11+00:00</published><updated>2020-08-19T05:48:11+00:00</updated><id>https://simonwillison.net/2020/Aug/19/announcing-consortium-python-data-api-standards/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://data-apis.org/blog/announcing_the_consortium/"&gt;Announcing the Consortium for Python Data API Standards&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Interesting effort to unify the fragmented DataFrame API ecosystem, where increasing numbers of libraries offer APIs inspired by Pandas that imitate each other but aren’t 100% compatible. The announcement includes some very clever code to support the effort: custom tooling to compare the existing APIs, and an ingenious GitHub Actions setup to run traces (via sys.settrace), derive type signatures and commit those generated signatures back to a repository.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/ralfgommers/status/1295296141387599879"&gt;@ralfgommers&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/standards"&gt;standards&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/data-science"&gt;data-science&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github-actions"&gt;github-actions&lt;/a&gt;&lt;/p&gt;



</summary><category term="python"/><category term="standards"/><category term="data-science"/><category term="github-actions"/></entry><entry><title>Quoting GPT-3</title><link href="https://simonwillison.net/2020/Jun/29/gpt-3-shepherded-max-woolf/#atom-tag" rel="alternate"/><published>2020-06-29T04:45:49+00:00</published><updated>2020-06-29T04:45:49+00:00</updated><id>https://simonwillison.net/2020/Jun/29/gpt-3-shepherded-max-woolf/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://twitter.com/minimaxir/status/1277436629368668160"&gt;&lt;p&gt;Data Science is a lot like Harry Potter, except there's no magic, it's just math, and instead of a sorting hat you just sort the data with a Python script.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://twitter.com/minimaxir/status/1277436629368668160"&gt;GPT-3&lt;/a&gt;, shepherded by Max Woolf&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/machine-learning"&gt;machine-learning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/data-science"&gt;data-science&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/max-woolf"&gt;max-woolf&lt;/a&gt;&lt;/p&gt;



</summary><category term="machine-learning"/><category term="data-science"/><category term="max-woolf"/></entry><entry><title>Data science is different now</title><link href="https://simonwillison.net/2019/Feb/15/data-science-different-now/#atom-tag" rel="alternate"/><published>2019-02-15T15:36:11+00:00</published><updated>2019-02-15T15:36:11+00:00</updated><id>https://simonwillison.net/2019/Feb/15/data-science-different-now/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://veekaybee.github.io/2019/02/13/data-science-is-different/"&gt;Data science is different now&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Detailed examination of the current state of the job market for data science. Boot camps and university courses have produced a growing volume of junior data scientists seeking work, but the job market is much more competitive than many expected—especially for those without prior experience. Meanwhile the job itself is much more about data cleanup and software engineering skills: machine learning models and applied statistics end up being a small portion of the actual work.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/vboykis/status/1096032872153133056"&gt;@vboykis&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/data-science"&gt;data-science&lt;/a&gt;&lt;/p&gt;



</summary><category term="data-science"/></entry><entry><title>Things About Real-World Data Science Not Discussed In MOOCs and Thought Pieces</title><link href="https://simonwillison.net/2018/Dec/11/real-world-data-science/#atom-tag" rel="alternate"/><published>2018-12-11T20:51:19+00:00</published><updated>2018-12-11T20:51:19+00:00</updated><id>https://simonwillison.net/2018/Dec/11/real-world-data-science/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://minimaxir.com/2018/10/data-science-protips/"&gt;Things About Real-World Data Science Not Discussed In MOOCs and Thought Pieces&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Really good article, pointing out that carefully optimizing machine learning models is only a small part of the day-to-day work of a data scientist: cleaning up data, building dashboards, shipping models to production, deciding on trade-offs between performance and production and considering the product design and ethical implementations of what you are doing make up a much larger portion of the job.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=18651463"&gt;minimaxir&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ethics"&gt;ethics&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/data-science"&gt;data-science&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/max-woolf"&gt;max-woolf&lt;/a&gt;&lt;/p&gt;



</summary><category term="ethics"/><category term="data-science"/><category term="max-woolf"/></entry><entry><title>Serverless for data scientists</title><link href="https://simonwillison.net/2018/Aug/25/serverless-data-scientists/#atom-tag" rel="alternate"/><published>2018-08-25T23:01:09+00:00</published><updated>2018-08-25T23:01:09+00:00</updated><id>https://simonwillison.net/2018/Aug/25/serverless-data-scientists/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://mike.place/talks/serverless/"&gt;Serverless for data scientists&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Slides and accompanying notes from a talk by Mike Lee Williams at PyBay, providing an overview of Zappa and diving a bit more deeply into pywren, which makes it trivial to parallelize a function across a set of AWS lambda instances (serverless Python map() execution essentially). I really like this format for sharing presentations—I used something similar for my own PyBay talk.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/mikepqr/status/1031231759793319936"&gt;@mikepqr&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/amazonaws"&gt;amazonaws&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/serverless"&gt;serverless&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/data-science"&gt;data-science&lt;/a&gt;&lt;/p&gt;



</summary><category term="amazonaws"/><category term="serverless"/><category term="data-science"/></entry><entry><title>Computational and Inferential Thinking: The Foundations of Data Science</title><link href="https://simonwillison.net/2018/Aug/25/computational-and-inferential-thinking-foundations-data-science/#atom-tag" rel="alternate"/><published>2018-08-25T22:13:51+00:00</published><updated>2018-08-25T22:13:51+00:00</updated><id>https://simonwillison.net/2018/Aug/25/computational-and-inferential-thinking-foundations-data-science/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.inferentialthinking.com/"&gt;Computational and Inferential Thinking: The Foundations of Data Science&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Free online textbook written for the UC Berkeley Foundations of Data Science class. The examples are all provided as Jupyter notebooks, using the mybinder web application to allow students to launch interactive notebooks for any of the examples without having to install any software on their own machines.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/education"&gt;education&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/jupyter"&gt;jupyter&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/data-science"&gt;data-science&lt;/a&gt;&lt;/p&gt;



</summary><category term="education"/><category term="jupyter"/><category term="data-science"/></entry><entry><title>Beginner's Guide to Jupyter Notebooks for Data Science (with Tips, Tricks!)</title><link href="https://simonwillison.net/2018/May/24/jupyter-notebooks/#atom-tag" rel="alternate"/><published>2018-05-24T13:58:15+00:00</published><updated>2018-05-24T13:58:15+00:00</updated><id>https://simonwillison.net/2018/May/24/jupyter-notebooks/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.analyticsvidhya.com/blog/2018/05/starters-guide-jupyter-notebook/"&gt;Beginner&amp;#x27;s Guide to Jupyter Notebooks for Data Science (with Tips, Tricks!)&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
If you haven’t yet got on the Jupyter notebooks bandwagon this should help. It’s the single biggest productivity improvement I’ve made to my workflow in a very long time.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/mikeloukides/status/999612877681102848"&gt;Mike Loukides&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/jupyter"&gt;jupyter&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/data-science"&gt;data-science&lt;/a&gt;&lt;/p&gt;



</summary><category term="jupyter"/><category term="data-science"/></entry></feed>