Simon Willison’s Weblog

On json, github, jupyter, javascript, opensource, ...

 

Recent entries

Notes from my appearance on the Changelog podcast 11 days ago

After I spoke at Zeit Day SF last weekend I sat down with Adam Stacoviak to record a 25 minute segment for episode 296 of the Changelog podcast, talking about Datasette. We covered a lot of ground!

A transcript should be available soon (the Changelog have excellent transcripts) but in the meantime here are my own notes on the topics that we covered.

My section starts 54:45 minutes into the episode.

From the intro

Datasette was inspired by my work at the Guardian with Simon Rogers, where we started the The Guardian Data Blog.

Here’s Datasette on GitHub, and Datasette: instantly create and publish an API for your SQLite databases where I first announced the project.

My sf-trees.com website, which is powered by this Datasette API using data from the San Francisco Department of Public Works.

The hosting platform I use for immutable deployments of Docker containers is Zeit Now.

The Datasette tool suite

Here’s the full Datasette suite of tools:

  • csvs-to-sqlite is a command-line tool for converting CSV files into a SQLite database.
  • Datasette is a web application for exploring SQLite databases and interacting with them via a read-only JSON API. It also includes the datasette publish command-line tool for publishing those databases using Zeit Now or Heroku.
  • Datasette Publish is a web app that lets you upload CSV files and deploy them to a new Datasette instance without installing any software. Here’s my announcement post.

We talked a lot about SQLite. Here’s a document describing How SQLite is tested and a bunch more about SQLite from my blog.

Datasette examples

A few more fun examples of datasettes that we discussed:

Tools for manipulating data

I’m a huge fan of the combination of Jupyter notebooks and Pandas. Here’s a collection of notes on jupyter and pandas from my blog.

Contributing to Datasette

The best thing you can do is use Datasette to publish your data and then give me feedback!

If you want to get involved in the code, take a look at the help wanted label on GitHub or have a look at the open issues for csvs-to-sqlite.

Even better: write plugins! Datasette’s plugin architecture is just getting started so the more experience and feedback we can get with it the better.

And finally… I mentioned US Navy Airship and Zeppelin photos. I also have a Pinterest Board.

Exploring the UK Register of Members Interests with SQL and Datasette 25 days ago

Ever wondered which UK Members of Parliament get gifted the most helicopter rides? How about which MPs have been given Christmas hampers by the Sultan of Brunei? (David Cameron, William Hague and Michael Howard apparently). Here’s how to dig through the Register of Members Interests using SQL and Datasette.

Gifts from the Sultan

mySociety have been building incredible civic participation applications like TheyWorkForYou and FixMyStreet for nearly 15 years now, and have accumulated all kinds of interesting data along the way.

They recently launched their own data portal at data.mysociety.org listing all of the information they have available. While exploring it I stumbled across their copy of the UK Register of Members Interests. Every UK Member of Parliament has to register their conflicts of interest and income sources, and mySociety have an ongoing project to parse that data into a more useful format.

It won’t surprise you to hear that I couldn’t resist turning their XML files into a SQLite database.

The result is register-of-members-interests.datasettes.com—a Datasette instance running against a SQLite database containing over 1.3 million line-items registered by 1,419 MPs over the course of 18 years.

Some fun queries

A few of my favourites so far:

Understanding the data model

Most of the action takes place in the items table, where each item is a line-item from an MP’s filing. You can search that table by keyword (see helicopter example above) or apply filters to it using the standard Datasette interface. You can also execute your own SQL directly against the database.

Each item is filed against a category. There appears to have been quite a bit of churn in the way that the categories are defined over the years, plus the data is pretty untidy—there are no less than 10 ways of spelling “Remunerated employment, office, profession etc.” for example!

Categories

There are also a LOT of duplicate items in the set—it appears that MPs frequently list the same item (a rental property for example) every time they fill out the register. SQL DISTINCT clauses can help filter through these, as seen in some of the above examples.

The data also has the concepts of both members and people. As far as I can tell people are distinct, but members may contain duplicates—presumably to represent MPs who have served more than one term in office. It looks like the member field stopped being populated in March 2015 so analysis is best performed against the people table.

Once concept I have introduced myself is the record_id. In the XML documents the items are often grouped together into a related collection, like this:

<regmem personid="uk.org.publicwhip/person/10001"
    memberid="uk.org.publicwhip/member/40289" membername="Diane Abbott" date="2014-07-14">
    <category type="2" name="Remunerated employment, office, profession etc">
        <item>Payments from MRL Public Sector Consultants, Pepple House, 8 Broad Street, Great Cambourne, Cambridge CB23 6HJ:</item>
        <item>26 November 2013, I received a fee of £1,000 for speaking at the 1st African Legislative Summit, National Assembly, Abuja, Nigeria.  Hours: 8 hrs. The cost of my flights, transfers and hotel accommodation in Abuja were also met; estimated value £5,000. <em>(Registered 3 December 2013)</em></item>
        <item>23 July 2013, I received a fee of £5,000 for appearing as a contestant on ITV&#8217;s &#8216;The Chase Celebrity &#8211; Series 3&#8217; television programme.  Address of payer:  ITV Studios Ltd, London Television Centre, Upper Ground, London SE1 9Lt.  Hours: 12 hrs.   <em>(Registered 23 July 2013)</em></item>
    </category>
</regmem>

While these items are presented as separate line items, their grouping carries meaning: the first line item here acts as a kind of heading to help provide context to the other items.

To model this in the simplest way possible, I’ve attempted to preserve the order of these groups using a pair of additional columns: the record_id and the sort_order. I construct the record_id using a collection of other fields—the idea is for it to be sortable, and for each line-item in the same grouping to have the same record_id:

record_id = "{date}-{category_id}-{person_id}-{record}".format(
    date=date,
    category_id=category_id,
    person_id=person_id.split("/")[
        -1
    ],
    record=record,
)

The resulting record_id might look like this: 2018-04-16-70b64e89-24878-0

To recreate that particular sequence of line-items, you can search for all items matching that record_id and then sort them by their sort_order. Here’s that record from Diane Abbott shown with its surrounding context.

A single record

How I built it

The short version: I downloaded all of the XML files and wrote a Python script which parsed them using ElementTree and inserted them into a SQLite database. I’ve put the code on GitHub.

A couple of fun tricks: firstly, I borrowed some code from csvs-to-sqlite to create the full-text search index and enable searching:

def create_and_populate_fts(conn):
    create_sql = """
        CREATE VIRTUAL TABLE "items_fts"
        USING {fts_version} (item, person_name, content="items")
    """.format(
        fts_version=best_fts_version()
    )
    conn.executescript(create_sql)
    conn.executescript(
        """
        INSERT INTO "items_fts" (rowid, item, person_name)
        SELECT items.rowid, items.item, people.name
        FROM items LEFT JOIN people ON items.person_id = people.id
    """
    )

The best_fts_version() function implements basic feature detection against SQLite by trying operations in an in-memory database.

Secondly, I ended up writing my own tiny utility function for inserting records into SQLite. SQLite has useful INSERT OR REPLACE INTO syntax which allows you to insert a record and will automatically update an existing record if there is a match on the primary key. This meant I could write this utility function and use it for all of my data inserts:

def insert_or_replace(conn, table, record):
    pairs = record.items()
    columns = [p[0] for p in pairs]
    params = [p[1] for p in pairs]
    sql = "INSERT OR REPLACE INTO {table} ({column_list}) VALUES ({value_list});".format(
        table=table,
        column_list=", ".join(columns),
        value_list=", ".join(["?" for p in params]),
    )
    conn.execute(sql, params)

# ...

insert_or_replace(
    db,
    "people",
    {
        "id": person_id,
        "name": regmem_el.attrib["membername"],
    },
)

What can you find?

I’ve really only scratched the surface of what’s in here with my initial queries. What can you find? Send me Datasette query links on Twitter with your discoveries!

Datasette plugins, and building a clustered map visualization one month ago

Datasette now supports plugins!

Last Saturday I asked Twitter for examples of Python projects with successful plugin ecosystems. pytest was the clear winner: the pytest plugin compatibility table (an ingenious innovation that I would love to eventually copy for Datasette) lists 457 plugins, and even the core pytest system itself is built as a collection of default plugins that can be replaced or over-ridden.

Best of all: pytest’s plugin mechanism is available as a separate package: pluggy. And pluggy was exactly what I needed for Datasette.

You can follow the ongoing development of the feature in issue #14. This morning I released Datasette 0.20 with support for a number of different plugin hooks: plugins can add custom template tags and SQL functions, and can also bundle their own static assets, JavaScript, CSS and templates. The hooks are described in some detail in the Datasette Plugins documentation.

datasette-cluster-map

I also released my first plugin: datasette-cluster-map. Once installed, it looks out for database tables that have a latitude and longitude column. When it finds them, it draws all of the points on an interactive map using Leaflet and Leaflet.markercluster.

Let’s try it out on some polar bears!

Polar Bears on a cluster map

The USGS Alaska Science Center have released a delightful set of data entitled Sensor and Location data from Ear Tag PTTs Deployed on Polar Bears in the Southern Beaufort Sea 2009 to 2011. It’s a collection of CSV files, which means it’s trivial to convert it to SQLite using my csvs-to-sqlite tool.

Having created the SQLite database, we can deploy it to a hosting account on Zeit Now alongside the new plugin like this:

# Make sure we have the latest datasette
pip3 install datasette --upgrade
# Deploy polar-bears.db to now with an increased default page_size
datasette publish now \
    --install=datasette-cluster-map \
    --extra-options "--page_size=500" \
    polar-bears.db

The --install option is new in Datasette 0.20 (it works for datasette publish heroku as well)—it tells the publishing provider to pip install the specified package. You can use it more than once to install multiple plugins, and it accepts a path to a zip file in addition to the name of a PyPI package.

Explore the full demo at https://datasette-cluster-map-demo.now.sh/polar-bears

Visualize any query on a map

Since the plugin inserts itself at the top of any Datasette table view with latitude and longitude columns, there are all sorts of neat tricks you can do with it.

I also loaded the San Francisco tree list (thanks, Department of Public Works) into the demo. Impressively, you can click “load all” on this page and Leaflet.markercluster will load in all 189,144 points and display them on the same map… and it works fine on my laptop and my phone. Computers in 2018 are pretty good!

But since it’s a Datasette table, we can filter it. Here’s a map of every New Zealand Xmas Tree in San Francisco (8,683 points). Here’s every tree where the Caretaker is Friends of the Urban Forest. Here’s every palm tree planted in 1990:

Palm trees planted in 1990

Update: This is an incorrect example: there are 21 matches on “palm avenue” because the FTS search index covers the address field—they’re not actually palm trees. Here’s a corrected query for palm trees planted in 1990.

The plugin currently only works against columns called latitude and longitude… but if your columns are called something else, don’t worry: you can craft a custom SQL query that aliases your columns and everything will work as intended. Here’s an example against some more polar bear data:

select *, "Capture Latitude" as latitude, "Capture Longitude" as longitude
from [USGS_WC_eartag_deployments_2009-2011]

Writing your own plugins

I’m really excited to see what people invent. If you want to have a go, your first stop should be the Plugins documentation. If you want an example of a simple plugin (including the all-important mechanism for packaging it up using setup.py) take a look at datasette-cluster-map on GitHub.

And if you have any thoughts, ideas or suggestions on how the plugin mechanism can be further employed please join the conversation on issue #14. I’ve literally just got started with Datasette’s plugin hooks, and I’m very keen to hear about things people want to build that aren’t yet supported.

Building a combined stream of recent additions using the Django ORM one month ago

I’m a big believer in the importance of a “recent additions” feed. Any time you’re building an application that involves users adding and editing records it’s useful to have a page somewhere that shows the most recent objects that have been created across multiple different types of data.

I’ve used a number of techniques to build these in the past—from an extra database table (e.g. the Django Admin’s LogEntry model) to a Solr or Elasticsearch index that exists just to serve recent additions.

For a recent small project I found myself needing a recent additions feed and realized that there’s a new, simple way to build one thanks to the QuerySet.union() method introduced in Django 1.11 back in April 2017.

Consider a number of different ORM models that can be added by users, each with a created timestamp field.

Prior to QuerySet.union(), building a combined recent additions feed across multiple models was difficult: it’s easy to show recent additions for a single model, but how can we intersperse and paginate additions made to models stored across more than one table?

Using .union() to combine records from different models

Consider the following three models:

class Project(models.Model):
    name = models.CharField(max_length=128)
    description = models.TextField()
    created = models.DateTimeField(auto_now_add=True)

class Image(models.Model):
    project = models.ForeignKey(
        Project, related_name='images', on_delete=models.CASCADE
    )
    image = models.ImageField()
    created = models.DateTimeField(auto_now_add=True)

class Comment(models.Model):
    project = models.ForeignKey(
        Project, related_name='comments', on_delete=models.CASCADE
    )
    comment = models.TextField()
    created = models.DateTimeField(auto_now_add=True)

Let’s build a single QuerySet that returns objects from all three models ordered by their created dates, most recent first.

Using .values() we can reduce these different models to a common subset of fields, which we can then .union() together like so:

recent = Project.objects.values(
    'pk', 'created'
).union(
    Image.objects.values('pk', 'created'),
    Comment.objects.values('pk', 'created'),
).order_by('-created')[:4]

Now if we print out list(recent) it will look something like this:

[{'created': datetime.datetime(2018, 3, 24, 1, 27, 23, 625195, tzinfo=<UTC>),
  'pk': 28},
 {'created': datetime.datetime(2018, 3, 24, 15, 51, 29, 116511, tzinfo=<UTC>),
  'pk': 15},
 {'created': datetime.datetime(2018, 3, 23, 20, 14, 3, 31648, tzinfo=<UTC>),
  'pk': 5},
 {'created': datetime.datetime(2018, 3, 23, 18, 57, 36, 585376, tzinfo=<UTC>),
  'pk': 11}]

We’ve successfully combined recent additions from three different tables! Here’s what the SQL for that looks like:

>>> from django.db import connection
>>> print(connection.queries[-1]['sql'])
(SELECT "myapp_project"."id", "myapp_project"."created" FROM "myapp_project")
 UNION (SELECT "myapp_image"."id", "myapp_image"."created" FROM "myapp_image")
 UNION (SELECT "myapp_comment"."id", "myapp_comment"."created" FROM "myapp_comment")

There’s just one problem: we got back a bunch of pk and created records, but we don’t know which model each of those rows represents.

Using .annotate() to add a type constant to the rows

We can fix this by using Django’s annotate() method combined with a Value() object to attach a constant string to each record specifying the type of the row it represents. Here’s how to do that for a single model:

>>> from django.db.models import Value, CharField
>>> list(Image.objects.annotate(
...     type=Value('image', output_field=CharField()
... )).values('pk','type', 'created')[:2])
[{'created': datetime.datetime(2018, 3, 22, 17, 16, 33, 964900, tzinfo=<UTC>),
  'pk': 3,
  'type': 'image'},
 {'created': datetime.datetime(2018, 3, 22, 17, 49, 47, 527907, tzinfo=<UTC>),
  'pk': 4,
  'type': 'image'}]

We’ve added the key/value pair 'type': 'image' to every record returned from the querystring. Now let’s do that to all three of our models and combine the results using .union():

recent = Project.objects.annotate(
    type=Value('project', output_field=CharField())
).values(
    'pk', 'created', 'type'
).union(
    Image.objects.annotate(
        type=Value('image', output_field=CharField())
    ).values('pk', 'created', 'type'),
    Comment.objects.annotate(
        type=Value('comment', output_field=CharField())
    ).values('pk', 'created', 'type'),
).order_by('-created')[:4]

If we run list(recent) we get this:

[{'created': datetime.datetime(2018, 3, 24, 15, 51, 29, 116511, tzinfo=<UTC>),
  'pk': 15,
  'type': 'comment'},
 {'created': datetime.datetime(2018, 3, 24, 15, 50, 3, 901320, tzinfo=<UTC>),
  'pk': 29,
  'type': 'image'},
 {'created': datetime.datetime(2018, 3, 24, 15, 46, 35, 42123, tzinfo=<UTC>),
  'pk': 15,
  'type': 'project'},
 {'created': datetime.datetime(2018, 3, 24, 7, 53, 15, 222029, tzinfo=<UTC>),
  'pk': 14,
  'type': 'comment'}]

This is looking pretty good! We’ve successfully run a single SQL UNION query across three different tables and returned the combined results in reverse chronological order. Thanks to the type column we know which model each record corresponds to.

Inflating the full referenced objects

Now we need to inflate those primary key references a full ORM object from each corresponding table.

The most efficient way to do this is to collect together the IDs for each type and then run a single SQL query per type to load the full objects.

Here’s code that does exactly that: it first collects the list of primary keys that need to be loaded for each type, then executes an efficient SQL IN query against each type to fetch the underlying objects:

records = list(recent)

type_to_queryset = {
    'image': Image.objects.all(),
    'comment': Comment.objects.all(),
    'project': Project.objects.all(),
}

# Collect the pks we need to load for each type:
to_load = {}
for record in records:
    to_load.setdefault(record['type'], []).append(record['pk'])

# Fetch them 
fetched = {}
for type, pks in to_load.items():
    for object in type_to_queryset[type].filter(pk__in=pks):
        fetched[(type, object.pk)] = object

# Annotate 'records' with loaded objects
for record in records:
    key = (record['type'], record['pk'])
    record['object'] = fetched[key]

After running the above code, records looks like this:

[{'created': datetime.datetime(2018, 3, 24, 15, 51, 29, 116511, tzinfo=<UTC>),
  'object': <Comment: a comment>,
  'pk': 15,
  'type': 'comment'},
 {'created': datetime.datetime(2018, 3, 24, 15, 50, 3, 901320, tzinfo=<UTC>),
  'object': <Image: Image object (29)>,
  'pk': 29,
  'type': 'image'},
 {'created': datetime.datetime(2018, 3, 24, 15, 46, 35, 42123, tzinfo=<UTC>),
  'object': <Project: Recent changes demo>,
  'pk': 15,
  'type': 'project'},
 {'created': datetime.datetime(2018, 3, 24, 7, 53, 15, 222029, tzinfo=<UTC>),
  'object': <Comment: Here is another comment>,
  'pk': 14,
  'type': 'comment'}]

We can now feed this to a template and use it to render our recent additions page.

Wrapping it in a re-usable function

Here’s a function that implements the above in a re-usable way:

def combined_recent(limit, **kwargs):
    datetime_field = kwargs.pop('datetime_field', 'created')
    querysets = []
    for key, queryset in kwargs.items():
        querysets.append(
            queryset.annotate(
                recent_changes_type=Value(
                    key, output_field=CharField()
                )
            ).values('pk', 'recent_changes_type', datetime_field)
        )
    union_qs = querysets[0].union(*querysets[1:])
    records = []
    for row in union_qs.order_by('-{}'.format(datetime_field))[:limit]:
        records.append({
            'type': row['recent_changes_type'],
            'when': row[datetime_field],
            'pk': row['pk']
        })
    # Now we bulk-load each object type in turn
    to_load = {}
    for record in records:
        to_load.setdefault(record['type'], []).append(record['pk'])
    fetched = {}
    for key, pks in to_load.items():
        for item in kwargs[key].filter(pk__in=pks):
            fetched[(key, item.pk)] = item
    # Annotate 'records' with loaded objects
    for record in records:
        record['object'] = fetched[(record['type'], record['pk'])]
    return records

This is also available as a gist.

I can now use that function to combine arbitrary querysets (provided they share a created datestamp field) like so:

recent = combined_recent(
    20,
    project=Project.objects.all(),
    image=Image.objects.all(),
    comment=Comment.objects.all(),
)

This will return the most recent 20 records across all three types, with the results looking like this:

[{'when': datetime.datetime(2018, 3, 24, 15, 51, 29, 116511, tzinfo=<UTC>),
  'object': <Comment: a comment>,
  'pk': 15,
  'type': 'comment'},
 {'when': datetime.datetime(2018, 3, 24, 15, 50, 3, 901320, tzinfo=<UTC>),
  'object': <Image: Image object (29)>,
  'pk': 29,
  'type': 'image'},
 {'when': datetime.datetime(2018, 3, 24, 15, 46, 35, 42123, tzinfo=<UTC>),
  'object': <Project: Recent changes demo>,
  'pk': 15,
  'type': 'project'},
 {'when': datetime.datetime(2018, 3, 24, 7, 53, 15, 222029, tzinfo=<UTC>),
  'object': <Comment: Here is another comment>,
  'pk': 14,
  'type': 'comment'}]

Efficient object loading with select/prefetch_related

If you’re going to render these objects on a page, it’s pretty likely you’ll need to load additional data about them. My example models above are deliberately simplified, but in any serious Django project it’s likely they will have additional references to other tables.

We can apply Django’s magic select_related() and prefetch_related() methods directly to the querysets we pass to the function, like so:

recent = combined_recent(
    20,
    project=Project.objects.all().prefetch_related('tags'),
    image=Image.objects.all().select_related('uploaded_by'),
    comment=Comment.objects.all().select_related('author'),
)

Django’s query optimizer is smart enough to ignore those calls entirely when building the initial union queries, so even with the above extras the initial union query will still look like this:

(SELECT "myapp_project"."id", "myapp_project"."created", 'project' AS "recent_changes_type" FROM "myapp_project")
 UNION (SELECT "myapp_image"."id", "myapp_image"."created", 'image' AS "recent_changes_type" FROM "myapp_image")
 UNION (SELECT "myapp_comment"."id", "myapp_comment"."created", 'comment' AS "recent_changes_type" FROM "myapp_comment")
ORDER BY (2) DESC LIMIT 20

The select_related() and prefetch_related() clauses will then be incorporated into the subsequent SQL queries that are used to efficiently inflate the full objects from the database.

Taking it further

There are a bunch of interesting extensions that can be made to this pattern.

Want pagination? The initial unioned queryset can be paginated using offset/limit by slicing the queryset, or using the Django Paginator class.

Want more efficient pagination (since offset/limit tends to get slow after the first few thousand rows)? We’re ordering by created already which means it’s not difficult to build efficient range-based pagination, requesting all records where the created date is less than the earliest date seen on the previous page.

Since everything is based on regular Django querysets, it’s possible to build all kinds of variants of the recent additions feed. So far we’ve just built one showing all changes across an entire application, but it’s not hard to apply additional filters to only show changes made by a specific user, or changes made relating to a specific foreign key relationship. If you can represent it as a collection of querysets that each expose a created column you can combine them into a single feed.

You don’t even need to use records that share a created column: if you have objects with columns of differing names you can use an annotation to alias those columns, like so:

recent = combined_recent(
    20,
    project=Project.objects.annotate(
        when=models.F('updated')
    ).prefetch_related('tags'),
    image=Image.objects.annotate(
        when=models.F('uploaded_at')
    ).select_related('uploaded_by'),
    comment=Comment.objects.annotate(
        when=models.F('commented_at')
    ).select_related('created_by'),
    datetime_field='when'
)

I haven’t extensively load-tested this pattern, but I expect it will work fine for databases with tens-of-thousands of records but may start running into trouble if you have millions of records (though an index on the created column should help a lot). If you need a recent additions feed on something larger scale than that you should probably look at a separate logging table or an external index in something like Elasticsearch instead.

For another interesting thing you can do with .union() check out my article on Implementing faceted search with Django and PostgreSQL.

Elsewhere

17th May 2018

  • sqlitebiter. SImilar to my csvs-to-sqlite tool, but sqlitebiter handles “CSV/Excel/HTML/JSON/LTSV/Markdown/SQLite/SSV/TSV/Google-Sheets”. Most interestingly, it works against HTML pages—run “sqlitebiter -v url ’https://en.wikipedia.org/wiki/Comparison_of_firewalls’” and it will scrape that Wikipedia page and create a SQLite table for each of the HTML tables it finds there. #
  • sql.js Online SQL interpreter (via) This is fascinating: sql.js is a project that complies the whole of SQLite to JavaScript using Emscripten. The demo is an online SQL interpreter which lets you import an existing SQLite database from your filesystem and run queries against it directly in your browser. #
  • Django #8936: Add view (read-only) permission to admin (closed). Opened 10 years ago. Closed 15 hours ago. I apparently filed this issue during the first DjangoCon back in September 2008, when Adrian and Jacob mentioned on-stage that they would like to see a read-only permission for the Django Admin. Thanks to Olivier Dalang from Fiji and Petr Dlouhý from Prague it’s going to be a feature shipping in Django 2.1. Open source is a beautiful thing. #

16th May 2018

  • How to number rows in MySQL. MySQL’s user variables can be used to add a “rank” or “row_number” column to a database query that shows the ranking of a row against a specific unique value. This means you can return the first N rows for any given column—for example, given a list of articles return just the first three tags for each article. I’ve recently found myself using this trick for a few different things—once you know it, chances to use it crop up surprisingly often. #
  • isomorphic-git (via) A pure-JavaScript implementation of the git protocol and underlying tools which works both server-side (Node.js) AND in the client, using an emulation of the fs API. Given the right CORS headers it can clone a GitHub repository over HTTPS right into your browser. Impressive. #

12th May 2018

  • Datasette: Full-text search. I wrote some documentation for Datasette’s full-text search feature, which detects tables which have been configured to use the SQLite FTS module and adds a search input box and support for a _search= querystring parameter. #

11th May 2018

  • Pyre: Fast Type Checking for Python (via) Facebook’s alternative to mypy. “Pyre is designed to be highly parallel, optimizing for near-instant responses so that you get immediate feedback, even in a large codebase”. Like their Hack type checker for PHP, Pyre is implemented in OCaml. #

10th May 2018

  • The latest SQLite 3.8.7 alpha version is 50% faster than the 3.7.17 release from 16 months ago.  That is to say, it does 50% more work using the same number of CPU cycles. [...] The 50% faster number above is not about better query plans.  This is 50% faster at the low-level grunt work of moving bits on and off disk and search b-trees.  We have achieved this by incorporating hundreds of micro-optimizations.  Each micro-optimization might improve the performance by as little as 0.05%.  If we get one that improves performance by 0.25%, that is considered a huge win.  Each of these optimizations is unmeasurable on a real-world system (we have to use cachegrind to get repeatable run-times) but if you do enough of them, they add up.

    D. Richard Hipp #

  • The synthetic voice of synthetic intelligence should sound synthetic. Successful spoofing of any kind destroys trust. When trust is gone, what remains becomes vicious fast.

    Stewart Brand #

9th May 2018

8th May 2018

  • mendoza-trees-workshop (via) Eventbrite Argentina has an academy program to train new Python/Django developers. I presented a workshop there this morning showing how Django and Jupyter can be used together to iterate on a project. Since the session was primarily about demonstrating Jupyter it was mostly live-coding, but the joy of Jupyter is that at the end of a workshop you can go back and add inline commentary to the notebooks that you used. In putting together the workshop I learned about the django_extensions “/manage.py shell_plus --notebook” command—it’s brilliant! It launches Jupyter in a way that lets youdirectly import your Django models without having to mess around with DJANGO_SETTINGS_MODULE. #

7th May 2018

  • Somebody should write up how the early-2000s push for open standards and the Web Standards Project’s advocacy are a major factor in why Apple was able to create its enormously valuable comeback. Put another way, one of the killer moments of the first iPhone demo was Jobs saying it had the “real” web, not the “baby” web, by demonstrating the NYT homepage. That would’ve been IE-only & Windows-only if not for effective advocacy from the web standards community.

    Anil Dash #

5th May 2018

  • Datasette 0.21: New _shape=, new _size=, search within columns. Nothing earth-shattering here but it’s accumulated enough small improvements that it warranted a new release. You can now send ?_shape=array to get back a plain JSON array of results, ?_size=XXX|max to get back a specific number of rows from a table view and ?_search_COLUMN=text to run full-text search against a specific column. #

3rd May 2018

  • Iodide Notebook: Project Examples (via) Iodide is a very promising looking open source JavaScript notebook project, and these examples do a great job of showing what it can do. It’s not as slick (yet) as Observable but it does run completely independently using just a browser. #