Simon Willison’s Weblog


14 items tagged “big-data”


Big Data is Dead (via) Don’t be distracted by the headline, this is very worth your time. Jordan Tigani spent ten years working on Google BigQuery, during which time he was surprised to learn that the median data storage size for regular customers was much less than 100GB. In this piece he argues that genuine Big Data solutions are relevant to a tiny fraction of companies, and there’s way more value in solving problems for everyone else. I’ve been talking about Datasette as a tool for solving “small data” problems for a while, and this article has given me a whole bunch of new arguments I can use to support that concept.

# 7th February 2023, 7:25 pm / big-data, small-data


What I’ve learned about data recently (via) Laurie Voss talks about the structure of data teams, based on his experience at npm and more recently Netlify. He suggests that Airflow and dbt are the data world’s equivalent of frameworks like Rails: opinionated tools that solve core problems and which mean that you can now hire people who understand how your data pipelines work on their first day on the job.

# 22nd June 2021, 5:09 pm / data, big-data, datascience, laurievoss

Everything You Always Wanted To Know About GitHub (But Were Afraid To Ask) (via) ClickHouse by Yandex is an open source column-oriented data warehouse, designed to run analytical queries against TBs of data. They’ve loaded the full GitHub Archive of events since 2011 into a public instance, which is a great way of both exploring GitHub activity and trying out ClickHouse. Here’s a query I just ran that shows number of watch events per year, for example:

SELECT toYear(created_at) as yyyy, count() FROM github_events WHERE event_type = ’WatchEvent’ group by yyyy

# 5th January 2021, 1:02 am / analytics, github, sql, big-data, clickhouse


Every day more than 1 trillion events are written into a streaming ingestion pipeline, which is processed and written to a 100PB cloud-native data warehouse. And every day, our users run more than 150,000 jobs against this data, spanning everything from reporting and analysis to machine learning and recommendation algorithms.

Netflix Technology Blog

# 18th August 2018, 5:35 pm / big-data, jupyter

Usage of ARIA attributes via HTTP Archive. A neat example of a Google BigQuery query you can run against the HTTP Archive public dataset (a crawl of the “top” websites run periodically by the Internet Archive, which captures the full details of every resource fetched) to see which ARIA attributes are used the most often. Linking to this because I used it successfully today as the basis for my own custom query—I love that it’s possible to analyze a huge representative sample of the modern web in this way.

# 12th July 2018, 3:16 am / aria, http, internet-archive, big-data

ActorDB. Distributed SQL database written in Erlang built on top of SQLite (on top of LMDB), adding replication using the raft consensus algorithm (so sharded with no single-points of failure) and a MySQL protocol interface. Interesting combination of technologies.

# 24th June 2018, 9:48 pm / erlang, scaling, sqlite, big-data

Query Parquet files in SQLite. Colin Dellow built a SQLite virtual table extension that lets you query Parquet files directly using SQL. Parquet is interesting because it’s a columnar format that dramatically reduces the space needed to store tables with lots of duplicate column data—most CSV files, for example. Colin reports being able to shrink a 1291 MB CSV file from the Canadian census to an equivalent Parquet file weighing just 42MB (3% of the original)—then running a complex query against the data in just 60ms. I’d love to see someone get this extension working with Datasette.

# 24th June 2018, 7:44 pm / sqlite, big-data, datasette, parquet, colin-dellow

Mozilla Telemetry: In-depth Data Pipeline (via) Detailed behind-the-scenes look at an extremely sophisticated big data telemetry processing system built using open source tools. Some of this is unsurprising (S3 for storage, Spark and Kafka for streams) but the details are fascinating. They use a custom nginx module for the ingestion endpoint and have a “tee” server written in Lua and OpenResty which lets them route some traffic to alternative backend.

# 12th April 2018, 3:44 pm / analytics, lua, mozilla, nginx, big-data, kafka


What can startups do on big data day one?

Log everything, and then forget about it. That way you’ll have data you can analyse later on, but aside from setting up logging and log storage you won’t waste any time messing around with Big Data when you haven’t yet found product-market fit.

[... 58 words]

I would like to attend a Big Data conference but I am short of funds. Is there any big data conference that helps students attend those conference through scholarship?

The traditional route for students who can’t afford to attend a conference is for them to volunteer. Contact event organisers of Big Data conferences that look relevant and ask if they are looking for volunteers.

[... 70 words]

What is a good list of conferences, speaking gigs, hackathons, and other technology-centric events where one can reach software architects and developers?

We have a pretty comprehensive list of (mostly tech) conferences in the Midwest USA here:

[... 45 words]

Where can I find an updated DB of countries, states and cities?

This is a surprisingly complicated question. The first thing you might want to ask yourself is “what’s a country”—how do you deal with places on this List of states with limited recognition for example?

[... 182 words]


What are the best big data conferences?

O’Reilly’s Strata is excellent—I went to their first event in February in Santa Clara, and they’re running another one in New York on 22nd-23rd September:

[... 128 words]


The Seven Secrets of Successful Data Scientists. Some sensible advice, including pick the right sized tool, compress everything, split up your data, use open source and run the analysis where the data is.

# 3rd September 2010, 12:36 am / data, big-data, recovered