Simon Willison’s Weblog

10 items tagged “bigdata”

Usage of ARIA attributes via HTTP Archive. A neat example of a Google BigQuery query you can run against the HTTP Archive public dataset (a crawl of the “top” websites run periodically by the Internet Archive, which captures the full details of every resource fetched) to see which ARIA attributes are used the most often. Linking to this because I used it successfully today as the basis for my own custom query—I love that it’s possible to analyze a huge representative sample of the modern web in this way. # 12th July 2018, 3:16 am

ActorDB. Distributed SQL database written in Erlang built on top of SQLite (on top of LMDB), adding replication using the raft consensus algorithm (so sharded with no single-points of failure) and a MySQL protocol interface. Interesting combination of technologies. # 24th June 2018, 9:48 pm

Query Parquet files in SQLite. Colin Dellow built a SQLite virtual table extension that lets you query Parquet files directly using SQL. Parquet is interesting because it’s a columnar format that dramatically reduces the space needed to store tables with lots of duplicate column data—most CSV files, for example. Colin reports being able to shrink a 1291 MB CSV file from the Canadian census to an equivalent Parquet file weighing just 42MB (3% of the original)—then running a complex query against the data in just 60ms. I’d love to see someone get this extension working with Datasette. # 24th June 2018, 7:44 pm

Mozilla Telemetry: In-depth Data Pipeline (via) Detailed behind-the-scenes look at an extremely sophisticated big data telemetry processing system built using open source tools. Some of this is unsurprising (S3 for storage, Spark and Kafka for streams) but the details are fascinating. They use a custom nginx module for the ingestion endpoint and have a “tee” server written in Lua and OpenResty which lets them route some traffic to alternative backend. # 12th April 2018, 3:44 pm

What can startups do on big data day one?

Log everything, and then forget about it. That way you’ll have data you can analyse later on, but aside from setting up logging and log storage you won’t waste any time messing around with Big Data when you haven’t yet found product-market fit.

[... 58 words]

I would like to attend a Big Data conference but I am short of funds. Is there any big data conference that helps students attend those conference through scholarship?

The traditional route for students who can’t afford to attend a conference is for them to volunteer. Contact event organisers of Big Data conferences that look relevant and ask if they are looking for volunteers.

[... 70 words]

What is a good list of conferences, speaking gigs, hackathons, and other technology-centric events where one can reach software architects and developers?

We have a pretty comprehensive list of (mostly tech) conferences in the Midwest USA here: http://lanyrd.com/places/midwest...

[... 45 words]

Where can I find an updated DB of countries, states and cities?

This is a surprisingly complicated question. The first thing you might want to ask yourself is “what’s a country”—how do you deal with places on this List of states with limited recognition for example?

[... 182 words]

What are the best big data conferences?

O’Reilly’s Strata is excellent—I went to their first event in February in Santa Clara, and they’re running another one in New York on 22nd-23rd September: http://lanyrd.com/2011/stratany/

[... 128 words]

The Seven Secrets of Successful Data Scientists. Some sensible advice, including pick the right sized tool, compress everything, split up your data, use open source and run the analysis where the data is. # 3rd September 2010, 12:36 am