Blogmarks
Filters: Sorted by date
Vitess (via) I remember looking at Vitess when it was first released by YouTube in 2012. The idea of a proven horizontally scalable sharding mechanism for MySQL was exciting, but I was put off by the need for a custom Go or Java client library. Apparently that changed with Vitess 2.1 in April 2017, the first version to introduce a MySQL protocol compatible proxy which can be connected to by existing code written in any language. Vitess 3.0 came out last December so now the MySQL proxy layer is much more stable. Vitess is used in production by a bunch of other companies now (including Slack and Square) so it’s definitely worth a closer look.
django-zombodb (via) The hardest part of working with an external search engine like Elasticsearch is always keeping that index synchronized with your relational database. ZomboDB is a PostgreSQL extension which lets you create a new type of index backed by an external Elasticsearch cluster. Updated rows will be pushed to the index automatically, and custom SQL syntax can then be used to execute searches. django-zombodb is a brand new library by Flávio Juvenal which integrates ZomboDB directly into the Django ORM, letting you add Elasticsearch-backed functionality with just a few lines of extra configuration. It even includes custom Django migrations for enabling the extension in PostgreSQL!
socrata2sql (via) Phenomenal new open source tool released by Andrew Chavez at the Dallas Morning News. Socrata is the open data portal software used by huge numbers of local governments worldwide. socrata2sql is a tool that interacts with the standard Socrata API and can use it to suck down a dataset and save it as a SQLite, PostgreSQL, MySQL or other SQLAlchemy-supported database. I just tried this and it took a single command to create a SQLite database of every police arrest in Dallas in the past five years.
db-to-sqlite (via) I just released version 0.2 of a tiny CLI utility I’ve been working on. It builds on top of SQLAlchemy and lets you connect to any SQLAlchemy-supported database and convert the data from it to a local SQLite database file. The new --all option will mirror all available tables (including foreign key relationships), or you can use --sql to save the results of custom SQL queries.
Questions for a new technology. Kellan poses 8 questions which should be asked of any technology that is being proposed for inclusion in an existing tech stack. I’m particularly fond of “Will this solution kill and eat the solution that it replaces?”. My rule of thumb these days is that new technology either needs to make something possible that isn’t possible at all with the existing stack, or it needs to represent at least a 3X productivity improvement in order to compensate for the switching and retraining costs across a large team.
The Datasette Ecosystem. I’ve written a page of documentation that introduces the wider Datasette Ecosystem: csvs-to-sqlite, sqlite-utils, db-to-sqlite, dbf-to-sqlite, markdown-to-sqlite and a full collection of Datasette plugins.
Datasette 0.27 (via) The latest release of Datasette introduces an option to output tables and SQL query results as newline-delimited JSON—plus a new “datasette plugins” command for listing available plugins.
websocketd (via) Delightfully clever piece of design: “It’s like CGI, twenty years later, for WebSockets”. Simply run “websocketd --port=8080 my-program” and it will start up a WebSocket server on port 8080 and fire up a new process running your script every time it sees a new WebSocket connection. Standard in and standard out are automatically hooked up to the socket connection. Since it spawns a new process per connection this won’t work well with thousands of connections but for smaller scale projects it’s an excellent addition to the toolbok—and since it’s written in Go there are pre-compiled binaries available for almost everything.
Practical Deep Learning for Coders 2019 (via) The deep learning evening course I took a few months ago has now been shared online in full, and it’s outstanding. “After the first lesson you’ll be able to train a state-of-the-art image classification model on your own data”—can confirm: after just the first lesson I built a bobcat v.s. cougar classifier using photos from iNaturalist.
The biggest thing I learned from the course is how powerful transfer learning is. I used to think you needed a huge amount of data to get good results from deep learning. That’s no longer true: you can take an existing model (eg ResNet for image classification) and train on top of it.
ResNet can classify images as 1,000 classes (house, cat, etc)—training an extra few hundred images of e.g. Bobcats vs Cougars only takes a couple of minutes on a GPU and can give you crazily accurate results.
It works because the pre-trained model can already pick up really subtle details—fur patterns, ear shapes etc—so you only need to train a few more layers on it for it to be able to classify against the patterns in your new set of training images.
And this doesnt just work for image classification! Natural language processing benefits from transfer learning too: take an existing model trained on the entire corpus of Wikipedia (so it knows patterns from sentence structures) and you can build IMDB sentiment analysis on top. That’s in lesson 4.
Subscribe to my blog on Telegram (via) I created a Telegram bot that’s subscribed to my Atom feed, so if you want to get notifications when I post to my blog you can do that using Telegram now.
togeojson (via) Handy JavaScript library and command-mine tool for converting KML and GPX to GeoJSON, by Tom MacWright
SQLite in 2018: A state of the art SQL dialect (via) In 2018 SQLite gained boolean literals, window functions, filter clauses, upserts and the ability to rename a column. If you want to try it out the latest official datasetteproject/datasette Docker image now bundles SQLite 3.26.
An Interactive Introduction to Fourier Transforms (via) I love interactive exploitable explanations and this is the best I’ve seen in a while: Jez Swanson breaks down exactly what a Fourier transform does, first by letting you interactively draw and deconstruct wave patterns and then by showing Epicycles andcexplsining JPEG compression. All with not a formula in sight!
Usable Data (via) A Paul Ford essay from February 2016 in which he advocates for SQLite as the ideal format for sharing interesting data. I don’t know how I missed this one—it predates Datasette, but it perfectly captures the benefits that I’m trying to expose with the project. “In my dream universe, there would be a massive searchable torrent site filled with open, explorable data sets, in SQLite format, some with full text search indexes already in place.”
Launching LiteCLI (via) Really neat alternative command-line client for SQLite, written in Python and using the same underlying framework as the similar pgcli (PostgreSQL) and mycli (MySQL) tools. Provides really intuitive autocomplete against table names, columns and other bits and pieces of SQLite syntax. Installation is as easy as “pip install litecli”.
The Friendship That Made Google Huge. The New Yorker profiles Jeff Dean and Sanjay Ghemawat, Google’s first and only level 11 Senior Fellows. This is some of the best writing on complex software engineering topics (map-reduce, Tensor Flow and the like) aimed at a general audience that I’ve ever seen. Also a very compelling case study in pair programming.
benfred/py-spy (via) A Python port of Julia Evans’ rbspy profiler, which she describes as “probably better” than the original. I just tried it out and it’s really impressive: it’s written in Rust but has precompiled binaries so you can just run “pip install py-spy” to install it. Shows live output in the terminal while your program is running and also includes the option to generate neat SVG flame graphs.
Fast Autocomplete Search for Your Website (via) I wrote a tutorial for the 24 ways advent calendar on building fast autocomplete search for a website on top of Datasette and SQLite. I built the demo against 24 ways itself—I used wget to recursively fetch all 330 articles as HTML, then wrote code in a Jupyter notebook to extract the raw data from them (with BeautifulSoup) and load them into SQLite using my sqlite-utils Python library. I deployed the resulting database using Datasette, then wrote some vanilla JavaScript to implement autocomplete using fast SQL queries against the Datasette JSON API.
Develop Your Naturalist Superpowers with Observable Notebooks and iNaturalist (via) Natalie’s article for this year’s 24 ways advent calendar shows how you can use Observable notebooks to quickly build interactive visualizations against web APIs. She uses the iNaturalist API to show species of Nudibranchs that you might see in a given month, plus a Vega-powered graph of sightings over the course of the year. This really inspired me to think harder about how I can use Observable to solve some of my API debugging needs, and I’ve already spun up a couple of private Notebooks to exercise new APIs that I’m building at work. It’s a huge productivity boost.
PEP 8016 -- The Steering Council Model (via) The votes are in and Python has a new governance model, partly inspired by the model used by the Django Software Foundation. A core elected council of five people (with a maximum of two employees from any individual company) will oversee the project.
Pampy: Pattern Matching for Python (via) Ingenious implementation of Erlang/Rust style pattern matching in just 150 lines of extremely cleanly designed and well-tested Python.
nip.io.
"NIP.IO maps <anything>.<IP Address>.nip.io to the corresponding <IP Address>, even 127.0.0.1.nip.io maps to 127.0.0.1" - looks useful. xip.io is a different service that does the same thing. Being able to put anything at the start looks handy for testing systems that handle different subdomains.
The _repr_html_ method in Jupyter notebooks (via) Today I learned that if you add a _repr_html_ method returning a string of HTML to any Python class Jupyter notebooks will render that HTML inline to represent that object.
Things About Real-World Data Science Not Discussed In MOOCs and Thought Pieces (via) Really good article, pointing out that carefully optimizing machine learning models is only a small part of the day-to-day work of a data scientist: cleaning up data, building dashboards, shipping models to production, deciding on trade-offs between performance and production and considering the product design and ethical implementations of what you are doing make up a much larger portion of the job.
How Does React Tell a Class from a Function? (via) Understanding the answer requires a deep understanding of the history of the JavaScript class system plus how Babel and other tools work with it.
Dan Abramov:
For an API to be simple to use, you often need to consider the language semantics (possibly, for several languages, including future directions), runtime performance, ergonomics with and without compile-time steps, the state of the ecosystem and packaging solutions, early warnings, and many other things. The end result might not always be the most elegant, but it must be practical.
Software Sprawl, The Golden Path, and Scaling Teams With Agency (via) This is smart: the "golden path" approach to encouraging a standard stack within a large engineering organization. If you build using the components on the golden path you get guaranteed ongoing support and as much free monitoring/tooling as can possibly be provided. I also really like the suggestion that this should be managed by a "council" of senior engineers with one member of the council rotated out every quarter to keep things from getting stale and cabal-like.
Charity Majors describes the process like this:
- Assemble a small council of trusted senior engineers.
- Task them with creating a recommended list of default components for developers to use when building out new services. This will be your Golden Path, the path of convergence (and the path of least resistance).
- Tell all your engineers that going forward, the Golden Path will be fully supported by the org. Upgrades, patches, security fixes; backups, monitoring, build pipeline; deploy tooling, artifact versioning, development environment, even tier 1 on call support. Pave the path with gold. Nobody HAS to use these components ... but if they don't, they're on their own. They will have to support it themselves. [...]
repo2docker (via) Neat tool from the Jupyter project team: run “jupyter-repo2docker https://github.com/norvig/pytudes” and it will pull a GitHub repository, create a new Docker container for it, install Jupyter and launch a Jupyter instance for you to start trying out the library. I’ve been doing this by hand using virtual environments, but using Docker for even cleaner isolation seems like a smart improvement.
BigInt: arbitrary-precision integers in JavaScript (via) The BigInt specification is now supported in Chrome—but it hasn’t yet made it to other browsers. The Chrome team have a really interesting solution: they’ve released a JSBI library which you can use to do BigInt calculations in any browser today, and an accompanying Babel plugin which can rewrite calls to that library into BigInt syntax once browser support catches up. I’ve never seen a library that includes a tool for refactoring itself into oblivion before.
AWS Ground Station – Ingest and Process Data from Orbiting Satellites. OK this is cool. “Instead of building your own ground station or entering in to a long-term contract, you can make use of AWS Ground Station on an as-needed, pay-as-you-go basis. [...] You don’t need to build or maintain antennas, and can focus on your work or research. We’re starting out with a pair of ground stations today, and will have 12 in operation by mid-2019. Each ground station is associated with a particular AWS Region; the raw analog data from the satellite is processed by our modem digitizer into a data stream (in what is formally known as VITA 49 baseband or VITA 49 RF over IP data streams) and routed to an EC2 instance that is responsible for doing the signal processing to turn it into a byte stream.”
Helicopter accident analysis notebook (via) Ben Welsh worked on an article for the LA Times about helicopter accident rates, and has published the underlying analysis as an extremely detailed Jupyter notebook. Lots of neat new (to me) notebook tricks in here as well.