Blogmarks
Filters: Sorted by date
How to rewrite your SQL queries in Pandas, and more (via) I still haven’t fully internalized the idioms needed to manipulate DataFrames in pandas. This tutorial helps a great deal—it shows the Pandas equivalents for a host of common SQL queries.
Intro to Threads and Processes in Python (via) I really like the diagrams in this article which compares the performance of Python threads and processes for different types of task via the excellent concurrent.futures library.
How to Use Static Type Checking in Python 3.6 (via) Useful introduction to optional static typing in Python 3.6, including how to use mypy, PyCharm and the Atom mypy plugin.
The best of Python: a collection of my favorite articles from 2017 and 2018 (so far). Gergely Szerovay has brought together an outstandingly interesting selection of Python articles from the last couple of years of activity of the Python community on Medium. A whole load of gems in here that I hadn’t seen before.
Creating Simple Interactive Forms Using Python + Markdown Using ScriptedForms + Jupyter (via) ScriptedForms is a fascinating Jupyter hack that lets you construct dynamic documents defined using markdown that provide form fields and evaluate Python code instantly as you interact with them.
What’s New in MySQL 8.0. MySQL 8 has lots of exciting improvements: Window functions, SRS aware spatial types for GIS, utf8mb4 by default, a ton of JSON improvements and atomic DDL. I no longer feel at a significant disadvantage when I have to use MySQL in place of PostgreSQL.
Text Embedding Models Contain Bias. Here’s Why That Matters (via) Excellent discussion from the Google AI team of the enormous challenge of building machine learning models without accidentally encoding harmful bias in a way that cannot be easily detected.
Datasette 0.19: Plugins Documentation (via) I’ve released the first preview of Datasette’s new plugin support, which uses the pluggy package originally developed for py.test. So far the only two plugin hooks are for SQLite connection creation (allowing custom SQL functions to be registered) and Jinja2 template environment initialization (for custom template tags), but this release is mainly about exercising the plugin registration mechanism and starting to gather feedback. Lots more to come.
Datasette 0.18: units (via) This release features the first Datasette feature that was entirely designed and implemented by someone else (yay open source)—Russ Garrett wanted unit support (Hz, ft etc) for his Wireless Telegraphy Register project. It’s a really neat implementation: you can tell Datasette what units are in use for a particular database column and it will display the correct SI symbols on the page. Specifying units also enables unit-aware filtering: if Datasette knows that a column is measured in meters you can now query it for all rows that are less than 50 feet for example.
What do you mean “average”? (via) Lovely example of an interactive explorable demonstrating mode/mean/median, built as an Observable notebook using D3.
Wireless Telegraphy Register (via) Russ Garrett used Datasette to build a browsable interface to the UK’s register of business radio licenses, using data from Ofcom.
Mozilla Telemetry: In-depth Data Pipeline (via) Detailed behind-the-scenes look at an extremely sophisticated big data telemetry processing system built using open source tools. Some of this is unsurprising (S3 for storage, Spark and Kafka for streams) but the details are fascinating. They use a custom nginx module for the ingestion endpoint and have a “tee” server written in Lua and OpenResty which lets them route some traffic to alternative backend.
The Academic Vanity Honeypot phishing scheme. Twitter thread describing a nasty phishing attack where an academic receives an email from a respected peer congratulating them on a recent article and suggesting further reading. The further reading link is a phishing site that emulates the victim’s institution’s login page.
Visualizing disk IO activity using log-scale banded graphs (via) This is a neat data visualization trick: to display rates of disk I/O, it splits the rate into a GB, MB and KB section on a stacked chart. This means that if you are getting jitter in the order of KBs even while running at 400+MB/second you can see the jitter in the KB section.
GitHub for Nonprofits (via) TIL GitHub provide legally recognized nonprofits with free organization accounts with unlimited users and unlimited private repos—and they’ve registered 30,000 nonprofit accounts through the program as of May 2017.
Deckset for Mac (via) $29 desktop Mac application that creates presentations using a cleverly designed markdown dialect. You edit the underlying markdown in your standard text editor and the Deskset app shows a preview of the presentation and lets you hit “play” to run it or export it as a PDF.
elasticsearch-dump. Neat open source utility by TaskRabbit for importing and exporting data in bulk from Elasticsearch. It can copy data from one Elasticsearch cluster directly to another or to an intermediary file, making it a swiss-army knife for migrating data around. I successfully used the “docker run” incantation to execute it without needing to worry about having the correct version of Node.js installed.
Datasette 0.15: sort by column (via) I’ve released the latest version of Datasette to PyPI. The key new feature is the ability to sort tables by column, using clickable column headers or directly via the new _sort= and _sort_desc= querystring parameters.
awesome-falsehood: Curated list of falsehoods programmers believe in (via) I really like the general category of “falsehoods programmers believe”, and Kevin Deldyckehas done an outstanding job curating this collection. Categories covered include date and time, email, human identity, geography, addresses, internationalization and more. This is a particularly good example of the “awesome lists” format in that each link is accompanied by a useful description.
Cookies-over-HTTP Bad (via) Mike West from the Chrome security team proposes a way for browsers to start discouraging the use of tracking cookies sent over HTTP—which represent a significant threat to user privacy from network attackers. It’s a clever piece of thinking: browsers would slowly ramp up the forced expiry deadline for non-HTTPS cookies, further encouraging sites to switch to HTTPS cookies while giving them ample time to adapt.
Typesense (via) A new (to me) open source search engine, with a focus on being “typo-tolerant” and offering great, fast autocomplete—incredibly important now that most searches take place using a mobile phone keyboard. Similar to Elasticsearch or Solr in that it runs as an HTTP server that you serve JSON via POST and GET—and it offers read-only replicas for scaling and high availability. And since it’s 2018, if you have Docker running (I use Docker for Mac) you can start up a test instance with a one-line shell command.
Parsing CSV using ANTLR and Python 3. I’ve been trying to figure out how to use ANTLR grammars from Python—this is the first example I’ve found that really clicked for me.
gron. Ingenious tool for working with JSON on the command line: run “gron URL/filepath” to transform a JSON document into a multi-line assignment structure designed to be easy to run grep against. Grep it, then pipe it back into “gron --ungron” to convert the filtered data back to JSON again. It solves a similar problem to jq—which is addressed in the README: “gron’s primary purpose is to make it easy to find the path to a value in a deeply nested JSON blob when you don’t already know the structure; much of jq’s power is unlocked only once you know that structure”.
Ask HN: What are the best MOOCs you’ve taken? Most useful Hacker News thread I’ve seen in a while: a torrent of great recommendations for online courses to learn everything from machine learning to astrophysics to songwriting.
rubber-docker/linux.c. rubber-docker is a workshop that talks through building a simply Docker clone from scratch in Python. I particularly liked this detail: linux.c is a Python extension written in C that exposes a small collection of Linux syscalls that are needed for the project—clone, mount, pivot_root, setns, umount, umount2 and unshare. Just reading through this module gives a really nice overview of how some of Docker’s underlying magic actually work.
import-pypi. A devious Python 3 hack which abuses importlib.machinery to add a hook such that any time you type “import modulename” it checks to see if the module is installed and runs “pip install modulename” first if it isn’t. Intended as a joke, but if you habitually fire up temporary virtual environments for exploratory programming like I do this could actually be a neat little time-saver.
The original Reddit source code, written in Lisp in 2005 (via) “If anyone’s interested, I found a hard drive in my garage with the original Reddit Lisp code from 2005. Been looking for it for years. Enjoy.”—spez
Use The Index, Luke! Paging Through Results (via) The best explanation of keyset pagination I’ve seen. Keyset pagination is where instead of using OFFSET/LIMIT to return the next page of results you instead track the last seen value in the column you sort by and then return the next X results that follow it. This allows you to paginate to arbitrarily deep offsets within a table, whereas OFFSET/LIMIT requires first iterating across all preceding rows and tends to stop working well after the first few thousand results.
Vega-Lite. A “high-level grammar of interactive graphics”. Part of the Vega project, which provides a mechanism for creating declarative visualizations by defining them using JSON. Vega-Lite is particularly interesting to me because it makes extremely tasteful decisions about how data should be visualized—give it some records, tell it which properties to plot on an axis and it will default to a display that makes sense for that data. The more I play with this the more impressed I am at the quality of its default settings.
Baltimore Sun Public Salary Records (via) The Baltimore Sun have published an interactive search engine for public salaries of Maryland state employees, and it’s powered by Datasette! Since data journalism is one of my key use-cases for Datasette I’m incredibly excited to see this in the wild. They’ve also published the underlying source code (see the via link) which is a really nice example of how to use Datasette’s custom templates and canned query functionality.