Simon Willison’s Weblog

Items tagged scaling in 2018

Filters: Year: 2018 × scaling ×

October 21 post-incident analysis (via) Legitimately fascinating post-mortem by GitHub. They run database masters in multiple data centers with raft for leader election... but when they had an unexpected network split between east and west coast they ended up with several seconds of write that had not been correctly replicated. Cleaning up the resulting mess took the best part of 24 hours! Distributed systems are hard. # 31st October 2018, 8:50 pm

Migrating Messenger storage to optimize performance (via) Fascinating case-study of a truly gargantuan migration. Messenger has over a billion users, and Facebook successfully migrated its backend storage from HBase to their MyRocks database (a fork of MySQL with a storage engine built on their SSD-optimized RocksDB key/value library) without any user-visible downtime. They ended up using two migration paths: one for the 99.9% of regular accounts, and a separate path for extremely high volume accounts (businesses with very active chat bots or support systems). # 27th June 2018, 3:05 pm

ActorDB. Distributed SQL database written in Erlang built on top of SQLite (on top of LMDB), adding replication using the raft consensus algorithm (so sharded with no single-points of failure) and a MySQL protocol interface. Interesting combination of technologies. # 24th June 2018, 9:48 pm

MySQL High Availability at GitHub. Cutting edge high availability case-study: GitHub are now using Consul, raft, their own custom load balancer and their own custom orchestrator replication management toolkit to achieve cross-datacenter failover for their MySQL master/replica clusters. # 20th June 2018, 11:05 pm

github/gh-ost: Thoughts on Foreign Keys? The biggest challenge I’ve seen with foreign key constraints at scale (at least with MySQL) is how they conflict with online schema migrations using tools like pt-online-schema-change or GitHub’s gh-ost. This is a good explanation of the issue by Shlomi Noach, one of the gh-ost maintainers. # 19th June 2018, 4:12 pm

Scaling a High-traffic Rate Limiting Stack With Redis Cluster. Brandur Leach describes the simple, elegant and performant design of Redis Cluster, and talks about how Stripe used it to scaled their rate-limiting from one to ten nodes. # 26th April 2018, 6:34 pm

Why it took a long time to build that tiny link preview on Wikipedia (via) Wikipedia now shows a little preview card on internal links with an image and summary paragraph of the linked page. As a Wikpedia user I absolutely love this feature—and as an engineer and product designer, it’s fascinating to hear the challenges they overcame to ship it. Of particular interest: actually generating a useful summary of a page, while stripping out the cruft that often accumulates at the beginning of their text. It’s also an impressive scaling challenge: the API they use for this feature is now handling more than 500,000 requests per minute. # 23rd April 2018, 9:07 pm