Simon Willison on scaling

142 posts tagged “scaling”

2019

Vitess (via) I remember looking at Vitess when it was first released by YouTube in 2012. The idea of a proven horizontally scalable sharding mechanism for MySQL was exciting, but I was put off by the need for a custom Go or Java client library. Apparently that changed with Vitess 2.1 in April 2017, the first version to introduce a MySQL protocol compatible proxy which can be connected to by existing code written in any language. Vitess 3.0 came out last December so now the MySQL proxy layer is much more stable. Vitess is used in production by a bunch of other companies now (including Slack and Square) so it’s definitely worth a closer look.

# 14th February 2019, 5:35 am / mysql, scaling, sharding, youtube, slack, vitess

2018

October 21 post-incident analysis (via) Legitimately fascinating post-mortem by GitHub. They run database masters in multiple data centers with raft for leader election... but when they had an unexpected network split between east and west coast they ended up with several seconds of write that had not been correctly replicated. Cleaning up the resulting mess took the best part of 24 hours! Distributed systems are hard.

# 31st October 2018, 8:50 pm / github, scaling, postmortem

Migrating Messenger storage to optimize performance (via) Fascinating case-study of a truly gargantuan migration. Messenger has over a billion users, and Facebook successfully migrated its backend storage from HBase to their MyRocks database (a fork of MySQL with a storage engine built on their SSD-optimized RocksDB key/value library) without any user-visible downtime. They ended up using two migration paths: one for the 99.9% of regular accounts, and a separate path for extremely high volume accounts (businesses with very active chat bots or support systems).

# 27th June 2018, 3:05 pm / facebook, migration, mysql, scaling, zero-downtime

ActorDB. Distributed SQL database written in Erlang built on top of SQLite (on top of LMDB), adding replication using the raft consensus algorithm (so sharded with no single-points of failure) and a MySQL protocol interface. Interesting combination of technologies.

# 24th June 2018, 9:48 pm / erlang, scaling, sqlite, big-data

MySQL High Availability at GitHub. Cutting edge high availability case-study: GitHub are now using Consul, raft, their own custom load balancer and their own custom orchestrator replication management toolkit to achieve cross-datacenter failover for their MySQL master/replica clusters.

# 20th June 2018, 11:05 pm / github, highavailability, mysql, scaling, shlominoach

github/gh-ost: Thoughts on Foreign Keys? The biggest challenge I’ve seen with foreign key constraints at scale (at least with MySQL) is how they conflict with online schema migrations using tools like pt-online-schema-change or GitHub’s gh-ost. This is a good explanation of the issue by Shlomi Noach, one of the gh-ost maintainers.

# 19th June 2018, 4:12 pm / databases, mysql, scaling, sql, shlominoach

Scaling a High-traffic Rate Limiting Stack With Redis Cluster. Brandur Leach describes the simple, elegant and performant design of Redis Cluster, and talks about how Stripe used it to scaled their rate-limiting from one to ten nodes.

# 26th April 2018, 6:34 pm / rate-limiting, redis, scaling, brandur-leach, stripe

Why it took a long time to build that tiny link preview on Wikipedia (via) Wikipedia now shows a little preview card on internal links with an image and summary paragraph of the linked page. As a Wikpedia user I absolutely love this feature—and as an engineer and product designer, it’s fascinating to hear the challenges they overcame to ship it. Of particular interest: actually generating a useful summary of a page, while stripping out the cruft that often accumulates at the beginning of their text. It’s also an impressive scaling challenge: the API they use for this feature is now handling more than 500,000 requests per minute.

# 23rd April 2018, 9:07 pm / scaling, wikipedia

2017

Scaling Postgres with Read Replicas & Using WAL to Counter Stale Reads (via) The problem with sending writes to the primary and balancing reads across replicas is dealing with replica lag—what if you write to the primary and then read from a replica that hasn’t had the new state applied to it yet? Brandur Leach dives deep into an elegant solution using PostgreSQL’s LSN (log sequence numbers) accesesed using pg_last_wal_replay_lsn(). An observer process continuously polls the replicas for their most recently applied LSN and stores them in a table. A column in the Users table then records the min_lsn valid for that user, updating it to the pg_current_wal_lsn() of the primary whenever that user makes a write. Combining the two allows the application to randomly select a replica that is up-to-date for the purposes of a specific user any time it needs to make a read.

# 18th November 2017, 6:42 pm / postgresql, replication, scaling, brandur-leach

django-multitenant (via) Absolutely fascinating Django library for horizontally sharding a database using a multi-tenant pattern, from the team at Citus. In this pattern every relevant table includes a “tenant_id”, and all queries should specifically select against that ID. Once you have that in place, you can shard your rows across multiple different databases and route to the correct database based on the tenant ID, safe in the knowledge that joins will still work provided they are against other rows belonging to the same tenant.

# 16th November 2017, 9:12 pm / django, postgresql, scaling

How Sentry Receives 20 Billion Events Per Month While Preparing To Handle Twice That. RabbitMQ federation, nginx and HAProxy, Riak as a key/value store, data processing is still mainly Python with a little bit of Rust. As of July 2017 it’s all hosted on Google Cloud Platform.

# 8th November 2017, 11:32 pm / scaling, rust, sentry

How Balanced does Database Migrations with Zero-Downtime. I’m fascinated by the idea of “pausing” traffic during a blocking site maintenance activity (like a database migration) and then un-pausing when the operation is complete—so end clients just see some of their requests taking a few seconds longer than expected. I first saw this trick described by Braintree. Balanced wrote about a neat way of doing this just using HAproxy, which lets you live reconfigure the maxconns to your backend down to zero (causing traffic to be queued up) and then bring the setting back up again a few seconds later to un-pause those requests.

# 7th November 2017, 11:36 am / haproxy, highavailability, http, migrations, scaling, zero-downtime

Scaling the GitLab database. Lots of interesting details on how GitLab have worked to scale their PostgreSQL setup. They’ve avoided sharding so far, instead opting for database pooling with pgbouncer and read-only replicas using hot standbys. I like the way they deal with replica lag—they store the current WAL position in a redis key for the user every time there’s a write, then use pg_last_xlog_replay_location() on the various replicas to check and see if they have caught up next time the user makes a request that needs to read some data.

# 30th October 2017, 8:53 pm / postgresql, redis, replication, scaling, gitlab

2013

What are some strategies for scaling sites & infrastructure so global response times are relatively close to US response times?

You need to run your application in multiple data centers around the world, partitioned such that an incoming HTTP request can be completely serviced by a single data center. Then you use global DNS load balancing to direct users to the data center that is closest to them.

[... 185 words]

10:52 am / 18th October 2013 / scaling, web-development, quora

What’s the best material on scalability?

Cal Henderson’s book Building Scalable Websites offers a good grounding.

[... 32 words]

1:40 pm / 7th April 2013 / scaling, quora

2012

What are good ways to develop software architectures using multiple languages?

There are a bunch of options for communicating between different languages, but these days the simplest is definitely JSON—it maps directly to common data structures in PHP, Python, Ruby and so on. Treat it as your common interchange format and you can’t go far wrong. It’s very easy to build simple internal web services on top of JSON.

[... 109 words]

3:03 pm / 25th December 2012 / scaling, software-engineering, web-development, quora, software-architecture

Did Mark Zuckerberg have any knowledge on building scalable social networks prior to starting work on Facebook?

I’m going to bet he didn’t have this knowledge, simply because back when he launched Facebook in 2004 almost NO ONE had this knowledge—there simply weren’t enough “web scale” products around for the patterns needed to run them to be widely discussed.

[... 143 words]

3:27 pm / 18th September 2012 / facebook, programming, scaling, socialnetworks, software-engineering, webapps, web-development, quora

Scalability: What is the best way to store and serve hundreds of GB of images for a heavy traffic website?

If you’re not going to use a service like S3, your best bet is to run something like MogileFS (which was designed by LiveJournal for handling images) and stick Varnish (a screamingly fast HTTP caching server) in front of it.

[... 66 words]

6:10 pm / 17th September 2012 / scaling, quora

How does Twitter select trending topics?

They use stream processing algorithms—they mention trending topics calculation in their technical blog entry about Storm, their open source stream processing software: http://engineering.twitter.com/2...

[... 38 words]

2:57 pm / 6th August 2012 / scaling, twitter, quora

How can I sort a huge amount of numbers?

Sorting large amounts of data is one of the first exercises you’ll see described in any Hadoop or map/reduce tutorial—so I’d suggest taking a look at Hadoop.

[... 44 words]

2:37 pm / 25th February 2012 / google, scaling, quora

What’s the best way to learn how to scale web applications?

Read “Building Scalable Websites” by Cal Henderson. It’s a few years old now but still very relevant—it basically covers everything he learnt the hard way scaling Flickr. It’s a really fun read, too.

[... 98 words]

2:42 pm / 23rd February 2012 / programming, scaling, quora

Can Scala gain wider usage than Java any time soon?

No, because Scala is harder to master than Java.

[... 54 words]

5:17 pm / 11th February 2012 / concurrency, java, programming, scala, scaling, web-development, quora

Which is the best open source tool to populate my database with test data for my load test?

I’ve seen tools that do this, but to be honest it’s very simple to write your own script for this (especially if you’re using an ORM). The other benefit to writing your own script for this is that you’ll have a much better chance of accurately representing your expected data, sizes etc.

[... 221 words]

9:19 am / 11th February 2012 / open-source, scaling, quora

2011

What are you some good blogs, videos, papers, etc. on scaling Django?

We’re building up a pretty sizable collection of video (and slides) from talks about Django on http://lanyrd.com/—including plenty that talk about scaling issues. Try this: http://lanyrd.com/search/?q=djan...—we have 16 videos and 16 slide decks from talks at events all over the world.

[... 102 words]

5:48 pm / 20th December 2011 / django, scaling, quora

Why does Django still not have support for multiple joins?

I don’t fully understand the question, but if you’re talking about doing a single join across multiple tables the Django ORM handles that just fine. Let’s say you want to get every BlogEntry written by a User who belongs to the Group with the name “admins”:

[... 67 words]

4:10 pm / 9th May 2011 / databases, django, scaling, web-development, quora

2010

What is the largest production deployment of Redis?

I’d guess Twitter or Craigslist.

[... 19 words]

9:21 am / 21st December 2010 / databases, redis, scaling, quora

Using MySQL as a NoSQL—A story for exceeding 750,000 qps on a commodity server. Very interesting approach: much of the speed difference between MySQL/InnoDB and memcached is due to the overhead involved in parsing and processing SQL, so the team at DeNA wrote their own MySQL plugin, HandlerSocket, which exposes a NoSQL-style network protocol for directly calling the low level MySQL storage engine APIs—resulting in a 7.5x performance increase.

# 27th October 2010, 11:10 pm / mysql, nosql, scaling, recovered

Bees with machine guns! Low-cost, distributed load-testing using EC2. Great name for a useful project—Bees with machine guns is a Fabric script which fires up a bunch of EC2 instances, uses them to load test a website and then spins them back down again.

# 27th October 2010, 11:04 pm / ec2, fabric, performance, scaling, recovered, load-testing

When should one switch from MySQL to Oracle or PostgreSQL?

When your own benchmarks prove that your application’s particular load characteristics will perform better on another database—and the difference is large enough that it’s worth the cost involved in retargeting your code. If that cost is high (and it probably will be) it may be worth paying for some expert consultants to ensure that your implementations against the different databases are properly optimised.

[... 102 words]

4:13 pm / 12th October 2010 / mysql, oracle, postgresql, scaling, quora

Django (web framework): Why did theonion.com stop using Drupal?

They wrote about their reasons in detail in a post to the Django sub-reddit a while ago: http://www.reddit.com/r/django/c...

[... 165 words]

5:40 pm / 25th August 2010 / django, drupal, scaling, quora

«« first « previous page 2 / 5 next » last »»