Simon Willison’s Weblog

Subscribe

129 items tagged “scaling”

2013

What are some strategies for scaling sites & infrastructure so global response times are relatively close to US response times?

You need to run your application in multiple data centers around the world, partitioned such that an incoming HTTP request can be completely serviced by a single data center. Then you use global DNS load balancing to direct users to the data center that is closest to them.

[... 185 words]

What’s the best material on scalability?

Cal Henderson’s book Building Scalable Websites offers a good grounding.

[... 32 words]

2012

What are good ways to develop software architectures using multiple languages?

There are a bunch of options for communicating between different languages, but these days the simplest is definitely JSON—it maps directly to common data structures in PHP, Python, Ruby and so on. Treat it as your common interchange format and you can’t go far wrong. It’s very easy to build simple internal web services on top of JSON.

[... 109 words]

Did Mark Zuckerberg have any knowledge on building scalable social networks prior to starting work on Facebook?

I’m going to bet he didn’t have this knowledge, simply because back when he launched Facebook in 2004 almost NO ONE had this knowledge—there simply weren’t enough “web scale” products around for the patterns needed to run them to be widely discussed.

[... 143 words]

Scalability: What is the best way to store and serve hundreds of GB of images for a heavy traffic website?

If you’re not going to use a service like S3, your best bet is to run something like MogileFS (which was designed by LiveJournal for handling images) and stick Varnish (a screamingly fast HTTP caching server) in front of it.

[... 66 words]

How does Twitter select trending topics?

They use stream processing algorithms—they mention trending topics calculation in their technical blog entry about Storm, their open source stream processing software: http://engineering.twitter.com/2...

[... 38 words]

How can I sort a huge amount of numbers?

Sorting large amounts of data is one of the first exercises you’ll see described in any Hadoop or map/reduce tutorial—so I’d suggest taking a look at Hadoop.

[... 44 words]

What’s the best way to learn how to scale web applications?

Read “Building Scalable Websites” by Cal Henderson. It’s a few years old now but still very relevant—it basically covers everything he learnt the hard way scaling Flickr. It’s a really fun read, too.

[... 98 words]

Can Scala gain wider usage than Java any time soon?

No, because Scala is harder to master than Java.

[... 54 words]

Which is the best open source tool to populate my database with test data for my load test?

I’ve seen tools that do this, but to be honest it’s very simple to write your own script for this (especially if you’re using an ORM). The other benefit to writing your own script for this is that you’ll have a much better chance of accurately representing your expected data, sizes etc.

[... 221 words]

2011

What are you some good blogs, videos, papers, etc. on scaling Django?

We’re building up a pretty sizable collection of video (and slides) from talks about Django on http://lanyrd.com/—including plenty that talk about scaling issues. Try this: http://lanyrd.com/search/?q=djan...—we have 16 videos and 16 slide decks from talks at events all over the world.

[... 102 words]

Why does Django still not have support for multiple joins?

I don’t fully understand the question, but if you’re talking about doing a single join across multiple tables the Django ORM handles that just fine. Let’s say you want to get every BlogEntry written by a User who belongs to the Group with the name “admins”:

[... 67 words]

2010

What is the largest production deployment of Redis?

I’d guess Twitter or Craigslist.

[... 19 words]

Using MySQL as a NoSQL—A story for exceeding 750,000 qps on a commodity server. Very interesting approach: much of the speed difference between MySQL/InnoDB and memcached is due to the overhead involved in parsing and processing SQL, so the team at DeNA wrote their own MySQL plugin, HandlerSocket, which exposes a NoSQL-style network protocol for directly calling the low level MySQL storage engine APIs—resulting in a 7.5x performance increase. # 27th October 2010, 11:10 pm

Bees with machine guns! Low-cost, distributed load-testing using EC2. Great name for a useful project—Bees with machine guns is a Fabric script which fires up a bunch of EC2 instances, uses them to load test a website and then spins them back down again. # 27th October 2010, 11:04 pm

When should one switch from MySQL to Oracle or PostgreSQL?

When your own benchmarks prove that your application’s particular load characteristics will perform better on another database—and the difference is large enough that it’s worth the cost involved in retargeting your code. If that cost is high (and it probably will be) it may be worth paying for some expert consultants to ensure that your implementations against the different databases are properly optimised.

[... 102 words]

Django (web framework): Why did theonion.com stop using Drupal?

They wrote about their reasons in detail in a post to the Django sub-reddit a while ago: http://www.reddit.com/r/django/c...

[... 165 words]

What is the largest production deployment of CouchDB for online use?

The BBC have a pretty big CouchDB cluster, which they use mostly as a replicated key-value store. It’s used by their new identity platform which includes customisation features for iPlayer.

[... 47 words]

What is the highest traffic website built on top of Django?

My best guess would be Disqus. Instagram are pretty enormous these days as well.

[... 31 words]

Three new features for reddit gold. Reddit’s experiments with a subscriber program are interesting to watch. 9,000 people signed up as subscribers without there being any benefit at all, and they’re now being rewarded with the ability to opt out of ads and access to computationally expensive features (like different ways of sorting their own user page) that wouldn’t scale for the entire user base. # 20th July 2010, 5:54 pm

What’s powering the Content API? The new Guardian Content API runs on Solr, scaled using EC2 and Solr replication and with a Scala web service layer sitting between Solr and the API’s end users. # 24th May 2010, 2:08 pm

twitter’s gizzard (via) Intriguing new open source project from Twitter. Gizzard is a sharding framework which provides a network service for partitioning data across arbitrary backend datastores, managing its own forwarding table to map key ranges to partitions and adding support for tree-based replication. # 11th April 2010, 9:39 pm

Notes from a production MongoDB deployment. Notes from running MongoDB for 8 months in production, with 664 million documents spread across 72 GB master and slave servers located in two different data centers. # 28th February 2010, 11:05 pm

Django Advent: Scaling Django. Mike Malone’s advice on scaling Django applications, including taking advantage of new features in 1.2. # 26th February 2010, 7:22 pm

Search Engine Time Machine. Detailed explanation of how ElasticSearch provides high availability, through clever sharding and replication strategies and configurable gateways for long-term persistent storage. # 17th February 2010, 10:32 pm

Elastic Search (via) Solr has competition! Like Solr, Elastic Search provides a RESTful JSON HTTP interface to Lucene. The focus here is on distribution, auto-sharding and high availability. It’s even easier to get started with than Solr, partly due to the focus on providing a schema-less document store, but it’s currently missing out on a bunch of useful Solr features (a web interface and faceting are the two that stand out). The high availability features look particularly interesting. UPDATE: I was incorrect, basic faceted queries are already supported. # 11th February 2010, 6:33 pm

dogproxy. Another of my experiments with Node.js—this is a very simple HTTP proxy which addresses the dog pile effect (also known as the thundering herd) by watching out for multiple requests for a URL that is currently “in flight” and bundling them together. # 3rd February 2010, 1:05 pm

2009

PostgreSQL 8.5alpha3 now available. “Hot Standby, allowing read-only connections during recovery, provides a built-in master-slave replication solution.” Woohoo! # 23rd December 2009, 9:57 am

Django | Multiple Databases. Russell just checked in the final patch developed from Alex Gaynor’s Summer of Code project to add multiple database support to Django. I’d link to the 21,000 line changeset but it crashed our Trac, so here’s the documentation instead. # 22nd December 2009, 5:22 pm

PostgreSQL 8.5 alpha 2 is out. “P.S. If you’re wondering about Hot Standby and Synchronous Replication, they’re still under heavy development and still (at this point) expected to be in 8.5.”—Hot Standby is PostgreSQL-speak for MySQL-style master/slave replication for scaling your reads. # 28th October 2009, 9:02 am