Know Your Scaling Enemy

August 26, 2010 by John Ellis

I’ve got scalability on the brain lately. Right now I’ve been thinking about caching strategies as a way to accelerate applications, reduce I/O and increase scalability.

A recent post on High Availability entitled "6 Ways to Kill Your Servers – Learning How to Scale the Hard Way" has been circulating the Internet’s tubes lately and is an interesting read on how someone came to understand scalability for a web site. It was narrated from a timeline perspective, detailing what had to be incremenetally learned as they scaled a website to beyond one million users a month. Each iteration was a lesson on what you had to learn… or your site will die.

All the lessons had a common thread: under load, I/O will eventually kill your site. It may start with network bottlenecks, then progress to open file handles, then to filesystem I/O. Eventually reading/writing blocks to disk or the network will become the critical path for your application and make it crawl to its knees.

It may sound like a hack but the solution is always the same: cache data like mad. Put as much data in-memory as humanly possible so you don’t need to read it from disk or *gasp* across the network. Cache data like there’s no tomorrow.

There are tons of advanced solutions for data caching. There are centralized solutions such as memcached or distributed solutions from Terracotta, Tangosol Coherence, JBoss Cache and others, but sometimes the most simple implementations of caching are the best. Unless you actually need massive cache stores that can persist to disk you may get the best leverage from local caches that reside entirely in-memory on the same server as the process that consumes them. One example is having an individual, entirely in-memory and independent EhCache region within every running application. This implementation is very straight-foward and best of all requires no network I/O for retrieval. True, you may end up with a bunch of redundant data spread across each running application, but for me that’s an acceptable trade-off for sub-millisecond access to the data I need. Even with aggressive cache invalidation the I/O savings can be huge. As Lesson #5 taught the author, caching can reduce I/O load by up to 80%. That’s a pretty huge savings.

When you move into managed cloud hosting your strategies may need to change. Since you can dynamically size memory with a VMware cloud, it may make more sense to have a centralized memcached or EhCache store. Since you can shrink or expand VMs on demand you don’t necessarily have to worry about a server’s RAM going unused. And since a good cloud service provider (such as BlueLock) will have gigabit interconnects between VMs, network latency may be a diminishing issue. You could have twenty very lean VMs with 1 GB RAM each connecting to a central memcached server with 16 GB of RAM that has a ton of cached data. You can even pre-fill it with frequently accessed data: think calendar dates, city/state/zip combinations, customer account data, previous invoices… all the stuff that will likely not change and need to be invalidated. If a node happens to be re-deployed or upgraded you don’t need to re-fetch that data either – your central cache server will still keep it faithfully in-memory.

Caching strategies in a physical datacenter world are very different than in the cloud computing world. That’s a good thing – lines between servers become blurred with cloud computing infrastructure, making "cleaner" solutions like centralized caching strategies more practical. Picking the right caching strategy is a big win for everyone; you end up doing more with less, you reduce response times and make customers happier for it. Everyone wins!

  • Lori MacVittie

    "Since you can shrink or expand VMs on demand you don’t necessarily have to worry about a server’s RAM going unused. "

    One assumes that there is RAM available on the physical server on which the VM resides, correct?

    I assume in a shared environment that reservations for additional RAM are not the norm, and that it is potentially gambling to assume more RAM will be available.

    Is that correct?

    Great post in general, though. Caching is increasingly important and SSD might be one answer to the I/O bottleneck…

    Lori

  • John Ellis

    One of the benefits of a VMware Cloud infrastructure is that you don’t have to worry about constraints imposed by a single physical server – all the resources for an entire datacenter are pooled together into huge "buckets" of resources that you can use as you wish.

    For example, BlueLock may add 10 servers to its datacenter that have 144GB and 16 CPUs each. These servers all work together to give us an additional 1.4 TB of RAM and 160 CPUs. You may then come to us and request 1 TB of RAM and 80 CPUs for your own personal cloud, which you can then build from as you wish.

    You may take an initial guesstimate at what size of server you need and initially try a 4 GB, 4 CPU virtual machine. After monitoring it for a while you may discover it never uses more than 1 GB of RAM, so you can pare it down to 1.5 GB and place the remaining 2.5 GB into your "pool" of resources. Better yet, you may decide that your 32 GB memcached server needs twice as much RAM – so you move the memory slider a bit to the right and grant the virtual machine 64 GB of RAM instead.

    It’s liberating not to have to worry about a huge capital expense that gets under-utilized, or under-sizing a server and quickly outgrowing it. You can instead re-size the server on demand and deal with sizing as you learn about your application’s profile.

    SSDs are definitely helping with I/O speeds, especially when talking about real-time data indexing and the like. OS’ are just now catching up to speed with SSDs by supporting the trim command and hybrid drives, while filesystems themselves are very close to making the most out of an SSD purchase by catering to an SSD’s unique write and erase patterns. It wouldn’t surprise me to see SSD-based SANs becoming a mainstay within the next four years, existing alongside traditional spinning disks.