I've got scalability on the brain lately. Right now I've been thinking about caching strategies as a way to accelerate applications, reduce I/O and increase scalability.
A recent post on High Availability entitled "6 Ways to Kill Your Servers - Learning How to Scale the Hard Way" has been circulating the Internet's tubes lately and is an interesting read on how someone came to understand scalability for a web site. It was narrated from a timeline perspective, detailing what had to be incremenetally learned as they scaled a website to beyond one million users a month. Each iteration was a lesson on what you had to learn... or your site will die.
All the lessons had a common thread: under load, I/O will eventually kill your site. It may start with network bottlenecks, then progress to open file handles, then to filesystem I/O. Eventually reading/writing blocks to disk or the network will become the critical path for your application and make it crawl to its knees.
It may sound like a hack but the solution is always the same: cache data like mad. Put as much data in-memory as humanly possible so you don't need to read it from disk or *gasp* across the network. Cache data like there's no tomorrow.
There are tons of advanced solutions for data caching. There are centralized solutions such as memcached or distributed solutions from Terracotta, Tangosol Coherence, JBoss Cache and others, but sometimes the most simple implementations of caching are the best. Unless you actually need massive cache stores that can persist to disk you may get the best leverage from local caches that reside entirely in-memory on the same server as the process that consumes them. One example is having an individual, entirely in-memory and independent EhCache region within every running application. This implementation is very straight-foward and best of all requires no network I/O for retrieval. True, you may end up with a bunch of redundant data spread across each running application, but for me that's an acceptable trade-off for sub-millisecond access to the data I need. Even with aggressive cache invalidation the I/O savings can be huge. As Lesson #5 taught the author, caching can reduce I/O load by up to 80%. That's a pretty huge savings.
When you move into managed cloud hosting your strategies may need to change. Since you can dynamically size memory with a VMware cloud, it may make more sense to have a centralized memcached or EhCache store. Since you can shrink or expand VMs on demand you don't necessarily have to worry about a server's RAM going unused. And since a good cloud service provider (such as BlueLock) will have gigabit interconnects between VMs, network latency may be a diminishing issue. You could have twenty very lean VMs with 1 GB RAM each connecting to a central memcached server with 16 GB of RAM that has a ton of cached data. You can even pre-fill it with frequently accessed data: think calendar dates, city/state/zip combinations, customer account data, previous invoices... all the stuff that will likely not change and need to be invalidated. If a node happens to be re-deployed or upgraded you don't need to re-fetch that data either - your central cache server will still keep it faithfully in-memory.
Caching strategies in a physical datacenter world are very different than in the cloud computing world. That's a good thing - lines between servers become blurred with cloud computing infrastructure, making "cleaner" solutions like centralized caching strategies more practical. Picking the right caching strategy is a big win for everyone; you end up doing more with less, you reduce response times and make customers happier for it. Everyone wins!