For some reason or another I keep hearing people talk about the CAP Theorem this week. So now it’s stuck in my head like a top 40 jingle. Thanks Interwebs.
Ever hear the euphemism "you can have it quick, fast or cheap: pick two?" CAP is one of those cake-and-eat-it to Theorems… the postulate being that you can pick two: Consistency, Availability or Partition Tolerance. Clustered relational databases usually are a combo of Availability and Consistency, NoSQL databases are often a combo of Availability and Partition Tolerance.
In a recent blog post, GigaSpaces CTO Nati Shalom talks about how people often over-complicate scalability decisions and jump to the concept that they must sacrifice ACID consistency to achieve scalability, often by allowing batched writes to happen asynchronously. Most of the time you can achieve the speed ya need by sharding relational databases instead of moving to incosistent NoSQL databases. By sharding data you can spread data logically across your infrastructure, farming database load out to multiple servers. If you use an ORM you can hide this complexity with something like Hibernate Shards to make partitioning even easier.
The best way to design and develop against a sharded database is to start virtualized. If you leverage a managed cloud hosting provider with appropriate I/O guarantees you can easily start by horizontally sharding your data then vertically scalign your infrastructure as you need. For example, all odd-numbered primary keys might go into one shard, even-numbered primary keys to the other. Once demand for one shard or another picks up, you can add memory for caching, CPUs for parallel execution or disks for storage. The right Infrastructure as a Service partner can make relational database scaling much easier – you don’t need to give up the farm to achieve massive scale.