Embracing Chaos Testing Helps Create Near-Perfect Clouds
If you Google "cloud computing outage" anytime of the year, the result will almost certainly be an example of recent outage. We recently saw the case of Telstra's cloud outage, which impacted its many high-profile customers for some twenty-four hours. If we skim through the similar data from past years, it is evident that most of these outages are man-made and are a real threat to the promising business models that cloud computing offers.
In his book The Big Switch: Rewiring the World, from Edison to Google, Nicholas Carr makes an interesting case of finding parallels between the advent of electricity as a utility and cloud computing. Carr says, "Hooked up to the Internet’s global computing grid, massive information-processing plants have begun pumping data and software code into our homes and businesses. This time, it’s computing that’s turning into a utility."
However, like electricity, cloud computing shares a characteristic that is crucial for its success—high uninterrupted availability. The five 9s principle perhaps describes the notion of high availability in the best possible way. “As Evan Marcus, Principal Engineer at Veritas Software, observes, 99.999 availability works out to 5.39 minutes of total downtime—planned or unplanned—in a given year.” Most of the cloud providers strive for such reliability in their offerings.
Netflix came up with the idea of the Simian Army, which consists of services in the cloud—referred to as Monkeys—for generating various kinds of failures or abnormal conditions and testing the ability of the system to survive them. One of goals of the Simian Army is to help keep the cloud highly available, which is done with the help of a service called Chaos Monkey.
Chaos Monkey is nothing more than a software-based service that runs in the Amazon Web Services cloud and randomly kills the instances and components comprising the cloud setup. Chaos Monkey works on the simple premise that if we need to design for high availability, we should design for failure. To design for failure, there should be ways to simulate failures as they would happen in real-world situations. This is exactly what a Chaos Monkey helps achieve in a cloud setup.
Netflix recently made the source code of Chaos Monkey (and other Simian Army services) open source and announced that more such monkeys will be made available to the community.
"Netflix notes that over the last year Chaos Monkey has terminated over 65,000 instances running in the Netflix production and testing environments.” What this alternatively means is that Netflix has found 65,000 ways a cloud infrastructure can become unavailable and thereby built in ways to respond to these outages in their architecture.
Cloud architects—Do you have the courage to let Chaos Monkey test your architecture? Let us know in the comments below.