Category Archives: Application Failover and Redundancy

Why DASHbay didn’t crash when Amazon-East did

By now, we’ve all heard about the Great Cloudburst of 2011. On April 21, Amazon’s Virginia-based data center experienced a huge reduction in service, triggered by what the company called “a networking event” and subsequent “re-mirroring of EBS volumes”.

I’ll leave examinations of the cause and response to other websites, and discuss the impact to DASHbay.

DASHbay builds and supports data-centric applications, focusing on open source software solutions, and often using cloud deployments. Amazon is our most frequently-used cloud data services provider.

At the time of the crash, several of our customers had mission-critical DASHbay-deployed applications running in the cloud. How did those customers fare, and therefore, how did DASHbay fare, since our customers’ problems are our problems?

I’m pleased to report that none of our customers were severely impacted by the outage.

Why not?

Here are some case studies of apps we built for clients, and the mitigation strategies that saved our bacon during Amazon’s failure.

The first is a real-time, high-availability mobile analytics collection application we deployed for Nielsen Mobile. Because this app’s continuous availability is mission critical, it was designed to not be dependent on any one AWS region. It failed over seamlessly to Amazon-West, and data-gathering continued normally. According to Brian Edgar, Group Program Manager at The Nielsen Company’s Telecom Practice: “While the outage at Amazon East was certainly bad news for Amazon and many of its clients, it was a great example of why the technology choices DASHbay recommended for us were ideal for this application. Our application was architected for geographic redundancy and fully leveraged the cloud model with dynamic DNS routing and load balancing using servers in multiple zones and regions. Our mission-critical, highly-available application experienced no outage at all. The Amazon regional failure proves we did the right thing.”

Another is a data acquisition app built for our client Credit.com. Unstructured data is gathered and marshaled into transaction reports. Data loss can directly impact Credit.com’s ability to monitor its own financial performance. This app was deployed only in the Amazon-East region, and was not available for over 24 hours. However, we anticipated the possibility of an outage, and had offshore staff in Nagpur, India, trained to perform manual workarounds for as long as necessary. These manual processes kicked in, and kept the data flowing. According to Credit.com’s CEO Ian Cohen, “We’ve been working with Dashbay for the last year and were really pleased with the measures they put in place to provide redundancies for our data acquisition applications. They positioned an offshore failsafe that allowed us to operate without interruption.”

What’s the message here? I think it’s this: data centers can fail! Design operational processes and real-time architectures with fail-over in mind. We used a variety of approaches, from human-intensive procedures that were nevertheless ready to go, to automated failover.

Which risk-mitigation strategies are right for a particular app? That depends on factors such as the volume of data and the tolerable latency of gathering and moving that data. We’re committed to thinking those factors through with our clients, and designing applications and processes with failover in mind.

One more important thing: let’s all keep the mindset of learning from mistakes, and if necessary changing architectures and backup procedures to keep our businesses running.

Terry Joyce, DASHbay founder