Big Data and the Cloud

big-data

One of the trends hitting analytics is the adoption of Big Data. One of the theories that I raised a few years ago was that one of the side effects of web companies is that they generate a lot of data. Many prescient companies like Yahoo and Facebook understood the value of this data and started to figure out ways to analyze it. Out of these early projects came several open source products that incorporated the architectures espoused by these companies.

Most of the most notable open-source projects to gain greater adoption is Hadoop and Cassandra. The idea was that not only were companies starting to see a lot more data, but they were also seeing a lot of unstructured data. This type of data didn’t lend itself to analysis by existing relational database tools. Hadoop was inspired by both Google and Yahoo (who is a major contributor to the project) to solve this exact problem.

Since many web companies are using Hadoop and Big Data started to gain traction at the same time as cloud computing, it soon became a natural fit to combine Big Data with cloud computing. On the surface, this makes sense. Hadoop and NoSQL solutions like it, are designed for massively parallel computing. Cloud computing is designed to provide the ability to spin up servers very quickly and scale horizontally. Seems like a natural match doesn’t it?

It depends on your use case. Mission critical production systems that need access to the cluster 24/7 and have performance requirements will want a lot of compute power and also high performance I/O. Not all cloud providers are able to give that level of performance required. Notably, Amazon is currently the only cloud provider who is attempting to provide Hadoop as a service on its multi-tenant cloud environment. Map reduce jobs can be run through Amazon Map Reduce, spinning up instances as needed to run the jobs in parallel and then returning the result set into S3. If you can tune your setup correctly, this can result is significant savings for targeted batch jobs – a good example of this is the TimesMachine project run by the New York Times that converted TIFF files into smaller TNG images using Amazon and Hadoop in which they processed 150 years worth of images in 36 hours.

Other cloud providers rely on providing single tenant hardware solutions. This provides both the dedicated compute and local storage for high performance I/O. In fact, Hadoop Distributed File System (HDFS) is designed specifically to be rack-aware so that it can ensure replication of data across multiple nodes and even on different racks. Cloudera, who has their own distribution of Hadoop, has specific hardware recommendations depending on the workload required. MapR which has additional code in its distribution for performance gains requires access to physical hard disks and may not play well in all multi-tenant environments.

However, the choice is not just between virtual-only and hardware-only. There are cloud providers who provide the ability to deploy both virtual, multi-tenant machines and dedicated, single-tenant servers. While the implementations differ, the capability to deploy both virtual and dedicated machines will give you additional flexibility. Companies such as Rackspace and GoGrid provide both virtual and dedicated hardware. If you are looking for a purpose-built server, Joyent provides servers pre-configured and pre-installed with Riak – a distributed database product by Basho.

If you are looking for Hadoop-as-a-Service and have targeted jobs that need to be low cost and don’t need to run consistently, then Amazon will fit the bill. However, you will take a performance hit – “It took from 5 to 18 minutes to execute tiny jobs that would take microseconds to execute on a fully configured cluster” according to this article by Infoworld. If you are looking for a production system running large continuous data sets with high-performance requirements, then you are better off using a hybrid offering on a provider like Rackspace or GoGrid.