Tag Archives: emr

Big Confusion about Big Data

Not only do people often confuse what exactly the term “Big Data” means, but the dizzying array of products that are out there that solve for Big Data problems add to the confusion. So what’s the difference between Hadoop, Cassandra, EMR, Big Query or Riak?

First, it’s important to define what Big Data is. First of all, I refer to Big Data to mean the data itself – although it is often used interchangeably with the solutions (such as Hadoop). I believe that data should satisfy 3 criteria before being considered “Big Data”:

  • Volume – the amount of data has to be large, in petabytes not just gigabytes
  • Velocity – the data has to be frequent, daily or even real-time
  • Structure – the data is typically but not always unstructured (like videos, tweets, chats)

elephant_rgb_sq

Hadoop

To deal with this type of data, Big Data solutions have been developed. One of the most well known is Hadoop. It was first developed by Doug Cutting after reading how Google implemented their distributed file system and Map Reduce functionality. As such, it is not a database but a framework that implements the Hadoop Distributed File System (HDFS) and Map Reduce. There are several distributions of this open-source product – Cloudera, Hortonworks and MapR being the most well-known. Amazon Elastic Map Reduce (EMR) is a cloud-based platform-as-a-service implementation of Hadoop. Instead of installing the distribution in your data center, you configure the jobs using Amazon’s platform.

Why would people use Hadoop? Typically they are pulling large amounts of unstructured data and need to be able to run sorting and simple calculations on the data. For example:

  • Counting words – this is the standard Map Reduce example
  • High-volume analysis – gathering and analyzing large scale ad network data
  • Recommendation engines – analyzing browsing and purchasing patterns to recommend a product
  • Social graphs – Determining relationships between individuals

One of the weaknesses of Hadoop is its job oriented nature. Map Reduce is designed to be a batch process so there is a significant penalty is waiting for the job to start up and complete. It is typically not a strong candidate for real-time analysis. It also has the challenge of having a master node that is a single point of failure. Recently, Cloudera has released their latest distribution that implements a failover master node but this is unique to their distribution. Additionally, developers typically use Java (although other languages are supported) and not SQL to create Map Reduce jobs.

Riak_product_logo

Cassandra / Riak / Dynamo

Amazon wrote a paper on how to implement a key value store and Cassandra and Riak are implementations of that key value store construct. Amazon also has their own implementation called Dynamo DB.

While there are differences among the implementations, customers will want to use these products because they want fast, predictable performance and built-in fault tolerance. This is for applications that require very low latency and highly available. Typical use cases would be:

  • Session Storage
  • User Data Storage
  • Scalable, low-latency storage for mobile apps
  • Critical data storage for medical data (a great example is how the Danish government uses Riak for their medical prescription program)
  • Building block for a custom distributed system

These solutions are typically also clustered in a ring formation with no particular node marked as a master and thus so single point of failure. Customer will typically not use these solutions if they require complex ad-hoc querying or heavy analytics. As with Hadoop, building queries here require using a programming language like Java or Erlang and not SQL. There is also a project called HBase which provides similar columnar data store functionality for data on Hadoop so that is another option that is available. This project was modeled after Google Big Table and not Dynamo.

mondodb

MongoDB

This is a project that has become very popular and is managed by 10Gen and also implemented in a SaaS model by MongoLabs. MongoDB has become very popular because it is designed with developers in mind and scales the gap between a relational database (like MySQL) and key-value store (like Riak). MongoDB is excellent if you need dynamic queries and want to maintain similar functionality to a relational database that is more friendly to object-oriented programming languages. 10Gen emphasizes that MongoDB is designed to easily scale out and also accelerate development (for example, if conducting agile development). Typical use cases would be:

  • Companies storing data in flat files
  • Using an XML store for data too complex to model in a relational database
  • Deployments to public or private clouds
  • Electronic health records

As with it’s other NoSQL brethren, MongoDB does not use SQL. Instead you will use a programming language (like Javascript) to interact with the database. It is also not good with applications that require multi-object transactions which MongoDB currently does not support.

dremel

Dremel

No, this is not a power tool but a technology implemented by Google in a recent research paper. In order to understand this product, you need to be familiar with Hive and Pig since they are often compared. As noted above, Hadoop has two main challenges – that the jobs have a start-up lag and that Java (or some other programming language) is required to write and implement MapReduce. Hive and Pig are open-source projects that sit on top of Hadoop and allow for users to implement SQL (in the case of Hive) or a scripting language (in the case of Pig by writing Pig Latin) to query data. Note that since this is a layer above Hadoop, MapReduce code is still being written and executed. It just translates SQL or Pig Latin into MapReduce code. This does solve for the case of being able to write queries against Hadoop but does not solve the latency problem. This is what Dremel attempts to do.

Dremel is designed to allow users to write almost real-time, interfactive and ad-hoc queries against a large data set. Because it uses its own query execution engine and doesn’t rely on MapReduce, this attempts to solve for the Hadoop latency problem. However, in its current implementation, it is purely an analysis tool since it is only read-only and organizes data in a columnar format. While there is an open-source project for this product, it is also currently commercially available as Google Big Query.

Where to go with Big Data

As you can see, there are a lot of products out there that all solve for Big Data problems in the different ways. It’s important to understand your use case and see which product is a best fit – it’s unlikely that you can find one that solves for the universe of Big Data problems. In addition, it is likely that you will still need to use a traditional relational database if you need to do reporting or interface with an enterprise application. Out of all the projects, Hadoop has the largest ecosystem with other supporting projects that broadens the functionality of the core product. As mentioned earlier, Hive is used to write SQL queries; there is also Sqoop that integrates relational data and HBase that provides low-latency capabilities. Fortunately, the majority of these projects are open-source so you only need to learn how to use them and find a cost-effective platform to implement it – like the cloud, a natural fit for distributed databases and NoSQL projects.