Tag Archives: large data sets

2012 is the year of Big Data

This is an article that I wrote recently that was published in the Cloud Computing Journal. While 2011 may have been the year of the cloud, 2012 is proving to be the year that Big Data breaks through in a big way. I discuss a brief history of Big Data and go into a little more detail as to why I think the cloud and Big Data are a natural fit. The key is ensuring that the right architecture is available to support the performance needs of Big Data applications. Click on the link below to get the full article.

The Big Data Revolution — For many years, companies collected data from various sources that often found its way into relational databases like Oracle and MySQL. However, the rise of the Internet, Web 2.0, and recently social media began an enormous increase in the amount of data created as well as in the type of data. No longer was data relegated to types that easily fit into standard data fields. Instead, it now came in the form of photos, geographic information, chats, Twitter feeds, and emails. The age of Big Data is upon us. A study by IDC titled “The Digital Universe Decade” projects a 45-fold increase in annual data by 2020. In 2010, the amount of digital information was 1.2 zettabytes (1 zettabyte equals 1 trillion gigabytes). To put that in perspective, the equivalent of 1.2 zettabytes is a full-length episode of “24” running continuously for 125 million years, according to IDC. That’s a lot of data. More important, this data has to go somewhere, and IDC’s report projects that by 2020, more than one-third of all digital information created annually will either live in or pass through the cloud. With all this data being created, the challenge will be how to collect, store, and analyze what it means.

Big Data and the Cloud


One of the trends hitting analytics is the adoption of Big Data. One of the theories that I raised a few years ago was that one of the side effects of web companies is that they generate a lot of data. Many prescient companies like Yahoo and Facebook understood the value of this data and started to figure out ways to analyze it. Out of these early projects came several open source products that incorporated the architectures espoused by these companies.

Most of the most notable open-source projects to gain greater adoption is Hadoop and Cassandra. The idea was that not only were companies starting to see a lot more data, but they were also seeing a lot of unstructured data. This type of data didn’t lend itself to analysis by existing relational database tools. Hadoop was inspired by both Google and Yahoo (who is a major contributor to the project) to solve this exact problem.

Since many web companies are using Hadoop and Big Data started to gain traction at the same time as cloud computing, it soon became a natural fit to combine Big Data with cloud computing. On the surface, this makes sense. Hadoop and NoSQL solutions like it, are designed for massively parallel computing. Cloud computing is designed to provide the ability to spin up servers very quickly and scale horizontally. Seems like a natural match doesn’t it?

It depends on your use case. Mission critical production systems that need access to the cluster 24/7 and have performance requirements will want a lot of compute power and also high performance I/O. Not all cloud providers are able to give that level of performance required. Notably, Amazon is currently the only cloud provider who is attempting to provide Hadoop as a service on its multi-tenant cloud environment. Map reduce jobs can be run through Amazon Map Reduce, spinning up instances as needed to run the jobs in parallel and then returning the result set into S3. If you can tune your setup correctly, this can result is significant savings for targeted batch jobs – a good example of this is the TimesMachine project run by the New York Times that converted TIFF files into smaller TNG images using Amazon and Hadoop in which they processed 150 years worth of images in 36 hours.

Other cloud providers rely on providing single tenant hardware solutions. This provides both the dedicated compute and local storage for high performance I/O. In fact, Hadoop Distributed File System (HDFS) is designed specifically to be rack-aware so that it can ensure replication of data across multiple nodes and even on different racks. Cloudera, who has their own distribution of Hadoop, has specific hardware recommendations depending on the workload required. MapR which has additional code in its distribution for performance gains requires access to physical hard disks and may not play well in all multi-tenant environments.

However, the choice is not just between virtual-only and hardware-only. There are cloud providers who provide the ability to deploy both virtual, multi-tenant machines and dedicated, single-tenant servers. While the implementations differ, the capability to deploy both virtual and dedicated machines will give you additional flexibility. Companies such as Rackspace and GoGrid provide both virtual and dedicated hardware. If you are looking for a purpose-built server, Joyent provides servers pre-configured and pre-installed with Riak – a distributed database product by Basho.

If you are looking for Hadoop-as-a-Service and have targeted jobs that need to be low cost and don’t need to run consistently, then Amazon will fit the bill. However, you will take a performance hit – “It took from 5 to 18 minutes to execute tiny jobs that would take microseconds to execute on a fully configured cluster” according to this article by Infoworld. If you are looking for a production system running large continuous data sets with high-performance requirements, then you are better off using a hybrid offering on a provider like Rackspace or GoGrid.

Microsoft Excel PowerPivot: Just What the Analyst Ordered

We do a lot of ad-hoc data investigation and analysis around here, and are always on the lookout for tools that make our lives easier.

While raw SQL and a database remain a top weapon of choice for getting a sense of a new data set, it leaves a bit to be desired in the presentation department. Plus, a lot of people just aren’t comfortable with SQL and we often need to hand off our work to more businessy types that have more starch in their shirts than comp sci classes under their belts.

It’d be great to have something like Excel, but without the limitations. Excel’s fine for simple, small data sets but it:

  • only works to a million rows
  • isn’t great at selecting subsets of data
  • isn’t good at stitching multiple tables together

Wouldn’t it be great to have something like Excel, but it serves more like the front end to multiple databases and DOESN’T cost a bajillion dollars like Tableau?

Introducing PowerPivot

Well there IS such a beast, and the surprise is, it’s Excel itself!  Excel 2010 has a free add-on called PowerPivot that addresses many of our longstanding issues with it as a business intelligence tool.

  • Excel with PowerPivot maxes out after 100’s of millions of rows
  • Excel with PowerPivot lets you combine multiple sources (database tables, web services, any feed!) in a single model (spreadsheet/pivot table)

But Does It, Really?

We didn’t buy the hype on the scalability, so we tried it for ourselves.   Our preliminary test was a 1 million row table, imported from a CSV.   PowerPivot made a nice fast pivot table over it, no problem.  In addition, workbook on disk was 36MB whereas the raw CSV was 75MB, a 50% reduction in size. We were able to relate in a few other small tables to the pivot table as well with no loss in speed.

We were deeply impressed and will be shifting more of our ad-hoc analysis over to Excel.

The only downside? It’s only for Windows. No PowerPivot for Excel Mac yet. This test was run on OS X, within a 64 bit Windows 7 Parallels VM given 4GB RAM. Office itself was the 32 bit version.