Category Archives: Analysis

Preparing for the Storm

storm-02

In a previous post, I described Big Data as containing certain criteria: Volume, Velocity and Structure (probably worth revising to Variety as that is now more commonly used). While Hadoop is currently the primary choice for analytics on Big Data, it is not necessarily designed to handle  velocity. That is where Storm comes in.

Hadoop is designed to operate in batch mode: its strength is in consolidating a lot of data, applying operations on the data in a massively parallel way and reducing the results to a file. Since this is a batch job, it has a defined beginning and a defined end. But what about data that is continuous and needs to be processed in real time? Given the popularity of Hadoop, several open source projects have risen to meet this challenge. Most notably, HBase is the project that is designed to work with Hadoop and provide for storing large quantities of sparse data leveraging the Hadoop Distributed File System (HDFS).

Storm Use Cases

But what about real-time analysis of a Twitter stream? This use case is a perfect fit for Storm. A recent project that was developed by Twitter, it is similar to Hadoop but tuned for dealing with high volume and velocity. Storm is a project that is intended to handle a real-time stream of information with no defined end. It can handle analyzing streams of data forever. if you are familiar with Yahoo’s S4, Storm is similar.  It handles three broad use cases (but not exclusively these):

  • Stream processing
  • Real time computation
  • Distributed remote procedure calls (RPC)

Storm is relatively new but has been adopted by several companies, most notably Twitter (obviously), Groupon and Alibaba. You’ll notice that even though these companies are in different industries, they all have to deal with large amounts of streaming data.

Components

Storm integrates with queuing systems (like Kestrel or JMS) and then writes results to a database. It is designed to be massively scalable, processing very high throughputs of messages with very low latency. It is also fault-tolerant so if any workers or nodes in a cluster die, then are automatically restarted. This gives it the ability to guarantee data processing – since it can track lineage, it will know if data isn’t processed and can make sure that it does. Since Storm has no concept of storage, you will need to store the results in another system (like a database or another application).

Storm also has its own components which bear some explaining. Although there are several components in a Storm cluster (including Zookeeper which is used for coordination and maintain state), there are three which are referenced frequently:

  • Topologies – this is equivalent to a Map Reduce job in Hadoop except that it doesn’t really end (unless you kill it of course)
  • Spouts – this is the primitive that takes source data and emits it into streams for processing
  • Bolts – this is the primitive that takes the data and does the actual processing (like filtering, joining, functions, etc.). This is similar to the part of the Map Reduce WordCount example that is summing the count of words.

Since Storm is intended to run in a highly distributed fashion – it is also a natural fit for the cloud. In fact, there is a project that makes it easy to deploy a Storm cluster on AWS. As an open source project, it can run on any Linux machine and works well with servers that can be horizontally scaled. If you want to learn more technical details regarding this project, check out the excellent wiki on Github, where you can also download the latest version.

Big Confusion about Big Data

Not only do people often confuse what exactly the term “Big Data” means, but the dizzying array of products that are out there that solve for Big Data problems add to the confusion. So what’s the difference between Hadoop, Cassandra, EMR, Big Query or Riak?

First, it’s important to define what Big Data is. First of all, I refer to Big Data to mean the data itself – although it is often used interchangeably with the solutions (such as Hadoop). I believe that data should satisfy 3 criteria before being considered “Big Data”:

  • Volume – the amount of data has to be large, in petabytes not just gigabytes
  • Velocity – the data has to be frequent, daily or even real-time
  • Structure – the data is typically but not always unstructured (like videos, tweets, chats)

elephant_rgb_sq

Hadoop

To deal with this type of data, Big Data solutions have been developed. One of the most well known is Hadoop. It was first developed by Doug Cutting after reading how Google implemented their distributed file system and Map Reduce functionality. As such, it is not a database but a framework that implements the Hadoop Distributed File System (HDFS) and Map Reduce. There are several distributions of this open-source product – Cloudera, Hortonworks and MapR being the most well-known. Amazon Elastic Map Reduce (EMR) is a cloud-based platform-as-a-service implementation of Hadoop. Instead of installing the distribution in your data center, you configure the jobs using Amazon’s platform.

Why would people use Hadoop? Typically they are pulling large amounts of unstructured data and need to be able to run sorting and simple calculations on the data. For example:

  • Counting words – this is the standard Map Reduce example
  • High-volume analysis – gathering and analyzing large scale ad network data
  • Recommendation engines – analyzing browsing and purchasing patterns to recommend a product
  • Social graphs – Determining relationships between individuals

One of the weaknesses of Hadoop is its job oriented nature. Map Reduce is designed to be a batch process so there is a significant penalty is waiting for the job to start up and complete. It is typically not a strong candidate for real-time analysis. It also has the challenge of having a master node that is a single point of failure. Recently, Cloudera has released their latest distribution that implements a failover master node but this is unique to their distribution. Additionally, developers typically use Java (although other languages are supported) and not SQL to create Map Reduce jobs.

Riak_product_logo

Cassandra / Riak / Dynamo

Amazon wrote a paper on how to implement a key value store and Cassandra and Riak are implementations of that key value store construct. Amazon also has their own implementation called Dynamo DB.

While there are differences among the implementations, customers will want to use these products because they want fast, predictable performance and built-in fault tolerance. This is for applications that require very low latency and highly available. Typical use cases would be:

  • Session Storage
  • User Data Storage
  • Scalable, low-latency storage for mobile apps
  • Critical data storage for medical data (a great example is how the Danish government uses Riak for their medical prescription program)
  • Building block for a custom distributed system

These solutions are typically also clustered in a ring formation with no particular node marked as a master and thus so single point of failure. Customer will typically not use these solutions if they require complex ad-hoc querying or heavy analytics. As with Hadoop, building queries here require using a programming language like Java or Erlang and not SQL. There is also a project called HBase which provides similar columnar data store functionality for data on Hadoop so that is another option that is available. This project was modeled after Google Big Table and not Dynamo.

mondodb

MongoDB

This is a project that has become very popular and is managed by 10Gen and also implemented in a SaaS model by MongoLabs. MongoDB has become very popular because it is designed with developers in mind and scales the gap between a relational database (like MySQL) and key-value store (like Riak). MongoDB is excellent if you need dynamic queries and want to maintain similar functionality to a relational database that is more friendly to object-oriented programming languages. 10Gen emphasizes that MongoDB is designed to easily scale out and also accelerate development (for example, if conducting agile development). Typical use cases would be:

  • Companies storing data in flat files
  • Using an XML store for data too complex to model in a relational database
  • Deployments to public or private clouds
  • Electronic health records

As with it’s other NoSQL brethren, MongoDB does not use SQL. Instead you will use a programming language (like Javascript) to interact with the database. It is also not good with applications that require multi-object transactions which MongoDB currently does not support.

dremel

Dremel

No, this is not a power tool but a technology implemented by Google in a recent research paper. In order to understand this product, you need to be familiar with Hive and Pig since they are often compared. As noted above, Hadoop has two main challenges – that the jobs have a start-up lag and that Java (or some other programming language) is required to write and implement MapReduce. Hive and Pig are open-source projects that sit on top of Hadoop and allow for users to implement SQL (in the case of Hive) or a scripting language (in the case of Pig by writing Pig Latin) to query data. Note that since this is a layer above Hadoop, MapReduce code is still being written and executed. It just translates SQL or Pig Latin into MapReduce code. This does solve for the case of being able to write queries against Hadoop but does not solve the latency problem. This is what Dremel attempts to do.

Dremel is designed to allow users to write almost real-time, interfactive and ad-hoc queries against a large data set. Because it uses its own query execution engine and doesn’t rely on MapReduce, this attempts to solve for the Hadoop latency problem. However, in its current implementation, it is purely an analysis tool since it is only read-only and organizes data in a columnar format. While there is an open-source project for this product, it is also currently commercially available as Google Big Query.

Where to go with Big Data

As you can see, there are a lot of products out there that all solve for Big Data problems in the different ways. It’s important to understand your use case and see which product is a best fit – it’s unlikely that you can find one that solves for the universe of Big Data problems. In addition, it is likely that you will still need to use a traditional relational database if you need to do reporting or interface with an enterprise application. Out of all the projects, Hadoop has the largest ecosystem with other supporting projects that broadens the functionality of the core product. As mentioned earlier, Hive is used to write SQL queries; there is also Sqoop that integrates relational data and HBase that provides low-latency capabilities. Fortunately, the majority of these projects are open-source so you only need to learn how to use them and find a cost-effective platform to implement it – like the cloud, a natural fit for distributed databases and NoSQL projects.

Pining for Pinterest

What does the Web’s hottest social network hold for data analytics?

Pinterest Icon Pinterest recently broke 11 million users this year. It was reportedly the fastest social network to reach 10 million unique users since its launch of closed beta in March 2010. However, that is somewhat in dispute, with many claiming that Formspring is still holds that title. Either way, there is no debate that Pinterest is growing at a phenomenal rate.

What is Pinterest anyway?

Since Pinterest headquarters is just down the Peninsula, I caught on to this site relatively early. However, a recent dinner with some out-of-town colleagues reminded me that not everyone obsessively follows startup news like I do.

Pinterest is a social network that links people together by their interests rather than their social circle. So unlike Facebook, which resolves around your friends, on Pinterest people express themselves and find friends through common interests. Another twist is that this is done via a metaphor of a pin board. Users “pin” photos that they find on site, from around the web, or uploaded directly to the site onto user-created “boards”. Users have boards to organize their pins into categories like “Travel” or “Humor”.

Since Pinterest is picture based, visual-type content show far better. Infographics, for example, are far more represented there than articles (some of which have no pictures to even pin). Brands are starting to take notice – Life magazine has a large following, posting mostly archived photos. Retailers are particularly strong here, since most of the items sold there are physical and therefore lends itself well to a pictorial presentation. In fact, Pinterest is driving more referral traffic than Google+, YouTube and Linkedin. Combined.

So who is on Pinterest?

Modea, a digital advertising agency compiled some demographic information on Pinterest users and the results were quite interesting. On average, users spend 15.8 minutes “pinning” while Facebook users only spend 12.1 minutes liking things. Almost a third of Pinterest users have an annual household income of at least $100,000 and almost 70% are women with the majority aged 25-34. So young, upper-middle class and female – sounds like the demographic that advertisers will want to target. Of course, that is just the is the US. The demographics in the UK are decidedly different – 56% male, 29% in the highest income bracket and interested in venture capital (go figure).

However, this doesn’t mean that there aren’t a good mix of interests on Pinterest. The Board of Man which is a board focused on more “manly” interests has 220,000 followers. Even the US Army has gotten into with its own set of boards which is managed by their Chief of Public Affairs.

It’s all about the data

Pinning is fun and all but the real value is the information captured by all those young and wealthy consumers. It’s also capturing a demographic that isn’t as prevalent on most of the social networks (meaning mostly male and on the coasts). Users are specifically stating their interests – in particular things that they covet or plan to purchase. In addition, the content is persistent. Unlike Twitter or Facebook, your pins stay present on the board for you to easily eview again later (and don’t come back from the dead as people discovered during the Facebook switch to Timeline). Retailers are starting to take notice – creating their own pages and engaging with users with interesting content.

Pinterest is still in beta so it has some things that I think it needs to add to become more robust.

  • It really needs some sort of robust and standards based API (like REST). This will give websites the ability to better integrate with the social network. It also creates the capability for 3rd parties to start extracting some interesting data.
  • An easier way to find people. It’s all about the interests and not friends but still, it’s impossible to find someone if they aren’t already friends with you on Facebook.
  • Allow some pins to stand out in some way (either promoted or simply popular)
  • Work in some additional content other than photos. While photos really enrich the site, having the ability to have music or text add additional context to the photos would be helpful.

In particular, an API in conjunction with other APIs would have the ability for developers to create some interesting apps in combination with Facebook, Twitter or Foursquare. The mash-up of the Facebook social graph with the Pinterest interest graph would be quite interesting. Since the information posted on Pinterest is public by default and generated in real-time, it would be a good indication of the zeitgeist in terms of interests, which would be a welcome addition to the topic trends already occurring on Twitter.

Customer retention metrics

Last night (Tue July 19th), I was fortunate to be able to speak to the SVForum Business Intelligence special interest group (SIG).

After introducing the audience to DASHbay, I took them through an implementation we did using our Quick Analysis practice, which leverages open source software (especially BIRT and postgresql), cloud computing (on AWS), and rapid, iterative development.

The implementation itself was a dashboard, built with BIRT in less than a week and showing metrics for account acquisition and retention. The metrics help any business track not just how well they are acquiring customers, but how well they are keeping them.

Account retention dashboard
Our customer was able to get at the metrics via a URL to a server running in the cloud, set up just for them. It’s a great way to leverage cloud computing: no IT procurement costs or delays, and you only pay for it while you need it.

We talked about DASHbay’s Report Server product, which among other features, allows us to capture any useful piece of the report, and include it in any web page. It also provides permissioning and authentication, taxonomy for organizing reports, and more.

I got an excellent reception from the audience, and was pleased with the reaction and discussions afterwards. Thanks to all who attended!

If you didn’t get a chance to be there, please get in touch so we can talk to you more about our Report Server for BIRT, our Quick Analysis Service, or many custom BI and Data Analytics services. Customer Retention is one very useful application which we can provide, but our tools and techniques are applicable to most common business analysis problems.

Terry

Microsoft Excel PowerPivot: Just What the Analyst Ordered

We do a lot of ad-hoc data investigation and analysis around here, and are always on the lookout for tools that make our lives easier.

While raw SQL and a database remain a top weapon of choice for getting a sense of a new data set, it leaves a bit to be desired in the presentation department. Plus, a lot of people just aren’t comfortable with SQL and we often need to hand off our work to more businessy types that have more starch in their shirts than comp sci classes under their belts.

It’d be great to have something like Excel, but without the limitations. Excel’s fine for simple, small data sets but it:

  • only works to a million rows
  • isn’t great at selecting subsets of data
  • isn’t good at stitching multiple tables together

Wouldn’t it be great to have something like Excel, but it serves more like the front end to multiple databases and DOESN’T cost a bajillion dollars like Tableau?

Introducing PowerPivot

Well there IS such a beast, and the surprise is, it’s Excel itself!  Excel 2010 has a free add-on called PowerPivot that addresses many of our longstanding issues with it as a business intelligence tool.

  • Excel with PowerPivot maxes out after 100’s of millions of rows
  • Excel with PowerPivot lets you combine multiple sources (database tables, web services, any feed!) in a single model (spreadsheet/pivot table)

But Does It, Really?

We didn’t buy the hype on the scalability, so we tried it for ourselves.   Our preliminary test was a 1 million row table, imported from a CSV.   PowerPivot made a nice fast pivot table over it, no problem.  In addition, workbook on disk was 36MB whereas the raw CSV was 75MB, a 50% reduction in size. We were able to relate in a few other small tables to the pivot table as well with no loss in speed.

We were deeply impressed and will be shifting more of our ad-hoc analysis over to Excel.

The only downside? It’s only for Windows. No PowerPivot for Excel Mac yet. This test was run on OS X, within a 64 bit Windows 7 Parallels VM given 4GB RAM. Office itself was the 32 bit version.