Category Archives: Databases

Big Confusion about Big Data

Not only do people often confuse what exactly the term “Big Data” means, but the dizzying array of products that are out there that solve for Big Data problems add to the confusion. So what’s the difference between Hadoop, Cassandra, EMR, Big Query or Riak?

First, it’s important to define what Big Data is. First of all, I refer to Big Data to mean the data itself – although it is often used interchangeably with the solutions (such as Hadoop). I believe that data should satisfy 3 criteria before being considered “Big Data”:

  • Volume – the amount of data has to be large, in petabytes not just gigabytes
  • Velocity – the data has to be frequent, daily or even real-time
  • Structure – the data is typically but not always unstructured (like videos, tweets, chats)

elephant_rgb_sq

Hadoop

To deal with this type of data, Big Data solutions have been developed. One of the most well known is Hadoop. It was first developed by Doug Cutting after reading how Google implemented their distributed file system and Map Reduce functionality. As such, it is not a database but a framework that implements the Hadoop Distributed File System (HDFS) and Map Reduce. There are several distributions of this open-source product – Cloudera, Hortonworks and MapR being the most well-known. Amazon Elastic Map Reduce (EMR) is a cloud-based platform-as-a-service implementation of Hadoop. Instead of installing the distribution in your data center, you configure the jobs using Amazon’s platform.

Why would people use Hadoop? Typically they are pulling large amounts of unstructured data and need to be able to run sorting and simple calculations on the data. For example:

  • Counting words – this is the standard Map Reduce example
  • High-volume analysis – gathering and analyzing large scale ad network data
  • Recommendation engines – analyzing browsing and purchasing patterns to recommend a product
  • Social graphs – Determining relationships between individuals

One of the weaknesses of Hadoop is its job oriented nature. Map Reduce is designed to be a batch process so there is a significant penalty is waiting for the job to start up and complete. It is typically not a strong candidate for real-time analysis. It also has the challenge of having a master node that is a single point of failure. Recently, Cloudera has released their latest distribution that implements a failover master node but this is unique to their distribution. Additionally, developers typically use Java (although other languages are supported) and not SQL to create Map Reduce jobs.

Riak_product_logo

Cassandra / Riak / Dynamo

Amazon wrote a paper on how to implement a key value store and Cassandra and Riak are implementations of that key value store construct. Amazon also has their own implementation called Dynamo DB.

While there are differences among the implementations, customers will want to use these products because they want fast, predictable performance and built-in fault tolerance. This is for applications that require very low latency and highly available. Typical use cases would be:

  • Session Storage
  • User Data Storage
  • Scalable, low-latency storage for mobile apps
  • Critical data storage for medical data (a great example is how the Danish government uses Riak for their medical prescription program)
  • Building block for a custom distributed system

These solutions are typically also clustered in a ring formation with no particular node marked as a master and thus so single point of failure. Customer will typically not use these solutions if they require complex ad-hoc querying or heavy analytics. As with Hadoop, building queries here require using a programming language like Java or Erlang and not SQL. There is also a project called HBase which provides similar columnar data store functionality for data on Hadoop so that is another option that is available. This project was modeled after Google Big Table and not Dynamo.

mondodb

MongoDB

This is a project that has become very popular and is managed by 10Gen and also implemented in a SaaS model by MongoLabs. MongoDB has become very popular because it is designed with developers in mind and scales the gap between a relational database (like MySQL) and key-value store (like Riak). MongoDB is excellent if you need dynamic queries and want to maintain similar functionality to a relational database that is more friendly to object-oriented programming languages. 10Gen emphasizes that MongoDB is designed to easily scale out and also accelerate development (for example, if conducting agile development). Typical use cases would be:

  • Companies storing data in flat files
  • Using an XML store for data too complex to model in a relational database
  • Deployments to public or private clouds
  • Electronic health records

As with it’s other NoSQL brethren, MongoDB does not use SQL. Instead you will use a programming language (like Javascript) to interact with the database. It is also not good with applications that require multi-object transactions which MongoDB currently does not support.

dremel

Dremel

No, this is not a power tool but a technology implemented by Google in a recent research paper. In order to understand this product, you need to be familiar with Hive and Pig since they are often compared. As noted above, Hadoop has two main challenges – that the jobs have a start-up lag and that Java (or some other programming language) is required to write and implement MapReduce. Hive and Pig are open-source projects that sit on top of Hadoop and allow for users to implement SQL (in the case of Hive) or a scripting language (in the case of Pig by writing Pig Latin) to query data. Note that since this is a layer above Hadoop, MapReduce code is still being written and executed. It just translates SQL or Pig Latin into MapReduce code. This does solve for the case of being able to write queries against Hadoop but does not solve the latency problem. This is what Dremel attempts to do.

Dremel is designed to allow users to write almost real-time, interfactive and ad-hoc queries against a large data set. Because it uses its own query execution engine and doesn’t rely on MapReduce, this attempts to solve for the Hadoop latency problem. However, in its current implementation, it is purely an analysis tool since it is only read-only and organizes data in a columnar format. While there is an open-source project for this product, it is also currently commercially available as Google Big Query.

Where to go with Big Data

As you can see, there are a lot of products out there that all solve for Big Data problems in the different ways. It’s important to understand your use case and see which product is a best fit – it’s unlikely that you can find one that solves for the universe of Big Data problems. In addition, it is likely that you will still need to use a traditional relational database if you need to do reporting or interface with an enterprise application. Out of all the projects, Hadoop has the largest ecosystem with other supporting projects that broadens the functionality of the core product. As mentioned earlier, Hive is used to write SQL queries; there is also Sqoop that integrates relational data and HBase that provides low-latency capabilities. Fortunately, the majority of these projects are open-source so you only need to learn how to use them and find a cost-effective platform to implement it – like the cloud, a natural fit for distributed databases and NoSQL projects.

Why BI in the Cloud?

Business Intelligence (BI) has been around for a while but recently, the interest in analytics and tools to support it have become increasingly popular. Previously, only large enterprises were able to afford the infrastructure and license cost to implement traditional business intelligence. However, the advent of the cloud is an opportunity for everyone to take advantage of the transformative power of data.

img_businessIntel

Traditional BI was typically a client server setup with companies typically shelling out for dedicated equipment in their own data center or a co-lo, requiring license fees for the software and a dedicated staff to manage all the equipment and to maintain the server. The more data that you had to analyze, the more equipment that you had to buy and the more staff that you had to hire. The massive amount of investment in this area became a sunk cost and some enterprises are still tied to this model despite the fact that new models have arisen to tackle the data challenge.

The Cloud

The cloud opens up all new options for companies looking to either build out their own solution or to leverage new products that are native to cloud. There are currently three generally accepted cloud delivery models: Software as a Service (SaaS), Platform as a Service (PaaS) and Infrastructure as a Service (IaaS). I consider SaaS and IaaS to be the models that will be the most effective for BI.

One of the concerns that are often raised regarding the cloud is security – this in particular with BI since it involves data and sometimes personal information. The question then is, how will a provider that specializes in managing a data center (in the case of IaaS) or large-scale application and data hosting (in the case of SaaS) any less secure than a company who’s focus is on their own business and not on securing the data center? You will need to verify a few things depending the delivery model that you are using. For SaaS, you will want to ensure that the provider can give you secure login and authentication, the ability to define granular levels of access, SSL, and the option of encryption at rest. For IaaS, the customer is typically responsible for setting up security but you will want to see ISAE 16, ISO 7001 certification, DDOS mitigation as well as options to provision firewalls and load balancers. The Cloud Security Alliance (CSA) is currently working on their Security, Trust and Assurance Registry (STAR) which will make it easier to determine security criteria for a cloud provider but it is currently only in the preview stage.

Deciding which Cloud Model

If you are interested in just providing data and then having software available to analyze, manipulate and create reports, SaaS products may work for you. Example of some SaaS BI companies are BIRST, PivotLink, and GoodData. All of them provide the BI-stack, which is a method for extracting, transforming and loading data, a place to store the data, and a front-end to create ah-hoc reports and dashboards. The advantages of using this style of BI is that all the management of the infrastructure and software is handled by the vendor. There is also minimal up-front costs – the model is to simply pay for what you use.

If you want to build your own BI platform, you can leverage IaaS and open-source software. You will need to find an IaaS vendor that best suites your needs – Amazon Web Services, GoGrid and Rackspace are the leaders in this space. Using IaaS, you will have full control of your infrastructure – you can determine how many servers to spin up, the security you want to use and how you want to store the data. You will also need to build out the software assuming that you have that expertise in-house. Some good open-source options are Talend for data migration and manipulation, Pentaho for data integration and analytics, BIRT for reporting and MySQL or Postgres for the database. This model requires more system administration expertise and developer resources to build out a custom BI solution but this may be worth the investment if you want tighter control of your product or have very specific custom needs. In either case, if you can leverage the features offered in open-source software, your will also minimize up-front costs and will pay for the infrastructure that you use – with the option of spinning up servers to meet demand or remove servers when they are no longer needed. You can start off with small 1GB server and expand to more cores, more RAM and more storage quickly and easily.

In addition, the flexibility of the cloud gives you the option to expand your infrastructure if you need to incorporate a Big Data solution to meet a particular use case. Currently, most of the popular technology is open-source such as Hadoop, MongoDB and Cassandra. I discuss Big Data in more depth in a previous blog post.

Ultimately, you may decide that you need all the features that are offered by a traditional BI vendor or have already made the investment in a particular infrastructure or technology. After all, these companies have been around a long time and there are many talent individuals who are well-versed in these products. However, if you are interested in lowering costs and off-load the infrastructure and software work to another vendor or are new to BI and want to get started with minimal up-front costs, the cloud based BI solutions might be the right option for you. Instead of having to project growth in order to order the hardware up-front, you will have the ability to pay-as-you-go, and add infrastructure and cost only as your growth demands. DASHbay is experienced in both delivery models discussed here and we can provide the right analytics expertise and development experience for your BI needs. Considering the growth in data, the flexibility of the cloud and the much needed analytic features of a BI solution work well together to provide a powerful, low-cost and scalable solution. Make sure that you work with the right vendors and partners to make your project a success!

Big Data and the Cloud

big-data

One of the trends hitting analytics is the adoption of Big Data. One of the theories that I raised a few years ago was that one of the side effects of web companies is that they generate a lot of data. Many prescient companies like Yahoo and Facebook understood the value of this data and started to figure out ways to analyze it. Out of these early projects came several open source products that incorporated the architectures espoused by these companies.

Most of the most notable open-source projects to gain greater adoption is Hadoop and Cassandra. The idea was that not only were companies starting to see a lot more data, but they were also seeing a lot of unstructured data. This type of data didn’t lend itself to analysis by existing relational database tools. Hadoop was inspired by both Google and Yahoo (who is a major contributor to the project) to solve this exact problem.

Since many web companies are using Hadoop and Big Data started to gain traction at the same time as cloud computing, it soon became a natural fit to combine Big Data with cloud computing. On the surface, this makes sense. Hadoop and NoSQL solutions like it, are designed for massively parallel computing. Cloud computing is designed to provide the ability to spin up servers very quickly and scale horizontally. Seems like a natural match doesn’t it?

It depends on your use case. Mission critical production systems that need access to the cluster 24/7 and have performance requirements will want a lot of compute power and also high performance I/O. Not all cloud providers are able to give that level of performance required. Notably, Amazon is currently the only cloud provider who is attempting to provide Hadoop as a service on its multi-tenant cloud environment. Map reduce jobs can be run through Amazon Map Reduce, spinning up instances as needed to run the jobs in parallel and then returning the result set into S3. If you can tune your setup correctly, this can result is significant savings for targeted batch jobs – a good example of this is the TimesMachine project run by the New York Times that converted TIFF files into smaller TNG images using Amazon and Hadoop in which they processed 150 years worth of images in 36 hours.

Other cloud providers rely on providing single tenant hardware solutions. This provides both the dedicated compute and local storage for high performance I/O. In fact, Hadoop Distributed File System (HDFS) is designed specifically to be rack-aware so that it can ensure replication of data across multiple nodes and even on different racks. Cloudera, who has their own distribution of Hadoop, has specific hardware recommendations depending on the workload required. MapR which has additional code in its distribution for performance gains requires access to physical hard disks and may not play well in all multi-tenant environments.

However, the choice is not just between virtual-only and hardware-only. There are cloud providers who provide the ability to deploy both virtual, multi-tenant machines and dedicated, single-tenant servers. While the implementations differ, the capability to deploy both virtual and dedicated machines will give you additional flexibility. Companies such as Rackspace and GoGrid provide both virtual and dedicated hardware. If you are looking for a purpose-built server, Joyent provides servers pre-configured and pre-installed with Riak – a distributed database product by Basho.

If you are looking for Hadoop-as-a-Service and have targeted jobs that need to be low cost and don’t need to run consistently, then Amazon will fit the bill. However, you will take a performance hit – “It took from 5 to 18 minutes to execute tiny jobs that would take microseconds to execute on a fully configured cluster” according to this article by Infoworld. If you are looking for a production system running large continuous data sets with high-performance requirements, then you are better off using a hybrid offering on a provider like Rackspace or GoGrid.

Customer retention metrics

Last night (Tue July 19th), I was fortunate to be able to speak to the SVForum Business Intelligence special interest group (SIG).

After introducing the audience to DASHbay, I took them through an implementation we did using our Quick Analysis practice, which leverages open source software (especially BIRT and postgresql), cloud computing (on AWS), and rapid, iterative development.

The implementation itself was a dashboard, built with BIRT in less than a week and showing metrics for account acquisition and retention. The metrics help any business track not just how well they are acquiring customers, but how well they are keeping them.

Account retention dashboard
Our customer was able to get at the metrics via a URL to a server running in the cloud, set up just for them. It’s a great way to leverage cloud computing: no IT procurement costs or delays, and you only pay for it while you need it.

We talked about DASHbay’s Report Server product, which among other features, allows us to capture any useful piece of the report, and include it in any web page. It also provides permissioning and authentication, taxonomy for organizing reports, and more.

I got an excellent reception from the audience, and was pleased with the reaction and discussions afterwards. Thanks to all who attended!

If you didn’t get a chance to be there, please get in touch so we can talk to you more about our Report Server for BIRT, our Quick Analysis Service, or many custom BI and Data Analytics services. Customer Retention is one very useful application which we can provide, but our tools and techniques are applicable to most common business analysis problems.

Terry

Drizzle – a new MySQL fork

There is a new MySQL fork that recently went GA called “Drizzle”. It is written by former developers of MySQL and is open-source. It’s designed to be fast, lightweight and optimized for the cloud.

http://drizzle.org/

Since databases are a part of every Business Intelligence implementation, I’m going to investigate how easy it is to install, deploy and configure.

I built out a VM using Alpha 3 build of Natty Narwhal (Ubuntu 11).  Since there is already an Ubuntu package, the deployment went smoothly.

As expected, the Alpha build of Ubuntu has some stability issues (for example, the new Unity interface didn’t load until the VM was launched a second time and some random components crashing). However, these do not seem to impact Drizzle.

My intent is to give Drizzle a test run by running some code that I wrote for MySQL 5.1 and note the differences. This will be closer to what someone would do if conducting some light SQL development and will not go into more advanced administration such as trying to test its capabilities for massive concurrency or clustered systems.

Working with Postgres on Windows via ODBC

The most straightforward way to hook up Excel or most other Windows programs to a PostgreSQL database is ODBC.
ODBC lets you connect to a local or remote Postgres.
To do this, you need to get an ODBC driver for Postgres.   But which one?    In most cases, you’re going to want the free official ODBC driver from Postgres.  Its documentation is so poor that I overlooked it for a day before figuring out that it’s the real deal.  The driver’s homepage makes it look like the last version was in 2005, but in fact there has been releases in October 2010 (and by the time you’re reading this, probably more recent ones).
Get the Windows installer here: http://www.postgresql.org/ftp/odbc/versions/msi/
Make sure you install the right version.  Even if you have 64 bit Windows you’ll need the 32 bit ODBC driver if the program that’s connecting to the database (like Excel) is 32 bit.  For Excel specifically, File > Help says whether it’s 32 or 64 bit.
The official installation and configuration FAQ: http://psqlodbc.projects.postgresql.org/faq.html
Set up a DSN
Once you’ve installed the ODBC driver you need to make it available to programs like Excel.
You do this via your computer’s “Data Sources (ODBC)” panel.   There’s a 32 bit and a 64 bit version, you gotta make sure you’re looking in the right one: http://support.microsoft.com/kb/942976
Once it’s open, click either the User DSN or System DSN tab:
  • User DSN: is available just for the current user.
    • If you’re just using it within Excel or Access, this is the one you want.
  • System DSN: is available for all users and services on the machine.

Click ‘Add…’ and select a PostgreSQL driver.  There’s two to choose from:

  • PostgreSQL Unicode: use this if your database was set up with the UTF-8 character set.
  • PostgreSQL ANSI: use this if your database was set up with a LATIN character set.
Fill in your server and database details.

  • Data Source: make a friendly name of the data source that you’ll later use when looking up this DSN in Excel or other programs.
  • Database: use ‘postgres’ to connect to all databases, or type the name of a specific database.
  • Server: domain name or IP address of server
  • User Name: postgres database username
  • Password: postgres database password
Go Forth And Use ODBC
Now go to the program in which you need to connect to Postgres and use whatever ODBC connection UI it provides.