Security in the cloud

Web-security-stock

After all the news about PRISM and data collection of internet traffic in the US and abroad, it got me thinking about how security is even more relevant now that so much of our work is done on the Internet. While running basic security techniques won’t really stop entities with special back-door access, being vigilant about security while using services on the cloud is always a best practice.

The cloud is often divided into three main categories. Infrastructure-as-as-Service (IaaS), Platform-as-a-Service (PaaS) and Software-as-a-Service (SaaS). The majority of users are most familiar with SaaS, either interacting with one at work (most likely Salesforce, among others) or even internet mail (Gmail, Yahoo Email, etc.).

saas-is-the-next-generation-of-recruiting-software

SaaS Security

Security with SaaS is often very similar to one you would consider when using a website. You will almost always want to:

  • Create a unique password that is difficult to guess
  • Enable HTTPS whenever possible
  • Enable two-factor authentication
  • Guard against socially engineered attacks

As a user, you are reliant on the SaaS provider to ensure that the systems that they are running on are secure, that they employ best practices, and are PCI-compliant (if they are taking your credit card information). The basic items listed above are some of the steps you can take to improve security.

The first item is well-known for anyone who has used email before so I will go into more depth about the other options. Enabling HTTPS means that the traffic from your browser to the server is encrypted. So anything sent through the internet will be difficult to read if it is ever intercepted. Clearly this is important when sending your login information since that complex password is meaningless if anyone sniffing your traffic can grab it. The details of this encryption can be fairly complex but essentially you want providers who use at least 1024 bit key but the standard is quickly moving to 2048 (the longer the length, the harder it is to crack). However, this only encrypts traffic in transit and NOT at rest. So once your data is in the server, it may or may not be encrypted.

Two-factor authentication is all the rage since passwords are guessed, stolen or simply given away in socially engineered attacks. The idea behind this security is that to get in, you need to have something you know (your password) and something you have (your token) and its unlikely that a remote hacker will have both. The token can take the form of a physical key – like an RSA token that generates a random number every 60 seconds, an app that does the same thing on your smart phone (like Google Authenticator) or a random number sent as a text message to your phone. While there are ways to defeat two-factor (like man-in-the-middle attacks) it is generally safer than single authentication.

Social engineering is the electronic version of the con game – using psychological manipulation to gain information for the purposes of theft, fraud or other nefarious activities. The danger in these types of threats is that as a user, you are dependent on people you don’t even know, who have access to your personal information, being vigilant about not giving away your information. While there is little that you can do about this other than ensuring that your SaaS vendor has known security policies and procedures to prevent this from happening. You can also protect yourself somewhat by not linking your Internet life together. Use different passwords and try not to link too many services together – doing so may lead to a vulnerability in one imploding the rest. Mat Honan wrote an excellent article about his unfortunate experience of getting hacked and what steps he could have taken to prevent it.

IAAS

IaaS Security

I’m going to skip PaaS since it is rapidly beginning to blend into IaaS (and vice versa). When you are using an IaaS provider, you are operating at the infrastructure layer, meaning the individual components that are used to host your application. You are no longer interfacing with a website but rather dealing at the compute, network and storage layer. You are most likely using this type of provider because you are hosting your own SaaS service and want it to reside in the cloud or are hosting some sort of online service, like a website or mobile app. Because you are now dealing with infrastructure, you are faced with a whole different set of security requirements than SaaS users. You will want to look into:

  • The physical security of the data center where your infrastructure is located
  • The security settings of the server
  • The security services offered by the provider
  • Any certifications that you need to meet for your particular use case
  • Specific security you need to provide for your customers
  • Data backup and disaster recovery
  • Your own security policies and procedures

At the IaaS layer, you are often a business serving your own customers so you have more security considerations to think about. You are also responsible for configuring your own security above the server layer as this is often not the responsibility of the IaaS provider.

One of the things to look for is if the data centers of the provider have strong security and controls – this is often reflected in attaining a SSAE 16 Type II and / or ISO 27001 certified data center designation. This is an internationally recognized auditing standard that contains a detailed audit report of the providers controls and security and in the case of Type II, the auditor’s opinion on whether the controls were operating effectively. Note that SSAE16 Type II replaces SAS 70 Type II (they are one and the same). Determine if the provider has the right compliance (PCI, HIPAA, FISMA, etc.) for your particular requirements. Data retention (what happens when your cancel your account) and data distribution policies (is your data automatically replicated or backed-up to other data centers?) should also be investigated.

The security settings of the server have to determined. There are a lot of factors to consider here that could probably consume a full-time training course. Suffice to say, you will need to do your research on what you need to tweak on your operating system of choice in order to meet the security requirements of your use case. One option to consider is to install CloudPassage on your server – in addition to providing firewall management for Linux and Windows, it also makes security recommendations. Note that some of these changes are uncommon and may not be compatible for all applications.

Research what security services are offered by your cloud provider. There is often a firewall option (software or hardware), a virtual private cloud option (which is a way to create network isolation), VPN services and even more advanced options like a Web App Firewall (WAF) or Intrusion Detection Systems (IDS / IPS). Firehost is one company that focused specifically on security and provides a variety of options.

An important consideration in any situation is how you can backup your data in the case of failure. Does the vendor provide a mechanism for backing up your data or will you need to provide that yourself? In some cases the vendor has built-in high-availability (for example, Amazon offers 99.999999999% durability for S3). However, standard data backup (defining a secondary location to store your data) is a good practice since durability guarantees do nothing if your sys admin deletes your files by accident.

You are responsible for you own security

Ultimately, you have to do the legwork to ensure that where ever you run your application or website, it will have the right level of security that you need. I don’t think that the cloud is inherently less safe than on-premise. After all, maintaining the integrity and security of their service is in the best interests of the vendor. In the case of IaaS, their core competency is on maintaining infrastructure – something your company is probably not focused on. Your company’s expertise is most likely geared towards your particular industry. However, you will need to be aware of the security options that are available to you and to make sure to take advantage of the ones offered by the vendor.

Preparing for the Storm

storm-02

In a previous post, I described Big Data as containing certain criteria: Volume, Velocity and Structure (probably worth revising to Variety as that is now more commonly used). While Hadoop is currently the primary choice for analytics on Big Data, it is not necessarily designed to handle  velocity. That is where Storm comes in.

Hadoop is designed to operate in batch mode: its strength is in consolidating a lot of data, applying operations on the data in a massively parallel way and reducing the results to a file. Since this is a batch job, it has a defined beginning and a defined end. But what about data that is continuous and needs to be processed in real time? Given the popularity of Hadoop, several open source projects have risen to meet this challenge. Most notably, HBase is the project that is designed to work with Hadoop and provide for storing large quantities of sparse data leveraging the Hadoop Distributed File System (HDFS).

Storm Use Cases

But what about real-time analysis of a Twitter stream? This use case is a perfect fit for Storm. A recent project that was developed by Twitter, it is similar to Hadoop but tuned for dealing with high volume and velocity. Storm is a project that is intended to handle a real-time stream of information with no defined end. It can handle analyzing streams of data forever. if you are familiar with Yahoo’s S4, Storm is similar.  It handles three broad use cases (but not exclusively these):

  • Stream processing
  • Real time computation
  • Distributed remote procedure calls (RPC)

Storm is relatively new but has been adopted by several companies, most notably Twitter (obviously), Groupon and Alibaba. You’ll notice that even though these companies are in different industries, they all have to deal with large amounts of streaming data.

Components

Storm integrates with queuing systems (like Kestrel or JMS) and then writes results to a database. It is designed to be massively scalable, processing very high throughputs of messages with very low latency. It is also fault-tolerant so if any workers or nodes in a cluster die, then are automatically restarted. This gives it the ability to guarantee data processing – since it can track lineage, it will know if data isn’t processed and can make sure that it does. Since Storm has no concept of storage, you will need to store the results in another system (like a database or another application).

Storm also has its own components which bear some explaining. Although there are several components in a Storm cluster (including Zookeeper which is used for coordination and maintain state), there are three which are referenced frequently:

  • Topologies – this is equivalent to a Map Reduce job in Hadoop except that it doesn’t really end (unless you kill it of course)
  • Spouts – this is the primitive that takes source data and emits it into streams for processing
  • Bolts – this is the primitive that takes the data and does the actual processing (like filtering, joining, functions, etc.). This is similar to the part of the Map Reduce WordCount example that is summing the count of words.

Since Storm is intended to run in a highly distributed fashion – it is also a natural fit for the cloud. In fact, there is a project that makes it easy to deploy a Storm cluster on AWS. As an open source project, it can run on any Linux machine and works well with servers that can be horizontally scaled. If you want to learn more technical details regarding this project, check out the excellent wiki on Github, where you can also download the latest version.

Big Confusion about Big Data

Not only do people often confuse what exactly the term “Big Data” means, but the dizzying array of products that are out there that solve for Big Data problems add to the confusion. So what’s the difference between Hadoop, Cassandra, EMR, Big Query or Riak?

First, it’s important to define what Big Data is. First of all, I refer to Big Data to mean the data itself – although it is often used interchangeably with the solutions (such as Hadoop). I believe that data should satisfy 3 criteria before being considered “Big Data”:

  • Volume – the amount of data has to be large, in petabytes not just gigabytes
  • Velocity – the data has to be frequent, daily or even real-time
  • Structure – the data is typically but not always unstructured (like videos, tweets, chats)

elephant_rgb_sq

Hadoop

To deal with this type of data, Big Data solutions have been developed. One of the most well known is Hadoop. It was first developed by Doug Cutting after reading how Google implemented their distributed file system and Map Reduce functionality. As such, it is not a database but a framework that implements the Hadoop Distributed File System (HDFS) and Map Reduce. There are several distributions of this open-source product – Cloudera, Hortonworks and MapR being the most well-known. Amazon Elastic Map Reduce (EMR) is a cloud-based platform-as-a-service implementation of Hadoop. Instead of installing the distribution in your data center, you configure the jobs using Amazon’s platform.

Why would people use Hadoop? Typically they are pulling large amounts of unstructured data and need to be able to run sorting and simple calculations on the data. For example:

  • Counting words – this is the standard Map Reduce example
  • High-volume analysis – gathering and analyzing large scale ad network data
  • Recommendation engines – analyzing browsing and purchasing patterns to recommend a product
  • Social graphs – Determining relationships between individuals

One of the weaknesses of Hadoop is its job oriented nature. Map Reduce is designed to be a batch process so there is a significant penalty is waiting for the job to start up and complete. It is typically not a strong candidate for real-time analysis. It also has the challenge of having a master node that is a single point of failure. Recently, Cloudera has released their latest distribution that implements a failover master node but this is unique to their distribution. Additionally, developers typically use Java (although other languages are supported) and not SQL to create Map Reduce jobs.

Riak_product_logo

Cassandra / Riak / Dynamo

Amazon wrote a paper on how to implement a key value store and Cassandra and Riak are implementations of that key value store construct. Amazon also has their own implementation called Dynamo DB.

While there are differences among the implementations, customers will want to use these products because they want fast, predictable performance and built-in fault tolerance. This is for applications that require very low latency and highly available. Typical use cases would be:

  • Session Storage
  • User Data Storage
  • Scalable, low-latency storage for mobile apps
  • Critical data storage for medical data (a great example is how the Danish government uses Riak for their medical prescription program)
  • Building block for a custom distributed system

These solutions are typically also clustered in a ring formation with no particular node marked as a master and thus so single point of failure. Customer will typically not use these solutions if they require complex ad-hoc querying or heavy analytics. As with Hadoop, building queries here require using a programming language like Java or Erlang and not SQL. There is also a project called HBase which provides similar columnar data store functionality for data on Hadoop so that is another option that is available. This project was modeled after Google Big Table and not Dynamo.

mondodb

MongoDB

This is a project that has become very popular and is managed by 10Gen and also implemented in a SaaS model by MongoLabs. MongoDB has become very popular because it is designed with developers in mind and scales the gap between a relational database (like MySQL) and key-value store (like Riak). MongoDB is excellent if you need dynamic queries and want to maintain similar functionality to a relational database that is more friendly to object-oriented programming languages. 10Gen emphasizes that MongoDB is designed to easily scale out and also accelerate development (for example, if conducting agile development). Typical use cases would be:

  • Companies storing data in flat files
  • Using an XML store for data too complex to model in a relational database
  • Deployments to public or private clouds
  • Electronic health records

As with it’s other NoSQL brethren, MongoDB does not use SQL. Instead you will use a programming language (like Javascript) to interact with the database. It is also not good with applications that require multi-object transactions which MongoDB currently does not support.

dremel

Dremel

No, this is not a power tool but a technology implemented by Google in a recent research paper. In order to understand this product, you need to be familiar with Hive and Pig since they are often compared. As noted above, Hadoop has two main challenges – that the jobs have a start-up lag and that Java (or some other programming language) is required to write and implement MapReduce. Hive and Pig are open-source projects that sit on top of Hadoop and allow for users to implement SQL (in the case of Hive) or a scripting language (in the case of Pig by writing Pig Latin) to query data. Note that since this is a layer above Hadoop, MapReduce code is still being written and executed. It just translates SQL or Pig Latin into MapReduce code. This does solve for the case of being able to write queries against Hadoop but does not solve the latency problem. This is what Dremel attempts to do.

Dremel is designed to allow users to write almost real-time, interfactive and ad-hoc queries against a large data set. Because it uses its own query execution engine and doesn’t rely on MapReduce, this attempts to solve for the Hadoop latency problem. However, in its current implementation, it is purely an analysis tool since it is only read-only and organizes data in a columnar format. While there is an open-source project for this product, it is also currently commercially available as Google Big Query.

Where to go with Big Data

As you can see, there are a lot of products out there that all solve for Big Data problems in the different ways. It’s important to understand your use case and see which product is a best fit – it’s unlikely that you can find one that solves for the universe of Big Data problems. In addition, it is likely that you will still need to use a traditional relational database if you need to do reporting or interface with an enterprise application. Out of all the projects, Hadoop has the largest ecosystem with other supporting projects that broadens the functionality of the core product. As mentioned earlier, Hive is used to write SQL queries; there is also Sqoop that integrates relational data and HBase that provides low-latency capabilities. Fortunately, the majority of these projects are open-source so you only need to learn how to use them and find a cost-effective platform to implement it – like the cloud, a natural fit for distributed databases and NoSQL projects.

2012 is the year of Big Data

This is an article that I wrote recently that was published in the Cloud Computing Journal. While 2011 may have been the year of the cloud, 2012 is proving to be the year that Big Data breaks through in a big way. I discuss a brief history of Big Data and go into a little more detail as to why I think the cloud and Big Data are a natural fit. The key is ensuring that the right architecture is available to support the performance needs of Big Data applications. Click on the link below to get the full article.

The Big Data Revolution — For many years, companies collected data from various sources that often found its way into relational databases like Oracle and MySQL. However, the rise of the Internet, Web 2.0, and recently social media began an enormous increase in the amount of data created as well as in the type of data. No longer was data relegated to types that easily fit into standard data fields. Instead, it now came in the form of photos, geographic information, chats, Twitter feeds, and emails. The age of Big Data is upon us. A study by IDC titled “The Digital Universe Decade” projects a 45-fold increase in annual data by 2020. In 2010, the amount of digital information was 1.2 zettabytes (1 zettabyte equals 1 trillion gigabytes). To put that in perspective, the equivalent of 1.2 zettabytes is a full-length episode of “24” running continuously for 125 million years, according to IDC. That’s a lot of data. More important, this data has to go somewhere, and IDC’s report projects that by 2020, more than one-third of all digital information created annually will either live in or pass through the cloud. With all this data being created, the challenge will be how to collect, store, and analyze what it means.

Why BI in the Cloud?

Business Intelligence (BI) has been around for a while but recently, the interest in analytics and tools to support it have become increasingly popular. Previously, only large enterprises were able to afford the infrastructure and license cost to implement traditional business intelligence. However, the advent of the cloud is an opportunity for everyone to take advantage of the transformative power of data.

img_businessIntel

Traditional BI was typically a client server setup with companies typically shelling out for dedicated equipment in their own data center or a co-lo, requiring license fees for the software and a dedicated staff to manage all the equipment and to maintain the server. The more data that you had to analyze, the more equipment that you had to buy and the more staff that you had to hire. The massive amount of investment in this area became a sunk cost and some enterprises are still tied to this model despite the fact that new models have arisen to tackle the data challenge.

The Cloud

The cloud opens up all new options for companies looking to either build out their own solution or to leverage new products that are native to cloud. There are currently three generally accepted cloud delivery models: Software as a Service (SaaS), Platform as a Service (PaaS) and Infrastructure as a Service (IaaS). I consider SaaS and IaaS to be the models that will be the most effective for BI.

One of the concerns that are often raised regarding the cloud is security – this in particular with BI since it involves data and sometimes personal information. The question then is, how will a provider that specializes in managing a data center (in the case of IaaS) or large-scale application and data hosting (in the case of SaaS) any less secure than a company who’s focus is on their own business and not on securing the data center? You will need to verify a few things depending the delivery model that you are using. For SaaS, you will want to ensure that the provider can give you secure login and authentication, the ability to define granular levels of access, SSL, and the option of encryption at rest. For IaaS, the customer is typically responsible for setting up security but you will want to see ISAE 16, ISO 7001 certification, DDOS mitigation as well as options to provision firewalls and load balancers. The Cloud Security Alliance (CSA) is currently working on their Security, Trust and Assurance Registry (STAR) which will make it easier to determine security criteria for a cloud provider but it is currently only in the preview stage.

Deciding which Cloud Model

If you are interested in just providing data and then having software available to analyze, manipulate and create reports, SaaS products may work for you. Example of some SaaS BI companies are BIRST, PivotLink, and GoodData. All of them provide the BI-stack, which is a method for extracting, transforming and loading data, a place to store the data, and a front-end to create ah-hoc reports and dashboards. The advantages of using this style of BI is that all the management of the infrastructure and software is handled by the vendor. There is also minimal up-front costs – the model is to simply pay for what you use.

If you want to build your own BI platform, you can leverage IaaS and open-source software. You will need to find an IaaS vendor that best suites your needs – Amazon Web Services, GoGrid and Rackspace are the leaders in this space. Using IaaS, you will have full control of your infrastructure – you can determine how many servers to spin up, the security you want to use and how you want to store the data. You will also need to build out the software assuming that you have that expertise in-house. Some good open-source options are Talend for data migration and manipulation, Pentaho for data integration and analytics, BIRT for reporting and MySQL or Postgres for the database. This model requires more system administration expertise and developer resources to build out a custom BI solution but this may be worth the investment if you want tighter control of your product or have very specific custom needs. In either case, if you can leverage the features offered in open-source software, your will also minimize up-front costs and will pay for the infrastructure that you use – with the option of spinning up servers to meet demand or remove servers when they are no longer needed. You can start off with small 1GB server and expand to more cores, more RAM and more storage quickly and easily.

In addition, the flexibility of the cloud gives you the option to expand your infrastructure if you need to incorporate a Big Data solution to meet a particular use case. Currently, most of the popular technology is open-source such as Hadoop, MongoDB and Cassandra. I discuss Big Data in more depth in a previous blog post.

Ultimately, you may decide that you need all the features that are offered by a traditional BI vendor or have already made the investment in a particular infrastructure or technology. After all, these companies have been around a long time and there are many talent individuals who are well-versed in these products. However, if you are interested in lowering costs and off-load the infrastructure and software work to another vendor or are new to BI and want to get started with minimal up-front costs, the cloud based BI solutions might be the right option for you. Instead of having to project growth in order to order the hardware up-front, you will have the ability to pay-as-you-go, and add infrastructure and cost only as your growth demands. DASHbay is experienced in both delivery models discussed here and we can provide the right analytics expertise and development experience for your BI needs. Considering the growth in data, the flexibility of the cloud and the much needed analytic features of a BI solution work well together to provide a powerful, low-cost and scalable solution. Make sure that you work with the right vendors and partners to make your project a success!

The Personal Analytics System

In a previous post, I discussed the explosion of data due to the growth in compute power coupled with the advent of Web 2.0 and social networks. However, this is not the only source of new and interesting data. Not only do people generate and contribute data via check-ins and tweets, they generate data by simply going about their everyday lives. Until recently, there was no way to capture that information easily. However, new gadgets have arrived that contribute to Big Data – however, this is data captured from human activity.

The ones that are out currently are the Nike+ Fuelband, the Jawbone Up, and the Fitibit Ultra. They are all essentially pedometers on steroids – a fun way to track your activity throughout the day and integrate it with a web dashboard and your social networks. These differ from the more serious Nike+ Sportswatch or the Garmin Forerunner that include GPS and are designed for serious athletes. What the more life-style oriented devices are intended for are a low-cost way to track daily activity throughout the day as well as sleep efficiency at night. The idea is that if you can collect data of your various activities throughout the day, you will have data points from which to make a better health plan or plan for a certain goal. They are also typically tied to other apps or websites where you can track food consumption in order to have a fuller picture of caloric gains and losses.

Personal Metrics Driven Management

While the intent is to improve health, this trend towards tracking gadgets for casual use is interesting since it generates data that pertains to a specific individual. You will now have a way of tracking over time very detailed information about your activities, habits, and sleep cycles. Successful companies have long used metrics to drive decisions and improve performance and efficiency. The tools are now starting to filter down to individual to take advantage of analytics to improve their lives.

fitbit_03

While all three trackers have their metrics, the one that I have been testing is the Fitbit Ultra. I integrate it with MyFitnessPal since it has a superior food tracking database and a better mobile app. The product is easy enough to setup and it tracks things like steps taken during the day, floors climbed, and of course calories burned. It’s all interesting information that you can take to determine your current activity levels and give you insight on if you should make changes in order to meet your goals. There are some pre-set goals like achieving 10,000 steps in a week, but you can modify them to your liking.

Sample_fitbit_metrics

The amount of data collected is astonishing but the real power is not in the point-in-time snapshot of activity but rather the accumulation of data over time to determine patterns and the integration with mobile and other apps (in particular, through the Fitbit API). You may see things like your activity drops significantly on days after you have low “sleep efficiency” (this is a metric that Fitbit uses to determine how much sleep you obtain during the night without waking up). Or you realize that you workout routine isn’t active enough to offset your time in front of a computer. Sadly, I determined that most of my day was sedentary since I spend most of my time at work behind a desk. The integration is also key since it increases the amount of quality data that you can collect, for example caloric intake in the case of MyFitnessPal. Integration with mobile apps also insures that you always have a mechanism for recording data that the Fitbit does not since most people carry their smart phone with them everywhere.

Integration is Key

All these website also include the requisite integration with social network like Facebook and Twitter. Although I believe that goals are best achieved when announced and through the support of friends, it does seem a bit creepy to be announcing when you go to bed and wake up and how many calories you consumed that day.

Corporations aren’t the only ones who can benefit from better data collection and analysis methods. Personal activity trackers now give the power of automated data collection and analysis to consumers. The websites even follow the Metrics Driven Management technique of a dashboard that displays all your pertinent metrics with the ability to drill-down for additional details. Data is now everywhere, even in your every day activities. Companies are now using data collection techniques and business intelligence technology to bring analysis to all aspects of our lives.

Pining for Pinterest

What does the Web’s hottest social network hold for data analytics?

Pinterest Icon Pinterest recently broke 11 million users this year. It was reportedly the fastest social network to reach 10 million unique users since its launch of closed beta in March 2010. However, that is somewhat in dispute, with many claiming that Formspring is still holds that title. Either way, there is no debate that Pinterest is growing at a phenomenal rate.

What is Pinterest anyway?

Since Pinterest headquarters is just down the Peninsula, I caught on to this site relatively early. However, a recent dinner with some out-of-town colleagues reminded me that not everyone obsessively follows startup news like I do.

Pinterest is a social network that links people together by their interests rather than their social circle. So unlike Facebook, which resolves around your friends, on Pinterest people express themselves and find friends through common interests. Another twist is that this is done via a metaphor of a pin board. Users “pin” photos that they find on site, from around the web, or uploaded directly to the site onto user-created “boards”. Users have boards to organize their pins into categories like “Travel” or “Humor”.

Since Pinterest is picture based, visual-type content show far better. Infographics, for example, are far more represented there than articles (some of which have no pictures to even pin). Brands are starting to take notice – Life magazine has a large following, posting mostly archived photos. Retailers are particularly strong here, since most of the items sold there are physical and therefore lends itself well to a pictorial presentation. In fact, Pinterest is driving more referral traffic than Google+, YouTube and Linkedin. Combined.

So who is on Pinterest?

Modea, a digital advertising agency compiled some demographic information on Pinterest users and the results were quite interesting. On average, users spend 15.8 minutes “pinning” while Facebook users only spend 12.1 minutes liking things. Almost a third of Pinterest users have an annual household income of at least $100,000 and almost 70% are women with the majority aged 25-34. So young, upper-middle class and female – sounds like the demographic that advertisers will want to target. Of course, that is just the is the US. The demographics in the UK are decidedly different – 56% male, 29% in the highest income bracket and interested in venture capital (go figure).

However, this doesn’t mean that there aren’t a good mix of interests on Pinterest. The Board of Man which is a board focused on more “manly” interests has 220,000 followers. Even the US Army has gotten into with its own set of boards which is managed by their Chief of Public Affairs.

It’s all about the data

Pinning is fun and all but the real value is the information captured by all those young and wealthy consumers. It’s also capturing a demographic that isn’t as prevalent on most of the social networks (meaning mostly male and on the coasts). Users are specifically stating their interests – in particular things that they covet or plan to purchase. In addition, the content is persistent. Unlike Twitter or Facebook, your pins stay present on the board for you to easily eview again later (and don’t come back from the dead as people discovered during the Facebook switch to Timeline). Retailers are starting to take notice – creating their own pages and engaging with users with interesting content.

Pinterest is still in beta so it has some things that I think it needs to add to become more robust.

  • It really needs some sort of robust and standards based API (like REST). This will give websites the ability to better integrate with the social network. It also creates the capability for 3rd parties to start extracting some interesting data.
  • An easier way to find people. It’s all about the interests and not friends but still, it’s impossible to find someone if they aren’t already friends with you on Facebook.
  • Allow some pins to stand out in some way (either promoted or simply popular)
  • Work in some additional content other than photos. While photos really enrich the site, having the ability to have music or text add additional context to the photos would be helpful.

In particular, an API in conjunction with other APIs would have the ability for developers to create some interesting apps in combination with Facebook, Twitter or Foursquare. The mash-up of the Facebook social graph with the Pinterest interest graph would be quite interesting. Since the information posted on Pinterest is public by default and generated in real-time, it would be a good indication of the zeitgeist in terms of interests, which would be a welcome addition to the topic trends already occurring on Twitter.

Big Data and the Cloud

big-data

One of the trends hitting analytics is the adoption of Big Data. One of the theories that I raised a few years ago was that one of the side effects of web companies is that they generate a lot of data. Many prescient companies like Yahoo and Facebook understood the value of this data and started to figure out ways to analyze it. Out of these early projects came several open source products that incorporated the architectures espoused by these companies.

Most of the most notable open-source projects to gain greater adoption is Hadoop and Cassandra. The idea was that not only were companies starting to see a lot more data, but they were also seeing a lot of unstructured data. This type of data didn’t lend itself to analysis by existing relational database tools. Hadoop was inspired by both Google and Yahoo (who is a major contributor to the project) to solve this exact problem.

Since many web companies are using Hadoop and Big Data started to gain traction at the same time as cloud computing, it soon became a natural fit to combine Big Data with cloud computing. On the surface, this makes sense. Hadoop and NoSQL solutions like it, are designed for massively parallel computing. Cloud computing is designed to provide the ability to spin up servers very quickly and scale horizontally. Seems like a natural match doesn’t it?

It depends on your use case. Mission critical production systems that need access to the cluster 24/7 and have performance requirements will want a lot of compute power and also high performance I/O. Not all cloud providers are able to give that level of performance required. Notably, Amazon is currently the only cloud provider who is attempting to provide Hadoop as a service on its multi-tenant cloud environment. Map reduce jobs can be run through Amazon Map Reduce, spinning up instances as needed to run the jobs in parallel and then returning the result set into S3. If you can tune your setup correctly, this can result is significant savings for targeted batch jobs – a good example of this is the TimesMachine project run by the New York Times that converted TIFF files into smaller TNG images using Amazon and Hadoop in which they processed 150 years worth of images in 36 hours.

Other cloud providers rely on providing single tenant hardware solutions. This provides both the dedicated compute and local storage for high performance I/O. In fact, Hadoop Distributed File System (HDFS) is designed specifically to be rack-aware so that it can ensure replication of data across multiple nodes and even on different racks. Cloudera, who has their own distribution of Hadoop, has specific hardware recommendations depending on the workload required. MapR which has additional code in its distribution for performance gains requires access to physical hard disks and may not play well in all multi-tenant environments.

However, the choice is not just between virtual-only and hardware-only. There are cloud providers who provide the ability to deploy both virtual, multi-tenant machines and dedicated, single-tenant servers. While the implementations differ, the capability to deploy both virtual and dedicated machines will give you additional flexibility. Companies such as Rackspace and GoGrid provide both virtual and dedicated hardware. If you are looking for a purpose-built server, Joyent provides servers pre-configured and pre-installed with Riak – a distributed database product by Basho.

If you are looking for Hadoop-as-a-Service and have targeted jobs that need to be low cost and don’t need to run consistently, then Amazon will fit the bill. However, you will take a performance hit – “It took from 5 to 18 minutes to execute tiny jobs that would take microseconds to execute on a fully configured cluster” according to this article by Infoworld. If you are looking for a production system running large continuous data sets with high-performance requirements, then you are better off using a hybrid offering on a provider like Rackspace or GoGrid.

retention_dashboard_rptsrvr

Customer retention metrics

Last night (Tue July 19th), I was fortunate to be able to speak to the SVForum Business Intelligence special interest group (SIG).

After introducing the audience to DASHbay, I took them through an implementation we did using our Quick Analysis practice, which leverages open source software (especially BIRT and postgresql), cloud computing (on AWS), and rapid, iterative development.

The implementation itself was a dashboard, built with BIRT in less than a week and showing metrics for account acquisition and retention. The metrics help any business track not just how well they are acquiring customers, but how well they are keeping them.

Account retention dashboard
Our customer was able to get at the metrics via a URL to a server running in the cloud, set up just for them. It’s a great way to leverage cloud computing: no IT procurement costs or delays, and you only pay for it while you need it.

We talked about DASHbay’s Report Server product, which among other features, allows us to capture any useful piece of the report, and include it in any web page. It also provides permissioning and authentication, taxonomy for organizing reports, and more.

I got an excellent reception from the audience, and was pleased with the reaction and discussions afterwards. Thanks to all who attended!

If you didn’t get a chance to be there, please get in touch so we can talk to you more about our Report Server for BIRT, our Quick Analysis Service, or many custom BI and Data Analytics services. Customer Retention is one very useful application which we can provide, but our tools and techniques are applicable to most common business analysis problems.

Terry

Internet Explorer’s Ajax Caching: What Are YOU Going To Do About It?

It’s well known that Internet Explorer aggressively caches ajax calls whereas all the other browsers grab the data fresh every time. This is usually bad: I’ve never encountered a case where I want ajax to NOT contact the server. Firefox, Safari and the other browsers know this and don’t cache ajax calls.  It’s not illegal or anything, it’s just caching to the maximum extent permitted by the spec.

In IE9’s new F12 Developer Console (finally!), cached requests show up as 304 Not Modified with a very suspicious < 1 ms response time.  The 304 is a red herring:  the server is never contacted.  It simply means IE in its infinite wisdom has decided the response wouldn’t have been modified IF it had asked the server, and serves it from browser cache.  Debugging on the server side shows that IE is really and truly *not* hitting the server.

This is a problem on all versions of IE:   IE9, IE8, IE7, and IE6.  People were scratching their heads over it way back in 2006.

To prevent Internet Explorer from caching, you have several options:

  • Add a cache busting token to the query string, like ?date=[timestamp].  In jQuery and YUI you can tell them to do this automatically.
  • Use POST instead of a GET
  • Send a HTTP response header that specifically forbids browsers to cache it

Of these, the HTTP response header is what I consider the “correct” solution.  But I’ll go through all of them below.

Cache Buster in Query String

On every ajax request, you can add a cache busting token to the query string, like ?date=[timestamp]

If you are using JQuery or YUI (the two frameworks I use the most), you can tell THEM to add the cache-busting parameter.

JQuery even has a global parameter to disable caching — set it in your <head> (like in your site template), and you’ve just cache-busted any ajax call jQuery makes.

However, I don’t like this approach:

  • It sends unnecessary data to the server that may interfere with the request.  Like, maybe the request takes an arbitrary number of parameters and the cache busting parameter becomes one of them.
  • Also, it clutters up the browser’s cache with a bunch of content that it’ll never serve, leaving less space for the content that SHOULD be cached.

POST Instead of GET

You can send your ajax call as a POST instead of a GET.  Browsers will never cache a POST.

However, I don’t like this approach either:

  • POSTs are for calls that modify the server.  If you are retrieving information from the server without modifying anything, you should stick with a GET.
  • I’ve also heard that POSTs are slower in ajax than GETs, but that’s a very minor secondary consideration.

Set Cache Headers Correctly

This is what I’d call the “correct” approach:  when the server sends an ajax response, have it set a HTTP header that tells all browsers to NOT cache your content because, well, you don’t WANT it to be cached.

You should, anyway:  Firefox and Safari just happen to be making a convenient assumption that you don’t want your ajax calls cached, whereas IE caches as much as it can get away with.  This will make all browsers’ behavior consistent.

How to do this?  Well, setting the caching in every single one of your server calls is for the birds.  You want something that applies to all ajax calls globally.   Fortunately, most javascript libraries — including jQuery, YUI, Mootools and Prototype — send a X-Requested-With: XmlHttpRequest header on every ajax request.

In the Java world, you can write a response filter that detects the X-Requested-With: XmlHttpRequest header and sets a “don’t cache this” header.

But in the Groovy world of Grails, there’s something even easier (of course). Below is a Grails filter that prevents caching of ajax requests that identify themselves with the X-Requested-With: XmlHttpRequest:

// put this class in grails-app/config/
class AjaxFilters {
    def filters = {
        all(controller:'*', action:'*') {
            before = {
                if (request.getHeader('X-Requested-With')?.equals('XMLHttpRequest')) {
                    response.setHeader('Expires', '-1')
                }
            }
        }
    }
}

Some people prefer to use the Cache-Control: no-cache header instead of expires. Here’s the difference:

  • Cache-Control: no-cache – absolutely NO caching
  • Expires: -1 – the browser “usually” contacts the Web server for updates to that page via a conditional If-Modified-Since request. However, the page remains in the disk cache and is used in appropriate situations without contacting the remote Web server, such as when the BACK and FORWARD buttons are used to access the navigation history or when the browser is in offline mode.

By adding this filter, you make Internet Explorer’s caching consistent with what Firefox and Safari already do.