Why DASHbay didn’t crash when Amazon-East did

By now, we’ve all heard about the Great Cloudburst of 2011. On April 21, Amazon’s Virginia-based data center experienced a huge reduction in service, triggered by what the company called “a networking event” and subsequent “re-mirroring of EBS volumes”.

I’ll leave examinations of the cause and response to other websites, and discuss the impact to DASHbay.

DASHbay builds and supports data-centric applications, focusing on open source software solutions, and often using cloud deployments. Amazon is our most frequently-used cloud data services provider.

At the time of the crash, several of our customers had mission-critical DASHbay-deployed applications running in the cloud. How did those customers fare, and therefore, how did DASHbay fare, since our customers’ problems are our problems?

I’m pleased to report that none of our customers were severely impacted by the outage.

Why not?

Here are some case studies of apps we built for clients, and the mitigation strategies that saved our bacon during Amazon’s failure.

The first is a real-time, high-availability mobile analytics collection application we deployed for Nielsen Mobile. Because this app’s continuous availability is mission critical, it was designed to not be dependent on any one AWS region. It failed over seamlessly to Amazon-West, and data-gathering continued normally. According to Brian Edgar, Group Program Manager at The Nielsen Company’s Telecom Practice: “While the outage at Amazon East was certainly bad news for Amazon and many of its clients, it was a great example of why the technology choices DASHbay recommended for us were ideal for this application. Our application was architected for geographic redundancy and fully leveraged the cloud model with dynamic DNS routing and load balancing using servers in multiple zones and regions. Our mission-critical, highly-available application experienced no outage at all. The Amazon regional failure proves we did the right thing.”

Another is a data acquisition app built for our client Credit.com. Unstructured data is gathered and marshaled into transaction reports. Data loss can directly impact Credit.com’s ability to monitor its own financial performance. This app was deployed only in the Amazon-East region, and was not available for over 24 hours. However, we anticipated the possibility of an outage, and had offshore staff in Nagpur, India, trained to perform manual workarounds for as long as necessary. These manual processes kicked in, and kept the data flowing. According to Credit.com’s CEO Ian Cohen, “We’ve been working with Dashbay for the last year and were really pleased with the measures they put in place to provide redundancies for our data acquisition applications. They positioned an offshore failsafe that allowed us to operate without interruption.”

What’s the message here? I think it’s this: data centers can fail! Design operational processes and real-time architectures with fail-over in mind. We used a variety of approaches, from human-intensive procedures that were nevertheless ready to go, to automated failover.

Which risk-mitigation strategies are right for a particular app? That depends on factors such as the volume of data and the tolerable latency of gathering and moving that data. We’re committed to thinking those factors through with our clients, and designing applications and processes with failover in mind.

One more important thing: let’s all keep the mindset of learning from mistakes, and if necessary changing architectures and backup procedures to keep our businesses running.

Terry Joyce, DASHbay founder

Drizzle – a new MySQL fork

There is a new MySQL fork that recently went GA called “Drizzle”. It is written by former developers of MySQL and is open-source. It’s designed to be fast, lightweight and optimized for the cloud.

http://drizzle.org/

Since databases are a part of every Business Intelligence implementation, I’m going to investigate how easy it is to install, deploy and configure.

I built out a VM using Alpha 3 build of Natty Narwhal (Ubuntu 11).  Since there is already an Ubuntu package, the deployment went smoothly.

As expected, the Alpha build of Ubuntu has some stability issues (for example, the new Unity interface didn’t load until the VM was launched a second time and some random components crashing). However, these do not seem to impact Drizzle.

My intent is to give Drizzle a test run by running some code that I wrote for MySQL 5.1 and note the differences. This will be closer to what someone would do if conducting some light SQL development and will not go into more advanced administration such as trying to test its capabilities for massive concurrency or clustered systems.

Installing Groovy on OS X 10.6

Groovy is an agile, dynamic language for the Java platform. It has the speed and flexibility of Ruby but it compiles down to Java bytecode. It’s the foundation of the Grails framework.

Here’s how to install Groovy 1.7.9 on Mac OS X 10.6 Snow Leopard.  It should also work on earlier versions of Groovy and OS X. I will note that Groovy does provide instructions, but I prefer these. There’s also a MacPorts installer, but I never have any luck with MacPorts.

Install Java SDK

Before you install Groovy, make sure you have a Java SDK 1.5+ installed.  The Java runtime that comes preinstalled with OS X does not have the SDK.

Download Groovy

Download Groovy from groovy.codehaus.org.

Install Groovy

Unpack the archive you downloaded.  It’ll be a folder.

Move the folder to /usr/share/java. I just used Finder, but here’s the obligatory Terminal command:

$ sudo mv ~/downloads/groovy-1.7.9/usr/share/java

Create a symlink that allows you to access Groovy without referencing the version number.  This way, to upgrade your Groovy you just install the newer version alongside the old one, and update this symlink.  This is a standard best practice.

$sudo ln -s groovy-1.7.9 groovy

To make Groovy available, add the following either to /etc/profile (for everyone) or ~/.profile (for just you):

export GROOVY_HOME=/usr/share/java/groovy
export PATH=$GROOVY_HOME/bin:$PATH

If you haven’t set the JAVA_HOME environment variable yet, create one in .profile that points to the path where Java’s installed. It’ll be something like this.

export JAVA_HOME=/Library/Java/Home

Let’s see if it works. Open Terminal and type:

$groovy

You should see:

Groovy Shell (1.7.9, JVM: 1.6.0_24)
Type 'help' or '\h' for help.
------------------------------------------------------------
groovy:000>

This is Groovy’s interactive command shell. Hit Ctrl-C to exit.

Now go get yer Groov on!

Installing Grails on OS X 10.6

Grails is an excellent rapid application development framework that integrates well with many industrial-strength analytics products.

Here’s how to install Grails 1.3.7 on Mac OS X 10.6 Snow Leopard.  It should also work on earlier versions of Grails and OS X. I will note that Grails.org does provide instructions, but I prefer these.

Install Java SDK

Before you install Grails, make sure you have a Java SDK 1.5+ installed.  The Java runtime that comes preinstalled with OS X does not have the SDK.

Download Grails

Download Grails from grails.org.

Install Grails

Unpack the archive you downloaded.  It’ll be a folder.

Move the folder to /usr/share/java. I just used Finder, but here’s the obligatory Terminal command:

$ sudo mv ~/downloads/grails-1.3.7/usr/share/java

Note:  some people install it to /usr/share/, but these instructions install it to /usr/share/java because that’s where other Java tools like Maven, Ant, JUnit and Derby are installed. Grails’ official instructions suggest your home directory (~/grails), which is just plain weird.

Change ownership and permissions:

$cd /usr/share/java
$sudo chown -R root:wheel grails-1.3.7/
$sudo chmod 0755 grails-1.3.7/bin/*

Create a symlink that allows you to access Grails without referencing the version number.  This way, to upgrade your Grails you just install the newer version alongside the old one, and update this symlink.  This is a standard best practice.

$sudo ln -s grails-1.3.7 grails

To make Grails available, add the following either to /etc/profile (for everyone) or ~/.profile (for just you):

export GRAILS_HOME=/usr/share/java/grails
export PATH=$GRAILS_HOME/bin:$PATH

If you haven’t set the JAVA_HOME environment variable yet, create one in .profile that points to the path where you have installed Java. It’ll be something like this.

export JAVA_HOME=/Library/Java/Home

Let’s see if it works. Open Terminal and type:

$grails

You should see:

Welcome to Grails 1.3.7 - http://grails.org/
Licensed under Apache Standard License 2.0
Grails home is set to: /usr/share/java/grails
No script name specified. Use 'grails help' for more info or 'grails interactive' to enter interactive mode

Happy Grailing!

Microsoft Excel PowerPivot: Just What the Analyst Ordered

We do a lot of ad-hoc data investigation and analysis around here, and are always on the lookout for tools that make our lives easier.

While raw SQL and a database remain a top weapon of choice for getting a sense of a new data set, it leaves a bit to be desired in the presentation department. Plus, a lot of people just aren’t comfortable with SQL and we often need to hand off our work to more businessy types that have more starch in their shirts than comp sci classes under their belts.

It’d be great to have something like Excel, but without the limitations. Excel’s fine for simple, small data sets but it:

  • only works to a million rows
  • isn’t great at selecting subsets of data
  • isn’t good at stitching multiple tables together

Wouldn’t it be great to have something like Excel, but it serves more like the front end to multiple databases and DOESN’T cost a bajillion dollars like Tableau?

Introducing PowerPivot

Well there IS such a beast, and the surprise is, it’s Excel itself!  Excel 2010 has a free add-on called PowerPivot that addresses many of our longstanding issues with it as a business intelligence tool.

  • Excel with PowerPivot maxes out after 100’s of millions of rows
  • Excel with PowerPivot lets you combine multiple sources (database tables, web services, any feed!) in a single model (spreadsheet/pivot table)

But Does It, Really?

We didn’t buy the hype on the scalability, so we tried it for ourselves.   Our preliminary test was a 1 million row table, imported from a CSV.   PowerPivot made a nice fast pivot table over it, no problem.  In addition, workbook on disk was 36MB whereas the raw CSV was 75MB, a 50% reduction in size. We were able to relate in a few other small tables to the pivot table as well with no loss in speed.

We were deeply impressed and will be shifting more of our ad-hoc analysis over to Excel.

The only downside? It’s only for Windows. No PowerPivot for Excel Mac yet. This test was run on OS X, within a 64 bit Windows 7 Parallels VM given 4GB RAM. Office itself was the 32 bit version.

Maven Profile Inheritance (okay, DISinheritance)

One best practice with the Maven build system is to take configuration that’s common to many projects and put it into a parent pom.xml.

Unfortunately, parent pom inheritance has some limits.  One of them is that profiles aren’t inherited.

You may be forgiven for thinking that profiles ARE inherited.  Not only does it seem like a logical (some would say necessary) feature of Maven, you might have seen it working… or so you thought.   See, profiles in parent poms can be triggered by the building of a child. When activated, they are applied directly to the parent pom, prior to that parent being used for inheritance into the child. So the effects of an active profile in a parent pom may be felt by the child.

Sometimes this indirect inheritance works well.  Sometimes, it’s just crack-addled.

When It Works…

Here’s an example where it works well. Say you have a parent pom with a ‘prod’ profile whose job is to insert some properties into the build.   On a production system, Maven will activate that profile and the parent pom will contain the properties.  Since properties are inherited from parents, your child pom will get them.  The end effect is just like the profile was inherited, but in fact the profile was NOT inherited.

… and When It Doesn’t

Here’s an example where it doesn’t.  Say your parent pom has a ‘prod’ profile that adds a module for publishing your app to a webserver.  On your production system, Maven will activate that profile and the parent pom will contain the module.  However, that module will try to publish the parent to the webserver, not your child project.

If you simply insist on trying to “inherit” plugins via a profile, you’d have to:

  1. Set up the profile in the parent pom, but instead of defining the module in a <plugin> section, define it in a <pluginManagement> section.  The pluginManagement is explicitly for declaring plugins that aren’t used in the parent but need to be propagated to the child.
  2. Set up the profile again in the child pom (yes, copy and paste).  Within the profile, REdeclare the plugin  (in a <plugin> section not <pluginManagement>), but without all the configuration.  This will activate the plugin defined in the parent.

Yes, it sucks that you have to double-declare things.   I try to find other solutions if at all possible.

Maven: activation conditions really DON’T work as advertised

Many people using the Maven build system want to conditionally trigger parts of the build:   say, add a .dll only when on Windows, or add a library only when some file is missing.

This is why Maven profiles exist. Profile activation conditions are the mechanism for turning on a profile based on conditions like operating system and missing files. However, they work nearly opposite of how they are documented to work.

Despite all the examples in the official docs, profile activation conditions aren’t ANDed together, but ORed. For example, you might want to activate a profile when on Windows AND a file is missing.  You can’t:  the profile will activate if either condition is true.

This is true in version 3.0.2 and has been thus ever since Maven 2.1: you can see the bug ticket where someone “fixed” it the wrong way. This is an egregious flaw that makes activation conditions unusable for many common use cases.

Working Around It Using Groovy

One workaround is to dip into the Groovy scripting language to do multiple conditions.

In the Maven pom.xml sample below, you can see that I’m using a combination of profile activation (to test for the existence of a file) and Groovy conditionals (to test for the operating system). The profile activation isn’t necessary, really; I could have done the whole thing with Groovy conditions.

<profiles>
	<!-- a Maven profile that does something when a particular file is missing from the build -->
	<profile>
		<id>install-shared-resources</id>
		<activation>
			<file>
				<!-- Activate this profile when there's no src/main/some-file yet -->
				<missing>src/main/some-file</missing>
			</file>
		</activation>
		<build>
			<plugins>
				<!-- Use the Groovy scripting language detect the OS and do something OS-specific -->
				<plugin>
					<groupId>org.codehaus.mojo</groupId>
					<artifactId>groovy-maven-plugin</artifactId>
					<version>1.3</version>
					<executions>
						<execution>
							<phase>package</phase>
							<goals>
								<goal>execute</goal>
							</goals>
							<configuration>
								<source>
									// True for any POSIX compliant operating system such as AIX,
									// HP-UX, Irix, Linux, MacOSX, Solaris or SUN OS
									if (org.apache.commons.lang.SystemUtils.IS_OS_UNIX)
									{
										// do something like execute a shell script
									}
									// Else if Windows 7 or Vista
									else if (org.apache.commons.lang.SystemUtils.IS_OS_WINDOWS_VISTA ||
											 System.properties['os.name'].toLowerCase().contains('windows 7'))
									{
										// do something like execute a shell script
									}
									// Else it's an unsupported OS (in this case, like Windows XP)
									else
									{
										fail("\nYou're on an unsupported operating system: " + System.properties['os.name']);
									}
				        		</source>
							</configuration>
						</execution>
					</executions>
				</plugin>
			</plugins>
		</build>
	</profile>
</profiles>

Mercurial Source Control Hosting Providers

Our small consulting company uses the Mercurial source control system, and we needed to find a website that would host our repositories on the cheap.

The providers you hear the most about, such as Bitbucket and Google Code, are geared towards personal and not corporate use.   For example, Bitbucket *does* have private repositories, but makes you permission each person individually to each repository — which sucks when you’ve got a bunch of repositories.   Ideally you want everyone on your team to automatically see new repositories.

Here were some of our criteria:

  • supports multiple logins under one master account
  • closer to 10 bucks a month than 100
  • online code browser and code diff
  • creates private repos by default (unlike Bitbucket which creates public ones)
  • user groups, so that we can permission all the employees of our outsourcing partner to specific repos easily, and revoke it just as easily
  • wiki would be nice
  • nice to have:  organize multiple Mercurial repositories into a single project

The Verdict

Below we evaluate the contenders and their pros and cons.
However, let’s get straight to the point and say that  RepositoryHosting.com and CodebaseHQ were the two top choices by a long shot.  If your requirements are anything like ours you should check them both out.
Finally we chose RepositoryHosting.com, even though we found CodeBaseHQ’s UI much more pleasing.  Why?
  • RepositoryHosting.com has permissionable usergroups
  • RepositoryHosting.com is cheaper

CodebaseHQ

£  5/mo /  3 projects / 500MB / unlimited repos  /          10 users
£13/mo / 15 projects /    2GB / unlimited repos  / unlimited users
£21/mo / 30 projects /    4GB / unlimited repos  / unlimited users
£40/mo / 60 projects /  10GB / unlimited repos  / unlimited users
Thoughts:
  • Nice UI, simple and helpful
  • Love that a single project can contain multiple repositories
  • Excellent visual diff tool
  • Good user setup, very easy to add a SSH key
  • Doesn’t have usergroups

RepositoryHosting

$6/mo unlimited users, 2GB.  $1/mo for each additional GB.
Thoughts:
  • Good user controls.  It’s quite easy to create group of users and permission them to one project, or even a category or projects.
  • We liked that you can archive projects, which gets important as time goes on.
  • We liked that you can have different project groups, which auto-apply different settings to the projects
  • We didn’t like that a project can have only one repo
  • It uses Trac for each project, which is ancient but it’ll do
  • The whole thing’s a little pokey and sluggish
  • UI’s a little archaic feeling, as it Trac itself
  • The UI’s not well integrated. For example, to get to the list of commits you have to go into yet another area with its own authentication, a website produced by the hg application itself

RepositoryHosting VS CodeBaseHQ

Pros for CodeBaseHQ

  • CodeBaseHQ very nicely namespaces the repositories within a project, so we can create a project called Customer and have all of their repositories within it.
  • The UI is miles better than RepositoryHosting:  instead of each repo having separately authenticated sites for trac and the hg browser, it’s all one smooth thing.
    • For example, I appreciate how my dashboard shows a combined list of all the commits and status messages across all projects, and each status message is fully hyperlinked to the stuff it talks about.

Pros for RepositoryHosting

  • It’s cheaper.
    • CodeBase makes you pay more for more projects.  RepositoryHosting it’s purely by the GB.
  • It has user groups and CodeBase doesn’t.
    • This is really needed for the Infocepts scenario.
  • CodeBaseHQ takes a few clicks to get to the list of repositories
  • A little easier to see what hg commands are available

Rejected Providers

Versionshelf

http://www.versionshelf.com/ 

$19/mo 3GB /            20 accounts /           15 repos
$79/mo 15GB / unlimited accounts / unlimited repos
Points:
  • We needed more than 15 repositories, which puts it at too high a price point.

XP Dev

unlimited projects/users
$5/mo  1GB
$15/mo 4GB
$30/mo 10GB
Points:
  • Each repository is created in the same namespace as all other repositories hosted by XP-Dev across all accounts, which means all the meaningful names like Main and Primary have been taken.
  • All users are in the same namespace across all accounts.  There’s only one “john” account in the entire system.  I have a fairly unique login name and it was already taken, and it *silently renamed* me to something else.  Fail!
  • As if it needed any more strikes against it, the UI stinks
  • It *is* fast, though

Assembla

$10/mo for  3 users  /  1 space   /  1 GB
$29/mo for 15 users /  2 spaces  /  4 GB
$49/mo for 30 users / 10 spaces / 10 GB
Points:
  • Looks industrial strength.  Seem more geared to Git and SVN, but they say they have Mercurial.
  • Has lots of features we don’t need,  like group video chat.
  • Each space can only have a single repository, so we’d have needed at least the $49/mo plan, which was too much for us.

Indefero

http://www.indefero.net/

£49/year / 1GB / unlimited projects & users
1GB extra £39/year
They say they have Mercurial, but it’s only when installing on your own server.

Parallels: browser test your OS X localhost

As a developer, you’ll have to pry my Macbook out of my cold, dead fingers.  But most of the world lives and breathes Windows, so I test my web application on Microsoft Internet Explorer and Firefox for Windows.

The easiest way to do this is to run Parallels.  I’ve got Parallels virtual machines for XP, Vista and Windows 7.   I’ve got different virtual machines for XP with IE 6, 7 and 8 .  I even browser test on Ubuntu (god knows why).

But, how do all those Windows browsers access the web server running on your OSX localhost?  Localhost on the PC maps to its own localhost, not the host Mac.

Take your pick:

http://<yourcomputer>.local:  OS X creates a DNS entry on the local network named <yourcomputer>.local.   You find your computer’s name in Control Panel > Sharing.   I named my computer MoBook, so in Internet Explorer I type http://mobook.local/ to hit the webserver I have running on OS X at http://localhost:80/

http://yourgatewayIPaddress: Parallels creates a little DHCP network for your virtual machines, and the OS X machine itself is at the gateway IP.   Go into Parallels > Preferences > Advanced > Network and the gateway is most likely the Start Address with a 0, 1 or 2 in front of it.  For example, my Start Address is 10.211.55.1 and in Internet Explorer I type http://10.211.55.2 to hit the webserver I have running on OS X at http://localhost:80.

This is with Parallels 6.  These should also work in previous versions of Parallels, as well as VMWare Fusion.

Working with Postgres on Windows via ODBC

The most straightforward way to hook up Excel or most other Windows programs to a PostgreSQL database is ODBC.
ODBC lets you connect to a local or remote Postgres.
To do this, you need to get an ODBC driver for Postgres.   But which one?    In most cases, you’re going to want the free official ODBC driver from Postgres.  Its documentation is so poor that I overlooked it for a day before figuring out that it’s the real deal.  The driver’s homepage makes it look like the last version was in 2005, but in fact there has been releases in October 2010 (and by the time you’re reading this, probably more recent ones).
Get the Windows installer here: http://www.postgresql.org/ftp/odbc/versions/msi/
Make sure you install the right version.  Even if you have 64 bit Windows you’ll need the 32 bit ODBC driver if the program that’s connecting to the database (like Excel) is 32 bit.  For Excel specifically, File > Help says whether it’s 32 or 64 bit.
The official installation and configuration FAQ: http://psqlodbc.projects.postgresql.org/faq.html
Set up a DSN
Once you’ve installed the ODBC driver you need to make it available to programs like Excel.
You do this via your computer’s “Data Sources (ODBC)” panel.   There’s a 32 bit and a 64 bit version, you gotta make sure you’re looking in the right one: http://support.microsoft.com/kb/942976
Once it’s open, click either the User DSN or System DSN tab:
  • User DSN: is available just for the current user.
    • If you’re just using it within Excel or Access, this is the one you want.
  • System DSN: is available for all users and services on the machine.

Click ‘Add…’ and select a PostgreSQL driver.  There’s two to choose from:

  • PostgreSQL Unicode: use this if your database was set up with the UTF-8 character set.
  • PostgreSQL ANSI: use this if your database was set up with a LATIN character set.
Fill in your server and database details.

  • Data Source: make a friendly name of the data source that you’ll later use when looking up this DSN in Excel or other programs.
  • Database: use ‘postgres’ to connect to all databases, or type the name of a specific database.
  • Server: domain name or IP address of server
  • User Name: postgres database username
  • Password: postgres database password
Go Forth And Use ODBC
Now go to the program in which you need to connect to Postgres and use whatever ODBC connection UI it provides.