Wednesday, March 5, 2014

Playing With My Friend Cassandra

IRL, I do Java/EE development and I was faced with an annoying issue we've had for a while.  I was tasked to work on a simple archive process that is a bit like  this:
We have a bunch of customers, and we create a bunch of different kinds of date sensitive statements for these customers.  Sometimes our customers loose some statements and they ask us for copies.   Forever, we've just maintained a monstrous file system of archived statements organized by date and customer ID and we just go pluck them off that file system and give them to them to our customers.    As time has past, the maintenance of the archive is becoming slow and cumbersome.   
Turns out that one big problem we had with our existing solution, was that it was super slow to maintain a disaster recovery copy of the archive.  We basically just did everything twice while archiving statements.

The sample code for this post and the companion posts is on github at

Somewhat unrelated to this task is that my place of employment (POE, perhaps) has started to use our own ThoughWorks style TechRadar to help focus our attentions on a vetted set of tools, processes, frameworks and such, while we progress though the 2014 work year.

What Does a Tech Radar Have To Do With My Problem?

Well, I had slack in the reins to explore discarding large portions of our existing archive statements solution and propose something new.  So I consulted the radar to fish for some new tools, or languages, or other things that might be helpful for this problem.  NoSQL was in the "trial ring" in our platforms quadrant.

Fast Forwarding A Bit

A first thought I had about my task, was to not have our archival process deal with DR, let the infrastructure deal with DR.  I came up with this idea  of using some sort of NoSQL thing to develop a simple statement database.    Our archival process would just insert dated statement records into a NoSQL thing, and it was that thing's job to deal with DR preparedness.    Also, I somehow zeroed in on Cassandra as a candidate NoSQL.

What's A Cassandra?  How Will It Solve My Problem?

Just go read some yourself; this is a good link and so is this, oh and I mustn't forget this link  I was thinking that Cassandra would be my archiving 'infrastructure' and it (she?) would deal with DR, replication, availability and other things like that.

Starting with this post and carrying over to another post or two, I will share with you some of the things I have learned.

Where To Start?

If you are like me, and have little or no NoSQL or cassandra background; Really, just go follow the getting started wiki topic.   After much, much more playing, learning and hacking (I will present some of this soon), I realized that I really needed to have a proper cluster in order for me to learn about all this replication stuff.

So I set up a 2 node cluster on a pair of RHEL 5 machines; again following stuff from the getting started guide.   After a more play, I found I have lots to learn about how the cluster and replication works in Cassandra.  I also found 2 machine/2 node cluster cumbersome to experiment with.


I googled upon this tool: CCM, that was specifically built for setting up a cluster on a single machine and toying around with that cluster.    It took me a few minutes to get this thing installed and running and here are a few things/steps that might help:

  1. I assume you already have a JDK -- but if not, now is a good time to get one.   Make sure it (i.e. 'java') is in your path, too.  I am using 1.7.0_51 (Sun).
  2. Go get ANT.    Put this in your path too.  I am using 1.9.3.  Why ANT?  That will become clearer in a bit.
  3. Python -  CCM is written in Python.   I am running on Fedora 20, so python is everywhere.   
  4. I needed to yum up PyYaml  (yum install PyYAML) as ccm needs that. 
  5. Grab the ccm zip from GitHUB and unzip it somewhere.  Or you can git clone it. 
  6. Follow the set up directions in the  (i.e, as root in dir containing ccm:    ./ install )

That's about it.  I put my ccm in /opt/ccm-master.  So to test if ccm is working I issued:  /opt/ccm-master/ccm  -h.   If everything is OK, you'll see a long help screen like I did.    

Back To, "Why Ant?",   CCM actually downloads Cassandra source and builds it.  All this, plus any clusters you build, wind up in the $HOME/.ccm.   I had been testing with Cassandra 2.0.5, so following the "Short Version"  usage directions on the (just read this, it has good stuff in it and it is short), I did:

/opt/ccm-master/ccm create MyTestCluster  -v 2.0.5
/opt/ccm-master/ccm  populate -n 3 
/opt/ccm-master/ccm start

After doing all this, I had a 3 node Cassandra cluster up and running on localhost.   FWIW: I started working back on my Laptop, instead of virtualized RHEL 5 servers.

Neat!   Now what?   In the next post, I will go over some of what I picked up while exploring and using ad-hoc 'CSQL' tools for Cassandra.

No comments:

Post a Comment