Tuesday, April 29, 2014

More Java Code for Fetching Data From Cassandra

So far I have made several posts about Cassandra and some very basic experiments that I have been doing:


In this post, I will share some more stuff I did to fetch data from Cassandra.

The sample code for this post and the companion posts is on github at https://github.com/fwelland/CassandraStatementTools.

Fetch By ID, Revisited

I 'freshened up' the simple selectStatements() from an earlier post.    The only thing that is really different is using this nifty lib, j-text-utils, I googled upon to format text tables.   Here it is (again):

public void selectStatements()
    throws Exception
{
 Select q = QueryBuilder.select("archived_statement_id", "customer_id", "day", "month","year","statement_type", "statement_filename").from(keyspace, table);
 if(null != statementUUID)
 {
     q.where(eq("archived_statement_id", statementUUID));
 }

 if(null != clevel)
 {
     q.setConsistencyLevel(clevel);            
 }

 ResultSet rs = session.execute(q);

 String colNames [] = {"statement id", "customer id", "statement type"};
 List<Object> list = new ArrayList<>();
 for(Row r : rs)
 {
     Object row[] = new Object[3];
     row[0] = r.getUUID("archived_statement_id").toString();
     row[1] = r.getInt("customer_id"); 
     row[2] = r.getString("statement_type"); 
     list.add(row);                        
 }
 Object data[][] = new Object[list.size()][];
 list.toArray(data); 
 TextTable tt = new TextTable(colNames, data);         
 tt.printTable();         
}

Now when I run my program with the following command args:

String margs[] = new String[]{"-node", "127.0.0.1", "-node", "127.0.0.2", "-node", "127.0.0.3", "-select",  "-uuid", "53ad6a82-cfa6-4808-9423-9f76d9fd6de9"};

the output looks like:
:run
Connected to cluster: My2_6TestCluster
Datatacenter: datacenter1; Host: /127.0.0.1; Rack: rack1
Datatacenter: datacenter1; Host: /127.0.0.2; Rack: rack1
Datatacenter: datacenter1; Host: /127.0.0.3; Rack: rack1
___________________________________________________________________
| statement id                        | customer id| statement type|
|==================================================================|
| 53ad6a82-cfa6-4808-9423-9f76d9fd6de9| 47900      | 0890          |


Fetch By Other Attributes

Now I will expand on selectStatements() to be able to list statements by other attributes like customer_id.   That should be easy enough; and here is what I came up with:

        ...
        if(null != statementUUID)
        {
            q.where(eq("archived_statement_id", statementUUID));
        }
        else
        {
            if( null != customerId)
            {
                q.where(eq("customer_id", customerId));
            }
        }
        ...

Basically, if the select operation is not by id, then in the else branch, I can string together some eq() clause for other attributes specified.    So I will test the simple case where just a customer id is specified; here are the results:

:run
Connected to cluster: My2_6TestCluster
Datatacenter: datacenter1; Host: /127.0.0.1; Rack: rack1
Datatacenter: datacenter1; Host: /127.0.0.2; Rack: rack1
Datatacenter: datacenter1; Host: /127.0.0.3; Rack: rack1
com.datastax.driver.core.exceptions.InvalidQueryException: No indexed columns present in by-columns clause with Equal operator
 at com.datastax.driver.core.exceptions.InvalidQueryException.copy(InvalidQueryException.java:35)
 at com.datastax.driver.core.ResultSetFuture.extractCauseFromExecutionException(ResultSetFuture.java:271)
 at com.datastax.driver.core.ResultSetFuture.getUninterruptibly(ResultSetFuture.java:187)
 at com.datastax.driver.core.Session.execute(Session.java:126)
 at com.fhw.CLoad.selectStatements(CLoad.java:176)
 at com.fhw.CLoad.main(CLoad.java:75)
Caused by: com.datastax.driver.core.exceptions.InvalidQueryException: No indexed columns present in by-columns clause with Equal operator
 at com.datastax.driver.core.Responses$Error.asException(Responses.java:96)
        ...

Bummer, not what I wanted.

Indexing in Cassandra

So Cassandra is largely a KV store and things work best if you ask for a row by key.    Recall that when I created the statements table, I made a surrogate id, 'archived_statement_id'  of type UUID.   Now it seems I am stuck only fetching by ID; YUCK!  

Fortunately you can create secondary indexes that can be used for locating records by other attributes.   So my uneducated approach to solving this is:


CREATE INDEX customer_id_index ON statementarchive.statements(  customer_id );

I used DevCenter to apply this. After the index was created, I re-ran the program and got this:

:run
Connected to cluster: My2_6TestCluster
Datatacenter: datacenter1; Host: /127.0.0.1; Rack: rack1
Datatacenter: datacenter1; Host: /127.0.0.2; Rack: rack1
Datatacenter: datacenter1; Host: /127.0.0.3; Rack: rack1
_______________________________________________________________________________
| statement id                        | date      | customer id| statement type|
|==============================================================================|
| 77669f72-c8be-4d6c-b8be-f6ef7b4325bb| 14-07-2011| 47900      | 0770          |
| 00463a07-c4ed-4276-8877-97f465d96b71| 23-07-2011| 47900      | 0770          |
| 390d3455-f9de-4cce-8b6b-5397f718e20a| 31-07-2011| 47900      | 0890          |
| fe9bf132-e500-49cb-86aa-ffc092c66879| 03-07-2011| 47900      | 0890          |
| 25dcfcac-acea-4e51-b1c6-59f4da5d6f6e| 27-07-2011| 47900      | 0770          |
| 87383c26-93c3-4784-b36f-eebe5ebe9903| 13-07-2011| 47900      | 0747          |
| 40107f00-065f-4cba-8fe3-4c96e718114a| 28-07-2011| 47900      | 0770          |
| e1c979b0-c11c-4e2b-93fd-676c29517213| 26-07-2011| 47900      | 0890          |
| 862bf808-3e80-45f5-a627-545697c65e05| 29-07-2011| 47900      | 0747          |
.....
93 records found.

WooHoo!

Indexing Causes Lots of Questions 

So what happened when the index was created?  Haven't got a clue.  Was this a good idea?   Not a clue, I need to read and learn more about this (may be I will start here).      Does the index work across the nodes?    Looks likes it; I tore down nodes systematically and my test continued to work.    

What about asking questions, show me all the statements for customer 84950, in March of 2012?  What kind of indexes would I need for that?

In a forth coming post, I hope to explore more about indexing as well as actually extracting the statement files and see if they are correct.


No comments:

Post a Comment