Saturday, April 12, 2014

Generating Test Files and Loading Them into Cassandra

If you have been following along, you'll see that:

The sample code for this post and the companion posts is on github at https://github.com/fwelland/CassandraStatementTools.

Now I need to load some more realistic data and a larger volume of data so I can do more testing and learn more about Cassandra.

Fabricating PDF Files

Grabbing some real statement files isn't really an option; so I needed a simple way to make some.  If you go back to the first post in this series, you may see that my existing statement data looked a bit like this:


This should be pretty straight forward.  The statement file name starts with a 'B' followed but the customer id, then an 'R' followed by statement type and then the extension.   Most customer statements are PDF files; but there are other types of statements in different types of files.    I needed a way to generate a bunch of mock statements in the above structure with some identifying data in the PDFs that I can use for verification.    So this happened: 

#!/bin/sh
cd ~/statements
for year in 2011 2012 2013 ; do 
 for cust in 47900 58240 43241 ; do
  for month in 01 02 03 04 05 06 07 08 09 10 11 12 ;  do 
   for dy in `cal ${month} ${year}  |grep -v [a-z]` ; do
                                day=`printf "%02d" $dy`
    mkdir -p ~/statements/${year}/${cust}/${month}/${day}
                                for stype in "0770" "0890" "0747" ; do
                                    fname=`echo ~/statements/${year}/${cust}/${month}/${day}/B${cust}R${stype}`
                                    echo "statement ${stype} for customer ${cust} on day ${day}-${month}-${year}" | enscript -t "Statement ${stype}" -o- |ps2pdf -  >${fname}.pdf
                                done
   done 
  done 
 done
done

The only 'magic' in this script are:

  • The hunk to get number of days in a month:  cal ${month} ${year}  |grep -v [a-z]  (credit google)
  • Tip to use enscript to make a ps file and then ps2pdf to turn that into PDF.   (again, credit google)
The rest of the gook is pretty mundane shell scripting.    After launching this script and waiting 20 minutes or so, I had  9864 mock customer statements organized like the picture above.  Each statement having some bread crumbs in it that I can use for validation later. 

Now To Load These Into Cassandra

From a post or two back, I have an 'addStatement()' routine that can stuff a statement file into Cassandra with accompanying meta data about that file.   So I'll need to do two things:
  • Make a file system crawler to locate statements and derive the statement meta data from its location in the file system and its file name.
  • Refactor the addStatement routine to accept some sort of statement bean rather than the instance attributes that it is using now.   
Here is the new improved addStatement()

    private String addStatement(Statement s)
            throws IOException
    {        
        ByteBuffer buffer;
        try (RandomAccessFile aFile = new RandomAccessFile(s.getStatementPath(), "r"); FileChannel inChannel = aFile.getChannel())
        {
            long fileSize = inChannel.size();
            buffer = ByteBuffer.allocate((int) fileSize);
            inChannel.read(buffer);
            buffer.rewind();
        }    
        Insert i = QueryBuilder.insertInto(keyspace,table);
        i.value("archived_statement_id", s.getArchivedStatementId());
        i.value("customer_id", s.getCustomerId());
        i.value("day", s.getDay());
        i.value("month", s.getMonth());
        i.value("year", s.getYear()); 
        i.value("statement_type", s.getStatementType()); 
        i.value("statement_filename", s.getStatementFilename());
        i.value("statement", buffer);
        i.setConsistencyLevel(clevel); 
        session.execute(i);
        return(s.getArchivedStatementId().toString()); 
    }

The changes I made are pretty simple: I returned the stringified UUID value after insertion and used a new bean called "Statement" as the source of the meta data from a statement to load. The Statement bean is pretty simple, but here it is:

package com.fhw;

import java.io.*;
import java.util.*;
import lombok.*;
@Data
public class Statement
{
    private UUID archivedStatementId; 
    private int customerId;
    private int day;
    private int month;
    private int year;
    private String statementType; 
    private String statementPath;
        
    public UUID getArchivedStatementId()
    {
        if(null == archivedStatementId)
        {
            archivedStatementId = UUID.randomUUID();
        }
        return(archivedStatementId); 
    }
    public String getStatementFilename()
    {
        String fname = null; 
        if(null != statementPath)
        {
            File f = new File(statementPath);
            fname = f.getName();
        }
        return(fname); 
    }
}
Wow really simple!  Where is everything?   Well, it is sorta outside the scope of this topic, but a friend turned me on to lombok.    Lombok takes care of lots of boiler plate getter and setter code, and probably does some other good stuff that I haven't learned about.  Check it out.

....And Now The Crawler

JDK 1.7 added java.nio.file.Files to the nio packages.   Some helper that seem well suited for building a FS crawler are the Files.walkFileTree() methods.   Using these methods; I get: 

    public void loadReports()
            throws IOException
    {
        StatementFileVistor crawler = new StatementFileVistor(); 
        crawler.setLoader(this);
        crawler.setRootLen(root_len);
        Files.walkFileTree(Paths.get(root), crawler);
    }

Easy! Turns out most of the work is in the visitor. So here is that:

package com.fhw;

import java.nio.file.*;
import static java.nio.file.FileVisitResult.CONTINUE;
import java.nio.file.attribute.*;
import lombok.*;

@Data
public class StatementFileVistor
    extends SimpleFileVisitor<Path>
{
    private CLoad loader;
    private int rootLen;
    private int count = 0; 
    
    public StatementFileVistor()
    {

    }
    
    @Override
    public FileVisitResult visitFile(Path file, BasicFileAttributes attr)
    {
        try
        {
            if(attr.isRegularFile())
            {
                loader.addStatement(makeStatement(file)); 
                count++;
            }
        }
        catch (Exception e)
        {
            System.out.println("failed adding " + file.toString() + "; error:  " + e.getMessage());
        }
        return CONTINUE;
    }
    
    private Statement makeStatement(Path file)
    {
        int customerId;
        int year;
        int day;
        int month;
        String statementType;

        String absPath = file.toAbsolutePath().toString();
        String p = absPath.substring(rootLen + 1);
        String s = p.substring(0, 4);
        year = Integer.parseInt(s);
        s = p.substring(5, 10);
        customerId = Integer.parseInt(s);
        s = p.substring(11, 13);
        month = Integer.parseInt(s);
        int idx = p.indexOf('/', 14);
        s = p.substring(14, idx);
        day = Integer.parseInt(s);
        String fileName = file.getFileName().toString();
        idx = fileName.indexOf('R') + 1;
        int end = fileName.indexOf('.');
        statementType = fileName.substring(idx, end);
        Statement stmt = new Statement();
        stmt.setCustomerId(customerId);
        stmt.setDay(day);
        stmt.setMonth(month);
        stmt.setYear(year);
        stmt.setStatementType(statementType);
        stmt.setStatementPath(file.toAbsolutePath().toString());
        return(stmt); 
    }
}

....But does it Work???

I was a bit surprised; casual inspections suggests it works.   So for this first test(s), I trimmed the statements up some to get a smaller test set:  I was only trying to load statements for July 2011 of a single customer; 93 statements.    Here is some output from the real test run (from within Netbeans):

:run
Connected to cluster: My2_6TestCluster
Datatacenter: datacenter1; Host: /127.0.0.1; Rack: rack1
Datatacenter: datacenter1; Host: /127.0.0.2; Rack: rack1
Datatacenter: datacenter1; Host: /127.0.0.3; Rack: rack1
I added 93 statements

BUILD SUCCESSFUL

Total time: 1.637 secs

It suggests I loaded 93 PDFs in 1.6s into a Cassandra cluster with replication factor 3 (albeit all local nodes).   Checking cassandra using DevCenter confirms there are 93 records.   Doing a few spot checks to make sure the data is on other nodes, seems to prove true too.    I was really expecting statement loading to be slow.

I think there is a bit more to learn here.    Further, I now need to try and extract some statements to make sure they are really there.  





No comments:

Post a Comment