Learning Never Stops

Tuesday, November 19, 2013

Journey With JackRabbit

The way jackrabbit is supposed to work most of the time with versioning capabilities is as follows

    1.    A repository say "MyRepository".
    2. Populate the repository with our data with proper tree structure by creating nodes and child    nodes.
    3.    To work with versioning , we might want to create a workspace and copy the nodes from          repository to the workspace
    4.    Work on the nodes in workspace and save them.
    5.    Once work is complete, merge the changed nodes with repository


   Data in the repository and workspaces can be persisted in following stores
    1.    Local DerbyDB
    2.    MSSQL
    3.    Oracle DB
    4.    PostgreSQL
    5.    MySQL


   Seems pretty straight forward. But the problem arises when the size of repository increases. Creating workspaces for a repository will remain a costly affair as nodes are copied. There are API in jackrabbit which allows to create a blank workspace and then selectively copy the desired node subtree.
   This might suffice and resolve most of the size and time needed to create a new workspace to work upon. But in some scenarios , we might need to copy the whole workspace because of dependency and the way application using the jackrabbit is designed.

   In such scenarios where selective copying seems not a appropriate solution one will have to think on how would we make the IOs fast.
   For such requirement , use of NoSQL DB which stores and distributes data horizontally effectively increasing the overall IOs.
   Couple of such persistence managers are being thought of and are available in very native format.
   The underlying technology for persistence of jackrabbit data can be
    1.     OrientDB , a NoSQL graph and document driven database
    2.    MongoDB , a NoSQL document driven database

    Persistence Manager for OrientDB can be found at https://github.com/eiswind/jackrabbit-orient . It stores the information in human readable format

    Persistence Manager for MongoDB can be found at http://svn.apache.org/repos/asf/jackrabbit/sandbox/jackrabbit-mongo-persistence/src/main/java/org/apache/jackrabbit/core/persistence/mongo . It serializes the information and stores in MongoDB

    On comparing the performance of LocalDerbyDB , PostgreSQL , OrientDB and MongoDB , it was observed that
    For creating the workspace for a repository with 3000 nodes on standard laptop
        LocalDerbyDB      : ~70000 ms   (PM and DS on derby DB)
        PostgreSQLDB     : ~60000 ms   (PM and DS on PostgreSQL)
        OrientDB              : ~55000 ms   (PM on orientDB and DS on PostgreSQL )
        MongoDB             : ~45000 ms   (PM and DS on MongoDB)


    So although MongoDB had overheads of serializing/deserializing, it performed best among the all the technology. There are performance improvement scope in OrientDB as well as MongoDB.

    Also MongoDB supports sharding and data can be distributed across horizontal node and retrival can be very fast using map-reduce algorithm.
    Only caveat about MongoDB , its GPL license.

Friday, October 11, 2013

Unlimited Possibilities with RaspberryPI

Everyone would want to have smart devices around. Smart devices are now capable of taking decisions reading internet information, reading information from most of the digital devices around and controlling them remotely over "Internet". Many micro controllers and dedicated hardware devices are in use for doing the same. But wouldn't it be nice if we have a computer with java installed instead of micro controllers. Micro controller programming is "pain" (for me at least).
       Now enters RaspberryPI(http://www.raspberrypi.org/) . A very small computer. Running Linux from SD card. Bare minimum interfaces like LAN connection, USB connection and HDMI output. Decent hardware with ARM chip , 512 MB RAM and SD card slot.Powers using a USB port. Supports Oracle Java 7. Supports Python. A very good tutorial site (http://learn.adafruit.com/category/raspberry-pi) with lots of example including programming with Motion Sensor, Motors , LEDs and Relays. Best part is , its only of size of a credit card.
   Use case include Home Automation (https://code.google.com/p/openhab/) , teaching in developing countries schools (due to low cost : $25-$35), helping people with disability , making a RaspberryPI cluster , robotics , unmanned vehicles and making GPS sensors for pets. Endless possibilities.
      Cheap devices and pretty decent computing power (infact more power then of a desktop computer which i bought in year 2000) with portability makes it a very smart device which can be easily programmed, store data , transmit data over internet and read data from internet makes it a wonderful device capable of automating everything. Possibilities are endless!!

Monday, June 11, 2012

Apache Pig over Hadoop

In the last 3 blog posts we looked

Hadoop and HDFS setup
Hive installation and example
Use Jasper Hive plugin and generate Jasper reports

Pig is another such tool to expose
Structured language which run over hadoop and HDFS

In the blog below we will try
installing and running the same sample example , where will be
extracting out the the mobile phone number and name of the persons
who's id is less then equal to 10 . We will be writing pig scripts
for the same.

We start with installing pig.

Download and install the pig debian package.
- dpkg -i pig_0.10.0-1_i386.deb

Start the dfs server and mapred service
- start-all.sh

If pig has to run as local mode, then no need to perform above step

Connect to pig shell (we will connect here locally)
- pig -x local

Once we are into the pig shell (Prompt name is grunt :) .. funny .. ) . We now will load the file from local file system to HDFS using pig.
- copyFromLocal export.csv person

We will now load the the data from HDFS to a pig relation (Similar to a table in Hive)
- person = LOAD 'export.csv' USING PigStorage(',') AS (PERSON_ID:int,NAME:chararray,FIRST_NAME:chararray,LAST_NAME:chararray,MIDDLE_NAMES:chararray,TITLE:chararray,STREET_ADDRESS:chararray,CITY:chararray,COUNTRY:chararray,POST_CODE:chararray,HOME_PHONE:chararray,WORK_PHONE:chararray,MOBILE_PHONE:chararray,NI_NUMBER:chararray,CREDITLIMIT:chararray,CREDIT_CARD:chararray,CREDITCARD_START_DATE:chararray,CREDITCARD_END_DATE:chararray,CREDITCARD_CVC:chararray,DOB:chararray);

We can see the
output of the person using dump command
- dump person;

Run a script to filter out persons who's person id is less then or equal to 10
- top_ten=FILTER person BY person_id<=10

Dump top_ten to see the output

Run a script to extract out the name and the mobile number of that list
- mobile_numbers = FOREACH top_ten
  GENERATE NAME , MOBILE_PHONE;

Dump the mobile_number to see the output

This is the output we desire.

Friday, June 8, 2012

Develop Jasper report with Hive

In last 2 blog posts we learned

Setup Hadoop and writing simple map-reduce jobs
Setup hive and firing sql queries over it

In this blog we will use Jasper Report to generate a report which will use Hive as the data store.

We will generate report form the list of customers who have mobile phone

It is assumed that you have Jaspersoft iReport Designer pre installed.

Start Hive in server mode so that we can connect it using jdbc client

hive --service hiveserver

Create table and load the data in the have table from the hive shell . This is done so that we can query it. Hadoop map reduce programs will be called internally to fetch data from this table. The data will be distributed over HDFS and will be collected and returned according to the query

Start the iReport Designer
- Create a new datasource to connect to Hive Database. This is the first step which will add a hive database.

Create a new report. Refer to the screenshots for more details. An query is given to fetch appropriate data from the hive.

This way we now have a distributed file system (HDFS). A map-reduce engine above it(Hadoop). Datawarehousing tool over these framework (Hive) and then used a reporting tool to extract out menaingful data out of it and displaying it. Jasper report has built-in capabilities to communicate with Hive (via JDBC).

Peace.

Sanket Raut

Thursday, June 7, 2012

Apache Hive example

Once you have HDFS and Hadoop configured, HIVE is a data warehousing solution which runs above HDFS and Hadoop. I have considered the same input file and fired the HIVE queries , which inturn fires hadoop MapReduce jobs.

Following steps were done to install HIVE.

Assume you have hadoop installation up and running (described in earlier post)

Download the HIVE binaries from apache site.
UnZip the hive-0.9.0-bin.tar.gz into a directory
cd to the unzipped directory and fire following command
- export HIVE_HOME=$PWD
- export PATH=$PWD/bin:$PATH

Once all the above steps are done , we are ready to enter the HIVE shell. This shell will help us enter hive commands.

enter command :
- hive

Once you are in hive shell, you are ready to fire hivesql commands

Since in earlier post we had a csv file we will create a table for the same. This will create a hive table in which we will load the data. This data will be distributed over HDFS across all the nodes.

CREATE TABLE person (PERSON_ID INT, NAME STRING, FIRST_NAME STRING, LAST_NAME STRING, MIDDLE_NAMES STRING, TITLE STRING, STREET_ADDRESS STRING, CITY STRING, COUNTRY STRING, POST_CODE STRING, HOME_PHONE STRING, WORK_PHONE STRING, MOBILE_PHONE STRING, NI_NUMBER STRING, CREDITLIMIT STRING, CREDIT_CARD STRING, CREDITCARD_START_DATE STRING, CREDITCARD_END_DATE STRING, CREDITCARD_CVC STRING, DOB STRING) row format delimited fields terminated by ',';

Then we will load the data from the csv file.

load data inpath '<PATH_TO_FILE>/export.csv' overwrite into table person

Now we are ready to fire some HiveQL queries , which will call corresponding map-reduce jobs

select * from person where person_id=1;
select count(1) from person;
select * from person where name like '%Bob%'

Hive is making Map-Reduce programming job simpler by giving warehousing and SQL capabilities.

Peace.
Sanket Raut

Wednesday, June 6, 2012

Hadoop Simple Example

Hadoop Example
This post will help understand how to install (standalone node) and run a sample map-reduce job on hadoop. Although the example dose not reflect the most correct actual usage of the map-reduce, its good for starter to start learning and coding Hadoop

Setting up Hadoop on Ubuntu (or any other linux)

Download the debian file "hadoop_1.0.3-1_i386.deb" from apache hadoop site

Create an group named hadoop. If you are using ubuntu , you explicitly need to create a group because the debian package will try to create a group with id 123 . Usually the group id exists.

                    sudo groupadd -g 142 -r hadoop

Install hadoop using a debian package

                    sudo dpkg -i hadoop_1.0.3-1_i386.deb

Create passphraseless SSH (This is , I suppose some limitation on framework as it require passwordless SSH to be enabled. Maybe in actual setup with multiple nodes , this is not needed.):

       sudo su -
       ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
       cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

Create a hdfs and format it :

                     hadoop namenode -format

Start hadoop Node . Starting the dfs will ensure the distributed service is started . We also need to start the task node which will run the map reduce programs.

                     start-dfs.sh
                     start-mapred.sh

Java Code

This java code will append a pudo text prior to each element in the csv file

package com.sanket;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.*;
import org.apache.hadoop.mapreduce.lib.output.*;
public class PudoFile {
public static class Map extends Mapper {
 private Text word = new Text();

 private Text key = new Text();
 public void map(Text key, Text value, Context context) throws IOException,   InterruptedException {
   String line = key.toString();
  //value is returned as NULL. Hence will parse the Key to read value
  StringTokenizer tokenizer = new StringTokenizer(line,",");
  String strKey = tokenizer.nextToken();
  key.set(strKey);
  word = new Text();
  word.set(line);
  context.write(key, word);
 } 

}
public static class Reduce extends Reducer {
 public void reduce(Text key, Iterable values, Context context) 

 throws IOException, InterruptedException {
  StringBuffer outputValue=new StringBuffer();
   for (Text val : values) {
    StringTokenizer st=new StringTokenizer(val.toString(),",");
    StringBuffer sb=new StringBuffer();
    while(st.hasMoreTokens())
    {
     String entity=st.nextToken();
     sb.append("Pudo"+entity+",");
     }
   context.write(new Text(), new Text(sb.toString().substring(0,sb.toString().length()-1))); 
  }
 }
}
public static void main(String[] args) throws Exception {
 Configuration conf = new Configuration();
 Job reverse = new Job(conf, "ReadCSV");
 reverse.setOutputKeyClass(Text.class);
 reverse.setOutputValueClass(Text.class);
 reverse.setMapOutputKeyClass(Text.class);
 reverse.setMapOutputValueClass(Text.class);
 reverse.setMapperClass(Map.class);
 reverse.setReducerClass(Reduce.class);
 //Regular TextInputFormat gives class cast exception
 //java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.Text
 reverse.setInputFormatClass(KeyValueTextInputFormat.class);
 reverse.setOutputFormatClass(TextOutputFormat.class);
 FileInputFormat.addInputPath(reverse, new Path(args[0]));
 FileOutputFormat.setOutputPath(reverse, new Path(args[1]));
 reverse.waitForCompletion(true);
 }

}

Compile the class and jar it. We will use the jar to run map reduce programs.

  javac -classpath hadoop-core-1.0.3.jar -d bin PsudoFile.java
  jar cvf FirstProgram.jar -C bin/ .

Add an input file to HDFS from local file system

       hadoop fs -mkdir inputfile
       hadoop fs -put export.csv inputfile

Run the program using following command

           hadoop jar FirstProgram.jar com.sanket.PsudoFile inputcsv/export.csv outputcsv

Check the output and delete existing output file. The output adapter gives error if file already exists

      NOW=$(date +"%b-%d-%s")
      LOGFILE="log-$NOW.log" > $LOGFILE 
      hadoop fs -cat outputcsv/part-r-00000 > $LOGFILE
      hadoop fs -rmr outputcsv

Output will be present in log---.log file

The output will not be in same sequence because of the internal sort done by map-reduce . The workaround is to implement your one Key and override the compare method. This will ensure that the output will be same as the input. But ideally , the data to be analyzed need not be given in the same sequence or can be used to derive a meaning out of the huge chunk of data.

A very practical example is :
http://online.wsj.com/article/SB10001424052702303444204577460552615646874.html

Peace.
Sanket Raut

Thursday, April 26, 2012

My StackOverflow Flair .. Will come up soon...