Monday, June 11, 2012

Apache Pig over Hadoop

In the last 3 blog posts we looked
   
  • Hadoop and HDFS setup
       
  • Hive installation and example
       
  • Use Jasper Hive plugin and  generate Jasper reports


Pig is another such tool to expose
Structured language which run over hadoop and HDFS

In the blog below we will try
installing and running the same sample example , where will be
extracting out the the mobile phone number and name of the persons
who's id is less then equal to 10 . We will be writing pig scripts
for the same.



We start with installing pig.

  • Download and install the  pig  debian package.
             
    • dpkg -i pig_0.10.0-1_i386.deb
       
  • Start  the dfs server and mapred service
       
        
    • start-all.sh
  • If pig has to run as local mode, then no need to perform above step
  • Connect to pig  shell (we will connect here locally)
           
    • pig -x local
     
  • Once we are  into the pig shell (Prompt name is grunt :) .. funny .. ) . We now  will load the file from local file system to HDFS using pig.

          
    • copyFromLocal export.csv person
          
  • We will now  load the the data from HDFS to a pig relation (Similar to a table in  Hive)
       
    • person = LOAD 'export.csv' USING PigStorage(',') AS      (PERSON_ID:int,NAME:chararray,FIRST_NAME:chararray,LAST_NAME:chararray,MIDDLE_NAMES:chararray,TITLE:chararray,STREET_ADDRESS:chararray,CITY:chararray,COUNTRY:chararray,POST_CODE:chararray,HOME_PHONE:chararray,WORK_PHONE:chararray,MOBILE_PHONE:chararray,NI_NUMBER:chararray,CREDITLIMIT:chararray,CREDIT_CARD:chararray,CREDITCARD_START_DATE:chararray,CREDITCARD_END_DATE:chararray,CREDITCARD_CVC:chararray,DOB:chararray);
     
  • We can see the
        output of the person using dump command
              
    • dump person;
  • Run a script to filter out persons  who's person id is less then or equal to 10
             
    • top_ten=FILTER person BY person_id<=10
    Dump top_ten to see the output
   
  • Run a script to extract out the  name and the mobile number of that list

    • mobile_numbers = FOREACH top_ten
              GENERATE NAME , MOBILE_PHONE;

    Dump the mobile_number to see the output

   This is the output we desire.

No comments: