In the last 3 blog posts we looked
- Hadoop and HDFS setup
- Hive installation and example
- Use Jasper Hive plugin and generate Jasper reports
Pig is another such tool to expose
Structured language which run over hadoop and HDFS
Structured language which run over hadoop and HDFS
In the blog below we will try
installing and running the same sample example , where will be
extracting out the the mobile phone number and name of the persons
who's id is less then equal to 10 . We will be writing pig scripts
for the same.
installing and running the same sample example , where will be
extracting out the the mobile phone number and name of the persons
who's id is less then equal to 10 . We will be writing pig scripts
for the same.
We start with installing pig.
- Download and install the pig debian package.
- dpkg -i pig_0.10.0-1_i386.deb
- Start the dfs server and mapred service
- start-all.sh
- If pig has to run as local mode, then no need to perform above step
- Connect to pig shell (we will connect here locally)
- pig -x local
- Once we are into the pig shell (Prompt name is grunt :) .. funny .. ) . We now will load the file from local file system to HDFS using pig.
- copyFromLocal export.csv person
- We will now load the the data from HDFS to a pig relation (Similar to a table in Hive)
- person = LOAD 'export.csv' USING PigStorage(',') AS (PERSON_ID:int,NAME:chararray,FIRST_NAME:chararray,LAST_NAME:chararray,MIDDLE_NAMES:chararray,TITLE:chararray,STREET_ADDRESS:chararray,CITY:chararray,COUNTRY:chararray,POST_CODE:chararray,HOME_PHONE:chararray,WORK_PHONE:chararray,MOBILE_PHONE:chararray,NI_NUMBER:chararray,CREDITLIMIT:chararray,CREDIT_CARD:chararray,CREDITCARD_START_DATE:chararray,CREDITCARD_END_DATE:chararray,CREDITCARD_CVC:chararray,DOB:chararray);
- We can see the
output of the person using dump command- dump person;
- Run a script to filter out persons who's person id is less then or equal to 10
- top_ten=FILTER person BY person_id<=10
Dump top_ten to see the output
- Run a script to extract out the name and the mobile number of that list
- mobile_numbers = FOREACH top_ten
GENERATE NAME , MOBILE_PHONE;
Dump the mobile_number to see the output
This is the output we desire.
This is the output we desire.
No comments:
Post a Comment