Thursday, June 7, 2012

Apache Hive example


Once you have HDFS and Hadoop configured, HIVE is a data warehousing solution which runs above HDFS and Hadoop. I have considered the same input file and fired the HIVE queries , which inturn fires hadoop MapReduce jobs.

Following steps were done to install HIVE.
  • Assume you have hadoop installation up and running (described in earlier post)
  • Download the HIVE binaries from apache site.
  • UnZip the hive-0.9.0-bin.tar.gz into a directory
  • cd to the unzipped directory and fire following command
    • export HIVE_HOME=$PWD
    • export PATH=$PWD/bin:$PATH

Once all the above steps are done , we are ready to enter the HIVE shell. This shell will help us enter hive commands.
  • enter command :
    • hive

Once you are in hive shell, you are ready to fire hivesql commands
Since in earlier post we had a csv file we will create a table for the same. This will create a hive table in which we will load the data. This data will be distributed over HDFS across all the nodes.
  • CREATE TABLE person (PERSON_ID INT, NAME STRING, FIRST_NAME STRING, LAST_NAME STRING, MIDDLE_NAMES STRING, TITLE STRING, STREET_ADDRESS STRING, CITY STRING, COUNTRY STRING, POST_CODE STRING, HOME_PHONE STRING, WORK_PHONE STRING, MOBILE_PHONE STRING, NI_NUMBER STRING, CREDITLIMIT STRING, CREDIT_CARD STRING, CREDITCARD_START_DATE STRING, CREDITCARD_END_DATE STRING, CREDITCARD_CVC STRING, DOB STRING) row format delimited fields terminated by ',';

Then we will load the data from the csv file.
  • load data inpath '<PATH_TO_FILE>/export.csv' overwrite into table person

Now we are ready to fire some HiveQL queries , which will call corresponding map-reduce jobs

  • select * from person where person_id=1;
  • select count(1) from person;
  • select * from person where name like '%Bob%'

Hive is making Map-Reduce programming job simpler by giving warehousing and SQL capabilities.

Peace.
Sanket Raut

1 comment:

alfred said...
This comment has been removed by a blog administrator.