Once you have HDFS and Hadoop configured, HIVE is a data warehousing solution which runs above HDFS and Hadoop. I have considered the same input file and fired the HIVE queries , which inturn fires hadoop MapReduce jobs.
Following steps were done to install HIVE.
- Assume you have hadoop installation up and running (described in earlier post)
- Download the HIVE binaries from apache site.
- UnZip the hive-0.9.0-bin.tar.gz into a directory
- cd to the unzipped directory and fire following command
- export HIVE_HOME=$PWD
- export PATH=$PWD/bin:$PATH
Once all the above steps are done , we are ready to enter the HIVE shell. This shell will help us enter hive commands.
- enter command :
Once you are in hive shell, you are ready to fire hivesql commands
Since in earlier post we had a csv file we will create a table for the same. This will create a hive table in which we will load the data. This data will be distributed over HDFS across all the nodes.
- CREATE TABLE person (PERSON_ID INT, NAME STRING, FIRST_NAME STRING, LAST_NAME STRING, MIDDLE_NAMES STRING, TITLE STRING, STREET_ADDRESS STRING, CITY STRING, COUNTRY STRING, POST_CODE STRING, HOME_PHONE STRING, WORK_PHONE STRING, MOBILE_PHONE STRING, NI_NUMBER STRING, CREDITLIMIT STRING, CREDIT_CARD STRING, CREDITCARD_START_DATE STRING, CREDITCARD_END_DATE STRING, CREDITCARD_CVC STRING, DOB STRING) row format delimited fields terminated by ',';
Then we will load the data from the csv file.
- load data inpath '<PATH_TO_FILE>/export.csv' overwrite into table person
Now we are ready to fire some HiveQL queries , which will call corresponding map-reduce jobs
- select * from person where person_id=1;
- select count(1) from person;
- select * from person where name like '%Bob%'
Hive is making Map-Reduce programming job simpler by giving warehousing and SQL capabilities.