Introduction to Apache Hive(Big Data, Final Seminar)

Introduction to Apache Hive.

TAKRIM UL ISLAM LASKAR(120103006)

Presentation on BIG DATA-

Presented by-

Overview:-• Origin

• What is HIVE?

• How Hive works?

• Hive vs Simple Web App

• Using Hive as Enterprise Data Warehouse

• Hive Architecture

• Hive Database

• Hive Data Model

• Metastore

• Hive Physical Layout

• Hive Configuration

• Hive Commands

• Hive Functions

• Database specific Hive commands

• Creation of a table on Hive

• Load data into a Hive Table

• Store Hive table to HDFS file.

Origin

• Started at facebook to manage lots of data.

• The Data was stored in oracle database Every night.

• ELT(Extract,Transform,Load) was performed on data.

• The data growth was exponential• By 2006 1TB/Day• By 2010 10TB/Day• By 2006 about 500.000.000 logs per day

• And there was a need to find some way to manage the data “effectively”

What is HIVE?

• Hive is a data warehouse infrastructure built on top of Hadoop that can compile SQL queries as MapReduce jobs and run the job in the cluster.

• What is Data Warehouse(DW)?• A Data Warehouse is a specific for analysis and reporting purposes.

How Hive works?

• Hive is built on top of Hadoop• Think HDFS and MapReduce

• Hive stored data on the HDFS

• Hive compile SQL queries into MapReduce jobs and run the jobs in the Hadoop cluster.

Hive vs Simple Web App

Hive Query SQL Query Any Database

Fig 1.Difference between Hive and other DB System

Using Hive as Enterprise Data Warehouse• First scribe and load data from database into HDFS

• Write MapReduce jobs to process data

• So, what is missing :• Command line interface for end users• Ad-hoc query support

• Without writing full MapReduce jobs• Schema information.

Hive Architecture

Fig 2. Hive Architecture [5]

Hive Architecture

• External Interface• CLI• WebUI• API

• JDBC and ODBC

• Thrift Server• Client API to execute HiveQL Statements

• Metastore• System catalog

• All components of Hive Interact with Metastore

Hive Database

• Data Model• Query Language

Hive Data Model

DB HDFSDirectory

Partitions(sub-directory)

Buckets(Files)

Tables

• Hive structure data into a well defined database concept i.e Tables , columns and rows, partitions ,buckets etc .

Fig 3. Hive Data Model

Hive Data Model

• Tables• Types Columns(int , float , string , date , Boolean)• Supports array/map/struct for JSON like data

• Partitions• ie, range partition tables by date

• Buckets• Hash partition within ranges

• Useful for sampling , join optimization

Metastore

• Database • Namespace containing a set of tables

• Table • Containing list of columns and their types .

• Partition • Each partition can have its own columns storage info• Mapping to HDFS directories

• Statistics • Info about the database

Hive Physical Layout

• Warehouse directory in HDFS

• Table row data is stored in warehouse subdirectory

• Partition creates subdirectory within table directories

• Actual data is stored in flat files

Hive Configuration

• Hive allows user to override Hadoop configuration properties • For example to setup new MapReduce task count use

• -mapred.reduce.task=1 (v1)

Hive Commands

• Setting Hadoop or Hive Configuration properties • >set CONFIGURATION_NAME= CONFIGURATION_VALUE;

• List all properties and their values • >set –v;

• This will spit lots of data about Hive in your cluster

• Adding a resource to DCache • >add [JAR|FILE|ARCHIVE] file_name;

• Hive>add FILE newcode.py;

Hive Functions

• hive> show functions;

• hive> describe function<function_name>;

Database specific Hive commands

• Hive>show tables ;• Hive>describe<table_names>;• Hive>describe extended <table_name>;• Hive>create table<…>;• Hive>alter table <…>;• Hive>drop table <…>;

Creation of a table on Hive

• Hive>create TABLE student(studentID int, studentNAME string

);

Load data into a Hive Table

• Hive does not do any transformation while loading data into tables.

• Load operations are currently pure copy/move operations that moves datafiles into location corresponsing to their tables.

• SYNTEX:• LOAD DATA[LOCAL] INPATH ‘filepath’[OVERWRITE] INTO TABLE

tablename[PARTITION(partcol1= val1,partcol2=val2…)]

Store Hive table to HDFS file:

• INSERT OVERWRITE LOCAL DIRECTORY ‘/path/to/local/dir/file’ SELECT * FROM TABLE_NAME;

Conclusion:

Big Data is BackBone of Devolopement of today's CyberWorld.Ever Enterprise and Gigantic Companies Are Totally dependent on this technology for storing data and their analysis.Hive Is ELT machine Which converts structured Query Language to Unstractured MapReduce Jobs and run them on the Hadoop Cluster.

References

[1] Hive Presentation Collection,(2015, February), [online]Available

- https://cwiki.apache.org/confluence/display/Hive/Presentations#Presentations-February2013HiveUserGroupMeetup

[2] Apache Hive tutorial- https://cwiki.apache.org/confluence/display/Hive/tutorial.html

[3] Hugo Pérez, Sergio Mendoza, Carlos Fenoy .” Master in Computer Architecture, Networks and Systems - CANS”, (2012,march) . University Politechnica De Catalunya[online]Available

- http://www.jorditorres.org/wp-content/uploads/2012/03/1.Apache_Hive.pdf

[4] Facebook Hive Team,(2010,march).Hive New Features And API[online]Available:http://www.slideshare.net/zshao/hive-user-meeting-march-2010-hive-team

[5] Cloudera Inc.,(2010).Hive Quick Start[online]Available: http://fr.slideshare.net/cwsteinbach/hive-quick-start-tutorial

[6] Owen O’Malley,Hortonworks Inc(2012,June).High Volume Updates in Hive[online]Available: http://www.slideshare.net/oom65/high-volume-updates-in-apache-hive

https://cwiki.apache.org/confluence/display/Hive/Presentations#Presentations-February2013HiveUserGroupMeetup

https://cwiki.apache.org/confluence/display/Hive/tutorial.html

http://www.jorditorres.org/wp-content/uploads/2012/03/1.Apache_Hive.pdf

http://www.slideshare.net/zshao/hive-user-meeting-march-2010-hive-team

http://fr.slideshare.net/cwsteinbach/hive-quick-start-tutorial

http://www.slideshare.net/oom65/high-volume-updates-in-apache-hive

Thank You - For Your Patience.

Introduction to Apache Hive(Big Data, Final Seminar)

Technology

Transcript of Introduction to Apache Hive(Big Data, Final Seminar)