Introduction to Apache Hive(Big Data, Final Seminar)

24
Introduction to Apache Hive. TAKRIM UL ISLAM LASKAR(120103006) Presentation on BIG DATA- Presented by-

Transcript of Introduction to Apache Hive(Big Data, Final Seminar)

Page 1: Introduction to Apache Hive(Big Data, Final Seminar)

Introduction to Apache Hive.

TAKRIM UL ISLAM LASKAR(120103006)

Presentation on BIG DATA-

Presented by-

Page 2: Introduction to Apache Hive(Big Data, Final Seminar)

Overview:-• Origin

• What is HIVE?

• How Hive works?

• Hive vs Simple Web App

• Using Hive as Enterprise Data Warehouse

• Hive Architecture

• Hive Database

• Hive Data Model

• Metastore

• Hive Physical Layout

• Hive Configuration

• Hive Commands

• Hive Functions

• Database specific Hive commands

• Creation of a table on Hive

• Load data into a Hive Table

• Store Hive table to HDFS file.

Page 3: Introduction to Apache Hive(Big Data, Final Seminar)

Origin

• Started at facebook to manage lots of data.

• The Data was stored in oracle database Every night.

• ELT(Extract,Transform,Load) was performed on data.

• The data growth was exponential• By 2006 1TB/Day• By 2010 10TB/Day• By 2006 about 500.000.000 logs per day

• And there was a need to find some way to manage the data “effectively”

Page 4: Introduction to Apache Hive(Big Data, Final Seminar)

What is HIVE?

• Hive is a data warehouse infrastructure built on top of Hadoop that can compile SQL queries as MapReduce jobs and run the job in the cluster.

• What is Data Warehouse(DW)?• A Data Warehouse is a specific for analysis and reporting purposes.

Page 5: Introduction to Apache Hive(Big Data, Final Seminar)

How Hive works?

• Hive is built on top of Hadoop• Think HDFS and MapReduce

• Hive stored data on the HDFS

• Hive compile SQL queries into MapReduce jobs and run the jobs in the Hadoop cluster.

Page 6: Introduction to Apache Hive(Big Data, Final Seminar)

Hive vs Simple Web App

Hive Query SQL Query Any Database

Fig 1.Difference between Hive and other DB System

Page 7: Introduction to Apache Hive(Big Data, Final Seminar)

Using Hive as Enterprise Data Warehouse• First scribe and load data from database into HDFS

• Write MapReduce jobs to process data

• So, what is missing :• Command line interface for end users• Ad-hoc query support

• Without writing full MapReduce jobs• Schema information.

Page 8: Introduction to Apache Hive(Big Data, Final Seminar)

Hive Architecture

Fig 2. Hive Architecture [5]

Page 9: Introduction to Apache Hive(Big Data, Final Seminar)

Hive Architecture

• External Interface• CLI• WebUI• API

• JDBC and ODBC

• Thrift Server• Client API to execute HiveQL Statements

• Metastore• System catalog

• All components of Hive Interact with Metastore

Page 10: Introduction to Apache Hive(Big Data, Final Seminar)

Hive Database

• Data Model• Query Language

Page 11: Introduction to Apache Hive(Big Data, Final Seminar)

Hive Data Model

DB HDFSDirectory

Partitions(sub-directory)

Buckets(Files)

Tables

• Hive structure data into a well defined database concept i.e Tables , columns and rows, partitions ,buckets etc .

Fig 3. Hive Data Model

Page 12: Introduction to Apache Hive(Big Data, Final Seminar)

Hive Data Model

• Tables• Types Columns(int , float , string , date , Boolean)• Supports array/map/struct for JSON like data

• Partitions• ie, range partition tables by date

• Buckets• Hash partition within ranges

• Useful for sampling , join optimization

Page 13: Introduction to Apache Hive(Big Data, Final Seminar)

Metastore

• Database • Namespace containing a set of tables

• Table • Containing list of columns and their types .

• Partition • Each partition can have its own columns storage info• Mapping to HDFS directories

• Statistics • Info about the database

Page 14: Introduction to Apache Hive(Big Data, Final Seminar)

Hive Physical Layout

• Warehouse directory in HDFS

• Table row data is stored in warehouse subdirectory

• Partition creates subdirectory within table directories

• Actual data is stored in flat files

Page 15: Introduction to Apache Hive(Big Data, Final Seminar)

Hive Configuration

• Hive allows user to override Hadoop configuration properties • For example to setup new MapReduce task count use

• -mapred.reduce.task=1 (v1)

Page 16: Introduction to Apache Hive(Big Data, Final Seminar)

Hive Commands

• Setting Hadoop or Hive Configuration properties • >set CONFIGURATION_NAME= CONFIGURATION_VALUE;

• List all properties and their values • >set –v;

• This will spit lots of data about Hive in your cluster

• Adding a resource to DCache • >add [JAR|FILE|ARCHIVE] file_name;

• Hive>add FILE newcode.py;

Page 17: Introduction to Apache Hive(Big Data, Final Seminar)

Hive Functions

• hive> show functions;

• hive> describe function<function_name>;

Page 18: Introduction to Apache Hive(Big Data, Final Seminar)

Database specific Hive commands

• Hive>show tables ;• Hive>describe<table_names>;• Hive>describe extended <table_name>;• Hive>create table<…>;• Hive>alter table <…>;• Hive>drop table <…>;

Page 19: Introduction to Apache Hive(Big Data, Final Seminar)

Creation of a table on Hive

• Hive>create TABLE student(studentID int, studentNAME string

);

Page 20: Introduction to Apache Hive(Big Data, Final Seminar)

Load data into a Hive Table

• Hive does not do any transformation while loading data into tables.

• Load operations are currently pure copy/move operations that moves datafiles into location corresponsing to their tables.

• SYNTEX:• LOAD DATA[LOCAL] INPATH ‘filepath’[OVERWRITE] INTO TABLE

tablename[PARTITION(partcol1= val1,partcol2=val2…)]

Page 21: Introduction to Apache Hive(Big Data, Final Seminar)

Store Hive table to HDFS file:

• INSERT OVERWRITE LOCAL DIRECTORY ‘/path/to/local/dir/file’ SELECT * FROM TABLE_NAME;

Page 22: Introduction to Apache Hive(Big Data, Final Seminar)

Conclusion:

Big Data is BackBone of Devolopement of today's CyberWorld.Ever Enterprise and Gigantic Companies Are Totally dependent on this technology for storing data and their analysis.Hive Is ELT machine Which converts structured Query Language to Unstractured MapReduce Jobs and run them on the Hadoop Cluster.

Page 23: Introduction to Apache Hive(Big Data, Final Seminar)

References

[1] Hive Presentation Collection,(2015, February), [online]Available

- https://cwiki.apache.org/confluence/display/Hive/Presentations#Presentations-February2013HiveUserGroupMeetup

[2] Apache Hive tutorial- https://cwiki.apache.org/confluence/display/Hive/tutorial.html

[3] Hugo Pérez, Sergio Mendoza, Carlos Fenoy .” Master in Computer Architecture, Networks and Systems - CANS”, (2012,march) . University Politechnica De Catalunya[online]Available

- http://www.jorditorres.org/wp-content/uploads/2012/03/1.Apache_Hive.pdf

[4] Facebook Hive Team,(2010,march).Hive New Features And API[online]Available:http://www.slideshare.net/zshao/hive-user-meeting-march-2010-hive-team

[5] Cloudera Inc.,(2010).Hive Quick Start[online]Available: http://fr.slideshare.net/cwsteinbach/hive-quick-start-tutorial

[6] Owen O’Malley,Hortonworks Inc(2012,June).High Volume Updates in Hive[online]Available: http://www.slideshare.net/oom65/high-volume-updates-in-apache-hive

Page 24: Introduction to Apache Hive(Big Data, Final Seminar)

Thank You - For Your Patience.