CIS 612 SUNNIE HUNGcis.csuohio.edu/~sschung/cis612/LectureNotes_Hive_ACID.pdf · Hive doesn’t...

27
APACHE HIVE CIS 612 SUNNIE CHUNG

Transcript of CIS 612 SUNNIE HUNGcis.csuohio.edu/~sschung/cis612/LectureNotes_Hive_ACID.pdf · Hive doesn’t...

Page 1: CIS 612 SUNNIE HUNGcis.csuohio.edu/~sschung/cis612/LectureNotes_Hive_ACID.pdf · Hive doesn’t generate MapReduce programs, it uses generic Mapper and Reducer modules. Hive communicates

APACHE HIVE

CIS 612

SUNNIE CHUNG

Page 2: CIS 612 SUNNIE HUNGcis.csuohio.edu/~sschung/cis612/LectureNotes_Hive_ACID.pdf · Hive doesn’t generate MapReduce programs, it uses generic Mapper and Reducer modules. Hive communicates

APACHE HIVE IS

� Data warehouse infrastructure built on top of Hadoop enabling data summarization and ad-hoc queries.

� Initially developed by Facebook.

� Hive Query Language statements are broken down by the Hive service into MapReduce jobs and executed across a Hadoop cluster.

� Hive structures data into well-understood database concepts such as tables, rows, columns, and partitions.

� It supports primitive types, as well as associative arrays, lists, structs.

� HQL supports DDL and DML.

� Users can embed custom map-reduce scripts.

Sunnie Chung CIS 612 Lectu

re Notes

2

Page 3: CIS 612 SUNNIE HUNGcis.csuohio.edu/~sschung/cis612/LectureNotes_Hive_ACID.pdf · Hive doesn’t generate MapReduce programs, it uses generic Mapper and Reducer modules. Hive communicates

HOW HIVE WORKS?

� Hive built on top of Hadoop

� Hive stores data in Hadoop Distributed File

System

� Hive complied HQL statements into MapReduce

jobs that are executed on Hadoop cluster.

� HQL has limited equality and join predicates,

and has no inserts on existing tables. (It can

override tables)

Sunnie Chung CIS 612 Lectu

re Notes

3

Page 4: CIS 612 SUNNIE HUNGcis.csuohio.edu/~sschung/cis612/LectureNotes_Hive_ACID.pdf · Hive doesn’t generate MapReduce programs, it uses generic Mapper and Reducer modules. Hive communicates

HIVE

� Supports SQL like Query Language : HiveQL

� Data in Hive is organized into tables

� Provides structure for unstructured Big Data

� Work with data inside HDFS

� Tables

� Data : File or Group of Files in HDFS

� Schema : In the form of metadata stored in Relational Database

� Have a corresponding HDFS directory

� Data in a table is Serialized

� Supports Primitive Column Types and Nestable Collection Types (Array and Map)

Sunnie Chung CIS 612 Lectu

re Notes

4

Page 5: CIS 612 SUNNIE HUNGcis.csuohio.edu/~sschung/cis612/LectureNotes_Hive_ACID.pdf · Hive doesn’t generate MapReduce programs, it uses generic Mapper and Reducer modules. Hive communicates

HIVE QUERY LANGUAGE

� SQL like language

� DDL : to create tables with specific serialization formats

� DML : to load data from external sources and insert query results into Hive tables

� Do not support updating and deleting rows in existing tables

� Supports Multi-Table insert

� Supports custom map-reduce scripts written in any language

� Can be extended with custom functions (UDFs)

� User Defined Transformation Function(UDTF)

� User Defined Aggregation Function (UDAF)

Sunnie Chung CIS 612 Lectu

re Notes

5

Page 6: CIS 612 SUNNIE HUNGcis.csuohio.edu/~sschung/cis612/LectureNotes_Hive_ACID.pdf · Hive doesn’t generate MapReduce programs, it uses generic Mapper and Reducer modules. Hive communicates

WHAT HIVE DOES?

� Hive allows SQL developers to write Hive Query

Language (HQL) statements that are similar to

SQL statements, but with limited in the

commands.

� It therefore allows developers to explore and

structure massive amounts of data, analyze it

then turn into business insight.

� Hive queries have very high latency because it is

based on Hadoop.

� Hive is read-based and not appropriate for write

operation.

Sunnie Chung CIS 612 Lectu

re Notes

6

Page 7: CIS 612 SUNNIE HUNGcis.csuohio.edu/~sschung/cis612/LectureNotes_Hive_ACID.pdf · Hive doesn’t generate MapReduce programs, it uses generic Mapper and Reducer modules. Hive communicates

HIVE’S ADVANTAGES

� Familiar: hundreds of unique users can simultaneously query the data using a language familiar to SQL users.

� Fast Response: times are typically much faster than other types of queries on the same type of huge datasets.

� Scalable and extensible: as data variety and volume grows, more commodity machines can be added to the cluster, without a corresponding reduction in performance.

� Informative Familiar JDBC and ODBC drivers: allow many applications to pull Hive data for seamless reporting. Hive allows users to read data in arbitrary formats, using SerDes and Input/Output formats. (SerDes: serialized and deserialized API is used to move data in and out of tables)

Sunnie Chung CIS 612 Lectu

re Notes

7

Page 8: CIS 612 SUNNIE HUNGcis.csuohio.edu/~sschung/cis612/LectureNotes_Hive_ACID.pdf · Hive doesn’t generate MapReduce programs, it uses generic Mapper and Reducer modules. Hive communicates

� External Interfaces:

� Web UI : Management

� Hive CLI : Run Queries, Browse Tables, etc

� API : JDBC, ODBC

� Metastore : � System catalog which contains metadata about Hive tables

� Driver : � manages the life cycle of a Hive-QL statement during compilation,

optimization and execution

� Compiler : � translates Hive-QL statement into a plan which consists of a DAG of

map-reduce jobs

� Database: is a namespace for tables

� Table: metadata for table contains list of columns and their types, owner, storage and SerDe information. Also contains any user supplied key and value data.

� Partition: each partition can have it own columns and SerDe and storage information.

HIVE ARCHITECTURESunnie Chung CIS 612 Lectu

re Notes

8

Page 9: CIS 612 SUNNIE HUNGcis.csuohio.edu/~sschung/cis612/LectureNotes_Hive_ACID.pdf · Hive doesn’t generate MapReduce programs, it uses generic Mapper and Reducer modules. Hive communicates

HIVE ARCHITECTURE

Sunnie Chung CIS 612 Lectu

re Notes

9

Page 10: CIS 612 SUNNIE HUNGcis.csuohio.edu/~sschung/cis612/LectureNotes_Hive_ACID.pdf · Hive doesn’t generate MapReduce programs, it uses generic Mapper and Reducer modules. Hive communicates

HIVE ARCHITECTURE

Sunnie Chung CIS 612 Lectu

re Notes

10

Page 11: CIS 612 SUNNIE HUNGcis.csuohio.edu/~sschung/cis612/LectureNotes_Hive_ACID.pdf · Hive doesn’t generate MapReduce programs, it uses generic Mapper and Reducer modules. Hive communicates

COMMAND LINE INTERFACE

� There are several ways to interact with Hive, including some popular graphical user interface but CLI is sometimes preferable. CLI allows creating, inspecting schema and query tables, etc.

� All commands and queries go to the Driver, which complies, optimizes and executes queries usually with MapReduce jobs.

� Hive doesn’t generate MapReduce programs, it uses generic Mapper and Reducer modules. Hive communicates with Job Tracker to initiate the MapReduce job.

� Data files to be processed are usually in HDFS, managed by NameNode.

� Hive uses Hive Query Language HQL, which is similar to SQL.

Sunnie Chung CIS 612 Lectu

re Notes

11

Page 12: CIS 612 SUNNIE HUNGcis.csuohio.edu/~sschung/cis612/LectureNotes_Hive_ACID.pdf · Hive doesn’t generate MapReduce programs, it uses generic Mapper and Reducer modules. Hive communicates

INPUT DATA

� Hive has no row-level insert, update or delete operations. The only way to put data into a table is to use one of load operations.

� There are four file formats supported in Hive, which are TEXTFILE, SEQUENCEFILE, ORC and RCFILE.

� Example: ’NASDAQ_daily_prices_B.csv’ a log file of stocks record of NASDAQ.

� exchange,stock_symbol,date,stock_price_open,stock_price_high,stock_price_low,stock_price_close,stock_volume,stock_price_adj_close

� NASDAQ,BBND,2010-02-08,2.92,2.98,2.86,2.96,483800,2.96

� NASDAQ,BBND,2010-02-05,2.85,2.94,2.79,2.93,884000,2.93

� NASDAQ,BBND,2010-02-04,2.83,2.88,2.78,2.83,1333300,2.83

….

Sunnie Chung CIS 612 Lectu

re Notes

12

Page 13: CIS 612 SUNNIE HUNGcis.csuohio.edu/~sschung/cis612/LectureNotes_Hive_ACID.pdf · Hive doesn’t generate MapReduce programs, it uses generic Mapper and Reducer modules. Hive communicates

CREATE TABLE TO HOLD THE DATA:

hive> CREATE TABLE IF NOT EXISTS stocks (

exchange STRING,

symbol STRING,

ymd STRING,

price_open FLOAT,

price_high FLOAT,

price_low FLOAT,

price_close FLOAT,

volume INT,

price_adj_close FLOAT)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

Sunnie Chung CIS 612 Lectu

re Notes

13

Page 14: CIS 612 SUNNIE HUNGcis.csuohio.edu/~sschung/cis612/LectureNotes_Hive_ACID.pdf · Hive doesn’t generate MapReduce programs, it uses generic Mapper and Reducer modules. Hive communicates

HIVE QUERY LANGUAGE: HIVEQL

� Create a database:

hive> CREATE DATABASE financials;

or

hive> CREATE DATABASE IF NOT EXISTS financials;

� Describe table:

hive> DESCRIBE DATABASE financials;

OK

Financialshdfs://localhost:54310/user/hive/warehouse/financials.db

� Use database:

hive> USE financials;

� Drop database:

hive> DROP DATABASE IF EXISTS financials;

Sunnie Chung CIS 612 Lectu

re Notes

14

Page 15: CIS 612 SUNNIE HUNGcis.csuohio.edu/~sschung/cis612/LectureNotes_Hive_ACID.pdf · Hive doesn’t generate MapReduce programs, it uses generic Mapper and Reducer modules. Hive communicates

HOW TO LOAD DATA INTO HIVE TABLE

� Use LOAD DATA to import data into a Hive table

� Hive>Load Data LOCAL INPATH '/home/sunny/EmployeeDetails.txt ' INTO TABLE Employee

� Use the word OVERWRITE to write over a file of the same name

� We can Load data from Local file system by using LOCAL keyword as above Example

� Inserting Data into new table by using SELECT statement

� For Example, INSERT OVERWRITE <table_name> SELECT * FROM Employee

15

Sunnie Chung CIS 612 Lectu

re Notes

Page 16: CIS 612 SUNNIE HUNGcis.csuohio.edu/~sschung/cis612/LectureNotes_Hive_ACID.pdf · Hive doesn’t generate MapReduce programs, it uses generic Mapper and Reducer modules. Hive communicates

MANAGING TABLES

Operation Command Syntax

See current tables Hive>Show TABLES

Check the table name Hive>Describe <Table_Name>

Change the table name Hive>Alter Table <table_Name>

Rename to mytab

Add a column Hive> Alter Table <table_Name> ADD

COLUMNS (MyID String)

Drop a partition Hive>Alter Table <table_Name>

DROP PARTITION (Age>70)

16

Sunnie Chung CIS 612 Lectu

re Notes

Page 17: CIS 612 SUNNIE HUNGcis.csuohio.edu/~sschung/cis612/LectureNotes_Hive_ACID.pdf · Hive doesn’t generate MapReduce programs, it uses generic Mapper and Reducer modules. Hive communicates

HIVE SUPPORTS THE FOLLOWINGS:

� WHERE Clause

� UNION All and DISTINCT

� GROUP BY and HAVING

� LIMIT Clause

� Hive Supports Sub-Queries but only in FROM

Clause

� JOINS , ORDER BY, SORT BY

17

Sunnie Chung CIS 612 Lectu

re Notes

Page 18: CIS 612 SUNNIE HUNGcis.csuohio.edu/~sschung/cis612/LectureNotes_Hive_ACID.pdf · Hive doesn’t generate MapReduce programs, it uses generic Mapper and Reducer modules. Hive communicates

OUTPUT DATA

� Output data produced by Hive is structured,

typically stored in a relational database.

� For cluster, MySQL or similar relational

database is required.

� The result tables then can be manipulated using

HiveQL in the similar way of SQL to relational

database.

Sunnie Chung CIS 612 Lectu

re Notes

18

Page 19: CIS 612 SUNNIE HUNGcis.csuohio.edu/~sschung/cis612/LectureNotes_Hive_ACID.pdf · Hive doesn’t generate MapReduce programs, it uses generic Mapper and Reducer modules. Hive communicates

LOAD FILE INTO TABLE:

hive> LOAD DATA LOCAL INPATH '/Users/nqt289/Desktop/NASDAQ_daily_prices_B.csv'

> OVERWRITE INTO TABLE stocks;

Copying data from file:/Users/nqt289/Desktop/NASDAQ_daily_prices_B.csv

Copying file: file:/Users/nqt289/Desktop/NASDAQ_daily_prices_B.csv

Loading data to table mydb.stocks

Deleted hdfs://localhost:54310/Users/nqt289/Desktop/NASDAQ_daily_prices_B.csv

OK

Time taken: 0.231 seconds

Sunnie Chung CIS 612 Lectu

re Notes

19

Page 20: CIS 612 SUNNIE HUNGcis.csuohio.edu/~sschung/cis612/LectureNotes_Hive_ACID.pdf · Hive doesn’t generate MapReduce programs, it uses generic Mapper and Reducer modules. Hive communicates

EXAMPLE OF OUTPUT OF HIVE

hive> SELECT * FROM STOCKS WHERE price_open='2.92';

Total MapReduce jobs = 1

Launching Job 1 out of 1

Number of reduce tasks is set to 0 since there's no reduce operator

Starting Job = job_201403311509_0003, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201403311509_0003

Kill Command = /Users/nqt289/hadoop-0.20.2/bin/../bin/hadoop job -Dmapred.job.tracker=localhost:54311 -kill job_201403311509_0003

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0

2014-03-31 15:39:20,577 Stage-1 map = 0%, reduce = 0%

2014-03-31 15:39:23,597 Stage-1 map = 100%, reduce = 0%

2014-03-31 15:39:26,625 Stage-1 map = 100%, reduce = 100%

Ended Job = job_201403311509_0003

MapReduce Jobs Launched:

Job 0: Map: 1 HDFS Read: 21998523 HDFS Write: 5166 SUCCESS

Total MapReduce CPU Time Spent: 0 msec

OK

NASDAQ BBND 2010-02-08 2.92 2.98 2.86 2.96 4838002.96

NASDAQ BTFG 2009-12-21 2.92 2.92 2.75 2.79 151002.79

NASDAQ BJCT 2004-04-21 2.92 2.98 2.9 2.98 32002.98

NASDAQ BJCT 2004-04-20 2.92 3.0 2.92 2.95 279002.95

Time taken: 12.785 seconds

Sunnie Chung CIS 612 Lectu

re Notes

20

Page 21: CIS 612 SUNNIE HUNGcis.csuohio.edu/~sschung/cis612/LectureNotes_Hive_ACID.pdf · Hive doesn’t generate MapReduce programs, it uses generic Mapper and Reducer modules. Hive communicates

ACID

� Atomicity

� Atomicity requires that each transaction be "all or nothing": if one part of the transaction fails, the entire transaction fails, and the database state is left unchanged. An atomic system must guarantee atomicity in each and every situation, including power failures, errors, and crashes. To the outside world, a committed transaction appears (by its effects on the database) to be indivisible ("atomic"), and an aborted transaction does not happen.

� Consistency

� The consistency property ensures that any transaction will bring the database from one valid state to another. Any data written to the database must be valid according to all defined rules, including constraints, cascades, triggers, and any combination thereof. This does not guarantee correctness of the transaction in all ways the application programmer might have wanted (that is the responsibility of application-level code) but merely that any programming errors cannot result in the violation of any defined rules.

� Isolation

� The isolation property ensures that the concurrent execution of transactions result in a system state that would be obtained if transactions were executed serially, i.e. one after the other. Providing isolation is the main goal of concurrency control. Depending on concurrency control method, the effects of an incomplete transaction might not even be visible to another transaction.[citation needed]

� Durability

� Durability means that once a transaction has been committed, it will remain so, even in the event of power loss, crashes, or errors. In a relational database, for instance, once a group of SQL statements execute, the results need to be stored permanently (even if the database crashes immediately thereafter). To defend against power loss, transactions (or their effects) must be recorded in a non-volatile memory. 21

Sunnie Chung CIS 612 Lectu

re Notes

Page 22: CIS 612 SUNNIE HUNGcis.csuohio.edu/~sschung/cis612/LectureNotes_Hive_ACID.pdf · Hive doesn’t generate MapReduce programs, it uses generic Mapper and Reducer modules. Hive communicates

ACID

� ACID for Hive is added manually with the use

cases:

� A set of Inserts and Updates is processed once an

hour.

� A set of Deletes is processed once a day.

� A log of transactions is exported from a RDBMS

to reflect new data once an hour.

� The delay is not an important issue here due to

the purpose of Hive, also the number of

transactions committed each time is huge (100 to

500 thousands rows.)22

Sunnie Chung CIS 612 Lectu

re Notes

Page 23: CIS 612 SUNNIE HUNGcis.csuohio.edu/~sschung/cis612/LectureNotes_Hive_ACID.pdf · Hive doesn’t generate MapReduce programs, it uses generic Mapper and Reducer modules. Hive communicates

HIVE ACHIEVEMENTS & FUTURE PLANS

� First step to provide warehousing layer for

Hadoop(Web-based Map-Reduce data processing

system)

� Accepts only sub-set of SQL: Working to subsume

SQL syntax

� Working on Rule-based optimizer : Plans to build

Cost-based optimizer

� Enhancing JDBC and ODBC drivers for making

the interactions with commercial BI tools.

� Working on making it perform better

Sunnie Chung CIS 612 Lectu

re Notes

23

Page 24: CIS 612 SUNNIE HUNGcis.csuohio.edu/~sschung/cis612/LectureNotes_Hive_ACID.pdf · Hive doesn’t generate MapReduce programs, it uses generic Mapper and Reducer modules. Hive communicates

PROJECTS & TOOLS ON HADOOP

� HBase

� Hive

� Pig

� Jaql

� ZooKeeper

� AVRO

� UIMA

� Sqoop

Sunnie Chung CIS 612 Lectu

re Notes

24

Page 25: CIS 612 SUNNIE HUNGcis.csuohio.edu/~sschung/cis612/LectureNotes_Hive_ACID.pdf · Hive doesn’t generate MapReduce programs, it uses generic Mapper and Reducer modules. Hive communicates

HIVE TUTORIAL

https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-HiveTutorial

25

Sunnie Chung CIS 612 Lectu

re Notes

Page 26: CIS 612 SUNNIE HUNGcis.csuohio.edu/~sschung/cis612/LectureNotes_Hive_ACID.pdf · Hive doesn’t generate MapReduce programs, it uses generic Mapper and Reducer modules. Hive communicates

REFERENCES

[1] "Apache Hadoop", http://hadoop.apache.org/Hadoop/

[2] “Apache Hive”, http://hive.apache.org/hive

[3] “Apache HBase”, https://hbase.apache.org/hbase

[4] “Apache ZooKeeper”, http://zookeeper.apache.org/zookeeper

[5] Jason Venner, "Pro Hadoop", Apress Books, 2009

[6] "Hadoop Wiki", http://wiki.apache.org/hadoop/

[7] Jiong Xie, Shu Yin, Xiaojun Ruan, Zhiyang Ding, Yun Tian, James Majors, Adam Manzanares, Xiao Qin, " Improving MapReduce Performance through Data Placement in Heterogeneous Hadoop Clusters", 19th International Heterogeneity in Computing Workshop, Atlanta, Georgia, April 2010

Sunnie Chung CIS 612 Lectu

re Notes

26

Page 27: CIS 612 SUNNIE HUNGcis.csuohio.edu/~sschung/cis612/LectureNotes_Hive_ACID.pdf · Hive doesn’t generate MapReduce programs, it uses generic Mapper and Reducer modules. Hive communicates

[8]Dhruba Borthakur, The Hadoop Distributed File System:

Architecture and Design, The Apache Software Foundation

2007.

[9] "Apache Hadoop",

http://en.wikipedia.org/wiki/Apache_Hadoop

[10] "Hadoop Overview",

http://www.revelytix.com/?q=content/hadoop-overview

[11] Konstantin Shvachko, Hairong Kuang, Sanjay Radia,

Robert Chansler, The Hadoop Distributed File System,

Yahoo!, Sunnyvale, California USA, Published in: Mass

Storage Systems and Technologies (MSST), 2010 IEEE

26th Symposium.

REFERENCESSunnie Chung CIS 612 Lectu

re Notes

27