Big data analytics -hive

44
WDABT 2016 – BHARATHIAR UNIVERSITY 1 Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016

Transcript of Big data analytics -hive

Page 1: Big data analytics -hive

WDABT 2016 – BHARATHIAR UNIVERSITY

1Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016

Page 2: Big data analytics -hive

2Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016

Page 3: Big data analytics -hive

component of

3Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016

Page 4: Big data analytics -hive

Structure DataStructure Data

Large Data SetLarge Data Set

MapreduceMapreduce Parallel Distribution

Parallel Distribution

Query DataQuery Data

Why HIVE

4Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016

Page 5: Big data analytics -hive

Features of hive

5Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016

Page 6: Big data analytics -hive

HDFS or HBASE STORAGE SYSTEM

Execution Engine

Hive QL Process Engine

WEB UIWEB UIHIVE

COMMAND LINE

HIVE COMMAND

LINEHD InsightHD Insight

Meta Store

User Interface

HIVE Architecture

6Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016

Page 7: Big data analytics -hive

Embedded Metastore

Local Metastore Remote Metastore

7Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016

Page 8: Big data analytics -hive

Hive File formats

• Text Files - Delimited by Parameters• Sequence Files - Less Data• RC Files - Analytic Processing• ORC Files – Optimized file format in binary

format

8Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016

Page 9: Big data analytics -hive

Hive query language offers:

Create Database

Create ,manage and partition tables

Supports various operators like Relational, Arithmetic and

Logical to evaluate functions

Hive supports DDL and DML

HIVE Query Language (HQL)

9Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016

Page 10: Big data analytics -hive

DDL Data Definition Language) StatementsThe DDL commands are listed below

Create, Alter, Drop database

Create Alter, Drop, Truncate table

Create, Alter with Partitioning and Bucketing

Create Views

Show

Describe

10Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016

Page 11: Big data analytics -hive

Loading files

Inserting data into Hive Tables from queries

DML (Data Manipulation Language) Statements

11Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016

Page 12: Big data analytics -hive

Database Operations

Syntax

CREATE DATABASE IF NOT EXISTS db_name

COMMENT ‘db_name Details’

WITH DBPROPERTIES (‘creator’ = ‘name’);

Example

CREATE DATABASE IF NOT EXISTS LIBDETS

COMMENT ’LIBRARY DETAILS’

WITH DBPROPERTIES (‘creator’ = ‘KIRUTHI’);

12Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016

Page 13: Big data analytics -hive

Database OperationsSyntax

SHOW DATABASES // displays databases available

Example

SHOW DATABASES;

Syntax

DESCRIBE DATABASE db_name; //display Schema of database

DESCRIBE DATABASE EXTENDED db_name;

Example

DESCRIBE DATABASE LIBDETS;

DESCRIBE DATABASE EXTENDED LIBDETS13Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016

Page 14: Big data analytics -hive

ALTER Database

Syntax

ALTER DATABASE db_name // Alter database properties

SET DBPROPERTIES (‘edited-by’ = ‘name’);

Example

ALTER DATABASE LIBDETS

SET DBPROPERTIES (‘edited-by’ = ‘KANI’);

14Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016

Page 15: Big data analytics -hive

USE , DROP Database

Syntax

USE db_name; //Assign database as current working database

Example

USE LIBDETS;

Syntax

DROP DATABASE db_name; // delete database

Example

DROP DATABASE LIBDETS;

15Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016

Page 16: Big data analytics -hive

TABLES

Hive supports two types of tables

Managed Table – Table stored in HiveWarehouse folderExternal Table – Retains a schema copy in specified location even table is deleted

16Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016

Page 17: Big data analytics -hive

Creating Managed Table

SyntaxCREATE TABLE IF NOT EXISTS tb_name (column_name data_type, column_name datatype,column_name data type) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’ ;ExampleCREATE TABLE IF NOT EXISTS LIBTBL ( Member_Code INT,Membr_Name STRING, Designation STRING,Dept_code INT,dept_name STRING,group_name STRING,course_name STRING,title STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’ ;

Managed Table

17Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016

Page 18: Big data analytics -hive

External Table. Creating External Table

SyntaxCREATE EXTERNAL TABLE tb_name IF NOT EXISTS tb_name (column_name datatype, column_name datatype, column_name datatype) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ LOCATION ‘ /home/usr/filename.format’; ExampleCREATE EXTERNAL TABLE IF NOT EXISTS LIBTBL (Member_Code INT, Member_Name STRING, Designation STRING, Dept_code INT, course_code INT, dept_name STRING, group_name STRING, course_name STRING, title STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ LOCATION ‘/home/livrith/Desktop/Book2.csv’;

18Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016

Page 19: Big data analytics -hive

Loading Data into Table

SyntaxLOAD DATA LOCAL INPATH ‘hdfs_file_or_directory_path’ OVERWRITE INTO TABLE tb_name;

ExampleLOAD DATA LOCAL INPATH ‘/home/kiruthika/Documents/Book2.csv’ OVERWRITE INTO TABLE LIBTBL;

19Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016

Page 20: Big data analytics -hive

Select clauseSyntaxSELET [ALL | DISTINCT] select_expr, select_expr, . . .FROM tb_name [WHERE where_conditon][GROUP BY column_name][ORDER BY column_name][HAVING having_condition][DISTRIBUTED column_name][LIMIT number]; Example:1SELECT * FROM LIBTBL;Example:2SELECT Member Name, Designation FROM LIBTBL;

20Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016

Page 21: Big data analytics -hive

Select – whereExampleSELECT * FROM LIBUDET WHERE group_name = ‘TEACHING’ OR group_name = ‘student’ AND Dept_name>= ‘18’;

Select - regular expressionSyntaxSELECT column1,column2,column3 FROM tb_name WHERE column_name LIKE ‘%alp%’;

ExampleSELECT PRODUCT, STATE, CITY FROM SALESDETS WHERE City LIKE ‘%O%’;

21Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016

Page 22: Big data analytics -hive

Group by

Example

SELECT PRODUCT, COUNT(PRODUCT)AS C1, STATE,

COUNTRY FROM SALESDETS GROUP BY PRODUCT,

STATE;

Order by // Sorts use only one reducerExample

SELECT PRODUCT, STATE, PRICE, COUNTRY FROM

SALESDETS

ORDER BY COUNTRY;

22Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016

Page 23: Big data analytics -hive

Sort by // Sorts the data before given to reducer

ExampleSELECT PRODUC,STATE,COUNTRY FROM SALESDETS SORT BY COUNTRYLIMIT 10;

Having // Filter data based on Group By

ExampleSELECT PRODUCT, COUNT(PRODUCT) AS C1,STATE,COUNTRY FROM SALESDETS GROUP BY PRODUCT, STATE, COUNTRYHAVING C1 > 5;

23Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016

Page 24: Big data analytics -hive

Limit

ExampleSELECT PRODUCT,STATE, PRICE, COUNTRY FROM SALESDETS COUNTRY LIMIT 10;

Distribute by // distributes rows among reducers

SyntaxSELECT column_name1, column_name2,column_name3 FROM tb_name DISTRIBUTE BY column_name SORT BY column_name ASC,column_name ASC LIMIT count;

ExampleSELECT PRODUCT,PRICE,STATE FROM SALESDETS DISTRIBUTE BY STATE SORT BY STATE ASC, PRODUCT ASC LIMIT 50;

24Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016

Page 25: Big data analytics -hive

Cluster by // does the job of both distribute by and sort by

Example

SELECT PRODUCT,PRICE,STATE FROM SALESDETS

CLUSTER BY STATE LIMIT 50;

Difference in Execution of Order By , Sort By, Distribute By, Cluster By

25Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016

Page 26: Big data analytics -hive

Data Aggregation

COUNT

AVG DISTINCT (AVG)

MIN DISTINCT(MIN)

MAX , DISTINCT(MAX)

26Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016

Page 27: Big data analytics -hive

Partitions

Hive reads the entire dataset from warehouse even when filter

condition is specified to fetch a particular column. This results as

bottleneck in MapReduce jobs and involves huge degree of I/O.

Partition command is used to break larger dataset into small

chunks on columns.

Hive supports two types of partition

Static partition

Dynamic partition

27Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016

Page 28: Big data analytics -hive

Creating partition tableSyntaxCREATE TABLE tb_name (column1 datatype, column2 datatype,column3 datatype) COMMENT ‘Details of the dataset’ PARTITIONED BY (column_name STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’;

ExampleCREATE TABLE MY_TABLE1 (Member_Name STRING,dept_name STRING,group_name STRING,course_name STRING,title STRING) COMMENT ‘User information’ PARTITIONED BY (Designation STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’;

28Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016

Page 29: Big data analytics -hive

Load data into static partition table

Syntax

LOAD DATA LOCAL INPATH ‘file_path’ OVERWRITE

INTO TABLE tb_name;

Example

LOAD DATA LOCAL INPATH

‘/home/livrith/Desktop/mytab.csv’ OVERWRITE INTO

TABLE MY_TABLE2;

29Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016

Page 30: Big data analytics -hive

Set dynamic partition

The following setting has to be modified to execute dynamic partitions.SET hive.exec.dynamic.partition = true;SET hive.exec.dynamic.partition.mode = nonstrict;

ExampleSET hive.exec.dynamic.partition = true;SET hive.exec.dynamic.partition.mode = nonstrict;

30Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016

Page 31: Big data analytics -hive

Insert data - Dynamic partition table

SyntaxINSERT OVERWRITE TABLE 1st_tb_name PARTITION(column_name) SELECT column_name1,column_name2,column_name3 FROM 2nd_tb_name;

//partition field should be the last attribute when inserting data

ExampleINSERT OVERWRITE TABLE MY_TABLE1 PARTITION(Designation)SELECT Member_Name,dept_name,group_name,course_name,title,Designation FROM MY_TABLE2;

31Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016

Page 32: Big data analytics -hive

Bucketing

32Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016

Page 33: Big data analytics -hive

BucketingBucketing is similar to partitioning.

Bucket is a file.

Bucket are used to create partition on specified column values

where as partitioning is used to divided data into small blocks on

columns.

33Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016

Page 34: Big data analytics -hive

Table creationSyntaxCREATE TABLE IF NOT EXISTS tb_name (column1 datatype,column2 datatype,column3 datatype) CLUSTER BY(column_name) into 3 BUCKETSROW FORMAT DELIMITED FIELDS TERMINATED BY ‘/t’;

ExampleCREATE TABLE SALES_BUC1 (Transacyion_date TIMESTAMP,Product STRING,Price INT,Payment_Type STRING,Name STRING,City STRING,State STRING,Country STRING,Account_Created TIMESTAMP) CLUSTERED BY (Price) into 3 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’;

34Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016

Page 35: Big data analytics -hive

Load data into tableSyntax

FROM 1st_tb_name INSERT OVERWRITE TABLE

2nd_tb_name

SELECT column_name1, column_name2,column_name3;

Example

FROM SALESDETS INSERT OVERWRITE TABLE

SALES_BUC1 SELECT

Transaction_date,Product,Price,Payment_Type,Name,City,Sta

te,Country,Account_Created;

35Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016

Page 36: Big data analytics -hive

Select from bucket tableSyntax:1SELECT DISTINCT column_name FROM 2nd_tb_nametb_name (BUCKET 1 OUT OF 3 ON column_name);

ExampleSELECT DISTINCT Price FROM SALES_BUC1 TABLESAMPLE (BUCKET 1 OUT OF 3 ON PRICE);

Syntax:2SELECT DISTINCT column_name FROM tb_name2Tb_name(BUCKET 1 OUT OF 2 ON column_name);

ExampleSELECT DISTINCT PRICE FROM SALES_BUC1 TABLESAMPLE(BUCKET 1 OUT OF 2 ON Price);

36Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016

Page 37: Big data analytics -hive

Sampling•SAMPLING is used in hive to populate small dataset from

the existing large datasets. Sampling employs selects records

randomly to create small datasets.

SyntaxSELECT COUNT(*) FROM tb_name TABLESAMPLE (BUCKET 1 OUT OF 3 ON column_name);

ExampleIn the example given below sample are created from the table sales_buc from the available 3 buckets.SELECT COUNT(*) FROM SALES_BUC TABLESAMPLE (BUCKET 1 OUT OF 3 ON Price);

37Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016

Page 38: Big data analytics -hive

• Apache HBase is an open-source, distributed, versioned,

non-relational database modeled after Google's Bigtable

• Apache HBase provides Bigtable-like capabilities on top

of Hadoop and HDFS.

38Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016

Page 39: Big data analytics -hive

NoSQL Databases

• NoSQL – Not only SQL, Non Relational/Non SQL Databases

• SCHEMA LESS• Ideology • BASE – Basically available Eventual

Consistency - Only can support two availabilty, replication

39Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016

Page 40: Big data analytics -hive

NoSQL Types

• Key Value Store - Amazon S3, Riak• Document based store – CouchDB,MongoDB• Column based store - Hbase, Cassandra• Graph based stores - Neoj4, Orientdb

40Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016

Page 41: Big data analytics -hive

HBASE is Not

• Table with one primary key (row key)• No Join Operations• Limited Atomicty and transaction support• Manipulated by SQL

41Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016

Page 42: Big data analytics -hive

Hbase components

• Master - Manages load balancing and scripting• Regionserver – Range of tables assigned by masterZookeper –• Client communicate via Zookeeper for read write

operations in region servers for storing node details• Region server uses Memstore similar to cache

memory• Provides services for synchronization, maintenance

42Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016

Page 43: Big data analytics -hive

Refrences

• http://hadooptutorials.co.in/tutorials• https://www.youtube.com/watch?v=W_oUrDBLBaE• https://flume.apache.org/FlumeUserGuide.html• https://archive.cloudera.com/cdh/3/sqoop/SqoopUser

Guide.html#_basic_usage• http://hortonworks.com/hadoop/oozie/• http://www.01.ibm.com/software/data/infosphere/ha

doop/zookeeper/• https://www.youtube.com/watch?v=Dv2V7lbIRmI• http://kafka.apache.org/documentation.html• https://www.youtube.com/watch?v=ArUHr3Czx-8

43Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016

Page 44: Big data analytics -hive

44Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016