Post on 11-May-2015
description
Introduction On Hive
Ajit Koti
Analysis of Data made by both engineering and non-engineering people.The data are growing fast. In 2007, the volume was 15TB and it grew up to 200TB in 2010.Current RDBMS can NOT handle it.Current solution are not available, not scalable, Expensive and Proprietary.
2
Motivation
MapReduce is a programing model and an associated implementation introduced by Goolge in 2004.Apache Hadoop is a software framework inspired by Google's MapReduce.
3
Map/Reduce -Apache Hadoop
Hadoop supports data-intensive distributed applications.
However...– Map-reduce hard to program (users know
sql/bash/python).– No schema.
4
Motivation (cont.)
A data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.
–ETL.–Structure.–Access to different storage.–Query execution via MapReduce.
Key Building Principles:–SQL is a familiar language–Extensibility – Types, Functions, Formats, Scripts
–Performance 5
What is HIVE?
Log processing Text mining Document indexing Customer-facing business intelligence (e.g., Google Analytics)
Predictive modeling, hypothesis testing
6
Hive Applications
7
Hive Architecture
Databases.Tables.Partitions.Buckets (or Clusters).
8
Data Units
Primitive types– Integers:TINYINT, SMALLINT, INT, BIGINT.– Boolean: BOOLEAN.– Floating point numbers: FLOAT, DOUBLE .– String: STRING.
Complex types– Structs: {a INT; b INT}.–Maps: M['group'].– Arrays: ['a', 'b', 'c'], A[1] returns 'b'.
9
Type System
Warehouse directory in HDFS – e.g., /user/hive/warehouse Table row data stored in subdirectories of warehouse
Partitions form subdirectories of table directories
Actual data stored in flat files – Control char-delimited text, or SequenceFiles – With custom SerDe, can use arbitrary format
10
Physical Layout
11
HiveQL
CREATE TABLE sample (foo INT, bar STRING) PARTITIONED BY (ds STRING);
SHOW TABLES '.*s';
DESCRIBE sample;
ALTER TABLE sample ADD COLUMNS (new_col INT);
DROP TABLE sample;
12
Examples – DDL Operations
LOAD DATA LOCAL INPATH './sample.txt' OVERWRITE INTO TABLE sample PARTITION (ds='2012-02-24');
LOAD DATA INPATH '/user/falvariz/hive/sample.txt' OVERWRITE INTO TABLE sample PARTITION (ds='2012-02-24');
13
Examples – DML Operations
LOAD DATA LOCAL INPATH './sample.txt' OVERWRITE INTO TABLE sample PARTITION (ds='2012-02-24');
LOAD DATA INPATH '/user/falvariz/hive/sample.txt' OVERWRITE INTO TABLE sample PARTITION (ds='2012-02-24');
14
Examples – DML Operations
SELECT foo FROM sample WHERE ds='2012-02-24';
INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT * FROM sample WHERE ds='2012-02-24';
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/hive-sample-out' SELECT * FROM sample;
15
SELECTS and FILTERS
SELECT MAX(foo) FROM sample;
SELECT ds, COUNT(*), SUM(foo) FROM sample GROUP BY ds;
FROM sample s INSERT OVERWRITE TABLE bar SELECT s.bar, count(*) WHERE s.foo > 0 GROUP BY s.bar;
16
Aggregations and Groups
SELECT * FROM customer c JOIN order_cust o ON c.id=o.cus_id);SELECT c.id,c.name,c.address,ce.exp FROM customer c JOIN (SELECT cus_id,sum(price) AS exp FROM order_cust GROUP BY cus_id) ce ON (c.id=ce.cus_id);
17
Join
CREATE TABLE customer (id INT,name STRING,address STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '#';
CREATE TABLE order_cust (id INT,cus_id INT,prod_id INT,price INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
18
Extension mechanisms
package com.example.hive.udf;
import org.apache.hadoop.hive.ql.exec.UDF;import org.apache.hadoop.io.Text;
public final class Lower extends UDF { public Text evaluate(final Text s) { if (s == null) { return null; } return new Text(s.toString().toLowerCase()); }}
19
User-defined function
Registering the class
CREATE FUNCTION my_lower AS 'com.example.hive.udf.Lower';
Using the function
SELECT my_lower(title), sum(freq) FROM titles GROUP BY my_lower(title);
Java code
Mathematical: round, floor, ceil, rand, exp...
Collection: size, map_keys, map_values, array_contains.
Type Conversion: cast.
Date: from_unixtime, to_date, year, datediff...
Conditional: if, case, coalesce.
String: length, reverse, upper, trim...
20
Built-in Functions
my_append.py
Using the function
21
Map/Reduce Scripts
for line in sys.stdin: line = line.strip() key = line.split('\t')[0] value = line.split('\t')[1] print key+str(i)+'\t'+value+str(i) i=i+1
SELECT TRANSFORM (foo, bar) USING 'python ./my_append.py' FROM sample;
22
Install and Config Hive
From a Release Tarball:$ wget http://archive.apache.org/dist/hadoop/hive/hive-0.5.0/hive-0.5.0-bin.tar.gz$ tar xvzf hive-0.5.0-bin.tar.gz$ cd hive-0.5.0-bin$ export HIVE_HOME=$PWD$ export PATH=$HIVE_HOME/bin:$PATH
23
Installing Hive
Java 1.6 Hadoop >= 0.17-0.20 Hive *MUST* be able to find Hadoop: – $HADOOP_HOME=<hadoop-install-dir>– Add $HADOOP_HOME/bin to $PATH Hive needs r/w access to /tmp and /user/hive/warehouse on HDFS:
$ hadoop fs –mkdir /tmp$ hadoop fs –mkdir /user/hive/warehouse$ hadoop fs –chmod g+w /tmp$ hadoop fs –chmod g+w /user/hive/warehouse
24
Hive Dependencies
Default configuration in $HIVE_HOME/conf/hive-default.xml – DO NOT TOUCH THIS FILE! Re(Define) properties in $HIVE_HOME/conf/hive-site.xml Use $HIVE_CONF_DIR to specify alternate conf dir
location You can override Hadoop configuration properties
in Hive’s configuration, e.g:– mapred.reduce.tasks=1
25
Hive Configuration
26
Hive CLI
Start a terminal and run $ hive Should see a prompt like: hive> Set a Hive or Hadoop conf prop: – hive> set propkey=value; List all properties and values: – hive> set –v; Add a resource to the DCache: – hive> add [ARCHIVE|FILE|JAR] filename;
27
Hive CLI Commands
List tables: – hive> show tables; Describe a table: – hive> describe <tablename>; More information: – hive> describe extended <tablename>; List Functions: – hive> show functions; More information: – hive> describe function<functionname>;
28
Hive CLI Commands
A easy way to process large scale data.Support SQL-based queries.Provide more user defined interfaces to extend
Programmability.Files in HDFS are immutable. Tipically:– Log processing: Daily Report, User Activity
Measurement–Data/Text mining: Machine learning (Training Data)– Business intelligence: Advertising Delivery,Spam
Detection
29
Conclusion
30
Thank You
Follow @ajitkoti