I n tr o d u c ti o n i n to B i g D a ta a n a l y ti c...

4.04.2018 7_hive

file:///home/szwabin/Dropbox/Zajecia/BigData/Lectures/7_hive/7_hive.html 1/24

Introduction into Big Data analytics

Lecture 7 – Apache Hive

Janusz Szwabiński

Outlook:

1. Introduction2. Installation and configuration3. Hive data model4. Hive SQL5. Example use case6. Accessing Hive with Python

Further reading:

https://cwiki.apache.org/confluence/display/Hive/Home(https://cwiki.apache.org/confluence/display/Hive/Home)https://www.guru99.com/introduction-hive.html (https://www.guru99.com/introduction-hive.html)

https://cwiki.apache.org/confluence/display/Hive/Home

https://www.guru99.com/introduction-hive.html

4.04.2018 7_hive


Introductionhttps://hive.apache.org/ (https://hive.apache.org/)initially developed by Facebookdata warehouse software facilitates reading, writing, and managing large datasets residing indistributed storage and queried using SQL syntaxbuilt on top of Apache Hadoop

MapReduce for executionHDFS for storageMetadata on raw files

designed for managing and querying only structured data and semi-structured data that is storedin tables

Features

procedural language HPL-SQLit separates the user from the complexity of Map Reduce programming (automaticallytranslates SQL-like queries into MapReduce jobs)

it reuses familiar concepts from the relational database world, such as tables, rows,columns and schemastandard SQL functionality, including many of the later SQL:2003 and SQL:2011 featuresfor analyticsHive's SQL can be extended with user code via

user defined functions (UDFs)user defined aggregates (UDAFs)user defined table functions (UDTFs).

data warehousing tasks such as extract/transform/load (ETL), reporting, and data analysisa mechanism to impose structure on a variety of data formatsaccess to files stored either directly in Apache HDFS or in other data storage systems suchas Apache HBasequery execution via Apache Tez, Apache Spark, or MapReducesub-second query retrieval via Hive LLAP, Apache YARN and Apache Slider

not a single "Hive format" in which data must be storedbuilt in connectors for comma and tab-separated values (CSV/TSV) text files, ApacheParquet, Apache ORC, and other formatsusers can extend Hive with connectors for other formats

not designed for online transaction processing (OLTP) workloads

https://hive.apache.org/

4.04.2018 7_hive


best used for traditional data warehousing tasksdesigned to maximize scalability (scale out with more machines added dynamically to the Hadoopcluster), performance, extensibility, fault-tolerance, and loose-coupling with its input formatscomponents of Hive include HCatalog and WebHCat

HCatalog is a table and storage management layer for Hadoopeasier access to data on the grid

WebHCat provides a service to run Hadoop MapReduce (or YARN), Pig, Hive jobs orperform Hive metadata operations using an HTTP (REST style) interface

Architecture

Metastorestores metadata for each of the tables such as their schema and locationit also includes the partition metadata which helps the driver to track the progress ofvarious data sets distributed over the clusterthe data is stored in a traditional RDBMS formatthe metadata helps the driver to keep a track of the data and it is highly crucialit should be regularly backed up

Driveracts like a controller which receives the HiveQL statementsit starts the execution of statement by creating sessions and monitors the life cycle andprogress of the executionit stores the necessary metadata generated during the execution of an HiveQL statementit also acts as a collection point of data or query result obtained after the Reduce operation

Compilerperforms compilation of the HiveQL query, which converts the query to an execution planthis plan contains the tasks and steps needed to be performed by the Hadoop MapReduceto get the output as translated by the querythe compiler converts the query to an abstract syntax tree (AST)after checking for compatibility and compile time errors, it converts the AST to a directedacyclic graph (DAG)

4.04.2018 7_hive


the DAG divides operators to MapReduce stages and tasks based on the input query anddata

Optimizerperforms various transformations on the execution plan to get an optimized DAGtransformations can be aggregated together, such as converting a pipeline of joins to asingle join, for better performanceit can also split the tasks, such as applying a transformation on data before a reduceoperation, to provide better performance and scalability

Executorafter compilation and optimization, the executor executes the tasksit interacts with the job tracker of Hadoop to schedule tasks to be runit takes care of pipelining the tasks by making sure that a task with dependency getsexecuted only if all other prerequisites are run

CLI, UI, and Thrift Servera command-line interface (CLI) provides a user interface for an external user to interactwith Hive by submitting queries, instructions and monitoring the process statusThrift server allows external clients to interact with Hive over a network, similar to the JDBCor ODBC protocols

Limitations of Hive

it does not offer real-time queries and row level updatesnot good for online transaction processinglatency for Apache Hive queries is generally very high

4.04.2018 7_hive


Digression - data warehouse

A data warehouse (DW or DWH) is a system used for reporting and data analysis, and isconsidered a core component of business intelligence. DWs are central repositories ofintegrated data from one or more disparate sources. They store current and historical data inone single place that are used for creating analytical reports for workers throughout theenterprise.

The data stored in the warehouse is uploaded from the operational systems (such asmarketing or sales). The data may pass through an operational data store and may requiredata cleansing for additional operations to ensure data quality before it is used in the DW forreporting.

A data mart is the access layer of the data warehouse environment that is used to get dataout to the users. The data mart is a subset of the data warehouse and is usually oriented to aspecific business line or team. Whereas data warehouses have an enterprise-wide depth, theinformation in data marts pertains to a single department. In some deployments, eachdepartment or business unit is considered the owner of its data mart including all thehardware, software and data. This enables each department to isolate the use, manipulationand development of their data. In other deployments where conformed dimensions are used,this business unit ownership will not hold true for shared dimensions like customer, product,etc.

Benefits

congregation of data to single database so a single query engine can be used to present data in anODSmitigate the problem of database isolation level lock contention in transaction processing systemscaused by attempts to run large, long-running, analysis queries in transaction processing databasesmaintain data history, even if the source transaction systems do notintegrate data from multiple source systems, enabling a central view across the enterprise

4.04.2018 7_hive


always valuable, but particularly so when the organization has grown by mergerimprove data quality, by providing consistent codes and descriptions, flagging or even fixing baddatapresent the organization's information consistentlyprovide a single common data model for all data of interest regardless of the data's sourcerestructure the data so that it makes sense to the business usersrestructure the data so that it delivers excellent query performance, even for complex analyticqueries, without impacting the operational systemsmake decision–support queries easier to writeorganize and disambiguate repetitive data

Database vs Data Warehouse

Database Data Warehouse

Used for Online Transactional Processing (OLTP).This records the data from a user for history

Used for Online Analytical Processing (OLAP). Thisreads the historical data of the users for businessdecissions

The tables and joins are complex since they arenormalized. This is done to reduce redundant dataand to save storage space

The tables and joins are simple since they aredenormalized. This is done to reduce the responsetime for analytical queries

Entity-relational modeling techniques are used fordatabase design

Data modeling techniques are used for warehousedesign

Optimized for write operations Optimized for read operations

Performance is low for analytical queries High performance for analytical queries

4.04.2018 7_hive


Installation and configuration

Download

http://ftp.man.poznan.pl/apache/hive/stable-2/ (http://ftp.man.poznan.pl/apache/hive/stable-2/)

Unpack the archive

cd /home/szwabin/Tools

mv ~/Downloads/apache-hive-2.3.3-bin.tar.gz .

tar zxvf apache-hive-2.3.3-bin.tar.gz

mv apache-hive-2.3.3-bin.tar.gz hive-2.3.3

rm apache-hive-2.3.3-bin.tar.gz

Set up environment

Add the following lines to the .bashrc file:

export HIVE_HOME=/home/szwabin/Tools/hive-2.3.3

export PATH=$HIVE_HOME/bin:$PATH

Start Hadoop

If not already done, start Hadoop services already set up:

start-dfs.sh

start-yarn.sh

Create hadoop directories

hadoop fs -mkdir /tmp

hadoop fs -mkdir /szwabin

hadoop fs -mkdir /szwabin/hive

hadoop fs -mkdir /szwabin/hive/warehouse

hadoop fs -chmod g+w /tmp

hadoop fs -chmod -R g+w /szwabin/hive/warehouse

Running Hive CLI

If you work with earlier versions of Hive, you may simply start the CLI with

hive

However, Hive CLI is deprecated in favor of Beeline, the CLI of HiveServer2. In order to start the Beeline, we need to run the schematool command as an initialization step:

schematool -initSchema -dbType derby

where derby is the database type for Hive's metastore. Derby is a reasonable choice in case of a singleuser installation. If you want to configure Hive with MySQL, have a look e.g. athttps://www.guru99.com/installation-configuration-hive-mysql.html (https://www.guru99.com/installation-configuration-hive-mysql.html)

http://ftp.man.poznan.pl/apache/hive/stable-2/

https://www.guru99.com/installation-configuration-hive-mysql.html

4.04.2018 7_hive


If you tried to run the deprecated Hive CLI first, you may see the following error while running the schematool:

Error: FUNCTION 'NUCLEUS_ASCII' already exists. (state=X0Y68,code=30000)

org.apache.hadoop.hive.metastore.HiveMetaException: Schema initialization

FAILED! Metastore state would be inconsistent !!

It is due to the fact, that running Hive CLI, even though it fails, creates a metastore_db subdirectory in thecurrent working directory. So, when we tried to initialize Hive with schematool, the metastore existed butnot in a complete form. We have to remove it first:

rm -r metastore_db

Then we re-run the initialization step:

schematool -initSchema -dbType derby

It may happen that we see the following warning at this point:

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in [jar:file:/home/szwabin/Tools/hive-2.3.3/lib/log4

j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/home/szwabin/Tools/hadoop-3.0.0/share/

hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBi

nder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an expla

nation.

SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFac

tory]

To get rid of it, we remove all the mentioned jar files except the first one from the classpath.

Now we can start to run the HiveServer2

hiveserver2

To run the Beeline CLI, we issue the command:

beeline -n szwabin -u jdbc:hive2://localhost:10000

At this point, we will propably get an error like:

Connecting to jdbc:hive2://localhost:10000

18/04/03 08:54:03 [main]: WARN jdbc.HiveConnection: Failed to connect to

localhost:10000

Error: Could not open client transport with JDBC Uri: jdbc:hive2://localh

ost:10000: Failed to open new session: java.lang.RuntimeException: org.ap

ache.hadoop.ipc.RemoteException(org.apache.hadoop.security.authorize.Auth

orizationException): User: szwabin is not allowed to impersonate szwabin

(state=08S01,code=0)

Beeline version 2.3.3 by Apache Hive

We solve that issue by adding the following properties to the Hadoop's core-site.xml file:

4.04.2018 7_hive


<property>

<name>hadoop.proxyuser.szwabin.groups</name>

<value>*</value>

</property>

<property>

<name>hadoop.proxyuser.szwabin.hosts</name>

<value>*</value>

</property>

</configuration>

We restart all Hadoop services and try to connect to the HiveServer2 again:

beeline -n szwabin -u jdbc:hive2://localhost:10000

Connecting to jdbc:hive2://localhost:10000

Connected to: Apache Hive (version 2.3.3)

Driver: Hive JDBC (version 2.3.3)

Transaction isolation: TRANSACTION_REPEATABLE_READ

Beeline version 2.3.3 by Apache Hive

0: jdbc:hive2://localhost:10000>

Testing the installation

4.04.2018 7_hive


0: jdbc:hive2://localhost:10000> CREATE TABLE pokes (foo INT, bar STRIN

G);

0: jdbc:hive2://localhost:10000> CREATE TABLE invites (foo INT, bar STRIN

G) PARTITIONED BY (ds STRING);

0: jdbc:hive2://localhost:10000> SHOW TABLES;

+-----------+

| tab_name |

+-----------+

| invites |

| pokes |

+-----------+

2 rows selected (2.062 seconds)

0: jdbc:hive2://localhost:10000> DESCRIBE invites;

+--------------------------+-----------------------+---------------------

--+

| col_name | data_type | comment

|

+--------------------------+-----------------------+---------------------

--+

| foo | int |

|

| bar | string |

|

| ds | string |

|

| | NULL | NULL

|

| # Partition Information | NULL | NULL

|

| # col_name | data_type | comment

|

| | NULL | NULL

|

| ds | string |

|

+--------------------------+-----------------------+---------------------

--+


0: jdbc:hive2://localhost:10000> ALTER TABLE pokes ADD COLUMNS (new_col I

NT);

No rows affected (0.775 seconds)

0: jdbc:hive2://localhost:10000> ALTER TABLE invites ADD COLUMNS (new_col

2 INT COMMENT 'a comment');


0: jdbc:hive2://localhost:10000> ALTER TABLE invites REPLACE COLUMNS (foo

INT, bar STRING, baz INT COMMENT 'baz replaces new_col2');


0: jdbc:hive2://localhost:10000> ALTER TABLE invites REPLACE COLUMNS (foo

INT COMMENT 'only keep the first column');


0: jdbc:hive2://localhost:10000> DROP TABLE pokes;


4.04.2018 7_hive


Configuration

Hive by default gets its configuration from <install-dir>/conf/hive-default.xmllocation of the Hive configuration directory can be changed by setting the HIVE_CONF_DIRenvironment variableconfiguration variables can be changed by (re-)defining them in <install-dir>/conf/hive-site.xml

Log4j configuration is stored in <install-dir>/conf/hive-log4j.propertiesHive configuration is an overlay on top of Hadoop – it inherits the Hadoop configuration variables bydefaultHive configuration can be manipulated by:

editing hive-site.xml and defining any desired variables (including Hadoop variables) in itusing the set commandinvoking Hive (deprecated), Beeline or HiveServer2 using the syntax:

hive --hiveconf x1=y1 --hiveconf x2=y2

hiveserver2 --hiveconf x1=y1 --hiveconf x2=y2

beeline --hiveconf x1=y1 --hiveconf x2=y2

setting the HIVE_OPTS environment variable to "--hiveconf x1=y1 --hiveconf x2=y2" which does the same as above

Runtime configuration

beeline> SET mapred.job.tracker=myhost.mycompany.com:50030;

beeline> SET -v;

4.04.2018 7_hive


Hive data model

Tables

analogous to tables in relational DBseach table has corresponding directory in HDFS

table name: pvsHDFS directory: /szwabin/wh/pvs

Partitions

analogous to dense indexes on partition columnsfor grouping same type of data togethereach table can have one or more partition keys to identify a particular partitionfaster queries on slices of the datanested subdirectories in HDFS for each combinations of partition column valuesexample

partition columns: ds, ctryHDFS subdirectory for ds=20090801, ctry=US /szwabin/wh/pvs/ds=20090801/ctry=US

HDFS subdirectory for ds=20090801, ctry=CA /szwabin/wh/pvs/ds=20090801/ctry=CA

Buckets

give extra structure to the data that may be used for more efficient queriessplit data based on hash of a column - mainly for parallelismone hdfs file per bucket within partition subdirectoryexample /szwabin/wh/pvs/ds=20090801/ctry=US/part-00000 ... /szwabin/wh/pvs/ds=20090801/ctry=US/part-00020

External tables

point to existing data directories in HDFScan create tables and partitions - partition columns just become annotations to external directoriesexamples

create external table with partitions

CREATE EXTERNAL TABLE pvs (userid int, pageid int, ds strin

g, ctry string)

PARTITIONED ON (ds string, ctry string)

STORED AS textfile

LOCATION `path/to/existing/table`

add partition to external table

ALTER TABLE pvs

ADD PARTITION (ds='20090801', ctry='US')

LOCATION `path/to/existing/partition`

4.04.2018 7_hive


4.04.2018 7_hive


Hive Language

Data types in Hive

Numeric Types

Type Memory allocation

TINY INT 1-byte signed integer (-128 to 127)

SMALL INT 2-byte signed integer (-32768 to 32767)

INT 4-byte signed integer ( -2,147,484,648 to 2,147,484,647)

BIG INT 8-byte signed integer

FLOAT 4-byte single precision floating point number

DOUBLE 8-byte double precision floating point number

DECIMAL We can define precision and scale in this type

String Types

Type Length

CHAR 255

VARCHAR 1 to 65355

STRING We can define length here(No Limit)

Date/Time Types

Type Usage

Timestamp Traditional Unix timestamp with optional nanosecond precision

Date YYYY-MM-DD format, the range from 0000-01-01 to 9999-12-31

Complex Types

Type Usage

Arrays ARRAY<data_type>, negative values and non-constant expressions not allowed

MapsMAP<primitive_type, data_type>, negative values and non-constant expressions notallowed

Structs STRUCT<col_name :datat_type, ... >

Union UNIONTYPE<data_type, datat_type, ...>

Creation and dropping of databases

CREATE DATABASE database_name;

DROP DATABASE database_name;

4.04.2018 7_hive


Table operations

CREATE TABLE my_users (id int, name string);

ALTER TABLE my_users RENAME TO my_hive_users;

DROP TABLE my_hive_users;

Loading data into table

CREATE TABLE hive_users(id INT, name STRING)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY `\t`;

LOAD DATA INPATH '/home/szwabin/MyHive/users.txt' INTO TABLE hive_users;

Displaying the content of the table

SELECT * FROM hive_users;

External tables

table created on HDFS data (schema on data)at the time of dropping the table it drops only schema, the data will be still available in HDFS asbeforeexternal tables provide an option to create multiple schemas for the data stored in HDFS instead ofdeleting the data every time whenever schema updateswhen to use:

if processing data available in HDFSwhen the files are being used outside of Hive

CREATE EXTERNAL TABLE hive_users_external(id INT, name STRING)


FIELDS TERMINATED BY '\t'

LOCATION '/szwabin/hive/hive_users_external';

Views

similar to tablesgenerated based on some requirementsany result may be saved as viewall operations can be performed on views

CREATE VIEW middle_class AS SELECT * FROM employees WHERE salary>25000

Indexes

pointers to particular column names of a tablehave to be manually defined

CREATE INDEX sample_index ON TABLE hive_users(name);

ORDER BY queries

used with SELECT statement

4.04.2018 7_hive


will order the data in the resultif the mentioned ORDER BY field is a string, then it will display the result in lexicographical order

SELECT * FROM employees ORDER BY Department;

GROUP BY queries

used with SELECT statementa query will select and display results grouped according to particular values in a specified column

SELECT Department, COUNT(*) FROM employees GROUP BY Department;

(total count of employees in each department)

SORT BY queries

SELECT * FROM employees SORT BY id DESC;

4.04.2018 7_hive


Joins

demographic breakdown (by gender) of page_view of 2008-03-03a join of the page_view table and the user table on the userid column is required:

SELECT pv.*, u.gender, u.age

FROM user u JOIN page_view pv ON (pv.userid = u.id)

WHERE pv.date = '2008-03-03';

an outer join of the above tables

SELECT pv.*, u.gender, u.age

FROM user u FULL OUTER JOIN page_view pv ON (pv.userid = u.id)

WHERE pv.date = '2008-03-03';

Inserts

FROM pv_users

INSERT OVERWRITE TABLE pv_gender_sum

SELECT pv_users.gender, count_distinct(pv_users.userid)

GROUP BY pv_users.gender;

INSERT OVERWRITE DIRECTORY '/user/data/tmp/pv_age_sum'

SELECT pv_users.age, count_distinct(pv_users.userid)

GROUP BY pv_users.age;

In certain situations we would want to write the output into a local file so that we could load it into an excelspreadsheet:

INSERT OVERWRITE LOCAL DIRECTORY '/tmp/pv_gender_sum'

SELECT pv_gender_sum.*

FROM pv_gender_sum;

Simple Hive ‘cheat sheet’ for SQL users (by Hortonworks)

4.04.2018 7_hive


4.04.2018 7_hive


Simple use caseWe want to analyze user ratings from https://movielens.org/ (https://movielens.org/) site. To this end, we firstcreate a table with tab-delimited text file format:

0: jdbc:hive2://localhost:10000> CREATE TABLE u_data (userid INT, movieid

INT, rating INT, unixtime STRING)


FIELDS TERMINATED BY '\t'

STORED AS TEXTFILE;


Next, we download the data files from MovieLens on the GroupLens datasets page:

cd /home/szwabin/MyHive

wget http://files.grouplens.org/datasets/movielens/ml-100k.zip

unzip ml-100k.zip

We load the data into the table:

0: jdbc:hive2://localhost:10000> LOAD DATA LOCAL INPATH '/home/szwabin/My

Hive/ml-100k/u.data' OVERWRITE INTO TABLE u_data;


Let us count the number of rows in the dataset:

0: jdbc:hive2://localhost:10000> SELECT COUNT(*) FROM u_data;

WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in t

he future versions. Consider using a different execution engine (i.e. spa

rk, tez) or using Hive 1.X releases.

+---------+

| _c0 |

+---------+

| 100000 |

+---------+

1 row selected (31.472 seconds)

Now we can do some complex data analysis on the table u_data. To this end let us create the following file:

https://movielens.org/

4.04.2018 7_hive


In [1]:

%%writefile /home/szwabin/MyHive/weekday_mapper.py

import sys

import datetime

for line in sys.stdin:

line = line.strip()

userid, movieid, rating, unixtime = line.split('\t')

weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()

print '\t'.join([userid, movieid, rating, str(weekday)])

Writing /home/szwabin/MyHive/weekday_mapper.py

4.04.2018 7_hive


We create additional table for the transformed data:

0: jdbc:hive2://localhost:10000> CREATE TABLE u_data_new(userid INT, movi

eid INT, rating INT, weekday INT)

ROW FORMAT DELIMITED FIELDS TERMINATED B

Y '\t';


We add the file

0: jdbc:hive2://localhost:10000> add FILE /home/szwabin/MyHive/weekday_ma

pper.py;


and do the transformation:

0: jdbc:hive2://localhost:10000> INSERT OVERWRITE TABLE u_data_new

SELECT TRANSFORM (userid, movieid, rati

ng, unixtime)

USING 'python /home/szwabin/MyHive/weekd

ay_mapper.py'

AS (userid, movieid, rating, weekday) FR

OM u_data;





Let us look at the weekday statistics:

0: jdbc:hive2://localhost:10000> SELECT weekday, COUNT(*) FROM u_data_new

GROUP BY weekday;




+----------+--------+

| weekday | _c1 |

+----------+--------+

| 1 | 13444 |

| 2 | 12858 |

| 3 | 16750 |

| 4 | 14700 |

| 5 | 15274 |

| 6 | 14865 |

| 7 | 12109 |

+----------+--------+


4.04.2018 7_hive


Accessing Hive with Pythonhttps://github.com/dropbox/PyHive (https://github.com/dropbox/PyHive)a collection of Python DB-API and SQLAlchemy interfaces for Presto and Hive

Requirements

Python 2.7 / Python 3HiveServer2 daemonThrift_sasl

Installation

apt-get install libsasl2-dev

pip3 install thrift_sasl

pip3 install pyhive[hive]

Sample session

>>> from pyhive import hive

>>> cursor = hive.connect('localhost').cursor()

>>> cursor.execute('SELECT * FROM u_data LIMIT 10')

>>> print(cursor.fetchone())

(196, 242, 3, '881250949')

https://github.com/dropbox/PyHive

I n tr o d u c ti o n i n to B i g D a ta a n a l y ti c...

Documents

Transcript of I n tr o d u c ti o n i n to B i g D a ta a n a l y ti c...