I n tr o d u c ti o n i n to B i g D a ta a n a l y ti c...
Transcript of I n tr o d u c ti o n i n to B i g D a ta a n a l y ti c...
4.04.2018 7_hive
file:///home/szwabin/Dropbox/Zajecia/BigData/Lectures/7_hive/7_hive.html 1/24
Introduction into Big Data analytics
Lecture 7 – Apache Hive
Janusz Szwabiński
Outlook:
1. Introduction2. Installation and configuration3. Hive data model4. Hive SQL5. Example use case6. Accessing Hive with Python
Further reading:
https://cwiki.apache.org/confluence/display/Hive/Home(https://cwiki.apache.org/confluence/display/Hive/Home)https://www.guru99.com/introduction-hive.html (https://www.guru99.com/introduction-hive.html)
4.04.2018 7_hive
file:///home/szwabin/Dropbox/Zajecia/BigData/Lectures/7_hive/7_hive.html 2/24
Introductionhttps://hive.apache.org/ (https://hive.apache.org/)initially developed by Facebookdata warehouse software facilitates reading, writing, and managing large datasets residing indistributed storage and queried using SQL syntaxbuilt on top of Apache Hadoop
MapReduce for executionHDFS for storageMetadata on raw files
designed for managing and querying only structured data and semi-structured data that is storedin tables
Features
procedural language HPL-SQLit separates the user from the complexity of Map Reduce programming (automaticallytranslates SQL-like queries into MapReduce jobs)
it reuses familiar concepts from the relational database world, such as tables, rows,columns and schemastandard SQL functionality, including many of the later SQL:2003 and SQL:2011 featuresfor analyticsHive's SQL can be extended with user code via
user defined functions (UDFs)user defined aggregates (UDAFs)user defined table functions (UDTFs).
data warehousing tasks such as extract/transform/load (ETL), reporting, and data analysisa mechanism to impose structure on a variety of data formatsaccess to files stored either directly in Apache HDFS or in other data storage systems suchas Apache HBasequery execution via Apache Tez, Apache Spark, or MapReducesub-second query retrieval via Hive LLAP, Apache YARN and Apache Slider
not a single "Hive format" in which data must be storedbuilt in connectors for comma and tab-separated values (CSV/TSV) text files, ApacheParquet, Apache ORC, and other formatsusers can extend Hive with connectors for other formats
not designed for online transaction processing (OLTP) workloads
4.04.2018 7_hive
file:///home/szwabin/Dropbox/Zajecia/BigData/Lectures/7_hive/7_hive.html 3/24
best used for traditional data warehousing tasksdesigned to maximize scalability (scale out with more machines added dynamically to the Hadoopcluster), performance, extensibility, fault-tolerance, and loose-coupling with its input formatscomponents of Hive include HCatalog and WebHCat
HCatalog is a table and storage management layer for Hadoopeasier access to data on the grid
WebHCat provides a service to run Hadoop MapReduce (or YARN), Pig, Hive jobs orperform Hive metadata operations using an HTTP (REST style) interface
Architecture
Metastorestores metadata for each of the tables such as their schema and locationit also includes the partition metadata which helps the driver to track the progress ofvarious data sets distributed over the clusterthe data is stored in a traditional RDBMS formatthe metadata helps the driver to keep a track of the data and it is highly crucialit should be regularly backed up
Driveracts like a controller which receives the HiveQL statementsit starts the execution of statement by creating sessions and monitors the life cycle andprogress of the executionit stores the necessary metadata generated during the execution of an HiveQL statementit also acts as a collection point of data or query result obtained after the Reduce operation
Compilerperforms compilation of the HiveQL query, which converts the query to an execution planthis plan contains the tasks and steps needed to be performed by the Hadoop MapReduceto get the output as translated by the querythe compiler converts the query to an abstract syntax tree (AST)after checking for compatibility and compile time errors, it converts the AST to a directedacyclic graph (DAG)
4.04.2018 7_hive
file:///home/szwabin/Dropbox/Zajecia/BigData/Lectures/7_hive/7_hive.html 4/24
the DAG divides operators to MapReduce stages and tasks based on the input query anddata
Optimizerperforms various transformations on the execution plan to get an optimized DAGtransformations can be aggregated together, such as converting a pipeline of joins to asingle join, for better performanceit can also split the tasks, such as applying a transformation on data before a reduceoperation, to provide better performance and scalability
Executorafter compilation and optimization, the executor executes the tasksit interacts with the job tracker of Hadoop to schedule tasks to be runit takes care of pipelining the tasks by making sure that a task with dependency getsexecuted only if all other prerequisites are run
CLI, UI, and Thrift Servera command-line interface (CLI) provides a user interface for an external user to interactwith Hive by submitting queries, instructions and monitoring the process statusThrift server allows external clients to interact with Hive over a network, similar to the JDBCor ODBC protocols
Limitations of Hive
it does not offer real-time queries and row level updatesnot good for online transaction processinglatency for Apache Hive queries is generally very high
4.04.2018 7_hive
file:///home/szwabin/Dropbox/Zajecia/BigData/Lectures/7_hive/7_hive.html 5/24
Digression - data warehouse
A data warehouse (DW or DWH) is a system used for reporting and data analysis, and isconsidered a core component of business intelligence. DWs are central repositories ofintegrated data from one or more disparate sources. They store current and historical data inone single place that are used for creating analytical reports for workers throughout theenterprise.
The data stored in the warehouse is uploaded from the operational systems (such asmarketing or sales). The data may pass through an operational data store and may requiredata cleansing for additional operations to ensure data quality before it is used in the DW forreporting.
A data mart is the access layer of the data warehouse environment that is used to get dataout to the users. The data mart is a subset of the data warehouse and is usually oriented to aspecific business line or team. Whereas data warehouses have an enterprise-wide depth, theinformation in data marts pertains to a single department. In some deployments, eachdepartment or business unit is considered the owner of its data mart including all thehardware, software and data. This enables each department to isolate the use, manipulationand development of their data. In other deployments where conformed dimensions are used,this business unit ownership will not hold true for shared dimensions like customer, product,etc.
Benefits
congregation of data to single database so a single query engine can be used to present data in anODSmitigate the problem of database isolation level lock contention in transaction processing systemscaused by attempts to run large, long-running, analysis queries in transaction processing databasesmaintain data history, even if the source transaction systems do notintegrate data from multiple source systems, enabling a central view across the enterprise
4.04.2018 7_hive
file:///home/szwabin/Dropbox/Zajecia/BigData/Lectures/7_hive/7_hive.html 6/24
always valuable, but particularly so when the organization has grown by mergerimprove data quality, by providing consistent codes and descriptions, flagging or even fixing baddatapresent the organization's information consistentlyprovide a single common data model for all data of interest regardless of the data's sourcerestructure the data so that it makes sense to the business usersrestructure the data so that it delivers excellent query performance, even for complex analyticqueries, without impacting the operational systemsmake decision–support queries easier to writeorganize and disambiguate repetitive data
Database vs Data Warehouse
Database Data Warehouse
Used for Online Transactional Processing (OLTP).This records the data from a user for history
Used for Online Analytical Processing (OLAP). Thisreads the historical data of the users for businessdecissions
The tables and joins are complex since they arenormalized. This is done to reduce redundant dataand to save storage space
The tables and joins are simple since they aredenormalized. This is done to reduce the responsetime for analytical queries
Entity-relational modeling techniques are used fordatabase design
Data modeling techniques are used for warehousedesign
Optimized for write operations Optimized for read operations
Performance is low for analytical queries High performance for analytical queries
4.04.2018 7_hive
file:///home/szwabin/Dropbox/Zajecia/BigData/Lectures/7_hive/7_hive.html 7/24
Installation and configuration
Download
http://ftp.man.poznan.pl/apache/hive/stable-2/ (http://ftp.man.poznan.pl/apache/hive/stable-2/)
Unpack the archive
cd /home/szwabin/Tools
mv ~/Downloads/apache-hive-2.3.3-bin.tar.gz .
tar zxvf apache-hive-2.3.3-bin.tar.gz
mv apache-hive-2.3.3-bin.tar.gz hive-2.3.3
rm apache-hive-2.3.3-bin.tar.gz
Set up environment
Add the following lines to the .bashrc file:
export HIVE_HOME=/home/szwabin/Tools/hive-2.3.3
export PATH=$HIVE_HOME/bin:$PATH
Start Hadoop
If not already done, start Hadoop services already set up:
start-dfs.sh
start-yarn.sh
Create hadoop directories
hadoop fs -mkdir /tmp
hadoop fs -mkdir /szwabin
hadoop fs -mkdir /szwabin/hive
hadoop fs -mkdir /szwabin/hive/warehouse
hadoop fs -chmod g+w /tmp
hadoop fs -chmod -R g+w /szwabin/hive/warehouse
Running Hive CLI
If you work with earlier versions of Hive, you may simply start the CLI with
hive
However, Hive CLI is deprecated in favor of Beeline, the CLI of HiveServer2. In order to start the Beeline, we need to run the schematool command as an initialization step:
schematool -initSchema -dbType derby
where derby is the database type for Hive's metastore. Derby is a reasonable choice in case of a singleuser installation. If you want to configure Hive with MySQL, have a look e.g. athttps://www.guru99.com/installation-configuration-hive-mysql.html (https://www.guru99.com/installation-configuration-hive-mysql.html)
4.04.2018 7_hive
file:///home/szwabin/Dropbox/Zajecia/BigData/Lectures/7_hive/7_hive.html 8/24
If you tried to run the deprecated Hive CLI first, you may see the following error while running the schematool:
Error: FUNCTION 'NUCLEUS_ASCII' already exists. (state=X0Y68,code=30000)
org.apache.hadoop.hive.metastore.HiveMetaException: Schema initialization
FAILED! Metastore state would be inconsistent !!
It is due to the fact, that running Hive CLI, even though it fails, creates a metastore_db subdirectory in thecurrent working directory. So, when we tried to initialize Hive with schematool, the metastore existed butnot in a complete form. We have to remove it first:
rm -r metastore_db
Then we re-run the initialization step:
schematool -initSchema -dbType derby
It may happen that we see the following warning at this point:
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/szwabin/Tools/hive-2.3.3/lib/log4
j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/szwabin/Tools/hadoop-3.0.0/share/
hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBi
nder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an expla
nation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFac
tory]
To get rid of it, we remove all the mentioned jar files except the first one from the classpath.
Now we can start to run the HiveServer2
hiveserver2
To run the Beeline CLI, we issue the command:
beeline -n szwabin -u jdbc:hive2://localhost:10000
At this point, we will propably get an error like:
Connecting to jdbc:hive2://localhost:10000
18/04/03 08:54:03 [main]: WARN jdbc.HiveConnection: Failed to connect to
localhost:10000
Error: Could not open client transport with JDBC Uri: jdbc:hive2://localh
ost:10000: Failed to open new session: java.lang.RuntimeException: org.ap
ache.hadoop.ipc.RemoteException(org.apache.hadoop.security.authorize.Auth
orizationException): User: szwabin is not allowed to impersonate szwabin
(state=08S01,code=0)
Beeline version 2.3.3 by Apache Hive
We solve that issue by adding the following properties to the Hadoop's core-site.xml file:
4.04.2018 7_hive
file:///home/szwabin/Dropbox/Zajecia/BigData/Lectures/7_hive/7_hive.html 9/24
<property>
<name>hadoop.proxyuser.szwabin.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.szwabin.hosts</name>
<value>*</value>
</property>
</configuration>
We restart all Hadoop services and try to connect to the HiveServer2 again:
beeline -n szwabin -u jdbc:hive2://localhost:10000
Connecting to jdbc:hive2://localhost:10000
Connected to: Apache Hive (version 2.3.3)
Driver: Hive JDBC (version 2.3.3)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 2.3.3 by Apache Hive
0: jdbc:hive2://localhost:10000>
Testing the installation
4.04.2018 7_hive
file:///home/szwabin/Dropbox/Zajecia/BigData/Lectures/7_hive/7_hive.html 10/24
0: jdbc:hive2://localhost:10000> CREATE TABLE pokes (foo INT, bar STRIN
G);
0: jdbc:hive2://localhost:10000> CREATE TABLE invites (foo INT, bar STRIN
G) PARTITIONED BY (ds STRING);
0: jdbc:hive2://localhost:10000> SHOW TABLES;
+-----------+
| tab_name |
+-----------+
| invites |
| pokes |
+-----------+
2 rows selected (2.062 seconds)
0: jdbc:hive2://localhost:10000> DESCRIBE invites;
+--------------------------+-----------------------+---------------------
--+
| col_name | data_type | comment
|
+--------------------------+-----------------------+---------------------
--+
| foo | int |
|
| bar | string |
|
| ds | string |
|
| | NULL | NULL
|
| # Partition Information | NULL | NULL
|
| # col_name | data_type | comment
|
| | NULL | NULL
|
| ds | string |
|
+--------------------------+-----------------------+---------------------
--+
8 rows selected (0.752 seconds)
0: jdbc:hive2://localhost:10000> ALTER TABLE pokes ADD COLUMNS (new_col I
NT);
No rows affected (0.775 seconds)
0: jdbc:hive2://localhost:10000> ALTER TABLE invites ADD COLUMNS (new_col
2 INT COMMENT 'a comment');
No rows affected (0.409 seconds)
0: jdbc:hive2://localhost:10000> ALTER TABLE invites REPLACE COLUMNS (foo
INT, bar STRING, baz INT COMMENT 'baz replaces new_col2');
No rows affected (0.423 seconds)
0: jdbc:hive2://localhost:10000> ALTER TABLE invites REPLACE COLUMNS (foo
INT COMMENT 'only keep the first column');
No rows affected (0.327 seconds)
0: jdbc:hive2://localhost:10000> DROP TABLE pokes;
No rows affected (2.846 seconds)
4.04.2018 7_hive
file:///home/szwabin/Dropbox/Zajecia/BigData/Lectures/7_hive/7_hive.html 11/24
0: jdbc:hive2://localhost:10000> DESCRIBE invites;
+--------------------------+-----------------------+---------------------
--------+
| col_name | data_type | comment
|
+--------------------------+-----------------------+---------------------
--------+
| foo | int | only keep the first
column |
| ds | string |
|
| | NULL | NULL
|
| # Partition Information | NULL | NULL
|
| # col_name | data_type | comment
|
| | NULL | NULL
|
| ds | string |
|
+--------------------------+-----------------------+---------------------
--------+
7 rows selected (0.363 seconds)
4.04.2018 7_hive
file:///home/szwabin/Dropbox/Zajecia/BigData/Lectures/7_hive/7_hive.html 12/24
Configuration
Hive by default gets its configuration from <install-dir>/conf/hive-default.xmllocation of the Hive configuration directory can be changed by setting the HIVE_CONF_DIRenvironment variableconfiguration variables can be changed by (re-)defining them in <install-dir>/conf/hive-site.xml
Log4j configuration is stored in <install-dir>/conf/hive-log4j.propertiesHive configuration is an overlay on top of Hadoop – it inherits the Hadoop configuration variables bydefaultHive configuration can be manipulated by:
editing hive-site.xml and defining any desired variables (including Hadoop variables) in itusing the set commandinvoking Hive (deprecated), Beeline or HiveServer2 using the syntax:
hive --hiveconf x1=y1 --hiveconf x2=y2
hiveserver2 --hiveconf x1=y1 --hiveconf x2=y2
beeline --hiveconf x1=y1 --hiveconf x2=y2
setting the HIVE_OPTS environment variable to "--hiveconf x1=y1 --hiveconf x2=y2" which does the same as above
Runtime configuration
beeline> SET mapred.job.tracker=myhost.mycompany.com:50030;
beeline> SET -v;
4.04.2018 7_hive
file:///home/szwabin/Dropbox/Zajecia/BigData/Lectures/7_hive/7_hive.html 13/24
Hive data model
Tables
analogous to tables in relational DBseach table has corresponding directory in HDFS
table name: pvsHDFS directory: /szwabin/wh/pvs
Partitions
analogous to dense indexes on partition columnsfor grouping same type of data togethereach table can have one or more partition keys to identify a particular partitionfaster queries on slices of the datanested subdirectories in HDFS for each combinations of partition column valuesexample
partition columns: ds, ctryHDFS subdirectory for ds=20090801, ctry=US /szwabin/wh/pvs/ds=20090801/ctry=US
HDFS subdirectory for ds=20090801, ctry=CA /szwabin/wh/pvs/ds=20090801/ctry=CA
Buckets
give extra structure to the data that may be used for more efficient queriessplit data based on hash of a column - mainly for parallelismone hdfs file per bucket within partition subdirectoryexample /szwabin/wh/pvs/ds=20090801/ctry=US/part-00000 ... /szwabin/wh/pvs/ds=20090801/ctry=US/part-00020
External tables
point to existing data directories in HDFScan create tables and partitions - partition columns just become annotations to external directoriesexamples
create external table with partitions
CREATE EXTERNAL TABLE pvs (userid int, pageid int, ds strin
g, ctry string)
PARTITIONED ON (ds string, ctry string)
STORED AS textfile
LOCATION `path/to/existing/table`
add partition to external table
ALTER TABLE pvs
ADD PARTITION (ds='20090801', ctry='US')
LOCATION `path/to/existing/partition`
4.04.2018 7_hive
file:///home/szwabin/Dropbox/Zajecia/BigData/Lectures/7_hive/7_hive.html 15/24
Hive Language
Data types in Hive
Numeric Types
Type Memory allocation
TINY INT 1-byte signed integer (-128 to 127)
SMALL INT 2-byte signed integer (-32768 to 32767)
INT 4-byte signed integer ( -2,147,484,648 to 2,147,484,647)
BIG INT 8-byte signed integer
FLOAT 4-byte single precision floating point number
DOUBLE 8-byte double precision floating point number
DECIMAL We can define precision and scale in this type
String Types
Type Length
CHAR 255
VARCHAR 1 to 65355
STRING We can define length here(No Limit)
Date/Time Types
Type Usage
Timestamp Traditional Unix timestamp with optional nanosecond precision
Date YYYY-MM-DD format, the range from 0000-01-01 to 9999-12-31
Complex Types
Type Usage
Arrays ARRAY<data_type>, negative values and non-constant expressions not allowed
MapsMAP<primitive_type, data_type>, negative values and non-constant expressions notallowed
Structs STRUCT<col_name :datat_type, ... >
Union UNIONTYPE<data_type, datat_type, ...>
Creation and dropping of databases
CREATE DATABASE database_name;
DROP DATABASE database_name;
4.04.2018 7_hive
file:///home/szwabin/Dropbox/Zajecia/BigData/Lectures/7_hive/7_hive.html 16/24
Table operations
CREATE TABLE my_users (id int, name string);
ALTER TABLE my_users RENAME TO my_hive_users;
DROP TABLE my_hive_users;
Loading data into table
CREATE TABLE hive_users(id INT, name STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY `\t`;
LOAD DATA INPATH '/home/szwabin/MyHive/users.txt' INTO TABLE hive_users;
Displaying the content of the table
SELECT * FROM hive_users;
External tables
table created on HDFS data (schema on data)at the time of dropping the table it drops only schema, the data will be still available in HDFS asbeforeexternal tables provide an option to create multiple schemas for the data stored in HDFS instead ofdeleting the data every time whenever schema updateswhen to use:
if processing data available in HDFSwhen the files are being used outside of Hive
CREATE EXTERNAL TABLE hive_users_external(id INT, name STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LOCATION '/szwabin/hive/hive_users_external';
Views
similar to tablesgenerated based on some requirementsany result may be saved as viewall operations can be performed on views
CREATE VIEW middle_class AS SELECT * FROM employees WHERE salary>25000
Indexes
pointers to particular column names of a tablehave to be manually defined
CREATE INDEX sample_index ON TABLE hive_users(name);
ORDER BY queries
used with SELECT statement
4.04.2018 7_hive
file:///home/szwabin/Dropbox/Zajecia/BigData/Lectures/7_hive/7_hive.html 17/24
will order the data in the resultif the mentioned ORDER BY field is a string, then it will display the result in lexicographical order
SELECT * FROM employees ORDER BY Department;
GROUP BY queries
used with SELECT statementa query will select and display results grouped according to particular values in a specified column
SELECT Department, COUNT(*) FROM employees GROUP BY Department;
(total count of employees in each department)
SORT BY queries
SELECT * FROM employees SORT BY id DESC;
4.04.2018 7_hive
file:///home/szwabin/Dropbox/Zajecia/BigData/Lectures/7_hive/7_hive.html 18/24
Joins
demographic breakdown (by gender) of page_view of 2008-03-03a join of the page_view table and the user table on the userid column is required:
SELECT pv.*, u.gender, u.age
FROM user u JOIN page_view pv ON (pv.userid = u.id)
WHERE pv.date = '2008-03-03';
an outer join of the above tables
SELECT pv.*, u.gender, u.age
FROM user u FULL OUTER JOIN page_view pv ON (pv.userid = u.id)
WHERE pv.date = '2008-03-03';
Inserts
FROM pv_users
INSERT OVERWRITE TABLE pv_gender_sum
SELECT pv_users.gender, count_distinct(pv_users.userid)
GROUP BY pv_users.gender;
INSERT OVERWRITE DIRECTORY '/user/data/tmp/pv_age_sum'
SELECT pv_users.age, count_distinct(pv_users.userid)
GROUP BY pv_users.age;
In certain situations we would want to write the output into a local file so that we could load it into an excelspreadsheet:
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/pv_gender_sum'
SELECT pv_gender_sum.*
FROM pv_gender_sum;
Simple Hive ‘cheat sheet’ for SQL users (by Hortonworks)
4.04.2018 7_hive
file:///home/szwabin/Dropbox/Zajecia/BigData/Lectures/7_hive/7_hive.html 21/24
Simple use caseWe want to analyze user ratings from https://movielens.org/ (https://movielens.org/) site. To this end, we firstcreate a table with tab-delimited text file format:
0: jdbc:hive2://localhost:10000> CREATE TABLE u_data (userid INT, movieid
INT, rating INT, unixtime STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
No rows affected (0.341 seconds)
Next, we download the data files from MovieLens on the GroupLens datasets page:
cd /home/szwabin/MyHive
wget http://files.grouplens.org/datasets/movielens/ml-100k.zip
unzip ml-100k.zip
We load the data into the table:
0: jdbc:hive2://localhost:10000> LOAD DATA LOCAL INPATH '/home/szwabin/My
Hive/ml-100k/u.data' OVERWRITE INTO TABLE u_data;
No rows affected (1.23 seconds)
Let us count the number of rows in the dataset:
0: jdbc:hive2://localhost:10000> SELECT COUNT(*) FROM u_data;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in t
he future versions. Consider using a different execution engine (i.e. spa
rk, tez) or using Hive 1.X releases.
+---------+
| _c0 |
+---------+
| 100000 |
+---------+
1 row selected (31.472 seconds)
Now we can do some complex data analysis on the table u_data. To this end let us create the following file:
4.04.2018 7_hive
file:///home/szwabin/Dropbox/Zajecia/BigData/Lectures/7_hive/7_hive.html 22/24
In [1]:
%%writefile /home/szwabin/MyHive/weekday_mapper.py
import sys
import datetime
for line in sys.stdin:
line = line.strip()
userid, movieid, rating, unixtime = line.split('\t')
weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()
print '\t'.join([userid, movieid, rating, str(weekday)])
Writing /home/szwabin/MyHive/weekday_mapper.py
4.04.2018 7_hive
file:///home/szwabin/Dropbox/Zajecia/BigData/Lectures/7_hive/7_hive.html 23/24
We create additional table for the transformed data:
0: jdbc:hive2://localhost:10000> CREATE TABLE u_data_new(userid INT, movi
eid INT, rating INT, weekday INT)
ROW FORMAT DELIMITED FIELDS TERMINATED B
Y '\t';
No rows affected (0.297 seconds)
We add the file
0: jdbc:hive2://localhost:10000> add FILE /home/szwabin/MyHive/weekday_ma
pper.py;
No rows affected (0.015 seconds)
and do the transformation:
0: jdbc:hive2://localhost:10000> INSERT OVERWRITE TABLE u_data_new
SELECT TRANSFORM (userid, movieid, rati
ng, unixtime)
USING 'python /home/szwabin/MyHive/weekd
ay_mapper.py'
AS (userid, movieid, rating, weekday) FR
OM u_data;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in t
he future versions. Consider using a different execution engine (i.e. spa
rk, tez) or using Hive 1.X releases.
No rows affected (27.257 seconds)
Let us look at the weekday statistics:
0: jdbc:hive2://localhost:10000> SELECT weekday, COUNT(*) FROM u_data_new
GROUP BY weekday;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in t
he future versions. Consider using a different execution engine (i.e. spa
rk, tez) or using Hive 1.X releases.
+----------+--------+
| weekday | _c1 |
+----------+--------+
| 1 | 13444 |
| 2 | 12858 |
| 3 | 16750 |
| 4 | 14700 |
| 5 | 15274 |
| 6 | 14865 |
| 7 | 12109 |
+----------+--------+
7 rows selected (31.69 seconds)
4.04.2018 7_hive
file:///home/szwabin/Dropbox/Zajecia/BigData/Lectures/7_hive/7_hive.html 24/24
Accessing Hive with Pythonhttps://github.com/dropbox/PyHive (https://github.com/dropbox/PyHive)a collection of Python DB-API and SQLAlchemy interfaces for Presto and Hive
Requirements
Python 2.7 / Python 3HiveServer2 daemonThrift_sasl
Installation
apt-get install libsasl2-dev
pip3 install thrift_sasl
pip3 install pyhive[hive]
Sample session
>>> from pyhive import hive
>>> cursor = hive.connect('localhost').cursor()
>>> cursor.execute('SELECT * FROM u_data LIMIT 10')
>>> print(cursor.fetchone())
(196, 242, 3, '881250949')