MODULE 2 Hive Apache Hive is a data warehouse build on the top of Hadoop for providing data...
Transcript of MODULE 2 Hive Apache Hive is a data warehouse build on the top of Hadoop for providing data...
![Page 1: MODULE 2 Hive Apache Hive is a data warehouse build on the top of Hadoop for providing data summarization, ad hoc queries, and analysis of large data set using SQL like language called](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec994d2a6dc8c7c24682cad/html5/thumbnails/1.jpg)
MODULE 2Prof. Mohammed Tanzeem Agra
Prof. Mohammed Tanzeem Agra
![Page 2: MODULE 2 Hive Apache Hive is a data warehouse build on the top of Hadoop for providing data summarization, ad hoc queries, and analysis of large data set using SQL like language called](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec994d2a6dc8c7c24682cad/html5/thumbnails/2.jpg)
MODULE 2
1) Essential Hadoop Tools
2) Hadoop Yarn Applications
3) Managing Hadoop with Apache Ambari
4) Basic Hadoop administration Procedure.
Prof. Mohammed Tanzeem Agra
![Page 3: MODULE 2 Hive Apache Hive is a data warehouse build on the top of Hadoop for providing data summarization, ad hoc queries, and analysis of large data set using SQL like language called](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec994d2a6dc8c7c24682cad/html5/thumbnails/3.jpg)
Essential Hadoop Tools
● Apache Pig
● Apache Hive (Dataware house infrastructure)
● Apache Sqoop (Relational Data)
● Apache Oozie ( design to run and Manage Multiple Hadoop Jobs)
● Apache Hbase (higher version of Google Bigtable)
Prof. Mohammed Tanzeem Agra
![Page 4: MODULE 2 Hive Apache Hive is a data warehouse build on the top of Hadoop for providing data summarization, ad hoc queries, and analysis of large data set using SQL like language called](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec994d2a6dc8c7c24682cad/html5/thumbnails/4.jpg)
Apache Pig
● Apache Pig is high level language that enables programmer to write complex MapReduce
transformation using simple scripting language.
● Pig often used in data set such as aggregate, join and sort.
● Also used to extract, transform, and load ETL data pipelines, quick research on raw data, and
data processing.
● There are two modes:
● Local mode : all processing is done on the local machine.
● Non Local Mode : execute the job on the cluster using either MapReduce or Tez Engine
Prof. Mohammed Tanzeem Agra
![Page 5: MODULE 2 Hive Apache Hive is a data warehouse build on the top of Hadoop for providing data summarization, ad hoc queries, and analysis of large data set using SQL like language called](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec994d2a6dc8c7c24682cad/html5/thumbnails/5.jpg)
Apache Hive
● Apache Hive is a data warehouse build on the top of Hadoop for providing data summarization, ad hoc
queries, and analysis of large data set using SQL like language called HiveQL
● Feature
– Tools to enable easy data extraction, transformation and loading
(ETL)
– A mechanism to impose structure on variety of data formats
– Access to file storee either directly in HDFS or in other data storage
system.
– Query Execution Via MapReduce and Tez
Prof. Mohammed Tanzeem Agra
![Page 6: MODULE 2 Hive Apache Hive is a data warehouse build on the top of Hadoop for providing data summarization, ad hoc queries, and analysis of large data set using SQL like language called](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec994d2a6dc8c7c24682cad/html5/thumbnails/6.jpg)
2.2 : APACHE YARNYet Another Resource Negotiator
Prof. Mohammed Tanzeem Agra
![Page 7: MODULE 2 Hive Apache Hive is a data warehouse build on the top of Hadoop for providing data summarization, ad hoc queries, and analysis of large data set using SQL like language called](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec994d2a6dc8c7c24682cad/html5/thumbnails/7.jpg)
Why YARN : MapReduce Version 1
MapReduce performed both processing and resource management functions. It
consisted of a Job Tracker which was the single master.
The Job Tracker allocated the resources, performed scheduling and monitored
the processing jobs.
It assigned map and reduce tasks on a number of subordinate processes
called the Task Trackers.
The Task Trackers periodically reported their progress to the Job Tracker.
Prof. Mohammed Tanzeem Agra
![Page 8: MODULE 2 Hive Apache Hive is a data warehouse build on the top of Hadoop for providing data summarization, ad hoc queries, and analysis of large data set using SQL like language called](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec994d2a6dc8c7c24682cad/html5/thumbnails/8.jpg)
Prof. Mohammed Tanzeem Agra
![Page 9: MODULE 2 Hive Apache Hive is a data warehouse build on the top of Hadoop for providing data summarization, ad hoc queries, and analysis of large data set using SQL like language called](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec994d2a6dc8c7c24682cad/html5/thumbnails/9.jpg)
Problem with Version 1
scalability bottleneck due to a single Job Tracker.
IBM mentioned in its article that according to Yahoo!, the practical limits of
such a design are reached with a cluster of 5000 nodes and 40,000 tasks
running concurrently.
Apart from this limitation, the utilization of computational resources is
inefficient in MRV1. Also, the Hadoop framework became limited only to
MapReduce processing paradigm.
Prof. Mohammed Tanzeem Agra
![Page 10: MODULE 2 Hive Apache Hive is a data warehouse build on the top of Hadoop for providing data summarization, ad hoc queries, and analysis of large data set using SQL like language called](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec994d2a6dc8c7c24682cad/html5/thumbnails/10.jpg)
INTRODUCTION TO YARN
YARN allows different data processing methods like graph processing,
interactive processing, stream processing as well as batch processing to run
and process data stored in HDFS.
Therefore YARN opens up Hadoop to other types of distributed applications
beyond MapReduce.
YARN enabled the users to perform operations as per requirement by using a
variety of tools like Sparkfor real-time processing, Hive for SQL, HBase for
NoSQL and others.
Prof. Mohammed Tanzeem Agra
![Page 11: MODULE 2 Hive Apache Hive is a data warehouse build on the top of Hadoop for providing data summarization, ad hoc queries, and analysis of large data set using SQL like language called](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec994d2a6dc8c7c24682cad/html5/thumbnails/11.jpg)
Prof. Mohammed Tanzeem Agra
![Page 12: MODULE 2 Hive Apache Hive is a data warehouse build on the top of Hadoop for providing data summarization, ad hoc queries, and analysis of large data set using SQL like language called](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec994d2a6dc8c7c24682cad/html5/thumbnails/12.jpg)
YARN COMPONENTS
YARN performs all your processing activities by allocating resources and
scheduling tasks. Apache Hadoop YARN Architecture consists of the following
main components :
Resource Manager
Node Manager
Application Master
Container
Prof. Mohammed Tanzeem Agra
![Page 13: MODULE 2 Hive Apache Hive is a data warehouse build on the top of Hadoop for providing data summarization, ad hoc queries, and analysis of large data set using SQL like language called](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec994d2a6dc8c7c24682cad/html5/thumbnails/13.jpg)
Prof. Mohammed Tanzeem Agra
![Page 14: MODULE 2 Hive Apache Hive is a data warehouse build on the top of Hadoop for providing data summarization, ad hoc queries, and analysis of large data set using SQL like language called](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec994d2a6dc8c7c24682cad/html5/thumbnails/14.jpg)
1. Resource Manager
It is the ultimate authority in resource allocation.
On receiving the processing requests, it passes parts of requests to
corresponding node managers accordingly, where the actual processing takes
place.
It is the arbitrator of the cluster resources and decides the allocation of the
available resources for competing applications.
Optimizes the cluster utilization like keeping all resources in use all the
time against various constraints such as capacity guarantees, fairness, and
SLAs.
Prof. Mohammed Tanzeem Agra
![Page 15: MODULE 2 Hive Apache Hive is a data warehouse build on the top of Hadoop for providing data summarization, ad hoc queries, and analysis of large data set using SQL like language called](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec994d2a6dc8c7c24682cad/html5/thumbnails/15.jpg)
Resource Manager still have two
components
It has two major components: a) Scheduler b) Application Manager
Scheduler :The scheduler is responsible for allocating resources to the various
running applications subject to constraints of capacities, queues etc.
Performs scheduling based on the resource requirements of the applications.
If there is an application failure or hardware failure, the Scheduler does not
guarantee to restart the failed tasks.
Prof. Mohammed Tanzeem Agra
![Page 16: MODULE 2 Hive Apache Hive is a data warehouse build on the top of Hadoop for providing data summarization, ad hoc queries, and analysis of large data set using SQL like language called](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec994d2a6dc8c7c24682cad/html5/thumbnails/16.jpg)
Application Manager
It is responsible for accepting job submissions.
Negotiates the first container from the Resource Manager for executing the
application specific Application Master.
Manages running the Application Masters in a cluster and provides service for
restarting the Application Master container on failure.
Prof. Mohammed Tanzeem Agra
![Page 17: MODULE 2 Hive Apache Hive is a data warehouse build on the top of Hadoop for providing data summarization, ad hoc queries, and analysis of large data set using SQL like language called](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec994d2a6dc8c7c24682cad/html5/thumbnails/17.jpg)
2. Node Manager
It takes care of individual nodes in a Hadoop cluster and manages user jobs and workflow on the given node.
It registers with the Resource Manager and sends heartbeats with the health status of the node.
Its primary goal is to manage application containers assigned to it by the resource manager.
It keeps up-to-date with the Resource Manager.
Application Master requests the assigned container from the Node Manager by sending it a Container Launch Context(CLC) which includes everything the application needs in order to run. The Node Manager creates the requested container process and starts it.
Monitors resource usage (memory, CPU) of individual containers.
Performs Log management.
It also kills the container as directed by the Resource Manager.
Prof. Mohammed Tanzeem Agra
![Page 18: MODULE 2 Hive Apache Hive is a data warehouse build on the top of Hadoop for providing data summarization, ad hoc queries, and analysis of large data set using SQL like language called](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec994d2a6dc8c7c24682cad/html5/thumbnails/18.jpg)
3. Application Master
An application is a single job submitted to the framework. Each such
application has a unique Application Master associated with it which is a
framework specific entity.
It is the process that coordinates an application’s execution in the cluster and
also manages faults.
Its task is to negotiate resources from the Resource Manager and work with
the Node Manager to execute and monitor the component tasks.
It is responsible for negotiating appropriate resource containers from the
ResourceManager, tracking their status and monitoring progress.
Once started, it periodically sends heartbeats to the Resource Manager to
affirm its health and to update the record of its resource demands.
Prof. Mohammed Tanzeem Agra
![Page 19: MODULE 2 Hive Apache Hive is a data warehouse build on the top of Hadoop for providing data summarization, ad hoc queries, and analysis of large data set using SQL like language called](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec994d2a6dc8c7c24682cad/html5/thumbnails/19.jpg)
4. Container
It is a collection of physical resources such as RAM, CPU cores, and disks on a
single node.
YARN containers are managed by a container launch context which is
container life-cycle(CLC). This record contains a map of environment
variables, dependencies stored in a remotely accessible storage, security
tokens, payload for Node Manager services and the command necessary to
create the process.
It grants rights to an application to use a specific amount of
resources (memory, CPU etc.) on a specific host.
Prof. Mohammed Tanzeem Agra
![Page 20: MODULE 2 Hive Apache Hive is a data warehouse build on the top of Hadoop for providing data summarization, ad hoc queries, and analysis of large data set using SQL like language called](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec994d2a6dc8c7c24682cad/html5/thumbnails/20.jpg)
Application Submission in YARN
Prof. Mohammed Tanzeem Agra
![Page 21: MODULE 2 Hive Apache Hive is a data warehouse build on the top of Hadoop for providing data summarization, ad hoc queries, and analysis of large data set using SQL like language called](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec994d2a6dc8c7c24682cad/html5/thumbnails/21.jpg)
Data Flow in YARN
Client submits an application
Resource Manager allocates a container to start Application Manager
Application Manager registers with Resource Manager
Application Manager asks containers from Resource Manager
Application Manager notifies Node Manager to launch containers
Application code is executed in the container
Client contacts Resource Manager/Application Manager to monitor
application’s status
Application Manager unregisters with Resource Manager
Prof. Mohammed Tanzeem Agra
![Page 22: MODULE 2 Hive Apache Hive is a data warehouse build on the top of Hadoop for providing data summarization, ad hoc queries, and analysis of large data set using SQL like language called](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec994d2a6dc8c7c24682cad/html5/thumbnails/22.jpg)
Using Apache Sqoop to Acquire Relational Data
● Sqoop is a tool design to transfer data between Hadoop and Relational DataBase.
● Sqoop can be used with any JDBC complaint database has been tested on Microsoft SQL Server,
PostgresSQL,MySQL, and Oracle.
● It can transfer data from Hive to Hbase Containers.
![Page 23: MODULE 2 Hive Apache Hive is a data warehouse build on the top of Hadoop for providing data summarization, ad hoc queries, and analysis of large data set using SQL like language called](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec994d2a6dc8c7c24682cad/html5/thumbnails/23.jpg)
There are two methods
● Apache Sqoop Import Method
● Step 1 : sqoop examines the database to gether the necessary meta data for the data to be imported.
● Step 2 : map only Hadoop job that sqoop submit to the cluster.
● Each node doing the import must have access to the database.
● The imported data are saved in an HDFS directory.
● Sqoop will use database name for the directory by default.
● Diagram
![Page 24: MODULE 2 Hive Apache Hive is a data warehouse build on the top of Hadoop for providing data summarization, ad hoc queries, and analysis of large data set using SQL like language called](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec994d2a6dc8c7c24682cad/html5/thumbnails/24.jpg)
Data export from the cluster works in a similar fashion. The export is done in two steps, Step1: is to examine the data base
Sqoop divide the input data set into splits, then uses indiviual map task to push the splits to the database.
![Page 25: MODULE 2 Hive Apache Hive is a data warehouse build on the top of Hadoop for providing data summarization, ad hoc queries, and analysis of large data set using SQL like language called](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec994d2a6dc8c7c24682cad/html5/thumbnails/25.jpg)
SQOOB DATA EXPORT METHOD
![Page 26: MODULE 2 Hive Apache Hive is a data warehouse build on the top of Hadoop for providing data summarization, ad hoc queries, and analysis of large data set using SQL like language called](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec994d2a6dc8c7c24682cad/html5/thumbnails/26.jpg)
Managing HADOOP with
Apache AmbariGUI FOR HADOOP
Prof. Mohammed Tanzeem Agra
![Page 27: MODULE 2 Hive Apache Hive is a data warehouse build on the top of Hadoop for providing data summarization, ad hoc queries, and analysis of large data set using SQL like language called](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec994d2a6dc8c7c24682cad/html5/thumbnails/27.jpg)
Why APACHE AMBARI
Managing a Hadoop installation by hand can be tedious and time consuming.
Keeping configuration file synronized across a cluster, starting, stopping, ping,
and restarting hadoop services and dependent services in the right order is
not a simple task.
This tool is designed to help you some basic navigation and usage scenario for
apache Ambari.
Prof. Mohammed Tanzeem Agra
![Page 28: MODULE 2 Hive Apache Hive is a data warehouse build on the top of Hadoop for providing data summarization, ad hoc queries, and analysis of large data set using SQL like language called](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec994d2a6dc8c7c24682cad/html5/thumbnails/28.jpg)
INTRODUCTION
Apache Ambari is an open source graphical installation and management tool
for install hadoop version 2.
Minimum four node clusters are required.
Can support : HDFS, YARN, MapReduce, Tez, Hive, Hbase, Pig, Sqoop, Oozie,
Zookeper and Flume.
To use Ambari in Hadoop entire installation must be on Hadoop cluster. It is
not possible to use Ambari for Hadoop cluster that have been installed by
other means.
Prof. Mohammed Tanzeem Agra
![Page 29: MODULE 2 Hive Apache Hive is a data warehouse build on the top of Hadoop for providing data summarization, ad hoc queries, and analysis of large data set using SQL like language called](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec994d2a6dc8c7c24682cad/html5/thumbnails/29.jpg)
Architecture
Prof. Mohammed Tanzeem Agra
![Page 30: MODULE 2 Hive Apache Hive is a data warehouse build on the top of Hadoop for providing data summarization, ad hoc queries, and analysis of large data set using SQL like language called](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec994d2a6dc8c7c24682cad/html5/thumbnails/30.jpg)
Application of Ambari
Centralized point of administration for Hadoop cluster.
User can configure cluster services.
Monitor the status of cluster host.
Start and stop services.
Add new host to the cluster.
It also provide real time reporting of important metrics.
Prof. Mohammed Tanzeem Agra
![Page 31: MODULE 2 Hive Apache Hive is a data warehouse build on the top of Hadoop for providing data summarization, ad hoc queries, and analysis of large data set using SQL like language called](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec994d2a6dc8c7c24682cad/html5/thumbnails/31.jpg)
Quick Tour of Apache Ambari
Dashboard View : Moving, Edit, Remove, Add
Prof. Mohammed Tanzeem Agra
![Page 32: MODULE 2 Hive Apache Hive is a data warehouse build on the top of Hadoop for providing data summarization, ad hoc queries, and analysis of large data set using SQL like language called](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec994d2a6dc8c7c24682cad/html5/thumbnails/32.jpg)
CPU Usage
Prof. Mohammed Tanzeem Agra
![Page 33: MODULE 2 Hive Apache Hive is a data warehouse build on the top of Hadoop for providing data summarization, ad hoc queries, and analysis of large data set using SQL like language called](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec994d2a6dc8c7c24682cad/html5/thumbnails/33.jpg)
Service View
Provide detailed look at each service running on the cluster.
It also provides a graphical method to configure each services – edit
/etc/hadoop/conf
The currently installed services are listed on the left side menu. To select a
service click the service name on the menu.
The status of namenode, SecondaryNameNode, DataNodes, uptime and
available disk space is displayed in the sumarry window.
Clicking the config tab will open the option form.The option are the same one
that are set in the Hadoop XML files.
Prof. Mohammed Tanzeem Agra
![Page 34: MODULE 2 Hive Apache Hive is a data warehouse build on the top of Hadoop for providing data summarization, ad hoc queries, and analysis of large data set using SQL like language called](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec994d2a6dc8c7c24682cad/html5/thumbnails/34.jpg)
Ambari Hadoop 2 – service window
Prof. Mohammed Tanzeem Agra
![Page 35: MODULE 2 Hive Apache Hive is a data warehouse build on the top of Hadoop for providing data summarization, ad hoc queries, and analysis of large data set using SQL like language called](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec994d2a6dc8c7c24682cad/html5/thumbnails/35.jpg)
Hosts View
Hosts menu item provides the information such as host name, IP address,
number of cores, memory, disk usage, current load average and other Hadoop
components in tabular form.
We can add new hosts by using the Action pull-down menu.
Host view provide three sub windows.
Components
Host metrics
Summary Information.
Each services can be stopped, restarted, decommissioned, or place in
maintenance mode.
Prof. Mohammed Tanzeem Agra
![Page 36: MODULE 2 Hive Apache Hive is a data warehouse build on the top of Hadoop for providing data summarization, ad hoc queries, and analysis of large data set using SQL like language called](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec994d2a6dc8c7c24682cad/html5/thumbnails/36.jpg)
Hosts View
Prof. Mohammed Tanzeem Agra
![Page 37: MODULE 2 Hive Apache Hive is a data warehouse build on the top of Hadoop for providing data summarization, ad hoc queries, and analysis of large data set using SQL like language called](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec994d2a6dc8c7c24682cad/html5/thumbnails/37.jpg)
Admin View
Admin View provide three option
List of installed software
Service Accounts : Hortonworks Data Platform (HDP)
Security
Admin Pull Down Menu
About : Provides the current version of ambari
Manage Ambari : Open the management screen where Users, Groups, Permissions
and Ambari Views can be created and configured.
Settings : Provide the option to turn of the progress window
Sign Out : Exits the interface.
Prof. Mohammed Tanzeem Agra
![Page 38: MODULE 2 Hive Apache Hive is a data warehouse build on the top of Hadoop for providing data summarization, ad hoc queries, and analysis of large data set using SQL like language called](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec994d2a6dc8c7c24682cad/html5/thumbnails/38.jpg)
Fault Tolerance
✔ Fault Tolrence is a property that enables a system to continue operating properlly in the event of failure of some of its components.
✔ Strict Control of Data Flow through out the execution : mapper processes do not exchange data with other mapper processes and data can only go from mapper to reducer.
✔ Recovery From failure of one or many map process : if server fail, the map tasks that were running on that machine could easily be restarted on another working server.
✔ Failed Reducers can be restarted but addition work has to be redone in such case.
Prof. Mohammed Tanzeem Agra
![Page 39: MODULE 2 Hive Apache Hive is a data warehouse build on the top of Hadoop for providing data summarization, ad hoc queries, and analysis of large data set using SQL like language called](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec994d2a6dc8c7c24682cad/html5/thumbnails/39.jpg)
Speculative Execution
● Execution of program is challenging because of large cluster but control and monitoring
resourcing is easy.
● When one part of MapReduce process runs slowly; it ultimately slow down everything such in
parallel computing
● In Sepculative Execution.
Prof. Mohammed Tanzeem Agra
![Page 40: MODULE 2 Hive Apache Hive is a data warehouse build on the top of Hadoop for providing data summarization, ad hoc queries, and analysis of large data set using SQL like language called](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec994d2a6dc8c7c24682cad/html5/thumbnails/40.jpg)
Hadoop MapReduce Hardware
● Server
● Storage (Hard disk)
● Processing (Processor)
● Old and New Hardware ?
Prof. Mohammed Tanzeem Agra