Hadoop Manual

40
Hadoop

description

HADOOP

Transcript of Hadoop Manual

Page 1: Hadoop Manual

Hadoop

Page 2: Hadoop Manual

TABLE OF CONTENTS

S.NO CONTENTS

1 Basics of Big data and hadoop

2 Installing and configuring hadoop on ubuntu

3 Running Hadoop at local host on ubuntu

4 Installing and configuring hive on ubuntu

5 Basics of Hive

6 Performing CRUD operations in hive

Page 3: Hadoop Manual

Basics of Big Data And Hadoop

What is Big Data? Big data means really a big data, it is a collection of large datasets that cannot be processed using traditional computing techniques. Big data is not merely a data, rather it has become a complete subject, which involves various tools, technqiues and frameworks.

What Comes Under Big Data? Big data involves the data produced by different devices and applications. Given below are some of the fields that come under the umbrella of Big Data.

Black Box Data : It is a component of helicopter, airplanes, and jets, etc. It captures voices of the flight crew, recordings of microphones and earphones, and the performance information of the aircraft.

Social Media Data : Social media such as Facebook and Twitter hold information and the views posted by millions of people across the globe.

Stock Exchange Data : The stock exchange data holds information about the ‘buy’ and ‘sell’ decisions made on a share of different companies made by the customers.

Power Grid Data : The power grid data holds information consumed by a particular node with respect to a base station.

Transport Data : Transport data includes model, capacity, distance and availability of a vehicle.

Search Engine Data : Search engines retrieve lots of data from different databases.

Page 4: Hadoop Manual

Thus Big Data includes huge volume, high velocity, and extensible variety of data. The data in it will be of three types.

Structured data : Relational data.

Semi Structured data : XML data.

Unstructured data : Word, PDF, Text, Media Logs.

Benefits of Big Data Big data is really critical to our life and its emerging as one of the most important technologies in modern world. Follow are just few benefits which are very much known to all of us:

Using the information kept in the social network like Facebook, the marketing agencies are learning about the response for their campaigns, promotions, and other advertising mediums.

Using the information in the social media like preferences and product perception of their consumers, product companies and retail organizations are planning their production.

Using the data regarding the previous medical history of patients, hospitals are providing better and quick service.

Big Data Technologies Big data technologies are important in providing more accurate analysis, which may lead to more concrete decision-making resulting in greater operational efficiencies, cost reductions, and reduced risks for the business.

Page 5: Hadoop Manual

To harness the power of big data, you would require an infrastructure that can manage and process huge volumes of structured and unstructured data in realtime and can protect data privacy and security.

There are various technologies in the market from different vendors including Amazon, IBM, Microsoft, etc., to handle big data. While looking into the technologies that handle big data, we examine the following two classes of technology:

Operational Big DataThis include systems like MongoDB that provide operational capabilities for real-time, interactive workloads where data is primarily captured and stored.

NoSQL Big Data systems are designed to take advantage of new cloud computing architectures that have emerged over the past decade to allow massive computations to be run inexpensively and efficiently. This makes operational big data workloads much easier to manage, cheaper, and faster to implement.

Some NoSQL systems can provide insights into patterns and trends based on real-time data with minimal coding and without the need for data scientists and additional infrastructure.

Analytical Big DataThis includes systems like Massively Parallel Processing (MPP) database systems and MapReduce that provide analytical capabilities for retrospective and complex analysis that may touch most or all of the data.

MapReduce provides a new method of analyzing data that is complementary to the capabilities provided by SQL, and a system based on MapReduce that can be scaled up from single servers to thousands of high and low end machines.

These two classes of technology are complementary and frequently deployed together.

Operational vs. Analytical Systems Operational Analytical

Latency 1 ms - 100 ms 1 min - 100 min

Concurrency 1000 - 100,000 1 - 10

Access Pattern Writes and Reads Reads

Queries Selective Unselective

Page 6: Hadoop Manual

Data Scope Operational Retrospective

End User Customer Data Scientist

Technology NoSQL MapReduce, MPP Database

Big Data Challenges The major challenges associated with big data are as follows:

Capturing data Curation Storage Searching Sharing Transfer Analysis Presentation

To fulfill the above challenges, organizations normally take the help of enterprise servers.

Traditional Approach In this approach, an enterprise will have a computer to store and process big data. Here data will be stored in an RDBMS like Oracle Database, MS SQL Server or DB2 and sophisticated softwares can be written to interact with the database, process the required data and present it to the users for analysis purpose.

LimitationThis approach works well where we have less volume of data that can be accommodated by standard database servers, or up to the limit of the processor which is processing the data. But when it comes to dealing with huge amounts of data, it is really a tedious task to process such data through a traditional database server.

Google’s Solution

Page 7: Hadoop Manual

Google solved this problem using an algorithm called MapReduce. This algorithm divides the task into small parts and assigns those parts to many computers connected over the network, and collects the results to form the final result dataset.

Hadoop Doug Cutting, Mike Cafarella and team took the solution provided by Google and started an Open Source Project called HADOOP in 2005 and Doug named it after his son's toy elephant. Now Apache Hadoop is a registered trademark of the Apache Software Foundation.

Hadoop runs applications using the MapReduce algorithm, where the data is processed in parallel on different CPU nodes. In short, Hadoop framework is capabale enough to develop applications capable of running on clusters of computers and they could perform complete statistical analysis for a huge amounts of data.

Page 8: Hadoop Manual

Hadoop is an Apache open source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models. A Hadoop frame-worked application works in an environment that provides distributed storage and computation across clusters of computers. Hadoop is designed to scale up from single server to thousands of machines, each offering local computation and storage.

Hadoop Architecture Hadoop framework includes following four modules:

Hadoop Common: These are Java libraries and utilities required by other Hadoop modules. These libraries provides filesystem and OS level abstractions and contains the necessary Java files and scripts required to start Hadoop.

Hadoop YARN: This is a framework for job scheduling and cluster resource management.

Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.

Hadoop MapReduce: This is YARN-based system for parallel processing of large data sets.

We can use following diagram to depict these four components available in Hadoop framework.

Page 9: Hadoop Manual

Since 2012, the term "Hadoop" often refers not just to the base modules mentioned above but also to the collection of additional software packages that can be installed on top of or alongside Hadoop, such as Apache Pig, Apache Hive, Apache HBase, Apache Spark etc.

Map Reduce Hadoop MapReduce is a software framework for easily writing applications which process big amounts of data in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.

The term MapReduce actually refers to the following two different tasks that Hadoop programs perform:

The Map Task: This is the first task, which takes input data and converts it into a set of data, where individual elements are broken down into tuples (key/value pairs).

The Reduce Task: This task takes the output from a map task as input and combines those data tuples into a smaller set of tuples. The reduce task is always performed after the map task.

Page 10: Hadoop Manual

Typically both the input and the output are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.

The MapReduce framework consists of a single master JobTracker and one slave TaskTracker per cluster-node. The master is responsible for resource management, tracking resource consumption/availability and scheduling the jobs component tasks on the slaves, monitoring them and re-executing the failed tasks. The slaves TaskTracker execute the tasks as directed by the master and provide task-status information to the master periodically.

The JobTracker is a single point of failure for the Hadoop MapReduce service which means if JobTracker goes down, all running jobs are halted.

Hadoop Distributed File System Hadoop can work directly with any mountable distributed file system such as Local FS, HFTP FS, S3 FS, and others, but the most common file system used by Hadoop is the Hadoop Distributed File System (HDFS).

The Hadoop Distributed File System (HDFS) is based on the Google File System (GFS) and provides a distributed file system that is designed to run on large clusters (thousands of computers) of small computer machines in a reliable, fault-tolerant manner.

HDFS uses a master/slave architecture where master consists of a singleNameNode that manages the file system metadata and one or more slaveDataNodes that store the actual data.

A file in an HDFS namespace is split into several blocks and those blocks are stored in a set of DataNodes. The NameNode determines the mapping of blocks to the DataNodes. The DataNodes takes care of read and write operation with the file system. They also take care of block creation, deletion and replication based on instruction given by NameNode.

HDFS provides a shell like any other file system and a list of commands are available to interact with the file system. These shell commands will be covered in a separate chapter along with appropriate examples.

How Does Hadoop Work? Stage 1A user/application can submit a job to the Hadoop (a hadoop job client) for required process by specifying the following items:

1. The location of the input and output files in the distributed file system.

Page 11: Hadoop Manual

2. The java classes in the form of jar file containing the implementation of map and reduce functions.

3. The job configuration by setting different parameters specific to the job.

Stage 2The Hadoop job client then submits the job (jar/executable etc) and configuration to the JobTracker which then assumes the responsibility of distributing the software/configuration to the slaves, scheduling tasks and monitoring them, providing status and diagnostic information to the job-client.

Stage 3The TaskTrackers on different nodes execute the task as per MapReduce implementation and output of the reduce function is stored into the output files on the file system.

Advantages of Hadoop Hadoop framework allows the user to quickly write and test distributed systems. It is

efficient, and it automatic distributes the data and work across the machines and in turn, utilizes the underlying parallelism of the CPU cores.

Hadoop does not rely on hardware to provide fault-tolerance and high availability (FTHA), rather Hadoop library itself has been designed to detect and handle failures at the application layer.

Servers can be added or removed from the cluster dynamically and Hadoop continues to operate without interruption.

Another big advantage of Hadoop is that apart from being open source, it is compatible on all the platforms since it is Java based

Installing And Configuring Hadoop In Ubuntu

Page 12: Hadoop Manual

Installing Java

k@laptop:~$ cd ~# Update the source listk@laptop:~$ sudo apt-get update# The OpenJDK project is the default version of Java # that is provided from a supported Ubuntu repository.k@laptop:~$ sudo apt-get install default-jdkk@laptop:~$ java -versionjava version "1.7.0_65"OpenJDK Runtime Environment (IcedTea 2.5.3) (7u71-2.5.3-0ubuntu0.14.04.1)OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode)

Adding a dedicated Hadoop user

k@laptop:~$ sudoaddgrouphadoopAdding group `hadoop' (GID 1002) ...Done.k@laptop:~$ sudoadduser --ingrouphadoophduserAdding user `hduser' ...Adding new user `hduser' (1001) with group `hadoop' ...Creating home directory `/home/hduser' ...Copying files from `/etc/skel' ...Enter new UNIX password: Retype new UNIX password: passwd: password updated successfullyChanging the user information for hduserEnter the new value, or press ENTER for the default

Full Name []: Room Number []: Work Phone []: Home Phone []: Other []:

Is the information correct? [Y/n] Y

Installing SSH

Page 13: Hadoop Manual

ssh has two main components:

1. ssh : The command we use to connect to remote machines - the client.2. sshd : The daemon that is running on the server and allows clients to connect to the

server.

The ssh is pre-enabled on Linux, but in order to start sshd daemon, we need to install ssh first. Use this command to do that :

k@laptop:~$ sudo apt-get install ssh

This will install ssh on our machine. If we get something similar to the following, we can think it is setup properly:

k@laptop:~$ which ssh/usr/bin/ssh

k@laptop:~$ which sshd/usr/sbin/sshd

Create and Setup SSH Certificates

Hadoop requires SSH access to manage its nodes, i.e. remote machines plus our local machine. For our single-node setup of Hadoop, we therefore need to configure SSH access to localhost.

So, we need to have SSH up and running on our machine and configured it to allow SSH public key authentication.

Hadoop uses SSH (to access its nodes) which would normally require the user to enter a password. However, this requirement can be eliminated by creating and setting up SSH certificates using the following commands. If asked for a filename just leave it blank and press the enter key to continue.

k@laptop:~$ suhduserPassword: k@laptop:~$ ssh-keygen -t rsa -P ""Generating public/private rsa key pair.Enter file in which to save the key (/home/hduser/.ssh/id_rsa): Created directory '/home/hduser/.ssh'.Your identification has been saved in /home/hduser/.ssh/id_rsa.Your public key has been saved in /home/hduser/.ssh/id_rsa.pub.

Page 14: Hadoop Manual

The key fingerprint is:50:6b:f3:fc:0f:32:bf:30:79:c2:41:71:26:cc:7d:e3hduser@laptopThe key's randomart image is:+--[ RSA 2048]----+| .oo.o || . .o=. o || . + . o . || o = E || S + || . + || O + || O o || o.. |+-----------------+

hduser@laptop:/home/k$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

The second command adds the newly created key to the list of authorized keys so that Hadoop can use ssh without prompting for a password.We can check if ssh works:

hduser@laptop:/home/k$ ssh localhostThe authenticity of host 'localhost (127.0.0.1)' can't be established.ECDSA key fingerprint is e1:8b:a0:a5:75:ef:f4:b4:5e:a9:ed:be:64:be:5c:2f.Are you sure you want to continue connecting (yes/no)? yesWarning: Permanently added 'localhost' (ECDSA) to the list of known hosts.Welcome to Ubuntu 14.04.1 LTS (GNU/Linux 3.13.0-40-generic x86_64)

Install Hadoop

hduser@laptop:~$ wget http://mirrors.sonic.net/apache/hadoop/common/hadoop-2.6.0/hadoop-2.6.0.tar.gzhduser@laptop:~$ tar xvzf hadoop-2.6.0.tar.gz

We want to move the Hadoop installation to the /usr/local/hadoop directory using the following command:

hduser@laptop:~/hadoop-2.6.0$ sudo mv * /usr/local/hadoop[sudo] password for hduser:

Page 15: Hadoop Manual

hduser is not in the sudoers file. This incident will be reported.

Oops!... We got:

"hduser is not in the sudoers file. This incident will be reported."

This error can be resolved by logging in as a root user, and then add hduser to sudo:

hduser@laptop:~/hadoop-2.6.0$ su kPassword:

k@laptop:/home/hduser$ sudoadduserhdusersudo[sudo] password for k: Adding user `hduser' to group `sudo' ...Adding user hduser to group sudoDone.

Now, the hduser has root priviledge, we can move the Hadoop installation to the/usr/local/hadoop directory without any problem:

k@laptop:/home/hduser$ sudosuhduser

hduser@laptop:~/hadoop-2.6.0$ sudo mv * /usr/local/hadoophduser@laptop:~/hadoop-2.6.0$ sudochown -R hduser:hadoop /usr/local/hadoop

Setup Configuration Files

The following files will have to be modified to complete the Hadoop setup:

1. ~/.bashrc2. /usr/local/hadoop/etc/hadoop/hadoop-env.sh3. /usr/local/hadoop/etc/hadoop/core-site.xml4. /usr/local/hadoop/etc/hadoop/mapred-site.xml.template5. /usr/local/hadoop/etc/hadoop/hdfs-site.xml

1. ~/.bashrc:

Before editing the .bashrc file in our home directory, we need to find the path where Java has been installed to set the JAVA_HOME environment variable using the following command:

Page 16: Hadoop Manual

hduser@laptop update-alternatives --config javaThere is only one alternative in link group java (providing /usr/bin/java): /usr/lib/jvm/java-7-openjdk-amd64/jre/bin/javaNothing to configure.

Now we can append the following to the end of ~/.bashrc:

hduser@laptop:~$ vi ~/.bashrc#HADOOP VARIABLES STARTexport JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64export HADOOP_INSTALL=/usr/local/hadoopexport PATH=$PATH:$HADOOP_INSTALL/binexport PATH=$PATH:$HADOOP_INSTALL/sbinexport HADOOP_MAPRED_HOME=$HADOOP_INSTALLexport HADOOP_COMMON_HOME=$HADOOP_INSTALLexport HADOOP_HDFS_HOME=$HADOOP_INSTALLexport YARN_HOME=$HADOOP_INSTALLexport HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/nativeexport HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"#HADOOP VARIABLES END

hduser@laptop:~$ source ~/.bashrc

note that the JAVA_HOME should be set as the path just before the '.../bin/':

hduser@ubuntu-VirtualBox:~$ javac -versionjavac 1.7.0_75

hduser@ubuntu-VirtualBox:~$ which javac/usr/bin/javac

hduser@ubuntu-VirtualBox:~$ readlink -f /usr/bin/javac/usr/lib/jvm/java-7-openjdk-amd64/bin/javac

2. /usr/local/hadoop/etc/hadoop/hadoop-env.sh

We need to set JAVA_HOME by modifying hadoop-env.sh file.

hduser@laptop:~$ vi /usr/local/hadoop/etc/hadoop/hadoop-env.sh

export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64

Page 17: Hadoop Manual

Adding the above statement in the hadoop-env.sh file ensures that the value of JAVA_HOME variable will be available to Hadoop whenever it is started up.

3. /usr/local/hadoop/etc/hadoop/core-site.xml:

The /usr/local/hadoop/etc/hadoop/core-site.xml file contains configuration properties that Hadoop uses when starting up. This file can be used to override the default settings that Hadoop starts with.

hduser@laptop:~$ sudomkdir -p /app/hadoop/tmphduser@laptop:~$ sudochownhduser:hadoop /app/hadoop/tmp

Open the file and enter the following in between the <configuration></configuration> tag:

hduser@laptop:~$ vi /usr/local/hadoop/etc/hadoop/core-site.xml

<configuration><property><name>hadoop.tmp.dir</name><value>/app/hadoop/tmp</value><description>A base for other temporary directories.</description></property>

<property><name>fs.default.name</name><value>hdfs://localhost:54310</value><description>The name of the default file system. A URI whosescheme and authority determine the FileSystem implementation. Theuri's scheme determines the config property (fs.SCHEME.impl) namingtheFileSystem implementation class. The uri's authority is used todetermine the host, port, etc. for a filesystem.</description></property></configuration>

4. /usr/local/hadoop/etc/hadoop/mapred-site.xml

By default, the /usr/local/hadoop/etc/hadoop/ folder contains /usr/local/hadoop/etc/hadoop/mapred-site.xml.template file which has to be renamed/copied with the name mapred-site.xml:

Page 18: Hadoop Manual

hduser@laptop:~$ cp /usr/local/hadoop/etc/hadoop/mapred-site.xml.template /usr/local/hadoop/etc/hadoop/mapred-site.xml

The mapred-site.xml file is used to specify which framework is being used for MapReduce.We need to enter the following content in between the <configuration></configuration> tag:

<configuration><property><name>mapred.job.tracker</name><value>localhost:54311</value><description>The host and port that the MapReduce job tracker runsat. If "local", then jobs are run in-process as a single mapand reduce task.</description></property></configuration>

5. /usr/local/hadoop/etc/hadoop/hdfs-site.xml

The /usr/local/hadoop/etc/hadoop/hdfs-site.xml file needs to be configured for each host in the cluster that is being used. It is used to specify the directories which will be used as the namenode and the datanode on that host.

Before editing this file, we need to create two directories which will contain the namenode and the datanode for this Hadoop installation. This can be done using the following commands:

hduser@laptop:~$ sudomkdir -p /usr/local/hadoop_store/hdfs/namenodehduser@laptop:~$ sudomkdir -p /usr/local/hadoop_store/hdfs/datanodehduser@laptop:~$ sudochown -R hduser:hadoop /usr/local/hadoop_store

Open the file and enter the following content in between the <configuration></configuration> tag:

hduser@laptop:~$ vi /usr/local/hadoop/etc/hadoop/hdfs-site.xml

<configuration><property><name>dfs.replication</name><value>1</value>

Page 19: Hadoop Manual

<description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.</description></property><property><name>dfs.namenode.name.dir</name><value>file:/usr/local/hadoop_store/hdfs/namenode</value></property><property><name>dfs.datanode.data.dir</name><value>file:/usr/local/hadoop_store/hdfs/datanode</value></property></configuration>

Format the New Hadoop Filesystem

Now, the Hadoop file system needs to be formatted so that we can start to use it. The format command should be issued with write permission since it creates current directory under /usr/local/hadoop_store/hdfs/namenode folder:

hduser@laptop:~$ hadoopnamenode -formatDEPRECATED: Use of this script to execute hdfs command is deprecated.Instead use the hdfs command for it.

15/04/18 14:43:03 INFO namenode.NameNode: STARTUP_MSG: /************************************************************STARTUP_MSG: Starting NameNodeSTARTUP_MSG: host = laptop/192.168.1.1STARTUP_MSG: args = [-format]STARTUP_MSG: version = 2.6.0STARTUP_MSG: classpath = /usr/local/hadoop/etc/hadoop...STARTUP_MSG: java = 1.7.0_65************************************************************/15/04/18 14:43:03 INFO namenode.NameNode: registered UNIX signal handlers for [TERM, HUP, INT]15/04/18 14:43:03 INFO namenode.NameNode: createNameNode [-format]15/04/18 14:43:07 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Page 20: Hadoop Manual

Formatting using clusterid: CID-e2f515ac-33da-45bc-8466-5b1100a2bf7f15/04/18 14:43:09 INFO namenode.FSNamesystem: No KeyProvider found.15/04/18 14:43:09 INFO namenode.FSNamesystem: fsLock is fair:true15/04/18 14:43:10 INFO blockmanagement.DatanodeManager: dfs.block.invalidate.limit=100015/04/18 14:43:10 INFO blockmanagement.DatanodeManager: dfs.namenode.datanode.registration.ip-hostname-check=true15/04/18 14:43:10 INFO blockmanagement.BlockManager: dfs.namenode.startup.delay.block.deletion.sec is set to 000:00:00:00.00015/04/18 14:43:10 INFO blockmanagement.BlockManager: The block deletion will start around 2015 Apr 18 14:43:1015/04/18 14:43:10 INFO util.GSet: Computing capacity for map BlocksMap15/04/18 14:43:10 INFO util.GSet: VM type = 64-bit15/04/18 14:43:10 INFO util.GSet: 2.0% max memory 889 MB = 17.8 MB15/04/18 14:43:10 INFO util.GSet: capacity = 2^21 = 2097152 entries15/04/18 14:43:10 INFO blockmanagement.BlockManager: dfs.block.access.token.enable=false15/04/18 14:43:10 INFO blockmanagement.BlockManager: defaultReplication = 115/04/18 14:43:10 INFO blockmanagement.BlockManager: maxReplication = 51215/04/18 14:43:10 INFO blockmanagement.BlockManager: minReplication = 115/04/18 14:43:10 INFO blockmanagement.BlockManager: maxReplicationStreams = 215/04/18 14:43:10 INFO blockmanagement.BlockManager: shouldCheckForEnoughRacks = false15/04/18 14:43:10 INFO blockmanagement.BlockManager: replicationRecheckInterval = 300015/04/18 14:43:10 INFO blockmanagement.BlockManager: encryptDataTransfer = false15/04/18 14:43:10 INFO blockmanagement.BlockManager: maxNumBlocksToLog = 100015/04/18 14:43:10 INFO namenode.FSNamesystem: fsOwner = hduser (auth:SIMPLE)15/04/18 14:43:10 INFO namenode.FSNamesystem: supergroup = supergroup15/04/18 14:43:10 INFO namenode.FSNamesystem: isPermissionEnabled = true15/04/18 14:43:10 INFO namenode.FSNamesystem: HA Enabled: false15/04/18 14:43:10 INFO namenode.FSNamesystem: Append Enabled: true15/04/18 14:43:11 INFO util.GSet: Computing capacity for map INodeMap15/04/18 14:43:11 INFO util.GSet: VM type = 64-bit15/04/18 14:43:11 INFO util.GSet: 1.0% max memory 889 MB = 8.9 MB15/04/18 14:43:11 INFO util.GSet: capacity = 2^20 = 1048576 entries15/04/18 14:43:11 INFO namenode.NameNode: Caching file names occuring more than 10 times15/04/18 14:43:11 INFO util.GSet: Computing capacity for map cachedBlocks15/04/18 14:43:11 INFO util.GSet: VM type = 64-bit15/04/18 14:43:11 INFO util.GSet: 0.25% max memory 889 MB = 2.2 MB15/04/18 14:43:11 INFO util.GSet: capacity = 2^18 = 262144 entries15/04/18 14:43:11 INFO namenode.FSNamesystem: dfs.namenode.safemode.threshold-pct = 0.999000012874603315/04/18 14:43:11 INFO namenode.FSNamesystem: dfs.namenode.safemode.min.datanodes = 0

Page 21: Hadoop Manual

15/04/18 14:43:11 INFO namenode.FSNamesystem: dfs.namenode.safemode.extension = 3000015/04/18 14:43:11 INFO namenode.FSNamesystem: Retry cache on namenode is enabled15/04/18 14:43:11 INFO namenode.FSNamesystem: Retry cache will use 0.03 of total heap and retry cache entry expiry time is 600000 millis15/04/18 14:43:11 INFO util.GSet: Computing capacity for map NameNodeRetryCache15/04/18 14:43:11 INFO util.GSet: VM type = 64-bit15/04/18 14:43:11 INFO util.GSet: 0.029999999329447746% max memory 889 MB = 273.1 KB15/04/18 14:43:11 INFO util.GSet: capacity = 2^15 = 32768 entries15/04/18 14:43:11 INFO namenode.NNConf: ACLs enabled? false15/04/18 14:43:11 INFO namenode.NNConf: XAttrs enabled? true15/04/18 14:43:11 INFO namenode.NNConf: Maximum size of anxattr: 1638415/04/18 14:43:12 INFO namenode.FSImage: Allocated new BlockPoolId: BP-130729900-192.168.1.1-142939339159515/04/18 14:43:12 INFO common.Storage: Storage directory /usr/local/hadoop_store/hdfs/namenode has been successfully formatted.15/04/18 14:43:12 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid>= 015/04/18 14:43:12 INFO util.ExitUtil: Exiting with status 015/04/18 14:43:12 INFO namenode.NameNode: SHUTDOWN_MSG: /************************************************************SHUTDOWN_MSG: Shutting down NameNode at laptop/192.168.1.1************************************************************/

Page 22: Hadoop Manual

Running Hadoop on Ubuntu at Local host

LOG INTO ADMINISTRATOR HADOOP USER ACCOUNT:

$sudo su hduser.

CHANGE DIRECTORY TO HADOOP INSTALLATION:

$cd /usr/local/hadoop/hadoop-2.6.0/sbin/

Page 23: Hadoop Manual

START ALL HADOOP SERVICES:

$ start-all.sh

CHECK IF ALL THE NODES ARE RUNNING

Page 24: Hadoop Manual

$ jps

CHECK IF NAME NODE IS RUNNING IN LOCAL HOST AT PORT 50070

http://localhost:50070

CHECK IF SECONDARY NODE IS RUNNING IN LOCAL HOST AT PORT 50090

http://localhost:50090/status.jsp

Page 25: Hadoop Manual

CHECK IF DATA NODE IS RUNNING IN LOCAL HOST AT PORT 50070

http://localhost:50070/dfshealth.html#tab-datanode

STOPPING ALL HADOOP SERVICES

$ stop-all.sh

Page 26: Hadoop Manual

Basics of hiveWhat is Hive Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on top of Hadoop to summarize Big Data, and makes querying and analyzing easy.

Initially Hive was developed by Facebook, later the Apache Software Foundation took it up and developed it further as an open source under the name Apache Hive. It is used by different companies. For example, Amazon uses it in Amazon Elastic MapReduce.

Hive is not

A relational database A design for OnLine Transaction Processing (OLTP) A language for real-time queries and row-level updates

Features of Hive

It stores schema in a database and processed data into HDFS.

Page 27: Hadoop Manual

It is designed for OLAP. It provides SQL type language for querying called HiveQL or HQL. It is familiar, fast, scalable, and extensible.

Architecture of Hive The following component diagram depicts the architecture of Hive:

This component diagram contains different units. The following table describes each unit:

Unit Name Operation

User Interface Hive is a data warehouse infrastructure software that can create interaction between user and HDFS. The user interfaces that Hive supports are Hive Web UI, Hive command line, and Hive HD Insight (In Windows server).

Meta Store Hive chooses respective database servers to store the schema or Metadata of tables, databases, columns in a table, their data types, and HDFS mapping.

HiveQL Process Engine

HiveQL is similar to SQL for querying on schema info on the Metastore. It is one of the replacements of traditional approach for MapReduce program. Instead of writing MapReduce program in Java, we can write a query for MapReduce job and process it.

Page 28: Hadoop Manual

Execution Engine The conjunction part of HiveQL process Engine and MapReduce is Hive Execution Engine. Execution engine processes the query and generates results as same as MapReduce results. It uses the flavor of MapReduce.

HDFS or HBASE Hadoop distributed file system or HBASE are the data storage techniques to store data into file system.

Working of Hive The following diagram depicts the workflow between Hive and Hadoop.

The following table defines how Hive interacts with Hadoop framework:

Step No.

Operation

1 Execute Query

The Hive interface such as Command Line or Web UI sends query to Driver (any database driver such as JDBC, ODBC, etc.) to execute.

2 Get PlanThe driver takes the help of query compiler that parses the query to check the syntax and query plan or the requirement of query.

3 Get MetadataThe compiler sends metadata request to Metastore (any database).

4 Send MetadataMetastore sends metadata as a response to the compiler.

Page 29: Hadoop Manual

5 Send PlanThe compiler checks the requirement and resends the plan to the driver. Up to here, the parsing and compiling of a query is complete.

6 Execute PlanThe driver sends the execute plan to the execution engine.

7 Execute JobInternally, the process of execution job is a MapReduce job. The execution engine sends the job to JobTracker, which is in Name node and it assigns this job to TaskTracker, which is in Data node. Here, the query executes MapReduce job.

7.1 Metadata OpsMeanwhile in execution, the execution engine can execute metadata operations with Metastore.

8 Fetch ResultThe execution engine receives the results from Data nodes.

9 Send ResultsThe execution engine sends those resultant values to the driver.

Column Types Column type are used as column data types of Hive. They are as follows:

Integral TypesInteger type data can be specified using integral data types, INT. When the data range exceeds the range of INT, you need to use BIGINT and if the data range is smaller than the INT, you use SMALLINT. TINYINT is smaller than SMALLINT.

The following table depicts various INT data types:

Type Postfix Example

TINYINT Y 10Y

SMALLINT S 10S

INT - 10

Page 30: Hadoop Manual

BIGINT L 10L

String TypesString type data types can be specified using single quotes (' ') or double quotes (" "). It contains two data types: VARCHAR and CHAR. Hive follows C-types escape characters.

The following table depicts various CHAR data types:

Data Type Length

VARCHAR 1 to 65355

CHAR 255

TimestampIt supports traditional UNIX timestamp with optional nanosecond precision. It supports java.sql.Timestamp format “YYYY-MM-DD HH:MM:SS.fffffffff” and format “yyyy-mm-dd hh:mm:ss.ffffffffff”.

DatesDATE values are described in year/month/day format in the form {{YYYY-MM-DD}}.

DecimalsThe DECIMAL type in Hive is as same as Big Decimal format of Java. It is used for representing immutable arbitrary precision. The syntax and example is as follows:

DECIMAL(precision, scale)

decimal(10,0)

Union TypesUnion is a collection of heterogeneous data types. You can create an instance using create union. The syntax and example is as follows:

UNIONTYPE<int, double, array<string>, struct<a:int,b:string>>

{0:1}

{1:2.0}

{2:["three","four"]}

{3:{"a":5,"b":"five"}}

{2:["six","seven"]}

Page 31: Hadoop Manual

{3:{"a":8,"b":"eight"}}

{0:9}

{1:10.0}

Literals The following literals are used in Hive:

Floating Point TypesFloating point types are nothing but numbers with decimal points. Generally, this type of data is composed of DOUBLE data type.

Decimal TypeDecimal type data is nothing but floating point value with higher range than DOUBLE data type. The range of decimal type is approximately -10-308 to 10308.

Null Value Missing values are represented by the special value NULL.

Complex Types The Hive complex data types are as follows:

ArraysArrays in Hive are used the same way they are used in Java.

Syntax: ARRAY<data_type>

Installing And Configuring Hive in Ubuntu

INTRODUCTIONApache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. Apache Hive supports analysis of large datasets stored in Hadoop’s HDFS and compatible file systems such as Amazon S3 filesystem. It provides an SQL-like language called HiveQL(Hive Query Language) while maintaining full support for map/reduce.

Hive InstallationInstalling HIVE:

1. Browse to the link: http://apache.claz.org/hive/stable/2. Click the apache-hive-0.13.0-bin.tar.gz3. Save and Extract it

Commands:user@ubuntu:~$ cd /usr/lib/user@ubuntu:~$ sudo mkdir hive

Page 32: Hadoop Manual

user@ubuntu:~$ cd Downloadsuser@ubuntu:~$ sudo mv apache-hive-0.13.0-bin /usr/lib/hive

Setting Hive environment variable:Commands:

user@ubuntu:~$ cduser@ubuntu:~$ sudo gedit ~/.bashrc

Copy and paste the following lines at end of the file

# Set HIVE_HOMEexport HIVE_HOME="/usr/lib/hive/apache-hive-0.13.0-bin"PATH=$PATH:$HIVE_HOME/binexport PATH

Setting HADOOP_PATH in HIVE config.shCommands:

user@ubuntu:~$ cd /usr/lib/hive/apache-hive-0.13.0-bin/binuser@ubuntu:~$ sudo gedit hive-config.sh

Go to the line where the following statements are written:

# Allow alternate conf dir location.HIVE_CONF_DIR="${HIVE_CONF_DIR:-$HIVE_HOME/conf"export HIVE_CONF_DIR=$HIVE_CONF_DIRexport HIVE_AUX_JARS_PATH=$HIVE_AUX_JARS_PATH

Below this write the following:

export HADOOP_HOME=/usr/local/hadoop (write the path where hadoop file is there)

Create Hive directories within HDFS

Command:

user@ubuntu:~$ hadoop fs -mkdir /usr/hive/warehouse

Setting READ/WRITE permission for table:

Command:

user@ubuntu:~$ hadoop fs -chmod g+w /usr/hive/warehouse

Page 33: Hadoop Manual

Performing C.R.U.D Operations in Hive

Launching HIVE

Command:

user@ubuntu:~$ hive

Page 34: Hadoop Manual

Creating a database:

Command:

hive> create database mydb;

Show all databases:

Command:

hive> show databases;

Page 35: Hadoop Manual

Create a Table:

Command:

hive> create table john(id varchar(30),number varchar(30));

Insert into a Table:

Command:

hive> Insert into john values(20,40);

Page 36: Hadoop Manual

Display a Table:

Command:

hive> Select * from john;

Alter the Table:

Command:

Page 37: Hadoop Manual

hive> ALTER TABLE john RENAME TO jerry;

Drop the Table:

Command:

hive> DROP TABLE jerry;