Hadoop single node-installation

64
Hadoop Single Node Cluster Installation www.kannandreams.wordpress.com www.facebook.com/groups/huge360/

description

 

Transcript of Hadoop single node-installation

Page 1: Hadoop single node-installation

Hadoop Single Node Cluster Installation

www.kannandreams.wordpress.comwww.facebook.com/groups/huge360/

Page 2: Hadoop single node-installation

Prerequisites

Operation System: Ubuntu 10.04 / 12.04- This tutorial tested on Ubuntu 12.04.- Recommended OS for Hadoop running is Linux

Java SDK ( Java 1.6 i.e Java 6 )- Hadoop requires a working Java 1.5+ - This tutorial tested on 1.6

OpenSSH- This package required to configure the SSH for the localhost.

Page 3: Hadoop single node-installation

Java 6

Hadoop package and codes are written on Java. In order to run hadoop, Java required. Hadoop requires java 1.5+ for its working but Java 1.6 (aka java 6 ) is recommended

Oracle Java 6 is preferred for Hadoop

Steps :1. To check the java version install : $ java -version2. To install Java 1.6 : sudo add-apt-repository ppa:webupd8team/javasudo apt-get updatesudo apt-get install oracle-java6-installer3. To check the installation package : /usr/lib/jvm/java-6-oracle

Page 4: Hadoop single node-installation

Adding a dedicated userWe will use a dedicated Hadoop user account for running Hadoop. it is not required but recommended

because it helps to separate the Hadoop installation from other software applications and user accounts running on the same machine (think: security, permissions, backups, etc).

1. Add New group : $ sudo addgroup hadoop

2. Add user to the group : $ sudo adduser --ingroup hadoop hduser

Page 5: Hadoop single node-installation

Configuring SSHTo know more about OpenSSH : https://help.ubuntu.com/10.04/serverguide/openssh-server.htmlHadoop uses SSH to manage and access Nodes.Configure SSH access to localhost for our single node installation in local machine.

1. To install OpenSSH : $ sudo apt-get install openssh-server

2. $ su - hduser

3. $ ssh-keygen -t rsa -P ""

4.To enable SSH access to your local machine with this newly created key

$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

5.The final step is to test the SSH setup by connecting to your local machine with the hduser user.

The step is also needed to save your local machine’s host key fingerprint to the hduser user’s

known_hosts file.

$ ssh localhost

Page 6: Hadoop single node-installation
Page 7: Hadoop single node-installation
Page 8: Hadoop single node-installation
Page 9: Hadoop single node-installation

Disable IPv6Apache Hadoop is not currently supported on IPv6 networks. It has only been tested and developed on IPv4

stacks. Hadoop needs IPv4 to work, and only IPv4 clients can talk to the cluster. Please refer JIRA :

https://issues.apache.org/jira/browse/HADOOP-3437https://issues.apache.org/jira/browse/HADOOP-6056

1. To disable IPv6 on Ubuntu 10.04 LTS, Open /etc/sysctl.conf through Editor2. Add the following lines to the end of the file

# disable ipv6net.ipv6.conf.all.disable_ipv6 = 1net.ipv6.conf.default.disable_ipv6 = 1net.ipv6.conf.lo.disable_ipv6 = 1

You have to reboot your machine in order to make the changes take effect.

Page 10: Hadoop single node-installation

Download & Install HadoopDownload Hadoop from the Apache Download Mirrors and extract the contents of the Hadoop

package to a location of your choice. I picked /usr/local/hadoop.Make sure to change the owner of all the files to the hduser user and hadoop group

http://apache.mirrors.pair.com/hadoop/core/hadoop-1.2.1/

Page 11: Hadoop single node-installation
Page 12: Hadoop single node-installation

Profile File Configuration

1. Environment variablesexport HADOOP_HOME=/usr/local/hadoop

export JAVA_HOME=/usr/lib/jvm/java-6-oracleexport PATH=$PATH:$HADOOP_HOME/bin

2. Convenient aliases and functions for running Hadoop-related commandsunalias fs &> /dev/null

alias fs="hadoop fs"unalias hls &> /dev/nullalias hls="fs -ls"

Note : First We will Unalias any associated string for the command fs ( > dev/null will discard the written data ) Alias fs command as shortcut for hadoop fs command. Similarly we do for fs -ls .

We will see more on hdfs commands in next tutorial.

Page 13: Hadoop single node-installation

Continuing Profile Configuration

.lzop is a file compressor very similar to gzip. lzop favors speed over compression ratio

If you have LZO compression enabled in your Hadoop cluster and compress job outputs with LZOPConveniently inspect an LZOP compressed file from the command# line; run via:## $ lzohead /hdfs/path/to/lzop/compressed/file.lzo## Requires installed 'lzop' command.#

lzohead () { hadoop fs -cat $1 | lzop -dc | head -1000 | less}

Page 14: Hadoop single node-installation
Page 15: Hadoop single node-installation

Configurations

We are going to modify the properties of configuration file to setup single node cluster option and directory structure setup.

hadoop-env.sh

conf/*-site.xml

core-site.xmlmapred-site.xmlhdfs-site.xml

Page 16: Hadoop single node-installation

hadoop-env.sh/usr/local/hadoop/conf/hadoop-env.sh

Search the below commented line and Modify as below

# The java implementation to use. Required.# export JAVA_HOME=/usr/lib/j2sdk1.5-sun

To

# The java implementation to use. Required.export JAVA_HOME=/usr/lib/jvm/java-6-oracle

Page 17: Hadoop single node-installation

Directory creationConfigure the directory where Hadoop will store its data files, the network ports it listens to, etc.

We will use the directory /app/hadoop/tmp in this tutorial.

Hadoop’s default configurations use hadoop.tmp.dir as the base temporary directory both for the local file system and HDFS, so don’t be surprised if you see Hadoop creating the specified directory automatically on HDFS at some later point

$ sudo mkdir -p /app/hadoop/tmp$ sudo chown hduser:hadoop /app/hadoop/tmp

# ...and if you want to tighten up security, chmod from 755 to 750...$ sudo chmod 750 /app/hadoop/tmp

Page 18: Hadoop single node-installation
Page 19: Hadoop single node-installation

conf/core-site.xml

Page 20: Hadoop single node-installation

Add the lines in between the Configuration tag

Page 21: Hadoop single node-installation
Page 22: Hadoop single node-installation
Page 23: Hadoop single node-installation

hadoop namenode formattingBefore we start the hadoop services, we need to format the Hadoop filesystem which is deployed on top

of the local filesystem of your “cluster” (which includes only your local machine if you followed this tutorial). You need to do this the first time you set up a Hadoop cluster.

/usr/local/hadoop/bin/hadoop namenode -format

Why Do we need formatting?Hadoop NameNode is the centralized place of an HDFS file system which keeps the directory tree of all

files in the file system, and tracks where across the cluster the file data is kept. Nothing but maintaining the metadata store related to datanodes.

When we format namenode it formats the meta-data related to data-nodes. By doing that, all the information on the datanodes are lost and they can be reused for new data

Note: Format has to be done at initial setup and not to be done later..Do not format a running Hadoop filesystem as you will lose all the data currently in the cluster (in HDFS)!

Page 24: Hadoop single node-installation
Page 25: Hadoop single node-installation

Service ControlTo Start the Service :This will startup a Namenode, Datanode, Jobtracker and a Tasktracker on your machine

/usr/local/hadoop/bin/start-all.sh

To Stop the Service:To stop all the process running related to hadoop on your machine.

/usr/local/hadoop/bin/stop-all.sh

Page 26: Hadoop single node-installation
Page 27: Hadoop single node-installation
Page 28: Hadoop single node-installation
Page 29: Hadoop single node-installation

Hadoop Web PortalNamenode: http://localhost:50070/

The name node web UI shows you a cluster summary including information about total/remaining capacity, live and dead nodes. Additionally, it allows you to browse the HDFS namespace and view the contents of its files in the web browser. It also gives access to the local machine’s Hadoop log files.

JobTracker: http://localhost:50030/

The JobTracker web UI provides information about general job statistics of the Hadoop cluster, running/completed/failed jobs and a job history log file. It also gives access to the ‘‘local machine’s’’ Hadoop log files (the machine on which the web UI is running on).

Task Tracker: http://localhost:50060/

The task tracker web UI shows you running and non-running tasks. It also gives access to the ‘‘local machine’s’’ Hadoop log files.

Page 30: Hadoop single node-installation
Page 31: Hadoop single node-installation
Page 32: Hadoop single node-installation
Page 33: Hadoop single node-installation

Hadoop Single Node Cluster Installation

www.kannandreams.wordpress.comwww.facebook.com/groups/huge360/

Page 34: Hadoop single node-installation

Prerequisites

Operation System: Ubuntu 10.04 / 12.04- This tutorial tested on Ubuntu 12.04.- Recommended OS for Hadoop running is Linux

Java SDK ( Java 1.6 i.e Java 6 )- Hadoop requires a working Java 1.5+ - This tutorial tested on 1.6

OpenSSH- This package required to configure the SSH for the localhost.

Page 35: Hadoop single node-installation

Java 6

Hadoop package and codes are written on Java. In order to run hadoop, Java required. Hadoop requires java 1.5+ for its working but Java 1.6 (aka java 6 ) is recommended

Oracle Java 6 is preferred for Hadoop

Steps :1. To check the java version install : $ java -version2. To install Java 1.6 : sudo add-apt-repository ppa:webupd8team/javasudo apt-get updatesudo apt-get install oracle-java6-installer3. To check the installation package : /usr/lib/jvm/java-6-oracle

Note: Step 2 required when java -version doesn’t show the recommended version.

Page 36: Hadoop single node-installation

Adding a dedicated userWe will use a dedicated Hadoop user account for running Hadoop. it is not required but recommended

because it helps to separate the Hadoop installation from other software applications and user accounts running on the same machine (think: security, permissions, backups, etc).

1. Add New group : $ sudo addgroup hadoop

2. Add user to the group : $ sudo adduser --ingroup hadoop hduser

Note: To add this hduser as Sudo, Modify the Suderos file and add as ALL option.

Page 37: Hadoop single node-installation

Configuring SSHTo know more about OpenSSH : https://help.ubuntu.com/10.04/serverguide/openssh-server.htmlHadoop uses SSH to manage and access Nodes.Configure SSH access to localhost for our single node installation in local machine.

1. To install OpenSSH : $ sudo apt-get install openssh-server

2. $ su - hduser

3. $ ssh-keygen -t rsa -P ""

4.To enable SSH access to your local machine with this newly created key

$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

5.The final step is to test the SSH setup by connecting to your local machine with the hduser user.

The step is also needed to save your local machine’s host key fingerprint to the hduser user’s

known_hosts file.

$ ssh localhost

Step 3: create an RSA key pair with an empty password. Generally, using an empty password is not recommended, but in this case it is needed to unlock the key without your interaction

Output:kannan@kannandreams:~$ su - hduserPassword: hduser@kannandreams:~$ ssh-keygen -t rsa

-P ""Generating public/private rsa key pair.Enter file in which to save the key

(/home/hduser/.ssh/id_rsa): /home/hduser/.ssh/id_rsa already exists.Overwrite (y/n)? yYour identification has been saved in

/home/hduser/.ssh/id_rsa.Your public key has been saved in

/home/hduser/.ssh/id_rsa.pub.The key fingerprint is:9c:6c:18:0d:32:06:eb:7b:c8:ce:ee:a6:2f:5f:e1:5

4 hduser@kannandreamsThe key's randomart image is:+--[ RSA 2048]----+| ..+ . || o o o || . E . || . . = . || . o . S || . = . . || + + ||.o.o ||.OB |+-----------------+hduser@kannandreams:~$

Page 38: Hadoop single node-installation
Page 39: Hadoop single node-installation
Page 40: Hadoop single node-installation
Page 41: Hadoop single node-installation

Disable IPv6Apache Hadoop is not currently supported on IPv6 networks. It has only been tested and developed on IPv4

stacks. Hadoop needs IPv4 to work, and only IPv4 clients can talk to the cluster. Please refer JIRA :

https://issues.apache.org/jira/browse/HADOOP-3437https://issues.apache.org/jira/browse/HADOOP-6056

1. To disable IPv6 on Ubuntu 10.04 LTS, Open /etc/sysctl.conf through Editor2. Add the following lines to the end of the file

# disable ipv6net.ipv6.conf.all.disable_ipv6 = 1net.ipv6.conf.default.disable_ipv6 = 1net.ipv6.conf.lo.disable_ipv6 = 1

You have to reboot your machine in order to make the changes take effect.

You can check whether IPv6 is enabled on your machine with the following command:

$ cat /proc/sys/net/ipv6/conf/all/disable_ipv6A return value of 0 means IPv6 is enabled, a value of 1 means disabled

Note : You can also disable IPv6 only for Hadoop as documented in HADOOP-3437

Page 42: Hadoop single node-installation

Download & Install HadoopDownload Hadoop from the Apache Download Mirrors and extract the contents of the Hadoop

package to a location of your choice. I picked /usr/local/hadoop.Make sure to change the owner of all the files to the hduser user and hadoop group

http://apache.mirrors.pair.com/hadoop/core/hadoop-1.2.1/

Page 43: Hadoop single node-installation

I have captured some mistake to understand what we doing .

Page 44: Hadoop single node-installation

Profile File Configuration

1. Environment variablesexport HADOOP_HOME=/usr/local/hadoop

export JAVA_HOME=/usr/lib/jvm/java-6-oracleexport PATH=$PATH:$HADOOP_HOME/bin

2. Convenient aliases and functions for running Hadoop-related commandsunalias fs &> /dev/null

alias fs="hadoop fs"unalias hls &> /dev/nullalias hls="fs -ls"

Note : First We will Unalias any associated string for the command fs ( > dev/null will discard the written data ) Alias fs command as shortcut for hadoop fs command. Similarly we do for fs -ls .

We will see more on hdfs commands in next tutorial.

Page 45: Hadoop single node-installation

Continuing Profile Configuration

.lzop is a file compressor very similar to gzip. lzop favors speed over compression ratio

If you have LZO compression enabled in your Hadoop cluster and compress job outputs with LZOPConveniently inspect an LZOP compressed file from the command# line; run via:## $ lzohead /hdfs/path/to/lzop/compressed/file.lzo## Requires installed 'lzop' command.#

lzohead () { hadoop fs -cat $1 | lzop -dc | head -1000 | less}

Page 46: Hadoop single node-installation
Page 47: Hadoop single node-installation

Configurations

We are going to modify the properties of configuration file to setup single node cluster option and directory structure setup.

hadoop-env.sh

conf/*-site.xml

core-site.xmlmapred-site.xmlhdfs-site.xml

Page 48: Hadoop single node-installation

hadoop-env.sh/usr/local/hadoop/conf/hadoop-env.sh

Search the below commented line and Modify as below

# The java implementation to use. Required.# export JAVA_HOME=/usr/lib/j2sdk1.5-sun

To

# The java implementation to use. Required.export JAVA_HOME=/usr/lib/jvm/java-6-oracle

Page 49: Hadoop single node-installation

Directory creationConfigure the directory where Hadoop will store its data files, the network ports it listens to, etc.

We will use the directory /app/hadoop/tmp in this tutorial.

Hadoop’s default configurations use hadoop.tmp.dir as the base temporary directory both for the local file system and HDFS, so don’t be surprised if you see Hadoop creating the specified directory automatically on HDFS at some later point

$ sudo mkdir -p /app/hadoop/tmp$ sudo chown hduser:hadoop /app/hadoop/tmp

# ...and if you want to tighten up security, chmod from 755 to 750...$ sudo chmod 750 /app/hadoop/tmp

If you forget to set the required ownerships and permissions, you will see a java.io.IOException

Page 50: Hadoop single node-installation
Page 51: Hadoop single node-installation

conf/core-site.xml

Page 52: Hadoop single node-installation

Add the lines in between the Configuration tag

<property> <name>hadoop.tmp.dir</name> <value>/app/hadoop/tmp</value> <description>A base for other temporary directories.</description></property>

<property> <name>fs.default.name</name> <value>hdfs://localhost:54310</value> <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description></property>

Page 53: Hadoop single node-installation

<property> <name>mapred.job.tracker</name> <value>localhost:54311</value> <description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description></property>

Page 54: Hadoop single node-installation

<property> <name>dfs.replication</name> <value>1</value> <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description></property>

Page 55: Hadoop single node-installation

hadoop namenode formattingBefore we start the hadoop services, we need to format the Hadoop filesystem which is deployed on top

of the local filesystem of your “cluster” (which includes only your local machine if you followed this tutorial). You need to do this the first time you set up a Hadoop cluster.

/usr/local/hadoop/bin/hadoop namenode -format

Why Do we need formatting?Hadoop NameNode is the centralized place of an HDFS file system which keeps the directory tree of all

files in the file system, and tracks where across the cluster the file data is kept. Nothing but maintaining the metadata store related to datanodes.

When we format namenode it formats the meta-data related to data-nodes. By doing that, all the information on the datanodes are lost and they can be reused for new data

Note: Format has to be done at initial setup and not to be done later..Do not format a running Hadoop filesystem as you will lose all the data currently in the cluster (in HDFS)!

Page 56: Hadoop single node-installation

Note: Warning : $HADOOP_HOME is deprecated - Can be omitted. This Env. Variable is stopped using in latest version.

Page 57: Hadoop single node-installation

Service ControlTo Start the Service :This will startup a Namenode, Datanode, Jobtracker and a Tasktracker on your machine

/usr/local/hadoop/bin/start-all.sh

To Stop the Service:To stop all the process running related to hadoop on your machine.

/usr/local/hadoop/bin/stop-all.sh

Page 58: Hadoop single node-installation

Warning : $HADOOP_HOME is deprecated - please omit it. This Env. Variable is stopped refering in higher version of hadoop package.

Page 59: Hadoop single node-installation

For those JPS command not working, please don’t be panic. This is nothing to do with hadoop installation. jps is java process monitoring utility to show the running process. Please install Java 6 Oracle and set the path . It should work.

Page 60: Hadoop single node-installation
Page 61: Hadoop single node-installation

Hadoop Web PortalNamenode: http://localhost:50070/

The name node web UI shows you a cluster summary including information about total/remaining capacity, live and dead nodes. Additionally, it allows you to browse the HDFS namespace and view the contents of its files in the web browser. It also gives access to the local machine’s Hadoop log files.

JobTracker: http://localhost:50030/

The JobTracker web UI provides information about general job statistics of the Hadoop cluster, running/completed/failed jobs and a job history log file. It also gives access to the ‘‘local machine’s’’ Hadoop log files (the machine on which the web UI is running on).

Task Tracker: http://localhost:50060/

The task tracker web UI shows you running and non-running tasks. It also gives access to the ‘‘local machine’s’’ Hadoop log files.

Page 62: Hadoop single node-installation
Page 63: Hadoop single node-installation
Page 64: Hadoop single node-installation