Hadoop single node-installation

Hadoop Single Node Cluster Installation

www.kannandreams.wordpress.comwww.facebook.com/groups/huge360/

Prerequisites

Operation System: Ubuntu 10.04 / 12.04- This tutorial tested on Ubuntu 12.04.- Recommended OS for Hadoop running is Linux

Java SDK ( Java 1.6 i.e Java 6 )- Hadoop requires a working Java 1.5+ - This tutorial tested on 1.6

OpenSSH- This package required to configure the SSH for the localhost.

Java 6

Hadoop package and codes are written on Java. In order to run hadoop, Java required. Hadoop requires java 1.5+ for its working but Java 1.6 (aka java 6 ) is recommended

Oracle Java 6 is preferred for Hadoop

Steps :1. To check the java version install : $ java -version2. To install Java 1.6 : sudo add-apt-repository ppa:webupd8team/javasudo apt-get updatesudo apt-get install oracle-java6-installer3. To check the installation package : /usr/lib/jvm/java-6-oracle

Adding a dedicated userWe will use a dedicated Hadoop user account for running Hadoop. it is not required but recommended

because it helps to separate the Hadoop installation from other software applications and user accounts running on the same machine (think: security, permissions, backups, etc).

1. Add New group : $ sudo addgroup hadoop

2. Add user to the group : $ sudo adduser --ingroup hadoop hduser

Configuring SSHTo know more about OpenSSH : https://help.ubuntu.com/10.04/serverguide/openssh-server.htmlHadoop uses SSH to manage and access Nodes.Configure SSH access to localhost for our single node installation in local machine.

1. To install OpenSSH : $ sudo apt-get install openssh-server

2. $ su - hduser

3. $ ssh-keygen -t rsa -P ""

4.To enable SSH access to your local machine with this newly created key

$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

5.The final step is to test the SSH setup by connecting to your local machine with the hduser user.

The step is also needed to save your local machine’s host key fingerprint to the hduser user’s

known_hosts file.

$ ssh localhost

https://help.ubuntu.com/10.04/serverguide/openssh-server.html

https://help.ubuntu.com/10.04/serverguide/openssh-server.html

Disable IPv6Apache Hadoop is not currently supported on IPv6 networks. It has only been tested and developed on IPv4

stacks. Hadoop needs IPv4 to work, and only IPv4 clients can talk to the cluster. Please refer JIRA :

https://issues.apache.org/jira/browse/HADOOP-3437https://issues.apache.org/jira/browse/HADOOP-6056

1. To disable IPv6 on Ubuntu 10.04 LTS, Open /etc/sysctl.conf through Editor2. Add the following lines to the end of the file

# disable ipv6net.ipv6.conf.all.disable_ipv6 = 1net.ipv6.conf.default.disable_ipv6 = 1net.ipv6.conf.lo.disable_ipv6 = 1

You have to reboot your machine in order to make the changes take effect.

https://issues.apache.org/jira/browse/HADOOP-3437





Download & Install HadoopDownload Hadoop from the Apache Download Mirrors and extract the contents of the Hadoop

package to a location of your choice. I picked /usr/local/hadoop.Make sure to change the owner of all the files to the hduser user and hadoop group

http://apache.mirrors.pair.com/hadoop/core/hadoop-1.2.1/

http://www.apache.org/dyn/closer.cgi/hadoop/core






Profile File Configuration

1. Environment variablesexport HADOOP_HOME=/usr/local/hadoop

export JAVA_HOME=/usr/lib/jvm/java-6-oracleexport PATH=$PATH:$HADOOP_HOME/bin

2. Convenient aliases and functions for running Hadoop-related commandsunalias fs &> /dev/null

alias fs="hadoop fs"unalias hls &> /dev/nullalias hls="fs -ls"

Note : First We will Unalias any associated string for the command fs ( > dev/null will discard the written data ) Alias fs command as shortcut for hadoop fs command. Similarly we do for fs -ls .

We will see more on hdfs commands in next tutorial.

Continuing Profile Configuration

.lzop is a file compressor very similar to gzip. lzop favors speed over compression ratio

If you have LZO compression enabled in your Hadoop cluster and compress job outputs with LZOPConveniently inspect an LZOP compressed file from the command# line; run via:## $ lzohead /hdfs/path/to/lzop/compressed/file.lzo## Requires installed 'lzop' command.#

lzohead () { hadoop fs -cat $1 | lzop -dc | head -1000 | less}

Configurations

We are going to modify the properties of configuration file to setup single node cluster option and directory structure setup.

hadoop-env.sh

conf/*-site.xml

core-site.xmlmapred-site.xmlhdfs-site.xml

hadoop-env.sh/usr/local/hadoop/conf/hadoop-env.sh

Search the below commented line and Modify as below

# The java implementation to use. Required.# export JAVA_HOME=/usr/lib/j2sdk1.5-sun

To

# The java implementation to use. Required.export JAVA_HOME=/usr/lib/jvm/java-6-oracle

Directory creationConfigure the directory where Hadoop will store its data files, the network ports it listens to, etc.

We will use the directory /app/hadoop/tmp in this tutorial.

Hadoop’s default configurations use hadoop.tmp.dir as the base temporary directory both for the local file system and HDFS, so don’t be surprised if you see Hadoop creating the specified directory automatically on HDFS at some later point

$ sudo mkdir -p /app/hadoop/tmp$ sudo chown hduser:hadoop /app/hadoop/tmp

# ...and if you want to tighten up security, chmod from 755 to 750...$ sudo chmod 750 /app/hadoop/tmp

conf/core-site.xml

Add the lines in between the Configuration tag

hadoop namenode formattingBefore we start the hadoop services, we need to format the Hadoop filesystem which is deployed on top

of the local filesystem of your “cluster” (which includes only your local machine if you followed this tutorial). You need to do this the first time you set up a Hadoop cluster.

/usr/local/hadoop/bin/hadoop namenode -format

Why Do we need formatting?Hadoop NameNode is the centralized place of an HDFS file system which keeps the directory tree of all

files in the file system, and tracks where across the cluster the file data is kept. Nothing but maintaining the metadata store related to datanodes.

When we format namenode it formats the meta-data related to data-nodes. By doing that, all the information on the datanodes are lost and they can be reused for new data

Note: Format has to be done at initial setup and not to be done later..Do not format a running Hadoop filesystem as you will lose all the data currently in the cluster (in HDFS)!

Service ControlTo Start the Service :This will startup a Namenode, Datanode, Jobtracker and a Tasktracker on your machine

/usr/local/hadoop/bin/start-all.sh

To Stop the Service:To stop all the process running related to hadoop on your machine.

/usr/local/hadoop/bin/stop-all.sh

Hadoop Web PortalNamenode: http://localhost:50070/

The name node web UI shows you a cluster summary including information about total/remaining capacity, live and dead nodes. Additionally, it allows you to browse the HDFS namespace and view the contents of its files in the web browser. It also gives access to the local machine’s Hadoop log files.

JobTracker: http://localhost:50030/

The JobTracker web UI provides information about general job statistics of the Hadoop cluster, running/completed/failed jobs and a job history log file. It also gives access to the ‘‘local machine’s’’ Hadoop log files (the machine on which the web UI is running on).

Task Tracker: http://localhost:50060/

The task tracker web UI shows you running and non-running tasks. It also gives access to the ‘‘local machine’s’’ Hadoop log files.

http://localhost:50070/






Hadoop Single Node Cluster Installation

www.kannandreams.wordpress.comwww.facebook.com/groups/huge360/

Prerequisites

Operation System: Ubuntu 10.04 / 12.04- This tutorial tested on Ubuntu 12.04.- Recommended OS for Hadoop running is Linux

Java SDK ( Java 1.6 i.e Java 6 )- Hadoop requires a working Java 1.5+ - This tutorial tested on 1.6

OpenSSH- This package required to configure the SSH for the localhost.

Java 6

Hadoop package and codes are written on Java. In order to run hadoop, Java required. Hadoop requires java 1.5+ for its working but Java 1.6 (aka java 6 ) is recommended

Oracle Java 6 is preferred for Hadoop

Steps :1. To check the java version install : $ java -version2. To install Java 1.6 : sudo add-apt-repository ppa:webupd8team/javasudo apt-get updatesudo apt-get install oracle-java6-installer3. To check the installation package : /usr/lib/jvm/java-6-oracle

Note: Step 2 required when java -version doesn’t show the recommended version.

Adding a dedicated userWe will use a dedicated Hadoop user account for running Hadoop. it is not required but recommended

because it helps to separate the Hadoop installation from other software applications and user accounts running on the same machine (think: security, permissions, backups, etc).

1. Add New group : $ sudo addgroup hadoop

2. Add user to the group : $ sudo adduser --ingroup hadoop hduser

Note: To add this hduser as Sudo, Modify the Suderos file and add as ALL option.

Configuring SSHTo know more about OpenSSH : https://help.ubuntu.com/10.04/serverguide/openssh-server.htmlHadoop uses SSH to manage and access Nodes.Configure SSH access to localhost for our single node installation in local machine.

1. To install OpenSSH : $ sudo apt-get install openssh-server

2. $ su - hduser

3. $ ssh-keygen -t rsa -P ""

4.To enable SSH access to your local machine with this newly created key

$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

5.The final step is to test the SSH setup by connecting to your local machine with the hduser user.

The step is also needed to save your local machine’s host key fingerprint to the hduser user’s

known_hosts file.

$ ssh localhost

Step 3: create an RSA key pair with an empty password. Generally, using an empty password is not recommended, but in this case it is needed to unlock the key without your interaction

Output:kannan@kannandreams:~$ su - hduserPassword: hduser@kannandreams:~$ ssh-keygen -t rsa

-P ""Generating public/private rsa key pair.Enter file in which to save the key

(/home/hduser/.ssh/id_rsa): /home/hduser/.ssh/id_rsa already exists.Overwrite (y/n)? yYour identification has been saved in

/home/hduser/.ssh/id_rsa.Your public key has been saved in

/home/hduser/.ssh/id_rsa.pub.The key fingerprint is:9c:6c:18:0d:32:06:eb:7b:c8:ce:ee:a6:2f:5f:e1:5

4 hduser@kannandreamsThe key's randomart image is:+--[ RSA 2048]----+| ..+ . || o o o || . E . || . . = . || . o . S || . = . . || + + ||.o.o ||.OB |+-----------------+hduser@kannandreams:~$

Disable IPv6Apache Hadoop is not currently supported on IPv6 networks. It has only been tested and developed on IPv4

stacks. Hadoop needs IPv4 to work, and only IPv4 clients can talk to the cluster. Please refer JIRA :

https://issues.apache.org/jira/browse/HADOOP-3437https://issues.apache.org/jira/browse/HADOOP-6056

1. To disable IPv6 on Ubuntu 10.04 LTS, Open /etc/sysctl.conf through Editor2. Add the following lines to the end of the file

# disable ipv6net.ipv6.conf.all.disable_ipv6 = 1net.ipv6.conf.default.disable_ipv6 = 1net.ipv6.conf.lo.disable_ipv6 = 1

You have to reboot your machine in order to make the changes take effect.

You can check whether IPv6 is enabled on your machine with the following command:

$ cat /proc/sys/net/ipv6/conf/all/disable_ipv6A return value of 0 means IPv6 is enabled, a value of 1 means disabled

Note : You can also disable IPv6 only for Hadoop as documented in HADOOP-3437

Download & Install HadoopDownload Hadoop from the Apache Download Mirrors and extract the contents of the Hadoop

package to a location of your choice. I picked /usr/local/hadoop.Make sure to change the owner of all the files to the hduser user and hadoop group


I have captured some mistake to understand what we doing .

Profile File Configuration

1. Environment variablesexport HADOOP_HOME=/usr/local/hadoop

export JAVA_HOME=/usr/lib/jvm/java-6-oracleexport PATH=$PATH:$HADOOP_HOME/bin

2. Convenient aliases and functions for running Hadoop-related commandsunalias fs &> /dev/null

alias fs="hadoop fs"unalias hls &> /dev/nullalias hls="fs -ls"

Note : First We will Unalias any associated string for the command fs ( > dev/null will discard the written data ) Alias fs command as shortcut for hadoop fs command. Similarly we do for fs -ls .

We will see more on hdfs commands in next tutorial.

Continuing Profile Configuration

.lzop is a file compressor very similar to gzip. lzop favors speed over compression ratio

If you have LZO compression enabled in your Hadoop cluster and compress job outputs with LZOPConveniently inspect an LZOP compressed file from the command# line; run via:## $ lzohead /hdfs/path/to/lzop/compressed/file.lzo## Requires installed 'lzop' command.#

lzohead () { hadoop fs -cat $1 | lzop -dc | head -1000 | less}

Configurations

We are going to modify the properties of configuration file to setup single node cluster option and directory structure setup.

hadoop-env.sh

conf/*-site.xml

core-site.xmlmapred-site.xmlhdfs-site.xml

hadoop-env.sh/usr/local/hadoop/conf/hadoop-env.sh

Search the below commented line and Modify as below

# The java implementation to use. Required.# export JAVA_HOME=/usr/lib/j2sdk1.5-sun

To

# The java implementation to use. Required.export JAVA_HOME=/usr/lib/jvm/java-6-oracle

Directory creationConfigure the directory where Hadoop will store its data files, the network ports it listens to, etc.

We will use the directory /app/hadoop/tmp in this tutorial.

Hadoop’s default configurations use hadoop.tmp.dir as the base temporary directory both for the local file system and HDFS, so don’t be surprised if you see Hadoop creating the specified directory automatically on HDFS at some later point

$ sudo mkdir -p /app/hadoop/tmp$ sudo chown hduser:hadoop /app/hadoop/tmp

# ...and if you want to tighten up security, chmod from 755 to 750...$ sudo chmod 750 /app/hadoop/tmp

If you forget to set the required ownerships and permissions, you will see a java.io.IOException

conf/core-site.xml

Add the lines in between the Configuration tag

<property> <name>hadoop.tmp.dir</name> <value>/app/hadoop/tmp</value> <description>A base for other temporary directories.</description></property>

<property> <name>fs.default.name</name> <value>hdfs://localhost:54310</value> <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description></property>

<property> <name>mapred.job.tracker</name> <value>localhost:54311</value> <description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description></property>

<property> <name>dfs.replication</name> <value>1</value> <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description></property>

hadoop namenode formattingBefore we start the hadoop services, we need to format the Hadoop filesystem which is deployed on top

of the local filesystem of your “cluster” (which includes only your local machine if you followed this tutorial). You need to do this the first time you set up a Hadoop cluster.

/usr/local/hadoop/bin/hadoop namenode -format

Why Do we need formatting?Hadoop NameNode is the centralized place of an HDFS file system which keeps the directory tree of all

files in the file system, and tracks where across the cluster the file data is kept. Nothing but maintaining the metadata store related to datanodes.

When we format namenode it formats the meta-data related to data-nodes. By doing that, all the information on the datanodes are lost and they can be reused for new data

Note: Format has to be done at initial setup and not to be done later..Do not format a running Hadoop filesystem as you will lose all the data currently in the cluster (in HDFS)!

Note: Warning : $HADOOP_HOME is deprecated - Can be omitted. This Env. Variable is stopped using in latest version.

Service ControlTo Start the Service :This will startup a Namenode, Datanode, Jobtracker and a Tasktracker on your machine

/usr/local/hadoop/bin/start-all.sh

To Stop the Service:To stop all the process running related to hadoop on your machine.

/usr/local/hadoop/bin/stop-all.sh

Warning : $HADOOP_HOME is deprecated - please omit it. This Env. Variable is stopped refering in higher version of hadoop package.

For those JPS command not working, please don’t be panic. This is nothing to do with hadoop installation. jps is java process monitoring utility to show the running process. Please install Java 6 Oracle and set the path . It should work.

Hadoop Web PortalNamenode: http://localhost:50070/

The name node web UI shows you a cluster summary including information about total/remaining capacity, live and dead nodes. Additionally, it allows you to browse the HDFS namespace and view the contents of its files in the web browser. It also gives access to the local machine’s Hadoop log files.

JobTracker: http://localhost:50030/

The JobTracker web UI provides information about general job statistics of the Hadoop cluster, running/completed/failed jobs and a job history log file. It also gives access to the ‘‘local machine’s’’ Hadoop log files (the machine on which the web UI is running on).

Task Tracker: http://localhost:50060/

The task tracker web UI shows you running and non-running tasks. It also gives access to the ‘‘local machine’s’’ Hadoop log files.

Hadoop single node-installation

Technology

Transcript of Hadoop single node-installation