hadoop

What is Hadoop( Storage perspective)

What is Hadoop (Storage perspective)?

Hadoop is a java frame work (software platform) for storing vast amounts of data (and also process the data). It can be setup on commonly available computers.

Use Case

It can be used when following requirements arise

Store terabytes of data: HDFS uses commonly available computers and storage devices and pools up the storage space on all the systems into one large piece. Streaming access of data: HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access Large data sets: File sizes typically in gigabytes. HDFS is tuned to support large files. It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster WORM requirement: HDFS applications need a write-once-read-many access model for files. A file once created, written, and closed need not be changed. This assumption simplifies data coherency issues and enables high throughput data access. High availability: HDFS stores multiple instances of data on various systems in the cluster. This ensures availability of data even if systems come down.

Architecture

Hadoop is based on Master-Slave architecture. An HDFS cluster consists of a single Namenode (master server) that manages the file system namespace and regulates access to files by clients. In addition, there are a number of Datanodes (Slaves), usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of Datanodes. The Namenode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to Datanodes. The Datanodes are responsible for serving read and write requests from the file systems clients. The Datanodes also perform block creation, deletion, and replication upon instruction from the Namenode. The Namenode is the arbitrator and repository for all HDFS metadata. The system is designed in such a way that user data never flows through the Namenode.

NameNodeDataNodeDataNodeClient

Key Features

File System Namespace: HDFS Supports hierarchical file organization. It supports operations like create, remove, move & rename files as well as directories. It doesnt have perms and quotas. Replication: HDFS Stores files as series of blocks. Blocks are replicated for fault tolerance. The replication factor and block size are configurable. Files in HDFS are write-once and have strictly one writer at any time. The replication placement is very critical for the performance. In large clusters nodes are spread across racks. Thee racks are connected via switches. Its observed that traffic within nodes in a rack is much higher than that across racks. Replicating data across racks saves network bandwidth. To minimize global bandwidth consumption and read latency, HDFS tries to satisfy a read request from a replica that is closest to the reader. If there exists a replica on the same rack as the reader node, then that replica is preferred to satisfy the read request. File System Metadata: Name node uses EditLog to record every change to file system metadata. The entire file system namespace is stored in file called FsImage. Robustness: Network or Disk Failure and data integrityDatanodes send heartbeat messages to namenode and when namenode doesnt receive them the datanode is marked dead. This may cause replication factor for some blocks fall. The name node constantly monitors the replication count for each block. If it falls then the namenode re replicates those nodes. This may happen because a replica may be corrupted, data node is dead or the replication count for a particular file may be increased. It also does rebalancing in case space on one node falls below a threshold value. Name stores checksum for each block and checks while retrieving. NameNode failure: The FsImage and the EditLog are central data structures of HDFS. A corruption of these files can cause the HDFS instance to be non-functional. For this reason, the Namenode can be configured to support maintaining multiple copies of the FsImage and EditLog. Any update to either the FsImage or EditLog causes each of the FsImages and EditLogs to get updated synchronously. This synchronous updating of multiple copies of the FsImage and EditLog may degrade the rate of namespace transactions per second that a Namenode can support. However, this degradation is acceptable because even though HDFS applications are very data intensive in nature, they are not metadata intensive. When a Namenode restarts, it selects the latest consistent FsImage and EditLog to use. The Namenode machine is a single point of failure for an HDFS cluster. If the Namenode machine fails, manual intervention is necessary. Currently, automatic restart and failover of the Namenode software to another machine is not supported. Data organization: Block size by HDFS is 64MB and it supports write once read many semantics. HDFS client has to do local catching. Suppose the HDFS file has a replication factor of three. When the local file accumulates a full block of user data, the client retrieves a list of Datanodes from the Namenode. This list contains the Datanodes that will host a replica of that block. The client then flushes the data block to the first Datanode. The first Datanode starts receiving the data in small portions (4 KB), writes each portion to its local repository and transfers that portion to the second Datanode in the list. The second Datanode, in turn starts receiving each portion of the data block, writes that portion to its repository and then flushes that portion to the third Datanode. Finally, the third Datanode writes the data to its local repository. Thus, a Datanode can be receiving data from the previous one in the pipeline and at the same time forwarding data to the next one in the pipeline. Thus, the data is pipelined from one Datanode to the next. When data is deleted it is not removed immediately removed rather it remains in /trash. It can be either removed or restored from there. How long to store the data in trash is configurable. Default value is 6 hrs.

How to access HDFS? DFSshell: from the shell user can create, remove and rename directories as well as files. This is intended for applications that use scripting languages to interact with HDFS. Browser Interface: HDFS installation configures the web server to expose the HDFS namespace through a configurable TCP port. This allows a user to navigate the HDFS namespace and view the contents of its files using a web browser. For administration purpose DFSadmin command is also provided.

Setting up Hadoop

We used three Linux boxes with CentOS, to setup a hadoop cluster. Details as follows-

DataNode-1DataNode-2NameNode

Property NameNode DataNode-1 DataNode-2

IP 10.245.0.121 10.245.0.131 10.245.0.57Storage Space - 35GB 9GB

data node name node data node Hostname NameNode DataNode-1 DataNode-2

Step By Step Approach

Multi Node Cluster

Step-1: (steps 1 to 5 needs to be done on all nodes)

Set the host names of three systems as indicated above. Added the entries in /etc/hosts file as follows

127.0.0.1 localhost localhost.localdomain localhost10.245.0.57 DataNode-110.245.0.131 DataNode-210.245.0.121 NameNode10.245.0.192 DataNode-3

Then gave the following command on each of the three systems- hostname XXX, and rebooted them.( XXX corresponds the hostname of each system)

Step-2:

Added a dedicated system user named hadoop.

[root@NameNode]#groupadd hadoop[root@NameNode]#useradd g hadoop hadoop[root@NameNode]#passwd hadoop

Step-3:

Installed JDK (jdk-1_5_0_14-linux-i586.rpm) and hadoop (hadoop-0.14.4.tar.gz) as user hadoop in /home/hadoop.

Step-4:

Setup the Linux systems in such a way that any system can ssh to any other system without password. Copy public keys of every system in cluster (including itself) into authorized_keys file.

Step-5:

Set JAVA_HOME variable in /conf/hadoop-env.sh to correct path.In our case it was export JAVA_HOME=/usr/java/jdk1.5.0_14/

Step-6: (on NameNode)

Add following entry into /conf/masters file NameNode

Add following entry into /conf/slaves file DataNode-1DataNode-2The conf/slaves file on master is used only by the scripts like bin/start-dfs.sh or bin/stop-dfs.sh for starting data nodes.

Step-7: (on data nodes)

Create a directory named hadoop-datastore (any name of your choice) where hadoop stores all the data. The path of this directory needs to be mentioned in hadoop-site.xml file for hadoop.temp.dir property.

Step- 8:

Change conf/hadoop-site.xml file. The file on NameNode looks as follows

fs.default.name hdfs://NameNode:54310 the name od the file system. the URI whose scheme determine the file system implementayion. the uri's scheme determines the config property( fs.scheme.impl) naming the FS implementation class. the uri's authority is used to determine host, port etc for the file system mapred.job.tracker NameNode:54311 the host and port that the mapreduce job tracker runs at. if local then jobs are run in-process as a single map and reduce task. dfs.replication 3 no of replications when a file is created.

On data nodes one extra property is added,

hadoop.tmp.dir /home/hadoop/hadoop-datastore base for haddop temp directories

Step- 9:

Format Hadoop's distributed filesystem (HDFS) for the namenode. You need to do this the first time you set up a Hadoop cluster. Do not format a running Hadoop namenode, this will cause all your data in the HDFS filesytem to be erased. The command is

bin/hadoop namenode format

The HDFS name table is stored on the namenode's (here: master) local filesystem in the directory specified by dfs.name.dir. The name table is used by the namenode to store tracking and coordination information for the datanodes.

Run the command /bin/start-dfs.sh on the machine you want the namenode to run on. This will bring up HDFS with the namenode running on the machine you ran the previous command on, and datanodes on the machines listed in the conf/slaves file.

Run the command /bin/stop-dfs.sh on the namenode machine to stop the cluster.

Step-10:Hadoop comes with several web interfaces which are by default (see conf/hadoop-default.xml) available at these locations: http://NameNode:50070/ - web UI for HDFS name node(s) These web interfaces provide concise information about what's happening in your Hadoop cluster. You may have to update hosts file in your windows system to resolve the names to its IP.From the NameNode you can do management as well as file operations via DFSshell. The command bin/hadoop dfs help, gives you the operations permitted by DFS. The command bin/hadoop dfsadmin help gives the administration operations supported.

Step-11:To add a new data node on fly just follow the above steps on new node and execute following command on the new node to join the cluster.bin/hadoop-daemon.sh --config start datanodeStep-12:To setup client machine install hadoop on a client machine and set the java_home variable in hadoop.env.sh. To copy data to HDFS from client use fs switch of dfs and use the URI of the namenode bin/hadoop dfs -fs hdfs://10.245.0.121:54310 -mkdir remotecopybin/hadoop dfs -fs hdfs://10.245.0.121:54310 -copyFromLocal /home/devendra/jdk-1_5_0_14-linux-i586-rpm.bin remotecopy

Observations

1. I/O handling.

See appendix for some test scripts and the Log analysis.

2. Fault Tolerance

Observations on a two data-node cluster with replication factor 2. The data was accessible even if one of data nodes was down. Observations on a three data-node cluster with replication factor 2 The data was accessible when one of the data-nodes was downSome data was accessible when two nodes were down Overflow condition: With a two data-node setup and nodes having free space of 20 GB and 1 GB , tried to copy 10 GB of data. The copy operation was successful without any errors. Observed warning in the log messages indicating only one copy is done. ( I guess if we connect one more datanode on fly I suppose the data will be replicated on to the new systemwill have to try this out to be sure)

Accidental data loss: Even if we remove data-blocks from one of the data nodes, they will be synchronized(this was observed).

Appendix

SCRIPT -1 The script copies 1 GB of data to the HDFS and back to the local system indefinitely. The md5 checksum matches after stopping the script. The script is executed from namenode.

I=0echo "[`date +%X`] :: start script" >>logecho "size of movies directory is 1 GB" >>logecho "[`date +%X`] :: creating a directory Movies" >>log/home/hadoop/hadoop-0.14.4/bin/hadoop dfs -mkdir Moviesif [ $? -eq 0 ]thenecho "[`date +%X`] :: mkdir sucessful" >>logfiwhile [ 1 = 1 ]doecho "------------------LOOP $i ------------------------" >>logecho "[`date +%X`] :: coping data into the directory" >>log/home/hadoop/hadoop-0.14.4/bin/hadoop dfs -copyFromLocal /home/hadoop/Movies /user/hadoop/Moviesif [ $? -eq 0 ]thenecho "[`date +%X`] :: copy sucessful" >>logfiecho "[`date +%X`] :: removing copy of file " >>logrm -rf /home/hadoop/Moviesif [ $? -eq 0 ]thenecho "[`date +%X`] :: remove sucessful" >>logfiecho "[`date +%X`] :: copying back to local system" >>log/home/hadoop/hadoop-0.14.4/bin/hadoop dfs -copyToLocal /user/hadoop/Movies /home/hadoop/Moviesif [ $? -eq 0 ]thenecho "[`date +%X`] :: move back sucessful" >>logfiecho "[`date +%X`] :: removing the file from hadoop" >>log/home/hadoop/hadoop-0.14.4/bin/hadoop dfs -rmr /user/hadoop/Moviesif [ $? -eq 0 ]thenecho "[`date +%X`] :: move back sucessful" >>logfii=èxpr $i + 1`done

LOG

[03:48:52 PM] :: start scriptsize of movies directory is 1GB[03:48:52 PM] :: creating a directory Movies[03:48:54 PM] :: mkdir sucessful------------------LOOP 0 ------------------------[03:48:54 PM] :: coping data into the directory[03:51:15 PM] :: copy sucessful[03:51:15 PM] :: removing copy of file[03:51:16 PM] :: remove sucessful[03:51:16 PM] :: copying back to local system[03:52:58 PM] :: move back sucessful[03:52:58 PM] :: removing the file from hadoop[03:53:01 PM] :: move back sucessful------------------LOOP 1 ------------------------[03:53:01 PM] :: coping data into the directory[03:55:23 PM] :: copy sucessful[03:55:23 PM] :: removing copy of file[03:55:24 PM] :: remove sucessful[03:55:24 PM] :: copying back to local system[03:57:03 PM] :: move back sucessful[03:57:03 PM] :: removing the file from hadoop[03:57:06 PM] :: move back sucessful------------------LOOP 2 ------------------------[03:57:06 PM] :: coping data into the directory[03:59:26 PM] :: copy successful

Copying 1GB data from file system to hadoop on a LAN of speed 100Mbps took 140 seconds on average( observe the text in green). This turned to be at speed of 58 Mbps.Copying 1GB of data from file system to hadoop took 100 seconds on average(observations in blue).This turned to be at speed of 80 Mbps.

Multithreaded script

Two threads are spawned, each of which copy data from the local system into hadoop and back from hadoop to local system infinitely. Logs are captured to analyze the I/O performance. The script was run for 48 hours and 850 loops got executed.

(i=0echo "thread1:[`date +%X`] :: start thread1" >>logecho "thread1:size of thread1 directory 640MB" >>logwhile [ 1 = 1 ]doecho "thread1:------------------LOOP $i ------------------------" >>logecho "thread1:[`date +%X`] :: coping data into the directory" >>log/home/hadoop/hadoop-0.14.4/bin/hadoop dfs -copyFromLocal /home/hadoop/thread1 /user/hadoop/if [ $? -eq 0 ]thenecho "thread1:[`date +%X`] :: copy sucessful" >>logfiecho "thread1:[`date +%X`] :: removing copy of file " >>logrm -rf /home/hadoop/thread1if [ $? -eq 0 ]thenecho "thread1:[`date +%X`] :: remove sucessful" >>logfiecho "thread1:[`date +%X`] :: copying back to local system" >>log/home/hadoop/hadoop-0.14.4/bin/hadoop dfs -copyToLocal /user/hadoop/thread1 /home/hadoop/if [ $? -eq 0 ]thenecho "thread1:[`date +%X`] :: move back sucessful" >>logfiecho "thread1:[`date +%X`] :: removing the file from hadoop" >>log/home/hadoop/hadoop-0.14.4/bin/hadoop dfs -rmr /user/hadoop/thread1if [ $? -eq 0 ]thenecho "thread1:[`date +%X`] :: deletion sucessful" >>logfii=èxpr $i + 1`done)&(j=0echo "thread2:[`date +%X`] :: start thread2" >>logecho "thread2:size of thread2 directory 640MB" >>logwhile [ 1 = 1 ]doecho "thread2:------------------LOOP $j ------------------------" >>logecho "thread2:[`date +%X`] :: coping data into the directory" >>log/home/hadoop/hadoop-0.14.4/bin/hadoop dfs -copyFromLocal /home/hadoop/thread2 /user/hadoop/if [ $? -eq 0 ]thenecho "thread2:[`date +%X`] :: copy sucessful" >>logfiecho "thread2:[`date +%X`] :: removing copy of file " >>logrm -rf /home/hadoop/thread2if [ $? -eq 0 ]thenecho "thread2:[`date +%X`] :: remove sucessful" >>logfiecho "thread2:[`date +%X`] :: copying back to local system" >>log/home/hadoop/hadoop-0.14.4/bin/hadoop dfs -copyToLocal /user/hadoop/thread2 /home/hadoop/if [ $? -eq 0 ]thenecho "thread2:[`date +%X`] :: move back sucessful" >>logfiecho "thread2:[`date +%X`] :: removing the file from hadoop" >>log/home/hadoop/hadoop-0.14.4/bin/hadoop dfs -rmr /user/hadoop/thread2if [ $? -eq 0 ]thenecho "thread2:[`date +%X`] :: deletion sucessful" >>logfij=èxpr $j + 1`done)WaitLOG

Messages from thread1 are in black and that of thread2 are in greenthread1:[05:15:00 PM] :: start thread1thread2:[05:15:00 PM] :: start thread2thread1:size of thread1 directory 640 MBthread2:size of thread2 directory 640MBthread1:------------------LOOP 0 ------------------------thread2:------------------LOOP 0 ------------------------thread1:[05:15:00 PM] :: coping data into the directorythread2:[05:15:00 PM] :: coping data into the directory 139 seconds ( to write)thread1:[05:17:19 PM] :: copy sucessful NOTE thread1:[05:17:19 PM] :: removing copy of file thread1:[05:17:20 PM] :: remove sucessful 152 seconds( to write)thread1:[05:17:20 PM] :: copying back to local systemthread2:[05:17:32 PM] :: copy sucessfulthread2:[05:17:32 PM] :: removing copy of filethread2:[05:17:33 PM] :: remove sucessfulthread2:[05:17:33 PM] :: copying back to local system 110 Seconds ( to read)thread2:[05:19:23 PM] :: move back sucessfulthread2:[05:19:23 PM] :: removing the file from hadoopthread1:[05:19:26 PM] :: move back sucessfulthread1:[05:19:26 PM] :: removing the file from hadoopthread2:[05:19:28 PM] :: deletion sucessfulthread1:[05:19:29 PM] :: deletion sucessfulthread1:------------------LOOP 1 ------------------------thread1:[05:19:29 PM] :: coping data into the directorythread2:------------------LOOP 1 ------------------------thread2:[05:19:29 PM] :: coping data into the directorythread1:[05:21:43 PM] :: copy sucessfulthread1:[05:21:43 PM] :: removing copy of filethread1:[05:21:44 PM] :: remove sucessfulthread1:[05:21:44 PM] :: copying back to local systemthread2:[05:21:48 PM] :: copy sucessfulthread2:[05:21:48 PM] :: removing copy of filethread2:[05:21:49 PM] :: remove sucessful 120 seconds ( to read)thread2:[05:21:49 PM] :: copying back to local systemthread1:[05:23:44 PM] :: move back sucessfulthread1:[05:23:44 PM] :: removing the file from hadoopthread1:[05:23:49 PM] :: deletion sucessful 125 seconds ( to read )thread1:------------------LOOP 2 ------------------------thread1:[05:23:49 PM] :: coping data into the directorythread2:[05:23:49 PM] :: move back sucessful

NOTE: Copying started at same time for thread1 and thread2 and also finished at about same time. That Means in 152 seconds 1.28 GB of data was transferred. The average data throughput was 70 Mbps.So is it that more the systems in cluster higher the speed( though there would be definitely a saturation point where the speed will come down) ?

MAP-REDUCE

MapReduce is a programming model for processing and large data sets. A map function processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model.Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system(hadoop) takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.

For implementation details of map-reduce in hadoop follow the link belowhttp://wiki.apache.org/lucene-hadoop-data/attachments/HadoopPresentations/attachments/HadoopMapReduceArch.pdf

Follow the link below for clear understanding on MapReduce http://209.85.163.132/papers/mapreduce-osdi04.pdf

Sample MapReduce implementation

The program will mimic the wordcount example, i.e. it reads text files and counts how often words occur. The input is text files and the output is text files, each line of which contains a word and the count of how often it occurred, separated by a tab. The "trick" behind the following Python code is that we will use hadoopstreaming for helping us passing data between our Map and Reduce code via STDIN (standard input) and STDOUT (standard output). We will simply use Python's sys.stdin to read input data and print our own output to sys.stdout.Save the file in mapper.py and reducer.py respectively.( requires python 2.4 or greater) in /home/hadoop and give executable permissions to them. One needs to start MapReduce deamons before submitting jobs bin/start-mapred.sh

Mapper.py

It will read data from STDIN (standard input), split it into words and output a list of lines mapping words to their (intermediate) counts to STDOUT (standard output)

#!/usr/bin/env pythonimport sys# maps words to their countsword2count = {}# input comes from STDIN (standard input)for line in sys.stdin: # remove leading and trailing whitespace line = line.strip() # split the line into words while removing any empty strings words = filter(lambda word: word, line.split()) # increase counters for word in words: # write the results to STDOUT (standard output); # what we output here will be the input for the # Reduce step, i.e. the input for reducer.py # # tab-delimited; the trivial word count is 1 print '%s\t%s'% (word, 1)

reducer.py

It will read the results of mapper.py from STDIN (standard input), and sum the occurences of each word to a final count, and output its results to STDOUT (standard output).

#!/usr/bin/env pythonfrom operator import itemgetterimport sys# maps words to their countsword2count = {}# input comes from STDINfor line in sys.stdin: # remove leading and trailing whitespace line = line.strip() # parse the input we got from mapper.py word, count = line.split() # convert count (currently a string) to int try: count = int(count) word2count[word] = word2count.get(word, 0) + count except ValueError: # count was not a number, so silently # ignore/discard this line pass# sort the words lexigraphically;## this step is NOT required, we just do it so that our# final output will look more like the official Hadoop# word count examplessorted_word2count = sorted(word2count.items(), key=itemgetter(0))# write the results to STDOUT (standard output)for word, count in sorted_word2count: print '%s\t%s'% (word, count)

Test the code as follows

[hadoop@NameNode ~]$echo "foo foo quux labs foo bar quux" | /home/hadoop/mapper.py | /home/hadoop/reducer.py

bar 1foo 3labs 1quux 2

Implementation on hadoop

Copy some large plain text files ( typically in GBs) into some local directory say text.Copy the data into HDFS

[hadoop@NameNode ~]$hadoop dfs copyFromLocal /path/to/test test

Run the mapreduce job

[hadoop@NameNode~]$bin/hadoop jar contrib/hadoop-streaming.jar -mapper /home/hadoop/mapper.py -reducer /home/hadoop/reducer.py -input test/* -output mapreduce-outputThe results can be viewed at http://localhost:50030/ or one can copy the output to local system hadoop dfs copyToLocal mapreduce-output

Inverted Index: Example of a mapreduce job

Suppose there are three documents with some text content and we have to compute the inverted index using map-reduce.

Doc-1 Doc-2 Doc-3 Hello Hello WorldWorld India iswelcome welcomingto IndiaIndia

Map Phase

Reduce Phase

Words such as to, is etc are considered noise and should be filtered appropriately.

hadoop

Documents

Transcript of hadoop