Hadoop on osx

Apache Hadoop cluster

on Macintosh OSX

The Trigger #DIY

http://dilbert.com/strips/comic/1995-12-10/

http://dilbert.com/strips/comic/1995-12-10/

The Kitchen Setup

The

Net

wo

rk

Master Chef a.k.a Namenode

Helpers a.k.a Datanode(s)

The Base Ingredients

0.13.0

10.7.5

0.9.5

200 MB/s

2.4.0

1.7.0.55

5.6.17

http://brew.sh/

http://brew.sh/

http://hadoop.apache.org/

http://hadoop.apache.org/

Basics• Ensure that all the namenode and datanode machines are running

on the same OSX version• For the purpose of this POC, I have selected OSX 10.7.5. All sample

commands are specific to this OS. You may need to tweak the commands to suit your OS version compatibility

• I am a homebrew fan , so I have used the old and gold ruby based platform for downloading all software needed to run the POC. You may very well opt for downloading the installers individually and tweak the process if you wish

• You will need fair bit of understanding of OSX and Hadoop to understand and interpret. If not, no worries – most of the stuff can be looked up online by simple Google search

• The “Namenode” machine needs more RAM than “Datanode” machines. Please configure the namenode machine with at least 8 GB RAM

http://www.google.com

The Cooking

• Ensure that ALL datanodes and namenode machines are running on the same OSX version and preferably have regulated software update strategy (i.e. automatic software disabled)

• Disable automatic “sleep” options in the machines to avoid machines goes into hibernation (from System Preferences)

• Download and Install “Xcode command line tools for Lion” (skip if Xcodepresent)

• As of today, hadoop is not IPv6 friendly. So, please disable IPv6 on all machines:

“networksetup –listallnetworkservices” command will display all the network names that your machine uses to connect to your network (E.g: Ethernet, Wi-Fi etc.)

“networksetup –setv6off Ethernet” will disable IPv6 over Ethernet (you may need to change the network name if it is any different)

http://bit.ly/1ztwcO5

The Cooking..

• Give logical names to ALL machines e.g. namenode.local ,datanode01.local

datanode02.local et al. (from System Preferences -> Sharing -> Computer

Name)

• Enable the following services from the Sharing panel of System

Preferences

– File Sharing

– Remote Login

– Remote Management

• Create one universal username (with Administrator privileges) on all

machines . E.g: hadoopuser. Preferably have the same password

• For the rest of steps , please login as this user and execute the commands

The Cooking

• On the namenode, run the command:

vi /etc/hosts

• Add all datanode hostnames , one host per line

• On each of the datanodes, run the command:

vi /etc/hosts

• Add the namenode hostname

sudo visudo

• Add an entry on the last line of the file as under:

hadoopuser ALL=(ALL) NOPASSWD: ALL

Coffee Time

• Install Java JDK and JRE on all the machines from Oracle Site

(http://bit.ly/1s2i7VC) . Configure $JAVA_HOME (see slides for

instructions)

• Set $JAVA_HOME in ALL machines. Usually, it is best to configure the same

in your .profile file. Run the following command to open your .profile

• vi ~/.profile

• #Paste the subsequent lines in the file and save it :export JAVA_HOME="`/System/Library/Frameworks/JavaVM.framework/Versions/Current/Commands/java_home`"

• You may additionally paste the following lines in the same file:export PATH=$PATH:/usr/local/sbin

PS1="\H : \d \t: \w :"

This is helpful for housekeeping activities

http://bit.ly/1s2i7VC

The Brewing

• Install “brew” and other components from it Run on terminal :

ruby -e "$(curl -fsSL https://raw.github.com/Homebrew/homebrew/go/install)"

[the quotes need to be there]

Run following command on terminal to ensure that it has been installed properly

brew doctor

Run following commands in the same order on terminal

brew install makedepend

brew install wget

brew install ssh-copy-id

brew install hadoop

Run following command on the “namenode” machine

brew install hive

brew install mysql

[assumption is that namenode will host resourcemanager, jobtracker, hive metastore, hiveserver.

brew installs the software in “/usr/local/Cellar” location]

http://brew.sh/

https://raw.github.com/Homebrew/homebrew/go/install

Run the following command for setting up keyless login from namenode to ALL datanodes. Run the command on namenode:

ssh-keygen

[press Enter key twice to accept default RSA , and no-passphrase]

Run the following command recursively for ALL datanode hostnames. Run the command on namenode:

ssh-copy-id [email protected]

provide the password when prompted. The command is verbose and tells if the key is installed properly. You may validate the same by executing the command :

ssh [email protected] . It should NOT ask you to supply password anymore.

After the requisite software has been installed , the next step is to configure the different components in a stepwise manner. Hadoop works in a distributed mode with “namenode” being the central hub of the cluster. This gives enough reason to have the common configuration files created on namenode first, and then copied in an automated manner into all the datanodes. Let’s start with the .profile changes on namenode machine first.

The Saute

We are going to configure Hive to use MySQL as the metastore for this POC. All we need is to create a db user “hiveuser” with a valid password in the MySQL DB installed and running on namenode AND copy the MySQL driver jar into Hive lib directory

On the namenode , please fire the command to go to your HADOOP_CONF_DIR location:

cd /usr/local/Cellar/hadoop/2.4.0/libexec/etc/hadoop

Here , we need to create/modify the following set of files:

slaves

core-site.xml

hdfs-site.xml

mapred-site.xml

yarn-site.xml

log4j.properties

On the namenode, please fire the command to go to your HIVE_CONF_DIR location:cd /usr/local/Cellar/hive/0.13.0/libexec/conf

Here , we need to create/modify the following set of files:

hive-site.xml

hive-log4j.properties

The Slow cooking

http://bit.ly/W3vism

Please find attached a simple script that, if installed on the namenode, can help you copy your config files to ALL datanodes (I call it the config-push)

Please find attached another simple script that I use for rebooting all the datanodes.

The Plating

You may wish to take the next steps if desired: Install zookeeper

Configure and run journalnodes

Go for High Availability cluster implementation with multiple Namenodes

Leave feedback if you wish to know the Hadoop configuration samples

The Garnishing

Disclaimer: Don’t sue me for any damage/infringement, I am not rich

Hadoop on osx

Technology

Transcript of Hadoop on osx