Automated Hadoop Cluster Construction on EC2
-
Upload
markkerzner -
Category
Technology
-
view
1.561 -
download
2
description
Transcript of Automated Hadoop Cluster Construction on EC2
Automated Hadoop Clusters on EC2
Mark KerznerSHMsoft
What is Hadoop? :) :) :)
Everybody knows that ... What is your definition?
What is a cloud?
Everybody knows that, but 1. Elastic resources2. Internet delivery3. SAAS4. Virtualization5. Device-enabled6. Only (1) or all of the above
You are the Hadoop programmer
... and you need tools What are your alternatives?● IDE● Local "cluster"● Pseudo-distributed cluster● EC2
You are the Hadoop programmer
... and you need tools What are your alternatives?● IDE - compile and run the code● Local "cluster" - local file system● Pseudo-distributed cluster - test outside● EC2 - test on the cluster, test for scale
What are your resources
● Tom White, "Hadoop, the Definitive Guide"● www.hadoopilluminated.com
For real play, you need a cluster
Hadoop+ (oh, by the way...)
HBase, Cassandra, MongoDB, NoSQL, Dynamo, BigTable, Dryad (MS), Azure (MS), MapReduce, MapR (EMC), Cloudera distribution, EMC distribution, IBM distribution...
WhirrSetup export AWS_ACCESS_KEY_ID=... export AWS_SECRET_ACCESS_KEY=... Installcurl -O http://www.apache.org/dist/whirr/whirr-0.7.1/whirr-0.7.1.tar.gztar zxf whirr-0.7.1.tar.gz; cd whirr-0.7.1 Generate key sssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa_whirr Runbin/whirr launch-cluster --config recipes/zookeeper-ec2.properties --private-key-file ~/.ssh/id_rsa_whirr
Whirr limitations
● No EBS● All or nothing● Generates configuration artifacts● Takes over your computer, no more local
development - uses proxy● Hard to customize
Amazon EMR
EMR limitations
● No choice of image● Fixed architecture● Hard to debug● Hard to customize
You do it
Repeat the manual procedure, only automate it PrepareAMI, Java, Hadoop On-the-flyStart AMI, login, configure, start services, verify, run test jobs
You do it - advanced
On startup Under-provision, over-provision, progress On-the-fly Monitor, run test jobs, watch for cluster deterioration
Cloudera Manager
MapR Manager
On the large scale
Hadoop 0.20 - up to 4,000 nodesHadoop 0.23 - up to 20,000GridGain - 100's of 1,000's
Thank you
Questions?