SOI Asia / AI3 2006 Spring Meeting Huahin Thailand 18-19, April 2006.
Huahin Framework for Hadoop, Hadoop Conference Japan 2013 Winter
-
Upload
ryu-kobayashi -
Category
Documents
-
view
109 -
download
1
description
Transcript of Huahin Framework for Hadoop, Hadoop Conference Japan 2013 Winter
Hadoop Conference Japan 2013 Winter
JJaann 2211,, 22001133@@rryyuu__kkoobbaayyaasshhii
Huahin Frameworkfor
Hadoop
• Ryu Kobayashi (@ryu_kobayashi)
• BrainPad Inc.
• Hadoop, Cassandra, Machine Learning, ...
AD
Now on sale!!!
What isHuahin
Framework?
Huahin Frameworkhttp://huahinframework.org
Hadoop Family
Logo is ...
Huahin logo is ...
Very very very cute!
Huahin Frameworkhttp://huahinframework.org
We released some software which developed in an office in June 2012 as OSS.
* It is what was used in the panel log analysis.
* Please refer to the slide of the "Hadoop Conference
Japan 2011 Fall" for more information.
http://goo.gl/C9tzf
Huahin Framework is a general term for multiple products.
Huahin Frameworkhttp://huahinframework.org
There is a custom to decide on a wine region in the code name of the office.
Huahin = Hua Hin = Tourist destinations in Thailand = Wine region
When it comes to Thailand...
Tt is the elephant !
As such, Huahin
image
The origin of the name of Huahin Framework
Huahin Frameworkhttp://huahinframework.org
Huahin Framework Configuration
Main is consists of the following elements:
• Huahin Core• Huahin Tools• Huahin Manager
Huahin Frameworkhttp://huahinframework.org
Huahin Core
• Simplified MapReduce programs• Do not have to write it yourself Writable and
Secondary Sort• The basic grouping, sorting, etc., the idea from SQL• If you want to write, can write natural MapReduce• C++ is the same as a superset of C• It can do Hive or Pig. However, if it really want to give
the performances.(Parallel computation, etc...)
• There Huahin Unit as a test driver• Wraps the MRUnit• Example of implementation
Huahin Frameworkhttp://huahinframework.org
Huahin Example
• Page top 10 rank example
First, natural MapReduce.Second, Huahin MapReduce.
Huahin Frameworkhttp://huahinframework.org
Data of page top 10 rank
Jan 21, 2013 user1 /index.htmlJan 21, 2013 user1 /index2.htmlJan 21, 2013 user2 /contents/foo.htmlJan 21, 2013 user42 /bar.htmlJan 21, 2013 user3 /index.htmlJan 21, 2013 user7 /news/index.htmlJan 21, 2013 user4 /release/2013.htmlJan 21, 2013 user3 /index2.htmlJan 21, 2013 user7 /download.htmlJan 21, 2013 user5 /bar.htmlJan 21, 2013 user12 /release/2012.htmlJan 21, 2013 user5 /contents/foo.htmlJan 21, 2013 user23 /page2.htmlJan 21, 2013 user53 /news.htmlJan 21, 2013 user6 /download.htmlJan 21, 2013 user21 /bar.htmlJan 21, 2013 user18 /index.html
Example: format is Tab delimited
Huahin Frameworkhttp://huahinframework.org
Page top 10 rank of natural MapReduce
public class PathTop10RankJobTool extends Configured implements Tool { @Override public int run(String[] arg0) throws Exception { Job firstJob = new Job(getConf(), "first"); firstJob.setJarByClass(PathTop10RankJobTool.class);
TextInputFormat.setInputPaths(firstJob, "input"); firstJob.setInputFormatClass(TextInputFormat.class);
firstJob.setMapperClass(PathTop10RankFirstMapper.class); firstJob.setMapOutputKeyClass(FirstKeyWritable.class); firstJob.setMapOutputValueClass(IntWritable.class);
firstJob.setReducerClass(PathTop10RankFirstReducer.class); firstJob.setOutputKeyClass(SecondKeyWritable.class); firstJob.setOutputValueClass(IntWritable.class);
SequenceFileOutputFormat.setOutputPath(firstJob, new Path("first")); firstJob.setOutputFormatClass(SequenceFileOutputFormat.class);
if (!firstJob.waitForCompletion(true)) { return -1; }
Job secondJob = new Job(getConf(), "second"); secondJob.setJarByClass(PathTop10RankJobTool.class);
SequenceFileInputFormat.setInputPaths(secondJob, "first"); secondJob.setInputFormatClass(SequenceFileInputFormat.class);
secondJob.setMapperClass(Mapper.class); secondJob.setMapOutputKeyClass(SecondKeyWritable.class); secondJob.setMapOutputValueClass(IntWritable.class);
secondJob.setGroupingComparatorClass(PathTop10RankGroupingComparatorClass.class); secondJob.setPartitionerClass(PathTop10RankPartitioner.class); secondJob.setSortComparatorClass(PathTop10RankingSortComparator.class);
secondJob.setReducerClass(PathTop10RankSecondReducer.class); secondJob.setOutputKeyClass(SecondKeyWritable.class); secondJob.setOutputValueClass(IntWritable.class);
TextOutputFormat.setOutputPath(secondJob, new Path("output")); secondJob.setOutputFormatClass(TextOutputFormat.class);
return secondJob.waitForCompletion(true) ? 0 : -1; }}
JobTools
Huahin Frameworkhttp://huahinframework.org
Page top 10 rank of natural MapReduce
public class PathTop10RankFirstMapper extends Mapper<LongWritable, Text, FirstKeyWritable, IntWritable> { private IntWritable ONE = new IntWritable(1);
@Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String[] s = value.toString().split("\t"); context.write(new FirstKeyWritable(s[0], s[2]), ONE); }}
FirstMapper
public class PathTop10RankFirstReducer extends Reducer<FirstKeyWritable, IntWritable, SecondKeyWritable, IntWritable> { @Override protected void reduce(FirstKeyWritable key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int pv = 0; for (IntWritable i : values) { pv += i.get(); }
context.write( new SecondKeyWritable(key.getDate().toString(), key.getPage().toString(), pv), new IntWritable(pv)); }}
FirstReducer
Huahin Frameworkhttp://huahinframework.org
Page top 10 rank of natural MapReduce
SecondReducer
public class PathTop10RankSecondReducer extends Reducer<SecondKeyWritable, IntWritable, SecondKeyWritable, IntWritable> { @Override protected void reduce(SecondKeyWritable key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int rank = 0; for (IntWritable i : values) { if (rank > 10) { break; }
context.write(key, i); rank++; } }
}
Huahin Frameworkhttp://huahinframework.org
Page top 10 rank of natural MapReduce
public class FirstKeyWritable implements WritableComparable<FirstKeyWritable> { private Text date = new Text(); private Text page = new Text();
public FirstKeyWritable() { }
public FirstKeyWritable(String date, String page) { this.date.set(date); this.page.set(page); }
@Override public void readFields(DataInput in) throws IOException { this.date.readFields(in); this.page.readFields(in); }
@Override public void write(DataOutput out) throws IOException { this.date.write(out); this.page.write(out); }
@Override public int compareTo(FirstKeyWritable o) { int compare = this.date.toString().compareTo(o.date.toString()); if (compare != 0) { return compare; } return this.page.toString().compareTo(o.page.toString()); }
@Override public boolean equals(Object obj) { if (obj == null) { return false; }
if (!(obj instanceof FirstKeyWritable)) { return false; }
FirstKeyWritable o = (FirstKeyWritable) obj; return this.date.equals(o.getDate()) && this.page.equals(o.getPage()); }
/** * @return the date */ public Text getDate() { return date; }
/** * @param date the date to set */ public void setDate(Text date) { this.date = date; }
/** * @return the page */ public Text getPage() { return page; }
/** * @param page the page to set */ public void setPage(Text page) { this.page = page; }}
FirstKeyWritablepublic class SecondKeyWritable implements WritableComparable<SecondKeyWritable> { private Text date = new Text(); private Text page = new Text(); private IntWritable pv = new IntWritable();
public SecondKeyWritable() { }
public SecondKeyWritable(String date, String page, int pv) { this.date.set(date); this.page.set(page); this.pv.set(pv); }
@Override public void readFields(DataInput in) throws IOException { this.date.readFields(in); this.page.readFields(in); this.pv.readFields(in); }
@Override public void write(DataOutput out) throws IOException { this.date.write(out); this.page.write(out); this.pv.write(out); }
@Override public int compareTo(SecondKeyWritable o) { return this.date.toString().compareTo(o.date.toString()); }
@Override public boolean equals(Object obj) { if (obj == null) { return false; }
if (!(obj instanceof SecondKeyWritable)) { return false; }
SecondKeyWritable o = (SecondKeyWritable) obj; return this.date.equals(o.getDate()); }
@Override public String toString() { return this.date + "\t" + this.page; }
/** * @return the date */ public Text getDate() { return date; }
/** * @param date the date to set */ public void setDate(Text date) { this.date = date; }
/** * @return the page */ public Text getPage() { return page; }
/** * @param page the page to set */ public void setPage(Text page) { this.page = page; }
/** * @return the pv */ public IntWritable getPv() { return pv; }
/** * @param pv the pv to set */ public void setPv(IntWritable pv) { this.pv = pv; }}
SecondKeyWritable
Huahin Frameworkhttp://huahinframework.org
Page top 10 rank of natural MapReduceGroupingComparator
public class PathTop10RankGroupingComparatorClass extends WritableComparator { public PathTop10RankGroupingComparatorClass() { super(SecondKeyWritable.class, true); }
@SuppressWarnings({ "rawtypes", "unchecked" }) @Override public int compare(Object a, Object b) { if (a instanceof SecondKeyWritable && b instanceof SecondKeyWritable) { Comparable one = SecondKeyWritable.class.cast(a).getDate(); Comparable another = SecondKeyWritable.class.cast(b).getDate(); return one.compareTo(another); } return super.compare(a, b); }}
Partitioner
public class PathTop10RankPartitioner extends Partitioner<SecondKeyWritable, IntWritable> { @Override public int getPartition(SecondKeyWritable key, IntWritable value, int numPartitioner) { return Math.abs(key.getDate().hashCode()) % numPartitioner; }}
SortComparator
public class PathTop10RankingSortComparator extends WritableComparator { public PathTop10RankingSortComparator() { super(SecondKeyWritable.class, true); }
@SuppressWarnings({ "rawtypes", "unchecked" }) @Override public int compare(Object a, Object b) { if (a instanceof SecondKeyWritable && b instanceof SecondKeyWritable) { Comparable one = SecondKeyWritable.class.cast(a).getDate(); Comparable another = SecondKeyWritable.class.cast(b).getDate();
int compare = one.compareTo(another); if (compare != 0) { return compare; }
Comparable oneOrder = SecondKeyWritable.class.cast(a).getPv(); Comparable anotherOrder = SecondKeyWritable.class.cast(b).getPv(); return oneOrder.compareTo(anotherOrder); } return super.compare(a, b); }}
Huahin Frameworkhttp://huahinframework.org
Page top 10 rank of natural MapReduce
• This is a very long ...• About 307 lines
Huahin Frameworkhttp://huahinframework.org
Page top 10 rank of Huahin MapReduce
public class PathRankingJobTool extends SimpleJobTool { @Override protected String setInputPath(String[] args) { return args[0]; }
@Override protected String setOutputPath(String[] args) { return args[1]; }
/* (non-Javadoc) * @see org.huahin.core.SimpleJobTool#setup() */ @Override protected void setup() throws Exception { final String[] labels = new String[] { "DATE", "USER", "URL" };
SimpleJob job1 = addJob(labels, StringUtil.TAB); job1.setFilter(FirstFilter.class); job1.setSummarizer(FirstSummarizer.class);
SimpleJob job2 = addJob(); job2.setSummarizer(SecondSummarizer.class); }}
JobTools
public class FirstFilter extends Filter { @Override public void init() { }
@Override public void filter(Record record, Writer writer) throws IOException, InterruptedException { Record emitRecord = new Record(); emitRecord.addGrouping("DATE", record.getValueString("DATE")); emitRecord.addGrouping("PATH", record.getValueString("URL")); emitRecord.addValue("PV", 1); writer.write(emitRecord); }
@Override public void filterSetup() { }}
FirstFilter
public class FirstSummarizer extends Summarizer { @Override public void init() { }
@Override public void summarize(Writer writer) throws IOException, InterruptedException { int pv = 0; while (hasNext()) { Record record = next(writer); pv += record.getValueInteger("PV"); }
Record emitRecord = new Record(); emitRecord.addGrouping("DATE", getGroupingRecord().getGroupingString("DATE")); emitRecord.addSort(pv, Record.SORT_UPPER, 1); emitRecord.addValue("PATH", getGroupingRecord().getGroupingString("PATH")); emitRecord.addValue("PV", pv); writer.write(emitRecord); }
@Override public void summarizerSetup() { }}
FirstSummarizer
public class SecondSummarizer extends Summarizer { @Override public void init() { }
@Override public void summarize(Writer writer) throws IOException, InterruptedException { int rank = 1; while (hasNext()) { if (rank > 10) { break; }
Record record = next(writer); Record emitRecord = new Record(); emitRecord.addValue("PATH", record.getValueString("PATH")); emitRecord.addValue("UU", record.getValueInteger("UU"));
writer.write(emitRecord); rank++; } }
@Override public void summarizerSetup() { }}
SecondSummarizer
Huahin Frameworkhttp://huahinframework.org
Page top 10 rank of Huahin MapReduce
• This is a very short!!• About 100 lines
Huahin Frameworkhttp://huahinframework.org
Huahin Core
• Other• Simple Join• Big Join• etc ...
Huahin Frameworkhttp://huahinframework.org
Huahin Tools
• A collection of tools generic operation.• Currently only Apache Log molding...• Operating environment • On Premises Hadoop• Stand Alone• Multi Thread execution for small data• EMR• S3://huahin/tools/huahin-tools.0.1.0.jar
Huahin Frameworkhttp://huahinframework.org
Huahin Manager
• Manager to manage the MapReduce Job• Get the Job list• Get the Job detail• Kill Job• Execution Job• Run queue management• MapReduce Jar• Hive Scripts• Pig Scripts
• Execution Hive Query • Execution Pig Latin• Execution is done in all the REST API.• Supported Apache Hadoop 1.0.X and 2.0.2-alpha• Supported CDH3 and CDH4
Huahin Frameworkhttp://huahinframework.org
Huahin Manager
• For 2.0.2-alpha and CDH4• Getting the Application list• Getting the Cluster info• Kill Application• Proxy to YARN APIs
Huahin Frameworkhttp://huahinframework.org
Huahin Manager
• EMR Support• Setting bootstrap
s3://huahin/manager/configure• Security group setting in order to access the REST API.• Security group that you set will be created during the
startup of the EMR.ElasticMapReduce-master• Values to be set• Port range: 9010• Source: IP addresses that are allowed to connect
Huahin Frameworkhttp://huahinframework.org
Huahin Manager
Operating environment of Huahin Manager
Huahin Manager
Various operations
REST API
Hadoop Cluster
HiveServer(1and 2)
Huahin Frameworkhttp://huahinframework.org
Huahin EManager
Manager that specializes in EMR
• Manager to manage the Job Flow• Get the Job Flow list• Get the Job Flow detail• Kill Job Flow Step• Execution Job• Run queue management• Register of queue• Get the queue detail• Remove queue
Huahin Frameworkhttp://huahinframework.org
Huahin EManager
• Register queue• The following functions can be assigned to the queue
at the EMR supports.• Hive• Pig• Streaming• Custom JAR• EManager can specify the cluster size to be started.
EManager assign a queue to a cluster that is free.(EMR to be a good point to bring up multiple cluster!)
Huahin Frameworkhttp://huahinframework.org
Huahin EManager
Operating environment of Huahin EManager
Huahin EManager
Amazon Elastic
MapReduce
REST API
Amazon Elastic
MapReduce
On premisesor
EC2 Instance
Huahin Manager will be started by the Master node bootstrap.
Huahin Manager will be started by the Master node bootstrap.
Various operations
Various operations
※ NOTICE: Setup the security group
Huahin Frameworkhttp://huahinframework.org
Huahin EManager
Operating environment of Huahin EManager
The place that is different when EManager starts in Management Console and Tools.• EManager recycle one Job Flow
Not attempt to start and end every time the EMR.Order to save costs and performances.※ It Currently can not Management Console. However, Can be done from the command line and SDK.
• However, reboot automatically when the upper limit of the number reaches 255 Step.
Huahin Frameworkhttp://huahinframework.org
Huahin EManager
Operating environment of Huahin EManager
The place that is different when EManager starts in Management Console and Tools.• It is booting for one hour• for cost(accounting and performance)• It do shutdown automatically before the timing
charged.• However, if it were running the Job is carried over
to the next billing timing.
Huahin Frameworkhttp://huahinframework.org
Huahin EManager
Register queue
Done using the PUT or POST method of registration of the queue.• PUT:If it have a script or JAR on the S3, It do Job
Flow or only the execution of Step.• POST:Place the JAR or script in the local to S3.
Boot and execution Step of Job Flow. It is a feature not in the EMR. And, option to remove the files that were POST.
• All registration is done in JSON.
Huahin Frameworkhttp://huahinframework.org
Huahin EManager
Register queue
Examples of PUT in the Hive:$ curl -X PUT http://localhost:9020/queue/register/hive \ -F ARGUMENTS='{"script":"s3://huahin/wordcount.hql","arguments":["arg1","arg2"]}'
Optional arguments of JSON
Examples of POST in the Hive:$ curl -X POST http://localhost:9020/queue/register/hive \ -F [email protected] -F ARGUMENTS='{"script":"s3://huahin/wordcount.hql","arguments":["arg1","arg2"]}'
Optional arguments of JSONDeleted after execution by setting the "true": "deleteOnExit"It no default deleted.
Huahin Frameworkhttp://huahinframework.org
Huahin EManager
List of Job Flow
Example of Get all Job Flow list:$ curl -X GET http://localhost:9020/jobflow/list
Example of get running Job Flow list:$ curl -X GET http://localhost:9020/jobflow/runnings
Example of Job Flow detail:$ curl -X GET http://localhost:9020/jobflow/describe/j-XXXXXXXXXXXX
Huahin Frameworkhttp://huahinframework.org
Huahin EManager
Queue API
Example of registered queue list:$ curl -X GET http://localhost:9020/queue/list
Example of runnings queue list:$ curl -X GET http://localhost:9020/queue/runnings
Example of get queue detail:$ curl -X GET http://localhost:9020/queue/describe/S_XXXXXXXXXXXX
Example of delete queue:$ curl -X DELETE http://localhost:9020/queue/kill/S_XXXXXXXXXXXX
Huahin Frameworkhttp://huahinframework.org
Huahin EManager
Kill of JobThere is a command to kill the Job running on Hadoop.
hadoop job -kill job_XXXXXXXXXX
However, there is no function that EMR. If start a Job by mistake, there is no choice but to terminate the Job Flow.
It will be able to kill by SSH to connect to the master node of the EMR, type the above command.
Troublesome...
Huahin Frameworkhttp://huahinframework.org
Huahin EManager
Kill of JobIt made possible the Kill API from EManager (Manager)!
Example of Step kill:$ curl -X DELETE http://localhost:9020/jobflow/kill/step/S_XXXXXXXXXXXX
Huahin Frameworkhttp://huahinframework.org
Conclusion
• Huahin Core• Unlike the Hive and Pig• When it want to use MapReduce to some extent the
natural.• Huahin Tools• Still...• Huahin Manager• All REST API operation• Integration with other systems• Huahin EManager• Integration with other systems• Cost and Performance management• Kill Step of Job Flow!
Huahin Frameworkhttp://huahinframework.org
The current version
• Huahin Core 0.1.4• Huahin Unit 0.1.4• Huahin Tools 0.1.0• Huahin Manager• 0.1.4 for Apache Hadoop 1.0.4• 0.1.4 for CDH3• 0.2.1 for Apache hadoop 2.0.2-alpha• 0.2.1 for CDH4• Huahin EManager 0.1.1
Thanks!!!