Huahin Framework for Hadoop, Hadoop Conference Japan 2013 Winter

40
Hadoop Conference Japan 2013 Winter ꜳ 6+ 6+! ꝏꜳꜳ Huahin Framework for Hadoop

description

Huahin Framework for Hadoop Hadoop Conference Japan 2013 Winter

Transcript of Huahin Framework for Hadoop, Hadoop Conference Japan 2013 Winter

Page 1: Huahin Framework for Hadoop, Hadoop Conference Japan 2013 Winter

Hadoop Conference Japan 2013 Winter

JJaann 2211,, 22001133@@rryyuu__kkoobbaayyaasshhii

Huahin Frameworkfor

Hadoop

Page 2: Huahin Framework for Hadoop, Hadoop Conference Japan 2013 Winter

• Ryu Kobayashi (@ryu_kobayashi)

• BrainPad Inc.

• Hadoop, Cassandra, Machine Learning, ...

AD

Now on sale!!!

Page 3: Huahin Framework for Hadoop, Hadoop Conference Japan 2013 Winter

What isHuahin

Framework?

Page 4: Huahin Framework for Hadoop, Hadoop Conference Japan 2013 Winter

Huahin Frameworkhttp://huahinframework.org

Page 5: Huahin Framework for Hadoop, Hadoop Conference Japan 2013 Winter

Hadoop Family

Logo is ...

Page 6: Huahin Framework for Hadoop, Hadoop Conference Japan 2013 Winter

Huahin logo is ...

Very very very cute!

Page 7: Huahin Framework for Hadoop, Hadoop Conference Japan 2013 Winter

Huahin Frameworkhttp://huahinframework.org

We released some software which developed in an office in June 2012 as OSS.

 * It is what was used in the panel log analysis.

 * Please refer to the slide of the "Hadoop Conference

Japan 2011 Fall" for more information.

  http://goo.gl/C9tzf

Huahin Framework is a general term for multiple products.

Page 8: Huahin Framework for Hadoop, Hadoop Conference Japan 2013 Winter

Huahin Frameworkhttp://huahinframework.org

There is a custom to decide on a wine region in the code name of the office.

Huahin = Hua Hin = Tourist destinations in Thailand = Wine region

When it comes to Thailand...

Tt is the elephant !

As such, Huahin

image

The origin of the name of Huahin Framework

Page 9: Huahin Framework for Hadoop, Hadoop Conference Japan 2013 Winter

Huahin Frameworkhttp://huahinframework.org

Huahin Framework Configuration

 Main is consists of the following elements:

• Huahin Core• Huahin Tools• Huahin Manager

Page 10: Huahin Framework for Hadoop, Hadoop Conference Japan 2013 Winter

Huahin Frameworkhttp://huahinframework.org

Huahin Core

• Simplified MapReduce programs• Do not have to write it yourself Writable and

Secondary Sort• The basic grouping, sorting, etc., the idea from SQL• If you want to write, can write natural MapReduce• C++ is the same as a superset of C• It can do Hive or Pig. However, if it really want to give

the performances.(Parallel computation, etc...)

• There Huahin Unit as a test driver• Wraps the MRUnit• Example of implementation

Page 11: Huahin Framework for Hadoop, Hadoop Conference Japan 2013 Winter

Huahin Frameworkhttp://huahinframework.org

Huahin Example

• Page top 10 rank example

First, natural MapReduce.Second, Huahin MapReduce.

Page 12: Huahin Framework for Hadoop, Hadoop Conference Japan 2013 Winter

Huahin Frameworkhttp://huahinframework.org

Data of page top 10 rank

Jan 21, 2013 user1 /index.htmlJan 21, 2013 user1 /index2.htmlJan 21, 2013 user2 /contents/foo.htmlJan 21, 2013 user42 /bar.htmlJan 21, 2013 user3 /index.htmlJan 21, 2013 user7 /news/index.htmlJan 21, 2013 user4 /release/2013.htmlJan 21, 2013 user3 /index2.htmlJan 21, 2013 user7 /download.htmlJan 21, 2013 user5 /bar.htmlJan 21, 2013 user12 /release/2012.htmlJan 21, 2013 user5 /contents/foo.htmlJan 21, 2013 user23 /page2.htmlJan 21, 2013 user53 /news.htmlJan 21, 2013 user6 /download.htmlJan 21, 2013 user21 /bar.htmlJan 21, 2013 user18 /index.html

Example: format is Tab delimited

Page 13: Huahin Framework for Hadoop, Hadoop Conference Japan 2013 Winter

Huahin Frameworkhttp://huahinframework.org

Page top 10 rank of natural MapReduce

public class PathTop10RankJobTool extends Configured implements Tool { @Override public int run(String[] arg0) throws Exception { Job firstJob = new Job(getConf(), "first"); firstJob.setJarByClass(PathTop10RankJobTool.class);

TextInputFormat.setInputPaths(firstJob, "input"); firstJob.setInputFormatClass(TextInputFormat.class);

firstJob.setMapperClass(PathTop10RankFirstMapper.class); firstJob.setMapOutputKeyClass(FirstKeyWritable.class); firstJob.setMapOutputValueClass(IntWritable.class);

firstJob.setReducerClass(PathTop10RankFirstReducer.class); firstJob.setOutputKeyClass(SecondKeyWritable.class); firstJob.setOutputValueClass(IntWritable.class);

SequenceFileOutputFormat.setOutputPath(firstJob, new Path("first")); firstJob.setOutputFormatClass(SequenceFileOutputFormat.class);

if (!firstJob.waitForCompletion(true)) { return -1; }

Job secondJob = new Job(getConf(), "second"); secondJob.setJarByClass(PathTop10RankJobTool.class);

SequenceFileInputFormat.setInputPaths(secondJob, "first"); secondJob.setInputFormatClass(SequenceFileInputFormat.class);

secondJob.setMapperClass(Mapper.class); secondJob.setMapOutputKeyClass(SecondKeyWritable.class); secondJob.setMapOutputValueClass(IntWritable.class);

secondJob.setGroupingComparatorClass(PathTop10RankGroupingComparatorClass.class); secondJob.setPartitionerClass(PathTop10RankPartitioner.class); secondJob.setSortComparatorClass(PathTop10RankingSortComparator.class);

secondJob.setReducerClass(PathTop10RankSecondReducer.class); secondJob.setOutputKeyClass(SecondKeyWritable.class); secondJob.setOutputValueClass(IntWritable.class);

TextOutputFormat.setOutputPath(secondJob, new Path("output")); secondJob.setOutputFormatClass(TextOutputFormat.class);

return secondJob.waitForCompletion(true) ? 0 : -1; }}

JobTools

Page 14: Huahin Framework for Hadoop, Hadoop Conference Japan 2013 Winter

Huahin Frameworkhttp://huahinframework.org

Page top 10 rank of natural MapReduce

public class PathTop10RankFirstMapper extends Mapper<LongWritable, Text, FirstKeyWritable, IntWritable> { private IntWritable ONE = new IntWritable(1);

@Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String[] s = value.toString().split("\t"); context.write(new FirstKeyWritable(s[0], s[2]), ONE); }}

FirstMapper

public class PathTop10RankFirstReducer extends Reducer<FirstKeyWritable, IntWritable, SecondKeyWritable, IntWritable> { @Override protected void reduce(FirstKeyWritable key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int pv = 0; for (IntWritable i : values) { pv += i.get(); }

context.write( new SecondKeyWritable(key.getDate().toString(), key.getPage().toString(), pv), new IntWritable(pv)); }}

FirstReducer

Page 15: Huahin Framework for Hadoop, Hadoop Conference Japan 2013 Winter

Huahin Frameworkhttp://huahinframework.org

Page top 10 rank of natural MapReduce

SecondReducer

public class PathTop10RankSecondReducer extends Reducer<SecondKeyWritable, IntWritable, SecondKeyWritable, IntWritable> { @Override protected void reduce(SecondKeyWritable key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int rank = 0; for (IntWritable i : values) { if (rank > 10) { break; }

context.write(key, i); rank++; } }

}

Page 16: Huahin Framework for Hadoop, Hadoop Conference Japan 2013 Winter

Huahin Frameworkhttp://huahinframework.org

Page top 10 rank of natural MapReduce

public class FirstKeyWritable implements WritableComparable<FirstKeyWritable> { private Text date = new Text(); private Text page = new Text();

public FirstKeyWritable() { }

public FirstKeyWritable(String date, String page) { this.date.set(date); this.page.set(page); }

@Override public void readFields(DataInput in) throws IOException { this.date.readFields(in); this.page.readFields(in); }

@Override public void write(DataOutput out) throws IOException { this.date.write(out); this.page.write(out); }

@Override public int compareTo(FirstKeyWritable o) { int compare = this.date.toString().compareTo(o.date.toString()); if (compare != 0) { return compare; } return this.page.toString().compareTo(o.page.toString()); }

@Override public boolean equals(Object obj) { if (obj == null) { return false; }

if (!(obj instanceof FirstKeyWritable)) { return false; }

FirstKeyWritable o = (FirstKeyWritable) obj; return this.date.equals(o.getDate()) && this.page.equals(o.getPage()); }

/** * @return the date */ public Text getDate() { return date; }

/** * @param date the date to set */ public void setDate(Text date) { this.date = date; }

/** * @return the page */ public Text getPage() { return page; }

/** * @param page the page to set */ public void setPage(Text page) { this.page = page; }}

FirstKeyWritablepublic class SecondKeyWritable implements WritableComparable<SecondKeyWritable> { private Text date = new Text(); private Text page = new Text(); private IntWritable pv = new IntWritable();

public SecondKeyWritable() { }

public SecondKeyWritable(String date, String page, int pv) { this.date.set(date); this.page.set(page); this.pv.set(pv); }

@Override public void readFields(DataInput in) throws IOException { this.date.readFields(in); this.page.readFields(in); this.pv.readFields(in); }

@Override public void write(DataOutput out) throws IOException { this.date.write(out); this.page.write(out); this.pv.write(out); }

@Override public int compareTo(SecondKeyWritable o) { return this.date.toString().compareTo(o.date.toString()); }

@Override public boolean equals(Object obj) { if (obj == null) { return false; }

if (!(obj instanceof SecondKeyWritable)) { return false; }

SecondKeyWritable o = (SecondKeyWritable) obj; return this.date.equals(o.getDate()); }

@Override public String toString() { return this.date + "\t" + this.page; }

/** * @return the date */ public Text getDate() { return date; }

/** * @param date the date to set */ public void setDate(Text date) { this.date = date; }

/** * @return the page */ public Text getPage() { return page; }

/** * @param page the page to set */ public void setPage(Text page) { this.page = page; }

/** * @return the pv */ public IntWritable getPv() { return pv; }

/** * @param pv the pv to set */ public void setPv(IntWritable pv) { this.pv = pv; }}

SecondKeyWritable

Page 17: Huahin Framework for Hadoop, Hadoop Conference Japan 2013 Winter

Huahin Frameworkhttp://huahinframework.org

Page top 10 rank of natural MapReduceGroupingComparator

public class PathTop10RankGroupingComparatorClass extends WritableComparator { public PathTop10RankGroupingComparatorClass() { super(SecondKeyWritable.class, true); }

@SuppressWarnings({ "rawtypes", "unchecked" }) @Override public int compare(Object a, Object b) { if (a instanceof SecondKeyWritable && b instanceof SecondKeyWritable) { Comparable one = SecondKeyWritable.class.cast(a).getDate(); Comparable another = SecondKeyWritable.class.cast(b).getDate(); return one.compareTo(another); } return super.compare(a, b); }}

Partitioner

public class PathTop10RankPartitioner extends Partitioner<SecondKeyWritable, IntWritable> { @Override public int getPartition(SecondKeyWritable key, IntWritable value, int numPartitioner) { return Math.abs(key.getDate().hashCode()) % numPartitioner; }}

SortComparator

public class PathTop10RankingSortComparator extends WritableComparator { public PathTop10RankingSortComparator() { super(SecondKeyWritable.class, true); }

@SuppressWarnings({ "rawtypes", "unchecked" }) @Override public int compare(Object a, Object b) { if (a instanceof SecondKeyWritable && b instanceof SecondKeyWritable) { Comparable one = SecondKeyWritable.class.cast(a).getDate(); Comparable another = SecondKeyWritable.class.cast(b).getDate();

int compare = one.compareTo(another); if (compare != 0) { return compare; }

Comparable oneOrder = SecondKeyWritable.class.cast(a).getPv(); Comparable anotherOrder = SecondKeyWritable.class.cast(b).getPv(); return oneOrder.compareTo(anotherOrder); } return super.compare(a, b); }}

Page 18: Huahin Framework for Hadoop, Hadoop Conference Japan 2013 Winter

Huahin Frameworkhttp://huahinframework.org

Page top 10 rank of natural MapReduce

• This is a very long ...• About 307 lines

Page 19: Huahin Framework for Hadoop, Hadoop Conference Japan 2013 Winter

Huahin Frameworkhttp://huahinframework.org

Page top 10 rank of Huahin MapReduce

public class PathRankingJobTool extends SimpleJobTool { @Override protected String setInputPath(String[] args) { return args[0]; }

@Override protected String setOutputPath(String[] args) { return args[1]; }

/* (non-Javadoc) * @see org.huahin.core.SimpleJobTool#setup() */ @Override protected void setup() throws Exception { final String[] labels = new String[] { "DATE", "USER", "URL" };

SimpleJob job1 = addJob(labels, StringUtil.TAB); job1.setFilter(FirstFilter.class); job1.setSummarizer(FirstSummarizer.class);

SimpleJob job2 = addJob(); job2.setSummarizer(SecondSummarizer.class); }}

JobTools

public class FirstFilter extends Filter { @Override public void init() { }

@Override public void filter(Record record, Writer writer) throws IOException, InterruptedException { Record emitRecord = new Record(); emitRecord.addGrouping("DATE", record.getValueString("DATE")); emitRecord.addGrouping("PATH", record.getValueString("URL")); emitRecord.addValue("PV", 1); writer.write(emitRecord); }

@Override public void filterSetup() { }}

FirstFilter

public class FirstSummarizer extends Summarizer { @Override public void init() { }

@Override public void summarize(Writer writer) throws IOException, InterruptedException { int pv = 0; while (hasNext()) { Record record = next(writer); pv += record.getValueInteger("PV"); }

Record emitRecord = new Record(); emitRecord.addGrouping("DATE", getGroupingRecord().getGroupingString("DATE")); emitRecord.addSort(pv, Record.SORT_UPPER, 1); emitRecord.addValue("PATH", getGroupingRecord().getGroupingString("PATH")); emitRecord.addValue("PV", pv); writer.write(emitRecord); }

@Override public void summarizerSetup() { }}

FirstSummarizer

public class SecondSummarizer extends Summarizer { @Override public void init() { }

@Override public void summarize(Writer writer) throws IOException, InterruptedException { int rank = 1; while (hasNext()) { if (rank > 10) { break; }

Record record = next(writer); Record emitRecord = new Record(); emitRecord.addValue("PATH", record.getValueString("PATH")); emitRecord.addValue("UU", record.getValueInteger("UU"));

writer.write(emitRecord); rank++; } }

@Override public void summarizerSetup() { }}

SecondSummarizer

Page 20: Huahin Framework for Hadoop, Hadoop Conference Japan 2013 Winter

Huahin Frameworkhttp://huahinframework.org

Page top 10 rank of Huahin MapReduce

• This is a very short!!• About 100 lines

Page 21: Huahin Framework for Hadoop, Hadoop Conference Japan 2013 Winter

Huahin Frameworkhttp://huahinframework.org

Huahin Core

• Other• Simple Join• Big Join• etc ...

Page 22: Huahin Framework for Hadoop, Hadoop Conference Japan 2013 Winter

Huahin Frameworkhttp://huahinframework.org

Huahin Tools

• A collection of tools generic operation.• Currently only Apache Log molding...• Operating environment • On Premises Hadoop• Stand Alone• Multi Thread execution for small data• EMR• S3://huahin/tools/huahin-tools.0.1.0.jar

Page 23: Huahin Framework for Hadoop, Hadoop Conference Japan 2013 Winter

Huahin Frameworkhttp://huahinframework.org

Huahin Manager

• Manager to manage the MapReduce Job• Get the Job list• Get the Job detail• Kill Job• Execution Job• Run queue management• MapReduce Jar• Hive Scripts• Pig Scripts

• Execution Hive Query • Execution Pig Latin• Execution is done in all the REST API.• Supported Apache Hadoop 1.0.X and 2.0.2-alpha• Supported CDH3 and CDH4

Page 24: Huahin Framework for Hadoop, Hadoop Conference Japan 2013 Winter

Huahin Frameworkhttp://huahinframework.org

Huahin Manager

• For 2.0.2-alpha and CDH4• Getting the Application list• Getting the Cluster info• Kill Application• Proxy to YARN APIs

Page 25: Huahin Framework for Hadoop, Hadoop Conference Japan 2013 Winter

Huahin Frameworkhttp://huahinframework.org

Huahin Manager

• EMR Support• Setting bootstrap

s3://huahin/manager/configure• Security group setting in order to access the REST API.• Security group that you set will be created during the

startup of the EMR.ElasticMapReduce-master• Values to be set• Port range: 9010• Source: IP addresses that are allowed to connect

Page 26: Huahin Framework for Hadoop, Hadoop Conference Japan 2013 Winter

Huahin Frameworkhttp://huahinframework.org

Huahin Manager

Operating environment of Huahin Manager

Huahin Manager

Various operations

REST API

Hadoop Cluster

HiveServer(1and 2)

Page 27: Huahin Framework for Hadoop, Hadoop Conference Japan 2013 Winter

Huahin Frameworkhttp://huahinframework.org

Huahin EManager

Manager that specializes in EMR

• Manager to manage the Job Flow• Get the Job Flow list• Get the Job Flow detail• Kill Job Flow Step• Execution Job• Run queue management• Register of queue• Get the queue detail• Remove queue

Page 28: Huahin Framework for Hadoop, Hadoop Conference Japan 2013 Winter

Huahin Frameworkhttp://huahinframework.org

Huahin EManager

• Register queue• The following functions can be assigned to the queue

at the EMR supports.• Hive• Pig• Streaming• Custom JAR• EManager can specify the cluster size to be started.

EManager assign a queue to a cluster that is free.(EMR to be a good point to bring up multiple cluster!)

Page 29: Huahin Framework for Hadoop, Hadoop Conference Japan 2013 Winter

Huahin Frameworkhttp://huahinframework.org

Huahin EManager

Operating environment of Huahin EManager

Huahin EManager

Amazon Elastic

MapReduce

REST API

Amazon Elastic

MapReduce

On premisesor

EC2 Instance

Huahin Manager will be started by the Master node bootstrap.

Huahin Manager will be started by the Master node bootstrap.

Various operations

Various operations

※ NOTICE: Setup the security group

Page 30: Huahin Framework for Hadoop, Hadoop Conference Japan 2013 Winter

Huahin Frameworkhttp://huahinframework.org

Huahin EManager

Operating environment of Huahin EManager

The place that is different when EManager starts in Management Console and Tools.• EManager recycle one Job Flow

Not attempt to start and end every time the EMR.Order to save costs and performances.※ It Currently can not Management Console. However, Can be done from the command line and SDK.

• However, reboot automatically when the upper limit of the number reaches 255 Step.

Page 31: Huahin Framework for Hadoop, Hadoop Conference Japan 2013 Winter

Huahin Frameworkhttp://huahinframework.org

Huahin EManager

Operating environment of Huahin EManager

The place that is different when EManager starts in Management Console and Tools.• It is booting for one hour• for cost(accounting and performance)• It do shutdown automatically before the timing

charged.• However, if it were running the Job is carried over

to the next billing timing.

Page 32: Huahin Framework for Hadoop, Hadoop Conference Japan 2013 Winter

Huahin Frameworkhttp://huahinframework.org

Huahin EManager

Register queue

Done using the PUT or POST method of registration of the queue.• PUT:If it have a script or JAR on the S3, It do Job

Flow or only the execution of Step.• POST:Place the JAR or script in the local to S3.

Boot and execution Step of Job Flow. It is a feature not in the EMR. And, option to remove the files that were POST.

• All registration is done in JSON.

Page 33: Huahin Framework for Hadoop, Hadoop Conference Japan 2013 Winter

Huahin Frameworkhttp://huahinframework.org

Huahin EManager

Register queue

Examples of PUT in the Hive:$ curl -X PUT http://localhost:9020/queue/register/hive \ -F ARGUMENTS='{"script":"s3://huahin/wordcount.hql","arguments":["arg1","arg2"]}'

Optional arguments of JSON

Examples of POST in the Hive:$ curl -X POST http://localhost:9020/queue/register/hive \ -F [email protected] -F ARGUMENTS='{"script":"s3://huahin/wordcount.hql","arguments":["arg1","arg2"]}'

Optional arguments of JSONDeleted after execution by setting the "true": "deleteOnExit"It no default deleted.

Page 34: Huahin Framework for Hadoop, Hadoop Conference Japan 2013 Winter

Huahin Frameworkhttp://huahinframework.org

Huahin EManager

List of Job Flow

Example of Get all Job Flow list:$ curl -X GET http://localhost:9020/jobflow/list

Example of get running Job Flow list:$ curl -X GET http://localhost:9020/jobflow/runnings

Example of Job Flow detail:$ curl -X GET http://localhost:9020/jobflow/describe/j-XXXXXXXXXXXX

Page 35: Huahin Framework for Hadoop, Hadoop Conference Japan 2013 Winter

Huahin Frameworkhttp://huahinframework.org

Huahin EManager

Queue API

Example of registered queue list:$ curl -X GET http://localhost:9020/queue/list

Example of runnings queue list:$ curl -X GET http://localhost:9020/queue/runnings

Example of get queue detail:$ curl -X GET http://localhost:9020/queue/describe/S_XXXXXXXXXXXX

Example of delete queue:$ curl -X DELETE http://localhost:9020/queue/kill/S_XXXXXXXXXXXX

Page 36: Huahin Framework for Hadoop, Hadoop Conference Japan 2013 Winter

Huahin Frameworkhttp://huahinframework.org

Huahin EManager

Kill of JobThere is a command to kill the Job running on Hadoop.

hadoop job -kill job_XXXXXXXXXX

However, there is no function that EMR. If start a Job by mistake, there is no choice but to terminate the Job Flow.

It will be able to kill by SSH to connect to the master node of the EMR, type the above command.

Troublesome...

Page 37: Huahin Framework for Hadoop, Hadoop Conference Japan 2013 Winter

Huahin Frameworkhttp://huahinframework.org

Huahin EManager

Kill of JobIt made possible the Kill API from EManager (Manager)!

Example of Step kill:$ curl -X DELETE http://localhost:9020/jobflow/kill/step/S_XXXXXXXXXXXX

Page 38: Huahin Framework for Hadoop, Hadoop Conference Japan 2013 Winter

Huahin Frameworkhttp://huahinframework.org

Conclusion

• Huahin Core• Unlike the Hive and Pig• When it want to use MapReduce to some extent the

natural.• Huahin Tools• Still...• Huahin Manager• All REST API operation• Integration with other systems• Huahin EManager• Integration with other systems• Cost and Performance management• Kill Step of Job Flow!

Page 39: Huahin Framework for Hadoop, Hadoop Conference Japan 2013 Winter

Huahin Frameworkhttp://huahinframework.org

The current version

• Huahin Core 0.1.4• Huahin Unit 0.1.4• Huahin Tools 0.1.0• Huahin Manager• 0.1.4 for Apache Hadoop 1.0.4• 0.1.4 for CDH3• 0.2.1 for Apache hadoop 2.0.2-alpha• 0.2.1 for CDH4• Huahin EManager 0.1.1

Page 40: Huahin Framework for Hadoop, Hadoop Conference Japan 2013 Winter

Thanks!!!