Rapid Development of Big Data applications using Spring for Apache Hadoop
-
Upload
zenyk -
Category
Technology
-
view
113 -
download
3
description
Transcript of Rapid Development of Big Data applications using Spring for Apache Hadoop
![Page 1: Rapid Development of Big Data applications using Spring for Apache Hadoop](https://reader037.fdocuments.us/reader037/viewer/2022102922/54c654034a7959b1098b4652/html5/thumbnails/1.jpg)
Spring for Apache Hadoop
By Zenyk Matchyshyn
![Page 2: Rapid Development of Big Data applications using Spring for Apache Hadoop](https://reader037.fdocuments.us/reader037/viewer/2022102922/54c654034a7959b1098b4652/html5/thumbnails/2.jpg)
Agenda• Goals of the project• Hadoop Introduction• High level support• Workflows• Scripting & Migration• Alternatives• Testing & Related
![Page 3: Rapid Development of Big Data applications using Spring for Apache Hadoop](https://reader037.fdocuments.us/reader037/viewer/2022102922/54c654034a7959b1098b4652/html5/thumbnails/3.jpg)
Big Data – Why?Because of Terabytes and Petabytes:
• Smart meter analysis• Genome processing• Sentiment & social media analysis• Network capacity trending & management• Ad targeting• Fraud detection
![Page 4: Rapid Development of Big Data applications using Spring for Apache Hadoop](https://reader037.fdocuments.us/reader037/viewer/2022102922/54c654034a7959b1098b4652/html5/thumbnails/4.jpg)
Goals• Provide programmatic model to work with
Hadoop ecosystem• Simplify client libraries usage• Provide Spring friendly wrappers• Enable real-world usage as a part of
Spring Batch & Spring Integration• Leverage Spring features
![Page 5: Rapid Development of Big Data applications using Spring for Apache Hadoop](https://reader037.fdocuments.us/reader037/viewer/2022102922/54c654034a7959b1098b4652/html5/thumbnails/5.jpg)
Supported distros
• Apache Hadoop 1.2.1/2.0.6/2.2.0• Cloudera CDH4• Hortonworks HDP 1.3• Pivotal HD 1.0/1.1
![Page 6: Rapid Development of Big Data applications using Spring for Apache Hadoop](https://reader037.fdocuments.us/reader037/viewer/2022102922/54c654034a7959b1098b4652/html5/thumbnails/6.jpg)
HADOOP INTRODUCTION
![Page 7: Rapid Development of Big Data applications using Spring for Apache Hadoop](https://reader037.fdocuments.us/reader037/viewer/2022102922/54c654034a7959b1098b4652/html5/thumbnails/7.jpg)
Hadoop
Hadoop Map/Reduce
HDFS
HBase
Pig Hive
![Page 8: Rapid Development of Big Data applications using Spring for Apache Hadoop](https://reader037.fdocuments.us/reader037/viewer/2022102922/54c654034a7959b1098b4652/html5/thumbnails/8.jpg)
Hadoop basics
Split Map Shuffle Reduce
Dog ate the boneCat ate the fish
Dog, 1Ate, 1The, 1 Bone, 1Cat, 1Ate, 1The, 1Fish,1
Dog, 1Ate, {1, 1}The, {1, 1} Bone, 1Cat, 1Fish,1
Dog, 1Ate, 2The, 2 Bone, 1Cat, 1Fish,1
![Page 9: Rapid Development of Big Data applications using Spring for Apache Hadoop](https://reader037.fdocuments.us/reader037/viewer/2022102922/54c654034a7959b1098b4652/html5/thumbnails/9.jpg)
Configuration< … XML …>
<context:property-placeholder location="hadoop.properties"/>
<hdp:configuration>fs.default.name=${hd.fs}mapred.job.tracker=${hd.jt}
</hdp:configuration>
<… XML … >
![Page 10: Rapid Development of Big Data applications using Spring for Apache Hadoop](https://reader037.fdocuments.us/reader037/viewer/2022102922/54c654034a7959b1098b4652/html5/thumbnails/10.jpg)
Job definition<hdp:job id=“hadoopJob"
input-path="${wordcount.input.path}" output-path="${wordcount.output.path}"libs="file:${app.repo}/supporting-lib-*.jar"mapper="org.company.Mapper"reducer="org.company.Reducer"/>
Configuration conf = new Configuration();
Job job = new Job(conf, “hadoopJob");job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);job.setMapperClass(Maper.class);job.setReducerClass(Reducer.class);job.setInputFormatClass(TextInputFormat.class);job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0]));FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
![Page 11: Rapid Development of Big Data applications using Spring for Apache Hadoop](https://reader037.fdocuments.us/reader037/viewer/2022102922/54c654034a7959b1098b4652/html5/thumbnails/11.jpg)
Job Execution
<hdp:job-runner id="runner" run-at-startup="true" pre-action=“someScript“
post-action=“someOtherScript“ job-ref=“hadoopJob" />
• Basic:
• Scheduled– TaskScheduler– Quartz
• Custom
![Page 12: Rapid Development of Big Data applications using Spring for Apache Hadoop](https://reader037.fdocuments.us/reader037/viewer/2022102922/54c654034a7959b1098b4652/html5/thumbnails/12.jpg)
HIGH LEVEL TOOLS
![Page 13: Rapid Development of Big Data applications using Spring for Apache Hadoop](https://reader037.fdocuments.us/reader037/viewer/2022102922/54c654034a7959b1098b4652/html5/thumbnails/13.jpg)
Solutions
• HBase• Hive• Pig• Cascading
![Page 14: Rapid Development of Big Data applications using Spring for Apache Hadoop](https://reader037.fdocuments.us/reader037/viewer/2022102922/54c654034a7959b1098b4652/html5/thumbnails/14.jpg)
Simplifies• Thread safety• DAO friendliness, wrappers and basic
mappers• Simple connection interfaces• Runners, Template and callback
methods• Common scenarios simplifications• Scripting support
![Page 15: Rapid Development of Big Data applications using Spring for Apache Hadoop](https://reader037.fdocuments.us/reader037/viewer/2022102922/54c654034a7959b1098b4652/html5/thumbnails/15.jpg)
Example - Template
template.execute("MyTable", new TableCallback<Object>() {
@Override public Object doInTable(HTable table) throws Throwable { Put p = new Put(Bytes.toBytes("SomeRow")); p.add(Bytes.toBytes("SomeColumn"), Bytes.toBytes("SomeQualifier"), Bytes.toBytes("AValue")); table.put(p);
return null; }
});
<hdp:hbase-configuration/>
<bean id="hbaseTemplate" class="org.springframework.data.hadoop.hbase.HbaseTemplate" p:configuration-ref="hbaseConfiguration"/>
![Page 16: Rapid Development of Big Data applications using Spring for Apache Hadoop](https://reader037.fdocuments.us/reader037/viewer/2022102922/54c654034a7959b1098b4652/html5/thumbnails/16.jpg)
Example – Script Runner<hdp:hive-server host=“hivehost" port="10001" />
<hdp:hive-template />
<hdp:hive-client-factory host="some-host" port="some-port" > <hdp:script location="classpath:org/company/hive/script.q">
<arguments>ignore-case=true</arguments> </hdp:script> </hdp:hive-client-factory>
<hdp:hive-runner id="hiveRunner" run-at-startup="true"> <hdp:script> DROP TABLE IF EXITS testHiveBatchTable; CREATE TABLE testHiveBatchTable (key int, value string); </hdp:script> <hdp:script location="hive-scripts/script.q"/> </hdp:hive-runner>
![Page 17: Rapid Development of Big Data applications using Spring for Apache Hadoop](https://reader037.fdocuments.us/reader037/viewer/2022102922/54c654034a7959b1098b4652/html5/thumbnails/17.jpg)
WORKFLOWS
![Page 18: Rapid Development of Big Data applications using Spring for Apache Hadoop](https://reader037.fdocuments.us/reader037/viewer/2022102922/54c654034a7959b1098b4652/html5/thumbnails/18.jpg)
Typical Big Data Processing Flow
Capture Pre-Process Insert Process Extract Present
![Page 19: Rapid Development of Big Data applications using Spring for Apache Hadoop](https://reader037.fdocuments.us/reader037/viewer/2022102922/54c654034a7959b1098b4652/html5/thumbnails/19.jpg)
Spring Batch & Spring Integration
• Big Data Flows are based on Spring Integration & Spring Batch
• Spring for Hadoop provides:– Spring Batch tasklets– Spring Integration support
![Page 20: Rapid Development of Big Data applications using Spring for Apache Hadoop](https://reader037.fdocuments.us/reader037/viewer/2022102922/54c654034a7959b1098b4652/html5/thumbnails/20.jpg)
Tasklets
• Job runners• Script runners• Hive • Pig• Cascading
![Page 21: Rapid Development of Big Data applications using Spring for Apache Hadoop](https://reader037.fdocuments.us/reader037/viewer/2022102922/54c654034a7959b1098b4652/html5/thumbnails/21.jpg)
Example
<hdp:job-tasklet id="hadoop-tasklet" job-ref="mr-job" wait-for-completion="true" />
<batch:job id="job1"> <batch:step id="import" next=“ht"> <batch:tasklet ref="script-tasklet"/> </batch:step> <batch:step id=“ht"> <batch:tasklet ref=" hadoop-tasklet" /></batch:step> </batch:job>
![Page 22: Rapid Development of Big Data applications using Spring for Apache Hadoop](https://reader037.fdocuments.us/reader037/viewer/2022102922/54c654034a7959b1098b4652/html5/thumbnails/22.jpg)
SCRIPTING & MIGRATION
![Page 23: Rapid Development of Big Data applications using Spring for Apache Hadoop](https://reader037.fdocuments.us/reader037/viewer/2022102922/54c654034a7959b1098b4652/html5/thumbnails/23.jpg)
Details
• Supports JVM languages from JSR-223 (Groovy, JRuby, Jython, Rhino)
• Exposes SimplerFileSystem• Provides implicit variables• Exposes FsShell to mimic HDFS shell• Exposes DistCp to mimic distcp from
Hadoop
![Page 24: Rapid Development of Big Data applications using Spring for Apache Hadoop](https://reader037.fdocuments.us/reader037/viewer/2022102922/54c654034a7959b1098b4652/html5/thumbnails/24.jpg)
Example<hdp:script-tasklet id="script-tasklet"> <hdp:script language="groovy">
inputPath = "/user/gutenberg/input/word/" outputPath = "/user/gutenberg/output/word/"
if (fsh.test(inputPath)) { fsh.rmr(inputPath) }
if (fsh.test(outputPath)) { fsh.rmr(outputPath) }
inputFile = "src/main/resources/data/nietzsche-chapter-1.txt"
fsh.put(inputFile, inputPath)
</hdp:script> </hdp:script-tasklet>
![Page 25: Rapid Development of Big Data applications using Spring for Apache Hadoop](https://reader037.fdocuments.us/reader037/viewer/2022102922/54c654034a7959b1098b4652/html5/thumbnails/25.jpg)
MigrationHadoop Streaming:
Hadoop Tool Executor:
<hdp:streaming id="streaming" input-path="/input/" output-path="/ouput/" mapper="${path.cat}" reducer="${path.wc}"/>
<hdp:tool-runner id="someTool" tool-class="org.foo.SomeTool" run-at-startup="true"> <hdp:arg value="data/in.txt"/>
<hdp:arg value="data/out.txt"/> property=value
</hdp:tool-runner>
![Page 26: Rapid Development of Big Data applications using Spring for Apache Hadoop](https://reader037.fdocuments.us/reader037/viewer/2022102922/54c654034a7959b1098b4652/html5/thumbnails/26.jpg)
Alternatives
• Apache Flume – distributed data collection• Apache Oozie – workflow scheduler• Apache Sqoop – SQL bulk import/export
![Page 27: Rapid Development of Big Data applications using Spring for Apache Hadoop](https://reader037.fdocuments.us/reader037/viewer/2022102922/54c654034a7959b1098b4652/html5/thumbnails/27.jpg)
TESTING & RELATED TOOLS
![Page 28: Rapid Development of Big Data applications using Spring for Apache Hadoop](https://reader037.fdocuments.us/reader037/viewer/2022102922/54c654034a7959b1098b4652/html5/thumbnails/28.jpg)
Testing
• JUnit/Mocks + MRUnit• Mini-HDFS and Mini-MapReduce
cluster• LocalJobRunner
![Page 29: Rapid Development of Big Data applications using Spring for Apache Hadoop](https://reader037.fdocuments.us/reader037/viewer/2022102922/54c654034a7959b1098b4652/html5/thumbnails/29.jpg)
Spring YARN
HDFSstorage
Map/Reducecluster / data process
YARNcluster
HDFSstorage
Map/Reducedata process
Otherlike Spark - data
Hadoop 1.x Hadoop 2.x
![Page 30: Rapid Development of Big Data applications using Spring for Apache Hadoop](https://reader037.fdocuments.us/reader037/viewer/2022102922/54c654034a7959b1098b4652/html5/thumbnails/30.jpg)
Spring eXtreme Data (XD)
• Ultimate data processing solution• Implements most common approach,
business logic up to you• On top of Spring Batch and Spring
Integration• Has DSL• Scalable
![Page 31: Rapid Development of Big Data applications using Spring for Apache Hadoop](https://reader037.fdocuments.us/reader037/viewer/2022102922/54c654034a7959b1098b4652/html5/thumbnails/31.jpg)
More speedups• Use provider quick start VM for initial
development• Use cloud based images for production
(start/stop)• Don’t use Map/Reduce without real need.
Start with higher abstraction.• Don’t migrate without real need!• Invest in DevOps (Chef / Puppet /
Vagrant…)
![Page 32: Rapid Development of Big Data applications using Spring for Apache Hadoop](https://reader037.fdocuments.us/reader037/viewer/2022102922/54c654034a7959b1098b4652/html5/thumbnails/32.jpg)
Q/A
?