Njug presentation
-
Upload
iwrigley -
Category
Technology
-
view
102 -
download
0
description
Transcript of Njug presentation
01-‐1 © Copyright 2010-‐2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
Hadoop 101: WriCng a Java MapReduce Program Ian Wrigley Sr. Curriculum Manager, Cloudera [email protected] | @iwrigley
01-‐2 © Copyright 2010-‐2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
And, by the way, what is Hadoop?
Why the World Needs Hadoop
01-‐3 © Copyright 2010-‐2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
§ Every day… – More than 1.5 billion shares are traded on the NYSE – Facebook stores 2.7 billion comments and Likes
§ Every minute… – Foursquare handles more than 2,000 check-‐ins – TransUnion makes nearly 70,000 updates to credit files
§ And every second… – Banks process more than 10,000 credit card transacCons
Volume
01-‐4 © Copyright 2010-‐2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
§ We are genera;ng data faster than ever – Processes are increasingly automated – People are increasingly interacCng online – Systems are increasingly interconnected
Velocity
01-‐5 © Copyright 2010-‐2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
§ We’re producing a variety of data, including – Audio – Video – Images – Log files – Web pages – Product raCng comments – Social network connecCons
§ Not all of this maps cleanly to the rela;onal model
Variety
01-‐6 © Copyright 2010-‐2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
§ One tweet is an anecdote – But a million tweets may signal important trends
§ One person’s product review is an opinion – But a million reviews might uncover a design flaw
§ One person’s diagnosis is an isolated case – But a million medical records could lead to a cure
Big Data Can Mean Big Opportunity
01-‐7 © Copyright 2010-‐2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
A Scalable Data Processing Framework
MapReduce
01-‐8 © Copyright 2010-‐2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
§ MapReduce is a programming model – It’s a way of processing data
§ In Hadoop, you supply two func;ons to process data: Map and Reduce – Map: typically used to transform, parse, or filter data – Reduce: typically used to summarize results
§ The Map func;on always runs first – The Reduce funcCon runs acerwards – The Hadoop framework performs a shuffle and sort to transfer data from the Map funcCon to the Reduce funcCon
§ Each piece is simple, but can be powerful when combined
What is MapReduce?
01-‐9 © Copyright 2010-‐2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
§ … in which Ian waves his hands around and aRempts to explain the MapReduce flow
MapReduce: An Example
01-‐10 © Copyright 2010-‐2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
§ MapReduce processing in Hadoop is batch-‐oriented
§ Usually wriRen in Java – This uses Hadoop’s API directly – You can do basic MapReduce in other languages
– Using the Hadoop Streaming wrapper program – Some advanced features require Java code
MapReduce Code for Hadoop
01-‐11 © Copyright 2010-‐2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
§ Some (very) basic concepts: – Input and output data is typed – The framework passes each input record to the Mapper in turn – A record is a (key, value) pair – For text files:
– The key is the byte offset of the start of the line – The value is the line itself
– Output data from the Mapper is transferred to the Reducer via a process known as the shuffle and sort – Reducers receive (key, Iterable of values) sets, in sorted key order – Job is configured and executed using a driver class
Basic Java API Concepts
01-‐12 © Copyright 2010-‐2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
Data Flow
Map input
Map output Reduce input Reduce output
Shuffle and sort
Nashville J. Jones 12.95 2013-07-21 Memphis S. Smith 66.57 2013-07-21 Nashville T. Harding 55.35 2013-07-22 Knoxville S. Warne 10.99 2013-07-22 Kingsport M. Thompson 99.95 2013-07-22
Nashville 12.95 Memphis 66.57 Nashville 55.35 Knoxville 10.99 Kingsport 99.95
Kingsport[99.95] Knoxville[10.99] Memphis [66.57] Nashville[12.95, 55.35]
Kingsport 99.95 Knoxville 10.99 Memphis 66.57 Nashville 68.30
01-‐13 © Copyright 2010-‐2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
Java MR Job Example: Mapper
package com.cloudera.example; import java.io.IOException; import org.apache.hadoop.io.DoubleWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper;
public class StoreSalesMapper extends Mapper<LongWritable, Text, Text, DoubleWritable> {
1 2 3 4 5 6 7 8 9
10
Input key and value types
Output key and value types
01-‐14 © Copyright 2010-‐2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
Java MR Job Example: Mapper
/* * The map method is invoked once for each line of text in the * input data. The method receives a key of type LongWritable * (which corresponds to the byte offset in the current input * file), a value of type Text (representing the line of input * data), and a Context object (which allows us to print status * messages, among other things). */ @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
11 12 13 14 15 16 17 18 19 20 21 22 23
01-‐15 © Copyright 2010-‐2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
Java MR Job Example: Mapper
String line = value.toString(); // ignore empty lines if (line.trim().isEmpty()) { return; } String[] fields = line.split("\t"); // ensure this line is not malformed if (fields.length != 4) { return; }
24 25 26 27 28 29 30 31 32 33 34 35 36
Convert value to a Java String
Defensive programming!
Split record into fields
Even more defensive programming!
01-‐16 © Copyright 2010-‐2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
Java MR Job Example: Mapper
String storeName = fields[0]; Double saleValue = Double.parseDouble(fields[2]); context.write(new Text(storeName), new DoubleWritable(saleValue)); } }
37 38 39 40 41 42 43 44 45 46 47
Output key and value
Extract based on posiCon
01-‐17 © Copyright 2010-‐2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
Java MR Job Example: Reducer
package com.cloudera.example; import java.io.IOException; import org.apache.hadoop.io.DoubleWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; public class SumReducer extends Reducer<Text, DoubleWritable, Text, DoubleWritable> {
1 2 3 4 5 6 7 8 9
10
Output key and value types
Input key and value types
01-‐18 © Copyright 2010-‐2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
Java MR Job Example: Reducer
/* * The reduce method is invoked once for each key received from * the shuffle and sort phase of the MapReduce framework. * The method receives a key of type Text (representing the key), * a set of values of type DoubleWritable, and a Context object. */ @Override public void reduce(Text key, Iterable<DoubleWritable> values, Context context) throws IOException, InterruptedException {
11 12 13 14 15 16 17 18 19
01-‐19 © Copyright 2010-‐2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
Java MR Job Example: Reducer
// used to sum up the store sales double sum = 0; // add to it it for each new value received for (DoubleWritable value : values) { sum += value.get(); }
// Our output is the event type (key) and the sum (value) context.write(key, new DoubleWritable(sum)); }
}
20 21 22 23 24 25 26 27 28 29 30 31
Output key and value
01-‐20 © Copyright 2010-‐2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
Java MR Job Example: Driver
package com.cloudera.example; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.DoubleWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.Job; // The driver is just a regular Java class with a "main" method public class StoreSales { public static void main(String[] args) throws Exception {
1 2 3 4 5 6 7 8 9
10 11 12 13
01-‐21 © Copyright 2010-‐2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
Java MR Job Example: Driver
// validate command line arguments (we require the user // to specify the HDFS paths to use for the job; see below) if (args.length != 2) { System.out.printf("Usage: Driver <input dir> <output dir>\n"); System.exit(-1); } // Instantiate a Job object for our job's configuration. Job job = new Job(); // configure input and output paths based on supplied arguments FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1]));
14 15 16 17 18 19 20 21 22 23 24 25 26
01-‐22 © Copyright 2010-‐2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
Java MR Job Example: Driver
// tells Hadoop to copy the JAR containing this class // to cluster nodes, as required to run this job job.setJarByClass(StoreSales.class); // give the job a descriptive name. This is optional, but // helps us identify this job on a busy cluster job.setJobName("Store Sale Aggregator");
// Specify which classes to use for the Mapper and Reducer job.setMapperClass(StoreSalesMapper.class); job.setReducerClass(SumReducer.class);
27 28 29 30 31 32 33 34 35 36 37
01-‐23 © Copyright 2010-‐2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
Java MR Job Example: Driver
// specify the Mapper's output key and value classes job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(DoubleWritable.class);
// specify the job's output key and value classes job.setOutputKeyClass(Text.class); job.setOutputValueClass(DoubleWritable.class); // start the MapReduce job and wait for it to finish. // if it finishes successfully, return 0; otherwise 1. boolean success = job.waitForCompletion(true); System.exit(success ? 0 : 1); } }
38 39 40 41 42 43 44 45 46 47 48 49 50 51
01-‐24 © Copyright 2010-‐2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
§ And now… the program actually running on a pseudo-‐distributed cluster
Demo
01-‐25 © Copyright 2010-‐2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
§ Obviously there’s much more to the Hadoop API than this – ParCConers – Combiners – Custom Writables, custom WritableComparables – DistributedCache – Counters – Etc., etc., etc
§ …but even with just this amount of knowledge, you could write real-‐world Hadoop applica;ons
Conclusion
01-‐26 © Copyright 2010-‐2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
§ Helps companies profit from all their data – Founded by experts from Facebook, Google, Oracle, and Yahoo
§ We offer products and services for large-‐scale data analysis – Socware (CDH distribuCon and Cloudera Manager) – ConsulCng and support services – Training and cerCficaCon
§ Want to aRend a training course? Use the code Nashville_15 for 15% off any Cloudera-‐delivered class
About Cloudera
01-‐27 © Copyright 2010-‐2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.