Cs267 hadoop programming

24
Hadoop Installation & MapReduce Programming CS267 - Data Mining & Machine Learning -Kuldeep Dhole

description

Map Reduce Basics

Transcript of Cs267 hadoop programming

Page 1: Cs267 hadoop programming

Hadoop Installation & MapReduce Programming

CS267 - Data Mining & Machine Learning

-Kuldeep Dhole

Page 2: Cs267 hadoop programming

WHW

Why: To be able to deal with Big Data Mining.

How: By learning Hadoop & MR programming

What: Hadoop Installation, HDFS basics, & MR programming for Hadoop

Page 3: Cs267 hadoop programming

Hadoop Installation

Amazon EC2 cloud - Cloudera’s Hadoop Insallationhttps://www.dropbox.com/s/s8zc3iwlq936hak/Amazon_Cloudera_Hadoop.pdf

Page 4: Cs267 hadoop programming

Hadoop Components

- HDFS (Hadoop Distributed File System)

- MapReduce Model

Page 5: Cs267 hadoop programming

HDFS Shell

CLUSTER / LOCAL MACHINE

/home/user1

File System of Local OS (Linux, Windows, etc.)> ls -l> mv f1 f2> cp f1 f2

HDFS - /tmp>hadoop fs -ls >hadoop fs -mv hdfs_f1 hdfs_f2>hadoop fs -cp hdfs_f1 hdfs_f2

- HDFS has its own shell commands

- You need to transfer data: LOCAL FS <-> HDFS

- Same concept applies to all machines in the cluster & Hadoop realm on all machines are in sync.

Page 6: Cs267 hadoop programming

MapReduce Concept

- Programming Model for Distributed Parallel Computing.- Used on scalable commodity hardware cluster.- Can process Big Data (100s of GBs, TBs)- Based on Key-Value structure.- Parallel MAP tasks, which emit <K, V> data- Parallel REDUCE tasks, which processes <K, V[ ]> data

Page 7: Cs267 hadoop programming

MapReduce Model

M1

M2

M3

M4

R1

R2

R3

R4

<K, V>

<K, V>

<K, V>

<K, V>

Sort, Merge & Shuffle

<K, V>

<K, V>

<K, V>

<K, V>

<K1, V [ ] >

<K2 V [ ] >

<K3, V [ ] >

<K4, V [ ] >

<K, V>

<K, V>

<K, V>

<K, V>

Page 8: Cs267 hadoop programming

MapReduce Model In Brief

(K1, V1) -> MAP -> List(K2, V2)

(K2, List(V2) -> REDUCE -> List(K3, V3)

Page 9: Cs267 hadoop programming

Hadoop MapReduce Application

- Implemented in Java - Components:

- Mapper- Reducer- Job Configuration

- Can be done in other languages like Python, Perl, Shell, etc. using Streaming Concept.

Page 10: Cs267 hadoop programming

Complete Application

public class YourApp {Mapper {}

Reducer {}

Job Configuration {}}

Page 11: Cs267 hadoop programming

Mapper Class & Function

public static class YourMap extends Mapper<K1, V1, K2, V2> {public void map(K1 key, V1 value, Context context) throws

IOException, InterruptedException {//DO YOUR PROCESSING ON Key ,

Value//K2 NewKey//V2 NewValue

context.write(NewKey, NewValue);}

}

Page 12: Cs267 hadoop programming

Reducer Class & Function

public static class YourReduce extends Reducer<K2, V2, K3, V3> { public void reduce(K2 key, Iterable<V2> values, Context context) throws IOException, InterruptedException {

//DO YOUR PROCESSING ON Key , Value//K3 NewKey//V3 NewValuecontext.write(NewKey, NewValue);

}}

Page 13: Cs267 hadoop programming

What are I/O Formats?

Page 14: Cs267 hadoop programming

Job Configurationpublic static void main(String[] args) throws Exception {

//Create ConfigurationConfiguration conf = new Configuration(); //Create JobJob job = new Job(conf, "YourApp");

//Specify Input DirectoryFileInputFormat.addInputPath(job, new Path(args[0]));//Specify Output DirectoryFileOutputFormat.setOutputPath(job, new Path(args[1]));job.setMapperClass(Map.class);//Specify Input Split Format By Which Mapper Reads <K, V> job.setInputFormatClass(KeyValueTextInputFormat.class)//Specify Output Format By Which Mapper Emits <K, V>

job.setMapOutputKeyClass(Text.class);job.setMapOutputValueClass(Text.class);

job.setReducerClass(Reduce.class);//Specify Output Format By Which Reducer Emits <K, V> job.setOutputKeyClass(Text.class); job.setOutputKeyClass(Value.class); //Specify Output Format By Which Output is Written To The Output Files job.setOutputValueClass(IntWritable.class); job.setJarByClass(org.myorg.YourApp.class); job.waitForCompletion(true); }

Page 15: Cs267 hadoop programming

Reverse Indexing Application

Input File: Output File:

/hdfs/f1.dat/hdfs_op/o1.dat

f1 w1 w2 w3 w4f2 w2 w3 w4 w5f3 w3 w4 w5 w6

w1 f1w4 f1, f2, f3w2 f1, f2w5 f2, f3w3 f1, f2, f3w6 f3

Hadoop System

Job:

CONF

REDUCE

MAP

Page 16: Cs267 hadoop programming

Mapper & Reducer AlgoMapper:

read line in K<filename>, V<rest contents>tokenize Vfor every token t:

emit K<t>, V<filename>

Reducer:receive K<token>, V[ ] <filenames>make unique list of V [ ]form a comma separated string of filenames in V [ ] as stremit K<token>, V<str>

Page 17: Cs267 hadoop programming

Actual Java Program: Mapper

public static class Map extends Mapper<Text, Text, Text, Text> { private Text word = new Text(); public void map(Text key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { String temp = tokenizer.nextToken(); //Strip last non-alphabet chars from a word

if ( ! temp.matches(".*[a-zA-Z]$") ) { word.set(temp.substring(0, temp.length()-1));

} else

word.set(temp); context.write(word, key); } } }

Page 18: Cs267 hadoop programming

Actual Java Program: Reducerpublic static class Reduce extends Reducer<Text, Text, Text, Text> { public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException { String doc_list = ""; HashMap<String, Integer> map = new HashMap<String, Integer>();

for (Text val : values) {map.put(val.toString(), 1);

} Iterator<String> keySetIterator = map.keySet().iterator(); while(keySetIterator.hasNext()){ String k = keySetIterator.next();

doc_list += k + ",";}

if (doc_list.length() > 0 && doc_list.charAt(doc_list.length()-1)==',') { doc_list = doc_list.substring(0, doc_list.length()-1); } context.write(key, new Text(doc_list)); } }

Page 19: Cs267 hadoop programming

Actual Java Program: Main()public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "reverse-index"); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(Map.class); job.setInputFormatClass(KeyValueTextInputFormat.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(Text.class); job.setReducerClass(Reduce.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); job.setOutputFormatClass(TextOutputFormat.class); job.setJarByClass(org.myorg.ReverseIndex.class); job.waitForCompletion(true); }

Page 20: Cs267 hadoop programming

Actual Java Program: Complete Apppackage org.myorg;//IMPORT RELEVANT API Libraries

public class AppName {

Mapper() {}

Reducer() {}

Main() {}}

Page 21: Cs267 hadoop programming

How to Exeucute?

- Compile/usr/java/jdk1.7.0_25/bin/javac -classpath /usr/local/hadoop/hadoop-core-

1.2.1.jar -d classes ip1/ReverseIndex.java

- make a JAR/usr/java/jdk1.7.0_25/bin/jar -cvf jar/reverse_index.jar -C classes/ .

- submit the JAR as JOBhadoop jar jar/reverse_index.jar org.myorg.ReverseIndex ip op

Page 22: Cs267 hadoop programming

DEMO

Page 23: Cs267 hadoop programming

Important LinksA few examples at my github: https://github.com/dkuldeep11/hadoop

Clear Basics: https://www.udacity.com/course/ud617

Hadoop MR Concept: http://developer.yahoo.com/hadoop/tutorial/module4.html#basics

MR Coding Basics: http://hadoop.apache.org/docs/stable1/mapred_tutorial.html

In Depth: http://bigdatauniversity.com/bdu-wp/bdu-course/introduction-to-mapreduce-programming/

Page 24: Cs267 hadoop programming

Thank You!

Q/A