Hadoop introduction

65
Introduction to Hadoop Arnaud Cogoluègnes - Zenika

Transcript of Hadoop introduction

Page 1: Hadoop introduction

Introduction to HadoopArnaud Cogoluègnes - Zenika

Page 2: Hadoop introduction

Hadoop overview

Distributed system:

● Distributed file system (HDFS)● Programming interface (MapReduce, YARN)

Page 3: Hadoop introduction

Hadoop

HDFS

MapReduce, YARN

Your application

Page 4: Hadoop introduction

Hadoop Distributed File System

What is it good for?

Page 5: Hadoop introduction

HDFS is scalable

More nodes, mode spaceParallel reads

Page 6: Hadoop introduction

HDFS is fault-tolerant

File blocks are replicated across nodesDefault replication is 3

Page 7: Hadoop introduction

HDFS has plenty of features

Data can be partitioned (by using directories)The storage is tunable (compression)

There are many storage formats (Avro, Parquet)

Page 8: Hadoop introduction

HDFS has limitations

Simply said, files are append-onlyGood for “write once, read many times”

A few large files better than many small files

Page 9: Hadoop introduction

Blocks, datanodes, namenode

file.csv B1 B2 B3 file is made of 3 blocks (default block size is 128 MB)

B1 B2 B1 B3

B1 B2 B2 B3

DN 1 DN 2

DN 4DN 3datanodes store files blocks(here block 3 is under-replicated)

B1 : 1, 2, 3 B2 : 1, 3, 4B3 : 2, 4

Namenode

namenode handles files metadata and enforces replication

Page 10: Hadoop introduction

HDFS Java APIConfiguration conf = new Configuration();

FileSystem fs = FileSystem.get(conf);

Path inputDir = new Path("/tmp");

FileStatus[] inputFiles = fs.listStatus(inputDir);

FSDataInputStream in = fs.open(inputFiles[i].getPath());

Page 11: Hadoop introduction

HDFS Shell$ hdfs dfs -ls /

Found 3 items

drwxr-xr-x - acogoluegnes supergroup 0 2014-08-05 22:12 /apps

drwxrwx--- - acogoluegnes supergroup 0 2014-08-13 21:38 /tmp

drwxr-xr-x - acogoluegnes supergroup 0 2014-07-27 12:30 /user

$

Page 12: Hadoop introduction

MapReduce is...

… scalable… simple, yet allows for many functions on

data… for batch, not real-time processing

Page 13: Hadoop introduction

Keys, Valuesmap(row) {

emit(k,v);

}

reduce(k,[v1… vn]) {

...

emit(...)

}

Page 14: Hadoop introduction

Key, value ?word count● key = word, value = 1● reducer sumsdistinct/unique● key = ID, value = row● reducer emits only one rowaggregation (group by / sum)● key = group colum, value = row● reducer sums on given column

Page 15: Hadoop introduction

MapReducefile.csv B1 B2 B3

Mapper

Mapper

Mapper

B1

B2

B3

Reducer

Reducer

k1,v1

k1,v2

k1 [v1,v2]

Page 16: Hadoop introduction

Code goes to datafile.csv B1 B2 B3

Mapper

Mapper

Mapper

B1

B2

B3

Reducer

Reducer

k1,v1

k1,v2

k1 [v1,v2]

B1 B2 B1 B3

B1 B2 B2 B3

DN 1 DN 2

DN 4DN 3

DN 1

DN 3

DN 4

Page 17: Hadoop introduction

MapReduce in Hadoop

1 mapper input = 1 split = 1 blockMap and reduce can be retried

Map and reduce must be idempotent

Page 18: Hadoop introduction

Shuffling

Mapper

Mapper

Mapper

B1

B2

B3

Reducer

Reducer

k1,v1

k1,v2

k1 [v1,v2]

shuffling

Page 19: Hadoop introduction

Unique mapperpublic static class ByIdMapper extends Mapper<LongWritable,Text,LongWritable,Text> {

@Override

protected void map(LongWritable key, Text value, Context context)

throws IOException, InterruptedException {

// e.g. line = “2,some value”

String id = StringUtils.split(value.toString(),",")[0];

LongWritable emittedKey = new LongWritable(Long.valueOf(id));

context.write(emittedKey,value);

}

}

Page 20: Hadoop introduction

Unique reducerpublic static class OneValueEmitReducer

extends Reducer<LongWritable,Text,NullWritable,Text> {

@Override

protected void reduce(LongWritable key, Iterable<Text> values,

Context context) throws IOException, InterruptedException {

context.write(NullWritable.get(), values.iterator().next());

}

}

Page 21: Hadoop introduction

MapReduce jobConfiguration configuration = new Configuration();

Job job = Job.getInstance(configuration);

job.setMapperClass(ByIdMapper.class);

job.setReducerClass(OneValueEmitReducer.class);

FileInputFormat.setInputPaths(job, new Path(“/data/in”));

FileOutputFormat.setOutputPath(job, new Path(“/work/out”));

job.setInputFormatClass(TextInputFormat.class);

boolean result = job.waitForCompletion(true);

Page 22: Hadoop introduction

MapReduce in Hadoop 1

MR = the only available processingMaster = job tracker

Slave = task tracker (map or reduce tasks)

Page 23: Hadoop introduction

Hadoop 1

HDFS

MapReduce

Page 24: Hadoop introduction

MapReduce in Hadoop 2

Yet Another Resource NegotiatorMapReduce re-implemented on top of YARN

Master = Resource Manager

Page 25: Hadoop introduction

Hadoop 2

HDFS

YARN

MapReduce Yourapp

Page 26: Hadoop introduction

MapReduce limitations

Low-level codeHard to re-use logic

Prefer higher-level abstractions like Cascading

Page 27: Hadoop introduction

File formats

Critical question: how to store my data?

Page 28: Hadoop introduction

File formats criterions

Interoperability?Splittability?

Compression support?Kind(s) of access?Schema evolution?

Support in MapReduce and Hadoop?

Page 29: Hadoop introduction

Hadoop file formats

SequenceFile : very efficient, but limited in features (e.g. no schema, not appendable)

Avro file : efficient, schema support

Page 30: Hadoop introduction

“Community” file formats

Thrift and protocol buffers : quite popular, schema support

ORC and Parquet : columnar storage, good for field selection and projection

Page 31: Hadoop introduction

Compression codecs

deflate, gzip : efficient but slow, good for archivingsnappy : less efficient but faster than gzip, good for every-day usage

Page 32: Hadoop introduction

Compression and splittability

Take a CSV file, it’s splittableCompress it with GZIP, it’s no longer splittable

Few compression codecs are splittable

Page 33: Hadoop introduction

File formats and compression

Some file formats are splittable, whatever the compression codec.

Good news!Avro, Parquet, SequenceFile, Thrift

Page 34: Hadoop introduction

Lambda architecture

General architecture for big data systemsComputing arbitrary functions on arbitrary data

in realtime

Page 35: Hadoop introduction

Lambda architecture

Page 36: Hadoop introduction

Lambda architecture wish list

● Fault-tolerant● Low latency● Scalable● General

● Extensible● Ad hoc queries● Minimal maintenance● Debuggable

Page 37: Hadoop introduction

Layers

Speed layer

Serving layer

Batch layer

Page 38: Hadoop introduction

Batch layer

Speed layer

Serving layer

Batch layer

Dataset storage.Views computation.

Page 39: Hadoop introduction

Serving layer

Speed layer

Serving layer

Batch layer

Random access to batch views.

Page 40: Hadoop introduction

Speed layer

Speed layer

Serving layer

Batch layer

Low latency access.

Page 41: Hadoop introduction

Batch layer

Speed layer

Serving layer

Batch layer

Hadoop (MapReduce, HDFS).Thrift, Cascalog.

Page 42: Hadoop introduction

Serving layer

Speed layer

Serving layer

Batch layer

ElephantDB, BerkeleyDB.

Page 43: Hadoop introduction

Speed layer

Speed layer

Serving layer

Batch layer

Cassandra, Storm, Kafka.

Page 44: Hadoop introduction

Apache Tez

“Apache Tez generalizes the MapReduce paradigm to execute a complex DAG (directed

acyclic graph) of tasks”

Page 45: Hadoop introduction

Apache Tez’s motivations

More general than MapReducePetabyte scale

Page 46: Hadoop introduction

Tez and Hadoop

HDFS

YARN

Tez Yourapp

Page 47: Hadoop introduction

Source: http://hortonworks.com/blog/expressing-data-processing-in-apache-tez/

Page 48: Hadoop introduction

Resource management

Re-use of containersCaching with sessionsRuntime optimization

Page 49: Hadoop introduction

Apache Tez integration

Already in Hive and PigCascading Tez planner in progress

Page 50: Hadoop introduction

Apache Spark

“fast and general engine for large-scale data processing”

Page 51: Hadoop introduction

Spark and Hadoop

HDFS

YARN

Spark Yourapp

Page 52: Hadoop introduction

Resilient distributed dataset (RDD)

Loads files from HDFS or local FSDoes computations on them

Stores them in memory for re-using

Page 53: Hadoop introduction

Caching

“Caching is a key tool for iterative algorithms and fast interactive use”

Page 54: Hadoop introduction

Spark Java APIJavaRDD<String> lines = sc.textFile("data.txt");

JavaRDD<Integer> lineLengths = lines.map(s -> s.length());

int totalLength = lineLengths.reduce((a, b) -> a + b);

Page 55: Hadoop introduction

HBase, content of a table

123 =>

info

row

value = ‘Lyon’, version = 2value = ‘Lugdunum’, version = 1...

nom

value = ‘2014-11-12’, version = 1...date

...

...

...

key

column family cell

column

Page 56: Hadoop introduction

Repartition join

M

M

M

R

R

jdoe,USpmartin,FR

jdoe,/productspmartin,/checkoutjdoe,/account

jdoe,US

jdoe,/products

jdoe,/account

jdoe,/productsjdoe,USjdoe,/account

jdoe,/products,USjdoe,/account,US

in-memorycartesian product

Page 57: Hadoop introduction

Repartition join optimization

M

M

M

R

R

jdoe,USpmartin,FR

jdoe,/productspmartin,/checkoutjdoe,/account

jdoe,US

jdoe,/products

jdoe,/account

jdoe,USjdoe,/productsjdoe,/account

jdoe,/products,USjdoe,/account,US

only “users” in memory(thanks to dataset indicator sorting,

i.e. “secondary sort”)

Page 58: Hadoop introduction

Optimization in Cascading CoGroup“During co-grouping, for any given unique grouping key, all of the rightmost pipes will accumulate the current grouping values into memory so they may be iterated across for every value in the left hand side pipe.(...)There is no accumulation for the left hand side pipe, only for those to the "right".

Thus, for the pipe that has the largest number of values per unique key grouping, on average, it should be made the "left hand side" pipe (lhs).”

Page 59: Hadoop introduction

Replicated/asymmetrical join

M

M

M

jdoe,USpmartin,FR

jdoe,/productspmartin,/checkoutjdoe,/account

jdoe,/products,US

jdoe,USpmartin,FR

jdoe,USpmartin,FR

jdoe,/account,US

pmartin,/checkout,FR

Loaded in distributed cache(hence “replicated”)

Page 60: Hadoop introduction

Hive, Pig, Cascading

UDF : User Defined Function

Hive

+SQL (non-standard)Low learning curveExtensible with UDF

-So-so testabilitySo-so reusabilityNo flow controlSpread logic (script, java, shell)Programming with UDF

Pig

+Pig LatinLow learning curveExtensible with UDF

-So-so testabilitySo-so reusabilitySpread logic (script, java, shell)Programming with UDF

Cascading

+API JavaUnit testableFlow control (if, try/catch, etc)Good re-usability

-Programming needed

Page 61: Hadoop introduction

Typical processing

Receiving data (bulk or streams)Processing in batch mode

Feed to real-time systems (RDBMs, NoSQL)

Page 62: Hadoop introduction

Use cases

Parsing, processing, aggregating data“Diff-ing” 2 datasets

Joining data

Page 63: Hadoop introduction

Join generated and reference data

Hadoop

Processing(join, transformation)Generated data

Reporting

Reference data

Page 64: Hadoop introduction

Data handling

Raw data Parsed data Processing and insertion

Archives View on data Transformations

Avro, GZIPKeep it for forever

Parquet, SnappyKeep 2 years of data Processing (Cascading)

HDFS Real time DB

Page 65: Hadoop introduction

Flow handling with Spring Batch

Archiving

Processing Processing Processing

Cleaning

Java, API HDFS

Cascading

MapReduce