Hadoop introduction

Introduction to HadoopArnaud Cogoluègnes - Zenika

Hadoop overview

Distributed system:

● Distributed file system (HDFS)● Programming interface (MapReduce, YARN)

Hadoop

HDFS

MapReduce, YARN

Your application

Hadoop Distributed File System

What is it good for?

HDFS is scalable

More nodes, mode spaceParallel reads

HDFS is fault-tolerant

File blocks are replicated across nodesDefault replication is 3

HDFS has plenty of features

Data can be partitioned (by using directories)The storage is tunable (compression)

There are many storage formats (Avro, Parquet)

HDFS has limitations

Simply said, files are append-onlyGood for “write once, read many times”

A few large files better than many small files

Blocks, datanodes, namenode

file.csv B1 B2 B3 file is made of 3 blocks (default block size is 128 MB)

B1 B2 B1 B3

B1 B2 B2 B3

DN 1 DN 2

DN 4DN 3datanodes store files blocks(here block 3 is under-replicated)

B1 : 1, 2, 3 B2 : 1, 3, 4B3 : 2, 4

Namenode

namenode handles files metadata and enforces replication

HDFS Java APIConfiguration conf = new Configuration();

FileSystem fs = FileSystem.get(conf);

Path inputDir = new Path("/tmp");

FileStatus[] inputFiles = fs.listStatus(inputDir);

FSDataInputStream in = fs.open(inputFiles[i].getPath());

HDFS Shell$ hdfs dfs -ls /

Found 3 items

drwxr-xr-x - acogoluegnes supergroup 0 2014-08-05 22:12 /apps

drwxrwx--- - acogoluegnes supergroup 0 2014-08-13 21:38 /tmp

drwxr-xr-x - acogoluegnes supergroup 0 2014-07-27 12:30 /user

$

MapReduce is...

… scalable… simple, yet allows for many functions on

data… for batch, not real-time processing

Keys, Valuesmap(row) {

emit(k,v);

}

reduce(k,[v1… vn]) {

...

emit(...)

}

Key, value ?word count● key = word, value = 1● reducer sumsdistinct/unique● key = ID, value = row● reducer emits only one rowaggregation (group by / sum)● key = group colum, value = row● reducer sums on given column

MapReducefile.csv B1 B2 B3

Mapper

Mapper

Mapper

B1

B2

B3

Reducer

Reducer

k1,v1

k1,v2

k1 [v1,v2]

Code goes to datafile.csv B1 B2 B3

Mapper

Mapper

Mapper

B1

B2

B3

Reducer

Reducer

k1,v1

k1,v2

k1 [v1,v2]

B1 B2 B1 B3

B1 B2 B2 B3

DN 1 DN 2

DN 4DN 3

DN 1

DN 3

DN 4

MapReduce in Hadoop

1 mapper input = 1 split = 1 blockMap and reduce can be retried

Map and reduce must be idempotent

Shuffling

Mapper

Mapper

Mapper

B1

B2

B3

Reducer

Reducer

k1,v1

k1,v2

k1 [v1,v2]

shuffling

Unique mapperpublic static class ByIdMapper extends Mapper<LongWritable,Text,LongWritable,Text> {

@Override

protected void map(LongWritable key, Text value, Context context)

throws IOException, InterruptedException {

// e.g. line = “2,some value”

String id = StringUtils.split(value.toString(),",")[0];

LongWritable emittedKey = new LongWritable(Long.valueOf(id));

context.write(emittedKey,value);

}

}

Unique reducerpublic static class OneValueEmitReducer

extends Reducer<LongWritable,Text,NullWritable,Text> {

@Override

protected void reduce(LongWritable key, Iterable<Text> values,

Context context) throws IOException, InterruptedException {

context.write(NullWritable.get(), values.iterator().next());

}

}

MapReduce jobConfiguration configuration = new Configuration();

Job job = Job.getInstance(configuration);

job.setMapperClass(ByIdMapper.class);

job.setReducerClass(OneValueEmitReducer.class);

FileInputFormat.setInputPaths(job, new Path(“/data/in”));

FileOutputFormat.setOutputPath(job, new Path(“/work/out”));

job.setInputFormatClass(TextInputFormat.class);

boolean result = job.waitForCompletion(true);

MapReduce in Hadoop 1

MR = the only available processingMaster = job tracker

Slave = task tracker (map or reduce tasks)

Hadoop 1

HDFS

MapReduce

MapReduce in Hadoop 2

Yet Another Resource NegotiatorMapReduce re-implemented on top of YARN

Master = Resource Manager

Hadoop 2

HDFS

YARN

MapReduce Yourapp

MapReduce limitations

Low-level codeHard to re-use logic

Prefer higher-level abstractions like Cascading

File formats

Critical question: how to store my data?

File formats criterions

Interoperability?Splittability?

Compression support?Kind(s) of access?Schema evolution?

Support in MapReduce and Hadoop?

Hadoop file formats

SequenceFile : very efficient, but limited in features (e.g. no schema, not appendable)

Avro file : efficient, schema support

“Community” file formats

Thrift and protocol buffers : quite popular, schema support

ORC and Parquet : columnar storage, good for field selection and projection

Compression codecs

deflate, gzip : efficient but slow, good for archivingsnappy : less efficient but faster than gzip, good for every-day usage

Compression and splittability

Take a CSV file, it’s splittableCompress it with GZIP, it’s no longer splittable

Few compression codecs are splittable

File formats and compression

Some file formats are splittable, whatever the compression codec.

Good news!Avro, Parquet, SequenceFile, Thrift

Lambda architecture

General architecture for big data systemsComputing arbitrary functions on arbitrary data

in realtime

Lambda architecture

Lambda architecture wish list

● Fault-tolerant● Low latency● Scalable● General

● Extensible● Ad hoc queries● Minimal maintenance● Debuggable

Layers

Speed layer

Serving layer

Batch layer

Batch layer

Speed layer

Serving layer

Batch layer

Dataset storage.Views computation.

Serving layer

Speed layer

Serving layer

Batch layer

Random access to batch views.

Speed layer

Speed layer

Serving layer

Batch layer

Low latency access.

Batch layer

Speed layer

Serving layer

Batch layer

Hadoop (MapReduce, HDFS).Thrift, Cascalog.

Serving layer

Speed layer

Serving layer

Batch layer

ElephantDB, BerkeleyDB.

Speed layer

Speed layer

Serving layer

Batch layer

Cassandra, Storm, Kafka.

Apache Tez

“Apache Tez generalizes the MapReduce paradigm to execute a complex DAG (directed

acyclic graph) of tasks”

Apache Tez’s motivations

More general than MapReducePetabyte scale

Tez and Hadoop

HDFS

YARN

Tez Yourapp

Source: http://hortonworks.com/blog/expressing-data-processing-in-apache-tez/

Resource management

Re-use of containersCaching with sessionsRuntime optimization

Apache Tez integration

Already in Hive and PigCascading Tez planner in progress

Apache Spark

“fast and general engine for large-scale data processing”

Spark and Hadoop

HDFS

YARN

Spark Yourapp

Resilient distributed dataset (RDD)

Loads files from HDFS or local FSDoes computations on them

Stores them in memory for re-using

Caching

“Caching is a key tool for iterative algorithms and fast interactive use”

Spark Java APIJavaRDD<String> lines = sc.textFile("data.txt");

JavaRDD<Integer> lineLengths = lines.map(s -> s.length());

int totalLength = lineLengths.reduce((a, b) -> a + b);

HBase, content of a table

123 =>

info

row

value = ‘Lyon’, version = 2value = ‘Lugdunum’, version = 1...

nom

value = ‘2014-11-12’, version = 1...date

...

...

...

key

column family cell

column

Repartition join

M

M

M

R

R

jdoe,USpmartin,FR

jdoe,/productspmartin,/checkoutjdoe,/account

jdoe,US

jdoe,/products

jdoe,/account

jdoe,/productsjdoe,USjdoe,/account

jdoe,/products,USjdoe,/account,US

in-memorycartesian product

Repartition join optimization

M

M

M

R

R

jdoe,USpmartin,FR


jdoe,US

jdoe,/products

jdoe,/account

jdoe,USjdoe,/productsjdoe,/account

jdoe,/products,USjdoe,/account,US

only “users” in memory(thanks to dataset indicator sorting,

i.e. “secondary sort”)

Optimization in Cascading CoGroup“During co-grouping, for any given unique grouping key, all of the rightmost pipes will accumulate the current grouping values into memory so they may be iterated across for every value in the left hand side pipe.(...)There is no accumulation for the left hand side pipe, only for those to the "right".

Thus, for the pipe that has the largest number of values per unique key grouping, on average, it should be made the "left hand side" pipe (lhs).”

Replicated/asymmetrical join

M

M

M

jdoe,USpmartin,FR


jdoe,/products,US

jdoe,USpmartin,FR

jdoe,USpmartin,FR

jdoe,/account,US

pmartin,/checkout,FR

Loaded in distributed cache(hence “replicated”)

Hive, Pig, Cascading

UDF : User Defined Function

Hive

+SQL (non-standard)Low learning curveExtensible with UDF

-So-so testabilitySo-so reusabilityNo flow controlSpread logic (script, java, shell)Programming with UDF

Pig

+Pig LatinLow learning curveExtensible with UDF

-So-so testabilitySo-so reusabilitySpread logic (script, java, shell)Programming with UDF

Cascading

+API JavaUnit testableFlow control (if, try/catch, etc)Good re-usability

-Programming needed

Typical processing

Receiving data (bulk or streams)Processing in batch mode

Feed to real-time systems (RDBMs, NoSQL)

Use cases

Parsing, processing, aggregating data“Diff-ing” 2 datasets

Joining data

Join generated and reference data

Hadoop

Processing(join, transformation)Generated data

Reporting

Reference data

Data handling

Raw data Parsed data Processing and insertion

Archives View on data Transformations

Avro, GZIPKeep it for forever

Parquet, SnappyKeep 2 years of data Processing (Cascading)

HDFS Real time DB

Flow handling with Spring Batch

Archiving

Processing Processing Processing

Cleaning

Java, API HDFS

Cascading

MapReduce

Hadoop introduction

Technology

Transcript of Hadoop introduction