Post on 15-Jul-2015
Introduction to HadoopArnaud Cogoluègnes - Zenika
Hadoop overview
Distributed system:
● Distributed file system (HDFS)● Programming interface (MapReduce, YARN)
Hadoop
HDFS
MapReduce, YARN
Your application
Hadoop Distributed File System
What is it good for?
HDFS is scalable
More nodes, mode spaceParallel reads
HDFS is fault-tolerant
File blocks are replicated across nodesDefault replication is 3
HDFS has plenty of features
Data can be partitioned (by using directories)The storage is tunable (compression)
There are many storage formats (Avro, Parquet)
HDFS has limitations
Simply said, files are append-onlyGood for “write once, read many times”
A few large files better than many small files
Blocks, datanodes, namenode
file.csv B1 B2 B3 file is made of 3 blocks (default block size is 128 MB)
B1 B2 B1 B3
B1 B2 B2 B3
DN 1 DN 2
DN 4DN 3datanodes store files blocks(here block 3 is under-replicated)
B1 : 1, 2, 3 B2 : 1, 3, 4B3 : 2, 4
Namenode
namenode handles files metadata and enforces replication
HDFS Java APIConfiguration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
Path inputDir = new Path("/tmp");
FileStatus[] inputFiles = fs.listStatus(inputDir);
FSDataInputStream in = fs.open(inputFiles[i].getPath());
HDFS Shell$ hdfs dfs -ls /
Found 3 items
drwxr-xr-x - acogoluegnes supergroup 0 2014-08-05 22:12 /apps
drwxrwx--- - acogoluegnes supergroup 0 2014-08-13 21:38 /tmp
drwxr-xr-x - acogoluegnes supergroup 0 2014-07-27 12:30 /user
$
MapReduce is...
… scalable… simple, yet allows for many functions on
data… for batch, not real-time processing
Keys, Valuesmap(row) {
emit(k,v);
}
reduce(k,[v1… vn]) {
...
emit(...)
}
Key, value ?word count● key = word, value = 1● reducer sumsdistinct/unique● key = ID, value = row● reducer emits only one rowaggregation (group by / sum)● key = group colum, value = row● reducer sums on given column
MapReducefile.csv B1 B2 B3
Mapper
Mapper
Mapper
B1
B2
B3
Reducer
Reducer
k1,v1
k1,v2
k1 [v1,v2]
Code goes to datafile.csv B1 B2 B3
Mapper
Mapper
Mapper
B1
B2
B3
Reducer
Reducer
k1,v1
k1,v2
k1 [v1,v2]
B1 B2 B1 B3
B1 B2 B2 B3
DN 1 DN 2
DN 4DN 3
DN 1
DN 3
DN 4
MapReduce in Hadoop
1 mapper input = 1 split = 1 blockMap and reduce can be retried
Map and reduce must be idempotent
Shuffling
Mapper
Mapper
Mapper
B1
B2
B3
Reducer
Reducer
k1,v1
k1,v2
k1 [v1,v2]
shuffling
Unique mapperpublic static class ByIdMapper extends Mapper<LongWritable,Text,LongWritable,Text> {
@Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
// e.g. line = “2,some value”
String id = StringUtils.split(value.toString(),",")[0];
LongWritable emittedKey = new LongWritable(Long.valueOf(id));
context.write(emittedKey,value);
}
}
Unique reducerpublic static class OneValueEmitReducer
extends Reducer<LongWritable,Text,NullWritable,Text> {
@Override
protected void reduce(LongWritable key, Iterable<Text> values,
Context context) throws IOException, InterruptedException {
context.write(NullWritable.get(), values.iterator().next());
}
}
MapReduce jobConfiguration configuration = new Configuration();
Job job = Job.getInstance(configuration);
job.setMapperClass(ByIdMapper.class);
job.setReducerClass(OneValueEmitReducer.class);
FileInputFormat.setInputPaths(job, new Path(“/data/in”));
FileOutputFormat.setOutputPath(job, new Path(“/work/out”));
job.setInputFormatClass(TextInputFormat.class);
boolean result = job.waitForCompletion(true);
MapReduce in Hadoop 1
MR = the only available processingMaster = job tracker
Slave = task tracker (map or reduce tasks)
Hadoop 1
HDFS
MapReduce
MapReduce in Hadoop 2
Yet Another Resource NegotiatorMapReduce re-implemented on top of YARN
Master = Resource Manager
Hadoop 2
HDFS
YARN
MapReduce Yourapp
MapReduce limitations
Low-level codeHard to re-use logic
Prefer higher-level abstractions like Cascading
File formats
Critical question: how to store my data?
File formats criterions
Interoperability?Splittability?
Compression support?Kind(s) of access?Schema evolution?
Support in MapReduce and Hadoop?
Hadoop file formats
SequenceFile : very efficient, but limited in features (e.g. no schema, not appendable)
Avro file : efficient, schema support
“Community” file formats
Thrift and protocol buffers : quite popular, schema support
ORC and Parquet : columnar storage, good for field selection and projection
Compression codecs
deflate, gzip : efficient but slow, good for archivingsnappy : less efficient but faster than gzip, good for every-day usage
Compression and splittability
Take a CSV file, it’s splittableCompress it with GZIP, it’s no longer splittable
Few compression codecs are splittable
File formats and compression
Some file formats are splittable, whatever the compression codec.
Good news!Avro, Parquet, SequenceFile, Thrift
Lambda architecture
General architecture for big data systemsComputing arbitrary functions on arbitrary data
in realtime
Lambda architecture
Lambda architecture wish list
● Fault-tolerant● Low latency● Scalable● General
● Extensible● Ad hoc queries● Minimal maintenance● Debuggable
Layers
Speed layer
Serving layer
Batch layer
Batch layer
Speed layer
Serving layer
Batch layer
Dataset storage.Views computation.
Serving layer
Speed layer
Serving layer
Batch layer
Random access to batch views.
Speed layer
Speed layer
Serving layer
Batch layer
Low latency access.
Batch layer
Speed layer
Serving layer
Batch layer
Hadoop (MapReduce, HDFS).Thrift, Cascalog.
Serving layer
Speed layer
Serving layer
Batch layer
ElephantDB, BerkeleyDB.
Speed layer
Speed layer
Serving layer
Batch layer
Cassandra, Storm, Kafka.
Apache Tez
“Apache Tez generalizes the MapReduce paradigm to execute a complex DAG (directed
acyclic graph) of tasks”
Apache Tez’s motivations
More general than MapReducePetabyte scale
Tez and Hadoop
HDFS
YARN
Tez Yourapp
Source: http://hortonworks.com/blog/expressing-data-processing-in-apache-tez/
Resource management
Re-use of containersCaching with sessionsRuntime optimization
Apache Tez integration
Already in Hive and PigCascading Tez planner in progress
Apache Spark
“fast and general engine for large-scale data processing”
Spark and Hadoop
HDFS
YARN
Spark Yourapp
Resilient distributed dataset (RDD)
Loads files from HDFS or local FSDoes computations on them
Stores them in memory for re-using
Caching
“Caching is a key tool for iterative algorithms and fast interactive use”
Spark Java APIJavaRDD<String> lines = sc.textFile("data.txt");
JavaRDD<Integer> lineLengths = lines.map(s -> s.length());
int totalLength = lineLengths.reduce((a, b) -> a + b);
HBase, content of a table
123 =>
info
row
value = ‘Lyon’, version = 2value = ‘Lugdunum’, version = 1...
nom
value = ‘2014-11-12’, version = 1...date
...
...
...
key
column family cell
column
Repartition join
M
M
M
R
R
jdoe,USpmartin,FR
jdoe,/productspmartin,/checkoutjdoe,/account
jdoe,US
jdoe,/products
jdoe,/account
jdoe,/productsjdoe,USjdoe,/account
jdoe,/products,USjdoe,/account,US
in-memorycartesian product
Repartition join optimization
M
M
M
R
R
jdoe,USpmartin,FR
jdoe,/productspmartin,/checkoutjdoe,/account
jdoe,US
jdoe,/products
jdoe,/account
jdoe,USjdoe,/productsjdoe,/account
jdoe,/products,USjdoe,/account,US
only “users” in memory(thanks to dataset indicator sorting,
i.e. “secondary sort”)
Optimization in Cascading CoGroup“During co-grouping, for any given unique grouping key, all of the rightmost pipes will accumulate the current grouping values into memory so they may be iterated across for every value in the left hand side pipe.(...)There is no accumulation for the left hand side pipe, only for those to the "right".
Thus, for the pipe that has the largest number of values per unique key grouping, on average, it should be made the "left hand side" pipe (lhs).”
Replicated/asymmetrical join
M
M
M
jdoe,USpmartin,FR
jdoe,/productspmartin,/checkoutjdoe,/account
jdoe,/products,US
jdoe,USpmartin,FR
jdoe,USpmartin,FR
jdoe,/account,US
pmartin,/checkout,FR
Loaded in distributed cache(hence “replicated”)
Hive, Pig, Cascading
UDF : User Defined Function
Hive
+SQL (non-standard)Low learning curveExtensible with UDF
-So-so testabilitySo-so reusabilityNo flow controlSpread logic (script, java, shell)Programming with UDF
Pig
+Pig LatinLow learning curveExtensible with UDF
-So-so testabilitySo-so reusabilitySpread logic (script, java, shell)Programming with UDF
Cascading
+API JavaUnit testableFlow control (if, try/catch, etc)Good re-usability
-Programming needed
Typical processing
Receiving data (bulk or streams)Processing in batch mode
Feed to real-time systems (RDBMs, NoSQL)
Use cases
Parsing, processing, aggregating data“Diff-ing” 2 datasets
Joining data
Join generated and reference data
Hadoop
Processing(join, transformation)Generated data
Reporting
Reference data
Data handling
Raw data Parsed data Processing and insertion
Archives View on data Transformations
Avro, GZIPKeep it for forever
Parquet, SnappyKeep 2 years of data Processing (Cascading)
HDFS Real time DB
Flow handling with Spring Batch
Archiving
Processing Processing Processing
Cleaning
Java, API HDFS
Cascading
MapReduce