Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
Hadoop 101 - Kansas City Big Data Summit 2014
-
Upload
skahler -
Category
Technology
-
view
344 -
download
0
Transcript of Hadoop 101 - Kansas City Big Data Summit 2014
Hadoop 101
Scott Kahlertwitter: boogabee
http://simpit.com
Community Engineer - Greenplum Database
xkcd.com
Primary Hadoop Use
Case
Data Lake, Active Archive, Staging Area
DATA
2002
Doug CuttingMike
Cafarella
2002
20032004
Doug Cutting 2006
Apache Hadoop
The project includes these modules:
● Hadoop Common
● Hadoop Distributed File System (HDFS™)
● Hadoop MapReduce
● Hadoop YARN
Apache Hadoop Ecosystem
Distributed Filesystem
Red Hat GlusterFS
Quantcast File System QFS
Ceph Filesystem
Lustre file system
Tachyon
GridGain
Distributed Programming
Apache Pig
JAQL
Apache Spark
Apache Flink
Netflix PigPen
AMPLab SIMR
Facebook Corona
Apache Twill
Damballa Parkour
Apache Hama
Datasalt Pangool
Apache Tez
Apache DataFu
Pydoop
Kangaroo
NoSQL Databases
Column Data Model
Apache HBase
Apache Cassandra
Hypertable
Apache Accumulo
Document Data Model
MongoDB
RethinkDB
ArangoDB
Stream Data Model
EventStore
Key-Value Data Model
Redis DataBase
Linkedin Voldemort
RocksDB
OpenTSDB
Graph Data Model
ArangoDB
Neo4j
NewSQL Databases
TokuDB
HandlerSocket
Akiban Server
Drizzle
Haeinsa
SenseiDB
Sky
BayesDB
InfluxDB
SQL-On-Hadoop
Apache Hive
Apache HCatalog
AMPLAB Shark
Apache Drill
Cloudera Impala
Facebook Presto
Datasalt Splout SQL
Apache Tajo
Apache Phoenix
Apache MRQL
Data Ingestion
Apache Flume
Apache Sqoop
Facebook Scribe
Apache Chukwa
Apache Storm
Apache Kafka
Netflix Suro
Apache Samza
Cloudera Morphline
HIHO
Service Programming
Apache Thrift
Apache Zookeeper
Apache Avro
Apache Curator
Apache karaf
Twitter Elephant Bird
Linkedin Norbert
Scheduling
Apache Oozie
Linkedin Azkaban
Apache Falcon
Machine Learning
Apache Mahout
WEKA
Cloudera Oryx
MADlib
H2O
Sparkling Water
Bechmarking
Apache Hadoop
Benchmarking
Yahoo Gridmix3
PUMA Benchmarking
Berkeley SWIM Benchmark
Intel HiBench
Security
Apache Sentry
Apache Knox Gateway
Apache Ranger
System Deployment
Apache Ambari
Cloudera HUE
Apache Whirr
Apache Mesos
Myriad
Marathon
Brooklyn
Hortonworks HOYA
Apache Helix
Apache Bigtop
Buildoop
Deploop
Applications
Apache Nutch
Sphnix Search Server
Apache OODT
HIPI Library
PivotalR
Development Frameworks
Spring XD
Categorize Pending ...
Twitter Summingbird
Apache Kiji
S4 Yahoo
Metamarkers Druid
Concurrent Cascading
Concurrent Lingual
Concurrent Pattern
Apache Giraph
Talend
Akka Toolkit
Eclipse BIRT
Spango BI
Jedox Palo
Twitter Finagle
Intel GraphBuilder
Apache Tika
http://hadoopecosystemtable.github.io/
Apache Bigtop
Hadoop Distributed File System (HDFS™)
A distributed file system that provides high-throughput access to application data
hdfs dfs -copyFromLocal File.txt hdfs://nn.hadoopcluster.local/user/hadoop/
Name
Node
Data
Node 1
Data
Node 6
Data
Node 3
Data
Node 2
Data
Node 5
Data
Node 4
Client
File.txt
A
B
C
I have File.txt
and I want to
write block A of it
Name
Node
Data
Node 1
Data
Node 6
Data
Node 3
Data
Node 2
Data
Node 5
Data
Node 4
Client
File.txt
A
B
C
Write that to
Data Node 2, 5
and 6
Name
Node
Data
Node 1
Data
Node 6
Data
Node 3
Data
Node 2
Data
Node 5
Data
Node 4
Client
File.txt
A
B
C
Setting up a
pipeline to
Nodes 2, 5, 6
Name
Node
Data
Node 1
Data
Node 6
Data
Node 3
Data
Node 2
Data
Node 5
Data
Node 4
Client
File.txt
A
B
C
Pushing block
down the
pipeline A A
A
Name
Node
Data
Node 1
Data
Node 6
Data
Node 3
Data
Node 2
Data
Node 5
Data
Node 4
Client
File.txt
A
B
C
A A
AIt worked!
Got a block
Repeat until all blocks in the system
Name
Node
Data
Node 1
Data
Node 6
Data
Node 3
Data
Node 2
Data
Node 5
Data
Node 4
Client
A A
A
B
B
B
C
C
C
A: 2,5,6
B: 1,3,4
C: 6,2,4
File.txt
hdfs dfs -copyToLocal hdfs://nn.hadoopcluster.local/user/hadoop/File.txt File2.txt
Name
Node
Data
Node 1
Data
Node 6
Data
Node 3
Data
Node 2
Data
Node 5
Data
Node 4
Client
A A
A
B
B
B
C
C
C
A: 2,5,6
B: 1,3,4
C: 6,2,4
File.txt
I want File.txt
Name
Node
Data
Node 1
Data
Node 6
Data
Node 3
Data
Node 2
Data
Node 5
Data
Node 4
Client
A A
A
B
B
B
C
C
C
A: 2,5,6
B: 1,3,4
C: 6,2,4
File.txt
File.txt is
A: 2,5,6
B: 1,3,4
C: 6,2,4
Name
Node
Data
Node 1
Data
Node 6
Data
Node 3
Data
Node 2
Data
Node 5
Data
Node 4
Client
A A
A
B
B
B
C
C
C
A: 2,5,6
B: 1,3,4
C: 6,2,4
File.txt
File.txt
A
B
C
System Health
Name
Node
Data
Node 1
Data
Node 6
Data
Node 3
Data
Node 2
Data
Node 5
Data
Node 4
Client
A A
A
B
B
B
C
C
C
A: 2,5,6
B: 1,3,4
C: 6,2,4
File.txt
Block
Report
Name
Node
Data
Node 1
Data
Node 6
Data
Node 3
Data
Node 2
Data
Node 5
Data
Node 4
Client
A A
A
B
B
B
C
C
C
A: 2,5,6
B: 1,3,4
C: 6,2,4
File.txt
No heartbeat
from Node 4
Name
Node
Data
Node 1
Data
Node 6
Data
Node 3
Data
Node 2
Data
Node 5
Data
Node 4
Client
A A
A
B
B
B
C
C
C
A: 2,5,6
B: 1,3,4
C: 6,2,4
File.txt
Copies on Node
4 must be gone
Name
Node
Data
Node 1
Data
Node 6
Data
Node 3
Data
Node 2
Data
Node 5
Client
A A
A
B
B
C
C
A: 2,5,6
B: 1,3
C: 6,2
File.txt
Need to get B &
C back up to 3
copies.
Name
Node
Data
Node 1
Data
Node 6
Data
Node 3
Data
Node 2
Data
Node 5
Client
A A
A
B
B
C
C
A: 2,5,6
B: 1,3,5
C: 6,2,1
File.txt
Okay all good
now
C
B
Name
Node
Data
Node 1
Data
Node 6
Data
Node 3
Data
Node 2
Data
Node 5
Client
A A
A
B
B
C
C
A: 2,5,6
B: 1,3,5
C: 6,2,1
File.txt
C
B
Data
Node 4
Map Shuffle/Sort Reduce
map: (K1, V1) → list(K2, V2)
reduce: (K2, list(V2)) → list(K3, V3)
Peter Piper picked a peck of pickled peppers;
A peck of pickled peppers Peter Piper picked;
If Peter Piper picked a peck of pickled peppers,
Where's the peck of pickled peppers Peter Piper picked?
Map Shuffle/Sort Reduce
map: (K1, V1) → list(K2, V2)
reduce: (K2, list(V2)) → list(K3, V3)
Peter Piper picked a peck of pickled peppers;
A peck of pickled peppers Peter Piper picked;
If Peter Piper picked a peck of pickled peppers,
Where's the peck of pickled peppers Peter Piper picked?
Map Shuffle/Sort Reducepeter -> 1
piper -> 1
picked -> 1
a -> 1
peck -> 1
of -> 1
pickled -> 1
peppers -> 1
a -> 1
peck -> 1
of -> 1
pickled -> 1
peppers -> 1
peter -> 1
piper -> 1
picked -> 1 if -> 1
peter -> 1
piper -> 1
picked -> 1
a -> 1
peck -> 1
of -> 1
pickled -> 1
peppers -> 1
where’s -> 1
the -> 1
peck -> 1
of -> 1
pickled -> 1
peppers -> 1
peter -> 1
piper -> 1
picked -> 1
map: (K1, V1) → list(K2, V2)
reduce: (K2, list(V2)) → list(K3, V3)
Peter Piper picked a peck of pickled peppers;
A peck of pickled peppers Peter Piper picked;
If Peter Piper picked a peck of pickled peppers,
Where's the peck of pickled peppers Peter Piper picked?
Map Shuffle/Sort Reducepeter -> 1
piper -> 1
picked -> 1
a -> 1
peck -> 1
of -> 1
pickled -> 1
peppers -> 1
a -> 1
peck -> 1
of -> 1
pickled -> 1
peppers -> 1
peter -> 1
piper -> 1
picked -> 1 if -> 1
peter -> 1
piper -> 1
picked -> 1
a -> 1
peck -> 1
of -> 1
pickled -> 1
peppers -> 1
where’s -> 1
the -> 1
peck -> 1
of -> 1
pickled -> 1
peppers -> 1
peter -> 1
piper -> 1
picked -> 1
a -> 1,1,1
if -> 1
of -> 1,1,1,1
peck -> 1,1,1,1
peppers -> 1,1,1,1
peter -> 1,1,1,1
picked -> 1,1,1,1
pickled -> 1,1,1,1
piper -> 1,1,1,1
the -> 1
where’s -> 1
map: (K1, V1) → list(K2, V2)
reduce: (K2, list(V2)) → list(K3, V3)
map: (K1, V1) → list(K2, V2)
reduce: (K2, list(V2)) → list(K3, V3)
Peter Piper picked a peck of pickled peppers;
A peck of pickled peppers Peter Piper picked;
If Peter Piper picked a peck of pickled peppers,
Where's the peck of pickled peppers Peter Piper picked?
Map Shuffle/Sort Reducepeter -> 1
piper -> 1
picked -> 1
a -> 1
peck -> 1
of -> 1
pickled -> 1
peppers -> 1
a -> 1
peck -> 1
of -> 1
pickled -> 1
peppers -> 1
peter -> 1
piper -> 1
picked -> 1 if -> 1
peter -> 1
piper -> 1
picked -> 1
a -> 1
peck -> 1
of -> 1
pickled -> 1
peppers -> 1
where’s -> 1
the -> 1
peck -> 1
of -> 1
pickled -> 1
peppers -> 1
peter -> 1
piper -> 1
picked -> 1
a -> 3
if -> 1
of -> 4
peck -> 4
peppers -> 4
peter -> 4
picked -> 4
pickled -> 4
piper -> 4
the -> 1
where’s -> 1
a -> 1,1,1
if -> 1
of -> 1,1,1,1
peck -> 1,1,1,1
peppers -> 1,1,1,1
peter -> 1,1,1,1
picked -> 1,1,1,1
pickled -> 1,1,1,1
piper -> 1,1,1,1
the -> 1
where’s -> 1
$HADOOP_HOME/bin/hadoop jar wc.jar WordCount
/user/hadoop/wordcount/input /user/hadoop/wordcount/output
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper myPythonScript.py \
-reducer /bin/wc \
-file myPythonScript.py
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.net.URI;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import java.util.Set;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.Counter;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.util.StringUtils;
public class WordCount2 {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
static enum CountersEnum { INPUT_WORDS }
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
private boolean caseSensitive;
private Set<String> patternsToSkip = new HashSet<String>();
private Configuration conf;
private BufferedReader fis;
@Override
public void setup(Context context) throws IOException,
InterruptedException {
conf = context.getConfiguration();
caseSensitive = conf.getBoolean("wordcount.case.sensitive", true);
if (conf.getBoolean("wordcount.skip.patterns", true)) {
URI[] patternsURIs = Job.getInstance(conf).getCacheFiles();
for (URI patternsURI : patternsURIs) {
Path patternsPath = new Path(patternsURI.getPath());
String patternsFileName = patternsPath.getName().toString();
parseSkipFile(patternsFileName);
}
}
}
private void parseSkipFile(String fileName) {
try {
fis = new BufferedReader(new FileReader(fileName));
String pattern = null;
while ((pattern = fis.readLine()) != null) {
patternsToSkip.add(pattern);
}
} catch (IOException ioe) {
System.err.println("Caught exception while parsing the cached file '"
+ StringUtils.stringifyException(ioe));
}
}
@Override
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
String line = (caseSensitive) ?
value.toString() : value.toString().toLowerCase();
for (String pattern : patternsToSkip) {
line = line.replaceAll(pattern, "");
}
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
Counter counter = context.getCounter(CountersEnum.class.getName(),
CountersEnum.INPUT_WORDS.toString());
counter.increment(1);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
GenericOptionsParser optionParser = new GenericOptionsParser(conf,
args);
String[] remainingArgs = optionParser.getRemainingArgs();
if (!(remainingArgs.length != 2 || remainingArgs.length != 4)) {
System.err.println("Usage: wordcount <in> <out> [-skip
skipPatternFile]");
System.exit(2);
}
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount2.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
List<String> otherArgs = new ArrayList<String>();
for (int i=0; i < remainingArgs.length; ++i) {
if ("-skip".equals(remainingArgs[i])) {
job.addCacheFile(new Path(remainingArgs[++i]).toUri());
job.getConfiguration().setBoolean("wordcount.skip.patterns", true);
} else {
otherArgs.add(remainingArgs[i]);
}
}
FileInputFormat.addInputPath(job, new Path(otherArgs.get(0)));
FileOutputFormat.setOutputPath(job, new Path(otherArgs.get(1)));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
lines = LOAD '/user/hadoop/File.txt' AS (line:chararray);
words = FOREACH input_lines GENERATE
FLATTEN(TOKENIZE(line)) AS word;
filtered_words = FILTER words BY word MATCHES '\\w+';
word_groups = GROUP filtered_words BY word;
word_count = FOREACH word_groups GENERATE
COUNT(filtered_words) AS count, group AS word;
ordered_word_count = ORDER word_count BY count DESC;
DUMP wordcount;
Apache Pig
Hive provides a mechanism to project
structure onto this data and query the data
using a SQL-like language called HiveQL.
Apache Hive
Resource Management
Job
TrackerTask Tracker1
Task Tracker6Task Tracker3
Task Tracker2 Task Tracker5
Task Tracker4
Client
M|M|M|M R|R
M|M|M|M R|R
M|M|M|M R|R
M|M|M|M R|R
M|M|M|M R|R
M|M|M|M R|R
Job
TrackerTask Tracker1
Task Tracker6Task Tracker3
Task Tracker2 Task Tracker5
Task Tracker4
Client
M|M|M|M R|R
M|M|M|M R|R
M|M|M|M R|R
M|M|M|M R|R
M|M|M|M R|R
M|M|M|M R|R
wordcount
Job
TrackerTask Tracker1
Task Tracker6Task Tracker3
Task Tracker2 Task Tracker5
Task Tracker4
Client
M|M|M|M R|R
M|M|M|M R|R
M|M|M|M R|R
M|M|M|M R|R
M|M|M|M R|R
M|M|M|M R|R
wordcount
Job
TrackerTask Tracker1
Task Tracker6Task Tracker3
Task Tracker2 Task Tracker5
Task Tracker4
Client
M|M|M|M R|R
M|M|M|M R|R
M|M|M|M R|R
M|M|M|M R|R
M|M|M|M R|R
M|M|M|M R|R
HBASE SOLR
SOLRHBASE
SOLRHBASE
SOLRHBASE
SOLRHBASE
SOLRHBASE
YARN
MapReduce v2
Resource
ManagerNode
Manager1
Node
Manager6
Node
Manager3
Node
Manager2
Node
Manager5
Node
Manager4
Client
Resource
ManagerNode
Manager1
Node
Manager6
Node
Manager3
Node
Manager2
Node
Manager5
Node
Manager4
Client
wordcount
Application
Master -
Wordcount
I need a
container to run
my wordcount
MR job
Resource
ManagerNode
Manager1
Node
Manager6
Node
Manager3
Node
Manager2
Node
Manager5
Node
Manager4
Client
wordcount
Application
Master -
Wordcount
I need 4 Mapper
and 2 Reducer
containers
M
M
M
M
R
R
Emerging Hadoop Use
Case
Application Container
Management