Hadoop 101 - Kansas City Big Data Summit 2014

54
Hadoop 101

Transcript of Hadoop 101 - Kansas City Big Data Summit 2014

Page 1: Hadoop 101 - Kansas City Big Data Summit 2014

Hadoop 101

Page 2: Hadoop 101 - Kansas City Big Data Summit 2014

Scott Kahlertwitter: boogabee

http://simpit.com

[email protected]

Community Engineer - Greenplum Database

Page 3: Hadoop 101 - Kansas City Big Data Summit 2014
Page 4: Hadoop 101 - Kansas City Big Data Summit 2014

xkcd.com

Page 5: Hadoop 101 - Kansas City Big Data Summit 2014

Primary Hadoop Use

Case

Page 6: Hadoop 101 - Kansas City Big Data Summit 2014

Data Lake, Active Archive, Staging Area

Page 7: Hadoop 101 - Kansas City Big Data Summit 2014

DATA

Page 8: Hadoop 101 - Kansas City Big Data Summit 2014

2002

Page 9: Hadoop 101 - Kansas City Big Data Summit 2014

Doug CuttingMike

Cafarella

2002

20032004

Page 10: Hadoop 101 - Kansas City Big Data Summit 2014

Doug Cutting 2006

Page 11: Hadoop 101 - Kansas City Big Data Summit 2014

Apache Hadoop

The project includes these modules:

● Hadoop Common

● Hadoop Distributed File System (HDFS™)

● Hadoop MapReduce

● Hadoop YARN

Page 12: Hadoop 101 - Kansas City Big Data Summit 2014

Apache Hadoop Ecosystem

Distributed Filesystem

Red Hat GlusterFS

Quantcast File System QFS

Ceph Filesystem

Lustre file system

Tachyon

GridGain

Distributed Programming

Apache Pig

JAQL

Apache Spark

Apache Flink

Netflix PigPen

AMPLab SIMR

Facebook Corona

Apache Twill

Damballa Parkour

Apache Hama

Datasalt Pangool

Apache Tez

Apache DataFu

Pydoop

Kangaroo

NoSQL Databases

Column Data Model

Apache HBase

Apache Cassandra

Hypertable

Apache Accumulo

Document Data Model

MongoDB

RethinkDB

ArangoDB

Stream Data Model

EventStore

Key-Value Data Model

Redis DataBase

Linkedin Voldemort

RocksDB

OpenTSDB

Graph Data Model

ArangoDB

Neo4j

NewSQL Databases

TokuDB

HandlerSocket

Akiban Server

Drizzle

Haeinsa

SenseiDB

Sky

BayesDB

InfluxDB

SQL-On-Hadoop

Apache Hive

Apache HCatalog

AMPLAB Shark

Apache Drill

Cloudera Impala

Facebook Presto

Datasalt Splout SQL

Apache Tajo

Apache Phoenix

Apache MRQL

Data Ingestion

Apache Flume

Apache Sqoop

Facebook Scribe

Apache Chukwa

Apache Storm

Apache Kafka

Netflix Suro

Apache Samza

Cloudera Morphline

HIHO

Service Programming

Apache Thrift

Apache Zookeeper

Apache Avro

Apache Curator

Apache karaf

Twitter Elephant Bird

Linkedin Norbert

Scheduling

Apache Oozie

Linkedin Azkaban

Apache Falcon

Machine Learning

Apache Mahout

WEKA

Cloudera Oryx

MADlib

H2O

Sparkling Water

Bechmarking

Apache Hadoop

Benchmarking

Yahoo Gridmix3

PUMA Benchmarking

Berkeley SWIM Benchmark

Intel HiBench

Security

Apache Sentry

Apache Knox Gateway

Apache Ranger

System Deployment

Apache Ambari

Cloudera HUE

Apache Whirr

Apache Mesos

Myriad

Marathon

Brooklyn

Hortonworks HOYA

Apache Helix

Apache Bigtop

Buildoop

Deploop

Applications

Apache Nutch

Sphnix Search Server

Apache OODT

HIPI Library

PivotalR

Development Frameworks

Spring XD

Categorize Pending ...

Twitter Summingbird

Apache Kiji

S4 Yahoo

Metamarkers Druid

Concurrent Cascading

Concurrent Lingual

Concurrent Pattern

Apache Giraph

Talend

Akka Toolkit

Eclipse BIRT

Spango BI

Jedox Palo

Twitter Finagle

Intel GraphBuilder

Apache Tika

http://hadoopecosystemtable.github.io/

Page 13: Hadoop 101 - Kansas City Big Data Summit 2014

Apache Bigtop

Page 14: Hadoop 101 - Kansas City Big Data Summit 2014

Hadoop Distributed File System (HDFS™)

A distributed file system that provides high-throughput access to application data

Page 15: Hadoop 101 - Kansas City Big Data Summit 2014

hdfs dfs -copyFromLocal File.txt hdfs://nn.hadoopcluster.local/user/hadoop/

Page 16: Hadoop 101 - Kansas City Big Data Summit 2014

Name

Node

Data

Node 1

Data

Node 6

Data

Node 3

Data

Node 2

Data

Node 5

Data

Node 4

Client

File.txt

A

B

C

I have File.txt

and I want to

write block A of it

Page 17: Hadoop 101 - Kansas City Big Data Summit 2014

Name

Node

Data

Node 1

Data

Node 6

Data

Node 3

Data

Node 2

Data

Node 5

Data

Node 4

Client

File.txt

A

B

C

Write that to

Data Node 2, 5

and 6

Page 18: Hadoop 101 - Kansas City Big Data Summit 2014

Name

Node

Data

Node 1

Data

Node 6

Data

Node 3

Data

Node 2

Data

Node 5

Data

Node 4

Client

File.txt

A

B

C

Setting up a

pipeline to

Nodes 2, 5, 6

Page 19: Hadoop 101 - Kansas City Big Data Summit 2014

Name

Node

Data

Node 1

Data

Node 6

Data

Node 3

Data

Node 2

Data

Node 5

Data

Node 4

Client

File.txt

A

B

C

Pushing block

down the

pipeline A A

A

Page 20: Hadoop 101 - Kansas City Big Data Summit 2014

Name

Node

Data

Node 1

Data

Node 6

Data

Node 3

Data

Node 2

Data

Node 5

Data

Node 4

Client

File.txt

A

B

C

A A

AIt worked!

Got a block

Page 21: Hadoop 101 - Kansas City Big Data Summit 2014

Repeat until all blocks in the system

Page 22: Hadoop 101 - Kansas City Big Data Summit 2014

Name

Node

Data

Node 1

Data

Node 6

Data

Node 3

Data

Node 2

Data

Node 5

Data

Node 4

Client

A A

A

B

B

B

C

C

C

A: 2,5,6

B: 1,3,4

C: 6,2,4

File.txt

Page 23: Hadoop 101 - Kansas City Big Data Summit 2014

hdfs dfs -copyToLocal hdfs://nn.hadoopcluster.local/user/hadoop/File.txt File2.txt

Page 24: Hadoop 101 - Kansas City Big Data Summit 2014

Name

Node

Data

Node 1

Data

Node 6

Data

Node 3

Data

Node 2

Data

Node 5

Data

Node 4

Client

A A

A

B

B

B

C

C

C

A: 2,5,6

B: 1,3,4

C: 6,2,4

File.txt

I want File.txt

Page 25: Hadoop 101 - Kansas City Big Data Summit 2014

Name

Node

Data

Node 1

Data

Node 6

Data

Node 3

Data

Node 2

Data

Node 5

Data

Node 4

Client

A A

A

B

B

B

C

C

C

A: 2,5,6

B: 1,3,4

C: 6,2,4

File.txt

File.txt is

A: 2,5,6

B: 1,3,4

C: 6,2,4

Page 26: Hadoop 101 - Kansas City Big Data Summit 2014

Name

Node

Data

Node 1

Data

Node 6

Data

Node 3

Data

Node 2

Data

Node 5

Data

Node 4

Client

A A

A

B

B

B

C

C

C

A: 2,5,6

B: 1,3,4

C: 6,2,4

File.txt

File.txt

A

B

C

Page 27: Hadoop 101 - Kansas City Big Data Summit 2014

System Health

Page 28: Hadoop 101 - Kansas City Big Data Summit 2014

Name

Node

Data

Node 1

Data

Node 6

Data

Node 3

Data

Node 2

Data

Node 5

Data

Node 4

Client

A A

A

B

B

B

C

C

C

A: 2,5,6

B: 1,3,4

C: 6,2,4

File.txt

Block

Report

Page 29: Hadoop 101 - Kansas City Big Data Summit 2014

Name

Node

Data

Node 1

Data

Node 6

Data

Node 3

Data

Node 2

Data

Node 5

Data

Node 4

Client

A A

A

B

B

B

C

C

C

A: 2,5,6

B: 1,3,4

C: 6,2,4

File.txt

No heartbeat

from Node 4

Page 30: Hadoop 101 - Kansas City Big Data Summit 2014

Name

Node

Data

Node 1

Data

Node 6

Data

Node 3

Data

Node 2

Data

Node 5

Data

Node 4

Client

A A

A

B

B

B

C

C

C

A: 2,5,6

B: 1,3,4

C: 6,2,4

File.txt

Copies on Node

4 must be gone

Page 31: Hadoop 101 - Kansas City Big Data Summit 2014

Name

Node

Data

Node 1

Data

Node 6

Data

Node 3

Data

Node 2

Data

Node 5

Client

A A

A

B

B

C

C

A: 2,5,6

B: 1,3

C: 6,2

File.txt

Need to get B &

C back up to 3

copies.

Page 32: Hadoop 101 - Kansas City Big Data Summit 2014

Name

Node

Data

Node 1

Data

Node 6

Data

Node 3

Data

Node 2

Data

Node 5

Client

A A

A

B

B

C

C

A: 2,5,6

B: 1,3,5

C: 6,2,1

File.txt

Okay all good

now

C

B

Page 33: Hadoop 101 - Kansas City Big Data Summit 2014

Name

Node

Data

Node 1

Data

Node 6

Data

Node 3

Data

Node 2

Data

Node 5

Client

A A

A

B

B

C

C

A: 2,5,6

B: 1,3,5

C: 6,2,1

File.txt

C

B

Data

Node 4

Page 34: Hadoop 101 - Kansas City Big Data Summit 2014
Page 35: Hadoop 101 - Kansas City Big Data Summit 2014

Map Shuffle/Sort Reduce

map: (K1, V1) → list(K2, V2)

reduce: (K2, list(V2)) → list(K3, V3)

Page 36: Hadoop 101 - Kansas City Big Data Summit 2014

Peter Piper picked a peck of pickled peppers;

A peck of pickled peppers Peter Piper picked;

If Peter Piper picked a peck of pickled peppers,

Where's the peck of pickled peppers Peter Piper picked?

Map Shuffle/Sort Reduce

map: (K1, V1) → list(K2, V2)

reduce: (K2, list(V2)) → list(K3, V3)

Page 37: Hadoop 101 - Kansas City Big Data Summit 2014

Peter Piper picked a peck of pickled peppers;

A peck of pickled peppers Peter Piper picked;

If Peter Piper picked a peck of pickled peppers,

Where's the peck of pickled peppers Peter Piper picked?

Map Shuffle/Sort Reducepeter -> 1

piper -> 1

picked -> 1

a -> 1

peck -> 1

of -> 1

pickled -> 1

peppers -> 1

a -> 1

peck -> 1

of -> 1

pickled -> 1

peppers -> 1

peter -> 1

piper -> 1

picked -> 1 if -> 1

peter -> 1

piper -> 1

picked -> 1

a -> 1

peck -> 1

of -> 1

pickled -> 1

peppers -> 1

where’s -> 1

the -> 1

peck -> 1

of -> 1

pickled -> 1

peppers -> 1

peter -> 1

piper -> 1

picked -> 1

map: (K1, V1) → list(K2, V2)

reduce: (K2, list(V2)) → list(K3, V3)

Page 38: Hadoop 101 - Kansas City Big Data Summit 2014

Peter Piper picked a peck of pickled peppers;

A peck of pickled peppers Peter Piper picked;

If Peter Piper picked a peck of pickled peppers,

Where's the peck of pickled peppers Peter Piper picked?

Map Shuffle/Sort Reducepeter -> 1

piper -> 1

picked -> 1

a -> 1

peck -> 1

of -> 1

pickled -> 1

peppers -> 1

a -> 1

peck -> 1

of -> 1

pickled -> 1

peppers -> 1

peter -> 1

piper -> 1

picked -> 1 if -> 1

peter -> 1

piper -> 1

picked -> 1

a -> 1

peck -> 1

of -> 1

pickled -> 1

peppers -> 1

where’s -> 1

the -> 1

peck -> 1

of -> 1

pickled -> 1

peppers -> 1

peter -> 1

piper -> 1

picked -> 1

a -> 1,1,1

if -> 1

of -> 1,1,1,1

peck -> 1,1,1,1

peppers -> 1,1,1,1

peter -> 1,1,1,1

picked -> 1,1,1,1

pickled -> 1,1,1,1

piper -> 1,1,1,1

the -> 1

where’s -> 1

map: (K1, V1) → list(K2, V2)

reduce: (K2, list(V2)) → list(K3, V3)

Page 39: Hadoop 101 - Kansas City Big Data Summit 2014

map: (K1, V1) → list(K2, V2)

reduce: (K2, list(V2)) → list(K3, V3)

Peter Piper picked a peck of pickled peppers;

A peck of pickled peppers Peter Piper picked;

If Peter Piper picked a peck of pickled peppers,

Where's the peck of pickled peppers Peter Piper picked?

Map Shuffle/Sort Reducepeter -> 1

piper -> 1

picked -> 1

a -> 1

peck -> 1

of -> 1

pickled -> 1

peppers -> 1

a -> 1

peck -> 1

of -> 1

pickled -> 1

peppers -> 1

peter -> 1

piper -> 1

picked -> 1 if -> 1

peter -> 1

piper -> 1

picked -> 1

a -> 1

peck -> 1

of -> 1

pickled -> 1

peppers -> 1

where’s -> 1

the -> 1

peck -> 1

of -> 1

pickled -> 1

peppers -> 1

peter -> 1

piper -> 1

picked -> 1

a -> 3

if -> 1

of -> 4

peck -> 4

peppers -> 4

peter -> 4

picked -> 4

pickled -> 4

piper -> 4

the -> 1

where’s -> 1

a -> 1,1,1

if -> 1

of -> 1,1,1,1

peck -> 1,1,1,1

peppers -> 1,1,1,1

peter -> 1,1,1,1

picked -> 1,1,1,1

pickled -> 1,1,1,1

piper -> 1,1,1,1

the -> 1

where’s -> 1

Page 40: Hadoop 101 - Kansas City Big Data Summit 2014

$HADOOP_HOME/bin/hadoop jar wc.jar WordCount

/user/hadoop/wordcount/input /user/hadoop/wordcount/output

$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \

-input myInputDirs \

-output myOutputDir \

-mapper myPythonScript.py \

-reducer /bin/wc \

-file myPythonScript.py

Page 41: Hadoop 101 - Kansas City Big Data Summit 2014

import java.io.BufferedReader;

import java.io.FileReader;

import java.io.IOException;

import java.net.URI;

import java.util.ArrayList;

import java.util.HashSet;

import java.util.List;

import java.util.Set;

import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.mapreduce.Counter;

import org.apache.hadoop.util.GenericOptionsParser;

import org.apache.hadoop.util.StringUtils;

public class WordCount2 {

public static class TokenizerMapper

extends Mapper<Object, Text, Text, IntWritable>{

static enum CountersEnum { INPUT_WORDS }

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

private boolean caseSensitive;

private Set<String> patternsToSkip = new HashSet<String>();

private Configuration conf;

private BufferedReader fis;

@Override

public void setup(Context context) throws IOException,

InterruptedException {

conf = context.getConfiguration();

caseSensitive = conf.getBoolean("wordcount.case.sensitive", true);

if (conf.getBoolean("wordcount.skip.patterns", true)) {

URI[] patternsURIs = Job.getInstance(conf).getCacheFiles();

for (URI patternsURI : patternsURIs) {

Path patternsPath = new Path(patternsURI.getPath());

String patternsFileName = patternsPath.getName().toString();

parseSkipFile(patternsFileName);

}

}

}

private void parseSkipFile(String fileName) {

try {

fis = new BufferedReader(new FileReader(fileName));

String pattern = null;

while ((pattern = fis.readLine()) != null) {

patternsToSkip.add(pattern);

}

} catch (IOException ioe) {

System.err.println("Caught exception while parsing the cached file '"

+ StringUtils.stringifyException(ioe));

}

}

@Override

public void map(Object key, Text value, Context context

) throws IOException, InterruptedException {

String line = (caseSensitive) ?

value.toString() : value.toString().toLowerCase();

for (String pattern : patternsToSkip) {

line = line.replaceAll(pattern, "");

}

StringTokenizer itr = new StringTokenizer(line);

while (itr.hasMoreTokens()) {

word.set(itr.nextToken());

context.write(word, one);

Counter counter = context.getCounter(CountersEnum.class.getName(),

CountersEnum.INPUT_WORDS.toString());

counter.increment(1);

}

}

}

public static class IntSumReducer

extends Reducer<Text,IntWritable,Text,IntWritable> {

private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values,

Context context

) throws IOException, InterruptedException {

int sum = 0;

for (IntWritable val : values) {

sum += val.get();

}

result.set(sum);

context.write(key, result);

}

}

public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();

GenericOptionsParser optionParser = new GenericOptionsParser(conf,

args);

String[] remainingArgs = optionParser.getRemainingArgs();

if (!(remainingArgs.length != 2 || remainingArgs.length != 4)) {

System.err.println("Usage: wordcount <in> <out> [-skip

skipPatternFile]");

System.exit(2);

}

Job job = Job.getInstance(conf, "word count");

job.setJarByClass(WordCount2.class);

job.setMapperClass(TokenizerMapper.class);

job.setCombinerClass(IntSumReducer.class);

job.setReducerClass(IntSumReducer.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

List<String> otherArgs = new ArrayList<String>();

for (int i=0; i < remainingArgs.length; ++i) {

if ("-skip".equals(remainingArgs[i])) {

job.addCacheFile(new Path(remainingArgs[++i]).toUri());

job.getConfiguration().setBoolean("wordcount.skip.patterns", true);

} else {

otherArgs.add(remainingArgs[i]);

}

}

FileInputFormat.addInputPath(job, new Path(otherArgs.get(0)));

FileOutputFormat.setOutputPath(job, new Path(otherArgs.get(1)));

System.exit(job.waitForCompletion(true) ? 0 : 1);

}

}

Page 42: Hadoop 101 - Kansas City Big Data Summit 2014

lines = LOAD '/user/hadoop/File.txt' AS (line:chararray);

words = FOREACH input_lines GENERATE

FLATTEN(TOKENIZE(line)) AS word;

filtered_words = FILTER words BY word MATCHES '\\w+';

word_groups = GROUP filtered_words BY word;

word_count = FOREACH word_groups GENERATE

COUNT(filtered_words) AS count, group AS word;

ordered_word_count = ORDER word_count BY count DESC;

DUMP wordcount;

Apache Pig

Page 43: Hadoop 101 - Kansas City Big Data Summit 2014

Hive provides a mechanism to project

structure onto this data and query the data

using a SQL-like language called HiveQL.

Apache Hive

Page 44: Hadoop 101 - Kansas City Big Data Summit 2014

Resource Management

Page 45: Hadoop 101 - Kansas City Big Data Summit 2014

Job

TrackerTask Tracker1

Task Tracker6Task Tracker3

Task Tracker2 Task Tracker5

Task Tracker4

Client

M|M|M|M R|R

M|M|M|M R|R

M|M|M|M R|R

M|M|M|M R|R

M|M|M|M R|R

M|M|M|M R|R

Page 46: Hadoop 101 - Kansas City Big Data Summit 2014

Job

TrackerTask Tracker1

Task Tracker6Task Tracker3

Task Tracker2 Task Tracker5

Task Tracker4

Client

M|M|M|M R|R

M|M|M|M R|R

M|M|M|M R|R

M|M|M|M R|R

M|M|M|M R|R

M|M|M|M R|R

wordcount

Page 47: Hadoop 101 - Kansas City Big Data Summit 2014

Job

TrackerTask Tracker1

Task Tracker6Task Tracker3

Task Tracker2 Task Tracker5

Task Tracker4

Client

M|M|M|M R|R

M|M|M|M R|R

M|M|M|M R|R

M|M|M|M R|R

M|M|M|M R|R

M|M|M|M R|R

wordcount

Page 48: Hadoop 101 - Kansas City Big Data Summit 2014

Job

TrackerTask Tracker1

Task Tracker6Task Tracker3

Task Tracker2 Task Tracker5

Task Tracker4

Client

M|M|M|M R|R

M|M|M|M R|R

M|M|M|M R|R

M|M|M|M R|R

M|M|M|M R|R

M|M|M|M R|R

HBASE SOLR

SOLRHBASE

SOLRHBASE

SOLRHBASE

SOLRHBASE

SOLRHBASE

Page 49: Hadoop 101 - Kansas City Big Data Summit 2014

YARN

MapReduce v2

Page 50: Hadoop 101 - Kansas City Big Data Summit 2014

Resource

ManagerNode

Manager1

Node

Manager6

Node

Manager3

Node

Manager2

Node

Manager5

Node

Manager4

Client

Page 51: Hadoop 101 - Kansas City Big Data Summit 2014

Resource

ManagerNode

Manager1

Node

Manager6

Node

Manager3

Node

Manager2

Node

Manager5

Node

Manager4

Client

wordcount

Application

Master -

Wordcount

I need a

container to run

my wordcount

MR job

Page 52: Hadoop 101 - Kansas City Big Data Summit 2014

Resource

ManagerNode

Manager1

Node

Manager6

Node

Manager3

Node

Manager2

Node

Manager5

Node

Manager4

Client

wordcount

Application

Master -

Wordcount

I need 4 Mapper

and 2 Reducer

containers

M

M

M

M

R

R

Page 53: Hadoop 101 - Kansas City Big Data Summit 2014

Emerging Hadoop Use

Case

Page 54: Hadoop 101 - Kansas City Big Data Summit 2014

Application Container

Management