Hadoop 101 - Kansas City Big Data Summit 2014

Hadoop 101

Scott Kahlertwitter: boogabee

http://simpit.com

[email protected]

Community Engineer - Greenplum Database

http://simpit.com

xkcd.com

Primary Hadoop Use

Case

Data Lake, Active Archive, Staging Area

Doug CuttingMike

Cafarella

2002

20032004

Doug Cutting 2006

Apache Hadoop

The project includes these modules:

● Hadoop Common

● Hadoop Distributed File System (HDFS™)

● Hadoop MapReduce

● Hadoop YARN

Apache Hadoop Ecosystem

Distributed Filesystem

Red Hat GlusterFS

Quantcast File System QFS

Ceph Filesystem

Lustre file system

Tachyon

GridGain

Distributed Programming

Apache Pig

JAQL

Apache Spark

Apache Flink

Netflix PigPen

AMPLab SIMR

Facebook Corona

Apache Twill

Damballa Parkour

Apache Hama

Datasalt Pangool

Apache Tez

Apache DataFu

Pydoop

Kangaroo

NoSQL Databases

Column Data Model

Apache HBase

Apache Cassandra

Hypertable

Apache Accumulo

Document Data Model

MongoDB

RethinkDB

ArangoDB

Stream Data Model

EventStore

Key-Value Data Model

Redis DataBase

Linkedin Voldemort

RocksDB

OpenTSDB

Graph Data Model

ArangoDB

Neo4j

NewSQL Databases

TokuDB

HandlerSocket

Akiban Server

Drizzle

Haeinsa

SenseiDB

Sky

BayesDB

InfluxDB

SQL-On-Hadoop

Apache Hive

Apache HCatalog

AMPLAB Shark

Apache Drill

Cloudera Impala

Facebook Presto

Datasalt Splout SQL

Apache Tajo

Apache Phoenix

Apache MRQL

Data Ingestion

Apache Flume

Apache Sqoop

Facebook Scribe

Apache Chukwa

Apache Storm

Apache Kafka

Netflix Suro

Apache Samza

Cloudera Morphline

HIHO

Service Programming

Apache Thrift

Apache Zookeeper

Apache Avro

Apache Curator

Apache karaf

Twitter Elephant Bird

Linkedin Norbert

Scheduling

Apache Oozie

Linkedin Azkaban

Apache Falcon

Machine Learning

Apache Mahout

WEKA

Cloudera Oryx

MADlib

H2O

Sparkling Water

Bechmarking

Apache Hadoop

Benchmarking

Yahoo Gridmix3

PUMA Benchmarking

Berkeley SWIM Benchmark

Intel HiBench

Security

Apache Sentry

Apache Knox Gateway

Apache Ranger

System Deployment

Apache Ambari

Cloudera HUE

Apache Whirr

Apache Mesos

Myriad

Marathon

Brooklyn

Hortonworks HOYA

Apache Helix

Apache Bigtop

Buildoop

Deploop

Applications

Apache Nutch

Sphnix Search Server

Apache OODT

HIPI Library

PivotalR

Development Frameworks

Spring XD

Categorize Pending ...

Twitter Summingbird

Apache Kiji

S4 Yahoo

Metamarkers Druid

Concurrent Cascading

Concurrent Lingual

Concurrent Pattern

Apache Giraph

Talend

Akka Toolkit

Eclipse BIRT

Spango BI

Jedox Palo

Twitter Finagle

Intel GraphBuilder

Apache Tika

http://hadoopecosystemtable.github.io/

Apache Bigtop

Hadoop Distributed File System (HDFS™)

A distributed file system that provides high-throughput access to application data

hdfs dfs -copyFromLocal File.txt hdfs://nn.hadoopcluster.local/user/hadoop/

Name

Node

Data

Node 1

Data

Node 6

Data

Node 3

Data

Node 2

Data

Node 5

Data

Node 4

Client

File.txt

A

B

C

I have File.txt

and I want to

write block A of it

Name

Node

Data

Node 1

Data

Node 6

Data

Node 3

Data

Node 2

Data

Node 5

Data

Node 4

Client

File.txt

A

B

C

Write that to

Data Node 2, 5

and 6

Name

Node

Data

Node 1

Data

Node 6

Data

Node 3

Data

Node 2

Data

Node 5

Data

Node 4

Client

File.txt

A

B

C

Setting up a

pipeline to

Nodes 2, 5, 6

Name

Node

Data

Node 1

Data

Node 6

Data

Node 3

Data

Node 2

Data

Node 5

Data

Node 4

Client

File.txt

A

B

C

Pushing block

down the

pipeline A A

A

Name

Node

Data

Node 1

Data

Node 6

Data

Node 3

Data

Node 2

Data

Node 5

Data

Node 4

Client

File.txt

A

B

C

A A

AIt worked!

Got a block

Repeat until all blocks in the system

Name

Node

Data

Node 1

Data

Node 6

Data

Node 3

Data

Node 2

Data

Node 5

Data

Node 4

Client

A A

A

B

B

B

C

C

C

A: 2,5,6

B: 1,3,4

C: 6,2,4

File.txt

hdfs dfs -copyToLocal hdfs://nn.hadoopcluster.local/user/hadoop/File.txt File2.txt

Name

Node

Data

Node 1

Data

Node 6

Data

Node 3

Data

Node 2

Data

Node 5

Data

Node 4

Client

A A

A

B

B

B

C

C

C

A: 2,5,6

B: 1,3,4

C: 6,2,4

File.txt

I want File.txt

Name

Node

Data

Node 1

Data

Node 6

Data

Node 3

Data

Node 2

Data

Node 5

Data

Node 4

Client

A A

A

B

B

B

C

C

C

A: 2,5,6

B: 1,3,4

C: 6,2,4

File.txt

File.txt is

A: 2,5,6

B: 1,3,4

C: 6,2,4

Name

Node

Data

Node 1

Data

Node 6

Data

Node 3

Data

Node 2

Data

Node 5

Data

Node 4

Client

A A

A

B

B

B

C

C

C

A: 2,5,6

B: 1,3,4

C: 6,2,4

File.txt

File.txt

A

B

C

System Health

Name

Node

Data

Node 1

Data

Node 6

Data

Node 3

Data

Node 2

Data

Node 5

Data

Node 4

Client

A A

A

B

B

B

C

C

C

A: 2,5,6

B: 1,3,4

C: 6,2,4

File.txt

Block

Report

Name

Node

Data

Node 1

Data

Node 6

Data

Node 3

Data

Node 2

Data

Node 5

Data

Node 4

Client

A A

A

B

B

B

C

C

C

A: 2,5,6

B: 1,3,4

C: 6,2,4

File.txt

No heartbeat

from Node 4

Name

Node

Data

Node 1

Data

Node 6

Data

Node 3

Data

Node 2

Data

Node 5

Data

Node 4

Client

A A

A

B

B

B

C

C

C

A: 2,5,6

B: 1,3,4

C: 6,2,4

File.txt

Copies on Node

4 must be gone

Name

Node

Data

Node 1

Data

Node 6

Data

Node 3

Data

Node 2

Data

Node 5

Client

A A

A

B

B

C

C

A: 2,5,6

B: 1,3

C: 6,2

File.txt

Need to get B &

C back up to 3

copies.

Name

Node

Data

Node 1

Data

Node 6

Data

Node 3

Data

Node 2

Data

Node 5

Client

A A

A

B

B

C

C

A: 2,5,6

B: 1,3,5

C: 6,2,1

File.txt

Okay all good

now

C

B

Name

Node

Data

Node 1

Data

Node 6

Data

Node 3

Data

Node 2

Data

Node 5

Client

A A

A

B

B

C

C

A: 2,5,6

B: 1,3,5

C: 6,2,1

File.txt

C

B

Data

Node 4

Map Shuffle/Sort Reduce

map: (K1, V1) → list(K2, V2)

reduce: (K2, list(V2)) → list(K3, V3)

Peter Piper picked a peck of pickled peppers;

A peck of pickled peppers Peter Piper picked;

If Peter Piper picked a peck of pickled peppers,

Where's the peck of pickled peppers Peter Piper picked?

Map Shuffle/Sort Reduce

map: (K1, V1) → list(K2, V2)






Map Shuffle/Sort Reducepeter -> 1

piper -> 1

picked -> 1

a -> 1

peck -> 1

of -> 1

pickled -> 1

peppers -> 1

a -> 1

peck -> 1

of -> 1

pickled -> 1

peppers -> 1

peter -> 1

piper -> 1

picked -> 1 if -> 1

peter -> 1

piper -> 1

picked -> 1

a -> 1

peck -> 1

of -> 1

pickled -> 1

peppers -> 1

where’s -> 1

the -> 1

peck -> 1

of -> 1

pickled -> 1

peppers -> 1

peter -> 1

piper -> 1

picked -> 1

map: (K1, V1) → list(K2, V2)







piper -> 1

picked -> 1

a -> 1

peck -> 1

of -> 1

pickled -> 1

peppers -> 1

a -> 1

peck -> 1

of -> 1

pickled -> 1

peppers -> 1

peter -> 1

piper -> 1

picked -> 1 if -> 1

peter -> 1

piper -> 1

picked -> 1

a -> 1

peck -> 1

of -> 1

pickled -> 1

peppers -> 1

where’s -> 1

the -> 1

peck -> 1

of -> 1

pickled -> 1

peppers -> 1

peter -> 1

piper -> 1

picked -> 1

a -> 1,1,1

if -> 1

of -> 1,1,1,1

peck -> 1,1,1,1

peppers -> 1,1,1,1

peter -> 1,1,1,1

picked -> 1,1,1,1

pickled -> 1,1,1,1

piper -> 1,1,1,1

the -> 1

where’s -> 1

map: (K1, V1) → list(K2, V2)


map: (K1, V1) → list(K2, V2)







piper -> 1

picked -> 1

a -> 1

peck -> 1

of -> 1

pickled -> 1

peppers -> 1

a -> 1

peck -> 1

of -> 1

pickled -> 1

peppers -> 1

peter -> 1

piper -> 1

picked -> 1 if -> 1

peter -> 1

piper -> 1

picked -> 1

a -> 1

peck -> 1

of -> 1

pickled -> 1

peppers -> 1

where’s -> 1

the -> 1

peck -> 1

of -> 1

pickled -> 1

peppers -> 1

peter -> 1

piper -> 1

picked -> 1

a -> 3

if -> 1

of -> 4

peck -> 4

peppers -> 4

peter -> 4

picked -> 4

pickled -> 4

piper -> 4

the -> 1

where’s -> 1

a -> 1,1,1

if -> 1

of -> 1,1,1,1

peck -> 1,1,1,1

peppers -> 1,1,1,1

peter -> 1,1,1,1

picked -> 1,1,1,1

pickled -> 1,1,1,1

piper -> 1,1,1,1

the -> 1

where’s -> 1

$HADOOP_HOME/bin/hadoop jar wc.jar WordCount

/user/hadoop/wordcount/input /user/hadoop/wordcount/output

$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \

-input myInputDirs \

-output myOutputDir \

-mapper myPythonScript.py \

-reducer /bin/wc \

-file myPythonScript.py

import java.io.BufferedReader;

import java.io.FileReader;

import java.io.IOException;

import java.net.URI;

import java.util.ArrayList;

import java.util.HashSet;

import java.util.List;

import java.util.Set;

import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.mapreduce.Counter;

import org.apache.hadoop.util.GenericOptionsParser;

import org.apache.hadoop.util.StringUtils;

public class WordCount2 {

public static class TokenizerMapper

extends Mapper<Object, Text, Text, IntWritable>{

static enum CountersEnum { INPUT_WORDS }

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

private boolean caseSensitive;

private Set<String> patternsToSkip = new HashSet<String>();

private Configuration conf;

private BufferedReader fis;

@Override

public void setup(Context context) throws IOException,

InterruptedException {

conf = context.getConfiguration();

caseSensitive = conf.getBoolean("wordcount.case.sensitive", true);

if (conf.getBoolean("wordcount.skip.patterns", true)) {

URI[] patternsURIs = Job.getInstance(conf).getCacheFiles();

for (URI patternsURI : patternsURIs) {

Path patternsPath = new Path(patternsURI.getPath());

String patternsFileName = patternsPath.getName().toString();

parseSkipFile(patternsFileName);

}

}

}

private void parseSkipFile(String fileName) {

try {

fis = new BufferedReader(new FileReader(fileName));

String pattern = null;

while ((pattern = fis.readLine()) != null) {

patternsToSkip.add(pattern);

}

} catch (IOException ioe) {

System.err.println("Caught exception while parsing the cached file '"

+ StringUtils.stringifyException(ioe));

}

}

@Override

public void map(Object key, Text value, Context context

) throws IOException, InterruptedException {

String line = (caseSensitive) ?

value.toString() : value.toString().toLowerCase();

for (String pattern : patternsToSkip) {

line = line.replaceAll(pattern, "");

}

StringTokenizer itr = new StringTokenizer(line);

while (itr.hasMoreTokens()) {

word.set(itr.nextToken());

context.write(word, one);

Counter counter = context.getCounter(CountersEnum.class.getName(),

CountersEnum.INPUT_WORDS.toString());

counter.increment(1);

}

}

}

public static class IntSumReducer

extends Reducer<Text,IntWritable,Text,IntWritable> {

private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values,

Context context

) throws IOException, InterruptedException {

int sum = 0;

for (IntWritable val : values) {

sum += val.get();

}

result.set(sum);

context.write(key, result);

}

}

public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();

GenericOptionsParser optionParser = new GenericOptionsParser(conf,

args);

String[] remainingArgs = optionParser.getRemainingArgs();

if (!(remainingArgs.length != 2 || remainingArgs.length != 4)) {

System.err.println("Usage: wordcount <in> <out> [-skip

skipPatternFile]");

System.exit(2);

}

Job job = Job.getInstance(conf, "word count");

job.setJarByClass(WordCount2.class);

job.setMapperClass(TokenizerMapper.class);

job.setCombinerClass(IntSumReducer.class);

job.setReducerClass(IntSumReducer.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

List<String> otherArgs = new ArrayList<String>();

for (int i=0; i < remainingArgs.length; ++i) {

if ("-skip".equals(remainingArgs[i])) {

job.addCacheFile(new Path(remainingArgs[++i]).toUri());

job.getConfiguration().setBoolean("wordcount.skip.patterns", true);

} else {

otherArgs.add(remainingArgs[i]);

}

}

FileInputFormat.addInputPath(job, new Path(otherArgs.get(0)));

FileOutputFormat.setOutputPath(job, new Path(otherArgs.get(1)));

System.exit(job.waitForCompletion(true) ? 0 : 1);

}

}

lines = LOAD '/user/hadoop/File.txt' AS (line:chararray);

words = FOREACH input_lines GENERATE

FLATTEN(TOKENIZE(line)) AS word;

filtered_words = FILTER words BY word MATCHES '\\w+';

word_groups = GROUP filtered_words BY word;

word_count = FOREACH word_groups GENERATE

COUNT(filtered_words) AS count, group AS word;

ordered_word_count = ORDER word_count BY count DESC;

DUMP wordcount;

Apache Pig

Hive provides a mechanism to project

structure onto this data and query the data

using a SQL-like language called HiveQL.

Apache Hive

Resource Management

Job

TrackerTask Tracker1

Task Tracker6Task Tracker3

Task Tracker2 Task Tracker5

Task Tracker4

Client

M|M|M|M R|R

M|M|M|M R|R

M|M|M|M R|R

M|M|M|M R|R

M|M|M|M R|R

M|M|M|M R|R

Job




Task Tracker4

Client

M|M|M|M R|R

M|M|M|M R|R

M|M|M|M R|R

M|M|M|M R|R

M|M|M|M R|R

M|M|M|M R|R

wordcount

Job




Task Tracker4

Client

M|M|M|M R|R

M|M|M|M R|R

M|M|M|M R|R

M|M|M|M R|R

M|M|M|M R|R

M|M|M|M R|R

HBASE SOLR

SOLRHBASE

SOLRHBASE

SOLRHBASE

SOLRHBASE

SOLRHBASE

YARN

MapReduce v2

Resource

ManagerNode

Manager1

Node

Manager6

Node

Manager3

Node

Manager2

Node

Manager5

Node

Manager4

Client

Resource

ManagerNode

Manager1

Node

Manager6

Node

Manager3

Node

Manager2

Node

Manager5

Node

Manager4

Client

wordcount

Application

Master -

Wordcount

I need a

container to run

my wordcount

MR job

Resource

ManagerNode

Manager1

Node

Manager6

Node

Manager3

Node

Manager2

Node

Manager5

Node

Manager4

Client

wordcount

Application

Master -

Wordcount

I need 4 Mapper

and 2 Reducer

containers

M

M

M

M

R

R

Emerging Hadoop Use

Case

Application Container

Management

Hadoop 101 - Kansas City Big Data Summit 2014

Technology

Transcript of Hadoop 101 - Kansas City Big Data Summit 2014