Hadoop

85
Hadoop Scott Leberknight

Transcript of Hadoop

Page 1: Hadoop

HadoopScott Leberknight

Page 2: Hadoop
Page 3: Hadoop
Page 4: Hadoop

Yahoo! "Search Assist"

Page 5: Hadoop

Yahoo!

Facebook

Twitter

Baidu

eBay

LinkedIn

New York Times

Rackspace

eHarmony

Powerset

http://wiki.apache.org/hadoop/PoweredBy

Notable Hadoop users...

Page 6: Hadoop

Hadoop in the Real

World...

Page 7: Hadoop

Financial analysisRecommendationsystems

Natural LanguageProcessing (NLP)

Image/video processing

Log analysis

Data warehousing

Correlation engines

Market research/forecasting

Page 8: Hadoop

Finance

Health &Life Sciences Academic research

Government

Social networking

Telecommunications

Page 9: Hadoop

History...

Page 10: Hadoop

Originally built to support distribution for Nutch search engine

Created by Doug Cutting

Inspired by Google BigTable and MapReduce papers circa 2004

Named after a stuffed elephant

Page 11: Hadoop

OK, So what exactly

is Hadoop?

Page 12: Hadoop

An open source...

general purpose framework for creating distributed applications that

process huge amounts of data.

batch/offline oriented...

data & I/O intensive...

Page 13: Hadoop

One definition of "huge"

25,000 machines

More than 10 clusters

3 petabytes of data (compressed, unreplicated)

700+ users

10,000+ jobs/week

Page 14: Hadoop

Distributed File System(HDFS)

Map/Reduce System

Major Hadoop

Components:

Page 15: Hadoop

But first, what

isn't Hadoop?

Page 16: Hadoop

...a relational database!

...an online transaction processing (OLTP) system!

Hadoop is NOT:

...a structured data store of any kind!

Page 17: Hadoop

Hadoop vs. Relational

Page 18: Hadoop

Hadoop Relational

Scale-out Scale-up(*)

Key/value pairs Tables

Say how to process the data

Say what you want (SQL)

Offline/batch Online/real-time

(*) Sharding attempts to horizontally scale RDBMS, but is difficult at best

Page 19: Hadoop

(Hadoop Distributed File System)

HDFS

Page 20: Hadoop

Data is distributed and replicated over multiple machines

Designed for large files(where "large" means GB to TB)

Block oriented

Linux-style commands, e.g. ls, cp, mv, rm, etc.

Page 21: Hadoop

File Block Mappings:

/user/aaron/data1.txt -> 1, 2, 3/user/aaron/data2.txt -> 4, 5/user/andrew/data3.txt -> 6, 7

1

3

4

5

2

2

5

6

4

7

3

2

4

7

1

6

7

1

6

3

5

NameNode

DataNode(s)

Page 22: Hadoop

Self-healing

fault tolerant

scalable

when nodes fail

rebalances files across cluster

just by adding new nodes!

Page 23: Hadoop

Map/Reduce

Page 24: Hadoop

Operate on key/value pairs

Mappers filter & transform input data

Reducers aggregate mapper output

Split input files (e.g. by HDFS blocks)

Page 25: Hadoop

move code to data

Page 26: Hadoop

map:

(K1, V1) list(K2, V2)

list(K3, V3)(K2, list(V2))

reduce:

Page 27: Hadoop

(the canonical Map/Reduce example)

Word Count

Page 28: Hadoop

the quick brown foxjumped over

the lazy brown dog

Page 29: Hadoop

(0, "the quick brown fox")

(K1, V1)

(20, "jumped over")

(32, "the lazy brown dog")

map phase -

inputs

Page 30: Hadoop

list(K2, V2)

("brown", 1) ("dog", 1)

("lazy", 1)("the", 1)

("over", 1)

("the", 1) ("quick", 1)

("brown", 1) ("fox", 1)

("jumped", 1)

map phase - outputs

Page 31: Hadoop

(K2, list(V2))reduce phase -

inputs

("dog", (1))

("lazy", (1)) ("over", (1))

("the", (1, 1))("quick", (1))

("brown", (1, 1))

("fox", (1)) ("jumped", (1))

Page 32: Hadoop

list(K3, V3)reduce phase - outputs

("dog", 1)

("lazy", 1) ("over", 1)

("the", 2)("quick", 1)

("brown", 2)

("fox", 1) ("jumped", 1)

Page 33: Hadoop

WordCount in code...

Page 34: Hadoop

public class SimpleWordCount extends Configured implements Tool {

public static class MapClass extends Mapper<Object, Text, Text, IntWritable> { ... }

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { ... }

public int run(String[] args) throws Exception { ... } public static void main(String[] args) { ... }}

Page 35: Hadoop

public static class MapClass extends Mapper<Object, Text, Text, IntWritable> { private static final IntWritable ONE = new IntWritable(1L); private Text word = new Text();

@Override protected void map(Object key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer st = new StringTokenizer(value.toString()); while (st.hasMoreTokens()) { word.set(st.nextToken()); context.write(word, ONE); } }}

Page 36: Hadoop

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable count = new IntWritable();

@Override protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable value : values) { sum += value.get(); } count.set(sum); context.write(key, count); }}

Page 37: Hadoop

public int run(String[] args) throws Exception { Configuration conf = getConf();

Job job = new Job(conf, "Counting Words"); job.setJarByClass(SimpleWordCount.class); job.setMapperClass(MapClass.class); job.setReducerClass(Reduce.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1]));

return job.waitForCompletion(true) ? 0 : 1;}

Page 38: Hadoop

public static void main(String[] args) throws Exception { int result = ToolRunner.run(new Configuration(), new SimpleWordCount(), args); System.exit(result);}

Page 39: Hadoop

Map/Reduce Data Flow

(Image from Hadoop in Action...great book!)

Page 40: Hadoop

Partitioning

Deciding which keys go to which reducer

Desire even distribution across reducers

Skewed data can overload a single reducer!

Page 41: Hadoop

Map/Reduce Partitioning & Shuffling

(Image from Hadoop in Action...great book!)

Page 42: Hadoop

Combiner

a.k.a. "Local Reduce"

Effectively a reduce in the mappers

Page 43: Hadoop

data # k/v pairs shuffled

without combiner

with combiner

("the", 1) 1000

("the", 1000) 1

Shuffling WordCount

(looking at one mapper that sees the word "the" 1000 times)

Page 44: Hadoop

Advanced Map/Reduce

Chaining Map/Reduce jobs

Hadoop Streaming

Joining data

Bloom filters

Page 45: Hadoop

Architecture

Page 46: Hadoop

HDFS

Map/Reduce

NameNode

SecondaryNameNode

DataNode JobTracker

TaskTracker

Page 47: Hadoop

NameNode JobTracker

SecondaryNameNode

DataNode1

TaskTracker1

map

reduce

DataNode2

TaskTracker2

map

reduce

DataNodeN

TaskTrackerN

map

reduce

Page 48: Hadoop

NameNode

Bookkeeper for HDFS

Single point of failure!

Should not store data or run jobs

Manages DataNodes

Page 49: Hadoop
Page 50: Hadoop
Page 51: Hadoop

DataNode

Store actual file blocks on disk

Does not store entire files!

Report block info to NameNode

Receive instructions from NameNode

Page 52: Hadoop

Secondary NameNode

Not a failover server for NameNode!

Snapshot of NameNode

Help minimize downtime/data loss if NameNode fails

Page 53: Hadoop

JobTracker

Track map/reduce tasks

Partition tasks across HDFS cluster

Re-start failed tasks on different nodes

Speculative execution

Page 54: Hadoop
Page 55: Hadoop
Page 56: Hadoop

TaskTracker

Track individual map & reduce tasks

Report progress to JobTracker

Page 57: Hadoop
Page 58: Hadoop

Monitoring/

Debugging

Page 59: Hadoop

distributed debugging

distributed processing

Page 60: Hadoop

Logs

View task logs on machine where specific task was processed

(or via web UI)

$HADOOP_HOME/logs/userlogs on task tracker

Page 61: Hadoop
Page 62: Hadoop

Counters

Define one or more counters

Increment counters during map/reduce tasks

Counter values displayed in job tracker UI

Page 63: Hadoop
Page 64: Hadoop

IsolationRunner

Re-run failed tasks with original input data

Must set keep.failed.tasks.files to 'true'

Page 65: Hadoop

Skipping Bad Records

Data may not always be clean

New data may have new interesting twists

Can you pre-process to filter & validate input?

Page 66: Hadoop

Performance Tuning

Page 67: Hadoop

Speculative execution(on by default)

Reduce amount of input data

Data compression

Use a Combiner

JVM Re-use(be careful)

Refactor code/algorithms

Page 68: Hadoop

Managing

Hadoop

Page 69: Hadoop

Lots of knobs

Needs active management

"Fair" scheduling

Trash can

Add/removedata nodes

Network topology/rack awareness

NameNode/SNNmanagement Permissions/quotas

Page 70: Hadoop

Hive

Page 71: Hadoop

Simulate structure for data stored in Hadoop

Query language analogous to SQL (Hive QL)

Translates queries into Map/Reduce job(s)...

...so not for real-time processing!

Page 72: Hadoop

Queries:

Projection Joins (inner, outer, semi)

Grouping Aggregation

Sub-queries Multi-table insert

User-defined functions

Input/output formats with SerDe

Customizable:

Page 73: Hadoop

"CITING","CITED"3858241,9562033858241,13242343858241,33984063858241,35573843858241,36348893858242,15157013858242,33192613858242,36687053858242,37070043858243,29496113858243,31464653858243,31569273858243,32213413858243,3574238...

Patent citation dataset

http://www.nber.org/patents

/user/sleberkn/nber-patent/tables/patent_citation/cite75_99.txt

Page 74: Hadoop

create external table patent_citations (citing string, cited string)row format delimited fields terminated by ','stored as textfilelocation '/user/sleberkn/nber-patent/tables/patent_citation';

create table citation_histogram (num_citations int, count int)stored as sequencefile;

Page 75: Hadoop

insert overwrite table citation_histogramselect num_citations, count(num_citations) from (select cited, count(cited) as num_citations from patent_citations group by cited) citation_countsgroup by num_citationsorder by num_citations;

Page 76: Hadoop
Page 77: Hadoop

Hadoop in the clouds

Page 78: Hadoop

Amazon EC2 + S3

EC2 instances are compute nodes (Map/Reduce)

Storage options:

HDFS on EC2 nodes

HDFS on EC2 nodes loading data from S3

Native S3 (bypasses HDFS)

Page 79: Hadoop

Amazon Elastic MapReduce

Interact via web-based console

EMR configures & launches Hadoop cluster for job

Submit Map/Reduce job(streaming, Hive, Pig, or JAR)

Uses S3 for data input/output

Page 80: Hadoop

Recap...

Page 81: Hadoop

Hadoop = HDFS + Map/Reduce

Horizontal scale-out

Designed for fault tolerance

Distributed, parallel processing

Structure & queries via Hive

Page 82: Hadoop

References

Page 83: Hadoop

http://hadoop.apache.org/

http://hadoop.apache.org/hive/

Hadoop in Actionhttp://www.manning.com/lam/

Definitive Guide to Hadoop, 2nd ed. http://oreilly.com/catalog/0636920010388

Yahoo! Hadoop bloghttp://developer.yahoo.net/blogs/hadoop/

Clouderahttp://www.cloudera.com/

Page 84: Hadoop

http://lmgtfy.com/?q=hadoop

http://www.letmebingthatforyou.com/?q=hadoop

Page 85: Hadoop

[email protected]

www.nearinfinity.com/blogs/

twitter: sleberknight

(my info)