Hadoop

HadoopScott Leberknight

Yahoo! "Search Assist"

Yahoo!

Facebook

Twitter

Baidu

eBay

LinkedIn

New York Times

Rackspace

eHarmony

Powerset

http://wiki.apache.org/hadoop/PoweredBy

Notable Hadoop users...

Hadoop in the Real

World...

Financial analysisRecommendationsystems

Natural LanguageProcessing (NLP)

Image/video processing

Log analysis

Data warehousing

Correlation engines

Market research/forecasting

Finance

Health &Life Sciences Academic research

Government

Social networking

Telecommunications

History...

Originally built to support distribution for Nutch search engine

Created by Doug Cutting

Inspired by Google BigTable and MapReduce papers circa 2004

Named after a stuffed elephant

OK, So what exactly

is Hadoop?

An open source...

general purpose framework for creating distributed applications that

process huge amounts of data.

batch/offline oriented...

data & I/O intensive...

One definition of "huge"

25,000 machines

More than 10 clusters

3 petabytes of data (compressed, unreplicated)

700+ users

10,000+ jobs/week

Distributed File System(HDFS)

Map/Reduce System

Major Hadoop

Components:

But first, what

isn't Hadoop?

...a relational database!

...an online transaction processing (OLTP) system!

Hadoop is NOT:

...a structured data store of any kind!

Hadoop vs. Relational

Hadoop Relational

Scale-out Scale-up(*)

Key/value pairs Tables

Say how to process the data

Say what you want (SQL)

Offline/batch Online/real-time

(*) Sharding attempts to horizontally scale RDBMS, but is difficult at best

(Hadoop Distributed File System)

HDFS

Data is distributed and replicated over multiple machines

Designed for large files(where "large" means GB to TB)

Block oriented

Linux-style commands, e.g. ls, cp, mv, rm, etc.

File Block Mappings:

/user/aaron/data1.txt -> 1, 2, 3/user/aaron/data2.txt -> 4, 5/user/andrew/data3.txt -> 6, 7

1

3

4

5

2

2

5

6

4

7

3

2

4

7

1

6

7

1

6

3

5

NameNode

DataNode(s)

Self-healing

fault tolerant

scalable

when nodes fail

rebalances files across cluster

just by adding new nodes!

Map/Reduce

Operate on key/value pairs

Mappers filter & transform input data

Reducers aggregate mapper output

Split input files (e.g. by HDFS blocks)

move code to data

map:

(K1, V1) list(K2, V2)

list(K3, V3)(K2, list(V2))

reduce:

(the canonical Map/Reduce example)

Word Count

the quick brown foxjumped over

the lazy brown dog

(0, "the quick brown fox")

(K1, V1)

(20, "jumped over")

(32, "the lazy brown dog")

map phase -

inputs

list(K2, V2)

("brown", 1) ("dog", 1)

("lazy", 1)("the", 1)

("over", 1)

("the", 1) ("quick", 1)

("brown", 1) ("fox", 1)

("jumped", 1)

map phase - outputs

(K2, list(V2))reduce phase -

inputs

("dog", (1))

("lazy", (1)) ("over", (1))

("the", (1, 1))("quick", (1))

("brown", (1, 1))

("fox", (1)) ("jumped", (1))

list(K3, V3)reduce phase - outputs

("dog", 1)

("lazy", 1) ("over", 1)

("the", 2)("quick", 1)

("brown", 2)

("fox", 1) ("jumped", 1)

WordCount in code...

public class SimpleWordCount extends Configured implements Tool {

public static class MapClass extends Mapper<Object, Text, Text, IntWritable> { ... }

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { ... }

public int run(String[] args) throws Exception { ... } public static void main(String[] args) { ... }}

public static class MapClass extends Mapper<Object, Text, Text, IntWritable> { private static final IntWritable ONE = new IntWritable(1L); private Text word = new Text();

@Override protected void map(Object key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer st = new StringTokenizer(value.toString()); while (st.hasMoreTokens()) { word.set(st.nextToken()); context.write(word, ONE); } }}

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable count = new IntWritable();

@Override protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable value : values) { sum += value.get(); } count.set(sum); context.write(key, count); }}

public int run(String[] args) throws Exception { Configuration conf = getConf();

Job job = new Job(conf, "Counting Words"); job.setJarByClass(SimpleWordCount.class); job.setMapperClass(MapClass.class); job.setReducerClass(Reduce.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1]));

return job.waitForCompletion(true) ? 0 : 1;}

public static void main(String[] args) throws Exception { int result = ToolRunner.run(new Configuration(), new SimpleWordCount(), args); System.exit(result);}

Map/Reduce Data Flow

(Image from Hadoop in Action...great book!)

Partitioning

Deciding which keys go to which reducer

Desire even distribution across reducers

Skewed data can overload a single reducer!

Map/Reduce Partitioning & Shuffling

(Image from Hadoop in Action...great book!)

Combiner

a.k.a. "Local Reduce"

Effectively a reduce in the mappers

data # k/v pairs shuffled

without combiner

with combiner

("the", 1) 1000

("the", 1000) 1

Shuffling WordCount

(looking at one mapper that sees the word "the" 1000 times)

Advanced Map/Reduce

Chaining Map/Reduce jobs

Hadoop Streaming

Joining data

Bloom filters

Architecture

HDFS

Map/Reduce

NameNode

SecondaryNameNode

DataNode JobTracker

TaskTracker

NameNode JobTracker

SecondaryNameNode

DataNode1

TaskTracker1

map

reduce

DataNode2

TaskTracker2

map

reduce

DataNodeN

TaskTrackerN

map

reduce

NameNode

Bookkeeper for HDFS

Single point of failure!

Should not store data or run jobs

Manages DataNodes

DataNode

Store actual file blocks on disk

Does not store entire files!

Report block info to NameNode

Receive instructions from NameNode

Secondary NameNode

Not a failover server for NameNode!

Snapshot of NameNode

Help minimize downtime/data loss if NameNode fails

JobTracker

Track map/reduce tasks

Partition tasks across HDFS cluster

Re-start failed tasks on different nodes

Speculative execution

TaskTracker

Track individual map & reduce tasks

Report progress to JobTracker

Monitoring/

Debugging

distributed debugging

distributed processing

Logs

View task logs on machine where specific task was processed

(or via web UI)

$HADOOP_HOME/logs/userlogs on task tracker

Counters

Define one or more counters

Increment counters during map/reduce tasks

Counter values displayed in job tracker UI

IsolationRunner

Re-run failed tasks with original input data

Must set keep.failed.tasks.files to 'true'

Skipping Bad Records

Data may not always be clean

New data may have new interesting twists

Can you pre-process to filter & validate input?

Performance Tuning

Speculative execution(on by default)

Reduce amount of input data

Data compression

Use a Combiner

JVM Re-use(be careful)

Refactor code/algorithms

Managing

Hadoop

Lots of knobs

Needs active management

"Fair" scheduling

Trash can

Add/removedata nodes

Network topology/rack awareness

NameNode/SNNmanagement Permissions/quotas

Simulate structure for data stored in Hadoop

Query language analogous to SQL (Hive QL)

Translates queries into Map/Reduce job(s)...

...so not for real-time processing!

Queries:

Projection Joins (inner, outer, semi)

Grouping Aggregation

Sub-queries Multi-table insert

User-defined functions

Input/output formats with SerDe

Customizable:

"CITING","CITED"3858241,9562033858241,13242343858241,33984063858241,35573843858241,36348893858242,15157013858242,33192613858242,36687053858242,37070043858243,29496113858243,31464653858243,31569273858243,32213413858243,3574238...

Patent citation dataset

http://www.nber.org/patents

/user/sleberkn/nber-patent/tables/patent_citation/cite75_99.txt

create external table patent_citations (citing string, cited string)row format delimited fields terminated by ','stored as textfilelocation '/user/sleberkn/nber-patent/tables/patent_citation';

create table citation_histogram (num_citations int, count int)stored as sequencefile;

insert overwrite table citation_histogramselect num_citations, count(num_citations) from (select cited, count(cited) as num_citations from patent_citations group by cited) citation_countsgroup by num_citationsorder by num_citations;

Hadoop in the clouds

Amazon EC2 + S3

EC2 instances are compute nodes (Map/Reduce)

Storage options:

HDFS on EC2 nodes

HDFS on EC2 nodes loading data from S3

Native S3 (bypasses HDFS)

Amazon Elastic MapReduce

Interact via web-based console

EMR configures & launches Hadoop cluster for job

Submit Map/Reduce job(streaming, Hive, Pig, or JAR)

Uses S3 for data input/output

Recap...

Hadoop = HDFS + Map/Reduce

Horizontal scale-out

Designed for fault tolerance

Distributed, parallel processing

Structure & queries via Hive

References

http://hadoop.apache.org/

http://hadoop.apache.org/hive/

Hadoop in Actionhttp://www.manning.com/lam/

Definitive Guide to Hadoop, 2nd ed. http://oreilly.com/catalog/0636920010388

Yahoo! Hadoop bloghttp://developer.yahoo.net/blogs/hadoop/

Clouderahttp://www.cloudera.com/

http://lmgtfy.com/?q=hadoop

http://www.letmebingthatforyou.com/?q=hadoop

[email protected]

www.nearinfinity.com/blogs/

twitter: sleberknight

(my info)

Hadoop

Technology

Transcript of Hadoop