Hadoop 101 v1

Post on 26-Jan-2015

116 views 0 download

Tags:

description

Given to Hadoop.SG meetup, Sept 25, 2013

Transcript of Hadoop 101 v1

HADOOP 101A really quick overview of what it is

(and why you should care...)

Monday, 30 September, 13

HADOOP IS A

TOOL TO SOLVE

BIG DATA PROBLEMS

Monday, 30 September, 13

HADOOP IS A(ONE KIND OF)TOOL TO SOLVE

(SOME KINDS OF)BIG DATA PROBLEMS

Monday, 30 September, 13

OK....

SO WHAT IS A

BIG DATA PROBLEM?

Monday, 30 September, 13

“A PROBLEM IS A BIG DATA PROBLEM

WHEN THE SIZE OF THE DATA IS PART OF THE PROBLEM”

Monday, 30 September, 13

A few Terabytes of Data...Monday, 30 September, 13

Monday, 30 September, 13

Monday, 30 September, 13

Text processing--a few hours?

Monday, 30 September, 13

But what if you have more data?

Monday, 30 September, 13

Network Storage--Petabytes!

Monday, 30 September, 13

Network Storage--Petabytes!

Monday, 30 September, 13

What if you need compute power for complex algorithms?

Monday, 30 September, 13

8 core? 16 Cores? 64 cores? 512 GB RAM?

Monday, 30 September, 13

More processing in one box--price grows exponentially

0

75,000

150,000

225,000

300,000

16 32 48 64 80 96 112 128

Single Machine

Monday, 30 September, 13

Compute cost for many commodity machines scales linearly

0

75,000

150,000

225,000

300,000

16 32 48 64 80 96 112 128

Single Machine Many Commodity Maciines

Monday, 30 September, 13

A network of commodity computers

Monday, 30 September, 13

Run jobs on part of the data on each one. Compile the results.

Monday, 30 September, 13

Let’s add a computer to manage the process of job delegation, merging the results...

Monday, 30 September, 13

We also need something to keep track of what files are where, so we know what data needs to be computed...

Monday, 30 September, 13

When you have a lot of computers, and even more hard drives, one thing I can guarantee...

Monday, 30 September, 13

Computers will eventually fail.

Monday, 30 September, 13

Computers will eventually fail.

Monday, 30 September, 13

Hard drives will eventually fail.

Monday, 30 September, 13

Hard drives will eventually fail.

Monday, 30 September, 13

Hard drives will eventually fail.

Monday, 30 September, 13

Hard drives will eventually fail.

Monday, 30 September, 13

Even whole racks will fail.

Monday, 30 September, 13

If a computer fails and you only have one copy of your data...

Monday, 30 September, 13

You will be very, very unhappy.

Monday, 30 September, 13

So lets store multiple copies of the data. Hard drives are CHEAP!

Monday, 30 September, 13

So lets store multiple copies of the data. Hard drives are CHEAP!

Monday, 30 September, 13

So lets store multiple copies of the data. Hard drives are CHEAP!

Monday, 30 September, 13

So lets store multiple copies of the data. Hard drives are CHEAP!

Monday, 30 September, 13

If one hard drive fails... we are still OK

Monday, 30 September, 13

If one computer fails... we are still OK

Monday, 30 September, 13

Even if a whole rack fails... we are still OK

Monday, 30 September, 13

Once we find a failure let’s have the system recopy the copies.

Monday, 30 September, 13

Guess, what? We’ve just invented Hadoop!

Monday, 30 September, 13

So let’s talk about the pieces of Hadoop.

Monday, 30 September, 13

Data nodes store and manage the data on a single “slave” computer

Data Node

Monday, 30 September, 13

Task trackers manage the compute

Data Node

Task Tracker

Monday, 30 September, 13

Job tracker manages task trackers, ships code to compute nodes

Data Node

Task TrackerJob Tracker

Monday, 30 September, 13

Name node manages distribution and replication on the data nodes

Data Node

Task TrackerJob Tracker

Name Node

Monday, 30 September, 13

Map Reduce

Task TrackerJob Tracker

Monday, 30 September, 13

HDFS (Hadoop Distributed File System)

Data Node

Name Node

Monday, 30 September, 13

VISUAL EXAMPLEMonday, 30 September, 13

MAPMonday, 30 September, 13

SHUFFLEMonday, 30 September, 13

REDUCEMonday, 30 September, 13

PUTTING IT ALL TOGETHERMonday, 30 September, 13

SO WHAT?

Monday, 30 September, 13

THAT’S WHAT!Monday, 30 September, 13

OPPORTUNITY!Monday, 30 September, 13

HADOOP ECOSYSTEMMonday, 30 September, 13