Apache hadoop

Apache Hadoop

- Large Scale Data Processing

Sharath Bandaru & Sai Dinesh Koppuravuri

Advanced Topics PresentationISYE 582 :Engineering Information Systems

Overview Understanding Big Data

Structured/Unstructured Data

Limitations Of Existing Data Analytics Structure

Apache Hadoop

Hadoop Architecture

HDFS

Map Reduce

Conclusions

References

Understanding Big Data

Big DataIs creating

Large And Growing Files

Measured in:Petabytes (10^12)Terabytes (10^15)

Which is largely unstructured

Structured/Unstructured Data

Why now ?D

ata

Gro

wth

STRUCTURED DATA – 20%

1980 2013

UNSTRUCTURED DATA – 80%

Source : Cloudera, 2013

Challenges posed by Big Data

Velocity

Volume

Variety

400 million tweets in a day on Twitter1 million transactions by Wal-Mart every hour

2.5 peta bytes created by Wal-Mart transactions in an hour

Videos, Photos, Text messages, Images, Audios, Documents, Emails, etc.,

Limitations Of Existing Data Analytics Architecture

BI Reports + Interactive Apps

RDBMS (aggregated data)

ETL Compute Grid

Storage Only Grid ( original raw data )

Collection

Instrumentation

Moving Data To Compute Doesn’t Scale

Can’t Explore Original High Fidelity Raw Data

Archiving=Premature Data Death

So What is Apache ?

• A set of tools that supports running of applications on big data.

• Core Hadoop has two main systems:

- HDFS : self-healing high-bandwidth clustered storage.

- Map Reduce : distributed fault-tolerant resource management and scheduling coupled with a scalable data programming abstraction.

History

Source : Cloudera, 2013

The Key Benefit: Agility/Flexibility

Schema-on-Write (RDBMS): Schema-on-Read (Hadoop):• Schema must be created before any data can be loaded.

• An explicit load operation has to take place which transforms data to DB internal structure.

• New columns must be added explicitly before new data for such columns can be loaded into the database

• Data is simply copied to the file store, no transformation is needed.

• A SerDe (Serializer/Deserlizer) is applied during read time to extract the required columns (late binding).

• New data can start flowing anytime and will appear retroactively once the SerDe is updated to parse it.

• Read is Fast• Standards/Governance

• Load is Fast• Flexibility/Agility

Pros

Use The Right Tool For The Right Job

Relational Databases: Hadoop:

Use when:• Interactive OLAP Analytics (< 1 sec)• Multistep ACID transactions• 100 % SQL compliance

Use when:• Structured or Not (Flexibility)• Scalability of Storage/Compute• Complex Data Processing

Traditional Approach

Big Data

Powerful ComputerProcessing limit

Enterprise Approach:

Hadoop Architecture

Task Tracker

Job Tracker

Name Node

Data Node

Master

Task Tracker

Data Node

Task Tracker

Data Node

Task Tracker

Data Node

Slaves

Map Reduce

HDFS

Hadoop Architecture

Task Tracker

Job Tracker

Name Node

Data Node

Master

Task Tracker

Data Node

Task Tracker

Data Node

Task Tracker

Data Node

Slaves

Application

Job Tracker

Task Tracker

Job Tracker

Name Node

Data Node

Master

Task Tracker

Data Node

Task Tracker

Data Node

Task Tracker

Data Node

Slaves

Application

HDFS: Hadoop Distributed File System

• A given file is broken into blocks (default=64MB), then blocks are replicated across cluster(default=3).

1

2

3

4

5

HDFS

3

4

5

1

2

5

1

3

4

2

4

5

1

2

3

Optimized for :• Throughput• Put/Get/Delete• Appends

Block Replication for :• Durability• Availability• Throughput

Block Replicas are distributed across servers and racks

Fault Tolerance for Data

Task Tracker

Job Tracker

Name Node

Data Node

Master

Task Tracker

Data Node

Task Tracker

Data Node

Task Tracker

Data Node

Slaves

HDFS

Fault Tolerance for Processing

Task Tracker

Job Tracker

Name Node

Data Node

Master

Task Tracker

Data Node

Task Tracker

Data Node

Task Tracker

Data Node

Slaves

Map Reduce

Fault Tolerance for Processing

Task Tracker

Job Tracker

Name Node

Data Node

Master

Task Tracker

Data Node

Task Tracker

Data Node

Task Tracker

Data Node

Slaves

Tables are backed up

Map Reduce

Input Data

Map Map Map Map Map

Shuffle

Reduce Reduce

Results

Understanding the concept of Map Reduce

Mother

Sam

An Apple

• Believed “an apple a day keeps a doctor away”

The Story Of Sam


• Sam thought of “drinking” the apple

He used a to cut the

and a to make

juice.


Next day• Sam applied his invention to all the fruits he could find in the fruit basket

(map ‘( )’)

(reduce ‘( )’) Classical Notion of Map Reduce in Functional Programming

A list of values mapped into another list of values, which

gets reduced into a single value


18 Years Later

• Sam got his first job in “Tropicana” for his expertise in making juices.

Now, it’s not just one basket

but a whole container of fruits

Also, they produce a list of

juice types separately

Fruits

NOT ENOUGH !! But, Sam had just ONE

and ONE

Large data and list of values for output

Wa i t !


Brave Sam

Fruits

(<a, > , <o, > , <p, > , …)

Each input to a map is a list of <key, value> pairs

Each output of a map is a list of <key, value> pairs

(<a’, > , <o’, > , <p’, > , …)Grouped by keyEach input to a reduce is a <key, value-list> (possibly a list of these, depending on the grouping/hashing mechanism)e.g. <a’, ( …)>Reduced into a list of values

Implemented parallel version of his innovation


• Sam realized,

– To create his favorite mix fruit juice he can use a combiner after the reducers

– If several <key, value-list> fall into the same group (based on the grouping/hashing algorithm) then use the blender (reducer) separately on each of them

– The knife (mapper) and blender (reducer) should not contain residue after use – Side Effect Free

Source: (Map Reduce, 2010).

Conclusions• The key benefits of Apache Hadoop:

1) Agility/ Flexibility (Quickest Time to Insight)

2) Complex Data Processing (Any Language, Any Problem)

3) Scalability of Storage/Compute (Freedom to Grow)

4) Economical Storage (Keep All Your Data Alive Forever)

• The key systems for Apache Hadoop are:

1) Hadoop Distributed File System : self-healing high-bandwidth clustered storage.

2) Map Reduce : distributed fault-tolerant resource management coupled with scalable data processing.

References

• Ekanayake, S. (2010, March). Map Reduce : The Story Of Sam. Retrieved April 13, 2013, from http://esaliya.blogspot.com/2010/03/mapreduce-explained-simply-as-story- of.html.

• Jeffrey Dean and Sanjay Ghemawat. (2004, December). Map Reduce : Simplified Data Processing on Large Clusters.

• The Apache Software Foundation. (2013, April). Hadoop. Retrieved April 19, 2013, from http://hadoop.apache.org/.

• Isabel Drost. (2010, February). Apache Hadoop : Large Scale Data Analysis made Easy. retrieved April 13, 2013, from http://www.youtube.com/watch?v=VFHqquABHB8.

• Dr. Amr Awadallah. (2011, November). Introducing Apache Hadoop : The Modern Data Operating System. Retrieved April 15, 2013, from http://www.youtube.com/watch?v=d2xeNpfzsYI

http://esaliya.blogspot.com/2010/03/mapreduce-explained-simply-as-story-%20of.html

http://hadoop.apache.org/

http://www.youtube.com/watch?v=VFHqquABHB8

http://www.youtube.com/watch?v=d2xeNpfzsYI

Apache hadoop

Technology

Transcript of Apache hadoop