Hadoop - University of...

HadoopYizheng (Ethan) Chen

Advisor: Prof. Aditya Akella

Outline

Hadoop

Yarn (NextGen Hadoop)

Hadoop

• What is Apache Hadoop– A framework (open‐source software) for reliable, scalable, distributed computing

• Hadoop MapReduce– A system for parallel processing of large data sets

• Hadoop Distributed File System (HDFS™)– A distributed file system that provides high‐throughput access to application data

– Similar to GFS

– http://hadoop.apache.org/• Why Hadoop?

Hadoop Job Execution

Hello 1Hadoop 1Goodbye 1Hadoop 1

Word Count

Hello World Bye World

Hello HadoopGoodbye Hadoop

Hello 1World 1Bye 1World 1

MapTask 1 sort

ReduceTask 1 (keys: A‐G)

merge/sort

Bye 1Hello 1World 1World 1

Goodbye 1Hadoop 1Hadoop 1Hello 1

combiner (local aggregation)

Bye 1Hello 1World 2

Goodbye 1Hadoop 2Hello 1

Goodbye 1

Bye 1Hello 1World 2

Goodbye 1Hadoop 2Hello 1

MapTask 2output

Hello 1World 2

Hadoop 2Hello 1

MapTask 2

ReduceTask 2 (keys: H‐Z)

Bye 1Goodbye 1

Hadoop 2Hello 1Hello 1World 2

Bye 1Goodbye 1

Hadoop 2Hello 2World 2

Bye 1Goodbye 1

Hadoop 2Hello 2World 2

shuffle HDFS part0

HDFS part1

MapTask 1output

Hadoop MapReduce

Hadoop Schedulers

• A pluggable framework for job scheduling algorithm available since Hadoop 0.19– FIFO– Fair Scheduler (Facebook)– Capacity Scheduler (Yahoo!)

FIFO schedulerOriginally optimized for large batch jobs(web index construction)FIFO order + priority queues

Fair Scheduler

Capacity Scheduler

HDFS Architecture

Hadoop Ecosystem

HBase: BigTable‐likeHive: Data summarization and ad hoc queryingPig: A high‐level data‐flow language and execution framework for parallel computationHCatalog: Table and storage management service (table abstraction of data)Zookeeper: A high‐performance coordination service for distributed applications

Hadoop and Hadoop‐derived Distributtions

https://blogs.apache.org/bigtop/entry/all_you_wanted_to_know

*Cloudera Distribution Including Apache Hadoop (CDH)

*Greenplum HD (EMC)*Hortonworks Data Platform (Yarn)*MapR

Outline

Hadoop

Yarn (NextGen Hadoop)

ResourceManager:*Scheduler: allocate resources to the various running applications (pluggable policy plug‐in)*ApplicationsManager : accept job‐submissions/launch the first container for ApplicationMaster

Split up the two major functionalities of the JobTracker: * Management* Job scheduling/monitoring

Resource Allocation• the resource request understood by the Scheduler is of the form:

<priority, (hostname/rackname/*), capability, #containers>

• Scheduler APIThere is a single API between the Scheduler and the

ApplicationMaster:(List <Container> newContainers, List <ContainerStatus> containerStatuses) allocate (List <ResourceRequest> ask, List<Container> release)

Compact:O(clustersize)

HDFS Federation

Benefits:* Namespace Scalability* Performance* Isolation

Hadoop - University of...

Documents

Transcript of Hadoop - University of...

Hadoop Conf 2014 - Hadoop BigQuery Connector

Hadoop 3 (2017 hadoop taiwan workshop)

[Hadoop] Terapot: Massive Email Archiving with Hadoop

Introduction to PhysBAM - University of Wisconsin–Madisonpages.cs.wisc.edu/~sifakis/courses/cs838-f11/lecture_notes/CS838_12... · CS838 Advanced Modeling and Simulation About PhysBAM

Collision processing - University of …pages.cs.wisc.edu/~sifakis/courses/cs838-f11/lecture...CS838 Advanced Modeling and Simulation Collision processing • Different types of collisions

Using the Cloud - University of Wisconsin–Madisonpages.cs.wisc.edu/~akella/CS838/F12/notes/Aaron_intro.pdf · Using the Cloud Assignments, ... Assignments Overview 1. Prepare to

Hadoop Training #4: Programming with Hadoop

Continuous Delivery for Linux/Windows/Hadoop...Beta Cluster Hadoop JobTracker Jenkins Slave Hadoop node Hadoop node Hadoop node Hadoop node Slave Node Gateway Prod. Cluster PigServer

Hadoop World 2010: Productionizing Hadoop: Lessons Learned

Hadoop Interview Questions Version 2.0.0 Author: Hadoop ...kpbigdata.com/img/Hadoop_Interview_question.pdf · Hadoop Interview Questions Version 2.0.0 Author: Hadoop Learning Resource

250 Hadoop Interview Questions and Answers for Experienced Hadoop Developers - Hadoop Online Tutorials

Hadoop Deployment Manual - Hyadespleiades.ucsc.edu/doc/bright/hadoop-deployment-manual.pdf2.2 Ncurses Installation Of Hadoop Using cm-hadoop-setup ... •The Hadoop Deployment Manual

Hadoop , Hadoop , Hadoop !!!

Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.

Megastore: Providing Scalable, Highly Available Storage …pages.cs.wisc.edu/~akella/CS838/F12/838-CloudPapers… · · 2012-05-24Megastore: Providing Scalable, Highly Available

Hue: The Hadoop UI - Hadoop Singapore

Обзор Hadoop-дистрибутивов. Тюнинг «узких мест» Hadoop

Why use Hadoop?, Challenges / Learning Hadoop & Average Salary of Hadoop Professional

Hadoop virtualization extensions hadoop world meetup

Hadoop - Architectural road map for Hadoop Ecosystem