Store app a shared storage appliance for efficient and scalable virtualized hadoop clusters

Post on 13-Aug-2015

113 views 1 download

Tags:

Transcript of Store app a shared storage appliance for efficient and scalable virtualized hadoop clusters

StoreApp:A Shared Storage Appliance for Efficient and Scalable Virtualized Hadoop Clusters

LIU KaiEmail: kiwenlau@163.comBlog: http://kiwenlau.com/

National Institute of Informatics, Japan

04/15/2023 1LIU Kai, National Institute of Informatics

Contents

Introduction (What?) Motivation (Why?) Implementation (How?) Personal Ideas

04/15/2023 2LIU Kai, National Institute of Informatics

Introduction – What is StoreApp?

04/15/2023 3LIU Kai, National Institute of Informatics

Background

Hadoop (version 1): for big data storage and computation Hadoop Distributed File System (HDFS): for storage Hadoop MapReduce Framework: for computation Master/Slave Architecture Storage(DataNode) and computation(TaskTracker) co-locate in a node

04/15/2023 LIU Kai, National Institute of Informatics 4

DataNodeTaskTracker

Slave Slave Slave Slave

NameNodeJobTracker

DataNodeTaskTracker

DataNodeTaskTracker

DataNodeTaskTracker

Master

Physical MachineOr Virtual Machine

Overview

What is StoreApp? A Hadoop plugin For speeding up Hadoop running in virtual machines Separate storage (DataNode) from computation (TaskTracker)

04/15/2023 LIU Kai, National Institute of Informatics 5

TaskTracker

DataNode

TaskTracker

TaskTracker

Physical machine Physical machine

Virtual machineDataNodeTaskTracker

DataNodeTaskTracker

DataNodeTaskTracker

Virtual machine

Benefit

Improve HDFS throughput by 78.3% Storage VM has higher priority in scheduling than computation VM Consolidating storage into one VM reduce I/O contentions

Reduce job completion time by 61% Most Hadoop jobs are data intensive Their performance are bottlenecked by slow disk access

04/15/2023 LIU Kai, National Institute of Informatics 6

Motivation – Why do we need StoreApp?

04/15/2023 7LIU Kai, National Institute of Informatics

Challenge 1

Can’t add or remove nodes easily Rebalancing data incurs significant data movement Cannot utilize the elasticity of virtual machines

04/15/2023 LIU Kai, National Institute of Informatics 8

DataNodeTaskTracker

DataNodeTaskTracker

DataNodeTaskTracker

Physical MachineVirtual Machine

DataNodeTaskTracker

DataNodeTaskTracker

DataNodeTaskTracker

DataNodeTaskTracker

DataNodeTaskTracker

DataNodeTaskTracker

Solution 1

Separate storage from computation Adding or removing computation node need no data movement Finding optimal number of computation nodes for each Hadoop job

04/15/2023 LIU Kai, National Institute of Informatics 9

TaskTracker

DataNode

TaskTracker

TaskTracker

TaskTracker

DataNode

TaskTracker

TaskTracker

TaskTracker

DataNode

TaskTracker

TaskTracker…

Physical MachineVirtual Machine

Challenge 2

Colocated Virtual Machines often access disk concurrently Random IO operations will compete with each other Significantly degrade the Hadoop Job performance

04/15/2023 LIU Kai, National Institute of Informatics 10

DataNodeTaskTracker

DataNodeTaskTracker

DataNodeTaskTracker

Physical MachineVirtual Machine

DataNodeTaskTracker

DataNodeTaskTracker

DataNodeTaskTracker

DataNodeTaskTracker

DataNodeTaskTracker

DataNodeTaskTracker

Solution 2

Separate storage from computation Each physical machine only has one storage virtual machine Only the storage Virtual Machine is IO intensive No serious concurrent IO operations

04/15/2023 LIU Kai, National Institute of Informatics 11

TaskTracker

DataNode

TaskTracker

TaskTracker

TaskTracker

DataNode

TaskTracker

TaskTracker

TaskTracker

DataNode

TaskTracker

TaskTracker…

Physical MachineVirtual Machine

Challenge 3

Can’t schedule Virtual Machines efficiently IO intensive VMs can be prioritized since they consume less CPU However, every VM is IO intensive!

04/15/2023 LIU Kai, National Institute of Informatics 12

DataNodeTaskTracker

DataNodeTaskTracker

DataNodeTaskTracker

Physical MachineVirtual Machine

DataNodeTaskTracker

DataNodeTaskTracker

DataNodeTaskTracker

DataNodeTaskTracker

DataNodeTaskTracker

DataNodeTaskTracker

Solution 3

Separate storage from computation Only the storage Virtual Machine is IO intensive The storage Virtual Machine will receive a higher priority

04/15/2023 LIU Kai, National Institute of Informatics 13

TaskTracker

DataNode

TaskTracker

TaskTracker

TaskTracker

DataNode

TaskTracker

TaskTracker

TaskTracker

DataNode

TaskTracker

TaskTracker…

Physical MachineVirtual Machine

Implementation – How to design StoreApp?

04/15/2023 14LIU Kai, National Institute of Informatics

Architecture

04/15/2023 LIU Kai, National Institute of Informatics 15

A StoreApp manager and multiple storage nodes The StoreApp manager run on the master node Each physical machine has one storage node

Components

StoreApp manager Coordinate the operations of all data nodes

Scheduler Scheduling tasks according to data locations

HDFS Proxy Receive all HDFS requests and forward them to DataNode

Shuffler Receive map output and push them to DataNode

04/15/2023 LIU Kai, National Institute of Informatics 16

HDFS Prefetching

04/15/2023 LIU Kai, National Institute of Informatics 17

Read the whole block b1 instead of needed partial records Unused data of block b1 is kept in the memory Read consecutive block into memory to form input split s1

task0 task1

Automated Cluster Resizing

04/15/2023 LIU Kai, National Institute of Informatics 18

Dynamically change Cluster Size during the job execution The iterative algorithm can search for the optimal cluster size

Personal Ideas

04/15/2023 19LIU Kai, National Institute of Informatics

Pros and cons

Pros Simple idea but shows good result Show clear logic of locating and solving problems

Cons Restrict to Hadoop 1 No open source

04/15/2023 LIU Kai, National Institute of Informatics 20

Future direction

From Hadoop 1 to Hadoop 2 Hadoop 2 is quite different with Hadoop 1 Hadoop 2 can support more application framework like Spark

From Virtual Machine to container Container is a more lightweight virtualization technology Container is more Resource efficient than Virtual Machine Container is more easy to scale than Virtual Machine

04/15/2023 LIU Kai, National Institute of Informatics 21

References

Yanfei Guo, et al. "StoreApp: A Shared Storage Appliance for Efficient and Scalable Virtualized Hadoop Clusters”, INFOCOM, 4, 2015

04/15/2023 LIU Kai, National Institute of Informatics 22

Thank you!

04/15/2023 LIU Kai, National Institute of Informatics 23