Serengeti - 虚拟化你的大数据应用

© 2009 VMware Inc. All rights reserved

Serengeti - 虚拟化你的大数据应用

蔺永华

Vmware, Inc.

Agenda

• Today’s big data system

• Why virtualize hadoop?

• Serengeti introduction

• Common questions about virtualization

• Serengeti solution

• Deep insight into Serengeti

• Summary

• Q & A

Today’s Big Data System:

ETL

Real Time

Streams

Unstructured Data (HDFS)

Real Time

Structured

Database

Big SQL

Data Parallel Batch

Processing

Real-Time

Processing (s4, storm)

Analytics

Agenda







• Summary

• Q & A

Challenges To Use Hadoop in physical infrastructure

Deployment

• Difficult to deploy, cost several people for several days even months

• Difficult to tune cluster performance

Low Efficiency

• Hadoop clusters are typically not 100% utilized across all hardware resources.

• Difficult to share resources safely between different workload

Single Point of Failure

• Single point of failure for Name Node and Job tracker

• No HA for Hive, HCatalog, etc.

Why Virtualize Hadoop? - Get your Hadoop cluster in minutes

Hadoop Installation and Configuration

Network Configuration

OS installation

Server preparation

Manual process, cost days

Fully automated process,

10 minutes to get a

Hadoop/HBase cluster from

scratch

1/1000 human efforts,

Least Hadoop operation knowledge

Automate by Serengeti on

vSphere with best practice

Why Virtualize Hadoop? - Consolidate sprawling clusters

Single purpose clusters for various

business applications lead to cluster

sprawl.

Clusters share

servers with strong isolation

Simplify

• Single Hardware Infrastructure

• Unified operations

Optimize

• Shared Resources = higher utilization

• Elastic resources = faster on-demand access

Hadoop Dev

Hadoop Prod

HBase

Cluster Sprawling

Cluster Consolidation

Finance Hadoop

Virtualization Platform

Hadoop Dev

Hadoop Prod

HBase ... Portal

Hadoop

Portal Hadoop

30% CAPEX Down

Why Virtualize Hadoop? –

Utilize all your resources to solve the priority problem

50%+ resources are sitting

idle while high priority job is

burning up its cluster.

Utilize all resources from

pool on demand.

Dynamic elastic

scaling on shared resource pool

3X faster to get analytic results

vSphere High Availability (HA) - protection against unplanned downtime

• Protection against host and VM failures

• Automatic failure detection (host, guest OS)

• Automatic virtual machine restart in minutes, on any available host in cluster

• OS and application-independent, does not require complex configuration

changes

Overview

High Availability for the Hadoop Stack

HDFS

(Hadoop Distributed File System)

HBase (Key-Value store)

MapReduce (Job Scheduling/Execution System)

Pig (Data Flow) Hive (SQL)

BI Reporting ETL Tools

Managem

ent

Serv

er

Zookeepr

(Coord

ination) HCatalog

RDBMS

Namenode

Jobtracker

Hive MetaDB

Hcatalog MDB

Server

vSphere Fault Tolerance provides continuous protection

App

OS

App

OS

App

OS X X App

OS

App

OS

App

OS

App

OS

X

VMware ESX VMware ESX

• Single identical VMs running in

lockstep on separate hosts

• Zero downtime, zero data loss

failover for all virtual machines in

case of hardware failures

• Integrated with VMware HA/DRS

• No complex clustering or

specialized hardware required

• Single common mechanism for all

applications and operating

systems

FT HA HA

Overview

Zero downtime for Name Node, Job Tracker and other components in Hadoop clusters

Agenda







• Summary

• Q & A

Easy and rapid deployment and management

Open source project launched in June 2012, 0.8 is released at Apr.

and will release 0.9 at Jun.

Toolkit that leverage virtualization to simplify Hadoop deployment

and operations

Deploy a cluster in 10 Minutes fully automated

Customize Hadoop and HBase cluster

Automated cluster operation

Come with eco-system components

Support all popular Hadoop Distributions

Serengeti

Demo: 10 minutes to a Hadoop cluster with Serengeti

Agenda







• Summary

• Q & A

Common questions about virtualization

Local Disk

• Can local disk be used in virtualization environment?

Flexibility and Scalability

• How to flexible schedule resources between clusters and different

applications as mentioned above?

Data stability

• In virtual environment, how can we distribute data across host and rack?

Data locality

• Hadoop will schedule compute tasks near by the data, to reduce network

IO for data R/W. Can virtual environment get the same result?

Performance

• How about the performance in virtual environment?

Agenda







• Summary

• Q & A

Can I use local disk easily?

Serengeti Extend Virtual Storage Architecture to Include Local Disk

Shared Storage: SAN or NAS

• Easy to provision

• Automated cluster rebalancing

Hybrid Storage

• SAN for boot images, other

workloads

• Local disk for Hadoop & HDFS

Host

Ha

do

op

Oth

er

VM

Oth

er

VM

Host

Ha

do

op

Ha

do

op

Oth

er

VM

Host

Ha

do

op

Ha

do

op

Oth

er

VM

Host

Ha

do

op

Oth

er

VM

Oth

er

VM

Host

Ha

do

op

Ha

do

op

Oth

er

VM

Host

Ha

do

op

Ha

do

op

Oth

er

VM

How to flexible scale in/scale out

How to flexible schedule resources between clusters and

different applications?

Storage

Evolution of Hadoop on VMs – Data/Compute separation

Compute Current Hadoop: Combined Storage/Compute

Storage

T1 T2

VM VM VM

VM VM

VM

Hadoop in VM

- * VM lifecycle determined by Datanode

- * Limited elasticity

Separate Storage

- * Separate compute from data

- * Remove elastic constrain

- by Datanode

- * Elastic compute

- * Raise utilization

Separate Compute Clusters

- * Separate virtual compute

- * Compute cluster per tenant

- * Stronger VM-grade security and resource isolation

Slave Node

Serengeti Node Scale Out / Scale In

Host

NameNode

Host

D

C JobTracker C

C C

Host

D

C C

C C

Host

D

C C

C C

Host

D

C C

C C

Serengeti Ballooning Enhancement for Java Application

JVM

Guest OS

Host JVM

Guest OS

Guest OS

Host JVM

How to keep data stability?

How to access data locally if data node and compute node

are located in different VM?

Distributed and Data/Compute Associated VM Placement

Data node and task tracker combined cluster Data Compute separated cluster

Host

master

Host

worker

Host

worker

Host

master

Host

Data node

Task tracker

Host

Data node

Task tracker

Host

Name node

Job tracker

Host

Data node

Task tracker

Host

Data node

Task tracker

Job tracker Task tracker Task tracker

HDFS cluster

Compute only cluster1

Compute only cluster2

Compute Only cluster

Rack 1 Rack 2 Rack 1 Rack 2

Rack 1 Rack 2

Hadoop Topology Awareness – Serengeti HVE

Hadoop Topology Changes

for Virtualization

/

D1 D2

R1 R2

N1

H1 H2 H3 H4 H5 H6 H7 H8 H9 H10 H11 H12

R3 R4

1 2 3

/

D1 D2

R1 R2

H1 H2 H3 H4 H5 H6 H7 H8 H9 H10 H11 H12

R3 R4

1 2 3

N2 N3 N4 N5 N6 N7 N8

1 3 2

1 2 3 4

Hadoop Virtualization Extensions for Topology

HADOOP-8468 (Umbrella JIRA)

HADOOP-8469

HDFS-3495

HDFS-3498

Hadoop

HVE

Task Scheduling Policy Extension

Balancer Policy Extension

Replica Choosing Policy Extension

Replica Placement Policy Extension

Network Topology Extension

Replica Removal Policy Extension

HDFS MapReduce

Hadoop Common

MAPREDUCE-4310

MAPREDUCE-4309

HADOOP-8470

HADOOP-8472

Is there significant performance degradation in virtualization

environment?

Is there any performance data?

Virtualized Hadoop Performance

Native versus Virtual Platforms, 32 hosts, 16 disks/host

Source: http://www.vmware.com/resources/techresources/10360

Agenda







• Summary

• Q & A

Serengeti architecture diagram

CLI Client

Spring Shell

UI Client

Flex UI

Serengeti

Web

Service

Hibernate/

DAO

Spring Batch

Rest API

Update

Meta DB

step

vPostgres

VM

Placement

calculation

VC adapter

VM

Provision

step

Sof tware

Mgmt

step

Ironfan

service

Thrift Service

Ironfan Progress

report

Chef

server

Rest API

Cookbook

VHM

step RabbitMQ

VM runtime

Manager

Host Host Host Host Host

Virtualization Platform

Hadoop

Node

Chef Client

HA kit

Hadoop

Node

Hadoop

Node

Package

repository

vCenter

Customizing your Hadoop/HBase cluster with Serengeti

Choice of distros

Storage configuration

• Choice of shared storage or Local disk

Resource configuration

High availability option

# of nodes

… "distro":"apache", "groups":[ { "name":"master", "roles":[ "hadoop_namenode", "hadoop_jobtracker”], "storage": {

"type": "SHARED",

"sizeGB": 20}, "instance_type":MEDIUM, "instance_num":1, "ha":true}, {"name":"worker", "roles":[ "hadoop_datanode", "hadoop_tasktracker" ], "instance_type":SMALL, "instance_num":5, "ha":false …

One command to scale out your cluster with Serengeti

>cluster resize –name <clustername> --nodegroup worker –instanceNum <#>

Configure/reconfigure Hadoop with ease by Serengeti

Modify Hadoop cluster configuration from Serengeti

• Use the “configuration” section of the json spec file

• Specify Hadoop attributes in core-site.xml, hdfs-site.xml, mapred-site.xml,

hadoop-env.sh, log4j.properties

• Apply new Hadoop configuration using the edited spec file

"configuration": {

"hadoop": {

"core-site.xml": {

// check for all settings at http://hadoop.apache.org/common/docs/r1.0.0/core-default.html

},

"hdfs-site.xml": {

// check for all settings at http://hadoop.apache.org/common/docs/r1.0.0/hdfs-default.html

},

"mapred-site.xml": {

// check for all settings at http://hadoop.apache.org/common/docs/r1.0.0/mapred-default.html

"io.sort.mb": "300"

} ,

"hadoop-env.sh": {

// "HADOOP_HEAPSIZE": "",

// "HADOOP_NAMENODE_OPTS": "",

// "HADOOP_DATANODE_OPTS": "",

…

> cluster config --name myHadoop --specFile /home/serengeti/myHadoop.json

Freedom of Choice and Open Source

Community Projects Distributions

• Flexibility to choose from major distributions

• Support for multiple projects

• Open architecture to welcome industry participation

• Contributing Hadoop Virtualization Extensions (HVE) to open

source community

cluster create --name myHadoop --distro apache

HDFS2 with Namenode Federation and HA

Deploy CDH4 Hadoop cluster

• Name Node Federation

• Name Node HA

• MapReduce v1

• HBase, Pig, Hive, and Hive Server

CDH4 configurations

Scale out

Elasticity

JobTracker HA/FT

Namenode Group 1

Active Namenode Standby Namenode

Namenode Group 2

Active Namenode Standby Namenode

Zookeeper Group

ZK ZK ZK

Coordinate Coordinate

Quorum-based metadata store

Data Nodes

Datanode Datanode Datanode Datanode Datanode Datanode Datanode Datanode

Block report Block report

Proactive monitoring and tuning with VCOPs

Proactively monitoring through VCOPs

Gain comprehensive visibility

Eliminate manual processes with intelligent automation

Proactively manage operations

Agenda







• Summary

• Q & A

VMWare brings Agility, Efficiency, and Elasticity to Big Data

Enable full elasticity

through separation of

Data and Compute

Scale In/Out Hadoop

with Resource

Constrain

Elasticity

Deploy, configure and

monitor Hadoop

clusters on the fly

Dynamic reconfiguring

of Hadoop to meet

changing business

demands

Agility

Consolidate Hadoop

to achieve higher

utilization

Pool resources to

allow for increased

performance and

priority job processing

Efficiency

Serengeti Resources

Download and try Serengeti

• projectserengeti.org

VMware Hadoop site

• vmware.com/hadoop

Hadoop performance on vSphere

• http://www.vmware.com/files/pdf/techpa

per/hadoop-vsphere51-32hosts.pdf

Hadoop High Availability solution

• vmware.com/files/pdf/Apache-Hadoop-

VMware-HA-solution.pdf

projectserengeti.org

vmware.com/hadoop

vmware.com/hadoop

http://www.vmware.com/files/pdf/techpaper/hadoop-vsphere51-32hosts.pdf






http://www.vmware.com/files/pdf/Apache-Hadoop-VMware-HA-solution.pdf











Serengeti - 虚拟化你的大数据应用

Documents

Transcript of Serengeti - 虚拟化你的大数据应用