Serengeti - 虚拟化你的大数据应用

41
© 2009 VMware Inc. All rights reserved Serengeti - 化你的大数据Vmware, Inc.

Transcript of Serengeti - 虚拟化你的大数据应用

Page 1: Serengeti - 虚拟化你的大数据应用

© 2009 VMware Inc. All rights reserved

Serengeti - 虚拟化你的大数据应用

蔺永华

Vmware, Inc.

Page 2: Serengeti - 虚拟化你的大数据应用

Agenda

• Today’s big data system

• Why virtualize hadoop?

• Serengeti introduction

• Common questions about virtualization

• Serengeti solution

• Deep insight into Serengeti

• Summary

• Q & A

Page 3: Serengeti - 虚拟化你的大数据应用

Today’s Big Data System:

ETL

Real Time

Streams

Unstructured Data (HDFS)

Real Time

Structured

Database

Big SQL

Data Parallel Batch

Processing

Real-Time

Processing (s4, storm)

Analytics

Page 4: Serengeti - 虚拟化你的大数据应用

Agenda

• Today’s big data system

• Why virtualize hadoop?

• Serengeti introduction

• Common questions about virtualization

• Serengeti solution

• Deep insight into Serengeti

• Summary

• Q & A

Page 5: Serengeti - 虚拟化你的大数据应用

Challenges To Use Hadoop in physical infrastructure

Deployment

• Difficult to deploy, cost several people for several days even months

• Difficult to tune cluster performance

Low Efficiency

• Hadoop clusters are typically not 100% utilized across all hardware resources.

• Difficult to share resources safely between different workload

Single Point of Failure

• Single point of failure for Name Node and Job tracker

• No HA for Hive, HCatalog, etc.

Page 6: Serengeti - 虚拟化你的大数据应用

Why Virtualize Hadoop? - Get your Hadoop cluster in minutes

Hadoop Installation and Configuration

Network Configuration

OS installation

Server preparation

Manual process, cost days

Fully automated process,

10 minutes to get a

Hadoop/HBase cluster from

scratch

1/1000 human efforts,

Least Hadoop operation knowledge

Automate by Serengeti on

vSphere with best practice

Page 7: Serengeti - 虚拟化你的大数据应用

Why Virtualize Hadoop? - Consolidate sprawling clusters

Single purpose clusters for various

business applications lead to cluster

sprawl.

Clusters share

servers with strong isolation

Simplify

• Single Hardware Infrastructure

• Unified operations

Optimize

• Shared Resources = higher utilization

• Elastic resources = faster on-demand access

Hadoop Dev

Hadoop Prod

HBase

Cluster Sprawling

Cluster Consolidation

Finance Hadoop

Virtualization Platform

Hadoop Dev

Hadoop Prod

HBase ... Portal

Hadoop

Portal Hadoop

30% CAPEX Down

Page 8: Serengeti - 虚拟化你的大数据应用

Why Virtualize Hadoop? –

Utilize all your resources to solve the priority problem

50%+ resources are sitting

idle while high priority job is

burning up its cluster.

Utilize all resources from

pool on demand.

Dynamic elastic

scaling on shared resource pool

3X faster to get analytic results

Page 9: Serengeti - 虚拟化你的大数据应用

vSphere High Availability (HA) - protection against unplanned downtime

• Protection against host and VM failures

• Automatic failure detection (host, guest OS)

• Automatic virtual machine restart in minutes, on any available host in cluster

• OS and application-independent, does not require complex configuration

changes

Overview

Page 10: Serengeti - 虚拟化你的大数据应用

High Availability for the Hadoop Stack

HDFS

(Hadoop Distributed File System)

HBase (Key-Value store)

MapReduce (Job Scheduling/Execution System)

Pig (Data Flow) Hive (SQL)

BI Reporting ETL Tools

Managem

ent

Serv

er

Zookeepr

(Coord

ination) HCatalog

RDBMS

Namenode

Jobtracker

Hive MetaDB

Hcatalog MDB

Server

Page 11: Serengeti - 虚拟化你的大数据应用

vSphere Fault Tolerance provides continuous protection

App

OS

App

OS

App

OS X X App

OS

App

OS

App

OS

App

OS

X

VMware ESX VMware ESX

• Single identical VMs running in

lockstep on separate hosts

• Zero downtime, zero data loss

failover for all virtual machines in

case of hardware failures

• Integrated with VMware HA/DRS

• No complex clustering or

specialized hardware required

• Single common mechanism for all

applications and operating

systems

FT HA HA

Overview

Zero downtime for Name Node, Job Tracker and other components in Hadoop clusters

Page 12: Serengeti - 虚拟化你的大数据应用

Agenda

• Today’s big data system

• Why virtualize hadoop?

• Serengeti introduction

• Common questions about virtualization

• Serengeti solution

• Deep insight into Serengeti

• Summary

• Q & A

Page 13: Serengeti - 虚拟化你的大数据应用

Easy and rapid deployment and management

Open source project launched in June 2012, 0.8 is released at Apr.

and will release 0.9 at Jun.

Toolkit that leverage virtualization to simplify Hadoop deployment

and operations

Deploy a cluster in 10 Minutes fully automated

Customize Hadoop and HBase cluster

Automated cluster operation

Come with eco-system components

Support all popular Hadoop Distributions

Serengeti

Page 14: Serengeti - 虚拟化你的大数据应用

Demo: 10 minutes to a Hadoop cluster with Serengeti

Page 15: Serengeti - 虚拟化你的大数据应用

Agenda

• Today’s big data system

• Why virtualize hadoop?

• Serengeti introduction

• Common questions about virtualization

• Serengeti solution

• Deep insight into Serengeti

• Summary

• Q & A

Page 16: Serengeti - 虚拟化你的大数据应用

Common questions about virtualization

Local Disk

• Can local disk be used in virtualization environment?

Flexibility and Scalability

• How to flexible schedule resources between clusters and different

applications as mentioned above?

Data stability

• In virtual environment, how can we distribute data across host and rack?

Data locality

• Hadoop will schedule compute tasks near by the data, to reduce network

IO for data R/W. Can virtual environment get the same result?

Performance

• How about the performance in virtual environment?

Page 17: Serengeti - 虚拟化你的大数据应用

Agenda

• Today’s big data system

• Why virtualize hadoop?

• Serengeti introduction

• Common questions about virtualization

• Serengeti solution

• Deep insight into Serengeti

• Summary

• Q & A

Page 18: Serengeti - 虚拟化你的大数据应用

Can I use local disk easily?

Page 19: Serengeti - 虚拟化你的大数据应用

Serengeti Extend Virtual Storage Architecture to Include Local Disk

Shared Storage: SAN or NAS

• Easy to provision

• Automated cluster rebalancing

Hybrid Storage

• SAN for boot images, other

workloads

• Local disk for Hadoop & HDFS

Host

Ha

do

op

Oth

er

VM

Oth

er

VM

Host

Ha

do

op

Ha

do

op

Oth

er

VM

Host

Ha

do

op

Ha

do

op

Oth

er

VM

Host

Ha

do

op

Oth

er

VM

Oth

er

VM

Host

Ha

do

op

Ha

do

op

Oth

er

VM

Host

Ha

do

op

Ha

do

op

Oth

er

VM

Page 20: Serengeti - 虚拟化你的大数据应用

How to flexible scale in/scale out

How to flexible schedule resources between clusters and

different applications?

Page 21: Serengeti - 虚拟化你的大数据应用

Storage

Evolution of Hadoop on VMs – Data/Compute separation

Compute Current Hadoop: Combined Storage/Compute

Storage

T1 T2

VM VM VM

VM VM

VM

Hadoop in VM

- * VM lifecycle determined by Datanode

- * Limited elasticity

Separate Storage

- * Separate compute from data

- * Remove elastic constrain

- by Datanode

- * Elastic compute

- * Raise utilization

Separate Compute Clusters

- * Separate virtual compute

- * Compute cluster per tenant

- * Stronger VM-grade security and resource isolation

Slave Node

Page 22: Serengeti - 虚拟化你的大数据应用

Serengeti Node Scale Out / Scale In

Host

NameNode

Host

D

C JobTracker C

C C

Host

D

C C

C C

Host

D

C C

C C

Host

D

C C

C C

Page 23: Serengeti - 虚拟化你的大数据应用

Serengeti Ballooning Enhancement for Java Application

JVM

Guest OS

Host JVM

Guest OS

Guest OS

Host JVM

Page 24: Serengeti - 虚拟化你的大数据应用

How to keep data stability?

How to access data locally if data node and compute node

are located in different VM?

Page 25: Serengeti - 虚拟化你的大数据应用

Distributed and Data/Compute Associated VM Placement

Data node and task tracker combined cluster Data Compute separated cluster

Host

master

Host

worker

Host

worker

Host

master

Host

Data node

Task tracker

Host

Data node

Task tracker

Host

Name node

Job tracker

Host

Data node

Task tracker

Host

Data node

Task tracker

Job tracker Task tracker Task tracker

HDFS cluster

Compute only cluster1

Compute only cluster2

Compute Only cluster

Rack 1 Rack 2 Rack 1 Rack 2

Rack 1 Rack 2

Page 26: Serengeti - 虚拟化你的大数据应用

Hadoop Topology Awareness – Serengeti HVE

Hadoop Topology Changes

for Virtualization

/

D1 D2

R1 R2

N1

H1 H2 H3 H4 H5 H6 H7 H8 H9 H10 H11 H12

R3 R4

1 2 3

/

D1 D2

R1 R2

H1 H2 H3 H4 H5 H6 H7 H8 H9 H10 H11 H12

R3 R4

1 2 3

N2 N3 N4 N5 N6 N7 N8

1 3 2

1 2 3 4

Page 27: Serengeti - 虚拟化你的大数据应用

Hadoop Virtualization Extensions for Topology

HADOOP-8468 (Umbrella JIRA)

HADOOP-8469

HDFS-3495

HDFS-3498

Hadoop

HVE

Task Scheduling Policy Extension

Balancer Policy Extension

Replica Choosing Policy Extension

Replica Placement Policy Extension

Network Topology Extension

Replica Removal Policy Extension

HDFS MapReduce

Hadoop Common

MAPREDUCE-4310

MAPREDUCE-4309

HADOOP-8470

HADOOP-8472

Page 28: Serengeti - 虚拟化你的大数据应用

Is there significant performance degradation in virtualization

environment?

Is there any performance data?

Page 29: Serengeti - 虚拟化你的大数据应用

Virtualized Hadoop Performance

Page 30: Serengeti - 虚拟化你的大数据应用

Native versus Virtual Platforms, 32 hosts, 16 disks/host

Source: http://www.vmware.com/resources/techresources/10360

Page 31: Serengeti - 虚拟化你的大数据应用

Agenda

• Today’s big data system

• Why virtualize hadoop?

• Serengeti introduction

• Common questions about virtualization

• Serengeti solution

• Deep insight into Serengeti

• Summary

• Q & A

Page 32: Serengeti - 虚拟化你的大数据应用

Serengeti architecture diagram

CLI Client

Spring Shell

UI Client

Flex UI

Serengeti

Web

Service

Hibernate/

DAO

Spring Batch

Rest API

Update

Meta DB

step

vPostgres

VM

Placement

calculation

VC adapter

VM

Provision

step

Sof tware

Mgmt

step

Ironfan

service

Thrift Service

Ironfan Progress

report

Chef

server

Rest API

Cookbook

VHM

step RabbitMQ

VM runtime

Manager

Host Host Host Host Host

Virtualization Platform

Hadoop

Node

Chef Client

HA kit

Hadoop

Node

Hadoop

Node

Package

repository

vCenter

Page 33: Serengeti - 虚拟化你的大数据应用

Customizing your Hadoop/HBase cluster with Serengeti

Choice of distros

Storage configuration

• Choice of shared storage or Local disk

Resource configuration

High availability option

# of nodes

… "distro":"apache", "groups":[ { "name":"master", "roles":[ "hadoop_namenode", "hadoop_jobtracker”], "storage": {

"type": "SHARED",

"sizeGB": 20}, "instance_type":MEDIUM, "instance_num":1, "ha":true}, {"name":"worker", "roles":[ "hadoop_datanode", "hadoop_tasktracker" ], "instance_type":SMALL, "instance_num":5, "ha":false …

Page 34: Serengeti - 虚拟化你的大数据应用

One command to scale out your cluster with Serengeti

>cluster resize –name <clustername> --nodegroup worker –instanceNum <#>

Page 35: Serengeti - 虚拟化你的大数据应用

Configure/reconfigure Hadoop with ease by Serengeti

Modify Hadoop cluster configuration from Serengeti

• Use the “configuration” section of the json spec file

• Specify Hadoop attributes in core-site.xml, hdfs-site.xml, mapred-site.xml,

hadoop-env.sh, log4j.properties

• Apply new Hadoop configuration using the edited spec file

"configuration": {

"hadoop": {

"core-site.xml": {

// check for all settings at http://hadoop.apache.org/common/docs/r1.0.0/core-default.html

},

"hdfs-site.xml": {

// check for all settings at http://hadoop.apache.org/common/docs/r1.0.0/hdfs-default.html

},

"mapred-site.xml": {

// check for all settings at http://hadoop.apache.org/common/docs/r1.0.0/mapred-default.html

"io.sort.mb": "300"

} ,

"hadoop-env.sh": {

// "HADOOP_HEAPSIZE": "",

// "HADOOP_NAMENODE_OPTS": "",

// "HADOOP_DATANODE_OPTS": "",

> cluster config --name myHadoop --specFile /home/serengeti/myHadoop.json

Page 36: Serengeti - 虚拟化你的大数据应用

Freedom of Choice and Open Source

Community Projects Distributions

• Flexibility to choose from major distributions

• Support for multiple projects

• Open architecture to welcome industry participation

• Contributing Hadoop Virtualization Extensions (HVE) to open

source community

cluster create --name myHadoop --distro apache

Page 37: Serengeti - 虚拟化你的大数据应用

HDFS2 with Namenode Federation and HA

Deploy CDH4 Hadoop cluster

• Name Node Federation

• Name Node HA

• MapReduce v1

• HBase, Pig, Hive, and Hive Server

CDH4 configurations

Scale out

Elasticity

JobTracker HA/FT

Namenode Group 1

Active Namenode Standby Namenode

Namenode Group 2

Active Namenode Standby Namenode

Zookeeper Group

ZK ZK ZK

Coordinate Coordinate

Quorum-based metadata store

Data Nodes

Datanode Datanode Datanode Datanode Datanode Datanode Datanode Datanode

Block report Block report

Page 38: Serengeti - 虚拟化你的大数据应用

Proactive monitoring and tuning with VCOPs

Proactively monitoring through VCOPs

Gain comprehensive visibility

Eliminate manual processes with intelligent automation

Proactively manage operations

Page 39: Serengeti - 虚拟化你的大数据应用

Agenda

• Today’s big data system

• Why virtualize hadoop?

• Serengeti introduction

• Common questions about virtualization

• Serengeti solution

• Deep insight into Serengeti

• Summary

• Q & A

Page 40: Serengeti - 虚拟化你的大数据应用

VMWare brings Agility, Efficiency, and Elasticity to Big Data

Enable full elasticity

through separation of

Data and Compute

Scale In/Out Hadoop

with Resource

Constrain

Elasticity

Deploy, configure and

monitor Hadoop

clusters on the fly

Dynamic reconfiguring

of Hadoop to meet

changing business

demands

Agility

Consolidate Hadoop

to achieve higher

utilization

Pool resources to

allow for increased

performance and

priority job processing

Efficiency

Page 41: Serengeti - 虚拟化你的大数据应用

Serengeti Resources

Download and try Serengeti

• projectserengeti.org

VMware Hadoop site

• vmware.com/hadoop

Hadoop performance on vSphere

• http://www.vmware.com/files/pdf/techpa

per/hadoop-vsphere51-32hosts.pdf

Hadoop High Availability solution

• vmware.com/files/pdf/Apache-Hadoop-

VMware-HA-solution.pdf