Serengeti - 虚拟化你的大数据应用

蔺永华

Vmware, Inc.

Agenda

• Today’s big data system

• Why virtualize hadoop?

• Serengeti introduction

• Common questions about virtualization

• Serengeti solution

• Deep insight into Serengeti

• Summary

• Q & A

Today’s Big Data System:

Real Time

Streams

Unstructured Data (HDFS)

Real Time

Structured

Database

Big SQL

Data Parallel Batch

Processing

Real-Time

Processing (s4, storm)

Analytics

Agenda

• Summary

• Q & A

Challenges To Use Hadoop in physical infrastructure

Deployment

• Difficult to deploy, cost several people for several days even months

• Difficult to tune cluster performance

Low Efficiency

• Hadoop clusters are typically not 100% utilized across all hardware resources.

• Difficult to share resources safely between different workload

Single Point of Failure

• Single point of failure for Name Node and Job tracker

• No HA for Hive, HCatalog, etc.

Why Virtualize Hadoop? - Get your Hadoop cluster in minutes

Hadoop Installation and Configuration

Network Configuration

OS installation

Server preparation

Manual process, cost days

Fully automated process,

10 minutes to get a

Hadoop/HBase cluster from

scratch

1/1000 human efforts,

Least Hadoop operation knowledge

Automate by Serengeti on

vSphere with best practice

Why Virtualize Hadoop? - Consolidate sprawling clusters

Single purpose clusters for various

business applications lead to cluster

sprawl.

Clusters share

servers with strong isolation

Simplify

• Single Hardware Infrastructure

• Unified operations

Optimize

• Shared Resources = higher utilization

• Elastic resources = faster on-demand access

Hadoop Dev

Hadoop Prod

Cluster Sprawling

Cluster Consolidation

Finance Hadoop

Virtualization Platform

Hadoop Dev

Hadoop Prod

HBase ... Portal

Hadoop

Portal Hadoop

30% CAPEX Down

Why Virtualize Hadoop? –

Utilize all your resources to solve the priority problem

50%+ resources are sitting

idle while high priority job is

burning up its cluster.

Utilize all resources from

pool on demand.

Dynamic elastic

scaling on shared resource pool

3X faster to get analytic results

vSphere High Availability (HA) - protection against unplanned downtime

• Protection against host and VM failures

• Automatic failure detection (host, guest OS)

• Automatic virtual machine restart in minutes, on any available host in cluster

• OS and application-independent, does not require complex configuration

changes

Overview

High Availability for the Hadoop Stack

(Hadoop Distributed File System)

HBase (Key-Value store)

MapReduce (Job Scheduling/Execution System)

Pig (Data Flow) Hive (SQL)

BI Reporting ETL Tools

Managem

Zookeepr

(Coord

ination) HCatalog

Namenode

Jobtracker

Hive MetaDB

Hcatalog MDB

Server

vSphere Fault Tolerance provides continuous protection

OS X X App

VMware ESX VMware ESX

• Single identical VMs running in

lockstep on separate hosts

• Zero downtime, zero data loss

failover for all virtual machines in

case of hardware failures

• Integrated with VMware HA/DRS

• No complex clustering or

specialized hardware required

• Single common mechanism for all

applications and operating

systems

FT HA HA

Overview

Zero downtime for Name Node, Job Tracker and other components in Hadoop clusters

Agenda

• Summary

• Q & A

Easy and rapid deployment and management

Open source project launched in June 2012, 0.8 is released at Apr.

and will release 0.9 at Jun.

Toolkit that leverage virtualization to simplify Hadoop deployment

and operations

Deploy a cluster in 10 Minutes fully automated

Customize Hadoop and HBase cluster

Automated cluster operation

Come with eco-system components

Support all popular Hadoop Distributions

Serengeti

Demo: 10 minutes to a Hadoop cluster with Serengeti

Agenda

• Summary

• Q & A

Common questions about virtualization

Local Disk

• Can local disk be used in virtualization environment?

Flexibility and Scalability

• How to flexible schedule resources between clusters and different

applications as mentioned above?

Data stability

• In virtual environment, how can we distribute data across host and rack?

Data locality

• Hadoop will schedule compute tasks near by the data, to reduce network

IO for data R/W. Can virtual environment get the same result?

Performance

• How about the performance in virtual environment?

Agenda

• Summary

• Q & A

Can I use local disk easily?

Serengeti Extend Virtual Storage Architecture to Include Local Disk

Shared Storage: SAN or NAS

• Easy to provision

• Automated cluster rebalancing

Hybrid Storage

• SAN for boot images, other

workloads

• Local disk for Hadoop & HDFS

How to flexible scale in/scale out

How to flexible schedule resources between clusters and

different applications?

Storage

Evolution of Hadoop on VMs – Data/Compute separation

Compute Current Hadoop: Combined Storage/Compute

Storage

VM VM VM

Hadoop in VM

- * VM lifecycle determined by Datanode

- * Limited elasticity

Separate Storage

- * Separate compute from data

- * Remove elastic constrain

- by Datanode

- * Elastic compute

- * Raise utilization

Separate Compute Clusters

- * Separate virtual compute

- * Compute cluster per tenant

- * Stronger VM-grade security and resource isolation

Slave Node

Serengeti Node Scale Out / Scale In

NameNode

C JobTracker C

Serengeti Ballooning Enhancement for Java Application

Guest OS

Host JVM

Guest OS

Host JVM

How to keep data stability?

How to access data locally if data node and compute node

are located in different VM?

Distributed and Data/Compute Associated VM Placement

Data node and task tracker combined cluster Data Compute separated cluster

master

worker

master

Data node

Task tracker

Data node

Task tracker

Name node

Job tracker

Data node

Task tracker

Data node

Task tracker

Job tracker Task tracker Task tracker

HDFS cluster

Compute only cluster1

Compute only cluster2

Compute Only cluster

Rack 1 Rack 2 Rack 1 Rack 2

Rack 1 Rack 2

Hadoop Topology Awareness – Serengeti HVE

Hadoop Topology Changes

for Virtualization

H1 H2 H3 H4 H5 H6 H7 H8 H9 H10 H11 H12

N2 N3 N4 N5 N6 N7 N8

1 2 3 4

Hadoop Virtualization Extensions for Topology

HADOOP-8468 (Umbrella JIRA)

HADOOP-8469

HDFS-3495

HDFS-3498

Hadoop

Task Scheduling Policy Extension

Balancer Policy Extension

Replica Choosing Policy Extension

Replica Placement Policy Extension

Network Topology Extension

Replica Removal Policy Extension

HDFS MapReduce

Hadoop Common

MAPREDUCE-4310

MAPREDUCE-4309

HADOOP-8470

HADOOP-8472

Is there significant performance degradation in virtualization

environment?

Is there any performance data?

Virtualized Hadoop Performance

Native versus Virtual Platforms, 32 hosts, 16 disks/host

Source: http://www.vmware.com/resources/techresources/10360

Agenda

• Summary

• Q & A

Serengeti architecture diagram

CLI Client

Spring Shell

UI Client

Flex UI

Serengeti

Service

Hibernate/

Spring Batch

Rest API

Update

Meta DB

vPostgres

Placement

calculation

VC adapter

Provision

Sof tware

Ironfan

service

Thrift Service

Ironfan Progress

report

server

Rest API

Cookbook

step RabbitMQ

VM runtime

Manager

Host Host Host Host Host

Virtualization Platform

Hadoop

Chef Client

HA kit

Hadoop

Package

repository

vCenter

Customizing your Hadoop/HBase cluster with Serengeti

Choice of distros

Storage configuration

• Choice of shared storage or Local disk

Resource configuration

High availability option

# of nodes

… "distro":"apache", "groups":[ { "name":"master", "roles":[ "hadoop_namenode", "hadoop_jobtracker”], "storage": {

"type": "SHARED",

"sizeGB": 20}, "instance_type":MEDIUM, "instance_num":1, "ha":true}, {"name":"worker", "roles":[ "hadoop_datanode", "hadoop_tasktracker" ], "instance_type":SMALL, "instance_num":5, "ha":false …

One command to scale out your cluster with Serengeti

>cluster resize –name <clustername> --nodegroup worker –instanceNum <#>

Configure/reconfigure Hadoop with ease by Serengeti

Modify Hadoop cluster configuration from Serengeti

• Use the “configuration” section of the json spec file

• Specify Hadoop attributes in core-site.xml, hdfs-site.xml, mapred-site.xml,

hadoop-env.sh, log4j.properties

• Apply new Hadoop configuration using the edited spec file

"configuration": {

"hadoop": {

"core-site.xml": {

// check for all settings at http://hadoop.apache.org/common/docs/r1.0.0/core-default.html

"hdfs-site.xml": {

// check for all settings at http://hadoop.apache.org/common/docs/r1.0.0/hdfs-default.html

"mapred-site.xml": {

// check for all settings at http://hadoop.apache.org/common/docs/r1.0.0/mapred-default.html

"io.sort.mb": "300"

"hadoop-env.sh": {

// "HADOOP_HEAPSIZE": "",

// "HADOOP_NAMENODE_OPTS": "",

// "HADOOP_DATANODE_OPTS": "",

> cluster config --name myHadoop --specFile /home/serengeti/myHadoop.json

Freedom of Choice and Open Source

Community Projects Distributions

• Flexibility to choose from major distributions

• Support for multiple projects

• Open architecture to welcome industry participation

• Contributing Hadoop Virtualization Extensions (HVE) to open

source community

cluster create --name myHadoop --distro apache

HDFS2 with Namenode Federation and HA

Deploy CDH4 Hadoop cluster

• Name Node Federation

• Name Node HA

• MapReduce v1

• HBase, Pig, Hive, and Hive Server

CDH4 configurations

Scale out

Elasticity

JobTracker HA/FT

Namenode Group 1

Active Namenode Standby Namenode

Namenode Group 2

Active Namenode Standby Namenode

Zookeeper Group

ZK ZK ZK

Coordinate Coordinate

Quorum-based metadata store

Data Nodes

Datanode Datanode Datanode Datanode Datanode Datanode Datanode Datanode

Block report Block report

Proactive monitoring and tuning with VCOPs

Proactively monitoring through VCOPs

Gain comprehensive visibility

Eliminate manual processes with intelligent automation

Proactively manage operations

Agenda

• Summary

• Q & A

VMWare brings Agility, Efficiency, and Elasticity to Big Data

Enable full elasticity

through separation of

Data and Compute

Scale In/Out Hadoop

with Resource

Constrain

Elasticity

Deploy, configure and

monitor Hadoop

clusters on the fly

Dynamic reconfiguring

of Hadoop to meet

changing business

demands

Agility

Consolidate Hadoop

to achieve higher

utilization

Pool resources to

allow for increased

performance and

priority job processing

Efficiency

Serengeti Resources

Download and try Serengeti

• projectserengeti.org

VMware Hadoop site

• vmware.com/hadoop

Hadoop performance on vSphere

• http://www.vmware.com/files/pdf/techpa

per/hadoop-vsphere51-32hosts.pdf

Hadoop High Availability solution

• vmware.com/files/pdf/Apache-Hadoop-

VMware-HA-solution.pdf

Serengeti - 虚拟化你的大数据应用

Documents

Transcript of Serengeti - 虚拟化你的大数据应用

虚拟化入门指南 7 Red Hat Enterprise Linux · 2017-08-24 · Red Hat Enterprise Linux 7 虚拟化入门指南 虚拟化概念简介 作者：Dayle Parker 红帽公司・工程部出版中心

ARM+Linux嵌入式系统技术路线download.hqyj.com/download/pdf/Farsight090418-arm-linux.pdf · 来作为高速缓存的一部分,将它虚拟为磁盘。Ramdiskdevice (如:/ dev

实验八 vRealize Log Insight - hactcm.edu.cnfileapi.it.hactcm.edu.cn/yjsyxnh/file/2019/3/1/... · 2019. 3. 1. · 《云计算与虚拟化》实验指导书 / 实验八：vRealize

The Serengeti

ZXSG SVFW-S 虚拟防火墙 - ZTE

Serengeti Day 2

SERENGETI 2013 ENG

Serengeti Wildebeest Migration

公司简介2020 - aisin.com · 由虚拟公司制转为更为先进的公司制。 ※1 虚拟公司制 ⋯ 超越集团各公司框架限制，制定和执行业务战略的集团协作组织结构

The Serengeti Region

The Subjunctive Mood 虚 拟 语 气. What would you do if you had a? Magic Brush.

亚马逊生态 - sides-share.s3.cn-north-1.amazonaws.com.cn... · 在Amazon Appstore 中使用的 虚拟货币 可以在Amazon Appstore 的所有 App 中使用。(1 coin = 1 JPY)

SERENGETI FESTIVAL Programmheft

MakerBot3D 打印机虚拟装配系统的设计与实现

云计算虚拟机的安全访问与共 享 - mit.educaoj/teaching/doc/hqian_t_bsc.pdf实现基于web 浏览器的远程虚拟机访问。 接着，本文对虚拟机的共享机制作出探讨，设计虚拟机安全共享的体系结构。

VMLogin 虚拟多登浏览器 软件使用说明书 Ver1

Virtual Machining - A Decade of Innovations 虚拟制造技术创新的十年

Serengeti 2015 FR

SERENGETI EXTENSIONelevatedestinations.com/.../2017/12/Safari-Extension.pdf · 2017. 12. 22. · SERENGETI EXTENSION: About the Tour The Serengeti, together with Kenya’s Masai Mara

SDX – Serengeti · SDX – Serengeti. Title: Product information SDX Serengeti - Belakos Flooring.indd Created Date: 11/1/2016 9:21:37 PM ...

虚拟化入门指南 7 Red Hat Enterprise Linux · 2017-08-24 · Red Hat Enterprise Linux 7 虚拟化入门指南虚拟化概念简介作者：Dayle Parker 红帽公司・工程部出版中心

The Subjunctive Mood 虚拟语气. What would you do if you had a? Magic Brush.

亚马逊生态 - sides-share.s3.cn-north-1.amazonaws.com.cn... · 在Amazon Appstore 中使用的虚拟货币可以在Amazon Appstore 的所有 App 中使用。(1 coin = 1 JPY)

云计算虚拟机的安全访问与共享 - mit.educaoj/teaching/doc/hqian_t_bsc.pdf实现基于web 浏览器的远程虚拟机访问。接着，本文对虚拟机的共享机制作出探讨，设计虚拟机安全共享的体系结构。

VMLogin 虚拟多登浏览器软件使用说明书 Ver1