20150704 benchmark and user experience in sahara weiting

Benchmarking and User Experiencein Sahara

Weiting Chen

[email protected]

July 04 2015

mailto:[email protected]

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata are available on request.

© 2015 Intel Corporation.

LEGAL DISCLAIMERS

oOur Background

oWhy Sahara

oDeployment Consideration

oCustomer Experience

oThe Future of Sahara

AGENDA

BACKGROUND

WHO WE ARE…

Exploring new opportunities in Big Data-as-a-Service(BDaaS)o Researching the possibility BDaaS solutiono Let BDaaS become better in IT infrastructureo Moving forward the future of BDaaS

Focusing on Sahara in OpenStacko Bring CDH into Saharao Create more features in Saharao Rank #1 in LOC, #3 in Commits for Sahara contribution

ABOUT OUR TEAM

WHY SAHARA?

oYou or someone at the company is using public Big Data application services like AWS EMR.You need Sahara to migrate Big Data application to your private cloud

oYou have multiple Hadoop clusters in your environment and you would like to integrate them for better infrastructure utilization.You need Sahara to virtualized Hadoop into cloud infrastructure.

oYou are using OpenStack as a IT cloud infrastructure for many years and there is a Hadoop cluster also running in your IT environment.You must use Sahara to bring them together as a unified IT environment for better maintenance.

FROM THE CUSTOMER NEEDS

source from OpenStack Vancouver Design Summit: Benchmarking Sahara-based as a Service solution by RedHat & Intel

Data Scientists/Analystso Provide an elastic way to run big data application

Developerso Bring a custom big data infrastructure by different needs

Administrator/Operatorso A better way to maintain not only hardware platform but also software package

Companyo Cost, cost, cost

BETTER USER EXPERIENCE MEANS…

A COMPLEX BIG DATA SOLUTION

Structured, Unstructured Data Big Data SolutionDifferent type data sources Complexity in organizing Data(ETL)

BI ReportDiverse BI Report

Pig

ZooKeeper

Deployment Consideration

SAHARA ARCHITECTURE

SAHARA DATA PROCESSING PATTERN

OpenStack

Instance

Data Node

Pattern 1: Internal HDFS

Collect Application

Collecting Data

OpenStack support to create HDFS on Cinder or Ephemeral Disk. This method can provide a better data processing performance via Ephemeral Disk or to persist the data via Cinder with lower performance.

Node Manager

Pros: Performance would be extreme fast.(depends on the storage backend)

Cons: Data persistence may be a problem if you would like to follow with the life of Virtual Cluster.


OpenStack

Instance 1

Pattern 2: External HDFS

Collect Application

Collecting Data

You can also choose to deploy HDFS to two different instances. This way can bring you more elasticity to manage your instances when you would like to save more compute power via turn off your node manager instance.

Node Manager

Pros: Performance may be the same as Pattern 1, but it can bring more flexible to control your instances, save the power, and also persist your data in data node.Cons: A long run cluster may still need to consider another way for persisting data.

Instance 2

Data Node


OpenStack

Instance

Pattern 3: Swift

Collect Application

Collecting Data

Use Swift can stream the data from storage to Hadoop directly. It provide a way to store your data externally and solve the data persistence problem. Currently Swift can also support data locality feature.

Node Manager

Pros: Streaming data directly and integrating with your Swift infrastructure. Cons: Performance could be an issue when comparing with other pattern by using HDFS.

Swift

Streaming Data

Cluster Deploymento Service Deployment

Compute Engine Choiceo Baremetal, KVM, Docker, Hyper-V, vSphere,

Xen

Storage Architectureo Ephemeral Disk

o Persistent Volume

o Performance

o Cost

o Current IT Infrastructure

Deployment Consideration

Host

Instance Instance …Instance

Data

Bare Metal KVM Container

EphemeralBlock

Storage

Data Data

Node Manager

Node Manager

Node Manager

Object Storage

Compute Engine

Storage Infrastructure

Cluster Deployment

Customer Experience

Issue1 - Provision a Cluster Takes a Long Time

Problem Description:o 10000+ jobs per day including several different workloads(some jobs run in SECs and some jobs

run in HOURs)o Hard to sort out a job is small or large, it is not only about data size but also in logistic o Provisioning a cluster takes a longer time than running a small job in secs, for example: launch a

4-nodes cluster in 10+ mins

Customer’s Feedback:o Finish job on time, no need to worry about provisioning a cluster

Possible Solutions/Alternatives:o Run jobs in an existing cluster(depends on the cases)o Run jobs in a public cluster using Resource ACL(will support in Liberty)o To reduce the time for provisioning a cluster -> Plugin specifico Use Docker can save time to launch an instance, but still need time to launch services

Docker brings better boot time

10X boot time difference between Docker and KVM

Docker also get the advantage when instance is idle

0

10

20

30

40

50

60

70

80

1 9

17

25

33

41

49

57

65

73

81

89

97

10

5

11

3

12

1

12

9

13

7

14

5

15

3

16

1

16

9

17

7

18

5

19

3

20

1

20

9

21

7

22

5

23

3

24

1

24

9

25

7

26

5

27

3

28

1

28

9

29

7

30

5

31

3

32

1

CP

U U

sag

e I

n P

erc

en

t

Time

Docker: Compute Node CPU (full test duration)

usr

sys

Averages

– 0.54

– 0.17

0

10

20

30

40

50

60

70

80

1

10

19

28

37

46

55

64

73

82

91

10

0

10

9

11

8

12

7

13

6

14

5

15

4

16

3

17

2

18

1

19

0

19

9

20

8

21

7

22

6

23

5

24

4

25

3

26

2

27

1

28

0

28

9

29

8

30

7

31

6

32

5

33

4

34

3

CP

U U

sag

e I

n P

erc

en

t

Time

KVM: Compute Node CPU (full test duration)

usr

sys

Averages

– 7.64

– 1.4

Source from IBM: Boden Russell (Performance Characteristics of Traditional VMs vs Docker Containers)

Issue2 - A complex data processing

Problem Description:o A job usually run multiple sub-jobs in a row, Ex: Job A -> Job B -> Job C, and also need to

support scheduling a job

Customer’s Feedback:o Running a complex job to fulfill their caseo To Schedule a job using Sahara EDPo Running a recurring job

oPossible Solutions/Alternatives:• Currently Sahara EDP only support to run a simple job• Schedule a job -> BP: https://review.openstack.org/#/c/175719/• A complex job running -> Under discussion• Running a recurring job -> Under discussion

https://review.openstack.org/#/c/175719/

Issue3 - Storage ArchitectureProblem Description:o Currently our customers use individual Compute Cluster(Using Nova) and Storage

Cluster(Using Swift as an Object Storage for data store). But there is a performance issue if compute and data put in different node, to transfer data must pass through network.

Customer’s Expectation:o Find a better solution to fulfill their requirements and integrate to their current storage

architecture

Possible Solutions/Alternatives:o Use Internal HDFS -> Needs a way to copy data from Swift to Internal HDFSo Use Swift Data Locality Feature -> Must change their storage architecture

Two-phases in Sort running period for disk writeo Shuffle Map-Reduce Data -> Use temp folder to storeo intermediate data(40%total throughput)• Write Output -> HDFS Write(60%total throughput)

Sort Workload Profile

Shuffling data using temp folder

Write output to HDFS/External Storage

Disk IO Peak

1. Hadoop temp Folder Location

2. HDFS Location

3. Data Persistent

4. Integrate with current Storage Architecture, usually use shared storage in cloud

5. Optimize storage by your workload

Storage Consideration

Redundant Issue when HDFS over Ceph/GlusterFS

Compute Cluster

Instance1

HDFS

Instance2

HDFS

…..

Instance3

HDFS

Ceph Cluster

Cinder

DATA DATA DATA

A DATA C DATAB DATA

A DATA B DATAC DATA

C DATAB DATA A DATA

3(in HDFS) x 3(in Ceph) = 9 Replicas in CephCluster

Cinder Volume Instance Locality Support in Sahara

Compute1

Instance1

HDFS

Instance2

HDFS

…..

Instance3

HDFS

Cinder-volume

DATA DATA DATA

Volume1 Volume2 Volume3

Compute2

Instance4

HDFS

Instance5

HDFS

…..

Instance6

HDFS

Cinder-volume

DATA DATA DATA

Volume4 Volume5 Volume6

Nova Nova

Performance Impact from o Swift overhead comes from “Rename” method in Hadoopo “List Endpoint” feature bring huge impacto Larger data size may deliver worse performance gap

27

Swift Performance Issue

Host

Swift

VMVM

HostNova Inst.

Store

VM

HDFS

VM

HDFS…..

…..vs.

1.25x overhead

1.67x overhead

1X

The output of the reduce function is written to a temporary location in HDFS. After completing, the output will automatically renamed from its temporary location to its final location.

Rename in Reduce Task

ANALYSIS

• Object storage cannot support rename, swiftfs use “copy and delete” for rename function.

• HDFS Rename -> Change METADATA in Name Node

• Swift Rename -> Copy new object and Delete the older one in Swift

1.5x overhead

local to swift

swift to swift

local to hdfs

Issue4 - Scaling a Cluster

Problem Description:o Current there are several issues they found when using scaling a cluster, they would like to

ask Community to improve their experience

Customer’s Expectation:o Rebalancing HDFS after scalingo Auto-scale a cluster by request(ex: job size, …etc)

Possible Solutions/Alternatives:o Rebalance HDFS -> BP: https://blueprints.launchpad.net/sahara/+spec/hdfs-rebalanceo Auto-scaling -> Needs be discussed

https://blueprints.launchpad.net/sahara/+spec/hdfs-rebalance

Issue5 - OpenStack Version SupportProblem Description:o New features usually support in new release, customers would like to use new feature in old

environmento Some new features cannot be accepted to backport to an older one

Customer’s Expectation:o Customers would like to use new feature in Kilo or later version OpenStack

Possible Solutions/Alternatives:o Rolling Upgrade from Juno to Kiloo Only use Sahara and Horizon in Kilo and other OpenStack project in Juno -> We haven’t try

thiso In the future, plugin will support backward compatible, let plugin can separate with Sahara

The Future of Sahara

oVanilla support Hadoop v1.2.1 and Hadoop 2.6

oSpark Plugin

oCloudera CDH Plugin

oMapR Plugin

oStorm Plugin

oNew Horizon UI with a Guide Panel

oDefault Template Support

What’s New in Kilo

oSahara EDP is the focus to process data flow

oSupport more data sources and storage architecture

oSupport more Big Data projects

oIntegrate with other OpenStack projects

oBaremetal -> Ironic

oDocker -> Magnum

oApplication Catalog -> Murano

The Future of Sahara

20150704 benchmark and user experience in sahara weiting

Documents

Transcript of 20150704 benchmark and user experience in sahara weiting