Lego Cloud SAP Virtualization Week 2012

43
Benoit Hudzia Sr. Researcher; SAP Research CEC Belfast Aidan Shribman Sr. Researcher; SAP Research Israel TRND04 The Lego Cloud

description

This session will demonstrate that by extending KVM we can deliver none-disruptively the next level of IaaS platform modularization. We will first show instantaneous live migration of VM. Then we will introduce the memory aggregation concept, and finally how to achieve full operational flexibility by dis-aggregating the datacenter resource to its core elements.

Transcript of Lego Cloud SAP Virtualization Week 2012

Page 1: Lego Cloud SAP Virtualization Week 2012

Benoit Hudzia Sr. Researcher; SAP Research CEC Belfast

Aidan Shribman Sr. Researcher; SAP Research Israel

TRND04

The Lego Cloud

Page 2: Lego Cloud SAP Virtualization Week 2012

© 2012 SAP AG. All rights reserved. 2

Agenda

Introduction

Hardware Trends

Live Migration

Memory Aggregation

Compute Aggregation

Summary

Page 3: Lego Cloud SAP Virtualization Week 2012

Introduction The evolution of the datacenter

Page 4: Lego Cloud SAP Virtualization Week 2012

© 2012 SAP AG. All rights reserved. 4

No virtualization

Basic Consolidation

Flexible Resources Management (Cloud)

Resources Disaggregation

(True utility Cloud)

Evolution of Virtualization

Page 5: Lego Cloud SAP Virtualization Week 2012

© 2012 SAP AG. All rights reserved. 5

Why Disaggregate Resources?

Better Performance

Replacing slow local devices (e.g. disk) with fast remote devices (e.g. DRAM).

Many remote devices working in parallel (e.g. DRAM, disk, compute)

Superior Scalability

Going beyond boundaries of the single node

Improved Economics

Do more with existing hardware

Reach better hardware utilization levels

Page 6: Lego Cloud SAP Virtualization Week 2012

© 2012 SAP AG. All rights reserved. 6

The Hecatonchires Project

Hecatonchires “Hundred Headed One”

Original idea: provide Distributed Shared Memory (DSM)

capabilities to the cloud

Strategic goal : full resource liberation brought to the cloud by:

Providing more resource flexibility to current cloud paradigm by breaking

down nodes to their basic elements (CPU, Memory, I/O)

Extend existing cloud software stack (KVM, Qemu, libvirt, OpenStack)

without degrading any existing capabilities.

Using commodity cloud hardware: medium sized hosts (typically 64 GB

and 8/16 cores), and standard interconnects (such as 1 Gigabit or 10 GE).

Initiated by Benoit Hudzia in 2011. Currently developed by two

small teams of researchers from the TI Practice located in

Belfast and Ra’anana

Page 7: Lego Cloud SAP Virtualization Week 2012

© 2012 SAP AG. All rights reserved. 7

High Level Architecture

No special HW required but RDMA enabled

NICs which support the low overhead low

latency communication layer

VMs are not bounded by host size anymore as

resources such as memory, I/O and compute

can be aggregated

Different sized VMs can share infrastructure

so we can still support the smaller VMs not

requiring dedicated hosts

Application stack runs unmodified

CPUs

Memory

I/O

CPUs

Memory

I/O

CPUs

Memory

I/O

H/W

OS

App

VM

H/W

OS

App

VM

H/W

OS

Ap

p

VM

H/W

OS

App

VM

Server #1 Server #2 Server #n

Guests

Fast RDMA Communication

Page 8: Lego Cloud SAP Virtualization Week 2012

© 2012 SAP AG. All rights reserved. 8

The Team - Panoramic View

Page 9: Lego Cloud SAP Virtualization Week 2012

Hardware Trends Are hosts getting closer?

Page 10: Lego Cloud SAP Virtualization Week 2012

© 2012 SAP AG. All rights reserved. 10

CPUs stopped getting faster

Moore’s law prevailed until 2003 when core’s

speed hit a practical limit of about 3.4 Ghz

In data center core are even slower running at

2.0 - 2.8 Ghz for to power conservation

reasons

Since 2000 you do get more cores – but that

does not effect compute cycle and compute

instruction latencies

Effectively arbitrary sequential algorithms

have not gotten faster since

Source: http://www.intel.com/pressroom/kits/quickrefyr.htm

Page 11: Lego Cloud SAP Virtualization Week 2012

© 2012 SAP AG. All rights reserved. 11

DRAM latency has remained constant

CPU clock speed and memory bandwidth

increased steadily (at least until 2000)

But memory latency remained constant – so

local memory has gotten slower from the CPU

perspective

Source: J. Karstens: In-Memory Technology at SAP. DKOM 2010

Page 12: Lego Cloud SAP Virtualization Week 2012

© 2012 SAP AG. All rights reserved. 12

Disk latency has virtually not improved

8.3

7.1 6.7

6.1 5.8 5.6

4.2

3 2.5

2

3,600 4,200 4,500 4,900 5,200 5,400 7,200 10,000 12,000 15,000

Average Latency (ms) 1980s standard disk has a 3,600 RPM

2010s standard disk has a 7,200 RPM

2x speedup in 30 years is negligible –

effectively disk has become slower from the

CPU perspective.

Panda et al. Supercomputing 2009

Page 13: Lego Cloud SAP Virtualization Week 2012

© 2012 SAP AG. All rights reserved. 13

But: Networks are Steadily Getting Faster

0

10

20

30

40

50

60

70

Network Performance (Gbit/s) Since 1979 we went from 0.01 Gbit/s to up 64

Gbit/s a x6400 Speedup

A competitive marketplace

10 and 40 Gbps Ethernet – originated from network

interconnects

40 Gbps QPX InfiniBand – originated from computer

internal bus technology

InfiniBand/Ethernet convergence

Virtual Protocol Interconnects

InfiniBand over Ethernet

RDMA over converged enhanced Ethernet

Using standard semantics defined by OFED

Panda et al. Supercomputing 2009

Page 14: Lego Cloud SAP Virtualization Week 2012

© 2012 SAP AG. All rights reserved. 14

And: Communication Stacks Are Becoming Faster

Network stack deficiencies

Application / OS context switches

Intermediate buffer copies

Transport processing

RDMA OFED Verbs API provides

Zero copy

Offloading TCP to NIC using RoCE

Flexibility to use IB, GE or IWARP

Resulting in

Reduced latency

Processor offloading

Operational flexibility

Page 15: Lego Cloud SAP Virtualization Week 2012

© 2012 SAP AG. All rights reserved. 15

Benchmarking Modern Interconnects

Intel MPI benchmark (IMP)

Used typically in HPC and parallel computing

Comparing:

4x DDR IB using Verbs API

10 GE TOE (TCP offloading engine) iWARP

1 GE

Measured latencies

IB 2 us

10 GE 8.23 us

1 GE 46.52 us

Broadcast latency

Exchange bandwidth

Source: Performance of HPC Applications over InfiniBand, 10 Gb and 1 Gb Ethernet, IBM

Page 16: Lego Cloud SAP Virtualization Week 2012

© 2012 SAP AG. All rights reserved. 16

Conclusion: Remote Nodes Have Gotten Closer

Interconnects have become much faster

Fast interconnects have become a commodity

and are moving out of the High Performance

Computing (HPC) niche

IB latency 2000 ns is only 20x slower than

RAM and is 100x faster than SSD

Remote page faulting is much faster than

traditional disk backed page swapping!

HANA Performance Analysis, Chaim Bendelac, 2011

Page 17: Lego Cloud SAP Virtualization Week 2012

© 2012 SAP AG. All rights reserved. 17

Result: Blurring of the physical node boundaries

10,000,000 ns

10,000,000 ns

2,000 ns

100 ns

0 ns

Page 18: Lego Cloud SAP Virtualization Week 2012

Live Migration Pretext to Hecatonchire

Page 19: Lego Cloud SAP Virtualization Week 2012

© 2012 SAP AG. All rights reserved. 19

Enabling Live Migration of SAP Workloads

Business Problem

Typical SAP workloads (e.g. SAP ERP) are transactional,

large (possibly 64 GB), with a fast rate of memory writes.

Classic live migration fails for such workloads as rapid

memory writes cause memory pages to be re-sent over

and over again

Hecatonchire’s Solution

Enable live migration by reducing both the number of

pages re-sent and the cost of a page re-send

Non intrusive reducing downtime, service degradation, and

total migration time

Page 20: Lego Cloud SAP Virtualization Week 2012

© 2012 SAP AG. All rights reserved. 20

Live Migration Technique

Pre-migration process

Reservation process

Iterative pre-copy

Stop and copy

Commitment

• VM active on host A

• Destination host selected

(Block devices mirrored)

• Initialize container on target host • Copy dirty pages in successive

rounds

• Suspend VM on host A

• Redirect network traffic

• Synch remaining state

• Activate on host B

• VM state on host A released

Page 21: Lego Cloud SAP Virtualization Week 2012

© 2012 SAP AG. All rights reserved. 21

Pre-copy live migration

Reducing number of page re-sends

Page LRU Reordering such that pages which

have a high chance of being re-dirtied soon are

delayed until later

Reducing the cost of a page re-send

By using XBZRLE delta encoder we can much

more efficiently represent page changes

Page 22: Lego Cloud SAP Virtualization Week 2012

© 2012 SAP AG. All rights reserved. 22

Pre-Copy Live-

Migration

Page Pushing

1

Round

Stop

and

Copy

Commit

Total Migration Time

Downtime Live on A Degraded on B Live on B

Page Pushing

1

Round

Stop

and

Copy

Commit

Total Migration Time

Downtime Live on A Degraded on B Live on B

Iterative

Pre-Copy

X

Rounds

Post-Copy Live-

Migration

Hybrid Post-Copy

Live-Migration

More Than One Way to Live Migrate …

Iterative

Pre-copy X

Rounds

Stop

and

Copy

Commit

Total Migration Time

Downtime Live on A Live on B

Pre-migrate;

Reservation

Pre-migrate;

Reservation

Commit

Pre-migrate;

Reservation

Page 23: Lego Cloud SAP Virtualization Week 2012

© 2012 SAP AG. All rights reserved. 23

Post-copy live migration using fast interconnects

In Post-copy live migration the state of the VM

is transferred to the destination and activated

before memory is transferred

Post-copy implementation includes

Handling of remote page faults

Background transfer of memory pages

Service degradation mitigated by

RDMA zero-copy interconnects

Pre-paging – similar in concept to pre-fetching

Hybrid Post Copy – begins with a pre-copy phase

MMU integration – eliminating need for VM pause

Page 24: Lego Cloud SAP Virtualization Week 2012

Demo

Page 25: Lego Cloud SAP Virtualization Week 2012

Memory Aggregation In the oven …

Page 26: Lego Cloud SAP Virtualization Week 2012

© 2012 SAP AG. All rights reserved. 26

The Memory Cloud Turns memory into a distributed memory service

Server1 Server2 Server3

Server1 Server2 Server3

Server1 Server2 Server3

VM VM VM

Storage

Applications

Memory

Breaks memory

from the bounds of the

physical box

Yields double digit

percentage gains in IT

economics

Transparent

deployment with

performance at scale

and Reliability

App

RAM

App App

RAM RAM

Page 27: Lego Cloud SAP Virtualization Week 2012

© 2012 SAP AG. All rights reserved. 27

RRAIM : Remote Redundant Array of Inexpensive Memory Supporting Large Memory Instances On-Demand

Business Problem

Current instance memory sizes are constrained by physical hosts’

memory size ( Amazon Biggest VM occupy the whole physical host)

Heavy swap usage slows execution time for data intensive applications

Hecatonchire Solution

Access remote DRAM via fast interconnects zero-copy RDMA

Hide remote DRAM latency by using page pre-pushing

MMU Integration for transparency for applications and VMs

Reliability by using a RAID-1 (mirroring) like schema

Hecatonchire Value Proposition

Provide memory aggregation on-demand

Totally transparent to workload (no integration needed)

No hardware investment! No dedicated servers!

RAIM Solution

Memory Cloud

Application VM swaps to memory

Cloud

RAM

Compression / De-

duplication / N-tiers

storage / HR-HA

Page 28: Lego Cloud SAP Virtualization Week 2012

© 2012 SAP AG. All rights reserved. 28

Capacity

Access S

peed

Hecatonchire / RRAIM: Breakthrough Capability Breaking the memory box barrier for memory intensive applications

10

mse

c

1 m

se

c

10

0 μ

se

c

10

μse

c

1 μ

se

c

SAN

NAS

Local Disk

Performance

Barrier

DRAM

L1 cache

L2 cache

Em

bed

ded

Reso

urc

es

Lo

cal

Reso

urc

es

Netw

ork

ed

Reso

urc

es

PB TB GB MB

SSD

Page 29: Lego Cloud SAP Virtualization Week 2012

© 2012 SAP AG. All rights reserved. 29

Lego Cloud Architecture ( Memory block)

Compute VM Memory Guest

Memory Cloud Management

Services

App App

memory Memory

Cloud

RRAIM

VM

RAM

VM VM

RAM

Many Physical Nodes

Hosting a variety of VMs

Combination VM Memory Guest & Host

Memory VM Memory Host

Page 30: Lego Cloud SAP Virtualization Week 2012

© 2012 SAP AG. All rights reserved. 30

Instant Flash Cloning On-Demand

Business Problem

Burst load / service usage that cannot be satisfied in time

Existing solutions

Vendors: Amazon / VMWare/ rightscale

Startup VM from disk image

Requires full VM OS startup sequence

Hecatonchire Solution

Using a paused VM as source for Copy-on-Write (CoW)

We perform a Post-Copy Live Migration

Hecatonchire Value Proposition

Just in time (sub-second) provisioning

Page 31: Lego Cloud SAP Virtualization Week 2012

© 2012 SAP AG. All rights reserved. 31

Instant Flash Cloning On-Demand

We can clone VMs to meet demand much faster

than other solutions

Reducing infrastructure costs while still minimizing

lost opportunities => Just in time provisioning

Requires Application Integration

We track OS/application metrics in running VMs or in Load

Balancer (LB)

Alerts are defined if metrics pass a pre-define threshold

According to alerts we can scale-up adding more resources

or scale-down to save on resources not utilized

Amazon Web Services - Guide

Page 32: Lego Cloud SAP Virtualization Week 2012

Compute Aggregation Our next challenge

Page 33: Lego Cloud SAP Virtualization Week 2012

© 2012 SAP AG. All rights reserved. 33

Cost Effective “Small” HPC Grid

High Performance Computing (HPC)

Supercomputers at the frontline of processing speeds 10k-100k core

Typical benchmark: Grid 500 (Linear Algebra)

Small HPC using 10-20 commodity (2 TB / 80 core) nodes

Typical Applications

Relational Databases

Analytics tasks (Linear Algebra)

Simulations

Hecatonchire Value Proposition

Optimal price / performance by using commodity hardware

Operational flexibility: node downtime without downing the cluster

Seamless deployment within existing cloud

Page 34: Lego Cloud SAP Virtualization Week 2012

© 2012 SAP AG. All rights reserved. 34

Distributed Shared Memory (DSM)

Traditional cluster

Distributed memory

Standard interconnects

OS instance on each node

Distribution handled by application

ccNUMA

Cache coherent shared memory

Fast interconnects

One OS instance

Distribution handled by hardware

Vendors: ScaleMP, Numascale, others

Page 35: Lego Cloud SAP Virtualization Week 2012

© 2012 SAP AG. All rights reserved. 35

Distributed Shared Memory – Inherent Limitations

Linux provides NUMA topology discovery

Distance between compute cores

Distance between cores to memory

While the Linux OS is aware of the NUMA

layout the application may not be aware …

Cache-coherency may get very expensive

Inter-core: L3 Cache 20 ns

Inter-socket: Main Memory 100 ns

Inter-node (IB): Remote Memory 2,000 ns

Thus the ccNUMA architecture many not

“really” be transparent to the application!

Page 36: Lego Cloud SAP Virtualization Week 2012

Summary

Page 37: Lego Cloud SAP Virtualization Week 2012

© 2012 SAP AG. All rights reserved. 37

Roadmap

2011

• Live Migration

• Pre-copy XBZRLE Delta Encoding

• Pre-copy LRU page reordering

• Post-copy using RDMA interconnects

2012

• Resource Aggregation

• Cloud Management Integration

• Memory Aggregation – RAIM (Redundant Array of Inexpensive Memory)

• I/O Aggregation – vRAID (virtual Redundant Array of Inexpensive Disks)

• Flash cloning

2013

• Lego Landscape

• CPU Aggregation - ccNUMA

• Flexible resource management

Page 38: Lego Cloud SAP Virtualization Week 2012

© 2012 SAP AG. All rights reserved. 38

Key takeaways

Hecatonchire extends standard Linux stack requiring

standard commodity hardware

With Hecatonchire unmodified applications or VMs can

tape into remote resources tranparently

To be released as open source under GPLv2 and LGPL

licenses to Qemu and Linux communities

Developed by SAP Research TI

Page 40: Lego Cloud SAP Virtualization Week 2012

Appendix

Page 41: Lego Cloud SAP Virtualization Week 2012

© 2012 SAP AG. All rights reserved. 41

Linux Kernel Virtual Machine (KVM)

Released as a Linux Kernel Module (LKM)

under GPLv2 license in 2007 by Qumranet

Full virtualization via Intel VT-x and AMD-V

virtualization extensions to the x86 instruction

set

Uses Qemu for invoking KVM, for handling of

I/O and for advanced capabilities such as VM

live migration

KVM considered the primary hypervisor on

most major Linux distributions such as

RedHat and SuSE

Page 42: Lego Cloud SAP Virtualization Week 2012

© 2012 SAP AG. All rights reserved. 42

Remote Page Faulting Architecture Comparison

Hecatonchire

No context switches

Zero-copy

Use iWarp RDMA

Yobusame

Context switches into user mode

Use standard TCP/IP transport

Horofuchi and Yamahata, KVM Forum 2011 Hudzia and Shribman, SYSTOR 2012

Page 43: Lego Cloud SAP Virtualization Week 2012

© 2012 SAP AG. All rights reserved. 43

Legal Disclaimer

The information in this presentation is confidential and proprietary to SAP and may not be disclosed without the permission of

SAP. This presentation is not subject to your license agreement or any other service or subscription agreement with SAP. SAP

has no obligation to pursue any course of business outlined in this document or any related presentation, or to develop or

release any functionality mentioned therein. This document, or any related presentation and SAP's strategy and possible future

developments, products and or platforms directions and functionality are all subject to change and may be changed by SAP at

any time for any reason without notice. The information on this document is not a commitment, promise or legal obligation to

deliver any material, code or functionality. This document is provided without a warranty of any kind, either express or implied,

including but not limited to, the implied warranties of merchantability, fitness for a particular purpose, or non-infringement. This

document is for informational purposes and may not be incorporated into a contract. SAP assumes no responsibility for errors or

omissions in this document, except if such damages were caused by SAP intentionally or grossly negligent.

All forward-looking statements are subject to various risks and uncertainties that could cause actual results to differ materially

from expectations. Readers are cautioned not to place undue reliance on these forward-looking statements, which speak only as

of their dates, and they should not be relied upon in making purchasing decisions.