Will Computer Systems With Performance Guarantees Ever Go ...Keynote. HASE 2014. Miami, FL, USA. Jan...

27
Will Computer Systems With Performance Guarantees Ever Go Mainstream? Keynote. HASE 2014. Miami, FL, USA. Jan 10, 2014 Juan A. Colmenares, Ph.D. 1 Will Computer Systems With Performance Guarantees Ever Go Mainstream? Keynote Presented at the 15th IEEE International Symposium on High Assurance Systems Engineering (HASE 2014) January 10, 2014 / Miami, Florida, USA Juan A. Colmenares Computer Science Laboratory (CSL) Samsung Research America – Silicon Valley (SRA-SV) [email protected] Disclaimer No part of this presentation necessarily represents the views and opinions of my current and former employers or my research collaborators.

Transcript of Will Computer Systems With Performance Guarantees Ever Go ...Keynote. HASE 2014. Miami, FL, USA. Jan...

Page 1: Will Computer Systems With Performance Guarantees Ever Go ...Keynote. HASE 2014. Miami, FL, USA. Jan 10, 2014 Juan A. Colmenares, Ph.D. 2 Introduction • Performance guarantees are

Will Computer Systems With Performance

Guarantees Ever Go Mainstream?

Keynote. HASE 2014.

Miami, FL, USA. Jan 10, 2014

Juan A. Colmenares, Ph.D. 1

Will Computer Systems With Performance Guarantees

Ever Go Mainstream?Keynote

Presented at the 15th IEEE International Symposium on High Assurance Systems Engineering (HASE 2014)

January 10, 2014 / Miami, Florida, USA

Juan A. ColmenaresComputer Science Laboratory (CSL)

Samsung Research America – Silicon Valley (SRA-SV)[email protected]

Disclaimer

• No part of this presentation necessarily represents the views and opinions of my current and former employers or my research collaborators.

Page 2: Will Computer Systems With Performance Guarantees Ever Go ...Keynote. HASE 2014. Miami, FL, USA. Jan 10, 2014 Juan A. Colmenares, Ph.D. 2 Introduction • Performance guarantees are

Will Computer Systems With Performance

Guarantees Ever Go Mainstream?

Keynote. HASE 2014.

Miami, FL, USA. Jan 10, 2014

Juan A. Colmenares, Ph.D. 2

Introduction

• Performance guarantees are key in mission-critical, cyber-physical systems

– Considered a ultra-specialized area• Average performance

– Today’s common figure of merit for software systems• For 10+ years, sustained demands for high-quality

multimedia applications– Multi-party video conference and video/audio on demand– Typical motivation for (probabilistic) performance

guarantees• Now, Internet-based service providers start to show

interest in offering predictably responsive interactive services

– To differentiate themselves from the competition – To retain existing customers/users and attract new ones

Introduction

• Developing distributed computing systems with performance guarantees is

– Harder – More expensive – More time consuming

• We just do it when it is strictly necessary• Naturally we defer such hard problems until there is

no other choice but to face them– Notable recent example: parallel computing

Page 3: Will Computer Systems With Performance Guarantees Ever Go ...Keynote. HASE 2014. Miami, FL, USA. Jan 10, 2014 Juan A. Colmenares, Ph.D. 2 Introduction • Performance guarantees are

Will Computer Systems With Performance

Guarantees Ever Go Mainstream?

Keynote. HASE 2014.

Miami, FL, USA. Jan 10, 2014

Juan A. Colmenares, Ph.D. 3

Questions

• Will current trends force us to develop massively used computer systems with some type of performance guarantees?

• And if so, are we prepared?

Mainstream Applications and SystemsSome Targets

• Currently or expected to become popular (used by millions)

• With clear demands for performance guarantees• Developed and supported by multiple, large teams

http://www.thinkgig.com http://www.isisingenieria.com

Data Centers (Cloud Computing)

NetworkedSensors and Actuators

[ Phrase borrowed from Prof. Jan Rabaey, UC Berkeley Swarm Lab ]

Page 4: Will Computer Systems With Performance Guarantees Ever Go ...Keynote. HASE 2014. Miami, FL, USA. Jan 10, 2014 Juan A. Colmenares, Ph.D. 2 Introduction • Performance guarantees are

Will Computer Systems With Performance

Guarantees Ever Go Mainstream?

Keynote. HASE 2014.

Miami, FL, USA. Jan 10, 2014

Juan A. Colmenares, Ph.D. 4

Cloud ComputingA Target for Performance Guarantees

• Run in data centers – Serving millions of people

• Examples of what to guarantee– Their contributions to the service response

times experienced by users– Throughput for media content delivery

Web Search

www.adobe.com

thinkjudd.com

Media Content Delivery

Networked Sensors and ActuatorsA Target for Performance Guarantees

• Some apps are basically control systems, wired and/or wireless

• Deployed in the environment• Examples of what to guarantee

– Response time latencies of critical actions

• Some apps with strict requirements

Autonomous Cars

From article by Tom Vanderbilt (Feb 2012)http://www.wired.com/magazine/2012/01/ff_autonomouscars/

Amazon Prime Air Rotorcraft

Sourc

e: H

onda

Robotic Assistants

http://www.amazon.com

Page 5: Will Computer Systems With Performance Guarantees Ever Go ...Keynote. HASE 2014. Miami, FL, USA. Jan 10, 2014 Juan A. Colmenares, Ph.D. 2 Introduction • Performance guarantees are

Will Computer Systems With Performance

Guarantees Ever Go Mainstream?

Keynote. HASE 2014.

Miami, FL, USA. Jan 10, 2014

Juan A. Colmenares, Ph.D. 5

Swarm of Devices at the Edge of the Cloud

[ Prof. Jan Rabaey, ASPDAC’08 ]

Infrastructuralcore

The Cloud Mobile Access & Relay

The Swarm

Swarm of Devices at the Edge of the Cloud

http://www.wired.com

www.popsci.com.au

The Cloud

Page 6: Will Computer Systems With Performance Guarantees Ever Go ...Keynote. HASE 2014. Miami, FL, USA. Jan 10, 2014 Juan A. Colmenares, Ph.D. 2 Introduction • Performance guarantees are

Will Computer Systems With Performance

Guarantees Ever Go Mainstream?

Keynote. HASE 2014.

Miami, FL, USA. Jan 10, 2014

Juan A. Colmenares, Ph.D. 6

Smart Homes and SpacesSwarm of Devices at the Edge of the Cloud

http://www.corning.com

http://www.corning.com

http://www.samsung.com

Samsung Smart Home Corning’s A Day Made of Glass

Smart CitiesFocus of the TerraSwarm Research Center[TS12]

• Meant to handle two cases– Normal operation and disasters

• Integrate– Fixed infrastructure

• e.g., environmental monitoring, energy-usage, tracking and mapping

– Mobile assets (automatic vehicles, UAVs, robots) – Immersive humans

• Cloud as a companion, but data locality is key for latency

[TS12] Lee et al. The TerraSwarm Research Center (TSRC) (A White Paper). Tech Report No. UCB/EECS-2012-207

Page 7: Will Computer Systems With Performance Guarantees Ever Go ...Keynote. HASE 2014. Miami, FL, USA. Jan 10, 2014 Juan A. Colmenares, Ph.D. 2 Introduction • Performance guarantees are

Will Computer Systems With Performance

Guarantees Ever Go Mainstream?

Keynote. HASE 2014.

Miami, FL, USA. Jan 10, 2014

Juan A. Colmenares, Ph.D. 7

Smart CitiesSwarm of Devices at the Edge of the Cloud

• Some initiatives in industry and academia– IBM’s The Smarter City– Schneider Electric’s Smart Cities Solution– TerraSwarm Research Center @ UC Berkeley– Center for Urban Science + Progress (CUSP) @ NYU

http://www.schneider-electric.com

Clear Demands for Performance Guarantees May Not Be Enough

• GUIs should provide response-time guarantees to users– At least for some meaningful actions

• But I don’t expect major improvement here soon

Popularity in decline, so no major interest in improving guarantees of desktop GUIs

Popular indeed , but battery life is a much more pressing issue

Uprising, but similar battery-life issue

Desktops Mobile Devices(e.g., smart phones

and tablets)

Wearable Devices(e.g., smart watches

and glasses)

GUI hardware acceleration is key in keeping response times low

Google Glasses

Samsung Gear

Page 8: Will Computer Systems With Performance Guarantees Ever Go ...Keynote. HASE 2014. Miami, FL, USA. Jan 10, 2014 Juan A. Colmenares, Ph.D. 2 Introduction • Performance guarantees are

Will Computer Systems With Performance

Guarantees Ever Go Mainstream?

Keynote. HASE 2014.

Miami, FL, USA. Jan 10, 2014

Juan A. Colmenares, Ph.D. 8

What Performance Guarantees Are We Talking About?

• We often seek guarantees on:– Throughput (e.g., requests per second)– Latency to response (e.g., service time)

• Other interesting performance metrics– Energy and power consumption (e.g., energy/power

budget)– Time to recovery (e.g., guaranteed maximum

recovery time)

• What type of guarantees?– Probabilistic with high confidence (mostly)

• Often easier targets than hard guarantees• Leave more room for tradeoffs instead of largely

overprovisioning for the infrequent worst cases

Our focus today

Any Performance Guarantees Offered by Public Clouds?

• None so far– At least by 3 major cloud providers

• Service Level Agreements (SLAs) are only about availability and accessibility

– e.g., monthly availability > 99.95%; otherwise, you get service credits

• Will competition have any effect?

https://cloud.google.comhttp://aws.amazon.com http://www.windowsazure.com

Page 9: Will Computer Systems With Performance Guarantees Ever Go ...Keynote. HASE 2014. Miami, FL, USA. Jan 10, 2014 Juan A. Colmenares, Ph.D. 2 Introduction • Performance guarantees are

Will Computer Systems With Performance

Guarantees Ever Go Mainstream?

Keynote. HASE 2014.

Miami, FL, USA. Jan 10, 2014

Juan A. Colmenares, Ph.D. 9

I/O Bandwidth Provisioning in Amazon Elastic Block Store (EBS)

Signs of Improvement in Public Clouds

• Amazon EBS offers volumes– Durable, block-level storage devices

• For a virtual-machine instance, an EBS volume appears as a native block device similar to a hard drive

• Provisioned IOPS volumes– Offer consistent performance for I/O-intensive workloads

(e.g., databases) in Amazon EC2– Designed to deliver within 10% of the specified IOPS rate

99.9% of the time• But this is NOT part of any SLA!

– IOPS rate up to 4000 IOPS per volume– Volume sizes from 10 GB to 1 TB

• Possibly inspired by– Gulati et al. mClock: Handling throughput variability for

hypervisor IO scheduling. OSDI 2010.

Source: http://aws.amazon.com/ebs/piops/

SolidFire’s All-Flash Storage Infrastructure with QoS

Signs of Improvement in Public Clouds

• At full scale (100 nodes) able to deliver– 3.4PB of effective capacity – 7.5 million IOPS

• Reduced cost– Below $3/GB and below $1/IOPS (60TB to 3.4PB)

• Below the cost of traditional performance disk solutions

• Able to guarantee performance to thousands of volumes within a shared storage

– Pending patent on QoS capabilities

http://www.solidfire.com

July 2013

Page 10: Will Computer Systems With Performance Guarantees Ever Go ...Keynote. HASE 2014. Miami, FL, USA. Jan 10, 2014 Juan A. Colmenares, Ph.D. 2 Introduction • Performance guarantees are

Will Computer Systems With Performance

Guarantees Ever Go Mainstream?

Keynote. HASE 2014.

Miami, FL, USA. Jan 10, 2014

Juan A. Colmenares, Ph.D. 10

Allocated Bandwidth for Streaming Servers in Windows Azure

Signs of Improvement in Public Clouds

• Media Services enable creation, management, and distribution of media

– e.g., encoding and on-demand streaming

• Reserved Units (RUs) – Dedicated set of resources for media processing tasks– Highly recommended for on-demand streaming

• Actually, availability SLA only valid with RUs

• Each RU provides bandwidth up to 200 Mbps for streaming origin servers

– Bandwidth allocation NOT part of any SLA– Availability SLA only applies when using <= 80% of available

bandwidth

Source: http://www.windowsazure.com/en-us/support/legal/sla/

Research Efforts Clear Interest in Improving

• Barker and Shenoy (UMass). Empirical evaluation of latency-

sensitive application performance in the cloud. MMSys 2010.– Focus on interference of dynamically varying background load on

latency-sensitive tasks

– Careful configurations mitigate, but do not eliminate interference

• Dean and Barroso (Google). The tail at scale. CACM 2013– Latency tail-tolerant software techniques to build predictable systems

out of less predictable parts

• Ferguson et al. (MSR) Jockey: Guaranteed job latency in data

parallel clusters. EuroSys 2012– Latencies guarantees for parallel data processing jobs using a resource

allocation control loop

• Terry et al (MSR). Consistency-based service level agreements

for cloud storage. SOSP 2013– A replicated key-value store that allows applications to declare their

consistency and latency priorities via consistency-based SLAs

Page 11: Will Computer Systems With Performance Guarantees Ever Go ...Keynote. HASE 2014. Miami, FL, USA. Jan 10, 2014 Juan A. Colmenares, Ph.D. 2 Introduction • Performance guarantees are

Will Computer Systems With Performance

Guarantees Ever Go Mainstream?

Keynote. HASE 2014.

Miami, FL, USA. Jan 10, 2014

Juan A. Colmenares, Ph.D. 11

So Far …

• In cloud computing– No public offerings with high-confidence

performance guarantees available• As far as we can tell

– How about private offerings?

• In some apps with networked sensors and actuators

– Clear requirements of performance guarantees due to safety

• Future swarm applications– A number will require performance guarantees– Also need to interact with the Cloud

Design Principles and Techniques to Build Software Systems with High-Confidence

Performance Guarantees

• Already available for system developers to adopt them

• Some challenges– Performance guarantees considered less important than

other requirements• Not perceived as a differentiating factor

– Additional complexity– End-to-end properties

• Multi-layered factors • A piece-by-piece game with distributed responsibility

– Input dependent– Cost effectiveness

• Especially considering the investment in existing systems

– Trained workforce

Page 12: Will Computer Systems With Performance Guarantees Ever Go ...Keynote. HASE 2014. Miami, FL, USA. Jan 10, 2014 Juan A. Colmenares, Ph.D. 2 Introduction • Performance guarantees are

Will Computer Systems With Performance

Guarantees Ever Go Mainstream?

Keynote. HASE 2014.

Miami, FL, USA. Jan 10, 2014

Juan A. Colmenares, Ph.D. 12

Design Principles and Techniques to Build Software Systems with High-Confidence

Performance Guarantees

• Next we will discuss– Divide-and-conquer design principle– Limiting system load– Mitigating performance variability

Divide and Conquer

• Systems should be built to enable systematic evaluation (via analysis and/or measurements) of:

– The individual contributions of factors that make up the system’s performance

– The effects of the combination of those factors on the system’s performance

• This design principle is key– Systems include multiple components or sub-systems– Performance guarantees of interest are usually end-to-end– Multiple factors influence system’s performance

• e.g., architectural features, algorithmic efficiency, task scheduling, memory management, I/O behavior, thread affinity, cache locality, etc.

• But too many factors can influence performance– To make it practical, we should consider the most

important ones for the system in hand

Page 13: Will Computer Systems With Performance Guarantees Ever Go ...Keynote. HASE 2014. Miami, FL, USA. Jan 10, 2014 Juan A. Colmenares, Ph.D. 2 Introduction • Performance guarantees are

Will Computer Systems With Performance

Guarantees Ever Go Mainstream?

Keynote. HASE 2014.

Miami, FL, USA. Jan 10, 2014

Juan A. Colmenares, Ph.D. 13

Performance Decoupling of System ComponentsEnabling Divide and Conquer

• Extension of software componentization to performance aspects

– Software components are used to divide the system’s logic in parts of manageable complexity

• The idea is to evaluate the contributions of individual components to the system’s performance

• KV-Cache[UCC13]

–Hash table coupled with a replacement logic

–Exploits a software absolute zero-copy approach and aggressive customization to offer high performance

An In-Memory Key-Value CacheExample of Performance Decoupling and Customization

Comm & Mem Mgmt Layer (10G NIC Driver + UDP + Mem Pools)

Hash Table

(with Fine Grain Locks)

Non-Blocking Queue-based CLOCK

(Replacement Logic)

Non-Blocking Queue-based CLOCK

(Replacement Logic)

Application Layer

No

n-B

lock

ing

C

ha

nn

els

> Decoupling <

[UCC13] Waddington, Colmenares, Kuang and Song. KV-Cache: A scalable high-performance web-object caching for manycore. 6th IEEE/ACM Int’l Conference on Utility and Cloud Computing. Implemented on Genode/Fiasco.OC µkernel

Page 14: Will Computer Systems With Performance Guarantees Ever Go ...Keynote. HASE 2014. Miami, FL, USA. Jan 10, 2014 Juan A. Colmenares, Ph.D. 2 Introduction • Performance guarantees are

Will Computer Systems With Performance

Guarantees Ever Go Mainstream?

Keynote. HASE 2014.

Miami, FL, USA. Jan 10, 2014

Juan A. Colmenares, Ph.D. 14

In-Memory Web-Object CachingOverview

• Widely used by Internet-based service providers to reduce latency and increase system throughput

– Memcached: a popular example

• www.memcached.org

Typical Side-Cache Deployment

A De Facto Figure of Merit: Capacity [IGCC11, BagLRU]

Maximum throughput (in RPS) the system can sustain with an average round-trip time (RTT) below 1 ms

[BagLRU] Wiggins and Langston. Enhancing the scalability of memcached. Intel Tech. Rep. 2012

(http://software.intel.com/en-us/articles/enhancing-the-scalability-of-memcached-0)

[IGCC11] Berezecki et al. Manycore key-value store. Proc. of the 2011 Int’l Green Computing Conference. 2011.

Experimental ResultsKV-Cache vs. Intel’s Bag-LRU Memcached[BagLRU]

Latency comparison for one million GET requests at 600K RPS (a slow rate)

Throughput comparison with average round-trip time < 1ms

2x

[BagLRU] Wiggins and Langston. Enhancing the scalability of memcached. Intel Tech. Rep. 2012

Page 15: Will Computer Systems With Performance Guarantees Ever Go ...Keynote. HASE 2014. Miami, FL, USA. Jan 10, 2014 Juan A. Colmenares, Ph.D. 2 Introduction • Performance guarantees are

Will Computer Systems With Performance

Guarantees Ever Go Mainstream?

Keynote. HASE 2014.

Miami, FL, USA. Jan 10, 2014

Juan A. Colmenares, Ph.D. 15

An In-Memory Key-Value CacheExample of Performance Decoupling and Customization

• We could have a stricter figure of merit (stronger guarantees)

– Maximum throughput (in RPS) the system can sustain with

• A target RTT of 1 msobserved on average, and

• No more than 0.1% of late responses, arriving after the target RTT

Round-trip time distribution at 3 million RPS for a single NIC

(3 million GET requests)

KV-Cache[UCC13] never exceeded the target round-trip time of 1 ms!

[UCC13] Waddington, Colmenares, Kuang and Song. KV-Cache: A scalable high-performance web-object caching for manycore. 6th IEEE/ACM Int’l Conference on Utility and Cloud Computing.

Space-Time PartitioningEnabling Divide and Conquer

Time

Spa

ce

Yellow partition grows due to adaptation

Spatial Partition: Key for performance isolation•Hard boundaries and

controlled communication between partitions

Spatial partitioning is not static and may vary over time•Partitions can be time multiplexed;

resources are gang-scheduled•Partitioning adapts to system’s needs

• Each partition receives a vector of basic resources– A number of hardware threads, memory pages, a portion of

cache segments, memory bandwidth, and energy budget• A partition may also receive

– Exclusive access to other resources (e.g., a device)– Guaranteed fractional services from other partitions

Page 16: Will Computer Systems With Performance Guarantees Ever Go ...Keynote. HASE 2014. Miami, FL, USA. Jan 10, 2014 Juan A. Colmenares, Ph.D. 2 Introduction • Performance guarantees are

Will Computer Systems With Performance

Guarantees Ever Go Mainstream?

Keynote. HASE 2014.

Miami, FL, USA. Jan 10, 2014

Juan A. Colmenares, Ph.D. 16

Space-Time PartitioningEnabling Divide and Conquer

Time

Spa

ce

Yellow partition grows due to adaptation

Spatial Partition: Key for performance isolation•Hard boundaries and

controlled communication between partitions

Spatial partitioning is not static and may vary over time•Partitions can be time multiplexed;

resources are gang-scheduled•Partitioning adapts to system’s needs

• Each partition receives a vector of basic resources– A number of hardware threads, memory pages, a portion of

cache segments, memory bandwidth, and energy budget• A partition may also receive

– Exclusive access to other resources (e.g., a device)– Guaranteed fractional services from other partitions

Controlled multiplexing is key

The Cell: Our Partitioning AbstractionUser-level Software Container

with Guaranteed Access to Resources

2nd-level Scheduling

2nd-level Mem Mgmt

Address Space A

Address Space B

Cell A

Task

Time

Spa

ce

Cell B

• Basic properties of cells– Full control over resources it

owns when mapped to hardware

– One or more address spaces (protection domains)

– Efficient inter-cell communication channels

Yellow partition grows due to adaptation

2nd-level runtime must be adaptive, too

Page 17: Will Computer Systems With Performance Guarantees Ever Go ...Keynote. HASE 2014. Miami, FL, USA. Jan 10, 2014 Juan A. Colmenares, Ph.D. 2 Introduction • Performance guarantees are

Will Computer Systems With Performance

Guarantees Ever Go Mainstream?

Keynote. HASE 2014.

Miami, FL, USA. Jan 10, 2014

Juan A. Colmenares, Ph.D. 17

Basis of a Component-based Modelwith Composable Performance

• Applications = Set of interacting components deployed on different cells

– Applications split into performance-incompatible and mutually distrusting cells with controlled communication

– OS Services are independent servers that provide QoS• Requires fast inter-cell communication

– Could use hardware acceleration for fast messaging

Application Component

DeviceDrivers

FileService

Real-time Cell

Core Application

Parallel Library

Channel

Channel

Storage Device

• Available preemptive schedulers– Round-robin (and pthreads) – EDF and Fixed Priority– Multiprocessor Constant Bandwidth

Server (M-CBS) [ECRTS’04]

– Juggle: A load balancer for SPMD applications [CLUSTER’12]

• Able to handle cell resizing Tessellation KernelTessellation Kernel

(Partition Support)

Application

Cell

[ECRTS’04] S. Baruah et al. Executing aperiodic jobs in a multiprocessor

constant-bandwidth server implementation. ECRTS'04.

[CLUSTER’12] S. Hofmeyr, J. Colmenares et al. Juggle: Addressing extrinsic

load imbalances in SPMD applications on multicore computers. Cluster

Computing Journal.

PULSE Framework

Scheduler X

Hardware cores

Timer interrupts

Customizable User-Level RuntimesPULSE: A framework for

Preemptive User-Level SchEdulers

Page 18: Will Computer Systems With Performance Guarantees Ever Go ...Keynote. HASE 2014. Miami, FL, USA. Jan 10, 2014 Juan A. Colmenares, Ph.D. 2 Introduction • Performance guarantees are

Will Computer Systems With Performance

Guarantees Ever Go Mainstream?

Keynote. HASE 2014.

Miami, FL, USA. Jan 10, 2014

Juan A. Colmenares, Ph.D. 18

• Supports reservations (i.e., differentiated service classes) and proportional share of bandwidth

– Using mClock scheduling algorithm [OSDI’10] (on top of PULSE)• NIC driver is entirely contained in user-space

– No system calls when transmitting and receiving buffers

[DAC’13] Colmenares, et al. Tessellation: Refactoring the OS around explicit resource containers with continuous adaptation.

[JAES’13] Colmenares, et al. A multi-core operating system with QoS-guarantees for network audio applications.

[OSDI’10] A. Gulati et al. mClock: handling throughput variability for hypervisor IO scheduling.

Network ServiceAn OS Service with QoS Guarantees[DAC’13, JAES’13]

(Avg. throughput = 125.2 KB/s)

A Divide and Conquer Approach to Deriving Time Bounds

Analytically derived execution-time

bounds of functions

H+Execution-time

measurements of functions

Tight execution-time bounds of functions

Analytically derived service-time

bounds of functions

H+ Service-time measurements of

functions

Tight service-time bounds of functions

Combine via a hybrid approach

0.2

0.4

0.6

0.8

1.0

time

0.2

0.4

0.6

0.8

1.0

time

Analytically

Derived Bound

Max. Observed

Value

Adopted Bound

Pessimistic Optimistic

Individual

functions

running in

isolation

Concurrent function-

executions, resource

sharing, and communication

activities

[IESS09] Colmenares et al. Experimental evaluation of a hybrid approach for deriving service-time bounds of methods in

real-time distributed computing objects. Proc. Int'l Embedded Systems Symposium 2009.

Page 19: Will Computer Systems With Performance Guarantees Ever Go ...Keynote. HASE 2014. Miami, FL, USA. Jan 10, 2014 Juan A. Colmenares, Ph.D. 2 Introduction • Performance guarantees are

Will Computer Systems With Performance

Guarantees Ever Go Mainstream?

Keynote. HASE 2014.

Miami, FL, USA. Jan 10, 2014

Juan A. Colmenares, Ph.D. 19

Basic Approaches for Deriving Time Bounds

Static Analysis Approaches

Measurement-based Approaches

Hard bound with a practically zero probability of being violated at run

time

Tend to produce excessively loose bounds when applied to modern fully-featured processors

Maximum measured execution-time value

Safety Margin

Soft bound with a non-negligible

probability of being exceeded at run time

May not cover the worst-case

Basic Approaches for Deriving Time Bounds

Static Analysis Approaches

Measurement-based Approaches

Hard bound with a practically zero probability of being violated at run

time

Tend to produce excessively loose bounds when applied to modern fully-featured processors

Maximum measured execution-time value

Safety Margin

Soft bound with a non-negligible

probability of being exceeded at run time

May not cover the worst-case

We want a tight time bound in between.

But how to determine the safety margin?

Page 20: Will Computer Systems With Performance Guarantees Ever Go ...Keynote. HASE 2014. Miami, FL, USA. Jan 10, 2014 Juan A. Colmenares, Ph.D. 2 Introduction • Performance guarantees are

Will Computer Systems With Performance

Guarantees Ever Go Mainstream?

Keynote. HASE 2014.

Miami, FL, USA. Jan 10, 2014

Juan A. Colmenares, Ph.D. 20

Curve Fitting TechniqueCentral to the Hybrid Approach

• Combines (1) measurements and (2) loose but analytically-derived hard bounds to produce reasonably safe and tight time bounds

α

Margin value

Probability of the soft bound being exceeded

at run-time

A Televideo Application

Display Windows

Video Streams

Performance Metric Reports(feedback)

Remote

User

Local

User

Remote

User

Local

User

Node 1 Node 2

TMOSM

OS/HW Platform

Network

TVTMO TVTMO

TMOSM

OS/HW Platform

Network Performance Metrics

• Throughput (at the application level)

• Message loss rate

• End-to-end delay

Page 21: Will Computer Systems With Performance Guarantees Ever Go ...Keynote. HASE 2014. Miami, FL, USA. Jan 10, 2014 Juan A. Colmenares, Ph.D. 2 Introduction • Performance guarantees are

Will Computer Systems With Performance

Guarantees Ever Go Mainstream?

Keynote. HASE 2014.

Miami, FL, USA. Jan 10, 2014

Juan A. Colmenares, Ph.D. 21

A Televideo Application

Frame size: 320 x 240 Frame rate: 10 fpsColor depth: 24 bitsCODEC: MPEG-4 (implementation FFMpeg)

Obtaining a Tight Service Bound for a Function via the Hybrid Approach

0

0.2

0.4

0.6

0.8

1

1.2

0 10 20 30 40 50 60

Time (ms)

Est

imat

ed P

roba

bilit

y

CDF Richards Model

Analytically Derived Bound

54 ms30 ms

Adopted Bound

Page 22: Will Computer Systems With Performance Guarantees Ever Go ...Keynote. HASE 2014. Miami, FL, USA. Jan 10, 2014 Juan A. Colmenares, Ph.D. 2 Introduction • Performance guarantees are

Will Computer Systems With Performance

Guarantees Ever Go Mainstream?

Keynote. HASE 2014.

Miami, FL, USA. Jan 10, 2014

Juan A. Colmenares, Ph.D. 22

Limiting System Load

• We can only guarantee performance under certain load limits and conditions (i.e., input)

Example

Avg

.

100% GET

requests

On-Line Admission ControlLimiting System Load

• Hey system! Can you guarantee performance X for this job?• Some possible answers

– Sure, no problema! [The rare happy case]– Yes, but let me put order in the house

• Possible downgrading and revocation

– Nope. I am sorry. Bye.– Nope, but let’s negotiate a little bit

• No with performance X, but with performance Y. Is this OK with you?

• Typical issues– Computational cost– Reduction in effective system utilization due to pessimism in the

analysis

• Some efforts to deal with those issues– Nie et al. Capacity-based admission control for mixed periodic

and aperiodic real time service processes. SOCA 2011.

Page 23: Will Computer Systems With Performance Guarantees Ever Go ...Keynote. HASE 2014. Miami, FL, USA. Jan 10, 2014 Juan A. Colmenares, Ph.D. 2 Introduction • Performance guarantees are

Will Computer Systems With Performance

Guarantees Ever Go Mainstream?

Keynote. HASE 2014.

Miami, FL, USA. Jan 10, 2014

Juan A. Colmenares, Ph.D. 23

Load Regulation and ShapingLimiting System Load

• Limit request rate or progress rate– Maximum number of requests in a given interval

period, or maximum inter-arrival rate (MIR)

• Leaky bucket– Classic textbook example of traffic shaping

• Handling excess of work– Queue requests, and drop if too many– Tradeoff content quality

• Good-enough in-time content can be better than late content

Mitigating Performance Variability

• Computer systems (architectures, networking, and software) are often built favoring average performance over performance predictability

– e.g., multi-level caches and deep pipelines with dynamic dispatch and speculative execution

• Often in practice, building the system from scratch to remove/reduce unpredictability is not economically feasible

– So, to learn to live with it, we need. [Yoda!]

• Common technique: Overprovisioning

However, some are trying to reintroduce timing predictability and repeatability from the ground up for safety-critical systems• Precision Timed (PRET) Machines @ UC Berkeley

[http://chess.eecs.berkeley.edu/pret/]• Time-Predictable Multi-Core Architecture for Embedded Systems

(T-CREST) -- An EU Research Project [http://www.t-crest.org/]

Page 24: Will Computer Systems With Performance Guarantees Ever Go ...Keynote. HASE 2014. Miami, FL, USA. Jan 10, 2014 Juan A. Colmenares, Ph.D. 2 Introduction • Performance guarantees are

Will Computer Systems With Performance

Guarantees Ever Go Mainstream?

Keynote. HASE 2014.

Miami, FL, USA. Jan 10, 2014

Juan A. Colmenares, Ph.D. 24

Mitigating Latency VariabilityIn Data Centers [CACM13]

• Issue the same request to multiple replicas and use the first response you get (hedged requests)

– Copies of the same request are sent with a short delay among them

– The client cancels outstanding requests once it gets the response

• Requests sent to multiple servers and the servers do cross-server status updates (tied requests)

– e.g., a server sends cancelations to others once is starts servicing the request

• Can reduce latency with modest load increase– If causes of variability do not simultaneously affect the

replicas

[CACM13] Dean and Barroso (Google). The tail at scale. Communications of the ACM. 2013.

Mitigating Latency VariabilityIn Data Centers [CACM13]

• Latency-induced probation

– In some situations the system performs better by

excluding a particularly slow machine and putting it

on probation

• Slowness is often caused by temporary phenomena

• Interesting point

– Removal of serving capacity from a live system

during periods of high load actually improves

latency

[CACM13] Dean and Barroso (Google). The tail at scale. Communications of the ACM. 2013.

Page 25: Will Computer Systems With Performance Guarantees Ever Go ...Keynote. HASE 2014. Miami, FL, USA. Jan 10, 2014 Juan A. Colmenares, Ph.D. 2 Introduction • Performance guarantees are

Will Computer Systems With Performance

Guarantees Ever Go Mainstream?

Keynote. HASE 2014.

Miami, FL, USA. Jan 10, 2014

Juan A. Colmenares, Ph.D. 25

Adaptive Resource AllocationA Complementary Technique

• Systems need to adapt to changes in the workloads (application and request mixes) and resource availability

• Number of efforts in this area:– Yang et al. Redline: First class support for interactivity in

commodity operating systems. OSDI 2008.– Padala et al. Automated control of multiple virtualized

resources. EuroSys 2009.– Hoffmann et al. SEEC: a general and extensible framework

for self-aware computing. Technical Report MIT-CSAIL-TR-2011-046.

– Sharifi at al. METE: meeting end-to-end qos in multicores through system-wide resource management. SIGMETRICS Perform. Eval. Rev., 39(1):13–24, June 2011.

Example Adaptive Control Loop

Application1

QoS-aware

Scheduler

BlockService

QoS-aware

Scheduler

NetworkService

QoS-aware

Scheduler

GUIService

Channel

Running System(Data Plane)

Application2

Channel

PerformanceReports

ResourceAssignments

Resource Allocation(Control Plane)

Partitioningand

Distribution

Observationand

Modeling

Cell Cell

Cell

[DAC13] Colmenares et al. Tessellation: refactoring the OS around explicit resource containers with continuous adaptation. DAC 2013.

Page 26: Will Computer Systems With Performance Guarantees Ever Go ...Keynote. HASE 2014. Miami, FL, USA. Jan 10, 2014 Juan A. Colmenares, Ph.D. 2 Introduction • Performance guarantees are

Will Computer Systems With Performance

Guarantees Ever Go Mainstream?

Keynote. HASE 2014.

Miami, FL, USA. Jan 10, 2014

Juan A. Colmenares, Ph.D. 26

Other Complementary Techniques

• Workload characterization• Load balancing• Differentiating service classes• Managing background activities and synchronized

disruption• Software customization• High-precision global time

– e.g., Precision Time Protocol (PTP) -- IEEE 1588

Conclusions

• Current trends indicate that distributed software systems with performance guarantees are likely to

– Become very popular– Demand large number of software developers

• When!? • Obstacles

– Other requirements perceived as more urgent• Power and energy efficiency• Security and privacy• High availability

– Legal hurdles for motivating apps (e.g., autonomous cars)• Design principles and techniques are available

– But need to be adapted to the system in hand• Major challenges

– Cost effectiveness– Trained workforce

Page 27: Will Computer Systems With Performance Guarantees Ever Go ...Keynote. HASE 2014. Miami, FL, USA. Jan 10, 2014 Juan A. Colmenares, Ph.D. 2 Introduction • Performance guarantees are

Will Computer Systems With Performance

Guarantees Ever Go Mainstream?

Keynote. HASE 2014.

Miami, FL, USA. Jan 10, 2014

Juan A. Colmenares, Ph.D. 27

THANKS

Questions?