Capriccio: Scalable Threads for Internet Service

Introduction

Internet services have ever-increasing scalability demands

Current hardware is meeting these demands Software has lagged behind

Recent approaches are event-based Pipeline stages of events

Drawbacks of Events

Events systems hide the control flow Difficult to understand and debug Eventually evolved into call-and-return

event pairs Programmers need to match related events Need to save/restore states

Capriccio: instead of event-based model, fix the thread-based model

Goals of Capriccio

Support for existing thread API Little changes to existing applications

Scalability to thousands of threads One thread per

execution Flexibility to address

application-specific needs

Performance

Ease

of

Prog

ram

min

g Threads

Threads

Events

Ideal

Thread Design Principles

Kernel-level threads are for true concurrency

User-level threads provide a clean programming model with useful invariants and semantics

Decouple user from kernel level threads More portable

Capriccio

Thread package All thread operations are O(1) Linked stacks

Address the problem of stack allocation for large numbers of threads

Combination of compile-time and run-time analysis

Resource-aware scheduler

Thread Design and Scalability

POSIX API Backward compatible

User-Level Threads

+ Performance+ Flexibility- Complex preemption- Bad interaction with kernel scheduler

Flexibility

Decoupling user and kernel threads allows faster innovation

Can use new kernel thread features without changing application code

Scheduler tailored for applications Lightweight

Performance

Reduce the overhead of thread synchronization

No kernel crossing for preemptive threading

More efficient memory management at user level

Disadvantages

Need to replace blocking calls with nonblocking ones to hold the CPU Translation overhead

Problems with multiple processors Synchronization becomes more

expensive

Context Switches

Built on top of Edgar Toernig’s coroutine library

Fast context switches when threads voluntarily yield

I/O

Capriccio intercepts blocking I/O calls Uses epoll for asynchronous I/O

Scheduling

Very much like an event-driven application

Events are hidden from programmers

Synchronization

Supports cooperative threading on single-CPU machines Requires only Boolean checks

Threading Microbenchmarks

SMP, two 2.4 GHz Xeon processors 1 GB memory two 10 K RPM SCSI Ultra II hard

drives Linux 2.5.70 Compared Capriccio, LinuxThreads,

and Native POSIX Threads for Linux

Latencies of Thread Primitives

Capriccio LinuxThreads NPTL

Thread creation

21.5 21.5 17.7

Thread context switch

0.24 0.71 0.65

Uncontended mutex lock

0.04 0.14 0.15

Thread Scalability

Producer-consumer microbenchmark LinuxThreads begin to degrade after

20 threads NPTL degrades after 100 Capriccio scales to 32K producers and

consumers (64K threads total)

Thread Scalability

I/O Performance

Network performance Token passing among pipes Simulates the effect of slow client links 10% overhead compared to epoll Twice as fast as both LinuxThreads and

NPTL when more than 1000 threads Disk I/O comparable to kernel threads

Linked Stack Management

LinuxThreads allocates 2MB per stack 1 GB of VM holds only 500 threads

Fixed Stacks

Linked Stack Management

But most threads consumes only a few KB of stack space at a given time

Dynamic stack allocation can significantly reduce the size of VM

Linked Stack

Compiler Analysis and Linked Stacks

Whole-program analysis Based on the call graph Problematic for recursions Static estimation may be too conservative

Compiler Analysis and Linked Stacks

Grow and shrink the stack size on demand Insert checkpoints to determine whether

we need to allocate more before the next checkpoint

Result in noncontiguous stacks

Placing Checkpoints

One checkpoint in every cycle in the call graph

Bound the size between checkpoints with the deepest call path

Dealing with Special Cases

Function pointers Don’t know what procedure to call at

compile time Can find a potential set of procedures

Dealing with Special Cases

External functions Allow programmers to annotate external

library functions with trusted stack bounds

Allow larger stack chunks to be linked for external functions

Tuning the Algorithm

Stack space can be wasted Internal and external fragmentation

Tradeoffs Number of stack linkings External fragmentation

Memory Benefits

Tuning can be application-specific No preallocation of large stacks

Reduced requirement to run a large numbers of threads

Better paging behavior Stacks—LIFO

Case Study: Apache 2.0.44

Maximum stack allocation chunk: 2KB Apache under SPECweb99 Overall slowdown is about 3%

Dynamic allocation 0.1% Link to large chunks for external functions

0.5% Stack removal 10%

Resource-Aware Scheduling

Advantages of event-based scheduling Tailored for applications With event handlers

Events provide two important pieces of information for scheduling Whether a process is close to completion Whether a system is overloaded


Thread-based View applications as sequence of

stages, separated by blocking calls Analogous to event-based scheduler

Blocking Graph

Node: A location in the program that blocked

Edge: between two nodes if they were consecutive blocking points

Generated at runtime


1. Keep track of resource utilization 2. Annotate each node with resource

used and its outgoing edges3. Dynamically prioritize nodes

Prefer nodes that release resources

Resources

CPU Memory (malloc) File descriptors (open, close)

Pitfalls

Tricky to determine the maximum capacity of a resource Thrashing depends on the workload

Disk can handle more requests that are sequential instead of random

Resources interact VM vs. disk

Applications may manage memory themselves

Yield Profiling

User threads are problematic if a thread fails to yield

They are easy to detect, since their running times are orders of magnitude larger

Yield profiling identifies places where programs fail to yield sufficiently often

Web Server Performance

4x500 MHz Pentium server 2GB memory Intel e1000 Gigabit Ethernet card Linux 2.4.20 Workload: requests for 3.2 GB of

static file data

Web Server Performance

Request frequencies match those of the SPECweb99

A client connects to a server repeated and issue a series of five requests, separated by 20ms pauses

Apache’s performance improved by 15% with Capriccio

Resource-Aware Admission Control

Consumer-producer applications Producer loops, adding memory, and

randomly touching pages Consumer loops, removing memory

from the pool and freeing it Fast producer may run out of virtual

address space

Resource-Aware Admission Control

Touching pages too quickly will cause thrashing

Capriccio can quickly detect the overload conditions and limit the number of producers

Programming Models for High Concurrency

Event Application-specific optimization

Thread Efficient thread runtimes

User-Level Threads

Capriccio is unique Blocking graph Resource-aware scheduling Target at a large number of blocking

threads POSIX compliant

Application-Specific Optimization

Most approaches require programmers to tailor their application to manage resources

Nonstandard APIs, less portable

Stack Management

No garbage collection

Future Work

Multi-CPU machines Profiling tools for system tuning

Capriccio: Scalable Threads for Internet Service

Documents

Transcript of Capriccio: Scalable Threads for Internet Service