Capriccio: Scalable Threads for Internet Service
-
Upload
harry-pitts -
Category
Documents
-
view
218 -
download
2
description
Transcript of Capriccio: Scalable Threads for Internet Service
Capriccio: Scalable Threads for Internet Service
Introduction
Internet services have ever-increasing scalability demands
Current hardware is meeting these demands Software has lagged behind
Recent approaches are event-based Pipeline stages of events
Drawbacks of Events
Events systems hide the control flow Difficult to understand and debug Eventually evolved into call-and-return
event pairs Programmers need to match related events Need to save/restore states
Capriccio: instead of event-based model, fix the thread-based model
Goals of Capriccio
Support for existing thread API Little changes to existing applications
Scalability to thousands of threads One thread per
execution Flexibility to address
application-specific needs
Performance
Ease
of
Prog
ram
min
g Threads
Threads
Events
Ideal
Thread Design Principles
Kernel-level threads are for true concurrency
User-level threads provide a clean programming model with useful invariants and semantics
Decouple user from kernel level threads More portable
Capriccio
Thread package All thread operations are O(1) Linked stacks
Address the problem of stack allocation for large numbers of threads
Combination of compile-time and run-time analysis
Resource-aware scheduler
Thread Design and Scalability
POSIX API Backward compatible
User-Level Threads
+ Performance+ Flexibility- Complex preemption- Bad interaction with kernel scheduler
Flexibility
Decoupling user and kernel threads allows faster innovation
Can use new kernel thread features without changing application code
Scheduler tailored for applications Lightweight
Performance
Reduce the overhead of thread synchronization
No kernel crossing for preemptive threading
More efficient memory management at user level
Disadvantages
Need to replace blocking calls with nonblocking ones to hold the CPU Translation overhead
Problems with multiple processors Synchronization becomes more
expensive
Context Switches
Built on top of Edgar Toernig’s coroutine library
Fast context switches when threads voluntarily yield
I/O
Capriccio intercepts blocking I/O calls Uses epoll for asynchronous I/O
Scheduling
Very much like an event-driven application
Events are hidden from programmers
Synchronization
Supports cooperative threading on single-CPU machines Requires only Boolean checks
Threading Microbenchmarks
SMP, two 2.4 GHz Xeon processors 1 GB memory two 10 K RPM SCSI Ultra II hard
drives Linux 2.5.70 Compared Capriccio, LinuxThreads,
and Native POSIX Threads for Linux
Latencies of Thread Primitives
Capriccio LinuxThreads NPTL
Thread creation
21.5 21.5 17.7
Thread context switch
0.24 0.71 0.65
Uncontended mutex lock
0.04 0.14 0.15
Thread Scalability
Producer-consumer microbenchmark LinuxThreads begin to degrade after
20 threads NPTL degrades after 100 Capriccio scales to 32K producers and
consumers (64K threads total)
Thread Scalability
I/O Performance
Network performance Token passing among pipes Simulates the effect of slow client links 10% overhead compared to epoll Twice as fast as both LinuxThreads and
NPTL when more than 1000 threads Disk I/O comparable to kernel threads
Linked Stack Management
LinuxThreads allocates 2MB per stack 1 GB of VM holds only 500 threads
Fixed Stacks
Linked Stack Management
But most threads consumes only a few KB of stack space at a given time
Dynamic stack allocation can significantly reduce the size of VM
Linked Stack
Compiler Analysis and Linked Stacks
Whole-program analysis Based on the call graph Problematic for recursions Static estimation may be too conservative
Compiler Analysis and Linked Stacks
Grow and shrink the stack size on demand Insert checkpoints to determine whether
we need to allocate more before the next checkpoint
Result in noncontiguous stacks
Placing Checkpoints
One checkpoint in every cycle in the call graph
Bound the size between checkpoints with the deepest call path
Dealing with Special Cases
Function pointers Don’t know what procedure to call at
compile time Can find a potential set of procedures
Dealing with Special Cases
External functions Allow programmers to annotate external
library functions with trusted stack bounds
Allow larger stack chunks to be linked for external functions
Tuning the Algorithm
Stack space can be wasted Internal and external fragmentation
Tradeoffs Number of stack linkings External fragmentation
Memory Benefits
Tuning can be application-specific No preallocation of large stacks
Reduced requirement to run a large numbers of threads
Better paging behavior Stacks—LIFO
Case Study: Apache 2.0.44
Maximum stack allocation chunk: 2KB Apache under SPECweb99 Overall slowdown is about 3%
Dynamic allocation 0.1% Link to large chunks for external functions
0.5% Stack removal 10%
Resource-Aware Scheduling
Advantages of event-based scheduling Tailored for applications With event handlers
Events provide two important pieces of information for scheduling Whether a process is close to completion Whether a system is overloaded
Resource-Aware Scheduling
Thread-based View applications as sequence of
stages, separated by blocking calls Analogous to event-based scheduler
Blocking Graph
Node: A location in the program that blocked
Edge: between two nodes if they were consecutive blocking points
Generated at runtime
Resource-Aware Scheduling
1. Keep track of resource utilization 2. Annotate each node with resource
used and its outgoing edges3. Dynamically prioritize nodes
Prefer nodes that release resources
Resources
CPU Memory (malloc) File descriptors (open, close)
Pitfalls
Tricky to determine the maximum capacity of a resource Thrashing depends on the workload
Disk can handle more requests that are sequential instead of random
Resources interact VM vs. disk
Applications may manage memory themselves
Yield Profiling
User threads are problematic if a thread fails to yield
They are easy to detect, since their running times are orders of magnitude larger
Yield profiling identifies places where programs fail to yield sufficiently often
Web Server Performance
4x500 MHz Pentium server 2GB memory Intel e1000 Gigabit Ethernet card Linux 2.4.20 Workload: requests for 3.2 GB of
static file data
Web Server Performance
Request frequencies match those of the SPECweb99
A client connects to a server repeated and issue a series of five requests, separated by 20ms pauses
Apache’s performance improved by 15% with Capriccio
Resource-Aware Admission Control
Consumer-producer applications Producer loops, adding memory, and
randomly touching pages Consumer loops, removing memory
from the pool and freeing it Fast producer may run out of virtual
address space
Resource-Aware Admission Control
Touching pages too quickly will cause thrashing
Capriccio can quickly detect the overload conditions and limit the number of producers
Programming Models for High Concurrency
Event Application-specific optimization
Thread Efficient thread runtimes
User-Level Threads
Capriccio is unique Blocking graph Resource-aware scheduling Target at a large number of blocking
threads POSIX compliant
Application-Specific Optimization
Most approaches require programmers to tailor their application to manage resources
Nonstandard APIs, less portable
Stack Management
No garbage collection
Future Work
Multi-CPU machines Profiling tools for system tuning