3rd Joint Workshop on Embedded and Ubiquitous Computing
1
Effect of Context Aware Scheduler on TLB
Satoshi Yamada
PhD Candidate
Kusakabe Laboratory
3rd Joint Workshop on Embedded and Ubiquitous Computing
2
Contents
• Introduction• Overhead of Context Switch• Context Aware Scheduler• Benchmark Applications and Measurement E
nvironment• Result• Related Works• Conclusion
3rd Joint Workshop on Embedded and Ubiquitous Computing
3
widely spread multithreading
• Multithreading hides the latency of disk I/O and network access
• Threads in many languages, Java, Perl, and Python correspond to OS threads
* More context switches happen today* Process scheduler in OS is more responsible for the system performance
3rd Joint Workshop on Embedded and Ubiquitous Computing
4
• Overhead of a context switch – includes that of loading a new working set for next
process– is deeply related with the utilization of caches
• Agarwal. etc “Cache performance of operating system and multiprogramming workloads” (1988)
• Mogul, et al. “The effect of of context switches no cache performance” (1991)
Context Switch and Caches
Process A Process B
Process A
Cache
Switch
Process A
Process BSwitch
Working setsoverflows the cache
Process B
A only
B only
A and B
3rd Joint Workshop on Embedded and Ubiquitous Computing
5
Advantage of Sibling Thread
mm
signal
file..
mm
signal
file..
fork()mm_struct
signal_struct
task_struct
create a PROCESS create a THREAD
task_struct
OS does not have to switch memory address spaces in switch sibling threads
signal_struct
.
.
we can expect the reductionof the overhead of context switch
Parent Parenttask_struct
mm
signal
file..
copy
mm_struct
signal_struct
.
.
Child
mm
signal
file..
share
clone()mm_struct task_struct
signal_struct
.
...
Child
Sibling Threads
3rd Joint Workshop on Embedded and Ubiquitous Computing
6
Contents
• Introduction• Overhead of Context Switch and TLB• Context Aware Scheduler• Benchmark Applications and Measurement E
nvironment• Result• Related Works• Conclusion
3rd Joint Workshop on Embedded and Ubiquitous Computing
7
TLB flush in Context Switch
• TLB is a cache which stores the translation from virtual addresses into physical address– TLB translation latency: 1 ns– TLB miss overhead: several accesses to memory
• On x86 processors, most of TLB entries are invalidated (flushed) in every context switch by changing memory address space
TLB flush does not happen in the context switch among sibling threads
3rd Joint Workshop on Embedded and Ubiquitous Computing
8
Overhead due to a context switch
by lat_ctx in LMbench
3rd Joint Workshop on Embedded and Ubiquitous Computing
9
Contents
• Introduction• Overhead of Context Switch and TLB• Context Aware Scheduler• Benchmark Applications and Measurement E
nvironment• Result• Related Works• Conclusion
3rd Joint Workshop on Embedded and Ubiquitous Computing
10
O(1) Scheduler in Linux
• O(1) scheduler runqueue has– active queue and expired que
ue– priority bitmap and array of lin
ked list of threads
• O(1) scheduler – searches priority bitmap– chooses a thread with the hig
hest priority
Scheduling overhead is independent of the number of threads
3rd Joint Workshop on Embedded and Ubiquitous Computing
11
Context Aware (CA) Scheduler
• CA scheduler creates auxiliary runqueues per group of threads
• CA scheduler compares Preg and Paux• Preg: the highest priority in regular O(1) scheduler runqueue• Paux: the highest priority in the auxiliary runqueue
• if Preg - Paux <= threshold, then we choose Paux
3rd Joint Workshop on Embedded and Ubiquitous Computing
12
Context Aware Scheduler
Linux O(1) scheduler
Context switches between processes: 3 times
Context switches between processes: 1 time
CA scheduler
A C DB E
A C D B E
3rd Joint Workshop on Embedded and Ubiquitous Computing
13
Fairness
• O(1) scheduler keeps the fairness by epoch– cycles of active queue and
expired queue
• CA scheduler also follows epoch – guarantee the same level o
f fairness as O(1) scheduler
3rd Joint Workshop on Embedded and Ubiquitous Computing
14
Contents
• Introduction• Overhead of Context Switch• Context Aware Scheduler• Benchmarks and Measurement Environment• Result• Related Works• Conclusion
3rd Joint Workshop on Embedded and Ubiquitous Computing
15
Benchmarks• Java
– Volano Benchmark (Volano)– lusearch program in DaCapo benchmark suite (D
aCapo)
• C– Chat benchmark (Chat)– memory program in SysBench benchmark suite
(SysBench)Information of Each Benchmark Applications
3rd Joint Workshop on Embedded and Ubiquitous Computing
16
Measurement Environment
• Hardware
• Sun’s J2SE 5.0• threshold of context aware scheduler
– 1 and 10
• Perfctr to count the TLB misses• GNU’s time command to measure the total system
performance
3rd Joint Workshop on Embedded and Ubiquitous Computing
17
Contents
• Introduction• Overhead of Context Switch• Context Aware Scheduler• Benchmarks and Measurement Environment• Result• Related Works• Conclusion
3rd Joint Workshop on Embedded and Ubiquitous Computing
18
Effect on TLB
• CA scheduler significantly reduces TLB misses• Bigger threshold is more effective
– frequent changes of priority by dynamic priority especially in DaCapo and Volano
Results of TLB misses (million times)
3rd Joint Workshop on Embedded and Ubiquitous Computing
19
Effect on System Performance
Results by time command (seconds)
Results of the Counters in Each Application(seconds)
CA scheduler • enhances the throughput on every application• reduces the total elapsed time by 43%
3rd Joint Workshop on Embedded and Ubiquitous Computing
20
Contents
• Introduction• Overhead of Context Switch• Context Aware Scheduler• Benchmarks and Measurement Environment• Result• Related Works• Conclusion
3rd Joint Workshop on Embedded and Ubiquitous Computing
21
H. L. Sujay Parekh, et. al,“Thread Sensitive Scheduling for SMT Process
ors” (2000)
• Parekh’s scheduler– tries groups of threads to execute in parallel a
nd sample the information about• IPC• TLB misses• L2 cache misses, etc
– schedules on the information sampled
Sampling Phase Scheduling Phase Sampling Phase Scheduling Phase
3rd Joint Workshop on Embedded and Ubiquitous Computing
22
Pranay Koka, et. al, “Opportunities for Cache Friendly Process” (2
005)
• Koka’s scheduler– traces the execution of each thread– put the focus on the shared memory space
between threads– Schedule on the information above
Tracing Phase Scheduling Phase Tracing Phase Scheduling Phase
3rd Joint Workshop on Embedded and Ubiquitous Computing
23
Conclusion
• Conclusion– CA scheduler is effective in reducing TLB misses– CA scheduler enhances the throughput of every
application
• Future Works– Evaluation on other platforms– Investigation of fairness among an epoch
• compare with Completely Fair Scheduler (Linux 2.6.23)
3rd Joint Workshop on Embedded and Ubiquitous Computing
24
widely spread multithreading
• Multithreading hides the latency of disk I/O and network access
• Threads in many languages, Java, Perl, and Python correspond to OS threads
ThreadA ThreadB
disk
* More context switches happen today* Process scheduler in OS is more responsible for the system performance
ThreadB waits
3rd Joint Workshop on Embedded and Ubiquitous Computing
25
Context Aware (CA) scheduler
A C DB E
A C D B E
Linux O(1) scheduler
CA scheduler
Context switches between processes: 3 times
Context switches between processes: 1 time
Our CA scheduler aggregates sibling threads
3rd Joint Workshop on Embedded and Ubiquitous Computing
26
Process A
Process C
Results of Context Switch
L2 cache size: 2MB
(micro seconds)
Process BCache 0
1MB
2MB
Top Related