thread-clustering
-
Upload
davidkftam -
Category
Technology
-
view
419 -
download
0
description
Transcript of thread-clustering
![Page 1: thread-clustering](https://reader033.fdocuments.us/reader033/viewer/2022060115/55792193d8b42a9c578b4ada/html5/thumbnails/1.jpg)
Thread Clustering
Thread Clustering:Sharing-Aware Thread Scheduling
on SMP-CMP-SMT Multiprocessors
David Tam, Reza Azimi, Michael Stumm
University of Toronto{tamda, azimi, stumm}@eecg.toronto.edu
![Page 2: thread-clustering](https://reader033.fdocuments.us/reader033/viewer/2022060115/55792193d8b42a9c578b4ada/html5/thumbnails/2.jpg)
Thread Clustering
Multiprocessors TodayExample: IBM Power 5 system
1
![Page 3: thread-clustering](https://reader033.fdocuments.us/reader033/viewer/2022060115/55792193d8b42a9c578b4ada/html5/thumbnails/3.jpg)
Thread Clustering
Multiprocessors Today
SMPCMP
SMT
SHAREDCACHE
Example: IBM Power 5 system
1
![Page 4: thread-clustering](https://reader033.fdocuments.us/reader033/viewer/2022060115/55792193d8b42a9c578b4ada/html5/thumbnails/4.jpg)
Thread Clustering
Multiprocessors TodayExample: IBM Power 5 system
12014
Disparity in L2 latencies
1
![Page 5: thread-clustering](https://reader033.fdocuments.us/reader033/viewer/2022060115/55792193d8b42a9c578b4ada/html5/thumbnails/5.jpg)
Thread Clustering
Operating Systems TodayCPU Schedulers:
● Ignore disparity in L2 latencies● Ignore data sharing among threads
● Distribute threads poorly● Cross-chip traffic
● Remote L2 cache accesses
● Causes performance problem2
![Page 6: thread-clustering](https://reader033.fdocuments.us/reader033/viewer/2022060115/55792193d8b42a9c578b4ada/html5/thumbnails/6.jpg)
Thread Clustering
Our Goal: Sharing-Aware Scheduling● Detect sharing patterns● Cluster threads
Benefits:● Decrease cross-chip traffic● Increase on-chip cache locality● Exploit shared L2 caches
3
![Page 7: thread-clustering](https://reader033.fdocuments.us/reader033/viewer/2022060115/55792193d8b42a9c578b4ada/html5/thumbnails/7.jpg)
Thread Clustering
Our Online Technique
REPEAT
STEPS:1) Monitor remote cache access rate2) Detect thread sharing patterns3) Determine thread clusters4) Migrate thread clusters
4
![Page 8: thread-clustering](https://reader033.fdocuments.us/reader033/viewer/2022060115/55792193d8b42a9c578b4ada/html5/thumbnails/8.jpg)
Thread Clustering
Sharing Detection● To observe remote cache accesses:
● Exploit HPCs (hardware performance counters)● Sample remote cache miss addresses
● Local cache misses satisfied by remote cache● IBM Power 5 continuous data sampling
X1
5
![Page 9: thread-clustering](https://reader033.fdocuments.us/reader033/viewer/2022060115/55792193d8b42a9c578b4ada/html5/thumbnails/9.jpg)
Thread Clustering
Sharing Detection● To observe remote cache accesses:
● Exploit HPCs (hardware performance counters)● Sample remote cache miss addresses
● Local cache misses satisfied by remote cache● IBM Power 5 continuous data sampling
X
2
5
![Page 10: thread-clustering](https://reader033.fdocuments.us/reader033/viewer/2022060115/55792193d8b42a9c578b4ada/html5/thumbnails/10.jpg)
Thread Clustering
Sharing Detection● To observe remote cache accesses:
● Exploit HPCs (hardware performance counters)● Sample remote cache miss addresses
● Local cache misses satisfied by remote cache● IBM Power 5 continuous data sampling
3
5
![Page 11: thread-clustering](https://reader033.fdocuments.us/reader033/viewer/2022060115/55792193d8b42a9c578b4ada/html5/thumbnails/11.jpg)
Thread Clustering
Sharing Signatures● Construct for each thread
● Counts remote cache accesses
8-bit counter
virtual address264
block
virtual address0
Conceptually
6
![Page 12: thread-clustering](https://reader033.fdocuments.us/reader033/viewer/2022060115/55792193d8b42a9c578b4ada/html5/thumbnails/12.jpg)
Thread Clustering
Sharing Signatures● Construct for each thread
● Counts remote cache accesses
ctri++
virtual address264
virtual address0
Conceptually
6
8-bit counterblock
![Page 13: thread-clustering](https://reader033.fdocuments.us/reader033/viewer/2022060115/55792193d8b42a9c578b4ada/html5/thumbnails/13.jpg)
Thread Clustering
Optimizations● CPU: Temporal Sampling
● Sample every Nth remote cache access● Memory: Spatial Sampling
● 256-entry vector● Hash function● Block ID filter
● Vectors still effective at indicating sharing
7
![Page 14: thread-clustering](https://reader033.fdocuments.us/reader033/viewer/2022060115/55792193d8b42a9c578b4ada/html5/thumbnails/14.jpg)
Thread Clustering
Spatial Sampling● Hash collision & alias removal
EmptyReserved
Filter Legend
0 255
0 255
Block ID
8
![Page 15: thread-clustering](https://reader033.fdocuments.us/reader033/viewer/2022060115/55792193d8b42a9c578b4ada/html5/thumbnails/15.jpg)
Thread Clustering
Spatial Sampling● Hash collision & alias removal
EMPTY
0 255
EmptyReserved
Filter Legend
0 255
hash
Block ID
8
![Page 16: thread-clustering](https://reader033.fdocuments.us/reader033/viewer/2022060115/55792193d8b42a9c578b4ada/html5/thumbnails/16.jpg)
Thread Clustering
Spatial Sampling● Hash collision & alias removal
hash
(First-Come-First-Reserved)
EmptyReserved
Filter Legend
0 255
0 255
Block ID
8
![Page 17: thread-clustering](https://reader033.fdocuments.us/reader033/viewer/2022060115/55792193d8b42a9c578b4ada/html5/thumbnails/17.jpg)
Thread Clustering
Spatial Sampling● Hash collision & alias removal
MATCH Block ID
hash
EmptyReserved
Filter Legend
0 255
0 255
Block ID
8
![Page 18: thread-clustering](https://reader033.fdocuments.us/reader033/viewer/2022060115/55792193d8b42a9c578b4ada/html5/thumbnails/18.jpg)
Thread Clustering
Spatial Sampling● Hash collision & alias removal
MISMATCH Block ID
hash
ALIASING PREVENTED
EmptyReserved
Filter Legend
0 255
0 255
Block ID
8
![Page 19: thread-clustering](https://reader033.fdocuments.us/reader033/viewer/2022060115/55792193d8b42a9c578b4ada/html5/thumbnails/19.jpg)
Thread Clustering
Automated ClusteringClustering Heuristic:
● Simple, one-pass algorithm● Compare vector against existing clusters● If not similar, create a new cluster
Similarity Metric:
● Shared blocks amplified● Non-shared blocks nullified
∑ V1[i] * V
2[i]
i = 0
N
9
![Page 20: thread-clustering](https://reader033.fdocuments.us/reader033/viewer/2022060115/55792193d8b42a9c578b4ada/html5/thumbnails/20.jpg)
Thread Clustering
Experimental Platform● 8-way Power 5, 1.5GHz● Linux 2.6● IBM J2SE 5.0 JVM
1.9MB L236MB
4 GB4 GB
36MB1.9MB L2
10
![Page 21: thread-clustering](https://reader033.fdocuments.us/reader033/viewer/2022060115/55792193d8b42a9c578b4ada/html5/thumbnails/21.jpg)
Thread Clustering
WorkloadsMicrobenchmark
● expect 4 clusters● 4 threads per cluster
SPECjbb2000 (modified)● expect 2 clusters
● 2 warehouses, 8 threads per warehouseRUBiS + MySQL
● expect 2 clusters● 2 databases, 16 threads per database
VolanoMark chat server● expect 2 clusters
● 2 rooms, 8 threads per room
11
![Page 22: thread-clustering](https://reader033.fdocuments.us/reader033/viewer/2022060115/55792193d8b42a9c578b4ada/html5/thumbnails/22.jpg)
Thread Clustering
Visualizing Clusters● An example
Cluster B,4 vectors
Cluster A,4 vectors
12
Counter Values
255128640
{
{
![Page 23: thread-clustering](https://reader033.fdocuments.us/reader033/viewer/2022060115/55792193d8b42a9c578b4ada/html5/thumbnails/23.jpg)
Thread Clustering
Visualizing Clusters● An example
12
Counter Values
255128640
{
{Cluster B,4 vectors
Cluster A,4 vectors
![Page 24: thread-clustering](https://reader033.fdocuments.us/reader033/viewer/2022060115/55792193d8b42a9c578b4ada/html5/thumbnails/24.jpg)
Thread Clustering
Visualizing Clusters● An example
12
Counter Values
255128640
{{Cluster B,
4 vectors
Cluster A,4 vectors
![Page 25: thread-clustering](https://reader033.fdocuments.us/reader033/viewer/2022060115/55792193d8b42a9c578b4ada/html5/thumbnails/25.jpg)
Thread Clustering
Visualizing Clusters● Microbenchmark
{4vectors
13
![Page 26: thread-clustering](https://reader033.fdocuments.us/reader033/viewer/2022060115/55792193d8b42a9c578b4ada/html5/thumbnails/26.jpg)
Thread Clustering
Visualizing Clusters● Modified SPECjbb2000 (4 warehouses)
{16vectors
14
![Page 27: thread-clustering](https://reader033.fdocuments.us/reader033/viewer/2022060115/55792193d8b42a9c578b4ada/html5/thumbnails/27.jpg)
Thread Clustering
Visualizing Clusters● RUBiS + MySQL (2 databases)
{24vectors
15
![Page 28: thread-clustering](https://reader033.fdocuments.us/reader033/viewer/2022060115/55792193d8b42a9c578b4ada/html5/thumbnails/28.jpg)
Thread Clustering
Visualizing Clusters● VolanoMark (4 rooms)
16
![Page 29: thread-clustering](https://reader033.fdocuments.us/reader033/viewer/2022060115/55792193d8b42a9c578b4ada/html5/thumbnails/29.jpg)
Thread Clustering
Remote Cache Impact● Normalized to default Linux
32
90
43
22
7270
92-17
17
![Page 30: thread-clustering](https://reader033.fdocuments.us/reader033/viewer/2022060115/55792193d8b42a9c578b4ada/html5/thumbnails/30.jpg)
Thread Clustering
Performance Impact● IPC: instructions per cycle● Normalized to default Linux
7.4
6.16.1
7.1
5.1
7.4
5.0
3.7
-0.8
18
![Page 31: thread-clustering](https://reader033.fdocuments.us/reader033/viewer/2022060115/55792193d8b42a9c578b4ada/html5/thumbnails/31.jpg)
Thread Clustering
Summary
BEFORE:Current Operating Systems
AFTER:Operating System With
Thread Clustering
19
![Page 32: thread-clustering](https://reader033.fdocuments.us/reader033/viewer/2022060115/55792193d8b42a9c578b4ada/html5/thumbnails/32.jpg)
Thread Clustering
Conclusions● HPCs can detect sharing● Sharing signatures are effective● Automated thread clustering:
● Reduces remote cache access up to 70%● Improves performance up to 7%
● All with low overhead
Future Work:● More workloads● Improve clustering algorithm● Integration with load-balancing aspects
20
![Page 33: thread-clustering](https://reader033.fdocuments.us/reader033/viewer/2022060115/55792193d8b42a9c578b4ada/html5/thumbnails/33.jpg)
Thread Clustering
![Page 34: thread-clustering](https://reader033.fdocuments.us/reader033/viewer/2022060115/55792193d8b42a9c578b4ada/html5/thumbnails/34.jpg)
Thread Clustering
Sampling Overhead● Modified SPECjbb2000