Enhancing Scalability and Energy Efficiency through...
Transcript of Enhancing Scalability and Energy Efficiency through...
Department of Electrical and Computer Engineering
Enhancing Scalability and Energy Efficiency through
Application-Aware Communication Reduction
David Lilja, Ulya R. Karpuzcu, John Sartori{lilja, ukarpuzc, jsartori}@umn.edu
Department of Electrical and Computer Engineering
Motivation
2
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
64 128 512 1024 2048
Processor count
Fra
cti
on
of
tim
e s
pen
t in
co
mm
un
icati
on
WRF AVUS AMR HYCOM
Processor Count
Frac
tion
of T
ime
Spen
t in
Com
mun
icat
ion
64 128 512 1024 2048
0.1
0
0.2
0.3
0.4
0.5
0.6
0.7
Exascale computing study: Technology challenges in achieving exascale systems, 2008
Department of Electrical and Computer Engineering
Motivation
3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
64 128 512 1024 2048
Processor count
Fra
cti
on
of
tim
e s
pen
t in
co
mm
un
icati
on
WRF AVUS AMR HYCOM
Processor Count
Frac
tion
of T
ime
Spen
t in
Com
mun
icat
ion
64 128 512 1024 2048
0.1
0
0.2
0.3
0.4
0.5
0.6
0.7
Exascale computing study: Technology challenges in achieving exascale systems, 2008
10% or less
50% or more
Department of Electrical and Computer Engineering
Motivation
4
Problem Size (PS)
PS per core
time (t)
Performance PS/t
PS share per core
Strong Scaling — /N /N xN /N
Weak Scaling xN — — xN /N
Department of Electrical and Computer Engineering
Motivation
5
Problem Size (PS)
PS per core
time (t)
Performance PS/t
PS share per core
Strong Scaling — /N /N xN /N
Weak Scaling xN — — xN /N
Department of Electrical and Computer Engineering
Executive Summary
• Bag of tricks to regulate the frequency and/or the quantity: • Approximate synchronization • To minimize frequency of communication
• (Lossy) compression • To minimize quantity of data communicated
• Accelerators in memory • To minimize both the frequency and the quantity
6
• Goal: application-centric minimization of communication overhead • Communication overhead is proportional to • the frequency of communication • the number of data objects (quantity) to be communicated
Department of Electrical and Computer Engineering
Approximate Synchronization• Synchronization imposes a partial or total order on parallel tasks • Each synchronization point represents a point of serialization • Serialization impairs parallel scalability
• Eliminate a subset of synchronization points to enhance parallel scalability • Divergence from fully-synchronized execution may result in • Catastrophic program termination • Lower computation accuracy • For approximation to be viable, accuracy loss must be bounded
•Taxonomy • Relaxed synchronization • Eliminate synchronized accesses to “non-critical” data
• Localized synchronization • Eliminate energy-hungry non-local synchronization
7
Department of Electrical and Computer Engineering
(Lossy) Compression• Communication in time vs. communication in space • Compression in memory • Compression in communication medium
• Taxonomy • Criticality-based • Discard non-critical data
• Equation-based • Represent data as a formula or subroutine • Construct or compute data at the destination
• Distribution-based • Represent data as a parametrized distribution • Communicate parameters only
8
Department of Electrical and Computer Engineering
Accelerators in Memory• Target: communication-heavy data analysis routines • Compression • Sort • Search • Histogram • Fit • Reduction or filtering • Distribution or entropy extraction • …
• Challenges • How to map applications to accelerators? • How to distribute the accelerators across the memory hierarchy?
9
Department of Electrical and Computer Engineering
Putting It All Together
• Bag of tricks • Approximate synchronization • To minimize frequency of communication
• (Lossy) compression • To minimize quantity of data communicated
• Accelerators in memory • To minimize both the frequency and the quantity
10
• Goal: application-centric minimization of communication overhead
• Agenda (in progress) • How effective is each trick? • How do tricks interact with each other?
Department of Electrical and Computer Engineering
Approximate Synchronization
11
0.0002 0.0006 0.00100
510
1520
25% accuracy loss
# oc
cure
nce
0 10 20 30 40 50 60
020
4060
80
# active threads
% o
f exe
cutio
n tim
e
ClassicSLEBLE
Department of Electrical and Computer Engineering
Approximate Synchronization
12
0 10 20 30 40 50 60
020
4060
80
# active threads
% o
f exe
cutio
n tim
e
ClassicSLEBLE
● ● ● ● ● ● ●010
2030
40# threads
spee
d−up
1 2 4 8 16 32 64
● ClassicSLEBLE
Department of Electrical and Computer Engineering
(Lossy) Compression
13
0
10
20
30
40
histogram SobelFilter nbody recursiveGaussian AVERAGE
% R
ed
uc
tio
n . Runtime Reduction Bandwidth Reduction
Figure 2: Reducing communication in bandwidth-constrained applications can significantly improve perfor-mance by enabling heightened parallelism and scalability.
Figure 3: Original gaussian blur filtering (L)and communication reduction result (R).
Figure 4: Lena images processed by the Sobel edge detection filterwith communication reduction for 0%, 50%, and 100% of loads.
in several scenarios. Figure 3 compares the pristine filter result for Gaussian blur filtering to the communicationreduction result for the Lena input image. Figure 4 shows a progression of output images produced by Sobeledge detection filtering the Lena input while using communication reduction for an increasing fraction of loads(from 0% to 100%). The 0% case, of course, shows the pristine output. For the sample application resultsshown in Figures 3 and 4, the output images after communication reduction are visually indistinguishable fromthe pristine filtered images. Error tolerance stems from the spatially correlated image data and the nature of thecomputations being performed.Research Issues to be Explored
We plan to address the following research questions in the area of compressed communication.1. What formats of compressed communication result in the best quality-efficiency tradeoffs for differentclasses of applications?2. How to quantify the impact of compressed communication on application output quality?3. How to adjust the level of compression, possibly adaptively at runtime?3.1.2 Approximate SynchronizationClearly, even a relatively small dependence on synchronization can seriously inhibit efficiency as we seekgreater levels of parallelism. However, many (perhaps most) real parallel applications require non-trivialamounts of synchronization. To illustrate this point, we performed a survey of parallel applications, character-izing a variety of application classes on a variety of parallel architecture platforms. We surveyed benchmarksfrom PARSEC [?], NAS [?], EEMBC [?], Parboil [?], and NVIDIA CUDA SDK [?] benchmark suites. Bench-marks were run natively on the parallel processor platforms described at the bottom of Table 3. Table 3 showsthe percentage of execution time spent performing synchronization operations (barriers, atomics, locks) for thesurveyed applications.
The data in Table 3 show that the majority (58%) of surveyed applications spend over 10% of their runtimeperforming synchronization operations, while nearly all (97%) of applications spend over 1% of their runtime
Project Description – 7
Department of Electrical and Computer Engineering
(Lossy) Compression
14
0
10
20
30
40
histogram SobelFilter nbody recursiveGaussian AVERAGE
% R
ed
uc
tio
n . Runtime Reduction Bandwidth Reduction
Figure 2: Reducing communication in bandwidth-constrained applications can significantly improve perfor-mance by enabling heightened parallelism and scalability.
Figure 3: Original gaussian blur filtering (L)and communication reduction result (R).
Figure 4: Lena images processed by the Sobel edge detection filterwith communication reduction for 0%, 50%, and 100% of loads.
in several scenarios. Figure 3 compares the pristine filter result for Gaussian blur filtering to the communicationreduction result for the Lena input image. Figure 4 shows a progression of output images produced by Sobeledge detection filtering the Lena input while using communication reduction for an increasing fraction of loads(from 0% to 100%). The 0% case, of course, shows the pristine output. For the sample application resultsshown in Figures 3 and 4, the output images after communication reduction are visually indistinguishable fromthe pristine filtered images. Error tolerance stems from the spatially correlated image data and the nature of thecomputations being performed.Research Issues to be Explored
We plan to address the following research questions in the area of compressed communication.1. What formats of compressed communication result in the best quality-efficiency tradeoffs for differentclasses of applications?2. How to quantify the impact of compressed communication on application output quality?3. How to adjust the level of compression, possibly adaptively at runtime?3.1.2 Approximate SynchronizationClearly, even a relatively small dependence on synchronization can seriously inhibit efficiency as we seekgreater levels of parallelism. However, many (perhaps most) real parallel applications require non-trivialamounts of synchronization. To illustrate this point, we performed a survey of parallel applications, character-izing a variety of application classes on a variety of parallel architecture platforms. We surveyed benchmarksfrom PARSEC [?], NAS [?], EEMBC [?], Parboil [?], and NVIDIA CUDA SDK [?] benchmark suites. Bench-marks were run natively on the parallel processor platforms described at the bottom of Table 3. Table 3 showsthe percentage of execution time spent performing synchronization operations (barriers, atomics, locks) for thesurveyed applications.
The data in Table 3 show that the majority (58%) of surveyed applications spend over 10% of their runtimeperforming synchronization operations, while nearly all (97%) of applications spend over 1% of their runtime
Project Description – 7
0
10
20
30
40
histogram SobelFilter nbody recursiveGaussian AVERAGE
% R
ed
uc
tio
n . Runtime Reduction Bandwidth Reduction
Figure 2: Reducing communication in bandwidth-constrained applications can significantly improve perfor-mance by enabling heightened parallelism and scalability.
Figure 3: Original gaussian blur filtering (L)and communication reduction result (R).
Figure 4: Lena images processed by the Sobel edge detection filterwith communication reduction for 0%, 50%, and 100% of loads.
in several scenarios. Figure 3 compares the pristine filter result for Gaussian blur filtering to the communicationreduction result for the Lena input image. Figure 4 shows a progression of output images produced by Sobeledge detection filtering the Lena input while using communication reduction for an increasing fraction of loads(from 0% to 100%). The 0% case, of course, shows the pristine output. For the sample application resultsshown in Figures 3 and 4, the output images after communication reduction are visually indistinguishable fromthe pristine filtered images. Error tolerance stems from the spatially correlated image data and the nature of thecomputations being performed.Research Issues to be Explored
We plan to address the following research questions in the area of compressed communication.1. What formats of compressed communication result in the best quality-efficiency tradeoffs for differentclasses of applications?2. How to quantify the impact of compressed communication on application output quality?3. How to adjust the level of compression, possibly adaptively at runtime?3.1.2 Approximate SynchronizationClearly, even a relatively small dependence on synchronization can seriously inhibit efficiency as we seekgreater levels of parallelism. However, many (perhaps most) real parallel applications require non-trivialamounts of synchronization. To illustrate this point, we performed a survey of parallel applications, character-izing a variety of application classes on a variety of parallel architecture platforms. We surveyed benchmarksfrom PARSEC [?], NAS [?], EEMBC [?], Parboil [?], and NVIDIA CUDA SDK [?] benchmark suites. Bench-marks were run natively on the parallel processor platforms described at the bottom of Table 3. Table 3 showsthe percentage of execution time spent performing synchronization operations (barriers, atomics, locks) for thesurveyed applications.
The data in Table 3 show that the majority (58%) of surveyed applications spend over 10% of their runtimeperforming synchronization operations, while nearly all (97%) of applications spend over 1% of their runtime
Project Description – 7
Department of Electrical and Computer Engineering
Enhancing Scalability and Energy Efficiency through
Application-Aware Communication Reduction
David Lilja, Ulya R. Karpuzcu, John Sartori{lilja, ukarpuzc, jsartori}@umn.edu