Enhancing Scalability and Energy Efficiency through...

15
Department of Electrical and Computer Engineering Enhancing Scalability and Energy Efficiency through Application-Aware Communication Reduction David Lilja, Ulya R. Karpuzcu, John Sartori {lilja, ukarpuzc, jsartori}@umn.edu

Transcript of Enhancing Scalability and Energy Efficiency through...

Page 1: Enhancing Scalability and Energy Efficiency through ...synergy.cs.vt.edu/2015-nsf-xps-workshop/reports/...Jun 01, 2015  · Department of Electrical and Computer Engineering Motivation

Department of Electrical and Computer Engineering

Enhancing Scalability and Energy Efficiency through

Application-Aware Communication Reduction

David Lilja, Ulya R. Karpuzcu, John Sartori{lilja, ukarpuzc, jsartori}@umn.edu

Page 2: Enhancing Scalability and Energy Efficiency through ...synergy.cs.vt.edu/2015-nsf-xps-workshop/reports/...Jun 01, 2015  · Department of Electrical and Computer Engineering Motivation

Department of Electrical and Computer Engineering

Motivation

2

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

64 128 512 1024 2048

Processor count

Fra

cti

on

of

tim

e s

pen

t in

co

mm

un

icati

on

WRF AVUS AMR HYCOM

Processor Count

Frac

tion

of T

ime

Spen

t in

Com

mun

icat

ion

64 128 512 1024 2048

0.1

0

0.2

0.3

0.4

0.5

0.6

0.7

Exascale computing study: Technology challenges in achieving exascale systems, 2008

Page 3: Enhancing Scalability and Energy Efficiency through ...synergy.cs.vt.edu/2015-nsf-xps-workshop/reports/...Jun 01, 2015  · Department of Electrical and Computer Engineering Motivation

Department of Electrical and Computer Engineering

Motivation

3

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

64 128 512 1024 2048

Processor count

Fra

cti

on

of

tim

e s

pen

t in

co

mm

un

icati

on

WRF AVUS AMR HYCOM

Processor Count

Frac

tion

of T

ime

Spen

t in

Com

mun

icat

ion

64 128 512 1024 2048

0.1

0

0.2

0.3

0.4

0.5

0.6

0.7

Exascale computing study: Technology challenges in achieving exascale systems, 2008

10% or less

50% or more

Page 4: Enhancing Scalability and Energy Efficiency through ...synergy.cs.vt.edu/2015-nsf-xps-workshop/reports/...Jun 01, 2015  · Department of Electrical and Computer Engineering Motivation

Department of Electrical and Computer Engineering

Motivation

4

Problem Size (PS)

PS per core

time (t)

Performance PS/t

PS share per core

Strong Scaling — /N /N xN /N

Weak Scaling xN — — xN /N

Page 5: Enhancing Scalability and Energy Efficiency through ...synergy.cs.vt.edu/2015-nsf-xps-workshop/reports/...Jun 01, 2015  · Department of Electrical and Computer Engineering Motivation

Department of Electrical and Computer Engineering

Motivation

5

Problem Size (PS)

PS per core

time (t)

Performance PS/t

PS share per core

Strong Scaling — /N /N xN /N

Weak Scaling xN — — xN /N

Page 6: Enhancing Scalability and Energy Efficiency through ...synergy.cs.vt.edu/2015-nsf-xps-workshop/reports/...Jun 01, 2015  · Department of Electrical and Computer Engineering Motivation

Department of Electrical and Computer Engineering

Executive Summary

• Bag of tricks to regulate the frequency and/or the quantity: • Approximate synchronization • To minimize frequency of communication

• (Lossy) compression • To minimize quantity of data communicated

• Accelerators in memory • To minimize both the frequency and the quantity

6

• Goal: application-centric minimization of communication overhead • Communication overhead is proportional to • the frequency of communication • the number of data objects (quantity) to be communicated

Page 7: Enhancing Scalability and Energy Efficiency through ...synergy.cs.vt.edu/2015-nsf-xps-workshop/reports/...Jun 01, 2015  · Department of Electrical and Computer Engineering Motivation

Department of Electrical and Computer Engineering

Approximate Synchronization• Synchronization imposes a partial or total order on parallel tasks • Each synchronization point represents a point of serialization • Serialization impairs parallel scalability

• Eliminate a subset of synchronization points to enhance parallel scalability • Divergence from fully-synchronized execution may result in • Catastrophic program termination • Lower computation accuracy • For approximation to be viable, accuracy loss must be bounded

•Taxonomy • Relaxed synchronization • Eliminate synchronized accesses to “non-critical” data

• Localized synchronization • Eliminate energy-hungry non-local synchronization

7

Page 8: Enhancing Scalability and Energy Efficiency through ...synergy.cs.vt.edu/2015-nsf-xps-workshop/reports/...Jun 01, 2015  · Department of Electrical and Computer Engineering Motivation

Department of Electrical and Computer Engineering

(Lossy) Compression• Communication in time vs. communication in space • Compression in memory • Compression in communication medium

• Taxonomy • Criticality-based • Discard non-critical data

• Equation-based • Represent data as a formula or subroutine • Construct or compute data at the destination

• Distribution-based • Represent data as a parametrized distribution • Communicate parameters only

8

Page 9: Enhancing Scalability and Energy Efficiency through ...synergy.cs.vt.edu/2015-nsf-xps-workshop/reports/...Jun 01, 2015  · Department of Electrical and Computer Engineering Motivation

Department of Electrical and Computer Engineering

Accelerators in Memory• Target: communication-heavy data analysis routines • Compression • Sort • Search • Histogram • Fit • Reduction or filtering • Distribution or entropy extraction • …

• Challenges • How to map applications to accelerators? • How to distribute the accelerators across the memory hierarchy?

9

Page 10: Enhancing Scalability and Energy Efficiency through ...synergy.cs.vt.edu/2015-nsf-xps-workshop/reports/...Jun 01, 2015  · Department of Electrical and Computer Engineering Motivation

Department of Electrical and Computer Engineering

Putting It All Together

• Bag of tricks • Approximate synchronization • To minimize frequency of communication

• (Lossy) compression • To minimize quantity of data communicated

• Accelerators in memory • To minimize both the frequency and the quantity

10

• Goal: application-centric minimization of communication overhead

• Agenda (in progress) • How effective is each trick? • How do tricks interact with each other?

Page 11: Enhancing Scalability and Energy Efficiency through ...synergy.cs.vt.edu/2015-nsf-xps-workshop/reports/...Jun 01, 2015  · Department of Electrical and Computer Engineering Motivation

Department of Electrical and Computer Engineering

Approximate Synchronization

11

0.0002 0.0006 0.00100

510

1520

25% accuracy loss

# oc

cure

nce

0 10 20 30 40 50 60

020

4060

80

# active threads

% o

f exe

cutio

n tim

e

ClassicSLEBLE

Page 12: Enhancing Scalability and Energy Efficiency through ...synergy.cs.vt.edu/2015-nsf-xps-workshop/reports/...Jun 01, 2015  · Department of Electrical and Computer Engineering Motivation

Department of Electrical and Computer Engineering

Approximate Synchronization

12

0 10 20 30 40 50 60

020

4060

80

# active threads

% o

f exe

cutio

n tim

e

ClassicSLEBLE

● ● ● ● ● ● ●010

2030

40# threads

spee

d−up

1 2 4 8 16 32 64

● ClassicSLEBLE

Page 13: Enhancing Scalability and Energy Efficiency through ...synergy.cs.vt.edu/2015-nsf-xps-workshop/reports/...Jun 01, 2015  · Department of Electrical and Computer Engineering Motivation

Department of Electrical and Computer Engineering

(Lossy) Compression

13

0

10

20

30

40

histogram SobelFilter nbody recursiveGaussian AVERAGE

% R

ed

uc

tio

n . Runtime Reduction Bandwidth Reduction

Figure 2: Reducing communication in bandwidth-constrained applications can significantly improve perfor-mance by enabling heightened parallelism and scalability.

Figure 3: Original gaussian blur filtering (L)and communication reduction result (R).

Figure 4: Lena images processed by the Sobel edge detection filterwith communication reduction for 0%, 50%, and 100% of loads.

in several scenarios. Figure 3 compares the pristine filter result for Gaussian blur filtering to the communicationreduction result for the Lena input image. Figure 4 shows a progression of output images produced by Sobeledge detection filtering the Lena input while using communication reduction for an increasing fraction of loads(from 0% to 100%). The 0% case, of course, shows the pristine output. For the sample application resultsshown in Figures 3 and 4, the output images after communication reduction are visually indistinguishable fromthe pristine filtered images. Error tolerance stems from the spatially correlated image data and the nature of thecomputations being performed.Research Issues to be Explored

We plan to address the following research questions in the area of compressed communication.1. What formats of compressed communication result in the best quality-efficiency tradeoffs for differentclasses of applications?2. How to quantify the impact of compressed communication on application output quality?3. How to adjust the level of compression, possibly adaptively at runtime?3.1.2 Approximate SynchronizationClearly, even a relatively small dependence on synchronization can seriously inhibit efficiency as we seekgreater levels of parallelism. However, many (perhaps most) real parallel applications require non-trivialamounts of synchronization. To illustrate this point, we performed a survey of parallel applications, character-izing a variety of application classes on a variety of parallel architecture platforms. We surveyed benchmarksfrom PARSEC [?], NAS [?], EEMBC [?], Parboil [?], and NVIDIA CUDA SDK [?] benchmark suites. Bench-marks were run natively on the parallel processor platforms described at the bottom of Table 3. Table 3 showsthe percentage of execution time spent performing synchronization operations (barriers, atomics, locks) for thesurveyed applications.

The data in Table 3 show that the majority (58%) of surveyed applications spend over 10% of their runtimeperforming synchronization operations, while nearly all (97%) of applications spend over 1% of their runtime

Project Description – 7

Page 14: Enhancing Scalability and Energy Efficiency through ...synergy.cs.vt.edu/2015-nsf-xps-workshop/reports/...Jun 01, 2015  · Department of Electrical and Computer Engineering Motivation

Department of Electrical and Computer Engineering

(Lossy) Compression

14

0

10

20

30

40

histogram SobelFilter nbody recursiveGaussian AVERAGE

% R

ed

uc

tio

n . Runtime Reduction Bandwidth Reduction

Figure 2: Reducing communication in bandwidth-constrained applications can significantly improve perfor-mance by enabling heightened parallelism and scalability.

Figure 3: Original gaussian blur filtering (L)and communication reduction result (R).

Figure 4: Lena images processed by the Sobel edge detection filterwith communication reduction for 0%, 50%, and 100% of loads.

in several scenarios. Figure 3 compares the pristine filter result for Gaussian blur filtering to the communicationreduction result for the Lena input image. Figure 4 shows a progression of output images produced by Sobeledge detection filtering the Lena input while using communication reduction for an increasing fraction of loads(from 0% to 100%). The 0% case, of course, shows the pristine output. For the sample application resultsshown in Figures 3 and 4, the output images after communication reduction are visually indistinguishable fromthe pristine filtered images. Error tolerance stems from the spatially correlated image data and the nature of thecomputations being performed.Research Issues to be Explored

We plan to address the following research questions in the area of compressed communication.1. What formats of compressed communication result in the best quality-efficiency tradeoffs for differentclasses of applications?2. How to quantify the impact of compressed communication on application output quality?3. How to adjust the level of compression, possibly adaptively at runtime?3.1.2 Approximate SynchronizationClearly, even a relatively small dependence on synchronization can seriously inhibit efficiency as we seekgreater levels of parallelism. However, many (perhaps most) real parallel applications require non-trivialamounts of synchronization. To illustrate this point, we performed a survey of parallel applications, character-izing a variety of application classes on a variety of parallel architecture platforms. We surveyed benchmarksfrom PARSEC [?], NAS [?], EEMBC [?], Parboil [?], and NVIDIA CUDA SDK [?] benchmark suites. Bench-marks were run natively on the parallel processor platforms described at the bottom of Table 3. Table 3 showsthe percentage of execution time spent performing synchronization operations (barriers, atomics, locks) for thesurveyed applications.

The data in Table 3 show that the majority (58%) of surveyed applications spend over 10% of their runtimeperforming synchronization operations, while nearly all (97%) of applications spend over 1% of their runtime

Project Description – 7

0

10

20

30

40

histogram SobelFilter nbody recursiveGaussian AVERAGE

% R

ed

uc

tio

n . Runtime Reduction Bandwidth Reduction

Figure 2: Reducing communication in bandwidth-constrained applications can significantly improve perfor-mance by enabling heightened parallelism and scalability.

Figure 3: Original gaussian blur filtering (L)and communication reduction result (R).

Figure 4: Lena images processed by the Sobel edge detection filterwith communication reduction for 0%, 50%, and 100% of loads.

in several scenarios. Figure 3 compares the pristine filter result for Gaussian blur filtering to the communicationreduction result for the Lena input image. Figure 4 shows a progression of output images produced by Sobeledge detection filtering the Lena input while using communication reduction for an increasing fraction of loads(from 0% to 100%). The 0% case, of course, shows the pristine output. For the sample application resultsshown in Figures 3 and 4, the output images after communication reduction are visually indistinguishable fromthe pristine filtered images. Error tolerance stems from the spatially correlated image data and the nature of thecomputations being performed.Research Issues to be Explored

We plan to address the following research questions in the area of compressed communication.1. What formats of compressed communication result in the best quality-efficiency tradeoffs for differentclasses of applications?2. How to quantify the impact of compressed communication on application output quality?3. How to adjust the level of compression, possibly adaptively at runtime?3.1.2 Approximate SynchronizationClearly, even a relatively small dependence on synchronization can seriously inhibit efficiency as we seekgreater levels of parallelism. However, many (perhaps most) real parallel applications require non-trivialamounts of synchronization. To illustrate this point, we performed a survey of parallel applications, character-izing a variety of application classes on a variety of parallel architecture platforms. We surveyed benchmarksfrom PARSEC [?], NAS [?], EEMBC [?], Parboil [?], and NVIDIA CUDA SDK [?] benchmark suites. Bench-marks were run natively on the parallel processor platforms described at the bottom of Table 3. Table 3 showsthe percentage of execution time spent performing synchronization operations (barriers, atomics, locks) for thesurveyed applications.

The data in Table 3 show that the majority (58%) of surveyed applications spend over 10% of their runtimeperforming synchronization operations, while nearly all (97%) of applications spend over 1% of their runtime

Project Description – 7

Page 15: Enhancing Scalability and Energy Efficiency through ...synergy.cs.vt.edu/2015-nsf-xps-workshop/reports/...Jun 01, 2015  · Department of Electrical and Computer Engineering Motivation

Department of Electrical and Computer Engineering

Enhancing Scalability and Energy Efficiency through

Application-Aware Communication Reduction

David Lilja, Ulya R. Karpuzcu, John Sartori{lilja, ukarpuzc, jsartori}@umn.edu