CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.

24
CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts

Transcript of CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.

Page 1: CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.

CS 395 Last Lecture

Summary, Anti-summary, and Final Thoughts

Page 2: CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.

2

Summary (1) Architecture

• Modern architecture designs are driven by energy constraints

• Shortening latencies is too costly, so we use parallelism in hardware to increase potential throughput

• Some parallelism is implicit (out-of-order superscalar processing,) but have limits

• Others are explicit (vectorization and multithreading,) and rely on software to unlock

Page 3: CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.

3

Summary (2) Memory

• Memory technologies trade off energy and cost for capacity, with SRAM registers on one end and spinning platter hard disks on the other

• Locality (relationships between memory accesses) can help us get the best of all cases

• Caching is the hardware-only solution to capturing locality, but software-driven solutions exist too (memcache for files, etc.)

Page 4: CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.

4

Summary (3) Software

• Want to fully occupy your hardware?– Express locality (tiling)– Vectorize (compiler or manual)– Multithread (e.g. OpenMP)– Accelerate (e.g. CUDA, OpenCL)

• Take the cost into consideration. Unless you’re optimizing in your free time, your time isn’t free.

Page 5: CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.

5

Research Perspective (2010)

• Can we generalize and categorize the most important, generally applicable GPU Computing software optimizations?– Across multiple architectures– Across many applications

• What kinds of performance trends are we seeing from successive GPU generations?

• Conclusion – GPUs aren’t special, and parallel programming is getting easier

Page 6: CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.

6

Application Survey

• Surveyed the GPU Computing Gems chapters• Studied the Parboil benchmarks in detail

Results: • Eight (for now) major categories of

optimization transformations– Performance impact of individual optimizations on

certain Parboil benchmarks included in the paper

Page 7: CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.

1: (Input) Data Access Tiling

7

DRAM

DRAM

Cache

DRAM

Scratchpad

ExplicitCopy

ImplicitCopy

LocalAccess

LocalAccess

Page 8: CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.

8

2. (Output) Privatization

• Avoid contention by aggregating updates locally

• Requires storage resources to keep copies of data structures

PrivateResults

LocalResults

GlobalResults

Page 9: CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.

9

Running Example: SpMV

Ax = v

Row

Data

Col

vx

A

Page 10: CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.

10

Running Example: SpMV

Ax = v

Row

Data

Col

A

vx

Page 11: CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.

11

3. “Scatter to Gather” Transformation

Ax = v v

Row

Data

Col

A

x

Page 12: CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.

12

3. “Scatter to Gather” Transformation

Ax = v v

Row

Data

Col

A

x

Page 13: CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.

13

4. Binning

A

Page 14: CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.

14

5. Regularization (Load Balancing)

Page 15: CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.

15

6. Compaction

Page 16: CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.

16

7. Data Layout Transformation

Page 17: CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.

17

7. Data Layout Transformation

Page 18: CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.

18

8. Granularity Coarsening• Parallel execution often requires redundant and

coordination work– Merging multiple threads into one allows reuse of result,

reducing redundancy

Essential

Redundant

4-wayparallel

2-wayparallel

Time

Page 19: CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.

How much faster do applications really get each hardware generation?

Page 20: CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.

20

Unoptimized Code Has Improved Drastically

• Orders of magnitude speedup in many cases

• Hardware does not solve all problems– Coalescing (lbm)– Highly contentious

atomics (bfs)

Page 21: CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.

21

Optimized Code Is Improving Faster than “Peak Performance”

• Caches capture locality scratchpad can’t efficiently (spmv, stencil)

• Increased local storage capacity enables extra optimization (sad)

• Some benchmarks need atomic throughput more than flops (bfs, histo)

Page 22: CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.

22

Optimization Still Matters• Hardware never

changes algorithmic complexity (cutcp)

• Caches do not solve layout problems for big data (lbm)

• Coarsening still makes a big difference (cutcp, sgemm)

• Many artificial performance cliffs are gone (sgemm, tpacf, mri-q)

Page 23: CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.

23

Stuff we haven’t covered

• Good tools out there for profiling code beyond good timing (cache misses, etc.) If you can’t find why a particular piece of code is taking so long, look into hardware performance counters.

• Patterns and practice– Some of the major patterns of optimization we

covered, but only the basic ones. Many optimization patterns are algorithmic.

Page 24: CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.

24

Fill Out Evaluations!