CS 395 Last Lecture Summary, Anti-summary, and Final T houghts

CS 395 Last Lecture

Summary, Anti-summary, and Final Thoughts

Summary (1) Architecture

• Modern architecture designs are driven by energy constraints

• Shortening latencies is too costly, so we use parallelism in hardware to increase potential throughput

• Some parallelism is implicit (out-of-order superscalar processing,) but have limits

• Others are explicit (vectorization and multithreading,) and rely on software to unlock

Summary (2) Memory

• Memory technologies trade off energy and cost for capacity, with SRAM registers on one end and spinning platter hard disks on the other

• Locality (relationships between memory accesses) can help us get the best of all cases

• Caching is the hardware-only solution to capturing locality, but software-driven solutions exist too (memcache for files, etc.)

Summary (3) Software

• Want to fully occupy your hardware?– Express locality (tiling)– Vectorize (compiler or manual)– Multithread (e.g. OpenMP)– Accelerate (e.g. CUDA, OpenCL)

• Take the cost into consideration. Unless you’re optimizing in your free time, your time isn’t free.

Research Perspective (2010)

• Can we generalize and categorize the most important, generally applicable GPU Computing software optimizations?– Across multiple architectures– Across many applications

• What kinds of performance trends are we seeing from successive GPU generations?

• Conclusion – GPUs aren’t special, and parallel programming is getting easier

Application Survey

• Surveyed the GPU Computing Gems chapters• Studied the Parboil benchmarks in detail

Results: • Eight (for now) major categories of

optimization transformations– Performance impact of individual optimizations on

certain Parboil benchmarks included in the paper

1: (Input) Data Access Tiling

Scratchpad

ExplicitCopy

ImplicitCopy

LocalAccess

2. (Output) Privatization

• Avoid contention by aggregating updates locally

• Requires storage resources to keep copies of data structures

PrivateResults

LocalResults

GlobalResults

Running Example: SpMVAx = v

3. “Scatter to Gather” Transformation

Ax = v v

3. “Scatter to Gather” Transformation

Ax = v v

4. Binning

5. Regularization (Load Balancing)

6. Compaction

7. Data Layout Transformation

8. Granularity Coarsening• Parallel execution often requires redundant and

coordination work– Merging multiple threads into one allows reuse of result,

reducing redundancy

Essential

Redundant

4-wayparallel

2-wayparallel

How much faster do applications really get each hardware generation?

Unoptimized Code Has Improved Drastically

• Orders of magnitude speedup in many cases

• Hardware does not solve all problems– Coalescing (lbm)– Highly contentious

atomics (bfs)

Optimized Code Is Improving Faster than “Peak Performance”

• Caches capture locality scratchpad can’t efficiently (spmv, stencil)

• Increased local storage capacity enables extra optimization (sad)

• Some benchmarks need atomic throughput more than flops (bfs, histo)

Optimization Still Matters• Hardware never

changes algorithmic complexity (cutcp)

• Caches do not solve layout problems for big data (lbm)

• Coarsening still makes a big difference (cutcp, sgemm)

• Many artificial performance cliffs are gone (sgemm, tpacf, mri-q)

Stuff we haven’t covered

• Good tools out there for profiling code beyond good timing (cache misses, etc.) If you can’t find why a particular piece of code is taking so long, look into hardware performance counters.

• Patterns and practice– Some of the major patterns of optimization we

covered, but only the basic ones. Many optimization patterns are algorithmic.

Fill Out Evaluations!

CS 395 Last Lecture Summary, Anti-summary, and Final T houghts

Documents

Transcript of CS 395 Last Lecture Summary, Anti-summary, and Final T houghts

Brief T houghts on Known Item Transactions

WELCOME TO SKY GRILL! · 845 395 395 295 495 395 315 345 525 645 445 595 545 395 445 545 645 FIERY CHICKEN WINGS CHILLI CARAMELIZED PRAWN BLACK PEPPER BEEF SKEWERS Chargrilled tender

Course 395: Machine Learning – LecturesMaja Pantic Machine Learning (course 395)Course 395: Machine Learning – Lectures • Lecture 1-2: Concept Learning (M. Pantic) • Lecture

DC-395 Series - ELHVB

Course 395: Machine Learning Lectures · Maja Pantic Machine Learning (course 395) Course 395: Machine Learning – Lectures • Lecture 1-2: Concept Learning (M. Pantic) • Lecture

TRUCK NETWORKS...MONO INYO KERN Death Valley National Park MONO LAKE LAKE CROWLY TINEMAHA RES OWENS LAKE HAIWEE RES BRIDGEPORT BISHOP INDEPENDENCE MOJAVE 395 395 89 395 395 182 270

Naruto Manga Chapter (395)

CRJS 395 Ethics In Criminal Justice. CRJS 395 Web Site .

Ozbike - Issue #395

etheses.whiterose.ac.uk Draft.d… · Web view2015. 3. 30. · Harcourt A. Morgan’s . Common Mooring. Concept. Forgotten T. houghts on Environment. al. Sustainability. Thomas

Compatibility list blue’Log XM / XC · 2021. 3. 11. · Circutor.....395 Cirwatt B series.......................................................................................................................395

Download 395

Course 395: Machine Learning - Lectures · Stavros Petridis Machine Learning (course 395) Course 395: Machine Learning - Lectures • Lecture 1-2: Concept Learning (M. Pantic)

395 Locust, Winnetka Illinois

CSC 395 – Software Engineering

Superman 395 1963

395 -San Francisco

Rs bel-395 presentation

T EACHING ITK: T HOUGHTS FOR V ERSION 4 John Galeotti jgaleotti@cmu.edu February 3, 2011.

Aris Express 395-6510