Avoiding Catastrophic Performance Loss

Detecting CPU-GPU Sync PointsJohn McDonald, NVIDIA Corporation

Topics

● D3D/GL Driver Models

● Types of Sync Points

● How bad are they, really?

● Detection

● Repair

● Summary

D3D Driver Model

● Multithreaded

● Client Thread (Your Application + D3D Runtime)

● Server Thread (D3D Runtime [DDI] + Driver)

● GPU (??)

● Remains in user-mode for as long as possible

GL Driver

● Very similar● Client thread (your application + GL entry points)

● Server thread (shelved data + expansion)

● GPU

● Again, very little time in Kernel Mode

Example Healthy Timeline

Client

Driver

Runtime

Runtime (DDI)

Thread separator

Component separator

State Change

Action Method (draw, clear, etc)

Present

Types of Sync Points

● Driver Sync Point

● CPU-GPU Sync Point

● Can be Server->GPU

● Can be Client->GPU

Driver Sync Point

● Major concern in OpenGL

● Minor concern in D3D

● Caused when Client thread would need information available only to Server thread

● In GL, any function that returns a value

● In D3D, certain State-getting operations

Healthy Timeline

Client

Driver

Runtime

Runtime (DDI)

Thread separator

Component separator

State Change

Present

Driver Sync Point

Client

Driver

Runtime

Runtime (DDI)

Thread separator

Component separator

State Change

Driver Sync Point

CPU-GPU Sync Point: Defined

When an application-side operation requires GPU work to finish prior to the completion of the provoking operation, a CPU-GPU Sync Point has been introduced.

CPU-GPU Sync Point (cont’d)

● Primary causes are buffer updates and obtaining query results

● GPU readback

● e.g. ReadPixels

● Locking the Backbuffer

● Complete list of entry points in Appendix

CPU-GPU Sync Point Visualized

● Ideal frame time should be max(CPU time, GPU time)

● Sync points cause this to be CPU Time + GPU Time.

Healthy Timeline

Client

Driver

Runtime

Runtime (DDI)

Thread separator

Component separator

State Change

Present

CPU-GPU (Server->GPU) Sync Point

Client

Driver

Runtime

Runtime (DDI)

Thread separator

Component separator

State Change

Server->GPU Sync Point

Healthy Timeline

Client

Driver

Runtime

Runtime (DDI)

Thread separator

Component separator

State Change

Present

CPU-GPU (Client->GPU) Sync Point

Client

Driver

Runtime

Runtime (DDI)

Thread separator

Component separator

State Change

Client->GPU Sync Point

How bad are they, really?

● One CPU-GPU Sync Point can halve your framerate.

● The more there are, the harder they are to detect

● They are hard to detect with sampling profilers—the time disappears into Kernel Time.

We get it. They suck. Now what?

● GPU Timestamp Queries to the rescue!

Finding CPU-GPU Sync Points

● For each entry point that could cause a CPU-GPU sync point…

● Wrap the call with two GPU Timestamp Queries (Don’t forget the Disjoint Query)

● Ideally: record a portion of the stack at the call site

● Also record CPU timestamps around the call

Finding Sync Points (cont’d)

● Later:

● Compute the elapsed time between the queries

● If it is small (< 10 ns), then no GPU kickoff was required

● If it’s larger, a GPU kickoff probably occurred—you’ve found a CPU-GPU Sync Point!

Code! (Original)

ctx->Map(...);

Code! (New)

ctx->Begin(pDisjoint);ctx->End(pTimestampBefore);double earlier = timer::now();ctx->Map(...);double cpuElapsed = timer::now() – earlier;ctx->End(pTimestampAfter);ctx->End(pDisjoint);stack = getStackRecord();gSPChecker->Register(pDisjoint, pTimestampBefore, pTimestampAfter,

stack, cpuElapsed);

Four Possibilities

CPU Elapsed GPU Elapsed Meaning

Low ~None <10 ns No problem!

High ~None <10 ns Possible Driver Sync (Bad)

Low Low* (~1 us) Possible Server->GPU Sync (Worse)

High Low* (~1 us) Possible Client->GPU Sync (Ugh)

* Let’s talk about this in a bit

No problem!

Client

Driver

Runtime

Runtime (DDI)

Thread separator

Component separator

State Change

Present

QueriesWell behaved Map

CPU Timestamp

Client->GPU Sync Point - detected

Client

Driver

Runtime

Runtime (DDI)

Thread separator

Component separator

State Change

Queries

CPU-GPU Sync Point

CPU Timestamp

Low elapsed GPU?

● GPU is fed commands in FIFO order

● Likely only command caught is WFI

● Which is ~1,000 clocks, or ~1 us or more.

● Subject for future improvements

Split push buffer?

● Two calls right next to each other may wind up in different pushbuffer fragments

● And different GPU kickoffs

● This doesn’t hurt our scheme—Timestamp queries occur after “all results of previous commands are realized.”

● This means the timestamp is from the end of the pipeline—not the beginning.

Split Pushbuffer (cont’d)

● Shouldn’t be an issue unless you are CPU-bound and barely using the GPU

● Workarounds. Only report:

● Violators that have either large elapsed GPU times (>1 us); or

● Hash the call stack, look for those that show up repeatedly.

Fixing CPU-GPU Sync Points

● Adjust flags● E.g. D3D9, never lock a default buffer with Flags=0

● Be wary of using nearly all GPU memory

● May not be enough room for DISCARD operations

● Spin-locking on query results—that’s definitely a CPU-GPU Sync Point, regardless of API.

Fixing CPU-GPU Sync Points (cont’d)

● Use NO_OVERWRITE in combination with GPU fences (or event queries) to ensure safe, contention-free updates

● Defer Query resolution until at least one frame later

● Use PBOs to do asynchronous readbacks

● And wait “awhile” before mapping.

Summary

CPU-GPU Sync Points. Not even one.

Questions

● jmcdonald at nvidia dot com

Appendix

GPU Timestamp Queries

● Tells you the GPU-time when preceedingoperations have completed—including writes to the FB.

● Two timestamp queries adjacent in the pushbuffer will have an elapsed time of 1/(Clock Frequency). (Very, very small).

Problematic D3D9 Entry Points

● Create*^

● IDirect3DQuery9::GetData

● *::Lock

● *::LockRect

● Present

^ Rare, but possible

Problematic D3D11 Entry Points

● ID3D11Device::CreateBuffer*^

● ID3D11Device::CreateTexture*^

● ID3D11DeviceContext::Map

● ID3D11DeviceContext::GetData

● IDXGISwapChain::Present

Problematic GL Entry Points

● glBufferData^

● glBufferSubData^

● glClientWaitSync

● glFinish

Problematic GL Entry Points

● glGetQueryResult

● glMap*

● glTexImage*^

● glTexSubImage*^

● SwapBuffers

Avoiding Catastrophic Performance Loss

Technology

Transcript of Avoiding Catastrophic Performance Loss

Avoiding Rollover Collisions Monthly Training Topic Ryder Safety & Loss Prevention.

Catastrophic forgetting.pdf

Avoiding Costly Mistakes in the Midst of California’s ... · Avoiding Costly Mistakes in the Midst ... • Loss Mitigation ... Order your DVD today and let everyone in your company

Institute for Catastrophic Loss Reduction Workshop Flood ......From 2009 to 2014 insured losses from catastrophic events were close to or above $1 billion each year – most of this

Avoiding Catastrophic Performance Loss - NVIDIA … · Avoiding Catastrophic Performance Loss Detecting CPU-GPU Sync Points John McDonald, NVIDIA Corporation

Avoiding Loss Of Fingers

Catastrophic Failures

Soybean Loss Adjustment Standards Handbook · (BP); Catastrophic Risk ... Soybean Loss Adjustment Standards Handbook Date SC Page(s) TC Page(s) ... “.000” on the production worksheet

CATASTROPHIC LOSS BENEFITS CONTINUATION FUND … Fund/CAT... · As of June 30, 2016, the CAT Fund had an outstanding claim reserve of $70,077,269. Ongoing ... established the Catastrophic

Bob Mellinger President Attainium Corp · “Avoiding Disaster” by John Laye “Blindsided – A Manager’s Guide to Catastrophic Incidents in the Workplace by Bruce T. Blythe

A Statistical Approach To Pricing Catastrophic Loss (CAT) Securities

Emergency Vehicle Operations Unit VIII Avoiding Accidents 1 Dave Denniston Loss Control Training Specialist.

California Catastrophic Earthquake Readiness Response ... · Southern California Catastrophic Earthquake Response PlanSouthern California Catastrophic Earthquake Response Plan Multi-Agency

Financial Empowerment Boot Camp · Endowment Effect Loss Aversion 1. The endowment effect “Ownership creates satisfaction” 2. Loss aversion “People are more motivated by avoiding

CTC1 deletion results in defective telomere replication, leading to catastrophic telomere loss and stem cell exhaustion

Employee Engagement and Reliability · operations over a long period of time •An organization that repeatedly accomplishes its high hazard mission while avoiding catastrophic events

Global Economic Turmoil, Catastrophic Loss and … · Catastrophic Loss and Insurance: Implications for Risk Management ... damage on the U.S. West Coast ... 1,400 1,600 1,800 2,000

Awakening to the New Story: The Evolutionary Imperative of ... · a catastrophic loss of soul and a loss of relationship with the planetary environment. The ‘rational mind’ has

Present An Executive Briefing Successfully Surviving A Catastrophic Loss

AVOIDING THE CLIMATE POVERTY SPIRAL - ActionAid...2 AVOIDING THE CLIMATE POVERTY SPIRAL: SOCIAL PROTECTION TO ADDRESS CLIMATE-INDUCED LOSS & DAMAGE 3 CONTENTS Key messages 4 Introduction: