Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released...
Transcript of Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released...
![Page 1: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/1.jpg)
Advanced DirectX® 11 technology: DirectCompute by Example
Justin Hensley
Lee Howes
Advanced Micro Devices Inc.
September 2nd 2010
![Page 2: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/2.jpg)
DirectX 11 Basics
• New API from Microsoft
– Released alongside Windows ® 7
– Runs on Windows Vista® as well
• Supports downlevel hardware
– DirectX9, DirectX10, DirectX11-class HW supported
– Exposed features depend on GPU
• Allows the use of the same API for multiple
generations of GPUs
– However Windows Vista/Windows 7 required
• Lots of new features…
![Page 3: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/3.jpg)
What is DirectCompute?
• DirectCompute brings GPGPU to DirectX
• DirectCompute is both separate from and
integrated with the DirectX graphics
pipeline
– Compute Shader
– Compute features in Pixel Shader
• Potential applications (応用分野)
– Physics
– AI
– Image processing
![Page 4: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/4.jpg)
DirectCompute – part of DirectX
• DirectX 11 helps efficiently combine Compute
work with graphics
– Sharing of buffers is trivial
– Work graph is scheduled efficiently by the driver
Input
Assembler
Vertex
Shader Tessellation
Geometry
Shader Rasterizer
Pixel
Shader
Compute
Shader
Graphics pipeline
![Page 5: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/5.jpg)
DirectCompute Features
• Scattered writes
• Atomic operations
• Append/consume buffer
• Shared memory (local data share)
• Structured buffers
• Double precision (if supported)
![Page 6: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/6.jpg)
DirectCompute by Example
• Order Independent Transparency (OIT)
– Atomic operations
– Scattered writes
– Append buffer feature
• Bullet Cloth Simulation
– Shared memory
– Shared compute and graphics buffers
![Page 7: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/7.jpg)
Order Independent Transparency
![Page 8: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/8.jpg)
Transparency Problem
• Classic problem in computer graphics
• Correct rendering of semi-transparent geometry requires sorting –
blending is an order dependent operation
半透明な物体の正確なレンダリングはそれぞれのトライアングルの視点からの距離でのソートが必要
• Sometimes sorting triangles is enough but not always
しかしトライアングルのソートでは不十分なときがある
– Difficult to sort: Multiple meshes interacting (many draw calls)
– Impossible to sort: Intersecting triangles (must sort fragments)
Try doing this
in PowerPoint! パワーポイントでこれをどうやって作るか知ってるかい?
大変だったんだ、これが。
![Page 9: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/9.jpg)
Background
• A-buffer – Carpenter „84
– CPU side linked list per-pixel for anti-aliasing
• Fixed array per-pixel
– F-buffer, stencil routed A-buffer, Z3 buffer, and k-buffer, Slice
map, bucket depth peeling
• Multi-pass
– Depth peeling methods for transparency
• Recent
– Freepipe, PreCalc [DirectX11 SDK]
![Page 10: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/10.jpg)
OIT using Per-Pixel Linked Lists
• Fast creation of linked lists of arbitrary size
on the GPU using D3D11
– Computes correct transparency
正しい半透明の描画が可能
• Integration into the standard graphics pipeline
– Demonstrates compute from rasterized data
– DirectCompute features in Pixel Shader
– Works with depth and stencil testing
– Works with and without MSAA
• Example of programmable blend
![Page 11: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/11.jpg)
Linked List Construction
• Two Buffers
– Head pointer buffer
• addresses/offsets
• Initialized to end-of-list (EOL) value (e.g., -1)
– Node buffer
• arbitrary payload data + “next pointer”
任意のデータと次のデータへのポインタ
• Each shader thread
1. Retrieve and increment global counter value
グローバルのカウンターを取り出し、インクリメント
2. Atomic exchange into head pointer buffer
ヘッドのポインタバッファをアトミックエクスチェンジ
3. Add new entry into the node buffer at location from step 1
新しいデータをステップ1での場所に書き込む
![Page 12: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/12.jpg)
Algorithm Overview
0. Render opaque scene objects
1. Render transparent scene objects
2. Screen quad resolves and composites
fragment lists
![Page 13: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/13.jpg)
Step 0 – Render Opaque
• Render all opaque geometry normally
不透明物体の描画を普通に行う
Render Target
![Page 14: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/14.jpg)
Algorithm Overview
0. Render opaque scene objects
1. Render transparent scene objects
半透明物体の描画
– All fragments are stored using per-pixel linked
lists
全てのフラグメントはピクセルごとのリンクリストとして書き込まれる
– Store fragment‟s: color, alpha, & depth
2. Screen quad resolves and composites
fragment lists
![Page 15: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/15.jpg)
Setup
• Two buffers
– Screen sized head pointer buffer
スクリーンサイズのヘッドポインタバッファ
– Node buffer – large enough to handle all fragments
ノードバッファ(全てのフラグメントを格納するのに十分な大きさ)
• Render as usual
不透明物体を普通に描画
• Disable render target writes
レンダーターゲットへの書き込みを無効化
• Insert render target data into linked list
半透明物体のデータをリンクリストに書き込み
![Page 16: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/16.jpg)
Step 1 – Create Linked List
-1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1
Head Pointer Buffer
-1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1
Node Buffer 0 1 2 3 4 5 6 …
Counter = 0
Render Target
![Page 17: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/17.jpg)
Step 1 – Create Linked List
-1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1
…
Render Target
Head Pointer Buffer
Node Buffer 0 1 2 3 4 5 6 …
Counter = 0
![Page 18: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/18.jpg)
Step 1 – Create Linked List
-1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1
…
Render Target
Head Pointer Buffer
Node Buffer 0 1 2 3 4 5 6 …
Counter = 1
IncrementCounter()
![Page 19: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/19.jpg)
Step 1 – Create Linked List
-1 -1 -1 -1 -1 -1
-1 0 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1
…
Render Target
Head Pointer Buffer
Node Buffer 0 1 2 3 4 5 6 …
Counter = 1
InterlockedExchange()
![Page 20: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/20.jpg)
Step 1 – Create Linked List
-1 -1 -1 -1 -1 -1
-1 0 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1
0.87
-1
-1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1
…
Render Target
Head Pointer Buffer
Node Buffer 0 1 2 3 4 5 6 …
Counter = 1 Scatter Write
![Page 21: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/21.jpg)
Step 1 – Create Linked List
-1 -1 -1 -1 -1 -1
-1 0 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1
0.87
-1
0.89
-1
0.90
-1
-1 -1 -1 -1 -1 -1
-1 -1 -1 1 2 -1
-1 -1 -1 -1 -1 -1
…
Culled due to existing
scene geometry depth.
Render Target
Head Pointer Buffer
Node Buffer 0 1 2 3 4 5 6 …
Counter = 3
![Page 22: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/22.jpg)
Node Buffer 0 1 2 3 4 5 6 …
Step 1 – Create Linked List
-1 -1 -1 -1 -1 -1
-1 3 4 -1 -1 -1
-1 -1 -1 -1 -1 -1
0.87
-1
0.89
-1
0.90
-1
0.65
0
0.65
-1
-1 -1 -1 -1 -1 -1
-1 -1 -1 1 2 -1
-1 -1 -1 -1 -1 -1
…
Render Target
Head Pointer Buffer
Counter = 5
![Page 23: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/23.jpg)
Step 1 – Create Linked List
-1 -1 -1 -1 -1 -1
-1 5 4 -1 -1 -1
-1 -1 -1 -1 -1 -1
0.87
-1
0.89
-1
0.90
-1
0.65
0
0.65
-1
0.71
3
-1 -1 -1 -1 -1 -1
-1 -1 -1 1 2 -1
-1 -1 -1 -1 -1 -1
…
Render Target
Head Pointer Buffer
Node Buffer 0 1 2 3 4 5 6 …
Counter = 6
![Page 24: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/24.jpg)
Node Buffer Counter
• Counter allocated in GPU memory (i.e. a buffer)
– Atomic updates
– Contention issues
• DirectX11 Append feature
– Compacted linear writes to a buffer
– Implicit writes
• Append()
– Explicit writes
• IncrementCounter()
• Standard memory operations
– Up to 60% faster than memory counters
![Page 25: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/25.jpg)
Code Example
RWStructuredBuffer RWStructuredCounter;
RWTexture2D<int> tRWFragmentListHead;
RWTexture2D<float4> tRWFragmentColor;
RWTexture2D<int2> tRWFragmentDepthAndLink;
[earlydepthstencil]
void PS( PsInput input )
{
float4 vFragment = ComputeFragmentColor(input);
int2 vScreenAddress = int2(input.vPositionSS.xy);
// Get counter value and increment
int nNewFragmentAddress = RWStructuredCounter.IncrementCounter();
if ( nNewFragmentAddress == FRAGMENT_LIST_NULL ) { return; }
// Update head buffer
int nOldFragmentAddress;
InterlockedExchange(tRWFragmentListHead[vScreenAddress], nNewHeadAddress,
nOldHeadAddress );
// Write the fragment attributes to the node buffer
int2 vAddress = GetAddress( nNewFragmentAddress );
tRWFragmentColor[vAddress] = vFragment;
tRWFragmentDepthAndLink[vAddress] = int2(
int(saturate(input.vPositionSS.z))*0x7fffffff), nOldFragmentAddress );
return;
}
![Page 26: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/26.jpg)
Algorithm Overview
0. Render opaque scene objects
1. Render transparent scene objects
2. Screen quad resolves and composites
fragment lists – Single pass
– Pixel shader sorts associated linked list (e.g., insertion sort)
– Composite fragments in sorted order with background
– Output final fragment
![Page 27: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/27.jpg)
Step 2 – Render Fragments
-1 -1 -1 -1 -1 -1
-1 5 4 -1 -1 -1
-1 -1 -1 -1 -1 -1
0.87
-1
0.89
-1
0.90
-1
0.65
0
0.65
-1
0.71
3
-1 -1 -1 -1 -1 -1
-1 -1 -1 1 2 -1
-1 -1 -1 -1 -1 -1
…
(0,0)->(1,1):
Fetch Head Pointer: -1
-1 indicates no fragment
to render
Render Target
Head Pointer Buffer
Node Buffer 0 1 2 3 4 5 6 …
![Page 28: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/28.jpg)
Step 2 – Render Fragments
-1 -1 -1 -1 -1 -1
-1 5 4 -1 -1 -1
-1 -1 -1 -1 -1 -1
0.87
-1
0.89
-1
0.90
-1
0.65
0
0.65
-1
0.71
3
-1 -1 -1 -1 -1 -1
-1 -1 -1 1 2 -1
-1 -1 -1 -1 -1 -1
…
(1,1):
Fetch Head Pointer: 5
Fetch Node Data (5)
Walk the list and store in
temp array
0.71 0.65 0.87
Render Target
Head Pointer Buffer
Node Buffer 0 1 2 3 4 5 6 …
![Page 29: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/29.jpg)
Step 2 – Render Fragments
-1 -1 -1 -1 -1 -1
-1 5 4 -1 -1 -1
-1 -1 -1 -1 -1 -1
0.87
-1
0.89
-1
0.90
-1
0.65
0
0.65
-1
0.71
3
-1 -1 -1 -1 -1 -1
-1 -1 -1 1 2 -1
-1 -1 -1 -1 -1 -1
…
(1,1):
Sort temp array
Blend colors and write out
0.65 0.71 0.87
Render Target
Head Pointer Buffer
Node Buffer 0 1 2 3 4 5 6 …
![Page 30: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/30.jpg)
Step 2 – Render Fragments
-1 -1 -1 -1 -1 -1
-1 5 4 -1 -1 -1
-1 -1 -1 -1 -1 -1
0.87
-1
0.89
-1
0.90
-1
0.65
0
0.65
-1
0.71
3
-1 -1 -1 -1 -1 -1
-1 -1 -1 1 2 -1
-1 -1 -1 -1 -1 -1
…
Render Target
Head Pointer Buffer
Node Buffer 0 1 2 3 4 5 6 …
![Page 31: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/31.jpg)
Anti-Aliasing
• Store coverage information in the linked
list
• Resolve per-sample
– Execute a shader at each sample location
– Use MSAA hardware
• Resolve per-pixel
– Execute a shader at each pixel location
– Average all sample contributions within the
shader
![Page 32: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/32.jpg)
Mecha Demo
• 602K scene triangles
– 254K transparent triangles
![Page 33: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/33.jpg)
Layers
![Page 34: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/34.jpg)
Scaling
![Page 35: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/35.jpg)
Future Work
• Memory allocation
• Sort on insert
• Other linked list applications
– Indirect illumination
– Motion blur
– Shadows
• More complex data structures
![Page 36: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/36.jpg)
Bullet Cloth Simulation
![Page 37: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/37.jpg)
概要 Overview
• 布のシミュレーションについて
• 2つの解法
–バッチソルバ(Batched Solver)
– SIMDバッチソルバ(SIMD Batched Solver)
• そしてもう一つ
–データのGPU上での直接のコピー(GPU Copy)
![Page 38: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/38.jpg)
DirectCompute for physics
• DirectCompute in the Bullet physics SDK
–布のシミュレーションの紹介
– An introduction to cloth simulation
– DirectComputeでの実装のティップス
– Some tips for implementation in
DirectCompute
![Page 39: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/39.jpg)
Cloth simulation
• Large number of particles
– Appropriate for parallel processing
– Force from each spring constraint applied to both
connected particles
• 大量のパーティクル
– 並列処理に適している
– それぞれのバネをつなぐパーティクルに拘束条件を適用
静止状態
現在の状態
力は静止状態からの伸
びとして求められる
求まった状態 位置の補正量の計算
![Page 40: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/40.jpg)
Cloth simulation
• Large number of particles
– Appropriate for parallel processing
– Force from each spring constraint applied to both
connected particles
• 大量のパーティクル
– 並列処理に適している
– それぞれのバネをつなぐパーティクルに拘束条件を適用
静止状態
現在の状態
力は静止状態からの伸
びとして求められる
求まった状態 位置の補正量の計算
Rest length of spring
![Page 41: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/41.jpg)
Cloth simulation steps
• For each simulation iteration:
– Compute forces in each link based on its
length
– Correct positions of masses/vertices from
forces
– Compute new vertex positions
静止状態 現在の状態
力は静止状態からの伸
びとして求められる
求まった状態 位置の補正量の計算
![Page 42: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/42.jpg)
Cloth simulation steps
• For each simulation iteration:
– Compute forces in each link based on its
length
– Correct positions of masses/vertices from
forces
– Compute new vertex positions
静止状態 現在の状態
力は静止状態からの伸
びとして求められる
求まった状態 位置の補正量の計算
![Page 43: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/43.jpg)
Cloth simulation steps
• For each simulation iteration:
– Compute forces in each link based on its
length
– Correct positions of masses/vertices from
forces
– Compute new vertex positions
静止状態 現在の状態
力は静止状態からの伸
びとして求められる
求まった状態 位置の補正量の計算
![Page 44: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/44.jpg)
Springs and masses
• Two or three main types of springs
– Structural/shearing 構造バネ
– Bending 曲げバネ
![Page 45: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/45.jpg)
Springs and masses
• Two or three main types of springs
– Structural/shearing 構造バネ
– Bending 曲げバネ
![Page 46: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/46.jpg)
CPU approach to simulation
• One link at a time
• リンク一つずつ解く
• Perform updates in place
• そして座標を更新
• “Gauss-Seidel” style
• ガウスザイデル法
• Conserves momentum
• Iterate n times
![Page 47: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/47.jpg)
CPU approach to simulation
• One link at a time
• リンク一つずつ解く
• Perform updates in place
• そして座標を更新
• “Gauss-Seidel” style
• ガウスザイデル法
• Conserves momentum
• Iterate n times
![Page 48: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/48.jpg)
CPU approach to simulation
• One link at a time
• リンク一つずつ解く
• Perform updates in place
• そして座標を更新
• “Gauss-Seidel” style
• ガウスザイデル法
• Conserves momentum
• Iterate n times
![Page 49: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/49.jpg)
CPU approach to simulation
• One link at a time
• リンク一つずつ解く
• Perform updates in place
• そして座標を更新
• “Gauss-Seidel” style
• ガウスザイデル法
• Conserves momentum
• Iterate n times
![Page 50: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/50.jpg)
CPU approach to simulation
• One link at a time
• リンク一つずつ解く
• Perform updates in place
• そして座標を更新
• “Gauss-Seidel” style
• ガウスザイデル法
• Conserves momentum
• Iterate n times
![Page 51: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/51.jpg)
CPU approach to simulation
• One link at a time
• リンク一つずつ解く
• Perform updates in place
• そして座標を更新
• “Gauss-Seidel” style
• ガウスザイデル法
• Conserves momentum
• Iterate n times Target
![Page 52: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/52.jpg)
Moving to the GPU: The pixel shader approach
• Offers full parallelism
• 完全な並列性
• One vertex per work-item
• All vertices together
• 全ての頂点を一度に解く
• No scattered writes
• Poor convergence
• 収束が悪いという問題がある
![Page 53: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/53.jpg)
Moving to the GPU: The pixel shader approach
• Offers full parallelism
• 完全な並列性
• One vertex per work-item
• All vertices together
• 全ての頂点を一度に解く
• No scattered writes
• Poor convergence
• 収束が悪いという問題がある
![Page 54: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/54.jpg)
Can DirectCompute help?
• Offers scattered writes as a feature as we
saw earlier
• The GPU implementation could be more
like the CPU
– Solver per-link rather than per-vertex
– Leads to races between links that update the
same vertex
![Page 55: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/55.jpg)
Execute independent subsets in parallel
• All links act at both ends
• Batch links
– No two links in a given
batch share a vertex
– No data races
![Page 56: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/56.jpg)
Execute independent subsets in parallel
• All links act at both ends
• Batch links
– No two links in a given
batch share a vertex
– No data races
![Page 57: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/57.jpg)
Execute independent subsets in parallel
• All links act at both ends
• Batch links
– No two links in a given
batch share a vertex
– No data races
![Page 58: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/58.jpg)
Execute independent subsets in parallel
• All links act at both ends
• Batch links
– No two links in a given
batch share a vertex
– No data races
![Page 59: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/59.jpg)
Execute independent subsets in parallel
• All links act at both ends
• Batch links
– No two links in a given
batch share a vertex
– No data races
![Page 60: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/60.jpg)
On a real cloth mesh we need many batches
• Create independent subsets of links through graph
coloring.
• Synchronize between batches
• 実際の布のメッシュでは多くのバッチが必要 – グラフカラーリングで独立なサブセットを作る必要がある
– それぞれのバッチを解く間に同期が必要
![Page 61: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/61.jpg)
On a real cloth mesh we need many batches
• Create independent subsets of links through graph
coloring.
• Synchronize between batches
• 実際の布のメッシュでは多くのバッチが必要 – グラフカラーリングで独立なサブセットを作る必要がある
– それぞれのバッチを解く間に同期が必要
1st batch
![Page 62: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/62.jpg)
On a real cloth mesh we need many batches
• Create independent subsets of links through graph
coloring.
• Synchronize between batches
• 実際の布のメッシュでは多くのバッチが必要 – グラフカラーリングで独立なサブセットを作る必要がある
– それぞれのバッチを解く間に同期が必要
2nd batch
![Page 63: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/63.jpg)
On a real cloth mesh we need many batches
• Create independent subsets of links through graph
coloring.
• Synchronize between batches
• 実際の布のメッシュでは多くのバッチが必要 – グラフカラーリングで独立なサブセットを作る必要がある
– それぞれのバッチを解く間に同期が必要
3rd batch
![Page 64: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/64.jpg)
On a real cloth mesh we need many batches
• Create independent subsets of links through graph
coloring.
• Synchronize between batches
• 実際の布のメッシュでは多くのバッチが必要 – グラフカラーリングで独立なサブセットを作る必要がある
– それぞれのバッチを解く間に同期が必要
10 batches
![Page 65: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/65.jpg)
Driving batches and synchronizing
Iteration 0
Batch 0
Batch 1
Batch 2
Batch 3
Batch 4
// Execute the kernel
context->CSSetShader(
solvePositionsFromLinksKernel.kernel, NULL, 0 );
int numBlocks =
(constBuffer.numLinks + (blockSize-1)) / blockSize;
context->Dispatch( numBlocks , 1, 1 );
![Page 66: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/66.jpg)
Packing for higher efficiency
• Can create clusters of
links
– The cloth is fixed-structure
– Can be preprocessed
• Apply a group per
DirectCompute “thread”
group
4 batches
![Page 67: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/67.jpg)
Driving batches and synchronizing
Iteration
Simulation step
Batch 0
Batch 1
Inner batch 0
Inner batch 1
Inner batch 0
Load data
Store data
Load data
Store data
• The next feature of DirectCompute:
shared memory
![Page 68: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/68.jpg)
Solving in shared memory
groupshared float4 positionSharedData[VERTS_PER_GROUP];
[numthreads(GROUP_SIZE, 1, 1)]
void
SolvePositionsFromLinksKernel( … uint3 DTid : SV_DispatchThreadID, uint3 GTid :
SV_GroupThreadID … )
{
for( int vertex = laneInWavefront; vertex < verticesUsedByWave; vertex+=GROUP_SIZE )
{
int vertexAddress = g_vertexAddressesPerWavefront[groupID*VERTS_PER_GROUP + vertex];
positionSharedData[vertex] = g_vertexPositions[vertexAddress];
}
... // Perform computation in shared buffer
for( int vertex = GTid.x; vertex < verticesUsedByWave; vertex+=GROUP_SIZE )
{
int vertexAddress = g_vertexAddressesPerWavefront[groupID*VERTS_PER_GROUP + vertex];
g_vertexPositions[vertexAddress] = positionSharedData[vertex];
}
}
![Page 69: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/69.jpg)
Solving in shared memory
groupshared float4 positionSharedData[VERTS_PER_GROUP];
[numthreads(GROUP_SIZE, 1, 1)]
void
SolvePositionsFromLinksKernel( … uint3 DTid : SV_DispatchThreadID, uint3 GTid :
SV_GroupThreadID … )
{
for( int vertex = laneInWavefront; vertex < verticesUsedByWave; vertex+=GROUP_SIZE )
{
int vertexAddress = g_vertexAddressesPerWavefront[groupID*VERTS_PER_GROUP + vertex];
positionSharedData[vertex] = g_vertexPositions[vertexAddress];
}
... // Perform computation in shared buffer
for( int vertex = GTid.x; vertex < verticesUsedByWave; vertex+=GROUP_SIZE )
{
int vertexAddress = g_vertexAddressesPerWavefront[groupID*VERTS_PER_GROUP + vertex];
g_vertexPositions[vertexAddress] = positionSharedData[vertex];
}
}
Define a “groupshared” buffer for
shared data storage
![Page 70: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/70.jpg)
Solving in shared memory
groupshared float4 positionSharedData[VERTS_PER_GROUP];
[numthreads(GROUP_SIZE, 1, 1)]
void
SolvePositionsFromLinksKernel( … uint3 DTid : SV_DispatchThreadID, uint3 GTid :
SV_GroupThreadID … )
{
for( int vertex = laneInWavefront; vertex < verticesUsedByWave; vertex+=GROUP_SIZE )
{
int vertexAddress = g_vertexAddressesPerWavefront[groupID*VERTS_PER_GROUP + vertex];
positionSharedData[vertex] = g_vertexPositions[vertexAddress];
}
... // Perform computation in shared buffer
for( int vertex = GTid.x; vertex < verticesUsedByWave; vertex+=GROUP_SIZE )
{
int vertexAddress = g_vertexAddressesPerWavefront[groupID*VERTS_PER_GROUP + vertex];
g_vertexPositions[vertexAddress] = positionSharedData[vertex];
}
}
Data will be shared across a group
of threads with these dimensions
![Page 71: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/71.jpg)
Solving in shared memory
groupshared float4 positionSharedData[VERTS_PER_GROUP];
[numthreads(GROUP_SIZE, 1, 1)]
void
SolvePositionsFromLinksKernel( … uint3 DTid : SV_DispatchThreadID, uint3 GTid :
SV_GroupThreadID … )
{
for( int vertex = laneInWavefront; vertex < verticesUsedByWave; vertex+=GROUP_SIZE )
{
int vertexAddress = g_vertexAddressesPerWavefront[groupID*VERTS_PER_GROUP + vertex];
positionSharedData[vertex] = g_vertexPositions[vertexAddress];
}
... // Perform computation in shared buffer
for( int vertex = GTid.x; vertex < verticesUsedByWave; vertex+=GROUP_SIZE )
{
int vertexAddress = g_vertexAddressesPerWavefront[groupID*VERTS_PER_GROUP + vertex];
g_vertexPositions[vertexAddress] = positionSharedData[vertex];
}
}
Load data from global buffers into
the shared region
![Page 72: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/72.jpg)
Solving in shared memory
groupshared float4 positionSharedData[VERTS_PER_GROUP];
[numthreads(GROUP_SIZE, 1, 1)]
void
SolvePositionsFromLinksKernel( … uint3 DTid : SV_DispatchThreadID, uint3 GTid :
SV_GroupThreadID … )
{
for( int vertex = laneInWavefront; vertex < verticesUsedByWave; vertex+=GROUP_SIZE )
{
int vertexAddress = g_vertexAddressesPerWavefront[groupID*VERTS_PER_GROUP + vertex];
positionSharedData[vertex] = g_vertexPositions[vertexAddress];
}
... // Perform computation in shared buffer
for( int vertex = GTid.x; vertex < verticesUsedByWave; vertex+=GROUP_SIZE )
{
int vertexAddress = g_vertexAddressesPerWavefront[groupID*VERTS_PER_GROUP + vertex];
g_vertexPositions[vertexAddress] = positionSharedData[vertex];
}
}
Write back to the global buffer after
computation
![Page 73: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/73.jpg)
Group execution
• The sequence of inner batch operations for
the first cluster is:
• それぞれのクラスタ内で独立なバッチセットを作成 – バッチセットは右図のようになる
![Page 74: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/74.jpg)
Group execution
• The sequence of inner batch operations for
the first cluster is:
• それぞれのクラスタ内で独立なバッチセットを作成 – バッチセットは右図のようになる
![Page 75: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/75.jpg)
Group execution
• The sequence of inner batch operations for
the first cluster is:
• それぞれのクラスタ内で独立なバッチセットを作成 – バッチセットは右図のようになる
![Page 76: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/76.jpg)
Group execution
• The sequence of inner batch operations for
the first cluster is:
• それぞれのクラスタ内で独立なバッチセットを作成 – バッチセットは右図のようになる
Synchronize…
Synchronize…
Synchronize…
Synchronize…
Synchronize…
// load
AllMemoryBarrierWithGroupSync();
for( each subgroup ) {
// Process a subgroup
AllMemoryBarrierWithGroupSync();
}
// Store
![Page 77: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/77.jpg)
Why is this an improvement?
• In-cluster batches now in-shader loop
• グローバルな同期を取らなければならなかった計算がクラスタ内の同期だけで済む
• Only 4 shader dispatches: less overhead
• 4回の同期、シェーダ実行で済む為、少ないオーバーヘッド
• Barrier synchronization is still slow
• クラスタ内のバリアを使った同期はまだ遅い
• しかし
![Page 78: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/78.jpg)
Exploiting the SIMD architecture
• Hardware executes 64- or 32-wide SIMD
• AMDのハードウェアは64SIMDで実行
• Sequentially consistent at the SIMD level
– So clusters can run on SIMDs, not groups
• Synchronization is now implicit
• そのため各クラスタで64スレッドごと立ち上げれば同期を明示的に導入しなくても良い
![Page 79: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/79.jpg)
Driving batches and synchronizing
Iteration
Simulation step
Batch 0
Batch 1
Inner batch 0
Inner batch 1
Inner batch 0
Synchronize…
![Page 80: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/80.jpg)
One more thing…
• Remember the tight pipeline integration?
• How can we use this to our advantage?
• Write directly to vertex buffer!
• コンピュートシェーダを使ってGPU上の布の頂点データを計算しているので直接バーテックスバッファに書き出すことが可能
Input
Assembler
Vertex
Shader Tesselation
Geometry
Shader Rasterizer
Pixel
Shader
Compute
Shader
Graphics pipeline
![Page 81: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/81.jpg)
One more thing…
• Remember the tight pipeline integration?
• How can we use this to our advantage?
• Write directly to vertex buffer!
• コンピュートシェーダを使ってGPU上の布の頂点データを計算しているので直接バーテックスバッファに書き出すことが可能
Input
Assembler
Vertex
Shader Tesselation
Geometry
Shader Rasterizer
Pixel
Shader
Compute
Shader
Graphics pipeline
![Page 82: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/82.jpg)
Create a vertex buffer
// Create a vertex buffer with unordered access support
D3D11_BUFFER_DESC bd;
bd.Usage = D3D11_USAGE_DEFAULT;
bd.ByteWidth = vertexBufferSize * 32;
bd.BindFlags =
D3D11_BIND_VERTEX_BUFFER |
D3D11_BIND_UNORDERED_ACCESS;
bd.CPUAccessFlags = 0;
bd.MiscFlags = 0;
bd.StructureByteStride = 32;
hr = m_d3dDevice->CreateBuffer(&bd, NULL, &m_Buffer);
// Create an unordered access view of the buffer to allow writing
D3D11_UNORDERED_ACCESS_VIEW_DESC uavbuffer_desc;
ud.Format = DirectXGI_FORMAT_UNKNOWN;
ud.ViewDimension = D3D11_UAV_DIMENSION_BUFFER;
ud.Buffer.NumElements = vertexBufferSize;
hr = m_d3dDevice->CreateUnorderedAccessView(m_Buffer, &ud, &m_UAV);
Vertex buffer also bound for
unordered access.
Scattered writes!
![Page 83: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/83.jpg)
Performance gains
• For 90,000 links:
– Rendering and copy only 2.98 ms/frame
–バッチソルバ
(Batched solver) 3.84 ms/frame
– SIMDバッチソルバ
(SIMD batched solver) 3.22 ms/frame
– SIMDバッチソルバとGPU上でのコピー
(SIMD with GPU copy) 0.617 ms/frame
• 3.5x improvement in solver alone
Based on internal numbers
![Page 84: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/84.jpg)
Thanks
• Justin Hensley
• Holger Grün
• Nicholas Thibieroz
• Erwin Coumans
![Page 85: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/85.jpg)
References
• Yang J., Hensley J., Grün H., Thibieroz N.: Real-Time Concurrent Linked
List Construction on the GPU. In Rendering Techniques 2010: Eurographics
Symposium on Rendering (2010), vol. 29, Eurographics.
• Grün H., Thibieroz N.: OIT and Indirect Illumination using DirectX11 Linked
Lists. In Proceedings of Game Developers Conference 2010 (Mar. 2010).
http://developer.amd.com/gpu_assets/OIT%20and%20Indirect%20Illuminati
on%20using%20DirectX11%20Linked%20Lists_forweb.ppsx
• http://developer.amd.com/samples/demos/pages/ATIRadeonHD5800Series
RealTimeDemos.aspx
• http://bulletphysics.org
![Page 86: Advanced DirectX 11 technology: DirectCompute by Example• New API from Microsoft –Released alongside Windows® 7 –Runs on Windows Vista® as well • Supports downlevel hardware](https://reader034.fdocuments.us/reader034/viewer/2022050404/5f81f8f97f4c943b143d2424/html5/thumbnails/86.jpg)
Trademark Attribution
AMD, the AMD Arrow logo, ATI, the ATI logo, Radeon and combinations thereof are trademarks of Advanced Micro
Devices, Inc. in the United States and/or other jurisdictions. Microsoft, Windows, Windows Vista, Windows 7 and
DirectX are registered trademarks of Microsoft Corporation in the U.S. and/or other juristictions. Other names used in
this presentation are for identification purposes only and may be trademarks of their respective owners.
©2010 Advanced Micro Devices, Inc. All rights reserved.