DirectCompute: Capturing the Teraflop
-
Upload
molly-cervantes -
Category
Documents
-
view
103 -
download
5
description
Transcript of DirectCompute: Capturing the Teraflop
![Page 1: DirectCompute: Capturing the Teraflop](https://reader037.fdocuments.us/reader037/viewer/2022102617/568130ba550346895d96db97/html5/thumbnails/1.jpg)
DirectCompute:Capturing the Teraflop
Chas. BoydArchitectMicrosoft Corporation
PDC09-CL03
![Page 2: DirectCompute: Capturing the Teraflop](https://reader037.fdocuments.us/reader037/viewer/2022102617/568130ba550346895d96db97/html5/thumbnails/2.jpg)
Overview
> Describing the GPU as a CPU> Fundamental principles in familiar terms
> Problem Set Definition> In what cases will I get the Teraflop?
> How to DirectCompute> Step by Step
> Managing I/O> Most codes are I/O bound
![Page 3: DirectCompute: Capturing the Teraflop](https://reader037.fdocuments.us/reader037/viewer/2022102617/568130ba550346895d96db97/html5/thumbnails/3.jpg)
Current CPU
4 Cores4 float wide SIMD3GHz48-96GFlops2x HyperThreaded64kB $L1/core20GB/s to Memory$200200W
CPU 0 CPU 1
CPU 2 CPU 3
L2 Cache
![Page 4: DirectCompute: Capturing the Teraflop](https://reader037.fdocuments.us/reader037/viewer/2022102617/568130ba550346895d96db97/html5/thumbnails/4.jpg)
Current GPU
32 Cores32 Float wide1GHz1TeraFlop32x
“HyperThreaded”
64kB $L1/Core150GB/s to Mem$200, 200W
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
L2 Cache
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD SIMD
SIMD
SIMD
SIMD
SIMD
SIMD SIMD
![Page 5: DirectCompute: Capturing the Teraflop](https://reader037.fdocuments.us/reader037/viewer/2022102617/568130ba550346895d96db97/html5/thumbnails/5.jpg)
Comparison: Current Processors
CPU 0 CPU 1
CPU 2 CPU 3
L2 Cache
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
L2 Cache
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
CPU GPU
SIMD
SIMD
SIMD
SIMD
SIMD SIMD
SIMD
SIMD
SIMD
SIMD
SIMD SIMD
![Page 6: DirectCompute: Capturing the Teraflop](https://reader037.fdocuments.us/reader037/viewer/2022102617/568130ba550346895d96db97/html5/thumbnails/6.jpg)
CPU vs GPU
CPU> Low latency memory> Random accesses> 20GB/s bandwidth> 0.1TFlop compute> 1GFlops/watt
> Well known programming model
GPU> High bandwidth memory> Sequential accesses> 100GB/s bandwidth> 1TFlop compute> 10 Gflops/watt
> Niche programming model
![Page 7: DirectCompute: Capturing the Teraflop](https://reader037.fdocuments.us/reader037/viewer/2022102617/568130ba550346895d96db97/html5/thumbnails/7.jpg)
An Asymmetric Multi- Processor System
7
CPU50GFlops
GPU1TFlop
CPU RAM4-6 GB
GPU RAM1 GB
10GB/s 100GB/s
1GB/s
![Page 8: DirectCompute: Capturing the Teraflop](https://reader037.fdocuments.us/reader037/viewer/2022102617/568130ba550346895d96db97/html5/thumbnails/8.jpg)
GPUs are Data-Parallel Processors
> GPU has 1000s of simultaneous ALUs> Need 100s of 1000s of threads to hit
peak> Only data elements come in such
numbers
![Page 9: DirectCompute: Capturing the Teraflop](https://reader037.fdocuments.us/reader037/viewer/2022102617/568130ba550346895d96db97/html5/thumbnails/9.jpg)
GPUs Need Data-Parallel Algorithms> Image processing
> Reduction, Histogram, FFT, Summed Area Table
> Video processing> transcode, effects, analysis
> Audio> Linear Algebra> Simulation/Modeling:
> Technical, Finance, Academic> Some Databases
![Page 10: DirectCompute: Capturing the Teraflop](https://reader037.fdocuments.us/reader037/viewer/2022102617/568130ba550346895d96db97/html5/thumbnails/10.jpg)
Video Stabilization
video
![Page 11: DirectCompute: Capturing the Teraflop](https://reader037.fdocuments.us/reader037/viewer/2022102617/568130ba550346895d96db97/html5/thumbnails/11.jpg)
Applications <> Algorithms
> Most important algorithms have known data-parallel versions
> Algorithm was replaced with data-parallel version:> Sorting: Quicksort was swapped to
Bitonic
![Page 12: DirectCompute: Capturing the Teraflop](https://reader037.fdocuments.us/reader037/viewer/2022102617/568130ba550346895d96db97/html5/thumbnails/12.jpg)
N-Body Galaxy Simulation
DirectComputeAMD HD 5870DirectX11
demo
![Page 13: DirectCompute: Capturing the Teraflop](https://reader037.fdocuments.us/reader037/viewer/2022102617/568130ba550346895d96db97/html5/thumbnails/13.jpg)
The Teraflop Today
N-Body Demo App:
AMD Phenom II X4 940 3GHz + Radeon HD 5850CPU 13.7GFlops Multicore SSE, not cache-
awareGPU 537GFlops DirectCompute
Intel Xeon E5410 2.33GHz + Radeon HD 5870CPU 25.5GFlops Multicore SSE, not cache-
aware GPU 722GFlops DirectCompute
![Page 14: DirectCompute: Capturing the Teraflop](https://reader037.fdocuments.us/reader037/viewer/2022102617/568130ba550346895d96db97/html5/thumbnails/14.jpg)
After
![Page 15: DirectCompute: Capturing the Teraflop](https://reader037.fdocuments.us/reader037/viewer/2022102617/568130ba550346895d96db97/html5/thumbnails/15.jpg)
Microsoft FFT Performance
GFlops
Log2( size)
![Page 16: DirectCompute: Capturing the Teraflop](https://reader037.fdocuments.us/reader037/viewer/2022102617/568130ba550346895d96db97/html5/thumbnails/16.jpg)
Component Relationships
Accelerator, Brook+, Rapidmind, CtMKL, ACML, cuFFT, D3DX, etc.
Media playback or processing, media UI, recognition, etc. Technical
DirectCompute, CUDA, CAL, OpenCL, LRB Native, etc.
CPU, GPU, LarrabeenVidia, Intel, AMD, S3, etc.
Applications
Processors
Compute Languages
Domain Libraries
Domain Languag
es
![Page 17: DirectCompute: Capturing the Teraflop](https://reader037.fdocuments.us/reader037/viewer/2022102617/568130ba550346895d96db97/html5/thumbnails/17.jpg)
DirectCompute Adds Client Scenarios> Support for multiple vendors
> All DirectX11 chips will support DirectCompute
> Some DirectX10 chips already support it> Tight integration with rendering
> Client scenarios involve interactive playback
> Support media data-types> Hardware format conversion for pixel
formats
> Server scenarios still supported
![Page 18: DirectCompute: Capturing the Teraflop](https://reader037.fdocuments.us/reader037/viewer/2022102617/568130ba550346895d96db97/html5/thumbnails/18.jpg)
Code Walkthrough
![Page 19: DirectCompute: Capturing the Teraflop](https://reader037.fdocuments.us/reader037/viewer/2022102617/568130ba550346895d96db97/html5/thumbnails/19.jpg)
DirectCompute Usage
> Initialize DirectCompute> Create some GPU code in .hlsl> Compile it using DirectX compiler> Load the code onto the GPU> Set up a GPU buffer for input data
> And set up a view into it for access> Make that data view current> Execute the code on the GPU> Copy the data back to CPU memory
![Page 20: DirectCompute: Capturing the Teraflop](https://reader037.fdocuments.us/reader037/viewer/2022102617/568130ba550346895d96db97/html5/thumbnails/20.jpg)
Initialize DirectCompute
hr = D3D11CreateDevice( NULL, // default gfx adapter D3D_DRIVER_TYPE_HARDWARE, // use hw NULL, // not sw rasterizer uCreationFlags, // Debug, Threaded, etc. NULL, // feature levels 0, // size of above D3D11_SDK_VERSION, // SDK version ppDeviceOut, // D3D Device &FeatureLevelOut, // of actual device ppContextOut ); // subunit of device);
![Page 21: DirectCompute: Capturing the Teraflop](https://reader037.fdocuments.us/reader037/viewer/2022102617/568130ba550346895d96db97/html5/thumbnails/21.jpg)
Example HLSL code
#define BLOCK_SIZE 256StructuredBuffer gBuf1;StructuredBuffer gBuf2;RWStructuredBuffer gBufOut;
[numthreads(BLOCK_SIZE,1,1)]void VectorAdd( uint3 id: SV_DispatchThreadID ){
gBufOut[id] = gBuf1[id] + gBuf2[id];}
![Page 22: DirectCompute: Capturing the Teraflop](https://reader037.fdocuments.us/reader037/viewer/2022102617/568130ba550346895d96db97/html5/thumbnails/22.jpg)
The HLSL Language> HLSL is the most widely used
language for Data Parallel Programming
> Syntax is similar to ‘C/C++’> Preprocessor defines (#define, #ifdef, etc)> Basic types (float, int, uint, bool, etc)> Operators, variables, functions
> Has some important differences> No pointers > Built-in variables & types (float4, matrix, etc)> Intrinsic functions (mul, normalize, etc)
![Page 23: DirectCompute: Capturing the Teraflop](https://reader037.fdocuments.us/reader037/viewer/2022102617/568130ba550346895d96db97/html5/thumbnails/23.jpg)
Compile the HLSL code
hr = D3DX11CompileFromFile( “myCode.hlsl”, // path to .hlsl file NULL, NULL, “VectorAdd”, // entry point pProfile, NULL, // Flags NULL, NULL, &pBlob, // compiled shader &pErrorBlob, // error log NULL );
![Page 24: DirectCompute: Capturing the Teraflop](https://reader037.fdocuments.us/reader037/viewer/2022102617/568130ba550346895d96db97/html5/thumbnails/24.jpg)
Compilation Steps
> Compiler (fxc or library) generates target-specific instructions (IL) from shader
> Different instruction sets for different generations of hardware
> Shader IL is highly optimized
HLSL Code
FXC or D3D
Compiler API
Intermediate
Language
IHV Driver
Hardware Native Code
![Page 25: DirectCompute: Capturing the Teraflop](https://reader037.fdocuments.us/reader037/viewer/2022102617/568130ba550346895d96db97/html5/thumbnails/25.jpg)
Complete Compilation and Send to GPU
pD3D->CreateComputeShader(pBlob->GetBufferPointer(),pBlob->GetBufferSize(),NULL,&pMyShader ); // hw fmt
pD3D->CSSetShader(pMyShader, NULL, 0 );
![Page 26: DirectCompute: Capturing the Teraflop](https://reader037.fdocuments.us/reader037/viewer/2022102617/568130ba550346895d96db97/html5/thumbnails/26.jpg)
Setup Buffer Resource for Input Data
D3D11_BUFFER_DESC descBuf;ZeroMemory( &descBuf, sizeof(descBuf) );desc.BindFlags = D3D11_BIND_UNORDERED_ACCESS;desc.StructureByteStride = uElementSize;desc.ByteWidth = uElementSize * uCount;desc.MiscFlags =
D3D11_RESOURCE_MISC_BUFFER_STRUCTURED;
pD3D->CreateBuffer( &desc, pInput, ppBuffer );
![Page 27: DirectCompute: Capturing the Teraflop](https://reader037.fdocuments.us/reader037/viewer/2022102617/568130ba550346895d96db97/html5/thumbnails/27.jpg)
Resources
> Resource Objects are used to store data> Resource Views are interfaces to the
Resource
Resource Object
My Data Buffer
Sampler Resource
View
Unordered Access View
Compute Shader
![Page 28: DirectCompute: Capturing the Teraflop](https://reader037.fdocuments.us/reader037/viewer/2022102617/568130ba550346895d96db97/html5/thumbnails/28.jpg)
DirectX Resources
> Data Objects in memory
> Enable out-of-bounds memory checking> Improves security, reliability of shipped
code> Returns 0 on reads> Writes are No-Ops
> Facilitates interop with Direct3D for display
![Page 29: DirectCompute: Capturing the Teraflop](https://reader037.fdocuments.us/reader037/viewer/2022102617/568130ba550346895d96db97/html5/thumbnails/29.jpg)
DirectX Resource Types
> Buffer> Defines an arbitrary data struct for the
records in this buffer object> Includes, structured, raw, streaming buffers
> Texture*> Storage for data that will be used in pixel
tasks> Includes 1-D, 2-D, 3-D, Cubes and arrays
thereof
![Page 30: DirectCompute: Capturing the Teraflop](https://reader037.fdocuments.us/reader037/viewer/2022102617/568130ba550346895d96db97/html5/thumbnails/30.jpg)
Buffer Resource Types
> Structured> Defines a record size with a fixed size.> Pixel data format is not specified, so
automatic type/format conversion not provided
> Unstructured> Can provide type/format conversion
> Both types support non-order-preserving> For use with Append()/Consume() I/O
![Page 31: DirectCompute: Capturing the Teraflop](https://reader037.fdocuments.us/reader037/viewer/2022102617/568130ba550346895d96db97/html5/thumbnails/31.jpg)
Image/Media Resource Types
> Texture1D, 2D, 3D, Cube, Array> A 2-D array of Pixels in specified format
> R8G8B8A8, R32_UINT, R16G16_UINT
![Page 32: DirectCompute: Capturing the Teraflop](https://reader037.fdocuments.us/reader037/viewer/2022102617/568130ba550346895d96db97/html5/thumbnails/32.jpg)
Setup a View into the Buffer
D3D11_UNORDERED_ACCESS_VIEW_DESC desc;ZeroMemory( &desc, sizeof(desc) );desc.ViewDimension = D3D11_UAV_DIMENSION_BUFFER;desc.Buffer.FirstElement = 0;desc.Format = DXGI_FORMAT_UNKNOWN;desc.Buffer.NumElements = uCount;
pD3D->CreateUnorderedAccessView(pBuffer, // Buffer view is into&desc, // above data&pMyUAV ); // result
![Page 33: DirectCompute: Capturing the Teraflop](https://reader037.fdocuments.us/reader037/viewer/2022102617/568130ba550346895d96db97/html5/thumbnails/33.jpg)
Resource Views
> Resource Views define the access mechanism for data stored in Resources (buffers)
> Support cool features like:> Hardware accelerated format conversion> Hardware accelerated linear
filtering/sampling> Can create multiple views onto one
resource> Enable data polymorphism while
providing info to implementation for optimal layout
![Page 34: DirectCompute: Capturing the Teraflop](https://reader037.fdocuments.us/reader037/viewer/2022102617/568130ba550346895d96db97/html5/thumbnails/34.jpg)
Unordered Access View (UAV)
> Enables two alternative usage patterns:
> Unordered/random/scattered I/O to the buffer it is created into
> Indexed operations for I/O> myBuffer[index] = x;> For Texture2D Resource, index is uint2
> Or Non-Order-Preserving I/O> Using Append()/Consume() intrinsics
![Page 35: DirectCompute: Capturing the Teraflop](https://reader037.fdocuments.us/reader037/viewer/2022102617/568130ba550346895d96db97/html5/thumbnails/35.jpg)
Non-Order Preserving I/O
> For fastest performance when ordering of records need not be preserved
> Or when nr of writes is unknownAppend( ResourceVar, val);
> Corresponding read operation provided for completenessConsume( ResourceVar, val);
> Requires buffer to have flag enabling this
![Page 36: DirectCompute: Capturing the Teraflop](https://reader037.fdocuments.us/reader037/viewer/2022102617/568130ba550346895d96db97/html5/thumbnails/36.jpg)
Shader Resource View (SRV)
> Enables hardware accelerated filtered sampling of the buffer
> This hardware is a significant fraction of chip area
> Excellent for pixel data (images/video)> A single pixel format defined per View> Read-Only operation
> Same resource cannot be bound to shader as SRV and as another view type at the same time
> Can also load w/o filtering
![Page 37: DirectCompute: Capturing the Teraflop](https://reader037.fdocuments.us/reader037/viewer/2022102617/568130ba550346895d96db97/html5/thumbnails/37.jpg)
Implementation Secrets
> Resources correspond to ranges of memory
> Views correspond to hardware logic units that perform data transformation on I/O
![Page 38: DirectCompute: Capturing the Teraflop](https://reader037.fdocuments.us/reader037/viewer/2022102617/568130ba550346895d96db97/html5/thumbnails/38.jpg)
Graphics vs Compute I/O
ALUs
Shader Execution
Output Mergers
Gamma correction,
Pixel format conversion, Framebuffer
prefetch
Texture Samplers
Pixel format conversion,
Bi-linear filtering, Gamma
correction
GPU Memory
250 c
locks
~50 c
locks
![Page 39: DirectCompute: Capturing the Teraflop](https://reader037.fdocuments.us/reader037/viewer/2022102617/568130ba550346895d96db97/html5/thumbnails/39.jpg)
Bind the Data, Launch the Work
pD3D->CSSetUnorderedAccessViews(0,1,&pMyUAV,NULL );
pD3D->Dispatch( GrpsX, GrpsY, GrpsZ );
![Page 40: DirectCompute: Capturing the Teraflop](https://reader037.fdocuments.us/reader037/viewer/2022102617/568130ba550346895d96db97/html5/thumbnails/40.jpg)
Thread Groups
> Not all threads in the call can/should share registers with each other
> Compute threads are structured into subsets or groups of threads
> Thread indices are available to the code:> SV_DispatchThreadID index of thread in
call> SV_GroupThreadID index of thread in group> SV_GroupID index of group in call
![Page 41: DirectCompute: Capturing the Teraflop](https://reader037.fdocuments.us/reader037/viewer/2022102617/568130ba550346895d96db97/html5/thumbnails/41.jpg)
Thread Groups
01 11 21
100000 01 02 03
10 11 12 13
20 21 22 23
30 31 32 33
00 01 02 03
10 11 12 13
20 21 22 23
30 31 32 33
2000 01 02 03
10 11 12 13
20 21 22 23
30 31 32 33
00 01 02 03
10 11 12 13
20 21 22 23
30 31 32 33
00 01 02 03
10 11 12 13
20 21 22 23
30 31 32 33
00 01 02 03
10 11 12 13
20 21 22 23
30 31 32 33
pDev11->Dispatch(3, 2, 1);
[numthreads(4, 4, 1)]
void MyCS(…)
![Page 42: DirectCompute: Capturing the Teraflop](https://reader037.fdocuments.us/reader037/viewer/2022102617/568130ba550346895d96db97/html5/thumbnails/42.jpg)
Set up Buffer for Transfer to CPU
D3D11_BUFFER_DESC desc;ZeroMemory( &desc, sizeof(desc) );desc.CPUAccessFlags =
D3D11_CPU_ACCESS_READ;desc.Usage = D3D11_USAGE_STAGING;desc.BindFlags = 0;desc.MiscFlags = 0;pD3D->CreateBuffer(
&desc, NULL, &StagingBuf );
![Page 43: DirectCompute: Capturing the Teraflop](https://reader037.fdocuments.us/reader037/viewer/2022102617/568130ba550346895d96db97/html5/thumbnails/43.jpg)
Transfer Results to CPU
pD3D->CopyResource( debugbuf, pBuffer );
![Page 44: DirectCompute: Capturing the Teraflop](https://reader037.fdocuments.us/reader037/viewer/2022102617/568130ba550346895d96db97/html5/thumbnails/44.jpg)
Temporary Registersaka General Purpose Registers
> Used for fast local variable storage> Built as a block in each SIMD core
> 16k 32-bit registers per core> Registers available per thread depends
on number of threads in the group (group size)> E.g. 16k registers/1024 threads in group
means each thread gets 16 DWORDs> Exceeding this limit has perf impacts:
> Registers may be spilled to memory, or> Threads on core may be cut back (less
‘HyperThreads’)
![Page 45: DirectCompute: Capturing the Teraflop](https://reader037.fdocuments.us/reader037/viewer/2022102617/568130ba550346895d96db97/html5/thumbnails/45.jpg)
Groupshared Memory
> New register type variable storage class> groupshared float sfFoo;
> A whole group of threads can access the same memory> Enables uses like user-controlled cache
> Max 32kB can be shared in DirectX11> 8k floats or 2k float4s> Vs 64kB of temporary registers
> 16k floats or 4k float4s
> Using fewer is usually faster
![Page 46: DirectCompute: Capturing the Teraflop](https://reader037.fdocuments.us/reader037/viewer/2022102617/568130ba550346895d96db97/html5/thumbnails/46.jpg)
Barrier Intrinsics
GroupMemoryBarrierDeviceMemoryBarrierAllMemoryBarrier> All I/O ops at the specified scope (group, device, or
both) before this point must complete before any other I/O ops
GroupMemoryBarrierWithGroupSync DeviceMemoryBarrierWithGroupSyncAllMemoryBarrierWithGroupSync> All I/O ops at the specified scope (group, device, or
both) before this point must complete before any other I/O ops
> AND all the specified threads must reach this point before any can continue
![Page 47: DirectCompute: Capturing the Teraflop](https://reader037.fdocuments.us/reader037/viewer/2022102617/568130ba550346895d96db97/html5/thumbnails/47.jpg)
Barrier ExampleShader(){
groupshared GS[GROUPSIZE];…compute the indices…
GS[sid] = myBuffer[Tid]; // Load my data elementGroupMemoryBarrierWithGroupSync();
// process the data in groupshared memory……GroupMemoryBarrierWithGroupSync();
outBuffer[Tid] = GS[sid]; // write my data element
}
![Page 48: DirectCompute: Capturing the Teraflop](https://reader037.fdocuments.us/reader037/viewer/2022102617/568130ba550346895d96db97/html5/thumbnails/48.jpg)
Implementation Secrets
> Thread Group corresponds to a SIMD core > 1 of 16-32 on the die
> Groupshared memory corresponds to a partition of that core’s L1 cache
> GroupMemoryBarrier() corresponds to a flush of that core’s I/O
![Page 49: DirectCompute: Capturing the Teraflop](https://reader037.fdocuments.us/reader037/viewer/2022102617/568130ba550346895d96db97/html5/thumbnails/49.jpg)
Data Parallel I/O
> I/O with 1600 active threads is not trivial
> Reads are broadcast, so should be fast, but:
> Writes by many threads to one destination can result in serialization
> Less Obvious:> Even writing to a sequential location
results in serialization on access to the address counter
> This is why DirectCompute provides a rich set of I/O operations and intrinsics
![Page 50: DirectCompute: Capturing the Teraflop](https://reader037.fdocuments.us/reader037/viewer/2022102617/568130ba550346895d96db97/html5/thumbnails/50.jpg)
Hardware Support
> DirectX11 Compute Shader runs on most current DirectX10 and 10.1 (4.x) parts> Explicit thread Dispatch()> Random-access I/O via resource variables> Private Write/Shared Read on groupshared data
> New DirectX11-class (5.x) hardware adds> Arbitrary accesses to groupshared data> Atomic intrinsic operators> Hardware format conversion on i/o> More streaming i/o methods
![Page 51: DirectCompute: Capturing the Teraflop](https://reader037.fdocuments.us/reader037/viewer/2022102617/568130ba550346895d96db97/html5/thumbnails/51.jpg)
Compute Shader 4.0 vs. 5.0Feature CS 4.x CS 5.0
Supported devices DirectX10, DirectX11, Ref DirectX11, Ref
Supported OSs Windows7, Vista, S2008 Windows7, Vista, S2008
Max number of threads/group
768 1024
Restrictions on Zn Zn = 1 1<= Zn <= 64
# 32-bit registers* 4k 8k
Shared register access Private Write / Shared Read
Full Indexed
Atomic operations Not supported Supported
Max number of bound UAVs
1 8
Double Precision No Optional
DispatchIndirect( ) No Supported
![Page 52: DirectCompute: Capturing the Teraflop](https://reader037.fdocuments.us/reader037/viewer/2022102617/568130ba550346895d96db97/html5/thumbnails/52.jpg)
OS Support
> DirectCompute ships in DirectX11> DirectX11 is integrated into Windows7 and
Server 2008R2
> Also available on Windows Vista SP2 and Windows Server 2008 via Platform Update > http://support.microsoft.com/kb/971644> Supports all new hardware features
> Developer SDK installs on either OS> http://msdn.microsoft.com/directx
![Page 53: DirectCompute: Capturing the Teraflop](https://reader037.fdocuments.us/reader037/viewer/2022102617/568130ba550346895d96db97/html5/thumbnails/53.jpg)
Call to Action
> Install the DirectX11 SDK> Try out the DirectCompute samples> Look for parts of your code that are
data parallel> Swap in GPU code using
DirectCompute> Experience Teraflop computing today
![Page 54: DirectCompute: Capturing the Teraflop](https://reader037.fdocuments.us/reader037/viewer/2022102617/568130ba550346895d96db97/html5/thumbnails/54.jpg)
YOUR FEEDBACK IS IMPORTANT TO US!
Please fill out session evaluation
forms online atMicrosoftPDC.com
![Page 55: DirectCompute: Capturing the Teraflop](https://reader037.fdocuments.us/reader037/viewer/2022102617/568130ba550346895d96db97/html5/thumbnails/55.jpg)
Learn More On Channel 9
> Expand your PDC experience through Channel 9
> Explore videos, hands-on labs, sample code and demos through the new Channel 9 training courses
channel9.msdn.com/learnBuilt by Developers for Developers….
![Page 56: DirectCompute: Capturing the Teraflop](https://reader037.fdocuments.us/reader037/viewer/2022102617/568130ba550346895d96db97/html5/thumbnails/56.jpg)
© 2009 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
![Page 57: DirectCompute: Capturing the Teraflop](https://reader037.fdocuments.us/reader037/viewer/2022102617/568130ba550346895d96db97/html5/thumbnails/57.jpg)