Максим Гольдин
description
Transcript of Максим Гольдин
![Page 1: Максим Гольдин](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816213550346895dd23f2e/html5/thumbnails/1.jpg)
![Page 2: Максим Гольдин](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816213550346895dd23f2e/html5/thumbnails/2.jpg)
DEV301
Введение в платформу гетерогенных вычислений C++ AMP и инструменты работыс GPU в Visual Studio 11 Максим ГольдинSenior DeveloperMicrosoft Corporation
![Page 3: Максим Гольдин](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816213550346895dd23f2e/html5/thumbnails/3.jpg)
Agenda
ContextCodeIDESummary
![Page 4: Максим Гольдин](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816213550346895dd23f2e/html5/thumbnails/4.jpg)
demo
N-Body Simulation
![Page 5: Максим Гольдин](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816213550346895dd23f2e/html5/thumbnails/5.jpg)
![Page 6: Максим Гольдин](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816213550346895dd23f2e/html5/thumbnails/6.jpg)
The Power of Heterogeneous Computing
146X
Interactive visualization of
volumetric white matter connectivity
36X
Ionic placement for molecular dynamics simulation on GPU
19X
Transcoding HD video stream to
H.264
17X
Simulation in Matlab using .mex file CUDA function
100X
Astrophysics N-body simulation
149X
Financial simulation of LIBOR model with swaptions
47X
GLAME@lab: An M-script API for linear Algebra operations
on GPU
20X
Ultrasound medical imaging for cancer
diagnostics
24X
Highly optimized object oriented
molecular dynamics
30X
Cmatch exact string matching to find
similar proteins and gene sequences
source
![Page 7: Максим Гольдин](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816213550346895dd23f2e/html5/thumbnails/7.jpg)
CPUs vs GPUs today
CPU
Low memory bandwidthHigher power consumptionMedium level of parallelismDeep execution pipelinesRandom accessesSupports general codeMainstream programming
GPU
High memory bandwidthLower power consumptionHigh level of parallelismShallow execution pipelinesSequential accessesSupports data-parallel codeNiche programming
images source: AMD
![Page 8: Максим Гольдин](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816213550346895dd23f2e/html5/thumbnails/8.jpg)
Tomorrow…
CPUs and GPUs coming closer together……nothing settled in this space, things still in motion…
C++ Accelerated Massive Parallelism is designed as a mainstream solution not only for today, but also for tomorrow
image source: AMD
![Page 9: Максим Гольдин](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816213550346895dd23f2e/html5/thumbnails/9.jpg)
C++ AMP
Part of Visual C++ Visual Studio integrationSTL-like library for multidimensional data Builds on Direct3D
performance
portabilityproductivity
![Page 10: Максим Гольдин](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816213550346895dd23f2e/html5/thumbnails/10.jpg)
Agenda checkpoint
ContextCodeIDESummary
![Page 11: Максим Гольдин](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816213550346895dd23f2e/html5/thumbnails/11.jpg)
Hello World: Array Addition
void AddArrays(int n, int * pA, int * pB, int * pC){ for (int i=0; i<n; i++) { pC[i] = pA[i] + pB[i]; }}
How do we take the serial code on the left that runs on the CPU and convert it to run on an accelerator like the GPU?
![Page 12: Максим Гольдин](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816213550346895dd23f2e/html5/thumbnails/12.jpg)
Hello World: Array Addition
void AddArrays(int n, int * pA, int * pB, int * pC){
for (int i=0; i<n; i++)
{ pC[i] = pA[i] + pB[i]; }
}
#include <amp.h>using namespace concurrency;
void AddArrays(int n, int * pA, int * pB, int * pC){ array_view<int,1> a(n, pA); array_view<int,1> b(n, pB); array_view<int,1> sum(n, pC); parallel_for_each( sum.grid, [=](index<1> i) restrict(direct3d) { sum[i] = a[i] + b[i]; } );}
void AddArrays(int n, int * pA, int * pB, int * pC){
for (int i=0; i<n; i++)
{ pC[i] = pA[i] + pB[i]; }
}
![Page 13: Максим Гольдин](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816213550346895dd23f2e/html5/thumbnails/13.jpg)
Basic Elements of C++ AMP codingvoid AddArrays(int n, int * pA, int * pB, int * pC){ array_view<int,1> a(n, pA); array_view<int,1> b(n, pB); array_view<int,1> sum(n, pC); parallel_for_each(
sum.grid, [=](index<1> idx) restrict(direct3d) { sum[idx] = a[idx] + b[idx];
} );}
array_view variables captured and associated data copied to accelerator (on demand)
parallel_for_each: execute the lambda on the accelerator once per thread
grid: the number and shape of threads to execute the lambda
index: the thread ID that is running the lambda, used to index into data
array_view: wraps the data to operate on the accelerator
restrict(direct3d): tells the compiler to check that this code can execute on Direct3D hardware (aka accelerator)
![Page 14: Максим Гольдин](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816213550346895dd23f2e/html5/thumbnails/14.jpg)
grid<N>, extent<N>, and index<N>
index<N> represents an N-dimensional point
extent<N>number of units in each dimension of an N-dimensional space
grid<N>origin (index<N>) plus extent<N>
N can be any number
![Page 15: Максим Гольдин](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816213550346895dd23f2e/html5/thumbnails/15.jpg)
Examples: grid, extent, and index
index<1> i(2); index<2> i(0,2); index<3> i(2,0,1);
extent<3> e(3,2,2);extent<2> e(3,4);extent<1> e(6);grid<3> g(e);grid<2> g(e);grid<1> g(e);
![Page 16: Максим Гольдин](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816213550346895dd23f2e/html5/thumbnails/16.jpg)
vector<int> v(96);extent<2> e(8,12); // e[0] == 8; e[1] == 12;array<int,2> a(e, v.begin(), v.end());
// in the body of my lambdaindex<2> i(3,9); // i[0] == 3; i[1] == 9;int o = a[i]; //or a[i] = 16;//int o = a(i[0], i[1]);
array<T,N>
Multi-dimensional array of rank N with element TStorage lives on accelerator
0 1 2 3 4 5 6 7 8 9 10 11
0
1
2
3
4
5
6
7
![Page 17: Максим Гольдин](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816213550346895dd23f2e/html5/thumbnails/17.jpg)
array_view<T,N>
View on existing data on the CPU or GPUarray_view<T,N> array_view<const T,N>
vector<int> v(10);
extent<2> e(2,5); array_view<int,2> a(e, v);
//above two lines can also be written//array_view<int,2> a(2,5,v);
![Page 18: Максим Гольдин](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816213550346895dd23f2e/html5/thumbnails/18.jpg)
Data Classes Comparison
array<T,N>
Rank at compile time Extent at runtimeRectangular
DenseContainer for dataExplicit copyCapture by reference [&]
array_view<T,N>
Rank at compile timeExtent at runtimeRectangular
Dense in one dimensionWrapper for dataImplicit copyCapture by value [=]
![Page 19: Максим Гольдин](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816213550346895dd23f2e/html5/thumbnails/19.jpg)
1. parallel_for_each( 2. g, //g is of type grid<N>3. [ ](index<N> idx)
restrict(direct3d) { // kernel code}
4. );
parallel_for_each
Executes the lambda for each point in the extentAs-if synchronous in terms of visible side-effects
![Page 20: Максим Гольдин](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816213550346895dd23f2e/html5/thumbnails/20.jpg)
Example: Matrix Multiplicationvoid MatrixMultiplySerial( vector<float>& vC, const vector<float>& vA, const vector<float>& vB, int M, int N, int W ){
for (int row = 0; row < M; row++) { for (int col = 0; col < N; col++){ float sum = 0.0f; for(int i = 0; i < W; i++) sum += vA[row * W + i] * vB[i * N + col]; vC[row * N + col] = sum; } }}
void MatrixMultiplyAMP( vector<float>& vC, const vector<float>& vA, const vector<float>& vB, int M, int N, int W ){ array_view<const float,2> a(M,W,vA),b(W,N,vB); array_view<writeonly<float>,2> c(M,N,vC); parallel_for_each(c.grid, [=](index<2> idx) restrict(direct3d) { int row = idx[0]; int col = idx[1];
float sum = 0.0f; for(int i = 0; i < W; i++) sum += a(row, i) * b(i, col); c[idx] = sum;
} );}
![Page 21: Максим Гольдин](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816213550346895dd23f2e/html5/thumbnails/21.jpg)
accelerator, accelerator_view
accelerator e.g. DX11 GPU, REFe.g. CPU
accelerator_viewa context for scheduling and memory management
CPUs
System memory
GPUPCIeGPU
GPU
GPU
Host Accelerator (GPU example)
• Data transfers • between accelerator and host
• could be optimized away for integrated memory architecture
![Page 22: Максим Гольдин](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816213550346895dd23f2e/html5/thumbnails/22.jpg)
Example: accelerator
// Identify an accelerator based on Windows device IDaccelerator myAcc(“PCI\\VEN_1002&DEV_9591&CC_0300”);
// …or enumerate all accelerators (not shown)
// Allocate an array on my acceleratorarray<int> myArray(10, myAcc.default_view);
// …or launch a kernel on my acceleratorparallel_for_each(myAcc.default_view, myArrayView.grid, ...);
![Page 23: Максим Гольдин](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816213550346895dd23f2e/html5/thumbnails/23.jpg)
C++ AMP at a Glance (so far)
restrict(direct3d, cpu)parallel_for_eachclass array<T,N>class array_view<T,N>class index<N>class extent<N>, grid<N>class acceleratorclass accelerator_view
![Page 24: Максим Гольдин](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816213550346895dd23f2e/html5/thumbnails/24.jpg)
Achieving maximum performance gains
Schedule threads in tilesAvoid thread index remappingGain ability to use tile static memory
parallel_for_each overload for tiles acceptstiled_grid<D0> or tiled_grid<D0, D1> or tiled_grid<D0, D1, D2>a lambda which accepts
tiled_index<D0> or tiled_index<D0, D1> or tiled_index<D0, D1, D2>
0 1 2 3 4 5
0
1
2
3
4
5
6
7
0 1 2 3 4 5
0
1
2
3
4
5
6
7
0 1 2 3 4 5
0
1
2
3
4
5
6
7
g.tile<2,2>()g.tile<4,3>()extent<2> e(8,6);grid<2> g(e);
![Page 25: Максим Гольдин](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816213550346895dd23f2e/html5/thumbnails/25.jpg)
tiled_grid, tiled_index
Given
When the lambda is executed byt_idx.global = index<2> (6,3)t_idx.local = index<2> (0,1)t_idx.tile = index<2> (3,1)t_idx.tile_origin = index<2> (6,2)
T
array_view<int,2> data(8, 6, p_my_data);parallel_for_each( data.grid.tile<2,2>(), [=] (tiled_index<2,2> t_idx)… { … });
0 1 2 3 4 50123456 T7
![Page 26: Максим Гольдин](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816213550346895dd23f2e/html5/thumbnails/26.jpg)
tile_static, tile_barrier
Within the tiled parallel_for_each lambda we can usetile_static storage class for local variables
indicates that the variable is allocated in fast cache memoryi.e. shared by each thread in a tile of threads
only applicable in restrict(direct3d) functions
class tile_barriersynchronize all threads within a tilee.g. t_idx.barrier.wait();
![Page 27: Максим Гольдин](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816213550346895dd23f2e/html5/thumbnails/27.jpg)
Example: Matrix Multiplication (tiled)void MatrixMultSimple(vector<float>& vC, const vector<float>& vA, const vector<float>& vB, int M, int N, int W ){
array_view<const float,2> a(M, W, vA), b(W, N, vB); array_view<writeonly<float>,2> c(M,N,vC); parallel_for_each(c.grid, [=] (index<2> idx) restrict(direct3d) { int row = idx[0]; int col = idx[1]; float sum = 0.0f;
for(int k = 0; k < W; k++) sum += a(row, k) * b(k, col);
c[idx] = sum; } );}
void MatrixMultTiled(vector<float>& vC, const vector<float>& vA, const vector<float>& vB, int M, int N, int W ){ static const int TS = 16; array_view<const float,2> a(M, W, vA), b(W, N, vB); array_view<writeonly<float>,2> c(M,N,vC); parallel_for_each(c.grid.tile< TS, TS >(), [=] (tiled_index< TS, TS> t_idx) restrict(direct3d) { int row = t_idx.local[0]; int col = t_idx.local[1]; float sum = 0.0f; for (int i = 0; i < W; i += TS) { tile_static float locA[TS][TS], locB[TS][TS]; locA[row][col] = a(t_idx.global[0], col + i); locB[row][col] = b(row + i, t_idx.global[1]); t_idx.barrier.wait(); for (int k = 0; k < TS; k++) sum += locA[row][k] * locB[k][col]; t_idx.barrier.wait(); } c[t_idx.global] = sum; } );}
![Page 28: Максим Гольдин](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816213550346895dd23f2e/html5/thumbnails/28.jpg)
C++ AMP at a Glance
restrict(direct3d, cpu)parallel_for_eachclass array<T,N>class array_view<T,N>class index<N>class extent<N>, grid<N>class acceleratorclass accelerator_view
tile_static storage classclass tiled_grid< , , >class tiled_index< , , >class tile_barrier
![Page 29: Максим Гольдин](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816213550346895dd23f2e/html5/thumbnails/29.jpg)
Agenda checkpoint
ContextCodeIDESummary
![Page 30: Максим Гольдин](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816213550346895dd23f2e/html5/thumbnails/30.jpg)
Visual Studio 11
OrganizeEditDesignBuildBrowseDebugProfile
![Page 31: Максим Гольдин](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816213550346895dd23f2e/html5/thumbnails/31.jpg)
demo
C++ AMP Parallel Debugger
![Page 32: Максим Гольдин](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816213550346895dd23f2e/html5/thumbnails/32.jpg)
Visual Studio 11
OrganizeEditDesignBuildBrowseDebugProfile
![Page 33: Максим Гольдин](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816213550346895dd23f2e/html5/thumbnails/33.jpg)
C++ AMP Parallel Debugger
Well known Visual Studio debugging features Launch, Attach, Break, Stepping, Breakpoints, DataTips Toolwindows
Processes, Debug Output, Modules, Disassembly, Call Stack, Memory, Registers, Locals, Watch, Quick Watch
New features (for both CPU and GPU)Parallel Stacks window, Parallel Watch window, Barrier
New GPU-specificEmulator, GPU Threads window, race detection
![Page 34: Максим Гольдин](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816213550346895dd23f2e/html5/thumbnails/34.jpg)
demo
Concurrency Visualizerfor GPU
![Page 35: Максим Гольдин](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816213550346895dd23f2e/html5/thumbnails/35.jpg)
Visual Studio 11
OrganizeEditDesignBuildBrowseDebugProfile
![Page 36: Максим Гольдин](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816213550346895dd23f2e/html5/thumbnails/36.jpg)
Concurrency Visualizer for GPU
Direct3D-centricSupports any library/programming model built on it
Integrated GPU and CPU viewGoal is to analyze high-level performance metrics
Memory copy overheadsSynchronization overheads across CPU/GPUGPU activity and contention with other processes
![Page 37: Максим Гольдин](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816213550346895dd23f2e/html5/thumbnails/37.jpg)
Concurrency Visualizer for GPU
Team is exploring ways to provide data on:
GPU Memory Utilization
GPU HW Counters
![Page 38: Максим Гольдин](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816213550346895dd23f2e/html5/thumbnails/38.jpg)
Agenda checkpoint
ContextCodeIDESummary
![Page 39: Максим Гольдин](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816213550346895dd23f2e/html5/thumbnails/39.jpg)
Summary
Democratization of parallel hardware programmabilityPerformance for the mainstreamHigh-level abstractions in C++ (not C)State-of-the-art Visual Studio IDEHardware abstraction platform
Intent is to make C++ AMP an open specification
![Page 40: Максим Гольдин](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816213550346895dd23f2e/html5/thumbnails/40.jpg)
Resources
Daniel Moth's blog (PM of C++ AMP)http://www.danielmoth.com/Blog/
MSDN Native parallelism blog (team blog)http://blogs.msdn.com/b/nativeconcurrency/
MSDN Dev Center for Parallel Computinghttp://msdn.com/concurrency
MSND Forums to ask questionshttp://social.msdn.microsoft.com/Forums/en/parallelcppnative/threads
![Page 41: Максим Гольдин](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816213550346895dd23f2e/html5/thumbnails/41.jpg)
Feedback
Your feedback is very important! Please complete an evaluation form!
Thank you!
![Page 42: Максим Гольдин](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816213550346895dd23f2e/html5/thumbnails/42.jpg)
DEV:Лабораторные работы
9 – 10 ноября в классе для самостоятельной работы9 ноября – с инструктором
10:30 – 11:45DEV201ILL: Основы Visual Studio LightSwitch
13:00 – 14:15DEV303ILL: Отладка с IntelliTrace с использованием Visual Stdudio 2010 Ultimate
14:30 – 15:45DEV304ILL: Использование Architecture Explorer для анализа кода в Visual Studio 2010 Ultimate
16:00 – 17:15DEV305ILL: Test Driven Development в Microsoft Visual Studio 2010
17:30 – 18:45DEV302ILL: Основы тестирования веб-производительности и нагрузочного тестирования с Visual Stdudio 2010 Ultimate
![Page 43: Максим Гольдин](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816213550346895dd23f2e/html5/thumbnails/43.jpg)
Questions?
DEV301 Максим Гольдин
Senior [email protected] http://blogs.msdn.com/b/mgoldin/
You can ask your questions at “Ask the expert” zone within an hour after end of this session
![Page 44: Максим Гольдин](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816213550346895dd23f2e/html5/thumbnails/44.jpg)
restrict(…)
Applies to functions (including lambdas)Why restrict
Target-specific language restrictionsOptimizations or special code-gen behaviorFuture-proofing
Functions can have multiple restrictionsIn 1st release we are implementing direct3d and cpucpu – the implicit default
![Page 45: Максим Гольдин](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816213550346895dd23f2e/html5/thumbnails/45.jpg)
restrict(direct3d) restrictions
Can only call other restrict(direct3d) functionsAll functions must be inlinableOnly direct3d-supported types
int, unsigned int, float, double structs & arrays of these types
Pointers and ReferencesLambdas cannot capture by reference¹, nor capture pointersReferences and single-indirection pointers supported only as local variables and function arguments
![Page 46: Максим Гольдин](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816213550346895dd23f2e/html5/thumbnails/46.jpg)
restrict(direct3d) restrictions
No recursion'volatile'virtual functionspointers to functionspointers to member functionspointers in structspointers to pointers
No goto or labeled statementsthrow, try, catchglobals or staticsdynamic_cast or typeidasm declarationsvarargsunsupported types
e.g. char, short, long double
![Page 47: Максим Гольдин](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816213550346895dd23f2e/html5/thumbnails/47.jpg)
Example: restrict overloading
double bar( double ) restrict(cpu,direc3d); // 1: same code for bothdouble cos( double ); // 2a: general codedouble cos( double ) restrict(direct3d); // 2b: specific code
void SomeMethod(array<double> c) { parallel_for_each( c.grid, [=](index<2> idx) restrict(direct3d) { //… double d1 = bar(c[idx]); // ok double d2 = cos(c[idx]); // ok, chooses direct3d overload //… });}
![Page 48: Максим Гольдин](https://reader035.fdocuments.us/reader035/viewer/2022062323/56816213550346895dd23f2e/html5/thumbnails/48.jpg)
Not Covered
Math librarye.g. acosf
Atomic operation librarye.g. atomic_fetch_add
Direct3D intrinsicsdebugging (e.g. direct3d_printf), fences (e.g. __dp_d3d_all_memory_fence), float math (e.g. __dp_d3d_absf)
Direct3D Interop*get_device, create_accelerator_view, make_array, *get_buffer