Medical Image Processing Strategies for multi-core CPUs
-
Upload
daniel-blezek -
Category
Health & Medicine
-
view
3.889 -
download
4
Transcript of Medical Image Processing Strategies for multi-core CPUs
Medical image processing strategies for multi-core CPUsDaniel Blezek, Mayo [email protected]
Poll
Does your primary computer have more than one core...?
2
Have you ever written parallel code?
It’s a parallel world...
SMP formerly was the domain of researchers Thanks to Intel, now it’s everywhere!
3
Hardware has far outstripped software Developers are not trained Development of parallel software is difficult Outside the box
Erlang Scala ...
... but most of us think in serial ...
shoehorn
Parallel Computing – according to Google
“parallel computing” 1.4M hits on Google “multithreading” 10M hits “multicore” 2.4M hits “parallel programming” 1.1M hits
Why is it so hard?– the world is parallel– we all think in parallel– yet we are taught to program in serial
4driving
Degrees of parallelism (my take)
Serial – SISD single thread of execution Data parallel – SIMD (fine grained parallelism) Embarrassingly parallel – larger scale SIMD
– CT or MR reconstruction– Each operation is independent, e.g. iFFT of slices
Worker thread – e.g. virus scanning software Coarse grained parallelism – SMP or MIMD
– Focus of this presentation, more in GPU talk– Concurrency, OpenMP, TBB, pthreads/Winthreads
Large scale – MPI on cluster, tight coupling Large scale – Grid computing, loose coupling
5
Pragmatic approach
C/C++ and Fortran are the kings of performance– (I’ve never written a single line of Fortran, so don’t ask)
“Bolted on” parallel concepts– Zero language support
Huge existing codebase
6
Pragmatic approach
Briefly touch on SIMD Introduce SMP concepts
– Threads, concurrency Development models
– pthreads/WinThreads– OpenMP– TBB– ITK
Medical Image Processing– Example problems– Common errors
Next steps
7packed
SIMD
8
SIMD – basic principles
9
http://en.wikipedia.org/wiki/SIMD
Data structures for SIMD
Array of Structuresstruct Vec {float x, y, z;
};
Vec[] points = new Vec[sz];
10
X Y Z --
X Y Z --X Y Z --
X Y Z --
*
Pack
Unpack
Data structures for SIMD
11
Structure of Arraysstruct Vec {float[] x;float[] y;float[] z;Vec ( int sz ) { x = new float[sz]; y = new float[sz]; z = new float[sz];};
};
Structure of Arraysstruct Vec {Vector4f[] v;Vec ( int sz ) {
// must be word // aligned v =
new Vector4f[sz];};
};
SIMD pitfalls
Structure alignment– Usually needs to be aligned on word boundary
Structure considerations– May need to refactor existing code/structures
Generally not cross-platform– MMX, 3D Now!, SSE, SSE2, SSE4, AltVec, AVX, etc...
Performance gains are modest– 2x – 4x common
Limited instructions– Add, multiply, divide, round– Not suitable for branching logic
Autovectorizing compilers for simple loops– -ftree-vectorize (GCC), -fast, -O2 or -O3 (Intel Compiler)
12
Threads
13
14
Threads – they’re everywhere
SMP concepts
15
Useful to think in terms of “cores”– 2 dual-core CPU = 4 “cores”– Cores share main memory, may share cache– Threads in same process share memory
Generally, one executing thread per core– Other threads sleeping
Cores – they’re everywhere
16
How many cores does your laptop have?
Mine has 50(!)2 Intel CPU (Core 2 Duo)32 nVidia cores (9600M GT)
16 nVidia cores (9400M)
Parallel concepts for SMP
Process– Started by the OS– Single thread executes “main”– No direct access to memory of other processes
Threads– Stream of execution under a process– Access to memory in containing process– Private memory– Lifetime may be less than main thread
Concurrency– Coordination between threads– High level (mutex, locks, barriers)– Low level (atomic operations)
17
Processes & Threads
18
Process Thread
NoNo
#include <pthread.h>
// Thread work function, must return pointer to voidvoid *doWork(void *work) { // Do work return work; // equivalent to pthread_exit ( myWork );}...pthread_t child;...rc=pthread_create(&child, &attr, doWork, (void *)work);... rc = pthread_join ( child, &threadwork );...
Thread construction – pthread example
19
Thread construction – Win32 example
20
#include <windows.h>DWORD WINAPI doWork( LPVOID work) {};...PMYDATA work;DWORD childID;HANDLE child;child = CreateThread( NULL, // default security attributes 0, // use default stack size doWork, // thread function name work, // argument to thread function 0, // use default creation flags &childID); // returns the thread identifier
WaitForMultipleObjects(NThreads, child, TRUE, INFINITE);
Thread construction – Java example
21
import java.lang.Thread;
class Worker implements Runnable {public Worker ( Work work ) {};
public void run() {}; // Do work here}...Worker worker = new Worker ( someWork );New Thread ( worker ).start();
Race Conditions
22
Serial Parallel
Problem!nono/door
Mutex
Mutex – Mutual exclusion lock– Protects a section of code– Only one thread has a lock on the object– Threads may
• wait for the mutex• return a status if the mutex is locked
Semaphore– N threads
Critical Section– One thread executes code– Protects global resources– Maintain consistent state
23
Race Conditions
24
...N = 0;...// Start some threads...
void* doWork() {
N++; // get, incr, store
}
Solution w/Mutex
Mutex mutex;
mutex.lock();
mutex.release();
NoNo
Atomic operations
Locks are not perfect– Cause blocking– Relatively heavy-weight
Atomic operations– Simple operations– Hardware support– Can implement w/Mutex
Conditions– Invisibility – no other thread knows about the change– Atomicity – if operation fails, return to original state
25
Deadlock
Deadlock
26NoNo
Mutex Thread
Mutex A
Mutex B
Thread synchronization – barrier
Initialized with the number of threads expected Threads signal when they are ready
– Wait until all expected threads are there A stalled or dead thread can stall all the threads
27
Thread synchronization – Condition variables
Workers atomically release mutex and wait Master atomically releases mutex and signals Workers wake up and acquire mutex
28
Mutex Thread
Condition
ConditionMutex A
Mutex A
Mutex A Mutex A
Wait Mutex A
Working
Condition
Thread pool & Futures
29
Maintains a “pool” of Worker threads Work queued until thread available Optionally notify through a “Future”
– Future can query status, holds return value Thread returns to pool, no startup overhead Core concept for OpenMP and TBB
OpenMP
30
Introduction to OpenMP
Scatter / gather paradigm– Maintains a thread pool
Requires compiler support– Visual C++, gcc 4.0, Intel Compiler
Easy to adapt existing serial code, easy to debug– Simple paradigm
31
OpenMP – simple parallel sections
32
#pragma omp parallel sections num_threads ( 5 ){ // 5 Threads scatter here
#pragma omp section { // Do task 1 } #pragma omp section { // Do task 2 } ... #pragma omp section { // Do task N }
// Implicit barrier}
...B
arrier
NoNo
OpenMP – parallel for
33
#pragma omp parallel forfor ( int i = 0; i < NumberOfIterations; i++ ) {
// Threads scatter here // each thread has a private copy of i doSomeWork( i );
}// Implicit barrier
Scheduling the iterations
OpenMP – reduction
34
int TotalAmountOfWork = 0;
#pragma omp parallel for reduction ( + : TotalAmountOfWork )for ( int i = 0; i < NumberOfIterations; i++ ) {
// Threads scatter here // each thread has a private copy of i & TotalAmountOfWork TotalAmountOfWork += doSomeWork( i );
}// Implicit barrier
// TotalAmountOfWork was properly accumulated// Each thread has local copy, barrier does reduction// No need to use critical sections
OpenMP – “atomic” reduction
35
int TotalAmountOfWork = 0;
#pragma omp parallel forfor ( int i = 0; i < NumberOfIterations; i++ ) {
// Threads scatter here int myWork = doSomeWork( i ); #pragma omp atomic TotalAmountOfWork += myWork;
}// Implicit barrier
// TotalAmountOfWork was properly accumulated// However, the atomic section can cause thread stalls
OpenMP – critical
36
int TotalAmountOfWork = 0;
#pragma omp parallel for reduction ( + : TotalAmountOfWork )for ( int i = 0; i < NumberOfIterations; i++ ) {
// Threads scatter here // each thread has a private copy of i TotalAmountOfWork += doSomeWork( i );
#pragma omp critical { // Execute by one thread at a time, e.g., “Mutex lock” criticalOperation(); }
}// Implicit barrier
OpenMP – single
37
int TotalAmountOfWork = 0;
#pragma omp parallel for reduction ( + : TotalAmountOfWork )for ( int i = 0; i < NumberOfIterations; i++ ) {
// Threads scatter here // each thread has a private copy of i TotalAmountOfWork += doSomeWork( i );
#pragma omp single nowait { // Execute by one thread, use “master” for the main thread reportProgress ( TotalAmountOfWork ); } // !! No implicit barrier because of “nowait” clause !!
}// Implicit barrier
Threading Building Blocks (TBB)
38
Introduction to TBB
Commercial and Open Source Licenses– GPL with runtime exception
Cross-platform C++ library– Similar to STL– Usual concurrency classes
Several different constructs for threading– for, do, reduction, pipeline
Finer control over scheduling Maintains a thread pool to execute tasks http://www.threadingbuildingblocks.org/
39
TBB – parallel for
40
#include "tbb/blocked_range.h”#include "tbb/parallel_for.h”
class Worker { public: Worker ( /* ... */ ) {...}; void operator() ( const tbb::blocked_range<int>& r ) const { for ( int i = r.begin(); i != r.end(); ++i ) { doWork ( i ); } }};...tbb::parallel_for ( tbb::blocked_range<int> ( 0, N ),Worker ( /* ... */ ), tbb::auto_partitioner() );
TBB – parallel reduction
41
#include "tbb/blocked_range.h”#include "tbb/parallel_reduce.h”
class ReducingWorker { int mLocalWork; public: ReducingWorker ( /* ... */ ) {...};
ReducingWorker ( ReducingWorker& o, split ) : mLocalWork(0) {}; void join ( const ReducingWorker& o ) {mLocalWork += o.mLocalWork};
void operator() ( const tbb::blocked_range<int>& r ) { ... }};...Worker w;tbb::parallel_reduce ( tbb::blocked_range<int> ( 0, N ),w, tbb::auto_partitioner() );
w.getLocalWork();
TBB – parallel reduction
42
TBB – synchronization
43
tbb::spin_mutex MyMutex;
void doWork ( /* ... */ ) { // Enter critical section, exit when lock goes out of scope tbb::spin_mutex::scoped_lock lock ( MyMutex );
// NB: This is an error!!! // tbb::spin_mutex::scoped_lock ( MyMutex );}...#include <tbb/atomic.h>tbb::atomic<int> MyCounter;...MyCounter = 0; // Atomicint i = MyCounter; // AtomicMyCounter++; MyCounter--; ++MyCounter; --MyCounter; // Atomic...MyCounter = 0; MyCounter += 2; // Watch out for other threads!
ITK Model
44
ITK Implementation
Threads operate across slices– Only implemented behavior in ITK
itk::MultiThreader is somewhat flexible– Requires that you break the ITK model– Uses Thread Join, higher overhead– No thread pool
45
Comparison
46
Threads (C/C++)+ Fine-grain control- Not cross-platform- Few constructs
ITK+ Integrated+ Simple- Limited control+/- ITK only
TBB+/- More complex+ Fine-grain control+ Intel (-?)+ Open Source+ Some constructs- Must re-write
code
OpenMP+ Simple+ Adapt existing code+/- Industry standard+/- Compiler support- Coarse-grain control
Language specific (Java)+ Fine-grain control+ Cross-platform easy(?)+ Many constructs+/- Language-specific
diy
Medical Imaging
47
Image class
48
class Image { public: short* mData; int mWidth, mHeight, mDepth; int mVoxelsPerSlice; int mVoxelsPerVolume; short* mSlicePointers; // Pointers to the start of each slice short getVoxel ( int x, int y, int z ) {...} void setVoxel ( int x, int y, int z, short v ) {...}};
Trivial problem – threshold
Threshold an image– If intensity > 100, output 1– otherwise output 0
Present from simple to complex– OpenMP– TBB– ITK– pthread (see extra slides)
49
Threshold – OpenMP #1
50
void doThreshold ( Image* in, Image* out ) {#pragma omp parallel for for ( int z = 0; z < in->mDepth; z++ ) { for ( int y = 0; y < in->mHeight; y++ ) { for ( int x = 0; x < in->mWidth; x++ ) { if ( in->getVoxel(x,y,z) > 100 ) { out->setVoxel(x,y,z,1); } else { out->setVoxel(x,y,z,0); } } } }}
// NB: can loop over slices, rows or columns by moving// pragma, but must choose at compile time
Threshold – OpenMP #2
51
void doThreshold ( Image* in, Image* out ) {#pragma omp parallel for for ( int s = 0; s < in->mVoxelsPerVolume; s++ ) { if ( in->mData[s] > 100 ) { out->mData[s] = 1; } else { out->mData[s] = 0; } }}
// Likely a lot faster than previous code
Threshold – TBB #1
52
class Threshold { public: Threshold ( Image* in, Image* o ) : in ( i ), out ( o ) {...} void operator() ( const tbb::blocked_range<int>& r ) { for ( int x = r.begin(); x != r.end(); ++x ) { if ( in->mData[x] > 100 ) { out->mData[x] = 1; } else { out->mData[x] = 0; } } }}
...
parallel_for ( tbb::blocked_range<int>(0, in->mVoxelsPerVolume ), Threshold ( in, out ), auto_partitioner() );// NB: default “grain size” for blocked_range is 1 pixel// tbb::blocked_range<int>(..., in->mVoxelsPerVolume / NumberOfCPUs )
Threshold – TBB #2
53
class Threshold { public: Threshold ( Image* in, Image* o ) : in ( i ), out ( o ) {...} void operator() ( const tbb::blocked_range<int>& r ) {...} void operator() ( const tbb::blocked_range2d<int,int>& r ) { for ( int z = in->mDepth; z < in->mDepth; z++ ) { for ( int y = r.rows().begin(); y != r.rows.end(); y++ ) { for ( int x = r.cols().begin(); x != r.cols().end(); x++ ){ if ( in->getVoxel(x,y,z) > 100 ) { out->setVoxel(x,y,z,1); } else { out->setVoxel(x,y,z,0); } } } } } };...
parallel_for ( tbb::blocked_range2d<int,int>( 0, in->mHeight, 32 0, in->mWidth, 32 ), Threshold ( in, out ), auto_partitioner() );
Threshold – TBB #3
54
class Threshold { public: Threshold ( Image* in, Image* o ) : in ( i ), out ( o ) {...} void operator() ( const tbb::blocked_range<int>& r ) {...} void operator() ( const tbb::blocked_range2d<int,int>& r ) {...} void operator() ( const tbb::blocked_range3d<int,int,int>& r ) { for ( int z = r.pages().begin(); z != r.pages().end(); z++ ) { for ( int y = r.rows().begin(); y != r.rows.end(); y++ ) { for ( int x = r.cols().begin(); x != r.cols().end(); x++ ){ if ( in->getVoxel(x,y,z) > 100 ) { out->setVoxel(x,y,z,1); } else { out->setVoxel(x,y,z,0); } } } } } };...parallel_for ( tbb::blocked_range3d<int,int,int>(0, in->mDepth, 1 0, in->mHeight, 32 0, in->mWidth, 32 ), Threshold ( in, out ), auto_partitioner() );
Threshold – ITK solution
55
ThreadedGenerateData( const OutputImageRegionType out, int threadId){... // Define the iterators ImageRegionConstIterator<TIn> inputIt(inputPtr, out); ImageRegionIterator<TOut> outputIt(outputPtr, out);
inputIt.GoToBegin(); outputIt.GoToBegin();
while( !inputIt.IsAtEnd() ) { if ( inputIt.Get() > 100 ) { outputIt.Set ( 1 ); } else { outputIt.Set ( 0 ); { ++inputIt; ++outputIt;}}
Interesting problem – anisotropic diffusion
Edge preserving smoothing methodPerona and Malik. Scale-space and edge detection using anisotropic
diffusion. Pattern Analysis and Machine Intelligence, IEEE Transactions on (1990) vol. 12 (7) pp. 629 – 639
Iterative process Demonstrate
– OpenMP– TBB– (ITK has an implementation)– (pthreads are tedious at the very least)
Pop quiz – are the following correct?
56
Anisotropic diffusion – OpenMP
57
void doAD ( Image* in, Image* out ) {#pragma omp parallel for for ( int t = 0; t < TotalTime; t++ ) { for ( int z = 0; z < in->mDepth; z++ ) { ... } }}
Anisotropic diffusion – OpenMP
58
void doAD ( Image* in, Image* out ) { short *previousSlice, *slice, *nextSlice; for ( int t = 0; t < TotalTime; t++ ) {#pragma omp parallel for for ( int z = 1; z < in->mDepth-1; z++ ) { previousSlice = in->mSlicePointers[z-1]; slice = in->mSlicePointers[z]; nextSlice = in->mSlicePointers[z+1]; for ( int y = 1; y < in->mHeight-1; y++ ) { short* previousRow = slice + y-1 * in->mWidth; short* row = slice + y * in->mWidth; short* nextRow = slice + y-1 * in->mWidth; short* aboveRow = previousSlice + y * in->mWidth; short* belowRow = nextSlice + y * in->mWidth; for ( int x = 1; i < in->mWidth-1; x++ ) { dx = 2 * row[x] – row[x-1] – row[x+1]; dy = 2 * row[x] – previousRow[x] – nextRow[x]; dz = 2 * row[x] – aboveRow[x] – belowRow[x]; ...
Anisotropic diffusion – OpenMP
59
void doAD ( Image* in, Image* out ) { for ( int t = 0; t < TotalTime; t++ ) {#pragma omp parallel for for ( int z = 1; z < in->mDepth-1; z++ ) { short* previousSlice = in->mSlicePointers[z-1]; short* slice = in->mSlicePointers[z]; short* nextSlice = in->mSlicePointers[z+1]; for ( int y = 1; y < in->mHeight-1; y++ ) { short* previousRow = slice + y-1 * in->mWidth; short* row = slice + y * in->mWidth; short* nextRow = slice + y-1 * in->mWidth; short* aboveRow = previousSlice + y * in->mWidth; short* belowRow = nextSlice + y * in->mWidth; for ( int x = 1; i < in->mWidth-1; x++ ) { dx = 2 * row[x] – row[x-1] – row[x+1]; dy = 2 * row[x] – previousRow[x] – nextRow[x]; dz = 2 * row[x] – aboveRow[x] – belowRow[x]; ...
Anisotropic diffusion – TBB #1
60
class doAD { public: static ADConstants* sConstants; doAD ( Image* in, Image* out ) { ... } void operator() ( const tbb::blocked_range3d<int,int,int>& r ) { if ( !sConstants == NULL ) { initConstants(); } // process ... }}
Threshold – TBB #2
61
class doAD { public: doAd ( ... ) {...} void operator() ( const tbb::blocked_range3d<int,int,int>& r ) { for ( int z = r.pages().begin(); z != r.pages().end(); z++ ) { for ( int y = r.rows().begin(); y != r.rows.end(); y++ ) { for ( int x = r.cols().begin(); x != r.cols().end(); x++ ){ ... } };...parallel_for ( tbb::blocked_range3d<int,int,int>(0, in->mDepth 0, in->mHeight 0, in->mWidth ), doAD ( in, out ), auto_partitioner() );
Threshold – TBB #3
62
class doAD { public: static tbb::atomic<int> sProgress; tbb::spin_mutex mMutex; doAd ( ... ) {...} void reportProgress ( int p ) { ... } void operator() ( const tbb::blocked_range3d<int,int,int>& r ) { for ( int z = r.pages().begin(); z != r.pages().end(); z++ ) { tbb::spin_mutex::scoped_lock lock ( mMutex ); sProgress++; reportProgress ( sProgress ); for ( int y = r.rows().begin(); y != r.rows.end(); y++ ) { for ( int x = r.cols().begin(); x != r.cols().end(); x++ ){ ... } };...doAD::sProgress = 0;parallel_for (...);
Threshold – TBB #4
63
class doAD { public: static tbb::atomic<int> sProgress; static tbb::spin_mutex mMutex; doAd ( ... ) {...} void reportProgress ( int p ) { ... } void operator() ( const tbb::blocked_range3d<int,int,int>& r ) { for ( int z = r.pages().begin(); z != r.pages().end(); z++ ) { tbb::spin_mutex::scoped_lock lock ( mMutex ); sProgress++; reportProgress ( sProgress ); for ( int y = r.rows().begin(); y != r.rows.end(); y++ ) { for ( int x = r.cols().begin(); x != r.cols().end(); x++ ){ ... } };...doAD::sProgress = 0;parallel_for (...);
nowait
Anisotropic diffusion – OpenMP (Progress)
64
using std;void doAD ( Image* in, Image* out ) {int progress = 0;for ( int t = 0; t < TotalTime; t++ ) {#pragma omp parallel for for ( int s = 0; s < in->mDepth; s++ ) { #pragma omp atomic progress++; #pragma omp single reportProgress ( progress ); ... } }}
Real-life problem
Compute Frangi’s vesselness measure– Frangi et al. Model-based quantitation of 3-D magnetic resonance angiographic
images. IEEE Transactions on Medical Imaging (1999) vol. 18 (10) pp. 946-956
Memory constrained solution– ITK implementation requires 1.2G for 100M volume
• Antiga. Generalizing vesselness with respect to dimensionality and shape. Insight Journal (2007)
Possible solutions using– OpenMP, TBB
65
Vesselness
66
ITK Implementation – computing the Hessian
6 volumes computed in serial– Individual filters are threaded– Good CPU usage– High memory requirements 67
Design considerations
Break problem into blocks– Compute hessian, eigenvalues, and vesselness– Reduces memory requirements– Incurs overhead, boundary conditions
68
Design considerations
69keep cpu’s full
Design considerations – boundary condition
70
Trade-offs
71
Algorithm sketch – Serial
72
int BlockSize = 32;for ( int z = 0; z < image->mDepth; z += BlockSize ) { for ( int y = 0; y < image->mHeight; y += BlockSize ) { for ( int x = 0; x < image->mWidth; x += BlockSize ) { processBlock ( in, out, x, y, z, BlockSize ); } }}
Algorithm sketch – OpenMP
73
int BlockSize = 32;#pragma omp parallel forfor ( int z = 0; z < image->mDepth; z += BlockSize ) { for ( int y = 0; y < image->mHeight; y += BlockSize ) { for ( int x = 0; x < image->mWidth; x += BlockSize ) { processBlock ( in, out, x, y, z, BlockSize ); } }}
Each thread is on a different slice– May cause cache contention– Similar problems for “y” direction
Algorithm sketch – OpenMP
74
int BlockSize = 32;for ( int z = 0; z < image->mDepth; z += BlockSize ) { for ( int y = 0; y < image->mHeight; y += BlockSize ) {#pragma omp parallel for for ( int x = 0; x < image->mWidth; x += BlockSize ) { processBlock ( in, out, x, y, z, BlockSize ); } }}
All threads on same rows– May not utilize all CPUs
• If Ratio of Width to BlockSize < # CPUs– Better cache utilization
Algorithm sketch – TBB
75
Individual blocks– Full CPUs– May not have best cache performance
class Vesselness { public: void operator() ( const tbb::blocked_range3d<int,int,int>& r ) { // Process the block, could use ITK here processBlock ( r.cols().begin(), r.rows().begin(), r.pages().begin(), r.cols().size(), r.rows().size(), r.pages().size() );...parallel_for ( tbb::blocked_range3d<int,int,int>( 0, in->mDepth, 32 0, in->mHeight, 32 0, in->mWidth, 32 ), Vesselness( in, out ), auto_partitioner() );
Next steps
Go try parallel development– Try threads to gain understanding and insight– Next OpenMP, adapting existing code– TBB: more constructs, different approachs
Experiment with new languages– Erlang, Scala, Reia, Chapel, X10, Fortress...
Check out some of the resources provided Have fun! It’s a brave new world out there...
76
Resources
TBB (http://www.threadingbuildingblocks.org/) OpenMP (http://openmp.org/wp/) Books/Articles
– Java Concurrency in Practice (http://www.javaconcurrencyinpractice.com/)– Parallel Programming (http://www-users.cs.umn.edu/~karypis/parbook/)– ITK Software Guide (http://www.itk.org/ItkSoftwareGuide.pdf)– The Problem with Threads (
http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-1.pdf) Tutorials
– Parallel Programming(https://computing.llnl.gov/tutorials/parallel_comp/)– pthreads (https://computing.llnl.gov/tutorials/pthreads/)– OpenMP (https://computing.llnl.gov/tutorials/openMP/)
Other– LLNL (https://computing.llnl.gov/)– Erlang (http://en.wikipedia.org/wiki/Erlang_programming_language)– GCC-OpenMP (http://gcc.gnu.org/projects/gomp/)– Intel Compiler (http://software.intel.com/en-us/intel-compilers/) 77
Resources
Languages– Erlang (http://www.erlang.org/)– Scala (http://www.scala-lang.org/)– Chapel (http://chapel.cs.washington.edu/)– X10 (http://x10-lang.org/)– Unified Parallel C (http://upc.gwu.edu/)– Titanium (http://titanium.cs.berkeley.edu/)– Co-Array Fortran (http://www.co-array.org/)– ZPL (http://www.cs.washington.edu/research/zpl/home/index.html)– High Performance Fortran (http://hpff.rice.edu/)– Fortress (http://projectfortress.sun.com/Projects/Community/) – Others (http://www.google.com/search?q=parallel+programming+language)
78
Medical image processing strategies for multi-core CPUsDaniel Blezek, Mayo [email protected]
Thread construction – pthread example
80
include <pthread.h>
void *(*start_routine)(void *);
intpthread_create(pthread_t *restrict thread, const pthread_attr_t *restrict attr, void *(*start_routine)(void *), void *restrict arg);
voidpthread_exit(void *value_ptr);
intpthread_join(pthread_t thread, void **value_ptr);
Mutex – pthread example
81
#include <pthread.h>pthread_mutex_t myMutex;...pthread_mutex_init ( &myMutex, NULL );...pthread_mutex_lock ( &myMutex );// Critical Section, only one thread at a time...pthread_mutex_unlock ( &myMutex );...if ( pthread_mutex_trylock ( &myMutex ) == EBUSY ) { // We did get the lock, so we are in the critical section ... pthread_mutex_unlock ( &myMutex );}
Mutex – Java example
82
import java.lang.*;
class Foo { public synchronized int doWork () { // only one thread can execute doWork
}
Object resource;public int otherWork () {
synchronized ( resource ) { // critical section, resource is the mutex ... }}
Threshold – pthread
83
struct Work { Image* in; Image *out; int start; int end; };Work workArray[THREADCOUNT];pthread_t thread[THREADCOUNT];
void* doThreshold ( void* inWork ) { Work* work = (Work*) inWork; for ( int s = work->start; s < work->end; s++ ) {...}}...pthread_attr_t attributes;pthread_attr_init ( &attributes );pthread_attr_setdetachstate ( &attributes, PTHREAD_CREATE_JOINABLE );for ( int t = 0; t < THREADCOUNT; t++ ) { initializeWork ( in, out, t, workArray[t] ); pthread_create ( &thead[t], &attributes, doThreshold, (void*) workArray[t] );}for ( int t = 0; t < THREADCOUNT; t++ ) { pthread_join ( thread[t], NULL );}
Insight Toolkit
84
Semaphore
Allow N threads access– Protects limited resources
Binary semaphore– N = 1– Equivalent to Mutex
85
ITK Implementation
Threads operate across slices– Only implemented behavior in ITK
itk::MultiThreader is somewhat flexible– Requires that you break the ITK model– Uses Thread Join, higher overhead– No thread pool
86
ITK – itk::MultiTheader
87
#include <itkMultiThreader.h>
// Win32DWORD doWork ( LPVOID lpThreadParameter );// Pthread - Linux, Mac, Unixvoid* doWork ( void* inWork );
itk::MultiThreader::Pointer threader = itk::MultiThreader::New();
threader->SetNumberOfThreads ( NumberOfThreads );for ( int i = 0; i < NumberOfThreads; i++ ) { threader->SetMultipleMethod ( i, doWork, (void*) work[i] );}// Explicit barrier, waits for Thread jointhreader->MultipleMethodExecute();
#include <itkImageToImageFilter.h>
template <In, Out> Worker : public ImageToImageFilter<In, Out> {...void BeforeThreadedGenerateData() {
// Master thread only ... } void ThreadedGenerateData(const OutputImageRegionType &r, int tid ){ // Generate output data for r ... }void AfterThreadedGenerateData() {
// Master thread only ... }
// Output split on last dimension// i.e. Slices for 3D volumes
Insight Toolkit
88
Anisotropic diffusion – OpenMP
89
using std;void doAD ( Image* in, Image* out ) {for ( int t = 0; t < TotalTime; t++ ) {#pragma omp parallel for for ( int slice = 0; slice < in->mDepth; slice++ ) { ... } }}