Chapter VIII Parallel Processor Organizations
-
Upload
naveen-nekuri -
Category
Documents
-
view
11 -
download
0
description
Transcript of Chapter VIII Parallel Processor Organizations
-
PARALLEL PROCESSOR ORGANIZATIONSJehan-Franois [email protected]
-
Chapter OrganizationOverviewWriting parallel programsMultiprocessor OrganizationsHardware multithreadingAlphabet soup (SISD, SIMD, MIMD, )Roofline performance model
-
OVERVIEW
-
The hardware sideMany parallel processing solutionsMultiprocessor architecturesTwo or more microprocessor chipsMultiple architecturesMulticore architecturesSeveral processors on a single chip
-
The software sideTwo ways for software to exploit parallel processing capabilities of hardwareJob-level parallelismSeveral sequential processes run in parallelEasy to implement (OS does the job!)Process-level parallelismA single program runs on several processors at the same time
-
WRITING PARALLEL PROGRAMS
-
OverviewSome problems are embarrassingly parallelMany computer graphics tasksBrute force searches in cryptography or password guessingMuch more difficult for other applicationsCommunication overhead among sub-tasksAmdahl's lawBalancing the load
-
Amdahl's LawAssume a sequential process takestp seconds to perform operations that could be performed in parallelts seconds to perform purely sequential operationsThe maximum speedup will be(tp + ts )/ts
-
Balancing the loadMust ensure that workload is equally divided among all the processorsWorst case is when one of the processors does much more work than all others
-
Example (I)Computation partitioned among n processorsOne of them does 1/m of the work with m < nThat processor becomes a bottleneck
Maximum expected speedup: n
Actual maximum speedup: m
-
Example (II)Computation partitioned among 64 processorsOne of them does 1/8 of the work
Maximum expected speedup: 64
Actual maximum speedup: 8
-
A last issueHumans likes to address issues one after the orderWe have meeting agendasWe do not like to be interruptedWe write sequential programs
-
Rene DescartesSeventeenth-century French philosopherInventedCartesian coordinates Methodical doubt[To] never to accept anything for true which I did not clearly know to be such Proposed a scientific method based on four precepts
-
Method's third ruleThe third, to conduct my thoughts in such order that, by commencing with objects the simplest and easiest to know, I might ascend by little and little, and, as it were, step by step, to the knowledge of the more complex; assigning in thought a certain order even to those objects which in their own nature do not stand in a relation of antecedence and sequence.
-
MULTI PROCESSOR ORGANIZATIONS
-
Shared memory multiprocessorsInterconnection networkRAMI/O
-
Shared memory multiprocessorCan offerUniform memory access to all processors (UMA)Easiest to programNon-uniform memory access to all processors (NUMA)Can scale up to larger sizesOffer faster access to nearby memory
-
Computer clusters Interconnection network
-
Computer clustersVery easy to assembleCan take advantage of high-speed LANsGigabit Ethernet, Myrinet, Data exchanges must be done through message passing
-
Message passing (I)If processor P wants to access data in the main memory of processor Q it mustSend a request to QWait for a replyFor this to work, processor Q must have a threadWaiting for message from other processorsSending them replies
-
Message passing (II)In a shared memory architecture, each processor can directly access all data
A proposed solutionDistributed shared memory offers to the users of a cluster the illusion of a single address space for their shared dataStill has performance issues
-
When things do not add upMemory capacity is very important for big computing applicationsIf the data can fit into main memory, the computation will run much faster
-
A problemA company replaced Single shared memory computer with 32GB of RAMFour clustered computers with 8GB eachMore I/O than everWhat did happen?
-
The explanationAssume OS occupies one GB of RAMThe old shared-memory computer still had 31 GB of free RAMEach of the clustered computer has 7 GB of free RAMThe total RAM available to the program went down from 31 GB to 47 = 28 GB!
-
Grid computingThe computers are distributed over a very large networkSometimes computer time is donatedVolunteer computingSeti@HomeWorks well with embarrassingly parallel workloadsSearches in a n-dimensional space
-
HARDWARE MULTITHREADING
-
General ideaLet the processor switch to another thread of computation while them current one is stalled
Motivation:Increased cost of cache misses
-
ImplementationEntirely controlled by the hardwareUnlike multiprogrammingRequires a processor capable ofKeeping track of the state of each threadOne set of registersincluding PC for each concurrent threadQuickly switching among concurrent threads
-
ApproachesFine-grained multithreading:Switches between threads for each instructionProvides highest throughputsSlows down execution of individual threads
-
ApproachesCoarse-grained multithreadingSwitches between threads whenever a long stall is detectedEasier to implement Cannot eliminate all stalls
-
ApproachesSimultaneous multi-threading:Takes advantage of the possibility of modern hardware to perform different tasks in parallel for instructions of different threadsBest solution
-
ALPHABET SOUP
-
Overview Used to describe processor organizations whereSame instructions can be applied toMultiple data instancesEncountered inVector processors in the pastGraphic processing units (GPU)x86 multimedia extension
-
Classification SISD:Single instruction, single dataConventional uniprocessor architectureMIMD:Multiple instructions, multiple dataConventional multiprocessor architecture
-
Classification SIMD:Single instruction, multiple dataPerform same operations on a set of similar dataThink of adding two vectors
for (i = 0; i++; i < VECSIZE) sum[i] = a[i] + b[i];
-
Vector computingKind of SIMD architectureUsed by Cray computersPipelines multiple executions of single instruction with different data (vectors) trough the ALURequiresVector registers able to store multiple valuesSpecial vector instructions: say lv, addv,
-
BenchmarkingTwo factors to considerMemory bandwidthDepends on interconnection networkFloating-point performanceBest known benchmark is LINPACK
-
Roofline modelTakes into accountMemory bandwidthFloating-point performanceIntroduces arithmetic intensityTotal number of floating point operations in a program divided by total number of bytes transferred to main memoryMeasured in FLOPS/byte
-
Roofline modelAttainable GFLOPS/s = Min(Peak Memory BWArithmetic Intensity, Peak Floating-Point Performance
-
Roofline modelPeak floating-point performanceFloating-point performance islimited by memory bandwidth
Chart1
2
4
8
16
16
16
16
16
GFLOPS
Arithmetic Intensity
Attainable GFLOPS/s
Sheet1
IntensityGFLOPS
0.1252
0.254
0.58
116
216
416
816
1616