CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing –...
Transcript of CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing –...
CS61C:GreatIdeasinComputerArchitecture
Lecture18:
ParallelProcessing–SIMD
JohnWawrzynek&NickWeaverhttp://inst.eecs.berkeley.edu/~cs61c
Lecture18:ParallelProcessing-SIMD
61CSurvey
Itwouldbenicetohaveareviewlectureeveryonceinawhile,actuallyshowingus
howthingsfitinthebiggerpicture
CS61c 2
Lecture18:ParallelProcessing-SIMD
Agenda• 61C–thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Loopunrolling• Memoryaccessstrategy-blocking• AndinConclusion,…
CS61c 3
Lecture18:ParallelProcessing-SIMD
61CTopicssofar…• Whatwelearned:
1. Binarynumbers2. C3. Pointers4. Assemblylanguage5. Processormicro-architecture6. Pipelining7. Caches8. Floatingpoint
• Whatdoesthisbuyus?− Promise:executionspeed− Let’scheck!
CS61c 4
Lecture18:ParallelProcessing-SIMD
ReferenceProblem•Matrixmultiplication
−Basicoperationinmanyengineering,data,andimagingprocessingtasks− Ex:,Imagefiltering,noisereduction,…
−CoreoperationinNeuralNetsandDeepLearning− Imageclassification(cats…)−RobotCars−Machinetranslation− Fingerprintverification−Automaticgameplaying
• dgemm −double-precisionfloating-pointgeneralmatrix-multiply−Standardwell-studiedandwidelyusedroutine−PartofLinpack/Lapack
CS61c 5
Lecture18:ParallelProcessing-SIMD
2D-Matrices
CS61c 6
N-1
N-1
0
0
• SquarematrixofdimensionNxN• iindexesthroughrows• jindexesthroughcolumns
MatrixMultiplication
CS61c 7
C=A*BCij=Σk(Aik*Bkj)
A
B
C
2DMatrixMemoryLayout
• a[][]inCusesrow-major• Fortranusescolumn-major• Ourexamplesuscolumn-major
8
a00 a01 a02 a03a10 a11 a12 a13a20 a21 a22 a23a30 a31 a32 a33
aij a31a21a11a01a30a20a10a00
a13a12a11a10a03a02a01a00 0x0
0x8
0x10
0x18
0x20
0x28
0x30
0x38
0x0
0x8
0x10
0x18
0x20
0x28
0x30
0x38
Row
Row Column
Column
Row-Major Column-Major
aij:a[i*N+j] aij:a[i+j*N]
Lecture18:ParallelProcessing-SIMD
dgemmReferenceCode:Python• Linearaddressing,assumes“column-major”memorylayout
CS61c 9
N Python[Mflops]
32 5.4160 5.5
480 5.4960 5.3
• 1MFLOP=1Millionfloating-pointoperationspersecond(fadd,fmul)
• dgemm(N…)takes2*N3flops
Lecture18:ParallelProcessing-SIMD
C• c=a*b• a,b,careNxNmatrices
CS61c 10
Lecture18:ParallelProcessing-SIMD
TimingProgramExecution
CS61c 11
Lecture18:ParallelProcessing-SIMD
CversusPython
CS61c 12
N C[GFLOPS] Python[GFLOPS]
32 1.30 0.0054160 1.30 0.0055
480 1.32 0.0054960 0.91 0.0053
Whichotherclassgivesyouthiskindofpower?Wecouldstophere…butwhy?Let’sdobetter!
240x!
Lecture18:ParallelProcessing-SIMD
Agenda• 61C–thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Amdahl’slaw• Loopunrolling• Memoryaccessstrategy-blocking• AndinConclusion,…
CS61c 13
Lecture18:ParallelProcessing-SIMD
WhyParallelProcessing?• CPUClockRatesarenolongerincreasing
− Technical&economicchallenges▪ Advancedcoolingtechnologytooexpensiveorimpracticalformostapplications▪ Energycostsareprohibitive
• Parallelprocessingisonlypathtohigherspeed− Compareairlines:
▪ Maximumair-speedlimitedbyeconomics▪ Usemoreandlargerairplanestoincreasethroughput▪ (Andsmallerseats…)
CS61c 14
Lecture18:ParallelProcessing-SIMD
UsingParallelismforPerformance• Twobasicapproachestoparallelism:
−Multiprogramming▪ runmultipleindependentprogramsinparallel▪ “Easy”
−Parallelcomputing▪ runoneprogramfaster▪ “Hard”
•We’llfocusonparallelcomputinginthenextfewlectures
15CS61c
New-SchoolMachineStructures(It’sabitmorecomplicated!)
•ParallelRequestsAssignedtocomputere.g.,Search“Katz”•ParallelThreadsAssignedtocoree.g.,Lookup,Ads•ParallelInstructions>[email protected].,5pipelinedinstructions•ParallelData>[email protected].,Addof4pairsofwords•HardwaredescriptionsAllgates@onetime•ProgrammingLanguages
16
SmartPhone
WarehouseScale
Computer
SoftwareHardware
Harness Parallelism&AchieveHigh Performance
LogicGates
Core Core…Memory(Cache)Input/Output
Computer
CacheMemory
CoreInstructionUnit(s) Functional
Unit(s)A3+B3A2+B2A1+B1A0+B0
Today’sLecture
Lecture18:ParallelProcessing-SIMD
Single-Instruction/Single-DataStream(SISD)
• Sequentialcomputerthatexploitsnoparallelismineithertheinstructionordatastreams.ExamplesofSISDarchitecturearetraditionaluniprocessormachines-E.g.OurRISC-Vprocessor
17
ProcessingUnit
CS61c
Thisiswhatwediduptonowin61C
Lecture18:ParallelProcessing-SIMD
Single-Instruction/Multiple-DataStream(SIMDor“sim-dee”)
• SIMDcomputerprocessesmultipledatastreamsusingasingleinstructionstream,e.g.,IntelSIMDinstructionextensionsorNVIDIAGraphicsProcessingUnit(GPU)
18CS61c
Today’stopic.
Lecture18:ParallelProcessing-SIMD
Multiple-Instruction/Multiple-DataStreams(MIMDor“mim-dee”)
• Multipleautonomousprocessorssimultaneouslyexecutingdifferentinstructionsondifferentdata.• MIMDarchitecturesincludemulticoreandWarehouse-ScaleComputers
19
InstructionPool
PU
PU
PU
PU
DataPoo
l
CS61c
TopicofLecture19andbeyond.
Lecture18:ParallelProcessing-SIMD
Multiple-Instruction/Single-DataStream(MISD)
• Multiple-Instruction,Single-Datastreamcomputerthatprocessesmultipleinstructionstreamswithasingledatastream.• Historicalsignificance
20CS61c
Thishasfewapplications.Notcoveredin61C.
Lecture18:ParallelProcessing-SIMD
Flynn*Taxonomy,1966
• SIMDandMIMDarecurrentlythemostcommonparallelisminarchitectures–usuallybothinsamesystem!
• Mostcommonparallelprocessingprogrammingstyle:SingleProgramMultipleData(“SPMD”)
− SingleprogramthatrunsonallprocessorsofaMIMD− Cross-processorexecutioncoordinationusingsynchronizationprimitives
21CS61c
*Prof.MichaelFlynn,Stanford
22
NEW YORK, NY, March 21, 2018 – ACM, the Association for Computing Machinery, today named John L. Hennessy, former President of Stanford University, and David A. Patterson, retired Professor of the University of California, Berkeley, recipients of the 2017 ACM A.M. Turing Award for pioneering a systematic, quantitative approach to the design and evaluation of computer architectures with enduring impact on the microprocessor industry. Hennessy and Patterson created a systematic and quantitative approach to designing faster, lower power, and reduced instruction set computer (RISC) microprocessors.
PattersonandHennessywinTurningAward!
Lecture18:ParallelProcessing-SIMD
Agenda• 61C–thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Amdahl’slaw• Loopunrolling• Memoryaccessstrategy-blocking• AndinConclusion,…
CS61c 23
Lecture18:ParallelProcessing-SIMD
SIMD–“SingleInstructionMultipleData”
24CS61c
Lecture18:ParallelProcessing-SIMD
SIMD(Vector)Mode
CS61c 25
Lecture18:ParallelProcessing-SIMD
SIMDApplications&Implementations•Applications
−Scientificcomputing▪Matlab,NumPy
−Graphicsandvideoprocessing▪Photoshop,…
−BigData▪Deeplearning
−Gaming−…
•Implementations−x86−ARM−RISC-Vvectorextensions
CS61c 26
27
FirstSIMDExtensions:MITLincolnLabsTX-2,1957
CS61c
28
Intelx86MMX1997
29https://chrisadkin.io/2015/06/04/under-the-hood-of-the-batch-engine-simd-with-sql-server-2016-ctp/
AVXalsosupportedbyAMDprocessors
Lecture18:ParallelProcessing-SIMD
LaptopCPUSpecs$ sysctl -a | grep cpuhw.physicalcpu: 2hw.logicalcpu: 4
machdep.cpu.brand_string: Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C
machdep.cpu.leaf7_features: SMEP ERMS RDWRFSGS TSC_THREAD_OFFSET BMI1 AVX2 BMI2 INVPCID SMAP RDSEED ADX IPT FPU_CSDS
CS61c 30
Lecture18:ParallelProcessing-SIMD
AVXSIMDRegisters
CS61c 31
Lecture18:ParallelProcessing-SIMD
SIMDDataTypes
CS61c 32
(NowalsoAVX-512available)
Lecture18:ParallelProcessing-SIMD
Agenda• 61C–thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Amdahl’slaw• Loopunrolling• Memoryaccessstrategy-blocking• AndinConclusion,…
CS61c 33
Lecture18:ParallelProcessing-SIMD
Problem• Today’scompilerscangenerateSIMDcode• Butinsomecases,betterresultsbyhand(assembly)•Wewillstudyx86(notusingRISC-Vasnovectorhardwarewidelyavailableyet)−Over1000instructionstolearn…
• Canweusethecompilertogenerateallnon-SIMDinstructions?
CS61c 34
Lecture18:ParallelProcessing-SIMDhttps://software.intel.com/sites/landingpage/IntrinsicsGuide/
CS61c 35
4parallelmultiplies
2instructionsperclockcycle(CPI=0.5)
assemblyinstruction
x86SIMD“Intrinsics”
Intrinsic
Lecture18:ParallelProcessing-SIMD
x86IntrinsicsAVXDataTypes
CS61c 36
Intrinsics:DirectaccesstoassemblyfromC
Lecture18:ParallelProcessing-SIMD
IntrinsicsAVXCodeNomenclature
CS61c 37
Lecture18:ParallelProcessing-SIMD
RawDouble-PrecisionThroughput
Characteristic Value
CPU i7-5557U
Clockrate(sustained) 3.1GHz
Instructionsperclock(mul_pd) 2
Parallelmultipliesperinstruction 4
PeakdoubleFLOPS 24.8GFLOPS
CS61c 38
Actualperformanceislowerbecauseofoverheadhttps://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardware-characteristics-over-time/
VectorizedMatrixMultiplication
CS61c 39
InnerLoop:
fori…;i+=4forj...
i+=4
Lecture18:ParallelProcessing-SIMD
“Vectorized”dgemm
CS61c 40
Lecture18:ParallelProcessing-SIMD
Performance
NGflops
scalar avx
32 1.30 4.56
160 1.30 5.47480 1.32 5.27960 0.91 3.64
CS61c 41
• 4xfaster• Butstill<<theoretical25GFLOPS!
Lecture18:ParallelProcessing-SIMD
Agenda• 61C–thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Loopunrolling• Memoryaccessstrategy-blocking• AndinConclusion,…
CS61c 42
LoopUnrolling• Onhighperformanceprocessors,optimizingcompilersperforms“loopunrolling”operationtoexposemoreparallelismandimproveperformance:
for(i=0; i<N; i++) x[i] = x[i] + s;• Couldbecome: for(i=0; i<N; i+=4) { x[i] = x[i] + s;
x[i-1] = x[i+1] + s; x[i-2] = x[i+2] + s; x[i-3] = x[i+3] + s;}
43
1. Exposedata-levelparallelismforvector(SIMD)instructionsorsuper-scalarmultipleinstructionissue
2. Mixpipelinewithunrelatedoperationstohelpwithreducehazards
3. Reduceloop“overhead”4. Makescodesizelarger
Lecture18:ParallelProcessing-SIMD
Amdahl’sLaw*appliedtodgemm• Measureddgemmperformance
− Peak5.5GFLOPS− Largematrices3.6GFLOPS− Processor 24.8GFLOPS
• Whyarewenotgetting(closeto)25GFLOPS?− Somethingelse(notfloating-pointALU)islimitingperformance!− Butwhat?Possibleculprits:
▪ Cache▪ Hazards▪ Let’slookatboth!
CS61c 44
* Amdahl’sLawstatesthattheamountaspeeduppossiblethroughparallelismislimitedbythesequential(non-parallel)work.
Lecture18:ParallelProcessing-SIMDCS61c 45
PipelineHazards–dgemm
“add_pd”dependsonresultof“mult_pd”whichdependson“load_pd”
Lecture18:ParallelProcessing-SIMD
LoopUnrolling
CS61c 46
Compilerdoestheunrolling
Howdoyouverifythatthegeneratedcodeisactuallyunrolled?
4registers
Lecture18:ParallelProcessing-SIMD
Performance
NGflops
scalar avx unroll
32 1.30 4.56 12.95
160 1.30 5.47 19.70480 1.32 5.27 14.50960 0.91 3.64 6.91
CS61c 47
?
WOW!
Lecture18:ParallelProcessing-SIMD
Agenda• 61C–thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Amdahl’slaw• Loopunrolling• Memoryaccessstrategy-blocking• AndinConclusion,…
CS61c 48
Lecture18:ParallelProcessing-SIMD
FPUversusMemoryAccess• Howmanyfloating-pointoperationsdoesmatrixmultiplytake?
− F=2xN3(N3multiplies,N3adds)
• Howmanymemoryload/stores?− M=3xN2(forA,B,C)
• Manymorefloating-pointoperationsthanmemoryaccesses− q=F/M=2/3*N− Good,sincearithmeticisfasterthanmemoryaccess− Let’scheckthecode…
CS61c 49
Lecture18:ParallelProcessing-SIMD
Butmemoryisaccessedrepeatedly
• q=F/M=1.6!(1.25loadsand2floating-pointoperations)
CS61c 50
Innerloop:
Lecture18:ParallelProcessing-SIMDCS61c 51
Second-LevelCache(SRAM)
TypicalMemoryHierarchyControl
Datapath
SecondaryMemory(Disk
OrFlash)
On-ChipComponents
RegFile
MainMemory(DRAM)Data
CacheInstrCache
Speed(cycles):½’s1’s10’s100’s-10001,000,000’sSize(bytes):100’s10K’sM’sG’sT’s
• Wherearetheoperands(A,B,C)stored?• WhathappensasNincreases?• Idea:arrangethatmostaccessesaretofastcache!
Cost/bit:highestlowest
Third-LevelCache(SRAM)
Lecture18:ParallelProcessing-SIMD
Blocking• Idea:
− Rearrangecodetousevaluesloadedincachemanytimes− Only“few”accessestoslowmainmemory(DRAM)perfloatingpointoperation
−à throughputlimitedbyFPhardwareandcache,notslowDRAM
− P&H,RISC-Veditionp.465
CS61c 52
BlockingMatrixMultiply(divideandconquer:sub-matrixmultiplication)
53
X =
X =
X =
X =
X =
X =
Lecture18:ParallelProcessing-SIMD
MemoryAccessBlocking
CS61c 54
Lecture18:ParallelProcessing-SIMD
Performance
NGflops
scalar avx unroll blocking
32 1.30 4.56 12.95 13.80
160 1.30 5.47 19.70 21.79480 1.32 5.27 14.50 20.17960 0.91 3.64 6.91 15.82
CS61c 55
Lecture18:ParallelProcessing-SIMD
Agenda• 61C–thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Amdahl’slaw• Loopunrolling• Memoryaccessstrategy-blocking• AndinConclusion,…
CS61c 56
Lecture18:ParallelProcessing-SIMD
AndinConclusion,…• ApproachestoParallelism
− SISD,SIMD,MIMD(nextlecture)• SIMD
−Oneinstructionoperatesonmultipleoperandssimultaneously
• Example:matrixmultiplication− Floatingpointheavyà exploitMoore’slawtomakefast
57CS61c