CS 61C: Great Ideas in Computer Architecture Lecture 18...
Transcript of CS 61C: Great Ideas in Computer Architecture Lecture 18...
![Page 1: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...swjun/courses/2019S-CS295/slides/lec2-CS… · (MISD) • Multiple-Instruction, Single-Data stream computer that exploits](https://reader034.fdocuments.us/reader034/viewer/2022042205/5ea7aa4c9924e161f81b8a56/html5/thumbnails/1.jpg)
CS61C:GreatIdeasinComputerArchitecture
Lecture18:ParallelProcessing– SIMD
BernhardBoser&RandyKatz
http://inst.eecs.berkeley.edu/~cs61c
![Page 2: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...swjun/courses/2019S-CS295/slides/lec2-CS… · (MISD) • Multiple-Instruction, Single-Data stream computer that exploits](https://reader034.fdocuments.us/reader034/viewer/2022042205/5ea7aa4c9924e161f81b8a56/html5/thumbnails/2.jpg)
ReferenceProblem
•Matrixmultiplication−Basicoperationinmanyengineering,data,andimagingprocessingtasks
−Imagefiltering,noisereduction,…−Manycloselyrelatedoperations
§ E.g.stereovision(project4)
•dgemm−doubleprecisionfloatingpointmatrixmultiplication
CS61c Lecture18:ParallelProcessing- SIMD 5
![Page 3: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...swjun/courses/2019S-CS295/slides/lec2-CS… · (MISD) • Multiple-Instruction, Single-Data stream computer that exploits](https://reader034.fdocuments.us/reader034/viewer/2022042205/5ea7aa4c9924e161f81b8a56/html5/thumbnails/3.jpg)
ApplicationExample:DeepLearning
• Imageclassification(cats…)•Pick“best”vacationphotos•Machinetranslation•Cleanupaccent•Fingerprintverification•Automaticgameplaying
CS61c Lecture18:ParallelProcessing- SIMD 6
![Page 4: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...swjun/courses/2019S-CS295/slides/lec2-CS… · (MISD) • Multiple-Instruction, Single-Data stream computer that exploits](https://reader034.fdocuments.us/reader034/viewer/2022042205/5ea7aa4c9924e161f81b8a56/html5/thumbnails/4.jpg)
Matrices
CS61c Lecture18:ParallelProcessing- SIMD 7
𝑐"#
• Square(orrectangular)NxNarrayofnumbers− DimensionN
𝐶 = 𝐴 ' 𝐵
𝑐"# = )𝑎"+𝑏+#
�
+
𝑖
𝑗N-1
N-1
00
![Page 5: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...swjun/courses/2019S-CS295/slides/lec2-CS… · (MISD) • Multiple-Instruction, Single-Data stream computer that exploits](https://reader034.fdocuments.us/reader034/viewer/2022042205/5ea7aa4c9924e161f81b8a56/html5/thumbnails/5.jpg)
MatrixMultiplication
CS61c 8
𝑪 = 𝑨 ' 𝑩𝑐"# = )𝑎"+𝑏+#
�
+
𝑖
𝑗
𝑘
𝑘
![Page 6: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...swjun/courses/2019S-CS295/slides/lec2-CS… · (MISD) • Multiple-Instruction, Single-Data stream computer that exploits](https://reader034.fdocuments.us/reader034/viewer/2022042205/5ea7aa4c9924e161f81b8a56/html5/thumbnails/6.jpg)
Reference:Python• MatrixmultiplicationinPython
CS61c Lecture18:ParallelProcessing- SIMD 9
N Python[Mflops]32 5.4160 5.5480 5.4960 5.3
• 1Mflop =1Millionfloatingpointoperationspersecond(fadd,fmul)
• dgemm(N…)takes2*N3 flops
![Page 7: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...swjun/courses/2019S-CS295/slides/lec2-CS… · (MISD) • Multiple-Instruction, Single-Data stream computer that exploits](https://reader034.fdocuments.us/reader034/viewer/2022042205/5ea7aa4c9924e161f81b8a56/html5/thumbnails/7.jpg)
C
• c=axb• a,b,careNxNmatrices
CS61c Lecture18:ParallelProcessing- SIMD 10
![Page 8: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...swjun/courses/2019S-CS295/slides/lec2-CS… · (MISD) • Multiple-Instruction, Single-Data stream computer that exploits](https://reader034.fdocuments.us/reader034/viewer/2022042205/5ea7aa4c9924e161f81b8a56/html5/thumbnails/8.jpg)
TimingProgramExecution
CS61c Lecture18:ParallelProcessing- SIMD 11
![Page 9: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...swjun/courses/2019S-CS295/slides/lec2-CS… · (MISD) • Multiple-Instruction, Single-Data stream computer that exploits](https://reader034.fdocuments.us/reader034/viewer/2022042205/5ea7aa4c9924e161f81b8a56/html5/thumbnails/9.jpg)
CversusPython
CS61c Lecture18:ParallelProcessing- SIMD 12
N C[Gflops] Python[Gflops]32 1.30 0.0054160 1.30 0.0055480 1.32 0.0054960 0.91 0.0053
Whichclassgivesyouthiskindofpower?Wecouldstophere…butwhy?Let’sdobetter!
240x!
![Page 10: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...swjun/courses/2019S-CS295/slides/lec2-CS… · (MISD) • Multiple-Instruction, Single-Data stream computer that exploits](https://reader034.fdocuments.us/reader034/viewer/2022042205/5ea7aa4c9924e161f81b8a56/html5/thumbnails/10.jpg)
New-SchoolMachineStructures(It’sabitmorecomplicated!)
• ParallelRequestsAssigned tocomputere.g.,Search“Katz”
• ParallelThreadsAssigned tocoree.g.,Lookup,Ads
• ParallelInstructions>[email protected].,5pipelined instructions
• ParallelData>1dataitem@one timee.g.,Addof4pairsofwords
• HardwaredescriptionsAllgates@onetime
• ProgrammingLanguages 16
SmartPhone
WarehouseScale
Computer
SoftwareHardware
HarnessParallelism&AchieveHighPerformance
LogicGates
Core Core…
Memory(Cache)
Input/Output
Computer
CacheMemory
Core
InstructionUnit(s) FunctionalUnit(s)
A3+B3A2+B2A1+B1A0+B0
Today’sLecture
![Page 11: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...swjun/courses/2019S-CS295/slides/lec2-CS… · (MISD) • Multiple-Instruction, Single-Data stream computer that exploits](https://reader034.fdocuments.us/reader034/viewer/2022042205/5ea7aa4c9924e161f81b8a56/html5/thumbnails/11.jpg)
Multiple-Instruction/Single-DataStream(MISD)
• Multiple-Instruction,Single-Datastreamcomputerthatexploitsmultipleinstructionstreamsagainstasingledatastream.• Historicalsignificance
20CS61c Lecture18:ParallelProcessing- SIMD
Thishasfewapplications.Notcoveredin61C.
![Page 12: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...swjun/courses/2019S-CS295/slides/lec2-CS… · (MISD) • Multiple-Instruction, Single-Data stream computer that exploits](https://reader034.fdocuments.us/reader034/viewer/2022042205/5ea7aa4c9924e161f81b8a56/html5/thumbnails/12.jpg)
SIMDApplications&Implementations
• Applications− Scientificcomputing
§ Matlab,NumPy− Graphicsandvideoprocessing
§ Photoshop,…− BigData
§ Deeplearning− Gaming−…
• Implementations− x86− ARM−…
CS61c Lecture18:ParallelProcessing- SIMD 24
![Page 13: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...swjun/courses/2019S-CS295/slides/lec2-CS… · (MISD) • Multiple-Instruction, Single-Data stream computer that exploits](https://reader034.fdocuments.us/reader034/viewer/2022042205/5ea7aa4c9924e161f81b8a56/html5/thumbnails/13.jpg)
RawDoublePrecisionThroughput(Bernhard’sPowerbook Pro)
Characteristic Value
CPU i7-5557U
Clockrate(sustained) 3.1GHz
Instructions perclock(mul_pd) 2
Parallel multipliesperinstruction 4
Peakdoubleflops 24.8Gflops
CS61c Lecture18:ParallelProcessing- SIMD 36
Actualperformanceislowerbecauseofoverhead
https://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardware-characteristics-over-time/
![Page 14: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...swjun/courses/2019S-CS295/slides/lec2-CS… · (MISD) • Multiple-Instruction, Single-Data stream computer that exploits](https://reader034.fdocuments.us/reader034/viewer/2022042205/5ea7aa4c9924e161f81b8a56/html5/thumbnails/14.jpg)
VectorizedMatrixMultiplication
CS61c 37
𝑖
𝑗
𝑘
𝑘
InnerLoop:
fori …;i+=4forj...
i+=4
![Page 15: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...swjun/courses/2019S-CS295/slides/lec2-CS… · (MISD) • Multiple-Instruction, Single-Data stream computer that exploits](https://reader034.fdocuments.us/reader034/viewer/2022042205/5ea7aa4c9924e161f81b8a56/html5/thumbnails/15.jpg)
“Vectorized”dgemm
CS61c Lecture18:ParallelProcessing- SIMD 38
![Page 16: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...swjun/courses/2019S-CS295/slides/lec2-CS… · (MISD) • Multiple-Instruction, Single-Data stream computer that exploits](https://reader034.fdocuments.us/reader034/viewer/2022042205/5ea7aa4c9924e161f81b8a56/html5/thumbnails/16.jpg)
Performance
NGflops
scalar avx32 1.30 4.56160 1.30 5.47480 1.32 5.27960 0.91 3.64
CS61c Lecture18:ParallelProcessing- SIMD 39
• 4xfaster• Butstill<<theoretical25Gflops!
![Page 17: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...swjun/courses/2019S-CS295/slides/lec2-CS… · (MISD) • Multiple-Instruction, Single-Data stream computer that exploits](https://reader034.fdocuments.us/reader034/viewer/2022042205/5ea7aa4c9924e161f81b8a56/html5/thumbnails/17.jpg)
PipelineHazards– dgemm
CS61c Lecture18:ParallelProcessing- SIMD 54
![Page 18: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...swjun/courses/2019S-CS295/slides/lec2-CS… · (MISD) • Multiple-Instruction, Single-Data stream computer that exploits](https://reader034.fdocuments.us/reader034/viewer/2022042205/5ea7aa4c9924e161f81b8a56/html5/thumbnails/18.jpg)
LoopUnrolling
CS61c Lecture18:ParallelProcessing- SIMD 55
Compilerdoestheunrolling
Howdoyouverifythatthegeneratedcodeisactuallyunrolled?
4registers
![Page 19: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...swjun/courses/2019S-CS295/slides/lec2-CS… · (MISD) • Multiple-Instruction, Single-Data stream computer that exploits](https://reader034.fdocuments.us/reader034/viewer/2022042205/5ea7aa4c9924e161f81b8a56/html5/thumbnails/19.jpg)
Performance
NGflops
scalar avx unroll32 1.30 4.56 12.95160 1.30 5.47 19.70480 1.32 5.27 14.50960 0.91 3.64 6.91
CS61c Lecture18:ParallelProcessing- SIMD 56