CS252 Spring 2017 Graduate Computer...
Transcript of CS252 Spring 2017 Graduate Computer...
WU UCB CS252 SP17
CS252 Spring 2017Graduate Computer Architecture
Lecture 10:VLIW Architectures
Lisa Wu, Krste Asanovichttp://inst.eecs.berkeley.edu/~cs252/sp17
©KrsteAsanovic,2015CS252,Fall2015,Lecture10
LastTimeinLecture9
Vectorsupercomputers§ Vectorregisterversusvectormemory§ Scalingperformancewithlanes§ Stripmining§ Chaining§ Masking§ Scatter/Gather
2
©KrsteAsanovic,2015CS252,Fall2015,Lecture10
SequentialISABottleneck
3
Checkinstructiondependencies
Superscalarprocessor
a = foo(b);for (i=0, i<
Sequentialsourcecode
Superscalarcompiler
Findindependentoperations
Scheduleoperations
Sequentialmachinecode
Scheduleexecution
©KrsteAsanovic,2015CS252,Fall2015,Lecture10
VLIW:VeryLongInstructionWord
§ Multipleoperationspackedintooneinstruction§ Eachoperationslotisforafixedfunction§ Constantoperationlatenciesarespecified§ Architecturerequiresguaranteeof:
- Parallelismwithinaninstruction=>nocross-operationRAWcheck- Nodatausebeforedataready=>nodatainterlocks
4
TwoIntegerUnits,SingleCycleLatency
TwoLoad/StoreUnits,ThreeCycleLatency TwoFloating-PointUnits,
FourCycleLatency
IntOp2 MemOp1 MemOp2 FPOp1 FPOp2Int Op1
©KrsteAsanovic,2015CS252,Fall2015,Lecture10
EarlyVLIWMachines
§ FPSAP120B(1976)- scientificattachedarrayprocessor- firstcommercialwideinstructionmachine- hand-codedvectormathlibrariesusingsoftwarepipeliningandloopunrolling
§ MultiflowTrace(1987)- commercializationofideasfromFisher’sYalegroupincluding“tracescheduling”
- availableinconfigurationswith7,14,or28operations/instruction
- 28operationspackedintoa1024-bitinstructionword§ CydromeCydra-5(1987)- 7operationsencodedin256-bitinstructionword- rotatingregisterfile
5
©KrsteAsanovic,2015CS252,Fall2015,Lecture10
VLIWCompilerResponsibilities
§ Scheduleoperationstomaximizeparallelexecution
§ Guaranteesintra-instructionparallelism
§ Scheduletoavoiddatahazards(nointerlocks)- TypicallyseparatesoperationswithexplicitNOPs
6
©KrsteAsanovic,2015CS252,Fall2015,Lecture10
LoopExecution
7
HowmanyFPops/cycle?
for (i=0; i<N; i++)
B[i] = A[i] + C;Int1 Int 2 M1 M2 FP+ FPx
loop: fldadd x1
fadd
fsdadd x2 bne
1 fadd / 8 cycles = 0.125
loop: fld f1, 0(x1)
add x1, 8
fadd f2, f0, f1
fsd f2, 0(x2)
add x2, 8
bne x1, x3, loop
Compile
Schedule
©KrsteAsanovic,2015CS252,Fall2015,Lecture10
LoopUnrolling
8
for (i=0; i<N; i++)
B[i] = A[i] + C;
for (i=0; i<N; i+=4)
{
B[i] = A[i] + C;
B[i+1] = A[i+1] + C;
B[i+2] = A[i+2] + C;
B[i+3] = A[i+3] + C;
}
Unroll inner loop to perform 4 iterations at once
Need to handle values of N that are not multiples of unrolling factor with final cleanup loop
©KrsteAsanovic,2015CS252,Fall2015,Lecture10
SchedulingLoopUnrolledCode
9
loop: fld f1, 0(x1)fld f2, 8(x1)fld f3, 16(x1)fld f4, 24(x1)add x1, 32fadd f5, f0, f1fadd f6, f0, f2 fadd f7, f0, f3 fadd f8, f0, f4fsd f5, 0(x2)fsd f6, 8(x2)fsd f7, 16(x2)fsd f8, 24(x2)add x2, 32bne x1, x3, loop
Schedule
Int1 Int 2 M1 M2 FP+ FPx
loop:
Unroll 4 ways
fld f1fld f2fld f3fld f4add x1 fadd f5
fadd f6fadd f7fadd f8
fsd f5fsd f6fsd f7fsd f8add x2 bne
How many FLOPS/cycle?4 fadds / 11 cycles = 0.36
©KrsteAsanovic,2015CS252,Fall2015,Lecture10
SoftwarePipelining
10
HowmanyFLOPS/cycle?
loop: fld f1, 0(x1)fld f2, 8(x1)fld f3, 16(x1)fld f4, 24(x1)add x1, 32fadd f5, f0, f1fadd f6, f0, f2 fadd f7, f0, f3 fadd f8, f0, f4fsd f5, 0(x2)fsd f6, 8(x2)fsd f7, 16(x2)add x2, 32fsd f8, -8(x2)bne x1, x3, loop
Int1 Int 2 M1 M2 FP+ FPxUnroll 4 ways firstfld f1fld f2fld f3fld f4
fadd f5fadd f6fadd f7fadd f8
fsd f5fsd f6fsd f7fsd f8
add x1
add x2bne
fld f1fld f2fld f3fld f4
fadd f5fadd f6fadd f7fadd f8
fsd f5fsd f6fsd f7fsd f8
add x1
add x2bne
fld f1fld f2fld f3fld f4
fadd f5fadd f6fadd f7fadd f8
fsd f5
add x1
loop:iterate
prolog
epilog
4 fadds / 4 cycles = 1
©KrsteAsanovic,2015CS252,Fall2015,Lecture10
SoftwarePipeliningvs.LoopUnrolling
11
time
performance
time
performance
LoopUnrolled
SoftwarePipelined
Startupoverhead
Wind-downoverhead
LoopIteration
LoopIterationSoftwarepipeliningpaysstartup/wind-downcostsonlyonceperloop,notonceperiteration
©KrsteAsanovic,2015CS252,Fall2015,Lecture10
Whatiftherearenoloops?
§ Brancheslimitbasicblocksizeincontrol-flowintensiveirregularcode
§ DifficulttofindILPinindividualbasicblocks
12
Basicblock
©KrsteAsanovic,2015CS252,Fall2015,Lecture10
TraceScheduling[Fisher,Ellis]
§ Pickstringofbasicblocks,atrace,thatrepresentsmostfrequentbranchpath
§ Useprofilingfeedbackorcompilerheuristicstofindcommonbranchpaths
§ Schedulewhole“trace”atonce§ Addfixup codetocopewithbranchesjumpingoutoftrace
13
©KrsteAsanovic,2015CS252,Fall2015,Lecture10
Problemswith“Classic”VLIW
§ Object-codecompatibility- havetorecompileallcodeforeverymachine,evenfortwomachinesinsamegeneration
§ Objectcodesize- instructionpaddingwastesinstructionmemory/cache- loopunrolling/softwarepipeliningreplicatescode
§ Schedulingvariablelatencymemoryoperations- cachesand/ormemorybankconflictsimposestaticallyunpredictablevariability
§ Knowingbranchprobabilities- Profilingrequiresansignificantextrastepinbuildprocess
§ Schedulingforstaticallyunpredictablebranches- optimalschedulevarieswithbranchpath
14
©KrsteAsanovic,2015CS252,Fall2015,Lecture10
VLIWInstructionEncoding
§ Schemestoreduceeffectofunusedfields- Compressedformatinmemory,expandonI-cacherefill- usedinMultiflow Trace- introducesinstructionaddressingchallenge
-Markparallelgroups- usedinTMS320C6xDSPs,IntelIA-64
- Provideasingle-opVLIWinstruction- Cydra-5UniOp instructions
15
Group 1 Group 2 Group 3
©KrsteAsanovic,2015CS252,Fall2015,Lecture10
IntelItanium,EPICIA-64
§ EPICisthestyleofarchitecture(cf.CISC,RISC)- ExplicitlyParallelInstructionComputing(reallyjustVLIW)
§ IA-64isIntel’schosenISA(cf.x86,MIPS)- IA-64=IntelArchitecture64-bit- Anobject-code-compatibleVLIW
§ MercedwasfirstItaniumimplementation(cf.8086)- Firstcustomershipmentexpected1997(actually2001)-McKinley,secondimplementationshippedin2002- Recentversion,Poulson,eightcores,32nm,announced2011
16
©KrsteAsanovic,2015CS252,Fall2015,Lecture10
EightCoreItanium“Poulson”[Intel2011]
§ 8cores§ 1-cycle16KBL1I&Dcaches§ 9-cycle512KBL2I-cache§ 8-cycle256KBL2D-cache§ 32MBsharedL3cache§ 544mm2in32nmCMOS§ Over3billiontransistors
§ Coresare2-waymultithreaded
§ 6instruction/cyclefetch- Two128-bitbundles
§ Upto12insts/cycleexecute
17
©KrsteAsanovic,2015CS252,Fall2015,Lecture10
IA-64InstructionFormat
§ Templatebitsdescribegroupingoftheseinstructionswithothersinadjacentbundles
§ Eachgroupcontainsinstructionsthatcanexecuteinparallel
18
Instruction 2 Instruction 1 Instruction 0 Template
128-bit instruction bundle
group i group i+1 group i+2group i-1
bundle j bundle j+1bundle j+2bundle j-1
©KrsteAsanovic,2015CS252,Fall2015,Lecture10
IA-64Registers
§ 128GeneralPurpose64-bitIntegerRegisters§ 128GeneralPurpose64/80-bitFloatingPointRegisters
§ 641-bitPredicateRegisters
§ GPRs“rotate”toreducecodesizeforsoftwarepipelinedloops- Rotationisasimpleformofregisterrenamingallowingoneinstructiontoaddressdifferentphysicalregistersoneachiteration
19
©KrsteAsanovic,2015CS252,Fall2015,Lecture10
RotatingRegisterFiles
20
Problems:Scheduledloopsrequirelotsofregisters,Lotsofduplicatedcodeinprolog,epilog
Solution:Allocatenewsetofregistersforeachloopiteration
20
©KrsteAsanovic,2015CS252,Fall2015,Lecture10
RotatingRegisterFile
21
P0P1P2P3P4P5P6P7
RRB=3
+R1
RotatingRegisterBase(RRB)registerpointstobaseofcurrentregisterset. Valueaddedontologicalregisterspecifier togivephysicalregisternumber.Usually,splitintorotatingandnon-rotatingregisters.
21
©KrsteAsanovic,2015CS252,Fall2015,Lecture10
RotatingRegisterFile(PreviousLoopExample)
22
bloopsd f9, ()fadd f5, f4, ...ld f1, ()
Three cycle load latency encoded as difference of 3
in register specifier number (f4 - f1 = 3)
Four cycle fadd latency encoded as difference of 4
in register specifier number (f9 – f5 = 4)
bloopsd P17, ()fadd P13, P12,ld P9, () RRB=8bloopsd P16, ()fadd P12, P11,ld P8, () RRB=7bloopsd P15, ()fadd P11, P10,ld P7, () RRB=6bloopsd P14, ()fadd P10, P9,ld P6, () RRB=5bloopsd P13, ()fadd P9, P8,ld P5, () RRB=4bloopsd P12, ()fadd P8, P7,ld P4, () RRB=3bloopsd P11, ()fadd P7, P6,ld P3, () RRB=2bloopsd P10, ()fadd P6, P5,ld P2, () RRB=1
22
©KrsteAsanovic,2015CS252,Fall2015,Lecture10
IA-64PredicatedExecution
23
Problem:Mispredicted brancheslimitILPSolution:Eliminatehardtopredictbrancheswithpredicatedexecution- AlmostallIA-64instructionscanbeexecutedconditionallyunderpredicate
- InstructionbecomesNOPifpredicateregisterfalse
Inst 1Inst 2br a==b, b2
Inst 3Inst 4br b3
Inst 5Inst 6
Inst 7Inst 8
b0:
b1:
b2:
b3:
if
else
then
Four basic blocks
Inst 1Inst 2p1,p2 <- cmp(a==b)(p1) Inst 3 || (p2) Inst 5(p1) Inst 4 || (p2) Inst 6Inst 7Inst 8
Predication
One basic block
Mahlke et al, ISCA95: On average >50% branches removed
©KrsteAsanovic,2015CS252,Fall2015,Lecture10
IA-64SpeculativeExecution
24
Problem:Branchesrestrictcompilercodemotion
Inst 1Inst 2br a==b, b2
Load r1Use r1Inst 3
Can’t move load above branch because might cause spurious
exception
Load.s r1Inst 1Inst 2br a==b, b2
Chk.s r1Use r1Inst 3
Speculative load never causes
exception, but sets “poison” bit on
destination register
Check for exception in original home block
jumps to fixup code if exception detected
Particularly useful for scheduling long latency loads early
Solution: Speculative operations that don’t cause exceptions
©KrsteAsanovic,2015CS252,Fall2015,Lecture10
IA-64DataSpeculation
25
Problem:Possiblememoryhazardslimitcodescheduling
Requires associative hardware in address check table
Inst 1Inst 2Store
Load r1Use r1Inst 3
Can’t move load above store because store might be to same
address
Load.a r1Inst 1Inst 2Store
Load.cUse r1Inst 3
Data speculative load adds address to
address check table
Store invalidates any matching loads in
address check table
Check if load invalid (or missing), jump to fixup
code if so
Solution: Hardware to check pointer hazards
©KrsteAsanovic,2015CS252,Fall2015,Lecture10
LimitsofStaticScheduling
§ Unpredictablebranches§ Variablememorylatency(unpredictablecachemisses)
§ Codesizeexplosion§ Compilercomplexity§ Despiteseveralattempts,VLIWhasfailedingeneral-purposecomputingarena(sofar).-MorecomplexVLIWarchitecturesclosetoin-ordersuperscalarincomplexity,norealadvantageonlargecomplexapps.
§ SuccessfulinembeddedDSPmarket- SimplerVLIWswithmoreconstrainedenvironment,friendliercode.
26
©KrsteAsanovic,2015CS252,Fall2015,Lecture10
Acknowledgements
§ ThiscourseispartlyinspiredbypreviousMIT6.823andBerkeleyCS252computerarchitecturecoursescreatedbymycollaboratorsandcolleagues:- Arvind (MIT)- JoelEmer (Intel/MIT)- JamesHoe(CMU)- JohnKubiatowicz (UCB)- DavidPatterson(UCB)
27