Explicit HW and SW Hierarchies High-Level Abstractions for giving the system what it wants Mattan...
-
Upload
mariam-whitus -
Category
Documents
-
view
213 -
download
1
Transcript of Explicit HW and SW Hierarchies High-Level Abstractions for giving the system what it wants Mattan...
Explicit HW and SW HierarchiesHigh-Level Abstractions for giving the system what it wants
Mattan Erez
The University of Texas at Austin
Salishan 2011
NNN (c) Mattan Erez, UT Austin
Power and reliability bound performance• More and more components• Per-component improvement too slow
1 KW
10 KW
100 KW
1 MW
10 MW
100 MW
1 GW
Tera Peta Exa
NNN (c) Mattan Erez, UT Austin
Power and reliability bound performance• More and more components• Per-component improvement too slow
0.1
1
10
100
1000
10000
100000
1000000
0.125 0.5 2 8 32 128 512
MTT
I [H
ours
]
Performance [PFLOPs]
Impact of per-socket FIT rate
500 FIT
2,000 FIT
8,000 FIT
32,000 FIT
NNN (c) Mattan Erez, UT Austin
What can we do?
• Compute less and store less– Use better algorithms
• Specialize more– But still innovate on algorithms
• Waste less– Minimize movement– Dynamically rebalance hardware
• Efficient resiliency for reliability– Minimize redundancy– Tradeoff inherent reliability and resiliency
NNN (c) Mattan Erez, UT Austin
Power is a zero-sum game
• Tradeoff control, compute, storage, comm.
– Dense algebra
– Large sparse data
– Building data structures
ALU/FPU
Registers
Caches
Control
NoC
I/O
Reliability
Other
NNN (c) Mattan Erez, UT Austin
Hierarchy enables HW/SW co-tuning and co-design• Hierarchy as common abstraction for HW and
SW– Basic engineering– Match abstractions
• Portability to ensure progress– Co-design cycle
• Portability to ensure efficiency– Co-tune for proportionality
NNNHardware hierarchy – locality
• Communication and storage dominate energy• Closer and smaller == better
– Amortize cost of global operations
28nm
20mm
64-bit DP26 pJ 256 pJ
1 nJ
500 pJ Efficientoff-chip
link
256-bitbuses
16 nJDRAMRd/Wr
256-bit access8 kB SRAM
50 pJ
20 pJ
NNN (c) Mattan Erez, UT Austin
Locality hierarchy “minimizes” hardware• Efficiency/performance tradeoffs
– Efficiency goes up as BW goes down
NNN (c) Mattan Erez, UT Austin
Hardware hierarchy – control
• Specialization is a form of hierarchy– Amortize SW control decisions in HW
• Sophisticated high-level control– Dynamic rebalancing
• Simple low-level control– Minimize hardware waste
• How far can we push this?
NNNHierarchical HW hierarchical SW
• Hierarchy is least abstract common denominator
L2 cache
ALUs ALUs
Main memory
L1 cache L1 cache
Dual-core PC
L2 cache
ALUs
Nodememory
Aggregate cluster memory(virtual level)
L1 cache
L2 cache
ALUs
Nodememory
L1 cache
L2 cache
ALUs
Nodememory
L1 cache
L2 cache
ALUs
Nodememory
L1 cache
4 node cluster of PCsCluster of dual Cell blades
LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS
Main memory
Aggregate cluster memory(virtual level)
LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS
Main memory
GPU memory
ALUs
SM
ALUs
SM
ALUs
SM
ALUs
SM
ALUs
SM
ALUs
SM
ALUs
SM
ALUs
SM
System with a GPU
Main memory
ALUs
SM…
ALUs
SM
matmullarge matrix mult
A B C
matmul_L132x32
matrix mult ...
matmul_L2256x256
matrix mult
matmul_L132x32
matrix mult
matmul_L132x32
matrix mult
matmul_L132x32
matrix mult
matmul_L2256x256
matrix mult
matmul_L132x32
matrix mult ...matmul_L1
32x32matrix mult
matmul_L132x32
matrix mult
matmul_L132x32
matrix mult
. . . . . . . . .
NNNTask hierarchiestask matmul::inner( in float A[M][T], in float B[T][N], inout float C[M][N] ){ tunable int P, Q, R; mappar( int i=0 to M/P, int j=0 to N/R ) { mapseq( int k=0 to T/Q ) {
matmul( A[P*i:P*(i+1);P][Q*k:Q*(k+1);Q], B[Q*k:Q*(k+1);Q][R*j:R*(j+1);R], C[P*i:P*(i+1);P][R*j:R*(j+1);R] ); } }}
task matmul::leaf( in float A[M][T], in float B[T][N], inout float C[M][N] ){ for (int i=0; i<M; i++) for (int j=0; j<N; j++) for (int k=0; k<T; k++) C[i][j] += A[i][k] * B[k][j];}
matmul::inner
matmul::leaf
Variant call graph
NNN
A B C
Task hierarchiestask matmul::inner( in float A[M][T], in float B[T][N], inout float C[M][N] ){ tunable int P, Q, R; mappar( int i=0 to M/P, int j=0 to N/R ) { mapseq( int k=0 to T/Q ) {
matmul( A[P*i:P*(i+1);P][Q*k:Q*(k+1);Q], B[Q*k:Q*(k+1);Q][R*j:R*(j+1);R], C[P*i:P*(i+1);P][R*j:R*(j+1);R] ); } }}
task matmul::leaf( in float A[M][T], in float B[T][N], inout float C[M][N] ){ for (int i=0; i<M; i++) for (int j=0; j<N; j++) for (int k=0; k<T; k++) C[i][j] += A[i][k] * B[k][j];}
Callee task: matmul::leaf
Calling task: matmul::inner
A B C
Located at level X
Located at level Y
NNN (c) Mattan Erez, UT Austin
Hierarchical software enables efficiency• Portability
– Hierarchy is least abstract common denominator – It’s what systems want
• Proportionality– Co-tune hardware and software– Path to true efficiency
• Co-design cycles– Maintain efficiency with new technology
• How strict is the hierarchy?
NNN (c) Mattan Erez, NVIDIA
Hierarchical software enables co-tuning• Locality profiles drive dynamic rebalancing
1.0E+0 1.0E+3 1.0E+6 1.0E+9 1.0E+120
20
40
60
80
100
120
Storage Size
% M
iss
NNN (c) Mattan Erez, UT Austin
Proportional and efficient resiliency
• Resiliency principles:– Detect fault– Correct erroneous data if possible– Contain fault– Repair/reconfigure– Restore state and re-execute
• Each step can be improved with co-tuning– Ignore certain faults (allow some errors)– Detect at coarse granularity– Contain where cheapest– Re-map application instead of repairing/reconfiguring
hardware– Preserve and restore minimally and effectively
NNN (c) Mattan Erez, UT Austin
Hierarchical resiliency – containment domains
• Containment domains enable proportionality
• Match locality hierarchy with resiliency hierarchy– Efficient state preservation and restoration– Predictable (minimal) overhead
• Hierarchy provides natural domains for managing faults (and rebalancing)
– Co-tune resiliency scheme in HW and SW– Range of hardware error detection and
correction mechanisms– Mechanisms introduce minimal overhead
when not in use
NNN (c) Mattan Erez, UT Austin
Containment Domains: a full-system approach to resiliency• Hierarchy provides natural domains for containing
faults• Containment domains enable
software-controlled resilience– Preserve data on domain start
– Detect faults before domain commits
– Recover: restore data and re-execute when necessary
• Arbitrary nesting– Tasks
– Functions
– Loop iterations
– Instructions
• Amenable to compiler analysis• Constructs for programmer tuning
NNN (c) Mattan Erez, UT Austin
Tunable error protection
• High AMTTI requires strong error protection– Global redundancy overhead can be
high
– Hardware mechanisms can help
– Can do even better with software control
• Containment domains enable specialized protection– Each domain can have unique
detection routine• May even be scenario specific
– Redundancy can be added at any granularity
B CA
B CA B CA=?
B CA
B CA
=?
B CA
NNN (c) Mattan Erez, UT Austin
State preservation and restoration
• Match storage hierarchy• Utilize NV memory• Explicit software control• Trade off overheads:
– Storage, local and global bandwidth, recomputation, complexity and effort
NNN (c) Mattan Erez, UT Austin
Faults and default behavior encompasses current approaches• Soft memory errors
– Detect: hardware ECC
– Recover: retry, if fail then restore, re-execute
• Hard memory fault– Detect: runtime liveness– Recover:
• Map-out bad mem
• If enough space then: recover and re-exec
• Else: escalate failure
• Soft arithmetic error– Detect: user-selectable
• Duplicated execution (HW/SW)
• Other HW techniques
• Algorithm-specific assert
– Recover: retry, if fail then restore, re-execute
• Soft control errors– Detect:
• User selectable signatures
• Implicit exceptions
– Recover: restore, re-execute
• Hard compute fault– Detect: runtime liveness– Recover:
• Map-out bad PE
• If OK w/o resource or spare available then: recover and re-exec
• Else: escalate failure
• High-level unhandled faults– Detect: runtime heartbeat– Recover:
• Escalate failure
NNN (c) Mattan Erez, UT Austin
Containment domains examplevoid task<inner> SpMV( in matrix,
in veci, out resi){
forall(…) reduce(…) SpMV(matrix[…],veci[…],resi[…]);} preserve {preserve_NV(matrix);} //innerrestore_for_child {…}
void task<leaf> SpMV(…) { for r=0..N for c=rowS[r]..rowS[r+1] { contain { resi[r]+=data[c]*veci[cIdx[c]]; } check {fault<fail>(c > prevC);} prevC=c; }}preserve {preserve_NV(matrix);} //leaf
NNN (c) Mattan Erez, UT Austin
Summary
• Hierarchy is basic engineering approach– Works for hardware and works for software
• Hierarchy is inevitable– Minimize movement– Amortize control
• Match explicit hierarchies in HW and SW– Lowest abstract common denominator
• Natural domains and boundaries enable:– Co-design– Co-tuning– Dynamic rebalancing – Resiliency