Approximate On-Chip Communication
Davide Patti, Ph.D. [email protected] University of Catania, Italy
…in the Previous Episodes1. The goal of computing was to be the fastest
2. The challenge to maximize MHz hit the ‘power wall in the mid-2000s
3. Initial solution: “ok, no problem, let’s optimise for speed and power…”
4. …but, eventually, the dramatically increasing workloads ruined the party…
!3
Why?! Ever-increasing amount of information
! Industry reports – 2010 – 2020 amount of information will expand by 50x – ...number of servers will only grow by a factor of 10!
Emerging RMS Applications
Error-Resilience Property Forgiving workloads: multimedia, recognition, search, can tolerate not perfect computing, examples: • Inexact inputs, derived from noisy and redundant
sources (e.g. sensors) • human consumer of results may not discern small
variations • data/algortihms including statistical/probabilistic
computations • computations which may be refined with multiple
iterations
!6
Approximate Computing: A Third Dimension for Optimization
“Error” or “Feature” ?• Approximation not as a “problem” to deal with, not as a
“limitation”, but part of the game
• A neuron spikes when a combination of all the excitation and inhibition it receives makes it reach threshold (around -50mV )
Approximating at Multiple Levels of the Stack
Hardware level
• Less accurate yet energy-efficient circuits (e.g., simplified adder)
• Tuning the supply voltage
Software level
• Ignore some computations (skip loop iterations, relaxing control dependences)
• Data structures, e.g., reducing vector sizes
• Ignore certain memory accesses replacing them by estimated values
Current Applications• Database Querying/Visualization:
• BlinkDB, Facebook’s Presto, M4 from SAP
2B points (70 mins) vs 1M points (3 mins)
Current Applications• Neural Networks
• Using NN to replace some expensive computation or algorithm
• Approximate NN implementations for inference (e.g., less bits to represent weights)
• SqueezeNet, Google’s Neural Machine Translation
Approximate Communication: the NOC Case Study
■ Shared bus➔Low area ➔Poor scalability ➔High energy consumption
■Network-on-Chip➔Mesh of Routers (in red) ➔Each Processing Element
connected to a Router ➔Scalability and modularity ➔Low energy consumption ➔ Increase of design complexity
Shared bus
Communication Overhead• Interconnection networks consume 10% to 20% of the power in
current HPC systems
• Majority due to network's links NoC based design
• More than one-third of the chip's power consumption
!14
Example
for (i=0; i<n; i++) v[i] = f(w[i]);
MemoryMI
CPU
!15
Example – Load w[i]
for (i=0; i<n; i++) v[i] = f(w[i]);
MemoryMI
CPU
Address Data
!16
Example – Store v[i]
for (i=0; i<n; i++) v[i] = f(w[i]);
MemoryMI
CPU
Data
!17
Approximate Communication! Send(data, destination) ! Send(data, destination, reliability_level)
Reliability Level
Communication Energy
Communication System “aware” of error-resilience Acting on two Knobs:
Voltage Swing (wired) Transmission Power (wireless)
!18
Tuning the Link Voltage Swing! Reliability vs. Energy (1mm bit-line):
! Nominal voltage swing → low BER, high energy ! Low voltage swing → high BER, low energy
ReconfigurableLink
coreNI
coreNI
coreNI
coreNI
R R RR
coreNI
coreNI
coreNI
coreNI
R R RR
coreNI
coreNI
coreNI
coreNI
R R RR
R R RR
coreNI
coreNI
coreNI
coreNI
core IPCore
NI NetworkInterface
R Router
PhysicalLink
TilecoreNI
R
ReconfigurableLink
coreNI
coreNI
coreNI
coreNI
R R RR
coreNI
coreNI
coreNI
coreNI
RR
coreNI
coreNI
coreNI
coreNI
R R RR
R R RR
coreNI
coreNI
coreNI
coreNI
R R
ReconfigurableLink
coreNI
coreNI
coreNI
coreNI
R R RR
coreNI
coreNI
coreNI
coreNI
RR
coreNI
coreNI
coreNI
coreNI
R R RR
R R RR
coreNI
coreNI
coreNI
coreNI
R R
HSPICELinkSimulation• 45nmCMOStechnology(NanGate'sOpenCellLibrary):• 10metallayers• 3mmlinklineusingtheseventhmetallayer• 2GHztargetfrequency
Improving energy efficiency in wireless network-on-chip architectures, V Catania, A Mineo, S Monteleone, M Palesi, D Patti, ACM Journal on Emerging Technologies in Computing Systems (JETC) 14 (1), 2018
HSPICELinkSimulation
70%saving3%overhead
Improving energy efficiency in wireless network-on-chip architectures, V Catania, A Mineo, S Monteleone, M Palesi, D Patti, ACM Journal on Emerging Technologies in Computing Systems (JETC) 14 (1), 9
HSPICELinkSimulation
Improving energy efficiency in wireless network-on-chip architectures, V Catania, A Mineo, S Monteleone, M Palesi, D Patti, ACM Journal on Emerging Technologies in Computing Systems (JETC) 14 (1), 9
!25
ImplementationHeader Data Data Data Tail
Reliability LevelDestination Other
control info
!26
Annotation Example
! Data coming from/delivered to w[i] travel with a reliability level rl
#pragma resilient(w, rl) for (i=0; i<n; i++) v[i] = f(w[i]);
!28
Application Characterization
! How the imprecision on inputs and internal data reflects on the outputs ?
! Classify data structures according to their impact on the outputs – Exploitation
! Storing less sensitive data on energy efficient memories (low voltage, low refresh rate, ...)
! Optimizing communication of less sensitive data (unreliable communications, lossy compression, ...)
!29
Experiments
! Two voltage swing levels – Nominal 1.1 V → BER: 10-17, Ebit: 512 fJ – Low 0.6 V → BER: 10-6, Ebit: 152 fJ
!30
Experiments! JPEG encoding pipeline (AXBench)
Level Shift DCT Quantize Entropy
Encode
Quantizer Table
Huffman Table
UINT8* encodeMcu(UINT32 imageFormat, UINT8 *outputBuffer) { levelShift(Y1); dct(Y1); quantization(Y1, ILqt); outputBuffer = huffman(1, outputBuffer); return outputBuffer; }
!31
ExperimentsUINT8* encodeMcu(UINT32 imageFormat, UINT8 *outputBuffer) { #pragma resilient_load(Y1, rl_load) levelShift(Y1); ... }
Level Shift DCT Quantize Entropy
Encode
Quantizer Table
Huffman Table
Memory
rl_load
!32
ExperimentsUINT8* encodeMcu(UINT32 imageFormat, UINT8 *outputBuffer) { #pragma resilient_store(Y1, rl_store) levelShift(Y1); ... }
Level Shift DCT Quantize Entropy
Encode
Quantizer Table
Huffman Table
Memory
rl_store
!33
ExperimentsUINT8* encodeMcu(UINT32 imageFormat, UINT8 *outputBuffer) { #pragma resilient(Y1, rl) levelShift(Y1); ... }
Level Shift DCT Quantize Entropy
Encode
Quantizer Table
Huffman Table
Memory
rlrl
Approximation Profiles
!35
Experiments
Level Shift DCT Quantize Entropy
Encode
Quantizer Table
Huffman Table
Mem 1
Mem 2
!36
Experiments
Level Shifter
R
DCT
R
MC
R
Quantizer
R
Entropy Encoder
R
MC
R
Mem 1
Mem 2
!37
Evaluation FlowApplication Resilient data
selection
Annotated application
Resilience level selection
Full Simulation (MIT Graphite)
Memory Reference
trace
NoC architecture
Energy estimation (Noxim)
Error injection
Perturbated Application
Execution
Communication energy
Execution
Imprecise results
Exactresults
Comparison Quality metric
!38
Experiments
!39
Experiments
Level Shift DCT Quantize Entropy
Encode
Quantizer Table
Huffman Table
Mem 1
Mem 2
Nominal (high energy, high reliability)
Approx (low energy, low reliability)
Conf 0 (gold)
!40
Experiments
Level Shift DCT Quantize Entropy
Encode
Quantizer Table
Huffman Table
Mem 1
Mem 2
Nominal (high energy, high reliability)
Approx (low energy, low reliability)
Conf 1
!41
Experiments
Level Shift DCT Quantize Entropy
Encode
Quantizer Table
Huffman Table
Mem 1
Mem 2
Nominal (high energy, high reliability)
Approx (low energy, low reliability)
Conf 2
!43
Experiments
Level Shift DCT Quantize Entropy
Encode
Quantizer Table
Huffman Table
Mem 1
Mem 2
Nominal (high energy, high reliability)
Approx (low energy, low reliability)
Conf 3
!44
Experiments
Level Shift DCT Quantize Entropy
Encode
Quantizer Table
Huffman Table
Mem 1
Mem 2
Nominal (high energy, high reliability)
Approx (low energy, low reliability)
Conf 4
!45
Experiments
Level Shift DCT Quantize Entropy
Encode
Quantizer Table
Huffman Table
Mem 1
Mem 2
Nominal (high energy, high reliability)
Approx (low energy, low reliability)
Conf 5
!47
Experiments
!48
Experiments
Level Shift DCT Quantize Entropy
Encode
Quantizer Table
Huffman Table
Mem 1
Mem 2
Nominal (high energy, high reliability)
Approx (low energy, low reliability)
Conf 6
!50
Experiments
Level Shift DCT Quantize Entropy
Encode
Quantizer Table
Huffman Table
Mem 1
Mem 2
Nominal (high energy, high reliability)
Approx (low energy, low reliability)
Conf 7
!52
Experiments
Level Shift DCT Quantize Entropy
Encode
Quantizer Table
Huffman Table
Mem 1
Mem 2
Nominal (high energy, high reliability)
Approx (low energy, low reliability)
Conf 8
!54
Experiments
Level Shift DCT Quantize Entropy
Encode
Quantizer Table
Huffman Table
Mem 1
Mem 2
Nominal (high energy, high reliability)
Approx (low energy, low reliability)
Conf 9
!56
Experiments
!57
Experiments
0 1 2 3 4 5 6 7 8 90.0000
0.0001
0.0002
0.0003
0.0004
0.0005
0.0006
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Image diff Normalized energy
Configuration
Imag
e di
ff (R
MS
E)
Nor
mal
ized
ene
rgy
!58
Sensitivity Analysis
in Y
1/le
velS
hift
out Y
1/le
velS
hift
in Y
1/dc
t
out Y
1/dc
t
in Y
1/qu
antiz
atio
n
in Il
qt/q
uant
izat
ion
out T
emp/
quan
tizat
ion
in T
emp/
huffm
an
out o
utpu
tBuf
fer/
huffm
an
0.00000
0.00005
0.00010
0.00015
0.00020
0.00025
0.00030
Sensitivity
!59
Experiments
Level Shift DCT Quantize Entropy
Encode
Quantizer Table
Huffman Table
Mem 1
Mem 2
Nominal (high energy, high reliability)
Approx (low energy, low reliability)
Conf 9
!60
Experiments
0.0E+0 1.0E-4 2.0E-4 3.0E-4 4.0E-4 5.0E-4 6.0E-40.00
0.20
0.40
0.60
0.80
1.00
1.20
Image diff (RSME)
Nor
mal
ized
ene
rgy
Next Step: On-Chip Wireless Communications
V. Catania, A. Mineo, S. Monteleone, M. Palesi, and D. Patti, “Improving energy efficiency in wireless network-on-chip architectures,” ACM Journal on Emerging Technologies in Computing Systems, vol. 14, no. 1, 2017.
!62
Tuning Transmitting Power
! High BER as compared to wired NoC – 10-9 vs. 10-14
! General approach – Increasing the transmitting power for compensating
the attenuation introduced by the wireless medium ! Proposed approach – Tuning the transmitting power based on the reliability
level of the current transmitted data
Tunable Transmitting PowerZigzag antenna modeled with Ansoft HFSS to compute attenuation (16Gbps)
Variable Power Amplifier
• S. Kaushik, M. Agrawal, H. K. Mondal, S. H. Gade, and S. Deb, “Path loss-aware adaptive transmission power control scheme for energy- efficient wireless noc,” in International Midwest Symposium on Circuits and Systems (MWSCAS), Aug. 2017, pp. 132–135.
• A. Mineo, M. Palesi, G. Ascia, and V. Catania, “Exploiting antenna directivity in wireless noc architectures,” Microprocessors and Microsys- tems, vol. 43, pp. 59–66, 2016.
Simulation Setup• Two transmission profiles:
• normal) BER 10e-12 —> 1.47 pJ/bit
• (approximate) BER 10e-6 —> 1pJ/bit
• Wireless Interfaces placement same as Memory Controllers (mesh corners)
• 8 × 8 mesh-based NoC architecture simulated by using the Graphite Multicore Simulator with the following parameters:
RepresentativeApplicationsApplication Description Approximated Regions
streamcluster:aRMSkerneldevelopedbyPrincetonUniversitythatsolvestheonlineclusteringproblem
Regions of 256 bytes required for storing the 64 dimensions of each point encoded as a floating point value of 4 bytes, for a total of 8192 regions.
canneal: developedbyPrincetonUniversity,itusescache-awaresimulatedannealing(SA)tominimizetherouXngcostofachipdesign
The annotation has been performed on the netlist element, for a total of 160,000 instances of 64 bytes netlist elements.
blackscholes:anIntelRMSbenchmarkthatcalculatespricesforaporYolioofEuropeanopXonsanalyXcallywiththeBlack-ScholesparXaldifferenXalequaXon
Two data structures have been annotated: optiondata a 36 bytes floating point structure, and prices (4 bytes floating point), for a total of 147,456 bytes and a 16,384 bytes, respectively.
radiosity: computestheequilibriumdistribuXonoflightinasceneusingthehierarchicaldiffuseradiositymethod.
elemvertex buf.col, a data structure encoding the three RGB components as 4 bytes floating point values, and elemvertex buf.vertex, a data structure encoding the 3-dimensional coordinates of each vertex of the polygons describing the 3D model of the scene. Each of these two structure occupies 12 bytes, for a total of 65,535 regions and 786,420 annotated bytes size each.
EvaluationFlowFourscenarios:
3. Approx.NoC4. Approx.WiNoC
1. NoC2. WiNoC
Results
∗AllenergyvaluesarenormalizedwithrespecttothewiredNoCenergyconsumption.
Results–PerformanceMetrics
Conclusions• ApproximatecommunicationtechniqueforimprovingtheenergyefficiencyofWiNoCarchitectures.• Dynamiclinkvoltageswing(NoClinks)• Dynamictransmittingpowermodulation(wirelesscommunications)
• Pragmabasedannotationoftheapplicationcode• loadandstoreinducedcommunicationsrelatedtoerrortolerantdata
• Assessmentonasetofrepresentativebenchmarks• Energysavingversusapplicationaccuracytrade-off.• Upto30%oftotalcommunicationenergysavinghasbeenobservedwithoutanyappreciableimpactontheaccuracymetrics
Future Developments• Generalize & Automate in order to reduce the
required knowledge about the Application
• A methodology to identify approximable communication flows
• Automated choice of the most efficient approximation technique (reduced bits representation, reduced iterations, etc..)
• Automatic exploration loop
Bibliography
• Vincenzo Catania, Andrea Mineo, Salvatore Monteleone, Maurizio Palesi, and Davide Patti. 2016. Cycle-Accurate Network on Chip Simulation with Noxim. ACM Trans. Model. Comput. Simul. 27, 1, Article 4 (August 2016), 25 pages. DOI: https://doi.org/10.1145/2953878
• Improving energy efficiency in wireless network-on-chip architectures, V Catania, A Mineo, S Monteleone, M Palesi, D Patti, ACM Journal on Emerging Technologies in Computing Systems (JETC) 14 (1), 9
• . Kaushik, M. Agrawal, H. K. Mondal, S. H. Gade, and S. Deb, “Path loss-aware adaptive transmission power control scheme for energy- efficient wireless noc,” in International Midwest Symposium on Circuits and Systems (MWSCAS), Aug. 2017, pp. 132–135.
• C. Roth, H. Bucher, S. Reder, F. Buciuman, O. Sander, and J. Becker. 2013. A SystemC modeling and simulation methodology for fast and accurate parallel MPSoC simulation. In Integrated Circuits and Systems Design (SBCCI), 2013 26th Symposium on. 1–6. DOI:http://dx.doi.org/10.1109/SBCCI.2013.6644853
• S. Deb, K. Chang, M. Cosic, A. Ganguly, P. P. Pande, D. Heo, and B. Belzer, “Enhancing performance of network-on-chip architectures with millimeter-wave wireless interconnects,” in IEEE International Conference on Application-specific Systems Architectures and Processors, 2010, pp. 73–80.
• E. Miller, H. Kasture, G. Kurian, C. Gruenwald, N. Beckmann, C. Celio, J. Eastep, and A. Agarwal, “Graphite: A distributed parallel simulator for multicores,” in High Performance Computer Architecture (HPCA), 2010 IEEE 16th International Symposium on. IEEE, 2010, pp. 1–12.
•
Top Related