Computing an Exact Dot Product A Hardware Accelerator...

A Hardware Accelerator for Computing an Exact Dot Product

Jack Koenig, David Biancolin, Jonathan Bachrach, Krste Asanović

Challenges with Floating Point

Addition and multiplication are not associative

1020 + 1 - 1020 = 0

Multithreaded, not even reproducible!

1020 + 1 - 1020 = 0 or 1

Solutions

● MPFR - Exact but much slower than hardware● ExactBLAS - Faster than MPFR, still slower than hardware● ReproBLAS - Fast and reproducible, but not exact

Moore’s Law Winding Down

3[Hennessy & Patterson, 2017]

From Moore’s Law to Dark Silicon

● System-on-Chips have billions of transistors

● Power density constraints prevent all transistors from being used at once

● Accelerators orders-of-magnitude more efficient than CPUs

● Can turn off unused specialized units to save power

⇒ Use those extra transistors for

specialized hardware

4NVIDIA Tegra 2

Specialization is already here!

Motivation

Why Dot Product?

1) A kernel of many applications2) Reduction is good candidate for exact representation

Why Exact?

Simplifies error analysis

Related Work

● We were heavily influenced by the work of Ulrich Kulisch et. al○ XPA 3233 in 1994○ PCI-based co-processor○ 0.8um process

● Recently, Uguen and Dinechin published a design-space exploration for FPGA-hosted implementations of Kulisch’s design

6[XPA 3233]

Principle of Operation

● Fixed point representation of entire space○ 1 + 2 ⨉ (2bits(exp) + bits(mant))○ 2100 bits to represent 1 double-precision number○ 4200 bits for product of 2 doubles○ 88 bits to preclude overflow○ 4288 bits in our complete representation (CR)

● Accumulation○ Fetch elements of each vector from memory○ Calculate product of the mantissas and sum of the

exponents○ Use sum of exponents to align product of mantissas

with complete representation○ Accumulate○ Propagate carry or borrow if necessary

7[Kulisch 2008]

System Architecture

Rocket Chip Generator

● A RISC-V processor generator● RISC-V is an open-source, extensible ISA● Provides Rocket Custom Coprocessor

Interface (RoCC)

Used in over 12 academic tapeouts and at least 1 commercial tapeout

EOS 22 (2014)

RoCC Accelerator

● Integrated with Rocket Chip via RoCC○ 5 stage in-order pipeline (Rocket)○ 32 KiB L1 instruction and data caches○ 256 KiB L2 cache

● Instructions are fetched by Rocket core and forwarded to the accelerator

● Memory interface is parameterized for○ 64-bit L1 cache interface○ 128-bit L2 cache interface

Instructions

Name Description

CLR_CR Clear the complete register

RD_DBL/RD_FLT Round the complete register and return the result to a general-purpose register

LD_CR Loads a complete register from memory

ST_CR Stores the complete register to memory

ADD_CR Adds a complete register in memory to the current value

PRE_DP Initializes vector base address registers

RUN_DP Specifies vector length; instructs accelerator to begin computation

Control & Memory Unit

Control Unit

● Decodes instructions● Rounds complete register

Memory Unit

● Fetches operands from memory and re-orders responses to feed to datapath

● Parameterized for 64-bit L1 or 128-bit L2 interface

Segmented Accumulator

● Divide complete register into segments● Each segment gets its own adder● Accumulates a portion of the product of the mantissas and incoming

carry/borrow

Centralized Accumulator

● Uses a single adder● For double, product of mantissas gives

104-bit summand● Read appropriate 4 words from

accumulator based on sum of exponents

● Add summand to lower-order 3 words, propagated carry/borrow into 4th

● Stall if carry or borrow propagates beyond

● all_ones, all_zeros helps with propagation 14

Methodology & Evaluation Overview

Performance evaluation requires both cycles-per-dot-product and cycle time.

1) Cycles-per-dot-product:○ Simulate the SoC in RTL simulation, measure execution time in cycles

2) Cycle time: ○ Push SoC through synthesis and P&R, determine critical path, area

Design space exploration over three parameters:

Complete Register (Centralized, Segmented)

{C,S}_{L1,L2}_{D,F}

Cache Interface(L1 = 64 bit, L2 = 128 bit)

Operand Precision(Double, Float)

Measuring Cycles-Per-Operation● simulate entire SoC in RTL simulation (Synopsys VCS)● microbenchmark: random vectors uniformly in the mantissa and exponent spac● measure cycles-per-element (CPE) of software libraries on a host with similar

caches:

Software libraries:

● ReproBLAS● Intel MKL

Host machine: Intel Xeon E5-2667

● caches: 32 KiB L1 D$, 256 KiB unified L2● ISA extensions: SSE 4.1, 4.2, AVX 16

Comparison: CPE vs Vector Length

Single Precision Double Precision

VLSI Evaluation

Push the complete SoC through CAD flow, measure cycle time and area.

Flow details:

● Synthesis: Synopsys Design Compiler● Place & Route: Synopsys IC Compiler● Technology: TSMC 45nm● No SRAM compiler

○ generate timing and area models using CACTI

Area Breakdown of Accelerator

Area Breakdown of Core excluding L2

1.24 ns

1.09 ns

Outstanding Questions & Future Work

● How to use effectively in BLAS-2 and BLAS-3 kernels?○ Must amortize overhead of accelerator setup○ Cost of saving intermediate exact results is high

● Measure energy and compare to software libraries. ○ Compare to software libraries.

Conclusion

● Realizable with modest area costs● Easily saturates available memory bandwidth

Strong case for integration in application specific SoCs; more careful evaluation required to motivate integration in general-purpose machines

Acknowledgements

● Special thanks to Jim Demmel, William Kahan, Hong Diep Nguyen, and Colin Schmidt

● This research was partially funded by DARPA Award Number HR0011-12-2-0016 and ASPIRE Lab industrial sponsors and affiliates Intel, Google, HPE, Huawei, LGE, Nokia, NVIDIA, Oracle, and Samsung.

Computing an Exact Dot Product A Hardware Accelerator...

Documents

Transcript of Computing an Exact Dot Product A Hardware Accelerator...

ApproximateNeumann SeriesorExactMatrix InversionforMassive ...arith24.arithsymposium.org/slides/s10-gustafsson.pdf · •Assumemultiply-and-add(MAD)operations ... MatrixInversionforMassiveMIMO

L C SL C S Energy Aware Lossless Data Compression Kenneth Barr and Krste Asanović MIT Laboratory for Computer Science.

AN INTERACTIVE EMBEDDED SYSTEM & AI TO PLAY …pc/courses/432/2014/projects/11_connect5/Fi… · AN INTERACTIVE EMBEDDED SYSTEM & AI TO PLAY CONNECT5 David Biancolin, Ritchie Zhao,

Mark Hampton and Krste Asanović April 9, 2008

Simmani: Runtime Power Modeling for Arbitrary RTL with … · 2020. 4. 13. · jrb@berkeley.edu Krste Asanović University of California, Berkeley krste@berkeley.edu ABSTRACT This

Tutorial)Introduc.on) - RISC-V FoundationTutorial)Introduc.on) Krste)Asanović) krste@eecs.berkeley.edu) )) RISC8V)Tutorial,)HPCA,)SFO)Marriot February)8,)2015) ISAs%don’t%ma-er%

A Hardware Accelerator for Computing an Exact Dot Productkrste/papers/koenig-ms.pdf · 2018. 6. 4. · David Biancolin was my project partner on the class project, as well as my collaborator

IEEE Symposium on Computer Arithmetic July 24 , 2017arith24.arithsymposium.org/slides/s5-cornea.pdf · Included in Intel® Parallel Studio XE and Intel® System Studio Suites Optimized

Ananian/Asanović/Kuszmaul/Leiserson/Lie: Unbounded Transactional Memory, HPCA '05 (1) Unbounded Transactional Memory C. Scott Ananian, Krste Asanović,

FireSim: FPGA-Accelerated Cycle-Exact Scale-Out System ...biancolin/papers/firesim-isca18.pdf · sign FPGA-based applications that run in the cloud. Using an FPGA-enabled public cloud

RAMP Gold: Architecture and Timing Model Andrew Waterman, Zhangxi Tan, Rimas Avizienis, Yunsup Lee, David Patterson, Krste Asanović Parallel Computing.

Workshop(Introduc/on( - RISC-VIntroduc/on(Krste(Asanović(krste@eecs.berkeley.edu( ! ((FirstRISC;V(Workshop,(Monterey,(CA(January(14,(2015

PPD: Platform for Private Data Mohit Tiwari with Krste Asanović, Dawn Song, Petros Maniatis*, Prashanth Mohan, Charalampos Papamanthou, Elaine Shi, Emil.

“Flash and Chips” The FireBox Warehouse-Scale …The FireBox Warehouse-Scale Computer Krste Asanović, David Patterson, Randy Katz, ] + FireBox group and ASPIRE Lab krste@berkeley.edu

1 Energy-Efficient Register Access Jessica H. Tseng and Krste Asanović MIT Laboratory for Computer Science, Cambridge, MA 02139, USA SBCCI2000.

The Hwacha Microarchitecture Manual, Version 3.8...The Hwacha Microarchitecture Manual, Version 3.8.1 Yunsup Lee Albert Ou Colin Schmidt Sagar Karandikar Howard Mao Krste Asanović

Emmett Witchel Krste Asanović MIT Lab for Computer Science Hardware Works, Software Doesn’t: Enforcing Modularity with Mondriaan Memory Protection.

Free and Open Instruction Sets & Other Stuff Krste Asanović, representing the ASPIRE Lab krste@eecs.berkeley.edu .

Mixed Precision Vector Processorspeople.eecs.berkeley.edu/~krste/papers/EECS-2015-265.pdf · Mixed Precision Vector Processors Albert Ou Krste Asanović, Ed. Vladimir Stojanovic,

SC 11.april Auto put broj 2 Novi Beograd - joga.co.rs · Astrologija, Svetlana Studen Lični razvoj I psiho-terapija Geštalt terapija, Marija Asanović, Ivana Kristić I Ljubica