CS152: Computer Systems Architecture Dark Silicon ...swjun/courses/2019W-CS152... · Possible ^...

29
CS152: Computer Systems Architecture Dark Silicon, Application-Specific Acceleration Sang-Woo Jun Winter 2019

Transcript of CS152: Computer Systems Architecture Dark Silicon ...swjun/courses/2019W-CS152... · Possible ^...

Page 1: CS152: Computer Systems Architecture Dark Silicon ...swjun/courses/2019W-CS152... · Possible ^ eyond MOS Device Directions o Nano-electrical Mechanical Relays? o Tunnel Field Effect

CS152: Computer Systems ArchitectureDark Silicon, Application-Specific Acceleration

Sang-Woo Jun

Winter 2019

Page 2: CS152: Computer Systems Architecture Dark Silicon ...swjun/courses/2019W-CS152... · Possible ^ eyond MOS Device Directions o Nano-electrical Mechanical Relays? o Tunnel Field Effect

Not All Transistors Can Be Active!

Utilization wall:“With each successive process generation, the percentage of a chip

that can switch at full frequency drops exponentially due to power constraints.” -- Venkatesh, ASPLOS ‘10

The following slides adapted from Michael Taylor’s 2012 talk“Is Dark Silicon Useful? Harnessing the Four Horsemen of the Coming Dark Silicon Apocalypse” – Marked ‘**’

Page 3: CS152: Computer Systems Architecture Dark Silicon ...swjun/courses/2019W-CS152... · Possible ^ eyond MOS Device Directions o Nano-electrical Mechanical Relays? o Tunnel Field Effect

Tradeoffs Between Cores And Frequency**4 cores @ 1.8 GHz

4 cores @ 2x1.8 GHz (12 cores dark)

2x4 cores @ 1.8 GHz (8 cores dark, 8 dim)

4x4 cores @ .9 GHz(16 dim)

Next generation

… …

Page 4: CS152: Computer Systems Architecture Dark Silicon ...swjun/courses/2019W-CS152... · Possible ^ eyond MOS Device Directions o Nano-electrical Mechanical Relays? o Tunnel Field Effect

The Four Horsemen**

What do we do with this dark silicon?

“Four top contenders, each of which seemed like an unlikely candidate from the beginning, carrying unwelcome burdens in design, manufacturing and programming. None is ideal, but each has its benefit and the optimal solution probably incorporates all four of them…”

Page 5: CS152: Computer Systems Architecture Dark Silicon ...swjun/courses/2019W-CS152... · Possible ^ eyond MOS Device Directions o Nano-electrical Mechanical Relays? o Tunnel Field Effect

The Shrinking Horseman (#1)**

“Area is expensive. Chip designers will just build smaller chips instead of having dark silicon in their designs!”

First, dark silicon doesn’t mean useless silicon, it just means it’s under-clocked or not used all of the time.

There’s lots of dark silicon in current chips: o On-chip GPU on AMD Fusion or Intel Sandybridge for GCC

• L3 cache is very dark for applications with small working sets

• SSE units for integer apps

• …

Page 6: CS152: Computer Systems Architecture Dark Silicon ...swjun/courses/2019W-CS152... · Possible ^ eyond MOS Device Directions o Nano-electrical Mechanical Relays? o Tunnel Field Effect

The Shrinking Horseman (#1)**

Competition and Marginso If there is an advantage to be had from using dark silicon, you have to use it too, to

keep up with the Jones.

Diminished Returns (e.g., $10 silicon selling for $200 today)o Savings Exponentially Diminishing: $5, $2.5, $1.25, 63co Overheads: packaging, test, marketing, etc.o Chip structures like I/O Pad Area do not scale

Exponential increase in Power Density -> Exponential Rise in Temperature

But, some chips will shrinko Nasty low margin, high competition chips; or a monopoly (Sony Cell)

Page 7: CS152: Computer Systems Architecture Dark Silicon ...swjun/courses/2019W-CS152... · Possible ^ eyond MOS Device Directions o Nano-electrical Mechanical Relays? o Tunnel Field Effect

The Dim Horseman (#2)**

Spatial dimming: Have enough cores to exceed power budget, but underclock them

Gen 1 & 2 Multicores (higher core count, lower freqs)

Near Threshold Voltage (NTV) Operationo Delay Loss/Lower clock speed > Energy Gain

o But, make it up with lots of dim cores

Page 8: CS152: Computer Systems Architecture Dark Silicon ...swjun/courses/2019W-CS152... · Possible ^ eyond MOS Device Directions o Nano-electrical Mechanical Relays? o Tunnel Field Effect

The Dim Horseman (#2)**

Temporal Dimming : Have enough cores to exceed power budget, but use them only in burstso Dim cores, but overclock if cold – e.g., Intel TurboBoost

o E.g., ARM A15 Core in mobile phones• A15 power usage way above sustainable for phone.

• 10 second bursts at most (big.LITTLE)

Page 9: CS152: Computer Systems Architecture Dark Silicon ...swjun/courses/2019W-CS152... · Possible ^ eyond MOS Device Directions o Nano-electrical Mechanical Relays? o Tunnel Field Effect

The Specialized Horseman (#3)**

“We will use all of that dark silicon area to build specialized cores, each of them tuned for the task at hand (10-100x more energy efficient), and only turn on the ones we need…”

Insights:o Power is now more expensive than area

o Specialized logic can improve energy efficiency by 10-1000x

Page 10: CS152: Computer Systems Architecture Dark Silicon ...swjun/courses/2019W-CS152... · Possible ^ eyond MOS Device Directions o Nano-electrical Mechanical Relays? o Tunnel Field Effect

The Specialized Horseman (#3)**

C-cores Approach:o Fill dark silicon with Conservation Cores, or c-cores,

which are automatically-generated, specialized energy-saving coprocessors that save energy on common apps

Execution jumps among c-cores (hot code) and a host CPU (cold code)o Power-gate HW that is not currently in use

o Coherent Memory & Patching Support for C-cores

Page 11: CS152: Computer Systems Architecture Dark Silicon ...swjun/courses/2019W-CS152... · Possible ^ eyond MOS Device Directions o Nano-electrical Mechanical Relays? o Tunnel Field Effect

Typical Energy Savings**

Page 12: CS152: Computer Systems Architecture Dark Silicon ...swjun/courses/2019W-CS152... · Possible ^ eyond MOS Device Directions o Nano-electrical Mechanical Relays? o Tunnel Field Effect

The Specialized Horseman (#3) -- Pssst

Another active thrust in this area is reconfigurable hardware acceleration using Field-Programmable Gate Arrays (FPGA)o A single FPGA fabric can be configured at runtime to act like any C-core

o Not as efficient as a prefabricated C-core, but can cover any function at runtime

o More on this later!

Page 13: CS152: Computer Systems Architecture Dark Silicon ...swjun/courses/2019W-CS152... · Possible ^ eyond MOS Device Directions o Nano-electrical Mechanical Relays? o Tunnel Field Effect

The Deus Ex Machina Horseman (#4)**

Deus Ex Machina: “A plot device whereby a seemingly unsolvable problem is suddenly and abruptly solved with the unexpected intervention of some new event, character, ability or object.”

“MOSFETs are the fundamental problem”

“FinFets, Trigate, High-K, nanotubes, 3D, for one-time improvements, but none are sustainable solutions across process generations.”

Page 14: CS152: Computer Systems Architecture Dark Silicon ...swjun/courses/2019W-CS152... · Possible ^ eyond MOS Device Directions o Nano-electrical Mechanical Relays? o Tunnel Field Effect

The Deus Ex Machina Horseman (#4)**

Possible “Beyond CMOS” Device Directionso Nano-electrical Mechanical Relays?

o Tunnel Field Effect Transistors (TFETS)?

o Spin-Transfer Torque MRAM (STT-MRAM)?

o Graphene?

o Human brain?

o DNA Computing?

Page 15: CS152: Computer Systems Architecture Dark Silicon ...swjun/courses/2019W-CS152... · Possible ^ eyond MOS Device Directions o Nano-electrical Mechanical Relays? o Tunnel Field Effect

CS152: Computer Systems ArchitectureField-Programmable Gate Arrays

Sang-Woo Jun

Winter 2019

Page 16: CS152: Computer Systems Architecture Dark Silicon ...swjun/courses/2019W-CS152... · Possible ^ eyond MOS Device Directions o Nano-electrical Mechanical Relays? o Tunnel Field Effect

What Are FPGAs

Field-Programmable Gate Array

Can be configured to act like any circuit – More later!

Can do many things, but we focus on computation acceleration

Page 17: CS152: Computer Systems Architecture Dark Silicon ...swjun/courses/2019W-CS152... · Possible ^ eyond MOS Device Directions o Nano-electrical Mechanical Relays? o Tunnel Field Effect

FPGAs Come In Many Forms

PCIe-Attached

CPU Integrated In-Network

In-Storage

Page 18: CS152: Computer Systems Architecture Dark Silicon ...swjun/courses/2019W-CS152... · Possible ^ eyond MOS Device Directions o Nano-electrical Mechanical Relays? o Tunnel Field Effect

How Is It Different From CPU/GPUs

GPU – The other major accelerator

CPU/GPU hardware is fixedo “General purpose”

o we write programs (sequence of instructions) for them

FPGA hardware is not fixedo “Special purpose”

o Hardware can be whatever we want

o Will our hardware require/support software? Maybe!

Optimized hardware is very efficiento GPU-level performance**

o 10x power efficiency (300 W vs 30 W)

Page 19: CS152: Computer Systems Architecture Dark Silicon ...swjun/courses/2019W-CS152... · Possible ^ eyond MOS Device Directions o Nano-electrical Mechanical Relays? o Tunnel Field Effect

Analogy

“The Z-Berry”“Experimental Investigations on Radiation Characteristics of IC Chips”benryves.com “Z80 Computer”

CPU/GPU comes with fixed circuits FPGA gives you a big bag of components

To build whatever

Shadi Soundation: Homebrew 4 bit CPU

Could be a CPU/GPU!

Page 20: CS152: Computer Systems Architecture Dark Silicon ...swjun/courses/2019W-CS152... · Possible ^ eyond MOS Device Directions o Nano-electrical Mechanical Relays? o Tunnel Field Effect

Fine-Grained Parallelism of Special-Purpose Circuits

Example -- Calculating gravitational force: 𝐺×𝑚1×𝑚2

(𝑥1−𝑥2)2+(𝑦1−𝑦2)

2

8 instructions on a CPU → 8 cycles**

Much fewer cycles on a special purpose circuit

A = G × m1

B = A × m2

C = x1 - x2

D = C2

E = y1 - y2

F = E2

G = D + F

Ret = B / G

A = G × m1 × m2 B = (x1 - x2)2 C = (y1 - y2)2

D = B + C

Ret = B / G

4 cycles with basic operations

3 cycles with compound operations

Ret = (G × m1 × m2) / ((x1 - x2)2 + (y1 - y2)2)

1 cycle with even further compound operations

May slow down clock

Page 21: CS152: Computer Systems Architecture Dark Silicon ...swjun/courses/2019W-CS152... · Possible ^ eyond MOS Device Directions o Nano-electrical Mechanical Relays? o Tunnel Field Effect

Coarse-Grained Parallelism ofSpecial-Purpose Circuits

Typical unit of parallelism for general-purpose units are threads ~= cores

Special-purpose processing units can also be replicated for parallelismo Large, complex processing units: Few can fit in chip

o Small, simple processing units: Many can fit in chip

Independent operations can explicitly be parallelized across dedicated hardware moduleso Hundreds/thousands of operations are regularly done in parallel

Only generates hardware useful for the applicationo Instruction? Decoding? Cache? Coherence?

Page 22: CS152: Computer Systems Architecture Dark Silicon ...swjun/courses/2019W-CS152... · Possible ^ eyond MOS Device Directions o Nano-electrical Mechanical Relays? o Tunnel Field Effect

How Is It Different From ASICs

ASIC (Application-Specific Integrated Circuit)o Special chip purpose-built for an application

o E.g., ASIC bitcoin miner, Intel neural network accelerator

o Function cannot be changed once expensively built

+ FPGAs can be field-programmedo Function can be changed completely whenever

o FPGA fabric emulates custom circuits

- Emulated circuits are not as efficient as bare-metalo ~10x performance (larger circuits, faster clock)

o ~10x power efficiency

Page 23: CS152: Computer Systems Architecture Dark Silicon ...swjun/courses/2019W-CS152... · Possible ^ eyond MOS Device Directions o Nano-electrical Mechanical Relays? o Tunnel Field Effect

Basic FPGA Architecture“Configurable logic block (CLB)”

Programmable interconnect

I/O block 6-InputLook-Up

Table

FF

Latch

Programmable

Input 1 Input 2 Output

0 0 0

0 1 0

1 0 0

1 1 1

Ex) 2-LUT for “AND”

~

Stores state forsequential circuit

construction

Page 24: CS152: Computer Systems Architecture Dark Silicon ...swjun/courses/2019W-CS152... · Possible ^ eyond MOS Device Directions o Nano-electrical Mechanical Relays? o Tunnel Field Effect

Basic FPGA Architecture – DSP Blocks

CLBs act as gates – Many needed to implement high-level logic

Arithmetic operation provided as efficient ALU blockso “Digital Signal Processing (DSP) blocks”

o Each block provides an adder + multiplier

“DSP block”

× +/-

Page 25: CS152: Computer Systems Architecture Dark Silicon ...swjun/courses/2019W-CS152... · Possible ^ eyond MOS Device Directions o Nano-electrical Mechanical Relays? o Tunnel Field Effect

Basic FPGA Architecture – Block RAM

CLB can act as flip-flopso (~1 bit/block) – tiny!

Some on-chip SRAM provided as blockso ~18/36 Kbit/block, MBs per chip

o Massively parallel access to data → multi-TB/s bandwidth

“Block RAM”

Page 26: CS152: Computer Systems Architecture Dark Silicon ...swjun/courses/2019W-CS152... · Possible ^ eyond MOS Device Directions o Nano-electrical Mechanical Relays? o Tunnel Field Effect

Basic FPGA Architecture – Hard Cores

Some functions are provided as efficient, non-configurable “hard cores”o Multi-core ARM cores (“Zynq” series)

o Multi-Gigabit Transceivers

o PCIe/Ethernet PHY

o Memory controllers

o …

ARM PCIe

Ethernet

Memory

Page 27: CS152: Computer Systems Architecture Dark Silicon ...swjun/courses/2019W-CS152... · Possible ^ eyond MOS Device Directions o Nano-electrical Mechanical Relays? o Tunnel Field Effect

Example Accelerator Card Architecture

PCIe

FPGA

DRAM

DRAM

1GbE

FMC

40GbE

“FPGA Mezzanine Card” Expansiono Network Ports, Memory, Storage, PCIe, …

General-Purpose I/O Pins Multi-Gigabit Transceivers

Page 28: CS152: Computer Systems Architecture Dark Silicon ...swjun/courses/2019W-CS152... · Possible ^ eyond MOS Device Directions o Nano-electrical Mechanical Relays? o Tunnel Field Effect

Example Accelerator Card (VCU108)

Page 29: CS152: Computer Systems Architecture Dark Silicon ...swjun/courses/2019W-CS152... · Possible ^ eyond MOS Device Directions o Nano-electrical Mechanical Relays? o Tunnel Field Effect

Programming/Using an FPGA Accelerator

Bitfile is programmed to FPGA over “JTAG” interfaceo Typically used over USB cable

o Supports FPGA programming, limited debugging access, etc

PCIe-attached FPGA accelerator card is typically used similarly to GPUso Program FPGA, execute software

o Software copies data to FPGA board, notify FPGA-> FPGA logic performs computations-> Software copies data back from FPGA

FPGA flexibility gives immense freedom of usage patternso Streaming, coherent memory, …