Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs

71
Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs Satnam Singh, Microsoft Research Cambridge, UK David Greaves, Computer Lab, Cambridge University, UK

description

Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs. Satnam Singh, Microsoft Research Cambridge, UK David Greaves, Computer Lab, Cambridge University, UK. XD2000i FPGA in-socket accelerator for Intel FSB. XD2000F FPGA in-socket accelerator for AMD socket F. - PowerPoint PPT Presentation

Transcript of Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs

Page 1: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs

Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs

Satnam Singh, Microsoft Research Cambridge, UK

David Greaves, Computer Lab, Cambridge University, UK

Page 2: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs

XD2000i FPGA in-socketaccelerator for Intel FSB

XD2000F FPGA in-socketaccelerator for AMD socket F

XD1000 FPGA co-processormodule for socket 940

Page 3: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs

The Future is Heterogeneous

Page 4: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs
Page 5: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs
Page 6: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs
Page 7: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs
Page 8: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs
Page 9: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs
Page 10: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs

Example Speedup: DNA Sequence Matching

Page 11: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs

Why are regular computers not fast enough?

Page 12: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs
Page 13: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs

FPGAs are the Lego of Hardware

Page 14: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs
Page 15: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs
Page 16: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs

LUT4 (OR)

Page 17: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs

LUT4 (AND)

Page 18: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs
Page 19: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs
Page 20: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs
Page 21: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs

opportunity

scientific computingdata miningsearchimage processingfinancial analytics

challenge

Page 22: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs

The Accidental Semi-colon

;

Page 23: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs

Kiwi Thesis• Parallel programs are a good

representation for circuit designs. (?)• Separated at birth?

Page 24: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs

Objectives• A system for software engineers.• Model synchronous digital circuits in C# etc.– Software models offer greater productivity than

models in VHDL or Verilog.• Transform circuit models automatically into

circuit implementations.• Transform programs with dynamic memory

allocation into their array equivalents.• Exploit existing concurrent software

verification tools.

Page 25: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs

Previous Work• Starts with sequential C-style programs.• Uses various heuristics to discover

opportunities for parallelism esp. in nested loops.

• Good for certain idioms that can be recognized.• However:

– many parallelization opportunities are not discovered

– lack of control– no support for dynamic memory allocation

Page 26: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs

Kiwi

structural imperative (C)parallelimperative

gate-level VHDL/Verilog Kiwi C-to-

gates

&0

0

0

Q

QSET

CLR

S

R

;;;

jpeg.cthread 2

thread 3

thread 1

Page 27: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs

Key Points• We focus on compiling parallel C# programs into

parallel hardware.• Important because future processors will be

heterogeneous and we need to find ways to model and program multi-core CPUs, GPUs, FPGAs etc.

• Previous work has had some success with compiling sequential programs into hardware.

• Our hypothesis: it’s much better to try and produce parallel hardware from parallel programs.

• Our approach involves compiling .NET concurrency constructs into gates.

Page 28: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs

Self Inflicted Constraints• Use a standard programming

language with no special extensions (C#).

• Use standard mechanism for concurrency (System.Threading).

• Use concurrency of model circuit structure.

Page 29: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs

I2C Bus Control in VHDL

Page 30: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs

Ports and Clockspublic static class I2C { [OutputBitPort("scl")] static bool scl;

[InputBitPort("sda_in")] static bool sda_in;

[OutputBitPort("sda_out")] static bool sda_out;

[OutputBitPort("rw")] static bool rw;

circuit ports identified by

custom attribute

Page 31: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs

I2C Control private static void SendDeviceID() { Console.WriteLine("Sending device ID"); // Send out 7-bit device ID 0x76 int deviceID = 0x76;

for (int i = 7; i > 0; i--) { scl = false; sda_out = (deviceID & 64) != 0; Kiwi.Pause();

// Set it i-th bit of the device ID scl = true; Kiwi.Pause(); // Pulse SCL scl = false; deviceID = deviceID << 1; Kiwi.Pause(); } }

Page 32: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs

Generated Verilogmodule i2c_demo(clk, reset, I2CTest_I2C_scl, I2CTest_I2C_sda);

input clk; input reset; reg i2c_demo_CS$4$0000; reg I2CTest_I2C_SendDeviceID_CS$4$0000; reg I2CTest_I2C_SendDeviceID_second_CS$4$0000; reg I2CTest_I2C_ProcessACK_ack1; reg I2CTest_I2C_ProcessACK_fourth_ack1; reg I2CTest_I2C_ProcessACK_second_ack1; reg I2CTest_I2C_ProcessACK_third_ack1; integer I2CTest_I2C_SendDeviceID_deviceID; integer I2CTest_I2C_SendDeviceID_second_deviceID; integer I2CTest_I2C_SendDeviceID_i; integer i2c_demo_i; integer I2CTest_I2C_SendDeviceID_second_i; integer i2c_demo_inBit; integer i2c_demo_registerID; output I2CTest_I2C_scl; output I2CTest_I2C_sda;

Page 33: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs

System Composition• We need a way to separately develop

components and then compose them together.

• Don’t invent new language constructs: reuse existing concurrency machinery.

• Adopt single-place channels for the composition of components.

• Model channels with regular concurrency constructs (monitors).

Page 34: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs

Writing to a Channelpublic class Channel<T>{ T datum; bool empty = true; public void Write(T v) { lock (this) { while (!empty) Monitor.Wait(this); datum = v; empty = false; Monitor.PulseAll(this); } }

Page 35: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs

Reading from a Channelpublic T Read(){ T r; lock (this) { while (empty) Monitor.Wait(this); empty = true; r = datum; Monitor.PulseAll(this); } return r;}

Page 36: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs

Our Implementation• Use regular Visual Studio technology to

generate a .NET IL assembly language file.• Our system then processes this file to produce

a circuit:– The .NET stack is analyzed and removed– The control structure of the code is analyzed and

broken into basic blocks which are then composed.– The concurrency constructs used in the program

are used to control the concurrency / clocking of the generated circuit.

Page 37: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs

systems level concurrency constructsthreads, events, monitors, condition variables

rendezvous join patterns transactionalmemory

dataparallelism

user applications

domain specificlanguages

Page 38: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs

Higher Level Concurrency Constructs

• By providing hardware semantics for the system level concurrency abstractions we hope to then automatically deal with other higher level concurrency constructs:– Join patterns (C-Omega, CCR, .NET Joins

Library)– Rendezvous– Data parallel operations

Page 39: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs
Page 40: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs

KiwiLibrary

Kiwi.cs

circuitmodel

JPEG.cs

Visual Studio

multi-thread simulationdebuggingverification

Kiwi Synthesis

circuitimplementation

JPEG.v

Page 41: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs

parallelprogram

C#

Thread 1

Thread 2

Thread 3

Thread 3

C togates

C togates

C togates

C togates

circuit

circuit

circuit

circuitVerilog

for system

Page 42: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs

public static int max2(int a, int b){ int result; if (a > b) result = a; else result = b; return result;}

.method public hidebysig static int32 max2(int32 a, int32 b) cil managed{ // Code size 12 (0xc) .maxstack 2 .locals init ([0] int32 result) IL_0000: ldarg.0 IL_0001: ldarg.1 IL_0002: ble.s IL_0008

IL_0004: ldarg.0 IL_0005: stloc.0 IL_0006: br.s IL_000a

IL_0008: ldarg.1 IL_0009: stloc.0 IL_000a: ldloc.0 IL_000b: ret}

max2(3, 7)

stack

local memory0

377

7

Page 43: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs

System.Threading• We have decided to target hardware

synthesis for a sub-set of the concurrency features in the .NET library System.Threading–Monitors (synchronization)– Thread creation (circuit structure)

Page 44: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs

Kiwi Concurrency Library• A conventional concurrency library Kiwi is

exposed to the user which has two implementations:– A software implementation which is defined purely in

terms of the support .NET concurrency mechanisms (events, monitors, threads).

– A corresponding hardware semantics which is used to drive the .NET IL to Verilog flow to generate circuits.

• A Kiwi program should always be a sensible concurrent program but it may also be a sensible parallel circuit.

Page 45: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs

System Composition• We need a way to separately develop

components and then compose them together.

• Don’t invent new language constructs: reuse existing concurrency machinery.

• Adopt single-place channels for the composition of components.

• Model channels with regular concurrency constructs (monitors).

Page 46: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs

Writing to a Channelpublic class Channel<T>{ T datum; bool empty = true; public void Write(T v) { lock (this) { while (!empty) Monitor.Wait(this); datum = v; empty = false; Monitor.PulseAll(this); } }

Page 47: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs

Reading from a Channelpublic T Read(){ T r; lock (this) { while (empty) Monitor.Wait(this); empty = true; r = datum; Monitor.PulseAll(this); } return r;}

Page 48: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs
Page 49: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs

class FIFO2{ [Kiwi.OutputWordPort(“result“, 31, 0)] public static int result;

static Kiwi.Channel<int> chan1 = new Kiwi.Channel<int>(); static Kiwi.Channel<int> chan2 = new Kiwi.Channel<int>();

Page 50: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs

public static void Consumer() { while (true) { int i = chan1.Read(); chan2.Write(2 * i); Kiwi.Pause(); } }

public static void Producer() { for (int i = 0; i < 10; i++) { chan1.Write(i); Kiwi.Pause(); } }

Page 51: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs

public static void Behaviour(){ Thread ProducerThread = new Thread(new ThreadStart(Producer)); ProducerThread.Start();

Thread ConsumerThread = new Thread(new ThreadStart(Consumer)); ConsumerThread.Start();

Page 52: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs
Page 53: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs

twoclock ticksper result

handshakingprotocol

Page 54: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs
Page 55: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs

Filter Example

thread one-placechannel

Page 56: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs

public static int[] SequentialFIRFunction(int[] weights, int[] input) { int[] window = new int[size]; int[] result = new int[input.Length]; // Clear to window of x values to all zero. for (int w = 0; w < size; w++) window[w] = 0; // For each sample... for (int i = 0; i < input.Length; i++) { // Shift in the new x value for (int j = size - 1; j > 0; j--) window[j] = window[j - 1]; window[0] = input[i]; // Compute the result value int sum = 0; for (int z = 0; z < size; z++) sum += weights[z] * window[z]; result[i] = sum; } return result; }

Page 57: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs
Page 58: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs

Transposed Filter

Page 59: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs

static void Tap(int i, byte w, Kiwi.Channel<byte> xIn, Kiwi.Channel<int> yIn, Kiwi.Channel<int> yout){ byte x; int y; while(true) { y = yIn.Read(); x = xIn.Read(); yout.Write(x * w + y); }}

Page 60: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs

Inter-thread Communication and Synchronization

// Create the channels to link together the tapsfor (int c = 0; c < size; c++){ Xchannels[c] = new Kiwi.Channel<byte>(); Ychannels[c] = new Kiwi.Channel<int>(); Ychannels[c].Write(0); // Pre-populate y-channel registers with zeros}

Page 61: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs

// Connect up the taps for a transposed filterfor (int i = 0; i < size; i++){ int j = i; // Quiz: why do we need the local j? Thread tapThread = new Thread(delegate() { Tap(j, weights[j], Xchannels[j], Ychannels[j], Ychannels[j+1]); }); tapThread.Start();}

Page 62: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs

Performance• Software

– Dual-core Pentium 2.67GHz, 3GB– 6,562,500 pixels per second

• BEE3 FPGA Performance– Xilinx XC5VLX110T FPGA, 100MHz– DDR2 memory, 2 DIMMS per channels, 288-bits per read– 4 cycles per pixel– 429,000,000 pixels per second

• Hand optimized core– Xilinx CoreGenerator: 400MHz

Page 63: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs

Current Limitations• Only integer arithmetic and string handling.• Floating point could be added easily.• Generation of statically allocated code:– Arrays must be dimensioned at compile time– Number of objects on the heap is determined at

compile time– Recursive function calling must bottom out at

compile time (so depth can not be run-time dependent)

Page 64: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs

Next Steps• Consider a series of concurrency constructs and

their meaning in hardware:– Transactional memory– Rendezvous.– Join patterns / chords– Data Parallel Descriptions

• Optimize away handshaking protocol.• Allow non trivial dynamic memory allocation.• Solve impedance mismatch with back-end tools

to improve performance.

Page 65: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs

Smith-Waterman Recurrence

Page 66: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs
Page 67: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs

SW Diagonal Dependencies

Can perform all operations on an anti-diagonal in parallel.Can pass query and database data along channels between cells.However, each operation needs a scoring matrix read.

Page 68: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs

for (int qpos = 0; qpos < height; qpos++) { short score = (dbval < 0 || seq[qpos] < 0) ? (short)0: pam250[dbval, seq[qpos]]; int left = prev[qpos]; int above = (qpos==0)? aboveScore: here[qpos-1]; int diag = (qpos==0)? prevAbove: prev[qpos-1]; int nv = Math.Max(0, Math.Max(left - 10, Math.Max(above - 10, diag + score))); if (nv > (int)max) max = (short)nv; here[qpos] = (short)nv; if (qpos == height-1) below_score.Write((short)nv); }

Page 69: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs

data paralleldescription of

FFT-style operationsin a multi-core

bytecode

FPGAhardware(VHDL)

GPU code (CUDA)

SMPC#

Page 70: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs

Summary• Circuits can be modelled as regular parallel programs.• Automatically transform parallel circuit models into digital

circuit implementations.• Exploit shared memory and passage passing idioms for co-

design.• We don’t need to invent a new language:

– Exploit rich existing knowledge of concurrent programming.• Apply recent innovations in shape analysis and region types

to allow us to compile programs with lists and trees.• Is there an application for this work at Sanger/EBI?• More information about Kiwi synthesis at

http://research.microsoft.com/~satnams

Page 71: Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs

Synplify Pro FPGA Implementation:

First, preliminary result: Device: Virtex 5x110T-2: Static timing: 20 logic layers, Fmax=78MHz (12.7 ns). Utilisation = 3120 Virtex-5 slices, 17% of 17500. Clock cycles per streaming base: 10.

Future parameter exploration: QSL search string query limit increase = 256 or 512. N search parallelism (number of units) = 32 or 64. Clocks per cell : reduce to 4 or 2 (channel overheads then dominate). Extend Kiwi channels between the four chips on the BEE3 board.