Generic SOC Architecture for Convolutional Neural Networks CDR 12.01.2015 By: Merav Natanson & Yotam...

Generic SOC Architecture for Convolutional Neural Networks

CDR12.01.2015

By: Merav Natanson & Yotam PlatnerSupervisor: Guy Revach

HSDSL Lab, Technion

NN Coprocessor and Algorithm on SOCHardware Implementation of a generic & modular NN coprocessor on FPGA logic

Software driver and API

Software implementation of specific test-case algorithms

Linux OS running on ARM processor

Our Board: Avnet ZedBoard (System On Chip)Programmable Logic - Xilinx Zync XC7Z020-1 (FPGA)Processing System - Dual ARM Cortex-A9Memory - 512MB DDR3External Interface - 10/100/1000 Ethernet

Project A StagesResearch and learning of Convolutional Neural Networks, with focus on LENET5 algorithm

Ramp up on Zedboard and Zync platforms

Hardware architecture document for FPGA coprocessor

Analysis of practicability & throughput for different operation modes (software-hardware configuration)

Architecture document for software API and algorithm

Functional simulation for coprocessor (Modelsim)

iW x

Background – Neural Networks (NN) NN networks are based on the biological neural system.

The basic units that construct the network are Neurons and Weights.

Neuron operation : - Multiply all relevant pixels with the appropriate weights. - Sum the outputs and add a constant bias. - Apply an activation function (e.g. tanh) on the current outputNeuron Weight

Is connected to multiple inputs and outputs.The neuron output is the result of an activation function on the sum of the inputs

Is the basic unit that connects the neurons. The weight multiplies the data passing through it with the weight value.

inputs iW x

Background – Neural Networks (NN) From Neurons and Weights we can construct a neural network with as many layers as we like.

Each layer contains a certain amount of neurons and a set of weights connects the layer to other layers.

The complexity of the network is determined by the dimension of the inputs, the more complex and more variable the input is, so does the network.

input output

Example of an algorithm – LENET 5 Purpose – hand written digit recognition.

Input – A hand written digit represented by a 32x32 pixel matrix.

Output – 10 values of +1 or -1. The suitable digit should be the only one represented by +1.

LENET 5 - layer types Convolution – Matrix convolution between single/multiple input feature maps (FM) and a small weights kernel matrix.

Sub-sampling – Performs local averaging, reduces the resolution of a feature map and the sensitivity of the outputs.

Fully connected – Each output neuron receives all the previous layer’s neurons as inputs, with different weights for each input.

d9

FPGA – Block SchemeExecution Unit 0

AXI MEMmap Slave

Interface (IP)

Weights Memory

Block

Neuron Write Controller

AXI REG Slave Interface (IP)

Registers

Neuron Bank

MUX

/ DEMUX

Neuron 0(FIFO + Multiplier + Adder)

Neuron M(FIFO + Multiplier + Adder)

…

Execution Unit N

Data Memory

BlockNeuron Read

Controller

...

REG Controller

Configuration

MemoryBlock

SUM+ROUND+FUNC


Registers

Neuron Read Controller

SUM+ROUND+FUNC

BiasMemory

Block

Neuron BankNeuron Operation :

Data & weights inputs received into FIFOMultiply and accumulate until finish flag is receivedReturn result to neuron read controller

Execution Units


Neuron Read ControllerCalculation BlockResults Write Controller

OUT ADDRESS FIFO BIAS ADDRESS FIFO

Memory Transform Block

Image MemoryWRITE BUS

32 bits width8K depth

Bias MemoryREAD BUS


Neuron Results(48 bit)

Image MemoryREAD BUS


Neuron BankWrite Bus

(Pixels + Weights & Neuron Address)

Execution UnitsNeuron Write Controller :

Read stage configurationWrite data & weights to neurons (transfer order is decided according to mode)Raise finish flags to neuronsWrite each configuration field to the relevant controllerRepeat operation for new stage (if available)

Neuron Read Controller : Pulling the results from the assigned neurons in a cyclic order,

until all outputs are finished Send the results to the “calculation” unit, with the appropriate bias

(from the bias memory block) and a “finish” flag In fully connected mode, the module pulls results from multiple neurons

(with counter), otherwise, every read produce an output

Calculation Unit:Sum all its inputs until receiving finish flagAdds bias to the previous resultPut the result as an input to an activation function (defined in a LUT)Pass the results (in order) to the "results write controller“

Results Write Controller :Writes results into the data memory block with an adjacent valid bit

Execution Units

∑ Rounding Block

48 bits Pixel size (8/16bit)

Fx(LUT)

To Results Write

Controller

(Pixel size)

BIAS(Pixel size)

NEURON RESULTS(48 bit)

(48 bit) (Pixel size)

Memory Blocks

Priorities :

Data memory read priority is higher than write

When read FIFO is full , its send "force write" signal.

Register Bank

Coprocessor Configuration“Configuration block” is the way of the processor to manage the coprocessor.

Therefore all its data (input) was transferred by the ARM.

The “Configuration block” is built from fields.

The EU reads the configuration (Neuron write controller) and transfer every field to its relevant FSMs.

For example, the kernel dim is needed to the Neuron write controller for knowing when to raise the “finish flag” (neuron will stop accumulate and produce an output).

Configuration Methods

Possible configurations – to achieve good performance & throughput for a specific stage :

Allocate a large number of neuron to single execution unitDivide each stage into several parallel execution unitsUse multiple EU to run the algorithm on several inputs in parallel.

Processor - CoprocessorRoles of the CPU :

Allocation of neurons to the execution unitsTransfer of data and weights to the FPGA’s RAM (through DMA)Configuration flow per execution unitStart execution units

Software API & driverRunning on ARM processor and embedded Linux OSLow level - Drivers for Xilinx CDMA IP and for AXI register accessMid level - API functions for coprocessor :

Add new algorithmAdd stages to algorithmSwitch between different configurations Run algorithm stages (on single input or multiple inputs in parallel)

High level – Specific application per algorithm

General Structs

sDataBlock: Holds a block of data in the DDR. Used for images, weights, bias an configuration blocks. Fields :- Size of block in bytes- Pointer to start of data in DDR.- Valid flag on data.

sSlotArray: A slot array is used as a double buffer for images or weights in order to allow "online" writing/reading to/from the coprocessor memory and thus to increase throughput. For example:- When images are loaded or read during the operation of the coprocessor,- When weights are too big to fit in the device memory.Fields :- List of addresses on the AXI bus for the slots in the slot array- List of data blocks to be written, or data blocks that were read.- ID of the execution unit that the slot array is assigned to.- Number of slots to be written in parallel (when slot array is advanced).

sExecUnitConfig: A struct that includes information for the algorithm run function, about what operations to do after an interrupt is received from a specific EU.Fields :- A wait vector of EU IDs (operation will not continue until all of them finish)- Boolean determining whether to start the EU again.- EU configuration address.- Neuron configuration for the EU.- Slot operations to perform (enum).- A list of other EUs that should be configured.

cNeuronBank

A static class that implements the main API functions.

Key Members :- A list of algorithms (cAlgorithm) that were added by the user [sorted by name]- A list of activation functions [sorted by name]- A list of configuration methods [sorted by name]- Four cMemHandler instances for the different memory blocks in the FPGA.- Configuration address & neuron allocation for each Execution Unit.- Pointer to current loaded algorithm.

addAlgorithm: Adds a new deep neural network algorithm.Parameters - Algorithm name, chosen configuration method & activation function, input image size.Return value - Pointer to the created cAlgorithm object.

To add additional (private) configuration methods and activation functions : addPrivateMethod, addLUT

configAlgorithm: Activates method pointed by the chosen algorithm's "config function pointer“ (for example – cascadeMethod). Readies coprocessor and API for algorithm run.Parameters: Algorithm name.

receivedIRQ: Called when an interrupt is recieved from the coprocessor. Checks the interrupt cause (execution unit id) and calls the chosen . Runs method for the current loaded algorithm.

Execution unit handling : setExecUnitConfiguration, setExeutionUnitNeurons, startExecUnit

cAlgorithmThis class holds the algorithm parameters and configuration data.

Key Members :- A list of stages (cStage) that were added by the user- Input image size- Pointer to LUT and configuration function chosen for this method.- Execution unit configuration lists for IRQ handling

addStage: Adds a new stage to the algorithm.Parameters - The stage type (CONV/SUBS/FC/EUCLIDEAN), stage dimension (kernel size for CONV/SUBS, output size for FC/EUCLIDEAN), run stage on COP/NEON.Return value - Pointer to the created cStage object.

For LUT configuration : setLUTAddr, getLUTConfig

cascadeMethod (OR singleMethod OR splitMethod OR private) :Called by configAlgorithm when the chosen config method is "cascade".- Calculates assignment of the algorithm stages to the different execution units, - Writes weights and bias to appropriate memory handlers, - Creates write and read slots for input and output images, - Generates the configuration blocks from the stage objects,- Writes the configuration blocks to the memory handler.- Creates IRQ handling lists for each EU (used in run method).

run: Called by receiveIRQ for the current loaded algorithm. Goes over appropriate configuration lists for the execution units, and accordingly changes neuron assignment, changes config addresses, advances read and write slots, and activates EUs.Parameters - Vector of execution unit IDs for the received interrupt.

cStageAbstract class that contains the basic stage data.Key Members: - Stage type - Run on FPGA/NEON (enum)- Next configuration address

runOnNeon: Each derived class implements it according to its needs. Runs the relevant NEON functions for each stage on the data.Parameters - Input data picture/FM for running the stage.Return value - Result data block (address & size).

getStageConfig: Creates configuration data block. The configuration is made from the derived class data (the class that implements this virtual method). See table for configuration block structure.Return value - Configuration data block (address & size).

cStageFM (inherits cStage)Data structure for sub sampling and convolution stages. Contains all the necessary data for creating the configuration data block.Key Members: - Input width & height - Output width & height - Kernel dimension- A list of relevant output feature maps [cOfm] for this stage.

addOfm: Creates new output feature map class and associate it with the current stage.Parameters - Output feature map Id, bias block (address & size). Return value – A pointer to the new cOfm object.

getStageConfig: Creates configuration data block for the method input variables (output feature maps) only. In this case one stage can contain more than one configuration.Parameters - Ids of all the output feature maps for this configurations.Return value - Configuration data block (address & size).

cOfm

Data structure for output feature map data.Key Members: - DDR addresses - for bias and for all input feature maps weights.- BRAM addresses - for stage output , bias and input feature maps weights & data.

addIFM: Adds input feature map (Id & weights) to output feature map class.For sub sampling stage, only one IFM is allowed.Parameters - Input feature map Id, weights block (DDR address & size).

cStageFC (inherits cStage)Data structure for fully connected stage. Contains all the necessary data for creating the configuration data block.Key Members :- DDR addresses - for bias and weights.- BRAM addresses - for stage output and input , bias and weights.- Input & output size.

addWeights – Receives weights and bias blocks.

cStageEuclidean (inherits cStage)Data structure for euclidean stage. Contains all the necessary data for creating the configuration data block. Runs on NEON only.Key Members: Input & output size , weights DDR address.

cMemHandler

Data structure for handling all the FPGA memory usage (BAC, BRAMS).Our implementation will contain 4 cMemHandler objects : Data , Weights, Bias & Configuration Memory Blocks (=objects).

Key Members: - Number of memory units for this object in bytes (all object units size are identical).- Start Address for this object in the BAC.- Map structure for all the slot arrays that are used to return results from the ARM.- Map structure for all the slot arrays that are used to write new data by the ARM.- Next 'empty' (BRAM) address to write for each inner memory unit. This structure is used both for data blocks & slots, before the algorithm start running.

createSlot: Creates new slot, and slotArray if needed.Parameters - Slot array Id , memory unit Id , execution unit Id , slot size , write/read slot (flag - T/F) , parallel number of slots to advance at once. Return value - The new slot BRAM Address.createSlotDirect: This method will be called from the second run of the algorithm and will do the same as createSlot method (but now, the old data from the previous run already exists).

writeBlock: Write data block to specific memory unit and return it's BRAM address.writeBlockDirect: Writes data (weights/picture/NEON output) to BRAM address. This method will be called from the second run of the algorithm (same as writeBlock method but now, the old data from the previous run already exists).Parameters - BRAM address & DDR address.

writeDataToSlot: Writes data (weights/picture/NEON output) to slot.Parameters- Slot array Id , data block (address & size), useOnlyOnce - Flag for identification whether this data will be written once (e.g. picture) or more (e.g. weights) - will be saved in the slotArray and influence the advanceWriteSlots behaviour.readDataFromSlot: read data from specific block (and return it's DDR address).

advanceWriteSlots / advanceReadSlots : Advance nextSlotId in the sSlotArray that was defined by the method input. The method will be called after every write/read from slot.Parameters: vector of EU id - define which sSlotArray to advance (can be more than one).

Upper Bounds

Next task - Functional SimulationSimulation at Execution Unit levelPartial VHDL implementation – controllers onlyRead from files & write to files (no RAM)

Project A - Gantt

Generic SOC Architecture for Convolutional Neural Networks CDR 12.01.2015 By: Merav Natanson & Yotam...

Documents

Transcript of Generic SOC Architecture for Convolutional Neural Networks CDR 12.01.2015 By: Merav Natanson & Yotam...