Design methodology for multi processor systems design on ... fine secondo... · Design methodology...

17
DEIS University of Bologna Design methodology for multi processor systems design on regular platforms Ph.D in Electronics, Computer Science and Telecommunications Ph.D Student: Davide Rossi Ph.D Tutor: Prof. Roberto Guerrieri

Transcript of Design methodology for multi processor systems design on ... fine secondo... · Design methodology...

DEIS

University of Bologna

Design methodology for multi processor systems design on regular platforms

Ph.D in Electronics, Computer Science and Telecommunications

Ph.D Student:

Davide Rossi

Ph.D Tutor:

Prof. Roberto Guerrieri

DEISUniversity of Bologna

Outline

Motivations

Approach overview

Software layers

Programming model

Compilation flow

System-C simulator

Architecture

Cluster architecture

Tile architecture

Preliminary implementation results

DEISUniversity of Bologna

Motivations

Embedded Applications requirements:

Time to market

Energy efficiency

Performance

Flexibility

Programmability

SoC trends:

Multi-core

Hierarchical memory architecture

Efficient connectivity

On-demand accelerator engines*source ITRS 2009

DEISUniversity of Bologna

Main Goals

Multi-Many processor

Spatial computation vs. Sequential computation

Thread level parallelism

Homogeneous architecture

Distributed ASIC acceleration

Heterogeneous set of accelerators

Instruction level parallelism

Data level parallelism

Regularity

Architectural level : tile base approach

Accelerators: implemented on a metal programmable technology (Customization with 9/36 masks CMOS65LP)

Homogeneous Programming Model and Tool-Chain

High level Application exploration and partitioning

Kernel extraction

Development of hardware accelerators for most critical kernel portions

DEISUniversity of Bologna

Programming model

Based on the OpenCL 1.0 specifications

Sequential code is executed on a host

Parallel kernels executed on the multi core device are declared with the __kernel specifier

Data parallel programming model

One kernel can run as an NDRange space of simultaneous work-groups

A maximum of 8 work items batched to a work group run on the cluster

Task parallel programming model

There is no implicit declaration of index space

Data structures are implemented by the device

Tasks and Kernels are synchronized through events and barriers

Example of NDRange index space

DEISUniversity of Bologna

Parallel Memory Sharing

Private Memory: per work-item

Private per work-item

Local Memory: per-work group

Shared by threads of the same work group

Inter-work item communication

Global Memory: per-application

Shared by all work-groups

Inter-kernel communication

Data can be allocated on memory utilizing __private, __local and __global address space qualifiers

Work item

Private Memory

NDRange kernel 1

GlobalMemoryNDRange kernel 2

Sequential

kernels

in Time

Work group

SharedMemory

. .

.

DEISUniversity of Bologna

Tool-chain flow overview

1) High level partitioning:

Evaluation of task level parallelism using OpenCL

Profiling without hardware acceleration using TLM simulator

2) Kernel extraction and implementation

Extraction of data level and instruction level parallelism

Thread level profiling using Griffy tools

Application level profiling using system-c simulator + Griffy emulation library

DEISUniversity of Bologna

Compilation flow

The OpenCL input file is processed and splitted in a HOST and a DEVICE C-code:

HOST:

Contains the sequential parts of an application

Handles off-chip memory allocation

Configures, calls and synchronizes parallel and hardware accelerated kernels

DEVICE:

Executes threads in parallel

Each thread handles its own data chunks transfers (DMA LIB)

Each thread configures and launches hardware accelerated functions (PGA LIB)

HOST and DEVICE c-code are further compiled with the processor specific compiler and linked with runtime libraries

DEISUniversity of Bologna

Accelerators Design Flow Accelerator description

Single assignment DFG description with simplified C syntax

The compilation flow automatically extracts information about

Routing only operations

pipeline stages

Outputs of the flow are:

Cycle accurate emulation libraries for integration with the stand-alone Cycle accurate simulator

Functional emulation libraries for integration with the SystemC TLM simulator

RTL implementation of the pipelined accelerators

GDS-II macro and all FE and BE views for integration in Synopsys and Cadence place & route tools

DEISUniversity of Bologna

System-C Simulator Integrates:

N computational tiles instances (processor ISS, DMA, memories, buffers, registers, accelerators, local interconnect)

1 IO tile instance (processor, memories, registers)

Shared and device memories

Interconnect TLM models

Memory transfers are Transaction Accurate

ISS read/write request are converted into packages flowing through routers

Each time a packet cross a router a delay unit is introduced

Target configuration files contain

Memory map information

Processors configuration information

Links to a shared libraries containing accelerators emulation function

DEISUniversity of Bologna

Cluster Architecture

The NoC connects the computational resources with a multi back shared memory

The bus connect the shared memory with the IO

Cluster manager is responsible for

Synchronization of work items within a work groups

NoC Configuration

DEISUniversity of Bologna

CT Architecture (1)

GP Processor Private Data and Program TCM

(16K each)

Dedicated interfaces for sub-system management MFU: DMA transfers

EFU: access hardware accelerators

EXT: access external memory space

Stream channel Operation is independent from

CPU

Separated RW/WR physical channels

Multiple logical channels

2-D addressing patterns

Features AUDIO and VIDEO addressing modes

DEISUniversity of Bologna

CT Architecture (2)

Synchronization mechanism Local registers

Local configuration bus

Hardware accelerators Features 4 1024x 32-bit

buffers

Address generators:

2D addressing patterns (step/stride)

Circular addressing

Each accelerated operation is triggered by the processor

DEISUniversity of Bologna

CT synthesis results

Technology: CMOS65LP, HVT/SVT, 1.2V

CT Area:

1.25 mm2

Max frequency (wc, 125°C, 1.1V):

250 MHz

SVT/HVT ratio:

0,11

DEISUniversity of Bologna

Collaborations

The PhD is in collaboration with STMicroelectronics

Collaborations in 3 European projects:

MORPHEUS (FP6)

MODERN (ENIAC)

THERMINATOR (FP7)

DEISUniversity of Bologna

Publications

N. Voros et al. “Dynamic System Reconfiguration in Heterogeneous Platforms”, Chapter 5: “The DREAM digital Signal Processor”, Chapter 8:” The MORPHEUS Data Communication and Storage Infrastructure”, Springer, 2009.

D. Rossi et al. “A Heterogeneous Digital Signal Processor Implementation for Dynamically Reconfigurable Computing”, CICC (Custom Integrated Circuit Conference), 2009.

D. Rossi et al. ”A Multi-Core Signal Processor for Heterogeneous Reconfigurable Computing”, International Symposium on System-on-Chip, Proceedings, 2009.

F. Campi et al. “RTL-to-Layout Implementation of an Embedded Coarse Grained Architecture for Dynamically Reconfigurable Computing in Systems-on-Chip”, Proceedings, 2009.

D. Rossi et al. , ”A Heterogeneous Digital Signal Processor for Dynamically Reconfigurable Computing”, JSSC IEEE Journal of Solid-State Circuits, 2010.

DEISUniversity of Bologna

Thank you for attention