Lecture 9. MIPS Processor Design – Pipelined Processor Design #2
Design methodology for multi processor systems design on ... fine secondo... · Design methodology...
Transcript of Design methodology for multi processor systems design on ... fine secondo... · Design methodology...
DEIS
University of Bologna
Design methodology for multi processor systems design on regular platforms
Ph.D in Electronics, Computer Science and Telecommunications
Ph.D Student:
Davide Rossi
Ph.D Tutor:
Prof. Roberto Guerrieri
DEISUniversity of Bologna
Outline
Motivations
Approach overview
Software layers
Programming model
Compilation flow
System-C simulator
Architecture
Cluster architecture
Tile architecture
Preliminary implementation results
DEISUniversity of Bologna
Motivations
Embedded Applications requirements:
Time to market
Energy efficiency
Performance
Flexibility
Programmability
SoC trends:
Multi-core
Hierarchical memory architecture
Efficient connectivity
On-demand accelerator engines*source ITRS 2009
DEISUniversity of Bologna
Main Goals
Multi-Many processor
Spatial computation vs. Sequential computation
Thread level parallelism
Homogeneous architecture
Distributed ASIC acceleration
Heterogeneous set of accelerators
Instruction level parallelism
Data level parallelism
Regularity
Architectural level : tile base approach
Accelerators: implemented on a metal programmable technology (Customization with 9/36 masks CMOS65LP)
Homogeneous Programming Model and Tool-Chain
High level Application exploration and partitioning
Kernel extraction
Development of hardware accelerators for most critical kernel portions
DEISUniversity of Bologna
Programming model
Based on the OpenCL 1.0 specifications
Sequential code is executed on a host
Parallel kernels executed on the multi core device are declared with the __kernel specifier
Data parallel programming model
One kernel can run as an NDRange space of simultaneous work-groups
A maximum of 8 work items batched to a work group run on the cluster
Task parallel programming model
There is no implicit declaration of index space
Data structures are implemented by the device
Tasks and Kernels are synchronized through events and barriers
Example of NDRange index space
DEISUniversity of Bologna
Parallel Memory Sharing
Private Memory: per work-item
Private per work-item
Local Memory: per-work group
Shared by threads of the same work group
Inter-work item communication
Global Memory: per-application
Shared by all work-groups
Inter-kernel communication
Data can be allocated on memory utilizing __private, __local and __global address space qualifiers
Work item
Private Memory
NDRange kernel 1
GlobalMemoryNDRange kernel 2
Sequential
kernels
in Time
Work group
SharedMemory
. .
.
DEISUniversity of Bologna
Tool-chain flow overview
1) High level partitioning:
Evaluation of task level parallelism using OpenCL
Profiling without hardware acceleration using TLM simulator
2) Kernel extraction and implementation
Extraction of data level and instruction level parallelism
Thread level profiling using Griffy tools
Application level profiling using system-c simulator + Griffy emulation library
DEISUniversity of Bologna
Compilation flow
The OpenCL input file is processed and splitted in a HOST and a DEVICE C-code:
HOST:
Contains the sequential parts of an application
Handles off-chip memory allocation
Configures, calls and synchronizes parallel and hardware accelerated kernels
DEVICE:
Executes threads in parallel
Each thread handles its own data chunks transfers (DMA LIB)
Each thread configures and launches hardware accelerated functions (PGA LIB)
HOST and DEVICE c-code are further compiled with the processor specific compiler and linked with runtime libraries
DEISUniversity of Bologna
Accelerators Design Flow Accelerator description
Single assignment DFG description with simplified C syntax
The compilation flow automatically extracts information about
Routing only operations
pipeline stages
Outputs of the flow are:
Cycle accurate emulation libraries for integration with the stand-alone Cycle accurate simulator
Functional emulation libraries for integration with the SystemC TLM simulator
RTL implementation of the pipelined accelerators
GDS-II macro and all FE and BE views for integration in Synopsys and Cadence place & route tools
DEISUniversity of Bologna
System-C Simulator Integrates:
N computational tiles instances (processor ISS, DMA, memories, buffers, registers, accelerators, local interconnect)
1 IO tile instance (processor, memories, registers)
Shared and device memories
Interconnect TLM models
Memory transfers are Transaction Accurate
ISS read/write request are converted into packages flowing through routers
Each time a packet cross a router a delay unit is introduced
Target configuration files contain
Memory map information
Processors configuration information
Links to a shared libraries containing accelerators emulation function
DEISUniversity of Bologna
Cluster Architecture
The NoC connects the computational resources with a multi back shared memory
The bus connect the shared memory with the IO
Cluster manager is responsible for
Synchronization of work items within a work groups
NoC Configuration
DEISUniversity of Bologna
CT Architecture (1)
GP Processor Private Data and Program TCM
(16K each)
Dedicated interfaces for sub-system management MFU: DMA transfers
EFU: access hardware accelerators
EXT: access external memory space
Stream channel Operation is independent from
CPU
Separated RW/WR physical channels
Multiple logical channels
2-D addressing patterns
Features AUDIO and VIDEO addressing modes
DEISUniversity of Bologna
CT Architecture (2)
Synchronization mechanism Local registers
Local configuration bus
Hardware accelerators Features 4 1024x 32-bit
buffers
Address generators:
2D addressing patterns (step/stride)
Circular addressing
Each accelerated operation is triggered by the processor
DEISUniversity of Bologna
CT synthesis results
Technology: CMOS65LP, HVT/SVT, 1.2V
CT Area:
1.25 mm2
Max frequency (wc, 125°C, 1.1V):
250 MHz
SVT/HVT ratio:
0,11
DEISUniversity of Bologna
Collaborations
The PhD is in collaboration with STMicroelectronics
Collaborations in 3 European projects:
MORPHEUS (FP6)
MODERN (ENIAC)
THERMINATOR (FP7)
DEISUniversity of Bologna
Publications
N. Voros et al. “Dynamic System Reconfiguration in Heterogeneous Platforms”, Chapter 5: “The DREAM digital Signal Processor”, Chapter 8:” The MORPHEUS Data Communication and Storage Infrastructure”, Springer, 2009.
D. Rossi et al. “A Heterogeneous Digital Signal Processor Implementation for Dynamically Reconfigurable Computing”, CICC (Custom Integrated Circuit Conference), 2009.
D. Rossi et al. ”A Multi-Core Signal Processor for Heterogeneous Reconfigurable Computing”, International Symposium on System-on-Chip, Proceedings, 2009.
F. Campi et al. “RTL-to-Layout Implementation of an Embedded Coarse Grained Architecture for Dynamically Reconfigurable Computing in Systems-on-Chip”, Proceedings, 2009.
D. Rossi et al. , ”A Heterogeneous Digital Signal Processor for Dynamically Reconfigurable Computing”, JSSC IEEE Journal of Solid-State Circuits, 2010.