The Tile Processor™: General Purpose Manycore for ...

36
The Tile Processor™: General Purpose Manycore for Multimedia, Networking, and Wireless Anant Agarwal Tilera Corporation and CSAIL, MIT Or, Anatomy of an In-Production Manycore

Transcript of The Tile Processor™: General Purpose Manycore for ...

Page 1: The Tile Processor™: General Purpose Manycore for ...

The Tile Processor™: General Purpose Manycorefor Multimedia, Networking, and Wireless

Anant AgarwalTilera Corporationand CSAIL, MIT

Or, Anatomy of an In-Production Manycore

Page 2: The Tile Processor™: General Purpose Manycore for ...

2

Agenda

• Introduction• Tile Processor Overview• Manycore Challenges• Tile Processor Innovations to handle challenges• Multicore Development Environment and Programming• Experience with Application Development and Performance• The Future of Multicore• Summary

Page 3: The Tile Processor™: General Purpose Manycore for ...

3

Markets Demanding More PerformanceNetworking market- Demand for high performance

- Faster speeds 1Gbps » 2Gbps » 4Gpbs » 10 Gbps- Demand for more services

- In-line L4 – L7 services, intelligence everywhereDigital Multimedia market- Demand for high performance

- H.264 encoding for High Definition- Demand for more services

- VoD, video conferencing, transcoding, transratingWireless Networks- Demand for higher throughput in both baseband and

backhaul- More channels, more services, lower power

- Fast moving standards- LTE, more services

Cable & BroadcastCable & Broadcast

Video ConferencingVideo Conferencing

Surveillance DVRSurveillance DVR

SwitchesSwitches

Security AppliancesSecurity Appliances

RoutersRouters

… and with power efficiency and programming ease

Page 4: The Tile Processor™: General Purpose Manycore for ...

4

Industry Aggressively Embracing Multicore

Time

Perfo

rman

ce

20072006 2010

16Cores

nCores

Inherent architectural bottlenecks:• No scalability • Power inefficiency • Primitive programming model

DualCores

QuadCores

Page 5: The Tile Processor™: General Purpose Manycore for ...

5

Tiled Multicore Closes the Performance Gap

• Cores connected by mesh network• Unlike buses, meshes scale• Distributed resources improve power

efficiency• Modular – easy to layout and verify

Current Bus Architecture

ProcessorCore

= TileCore + Switch

S

Page 6: The Tile Processor™: General Purpose Manycore for ...

6

TILE64™ Processor Stats

Core power (H.264 encoding, 64 cores) 1V, 700MHz 8WClock speed (MHz) 600-900

I/O bandwidth 40 GbpsMain Memory bandwidth 200 Gbps

Multicore Performance (90nm)Number of tiles (general purpose cores) 64

5 MBOperations @ 900MHz (32, 16, 8 bit) 172-230-460 BOPSOn chip interconnect bandwidth 32 Terabits per second

On chip distributed cache

Power Efficiency

I/O and Memory Bandwidth

ProgrammingANSI standard C/C++SMP Linux programmingStandard based tools

The TILE64 chip is shipping today

Page 7: The Tile Processor™: General Purpose Manycore for ...

7

PCIe 1MACPHY

PCIe 1MACPHY

PCIe 0MACPHY

PCIe 0MACPHY

SerdesSerdes

SerdesSerdes

Flexible IOFlexible IO

GbE 0GbE 0

GbE 1GbE 1Flexible IOFlexible IO

UART, HPIJTAG, I2C,

SPI

UART, HPIJTAG, I2C,

SPI

DDR2 Memory Controller 3DDR2 Memory Controller 3

DDR2 Memory Controller 0DDR2 Memory Controller 0

DDR2 Memory Controller 2DDR2 Memory Controller 2

DDR2 Memory Controller 1DDR2 Memory Controller 1

XAUIMAC

PHY 0

XAUIMAC

PHY 0

SerdesSerdes

XAUIMAC

PHY 1

XAUIMAC

PHY 1

SerdesSerdes

TILE64 Processor Block DiagramA Complete System on a Chip

PROCESSOR

P2

Reg File

P1 P0

CACHEL2 CACHE

L1I L1D

ITLB DTLB

2D DMA

STN

MDN TDN

UDN IDN

SWITCH

Page 8: The Tile Processor™: General Purpose Manycore for ...

8

TILExpressTM PCI-e Board

Page 9: The Tile Processor™: General Purpose Manycore for ...

9

Key Manycore Challenges and Innovations

1. General purpose cores enable standard OS and

programming – How to design a core to balance tile performance

with number of tiles

3. Multicore Distributed Coherent Cache– How to obtain both cache

capacity and locality

2. iMesh™ Network – How to desin the interconnect and

scale

PROCESSOR

P2

Reg File

P1 P0

CACHEL2 CACHEL1I L1D

ITLB DTLB2D DMA

STN

MDN TDN

UDN IDN

SWITCH

PROCESSOR CACHE

SWITCH

PROCESSOR CACHE

SWITCH

4. Multicore Hardwall™– How to virtualize multicore

5. Multicore Development Environment

– How to make it easy to develop software

Page 10: The Tile Processor™: General Purpose Manycore for ...

Tile Processor Architecture

Page 11: The Tile Processor™: General Purpose Manycore for ...

11

1- Full-Featured General Purpose Cores

• Processor– Homogeneous cores– 3-way VLIW CPU, 64-bit instruction size – SIMD instructions: 32, 16, and 8-bit ops– Instructions for video (e.g., SAD) and

networking (e.g., hashing)– Protection and interrupts

• Memory– L1 cache: 8KB I, 8KB D, 1 cycle latency– L2 cache: 64KB unified, 7 cycle latency– Off-chip main memory, ~70 cycle latency– 32-bit virtual address space per process– 64-bit physical address space– Instruction and data TLBs– Cache integrated 2D DMA engine

• Switch in each tile• Runs SMP Linux

PROCESSOR

P2

RegFile

P1 P0

CACHEL2 CACHE

L1I L1DITLB DTLB

2D DMA

STN

MDN TDN

UDN IDN

SWITCH

Page 12: The Tile Processor™: General Purpose Manycore for ...

12

Kill If Less than Linear

How to Size a Core – “KILL Rule” for Multicore

Increase resource size within a core only iffor every 1% increase in core area

there is at least a 1% increase in core performance

Leads to power-efficient multicore design

Insight: For parallel applications, multicore performance can increase in proportion to the increase in area as more cores are added

Page 13: The Tile Processor™: General Purpose Manycore for ...

13

2- iMesh On-Chip Network Architecture

• Distributed resources– 2D Mesh peer-to-peer tile networks– 5 independent networks – Each with 32-bit channels, full duplex– Tile-to-memory, tile-to-tile, and tile-to-IO data transfer– Packet switched, wormhole routed, point-to-point– Near-neighbor flow control– Dimension-ordered routing

• Performance– ASIC-like one cycle hop latency– 2 Tbps bisection bandwidth– 32 Tbps interconnect bandwidth

• 5 independent networks– One static, four dynamic – IDN – System and I/O– MDN – Cache misses, DMA, other memory– TDN – Tile to tile memory access– UDN, STN – User-level streaming and scalar transfer

STN

MDN TDNUDN IDN

SWITCH

Page 14: The Tile Processor™: General Purpose Manycore for ...

14

Meshes are Power Efficient

[Konstantakopoulos ’07]

More than 80% power savings over buses

Page 15: The Tile Processor™: General Purpose Manycore for ...

15

sub r5, r3, r55

DynamicSwitch

add r55, r3, r4

DynamicSwitch

Catch a

ll

Direct User Access to Interconnect

TAG

• Enables stream programming and fast messaging• Compute and send in one instruction• Automatic demultiplexing of streams into registers• Number of streams is virtualized• Streams do not necessarily go through memory for

power efficiency

Page 16: The Tile Processor™: General Purpose Manycore for ...

16

3- Distributed Coherent Cache

• Each tile has local L1 and L2 caches• Combined L2 caches of all tiles act as

distributed 4MB L3 cache• Access facilited by low latency mesh• Low Latency of local L1 and L2 caches• Capacity of large distributed L3 cache• Caches are coherent, enabling running

SMP Linux

iL1 dL1L2

L3

Page 17: The Tile Processor™: General Purpose Manycore for ...

17

The virtualization and protection challenge• Multicores need to run multiple OS’s and

applications • OS’s must be protected from each other• I/O and other shared resources must be

virtualized• User-level network access exacerbates this

problem

Multicore Hardwall technology• Protects applications and OS by prohibiting

unwanted interactions• Configurable to include one or many tiles in a

protected area• Supported by hypervisor

OS1/APP1

OS1/APP3

OS2/APP2

4- Multicore Hardwall Technology for Virtualization and Protection

Page 18: The Tile Processor™: General Purpose Manycore for ...

18

Multicore Hardwall Implementation

OS1/APP1

OS1/APP3

OS2/APP2

datavalidSwitch

datavalidSwitch

HARDWALL_ENABLE

Page 19: The Tile Processor™: General Purpose Manycore for ...

Multicore Development Environment and Programming

Page 20: The Tile Processor™: General Purpose Manycore for ...

20

5- Multicore Software Tools and Programming

• Big multicore challenge• Multicore software tools challenge

– Current tools are primitive – use single process based models– E.g., how do you single-step an app spread over many cores– GDB windows on each of 100 cores is a non-starter– Many multicore vendors do not even supply tools

• Multicore programming challenge– Key tension between getting up and running quickly using familiar models,

while providing means to obtain full multicore performance– How do you program 100—1000 cores?– Intel Webinar likens threads to the “Assembly of parallel programming” – but

familiar and still useful in the short term for small numbers of cores– Need a way to transition smoothly from today’s programming to tomorrow’s

Page 21: The Tile Processor™: General Purpose Manycore for ...

21

Multicore Development Environment:Comprehensive Standards-Based Tools

• High Performance Compiler– ANSI Standard C/C++– Gcc options and GNU extensions (3.0)– Profile-directed optimization– Software pipelining– Automatic & guided unrolling

• Multicore Debugger– GDB standard based– Aggregate control and state display– Whole-application model for collective control

• Multicore Profiler– Architectural and application details– Source-level tracking– Collective stats

• Simulator & Hardware Platforms– Cycle, timing & functional accuracy– Complete production hardware platform

• Command Line and Eclipse-based IDE

Page 22: The Tile Processor™: General Purpose Manycore for ...

22

Tile Processor’s Systems Software Stack

TILE64 hardware

I/O Shim I/O ShimIDN MDN IDN MDN

tile tile tile tile …

Hypervisor

Supervisor

Applications

XAUI, etc. drivers

Linux kernel drivers

Linuxlibraries

iLib, etc. libraries

• User mode apps and libraries– Standard libs libpthread, libc, libm...– iLib APIs, intrinsics, NetIO APIs

• Supervisor layer– SMP Linux 2.6, single OS instance

across multiple tiles– Familiar pthreads programming, with

tile specific extensions• Hypervisor layer

– Thin distributed message passing “OS”– Virtualizes tiles, I/O devices

• TILE64 hardware

Page 23: The Tile Processor™: General Purpose Manycore for ...

Gentle Slope Application Partitioning Approaches

and Performance

Page 24: The Tile Processor™: General Purpose Manycore for ...

24

Gentle Slope Programming Model

Gentle slope programming philosophy– Facilitates immediate results using off-the-shelf code– Incremental steps to reach performance goals

Four incremental steps• Compile and run standard C/C++ applications on a single tile

• Run the program in parallel using standard SMP Linux models – pthreads or processes

• Use stream programming and messaging using iLib – a light-weight sockets-like API

• Further performance tuning using profiling tools, NUMA tuning, and other optimizers

T

T

T

MEM

MEM

Page 25: The Tile Processor™: General Purpose Manycore for ...

25

Leveraging Sequential Code Development

DPI

DPI

DPI

LoadBalancer Histogram

Packet stream

Filter, or deep packet inspectionMostly unmodified sequential program

(Like Google’s Map)

Statistics, alerts(Like Google’s Reduce)

Library componentlike networkServer Load Balancer

Page 26: The Tile Processor™: General Purpose Manycore for ...

26

Multicore Programming without Parallel Programming

DPI

DPI

DPI

LoadBalancer Histogram

Packet stream

Filter, or deep packet inspectionMostly unmodified sequential program

(Like Google’s Map)

Statistics, alerts(Like Google’s Reduce)

Library componentlike networkServer Load Balancer

Page 27: The Tile Processor™: General Purpose Manycore for ...

27

Mapping On Tiled Multicores

PCIe 1MACPHY

PCIe 1MACPHY

PCIe 0MACPHY

PCIe 0MACPHY

SerdesSerdes

SerdesSerdes

Flexible IOFlexible IO

GbE 0GbE 0

GbE 1GbE 1Flexible IOFlexible IO

UART, HPIJTAG, I2C,

SPI

UART, HPIJTAG, I2C,

SPI

DDR2 Memory Controller 3DDR2 Memory Controller 3

DDR2 Memory Controller 0DDR2 Memory Controller 0

DDR2 Memory Controller 2DDR2 Memory Controller 2

DDR2 Memory Controller 1DDR2 Memory Controller 1

XAUIMAC

PHY 0

XAUIMAC

PHY 0

SerdesSerdes

XAUIMAC

PHY 1

XAUIMAC

PHY 1

SerdesSerdes

DPI

DPI

DPI

HST LB

Page 28: The Tile Processor™: General Purpose Manycore for ...

28

Networking Performance

Pattern Matching 10Gbps using 22 tiles

Snort(Intrusion Prevention

Application)

10Gbps throughput on full SNORT database

Page 29: The Tile Processor™: General Purpose Manycore for ...

29

Multimedia Performance

16x16 macroblock comparison kernel 504 M per second

Video Conferencing, 720p H.264 encoder

with opensource x264 code

2 streams @ 30fps

Video Conferencing, 720p H.264 encoder

by customer with prop. encoder

15 streams @ 30fps

Page 30: The Tile Processor™: General Purpose Manycore for ...

The Future of Multicore

Page 31: The Tile Processor™: General Purpose Manycore for ...

31

The Future of Multicore

Number of cores will double every 18 months – by a corollary of Moore’s law

‘05 ‘08 ‘11 ‘14

64 256 1024 4096

‘02

16AcademiaIndustry 16 64 256 10244

Page 32: The Tile Processor™: General Purpose Manycore for ...

32

Core

s

2008 2009 2010

Scalable Multicore Roadmap

Scaling up

Scaling down

Page 33: The Tile Processor™: General Purpose Manycore for ...

33

The Vision

The core is the logic gateof the 21st century!

pm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

s

1000 General Purpose Cores by 2014!

Page 34: The Tile Processor™: General Purpose Manycore for ...

34 TILERA PROPRIETARY & CONFIDENTIAL MATERIAL

Software Systems on Chip -- Software Defined Hardware

DDR2

XAUI

2Gbe

PCIe

FlexibleI/O

DDR2DDR2

2 X 720p30 Encodes

8 X 720p30 Decodes

3 X Audio channels

Scaling functions

Distribution tiles

Comm function

AudioVideo

PCComm

MEMMEMMEM MEMMEMMEM

MEMMEMMEM

Page 35: The Tile Processor™: General Purpose Manycore for ...

35

Summary

•• Our industry is going Our industry is going multicoremulticore

•• Current multicore architectures face software and scalability Current multicore architectures face software and scalability challengeschallenges

•• Tile Processor Architecture delivers leading performance Tile Processor Architecture delivers leading performance and power efficiency, and scales smoothly for future and power efficiency, and scales smoothly for future designsdesigns

•• Gentle slope programming offers:Gentle slope programming offers:–– Convenience of standard C/Linux programming modelsConvenience of standard C/Linux programming models–– Leverages existing software code baseLeverages existing software code base

•• Tiled Tiled multicoremulticore approach enables scalable roadmapapproach enables scalable roadmap

Page 36: The Tile Processor™: General Purpose Manycore for ...

36

Additional Information

The following are trademarks of Tilera Corporation: Tilera, the Tilera Logo, Tile Processor, TILE64, Embedding Multicore, Multicore Development Environment, Gentle Slope Programming, iLib, iMesh and Multicore Hardwall. All other trademarks and/or registered trademarks are the property of their respective owners.

© Copyright 2008 Tilera Corporation