The Tile Processor™: General Purpose Manycore for ...
Transcript of The Tile Processor™: General Purpose Manycore for ...
The Tile Processor™: General Purpose Manycorefor Multimedia, Networking, and Wireless
Anant AgarwalTilera Corporationand CSAIL, MIT
Or, Anatomy of an In-Production Manycore
2
Agenda
• Introduction• Tile Processor Overview• Manycore Challenges• Tile Processor Innovations to handle challenges• Multicore Development Environment and Programming• Experience with Application Development and Performance• The Future of Multicore• Summary
3
Markets Demanding More PerformanceNetworking market- Demand for high performance
- Faster speeds 1Gbps » 2Gbps » 4Gpbs » 10 Gbps- Demand for more services
- In-line L4 – L7 services, intelligence everywhereDigital Multimedia market- Demand for high performance
- H.264 encoding for High Definition- Demand for more services
- VoD, video conferencing, transcoding, transratingWireless Networks- Demand for higher throughput in both baseband and
backhaul- More channels, more services, lower power
- Fast moving standards- LTE, more services
Cable & BroadcastCable & Broadcast
Video ConferencingVideo Conferencing
Surveillance DVRSurveillance DVR
SwitchesSwitches
Security AppliancesSecurity Appliances
RoutersRouters
… and with power efficiency and programming ease
4
Industry Aggressively Embracing Multicore
Time
Perfo
rman
ce
20072006 2010
16Cores
nCores
Inherent architectural bottlenecks:• No scalability • Power inefficiency • Primitive programming model
DualCores
QuadCores
5
Tiled Multicore Closes the Performance Gap
• Cores connected by mesh network• Unlike buses, meshes scale• Distributed resources improve power
efficiency• Modular – easy to layout and verify
Current Bus Architecture
ProcessorCore
= TileCore + Switch
S
6
TILE64™ Processor Stats
Core power (H.264 encoding, 64 cores) 1V, 700MHz 8WClock speed (MHz) 600-900
I/O bandwidth 40 GbpsMain Memory bandwidth 200 Gbps
Multicore Performance (90nm)Number of tiles (general purpose cores) 64
5 MBOperations @ 900MHz (32, 16, 8 bit) 172-230-460 BOPSOn chip interconnect bandwidth 32 Terabits per second
On chip distributed cache
Power Efficiency
I/O and Memory Bandwidth
ProgrammingANSI standard C/C++SMP Linux programmingStandard based tools
The TILE64 chip is shipping today
7
PCIe 1MACPHY
PCIe 1MACPHY
PCIe 0MACPHY
PCIe 0MACPHY
SerdesSerdes
SerdesSerdes
Flexible IOFlexible IO
GbE 0GbE 0
GbE 1GbE 1Flexible IOFlexible IO
UART, HPIJTAG, I2C,
SPI
UART, HPIJTAG, I2C,
SPI
DDR2 Memory Controller 3DDR2 Memory Controller 3
DDR2 Memory Controller 0DDR2 Memory Controller 0
DDR2 Memory Controller 2DDR2 Memory Controller 2
DDR2 Memory Controller 1DDR2 Memory Controller 1
XAUIMAC
PHY 0
XAUIMAC
PHY 0
SerdesSerdes
XAUIMAC
PHY 1
XAUIMAC
PHY 1
SerdesSerdes
TILE64 Processor Block DiagramA Complete System on a Chip
PROCESSOR
P2
Reg File
P1 P0
CACHEL2 CACHE
L1I L1D
ITLB DTLB
2D DMA
STN
MDN TDN
UDN IDN
SWITCH
8
TILExpressTM PCI-e Board
9
Key Manycore Challenges and Innovations
1. General purpose cores enable standard OS and
programming – How to design a core to balance tile performance
with number of tiles
3. Multicore Distributed Coherent Cache– How to obtain both cache
capacity and locality
2. iMesh™ Network – How to desin the interconnect and
scale
PROCESSOR
P2
Reg File
P1 P0
CACHEL2 CACHEL1I L1D
ITLB DTLB2D DMA
STN
MDN TDN
UDN IDN
SWITCH
PROCESSOR CACHE
SWITCH
PROCESSOR CACHE
SWITCH
4. Multicore Hardwall™– How to virtualize multicore
5. Multicore Development Environment
– How to make it easy to develop software
Tile Processor Architecture
11
1- Full-Featured General Purpose Cores
• Processor– Homogeneous cores– 3-way VLIW CPU, 64-bit instruction size – SIMD instructions: 32, 16, and 8-bit ops– Instructions for video (e.g., SAD) and
networking (e.g., hashing)– Protection and interrupts
• Memory– L1 cache: 8KB I, 8KB D, 1 cycle latency– L2 cache: 64KB unified, 7 cycle latency– Off-chip main memory, ~70 cycle latency– 32-bit virtual address space per process– 64-bit physical address space– Instruction and data TLBs– Cache integrated 2D DMA engine
• Switch in each tile• Runs SMP Linux
PROCESSOR
P2
RegFile
P1 P0
CACHEL2 CACHE
L1I L1DITLB DTLB
2D DMA
STN
MDN TDN
UDN IDN
SWITCH
12
Kill If Less than Linear
How to Size a Core – “KILL Rule” for Multicore
Increase resource size within a core only iffor every 1% increase in core area
there is at least a 1% increase in core performance
Leads to power-efficient multicore design
Insight: For parallel applications, multicore performance can increase in proportion to the increase in area as more cores are added
13
2- iMesh On-Chip Network Architecture
• Distributed resources– 2D Mesh peer-to-peer tile networks– 5 independent networks – Each with 32-bit channels, full duplex– Tile-to-memory, tile-to-tile, and tile-to-IO data transfer– Packet switched, wormhole routed, point-to-point– Near-neighbor flow control– Dimension-ordered routing
• Performance– ASIC-like one cycle hop latency– 2 Tbps bisection bandwidth– 32 Tbps interconnect bandwidth
• 5 independent networks– One static, four dynamic – IDN – System and I/O– MDN – Cache misses, DMA, other memory– TDN – Tile to tile memory access– UDN, STN – User-level streaming and scalar transfer
STN
MDN TDNUDN IDN
SWITCH
14
Meshes are Power Efficient
[Konstantakopoulos ’07]
More than 80% power savings over buses
15
sub r5, r3, r55
DynamicSwitch
add r55, r3, r4
DynamicSwitch
Catch a
ll
Direct User Access to Interconnect
TAG
• Enables stream programming and fast messaging• Compute and send in one instruction• Automatic demultiplexing of streams into registers• Number of streams is virtualized• Streams do not necessarily go through memory for
power efficiency
16
3- Distributed Coherent Cache
• Each tile has local L1 and L2 caches• Combined L2 caches of all tiles act as
distributed 4MB L3 cache• Access facilited by low latency mesh• Low Latency of local L1 and L2 caches• Capacity of large distributed L3 cache• Caches are coherent, enabling running
SMP Linux
iL1 dL1L2
L3
17
The virtualization and protection challenge• Multicores need to run multiple OS’s and
applications • OS’s must be protected from each other• I/O and other shared resources must be
virtualized• User-level network access exacerbates this
problem
Multicore Hardwall technology• Protects applications and OS by prohibiting
unwanted interactions• Configurable to include one or many tiles in a
protected area• Supported by hypervisor
OS1/APP1
OS1/APP3
OS2/APP2
4- Multicore Hardwall Technology for Virtualization and Protection
18
Multicore Hardwall Implementation
OS1/APP1
OS1/APP3
OS2/APP2
datavalidSwitch
datavalidSwitch
HARDWALL_ENABLE
Multicore Development Environment and Programming
20
5- Multicore Software Tools and Programming
• Big multicore challenge• Multicore software tools challenge
– Current tools are primitive – use single process based models– E.g., how do you single-step an app spread over many cores– GDB windows on each of 100 cores is a non-starter– Many multicore vendors do not even supply tools
• Multicore programming challenge– Key tension between getting up and running quickly using familiar models,
while providing means to obtain full multicore performance– How do you program 100—1000 cores?– Intel Webinar likens threads to the “Assembly of parallel programming” – but
familiar and still useful in the short term for small numbers of cores– Need a way to transition smoothly from today’s programming to tomorrow’s
21
Multicore Development Environment:Comprehensive Standards-Based Tools
• High Performance Compiler– ANSI Standard C/C++– Gcc options and GNU extensions (3.0)– Profile-directed optimization– Software pipelining– Automatic & guided unrolling
• Multicore Debugger– GDB standard based– Aggregate control and state display– Whole-application model for collective control
• Multicore Profiler– Architectural and application details– Source-level tracking– Collective stats
• Simulator & Hardware Platforms– Cycle, timing & functional accuracy– Complete production hardware platform
• Command Line and Eclipse-based IDE
22
Tile Processor’s Systems Software Stack
TILE64 hardware
I/O Shim I/O ShimIDN MDN IDN MDN
tile tile tile tile …
Hypervisor
Supervisor
Applications
XAUI, etc. drivers
Linux kernel drivers
Linuxlibraries
iLib, etc. libraries
• User mode apps and libraries– Standard libs libpthread, libc, libm...– iLib APIs, intrinsics, NetIO APIs
• Supervisor layer– SMP Linux 2.6, single OS instance
across multiple tiles– Familiar pthreads programming, with
tile specific extensions• Hypervisor layer
– Thin distributed message passing “OS”– Virtualizes tiles, I/O devices
• TILE64 hardware
Gentle Slope Application Partitioning Approaches
and Performance
24
Gentle Slope Programming Model
Gentle slope programming philosophy– Facilitates immediate results using off-the-shelf code– Incremental steps to reach performance goals
Four incremental steps• Compile and run standard C/C++ applications on a single tile
• Run the program in parallel using standard SMP Linux models – pthreads or processes
• Use stream programming and messaging using iLib – a light-weight sockets-like API
• Further performance tuning using profiling tools, NUMA tuning, and other optimizers
T
T
T
MEM
MEM
25
Leveraging Sequential Code Development
DPI
DPI
DPI
LoadBalancer Histogram
Packet stream
Filter, or deep packet inspectionMostly unmodified sequential program
(Like Google’s Map)
Statistics, alerts(Like Google’s Reduce)
Library componentlike networkServer Load Balancer
26
Multicore Programming without Parallel Programming
DPI
DPI
DPI
LoadBalancer Histogram
Packet stream
Filter, or deep packet inspectionMostly unmodified sequential program
(Like Google’s Map)
Statistics, alerts(Like Google’s Reduce)
Library componentlike networkServer Load Balancer
27
Mapping On Tiled Multicores
PCIe 1MACPHY
PCIe 1MACPHY
PCIe 0MACPHY
PCIe 0MACPHY
SerdesSerdes
SerdesSerdes
Flexible IOFlexible IO
GbE 0GbE 0
GbE 1GbE 1Flexible IOFlexible IO
UART, HPIJTAG, I2C,
SPI
UART, HPIJTAG, I2C,
SPI
DDR2 Memory Controller 3DDR2 Memory Controller 3
DDR2 Memory Controller 0DDR2 Memory Controller 0
DDR2 Memory Controller 2DDR2 Memory Controller 2
DDR2 Memory Controller 1DDR2 Memory Controller 1
XAUIMAC
PHY 0
XAUIMAC
PHY 0
SerdesSerdes
XAUIMAC
PHY 1
XAUIMAC
PHY 1
SerdesSerdes
DPI
DPI
DPI
HST LB
28
Networking Performance
Pattern Matching 10Gbps using 22 tiles
Snort(Intrusion Prevention
Application)
10Gbps throughput on full SNORT database
29
Multimedia Performance
16x16 macroblock comparison kernel 504 M per second
Video Conferencing, 720p H.264 encoder
with opensource x264 code
2 streams @ 30fps
Video Conferencing, 720p H.264 encoder
by customer with prop. encoder
15 streams @ 30fps
The Future of Multicore
31
The Future of Multicore
Number of cores will double every 18 months – by a corollary of Moore’s law
‘05 ‘08 ‘11 ‘14
64 256 1024 4096
‘02
16AcademiaIndustry 16 64 256 10244
32
Core
s
2008 2009 2010
Scalable Multicore Roadmap
Scaling up
Scaling down
33
The Vision
The core is the logic gateof the 21st century!
pm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
s
1000 General Purpose Cores by 2014!
34 TILERA PROPRIETARY & CONFIDENTIAL MATERIAL
Software Systems on Chip -- Software Defined Hardware
DDR2
XAUI
2Gbe
PCIe
FlexibleI/O
DDR2DDR2
2 X 720p30 Encodes
8 X 720p30 Decodes
3 X Audio channels
Scaling functions
Distribution tiles
Comm function
AudioVideo
PCComm
MEMMEMMEM MEMMEMMEM
MEMMEMMEM
35
Summary
•• Our industry is going Our industry is going multicoremulticore
•• Current multicore architectures face software and scalability Current multicore architectures face software and scalability challengeschallenges
•• Tile Processor Architecture delivers leading performance Tile Processor Architecture delivers leading performance and power efficiency, and scales smoothly for future and power efficiency, and scales smoothly for future designsdesigns
•• Gentle slope programming offers:Gentle slope programming offers:–– Convenience of standard C/Linux programming modelsConvenience of standard C/Linux programming models–– Leverages existing software code baseLeverages existing software code base
•• Tiled Tiled multicoremulticore approach enables scalable roadmapapproach enables scalable roadmap
36
Additional Information
The following are trademarks of Tilera Corporation: Tilera, the Tilera Logo, Tile Processor, TILE64, Embedding Multicore, Multicore Development Environment, Gentle Slope Programming, iLib, iMesh and Multicore Hardwall. All other trademarks and/or registered trademarks are the property of their respective owners.
© Copyright 2008 Tilera Corporation