Mapping Stream based Applications to an Intel IXP Network ... · 4 Problem description – cont....
Transcript of Mapping Stream based Applications to an Intel IXP Network ... · 4 Problem description – cont....
Mapping Stream based Applications to an Intel IXP Network Processor using Compaan
Sjoerd Meijer, Bart Kienhuis, Johan Walters, David SnuijfUniversity Leiden, LIACS
2
Outline
• Need for multi-processor platforms
• Problem: how to program them
• Solution: compiler support
• Realization: the IMCA back-end
• Conclusion
3
Problem description
• Ferocious appetite for
more embedded
computation power
Source: TI, Xilinx – 1 MAC = 8 bit Multiply-Accumulate
0
500
1000
1500
2000
2500
2000 2001 2002 2003 2004 2005 2006
Bil
lio
n M
AC
/s
HDTV
MPEG4Voice
over IP
Video
over IP
3G Wireless/
WCDMA
FutureBroadband Standards
General Purpose DSP/CPU
IncreasingGap
2.5G
Solution: A need for Multi Processor Systems
4
Problem description – cont.
• Moore’s Law:
– More transistors
• Tooling:
– More lines of code
• Productivity Gap
Pro
ductivity
Lin
es o
f C
ode
/Man
Mon
th
10
100
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
1981
1985
1989
1993
1997
2001
2005
2009
Lo
gic
tra
nsis
tors
pe
r ch
ip(K
)
10
100
1,000
10,000
100,000
1,000,000
10,000,000
Logic Tr./Chip
58% / Year compoundcomplexity growth rate
Loc/MM
8–10% / Year Compound Productivity growth rate
Productivity gap
}
We know how to build multiprocessor systems,But not how to program them efficiently
5
Problem Description
for( int t=1; t<=P; t++){
for( int i=1; i<=M; i++ ){
for( int j=4; j<=N; j++ ){
r1[i+1][j-3] = F1(...); //stm1
}
}
for( int l=3; l<=M; l++ ){
for( int m=3; m<=N-1; m++ ){
if ( l+m<= 7 ){
r2[l][m] = F2( r1[l-1][m-2] ); //stm2
}
if ( l+m>=8 ){
r2[l][m] = F3( r1[l][N-3] ); //stm3
}
... = F4( r2[l][m] ); //stm4
}
}
}
F3F2
F1
Get() Get()
Put() Put()
FIFO1 FIFO2
F4
FIFO3 FIFO4
Put() Put()
Get() Get()
Storage arrays (R1) are located in Global Memory
Se
qu
en
tia
lly O
rdere
d
Processes run autonomously
Communicate via unbounded FIFOs
We need tools to convert the sequential programto a parallel equivalent
6
Our Solution
for j = 1:1:5,
for i = j:1:5,
[r(j,i)] = ReadMatrix_Zeros_64x64();
end
end
for k = 1:1:6,
for j = 1:1:5,
[r(j,j), x(k,j), t ] = Vectorize( r(j,j), x(k,j) );
for i = j+1:1:5,
[r(j,i), x(k,i), t] = Rotate( r(j,i), x(k,i), t );
end
end
end
Sequential Application
Specification
Application
P1 P3
P2 P4
P5
Parallel Application
Specification
Translator
EASY to specify DIFFICULT to specify
DIFFICULT to map EASY to map
FPGA
COMPAAN
IMCA
7
Questions
• Q1: Can we use the IXP architecture for stream-based applications?
• Q2: Can we map applications written as a KPN onto the IXP?
• Q3: Can we program the IXP using Compaan?
8
Intel IXP2400 Network Processor
• Optimized for streaming
– 2.6 Gbit ethernet connection
• Build to operate in real-time on internet traffic
– 8 microengines with 8 hardware supported
threads
• Completely programmable in C
– A SDK available for programming
9
IXP2400
• 8 microengines
• 1 XScale core
• MSF interface
• SRAM: 128 MB
• DRAM: 1 GB
• Scratchpad: 4 K
IXP2400
Media Switch
Fabric
Interface
RBUF
TBUF
Scratchpad
memory
CAP
Hash unit
DRAM
External Media
Device(s)
PCI
Intel
XScale
Core
Optional hosts
CPU, PCI
bus devices
SRAM
Controller 1
SRAM
Controller 0
DRAM
Controller 0
SRAM
ME 1:1ME 1:0
ME 1:3ME 1:2
ME 0:1ME 0:0
ME 0:3ME 0:2
10
Microengine
• Local memory: 640 K
• Registers– GPR
– Read Xfer
– Write Xfer
– Next neighbour
• Instruction store:4 K
• 8 Threads
A Operand
4K
Instruction
Store
Immed
640 long
words
Local
Memory
B Operand
Execution Datapath
(Shift, Add, Substract,
Multiply, Find Fist Bit Set)
16 entry CAM
LM addr 1
128 SRAM
Write Xfer
CRC Unit
CRC
Remainder
Local CRSs
128 DRAM
Write Xfer
LM addr 0
D-Pull
Bus
S-Pull
Bus
To Next
Neighbour
128 SRAM
Read Xfer
128 DRAM
Read Xfer
128 Next
Neighbour
128 GPRs
(B Bank)
128 GPRs
(A Bank)
S-Push
Bus
D-Push
Bus
From Next
Neighbour
11
Intel IXP Network Processor
• Not used on a large scale
• Difficult to program:
– Write code per microengine
– Infrastructure
– Synchronization
– Non-unified complex memory model
Conclusion: easier programming needed
12
Static Affine Nested Loop Programs
• Nested loop: all statements occur within loops
• Static: control flow known at compile time
• Affine: ax+bexpressions
• Explicit array references are required to extract data-level parallelism in the application
13
Kahn Process Network (KPN)
• Process Networks
– Processes run autonomously
– Communicate via unbounded FIFOs
– Synchronize via blocking read
• Process is either
– executing (Execute)
– communicating(Put/Get)
• BENEFITS:
– Deterministic Behavior
– Distributed Control
– Distributed Memory
Fc
A
Fa Fb
�Get
Execute
Put
Get
�Execute
Put
Put
�Get
Get
Execute
Put
Fifo
C
B
14
Threads
• 8 Threads per microengine
• Minimal costs swapping contexts
• Registers and Memory divided between threads:– Private: a copy for each thread
– Shared: all threads same value
• Instruction store shared by all threads
• 1 Process � 1 Thread
• Threads run in Round Robin schedule
• Non-pre-emptive: programmer must swap
15
Signals
• Synchronization between IXP elements
– Between threads and microengines
– Memory access
• Wait for signal � context switch
__declspec(sram_read_reg) x; SIGNAL sig;
__declspec(scratch)* addr = 0x400;
scratch_read(&x, addr, 1, sig_done, sig);
do_other_work();
__wait_for_all(&sig);
y = x;
16
Available FIFOs
• In hardware:– 16 Scratchpad memory rings
– 128 SRAM rings
– 7 Next neighbour rings
• In software:– Local memory
– Direct Xfer register access
– Scratchpad memory
17
FIFO mappings
• Small and frequently used:
– Scratchpad HW
– (Next neighbour rings)
• Small and less frequently used:
– Scratchpad SW
• Large and/or not
frequently used:
– SRAM rings
18
Process mapping
Interesting presentation related to this mapping problem is: An ILP Formulation for System-Level Application Mapping on Network Processor Architectures, Chris Ostlerand Karam S. Chatha, Proceedings of DATE’07, Nice 16-20 April 2007, France
19
Tool Flow Overview
• IMCA:
IXP
Mapper for
Compaan
Applications
20
Code generation
• Visitor design pattern
– Visits platform description, writes code per
element
• One “C” file per microengine
• FIFOs accessed uniformly
– Static functions to implement FIFO code
– Port struct to specify FIFO variables
21
Using the IMCA environment
Packet
Packet
Packet
Packet
Packet
Packet
Packet
Packet
Packet
Packet
22
Results
• QR algorithm
– 5 nodes
– 12 FIFOs
2108 Mhz213FPGA, full HW
39100 Mhz3865FPGA, 5 MB
67600 Mhz40247IXP
Time micro secs.
CPU freq.# clock cyclesArch.
23
Discussion
• Work presents a first try, still many open issues
– Selection of right communication channel
– Binding of the KPN processes to threads and
microengines
– MSF takes a lot of resources, what is the minimum
required.
• Future of the IXP is uncertain, perhaps the Cell
Processor is an interesting next research platform
24
Conclusion
• Q1: The IXP can be used for streaming applications– Yes, showed it for QR and DWT
• Q2: We automatically mapped QR– Yes, we can map the FIFO communication onto the
communication channels and the processes on the threads
• Q3: IMCA back-end generates IXP code– Yes, we can use Compaan to automatically generate
the Processes and FIFO channels that are subsequently mapped on the IXP