mapping stream based applications to an intel ixp network ... · 4 problem description – cont....
TRANSCRIPT
![Page 1: Mapping Stream based Applications to an Intel IXP Network ... · 4 Problem description – cont. • Moore’s Law: – More transistors • Tooling: – More lines of code • Productivity](https://reader034.vdocuments.net/reader034/viewer/2022050202/5f55bb2935f01d31421d1412/html5/thumbnails/1.jpg)
Mapping Stream based Applications to an Intel IXP Network Processor using Compaan
Sjoerd Meijer, Bart Kienhuis, Johan Walters, David SnuijfUniversity Leiden, LIACS
![Page 2: Mapping Stream based Applications to an Intel IXP Network ... · 4 Problem description – cont. • Moore’s Law: – More transistors • Tooling: – More lines of code • Productivity](https://reader034.vdocuments.net/reader034/viewer/2022050202/5f55bb2935f01d31421d1412/html5/thumbnails/2.jpg)
2
Outline
• Need for multi-processor platforms
• Problem: how to program them
• Solution: compiler support
• Realization: the IMCA back-end
• Conclusion
![Page 3: Mapping Stream based Applications to an Intel IXP Network ... · 4 Problem description – cont. • Moore’s Law: – More transistors • Tooling: – More lines of code • Productivity](https://reader034.vdocuments.net/reader034/viewer/2022050202/5f55bb2935f01d31421d1412/html5/thumbnails/3.jpg)
3
Problem description
• Ferocious appetite for
more embedded
computation power
Source: TI, Xilinx – 1 MAC = 8 bit Multiply-Accumulate
0
500
1000
1500
2000
2500
2000 2001 2002 2003 2004 2005 2006
Bil
lio
n M
AC
/s
HDTV
MPEG4Voice
over IP
Video
over IP
3G Wireless/
WCDMA
FutureBroadband Standards
General Purpose DSP/CPU
IncreasingGap
2.5G
Solution: A need for Multi Processor Systems
![Page 4: Mapping Stream based Applications to an Intel IXP Network ... · 4 Problem description – cont. • Moore’s Law: – More transistors • Tooling: – More lines of code • Productivity](https://reader034.vdocuments.net/reader034/viewer/2022050202/5f55bb2935f01d31421d1412/html5/thumbnails/4.jpg)
4
Problem description – cont.
• Moore’s Law:
– More transistors
• Tooling:
– More lines of code
• Productivity Gap
Pro
ductivity
Lin
es o
f C
ode
/Man
Mon
th
10
100
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
1981
1985
1989
1993
1997
2001
2005
2009
Lo
gic
tra
nsis
tors
pe
r ch
ip(K
)
10
100
1,000
10,000
100,000
1,000,000
10,000,000
Logic Tr./Chip
58% / Year compoundcomplexity growth rate
Loc/MM
8–10% / Year Compound Productivity growth rate
Productivity gap
}
We know how to build multiprocessor systems,But not how to program them efficiently
![Page 5: Mapping Stream based Applications to an Intel IXP Network ... · 4 Problem description – cont. • Moore’s Law: – More transistors • Tooling: – More lines of code • Productivity](https://reader034.vdocuments.net/reader034/viewer/2022050202/5f55bb2935f01d31421d1412/html5/thumbnails/5.jpg)
5
Problem Description
for( int t=1; t<=P; t++){
for( int i=1; i<=M; i++ ){
for( int j=4; j<=N; j++ ){
r1[i+1][j-3] = F1(...); //stm1
}
}
for( int l=3; l<=M; l++ ){
for( int m=3; m<=N-1; m++ ){
if ( l+m<= 7 ){
r2[l][m] = F2( r1[l-1][m-2] ); //stm2
}
if ( l+m>=8 ){
r2[l][m] = F3( r1[l][N-3] ); //stm3
}
... = F4( r2[l][m] ); //stm4
}
}
}
F3F2
F1
Get() Get()
Put() Put()
FIFO1 FIFO2
F4
FIFO3 FIFO4
Put() Put()
Get() Get()
Storage arrays (R1) are located in Global Memory
Se
qu
en
tia
lly O
rdere
d
Processes run autonomously
Communicate via unbounded FIFOs
We need tools to convert the sequential programto a parallel equivalent
![Page 6: Mapping Stream based Applications to an Intel IXP Network ... · 4 Problem description – cont. • Moore’s Law: – More transistors • Tooling: – More lines of code • Productivity](https://reader034.vdocuments.net/reader034/viewer/2022050202/5f55bb2935f01d31421d1412/html5/thumbnails/6.jpg)
6
Our Solution
for j = 1:1:5,
for i = j:1:5,
[r(j,i)] = ReadMatrix_Zeros_64x64();
end
end
for k = 1:1:6,
for j = 1:1:5,
[r(j,j), x(k,j), t ] = Vectorize( r(j,j), x(k,j) );
for i = j+1:1:5,
[r(j,i), x(k,i), t] = Rotate( r(j,i), x(k,i), t );
end
end
end
Sequential Application
Specification
Application
P1 P3
P2 P4
P5
Parallel Application
Specification
Translator
EASY to specify DIFFICULT to specify
DIFFICULT to map EASY to map
FPGA
COMPAAN
IMCA
![Page 7: Mapping Stream based Applications to an Intel IXP Network ... · 4 Problem description – cont. • Moore’s Law: – More transistors • Tooling: – More lines of code • Productivity](https://reader034.vdocuments.net/reader034/viewer/2022050202/5f55bb2935f01d31421d1412/html5/thumbnails/7.jpg)
7
Questions
• Q1: Can we use the IXP architecture for stream-based applications?
• Q2: Can we map applications written as a KPN onto the IXP?
• Q3: Can we program the IXP using Compaan?
![Page 8: Mapping Stream based Applications to an Intel IXP Network ... · 4 Problem description – cont. • Moore’s Law: – More transistors • Tooling: – More lines of code • Productivity](https://reader034.vdocuments.net/reader034/viewer/2022050202/5f55bb2935f01d31421d1412/html5/thumbnails/8.jpg)
8
Intel IXP2400 Network Processor
• Optimized for streaming
– 2.6 Gbit ethernet connection
• Build to operate in real-time on internet traffic
– 8 microengines with 8 hardware supported
threads
• Completely programmable in C
– A SDK available for programming
![Page 9: Mapping Stream based Applications to an Intel IXP Network ... · 4 Problem description – cont. • Moore’s Law: – More transistors • Tooling: – More lines of code • Productivity](https://reader034.vdocuments.net/reader034/viewer/2022050202/5f55bb2935f01d31421d1412/html5/thumbnails/9.jpg)
9
IXP2400
• 8 microengines
• 1 XScale core
• MSF interface
• SRAM: 128 MB
• DRAM: 1 GB
• Scratchpad: 4 K
IXP2400
Media Switch
Fabric
Interface
RBUF
TBUF
Scratchpad
memory
CAP
Hash unit
DRAM
External Media
Device(s)
PCI
Intel
XScale
Core
Optional hosts
CPU, PCI
bus devices
SRAM
Controller 1
SRAM
Controller 0
DRAM
Controller 0
SRAM
ME 1:1ME 1:0
ME 1:3ME 1:2
ME 0:1ME 0:0
ME 0:3ME 0:2
![Page 10: Mapping Stream based Applications to an Intel IXP Network ... · 4 Problem description – cont. • Moore’s Law: – More transistors • Tooling: – More lines of code • Productivity](https://reader034.vdocuments.net/reader034/viewer/2022050202/5f55bb2935f01d31421d1412/html5/thumbnails/10.jpg)
10
Microengine
• Local memory: 640 K
• Registers– GPR
– Read Xfer
– Write Xfer
– Next neighbour
• Instruction store:4 K
• 8 Threads
A Operand
4K
Instruction
Store
Immed
640 long
words
Local
Memory
B Operand
Execution Datapath
(Shift, Add, Substract,
Multiply, Find Fist Bit Set)
16 entry CAM
LM addr 1
128 SRAM
Write Xfer
CRC Unit
CRC
Remainder
Local CRSs
128 DRAM
Write Xfer
LM addr 0
D-Pull
Bus
S-Pull
Bus
To Next
Neighbour
128 SRAM
Read Xfer
128 DRAM
Read Xfer
128 Next
Neighbour
128 GPRs
(B Bank)
128 GPRs
(A Bank)
S-Push
Bus
D-Push
Bus
From Next
Neighbour
![Page 11: Mapping Stream based Applications to an Intel IXP Network ... · 4 Problem description – cont. • Moore’s Law: – More transistors • Tooling: – More lines of code • Productivity](https://reader034.vdocuments.net/reader034/viewer/2022050202/5f55bb2935f01d31421d1412/html5/thumbnails/11.jpg)
11
Intel IXP Network Processor
• Not used on a large scale
• Difficult to program:
– Write code per microengine
– Infrastructure
– Synchronization
– Non-unified complex memory model
Conclusion: easier programming needed
![Page 12: Mapping Stream based Applications to an Intel IXP Network ... · 4 Problem description – cont. • Moore’s Law: – More transistors • Tooling: – More lines of code • Productivity](https://reader034.vdocuments.net/reader034/viewer/2022050202/5f55bb2935f01d31421d1412/html5/thumbnails/12.jpg)
12
Static Affine Nested Loop Programs
• Nested loop: all statements occur within loops
• Static: control flow known at compile time
• Affine: ax+bexpressions
• Explicit array references are required to extract data-level parallelism in the application
![Page 13: Mapping Stream based Applications to an Intel IXP Network ... · 4 Problem description – cont. • Moore’s Law: – More transistors • Tooling: – More lines of code • Productivity](https://reader034.vdocuments.net/reader034/viewer/2022050202/5f55bb2935f01d31421d1412/html5/thumbnails/13.jpg)
13
Kahn Process Network (KPN)
• Process Networks
– Processes run autonomously
– Communicate via unbounded FIFOs
– Synchronize via blocking read
• Process is either
– executing (Execute)
– communicating(Put/Get)
• BENEFITS:
– Deterministic Behavior
– Distributed Control
– Distributed Memory
Fc
A
Fa Fb
�Get
Execute
Put
Get
�Execute
Put
Put
�Get
Get
Execute
Put
Fifo
C
B
![Page 14: Mapping Stream based Applications to an Intel IXP Network ... · 4 Problem description – cont. • Moore’s Law: – More transistors • Tooling: – More lines of code • Productivity](https://reader034.vdocuments.net/reader034/viewer/2022050202/5f55bb2935f01d31421d1412/html5/thumbnails/14.jpg)
14
Threads
• 8 Threads per microengine
• Minimal costs swapping contexts
• Registers and Memory divided between threads:– Private: a copy for each thread
– Shared: all threads same value
• Instruction store shared by all threads
• 1 Process � 1 Thread
• Threads run in Round Robin schedule
• Non-pre-emptive: programmer must swap
![Page 15: Mapping Stream based Applications to an Intel IXP Network ... · 4 Problem description – cont. • Moore’s Law: – More transistors • Tooling: – More lines of code • Productivity](https://reader034.vdocuments.net/reader034/viewer/2022050202/5f55bb2935f01d31421d1412/html5/thumbnails/15.jpg)
15
Signals
• Synchronization between IXP elements
– Between threads and microengines
– Memory access
• Wait for signal � context switch
__declspec(sram_read_reg) x; SIGNAL sig;
__declspec(scratch)* addr = 0x400;
scratch_read(&x, addr, 1, sig_done, sig);
do_other_work();
__wait_for_all(&sig);
y = x;
![Page 16: Mapping Stream based Applications to an Intel IXP Network ... · 4 Problem description – cont. • Moore’s Law: – More transistors • Tooling: – More lines of code • Productivity](https://reader034.vdocuments.net/reader034/viewer/2022050202/5f55bb2935f01d31421d1412/html5/thumbnails/16.jpg)
16
Available FIFOs
• In hardware:– 16 Scratchpad memory rings
– 128 SRAM rings
– 7 Next neighbour rings
• In software:– Local memory
– Direct Xfer register access
– Scratchpad memory
![Page 17: Mapping Stream based Applications to an Intel IXP Network ... · 4 Problem description – cont. • Moore’s Law: – More transistors • Tooling: – More lines of code • Productivity](https://reader034.vdocuments.net/reader034/viewer/2022050202/5f55bb2935f01d31421d1412/html5/thumbnails/17.jpg)
17
FIFO mappings
• Small and frequently used:
– Scratchpad HW
– (Next neighbour rings)
• Small and less frequently used:
– Scratchpad SW
• Large and/or not
frequently used:
– SRAM rings
![Page 18: Mapping Stream based Applications to an Intel IXP Network ... · 4 Problem description – cont. • Moore’s Law: – More transistors • Tooling: – More lines of code • Productivity](https://reader034.vdocuments.net/reader034/viewer/2022050202/5f55bb2935f01d31421d1412/html5/thumbnails/18.jpg)
18
Process mapping
Interesting presentation related to this mapping problem is: An ILP Formulation for System-Level Application Mapping on Network Processor Architectures, Chris Ostlerand Karam S. Chatha, Proceedings of DATE’07, Nice 16-20 April 2007, France
![Page 19: Mapping Stream based Applications to an Intel IXP Network ... · 4 Problem description – cont. • Moore’s Law: – More transistors • Tooling: – More lines of code • Productivity](https://reader034.vdocuments.net/reader034/viewer/2022050202/5f55bb2935f01d31421d1412/html5/thumbnails/19.jpg)
19
Tool Flow Overview
• IMCA:
IXP
Mapper for
Compaan
Applications
![Page 20: Mapping Stream based Applications to an Intel IXP Network ... · 4 Problem description – cont. • Moore’s Law: – More transistors • Tooling: – More lines of code • Productivity](https://reader034.vdocuments.net/reader034/viewer/2022050202/5f55bb2935f01d31421d1412/html5/thumbnails/20.jpg)
20
Code generation
• Visitor design pattern
– Visits platform description, writes code per
element
• One “C” file per microengine
• FIFOs accessed uniformly
– Static functions to implement FIFO code
– Port struct to specify FIFO variables
![Page 21: Mapping Stream based Applications to an Intel IXP Network ... · 4 Problem description – cont. • Moore’s Law: – More transistors • Tooling: – More lines of code • Productivity](https://reader034.vdocuments.net/reader034/viewer/2022050202/5f55bb2935f01d31421d1412/html5/thumbnails/21.jpg)
21
Using the IMCA environment
Packet
Packet
Packet
Packet
Packet
Packet
Packet
Packet
Packet
Packet
![Page 22: Mapping Stream based Applications to an Intel IXP Network ... · 4 Problem description – cont. • Moore’s Law: – More transistors • Tooling: – More lines of code • Productivity](https://reader034.vdocuments.net/reader034/viewer/2022050202/5f55bb2935f01d31421d1412/html5/thumbnails/22.jpg)
22
Results
• QR algorithm
– 5 nodes
– 12 FIFOs
2108 Mhz213FPGA, full HW
39100 Mhz3865FPGA, 5 MB
67600 Mhz40247IXP
Time micro secs.
CPU freq.# clock cyclesArch.
![Page 23: Mapping Stream based Applications to an Intel IXP Network ... · 4 Problem description – cont. • Moore’s Law: – More transistors • Tooling: – More lines of code • Productivity](https://reader034.vdocuments.net/reader034/viewer/2022050202/5f55bb2935f01d31421d1412/html5/thumbnails/23.jpg)
23
Discussion
• Work presents a first try, still many open issues
– Selection of right communication channel
– Binding of the KPN processes to threads and
microengines
– MSF takes a lot of resources, what is the minimum
required.
• Future of the IXP is uncertain, perhaps the Cell
Processor is an interesting next research platform
![Page 24: Mapping Stream based Applications to an Intel IXP Network ... · 4 Problem description – cont. • Moore’s Law: – More transistors • Tooling: – More lines of code • Productivity](https://reader034.vdocuments.net/reader034/viewer/2022050202/5f55bb2935f01d31421d1412/html5/thumbnails/24.jpg)
24
Conclusion
• Q1: The IXP can be used for streaming applications– Yes, showed it for QR and DWT
• Q2: We automatically mapped QR– Yes, we can map the FIFO communication onto the
communication channels and the processes on the threads
• Q3: IMCA back-end generates IXP code– Yes, we can use Compaan to automatically generate
the Processes and FIFO channels that are subsequently mapped on the IXP