nettm: faster and easier synchronization for soft ...steffan/talks/fpga11.pdf · bloom filters...
TRANSCRIPT
NetTM: Faster and Easier Synchronization for Soft Multicores
via Transactional Memory
Martin Labrecque
Prof. Greg Steffan
University of Toronto
FPGA, February 27th 2011
2
Processors in FPGAsFPGAs in Telecommunications:
– Present in most high-end routers
– More than 40% of FPGA market
3
Processors in FPGAsFPGAs in Telecommunications:
– Present in most high-end routers
– More than 40% of FPGA market
Deep packet inspection requires: software + CPUs
4
Processors in FPGAsFPGAs in Telecommunications:
– Present in most high-end routers
– More than 40% of FPGA market
Deep packet inspection requires: software + CPUs
Our goal: implement those cores directly in the FPGA
5
Processors in FPGAsFPGAs in Telecommunications:
– Present in most high-end routers
– More than 40% of FPGA market
Deep packet inspection requires: software + CPUs
Our goal: implement those cores directly in the FPGA
FPGA
Processor(s)
PC
Instr. Mem.
Reg. Array
regA
regB
regW
datW
datA
datB
ALU
25:21
20:16
+4
Data Mem.
datIn
addrdatOut
aluA
aluB
IncrPC
Instr
4:0 Wdest
Wdata
20:13
Xtnd
25:21
Wdata
Wdest
15:0
Xtnd << 2
Zero Test
25:21
Wdata
Wdest
20:0
25:21
Wdata
Wdest
DDR controller
Ethernet MAC
6
NetThreads: Our Base System
InputBuffer
DataCache
OutputBuffer
Synch. Unit
packetinput
packetoutput
Instr.
DataInput mem.Output mem.
I$
processor
4-threads
Off-chip DDR2
I$
processor
4-threads
NetFPGA
7
NetThreads: Our Base System
8 threads?
InputBuffer
DataCache
OutputBuffer
Synch. Unit
packetinput
packetoutput
Instr.
DataInput mem.Output mem.
I$
processor
4-threads
Off-chip DDR2
I$
processor
4-threads
NetFPGA
8
NetThreads: Our Base System
8 threads?
InputBuffer
DataCache
OutputBuffer
Synch. Unit
packetinput
packetoutput
Instr.
DataInput mem.Output mem.
I$
processor
4-threads
Off-chip DDR2
I$
processor
4-threads
Write 1 program, run on all threads!
NetFPGA
9
NetThreads: Our Base System
8 threads?
InputBuffer
DataCache
OutputBuffer
Synch. Unit
packetinput
packetoutput
Instr.
DataInput mem.Output mem.
I$
processor
4-threads
Off-chip DDR2
I$
processor
4-threads
Write 1 program, run on all threads!Released online: netfpga+netthreads
NetFPGA
10
Parallelizing Stateful Applications
Packet1 Packet2 Packet3 Packet4Packets are data-
independent and are processed in parallel
Ideal scenario: Thread1 Thread2 Thread3 Thread4
TIM
E
11
Parallelizing Stateful Applications
Packet1 Packet2 Packet3 Packet4Packets are data-
independent and are processed in parallel
Ideal scenario: Thread1 Thread2 Thread3 Thread4
TIM
EProgrammers need
to insert locks in case there is a dependence
Reality:waitwaitwaitT
IME
12
Parallelizing Stateful Applications
Packet1 Packet2 Packet3 Packet4Packets are data-
independent and are processed in parallel
Ideal scenario: Thread1 Thread2 Thread3 Thread4
TIM
EProgrammers need
to insert locks in case there is a dependence
Reality:waitwaitwaitT
IME
Experimental result: Synchronizing packet processing threads with fine/medium-grained global locks is overly-
conservative 80-90% of the time [ANCS'10]
13
Parallelizing Stateful Applications
Packet1 Packet2 Packet3 Packet4Packets are data-
independent and are processed in parallel
Ideal scenario: Thread1 Thread2 Thread3 Thread4
TIM
EProgrammers need
to insert locks in case there is a dependence
Reality:waitwaitwaitT
IME
x
TIM
EData-independent packets are
processed in parallel
Transactional memory
14
Synch. Unit
NetTM: extending NetThreads for TM
InputBuffer
DataCache
OutputBuffer
packetinput
packetoutput
Instr.
Data
Output mem.
I$
processor
4-threads I$
processor
4-threads
Off-chip DDR2
Input mem.
15
Synch. Unit
NetTM: extending NetThreads for TM
InputBuffer
DataCache
OutputBuffer
packetinput
packetoutput
Instr.
Data
Output mem.
I$
processor
4-threads I$
processor
4-threads
Off-chip DDR2
Application-specificBloom filters [ARC10]
Conflict DetectionSynch. Unit
Input mem.
16
Synch. Unit
NetTM: extending NetThreads for TM
InputBuffer
DataCache
OutputBuffer
packetinput
packetoutput
Instr.
Data
Output mem.
I$
processor
4-threads I$
processor
4-threads
UndoLog
Off-chip DDR2
Application-specificBloom filters [ARC10]
Conflict DetectionSynch. Unit
Input mem.
17
Synch. Unit
NetTM: extending NetThreads for TM
InputBuffer
DataCache
OutputBuffer
packetinput
packetoutput
Instr.
Data
Output mem.
I$
processor
4-threads I$
processor
4-threads
UndoLog
- 1K words speculative writes buffered per thread - 4-LUT: +21% 16K BRAMs: +25% Preserved 125 MHz
Off-chip DDR2
Application-specificBloom filters [ARC10]
Conflict DetectionSynch. Unit
Input mem.
18
Synch. Unit
NetTM: extending NetThreads for TM
InputBuffer
DataCache
OutputBuffer
packetinput
packetoutput
Instr.
Data
Output mem.
I$
processor
4-threads I$
processor
4-threads
UndoLog
- 1K words speculative writes buffered per thread - 4-LUT: +21% 16K BRAMs: +25% Preserved 125 MHz
Off-chip DDR2
Application-specificBloom filters [ARC10]
Conflict DetectionSynch. Unit
Input mem.
19
Transactional Memory●1st HTM implementation tightly integrated with soft processors
20
Transactional Memory●1st HTM implementation tightly integrated with soft processors
●Supports conventional locks and TM without code modification!
21
Transactional Memory●1st HTM implementation tightly integrated with soft processors
●Supports conventional locks and TM without code modification!
●Can extract optimistic parallelism across packets
●Improves benchmark throughput: +6%, +54%, +57%
22
Transactional Memory●1st HTM implementation tightly integrated with soft processors
●Supports conventional locks and TM without code modification!
●Can extract optimistic parallelism across packets
●Improves benchmark throughput: +6%, +54%, +57%
●Coarse critical sections and deadlock avoidance simplify program
23
Transactional Memory●1st HTM implementation tightly integrated with soft processors
●Supports conventional locks and TM without code modification!
●Can extract optimistic parallelism across packets
●Improves benchmark throughput: +6%, +54%, +57%
●Coarse critical sections and deadlock avoidance simplify program
●Processor and conflict detection integration works well on FPGA
24
Transactional Memory
Future work: scale to more cores on newer FPGA/NetFPGA!
●1st HTM implementation tightly integrated with soft processors
●Supports conventional locks and TM without code modification!
●Can extract optimistic parallelism across packets
●Improves benchmark throughput: +6%, +54%, +57%
●Coarse critical sections and deadlock avoidance simplify program
●Processor and conflict detection integration works well on FPGA
25
Transactional Memory
NetTM and NetThreads available online
: netfpga+netthreads
Future work: scale to more cores on newer FPGA/NetFPGA!
●1st HTM implementation tightly integrated with soft processors
●Supports conventional locks and TM without code modification!
●Can extract optimistic parallelism across packets
●Improves benchmark throughput: +6%, +54%, +57%
●Coarse critical sections and deadlock avoidance simplify program
●Processor and conflict detection integration works well on FPGA