nettm: faster and easier synchronization for soft ...steffan/talks/fpga11.pdf · bloom filters...

25
NetTM: Faster and Easier Synchronization for Soft Multicores via Transactional Memory Martin Labrecque Prof. Greg Steffan University of Toronto FPGA, February 27 th 2011

Upload: others

Post on 09-Jul-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: NetTM: Faster and Easier Synchronization for Soft ...steffan/talks/fpga11.pdf · Bloom filters [ARC10] Synch. Unit Conflict Detection Input mem. 18 Synch. Unit NetTM: extending NetThreads

NetTM: Faster and Easier Synchronization for Soft Multicores

via Transactional Memory

Martin Labrecque

Prof. Greg Steffan

University of Toronto

FPGA, February 27th 2011

Page 2: NetTM: Faster and Easier Synchronization for Soft ...steffan/talks/fpga11.pdf · Bloom filters [ARC10] Synch. Unit Conflict Detection Input mem. 18 Synch. Unit NetTM: extending NetThreads

2

Processors in FPGAsFPGAs in Telecommunications:

– Present in most high-end routers

– More than 40% of FPGA market

Page 3: NetTM: Faster and Easier Synchronization for Soft ...steffan/talks/fpga11.pdf · Bloom filters [ARC10] Synch. Unit Conflict Detection Input mem. 18 Synch. Unit NetTM: extending NetThreads

3

Processors in FPGAsFPGAs in Telecommunications:

– Present in most high-end routers

– More than 40% of FPGA market

Deep packet inspection requires: software + CPUs

Page 4: NetTM: Faster and Easier Synchronization for Soft ...steffan/talks/fpga11.pdf · Bloom filters [ARC10] Synch. Unit Conflict Detection Input mem. 18 Synch. Unit NetTM: extending NetThreads

4

Processors in FPGAsFPGAs in Telecommunications:

– Present in most high-end routers

– More than 40% of FPGA market

Deep packet inspection requires: software + CPUs

Our goal: implement those cores directly in the FPGA

Page 5: NetTM: Faster and Easier Synchronization for Soft ...steffan/talks/fpga11.pdf · Bloom filters [ARC10] Synch. Unit Conflict Detection Input mem. 18 Synch. Unit NetTM: extending NetThreads

5

Processors in FPGAsFPGAs in Telecommunications:

– Present in most high-end routers

– More than 40% of FPGA market

Deep packet inspection requires: software + CPUs

Our goal: implement those cores directly in the FPGA

FPGA

Processor(s)

PC

Instr. Mem.

Reg. Array

regA

regB

regW

datW

datA

datB

ALU

25:21

20:16

+4

Data Mem.

datIn

addrdatOut

aluA

aluB

IncrPC

Instr

4:0 Wdest

Wdata

20:13

Xtnd

25:21

Wdata

Wdest

15:0

Xtnd << 2

Zero Test

25:21

Wdata

Wdest

20:0

25:21

Wdata

Wdest

DDR controller

Ethernet MAC

Page 6: NetTM: Faster and Easier Synchronization for Soft ...steffan/talks/fpga11.pdf · Bloom filters [ARC10] Synch. Unit Conflict Detection Input mem. 18 Synch. Unit NetTM: extending NetThreads

6

NetThreads: Our Base System

InputBuffer

DataCache

OutputBuffer

Synch. Unit

packetinput

packetoutput

Instr.

DataInput mem.Output mem.

I$

processor

4-threads

Off-chip DDR2

I$

processor

4-threads

NetFPGA

Page 7: NetTM: Faster and Easier Synchronization for Soft ...steffan/talks/fpga11.pdf · Bloom filters [ARC10] Synch. Unit Conflict Detection Input mem. 18 Synch. Unit NetTM: extending NetThreads

7

NetThreads: Our Base System

8 threads?

InputBuffer

DataCache

OutputBuffer

Synch. Unit

packetinput

packetoutput

Instr.

DataInput mem.Output mem.

I$

processor

4-threads

Off-chip DDR2

I$

processor

4-threads

NetFPGA

Page 8: NetTM: Faster and Easier Synchronization for Soft ...steffan/talks/fpga11.pdf · Bloom filters [ARC10] Synch. Unit Conflict Detection Input mem. 18 Synch. Unit NetTM: extending NetThreads

8

NetThreads: Our Base System

8 threads?

InputBuffer

DataCache

OutputBuffer

Synch. Unit

packetinput

packetoutput

Instr.

DataInput mem.Output mem.

I$

processor

4-threads

Off-chip DDR2

I$

processor

4-threads

Write 1 program, run on all threads!

NetFPGA

Page 9: NetTM: Faster and Easier Synchronization for Soft ...steffan/talks/fpga11.pdf · Bloom filters [ARC10] Synch. Unit Conflict Detection Input mem. 18 Synch. Unit NetTM: extending NetThreads

9

NetThreads: Our Base System

8 threads?

InputBuffer

DataCache

OutputBuffer

Synch. Unit

packetinput

packetoutput

Instr.

DataInput mem.Output mem.

I$

processor

4-threads

Off-chip DDR2

I$

processor

4-threads

Write 1 program, run on all threads!Released online: netfpga+netthreads

NetFPGA

Page 10: NetTM: Faster and Easier Synchronization for Soft ...steffan/talks/fpga11.pdf · Bloom filters [ARC10] Synch. Unit Conflict Detection Input mem. 18 Synch. Unit NetTM: extending NetThreads

10

Parallelizing Stateful Applications

Packet1 Packet2 Packet3 Packet4Packets are data-

independent and are processed in parallel

Ideal scenario: Thread1 Thread2 Thread3 Thread4

TIM

E

Page 11: NetTM: Faster and Easier Synchronization for Soft ...steffan/talks/fpga11.pdf · Bloom filters [ARC10] Synch. Unit Conflict Detection Input mem. 18 Synch. Unit NetTM: extending NetThreads

11

Parallelizing Stateful Applications

Packet1 Packet2 Packet3 Packet4Packets are data-

independent and are processed in parallel

Ideal scenario: Thread1 Thread2 Thread3 Thread4

TIM

EProgrammers need

to insert locks in case there is a dependence

Reality:waitwaitwaitT

IME

Page 12: NetTM: Faster and Easier Synchronization for Soft ...steffan/talks/fpga11.pdf · Bloom filters [ARC10] Synch. Unit Conflict Detection Input mem. 18 Synch. Unit NetTM: extending NetThreads

12

Parallelizing Stateful Applications

Packet1 Packet2 Packet3 Packet4Packets are data-

independent and are processed in parallel

Ideal scenario: Thread1 Thread2 Thread3 Thread4

TIM

EProgrammers need

to insert locks in case there is a dependence

Reality:waitwaitwaitT

IME

Experimental result: Synchronizing packet processing threads with fine/medium-grained global locks is overly-

conservative 80-90% of the time [ANCS'10]

Page 13: NetTM: Faster and Easier Synchronization for Soft ...steffan/talks/fpga11.pdf · Bloom filters [ARC10] Synch. Unit Conflict Detection Input mem. 18 Synch. Unit NetTM: extending NetThreads

13

Parallelizing Stateful Applications

Packet1 Packet2 Packet3 Packet4Packets are data-

independent and are processed in parallel

Ideal scenario: Thread1 Thread2 Thread3 Thread4

TIM

EProgrammers need

to insert locks in case there is a dependence

Reality:waitwaitwaitT

IME

x

TIM

EData-independent packets are

processed in parallel

Transactional memory

Page 14: NetTM: Faster and Easier Synchronization for Soft ...steffan/talks/fpga11.pdf · Bloom filters [ARC10] Synch. Unit Conflict Detection Input mem. 18 Synch. Unit NetTM: extending NetThreads

14

Synch. Unit

NetTM: extending NetThreads for TM

InputBuffer

DataCache

OutputBuffer

packetinput

packetoutput

Instr.

Data

Output mem.

I$

processor

4-threads I$

processor

4-threads

Off-chip DDR2

Input mem.

Page 15: NetTM: Faster and Easier Synchronization for Soft ...steffan/talks/fpga11.pdf · Bloom filters [ARC10] Synch. Unit Conflict Detection Input mem. 18 Synch. Unit NetTM: extending NetThreads

15

Synch. Unit

NetTM: extending NetThreads for TM

InputBuffer

DataCache

OutputBuffer

packetinput

packetoutput

Instr.

Data

Output mem.

I$

processor

4-threads I$

processor

4-threads

Off-chip DDR2

Application-specificBloom filters [ARC10]

Conflict DetectionSynch. Unit

Input mem.

Page 16: NetTM: Faster and Easier Synchronization for Soft ...steffan/talks/fpga11.pdf · Bloom filters [ARC10] Synch. Unit Conflict Detection Input mem. 18 Synch. Unit NetTM: extending NetThreads

16

Synch. Unit

NetTM: extending NetThreads for TM

InputBuffer

DataCache

OutputBuffer

packetinput

packetoutput

Instr.

Data

Output mem.

I$

processor

4-threads I$

processor

4-threads

UndoLog

Off-chip DDR2

Application-specificBloom filters [ARC10]

Conflict DetectionSynch. Unit

Input mem.

Page 17: NetTM: Faster and Easier Synchronization for Soft ...steffan/talks/fpga11.pdf · Bloom filters [ARC10] Synch. Unit Conflict Detection Input mem. 18 Synch. Unit NetTM: extending NetThreads

17

Synch. Unit

NetTM: extending NetThreads for TM

InputBuffer

DataCache

OutputBuffer

packetinput

packetoutput

Instr.

Data

Output mem.

I$

processor

4-threads I$

processor

4-threads

UndoLog

- 1K words speculative writes buffered per thread - 4-LUT: +21% 16K BRAMs: +25% Preserved 125 MHz

Off-chip DDR2

Application-specificBloom filters [ARC10]

Conflict DetectionSynch. Unit

Input mem.

Page 18: NetTM: Faster and Easier Synchronization for Soft ...steffan/talks/fpga11.pdf · Bloom filters [ARC10] Synch. Unit Conflict Detection Input mem. 18 Synch. Unit NetTM: extending NetThreads

18

Synch. Unit

NetTM: extending NetThreads for TM

InputBuffer

DataCache

OutputBuffer

packetinput

packetoutput

Instr.

Data

Output mem.

I$

processor

4-threads I$

processor

4-threads

UndoLog

- 1K words speculative writes buffered per thread - 4-LUT: +21% 16K BRAMs: +25% Preserved 125 MHz

Off-chip DDR2

Application-specificBloom filters [ARC10]

Conflict DetectionSynch. Unit

Input mem.

Page 19: NetTM: Faster and Easier Synchronization for Soft ...steffan/talks/fpga11.pdf · Bloom filters [ARC10] Synch. Unit Conflict Detection Input mem. 18 Synch. Unit NetTM: extending NetThreads

19

Transactional Memory●1st HTM implementation tightly integrated with soft processors

Page 20: NetTM: Faster and Easier Synchronization for Soft ...steffan/talks/fpga11.pdf · Bloom filters [ARC10] Synch. Unit Conflict Detection Input mem. 18 Synch. Unit NetTM: extending NetThreads

20

Transactional Memory●1st HTM implementation tightly integrated with soft processors

●Supports conventional locks and TM without code modification!

Page 21: NetTM: Faster and Easier Synchronization for Soft ...steffan/talks/fpga11.pdf · Bloom filters [ARC10] Synch. Unit Conflict Detection Input mem. 18 Synch. Unit NetTM: extending NetThreads

21

Transactional Memory●1st HTM implementation tightly integrated with soft processors

●Supports conventional locks and TM without code modification!

●Can extract optimistic parallelism across packets

●Improves benchmark throughput: +6%, +54%, +57%

Page 22: NetTM: Faster and Easier Synchronization for Soft ...steffan/talks/fpga11.pdf · Bloom filters [ARC10] Synch. Unit Conflict Detection Input mem. 18 Synch. Unit NetTM: extending NetThreads

22

Transactional Memory●1st HTM implementation tightly integrated with soft processors

●Supports conventional locks and TM without code modification!

●Can extract optimistic parallelism across packets

●Improves benchmark throughput: +6%, +54%, +57%

●Coarse critical sections and deadlock avoidance simplify program

Page 23: NetTM: Faster and Easier Synchronization for Soft ...steffan/talks/fpga11.pdf · Bloom filters [ARC10] Synch. Unit Conflict Detection Input mem. 18 Synch. Unit NetTM: extending NetThreads

23

Transactional Memory●1st HTM implementation tightly integrated with soft processors

●Supports conventional locks and TM without code modification!

●Can extract optimistic parallelism across packets

●Improves benchmark throughput: +6%, +54%, +57%

●Coarse critical sections and deadlock avoidance simplify program

●Processor and conflict detection integration works well on FPGA

Page 24: NetTM: Faster and Easier Synchronization for Soft ...steffan/talks/fpga11.pdf · Bloom filters [ARC10] Synch. Unit Conflict Detection Input mem. 18 Synch. Unit NetTM: extending NetThreads

24

Transactional Memory

Future work: scale to more cores on newer FPGA/NetFPGA!

●1st HTM implementation tightly integrated with soft processors

●Supports conventional locks and TM without code modification!

●Can extract optimistic parallelism across packets

●Improves benchmark throughput: +6%, +54%, +57%

●Coarse critical sections and deadlock avoidance simplify program

●Processor and conflict detection integration works well on FPGA

Page 25: NetTM: Faster and Easier Synchronization for Soft ...steffan/talks/fpga11.pdf · Bloom filters [ARC10] Synch. Unit Conflict Detection Input mem. 18 Synch. Unit NetTM: extending NetThreads

25

Transactional Memory

NetTM and NetThreads available online

: netfpga+netthreads

[email protected]

Future work: scale to more cores on newer FPGA/NetFPGA!

●1st HTM implementation tightly integrated with soft processors

●Supports conventional locks and TM without code modification!

●Can extract optimistic parallelism across packets

●Improves benchmark throughput: +6%, +54%, +57%

●Coarse critical sections and deadlock avoidance simplify program

●Processor and conflict detection integration works well on FPGA