application-specific signatures for transactional memory in soft processors martin labrecque mark...

39
Application-Specific Signatures for Transactional Memory in Soft Processors Martin Labrecque Mark Jeffrey Gregory Steffan ECE Dept. University of Toronto

Upload: darcy-white

Post on 17-Jan-2016

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Application-Specific Signatures for Transactional Memory in Soft Processors Martin Labrecque Mark Jeffrey Gregory Steffan ECE Dept. University of Toronto

Application-Specific Signatures for Transactional Memory in Soft Processors

Martin LabrecqueMark Jeffrey

Gregory Steffan

ECE Dept. University of Toronto

Page 2: Application-Specific Signatures for Transactional Memory in Soft Processors Martin Labrecque Mark Jeffrey Gregory Steffan ECE Dept. University of Toronto

2

FPGA

Increasingly large Systems-on-Chip Many CPUs, accelerators, IP blocksProcessors are easier to program than hardware

FPGAs & multicores: similar parallel programming challenge

Soft Processor

PC

Instr. Mem.

Reg. Array

regA

regB

regW

datW

datA

datB

ALU

25:21

20:16

+4

Data Mem.

datIn

addrdatOut

aluA

aluB

IncrPC

Instr

4:0 Wdest

Wdata

20:13

Xtnd

25:21

Wdata

Wdest

15:0

Xtnd << 2

Zero Test

25:21

Wdata

Wdest

20:0

25:21

Wdata

Wdest

FPGAs for Systems-on-Chip

DDR controller

Ethernet MACcontrollers

Why are parallel programs challenging?

Page 3: Application-Specific Signatures for Transactional Memory in Soft Processors Martin Labrecque Mark Jeffrey Gregory Steffan ECE Dept. University of Toronto

3

Packet Processing Example

packet = get_packet();

connection = database->lookup(packet);

if(connection == NULL)

connection = database->add(packet);

connection->count++;

global_packet_count++;

SINGLE-THREADED MULTI-THREADED

1- Must correctly delimit atomic operations2- Improve performance by finer-grain locking

Challenges:

Ato

mi

cA

tom

i c

packet = get_packet();

connection = database->lookup(packet);

if(connection == NULL)

connection = database->add(packet);

connection->count++;

global_packet_count++;

Page 4: Application-Specific Signatures for Transactional Memory in Soft Processors Martin Labrecque Mark Jeffrey Gregory Steffan ECE Dept. University of Toronto

4

Packet Processing Example

Ato

mi

cA

tom

i c

packet = get_packet();

connection = database->lookup(packet);

if(connection == NULL)

connection = database->add(packet);

connection->count++;

global_packet_count++;No Parallelism

Optimisic Parallelism across Connections

Opportunity for ParallelismMULTI-THREADED

Page 5: Application-Specific Signatures for Transactional Memory in Soft Processors Martin Labrecque Mark Jeffrey Gregory Steffan ECE Dept. University of Toronto

5

Exploit Opportunity for Parallelism

• Allow more than 1 thread in a critical section

• Will succeed if threads access different data

Transactional Memory–the new hot topic for multiprocessor computers–how to map TM to FPGAs?

Page 6: Application-Specific Signatures for Transactional Memory in Soft Processors Martin Labrecque Mark Jeffrey Gregory Steffan ECE Dept. University of Toronto

6

Our Transactional Approach

• Modify main memory directly: reduce copies, faster commit

DataCache

Data

processor1

Off-chip DDR

processor2

x x

•Detect conflicts prior to corrupting main memory

• Undo changes on transaction abort

• How to efficiently detect conflicts?

Page 7: Application-Specific Signatures for Transactional Memory in Soft Processors Martin Labrecque Mark Jeffrey Gregory Steffan ECE Dept. University of Toronto

7

Conflict Detection

Must detect all conflicts for correctnessReporting false conflicts is acceptable

Transaction1 Transaction2

Read A Read A OK

Read B Write B CONFLICT

Write C Read C CONFLICT

• Compare accesses across transactions:

Write D Write D CONFLICT

• Tracking speculative reads and writes

Page 8: Application-Specific Signatures for Transactional Memory in Soft Processors Martin Labrecque Mark Jeffrey Gregory Steffan ECE Dept. University of Toronto

8

Related Work on Conflict Detection

• FPGAs: test speculative bits in the cache–Complex to evict cache lines

–Lots of additional state

–Too restrictive in terms of storage capacity

Signatures well suited to FPGA bitwise operations

How can signatures be efficiently implemented?

• ASIC: compare signatures–Signature: bit vector recording TM memory accesses

–No previous signature FPGA implementation

Page 9: Application-Specific Signatures for Transactional Memory in Soft Processors Martin Labrecque Mark Jeffrey Gregory Steffan ECE Dept. University of Toronto

9

Conflict Detection with Signatures

• Hash of an address indexes into a bit vector

- More bits per signature more resolution - FPGA timing and area limit the number of bits- Hash functions have varying complexity/accuracy

processor1 load

HashFunction

Write Read

Signatures

processor2 store

AND

Page 10: Application-Specific Signatures for Transactional Memory in Soft Processors Martin Labrecque Mark Jeffrey Gregory Steffan ECE Dept. University of Toronto

10

Goals of this Work

• Implement efficient signatures for TM on FPGAs

FPGA reconfigurability better/more-efficient TM

Evaluate with real system

Page 11: Application-Specific Signatures for Transactional Memory in Soft Processors Martin Labrecque Mark Jeffrey Gregory Steffan ECE Dept. University of Toronto

11

Existing Hash Functions

1. Bit Selection

Address bits0 1 1 0 ... ...

Hash = 0 1 1 0

4 bits hash index into 16 signature bits

Page 12: Application-Specific Signatures for Transactional Memory in Soft Processors Martin Labrecque Mark Jeffrey Gregory Steffan ECE Dept. University of Toronto

12

Existing Hash Functions (continued)

We use 4 hash functions to improve performance/length

2. H3: XOR random address bitsAddress bits1 0 0 1 1 1 ...

Multiple hash functions index different parts of the signature

Address bits0 0 1 1 0 1 ...

Hash_2 = 1 0

Hash_1 = 1 1

Page 13: Application-Specific Signatures for Transactional Memory in Soft Processors Martin Labrecque Mark Jeffrey Gregory Steffan ECE Dept. University of Toronto

13

Existing Hash Functions (continued)

3. PBX: XOR high-order bits with low-order onesAddress bits1 1 0 1 ...

Hash_2 = 0 1

Address bits1 1 0 1 ...

Hash_1 = 0 1

Address bits0 0 1 0 ...

Hash_2 = 1 0

4.LE-PBX: XOR high-order bits with low-order ones, progressively omit low-order bits in hash functions

Page 14: Application-Specific Signatures for Transactional Memory in Soft Processors Martin Labrecque Mark Jeffrey Gregory Steffan ECE Dept. University of Toronto

14

Signatures: an Opportunity for FPGAs

Application-specific signatures!

ASIC hash functions on FPGA: very area consuming Due to locality:

applications access certain memory locations more frequently

certain locations will have more conflicts than others

Via app-specific signatures: increase tracking resolution of conflicting memory locations

decrease tracking resolution of others

FPGAs allow customized hash function for each application

Page 15: Application-Specific Signatures for Transactional Memory in Soft Processors Martin Labrecque Mark Jeffrey Gregory Steffan ECE Dept. University of Toronto

15

Trie-based Hashing for Signatures

0 0 00 1 11 0 01 0 11 1 01 1 1

Binary Addresses (profiling)

1xx

root

11x

111 110 101 100 011 000

10x

0xx

01x 00x

Trie gives control on the resolution for different memory regions

Complete trie of all TM accesses is HUGE

Which leaves in the trie can/cannot be merged?

Leaves are distinctaddresses

signature bits

Page 16: Application-Specific Signatures for Transactional Memory in Soft Processors Martin Labrecque Mark Jeffrey Gregory Steffan ECE Dept. University of Toronto

16

Load/Store A2 A1 A0

Trie-Based Conflict Detection

1xx

xxx

11x

111 110 101 100 011 000

10x

0xx

01x 00x

Simulation feedback:

3 leaves in trie 3 signature bits encompass all accesses

Compact trie by only evaluating nodes with remaining branching

Representation is very efficient!

A2 & A0

A2 & !A0

!A2

A2,A1,A0A2,A1,A0

Page 17: Application-Specific Signatures for Transactional Memory in Soft Processors Martin Labrecque Mark Jeffrey Gregory Steffan ECE Dept. University of Toronto

17

Trie-based Hash functionEvaluation

Training packet trace is different from test packet trace

Page 18: Application-Specific Signatures for Transactional Memory in Soft Processors Martin Labrecque Mark Jeffrey Gregory Steffan ECE Dept. University of Toronto

18

Multiprocessor System– NetFPGA: Virtex II Pro 50, 4 GigE + 1 PCI interfaces– 2 processors @ 125 MHz (limited by FPGA)– 64 MB DDR2 SDRAM @ 200 MHz

Real system executing real applications

Instr.

Data

Input mem.

Output mem.

I$

processor1

1-thread I$

processor2

1-thread

InputBuffer

Shared DataCache

OutputBufferpacket

inputpacketoutput

Off-chip DDR

Synch. Unit

Page 19: Application-Specific Signatures for Transactional Memory in Soft Processors Martin Labrecque Mark Jeffrey Gregory Steffan ECE Dept. University of Toronto

19

Simulated Ratio of False Conflicts versus Number of Signature Bits

- Trie-based hashing function requires much fewer signature bits

0

5

10

15

20

25

30

35

40

45

1 10 100 1000 10000 100000

BitSel

H3

LE-PBX

Trie

NA

T, p

erce

nt fa

lse

conf

licts

Page 20: Application-Specific Signatures for Transactional Memory in Soft Processors Martin Labrecque Mark Jeffrey Gregory Steffan ECE Dept. University of Toronto

20

Simulated Ratio of False Conflicts versus Number of Signature Bits

0

5

10

15

20

25

30

1 10 100 1000 10000 100000

BitSel

H3

LE-PBX

Trie

Classifier

UDHCP

0

5

10

15

20

25

1 10 100 1000 10000 100000

BitSel

H3

LE-PBX

Trie

- Trie-based hashing function requires much fewer signature bits

0

5

10

15

20

25

30

35

40

45

1 10 100 1000 10000 100000

BitSel

H3

LE-PBX

Trie

NAT

0

2

4

6

8

10

12

1 10 100 1000 10000 100000

BitSel

H3

LE-PBX

Trie

Intruder

Page 21: Application-Specific Signatures for Transactional Memory in Soft Processors Martin Labrecque Mark Jeffrey Gregory Steffan ECE Dept. University of Toronto

21

0.5

0.6

0.7

0.8

0.9

1

1.1

0 50 100 150 200

Classifier

UDHCP

Intruder

NAT

Simulated Packet Rate Normalized to Ideal Conflict Detection vs Trie-Based Signature Length

Signatures are Critical to Performance

Ideal

Page 22: Application-Specific Signatures for Transactional Memory in Soft Processors Martin Labrecque Mark Jeffrey Gregory Steffan ECE Dept. University of Toronto

22

2 Best Implementation Options

Block RAM

2048 signature bits per thread

Signatures

Bit-Select hash function

Registers~100 signature bits per thread

Arbitrary hash function

We use trie-based signatures:They perform best at that size

Let’s Compare!

Maximum Design @ 125MHz

Page 23: Application-Specific Signatures for Transactional Memory in Soft Processors Martin Labrecque Mark Jeffrey Gregory Steffan ECE Dept. University of Toronto

23

Trie-based Hashing Normalized to BitSelection

- Significantly fewer rollbacks packet rate increase

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

Classifier NAT UDHCP Intruder

Throughput

Area+12%

+58%

+9%

+71%

- At most 5% area overhead

Page 24: Application-Specific Signatures for Transactional Memory in Soft Processors Martin Labrecque Mark Jeffrey Gregory Steffan ECE Dept. University of Toronto

24

Conclusions Conflict detection significantly impacts performance

Trie-based hashing reduces required signature bits

Trie-based hashing can be implemented in LUTs Preserve frequency, 5% area overhead

Retiming is required to implement in RAMs

Increased performance (up to 71%) versus other best implementation (RAM-based bit-select)

- Application-specific signatures enable first fully integrated TM processor for FPGA

- We now have an extended version working with 8 threads

Page 25: Application-Specific Signatures for Transactional Memory in Soft Processors Martin Labrecque Mark Jeffrey Gregory Steffan ECE Dept. University of Toronto

25

Martin LabrecqueMark Jeffrey

Gregory Steffan

ECE Dept. University of Toronto

martinL/[email protected]

Thank you!

Page 26: Application-Specific Signatures for Transactional Memory in Soft Processors Martin Labrecque Mark Jeffrey Gregory Steffan ECE Dept. University of Toronto

26

Page 27: Application-Specific Signatures for Transactional Memory in Soft Processors Martin Labrecque Mark Jeffrey Gregory Steffan ECE Dept. University of Toronto

27

Transactional MemoryParallel Programming Made Easy

•Reduce conservative synchronization overhead

Lock(); if (shared_1) array [ i ] = 0; Unlock();

Only serialized when truly necessary

Bool val = f(shared_1);if(val){ Lock(); if ( f(shared_1) ) shared_1 = 0; Unlock();}

Lock(); if ( f(shared_1) ) shared_1 = 0; Unlock();

BE

FO

RE

AF

TE

R

•Alleviate need for fine grained-synchronization

Page 28: Application-Specific Signatures for Transactional Memory in Soft Processors Martin Labrecque Mark Jeffrey Gregory Steffan ECE Dept. University of Toronto

28

Our Transactional Approach • No program change required• Modify directly main memory

DataCache

Data

processor

Off-chip DDR

processor

x

x

x

•Detect conflicts prior to corrupting main memory• Undo changes on transaction abort

Page 29: Application-Specific Signatures for Transactional Memory in Soft Processors Martin Labrecque Mark Jeffrey Gregory Steffan ECE Dept. University of Toronto

29

sigsvn_udhcp/statsout fp ratessigsvn_other/mat other stats

Page 30: Application-Specific Signatures for Transactional Memory in Soft Processors Martin Labrecque Mark Jeffrey Gregory Steffan ECE Dept. University of Toronto

30

Transactional MemoryParallel Programming Made Easy

•Reduce conservative synchronization overhead

Lock(); if (shared_1) array [ i ] = 0; Unlock();

Only serialized when truly necessary

Bool val = f(shared_1);if(val){ Lock(); if ( f(shared_1) ) shared_1 = 0; Unlock();}

Lock(); if ( f(shared_1) ) shared_1 = 0; Unlock();

BE

FO

RE

AF

TE

R

•Alleviate need for fine grained-synchronization

Page 31: Application-Specific Signatures for Transactional Memory in Soft Processors Martin Labrecque Mark Jeffrey Gregory Steffan ECE Dept. University of Toronto

31

Transactional Single-Threaded Processor (simplified)

Instr.Cache

PC

+4

Reg.Array

ALU

DataCache

Hazard Detection Logic

Hazard detection is too slow: use static hazard detection

Page 32: Application-Specific Signatures for Transactional Memory in Soft Processors Martin Labrecque Mark Jeffrey Gregory Steffan ECE Dept. University of Toronto

32

Transactional Single-Threaded Processor (simplified)

Instr.Cache

+4

ALU

DataCache

Con

flict

Det

ectio

n

Undo Log

Reg.ArrayReg.Array

PCPC

Page 33: Application-Specific Signatures for Transactional Memory in Soft Processors Martin Labrecque Mark Jeffrey Gregory Steffan ECE Dept. University of Toronto

33

Transactional Packet Processing

• Hardware support to revert speculative changes to:– Register file– Program counter – Data memory

• To detect failed speculation:– Record read and write sets of speculative threads– Compare sets across threads

When does the set comparison take place?

Page 34: Application-Specific Signatures for Transactional Memory in Soft Processors Martin Labrecque Mark Jeffrey Gregory Steffan ECE Dept. University of Toronto

34

Conflict Detection with Signatures• Suited for FPGA bitwise operations

– Hash of an address sets bits in a bit vector

-Requires many bits per thread-Timing constraints allow read and write set tracking for 2 threads-Made a single-threaded 2-processor implementation

W 00000000R 00000000

Signature Thread 0

processor x

W 01000000R 00000000

W 00000000R 00000000

Signature Thread 1

processor x

W 01000000R 00000000

– Set comparison is an AND operation– Clearing sets is done in 1 cycle

Page 35: Application-Specific Signatures for Transactional Memory in Soft Processors Martin Labrecque Mark Jeffrey Gregory Steffan ECE Dept. University of Toronto

35

1xx

root

11x

111 110 000

0xx

00x

Page 36: Application-Specific Signatures for Transactional Memory in Soft Processors Martin Labrecque Mark Jeffrey Gregory Steffan ECE Dept. University of Toronto

36

Page 37: Application-Specific Signatures for Transactional Memory in Soft Processors Martin Labrecque Mark Jeffrey Gregory Steffan ECE Dept. University of Toronto

37

A New Meaning for Locks• Optimistically consider locks

• No program change required

Lock();

if ( f( ) )

shared_1 = a();

else

shared_2 = b();

Unlock();

Thread1 Thread2 Thread3 Thread4

LOC

KS

Thread1 Thread2 Thread3 Thread4

TR

AN

SA

CT

IOA

L x

• Reduce conservative synchronization overhead• Reduce challenge of fine grained-synchronization

Page 38: Application-Specific Signatures for Transactional Memory in Soft Processors Martin Labrecque Mark Jeffrey Gregory Steffan ECE Dept. University of Toronto

38

Page 39: Application-Specific Signatures for Transactional Memory in Soft Processors Martin Labrecque Mark Jeffrey Gregory Steffan ECE Dept. University of Toronto

39

• * can you list the apps?

• emphasize that train != test in methodology page