application-specific signatures for transactional memory in soft processors martin labrecque mark...
TRANSCRIPT
Application-Specific Signatures for Transactional Memory in Soft Processors
Martin LabrecqueMark Jeffrey
Gregory Steffan
ECE Dept. University of Toronto
2
FPGA
Increasingly large Systems-on-Chip Many CPUs, accelerators, IP blocksProcessors are easier to program than hardware
FPGAs & multicores: similar parallel programming challenge
Soft Processor
PC
Instr. Mem.
Reg. Array
regA
regB
regW
datW
datA
datB
ALU
25:21
20:16
+4
Data Mem.
datIn
addrdatOut
aluA
aluB
IncrPC
Instr
4:0 Wdest
Wdata
20:13
Xtnd
25:21
Wdata
Wdest
15:0
Xtnd << 2
Zero Test
25:21
Wdata
Wdest
20:0
25:21
Wdata
Wdest
FPGAs for Systems-on-Chip
DDR controller
Ethernet MACcontrollers
Why are parallel programs challenging?
3
Packet Processing Example
packet = get_packet();
…
connection = database->lookup(packet);
if(connection == NULL)
connection = database->add(packet);
connection->count++;
…
global_packet_count++;
SINGLE-THREADED MULTI-THREADED
1- Must correctly delimit atomic operations2- Improve performance by finer-grain locking
Challenges:
Ato
mi
cA
tom
i c
packet = get_packet();
…
connection = database->lookup(packet);
if(connection == NULL)
connection = database->add(packet);
connection->count++;
…
global_packet_count++;
4
Packet Processing Example
Ato
mi
cA
tom
i c
packet = get_packet();
…
connection = database->lookup(packet);
if(connection == NULL)
connection = database->add(packet);
connection->count++;
…
global_packet_count++;No Parallelism
Optimisic Parallelism across Connections
Opportunity for ParallelismMULTI-THREADED
5
Exploit Opportunity for Parallelism
• Allow more than 1 thread in a critical section
• Will succeed if threads access different data
Transactional Memory–the new hot topic for multiprocessor computers–how to map TM to FPGAs?
6
Our Transactional Approach
• Modify main memory directly: reduce copies, faster commit
DataCache
Data
processor1
Off-chip DDR
processor2
x x
•Detect conflicts prior to corrupting main memory
• Undo changes on transaction abort
• How to efficiently detect conflicts?
7
Conflict Detection
Must detect all conflicts for correctnessReporting false conflicts is acceptable
Transaction1 Transaction2
Read A Read A OK
Read B Write B CONFLICT
Write C Read C CONFLICT
• Compare accesses across transactions:
Write D Write D CONFLICT
• Tracking speculative reads and writes
8
Related Work on Conflict Detection
• FPGAs: test speculative bits in the cache–Complex to evict cache lines
–Lots of additional state
–Too restrictive in terms of storage capacity
Signatures well suited to FPGA bitwise operations
How can signatures be efficiently implemented?
• ASIC: compare signatures–Signature: bit vector recording TM memory accesses
–No previous signature FPGA implementation
9
Conflict Detection with Signatures
• Hash of an address indexes into a bit vector
- More bits per signature more resolution - FPGA timing and area limit the number of bits- Hash functions have varying complexity/accuracy
processor1 load
HashFunction
Write Read
Signatures
processor2 store
AND
10
Goals of this Work
• Implement efficient signatures for TM on FPGAs
FPGA reconfigurability better/more-efficient TM
Evaluate with real system
11
Existing Hash Functions
1. Bit Selection
Address bits0 1 1 0 ... ...
Hash = 0 1 1 0
4 bits hash index into 16 signature bits
12
Existing Hash Functions (continued)
We use 4 hash functions to improve performance/length
2. H3: XOR random address bitsAddress bits1 0 0 1 1 1 ...
Multiple hash functions index different parts of the signature
Address bits0 0 1 1 0 1 ...
Hash_2 = 1 0
Hash_1 = 1 1
13
Existing Hash Functions (continued)
3. PBX: XOR high-order bits with low-order onesAddress bits1 1 0 1 ...
Hash_2 = 0 1
Address bits1 1 0 1 ...
Hash_1 = 0 1
Address bits0 0 1 0 ...
Hash_2 = 1 0
4.LE-PBX: XOR high-order bits with low-order ones, progressively omit low-order bits in hash functions
14
Signatures: an Opportunity for FPGAs
Application-specific signatures!
ASIC hash functions on FPGA: very area consuming Due to locality:
applications access certain memory locations more frequently
certain locations will have more conflicts than others
Via app-specific signatures: increase tracking resolution of conflicting memory locations
decrease tracking resolution of others
FPGAs allow customized hash function for each application
15
Trie-based Hashing for Signatures
0 0 00 1 11 0 01 0 11 1 01 1 1
Binary Addresses (profiling)
1xx
root
11x
111 110 101 100 011 000
10x
0xx
01x 00x
Trie gives control on the resolution for different memory regions
Complete trie of all TM accesses is HUGE
Which leaves in the trie can/cannot be merged?
Leaves are distinctaddresses
signature bits
16
Load/Store A2 A1 A0
Trie-Based Conflict Detection
1xx
xxx
11x
111 110 101 100 011 000
10x
0xx
01x 00x
Simulation feedback:
3 leaves in trie 3 signature bits encompass all accesses
Compact trie by only evaluating nodes with remaining branching
Representation is very efficient!
A2 & A0
A2 & !A0
!A2
A2,A1,A0A2,A1,A0
17
Trie-based Hash functionEvaluation
Training packet trace is different from test packet trace
18
Multiprocessor System– NetFPGA: Virtex II Pro 50, 4 GigE + 1 PCI interfaces– 2 processors @ 125 MHz (limited by FPGA)– 64 MB DDR2 SDRAM @ 200 MHz
Real system executing real applications
Instr.
Data
Input mem.
Output mem.
I$
processor1
1-thread I$
processor2
1-thread
InputBuffer
Shared DataCache
OutputBufferpacket
inputpacketoutput
Off-chip DDR
Synch. Unit
19
Simulated Ratio of False Conflicts versus Number of Signature Bits
- Trie-based hashing function requires much fewer signature bits
0
5
10
15
20
25
30
35
40
45
1 10 100 1000 10000 100000
BitSel
H3
LE-PBX
Trie
NA
T, p
erce
nt fa
lse
conf
licts
20
Simulated Ratio of False Conflicts versus Number of Signature Bits
0
5
10
15
20
25
30
1 10 100 1000 10000 100000
BitSel
H3
LE-PBX
Trie
Classifier
UDHCP
0
5
10
15
20
25
1 10 100 1000 10000 100000
BitSel
H3
LE-PBX
Trie
- Trie-based hashing function requires much fewer signature bits
0
5
10
15
20
25
30
35
40
45
1 10 100 1000 10000 100000
BitSel
H3
LE-PBX
Trie
NAT
0
2
4
6
8
10
12
1 10 100 1000 10000 100000
BitSel
H3
LE-PBX
Trie
Intruder
21
0.5
0.6
0.7
0.8
0.9
1
1.1
0 50 100 150 200
Classifier
UDHCP
Intruder
NAT
Simulated Packet Rate Normalized to Ideal Conflict Detection vs Trie-Based Signature Length
Signatures are Critical to Performance
Ideal
22
2 Best Implementation Options
Block RAM
2048 signature bits per thread
Signatures
Bit-Select hash function
Registers~100 signature bits per thread
Arbitrary hash function
We use trie-based signatures:They perform best at that size
Let’s Compare!
Maximum Design @ 125MHz
23
Trie-based Hashing Normalized to BitSelection
- Significantly fewer rollbacks packet rate increase
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
Classifier NAT UDHCP Intruder
Throughput
Area+12%
+58%
+9%
+71%
- At most 5% area overhead
24
Conclusions Conflict detection significantly impacts performance
Trie-based hashing reduces required signature bits
Trie-based hashing can be implemented in LUTs Preserve frequency, 5% area overhead
Retiming is required to implement in RAMs
Increased performance (up to 71%) versus other best implementation (RAM-based bit-select)
- Application-specific signatures enable first fully integrated TM processor for FPGA
- We now have an extended version working with 8 threads
25
Martin LabrecqueMark Jeffrey
Gregory Steffan
ECE Dept. University of Toronto
martinL/[email protected]
Thank you!
26
27
Transactional MemoryParallel Programming Made Easy
•Reduce conservative synchronization overhead
Lock(); if (shared_1) array [ i ] = 0; Unlock();
Only serialized when truly necessary
Bool val = f(shared_1);if(val){ Lock(); if ( f(shared_1) ) shared_1 = 0; Unlock();}
Lock(); if ( f(shared_1) ) shared_1 = 0; Unlock();
BE
FO
RE
AF
TE
R
•Alleviate need for fine grained-synchronization
28
Our Transactional Approach • No program change required• Modify directly main memory
DataCache
Data
processor
Off-chip DDR
processor
x
x
x
•Detect conflicts prior to corrupting main memory• Undo changes on transaction abort
29
sigsvn_udhcp/statsout fp ratessigsvn_other/mat other stats
30
Transactional MemoryParallel Programming Made Easy
•Reduce conservative synchronization overhead
Lock(); if (shared_1) array [ i ] = 0; Unlock();
Only serialized when truly necessary
Bool val = f(shared_1);if(val){ Lock(); if ( f(shared_1) ) shared_1 = 0; Unlock();}
Lock(); if ( f(shared_1) ) shared_1 = 0; Unlock();
BE
FO
RE
AF
TE
R
•Alleviate need for fine grained-synchronization
31
Transactional Single-Threaded Processor (simplified)
Instr.Cache
PC
+4
Reg.Array
ALU
DataCache
Hazard Detection Logic
Hazard detection is too slow: use static hazard detection
32
Transactional Single-Threaded Processor (simplified)
Instr.Cache
+4
ALU
DataCache
Con
flict
Det
ectio
n
Undo Log
Reg.ArrayReg.Array
PCPC
33
Transactional Packet Processing
• Hardware support to revert speculative changes to:– Register file– Program counter – Data memory
• To detect failed speculation:– Record read and write sets of speculative threads– Compare sets across threads
When does the set comparison take place?
34
Conflict Detection with Signatures• Suited for FPGA bitwise operations
– Hash of an address sets bits in a bit vector
-Requires many bits per thread-Timing constraints allow read and write set tracking for 2 threads-Made a single-threaded 2-processor implementation
W 00000000R 00000000
Signature Thread 0
processor x
W 01000000R 00000000
W 00000000R 00000000
Signature Thread 1
processor x
W 01000000R 00000000
– Set comparison is an AND operation– Clearing sets is done in 1 cycle
35
1xx
root
11x
111 110 000
0xx
00x
36
37
A New Meaning for Locks• Optimistically consider locks
• No program change required
Lock();
if ( f( ) )
shared_1 = a();
else
shared_2 = b();
Unlock();
Thread1 Thread2 Thread3 Thread4
LOC
KS
Thread1 Thread2 Thread3 Thread4
TR
AN
SA
CT
IOA
L x
• Reduce conservative synchronization overhead• Reduce challenge of fine grained-synchronization
38
39
• * can you list the apps?
• emphasize that train != test in methodology page