1 sigrace: signature-based data race detection abdullah muzahid, dario suarez*, shanxiang qi &...

Post on 21-Dec-2015

220 Views

Category:

Documents

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

SigRace: Signature-Based Data Race Detection

Abdullah Muzahid, Dario Suarez*, Shanxiang Qi & Josep Torrellas

Computer Science DepartmentUniversity of Illinois at Urbana-Champaign

http://iacoma.cs.uiuc.edu

*Universidad de Zaragoza, Spain

2

Debugging Multithreaded Programs

“Debugging a multithreaded program has a lot in

common with medieval torture methods”

-- Random quote found via Google search

3

Data Race

• Two threads access the same variable without intervening synchronization and at least one is a write

T1 T2

lock L

unlock Lx++

x++

• Hard to detect and reproduce

• Common bug

4

Dynamic Data Race Detection

• Mainly two approaches

5

Dynamic Data Race Detection

• Mainly two approaches– Lockset: Finds violation of locking discipline

6

Dynamic Data Race Detection

• Mainly two approaches– Lockset: Finds violation of locking discipline

– Happened-Before: Finds concurrent conflicting accesses

7

Happened-Before Approach

8

Happened-Before Approach

[0, 0] [0, 0]

Thread 0 Thread 1

9

Happened-Before Approach

Lock L

[0, 0]

[1, 0]

[0, 0]

Thread 0 Thread 1

10

Happened-Before Approach

Lock L

Unlock L

[0, 0]

[1, 0]

[2, 0]

[0, 0]

Thread 0 Thread 1

11

Happened-Before Approach

Lock L

Unlock L

[0, 0]

[1, 0]

[2, 0]

[0, 0]

Thread 0 Thread 1

12

Happened-Before Approach

Lock L

Unlock L

[0, 0]

[1, 0]

[2, 0]

[0, 0]

Thread 0 Thread 1

Lock L

13

Happened-Before Approach

Lock L

Unlock L

[0, 0]

[1, 0]

[2, 0]

[0, 0]

Thread 0 Thread 1

Lock L

[0, 1]

14

Happened-Before Approach

Lock L

Unlock L

[0, 0]

[1, 0]

[2, 0]

[0, 0]

Thread 0 Thread 1

Lock L

[2, 1]

+

[0, 1]

15

Happened-Before Approach

• Epoch : sync to sync

Lock L

Unlock L

[0, 0]

[1, 0]

[2, 0]

[0, 0]

Thread 0 Thread 1

Lock L

[2, 1]

16

Happened-Before Approach

• Epoch : sync to sync

Lock L

Unlock L

[0, 0]

[1, 0]

[2, 0]

[0, 0]

Thread 0 Thread 1

Lock L

[2, 1]

a

b

c

d

e

17

Happened-Before Approach

Lock L

Unlock L

[0, 0]

[1, 0]

[2, 0]

[0, 0]

Thread 0 Thread 1

Lock L

[2, 1]

a

b

c

d

e

• Epoch : sync to sync

18

Happened-Before Approach

Lock L

Unlock L

[0, 0]

[1, 0]

[2, 0]

[0, 0]

Thread 0 Thread 1

Lock L

[2, 1]

a

b

c

d

e

• Epoch : sync to sync

19

Happened-Before Approach

Lock L

Unlock L

[0, 0]

[1, 0]

[2, 0]

[0, 0]

Thread 0 Thread 1

Lock L

[2, 1]

a

b

c

d

e

• Epoch : sync to sync

20

Happened-Before Approach

Lock L

Unlock L

[0, 0]

[1, 0]

[2, 0]

[0, 0]

Thread 0 Thread 1

Lock L

[2, 1]

a

b

c

d

e

• Epoch : sync to sync

• a, b happened before e

21

Happened-Before Approach

Lock L

Unlock L

[0, 0]

[1, 0]

[2, 0]

[0, 0]

Thread 0 Thread 1

Lock L

[2, 1]

a

b

c

d

e

• Epoch : sync to sync

• a, b happened before e

22

Happened-Before Approach

Lock L

Unlock L

[0, 0]

[1, 0]

[2, 0]

[0, 0]

Thread 0 Thread 1

Lock L

[2, 1]

a

b

c

d

e

• Epoch : sync to sync

• a, b happened before e

• c, d unordered

23

Happened-Before Approach

Lock L

Unlock L

[0, 0]

[1, 0]

[2, 0]

[0, 0]

Thread 0 Thread 1

Lock L

[2, 1]

a

b

c

d

e

• Epoch : sync to sync

• a, b happened before e

• c, d unordered

= x

x =

Data Race

24

Software Implementation

• Need to instrument every memory access– 10x – 50x slowdown

– Not suitable for production runs

25

Hardware Implementation

26

Hardware Implementation

P1 P2

… …C1 C2

27

Hardware Implementation

P1 P2

… …TS TS

C1 C2

28

Hardware Implementation

P1 P2

… …xts1

TS TS

C1 C2

29

Hardware Implementation

P1 P2

… …x xts1 ts2

ts2 …TS TS

C1 C2

WR

30

Hardware Implementation

P1 P2

… …x xts1 ts2

ts2 …

ts2

check

TS TS

C1 C2

WR

31

Limitations of HW Approaches

32

Limitations of HW Approaches

P1 P2

… ……

C1 C2

• Modify cache and coherence protocol

33

Limitations of HW Approaches

P1 P2

… ……

C1 C2

• Modify cache and coherence protocol

• Perform checking at least on every coherence transaction

ts1ts2

ts2

check

34

Limitations of HW Approaches

P1 P2

… …C1 C2

• Modify cache and coherence protocol

• Perform checking at least on every coherence transaction

• Lose detection ability when cache line is displaced or invalidated

35

Our Contributions

• SigRace: Novel HW mechanism for race detection based on signatures– Simple HW

• Cache and coherence protocol are unchanged– Higher coverage than existing HW schemes

• Detect races even if the line is displaced/invalidated

• SigRace finds 150% more injected races than a state-of-the-art HW proposal

– Usable on-the-fly in production runs

36

Outline

• Motivation• Main Idea• Implementation• Results• Conclusions

37

Main Idea

Address Signature + Happened-before

38

Hardware Address Signatures

[Ceze et al, ISCA06]

…P

erm

ute

ROB

addressld/st

39

Hardware Address Signatures

40

Hardware Address Signatures

• Logical AND for intersection

• Has false positives but not false negatives

41

Using Signatures for Race Detection

sync

sync

Epoch

42

Using Signatures for Race Detection

sync

sync

Epoch SigTS

43

Using Signatures for Race Detection

Block

• Block is a fixed number of dynamic instructions (not a cache block or basic block or atomic block)

sync

sync

Epoch Sig1TS1

44

Using Signatures for Race Detection

Block

sync

sync

Epoch

Race Detection Module

Sig1TS1

45

Using Signatures for Race Detection

Block

sync

sync

Epoch ØTS1

Race Detection Module

46

Using Signatures for Race Detection

Block

sync

sync

Epoch

Sig2TS1

Race Detection Module

47

Using Signatures for Race Detection

Block

sync

sync

Epoch

Sig2TS1

Race Detection Module

48

Using Signatures for Race Detection

Block

sync

sync

Epoch

ØTS1

Race Detection Module

49

Using Signatures for Race Detection

Block

sync

sync

Epoch

Sig3TS2

Race Detection Module

50

Using Signatures for Race Detection

Block

sync

sync

Epoch

Sig3TS2

Race Detection Module

51

Using Signatures for Race Detection

Block

sync

sync

Epoch

Sig3TS2

Race Detection Module

52

Using Signatures for Race Detection

Block

sync

sync

Epoch

Sig3TS2 Sig Ո Sig

Race Detection Module

53

On Chip Race Detection Module (RDM)

54

On Chip Race Detection Module (RDM)

P1 P2

Q1 Q2

Chip

RDM

55

On Chip Race Detection Module (RDM)

P1 P2

T1 R1 W1

Q1 Q2

Chip

RDM

56

On Chip Race Detection Module (RDM)

P1 P2

T1 R1 W1

Q1 Q2

Chip

RDM

57

On Chip Race Detection Module (RDM)

P1 P2

T1 R1 W1

Q1 Q2

Chip

RDM

58

On Chip Race Detection Module (RDM)

P1 P2

T2 R2 W2

Q1 Q2

T1 R1 W1

Chip

RDM

59

On Chip Race Detection Module (RDM)

P1 P2

T2 R2 W2

Q1 Q2

T1 R1 W1

Chip

RDM

60

On Chip Race Detection Module (RDM)

P1 P2

Q1 Q2

T2 R2 W2T1 R1 W1

Chip

RDM

61

On Chip Race Detection Module (RDM)

P1 P2

Q1 Q2

T2 R2 W2T1 R1 W1

TJ RJ WJ

If T2 & TJ unordered

R2 WJՈ

W2 WJՈ

W2 RJՈ

Else stop

Chip

RDM

62

On Chip Race Detection Module (RDM)

P1 P2

Q1 Q2

T2 R2 W2T1 R1 W1 TJ` RJ` WJ`

If T2 & TJ` unordered

R2 WJ`Ո

W2 WJ`Ո

W2 RJ`Ո

Else stop

Chip

RDM

63

On Chip Race Detection Module (RDM)

P1 P2

Q1 Q2

T2 R2 W2T1 R1 W1

TJ” RJ” WJ”

If T2 & TJ” unordered

R2 WJ”Ո

W2 WJ”Ո

W2 RJ”Ո

Else stop

Chip

RDM

64

On Chip Race Detection Module (RDM)

P1 P2

Q1 Q2

T2 R2 W2T1 R1 W1

TJ”` RJ”` WJ”`

If T2 & TJ”` unordered

R2 WJ”`Ո

W2 WJ”`Ո

W2 RJ”`Ո

Else stop

Chip

RDM

65

On Chip Race Detection Module (RDM)

P1 P2

Q1 Q2

T2 R2 W2T1 R1 W1

TJ”` RJ”` WJ”`

If T2 & TJ”` unordered

R2 WJ”`Ո

W2 WJ”`Ո

W2 RJ”`Ո

Else stop

Chip

RDM

Done in Background

66

On Chip Race Detection Module (RDM)

P1 P2

Q1 Q2

T2 R2 W2T1 R1 W1

TJ”` RJ”` WJ”`

If T2 & TJ”` unordered

R2 WJ”`Ո

W2 WJ”`Ո

W2 RJ”`Ո

Else stop

Chip

RDM

False Positives

67

Re-execution

• Needed for

– Discard if a false positive

– Identify the accesses involved

68

Support for Re-execution

• Save synchronization history in TS Log

– Timestamp at sync points

• Log inputs (interrupts, sys calls, etc)

• Take periodic checkpoints: ReVive [Prvulovic et al, ISCA02]

69

Modes of Operation

• Normal Execution

• Re-Execution: Bring the program to just before the race

• Race Analysis: Pinpoint the racy accesses or discarding the false positive

70

SigRace Re-execution Mode

• Can be done in another machine

71

SigRace Re-execution Mode

• Can be done in another machine

• Periodic checkpoint of memory state

sync

sync

sync

sync

sync

checkpointT0 T1 T2

72

SigRace Re-execution Mode

• Can be done in another machine

• Periodic checkpoint of memory state

sync

sync

sync

sync

sync

checkpoint

s1

s2Data Race

T0 T1 T2

73

SigRace Re-execution Mode

• Can be done in another machine

• Periodic checkpoint of memory state

sync

sync

sync

sync

sync

checkpoint

s1

s2Data Race

Ո

Conflict Sig Conflict Sig

T0 T1 T2

74

SigRace Re-execution Mode

• Can be done in another machine

• Periodic checkpoint of memory state

sync

sync

sync

sync

sync

checkpoint

s1

s2Data Race

Ո

Conflict Sig Conflict Sig

checkpointT0 T1 T2 T0 T1 T2

75

SigRace Re-execution Mode

• Can be done in another machine

• Periodic checkpoint of memory state

sync

sync

sync

sync

sync

checkpoint

s1

s2Data Race

Ո

Conflict Sig Conflict Sig

sync

sync

sync

sync

sync

checkpointT0 T1 T2 T0 T1 T2

Use the TS Log

76

SigRace Analysis Mode

sync

sync

sync

sync

sync

checkpointT0 T1 T2

77

SigRace Analysis Mode

sync

sync

sync

sync

sync

checkpointT0 T1 T2

78

SigRace Analysis Mode

sync

sync

sync

sync

sync

checkpointT0 T1 T2

Conflict SigConflict Sig

ldՈ

79

SigRace Analysis Mode

sync

sync

sync

sync

sync

checkpointT0 T1 T2

Conflict SigConflict Sig

ldlog

Ո

80

SigRace Analysis Mode

sync

sync

sync

sync

sync

checkpointT0 T1 T2

Conflict SigConflict Sig

ldlog

sync sync

Ո

81

SigRace Analysis Mode

sync

sync

sync

sync

sync

checkpointT0 T1 T2

log

sync sync

log

82

SigRace Analysis Mode

sync

sync

sync

sync

sync

checkpointT0 T1 T2

log

sync sync

log

• Pinpoints racy addresses or,

• Identifies and discards false positives

83

Outline

• Motivation• Main Idea• Implementation• Results• Conclusions

84

New Instructions

• collect_on– Enable R and W address collection in current thread

85

New Instructions

• collect_on– Enable R and W address collection in current thread

• collect_off– Disable R and W address collection in current thread

86

New Instructions

• sync_reached

87

New Instructions

• sync_reached– Dump TS, R and W

P

RDM

Network

TS R W

88

New Instructions

• sync_reached– Dump TS, R and W

– Clear signatures

P

RDM

Network

TS Ø Ø

89

New Instructions

• sync_reached– Dump TS, R and W

– Clear signatures

– Update TS

P

RDM

Network

TS` Ø Ø

90

Modifications in Sync Libraries

91

Modifications in Sync Libraries

• Synchronization objectvariable timestamp

92

Modifications in Sync Libraries

• Synchronization object

• Unlock macro

variable timestamp

UNLOCK (‘{

}’)

unlock($1.lock)

93

Modifications in Sync Libraries

• Synchronization object

• Unlock macro

variable timestamp

UNLOCK (‘{

}’)

unlock($1.lock)

sync_reached

RDM

Network

TS R WP

94

Modifications in Sync Libraries

• Synchronization object

• Unlock macro

variable timestamp

UNLOCK (‘{

}’)

unlock($1.lock)

sync_reached

RDM

Network

TS Ø ØP

95

Modifications in Sync Libraries

• Synchronization object

• Unlock macro

variable timestamp

UNLOCK (‘{

}’)

unlock($1.lock)

sync_reached

RDM

Network

TS Ø ØP

96

Modifications in Sync Libraries

• Synchronization object

• Unlock macro

variable timestamp

UNLOCK (‘{

}’)

unlock($1.lock)

sync_reached$1.timestamp = TS lock TS

97

Modifications in Sync Libraries

• Synchronization object

• Unlock macro

variable timestamp

UNLOCK (‘{

}’)

unlock($1.lock)

sync_reached$1.timestamp = TS

98

Modifications in Sync Libraries

• Synchronization object

• Unlock macro

variable timestamp

UNLOCK (‘{

}’)

unlock($1.lock)

sync_reached$1.timestamp = TS

AppendtoTSLog(TS) TS LogTS

99

Modifications in Sync Libraries

• Synchronization object

• Lock macro

variable timestamp

LOCK (‘{

}’)

lock($1.lock)

100

Modifications in Sync Libraries

• Synchronization object

• Lock macro

variable timestamp

LOCK (‘{

}’)

lock($1.lock)

TS = GenerateTS (TS,$1.timestamp)

lock timestamp

101

Modifications in Sync Libraries

• Synchronization object

• Lock macro

variable timestamp

LOCK (‘{

}’)

lock($1.lock)

TS = GenerateTS (TS,$1.timestamp)

lock timestamp

Transparent to Application Code

102

Other Topics in Paper

• Easy to virtualize

• Queue Overflow

• Detailed HW structures

103

Outline

• Motivation• Main Idea• Implementation• Results• Conclusions

104

Experimental Setup

• PIN – Binary Instrumention Tool• Default parameters

• Benchmarks: SPLASH2, PARSEC

– # of proc: 8

– Signature size: 2 Kbits

– Block size: 2,000 ins

– Queue size: 16 entries

– Checkpoint interval: 1 Million ins

105

Race Detection Ability

• Three configurations

– SigRace Default

– SigRace Ideal : Stores every signature between 2 checkpoints

– ReEnact [Prvulovic et al, ISCA03]: Cache based approach with timestamp per word

106

Race Detection Ability

App Ideal

SigRace

Default SigRce

ReEnact

Cholesky 16 16 16Barnes 11 11 6Volrend 27 27 18Ocean 1 1 1Radiosity 15 15 12Raytrace 4 4 3Water-sp 8 4 2Streamcluster

13 12 13

Total 95 90 70

107

Race Detection Ability

• More coverage than ReEnact

App Ideal

SigRace

Default SigRce

ReEnact

Cholesky 16 16 16Barnes 11 11 6Volrend 27 27 18Ocean 1 1 1Radiosity 15 15 12Raytrace 4 4 3Water-sp 8 4 2Streamcluster

13 12 13

Total 95 90 70

108

Race Detection Ability

• More coverage than ReEnact

• Coverage comparable to ideal configuration

App Ideal

SigRace

Default SigRce

ReEnact

Cholesky 16 16 16Barnes 11 11 6Volrend 27 27 18Ocean 1 1 1Radiosity 15 15 12Raytrace 4 4 3Water-sp 8 4 2Streamcluster

13 12 13

Total 95 90 70

109

Injected Races

• Removed one dynamic sync per run• Each application runs 25 times with diff sync

elimination

110

Injected Races

111

Injected Races

• More overall coverage than ReEnact

112

Injected Races

• More overall coverage than ReEnact– 150% more coverage

113

Injected Races

• More overall coverage than ReEnact– 150% more coverage

114

Conclusions

• Proposed SigRace: – Simple HW

• Cache and coherence protocol are unchanged– Higher coverage than existing HW schemes

• Detect races even if the line is displaced/invalidated

– Usable on-the-fly in production runs

• SigRace finds 150% more injected races than word-based ReEnact

115

SigRace: Signature-Based Data Race Detection

Abdullah Muzahid, Dario Suarez*, Shanxiang Qi & Josep Torrellas

Computer Science DepartmentUniversity of Illinois at Urbana-Champaign

http://iacoma.cs.uiuc.edu

*Universidad de Zaragoza, Spain

116

Back Up Slides

117

Execution Overhead

• No overhead in generating signatures (HW)• Additional instructions are negligible• Main overheads

– Checkpointing (ReVive – 6.3%)

– Network traffic (63 bytes per 1000 ins - compressed)

– Re-execution (depends on false positives & race position)• Can be done offline

118

Network Traffic Overhead

63

[ 1 cache line

119

Re-execution Overhead

• Instructions re-executed until the first true data race is analyzed are shown as overhead

• In this process, it may also encounter many false positive races

• Instructions re-executed to analyze only the true race are shown as true overhead

• Instructions re-executed to filter out the false positives are shown as false overhead

120

Re-execution Overhead

22%

Modest overhead

121

False Positives

• Parallel bloom filters with H3 hash function

1.57%

Low False Positive

122

Virtualization

123

Virtualization

• RDM uses as many queues as the number of threads

• Timestamp is accessed by thread id

• Thread id remains same even after migration

• Timestamps, flags, conflict signature are saved and restored at context switch

• RDM intersects incoming signatures against all other threads’(even inactive ones) signatures

• Threads can be re-executed without any scheduling constraints

124

Scalability

• For small # of proc., scalability is not a problem

• The operation of RDM can be pipelined– Simple repetitive operation

• Network traffic (compressed message) around 63Bytes/thousand ins

• Checkpoint is an issue.

top related