programming, debugging, profiling and optimizing transactional memory applications department of...

34
Programming, Debugging, Profiling and Optimizing Transactional Memory Applications Department of Computer Architecture Universitat Politècnica de Catalunya – BarcelonaTech Barcelona Supercomputing Center 01 July 2010 Ferad Zyulkyarov PhD Thesis Proposal

Upload: brandi-whitfield

Post on 31-Mar-2015

219 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Programming, Debugging, Profiling and Optimizing Transactional Memory Applications Department of Computer Architecture Universitat Politècnica de Catalunya

Programming, Debugging, Profiling and Optimizing Transactional Memory Applications

Department of Computer ArchitectureUniversitat Politècnica de Catalunya – BarcelonaTech

Barcelona Supercomputing Center

01 July 2010

Ferad Zyulkyarov

PhD Thesis Proposal

Page 2: Programming, Debugging, Profiling and Optimizing Transactional Memory Applications Department of Computer Architecture Universitat Politècnica de Catalunya

Publications• Ferad Zyulkyarov, Srdjan Stipic, Tim Harris, Osman Unsal, Adrian Cristal, Ibrahim Hur, Mateo Valero,

Discovering and Understanding Performance Bottlenecks in Transactional Applications, PACT'10• Ferad Zyulkyarov, Tim Harris, Osman Unsal, Adrian Cristal, Mateo Valero, Debugging Programs that

use Atomic Blocks and Transactional Memory, PPoPP'10• Vladimir Gajinov, Ferad Zyulkyarov, Osman Unsal, Adrian Cristal, Eduard Ayguade, Tim Harris,

Mateo Valero, QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory , ICS'09

• Ferad Zyulkyarov, Vladimir Gajinov, Osman Unsal, Adrian Cristal, Eduard Ayguade, Tim Harris, Mateo Valero, Atomic Quake: Using Transactional Memory in an Interactive Multiplayer Game Server , PPoPP’09

• Ferad Zyulkyarov, Sanja Cvijic,Osman Unsal, Adrian Cristal, Eduard Ayguade, Tim Harris, Mateo Valero, WormBench - A Configurable Workload for Evaluating Transactional Memory Systems, MEDEA '09

• Ferad Zyulkyarov, Milos Milovanovic, Osman Unsal, Adrian Cristal, Eduard Ayguade, Tim Harris, Mateo Valero, Memory Management for Transaction Processing Core in Heterogeneous Chip-Multiprocessors, OSHMA '09

• Milos Milovanovic, Osman Unsal, Adrian Cristal, Ferad Zyulkyarov, Mateo Valero, Compiler Support for Using Transactional Memory in C/C++ Applications, INTERACT’07

2

Page 3: Programming, Debugging, Profiling and Optimizing Transactional Memory Applications Department of Computer Architecture Universitat Politècnica de Catalunya

Work Plan

3

12m

11m

21m

10m

15m

9.5m

7m

2m

01/10/2010

Page 4: Programming, Debugging, Profiling and Optimizing Transactional Memory Applications Department of Computer Architecture Universitat Politècnica de Catalunya

Transactional Memory

4

atomic { statement1; statement2; statement3; statement4; ...}

Page 5: Programming, Debugging, Profiling and Optimizing Transactional Memory Applications Department of Computer Architecture Universitat Politècnica de Catalunya

The Big Questions

• Is programming with TM easy?• Is TM competitive with locks?• Are existing development tools sufficient?

5

Page 6: Programming, Debugging, Profiling and Optimizing Transactional Memory Applications Department of Computer Architecture Universitat Politècnica de Catalunya

Atomic Quake

• Parallel Quake game server– All locks are replaces with atomic blocks

• 27,400 LOC of C code in 56 files• Rich transactional application

– 63 atomic blocks– Rich uses of atomic blocks

• Library calls, I/O, error handling, memory allocation, failure atomicity

– Various transactional characteristics• A workload to drive research in TM

6

Page 7: Programming, Debugging, Profiling and Optimizing Transactional Memory Applications Department of Computer Architecture Universitat Politècnica de Catalunya

Is programming with TM easy?

• Yes.• In large applications where we have many

shared objects and want to provide efficient fine grain synchronization– Example: region based locking in tree data

structure and graphs.

7

Page 8: Programming, Debugging, Profiling and Optimizing Transactional Memory Applications Department of Computer Architecture Universitat Politècnica de Catalunya

Where Transactions Fit?Guarding different types of objects with separate locks.

1 switch(object->type) { /* Lock phase */ 2 KEY: lock(key_mutex); break; 3 LIFE: lock(life_mutex); break; 4 WEAPON: lock(weapon_mutex); break; 5 ARMOR: lock(armor_mutex); break 6 }; 7 8 pick_up_object(object); 910 switch(object->type) { /* Unlock phase */11 KEY: unlock(key_mutex); break;12 LIFE: unlock(life_mutex); break;13 WEAPON: unlock(weapon_mutex); break;14 ARMOR: unlock(armor_mutex); break15 };

Lock phase.

Unlock phase.

atomic {

}

pick_up_object(object);

8

Page 9: Programming, Debugging, Profiling and Optimizing Transactional Memory Applications Department of Computer Architecture Universitat Politècnica de Catalunya

Is TM Competitive to Locks?

• No. – 4-5x slowdown on single

threaded version.

• But it is promising to be competitive because of the obtained good scalability.

9

Scales OK up to 4 threads.

Scales OK up to 4 threads.

ThreadsTransaction

s

AbortsIrrevocable

Num %

1 36 667 0 0.00% 172 75 824 241 0.42% 314 166 000 2 612 1.58% 858 477 519 76 771 25.50% 237

Sudden increase in aborts.

Sudden increase in aborts.

Page 10: Programming, Debugging, Profiling and Optimizing Transactional Memory Applications Department of Computer Architecture Universitat Politècnica de Catalunya

Are Existing Tools Sufficient?

• No• We need:

– Richer language level primitives and integration.– Mechanisms to handle I/O.– Dynamic error handling.– Debuggers.– Profilers.

10

Page 11: Programming, Debugging, Profiling and Optimizing Transactional Memory Applications Department of Computer Architecture Universitat Politècnica de Catalunya

Unstructured Use of LocksLocks

1 for (i=0; i<sv_tot_num_players/sv_nproc; i++){ 2 <statements1> 3 LOCK(cl_msg_lock[c - svs.clients]); 4 <statemnts2> 5 if (!c->send_message) { 6 <statements3> 7 UNLOCK(cl_msg_lock[c - svs.clients]); 8 <statements4> 9 continue;10 }11 <stamemnts5>12 if (!sv.paused && !Netchan_CanPacket (&c->netchan)) {13 <statmenets6>14 UNLOCK(cl_msg_lock[c - svs.clients]);15 <statements7>16 continue;17 }18 <statements8>19 if (c->state == cs_spawned) {20 if (frame_threads_num > 1) LOCK(par_runcmd_lock);21 <statements9>22 if (frame_thread_num > 1) UNLOCK(par_runcmd_lock);23 }24 UNLOCK(cl_msg_lock[c - svs.clients]);25 <statements10>26 }

Atomic Block 1 bool first_if = false; 2 bool second_if = false; 3 for (i=0; i<sv_tot_num_players/sv_nproc; i++){ 4 <statements1> 5 atomic { 6 <statemnts2> 7 if (!c->send_message) { 8 <statements3> 9 first_if = true;10 } else {11 <stamemnts5>12 if (!sv.paused && !Netchan_CanPacket(&c->netchan)){13 <statmenets6>14 second_if = true;15 } else {16 <statements8>17 if (c->state == cs_spawned) {18 if (frame_threads_num > 1) {19 atomic {20 <statements9>21 }22 } else {23 <statements9>;24 }25 }26 }27 }28 }29 if (first_if) {30 <statements4>;31 first_if = false;32 continue;33 }34 if (second_if) {35 <statements7>;36 second_if = false;37 continue;38 }39 <statements10>40 }

Extra variables and code

Extra variables and code

Solutionexplicit “commit”

Solutionexplicit “commit” Complicated

Conditional Logic

Complicated Conditional

Logic

11

Page 12: Programming, Debugging, Profiling and Optimizing Transactional Memory Applications Department of Computer Architecture Universitat Politècnica de Catalunya

Various Transactional Characteristics

ID TX#Dynamic Length (CPU Cycles) Read Set (Bytes) Write Set (Bytes)

Total Min Max Avg Total Min Max Avg Total Min Max Avg56 26,962 172,872,572 288 112,832 6,412 1,328,536 20 104 49 0 0 0 060 5,931 5,810,152 224 41,552 980 76,212 12 640 13 928 0 116 061 1,095 20,573,540 4,560 49,984 19,208 723,474 88 776 661 90 84 84 8459 1,042 3,117,844 1,520 39,344 2,999 29,176 5 28 28 16,672 16 16 1657 1,038 401,502,152 288,704 522,528 387,552 10,963,719 7,614 15,490 10,562 2,592,367 1,680 3,656 2,49758 1,002 134,949,344 87,056 1,341,504 134,949 5,054,282 3,028 53,566 5,044 931,445 548 11,161 93015 3 67,660 720 48,176 1,735 96 32 32 32 18 6 6 6

5 2 99,988 592 36,384 1,923 64 32 32 32 10 5 5 522 2 43,632 12,176 35,504 21,816 72 36 36 36 128 64 64 6436 2 40,476 6,800 44,880 20,238 249 108 141 125 55 22 33 2838 2 71,368 2,144 31,504 4,461 90 44 46 45 26 12 14 13

12

Very small transactionsVery small

transactions

Very large transactionsVery large

transactions

Different execution frequency -> Phased

behavior.

Different execution frequency -> Phased

behavior.

Control flow does not reach all atomic blocks.Control flow does not

reach all atomic blocks.

Most frequent atomic block is read-only.

Most frequent atomic block is read-only.

Per-atomic block runtime statistics from Atomic Quake.

Page 13: Programming, Debugging, Profiling and Optimizing Transactional Memory Applications Department of Computer Architecture Universitat Politècnica de Catalunya

Debugging Transactional Applications

• Existing debuggers are not aware of atomic blocks and transactional memory

• New principles and approaches:– Debugging atomic blocks atomically– Debugging at the level of transactions– Managing transactions at debug-time

• Extension for WinDbg to debug programs with atomic blocks

13

Page 14: Programming, Debugging, Profiling and Optimizing Transactional Memory Applications Department of Computer Architecture Universitat Politècnica de Catalunya

Atomicity in Debugging• Step over atomic blocks as if single instruction.• Abstracts weather atomic blocks are implemented with TM

or lock inference• Good for debugging sync errors at granularity of atomic

blocks vs. individual statements inside the atomic blocks.

14

<statement 1><statement 2>atomic { <statement 3> <statement 4> <statement 5> <statement 6>}<statement 7><statement 8>

<statement 1><statement 2>atomic { <statement 3> <statement 4> <statement 5> <statement 6>}<statement 7><statement 8>

Non-TM Aware Debugger TM Aware Debugger

Debugging becomes frustrating when

transaction aborts.

Debugging becomes frustrating when

transaction aborts.

Page 15: Programming, Debugging, Profiling and Optimizing Transactional Memory Applications Department of Computer Architecture Universitat Politècnica de Catalunya

Isolation in Debugging

• What if we want to debug wrong code within atomic block?– Put breakpoint inside atomic block.– Validate the transaction– Step within the transaction.

• The user does not observe intermediate results of concurrently running transactions– Switch transaction to irrevocable mode after validation.

15

atomic { <statement 1> <statement 2> <statement 3> <statement 4>}

Page 16: Programming, Debugging, Profiling and Optimizing Transactional Memory Applications Department of Computer Architecture Universitat Politècnica de Catalunya

Debugging at the Level of Transactions

• Assumes that atomic blocks are implemented with transactional memory.

• Examine the internal state of the TM– Read/write set, re-executions, status

• TM specific watch points– Break when conflict happens– Filters

• Concurrent work with Herlihy and Lev [PACT’ 09].

16

Page 17: Programming, Debugging, Profiling and Optimizing Transactional Memory Applications Department of Computer Architecture Universitat Politècnica de Catalunya

TM Specific Watchpoints

17

atomic { <statement 1> <statement 2> <statement 3> <statement 4>}

Conflict Information

Conflicting Threads: T1, T2Address: 0x84D2F0Symbol: reservation@04Readers: T1Writers: T2

Break when conflict happens

Break when conflict happens

Filter: Break ifAddress = reservation@04Thread = T2

Filter: Break ifAddress = reservation@04Thread = T2

AND

Page 18: Programming, Debugging, Profiling and Optimizing Transactional Memory Applications Department of Computer Architecture Universitat Politècnica de Catalunya

Managing Transactions at Debug-Time

• At the level of atomic blocks– Debug time atomic blocks– Splitting atomic blocks

• At the level of transactions– Changing the state of TM system (i.e. adding and

removing entries from read/write set, change the status, abort)

• Analogous to the functionality of existing debuggers to change the CPU state

18

Page 19: Programming, Debugging, Profiling and Optimizing Transactional Memory Applications Department of Computer Architecture Universitat Politècnica de Catalunya

Example Debug Time Atomic Blocks

19

<statement 1><statement 2><statement 3><statement 4><statement 5><statement 6><statement 7><statement 8><statement 9><statement 10><statement 11><statement 12><statement 13><statement 14>

Page 20: Programming, Debugging, Profiling and Optimizing Transactional Memory Applications Department of Computer Architecture Universitat Politècnica de Catalunya

Example Debug Time Atomic Blocks

20

<statement 1><statement 2><statement 3>StartDebugAtomic<statement 4><statement 5><statement 6><statement 7><statement 8><statement 9>EndDebugAtomic<statement 10><statement 11><statement 12><statement 13><statement 14>

User marks the startand the end of thetransactions

Page 21: Programming, Debugging, Profiling and Optimizing Transactional Memory Applications Department of Computer Architecture Universitat Politècnica de Catalunya

Issues of Profiling TM Programs

• TM applications have unanticipated overheads– Problem raised by Pankratius [talk at ICSE’09] and

Rossbach et al. [PPoPP’10]

• Difficult to profile TM applications without profiling tools and without knowing the implementation of the TM system– Experience of optimizing QuakeTM, Gajinov et al.

[ICS’2009]

21

Page 22: Programming, Debugging, Profiling and Optimizing Transactional Memory Applications Department of Computer Architecture Universitat Politècnica de Catalunya

Profiling TM Programs

• Design principles– Report results at source language constructs– Abstract the underlying TM system– Low probe effect and overhead

• Profiling techniques– Conflict point discovery– Identifying conflicting data structures– Visualizing transactions

22

Page 23: Programming, Debugging, Profiling and Optimizing Transactional Memory Applications Department of Computer Architecture Universitat Politècnica de Catalunya

Conflict Point Discovery

• Identifies the statements involved in conflicts• Provides contextual information

• Finds the critical path

23

File:Line #Conf. Method Line

Hashtable.cs:51 152 Add If (_container[hashCode]…

Hashtable.cs:48 62 Add uint hashCode = HashSdbm(…

Hashtable.cs:53 5 Add _container[hashCode] = n …

Hashtable.cs:83 5 Add while (entry != null) …

ArrayList.cs:79 3 Contains for (int i = 0; i < count; i++ )

ArrayList.cs:52 1 Add if (count == capacity – 1) …

Page 24: Programming, Debugging, Profiling and Optimizing Transactional Memory Applications Department of Computer Architecture Universitat Politècnica de Catalunya

Call Context

24

increment() { counter++;}

probability80 { probability = random() % 100; if (probability < 80) { atomic { increment(); } }}

probability20 { probability = random() % 100; if (probability >= 80) { atomic { increment(); } }}

for (int i = 0; i < 100; i++) { probability80(); probability20();}

for (int i = 0; i < 100; i++) { probability80(); probability20();}

Thread 1

Thread 2

Bottom-up view+ increment (100%) |---- probability80 (80%) |---- probability20 (20%)

Bottom-up view+ increment (100%) |---- probability80 (80%) |---- probability20 (20%)

Top-down view+ main (100%) |---- probability80 (80%) |---- increment (80%) |-----probability20 (20%) |---- increment (20%)

Top-down view+ main (100%) |---- probability80 (80%) |---- increment (80%) |-----probability20 (20%) |---- increment (20%)

Page 25: Programming, Debugging, Profiling and Optimizing Transactional Memory Applications Department of Computer Architecture Universitat Politècnica de Catalunya

Aborts Graph (Bayes)

25

AB1 AB2

AB3

Conf: 73%Wasted: 63%

Conf: 20%Wasted: 29%

72% of wasted work

There are 15 atomic blocks and only one of them aborts most.Which atomic blocks cause AB3 to abort?

Page 26: Programming, Debugging, Profiling and Optimizing Transactional Memory Applications Department of Computer Architecture Universitat Politècnica de Catalunya

Indentifying Conflicting Objects

26

Per-Object View

+ List.cs:1 “list” (42%) |--- ChangeNode (20 %) +---- Replace (12%) +---- Add (8%)

Per-Object View

+ List.cs:1 “list” (42%) |--- ChangeNode (20 %) +---- Replace (12%) +---- Add (8%)

1: List list = new List();2: list.Add(1);3: list.Add(2);4: list.Add(3);...atomic { list.Replace(2, 33);}

List 1 2 3

0x08 0x10 0x18 0x20

GC Memory Allocator DbgEng

Object Addr0x20

GC Root0x08

Instr Addr0x446290

List.cs:1

Page 27: Programming, Debugging, Profiling and Optimizing Transactional Memory Applications Department of Computer Architecture Universitat Politècnica de Catalunya

Transaction Visualizer (Genome)

27

Aborts occur at the first and last atomic blocks in

program order.

Aborts occur at the first and last atomic blocks in

program order.

Garbage CollectionGarbage

Collection Wait on barrierWait on barrier

Page 28: Programming, Debugging, Profiling and Optimizing Transactional Memory Applications Department of Computer Architecture Universitat Politècnica de Catalunya

Overhead and Probe Effect

28

Thrd# Bayes+ Bayes- Gen+ Gen- Intrd+ Intrd- Labr+ Labr- Vac+ Vac- WB+ WB-

1 1.59 1.00 1.27 1.00 1.29 1.00 1.07 1.00 1.26 1.00 0.71 1.002 1.00 0.56 0.97 0.67 0.97 0.58 0.64 0.61 0.83 0.59 0.60 0.554 0.23 0.23 0.73 0.52 0.91 0.36 0.45 0.46 0.58 0.40 0.41 0.338 0.21 0.20 0.73 0.55 1.57 0.38 0.72 0.56 0.53 0.34 0.33 0.22

Normalized Execution Time

Thrd# Bayes+ Bayes- Gen+ Gen- Intrd+ Intrd- Labr+ Labr- Vac+ Vac- WB+ WB-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.002 4.39 4.69 0.07 0.07 3.69 3.51 0.19 0.15 0.80 0.80 0.00 0.004 16.29 27.31 0.26 0.36 14.90 13.65 0.35 0.36 2.30 2.45 0.00 0.008 53.74 66.08 0.53 0.80 39.64 37.41 0.40 0.47 4.91 5.30 0.02 0.03

Abort Rate in %

+ Profiling Enabled- Profiling Disabled

Standard deviation for the difference 27%

Standard deviation for the difference 27%

Standard deviation for the difference 3.88%

Standard deviation for the difference 3.88%

Process data offline or during GC.

Page 29: Programming, Debugging, Profiling and Optimizing Transactional Memory Applications Department of Computer Architecture Universitat Politècnica de Catalunya

Optimization Techniques

• Moving statements• Atomic block scheduling• Checkpoints and nested atomic blocks• Pessimistic reads• Early release

29

Page 30: Programming, Debugging, Profiling and Optimizing Transactional Memory Applications Department of Computer Architecture Universitat Politècnica de Catalunya

Will this code execute the same?

Moving Statements

atomic {

counter++;

<statement1>

<statement2>

<statement3>

}

atomic {

<statement1>

<statement2>

<statement3>

counter++;

}

30

No!No!

Page 31: Programming, Debugging, Profiling and Optimizing Transactional Memory Applications Department of Computer Architecture Universitat Politècnica de Catalunya

Checkpointsatomic {

<statement1>

<statement2>

<statement3>

<statement4>

<statement5>

<statement6>

<statement7>

}

31

Conflicts

2%

15%

4%

79%

Insert CheckpointInsert Checkpoint

Page 32: Programming, Debugging, Profiling and Optimizing Transactional Memory Applications Department of Computer Architecture Universitat Politècnica de Catalunya

Checkpointsatomic {

<statement1>

<statement2>

<statement3>

<statement4>

<statement5>

<statement6>

<checkpoint>

<statement7>

}

32

Conflicts

2%

15%

4%

79%

Insert CheckpointInsert Checkpoint

Reduced wasted work for the atomic

block with 40%.

Reduced wasted work for the atomic

block with 40%.

Page 33: Programming, Debugging, Profiling and Optimizing Transactional Memory Applications Department of Computer Architecture Universitat Politècnica de Catalunya

Conclusion

• Study the programmability aspects of TM• New debugging principles and approaches for

TM applications• New profiling techniques for TM applications• Profile-guided optimization approaches for TM

applications

33

Page 34: Programming, Debugging, Profiling and Optimizing Transactional Memory Applications Department of Computer Architecture Universitat Politècnica de Catalunya

34

Край