parallel gc (chapter 14) eleanor ainy december 16 th 2014 1
TRANSCRIPT
1
Parallel GC(Chapter 14)
Eleanor AinyDecember 16th 2014
2
Outline of Today’s Talk
How to use parallelism in each of the 4 components of tracing GC:• Marking• Copying• Sweeping• Compaction
3
Till now …
Multiple mutator threads
But only 1 collector thread
Poor use of resources!
Assumption remains: No mutators run in parallel to the collector!
Introduction
4
Parallel vs. Non-Parallel Collection
MutatorCollection
Cycle 1Collection
Cycle 2
Introduction
5
The Goal
To reduce:• Time overhead of garbage collection• Pause times in case of stop-the-world collection
Introduction
6
Parallel GC Challenges
Ensure there is sufficient work to be done. Otherwise it’s not worth it!
Load balancing – distribute work & other resources in a way that minimizes the coordination needed.
Synchronization – needed for both correctness and to avoid repeating work.
Introduction
7
More on Load Balancing
Static Partitioning• Some processors will probably have more work to do compared to others.
• Some processors will exhaust their resources before others do.
Introduction
8
Dynamic Load Balancing • Sometimes it’s possible to obtain a good estimate of the amount of work to be done in advance
• More often it’s not possible to estimate that Solution: (1) Over-partition the work into more tasks (2) Have each thread compete to claim one task at a time to execute.Advantages:(1) More resilient to changes in the number of processors available(2) If one task takes longer to execute other threads can execute
anyfurther work
More on Load BalancingIntroduction
9
Why not divide the work to the smallest possible independent tasks?
The coordination cost is too expensive!Synchronization guarantees correctness and avoids unnecessary work, but has time & space overheads!
Algorithms try to minimize the synchronization needed by using thread-local data structures, for instance.
More on Load BalancingIntroduction
10
Processor-centric VS. Memory-centric
Processor-centric algorithms:• threads acquire work that vary in size.• threads steal work from other threads• little regard to the location of the objects
Memory-centric algorithms:• take location into greater account• operate on continuous blocks of heap memory• acquire/release work from/to shared pools of fixed-size buffers of work
Introduction
11
Algorithms’ Abstraction
Assumption: Each collector thread executes the following loop (*):
while not terminated()acquireWork()performWork()generateWork()
(*) in most cases.
Introduction
12
Outline of Today’s Talk
How to use parallelism in each of the 4 components of tracing GC:• Marking• Copying• Sweeping• Compaction
13
Marking comprises of…
1) Acquisition of an object from a work list2) Testing & setting marks3) Generating further marking work by adding the
object’s children to the work list
Parallel Marking
14
Important Note
All known parallel marking algorithms areprocessor-centric!
Parallel Marking
15
When is Synchronization Required?
No synchronization:If the work list is thread-local.Example: when an object’s mark is represented by a bit in its header.
Synchronization needed:Otherwise the thread must acquire work atomically from some otherthread’s work list or from some global list.Example: when marks are stored in a shared bitmap.
Parallel Marking
16
Endo et al [1997] Parallel Mark Sweep Algorithm
N – total number of threadsEach marker thread has its own:• local mark stack• a stealable work queue.
shared stealableWorkQueue[N]me myThreadId
acquireWork():if not isEmpty(myMarkStack)
returnstealFromMyself() if isEmpty(myMarkStack)
stealFromOthers()
Parallel Marking
17
An idle thread acquires work by first examining its own queue andthen other threads’ queues.
stealFromMyself():lock(stealableWorkQueue[me])n size(stealableWorkQueue[me]) / 2transfer(stealableWorkQueue[me], n, myMarkStack) unlock(stealableWorkQueue[me])
Parallel Marking
Endo et al [1997] Parallel Mark Sweep Algorithm
18
An idle thread acquires work by first examining its own queue andthen other threads’ queues. stealFromOthers():
for each j in Threads if not locked(stealableWorkQueue[j] )
if lock(stealableWorkQueue[j]) n size(stealableWorkQueue[j]) / 2 transfer(stealableWorkQueue[j], n, myMarkStack) unlock(stealableWorkQueue[j]) return
Parallel Marking
Endo et al [1997] Parallel Mark Sweep Algorithm
19
performWork():while pop(myMarkStack, ref)
for each fld in Pointers(ref) child *fld if child null && not isMarked(child) setMarked(child)
push(myMarkStack, child)
Parallel Marking
Endo et al [1997] Parallel Mark Sweep Algorithm
20
Parallel Marking
Endo et al [1997] Parallel Mark Sweep Algorithm
Stack BStack A
P1P2
Thread A Thread B
Queue BQueue A
C1
Notice: it is possible for threads to mark the same child object.
21
Each thread checks its own mark queue. If it’s empty it transfers all its mark stack (apart from local roots) to the queue.
generateWork():if isEmpty(stealableWorkQueue[me])
n size(myMarkStack)lock(stealableWorkQueue[me])transfer(myMarkStack, n, stealableWorkQueue[me])unlock(stealableWorkQueue[me])
Parallel Marking
Endo et al [1997] Parallel Mark Sweep Algorithm
22
Parallel Marking With a BitmapThe collector tests the bit and only if it isn’t set, attempts to set itatomically, retrying if the set fails.
setMarked(ref):oldByte markByte(ref)bitPosition markBit(ref)loop
if isMarked(oldByte, bitPosition) returnnewByte mark(oldByte, bitPosition)if (CompareAndSet(&markByte(ref), oldByte, newByte) return
Parallel Marking
Endo et al [1997] Parallel Mark Sweep Algorithm
CompareAndSet(x,old,new): atomic curr *x if curr = old *x new return true return false
23
Termination Detection – Reminder From Previous Lecture:• Separate thread for termination detection.
• Symmetric detection – every thread can play the role of the detector.
Parallel Marking
Endo et al [1997] Parallel Mark Sweep Algorithm
24
Termination Detection – Reminder From Previous Lecture:
shared jobs[N] initial work assignmentsshared busy[N] [true, …]shared jobsMoved falseshared allDone falseme myThreadId
Parallel Marking
Endo et al [1997] Parallel Mark Sweep Algorithm
25
Termination Detection – Reminder From Previous Lecture:
worker(): loop
while not isEmpty(jobs[me]) job dequeue(jobs[me]) perform jobif another thread j exists whose jobs set appears relatively large some stealJobs(j) enqueue(jobs[me], some) continuebusy[me] falsewhile no thread has jobs to steal && not allDone /* do nothing: wait for work or termination*/if allDone returnbusy[me] true
Parallel Marking
Endo et al [1997] Parallel Mark Sweep Algorithm
26
Termination Detection – Reminder From Previous Lecture:
stealJobs(j): some atomicallyRemoveJobs(jobs[j])if not isEmpty(some)
jobsMoved truereturn some
Parallel Marking
Endo et al [1997] Parallel Mark Sweep Algorithm
27
Termination Detection – Reminder From Previous Lecture:
detect(): anyActive truewhile anyActive
anyActive ( i) (busy[i])anyActive anyActive || jobsMoved jobsMoved false
allDone true
Parallel Marking
Endo et al [1997] Parallel Mark Sweep Algorithm
28
Running Example
Queue B
Stack B
Queue A
Stack A
Initially: queues are empty!acquireWork – if stack is non-empty returns.
Thread A Thread B
Endo et al [1997] Parallel Mark Sweep AlgorithmParallel Marking
29
Running Example
performWork pops, marks and pushes children.
Stack B
Queue BO1
O2
O3
O4
O4
O1
O3
O2
Parallel Marking
Endo et al [1997] Parallel Mark Sweep Algorithm
30
Running Example
Queue B
Stack B
generateWork moves all the objects from the stack to the queue!
Queue B
O3
O2
Stack B
O3O2
Parallel Marking
Endo et al [1997] Parallel Mark Sweep Algorithm
31
Running Example
acquireWork – if stack is empty moves half the queue to the stack.
Stack B
Queue BQueue B
Parallel Marking
Endo et al [1997] Parallel Mark Sweep Algorithm
32
Running Example
Queue B
acquireWork – if queue is also empty, steals from other queues.This continues until there is no more work (the detector will detect this!).
Stack A Stack B
Queue AQueue A
Parallel Marking
Endo et al [1997] Parallel Mark Sweep Algorithm
33
N – total number of threads• Each thread has its own stealable deque (double-ended queue).• The deques are fixed size to avoid allocation during collection
causes overflow.• All threads share a global overflow set implemented as a list of list.
shared overflowSetshared deque[N]me myThreadId
Parallel Marking
Flood et al [2001] Parallel Mark Sweep Algorithm
acquireWork():if not isEmpty(deque[me])
return n dequeFixedSize/2if extractFromOverflowSet(n)
returnstealFromOthers()
34
• The Java class structure holds the head of a list of overflow objects of that type, linked through the class pointer field in their header.
• An object’s type field can be restored on remove from overflow set (stop-the-world enables the type field to be used here).
Flood et al [2001] Parallel Mark Sweep AlgorithmParallel Marking
35
Idle threads acquire work by trying to fill half their deque from the overflow set before stealing from other deques.
extractFromOverflowSet(n): transfer(overflowSet, n, deque[me])
Parallel Marking
Flood et al [2001] Parallel Mark Sweep Algorithm
36
Idle threads steal work from the top of others’ deques using remove.
stealFromOthers():for each j in Threads
ref remove(deque[j])if ref null push(deque[me], ref) return
remove:requires synchronization!
Flood et al [2001] Parallel Mark Sweep AlgorithmParallel Marking
37
performWork():loop
ref pop(deque[me])if ref = null return for each fld in Pointers(ref) child *fld if (child null && not isMarked(child) setMarked(child) if not push(deque[me], child) n size(deque[me]) / 2 transfer(deque[me], n, overflowSet)
pop:requires synchronizationonly to claim the lastelement of the deque.
push: does not requiresynchronization.
Flood et al [2001] Parallel Mark Sweep AlgorithmParallel Marking
38
Work is generated inside peformWork by pushing to the deque or transferring to the overflow set.
generateWork():/* nop */
Flood et al [2001] Parallel Mark Sweep AlgorithmParallel Marking
39
Termination Detection• Variation of symmetric detection that we saw in previous lecture.• Status word – one bit per thread (active/inactive).
Flood et al [2001] Parallel Mark Sweep AlgorithmParallel Marking
40
Running Example
Deque BDeque A
Initially: deques are non-empty!acquireWork – if deque is non-empty return.
Thread A Thread B
Flood et al [2001] Parallel Mark Sweep AlgorithmParallel Marking
41
Running Example
performWork – pop, mark and push children.
O1
O4
O5
O2
O3
O6
O7
Flood et al [2001] Parallel Mark Sweep AlgorithmParallel Marking
42
Running Example
Deque B
performWork – if push causes overflow copies half the queue to the overflow set.
Thread B
O1
O4
O5
O2
O3
O6
O7
O3 O4 O5 O6
O7
A
A
B
Flood et al [2001] Parallel Mark Sweep AlgorithmParallel Marking
O1O2
43
Running Example
performWork – the overflow set in this case:
A
A
B
Class A Structure
Class B Structure
O5
O6
O7
Flood et al [2001] Parallel Mark Sweep AlgorithmParallel Marking
O1
O4
O5
O2
O3
O6
O7
44
Running Example
Deque B
acquireWork – if deque is empty, takes work from overflow set. If fails, removes from other deques.
Thread B
Deque A
Thread A
O9 O9
Flood et al [2001] Parallel Mark Sweep AlgorithmParallel Marking
45
• This technique is best employed when the number of threads is known in advance.
• May be difficult for a thread:• To choose the best queue from which to steal.• To detect termination.
Mark Stacks With Work Stealing - Disadvantages Parallel Marking
46
• Threads exchange marking tasks through single writer, single reader channels.
• In a system of N threads, each thread has an array of N-1 queues.• Annotation for input channel from thread i to thread j i j.
This is also an output channel of thread i.
shared channel[N,N]me myThreadId
Wu and Li [2007] Parallel Tracing With ChannelsParallel Marking
47
If the thread’s stack is empty, it takes a task from some input channel k me.
acquireWork():if not isEmpty(myMarkStack)
returnfor each k in Threads
if not isEmpty(channel[k, me]) ref remove(channel[k, me]) push(myMarkStack, ref) return
Wu and Li [2007] Parallel Tracing With ChannelsParallel Marking
48
Threads first try to add new tasks (marking children) to other threads’input channels (their output channels).performWork():
loopif isEmpty(myMarkStack) returnref pop(myMarkStack)for each fld in Pointers(ref) child *fld if child null && not isMarked(child) if not generateWork(child)
push(myMarkStack, child)
Wu and Li [2007] Parallel Tracing With ChannelsParallel Marking
49
• When a thread generates a new task, it first checks whether any other thread k needs work.
• If so, adds the task to the output channel me k.• Otherwise, pushes the task to its own stack.
generateWork(ref):for each k in Threads
if needsWork(k) && not isFull(channel[me,k]) add(channel[me,k], ref) return true
return false
Wu and Li [2007] Parallel Tracing With ChannelsParallel Marking
50
Advantages:• No expensive atomic operations!• Performs better on servers with many processors.• Keeps all threads busy.
(*) On a machine with 16 Intel Xeon processors queues of size one ortwo were found to scale best.
Wu and Li [2007] Parallel Tracing With ChannelsParallel Marking
51
Outline of Today’s Talk
How to use parallelism in each of the 4 components of tracing GC:• Marking• Copying• Sweeping• Compaction
52
Copying is Different From Marking…
It’s essential that an object be copied only once!If an object is marked twice it usually does not affect the correctness of the program.
Parallel Copying
53
Processor-Centric Techniques:Cheng and Blelloch [2001] Parallel Copying
Each copying thread is given its own stack and transfers work between its local stack and a shared stack.
k – size of a local stack
shared sharedStackmyCopyStack[k]sp 0 /* local stack pointer */
Parallel Copying
54
Using rooms, they allow multiple threads to:• pop elements from the shared stack in parallel• push elements to the shared stack in parallelBut not pop and push in parallel!
shared gate openshared popClients /* number of clients in the pop room */shared pushClients /* number of clients in the push room */
Processor-Centric Techniques:Cheng and Blelloch [2001] Parallel CopyingParallel
Copying
55
while not terminated()enterRoom() /* enter pop room */for i 1 to k
if isLocalStackEmpty() acquireWork() if isLocalStackEmpty() breakperformWork()
transitionRooms()generateWork()if exitRoom() /* exit push room */ terminate()
acquireWork(): sharedPop()
performWork(): ref localPop() scan(ref)
generateWork(): sharedPush()
isLocalStackEmpty(): return sp = 0
Processor-Centric Techniques:Cheng and Blelloch [2001] Parallel CopyingParallel
Copying
56
localPush(ref):myCopyStack[sp++] ref
localPop():return myCopyStack[--sp]
Local Stack
SPref
1. localPop()2. localPush(ref)
Processor-Centric Techniques:Cheng and Blelloch [2001] Parallel CopyingParallel
Copying
57
sharedPop():cursor FetchAndAdd(&sharedStack, 1) if cursor stackLimit FetchAndAdd(&sharedStack, -1)else myCopyStack[sp++] cursor[0]
FetchAndAdd(x, v): atomic old *x *x old + v return old
Processor-Centric Techniques:Cheng and Blelloch [2001] Parallel CopyingParallel
Copying
58
sharedPush():cursor FetchAndAdd(&sharedStack, -sp) - sp for i 0 to sp-1
cursor[i] myCopyStack[i]sp 0
FetchAndAdd(x, v): atomic old *x *x old + v return old
Parallel Copying
Processor-Centric Techniques:Cheng and Blelloch [2001] Parallel Copying
59
enterRoom():while gate OPEN
/* do nothing: wait */FetchAndAdd(&popClients, 1)while gate OPEN
FetchAndAdd(&popClients, -1) /* failure - return to previous state*/
while gate OPEN /* do nothing: wait */ FetchAndAdd(&popClients, 1) /* try again */
Parallel Copying
Processor-Centric Techniques:Cheng and Blelloch [2001] Parallel Copying
60
transitionRooms(): /* move from pop room to push room */gate CLOSED /* close gate to pop room */FetchAndAdd(&pushClients, 1)FetchAndAdd(&popClients, -1) while popClients > 0
/* do nothing: wait till none popping */
Processor-Centric Techniques:Cheng and Blelloch [2001] Parallel CopyingParallel
Copying
61
exitRoom():pushers FetchAndAdd(&pushClients, -1) - 1if pushers = 0 /* last in push room */ gate OPEN if isEmpty(sharedStack) /* no work left */
return true else
return false
Processor-Centric Techniques:Cheng and Blelloch [2001] Parallel CopyingParallel
Copying
62
Problem:Any processor waiting to enter the push room must wait until allprocessors in the pop room have finished their work!
Possible Solution:The work can be done outside the rooms!It increases the likelihood that the pop room is empty threads will be able to enter the push room more quickly
Parallel Copying
Processor-Centric Techniques:Cheng and Blelloch [2001] Parallel Copying
63
• Divide the heap into small, fixed-size chunks.• Each thread receives its own chunks to scan and into which to copy
survivors.• Once a thread chunk copy is full it’s transferred to a global pool
where idle threads compete to scan it and a new empty chunk isobtained for the thread itself.
Parallel Copying
Memory-Centric Techniques:Block-Structured Heaps
64
Mechanisms Used To Ensure Good Load Balancing:• Chunks acquired were small (256 words).
• To avoid fragmentation, they used big bag of pages allocation for small objects
• Larger objects and chunks were allocated from the shared heap using a lock.
Parallel Copying
Memory-Centric Techniques:Block-Structured Heaps
65
Mechanisms Used To Ensure Good Load Balancing:• Balanced load in finer granularity.• Each chunk was divided into smaller blocks (32 words).
Memory-Centric Techniques:Block-Structured HeapsParallel
Copying
66
Mechanisms Used To Ensure Good Load Balancing:• After scanning a slot, the thread checks whether it reached the block boundary.
• If so and the next object was smaller than a block:• the thread advanced its scan pointer to the start of its current copy
block.• It reduced contention – the thread did not have to compete to
acquire a new scan block.• Un-scanned blocks in that area are given to the global pool.
• If the object was larger than a block but smaller than a chunk, the scan pointer was advanced to the start of its current copy chunk.
• If the object was large, the thread continued to scan it.
Parallel Copying
Memory-Centric Techniques:Block-Structured Heaps
67
Mechanisms Used To Ensure Good Load Balancing:
Parallel Copying
Memory-Centric Techniques:Block-Structured Heaps
68
Block States and Transitions:
Memory-Centric Techniques:Block-Structured HeapsParallel
Copying
69
State Transition Logic:
Parallel Copying
Memory-Centric Techniques:Block-Structured Heaps
70
Outline of Today’s Talk
How to use parallelism in each of the 4 components of tracing GC:• Marking• Copying• Sweeping• Compaction
71
1) Statically partition the heap into contiguous blocks for threads to sweep.
2) Over-partition the heap and have threads compete for a block to sweep to a free-list.
ProblemThe free-list becomes a bottleneck!
SolutionProcessors will have their own free-lists.
Parallel Sweeping
Simple Strategies
72
• A naturally parallel solution to sweeping partially full blocks.• In the sweep phase, we need to identify empty blocks and return
them to the block allocator.• Need to reduce contention.• Gave each thread several consecutive blocks to process locally.• They used bitmap marking with bitmaps held in block headers
(used to determine whether a block is empty or not).• Empty blocks are added to a local free-block list.• Partially full blocks are added to local reclaim list for subsequent
lazy sweeping.• Once a processor finishes with its sweep set it merges its local list
with the global free-block list.
Parallel Sweeping
Endo et al [1997] Lazy Sweeping
73
Outline of Today’s Talk
How to use parallelism in each of the 4 components of tracing GC:• Marking• Copying• Sweeping• Compaction
74
Observation:Uniprocessor compaction algorithms typically slide all live data to one end of the heap space.
If multiple threads do so in parallel one thread can overwrite live data before another thread has moved it!
Thread 1 compaction data. Thread 2 compaction data.
B CA DC
Parallel Compaction
Flood et al [2001] Parallel Mark-Compact
75
Suggested Solution:• Divide the heap space into several regions, one for each
compacting thread.• To reduce fragmentation, they also have threads alternate the
direction in which they move objects in even and odd numbered regions.
Flood et al [2001] Parallel Mark-CompactParallel Compaction
76
4 Phases:1) Parallel marking.2) Calculate forwarding addresses.3) Update references.4) Move objects.
Flood et al [2001] Parallel Mark-CompactParallel Compaction
77
Phase 2 - Calculating Forwarding Addresses:• Over-partition the space into M = 4N (N- number of threads) units of
roughly the same size.• Threads compete to claim units.• Each thread counts the volume of live data in its unit.• According to these volumes, they partition the space into N regions that
contain approximately the same amount of live data.• Threads compete to claim units and install forwarding addresses of each
live object of their units.
3 6 13 7 10 5 7 5 12 48 9
30 29 30
M = 12 units, N = 3 regions/threads
Flood et al [2001] Parallel Mark-CompactParallel Compaction
78
Phase 3 - Updating References:• Updating references to point to objects’ new locations requires scanning:
• Objects stored in mutator threads’ stacks that might contain references to objects in the heap space (young generation).
• Live objects in the heap space (old generation).• Threads compete to claim old generation units to scan and a single
thread scans the young generation.
Phase 4 – Moving Objects:• Each thread is in charge of a region. • Good load balancing is guaranteed because the regions contain roughly
equal volumes of live data.
Flood et al [2001] Parallel Mark-CompactParallel Compaction
79
Disadvantages:1) The algorithm makes 3 passes over the heap while other
compacting algorithms make fewer passes.2) Rather than compacting all live data to one end of the heap,
the algorithm compacts into N regions, leaving (N +1)/2 gaps for allocation. If a large number of threads in used, it’s difficult for mutators to allocate very large objects.
Flood et al [2001] Parallel Mark-CompactParallel Compaction
80
1) Address the 3 passes problem:• Calculate rather than store forwarding addresses using the mark
bitmap and an offset vector that holds the new address of the first live object in each block.
• To construct the offset vector one pass over the mark-bit vector is needed.
• Only a single pass over the heap is needed to move objects and update references using these vectors.
Abuaiadh et al [2004] Parallel Mark-CompactParallel Compaction
81
1) Address the 3 passes problem:• Bits in the mark-bit vector indicate the start and end of each live object.• Words in the offset vector hold the address to which the first live object
in their corresponding block will be moved. • Forwarding addresses are not stored but are calculated when needed
from the offset and mark-bit vectors.
Abuaiadh et al [2004] Parallel Mark-CompactParallel Compaction
82
2) Address the small gaps problem:• Over-partition the heap into fairly large areas. • Threads race to claim the next area to compact, using an atomic
operation to increment a global area index.• If the thread succeeds, it has obtained an area to compact.• If it fails, it tries to claim the next area.
Abuaiadh et al [2004] Parallel Mark-CompactParallel Compaction
83
2) Address the small gaps problem:• A table holds pointers to the beginning of the free space for each area. • After winning an area to compact, a thread races to obtain an area
into which it can move objects. It claims an area by trying to write null into its corresponding table slot.
• Threads never try to compact from or into an area whose table entry is null.
• Objects are never moved from a lower to a higher numbered area.• Progress is guaranteed since a thread can always compact an area into
itself.• Once a thread has finished with an area, it updates the area’s free
pointer. If an area is full, its free space pointer will remain null.
Abuaiadh et al [2004] Parallel Mark-CompactParallel Compaction
84
2) Address the small gaps problem:
…1 2 3
Area Index: 0Area Index: 1Area Index: 2
Free pointers table 200 1000 1800 …
A B CA B C
NULL
D EA B C D E
200 400
400NULL
1000 1800
Abuaiadh et al [2004] Parallel Mark-CompactParallel Compaction
85
2) Address the small gaps problem:Explored two ways in which objects can be moved:a. Slide object by object.b. To reduce compaction time, slide only complete blocks (256 bytes).
Free space in each block is not squeezed out.
Abuaiadh et al [2004] Parallel Mark-CompactParallel Compaction
86
Discussion
• What is the tradeoff in the choice of the chunk size in parallel copying?• Parallel copying with no synchronization can cause issues? For example if
an object is copied twice by two different threads, what can be the consequence?
A
B
A A
FA X
B
87
Something Extra
https://www.youtube.com/watch?v=YhKZe22tZlc
88
Conclusions & Summary
• There should be enough work for parallel collection • Need to take into account synchronization costs• Need to balance loads between the multiple threads• Learned different algorithms for marking, sweeping, copying and
compaction that take all this challenges into account.• Difference between marking and copying – marking an object twice is
not so bad. Copying an object twice can harm the correctness.