cache-conscious copying collectors written by: henry j. baker presented by: eliaz tobias

38
Cache-Conscious Cache-Conscious Copying Copying Collectors Collectors Written By : Henry J. Baker Presented By : Eliaz Tobias

Post on 21-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Cache-Conscious Copying Collectors Written By: Henry J. Baker Presented By: Eliaz Tobias

Cache-ConsciousCache-Conscious Copying Copying CollectorsCollectors

Written By: Henry J. BakerPresented By: Eliaz Tobias

Page 2: Cache-Conscious Copying Collectors Written By: Henry J. Baker Presented By: Eliaz Tobias

MotivationMotivation Garbage Collectors must minimize: Garbage Collectors must minimize:

Cache SpaceCache Space Off-Off-Chip Communication BandwidthChip Communication Bandwidth

Performance Optimization on Performance Optimization on modernmodern

Single-Chip Computer Single-Chip Computer ArchitecturesArchitectures

Page 3: Cache-Conscious Copying Collectors Written By: Henry J. Baker Presented By: Eliaz Tobias

Modern processor chips are no Modern processor chips are no longer “CPU-bound”, but “I/O-longer “CPU-bound”, but “I/O-Bound”Bound”

IntroductioIntroductionn

Example:Example:Multiplying Large Matrices [LAM 91]

Performance has increased from 0.9 MFLOPS to

4 MFLOPS through careful management of the

scarce resources

Page 4: Cache-Conscious Copying Collectors Written By: Henry J. Baker Presented By: Eliaz Tobias

Introduction - Introduction - ContCont

Symbolic ProcessingSymbolic Processing - Also More - Also More I/O-Bound than CPU-BoundI/O-Bound than CPU-Bound

Numeric ProcessingNumeric Processing - I/O-Bound. - I/O-Bound. There is a need to use ‘prefetching’.There is a need to use ‘prefetching’.

Page 5: Cache-Conscious Copying Collectors Written By: Henry J. Baker Presented By: Eliaz Tobias

Introduction - Introduction - ContCont

Control structures of the symbolic Control structures of the symbolic processing tend to be data-dependant. processing tend to be data-dependant. This fact limits the potential of This fact limits the potential of prefetching strategies.prefetching strategies.

The prefetching strategy isn’t The prefetching strategy isn’t appropriate for today’s limited appropriate for today’s limited bandwidth processors. bandwidth processors.

Using GCUsing GC can help us optimize the can help us optimize the space & bandwidth management space & bandwidth management strategies.strategies.

Page 6: Cache-Conscious Copying Collectors Written By: Henry J. Baker Presented By: Eliaz Tobias

What are we going to What are we going to see ?see ?

Cache CategoriesCache Categories HistoryHistory Advanteges of the CGC’s & the Advanteges of the CGC’s & the

NCGC’sNCGC’s Strategies to overcome the drawbacks Strategies to overcome the drawbacks

of Copying Garbage Collectorsof Copying Garbage Collectors Analysis of a Parallel GC system Analysis of a Parallel GC system Case Study - The Intel 80860XP Case Study - The Intel 80860XP

architecturearchitecture SummarySummary

Page 7: Cache-Conscious Copying Collectors Written By: Henry J. Baker Presented By: Eliaz Tobias

Fully AssociativeFully Associative

Direct mappedDirect mapped

n-way set associativen-way set associative

Cache CategoriesCache Categories

a block can be mapped anywherea block can be mapped anywhere

a block can be mapped into 1 set a block can be mapped into 1 set ( of n lines size ) by the formula : ( of n lines size ) by the formula :

(Block address)MOD(Number of sets in cache)(Block address)MOD(Number of sets in cache)

a block can be mapped to exactly one cache linea block can be mapped to exactly one cache line

Page 8: Cache-Conscious Copying Collectors Written By: Henry J. Baker Presented By: Eliaz Tobias

HistoryHistory Minsky 1963Minsky 1963 The The

1st copying garbage collector1st copying garbage collector Cheney 1970Cheney 1970 The The

elegant 2-space modelelegant 2-space model Baker 1978Baker 1978 Real-Real-

Time version of the Cheiney Time version of the Cheiney algorithmalgorithm

Page 9: Cache-Conscious Copying Collectors Written By: Henry J. Baker Presented By: Eliaz Tobias

Non-Copying Garbage Collectors - Non-Copying Garbage Collectors - AdvantagesAdvantages

Less address space required Less address space required Objects never occupy 2 different locationsObjects never occupy 2 different locations

at the same timeat the same time

Doesn’t move objects behind the Doesn’t move objects behind the compiler’s backcompiler’s back

Compiler optimizations are not invalidatedCompiler optimizations are not invalidated

The smaller object space and object The smaller object space and object motion motion optimizes the performanceoptimizes the performance on cache-based architectures.on cache-based architectures.

Page 10: Cache-Conscious Copying Collectors Written By: Henry J. Baker Presented By: Eliaz Tobias

Copying Garbage Collectors - AdvantagesCopying Garbage Collectors - Advantages

Trivial allocation of non-Trivial allocation of non-homogeneous objects, due to the homogeneous objects, due to the compactness of the free area.compactness of the free area.

Can also be used to improve the locality Can also be used to improve the locality for virtual memory / cache purposes.for virtual memory / cache purposes.

Can expand/contract the amount of space Can expand/contract the amount of space under management more easily.under management more easily.

Single-Phase AlgorithmSingle-Phase Algorithm

Simplicity of storage management Simplicity of storage management under widely varying demands.under widely varying demands.

Page 11: Cache-Conscious Copying Collectors Written By: Henry J. Baker Presented By: Eliaz Tobias

ReminderReminder– The cheiney algorithm uses a from-space The cheiney algorithm uses a from-space

and a to-space.and a to-space.

– When the from-space is filled, a “stop & When the from-space is filled, a “stop & copy” is done. All live objects move to copy” is done. All live objects move to the to-space.the to-space.

– The copy is done using 2 pointers in the The copy is done using 2 pointers in the to-space: the scan & allocate pointers. to-space: the scan & allocate pointers.

– After the copy, the to-space & the After the copy, the to-space & the from-space change parts.from-space change parts.

The Cheiney algorithmThe Cheiney algorithm

Page 12: Cache-Conscious Copying Collectors Written By: Henry J. Baker Presented By: Eliaz Tobias

Let’s overcome Let’s overcome the the Drawbacks of Drawbacks of the the Copying Copying

Garbage Garbage CollectorsCollectors

Page 13: Cache-Conscious Copying Collectors Written By: Henry J. Baker Presented By: Eliaz Tobias

CDR-CodingCDR-Coding Large Objects are not copiedLarge Objects are not copied The Ghosting ArgumnetThe Ghosting Argumnet Forwarding PointersForwarding Pointers Forwarding Pointers - PipelineForwarding Pointers - Pipeline Parallel Garabage CollectionParallel Garabage Collection

The methodsThe methods

Page 14: Cache-Conscious Copying Collectors Written By: Henry J. Baker Presented By: Eliaz Tobias

CDR-CodingCDR-Coding CDR-coding is a space-saving way CDR-coding is a space-saving way

to store lists in memoryto store lists in memory Usually used in lispUsually used in lisp The Idea:The Idea:

We won’t hold 2 pointers for We won’t hold 2 pointers for each element in the list.each element in the list.

We’ll hold a value & a 2-bit We’ll hold a value & a 2-bit ""CDR code". The CDR code may CDR code". The CDR code may contain one of three values: CDR-contain one of three values: CDR-NORMAL, NORMAL, CDR-NEXT, and CDR-NEXT, and CDR-NIL.CDR-NIL.

Page 15: Cache-Conscious Copying Collectors Written By: Henry J. Baker Presented By: Eliaz Tobias

The ideaThe idea– Modifying the page map insteadModifying the page map instead

What for ?What for ?– Saving copying overheadSaving copying overhead

– Saving physical address spaceSaving physical address space

Large objects are not Large objects are not copiedcopied

Page 16: Cache-Conscious Copying Collectors Written By: Henry J. Baker Presented By: Eliaz Tobias

What do we need for this ?What do we need for this ?– System support of page maps which System support of page maps which

alias to different virtual address spacealias to different virtual address space

– Large objects should not share pages Large objects should not share pages with other objectswith other objects

ImprovementImprovement– In a properly designed system, the In a properly designed system, the

aliased cache locations are accessible aliased cache locations are accessible without reloading without reloading

Avoiding unnecessary overheadAvoiding unnecessary overhead

Large objects are not Large objects are not copiedcopied

Page 17: Cache-Conscious Copying Collectors Written By: Henry J. Baker Presented By: Eliaz Tobias

Let’s examine the following case:Let’s examine the following case:– Cheiney’s algorithm is run twice in Cheiney’s algorithm is run twice in

succession on exactly the same data.succession on exactly the same data.

– Remember that on the 2nd time from Remember that on the 2nd time from & to switch places.& to switch places.

– When an object is copied to to-space When an object is copied to to-space ( on the second time ), the access ( on the second time ), the access of the allocate pointer in to-space, is of the allocate pointer in to-space, is at the same relative location of the at the same relative location of the scan pointer in from-space.scan pointer in from-space.

The Ghosting ArgumentThe Ghosting Argument

Page 18: Cache-Conscious Copying Collectors Written By: Henry J. Baker Presented By: Eliaz Tobias

There is a There is a ghost/shadowghost/shadow of the of the allocate pointer in the from-spaceallocate pointer in the from-space

As a resultAs a result, the page/cache line of the , the page/cache line of the allocate pointer will almost certainly be allocate pointer will almost certainly be in memory in memory Minimal cache missesMinimal cache misses

The The ghosting argumentghosting argument works well even works well even for other copying garbage collectors.for other copying garbage collectors.

The Ghosting Argument The Ghosting Argument ( Cont )( Cont )

Page 19: Cache-Conscious Copying Collectors Written By: Henry J. Baker Presented By: Eliaz Tobias

What for ?What for ?– The forwarding pointers are used for the The forwarding pointers are used for the

‘from-space’ and the ‘to-space’ to stay ‘from-space’ and the ‘to-space’ to stay isomorphicisomorphic after the copy - after the copy - ‘Object ‘Object identity’identity’..

What if the GC won’t use them ?What if the GC won’t use them ?– The ‘to-space’ will consist of The ‘to-space’ will consist of objects with objects with

only one referenceonly one reference

Cyclic list structures Cyclic list structures

Forwarding pointersForwarding pointers

Page 20: Cache-Conscious Copying Collectors Written By: Henry J. Baker Presented By: Eliaz Tobias

AssumptionsAssumptions– Shared Shared objects with more than one objects with more than one

reference are much less predictable.reference are much less predictable.

– The The GC needs a way to identify GC needs a way to identify those those relatively few shared objects.relatively few shared objects.

– Based on intuition & experience, it’s Based on intuition & experience, it’s known that objects with 1 reference are known that objects with 1 reference are functional objectsfunctional objects..

– The assumption for preserving the The assumption for preserving the forwarding pointers is not relevant for forwarding pointers is not relevant for these objects.these objects.

Forwarding pointersForwarding pointers

Page 21: Cache-Conscious Copying Collectors Written By: Henry J. Baker Presented By: Eliaz Tobias

Conclusion 1:Conclusion 1: We can get rid of the We can get rid of the forwarding pointers for the forwarding pointers for the functional functional objects.objects.

Problem:Problem: There are cases in which some There are cases in which some functional objects, have more than 1 functional objects, have more than 1 reference. Thus, reference. Thus, we will use more we will use more memorymemory space & copying overhead than space & copying overhead than neededneeded..

Solution:Solution: Efficient & incremental Efficient & incremental calculation for functional objects by calculation for functional objects by the the GC, whether it’s better to copy or share GC, whether it’s better to copy or share the object.the object.

Forwarding pointersForwarding pointers

Page 22: Cache-Conscious Copying Collectors Written By: Henry J. Baker Presented By: Eliaz Tobias

Conclusion 2:Conclusion 2: If a substantial If a substantial amount of the objects are functional, amount of the objects are functional, than the GC can be made more than the GC can be made more efficient.efficient.

Forwarding pointersForwarding pointers

What about the non-functional What about the non-functional objects ?objects ?

– Permanent objects or objects referenced Permanent objects or objects referenced from the ‘outside’, cannot become from the ‘outside’, cannot become

garbage.garbage. We can eliminate the copyingWe can eliminate the copying and forwarding of these and forwarding of these non-non-

functional objects.functional objects.

Page 23: Cache-Conscious Copying Collectors Written By: Henry J. Baker Presented By: Eliaz Tobias

What about the temporary, shared,What about the temporary, shared,non-functional objects ?non-functional objects ?– If all the above optimizations were If all the above optimizations were

implemented, then most of the cache implemented, then most of the cache misses would be of forwarding pointers.misses would be of forwarding pointers.

– The cache stores mainly forwarding The cache stores mainly forwarding pointers now.pointers now.

– Solution:Solution: Maximizing the density of the Maximizing the density of the forwarding pointers in the cache.forwarding pointers in the cache.

– We can also use a We can also use a separate tableseparate table for the for the forwarding pointers.forwarding pointers.

Forwarding pointersForwarding pointers

Page 24: Cache-Conscious Copying Collectors Written By: Henry J. Baker Presented By: Eliaz Tobias

The retrieval of the forwarding The retrieval of the forwarding pointers is a limit to the pointers is a limit to the traditional Cheiney CGC.traditional Cheiney CGC.

If the bottleneck is the latency and If the bottleneck is the latency and not the bandwidth, one can perform not the bandwidth, one can perform a a number of forwarding pointer lookups number of forwarding pointer lookups in a pipelined fashionin a pipelined fashion..

Appel uses this scheme in his Appel uses this scheme in his vectorized CGC in 1989.vectorized CGC in 1989.

Forwarding pointers - Forwarding pointers - PipelinePipeline

Page 25: Cache-Conscious Copying Collectors Written By: Henry J. Baker Presented By: Eliaz Tobias

Parallel Parallel Garbage Garbage CollectorsCollectors

Page 26: Cache-Conscious Copying Collectors Written By: Henry J. Baker Presented By: Eliaz Tobias

Problem:Problem: The CGC increases the The CGC increases the traffic to memory, and pushes out traffic to memory, and pushes out mutator information from the mutator information from the cache/main memory.cache/main memory.

Parallel Garbage CollectionParallel Garbage Collection

If the CGC is implemented usingIf the CGC is implemented usinganother processor, those another processor, those problems can be solved.problems can be solved.

Let’s focus on the Let’s focus on the RTCGC RTCGC scheme that Baker suggested in scheme that Baker suggested in 19781978..

Page 27: Cache-Conscious Copying Collectors Written By: Henry J. Baker Presented By: Eliaz Tobias

The motivationThe motivation is to let the mutator is to let the mutator processor work, even though the processor work, even though the collector processor still copies objects collector processor still copies objects from from from-space to to-space.from-space to to-space.

Let’s consider that the mutator is Let’s consider that the mutator is stopped during the flip ( the switch of stopped during the flip ( the switch of from-space with to-space ), in order from-space with to-space ), in order for the cache to be invalidated.for the cache to be invalidated.

Parallel Garbage CollectionParallel Garbage Collection

Page 28: Cache-Conscious Copying Collectors Written By: Henry J. Baker Presented By: Eliaz Tobias

Problem:Problem: If the mutator, while working If the mutator, while working on to-space, finds a pointer to from-on to-space, finds a pointer to from-space ?space ?

– It can move the object by himself, It can move the object by himself, accessing accessing from-space. from-space.

– This leads to a miss in the mutator’s cache.This leads to a miss in the mutator’s cache.

– As a result, information important to the As a result, information important to the mutator is removed from the cache. mutator is removed from the cache.

Parallel Garbage CollectionParallel Garbage Collection

Page 29: Cache-Conscious Copying Collectors Written By: Henry J. Baker Presented By: Eliaz Tobias

QuestionsQuestions If the mutator finds no forwarding If the mutator finds no forwarding

pointer from the object, then it must pointer from the object, then it must copy it ( or should the collector do it ?).copy it ( or should the collector do it ?).

Should the mutator use an Should the mutator use an allocate allocate pointer of his ownpointer of his own, or , or gain access to gain access to the collector’s the collector’s allocate pointerallocate pointer ? ?

Parallel Garbage CollectionParallel Garbage Collection

Page 30: Cache-Conscious Copying Collectors Written By: Henry J. Baker Presented By: Eliaz Tobias

What if the architecture supports a way What if the architecture supports a way to to bypassbypass the change of the mutator’s the change of the mutator’s cache, while the mutator accesses the cache, while the mutator accesses the from-space?from-space?

Parallel Garbage CollectionParallel Garbage Collection

How does the Baker RTCGC works ?How does the Baker RTCGC works ?

– Then the mutator tell the collector to do the Then the mutator tell the collector to do the copying of the object using a mailbox copying of the object using a mailbox (containing the address of the desired object).(containing the address of the desired object).

– The mutator can work without synchronizing The mutator can work without synchronizing with the collector, with the collector, untiluntil a forwarding pointer a forwarding pointer is not found, when accessing from-space.is not found, when accessing from-space.

Page 31: Cache-Conscious Copying Collectors Written By: Henry J. Baker Presented By: Eliaz Tobias

How does the Baker RTCGC work (cont) ?How does the Baker RTCGC work (cont) ?– The collector checks this mailbox in every cycle.The collector checks this mailbox in every cycle.

– The collector finds the object address in its The collector finds the object address in its mailbox.mailbox.

– Then, the collector transports it to to-space.Then, the collector transports it to to-space.

– Then, it installs the forwarding pointer.Then, it installs the forwarding pointer.

– When the collector finishes, it sends a message When the collector finishes, it sends a message using a special mailbox to the mutator, telling him using a special mailbox to the mutator, telling him that he is going to sleep.that he is going to sleep.

– The mutator is responsible of waking the collector.The mutator is responsible of waking the collector.

Parallel Garbage CollectionParallel Garbage Collection

Page 32: Cache-Conscious Copying Collectors Written By: Henry J. Baker Presented By: Eliaz Tobias

Case Study - The Intel 80860XP Case Study - The Intel 80860XP ArchitectureArchitecture

Modern 64-bit RISC ProcessorModern 64-bit RISC Processor 32-bit Virtual Address Space mapped to 32-bit 32-bit Virtual Address Space mapped to 32-bit

Physical Address SpacePhysical Address Space Instructions for Load/Store with Quad-Word Instructions for Load/Store with Quad-Word

Blocks ( which take more than 1 cycle )Blocks ( which take more than 1 cycle ) 16KB on-chip cache, can be extended16KB on-chip cache, can be extended

into an off-chip 256KB/512KB SRAM into an off-chip 256KB/512KB SRAM cachecache

The cache is 4-way set-associativeThe cache is 4-way set-associative

Characteristics

Page 33: Cache-Conscious Copying Collectors Written By: Henry J. Baker Presented By: Eliaz Tobias

Case Study - The Intel 80860XP Case Study - The Intel 80860XP ArchitectureArchitecture

Contains pipelined ‘Load’ instructions which Contains pipelined ‘Load’ instructions which bypass the data cachebypass the data cache

Contains ‘Store’ instructions which write to Contains ‘Store’ instructions which write to the memory, or to the cache if a cache-hit the memory, or to the cache if a cache-hit occurred. They don’t use new cache spaceoccurred. They don’t use new cache space

Uses the MESI cache coherency Uses the MESI cache coherency protocol, based on global bus protocol, based on global bus and “snooping” and “snooping”

Characteristics

Page 34: Cache-Conscious Copying Collectors Written By: Henry J. Baker Presented By: Eliaz Tobias

Case Study - The Intel 80860XP Case Study - The Intel 80860XP ArchitectureArchitecture

Analysis The 64-bit bus & the quad-words usage in The 64-bit bus & the quad-words usage in

the load/store instructions, provide the load/store instructions, provide substantial bandwidth to support a CGCsubstantial bandwidth to support a CGC

The cache-bypassing capability of the load The cache-bypassing capability of the load instruction, operates without damaging instruction, operates without damaging the cache locality of the mutatorthe cache locality of the mutator

The 4-way set associativity of the cache The 4-way set associativity of the cache avoids the copy-thrashing behavior of avoids the copy-thrashing behavior of

the direct mapped cache the direct mapped cache The advanced cache coherency protocol The advanced cache coherency protocol

should allow a dedicated GC processorshould allow a dedicated GC processor to to reduce the load on the mutator processorreduce the load on the mutator processor

Page 35: Cache-Conscious Copying Collectors Written By: Henry J. Baker Presented By: Eliaz Tobias

SummarySummary We have examined the interaction of We have examined the interaction of

the CGC with the memory system of the CGC with the memory system of modern RISC architectures.modern RISC architectures.

We can easily overcome the We can easily overcome the drawbacks of copying collectors, drawbacks of copying collectors, especially with cache bypassing especially with cache bypassing mechanisms.mechanisms.

Modern cache coherency protocols Modern cache coherency protocols support parallel garbage collectionsupport parallel garbage collection rather nicely.rather nicely.

Page 36: Cache-Conscious Copying Collectors Written By: Henry J. Baker Presented By: Eliaz Tobias

Summary - Summary - ContCont

We studied the Intel 80860XP We studied the Intel 80860XP architecture.architecture.

We found it is very suitable for copying We found it is very suitable for copying garbage collection, in both single-garbage collection, in both single-processor and multi-processor processor and multi-processor configurations.configurations.

We learned of methods like CDR-Coding, We learned of methods like CDR-Coding, Uncopied Large Objects, Forwarding Uncopied Large Objects, Forwarding pointers economization, pipelines and pointers economization, pipelines and parallel CGC’s.parallel CGC’s.

Page 37: Cache-Conscious Copying Collectors Written By: Henry J. Baker Presented By: Eliaz Tobias

QuestionsQuestions……

Page 38: Cache-Conscious Copying Collectors Written By: Henry J. Baker Presented By: Eliaz Tobias

The EndThe End