![Page 1: Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor](https://reader035.vdocuments.net/reader035/viewer/2022070406/56814084550346895dac0c79/html5/thumbnails/1.jpg)
Cell CG: Using the Cell Synergistic Processor as a
Garbage Collection Coprocessor
Chen-Yong CherMichael Gschwind
![Page 2: Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor](https://reader035.vdocuments.net/reader035/viewer/2022070406/56814084550346895dac0c79/html5/thumbnails/2.jpg)
Automatic Dynamic Garbage
Collector•Offload to coprocessor
•performance benefit
•host processor keeps running independently
•BDW implementation
•offload mark phase to SPE
•Prefetch and avoid cache misses
![Page 3: Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor](https://reader035.vdocuments.net/reader035/viewer/2022070406/56814084550346895dac0c79/html5/thumbnails/3.jpg)
Contributions•LS and sync for offloading mark phase
•coherency of liveness info b/w host and SPU
•MFC-based SW$ to improve performance
•hybrid caching schemes for different data types
•extend SW$ and Capitulative Loads using MFC DMA
![Page 4: Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor](https://reader035.vdocuments.net/reader035/viewer/2022070406/56814084550346895dac0c79/html5/thumbnails/4.jpg)
Usefulness
•Application space not the right place for such job
•Explore LS-based system
•low latency
•no overhead of cache coherence protocols
![Page 5: Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor](https://reader035.vdocuments.net/reader035/viewer/2022070406/56814084550346895dac0c79/html5/thumbnails/5.jpg)
BDW garbage collection
• Mark- (lazy) sweep
• ambiguous roots
• DFS for all reachable pointers on the heap
• un-marked are de-allocated
• lazy sweep: avoid touching large amounts of cold data
• preferable caching behavior
![Page 6: Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor](https://reader035.vdocuments.net/reader035/viewer/2022070406/56814084550346895dac0c79/html5/thumbnails/6.jpg)
Initialization• PPE
• offload marking phase to SPE
• send effective address to SPE (mailbox)
• SPE
• indicate ready to receive data
• sync for completion of PPE transfer
• MFC translates effective address to absolute (Segm. Tables, Page Tables)
• Initiate DMA transfer (preferable)
• SPE LS -> Memory
![Page 7: Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor](https://reader035.vdocuments.net/reader035/viewer/2022070406/56814084550346895dac0c79/html5/thumbnails/7.jpg)
Use of Local Store
•20 KB instruction image (GC code)
•128 KB of software cache
•40 KB of header cache
•32 KB of local mark stack
•small activation record stack
![Page 8: Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor](https://reader035.vdocuments.net/reader035/viewer/2022070406/56814084550346895dac0c79/html5/thumbnails/8.jpg)
Porting GC• Porting BDW to SPE
• Application heap traversal (bulk of execution)
• Sync b/w PPE and SPE via mailbox (small data)
• BDW data structures
• Mark Stack
• Heap Blocks
• Header Blocks
• Traverse only live references (locality optimization)
![Page 9: Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor](https://reader035.vdocuments.net/reader035/viewer/2022070406/56814084550346895dac0c79/html5/thumbnails/9.jpg)
Porting GC cont.• Pointer chasing
• poor locality on hardware cache and prefetchers
• operand buffers to improve locality
• fetch entire blocks not just reference
• object cache (for HDR blocks)
• temporary store of records
• hashed system memory effective address
• Naive implementation (baseline performance)
![Page 10: Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor](https://reader035.vdocuments.net/reader035/viewer/2022070406/56814084550346895dac0c79/html5/thumbnails/10.jpg)
Software caches• Non-homogeneous and non-
continuous data structures
• exploit temporal and spatial locality
• Significant overheads on SPU
• access latency to compute and compare tags
• possible cache miss
• locate data buffer
• Poor match
• for regular and predictive access patterns
• Useful for large data sets
• statistical locality
• Adjusted in size
![Page 11: Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor](https://reader035.vdocuments.net/reader035/viewer/2022070406/56814084550346895dac0c79/html5/thumbnails/11.jpg)
Software caches cont.• Hybrid design (caching strategy)
• SW$ + operand buffer
• partition blocks using SW$ or OPB
• use SW$ for small heap blocks
• reduces hit latency
• removes references with dense spatial locality to cache
• Intelligent tuning of code generation
• take advantage of Cell SPEISA features
• ILP/DLP (not covered)
![Page 12: Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor](https://reader035.vdocuments.net/reader035/viewer/2022070406/56814084550346895dac0c79/html5/thumbnails/12.jpg)
Prefetching• Large sets with poor locality
• Hide memory latency (techniques)
• Boehm
• uses mark stack for prefetching
• CHV and Capitulative Loads
• use 8-16 entry buffers (FIFO)
• CHV
• exploits parallel branches
• ! DFS traversal
• Capitulative Loads
• changes access ordering
• a demand load on cache hit, a prefetch on miss
![Page 13: Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor](https://reader035.vdocuments.net/reader035/viewer/2022070406/56814084550346895dac0c79/html5/thumbnails/13.jpg)
Prefetching cont.• Advantage of Cell
• initiate prefetch under application control by DMA engine
• good for regular dense data sets (tiling matrices) is efficient
• GC with irregular data patterns and unpredictable locality is the antithesis
• Prefetch of heap blocks for scanning pointers
• Locality: addresses are within heap block bounds
![Page 14: Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor](https://reader035.vdocuments.net/reader035/viewer/2022070406/56814084550346895dac0c79/html5/thumbnails/14.jpg)
Prefetching cont.• Differences b/w prefetching on
conventional procs vs. procs with LS
• addresses are binding
• granularity of prefetching
• cost of misspeculation
• cost of DMA
• Early/late arrival, replacement
• not enough work to overlap and hide the miss latency
• suffer of pollution effects and overheads
• virtual tie
![Page 15: Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor](https://reader035.vdocuments.net/reader035/viewer/2022070406/56814084550346895dac0c79/html5/thumbnails/15.jpg)
Data coherence and consistency
management• Sync b/w PPE and SPE is necessary
• application and control code on PPE
• bulk of mark phase on SPE
• Data are copied
• maintain consistency on updates
• hdr blocks
• lookup indices
• application heap
![Page 16: Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor](https://reader035.vdocuments.net/reader035/viewer/2022070406/56814084550346895dac0c79/html5/thumbnails/16.jpg)
Data coherence and consistency
management cont.• Scheme based on data usage
• SPE -> PPE only when SPE has completed work
• data sync. necessary except from mark stack
• Achieve coherency and sync. handled via MFC mailbox
• schedule mark operation
• PPE sends descriptor with part of the MS and parameters
• SPE sends back the parameters via DMA (for ACK) and mark phase begins
![Page 17: Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor](https://reader035.vdocuments.net/reader035/viewer/2022070406/56814084550346895dac0c79/html5/thumbnails/17.jpg)
Conclusions•Data reference optimized strategy
• performance gain 400%-600%
• local store based data prefetch
• explore local environment
• Software controlled cache for garbage collectors
• Viable and competitive solution
• offload to coprocessor, increase utilization
![Page 18: Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor](https://reader035.vdocuments.net/reader035/viewer/2022070406/56814084550346895dac0c79/html5/thumbnails/18.jpg)
A Reactive Unobtrusive
Prefetcher for Multicore and
Manycore Architectures
Jason Mars Daniel WilliamsDan Upton Sudeep Ghosh Kim Hazelwood
![Page 19: Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor](https://reader035.vdocuments.net/reader035/viewer/2022070406/56814084550346895dac0c79/html5/thumbnails/19.jpg)
Software dynamic prefetchers
•Identify complex patterns
•high application overheads
•Unobtrusive prefetcher
•take advantage of neighbor’s underutilized cores
•Snooping, profiling, pattern detection, prefetching
![Page 20: Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor](https://reader035.vdocuments.net/reader035/viewer/2022070406/56814084550346895dac0c79/html5/thumbnails/20.jpg)
URPref• Neighbor idle core observes miss patterns
• Analyze miss patterns
• continuously profile and adapt
• Use Sequitur
• pattern detection
• Perform prefetch
• first identify prefix of hot stream
• Reactive
• high cache miss rate
• Pointer chasing (complex and difficult for hardware)
![Page 21: Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor](https://reader035.vdocuments.net/reader035/viewer/2022070406/56814084550346895dac0c79/html5/thumbnails/21.jpg)
URPref cont.
•Contributions
•profile cache misses, detect patterns, prefetch
•propose hardware extensions
•snooping, etc.
•pattern based approach to detect miss patterns, adapt phases
![Page 22: Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor](https://reader035.vdocuments.net/reader035/viewer/2022070406/56814084550346895dac0c79/html5/thumbnails/22.jpg)
URPref Support• FIFO Snoopy buffer
• Profiling on separate core
• 2 more ISA instructions
• OS must be aware of URPref
• Resume, easy halted, suspend, wait, late start
• Linear time pattern detection algorithm for miss patterns on cache misses
• Sequitur
![Page 23: Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor](https://reader035.vdocuments.net/reader035/viewer/2022070406/56814084550346895dac0c79/html5/thumbnails/23.jpg)
Detecting Hot Streams with
Sequitur• Hot cache miss
• length of miss stream
• frequency in the input sample window
• often large percentage in a small portion
• Snoopy sends each new miss to Sequitur
• determine if it forms a prefix
• if it matches then fetch the remainder hot stream to the neighboring L1 cache
![Page 24: Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor](https://reader035.vdocuments.net/reader035/viewer/2022070406/56814084550346895dac0c79/html5/thumbnails/24.jpg)
Detecting Hot Streams with Sequitur cont.• Profile window
• series of cache misses
• Data miss stream
• sequence of data cache miss addresses repeats in a profile window
• Hot stream
• a given percentage of the profiling window
• Sequitur builds a context free grammar of the cache miss patterns
• each production represents some sequence repeated more than once
![Page 25: Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor](https://reader035.vdocuments.net/reader035/viewer/2022070406/56814084550346895dac0c79/html5/thumbnails/25.jpg)
Sequitur• Hottest streams used for
prefetching
• hotness = uses x misses
• # times rule used in a grammar (cold uses)
• # of terminal symbols
• Detecting patterns
• actual cache addresses
• deltas
![Page 26: Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor](https://reader035.vdocuments.net/reader035/viewer/2022070406/56814084550346895dac0c79/html5/thumbnails/26.jpg)
Complex patterns
![Page 27: Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor](https://reader035.vdocuments.net/reader035/viewer/2022070406/56814084550346895dac0c79/html5/thumbnails/27.jpg)
Using Sequitur• Determine data b/w
adjacent misses
• hotness = length x cold uses
• sum of the # of terminals for a given non-terminal
• # of times the non-terminal appears in the grammar
• prefix
• first few symbols of the hot stream
![Page 28: Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor](https://reader035.vdocuments.net/reader035/viewer/2022070406/56814084550346895dac0c79/html5/thumbnails/28.jpg)
Using Sequitur cont.
• Fill window of misses from snoopy
• dynamic size -> normalize # of hot streams
• Keep track of last 4 misses
• Active hot stream table
• Receive misses from Snoopy
• create prefix window
• search on active HS table
• if a match is found, prefetch it
![Page 29: Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor](https://reader035.vdocuments.net/reader035/viewer/2022070406/56814084550346895dac0c79/html5/thumbnails/29.jpg)
Prefetching• New hot streams are added to the table
• hot streams no longer active are retired
• No hot streams found
• dynamic prefetcher remains dormant
• Hot streams become cold
• dynamic prefethcer stops
• Avoid conservative approach
• Application code doesn’t change
• No prefetch instructions are added
• no overhead
• No sync. b/w prefetcher and app core is needed
• no respect to latency when stream of tail is prefetched
![Page 30: Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor](https://reader035.vdocuments.net/reader035/viewer/2022070406/56814084550346895dac0c79/html5/thumbnails/30.jpg)
Prefetcher: Adaptive Response• Phase change
• avoid cache pollution
• incorrect prefetching
• achieve maximum performance
• Continuous profiling (one window latency)
• prefetch happens with prefixes from the previous window
• ABCDEF
• ABCGHI
![Page 31: Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor](https://reader035.vdocuments.net/reader035/viewer/2022070406/56814084550346895dac0c79/html5/thumbnails/31.jpg)
![Page 32: Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor](https://reader035.vdocuments.net/reader035/viewer/2022070406/56814084550346895dac0c79/html5/thumbnails/32.jpg)