cell cg: using the cell synergistic processor as a garbage collection coprocessor

32
Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor Chen-Yong Cher Michael Gschwind

Upload: lilith

Post on 09-Jan-2016

27 views

Category:

Documents


0 download

DESCRIPTION

Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor. Chen-Yong Cher Michael Gschwind. Automatic Dynamic Garbage Collector. Offload to coprocessor performance benefit host processor keeps running independently BDW implementation offload mark phase to SPE - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor

Cell CG: Using the Cell Synergistic Processor as a

Garbage Collection Coprocessor

Chen-Yong CherMichael Gschwind

Page 2: Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor

Automatic Dynamic Garbage

Collector•Offload to coprocessor

•performance benefit

•host processor keeps running independently

•BDW implementation

•offload mark phase to SPE

•Prefetch and avoid cache misses

Page 3: Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor

Contributions•LS and sync for offloading mark phase

•coherency of liveness info b/w host and SPU

•MFC-based SW$ to improve performance

•hybrid caching schemes for different data types

•extend SW$ and Capitulative Loads using MFC DMA

Page 4: Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor

Usefulness

•Application space not the right place for such job

•Explore LS-based system

•low latency

•no overhead of cache coherence protocols

Page 5: Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor

BDW garbage collection

• Mark- (lazy) sweep

• ambiguous roots

• DFS for all reachable pointers on the heap

• un-marked are de-allocated

• lazy sweep: avoid touching large amounts of cold data

• preferable caching behavior

Page 6: Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor

Initialization• PPE

• offload marking phase to SPE

• send effective address to SPE (mailbox)

• SPE

• indicate ready to receive data

• sync for completion of PPE transfer

• MFC translates effective address to absolute (Segm. Tables, Page Tables)

• Initiate DMA transfer (preferable)

• SPE LS -> Memory

Page 7: Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor

Use of Local Store

•20 KB instruction image (GC code)

•128 KB of software cache

•40 KB of header cache

•32 KB of local mark stack

•small activation record stack

Page 8: Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor

Porting GC• Porting BDW to SPE

• Application heap traversal (bulk of execution)

• Sync b/w PPE and SPE via mailbox (small data)

• BDW data structures

• Mark Stack

• Heap Blocks

• Header Blocks

• Traverse only live references (locality optimization)

Page 9: Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor

Porting GC cont.• Pointer chasing

• poor locality on hardware cache and prefetchers

• operand buffers to improve locality

• fetch entire blocks not just reference

• object cache (for HDR blocks)

• temporary store of records

• hashed system memory effective address

• Naive implementation (baseline performance)

Page 10: Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor

Software caches• Non-homogeneous and non-

continuous data structures

• exploit temporal and spatial locality

• Significant overheads on SPU

• access latency to compute and compare tags

• possible cache miss

• locate data buffer

• Poor match

• for regular and predictive access patterns

• Useful for large data sets

• statistical locality

• Adjusted in size

Page 11: Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor

Software caches cont.• Hybrid design (caching strategy)

• SW$ + operand buffer

• partition blocks using SW$ or OPB

• use SW$ for small heap blocks

• reduces hit latency

• removes references with dense spatial locality to cache

• Intelligent tuning of code generation

• take advantage of Cell SPEISA features

• ILP/DLP (not covered)

Page 12: Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor

Prefetching• Large sets with poor locality

• Hide memory latency (techniques)

• Boehm

• uses mark stack for prefetching

• CHV and Capitulative Loads

• use 8-16 entry buffers (FIFO)

• CHV

• exploits parallel branches

• ! DFS traversal

• Capitulative Loads

• changes access ordering

• a demand load on cache hit, a prefetch on miss

Page 13: Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor

Prefetching cont.• Advantage of Cell

• initiate prefetch under application control by DMA engine

• good for regular dense data sets (tiling matrices) is efficient

• GC with irregular data patterns and unpredictable locality is the antithesis

• Prefetch of heap blocks for scanning pointers

• Locality: addresses are within heap block bounds

Page 14: Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor

Prefetching cont.• Differences b/w prefetching on

conventional procs vs. procs with LS

• addresses are binding

• granularity of prefetching

• cost of misspeculation

• cost of DMA

• Early/late arrival, replacement

• not enough work to overlap and hide the miss latency

• suffer of pollution effects and overheads

• virtual tie

Page 15: Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor

Data coherence and consistency

management• Sync b/w PPE and SPE is necessary

• application and control code on PPE

• bulk of mark phase on SPE

• Data are copied

• maintain consistency on updates

• hdr blocks

• lookup indices

• application heap

Page 16: Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor

Data coherence and consistency

management cont.• Scheme based on data usage

• SPE -> PPE only when SPE has completed work

• data sync. necessary except from mark stack

• Achieve coherency and sync. handled via MFC mailbox

• schedule mark operation

• PPE sends descriptor with part of the MS and parameters

• SPE sends back the parameters via DMA (for ACK) and mark phase begins

Page 17: Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor

Conclusions•Data reference optimized strategy

• performance gain 400%-600%

• local store based data prefetch

• explore local environment

• Software controlled cache for garbage collectors

• Viable and competitive solution

• offload to coprocessor, increase utilization

Page 18: Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor

A Reactive Unobtrusive

Prefetcher for Multicore and

Manycore Architectures

Jason Mars Daniel WilliamsDan Upton Sudeep Ghosh Kim Hazelwood

Page 19: Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor

Software dynamic prefetchers

•Identify complex patterns

•high application overheads

•Unobtrusive prefetcher

•take advantage of neighbor’s underutilized cores

•Snooping, profiling, pattern detection, prefetching

Page 20: Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor

URPref• Neighbor idle core observes miss patterns

• Analyze miss patterns

• continuously profile and adapt

• Use Sequitur

• pattern detection

• Perform prefetch

• first identify prefix of hot stream

• Reactive

• high cache miss rate

• Pointer chasing (complex and difficult for hardware)

Page 21: Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor

URPref cont.

•Contributions

•profile cache misses, detect patterns, prefetch

•propose hardware extensions

•snooping, etc.

•pattern based approach to detect miss patterns, adapt phases

Page 22: Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor

URPref Support• FIFO Snoopy buffer

• Profiling on separate core

• 2 more ISA instructions

• OS must be aware of URPref

• Resume, easy halted, suspend, wait, late start

• Linear time pattern detection algorithm for miss patterns on cache misses

• Sequitur

Page 23: Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor

Detecting Hot Streams with

Sequitur• Hot cache miss

• length of miss stream

• frequency in the input sample window

• often large percentage in a small portion

• Snoopy sends each new miss to Sequitur

• determine if it forms a prefix

• if it matches then fetch the remainder hot stream to the neighboring L1 cache

Page 24: Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor

Detecting Hot Streams with Sequitur cont.• Profile window

• series of cache misses

• Data miss stream

• sequence of data cache miss addresses repeats in a profile window

• Hot stream

• a given percentage of the profiling window

• Sequitur builds a context free grammar of the cache miss patterns

• each production represents some sequence repeated more than once

Page 25: Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor

Sequitur• Hottest streams used for

prefetching

• hotness = uses x misses

• # times rule used in a grammar (cold uses)

• # of terminal symbols

• Detecting patterns

• actual cache addresses

• deltas

Page 26: Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor

Complex patterns

Page 27: Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor

Using Sequitur• Determine data b/w

adjacent misses

• hotness = length x cold uses

• sum of the # of terminals for a given non-terminal

• # of times the non-terminal appears in the grammar

• prefix

• first few symbols of the hot stream

Page 28: Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor

Using Sequitur cont.

• Fill window of misses from snoopy

• dynamic size -> normalize # of hot streams

• Keep track of last 4 misses

• Active hot stream table

• Receive misses from Snoopy

• create prefix window

• search on active HS table

• if a match is found, prefetch it

Page 29: Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor

Prefetching• New hot streams are added to the table

• hot streams no longer active are retired

• No hot streams found

• dynamic prefetcher remains dormant

• Hot streams become cold

• dynamic prefethcer stops

• Avoid conservative approach

• Application code doesn’t change

• No prefetch instructions are added

• no overhead

• No sync. b/w prefetcher and app core is needed

• no respect to latency when stream of tail is prefetched

Page 30: Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor

Prefetcher: Adaptive Response• Phase change

• avoid cache pollution

• incorrect prefetching

• achieve maximum performance

• Continuous profiling (one window latency)

• prefetch happens with prefixes from the previous window

• ABCDEF

• ABCGHI

Page 31: Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor
Page 32: Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor