memory subsystem performance of programs using coping garbage collection

Memory Subsystem Performance of Programs using Coping Garbage

Collection

Authers:

Amer Diwan

David Traditi

Eliot Moss

Presented by: Ronen Shabo

Introduction

Heap allocation with coping garbage collection is believed to have poor memory subsystem performance.

However, with the appropriate memory subsystem organization, heap allocation can have good memory subsystem performance.

Agenda

Background.

Memory subsystem

Cache

Write buffer

Page mode

CPI

Copying garbage collection

SML Related work Methodology Result and Analysis Conclusions

Cache

It is known that CPUs get faster relative to DRAM memory chips.

A solution to this problem is to add a small fast memory call cache.

Cache work by reducing the average memory access time.

It is possible since memory access has temporal and spatial locality.

Cacheas

socia

tivity

subblock subblock subblocksubblock

Block

Block

Block

tagValid

V V V Vtag

tag

Valid

Valid

SET

Cache Hit Policies

On read hit

Read the word from cache.

Write through:

Write the word to cache and memory.

Write back:

Write the word to cache.

Mark the block as dirty.

When evicted block from cache, if dirty write it to memory.

Cache miss policies

On read miss the block is copying from main memory.

Write no allocate:

Do not allocate block in the cache.

Send the write to main memory, without putting the write in the cache.

Write allocate, no subblock placement:

Allocate a block in the cache.

Fetch the corresponding memory block from main memory.

Write the word to cache and to memory. Write allocate,subblock placement :

Allocate block in the cache.

Write the word to the cache and to memory.

Invalidate the remaining words in the cache.

Memory Subsystem

Write buffer :

Is a queue containing writes that are to be sent to main memory.

Page-mode :

Main memory is divided into DRAM pages. Page-mode writes reduce the latency of write to the same DRAM page.

CPI - Cycles Per useful Instruction :

number of CPU cycles to complete a program divided by the total number of useful instruction.

Coping Garbage Collection Two memory areas Memory allocation is done from FROMSPACE. When FROMSPACE is full, moves all the live objects

from FROMSPACE to TOSPACE. Exchange names.

Generational Coping GC Split objects into multiple areas by age. Scan older objects area less frequently. Copy long surviving objects to older generations area.

SMLStandard ML

Call by value Safe Polymorphic Functional Garbage collection

SML/NJ compiler Making allocation cheap and function call fast.

Allocation done in-line.

Aggressive -reduction (in-line) function call is used.

Extensive use of registers. Allocate procedure activation record on the heap

instead of the stack.

Related work This Work Advantage

This work made a different between read and write miss and there penalty.

Previous work use overall miss ratios .

This work module the entire memory subsystem including the write buffer and DARM page-mode.

Previous work did not module the entire memory subsystem.

The conclusions of a work that study the cache write policies on the performance of C and Fortran programs support ours that write allocate with subblock is the preferred architecture.

Methodology Tools :

QPT - Used to produce memory traces for SML/NJ programs.

Tycho - Used for the memory subsystem simulation.

Performance:

Performance numbers are in CPI. All instruction besides nops are considered useful.

Benchmarks :

The benchmark run on eight programs listed on the next table:Program Description

CW The Concurrency Workbench is a tool for analyzing networks of finitestate processes expressed in Milner's Calculus of CommunicatingSystems.

Leroy An implementation of the Knuth-Bendix completion algorithm.Lexgen A lexical-analyzer generator, processing the lexical description of

Standard ML.Life The game of Life implemented using lists.PIA The Perspective Inversion Algorithm decides the location of an

object in a perspective video image.Simple A spherical fluid-dynamics program .VLIW A Very-Long-Instruction-Word instruction scheduler.YACC An implementation of an LALR(1) parser generator processing the

grammar of Standard ML.

Cont.

Program Inst Fetches Allocations (words)

CW 523,245,978 56,467,440Leroy 312,086,438 67,733,930Lexgen 328,422,283 33,046,349Life 413,536,662 37,849,681PIA 122,215,151 13,047,041Simple 604,611,016 67,261,664VLIW 399,812,033 59,496,919YACC 133,043,324 17,015,250

Memory Subsystem Simulation

The memory features and penalty used in this study restrict to currently popular RISC workstation.

All simulation use: Write buffer (depth 6) Page mode Separated Data and Instruction caches Write-through policy

The simulations take over: Cache size 8K-128K Direct map and two-way set associative caches ( with LRU

replacement). Block size of 16 and 32 bytes Write allocate versus write no allocate Subblock placement versus no subblock placement.

Results and Analysis

Analysis SML/NJ programs: Programs do heap allocation at a rate of 0.2-0.4

words per instruction. The majority of writes are initialization writes. Writes come in bunches, they initialize newly

allocated area.

An aggressive write policy is necessary. Avoid waiting for writes to memory write buffer &

fast page mode. On write miss avoid reading cache block write

allocate with subblock placement cache policy is needed.

Result

Result CW

summary graphs

Result

CW

write alloc

subblock

block size 16

ConclusionsWrite miss policy and subblock placement: It is clear from this study that the best cache organization

is write-allocate / subblock placement.(Surprisingly for caches larger then 64k direct map cache the memory subsystem overhead of SML/NJ programs is acceptable less then 16%)

Performance of write allocate /no subblock is almost identical to write no allocate /no subblock. (Address is being read soon after being write,even for 8K cache. Since our program allocate 0.4-0.9 bytes per instruction , a read block occurs within 9K-20K).

Associativity: Increasing the associativity improve the CPI.

(This improvement is less then the one obtained from subblock placement).

Higher associativity improves the instruction cache performance but has little impact on data cache.(A lot of the penalty from the instruction cache is due to conflict miss and that from data cache is due to capacity miss).

ConclusionsBlock size: Increasing block size from 16 to 32 bytes improve the

performance.

Cache Size: Increasing cache size improve the performance. Most of the CPI improvement come from the instruction

cache.(From related work we expect to see sharp improvement once it can feet the allocation area 512K is large enough to hold most benchmark)

Write Buffer: A six deep write buffer with page mode is sufficient to

absorb the bursty writes.(Since there contribution to CPI is negligible)

Summary A depth study of the memory subsystem was made and the results show that:

Programs with intensive heap allocation performed poorly on most memory subsystem.

However on some machine (DECstation 5000/200) the performance was good.

The most crucial parameter for good performance was subblock placement, in this case the overhead was under 16% for caches bigger then 64K.

Associativity and cache size (up to 128k) were more important for the instruction cache.

Higher associativity and larger block size had small contribution.

memory subsystem performance of programs using coping garbage collection

Documents