cs 258 parallel computer architecture limitless directories: a scalable cache coherence scheme david...

18
CS 258 Parallel Computer Architecture LimitLESS Directories: A Scalable Cache Coherence Scheme David Chaiken, John Kubiatowicz, and Anant Agarwal Presented: March 19, 2008 Ankit Jain

Post on 22-Dec-2015

223 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: CS 258 Parallel Computer Architecture LimitLESS Directories: A Scalable Cache Coherence Scheme David Chaiken, John Kubiatowicz, and Anant Agarwal Presented:

CS 258 Parallel Computer Architecture

LimitLESS Directories: A Scalable Cache Coherence

SchemeDavid Chaiken, John Kubiatowicz,

and Anant Agarwal

Presented:March 19, 2008

Ankit Jain

Page 2: CS 258 Parallel Computer Architecture LimitLESS Directories: A Scalable Cache Coherence Scheme David Chaiken, John Kubiatowicz, and Anant Agarwal Presented:

LimitLESS.23/19/08

The Background & Problems• Bus-Based Protocols

– Do not scale because broadcasts are slow and limit parallelism

• Traditional Directory-Based Protocols– Monolithic Directories

» Implicitly serialize all memory requests– Directory Accesses consume a disproportionately large

fraction of available network bandwidth– Full Directories are Large

» Full Map Size: Total Memory Size * Number of Processors

– Limited Directory Protocols» Allowing a limited number of simultaneous cached

copies of any block of data» Pro: Size of directory is smaller» Con: Potential Thrashing since eviction and

reassignment when more simultaneous copies needed» Previous studies show small set of pointers is

sufficient to capture worker-set of processors

Page 3: CS 258 Parallel Computer Architecture LimitLESS Directories: A Scalable Cache Coherence Scheme David Chaiken, John Kubiatowicz, and Anant Agarwal Presented:

LimitLESS.33/19/08

Alewife Architecture

• Cost Effective Mesh Network– Pro: Scales in terms of hardware– Pro: Exploits Locality

• Directory Distributed along with main memory– Bandwidth scales with number of

processors

• Con: Non-Uniform Latencies of Communication– Have to manage the mapping of

processes/threads onto processors due– Alewife employs techniques for latency

minimization and latency tolerance so programmer does not have to manage

• Context Switch in 11 cycles between processes on remote memory request which has to incur communication network latency

• Cache Controller holds tags and implements the coherence protocol

Page 4: CS 258 Parallel Computer Architecture LimitLESS Directories: A Scalable Cache Coherence Scheme David Chaiken, John Kubiatowicz, and Anant Agarwal Presented:

LimitLESS.43/19/08

LimitLESS Protocol + Requirements• Limited Directory that is Locally Extended

through Software Support• Handle the common case (small worker set)

in hardware and the exceptional case (overflow) in software

• Processor with rapid trap handling (executes trap code within 5-10 cycles of initiation)

• State Shared– Processor needs complete access to coherence related

controller state in the hardware directories– Directory Controller can invoke processor trap handlers

• Machine needs an interface to the network that allows the processor to launch and intercept coherence protocol packets

Page 5: CS 258 Parallel Computer Architecture LimitLESS Directories: A Scalable Cache Coherence Scheme David Chaiken, John Kubiatowicz, and Anant Agarwal Presented:

LimitLESS.53/19/08

The Protocol

Note: In the Read-Only State, the notation S: n>p indicates that the outputs from the state are handled through a software interrupt

handler if the size of the pointer set (n) is greater than the size of the limited directory (p).

Page 6: CS 258 Parallel Computer Architecture LimitLESS Directories: A Scalable Cache Coherence Scheme David Chaiken, John Kubiatowicz, and Anant Agarwal Presented:

LimitLESS.63/19/08

An Example

• Proc i has data block D from Proc d in Read-Write State

• Proc j wants to write a value to data block DProcessor i

Data BlockState

d Read-Write

Processor j

Data BlockState

d Invalid

Processor d Directory Entry

Data BlockState AckCtr Owning Processors

d Read-Write 0 i

Page 7: CS 258 Parallel Computer Architecture LimitLESS Directories: A Scalable Cache Coherence Scheme David Chaiken, John Kubiatowicz, and Anant Agarwal Presented:

LimitLESS.73/19/08

An Example

• Proc i has data block D from Proc d in Read-Write State

• Proc j wants to write a value to data block DProcessor i

Data BlockState

d Read-Write

Processor j

Data BlockState

d Invalid

j WREQ

Precondition: P = { I }

INV i

Data BlockState AckCtr Owning Processors

d Read-Write 0 i

Processor d Directory Entry

Page 8: CS 258 Parallel Computer Architecture LimitLESS Directories: A Scalable Cache Coherence Scheme David Chaiken, John Kubiatowicz, and Anant Agarwal Presented:

LimitLESS.83/19/08

An Example

• Proc i has data block D from Proc d in Read-Write State

• Proc j wants to write a value to data block DProcessor i

Data BlockState

d Invalid

Processor j

Data BlockState

d Invalid

Data BlockState AckCtr Owning Processors

d Read-Write 1 j

Processor d Directory Entry

Page 9: CS 258 Parallel Computer Architecture LimitLESS Directories: A Scalable Cache Coherence Scheme David Chaiken, John Kubiatowicz, and Anant Agarwal Presented:

LimitLESS.93/19/08

An Example

• Proc i has data block D from Proc d in Read-Write State

• Proc j wants to write a value to data block DProcessor i

Data BlockState

d Invalid

Processor j

Data BlockState

d Invalid

Data BlockState AckCtr Owning Processors

d Read-Write 1 j

AckCtr = 1, P = { j }

i ACKC

Processor d Directory Entry

Page 10: CS 258 Parallel Computer Architecture LimitLESS Directories: A Scalable Cache Coherence Scheme David Chaiken, John Kubiatowicz, and Anant Agarwal Presented:

LimitLESS.103/19/08

An Example

• Proc i has data block D from Proc d in Read-Write State

• Proc j wants to write a value to data block DProcessor i

Data BlockState

d Invalid

Processor j

Data BlockState

d Read-Write

Data BlockState AckCtr Owning Processors

d Read-Write 0 j

Processor d Directory Entry

Page 11: CS 258 Parallel Computer Architecture LimitLESS Directories: A Scalable Cache Coherence Scheme David Chaiken, John Kubiatowicz, and Anant Agarwal Presented:

LimitLESS.113/19/08

Interprocessor-Interrupt (1/2)

• Trap routine can either discard packet or store it to memory

• Store-back capability permits message-passing and block transfers

• Potential Deadlock Scenario with Processor Stalled and waiting for a remote cache-fill

•Solution: Synchronous Trap (stored in local memory) to empty input queue

Page 12: CS 258 Parallel Computer Architecture LimitLESS Directories: A Scalable Cache Coherence Scheme David Chaiken, John Kubiatowicz, and Anant Agarwal Presented:

LimitLESS.123/19/08

Interprocessor-Interrupt (2/2)

• Overflow Trap Scenario– First Instance: Full-Map bit-vector allocated in local memory

and hardware pointers emptied into this and vector entered into hash table

– Otherwise: Empty hardware pointers into bit vector– Meta-State Set to “Trap-On-Write”– While emptying hardware pointers, Meta-State: “Trans-In-

Progress”

• Incoming Write Request Scenario– Empty hardware pointers to memory– Set AckCtr to number of bits that are set in bit-vector– Send invalidations to all caches except possibly requesting

one– Free vector in memory– Upon invalidate acknowledgement (AckCtr == 0), send Write-

Permission and set Memory State to “Read-Write”

Page 13: CS 258 Parallel Computer Architecture LimitLESS Directories: A Scalable Cache Coherence Scheme David Chaiken, John Kubiatowicz, and Anant Agarwal Presented:

LimitLESS.133/19/08

Performance Technique

Notes:

• Multigrid: Small worker sets limited directories perform as well as full map

• SIMPLE implemented barrier synchronization with single lock

• Matexpr has worker sets up to 16 processors

• Weather has one variable initialized by one processor and then read by all the other processors

Page 14: CS 258 Parallel Computer Architecture LimitLESS Directories: A Scalable Cache Coherence Scheme David Chaiken, John Kubiatowicz, and Anant Agarwal Presented:

LimitLESS.143/19/08

Results (1/3)

Page 15: CS 258 Parallel Computer Architecture LimitLESS Directories: A Scalable Cache Coherence Scheme David Chaiken, John Kubiatowicz, and Anant Agarwal Presented:

LimitLESS.153/19/08

Results (2/3)

Page 16: CS 258 Parallel Computer Architecture LimitLESS Directories: A Scalable Cache Coherence Scheme David Chaiken, John Kubiatowicz, and Anant Agarwal Presented:

LimitLESS.163/19/08

Results (3/3)

Page 17: CS 258 Parallel Computer Architecture LimitLESS Directories: A Scalable Cache Coherence Scheme David Chaiken, John Kubiatowicz, and Anant Agarwal Presented:

LimitLESS.173/19/08

Summary

• LimitLESS directories can closely emulate Full-Map Directories while saving hardware resources

• LimitLESS is not as sensitive to tuning parameters as the Limited Directory approach

• The protocol is general enough to apply to other coherence techniques

• In the future, it can be extended to give feedback to programmers/compilers about hot-spots, etc

Page 18: CS 258 Parallel Computer Architecture LimitLESS Directories: A Scalable Cache Coherence Scheme David Chaiken, John Kubiatowicz, and Anant Agarwal Presented:

LimitLESS.183/19/08

Full Memory State Transition Diagram