exploiting fine-grained data parallelism with chip multiprocessors and fast barriers jack sampson*,...

40
Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P. Jouppi¤, Mike Schlansker¤, Brad Calder‡ *UCSD †UPC Barcelona ¤Hewlett-Packard Laboratories ‡UCSD/Microsoft

Upload: darcy-heath

Post on 18-Jan-2018

216 views

Category:

Documents


0 download

DESCRIPTION

Fine-grained Parallelism CMPs in role of vector processors –Software synchronization still expensive –Can target inner-loop parallelism Barriers a straightforward organizing tool –Opportunity for hardware acceleration Faster barriers allow greater parallelism –1.2x – 6.4x on 256 element vectors –3x – 12.2x on 1024 element vectors

TRANSCRIPT

Page 1: Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P

Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors

and Fast Barriers

Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P. Jouppi¤, Mike Schlansker¤, Brad Calder‡

*UCSD †UPC Barcelona ¤Hewlett-Packard Laboratories ‡UCSD/Microsoft

Page 2: Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P

Motivations

CMPs are not just small multiprocessors– Different computation/communication ratio– Different shared resources

Inter-core fabric offers potential to support optimizations/acceleration– CMPs for vector, streaming workloads

Page 3: Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P

Fine-grained Parallelism CMPs in role of vector processors

– Software synchronization still expensive– Can target inner-loop parallelism

Barriers a straightforward organizing tool– Opportunity for hardware acceleration

Faster barriers allow greater parallelism– 1.2x – 6.4x on 256 element vectors– 3x – 12.2x on 1024 element vectors

Page 4: Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P

Accelerating Barriers

Barrier Filters: a new method for barrier synchronization– No dedicated networks– No new instructions– Changes only in shared memory system– CMP-friendly design point

Competitive with dedicated barrier network– Achieves 77%-95% of dedicated network performance

Page 5: Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P

Outline Introduction

Barrier Filter Overview

Barrier Filter Implementation

Results

Summary

Page 6: Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P

Observation and Intuition

Observations– Barriers need to stall forward progress– There exist events that already stall processors

Co-opt and extend existing stall behavior– Cache misses

• Either I-Cache or D-Cache suffices

Page 7: Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P

High Level Barrier Behavior A thread can be in one of three states1. Executing

– Perform work– Enforce memory ordering– Signal arrival at barrier

2. Blocking– Stall at barrier until all arrive

3. Resuming– Release from barrier

Page 8: Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P

Barrier Filter Example

CMP augmented with filter– Private L1– Shared, banked L2

# Threads: 3Filter State

Arrived-counter: 0Thread A:

EXECUTING

Thread B:EXECUTING

Thread C:EXECUTING

Page 9: Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P

Example: Memory Ordering

Before/after for memory– Each thread executes a

memory fence

# Threads: 3Filter State

Arrived-counter: 0Thread A:

EXECUTING

Thread B:EXECUTING

Thread C:EXECUTING

Page 10: Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P

Example: Signaling Arrival

Communication with filter– Each thread invalidates a

designated cache line

# Threads: 3Filter State

Arrived-counter: 0Thread A:

EXECUTING

Thread B:EXECUTING

Thread C:EXECUTING

Page 11: Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P

Example: Signaling Arrival

Invalidation propagates to shared L2 cache

Filter snoops the invalidation– Checks address for match– Records arrival

# Threads: 3Filter State

Arrived-counter: 0Thread A:

EXECUTING

Thread B:EXECUTING

Thread C:EXECUTING

Arrived-counter: 1Thread A:

BLOCKING

Page 12: Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P

Example: Signaling Arrival

Invalidation propagates to shared L2 cache

Filter snoops the invalidation– Checks address for match– Records arrival

# Threads: 3Filter State

Arrived-counter: 1Thread A:

BLOCKING

Thread B:EXECUTING

Thread C:EXECUTING

Arrived-counter: 2Thread C:

BLOCKING

Page 13: Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P

Example: Stalling

Thread A attempts to fetch the invalidated data

Fill request not satisfied– Thread stalling mechanism

# Threads: 3Filter State

Arrived-counter: 2Thread A:

BLOCKING

Thread B:EXECUTING

Thread C:BLOCKING

Page 14: Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P

Example: Release

Last thread signals arrival

Barrier release– Counter resets– Filter state for all threads

switches

# Threads: 3Filter State

Arrived-counter: 2Thread A:

BLOCKING

Thread B:EXECUTING

Thread C:BLOCKING

Arrived-counter: 0Thread C:

RESUMING

Thread A:RESUMING

Thread B:RESUMING

Page 15: Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P

Example: Release

After release– New cache-fill requests served– Filter serves pending cache-

fills

# Threads: 3Filter State

Arrived-counter: 0Thread A:

RESUMING

Thread B:RESUMING

Thread C:RESUMING

Page 16: Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P

Example: Release

After release– New cache-fill requests served– Filter serves pending cache-

fills

# Threads: 3Filter State

Arrived-counter: 0Thread A:

RESUMING

Thread B:RESUMING

Thread C:RESUMING

Page 17: Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P

Outline Introduction

Barrier Filter Overview

Barrier Filter Implementation

Results

Summary

Page 18: Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P

Software Interface Communication requirements

– Let hardware know # of threads– Let threads know signal addresses

Barrier filters as virtualized resource– Library interface– Pure software fallback

User scenario– Application calls OS to create barrier with # threads– OS allocates barrier filter, relays address and # threads– OS returns address to application

Page 19: Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P

Barrier Filter Hardware Additional hardware: “address filter”

– In controller for shared memory level– State table, associated FSMs– Snoops invalidations, fill requests for

designated addresses

Makes use of existing instructions and existing interconnect network

Page 20: Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P

Barrier Filter Internals

Each barrier filter supports one barrier– Barrier state– Per-thread state, FSMs

Multiple barrier filters– In each controller– In banked caches, at a

particular bank

Page 21: Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P

Barrier Filter Internals

Each barrier filter supports one barrier– Barrier state– Per-thread state, FSMs

Multiple barrier filters– In each controller– In banked caches, at a

particular bank

Page 22: Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P

Barrier Filter Internals

Each barrier filter supports one barrier– Barrier state– Per-thread state, FSMs

Multiple barrier filters– In each controller– In banked caches, at a

particular bank

Page 23: Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P

Why have an exit address? Needed for re-entry to barriers

– When does Resuming again become Executing?– Additional fill requests may be issued

Delivery is not a guarantee of receipt– Context switches– Migration– Cache eviction

Page 24: Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P

Ping-Pong Optimization

Draws from sense reversal barriers– Entry and exit operations as duals

Two alternating arrival addresses– Each conveys exit to the other’s barrier– Eliminates explicit invalidate of exit address

Page 25: Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P

Outline Introduction

Barrier Filter Overview

Barrier Filter Implementation

Results

Summary

Page 26: Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P

Methodology Used modified version of SMT-Sim We performed experiments using 7 different barrier

implementations– Software:

• Centralized, combining tree– Hardware:

• Filter barrier (4 variants), dedicated barrier network

We examined performance over a set of parallelizeable kernels– Livermore loops 2, 3, 6– EEMBC kernels autocorrelation, viterbi

Page 27: Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P

Benchmark Selection Barriers are seen as heavyweight operations

– Infrequently executed in most workloads

Example: Ocean from SPLASH-2– On simulated 16 core CMP: 4% of time in

barriers

Barriers will be used more frequently on CMPs

Page 28: Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P

Latency Micro-benchmark

Average time of barrier execution (in isolation)– #threads = #cores

Page 29: Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P

Latency Micro-benchmark

Notable effects due to bus saturation– Barrier filter scales well up until this point

Page 30: Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P

Latency Micro-benchmark

Filters closer to dedicated network than software– Significant speedup vs. software still exhibited

Page 31: Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P

Autocorrelation Kernel On 16 core CMP

– 7.98x speedup for dedicated network

– 7.31x speedup for best filter barrier

– 3.86 speedup for best software barrier

Significant speedup opportunities with fast barriers

Page 32: Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P

Viterbi Kernel

Not all applications can scale to arbitrary number of cores

Viterbi performance higher on 4 or 8 cores than on 16 cores

Viterbi on 4 core CMP

Page 33: Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P

Livermore Loops

Serial/parallel crossover– HW achieves on 4x smaller problem

Livermore Loop 3 on 16-core CMP

Page 34: Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P

Livermore Loops

Reduction in parallelism to avoid false sharing

Livermore Loop 3 on 16-core CMP

Page 35: Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P

Result Summary Fine-grained parallelism on CMPs

– Significant speedups possible• 1.2x – 6.4x on 256 element vectors• 3x – 12.2x on 1024 element vectors

– False sharing affects problem size/scaling

Faster barriers allow greater parallelism– HW approaches extend worthwhile problem sizes

Barrier filters give competitive performance– 77% - 95% of dedicated network performance

Page 36: Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P

Conclusions

Fast barriers– Can organize fine-grained data parallelism on a CMP

CMPs can act in a vector processor role– Exploit inner-loop parallelism

Barrier filters– CMP-oriented fast barrier

Page 37: Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P

(FIN)

Questions?

Page 38: Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P
Page 39: Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P
Page 40: Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P

Extra Graphs