exploiting fine-grained data parallelism with chip multiprocessors and fast barriers jack sampson*,...
DESCRIPTION
Fine-grained Parallelism CMPs in role of vector processors –Software synchronization still expensive –Can target inner-loop parallelism Barriers a straightforward organizing tool –Opportunity for hardware acceleration Faster barriers allow greater parallelism –1.2x – 6.4x on 256 element vectors –3x – 12.2x on 1024 element vectorsTRANSCRIPT
Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors
and Fast Barriers
Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P. Jouppi¤, Mike Schlansker¤, Brad Calder‡
*UCSD †UPC Barcelona ¤Hewlett-Packard Laboratories ‡UCSD/Microsoft
Motivations
CMPs are not just small multiprocessors– Different computation/communication ratio– Different shared resources
Inter-core fabric offers potential to support optimizations/acceleration– CMPs for vector, streaming workloads
Fine-grained Parallelism CMPs in role of vector processors
– Software synchronization still expensive– Can target inner-loop parallelism
Barriers a straightforward organizing tool– Opportunity for hardware acceleration
Faster barriers allow greater parallelism– 1.2x – 6.4x on 256 element vectors– 3x – 12.2x on 1024 element vectors
Accelerating Barriers
Barrier Filters: a new method for barrier synchronization– No dedicated networks– No new instructions– Changes only in shared memory system– CMP-friendly design point
Competitive with dedicated barrier network– Achieves 77%-95% of dedicated network performance
Outline Introduction
Barrier Filter Overview
Barrier Filter Implementation
Results
Summary
Observation and Intuition
Observations– Barriers need to stall forward progress– There exist events that already stall processors
Co-opt and extend existing stall behavior– Cache misses
• Either I-Cache or D-Cache suffices
High Level Barrier Behavior A thread can be in one of three states1. Executing
– Perform work– Enforce memory ordering– Signal arrival at barrier
2. Blocking– Stall at barrier until all arrive
3. Resuming– Release from barrier
Barrier Filter Example
CMP augmented with filter– Private L1– Shared, banked L2
# Threads: 3Filter State
Arrived-counter: 0Thread A:
EXECUTING
Thread B:EXECUTING
Thread C:EXECUTING
Example: Memory Ordering
Before/after for memory– Each thread executes a
memory fence
# Threads: 3Filter State
Arrived-counter: 0Thread A:
EXECUTING
Thread B:EXECUTING
Thread C:EXECUTING
Example: Signaling Arrival
Communication with filter– Each thread invalidates a
designated cache line
# Threads: 3Filter State
Arrived-counter: 0Thread A:
EXECUTING
Thread B:EXECUTING
Thread C:EXECUTING
Example: Signaling Arrival
Invalidation propagates to shared L2 cache
Filter snoops the invalidation– Checks address for match– Records arrival
# Threads: 3Filter State
Arrived-counter: 0Thread A:
EXECUTING
Thread B:EXECUTING
Thread C:EXECUTING
Arrived-counter: 1Thread A:
BLOCKING
Example: Signaling Arrival
Invalidation propagates to shared L2 cache
Filter snoops the invalidation– Checks address for match– Records arrival
# Threads: 3Filter State
Arrived-counter: 1Thread A:
BLOCKING
Thread B:EXECUTING
Thread C:EXECUTING
Arrived-counter: 2Thread C:
BLOCKING
Example: Stalling
Thread A attempts to fetch the invalidated data
Fill request not satisfied– Thread stalling mechanism
# Threads: 3Filter State
Arrived-counter: 2Thread A:
BLOCKING
Thread B:EXECUTING
Thread C:BLOCKING
Example: Release
Last thread signals arrival
Barrier release– Counter resets– Filter state for all threads
switches
# Threads: 3Filter State
Arrived-counter: 2Thread A:
BLOCKING
Thread B:EXECUTING
Thread C:BLOCKING
Arrived-counter: 0Thread C:
RESUMING
Thread A:RESUMING
Thread B:RESUMING
Example: Release
After release– New cache-fill requests served– Filter serves pending cache-
fills
# Threads: 3Filter State
Arrived-counter: 0Thread A:
RESUMING
Thread B:RESUMING
Thread C:RESUMING
Example: Release
After release– New cache-fill requests served– Filter serves pending cache-
fills
# Threads: 3Filter State
Arrived-counter: 0Thread A:
RESUMING
Thread B:RESUMING
Thread C:RESUMING
Outline Introduction
Barrier Filter Overview
Barrier Filter Implementation
Results
Summary
Software Interface Communication requirements
– Let hardware know # of threads– Let threads know signal addresses
Barrier filters as virtualized resource– Library interface– Pure software fallback
User scenario– Application calls OS to create barrier with # threads– OS allocates barrier filter, relays address and # threads– OS returns address to application
Barrier Filter Hardware Additional hardware: “address filter”
– In controller for shared memory level– State table, associated FSMs– Snoops invalidations, fill requests for
designated addresses
Makes use of existing instructions and existing interconnect network
Barrier Filter Internals
Each barrier filter supports one barrier– Barrier state– Per-thread state, FSMs
Multiple barrier filters– In each controller– In banked caches, at a
particular bank
Barrier Filter Internals
Each barrier filter supports one barrier– Barrier state– Per-thread state, FSMs
Multiple barrier filters– In each controller– In banked caches, at a
particular bank
Barrier Filter Internals
Each barrier filter supports one barrier– Barrier state– Per-thread state, FSMs
Multiple barrier filters– In each controller– In banked caches, at a
particular bank
Why have an exit address? Needed for re-entry to barriers
– When does Resuming again become Executing?– Additional fill requests may be issued
Delivery is not a guarantee of receipt– Context switches– Migration– Cache eviction
Ping-Pong Optimization
Draws from sense reversal barriers– Entry and exit operations as duals
Two alternating arrival addresses– Each conveys exit to the other’s barrier– Eliminates explicit invalidate of exit address
Outline Introduction
Barrier Filter Overview
Barrier Filter Implementation
Results
Summary
Methodology Used modified version of SMT-Sim We performed experiments using 7 different barrier
implementations– Software:
• Centralized, combining tree– Hardware:
• Filter barrier (4 variants), dedicated barrier network
We examined performance over a set of parallelizeable kernels– Livermore loops 2, 3, 6– EEMBC kernels autocorrelation, viterbi
Benchmark Selection Barriers are seen as heavyweight operations
– Infrequently executed in most workloads
Example: Ocean from SPLASH-2– On simulated 16 core CMP: 4% of time in
barriers
Barriers will be used more frequently on CMPs
Latency Micro-benchmark
Average time of barrier execution (in isolation)– #threads = #cores
Latency Micro-benchmark
Notable effects due to bus saturation– Barrier filter scales well up until this point
Latency Micro-benchmark
Filters closer to dedicated network than software– Significant speedup vs. software still exhibited
Autocorrelation Kernel On 16 core CMP
– 7.98x speedup for dedicated network
– 7.31x speedup for best filter barrier
– 3.86 speedup for best software barrier
Significant speedup opportunities with fast barriers
Viterbi Kernel
Not all applications can scale to arbitrary number of cores
Viterbi performance higher on 4 or 8 cores than on 16 cores
Viterbi on 4 core CMP
Livermore Loops
Serial/parallel crossover– HW achieves on 4x smaller problem
Livermore Loop 3 on 16-core CMP
Livermore Loops
Reduction in parallelism to avoid false sharing
Livermore Loop 3 on 16-core CMP
Result Summary Fine-grained parallelism on CMPs
– Significant speedups possible• 1.2x – 6.4x on 256 element vectors• 3x – 12.2x on 1024 element vectors
– False sharing affects problem size/scaling
Faster barriers allow greater parallelism– HW approaches extend worthwhile problem sizes
Barrier filters give competitive performance– 77% - 95% of dedicated network performance
Conclusions
Fast barriers– Can organize fine-grained data parallelism on a CMP
CMPs can act in a vector processor role– Exploit inner-loop parallelism
Barrier filters– CMP-oriented fast barrier
(FIN)
Questions?
Extra Graphs