the numachine multiprocessor
DESCRIPTION
The NUMAchine Multiprocessor. ICPP 2000. Outline. Presentation Overview. Architecture System Overview Key Features Fast ring routing Hardware Cache Coherence Memory Model: Sequential Consistency Simulation Studies Ring performance Network Cache performance Coherence overhead - PowerPoint PPT PresentationTRANSCRIPT
Westin Harbour Castle, August 24, 2000
The NUMAchine Multiprocessor
ICPP 2000
2
Universityof
Toronto
Outline
ArchitectureSystem OverviewKey FeaturesFast ring routing
Hardware Cache Coherence
Memory Model: Sequential Consistency
Simulation StudiesRing performanceNetwork Cache performanceCoherence overhead
Prototype Performance
Hardware Status
Conclusion
Presentation Overview
3
Universityof
Toronto
Arch:Sys
Hierarchical ring network, based on clusters ( NUMAchine’s ‘Stations’) which are themselves bus-based SMPs
System Architecture
4
Universityof
Toronto
Arch:Features
Hierachical ringsAllow for very fast and simple routingProvide good support for broadcast and multicast
Hardware Cache CoherenceHierarchical, directory-based, CC-NUMA systemWriteback/Invalidate protocol, designed to use the broadcast/ordering
properties of rings
Sequentially Consistent Memory ModelThe most intuitive model for programmer’s trained on uniprocessors
Simple, low cost, but with good flexibility, scalability and performance
NUMAchine’s Key Features
5
Universityof
Toronto
Arch:Fmask
Fast ring routing is achieved by the use of Filtermasks (I.e. simple bit-masks) to store cache-line location information (imprecision reduces directory storage requirements)
These Filtermasks are used directly by the routing hardware in the ring interfaces
Fast Ring Routing: Filtermasks
6
Universityof
Toronto
CC
Hierarchical, directory-based, writeback/invalidate
Directory entries are stored in both the per-station memory (‘home’ location), and cached in the network interfaces (hence the name, Network Cache)
The Network Cache stores both the remotely cached directory information, as well as the cache lines themselves, and allows the network interface to perform coherence operations locally (on-Station), avoiding remote accesses to the home directory
Filtermasks indicate which Stations (I.e. clusters) may potentially have a copy of a cache line (with the fuzziness due to the imprecise nature of the filter masks)
Processor Masks are used only within a Station, to indicates which particular caches may contain a copy (with the fuzziness here due to Shared lines that may have been silently ejected)
Hardware Cache Coherence
7
Universityof
Toronto
SC
The most intuitive model for the normally trained programmer: increases the usability of the system
Easily supported by NUMAchine’s ring network: the only change necessary is to force invalidates to pass through a global ‘sequencing point’ on the ring, increasing the average invalidation latency by 2 ring hops (40 ns with our default 50 MHz rings)
Memory Model: Sequential Consistency
8
Universityof
Toronto
SS:RP1
Use the SPLASH-2 benchmarks suite, and a cycle-accurate hardware simulator with full modeling of the coherence protocol
Applications with high communication-to-computation ratios (e.g. FFT, Radix) show high utilizations, particularly in the Central Ring (indicating that a faster Central Ring would help)
Simulation Studies: Ring Performance 1
9
Universityof
Toronto
SS:RP2
Maximum and average ring interface queue depths indicate the network congestion, which correlates to bursty traffic
Large differences between the maximum and average values indicates large variability in burst size
Simulation Studies: Ring Performance 2
10
Universityof
Toronto
SS:NC
Graphs show a measure of the Network Cache’s effect by looking at the hit rate (I.e. reduction in remote data and coherence traffic)
By categorizing the hits by the coherence directory state, we also see where the benefits come from: caching shared data, or reducing invalidations and coherence traffic
Simulation Studies: Network Cache
11
Universityof
Toronto
SS:CO
Measure the overhead due to cache coherence, by allowing all writes to succeed immediately without checking cache-line state, and comparing against runs with the full cache coherence protocol in place (both using infinite-capacity Network Caches to avoid measurement noise due to capacity effects)
Results indicate that in many cases it is basic data locality and/or poor parallelizability that are impeding performance, not cache coherence
Simulation Studies: Coherence Overhead
12
Universityof
Toronto
PP
Prototype Performance Speedups from the hardware prototype, compared against
estimates from the simulator
13
Universityof
Toronto
Status
Fully operational running the custom Tornado OS, with a 32-processor system shown below
Hardware Prototype Status
14
Universityof
Toronto
Fin
4- and 8-way SMPs are fast becoming commodity items
The NUMAchine project has shown that a simple, cost-effective, CC-NUMA multiprocessor can be built using these SMP building blocks and a simple ring network, and still achieve good performance and scalability
In the medium-scale range (a few tens to hundreds of processors), rings are a good choice for a multiprocessor interconnect
We have demonstrated an efficient hardware cache coherence scheme, which is designed to make use of the natural ordering and broadcast capabilities of rings
NUMAchine’s architecture efficiently supports a sequentially consistent memory model, which we feel is essential for increasing the ease of use and programmability of multiprocessors
Conclusion
15
Universityof
Toronto
Ack
Operating Systems
Prof. Michael Stumm
Orran Krieger (IBM)
Ben Gamsa
Jonathon Appavoo
Robert Ho
Compilers
Prof. Tarek Abdelrahman
Prof. Naraig Manjikian (Queens)
Applications
Prof. Ken Sevcik
Acknowledgments: The NUMAchine Team
Hardware
Prof. Zvonko Vranesic
Prof. Stephen Brown
Robin Grindley (SOMA Networks)
Alex Grbic
Prof. Zeljko Zilic (McGill)
Steve Caranci (Altera)
Derek DeVries (OANDA)
Guy Lemieux
Kelvin Loveless (GNNettest)
Prof. Sinisa Srbljic (Zagreb)
Paul McHardy
Mitch Gusat (IBM)