exploration of memory and cluster modes in directory-based ... · exploration of memory and cluster...
TRANSCRIPT
Exploration of Memory and Cluster
Modes in Directory-Based
Many-Core CMPs
Subodha Charles and Prabhat Mishra
University of Florida, USA
Chetan Arvind Patil and Umit Y. Ogras
Arizona State University, USA
This work was partially supported by the National Science
Foundation (NSF) grants CNS-1526687 and CNS-1526562
2
Outline
Introduction
Existing NoC Exploration Methods
Accurate Modeling and Exploration
❖ Motivation
❖ Modeling of Directory–Memory Traffic
❖ Exploration of Memory and Cluster Modes
Experimental Results
Conclusion
Increased Complexity of SoC Design
Increased Complexity of SoC Design
NoCs are Ciritcal for Performance
Early interconnection
designs were buses
and point-to-point
Does Not Scale!
Solution: NoC
Architecture of a Many-Core CMP
7
Outline
Introduction
Existing NoC Exploration Methods
Accurate Modeling and Exploration
❖ Motivation
❖ Modeling of Directory–Memory Traffic
❖ Exploration of Memory and Cluster Modes
Experimental Results
Conclusion
Traffic Optimization on NoC
Min # of MCs
Eitschberger et al.
MCC ‘13
Optimum MC Placement
Xu et al.
CODES+ISSS ‘13
Dynamic Workload Data Mapping
Awasthi et al.
PACT ‘10
8
Optimum MC Placement
9
Column 0/7 Column 2/5 Diamond
Optimum SlashXu et al.
CODES+ISSS ‘13
10
Outline
Introduction
Existing NoC Exploration Methods
Accurate Modeling and Exploration
❖ Motivation
❖ Modeling of Directory–Memory Traffic
❖ Exploration of Memory and Cluster Modes
Experimental Results
Conclusion
KNL: 2nd Generation Xeon-Phi
38 tiles
36 active, 2 recovery
Each tile;
2 VPUs, Out of order
4 threads per core
4 separate NoCs
Traffic Model of gem5 Simulator
Life Cycle of a memory
request:
(1) Request forwarded
to Directory
Controller after miss
in private cache
(2) Data retrieved from
memory
(3) MC forwards data to
the requestor
1
2
3
A Memory Controller at Each Tile?
Is this a realistic assumption???
Number of MCs < Number of tiles
Packaging constraints
High I/O pin cost
Intel Xeon-Phi 7210
Hotspots Introduced by MCs
Key Idea
The interactions between cores,
directory controllers and memory
controllers should be accurately
modelled to enable exploration of
NoC optimization
17
Outline
Introduction
Existing NoC Exploration Methods
Accurate Modeling and Exploration
❖ Motivation
❖ Modeling of Directory–Memory Traffic
❖ Exploration of Memory and Cluster Modes
Experimental Results
Conclusion
Modified Traffic Model
Life Cycle of a memory
request:
(1) Request forwarded
to Directory
Controller after miss
in private cache
(2) Forward request to
MC.
(3) Data retrieved from
memory
(4) MC forwards data to
the requestor
1
3
2
4
Modified Traffic Model
19
Introduces hotspots
Realistic estimate of power and performance data.
Exploration of MC placement.
Exploration of Cluster and Memory modes
The inclusion of the new step (2) has a significant
impact
Modified Traffic Model
21
Outline
Introduction
Existing NoC Exploration Methods
Accurate Modeling and Exploration
❖ Motivation
❖ Modeling of Directory–Memory Traffic
❖ Exploration of Memory and Cluster Modes
Experimental Results
Conclusion
Cluster Modes in KNL
All-to-all Mode
A request from a core can be
forwarded to any directory
controller. The memory
request can be forwarded to
any MC as well.
Quadrant Mode
Four virtual quadrants. A request
from a core can be forwarded to any
directory controller. But the memory
request should be sent to an MC on
the same quadrant as the directory.
12
3
1
2
3
Memory Modes in KNL
Flat Mode
DDR and MCDRAM in the
same address space
Cache Mode
MCDRAM acting as
last-level cache
12
3
1
2
3
4
Traffic Flow – Memory and Cluster Modes
Flat, All-to-all
Mode
Cache, All-to-all
Mode
Flat, Quadrant
Mode
25
Outline
Introduction
Existing NoC Exploration Methods
Accurate Modeling and Exploration
❖ Motivation
❖ Modeling of Directory–Memory Traffic
❖ Exploration of Memory and Cluster Modes
Experimental Results
Conclusion
Experimental Setup
Architecture Simulator: gem5
NoC model: Garnet2.0
A CMP similar to Xeon-Phi 7210 modeled in
gem5
Our implementation added in the cache
coherence traffic transitions.
Gem5 output statistics fed into McPAT simulator
to extract power results.
Network Traffic Analysis
The default gem5
model gives highly
optimistic results
The two modified
models – KNL (all-to-
all) and KNL
(quadrant) gives
comparable results
KNL (quadrant) gives
better performance as
it has high affinity
between directory and
memory controllers.
Memory Controller Placement
Exploration of memory controller placement under the
modified model.
Compared with the work done by Xu et al. “Optimal” is no
longer the optimal placement.
The default gem5 model again gives highly optimistic results
Memory and Cluster Mode Exploration
Compared to All-to-all Flat mode, All-to-all Cache mode
gives highest benefit : 18.62% less execution time on
average
Observations are in agreement with results obtained
from Xeon Phi 7210 hardware platform
30
Conclusion
Thank you!
Questions?