yasuo ishii, kouhei hosokawa, mary inaba , kei hiraki
DESCRIPTION
High Performance Memory Access Scheduling Using Compute-Phase Prediction and Writeback-Refresh Overlap. Yasuo Ishii, Kouhei Hosokawa, Mary Inaba , Kei Hiraki. Design Goal: High Performance Scheduler. Three Evaluation Metrics Execution Time (Performance) Energy-Delay Product - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Yasuo Ishii, Kouhei Hosokawa, Mary Inaba , Kei Hiraki](https://reader035.vdocuments.net/reader035/viewer/2022062410/56816223550346895dd25005/html5/thumbnails/1.jpg)
High PerformanceMemory Access SchedulingUsing Compute-Phase Prediction and Writeback-Refresh Overlap
Yasuo Ishii, Kouhei Hosokawa, Mary Inaba, Kei Hiraki
![Page 2: Yasuo Ishii, Kouhei Hosokawa, Mary Inaba , Kei Hiraki](https://reader035.vdocuments.net/reader035/viewer/2022062410/56816223550346895dd25005/html5/thumbnails/2.jpg)
Design Goal: High Performance Scheduler
Three Evaluation Metrics Execution Time (Performance) Energy-Delay Product Performance-Fairness Product
We found several trade-offs among these metrics The best execution time (performance) configuration does not
show the best energy-delay product
![Page 3: Yasuo Ishii, Kouhei Hosokawa, Mary Inaba , Kei Hiraki](https://reader035.vdocuments.net/reader035/viewer/2022062410/56816223550346895dd25005/html5/thumbnails/3.jpg)
Contribution Proposals
Compute-Phase Prediction Thread-priority control technique for multi-core processor
Writeback-Refresh Overlap Mitigates refresh penalty on multi-rank memory system
Optimizations MLP-aware priority control Memory bus reservation Activate throttling
![Page 4: Yasuo Ishii, Kouhei Hosokawa, Mary Inaba , Kei Hiraki](https://reader035.vdocuments.net/reader035/viewer/2022062410/56816223550346895dd25005/html5/thumbnails/4.jpg)
Outline Proposals
Compute-Phase Prediction Thread-priority control technique for multi-core processor
Writeback-Refresh Overlap Mitigates refresh penalty on multi-rank memory system
Optimizations MLP-aware priority control Memory bus reservation Activate throttling
![Page 5: Yasuo Ishii, Kouhei Hosokawa, Mary Inaba , Kei Hiraki](https://reader035.vdocuments.net/reader035/viewer/2022062410/56816223550346895dd25005/html5/thumbnails/5.jpg)
Non-priority requests
Priority requests
Thread-priority Control Thread-priority control is beneficial for multi-core chips
Network Fair Queuing[Nesbit+ 2006], Atlas[Kim+ 2010], Thread Cluster Memory Scheduling[Kim+ 2010]
Typically, policies are updated periodically (Each epoch contains millions of cycles in TCM)
Compute-Intensive
Memory-Intensive
Memory(DRAM)
Memory-Intensive
Compute-Intensive
priority status is not yet changed
Core 0
Core 1
high priority
![Page 6: Yasuo Ishii, Kouhei Hosokawa, Mary Inaba , Kei Hiraki](https://reader035.vdocuments.net/reader035/viewer/2022062410/56816223550346895dd25005/html5/thumbnails/6.jpg)
Example: Memory Traffic of Blackscholes
One application contains both memory-intensive phases and compute-intensive phases
0102030405060708090
100
Miss
per
Kilo
Inst
ructi
ons (
MPK
I)
![Page 7: Yasuo Ishii, Kouhei Hosokawa, Mary Inaba , Kei Hiraki](https://reader035.vdocuments.net/reader035/viewer/2022062410/56816223550346895dd25005/html5/thumbnails/7.jpg)
Phase-prediction result of TCM
We think this inaccurate classification is caused by the conventional periodically updating prediction strategy
0102030405060708090
100
Miss
per
Kilo
Inst
ructi
ons (
MPK
I)
Compute-phase Memory-phase
![Page 8: Yasuo Ishii, Kouhei Hosokawa, Mary Inaba , Kei Hiraki](https://reader035.vdocuments.net/reader035/viewer/2022062410/56816223550346895dd25005/html5/thumbnails/8.jpg)
Contribution 1: Compute-Phase Prediction “Distance-based phase prediction” to realize fine-grain
thread priority control scheme
Core Memory(DRAM)
Distance = # of committed instructions between 2 memory requests
Core DRAM
Distance of req. exceed Θinterval
Compute-phase
Core DRAM
Non-distant of req. continue Θdistant times
Memory-phase
![Page 9: Yasuo Ishii, Kouhei Hosokawa, Mary Inaba , Kei Hiraki](https://reader035.vdocuments.net/reader035/viewer/2022062410/56816223550346895dd25005/html5/thumbnails/9.jpg)
Phase-prediction of Compute-Phase Prediction
Prediction result nearly matches the optimal classification Improves fairness and system throughput
0
10
20
30
40
50
60
70
80
90
100
Miss
per
Kilo
Inst
ructi
ons (
MPK
I)
![Page 10: Yasuo Ishii, Kouhei Hosokawa, Mary Inaba , Kei Hiraki](https://reader035.vdocuments.net/reader035/viewer/2022062410/56816223550346895dd25005/html5/thumbnails/10.jpg)
Outline Proposals
Compute-Phase Prediction Thread-priority control technique for multi-core processor
Writeback-Refresh Overlap Mitigates refresh penalty on multi-rank memory system
Optimizations MLP-aware priority control Memory bus reservation Activate throttling
![Page 11: Yasuo Ishii, Kouhei Hosokawa, Mary Inaba , Kei Hiraki](https://reader035.vdocuments.net/reader035/viewer/2022062410/56816223550346895dd25005/html5/thumbnails/11.jpg)
DRAM refreshing penalty
DRAM refreshing increases the stall time of read requests Stall of read requests increases the execution time Shifting refresh timing cannot reduce the stall time
This increases the threat of stall time for read requests
tREFI tRFCRank-0
Rank-1
Mem. BusStall of read requestsIncreases the threat of stall
![Page 12: Yasuo Ishii, Kouhei Hosokawa, Mary Inaba , Kei Hiraki](https://reader035.vdocuments.net/reader035/viewer/2022062410/56816223550346895dd25005/html5/thumbnails/12.jpg)
Contribution 2: Writeback-Refresh Overlap
Typically, modern controllers divide read phases and write phases to reduce bus turnaround penalties
Overlaps refresh command with the write phase Avoid to increasing the stall time of read requests
R
Rank-0
Rank-1
Mem. Bus WRead requests stall
R W R W R W R W R W R W R
![Page 13: Yasuo Ishii, Kouhei Hosokawa, Mary Inaba , Kei Hiraki](https://reader035.vdocuments.net/reader035/viewer/2022062410/56816223550346895dd25005/html5/thumbnails/13.jpg)
Outline Proposals
Compute-Phase Prediction Thread-priority control technique for multi-core processor
Writeback-Refresh Overlap Mitigates refresh penalty on multi-rank memory system
Optimizations MLP-aware priority control Memory bus reservation Activate throttling
![Page 14: Yasuo Ishii, Kouhei Hosokawa, Mary Inaba , Kei Hiraki](https://reader035.vdocuments.net/reader035/viewer/2022062410/56816223550346895dd25005/html5/thumbnails/14.jpg)
Optimization 1: MLP-Aware Priority Control
Prioritizes Low-MLP requests to reduce the stall time. This priority is higher than the priority control of compute-
phase predictions Minimalist [Kaseridis+ 2011] also uses MLP-aware scheduling
load(1)load(0)Core 0
Memory(DDR3)
Core 1
load(1)
Request Queue
gives extra priority
Stall
Stallload(1)load (0)load(1)load(1)load(1)
![Page 15: Yasuo Ishii, Kouhei Hosokawa, Mary Inaba , Kei Hiraki](https://reader035.vdocuments.net/reader035/viewer/2022062410/56816223550346895dd25005/html5/thumbnails/15.jpg)
Optimization 2: Memory Bus Reservation Reserves HW resources to reduce the latency of critical
read requests Data bus for read and write (considering tRTR/tWTR penalty)
This method improves the system throughput and fairness
Command-Rank-0 ACT RD
tRAS
Command-Rank-1
BLRD
RD
RD RD
Additional penalty
Memory bus
![Page 16: Yasuo Ishii, Kouhei Hosokawa, Mary Inaba , Kei Hiraki](https://reader035.vdocuments.net/reader035/viewer/2022062410/56816223550346895dd25005/html5/thumbnails/16.jpg)
Optimization 3: Activate Throttling Controls precharge / activation based on tFAW tracking
Too early precharge command does not contribute to the latency reduction of following activate command
Command-Rank-0
Memory clock
tFAW tRP
PREACT
1
ACT
2
ACT
3
ACT
4
Activate throttling increases the chance of row-hit access
ACT
Row-conflict
![Page 17: Yasuo Ishii, Kouhei Hosokawa, Mary Inaba , Kei Hiraki](https://reader035.vdocuments.net/reader035/viewer/2022062410/56816223550346895dd25005/html5/thumbnails/17.jpg)
Optimization: Other Techniques
Aggressive precharge Reduces row-conflict penalties
Force refreshing When tREFI timer has expired, the force refresh is issued
Adds extra priority to the timeout requests Promotes old read request to the higher priority Eliminates the starvation
![Page 18: Yasuo Ishii, Kouhei Hosokawa, Mary Inaba , Kei Hiraki](https://reader035.vdocuments.net/reader035/viewer/2022062410/56816223550346895dd25005/html5/thumbnails/18.jpg)
Implementation: Optimized Memory Controller
The optimized controller does not require large HW cost We mainly extend thread-priority control and controller state
through our new scheduling technique
Read Queue
Write Queue
Refresh Queue
MU
X
ControllerState
Refresh Timer
Processor Core
DDR3Devices
Thread PriorityControl
EnhancedController
State
Adds priority bit for each request
Extends controller state (2-bit)
![Page 19: Yasuo Ishii, Kouhei Hosokawa, Mary Inaba , Kei Hiraki](https://reader035.vdocuments.net/reader035/viewer/2022062410/56816223550346895dd25005/html5/thumbnails/19.jpg)
Implementation: Hardware Cost
Per-channel resource (341.25B) Compute-Phase Prediction (258B) Writeback-Refresh Overlap (2-bit) Other features (83B)
Per-request resource (3-bit) Priority bit, Row-hit bit, Timeout flag bit
Overall Hardware Cost: 2649B
![Page 20: Yasuo Ishii, Kouhei Hosokawa, Mary Inaba , Kei Hiraki](https://reader035.vdocuments.net/reader035/viewer/2022062410/56816223550346895dd25005/html5/thumbnails/20.jpg)
Evaluation Results
Performance Improvement
![Page 21: Yasuo Ishii, Kouhei Hosokawa, Mary Inaba , Kei Hiraki](https://reader035.vdocuments.net/reader035/viewer/2022062410/56816223550346895dd25005/html5/thumbnails/21.jpg)
Evaluation Results
Exec time : 11.2% PFP : 20.9% EDP : 20.2%Performance Improvement
Max Slowdown : 10.8%
![Page 22: Yasuo Ishii, Kouhei Hosokawa, Mary Inaba , Kei Hiraki](https://reader035.vdocuments.net/reader035/viewer/2022062410/56816223550346895dd25005/html5/thumbnails/22.jpg)
Evaluation Results
Performance Improvement
Max : 12.9% Max : 26.2%Max : 14.9%
Exec time : 11.2% PFP : 20.9% EDP : 20.2%Max Slowdown : 10.8%
![Page 23: Yasuo Ishii, Kouhei Hosokawa, Mary Inaba , Kei Hiraki](https://reader035.vdocuments.net/reader035/viewer/2022062410/56816223550346895dd25005/html5/thumbnails/23.jpg)
Evaluation Results1c
han
4cha
n
1cha
n
4cha
n
1cha
n
4cha
n
1cha
n
4cha
n
1cha
n
4cha
n
1cha
n
4cha
n
4cha
n
4cha
n
1cha
n
4cha
n
1cha
n
4cha
n /
MT-canneal
MT-canneal
bl-bl-fr-fr
bl-bl-fr-fr
c1-c1 c1-c2 c1-c1-c2-c2
c1-c1-c2-c2
c2 c2 fa-fa-fe-fe
fa-fa-fe-fe
fl-fl-sw-sw-c2-c2-fe-fe
fl-fl-sw-sw-c2-c2-fe-
fe-bl-bl-fr-fr-c1-c1-st-st
fl-sw-c2-c2
fl-sw-c2-c2
st-st-st-st
st-st-st-st
Overall
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1Total Execution Time
FCFS Close Proposed
![Page 24: Yasuo Ishii, Kouhei Hosokawa, Mary Inaba , Kei Hiraki](https://reader035.vdocuments.net/reader035/viewer/2022062410/56816223550346895dd25005/html5/thumbnails/24.jpg)
Evaluation Results1c
han
4cha
n
1cha
n
4cha
n
1cha
n
4cha
n
1cha
n
4cha
n
1cha
n
4cha
n
1cha
n
4cha
n
4cha
n
4cha
n
1cha
n
4cha
n
1cha
n
4cha
n /
MT-canneal
MT-canneal
bl-bl-fr-fr
bl-bl-fr-fr
c1-c1 c1-c2 c1-c1-c2-c2
c1-c1-c2-c2
c2 c2 fa-fa-fe-fe
fa-fa-fe-fe
fl-fl-sw-sw-c2-c2-fe-fe
fl-fl-sw-sw-c2-c2-fe-
fe-bl-bl-fr-fr-c1-c1-st-st
fl-sw-c2-c2
fl-sw-c2-c2
st-st-st-st
st-st-st-st
Overall
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1Total Execution Time
FCFS Close Proposed
![Page 25: Yasuo Ishii, Kouhei Hosokawa, Mary Inaba , Kei Hiraki](https://reader035.vdocuments.net/reader035/viewer/2022062410/56816223550346895dd25005/html5/thumbnails/25.jpg)
Evaluation Results1c
han
4cha
n
1cha
n
4cha
n
1cha
n
4cha
n
1cha
n
4cha
n
1cha
n
4cha
n
1cha
n
4cha
n
4cha
n
4cha
n
1cha
n
4cha
n
1cha
n
4cha
n /
MT-canneal
MT-canneal
bl-bl-fr-fr
bl-bl-fr-fr
c1-c1 c1-c2 c1-c1-c2-c2
c1-c1-c2-c2
c2 c2 fa-fa-fe-fe
fa-fa-fe-fe
fl-fl-sw-sw-c2-c2-fe-fe
fl-fl-sw-sw-c2-c2-fe-
fe-bl-bl-fr-fr-c1-c1-st-st
fl-sw-c2-c2
fl-sw-c2-c2
st-st-st-st
st-st-st-st
Overall
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1Total Execution Time
FCFS Close Proposed
![Page 26: Yasuo Ishii, Kouhei Hosokawa, Mary Inaba , Kei Hiraki](https://reader035.vdocuments.net/reader035/viewer/2022062410/56816223550346895dd25005/html5/thumbnails/26.jpg)
Evaluation Results
00.10.20.30.40.50.60.70.80.9
11c
han
4cha
n
1cha
n
4cha
n
1cha
n
4cha
n
1cha
n
4cha
n
1cha
n
4cha
n
1cha
n
4cha
n
4cha
n
4cha
n
1cha
n
4cha
n
1cha
n
4cha
n /
MT-canneal
MT-canneal
bl-bl-fr-fr
bl-bl-fr-fr
c1-c1 c1-c2 c1-c1-c2-c2
c1-c1-c2-c2
c2 c2 fa-fa-fe-fe
fa-fa-fe-fe
fl-fl-sw-sw-c2-c2-fe-fe
fl-fl-sw-sw-c2-c2-fe-fe-bl-
bl-fr-fr-c1-c1-st-st
fl-sw-c2-c2
fl-sw-c2-c2
st-st-st-st
st-st-st-st
Overall
0
0.2
0.4
0.6
0.8
1
FCFS Close Proposed
Max
Slowdow
nEDP
![Page 27: Yasuo Ishii, Kouhei Hosokawa, Mary Inaba , Kei Hiraki](https://reader035.vdocuments.net/reader035/viewer/2022062410/56816223550346895dd25005/html5/thumbnails/27.jpg)
11.2% Performance improvement from FCFS consists of Close Page Policy: 4.2% Baseline Optimization: 4.9% Proposal Optimization: 1.9%
Baseline optimization accomplishes a 9.1% improvement
Optimization Breakdown
Performance...0%
2%
4%
6%
8%
10%
ProposedOptimization
BaselineOptimization
Close Page
FCFS(base)
4.2%
4.9%
1.9%
・ Timeout Detection・Write Queue Spill Prevention・ Auto-Precharge・Max Activate-Number Restriction
ProposalsCompute-phase predictionWriteback-refresh overlap
OptimizationsMLP-aware priority control Memory bus reservationActivate throttling
![Page 28: Yasuo Ishii, Kouhei Hosokawa, Mary Inaba , Kei Hiraki](https://reader035.vdocuments.net/reader035/viewer/2022062410/56816223550346895dd25005/html5/thumbnails/28.jpg)
Optimization Breakdown 11.2% Performance improvement
from FCFS consists of Close Page Policy: 4.2% Baseline Optimization: 4.9% Proposal Optimization: 1.9%
Baseline optimization accomplishes a 9.1% improvement
Performance...0%
2%
4%
6%
8%
10%
4.2%
4.9%
1.9%ProposedOptimization
BaselineOptimization
Close Page
FCFS(base)
![Page 29: Yasuo Ishii, Kouhei Hosokawa, Mary Inaba , Kei Hiraki](https://reader035.vdocuments.net/reader035/viewer/2022062410/56816223550346895dd25005/html5/thumbnails/29.jpg)
Performance/EDP summary
19.0619.5620.0620.5621.0621.56
2941
2991
3041
3091
3141
Close Page Policy
Optimization baseline
EDP
Exec time
(3065, 20.58)
(3054, 20.25)
(3173, 21.7)
(2987, 20.08)
(2975, 19.79)
(3012, 19.71)
(2990, 19.17)
(2981, 19.11)
Y.Moon
K.Fang
K. kuroyanagi C. Li
L. Chen
T. Ikeda Ours
![Page 30: Yasuo Ishii, Kouhei Hosokawa, Mary Inaba , Kei Hiraki](https://reader035.vdocuments.net/reader035/viewer/2022062410/56816223550346895dd25005/html5/thumbnails/30.jpg)
Performance/EDP summary
19.0619.5620.0620.5621.0621.56
2941
2991
3041
3091
3141
Close Page Policy
Final score
EDP
Exec time
(3065, 20.58)
(3054, 20.25)
(3173, 21.7)
(2987, 20.08)
(2975, 19.79)
(3012, 19.71)
(2990, 19.17)
(2941, 19.06)
Optimization baseline
Y.Moon
K.Fang
K. kuroyanagi C. Li
L. Chen
T. Ikeda Ours
(2981, 19.11)
![Page 31: Yasuo Ishii, Kouhei Hosokawa, Mary Inaba , Kei Hiraki](https://reader035.vdocuments.net/reader035/viewer/2022062410/56816223550346895dd25005/html5/thumbnails/31.jpg)
Performance/EDP summary
19.0619.5620.0620.5621.0621.56
2941
2991
3041
3091
3141
Close Page Policy
(3065, 20.58)
(3054, 20.25)
(3173, 21.7)
(2987, 20.08)
(2975, 19.79)
(2990, 19.17)
(2941, 19.06)Final score
EDP
Exec time
(3012, 19.71)
Optimization baseline
Y.Moon
K.Fang
K. kuroyanagi C. Li
L. Chen
T. Ikeda Ours
(2981, 19.11)
![Page 32: Yasuo Ishii, Kouhei Hosokawa, Mary Inaba , Kei Hiraki](https://reader035.vdocuments.net/reader035/viewer/2022062410/56816223550346895dd25005/html5/thumbnails/32.jpg)
Optimization History
18.7518.9519.1519.3519.5519.75
2940
2950
2960
2970
2980
2990
3000
3010 EDP
Exec time
Optimization baseline
Final score
(2975, 19.79)
(3012, 19.71)
(2990, 19.17)
(2981, 19.11)
(2941, 19.06)Y.Moon
K.Fang
K. kuroyanagi
Ours
![Page 33: Yasuo Ishii, Kouhei Hosokawa, Mary Inaba , Kei Hiraki](https://reader035.vdocuments.net/reader035/viewer/2022062410/56816223550346895dd25005/html5/thumbnails/33.jpg)
18.7518.9519.1519.3519.5519.75
2940
2950
2960
2970
2980
2990
3000
3010 EDP
Exec time
Optimization baseline
Opt 1: MLP-aware priority controlOpt 2: Mem bus resevation Opt 3: ACT throttling
Final score
Optimization HistoryY.Moon
K.Fang
K. kuroyanagi
Ours
(2975, 19.79)
(3012, 19.71)
(2990, 19.17)
(2981, 19.11)
(2941, 19.06)
(2953, 18.75)
![Page 34: Yasuo Ishii, Kouhei Hosokawa, Mary Inaba , Kei Hiraki](https://reader035.vdocuments.net/reader035/viewer/2022062410/56816223550346895dd25005/html5/thumbnails/34.jpg)
18.7518.9519.1519.3519.5519.75
2940
2950
2960
2970
2980
2990
3000
3010 EDP
Exec time
Optimization baseline
Compute-phase predictionWriteback-refresh overlap
Final score
Optimization HistoryY.Moon
K.Fang
K. kuroyanagi
Ours (2953, 18.75)
(2975, 19.79)
(3012, 19.71)
(2990, 19.17)
(2981, 19.11)
(2941, 19.06)
Opt 1: MLP-aware priority controlOpt 2: Mem bus resevation Opt 3: ACT throttling
![Page 35: Yasuo Ishii, Kouhei Hosokawa, Mary Inaba , Kei Hiraki](https://reader035.vdocuments.net/reader035/viewer/2022062410/56816223550346895dd25005/html5/thumbnails/35.jpg)
18.7518.9519.1519.3519.5519.75
2940
2950
2960
2970
2980
2990
3000
3010 EDP
Exec time
Optimization baseline
Compute-phase predictionWriteback-refresh overlap
Final score
Optimization HistoryY.Moon
K.Fang
K. kuroyanagi
Ours (2953, 18.75)
(2975, 19.79)
(3012, 19.71)
(2990, 19.17)
(2981, 19.11)
(2941, 19.06)
Opt 1: MLP-aware priority controlOpt 2: Mem bus resevation Opt 3: ACT throttling
![Page 36: Yasuo Ishii, Kouhei Hosokawa, Mary Inaba , Kei Hiraki](https://reader035.vdocuments.net/reader035/viewer/2022062410/56816223550346895dd25005/html5/thumbnails/36.jpg)
Conclusion
High Performance Memory Access Scheduling Proposals
Novel thread-priority control method: Compute-phase prediction Cost-effective refreshing method: Writeback-refresh overlap
Optimization strategies MLP-aware priority control, Memory bus reservation, Activate
Throttling, Aggressive precharge, force refresh, timeout handling
The optimized scheduler reduces exec time by 11.2% Several trade-offs between performance and EDP Aggregating the various optimization strategies is
most important for the DRAM system efficiency
![Page 37: Yasuo Ishii, Kouhei Hosokawa, Mary Inaba , Kei Hiraki](https://reader035.vdocuments.net/reader035/viewer/2022062410/56816223550346895dd25005/html5/thumbnails/37.jpg)
Q&A