Minimalist Open-page: A DRAM Page-mode Scheduling Policy
for the Many-core EraDimitris Kaseridis+, Jeffrey Stuecheli*+, and Lizy Kurian John+
MICRO’11
Korea University, VLSI Signal Process-ing Lab.
Jinil Chung ( 정진일 ) ([email protected])
+ The University of Texas at Austin* IBM Corp.
[Paper Review]
[email protected] ( 2 )
Abstract
[IEEE Spectrum(link)]
DRAM: balance between perfor-mance, power, and storage density To realize good performance,
Must mange the structural and tim-ing restrictions of the DRAM de-
vices
Use of “Page-mode” feature can mitigate many DRAM
constraints
Aggressive page-mode results in many conflicts (e.g. bank con-
flict) when multiple workloads in many-core systems map to the
same DRAM
In this paper, Minimalist approach“just enough” page-mode accesses to get benefits, avoiding un-
fairness Proposed address hashing + data prefetch engine + per re-
quest priority
[email protected] ( 3 )
1. Introduction
Row buffer (or “page-mode”) Access
This paper proposed combination of open/closed-page policy based on …
1) Page-mode gain with only a small number of page accesses Propose a fair DRAM address mapping scheme: low RBL & high BLP
2) Page-mode hit with spatial locality which can be captured in prefetch engines
Propose an intuitive criticality-based memory request priority scheme
Open-page pol-icy
Closed-page pol-icy
Page-mode gain
Reducing row ac-cess latency
None(single col. access per row activation)
Multiple requests in many core sys-
tem
Introducing priority inversion and fair-
ness/starvation problems
Avoiding complexi-ties of row buffer
management
RBL: Row-buffer Local-ityBLP: Bank-level Paral-lelism
NOT temporal local-ity!
[email protected] ( 4 )
2. BackgroundDRAM timing constraint results in “dead time” before and after random access MC(Memory Controller)’s job is to re-duce performance-limiting gaps using
parallelism1) tRC (row cycle time; ACT-to-ACT @same
BK): MC activates a page wait for tRC @same
BK: multiple threads access diff. row @same
BK latency overhead (tRC delay)
2) tRP (row precharge time; PRE-to-ACT @same BK)
: In open-page policy, MC activates other page tRP penalty @same BK (=close cur-
rent page before new page is opened)ACT
PRE
ACT
tRP (e.g. 12ns)
tRC (e.g. 48ns)
tRAS (e.g. 36ns)
@same bank
[email protected] ( 5 )
3. MotivationUse of “page-mode” …
1) Latency Effects: Due to tRC & tRP, overall latency increase small # of access?
2) Power Reduction: only Activate Power reduction small # of access is enough
3) Bank Utilization: drop off quickly as access increase small # of ac-cess is enough
4) Other DRAM complexities: small # of access is needed for soften re-strictions
ex) tFAW (Four page Activate time Window; 30ns), cache block trans-fer delay=3ns -. single access per ACT: limited peak utilization (6ns*4/30ns=80%) -. two~ accesses per ACT: not limited peak utilization (12ns*4/30ns>100%)
Closed-page pol-icy
Closed-page pol-icy
If B/U is high, the probabil-ity that new request will conflict w/ a busy bank is greater.
16%
62%
Next page
[email protected] ( 6 )
3. Motivation
3.1 Row-buffer locality in Modern Pro-cessors : in current WS/Server class designs large last-level cache (e.g. IBM PowerPC 7)
RBL: Row-buffer Local-ity
Temporal locality: hits to the large Last-level cache
Row buffers exploit only Spatial locality
Using prefetch engines, It can be predict spatial lo-cality
[email protected] ( 7 )
3. Motivation3.2 Bank and Row Buffer Locality Interplay with Address Mapping
-. DRAM device address: row, column, and bank
Workload A: long sequential ac-cess seq.Workload B: single operation
Workload A: higher priority Slow B0
Workload B: higher pri-ority Slow A4
High BLP (Bank-level Parallel-ism) B0 can be serviced w/o de-grading traffic to the workload A
e.g. FR-FCFS
e.g. ATLAS, PAR-BS
e.g. Minimal-ist
(DRAM all col. low order real addr.)
(DRAM col. & bank low order real addr.)
(DRAM all col. low order real addr.)
[email protected] ( 8 )
4. Minimalist Open-page Mode
7-bit 5-bit 2-bit
4.1 DRAM Address Mapping Scheme
For sequential ac-cess of 4 cache lines
-. The basic difference that the Column access bits are split in two places. +. 2 LSB bits are located right after the Block bits +. 5 MSB bits are located just before the Row bits
-. (Not shown in the figure) higher order address bits are XOR-ed with the bank bits produce the actual bank selection bits reducing row buf -fer conflict [Zhang et al./MICRO’00]
[email protected] ( 9 )
4. Minimalist Open-page Mode4.2 Data Prefetch Engine [IBM PowerPC 6]
: predictable “page-mode” opportunities need for accurate prefetch engine : each core includes HW prefetcher w/ prefetch depth distance predictor
1) Multi-line Prefetch Requests -. Multi-line prefetch operation: single request (to indicate specific seq. of cache lines) -. Reducing command BW and queue resource
[email protected] ( 10 )
4. Minimalist Open-page Mode4.3 Memory Request Queue Scheduling Scheme
: In OOO execution, the importance of each request can vary both be-tween andwithin applications need for dynamic priority scheme
1) DRAM Memory Requests Priority Calculation -. different priority based on criticality to performance -. Increase priority of each request every 100ns time interval time-based -. 2 categories: read (normal) and prefetch read request is higher priority -. MLP information from MSHR in each core: many misses less impor-tant -. Distance information from Prefetch engine (4.2)
MLP: Memory Level ParallelismMSHR: Miss Status Holding Register
Read re-quest
[email protected] ( 11 )
4. Minimalist Open-page Mode4.3 Memory Request Queue Scheduling Scheme (cont.)
2) DRAM Page Closure (Precharge) Policy -. Using autoprecharge increasing command BW
3) Overall Memory Requests Scheduling Scheme (Priority Rules 1) -. Same rules are used by all of MC No need for communication among MC -. if MC is servicing the multiple transfers from a multi-line prefetch request, it can be interrupted by a higher priority request very critical request can be serviced w/ the smallest latency
4) Handling write operations -. dynamic priority scheme not apply to write -. Using VWQ(Virtual Write Queue) causing minimal write instructions
[email protected] ( 12 )
5. Evaluation-. 8 core CMP system using the Simics functional model extended w/ the GEMS toolset-. Simulate DDR3 1333MHz DRAM using memory controller policy for each experiment-. Minimalist open-page scheme is compared against three open-page policies: Table 5 1) PAR-BS (Parallelism-aware Batch Scheduler) 2) ATLAS (Adaptive per-Thread Least-Attained-Service) memory sched-uler 3) FR-FCFS (First-Ready, First-Come-First-Served): baseline
[email protected] ( 13 )
5. Evaluation5.1 Throughput -. Overall, “Minimalist Hash+Priority" demonstrated the best through-put improvement over the other schemes, achieving a 10% improvement. -. This is compared against ATLAS and PAR-BS that achieved 3.2% and 2.8% throughput improvements over the whole workload suite.
[email protected] ( 14 )
5. Evaluation5.2 Fairness -. Minimalist improves fairness up to 15% with an overall improvement of 7.5%, 3.4% and 2.5% for FR-FCFS, PAR-BS and ATLAS, respectively.
[email protected] ( 15 )
5. Evaluation5.3 Row Buffer Access per Activation -. The observed page-access rate for the aggressive open-page policies fall significantly short The high page hit rate is simply not possible given the interleaving of requests between the eight executing pro-grams. -. With the Minimalist scheme, the achieved page-access rate is close to 3.5, compared to the ideal rate of four.
[email protected] ( 16 )
5. Evaluation5.4 Target Page-hit Count Sensitivity -. The Minimalist system requires a target number of page hits to be selected that indicates the maximum number of pages hits the scheme attempts to achieve per row activation. -. a target number of 4 pages hits provides the best results. (that different system configuration may shift the optimal page-mode hit count.)
[email protected] ( 17 )
5. Evaluation5.5 DRAM Energy Consumption -. To estimate the power consumption we used the Micron power calcu-lator -. Approximately the same as FR-FCFS. PAR-BS, ATLAS and “Minimalist Hash+Priority" provide a small decrease of approximately 5% to the overall energy consumption. -. The energy results are essentially a balance between the decrease in page-modehits (resulting in high DRAM activation power) and the increase in sys-tem performance (decreasing runtime).
[email protected] ( 18 )
Conclusions
Minimalist Open-page memory scheduling policy -. Page-mode gain w/ small number of page accesses for each page ac-tivation -. Assign per-request priority using request stream information in MLP and data prefetch engine
Improving throughput and fairness -. Throughput increased by 10% on average (compared to FR-FCSC) -. No need for thread based priority information -. No need for communication/coordination among multiple MC or OS
[email protected] ( 19 )
Appendix. Detailed simulation informa-tion
[email protected] ( 20 )
Appendix. Detailed simulation informa-tion
[email protected] ( 21 )
Appendix. Detailed simulation informa-tion
[email protected] ( 22 )
Thanks,