on the off-chip memory latency of real-time systems: is...
Post on 25-Jun-2020
0 Views
Preview:
TRANSCRIPT
On the Off-chip Memory Latency of Real-Time
Systems: Is DDR DRAM Really the Best Option?
Mohamed Hassan
Outline
01 02 0403 05Motivation DRAMs PREDICTABILITY RLDRAM Results
SRAMs MOTIVATION
1
• Historically, SRAMs have been the option for
real-time safety-critical embedded systems
• With the increase in data demand
→ the cost became unaffordable
Work in Off-chip Memory MOTIVATION
[Burchard et al, DATE][Heithecker and Ernst, DAC]
Smart PhonesIBM’s Acorn
2005
20112007
2009 2013
Predator[Akesson et al., CODES+ISSS]
AMC[Paolieri et al., ESL]
PRET[Reineke et al., CODES+ISSS]
COP [Goossens et al.]MultiChannel [Gomony et al.]
ORP [Zheng et al.]
MCMC[Ecco et al., RTCSA2014]DCmc [Jalle et al., RTSS2014]
[Kim et al., RTAS2014]ROC[Krishnapillai et al., ECRTS2014]
RTMem[Li et al., ECRTS2014]ReOrder[Ecco et al., ECRTS2016]
[Guo and Pellizzoni, RTAS2017]
PMC [Hassan et al., RTAS2015][Kim et al., RTAS2015]
MEDUSA[Valsan and Yun, CPSNA2015][Yun et al., ECRTS2015]
ReoRorder[Ecco and Ernst, RTSS2015]
[Hassan and Pellizzoni,EMSOFT2018]
2014
2015
2016/17
2018
1
Work in Off-chip Memory MOTIVATION
[Burchard et al, DATE][Heithecker and Ernst, DAC]
Smart PhonesIBM’s Acorn
2005
20112007
2009 2013
Predator[Akesson et al., CODES+ISSS]
AMC[Paolieri et al., ESL]
PRET[Reineke et al., CODES+ISSS]
COP [Goossens et al.]MultiChannel [Gomony et al.]
ORP [Zheng et al.]
MCMC[Ecco et al., RTCSA2014]DCmc [Jalle et al., RTSS2014]
[Kim et al., RTAS2014]ROC[Krishnapillai et al., ECRTS2014]
RTMem[Li et al., ECRTS2014]ReOrder[Ecco et al., ECRTS2016]
[Guo and Pellizzoni, RTAS2017]
PMC [Hassan et al., RTAS2015][Kim et al., RTAS2015]
MEDUSA[Valsan and Yun, CPSNA2015][Yun et al., ECRTS2015]
ReoRorder[Ecco and Ernst, RTSS2015]
[Hassan and Pellizzoni,EMSOFT2018]
2014
2015
2016/17
2018
All in Double Data Rate (DDR) DRAMs
1
A Context about DDRs MOTIVATION
• Low cost
• Large capacity
• High BW
• DDR DRAM is the commodity off-chip memory, Why?
• What is the most important requirement for real-time/safety-critical systems?
• Yes, Predictability
• How is DDRx for predictability?
• DDRx Random Access Memories are not Random at all!!
• Access latency varies notably based on many factors• access patterns
• transaction type (read or write)
• DRAM state from previous accesses
2
DRAM
• Multiplexed address mode:
• The address bits are split into two segments provided to the
device in two stages:
1. Row address → row decoder
2. Column address → column decoder
✓ Low cost (less pin count)
High latency
Huge variability
Background
3
DRAM
• A request in general can consist of one, two, or three
commands:
• ACTIVATE (A) command:
• Bring data row from cells into sense amplifiers
Background
3
DRAM
• DRAM Consists of multiple banks
• The memory controller (MC) manages accesses to DRAM
• A request in general consists of:
• ACTIVATE (A) command:
• Bring data row from cells into sense amplifiers
• Read/Write (R/W) commands:
• To read/write from specific columns in
the sense amplifiers
Background
3
DRAM
• DRAM Consists of multiple banks
• The memory controller (MC) manages accesses to DRAM
• A request in general consists of:
• ACTIVATE (A) command:
• Bring data row from cells into sense amplifiers
• Read/Write (R/W) commands:
• To read/write from specific columns in
the sense amplifiers
• PRECHARGE (P) command:
• to write back a previous row in the sense
amplifiers before bringing the new one
Background
3
DRAM
• DRAM Consists of multiple banks
• The memory controller (MC) manages accesses to DRAM
• A request in general consists of:
• ACTIVATE (A) command:
• Bring data row from cells into sense amplifiers
• Read/Write (R/W) commands:
• To read/write from specific columns in
the sense amplifiers
• PRECHARGE (P) command:
• to write back a previous row in the sense
amplifiers before bringing the new one
Background
Row Conflict: P+ A + R/W
Row Idle/Close: A + R/W
Row Hit: R/W
3
Background DRAM
• DRAM Consists of multiple banks
• The memory controller (MC) manages accesses to DRAM
• A request in general consists of:
• ACTIVATE command
• R/W commands
• PRECHARGE command
• All commands have associated timing constraints that have
to be satisfied by the controller (20+ timing constraints)
R1
DATAtRL
0 37 41
A0
DATAtWL tWTR
27
A1
W0
3 9
tRCD
3
Work in Off-chip Memory MOTIVATION
[Burchard et al, DATE][Heithecker and Ernst, DAC]
Smart PhonesIBM’s Acorn
2005
20112007
2009 2013
Predator[Akesson et al., CODES+ISSS]
AMC[Paolieri et al., ESL]
PRET[Reineke et al., CODES+ISSS]
COP [Goossens et al.]MultiChannel [Gomony et al.]
ORP [Zheng et al.]
MCMC[Ecco et al., RTCSA2014]DCmc [Jalle et al., RTSS2014]
[Kim et al., RTAS2014]ROC[Krishnapillai et al., ECRTS2014]
RTMem[Li et al., ECRTS2014]ReOrder[Ecco et al., ECRTS2016]
[Guo and Pellizzoni, RTAS2017]
PMC [Hassan et al., RTAS2015][Kim et al., RTAS2015]
MEDUSA[Valsan and Yun, CPSNA2015][Yun et al., ECRTS2015]
ReoRorder[Ecco and Ernst, RTSS2015]
[Hassan and Pellizzoni,EMSOFT2018]
2014
2015
2016/17
2018
1. It can not address the variability in the access latency of the DDRx chips.
2. Still suffers from high WCLs due to the complex interactions between DDRx commands
4
Assessing DDRs for Predictability MOTIVATION
• Low cost
• high capacity
• High BW
• DDR DRAM is the commodity off-chip memory, Why?
• What is the most important requirement for real-time/safety-critical systems?
• Yes, Predictability
• How is DDRx for predictability?
• DDRx Random Access Memories are not Random at all!!
• access latency varies notably based on many factors• access patterns
• transaction type (read or write)
• DRAM state from previous accesses Comprehensively study these factorsሽ
5
Assessing DDRs for Predictability PREDICTABILITY
• How is DDRx for predictability?
• Predictability has different definitions in the real-time literature
• One important measure is the relative difference between best- and worst-case execution times (or latencies in case of memories) [Wilhelm et al, TECS08]
• We define “Variability Window” (VW) to quantitatively measure the
DRAM predictability
𝑉𝑊 =𝑊𝐶𝐿 − 𝐵𝐶𝐿
𝑊𝐶𝐿× 100
5
Assessing DDRs for Predictability PREDICTABILITY
R1
DATAtRL
0 10 14
R1
DATAtRL
0 27 31
W0
DATAtWL tWTR
17
• Targets an open row (only R command)
• (a) is best-case
• (e) arrives after a write to same rank
15 2
1
19
.5
19
.5
40
.5
30 3
6
34
.5
34
.5
55
.5
45
79
.5
93 94
.5
10
8
0
20
40
60
80
100
120
a b c d e f g h i j k l m n o
Late
ncy
[n
s]
6
(a)
(e)
Assessing DDRs for Predictability PREDICTABILITY
R1
DATAtRL
0 10 14
R1
DATAtRL
0 27 31
W0
DATAtWL tWTR
17
• Targets an open row (only R command)
• (a) is best-case
• (e) arrives after a write to same rank
15 2
1
19
.5
19
.5
40
.5
30 3
6
34
.5
34
.5
55
.5
45
79
.5
93 94
.5
10
8
0
20
40
60
80
100
120
a b c d e f g h i j k l m n o
Late
ncy
[n
s]
6
(a)
(e)
Assessing DDRs for Predictability PREDICTABILITY
• Targets a close row (A + R command)
• (f) is best-case
• (j) arrives after a closed write to same rank
R1
DATAtRL
0 10 24
R1
DATAtRL
0 37 41
A0
DATAtWL tWTR
27
A1
A1
W0
3 9
tRCD
15 2
1
19
.5
19
.5
40
.5
30 3
6
34
.5
34
.5
55
.5
45
79
.5
93 94
.5
10
8
0
20
40
60
80
100
120
a b c d e f g h i j k l m n o
Late
ncy
[n
s]
6
20
(f)
(j)
Assessing DDRs for Predictability PREDICTABILITY
• Targets a close row (A + R command)
• (f) is best-case
• (j) arrives after a closed write to same rank
R1
DATAtRL
0 10 24
R1
DATAtRL
0 37 41
A0
DATAtWL tWTR
27
A1
A1
W0
3 9
tRCD
15 2
1
19
.5
19
.5
40
.5
30 3
6
34
.5
34
.5
55
.5
45
79
.5
93 94
.5
10
8
0
20
40
60
80
100
120
a b c d e f g h i j k l m n o
Late
ncy
[n
s]
6
(f)
(j)
20
Assessing DDRs for Predictability PREDICTABILITY
• Targets a conflict row (P + A + R command)
• (k) is best-case
• (o) arrives after a conflict write to same rank
R1
DATAtRL
0 20 34
0 42 72
A0
DATAtWL
27
A1
W0
9
tRCD
P1
10
P0
R1
DATAtRL
A1
P1
tRL
15 2
1
19
.5
19
.5
40
.5
30 3
6
34
.5
34
.5
55
.5
45
79
.5
93 94
.5
10
8
0
20
40
60
80
100
120
a b c d e f g h i j k l m n o
Late
ncy
[n
s]
6
(k)
(o)
30
Assessing DDRs for Predictability PREDICTABILITY
• Targets a conflict row (P + A + R command)
• (k) is best-case
• (o) arrives after a conflict write to same rank
R1
DATAtRL
0 20 34
0 42 72
A0
DATAtWL
27
A1
W0
9
tRCD
P1
10
P0
R1
DATAtRL
A1
P1
tRL
15 2
1
19
.5
19
.5
40
.5
30 3
6
34
.5
34
.5
55
.5
45
79
.5
93 94
.5
10
8
0
20
40
60
80
100
120
a b c d e f g h i j k l m n o
Late
ncy
[n
s]
6
(k)
(o)
30
Assessing DDRs for Predictability PREDICTABILITY
15 2
1
19
.5
19
.5
40
.5
30 3
6
34
.5
34
.5
55
.5
45
79
.5
93 94
.5
10
8
0
20
40
60
80
100
120
a b c d e f g h i j k l m n o
Late
ncy
[n
s]
VW= 620%
WCL=108ns
• 15 cases for a read request
• Another 15 for a write request
6
Assessing DDR Controllers for Predictability PREDICTABILITY
0
200
400
600
800
1000
1200
VW
%
• We calculate the VW for 8 of the state-of-the-art DDRx DRAM Controllers
7
Assessing DDR Controllers for Predictability PREDICTABILITY
0
200
400
600
800
1000
1200
VW
%
• We calculate the VW for 8 of the state-of-the-art DDRx DRAM Controllers
6 out of the 8 exceed 800%
7
Assessing DDR Controllers for predictability PREDICTABILITY
0
200
400
600
800
1000
1200
VW
%
• We calculate the VW for 8 of the state-of-the-art DDRx DRAM Controllers
6 out of the 8 exceed 800%Achieve less variability at the expense of
1. complexity:
• Bank partitioning
• Rank switching
2. Conservatism:
• e.g. using close-page (MCMC)
7
Assessing DDR Controllers for predictability PREDICTABILITY
0
200
400
600
800
1000
1200
VW
%
• We calculate the VW for 8 of the state-of-the-art DDRx DRAM Controllers
6 out of the 8 exceed 800%Achieve less variability at the expense of
1. complexity:
• Bank partitioning
• Rank switching
2. Conservatism:
• e.g. using close-page (MCMC)
Even with the pessimism and complexity, 261% is still a significant variability for safety-critical systems
7
Assessing DDR Controllers for predictability PREDICTABILITY
0
200
400
600
800
1000
1200
VW
%
• We calculate the VW for 8 of the state-of-the-art DDRx DRAM Controllers
6 out of the 8 exceed 800%Achieve less variability at the expense of
1. complexity:
• Bank partitioning
• Rank switching
2. Conservatism:
• e.g. using close-page (MCMC)
Even with the pessimism and complexity, 261% is still a significant variability for safety-critical systems
Exploring other types of memories that address these limitations is unavoidable towards providing more predictable memory performance with less variability and tighter bounds
7
RLDRAM: An Alternative RLDRAM
RLDRAM Introducedby Infinion
1999
RLDRAM2 was introduced by Infinion and Micron
RLDRAM3 was introduced by Micron
2003
2012
+8
Why RLDRAM? RLDRAM
DDRDRAM
12
RASRow Address
CASCol Address
Row Conflict: P+ A + R/W
Row Idle/Close: A + R/W
Row Hit: R/W
RLDRAMAll Address
Bits
R/W
9Multiplexed Address Mode Non-Multiplexed Address Mode
Assessing RLDRAM for Predictability RLDRAM
R1
DATAtRL
0 13 17
W1
DATAtWL
0 14 18
R1
DATAtRL
3 16 20
R0
0
W1
DATAtWL
2 16 20
R0
0
R1
DATAtRL
4 17 21
W0
0
R1
DATAtWL
3 17 21
W0
0
R1
DATAtRL
5 18 22
C0
0
tRC
W1
DATAtWL
5 23
C0
0
tRC
19
10
Assessing RLDRAM for Predictability RLDRAM
R1
DATA
tRL
0 13 17
W1
DATA
tWL
0 14 18
R1
DATA
tRL
3 16 20
R0
0
W1
DATA
tWL
2 16 20
R0
0
R1
DATA
tRL
4 17 21
W0
0
R1
DATA
tWL
3 17 21
W0
0
R1
DATA
tRL
5 18 22
C0
0
tRC
W1
DATA
tWL
5 23
C0
0
tRC
19
• A total of 8 cases• Variability window is 46.2%
(13.4x reduction)• WCL=28.5ns (3.79x reduction)
19
.5 21
24
24 2
5.5
25
.5 27
28
.5
0
5
10
15
20
25
30
a b c d e f g h
Late
ncy
[n
s]46.2%
10
RLDC: A Predictable Controller for RLDRAM RLDRAM
Pro
cess
or
Dec
od
er
Command Generation
Address Mapping
Timing Checker
RR
Arb
iter
RLD
RA
M
type
address
perPEBuffers
cmd
ba
addr
• RLDC to predictably manage accesses to RLDRAM
• Round Robin• Support both bank sharing and
bank partitioning• Simple timing checker → good for
analyzability, V&V, Certification
11
Bounding Memory Latency RLDRAM
R
DATAtRL
6 12 18
W
0
RW
31
21 43
11 11
: Processor
: Bank
• Bank Sharing Scheme:
𝑊𝐶𝐿𝑠ℎ𝑎𝑟𝑒 = 𝑁 − 1 × 𝑡𝑅𝐶+𝑡𝐶𝐿
N is number of processing elements
tRC
12
Bounding Memory Latency RLDRAM
N is number of processing elements
R
DATAtRL
5 8 13
W
0
RW
26
21 43
21 43
: Processor
: Bank
𝑊𝐶𝐿𝑝𝑎𝑟𝑡 =𝑁 − 1
2× 𝑡𝑊𝐿 − 𝑡𝑅𝐿 +
𝐵𝐿
2
+𝑁 − 1
2× 𝑡𝑅𝐿 − 𝑡𝑊𝐿 +
𝐵𝐿
2
+ 𝑡𝐶𝐿
W-to-R Delay
R-to-W Delay
• Bank Partitioning Scheme:
12
Evaluation Setup RESULTS
PEs
4 Processorsin-order pipelinea private 16KB L1 a shared 1MB L2 cache
DRAM Either RLDRAM or DDRx
RLDRAMRLDRAM3-1600RLDC manages accesses to RLDRAM
DDRxDDR3-1600 AMC, PMC, RTMem, DCmc, ORP, MCMC, ROC, or ReOrder manages access to DDR3
Bank Management
We experiment with both bank partitioning and bank sharing among PEs for RLDC
Benchmarks EEMBC Automotive
13
Worst-Case Latency RESULTS
0
50
100
150
200
250
300
WC
L (n
s)
experimental analytical
1. DDRx MC has 2.5x to 6.46xworse analytical WCL than RLDC→Very similar numbers for exp.
2. Relatively low WCL of MCMC, ROC, ReOrder is due to 4 ranks!
3. For RLDC: bank partitioning provide tighter WCL than sharing→ at the expense of flexibility
4. Gap between exp. vs analytical WCL is much higher for DDR →again due to inherent variability
14
Variability Window RESULTS
1. Already discussed analytical VW2. Exp. VW for DDRx MCS:
• >400% for 4 MCs, • 300%-400% for 3 MCs and• ~200% for 1 MC.
3. for RLDC:• 76.9% for partitioned banks• 84.6% for shared banks
0
200
400
600
800
1000
1200
VW
experimental analytical
15
Scalability: # Processors RESULTS
1. The WCL latency gap between RLDC and majority of DDRx MCs increases drastically
16
Scalability: # Processors RESULTS
1. The WCL latency gap between RLDC and majority of DDRx MCs increases drastically
2. DDRx MCs can be categorized into three categories
cat1 cat2 cat3
Bank
sharing
Bank
partBank part +
multi-rank
16
Scalability: # Processors RESULTS
1. The WCL latency gap between RLDC and majority of DDRx MCs increases drastically
2. DDRx MCs can be categorized into three categories
3. RLDC’s WCL is less than all categories for all #PEs → for both part and sharing→ without complex arbitration/ reorderings (better analyzability and composability)
cat1 cat2 cat3
Bank
sharing
Bank
partBank part +
multi-rank
16
Discussion RESULTS
1. How mature is RLDRAM?
• Has been there since 1999
• Long-term Supported by Micron
• The main off-chip memory in networking
and other low-latency needs
17
Discussion RESULTS
2. Can I buy RLDRAM off-the-shelf?
• Definitely
• They are sold as discrete components (which is the norm for
embedded systems anyway)
• There are also specialized DDR-compatible sockets for
RLDRAM
17
Discussion RESULTS
3. Are there any platforms/boards/test-beds?
ATCA-9305-NSPProcessing blade
Zynq UltraScale/+ MPSoC
Arria10 SoC
17
Discussion RESULTS
4. If it is that good, why it did not take off then?
• Take off = replaces DDR? → It is not meant to be!
• It is a specialized type of memory for low-latency guarantees
• Commodity general-purpose market is looking for high BW, capacity,
cost
• We should ask ourselves, what are we looking for?
• An analogy: FPGAs have been there for long-time, why are they
taking off now? → they satisfy a “new” need in the AI market
• Will they replace CPUs then?
• It is not an either-or decision
17
Discussion RESULTS
5. Does that mean we no longer need DRAMs in real-time systems?
• No
• Use-case dependent
• A heterogenous memory system?
• Mixed Criticality with different requirements?
17
Discussion RESULTS
6. As a scheduling researcher, why should I care?
• Taking the shared resources interference into account is
unavoidable to provide more accurate numbers
• DDR DRAM is very complex to account for its details (rd vs
wr, conf vs hit, ..etc) → explodes the analysis
• RLDRAM alleviates this complexity..which can make the
analysis more feasible
17
Discussion RESULTS
7. Limitations/trade-off
17
COST
DDRx RLDRAM SRAM
DDRx RLDRAM SRAM
Latency
15 21
19
.51
9.5 4
0.5
30 36
34
.53
4.5 5
5.5
45
79
.5 93 94
.5 10
8
0
50
100
150
a c e g i k m o
Late
ncy
[n
s] VW= 620%
0
200
400
600
800
1000
1200
VW
%
19
.5 21
24
24 2
5.5
25
.5 27
28
.5
0
5
10
15
20
25
30
a b c d e f g h
Late
ncy
[n
s]
46.2%
Pro
cess
or
Dec
od
er Command Generation
Address Mapping
Timing Checker
RR
Arb
iter
RLD
RA
M
type
address
perPEBuffers
cmd
ba
addr
11× less timing variability6.4× less WCL
Main Lessons:1. DDR DRAM is not
designed for predictability,RLDRAM is.
2. Looking for solns that address our needs insteadof starting from the mainstream soln?
3. Not either-or: A heterogenous memoryto address conflictingneeds of MCS
15 21
19
.51
9.5 4
0.5
30 36
34
.53
4.5 5
5.5
45
79
.5 93 94
.5 10
8
0
50
100
150
a c e g i k m o
Late
ncy
[n
s] VW= 620%
0
200
400
600
800
1000
1200
VW
%
19
.5 21
24
24 2
5.5
25
.5 27
28
.5
0
5
10
15
20
25
30
a b c d e f g h
Late
ncy
[n
s]
46.2%
Pro
cess
or
Dec
od
er Command Generation
Address Mapping
Timing Checker
RR
Arb
iter
RLD
RA
M
type
address
perPEBuffers
cmd
ba
addr
11× less timing variability6.4× less WCL
Main Lessons:1. DDR DRAM is not
designed for predictability,RLDRAM is.
2. Looking for solns that address our needs insteadof starting from the mainstream soln?
3. Not either-or: A heterogenous memoryto address conflictingneeds of MCS
top related