nandflashsim :intrinsiclatencyvariation: intrinsic...
TRANSCRIPT
NANDFlashSim : Intrinsic Latency VariationNANDFlashSim : Intrinsic Latency Variation Aware NAND Flash Memory System Modeling
and Simulation at Microarchitecture Leveland Simulation at Microarchitecture Level
Myoungsoo Jung (MJ), y g g ( ),Ellis H. Wilson III, David Donofrio, John Shalf, Mahmut T. Kandemir
AgendaAgenda
• Revisiting NAND flash technology• Advance NAND flash operationsAdvance NAND flash operations• NANDFlashSim• Evaluation
Intrinsic Latency VariationIntrinsic Latency Variation
• Fowler-Nordheim Tunneling – Making an electron channela g a e ect o c a e– Voltage is applied over a certain threshold
I t l t l i• Incremental step pulse programming (ISPP)
sof
Cel
lsGate
Floating Gate
ibut
ion Floating Gate
Dist
rSource DrainChannel
Intrinsic Latency VariationIntrinsic Latency Variation
• Each step of ISPP needs different programming duration (latency) p g g ( y)
• Latencies of the NAND flash memory fluctuate depending on the address offluctuate depending on the address of the pages in a block
NAND Flash ArchitectureNAND Flash Architecture
• Employing cache and data registers • Multiple planes (memory array)Multiple planes (memory array)• Multiple dies
Memory Array(Plane)
Density TrendDensity Trend
• Flash technology– Each cell is capable to store multiple bitsac ce s capab e to sto e u t p e b ts– manufacturing feature size is scaling down
S f d it i i i b t t• So far, density is increasing by two to four times every 2 years70
Flash Chip
40
50
60
Gb)
10
20
30
Den
sity
(G
2002 2004 2006 2008 2010 2012
0
Years
Density TrendDensity Trend
• Shrinking manufacturing feature size might be limited around 12 nanometerg
• Multi-die stack technologyFl h k ti t l b– Flash packages continue to scale up by employing multiple dies and planes
70 Flash Chip Flash Chip Flash Chip MCP
40
50
60
Gb)
1500
2000
Gb)
?1500
2000
Gb)
10
20
30
Den
sity
(G500
1000D
ensi
ty (G ?
500
1000D
ensi
ty (G
MOSAID HLNAND (32Gb 16-DIE NAND STACK)
How does performance behavior change?
2002 2004 2006 2008 2010 2012
0
Years2002 2004 2006 2008 2010 2012 2014
032Gb 64Gb
Years
1Gb
2002 2004 2006 2008 2010 2012 2014
032Gb 64Gb
Years
1Gb
Legacy OperationLegacy OperationA I/O ti lit i t l ti t• An I/O operation splits into several operation stages
• Each stage should be appropriately handled by device drivers
writeSttatu
cmdcheck
data
s
cmd
dataaddr
Cache OperationCache Operation
C h d i i l• Cache mode operations use internal registers in an attempt to hide performance
h d f d t toverhead from data movements
data2data1
Internal Data Move ModeInternal Data Move Mode
S i d l t d t• Saving space and cycles to copy data• Source and destination page address should
be located in the same diebe located in the same die
data
N d t t
cmd
No data movement through NAND interface
cmdsrc addrdst addr
Multi plane Mode OperationMulti-plane Mode OperationT diff t b d i ll l• Two different pages can be served in parallel
• Addresses should indicate same page offset in a block, same die address and should have different ,plane addresses (plane addressing rule)
cmdcheck
cmddata1data2
cmdaddr1addr2
Interleaved Die Mode OperationInterleaved-Die Mode Operation
idi t ki d t f i t l• providing a way, taking advantage of internal parallelism by interleaving NAND transactions
• Scheduling NAND transactions and bus arbitrations gare critical dominant of memory system performance
cmd cmd
cmdaddr1data1
cmdaddr2data2
ChallengesChallenges
P f i d b d• Performances are varied based on:– intrinsic latency variation characteristic– internal parallelism– advanced flash operations typesp yp
• Performances are affected byhow to deal with diverse advance flash– how to deal with diverse advance flash operationshow to effectively schedule NAND– how to effectively schedule NAND transactions
Prior Simulation WorksPrior Simulation Works
Fl h b d S lid S Di k Si l i• Flash-based Solid State Disks Simulation– Tightly coupled to specific flash firmware
• Unaware of latency variation of NAND flash– Latency approximation model with constantsy pp
• Course-grain NAND command handling– In-order executionIn order execution
e n
SSD SimulatorI/O Subsystem
h Fi
rmw
are
Late
ncy
roxi
mat
ion
Flas
h LAp
p
NANDFlashSimNANDFlashSim
Si l ti d M d li NAND fl h• Simulating and Modeling NAND flash rather than flash firmware or SSDs
NANDFlashSim can be applied to diverse– NANDFlashSim can be applied to diverse application like off-chip caches of a multi-core system and I/O subsystems of mobile systems
– Multiple instances can be used for building SATA, PCI-e based SSDs
e n
SSD SimulatorI/O Subsystem
war
e
h Fi
rmw
are
Late
ncy
roxi
mat
ion
lash
Firm
w
Flas
h LAp
p Fl
NANDFlashSimNANDFlashSim
D t il d Ti i M d l• Detailed Timing Model • Awareness of intrinsic latency variation
d i d b f i i– designed to be performance variation-aware and employs different page offsets in a physical blockb oc
• Reconfigurable Microarchitecture– Supports highly reconfigurable architectures in pp g y g
terms of multiple dies and planes• Fine-grain NAND flash command handling
– 16 combinations of advance flash operation– Supporting out-of-order execution
High level ViewHigh-level View
C d t hit t d i di id l t t• Command set architecture and individual state machine associated with itH t d NAND fl h l k d i t• Host and NAND flash clock domain are separate.
• All entries (controller, register, die, …) are updated at every cyclesupdated at every cycles
Uni
t
BusLo
gica
l U
AND
Flas
h I/
O
k*j
Bloc
ks
NA
Command Set ArchitectureCommand Set Architecture
• Multi-stage Operation– Stage are defined by common operationsStage a e de ed by co o ope at o s– CLE, ALE, TIR, TIN, TOR, TON, etc…
C d Ch i Latency• Command Chains– Defines command sequences
Latency Variation
Generators
Req.
Validation (Throughput)Validation (Throughput)78
Worst-case basedTypical case based
456
dth
(MB
/s)
Typical-case based Variation-aware based MSIS
12% 2.1%
1.6% 3.8%
0123
Ban
dwid
35
40 Worst-case based Typical-case based
Legacy Cache Two-plane(2x)2x Cache0
14
16 Worst-case based
Write Performance (Single Die)4.1% ~11.2% Deviation
83 1 170 9% (T i l )
15
20
25
30
idth
(MB
/s)
Variation-aware based MSIS
8
10
12
14
h (M
B/s
)
Typical-case based Variation-aware based MSIS
83.1~170.9% (Typical-case)
7.3%~42%(Worst-case)
0
5
10
15
Ban
dw
2
4
6
8
Ban
dwid
t5.3~9.4% (NANDFlashSim)
Legacy Cache Two-plane (2x)0
Legacy Cache Two-plane(2x)2x Cache0
Read Performance (Single Die)Write Performance (Dual Die)
Improvement from deviation 48.7%~79.6% in typical-case model, 44.4 ~ 53.5 % in worst-case model
Validation (Latency)Validation (Latency)
Write-intensive Workload (MSN Server, Financial OLTP)
Latencies of NANDFlashSim are almost completely overlapped with realLatencies of NANDFlashSim are almost completely overlapped with real product latencies (Hynix)
Read-intensive Workload (Webserch, User)
Performance of Multiple PlanesPerformance of Multiple Planes
• Performance of write are significantly enhanced as the number of plane increases– Cell activities (TIN) can be executed in parallel
• Data movement (TOR) is a dominant factorData movement (TOR) is a dominant factor in determining bandwidth
2048 Bytes 4096 Bytes 8192 Bytes 2048 Bytes 4096 Bytes 8192 Bytes
1012141618
(MB
/s)
2530354045
MB
/s)
2468
10
Ban
dwid
th (
510152025
Ban
dwid
th (MLarger transfer sizes couldn’t take
advantage of multiple planes well
Legacy
Cache ModeTwo-plane
Four-planeSix-plane
Eight-plane0
Legacy
Cache ModeTwo-plane
Four-planeSix-plane
Eight-plane0
Write Performance (Single Die) Read Performance (Single Die)
Performance of Multiple DiesPerformance of Multiple Dies
Si il t lti l it f• Similar to multi-plane, write performance are improved by increasing the number of diesdies
• Multiple dies architecture provides a little worse performance than multi-planeworse performance than multi plane
40 Legacy Cache Mode Two-plane16
Legacy
25
30
35
40
MB
/s)
8
10
12
14
(MB
/s)
Legacy Cache Mode Two-plane Two-plane Cache Mode
10
15
20Ba
ndw
idth
(M
2
4
6
8
Ban
dwid
th
1(SDP) 2(DDP) 4(QDP) 8(ODP)0
5B
Package Type
1(SDP) 2(DDP) 4(QDP) 8(ODP)0
Package Type
Multi plane VS Multi dieMulti-plane VS Multi-die
• Under disk-friendly workload– The performance of interleaved-die e pe o a ce o te ea ed d e
operation is 54.5% better than multi-plane operation on averagep g
– Interleaved-die operations have less restrictions for addressingrestrictions for addressing
250003000035000 msnfs
web fin
250003000035000 msnfs
webfin
10000150002000025000
IOP
S
usr prn
10000150002000025000
IOP
S usr prn
1 2 4 8 160
5000
The Number of Plane1 2 4 8 16
05000
The Number of Die
Breakdown of CyclesBreakdown of Cycles
• While writes, most cycles are used for NAND flash itself, reads spend at least , p50.5% of the total time doing.
Bus activities Memory cell activities
e (2x) 66.9%
TON TOR CLE ALE
Cache
7.0%
6 8%
TIN TOR TIR CLE
egac
y
Cache
Mod
e
Two-
plane
(
62.8%
50.5%cyhe
Mod
e
Two-
plane
(2x)2x C 6.8%
3.6%
3.5%
ALE
LegC
0.0 5.0x106 1.0x107 1.5x107 2.0x107 2.5x107
Cycles (ns) Lega
cy
CachT
0.0 7.0x107 1.4x108 2.1x108 2.8x108 3.5x108
Cycles (ns)
Page Migration TestPage Migration Test6x109
7x109
Legacy Cache ModeI t l 6x1011
7x1011 Legacy Cache mode
3x109
4x109
5x109
ycle
s (n
s)
Internal 2x 2x cache mode 2x internal
11
4x1011
5x1011
6x10
gy (u
J)
Internal 2x 2x cache mode 2x internal
0
1x109
2x109
Tota
l Cy
1x1011
2x1011
3x1011
Ene
rg
Energy consumption is
2 blocks 4 blocks 8 blocks 16 blocks 32 blocks
0
Block Migration Size
ALE CLE TIR TOR TIN TON BER
2 blocks 4 blocks 8 blocks 16 blocks 32 blocks0
Migration Block Size
depending on # of activate components
(2 )
2x cache
2x internal
ALE CLE TIR TOR TIN TON BER
saving
Legacy
Cache Mode
Internal Data Move
Two-plane (2x)
g y
0 1x108 2x108 3x108 4x108
Cycles (ns)
Conclusion & Future WorksConclusion & Future Works
A h hi l f l ti ll li• A research vehicle for evaluating parallelism and architecture trend
Single instance– Single instance• Integrating it into GEM5 and Simics• Plan to apply it with Green Flash and Xtensa of CoDExpp y
– Multiple instances• We successfully built a multi-channel SSD framework
with 1024 instances (~16384 dies ~ 131072 planes)with 1024 instances (~16384 dies, ~ 131072 planes)
• Open Source Project– Static/shared library– Static/shared library– Standalone simulation
Q & AQ & A
D l d• Download – http://www.cse.psu.edu/~mqj5086/nfs/
• Mailing list• Mailing list– [email protected]
• Thanks toThanks to– Dean Klein, Micron Technology, Inc.– Seung-hwan Song, University of Minnesota – Michael Kim, Corelinks– Kurt Lee, Corelinks
Leonard Ko Corelinks– Leonard Ko, Corelinks– Yulwon Cho, Stanford University