pen: design and evaluation of partial-erase for 3d nand ... · impacts on block size •we use four...
TRANSCRIPT
PEN: Design and Evaluation of Partial-Erase for 3D NAND-Based High Density SSDsChun-Yi Liu, Jagadish B. Kotra, *Myoungsoo Jung, Mahmut T. Kandemir
The Pennsylvania State University, *Yonsei University
1
Outline• Introduction and Background
• Motivation: Impact on Block Size
• Controller Hardware for Partial-Erase
• FTL for Partial-Erase
• Evaluation
• Conclusion
2
The trend of NAND Flash
• The issues of increasing 2D NAND flash density (smaller cells) Reliability - various serious disturbances
Program (write) disturbances, read disturbances, and retention errors.
Performance – longer program (write) time
Cells are more sensitive to the program voltage.
• 3D NAND can highly mitigate those issues. Larger stakability cells
Increasing layers to increase density.
32 -> 48 -> 64 -> 96 layers
• 3D NAND will dominate MLC / TLC
NAND market. 020406080
100
2D > 2Xnm 2D 20nm 2D <1Xnm 3D NAND 3
Overview of 2D/3D NAND-based SSDs
SSDcontroller
SSD
DRAM
Flash Package Y
Flash Package 0….
….
….
…
….Die 0
Plane 0 Plane 1
Channel 0 Channel 1 Channel X
… …
Host interface: NVMe, PCIe, SATA
2D NAND dies are replaced by 3D NAND dies.
P-0P-1
P-Z
Block W
P-0P-1
P-Z
Block 0
…
…
PlaneFlash Package
P-0
P-1
P-2
P-3
Bit 1 Bit 2 Bit n…
…
…
…
…
…
…
2D NAND Block
P-0
P-4
P-8
P-12
Bit 0 Bit 1 Bit n
3D NAND Block
…
…
…
……
123
567
9
13
…
…1011
1415
4
Plane 0
Plane
Abstract
View
Block Layer
View
SG select gate
CG control gate
ST select transistor
P-0
P-4
P-8
P-12
Bit 0 Bit 1 Bit n
Slice 0
LowerST
…
…
…
…
…
123
567
9
13
…
…
PageDecoder
USG 3
CG 0
CG 3
LSG
1011
1415
Block Physical View
Upper
ST
P-0P-1P-2P-3
P-14
Block N
P-15
…
P-0P-1P-2P-3
P-14
Block 0
P-15
… Lower ST
layer
Cell layer 3Cell layer 2Cell layer 1Cell layer 0
Upper ST
layer
Block 0
The Big Block Problem
Cross-layer signals:They are difficult to shrink.
All Read/write/erase operationsare controlled by the top layer.
Multiple slices have to share the control circuits due to cross-layer signals.As the number of layers (n) increases, pages per page in 3D NAND increases in O(n2), not only O(n).
To read/write P-4:CG 1 and USG 0 have to be set to proper value.
USG 0
CG 1
5
SSD Management Software:Flash Translation Layers (FTLs)
• NAND flash properties: Read/Write operation: at a page unit
Typically, hundred of microsecond (us)
Erase: at a block unit
Typically, milliseconds (ms) or tens of ms
• Out-place-update:
• Flash Translation Layers (FTLs): Address mapping
Maintain a mapping table
Garbage Collection (GC)
Different FTLs have their own GC algorithms.
Page-0
Page-1
Page-14
Block 0
Page-15
…
Page-0
Page-1
Page-14
Block 0
Page-15
…
In-place-update
Out-place-update
Page-0
Page-1
Page-14
Block 0
Page-15
…
Page-0Page-1
Page-14
Block 0
Page-15
…
1 erase (10ms) 1 write (900us)
1 write (900us)
Mapping table required
UpdatePage-0
Invalidate
6
Flash Translation Layers (FTLs)
• Three categories of FTLs: Page-level mapping:
Pro: providing the best performance
Con: huge mapping table required
1 TB SSDs requires 1 GB mapping table.
Block-level mapping:
Pro: small mapping table
Con: performance degradation
A well-known implementation: NFTL
Hybrid mapping:
Combining the best of previous two mappings
We focus on the Superblock FTL.
Huge Mapping table
Logical address
Page-0
Page-1
Page-14
Block 0
Page-15
…
Page-0
Page-1
Page-14
Block N
Page-15
…
….
Page-level mapping
Small Mapping table
Logical address
Page-0
Page-1
Page-14
Block 0
Page-15
…
Page-0
Page-1
Page-14
Block N
Page-15
…
….
Block-level mapping7
Introduction of NFTL
D-block (Data-block)
Mapping table
Logical address
Block Index (block x)
Calculation Block x
Write request to a new address
D-block
Update request to the same addressThe data is updated (logged) in U-block.
U-block(Update-block)
D-block
Read request to the same address(We may need to sequentially search U-block.)
U-block
GC scenario 1: fully-utilized U-block
D-block U-block
Write Request
Read Request
GC scenario 2: unpaired D-block(The number of U-blocks is mush smaller than that of D-blocks.)
D-block
Write Request
Allocate a U-block
Write Request
8
Chip interface
Free pageValid pageInvalid page
D-block U-block(block pair)
Free block
Read page to controller
Step-1: Read valid page
D-block U-blockStep-2:Write valid page &
invalidate page
Write page to free block
Repeat steps (1) & (2) until no valid pages in
D/U-blocksD-block U-block
Step-3: Erase D/U-block
New D-blockStep-4: Update
mapping table(FTL)
WRWWWW
I/O Request QueueStalledStalled
Chip 0
SSD ControllerVictim blocks
Free blockFree block
Introduction of NFTL (cont’d)• NFTL GC (Merge) overview
The number of copied valid pages increase, as the number of pages per block increase.
9
Introduction of Superblock FTL• Superblock FTL GC overview
Intra-superblock GC (similar to fully-utilized U-block in NFTL)
Inter-superblock GC (similar to unpaired D-block in NFTL)
A superblock
A superblock
Allocatea
block
10
Outline• Introduction and Background
• Motivation: Impact on Block Size
• Controller Hardware for Partial-Erase
• FTL for Partial-Erase
• Evaluation
• Conclusion
11
Impacts on block size• We use four iso-capacity SSD configurations to show the impacts.
To prevent that the capacity affects the GC overheads (GC scenario 2 in NFTL)
• (blocks per plane, pages per block) (15014, 72), (7552, 144), (3776, 288), (1888, 576)
As number of pages per block increases, the time spending in reading/writing valid pages during GC increases.
0
30
60
72
14
4
28
8
57
6
72
14
4
28
8
57
6
72
14
4
28
8
57
6
72
14
4
28
8
57
6
72
14
4
28
8
57
6
72
14
4
28
8
57
6
prn_1 proj_0 proj_1 proj_2 usr_1 usr_2
Wri
te L
ate
ncy
(in
m
sec)
Spare Area Read Write GC Spare Area Read GC Read GC Write GC Erase
12
Impacts on block size (cont’d)
• The write amplifications slightly increase.
• GC triggered frequencies increase due to a few number of blocks.
0
6
12
18
Wri
te A
mp
.
Blk_72 Blk_144 Blk_288 Blk_576
0
0.2
0.4
0.6
Nu
mb
er
of
GC
s (i
n
mill
ion
)
Blk_72 Blk_144 Blk_288 Blk_576
13
• We propose PEN (Partial-Erase for 3D NAND flash) to address the 3D NAND performance degradation. The number of copied valid pages can be reduced.
• The PEN system contains two parts: Hardware part:
Partial-erase operation
Indexing the partial-blocks
Software part:
FTL for partial-erase
M-merge
Program disturbance
Wear-leveling
Overview of Partial-Erase Operation
576 valid pages
Free page
Valid page
Invalid page
Block-ErasePartial-Erase
7272
432
144
36
36
360576
7272
432
144
36
36
360
72 valid pages
Baseline Proposed PEN system14
Outline• Introduction and Background
• Motivation: Impact on Block Size
• Controller Hardware for Partial-Erase
• FTL for Partial-Erase
• Evaluation
• Conclusion
15
Controller Hardware for Partial-Erase
• PEN enables the partial-erase by setting proper control signals while erase operation is performed. E.g. CG0~CG3 in the figure.
• The minimum unit is a layer of page.
• In the bottom figure, the layer (Page-4 ~ Page-7) controlled by CG1 is erased.
• The partial-erase introduce additional program disturbances. E.g. upper layer (Page-0 ~ Page-3) and lower layer
(Page-8 ~ Page-11).
P-0P-4
P-8
P-12
Bit 0 Bit 1 Bit n
Slice 0 Block 0
…
…123
567
913
…
…1011
1415
0V (CG1)
Floating
Floating
Floating
Floating
Partial Erase
Floating
……
…
P-0P-4
P-8
P-12
Bit 0 Bit 1 Bit n
Slice 0
…
…123
567
913
…
…1011
1415
0V
0V
0V (CG3)
Floating
Floating
Block Erase
0V (CG0)
……
…
ProgramDisturbance
Block 0
16
Hardware Overheads & Indexing the Partial Blocks
Start Address Address Address EndCommand Ignored Block Index
Block IndexBlock-Erase (baseline)Partial-Erase PB Index
1 byte 24 bits 1 byte
P - 0123
P - 4567
P - 89
1011
P - 12131415
1
2
3
5
89
10
11
12
13
14
15Block 426
(426, 5)
Block Index
PB Index
4
5
6
7
*2
*2 +1…
3D NAND Flash Cell
Array(plane 0)
3D NAND Flash Cell
Array(plane 1)
Blo
ck De
cod
er
Page buffers and column decoderPeripheral circuits
Charge Pump Charge PumpCommand logic andAnalog Circuits
Top view of 3D NAND Flash Chip
PD PD
Page Decoder (PD)
ControlLogic
High voltageswitch
High voltageswitch
Control logics in page decoders(PD) need to be duplicated. Partial-block index mechanism.
Partial-erase command format.
17The latency of the partial-erase operation is only slightly faster than the block erase.
Outline• Introduction and Background
• Motivation: Impact on Block Size
• Controller Hardware for Partial-Erase
• FTL for Partial-Erase
• Evaluation
• Conclusion
18
Partial Block vs. Smaller Block
• Pretending the block size is 72 pages, instead of 576 pages.
• The drawback of smaller block (72 pages) through the partial-erase. 8x of the mapping table
Reducing the block size will increase the number of blocks.
Inefficient partial-erase operations
Not aware of disturbance
1 block erase (576 pages)
8 partial-erases (8 * 72 pages)
19
Different Block Pair Scenarios
7272
432
144
36
36
360
(1)
(2)
18 pages
(3)
How/Should we use the partial-erase in the following scenarios?
Partial-erases may not be beneficial.
2 pages
6 pages
Partial-erase may still introduce valid page copy.
(4)
Assume that minimum partial-block has 18 pages.
72
36108
Deciding partial-block size is important.
Should select 72 + 36 pages,not 36 + 36 + 36 pages.
36 pages
36 pages
2 pages
2 pages
20
M-Merge (Modified-Merge) Algorithm
• Restore operation: (1) copy out valid pages
(2) partial-erase operation
(3) copy back valid pages
936
36
72
432
14
D
43
U
936
36
72
Restore (PB 9)
(2) partial-erase (3) Copy back valid pages
432
7214
D
45
27
U
432
7214
D
45
27
U
2 pages
Restore (PB 14)
(2) partial-erase (3) Copy back valid pages (1) Copy out valid pages
D U D U
1
2
3
89101112131415
4
5
6
7
72
288
72
D
21
M-Merge Algorithm (cont’d)• Recursive relation to decide the restored partial-blocks (pb)
𝑐𝑜𝑠𝑡 𝑝𝑏 = min(𝑟𝑒𝑠𝑡𝑜𝑟𝑒 𝑝𝑏 , 𝑐𝑜𝑠𝑡 𝑝𝑏 ∗ 2 + 𝑐𝑜𝑠𝑡[𝑝𝑏 ∗ 2 + 1])
If ( pb = the smallest PB ) 𝑐𝑜𝑠𝑡 𝑝𝑏 = 𝑟𝑒𝑠𝑡𝑜𝑟𝑒 𝑝𝑏
• Check whether the Cost(M-Merge) < cost(Merge)
P - 0123
P - 4567
P - 89
1011
P - 12131415
1
2
3
5
4
5
6
7
1
2
3
4
5
6
7
Restore(PB 1) = 12 copies + 1 erase + 16 copiesRestore(PB 2) = 4 copies + 1 erase + 8 copiesRestore (PB 5) = 1 erase + 4 copiesRestore (other PBs) = 0
1
2
3
4
5
6
7
*2
*2+1
Restore(PB 2) > cost[4] + cost[5]
1
2
3
4
5
6
7
*2
*2+1
Restore(PB 1) > cost[2] + cost[3]
Cost[PB 1] is the final cost of the M-merge.Merge algorithm will restore PBs 3, 4, 5 to complete merge.Note restore(PB 3) and restore(PB 4) are 0.
Calculate Restore costs
22
M-Merge for Block-level Mapping (NFTL)
• M-Merge use the recursive relation to decide the restored partial-block. PBs 5, 6, 8, 9, 14, and 15
• Apply restore to those PBs PBs 5, 6, 8, and 15 can be skipped.
Free page
Valid pageInvalid page
D,U=>Victim blocks M-Merge
72
288
72
(1) Skipped (5, 6, 8, 15)
432
(3) Restore (PB 14)
14
(2) Restore (PB 9)
9
576
14
D D DU
6
5
8
15
36
36
U
72
D
43
U
72
D
45
27
U
2 pages
1
2
3
8
9
10
11
12
13
14
15
4
5
6
7
72
288
72
D
23
M-Merge with Program Disturbance
• To prevent the data corruption, M-Merge restored the previous disturbed pages.
• At time t2 and t4, neighbored partial-blocks are erased.
• To prevent wear-unleveling, a block can only be M-Merged limited times.
72
D
To be disturbedValid pageTo be erased (PB 9)To be erased in case of data corruption
9
t1 t2 t3 t4 time
P-0P-4
P-8
P-12
Bit 0 Bit 1 Bit n
Slice 0 Block 0
…
…123
567
913
…
…1011
1415
Partial Erase
……
…
ProgramDisturbance
24
BlockErase
(Merge)
t5
M-Merge for Hybrid Mapping (Superblock FTL)
• Superblock FTL has no D-/U-block concept Before M-Merge, we assign D-/U-block-sets, based on number of valid
pages.
A superblock
D-block-set
U-block-set
25
Outline• Introduction and Background
• Motivation: Impact on Block Size
• Controller Hardware for Partial-Erase
• FTL for Partial-Erase
• Evaluation
• Conclusion
26
Experimental Setup• We use 12 write-dominant workload traces.
• Four metrics Average write latency
Write amplification
AEP – the average number of erase operations per page
VEP – the variance number of erase operations per page
Typically, AEB (average number of erase operation per block) are used.
Due to the partial-erase operation, a finer measurement is required.
27
NFTL Results
0
0.5
1
1.5
No
rmal
ized
Wri
te
Late
ncy
Baseline PEN
0
5
10
15
20
Wri
te A
mp
. Baseline PEN
0
5
10
15
AEP
Baseline PEN
0
5
10
15
VEP
Baseline PEN
28
Sensitivity Results
0
0.5
1
1.5
No
rmal
ized
Wri
te
Late
ncy
Baseline L=1 2 3 4 5 6
0
0.5
1
1.5
No
rmal
ized
Wri
te
Late
ncy
Baseline PEN without disturbance PEN
0
0.5
1
1.5
No
rmal
ized
Wri
te
Late
ncy
Baseline Unlimited 16 4
0
6
12
18
VEP
Baseline Unlimited 16 4
29
Superblock FTL Results
0
0.5
1
1.5
No
rmal
ized
Wri
te
Late
ncy
4_5_BSL 4_5_PEN 4_8_BSL 4_8_PEN
0
1.5
3
4.5
6
7.5
Wri
te A
mp
. 4_5_BSL 4_5_PEN 4_8_BSL 4_8_PEN
0
1
2
3
4
AEP
4_5_BSL 4_5_PEN 4_8_BSL 4_8_PEN
0
4
8
12
16
VEP
4_5_BSL 4_5_PEN 4_8_BSL 4_8_PEN
30
Related Works
• Partial-erase proposals Hardware
2D NAND partial-erase proposal – not beneficial due to a smaller block
Partial-block erase (PBE) – reduce the whole capacity
Subblock management – provide only three partial-blocks
Software
Subblock-erase – designed for page-level mapping
• Partial-GC proposals Those redistribute the GC overheads to idle time, cannot reduce GC overheads.
Those can be combined with our partial-erase operation.
31
Conclusion• we propose and evaluate a novel partial-erase based PEN architecture in
emerging 3D NAND flashes, which minimizes the number of valid pages copied during a GC operation.
• To show the effectiveness of our proposed partial-erase operation, we introduce our M-Merge algorithm that employs our partial-erase operation for NFTL and Superblock FTL.
• Our extensive experimental evaluations show that the average write latency under the proposed PEN system is reduced up to 47.9%.
32
Q & A
33
Thank You!
34