A 3D NAND Flash Ready 8-Bit Convolutional Neural Network Core Demonstrated in a Standard
Logic Process
M. Kim1, M. Liu1, L. Everson1, G. Park1, Y. Jeon2, S. Kim2, S. Lee2, S. Song2, and C. H. Kim1 1Dept. of ECE, University of Minnesota, Minneapolis, MN, USA email: [email protected]
2ANAFLASH Inc., San Jose, CA, USA Abstract – A convolutional neural network (CNN) core that can be readily mapped to a 3D NAND flash array was demonstrated in a standard 65nm CMOS process. Logic-compatible embedded flash memory cells were used for storing multi-level synaptic weights while a bit-serial architecture enables 8 bit multiply-and-accumulate operation. A novel back-pattern tolerant program-verify scheme reduces the cell current variation to less than 0.6µA. Positive and negative weights are stored in eFlash cells in adjacent bitlines, generating a differential output signal. Our eNAND based neural network core achieves a 98.5% handwritten digit recognition accuracy which is close to the software accuracy of 99.0% for the same precision. This work represents the first physical demonstration of an embedded NAND Flash based neuromorphic chip in a standard logic process.
I. INTRODUCTION
Deep neural networks (DNNs) contain multiple computation layers each performing a massive number of multiply-and-accumulate (MAC) operations between the input data and trained weights. Due to huge amount of data processing and computation need, the performance and energy-efficiency of DNN chips can be limited by available memory bandwidth and MAC engines. A promising approach to alleviate this issue is the compute-in-memory (CIM) paradigm where the computation occurs where the data is stored, with massively parallelized analog MAC engines. One embodiment of this approach is a resistive RAM (ReRAM) crossbar array where weights are stored in the form of transconductance of ReRAM cells while the MAC operation is performed in the analog domain. Despite their tremendous potential, past CIM demonstrations are mostly limited to small-scale networks with low precision operands (e.g. 1-4 bits) due to fabrication difficulties [1-5]. Many array level studies are based on software models extracted from individual device probing which cannot capture the intricate device and circuit level behaviors. To make CIM a practical reality for future DNNs with hundreds of convolutional layers, it is imperative to focus on a memory technology that is non-volatile, ultra-high density, low cost, and highly manufacturable. 3D NAND Flash technology is the leading candidate that satisfies all these requirements, however, almost no experimental works exist on 3D NAND Flash based CIM designs due to the proprietary nature of the technology.
In this work, we investigate the hardware and software design aspects of a 3D NAND Flash based CNN accelerator through a physical chip implementation. A demonstrator chip that mimics a 3D NAND Flash array was fabricated in a 65nm logic process. Logic-compatible single-poly eFlash memory was used for the synaptic device. The proposed design features multi-level non-volatile weight storage, single cycle current integration, bit-serial MAC operation, and multi bit output sensing. One of the highlights of this work is the back-pattern tolerant program-verify sequence which reduces the cell current variation to less than 0.6µA, allowing 28 individual cell currents to be summed up in a single cycle while delivering an MNIST classification accuracy of 98.5%.
II. NEURAL NETWORK CORE CIRCUIT DESIGN
Fig. 1 shows a bit column stacked (BiCS, [6]) 3D NAND array composed of 40 BLs, 4 SGDs, and 16 WLs, which was flattened for implementation in a standard logic process. Unlike NOR type designs where each memory cell requires a selecting device, transistors in a NAND string share the source/drain nodes and hence selecting devices are only required at the top and bottom of the stack. This results in an area reduction up to 20%. The bitline capacitance of NAND is reduced compared to NOR for the same capacity, enabling a faster bitline sensing operation. The logic-compatible 3T eFlash cell shown in Fig. 2 consists of two asymmetrically sized PMOS devices for efficient program and erase operation and an NMOS read device for the NAND string. Due to series resistance of the unselected devices, the I-V characteristics of the eNAND Flash cell vary depending on its location in the stack. Simulation results in Fig. 2 (right) show that such location dependent variation can be cancelled out by fine-tuning the floating gate (FG) voltage.
A pair of eNAND cells in adjacent bitlines are used to store a single weight as illustrated in Fig. 3. If the weight is positive then the cell current of the left bitline is increased accordingly while the cell current on the right bitline is programmed to <0.1µA, and vice versa. We included a repair block where three eNAND cells connected to each WL are reserved for redundancy and bias weights. Input data is simultaneously loaded onto the SGD lines which activate multiple memory cell currents. During this operation, the selected and unselected wordlines are held to VREAD and VPASS, respectively. The sum of the individual cell currents flows through each bitline. The bitline pair generates two
currents; i.e. positive weight and negative weight currents. Both currents are converted to the corresponding output voltages by the bitline regulator circuit (Fig. 4). Finally, a voltage controlled oscillator (VCO) circuit converts the output voltage to frequency which is measured using off-chip equipment for multi-bit sensing.
The overall chip architecture is shown in Fig. 4 (left) which contains high voltage wordline drivers [3], BL sensing circuits, VCO and frequency divider circuits, and scan chains. The circuit diagram and layout of the 16 stack eNAND string are shown in Fig. 4 (right). The weights are written to the array by first erasing the entire array and then selectively programming the threshold voltage of each individual eNAND cell. Each cell can store a 2 bit weight segment with target cell current levels of 0, 3, 6 and 9µA. Retention characteristics shown later in Fig. 14 confirm that a 3 µA margin between the different levels is sufficient to overcome charge loss issues. During inference mode, the selected and unselected WL gates are biased at VREAD=1.1V and VPASS=2.6V, respectively, while the drain bias is fixed at VBL=0.8V. Positive and negative weights are stored in a bitline pair which generates a differential bitline current of -9, -6, -3, 0, 3, 6 or 9µA. The bias condition of wordline, SGD, bitline voltages for different operating modes are shown in Fig. 5. To obtain a precise cell current, the bitline voltage was regulated to 0.8V during both verify and inference modes. The bitline current is indirectly measured by reading out the feedback voltage driving the PMOS load as shown Fig. 4 (right) [3]. This voltage is fed to a VCO circuit which generates a frequency output for multi bit inference. The timing sequence of a 5x5 convolution operation with 8 bit input and 8 bit weight is shown in Fig. 6. The bit serial inner-product computing scheme was adopted where the operand is serialized and asserted one bit at a time to the array [7]. This novel architecture enables variable precision MAC operation without any significant modification to the CIM hardware.
III. EXPERIMENTAL RESULTS
Measured data in Fig. 7 confirms excellent program and program inhibition characteristics for the flash cells in the 16 stack NAND string. The output frequency of the VCO was proportional to the cell current (Fig. 8), which is essential for accurate program verify and inference operations. One critical requirement for precise cell current readout is that the VPASS bias applied to the unselected wordlines must be high enough to minimize the series resistance of the unselected cells. However, a higher VPASS can lead to unwanted threshold voltage shift in the unselected cells. We chose a VPASS voltage of 2.6V which offered a good compromise between the two competing effects.
Another critical challenge we faced while programming the weights into the NAND string is the so-called back pattern dependency where the programmed cell current is affected by the program state of other cells in the array. In our testing, we saw the cell current increase by up to 3µA depending on whether the rest of the array is in erase mode (lowest Vt) or weight 0 mode (highest Vt). To overcome this
issue, we devised a novel back-pattern tolerant program-verify scheme in Fig. 10 which ensures that the programmed cell current remains constant irrespective of the weights stored in the rest of the array. The operating sequence is as follows. In phase one, we programmed the weight 0 cells on a given wordline while inhibiting the weight 1, 2 and 3 cells. To ensure that the cell currents of all weight 0 cells are below 0.1μA, we apply high voltage (8.0V) and long duration (20μs) pulses until the target current is reached. Once this is completed, we roughly program the weight 1, 2 and 3 cell currents close to their intended targets using additional program pulses as shown in Fig. 10. The first pulse is applied to weight 1, 2, 3 cells. The second pulse is applied to weight 1 and 2 cells, and the third pulse is applied to weight 1 cells. The specific program voltage of each pulse was carefully chosen based on the program and program inhibition characteristics in Fig. 7. The same sequence was repeated for the rest of the wordlines until the entire array is programmed. In phase two, using smaller and shorter pulses (i.e. 7.0V, 10μs), we fine-tuned the weight 3 cells to 9μA, the weight 2 cells to 6μA and the weight 1 cells to 3μA. A VPASS voltage of 2.6V was used throughout the entire program-verify sequence. Fig. 11 shows how the cell current changes with increasing number of program pulses for weight 1, 2 and 3 cells. It can be seen that the cell current variation is reduced from 3μA to 0.6μA after the program-verify operation. This offers a significant advantage over SRAM or MRAM based neuromorphic implementations which do not have any post-silicon tuning capabilities. Fig. 12 shows the measured cell current and output frequency after programming the entire array using MNIST trained weights [8].
Fig. 13 shows the LeNet-5 CNN demonstration flow for the MNIST handwritten digit recognition application. Weights were trained based on 60,000 handwritten digit images from the MNIST dataset and were preloaded to the test chip. During inference mode, the neural network core generates a frequency output based on 28x28 pixel 8 bit grayscale images and the preloaded 8 bit weights. Due to the limited test time, this paper presents classification results of 1,000 randomly chosen MNIST images (full test results will be available soon). The classification accuracy measured from the test chip was 98.5% (Fig. 13) which is close to the software accuracy of 99.0% for the same weight and data precision. The small discrepancy can be attributed to noise and variation effects in a real chip. Charge loss was minimal when the chip was baked at 150°C for 16 hours. Comparison with previous NOR type neuromorphic core designs in various memory technologies is shown in Fig. 15. The die photo and chip feature summary are given in Fig. 16.
REFERENCES [1] X. Guo, F. Merrikh Bayat, M. Bavandpour, et al., pp. 6.5.1-6.5.4, IEDM, Dec. 2017. [2] W. Chen, K. Li, W. Lin, et al., ISSCC, pp. 494-495, Feb. 2018. [3] M. Kim, J. Kim, G. Park, et al., IEDM, pp. 15.4.1-15.4.4, Dec. 2018. [4] C. Xue, W. Chen, J. Liu, et al., ISSCC, pp. 388-389, Feb. 2019. [5] X. Si, J. Chen, Y. Tu, ISSCC, pp. 396-397, Feb. 2019. [6] H. Maejima, K. Kanda, S. Fujimura, et al., ISSCC, pp. 336-337, Feb. 2018. [7] P. Judd, J. Albercio, A. Moshovos, ICAL, Vol. 16, pp.80-83, 2017. [8] Y. LeCun, L. Bottou, Y. Bengio, et al., PROC, pp.1-36, Nov. 1988.
BL0
Analog MUX
BL39
BL Current Verify/Sensing Circuit(Core Devices)
SGD0<3:0>Block0
(16 StackeNANDArray)
SGD4<3:0>
Block4
Pix
el-I
N,W
L a
nd
Blo
ck S
CA
N
Block3Block7(RepairBlock)
VCO
OutputFrequency Divider
HVS WWL015PWL015
HVS WWL00PWL00
SGS0
SGD3<3:0>
SGS3
HVS
HVS
HVS
HVS
WWL315PWL315
WWL30PWL30
WWL415PWL415
WWL40PWL40
WWL715PWL715
WWL70PWL70
HVS
HVSP
ixe
l-IN
,WL
an
d B
lock
SC
AN
VPP,
VPAS
S
SGS4
SGD7<3:0>
SGS7
BL SCAN & DRIVER
Co
ntr
ol1
Co
ntr
ol2
High Voltage Switch
FG00 FG03
BL
WRITE BL SCAN
VREG –
+
M1FG150 FG153
Frequency Divider
BLSGD
PWL
WWL
S1
M1
M2
M3FG
CSL
PWL
WWL
SGS
M1
M2
M3
S2
FG
16Stack eNANDCell String
M2
M3
M1
M2
M3
S1
S2
PWL<15>
WWL<15>
SGD<0>SGD<3>
PWL<0>
WWL<0>SGSCSL
Output
BL voltage
regulator
VCO
Fig. 4. (Left) Overall diagram of CNN core with high voltage wordline drivers, a 16 stack eNAND Flash array, and BL sensing circuit. (Right) BL pair and readout circuit along with the layout of a 16 stack eNAND string.
WL0-6<0>
8 Input Cycles
5 × 5 Convolution Timing Diagram
X0-24<0>
X0-24<7>
WL0-6<3>
SGD0-6<3:0>
8th bit input
1st bit input
6 W enable
0,1 Wenable
32 ( 8 bit Input X 8 bit Weight (eNAND 4 Stack ) ) Cycles
X0-24
<0:7> < >∙ − ∙ < >∙ ∙ −== X<i>· 2(7-i)·W<j>·2(2·(3-j))i=0
7j=03
* X : input neuron W : Weight
Bit serial inner-product computing
Fig. 6. Timing diagram of 5 × 5 convolution operation with 8 bitdata and 8 bit weights. 4 cells are used to store a single 8 bit weight.Bit serial operation produces a multi-bit inner product result.
IPOS INEG
P
N
B
Postive weight
Negative weight
Bias weight
* eNAND cells with
‘1’
SGD0<3>
VREADWL0<15>
SGD0<2>SGD0<1>SGD0<0>
‘1’
‘0’‘0’
WL0<14>WL0<13>WL0<12>
WL0<0>
VPASSVPASSVPASS
VPASS
P15
P14
P13
P12
P15
P14
P13
P12
P15
P14
P13
P12
P15
P14
P13
P12
P0 P0 P0 P0
N15
N14
N13
N12
N15
N14
N13
N12
N15
N14
N13
N12
N15
N14
N13
N12
N0 N0 N0 N0
Block0
‘0’
SGD6<3>
VREADWL6<15>
SGD6<2>SGD6<1>SGD6<0>
‘1’
‘0’‘0’
WL6<14>WL6<13>WL6<12>
WL6<0>
VPASSVPASSVPASS
VPASS
B15
B14
B13
B12
B15
B14
B13
B12
B15
B14
B13
B12
P15
P14
P13
P12
P0 B0 B0 B0
B15
B14
B13
B12
B15
B14
B13
B12
B15
B14
B13
B12
N15
N14
N13
N12
N0 B0 B0 B0
Block6
8bit Pair
Sensing Circuit & VCO
Sensing Circuit & VCO
Frequency Compare
Multi bitOut
Even BL Odd BL
Fig. 3. Positive, negative and bias weight valuesare stored in two adjacent bitlines. 28 individualNAND string currents are summed up andconverted to frequencies for multi-bit sensing.
PGM Bias 7.0V
PGM Pulse Count Number (20µs width)
Cell
Curr
ent (
µA)
Program inhibition mode (SGD Off)
WL 0 to 15 Avg. current of 100 cells, 25°C, VREAD=1.1V, VBL=0.8V
WL15 to 0
Program inhibition mode (BL High)
Program mode
Cell
Curr
ent (
µA)
WL15 to 0
WL15 to 0
PGM Bias 7.2V PGM Bias 7.4V PGM Bias 7.6V PGM Bias 7.8V PGM Bias 8.0V
Fig. 7. Cell current versus the number of program pulses for two program inhibition modes (BL high and SGD off) and program mode (bottom row). The average current of 100 cells is shown for each wordline and 0.2V program bias increments from 7.0V to 8.0V for a constant pulse width of 20µs. Test chip data shows reliable programming with minimal program disturbance.
PWL<15>(0V)
WWL<15>(HV)
SGD<3:0>(0V)
PWL<0>(0V)
WWL<0>(0V)SGS(0V)CSL(0V)
Erase Operation (WL15)
WM1=8xWM2,3
M1
M2e-
OFF
BL(0V)
~0V
OFF
PWL<15>(HV)
WWL<15>(HV)
SGD<0>(VDD)
PWL<0>(VPASS)
WWL<0>(VPASS)SGS(0V)CSL(VDD)
PGM Operation (WL15)
e-
ON
BLA(0V)
OFF
Ce
ll A
<0>
e-
OFF
OFF
Ce
ll A
<3
:1>
SGD<3:1>(0V)
e-
OFF
OFF
Ce
ll B
<0>
BLB(VDD)
Programmed Program-Inhibited via self-boosting
Fig. 5. Bias conditions of logic-compatible eNAND Flash cell for erase and program operations.
WL0
SGS
WL15
CSL
40BLs
SGD0 SGD1 SGD2 SGD3
Block
WL0
SGS
WL15
CSL
Block
BL0
BL39 SGD0
BL0 BL39
SGD1 SGD2 SGD3
(a) (b)
Fig. 1. (a) 3D NAND BiCS architecture [6] with 40 BLs x 4 SGDs x 16 WLs. (b) Flattened 3D NAND architecture for demonstrating an eNAND flash based deep neural network accelerator in a standard logic process.
Fig. 2. Simulated I-V characteristics of top and bottom eFlash cells of a 16 stack NAND string. Series resistance varies depending on the location in the stack which can be compensated by incremental programming.
0.0 0.2 0.4 0.6 0.8 1.0
0
3
6
9
12
VFG1 -0.6V VFG1 -0.417V VFG1 -0.320V VFG1 -0.238V
0
3
6
9
12
VFG -0.6V VFG -0.290V VFG -0.066V VFG 0.150V
PWL<15>
WWL<15>
SGD : 2.5V
PWL<0>
WWL<0>SGS : 2.5V
CSL : 0V
BL Voltage (V)
VBL=0.8V 65nm, VDD 1.2V, 25°C
Cell Current (µA)
ICELL1VFG: Floating Gate Voltage
VFG1 ICELL2
* Selected PWL/WWL : VREAD(1.1V)* Unselected PWL/WWL : VPASS (2.6V)
Weight3
Weight2
Test Condition : 25°C, VREAD 1.1V, VBL 0.8V
Weight0
Weight1
Weight0
Weight1
Weight2
Weight3
Weight3
Weight2
Weight1
Weight0
Weight3
Weight2
Weight1
Weight0
BL # WL # 0 4 8 12 16 20 24 28 32 0 2 4 6 8 10 12 14 16
Freq
uenc
y(M
hz)
470 480 490 500 510 520 530 540 550
0
3
6
9
Fig. 12. Individual cell currents and frequency measurements in bitline and wordline directions for MNIST trained weights
WL
Driv
ers w
/ HV
S
WL
Driv
ers w
/ HV
S
16 StackeNANDBased
SynapseArray
Sensing
BL Driver600µm
1100
µm
Technology
Core Size 1100 X 600 μm2
65nm Logic
VDD (Core, IO) 1.2V / 2.5V
# of Filters 640 / 2bit
# of Synapses
20K(=64x320)
Throughput
4.95 µW (per filter)Power
0.5G pixels/s per core
(tREAD : 50ns)
Fig. 16. Die microphotograph and test chip feature summary.
Output Recognized Index
Digit 0 Digit 1 Digit 2 Digit 3 Digit 4
Digit 5 Digit 6 Digit 7 Digit 8 Digit 9
INPUT 28X28 (8 bit)
55
5×5 Convolutions
22
Max Pooling
55
5×5 Convolutions2
2
Max Pooling
Feature maps6@24X24
OUTPUT10
Layer120
Full connectionFull connection
Gaussian connections
Feature maps6@12X12
Feature maps16@8X8 Feature maps
16@4X4 Layer84
eNAND 5×5 6filters (12BLs) operation
eNAND 5×5 96(6×16)filters (192BLs) operation
* eNAND # of filters :# × (Pos. BL + Neg. BL)
Final results
0 1 2 3 4 5 6 7 8 9
Coun
t #
0
1
1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
10
1001
10
100
LeNet-5 CNN(Convolutional Neural Network) Architecture
Fig. 13. (Upper) LeNet-5 Convolutional Neural Network (CNN) flow [8] using the proposed neural network core. (Low) Hand written digit recognition results measured from the test chip for 1,000 8 bit grayscale MNIST test images.
SGS
CSL
SGD
WL0
WL1
WL2
WL3
WL15
WL14
WL13
WL12
ERASE
BL PHASE 1ERASE PHASE 2
W0
Approx. W2
Approx. W1
W0
W0
W0
W0
W2
W1W0
W1
W0
W0
W3
Weight 0PGM complete
W1,2PGM 8.0V
20us
Weight0PGM
W1PGM
Weight 1-3Coarse PGM
W1,2,3PGM
20us
Fine Weight 1,2,3 PGM 7.0V & 1,2,3 Verify
Weight 1,2,3PGM complete
10us
PHASE 1
7.0V
*W0/1/2/3 : Weight 0/1/2/3
PHASE 2
ERASE
ERASE
ERASE
ERASE
ERASE
ERASE
ERASE
Approx. W1
Approx. W3
Fig. 10. Pulse sequence for programming weights 0, 1, 2 and 3 into the 16stack eNAND array. 2 bit weight segments can be programmed into eachcell using the proposed back-pattern tolerant program-verify scheme.
Pre-Cycle 1K , Tbake=150°C
100 101 102 103
Bake Time (hrs)
0
3
6
Weight0
9
Weight1
Weight2
Weight3
Before baking
Cell
Curr
ent (
µA)
Fig. 14. Retention characteristics of weight 0,1, 2 and 3 cell currents confirm minimalcharge loss. Baking temperature was 150°C.
ISSCC’19 [4]
Technology 55nm
IEDM’18 [3]
65nm
ISSCC’19 [5]
55nm
Input Resolution
Non volatile? YES (ReRAM) NO (SRAM) YES (eFlash)
Voltage 1.0V 1.0V1.0V
Logic Compatible?
ISSCC’18 [2]
Weight Resolution
YES (ReRAM)
YES YES NONO
3 Bits 3 Bits 5 Bits 2.3 Bits
Program-verify? NO NO YES NO
65nm
1.0V
YES (eFlash)
YES
65nm
This work
YES (eFlash)
1.2V
YES
8 Bits
YES
IEDM’17 [1]
Cell Type NAND NOR NOR NOR NOR NOR
1 Bit
2 Bits
NO
2.7V
180nm
# of Currents Summed Up
8 Cells 68 Cells 14 Cells28 Cells 32 Cells
3 Bits 2 Bits 2 Bits 1 Bit8 Bits
Neural Net Architecture
CNN MLP CNNCNN MLPCNN
4 Cells
Fig. 15. Comparison with prior works.
0 3 6 9 12 15420
440
460
480
500
520
540
Freq
uenc
y (M
hz)
Cell Current (µA)
65nm, VDD 1.2V, 25°C
BL0 to 39
Weight0
Weight1
Weight2Weight3
Fig. 8. VCO frequency versus cell current for BL0 to 39, and different weights.
WL0-3 WL4-7
WL8-11 WL12-15
Weight2 Weight2
Weight2 Weight2
Weight1 Weight1
Weight1 Weight1
Test Condition : 25°C, VREAD 1.1V, VBL 0.8V
Weight3
Weight3Weight3
Weight3
Fine Program Pulse Count 10 20 30 40 0
310 20 30 40 0
6
9
123
6
9
12
Cell
Curr
ent (
µA)
Fig. 11. Cell current versus program pulse count. The program-verify operation ensures precise weight storage.
Cell
Curr
ent (
µA)
Read Time (Sec) Read Time (0.1Sec)
VPASS 3.4V to 3.3V (0.01V Step)
VPASS 2.5~3.3V (0.1V Step)
VPASS 3.4, 3.5V
Test Condition : 25°C, VREAD 1.1V, VBL 0.8V
Fig. 9. VPASS disturbance characteristics of eNAND cells during read operation. Cell current remains constant for VPASS<3.3V.