neuromorphic system design and...

125
NEUROMORPHIC SYSTEM DESIGN AND APPLICATION by Beiye Liu B.S. in Information Engineer, Southeast University, Nanjing, China, 2011 M.S. in Electrical Engineering, University of Pittsburgh, Pittsburgh, 2014 Submitted to the Graduate Faculty of the Swanson School of Engineering in partial fulfillment of the requirements for the degree of Doctor of Philosophy University of Pittsburgh 2016

Upload: others

Post on 02-Feb-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

  • NEUROMORPHIC SYSTEM DESIGN AND APPLICATION

    by

    Beiye Liu

    B.S. in Information Engineer, Southeast University, Nanjing, China, 2011

    M.S. in Electrical Engineering, University of Pittsburgh, Pittsburgh, 2014

    Submitted to the Graduate Faculty of

    the Swanson School of Engineering in partial fulfillment

    of the requirements for the degree of

    Doctor of Philosophy

    University of Pittsburgh

    2016

  • ii

    UNIVERSITY OF PITTSBURGH

    SWANSON SCHOOL OF ENGINEERING

    This dissertation was presented

    by

    Beiye Liu

    It was defended on

    March 30, 2016

    and approved by

    Yiran Chen, Ph.D., Associate Professor, Department of Electrical and Computer Engineering

    Hai Li, Ph.D., Associate Professor, Department of Electrical and Computer Engineering

    Xin Li, Ph.D., Associate Professor,

    Department of Electrical and Computer Engineering, Carnegie Mellon University

    Zhi-Hong Mao, Ph.D., Associate Professor,

    Department of Electrical and Computer Engineering

    Ervin Sejdic, Ph.D., Assistant Professor, Department of Electrical and Computer Engineering

    Dissertation Director:

    Yiran Chen, Ph.D., Associate Professor, Department of Electrical and Computer Engineering

  • iii

    Copyright © by Beiye Liu

    2016

  • iv

    With the booming of large scale data related applications, cognitive systems that leverage

    modern data processing technologies, e.g., machine learning and data mining, are widely used in

    various industry fields. These application bring challenges to conventional computer systems on

    both semiconductor manufacturing and computing architecture. The invention of neuromorphic

    computing system (NCS) is inspired by the working mechanism of human-brain. It is a

    promising architecture to combat the well-known memory bottleneck in Von Neumann

    architecture. The recent breakthrough on memristor devices and crossbar structure made an

    important step toward realizing a low-power, small-footprint NCS on-a-chip. However, the

    currently low manufacturing reliability of nano-devices and circuit level constrains, .e.g., the

    voltage IR-drop along metal wires and analog signal noise from the peripheral circuits, bring

    challenges on scalability, precision and robustness of memristor crossbar based NCS.

    In this dissertation, we quantitatively analyzed the robustness of memristor crossbar

    based NCS when considering the device process variations, signal fluctuation and IR-drop.

    Based on our analysis, we will explore deep understanding on hardware training methods, e.g.,

    on-device training and off-device training. Then, new technologies, e.g., noise-eliminating

    training, variation-aware training and adaptive mapping, specifically designed to improve the

    training quality on memristor crossbar hardware will be proposed in this dissertation. A digital

    NEUROMORPHIC SYSTEM DESIGN AND APPLICATION

    Beiye Liu, PhD

    University of Pittsburgh, 2016

  • v

    initialization step for hardware training is also introduced to reduce training time. The circuit

    level constrains will also limit the scalability of a single memristor crossbar, which will decrease

    the efficiency of implementation of NCS. We also leverage system reduction/compression

    techniques to reduce the required crossbar size for certain applications. Besides, running machine

    learning algorithms on embedded systems bring new security concerns to the service providers

    and the users. In this dissertation, we will first explore the security concerns by using examples

    from real applications. These examples will demonstrate how attackers can access confidential

    user data, replicate a sensitive data processing model without any access to model details and

    how expose some key features of training data by using the service as a normal user. Based on

    our understanding of these security concerns, we will use unique property of memristor device to

    build a secured NCS.

  • vi

    TABLE OF CONTENTS

    PREFACE ................................................................................................................................... XV

    ACKNOWLEDGEMENTS .................................................................................................. XVII

    1.0 INTRODUCTION ........................................................................................................ 1

    1.1 MOTIVATION .................................................................................................... 1

    1.1.1 Challenge 1: Training with Imperfect Hardware ......................................... 2

    1.1.2 Challenge 2: Limited System Scalability ....................................................... 3

    1.1.3 Challenge 3: Security Concerns in Cognitive Systems ................................. 4

    1.2 DISSERTATION CONTRIBUTION AND OUTLINE ................................... 5

    2.0 DESIGN BASICS ......................................................................................................... 7

    2.1 MEMRISTOR BASICS ...................................................................................... 7

    2.2 MEMRISTOR CROSSBAR ............................................................................... 9

    2.3 MEMRISTOR CROSSBAR BASED NCS ...................................................... 11

    2.3.1 Feedforward Sensing ..................................................................................... 11

    2.3.2 Hardware Training........................................................................................ 12

    2.3.2.1 Off-device training .............................................................................. 13

    2.3.2.2 On-device training............................................................................... 15

    3.0 VARIATION-AWARE OFF-DEVICE TRAINING .............................................. 16

    3.1 IMPACT OF DEVICE VARIATION .............................................................. 16

  • vii

    3.2 VORTEX ............................................................................................................ 18

    3.2.1 Variation-aware Training (VAT) ................................................................. 18

    3.2.1.1 Algorithm ............................................................................................. 18

    3.2.1.2 Variation Tolerance vs. Training Rate ............................................. 21

    3.2.1.3 Self-tuning and Validation ................................................................. 22

    3.2.2 Adaptive Mapping (AMP) ............................................................................ 24

    3.2.2.1 Basic Steps of AMP ............................................................................. 24

    3.2.2.2 Greedy mapping algorithm ................................................................ 26

    3.2.3 Integration of VAT and AMP....................................................................... 28

    3.3 EXPERIMENTS ................................................................................................ 28

    3.3.1 Effectiveness of AMP ..................................................................................... 29

    3.3.2 ADC Resolution ............................................................................................. 29

    3.3.3 Design Redundancy ....................................................................................... 30

    3.4 SECTION SUMMARY ..................................................................................... 32

    4.0 ROBUTS ON-DEVICE TRAINING WITH DIGITAL INITIALIZATION ....... 33

    4.1 NOISE-ELIMINATING ON-DEVICE TRAINING ...................................... 33

    4.1.1 Impacts of Device Variation and Signal Noise ............................................ 33

    4.1.2 Noise Sensitivity of On-device Training ...................................................... 36

    4.1.3 Noise-Eliminating Training Scheme ............................................................ 38

    4.2 DIGITAL INITIALIZATION .......................................................................... 40

    4.2.1 Basic Idea........................................................................................................ 40

    4.2.2 Digitalization of Weight Matrix ................................................................... 41

    4.3 EXPERIMENTS AND RESULTS ................................................................... 42

  • viii

    4.3.1 Noise Elimination ........................................................................................... 42

    4.3.2 Digital-Assisted Initialization ....................................................................... 43

    4.3.3 Case study ....................................................................................................... 46

    4.4 SECTION SUMMARY ..................................................................................... 49

    5.0 SCALABILITY .......................................................................................................... 51

    5.1 IR-DROP LIMITS SINGLE CROSSBAR SIZE ............................................ 51

    5.1.1 Impact of IR-Drop on Memristor Crossbar ................................................ 51

    5.1.2 Problem Formulation .................................................................................... 52

    5.1.2.1 Training................................................................................................ 52

    5.1.2.2 Sensing.................................................................................................. 53

    5.2 SYSTEM REDUCTION ................................................................................... 54

    5.2.1 Weight Matrix Approximation..................................................................... 55

    5.2.2 One-dimensional (1-D) Reduction ................................................................ 56

    5.2.3 Two-dimensional (2-D) Reduction ............................................................... 58

    5.2.4 Implementation Example .............................................................................. 59

    5.3 IR-DROP COMPEMSATION ......................................................................... 62

    5.3.1 Sensing Compensation .................................................................................. 62

    5.3.2 Training Compensation ................................................................................ 65

    5.4 MODEL COMPRESSION ............................................................................... 66

    5.5 EXPERIMENTAL RESULTS ......................................................................... 68

    5.5.1 Training Quality ............................................................................................ 69

    5.5.2 Reading Accuracy and Selection of r ........................................................... 71

    5.5.3 Training Performance ................................................................................... 73

  • ix

    5.5.4 Area ................................................................................................................. 74

    5.5.5 Robustness ...................................................................................................... 75

    5.5.5.1 Training and Testing with IR-drop ................................................... 75

    5.5.5.2 Impact of memristor/wire resistance variation ................................ 76

    5.5.5.3 Tradeoff between 1-D/2-D system reduction .................................... 77

    5.5.6 Model compression ........................................................................................ 78

    5.6 SECTION SUMMARY ..................................................................................... 80

    6.0 SECURITY APPLICATION .................................................................................... 81

    6.1 SECURITY CONCERNS IN COGNITIVE SYSTEMS ................................ 82

    6.1.1 Test Data Privacy........................................................................................... 82

    6.1.2 Training Data Security .................................................................................. 83

    6.1.3 Model Security ............................................................................................... 86

    6.2 MODEL REPLICATION ATTACK ............................................................... 86

    6.2.1 Attacking Model ............................................................................................ 86

    6.2.2 Demonstration ................................................................................................ 89

    6.3 MEMRISTOR-BASED SECURED NCS ........................................................ 91

    6.3.1 Drifting Effect ................................................................................................ 91

    6.3.2 Secured NCS Design ...................................................................................... 92

    6.4 EXPERIMENT RESULTS ............................................................................... 93

    6.4.1 Drifting vs. Degradation................................................................................ 94

    6.4.2 Replication Quality ........................................................................................ 95

    6.5 SECTION SUMMARY ..................................................................................... 96

    7.0 CONCLUSION AND FUTURE WORK ................................................................. 97

  • x

    7.1 DISSERTATION CONCLUSION ................................................................... 97

    7.2 FUTURE WORK ............................................................................................... 99

    7.2.1 Function Generality of Memristor-based NCS ........................................... 99

    7.2.2 Usability of Memristor-based Secured NCS ............................................. 100

    BIBLIOGRAPHY ..................................................................................................................... 102

  • xi

    LIST OF TABLES

    Table 1. Simulation setup ............................................................................................................. 46

    Table 2. Training failure rate ........................................................................................................ 48

    Table 3. Experiment parameters. .................................................................................................. 69

    Table 4. Recall successful rate of NCS with different sizes ......................................................... 78

  • xii

    LIST OF FIGURES

    Figure 1. Metal-oxide memristor [58]. ........................................................................................... 8

    Figure 2. Device programming [48]. .............................................................................................. 9

    Figure 3. Memristor crossbar [36]. ............................................................................................... 10

    Figure 4. Single layer neural network[56]. ................................................................................... 12

    Figure 5. (a) On-device training method, (b) Off-device training method. .................................. 13

    Figure 6. Impact of device variation. ............................................................................................ 17

    Figure 7. Tradeoff between variation tolerance and training rate. ................................................ 22

    Figure 8. Self-tuning process in training. ...................................................................................... 23

    Figure 9. Adaptive mapping. ........................................................................................................ 25

    Figure 10. Algorithm 1. ................................................................................................................ 27

    Figure 11. Effectiveness of AMP.................................................................................................. 29

    Figure 12. ADC resolution VS test rate. ....................................................................................... 30

    Figure 13. Overhead vs. Test rate. ................................................................................................ 31

    Figure 14. Training under memristor variation and input noise. .................................................. 35

    Figure 15. Training process with noise. ........................................................................................ 36

    Figure 16. Noise elimination mechanism. .................................................................................... 38

    Figure 17. Digital initialization. .................................................................................................... 40

  • xiii

    Figure 18. Effectiveness of noise-eliminating training. ................................................................ 43

    Figure 19. Comparison of convergence rate of different initialization. ........................................ 44

    Figure 20. The impact of initialization on total training time. ...................................................... 46

    Figure 21. 3-layer network recall rate test of dynamic threshold training algorithm. .................. 47

    Figure 22. Comparisons of overall training time. ......................................................................... 49

    Figure 23. Voltage distribution with IR-drop. .............................................................................. 52

    Figure 24. System reduction improves reliability. ........................................................................ 57

    Figure 25. Conceptual schematics of (a) 1-D reduction (b) 2-D reduction. ................................. 60

    Figure 26. Compensation for both training and sensing process. ................................................. 62

    Figure 27. Sensitivity analysis based compensation. .................................................................... 64

    Figure 28. Model compression. .................................................................................................... 67

    Figure 29. Trained resistance discrepancy. ................................................................................... 70

    Figure 30. Recall discrepancy (a) respect to r/n, (b) respect to ε. ................................................. 71

    Figure 31. Training time comparison............................................................................................ 73

    Figure 32. Area cost comparison. ................................................................................................. 74

    Figure 33. Recall successful rates of three NCS designs considering IR-drop. ........................... 75

    Figure 34. Model compression. .................................................................................................... 79

    Figure 35. Encrypted neural network. ........................................................................................... 83

    Figure 36. Reverse estimation....................................................................................................... 85

    Figure 37. Training and replication of the learning model. .......................................................... 87

    Figure 38. Model replication......................................................................................................... 90

    Figure 39. Resistance change and system degradation. ................................................................ 95

    Figure 40. Effectiveness of memristor-based secured neuromorphic system. .............................. 96

  • xiv

    Figure 41. Memristor crossbar-based CNN. ............................................................................... 100

    Figure 42. Ideal degradation. ...................................................................................................... 101

  • xv

    PREFACE

    This dissertation is submitted in partial fulfillment of the requirements for Beiye Liu’s degree of

    Doctor of Philosophy in Electrical and Computer Engineering. It contains the work done from

    September 2011 to March 2016. My advisor is Yiran Chen, University of Pittsburgh, 2010 –

    present.

    The work is to the best of my knowledge original, except where acknowledgement and

    reference are made to the previous work. There is no similar dissertation that has been submitted

    for any other degree at any other university.

    Part of the work has been published in the conference:

    1. DAC2013: B. Liu, M. Hu, H. Li, ZH. Mao, Y. Chen, T. Huang, W. Zhang, “Digital-

    assisted noise-eliminating training for memristor crossbar-based analog neuromorphic

    computing engine,” Design Automation Conference (DAC), pp. 1-6, 2013.

    2. ICCAD2014: B. Liu, X. Li, T. Huang, Q. Wu, M. Barnell, H. Li, Y. Chen,

    “Reduction and IR-drop compensations techniques for reliable neuromorphic

    computing systems,” International Conference on Computer-Aided Design (ICCAD),

    pp. 63-70, 2014.

    3. DAC2015: B Liu, X Li, Q Wu, T Huang, H Li, Y Chen, “Vortex: variation-aware

    training for memristor X-bar,” Design Automation Conference (DAC), pp. 1-6, 2015.

  • xvi

    4. DAC2015: B Liu, C Wu, Q Wu, M Barnell, Q Qiu, H Li, Y Chen, “Cloning your

    mind: security challenges in cognitive system designs and their solutions,” invited to

    Design Automation Conference (DAC), pp. 95, 2015.

    Part of the work has been published in journal publications:

    5. B Liu, Y Chen, B Wysocki, T Huang, “Reconfigurable neuromorphic computing

    system with memristor-based synapse design,” Neural Processing Letters, vol. 41, pp.

    159-167, 2015.

  • xvii

    ACKNOWLEDGEMENTS

    I would like to acknowledge the support of my advisor, Yiran Chen, whose support made this

    work possible, and to 50th Design Automation Conference (DAC 2013) Richard Newton Young

    Student Fellow award for providing financial support. I’d like to thank Professor Yiran Chen and

    Professor Xin Li for their excellent guidance during the research. Professor Yiran Chen gives me

    guidance of emerging nonvolatile memory designs and neuromorphic system development.

    Professor Xin Li gives me guidance of CAD tool development, simulations and validations.

    Special thanks go to Professor Hai (Helen) Li, Professor Zhi-Hong Mao, and Professor Ervin

    Sejdic for being my committee members.

    Besides, I’d like to express my gratitude to the members from Evolutional Intelligent (EI)

    lab at Swanson School of Engineering for their consistent supports during my research. Finally,

    I’d like to thank my parents for their great encouragement during the whole Ph.D. research.

  • 1

    1.0 INTRODUCTION

    1.1 MOTIVATION

    Machine learning technology has been widely used in data processing to help users better

    understand the underlying property of the data [1]. As a popular type of machine learning

    algorithm, neural network processes input data by multiplying them with layers of weighted

    connections. Various types of neural network designs, e.g., convolutional neural network (CNN)

    [2] and recurrent neural network (RNN) [3], have repeatedly and significantly improved the best

    performances in the literature for multiple databases from different application fields, including

    computer vision and nature language processing [3][4]. However, the neural network is a

    computation intensive software algorithm and it is a remarkable fact that implementations of a

    lot of milestone neural network models were enabled by the significant hardware breakthroughs,

    e.g., graphic processing unit (GPU)[5].

    In recent years, computer hardware industry is experiencing great revolutions on its two

    foundation stones: semiconductor manufacturing and computing architecture: On the one hand,

    the scaling of conventional CMOS devices is approaching the limit [6][7]. Scalable emerging

    nano-devices, i.e., spintronic and resistive devices (memristor) [8]-[12], nanotube[14][15] etc.,

    are under extensive investigations; on the other hand, the well-known “memory wall” challenge

    of von-Neumann architecture [8], i.e., the ever-increasing gap between CPU performance and

  • 2

    memory bandwidth, motivates many studies on alternative computing architectures for highly

    parallel software algorithms, e.g., neural network.

    Neuro-biological architecture is one of such promising candidates. After twenty-year

    trough, neuromorphic computing, which denotes the VLSI realization of neuro-biological

    architecture, is recently revitalized by the discovery of nanoscale resistive devices, e.g.,

    memristor[13]. The similarity between the programmable resistance state of memristors and the

    variable synaptic strengths of biological synapses dramatically simplifies the design of neural

    network circuits [17]-[25]. Moreover, the crossbar structure, which is the densest interconnect

    topology that can be achieved by modern planar semiconductor manufacturing, further boosts the

    integration density and power efficiency of memristor-based neuromorphic computing systems

    (NCS) [26]-[47] to the levels of 1010-Synapses/Inch2 and Tera-flops/Watt, respectively. Besides,

    memristor crossbar structure is recently introduced to improve the execution efficiency of the

    Matrix-Vector multiple lications, which is one of the most common operations in the mathematic

    representation of neural network [37]. However, the implementation of an NCS with memristor

    crossbar is facing several major technical challenges mainly introduced by the physical

    limitations of the hardware circuit.

    1.1.1 Challenge 1: Training with Imperfect Hardware

    In machine learning theory, “training” is defined as the process of calculating the value of

    all the variables in a specific model based on training data. In memristor crossbar based NSC, we

    need to not only calculate the values of all variables, but also have memristors programmed to

    accurately represent those values. For clarification purpose, we use “hardware training” to define

    the whole process of calculating and device programming.

  • 3

    The most intuitive hardware training scheme is called the “off-device” method, which

    separates the whole process into two steps: The first step is identical to the conventional software

    training, which calculates all the variables based on the given training data. The second step is

    programming every memristor based on the calculation in the first step [42]. Due to the difficulty

    of accurate real-time monitoring the memristor state, the off-device training is vulnerable to the

    intrinsic device switching variations and manufacturing defects.

    Surprisingly, the process of hardware training does not necessarily need to be separated

    into two steps. Another type of hardware training scheme is the “on-device” method, which

    directly implements gradient descent training (GDT) algorithm on the memristor crossbar by

    repeating the loop of “programming and sensing” [30]. On-device method is able to adaptively

    adjust the training inputs to reduce the impact of variability of memristors by sensing the

    memristor (indeed, output current from the crossbar) in the real-time. However, due to the very

    limited precision of analog signals on the hardware, the quality of on-device training is severely

    affected by the signal noise and sensing accuracy. At the same time, iteratively programming and

    sensing slows down the overall hardware training process.

    1.1.2 Challenge 2: Limited System Scalability

    Besides the hardware training challenges, the scale of single memristor crossbar is

    limited by the IR-drop along the resistance network (memristor crossbar) composed of metal

    wire and memristors. The analysis of the impact of IR-drop on crossbar-based digital memory

    shows a 64×64 crossbar already has severe voltage degradation [57]. Following the increase of

    the memristor crossbar size, the impact of the IR-drop becomes more critical, resulting in the

    performance variations or even functional failures of the NCS. Even though a large scale neural

  • 4

    network can be partitioned and possibly mapped onto multiple memristor crossbars, the

    significant hardware/energy/speed overhead of “partition & mapping” [46] makes the high single

    crossbar capacity a very important research challenge.

    1.1.3 Challenge 3: Security Concerns in Cognitive Systems

    Besides the design challenges on NCS hardware, cognitive systems, e.g., machine

    learning algorithm/models, that are implemented on memristor crossbar, or any hardware

    platforms, also have security concerns. Common cognitive systems work as the following

    process: Given a subset of certain type of data (training data), the cognitive system will try to

    extract (learn) patterns or intrinsic relationships (trained model) between variables. Then, based

    on the model built upon the training data, the cognitive system can make prediction/inference

    unknown data (test data). In this process, there are three key elements: training data, trained

    model and test data. There are many scenarios that one or more of these three elements are

    confidential or highly valuable to the system owner. Among all the security concerns, the

    security of model privacy interests us most. Running learning models on an embedded device

    introduces an obvious convenience such as run-time processing and high efficiency, but

    unfortunately also introduces security challenges. The learning model will be exposed to the risk

    of being attacked by unauthorized attackers who have physical access to the device.

  • 5

    1.2 DISSERTATION CONTRIBUTION AND OUTLINE

    According to above three challenges, our proposed work can be also decoupled as

    following four main research scopes: 1) Eliminate the impact of device variation on the off-

    device training method; 2) Improve training quality and speed of on-device method ,which is

    limited by the precision and time consumption of analog computing; 3) Enhance the system

    scalability by reducing the required crossbar size for large network model and increasing the size

    of single implementable crossbar; 4) Utilizing the unique property of memristors for a learning

    system platform that protect model privacy against security attack.

    Section 2.0 will introduce the background knowledge of memristor devices and also

    describes two different hardware training method in details.

    Our work for research scope 1 will be described in section 3.0 . We perform an insightful

    analysis on the impacts of hardware design factors on the off-device training quality of NCS.

    Based on our analysis, we propose a novel variation-aware off-device training scheme, namely,

    Vortex for the training robustness enhancement: it firstly modifies the programming pre-

    calculation algorithm to compensate the impact of memristor variations and then introduces an

    adaptive mapping process to selectively map the synapse with large impact on network output

    onto the memristor with low variations. Integrating these two complimentary techniques together

    can further improve training quality.

    In section 4.0 , we quantitatively analyzed the sensitivity of the on-device hardware

    training method to the process variations and input signal noise for research scope 2. We then

    proposed a noise-eliminating training method with the corresponding modified crossbar structure

    to minimize the noise accumulation during the on-device training and enhance the trained system

    performance, i.e., the testing accuracy. A digital initialization step for memristor crossbar

  • 6

    training is also introduced to reduce the training failure rate as well as the training time.

    Experimental results show that our technique can significantly improve the performance and

    training time of neuromorphic computing system by up to 39.35% and 23.33%, respectively.

    Section 5.0 will focus on research scope 3. We will investigate the IR-drop caused

    physical limitation and reliability issue in memristor crossbars. More specifically, we will first

    formulate the effect of IR-drop in NCS designs and evaluate its impact. In order to enhance the

    computing capacity and reliability of NCS, we propose a system reduction scheme that can

    effectively reduce the required crossbar size for a specific problem while still maintaining high

    computation accuracy and robustness, enabling simpler and more scalable NCS implementations.

    To further improve the robustness of NCS, we propose a novel design method that can actively

    compensate the IR-drop induced signal degradations in training and computing. Note that system

    reduction and IR-drop compensation methods are implemented at different design levels and

    thus, complementary to each other. Experiment results demonstrate much smaller

    implementation area (i.e., 61.3% of original design circuit area) and better computing robustness

    (i.e., 27.0% computing accuracy improvement) of NCS after combining these two approaches.

    Section 6.0 will show our work for research scope 4. We study the learning process that

    allows the attacker to attach the privacy of a model on an embedded device. We then will

    investigate using unique drifting property of memristor device to build a secured NCS that

    prevents replicating the model hard-coded in a memristor crossbars. The performance of the

    secured system will gradually degrade without regular calibrations.

  • 7

    2.0 DESIGN BASICS

    2.1 MEMRISTOR BASICS

    As predicted by Prof. Leon Chua in 1971 [13], memristor is the fourth fundamental

    circuit element uniquely defining the relationship between magnetic flux () and electrical

    charge (q) as: d= M·dq. Here the electrical property of the memristor is represented by

    memristance (M) in unit of . Since and q are time dependent parameters, the instantaneous

    resistance (memristance) of a memristor is determined by the historical profile of the electrical

    excitations through the device. In other words, the resistance state of a memristor can be

    programmed by applying current or voltage. In 2008, HP Labs reported that the memristive

    effect was realized by moving the doping front along a TiO2 thin-film device [58]. Since then,

    many different memristive materials and structures were found or rediscovered [48].

  • 8

    Figure 1. Metal-oxide memristor [58].

    Figure 1 depicts an ion migration filament model of metal-oxide memristors [58]. A

    metal-oxide layer is sandwiched between two metal electrodes. During reset process, the

    memristor switches from low resistance state (LRS) to high resistance state (HRS). The oxygen

    ions migrate from the electrode/oxide interface and re-combine with the oxygen vacancies. A

    partially ruptured conductive filament region with a high resistance per unit length (Roff) is

    formed on the left of the conductive filament region with a low resistance per unit length (Ron).

    During set process, the memristor switches from HRS to LRS. The ruptured conductive filament

    region shrinks. The resistance of a memristor can be programmed to any arbitrary value between

    LRS and HRS by applying a programming current or voltage with different pulse widths or

    magnitudes. Note that the relationship between the programming voltage amplitude/pulse width

    and the memristor resistance change is usually a highly nonlinear function, as shown in Figure

    2[48]. For example, with programming voltage of -2.9V, it takes ~500 ns to switch the device

    nc

    ni

    nt

    w w+λ D

    xDw+λw0

    V(t)

  • 9

    from LRS to 900 k(point ‘A’ in the Figure 2)However, with the same programming time, -

    2.8V only switches the device to ~400 k, which is half of the resistance marked by point ‘A’

    Figure 2. Device programming [48].

    2.2 MEMRISTOR CROSSBAR

    As shown in Figure 3, a memristor crossbar is a connection structure that integrates a

    matrix of memristors (M) with metal wires. Each memristor is connected to a top horizontal

    metal wire and a vertical bottom wire. The crossbar structure realizes the highest possible

    integration density of memristor devices within a single layer, in which each memristor uses 4f2

    circuit area (f=feature size).

    Read: The resistances of memristors in a crossbar can be read individually. For example,

    when reading the resistance of mij, which is the memristor connecting to the i-th top metal wire

  • 10

    and the j-th bottom metal wire, a sensing voltage v will be applied on the i-th top wire while all

    the other wires are grounded. The current cj can be sensed from the j-th metal wire and resistance

    of mij = v/cj. Besides, the resistances of a column of memristors can be sensed together as we

    will describe in 2.3.1.

    Figure 3. Memristor crossbar [36].

    Program: At the same time, memristors in a crossbar can be programmed individually.

    During the programming of memristor crossbar, different amplitude and duration of

    programming pulses are directly applied to the target memristor based on the desired resistance

    change: the voltages of the WL and BL connecting the target memristor are set to +Vbias and

    GND, respectively, while all other word-lines (WLs) and bit-lines (BLs) are connected to

    mij

    i-th

    j-th

    x

    y

  • 11

    +Vbias/2. Hence, only the target memristor is applied with the full Vbias above the threshold that

    can change the device’s resistance state while the rest of memristors in the crossbar remain

    unchanged because they are only half selected with a voltage of Vbias/2 [57]. Due to the intrinsic

    switching characteristics, memristors with half-programming voltage barely changes their

    resistances.

    2.3 MEMRISTOR CROSSBAR BASED NCS

    2.3.1 Feedforward Sensing

    Figure 4 depicts a conceptual overview of a neural network that can be implemented with

    a memristor crossbar based NCS in Figure 3. Two groups of neurons are connected by a set of

    synapses. The input neurons send signals into the network and the output neurons collect the

    information from the input neurons through the synapses and process them with an activation

    function. The synapses apply different weights (synaptic strengths) on the information during the

    transmission. In general, the relationship between the input pattern x and the output pattern y can

    be described as [28]:

    .

    (1)

    Here the weight matrix Wn×m denotes the synaptic strengths between the two neuron

    groups. In an NCS, the matrix-vector multiplication shown in the equation (1) is one of the most

    intensive operations. Because of the structural similarity, a memristor crossbar is conceptually

    efficient in executing matrix-vector multiplications [37].

  • 12

    Figure 4. Single layer neural network[56].

    The computation process defined by equation (1) is called “sensing”. In hardware

    implementation shown in Figure 3, during the sensing process of a memristor crossbar-based

    NCS, x is mimicked by the input voltage vector applied to the WLs of the memristor crossbar

    while the BLs are grounded. Each memristor is programmed to a resistance state representing the

    weight of the correspondent synapse. The current along each BL of the memristor crossbar is

    collected and converted to the output voltage vector y by “neurons”, e.g., CMOS analog circuit

    or emerging domain wall devices [36]. The matrix Wn×m is often implemented by two crossbars,

    which represent the positive and negative elements of Wn×m, respectively.

    2.3.2 Hardware Training

    As mentioned in section 1.1.1, we use “hardware training” to define the process of

    calculating and programming all the memristors to the target resistant states.

  • 13

    Figure 5. (a) On-device training method, (b) Off-device training method.

    2.3.2.1 Off-device training

    As mentioned in section 1.1.1, the most intuitive hardware-training scheme is called the

    “off-device” training method (Figure 5 (b)), which separates the whole process into two steps.

    The first step is identical to the conventional software training, which calculates all the variables.

    A neural network is usually trained with supervised learning method to perform a classification

    or regression function. Without losing generality, we use classification tasks as examples in this

    dissertation. Assuming we have a data set, which consists of training samples and testing

    samples. Each training/testing sample contains a set of feature vectors FR/FT and a set of label

    vectors LR/LT. The task of the neural network is to make predictions as close as LT based on

    FT.

    For generality purpose, e.g., multi-task multi-class problems, we assume labels LR/LT as

    a vector of values instead of single value. Each column of memristors are used to perform

    Weight matrix

    (a) (b)

  • 14

    classification on one bit of labels. Hence, every column of memristors can be trained

    individually. For each column, the difference between the current neural network output and the

    training labels can be described as the following cost functions:

    (2)

    Here, c is the cost value, lrs is the s-th label in the training label vector LR and os is the

    output when given the s-th training feature in FR, i.e., frs. Assuming there are n samples in the

    training data, the goal of software training is minimizing the sum of error, i.e., cost value defined

    by equation (2). Since we are able to compare the network's calculated values for the output

    nodes to these "correct" labels, equation (2) can be optimized by GDT algorithm:

    (3)

    Here α is the training speed parameter and the error terms will be used to adjust the

    weight so that in the next time around the output values will be closer to the "correct" values LR.

    Usually the labels are a vector of +1/-1, indicating one frs belongs to a certain class or not.

    Once the training converges, e.g., cost value stabilizes below certain threshold, weight

    matrix W can be used for testing. Based on W, programming pulse voltage amplitude and

    duration for each memristor can be calculated based on the relationship shown in Figure 2. Then

    the memristor devices can be updated accordingly, which is the second step of off-device

    training.

  • 15

    2.3.2.2 On-device training

    Surprisingly, the processes of calculating the variable values and memristor programming

    are not necessary to be separated. Another type of hardware training scheme is the “on-device”

    method (Figure 5 (a)), which directly implements GDT algorithm on the memristor crossbar by

    repeating the loop of “programming and sensing” [30]. On-device method is able to adaptively

    adjust the training inputs to reduce the impact of variability of memristors by sensing the

    memristor (indeed, output current from the crossbar) in the real-time.

    One significant difference between our on-device training scheme and the conventional

    software training is the feedforward operation is done on the memristor crossbar devices. The

    features are applied as voltages on the input terminals and output results are sensed as introduced

    in section 2.3.1. There are clear benefits of implementing feedforward operation on device. Since

    there is no such step of programming memristors based on the pre-calculated connection

    weights, there is no discrepancy between theoretical weights and hardware weights. Besides, the

    device variation and defects can be automatically compensated by the on-device training because

    of the close-loop training algorithm. The main shortcoming of this training scheme is that the

    analog values need to be quantized through the interfaces as:

    (4)

    Limited by ADC hardware, the parameter bits cannot reach 32 or even 64 as people use it

    software training.

  • 16

    3.0 VARIATION-AWARE OFF-DEVICE TRAINING

    3.1 IMPACT OF DEVICE VARIATION

    As mentioned in section 1.1.1, the off-device training method separates the hardware

    training process into two steps, e.g., calculating weights of all connections in software and

    programming each memristor based on the calculation. In hardware implementation, off-device

    training is subject to many realistic factors and constraints. In this section, we will investigate the

    impacts of these limitations on the robustness of different hardware training methods. Here, the

    “robustness” is quantitatively measured as the test accuracy of a memristor crossbar-based NCS

    trained by a specific method.

    The main difference between on-device and off-device training is that on-device training

    adaptively adjusts the programming signal during the iterations based on the sensed output

    current of the memristor crossbar. Theoretically, memristor device variations can be naturally

    tolerated in this process if the analog output current can be precisely sensed. On the contrary, the

    off-device training determines the programming pulse width and magnitude before accessing the

    devices. Hence, device variations inevitably incur the discrepancy between the targeted value

    and the actual programmed memristor resistance.

    To illustrate the impact of device variations on the training of memristor crossbars, we

    performed on-device and off-device training on a column of 100 memristors. The nominal on-

  • 17

    and off-state resistances of the memristor are set to 10 kΩ and 1 MΩ, respectively. Here we

    assume the memristor device variation follows a lognormal distribution [63]. It means that for an

    on-state memristor, its resistance r→eθ∙10 kΩ, where 𝜃 ~ N(0,σ2). The training goal is to ensure

    when the input wires are all connected to 1V, the memristor column shall generate an output

    current of 1mA. Figure 6 shows 1000× Monte-Carlo simulation results when the standard

    deviation σ changes. Following the increase of σ, on-device training result constantly maintains a

    low discrepancy between the trained output and the target output while this output discrepancy

    keeps growing in off-device training result. This experiment is performed by assuming the

    analog sensing of on-device training is perfectly precise, which is unrealistic in hardware.

    Figure 6. Impact of device variation.

    0

    1

    2

    3

    4

    0 0.2 0.4 0.6 0.8

    OLD CLD

    Sen

    sin

    g d

    iscr

    epa

    ncy

    (0.1

    mA

    )

    σ

    On-chipOff-chip

  • 18

    3.2 VORTEX

    The low requirement of sensing resolution makes off-device training an attractive

    solution in memristor crossbar-based NCS designs. However, compared with on-device training,

    the disadvantage of off-device training is also obvious: open-loop scheme is intrinsically lack of

    the mechanism to tolerate memristor device variations. In this work, we propose Vortex – a

    variation-aware robust training scheme that can tolerate the memristor device variations using

    the following two techniques:

    • Variation-aware training (VAT) – an off-device training method that models the device

    variations and adjusts the training goal to tolerate the variation impact in pre-calculation step.

    • Adaptive mapping (AMP) – a method to pre-test all devices and then adaptively map the

    synaptic connections to physical devices based on the actual memristors’ variations.

    3.2.1 Variation-aware Training (VAT)

    3.2.1.1 Algorithm

    Without losing generality, we use a one layer a neural network as an example. The goal

    of conventional GDT algorithm is to find the connection weights W that can successfully classify

    the training samples as many as possible. The computation of the network can be expressed as

    equation (1). Here x is a 1×n input feature vector, W is a n×m weight connection matrix and y is

    a 1×m output vector that corresponds to m classes. As each column of W is trained separately,

    the training of the r-th column (wr) can be summarized as the following optimization process:

  • 19

    (5)

    Here the i-th training sample contains input feature vector x(i) and target output

    . s is the total number of the training samples. The optimization process

    minimizes the difference between the actual output ( ) and the target output ( ), given

    the condition that the output current of a crossbar is physically bounded by circuit limitations,

    which is numerically represented as “1” in equation (5). Here we use “1 vs. all” method in the

    output neuron design: only when a training sample is labeled as class r.

    In a memristor crossbar-based NCS, weight matrix W is represented by the crossbar.

    When memristor variations are taken into account, the actual programmed weight matrix W' may

    be different from the target W even we can perfectly control the programming voltage pulse

    width and magnitude. Similar to Section 3.1, here we assume the memristor device variation

    follows lognormal distribution [63] as , and . The

    optimization constraint of equation (5) then becomes:

    .

    (6)

    As θ is a small variation, we can simplify the equation (6) using a linear approximation:

    .

    (7)

  • 20

    Equation (7) can be further reorganized as:

    . (8)

    The second term of equation (8) is also used as the constraint in the conventional GDT

    algorithm. We call the first term as “penalty of variations” because it represents the sum of the

    crossbar output deviation induced by device variations. However, the optimization process

    cannot be performed with random variable θq. Therefore, we estimate the upper bound of the

    penalty of variations by:

    .

    (9)

    ‖θ‖2 is the 2-norm of a vector of random variables that follow normal distributions. At a

    certain confidence level, we can restrict ‖θ‖2≤ρ based on Chi-square distribution with degree of

    freedom n. Then the modified training process under the consideration of weight variations can

    be expressed as:

    S.T.

    (10)

    We refer to this technique as VAT (variation-aware training).

  • 21

    3.2.1.2 Variation Tolerance vs. Training Rate

    The introduction of the estimated penalty of variations makes the training procedure be

    aware of device variations at a predetermined degree and include them in the training constraints.

    The trained memristor crossbar, hence, becomes more robust in tolerating device variations

    during computations, allowing us to obtain the desired output even there are variations in the

    programmed weights. However, such a method applies a tighter constraint to the training and

    results in lower training rate.

    To evaluate the tradeoff between the training rate and the variation tolerance of the NCS

    under VAT, we vary the estimated penalty of variation in equation (10) by multiplying a scalar γ

    (0< γ

  • 22

    Figure 7. Tradeoff between variation tolerance and training rate.

    The left side of Figure 7 shows test rate (w/ variation) is significantly lower than test rate

    (w/o variation), indicating significant impact of device variations on the training quality in this

    range. When γ rises, test rate (w/o variation) continues to decreases due to the disturbance of the

    introduced estimated penalty of variations to the optimization process. Teste rate (w/ variation),

    however, raises to a peak first when γ increases to 0.2. It clearly shows the efficacy of VAT to

    tolerate device variations in the training. Continue increasing γ, however, may not further

    improve the variation tolerance. The disturbance to the training process starts to dominate and

    results in a decrease of the test rate.

    3.2.1.3 Self-tuning and Validation

    Figure 7 shows that for a specific memristor crossbar based NCS, there exists an optimal

    γ that ensures the maximum test rate (but not corresponding to the highest training rate). Hence,

    we propose a self-tuning process that is very similar to the regularization used in regressions to

    prevent over-fitting and maximize test rate [67]. The details of the proposed process are shown in

  • 23

    Figure 8. Instead of training the neural network only once with all training samples, we separate

    the training samples into two groups (one is large and one is small). The large group is used as

    the actual “training samples” while the small group is used for “validation”. After training, a

    validation step will be launched: we first model the memristor variations and inject them into the

    weight matrix W trained by the training samples. Then the training quality of the NCS is tested

    by the validation samples under a fixed γ. We repeat training-validation loops by scanning the

    value of γ until achieving the maximum test rate over the all validation samples and the

    corresponding γ will be selected in the final training process. Note that the efficacy of self-tuning

    loop varies when the memristor device variation model changes. In this paper, we use the

    lognormal distribution as our memristor device variation model [63]. However, our proposed

    techniques are not restricted to any particular variation models.

    Figure 8. Self-tuning process in training.

  • 24

    3.2.2 Adaptive Mapping (AMP)

    VAT aims optimizing the training algorithm to tolerate the impact of device variations. In

    this section, we propose adaptive mapping (AMP) – a hardware solution to mitigate the impact

    of device variations by optimizing the mapping scheme of the computation to the crossbar and

    leveraging the design redundancy.

    3.2.2.1 Basic Steps of AMP

    AMP includes three sequential steps:

    Pre-testing – After a memristor crossbar is manufactured, we program every memristor

    targeting a certain resistance state and then sense the device resistance to get the distributions of

    memristor resistance in a the crossbar (we may need to sense multiple times eliminate the

    impacts of switching variations). To minimize the impact of IR-drop and sneak paths, we would

    perform pre-testing on each individual memristor and keep all other memristors at high-

    resistance state (HRS). The obtained distribution should follow lognormal distribution [63].

    Sensitivity analysis – Variability of different memristors generates different impact on

    the computation accuracy of a NCS. To identify the memristors that have large impacts on the

    NCS computation accuracy and need better control of device variations, a sensitivity analysis is

    performed. In an m×n crossbar, the sensitivity of the j-th output yj to the device variation of a

    specific memristor ( ) is:

    (12)

  • 25

    Equation (12) shows that the impact of a memristor’s variations on the NCS computation

    accuracy is proportional to the product of the input and the weight that memristor represents.

    Since the weight matrix W of a neural network is often highly skewed (e.g., max (wij) is easily

    >1000× min (wij)), the memristors with a low resistance and a high input demand for a better

    variation control.

    Figure 9. Adaptive mapping.

    Mapping – To minimize the impact of device variations on the NCS computation

    accuracy, we may replace the memristor with a resistance significantly deviating from the

    nominal value and high impact on NCS computation accuracy using a device with smaller

    variation. But it is hard to physically replace a memristor with another after a crossbar being

    fabricated. However, changing the mapping relation between elements in the weight matrix and

    memristors in the crossbar can be easily done. In observation of the multiplication of

  • 26

    permutations of vector-matrix multiplication, switching two rows in weight matrix together with

    their inputs does not change the output of the multiplication. Hence, if one row (e.g., row1 in

    Figure 9) in the crossbar has one memristor with large variation that matches a high weight

    connection, we can assign the input signals originally on row1 to the input of another row (e.g.,

    row2 in Figure 9) and program row2 to the original weight of row1. In this case, all high weight

    connections and large variation memristors have been mismatched.

    3.2.2.2 Greedy mapping algorithm

    To achieve a minimum impact of memristor device variations across the whole cross, we

    adopt a greedy mapping algorithm in AMP to determine the mapping relations between the

    weight matrix and the crossbar, as depicted in Figure 10. The whole mapping process can be

    summarized as the follows:

    We first calculate the impact of device variations by mapping the p-th row of weights

    matrix onto the q-th row of the memristor crossbar. As discussed in sensitivity analysis, such an

    impact can be measured by “summed weighted variations (SWV)” as:

    .

    (13)

    Here we assume both the crossbar and weight matrix W have total n columns. wpj is a

    connection weight at the location (p,j) in W and represents the device variation of the

    memristor at the location (q,j) in the crossbar. show the difference between the

    ideal weight ( ) and the actual weight represented by the memristor (( )). Here 𝜃

    ~ N(0,σ^2).

  • 27

    Figure 10. Algorithm 1.

    The mapping starts with the row of W with the largest device variation sensitivity

    calculated in equation (12) and maps it to the row of the crossbar with the smallest SWV. After a

    row is mapped, its original row from W and the mapped row in the crossbar will be removed

    from the queue of the to-be-mapped rows. AMP will repeat this process until all the rows are

    properly mapped. As many redundant designs, we may also leverage additional memristor

    columns/rows to further improve the efficacy of the mapping. The mapping algorithm remains

    almost the same expect that there are more memristor rows are available.

    Defective cell is another reliability issue in the fabrication of memristor crossbars,

    causing the device resistance at HRS or LRS. Such defective cells can be detected as memristors

    with large variations and replaced by AMP by following the similar approach.

  • 28

    3.2.3 Integration of VAT and AMP

    VAT and AMP are two complementary techniques that can be seamlessly integrated

    while the efficacies of them are also stackable: For example, if effective device variations of the

    memristor crossbar have been reduced by AMP, this reduction shall be captured by the

    memristor device variation model used in the self-tuning process of VAT. As a result, a smaller

    penalty of variation will be needed in VAT, leading to potentially higher training rate and test

    rate.

    3.3 EXPERIMENTS

    To evaluate our proposed Vortex scheme, we implement a two-layer neural network on a

    memristor crossbar based NCS for the famous MNIST digits classification task [76]. The input

    signals of the crossbars are digital voltages corresponding to the pixels of the original benchmark

    images. The output signals are the currents sensed from the ten vertical wires of the crossbar;

    each of them represents one class from ‘0’ to ‘9’. “1 vs. all” method is still used in output neuron

    designs. Each benchmark image has 28×28 pixels, requiring a 784×10 crossbar for the

    computation. The nominal on-state and off-state resistances of memristors used in our

    experiment are 10kΩ/1MΩ respectively. Benchmark may need to be under-sampled to fit into

    the memristor crossbars with difference sizes in the relevant evaluations.

  • 29

    3.3.1 Effectiveness of AMP

    Figure 11 illustrates the training rate of VAT and test rates of the crossbar before and

    after applying AMP. As expected, after AMP is applied, the impact of device variations of the

    crossbar decrease, resulting in improvement of test rate w.r.t. the case before AMP is applied.

    Besides, the optimal also reduces from 0.4 (before AMP) to 0.2 (after AMP).

    Figure 11. Effectiveness of AMP.

    3.3.2 ADC Resolution

    The resolution of analog-digital converter (ADC) is an important factor that affects the

    efficacy of AMP by influencing the memristor resistance pre-testing accuracy. We analyze the

    impact of different resolutions on NCS computation robustness (test rate). No redundancy is

    added in this analysis. Figure 12 shows the test rates of the NCS with different ADC resolutions

    0.65

    0.75

    0.85

    0.95

    0.1 0.3 0.5 0.7

    W/ AMP W/o AMPTraining rate

    Rate

    Optimal

    Optimal

    γ

  • 30

    under different device variations. Low resolution (4-bit/5-bit) significantly limits the

    computation robustness. The test rates of the NCS with different variations start to saturate when

    a 6-bit ADC is applied. Further improving the ADC resolution gives us very marginal

    computation robustness enhancement. Hence, we fix the ADC resolution at 6-bit in the following

    experiments.

    Figure 12. ADC resolution VS test rate.

    3.3.3 Design Redundancy

    We analyze the tradeoff between design redundancy and NCS computation robustness

    using the same experiment setup in Section 3.3.1. When memristors have a large variation

    (σ=0.8) and there are no redundant rows, the test rate is generally low, i.e., 71.8%. To improve

    the computation robustness, we may add extra p rows. Figure 13 shows the test rates of the

    crossbar with different p under different training schemes.

    0.6

    0.7

    0.8

    0.9

    4 5 6 7

    sigma=0.7 sigma=0.5sigma=0.6 sigma=0.8

    Tes

    t ra

    te

    Accurate

    & efficient

    Sensing resolution

  • 31

    Figure 13. Overhead vs. Test rate.

    In general, increasing the redundancy (p) helps to improve the test rates. However, the

    test rates are primarily determined by the device variations rather than the redundancy, and the

    help of redundancy is more prominent when the device variations are large. For comparison

    purpose, Figure 13 also shows the test rates under conventional off-device and on-device training

    without design redundancy. On average, Vortex achieves 29.6% and 26.4% higher test rates

    compared to off-device and on-device training, respectively, even without redundant rows. Here

    the theoretical maximum test rate in this configuration is ~85%, which is determined by the

    nature of the adopted neural network model. In the following experiments, we choose100

    redundant rows and σ=0.6 as our default setup.

    Series1

    Series2

    Series3

    Series4

    Series1Series2

    Tes

    t ra

    teRedundant rows

    σ=0.8 σ=0.7 σ=0.6 σ=0.5

    0.9

    0.8

    0.7

    0.6

    0.5

    0.4

    0 20 40 60 80 100 120

    CLD

    OLD

    Vortex, σ=0.8

    Vortex, σ=0.7

    Vortex, σ=0.6

    Vortex, σ=0.5

    variation

  • 32

    3.4 SECTION SUMMARY

    In this section, we try to understand the training robustness of memristor crossbars by

    quantitatively analyzing the influences of some hardware limitations, e.g., device variation, and

    sensing resolution. Based on our analysis, “Vortex” – a variation-aware off-device training

    scheme is then developed to better tolerate device imperfections and design constraints.

    Experiment results show that Vortex achieves significantly improved training quality, i.e., 29.6%

    higher test rate, w.r.t. conventional off-device training.

  • 33

    4.0 ROBUTS ON-DEVICE TRAINING WITH DIGITAL INITIALIZATION

    4.1 NOISE-ELIMINATING ON-DEVICE TRAINING

    As mentioned in section 2.3.2.2, the memristor crossbar hardware training does not

    necessarily be separated into two steps as described in off-device training. Another type of

    hardware training scheme is the “on-device” method, which directly implements gradient descent

    training (GDT) algorithm on the memristor crossbar by repeating the loop of “programming and

    sensing” [30]. Since the on-device method is a close-loop operations, it may be able to

    adaptively adjust the training inputs to reduce the impact of variability of memristors by sensing

    the memristor (indeed, output current from the crossbar) in the real-time. However, the impact of

    the very limited precision of analog signals on the hardware on the quality of on-device training

    needs further investigation.

    4.1.1 Impacts of Device Variation and Signal Noise

    To have an intuitive view on the impact of device variation and signal noise on crossbar

    training, we perform the following experiment. Figure 14 shows an example of the output

    comparison step in the on-device training process when a set of read voltage Vrd, 0, Vrd/2 is

    applied to the WLs of three memristors R1-R3 in the same column. Here we assume the three

    memristors are all at HRS. The ideal voltage on the BL shared by these three memristors should

  • 34

    be Vrd/2. However, the device non-uniformity and the input voltage fluctuation may cause the

    bias changes on the memristors. For example, if the resistance of R1 is larger than that of R2, the

    voltage on the BL will be below Vrd/2, as shown in Figure 14 (a). Also, if the input voltages on

    the WL of R1 changes to Vrd + ∆V, the voltage on the BL will be above Vrd/2, as shown in

    Figure 14 (b). In both cases, the calculated difference between the current output and the target

    output will be different from the ideal case. Such deviation can be accumulated along with the

    training iterations. Together with the fluctuations of the programming voltage and the process

    variations, it will cause the deviation of the programmed memristor resistance from the ideal

    value during the programming step in the memristor crossbar training process and finally affect

    the computation accuracy. We use an example to illustrate the impacts of the process variation

    and input signal noise on the memristor crossbar training. A 64 x 64 crossbar is implemented to

    realize the synapse connection of a one-layer neural network. Figure 14 (d) shows the resistance

    difference between the ideally trained crossbar (no process variation or input signal) and the

    crossbars trained with considering process variation (top row) or input signal noise (bottom row),

    respectively. In the evaluation of process variation's impact, the distribution of the memristor cell

    size in the crossbar is generated randomly for every iteration with Gaussian distribution. Note

    that since the input noise for write will result in the variation of the crossbar memristance, we

    consider the write input noise with process variation together. The standard deviation of the

    memristance variation is assumed to be 10% (σ = 0.1), 20% (σ = 0.2), and 30% (σ= 0.3) of its

    nominal value. In the evaluation of the read input signal noise's impact, similarly, a random noise

    following Gaussian distribution is generated on the input signals of the crossbar in every

    iteration. The standard deviation of the noise is assumed to be 10% (σ = 0.05), 20% (σ = 0.1),

  • 35

    and 30% (σ = 0.15) of Vrd. The mean of the noise is zero. The GDT rule is applied in the

    training.

    Figure 14. Training under memristor variation and input noise.

    Our simulation shows very marginal degradation in the training robustness as the process

    variation increases. It is because the device variations are reflected in the difference between the

    current output and the target output during each iteration and compensated by close-loop

    training. Similarly, write pulse noise will cause memristance change variation in each iteration,

    which will also be compensated by close-loop training. However, input signal noise is generated

    on-the-fly and accumulated during the training process, leading to a large difference from the

    ideal trained result.

  • 36

    Figure 15. Training process with noise.

    4.1.2 Noise Sensitivity of On-device Training

    Figure 15 illustrates how the how this dynamic threshold training scheme works in

    system level. We assume F is the output activation function of the NCS, i.e., comparators, which

    translates the output of the crossbar to a digital value {1, -1}. The input signal noise N is added

    on the F before it is sent to the next iteration. Different from the conventional GDT, our method

    tries to minimize not only the 2-norm output distance , but also the system's sensitivity

    to the noise as:

    (14)

    In the above cost function, J1 and J2 denote memristor crossbar output distance and the

    noise sensitivity, respectively. At the end of iteration t, the adjustment of the memristor crossbar

    in the next iteration W(t+1) can be derived from the current W(t) as:

  • 37

    (15)

    or,

    (16)

    The choice of training rate is discussed in [30]. For the second term on the right of equation

    (16), we have:

    (17)

    Equation (17) means that the variations of W (the process variation) is reflected by the output

    distance (y – y*). J2 is determined by the activation function f as:

    (18)

    For the two popular activation functions in neuromorphic computing, i.e., sigmoid function and

    sgn function, equation (18) can be expressed as:

    (19)

    and

    (20)

  • 38

    respectively. In both cases, the noise sensitivity decreases when raises, as shown in

    Figure 16.

    Figure 16. Noise elimination mechanism.

    4.1.3 Noise-Eliminating Training Scheme

    Based on our observation on equation (19) and (20), we proposed a noise-eliminating

    training scheme to minimize the noise accumulation during the on-device training. Redundant

    rows are added on top of the memristor array to generate an offset current B that is opposite to

    the tar-get output of the column yi* during on-device training, as shown in Figure 14 (c). It adds

    the bias to the calculated difference between the current output and the target output of the

    crossbar so that the | | is shifted out of the sensitive region of f(x) as:

    (21)

  • 39

    As shown in Figure 16, through applying bias, the residue of the noise in the sensitive region of

    the activation function is reduced and the accumulation of the noise during the training iterations

    is minimized. The selection of bias is important in our proposed scheme: A bias larger than

    necessary may make the training process bypass the convergence region, leading to the difficulty

    of convergence. If bias is too small, it may not efficiently suppress the noise. A detailed

    evaluation on the selection of bias will be given in Section 4.3.1.

    We define bias amplitude a to measure the ability of the reference memristor to offset the

    crossbar output as:

    (22)

    Here Ron is the HRS of a memristor. Rref is the average resistance of the reference

    memristors. Ncol is number of memristors in a column. Nref is the number of reference memristors

    in a column. During the on-device training, a training failure is defined as the unsuccessful

    convergence after the maximum n iterations of training. Here n is the threshold usually much

    more than the normal iteration number required for convergence. If a training failure happens,

    we will reset the reference memristors to reduce a and redo the training process until the training

    succeeds or a=0, which indicates the training is degraded to conventional training scheme.

  • 40

    4.2 DIGITAL INITIALIZATION

    4.2.1 Basic Idea

    In our noise-eliminating training scheme, the introduction of bias affects the convergence

    process of on-device training and may cause the potential convergence failure. In this section, we

    proposed a digital initialization step to the on-device training to reduce the training failure rate

    and training time.

    Figure 17. Digital initialization.

    As shown in Figure 17, in the initialization step, the target W, which can be calculated

    beforehand, is quantized to its digital version where every element is represented by a multi-

  • 41

    level cell (MLC) data, e.g., 2-bit digit. is then written into the crossbar by the programming

    method introduced in section 2.3.2.1, regardless the device variations. Our digital-assisted

    training initialization step can improve the convergence speed of on-device training by setting

    the initial resistance of the memristors close to the target value. Different from the off-device

    training, the digital initialization does not require to program the memristor to the digitalized

    resistance level precisely and can tolerate the device variations. Note that the digitalization of W

    relies on specific training algorithms as we will show next for our approach.

    4.2.2 Digitalization of Weight Matrix

    In the conventional MLC memory cell design, the distances between the two adjacent

    resistance states of the memristor must be the same to maximize the sense margin [69]. The

    threshold to differentiate the different MLC level is set to the cross point between the

    distributions of two adjacent resistance states. In on-device training, the convergence rate of the

    training process is conceptually determined by the distance between the target value and the

    initial value. Therefore, the partition method of MLC memory design does not necessarily give

    us the minimum distance in the digitalization of weight matrix W.

    We propose a heuristic method to determine the resistance states of the memristor

    corresponding to the different digitalized levels of W: For an M-level digitalization, the elements

    of W are equally classified into m baskets based on their values. We then find

    the for each basket to achieve the minimum ,

  • 42

    is the optimal memristor resistance states for the i level of the digitalization. Here we used 1-

    norm resistance distance to measure the impact of the difference between and on the

    overall convergence rate of the on-device training. For different on-device training algorithms,

    other methods, e.g., based on 2-norm distance or the maximum distance, may be also adopted.

    Considering the practical memristor programming resolution, we set m=4 here. Note that this

    method may cause smaller MLC sensing margin, however, we do not need to read out the value

    of each MLC. The initialization accuracy is enough to guarantee the training quality.

    4.3 EXPERIMENTS AND RESULTS

    4.3.1 Noise Elimination

    Figure 18 illustrates the effectiveness of the noise-eliminating training method on

    improving the performance of memristor crossbar-based NCS. A hopfield network with 128

    input neurons is built on a crossbar with one-layer iterative structure to remember 16

    patterns. We choose conventional delta rule (DR) training method for comparison. In our

    simulation, we set the bias amplitude a to 0.05. Monte-Carlo simulations are conducted under

    different process variations and input signal noise levels to measure the success rate when

    recognizing the image. As shown in Figure 18(a) and (b), even at the worst case of

    or at each comparison, our method still achieves the best

    performance.

  • 43

    Figure 18. Effectiveness of noise-eliminating training.

    4.3.2 Digital-Assisted Initialization

    Figure 19 compares the training speed of the same on-device training simulated in

    Section 4.3.1. Y-axis is the Hamming distance between the output vectors of the crossbar and the

    target output vectors. X-axis is training iteration number. The size of training input vector set is

    16 and the crossbar training ends when generated output matches the target patterns. Four

    combinations of process variations and input signal noise levels are simulated. To exclusively

    measure the effects of digital-assisted initialization, noise-eliminating training is not applied in

    the simulations.

  • 44

    Figure 19. Comparison of convergence rate of different initialization.

    Among all the simulated results, initializing the states of all memristors to ‘1’ (HRS) has

    the largest number of training iterations while initializing the states of all memristors to ‘0’

    (LRS) has the smallest number of training iterations among all the simulations except the ones

    with the digital-assisted initialization. It indicates that the majority of the target memristor states

    are close to ‘0’.

    “MLC-based digital-assisted” curve denotes the results of using the digitalization

    method of 2-bit MLC memory design in W initialization while “Optimized digital-assisted”

    curve denotes the results of using the heuristic method proposed in Section 4.2. Both of them

    demonstrated much lower iteration number than the other training process without the digital-

    assisted initialization step. Our heuristic method offers the best result among all the training

    methods: when both process variation and input signal noise are considered, the training iteration

    number of “Optimized digital-assisted” is 23.3% less than that of “initialization with’0’” .

  • 45

    The introduction of process variation causes the deviation of the initial states of the

    memristors from the target states in the digital-assisted initialization step. It raises the Hamming

    distances of the first several iterations and increases the iteration numbers considerably, as

    shown in Figure 19 (b) and (d).

    In general, the total training time of a conventional back propagation (BP) training

    method can be calculated by:

    (23)

    Here is the input size of the crossbar. is overall training time. and are the

    programming and comparison time consumed in each iteration. is the number of iterations.

    When the digital-assisted initialization step is applied, the initialization time is added to the

    total training time. Therefore, to achieve the positive benefit, the speed up introduced by the

    digital-assisted initialization step must be larger than the extra initialization time. Figure 20

    shows that for a crossbar with the size of n < 128, digital-assisted initialization step does not give

    us any benefits on the training time reduction under the simulated conditions.

  • 46

    Figure 20. The impact of initialization on total training time.

    4.3.3 Case study

    To comprehensively evaluate effectiveness of all our proposed techniques, we

    implemented a three-layer feed forward neural based on a neuromorphic computing system with

    multiple crossbars computing engines. BP training is used as comparison training

    algorithm in this case. Other simulation parameters can be found at Table 1.

    Table 1. Simulation setup

  • 47

    Four sets of image patterns (e.g., face, animal, building and finger print) are adopted in

    the training neuromorphic computing systems. As shown in Figure 21, each pattern set has 8

    images with a size of pixels.

    Figure 21. 3-layer network recall rate test of dynamic threshold training algorithm.

    Figure 21 compares the recall success rates of the conventional back propagation (BP) training

    and the modified noise-eliminating method. Our method surpasses the conventional training.

    method over all the simulation cases. Following the increase in the bias amplitude, the recall

    success rate improvement introduced by the noise-eliminating training method becomes more

    prominent.

  • 48

    Table 2. Training failure rate

    Table 2 shows the training failure rate and the training time (without digital-assisted

    initialization step) under the different bias amplitude .The increase in bias amplitude results in

    the reduction of the training time for each iteration while rapidly raises the training failure rate.

    As aforementioned in Section3.3, Training failure will prolong the total training time since we

    will redo the training with a reduced a. The overall training time will become:

    where and are training time for each iteration and training failure rate for the

    training process with bias amplitude .

    Figure 22 shows the overall training time comparison between conventional BP training,

    the modified noise-eliminating training with and without the digital-assisted initialization step

    starting with different a. Our techniques generally reduce the on-device training time by

    12.6~14.1% for the same recall success rate, or improve the recall success rate by 18.7%~36.2%

  • 49

    for the same training time. Designer can pick the best combination based on the specific system

    requirement.

    Figure 22. Comparisons of overall training time.

    4.4 SECTION SUMMARY

    In this section, we proposed a noise-eliminating training method and digital-assisted

    initialization step to improve the training process robustness and the performance of on-device

    training for memristor crossbar-based NCS. Experimental results show that our techniques can

    significantly improve the testing accuracy and training time of neuromorphic computing system

    by up to 18.7% ~ 36:2% and 12:6% ~ 14:1%, respectively, through suppressing the noise

  • 50

    accumulation in the training iterations and reducing mismatch between the initial weight matrix

    state and the target value.

  • 51

    5.0 SCALABILITY

    5.1 IR-DROP LIMITS SINGLE CROSSBAR SIZE

    5.1.1 Impact of IR-Drop on Memristor Crossbar

    In a memristor crossbar, the voltage applied to the two terminals of a memristor is

    affected by the device location in the crossbar and the resistance states of all other memristors. In

    [57], the author explained that in the worst case, both sensing and programming of the crossbar

    will encounter severe reliability issues when the array size is beyond 64×64. Although an NCS

    intrinsically can tolerate certain random errors in sensing process, IR-drop remains an issue in

    NCS training.

    Figure 23 depicts the distribution of the actual voltage drop V’ on each memristor in a

    128×128 crossbar during the training process. Here Vbias =2.9V. V’ij is the voltage actually

    applied to the memristor between WLi and BLj. The largest IR-drop normally occurs at the far-

    end of the WL and BL (i.e., V’(128,128)). The smallest/largest voltage degradation (IR-drop) occurs

    when all memristors are at their HRS/LRS. Figure 23 (b) shows that in the worst case, the largest

    IR-drop quickly increases to an unacceptable level as the crossbar size increases. It greatly

    decreases the programmability of the crossbar and degrades computation accuracy of the NCS.

    Degradation also occurs in recall process as shown in Figure 23 (c) and (d).

  • 52

    (a) (b)

    (c) (d)

    Figure 23. Voltage distribution with IR-drop.

    5.1.2 Problem Formulation

    5.1.2.1 Training

    Normally the training of a crossbar starts with an initial state where all memristors are at

    their HRS. To program the initialized memristor crossbar (RHRS) to the target memristor

    resistance state R that representing weight matrix W, a training time matrix T is generated based

    on the characterized relationship between the memristor resistance change and the programming

    time and voltage [48]:

    Best case

    Worst case

    Best case

    Worst case

  • 53

    (24)

    where V is the ideal programming voltage (V (i,j)=Vbias). After including the impact of IR-drop,

    the actual trained memristors resistance state is R’=f (T,V’,RHRS). Thus, if the V’ deviates from

    the ideal V due to IR-drop, the actual trained crossbar R’ will be distinctive from R. The

    difference between R and R’ depends on the size of crossbar. As shown in Figure 2, when the

    programming v