simba: a skyrmionic in-memory binary neural network accelerator · 2020-03-12 · journal of latex...

12
JOURNAL OF L A T E X CLASS FILES 1 SIMBA: A S kyrmionic I n-M emory B inary Neural Network A ccelerator Venkata Pavan Kumar Miriyala, Student Member, IEEE, Kale Rahul Vishwanath, Xuanyao Fong, Member, IEEE Department of Electrical and Computer Engineering, National University of Singapore, Singapore, 117583. Magnetic skyrmions are emerging as potential candidates for next generation non-volatile memories. In this paper, we propose an in-memory binary neural network (BNN) accelerator based on the non-volatile skyrmionic memory, which we call as SIMBA. SIMBA consumes 26.7 mJ of energy and 2.7 ms of latency when running an inference on a VGG-like BNN. Furthermore, we demonstrate improvements in the performance of SIMBA by optimizing material parameters such as saturation magnetization, anisotropic energy and damping ratio. Finally, we show that the inference accuracy of BNNs is robust against the possible stochastic behavior of SIMBA (88.5% ±1%). Index Terms—Binary neural networks, magnetic skyrmions, in-memory computing, spintronics, spin hall effect and spin-transfer torque nano oscillators. I. I NTRODUCTION Spintronic memories utilizing magnetic skyrmions have recently been the subject of great interest [1]–[6]. This is due to scalability of skyrmions down to 2 nm [7]–[9], the ease of electrical manipulation [1], [10] and their robustness and stability due to topological protection of skyrmions [11], [12]. Existing skyrmionic memory proposals replace the magnetic domain wall devices in racetrack memories [13], [14] with their skyrmionic counterparts. However, micro-magnetic sim- ulations recently show that when a spin-torque nano-oscillator (STNO) [15] is used to generate spin waves (SWs) in a ferromagnetic film, the presence of a skyrmion in the film can influence its steady state dynamics [16]. A skyrmion is topologically protected and hence, does not allow the SWs to pass through it. Moreover, the skyrmion backscatters the SWs as if it is reradiating the SWs. The interference between SWs emitted by the STNO and SWs backscattered by the skyrmion, which are well-described by wave equations, can affect the steady state dynamics of the STNO—sufficiently strong constructive interference between SWs can lead to nucleation of a skyrmion in the STNO [16]. In this paper, we propose a novel way of using the SW me- diated interactions between magnetic skyrmions and STNOs to augment the skyrmionic memory with compute capabilities and implement an ultra-low power non-volatile in-memory computing engine (NVIMCE) using the presence and absence of a skyrmion as a state variable. To the best of our knowledge, this is the first skyrmionic in-memory computing proposal. Moreover, unlike other skyrmionic logic devices that have been proposed in the literature [17]–[19], the issues arising from the transverse motion of skyrmions [11] are avoided in our proposed device because logic operations do not rely on skyrmion motion. On the other hand, the NVIMCE, which saves on energy associated with data storage and movement is a crucial design Manuscript received XXXXX; revised August XXXXX. Correspond- ing authors: V. P. K. Miriyala (email: [email protected]), X. Fong ([email protected]). technique for accelerating neural network algorithms [20]– [22]. The NVIMCEs can be powered down to achieve near- zero standby power consumption when the device is in sleep mode. When the device wakes up, data in the NVIMCE is up- dated and processed in-situ, which saves on latency and energy consumed to transfer data between memory and processing elements in the conventional von Neumann architecture. In this paper, we also propose a skyrmionic NVIMCE (abbreviated as SIMC) based hardware accelerator for binary neural networks (BNNs) [23], which we call as SIMBA. A hybrid simulation framework consisting of device, circuit and architecture-level simulation tools is developed to design, analyze and evaluate SIMBA. Our simulation results show that high energy efficiency and fast inference operations are achieved in SIMBA due to reduced data movement. Further, we have also demonstrated improvements in performance of SIMBA by optimizing the material parameters. The rest of this paper is structured as follows. The SW mediated interactions between skyrmions and STNOs, and the skyrmionic logic gates will be first introduced in Section II. The hardware architectures of our proposed SIMC and SIMBA are then presented in Section III and Section IV, respectively. Results and discussion on proposed devices, circuits and archi- tectures are then presented in Section V. Finally, Section VI concludes this paper. II. SKYRMIONIC LOGIC COMPUTATION USING STNOS For completeness sake, the underlying device physics of the non-volatile skyrmionic devices with which SIMBA is implemented is briefly presented in the following subsection. Further details of the device-level simulations are discussed in Section V. A. Interactions between Skyrmions and STNOs Let us first consider the device structure shown in Fig. 1 (a), which consists of an oxide/ferromagnetic free layer (FL)/heavy metal (HM) trilayer structure in which the magnetization of FL can be manipulated. Two magnetically pinned layers (PL) arXiv:2003.05132v1 [cs.ET] 11 Mar 2020

Upload: others

Post on 06-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SIMBA: A Skyrmionic In-Memory Binary Neural Network Accelerator · 2020-03-12 · JOURNAL OF LATEX CLASS FILES 1 SIMBA: A Skyrmionic In-Memory Binary Neural Network Accelerator Venkata

JOURNAL OF LATEX CLASS FILES 1

SIMBA: A Skyrmionic In-Memory Binary Neural NetworkAccelerator

Venkata Pavan Kumar Miriyala, Student Member, IEEE, Kale Rahul Vishwanath, Xuanyao Fong, Member, IEEEDepartment of Electrical and Computer Engineering, National University of Singapore, Singapore, 117583.

Magnetic skyrmions are emerging as potential candidates for next generation non-volatile memories. In this paper, we proposean in-memory binary neural network (BNN) accelerator based on the non-volatile skyrmionic memory, which we call as SIMBA.SIMBA consumes 26.7 mJ of energy and 2.7 ms of latency when running an inference on a VGG-like BNN. Furthermore, wedemonstrate improvements in the performance of SIMBA by optimizing material parameters such as saturation magnetization,anisotropic energy and damping ratio. Finally, we show that the inference accuracy of BNNs is robust against the possible stochasticbehavior of SIMBA (88.5% ±1%).

Index Terms—Binary neural networks, magnetic skyrmions, in-memory computing, spintronics, spin hall effect and spin-transfertorque nano oscillators.

I. INTRODUCTION

Spintronic memories utilizing magnetic skyrmions haverecently been the subject of great interest [1]–[6]. This is dueto scalability of skyrmions down to 2 nm [7]–[9], the easeof electrical manipulation [1], [10] and their robustness andstability due to topological protection of skyrmions [11], [12].Existing skyrmionic memory proposals replace the magneticdomain wall devices in racetrack memories [13], [14] withtheir skyrmionic counterparts. However, micro-magnetic sim-ulations recently show that when a spin-torque nano-oscillator(STNO) [15] is used to generate spin waves (SWs) in aferromagnetic film, the presence of a skyrmion in the filmcan influence its steady state dynamics [16]. A skyrmion istopologically protected and hence, does not allow the SWsto pass through it. Moreover, the skyrmion backscatters theSWs as if it is reradiating the SWs. The interference betweenSWs emitted by the STNO and SWs backscattered by theskyrmion, which are well-described by wave equations, canaffect the steady state dynamics of the STNO—sufficientlystrong constructive interference between SWs can lead tonucleation of a skyrmion in the STNO [16].

In this paper, we propose a novel way of using the SW me-diated interactions between magnetic skyrmions and STNOsto augment the skyrmionic memory with compute capabilitiesand implement an ultra-low power non-volatile in-memorycomputing engine (NVIMCE) using the presence and absenceof a skyrmion as a state variable. To the best of our knowledge,this is the first skyrmionic in-memory computing proposal.Moreover, unlike other skyrmionic logic devices that havebeen proposed in the literature [17]–[19], the issues arisingfrom the transverse motion of skyrmions [11] are avoided inour proposed device because logic operations do not rely onskyrmion motion.

On the other hand, the NVIMCE, which saves on energyassociated with data storage and movement is a crucial design

Manuscript received XXXXX; revised August XXXXX. Correspond-ing authors: V. P. K. Miriyala (email: [email protected]), X. Fong([email protected]).

technique for accelerating neural network algorithms [20]–[22]. The NVIMCEs can be powered down to achieve near-zero standby power consumption when the device is in sleepmode. When the device wakes up, data in the NVIMCE is up-dated and processed in-situ, which saves on latency and energyconsumed to transfer data between memory and processingelements in the conventional von Neumann architecture.

In this paper, we also propose a skyrmionic NVIMCE(abbreviated as SIMC) based hardware accelerator for binaryneural networks (BNNs) [23], which we call as SIMBA.A hybrid simulation framework consisting of device, circuitand architecture-level simulation tools is developed to design,analyze and evaluate SIMBA. Our simulation results showthat high energy efficiency and fast inference operations areachieved in SIMBA due to reduced data movement. Further,we have also demonstrated improvements in performance ofSIMBA by optimizing the material parameters.

The rest of this paper is structured as follows. The SWmediated interactions between skyrmions and STNOs, and theskyrmionic logic gates will be first introduced in Section II.The hardware architectures of our proposed SIMC and SIMBAare then presented in Section III and Section IV, respectively.Results and discussion on proposed devices, circuits and archi-tectures are then presented in Section V. Finally, Section VIconcludes this paper.

II. SKYRMIONIC LOGIC COMPUTATION USING STNOS

For completeness sake, the underlying device physics ofthe non-volatile skyrmionic devices with which SIMBA isimplemented is briefly presented in the following subsection.Further details of the device-level simulations are discussed inSection V.

A. Interactions between Skyrmions and STNOs

Let us first consider the device structure shown in Fig. 1 (a),which consists of an oxide/ferromagnetic free layer (FL)/heavymetal (HM) trilayer structure in which the magnetization ofFL can be manipulated. Two magnetically pinned layers (PL)

arX

iv:2

003.

0513

2v1

[cs

.ET

] 1

1 M

ar 2

020

Page 2: SIMBA: A Skyrmionic In-Memory Binary Neural Network Accelerator · 2020-03-12 · JOURNAL OF LATEX CLASS FILES 1 SIMBA: A Skyrmionic In-Memory Binary Neural Network Accelerator Venkata

JOURNAL OF LATEX CLASS FILES 2

2.20 2.25 2.30 2.35 2.40

-0.02

0.00

0.02

0.04

0.06

Mag

. alo

ng

x-d

irecti

on

Time (ns)

mx

0 1 2 3 4 5-0.04

-0.02

0.00

0.02

0.04

No skyrmion

Ma

g. a

lon

g x

-dir

ec

tio

n

Time (ns)0 1 2 3 4 5

-0.8

-0.6

-0.4

-0.2

0.0

0.2

0.4

0.6

0.8

skyrmion at d = 100 nm

Mag

. alo

ng

x-d

irecti

on

Time (ns)0 1 2 3 4 5

-0.04

-0.02

0.00

0.02

0.04

0.06skyrmion at d = 125 nm

Mag

. alo

ng

x-d

irecti

on

Time (ns)

(a) (c) (b)

(d) (e) (f)

R-2 R-1 𝑽𝟏 𝑽𝟐

d

Heavy metal

Free layer

Oxide Layer Pin layer

Fig. 1. The cross-sectional view of the device structure used to demonstrate the interactions between skyrmions and STNOs. V1 generates the skyrmion inregion R-1 and V2 causes the local magnetization in region R-2 to precess continuously, shown in (b) and (c). The precession leads to the propagation ofSWs outward from R-2 as shown in (c). (d) The local magnetization in R-2 when no skyrmion in present in R-1. (e) and (f) show the local magnetizationin R-2 when a skyrmion is at d = 100 nm and d = 125 nm, respectively. The stabilized negative in-plane magnetization in (e) denotes the formation of newskyrmion in R-2.

are placed on top of the oxide layer to form the magnetictunnel junctions (MTJs) [24], [25] labeled as R-1 and R-2.

Now, consider the injection of current into the FL throughR-2 (by applying a voltage, V2, across the corresponding MTJ)in the absence of any skyrmions in the FL layer. Resultsof the micromagnetic simulation in MuMax3 [26], shownin Fig. 1 (b), demonstrate that current-induced spin-transfertorque (STT) [27] can cause the local magnetization in R-2 tooscillate continuously like an STNO. Due to the interactionbetween various magnetization energies (exchange, demag-netizing, anisotropies, etc.), the oscillations in R-2 radiateoutward through the rest of FL in the form of SWs as Fig. 1 (c)shows. Hence, applying a voltage across the MTJ in R-2 inthis manner activates the STNO at R-2.

However, if the current-induced STT on the FL is suffi-ciently strong, the local magnetization of the FL may switchto a skyrmionic magnetization state instead of continuouslyoscillating, as shown in Fig. 1 (b). A skyrmion acts as ascatterer of SWs due to its topological protection. Hence, ifa skyrmion is present at R-1 (which may be nucleated byapplying a voltage, V1, across the corresponding MTJ) beforethe STNO at R-2 is activated, the SWs emitted by the STNO atR-2 (SWsemitted) will be backscattered by the skyrmion at R-1.The SWs backscattered by the skyrmion in R-1 (SWs scattered)interfere with SWsemitted and can affect the dynamics of theSTNO at R-2. Results of MuMax3 simulations show thatSTNO oscillations are amplified (Fig. 1 (e)) and attenuated(Fig. 1 (f)) when the interference is constructive and destruc-tive, respectively. This phenomenon can be modeled using the

following wave equations

SWsemitted ∝ A1ekkf

x−ωtcos(kx− ωt) (1)

SWs scattered ∝ B1ekkf

x−ωt− Φ2Φf cos(kx− ωt− Φ) (2)

SWs total = SWsemitted + SWs scattered (3)

where A1 and B1 are the wave amplitudes, k = 2π/λ is thewave vector, λ is the wave length, ω is the angular frequency,kf and Φf are fitting constants, Φ = 2k (d-R) is the phase

TABLE ISIMULATION PARAMETERS

Quantity Value

Size of FL 300×300×0.4 nm3

Mesh size 2 nm×2 nm×0.4 nm

Diameter of A, B, O1 and O2 40 nm

Saturation Magnetization 580×103 A/m

Exchange constant 1.5×1011 J/m

Anisotropic energy density 0.8×106 J/m3

Interfacial Dzyaloshinskii–Moriya

interaction constant

3.5 mJ/m2

Damping ratio (α) 0.03

Polarization (P) 0.56

Spin torque efficiency (Λ) 1

Field-free torque coefficient (ϵ1) 10-3

Spin hall angle (θsh) 0.1 Resistance area product 1.63 Ω∙µm2

Tunneling Magnetoresistance ratio 125%

HM resistivity 2 µΩ∙m

Page 3: SIMBA: A Skyrmionic In-Memory Binary Neural Network Accelerator · 2020-03-12 · JOURNAL OF LATEX CLASS FILES 1 SIMBA: A Skyrmionic In-Memory Binary Neural Network Accelerator Venkata

JOURNAL OF LATEX CLASS FILES 3

𝑯𝒆𝒂𝒗𝒚 𝒎𝒆𝒕𝒂𝒍

𝑭𝒓𝒆𝒆 𝑳𝒂𝒚𝒆𝒓

𝑶𝒙𝒊𝒅𝒆 𝑳𝒂𝒚𝒆𝒓 𝑷𝒊𝒏𝒏𝒆𝒅 𝑳𝒂𝒚𝒆𝒓

A B

𝑶𝟏

𝑶𝟐

𝑽𝑰𝑵𝟏 𝑽𝑰𝑵𝟐

𝑽𝑶𝟐

𝑽𝑶𝟏

tI = 0.5 ns tI = 0 ns

A B

tI = 1 ns tI = 2 ns

tC1 = 0 ns tC1 = 2.5 ns tC2= 0 ns tC2 = 5 ns

o1

o2 o2

o1

o2 o2

o1

o2 o2

o1

o2

o1

o2

o1

o2 tA= 0 ns tA = 1 ns tA= 2 ns tA = 3 ns

a)

b)

f)

c)

d)

e)

g)

Fig. 2. (a) The proposed skyrmionic logic device. Regions, A and B are usedfor storing inputs whereas O1 and O2 are used for computing. Micromagneticsimulation of (b-c) input skyrmion nucleation, (d-f) logic computation and (g)skyrmion annihilation (reset) operations. The x -component of magnetization(mx) are shown in (b-g). The blue and red dumbbell-like structures in (d-g) are the skyrmions. (g) Skyrmions are annihilated by passing an in-planereset current, illustrated in (a), through the HM, which pushes skyrmions to theboundary of the FL. The time delays during nucleation of inputs, computationduring phase-1 and phase-2, and annihilation operations are denoted as tI, tC1,tC2 and tA, respectively. The inset in (g) shows the color map used for plottingmx in (b-g).

difference, and R is the radius of the skyrmion. When λ isconstant, SWsemitted and SWs scattered interfere constructivelyfor d = λ+ nλ/4 ∀ n ∈ even positive integers, whereasthey interfere destructively for d = λ+ nλ/4 ∀ n ∈ oddpositive integers. Interestingly, it was found that when theinterfacial Dzyaloshinskii-Moriya interactions are sufficientlystrong, the constructive interference leads to nucleation of askyrmion in the STNO in R-2 [16] (Fig. 1 (e)). The stabilizednegative x -component of magnetization in Fig. 1 (e) denotesthe nucleation of a skyrmion. We will next discuss the design

of skyrmionic logic gates that utilize the effects discussed inthis sub-section.

B. Spin-wave-based Skyrmionic Logic

The structure of our proposed skyrmionic logic gate, shownin Fig. 2 (a), contains four point-contact regions (A, B, O1and O2). The regions A and B are used for storing inputs,whereas regions O1 and O2 are used during computation.The presence (absence) of the skyrmions in these regionsdenote the bit ‘1’ (‘0’) and can be electrically sensed asthe resistance of corresponding MTJs [24]. Fig. 2 (b) showsthe time evolution of FL magnetization when A and B areprogrammed to be ‘0’ and ‘1’, respectively, whereas Fig. 2 (c)shows the time evolution of FL magnetization when A and Bare both programmed to be ‘1’.

Computation on inputs stored at A and B is performedby using O1 and O2 as STNOs. O1 and O2 are placed atdistance, d = 125 nm apart from each other and at a distance,d = 150 nm from both the input bits, A and B. The materialparameters assumed in this work lead to λ ≈ 100 nm. Inthis geometry, if the STNO at O1 is activated, the presence ofa skyrmion at either A or B leads to constructive interferencebetween SWsemitted and SWs scattered in O1 and amplify theoscillations in the STNO, resulting in the formation of askyrmion at O1. Furthermore, if skyrmions are present inboth A and B, the amplification in STNO oscillations wouldbe large and a skyrmion forms at O1 within relatively shorttime delays. Due to symmetry, the same effect is observedwhen the STNO in O2 is activated. However, in the presenceof a skyrmion at O1, SWsemitted by O2 and SWs scattered bythe skyrmion O1 interfere destructively and oscillations at O2become attenuated.

Consider only the STNO at O1 with no skyrmions in O2. IfO1 is activated for 2.5 ns, a new skyrmion will only form inO1 if skyrmions are present in both A and B (see Fig. 2 (d)from tC1 = 0 ns to 2.5 ns). Otherwise, no skyrmion will formin O1 (see Fig. 2 (e)-(f) from tC1 = 0 ns to 2.5 ns). Hence,the output stored at O1 may be expressed using the Booleanexpression O1 = A ·B (i.e., A AND B). If only one of Aor B has a skyrmion instead, the pulse width of VO1 needsto be 5 ns in order for a skyrmion to form in O1. Due to thesymmetry between O1 and O2, same effect can be observedat O2 as well. We demonstrate the latter case by activating O2for 5 ns (see Fig. 2 (e)-(f) from tC2 = 0 ns to 5 ns). The outputstored at O2 may be expressed using the Boolean expressionO2 = A + B (i.e., A OR B). Hence, A · B and A + Boperations can be performed by ensuring that O2 and O1 donot have any skyrmion before activating the STNO with therequired pulse width at the desired location.

The XOR operation between A and B (A ⊕ B) needs tobe performed in two phases. First, we note that the Booleanexpression for the XOR operation may be written as

A ⊕ B = (A · B + A · B) = (A + B) · (A · B) (4)

Hence, in the first phase, A · B operation is performed andstored at O1. In the second phase, the voltage pulse widthis given to the STNO in O2 as if A +B operation is being

Page 4: SIMBA: A Skyrmionic In-Memory Binary Neural Network Accelerator · 2020-03-12 · JOURNAL OF LATEX CLASS FILES 1 SIMBA: A Skyrmionic In-Memory Binary Neural Network Accelerator Venkata

JOURNAL OF LATEX CLASS FILES 4

(c)

(d)

36 F

𝑹𝑺𝑳𝟏

𝑹𝑺𝑳𝟐

𝑹𝑺𝑳𝟑

𝑹𝑺𝑳𝟒

𝑶𝟏

𝑨

𝑩

𝑪𝑺𝑳𝟐 𝑪𝑺𝑳𝟑

25 F 𝑶𝟐

𝑻𝟏

𝑻𝟐

𝑻𝟑

𝑻𝟒

(e) 𝑹𝑺𝑳𝟏 𝑪𝑺𝑳𝟏 𝑹𝑺𝑳𝟑 𝑪𝑺𝑳𝟑 𝑹𝑺𝑳𝟒 𝑹𝑺𝑳𝟐

Ro

w co

ntro

l circuitry

Re

set cu

rren

t drive

r

Column control circuitry

𝑪𝑺𝑳𝟐

𝑪𝑺𝑳𝟏

(a) (b)

𝑹𝑺𝑳𝟏

𝑪𝑺𝑳𝟐

𝑹𝑺𝑳𝟐

𝑹𝑺𝑳𝟒

𝑹𝑺𝑳𝟑

𝑹𝑨 𝑹𝑩

𝑹𝑶𝟐

𝑹𝑶𝟏

𝑻𝟏 𝑻𝟐

𝑻𝟒

𝑻𝟑

𝑪𝑺𝑳𝟐

𝑻𝟏

𝑻𝟑

𝑻𝟒 𝑻𝟐

𝑹𝑺𝑳𝟏 𝑹𝑺𝑳𝟐

𝑹𝑺𝑳𝟑

𝑹𝑺𝑳𝟒

A B

𝑶𝟐

𝑶𝟏

𝑪𝑺𝑳𝟏

Fig. 3. (a) The device structure of a bit-cell in the proposed SIMC with four access transistors (T1, T2, T3 and T4). Regions, A, B, O1 and O2 areconnected to select line, CSL1 at the bottom (not shown in (a) for better illustration). (b) Electrical model of the bit-cell. The regions A, B, O1 and O2 areassumed as simple MTJs with resistances, RA, RB, RO1 and RO2. Therefore, the proposed bit-cell can be seen as a simple circuit block with four 1-transistor1-resistances. (c) Layout schematic of a bit-cell obtained using the NCSU FreePDK45 [28]. (d) Array-level design of SIMC with 3×3 bit-cells. Row selectlines, RSL1, RSL2 are represented as yellow-colored line and dotted-lines, respectively. Similarly, RSL3, RSL4 are represented as red-colored line anddotted-lines, respectively. The column select lines, CSL1, CSL2 and CSL3 are represented as green, blue and black-colored lines, respectively. (e) Anexample layout of this 3×3 bit-cell array. The rectangular box in (e) marks a single bit-cell.

performed. However, the result of the output of O2 depends onthe output stored inO1, as discussed earlier. Due to destructiveinterference between SWsemitted by O2 and SWs scattered by theskyrmion O1, the result at O2 at the end of the two-phaseoperation is (A +B) · (A ·B) = A ⊕ B. For instance,if no skyrmions are present in both A and B (i.e., A = B =‘0’), magnetization at O1 during first phase and magnetizationat O2 during second phase would only oscillate continuouslyand no skyrmion will form at O2, which is in accordancewith the Eq. 4. If a skyrmion is present at either A or B (i.e.,A ⊕ B = ‘1’), the SWsemitted by O2 and SWs scattered bythe skyrmion interfere constructively at O2. Consequently, askyrmion will form at O2 in agreement with the Eq. 4. Finally,if skyrmions are present at both A and B (i.e., A = B = ‘1’),they would also cause the SWs to interfere constructively atO2. However, the interference between SWs at O2 is not asstrong in the presence of SWs backscattered by the skyrmionthat is nucleated at O1 during the first phase. Consequently, askyrmion will not form in O2 when skyrmions are present inboth A and B, which satisfies the Eq. 4. The XOR operationbetween A and B operation is validated in micromagneticsimulations shown in Fig. 2 (d)-(f). As shown in Fig. 2 (d),when A and B are ‘1’, O2 is ‘0’. If only one of A or B =

‘1’, O2 = ‘1’ (Fig. 2 (e)). Finally, when both A and B are‘0’, O2 is ‘0’ (Fig. 2 (f)).

Note that if any skyrmion in the FL is to be annihilatedin any operation, then all skyrmions in the FL would haveto be annihilated. Skyrmions are then nucleated at the de-sired locations after the annihilation (reset) process. The resetprocess is performed by passing current in the plane of theHM layer. A spin-orbit torque (SOT) induced by the currentflow is exerted on the FL magnetization [29]. If this torque issufficiently strong, skyrmions get pushed off the boundariesof the FL. Simulation results in Fig. 2 (g) show the scenariowhere skyrmions are getting pushed towards the left boundaryof the FL and finally getting annihilated.

III. SKYRMIONIC IN-MEMORY COMPUTING ENGINE(SIMC)

The device structure presented in Section II-B is used toimplement a multi-level bit-cell (see Fig. 3) in our proposedSIMC. Since our proposed device is non-volatile, it may alsobe used as a memory device that stores data at A and B.As discussed in Section II-B, O1 and O2 may be used forperforming logic operations on the stored data.

Page 5: SIMBA: A Skyrmionic In-Memory Binary Neural Network Accelerator · 2020-03-12 · JOURNAL OF LATEX CLASS FILES 1 SIMBA: A Skyrmionic In-Memory Binary Neural Network Accelerator Venkata

JOURNAL OF LATEX CLASS FILES 5

A. Multi-level bit cell

Each of the four regions of our proposed device is connectedto an access transistor, as Fig. 3 (a) shows. Regions A and Bare connected to column select line, CSL1 at the bottom, andto column select line, CSL2 through transistors, T1 and T2(which are controlled by row select lines, RSL1 and RSL2,respectively). Similarly, regions O1 and O2 are connectedto column select line, CSL1 at the bottom, and to columnselect line, CSL3 through transistors, T3 and T4 (which arecontrolled by row select lines, RSL3 and RSL4, respectively).For better illustration, connections to CSL1 at the bottomare omitted from Fig. 3 (a). However, all the connections areshown in Fig. 3 (b), which represents the circuit model of ourproposed bit-cell.

The write operation to the bit-cell occurs as follows. Whenbit ‘1’ is to be written into A (i.e., nucleate a skyrmion in A),T1 is turned ON to pass current from CSL2 to CSL1 throughA. To write bit ‘0’ into A (i.e., do not nucleate a skyrmion),T1 is turned OFF and no current passes through A. Similarly,to write bit ‘1’ into B, T2 is turned ON and the current ispassed from CSL2 to CSL1 through B. To write bit ‘0’ intoB, T2 is turned OFF and no current is passed through B.

For performing the logic computations on data stored in Aand B, O1 and O2 are utilized as STNOs. To activate theSTNO in O1, T3 is turned ON and the current is passed fromCSL3 to CSL1 through T3. Otherwise, T3 is turned OFF andno current is passed through O1. Similarly, to activate theSTNO in O2, T4 is turned ON and the current is passed fromCSL3 to CSL1 through T4. Otherwise, T4 is turned OFF andno current is passed through O2.

Note that, as mentioned earlier in Section II-B, if any inputbit needs to switch from ‘1’ to ‘0’ or if the logic computationneeds to be re-evaluated, the reset current must first be passedin-plane through the HM to annihilate the skyrmions, followedby nucleation of skyrmions at the desired locations. Thisprocess may require the read operation of the data storedin A and B. The data stored in A and B, and the outputof computation stored in O1 and O2, may be sensed as theresistance of the MTJ at the corresponding regions. A current-sensing scheme with conventional sense amplifiers may beused to perform the read operation [30]–[33]. Note that thevoltages across the MTJs during read operations need to beoptimized to avoid the read-failure or read-disturb errors.

B. Hardware architecture of SIMC

The proposed multi-level bit-cells are arranged into regulararrays to construct the SIMC. The layout for an individualbit-cell is depicted in Fig. 3 (c). The CSL1 and CSL2 arerouted on metal-2 whereas CSL3 is routed on metal-3. Thebit-cells are arranged into a large array and a 3×3 area ofthe bit-cell array is shown in Fig. 3 (d). The light orange-colored region in Fig. 3 (d) depicts the HM layer that is sharedamong the bit-cells whereas the blue-colored square regionsdepict the magnetic device structures. The colored-lines anddotted-lines represent the various control lines to the bit-cells.Note that separate row and column control circuitry drive thecontrol lines as shown in Fig. 3 (d). Furthermore, a constant

current driver supplies the reset current to the HM during resetoperations. The layout of a 3×3 area of the bit cell array isshown in Fig. 3 (e).

IV. HARDWARE ARCHITECTURE OF SIMBA

The SIMC proposed in Section III can be used to implementthe hardware accelerator for binary neural network (BNN)algorithms. We first introduce BNNs and show the operationsthat are needed to be performed. The mapping of theseoperations to dedicated hardware are then discussed beforeshowing how they are put together to give the hardwareimplementation of SIMBA.

A. Binary Neural Networks

BNNs are a class of deep learning algorithms that performinference operations using binarized inputs and weights [23].The binarized pixel data of images are first fed to the binaryconvolution (BinConv) layer in the form of input feature maps(IFs) with size (W ,H ) (see Fig. 4 (a)). In BinConv layer, thedata in IFs is convolved with binary weights stored in filters(KFs) of size (kw , kh). Mathematically,

f (i , j ) = popcount(IF (W ,H ) XNOR KF (kw , kh))

OF (i , j ) = 1; if f (i , j ) ≥ threshold(f )

= 0; if f (i , j ) < threshold(f )

(5)

where (i , j ) are the array indices of output feature map (OF ).When given a string of bits, the popcount operation counts thenumber of ‘1’s in the string. As expressed in Eq. 5, convolutionon binary data is achieved by performing a sequence of bit-wise XNOR, popcount and integer comparator operations.

The dimensions of resultant OFs from the BinConv layerare compressed in the maxpooling (Maxpool) layer. In theMaxpool layer, the maximum pixel value in a selected groupof pixel values is estimated and forwarded to the subsequentlayers. Due to the binarization of the data, the results of theMaxpool layer can be obtained using bit-wise OR operations.As shown in Fig. 4 (a), the compressed OFs are propagatedthrough several BinConv and Maxpool layers. The final OFsare then given as inputs to the fully-connected (FullyConn)layer to obtain the inference/classification results. FullyConnlayer is a feed-forward multi-perceptron network, where theweighted sum of inputs is compared with a threshold ineach perceptron. Due to the binarization of the network,computations in this layer can also be executed using bit-wiseAND operations followed by popcount and comparator .

B. Hardware Implementation of SIMBA

The proposed hardware architecture of SIMBA is shown inFig. 4 (b). It is assumed that training of the BNN is performedoffline and the configuration of the trained BNN then is loadedinto the SIMBA hardware, which performs the test/inferenceoperations. The hardware architecture of SIMBA consists ofhardware dedicated to the execution of operations for theBinConv, Maxpool, and FullyConn layers. Note from Eq. 5that the BNN algorithm requires the computation of bit-wiseXNOR followed by popcount , which counts the number of

Page 6: SIMBA: A Skyrmionic In-Memory Binary Neural Network Accelerator · 2020-03-12 · JOURNAL OF LATEX CLASS FILES 1 SIMBA: A Skyrmionic In-Memory Binary Neural Network Accelerator Venkata

JOURNAL OF LATEX CLASS FILES 6

kh

kw Pre-processed filters with binary weights

Input image

Bin

ary

Co

nv

olu

tio

n

Ma

x P

oo

lin

g

Fu

lly

Co

nn

ecte

d

La

yer

Dog = 0.65 Cat = 0.2 Monkey = 0.1 Rabbit = 0.05

∙∙∙∙

feature maps with binarized pixel data

Bin

ary

Co

nv

olu

tio

n

Ma

x P

oo

lin

g

a)

b)

Wri

te c

ircu

itry

4000 1 KB SIMC

units

Read Circuitry

Popcount circuitry

Comparator

BinConv layer

Wri

te c

ircu

itry

4000 1 KB SIMC

units

Read Circuitry

Maxpool layer

Write circuitry

Wri

te c

ircu

itry

4000 1 KB SIMC

units

Read Circuitry

Popcount circuitry

Comparator

BinConv layer

Wri

te c

ircu

itry

4000 1 KB SIMC

units

Read Circuitry

Maxpool layer

Write circuitry

Wri

te c

ircu

itry

4000 1 KB SIMC

units

Read Circuitry

Popcount circuitry

Comparator

FullyConn layer

… HM HM 𝑷𝟏

𝑽𝒄𝒐𝒎𝒑

HM

PL Oxide

FL

CL1

HM

𝑽𝒓𝒆𝒂𝒅

𝑰 𝒄𝒐

𝒎𝒑

/𝑰 𝒓

𝒆𝒔

𝒆𝒕

𝑰𝒓𝒆𝒂𝒅

GND

CL2

𝑪𝒐𝒎𝒑𝒂𝒓𝒂𝒕𝒐𝒓 𝒑𝒐𝒑𝒄𝒐𝒖𝒏𝒕 𝒄𝒊𝒓𝒄𝒖𝒊𝒕

𝑻𝒄𝒐𝒎𝒑

𝑻𝒓𝒆𝒂𝒅

c)

𝑷𝟐 𝑷𝒏

IC

Fig. 4. (a) The data flow of BNNs computation when classifying an input image into different categories. A typical BNN consists of different binaryconvolution (BinConv), max pooling (Maxpool) and fully connected (FullyConn) layers. (b) SIMBA, our proposed hardware architecture for accelerating theBNN algorithm. 1 KB SIMC units are used to perform the 2-bit XOR/OR/AND operations. (c) The proposed popcount and comparator circuits utilizingthe spin-orbit torque MTJs.

bits that match each other. The same result may be obtainedby counting the number of bit mismatches and subtractingthe result from the total number of bit comparisons, which isknown a priori. Hence, the result of f(i, j) in Eq. 5 can alsobe computed using XOR operations (which returns whetherthe input bits mismatch) followed by a suitably implementedpopcount circuit. For executing BinConv layers in SIMBA,SIMC units capable of 1 KB data storage perform the XORcomputations, which are read out and passed to the popcountcircuitry.

Since our SIMC units perform the XOR operation, OF (i, j)may be calculated using one of two methods. In the firstmethod, the conventional popcount is used to count thenumber of ‘1’s returned by the XOR operations. OF (i, j) isset to ‘0’ if the result exceeds the threshold and ‘1’ otherwise.In the second method, a popcount circuit that counts the

number of ‘0’s instead of ‘1’ is given the result of the XORoperations. OF (i, j) is set to ‘1’ if the result of exceeds thethreshold and ‘0’ otherwise. In this paper, we implementedthe former method to compute OF (i, j). Fig. 4 (c) shows ourproposed popcount circuit, which consists of ‘n’ spin-orbittorque MTJs (SOT-MTJs), each of which receives as inputthe output of each XOR operation. All the popcount SOT-MTJs are initially set to low resistance state (i.e., state ‘0’)by providing the negative reset voltage (−VP < 0 V) to theHMs of SOT-MTJs. The comparator SOT-MTJ is initializedto the high resistance state (‘1’) by passing current through theHM of comparator SOT-MTJ. When the output of i-th XORoperation is ‘0’ (‘1’), Pi is set as 0 V (+VP > 0 V). Positivevoltage on Pi causes current to flow through the correspondingHM, which switches the resistance state of the correspondingSOT-MTJ from low (i.e., state ‘0’) to high (i.e., state ‘1’).

Page 7: SIMBA: A Skyrmionic In-Memory Binary Neural Network Accelerator · 2020-03-12 · JOURNAL OF LATEX CLASS FILES 1 SIMBA: A Skyrmionic In-Memory Binary Neural Network Accelerator Venkata

JOURNAL OF LATEX CLASS FILES 7

Simulation/experimental magnetoresistance data

m-dependent 𝑅A,𝑅B, 𝑅O1 and 𝑅O2

Circuit Simulation tool- Cadence Simulation/experimental

data of the CMOS access device

Micromagnetics tool- MuMax3 Magnetic material and

device parameters

Tunneling magnetoresistance model

m-dependent voltage stimuli acting on the access device and the magnetic device.

Time evolution of 𝑚

Electrical behavior of the bit-cell

Layout & DRC analysis using Cadence

System/architecture level performance evaluation using NVSim tool.

Studying magnetization dynamics using micromagnetics tool- MuMax3

Magnetic material and device parameters

Fig. 5. Flow chart of the proposed simulation framework.

After all XOR results have been processed, the transistor,Tcomp, is turned ON to pass positive current from CL1 to theground. The resistance states of popcount MTJs determine themagnitude of Icomp passing through the comparator SOT-MTJ(see Fig. 4 (c)). If the magnitude of Icomp exceeds the thresholdswitching current, the resistance state of the comparator SOT-MTJ changes from high to low (‘1’ to ‘0’). The output resultcan read by turning ON the transistor, Tread (by setting theread voltage Vread) and sensing the current flowing into thecomparator SOT-MTJ through CL2.

For executing the Maxpool layer, as depicted in Fig. 4 (b),the same SIMC unit hardware design may be used. However,the SIMC units for Maxpool layers are configured to performonly OR operations. Furthermore, popcount and comparatorcircuitry are not needed. Also shown in Fig. 4 (b), the hard-ware used for executing the computations in the FullyConnlayer is similar to the hardware for executing the BinConvlayer. However, the SIMC units for the FullyConn layer areconfigured to perform AND operations only.

V. RESULTS AND DISCUSSION

We will now focus on the design and evaluation of thedevices, circuits and hardware architecture that constituteSIMBA. In order to study the interactions between the differentlevels of design abstraction, we developed a hybrid mixed-mode devices-to-hardware architecture simulation framework.We will first present the simulation framework used for thiswork before discussing the simulation results for the devices,circuits and hardware architecture of SIMBA.

A. Modeling and Simulation Framework

Fig. 5 shows the flow-chart of our simulation framework.At the device-level, we used the MuMax3 micromagnetic

simulator [26] to study the magnetization dynamics and theinteractions between skyrmions and STNOs. The magneticdevice and material parameters we assumed in our simulationsare extracted from experimental results [11] and listed inTable I. We have modified the MuMax3 simulator [34] tomodel the current flowing through an MTJ (in Fig. 1 andFig. 2) and the current-induced spin-transfer torque when avoltage is applied across it.

The structure and electrical model of our proposed multi-level bit-cell are shown in Fig. 3 (a)-(b). The resistances, RA,RB, RO1 and RO2 vary with respect to the magnetization stateof the regions A, B, O1 and O2, respectively and can bewritten as

RX = RP + (RAP −RP)

(1−m ·mp

2

)(6)

where, RX ∈ {RA, RB, RO1 and RO2}, m and mp are unitvectors describing the magnetization direction of FL and PL,respectively, in the corresponding regions. RP (RAP) is theresistance of the region when m and mp are parallel (anti-parallel) to each other.

In order to study the magnetization dynamics at the bit-cell level, m-dependent values of RA, RB, RO1 and RO2are first calculated using Eq. 6. The electrical model of ourproposed bit-cell is simulated using SPICE-like [35], [36]circuit simulation tools to extract the electrical stimuli onthe individual resistors and access transistors (Fig. 3 (b)). Inthis work, we have modeled the access transistors using theNCSU FreePDK45 [28]. The circuit simulation results are usedin the MuMax3 micromagnetic simulation to determine themagnetization dynamics. At this point, the characterizationdata about the individual bit-cell is available and fed to thearray/architecture-level simulator. We have used the NVSimsimulation tool [30] to estimate the array/architecture-levelenergy consumption and latency of the proposed SIMC.

Fig. 6. Simulated electrical response at A, B, O1 and O2 of the bit-cellduring write and XOR operations. The output of XOR computation betweenA = 1 and B = 1 is sensed from the current output at O2. Low normalizedcurrent outputs in A, B and O1 corresponds to bit ‘1’, whereas the highnormalized current output in O2 corresponds to bit ‘0’.

Page 8: SIMBA: A Skyrmionic In-Memory Binary Neural Network Accelerator · 2020-03-12 · JOURNAL OF LATEX CLASS FILES 1 SIMBA: A Skyrmionic In-Memory Binary Neural Network Accelerator Venkata

JOURNAL OF LATEX CLASS FILES 8

TABLE IIELECTRICAL INPUTS FOR THE BIT-CELL DURING ITS DIFFERENT OPERATIONS

CSL1 CSL2 CSL3 RSL1 RSL2 RSL3 RSL4

Write (A) 0 0.81 0 1 0 0 0

Write (B) 0 0.81 0 0 1 0 0

Compute (O1) 0 0 0.79 0 0 1 0

Compute (O2) 0 0 0.79 0 0 0 1

Read 0 0 0.25 0 0 0 1

**All the values are in units of volts (V)

0 1 2 3 4 5 6

0.76

0.78

0.79

0.80

0.81

0.82

Magnetization along x-direction

Time (ns)

Vo

ltag

e (

V)

-0.03

-0.02

-0.01

0

0.01

0.02

0.03

Window-C

Window-B

Window-A

Fig. 7. The dependence of magnetization’s x -component on input voltageswith different pulse widths. The magnetic attenuation (window-A), stableoscillations (window-B) and skyrmion formation (window-C) regimes in thephase plot are labelled accordingly. Negative magnetization x -component inwindow-C denotes the nucleation of a skyrmion.

B. Electrical Behavior of the Bit-cell

Fig. 6 shows the electrical behavior of the bit-cell duringwrite followed by the XOR operation. The electrical stimulilisted in “Write (A)” row of Table II is applied to the bit-cell when bit ‘1’ is being written into A. Consequently, askyrmion is nucleated in A and RA increases, which reducesthe magnitude of current flow in A (see Fig. 6 (a)). Thetotal delay for the write operation is about 2.5 ns (inclusiveof 0.5 ns relaxation delay). Similarly, when bit ‘1’ is beingwritten into B, the electrical stimuli listed in “Write (B)”row of Table II is applied to the bit-cell. Consequently, theskyrmion gets nucleated in B and the magnitude of currentflow in B reduces (see Fig. 6 (b)). The total delay of this writeoperation is also about 2.5 ns (inclusive of 0.5 ns relaxationdelay).

The two-cycle XOR operation is next performed by firstactivating the STNO at O1 to compute A · B and thenactivating the STNO at O2 in the second phase. For activatingthe STNO at O1, the electrical stimuli listed in “Compute(O1)” row of Table II is applied to the bit-cell. As shownin Fig. 6, when both the inputs, A and B, are in state 1, a

TABLE IIIPERFORMANCE EVALUATION OF 1 KB SIMC

Operation Energy Latency

Write 6.721 pJ 2.308 ns Read 1.739 pJ 1.15 ns

OR 15.793 pJ 5.308 ns

AND 8.233 pJ 2.808 ns XOR 24 pJ 8.1160 ns

Reset 0.432 pJ 4 ns

skyrmion is nucleated in O1. As a result, RO1 increases andcurrent flow through O1 is reduced (See Fig. 6 (c)). The totaldelay of this operation is about 3.0 ns (inclusive of 0.5 nsrelaxation delay). In the final compute cycle, the STNO inO2 is activated for 5 ns using the electrical stimuli listed in“Compute (O1)” row of Table II. Despite the voltage stimulus,no skyrmion gets nucleated in O2 due to the presence of askyrmion in O1. Hence, there is no change in RO2 and themagnitude of current flow through O2 (see Fig. 6 (d)). In thiscase, the output at O2 is ‘0’, which corresponds to the resultof the XOR operation between data stored at A and B. Afterthe compute operation has finished, the output at O2 can beread by applying the electrical stimuli listed in “Read” row ofTable II to the bit-cell.

The voltages applied to CSL2 and CSL3 during write andcompute operations, respectively, are determined from themagnetization phase diagram shown in Fig. 7, which wasobtained by varying the voltage magnitude and pulse width.When the input voltage is less than 0.78 V, the magneticoscillations attenuate within the time period of 4 ns (Window-A). If the input voltage is in between 0.788 V and 0.79 V,magnetization oscillates continuously for 6 ns (Window-B).Finally, if the voltage is greater than 0.8 V, magnetic oscilla-tions are amplified and a skyrmion is nucleated (Window-C).Therefore, we apply 0.81 V to CSL2 during write operation,and 0.79 V to CSL3 during compute operation (Table II). Thevoltage used for the read operation (i.e., 0.25 V) is selectedto avoid read-failure and read-disturb errors.

Page 9: SIMBA: A Skyrmionic In-Memory Binary Neural Network Accelerator · 2020-03-12 · JOURNAL OF LATEX CLASS FILES 1 SIMBA: A Skyrmionic In-Memory Binary Neural Network Accelerator Venkata

JOURNAL OF LATEX CLASS FILES 9

TABLE IVVGG-LIKE BNN MODEL

Description W×H×IFs W×H×OFs

1 BinConv1 32×32×3 32×32×128

2 BinConv2 32×32×128 32×32×128

3 Maxpool1 32×32×128 16×16×128

4 BinConv3 16×16×128 16×16×256 5 BinConv4 16×16×256 16×16×256

6 Maxpool2 16×16×256 8×8×256

7 BinConv5 8×8×256 8×8×512 8 BinConv6 8×8×512 8×8×512

9 Maxpool3 8×8×512 4×4×512

10 FullyConn1 1×1×8912 1×1×1024 11 FullyConn2 1×1×1024 1×1×1024

12 FullyConn3 1×1×1024 1×1×10

1 2 3 4 5 6 7 8 9 1 0 1 1 1 201234567

1 2 3 4 5 6 7 8 9 1 0 1 1 1 20 . 00 . 10 . 20 . 30 . 40 . 50 . 60 . 70 . 8

L a y e r n u m b e r

Exec

ution

Time

(ms)

b )a ) L a y e r n u m b e r

Energ

y Con

sump

tion (

mJ)

Fig. 8. Estimated layer-wise (a) energy consumption and (b) execution timesper inference in SIMBA.

C. Evaluation of SIMBA Hardware Architecture

Based on the device write current requirements andFreePDK45 design rules, the area of each bit-cell in the SIMCis about 900 F2 (Fig. 3 (a)). We note that the bit-cell area islimited by the transistor size due to the high write currentrequirements (≈ 490 µA) of the memory device. Using theextracted electrical characteristics of an individual bit-cell, anSIMC with 1 KB data storage capability was simulated inNVSim to estimate the energy consumption and latency fordifferent operations, which are summarized in Table III. Thesimulation results for the 1 KB SIMC units are used to evaluatethe overall SIMBA hardware architecture.

In addition, the electrical stimuli applied to the popcountand comparator circuits proposed in Fig. 4 (c) are listed inTable. V. Since the switching characteristics of SOT-MTJs arewell known in literature [37]–[39], the simulation results ofpopcount and comparator circuits have been omitted. Theenergy and latency consumed during these operations aredetermined to be 7.5 pJ and 10 ns, respectively. A popcountcircuit processing 100 bits at a time is used for estimatingthese energy and latency values.

In this work, we designed SIMBA to accelerate the compu-tations for a VGG-like BNN with 12 layers (see Table IV)[23], [40]. The VGG-like BNN model performs the clas-sification task on CIFAR-10 image data with an accuracyof 88.5%. Each BinConv layer in the network performs

i) kw × kh × OF × W × H × IF number of XORoperations, ii) W × H × OF number of popcount oper-ations (each on kw × kh × IF number of bits), and iii)W × H × OF number of comparator operations. EachMaxpool layer performs 3 × W × H × OF number of2-bit OR operations, where the maxpooling filter size is 2×2.Each FullyConn layer performs i) IF × OF number ofAND operations, ii) popcount operation on IF number ofbits, and iii) OF number of comparator operations. Usingthe results in Table III and results from circuit and devicesimulations, the estimated energy consumption and time delayin each BNN layer are shown in Fig. 8, respectively. Over all,SIMBA is estimated to require 26.7 mJ of energy and 2.7 mstime delay to perform one classification task, which translatesto throughput of 370.4 (images/sec).

1) Analysis of Device Characteristics on SIMBA Perfor-mance

We have also explored the impact of material properties(such as damping ratio, α, anisotropic energy density, KU, andsaturation magnetization, MS) on the performance of proposedhardware. We found that a 33.3% of reduction in α improvesthe energy efficiency by 47% with 1.3× speedup (Fig. 9 (b-c)). Reduction in α decreases the current required to drivemagnetization oscillations and improves energy efficiency andexecution times. Separately, over 37.5% reduction of KUimproves the energy efficiency by 36.84% with 1.18× speedup(Fig. 9 (d-e)). Similar to the reduction in α, the currentrequired to induce magnetization oscillations or switchingreduces with reduced KU. When only MS is increased by10%, the energy consumption is reduced by 23.68% withspeedup of 1.13× (Fig. 9 (f-g)). This is due to increase in thedemagnetizing field as MS increases. The demagnetizing fieldtends to reduce the current needed to induce magnetizationoscillations or switching.

Since thermal effects can affect the magnetization dynamicsof the spintronic devices and introduce errors in SIMBA, thereis a need to study the influence of hardware errors on theoverall classification accuracy. We induced errors in 1%-30%of the pixels in each BNN layer and observed the overallclassification accuracy. Note that the magnitude of error ineach pixel is randomly varied from ±1 to ±30. As shownin Fig. 9 (a), the classification accuracy degrades by only 1%despite the presence of 30% errors in each BinConv layer. Theworst classification error rate we observed is 12.4% whereasthe error rate without thermal effects is around 11.5%.

VI. CONCLUSION

In summary, we propose a binary neural network (BNN)accelerator, referred to as SIMBA, based on a novel non-volatile skyrmionic in-memory processing engine. A hybridmixed-mode device-to-architecture simulation framework wasdeveloped to design and evaluate SIMBA. Simulation resultsshow that SIMBA consumes 26.7 mJ of energy and 2.7 ms ofdelay per inference task on a VGG-like BNN. The inferenceerror rate is less than 1% even in the presence of 30% bit errorsin the skyrmionic devices. Finally, we showed improvementsin the performance of SIMBA that can be achieved by tuning

Page 10: SIMBA: A Skyrmionic In-Memory Binary Neural Network Accelerator · 2020-03-12 · JOURNAL OF LATEX CLASS FILES 1 SIMBA: A Skyrmionic In-Memory Binary Neural Network Accelerator Venkata

JOURNAL OF LATEX CLASS FILES 10

(b)

(c)

(d)

(e)

(f)

(g)

(a) 0 5 10 15 20 25 3010

11

12

Device/Circuit Error Robustness

Magnitude of error added in each pixel

±1 ±5 ±10 ±15

±20 ±25 ±30Cla

ssif

icati

on

err

or

(%)

% errors induced in each BinConv layer

Fig. 9. (a) Variations in error rate per inference induced by the stochastic behavior of devices/circuits used in the hardware. (b-g) The influence of materialproperties on the performance of SIMBA (energy consumption and execution time). (b-c) damping ratio, α, (d-e) anisotropic energy density, KU, and (f-g)saturation magnetization, MS.

TABLE VELECTRICAL INPUTS FOR THE popcount AND comparator CIRCUITS

IC Pi Vcomp CL2 Vread

Reset-popcount 0 -0.01 V 0 0 0

Reset-comparator -520 µA 0 1 V 0 0

Write-popcount 0 0 V (bit ‘0’)

0.01 V (bit ‘1’)

0 0 0

Write-comparator 380 µA 0 1 V 0 0

Read-comparator 0 0 0 0.25 V 1 V

device material parameters such as damping ratio, anisotropicenergy, and saturation magnetization.

ACKNOWLEDGMENT

This work was supported in part by Singapore Ministryof Education (MOE) Grant MOE 2017-T2-1-114, in part by

Page 11: SIMBA: A Skyrmionic In-Memory Binary Neural Network Accelerator · 2020-03-12 · JOURNAL OF LATEX CLASS FILES 1 SIMBA: A Skyrmionic In-Memory Binary Neural Network Accelerator Venkata

JOURNAL OF LATEX CLASS FILES 11

MOE Academic Research Fund Tier 1, in part by the Agencyfor Science, Technology and Research (A*STAR) SpOT-LITEProgram, in part by the NUS Start-up Grant, and in part byCompetitive Research Programme (CRP) under Grant NRF-CRP12-2013-01.

REFERENCES

[1] A. Fert, N. Reyren, and V. Cros, “Magnetic skyrmions: advances inphysics and potential applications,” Nature Review Materials, vol. 2,p. 17031, 2017. [Online]. Available: https://www.nature.com/articles/natrevmats201731

[2] X. Chen, H. Zhang, E. Deng, M. Yang, N. Lei, Y. Zhang, W. Kang,and W. Zhao, “Sky-ram: Skyrmionic random access memory,” IEEEElectron Device Letters, vol. 40, no. 5, pp. 722–725, 2019. [Online].Available: https://ieeexplore.ieee.org/document/8667872

[3] M.-C. Chen, A. Ranjan, A. Raghunathan, and K. Roy, “Cachememory design with magnetic skyrmions in a long nanotrack,”IEEE Transactions on Magnetics, vol. 55, no. 8, p. 1500309,2019. [Online]. Available: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8698445&isnumber=8766151

[4] P. Lai, G. P. Zhao, H. Tang, N. Ran, S. Q. Wu, J. Xia, X. Zhang, andY. Zhou, “An improved racetrack structure for transporting a skyrmion,”Scientific Reports, vol. 7, no. 45330, pp. 1–8, 2017. [Online]. Available:https://www.nature.com/articles/srep45330

[5] T. Chen, R. K. Dumas, A. Eklund, P. K. M. A. Houshang, A. A. Awad,P. Drrenfeld, B. G. Malm, A. Rusu, and J. Akerman, “Skyrmionsin magnetic multilayers,” Physics Reports, vol. 704, pp. 1–49, 2017.[Online]. Available: https://www.sciencedirect.com/science/article/abs/pii/S0370157317302934

[6] R. Tomasello, E. Martinez, R. Zivieri, L. Torres, M. Carpentieri,and G. Finocchio, “A strategy for the design of skyrmion racetrackmemories,” Scientific Reports, vol. 4, no. 6784, pp. 1–7, 2014. [Online].Available: https://www.nature.com/articles/srep06784

[7] X. S. Wang, H. Y. Yuan, and X. R. Wang, “A theory on skyrmionsize,” Communication Physics, vol. 1, p. 31, 2018. [Online]. Available:https://www.nature.com/articles/s42005-018-0029-0

[8] S. Meyer, M. Perini, S. von Malottki, A. Kubetzka, R. Wiesendanger,K. von Bergmann, and S. Heinze, “Isolated zero field sub-10nm skyrmions in ultrathin co films,” Nature Communications,vol. 10, no. 3823, p. 3823, 2019. [Online]. Available: https://www.nature.com/articles/s41467-019-11831-4

[9] N. Romming, A. Kubetzka, C. Hanneken, K. von Bergmann,and R. Wiesendanger, “Field-dependent size and shape of singlemagnetic skyrmions,” Physical Review Letters, vol. 114, p. 177203,2015. [Online]. Available: https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.114.177203

[10] N. Romming, C. Hanneken, M. Menzel, J. E. Bickel, B. Wolter, K. vonBergmann, A. Kubetzka, and R. Wiesendanger, “Writing and deletingsingle magnetic skyrmions,” Science, vol. 341, no. 6146, pp. 636–639,2013. [Online]. Available: https://science.sciencemag.org/content/341/6146/636

[11] J. Sampaio, V. Cros, S. Rohart, A. Thiaville, and A. Fert, “Nucleation,stability and current-induced motion of isolated magnetic skyrmions innanostructures,” Nature Nanotechnology, vol. 8, no. 11, pp. 839–844,2013. [Online]. Available: https://www.nature.com/articles/nnano.2013.210

[12] N. Nagaosa and Y. Tokura, “Topological properties and dynamicsof magnetic skyrmions,” Nature Nanotechnology, vol. 8, no. 12, pp.899–911, 2013. [Online]. Available: https://www.nature.com/articles/nnano.2013.243

[13] S. S. P. Parkin, M. Hayashi, and L. Thomas, “Magnetic domain-wall racetrack memory,” Science, vol. 320, no. 5873, pp. 190–194,2008. [Online]. Available: https://science.sciencemag.org/content/320/5873/190

[14] S. S. P. Parkin and S.-H. Yang, “Memory on the racetrack,” NatureNanotechnology, vol. 10, pp. 195–198, 2015. [Online]. Available:https://www.nature.com/articles/nnano.2015.41

[15] T. Chen, R. K. Dumas, A. Eklund, P. K. M. A. Houshang, A. A.Awad, P. Drrenfeld, B. G. Malm, A. Rusu, and J. Akerman,“Spin-torque and spin-hall nano-oscillators,” Proceedings of theIEEE, vol. 104, no. 10, pp. 1919–1945, 2016. [Online]. Available:https://ieeexplore.ieee.org/document/7505988/

[16] V. P. K. Miriyala, Z. Zhifeng, G. Liang, and X. Fong, “Spin-wavemediated interactions for majority computation using skyrmions andspin-torque nano-oscillators,” Journal of Magnetism and MagneticMaterials, vol. 486, p. 165271, 2019. [Online]. Available: https://www.sciencedirect.com/science/article/abs/pii/S0304885319306675

[17] M. G. Mankalale, Z. Zhao, J.-P. Wang, and S. S. Sapatnekar, “Skylogicaproposal for a skyrmion-based logic device,” IEEE Transactions onElectron Devices, vol. 66, pp. 1990–1996, 2019. [Online]. Available:https://ieeexplore.ieee.org/abstract/document/8660572

[18] S. Luo, M. Song, X. Li, Y. Zhang, J. Hong, X. Yang, X. Zou, N. Xu, andL. You, “Reconfigurable skyrmion logic gates,” Nano Letters, vol. 18,pp. 1180–1184, 2018.

[19] X. Zhang, M. Ezawa, and Y. Zhou, “Magnetic skyrmion logic gates:conversion, duplication and merging of skyrmions,” Scientific Reports,vol. 5, no. 9400, 2015. [Online]. Available: https://www.nature.com/articles/srep09400

[20] L. Chang, X. Ma, Z. Wang, Y. Zhang, Y. Xie, and W. Zhao, “Pxnor-bnn: In/with spin-orbit torque mram preset-xnor operation-based binaryneural networks,” IEEE Transactions on Very Large Scale Integration(VLSI) Systems, vol. 27, no. 11, pp. 2668–2679, Nov 2019. [Online].Available: https://ieeexplore.ieee.org/document/8768197

[21] D. Fan and S. Angizi, “Energy efficient in-memory binary deepneural network accelerator with dual-mode sot-mram,” in 2017 IEEEInternational Conference on Computer Design (ICCD), Nov 2017,pp. 609–612. [Online]. Available: https://ieeexplore.ieee.org/document/8119280

[22] P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang,and Y. Xie, “Prime: A novel processing-in-memory architecturefor neural network computation in reram-based main memory,” in2016 ACM/IEEE 43rd Annual International Symposium on ComputerArchitecture (ISCA), June 2016, pp. 27–39. [Online]. Available:https://ieeexplore.ieee.org/document/7551380/

[23] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio,“Binarynet: training deep neural networks with weights and activationsconstrained to +1 or -1,” arXiv preprint arXiv:1602.02830v3, 2016.[Online]. Available: https://arxiv.org/abs/1602.02830v3

[24] S. Ikeda, K. Miura, H. Yamamoto, K. Mizunuma, H. D. Gan, M. Endo,S. Kanai, J. Hayakawa, F. Matsukura, and H. Ohno, “A perpendicular-anisotropy cofebmgo magnetic tunnel junction,” Nature Materials,vol. 9, pp. 721–7224, 2010.

[25] V. P. K. Miriyala, X. Fong, and G. Liang, “Influence of size and shapeon the performance of vcma-based mtjs,” IEEE Transactions on ElectronDevices, vol. 66, no. 2, pp. 944–949, 2019.

[26] A. Vansteenkiste, J. Leliaert, M. Dvornik, M. Helsen, F. Garcia-Sanchez,and B. V. Waeyenberge, “The design and verification of mumax3,” AIPAdvances, vol. 4, no. 10, p. 107133, 2014.

[27] J. Xiao, A. Zangwill, and M. D. Stiles, “Boltzmann test of slonczewski’stheory of spin-transfer torque,” Physical Review B, vol. 70, no. 17, p.172405, 2004.

[28] E. NCSU, http://www.eda.ncsu.edu/wiki/FreePDK45:Contents, 2011.[29] L. Liu, Chi-Feng, Y. Li, H. W. Tseng, D. C. Ralph, and R. A. Buhrman,

“Spin-torque switching with the giant spin hall effect of tantalum,”Science, vol. 336, pp. 555–558, 2012.

[30] X. Dong, C. Xu, Y. Xie, and N. P. Jouppi, “Nvsim: A circuit-level performance, energy, and area model for emerging nonvolatilememory,” IEEE Transactions on Computer-Aided Design of IntegratedCircuits and Systems, vol. 31, pp. 994–1007, 2012. [Online]. Available:https://ieeexplore.ieee.org/document/6218223

[31] C. Kim, K. Kwon, C. Park, S. Jang, and J. Choi, “7.4 a covalent-bondedcross-coupled current-mode sense amplifier for stt-mram with 1t1mtjcommon source-line structure array,” in 2015 IEEE International Solid-State Circuits Conference - (ISSCC) Digest of Technical Papers, Feb2015, pp. 1–3.

[32] J. Kim, K. Ryu, J. P. Kim, S. H. Kang, and S.-O. Jung, “Stt-mramsensing circuit with self-body biasing in deep submicron technologies,”IEEE Transactions on Very Large Scale Integration (VLSI) Systems,vol. 22, pp. 1630–1634, 2014.

[33] Q.-K. Trinh, S. Ruocco, and M. Alioto, “Time-based sensing forreference-less and robust read in stt-mram memories,” IEEE Transac-tions on Circuits and Systems I: Regular Papers, vol. 65, pp. 3338–3348,2018.

[34] X. Fong, https://github.com/xfong/3/tree/seeder, 2019.[35] HSPICE, https://www.synopsys.com/verification/ams-

verification/hspice.html, 2019.[36] Cadence, https://www.cadence.com/, 2019.[37] K. Garello, F. Yasin, S. Couet, L. Souriau, J. Swerts, S. Rao, S. V.

Beek, W. Kim, E. Liu, S. Kundu, D. Tsvetanova, K. Croes, N. Jossart,

Page 12: SIMBA: A Skyrmionic In-Memory Binary Neural Network Accelerator · 2020-03-12 · JOURNAL OF LATEX CLASS FILES 1 SIMBA: A Skyrmionic In-Memory Binary Neural Network Accelerator Venkata

JOURNAL OF LATEX CLASS FILES 12

E. Grimaldi, M. Baumgartner, D. Crotti, A. Fummont, P. Gambardella,and G. S. Kar, “Sot-mram 300mm integration for low power and ultrafastembedded memories,” in 2018 IEEE Symposium on VLSI Circuits, June2018, pp. 81–82.

[38] X. Ma, L. Chang, S. Li, L. Deng, Y. Ding, and Y. Xie,“In-memory multiplication engine with sot-mram based stochasticcomputings,” arXiv preprint arXiv:1809.08358v1, 2018. [Online].Available: https://arxiv.org/abs/1809.08358v1

[39] F. Parveen, S. Angizi, Z. He, and D. Fan, “Low power in-memory com-puting based on dual-mode sot-mram,” in 2017 IEEE/ACM InternationalSymposium on Low Power Electronics and Design (ISLPED), July 2017,pp. 1–6.

[40] P. Guo, H. Ma, R. Chen, P. Li, S. Xie, and D. Wang, “Fbna: Afully binarized neural network accelerator,” in 2018 28th InternationalConference on Field Programmable Logic and Applications (FPL), Aug2018, pp. 51–513.

Venkata Pavan Kumar Miriyala received theB.Tech degree in electronics and communicationengineering from SSN Engineering College, India in2014. He previously held positions at IBM-ResearchTokyo, Japan, and Indian Institute of Technology(IIT), Hyderabad. He is currently working towardsa Ph.D. at the National University of Singapore.

Kale Rahul Vishwanath received the M.Sc. Degreein Electrical Engineering from National Universityof Singapore and B.Tech. degree in ElectronicsEngineering from VJTI, Mumbai, India, in 2016and 2014 respectively. He is currently working asa Research Engineer at the National University ofSingapore.

Xuanyao Fong (SM′06, M′14) received the Ph.D.and B.Sc. degrees in electrical engineering fromPurdue University, West Lafayette, IN, in 2014 and2006, respectively. He previously held positions atAdvanced Micro Devices, Inc., Boston Design Cen-ter, Boxboro, MA, was a Research Assistant andPostdoctoral Research Assistant in Purdue Univer-sity, and a Research Scientist in the Institute ofMicroelectronics, A*STAR, Singapore. Dr. Fong iscurrently an Assistant Professor in the Departmentof Electrical and Computer Engineering, National

University of Singapore.