Training of a Spiking Neural Network on spintronics based analog hardware for handwritten digit recognition

—Spiking Neural Network (SNN) has been shown to consume very low power for an inference task, i.e., forward computation on the test data with a pre-trained model. However, training of such SNN has remained a challenge. In this paper, we use Spike Time Dependent Plasticity (STDP) based learning to train the SNN through implementation on an analog hardware. We design and simulate the synapses and neurons in the hardware using a combination of ferromagnetic metal- heavy metal based spintronic devices and transistor based electronic circuits. We train the SNN on the MNIST dataset of handwritten digits. The architecture used by us has less number of network layers that that used previously for MNIST classiﬁcation. We use two different modes (both STDP enabled but with different level of control over spiking of the output stage neurons or post-neurons) for the training/learning: completely unsupervised learning and partially supervised learning. Finally, we report the time consumed and total energy consumed in our designed synapse circuits to enable the learning.


I. INTRODUCTION
A. Motivations for training Spiking Neural Network (SNN) on analog hardware Spiking Neural Network (SNN) algorithm is considered to be of special interest [1]- [5] among the different Neural Network (NN) algorithms proposed for data classification. SNN processes information in the form of spikes much like the brain [6]- [9]. As a result of the use of spikes, SNN has been shown to consume very low power for inference, i.e. forward computation of pre-trained model with already tuned weights on test data [10]- [12]. This has led to the interest in SNN. However training the SNN has remained a challenge [10].
One possible way of training the SNN is to train a Non-Spiking NN first and then convert it to SNN, but this method has several limitations [10], [13], [14]. The alternate way is to train the SNN itself. SNN has already been trained on CPU-s and GPU-s by simulating biologically inspired processes like Leaky Integrate Fire (LIF) property of neuron, Spike Time Dependent Plasticity (STDP) of synapse, etc. on the CPU or GPU. SNN has also been trained on customized digital neuromorphic chips [15]- [17]. The time dynamics of LIF and STDP processes need to be solved for on the CPU, GPU or customized neuromorphic chip for the purpose. This is highly time and energy consuming because such computing units are fundamentally clock driven. A process like LIF or STDP has to be split into a multitude of time steps and solved for each time step on such a clock driven computing unit [10]. A more efficient way to train the SNN is hence on analog hardware with emerging devices where the physics of the devices mimics the time dynamics of the synapses (STDP) or the neurons (LIF). Hence such time dynamics is computed automatically. Spintronic devices, coupled with transistor based electronic circuits, form one such class of emerging devices [18]- [24].
Another motivation to train SNN on analog hardware designed with emerging devices is the in-memory computing architecture it offers. The CPU follows the Von Neumann architecture and hence has memory and computing separated from each other. Hence a lot of time and energy is consumed in shuffling data between the memory and computing units. This is known as the Von Neumann bottleneck. It becomes particularly critical in the case of training NN-s because of the frequent need to tune the weight parameters [10], [25]. Though the GPU or the specialized neuromorphic chips [15]- [17] have a multitude of cores, memory and computing units are still physically separated inside each core. Hence the Von Neumann bottleneck for computation is not completely eliminated in them either. However analog hardware designed with emerging non-volatile devices, e.g. Resistive Random Access Memory (RRAM), Phase Change Memory (PCM) and spintronic devices, mostly follows a crossbar architecture and offers complete memory-computing intertwining [25]- [35] and eliminates the Von Neumann bottleneck altogether. Since the SNN is trained here using STDP property of synapses as opposed to optimization of a global loss function as in nonspiking NN [36], the training is very local in nature. Thus it is particularly suitable for implementation on such analog in-memory computing hardware [11].

B. Motivation for choosing spintronic hardware SNN
In this paper, we use heavy metal/ ferromagnetic metal heterostructure based spin orbit torque driven domain wall devices, coupled with transistor based electronic circuits, both as synapses and neurons in our designed analog hardware SNN. We choose such spintronic devices as opposed to other emerging devices [25]- [27], [29]- [33] for the following two reasons.
1. The same kind of spintronic device can enable both neuron and synapse functionality [18], [21], [24]. The nature of the transistor based circuit connected to it just changes in either case.
2. Such a domain wall based spintronic device has a linear and symmetric synaptic characteristic unlike RRAM and PCM devices, making it easier to achieve learning in the spintronic hardware SNN since learning involves frequent positive and negative weight updates at the synapses [25], [37]- [40].

C. Work done in this paper
Through a combination of micromagnetic modeling of the spintronic devices on a micromagnetic package "mumax3", simulation of the transistor circuits on Cadence Virtoso SPICE circuit simulator and system level simulations of the SNN on a high level programming language, we demonstrate training of the designed SNN on the popular MNIST dataset of handwritten digits.
The novelties of our work compared to previous works are as follows: 1. Training on the MNIST dataset with spintronic hardware has been reported earlier [21]. The algorithm used in [21] is based on that proposed in [6]. It employs a SNN with three layers. On the other hand, the SNN we use here has only two layers -a layer of pre-neurons and a layer of post-neurons ( Fig. 1(a),(b)). This makes hardware implementation simpler in our case.
2. Only a completely unsupervised learning method is used in [6], [21], with the synapses exhibiting biologically inspired STDP property and neurons exhibiting biologically inspired homoeostasis property. In our implementation, we train the SNN in two modes. One mode is the STDP and homoesostasis enabled completely unsupervised mode used in [6], [21] (Fig. 1(a)). The other mode is a partially supervised mode. In this mode, the synapses still update their weights via STDP contributing to the partially unsupervised nature of the learning. However the partially supervised nature comes from applying bias currents to the output stage neurons or post-neurons only during the learning phase to control them and make them fire selectively ( Fig. 1(b)) [20], [24], [41], [42]. While SNN has been trained on simple datasets like the Fisher's Iris dataset and Wisconsin Breast Cancer dataset under the partially supervised mode in the previous reports [20], [24], [41], [42], training on a more complex dataset involving more input features and more output classes like the MNIST dataset has not been demonstrated before whether with spintronic hardware or otherwise, to the best of our knowledge.
3. Earlier works [24], [43] only report the energy consumed in the spintronic devices themselves for the STDP enabled learning. In this paper, we have reported the net energy consumed in the peripheral transistor circuits as well. Some other earlier works [11], [12] on spintronic implementation of SNN only report energy consumption for inference.
The paper is organized as follows. In Section II we discuss our SNN training algorithm, the layers used in the SNN and the two modes for training/ learning (Fig. 1). We also report the classification accuracy results obtained on the MNIST dataset with our algorithm (Fig. 2, Fig. 3, Fig. 4). In Section III we discuss our simulation of the spintronic devices (Fig. 5,  Fig. 6) and associated transistor based electronic circuits (Fig. 7), which are used to obtain STDP property of synapses ( Fig. 8) and LIF property of neurons. In Section IV we show our calculation of the total time taken and the net energy consumed in all the synapse circuits to achieve the learning (Table I).

A. Description of the algorithm
The SNN, designed by us in this paper, has a layer of preneurons, to which the input is applied, and a layer of postneurons, the spiking pattern of which determines the output label corresponding to the input generated by the SNN (Fig. 1). Since we classify the MNIST dataset of hand-written digits here, N 1 = 784 pre-neurons are used corresponding to the 784 pixels in each input image of the dataset. N 2 = 400 post-neurons are used with several neurons designed to spike corresponding to the different variations with which the same digit is written, using the homoeostasis property as discussed in [6], [21]. However, as mentioned in Section I, [6] and [21] use a separate layer of excitatory neurons and a separate layer of inhibitory neurons at the output stage, unlike here where we use only a single layer of post-neurons [24], [41], [42]. Thus our network is simpler in design than that in [6], [21].
Every pre-neuron and post-neuron in our designed SNN follows the biologically motivated Leaky Integrate Fire (LIF) model [6], [8], [24]. Following the LIF model, the membrane potential v(t) of each neuron is governed by the following equation: where I(t) is input current to the neuron, G L is membrane conductance, E L is resting potential of neuron. Once v(t) reaches the threshold potential (V th ), the neuron generates a spike and v(t) drops to E L . For our designed SNN, we choose G L = 30 nS , E L = −70 mV and C = 500 fF for the LIF model of both the pre-neurons and post-neurons. For the pre-neurons, V th is fixed at 20 mV. For the postneurons, V th is a function of time following the biologically plausible homoeostasis property of neurons [6], [9]. Every time the neuron spikes (t spike ), V th increases by a fixed amount ∆V th = 7 mV and then decays with a homoestasis time constant τ homeo = 15 µs as follows: for time (t) greater than t spike until the occurrence of the next spike in the neuron.
Each pre-neuron is connected to each post-neuron in our designed SNN through a synapse, as shown in Fig. 1(a),(b). When the pre-neuron, numbered i, fires at time t spike,i , the post-neuron, numbered j receives an input current from the pre-neuron i, as given by: where τ M = 10 µs and τ S = 2.5 µs are two time constants [41], [42]. w j,i is the weight stored in the synapse, which is updated following the biologically plausible Spike Time Dependent Plasticity (STDP) rule as follows if t spike,j < t spike,i (4) where Γ 1 = 9, Γ 2 = 15, τ 1 = 10 µs, τ 2 = 20 µs and µ = 1.7 are constants related to STDP [41], [42].
The increase in weight update that happens when postneuron j spikes after the pre-neuron i (t spike,j > t spike,i ) is known as "synaptic potentiation", while the decrease in weight update that happens when post-neuron j spikes before the pre-neuron i (t spike,j < t spike,i ) is known as "synaptic depression". Since the weight update and hence learning happens in the synapse based on the spiking pattern of the pre-neuron i and the post-neuron j, the learning is local in nature. It also is largely unsupervised, though some amount of supervision may be provided by controlling the spiking of the post-neurons [41], [42]. Thus as mentioned in Section I, we use two modes of learning/ training for our designed SNN in this paper-completely unsupervised learning, and partially supervised learning. In either case, each post-neuron exhibits homoeostasis property as described above. Also in either case, the post-neurons are connected inhibitorily with each other to implement the "Winner Take All" (WTA) mechanism, as shown in Fig. 1(a),(b) [6], [41], [42].
However, in the partially supervised learning mode, during the training phase, when the input belongs to a particular digit/ class, no inhibitory current is applied on 40 post-neurons allotted for that digit but inhibitory current is applied on the other post-neurons so that only some specific neurons of those 40 post-neurons fire ( Fig. 1(b)) [24], [41], [42]. Thus knowledge of the digit/ class each input belongs to is used in In the completely unsupervised learning mode, no such inhibitory current is applied on the post-neurons at any stage ( Fig. 1(a)). Knowledge of what digit/ class each input belongs to is not needed at the training/ learning stage, but only while determining the classification accuracy after training. To determine classification accuracy, all the inputs corresponding to the different digits are applied on the SNN. For each post-neuron, it is noted how many times the neuron fires corresponding to the different inputs corresponding to each digit/ class. The neuron is assigned the label of the digit/ class for which it fires the most. Then for each input, if the label of the post-neuron that fires the most matches with the digit/ class the input belongs to, it is considered a success and increases the classification accuracy [6], [24].

B. Accuracy results and related discussion
Using 1000 images from the MNIST dataset for training and 100 images for testing, we obtain a fairly high train and test dataset classification accuracy both for completely unsupervised and partially supervised learning, as shown in the accuracy vs epoch plots of Fig. 2(a) and (b) respectively. Reasonably high accuracy is obtained after 1 or 2 epochs only, and the accuracy metric does not improve after that. This is because unlike non-spiking NN, no global loss function is optimized here to train the weights [36]. Instead a very local spiking based feedback method is used. Hence gradual increase in accuracy with epoch is not expected here as opposed to the case with loss function based non-spiking NN algorithms [36].
Rather training accuracy decreases with the number of epochs in the case of completely unsupervised learning ( Fig.  2(a)). We can explain the result by studying the evolution of the synaptic weights in the SNN with epochs. Fig. 3(a) shows the weights for completely unsupervised learning after first epoch (train accuracy = 86 %) and Fig. 3(b) shows the weights for the same learning after the 5th epoch (train accuracy = 82 %). In each case, weights of all the synapses connecting the 784 pre-neurons to a post-neuron are plotted as a 28x28 pixel grayscale image. Images corresponding to all such 400 postneurons are then plotted together. Intensity of each pixel is proportional to the value of each weight (see vertical bar next to the image). Minimum value of weight allowed is 0 and maximum value is 900.
In order to identify the reason for drop in accuracy, in Fig. 3(a) and (b), we have put yellow circles around the images corresponding to weights of synapses attached to those post-neurons, which have correct results after 1st epoch but not after 5th epoch. After 1st epoch, these post-neurons fire the most for input images, belonging to the class/ label that match with the label assigned to those post-neurons. But after 5th epoch, the two things don't match leading to drop in accuracy ( Fig. 3(a). This happens because the network learns as the synaptic weights form impressions of the different images. As the number of epochs increase, impression of image belonging to a particular digit/class/ label can superpose on image belonging to another digit/class/ label as can be observed for post-neurons corresponding to the yellow circles in Fig. 3(a) and (b). This leads to the inaccuracy. Fig. 4(a) and (b) show the weights of the SNN in the case of partially supervised learning after the 1st and 2nd epoch respectively, plotted in the same way as before. In the case of partially supervised learning, since inhibitory bias currents are applied on all post-neurons other than post-neuron 1-20 and post-neuron 201-220 during training for input images belonging to class/ digit "0", post-neurons 1-20 and 201-220 capture different variations of digit "0". Same is true for the other digits. On the other hand, in the case of unsupervised learning, since no supervision is provided to any specific postneuron in the form of bias current, the different post-neurons randomly capture impressions of different digits from "0" to "9", as expected [6] (Fig. 3(a),(b)).
The sharp increase in accuracy from 1st epoch to 2nd epoch in the case of partially supervised learning ( Fig. 2(b)) can also be explained from the images of synaptic weights in Fig. 4(a) and (b). As shown through yellow circles around images of synaptic weights corresponding to specific post-neurons, those post-neurons do not fire correctly after 1st epoch because their corresponding synaptic weights do not yet form impression of image belonging to any class/digit/label. After 2nd epoch, those impressions form, the neurons fire correctly and accuracy increases.

DEVICES
As mentioned in Section I, spintronic devices-more specifically spin orbit torque driven domain wall devices-are ideal for implementation of analog hardware SNN because such devices can act both as synapses and neurons [18]- [24]. We discuss the characteristics of the domain wall devices and the associated transistor based circuits, that enable functionality of synapses and neurons in them, next.

A. Domain wall synapse
Current driven domain wall motion in heavy metal/ ferromagnetic metal heterostructure based spintronic device has been utilized to propose and experimentally demonstrate synaptic behavior in such a device [34], [35], [44], [45]. In-plane current, also known as "write" current, flowing through the device moves the ferromagnetic domain wall in the ferromagnetic layer even in the absence of magnetic field as observed in experiments and simulations [46]- [50] ( Fig. 5(a)). For a fixed pulse duration (3 ns in our case-    5(b)), the wall moves by different lengths for different magnitudes of the current pulse. Different positions of the domain wall correspond to different non-volatile conductance states of the vertical Magnetic Tunnel Junction (MTJ) structure (fixed ferromagnetic layer/ insulating oxide layer/ free ferronagnetic layer). This is because according to the physics of the Tunneling Magneto Resistance (TMR) effect [51], [52], conductance of the MTJ is proportional to the average perpendicular magnetization (in the vertically down direction) in the free ferromagnetic layer of the MTJ, which changes as the domain wall moves. The different conductance values correspond to the different values of weight stored in the synapse (Fig. 5(b)).
We carry out micromagnetic simulations on the numerical package "mumax3" [53] to model the domain wall motion and obtain the variation of conductance of the device as a function of the magnitude of the programming current pulse. We choose paramaters with respect to Pt (heavy metal)/ CoFe (ferromagnetic layer)/ MgO (insulating oxide) system and calibrate our model against experimental data [46]- [50]. More details of our simulation method can be found in [24], [35], [40]. Transistor based peripheral circuit has also been designed (Fig. 7) to enable STDP based weight update characteristic in the domain wall device, as given by equation (4). Transistor T3 drives current into the domain wall synapse when the spike at the gate of transistor T4 (post-neuron spike) occurs after the spike at the gate of transistor T2 (pre-neuron spike) (Fig. 7). Current flows inside the domain wall synapse in a direction such that conductance of the synapse increases ("synaptic potentiation") following the conductance versus current plot of Fig. 5(b). At the time of pre-neuron spike, T2 turns on and voltage across capacitor C1 becomes equal to V 1 = 1.5-0.8 = 0.7 V (Fig. 7). Then it is discharged steadily with time through transistor T1 which approximately acts as a constant current. Thus voltage across C1 decreases linearly with time, leading to linear increase in gate voltage of T3 with time. T7 operates in the sub-threshold regime. Hence the current driven by T3 (I write,1 ), through T4 when T4 turns on, to the domain wall device is an exponential function of the negative of time difference between spiking time of post-neuron and pre-neuron [54]. Thus the positive weight update ("synaptic potentiation") given by equation (4) is achieved in the domain wall synapse.
Similarly, T7 drives current I write,2 into the domain wall synapse when the spike at the gate of transistor T8 (pre-neuron spike) occurs after the spike at the gate of transistor T6 (postneuron spike) (Fig. 7). In this case, current flows inside the domain wall synapse in a direction to that in the previous case. Hence, net in-plane "write" current flowing through the domain wall device I write = I write,1 − I write,2 (Fig. 7). As a result, the conductance of the synapse decreases as desired ("synaptic depression"). The exponential dependence on negative of time difference between pre-neuron spike and post-neuron spike in equation (4) is achieved in this case through the steady discharge of capacitor C2 with time [11], [21], [24].
We simulate this transistor based electronic circuit in Fig. 7, which enables STDP property in the domain wall synapse as described above, on Cadence Virtuso SPICE circuit simulator meant for analog design. From the SPICE simulation, conductance change of the domain wall synapse device is obtained as a function of timing difference between a spike at a post-neuron connected to it and a spike at pre-neuron connected to it (Fig. 8). The desired STDP characteristic from equation (4) is obtained for both positive conductance/ weight update ("synaptic potentiation"-red plot) and negative conductance/ weight update ("synaptic depression"-blue plot). All transistors are designed in the 65 nm technology node. Transistor parameters and capacitance values of capacitors C1 and C2 are chosen (C1 = 4.9 pF, C2= 13.7 pF)) such that the STDP time constant τ 1 = 10 µs and τ 2 = 20 µs, as desired in our SNN algorithm of Section II.
The design of the transistor circuit makes sure that the current driven into the domain wall synapse depends exponentially on the difference between time of occurrence of a spike at a pre-neuron and that at a post-neuron. However it is to noted that the weight update, caused by that current flow, also follows the same exponential relationship owing to the fact that the conductance of the domain wall synapse varies linearly and symmetrically (between positive and negative current) with in-plane current flowing through the device (Fig. 5(b)). This linear and symmetric characteristic is absent in other Non Volatile Memory (NVM) devices proposed for similar application, e.g., RRAM and PCM synapses [25], [37]- [40]. Hence we have chosen spintronic synapse (domain wall synapse) to implement the hardware SNN in our paper, as we have already mentioned in Section I.

B. Domain wall neuron
The same physics of spin orbit torque driven domain wall motion in a ferromagnetic metal/ heavy metal heterostructure has been utilized to propose the domain wall based neuron device [18], [24]. The difference in design with respect to the domain wall synapse is that the MTJ structure is present only at one end of the ferromagnetic track, in which the domain wall moves (Fig. 6). Thus conductance of the MTJ remains unchanged for most of the domain wall motion in the track. Then it drops sharply when the domain wall reaches underneath the MTJ structure (Fig. 6) because then the magnetic moment of the free layer is then switched to the vertically down direction. This results in an anti-parallel alignment of the moments in the fixed and free layer, leading to a drop in conductance following the Tunneling Magneto Resistance (TMR) effect [51], [52].
Because of the potential divider circuit in Fig. 7, when the conductance in the MTJ drops, the voltage V out in the circuit associated with the neuron device ( Fig. 6) increases sharply. The operating principle of the STDP synapse circuit in Fig. 7 is such that the post-neuron needs to generate positive spike, i.e. the gate voltage of T4 and T6 need to increase sharply with time for succesful weight update at the domain wall synapse [24]. This is accomplished by a sharp increase in V out . The pre-neuron needs to generate negative spike, i.e. the gate voltage of T2 and T6 need to drop sharply with time [24]. Hence, an extra inverter circuit, comprising transistors T10 and T11, is present at the output stage of the pre-neuron circuit which is not present for the post-neuron circuit. Whenever the domain wall reaches the end of the track in the pre-neuron or post-neuron circuit, transistor T9 or T12 turns on. This applies reverse current that moves the domain wall back to its initial position. This is equivalent to the potential dropping to its reset value after every spike in the LIF model [8], [24].
Just like the transistor circuit to enable synaptic property in Fig. 7, we simulate the transistor circuit that enables preneuron and post-neuron functionalities in Fig. 7 on Cadence Virtuoso SPICE circuit simulator.

IV. ENERGY CONSUMPTION AND SPEED METRICS FOR
TRAINING THE SPINTRONIC HARDWARE SNN In this section, we compute the time taken and the energy consumed in implementing the SNN training algorithm discussed in Section II on the spintronic hardware discussed in Section III. We only calculate the energy consumed in the synaptic circuit of Fig. 7 corresponding to all the synapses in the SNN (Fig. 1), and not the neuron circuits. However we account for both the energy dissipated in the domain wall devices and the rest of the transistor based circuit with them that enable STDP based synaptic functionality (Fig. 7).
C1 is charged instantaneously to V 1 = 0.7 V (Fig. 7) and then discharged steadily every time a pre-neuron spikes, as explained in Section III. Hence 1 2 C 1 V 2 1 is dissipated for charging and 1 2 C 1 V 2 1 is dissipated for discharging in each of the N 2 = 400 synapse circuits connected to each preneuron. Similarly, C2 is charged instantaneously to V 2 = 1.1 V (Fig. 7) and discharged steadily through T5 every time a postneuron spikes. Hence 1 2 C 2 V 2 2 is dissipated for charging and 1 2 C 2 V 2 2 is dissipated for discharging in each of the N 1 = 784 synapse circuits connected to each post-neuron. In addition, for the synapse connecting post-neuron j with pre-neuron i, the energy dissipated in the domain wall device due to current flow (I write j,i ) that causes the weight update, along with energy dissipated in transistors through which the same current flows (T3 and T4 for positive weight update, T7 and T8 for negative weight update) can together be written as V DD |I write j,i |∆t pulse where V DD =1.5 V and t pulse is the duration of each current pulse (3 ns, as used in our micromagnetic simulations- Fig.  5(b)). I write j,i for each synapse can be obtained for the weight update (∆w j,i ) in the synapse given by equation (4), using the relationship between weight and conductance, and then conductance and current ( Fig. 5(b)).
Thus for each epoch, the total energy dissipated in the STDP circuit (E epoch ) is given by: i=1 Σ all spikes of pre−neuron i C 1 V 2 1 N 2 + Σ j=N2 j=1 Σ all spikes of post−neuron j C 2 V 2 2 N 1 + Σ i=N1 i=1 Σ j=N2 j=1 Σ all spikes of pre−neuron i Σ all spikes of post−neuron j |I write j,i j=1 Σ all spikes of pre−neuron i Σ all spikes of post−neuron j |I write j,i |V DD ∆t pulse (5) where n pre i is the total number of times the pre-neuron i spikes per epoch and n post j is the total number of times the post-neuron j spikes per epoch. Table I lists the training accuracy, test accuracy, total number of pre-neuron spikes, total number of post-neuron spikes, time taken for training, energy consumed in training (given by equation (5) above) and power consumed in training both for the case of completely unsupervised learning and partially supervised learning on the MNIST dataset (1000  TABLE I  ENERGY CONSUMPTION AND SPEED METRICS FOR TRAINING THE SPINTRONIC HARDWARE SNN train samples, 100 test samples). Only 1 epoch is used in the case of unsupervised learning and 2 epochs are used for partially supervised learning because reasonable train and test accuracies are achieved by then (Fig. 2).
Energy consumed per epoch is almost the same for completely unsupervised and partially supervised learning. It is to be noted that earlier reports [24], [43] only consider energy dissipated in the spintronic devices of the synapse circuits during the training. So the value of total energy for learning reported in [24], [43] is much lower than what we report in Table I.

V. CONCLUSION
Thus in this paper we have implemented training of analog hardware SNN, based on STDP enabled spintronic synapses and neurons, on the MNIST dataset of handwritten digits. We have used two modes of learning and obtained high classification accuracies. We have also reported the net energy consumed for learning in the spintronic devices and associated transistor based circuits that enable synaptic functionality.