SpinDrop: Dropout-Based Bayesian Binary Neural Networks With Spintronic Implementation

Neural Networks (NNs) provide an effective solution in numerous application domains, including autonomous driving and medical applications. Nevertheless, NN predictions can be incorrect if the input sample is outside of the training distribution or contaminated by noise. Consequently, quantifying the uncertainty of the NN prediction allows the system to make more insightful decisions by avoiding blind predictions. Therefore, uncertainty quantification is crucial for a variety of applications, including safety-critical applications. Bayesian NN (BayNN) using Dropout-based approximation provides a systematic approach for estimating the uncertainty of predictions. Despite such merit, BayNNs are not suitable for implementation in an embedded device or able to meet high-performance demands for certain applications. Computation in-memory (CiM) architecture with emerging non-volatile memories (NVMs) is a great candidate for high-performance and low-power acceleration BayNNs in hardware. Among NVMs, Magnetic Tunnel Junction (MTJ) offer many benefits, but they also suffer from various non-idealities and limited bit-level resolution. As a result, binarizing BayNNs is an attractive option that can directly implement BayNN into a CiM architecture and able to achieve benefits of both CiM architecture and BayNNs at the same time. Conventional in-memory hardware implementations emphasize conventional NNs, which can only make predictions, and do not account for both device and input uncertainty, thus, reducing both reliability and performance. In this paper, we propose for the first time Binary Bayesian NNs (BinBayNN) with an end-to-end approach (from algorithmic level to device level) for their implementation. Our approach takes the inherent stochastic properties of MTJs as a feature to implement Dropout-based Bayesian Neural Networks. We provide an extensive evaluation of our approach from the device level up to the algorithmic level on various benchmark datasets.


I. INTRODUCTION
M ACHINE learning approaches like Neural Networks (NNs) are at the center of modern computing paradigms. They offer the ability to solve complex tasks, which are difficult, if not impossible, with conventional computational methods. Applications like autonomous driving or computer-aided medical diagnostics would not be possible without embedding human-inspired computation.
Running inference to calculate the results of a NN based on its inputs can be heavily time and energy-demanding. Therefore, hardware accelerators, from more generic hardware such as Graphic Processing Units (GPUs) to highly specialized hardware such as Tensor Processing Units (TPUs) [1] have been developed and presented as potential alternatives. As a significant part of NN utilization, the inference process is based on performing a series of matrix-vector multiplication on a very large set of data. Compute in-Memory (CiM) architectures promise to reduce the memory transfer overhead. Consequently, the inference speed is improved, and the energy consumption is reduced.
In terms of NN algorithms, standard approaches consist in minimizing the task-specific loss function to estimate a single point value for each network parameter (also called point-estimate NNs). They use the back-propagation algorithm to fit the dataset [4]. These approaches discard all other possible parametrizations of the network that might also fit the input dataset well, they also lack explainability, and fail to give the uncertainty of a prediction. Furthermore, pointestimate NNs generally tend to be overconfident about their predictions even though NNs receive out-of-distribution data to infer. Alternatively, Bayesian Neural Networks (BayNNs) [5] offer a principled approach to train uncertainty-aware neural networks. Recent work shows that Dropout, which is typically used in neural networks to reduce overfitting, can be used 2156-3357 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
as an approximation for Gaussian processes [6]. Therefore, to evaluate the quality of NN results, the model uncertainty can be obtained at a very low cost, from common Dropout models.
In terms of NN parameter resolution, reducing the bit-width of the NN weights and/or activation functions from the full-precision (32-bit floating point) to a binary value, e.g., +1 or −1 is an attractive solution for NVMs-based NN. For example, Binary Neural Network (BNN) is presented in [7]. Conventional BNNs and their CiM implementations can be interpreted as point estimate NNs. Therefore, they also lack the ability to quantify uncertainty for an inference result, fail to distinguish out-of-distribution data, and additionally suffer from device non-idealities, e.g., thermal fluctuations of NVMs [8]. These points are of particular interest for highly safe and robust applications. Binarizing a Bayesian Neural Network is therefore a very attractive approach that combines the benefit of both Bayesian and Binary Neural Networks in a single solution.
In this paper, we present a Bayesian Binary Neural Network (BayBNN) using spintronic-based Dropout as a Gaussian approximation and develop a complete solution and flow spanning from the training algorithm to the circuit-level hardware implementation. Our contributions can be summarized as follows: • We demonstrate that the optimization objective of a BNN is mathematically equivalent to an approximation to the probabilistic deep Gaussian process, therefore it can be used in BayBNN.
• We propose a hardware-based Dropout method and present its hardware implementation using spintronic devices, specifically STT-MTJs. • We demonstrate the concept of Bayesian Binary Neural Networks through holistic, extensive analysis from device and circuit level evaluations up to the algorithmic level.
• We show that the proposed approach can be used to detect out-of-distribution data, being robust against variability issues of the Dropout generation mechanisms and thermal fluctuations of the spintronic-based NVMs.
Our proposed approach embraces the randomness of the STT-spintronic-based CiM architecture, considering it as a feature instead of as an issue for Dropout-based Bayesian Binary Neural Networks implementation. Moreover, at the architecture level, we show that it does not imply changes in the bit-cell design, leading to the re-utilization of the bit-cell array designed for classic MRAM memory or for classic Binary NN. The main changes are to be found in the peripheral circuitry to enable and adapt the Dropout technique to Bayesian inference. To the best of our knowledge, there is no other work presenting a combination of Bayesian Binary Neural Network using Dropout approximation with its physical-level implementation using STT-MRAM.
The paper is structured as follows. Section II covers the background of our work. Section III-C shows the concept of the proposed circuitry and its architectural usage. Section III introduces the mathematical groundwork for Bayesian inference in Binary Neural Networks using Dropout. Section IV shows the evaluation of our proposed methods, and Section V concludes the paper.

II. BACKGROUND A. Magnetic Tunnel Junction
The Magnetic Tunnel Junction (MTJ) is the basic building component of magnetic memory bit-cell design. It consists of two ferromagnetic layers separated by a tunnel oxide layer. The first layer is called Free Layer (FL) as its magnetic orientation can switch between two stable directions. The second one is called pinned layer because the magnetization is pinned. The relative orientations of the two ferromagnetic layers result in two resistance states. The MTJ exhibits a high resistance state when the magnetic orientations of the two layers are anti-parallel (R A P ) and a low resistance state when they are parallel (R P ). The resistance state can be changed by a spin-polarized current passing through it, inducing the Spin Transfer Torque effect (STT) [9]. The ratio between the two resistance states is called Tunnel Magneto Resistance (TMR).
In presence of the thermal noise, the switching of the MTJ has a stochastic behavior: for a given amplitude and duration of a current passing through the MTJ, the switching has a certain probability to occur, expressed as: where is the thermal stability factor, t is the pulse duration, τ 0 is the attempt time and I c0 is the critical current at 0K temperature. Figure 1, presents the switching probability of an MTJ with respect to the applied voltage. From left to right, the curves show the impact on the switching probability when the voltage level is decreased [10]. In addition to stochastic switching, an MTJ can also suffer from several reliability issues, the process variation being the main issue. The process variation could severely impact the resistance value of the MTJ and consequently the TMR ratio. Moreover, the STT-MRAM is also prone to other reported reliability challenges including write failure, retention failure and read disturb [11] which could lead to errors and potential system failures, depending on the run-time system utilization.
B. Neural Networks 1) Point Estimate NNs: Neural Networks are a powerful tool used for many applications nowadays. They can be split into several classes, e.g., Multi-layer Perceptions (MLPs), Recurrent-Neural Networks (RNNs), or Convolutional Neural Networks (CNNs), to name only a few of them. Given a set of Q input and target output pairs, {(x 1 , y 1 ), . . . , (x Q , y Q )} a NN computes a function defined by its parameters θ that maps the inputs x i ∈ R Q to observed outputsŷ i ∈ R D that can be described asŷ = f θ (x). Here, for brevity, θ denotes all the model parameters, e.g., weights w representing the network connections and the bias b. Each class of NN has a single input layer l 0 , multiple hidden layers l i , i = 1 . . . L − 1 and an output layer L. The simplest layer of NNs, the linear layer, can be represented as a linear transformation of input x i followed by an element-wise nonlinear activation function S(·), e.g.: The optimization goal of NN is to find a single value for the parameters θ that reduces the loss function E, e.g., mean squared error (MSE), using the backpropagation algorithm.
The loss function can also be defined as the log-likelihood of the training set. This type of NNs is referred to as pointestimate NNs. It discards all other possible parameterizations of the NN.

2) Regularization Methods:
Overfitting is a typical problem in point estimate NNs, in which the neural network model performs extremely well on training data, but not during the inference data. To improve the generalization of the NN model to inference data, usually, a regularization term is used such as L 2 (ridge regression), applied to each parameter of the NNs. The overall learning objective of regularized NNs can be described as: where λ is the hyperparameter that controls the strength of regularization. Stochastic regularization techniques (SRTs) such as Dropout [12] are also commonly used in NN models. Dropout adopts the model's output stochastically to perform regularization. Consequently, the loss becomes stochastic as well. The Dropout approach produces an average of the predictions of large ensembles of different neural networks in a computationally inexpensive way. During training, it randomly omits some neurons from the hidden layers with a predefined probability. Dropout is applied by sampling binary vectors Z i , i ∈ [1, · · · L − 1] of the same dimension as the bias vector b. Each element of Z i is distributed according to a Bernoulli distribution with a probability p = [0, 1]. Therefore, Dropout can be described as Z i ∼ Ber noulli(1 − p) and it set the given input x i for a layer to zero x i ⊙ Z with probability p. Where ⊙ is the Hadamard product.
3) Out-of-Distribution Data and Uncertainty in NNs: Ideally, a neural network model should generalize well to inference data that were not used during the training. For example, in the simple case of the MNIST dataset, the distribution of training and inference data are the same, e.g., both training and inference data contain handwritten digits 0 · · · 9. Data that the NN model is not trained for, i.e., has a different distribution, is referred to as out-of-distribution data. Preferably, when a NN receives out-of-distribution data, it should either not predict at all or treat it as uncertain input, i.e., return a result with high uncertainty. Otherwise, this situation can be utilized to completely crash the NN by adding specially calculated noise to the data [13]. In this case, the input can be passed to a human, e.g., a human taking control of driving the autonomous vehicle in an uncertain environment.
Furthermore, an ideal NN should also predict confidently. The confidence of a NN model is usually given by the SoftMax layer, which converts the final results of a NN into class probabilities. The confidence of a NN model in its prediction is different from uncertainty in the prediction. For example, a NN for a classification task predicts "confidences" in the range of 0 − 100% for each label. A well-trained NN is expected to predict confidently a correct label.
There are two types of uncertainty in NNs: epistemic and aleatory. Epistemic uncertainty represents a model uncertainty resulting from a lack of information regarding the training data, which decreases as the size of the training data rises. In contrast, aleatoric uncertainty refers to the uncertainty brought about by the noise in the observations. More data are not associated with a decline in aleatory uncertainty.
4) Bayesian Neural Networks: BayNNs are stochastic NNs trained using a Bayesian approach. BayNNs can be seen as a probabilistic deep learning model. It assigns prior distributions over the models' parameters to find parameters that could have generated the given dataset. Typically, a standard Gaussian distribution is set over the model parameters, P(θ ) = N (0, I ). BayNNs make predictions over an ensemble of an infinite number of NNs, i.e., NNs with various configurations of weights and biases, which is computationally intractable. Variational inference [14] provides a solution by approximating the distribution of NN parameters with a simpler known distribution, the approximate variational distribution, and minimizing the Kullback-Leibler (KL) divergence between those distributions. KL divergence encourages the approximate variational distribution to become similar to the prior distribution, and minimization of that term is similar to minimizing negative evidence lower bound (ELBO). However, it requires integrating over all parameters space and is intractable. It can be estimated with Monte-Carlo (MC) sampling over θ of the NN. As a result, BayNNs can estimate uncertainty, reduces over-fitting, and makes it easier for the NN to learn from small datasets.
5) Dropout as Bayesian Approximation: Gal and Ghahramani [6] provided mathematical groundwork showing in the case of variational distribution, that columns which are randomly set to zero with a probability p can also act as an approximation of the intractable posterior distribution. Also, minimizing the L 2 regularization is similar to minimizing the KL divergence. The overall objective can be described as: The overall objective is similar to Equation 3 with l as the prior length-scale that defines a more expressive prior, τ model precision. Therefore, sampling stochastic parameterŝ θ from the Bernoulli distribution z i, j is equivalent to the binary variables of the Dropout. Here,θ summarizes stochastic weight M and bias m in Equation 4. Hence, sampling T sets of vectors from the {z t 1 . . . , z t L } T t=1 will give {θ t 1 . . . , θ t L } T t=1 . Predictive performance can be determined by averaging the outputs of the T samples, it is referred to as BayNN in the following sections. Also, the variance of the T samples can be used as a measure of uncertainty in prediction. Therefore, uncertainty for a NN can be easily obtained by using Dropout during inference for a NN with floating point parameter θ , given it was trained with Dropout and L 2 regularization.

C. Related Works
Prior research was conducted on hardware solutions for Bayesian and Binary Neural Networks. The general trend was to use implementations based on CMOS technology, such as GPU platforms, for inference and training [15]. Some other studies favored the use of FPGAs or ASIC solutions [16], [17]. However, these solutions suffer from excessive power consumption due to data fetching between memory and core unit [18]. Another alternative is to rely on the lower implementation overhead of NVMs technologies [2]. In [19], the authors exploit resistive RAM (RRAM) non-idealities to be used for Bayesian learning. In [20], the authors introduced a stochastic RRAM-based selector to allow synapse Dropout. In [21], the RRAM stochasticity is leveraged by the probabilistic Gaussian sampling in the input of the crossbar. A more recent study [22], presented a stochastic Bayesian machine based on a Resistive memory in conjunction with Linear Feedback Shift Register (LFSR) for Bayesian inference. The stochasticity due to magnetization fluctuation in the MTJ has often been proposed to implement True Random Number Generators (TRNGs). In [23], the MTJ is designed to continuously switch between R P and R A P by reducing the energy barrier between the two states, i.e., Superparamagnetic tunnel junction (SMTJs). In [24], the stochasticity is implemented by calculating the switching time when a given current is applied to the MTJ, through several consecutive set and reset operations. Stochastic MTJs were even proposed as a synapse for neuromorphic computing [25]. This work presents for the first time the implementation of the In-Memory Bayesian Binary Neural Network with STT-MRAM devices. For this purpose, the two modes of operation of the MTJ are explored, respectively, the stochastic feature implemented in the SpinDrop module and the deterministic behavior for the synapse storing in the crossbar array. This approach has the benefit of not requiring changes in the bit-cell of the crossbar architecture compared to [20], all design changes occur in the peripheral circuits. Moreover, compared to the RRAM-based stochastic engine presented in [21], the STT-MRAM operates at lower voltages (around 1V), thus resulting in less power consumption.

III. BAYESIAN BINARY NEURAL NETWORK
For an efficient uncertainty quantification utilizing Bayesian Binary NNs, algorithmic issues and hardware characteristics of neural networks building blocks must be taken into account.
In addition, Bayesian training and inference are accompanied by a number of challenges. They are discussed in the sections that follow.

A. Algorithmic Description and Optimization of BayBNNs
Binary NNs [7] propose a reduction of the bit precision of weights and activations with the Sign(x) function that can be described as: During the forward pass, the real-valued weights and activations are binarized. Therefore, the loss is computed on binary weights and activations. As the L 2 regularization term is minimized when the weights are zero, applying it to binary +1 or −1 weights results in a constant penalty term that cannot be optimized. Alternatively, the L 2 regularization can be applied to the real-valued proxy weights but can have no effect. For example, when the algorithm proposed by [7] is augmented with L 2 regularization, the training, loss and error rate are not affected, as shown in Figure 2. Since adding L 2 only affects the numerical value of the weights by approaching them closer to zero, the sign of the weight is not affected. Therefore, Dropout based Bayesian approximation described in Section II-B5 can not be applied directly to those kinds of BNN training algorithms. Alternatively, since the weights of the BNN are either +1 or, −1 a regularization term for binary NN training should encourage the weights to be around those values. Many such regularization functions can be designed for BNN, for example, R 1 = (α − |W |) 2 , which has two minimums at ±α [26]. Where, α can be a scalar value, e.g., 0.5.
In most BNN algorithms, a learnable scale factor is introduced to reduce quantization error and to improve performance [26], [27]. In that case, α ∈ R [C out ,1,1,1] can be incorporated into the regularization term. This will encourage the weights to be around the scale parameter. Therefore, the overall learning objective becomes: It can be rewritten according to Equation 4 for Dropout-based Bayesian approximation, as: The bias term can be removed for simplicity. Furthermore, the R 1 regularization term can be optimized as follows: For a large NN model, the third term of Equation 7 will numerically dominate the first term. Therefore, α 2 + |W | 2 ≈ |W | 2 and Equation 7 can be further approximated as: Let, β = |W | − 2α and substitute it to the overall objective BNN gives: and it is equivalent to the learning objective of BayNN (4). Specifically, when the scale factor α is very small, i.e., close to zero, Equation 9 becomes exactly equal to Equation 4. Therefore, minimizing the R 1 regularization term is equivalent to minimizing the KL divergence. As a result, a BNN trained with Dropout and R 1 is also a Bayesian approximation (BayBNN). This demonstration encouraged us to look forward to the hardware implementation of the concept.

B. Spintronic-Based Dropout Model Description
This section describes the Spintronic-based Dropout (called in the following -SpinDrop) neural network model. As mentioned previously, in normal Dropout, a random Bernoulli mask is sampled with a fixed dropout probability p = [0, 1]. As the dropout is usually implemented at the software level or in a digital fashion, the dropout probability itself is deterministic.
On the other hand, implementation of the SpinDrop module with stochastic and analog components will result in a dropout probability p which can vary and can be considered stochastic. The stochasticity here could also be coming from manufacturing variations and/or thermal noise.
With SpinDrop, the feed-forward operation of a layer l becomes:p Here, the mean µ and standard deviation σ come from device technology,M denotes dropout mask with process variation, * denotes elementwise product,p is the dropout probability with variations and the superscript b denotes binary matrix or vector. We show later that using the SpinDrop during training can sometimes outperform regular Dropout-based Bayesian inference.

C. Spintronic-Based Bayesian Dropout (SpinDrop) Implementation
The intrinsic MTJ stochasticity due to magnetization fluctuations has a strong impact on the MRAM memory writing time. In standard MRAM memory applications, this effect is not desired.
In this study, a binary crossbar array is implemented with STT-MRAM technology to specifically allow parallel reading. This allows performing efficiently one-step in-memory computing operations, which in turn is used to accelerate our proposed Bayesian Binary NNs. In this array, the stochasticity of MTJs will be used in the periphery of the crossbar to allow random selection and thus dropping of one or more rows of the crossbar at a time, while the MTJ used for synapse storing and in-memory computation are operated in a deterministic way. We propose then to combine the stochastic and deterministic aspects of the STT-MRAM to implement the Dropout required for the BayNNs approach, as explained in Sections II-B4 and III.
1) Magnetic Tunnel Junction as a Tuneable Bitstream for Dropout Generation: The functionality of the MTJ stochastic switching has been simulated with VerilogA model as presented in [28]. The stochasticity is obtained by using the probability density of the switching time of the MTJ as a function of the injected current [10]. For a given amplitude and time of the current pulse, the model computes a certain probability to switch from one state to the other. This model was used to design an MTJ-based Random Number Generator (RNG). To allow the control of the current and thus of the resulting probability, a current generator driven by a digital value Q is used. In this case, it is obtained by five PMOS transistors connected in parallel. The first four transistors are controlled in the digital value Q = Q 4 Q 3 Q 2 Q 1 (Figure 3), allowing a linear increase of the current and resulting probability. The fifth one (not shown) is grounded and ensures a minimum current flowing through the structure. The number of transistors can be extended or reduced depending on the chosen digital resolution. Nevertheless, it must be able to cover the entire probability range of the STT-MRAM. A bitstream with a given probability is generated through SET and RESET several alternate operations. After a "SET" write operation, the state of the MTJ is read using a sense amplifier [29] to check if the switching has occurred. The read result is the Dropout signal. Further to the reading operation, the MTJ is "RESET" to the P state to anticipate the next Dropout signal generation. Figure 3 (b) shows the evolution of the probability of MTJ switching according to the value of Q. We distinguish two regions: the first one is the linear region corresponding to the transistors in the sub-threshold regime and the plateau zone that corresponds to the saturated region for the transistors.
Both MTJs and CMOS transistors from the peripheral circuits could be subject to manufacturing variations that can impact the quality of the generated bitstream thus introducing a deviation from the target probability. To mitigate manufacturing issues, some studies proposed the use of multiple (N) MTJs [30], since the random sequences generated by several MTJs have a lower standard deviation (divided by N). To compensate for probability deviation, another solution consists in adding a feedback loop to increase the accuracy of the generated probability [31], [32]. However, note that in our particular case, the probability deviation related to manufacturing defects is not necessarily a concern. In fact, instead of trying to target an accurate probability, these variations could be useful within the Bayesian Neural Networks.
2) Architecture Design: To perform bitwise Bayesian inference computation within the memory, the typical classic STT-MRAM crossbar architecture needs to be modified.
In classic BNNs, the conventional matrix-vector multiplications are reduced to XNOR operations for higher efficiency. It is thus necessary to design the spintronic-based bit-cell that allows performing this particular operation. Several digital [33] and analog implementations have been presented in the literature. Also, analog solutions could be used, that take advantage of the summation of currents according to Kirchhoff's law.
The proposed architecture is organized around the crossbar MTJ array, in which each trained weight is stored in a unit represented by two 1T-1MTJ cells, as in conventional BNNs crossbars. In addition, a wordline decoder with the capability to enable multiple consecutive addresses is used. One Spin-Drop module, described in the previous section, is connected to each wordline pair to implement the Dropout concept described earlier. A bitline conditioning circuit is used to set the bitline according to the inputs. One ADC per sensed bitline and a sourceline conditioning circuit are connected to the output, to sense and convert the state of the enabled cells. Further to that, they are provided with a CMOS-based Accumulator-Adder module to sum up the partial Matrix-Vector product results according to Equation 2. Finally, a comparator and a running average CMOS circuit complete the schematic, ensuring that the computation achieves the predictive mean needed for the Bayesian inference presented in Section II-B4. The sourceline periphery (ADC, accumulator-adder, and so on) can be shared by multiple sourcelines using MUXes to save chip area by reusing temporary un-useful components. The output of the CMOS periphery of the crossbar can then be provided to input to the next crossbar that also respects the same architectural design.
Moreover, as Bayesian inference with Dropout approximation requires spatial and temporal independence, the probability that a neuron is dropped is independent of one another and also from one input to another. To achieve temporal independence, the proposed SpinDrop module randomly activates and inactivates each word-line with probability p and 1 − p, respectively, during each forward path. In addition, to achieve spatial independence, a separate SpinDrop module is used for each word-line of the crossbar. Figure 5 (b) describes the Bayesian inference systematically for a linear layer with 10 Monte Carlo samples (T=10).
As mentioned before, the presented architecture produces a weighted sum calculation of a single layer. The output of a layer is then used as the input to the next layer. Each input neuron is mapped to two rows, and each output neuron is mapped to one column, that is to say, two word-lines feed one input neuron while a column line feeds one output neuron. Each pair of cells thus represent the connection between an input neuron and an output neuron. To evaluate the output of a neuron, the following steps have to be performed serially for each layer of the BayBNN.
At the beginning of the inference operation, each of the SpinDrop modules randomly drops its corresponding neuron with the Dropout probability, which is implemented by enabling or disabling the respective wordlines. As only a limited number of cells per bitline can be reliably sensed at once [34], our architecture supports the computation of group-wise CiM operations. Based on parallel measurable cells Z CIM , the array with Z l−1 inputs (stored in 2 × Z l−1 rows) is split into groups of Z groups = Z l−1 /Z CIM . Each group is selected one by one via the wordline decoder, and the SpinDrop modules are used to drop out an input by disabling the wordline of the respective input neuron. The implementation of the SpinDrop module is performed by adding a pass-gate that allows either access to the classical decoder or to the stochastic wordline (WL). Here, one SpinDrop module is used per wordline, but this can be reduced down to four (depending on the maximal CiM operation group size) and thus multiplexed among the different group operations. The ADC is used to calculate the result of the XNOR operations, which is then summed up in the accumulator-adder module. After all of the groups have been summed up, the comparator performs a threshold function. The threshold function chosen here is the Sign(x) function, which is used to binarize weights  and activation in a typical BNN, as described in Section III. The MUXes can be used to map multiple layers in the same crossbar and evaluate them one by one.
To get the predictive performance of a Dropout based BayNN, the average result of all individual Monte-Carlo inferences with SpinDrop enabled has to be calculated. The results of the neurons of the output layer are used in a running average to evaluate the predictive performance and uncertainty of BayNN. Therefore, the calculated mean value is the Bayesian Inference result, and variation is the corresponding confidence (uncertainty) in this result.

A. Simulation Setup
For the architectural simulation, we first obtained the circuit characteristics of the peripheral blocks described in Section III-C. The Acumulator-Adder, Comparator and Averaging circuits were synthesized with the Synopsys Design Compiler using the TSMC 40 nm low-power PDK-based standard IP libraries. Decoding and sensing for the CiM operation were evaluated at the circuit-level array using NVsim [35]. For doing that, we adjusted the NVsim simulation with the data presented in Section IV-B to account for four active cells, thus modeling the CiM operation. We have also replaced the single-bit sense amplifiers with multi-bit ADCs. The results for each individual component are shown in Table I.
To evaluate both predictive performance and uncertainty estimations, we trained an MLP with four layers (256 neurons per layer for three hidden layers, the last one being dependent on the dataset) and a CNN with a LeNet-5 topology on the MNIST dataset. Furthermore, a VGG topology (nine layers) is evaluated on the CIFAR-10 dataset and the ResNet-18 (eighteen layers) topology is evaluated on both CIFAR-10 and CIFAR-100 datasets. For both, we used the Adam optimization algorithm with default settings in PyTorch to minimize the cross-entropy loss function with λ = 1×10 −5 . Model precision τ can be derived from λ value. We trained the MNIST dataset longer (400 epochs) than the CIFAR-10 and CIFAR-100 datasets (100 epochs) due to the network sizes. We have applied RandomHorizontalFlip and RandomResizedCrop type data augmentation to CIFAR-10 datasets to improve accuracy. However, no data augmentation is applied to the MNIST dataset.
Moreover, to show the scalability of our proposed approach to even larger topologies and harder tasks, several real-world biomedical image segmentation and classification datasets are evaluated on state-of-the-art topologies. For the biomedical image classification, we have trained the DenseNet-121 (121 layers) topology on the pneumonia detection dataset from chest X-Rays.
On the other hand, for biomedical image segmentation, Digital Retinal Images for Vessel Extraction (DRIVE) [36], breast ultrasound scans (for breast cancer) [37], and Mitochondrial Electron Microscopy [38] Datasets are evaluated on U-Net [39], Bayesian SegNet [40], and Feature Pyramid Network (FPN) [41] with ResNet-50 (50 layers) as feature extractor, respectively. The DRIVE dataset comprises 40 images with a size of 584 by 565 pixels, 20 of which are used for training and the other 20 for testing. The Breast Ultrasound Scan dataset (breast cancer) is used for the early detection of breast cancer, which is one of the most common causes of death among women worldwide. The dataset is categorized into three classes: normal, benign, and malignant images. There are a total of 780 images, with an average image size of 500 by 500 pixels. The Electron Microscopy Dataset depicts a 5×5×5nm section of the CA1 hippocampal area of the brain, which corresponds to a 1065 × 2048 × 1536 volume. Since the size of each image is too large to fit into NN topologies, we have patched each image and its corresponding mask into 256 × 256 masks for training.
The metrics used to determine the performance of segmentation tasks are pixel-wise accuracy, Intersection-Over-Union (IoU), Sensitivity, Specificity, Area Under the ROC Curve (AUC), F1 Score, and precision. IoU is one of the most commonly used metrics in segmentation tasks and is the ratio of the area of overlap and the area of union. The pixel-wise accuracy states the percentage of pixels in the predicted image that are classified correctly. Specificity represents the proportion of actual negative cases accurately recognized by the model. In total, 10000 input data for each dataset were used and each of the data has the same shape (28 × 28) as needed for the MNIST dataset. The random noise in each dataset changes for each Monte-Carlo sample for Bayesian NN.
To evaluate the robustness to thermal fluctuation (epistemic uncertainty) of the proposed SpinDrop BayBNN, we inject random Gaussian noise N (0, I ) into the weighted sum of each layer. Furthermore, we have explored different implementations of the SpinDrop in terms of dropout probability and location of the Dropout layer to check their impact on inference accuracy. For our hardware analysis, the SpinDrop module was used in all layers and all crossbars.

B. Cell-Level Simulation
To evaluate and understand the global effects of the stochasticity at the bit-cell level, an initial study of the crossbar output current variability taking into account various MTJ states was conducted, an example is shown in the map of Figure 7. Our objective was to determine the range of operation (best-case; worst-case) for the currents and the impact of the MTJ states on the output current. Moreover, this study will allow us to determine also the maximum size of the crossbar that is still functional for the proposed SpinDrop BayBNN architecture. The simulation has been performed through several Monte-Carlo samples (i.e. 30) simulations on the Bit-cell considering both MTJ and the selecting CMOS device variations. We extracted 15 different crossbar sensing states, as shown in Figure 7. We notice that when the current is higher, that is to say when the number of MTJ active in the crossbar is important, the deviation of the output current related to device variability also increases. Our evaluation shows that the same values of currents are obtained for different bitline and wordline activations. We also observe a peak current of 140µA. These current levels will also be used for power consumption estimates in Section IV-D.
A second evaluation has been performed, this time concerning the SpinDrop module. In order to control the switching probabilities of the MTJs, the current flowing through it is adjusted thanks to the value of Q, as shown in Section III-C1. The current varies from 80 to 150 µA for a duration of 10 ns for the SET signal. For the RESET signal, an amplitude of 300 µA for a duration of 4 ns has been used. This current has to be high enough to ensure that the MTJ is reliably switched.
This evaluation of the SpinDrop module aims at appraising the target accuracy of the probability generated by the module. This is done due to the fact that our objective is to quantify the effects of the variability and the stochasticity on the target accuracy.
For doing this, we performed several Monte Carlo analyses (i.e. 100 switchings of the MTJ within 20 Monte Carlo runs, which is equivalent to 2000 Monte Carlo simulations on an MTJ). Note that the MC simulations performed here are different from the MC sampling required for Bayesian inference. The probability of dropping the neuron from the crossbar output will be equal to: As we can see in Figure 8, the generated probability is not fixed but follows a Gaussian distribution. In fact, when the probability of switching the MTJ approaches 100%, it becomes difficult to generate an accurate probability as we are limited by the saturation regime of the transistors (as explained in Section III-C1). As we can see in Figure 3, the SpinDrop module is no longer linear and the values are quite close to one another. These cell-level evaluation results will be used in the circuit-level simulation.

C. Circuit-Level Simulation
A 20 × 10 crossbar array is implemented according to the architecture presented in Figure 4. With this implementation, we intend to illustrate the Dropout concept impact on a larger crossbar with accurate Spice-level simulation. For this validation, the crossbar outputs are connected to a winner-takes-all (WTA) circuit [42]. WTA circuits inhibit all its inputs except the one corresponding to the highest current at the output of  the crossbar. Therefore, the prediction is directly implemented in the crossbar. To implement the proposed SpinDrop module, another peripheral circuit is added that selectively allows the activation of the previous layer to the stochastic wordline of the current layer. In Figure 8, the impact of the Dropout rate on the overall accuracy is depicted. Firstly, we notice that the overall accuracy is slightly affected. Indeed, when a wordline is dropped, all the outputs are impacted in the same way, and the output with the highest current will still be declared the winner by the WTA circuit. Nevertheless, the Dropout starts certainly having an impact on the accuracy, especially around 20%. Starting with this rate, the probability to drop all the word-lines of the crossbar is increased, leading to potential classification errors. Further to that, when coupling with the Monte Carlo process variation simulations, the loss of accuracy could be even higher.

D. Architecture Simulation
Data from Table I were used to approximate the delay and the inference energy of a single image forward pass for the MNIST dataset. Figure 9 shows the delay and energy consumption of both MLP and CNN topologies, implemented with Binary and Bayesian Binary styles. In each graph, we compare the BNN values with the resources needed for the proposed SpinDrop BayBNN with 10 MC samples taken for the predictive performance and a CiM group size of Z C I M = 4. Taking more MC samples does not improve predictive performance, as shown in the Figure 13.
Note that BayNN implementation tends to be more computationally expensive in contrast with conventional BNN implementation since one must compute the expectation (average) of many inference results. To perform a single inference with the BayBNN, each layer is evaluated using multiple CiM operations to calculate the results of the matrix-vector multiplication between the input to the layer and its weights. Each of these CiM operations gives a partial sum of the final neuron output, which is generated by adding up the partial sums in the accumulator-adder module. After all partial sums are added up, the comparator is used to implement the threshold activation (sign (x)) function. This is done for all layers to get a prediction for a single inference. When performing the BayNN inference, this procedure has to be done multiple times, depending on the number of inference samples, in which the SpinDrop module is used to randomly select the neurons during the CiM operation. To get the final result, the average circuit is computing the running average of the MC runs. However, the main contributions to energy and delay are the individual forward passes occurring during the predictive mean computation. Therefore, the total delay and energy consumption scale linearly with the number of inference samples for the Bayesian inference.
In Table II, the proposed approach is compared with State-Of-The-Art (SOTA) implementation in terms of energy consumption, all based on the MNIST dataset. When compared to an NVM implementation [21], the proposed approach with STT-MRAM technology is ×4.65 less energy-consuming. It should be noted that the RRAM implementation is done only with 2 hidden layers, whereas our implementation is realized on a Lenet-5 topology. When considering the FPGA implementation, the energy consumption is ×10 times lower when compared to an implementation with two hidden layers [43], [44]. With a Lenet-5 topology [45], the evaluated energy is ×20 better. One can conclude the energy efficiency of the proposed approach with the SpinDrop module and the MTJ crossbar array.

E. Uncertainty Estimation
Here, we have used Dropout with a Dropout probability of 20% on all the layers and crossbar array for uncertainty estimation.
1) Epistemic Uncertainty: In critical applications, any NN results must be correct and trustworthy, otherwise an error flag should be raised. Such kinds of NNs should only predict an answer when the input distribution matches the trained one D train . Point Estimate NNs can't infer "Undecided" if it receives out-of-distribution data. We found that when a BNN gets out-of-distribution data, it predicts random MNIST labels. For example, it mostly predicts MNIST labels 0 (with a frequency of 28.28%) and 8 (with a frequency of 62.17%) when dataset D 2 (random Gaussian noise) is applied as an input. In fact, our evaluated point-estimate NN predicts 66.13% of the random Gaussian D 2 and 97.77% of uniform D 3 inputs with 100% confidence that the input is a handwritten digit.  If a fail-safe NN model receives in-training distribution data, it should confidently predict the correct label (i.e. prediction probability should be close to 100%) and for each of the MC sample, to have low variance in the prediction probability. On the contrary, variance in prediction probability for outof-distribution data is expected to be high. We have utilized the expected behavior of a fail-safe system to introduce two metrics for detecting out-of-distribution data. Specifically, the NN only predicts when it is highly confident (prediction probability ≥ 95%) and has low uncertainty (that means highly certain) in prediction (quantile score of 10%). A quantile score of 10% has 90% of the MC sample's confidence score above that value, i.e., low variance. As a result, the proposed Spin-Drop BayBNN can detect up to 100% of out-of-distribution data from the dataset D 1 · · · D 6 , as shown in Figure 11.
To further emphasize the significance of uncertainty in detecting out-of-distribution data, we have also assessed the situation in which the model predicts when uncertainty is moderate (or moderately certain) and high (that means not certain). Nonetheless, the prediction confidence is still considered high. A quantile score of 50% (median of the MC samples) and 90% is considered for moderate and high uncertainty analysis. If the model predicts despite moderate uncertainty, its ability to detect out-of-distribution data could decrease by as much as 61.68%. However, the model can not detect any out-of-distribution data for most of the datasets when it predicts despite high uncertainty, as shown in Figure 11. This demonstrates the significance of uncertainty estimation in spotting out-of-distribution data.
Additionally, our proposed method is also robust against random data poisoning. When 20% of the MNIST validation data is poisoned with random Gaussian noise (dataset D 2 ) with random labels, our proposed Dropout based Bayesian BNN can detect 84.45% of poisoned data and achieves an accuracy of 96.28% on predicted inputs. On the contrary, the accuracy of the point-estimate NN decreases from 98.83% to 81.19%. Consequently, our proposed SpinDrop BayBNN improves inference accuracy by 15%, in addition to obtaining fail-safe properties. Data poisoning in our context is considered to occur when 1) an adversary is aware of the inference data of the model, and has the power to alter a small fraction of the inference data in order to degrade the trained model's overall accuracy, or 2) the data generation process is noisy.
We have also conducted an experiment similar to [6] with a continuously rotated image of digit 1 on LeNet-5 topology, as shown in the Figure 12. We performed 100 stochastic forward passes on the softmax input, the output of the final fully connected layer, and the softmax output. As mentioned previously, softmax output is the class probability based on the output of the NN. For the 12 images, the point estimate BNN predicts classes [1 1 1 0 0 5 3 3 5 2 5 7]. The figure shows, initially, the majority of the stochastic softmax output for level one (correct level) is close to one (100% confident) and there is very low variance in the predictions. However, as the rotation increases, softmax output for level one reduces and other (incorrect) levels increase. Nevertheless, the point estimate model predicts even though the uncertainty in prediction is very high. In this scenario, it is reasonable for the model to reject the prediction and request a label from an external annotator for this input. Model uncertainty can be obtained from entropy or variation across stochastic runs.
2) Aleatoric (Model) Uncertainty: Due to the non-idealities of spintronic technologies, the on-chip model introduces additional in-field uncertainty. For example, dynamic thermal fluctuations cause noisy weighted sums in the crossbar array. The proposed SpinDrop BayBNN can be leveraged to handle model uncertainty and attain robustness to dynamic thermal fluctuations. In our analysis of thermal fluctuations, inference accuracy of the BNN reduces to 90.16%, 76.41%, and 69.89% respectively for noise strength of N (0, I ), I ∈ (3,4,5). However, the inference accuracy of the proposed SpinDrop BayBNN does not reduce when 50% or fewer MC-samples are noisy due to thermal fluctuations, as shown in Figure 13 (a).
The model uncertainty results for biomedical segmentation tasks are depicted qualitatively in Figure 10. The uncertainty masks show the pixel-wise uncertainty for each prediction. It can be observed that the proposed SpinDrop BayBNN method generates model uncertainty similar to the MC-Dropout method. Ideally, uncertainty is expected to be high around misclassified pixels and low around correctly classified pixels. Overall, it can be observed in Figure 7 that the uncertainty is high around the misclassified pixels, but the correctly classified pixels have low uncertainty. In general, MC-Dropout produces slightly stronger uncertainty masks as a higher Dropout probability (50%) is used.   [6] with ReLU activation is slightly better than the others. This analysis shows that binarizing BayNN can still achieve comparable predictive performance to full-precision BayNNs. Fig. 11. Capability of the proposed method in detecting out-of-distribution (OoD) datasets D 1 to D 6 . Evaluated on a model trained for MNIST dataset with a Dropout probability of 15%.  However, the hardware implementation for our solution may lead to a smaller area and better power-performance product thanks to simpler activation functions and binary weights.
Here, a smaller dropout probability of 20% is used in our analysis. Additionally, we have explored many different kinds of implementations with different locations for the Dropout layer. They are discussed further in Section IV-F3. Furthermore, our proposed BayBNN achieves comparable inference accuracy to the state-of-the-art (SOTA) point estimate BNN algorithm for VGG, and the ResNet-18 CNN topology with CIFAR-10. However, on CIFAR-100 datasets, our method outperforms the SOTA point estimate BNN algorithm by 3.6%, as summarized in Tables III and IV. In the worst-case scenario, performance is only 0.57% and 2.3% worse than SOTA IR-Net [46], and DIR-Net [48], respectively, on ResNet-18 topology. In general, we have found that taking T Monte-Carlo samples and averaging them increases the inference accuracy of the NNs model trained with and without Dropout. For the biomedical image classification task, our proposed method outperforms both full precision and binarized by up to 3.04%, as depicted in the Table VI. Our method is trained with a lower dropout probability (10%) compared to the MC-Dropout method. When a higher dropout probability is used, e.g., 20%, the performance of our method is reduced by 3%. Similar to other datasets, we have found that using Monte Carlo sampling for the Bayesian inference improves the predictive performance. We have used 20 (T = 20) samples for Bayesian inference.
Moreover, for biomedical image segmentation tasks, the proposed method performs similarly on all matrices to the 32-bit full-precision MC-Dropout method, with up to a 1.09% improvement in IoU score. In the worst case, our method achieves a 3.65% lower IoU score. The results are summarized in Table V. Similar to other datasets, we have taken 20 (T = 20) samples for Bayesian inference. We used higher-precision 4-bit activation for segmentation tasks, as they are much more difficult than classification tasks. We have used the algorithm proposed in [50] for activation quantization. Since weights are still kept at 1-bit, no changes to the crossbar structure are required. Only peripheral changes, such as a higher bit resolution ADC, are required.
Two examples of predictive performance for each dataset are shown in Figure 10. Also, the performance of Spin-Drop BayBNN is compared qualitatively to the full-precision MC-Dropout method. Our observations show that the proposed SpinDrop BayBNN performs similarly to MC-Dropout, i.e., predicts similar segmentation masks. In general, most of the miss-classified pixels are on the boundary of ground truth masks.
2) Performance of BayBNN With SpinDrop: Although the predictive performance of BayBNN (algorithmic, i.e., no variations in the SpinDrop module) is comparable to the full precision implementation, it should also be robust to the variations of the Spintronic-based implementation, SpinDrop. The evaluations show that the predictive performance of the proposed BayNN with SpinDrop considering variations is also comparable to algorithmic Dropout, as shown in Table VIII for MNIST and CIFAR-10 datasets on LeNet-5 and VGG topologies, respectively. We have performed two experiments to evaluate the robustness of the proposed approach to variations in the SpinDrop module. In one case, we trained a NN with normal Dropout but during Bayesian inference, we evaluated the model against our proposed SpinDrop Dropout with up to 3× the standard deviation σ of the manufacturing variations. In this case, the predictive performance of both MNIST and CIFAR-10 datasets improves slightly (+0.66%) compared to the original algorithmic Dropout BayBNN. Here, the σ of the SpinDrop module is as high as 3.3% (3×). That means the Dropout probability of each neuron can fluctuate by ±10% from the trained value, without any impact on the performance. In our experiments, SpinDrop with a dropout probability of 20% is used for all the layers and all crossbars.
In the other case, when the NN is trained with the SpinDrop techniques instead of the original Dropout, it would expect Dropout probabilityp = p+ϵ, ϵ ∼ N (µ, σ 2 ). In this case, the predictive performance also improvement is slightly better in comparison, e.g., the accuracy improvement of 1% achieved. Variation in the dropout probability leads to more sparsity during Bayesian inference, as a result, accuracy improves slightly.
3) Analysis of Dropout Rate and Location: We have used a smaller Dropout rate of 20% in our analysis for both MNIST and CIFAR-10 datasets. The results for other smaller dropout rates, e.g., from 10% to 20% are summarized in Table IX. Here, for each Dropout probability p, we have taken up to 50 MC samples for the Bayesian inference. Our results show that the predictive performance improved for all the Dropout rates by up to 1.7% compared to BNNs.
Usually, the default Dropout probability of 50% is used during training. Although this improves the predictive performance of full-precision NNs, it does not improve the performance of the binary NNs, as shown in Figure 14. Using 50% Dropout probability achieves lower validation accuracy compared to 20% Dropout probability and no Dropout model. We, therefore, used lower dropout rates in our evaluation for predictive performance. In all NN topologies and classes, the location of the Dropout layer is very important. In all cases in our analysis, The Dropout layer is applied after Batch Normalization and Sign(x) layers, otherwise, the effect of Dropout is canceled out. In MLP topology, Dropout was applied in all hidden layers. On the contrary, the location of Dropout in CNN depends on specific topology. In our experiment with ResNet-18 topology, Dropout is applied to only the last few layers with a large number of parameters. In ResNet-18, applying Dropout to all hidden layers is found to decrease performance. Similar to MLP, Dropout was applied to all hidden layers in VGG and LeNet-5 topologies.

V. CONCLUSION
In this paper, we propose for the first time the algorithmic groundwork for the Dropout-based Bayesian binary neural network and the corresponding CiM-based implementation in STT-MRAM. For this purpose, the stochastic and deterministic aspects of STT-MRAM have been combined in a crossbar array-based architecture. The stochastic behavior of the STT-MRAM is leveraged for the implementation of Dropout in hardware required for Bayesian NNs, while the deterministic behavior of STT-MRAM is exploited for the NN weight storage. The results show up to 100% detection capabilities for out-of-distribution data, and up to 15% improvement in accuracy for poisoned data. Furthermore, our results show the high resilience of the proposed concept to process and thermal variations. The combination of the algorithmic Bayesian approach with the cost-effective and energy-efficient implementation of CiM-based Binary NNs allows reliable and also more accurate architectures implementations for deep learning and their usage in critical applications.