A Non-Volatile All-Spin Analog Matrix Multiplier: An Efficient Hardware Accelerator for Machine Learning

We propose and analyze a compact and non-volatile
nanomagnetic (all-spin) analog matrix multiplier performing the multiply-and-accumulate (MAC) operation using two magnetic tunnel junctions –
one activated by strain to act as the multiplier, and the other activated by
spin-orbit torque pulses to act as a domain wall synapse that performs the
operation of the accumulator. Each MAC operation can be performed in ~1 ns and
the maximum energy dissipated per operation is ~100 aJ. This provides a very
useful hardware accelerator for machine learning (e.g. training of deep neural
networks), solving combinatorial optimization problems with Ising type
machines, and other artificial intelligence tasks which often involve the
multiplication of large matrices. The non-volatility allows the matrix
multiplier to be embedded in powerful non-von-Neumann architectures.


I. INTRODUCTION
A RTIFICAL intelligence (AI) is pervasive and ubiquitous in modern life (smart cities, smart appliances, autonomous self-driving vehicles, information processing, speech recognition, patient monitoring, etc.). Estimates by OpenAi predict an explosive growth of computational requirements in AI by a factor of 100  every two years, which is a 50  faster rate than Moore's law governing the evolution of the chip industry [1]. Most AI applications leverage machine learning (or deep learning based on neural networks) to perform two primary functions -training and inference. Algorithms for these tasks require multiplication of large matrices, such as in updating the synaptic weight matrices in deep learning networks, which is an essential feature of training a neuronal circuit, solving combinatorial optimization problems ______________________________________ Hardware accelerators that can perform matrix multiplications rapidly and efficiently are therefore very attractive since they can speed up AI tasks immensely. They are particularly useful in computer vision [2], image and other classification tasks [3], approximate computing [4], speech recognition [5], patient monitoring [6] and biomedicine [7]. The earliest ideas for devising hardware-based matrix multipliers date back to 1909. Percy Ludgate conceived of a machine made of mechanical parts that was understandably unwieldy, slow and unreliable [8]. Modern matrix multipliers employ electronic charge-based circuitry that are fast, convenient and reliable [9], but also energy-hungry and volatile, i.e. they lose all information once powered off. Recently, matrix multipliers have been implemented with optical networks [10,11], which can be extremely energyefficient and fast, but their drawback is the large footprint. They too are usually volatile since they use capacitors. In this paper, we present an all-magnetic (all-spin) implementation of a matrix multiplier, which is energy efficient, fast and has a much smaller footprint than its optical counterparts. Its most important advantage is that it is non-volatile and hence the matrix products can be stored indefinitely in the device after powering off.
Consider the matrix multiplication operation . This operation consists of multiplying pairs of numbers (one member of the pair picked from a row of one matrix and the other from a column of the other matrix) and then adding up the products of the pairs to produce an element of the product matrix. Thus, one would need: (1) a "multiplier" to multiply pairs of numbers, and (2) an "accumulator" (which accumulates the individual products and adds them up). These are the two ingredients of a hardware accelerator for matrix multiplication. In this work, we implement the multiplier with a single straintronic magnetic tunnel junction (MTJ) and the accumulator with another magnetic tunnel junction (driven by spin-orbit torque) acting as a domain wall synapse [12]. Each MTJ can have a footprint of ~(100 nm) 2 , and with all the peripherals, the footprint of the entire device can be < 1 m 2 . The matrix multiplier can operate at clock rates of ~GHz and dissipate ~100 aJ of energy per multiply-and-accumulate (MAC) operation. In the next two sections, we describe the multiplier and the accumulator.

II. MULTIPLIER
A schematic of the proposed multiplier is shown in Fig. 1. It consists of an elliptical MTJ that has a (magnetically) "hard" layer and a "soft" layer, separated by an intervening insulating spacer layer. Any residual dipole interaction between the hard and the soft layer creates an effective magnetic field Hd in the soft layer that is directed along the latter's major axis (easy axis) in a direction opposite to the magnetization of the hard layer. The soft layer is magnetostrictive and placed in elastic contact with an underlying poled piezoelectric thin film deposited on a conducting substrate (this construct constitutes a 2-phase multiferroic). Two electrically shorted electrodes, delineated on the piezoelectric film, flank the MTJ, while the back of the substrate is connected to ground. When a (gate) voltage V G is applied to the shorted electrode pair, it generates biaxial strain in the piezoelectric film pinched between the two electrodes, which is transferred to the elliptical soft layer. The strain is either compressive along the major axis and tensile along the minor axis of the soft layer, or vice versa, depending on the voltage polarity [13]. With the right voltage polarity, these strains rotate the soft layer's magnetization away from the major axis of the ellipse (the easy axis) towards the minor axis (hard axis) because of the Villari effect. The rotation is arrested midway by the magnetic field Hd and hence the magnetization ultimately settles into a steadystate orientation that subtends some angle  ss with the major axis (or the magnetization of the hard layer). The value of  ss depends on the applied strain and H d . The hard layer's magnetization remains unaffected. Since the resistance of the s-MTJ depends on the angle  ss between the magnetizations of the hard and the soft layers, the strain changes the MTJ resistance since the value of H d is fixed. This is the operational basis of a "straintronic" MTJ (s-MTJ), whose basic function was demonstrated in [14].
To implement the multiplier, a constant current source I bias is connected between the hard and soft layers of the s-MTJ (terminals '1' and '2'), as shown in Fig. 1(a). This drives a current through the s-MTJ. The gate voltage V G is applied at terminal '3' to generate the strain in the soft layer, and a fourth terminal is connected to the hard layer (common with terminal '1'), which outputs a voltage V 0 . Terminal 2, connected to the soft layer, is grounded and hence 0 is the resistance of the s-MTJ that can be altered by the gate voltage V G generating strain, as explained before.

A. Rotation of the soft layer's magnetization due to the gate voltage
We have modeled the rotation of the soft layer's magnetization as a function of the gate voltage V G in the presence of H d and thermal noise using stochastic Landau-Lifshitz-Gilbert simulations [15]. This allows us to find the  ss versus V G relation. The s-MTJ resistance is given by    3 3 characteristic, which we show qualitatively in Fig. 1(b). With proper choice of the s-MTJ parameters, we can produce a linear region in the G s-MTJ vs. V G characteristic where We show this analytically in the Appendix. In Fig. 2, we plot the  ss versus V G characteristics obtained from the stochastic Landau Lifshitz Gilbert simulation and the resulting G s-MTJ versus V G plot. The simulation procedure is described in ref. [15] and the Appendix. The parameters for the elliptical soft layer of the s-MTJ used in the simulation are given in Table I. The soft layer is assumed to be made of Terfenol-D, which has large magnetostriction. The piezoelectric film is assumed to be (001) PMN-PT which has a large piezoelectric coefficient. The plot in Fig. 2 . When the gate voltage V G is chosen to be in that region, one can perform an analog multiplication of two input voltages V in1 and V in2 encoding the two matrix elements that are to be multiplied. We elucidate this in the next subsection.  Table I.

B. Operation of the multiplier
To understand how the multiplier works, refer to Fig. 1(c) and note that 1 That implements a "multiplier" since the current I out flowing through the s-MTJ (which is also the current through the series resistor R) is proportional to the product of the two input voltages V in1 and V in2 . The voltage out V is proportional to this current and hence it too is proportional to the product . Similar ideas were used to design probability composer circuits for Bayesian inference engines in the past [16]. In our case, V in1 and V in2 are voltage "pulses" of fixed width and varying amplitude. Their amplitudes are proportional to the two matrix elements to be multiplied. Note from Fig. 2(b) that the linear region in the plot extends over a voltage range of ~100 mV. Therefore, for this choice of parameters, the amplitude of the V in1 pulse should be no more than ~50 mV. Since we would like the two voltage pulses V in1 and V in2 to have similar limits on the amplitude, both should have an amplitude no more than 50 mV. We can, of course, increase the voltage range by redesigning with different parameters, but that will increase the energy dissipation per MAC operation.

III. ACCUMULATOR
Next, imagine that the resistor R of Fig. 1(c) is a heavy metal (HM) strip, on top of which we place a p-MTJ (which is an MTJ whose ferromagnetic layers have perpendicular magnetic anisotropy) with the soft layer in contact with the HM strip. We can insert a thin insulating layer and a thin metallic layer between the soft layer and the heavy metal, which will not 4 4 impede the operation of the accumulator. This configuration is shown in Fig. 3(a). The current pulses I out pass through the heavy metal strip (which is the resistor R) and because of spinorbit interaction in that strip, they inject spins into the soft layer of the p-MTJ (through the thin insulating and metallic layers). That causes domain wall motion in the latter during each pulse because of spin orbit torque due to the spin Hall effect [17][18][19]. The distance a domain wall moves over the duration of a pulse is approximately proportional to the amplitude of the pulse since the domain wall velocity is proportional to the current density. The arrangement is shown in Fig. 3(b). After any number of pulses, a fraction of the soft layer will have its magnetization parallel to that of the hard layer, a small fraction will be un-magnetized and will be the "domain wall" separating two domains, and the remainder of the soft layer will have its magnetization antiparallel to that of the hard layer. The fractions with parallel and anti-parallel magnetizations change with successive current pulses flowing through the heavy metal.  Fig. 1. (c) The conductance of the p-MTJ is the conductance of the parallel combination of three conductors associated with the anti-parallel configuration, domain wall interface, and parallel configuration. This is the well-known basis of a domain wall synapse [12]. Here, we have used a p-MTJ in the spirit of ref. [12], but there is no reason why an MTJ with in-plane magnetic anisotropy cannot be used instead.
The conductance of the p-MTJ (measured between its hard and soft layers) is the conductance of the parallel combination of three conductors corresponding to the parallel configuration of the p-MTJ, the domain wall (DW) interface and the antiparallel configuration [12], as shown in Fig. 3(c). If the domain wall in the soft layer of the p-MTJ is located at a distance x from one edge and L is the length of the soft layer (excluding the domain wall width, which is w), then [12]  where G P is the p-MTJ conductance in the parallel state, G AP is the conductance in the antiparallel state and G DW is the conductance associated with the domain wall in the soft layer.

A. Operation of the accumulator
To understand how the accumulator works, consider the fact that the amplitudes of the voltage pulses V in1 and V in2 are proportional to the two matrix elements a and b that are to be multiplied. The pulses all have a fixed width of t. The current The i-th current pulse will move the domain wall by an amount and v i is the domain wall velocity imparted by the i-th current pulse. The domain wall velocity is proportional to current density for low densities [18] and is hence proportional to the amplitude of the current pulse. Therefore, from Equation (4), we get The last equation is an important result showing that the amount by which the domain wall moves after each pulse is proportional to the product of the two numbers a and b. Since  constants. Finally, from Equation (6), we obtain Fig. 4 shows the composite system that constitutes the allspin matrix multiplier. In addition to the multiplier shown in Fig. 1(c) and the accumulator shown in Fig. 3(a), we use a voltage source V s proportional to 1/B, a conductor whose conductance is equal to A, and another conductor whose which is proportional to the (i, j)-th element of the product matrix. The voltage dropped over the last conductor is proportional to this current and hence proportional to the (i, j)th element of the product matrix c ij . We just have to measure this voltage after the pulse sequence has ended (i.e. one row has been multiplied with one column) to obtain a voltage proportional to c ij , which is the result of multiplying the i-th row of the first matrix with the j-th column of the second. After obtaining c ij , the domain wall synapse is reset with a magnetic field or a reverse current pulse to make x = 0, and then the process is repeated to obtain the product of multiplying another row of the first matrix with another column of the second (which would be the next element of the product matrix).

B. Energy dissipation
The energy dissipation incurred during the rotation of a nanomagnet's magnetization due to strain is very smalltheoretically around 1 aJ at room temperature [15], while the energy dissipation associated with domain wall motion will be on the order of 2 where I is the current inducing the domain wall motion, R is the resistance of the heavy metal strip and t is the pulse width. There is some additional dissipation in the passive resistors, but they can be made arbitrarily small by choosing the bias voltages to be small. We will neglect any other dissipation due to domain wall viscosity, which would be comparatively smaller. Therefore, the energy dissipated during each MAC operation is ~2 I R t  . We will assume that the HM strip has a width of 50 nm and thickness 5 nm (cross-sectional area = 250 nm 2 ) and length 100 nm. Hence its resistance is R = 40 ohms, if it is made of Pt whose resistivity is 10 -7 ohm-m. From Fig. 1(c) we see that the current through the heavy metal strip will have a maximum value of will have a maximum value of ~ 50 A since V in2 (max) ~ 50 mV and R P = 1 kW. Assuming a pulse width t = 1 ns, the maximum energy dissipation per multiply-and-accumulate (MAC) operation is This is a small energy price to pay for the small footprint and the non-volatility of this device.

IV. CONCLUSION
We have shown how to implement a matrix multiplier with two MTJs, passive resistors and some bias sources. The energy dissipation per multiply and accumulate (MAC) operation is much smaller than what would be encountered in traditional electronic implementations, although not as small as in optical implementations [10]. Our matrix multiplier is also not as fast as optical implementations, or even electronic implementations, but it is non-volatile and will retain the result of the operation (i.e. the matrix element cij) indefinitely after powering off. The non-volatility is a major advantage since it will allow most or all computing to be performed at the edge without the need to access the cloud. This reduces the likelihood of hacking, data loss, intrusion and eavesdropping. Cybersecurity is critical for artificial intelligence and the ability to perform all or most computing at the edge offers increased protection against cyber threats.
The extremely low energy dissipation also offers protection against hardware Trojans, which are disastrous for AI and are very hard to detect. Trojans, however, have an effect on the power consumption and therefore can be detected with a technique called side channel analysis [20], which searches for anomalies in power consumption. A low power matrix multiplier, which consumes very little power itself, will exacerbate power anomalies and facilitate Trojan detection.

A.1:
We consider the elliptical soft layer of a straintronic MTJ as shown in Fig. 5. This figure shows the axis designation with the z-axis along the major (easy) axis of the soft layer and y-axis along the minor (hard) axis. We will assume that the hard layer's magnetization is along its own easy axis and is pointing along the +z-direction. In that case, the polar angle  shown in Fig. 5 is the angle between the magnetizations of the hard and soft layers of the s-MTJ. Ref. [15] showed that the stochastic Landau-Lifshitz-Gilbert equation yields the following equations to describe the temporal evolution of the polar and azimuthal angles of the magnetization vector () in the soft layer in the presence of thermal noise: Here M s is the saturation magnetization of the soft layer,  is the uniaxial stress generated in the soft layer along the major axis by the applied gate voltage V G (we neglect the strain generated along the minor axis since it is much smaller),  s is the saturation magnetization of the soft layer's material,  is the soft layer's volume, L maj is the length of the major axis, L min is the length of the minor axis and d is the thickness of the soft layer. The quantity  0 is the permeability of free space and  B is the Bohr magneton. The field h i (t) [i = x, y, z] is the random magnetic field due to thermal noise and      Table I to find the steady state value of  (i.e.  ss ) as a function of V G . This is shown in Fig. 2(a). Thermal noise introduces some randomness in the magnetization trajectory, and hence we find  ss as a function of V G by averaging over 100 trajectories. This yields the plot in Fig. 2(a) and ultimately the plot in Fig.  2 where  ss is the steadystate angle between the magnetizations of the hard and the soft layer at any given stress (or, equivalently, any given V G ). From ref. [15], we obtain that the magneto-static energy in the plane of the nanomagnet (i. e. when  = 90 0 ) for any magnetization orientation and at any given stress is   where H d is the effective magnetic field in the soft layer due to any residual dipole coupling with the hard layer. As mentioned earlier, this field is antiparallel to the magnetization of the hard layer. The strength of this field can be tailored by engineering the material composition of the hard layer, which is usually made of a synthetic antiferromagnet. It can also be adjusted with an external in-plane magnetic field, if needed. The steady state value of the angle  is that where the magneto-static energy is minimized. Taking the derivative of Equation (A3) with respect to  and setting it equal to zero, we find the angle where the energy is minimum, and it corresponds to the steady state value ss. We get   Using the values in Table I,

G V  
and that is what we observe in Fig. 2 and hence 1 1  . Thus, we have derived the existence of the linear region in the G s-MTJ vs. V G characteristic analytically and found that it exists when G V   is close to  .
Since  = 0.26 V and  = -0.001 V, while R AP = 2 k, we find that  = -0.96 (k-V) -1 and  = -0.261 V. This value of  shows excellent agreement with what we obtained in Fig. 2(b), 8 8 but  is larger in magnitude by more than a factor of 2, which is still acceptable within the limits of the approximations used to derive this analytical result.

A. Maximum current pulse amplitude
The maximum current that flows through the heavy metal strip was calculated as 50 A. The corresponding current density through a 250 nm 2 cross-section is 2  10 11 A/m 2 . If v d is the domain wall velocity at that current density (which is material dependent), then the maximum domain wall displacement caused by the maximum current pulse of duration 1 ns is v d t with t = 1 ns. Since the soft layer of the p-MTJ is 100 nm long, it can sustain N = 100 nm/v d t current pulses before the domain wall moves completely through it. Hence the largest matrix size that can be handled is N N  .

B. Digital (non-binary) multiplier
If we wish to use this device as a digital, but not just binary, multiplier (meaning its elements can have integral values that are not just 0 and 1), then we need to know what is the largest digit we can have as a matrix element. That depends on how small we can make the quantization step size when we digitize. The minimum step size is, say, twice the thermal noise voltage appearing at any input terminal and that is 2 where C in is the input terminal capacitance [21]. We can reasonably assume that C in ~ 1fF when we factor in line capacitances. This makes the minimum step size ~4 mV at room temperature. Hence the largest digit that we can encode is 50 mV/4 mV = 12. We can, of course, increase this number by using optimized design where the amplitude of the voltage pulses can exceed 50 mV. This would require decreasing .
Here, however, we were interested in demonstrating just the basic principle and hence have not focused on design optimization.
We can also calculate the current density through the HM strip at the minimum step size of 4 mV. This is ~ 4 A. The corresponding current density is 4 A/250 nm 2 = 1.6 10 10 A/m 2 , which is more than enough to induce domain wall motion in many materials [22]. Hence, the smallest digit is 1 since the current pulse corresponding to this digit can also induce domain wall motion.