Spin Wave Based Approximate 4:2 Compressor

In this paper, we propose an energy efficient SW based approximate 4:2 compressor comprising a 3-input and a 5-input Majority gate. We validate our proposal by means of micromagnetic simulations, and assess and compare its performance with one of the state-of-the-art SW, 45nm CMOS, and Spin-CMOS counterparts. The evaluation results indicate that the proposed compressor consumes 31.5\% less energy in comparison with its accurate SW design version. Furthermore, it has the same energy consumption and error rate as the approximate compressor with Directional Coupler (DC), but it exhibits 3x lower delay. In addition, it consumes 14% less energy, while having 17% lower average error rate than the approximate 45nm CMOS counterpart. When compared with the other emerging technologies, the proposed compressor outperforms approximate Spin-CMOS based compressor by 3 orders of magnitude in term of energy consumption while providing the same error rate. Finally, the proposed compressor requires the smallest chip real-estate measured in terms of devices.


I. INTRODUCTION
The information technology revolution has led to a rapid raw data rapid increase, which processing calls for high performance computing platforms 1 . Up to date, downscaling Complementary Metal Oxide Semiconductor (CMOS) has been effective to satisfy these requirements, however, Moore's law has reached its near economical end as CMOS feature size reduction is becoming increasingly difficult due to leakage, reliability, and cost walls 2 . As a result, different technologies have been investigated to replace CMOS such as graphene devices 3 , memristor 4 , and spintronics 5 . In this paper, we chose to study one type of spintronics technology, the Spin Wave (SW) technology, which appears to open the way towards the most energy efficient digital computing paradigm [6][7][8][9] . SW based computing is promising for three main reasons 6-9 : 1) it has ultra-low energy consumption potential because it does not rely on electrons movements but just on their spinning around the magnetic field orientation [6][7][8][9] , 2) it is highly scalable because SW's wavelength (which is the distance between two electrons that exhibit the same behavior) can reach the nanometer scale [6][7][8][9] , and 3) it has an acceptable delay [6][7][8][9] . As a consequence of these promising features, different researcher groups have made use of SW interaction to build logic gates and circuits.
The first experimental SW logic gate is an inverter, designed by utilizing a Mach-Zehnder interferometer 10 . Moreover, the Mach-Zehnder interferometer has been used to build a single output Majority, (N)AND, (N)OR, and X(N)OR gates 10 , while multi-output SW logic gates have been introduced in 9,11,12 . Furthermore, multi-frequency logic gates that enhance SW computing and storage capabilities have been proposed in 8,13 , and wavepipelining has been achieved with pulse mode operation in the SW domain by utilizing four cascaded Majority gates 14 . In addition, different SW circuits have been also demonstrated at conceptual level 15 , simulation level 7,16 , and practical millimeter scale prototypes 17 . All the aforementioned logic gates and circuits were designed to provide accurate results; however, many applications such as multimedia processing and social media are error-tolerant, and within certain error limits, they still function correctly 18 . Hence, those applications can benefit from approximate computing circuits, which save significant energy, delay, and area.
Based on the previous discussion on the SW technology potential and the approximate computing benefits one can conclude that SW approximate circuits are of great interest. In view of this observation, and given that multiplication is heavily utilized in error tolerant applications, and fast state-of-the-art multipliers are build with 4:2 compressors 19 we introduce in this paper a novel approximate SW 4:2 compressor. The paper main contributions can be summarized as follows: • Developing and designing an approximate SW 4:2 compressor: We propose an approximate 4:2 compressor consisting of two Majority gates that provides an average error rate of 31%.
• Enabling directional couplers free approximate circuit design: We demonstrate that Majority gates can be directly cascaded, i.e., without amplitude normalization of domain conversion, to form a 4:2 compressor with no additional average error rate penalty.
• Validating the proposed 4:2 Compressor: We demonstrate by means of MuMax3 micromagnetics simulations the correct functionality of the proposed approximate 4:2 compressor.
• Demonstrating the superiority: The proposed approximate SW 4:2 Compressor performance is assessed and compared with state-of-the-art SW, 45 nm CMOS, and Spin-CMOS counterparts. The evaluation results indicate that the proposed compressor saves 31.5% energy in comparison with the accurate SW design, whereas it has the same energy consumption and error rate as the approximate compressor with Directional Coupler (DC), but while being 3x faster. In addition, the proposed compressor consumes 14% less energy while providing 17% less error rate when compared with the approximate 45 nm CMOS counterpart. Moreover, the proposed compressor outperforms approximate Spin-CMOS equivalent design by 3 orders of magnitude in terms of energy while having the same error rate. Finally, the proposed compressor requires the smallest chip real-estate.
The rest of the paper is organized as follows. Section II explains SW computing background. Section III introduces the proposed approximate 4:2 compressor and Section IV provides inside on the simulation setup and results. Section V reports performance evaluation and comparison with state-of-the-art data. Section VI concludes the paper.

COMPUTING PARADIGM
The magnetization dynamics caused by the magnetic torque when the magnetic material magnetization is out of equilibrium is captured by the Landau-Lifshitz-Gilbert (LLG) where γ is the gyromagnetic ratio, µ 0 the vacuum permeability, M the magnetization, M s the saturation magnetization, α the damping factor, and H ef f the effective field, which consists of the external field, the exchange field, the demagnetizing field, and the magneto-crystalline field.
Equation (1) has wave-like solutions under small magnetic disturbances, which are called Spin Waves (SWs) and are the collective excitations of the magnetization within the magnetic material 6 . A SW, as any other wave, is described by its amplitude A, phase φ, wavelength λ, wavenumber k = 2π λ , and frequency f as graphically presented in Figure 1a) 6 . SW frequency and wavenumber are linked by the so called dispersion relation, which plays a fundamental role during the SW circuit design process 6 .
Generally speaking, information can be encoded in SW amplitude and phase at different frequencies 6,8 , while the interaction between SWs coexisting in the same waveguide is governed by the interference principle. Figure 1b) presents two SWs interaction situations: if they have the same phase, i.e., ∆φ = 0, they interfere constructively resulting in a larger amplitude SW, whereas if they have different phases, i.e., ∆φ = π, they interfere destructively resulting in a diminished amplitude SW. Due to their very nature, SWs provide natural support for Majority function evaluation as the interference of an odd number of SWs emulates an Majority decision. For instance, if 3 same amplitude, frequency, and wavelength SWs interfere, the result is a 0 phase SW (logic 0) if no more than one of them has a π phase, and in a π phase SW (logic 1) otherwise, which is equivalent with the behavior of a 3-input Majority gate. Note that a CMOS 3-input majority gate implementation requires 18 transistors, while in SW technology it only requires one waveguide. We note that if the SWs have different A, λ, and f , their interaction results in more sophisticated interferences, which might open different SW based computation paradigms. However, in this paper, we  Threshold detection relies on the comparison of the output SW amplitude with a predefined threshold value T , i.e., if the SW amplitude is larger than T , the output is logic 1, and 0, otherwise 6,20 .

Full Adder
Full Adder S(i-1)

III. SW APPROXIMATE 4:2 COMPRESSOR
For many state-of-the-art applications, e.g., artificial neural network, machine vision, detecting events such as visual surveillance and people counting, which heavily rely on multiplications the availability of fast multipliers is essential. Wallace or Dadda tree multipliers are the fastest and can perform a multiplication within 2 clock cycles. They embed 3 stages, i.e., partial product generation, reduction tree, and carry propagation adder. In an n-bit multiplier the first stage requires n 2 gates to produce the partial products matrix, the second stage provides a logarithmic depth reduction of n n-bit numbers to two numbers without carry propagation, and the final stage is a carry propagate adder that sums-up the reduction tree outputs 21 . The n to 2 reduction has been traditionally done by means of Full and Half adders but n:2 compressors based reduction trees can be shallower and have a more regular layout 21 . Thus, most of the state-of-the-art CMOS implementations make use of 4:2 compressors for which faster than 2 cascaded FA implementations exists 19,22,23 . Essentially speaking, a 4:2 compressor processes 4 dots in the same column and generate one dot in the current column and a carry to the next column. To properly preserve the value carried by the inputs, after a FA delay, the 4:2 compressor generates a transport to the next column and receives a transport from the previous position, which it further process to generate the sum and a carry for the next column. Thus, the compressor has 5 inputs (one of them coming from the previous column) and 2 real outputs and one intermediate transport to the next column. Given that multiplication dominated error tolerant applications exist, e.g., multimedia processing and social media 18 , approximate CMOS 4:2 compressors have been proposed 19 , which enable significant energy consumptions and area saving.  Figure 3 presents the approximate compressor obtained by cascading two approximate FAs by means of a normalizer (directional coupler). However, the directional coupler induces substantial delay and area overheads, which makes working without it desirable. Therefore, we propose the novel directional coupler free approximate compressor depicted in Figure 4. The behaviour of the 2 directly cascaded FAs is now obtained with a 3-input Majority gate and a 5-input Majority gate computing C o1 = M AJ(X, Y, C i ), and S = C o2 = M AJ(I 1 , I 2 , I 3 , I 4 , C in ), respectively. The proposed 4:2 approximate compressor generates C o1 without any error, and S and C o2 with an average error rate of 31.25%, and 18.75%, respectively. Table I presents the truth table   of the accurate 4:2 compressor C o1 , S ac , and C o2ac , the approximate 4:2 compressor without directional coupler C o1 , C o2ap1 , and S ap1 , and the approximate 4:2 compressor with directional coupler C o1 , C o2ap2 , and S ap2 . As it can be observed from the To achieve proper functionality for the structure in Figure 4, the waveguide width must be excited at the same amplitude, wavelength, and frequency, and the waveguide lengths must be accurately computed as they determine the SWs interaction modes. For example, if SW constructive (destructive) interference is envisaged for in phase (out of phase) SWs, the distances must be equal with n × λ, where n = 0, 1, 2, . . .; this is the case for d 1 , d 3 , d 4 , and d 6 in Figure 4. In contrast, if SW constructive (destructive) interference is envisaged for out of phase (in phase) SWs, the distances must be equal with (n + 1/2) × λ; this is the case for d 2 and d 5 in Figure 4. On the output side, it is important to detect the output at specific position, i.e., if the desired output is the output itself, which is the case for C o1 in Figure 4, d 7 must be equal with n × λ, whereas if the inverted output is desired, the distance must be equal with (n + 1)/2 × λ. Moreover, the outputs must be detected as near as possible from the last interference point to capture large SW amplitude.
The proposed SW 4:2 compressor operation principle is as follows: • C o1 : SWs are excited at I 1 , I 2 , and I 3 with the same amplitude, wavelength, and frequency at the same time moment. The I 2 SW interfere constructively or destructively with I 3 SW depending on their phase difference, the resulted SW propagates through the waveguide, and subsequently interferes with the I 1 SW. The resulted SW is captured at the output C o1 based on phase detection.
• S and C o2 : I 2 SW interferes constructively or destructively with I 3 SW depending on their phase difference, and the resulted SW propagates through the waveguide to interfere with the SWs excited at I 4 and C in . The resulted SW propagates, and subsequently interferes with the I 1 SW. Finally, the resulted SW is captured at the outputs S and C o2 based on the threshold detection.

IV. SIMULATION SETUP AND RESULTS
In order to validate the proposed structure by MuMax3 25 , we made use of the parameters specified in Table III 26 . In addition, we assumed waveguide thickness and width of 1 nm and     0.4 ns reading window starting 1.80 ns after the input application. Table III presents  respectively. We note that in order to perform amplitude normalization the DC has to be rather long 7 , which results in a large delay overhead.    given that the approximate 4:2 compressor in 28 has the same average error rate as the one we propose, we can infer that replacing their compressor with ours does not change the image quality while resulting with 3 orders of magnitude less energy consumption.
We note that the main goal of this paper is to propose and validate a SW 4:2 approximate compressor and as such we do not take into consideration thermal and variability effects.
However, in 31 , it was suggested that thermal noise, edge roughness, and waveguide trapezoidal cross section do not have noticeable impact on gate functionality. Thus, we expect that the 4:2 approximate compressor functions correctly under their presence. However, further investigation of such phenomena is of great interest but cannot be performed before technology data and suitable simulation tools become available.

VI. CONCLUSIONS
This paper proposed a Spin Wave (SW) based 4:2 approximate compressor, which consists of 3-input and 5-input Majority gates. We reported the design of approximate circuits without directional couplers, which are essential to normalize gate output(s) when cascading them in accurate circuit designs. We validated the proposed compressor by means of micromagnetic simulations, and compared it with the state-of-the-art SW, 22 nm CMOS, 45 nm CMOS, and Spin-CMOS counterparts.
The evaluation results indicated that the proposed 4:2 compressor saves 31.5% energy in comparison with the accurate SW compressor, has the same energy consumption, and error rate as the approximate compressor with DC, but it required 3x less delay. Moreover, it consumes 14% less energy, while having 17% lower error rate when compared with the approximate 45 nm CMOS counterpart. Furthermore, it outperformes the approximate Spin-CMOS based compressor by 3 orders of magnitude in term of energy consumption while providing the same error rate. Last but not least, the proposed compressor requires the smallest number of devices, thus it potentially requires the lowest chip real-estate.