A Codesigned Integrated Photonic Electronic Neuron

—In the modern era of artiﬁcial intelligence, increasingly sophisticated artiﬁcial neural networks (ANNs) are implemented, which pose challenges in terms of execution speed and power consumption. To tackle this problem, recent research on reduced-precision ANNs opened the possibility to exploit analog hardware for neuromorphic acceleration. In this scenario, photonic-electronic engines are emerging as a short-medium term solution to exploit the high speed and inherent parallelism of optics for linear computations needed in ANN, while resorting to electronic circuitry for signal conditioning and memory storage. In this paper we introduce a precision-scalable integrated photonic-electronic multiply-accumulate neuron, namely PEMAN. The proposed device relies on (i) an analog photonic engine to perform reduced-precision multiplications at high speed and low power, and (ii) an electronic front-end for accumulation and application of the nonlinear activation function by means of a nonlinear encoding in the analog-to-digital converter (ADC). The device, based on the iSiPP50G SOI process for the photonic engine and a commercial 28 nm CMOS process for the electronic front end, has been numerically validated through cosimulations to perform multiply-accumulate operations (MAC). PEMAN exhibits a multiplication accuracy of 6.1 ENOB up to 10 GMAC/s, while it can perform computations up to 56 GMAC/s with a reduced accuracy down to 2.1 ENOB. The device can trade off speed with resolution and power consumption, it outperforms its analog electronics counterparts both in terms of speed and power consumption, and brings substantial improvements also compared to a leading GPU.


I. INTRODUCTION
N OWADAYS machine learning technology is pervasively used for a wide range of applications including image classification, speech recognition and language translation, decision making, web searches, content filtering on social networks, recommendations on e-commerce websites [1]. Deep learning is one of the fastest-growing machine learning methods, exploiting multi-layered artificial neural networks (ANNs) implemented in digital electronics for processing large data sets, combining and analysing vast amounts of information quickly without the need of explicit instructions [2]. The spreading of artificial intelligence (AI)driven systems for an increasing number of applications is This  testified by the fact that the computing power required to train state-of-the-art AI doubled every 3.4 months since 2012 [3]. Recent deep learning milestones include ResNet winning the ImageNet challenge in 2015 by reaching a super-human level of accuracy in object recognition [4], and GPT-3, the largest AI model up to date, capable of producing high quality humanlike writings thanks to an over-100-billion parameter ANN trained over a large part of the internet [5].
These results have been achieved thanks to increasingly sophisticated ANNs and training algorithms, and leveraging a very large amount of computing power. Indeed, general purpose graphical processing units (GPGPUs) have been identified as particularly suitable for implementing the parallel computing tasks typical of ANNs, and contributed significantly to their current success in real application scenarios [6]. Recently, field-programmable gate arrays (FPGAs) and digital or mixed-signal application-specific integrated circuits (ASICs) [7]- [9] have been specifically designed to implement ANN computations, improving both speed and energy efficiency for learning tasks. To this aim, these novel electronic solutions focus on advanced numerical representations and memory architectures suitable for highspeed matrix multiplications, and on a very high bidirectional off-chip bandwidth (exceeding 1 Tb/s) to enable model and data parallelism. Compact and energy-efficient neuromorphic hardware is indeed of paramount importance due to the high power dissipation of large neural network models, reaching several MW during both training and inference [10]- [12].
Driven by the research on low-precision computing for ANNs, analog engines (e.g., based on memristors [13]) are promising as neuromorphic accelerators. The aim is to avoid the quadratic growth of the linear ANN computations (i.e., vector-matrix multiplications) as a function of the neural network layer size. Indeed, analog hardware, though more expensive than digital solutions, can be used to parallelize the linear computations.
In this scenario, photonic solutions show a great potential to realize analog low-power high-throughput accelerators for machine learning [14,15]. Photonic implementations typically exploit free-space optics, such as the diffractive architectures that make use of micromachined lenses [16], or integrated optics, e.g., the coherent solutions based on Mach-Zehnder interferometer meshes [17,18]. Despite many research efforts, all-optical approaches must still overcome several challenges before their practical exploitation. The issues concern the large-scale integration and control of many photonic devices (comprising light sources) and the lack of suitable photonic nonlinearities for the activation function. The latter appears to be the main limitation towards truly deep photonic neural networks [19]. While some interesting works on optical nonlinearities are emerging [20]- [22], photonics can be exploited in the short-medium term to implement just the linear computations required in ANNs, thus realizing photonicelectronic accelerators.
The DEAP (Digital Electronics and Analog Photonics), proposed in [23], is an example of such photonic-electronic neuromorphic cores derived from the broadcast-and-weight architecture [24]. It is a wavelength division multiplexing (WDM)-based optical network that relies on double bus ring resonators connected to a balanced photodetector to perform bipolar multiplications. Another example of these hybrid devices is represented by the photonic tensor core proposed in [25]. This architecture exploits a phase change material to implement photonic memory elements used to record multiplicands. In both solutions, the multiplication results are encoded in the amplitude of a photocurrent after photodetection. Even though these solutions provide a sound system-level photonic-electronic architecture, an indepth codesign of the photonic and electronic circuits towards the integration of both parts has still to be properly tackled.
Building upon the preliminary results reported in [26], in this paper we present the photonic-electronic multiplyaccumulate neuron (PEMAN). It is a reduced-precision integrated photonic-electronic device based on a multiplyaccumulate (MAC) processor with an ADC-embedded nonlinearity, suited to accelerate ANNs based on memoryless layers [27]. The PEMAN photonic engine exploits two Mach-Zehnder modulators and a balanced photodetector to perform high-speed bipolar multiplications. The electronic front-end comprises an accumulation capacitance and a loopunrolled successive approximation register (SAR) ADC. This last element applies the nonlinearity of interest within the analog to digital conversion. This architecture is able to trade off speed with multiplication accuracy.
The remainder of this paper is structured as follows: after a background on ANN, precision-scalable and analog computing reported in Sec. II, in Sec. III we present the integrated photonic-electronic neuron. Sec. IV analyzes the performance of the components and of the full photonic engine through circuit-based simulations, while Sec. V discusses speed, resolution, and energy consumption of the designed photonic-electronic device. Sec. VI concludes the paper.

II. BACKGROUND
After recalling the main operations involved in ANN computation, this section focuses on the rationale behind reduced-precision computing for neuromorphic applications, and on the problem of interfacing analog computing to digital memories, with an emphasis on the relevant metrics.
ANNs are a class of machine learning methods vaguely inspired by biological neurons. An ANN is a collection of elementary units, called neurons, arranged in layers. Neurons can be connected either to all or to only a part of the neurons in adjacent layers, thus forming either fully-connected or sparsely-connected layers, respectively. In ANN models, the stimulus of a neuron is computed by adding all the input values, each one multiplied by a proper weight, which corresponds to a MAC operation. Finally, a nonlinear function is applied to the accumulation result. This last step is of paramount importance to guarantee the network a proper expressiveness, i.e., the ability to generalize what the ANN has learned to unforeseen data. The computations involved in an ANN layer composed of M neurons fed by a previous layer with N neurons are formalized as follows: where O i , (i ∈ 1, ..., M ) is the output of the i-th neuron of the layer, x n , (n ∈ 1, ..., N ) is the output of the n-th neuron of the previous layer feeding the current layer, w i,n is the weight from the n-th neuron of the previous layer and the i-th neuron of the current layer, θ i is the i-th neuron bias term, and f (·) is the nonlinear function. The building blocks of an ANN are therefore three: 1) a linear part, performing the MAC operations; 2) a nonlinear part, which applies the nonlinear function to the result of MAC operations; 3) a memory element, storing the neuron output in order to be utilized in the successive layers. While weights w i,n are normally bipolar, positive-only inputs x n are widely used in neuromorphic applications since many nonlinear functions f (·) (i.e., the sigmoid, the softplus, and the rectified linear unit or ReLU) have positive-only outputs [4].

A. Reduced-Precision Computing in ANN
The most computationally intensive and time consuming workload in ANNs is constituted by the linear part, i.e., by MAC operations. This is because MAC operations in an ANN layer, described by Eq. 1, grow as O(M N ), while the computations in the nonlinear part grow only as O(M ) [27]. For this reason GPGPUs, particularly suited to perform vectormatrix multiplications, have enabled the effective use of deep ANNs with thousands or even millions of neurons per layer [6]. In recent years, many research activities have been performed to lower the burden of ANN computations, for instance to exploit hardware-constrained devices and/or to apply ANN in applications with low-latency requirements, such as safety-critical ones [28].
Several hardware and software solutions are emerging in order to meet these low memory and low computing capacity constraints. The main goal of software solutions is to develop ANNs that require less memory, relying on simpler arithmetics, while exhibiting negligible accuracy losses. For instance, ANNs have been pruned by removing less relevant connections, parameters have been normalized, and optimizations have been performed in dataflows to reduce data movement and storage [29]. Furthermore, works on reducedprecision computing have demonstrated the possibility to avoid the cumbersome floating point (FP) arithmetic by exploiting a small number of bits to represent ANN parameters with nearly negligible accuracy loss in several edge node applications [30]. These works report ANNs with parameters encoded with ≤ 8 bits [31] down to 1 bit in binary neural networks [32,33]. Based on the research activities in the field of reduced-precision ANNs, state-of-the-art GPGPUs implement dedicated hardware to perform integer operations (down to 1 bit) in order to reduce the power consumption and latency of ANNs [34,35]. Moreover, a new class of devices has recently emerged: precision-scalable MAC architectures [36]. These digital electronic architectures are designed to accelerate MAC operations in ANNs, making it possible to choose the number of bits used in computations, typically in three configurations: either 8, 4 or 2 bits of resolution. A lower precision results in higher speed and energy efficiency, making it possible to trade off speed and power efficiency with bit resolution.
In this scenario, analog hardware is gaining momentum to implement neuromorphic accelerators exploiting physical properties of circuits [37]. These analog engines aim to circumvent the quadratic growth in computational time associated with the number of neurons per layer, at the expense of more complex hardware [38]- [40]. Analog electronics mainly exploits fundamental circuit laws and device properties (e.g., current sum in a circuit node) to perform MAC operations [41]- [43]. A remarkable class of electronic neuromorphic devices are memristor crossbar arrays, also known as resistive RAMs (ReRAMs) [13]. However, ReRAMbased engines (being inherently resistive) suffer from high power dissipation issues, and lack reliable process standards and accurate models for simulation frameworks.
In the roadmap towards low power and high density MAC engines, neuromorphic photonics promise to bring sub-fJ per MAC power efficiency with high compactness, while relying on an inherently parallel hardware that reduces the complexity growth [44,45]. Nevertheless, several challenges must be tackled to enable effective all-optical approaches for neuromorphic hardware, including the efficient largescale integration of many active and passive devices, and the reduction of losses and impairments, which may cause a significant accuracy drop (up to 70% in Mach-Zehnder-based coherent approaches) [45]- [48]. While considerable effort is put to overcome these issues [17,49]- [52], photonic analog processors are also emerging within hybrid photonic-electronic accelerators, being particularly suited to perform high-speed MAC operations for reduced-precision ANNs [14,23,53].

B. Resolution in Analog Engines
Analog signals can be represented by a set of continuous values, while digital ones can be represented by a set of discrete values. However, analog computing cannot express continuously variable quantities, i.e., with arbitrarily high resolution, because of noise and distortions introduced by the analog hardware. This indeed limits the resolution of the analog system, i.e., the minimum distance between two distinguishable values. For any noise distribution the standard deviation σ provides an estimate of the noise interval, namely the spreading of the values around the expected value.
As currently there is no established analog memory, information needs to be digitized in order to be stored. For this reason, the use of the "number of bits" is an appropriate metric to define the resolution of a photonic or electronic analog system, as it provides the bits needed to manage and store the information. To this aim the effective number of bits (ENOB) can be estimated, taking into account both noise and distortions. The ENOB depends on the signal over noise and distortion ratio (SINAD), which can be in turn computed from the signal to noise ratio (SNR) and the total harmonic distortion (THD). Equations 2 and 3 state the relations between ENOB, SINAD, SNR and THD, all in dB [54].

C. Metrics for analog neuromorphic photonics
Digital hardware makes use of floating point operations per second (i.e., FLOPS) to evaluate the computational speed. Systems based on reduced precision, such as analog engines, cannot be directly compared to electronic engines based on a floating point arithmetic. Once a given arithmetic precision has been chosen, an appropriate metric for reduced-precision systems is the MAC/s, which quantifies the speed at which MAC operations are carried out. Another metric that cannot be used for analog computing is the bit error rate (BER), assessing the number of altered bits in digital communications; indeed, for analog systems ENOB (and SINAD) are relevant. Moreover, the energy efficiency of analog processors has to be properly normalized over the kind of operation performed, i.e., Joule per MAC (J/MAC).

III. THE INTEGRATED PHOTONIC ELECTRONIC NEURON
This section introduces the PEMAN, an integrated photonicelectronic precision-scalable MAC architecture with ADC embedded nonlinearity. The device has been codesigned in order to exploit the strengths of both photonic and electronic domains to perform the computations needed in neuromorphic applications. In particular, as depicted in Fig.1, the PEMAN leverages: (i) an analog photonic engine to carry out reducedprecision multiplications at high speed and low power, and (ii) an electronic front-end to accumulate the multiplication results and compute the nonlinear function. For the first time in an analog neuromorphic engine, the nonlinearity is computed within the ADC.

A. Working Principle
The photonic engine relies on two cascaded travelling wave (TW) Mach-Zehnder modulators (MZM) to act on the amplitude of an incoming lightwave and perform dot product multiplications. The first one is a 1 × 1 MZM able to impress input values in the range to 1 (all-pass state); the unity-limited range can be overcome by a simple scaling of inputs and output. The photocurrent generated by the balanced PD represents the multiplication result. The accumulation is then carried out after the opto-electronic conversion by charging (or discharging) a capacitor, thus implementing the MAC operation. The capacitor voltage is reset every N + 1 accumulations of the results of the N input-weight multiplications and of the bias term θ i , as shown in Eq. 1. The capacitor is connected to a differential amplifier, needed to properly drive the subsequent ADC. During the reset phase, the amplifier input is disconnected, the ADC samples the capactor voltage, and subsequently the capacitor is reset to zero. The ADC has been designed with a nonlinear coding that allows inherently applying the neuron nonlinearity within the sampling operation, as detailed in Sec. III-B.
Differently from a transimpedance amplifier (TIA) based photoreceiver, the integrating front-end accumulates in the analog domain the results of several operations before sampling, hence relaxing the ADC bandwidth specifications, In particular, sampling every N +1 operations allows the ADC rate to be N + 1 times lower than the MAC rate. This is a critical aspect to reduce the ADC power consumption and to increase the achievable ENOB (typically quite low for highspeed ADC, e.g., ∼ 2 for ADC operating at ≥ 5 GSa/s [55]).
The photonic engine has been emulated using IMEC iSiPP50G platform [56], while the electronic front-end has been designed using a commercial 28 nm CMOS process. The entire PEMAN system has been validated through cosimulations using Lumerical Interconnect and Cadence Spectre for the photonic and electrical domain, respectively.

B. Electronic Analog Front-End
The electronic analog front-end implements the second part of the MAC operations, i.e., the accumulation, followed by the analog-to-digital conversion embedding the activation function. A single frame of the MAC operations performed by the PEMAN is depicted in Fig. 2(a). The results of N consecutive multiplications (w i,n · x n ) plus the bias term θ i , each of them associated to a time slot of length T M AC , are accumulated on the Accumulation Capacitor C A shown in Figure 1. Index n represents the multiplication step during the accumulation, while index i represents the computed output, i.e., the i-th overall PEMAN operation. After the  (N + 1)-th T M AC , i.e., after T ACC , a transition of a digital signal commands the sampling of the amplifier output voltage V out by the ADC. To finalize these operations, a time T S is needed for the amplifier to reach a stable state after the last accumulation (M AC i,N +1 in Fig. 2). Finally, the ADC, within a conversion time T C , converts the result of the sampling operation into the digital code D out,i , then stored in a memory. During this conversion phase N LF i , the nonlinear function is also applied. The sum of the accumulation time, the sampling period and the conversion time determines the time needed for a whole PEMAN operation, T P EM AN . The proposed architecture offers the possibility to timeinterleave part of the operations of the analog front-end, thus allowing a lower T P EM AN and a higher computational speed, without penalizing the electronic performance. In particular, once the correct reset of the accumulation capacitor and the proper sampling of the amplifier output voltage V out,i are guaranteed, the following accumulation ACC i+1 can start, while the ADC is still converting the result of the previous sampling phase S i . With this approach, the conversion time of the ADC could be as long as the accumulation time T ACC without introducing penalties on the PEMAN speed. The overall T P EM AN period is then equal to the sum of the accumulation time T ACC = (N + 1)T M AC and the sampling period T S . It is worth noting that a larger number of accumulations relaxes the design of both the differential amplifier and the ADC. Compared to the minimum achievable T P EM AN ∼ T ACC , the only overhead introduced by the proposed solution is T S , which cannot be avoided in any electronic front-end to allow the correct settling before the analog-to-digital conversion. Nevertheless, the larger is N , the lower is the impact of the sampling period on the PEMAN speed. Similarly, a large number of accumulations allows an extended conversion time for the ADC.
The maximum number of accumulations allowed by the PEMAN is related to the maximum photodiode current, the accumulation capacitor C A , its maximum voltage swing V C , and the integration time. For the sake of simplicity, let us consider an integration time equal to T M AC and a photodiode current constant during the whole period T ACC . Considering the maximum photodiode current, as in the case of maximum input x n and maximum weight w i,n , we will obtain the worstcase estimate of the maximum number of accumulations. The minimum value of the accumulation capacitor C A is given by the parallel of the photodiodes' parasitic capacitances, which could be as low as few hundreds of femtofarads as in monolithically-integrated solutions [57], and the switch parasitic capacitances. Due to the non-linearity of these capacitances, a large V C swing may cause severe harmonic distortions, limiting the SINAD of the analog front-end. For this reason, additional linear capacitors (Metal-Insulator-Metal or Metal-Oxide-Metal capacitors, typically present in current commercial CMOS processes) with capacitances of the order of picofarads or tens of picofarads can be added in parallel, increasing the overall value of C A and the linearity of the analog front-end, with relatively small impact on the area occupation. Consequently, the voltage headroom of V C is mainly limited by the supply voltage of the front-end, which is close to 1 V in modern CMOS processes. Anticipating some values obtained from the numerical simulations described in Sec. IV-B, a maximum photodiode current of 1 mA and a T M AC of 100 ps are here used to estimate the maximum number of accumulations N + 1 ∼ 200. Considering the high speed of the photonic operations (tens of GHz) and a flexible range of N from few tens to hundreds, the sampling frequency of the ADC is on the order of 100 MS/s -1 GS/s. Given the constraints of speed and resolution, a feasible and energy efficient solution is represented by the Successive Approximation Register (SAR) converter, depicted in Fig. 3. In particular, a loop unrolled topology [58] has been chosen due to the improved feedback delay, which guarantees a higher sampling rate compared to the conventional SAR topology. The fully-differential architecture brings several advantages in terms of improved linearity, common-mode noise rejection, and SAR algorithm efficiency. For this reason, the ADC is preceded by the differential amplifier that converts the unipolar voltage V C into the differential voltage V out . The presence of N 0 different comparators, where N 0 is the nominal resolution, allows the intrinsic speed-up of the chosen topology by removing the digital delay to store each comparison result, as well as the comparator reset time. At the same time, it has some unavoidable drawbacks, namely an increase of area consumption, and the need of additional hardware overhead for the offset calibration.
The nonlinear function of the neuron is embedded inside the capacitive DAC of the SAR converter. Instead of employing the typical binary weighting, an ad-hoc weighting strategy has been developed, obtaining a sigmoid transfer function, as shown in Fig. 4. The system has been designed with a standard 28 nm bulk CMOS process, with a nominal resolution of 6 bit and a max sampling frequency of 1.4 GS/s, and simulated by means of Cadence Spectre. The non-linear encoding requires additional logical circuitry, which slightly increases the delay time of the critical path, and a larger number of capacitors. Nevertheless, it is possible to keep the total capacitance of the DAC (and consequently the area and the power consumption of the ADC) sufficiently low by exploiting parallel connections of several capacitors at the same depth level of the algorithm.

IV. THE PHOTONIC LINEAR ENGINE
The PEMAN optical engine has been simulated on a commercial silicon photonic platform, namely IMEC iSiPP50G [56]. This section details the photonic engine and reports the results of numerical simulations performed to validate its performance. The first part focuses on the characterization of the single MZM-based weighting element. The second part presents the validation of the whole photonic engine.
Each MZM has been simulated in Lumerical Interconnect exploiting TW elements with the following characteristics: 2.5 mm-long electro-optic phase shifters, V π = 3.6 V, and a free spectral range of 14.5 nm. The phase shifters are driven in a push-pull configuration within the range [0, V π ]. The MZMs are unbalanced in order to match the driving voltage ends with the representation ends: the minimum (maximum) value is encoded by applying a null (V π ) voltage to the upper arm and a V π (null) voltage in the lower arm of the MZM. Regarding the 1 × 2 MZM, to obtain 0 both arms are driven with V π /2, thus resulting in a theoretically null current at the balanced PD.

A. MZM-based Bipolar Multiplication
This section reports the characterization of the bipolar MZM element, i.e., the one used to encode the weight w within the interval [−1, 1]. As the model library provides just the 1×1 TW MZM, the 1×2 version has been implemented by adding a 1×2 multi-mode interferometer (MMI) to split the input into two 1×1 MZMs.
As discussed in Sec. II-B, the characterization of an analog processing element is done taking into account both noise (SNR) and distorsions (THD), which determine the SINAD, ultimately dictating the element resolution through the ENOB. To this aim, the elements must be driven to provide a sinusoidal signal with full scale i.e., with the largest peakto-peak voltage (V P P ). Given the inherent MZM sinusoidal characteristic, to obtain a sine wave it must be driven with triangular waves in push-pull configuration. The following noise sources have been taken into account in the simulations: −150 dB/Hz laser relative intensity noise (RIN), 1 MHz laser linewidth, thermal noise (Temperature 300 K), dark current and shot noise in the PD. TW phase shifters allow taking into account the related delays and distortions.
Simulations have been performed at first as a function of the triangular wave frequency (from 1 GHz to 56 GHz) to evaluate the impact of the MZM finite bandwidth on the ENOB. These simulations have been carried out at a fixed laser power of 10 dBm. Subsequently, simulations aimed to relate the influence of the laser power to the ENOB are reported, varying the input power from 0.05 mW (-13 dBm) to 10 mW (10 dBm) at a constant frequency of 10 GHz. 128 wave periods have been simulated with 512 points per period. The The results as a function of the frequency are summarized in Fig. 5, reporting the SNR, the opposite of THD, and the ENOB. The main limiting factor at low frequencies is the THD (mainly caused by the MZM nonlinear behaviour), that varies in a non-monotone fashion with a peak near 5 GHz, where the distortion is minimum and the obtained ENOB is the largest, about 7.5. Conversely, the SNR exhibits a monotonously decreasing behaviour, becoming the main limiting factor at higher frequencies. Indeed the ENOB follows the THD trend at lower frequencies, and then it is SNR-limited at higher frequencies, being ≤ 5.5 above 50 GHz. Fig. 6 shows the ENOB as a function of the input laser power. SNR and ENOB grow for increasing power, while THD does not vary significantly. The results show the relevance of the distortion in the MZM-based weighting element: an input power increase of 23 dB causes an ENOB increment of 2, lower than the maximum value of 3.8 obtainable through the SNR increase, according to Eq. 3. This is a common pattern in analog systems, where an increased power increases proportionally both the signal and distortions, thus keeping the THD value constant. This behaviour hinders the effect of the SNR improvement on the overall SINAD and ENOB.

B. PEMAN Validation with Random-valued Multiplications
In this section we discuss the simulations done to evaluate the whole photonic engine composed by the CW laser, a 1x1 MZM, a 1x2 MZM and the balanced PD. In this case the method used for the single MZM cannot be exploited, since SNR and THD of multiple cascaded devices cannot be directly derived. Instead, an equivalent noise interval has been derived by means of random-valued multiplications, aimed to assess the ENOB of the full circuit. Sec. II-B discussed the derivation of ENOB from the noise standard deviation. Using the same rationale the multiplication error standard deviation has been used to compute an equivalent noise interval (equal to 6σ) taking into account distortions and bandwidth limitations, thus evaluating the system resolution.
The PEMAN has been tested for same frequencies and input powers used for the MZM-based weighting element in Sec. IV-A, performing dot product multiplications on a dataset of random-valued input-weight pairs. The dataset has been produced using the Python library NumPy, thus generating values in the range [0, 1] for x and values in the range [−1, 1] for w, both rounded at the third decimal. The obtained values have been translated into the corresponding MZM voltage values by means of a nonlinear coding based on the static characteristic. The simulation output returns the multiplication results as time-dependent photocurrent waveforms. The waveforms have been analyzed to extract the standard deviation σ relative to the multiplication error. These simulations have been performed with the same settings used for the weighting MZM characterisation and using 256 points per period. Fig. 7 reports the ENOB as a function of the MAC rate with a dataset of 1024 multiplications at a constant input laser power of 10 dBm. It shows a constant ENOB of 6.1 up to 10 GHz, while it decreases down to 2.1 at 56 GHz. The lowfrequency plateau and the subsequent decay in the ENOB reflect the fact that the system resolution is noise-limited up to 10 GHz, while at higher frequencies it is bandwidthlimited. The plateau, not present in the case of the single MZM element, is due to the nonlinear coding used to drive the MZMs, which mitigates the effects of distortions. Fig. 8 shows the ENOB as a function of the input laser power with a dataset of 256 multiplications, a fixed MAC rate of 10 GHz, and all other parameters unchanged. The ENOB grows from 4.3 to 6.1 for increasing input laser power. Similarly to the case of the single weighting element, an input power increase of 23 dB causes an ENOB increment of 1.8, lower than the maximum value of 3.8 achievable through the SNR increase, according to Eq. 3. This is due to the fact that an increased power causes higher distortions.
The results obtained on the overall PEMAN architecture are consistent with the performance of its basic MZM elements, developed for digital communications. In particular, they show that the PEMAN can trade off not only speed with resolution, but also power consumption with resolution. Moreover, the found ENOB are in line with the performance of similar devices found in the recent literature [59,60].
Concerning the maximum number of MAC operations that can be accumulated, this number can be derived from the maximum photocurrent produced by the device, found to be 1.1 mA. This value is achieved when both input x n and weight w i,n are equal to 1, the MAC rate is 10 GMAC/s, and the laser power is set at 10 dBm. In order to have a voltage variation ≤ 0.1 V on the accumulation capacitor V C , so that the bias point of the photodiodes is not significantly altered, the PEMAN can accumulate ∼ 200 multiplications.

V. DISCUSSION
In this section we discuss the obtained results, focusing on the PEMAN performance and physical implementation. We aim to position the proposed photonic-electronic neuron among the current solutions based on digital and analog electronics.
Tab. I reports a comparison in terms of speed, resolution, and power efficiency concerning PEMAN (operating at different MAC rates), four analog electronic solutions (HICANN [61], NeuroGrid [62], SpiNNaker [63], and TrueNorth [64]), and a leading GPGPU (NVIDIA Tesla v100 [65]). To derive the power budget, the PEMAN is considered with an input laser power of 10 dBm and the electronic ADC working at its maximum speed of 1.4 GS/s. In these conditions, the power consumption of every element is as follows: 81 mW for the laser source [44], < 1 mW for the balanced PD, and 13 mW for the electronics (amplifier and ADC). Concerning MZMs, besides a static power consumption < 1 mW, the phase shifter energy consumption per MAC is fixed at 6.9 pJ/MAC, according to the its equivalent capacity and neglecting the driving circuitry. The power consumption for the analog electronics has been evaluated by dividing the dissipated power by the number of basic elements (i.e., neurons which corresponds to the number of MAC operations) and by the MAC speed per processing core.
The codesigned photonic-electronic neuromorphic device has the potential to outperform its analog electronics counterparts by several orders of magnitude in terms of speed per core. Moreover, it outperforms all the electronic engines apart from the TrueNorth also in terms of power consumption. The adopted opto-electronic approach overcomes the main electronic bottleneck caused by the dynamic power exponentially growing with clock speed [66], trading off speed and resolution. It can reach MAC rates exceeding 50 GMAC/s while reducing the energy per MAC operation, as the static power scales down with speed. The drawback for higher MAC rates is a reduced resolution in terms of ENOB due to the finite bandwidth of the MZM elements. As reported in Sec. IV, rates below 10 GMAC/s are not convenient, as there is no resolution improvement, while the energy per MAC increases due to the static power consumption.
The computed power consumption accounts also for the nonlinear function computation. By applying the nonlinearity while sampling, the system avoids: (i) an additional DRAM read/write, (ii) the nonlinearity computation (typically 10 arithmetic operations), saving 76 pJ every time the ADC samples, according to the energy cost of DRAM read/write 5 pJ/bit and floating point operation 0.1 pJ/bit [67]. 2.5×10 −6 5 0.27 Tesla v100 [65] 2.93 32 20 VI. CONCLUSION With this paper, we propose, reporting also a first performance evaluation through numerical simulations, a precision-scalable integrated photonic-electronic multiplyaccumulate neuron (PEMAN) intended for neuromorphic acceleration at low power and with the ability to trade off speed and accuracy. The hybrid device implements the high speed multiplication stage in the optical domain, while embedding the nonlinear activation function in the analogto-digital conversion process in the electronic domain. The numerical simulations have been performed considering IMEC iSiPP50G silicon photonic platform for the implementation of the optical part, and a standard 28 nm CMOS process for the electronic front-end. Photonic-electronic co-simulations show that the PEMAN has the potential to largely outperform analog and digital electronic equivalent solutions, in particular in terms of power consumption at large operating frequencies in excess of 10 GMAC/s and up to 56 GMAC/s, where an extremely low power consumption of 8.6 pJ/MAC is achieved. In addition, the PEMAN can flexibly adapt its operation balancing/trading speed and accuracy needs.
The choice of a silicon photonics platform for implementing the integrated optical engine has been made envisioning an allsilicon implementation of the PEMAN, ideally, also using a common platform for the photonic and electronic parts. The technological platforms used for the implementation clearly have deep and complex techno-economic implications in terms of form factors and CAPEX/OPEX of the ANN. Limiting the considerations to the ANN power efficiency, which is the key aspect discussed in the paper, future work will tackle the design also of the driving circuitry and consider alternative platforms for implementing the photonic and the electronic parts. For example, the InP monolithic integration platform can be a promising candidate, providing all the required photonic building blocks including the laser source. Also, alternative electronic technologies like the finFET platforms, can be investigated, having the potential to improve speed and power efficiency of the PEMAN.