Analysis of ENF Signal Extraction From Videos Acquired by Rolling Shutters

Electric network frequency (ENF) analysis is a promising forensic technique for authenticating multimedia recordings and detecting tampering. The validity of the ENF analysis heavily relies on the capability of extracting high-quality ENF signals from multimedia recordings. This paper analyzes and compares two representative methods for extracting ENF signals from visual signals acquired by cameras using the rolling-shutter mechanism. The first method proposed in prior work, direct concatenation, ignores the idle period of each frame. The second method proposed in this paper, periodic zeroing-out, inserts zeros to missing sample points instead of ignoring the idle period. Our theoretical analyses of using multirate signal processing reveal and experiments confirm that while the first method can extract ENF signals without knowing the exact value of camera read-out time, there exists some mild distortion to extracted ENF signals. In contrast, the second method taking the read-out time as the additional input is capable of extracting distortion-free ENF signals, and its frequency component of the highest strength is always located at the nominal frequency. Additionally, we examine aliased DC and negative ENF components caused by the two methods and show that their impact on the accuracy of frequency estimation is minimum. This paper facilitates the fundamental understanding of extracting ENF signals from videos. The research findings imply that the periodic zeroing-out method offers more accurate frequency estimates, but the performance improvement is moderate.

, [5] and detecting tampering within the recordings [6], [7]. The ENF is the supply frequency of an electric power grid, and it fluctuates temporally around a nominal value of 60 Hz in North America or 50 Hz in most other regions of the world due to the mismatch between the demand and the supply within the power network [3], [8], [9]. Since different nodes within the grid are interconnected, the fluctuations in frequency at different locations of the same grid share similar patterns [3], [10]. The instantaneous values of ENF over time are regarded as the ENF signal, which can serve as a natural timestamp for the authentication of multimedia recordings.
The ENF signal can be embedded into audio recordings via sensing acoustic vibrations or via interfering electromagnetically in sensing circuits [3], [11]. Several studies show that ENF signals extracted from audio recordings can be used to assess the authenticity of the recordings [3], [7], [12], [13], [14] and that ENF extracted from the audio track of videos can be used to identify the location of recording [15], [16]. Furthermore, recent research has discovered that the ENF analysis could be extended beyond the realm of forensic science to multimedia signal processing, e.g., the synchronization of audio and video recordings [17], the synchronization of videos without overlapped scenes [9], and the alignment of historical audio recordings [17].
More recent studies tackle a more challenging issue: extracting ENF signals from the visual track of multimedia recordings [1], [4], [8], [18], [19], [20]. Indoor lightings, such as fluorescent lights and incandescent bulbs, vary in light intensity at twice the supply voltage frequency, leading to near-invisible flickering in the illuminated environment. Consequently, cameras under the indoor illumination environment may capture videos that contain ENF signals. One of the major concerns encountered in extracting the embedded ENF signal from visual tracks is the issue of aliasing to DC because of insufficient sampling rate [8]. For example, when the nominal value of the ENF in the visual form is 120 Hz, and the frame rate is 30 frames per second (fps), both the ENF traces centered at 120 Hz and −120 Hz will alias to be centered at 0 Hz, mixing up two mirrored traces that cannot be separated. Even worse, the ENF traces centered at 0 Hz are also corrupted by the native DC content [8]. Fig. 1 illustrates the aforementioned challenge.
The rolling shutter mechanism is traditionally considered detrimental to the image and video analysis. However, the authors in [8] demonstrate that the rolling shutter mechanism used in complementary metal-oxide-semiconductor (CMOS) Fig. 1. An issue of mixed frequency components caused by aliasing when global-shutter CCD cameras are used. (Left) Time-frequency representation of the original signal containing substantial signal contents around ±120 Hz and DC (0 Hz); (Right) Mixed/aliased frequency components around DC after being sampled at 30 fps. Two mirrored components, in general, cannot be separated once mixed. Their overlap with the DC component further hinders the estimation of the desired frequency component [8].
cameras could be exploited to increase the effective sampling rate. The rolling shutter acquires videos by sequentially reading and storing the pixel values of horizontal or vertical lines of each landscape frame. Since successive lines of a frame are acquired at different time instants, the rolling shutter can foster the ENF signal extraction from the visual track by increasing the effective sampling rate by a factor of the number of lines. In [8], the authors proposed an extraction method that we refer to as the direct concatenation method in this paper. It only retains and concatenates the signals available at the beginning of each frame period and ignores those signals during the idle time toward the end of each frame period. Although the method works practically, the authors made no attempt to systematically investigate the direct concatenation method and offer the pros and cons of such extraction procedures. This paper proposes another intuitively designed ENF signal extraction method, analyzes it along with the direct concatenation method, and compares them. As an existing method, the direct concatenation method treats each row of a video frame during the read-out time as a single sample point and concatenates them into a one-dimensional (1-d) signal. We show that the extracted ENF signal by the direct concatenation exhibits mild distortion. We also examine our proposed extraction method, referred to as the periodic zeroing-out method. In contrast to the direct concatenation method that pauses sampling during the idle time, the periodic zeroing-out method conducts an equivalently uniform sampling temporally by returning zeros during the idle period toward the end of each frame period. It achieves a higher signal-to-noise ratio (SNR) at the cost of the need to know the read-out time parameter that is camera-model dependent.
In addition to the multirate analyses for the direct concatenation and periodic zeroing-out methods that were presented in our preliminary work [1], [2], we highlight the unique contributions of this journal version as follows: • We mathematically prove the closed forms for the aliased component index and the frequency value that achieve the highest SNR for both extraction methods.
• We investigate the distortion due to the direct concatenation method and conclude that the distortion does not significantly affect the quality of the extracted ENF signal.
• We examine the typical scenario of corruption effects caused by the aliased components and reveal that the effects are minimal because the magnitude of the aliased DC and negative ENF components is usually much smaller than that of the ENF component of our interest. The rest of this paper is structured as follows. Section II reviews the rolling shutter mechanism and the preprocessing steps for frequency estimation. Section III presents the theoretical analysis of the two ENF extraction methods for rolling-shutter videos and compares them. Section IV confirms the theoretical results with experiments. Section VI provides a discussion and Section VII concludes the paper.

II. BACKGROUND AND PRELIMINARIES
In this section, we review relevant literature and establish definitions to facilitate the understanding of the processes of extracting ENF signals from videos acquired by rolling shutters.

A. ENF Signal Extraction
Several classical frequency estimation methods have been used to extract the ENF signal from media recordings, including approaches based on the short-time Fourier transform (STFT) [21] and subspace analysis [22], [23]. STFT is commonly used for analyzing time-varying signals such as the ENF signal. To extract the dominating instantaneous frequencies from the STFT-based periodogram, one can find the peak location and improve its precision with quadratic interpolation [24] or using the weighted energy approach [8]. On the other hand, subspace-based methods such as the multiple signal classification (MUSIC) [22] and the estimation of signal parameters via rotational invariance techniques (ESPRIT) [23] exploit the orthogonality between the signal subspace and noise subspace. ESPRIT is more advantageous over MUSIC because of lower computational and storage costs [23].
Dedicated methods based on frequency demodulation or frequency tracking have been developed to precisely extract the ENF signal. Studies in [25] and [26] consider the ENF trace as a frequency-modulated signal. In [25], frequency demodulation is applied to estimate the ENF. In [26], the authors propose a robust filtering algorithm (RFA) that effectively suppresses the additive noise and improves the SNR condition before ENF estimation. Combined with frequency estimation methods such as the STFT and ESPRIT, the use of the RFA leads to an improvement in ENF extraction accuracy. Additionally, some frequency tracking methods that exploit temporal smoothness have also been proposed for precise ENF estimation. In [27], the authors use dynamic programming to search for a minimum cost path, which prevents abrupt frequency jumps from one time frame to another. Similarly, the authors in [28] propose offline and online multiple-frequency-trace tracking methods based on iterative dynamic programming and trace compensation, which can provide robust performance under low SNR conditions. Certain preprocessing is required to extract ENF signals from multimedia recordings before the aforementioned frequency estimation or tracking algorithms are applied, as illustrated in Fig. 2. For visual recordings, the temporal variation of light intensity embedded in the video frames can Fig. 2. A block diagram for estimating ENF signal from video frames. DCC and PZO above stand for the direct concatenation (DCC) method and periodic zeroing-out (PZO) method, respectively. be exploited to extract the ENF signal. The study in [8] takes the average of the pixel values in each row, produces a 1-d time series signal, and uses it for frequency estimation. This process of making time series signals will be reviewed in detail in Section II-C. Audio recordings, which are 1-d signals, are passed through bandpass filters with a narrow passband centered at the frequency of interest, i.e., 50/60 Hz. They can then be readily used as input for frequency estimation. Bandpass filtering as a preprocessing step is necessary for the subspace-based approaches, but it is not necessarily required for the STFT-based methods.
More recent studies focus on designing dedicated ENF extraction algorithms for videos and images. In [18], the authors improve the quality of frequency estimation by reducing the discontinuities in an intermediate luminance signal. Specifically, pixels that are less impacted by motion are selected to contribute to the calculation of the luminance signal. In [20] and [29], the authors improve the ENF estimation by compensating for the temporal magnitude variation of the intermediate luminance signal due to the relative geometry among the light, scene, and camera. These preprocessing ideas are compatible with and are orthogonal to the methods proposed in this paper. In [19], the authors propose a phasebased ENF extraction algorithm that works for short videos and without the need to know the read-out time. The proposed method assumes an ideal sinusoidal model and exploits phase differences of row signals from consecutive frames to estimate the ENF signal. Our comparison with [19] in Section V reveals that it underperforms our proposed method when embedded ENF signals have low SNR, especially when the magnitude of ENF varies across the row index.

B. Video Frame Acquisition by Rolling Shutter
With a rolling shutter, each frame is recorded by scanning across a rectangular CMOS sensor line by line instead of capturing the whole frame at a single time instance, as in the case of a global shutter. Fig. 3 illustrates the video acquisition process of a camera using the rolling shutter mechanism. In each frame period T c = 1/ f c , where f c is the frame rate, each row of a frame is sequentially exposed to light and acquired for the first T ro seconds followed by an idle period of T idle seconds during which no action is taken before proceeding to the acquisition of the same row of the next frame [9]. T ro is called the read-out time [30] defined as the amount of time required for the rows of a frame to be acquired and is a camera-dependent parameter [30], [31]. The study in [31] also shows that within a camera device, T ro could vary with video resolution or video frame rate.
Since pixels in different rows are exposed at different time instances but displayed simultaneously during playback, the Fig. 3. An illustration of the video acquisition process of a camera using the rolling shutter mechanism: Rows of pixels of each frame from Row 1 to L are sequentially sampled during the read-out time T ro and no action is taken during the idle period T idle . For ENF extraction, each row of a video frame is treated as a single sample point. The direct concatenation method concatenates all available sample points into a 1-d signal for frequency estimation.
rolling shutter may cause such distortions as skew, smear, and other image artifacts, especially with fast-moving objects and rapid flashes of light [32]. The sequential read-out mechanism of the rolling shutter has been traditionally considered detrimental to image/video quality due to its accompanying artifacts. However, recent work has shown that the rolling shutter can be exploited with computer vision and computational photography techniques [33], [34]. In this work, we exploit the rolling shutter mechanism to significantly increase the sampling rate, thereby avoiding the undesirable effects of aliasing caused by the low sampling rate of video cameras.

C. Visual Content Removal
The ENF signal is embedded in video recordings on top of visual content in the form of light interference. To separate the ENF signal from video signals, we will first focus on removing visual content corresponding to the first block of Fig. 2, and estimate the time-domain ENF signal. However, it is important to note that our ultimate goal is to estimate ENF, which will be left for Section III.
In an indoor lighting environment, the light intensity of fluorescent and incandescent lights typically fluctuates around the doubled frequency of the nominal ENF due to a power law, which we define as f e . Although the ENF is time-varying, we assume it is constant in our derivation due to its extremely small variation range. As defined in [30], the ENF signal can be formulated as a zero-mean random field sinusoid of random initial phase φ with magnitude A as follows: where 1 ≤ i ≤ N and 1 ≤ r ≤ M denote the frame index and the row index, respectively. Angular frequencies a (in radians/frame) and e = 2π f e / f s (in radians/row) correspond to frequencies for frame and row, respectively, where f s (in Hz) is the rolling shutter's row-wise sampling rate, and more details can be found in [30]. Note that E 0 (r, i) contains the ENF frequency component f e (in Hz) to be estimated. When ENF is embedded into videos, pixel locations with different visual contents may have slightly different magnitude responses. We, therefore, model such an effect as an additive noise term e(r, c, i) and express the resulting noisy ENF signal that is considered to be the extra signal in addition to the visual content as where 1 ≤ c ≤ C denotes the column index and e(r, c, i) is assumed to be a zero-mean additive white noise, such that any pair of E(r, c, i) values is independent and identically distributed (iid). We now denote an ENF-containing video acquired by rolling shutter as I (r, c, i): where V (r, c, i) is the visual content and E(r, c, i) is an additive term to the video signal. To estimate f e buried in E(r, c, i), we first estimate the visual content V (r, c, i). Second, we subtract it from the video signal I (r, c, i) to obtain a residual videoÊ(r, c, i). Third, we averageÊ(r, c, i) across the columns, which finally leads to estimating E 0 (r, i). To this end, we can use motion compensation [9] and obtain a residual video:Ê Note that the pixels of the same row inÊ(r, c, i) are exposed to incoming light simultaneously and are iid. Thus, averaginĝ E(r, c, i) across the columns [9], we obtain: Due to the law of large numbers, the third term is close to 0. The second term is close to zero whenV (r, c, i) is precise. Hence, we are able to extract E 0 (r, i) by simply averaginĝ E(r, c, i) with respect to the column. This also brings about another positive result that the pixel values of each row across the column can lower the variance or increase the SNR by a factor of C.
In the rest of this paper, we consider a special case of a static scene, i.e., V (r, c, i) ≃ V (r, c) for all frame i's and proceed with the corresponding derivation. As this paper focuses on estimating the ENF and not estimating V (r, c, i), we use static scene videos, which in turn helps obtain higher SNR of the signal of interest and a simpler derivation process. Using the fact that the visual content of every frame for static scene videos is identical, we estimate the static visual scene from the average of a random subset of frames as follows: where I is a random subset of the set, {1, 2, . . . , N }, and | · | is the cardinality operator that returns the number of elements of a set. Since E 0 (r, i) and e(r, c, i) are zero-mean signals, (6c) shows thatV (r, c) is an unbiased estimator of V (r, c) with a small variance due to the aggregation of many frames, and the second term of (5b) has an even smaller variance (by a factor of 1/C) due to averaging across columns.

III. ANALYSES OF TWO ENF EXTRACTION METHODS
This section investigates two complementary ENF extraction approaches that are the best options for correctly assembling the signal. Both methods assemble the signal in the time domain and should be applied before the frequency estimation/tracking algorithms corresponding to the linearization step in the second block of Fig. 2.

A. Direct Concatenation Method
In [8], the rolling shutter mechanism was exploited for the first time to resolve the insufficient sampling rate for ENF extraction from the visual track. In this subsection, we provide a theoretical analysis of its pros and cons and refer to this method as the direct concatenation method. Fig. 4(a) shows how the row signal, y(n), is formed by the extraction method using an available subset of all sample points, x(n), captured by the rolling shutter mechanism. During the frame period T c , a total of M sample points may be captured at the speed of the rolling shutter, among which only L sample points during the first T ro seconds will be retained as the rows of each frame. The remaining M − L available sample points will be discarded, where L ≤ M and T c = T ro + T idle . Segments of sample points of length L are concatenated into a single 1-d row signal y(n). We can thus define the 1-d row signals x(n) and y(n) as concatenated time-series sample points of E 0 (r, i) in (5a), where n is denoted by (i − 1)M + r for x(n) and (i − 1)L + r for y(n). Note that x(n) is close to a singletone time-domain signal.
The rolling shutter's row-wise sampling rate f s can be defined in the pair of (M, T c ) or (L , T ro ) [30] as where T s is the amount of time between the start of sampling one row and the start of sampling the subsequent row. We adopt an L-branch filter bank model where each branch has an M-fold downsampler and L-fold upsampler with advance/delay operators at the beginning/end of each branch, as shown in Fig. 4(b). The lth branch of Fig. 4(b) is separately illustrated in Fig. 5 with the sampling rates reported for each intermediate stage. Using the theory of multirate signal processing [35], one can obtain the DTFT of the output signal at the lth branch shown in Fig. 5 as follows, and the detailed derivations can be found in Appendix A of the supplementary materials: Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.  The DTFT of the directly concatenated signal y(n) is where and the aliased sinc function asinc L (x) def = sin(L x/2) L sin(x/2) is a periodic function that has zero crossings at integer multiples of 2π L and achieves maxima at integer multiples of 2π for odd L and 4π for even L. Some illustrative examples for the aliased sinc function in (10) are given in Fig. 13 of the supplementary materials.
The direct concatenation method results in a change of sampling frequency from f s for the complete row signal x(n) to (L/M) f s for the directly concatenated signal y(n). As shown in Fig. 5, the complete row signal x(n) and its advanced version x(n + l) are sampled at the rolling shutter's effective sampling rate f s . It is then downsampled by M and upsampled by L such that the branch output signal y l (n) is sampled at f ps = L M f s = L f c which we define as the perceptual sampling frequency. We can relate the row signal in the analog frequency domain to the available row signal by "sampling" Y ( ) of (9) using f ps . That is, we substitute by 2π f / f ps and obtain: where and the frame rate f c = 1/T c . Details for obtaining (11)- (12) are given in Appendix B of the supplementary materials.
Eq. (11) shows that Y ( f ; f ps ) is the weighted sum of distorted X ( f ; f s )'s that are shifted by integer multiples of the frame rate f c . A closer inspection reveals that the 0th summation term is the ideal input source signal X ( f ; f s ) distorted by the scaling function A 0 ( f ; f ps ). Observe that the shift step is f c = f s /M, which means that the original and shifted copies X ( f − m f c ; f s ) for m = 0, . . . , M − 1 will evenly occupy the full sampling range from [0, f s ) Hz. The frequency components that were aliased under the lower sampling rate f c are now well within the range under the rolling shutter's effective sampling frequency f s .
To extract high-quality ENF signals, one simple but effective strategy is to pick the frequency component of the highest The set of elements m * that maximizes the magnitude of (12), namely, the scaled asinc term, can be obtained by solving the following optimization problem: with a technical side condition that the solution to (13) should be obtained from the main lobe of the aliased sinc function of (12). This condition is guaranteed by Lemma 3 of Appendix C of the supplementary materials. The ENF signal is embedded in camera-captured video recordings through the light intensity changes that fluctuate at twice the frequency of the nominal ENF, f e . The instantaneous value of the ENF of power grids varies around its nominal value, but the variation range is typically very narrow (e.g., 49.90-50.10 Hz for Europe [3] and 59.90-60.10 Hz for United States [10]). Hence, we assume that the frequency domain representation of a clean/pure ENF trace is nonzero only at ± f e with periods f ps and f s . We need only to solve the optimization problem by focusing on the frequency of taking the following values.
Substituting the positive half of (14), namely, into the optimization problem (13), we may rewrite: We claim that m * shown as follows is a solution to (13), which is proven in Appendix C of the supplementary materials: where round(x) returns the nearest integer of x.
To guide the practical ENF extraction process, it is convenient to know the exact frequency location/range from which the strongest signal can be extracted. To this end, we plug in m * into (15) and obtain: which reveals that the direct concatenation method needs both T c and T ro to determine the frequency location of the strongest ENF signal. T ro is camera model dependent, but for most cameras, the values are not publicly available and nontrivial to obtain. Instead, we can use visual cues by choosing the relatively strong straps from the spectrogram, which does not require the estimation of T ro . Substituting the subset of frequencies resulting from the negative component of (14) into the optimization problem (13) gives us Results (17)- (19) are proven in Appendix C of the supplementary materials.

B. Periodic Zeroing-Out Method
In this paper, we propose a periodic zeroing-out method for extracting ENF signals from the visual track. Fig. 6 illustrates the timing schedule for acquiring the nonzero sample points from row 1 to row L and zero-valued, imaginary sample points from row L + 1 to row M. Unlike the direct concatenation method, the estimation of T ro is essential for using this method. In return, the advantage is that the frequency component with the maximum SNR can always be extracted at twice the frequency of the nominal ENF, ± f e . We denote the time-domain input and output signals by x(n) and w(n), respectively, as shown in Fig. 4(c). Out of all M points of the available signal within one sampling period, we zero out the last M − L sample points that correspond to the idle period and denote the resulting output signal as w(n). In this scenario, the effective sampling rate is the same for both x(n) and w(n).
Such input-output relationship may be modeled by a L-branch filter bank with M-factor downsamplers/upsamplers and advance/delay operators as shown in Fig. 4(d). The DTFT Fig. 6. An illustration of the sample acquisition process of the periodic zero-padding method: Rows 1 to L of each frame are sequentially recorded during the read-out time T ro and the remaining "imaginary" rows' outputs are zeros during the idle period T idle . of the output signal w l (n) at the lth branch signal can be expressed as: Hence, the DTFT of the periodic zeroed-out signal w l (n) is where The detailed derivations for (20)- (22) can be found in Appendix D of the supplementary materials. We substitute by 2π f / f s and obtain: Note that A ′ m is not a function of f , whereas A m ( f ; f ps ) of (11) in the direct concatenation method multiplicatively distorts X ( f − m f c ; f s ). We observe that the periodically zeroed-out output W ( f ; f s ) is a weighted sum of non-distorted available input X ( f ; f s )'s shifted by multiples of the frame rate f c . It has frequency components at Similar to the direct concatenation method in Section III-A, we want to find the frequency components achieving the maximum signal strength out of the M copies A ′ m X ( f − m f c ; f s ) for m = 0, . . . , M −1. The set of elements m * that maximizes the magnitude of (22), namely, the asinc term, can be found by the following optimization problem: Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. The solution m * leading to a zero cost to (25) is To find the frequency components of the strongest ENF signals, we plug in m * into (24) and obtain: which means that we can extract the frequency components of the strongest ENF signals at ± f e . Equivalently, the frequency location of the strongest ENF signal is not a function of T ro , but T ro has been used at the beginning to calculate the number of zeros to be inserted. Another directly observable advantage of the periodic zeroing-out method over the direct concatenation method is that the ENF signal is not distorted.

C. Comparisons of Two Extraction Methods Using Derived Models
TABLE I compares the two extraction methods. First, the scaling function A ′ m of the periodic zeroing-out method uniformly scales the magnitude of the ENF signal along the frequency axis, whereas the scaling function A m ( f ; f ps ) of the direct concatenation method distorts the spectrum along with the frequency. The periodic zeroing-out method ensures that the ENF trace with the highest scaling magnitude can always be extracted at the doubled frequency of the nominal ENF per the result from (26) and (27), e.g., f e = 120 Hz in the United States. In comparison, the result for the direct concatenation method in (17) and (18) reveals that the position of the highest scaling magnitude depends on both the frame period T c and read-out time T ro . TABLE II shows the calculated frequencies that achieve the highest scaling magnitude for the two methods for iPhone 6s operating under various frame rates in an environment of ENF f e = 120 Hz. The iPhone 6s camera has an estimated read-out time T ro = 19.8 ms; the estimation process can be found in Section IV-B.
The read-out time T ro is needed for the periodic zeroing-out method to generate the row signal w(n). T ro is also needed for the direct concatenation method to find the exact frequency location/range from which the strongest signal should be extracted. Note that T ro is not needed for the direct concatenation method if the spectrogram is clean and the visually strongest strap is selected.
To compare the resulting signal-to-noise ratio (SNR) due to two extraction methods, we assume that the available row  II   ALIASED FREQUENCIES WITH HIGHEST MAGNITUDE PER THEORETICAL  RESULTS OF SECTIONS III-A AND III-B signal x(n) is corrupted by the additive white Gaussian noise with a flat power spectrum σ 2 . The noise power spectra at the output of the downsamplers in Fig. 4(b) and Fig. 4(d) are σ 2 .
In contrast, upsampling reduces the powers to σ 2 /L and σ 2 /M in some averaged sense. 1 Hence, the noise power spectra of y(n) and w(n) are σ 2 and Lσ 2 /M, respectively. In this paper, we calculate the local SNR at a peak frequency f peak based on the definition in [11] as follows: where PSD signal, f peak and PSD noise, f peak are the power spectral densities (PSDs) for the signal and the noise components, respectively, at f peak . The local SNR of the periodic zeroing-out method at its f * is 10 log 10 (PSD signal, f e )/(2σ 2 / f s · L/M) . In contrast, the local SNR of the direct concatenation method at its f * is lower than or equal to the above local SNR value. That is because the direct concatenation method has a maximum signal power less than or equal to that of the periodic zeroing-out method but has the same time-averaged noise PSD as that of the periodic zeroing-out method.
The periodic zeroing-out method should, in principle, produce more accurate frequency estimates than the direct concatenation. To quantitatively investigate the distortion due to the scaling function of the direct concatenation method, we examine the scaling function A m ( f ; f ps ) at the best frequency component m = m * and around a small neighborhood of the typical ENF range in the US, namely, f ∈ F = ( f * − δ, f * + δ), where δ = 0.05 Hz. Substituting m * of (16) into (11), we obtain the strength as an asinc in f : where r (x) = x − round(x) ∈ [−1/2, 1/2) is a residual function. We evaluate A m * ( f ) as examples for four camera A m * ( f ) also has very mild slopes for all four cases. We calculated the distortion span in percentage defined as As is shown in TABLE III, the span of the strength within F is no more than 0.19% for all four cases. We also calculated the relative strength between the direct concatenation and the periodic zeroing-out case, namely, A m * /A ′ m ′ * , where m ′ * is m * defined in (26). It reveals that the strength of the direct concatenation is more than 90% of that of the periodic zeroingout.
In this subsection, the two scaling functions A m and A ′ m were quantitatively compared. The results show that the two methods do not have very different scaling functions when the best frequency component is adopted. Also, it should be noted that the scaling functions of each method do not directly reflect frequency estimation performance, which will be examined in Section IV.D-E.

1) Experimental Setup for Video Capturing:
We conducted experiments using the back camera of iPhone 6s in an indoor environment with electric lighting in Raleigh, USA. Videos of a static scene were acquired by facing the camera toward a white wall, and various frame rates were used. The frame rate, controlled indirectly by increasing the exposure time in an iPhone App named Yamera, is the inverse of the frame period during which the light is integrated by the CMOS sensor cell. The exposure time is a controllable parameter, and its maximum value equals the frame period. In this case, the completion time for sampling a row in the current frame is the starting time for sampling the same row in the next frame. For example, using 1/23 sec as the exposure time, a camera can achieve a frame rate of up to 23 fps. We captured a dataset of 11 video recordings, each lasting about ten minutes and featuring a white wall. We included in TABLE IV detailed information on the capturing conditions of the first ten recordings that are investigated in Section IV-B to Section IV-D. The last recording was captured at a special frame rate and is analyzed in Section IV-E.
We recorded power mains signals in parallel and regarded the extracted ENF signals from them as reference ENF signals due to their high SNR. The voltage of the power mains was first stepped down using a 110 V AC to 12 V AC transformer and then divided by a factor of 1/1001 using a voltage divider circuit consisting of 33 Ohm and 33k Ohm resistors. The output of the voltage divider with a 12 mV sinusoid waveform was directly plugged into a portable recorder, SONY ICD-UX543F, for recording.
2) Practical Schemes for Visual Content Estimation: Unlike the static scene estimation scheme in (6a) that randomly samples frames for averaging, we practically use all available frames to estimate the visual content. Because the ENF trace is not a pure sinusoid, averaging over all frames may cancel the ENF traces and consequently produce a reasonable visual content estimate.
Many cameras have an automatic brightness control mechanism, which increases or decreases the brightness of a visual scene depending on the illumination conditions [9]. To mitigate the negative effect of the brightness adjustment on the ENF estimation, one may conduct intensity and contrast compensation. In [9], brightness-adjusted visual content was estimated using a linear transformation and then subtracted from each frame, which leaves the ENF trace dominating in the resulting video, and the row signal can be subsequently generated for ENF extraction. However, the magnitude of the ENF trace in the resulting ENF-containing video is still affected by the brightness adjustment. In this paper, we adjust all frames' brightness to that of a fixed reference frame of the video i ref as follows: where the scaling parameter a i ref ,i and the bias parameter b i ref ,i for each frame are obtained by regressing the intensity of the current frame against that of the reference frame. To avoid picking an outlier frame, we choose the middle frame of a video as the reference frame. As a preprocessing, this improves the approach in [9] by reducing brightness transition effects between consecutive frames on the ENF-containing video and the computational burden required for calculating the scaling and bias parameters.
3) Time-frequency Analysis: To visualize a row signal/reference signal in the frequency domain, we plot a spectrogram by splitting the signal into frames of 12 seconds with 90% overlap. We zoom into the spectrogram around the frequency of interest and use the frequency value that leads to the spectral peak as the ENF estimate for each frame. Quadratic interpolation was used to refine the ENF estimates [37].
We estimate the local SNR at a frequency of interest, f , using the ratio of estimated signal and estimated noise powers within a small neighborhood around f [11]. Under the assumption that our signal of interest is buried in the white Gaussian noise, we use a different definition of noise power: the median of all power values in the range between 0 and f s /2. Regarding the signal power, we adopt the same definition as in [11] that subtracts the obtained noise power estimate from the power of the peak of the periodogram in the vicinity of F. 4) Performance Measures: We evaluate (dis)similarity between two signals {x i } N i=1 and {y i } N i=1 using three metrics: i) The normalized cross-correlation (NCC)

B. Camera Read-Out Time T ro Estimation
We estimate the read-out time T ro for cameras used in this experiment using the vertical phase method [30] that analyzes the phase shift in the discrete Fourier transforms (DFTs) of the row signal and calculates e in (1) needed for the estimation of the read-out time. We improved the accuracy of the T ro estimation by compensating for the effect of the camera brightness control, as mentioned in the previous subsection.
TABLE IV shows the T ro estimates for various videos captured under different recording conditions. It is revealed that there are two distinct groups of closely clustered T ro values. The first group contains recordings #1-#6 with an average of 19.8 ms and a standard deviation of 0.3 ms. The second group contains recordings #7-#10 with an average of 12.6 ms and a standard deviation of 0.2 ms. One possible explanation for such closely clustered T ro values is that the iPhone 6s has a few discrete preprogrammed T ro values that are used for different frame resolutions.
Specifically, for all videos that have no more than 480 lines, the true value corresponding to the 19.8 ms estimate is used, whereas for all videos that have no less than 720 lines, the true value corresponding to the 12.6 ms estimate is used. In the following experiments in Sections IV-C-E, we deal with videos with L = 480 where 19.8 ms is considered to be the true T ro value. Fig. 7(a) and Fig. 7(e) illustrate two spectrograms for row signals generated by the direct concatenation method and the periodic zeroing-out method, respectively. The video was captured under the nominal ENF 60 Hz, i.e., f e = 120 Hz, and the camera parameters were f c = 23.0062 fps, L = 480, and T ro = 19.8 ms. The frame rate was chosen to be not a divisor of the nominal ENF frequency to avoid the mix of DC and ENF components for visualization purposes. We will investigate the scenario in which these components are mixed in Section IV-E. The spectrogram results show the aliased DC components in thick red straps at frequency locations {m f c : m ∈ Z} and aliased ENF components in thin red straps at frequency locations {± f e + m f c : m ∈ Z}, which is consistent with the prediction from our theoretical analyses of the two models in Section III.

C. ENF Signals Extraction Using Derived Models
In the zoomed-in regions of Fig. 7(a) and Fig. 7(e), we annotated all aliased components, namely, "positive (+ve)," "dc," and "negative (-ve)," if a specific component exists. We calculated all possible aliased frequency components within the ranges and listed them in Fig. 7(b) and Fig. 7(f). By comparing the annotated frequency components in the spectrograms and the theoretical ones summarized in the tables, it is revealed that only those components with a larger magnitude are visible. It is interesting to note that within the displayed ranges, most of the negative aliased ENF components are not visible due to their small magnitude. Fig. 7(c) and Fig. 7(g) show two plots of the practical scalar values versus the theoretical scalar values for all aliased ENF components in Fig. 7(a) and Fig. 7(e). Both figures indicate a roughly linear relationship between them, which verifies that the theoretical prediction of the aliased ENF components in (18) and (27) is consistent with the practical measurement from the video recording.

D. Comparison of ENF Signals Extraction Methods
which is calculated from (18) with k * + ν * = 0, and two neighboring components at 50.9814 ± 23.0062 Hz. Similarly, Fig. 7(h) shows extracted ENF signals using the periodic zeroing-out method from the component of the highest magnitude at 120 Hz and two neighboring components at 120 ± 23.0062 Hz. The empirical SNRs for each component were measured and we observe that the theory-predicted frequency components 50.9814 Hz and 120 Hz corresponding to each method have the highest SNR, respectively. We further evaluate the performance of the direct concatenation method and the periodic zeroing-out method in low local SNR scenarios simulated by adding zero-mean white Gaussian noise to the rows, or equivalently, to x(n) of all ten recordings given in TABLE IV. For each recording, five realizations were made to obtain a sample of size 50 in total. To examine the relative performance of the two methods in the statistical sense, t-tests with a significance level of 0.05 were conducted between the two groups' results. Fig. 8 reveals that the periodic zeroing-out method is better than the direct concatenation method at all different local SNR cases in terms of the NCC, RMSE, and MAE of frequency estimates. The NCC boxplot shows that only the p-value at 15 dB is statistically significant. The RMSE and MAE boxplots present that the p-values at all local SNR cases except 25 dB are statistically significant. These results indicate that the periodic zeroing-out method is more robust against noise than the direct concatenation method in terms of the mean. However, the periodic zeroing-out method failed to obtain a significant performance improvement over the direct concatenation method, which implies that the direct concatenation method can still be used for reasonable performance.

E. ENF Extraction When Frequency of Interest Is Corrupted by Aliased Components
In this subsection, we investigate special frame rates that cause corruption to the frequency of interest by other aliased components and subsequently show that the consequence is mild. The aliased DC component corrupts the frequency of interest, f e , when there exists m dc ∈ Z such that 0 + m dc f c = f e . Similarly, the aliased negative ENF component corrupts f e when there exists m neg ∈ Z such that − f e + m neg f c = f e . It is noted that only integer-valued frame rates may lead to the corruption to f e = 100 or 120 Hz. We summarize ENF frequency and frame rate pairs that lead to corruption due to aliased components in Next, we will show that even if aliasing exists, the corruption caused by such aliasing is of relatively small magnitude that there is no significant adverse effect. We captured a white-wall video using iPhone 6s under the nominal ENF at f e = 120 Hz, frame rate f c = 30 Hz, and L = 480 with T ro = 19.8 ms. For the periodic zeroing-out method, the strongest positive ENF component is located at f * = 120 Hz with index m = 0. We chose 120 Hz as the frequency of interest, and as illustrated in Fig. 9, both the DC component To evaluate the impact of corruption on the accuracy of frequency estimation, we extract the ENF signal from the strongest component and compare it with the reference ENF signal in Fig. 10. Calculated metrics such as NCC, RMSE, and MAE with respect to the reference signal are shown at the bottom right of the plot. The figure reveals that the ENF signal estimates by both the direct concatenation method and periodic Fig. 10. The ENF signals extracted from the strongest component at 60 Hz using the direction concatenation method (blue dash-dotted) and at 120 Hz using the periodic zeroing-out method (red dotted). The black solid curve is the reference ENF signal. Three (dis)similarity measures are shown at the bottom (left values for the direct concatenation and right values for the periodic zeroing-out). The blue and red ENF signals were shifted and scaled to be aligned with the reference ENF signal. zeroing-out method are similar to the reference ENF signal. It implies that under our experimental setup with iPhone 6s under f e = 120 Hz and f c = 30 Hz, the DC and the negative ENF components do not have a strong effect on the accuracy of frequency estimation when overlapping with the positive ENF components.
To examine if the above result holds true for cameras with different T ro 's, we vary T ro and examine the relative magnitude of the aliased components. The results for two typical cases are shown in Fig. 11. It is revealed that no matter how the T ro changes, the effect of aliased components is minor. This can be explained through the shape of the aliased sinc function that has peaks and surrounding oscillations with decaying amplitude around the peaks.

V. COMPARISON WITH PHASE-BASED ENF EXTRACTION METHOD
Following the publication of our preliminary work on the PZO method [2], Han et al. [19] proposed a phase-based method and showed that it outperforms the PZO method in their implementation. Specifically, they replaced the precise read-out time (see Section IV-B and [30]) with a read-out time estimator that is based on the given recording. We refer to Han et al.'s implementation of PZO as PZO-Han. We will show that PZO outperforms both the phase-based method and PZO-Han.
We implemented the phase-based method using a time window of 12 seconds and used the same parameters as those in [19]. For PZO-Han, we kept everything the same in our PZO implementation, with the sole exception of using the read-out time estimator by Han et al.
TABLE VI mainly compares the performance between PZO and the phase-based method [19], and reveals that PZO performs significantly better. For 9 out of 11 video recordings, the phase-based method fails to generate reliable ENF signals as indicated by the low NCC values highlighted in red, while DCC and PZO can extract ENF signals stably. Only for recordings #8 and #11, the performance of the phase-based method is comparable. Our inspection into the estimated waveforms reveals that the phase-based method performs worse under low SNR (which is a characteristic of our dataset), especially when the magnitude of ENF varies across the row index. Specifically, we show in Fig. 12 some representative examples of row signals and their corresponding ENF traces estimated by the phase-based method. The row signals from our dataset in Fig. 12(a)-(b) are noisier and less sinusoidal than the row signals from Han et al.'s dataset [19] in Fig. 12(c)-(d). The lowered performance of the phase-based method on our dataset may be attributed to model-data mismatch, as the phase-based method assumes an ideal sinusoidal model in their objective function. One potential remedy is to allow the magnitude of the sinusoidal signal to be time/spatial varying [20], [29], [38]. Another potential is to adopt/design a frequency estimation method that is less sensitive to the departure from an ideal sinusoidal model and more robust against acquisition and quantization noise that is common in video signals. More examples of row signals are provided in Appendix E of the supplementary materials.
TABLE VI also compares the performance between PZO-Han [19] and the phase-based method [19] on our dataset but reveals a different conclusion than that reported in [19] using their dataset. We observe that PZO-Han performs well on 9 out of 11 video recordings even when usingT ro that are less precise. Only recordings #5 and #10 from our dataset were poorly performed, likely due to significantly  [19]. (a)-(b) The row signals from our dataset have more noise and less ideal sinusoidal shapes than (c)-(d) the row signals from the dataset by Han et al. [19]. The phase-based method [19] performs well only in the latter cases.
overestimated read-out time. This is because when a largerthan-normal read-out time is used, it reduces M and causes the upsampler-induced noise floor to rise, which decreases the SNR for ENF estimation.
We are delighted to acknowledge the contributions of new methods [18], [19], [20] for ENF extraction from videos after the initial publication of our PZO method. As we also mentioned in Section II-A, ideas such as pixel selection [18] and compensation of magnitude variations [20], [29], [38] are compatible with our proposed methods. This is because DCC and PZO focus on temporally arranging the sample points, whereas [18], [20], [29] focus on improving the quality of generated values of the sample points. Our analysis reveals how different ways to assemble time-domain signals may lead to different interpretations in the Fourier domain.

A. Aliased Components Are Not Harmonics
Using a spectrum combining technique proposed in [39] will not help to achieve more accurate ENF signals despite multiple straps observed. This is because the straps are the artifacts of the two extraction methods instead of being physical harmonics caused by nonlinearities. It is noted that the available row signal contains both the ENF signal and the noise. After passing through either extraction method, the ENF is duplicated at different frequency locations along with its noise that originates from the exact same realization.

B. Comparison With Parallel Work
While working on extending our preliminary results on periodic zeroing-out presented at ICASSP 2019 [2], we became aware of a parallel effort by Vatansever et al. [31], which also used multirate signal processing to analyze ENF in videos acquired by rolling shutters. In light of the similar submission time of [2] and [31], we believe that the two lines of research were conducted in parallel. Both papers were inspired by our earlier results published at ICIP 2014 [1].
We discuss below the similarities and differences between Vatansever et al. [31] and this paper. Both works include an analytic model for the periodically aliased ENF components in rolling-shutter captured videos. Both works examine the magnitude attenuation of frequency components and formulate the optimization problem of finding the ENF component with the strongest magnitude. Based on their model and its predictions, Vatansever et al. [31] propose an idle period estimator and a new time-of-recording authentication technique. In comparison, this work derives the closed-form solution of the optimization problem for finding the strongest ENF component, which previous works have never done. This work also conducts in-depth mathematical analyses and experimental comparisons for both the direct concatenation and periodic zeroing-out methods, whereas Vatansever et al. [31] only analyze the direct concatenation method. Additionally, this work investigates the corruption effect due to the aliased components and finds that this effect is practically negligible.

VII. CONCLUSION
In this paper, we have examined and compared the two methods for extracting ENF signals. The direct concatenation method treats each row of a video frame exposed during the read-out time as a single sample point and concatenates them into a 1-d signal for frequency estimation. The periodic zeroing-out method conducts an equivalently uniform temporal sampling by returning zeros during the idle period. Using the multirate filter bank model, the two methods have been examined and compared theoretically and practically. Our experimental results have verified our theoretical predictions as to where the ENF component achieving the maximum SNR can be found and showed that the distortion and corruption effects on the quality of the extracted ENF signal are minor. More importantly, it has been revealed that one method is not significantly better performing than the other, but they are complementary to each other.