Robust Tensor Decomposition Based Background/Foreground Separation in Noisy Videos and Its Applications in Additive Manufacturing

Background/foreground separation is one of the most fundamental tasks in computer vision, especially for video data. Robust PCA (RPCA) and its tensor extension, namely, Robust Tensor PCA (RTPCA), provide an effective framework for background/foreground separation by decomposing the data into low-rank and sparse components, which contain the background and the foreground (moving objects), respectively. However, in real-world applications, the video data is contaminated with noise. For example, in metal additive manufacturing (AM), the processed X-ray video to study melt pool dynamics is very noisy. RPCA and RTPCA are not able to separate the background, foreground, and noise simultaneously. As a result, the noise will contaminate the background or the foreground or both. There is a need to remove the noise from the background and foreground. To achieve the three components decomposition, a smooth sparse Robust Tensor Decomposition (SS-RTD) model is proposed to decompose the data into static background, smooth foreground, and noise, respectively. Specifically, the static background is modeled by the low-rank tucker decomposition, the smooth foreground (moving objects) is modeled by the spatio-temporal continuity, which is enforced by the total variation regularization, and the noise is modeled by the sparsity, which is enforced by the $\ell _{1}$ norm. An efficient algorithm based on alternating direction method of multipliers (ADMM) is implemented to solve the proposed model. Extensive experiments on both simulated and real data demonstrate that the proposed method significantly outperforms the state-of-the-art approaches for background/foreground separation in noisy cases. Note to Practitioners—This work is motivated by melt pool detection in metal additive manufacturing where the processed X-ray video from the monitoring system is very noisy. The objective is to recover the background with porosity defects and the foreground with melt pool in the presence of noise. Existing methods fail to separate the noise from the background and foreground since RPCA and RTPCA have only two components, which cannot explain the three components in the data. This paper puts forward a smooth sparse Robust Tensor Decomposition by decomposing the tensor data into low-rank, smooth, and sparse components, respectively. It is a highly effective method for background/foreground separation in noisy case. In the case studies on simulated video and X-ray data, the proposed method can handle non-additive noise, and even the case of high noise-ratio. In the proposed algorithm, there is only one tuning parameter $\lambda $ . Based on the case studies, our method achieves satisfying performance by taking any $\lambda \in [{0.2,1}]$ with anisotropic total variation regularization. With this observation, practitioners can apply the proposed method without extensive parameter tuning work. Furthermore, the proposed method is also applicable to other popular industrial applications. Practitioners can use the proposed SS-RTD for degradation processes monitoring, where the degradation image contains the static background, anomaly, and random disturbance, respectively.


NOMENCLATURE H, W, T
The height, width, and number of an image frame (r 1 , r 2 , r 3 ) The multi-linear rank in Tucker Decomposition λ The balance coefficient in the proposed objective function X The order three tensor in R H ×W ×T represented by {X 1 , · · · , X T } X t t-th image frame in R H ×W L The low-rank tensor (static video background) S The smooth tensor (smooth moving objects) E The noise tensor (all kinds of noise) X × n U The mode-n multiplication of a tensor X with a matrix U G The core tensor in tucker decompsition U 1 , U 2 , U 3 The factor matrices in tucker decompsition f The auxiliary variable D h , D v , D t Three vectorizations of the difference operation along the hori-zontal, vertical, and temporal directions D The the concatenated difference operation, The Frobenius norm 1545-5955 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.
· 1 The 1 norm · 2 The 2 norm · T V 1 The anisotropic total variation norm vec(·) The vectorization operator ten(·) The tensorization operator λ f , X The Lagrange multiplier vector and tensor β f , β X The positive penalty scalars c 1 , c 2 The coefficients in adaptive updating scheme for β f , β X fftn The fast 3D Fourier transform ifftn The inverse fast 3D Fourier transform soft(·, ·) The soft-thresholding operator γ The parameter associated with convergence rate in ADMM Err(·) The error of the auxiliary variable relChgA The relative change of A relErrA The relative error of A

I. INTRODUCTION
B ACKGROUND/Foreground separation is a fundamental step for moving object detection in many video data applications [1], [2]. It is usually performed by separating the moving objects called "foreground" from the static objects called "background" [3], [4]. In particular, we are interested in foreground/background separation in the presence of noise. In many real-world applications, the presence of noise is a very common but challenging issue [5], [6].
One motivating example comes from metal additive manufacturing (AM) research. In metal AM, high-speed X-ray data has been applied to study the melt pool formation and evolution [7]. The geometry of melt pool can be described by the solid/liquid interface, which can be further utilized to characterize the microstructure of the final part [8]. To capture the melt pool, a high-speed X-ray imaging system is applied to monitor the printing process [9]. One unprocessed X-ray image from the monitoring system is shown in Fig. 1a, where the solid/liquid interface (red boundary in Fig. 1c) cannot be identified. This is because the densities of solid and liquid states of metal material are almost the same. In practice, researchers [7], [10], [11] need to apply preprocessing techniques to enhance the melt pool boundary. In this work, the unprocessed X-ray image is first subtracted by the X-ray image captured six frames earlier and then adjusted contrast to obtain the processed X-ray image. The processed X-ray image is shown in Fig. 1b, where the melt pool boundary can be located by our human as outlined in red shown in Fig. 1c. Even though the melt pool boundary is enhanced, this subtraction step generates lots of random noise causing problems for accurate boundary detection.
Recent research on background/foreground separation is based on decomposition of video data into low-rank and sparse components. It is an effective framework to separate the foreground from the background, which are modeled by the sparse and low-rank components, respectively. Among them, the most representative problem formulation is the Robust Principal Component Analysis (RPCA) [3], [12], which is a modification of the widely used statistical procedure named principal component analysis (PCA). RPCA [12] decomposes the data matrix X into the sum of a low-rank matrix L and a sparse matrix S, where the low-rankness and sparsity are measured by the nuclear norm · * and 1 norm · 1 , respectively. Apart from the background/foreground separation, RPCA can also be applied to the task of image denoising [2], where the noise is represented by the sparse components.
One major disadvantage of RPCA is that it can only deal with 2-D matrix data since the nuclear norm · * is designed for matrix. However, real-world data is usually multi-dimensional in nature, where rich information is stored in multi-way arrays known as tensors [13]. For example, a greyscale video is 3-D data, which stacks multiple images along the time domain; a color image is also 3-D data that has three channels: red, green, and blue, where each channel is a 2-D image. To apply RPCA to these datasets, the multi-way tensor data has to be reconstructed into a matrix. Such a preprocessing usually leads to information loss and performance degradation since the structure information in the data is deteriorated. To address this issue, it is necessary to consider extending RPCA to manipulate the tensor data directly by taking advantage of its multi-dimensional structure. However, it is challenging to do so since the numerical algebra of tensors is fraught with many computationally hard problems [14], [15].
Contributed by the newly developed tensor multiplication scheme on t-SVD [16], Zhang et al. [17] proposed the tensor tubal rank as well as the tensor nuclear norm for image denoising. Based on the tensor nuclear norm, Lu et al. [18] developed Robust Tensor PCA (RTPCA) by extending RPCA from 2-D matrix to 3-D tensor data, aiming to exactly recover a low-rank tensor contaminated by sparse errors. More specifically, it tries to recover the low-rank tensor L and sparse tensor E from the data tensor X , which can be represented X = L + E. However, none of them is able to address the background/foreground separation problem in noisy cases because the low-rank and sparse components extracted from RPCA or RTPCA algorithms do not consider the noise components in their analyses. There are works [19], [20] that decompose the video X into three components, namely, L, S, and E. They focus on using L + E to model the dynamic background, where L represents the static background and E represents the small changes in the dynamic background. They are all vector-based methods without using the structure information in the tensor.
Deep learning-based methods for background/foreground separation have also shown promise. In [21], the authors present a background subtraction algorithm based on spatial features learned with convolutional neural networks. In [22], background estimation at each pixel is carried out by weightless neural networks designed to learn pixel color frequency during video play, and all networks share the same rule for memory retention during training. Another supervised learning-based approach is to apply a semantic segmentation model trained on a labeled dataset containing objects of interest to directly produce foreground masks for each frame of a video. In the field of semantic segmentation, deep convolutional networks are the most popular models, such as Mask R-CNN [23] and DeepLab [24]. The disadvantage of such supervised methods is that they require the classes of foreground objects during training. Instead, unsupervised techniques like RPCA and RTPCA are applicable to datasets with arbitrary semantic content. While deep learning-based methods have shown promise in the noiseless regime, they are generally not designed to perform foreground/background separation in the presence of noise, which is the focus of our work.
The objective of this study is to address the problem of background/foreground separation in the presence of noise and apply it to additive manufacturing applications ( Figure 1). To achieve this objective, a smooth sparse Robust Tensor Decomposition (SS-RTD) is proposed to decompose the data tensor X into a low-rank tensor (background) L, a smooth tensor (foreground) S, and a sparse tensor (noise) E, namely, X = L + S + E. In the SS-RTD, the background is modeled by the low-rank Tucker decomposition [13]. The spatio-temporal continuity is applied to formulate the moving objects (foreground) [20], [25]. That is, the moving objects in video foreground are spatially continuous in both its support regions and its intensity values in these regions. Moreover, the moving objects are also temporally continuous among succeeding frames. The noise is modeled by a sparse tensor. To summarize, the contributions of this paper are as follows: (i) Propose the smooth sparse Robust Tensor Decomposition model for background/foreground separation in the presence of noise by decomposing the data tensor into a low-rank tensor, a smooth tensor, and a sparse tensor, respectively; (ii) Implement an efficient algorithm based on alternating direction method of multipliers (ADMM) [26] to solve the proposed model. The remainder of this paper is organized as follows. A brief review of notation and related research work is provided in Section II. The proposed model and algorithm to solve this model are introduced in Section III. Numerical studies in Section IV and real-world additive manufacturing application in Section V are provided for testing and validation of the proposed method. Finally, the conclusions and future work are discussed in Section VI.

II. NOTATION AND RESEARCH BACKGROUND
In Section II-A, the notation and basics in multi-linear algebra used in this paper are reviewed. Then, the smooth sparse decomposition and Robust Tensor PCA are reviewed briefly in Section II-B. Afterwards, the research gaps of the existing work are identified in Section II-C.

A. Notation and Tensor Basis
Throughout this paper, scalars are denoted by lowercase letters, e.g., x; vectors are denoted by lowercase boldface letters, e.g., x; matrices are denoted by uppercase boldface, e.g., X; and tensors are denoted by calligraphic letters, e.g., X . The order of a tensor is the number of its modes or dimensions. A real-valued tensor of order N is denoted by X ∈ R I 1 ×I 2 ×···×I N and its entries by X (i 1 , i 2 , · · · , i N ). The multi-linear Tucker rank of an N-order tensor is the tuple of the ranks of the mode-n unfoldings X (n) ∈ R I n ×(I 1 ×···×I n−1 ×I n+1 ×···×I N ) . The inner product of two same-sized tensors X and Y is the sum of the products of their entries, i.e., Following the definition of inner product, the Frobenius norm of a tensor X is defined as X F = √ X , X . The mode-n multiplication of a tensor X with a matrix U amounts to the multiplication of all mode-n vector fibers with U, i.e.,(X × n U) (i 1 , · · · , i n−1 , j n , i n+1 , · · · , i N ) = i n X (i 1 , · · · , i N ) · U( j n , i n ).

B. Related Work
In this subsection, three directions of related work to motivate the research in this paper are introduced here.

1) Smooth Sparse Decomposition:
Yan et al. [27] proposed a smooth sparse decomposition (SSD) approach to integrate smoothing images and anomaly detection into a single task. It decomposes a signal y into a smooth background μ, sparse anomalous regions a, and random noise e, namely, y = μ + a + e. To further enhance the model, the authors proposed to use smooth spline basis B and sparse spline basis B a to formulate the mean and anomalies. Specifically, the enhanced SSD model is as y = Bθ + B a θ a + e, where θ is the smooth basis coefficients and θ a is the sparse basis coefficients, respectively. To estimate θ and θ a , a least square regression penalized by the two terms is as follows where e 2 2 is the fitting error; the first penalty term measures the smoothness in θ through the roughness matrix A; the second penalty term describes the sparsity in θ a ; and τ 1 , τ 2 > 0 are regularization coefficients to be tuned. Along the same direction, Yan et al. extended SSD model (1) to consider spatio-temporal information in a video, where B = B t ⊗B s and B a = B at ⊗B as . The idea of smooth and sparse decomposition is further extended to the literature of tensor completion [28].
2) Robust Tensor PCA: As the tensor extension of the popular Robust PCA, the recent proposed RTPCA [18] aims to recover the low rank tensor L 0 ∈ R I 1 ×I 2 ×I 3 and sparse tensor E 0 ∈ R I 1 ×I 2 ×I 3 from their sum. RTPCA solves the following convex optimization problem min L,S where · T N N is their proposed tensor nuclear norm, which is a convex relaxation of the tensor tubal rank. The tensor nuclear norm and tensor tubal rank are defined based on the t-SVD proposed in [17]. Following this direction, to further exploit the low-rank structures in tensor data, Liu et al. [29] extracted low-rank component for the core matrix whose entries are from the diagonal elements of the core tensor. Based on this idea, they defined a new tensor nuclear norm and proposed a creative algorithm to deal with RTPCA problems. Other than the work based on the tensor tubal rank, Yang et al. [30] considered a new model for RTPCA based on tensor train rank. These methods are applied to background/foreground separation, image/video denosing, etc.
3) RPCA Under Dynamic Background: TVRPCA [20] is a novel unified framework to detect moving foreground objects by separating dynamic background from moving objects and detecting lingering objects using the spatial and temporal continuity of the foreground. The observed video X may be decomposed as X = L + M, where L is the background and M is the residual. Furthermore, M is further separated into two terms, where S and E corresponds to the intrinsic foreground and the dynamic background component. Thereby, the problem can be formulated as min L,M,E,S where λ 1 , λ 2 , and λ 3 are the weights for balancing the corresponding terms in (2). p can be either {1} or {2, 1} for representing 1 and 2,1 norms, respectively. L + E is used to model the final dynamic background. GoDec [19] has the same strategy as the TVRPCA but with a different formulation.

C. Research Gap Identification
In real-world applications, the video data is often contaminated with noise during acquisition, compression, and transmission [6] exemplified in Fig. 1. If the smooth sparse decomposition method [27] is applied to the noisy video, the background and foreground cannot be separated well since they together are considered as smooth components and the noise is considered as sparse components. If the Robust Tensor PCA [29]- [31] is applied, either the detected moving objects or the background is contaminated with noise since RTPCA cannot handle the three components simultaneously (i.e., background, foreground, and noise). If TVRPCA and GoDec are applied, the physical meaning of each component is not fully matched with our case. Therefore, this work seeks to address these research gaps by devising a new smooth sparse Robust Tensor Decomposition (SS-RTD). The proposed model can be considered as separating background/foreground together with video denoising by providing a new decomposing methodology with three components.

III. PROPOSED METHOD
In Section III-A, the proposed smooth sparse RTD for background/foreground separation in the presence of noise is presented. Specifically, the low-rankness, spatio-temporal continuity, and sparsity are formulated by the tucker decomposition, total variation (TV) regularization, and 1 regularization, respectively. In Section III-B, an efficient algorithm based on ADMM [26] is designed to solve the proposed model.

A. Proposed Smooth Sparse RTD Model
Throughout this work, it is focused on the video that can be represented as a third-order tensor X : H , W and T denote the height, width of an image frame, and the number of image frames, respectively. The three modes of tensor X are height, width and time of the video.
As discussed in Section II-C, for a noisy video, it is necessary to decompose the video data X into the low-rank tensor L (the static video background), the smooth tensor S (the smooth moving objects in foreground), and the sparse tensor E (absorbing impulse noise), respectively. In the static background, the image frames keep unchanged along with the time domain. This can be achieved by restricting L to be a low-rank tensor in the time domain. For the moving objects in the video foreground, they are continuous spatially and temporally so that they can be represented as a smooth tensor S. Impulse noise [32] corruption is very common in digital images. Impulse noise is always independent and uncorrelated to the image pixels and is randomly distributed over the image. For an impulse noise corrupted image, a ratio of image pixels will be noisy, and the rest of pixels will be noise-free. Therefore, the impulse noise is absorbed by the sparse tensor E so that the noise can be excluded from the background and foreground. An illustration of the video decomposition strategy for our proposed method is provided in Fig. 2. Specifically, it has the following form X = L + S + E as proposed in Section I.
To model the low-rankness, the static background L is approximated by the well-known Tucker decomposition [13] with rank-(r 1 , r 2 , r 3 ). Specifically, the Tucker decomposition has the following form where U 1 ∈ R H ×r 1 , r 1 < H and U 2 ∈ R W ×r 2 , r 2 < W are orthogonal factor matrices for two spatial domain, U 3 ∈ R T ×r 3 , r 3 < T is orthogonal factor matrix for temporal domain, core tensor G ∈ R r 1 ×r 2 ×r 3 interacts these factors. The determination of (r 1 , r 2 , r 3 ) is provided in Section III-C. By formulating the low-rank tensor L using Tucker decomposition, it can reconstruct more accurate video background than the low-rank model based on matrices. Because the Tucker decomposition considers not only the spatial but also the temporal correlations in video background. The smooth tensor S (moving objects) is assumed to have the spatio-temporal continuity property such that the foreground moves smoothly and coherently in the spatial and temporal directions. In the literature, imposing the spatio-temporal continuity constraints on moving objects in the foreground is well studied and proven to be effective [20], [25]. To measure the sensitivity to change of a quantity function, the derivative is often applied in mathematics. For discrete functions, difference operators are the approximation to derivative. Given a thirdorder tensor S ∈ R H ×W ×T , S(x, y, t) indicates intensity of position (x, y) at time t, and denote three difference operation results of position (x, y) at time t with periodic boundary conditions along the horizontal, vertical, and temporal directions, respectively. For simplicity of computation, all the entries of S can be stacked into a column vector s = vec(S), in which vec(·) represents the vectorization operator.
and D t s = vec(S t ) are used to denote the vectorizations of the three difference operation results, respectively, The commonly used vector norms are 1 norm. Specifically, the anisotropic total variation norm is defined as which is the 1 norm of [D h s, D v s, D t s] . The total variation regularization has been widely used in image and video denoising and restoration [33], [34] due to its superiority in detecting discontinuous changes in image processing.
By combining the advantages of Tucker decomposition for the low-rank tensor and total variation regularizations for the smooth tensor, the proposed smooth sparse RTD has the following formulation where the factor matrices U j , j = 1, 2, 3 in Tucker decomposition (3) are orthogonal in columns. The first term in the objective function measures the sparsity of the noise tensor E by 1 norm. The second term encodes the spatio-temporal continuity of smooth tensor S, which is measures by the total variation regularizations defined in (4). λ > 0 is the coefficient to balance the two terms in the objective function. The first term in the constraints show the decomposition strategy of the tensor data and the second term in the constraints shows the Tucker decomposition of the low rank tensor L. Remark 1 (Comparison With [35]- [37]): [35], [37] and our paper have the same concept of three components decomposition. However, the physics meaning of each component is different, which results in a totally different problem formulation and optimization algorithm. As shown in Section II-B, [35] mainly decomposes the video into the smooth background (it will dynamically change along with time), sparse anomalies, and random noise. Reference [37] decomposes a video into the varying background, the static hotspots, and the moving hotspots. Therefore, we cannot apply methods in [35], [37] to our problem. Reference [36] had the same problem setting as our paper, which is utilized as a part of the benchmarks for comparison. The difference is that they use the matrix nuclear norm to model the low-rankness. Besides, this paper is a vector-based method without using the structure information in the tensor.

B. Optimization Algorithm
In this section, an efficient algorithm based on alternating direction method of multipliers (ADMM) [26] is proposed and implemented for solving the proposed SS-RTD model in (5). Specifically, a multi-block version of ADMM is developed. To apply ADMM, the SS-RTD model in (5) is rewritten as the following equivalent form by introducing an auxiliary variable f : where the factor matrices U j , j = 1, 2, 3 are orthogonal in columns. The above optimization problem in (6) can be solved in its Lagrangian dual form by augmenting the constraints into the objective function. Specifically, the augmented Lagrangian function of problem in (6) has the follow form where λ f , X are the Lagrange multiplier vector and tensor, respectively, and β f , β X > 0 are positive penalty scalars. The optimization problem in (7) is a non-convex optimization problem. Thus, optimizing all variables simultaneously is difficult. Instead, this optimization problem (7) is solved by alternatively minimizing one variable with the others fixed.
Under the framework of multi-block ADMM, the optimization problem (7) with respect to each variable can be solved by the following sub-problems. 1) Optimization on G, U j or L: The optimization subproblem of (7) with respect to G and U j , j = 1, 2, 3 can be rewritten as where X X X = X − S − E − X /β X . The classic HOOI algorithm [13] can be applied to solve this sub-problem, which is another alternating iterative algorithm. Due to the nonconvexity of the sub-problem (8), HOOI cannot obtain the optimal solution in general. However, the iterative sequence from HOOI has the monotone decreasing property.
2) Optimization on S: By setting the gradient of L with respect to S to zero, the sub-problem of (7) with respect to S can be solved by the following linear equations where ten(·) denotes the tensorization operator. Thanks to the block-circulant structure of the matrix corresponding to the operator D * D, it can be diagonalized by the 3D FFT matrix [38]. Therefore, S can be fast computed by where = |fftn(D h )| 2 + |fftn(D v )| 2 + |fftn(D t )| 2 is not related to the data and model parameters so that it only needs to be calculated once in the whole algorithm. fftn and ifftn indicate fast 3D Fourier transform and its inverse transform, respectively.
3) Optimization on f: The sub-problem of (7) with respect to f can be rewritten as For anisotropic total variation, the well-known softthresholding operator [39] can be applied to solve this sub-problem as follows where the soft-thresholding operator soft(A, τ ) = sign(A) · max(|A| − τ, 0) is performed element-wisely.

4) Optimization on E:
The sub-problem of (7) with respect to E can be solved by where

5) Updating Multipliers:
According to the ADMM, the multipliers associated with L are updated by the following formulas where γ > 0 is a parameter associated with convergence rate, and the penalty parameters β f and β X follow an adaptive updating scheme as suggested in [40]. Take β f as an example, where Err(f k ) = f k − Dvec(S k ). As suggested in [25], γ = 1.1, and c 1 , c 2 can be taken as 1.15 and 0.95, respectively. The proposed algorithm for SS-RTD can now be summarized in Algorithm 1. The derivations for (8), (10), and (12) are provided in the Appendix.

C. Implementation Details and Complexity Analysis
In the SS-RTD model (5), there exist four parameters, i.e., r 1 , r 2 , r 3 and λ, respectively, where r 1 and r 2 control the complexity of spatial redundancy, r 3 controls the complexity of temporal redundancy, and λ handles a trade-off between noise and foreground modeling. In all experiments, r 1 and r 2 are set to values of ceil(0.80 × H ) and ceil(0.80 × W ), respectively, where ceil(·) is the operator to round the element to the nearest integer greater than or equal to that element. By doing so, the accumulation energy ratio of top normalized singular values (AccEgyR) ) attains a ratio over 0.9 for various natural images, as reported in [25]. For r 3 , it takes the value 1 for all experiments so that each image frame in L is the same [41]. In terms of λ, it needs to be carefully tune based on the data. Specifically, λ is taken in the range [0.2, 1]. Empirically, the proposed algorithm can achieve satisfactory performance with any λ ∈ [0.2, 1]. Therefore, practitioners can apply the proposed method without extensive parameter tuning.
Next, the computational complexity of the proposed SS-RTD is analyzed. In Algorithm 1, one outer iteration from line 1 to 7 includes updating rules for L, S, f , E, and multipliers, respectively. Updating L is achieved by the iterative HOOI algorithm, which needs O(W 3

IV. NUMERICAL STUDIES
To evaluate the performance of the proposed SS-RTD, its performance on open-sourced video data is presented in this section. In Section IV-A, the empirical convergence and sensitivity of the proposed algorithm are illustrated. The performances of the proposed algorithm for background subtraction and foreground detection are presented in Section IV-B and IV-C, respectively. In Section IV-B and IV-C, RTPCA 1 [31], IRTPCA 2 [29], TTNN 3 [30], TVRPCA 4 [20], GoDec 5 [19], and PRPCA 6 [36] are selected as benchmarks for comparison with the proposed SS-RTD, which are state-of-the-art methods in the related area. The benchmarks have two categories: (1) RTPCA, IRTPCA, and TTNN are the most advanced Robust Tensor PCA algorithms in the literature; (2) TVRPCA, GoDec, and PRPCA are vector-based algorithms to decompose a video into three components. All results in this section are the average results of 20 repetitions for comparison. The codes of SS-RTD are implemented in Matlab 2019a. The CPU of the computer to conduct experiments in this paper is an Intel ® Xeon ® Processor E-2286M (8-cores 2.40-GHz Turbo, 16 MB).
Performance Evaluation Indices and Parameter Tuning: For the task of background subtraction, the peak signal-to-noise ratio (PSNR) and the structural similarity index (SSIM) are used to measure the recovery accuracy. PSNR and SSIM commonly measure the similarity of two images in intensity and structure respectively. Specifically, PSNR is defined as: PSNR = 10 × log 10 , where I andÎ are respectively the original and recovered background. SSIM measures the structural similarity of two images; see [43] for details. Average PSNR and SSIM over all image frames in the video are used to evaluate recovery performance of video background. For the task of foreground detection, F-measure is applied to assess the foreground detection performance. Average F-measure over all image frames in the video is applied to evaluate the detection performance of video foreground. Therefore, 20 repetitions are sufficient to represent the performance of each method since each repetition is the average performance of multiple image frames. For these performance indices PSNR, SSIM, and F-measure, higher values indicate the better performance.
For the proposed method as well as the benchmark methods, the parameter tuning is performed by searching from 100 sets of parameters sampled by the maximin Latin hypercube design [44] such that the average PSNR and F-measure achieve the best value for background subtraction and foreground detection, respectively. In the proposed algorithm 1, λ can be selected from [0.2, 1] based on the empirical results in Section IV-A.
A. Convergence and Sensitivity Analysis 1) Convergence Analysis: The video dataset Candela from SBI dataset 7 [45] is used in this section. In total, this video dataset has 855 image frames, where the size of each grayscale image is 288×352. For simplicity, the first 80 image frames in the sequence are used for experiments. Therefore, the tensor size is 288 × 352 × 80. One image frame from Candela in this experiment is shown in Fig. 5a. In the video, there is a man entering and leaving room, abandoning a bag. For each image, 10% of pixels are randomly selected to set as random integers in [0, 255], and the positions of the contaminated pixels are unknown (one example is shown in Fig. 5d). To evaluate the convergence of the proposed algorithm, the relative change are applied as the assessment indices of algorithm convergence, where A k is the result in k-th iteration and A 0 is the ground-truth. The ground truth of the static video background is provided in the first column of Fig. 6.
The curves of the relative change of the video background L and the video foreground S, and the relative error of the  video background L are shown in Fig. 3. The curve of the relative error of the video foreground S is not provided since the ground-truth video foreground for real data is unknown. In the proposed algorithm 1, parameter λ is set to the values of 0.4. From Fig. 3, the relative change converges to zero when the number of iterations keeps increasing. Meanwhile, the corresponding relative error of the video background L gradually decreases to a stable value. In general, the results in this experiment show the empirical convergence of the proposed algorithm.
2) Sensitivity Analysis: Since λ is the only tuning parameter in the proposed algorithm 1, the sensitivity of the algorithm to parameter λ is studied to further explore its effect. The same dataset with 10% noise as the one in Section IV-A1 is applied here. However, in this experiment, λ can vary from 0.01 to 3. The results of relative error of the video background L for both cases are plotted in Fig. 4 vs. λ. The relative error of proposed SS-RTD drops very fast when 0 < λ < 0.2 and remains very stable when 0.2 ≤ λ < 3. Overall, our method has a flat area where the performance of our method is fairly good for a wide range of λ. Therefore, our method with any λ ∈ [0.2, 1] is recommended for practitioners.

B. Background Subtraction
In this subsection, the proposed algorithm 1 is applied to background subtraction. The video dataset Candela in Section IV-A, Caviar1, and Caviar2 from SBI dataset [45]  are used for the experiments. Figs. 5a, 5b, 5c show three image frames from the three video datasets. In the dataset of Caviar1, there are people slowly walking along a corridor, with mild shadows. In the dataset of Caviar2, there are people entering and leaving a store, standing only for few frames. For both videos, the first 80 image frames are used for experiments. Thus, the tensor data size of Caviar1 and Caviar2 is 256 × 384 × 80. The background in each video dataset is static, which is provided as the ground truth for comparison. There are people walking in the background, which are treated as the smooth foreground. The noisy effect is simulated the same way as in [31]. Specifically, a ratio of pixels in each image are set to random integer in [0, 255], and the positions of the contaminated pixels are unknown. The corresponding noisy images from the three video datasets are shown in the second row of Fig. 5. Note that the simulated noise in the section is much harder than the additive noise like Gaussian noise since the information in the pixel is completely ruined.
In this experiment, the cases of 10%, 20%, and 30% noise are studied to show the performance on background subtraction under different noise ratio. The quantitative results of all benchmark methods with different noise ratios on simulated Candela, Caviar1, and Caviar2 are summarized in Table I regarding PSNR, SSIM, and Computational Time, respectively. For all cases using PSNR and SSIM, our method can achieve the best performance for all scenarios. When the noise ratio increases, our proposed SS-RTD is the most consistent one among all the benchmark methods. Specifically, our approach performs much better than RTPCA, IRTPCA, and TTNN. This result demonstrates the advantage of low-rank Tucker decomposition for the static background model over the tensor nuclear norm. PRPCA shows similar performance with the nuclear norm-based algorithms, while TVRPCA and GoDec have better performance than PRPCA and these nuclear norm-based algorithms. Regarding PSNR and SSIM, our proposed method achieves the best performance due to the advantage of tensor modeling over the vector-based method. Our approach ranks third among all benchmarks in terms of computational time, but it is within the same level of magnitude as the fastest one (GoDec) and very close to the second-fastest algorithm, i.e., TTNN. PRPCA is the most timeconsuming one. These results demonstrate that the proposed algorithm has the best performance in terms of accuracy and also runs efficiently.
To show the visualization result, the background subtraction results from the case of 10% noise on all three video datasets are demonstrated. The visualizations of the recovered video background from each video dataset for different methods are shown in Fig. 6. Our proposed approach generally produces the cleanest background for all three video datasets. For the backgrounds generated using RTPCA, IRPCA, and TTNN, people still remain in the background even though most of the noise is removed. For the results from TVRPCA and GoDec, their performance is very close to the proposed method for Candela, and Caviar2. But they perform poorly for the Caviar1, where there is shadow of people left in the background. PRPCA can almost recover the background without people, but the blurriness is high.
Since the quantitative results in Table I are the best performance for each algorithm from 100 sets of tuning parameters, boxplots are used to show the performance variability for each algorithm. For the case of 10% noise, the boxplots for Candela, Caviar1, and Caviar2 in Fig. 7 summarize five-number standards (the minimum, the maximum, the sample median, and the first and third quartiles) for each algorithm in terms of PSNR. This result shows that the proposed method is very stable for different values of λ since the variation among these five numbers is quite small. Even though TVRPCA and GoDec can achieve good performance, they are more sensitive to the tuning parameters than our method, especially TVRPCA. PRPCA is a very stable algorithm. However, its performance is inferior. This comparison shows that the proposed SS-RTD is a ready-to-use algorithm since its performance is robust with different λ.

C. Foreground Detection
In this subsection, our algorithm 1 is applied to foreground detection. The video datasets, i.e., Highway, Office,   [46], are used for experiments. Figs. 8a, 8b, and 8c show three image frames from the three video datasets, respectively. In the dataset of Highway, there are vehicles moving along the highway, where the size of each grayscale image is 240 × 320.
In corresponding noisy images from the three video datasets are shown in the second row of Fig. 8. In the experiment, we apply two pretrained deep learning models 9 on our datasets directly: a Mask R-CNN [23] model trained on the COCO 2014 dataset and a DeepLab [24] model trained on the PASCAL VOC dataset.
In this experiment, the cases of 10% and 20% noise are investigated to show the performance on foreground detection under different noise ratios. The quantitative results of all benchmark methods with different noise ratios of simulated Highway, Office, and Pedestrians are summarized in Table II in terms of Precision, Recall, F-measure, and Computational Time, respectively. In terms of precision, our method can achieve the best performance in five out of the total six scenarios. In terms of recall, the proposed SS-RTD can achieve the best performance in three out of the total six scenarios. More importantly, our method is the best for all six scenarios in terms of F-measure. When the noise ratio increases, our proposed SS-RTD is the most consistent one among all the benchmark methods. TVRPCA and GoDec have better performance than the tensor nuclear norm-based algorithms and PRPCA. However, our proposed approach shows best performance due to the advantage of tensor modeling over the vector-based method. The performances of two deep learning model, i.e., Mask R-CNN and DeepLab, are very inferior since they are not designed for noisy videos. Our method runs efficiently in computational time, even though it is not the fastest. PRPCA is still the slowest one.
The foreground detection results from the case of 20% noise ratio are demonstrated to show the visualization result. The visualizations of one frame from each video dataset for different methods are shown in Fig. 9. In general, our method can detect the most accurate foreground among all benchmarks even though the video is noisy. For the foreground masks subtracted from RTPCA, IRPCA, TTNN, and GoDec, a lot of noise remains in the foreground, where the moving objects are not well detected either. The results from TVRPCA and PRPCA can remove noise from the foreground. However, the detected foreground is not as complete as our result.
Since the quantitative results in Table II are the best performance for each algorithm from 100 sets of tuning parameters, boxplots are used to show the performance variability for each algorithm. For the case of 20% noise, the boxplots for Highway, Office, and Pedestrians in Fig. 10 summarize fivenumber standards (the minimum, the maximum, the sample median, and the first and third quartiles) for each algorithm in terms of F-measure. This result shows that our method is very stable for different values of λ since these five numbers have very small variations. Even though TVRPCA can achieve good performance, it is very sensitive to the tuning parameters compared with our method. PRPCA is a very stable algorithm. However, its performance is poor. In this experiment, the worst performance from our method is still much better than the best performance from all other methods.

V. APPLICATION IN ADDITIVE MANUFACTURING
In this section, the application of proposed method in melt pool detection is presented. The same benchmark methods in 9 https://github.com/matlab-deep-learning Section IV are applied here. The laser powder bed fusion (LPBF) process using Ti-6Al-4V alloy starts from the following specimen shown in Fig. 11a. The laser is applied on the top surface, which moves from the bottom left to the top right. The experimental conditions for the laser are: power 190 W; spot size 100 μm; scan speed 0.25 m/s. Although the process is conceptually simple, there are many highly dynamic and transient physical phenomena involved because of the extremely high heating and cooling rates, e.g., melting and partial vaporization of metallic powders, the flow of the molten metal, and rapid solidification, etc. To in-situ probe the dynamics of the LPBF process, a Highspeed X-ray imaging system at the Advanced Photon Source (APS), as shown in Fig. 11b is applied, where an X-ray beam goes through the thickness of the sample as shown in Fig. 11a. In terms of X-ray imaging conditions, the pixel dimension is 2 μm, the duration of each frame is 16.67 μs, the field of view is 768μm×1440μm, and the frame rate is 60 kHz. The melt pool boundary, i.e., the solid/liquid interface, can be extracted from the high-speed X-ray images to track the melt pool information. The melt pool boundary can be further used to characterize the solidification rate [9], which plays a critical role in determining the microstructure in additively manufactured metals. One unprocessed example from the imaging system is shown in Fig. 1a. As discussed earlier, it is hard to identify the melt pool boundary. To enhance the boundary of the melt pool, each unprocessed X-ray image is first subtracted by the X-ray image captured six frames earlier and then adjusted contrast to obtain the processed X-ray image to have a wide boundary, which is shown in Fig. 1b. The number six is determined by trying various numbers, where six gives the best visualization. By doing so, lots of noise are generated. In the literature, researchers [7], [10], [11] manually extract the melt pool boundary from the processed X-ray image. In this work, the proposed algorithm is the first attempt to extract the melt pool information by decomposing the video data and removing the noise.
In this experiment, there are 100 processed X-ray image frames capturing the melt pool dynamics, where the size of each image frame is 384 × 720. Therefore, the tensor size is 384 × 720 × 100. Since the ground truth of background and foreground is unknown, visualization results are presented to show the performance of different methods. The performance of proposed SS-RTD is represented by taking λ = 0.55. In Fig. 12, the background/foreground separation results from all methods are presented. In terms of background, our method can recover the porosity defects existing in the specimen, which cannot be seen from the noisy image. Besides, the shape and location of porosity defects is consistent with the ones in Fig. 1a. GoDec can also recover the porosity defects. However, the recovered background is noisy and the melt pool can still be seen in the background. PRPCA can recover a noiseless background while the porosity defects cannot be identified. For other methods, they cannot recover the background. In terms of foreground, our method and TVRPCA can remove noise from the foreground while other methods cannot. However, our detected foreground can better show the entire melt pool compared with TVRPCA. The result in this section shows    that our algorithm can recover the background with porosity defects, which are hidden by the noise, and obtain the most accurate melt pool geometry among all methods.
VI. CONCLUSION In this article, a new smooth sparse Robust Tensor Decomposition is developed for background/foreground separation in noisy video data. The proposed SS-RTD decomposes the video data into low-rank, smooth, and sparse components, respectively. To achieve the solutions efficiently, an algorithm based on ADMM for SS-RTD is implemented. The empirical convergence experiment shows that the proposed SS-RTD can converge and run efficiently in practice. The background subtraction and foreground detection results on simulated video data demonstrate that our method outperforms RTPCA, IRTPCA, TTNN, TVRPCA, and GoDec, which are state-ofthe-art algorithms in the literature. The proposed SS-RTD is the only method that can extract the melt pool information on X-ray data. These results also illustrate the effectiveness of Tucker decomposition for the low-rank tensor, Total Variation regularization for the smooth tensor, and noise removal model using the sparse tensor. More importantly, the proposed SS-RTD can be considered as an ready-to-use algorithm since its performance is very stable when the only tuning parameter λ takes any value from [0. 2,1] In addition, there are still some aspects of SS-RTD that deserve further investigation. First, the video background is assumed to be static in the proposed model. However, the extension to the case of dynamic background should be further investigated. Second, the empirical convergence of the proposed algorithm based on ADMM is provided in our case study. Thereafter, the theoretical convergence of the proposed algorithm needs further research.

APPENDIX DERIVATIONS FOR ALGORITHM 1
This appendix provides the detailed derivations for Algorithm 1. Specifically, the steps to obtain (8), (10), and (12) will be shown in details.
The terms in the augmented Lagrangian function (7) related to G, U j are as follow where X X X = X − S − E − X /β X . Since the second term in the right hand side of (15) is a constant, the sub-problem with respect to G, U j can be represented as (8).
The terms in the augmented Lagrangian function (7) related to f have the following representation where the third term in the right hand side of (16) is a constant. Therefore, the sub-problem with respect to f can be represented as (10).
The terms in the augmented Lagrangian function (7) related to E take the following form where the third term in the right hand side of (17) is a constant. Thus, the equation (12) can be derived.