Inequality-Constrained 3D Morphable Face Model Fitting

3D morphable model (3DMM) fitting on 2D data is traditionally done via unconstrained optimization with regularization terms to ensure that the result is a plausible face shape and is consistent with a set of 2D landmarks. This paper presents inequality-constrained 3DMM fitting as the first alternative to regularization in optimization-based 3DMM fitting. Inequality constraints on the 3DMM's shape coefficients ensure face-like shapes without modifying the objective function for smoothness, thus allowing for more flexibility to capture person-specific shape details. Moreover, inequality constraints on landmarks increase robustness in a way that does not require per-image tuning. We show that the proposed method stands out with its ability to estimate person-specific face shapes by jointly fitting a 3DMM to multiple frames of a person. Further, when used with a robust objective function, namely gradient correlation, the method can work “in-the-wild” even with a 3DMM constructed from controlled data. Lastly, we show how to use the log-barrier method to efficiently implement the method. To our knowledge, we present the first 3DMM fitting framework that requires no learning yet is accurate, robust, and efficient. The absence of learning enables a generic solution that allows flexibility in the input image size, interchangeable morphable models, and incorporation of camera matrix.


Inequality-Constrained 3D Morphable
Face Model Fitting Evangelos Sariyanidi , Casey J. Zampella , Robert T. Schultz , and Birkan Tunç Abstract-3D morphable model (3DMM) fitting on 2D data is traditionally done via unconstrained optimization with regularization terms to ensure that the result is a plausible face shape and is consistent with a set of 2D landmarks.This paper presents inequality-constrained 3DMM fitting as the first alternative to regularization in optimization-based 3DMM fitting.Inequality constraints on the 3DMM's shape coefficients ensure face-like shapes without modifying the objective function for smoothness, thus allowing for more flexibility to capture person-specific shape details.Moreover, inequality constraints on landmarks increase robustness in a way that does not require per-image tuning.We show that the proposed method stands out with its ability to estimate personspecific face shapes by jointly fitting a 3DMM to multiple frames of a person.Further, when used with a robust objective function, namely gradient correlation, the method can work "in-the-wild" even with a 3DMM constructed from controlled data.Lastly, we show how to use the log-barrier method to efficiently implement the method.To our knowledge, we present the first 3DMM fitting framework that requires no learning yet is accurate, robust, and efficient.The absence of learning enables a generic solution that allows flexibility in the input image size, interchangeable morphable models, and incorporation of camera matrix.

I. INTRODUCTION
3D MORPHABLE model (3DMM) fitting is the prevail- ing approach for reconstructing 3D face shape from 2D images.This approach has the unique ability of separately quantifying key factors that govern a face image, namely, identity, expression, illumination and pose [18].Therefore, it is suitable for tasks such as face or facial expression recognition, or decoupling of pose and expression.Evangelos Sariyanidi and Casey J. Zampella are with the Center for Autism Research, Children's Hospital of Philadelphia, Philadelphia, PA 19104 USA (e-mail: sariyanide@chop.edu;zampellac@chop.edu).
Robert T. Schultz and Birkan Tunç are with the Center for Autism Research, Children's Hospital of Philadelphia, Philadelphia, PA 19104 USA, and also with the University of Pennsylvania, Philadelphia, PA 19104 USA (e-mail: schultzrt@chop.edu;tuncb@chop.edu).
Digital Object Identifier 10.1109/TPAMI.2023.3334948 Most 3DMMs represent facial shape as a linear sum of basis functions learned from a large set of face scans.3D reconstruction from 2D data is achieved by inferring the basis coefficients in this sum.This inference has been traditionally done via unconstrained optimization with 2 penalty terms, which ensure a face-like output by preventing the magnitudes of the basis coefficients from being too large (Fig. 1(a)).However, the weights of these terms are not easy to tune and may lead to overly smooth shapes [18] that miss personal characteristics.A penalty on deviation from landmarks is also added for increased robustness, but it is questionable whether this term can be tuned in a way that works for all images, since the weight should ideally depend on landmark detection error (Section I-A1).
An alternative is to use learning-based methods, which can overcome some limitations of unconstrained optimization methods, but require sophisticated training and a large amount of data.Also, training imposes rigidity on the pipeline by fixing certain properties of the input data and the model, which in turn limits the ability to work with images of arbitrary sizes, to incorporate a camera matrix, or to use a new or modified morphable model (Section I-A2).
We introduce a novel optimization-based 3DMM fitting framework that preserves individual face details, and achieves robustness on par with learning-based methods without suffering from their rigidity.The proposed framework is based on inequality-constrained optimization.Specifically, we use inequality constraints to ensure a face-like shape by forcing the shape parameters to stay within a viable range (Fig. 1(a)) but leave the objective function otherwise intact, thus allowing for more flexibility to capture person-specific details (Section II-A1).Moreover, we increase robustness through inequality constraints that ensure that the output is consistent with detected landmarks (Fig. 1(b)), in a way that is practically applicable to any face image (Section II-A2).
Our experiments show that, while learning-based methods may have a performance advantage in estimating identity from a single or a small number of frames, the proposed inequalityconstrained approach performs on a par with or better than learning-based methods as an increasing number of frames per subject are used.Moreover, when combined with a robust objective function, namely gradient correlation (GC) [49], the proposed approach achieves robustness comparable to the best deep learning methods, even though it requires no learning at all.Finally, an efficient GPU-based implementation can process a video frame in ∼20 ms after shape identity is estimated, which  r We provide experiments with a novel metric, within- and between-subject effect size (WBES), and show that inequality-constrained optimization outperforms state-ofthe-art methods in terms of generating increasingly more person-specific face shapes as more frames per subject are used (Fig 2).Critically, this study shows for the first time that an optimization-based approach that needs no learning can be robust, accurate, and efficient, in direct contrast to criticisms of previous optimization approaches that were deemed too complex [23], [26], [27]; inefficient [2], [15], [17], [50], [53]; fragile or highly sensitive to landmarks [2], [20], [50], [52]; and reliant on heavily engineered components [2].The CUDA implementation of the method is available at https://github.com/Computational-Psychiatry/3DI.
In light of these shortcomings, we propose to use inequalityconstrained optimization.While optimization-based approaches have been deemed complex and heavily engineered, our approach is refreshingly simple; it requires but the tuning of three coefficients for the 3DMM (Section II-A1) and determining bounds for landmark constraints, which practically amounts to measuring the error bounds of the landmark detector that is used (Section II-A2).The search space that satisfies all inequality constraints is the feasible region, and solutions that fall outside this region are prohibited as they are too intricate or inconsistent with detected landmarks (Fig. 3).The solutions in the feasible region are not penalized at all, leading to higher flexibility to capture person-specific face shapes (Section II-A1) and less sensitivity to small landmark detection errors (Section II-A2).Finally, when inequality-constrained optimization Fig. 3. Inequality constraints in action.Each row shows the objective function for 3DMM fitting; the function is reduced to the line containing the unconstrained and inequality-constrained solutions, θ u and θ 3DI .Green rectangles show the feasible region that satisfies inequalities.Top row: inequality constraints prevent convergence to the local minimum with a shape that is too intricate.Bottom row: the local minimum θ u that leads to a poor shape is ruled out by inequality constraints.
is used with GC as objective function, it can generalize to "inthe-wild" images even when it uses a highly controlled 3DMM.In other words, there is no need to learn a robust 3DMM [9] when GC is used.
This paper extends an earlier version of this study [41] in four ways: (1) showing how to attain an efficient implementation; (2) experiments with a new metric (WBES) showing that the method attains person-specific results; (3) experiments on two new datasets (Florence [1] and YT Faces [51]); and (4) direct comparison of inequality constraints versus 2 regularization.
2) Learning-Based Methods: Learning-based methods obviate the need for de novo optimization per image by estimating 3DMM parameters with a model trained offline [18].Numerous methods are proposed [33], but still, the use of 2 regularization (while training the model) is common to virtually all [15], [17], [31], [36], [45], [47], [52], [53], [55].Robust learning-based pipelines are also difficult to construct.They require a careful design with sophisticated cost functions that balance many (e.g., five [17] or seven [52]) loss terms, and models with millions of parameters [30] tuned in a lengthy training (even with powerful GPUs) that requires large amounts of data.Moreover, exceptions [30] aside, training typically brings rigidity, warranting re-training each time a new morphable model is considered (e.g., different age or ethnicity groups, a new expression model), the number of 3DMM components is adjusted, or input image size is changed.Further rigidity is imposed on camera model, which is often either fixed to weak perspective [2], [16], [20], [27], [36], [47], [52], which limits applications [40], or is adaptable (e.g., thanks to a differentiable renderer [26]) but fixed at training [17], [44], which may limit accuracy if test images undergo a largely different perspective warping compared to training images.In contrast, optimization-based approaches are flexible in all these regards.
3) Identity Reconstruction: A key purported benefit of 3DMM fitting is parametrizing facial identity.Yet, the accuracy of identity capture has only been partially studied.Traditional metrics, such as error between estimated 3D face and ground truth scan, do not provide direct information about whether a person's reconstruction differs from that of others.Some studies addressed this by measuring face similarity or face verification performance [2], [25], [26], [35], [48], or by doing clustering [26].However, those evaluations were not based on shape representation alone, as they used rendered images (i.e., shape and texture information) or deep embeddings.In response, we directly measure ability to produce a person-specific 3D shape representation, and the consistency of this representation across reconstructions from images of different poses and expressions, with a novel metric (Section IV-A1).
4) Fitting Morphable Models to 3D Scans: A 3DMM can also be fit to 3D scans, rather than 2D data, for example, to manipulate the facial expression on the scans [7].While (linear) 3DMMs have been the predominant approach [22], nonlinear and topology-independent approaches have also been proposed recently [14].Inequality constraints on shape parameters have been explored in the context of 3D-3D fitting [7], [11].However, no study to our knowledge has used inequality constraints on landmarks, and the constraints that we propose in this study can be extended to the case of 3D-3D fitting (Section V).

II. INEQUALITY-CONSTRAINED 3DMM FITTING
3DMMs are models that can generate face images by controlling various factors, such as the person's identity, expression and pose, as well as the illumination conditions [5].Fitting the 3DMM to an input image I 0 amounts to finding the parameters that best represent the image.This is achieved by minimizing an objective function f that quantifies the dissimilarity between I 0 and the 3DMM-generated image, I. Specifically, I is a vector I := (I 1 , . . ., I P ) T of P pixels and depends on the 3DMM's shape parameters (identity and expression) α:=(α id , α exp ), texture parameters β, illumination coefficients φ and camera view transformation c := (u, τ ), which includes rotation and translation parameters, u and τ .Thus, the objective is a function of these parameters and the input image, f (α, β, c, φ; I 0 ).Our choice of objective function is GC, as discussed in Section III-A.

A. Inequality Constraints 1) 3DMM Parameter Constraints:
The 3D face shape generated by the 3DMM is a vector of N 3-dimensional points p = (p T 1 , p T 2 , . . ., p T N ) T and is generated as p := p + Aα, where p ∈ R 3˜N is the 3DMM's average shape vector, and A is a 3N ×K α matrix representing the shape basis of the 3DMM.(For conciseness, we assume that A includes both identity-and expression-related shape variation.)One can ensure that the shape parameters produce a face-like shape by enforcing the inequality constraints [7], [11] where α and α are the lower and upper bounds of the inequalities, respectively, and is elementwise inequality (Fig. 1(a)).Unlike 2 regularization, this strategy does not penalize shape coefficients α that satisfy (1).Thus, the output does not gravitate Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
towards smoother faces, allowing for higher flexibility to capture person-specific shape cues.We determine the bounds α and α by tuning two parameters (Section IV-A3).Cross-dataset experiments demonstrate that these parameters generalize well.Each point of the mesh p has a person-specific (grayscale) pixel intensity, contained in the vector t := t + Bβ, where t is the average texture vector of the 3DMM and B is an N × K β matrix, which is the 3DMM's texture basis.The inequality constraint is enforced for a realistic looking face.The bounds β and β are tuned through a single parameter (Section IV-A3).
We prefer to use box constraints as in ( 1) and ( 2) for computational simplicity.Of note, the resulting feasible regions can allow too much flexibility.For example, for morphable models constructed via PCA, a more appropriate approach would be to restrict the search space to ellipsoids by enforcing an upper bound on the Mahalanobis distances α T C −1 α α and β T C −1 β β, defined by covariance matrices C α and C β of the model.While our 3D-to-2D fitting implementation considers only box constraints, for comparison, we also provide 3D-3D fitting results with both ellipsoid and box constraints via the same morphable model (Supp.Appendix E, available online).
2) Landmark Constraints: 3DMM fitting performance can be improved by ensuring that the 2D projection of the facial mesh is consistent with L pre-detected facial landmarks on the 2D image, namely x := (x 1 , ỹ1 , . . ., xL , ỹL ).Let the 2D projection of the facial mesh be a set of N points (x 1 , y 1 ), . . ., (x N , y N ), computed as where v xi , v yi and v zi are view-transformed 3D coordinates obtained as (v xi , v yi , v zi ) T = R(u)p i + τ ; φ x , φ y , γ x , γ y are the parameters of the camera's perspective projection; and R(u) is the 3D rotation matrix, parametrized by u.Furthermore, let x L (α, c) := (x l 1 , y l 1 , . . ., x l L , y l L ) be the vector-valued function containing the L mesh points that correspond to the L facial landmarks.
The traditional approach to making use of the landmarks is to augment the objective function for 3DMM fitting with a regularization term that penalizes the discrepancy x L (α, c) − x.This term must ideally depend on the error distributions of the landmark detector.Nearly all 3DMM methods use 2 regularization, which is an appropriate term if the error follows a Gaussian distribution.However, empirical error distributions suggest that Gaussianity assumptions are not always valid (Fig. 4), as error distributions are often super/sub-Gaussian (i.e., have positive/negative kurtosis) and skewed.One can consider using alternative regularization terms, such as the wing-loss [21], which has parameters that can better adapt to error distributions with varying degrees of kurtosis.Still, wing loss or other robust functions (e.g., L1) are typically symmetric and would be limited in their ability to handle skewed error distributions.Moreover, the error of landmark detectors is systemic -it depends on pose (Section IV-B5) and possibly other factors.Thus, the ideal regularization term must depend on pose, otherwise systemic errors that are not accounted for may be reflected by the 3DMM fitting outcome.
Given the difficulties with devising adequate regularization terms, we apply a different strategy.Specifically, we use inequality constraints that enforce the spatial discrepancy between the detected landmarks and the corresponding mesh points to be within a range as where ˇ and ˆ are the lower and upper bounds.Restricting the search space in this way ensures that the 3DMM fitting procedure does not diverge (Appendix D, available online)-the outcome is forced to be consistent with the landmarks which are typically estimated very reliably with modern detectors [12].However, any solution that satisfies the landmark constraints is not penalized at all, thus small landmark errors caused by systemic bias have little effect on the 3DMM fitting outcome (Section IV-B5).
The bounds ˇ and ˆ can be determined without the need of tuning with respect to a specific application.Suppose that we measure the error of a landmark detector on a large dataset of face images in terms of signed horizontal and vertical differences for each landmark.The horizontal (signed) differences for the ith landmark form an empirical distribution (Fig. 4), and we set the lower bound for the constraint of this landmark, ˇ x i , to the δth percentile of this distribution; and the upper bound, ˆ x i , to the (100 − δ)th percentile, where δ is a small (e.g., 2.5) but nonzero value, as setting it to zero would render the bounds highly sensitive to outliers.The corresponding vertical bounds, ˇ y i and ˆ y i , are defined similarly, and the upper and lower bound vectors for all landmarks are determined as

B. Problem Formulation
We can now formulate inequality-constrained 3DMM fitting as the following optimization problem.Note that the formulation above considers single-frame 3DMM fitting.This is for notational conciseness, and jointly fitting a 3DMM to multiple frames of a subject is a straightforward Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
extension of this problem.Specifically, the adjustments needed for multi-frame fitting are (i) defining per-frame expression, illumination and camera view parameters; (ii) extending the set of inequality constraints to all per-frame parameters; and (iii) defining the objective function as the sum of per-frame objective functions while keeping the identity-specific parameters α id and β common.

III. EFFICIENT AND ROBUST IMPLEMENTATION
Optimization algorithms that have been used widely for unconstrained 3DMM fitting cannot be used for the proposed framework as they cannot handle inequality constraints.We propose using the log-barrier method [10] as it is a wellestablished approach for performing inequality-constrained optimization with differentiable objective functions and inequality constraints.

A. Robust Objective Function
The robustness of 3DMM fitting relies largely on the objective function, because there are many discrepancies between a 3DMM-synthesized image I and a real-world image I 0 that cannot be compensated for analytically.For example, facial accessories such as glasses or facial hair cause occlusion.The target image may also have undergone an unknown nonlinear pixel transformation, such as histogram equalization or some camera gain function.Finally, if the 3DMM is not collected from a sufficiently diverse sample, factors such as skin color may not be accurately captured by the texture model.
Traditionally, photometric objective functions have been used for 3DMM fitting via optimization [5], [6], [18].While one can use a robust photometric function to account for some of the challenges above (e.g., occlusion), the factors that affect the entire image (e.g., nonlinear pixel transformations) are likely to be problematic even for robust photometric functions, unless one performs an adequate pre-processing step, which is a problem on its own.We use the GC as the objective function, as it proved to be remarkably robust in the similar problem of 2D-2D alignment [49] and is differentiable.This function has a built-in ability to ignore outliers: Whenever there is a regional mismatch between two images, the computation of GC over this region resembles the realization of a stationary random process with a uniform distribution, thus the GC computed from the region can be assumed to be practically zero [49].Moreover, GC uses the gradient directions of the pixels, which, unlike the absolute pixel values or the gradient magnitudes, show little sensitivity to non-linear pixel transformations, as we also show experimentally (Supp.Appendix C, available online).The definition of GC as the objective function f is provided in Section III-C along with its derivatives that are needed for optimization with the log-barrier method.

B. The Log-Barrier Method
The crux of the log-barrier method is to create an augmented objective function that incorporates the inequality constraints Fig. 5. Log-barrier method.Solid lines depict the bounds of an inequality constraint; the intersection of all constraints is the polygon.Dashed curves show how the method approximates the polygon using the logarithm of inequality constraints [see (5)].Approximation quality improves as t increases, t 0 , t 1 , . . ., t T .Thus, the log-barrier method transforms an inequality-constrained problem into a series of unconstrained problems, the solution of which (i.e., θ * ) gradually approaches that of the original problem, θ * .

TABLE I OVERVIEW OF THE LOG-BARRIER METHOD
and the original objective function f .Then, the inequalityconstrained optimization problem is replaced with a series of unconstrained minimization problems (Fig. 5).Specifically, the augmented objective function f is Here, t is a positive parameter and h i are the entries of the vectorvalued function h(α, β, c) that incorporates the inequalities.That is, h is computed by vertically concatenating the vectors α−α, α− α, β−β, β− β, ˇ − x L + x and x L − x − ˆ .The logarithm in (5) approximates the inequality constraints with a differentiable function, allowing for derivative-based optimization techniques.A series of optimization steps is performed by progressively increasing t, and thus improving approximation quality (Fig. 5 and Table I ).Each of these steps is an unconstrained problem (i.e., minimization of f for a given t) called outer iteration.This process shows little sensitivity to t values, as they typically change the number of iterations but not the result [10].
The log-barrier algorithm is an interior-point method, therefore it needs to be initialized from a feasible point.The texture parameters β can be trivially initialized by setting them to any vector that satisfies (2).However, the shape and pose parameters cannot be initialized trivially due to the landmark constraints in (4).Thus, one needs to solve the feasibility problem -the problem of finding some parameters α, c that satisfy (4)-which can be accomplished by using the Phase I method [10].Fortunately, the implementation of the latter method for our problem is not computationally involved, as it amounts to minimization over the small-dimensional space of the L landmarks rather than the space of the dense mesh points (Supp.Appendix B, available online).

C. Gauss-Newton Method for Each Outer Iteration
We use the Gauss-Newton method to approximately solve1 each outer iteration in Table I.This method uses a first-order approximation of Hessian, which is efficient in our context and similar problems [3], [9].It is an iterative method that starts with an initial point θ ← (α, β, c, φ), and updates it until convergence is reached.At each inner iteration, θ is updated as θ ← θ + sΔθ, where s is the step size and Δθ the search direction.The computation of these elements is explained in the remainder of this section.
The search direction Δθ is computed using the first-order approximation to the Hessian of objective function, ∇2 f , and the gradient of the objective function ∇ f as The algorithm to solve an outer iteration is given in Table II.
Clearly, the derivatives needed, ∇ f and ∇2 f , depend on the of the objective function.Our objective function, GC, can be computed as where g x (I) and g y (I) are the horizontal and vertical (magnitude-normalized) gradients of I and T is transpose.
The kth entries of these functions, g xk and g yk , are where g 0 x :=g x (I 0 ) and g 0 y :=g y (I 0 ); and ∇g x , ∇g y are the derivatives of the g x (I) and g y (I) w.r.t.θ.The kth entries of these derivatives, ∇g xk and ∇g yk , are computed as Here, the derivatives of the image pixels w.r.t.parameters θ, namely the entries of ∇I := (∇I 1 , . . ., ∇I P ), depend on implementation details, such as the parametrization of rotation or illumination.We use the Rodrigues rotation formula and the Phong illumination model, and provide the needed derivatives in Supp.Appendix A, available online.
To compute the derivatives of GC in (12), one must identify the indices of the neighbors (i.e., k l , k r , k b , and k a ) of each pixel.The latter is not trivial when the rendered region is non-rectangular, but fortunately can be done at constant time as follows.Suppose that the image vector I is constructed by vertically concatenating the columns corresponding to the rendered pixels.For example, if the rendered pixels are as indicated by M [i] in Fig. 6 2) Step Size Computation: The last component needed to implement the algorithm in Table II is step size s.To make the most of each iteration, we use a simple but adaptive approach, backtracking line search [10], which is an iterative method that starts from a step of s := 1 and reduces it as s ← μs until the condition is satisfied, where μ ∈ (0, 1) and ν ∈ (0, 0.5) are given coefficients.The choice of μ and ν values has limited effect on the total running time, as values that lead to a more thorough search lead to fewer parameter updates and vice versa [10].
In cases where a logarithm in (5) gets a negative argument, it is possible that the argument of f on the left-hand-side of (13), namely θ + sΔθ, violates inequality constraints and leads to a non-real output f (θ + sΔθ).This indicates that the step size is too large, and needs to be reduced until a real output is produced.We can identify whether inequality constraints are violated without the costly operation of image rendering by simply checking if the minimal entry of the h function, namely min i h i , is negative, hence the step III.a in Table II.

IV. EXPERIMENTS
We validate inequality-constrained 3DMM fitting in terms of geometric error and ability to produce person-specific and consistent face shapes.We present experiments on four datasets and compare with four state-of-the-art approaches as well as 2 regularization.We also provide an ablation study and computation time.

A. Experimental Setup 1) Evaluation Metrics:
We use the standard metric for 3D face reconstruction, namely the geometric error between a reconstructed mesh and the corresponding 3D ground truth scan, which is computed as follows.First, we rigidly align the reconstructed mesh to the ground truth.Then, we establish correspondence for each point on the reconstructed mesh by finding the nearest neighbor of the point on the ground truth.Finally, geometric error for a mesh is computed as the average euclidean distance between the corresponding mesh and ground truth points.Depending on the landmark points that are available on the dataset, we perform rigid alignment either by using (sparse) the landmark points or via ICP (Section IV-A2).The error computed in this way quantifies the reconstruction quality of the facial shape alone, as any rigid alignment is compensated for.Additional results computed with scans that include pose (i.e., rotation) variation are provided in Supp Appendix G, available online.
We also propose a new metric, Within-and Between-subject Effect Size (WBES), to evaluate a method's ability to produce a consistent and person-specific 3D shape.To compute WBES, we generate three identity reconstructions per subject from different images, r s 1 , r s 2 , r s 3 , and compute the euclidean distance between each unique pair, yielding a set of 3N s within-subject distances, where N s is the number of subjects.We also compute a distribution with distances between all pairs of reconstructions of different subjects, ||r s i − r s j || with s =s .We then quantify the separability of within-and between-subject distributions using Cohen's effect size, which is computed as where μ b , n b and v b are respectively the mean, sample size and variance of the between-subject distribution; and μ w , n w and v w are the same statistics for the within-subject distribution.
Cohen's effect size quantifies the separation of distributions relative to the (pooled) standard deviation; e.g., WBES=2 means that the averages of the distributions are two standard deviations apart.Ideally, within-subject distances must be close to zero and between-subject distances must be significantly large, yielding a large WBES.We investigate how the geometric error or the WBES improves with the number of frames used per reconstruction.For single-frame methods, we do a multi-frame extension as in [48] by computing each reconstruction as the average of multiple reconstructions from different frames.For multi-frame methods, we perform reconstruction by jointly fitting a 3DMM to the available frames, unless we fit to 10 or more frames, in which case, to circumvent computational memory restrictions, we perform reconstruction over groups of frames of a subject and average over these reconstructions.
Our final metric is normalized mean error (NME) for 2D landmarks, which is computed by measuring the per-landmark error on the 2D image, then computing the average over 51 landmarks corresponding to inner face, and finally dividing this average to the diagonal of the box enclosing the face.
2) Datasets and Data Generation Procedures: We use five datasets: BU4DFE [54], Florence [1], NoW [38], YT Faces [51] and AFLW3D-2000 [56].The 3D geometric error is reported on the datasets with ground truth scans (BU4DFE, Florence and NoW).For BU4DFE and Florence, the images that are used for 3D face reconstruction are rendered using the texture data that is available alongside the meshes.Each image is rendered with a random yaw and pitch variation, sampled from uniform distributions ranged respectively from −80 to 80 and −15 to 25 degrees.Images are rendered through the weak perspective camera model, as nearly all reconstruction methods use this model.
The Florence dataset contains only neutral scans and therefore rendered images do not contain expression variation.The images rendered from the BU4DFE dataset contain expression variation as they are picked from the (dynamic) scans with the six basic facial expressions.However, the task is still to estimate the neutral shape of each subject, as the point correspondence for the geometric error is difficult to establish in the presence of facial expressions [22].Thus, the ground truth scan used for error computation is neutral, and we use the neutral outputs of the reconstruction methods.This procedure is also the one that is applicable to the NoW dataset, which contains a single and neutral ground truth scan but multiple (real) images per person that are collected with an iPhone.We use NoW images which, similarly to BU4DFE, contain a pose and expression as well as illumination variation.
The rigid alignment needed for the geometric error computation on Florence and BU4DFE is performed using five keypoints on the ground truth scan, namely the outer eye corners, the mouth corners and the nose tip.For the NoW dataset, we perform dense rigid alignment as suggested in the dataset's benchmark [38].
A critical aspect of our experiments is to show how the estimation of shape identity improves as we use more frames per subject.Thus, we do not use the standard benchmark procedure for the NoW dataset, which is designed for single-frame reconstruction, but instead show how the quality of reconstructions improve as they are computed from an increasing number of frames per subject.Thus, we compute multi-as well as singleframe performance with all compared methods, and restrict our experiments to the validation set of this dataset.
We compare methods in terms of WBES on BU4DFE, by generating 1,000 frames per subject as described above.We also use the YT dataset to measure WBES on uncontrolled conditions, using only sequences with a minimum of 200 frames.We use only one sequence per subject, as sequences undergo unknown (camera) distortions, imposing restrictions on recovering face identity (Fig. 11(b)).
Finally, we further test the robustness of our implementation on AFLW3D-2000, by measuring 2D landmark estimation error.We do not conduct experiments using the 3D annotations of this dataset, since 3D annotations from single frames cannot be deemed reliable.
3) Implementation and Compared Methods: We refer to the open-source implementation of our framework as Open3DI.We detect landmarks with 2D-FAN [12] and use the Basel'09 Face Model (BFM) [34] with K α id = K β = 199 shape and texture components; and the FaceWarehouse expression model [13] with K α exp = 29 vectors [56].We parametrize rotation with Rodrigues formula.We use Phong model for illumination, but fit illumination parameters only during initialization, separately for each frame.Also, we ignore the specular reflection parameter of the Phong model for computational simplicity.For fair comparison, we use a weak perspective camera on experiments with data that we generated from 3D scans.Open3DI needs three parameters for the 3DMM constraints (i.e., identity, texture and expression constraints), which are determined as follows: Let σ β be a vector containing the standard deviations of the 3DMM's texture components.Then, we set the lower and upper bounds as β:=−λ β σ β and β:=λ β σ β by tuning a parameter λ β .Similarly, we have two more parameters, λ α id and λ α exp , for (shape) identity and expression constraints.The bounds of landmark constraints, ˇ and ˆ , are determined by essentially measuring the per-landmark error as described in Section II-A2.To have exact ground truth, we measure 2D landmark error on a dataset that we synthesized using the BFM for identity and FaceWarehouse model for expression.Further implementation details and information on how to run the code are in Supp.Appendix B, available online.
For comparison, we perform ablation studies with an 2regularized variant for the same objective function (i.e., GC), and refer to it as GC-L2.The weights of regularization terms are  tuned via cross-database cross-validation.Single-frame versions of Open3DI and GC-L2, referred to as Open3DI-S and GC-L2-S, are also added to comparisons.
We also compare with state-of-the-art methods, namely, 3DDFA (v2) [27], Deep3DFace (tensorflow version) [17], RingNet [38], DECA [19] and INORig [2].The latter is the only multi-frame method, as publicly-available multi-frame methods are difficult to find [2].We are unable to compare with methods that are too slow, as our experiments run on thousands of images.For INORig, we use the robust model [2].Prior to quantifying performance, we crop reconstructions tightly over the facial region (see Fig. 9(a)), which leads to a mesh with 23,470 points for the BFM-based methods.

B. Results
1) 3D Reconstruction Results: Fig. 7 shows the cumulative error distribution (CED) of several approaches; legend numbers are median errors.The results for multi-frame methods (Open3DI and INORig) in Fig. 7 are obtained through reconstruction from five frames per subject.Results for reconstruction with varying numbers of frames per subject are reported in Fig. 8, and qualitative reconstruction results on the Florence dataset are shown in Fig. 9.
Open3DI achieves the smallest median error on the Florence dataset and it is followed by INORig (Fig. 7(a)).However, on the BU4DFE dataset, INORig performs better than Open3DI Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.when both methods use five frames per subject, and even the single-frame method Deep3DFace performs very comparably to Open3DI.This is likely due to the fact that BU4DFE contains exaggerated expressions, and, while learning-based methods leverage the large amounts of training data to predict how the neutral face looks in the presence of facial expressions, Open3DI does not have any learning component to address this expression-identity ambiguity.However, this ambiguity becomes less problematic as we have more frames per person, and Open3DI outperforms other methods in comparisons where all methods use ten frames or more (Fig. 8).The performance of Open3DI improves at a higher rate than all other methods, which is a trend also observed on reconstruction from real images of the NoW dataset.Given that face videos typically contain an abundance of frames with varying pose and expressions, being able to fully benefit from extra frames is a practically important ability.
So far, we compared methods in terms of geometric error.However, it must be highlighted that geometric errors reported on real 3D scans are merely estimations, as the true point correspondence between a reconstructed face and the corresponding 3D scan is unknown, and the nearest-neighbor (i.e., Chamfer) criterion that is used for error computation may lead to semantically incorrect correspondences [22], [39].Thus, comparison with additional metrics that do not need point correspondence, is important.Fig. 10 compares methods in terms of the WBES metric on the BU4DFE dataset.A critical outcome of this comparison is to expose limitations of single-frame reconstruction, which cannot be easily deduced from geometric error.For example, the mean geometric error of Deep3DFace on BU4DFE is 1.78 mm for single-frame reconstruction and reduces up to 1.63 mm with multi-frame reconstruction (Fig. 8), and this relative improvement may not appear impressive.However, Fig. 10(b) shows that the multi-frame extension of this method is significantly better than its single-frame version in terms of capturing personspecific details, as the within-and between-subject distributions are largely separated in the case of multi-but not single-frame reconstruction.Open3DI stands out with its ability to capture person-specific shape cues with increasingly better accuracy as more frames are available, as its WBES grows at the highest rate and reaches the highest value.Also, Open3DI visibly outperforms Open3DI-S.That is, fitting multiple frames jointly, as opposed to naive averaging, leads to significant improvement,  as multi-frame reconstructions of a person are more consistent than single-frame reconstructions (Section IV-B2).This is a critical feature of Open3DI, making it highly suitable for image sequences.
WBES for the uncontrolled YT Faces dataset is provided in Fig. 11(a).Results must be interpreted with some caution, as YT Faces videos are brief and may contain limited pose variation, which may make producing consistent reconstructions from different frames easier.Nevertheless, the two critical outcomes described in the previous paragraph are replicated: (1) Deep3DFace is the best single-frame method, but Open3DI-S approaches it as the number of reconstructions that are averaged over increases, and (2) Open3DI shows the highest increase in WBES per new available frame.The major difference from the experiments on the controlled datasets is that INORig performs better on YT Faces than on controlled data.Still, Open3DI reaches INORig's performance as the number of frames per reconstruction increases, and qualitative results (Fig. 12  In sum, Open3DI emerges as the best method for producing increasingly more accurate and person-specific reconstructions as more frames are available.Its ability to generalize from a highly controlled 3DMM to uncontrolled data is remarkable and unique; no other method to our knowledge achieves comparable precision and robustness without training with large amounts of "in-the-wild" data. 2) Mahalanobis Distances of Shape Parameters: A question that arises when we average over an increasing number of reconstructions of a subject is whether the resulting face shape becomes increasingly and overly smooth.To investigate this issue, we report the mean Mahalanobis distance of the identity vectors (w.r.t. the covariance matrix of BFM) for the BU4DFE subjects.Table III shows the mean Mahalanobis distance against the number of (averaged) shape vectors per subject.With single-frame reconstructions (i.e., Open3DI-S), the mean Mahalanobis distance reduces significantly to 10.4 as we average over multiple identity vectors, which suggests that a non-negligible smoothing effect occurs.However, the Mahalanobis distance converges around 19.7 for Open3DI, indicating that multi-frame reconstructions of a subject are more consistent among themselves than single-frame reconstructions.
In fact, one may question whether the averaged shapes resulting from multi-frame reconstructions are too intricate, as 19.7 is higher than the threshold distance of 15.8 that covers the 99% of the ellipsoid defined by the covariance matrix of the BFM (Supp.Appendix E, available online).Thus, we conducted additional 3D-3D fitting experiments, where we enforced ellipsoid constraints in addition to or instead of the box constraints (Supp.Fig. E.3, available online).Results suggest that the perceivable differences between enforcing box and Mahalanobis constraints are minor and at a very granular level for the BFM.However, this may not be the case for other morphable models, particularly if their shape space is of higher dimensionality.
3) Landmark Detection Results: We further evaluate Open3DI on uncontrolled data through 2D landmark detection experiments on the AFLW3D-2000 dataset (Fig. 13).Results demonstrate that the landmark detection accuracy of Open3DI is on a par with state-of-the-art approaches.Fig. 14 shows qualitative landmark detection and face reconstruction results on the AFLW3D-2000 dataset (see Supp.   the landmark and 3DMM parameter constraints.This is not surprising, as implausible shapes are produced with unconstrained parameters [18].An alternative is to replace inequality constraints with the standard 2 regularization (i.e., the GC-L2 method), which increased median error by ∼37% on BU4DFE (Supp.Fig. G4b, available online).The WBES of GC-L2 on BU4DFE is 1.67 and 1.88 respectively for F = 9 and F = 27 frames per reconstruction, which is much smaller than the WBES of 2.19 and 2.78 achieved by Open3DI for the respective F values.

5) Sensitivity to Pose Variation:
The error of the facial landmark detector that we use for the landmark constraints of our method increases as facial pose deviates from frontal (Fig. 15   0.9; p = 0.001) with pose-dependent landmark errors, whereas the NME of Open3DI is not (ρ = 0.2; p = 0.650).6) Computation Time: Fig. 16(a) shows processing times for the multi-frame methods Open3DI and INORig on an NVIDIA GTX 1080 GPU.The illustrated time increase versus number of frames is slightly higher than linear for both methods.For Open3DI, this is caused primarily by the matrix multiplications ∇g T x ∇g x and ∇g T y ∇g y needed for Hessian in (11), which are of higher-than-linear complexity despite usage of GPU (Fig. 16(b)).The average time to fit a complete set of 3DMM parameters to one frame is 237.6 ms.Notably, once the identity (shape and texture) parameters are identified, fitting camera and expression parameters to one frame takes 20.4 ms on average, making Open3DI viable for video processing.These results are significantly faster than those recently reported for optimizationbased approaches (e.g., 120s [15], [45]) or implementation with off-the-shelf solvers [41] that take tens of seconds per image.While some learning-based methods (e.g., [45]) can currently run much faster than Open3DI, it must be noted that we present the first attempt at an efficient implementation, whereas learning-based methods have been studied for years and enjoy a large community and software support.

C. Limitations
The single-frame version of the proposed method does not typically perform better than learning-based methods and produces highly overlapping within-and between-subject distances from single frame (Fig. 10(b); bottom-left).But even successful learning-based methods lead to highly overlapping withinand between-subject distributions (Fig. 10(b); top-left), which highlights that single-frame identity shape estimation has significant limitations.The lack of regularization in our method may be reducing performance for single-frame estimation but becomes a major strength in multi-frame estimation, where the availability of multiple frames reduces the ill-posedness of the problem and thereby the need for regularization.Another limitation is that closed eyes and inner parts of mouth are not always accurately captured (Supp.Fig. G.5, available online), but we observed similar issues for other methods that use the same expression model.This issue can be remedied by fine-tuning constraints for some landmarks [17], [35], [42], or, possibly, by replacing the expression model (e.g., with Flame [32]).Finally, Open3DI produces higher error on facial regions with low gradient variation, such as the cheeks and jaw (Fig. 17).These are particularly difficult regions for fitting a 3DMM to 2D data as the lack of texture variation creates ambiguity, but learning-based methods can achieve relatively smaller errors as data-driven architectures can implicitly learn the difficulty with these regions and produce more conservative estimations.

V. DISCUSSION AND FUTURE DIRECTIONS
This study calls for the re-evaluation of the commonly held conventional wisdom that optimization-based methods are inappropriate for real-life applications (Section I).While such impressions may be valid for unconstrained optimization methods, which were inferior to learning-based methods in our experiments and in other studies, we show that optimization-based methods are not categorically unsuitable for real-life applications.Rather, 3DMM fitting based on inequality-constrained optimization of a robust objective function offers a highly promising alternative to learning-based methods.These findings are not only of theoretical interest, but also have practical significance, given that optimization-based methods are more flexible than learning-based methods in terms of the morphable model that is used and the camera parameters (Section I).The proposed approach emerged as particularly suitable for video-based applications, given that its multi-frame reconstruction performance is similar to or better than the compared methods, and that it can estimate per-frame 3DMM parameters (i.e., pose and expression) efficiently (Section IV-B6).
While the focus of this study was 3DMM fitting to 2D data, some of our findings may be of relevance for 3DMM fitting to 3D data.In particular, the inequality constraints on landmarks, which, to our knowledge, have not been used for 3D-3D fitting, merit investigation in the latter context.The minimization of nearest neighbor distance, which is a standard criterion for performing 3D-3D fitting, may lead to semantically incorrect point correspondences between the 3DMM mesh and the 3D scan, as the error on landmarks, which are arguably the easiest points to determine correspondence, can be high [22].Inequality constraints can ensure that the search for 3DMM fitting is restricted to be consistent with selected landmarks for improved semantic correspondence, while also allowing for flexibility to address the errors of the landmark detector.In fact, when texture data is available along with 3D shape scans, it is conceptually straightforward to develop a unified inequality-constrained approach-to simultaneously perform 3D-2D fitting to rendered image(s) as well as 3D-3D fitting to the shape scan.

VI. CONCLUSION
This paper formulates 3D morphable model (3DMM) fitting as an inequality-constrained optimization problem, offering a strong alternative to ( 2 ) regularization that is particularly advantageous for applications where multiple frames per person are available.We showed that inequality-constrained optimization with gradient correlation (GC) is robust, accurate, efficient, and produces person-specific face shapes.This is a notable contribution, as all other state-of-the-art approaches need sophisticated and time-consuming training, which is not trivial and imposes rigidity on the pipeline.We present an initial example of efficient implementation, though further research should examine opportunities for improvement, as inequality constraints have been surprisingly understudied in computer vision.Given that 2 regularization is a cornerstone of many computer vision methods, our results may serve as a foundation and inspiration for future research applications beyond 3DMM fitting.

Manuscript received 24
February 2022; revised 29 September 2023; accepted 3 November 2023.Date of publication 28 November 2023; date of current version 8 January 2024.This work was supported in part by the National Institutes of Health (NIH), in part by the Office of the Director (OD), in part by the National Institute of Child Health and Human Development (NICHD), in part by the National Institute of Mental Health (NIMH) of US under Grants R01MH118327, R01MH122599, 5P50HD105354-02 and R21HD102078, and in part by IDDRC at CHOP/Penn.Recommended for acceptance by S. Gao.(Corresponding author: Evangelos Sariyanidi.)

Fig. 1 .
Fig. 1.(a) Inequality constraints on the 3DMM shape coefficients.Each row depicts how the face shape changes with one component of the 3DMM's shape basis.The produced shape becomes implausible as the magnitude of a basis coefficient becomes too large.Our inequality constraints enforce basis coefficients to stay in the feasible (i.e., green shaded) region.(b) Inequality constraints for landmarks.For each landmark, we learn a feasible region (i.e., rectangle with edges x , y ) and prohibit solutions where the landmark falls outside this region.

Fig. 2 .
Fig.2.How 3D reconstruction consistency changes with number of frames used.Each column shows the average of 20 3D shape renderings for 2 subjects.When reconstruction is done with F = 1 frame, there is inconsistency between different reconstructions of the same person and the average is blurry.Reconstructions become more consistent and person-specific as F increases.

Fig. 4 .
Fig. 4. Distributions of the spatial (i.e., horizontal or vertical) discrepancy between the estimated and actual location of landmarks (lower lip, nose, left eye and right eye), computed on a set of synthesized images.The numbers in each panel are the kurtosis (k) and skewness (s) coefficients of the corresponding distribution.
where δ xk (I) := I k r −I k l and δ yk (I) := I k b −I k a are the horizontal and vertical gradient operators for the kth pixel, and δ k is the 2 norm of δ k :=(δ xk , δ yk ) T .The indices k r , k l , k b and k a respectively denote the pixels to the right, to the left, below and above the kth pixel.

Fig. 6 .
Fig. 6.Illustration of how the neighbors of a pixel at (x, y) are located within the image vector I that contains a non-rectangular region.M [i] is a binary 2D array that indicates rendered pixels, and C[i] is its cumulative sum; both arrays follow a column-major representation.Suppose that the pixel at (x, y) in the 2D image space corresponds to the kth entry in I, where k = C[i] = C[y+(x−1)H].The pixels to the left and right of (x, y) are located at the entries k l = C[i−H] and k r = C[i+H] of I.
, then I contains 18 values: the 3 values in the first rendered column, the 5 values in the second rendered column and so on.The cumulative sum of the binary 2D array, C[i] := i j=1 M [j], efficiently establishes the link between the 2D location of a rendered pixel, (x, y), and the index k that corresponds to the the same pixel within the vector I as k = C[i], where i:=y+(x−1)H and H is image height.Moreover, the pixels to the left and right of the kth pixel, namely those at (x−1, y) and (x+1, y), correspond to the k l th and k r th entry of I respectively, where k l = C[i − H] and k r = C[i + H].The pixels above and below the kth pixel are simply k a = k−1 and k b = k+1.

Fig. 7 .
Fig. 7. Cumulative Error Distribution (CED) for compared methods, showing the percentage of subjects whose (mean) face reconstruction error is below the listed threshold values.Numbers in legend indicate each method's median reconstruction on the dataset.

Fig. 8 .
Fig. 8. Average geometric error against the number of frames used per reconstruction, shown on three datasets.

Fig. 10 .
Fig. 10.(a) Within-and between-subject effect size (WBES) against number of frames per reconstruction for the BU4DFE dataset; higher WBES indicates better capacity to distinguish between subjects.(b) Distributions of within-and between-subject distances for best methods in (a), namely Deep3DFace and Open3DI, for various numbers of frames per reconstruction, F .

Fig. 11 .
Fig. 11.(a) Within-and between-subject effect size (WBES) for YT Faces; higher WBES is better.(b) Some faces in the dataset are unnaturaly squeezed by some distortion.

Fig. 12 .
Fig. 12. Qualitative results on the YT Faces dataset, depicting the input frame, reconstructed face and estimated identity shape.Further qualitative results are in Supp.Fig. G.5.
) support Open3DI's ability to produce person-specific results on uncontrolled data.A large number of additional qualitative results are presented in Supp.Fig. G.5.
Fig. G.6 for further similar results, available online), illustrating that Open3DI is effective in unconstrained imaging conditions, despite needing no training at all.4) Ablation Study: The median geometric error on BU4DFE increased by ∼63% and ∼303% when we respectively removed

Fig. 13 .
Fig. 13.Cumulative error distribution (CED) of Open3DI for 2D landmark detection compared to widely used landmark detection methods.

Fig. 14 .
Fig. 14.Open3DI's performance on the AFLW3D-2000 dataset.For each test image, we show the landmark points as estimated by Open3DI, the dense mesh on the input image, and estimated (neutral) identity shape.
(a)).Our discussion in Section II-A2 suggests that 2 -regularization (i.e., GC-L2) should be more sensitive landmark detection errors than Open3DI, and experiments support this hypothesis.Fig.15(b) plots the dense NME of the two approaches versus pose (i.e., yaw angle) variation on the BU4DFE dataset: GC-L2 leads to higher error as pose approaches profile, whereas inequality-constrained optimization has little sensitivity to pose variations.Statistical tests support this observation: The NME of GC-L2 is highly and significantly correlated (ρ =

Fig. 15 .
Fig. 15.(a) The landmark detection error of our 2D-FAN implementation, measured on a synthesized dataset in terms of median NME (normalized mean error) versus pose (i.e., yaw angle) variation.(b) Median (normalized) dense 3D reconstruction error versus pose variation for Open3DI versus GC-L2; normalization is performed by dividing to inter-ocular distance.

Fig. 16 .
Fig. 16.(a) Processing time versus number of frames per reconstruction for Open3DI and INORig on an NVIDIA GTX 1080 GPU.(b) Processing time (per iteration) for the first-order approximation of Hessian needed by Open3DI.

Fig. 17 .
Fig. 17.Limitations of Open3DI illustrated over reconstructions of two (synthesized) subjects.While Open3DI captures the facial features (eyes, brows, mouth) relatively well, regions with low texture variation (cheeks, jaw) are captured with visibly less accuracy.

TABLE II MINIMIZATION
OF f FOR EACH OUTER ITERATION (TABLE I)

TABLE III MEAN
MAHALANOBIS DISTANCE OF THE SHAPE PARAMETER VECTORS OF BU4DFE SUBJECTS