Identity-Preserving Pose-Robust Face Hallucination Through Face Subspace Prior

Over the past few decades, numerous attempts have been made to address the problem of recovering a high-resolution (HR) facial image from its corresponding low-resolution (LR) counterpart, a task commonly referred to as face hallucination. Despite the impressive performance achieved by position-patch and deep learning-based methods, most of these techniques are still unable to recover identity-specific features of faces. The former group of algorithms often produces blurry and oversmoothed outputs particularly in the presence of higher levels of degradation, whereas the latter generates faces which sometimes by no means resemble the individuals in the input images. In this paper, a novel face super-resolution approach will be introduced, in which the hallucinated face is forced to lie in a subspace spanned by the available training faces. Therefore, in contrast to the majority of existing face hallucination techniques and thanks to this face subspace prior, the reconstruction is performed in favor of recovering person-specific facial features, rather than merely increasing image quantitative scores. Furthermore, inspired by recent advances in the area of 3D face reconstruction, an efficient 3D dictionary alignment scheme is also presented, through which the algorithm becomes capable of dealing with low-resolution faces taken in uncontrolled conditions. In extensive experiments carried out on several well-known face datasets, the proposed algorithm shows remarkable performance by generating detailed and close to ground truth results which outperform the state-of-the-art face hallucination algorithms by significant margins both in quantitative and qualitative evaluations.


Identity-Preserving Pose-Robust Face Hallucination
Through Face Subspace Prior Ali Abbasi and Mohammad Rahmati, Member, IEEE Abstract-Over the past few decades, numerous attempts have been made to address the problem of recovering a high-resolution (HR) facial image from its corresponding low-resolution (LR) counterpart, a task commonly referred to as face hallucination. Despite the impressive performance achieved by position-patch and deep learning-based methods, most of these techniques are still unable to recover identity-specific features of faces. The former group of algorithms often produces blurry and oversmoothed outputs particularly in the presence of higher levels of degradation, whereas the latter generates faces which sometimes by no means resemble the individuals in the input images. In this paper, a novel face super-resolution approach will be introduced, in which the hallucinated face is forced to lie in a subspace spanned by the available training faces. Therefore, in contrast to the majority of existing face hallucination techniques and thanks to this face subspace prior, the reconstruction is performed in favor of recovering person-specific facial features, rather than merely increasing image quantitative scores. Furthermore, inspired by recent advances in the area of 3D face reconstruction, an efficient 3D dictionary alignment scheme is also presented, through which the algorithm becomes capable of dealing with low-resolution faces taken in uncontrolled conditions. In extensive experiments carried out on several well-known face datasets, the proposed algorithm shows remarkable performance by generating detailed and close to ground truth results which outperform the state-ofthe-art face hallucination algorithms by significant margins both in quantitative and qualitative evaluations.

I. INTRODUCTION
O UR desire to enhance the resolution of an alreadyrecorded image is arguably as old as the time when the early photographs were taken. With the emergence of digital images, the idea also started to attract the attention of many researchers, leading to the introduction of a popular field in the area of image processing, known as image super-resolution [1]. Even today, despite cameras having ever-increasing resolution, there is still a huge demand in increasing the resolution of the existing images, particularly in specific applications such as law enforcement, surveillance, and monitoring, where images are taken under uncontrolled conditions and are required to be further processed before being used for a particular purpose. More importantly, most computer vision algorithms are designed to work with high quality images, which means their performance would severely affected when given a lowresolution input [2].
Ali Abbasi is with the Pattern Recognition and Image Processing Lab, Department of Computer Engineering, Amirkabir University of Technology (Tehran Polytechnic), 424 Hafez Ave., Tehran, Iran. e-mail: ali.abbasi@aut.ac.ir Among different variations of image super-resolution applications, those which deal with super-resolving face images have always been of special interest to researchers, and are often categorized under the name face hallucination. The term was first coined by Baker and Kanade [3] in 2000, and since then has gained huge popularity due to its wide range of applications, with dozens of algorithms proposed so far.
One can hardly offer an explicit classification of the algorithms presented in the literature, as in many cases, the distinction among different categories of methods is not clear. Consequently, different studies have suggested different criteria to classify face hallucination algorithms; including operating domain (spatial vs. frequency), number of input images (single vs. multiple), and reconstruction method (reconstruction-based vs. learning-based). Early researchers relied more on statistical approaches to predict the HR face image given the LR observation. In their pioneering effort [3], Baker and Kanade used a Bayesian framework with gradient priors to estimate highfrequency components of a face image. Inspired by their work, Su et al. [4] proposed a similar formulation in which the prior was estimated by matching local low-level facial features from the input LR and the training HR face images. Meanwhile, Markov random fields (MRF) also started to draw the attention of researchers, after a two-step method was suggested by Liu et al. [5].
Another major group of researchers focused on making use of training samples to learn a projection matrix which could be later used to project the LR input into high dimensional space and obtain the reconstructed HR output. They based their idea on the fact that face images share structural similarities, and therefore can be synthesized from a linear combination of other samples. Wang and Tang [6] addressed the problem by applying PCA to fit the input face image as a linear combination of the training low-resolution face images, and then reconstructing the HR output by using the combination weights for the training high-resolution images. Despite considerable performance, their method failed to recover fine details of face images as it only focused on global face information. To alleviate this, various methods has been suggested in the literature. In [7], authors adopted locality preserving projection (LPP) to learn the projection weights. Yang et al. [8] employed non-negative matrix factorization (NMF) to find the face subspace, along with a patch-based sparse representation method using coupled overcomplete dictionaries to generate final hallucinated image. Also in [9], the coefficient vector was obtained through a recursive error back-projection method.
To find the aforementioned subspace, many studies have utilized the idea of manifold learning by assuming that LR face images and their HR counterparts are sampled from two arXiv:2111.10634v1 [cs.CV] 20 Nov 2021 manifolds which have similar local neighborhood structures. Liu et al. [10] maintained the local features by developing a multilinear patch-based reconstruction method. Fan and Yeung [11] addressed the problem through a two-step approach using neighbor embedding over visual primitive features. Huang et al. [12] applied canonical correlation analysis (CCA) to determine this subspace. Authors in [13] managed to learn pixel-wise structure prior represented as embedding coefficients to estimate the final result. Because of the one-tomany mapping relation between LR and HR samples, some researchers cast doubt on the above manifold assumption and suggested different alternative strategies. Li et al. [14] presented a manifold alignment approach which projected the two manifolds to a common hidden manifold. In another study [15] a strategy was devised to learn linear models based on the local geometrical structure on the high-resolution manifold. To avoid dealing with the difficulties of preserving local geometry in various resolutions, [16] directly regularized the relationship between target patch and training patches in the HR space. Later, Shi et al. [17] addressed this challenge by training a series of adaptive kernel regression mappings for predicting the missing details from LR patches.
Position-patch based face hallucination methods have also gained wide popularity during the last decade. The main intuition behind these algorithms is that the HR counterpart of a given input LR image patch can be reconstructed by applying neighbor embedding to those patches located in the same position as the test patch. Ma et al. [18] was first to suggest this method, by computing the reconstruction weights through solving a least square problem. To obtain a more suitable solution, Jung et al. [19] decided to replace the least square estimation with a convex constrained optimization. Various attempts have been made recently to use the idea of localityconstrained representation (LcR) in order to impose a locality constraint on the least square inversion problem to encourage sparsity and locality simultaneously [20]- [22].
In recent years, with the advancement of neural networks, deep learning-based face hallucination algorithms have become increasingly prevalent in the literature. Motivated by powerful representation abilities of CNNs, Zhou et al. [23] designed a network architecture to learn the mapping between the raw input image and the face representations extracted by a deep convolutional network. To avoid oversmoothing problem and preserve more textural details, WaveletSRNet [24] reconstructed HR images in wavelet coefficient domain. Chen et al. [25], extracted multi-scale features by incorporating multiple encoders and decoders in bottom-up and topbottom patterns. In [26], a super-resolution technique was suggested which decomposed faces and recovered different components. Jiang et al. [27] also developed a network with two individual branches to learn global facial shape and local facial components.
The emergence of generative adversarial networks had also a great impact on face super-resolution studies. Yu et al. [28] pioneered in GAN-based face hallucination algorithms, by considering an architecture which consisted of two discriminative and generative networks. They further extended their work [29]  networks (STN) in their network to improve the alignment and upsampling performance. Later, they enhance robustness against noisy inputs and inputs with non-fixed resolution in [30] and [31], respectively. Bulat et al. [32] discussed the idea of learning the degradation before super-resolution in a two-stage process. In [33], the problem was formulated with a collaborative suppression and replenishment framework, whereas the algorithm of [34] made the training phase more effective and efficient by introducing spatial attention into the generator. A self-supervised method in which the problem of face super-resolution is expressed as generation problem was developed in [35]. [36] also took a different approach by integrating multiple deep learning networks of different types.

A. Motivation and Contribution
Recent face hallucination studies have been dominated by two approaches: position-patch and deep learning-based methods. The first relies on the basic assumption that small patches in LR and HR spaces create manifolds with similar local geometry, hence they consider the reconstruction weights in both spaces equal. However, it has been shown [14] that, due to the nonisometric one-to-multiple mappings from LR patches to HR ones, this assumption is not always met in practice. Therefore, face images of two entirely different individuals may have similar LR patches, whereas HR/LR patch pairs of a specific person may bear no similarity at all [37]. This becomes more severe as the LR input degradation level increases [21], or patches with smaller sizes are considered. Fig.  1 illustrates four patches of different sizes extracted from the same LR image. As shown, training patches belonging to the test subject tend to be more similar to the LR test patch when they increase in size. To further demonstrate this point, the average neighborhood preservation rates (NPR) [38] between the LR and HR image manifolds according to different patch size and based on three levels of degradation is presented in Fig. 2 Fig. 2. Influence of the patch size on the average neighborhood preservation rates between the LR and HR image manifolds, based on different levels of degradation (i.e., blur kernel size) and when the number of neighbors K is set to 200. Patch-based strategies tend to be more erroneous as the patches decrease in size, and this becomes even worse in the presence of higher levels of degradation. Considering the entire face image as a patch leads to a sharp increase in the level of NPR and more robustness against facial degradation.
of the selected patches. However, the rising trend gradually weakens before the case when the whole image is taken into consideration, in which a sudden jump in the values of NPR is observable. The figure also reveals that, as image becomes more degraded, more invalid patches will be selected as the neighboring patches. Still, the case when the entire image is considered is relatively less affected by this change.
Generally, patch-based methods often face the dilemma of selecting appropriate patch size. On one hand, to capture the global nature of faces and finding more meaningful and accurate neighbors, larger patches are preferred. On the other hand, as a consequence of the curse of dimensionality [39], the size of the training set should grow exponentially with the patch size to guarantee valid matches [40] and avoid ghosting effects [22]. Although several approaches have been suggested to alleviate this problem [21], [22], [37], existing patch-based methods still fail to recover person-specific facial features due to the above-mentioned limitations.
A similar argument can also be made in the case of deep learning-based face super-resolution algorithms. In spite of their great ability to add visually pleasing details to LR images, these algorithms often neglect how much beneficial the added information is for the task of recognizing the identity of the face [41]. Most of the loss functions that have been considered in the literature are designed to minimize the mean square error (MSE) between the HR image and its corresponding reconstructed one, which, although can sometimes achieve high MSE-oriented quality metrics, in most cases produce blurry and over-smoothed results [42]. In order for deep learning-based methods to be able to learn identity-aware representations, they are required to be trained with a large well-labeled face dataset, which tends to be very costly [41]. In recent years, several network architectures and loss functions have been suggested to incorporate the identity prior into the learning procedure [41], [42], however, still in many cases the hallucinated face hardly resembles the person in the test image, as shown in Fig. 3.
In this paper, a novel face hallucination approach is presented in which super-resolution is performed in the subspace spanned by the available training faces, often referred to as face subspace [43]. To accomplish this, face subspace prior has been incorporated as a regularization term, and through a simple, yet vastly effective MAP-based formulation, the benefits of global hallucination are achieved whereas the drawbacks of patch-based methods are avoided. Additionally, although the proposed algorithm can be considered as a global reconstruction scheme, however, the hallucinated faces are artifact-free and robust to ghosting effects, and there is no constraints on dataset size either. The optimization process of the proposed objective function is also addressed through a highly effective recently introduced closed-form solution [44].
Furthermore, to better deal with the cases where there is a significant misalignment between the input LR face and the training faces, an effective 3D dictionary alignment technique which allows us to perform face super-resolution on unconstrained face images will be suggested. The proposed alignment procedure can also be used in the pipeline of other similar algorithms, and since the majority of current face hallucination approaches only produce satisfactory results when given frontal LR face images, this method can substantially increase their robustness against pose variations in LR inputs.
Therefore, the major contributions of the paper can be summarized as follows: • We force the reconstructed face to lie in the linear span of the training faces, hence, unlike most existing face hallucination algorithms, the reconstruction is performed in favor of both recovering identity-specific face attributes as well as enhancing image quantitative measures. • We will show that not more than three samples per subject are required for our algorithm to guarantee an identitypreserving result and outperform the existing methods in both tasks of face super-resolution and face recognition. • By incorporating an efficient 3D alignment procedure, the algorithm can extend its superior performance to the case where LR face pose is significantly different from the ones in the training set. More importantly, the proposed alignment scheme allows us to deal with face hallucination problems in which face images from both the training and testing sets are unconstrained. • In contrast to the patch-based algorithms, the proposed method is barely affected by increasing the level of degradation, and shows outstanding robustness when given very low-resolution (VLR) face images. • By utilizing an efficient closed-form solution for the proposed objective function, our method is considered to be a very fast face hallucination algorithm in which the computational time is comparatively less affected by increasing the size of the HR image or the number of samples in the dataset. • The proposed method achieves superior performance over the competitive state-of-the-art algorithms in various frontal, non-frontal, and in-the-wild face hallucination experiments conducted on several well-known face datasets, and surpasses the position-patch and deep learning-based methods both quantitatively and qualitatively.
The organization of the rest of the paper is as follows: in section II, the face subspace prior along with our main MAP- based model are explained. Later, the details of the optimization procedure of the proposed algorithm is scrutinized, before the introduction of our 3D dictionary alignment pipeline which is detailed in the same section. The experimental evaluations and comparison with other competitive algorithms are the subject of section III, followed by conclusion and possible future works which are presented in section IV.

II. PROPOSED METHOD
A common assumption in the problem of single image super-resolution is that the low-resolution input is a noisy, blurred, and decimated counterpart of the unknown highresolution image. Consequently, the following forward degradation model is often taken into consideration: In addition, H ∈ IR m h ×m h represents blurring filter, S ∈ IR m l ×m h denotes decimation operator with scaling factor d, hence m h = m l × d 2 , and ∈ IR m l ×1 indicates additive white Gaussian noise (AWGN) encountered through the image acquisition process. The problem of superresolution can therefore be written as an optimization problem derived from maximum likelihood (ML) estimator of the highresolution image x as below: which leads to the following solution: This solution, which is equivalent to the least-square solution of the inverse problem of (1), is known to be an ill-conditioned problem due to its sensitivity to small noise and measurement errors [1]. On condition that H T S T SH is singular, the problem is also ill-posed with infinite space of possible solutions. Moreover, solving (3) requires inverting the matrix H T S T SH with a computational complexity of the order O(m 3 h ), which makes it practically inefficient in many real scenarios [44].
To overcome these problems, some additional information is needed to constrain the space of solutions and stabilize the problem. This is often accomplished by introducing a new term to (2), converting the maximum likelihood problem to a maximum a posteriori (MAP) problem: which consists of a fidelity term corresponding to model likelihood and a regularization term which represents a priori knowledge about the original image, with the regularization parameter ξ which determines the contribution of each term. R is also a matrix which can be defined according to the application. Various priors for natural images have been suggested in the literature, with Tikhonov regularization and Total Variation as the most notable ones [1]. However, to the best of our knowledge, there are few, if any, such priors introduced for the purpose of addressing single-frame global face hallucination problem, and the studies mostly include multi-image [45] or patch-based [46] approaches.

A. Identity-Preserving Face Prior
In the area of pattern recognition, a well-established assumption is that patterns from a specific object class lie on a linear subspace [39]. In regard to facial recognition problem, it has been verified that face images belonging to a certain subject create a low-dimensional subspace, and the idea has been the cornerstone of various successful face recognition algorithms [43] be the HR training faces of the ith subject. On condition that n i is sufficiently large, the above assumption implies that if the hallucinated face image x belongs to the subject i, for some scalar coefficients α i,j ∈ IR, j = 1, 2, . . . , n i , it can be represented as Therefore, in case the subject of the input LR face is given beforehand, a suitable prior term for (4) would be To generalize the above term, a dictionary matrix D h is defined which contains the whole n training faces of all c subjects, in which, to facilitate classification task, face images of the same subject are arranged beside each other, that is, where α is the coefficient vector with non-zero entries for those elements associated with subject i, and zero elsewhere. Incorporating this prior term into (4) gives Provided that the number of subjects is sufficiently large, α is expected to have a sparse representation, since only few of its elements will have nonzero values. Therefore, a new regularization term α 0 should be included in (7), and since this leads to an NP-hard problem, the l 0 -norm is replaced by l 1 -norm [43]: With the above formulation, the hallucination is constrained to the subspace spanned by the subject which gives the sparsest coefficient vector with respect to the input image. Thus, if there are enough samples of the subject to which the input face belongs, the super-resolution is performed for the benefit of recovering true facial attributes. In section III we will show that not more than three samples per subject are required to satisfy this condition.

B. Optimization
The optimization problem (8) can be divided into two subproblems associated with each of the variables x and α, before being solved iteratively for one while fixing the other. The following two optimization steps can therefore be defined: 1) Optimizing for x: The intermediate HR estimate of the LR input y in a given iteration t can be obtained through the following l 2 -regularized optimization problem: whose closed-form solution is given by Unlike the optimization procedure of other similar inverse problems (e.g., image deblurring [47]) which can be solved efficiently in the frequency domain, here, the presence of the decimation operator S in the fidelity term, and the fact that the product matrix SH does not have a block-circulant structure and cannot be diagonalized in the frequency domain, makes the problem impossible to be solved using the Fourier transform. However, [44] showed that under certain assumptions on the decimation operator S and the blurring matrix H, the optimization problem admits a closed-form solution in the frequency domain. More precisely, assuming H as the matrix representation of the cyclic convolution operator, one can decompose the blurring operator H and its conjugate transpose H H as where F ∈ C m h ×m h is the discrete Fourier transform matrix with the property and Λ ∈ C m h ×m h is a diagonal matrix whose elements are the Fourier transform of the zero-padded PSF, that is, the first column of the blurring matrix H. Additionally, S is assumed to be a downsampling operator whose conjugate transpose S H interpolates the decimated image with zeros, and satisfies the relationship SS H = I m l . By considering S S H S, which operates as an element-wise multiplication by an m h ×m h matrix with ones at the sampled positions and zeros elsewhere, [48] showed that: in which ⊗ denotes the Kronecker product, J d 2 ∈ IR d 2 ×d 2 is a matrix of ones, and I m l ∈ IR m l ×m l is an identity matrix. Bearing in mind (12) as well as the previously mentioned assumptions, one can rewrite the analytical solution (10) as
This can be further simplified by incorporating the Woodbury matrix identity [49] into (13) and obtaining the following closed-form solution: where 2) Optimizing for α: The coefficient vector α is updated using the following well-known l 1 -minimization problem: which can be efficiently solved using various l 1 -minimization algorithms. It should also be noted that the initial α vector is obtained through solving the above minimization problem for the input face y and the low-resolution training dictionary D l , i.e., α 0 = arg min α y − D l α 2 2 + λ α 1 . Algorithm 1 summarizes the entire procedure of the proposed face hallucination approach.

C. 3D Dictionary Alignment
As face images of different poses are distributed on a highly nonlinear manifold [50], the majority of dictionarybased face hallucination algorithms fail to achieve satisfactory results when given non-frontal input faces. Inspired by recent advances in 3D face reconstruction and alignment studies, in this section an efficient dictionary alignment procedure will be presented, by which the training faces are registered with respect to the LR face pose before being used in the hallucination process. This additional step gives significant flexibility to the main algorithm and boosts its performance in reconstructing LR faces in the presence of high pose variations, even when the training faces are also non-frontal and unconstrained. The proposed alignment procedure, which is visually presented in Fig. 4, contains the following steps: 1) Training Faces 3D Reconstruction: In order to perform 3D alignment, we first use [51] to generate 3D geometries associated with each of the training faces, and obtain two sets is the 3D mesh associated with the ith training face, with V i ∈ IR nv×3 , T i ∈ IN nt×3 , and C i ∈ IR nv×3 as its vertices, triangles, and color attributes, where n v and n t denote the number of vertices and triangles, respectively. As might be expected, the whole  Fig. 4. Flowchart of the 3D dictionary alignment framework. The 3D facial landmarks are first extracted from the upscaled version of the LR input face, before being used to align it with respect to a set of landmark points defined as reference and obtain the associated transformation matrix. Having previously calculated the transformations between the dictionary samples and the reference landmarks, the transformation matrices between the training samples and the input face image can therefore be efficiently obtained. After applying these transformation matrices to their corresponding training face objects and performing 3D rendering, the aligned HR and LR dictionaries are obtained which will be later used along with the masked LR input in the process of face hallucination. process of face reconstruction is performed offline, hence does not affect the runtime of the main hallucination process.
2) LR Face Landmark Detection: The most crucial part of the alignment pipeline is to locate facial landmarks on the LR input face. Since the proposed hallucination algorithm will accept degraded facial images with high variations in pose, the landmark detection method is expected to be highly robust and perform well in uncontrolled conditions. Fortunately, recent advances in deep neural networks has allowed researchers to propose powerful facial landmark detectors with considerable speed, accuracy, and stability. According to our experiments, most of the current state-of-the-art approaches fully satisfy our desired level of robustness, and therefore are eligible to be used in the alignment procedure. Fig. 5 illustrates the average normalized mean error (NME) between the landmarks of a set of upscaled LR face images detected by [52] and their corresponding ground truth points, based on different levels of degradation. According to the figure, in most cases the difference between landmarks detected in the LR face images and their ground truth points is fairly negligible, even when face images as small as 15 × 15 pixels with 5 × 5 blur kernel are considered. Since conventional degradation settings are often far more lenient, we can rest assured that the detected landmarks p y will not affect the main alignment procedure.  objects is also efficiently implemented by applying each transformation to the corresponding object vertices. 4) 3D Face Rendering: Finally, the 3D face objects are converted to 2D images using the available object rendering algorithms. This step is the most time-consuming part of the whole process, and extra care must be taken to preserve the details and information of the face object.
The above steps are performed on each of the training samples to obtain the aligned HR and LR dictionaries D a h and D a l . To reduce the error caused by non-face regions in the process of hallucination, we also remove the excess pixels in the LR input face using a mask extracted from the average of the registered LR faces to obtain y m , which will be later used along with D a h and D a l in Algorithm 1 to perform poserobust face hallucination. As will be shown in section III, the proposed 3D dictionary alignment significantly improves the face hallucination performance even in the presence of high pose variations. Moreover, the whole process takes roughly 0.04 seconds for a 20 × 20 training face image and its corresponding object with n v = 43, 867 and n t = 86, 906, which is indeed a reasonable time considering the amazing benefits it offers. A summary of the 3D dictionary alignment procedure is presented in Algorithm 2 as pseudo-code. x a i ← render(V a i , Ti, Ci) 8: Add x a i to the HR dictionary D a h . 9: end for 10: D a l ← degrade(D a h ) 11: Obtain y m by applying the mask extracted from the average of the LR aligned faces to y. Output: Aligned dictionaries D a h and D a l , masked LR input y m .

III. EXPERIMENTAL RESULTS
In this section, several experiments have been carried out to evaluate the performance of the proposed algorithm and demonstrate its efficiency and applicability. For this purpose, a number of recently published state-of-the-art face hallucination methods have been selected 1 with their parameters tuned so that they produce their optimal results. The experiments on frontal face hallucination are performed on the FERET [53], the CMU Multi-PIE [54], and the AR [55] public face datasets, whereas the pose-robust face super-resolution algorithm is tested on the CMU Multi-PIE and the LFW [56] databases.

A. Experiments on Frontal Faces
The proposed method is first assessed on faces taken in controlled condition. Unless otherwise specified, all the LR face images are obtained after applying downsampling and blurring (by a 4 × 4 average smoothing filter) to their HR counterparts. Samples from all databases were aligned based on the location of the eye corners. For all the experiments, one random image per each subject was selected as the test sample and the remaining were used in the training phase. The regularization parameters µ and λ are chosen to be 10 −8 and 2700, respectively, whereas the number of iterations T is set to 30. In the patch-based approaches, the patch size and the overlapping parameters are chosen according to the LR and HR image sizes. In [6], the eigenvalues accumulation contribution rate is set to 0.99. [15] is modified so that it includes the blur information of the LR inputs. The implementations of [20] and [21] were slightly changed to prevent errors in recovering very low-resolution face images. 1) The FERET dataset: We first evaluate the performance of the proposed method on the frontal facial images from the FERET database. We only select subjects with five or more samples in the database, which leads to a subset of 519 images from 70 individuals, each with unequal number of samples. The LR input faces are of size 15 × 10, and scaling factors are set to 2, 4, and 8. The PSNR and SSIM performance of different algorithms is summarized in Table  I. In all three experiments and with different scaling factors, the proposed method outperforms the second best algorithm by 1.18 dB, 1.00 dB, and 1.00 dB in PSNR and 0.0095, 0.0199, and 0.0353 in SSIM, respectively. In Fig. 7, the performance of the algorithms on each one of the test samples when scaling factor is 4 is displayed, which demonstrates the dominance of the proposed method over the competitive ones in almost all the available test images. Fig. 6 (top three rows) also qualitatively compares the methods on three testing images with different scaling factors. It is observable that the competitive methods were unable to recover facial details including eyeglasses and wrinkles. In the presence of facial expressions, [6] produced undesirable artifacts in the recovered face images. The methods based on position-patch also produced blurry and oversmoothed images, particularly around the mouth regions. The results of LM-CSS [15] appear to be more similar to the original HR faces than those of the remaining approaches, however, the noise and artifacts added to the resultant faces have made this method quantitatively unsatisfactory. In general, the results obtained by the proposed algorithm are obviously more detailed, clear, and artifact-free compared to those produced by the others.
methods can be seen in methods in recovering some facial attributes (e.g., eyeglasses), however, its results still suffer from blurriness and lack of details.
3) The AR dataset (VLR Face Hallucination): To demonstrate the efficiency of the proposed algorithm in hallucinating very low-resolution face images [2], extensive experiments were conducted on the AR face database. From the cropped version of the database [59], we discard the images with occlusion, and select a subset of 1400 images associated with each of the 100 subjects, each with 14 samples. Lowresolution input faces are chosen to be 5 × 4, making it an extreme case of VLR face super-resolution task with only 20 LR pixels available. The dataset also contains facial images with significant expression variations, which cause the problem to be even more challenging. We upscale the existing LR faces by the factors of 4, 8, and 16. As can be seen in Table III, our approach shows its capability in recovering very low-resolution inputs in all three experiments by improving the PSNR by 1.18 dB, 2.37 dB, and 1.5 dB, and the SSIM by 0.0186, 0.1235, and 0.1166 units, respectively. More importantly, as the visual comparison in the last three rows of Fig. 6 suggests, the performance of the position-patch based methods falls dramatically when given VLR inputs, with their results being blurry and mostly irrelevant. Conversely, despite being a classic algorithm, [6] manages to outperform several recently introduced patch-based techniques due to its global reconstruction approach. All in all, the results achieved by the proposed algorithm bear much more visual resemblance to the ground truth faces compared to those of the others.
B. Parameters Analysis 1) Regularization Parameters: There are two regularization parameters in the proposed formulation, namely µ and λ. The first determines the closeness of the reconstructed face to the subspace spanned by the available faces, whereas the second decides how strictly this face subspace should be estimated. In this subsection, we perform experiments on a randomly selected face image and tune each parameter separately while keeping the other fixed. By fixing µ and changing λ values over the range [0, 10 4 ] with an interval of 500, as plotted in Fig. 8, one can notice that the best performance is achieved when λ is roughly set to 10 3 . We next fix the value of λ and chose 20 different values for µ from the range [0, 20]. The variations of PSNR and SSIM (Fig. 9) suggest that values closer to zero are more desirable for this parameter. When µ is set to zero, however, the hallucinated face image will be incalculable (NaN) due to the absence of regularization term in the optimization function. We therefore set λ = 2700 and µ = 10 −8 in our experiments.
2) Number of Training Samples per Subject: The proposed algorithm utilizes the face subspace as a prior knowledge, thus it is desirable to see how the quality of the subspace spanned by the training face images affects its performance. In this regard, the diversity among the faces (i.e., the availability of faces with different variations), which is related to the number of training faces per subject, is expected to be decisive. To justify this, we create a test set by randomly selecting one sample from each subject of the AR dataset, and perform a series of experiments on the selected set, each time with different number of training samples per subject. Fig. 10 shows the quantitative metrics obtained by different methods in each experiment. It is observable that when there is only one sample per subject, TLcR-RL [22] and TRNR [58] achieve better performance compared to the proposed method. This can be justified by the fact that in this case, no subject-specific face subspace is formed and subsequently the prior term used in the formulation will be of no benefit. However, the performance of the proposed method improves significantly when another training sample is added for each subject, with only 0.1 dB lower PSNR and 0.0246 higher SSIM compared to the pioneering method. When there are three samples per each subject, the proposed method clearly outperforms the remaining algorithms, and by increasing the number of samples continues to improve its dominance whereas the performance of the other methods remains relatively unchanged. In Fig.  11, the influence of the number of samples per subject on the visual appearance of a face image reconstructed by the proposed method (bottom) and [22] (top) is displayed. As the number of samples per subject increases, more details appear in the hallucinated face, and the undesired effects (for example, on forehead region) diminish. With four samples per subject, a clean face image close to the ground truth is obtained, whereas the true facial expression is reconstructed when seven samples per subject are used. The figure also reveals that the results generated by TLcR-RL are less affected by the addition of extra samples to the training set, and are still oversmoothed and blurry even when there are 13 samples per subject available. To summarize, the experiment demonstrates that the proposed algorithm requires only two to three images per each subject to outperform the competitive methods both quantitatively and qualitatively.

3) Number of Iterations:
To investigate the influence of the face subspace prior, it would be worthwhile to compare the average PSNR and SSIM values across different iterations. As shown in Fig. 12, the two quantitative metrics improve dramatically in the early iterations, reaching to 32.92 dB in PSNR and 0.9622 in SSIM in a single iteration (2.65 dB and 0.0278 improvements, respectively), and surpassing 33.64 dB in PSNR and 0.9670 in SSIM after the first five iterations (3.36 dB and 0.0326 improvements, respectively), with respect to the initial state, i.e.,x ≈ D h α 0 . The growth of these two indicators becomes stable roughly at iteration 30, thus, it will be considered as the number of iterations in our experiments.

4) Blur Kernel Size:
We further test the performance of our approach against various levels of degradation to measure how robust the proposed algorithm is when less facial information is available. We set the LR input size and the upsampling factor to be 12 × 9 and 4, respectively, and perform several experiments with average blur kernel of different sizes. The quantitative results of different approaches in this experiment are represented in Fig. 13. Despite the significant decline in the performance of the other methods, our algorithm is barely affected by the increase in blur kernel size, and even in extreme cases, its quantitative measures remain more or less the same. As discussed in section I, as a result of selecting inappropriate neighboring patches, the performance of the patch-based face hallucination methods is prone to be seriously affected when LR input images become more degraded.

C. Influence of the Face Subspace Prior
The effectiveness of the face subspace prior in the proposed approach is further evaluated by making more quantitative and qualitative comparisons between the final reconstructed face obtained by our algorithm and its initial state, which is the case when only the concept of neighbor embedding is taken for granted and the hallucinated face is equal to D h α 0 . Fig.  14 displays the quantitative indicators for the test faces in the FERET database. The linear embedding-based approach achieves the average 30.29 dB in PSNR and 0.9344 in SSIM, which are lower than the proposed algorithm by 3.42 dB and 0.0323, respectively. This substantial difference between the initial and final states of the algorithm is the result of the improvements made by the face subspace prior, which has also been clearly reflected in the facial details recovered by the two approaches. According to the examples shown in Fig. 15, in the initial phase of the algorithm, only basic structures of the faces are recovered, and, in contrast to the final hallucination results, fine details such as eyeglasses, nose shape, eyebrows direction, and face pose are all ignored. As a consequence of considering the faces merely as the linear combinations of the training samples, these images contain blurry regions and in some cases are by no means close to their ground truth counterparts.

D. Face Recognition Accuracy
To clarify the advantages of our algorithm in recovering person-specific facial features, we conduct experiments on the task of low-resolution face recognition. The evaluations are performed on the Multi-PIE and the AR face datasets with 130 and 100 subjects, respectively, such that each subject has only two samples in the training set. The size of the LR images in the Multi-PIE subset is 8 × 6, whereas the inputs in the AR subset are of size 10 × 8, and the scaling factor in both experiments is 4. LR faces are obtained by  applying a 7 × 7 Gaussian filter with σ = 2 to the HR images, before downsampling them to the desired sizes. The resultant images of all methods are classified using the SRC classifier [43]. As for the proposed method, the final coefficient vectorα is used in the classification. Table IV represents the recognition accuracy achieved by different approaches in both experiments, whereas Fig. 16 compares their cumulative recognition rates in the first five ranks. The recognition rates of the proposed method on the Multi-PIE and the AR datasets are 92.25% and 76%, outperforming the others by 8.53% and 3%, respectively. This clearly illustrates the effectiveness of the recognition-oriented aspect of our face hallucination algorithm, even when each subject has only two samples in the training set.

E. Comparison with Deep Learning-Based Methods
We further evaluate our algorithm by comparing its performance on color images with several popular and/or successful deep learning-based approaches. We consider SRCNN [60] as a baseline algorithm along with DCSCN [61], DBPN [62], ESRGAN [63], SPSR [64], SRGAN [65], realESRGAN [66], SwinIR [67], and PULSE [35], of which the latter is a recently developed face hallucination technique. CNN-based methods were trained using our data, whereas the remaining networks   Table V, and some hallucinated faces generated by different methods are also displayed in Fig.  17. It appears that SRCNN manages to enhance parts of the face images, but leaves undesired artifacts around the boundary regions. DCSCN and DBPN have done only slightly better than the bicubic interpolation, whereas the results produced by GAN-based algorithms are deformed and unclear. PULSE generates noise-free faces which hardly resemble their true identities. In general, despite achieving satisfactory results on higher resolutions, deep learning-based super-resolution algorithms seem to fail to enhance very low-resolution face images, with their results being vague and distorted.

F. Computational Complexity
The optimization procedure of the proposed algorithm consists of two phases. As discussed in section II, the l 2 -l 2 minimization problem can be solved through a closed-form solution, hence the l 1 -minimization problem is basically the most time-consuming part of the reconstruction process, which can also be solved efficiently using various l 1 -optimization approaches. Overall, the proposed method is considered to be a very fast algorithm, with not more than 30 iterations required for it to converge. To compare the computational time of our algorithm with those of the other methods and evaluate the effects of different parameters on its runtime, we perform experiments on the AR and the FERET databases with LR input size and scaling factor of 10 × 8 and 4, respectively, using MATLAB 2020b and a computer with 6GB memory and 1.8 GHz CPU. Table VI presents the average runtime of each method on the AR database when the training set size is 100. One can notice that the proposed algorithm performs face hallucination with reasonable computational cost compared to the competitive methods. We also measure the runtime of each algorithm with respect to the dataset size and the scaling factor. The results, which are displayed in Fig. 18, reveal that the computational time of our algorithm is not much affected by both parameters, whereas the ones for TLcR-RL [22] and SSR [57] grow exponentially, making them practically inefficient in the case of large images or big datasets. It should also be noted that reducing the number of iterations -which was previously shown to be slightly significant to the quality of the hallucinated face image after the first few iterations -will decrease the computational cost of the proposed algorithm even further. One can also think of applying collaborative representation [68] instead of sparse representation, which leads to a super-fast face hallucination procedure with both subproblems having closed-form solutions.

G. Pose-Robust Face Hallucination
In order to evaluate the efficiency of the proposed 3D dictionary alignment procedure introduced in section II, extensive experiments were conducted on the Multi-PIE and the LFW face databases. Since the methods used in the frontal face experiments are unable to perform pose-robust face hallucination, we integrate them with our 3D dictionary alignment scheme, hence they use the same aligned dictionary to perform face hallucination. Moreover, the parameters and settings associated with all the algorithms remain the same as in the previous experiments. We use [52] to extract 3D landmarks from both the LR and HR faces, and [51] to perform 3D face reconstruction. To apply face alignment, all 68 landmark points were taken into consideration.
1) The Multi-PIE Dataset: For each of the subjects in the Multi-PIE face database, we consider illumination condition 10 and select the frontal face images (camera 05-1) taken in neutral position from all sessions as the training samples (thus, there are five samples per subject in the training set), and randomly select one sample from their images in the same settings but under different pose variations (cameras 04-1, 05-   Fig. 19 also verify the effectiveness of our proposed dictionary alignment technique, which has clearly enhanced the performance of all the face super-resolution algorithms. Once again, by investigating the hallucinated faces, one can easily notice more recovered facial details in our results than those of the other approaches. 2) The LFW Dataset (Face Hallucination in the Wild): The performance of the proposed method is further evaluated in a real-world scenario, where samples in both the training and test sets are taken in uncontrolled conditions. Among the face images in the LFW database, we consider those belonging to the subjects with 10 to 14 samples in the database, which forms a subset containing 73 subjects and 894 samples. Since the images in the LFW dataset are taken in the wild, unlike the previous experiment, training samples may contain various degradations such as occlusion and illumination variations, which makes the problem even more challenging. Although the proposed alignment procedure is expected to perfectly handle the difficulties related to pose variations, in some infrequent cases, there is a considerable difference in poses of a training sample and the test face, making the registration unfavorable for the task of face hallucination. This  [6]. (d) LSR [18]. (e) LcR [20].
(f) LINE [21]. (g) SSR [57]. (h) LM-CSS [15]. (i) TRNR [58]. (j) TLcR-RL [22]. (k) Proposed. (l) Ground truth. often occurs when the two faces are significantly rotated in different directions. One intuitive way to detect these cases is to use image histogram, as an erroneous alignment causes considerable changes in the intensity values of images, Fig.  20. We therefore define a threshold value θ, and accept an alignment if hist(I) − hist(I a ) < θ (16) where hist returns the histogram of the input image, and I and I a are original and transformed faces, respectively. We empirically consider θ = 100. Also, the size of the LR faces is 20 × 20 (similar to the previous experiment, the actual face regions are considerably smaller), and the scaling factor is 4. The PSNR and SSIM scores obtained by different methods (Table VII) shows differences of 2.13 dB and 0.0392 in favor of the proposed algorithm, respectively. Also from Fig. 21, one can observe that the proposed method has added much more information to the reconstructed faces compared to the competitive algorithms. This not only once again highlights the superiority of the proposed face hallucination algorithm over the other approaches, but also shows how efficient our 3D dictionary alignment method is, even when both the training and testing images are taken in uncontrolled conditions.

IV. CONCLUSION
This paper presented a fast and robust face super-resolution algorithm which, different from most of the existing methods, attempts to recover true individual-specific attributes of the low-resolution input face. This is achieved by introducing a MAP estimator which encourages the reconstructed image to lie near to the subspace spanned by the training samples of the subject to which the input face belongs. Our experiments indicated that not more than three samples per subject are required to form such a subspace and obtain clear, detailed, and artifactfree outputs, even in datasets with high variations in facial expressions. We further extended our method by proposing a 3D dictionary alignment framework which proved to be vastly effective in real-world scenarios. Our evaluations revealed that, owing to its efficient closed-form based optimization procedure, the proposed algorithm is a very fast method which shows impressive robustness against face degradations. The identity-preserving aspect of our algorithm also demonstrated its significance in the task of low-resolution face recognition, even when there were only two samples of each subject available. The comparison results of the proposed method with some competitive algorithms, including both position-patch and deep learning-based methods, illustrated its superiority over the state-of-the-art face hallucination techniques in terms (c) Wang [6]. (d) LSR [18]. (e) LcR [20]. (f) LINE [21]. (g) SSR [57]. (h) LM-CSS [15]. (i) TRNR [58]. (j) TLcR-RL [22]. (k) Proposed. (l) Ground truth.
of both quantitative measurements and visual impressions. One possible extension of our proposed algorithm is to employ more robust sparse representation-based regularization terms (e.g., [69] and [70]) to enhance its flexibility and robustness.
Our future work will also focus on introducing the blurring operator to the main objective function as a variable and designing a blind face hallucination scheme.