TVA-GAN: Attention Guided Generative Adversarial Network For Thermal To Visible Face Transformations

—In the recent advancement of machine learning methods for realistic image generation and image translation, Generative Adversarial Networks (GANs) play a vital role. GAN generates novel samples that look indistinguishable from the real images. The image translation using a generative adversarial network refers to unsupervised learning. In this paper, we translate the thermal images into visible images. Thermal to Visible image translation is challenging due to the non-availability of accurate semantic information and smooth textures. The thermal images contain only single-channel, holding only the images’ luminance with less feature. We develop a new Cyclic Attention-based Generative Adversarial Network for Thermal to Visible Face transformation (TVA-GAN) by incorporating a new attention-based network. We use attention guidance with a recurrent block through an Inception module to reduce the learning space towards the optimum solution. TVA-GAN is tested and evaluated for thermal to visible face synthesis over the WHU-IIP and Tufts Face Thermal2RGB datasets. The results using the proposed TVA-GAN is promising for face synthesis as compared to the state-of-the-art GAN methods.


I. INTRODUCTION
Visible image generation using thermal images is a very challenging task rather than using Infrared or Near-Infrared images. Near-infrared (NIR) images are close to redlight wavelengths between 700 nm -1400 nm. NIR images are very close to human vision and discard the color wavelength pieces of information. This results in most articles looking similar to the image converted into gray scale images. Most NIR cameras at night utilizing IR LEDs for illumination are limited in range, usually not more than 500m. While on the other hand, thermal images are far-infrared images with widearea emission detection. Thermal Infrared (TIR) cameras are sensitive to heat radiation produced by a body. Heat is the electromagnetic waves emitted by a body above the absolute zero temperature, which contains different wavelengths. Both NIR and TIR images capture non-overlapping electromagnetic N.K. Yadav and S.K. Singh is with the Computer Vision and Biometrics Laboratory at Indian Institute of Information Technology, Allahabad-211015, India (email: nandkmyadav@gmail.com, sk.singh@iiita.ac.in).
spectrum. However, thermal images and near-infrared images are very different from each other since thermal images are more specific to capturing images for a particular range of temperature only. Thus, the thermal images(TIR) have more noisy data than the NIR images. So, it is more challenging to generate the actual visible domain images from the corresponding thermal domain images.
In the current scenario of deep learning [1], the image generation tasks handle various applications of computer vision, including image restoration [2], image synthesis [3], face synthesis [4] [5] and many more. We consider the visible face synthesis from the thermal face image as an image-toimage translation problem due to the images' inter-domain transformation. The image-to-image translation [6] method is inspired from the language transformation problem proposed by Mark Twin [7]. Here the language is first transformed from French to English and then back to French, and the final results are compared with the source text string for better translations. The image-to-image translation is effectively handle by Generative Adversarial Networks(GANs) which works on the principle of training a model which learns by balancing false results against true results. With the modern influence of deep learning, different Generative Adversarial Network (GAN) methods [8], [9], [10], [11] have been proposed to deal with the image-to-image translation problems. GAN based models have been also utilized for different applications such as image segmentation [12], image colorization [13], image super-resolution [14], image style transfer [15], and face photo-sketch synthesis [16].
Deep learning methods are prevalent for image-to-image translation in multi-domain scenarios in computer vision, and bio-metrics [1] [17]. The deep learning methods consist of two domains: supervised and unsupervised learning methods. The supervised framework needs tremendous manual work for labeling the data. Generative Adversarial Networks(GANs) have gained massive popularity because of their ability to generate realistic samples within training samples distribution. In proposed TVA-GAN used a thermal face image to feed into the generator network for producing a synthesized real-looking visible face image as the output. The GAN-based image-toimage translation methods comprise two networks: generator and discriminator networks. The discriminator network includes a Convolutional Neural Network (CNN) for two-class classification between the real and fake samples. The generator network is an auto-encoder [6][2] that produces high-quality images within the given training set distribution.
The significant commitments of this paper are as follows: • We propose an Attention-based Generative Adversarial Network (TVA-GAN) for thermal to visible face transformation using an image-to-image translation framework. • The proposed TVA-GAN's learning space narrowed down towards optimal learning by using attention guidance and the deep feature extraction using the inception network, which helps to learn more local sparse structure and performs better than the traditional methods. • We proposed a novel generator architecture for TVA-GAN using Recurrent Inception block with attention mechanism to improve the training of Attention network. • We tested the proposed TVA-GAN for thermal to visible face synthesis using real thermal face images and found improvement over various state-of-the-art methods.
The rest of the paper is described in the following manner: a concise literature review for image translation and thermal to visual transformation is presented in Section II; The proposed TVA-GAN with network analysis and losses are described in Section III; The experimental setup is described in Section IV; The experimental results and observations are described in Section V; and Lastly, the conclusion of the paper is provided in Section VI.

II. RELATED WORK
In the area of methods using machine learning, feature classification using classifiers for recognition task proposed by Jun Li et al. named hallucinating faces using thermal infrared images. In the methods using machine learning, feature classification using classifiers for recognition task proposed by Jun Li et al. [18] named hallucinating faces using thermal infrared images. In comparison, Choi et al. [19] preprocessed the thermal image and normalize the intensity values of images. Choi et al. used Self quotient image(SQI) with the Gaussian filtering (DOG) difference for the recognition task. Cunjian Chen et al. [20] used Pyramid Scale Invariant Feature Transform (PSIFT) for matching the images in thermal and visible domains. These non-deep learning based methods' primary aim is to reduce the domain gap for learning features.
Among deep learning approaches,Vishal M. Patel et al. used polarimetric thermal faces and generative adversarial networks [21] for high-quality visible faces synthesis. The Polarimetric Thermal Database [22] is used in [21] for Face Recognition, which contains polarimetric images with more facial features than actual thermal images. The database consists of only grey channel images, not visible color images. For the same database, Iranmaneshet et al. proposed a Deep Cross Polarimetric Thermal-to-visible face recognition [23] for thermal face recognition. The authors used two CNN and contrastive loss functions to recognize faces from polarimetric and visible domains. Generative Adversarial Networks (GAN) appeared as an unsupervised learning framework for generating the new samples within a given dataset distribution. Different authors proposed different versions of GANs to deal with different problems associated with image generation, translation, and new sample generation. Image-to-image translation methods using GAN proposed by various researchers, which helps translate the images from one domain to another [24], [9], [10], [25], [26], [27], [28]. The ConditionalGAN [29] can be seen as the baseline, which generates new samples with some embedding conditions. The ConditionalGAN generator network can generate samples based on some prior given conditions as class labels. In 2016, first unsupervised image translation network using GAN was proposed by Ming-Yu Liu and Oncel Tuzel, named CoGAN (Coupled Generative Adversarial Network) [30], capable of learning the joint distribution from the marginal distribution of two different domains.
The pix2pix was based on ConditionalGAN and CycleGAN which was quite similar to CoGAN in inter-domain feature learning. The CycleGAN was a state-of-the-art model for the unpaired image to image translations. Its generator was capable of generating more realistic samples than any other methods dealing with unpaired data. The pix2pix used the markovian PatchGAN [8] discriminator network, and it displayed promising results for the paired image transformation. The pix2pix restricted for paired image transformation using the same set of images in different domains. pix2pix used the PatchGAN discriminator for labeling the generated image patches. The paired image dataset collection is expensive and suffers from long procedural processes. To remove such problems associated with pix2pix, CycleGAN proposed, which can transform the inter-domain images without having paired datasets. CycleGAN converts the source domain images into the target domain images of the same semantic information. Further network converts them back to the source domain images. Which helps to decreasing the divergence of the learning space and increasing the quality of generated images. On the other side, Yi et al. proposed DualGAN [10] similar to CycleGAN for the image-to-image translation, which varies from CycleGAN in terms of the loss functions. The DualGAN exercises reconstruction loss, whereas the CycleGAN practices the Cycle-consistency loss. In most of the incidents, the Cycle-GAN outperforms the DualGAN. Thus, we use the CycleGAN framework in the proposed model.
In a recent development, Self-Attention GAN is proposed [28], which is also known as an intra-attention network capable of boosting the CNN performance because the attention network focuses more on the essential features of the images. Self-Attention GAN learns the long-range multi-level dependencies by attending the response at a specific position of images. The attention-based networks help to eliminate intense training of deep neural networks compared to CNN models [28] [31] [32]. Recently, the attention-based networks are also proposed by Mejjati et al. [26] and Tang et al. [33] for image-to-image translation using GANs. Both of these methods used attention guided generator for the foreground image generation and preserved the background information using inverse mapping of generator output and concatenated them in final synthesizing. There is few more attention-based GANs for image-to-image translation, including Multi-channel Attention GAN [34] and Deep-Attention GAN [35]. Attribute guided GAN [36] is proposed for sketch generation. Attentionbased two-stream CNNs [37], [32] are proposed for spoofing detection in faces.
III. PROPOSED TVA-GAN MODEL n this section, we present the proposed Thermal to Visible transformation Attention Guided Generative Adversarial Network (TVA-GAN) for Thermal to Visible face synthesis. The proposed TVA-GAN architecture is illustrated in Fig. 3. We use the paired dataset A n j=1 = (X j , Y j ) n j=1 , x ∈ X and y ∈ Y , where x j and y j are the pairs of thermal and corresponding visible images. We use CycleGAN [9] framework with U-Net [38] based architecture. The generator network consists of an encoder and a decoder. The encoder is based on the Recurrent-Inception modules and the decoder is based on the attention mechanisms. The proposed TVA-GAN translates the images from source domain (x) to target domain (y) and target domain (y) to source domain (x) in cyclic manner. We use two Attention Guided Generator Networks, i.e., G xy to translate images from domain x to domain y (x → y) and G yx to generate the image in domain x from domain y (y → x). The generator network used in the proposed TVA-GAN has an inbuilt attention mechanism .
The proposed TVA-GAN method trained end to end using the various types of loss functions. For better convergence, we combined multiple losses to add different curvatures in the optimization. The followings are the losses used in this paper: Adversarial loss, Cycle loss, Synthesized loss, Cycle synthesized loss, Feature reconstruction loss (i.e., perceptual loss) Attention Block: We use attention gates [39] as Attention block in our proposed network to capture sizeable receptive field and semantic contextual information. While applying multi-stage CNN, the attention gate reduces the feature responses for irrelevant background regions. There is no restriction for cropping an ROI (region of interest) between the network layers. Attention gate output is obtained from elementwise multiplication between input feature maps denoted as z k and q att k respectively. z k is the feature map of k th layer in CNN network. z k j ∈ R F k where F k represents the number of feature maps in k th layer. Attention gate helps to focus on subset of a specific region of target structure. The gating vector denoted by g j , helps to analysing spatial regions by providing contextual and activation information. Where g j ∈ R Fg used for determining the focus region of pixel j. In the attention block ReLU presented by σ 1 .
We use additive attention, where the attention map calculated between previous up-sampling layer and corresponding down-sampling layer of encoder block in network. Hence both layers attention map added and perform operation for getting q k att . Both the vectors after channel wise convolution of 1 summed element wise because it shows better results than multiplicative attention [40](element wise multiplication increases the network complexity).
where σ 2 (z j,c ) = 1 1+exp(−zj,c) represents the Sigmoid activation function where j and c denotes the spatial and channel dimensions. W z ∈ R F k xF int , W g ∈ R FgxF int and ϕ ∈ R Fintx1 represent the linear transformation. F int denotes the no of output channel for each 1 × 1 convolution, and b ϕ ∈ R and b g ∈ R Fint represent the bias term. In brief, two input feature maps passed through the 1 × 1 × 1 channel-wise convolution after that combined through adding the outputs and pass by ReLU activation. Therefore second channel-wise convolution was performed using 1 × 1 × 1 kernels and passed through the Sigmoid layer to obtain the mask and concatenate the attention mask with up-sampled feature maps. Attention Block shown in Fig. 1. Note: The linear transformations are computed by 1 × 1 × 1 channel-wise convolutions. Attention block described in Table  II Recurrent Inception Block For better learning of the contextual information, we used recurrent block with t = 2 occurrences. In the proposed RCIN, the recurrent block results in more network depth with fewer parameters and learning by weight sharing. For learning the globally as well locally, we used the inception network with the recurrent network. Inception also helps to make networks computationally cheaper in terms of parameters. While using two recurrent blocks together, we found a large no of computational parameters besides this. We used a novel recurrent inception module that reduces parameters and learns both locally and globally due to large and small filter sizes (3 × 3, 5 × 5 and 1 × 1) . We pass each layer through ReLU layer(except the max-pooling layer), as shown in Fig.1. To overcome the problem of vanishing gradients. The ReLU used in architecture advantages with faster and more efficient learning due to no error while back-propagating the gradients in the network with fewer computational parameters than softmax. To make the network smaller, we fixed the no of output filters for 5 × 5 most immense kernel size in inception block kernel size; the no of output filters fixed to 16 instead of deriving from input parameters, because filters derived from input parameters results in more number filter layers introduced in the network and increases the network complexity. The recurrent inception block architecture described in Table I.
Adversarial Loss: Adversarial loss measures the error for generator and discriminator networks. The generator network generates the fake image specimens. The discriminator network produces labels for the generated image samples as fake/real, depending upon how each generated image data distribution matches the corresponding real image data distribution. The vanilla GAN uses negative log-likelihood loss [41], which leads to instability in training. To overcome the instability problem the proposed TVA-GAN model uses LSGAN [42]. The GAN adversarial loss for X → Y transformation is described as below where G xy denotes the generator function for transforming the images domain x to domain y. While D Y is discriminator function for domain Y .
where x ∈ X and y ∈ Y . Similarly, GAN adversarial loss computed for Y → X transformation (L GAN (G yx , D X )) . Where G yx denotes the generator function for transforming the images domain y to domain x. While D X is discriminator function for domain X.
Cycle Loss: We use cycle-consistency loss (cycle loss) [9] in the objective function of the proposed method. It is computed using the L 1 distance between the real image and the cyclic reconstructed image in both forward and backward transformations. The forward cycle loss is defined as, Similarly, the backward cycle loss is computed as, Cycle-Synthesized Loss: The cycle-synthesized loss [43] is used in the proposed model to make training better. We calculate the cycle-synthesized loss as L 1 loss between the cycled/reconstructed image and the synthesized image in crossdomains. The cycle-synthesized losses are computed as, where G yx (y) and G xy (x) are the synthesized images and G yx (G xy (x) and G xy (G yx (y) are the cycled images.
Synthesized Loss: Synthesized loss is calculated between the generated image and the input image without using the detachment of computation graph, which helps to back-propagate the network loss. For A ∈ X and B ∈ Y the synthesized losses in domains A and B are defined as, Feature Reconstruction Loss: We estimate the loss for related feature representation between the target image and the generated image. The same is also performed for the target and the corresponding reconstructed image. We use mean square error to compute the distance between the extracted features, where feature extraction performed using the pretrained VGG-19 network used in the perceptual loss. For any trained network ψ, let ψ k (y) represents the activation feature map of dimension W k × H k × C k corresponding to the k th convolution layer. Where C represents a number of channel, W width of input image and H height of input image, ψ is the pre-trained VGG-19 model. While processing image y through pre-trained network's ( ψ) k th layer we get the feature map ψ k (y).
where y andŷ are the original and the generated images, respectively. Using the above function, we compute the fol-  (1 1),(2 2),(3 2),(4 2) (C1 + C) ,output channels = input channels lowing feature reconstruction losses where x ∈ X and y ∈ Y : L f ake real (A) = l ψ,k f eat (x, G yx (x)) L f ake real (B) = l ψ,k f eat (y, G xy (y)) L recon real (A) = l ψ,k f eat (x, G xy (G yx (x))) L recon real (B) = l ψ,k f eat (y, G yx (G xy (y))) L recon f ake (A) = l ψ,k f eat (G xy (y), G xy (G yx (x))) L recon f ake (B) = l ψ,k f eat (G yx (x), G yx (G xy (y))) Objective Function: The final objective function for the proposed TVA-GAN is given as follows: where λ is the weight hyperparameters for different type of losses.

A. Network Architecture
For training the network we use newly proposed recurrent inception block with attention networks. The integration of recurrent inception block with attention networks makes it better for learning in image-to-image translation task. We use CycleGAN network as the base model for translation task. The proposed method can generate more realistic and accurate translation task while synthesising the images. The proposed method contains two Generator networks (i.e., G xy , and G yx ) and two Discriminator networks (i.e., D Y and D X ) for both domains, respectively. The generator has inbuilt  [28]. It also helps to the proposed TVA-GAN to handle the background information without introducing any new network. We follow the architecture having an integrated attention module to take care of long-range dependencies.
Generator Network: We use the recurrent-inception attention-based network architecture in this paper in the generator network. The encoder of generator network includes recurrent-inception block as examined in Table I. The recurrent-inception helps to improve the network performance and the learning of optimal local sparse structure. The attention block consists of the Attention-Gate [39] architecture outlined in Table II. The attention block is used in the decoder only after every up-sampling layer, followed by the Convolutional layer combined with batch normalization and ReLU activation function. The attention-block finds the scalar attention value for each pixel vector by additive attention learned through linear transformation using 1×1×1 channel wise convolutions. The generator architecture summary is presented in Table III.
Discriminator Network: In the discriminator network architecture, we use the PatchGAN discriminator proposed in pix2pix, known as Markovian Patch-GAN discriminator with five-layer architecture. We feed the discriminator network with 256 × 256 images generated by the generator network. Discriminator's 1 st layer is a convolution layer with LeakyReLU activation function. After that, each convolution layer is followed by the instance-normalization and LeakyReLU activation function. We use 4 × 4 kernel in each Convolutional layer with stride 2 and padding 1. Last layer of architecture contains only convolution layer. The network architecture of discriminator network is summarized in Table IV.

B. Baseline Methods
The proposed TVA-GAN for Thermal-Visible synthesis is compared with current baseline methods of image-to-image translation by following its original settings. 1) pix2pix [24]: pix2pix is used for paired image dataset translates the images from one domain to another using the U-net generator network with the PatchGAN discriminator network. It works based on conditional data input. Original settings used for evaluation of network performance. 1 2) CycleGAN [9]: CycleGAN is proposed for the unpaired image-to-image translation method by using cycle-consistency loss. It transforms the source domain image into the target domain image and then reconstructs the target domain image to the source domain image. The cycle-consistency loss is calculated between the source image and reconstructed image.
3) DualGAN [10]: DualGAN also refers to nearly the same methodology as CycleGAN, but uses reconstruction loss rather than cycle-consistency loss. Also, it does not require the paired data in the image translation task. DualGAN, with its original setting, is used for performance evaluation. 2 4) PCSGAN [25]: PCSGAN also refers to nearly the same methodology as CycleGAN, but uses cycle perceptual loss with synthesized perceptual loss rather than cycle-consistency loss. It uses the paired data in the image translation task.

5) AGGAN [26]:
An attention-guided model (AGGAN), proposed by Mejjati et al., extracts the attention map to find the foreground and background of images. The attention mechanism discovers the region of translation in the opposite domain by finding the attention map. 3 [27]: AttentionGAN practices the same mechanism introduced in CycleGAN with an inbuilt attention mechanism to find an attention mask with content mask to transform the images from one domain to another. 4

C. Datasets Used
We test our model for two thermal-visible datasets, namely WHU-IIP and Tufts Face Thermal2RGB; both datasets contain the thermal and real visible face pairs. We use the WHU-IIP [44] and Tufts Face Thermal2RGB [45] datasets for thermal to visible face synthesis using the proposed TVA-GAN method  and existing GAN based methods. For WHU-IIP for thermal to real visual transformation, 552 training image pairs, and 240 testing image pairs are considered in the experiments. We use 403 images for training and 156 images for testing in paired manner for Tufts Face Thermal2RGB dataset. Tufts Face thermal2RGB dataset contains more diverse data than WHU-IIP to judge the generalization capability of the proposed model. It includes images of people having various races with different facial attributes, including some people who have sunglasses and spectacles.

D. Parameter Settings
For all the datasets used for training and testing, the images are resized to the dimensions as 256 × 256 × 3 (where 3 denotes the no. of channels). Similar to CycleGAN, pool size is set to 50. We use diffGrad optimizer [46] for the proposed TVA-GAN because previously proposed optimizers [47], [48] suffer from adjustment of learning-rate update. For the pix2pix method, we use the batch normalization based on the original implementation. For the CycleGAN and DualGAN, we use the batch normalization method as proposed in the original network for comparison with our results. We use lsgan loss [42] as used in CycleGAN for training stability of the proposed model through out the training process. The loss weight hyperparameters used in the final objective function are listed in Table VII. We use the diffGrad optimizer with a learning rate of 0.0002 and momentum terms β 1 = 0.5 and β 2 = 0.999. The linear decay is used to reduce the optimizer's learning rate

E. Evaluation Metrics
For the quantitative analysis of our results as compared to the state-of-the-art methods, we use SSIM [49], LPIPS [50] , PSNR [49] and VGG-FaceLoss evaluation metrics. The Structural Similarity Index (SSIM) is used to measure the structural similarity between the generated and real visible face images. SSIM shows better human-level visual perception. Higher SSIM means close structural similarity between the generated image and the actual visible face image. Peak Signal-to-Noise Ratio (PSNR) is computed to measure the quality of generated images. Learned Perceptual Image Patch Similarity (LPIPS) helps to find the patch level similarity as we use the PatchGAN discriminator. This evaluation helps to understand the quality of generated images using the proposed method. We also compute VGG-FaceLoss to ensure featurelevel similarity. It uses a pre-trained VGGFace to extract the features from a synthesized face image and actual visible face image and computes the L1 distance between them. We also use Visual Information Fidelity (VIF) [51] to study the proposed method using different losses. VIF is used to compare the visual information among the reference image and generated image. The VIF helps to distinguish the generated images from the reference images as human visual system does. So, VIF helps to understand how accurate transformation occurs while our proposed method transforms the thermal images into visible images.

A. Quantitative Result Analysis
The proposed TVA-GAN generates more realistic and natural-looking images while transforming the thermal domain into the visual domain. TVA-GAN shows more promising results than the state-of-the-art attention and non-attentionbased GAN models.
For thermal to visual synthesis, the quantitative results of TVA-GAN concerning various state-of-the-art methods are reported in Table V for the WHU-IIP dataset and Table VI for the Tufts Face Thermal2RGB dataset. We found that TVA-GAN performs better than all state-of-art methods in terms of the SSIM, LPIPS, and VGG-FaceLoss for both WHU-IIP and Tufts Face Thermal2RGB datasets. It's performance is slightly low in terms of PSNR compared to PCSGAN for the WHU-IIP dataset.

B. Qualitative Result Analysis
The qualitative result analysis between the generated images and ground truth images using the proposed TVA-GAN and different existing GAN models is shown in Fig. 6 and 7. The non-attention-based methods pix2pix, CycleGAN, Dual-GAN, PCSGAN, and attention-based methods AGGAN and AttentionGAN results are illustrated in Fig. 6 and 7 for Tufts Face Thermal2RGB and WHU-IIP datasets, respectively. It is visible in Fig. 6 that TVA-GAN can produce better results for more diverse datasets than the existing state-ofthe-art methods. A similar observation is also made in Fig.  7 that TVA-GAN results are convincing for less diversifying dataset WHU-IIP as compared to both attention-based and non attention-based methods. Compared to the existing methods.TVA-GAN performs better than non-attention-based methods like pix2pix, CycleGAN, DualGAN, and PCSGAN have missing features due to missing attention. They do not accurately learn local as global level feature details as shown in self-attention gan [26]; on the other hand, proposed method AGGAN and AttentionGAN learning foreground and background using masking and invert masking for their method not accurately performed for fewer feature details. However, our method using recurrent inception with attention block performs better due to better segmentation using attention while translating the images.TVA-GAN is translating foreground and background information simultaneously using recurrent inception by increasing network depth and learning global and local features with fewer parameters. Generated images using TVA-GAN are more structure-preserving and close to the ground truth than results produced by other methods.
C. Impact of different losses used in TVA-GAN.
For the proposed TVA-GAN, we evaluate the impact of different losses used for training. We perform the ablation study over the Adversarial loss, Cycle loss, Cycle Synthesized loss, Synthesized loss, and Feature reconstruction losses. We can see the Qualitative comparison of various losses used in proposed method for WHU-IIP datset in Fig. 8 and for Tufts Face Thermal2RGB dataset in Fig. 9. These results are summarized as follows: • The proposed TVA-GAN performs better than the both attention-based and non-attention-based models for thermal to visible face synthesis. • The proposed TVA-GAN can generate more genuine visual representations using thermal face images and results in more precise details and fewer artifacts in the generated images. • The model fails to distinguish between different person when used with only Adversarial loss on WHU-IIP dataset, and Adversarial loss with Cycle loss on Tufts Face Thermal2RGB datasets. While during training, it performs well. Hence, these two losses are not enough for generalization over the different subjects. Moreover, it is evident from the high quality generated images after combining the Adversarial loss, Cycle loss, Synthesized loss, Cycle synthesized loss and Feature reconstruction loss.
For WHU-IIP dataset, the proposed TVA-GAN leads to better SSIM, VIF and PSNR in terms of %, using combined adversarial loss,cycle loss, synthesized loss, feature reconstruction loss and combined adversarial loss,cycle loss, synthesized loss, feature reconstruction loss with Cycle synthesized loss combinations. We perceived % increment of 60.99%, 6.73%, 5.60% for SSIM, VIF and PSNR, respectively, and % reduction of 73.47% for LPIPS compared to only adversarial loss as reported in Table IX. With compared to combination of adversarial loss and cycle loss, TVA-GAN achieves the increment of 10.18%, 0.54%, 2.25% for SSIM, VIF and PSNR using combination of adversarial loss, cycle loss, synthesized loss and , while shows 37.35% of reduction for LPIPS as shown in Table IX Compared to combination of adversarial loss, cycle loss and synthesized loss, TVA-GAN gains 1.86%, 0.02%, 0.13% for SSIM, VIF and PSNR while reports a reduction of 10.34% for LPIPS by using the combination of adversarial loss, cycle loss, synthesized loss and feature reconstruction loss as depicted in Table IX. For Tufts Face Thermal2RGB, with combination of adversarial loss, cycle loss, cycle synthesized loss, feature reconstruction loss and synthesized loss, the proposed TVA-GAN shows improvement of 39.51%, 3.24%, 8.31% for SSIM, VIF and PSNR while shows 5.75% reduction in LPIPS as compared to the only adversarial loss as shown in Table  X. With combination of adversarial loss, cycle loss, cycle synthesized loss, feature reconstruction loss and synthesized loss, the proposed TVA-GAN shows improvement over combination of adversarial and cycle loss with the gain of 42.26%, 4.04%, 9.17% for SSIM, VIF and PSNR while shows 8.08% reduction for LPIPS as reported in Table X. For Tufts Face Thermal2RGB dataset, the proposed method with all the losses also shows gain over combination of adversarial, cycle and synthesized loss by 1.63%, 0.33%, 0.19% for SSIM, VIF and PSNR while shows 1.42% reduction in LPIPS. By combining cycle synthesized loss with feature reconstruction loss, cycle loss, synthesized loss and adversarial loss, we gain 0.28%, 0.48%, 0.54% for SSIM, VIF and PSNR while shows 4.00% reduction in LPIPS for Tufts Face Thermal2RGB dataset.

D. Face verification Results for proposed TVA-GAN
For better understanding the quality of generated faces, we evaluate the generated faces using the face verification framework in this subsection. We plot the receiver operating characteristic (ROC) curves in Fig. 10 corresponding to the generated face samples using the proposed TVA-GAN with different GAN methods over the WHU-IIP and Tufts Face Thermal2RGB face datasets. We use the DeepFace [52] framework with pre-trained deep face models to calculate the distance between the generated face sample and ground truth image. We use the cosine-similarity [53] as a metric for distance calculation and use the distance as the score for the generated image samples. We use the ground truth with the corresponding generated image for the positive pairs, and for negative pairs. We use the ground-truth image with any randomly chosen generated sample from another subject. We calculate the cosine-similarity score for the positive and negative pairs and use it as a score for the ROC plot. The proposed TVA-GAN shows the gain in Fig. 10 for Face-Verification using WHU-IIP and Tufts Face Thermal2RGB datasets. For WHU-IIP face dataset, the proposed TVA-GAN shows gain of 1.177%, 9.330%, 4.182%, 5.040%, 10.259%, 15.152% compared to pix2pix, CycleGAN, DualGAN, PCS-GAN, AGGAN, and Attention-GAN. For Tufts Face Ther-mal2RGB dataset, the proposed method depicts improvement of 39.726%, 12.899%, 44.742%, 28.136%, 19.786%, 24.071% compared to pix2pix, CycleGAN, Dual-GAN, PCS-GAN, AGGAN and AttentionGAN.

VI. CONCLUSION
This paper proposes a new attention-guided generative adversarial network for thermal to visual face synthesis (TVA-GAN). The proposed model generates more realistic face images than the state-of-art methods. We design the network by including multiple losses to tackle the various problems related to image synthesis like blur, artifact generation, and semantic distortions. The losses include Adversarial loss, Cycle loss, Cycle-synthesized loss, Feature reconstruction loss and Synthesized loss.Our proposed generator network learns both local and global features accurately while transforming thermal to the visual domain . It differs in terms of only translation the foreground information as proposed in AG-GAN and AttentionGAN. It translates both the information foreground and background simultaneously without separating them. We used recurrent inception block with attention block. The proposed recurrent inception block learns the global and local features effectively, translating the images in thermal to visual domains. Recurrent inception block handles salient parts and contextual information both locally and globally by using large kernels and small kernels with more depth and fewer parameters because of the recurrent layer. While decoding occurs, attention block takes care of large receptive fields and learns more semantic contextual in formations. Attention block handles multi-stage CNN localization by crushing the feature responses in unrelated background regions progressively. The proposed model TVA-GAN is tested for the thermal to visual face synthesis problem using WHU-IIP and Tufts Face Ther-mal2RGB datasets. It defeats the existing state-of-the-art non-Attention-based GAN models such as pix2pix, CycleGAN, DualGAN, PCSGAN, as well as attention-based GAN models such as AGGAN and AttentionGAN. It produces more realistic faces, closer to the target image having fewer artifacts with identity preservation.