TVA-GAN: attention guided generative adversarial network for thermal to visible image transformations

In the recent improvement in deep learning approaches for realistic image generation and translation, Generative Adversarial Networks (GANs) delivered favorable results. GAN generates novel samples that look indistinguishable from authentic images. This paper proposes a novel generative network for thermal-to-visible image translation. Thermal to Visible synthesis is challenging due to the non-availability of accurate semantic and textural information in thermal images. The thermal sensors acquire the thermal face images by capturing the object’s luminance with fewer details about the actual facial information. However, it is advantageous for low-light and night-time vision, where image information cannot be captured in a complex environment by an RGB camera. We design a new Attention-guided Cyclic Generative Adversarial Network for Thermal to Visible Face transformation (TVA-GAN) by integrating a new attention network. We utilize attention guidance with a recurrent block with an Inception module to simplify the learning space toward the optimum solution. The proposed TVA-GAN is trained and evaluated for thermal to visible face synthesis over three benchmark datasets, including the WHU-IIP, Tufts Face Thermal2RGB, and CVBL-CHILD datasets. The proposed TVA-GAN results show promising improvement in face synthesis compared to the state-of-the-art GAN methods. For the proposed TVA-GAN, code is available at: https://github.com/GANGREEK/TVA-GAN.


Introduction
Visible image generation using thermal images is a very challenging task rather than using near-infrared (NIR) images [1,2]. NIR images are close to red light wavelengths between 700 nm -2500 nm and very close to human vision. NIR image formation discards the color wavelength pieces of information using infrared LEDs. Most NIR cameras at night utilize IR LEDs for illumination, which are limited in range, usually not more than 500 m. On the other hand, thermal images are farinfrared images with wide-area emission detection. Thermal Infrared (TIR) [3,4] cameras are sensitive to heat radiation produced by the objects in the scene. Heat refers to the electromagnetic waves emitted by a body above the absolute zero temperature, which contains different wavelengths. Both NIR and TIR images capture a non-overlapping electromagnetic spectrum. However, thermal and near-infrared images are very different from each other since thermal images are more specific to capturing images for a particular range of temperature only. Thus, the thermal images have more noisy data than the NIR images. So, generating the actual visible domain images from the corresponding thermal domain images is more challenging.
In the present scenario of deep learning, [5] image generation methods utilized in various applications related to computer vision, such as image restoration [6], image synthesis [7], face synthesis [8,9], facial expression synthesis [10], image fusion [11] and many more. Deep learning methods prevail over the traditional machine learning-based method by using image translation methods in multi-domain computer vision [12] and biometrics [5] scenarios. We consider the visible face synthesis through thermal face images as an image translation problem. Thermal face to Visible face translation effectively handheld by using Generative Adversarial Networks (GANs) [13][14][15]. The GAN model works on the adversarial training principle where a generator model learns by balancing false results against true results. With the modern leverage of deep learning, different variants of GAN methods [16][17][18][19] have been investigated for image-to-image translation in computer vision. The GAN-based models are utilized for different applications related to computer vision. Various tasks such as image colorization [20], image super-resolution [21], image segmentation [22], style transfer [23], text-to-image synthesis [24], distortion rectification [25], and face photo-sketch synthesis [26] handled effectively using GANs. GANs have gained massive popularity because they generate realistic samples within the training data distribution. The GAN-based image-to-image translation methods comprise two networks: generator and discriminator networks. The generator network is generally an auto-encoder [6,14] in the image translation technique that produces high-quality images within the given training set distribution. Additionally, in the network training, the discriminator network is used to classify and label the generated image samples as fake/real. The discriminator network utilizes CNN (Convolutional Neural Networks) to classify the generated images. The proposed TVA-GAN utilizes a thermal face image to feed into the generator network and produces a synthesized, real-looking visible face image as the output.
The significant commitments of this paper are as follows: -Using an image-to-image translation framework, we propose an Attention guided Generative Adversarial Network (TVA-GAN) for thermal to visible face transformation. -We propose a novel and efficient U-Net-based generator for TVA-GAN using the Recurrent Inception block with an attention mechanism leading to better generalization capability and fewer parameters than the original U-Net. -The proposed TVA-GAN's learning space is contracted towards the most efficient learning space by utilizing attention gates with in-depth feature extraction through the inception network, which supports learning more sparse local structures and performs better than the standard methods.
-We test the proposed TVA-GAN for thermal to visible face synthesis using real thermal face images on three benchmark datasets and observe improvement over various state-of-the-art methods. -We also perform the verification of generated faces by evaluating them over recent state-of-the-art deep-face [27] metric and analyze the results using the ROC curve.
The rest part of the paper contains the subsequent sections as follows: A concise literature review for image translation methods presented in Sect. 2; the proposed TVA-GAN network analysis with different loss functions described in Sect. 3; the experimental setup for the proposed TVA-GAN illustrated in Sect. 4; the empirical study over the experimental results depicted in Sect. 5; and Lastly, the concise conclusion of the findings over the proposed method shown in Sect. 6.

Related work
The image synthesis from the thermal to the visible domain was initially attempted using machine learning techniques. Jun Li et al. [28] developed hallucinating faces using thermal infrared images through classifiers. In comparison, Choi et al. [29] pre-processed the thermal image and normalized the intensity values of images. Choi et al. used a self-quotient image with Gaussian filtering for the recognition task. Chen et al. [30] used pyramid scale-invariant feature transform for matching the images in thermal and visible domains. These non-deep learning-based methods' primary aim was to reduce the domain gap for learning features. However, these approaches suffer in performance as they need to be more capable of learning the image features and corresponding mapping in different domains. Recently, deep learning has shown a significant improvement in the performance of image synthesis. Among deep learning approaches, Patel et al. used polarimetric thermal faces, and GANs [31] for high-quality visible face synthesis. The polarimetric thermal database [32], used in [31] for face recognition, contains polarimetric images with more facial features than actual thermal images. The database consists of only grey channel images. For the same database, Iranmaneshet et al. proposed a polarimetric thermal-to-visible face recognition method [33] by using two CNNs and a contrastive loss function to recognize faces from polarimetric and visible domains. GAN emerged as an unsupervised learning method for new data generation by using the data distribution of training datasets. Researchers proposed different versions of GANs to deal with image generation and translation problems. Several image-to-image translation methods using GAN were also proposed by various researchers [17,18,[34][35][36][37][38]. ConditionalGAN [39] emerged as a condition-specified image generation baseline, generating new samples based on prior conditions such as class labels. In 2016 Liu et al. proposed a generative model named Coupled Generative Adversarial Network (CoGAN) [40] for learning the joint distribution of multi-domain images.
Pix2Pix [16] emerged as a revolutionary work for image translation by using conditional embedding. It uses Con-ditionalGAN and is similar to CoGAN in inter-domain feature learning. Pix2Pix used PatchGAN discriminator [16] and generated promising results over the paired datasets and transformed the scene images effectively. However, the paired image dataset collection is expensive and suffers from long procedural processes. To tackle the problems associated with Pix2Pix, CycleGAN used interdomain translation with cycle consistency loss. CycleGAN generates more realistic samples than other methods dealing with unpaired data. CycleGAN delivered better quality results. CycleGAN generates natural-looking samplings by using cyclic training. A cycle consistency loss is calculated between the source domain image and the cyclic constructed image, which helps to reduce the divergence of learning space and increase the quality of generated images. On the other side, Yi et al. proposed the DualGAN [18] similar to CycleGAN for the unpaired image translation. The DualGAN exercises reconstruction loss, whereas the CycleGAN practices Cycle-consistency loss. In most of the incidents, the CycleGAN surpasses the DualGAN. Some other state-of-the-art methods [35,37,41] uses the Cyclic training and reported better results. Thus, we use the CycleGAN framework in the proposed model.
In a recent development of GAN's training and learning, Self-Attention GAN [38], also known as an intra-attention network, is used to boost the model's performance because the attention network focuses more on the essential features of the images. Self-Attention GAN can learn the longrange multi-level dependencies by attending to the response at a specific position of the images. The attentionbased networks help to eliminate the intense training of deep neural networks compared to CNN models [38,42,43]. Recently, attention-based networks have also been proposed by Mejjati et al. [36] and Tang et al. [44] for unpaired image translation using GANs. Other attentionbased GANs for image translation, such as Multi-channel Attention GAN [45] and Attribute guided sketch-generation GAN [46] proposed. Two-stream CNN network using attention mechanism [43,47] are also proposed for spoof detection in faces. These attention-based methods have yet to utilize the capacity of recurrent modules to capture the inter-relation in images required for quality image generation. We addressed this problem in our proposed work by incorporating the Recurrent Inception block with an attention mechanism for improved thermal-to-visible face translation. Some researchers have also worked on thermal to visible translation using GANs. Nyberg et al. proposed unpaired thermal to visible translation using adversarial training [48]. In contrast, Kuang et al. proposed a Conditional GAN-based Colorization network [49] for the translation of thermal scene images into Visible scene images. The GAN models have been also exploited for the thermal face to visible face translation using different network architectures [50][51][52][53]. Note that these methods utilize the existing GAN models without any sophisticated network arrangement and objective function. However, the proposed method considers the recurrent inception block to suitably encode the variable length dependencies using various loss functions.
In the current scenario, an inception-based network with a light spatial transition layer (LIST) [54] is used by Lahiri et al. for image restoration, image denoising, and image inpainting. However, the method [54] is not utilized for the image translation approach. However, we use an inceptionbased module with a fixed number of channels for larger convolution for handling image-to-image translation. Incremental image translation [55] proposed by Tan et al. utilizes the incremental learning method, which has the advantage of additional domain embedding at the time of learning. However, the inherent ambiguity makes it challenging to use incremental learning for thermal faces. E2I is proposed by Xu et al. [56] for image restoration by exploiting edge-to-image inpainting. Note that E2I is not tested for human faces. An enhanced colorization method proposed by the Zhong et al. [57] for person re-identification using infrared images. The method in [57] is tested over the NIR images and uses deep cross-modality for identification purposes. However, we verify the generated images of the visible domain using face verification methods. After generating the faces, we compute the scores using deep face and present the ROC curve for verification purposes.

Proposed TVA-GAN model
This section illustrates the proposed Attention Guided Generative Adversarial Network (TVA-GAN) for thermal to visible face transformation. Figure 1 shows the proposed TVA-GAN architecture in detail. It includes two different domains depicted on two sides of the figure for transforming images from one domain to another. Each side contains three images: the real image from the corresponding domain, the generated fake image, and the reconstructed image from the generated fake image. Two generator networks G xy and G yx are used in each domain, where G xy transforms the thermal domain images into visual domain images, and G yx transforms the visual domain images into the thermal domain images. For each domain, cycle loss is calculated among the input and cyclic reconstructed images. Figure 1 uses the same color for each loss to show the input from different domain images. However, the discriminator network uses training image distribution to match the quality of generated images, so it contains only one arrow representation. For the feature reconstruction loss pre-trained network is used. The other losses among domains are illustrated in Sect. 3.3.
We utilize the dataset in paired manner as A n j¼1 ¼ ðX j ; Y j Þ n j¼1 , x 2 X and y 2 Y, where x j and y j are the pairs of thermal and corresponding visible images.
We use cyclic training [17] with U-Net [58] based architecture. The generator network consists of an encoderdecoder network. The encoder part uses the proposed Recurrent-Inception modules, while the decoder part uses additional attention mechanisms, up-convolutional layers, and recurrent-inception modules. By utilizing the attention mechanism in the generator networks, two generator networks formed known as G xy and G yx . The proposed TVA-GAN employed the first generator network G xy to transform source domain images into the target domain images (x ! y). The second generator network G yx transforms the target domain images into the source domain images. Whole operation performed in a cyclic manner, i.e., (x ! y ! x) for each domain. The attention gates are used in each generator network for the attention mechanism. Next, we present the different blocks and losses in detail used in the proposed model.

Attention block
We use attention gates [59] as Attention blocks in the proposed network to capture the sizeable receptive field and semantic contextual information. The attention gate reduces the feature responses for irrelevant background regions in CNN layers. There is no restriction for cropping an ROI (region of interest) between the network layers. Attention gate output is obtained from element-wise multiplication between input feature maps denoted as z k and q att k , respectively, where z k is the feature map of k th layer in CNN network with z k j 2 R F k where F k represents the number of feature maps in k th layer. Attention gate helps to focus on subset of a specific region of target structure. The gating vector, denoted by g j , helps in analysing the spatial regions by providing contextual and activation information. Where g j 2 R F g is used for determining the focus region of  (1), We use additive attention, where the attention map is calculated between previous up-sampling layer and corresponding down-sampling layer of encoder block in network. Hence, the attention maps of both layers are added to compute the attention map q k att in k th layer as illustrated in (2). Both the vectors after channel wise convolution by 1 kernel are summed in an element wise fashion because it shows better results than multiplicative attention [60]. Moreover, element wise multiplication increases the network complexity. Thus, the feature map of k th layer after incorporating attention can be computed as, where z k andẑ k are the feature map of k th layer before and after attention, respectively, and q k att is the computed attention map for k th layer as in (3) given below, where T represents the weighted transpose vector in the networks, r 2 ðz j;c Þ ¼ 1 1þexpðÀz j;c Þ represents the Sigmoid activation function, j and c denote the spatial and channel dimensions, respectively, W z 2 R F k xFint , W g 2 R F g xFint and u 2 R F int x1 represent the linear transformations, F int denotes the no of output channel for each 1 Â 1 convolution, and b u 2 R and b g 2 R F int represent the bias terms. In brief, two input feature maps passed through the 1 Â 1 Â 1 channel-wise convolution, then added the outputs and further operated by ReLU activation. After that, channelwise convolution was performed using 1 Â 1 Â 1 kernels and passed through the Sigmoid layer to obtain the mask and concatenate the attention mask with up-sampled feature maps. The attention block is shown in Fig. 2 and also described in Table 1. Note that the linear transformations are computed by 1 Â 1 Â 1 channel-wise convolutions.

Recurrent inception block (RCIN)
For better learning of the contextual information, we use a recurrent block with t ¼ 2 occurrences. In the proposed RCIN, the recurrent block results in more network depth with learning by weight sharing. Recurrent networks help to grab more spatial and contextual information as time step increases. For learning the global as well local contexts, we use the inception module with the recurrent network [61]. While using two recurrent blocks together, we found a large number of computational parameters. In order to tackle it, we utilize inception with single recurrent block. The recurrent block defined as below, where z m;n;k ðtÞ represents the input of recurrent block over time step t with vectored patches of k th layer centered at (m, n), and ðW f k Þ and ðW r k Þ represent the vectors of weight associated with k th layer feature map of feed-forward and recurrent network, respectively. While in m;n f ðtÞ represents the input of feed-forward network at time step t and in m;n r ðt À 1Þ represents the input of recurrent network at time step t. In (4), first half of the equation used for Convolutional layer of 1 Â 1 and second half is used for recurrent block. The intermediate state of this unit is given by: z m;n;k ðtÞ ¼ g r ðfðz m;n;k ðtÞÞÞ ð5Þ where f denotes the ReLU activation function and g r is local response normalization [62]. Thus, we use a novel recurrent inception module that reduces parameters as compared to double recurrent blocks and learns both local and global contexts due to large and small filter sizes (i.e., 3 Â 3, 5 Â 5 and 1 Â 1) used in the inception block [63]. We pass the activation maps through ReLU layer (except the max-pooling layer), as shown in Fig. 2. To overcome the problem of vanishing gradients. The ReLU leads to faster and more efficient learning. To make the network smaller, we fix the number of output filters as 16 for 5 Â 5 kernel, an immense kernel size in inception block kernel sizes. Thus, it leads to comparatively fewer parameters as the number of filters having larger kernel sizes is reduced. Table 2 summarizes the recurrent inception block architecture.

Objective function using different loss functions
The proposed TVA-GAN with different loss functions is trained cyclically. To improve network convergence, we combine multiple losses, which helps to add different curvatures in optimization space. The followings are the losses utilized in this paper: Adversarial loss, Cycle loss, Synthesized loss, Cycle synthesized loss, and Feature reconstruction loss (i.e., perceptual loss). Adversarial loss: Adversarial loss is calculated among the generator and discriminator networks to estimate the error between the generated sample's distribution and corresponding ground truth training distributions. The discriminator network labels the generated samples as real/fake based on matching with the corresponding ground truth. To tackle the instability in training, the proposed TVA-GAN model uses Least Square GAN (LSGAN) loss [64]. The GAN adversarial loss for X ! Y transformation using the generator, denoted as G xy which transforms the images from domain x to domain y. While D Y is the discriminator function for domain Y. GAN adversarial loss is described in (6) for X ! Y transformation.
Similarly, GAN adversarial loss is computed for Y ! X transformation, described in (7). Where G yx denotes the generator function for transforming the images Y ! X. While D X is discriminator function for domain X.
Cycle loss: In the proposed method, cycle-consistency loss (cycle loss) [16] is employed in the objective function. It utilizes the L1 distance between the real and the cyclic reconstructed image. For the forward pass cycle loss calculated as shown in (8), Similarly, the backward pass cycle loss is computed as shown in (9), Synthesized loss: The synthesized loss is estimated by calculating the L 1 distance between the generated and input images without separating from the computation graph, which enables back-propagating network loss. The synthesized losses in domains X and Y are defined as shown in (10), Cycle-synthesized loss: Motivated from CSGAN, cycle-synthesized loss [65] is employed in the proposed model to enhance learning. The cycle-synthesized losses are illustrated in (11), L Csl 1 ¼ jjG xy ðG yx ðyÞ À G xy ðxÞÞjj 1 and L Csl 2 ¼ jjG yx ðG xy ðxÞ À G yx ðyÞÞjj 1 : Feature reconstruction loss: We evaluate the feature reconstruction loss for affiliated feature representation between the target image and the generated image. We use mean square error to calculate the distance between the extracted features, a pre-trained VGG-19 network used for feature extraction, also known as the perceptual loss described in (12). For a pre-trained network w, let w k ðyÞ represents the activation feature map of size W k Â H k Â C k associated with k th convolution layer. Where C denotes the number of channels, W denotes the width of the input image, and H denotes the height of the input image. While processing image y through pre-trained VGG-19 network's (w) kth layer, we get the feature map w k ðyÞ. The dissimilarity in feature space is calculated as, where y andŷ are the original and the generated images, respectively. Using the above function, we compute the following feature reconstruction losses: Objective function: The final objective function given as follows for the TVA-GAN, where real ðXÞ þ L fake real ðYÞ þ L recon real ðXÞ þ L recon real ðYÞ þ L recon fake ðXÞ þ L recon fake ðYÞÞ where k is the weight hyper-parameters for different type of losses.

Network architecture
The integration of recurrent inception block with attention networks makes it better for network learning in the image translation task. Zhang et al. proposed a self-attention [38] network for meaningful region-specific learning, which helps to handle long-range dependencies in the training of convolutional neural networks. We hire the cyclic framework from CycleGAN as the base model for the translation task. The proposed method contains two inbuilt attentionguided generator networks for image translation in domains x and Y as (i.e., G xy , and G yx ). Also two discriminator networks (i.e., D Y and D X ) associated with domain Y and X. The proposed method can generate more realistic samples and delivers accurate image translation.
The attention mechanism with recurrent inception helps the proposed TVA-GAN focus on the images' important regions without worrying about background information. Detailed network architecture is illustrated in Fig. 3. Generator Network: In our work we use recurrent-inception attention-based network architecture in the generator network training. The encoder of the generator network includes recurrent-inception blocks as shown in Fig. 2 and summarized in Table 2. The recurrent inception helps improve the network performance and learn optimal local sparse structure. The attention block consists of the Attention-Gate [59] architecture outlined in Table 1. The attention block is used in the decoder only after every upsampling layer, followed by the Convolutional layer combined with batch normalization and ReLU activation function. The attention-block finds the scalar attention value for each pixel vector by additive attention learned through the linear transformation using 1 Â 1 Â 1 channelwise convolutions. The generator architecture summary is presented in Table 3.
Discriminator Network: In the discriminator network architecture, we use the PatchGAN discriminator proposed in Pix2Pix, known as Markovian Patch-GAN discriminator with five-layer architecture. We feed the discriminator network with 256 Â 256 images generated by the generator network. Discriminator's 1 st layer is a convolution layer with LeakyReLU activation function. After that, each convolution layer is followed by the instance-normalization and LeakyReLU activation function. We use 4 Â 4 kernel in each Convolutional layer with stride 2 and padding 1. Last layer of architecture contains only convolution layer. Detailed network architecture of discriminator network is summarized in Table 4, contains the detailed Discriminator Architecture.

Baseline methods
The proposed TVA-GAN for Thermal-Visible synthesis is compared with current baseline methods of image-to-image translation by following its original settings.

Pix2Pix [34]
Using paired images and conditional generative adversarial network, Pix2Pix translates the images from one domain to another using the U-net generator network. It uses conditional generator network. We use the original settings for evaluation of network performance. 1

DualGAN [18]
DualGAN also refers to nearly the same methodology as CycleGAN, but uses reconstruction loss rather than cycleconsistency loss. Also, it does not require the paired data in the image translation task. DualGAN, with its original setting, is used for performance evaluation. 2

PCSGAN [35]
PCSGAN also refers to nearly the same methodology as CycleGAN, but uses cycle perceptual loss with synthesized perceptual loss along with cycle-consistency loss. It uses the paired data in the image translation task.

AGGAN [36]
An attention-guided model (AGGAN), proposed by Mejjati et al., extracts the attention map to find the foreground and background of images. The attention mechanism discovers the region of translation in the opposite domain by finding the attention map. 3

AttentionGAN [37]
AttentionGAN practices the same mechanism introduced in CycleGAN with an inbuilt attention mechanism to find an attention mask with content mask to transform the images from one domain to another. 4

ThermalGAN [66]
ThermalGAN was originally proposed for color-thermal cross-modality person re-identification using thermal face images. We train it in paired scenario of thermal-visible face translation by following the original training hyper parameters. 5

Datasets used
We test our model for three thermal-visible datasets, namely WHU-IIP [67], Tufts Face Thermal2RGB [68] and our CVBL-CHILD [69] dataset. All the three datasets contain the thermal and real visible face pairs. We use these datasets for thermal to visible face synthesis using the proposed TVA-GAN method and existing GAN-based methods. The WHU-IIP dataset contains 552 training image pairs and 240 testing image pairs for the experiments. We use 403 images for training and 156 images for testing in paired manner from Tufts Face Thermal2RGB dataset. Tufts dataset contains more diverse images than WHU-IIP to judge the generalization capability of the proposed model. It includes images of people having various races with different facial attributes, including some people who have sunglasses and spectacles. Sample images from WHU-IIP and Tufts datasets attached in Figs. 4 and 5, respectively. Additionally, we use the CVBL-CHILD dataset, which is developed by our laboratory. For the CVBL-CHILD dataset, we use Thermal Images and Corresponding Face Images [69], which initially contains unpaired and unregistered 2500 images in each domain from 125 subjects. By performing registration corresponding to visible images, we prepare the CVBL-CHILD dataset containing 2096 images from 125 individuals with different poses for paired image-to-image translation. 6 Out of which, randomly chosen 1399 image pairs are used for training, and the rest 697 image pairs are used for testing in a paired manner.

Parameter settings
We use the 3 channel images with a height and width of 256 Â 256 for each dataset used for training and testing purposes. In the case of thermal images having one channel (luminance), we convert it to 3 channels by stacking the luminance channel. So for each domain, the dimensions of the images are 256 Â 256 Â 3. Identical to CycleGAN, the pool size is 50. We use diffGrad optimizer [70] for the proposed TVA-GAN because previously proposed optimizers [71] suffer from an auto adjustment of the learning rate, to tackle this problem diffGrad optimizer employed with a learning rate of 0.0002 and momentum terms of b 1 ¼ 0:5 and b 2 ¼ 0:999. The linear decay decreases the optimizer's learning rate to 0 by applying a reduction rate at every 50 epoch. We update the learning rate after every 50 epoch. We use lsgan loss [64]

Evaluation metrics
We use SSIM, PSNR, LPIPS [72], and VGG-FaceLoss as evaluation metrics for the quantitative results analysis of the proposed TVA-GAN with state-of-the-art methods. The Structural Similarity Index (SSIM) measures the structural similarity between the generated visual face and actual visible face images. Higher SSIM means close structural similarity between forged and authentic visible face images. Peak Signal-to-Noise Ratio (PSNR) is used to assess the quality of generated images. Learned Perceptual Image Patch Similarity (LPIPS) helps find the patch level Neural Computing and Applications (2023) 35:19729-19749 19739 similarity between the generated face and real face image. This evaluation comprehends the quality of generated images using the proposed TVA-GAN method. We also estimate VGG-FaceLoss to assure the facial feature similarity by computing the L1 distance between the facial features of the generated face and the real face. A pretrained VGGFace network extracts the facial features from the input and the corresponding generated images.

Quantitative result analysis
The proposed TVA-GAN delivers better results with respect to other state-of-the-art attention-guided methods as AGGAN [36] and AttentionGAN [37], as well as nonattention-guided method as Pix2Pix [34], CycleGAN [17], DualGAN [18], PCSGAN [35], and ThermalGAN [66]. TVA-GAN delivers more realistic and natural-looking images than state-of-the-art attention and non-attentionbased GAN models. For thermal to visual face synthesis, the quantitative results of TVA-GAN concerning different state-of-the-art methods are reported in Table 5 for the WHU-IIP dataset, Table 6 for the Tufts Face Ther-mal2RGB dataset and Table 7 for CVBL-CHILD datasets. We found that TVA-GAN performance is more satisfactory than the state-of-art methods in terms of the SSIM, LPIPS, and VGG-FaceLoss for WHU-IIP, Tufts Face Ther-mal2RGB, and CVBL-CHILD datasets. However, PCSGAN performance is slightly better than WHU-IIP in terms of PSNR. On the other hand, the proposed TVA-GAN shows a lower score for LPIPS and VGG-FaceLoss for WHU-IIP, Tufts Face Thermal2RGB, and CVBL-CHILD datasets.

Qualitative result analysis
The qualitative analysis between the generated and ground truth images using the proposed TVA-GAN and different existing GAN models depicted in Figs. 6, 7 and 8. The results of non-attention-guided methods such as Pix2Pix, CycleGAN, DualGAN, PCSGAN, and attention-guided methods AGGAN and AttentionGAN depicted in Figs. 6, 7 and 8 for Tufts Face Thermal2RGB, WHU-IIP, and CVBL-CHILD datasets, respectively. It is visible in Figs. 6, 7, and 8, respectively, that TVA-GAN produces promising results for more diverse datasets than the existing state-of-the-art methods.
As observed in Fig. 6, TVA-GAN results are convincing for less diversifying dataset WHU-IIP compared to attention-guided and non-attention-guided methods. Compared to the existing methods. TVA-GAN performs better than non-attention-guided methods like Pix2Pix, Cycle-GAN, DualGAN, and PCSGAN have missing features due to missing attention. They do not accurately learn local and global level feature details, and attention guided method AGGAN and AttentionGAN learn foreground and background using masking and invert masking. However, both of them do not perform well for fewer feature details. However, our method using recurrent inception with an attention block performs better due to better visual representation while translating the images. TVA-GAN simultaneously translates foreground and background information using recurrent inception by increasing network depth and learning global and local features. Generated images using TVA-GAN are more structurepreserving and comparable to the ground truth than results produced by other methods. A similar trend also followed for diverse datasets such as Tufts Thermal2RGB and CVBL-CHILD. Qualitative results for Tufts Thermal2RGB and CVBL-CHILD datasets are reported in Figs. 7, and 8, respectively.

Impact of different losses used in TVA-GAN
We evaluate the impact of different losses in training for the proposed TVA-GAN. Additionally, the visual information fidelity (VIF) [73] metric is used to approximate the visual information between the reference and generated images for evaluating the impact of different loss functions. The VIF helps determine the quality of generated images compared to actual visible face images as the human visual system does. So, VIF comprehends how accurate thermal to visible transformation occurs due to the proposed method. We perform the ablation study over the Adversarial loss (AL), Cycle loss (Cyc), Cycle Synthesized loss (Csl), Synthesized loss (Sl), and Feature reconstruction loss (FR). The quantitative results analysis over the WHU-IIP, Tufts Face Thermal2RGB and CVBL-CHILD datasets illustrated in Tables 8, 9, 10 respectively.
The qualitative comparison of various losses used in the proposed method shown in Figs. 9, 10 for the WHU-IIP and Tufts Face Thermal2RGB datasets, respectively, and in Fig. 11 for the CVBL-CHILD dataset.
These results are summarized as follows: • The proposed TVA-GAN performs better than the both attention-based and non-attention-based models for thermal to visible face synthesis. • The proposed TVA-GAN can generate more genuine visual representations using thermal face images and In contrast, a lower value is adequate for LPIPS and VGG-FaceLoss. Note that we also include the % improvement in the scores of TVA-GAN w.r.t. the compared methods. The # in SSIM and PSNR scores represent the improvement by proposed method. However, the " in LPIPS and VGG-FaceLoss scores represent the improvement by proposed method Bold values indicate the best metric value. While for Method it shows our proposed TVA-GAN performs better in comparision of others   35:19729-19749 19741 results in more precise details and fewer artifacts in the generated images. • The model fails to distinguish between different person when used with only Adversarial loss on WHU-IIP, Tufts Face Thermal2RGB and CVBL-CHILD datasets, and combination of Adversarial loss with Cycle loss not sufficient for Tufts Face Thermal2RGB and CVBL-CHILD datasets. While during training, it performs well. Hence, these two losses are not enough for generalization over the different subjects. Moreover, it is evident from the high quality generated images after combining Adversarial loss, Cycle loss, Synthesized loss, Cycle synthesized loss and Feature reconstruction loss.

Saliency map detection for proposed TVA-GAN
In order to justify the improved performance of the proposed TVA-GAN model, we analyze the saliency map with and without the different proposed sub-module combinations, such as RCIN and attention blocks. As the proposed GAN-based thermal to visible translation uses unlabeled datasets, we use the saliency map detection inspired by Vanilla Gradient [74]. A saliency map shows each pixel's importance while generating the images in unlabeled data. Brighter Pixels show more important regions than dark pixels, whereas brighter pixels mean those pixels contribute significantly to image generation. We perform the     It is observed that the proposed TVA-GAN with both RCIN and Attention blocks is able to utilize the gradients due to the relevant features of the input image in the most suitable way. The U-Net-based model cannot attend to the  right features from the input image, which confirms our proposal to use the RCIN module that better captures the pixel relations. It is evident from these saliency maps that attention plays an important role, as the saliency map without attention does not exploit the gradients from important facial regions, which is better when attention is used. Hence, it is clear that the proposed RCIN and Attention modules provide complementary information, which forms a better generator model than the standard U-Net model.

Face verification results for proposed TVA-GAN
For ensuring satisfactory performance of generated faces, we evaluate the generated faces using the face verification framework in this subsection. We plot the receiver operating characteristic (ROC) curves in Fig. 13 over the generated face samples of TVA-GAN and different GAN methods over the WHU-IIP and Tufts Face Thermal2RGB face datasets. We use the DeepFace [27] framework with pre-trained deep face models to calculate the distance between the generated face sample and ground truth image. We use the cosine-similarity [75] as a metric for distance calculation and use the distance as the score for the generated image samples. We use the ground truth with the corresponding generated image for the positive pairs, and for negative pairs. We use the ground-truth image with any randomly chosen generated sample from another subject. We calculate the cosine-similarity score for the positive and negative pairs and use it as a score for the ROC plot.
The proposed TVA-GAN shows the gain in Fig. 13

Conclusion
This paper proposes a new attention-guided generative adversarial network for thermal-to-visual face synthesis (TVA-GAN). The proposed model generates more realistic face images than the state-of-art methods. We design the network by including multiple losses to tackle the various problems related to image synthesis, blur, artifact generation, and semantic distortions. The losses include Adversarial loss, Cycle loss, Cycle-synthesized loss, Feature reconstruction loss, and Synthesized loss. The proposed We use a recurrent inception block with an attention module. The proposed recurrent inception block effectively learns the global and local features, translating the images from thermal to visual domains. The recurrent inception   13 Comparative analysis of the Face verification using ROC Curves over different state-of-the-art methods with proposed TVA-GAN, ROC Curves are computed using DeepFace with cosine-similarity as a distance metric over the WHU-IIP and Tufts Face Thermal2RGB datasets block handles salient parts and contextual information locally and globally by using large kernels and small kernels of the inception network and more depth because of the recurrent layer. While decoding occurs, the attention block takes care of large receptive fields and learns more semantic contextual in formations. The attention block handles multi-stage CNN localization by progressively crushing the feature responses in unrelated background regions. The proposed model TVA-GAN is tested for the thermal to visual face synthesis problem using WHU-IIP, Tufts Face Thermal2RGB, and CVBL-CHILD datasets. It defeats the existing state-of-the-art, non-Attention-based GAN models such as Pix2Pix, CycleGAN, DualGAN, PCSGAN, ThermalGAN, and attention-based GAN models such as AGGAN and AttentionGAN. It delivers more realistic faces nearer to the target image, containing fewer artifacts with identity preservation. The future work of this paper includes reducing the network size using the knowledge distillation model for resource-constrained edge devices. Another future work aims to utilize the proposed model for a different type of image-to-image translation problems. The extension of the proposed model on video data can be seen as another potential future direction.