Quickly Transforming Discriminator in Pre-Trained GAN to Encoder

—Fine-designed deep Generative Adversarial Net- works (GANs) can generate high-quality (HQ) images. However, the discriminator in GAN only plays a role to distinguish candi- dates produced by the generator from the true data distribution, and numerous generated samples are still not clear and true. From pre-trained GAN, we offer a self-supervised method to quickly transform the discriminator into an encoder and ﬁne- tune the pre-trained GAN to an auto-encoder. The parameters of the pre-trained discriminator are reused and converted into an encoder for outputting reformed latent space. The transformation changes the previous GAN to a symmetrical architecture and the generator can reconstruct the HQ image by reforming latent space. By ﬁxing the generator, the reformed latent space can perform better representation than the pre-trained GAN, and the performance of the pre-trained GAN can be improved by the transformed encoder.


I. INTRODUCTION
S INCE DCGAN [1] was proposed, using a convolutional neural network to implement the generative model becomes popular. However, with the increase of resolution, we need tremendous learning parameters, and large-scale HQ datasets as training samples. Recent novel GANs, such as PGGAN [2], StyleGAN [3] and so forth ( [4], [5], [6]), achieve stable training and effective results. These GANs take advantage of several techniques such as pixel normalization and equalized learning rate to control the upgrading rate for training-related parameters. Compared with DCGAN that needs doubling parameters to the next layer, the improvements above make better GAN that can generate higher resolution images in fewer parameters. However, those GANs have no ability to reconstruct images via latent variables. Actually, the methods above inevitably bring out many defective samples, with local details, blurred and perturbed generations.
GAN always utilizes low-dimensional latent space to generate high-dimensional images. In inverting process, embedding images to latent space is similar to the mapping from high dimensional data to low dimensional manifold. There are two types of methods to embed images to latent space. Previous  work [7] tries to embed images to latent space and searches the corresponding latent space via perceptual loss [8], but the method does not use auto-encoder so that it needs to optimize every target. Besides, other works [9], [10], [11], [12] use autoencoders to embed images, even though the reformed latent space produced by the encoder can not be reconstructed well, especially for HQ images, and the method always brings the problem of space entanglement.
So far, the auto-encoder can not do the embedding task well. There's still plenty of room for improvement. We propose a method to embed and reconstruct HQ images. Using pretrained GAN, we quickly transform its discriminator into an encoder, which uses the self-generated images from the generator to replace ground-truth samples, so that we can improve the generative model's ability to reconstruct images and represent latent space. Meanwhile, the discriminator in the pretrained GAN is slightly modified. We reuse the discriminative parameters to inherit the features of the pre-trained GAN and explore the model performance if we increase the symmetries of the model architecture and parameters. In conclusion, we highlight our contributions in the following three parts: 1. In pre-trained GANs, such as DCGAN and PGGAN, we propose a novel approach to quickly transform the discriminator into an encoder. Using a self-supervised manner with few training epochs (limited to 10 epochs), the transformed encoder can embed images into a new latent space. And using the new latent space, the performance of the transformed model is better than the baseline of pre-trained GAN.
2. We change the parameters located in the discriminator output layer, and let the output size of the discriminator is equal to the input size of the generator. On this basis, we train the modified discriminator to obtain a transformed encoder. By reusing the parameters of the pre-trained discriminator, the transformed encoder can inherit the features from pre-trained GAN and training quickly.
3. We analyze the relationship between the model performance and the model architecture, and we notice that when we increase the symmetries of the model which consists of transformed encoder and generator. The modified model can improve the quality of the generated images. Finally, the above improvements reduce the chaotic area in the generated image and improve the generative performance measured by PSNR [13], SSIM [14], FID [15] and LPIPS [16].

II. PROPOSED APPROACH
From the perspective of model training strategy, GAN is achieved by asynchronously training two networks: generator G and discriminator D. In each step, we upgrade the parameters of D according to the loss gap between the G outputs and ground-truth (GT) samples x. Then we upgrade the parameters of G according to the descent direction which comes from D outputs.
Different from GAN, training an auto-encoder model is a synchronous process that is composed of an encoder (E) and a decoder. Here, we regard the decoder as a generator (G) and try to transform D to E. In Bayesian probability model, such as variational auto-encoder (VAE) [17], E and G can also be denoted as q(z|x) and p(x|z). Both E and G need to be upgraded in the same once back-propagation. The input of the auto-encoder is target images x which sample from ground truth (GT), and the output results are generated images G(z). Similar to VAE, we denote z as multi-dimensional random variables. Here, we choose z ∼ N (0, 1).
We briefly denote the loss function of vanilla VAE as Eq.1. The first term is the distribution gap between GT samples and generations that needs to be measured by the KL divergence. The second term is the reconstruction loss L R , which maximizes the expected log-likelihood for image generation.
Generalized latent spaces are located inside of the overall generative model and generated by the corresponding layers. There are two commonly used latent spaces. One comes from the encoder's output layer, E(·), and another one comes from the decoder's input layer, z. Motivated by [10], we deduce the gap between the z and E(·) when they have the same dimensional size. The first term of Eq.1 can measure the gap between the two latent spaces as follows: Here, we denote one dimension of z as z i . On the dimension, we denote its mean as µ i , and its standard deviation as σ i . By only using z, we aim to minimize the latent space gap in a self-supervised manner.

A. Self-Supervised Transforming
The loss function of original GAN is not adaptive for training the transformed auto-encoder. We modify the model architecture which similar to VAE. However, KL divergence is based on the Bayesian probabilistic model, and can not guide the details of the latent space representation. Meanwhile, E cannot effectively reconstruct the wild image when the target is far from the training dataset. Inspired by [7], [18], we propose a self-supervised way to transform E, replacing the log-likelihood loss with mean square error (MSE) loss and perceptual loss [8] as follows: Eq.3 is the standard loss function that measures G(z) and GT samples x. We translate x as a self-supervised output, G(z), so that the gap between generations and its reconstructions can be reformulated as G(z) and G(E(G(z))). Finally, the optimized target is replaced, and there is no need to input x. The replacement as follows: (4) In addition, if we do not control the latent space values for E output, the two latent spaces (z and E(·)) will be too far apart to embed images. Therefore, we assume a fixed mean µ = 0 and variance σ = 1 to optimize E(·). That is the key to improve encoding efficiency. E should find a more suitable space for image embedding. The images will be embedded in an overlapping space. This change also improves the reconstruction performance, alleviating the coupling of latent variables. This regularization trick can reformulate Eq.2 as follows: For pre-trained GANs, we usually do not know the specific value of µ and σ. So we use MSE to optimize the potential space which comes from E output. Eq.5 can be replaced by the following loss function: Finally, we briefly summarize the final loss function in Eq. 7, which replaces the original auto-encoder loss function as follows: Fig.1 demonstrates the principles of transforming a pretrained discriminator into an encoder.

B. Fine-tune Networks to Symmetrical Architecture
In common GANs, the vocation of D is only to classify x and G(z), and its output size usually is one dimension, which is smaller than G input size. Transforming D to E is our goal, so we fix pre-trained G and only train the transformed E. In order to make G and E form a symmetrical encoding-decoding architecture, E output size (E output ) needs to equal G input size (G input ). The transforming means that we add more parameters to the output layer of D. Following the increased parameters, we increase the symmetries of the transformed model, which is composed of G and E. We achieve the transforming by replacing the output layer of D (D output ). In each layer, we focus on the data input dimensions and out dimensions (input, output), and we only address the D output layer with its output dimensions. In the full connected layer (FC), we increase output nodes. In the convolutional layer (CONV), we increase its output channels. The modified D output for different GANs are listed in Tab. I. We use 3-layer fully connected architecture (3-FC) and 5-layer convolutional architecture (5-FC) to process 28×28 images. Besides, DCGAN deals with 256× 256 images and PGGAN deals with 1024×1024 images.

C. Sharing Discriminator Parameters to Encoders
A recent work [19] reuses the pre-trained parameters in the discriminator to handle the task of image-to-image style translation. Similar with Cycle-GAN [20], it needs another network to form a double model to deal with the image-toimage translation task. The difference with this work is that we only reuse the parameters to boost the speed of training convergence. We do not use more pairs of networks. Algorithm 1 shows the pseudo-code of the whole proposed method. In Fig.2 and Fig. 5, we display pre-trained PGGAN samples and corresponding reconstructions.

III. EXPERIMENT
We use three datasets for experiments. The first dataset is Fashion-MNIST. we use 60.000 training samples with image size 28×28, and we set batch size 128. The second one is CelebA [21]. There are 202,599 face image and we choose 30,000 samples to training, we resize image to 256×256 with batch size 30. The last one is HQ dataset CelebA-HQ [22]. There are 30,000 face images with image size 1024×1024. Here, we only use CelebA-HQ for evaluation. The framework is PyTorch (version 1.5.1, CUDA 10.2) on a GPU Card (Nvidia Tesla V100-SXM3 32GB). We choose Adam optimizer with a learning rate of 0.0015, β 1 =0.5, β 2 =0.99 and = 1e-8 to implement all above experiments.

A. Comparison between Different Architectures
In Fashion-MNIST, we use 5 convolutional layers (5-CONV) to build G and D, and use different numbers of fully connected layers and convolutional layers to construct E. These architectures include one-layer fully connected layer (1-FC), 3 convolutional layers (3-CONV), 3 fully connected layers (3-FC), and 5 convolutional layers (5-CONV). When we choose different architectures to build E for auto-encoding with G, the reconstructed images by reformed latent space are not much different. We report the evaluation metrics in Tab. II. The reconstructed 5,000 images are evaluated by PSNR and SSIM after training one epoch. We notice that the performance has improved when we increase the model symmetries, even though the improvement is not obvious in human perception.   We choose DCGAN to evaluate the symmetrical architecture on CelebA. Different from vanilla DCGAN, we replace batch normalization with spectral normalization [23] on D. Based on the replacement, D satisfies Lipschitz continuity and makes training more stable. Compared with the previous size 64×64, the modification can make DCGAN handle 256×256 images, but it is still difficult to train. So we test two architectures on 256×256 images. As shown in Tab. I, original D output channel is one dimension for classifying the sample's fake and truth. In modified D (we call it E), we increased the last layer parameters by changing its layer channels, and then E output size equal to the input size of G (see Tab. I). We train the two architectures separately without the pre-training process. The result has shown that the vanilla output layer training fails when the training epoch increases, but E can make the training more effective. We report the training process in Fig. 3.

B. Self-Supervised Training for Encoder Transform
On CelebA-HQ, we choose 10,000 samples for evaluation.
To train E, we use 20,000 generated images from pre-trained PGGAN (with batch size 4). As shown in Fig.2, the pretrained samples still have many distortions and blur blobs in local details. The first row shows the generated images by PGGAN. Our transformed method has shown in the second row. By encoding the reformed latent space, our method samples (G(E(G(z)))) are better than the pre-trained model samples (G(z)). On LSUN dataset (car, tower and horse) [24], we report more cases on Fig. 5 with pre-trained PGGAN and our reconstructions.

C. Reusing Parameters with Symmetric Architecture
To evaluate reusing parameters, we designed three different encoders based on PGGAN's E (E s , E n and E p ). PGGAN's network needs 9 blocks from 4×4 to 1024×1024. Here, the output block of the three encoders' is the same. As for the other blocks, E s has half layer parameters of D (two-layer block to one-layer block), and E n is the same parameter size as D with no reused D's parameters. E p has the same  parameter size as E n but reuses D's parameters. We report the experimental results in Tab. III. In FID, we compare three encoders with GT (10,000 samples), and compare with G(z) in LPIPS (2,000 samples). All results are obtained in the 10th epoch.
We also compare three encoders during the training process. As shown in Fig 4, in the early epochs, FID of E s and E n are slightly higher than the baseline G(z). With the epoch increase, E p is better than E s and E n , and converges faster. LPIPS of E p is also better than others. This verifies our intuitive view that a transformed E will be better when we reuse D parameters and increase the model symmetries.

IV. CONCLUSION
We offered a novel approach for quickly transforming a discriminator to an encoder via a pre-trained GAN, in which we adjust the parameters of the discriminator output layer to be the same size as the generator input layer. We use a selfsupervised manner to train the reformed encoder. By reusing the parameters and increasing the networks' symmetries, our proposed schemes yield an efficient encoder that enhances the performance of latent space representation and image reconstruction.