MR-InpaintNet: Toward Deep Multi-Resolution Learning for Progressive Image Inpainting

—Deep learning based image inpainting methods have improved the performance greatly due to powerful representation ability of deep learning. However, current deep inpainting methods still tend to produce unreasonable structure and blurry texture, implying that image inpainting is still a challenging topic due to the ill-posed property of the task. To address these issues, we propose a novel deep multi-resolution learning-based progressive image inpainting method, termed MR-InpaintNet, which takes the damaged images of different resolutions as input and then fuses the multi-resolution features for repairing the damaged images. The idea is motivated by the fact that images of different resolutions can provide different levels of feature information. Speciﬁcally, the low-resolution image pro- vides strong semantic information and the high-resolution image offers detailed texture information. The middle-resolution image can be used to reduce the gap between low-resolution and high-resolution images, which can further reﬁne the inpainting result. To fuse and improve the multi-resolution features, a novel multi-resolution feature learning (MRFL) process is designed, which is consisted of a multi-resolution feature fusion (MRFF) module, an adaptive feature enhancement (AFE) module and a memory enhanced mechanism (MEM) module for information preservation. Then, the reﬁned multi-resolution features contain both rich semantic information and detailed texture information from multiple resolutions. We further handle the reﬁned multi- resolution features by the decoder to obtain the recovered image. Extensive experiments on the Paris Street View, Places2 and CelebA-HQ datasets demonstrate that our proposed MR-InpaintNet can effectively recover the textures and structures, and performs favorably against state-of-the-art methods.


I. INTRODUCTION
I MAGE inpainting is a task of generating visually plausible contents to fill in the missing regions in a damaged image or repair the deteriorated images. Due to the broad practical application areas, e.g., human face editing [3], privacy protection [4], photo disocclusion [5] and object removal [6], H. Zheng S. Yan is with Sea AI Lab (SAIL), Singapore; also with the National University of Singapore, Singapore 117583. E-mail: shuicheng.yan@gmail.com. image inpainting has been emerging as an important topic in the areas of image processing and computer vision. However, image inpainting task is still facing great challenges, since the restored global structures and local textures in the synthesized image still cannot keep consistent with the known region.
For a meaningful image, it must contain redundancy and non-local similarities. As such, traditional inpainting methods assume that the contents of the missing region have similar information to the known region [7]- [13]. By seeking the similarity between different patches of an image, one can choose the most similar patches to fill in the missing region. Note that although coping the known patches can well produce realistic local textures, it cannot capture high-level features and understand global contextual information. These issues will potentially result in inaccurate and even wrong structures in the repaired image. Besides, the above assumption is unreasonable in some cases, since there is no guarantee that all missing contents can be found similarly in the known area. As a result, simply coping the known patches will prevent the method from generating new contents, so that the image inpainting results will be awful for the cases of large masked region.
With the rapid development of deep neural networks, there is growing attention to designing advanced deep models for different computer vision tasks [14]- [17], such as deep image inpainting methods that have obtained impressive results compared to the traditional methods. Most existing deep inpainting models employ the pipeline of encoder and decoder. To be specific, given a corrupted image and a corresponding mask, deep inpainting methods map the input to a feature space, and decode the features to obtain a repaired image. Certain carefully-designed modules and losses have also been used to enhance the global semantic information flow and capture local textural features. Attributed to these new modules, deep inpainting methods have been reported to be able to generate visually plausible inpainting results. However, for a highresolution image, convolution neural networks (CNNs)-based deep models may be ineffective to build long-term relationship between the distant regions. They progressively dilate the receptive field, which makes the network learn less contextual information. As a result, the repaired image may contain artifacts, inaccurate structures and blurry textures, which are inconsistent with the known region, as shown in Figure 1.
In this paper, we therefore present a novel deep inpainting network via deep multi-resolution learning, as illustrated in Figure 2, so that both the global structure information and detailed texture information can be discovered jointly. The main contributions of this paper are summarized as follows: • Technically, a new and effective deep multi-resolution learning-based progressive image inpainting network, termed MR-InpaintNet, is proposed. Different from current deep inpainting models, MR-InpaintNet handles the task based on multiple damaged images and multiple masks from different resolutions, so that different levels of feature information can be learnt to enhance the performance. To be specific, high-resolution images support exhaustive textural information to recover the texture details, and low-resolution images provide rich semantic information for better understanding the global context. As for middle-resolution images, it can be applied to alleviate the gap between high-resolution and low-resolution images. To the best of our knowledge, this work is one of few existing deep inpainting methods to jointly recover the structure and texture information of the missing regions based on multiple resolution images. • To build the relationship between distant regions in highresolution image and obtain strong feature representation containing rich textural and semantic information, a novel module called multi-resolution feature learning (MRFL) is designed to aggregate the information from different resolutions, which can jointly fuse and enhance the multiresolution features for image inpainting. • For deep neural networks, some useful information from shadow layers are usually degraded with the deepening of the neural networks. Therefore, we also propose a memory enhanced mechanism (MEM) to solve this issue, which can prevent our inpainting model from information degradation, with the increase of layers. The features from those shadow layers can also be reused in the deeper layers for information preservation. • Extensive experiments on Paris Street View, Places2 and CelebA-HQ datasets show that our inpainting method can generate more reasonable structures and realistic textures, than current state-of-the-art deep inpainting methods.
This paper is outlined as follows. A brief review of the related inpainting methods is shown in Section 2. In Section 3, we propose MR-InpaintNet. The experiments and analysis on several real-world datasets are shown in Section 4. Finally, conclusion with some future works is offered in Section 5.

II. RELATED WORK
In this section, we briefly introduce some image inpainting algorithms, which are related to our proposed method.

A. Traditional Image Inpainting
For an image to be repaired, traditional methods try to find the similar components in the background to fill in the missing regions. The basic assumption of these traditional methods is that there are always redundant information and non-local similarities in the image. Roughly, traditional methods include two types, i.e., patch-based methods [7]- [10] and diffusionbased methods [12], [13].
Patch-based methods. Patch-based methods simply copy the known patches instead of the missing patches, because the prerequisite is that the information of the missing parts can be found in the known area. As a result, these methods cannot generate new contents to fill in the missing region. PatchMatch [7], as one of the most classic patch-based methods, is an interactive image inpainting and editing method by driving a novel randomized algorithm to find the nearest neighbor matches between different patches. Ghorai et al. [10] also presented a multi-pyramid-based inpainting method to preserve the texture consistency and structure continuity in image based on the local patch statistics and sparse representation.
Diffusion-based methods. A robust inpainting method was proposed in [13], which uses the fractional-order nonlinear diffusion with difference curvature, and obtains a good balance between edge preservation and elimination of image staircase and speckle artifacts. For non-textured-based inpainting, total variation is a classic method. Li et al. [12] proposed an improved total variation method that computes the diffusion coefficient based on the damaged pixel and its neighbors.
Remarks. Traditional methods may produce realistic textures in some cases, since they simply copy the information from the known regions. However, they cannot capture highlevel features, extract contextual information, generate new contents and retain the semantic consistency, especially for large masks, so these methods may cause inconsistent structures in the recovered images.

B. Deep Learning-based Image Inpainting
Deep inpainting methods designs novel encoder-decoder neural networks by feature learning for image inpainting. They encode the damaged images to obtain feature representations and then decode the feature maps to receive the repaired images. Deep image inpainting methods can be roughly divided into single-solution inpainting methods [1], [2], [18]- [38] and multi-solution inpainting methods [39], [40].
Single-solution inpainting methods. Most of current image inpainting methods belong to this category, which obtains one output image during repairing a masked image. Context encoders (CE) [18] is the first deep image inpainting method that takes an encoder-decoder pipeline as the main framework. For a damaged image, the useless information in the damaged region may greatly affect the recovered result. To this end, a new partial convolution (PC) [1] was proposed for image inpainting. Yu et al. [41] presented a region normalization (RN)  to replace the batch normalization in deep neural networks according to the masked region. A mutual encoder-decoder network [2] was also proposed for image inpainting, which contains a carefully-designed feature equalization (FE) module to help the dcoder obtain better results.
Multi-solution inpainting methods. For image inpainting, we only need to obtain an output image with reasonable structures and textures, so there is no definite solution. Therefore, researchers also tried to obtain multiple solutions, such as PIC [39] and PD-GAN [2]. To be specific, PIC can synthesize multiple image inpainting results by using a probabilistically principled framework, which is the first method to explore diverse inpainting. PD-GAN proposed a novel normalization method to dynamically balance the realism and diversity inside the hole region. A perceptual diversity loss was also presented for further promoting diverse content generation.
Remarks. Compared with those traditional methods, deep inpainting methods can obtain high-level features to learn better structures. However, current deep inpainting methods still suffer from unreasonable structures, blurry textures and artifacts. As such, how to restore the textures and structures of the missing area better, and ensure the consistency with the known area for deep inpainting is still a difficult issue.

III. MR-INPAINTNET: DEEP MULTI-RESOLUTION LEARNING FOR IMAGE INPAINTING
In this section, we introduce the architecture and modules of MR-InpaintNet. The main motivation and network architecture are firstly introduced. Based on the deep multi-resolution learning, a novel multi-resolution feature learning (MRFL) process is proposed, which aggregates the information of multiple resolutions and can adaptively enhance the multiresolution features. Finally, loss functions are detailed.

A. Motivation
For the task of deep image inpainting, similar textural and structural information exist in the images, which can be seen in Figre 3. Therefore, we can apply the information of the known region to repair the unknown region. That is, we need to establish the connections among similar regions in the images. However, the distance between those similar patches may be far away in images, making the deep CNNs inept or ineffective in connecting the distant regions, especially for those high-resolution image. To address this issue, we propose to map the distant areas in high-resolution image to a new data space to make the distance between similar regions smaller. A simple idea is to down-sample the highresolution image to obtain the corresponding lower-resolution images. Then, we can use the smaller pathces in lowerresolution images to represent the corresponding patches in high-resolution images. As shown in Figure 3, we illustrate the detailed process of building the relationship between distant patches in high-resolution image via multi-resolution learning. For lower-resolution image, CNN can quickly expand the receptive field to the entire picture as the number of network layers increases. As a result, CNN can potentially model the relationship between distant regions easily for lower-resolution images. According to the relationship between high-resolution and lower-resolution images, the modeling of remote regions in high-resolution images can be indirectly built.

B. Network Architecture
The framework of MR-InpaintNet is shown in Figure 2. As can be seen, MR-InpaintNet has three subnetworks, i.e., a generator to produce the inpainting results, a discriminator network to conduct adversarial training with generator, and a pretrained VGG-Net to compute the relative loss functions. Only the generator network is used in test phase.
The I gt and a corresponding mask M (0 for known region and 1 for unknown region), the input image I in can be defined as where represents the element-wise multiplication. As such, the low-resolution and middle-resolution images I in and I in can be obtained by down-sampling I gt and M . The generator uses the classic U-Net [42] as basic model, which contains three main components, i.e., a multi-resolution encoder (MR-Encoder), a multi-resolution feature learning (MRFL) module and a decoder. Specifically, the MR-Encoder adopts multiple encoders to handle the multi-resolution inputs. As a result, the low-resolution inputs can provide strong semantic information to promote learning semantic information by MR-InpaintNet. The high-resolution inputs are encoded as features to supply better texture information that will be beneficial for the consistency in details. The middle-resolution inputs can be used to balance and reduce the gap between semantic information and texture information in the feature space to obtain more realistic repaired results. The structure of the discriminator is shown in Figure 4, which can distinguish the real pictures and synthesized pictures. The input of the discriminator has two elements, i.e., an image and a random patch of this image, where the entire image is adopted to keep the global consistency and the random patch is utilized to keep the local consistency. As such, the discriminator will contain two branches. The first one is for handling the entire image and the other one is for dealing with the patch. Further, the extracted features from these two branches will be concatenated for distinguishing the real and generated pictures.

C. Multi-Resolution Feature Learning (MRFL)
Given the encoded features from multiple resolutions, it is important to merge, refine and enhance the multi-resolution features to obtain stronger feature representations. We therefore propose MRFL to perform better multi-resolution representation learning. The detailed structure of MRFL is given in Figure 5, which contains two main units, i.e., a multiresolution feature fusion (MRFF) module and an adaptive  feature enhancement (AFE) module. Specifically, MRFF fuses the multi-resolution features in both spatial and channel dimensions to construct the fused features, and AFE can adaptively embed the fused features into the multi-resolution features so that the multi-resolution features can be refined. By cascading these two modules, MRFL will be able to merge, refine and enhance the multi-resolution features progressively. In addition, we also present a novel memory enhanced mechanism (MEM) module in MRFL for information preservation.
Multi-resolution feature fusion (MRFF). The structure of MRFF is shown in Figure 6. Firstly, MRFF tries to fuse the multi-resolution features from two aspects, i.e., channel fusion and spatial fusion. For the inputted multi-resolution features F i , F i and F i in the i-th stage, we use a global average pooling method to make the input tensors become vectors respectively, which is described as follows: where H and W denote the height and width of input multiresolution features F i , v i , v i and v i are the corresponding vectors in the i-th stage. For enhancing the information flow, v i , v i and v i are processed by a fully-connected layer respectively. As a result, information in the same resolution can be fully exchanged. Furthermore, to fuse these vectors from fully connected layer (FC) and aggregate richer information from different resolutions, we sum these vectors directly as follows: where C f use i denotes the result of channel fusion in the i-th stage and f c(·) denotes the operation of FC.
For the spatial fusion, the input multi-resolution features F i , F i and F i in the i-th stage are firstly handled by a convolution layer and a residual learning process to obtain the intermediate features D i , D i and D i in the i-th stage as follows: where conv(·) represents the convolution layer to promote the feature transformation. The residual learning process can reuse the previous information and prevent the network from gradient vanishing. Then, D i , D i and D i are concatenated so that the information from different resolution images can be merged. Since the concatenation operation increases the number of channels, we use a convolution layer to reduce the channels and then fuse them, which are performed as follows: where S f use i denotes the spatial fusion result in the i-th stage. To further fuse the features from multiple resolutions, we merge the results of both the channal fusion and spatial fusion. Specifically, we perform the element-wise multiplication between C f use i and S f use i as follows: where can mix the information from multiple resolutions in spatial dimension. As a result, F f use i will be able to merge the information from multiple resolutions in both channel and spatial dimensions, which means that it can be a better representation of context.
Memory enhanced mechanism (MEM). It is known that when the number of neural network layers increases, some useful information in shallow layers, which are helpful for the model to restore the damaged images, may be lost. MRFL may also suffer from this issue, since it cascades the MRFFs and AFEs. We therefore design a new memory enhanced mechanism (MEM) module to preserve the previous information and further enhance the representation ability. Given the fused feature maps F f use i (i = 1, 2, . . . , N ) in the i-th stage of MRFL, the process of MEM can be described as follows: where F mem i denotes the summation of all previous fused feature maps in i-th stage. Thus, the features from the shallow layers will be retained. Simultaneously, the information from different layers can be aggregated together, which will be helpful to address the long-term dependency problem.
Adaptive feature enhancement (AFE). Although the fused features by MRFF contain rich information from different resolutions, some important information from the original resolution may be dropped during the process of feature fusion. Toward this issue, we would like to take the features of different resolutions as the main component and then use the fused features to enhance them. To be specific, a new module AFE is designed to enhance the features adaptively. The feature enhancement strategy of AFE is to compute the weighted summation of fused features and the input multiresolution features. The structure of AFE can be seen in Figure  7. Since the input multi-resolution features are the primary part, we can simply set the corresponding weights to ones. As for the weights of the fused features, ResBlock and sigmoid function are used to learn the weights of the fused features adaptively. Finally, the weights θ, θ and θ in the i-th stage can be obtained as follows: where RB(·) represents the ResBlock and sigmoid(·) denotes the sigmoid function. Specifically, ResBlock can map the input features into the parameter space, and the sigmoid function can further constrain the values to (0, 1). Note that the values less than 1 ensures the importance of multi-resolution features. Further, the refined multi-resolution features can be performed by the following weighted summation operation: where F i+1 , F i+1 and F i+1 denote the refined features, which are set as the input of the (i+1)-th stage. As a result, the final multi-resolution features will contain the information from different resolutions and different layers, which can enable MR-InpaintNet to better understand the semantics and texture details, and further obtain more realistic inpainting results.

D. Loss Functions
Given an input image I in and a mask M , the predicted image by our MR-InpaintNet is represented as I pre . The final output image I out is the merged result of the predicted image I pre and the input image I in , because there are known regions in the input image. This process can be performed as follows: To better retain the consistency in both texture and structure, several loss functions are involved for training, i.e., reconstruction loss, perceptual loss, style loss and adversarial loss. Next, we will briefly introduce them one by one.
Reconstruction loss. To measure the difference in pixel level, we use two reconstruction losses from two aspects including the known region and unknown region: where N denotes the total number of elements in I pre . The reconstruction loss can guide our inpainting model to restore the damaged images pixel by pixel. Perceptual loss. We then introduce the VGG-16 based perceptual loss for capturing the high-level semantic information and simulating human perception of images quality: where C i , H i and W i denote the channel size, height and weight of i-th feature map in a pretrained and fixed VGG-16. ϕ out i and ϕ gt i denote the corresponding feature maps of the i-th pooling layer in the pretrained and fixed VGG-16, respectively. The perceptual loss can instruct our MR-InpaintNet to learn semantic information, and further keep the semantic coherence.
Style loss. Similar with the perceptual loss, VGG-16 based style loss is also adopted for further ensuring the semantic consistency, which is defined as follows: Note that the style loss employs the Gram matrix of the selected feature maps to measure the difference between the recovered image and ground-truth. The style loss can therefore enable the content consistency of the synthesized image.
Adversarial loss. The adversarial loss is used to alternatively train the generator and discriminator: where G and D denote the generator and discriminator, respectively. This loss can make the restored image close to the real one, i.e., making the generated image look more realistic.
Total loss. Based on the above several losses, the objective function of our MR-InpaintNet can be described as follows: where λ known , λ unkonwn , λ style , λ perceptual and λ adv are the tradeoff parameters. The reconstruction losses L known and L unkonwn can reduce the difference between the recovered image and the ground-truth image in Euclidean space. The style loss and perceptual loss measures the difference between the recovered image and the ground-truth image in semantic space. The adversarial loss can make the generated image closer to the real image from both local and global perspectives.

IV. EXPERIMENTS
In this section, we conduct extensive experiments to evaluate the effectiveness of our method, along with illustrating the comparison results with other related methods. We perform all experiments with the Pytorch framework [43] on a NVIDIA GeForce RTX 2080i GPU. All images in both training and testing are resized with 256 × 256 pixels for each method for fair comparison. We adopt the Adam optimizer [44] with initial learning of 0.0001 to train our model. Additionally, the learning rate will degrade 10% per epoch before it is smaller than 0.00001. For the tradeoff parameters in the objective function of our MR-InpaintNet, we empirically set λ known = 1, λ unkonwn = 4, λ style = 250, λ perceptual = 0.1 and λ adv = 0.2. In what follows, we introduce the used datasets and evaluation metrics in our experiments.

A. Datasets and Evaluations
Datasets. In this study, we use four widely-used datasets for evaluating the image inpainting tasks, which includes three image datasets (i.e., Places2 [45], Paris Street View [   CelebA-HQ [22]) and one mask dataset. For the mask dataset, we use the free-form masks provided in [1].
• Places2 dataset is a large-scale image dataset containing more than 1.8m images from 365 scenes. Each scene has 5k training images and 100 test images. In this study, we pick four scenes, i.e., butte, snowfield, canyon and farm, for our experiments. • Paris Street View dataset is composed of street view images in Paris. It contains 15k images, where 1.49k images for training and 100 images for testing.
• CelebA-HQ dataset is a human face dataset that is a high-quality version from the CelebA dataset [47]. There are 30k face images in CelebA-HQ dataset. We randomly select 27k images for training and 3k images for testing. • Mask dataset [1] contains 12k masks and all the masks are irregular. Those masks are divided into six types according to the size of the masked region.
• PC: a novel convolution layer termed partial convolution for image inpainting. The overall pipeline adopts a basic U-Net [42] architecture that all the convolutions are replaced with partial convolution. • PIC: a new and probabilistically principled framework with two parallel paths, which is the first image inpainting method to obtain diverse inpainting results. • RN: two novel normalization methods called RN-B and RN-L, which can make the whole model avoid the mean and variance shifts for image inpainting problems. • FE: a new module feature equalization that can retain the consistency between structure and texture. A mutual encoder-decoder is used as its main framework. For PIC, RN and FE methods, we directly use the source codes shared by authors. For PC, we reimplement the code because the authors did not provide the source codes.
Evaluation metrics. To evaluate the inpainting performance of each method, three commonly-used evaluation metrics are utilized following [41], which includes the l 1 loss, peak signalto-noise ratio (PSNR), and the structural similarity (SSIM) with window size being 11. Specifically, l 1 loss can evaluate the difference in pixel level. Thus, the smaller l 1 loss, the better the inpainting result. In contrast, the larger the PSNR and SSIM values, the more realistic the restored images.

B. Image Inpainting Results
To fully compare the performance of each method, we provide both quantitative and qualitative inpainting results. For fair comparison on each method, the output image is directly used as the restored images without any postprocessing.

1)
Results on Places2 dataset: We evaluate each method for repairing the corrupted images of the Places2 dataset. We first visualize some image inpainting results in Figure  8. As can be seen, when the mask is small, all inpainting methods can obtain plausible visual results. We also notice that there are some artifacts in the restored images of other compared methods. For example, in the first row of Figure  8, PIC method produces blurry textures; the recovered image of PC method also contains the shadow of the mask; the FE method fails to maintain the consistency of light intensity and synthesizes blurry textures; there is also the shallow shadow for RN method. In contrast, our MR-InpaintNet generates the most realistic and accurate images. Table I describes the numerical results on the Places2 dataset with different levels of masks. We can find that: (1) the overall performance of our MR-InpaintNet is better than other competitors in most cases; (2) when the mask is relatively small, PC and FE methods obtain the best results based on some metrics, and our MR-InpaintNet also delivers highly competitive results; (3) as the masked area increases, the performance of each inpainting method degrades, which indicates that it is more difficult to fill in the large missing region, since less known information can be used for prediction. This is also consistent with the visual results.
2) Results on Paris Street View dataset: We first visualize some inpainting results in Figure 9, from which we can see that: (1) each inpainting method can produce reasonable structures, which means that current methods can well understand the semantics of the damaged image; (2) when the scene is complex, both RN and PIC methods synthesize unrealistic details and blurry textures. In contrast, our MR-InpaintNet obtains better texture details and more consistent structures.
The numerical comparison results are described in Table II. We see that: (1) our method obtains the best results for six kinds of mask, compared with competitors; (2) the inpainting performance of each method also degrades with the increase of masked regions. And the performance degrading speed of our MR-InpaintNet is slowest, which means that our method has stronger learning ability and robustness to the corruptions.
3) Results on CelebA-HQ dataset: We further evaluate each method on CelebA-HQ dataset to show the generalization ability. Some examples of image inpainting results are shown in Figure 10. We see that all image inpainting methods can well restore the damaged images with small masks. We also observe that: (1) PIC method cannot recover the plausible shape of human face, as shown in the second row of Figure  10. FE method suffers from the brown spots on the human face. RN method cannot obtain realistic eyes when the eyes are masked. The restored images of PC method contain some artifacts; (2) our MR-InpaintNet obtains more realistic human face images. Note that the hair in the third row of Figure 10 is difficult to be recovered, although our method generates the most reasonable hair. This is because the inpainting network confuses the black hat and the hair. In such cases, when the network tries to recover the black hat and hair at the same time, a blurry repairing image will be generated.
The numerical inpainting results on CelebA-HQ dataset are described in Table II. We can find that: (1) compared with the scene images in Places2 dataset and the street view images in Paris Street View dataset, there is an obvious performance improvement on CelebA-HQ. Because human face contains more prior knowledge, and the structures and textures of human face are easier to be handled; (2) our MR-InpaintNet obtains the highest PSNR and SSIM, and the lowest l 1 loss, which verifies its strong generalization ability on human faces.

C. Ablation Study
We conduct experiments to explore the effect of different modules on the performance of our MR-InpaintNet. To be specific, we evaluate the effectiveness of the deep multiresolution learning, and the modules MRFF, AFE and MEM.
1) Effect of deep multi-resolution learning: To verify the effectiveness of multi-resolution learning on image inpainting, we conduct an ablation study by using the images from different resolutions as the input. In this study, four settings are evaluated, including (1) High: using only the high-resolution images as input; (2) High + Mid: using the high-resolution and middle-resolution images as input, which contains two encoders to handle them respectively; (3) High + Low: using the high-resolution and low-resolution images as input, which contains two encoders to handle them respectively; (4) using the images from multiple resolutions as input, i.e., our MR-InpaintNet, denotes as Ours. As described in Table III, we can find that High setting obtains the worst performance among all the models, while our MR-InpaintNet obtains the best results. That is, setting the images from multiple resolutions as input is effective to handle the issue of building the relationship of distant regions in high-resolution images.
We further visualize the feature maps of different encoders over multiple resolution images in Figure 11. We see that there are many detailed textures in the feature maps from the encoder handling high-resolution images. As for the feature maps from the encoder handling low-resolution images, we cannot see many deatails directly. In addition, the feature maps from the encoder handling middle-resolution images is mroe concrete comapred to the feature maps from the encoder with low-resolution images. As mentioned earlier, it is an intermediate variable to adajust and reduce the gap between high resolution and low resolution, which can further improve the image inpainting results.
2) Effect of MRFF: To evaluate the effectiveness of the module MRFF, we remove it from the framework of our MR-InpaintNet for compasison, denoted as W/O MRFF. That is, W/O MRFF uses three dependent paths to deal with the input images of different resolutions. While the extracted multiresolution features are not fused, and are directly concatenated  and feed into the decoder. From Table IV, we see that there is an obvious degradation in terms of l 1 loss, PSNR and SSIM. This implies that the module MRFF is indeed effective for the inpainting task, because MRFF can fuse the features from different resolutions and understand the information in different levels so that better feature representations can be obtained to enhance the image inpainting task.

3) Effect of AFE:
To examine the effect of the module AFE, we replace the weighted summation operation in AFE by the sum of the multi-resolution features and fused features, denoted as W/O AFE. Different from MR-InpaintNet that adaptively learns the weights of fused features, the weights of the fused features in W/O AFE are set to ones, which is equal to the weight of multi-resolution features. As can be seen in Table IV pared with that of MR-InpaintNet. This means that removing the module MEM from MR-InpaintNet indeed degrades the results. This is because removing MEM will make the deep neural network lose some information from the shallow layers when continuing to propagate forward. As such, our MR-InpaintNet equipped with MEM can well resolve this issue, since it can preserve the previous information and directly reuse the information from shadow layers in deeper layers.

5) Visual analysis of deep layers in decoder:
The idea of deep neural networks comes from the human brain neurons, which always imitates human beings to perform specific tasks. Therefore, we explore how our MR-InpaintNet repairs the damaged images in this part. When humans want to reconstruct an image, they usually draw the outline firstly, and then add the details. In other words, we first construct the overall structures, and then restore the detailed texture. In Figure 12, we show the visual results of the feature maps obtaind from different layers in the decoder of our method. We can see that there are more and more detail information from the shadow to deep layers. Specifically, in the first layer of the decoder, we can only see the blurred outline. In the second and third layers, the feature maps contain some obvious structures. In the fourth layer, it is easy to find that the featue maps contain rich structural and textural information. As a result, in term of repairing a damaged image, the learning process of our MR-InpaintNet is similar with that of humans according to Figure 12. To a certain extent, these observations demonstrate that the design of our MR-InpaintNet is reasonable.

V. CONCLUSION
We presented deep multi-resolution learning for handling the image inpainting task, and proposed a novel deep inpainting network termed MR-InpaintNet. Based on the images from different resolutions, MR-InpaintNet can build the relationship between distant regions and obtain rich information in different levels, including both texture information and meaningful semantic information. To propagate the relationship and further improve the features, we have designed two new modules, i.e., a multi-resolution feature fusion module for fusing the features from different resolutions and an adaptive feature enhancement for automatically refining the multi-resolution features via embedding the fused features. Furthermore, to avoid the information loss in deeper layers and prevent MR-InpaintNet from the performance degradation, we also present a memory enhanced mechanism to retain the previous information in each layer. Extensive visual and numerical results have demonstrated the effectiveness of our MR-IpaintNet. In future, we hope that the image inpainting results can be controlled artificially in the real-world applications. Therefore, designing an image inpainting algorithm that can be flexibly intervened by humans is an important problem to be solved. Besides, evaluating the deep multi-resolution learning for other computer vision tasks is also an interesting future work.