Yan Luo

and 5 more

Text-guided image generative diffusion models achieve fast development on the generation and editing of high-quality images. To extend such success to video editing, some efforts combining image generation with video editing have been made, which however only achieve inferior performance. We attribute it to two challenges: 1) different from the static image generation, it is tricky for dynamic video information to ensure the temporal fidelity of motion consistency across different frames; 2) the randomness of the frame generation process makes it hard to continuously retain the similar spatial fidelity for the original detailed features. In this paper, we propose a new high-fidelity diffusion model-based zero-shot text-guided video editing network, called HiFiVEditor, which aims to conduct effective video editing with high fidelity of the original video’s detailed and dynamic information.  Specifically, we propose a Spatial-Temporal Fidelity Block (STFB) that enables the model to restore the spatial features by enlarging the spatial perceptual field to avoid loss of important information, and capture more dynamic information between different frames by using all frames for preserving temporal consistency to achieve better temporal fidelity. In addition, we introduce Null-Text Embedding to create a soft text embedding to optimize the noise learning process, so that the latent noise can be aligned with the prompt. Furthermore, to tune the video style and render it more realistic, we employ a Prior-Guided Perceptual Loss to constrain the prediction results to avoid deviating from the original video style. Extensive experiments demonstrate the superior video editing capability compared to existing works.

Zhao Zhang

and 5 more

Zhao Zhang

and 5 more

Post-training quantization (PTQ) can reduce the memory footprint and latency for deep model inference, while still preserving the accuracy of the model, with only a small unlabeled calibration set and without the retraining on full training set. To calibrate a quantized model, current PTQ methods usually randomly select some unlabeled data from the training set as calibration data. However, we prove that the random data selection would result in performance instability and degradation for the activation distribution mismatch. In this paper, we attempt to solve the crucial task on optimal calibration data selection, and propose a novel one-shot calibration data selection method termed SelectQ, which selects specific data for calibration via dynamic clustering. SelectQ uses the statistic information of activation and performs layer-wise clustering to learn an activation distribution on training set. For that purpose, a new metric called Knowledge Distance is proposed to calculate the distances of activation statistics from centroids. Finally, after calibration by the selected data, quantization noise can be alleviated by mitigating the distribution mismatch within activations. Extensive experiments on ImageNet dataset show that our SelectQ increases the Top-1 accuracy of ResNet18 over 15\% in 4-bit quantization, compared to randomly sampled calibration set. It’s noteworthy that SelectQ does not involve both the backward propagation and Batch Normalization parameters, which means that it has fewer limitations in practical applications.

Huan Zhang

and 5 more

Deep learning based image inpainting methods have improved the performance greatly due to powerful representation ability of deep learning. However, current deep inpainting methods still tend to produce unreasonable structure and blurry texture, implying that image inpainting is still a challenging topic due to the ill-posed property of the task. To address these issues, we propose a novel deep multi-resolution learning-based progressive image inpainting method, termed MR-InpaintNet, which takes the damaged images of different resolutions as input and then fuses the multi-resolution features for repairing the damaged images. The idea is motivated by the fact that images of different resolutions can provide different levels of feature information. Specifically, the low-resolution image provides strong semantic information and the high-resolution image offers detailed texture information. The middle-resolution image can be used to reduce the gap between low-resolution and high-resolution images, which can further refine the inpainting result. To fuse and improve the multi-resolution features, a novel multi-resolution feature learning (MRFL) process is designed, which is consisted of a multi-resolution feature fusion (MRFF) module, an adaptive feature enhancement (AFE) module and a memory enhanced mechanism (MEM) module for information preservation. Then, the refined multi-resolution features contain both rich semantic information and detailed texture information from multiple resolutions. We further handle the refined multiresolution features by the decoder to obtain the recovered image. Extensive experiments on the Paris Street View, Places2 and CelebA-HQ datasets demonstrate that our proposed MRInpaintNet can effectively recover the textures and structures, and performs favorably against state-of-the-art methods.

Suiyi Zhao

and 6 more

Unsupervised blind motion deblurring is still a challenging topic due to the inherent ill-posed properties, and lacking of paired data and accurate quality assessment method. Besides, virtually all the current studies suffer from large chromatic aberration between the latent and original images, which will directly cause the loss of image details. However, how to model and quantify the chromatic aberration appropriately are difficult issues urgent to be solved. In this paper, we propose a general unsupervised color retention network termed CRNet for blind motion deblurring, which can be easily extended to other tasks suffering from chromatic aberration. New concepts of blur offset estimation and adaptive blur correction are introduced, so that more detailed information can be retained to improve the deblurring task. Specifically, CRNet firstly learns a mapping from the blurry image to motion offset, rather than directly from the blurry image to latent image as previous work. With obtained motion offset, an adaptive blur correction operation is then performed on the original blurry image to obtain the latent image, thereby retaining the color information of image to the greatest extent. A new pyramid global blur feature perception module is also designed to further retain the color information and extract more blur information. To assess the color retention ability for image deblurring, we present a new chromatic aberration quantization metric termed Color-Sensitive Error (CSE) in line with human perception, which can be applied to both the cases with/without paired data. Extensive experiments demonstrated the effectiveness of our CRNet for the color retention in unsupervised deblurring.

Zhao Zhang

and 5 more

Single Image Deraining task aims at recovering the rain-free background from an image degraded by rain streaks and rain accumulation. For the powerful fitting ability of deep neural networks and massive training data, data-driven deep SID methods obtained significant improvement over traditional ones. Current SID methods usually focus on improving the deraining performance by proposing different kinds of deraining networks, while neglecting the interpretation of the solving process. As a result, the generalization ability may still be limited in real-world scenarios, and the deraining results also cannot effectively improve the performance of subsequent high-level tasks (e.g., object detection). To explore these issues, we in this paper re-examine the three important factors (i.e., data, rain model and network architecture) for the SID problem, and specifically analyze them by proposing new and more reasonable criteria (i.e., general vs. specific,synthetical vs. mathematical, black-box vs. white-box). We also study the relationship of the three factors from a new perspective of data, and reveal two different solving paradigms (explicit vs. implicit) for the SID task. We further discuss the current mainstream data-driven SID methods from five aspects, i.e., training strategy, network pipeline, domain knowledge, data preprocessing, and objective function, and some useful conclusions are summarized by statistics. Besides, we profoundly studied one of the three factors, i.e., data, and measured the performance of current methods on different datasets through extensive experiments to reveal the effectiveness of SID data. Finally, with the comprehensive review and in-depth analysis, we draw some valuable conclusions and suggestions for future research. Please cite this work as: Zhao Zhang, Yanyan Wei, Haijun Zhang, Yi Yang, Shuicheng Yan and Meng Wang, “Data-Driven Single Image Deraining: A Comprehensive Review and New Perspectives,” Pattern Recognition (PR), May 2023.