Abstract
Text-guided image generative diffusion models achieve fast development
on the generation and editing of high-quality images. To extend such
success to video editing, some efforts combining image generation with
video editing have been made, which however only achieve inferior
performance. We attribute it to two challenges: 1) different from the
static image generation, it is tricky for dynamic video information to
ensure the temporal fidelity of motion consistency across different
frames; 2) the randomness of the frame generation process makes it hard
to continuously retain the similar spatial fidelity for the original
detailed features. In this paper, we propose a new high-fidelity
diffusion model-based zero-shot text-guided video editing network,
called HiFiVEditor, which aims to conduct effective video editing with
high fidelity of the original video’s detailed and dynamic information.
Specifically, we propose a Spatial-Temporal Fidelity Block (STFB) that
enables the model to restore the spatial features by enlarging the
spatial perceptual field to avoid loss of important information, and
capture more dynamic information between different frames by using all
frames for preserving temporal consistency to achieve better temporal
fidelity. In addition, we introduce Null-Text Embedding to create a soft
text embedding to optimize the noise learning process, so that the
latent noise can be aligned with the prompt. Furthermore, to tune the
video style and render it more realistic, we employ a Prior-Guided
Perceptual Loss to constrain the prediction results to avoid deviating
from the original video style. Extensive experiments demonstrate the
superior video editing capability compared to existing works.