loading page

Foundation Models for Video Understanding: A Survey
  • +2
  • Neelu Madan,
  • Andreas Møgelmose,
  • Rajat Modi,
  • Yogesh S Rawat,
  • Thomas B Moeslund
Neelu Madan

Corresponding Author:[email protected]

Author Profile
Andreas Møgelmose
Rajat Modi
Yogesh S Rawat
Thomas B Moeslund


Video Foundation Models (ViFMs) aim to develop general-purpose representations for various video understanding tasks by leveraging large-scale datasets and powerful models to capture robust and generic features from video data. This survey analyzes over 200 methods, offering a comprehensive overview of benchmarks and evaluation metrics across 15 distinct video tasks, categorized into three main groups. Additionally, we provide an in-depth performance analysis of these models for the six most common video tasks. We identify three main approaches to constructing ViFMs: 1) Image-based ViFMs, which adapt image foundation models for video tasks; 2) Video-based ViFMs, which utilize video-specific encoding methods; and 3) Universal Foundation Models (UFMs), which integrate multiple modalities (image, video, audio, text, etc.) within a single framework. Each approach is further subdivided based on either practical implementation perspectives or pretraining objective types. By comparing the performance of various ViFMs on common video tasks, we offer valuable insights into their strengths and weaknesses, guiding future advancements in video understanding. Our analysis reveals that image-based ViFMs consistently outperform video-based ViFMs on most video understanding tasks. Additionally, UFMs, which leverage diverse modalities, demonstrate superior performance across all video tasks. We provide the comprehensive list of ViFMs studied in this work at: https://github.com/NeeluMadan/ViFM Survey.git.
10 Jun 2024Submitted to TechRxiv
14 Jun 2024Published in TechRxiv