loading page

MMVAD: A Vision-Language Model for Cross-Domain Video Anomaly Detection with Contrastive Learning and Scale-Adaptive Frame Segmentation
  • Debojyoti Biswas,
  • Jelena Tesic
Debojyoti Biswas
Texas State University

Corresponding Author:[email protected]

Author Profile
Jelena Tesic
Texas State University


Video Anomaly Detection (VAD) is crucial for public safety and detecting abnormalities in riskprone zones. Anomaly Detection from weakly-labeled datasets has been very challenging for CCTV surveillance videos. The challenge is more intense when we involve high-altitude drone videos for VAD tasks. There have been very few works for drone-captured VAD; even the existing CCTV VAD methods suffer from several limitations that hinder their optimal performance. The previous works for VAD mostly use single modal data, e.g., video data, which is insufficient to understand the complex scene context. Moreover, the existing multimodal systems use the traditional linear fusion method to capture multimodal feature interaction, which does not address the misalignment issue from different modalities. Next, the existing work relies on fixed-scale video segmentation, which fails to preserve the fine-grained local and global context knowledge. Also, it was found that the feature magnitude-based VAD does not correctly represent the anomalous events. To address these issues, we present a novel vision-language-based video anomaly detection for drone videos. We use adaptive long-short-term video segmentation (ALSVS) for local-global knowledge extraction. Next, we propose to use a shallow yet efficient attention-based feature fusion (AFF) technique for multimodal VAD (MMVAD) tasks. Finally, for the first time, we introduce feature anomaly learning based on a saliency-aware contrastive algorithm. We found contrastive anomaly feature learning is more robust than the magnitude-based loss calculation. We performed experiments on two of the latest drone VAD datasets (Drone-Anomaly and UIT Drone), as well as two CCTV VAD datasets (UCF crime and XD-Violence). Compared to the baseline and closest SOTA, we achieved at least a +3.8% and +3.3% increase in AUC, respectively, for the drone and CCTV datasets.
15 May 2024Submitted to TechRxiv
20 May 2024Published in TechRxiv