Towards Scalable Video Anomaly Retrieval: A Synthetic Video-Text Benchmark

Shuyu Yang 1 Yilun Wang 1 Yaxiong Wang 2 Li Zhu 1 Zhedong Zheng 3

1Xi'an Jiaotong University 2Hefei University of Technology 3University of Macau

[Paper]     [Dataset]     [Github]     [BibTeX]


Anomaly Videos

Normal Videos

Abstract


Video anomaly retrieval aims to localize anomalous events in videos using natural language queries to facilitate public safety. However, existing datasets suffer from severe limitations: (1) data scarcity due to the long-tail nature of real-world anomalies, and (2) privacy con- straints that impede large-scale collection. To address the aforemen- tioned issues in one go, we introduce SVTA (Synthetic Video-Text Anomaly benchmark), the first large-scale dataset for cross-modal anomaly retrieval, leveraging generative models to overcome data availability challenges. Specifically, we collect and generate video descriptions via the off-the-shelf LLM (Large Language Model) cov- ering 68 anomaly categories, e.g., throwing, stealing, and shooting. These descriptions encompass common long-tail events. We adopt these texts to guide the video generative model to produce diverse and high-quality videos. Finally, our SVTA involves 41,315 videos (1.36M frames) with paired captions, covering 30 normal activities, e.g., standing, walking, and sports, and 68 anomalous events, e.g., falling, fighting, theft, explosions, and natural disasters. We adopt three widely-used video-text retrieval baselines to comprehensively test our SVTA, revealing SVTA's challenging nature and its effec- tiveness in evaluating a robust cross-modal retrieval method. SVTA eliminates privacy risks associated with real-world anomaly col- lection while maintaining realistic scenarios.

Approach


Data scarcity caused by the long-tail nature of real-world anomalies and privacy constraints hinder large-scale anomaly data collection from real-world scenarios. Fortunately, recent advances in Large Language Models (LLMs) and video foundation models enable high- quality text and video generation. Inspired by this, we propose to leverage LLMs and video generative models to create a large-scale cross-modal video anomaly retrieval benchmark. We provide an overview of our Synthetic Video-Text Anomaly (SVTA) benchmark construction. First, we collect and generate diverse video descriptions via LLM. Second, we leverage a state-of-the-art open-source video generative model to craft high-quality videos. Third, we adopt LLM to assign preliminary attributes for samples lacking explicit normal/anomaly labels and refine all labels by K-Means clustering and manual verification. The final dataset integrates 41,315 rigorously curated video-text pairs.


Statistics

Comparison of the proposed SVTA dataset and some of the other publicly available datasets for anomaly detection and anomaly retrieval. Our dataset provides many more video samples, action classes (anomaly and normal), and background in comparison with other available datasets for anomaly retrieval (Anno. means Annotation).

Datasets Modality Annotation Anno. Format #Videos #Texts #Anomaly Types Anomaly : Normal Data source
UBnormal Video Frame-level Tag Action Label 543 - 22 Anomaly 2:3 Synthesis
ShanghaiTech Video Frame-level Tag Action Label 437 - 11 Anomaly 1:18 Collection
UCF-Crime Video Video-level Tag Action Label 1,900 - 13 Anomaly 1:1 Collection
UCA Video, Text Video-level Text Action Text 1,900 23,542 13 Anomaly 1:1 Collection
UCFCrime-AR Video, Text Video-level Text Action Text 1,900 1,900 13 Anomaly 1:1 Collection
SVTA (Ours) Video, Text Video-level Text Action Text 41,315 41,315 68 Anomaly 3:2 Synthesis

An overview of the SVTA attribute annotations, including the distribution of (a) normal and anomaly videos, (b) normal categories, and (c) anomaly categories.

(a) Overall distribution of normal and anomaly videos

(b) Distribution of normal types

(c) Distribution of anomaly types


Results

This table presents the results of multimodal text-to-video (T2V) and video-to-text (V2T) retrieval on SVTA, in terms of Recall Rate (R@1, R@5, R@10), Median Rank (MdR), and Mean Rank (MnR).

Method T2V V2T
R@1↑ R@5↑ R@10↑ MdR↓ MnR↓ R@1↑ R@5↑ R@10↑ MdR↓ MnR↓
CLIP4Clip-MeanP 54.0 81.7 88.9 1.0 8.8 55.8 82.5 89.4 1.0 7.9
CLIP4Clip-seqLSTM 53.9 81.7 88.7 1.0 8.7 55.7 82.4 89.4 1.0 7.8
CLIP4Clip-seqTransf 55.4 82.6 89.4 1.0 7.9 55.7 82.9 89.7 1.0 7.6
CLIP4Clip-tightTransf 46.3 75.6 84.7 2.0 15.3 46.9 76.2 85.2 2.0 16.3
X-CLIP (ViT-B/32) 52.9 79.9 88.1 1.0 9.0 52.9 80.2 87.9 1.0 9.4
X-CLIP (ViT-B/16) 55.8 82.2 89.6 1.0 8.0 56.2 82.1 89.4 1.0 8.1
GRAM 57.3 82.0 88.7 1.0 130.5 56.5 81.6 88.3 1.0 137.9

These are some retrieved examples of GRAM on SVTA. We visualize top 3 retrieved videos (green: correct; orange: incorrect). We show the ranking list of on our SVTA dataset. We could observe that our dataset is still challenging to prevailing cross-modality models.


This table presents the results of Zero-shot multimodal text-to-video (T2V) and video-to-text (V2T) retrieval on UCFCrime-AR, in terms of Recall Rate (R@1, R@5, R@10), Median Rank (MdR), and Mean Rank (MnR).

Method T2V V2T
R@1↑ R@5↑ R@10↑ MdR↓ MnR↓ R@1↑ R@5↑ R@10↑ MdR↓ MnR↓
CLIP4Clip-MeanP 23.6 50.0 63.0 5.5 15.7 16.7 39.5 54.1 9.0 22.6
CLIP4Clip-seqLSTM 22.9 49.0 64.4 6.0 16.0 18.4 36.1 52.4 10.0 23.5
CLIP4Clip-seqTransf 24.0 47.6 64.0 6.0 16.1 17.7 36.4 51.0 10.0 22.4
CLIP4Clip-tightTransf 16.8 41.4 53.4 8.0 32.9 14.3 34.0 49.0 12.0 39.4
X-CLIP (ViT-B/32) 24.0 49.7 63.4 6.0 16.4 17.7 36.4 52.7 9.0 22.7
X-CLIP (ViT-B/16) 27.4 53.1 67.8 5.0 14.0 20.4 44.6 59.5 7.0 19.6
GRAM 34.5 60.7 70.7 3.0 17.8 32.4 57.2 68.6 4.0 26.3

This table presents the results of Zero-shot multimodal text-to-video (T2V) and video-to-text (V2T) retrieval on OOPS!, in terms of Recall Rate (R@1, R@5, R@10), Median Rank (MdR), and Mean Rank (MnR).

Method T2V V2T
R@1↑ R@5↑ R@10↑ MdR↓ MnR↓ R@1↑ R@5↑ R@10↑ MdR↓ MnR↓
CLIP4Clip-MeanP 16.1 35.9 45.6 14.0 112.4 14.0 31.2 40.6 19.0 127.6
CLIP4Clip-seqLSTM 15.5 35.5 45.2 14.0 114.0 12.8 30.5 39.5 20.0 129.5
CLIP4Clip-seqTransf 16.0 35.5 45.5 14.0 114.4 13.0 30.6 40.6 19.0 127.4
CLIP4Clip-tightTransf 11.8 27.7 36.8 24.0 219.3 9.5 24.9 33.3 29.0 230.8
X-CLIP (ViT-B/32) 15.7 35.6 46.9 13.0 108.0 14.1 32.8 41.8 17.0 110.9
X-CLIP (ViT-B/16) 18.9 41.8 52.2 9.0 84.4 16.5 36.8 47.0 12.0 91.9
GRAM 18.6 39.0 48.1 12.0 542.2 19.0 39.1 49.1 11.0 551.0

BibTeX

@article{yang2025towards,
  title={Towards Scalable Video Anomaly Retrieval: A Synthetic Video-Text Benchmark},
  author={Yang, Shuyu and Wang, Yilun and Wang, Yaxiong and Zhu, Li and Zheng, Zhedong},
  booktitle={arXiv preprint arXiv:2506.01466},
  year={2025},
}

Project page template is borrowed from AnimateDiff.