Towards Scalable Video Anomaly Retrieval: A Synthetic Video-Text Benchmark

Shuyu Yang ¹ Yilun Wang ¹ Yaxiong Wang ² Li Zhu ¹ Zhedong Zheng ³

¹Xi'an Jiaotong University

²Hefei University of Technology

³University of Macau

[Paper] [Dataset] [Github] [BibTeX]

Anomaly Videos

A man poured gasoline on a car and lit it.

At night, two topless men smashed the door of the store.

A criminal was beaten with sticks by two policemen in a small room.

A little boy is running downhill The little boy falls badly on the ground.

People are trying to skydive out of a plane. The airplane catches fire as they are trying to jump out.

The two men who hit the man left and kicked another man lying on the ground a few more times.

The man took out a black pistol and pointed it at the man in black at the opposite counter.

A person was zip lining across a river. The person let go of the zip line and fell into the river.

At an electric vehicle charging station, improper modifications cause a fire that spreads to neighboring new energy vehicles.

On the side of the road, a woman beat another woman, and several men prevented them from fighting.

A man tried to drive into a dirt bike course. Another rider rammed right into him because he went too fast into a road.

To slid off the roof onto the Jeep with skis. Man falls when he lands on Jeep and falls onto the ground face first.

A 28-year-old Russian athlete in sportswear performing a 360-degree spin on a cliff bungee jumping platform (5-meter cantilever).

A circus artiste (female, 25 years old, Italian) accidentally threw a flaming knife during a performance. It fell and ignited the curtain.

An elderly dementia patient (female, 70 years old, British) threw a metal spoon at the TV screen in the nursing home activity room. The spoon cracked the screen, caught in a ceiling-lit overhead close-up.

A skateboard teenager (male, 16 years old, Mexican) was skating across a zebra crossing when he threw stones at a signal light box. The impact created a sharp cracking sound, captured in a low-angle upward close-up.

Normal Videos

At night, there is constant traffic at traffic light intersections.

The students are walking in the corridors of the classroom.

There are many people dining face to face in the restaurant.

There are many people riding motorcycles on one street.

The two worked in a workshop, and then they got off work one after another.

A woman wearing a red bag is shopping for clothes in the clothing store.

A woman in blue clothes is negotiating with two men in a supermarket.

An elderly man enters the hotel, and the front desk clerk serves him.

Some people are resting on benches under the bridge, some are borrowing shared bicycles.

A man took out a magazine from the bookshelf and sat on the sofa to look at it for a while, then put the magazine back and left.

A woman in black in a roadside phone booth called, and then a woman in white came over to communicate with her.

A black car drove into the yard and parked beside the white car. The female owner got out of the car and left.

Abstract

Video anomaly retrieval aims to localize anomalous events in videos using natural language queries to facilitate public safety. However, existing datasets suffer from severe limitations: (1) data scarcity due to the long-tail nature of real-world anomalies, and (2) privacy con- straints that impede large-scale collection. To address the aforemen- tioned issues in one go, we introduce SVTA (Synthetic Video-Text Anomaly benchmark), the first large-scale dataset for cross-modal anomaly retrieval, leveraging generative models to overcome data availability challenges. Specifically, we collect and generate video descriptions via the off-the-shelf LLM (Large Language Model) cov- ering 68 anomaly categories, e.g., throwing, stealing, and shooting. These descriptions encompass common long-tail events. We adopt these texts to guide the video generative model to produce diverse and high-quality videos. Finally, our SVTA involves 41,315 videos (1.36M frames) with paired captions, covering 30 normal activities, e.g., standing, walking, and sports, and 68 anomalous events, e.g., falling, fighting, theft, explosions, and natural disasters. We adopt three widely-used video-text retrieval baselines to comprehensively test our SVTA, revealing SVTA's challenging nature and its effec- tiveness in evaluating a robust cross-modal retrieval method. SVTA eliminates privacy risks associated with real-world anomaly col- lection while maintaining realistic scenarios.

Approach

Data scarcity caused by the long-tail nature of real-world anomalies and privacy constraints hinder large-scale anomaly data collection from real-world scenarios. Fortunately, recent advances in Large Language Models (LLMs) and video foundation models enable high- quality text and video generation. Inspired by this, we propose to leverage LLMs and video generative models to create a large-scale cross-modal video anomaly retrieval benchmark. We provide an overview of our Synthetic Video-Text Anomaly (SVTA) benchmark construction. First, we collect and generate diverse video descriptions via LLM. Second, we leverage a state-of-the-art open-source video generative model to craft high-quality videos. Third, we adopt LLM to assign preliminary attributes for samples lacking explicit normal/anomaly labels and refine all labels by K-Means clustering and manual verification. The final dataset integrates 41,315 rigorously curated video-text pairs.

Statistics

Comparison of the proposed SVTA dataset and some of the other publicly available datasets for anomaly detection and anomaly retrieval. Our dataset provides many more video samples, action classes (anomaly and normal), and background in comparison with other available datasets for anomaly retrieval (Anno. means Annotation).

Datasets	Modality	Annotation	Anno. Format	#Videos	#Texts	#Anomaly Types	Anomaly : Normal	Data source
UBnormal	Video	Frame-level Tag	Action Label	543	-	22 Anomaly	2:3	Synthesis
ShanghaiTech	Video	Frame-level Tag	Action Label	437	-	11 Anomaly	1:18	Collection
UCF-Crime	Video	Video-level Tag	Action Label	1,900	-	13 Anomaly	1:1	Collection
UCA	Video, Text	Video-level Text	Action Text	1,900	23,542	13 Anomaly	1:1	Collection
UCFCrime-AR	Video, Text	Video-level Text	Action Text	1,900	1,900	13 Anomaly	1:1	Collection
SVTA (Ours)	Video, Text	Video-level Text	Action Text	41,315	41,315	68 Anomaly	3:2	Synthesis

An overview of the SVTA attribute annotations, including the distribution of (a) normal and anomaly videos, (b) normal categories, and (c) anomaly categories.

(a) Overall distribution of normal and anomaly videos

(b) Distribution of normal types

(c) Distribution of anomaly types

Results

This table presents the results of multimodal text-to-video (T2V) and video-to-text (V2T) retrieval on SVTA, in terms of Recall Rate (R@1, R@5, R@10), Median Rank (MdR), and Mean Rank (MnR).

Method	T2V					V2T
Method	R@1↑	R@5↑	R@10↑	MdR↓	MnR↓	R@1↑	R@5↑	R@10↑	MdR↓	MnR↓
CLIP4Clip-MeanP	54.0	81.7	88.9	1.0	8.8	55.8	82.5	89.4	1.0	7.9
CLIP4Clip-seqLSTM	53.9	81.7	88.7	1.0	8.7	55.7	82.4	89.4	1.0	7.8
CLIP4Clip-seqTransf	55.4	82.6	89.4	1.0	7.9	55.7	82.9	89.7	1.0	7.6
CLIP4Clip-tightTransf	46.3	75.6	84.7	2.0	15.3	46.9	76.2	85.2	2.0	16.3
X-CLIP (ViT-B/32)	52.9	79.9	88.1	1.0	9.0	52.9	80.2	87.9	1.0	9.4
X-CLIP (ViT-B/16)	55.8	82.2	89.6	1.0	8.0	56.2	82.1	89.4	1.0	8.1
GRAM	57.3	82.0	88.7	1.0	130.5	56.5	81.6	88.3	1.0	137.9

These are some retrieved examples of GRAM on SVTA. We visualize top 3 retrieved videos (green: correct; orange: incorrect). We show the ranking list of on our SVTA dataset. We could observe that our dataset is still challenging to prevailing cross-modality models.

This table presents the results of Zero-shot multimodal text-to-video (T2V) and video-to-text (V2T) retrieval on UCFCrime-AR, in terms of Recall Rate (R@1, R@5, R@10), Median Rank (MdR), and Mean Rank (MnR).

Method	T2V					V2T
Method	R@1↑	R@5↑	R@10↑	MdR↓	MnR↓	R@1↑	R@5↑	R@10↑	MdR↓	MnR↓
CLIP4Clip-MeanP	23.6	50.0	63.0	5.5	15.7	16.7	39.5	54.1	9.0	22.6
CLIP4Clip-seqLSTM	22.9	49.0	64.4	6.0	16.0	18.4	36.1	52.4	10.0	23.5
CLIP4Clip-seqTransf	24.0	47.6	64.0	6.0	16.1	17.7	36.4	51.0	10.0	22.4
CLIP4Clip-tightTransf	16.8	41.4	53.4	8.0	32.9	14.3	34.0	49.0	12.0	39.4
X-CLIP (ViT-B/32)	24.0	49.7	63.4	6.0	16.4	17.7	36.4	52.7	9.0	22.7
X-CLIP (ViT-B/16)	27.4	53.1	67.8	5.0	14.0	20.4	44.6	59.5	7.0	19.6
GRAM	34.5	60.7	70.7	3.0	17.8	32.4	57.2	68.6	4.0	26.3

This table presents the results of Zero-shot multimodal text-to-video (T2V) and video-to-text (V2T) retrieval on OOPS!, in terms of Recall Rate (R@1, R@5, R@10), Median Rank (MdR), and Mean Rank (MnR).

Method	T2V					V2T
Method	R@1↑	R@5↑	R@10↑	MdR↓	MnR↓	R@1↑	R@5↑	R@10↑	MdR↓	MnR↓
CLIP4Clip-MeanP	16.1	35.9	45.6	14.0	112.4	14.0	31.2	40.6	19.0	127.6
CLIP4Clip-seqLSTM	15.5	35.5	45.2	14.0	114.0	12.8	30.5	39.5	20.0	129.5
CLIP4Clip-seqTransf	16.0	35.5	45.5	14.0	114.4	13.0	30.6	40.6	19.0	127.4
CLIP4Clip-tightTransf	11.8	27.7	36.8	24.0	219.3	9.5	24.9	33.3	29.0	230.8
X-CLIP (ViT-B/32)	15.7	35.6	46.9	13.0	108.0	14.1	32.8	41.8	17.0	110.9
X-CLIP (ViT-B/16)	18.9	41.8	52.2	9.0	84.4	16.5	36.8	47.0	12.0	91.9
GRAM	18.6	39.0	48.1	12.0	542.2	19.0	39.1	49.1	11.0	551.0

BibTeX

 @article{yang2025towards,

      title={Towards Scalable Video Anomaly Retrieval: A Synthetic Video-Text Benchmark},

      author={Yang, Shuyu and Wang, Yilun and Wang, Yaxiong and Zhu, Li and Zheng, Zhedong},

      booktitle={arXiv preprint arXiv:2506.01466},

      year={2025},

    }

Project page template is borrowed from AnimateDiff.