arXiv 论文速递

VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM

Authors: Yuqian Yuan, Hang Zhang, Wentong Li, Zesen Cheng, Boqiang Zhang, Long Li, Xin Li, Deli Zhao, Wenqiao Zhang, Yueting Zhuang, Jianke Zhu, Lidong Bing

First: 2024-12-31T18:56:46+00:00 · Latest: 2025-03-25T08:10:15+00:00

Comments: 17 pages, 14 figures, technical report

Abs · PDF · Code1 · Code2

Abstract

Video Large Language Models (Video LLMs) have recently exhibited remarkable capabilities in general video understanding. However, they mainly focus on holistic comprehension and struggle with capturing fine-grained spatial and temporal details. Besides, the lack of high-quality object-level video instruction data and a comprehensive benchmark further hinders their advancements. To tackle these challenges, we introduce the VideoRefer Suite to empower Video LLM for finer-level spatial-temporal video understanding, i.e., enabling perception and reasoning on any objects throughout the video. Specially, we thoroughly develop VideoRefer Suite across three essential aspects: dataset, model, and benchmark. Firstly, we introduce a multi-agent data engine to meticulously curate a large-scale, high-quality object-level video instruction dataset, termed VideoRefer-700K. Next, we present the VideoRefer model, which equips a versatile spatial-temporal object encoder to capture precise regional and sequential representations. Finally, we meticulously create a VideoRefer-Bench to comprehensively assess the spatial-temporal understanding capability of a Video LLM, evaluating it across various aspects. Extensive experiments and analyses demonstrate that our VideoRefer model not only achieves promising performance on video referring benchmarks but also facilitates general video understanding capabilities.

中文标题/摘要

标题：VideoRefer Suite: 提升视频时空物体理解的视频大语言模型

视频大语言模型（Video LLMs）最近在通用视频理解方面表现出显著的能力。然而，它们主要关注整体理解，难以捕捉细微的空间和时间细节。此外，高质量的对象级视频指令数据的缺乏和全面基准的缺失进一步阻碍了它们的发展。为了解决这些挑战，我们引入了VideoRefer Suite，以增强Video LLM的细粒度时空视频理解能力，即在视频中对任何物体进行感知和推理。特别地，我们从数据集、模型和基准三个方面全面开发了VideoRefer Suite。首先，我们引入了一个多智能体数据引擎，精心构建了一个大规模、高质量的对象级视频指令数据集，称为VideoRefer-700K。其次，我们提出了VideoRefer模型，该模型配备了多功能的时空物体编码器，以捕捉精确的区域和序列表示。最后，我们精心构建了VideoRefer-Bench，以全面评估Video LLM的时空理解能力，从多个方面对其进行评估。广泛的实验和分析表明，我们的VideoRefer模型不仅在视频引用基准测试中取得了令人鼓舞的性能，还促进了视频理解的一般能力。

Summary / 总结

The research aims to enhance the spatial-temporal object understanding of Video Large Language Models (Video LLMs) by addressing their limitations in capturing fine-grained details. The authors introduce the VideoRefer Suite, which includes a multi-agent data engine for creating a large, high-quality object-level video instruction dataset, a spatial-temporal object encoder in the VideoRefer model, and a comprehensive benchmark, VideoRefer-Bench, to evaluate Video LLMs. Experiments show that the VideoRefer model performs well on video referring benchmarks and improves general video understanding capabilities.

研究旨在通过解决Video LLM在捕捉细粒度细节方面的局限性，增强其空间-时间对象理解能力。为此，引入了VideoRefer Suite，包括大规模的对象级视频指令数据集VideoRefer-700K、VideoRefer模型中的空间-时间对象编码器以及全面的基准VideoRefer-Bench。实验表明，VideoRefer模型在视频引用基准测试中表现出色，并增强了通用视频理解能力。

STOP: Integrated Spatial-Temporal Dynamic Prompting for Video Understanding

Authors: Zichen Liu, Kunlun Xu, Bing Su, Xu Zou, Yuxin Peng, Jiahuan Zhou

First: 2025-03-20T09:16:20+00:00 · Latest: 2025-03-25T03:05:36+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Pre-trained on tremendous image-text pairs, vision-language models like CLIP have demonstrated promising zero-shot generalization across numerous image-based tasks. However, extending these capabilities to video tasks remains challenging due to limited labeled video data and high training costs. Recent video prompting methods attempt to adapt CLIP for video tasks by introducing learnable prompts, but they typically rely on a single static prompt for all video sequences, overlooking the diverse temporal dynamics and spatial variations that exist across frames. This limitation significantly hinders the model's ability to capture essential temporal information for effective video understanding. To address this, we propose an integrated Spatial-TempOral dynamic Prompting (STOP) model which consists of two complementary modules, the intra-frame spatial prompting and inter-frame temporal prompting. Our intra-frame spatial prompts are designed to adaptively highlight discriminative regions within each frame by leveraging intra-frame attention and temporal variation, allowing the model to focus on areas with substantial temporal dynamics and capture fine-grained spatial details. Additionally, to highlight the varying importance of frames for video understanding, we further introduce inter-frame temporal prompts, dynamically inserting prompts between frames with high temporal variance as measured by frame similarity. This enables the model to prioritize key frames and enhances its capacity to understand temporal dependencies across sequences. Extensive experiments on various video benchmarks demonstrate that STOP consistently achieves superior performance against state-of-the-art methods. The code is available at https://github.com/zhoujiahuan1991/CVPR2025-STOP.

中文标题/摘要

标题：STOP：集成时空动态提示的视频理解

基于大量图像-文本对预训练的视觉-语言模型如CLIP，在众多图像任务中展示了令人鼓舞的零样本泛化能力。然而，将这些能力扩展到视频任务仍然具有挑战性，因为缺乏标注的视频数据和高昂的训练成本。最近的视频提示方法试图通过引入可学习的提示来适应CLIP以用于视频任务，但它们通常依赖于单一静态提示来处理所有视频序列，忽视了帧间存在的多样时空动态和空间变化。这一限制严重阻碍了模型捕捉关键的时空信息以实现有效的视频理解。为了解决这一问题，我们提出了一种集成时空动态提示（STOP）模型，该模型由两个互补模块组成：帧内时空提示和帧间时空提示。我们的帧内时空提示旨在通过利用帧内注意力和时间变化自适应地突出显示每个帧中的判别区域，使模型能够关注具有显著时间动态的区域并捕捉细粒度的空间细节。此外，为了突出不同帧在视频理解中的重要性差异，我们进一步引入了帧间时空提示，动态地在帧相似度高（根据帧相似性测量）的帧之间插入提示。这使模型能够优先处理关键帧并增强其理解序列间时间依赖性的能力。在各种视频基准上的广泛实验表明，STOP在与最新方法的比较中始终表现出更优的性能。代码可在https://github.com/zhoujiahuan1991/CVPR2025-STOP/ 获取。

Summary / 总结

The research aims to enhance the ability of vision-language models to understand video tasks by addressing the limitations of static prompts. The STOP model introduces two modules: intra-frame spatial prompting and inter-frame temporal prompting. Intra-frame spatial prompts adaptively highlight discriminative regions within each frame, while inter-frame temporal prompts dynamically insert prompts between frames with high temporal variance. Experiments show that STOP outperforms existing methods on various video benchmarks, effectively capturing temporal dynamics and spatial variations. The code is available at https://github.com/zhoujiahuan1991/CVPR2025-STOP.

研究旨在通过引入STOP模型提高CLIP等视觉语言模型在视频理解任务中的零样本泛化能力，这些任务因缺乏标注的视频数据和高昂的训练成本而具有挑战性。STOP模型包含两个模块：帧内空间提示和帧间时间提示。帧内空间提示能够自适应地突出每个帧中的关键区域，而帧间时间提示则在具有高时间差异的帧之间动态插入提示。实验表明，STOP在各种视频基准测试中均优于现有最佳方法。代码可在https://github.com/zhoujiahuan1991/CVPR2025-STOP获得。

Q-Bench-Video: Benchmarking the Video Quality Understanding of LMMs

Authors: Zicheng Zhang, Ziheng Jia, Haoning Wu, Chunyi Li, Zijian Chen, Yingjie Zhou, Wei Sun, Xiaohong Liu, Xiongkuo Min, Weisi Lin, Guangtao Zhai

First: 2024-09-30T08:05:00+00:00 · Latest: 2025-03-02T12:17:51+00:00

Abs · PDF · Code1 · Code2

Abstract

With the rising interest in research on Large Multi-modal Models (LMMs) for video understanding, many studies have emphasized general video comprehension capabilities, neglecting the systematic exploration into video quality understanding. To address this oversight, we introduce Q-Bench-Video in this paper, a new benchmark specifically designed to evaluate LMMs' proficiency in discerning video quality. a) To ensure video source diversity, Q-Bench-Video encompasses videos from natural scenes, AI-generated Content (AIGC), and Computer Graphics (CG). b) Building on the traditional multiple-choice questions format with the Yes-or-No and What-How categories, we include Open-ended questions to better evaluate complex scenarios. Additionally, we incorporate the video pair quality comparison question to enhance comprehensiveness. c) Beyond the traditional Technical, Aesthetic, and Temporal distortions, we have expanded our evaluation aspects to include the dimension of AIGC distortions, which addresses the increasing demand for video generation. Finally, we collect a total of 2,378 question-answer pairs and test them on 12 open-source & 5 proprietary LMMs. Our findings indicate that while LMMs have a foundational understanding of video quality, their performance remains incomplete and imprecise, with a notable discrepancy compared to human performance. Through Q-Bench-Video, we seek to catalyze community interest, stimulate further research, and unlock the untapped potential of LMMs to close the gap in video quality understanding.

中文标题/摘要

标题：Q-Bench-Video：评估LMMs的视频质量理解能力

随着对大型多模态模型（LMMs）在视频理解方面研究兴趣的增加，许多研究强调了视频理解的一般能力，忽视了对视频质量理解的系统探索。为解决这一问题，本文引入了Q-Bench-Video，这是一种新的基准，专门用于评估LMMs在区分视频质量方面的熟练程度。a) 为了确保视频来源的多样性，Q-Bench-Video 包含了来自自然场景、人工智能生成内容（AIGC）和计算机图形（CG）的视频。b) 在传统的多项选择题格式基础上，我们增加了开放性问题，以更好地评估复杂场景，并引入了视频对质量比较问题以增强全面性。c) 除了传统的技术、美学和时间扭曲之外，我们还扩展了评估维度，包括AIGC扭曲，以应对视频生成需求的增加。最后，我们收集了2,378个问题-答案对，并在12个开源和5个专有LMMs上进行了测试。我们的研究结果表明，虽然LMMs对视频质量有一定的基础理解，但其表现仍然不完整且不够精确，与人类表现存在显著差异。通过Q-Bench-Video，我们旨在激发社区兴趣，促进进一步研究，并释放LMMs的潜力，以缩小视频质量理解的差距。

Summary / 总结

Q-Bench-Video is a new benchmark designed to evaluate Large Multi-modal Models (LMMs) in discerning video quality, covering diverse video sources and expanding evaluation aspects to include AIGC distortions. The benchmark includes multiple-choice, open-ended, and video pair quality comparison questions, and tests 12 open-source and 5 proprietary LMMs. The results show that LMMs have a basic understanding of video quality but fall short in precision and completeness compared to human performance.

Q-Bench-Video 是一个新基准，旨在评估大型多模态模型（LMMs）在区分视频质量方面的能力，涵盖了自然场景、AI生成内容和计算机图形。它包括多项选择、开放式和视频对质量比较问题，扩展了对AIGC失真的评估。测试17个LMM后，研究发现虽然LMMs对视频质量有基本的理解，但其性能不完整且不如人类表现，强调了进一步研究的必要性。

Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1

Authors: Yi Chen, Yuying Ge, Rui Wang, Yixiao Ge, Lu Qiu, Ying Shan, Xihui Liu

First: 2025-03-31T17:55:23+00:00 · Latest: 2025-03-31T17:55:23+00:00

Comments: Technical Report (In Progress); Code released at: https://github.com/TencentARC/SEED-Bench-R1

Abs · PDF · Code1 · Code2 · Code3

Abstract

Recent advancements in Chain of Thought (COT) generation have significantly improved the reasoning capabilities of Large Language Models (LLMs), with reinforcement learning (RL) emerging as an effective post-training approach. Multimodal Large Language Models (MLLMs) inherit this reasoning potential but remain underexplored in tasks requiring both perception and logical reasoning. To address this, we introduce SEED-Bench-R1, a benchmark designed to systematically evaluate post-training methods for MLLMs in video understanding. It includes intricate real-world videos and complex everyday planning tasks in the format of multiple-choice questions, requiring sophisticated perception and reasoning. SEED-Bench-R1 assesses generalization through a three-level hierarchy: in-distribution, cross-environment, and cross-environment-task scenarios, equipped with a large-scale training dataset with easily verifiable ground-truth answers. Using Qwen2-VL-Instruct-7B as a base model, we compare RL with supervised fine-tuning (SFT), demonstrating RL's data efficiency and superior performance on both in-distribution and out-of-distribution tasks, even outperforming SFT on general video understanding benchmarks like LongVideoBench. Our detailed analysis reveals that RL enhances visual perception but often produces less logically coherent reasoning chains. We identify key limitations such as inconsistent reasoning and overlooked visual cues, and suggest future improvements in base model reasoning, reward modeling, and RL robustness against noisy signals.

中文标题/摘要

标题：探索强化学习对视频理解的影响：SEED-Bench-R1 的见解

近期在思维链（COT）生成方面的进展显著提升了大型语言模型（LLMs）的推理能力，强化学习（RL）已成为一种有效的后训练方法。多模态大型语言模型（MLLMs）继承了这种推理潜力，但在需要感知和逻辑推理的任务中仍被严重忽视。为解决这一问题，我们引入了SEED-Bench-R1，这是一个旨在系统评估MLLMs在视频理解中后训练方法的基准。它包含复杂的现实世界视频和复杂的日常规划任务，以多项选择题的形式呈现，需要复杂的感知和推理。SEED-Bench-R1 通过三级层次结构评估泛化能力：同分布、跨环境和跨环境任务场景，并配备了一个大规模训练数据集，具有易于验证的正确答案。使用Qwen2-VL-Instruct-7B作为基础模型，我们比较了RL与监督微调（SFT），展示了RL在同分布和异分布任务中的数据效率和优越性能，甚至在长视频基准等通用视频理解基准上优于SFT。我们的详细分析表明，RL增强了视觉感知，但往往产生逻辑不连贯的推理链。我们确定了关键限制，如不一致的推理和忽视视觉线索，并建议未来在基础模型推理、奖励建模和RL对噪声信号的鲁棒性方面进行改进。

Summary / 总结

The study explores how reinforcement learning (RL) improves the reasoning capabilities of multimodal large language models (MLLMs) in video understanding tasks. SEED-Bench-R1, a new benchmark, evaluates post-training methods for MLLMs, focusing on complex real-world videos and everyday planning tasks. RL shows better performance and data efficiency compared to supervised fine-tuning (SFT), even on general video understanding benchmarks, while also enhancing visual perception but sometimes producing less coherent reasoning chains.

研究探讨了强化学习（RL）对多模态大型语言模型（MLLMs）在视频理解中的推理能力的影响。SEED-Bench-R1 作为基准，用于评估 MLLMs 的后训练方法，包含复杂的现实世界视频和多项选择题。研究使用 Qwen2-VL-Instruct-7B 作为基础模型，比较了 RL 与监督微调（SFT），结果显示 RL 在数据效率和性能上更优，尤其是在分布外任务中。然而，RL 生成的推理链不够连贯，容易忽视视觉线索，指出了未来改进的方向。

DenseImage Network: Video Spatial-Temporal Evolution Encoding and Understanding

Authors: Xiaokai Chen, Ke Gao

First: 2018-05-19T08:53:44+00:00 · Latest: 2018-05-19T08:53:44+00:00

Comments: 7 pages

Abs · PDF · Code1 · Code2

Abstract

Many of the leading approaches for video understanding are data-hungry and time-consuming, failing to capture the gist of spatial-temporal evolution in an efficient manner. The latest research shows that CNN network can reason about static relation of entities in images. To further exploit its capacity in dynamic evolution reasoning, we introduce a novel network module called DenseImage Network(DIN) with two main contributions. 1) A novel compact representation of video which distills its significant spatial-temporal evolution into a matrix called DenseImage, primed for efficient video encoding. 2) A simple yet powerful learning strategy based on DenseImage and a temporal-order-preserving CNN network is proposed for video understanding, which contains a local temporal correlation constraint capturing temporal evolution at multiple time scales with different filter widths. Extensive experiments on two recent challenging benchmarks demonstrate that our DenseImage Network can accurately capture the common spatial-temporal evolution between similar actions, even with enormous visual variations or different time scales. Moreover, we obtain the state-of-the-art results in action and gesture recognition with much less time-and-memory cost, indicating its immense potential in video representing and understanding.

中文标题/摘要

标题：DenseImage 网络：视频空间-时间演化编码与理解

许多领先的视频理解方法既耗数据又耗时，无法高效地捕捉空间-时间演化的核心。最新研究显示，CNN 网络可以推理图像中实体的静态关系。为了进一步利用其在动态演化推理方面的潜力，我们提出了一种名为 DenseImage 网络 (DIN) 的新型网络模块，包含两个主要贡献。1) 一种新颖的视频紧凑表示，将视频的关键空间-时间演化提炼成一个称为 DenseImage 的矩阵，便于高效视频编码。2) 基于 DenseImage 和时间顺序保持的 CNN 网络提出了一种简单而强大的学习策略，用于视频理解，该策略包含一个局部时间相关约束，能够以不同滤波器宽度捕捉多时间尺度的时间演化。在两个最近的具有挑战性的基准上的大量实验表明，我们的 DenseImage 网络能够准确捕捉相似动作之间的共同空间-时间演化，即使存在巨大的视觉变化或不同的时间尺度。此外，我们在动作和手势识别中获得了最先进的结果，且所需的时间和内存成本更低，表明其在视频表示和理解方面的巨大潜力。

Summary / 总结

The research aims to address the inefficiency of current video understanding methods by introducing DenseImage Network (DIN), which captures spatial-temporal evolution efficiently. DIN proposes a novel DenseImage representation and a temporal-order-preserving CNN network with a local temporal correlation constraint. Experiments on benchmark datasets show that DIN can accurately recognize actions and gestures with significant improvements in efficiency compared to existing methods.

研究旨在通过引入DenseImage网络（DIN）来解决当前视频理解方法的低效问题，DIN能够高效地捕捉空间-时间演变。DIN提出了一种新的DenseImage表示和一个基于DenseImage的时序保持的CNN网络，该网络包含局部时间相关约束。实验表明，DIN在动作和手势识别上表现出色，并且在效率上显著优于现有方法。

4D-Bench: Benchmarking Multi-modal Large Language Models for 4D Object Understanding

Authors: Wenxuan Zhu, Bing Li, Cheng Zheng, Jinjie Mai, Jun Chen, Letian Jiang, Abdullah Hamdi, Sara Rojas Martinez, Chia-Wen Lin, Mohamed Elhoseiny, Bernard Ghanem

First: 2025-03-22T17:55:53+00:00 · Latest: 2025-03-22T17:55:53+00:00

Abs · PDF · Code1 · Code2

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated impressive 2D image/video understanding capabilities. However, there are no publicly standardized benchmarks to assess the abilities of MLLMs in understanding the 4D objects (3D objects with temporal evolution over time). In this paper, we introduce 4D-Bench, the first benchmark to evaluate the capabilities of MLLMs in 4D object understanding, featuring tasks in 4D object Question Answering (4D object QA) and 4D object captioning. 4D-Bench provides 4D objects with diverse categories, high-quality annotations, and tasks necessitating multi-view spatial-temporal understanding, different from existing 2D image/video-based benchmarks. With 4D-Bench, we evaluate a wide range of open-source and closed-source MLLMs. The results from the 4D object captioning experiment indicate that MLLMs generally exhibit weaker temporal understanding compared to their appearance understanding, notably, while open-source models approach closed-source performance in appearance understanding, they show larger performance gaps in temporal understanding. 4D object QA yields surprising findings: even with simple single-object videos, MLLMs perform poorly, with state-of-the-art GPT-4o achieving only 63\% accuracy compared to the human baseline of 91\%. These findings highlight a substantial gap in 4D object understanding and the need for further advancements in MLLMs.

中文标题/摘要

标题：4D-Bench：多模态大型语言模型在4D物体理解中的基准测试

多模态大型语言模型（MLLMs）在2D图像/视频理解方面表现出色。然而，目前没有公开的标准基准来评估MLLMs在理解4D物体（具有时间演变的3D物体）方面的能力。本文介绍了4D-Bench，这是首个评估MLLMs在4D物体理解能力的基准，包含4D物体问答（4D物体QA）和4D物体描述任务。4D-Bench提供了多样类别的4D物体、高质量的注释以及需要多视角时空理解的任务，不同于现有的基于2D图像/视频的基准。通过4D-Bench，我们评估了多种开源和闭源的MLLMs。4D物体描述实验的结果表明，MLLMs在时间理解方面普遍弱于其外观理解能力，尽管开源模型在外观理解方面接近闭源模型的表现，但在时间理解方面却显示出更大的性能差距。4D物体问答产生了令人惊讶的结果：即使使用简单的单物体视频，MLLMs的表现也很差，最先进的GPT-4o的准确率仅为63%，而人类基线为91%。这些发现突显了4D物体理解的巨大差距，并强调了MLLMs进一步发展的必要性。

Summary / 总结

4D-Bench is a new benchmark designed to evaluate the 4D object understanding capabilities of Multimodal Large Language Models (MLLMs), focusing on 4D object Question Answering and captioning. The benchmark includes diverse 4D objects with high-quality annotations and tasks requiring multi-view spatial-temporal understanding. Experimental results show that MLLMs generally perform poorly in temporal understanding compared to appearance understanding, with open-source models showing larger performance gaps in temporal understanding. Notably, state-of-the-art GPT-4o achieves only 63% accuracy in 4D object QA, far below the human baseline of 91%. This highlights the need for further advancements in MLLMs for 4D object understanding.

4D-Bench 是首个评估 MLLMs 在理解 4D 对象方面的基准，包含 4D 对象问答和描述任务，涉及多样化的 4D 对象并要求多视角时空理解。实验结果显示，MLLMs 在时间理解方面表现较弱，相比之下在外观理解方面接近闭源模型的性能，但在 4D 对象问答中，即使是最先进的模型 GPT-4o 也只能达到 63% 的准确率，远低于人类基准的 91%。这表明 MLLMs 在 4D 对象理解方面存在显著差距，需要进一步研究。

MT-Video-Bench: A Holistic Video Understanding Benchmark for Evaluating Multimodal LLMs in Multi-Turn Dialogues

Authors: Yaning Pan, Zekun Wang, Qianqian Xie, Yongqian Wen, Yuanxing Zhang, Guohui Zhang, Haoxuan Hu, Zhiyu Pan, Yibing Huang, Zhidong Gan, Yonghong Lin, An Ping, Tianhao Peng, Jiaheng Liu

First: 2025-10-20T16:38:40+00:00 · Latest: 2025-10-20T16:38:40+00:00

Comments: Project Website: https://github.com/NJU-LINK/MT-Video-Bench

Abs · PDF · Code1 · Code2 · Code3

Abstract

The recent development of Multimodal Large Language Models (MLLMs) has significantly advanced AI's ability to understand visual modalities. However, existing evaluation benchmarks remain limited to single-turn question answering, overlooking the complexity of multi-turn dialogues in real-world scenarios. To bridge this gap, we introduce MT-Video-Bench, a holistic video understanding benchmark for evaluating MLLMs in multi-turn dialogues. Specifically, our MT-Video-Bench mainly assesses six core competencies that focus on perceptivity and interactivity, encompassing 987 meticulously curated multi-turn dialogues from diverse domains. These capabilities are rigorously aligned with real-world applications, such as interactive sports analysis and multi-turn video-based intelligent tutoring. With MT-Video-Bench, we extensively evaluate various state-of-the-art open-source and closed-source MLLMs, revealing their significant performance discrepancies and limitations in handling multi-turn video dialogues. The benchmark will be publicly available to foster future research.

中文标题/摘要

标题：MT-Video-Bench：多模态大语言模型在多轮对话中视频理解评估基准

近年来，多模态大语言模型（MLLMs）的发展显著提升了AI理解视觉模态的能力。然而，现有的评估基准仍然局限于单轮问答，忽视了真实场景中多轮对话的复杂性。为弥补这一不足，我们引入了MT-Video-Bench，这是一个全面的视频理解基准，用于评估MLLMs在多轮对话中的表现。具体而言，我们的MT-Video-Bench主要评估六个核心能力，涵盖感知性和互动性，包括987个精心策划的多轮对话，涉及多个领域。这些能力严格与实际应用对齐，如互动体育分析和基于视频的多轮智能辅导。通过MT-Video-Bench，我们广泛评估了各种最先进的开源和闭源MLLMs，揭示了它们在处理多轮视频对话方面的显著性能差异和局限性。该基准将公开发布，以促进未来的研究。

Summary / 总结

MT-Video-Bench is introduced to evaluate MLLMs in multi-turn dialogues, addressing the limitation of existing benchmarks that focus on single-turn question answering. It assesses six core competencies in 987 multi-turn dialogues from various domains, revealing significant performance discrepancies among different MLLMs. The benchmark aims to promote future research by fostering a better understanding of MLLMs' capabilities in real-world applications like interactive sports analysis and video-based tutoring.

MT-Video-Bench 用于评估 MLLMs 在多轮对话中的能力，解决了现有基准主要关注单轮问答的问题。它评估了来自不同领域的 987 个多轮对话中的六个核心能力，揭示了不同 MLLMs 在处理多轮视频对话时的显著性能差异。该基准旨在通过促进对 MLLMs 在实际应用如互动体育分析和视频辅导中的能力的理解来推动未来研究。

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Authors: Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, Lidong Bing

First: 2024-06-11T17:22:23+00:00 · Latest: 2024-10-30T06:49:54+00:00

Comments: ZC, SL, HZ, YX, and XL contributed equally to this project. Code: https://github.com/DAMO-NLP-SG/VideoLLaMA2

Abs · PDF · Code1 · Code2 · Code3

Abstract

In this paper, we present the VideoLLaMA 2, a set of Video Large Language Models (Video-LLMs) designed to enhance spatial-temporal modeling and audio understanding in video and audio-oriented tasks. Building upon its predecessor, VideoLLaMA 2 incorporates a tailor-made Spatial-Temporal Convolution (STC) connector, which effectively captures the intricate spatial and temporal dynamics of video data. Additionally, we integrate an Audio Branch into the model through joint training, thereby enriching the multimodal understanding capabilities of the model by seamlessly incorporating audio cues. Comprehensive evaluations on multiple-choice video question answering (MC-VQA), open-ended video question answering (OE-VQA), and video captioning (VC) tasks demonstrate that VideoLLaMA 2 consistently achieves competitive results among open-source models and even gets close to some proprietary models on several benchmarks. Furthermore, VideoLLaMA 2 exhibits reasonable improvements in audio-only and audio-video question-answering (AQA & OE-AVQA) benchmarks over existing models. These advancements underline VideoLLaMA 2's superior performance in multimodal comprehension, setting a new standard for intelligent video analysis systems. All models are public to facilitate further research.

中文标题/摘要

标题：VideoLLaMA 2：提升视频-大语言模型的空间-时间建模和音频理解能力

在本文中，我们介绍了VideoLLaMA 2，这是一种旨在增强视频和音频任务中空间-时间建模和音频理解能力的视频大语言模型（Video-LLMs）。基于其前身，VideoLLaMA 2 引入了一种定制的空间-时间卷积（STC）连接器，有效地捕捉了视频数据的复杂空间和时间动态。此外，我们通过联合训练将音频分支集成到模型中，从而通过无缝地结合音频线索来丰富模型的多模态理解能力。在多项选择视频问答（MC-VQA）、开放式视频问答（OE-VQA）和视频字幕生成（VC）任务上的全面评估表明，VideoLLaMA 2 在开源模型中始终能够获得具有竞争力的结果，并且在某些基准测试中接近一些专有模型。此外，VideoLLaMA 2 在音频仅和音频-视频问答（AQA & OE-AVQA）基准测试中也表现出合理的改进。这些进步突显了VideoLLaMA 2 在多模态理解方面的优越性能，为智能视频分析系统树立了新的标准。所有模型均为公开，以促进进一步的研究。

Summary / 总结

VideoLLaMA 2 is a set of Video Large Language Models designed to improve spatial-temporal modeling and audio understanding in video and audio tasks. It incorporates a Spatial-Temporal Convolution (STC) connector and an Audio Branch for joint training, enhancing multimodal understanding. Evaluations show that VideoLLaMA 2 achieves competitive results on various benchmarks and shows improvements in audio-only and audio-video question-answering tasks, setting a new standard for multimodal comprehension in video analysis systems.

VideoLLaMA 2 是一组视频大型语言模型，旨在提高视频和音频任务中的时空建模和音频理解能力。它集成了时空卷积（STC）连接器和音频分支进行联合训练，增强了多模态理解能力。评估结果显示，VideoLLaMA 2 在各种基准测试中取得了竞争力的结果，并在音频-only 和音频-视频问答任务中表现出改进，为视频分析系统的多模态理解设定了新标准。

VideoINSTA: Zero-shot Long Video Understanding via Informative Spatial-Temporal Reasoning with LLMs

Authors: Ruotong Liao, Max Erler, Huiyu Wang, Guangyao Zhai, Gengyuan Zhang, Yunpu Ma, Volker Tresp

Venue: EMNLP 2024

First: 2024-09-30T15:04:14+00:00 · Latest: 2024-10-04T21:57:23+00:00

Comments: EMNLP 2024 Findings; 22 pages; Code: https://github.com/mayhugotong/VideoINSTA

Abs · PDF · Code1 · Code2 · Code3

Abstract

In the video-language domain, recent works in leveraging zero-shot Large Language Model-based reasoning for video understanding have become competitive challengers to previous end-to-end models. However, long video understanding presents unique challenges due to the complexity of reasoning over extended timespans, even for zero-shot LLM-based approaches. The challenge of information redundancy in long videos prompts the question of what specific information is essential for large language models (LLMs) and how to leverage them for complex spatial-temporal reasoning in long-form video analysis. We propose a framework VideoINSTA, i.e. INformative Spatial-TemporAl Reasoning for zero-shot long-form video understanding. VideoINSTA contributes (1) a zero-shot framework for long video understanding using LLMs; (2) an event-based temporal reasoning and content-based spatial reasoning approach for LLMs to reason over spatial-temporal information in videos; (3) a self-reflective information reasoning scheme balancing temporal factors based on information sufficiency and prediction confidence. Our model significantly improves the state-of-the-art on three long video question-answering benchmarks: EgoSchema, NextQA, and IntentQA, and the open question answering dataset ActivityNetQA. The code is released here: https://github.com/mayhugotong/VideoINSTA.

中文标题/摘要

标题：VideoINSTA：利用LLM进行零样本长视频理解的空间-时间推理

在视频-语言领域，利用零样本大型语言模型推理进行视频理解的最新工作已成为与之前端到端模型竞争的挑战者。然而，长视频理解由于长时间推理的复杂性，即使对于零样本大型语言模型（LLM）方法也提出了独特的挑战。长视频中的信息冗余问题促使我们思考大型语言模型（LLMs）需要哪些特定信息以及如何利用它们进行长视频分析中的复杂空间-时间推理。我们提出了一种框架VideoINSTA，即基于LLM的零样本长视频理解的空间-时间推理框架。VideoINSTA贡献了（1）一种利用LLM进行长视频理解的零样本框架；（2）一种基于事件的时间推理和基于内容的空间推理方法，使LLM能够在视频中推理空间-时间信息；（3）一种基于信息充分性和预测置信度的自我反思信息推理方案。我们的模型在三个长视频问答基准测试EgoSchema、NextQA和IntentQA以及开放问答数据集ActivityNetQA上显著改进了现有最佳性能。代码发布在：https://github.com/mayhugotong/VideoINSTA。

Summary / 总结

VideoINSTA proposes a zero-shot framework for long video understanding using LLMs, incorporating event-based temporal reasoning and content-based spatial reasoning to handle information redundancy. It achieves significant improvements on EgoSchema, NextQA, IntentQA, and ActivityNetQA benchmarks, demonstrating its effectiveness in complex spatial-temporal reasoning for long videos.

VideoINSTA 是一个零样本框架，利用大语言模型（LLMs）进行长视频理解，解决长时间跨度推理的挑战。它采用事件驱动的时间推理和基于内容的空间推理来处理时空信息，并包含一种基于信息充足性和预测置信度的自我反思信息推理方案。该模型在 EgoSchema、NextQA、IntentQA 和 ActivityNetQA 等长视频问答基准测试中优于现有最佳方法。

VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding

Authors: Muhammad Maaz, Hanoona Rasheed, Salman Khan, Fahad Khan

First: 2024-06-13T17:59:59+00:00 · Latest: 2024-06-13T17:59:59+00:00

Comments: Technical Report

Abs · PDF · Code1 · Code2 · Code3

Abstract

Building on the advances of language models, Large Multimodal Models (LMMs) have contributed significant improvements in video understanding. While the current video LMMs utilize advanced Large Language Models (LLMs), they rely on either image or video encoders to process visual inputs, each of which has its own limitations. Image encoders excel at capturing rich spatial details from frame sequences but lack explicit temporal context, which can be important in videos with intricate action sequences. On the other hand, video encoders provide temporal context but are often limited by computational constraints that lead to processing only sparse frames at lower resolutions, resulting in reduced contextual and spatial understanding. To this end, we introduce VideoGPT+, which combines the complementary benefits of the image encoder (for detailed spatial understanding) and the video encoder (for global temporal context modeling). The model processes videos by dividing them into smaller segments and applies an adaptive pooling strategy on features extracted by both image and video encoders. Our architecture showcases improved performance across multiple video benchmarks, including VCGBench, MVBench and Zero-shot question-answering. Further, we develop 112K video-instruction set using a novel semi-automatic annotation pipeline which further improves the model performance. Additionally, to comprehensively evaluate video LMMs, we present VCGBench-Diverse, covering 18 broad video categories such as lifestyle, sports, science, gaming, and surveillance videos. This benchmark with 4,354 question-answer pairs evaluates the generalization of existing LMMs on dense video captioning, spatial and temporal understanding, and complex reasoning, ensuring comprehensive assessment across diverse video types and dynamics. Code: https://github.com/mbzuai-oryx/VideoGPT-plus.

中文标题/摘要

标题：VideoGPT+: 结合图像和视频编码器以增强视频理解

在语言模型取得进展的基础上，大型多模态模型（LMMs）在视频理解方面做出了显著改进。当前的视频LMMs利用了先进的大型语言模型（LLMs），但它们依赖于图像或视频编码器来处理视觉输入，每种编码器都有其局限性。图像编码器擅长捕捉帧序列中的丰富空间细节，但缺乏明确的时间上下文，这对于包含复杂动作序列的视频来说可能是重要的。另一方面，视频编码器提供了时间上下文，但由于计算限制，通常只能以较低分辨率处理稀疏帧，导致上下文和空间理解能力降低。为此，我们引入了VideoGPT+，它结合了图像编码器（用于详细的空间理解）和视频编码器（用于全局时间上下文建模）的互补优势。该模型通过将视频分割成更小的片段，并在由图像和视频编码器提取的特征上应用自适应池化策略来处理视频。我们的架构在多个视频基准测试中展示了改进的性能，包括VCGBench、MVBench和零样本问答。此外，我们使用一种新颖的半自动注释流水线开发了112K视频指令集，进一步提高了模型性能。另外，为了全面评估视频LMMs，我们提出了VCGBench-Diverse，涵盖了18个广泛的视频类别，如生活方式、体育、科学、游戏和监控视频。该基准测试包含4,354个问答对，评估了现有LMMs在密集视频描述、空间和时间理解以及复杂推理方面的泛化能力，确保了对不同视频类型和动态的全面评估。代码：https://github.com/mbzuai-oryx/VideoGPT-plus.

SurgLLM: A Versatile Large Multimodal Model with Spatial Focus and Temporal Awareness for Surgical Video Understanding

Authors: Zhen Chen, Xingjian Luo, Kun Yuan, Jinlin Wu, Danny T. M. Chan, Nassir Navab, Hongbin Liu, Zhen Lei, Jiebo Luo

First: 2025-08-30T04:36:41+00:00 · Latest: 2025-08-30T04:36:41+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Surgical video understanding is crucial for facilitating Computer-Assisted Surgery (CAS) systems. Despite significant progress in existing studies, two major limitations persist, including inadequate visual content perception and insufficient temporal awareness in surgical videos, and hinder the development of versatile CAS solutions. In this work, we propose the SurgLLM framework, an effective large multimodal model tailored for versatile surgical video understanding tasks with enhanced spatial focus and temporal awareness. Specifically, to empower the spatial focus of surgical videos, we first devise Surgical Context-aware Multimodal Pretraining (Surg-Pretrain) for the video encoder of SurgLLM, by performing instrument-centric Masked Video Reconstruction (MV-Recon) and subsequent multimodal alignment. To incorporate surgical temporal knowledge into SurgLLM, we further propose Temporal-aware Multimodal Tuning (TM-Tuning) to enhance temporal reasoning with interleaved multimodal embeddings. Moreover, to accommodate various understanding tasks of surgical videos without conflicts, we devise a Surgical Task Dynamic Ensemble to efficiently triage a query with optimal learnable parameters in our SurgLLM. Extensive experiments performed on diverse surgical video understanding tasks, including captioning, general VQA, and temporal VQA, demonstrate significant improvements over the state-of-the-art approaches, validating the effectiveness of our SurgLLM in versatile surgical video understanding. The source code is available at https://github.com/franciszchen/SurgLLM.

中文标题/摘要

标题：SurgLLM：一种具有空间聚焦和时间意识的多功能多模态模型及其在手术视频理解中的应用

手术视频理解对于促进计算机辅助手术（CAS）系统至关重要。尽管现有研究取得了显著进展，但手术视频中的视觉内容感知不足和时间意识不足仍存在两大主要限制，阻碍了多功能CAS解决方案的发展。本文提出了一种名为SurgLLM的框架，这是一种针对多功能手术视频理解任务的高效多模态模型，具有增强的空间聚焦和时间意识。具体而言，为了增强SurgLLM的空间聚焦，我们首先为SurgLLM的视频编码器设计了一种手术上下文感知的多模态预训练（Surg-Pretrain），通过执行以器械为中心的掩蔽视频重建（MV-Recon）和后续的多模态对齐。为了将手术时间知识融入SurgLLM，我们进一步提出了一种时间感知的多模态调优（TM-Tuning），通过交错的多模态嵌入增强时间推理。此外，为了适应手术视频的各种理解任务而不产生冲突，我们设计了一种手术任务动态集成，以高效地对查询进行分类，并在我们的SurgLLM中使用可学习参数。在包括字幕生成、通用VQA和时间VQA在内的多种手术视频理解任务上进行的广泛实验表明，SurgLLM在这些任务上显著优于现有最佳方法，验证了其在多功能手术视频理解中的有效性。源代码可在https://github.com/franciszchen/SurgLLM获取。

Summary / 总结

The research aims to address the limitations of existing surgical video understanding models by enhancing spatial focus and temporal awareness. The SurgLLM framework is proposed, which includes Surgical Context-aware Multimodal Pretraining for spatial focus and Temporal-aware Multimodal Tuning for temporal reasoning. Experimental results show significant improvements in various tasks such as captioning, general VQA, and temporal VQA, validating the effectiveness of SurgLLM in versatile surgical video understanding. The source code is available on GitHub.

SurgLLM旨在通过解决视觉内容感知不足和时间感知不足的问题，增强手术视频理解，以支持计算机辅助手术系统。它通过手术上下文感知多模态预训练来增强空间聚焦，并通过时间感知多模态调优来增强时间推理。实验结果显示，在包括字幕生成、通用VQA和时间VQA等任务中，SurgLLM相比现有方法有显著改进，验证了其在多功能手术视频理解中的有效性。源代码可在GitHub上获得。

FAVOR-Bench: A Comprehensive Benchmark for Fine-Grained Video Motion Understanding

Authors: Chongjun Tu, Lin Zhang, Pengtao Chen, Peng Ye, Xianfang Zeng, Wei Cheng, Gang Yu, Tao Chen

First: 2025-03-19T06:42:32+00:00 · Latest: 2025-03-19T06:42:32+00:00

Comments: FAVOR-Bench project page: https://favor-bench.github.io/

Abs · PDF · Code1 · Code2 · Project1 · Project2

Abstract

Multimodal Large Language Models (MLLMs) have shown remarkable capabilities in video content understanding but still struggle with fine-grained motion comprehension. To comprehensively assess the motion understanding ability of existing MLLMs, we introduce FAVOR-Bench, comprising 1,776 videos with structured manual annotations of various motions. Our benchmark includes both close-ended and open-ended tasks. For close-ended evaluation, we carefully design 8,184 multiple-choice question-answer pairs spanning six distinct sub-tasks. For open-ended evaluation, we develop both a novel cost-efficient LLM-free and a GPT-assisted caption assessment method, where the former can enhance benchmarking interpretability and reproducibility. Comprehensive experiments with 21 state-of-the-art MLLMs reveal significant limitations in their ability to comprehend and describe detailed temporal dynamics in video motions. To alleviate this limitation, we further build FAVOR-Train, a dataset consisting of 17,152 videos with fine-grained motion annotations. The results of finetuning Qwen2.5-VL on FAVOR-Train yield consistent improvements on motion-related tasks of TVBench, MotionBench and our FAVOR-Bench. Comprehensive assessment results demonstrate that the proposed FAVOR-Bench and FAVOR-Train provide valuable tools to the community for developing more powerful video understanding models. Project page: \href{https://favor-bench.github.io/}{https://favor-bench.github.io/}.

中文标题/摘要

标题：FAVOR-Bench：细粒度视频运动理解的综合基准

多模态大型语言模型（MLLMs）在视频内容理解方面表现出色，但在细粒度运动理解方面仍存在困难。为了全面评估现有MLLMs的运动理解能力，我们引入了FAVOR-Bench，包含1,776个视频，附有各种运动的结构化手动注释。我们的基准包括封闭式和开放式任务。对于封闭式评估，我们精心设计了8,184个多选题-答案对，涵盖六个不同的子任务。对于开放式评估，我们开发了一种新型低成本的LLM-free和一种基于GPT的字幕评估方法，前者可以增强基准测试的可解释性和可重复性。全面的实验表明，21种最先进的MLLMs在理解视频运动的详细时间动态方面存在显著局限。为缓解这一局限，我们进一步构建了FAVOR-Train数据集，包含17,152个视频，附有细粒度运动注释。对Qwen2.5-VL进行微调后在FAVOR-Train上的结果在TVBench、MotionBench和我们自己的FAVOR-Bench上的运动相关任务上表现出一致的改进。全面的评估结果表明，提出的FAVOR-Bench和FAVOR-Train为开发更强大的视频理解模型提供了有价值的工具。

Summary / 总结

FAVOR-Bench is a comprehensive benchmark for evaluating the fine-grained motion understanding of Multimodal Large Language Models (MLLMs), comprising 1,776 videos with structured annotations. It includes both close-ended and open-ended tasks, with 8,184 multiple-choice questions for close-ended evaluation and novel caption assessment methods for open-ended evaluation. Experiments with 21 state-of-the-art MLLMs reveal their limitations in comprehending detailed temporal dynamics, and finetuning Qwen2.5-VL on FAVOR-Train improves performance on motion-related tasks. This benchmark and training dataset provide valuable tools for developing more powerful video understanding models.

FAVOR-Bench 是一个全面的基准，用于评估多模态大型语言模型（MLLMs）的细粒度运动理解能力。它包含 1,776 个带有结构化注释的视频，并包括封闭式和开放式任务。该基准评估了 21 种最先进的 MLLMs，并揭示了它们在理解视频运动中的时间动态方面的局限性。为了应对这些局限性，开发了 FAVOR-Train 数据集，包含 17,152 个带有细粒度运动注释的视频，这在不同的基准测试中一致地提高了运动相关任务的表现。

VIR-Bench: Evaluating Geospatial and Temporal Understanding of MLLMs via Travel Video Itinerary Reconstruction

Authors: Hao Wang, Eiki Murata, Lingfang Zhang, Ayako Sato, So Fukuda, Ziqi Yin, Wentao Hu, Keisuke Nakao, Yusuke Nakamura, Sebastian Zwirner, Yi-Chia Chen, Hiroyuki Otomo, Hiroki Ouchi, Daisuke Kawahara

First: 2025-09-23T13:46:31+00:00 · Latest: 2025-09-23T13:46:31+00:00

Abs · PDF · Code1 · Code2

Abstract

Recent advances in multimodal large language models (MLLMs) have significantly enhanced video understanding capabilities, opening new possibilities for practical applications. Yet current video benchmarks focus largely on indoor scenes or short-range outdoor activities, leaving the challenges associated with long-distance travel largely unexplored. Mastering extended geospatial-temporal trajectories is critical for next-generation MLLMs, underpinning real-world tasks such as embodied-AI planning and navigation. To bridge this gap, we present VIR-Bench, a novel benchmark consisting of 200 travel videos that frames itinerary reconstruction as a challenging task designed to evaluate and push forward MLLMs' geospatial-temporal intelligence. Experimental results reveal that state-of-the-art MLLMs, including proprietary ones, struggle to achieve high scores, underscoring the difficulty of handling videos that span extended spatial and temporal scales. Moreover, we conduct an in-depth case study in which we develop a prototype travel-planning agent that leverages the insights gained from VIR-Bench. The agent's markedly improved itinerary recommendations verify that our evaluation protocol not only benchmarks models effectively but also translates into concrete performance gains in user-facing applications.

中文标题/摘要

标题：VIR-Bench：通过旅行视频行程重建评估MLLMs的地理空间和时间理解能力

近期多模态大型语言模型（MLLMs）的发展显著提升了视频理解能力，为实际应用开辟了新的可能性。然而，当前的视频基准主要集中在室内场景或短距离户外活动上，长距离旅行相关的挑战尚未得到充分探索。掌握扩展的地理空间-时间轨迹对于下一代MLLMs至关重要，支撑着诸如具身AI规划和导航等现实世界任务。为弥合这一差距，我们提出了VIR-Bench，这是一个由200个旅行视频组成的新型基准，将行程重建作为一项具有挑战性的任务，旨在评估和推动MLLMs的地理空间-时间智能。实验结果表明，包括专有模型在内的最新MLLMs难以获得高分，突显了处理跨越广阔空间和时间尺度的视频的难度。此外，我们进行了一项深入的案例研究，开发了一个原型旅行规划代理，利用从VIR-Bench中获得的见解。代理显著改进的行程建议验证了我们的评估协议不仅有效地基准化了模型，还转化为面向用户的实际性能提升。

Summary / 总结

VIR-Bench evaluates the geospatial and temporal understanding of MLLMs through travel video itinerary reconstruction, addressing the current benchmarks' focus on indoor scenes and short outdoor activities. The benchmark consists of 200 travel videos and reveals that state-of-the-art MLLMs, including proprietary ones, struggle to handle extended spatial and temporal scales. An in-depth case study shows that the evaluation protocol effectively benchmarks models and translates into improved performance in travel planning applications.

VIR-Bench 通过旅行视频行程重建来评估 MLLMs 的地理时空理解能力，弥补了现有基准在长距离旅行方面的不足。该基准包含 200 个旅行视频，实验结果显示最先进的 MLLMs 表现不佳，突显了处理扩展时空尺度的难度。此外，基于 VIR-Bench 的洞察开发的旅行规划代理展示了改进的行程建议，验证了该基准的有效性。

VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

Authors: Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Dilin Wang, Zhicheng Yan, Hongyu Xu, Justin Theiss, Tianlong Chen, Jiachen Li, Zhengzhong Tu, Zhangyang Wang, Rakesh Ranjan

First: 2025-05-26T17:56:30+00:00 · Latest: 2025-06-01T21:20:16+00:00

Comments: Project Page: https://vlm-3r.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

The rapid advancement of Large Multimodal Models (LMMs) for 2D images and videos has motivated extending these models to understand 3D scenes, aiming for human-like visual-spatial intelligence. Nevertheless, achieving deep spatial understanding comparable to human capabilities poses significant challenges in model encoding and data acquisition. Existing methods frequently depend on external depth sensors for geometry capture or utilize off-the-shelf algorithms for pre-constructing 3D maps, thereby limiting their scalability, especially with prevalent monocular video inputs and for time-sensitive applications. In this work, we introduce VLM-3R, a unified framework for Vision-Language Models (VLMs) that incorporates 3D Reconstructive instruction tuning. VLM-3R processes monocular video frames by employing a geometry encoder to derive implicit 3D tokens that represent spatial understanding. Leveraging our Spatial-Visual-View Fusion and over 200K curated 3D reconstructive instruction tuning question-answer (QA) pairs, VLM-3R effectively aligns real-world spatial context with language instructions. This enables monocular 3D spatial assistance and embodied reasoning. To facilitate the evaluation of temporal reasoning, we introduce the Vision-Spatial-Temporal Intelligence benchmark, featuring over 138.6K QA pairs across five distinct tasks focused on evolving spatial relationships. Extensive experiments demonstrate that our model, VLM-3R, not only facilitates robust visual-spatial reasoning but also enables the understanding of temporal 3D context changes, excelling in both accuracy and scalability.

中文标题/摘要

标题：VLM-3R：视觉语言模型结合指令对齐的3D重建

大型多模态模型（LMMs）在2D图像和视频上的快速进步激发了将这些模型扩展到理解3D场景的动机，旨在实现类人的视觉-空间智能。然而，实现与人类能力相媲美的深度空间理解在模型编码和数据获取方面提出了重大挑战。现有方法经常依赖外部深度传感器进行几何捕获，或者利用现成的算法预先构建3D地图，从而限制了其可扩展性，尤其是在使用单目视频输入和时间敏感应用方面。在本文中，我们提出了VLM-3R，这是一种结合3D重建指令调优的统一视觉语言模型（VLMs）框架。VLM-3R通过使用几何编码器处理单目视频帧，推导出表示空间理解的隐式3D令牌。利用我们的空间-视觉-视角融合以及超过20万条精心策划的3D重建指令调优问答（QA）对，VLM-3R有效地将现实世界的空间上下文与语言指令对齐。这使得单目3D空间辅助和具身推理成为可能。为了促进时间推理的评估，我们引入了视觉-空间-时间智能基准，其中包括超过13.86万条问答对，涉及五个不同任务，专注于不断变化的空间关系。广泛的实验表明，我们的模型VLM-3R不仅促进了稳健的视觉-空间推理，还能够理解3D上下文的变化，无论在准确性还是可扩展性方面都表现出色。

Summary / 总结

VLM-3R is a unified framework for Vision-Language Models that integrates 3D reconstructive instruction tuning to enhance spatial understanding. It processes monocular video frames using a geometry encoder to derive implicit 3D tokens, aligning real-world spatial context with language instructions. VLM-3R demonstrates robust visual-spatial reasoning and temporal 3D context changes, excelling in accuracy and scalability. The model is evaluated on the Vision-Spatial-Temporal Intelligence benchmark, which includes over 138.6K QA pairs across five tasks focusing on evolving spatial relationships.

VLM-3R 是一个统一框架，通过 3D 重建指令调优增强视觉-语言模型，处理单目视频帧并推导出隐式的 3D 令牌以理解空间关系。它使用了超过 20 万条 QA 对进行空间-视觉-视角融合，将现实世界的空间上下文与语言指令对齐。实验表明，VLM-3R 在视觉空间推理方面表现出色，并能理解 3D 时间上下文的变化，具有高准确性和可扩展性。该模型在包含超过 13.86 万条 QA 对的视觉-空间-时间智能基准上进行了评估，涵盖了五个专注于演变空间关系的任务。

ST-Think: How Multimodal Large Language Models Reason About 4D Worlds from Ego-Centric Videos

Authors: Peiran Wu, Yunze Liu, Miao Liu, Junxiao Shen

First: 2025-03-16T15:24:11+00:00 · Latest: 2025-04-23T03:44:01+00:00

Abs · PDF · Code1 · Code2

Abstract

Humans excel at spatial-temporal reasoning, effortlessly interpreting dynamic visual events from an egocentric viewpoint. However, whether multimodal large language models (MLLMs) can similarly understand the 4D world remains uncertain. This paper explores multimodal spatial-temporal reasoning from an egocentric perspective, aiming to equip MLLMs with human-like reasoning capabilities. To support this objective, we introduce \textbf{Ego-ST Bench}, a novel benchmark containing over 5,000 question-answer pairs across four categories, systematically evaluating spatial, temporal, and integrated spatial-temporal reasoning. Additionally, we propose \textbf{ST-R1} training paradigm, a video-based reasoning model that incorporates reverse thinking into its reinforcement learning process, significantly enhancing performance. We combine long-chain-of-thought (long-CoT) supervised fine-tuning with Group Relative Policy Optimization (GRPO) reinforcement learning, achieving notable improvements with limited high-quality data. Ego-ST Bench and ST-R1 provide valuable insights and resources for advancing video-based spatial-temporal reasoning research.

中文标题/摘要

标题：ST-Think: 多模态大型语言模型如何从第一人称视频中推理四维世界

人类在空间-时间推理方面表现出色，能够轻松地从第一人称视角解释动态视觉事件。然而，多模态大型语言模型（MLLMs）是否能够以类似的方式理解四维世界仍不确定。本文探讨了从第一人称视角进行多模态空间-时间推理的可能性，旨在赋予MLLMs类似人类的推理能力。为了支持这一目标，我们引入了**Ego-ST Bench**，这是一个包含超过5,000个问题-答案对的新基准，系统地评估了空间、时间和整合的空间-时间推理能力。此外，我们提出了**ST-R1**训练范式，这是一种基于视频的推理模型，将逆向思考融入其强化学习过程中，显著提高了性能。我们结合了长链推理（long-CoT）监督微调与组相对策略优化（GRPO）强化学习，使用有限的高质量数据取得了显著的改进。Ego-ST Bench和ST-R1为推进基于视频的空间-时间推理研究提供了宝贵的知识和资源。

Summary / 总结

This paper investigates whether multimodal large language models can understand 4D worlds from an egocentric viewpoint, introducing Ego-ST Bench, a new benchmark with over 5,000 question-answer pairs, and ST-R1, a video-based reasoning model that incorporates reverse thinking and achieves significant performance improvements through long-CoT supervised fine-tuning and GRPO reinforcement learning with limited data.

该论文探讨了多模态大型语言模型如何从第一人称视角理解4D世界，引入了包含超过5,000个问题-答案对的Ego-ST Bench基准，并提出了ST-R1训练范式，该范式将逆向思考融入强化学习过程以提升性能。模型结合了长链推理监督微调与GRPO强化学习，即使使用少量高质量数据也取得了显著改进。

TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding

Authors: Shuhuai Ren, Sishuo Chen, Shicheng Li, Xu Sun, Lu Hou

First: 2023-10-29T16:25:32+00:00 · Latest: 2023-10-29T16:25:32+00:00

Comments: 16 pages, 9 figures, code is available at https://github.com/RenShuhuai-Andy/TESTA

Abs · PDF · Code1 · Code2 · Code3

Abstract

Large-scale video-language pre-training has made remarkable strides in advancing video-language understanding tasks. However, the heavy computational burden of video encoding remains a formidable efficiency bottleneck, particularly for long-form videos. These videos contain massive visual tokens due to their inherent 3D properties and spatiotemporal redundancy, making it challenging to capture complex temporal and spatial relationships. To tackle this issue, we propose an efficient method called TEmporal-Spatial Token Aggregation (TESTA). TESTA condenses video semantics by adaptively aggregating similar frames, as well as similar patches within each frame. TESTA can reduce the number of visual tokens by 75% and thus accelerate video encoding. Building upon TESTA, we introduce a pre-trained video-language model equipped with a divided space-time token aggregation module in each video encoder block. We evaluate our model on five datasets for paragraph-to-video retrieval and long-form VideoQA tasks. Experimental results show that TESTA improves computing efficiency by 1.7 times, and achieves significant performance gains from its scalability in processing longer input frames, e.g., +13.7 R@1 on QuerYD and +6.5 R@1 on Condensed Movie.

中文标题/摘要

标题：TESTA：时空令牌聚合用于长视频语言理解

大规模视频语言预训练在推进视频语言理解任务方面取得了显著进展。然而，视频编码的沉重计算负担仍然是效率瓶颈，尤其是对于长视频。这些视频由于其固有的三维特性和时空冗余，包含大量的视觉令牌，这使得捕捉复杂的时空关系变得具有挑战性。为了解决这个问题，我们提出了一种高效的方法，称为时空令牌聚合（TESTA）。TESTA 通过自适应聚合相似的帧以及每个帧内的相似块来压缩视频语义。TESTA 可以将视觉令牌的数量减少 75%，从而加速视频编码。基于 TESTA，我们引入了一种预训练的视频语言模型，该模型在每个视频编码块中配备了分割的空间时间令牌聚合模块。我们在五个数据集上对我们的模型进行了评估，用于段落到视频检索和长视频问答任务。实验结果表明，TESTA 将计算效率提高了 1.7 倍，并且由于其在处理更长输入帧方面的可扩展性，实现了显著的性能提升，例如在 QuerYD 上 +13.7 R@1 和在 Condensed Movie 上 +6.5 R@1。

Summary / 总结

The research aims to address the computational efficiency bottleneck in long-form video-language understanding tasks by proposing TESTA, which adaptively aggregates similar frames and patches to reduce visual tokens by 75%. The method significantly improves computing efficiency by 1.7 times and achieves notable performance gains, especially for longer input frames, with improvements of +13.7 R@1 on QuerYD and +6.5 R@1 on Condensed Movie.

研究旨在解决长视频在视频语言理解任务中的计算难题。提出TESTA方法，通过适配性聚合相似帧和帧内相似块，减少视觉令牌数量达75%，提高计算效率。在五个数据集上，TESTA在QuerYD上提升了13.7 R@1，在Condensed Movie上提升了6.5 R@1，同时加速了视频编码。

MAMBA4D: Efficient Long-Sequence Point Cloud Video Understanding with Disentangled Spatial-Temporal State Space Models

Authors: Jiuming Liu, Jinru Han, Lihao Liu, Angelica I. Aviles-Rivero, Chaokang Jiang, Zhe Liu, Hesheng Wang

Venue: CVPR 2025

First: 2024-05-23T09:08:09+00:00 · Latest: 2025-02-27T02:34:16+00:00

Comments: Accepted by CVPR 2025. The first two authors contribute equally

Abs · PDF · Code1 · Code2 · Code3

Abstract

Point cloud videos can faithfully capture real-world spatial geometries and temporal dynamics, which are essential for enabling intelligent agents to understand the dynamically changing world. However, designing an effective 4D backbone remains challenging, mainly due to the irregular and unordered distribution of points and temporal inconsistencies across frames. Also, recent transformer-based 4D backbones commonly suffer from large computational costs due to their quadratic complexity, particularly for long video sequences. To address these challenges, we propose a novel point cloud video understanding backbone purely based on the State Space Models (SSMs). Specifically, we first disentangle space and time in 4D video sequences and then establish the spatio-temporal correlation with our designed Mamba blocks. The Intra-frame Spatial Mamba module is developed to encode locally similar geometric structures within a certain temporal stride. Subsequently, locally correlated tokens are delivered to the Inter-frame Temporal Mamba module, which integrates long-term point features across the entire video with linear complexity. Our proposed Mamba4d achieves competitive performance on the MSR-Action3D action recognition (+10.4% accuracy), HOI4D action segmentation (+0.7 F1 Score), and Synthia4D semantic segmentation (+0.19 mIoU) datasets. Especially, for long video sequences, our method has a significant efficiency improvement with 87.5% GPU memory reduction and 5.36 times speed-up. Codes will be released at https://github.com/IRMVLab/Mamba4D.

中文标题/摘要

标题：MAMBA4D：基于解耦时空状态空间模型的高效长序列点云视频理解

点云视频能够真实地捕捉现实世界的空间几何和时间动态，这对于使智能代理理解动态变化的世界至关重要。然而，设计有效的4D主干结构仍然具有挑战性，主要是由于点的不规则和无序分布以及帧间的时间不一致性。此外，最近基于变压器的4D主干通常由于其二次复杂性而面临巨大的计算成本，特别是在长视频序列中。为了解决这些挑战，我们提出了一种全新的基于状态空间模型（SSMs）的点云视频理解主干。具体来说，我们首先在4D视频序列中解耦空间和时间，然后通过我们设计的Mamba模块建立时空相关性。我们开发了帧内时空Mamba模块来在一定时间跨度内编码局部相似的几何结构。随后，局部相关令牌被传递到帧间时空Mamba模块，该模块以线性复杂度整合整个视频中的长期点特征。我们提出的Mamba4D在MSR-Action3D动作识别（+10.4%准确率）、HOI4D动作分割（+0.7 F1分数）和Synthia4D语义分割（+0.19 mIoU）数据集上取得了竞争力的表现。特别是对于长视频序列，我们的方法在GPU内存减少87.5%和速度提升5.36倍方面具有显著的效率改进。代码将在https://github.com/IRMVLab/Mamba4D上发布。

Summary / 总结

This paper addresses the challenge of understanding long-sequence point cloud videos by proposing MAMBA4D, a novel backbone based on State Space Models. It disentangles spatial and temporal components, using Mamba blocks to encode local geometric structures and integrate long-term features efficiently. The method shows competitive performance on action recognition, segmentation, and achieves significant efficiency improvements, reducing GPU memory usage by 87.5% and speeding up by 5.36 times for long sequences.

研究旨在通过解决不规则点分布和时间不一致性的问题，开发一种高效的点云视频理解骨干网络。方法引入了解耦的空间-时间状态空间模型MAMBA4D，使用Mamba模块编码帧内空间结构并线性集成帧间时间特征。所提模型在动作识别、动作分割和语义分割任务上取得了竞争力的表现，并且在长视频序列上显示出显著的效率提升，GPU内存使用减少了87.5%，速度提高了5.36倍，相比之前的方法。

UI2V-Bench: An Understanding-based Image-to-video Generation Benchmark

Authors: Ailing Zhang, Lina Lei, Dehong Kong, Zhixin Wang, Jiaqi Xu, Fenglong Song, Chun-Le Guo, Chang Liu, Fan Li, Jie Chen

First: 2025-09-29T08:14:26+00:00 · Latest: 2025-09-29T08:14:26+00:00

Abs · PDF · Code1 · Code2

Abstract

Generative diffusion models are developing rapidly and attracting increasing attention due to their wide range of applications. Image-to-Video (I2V) generation has become a major focus in the field of video synthesis. However, existing evaluation benchmarks primarily focus on aspects such as video quality and temporal consistency, while largely overlooking the model's ability to understand the semantics of specific subjects in the input image or to ensure that the generated video aligns with physical laws and human commonsense. To address this gap, we propose UI2V-Bench, a novel benchmark for evaluating I2V models with a focus on semantic understanding and reasoning. It introduces four primary evaluation dimensions: spatial understanding, attribute binding, category understanding, and reasoning. To assess these dimensions, we design two evaluation methods based on Multimodal Large Language Models (MLLMs): an instance-level pipeline for fine-grained semantic understanding, and a feedback-based reasoning pipeline that enables step-by-step causal assessment for more accurate evaluation. UI2V-Bench includes approximately 500 carefully constructed text-image pairs and evaluates a range of both open source and closed-source I2V models across all defined dimensions. We further incorporate human evaluations, which show strong alignment with the proposed MLLM-based metrics. Overall, UI2V-Bench fills a critical gap in I2V evaluation by emphasizing semantic comprehension and reasoning ability, offering a robust framework and dataset to support future research and model development in the field.

中文标题/摘要

标题：UI2V-Bench：基于理解的图像到视频生成基准

生成扩散模型正在迅速发展并受到广泛关注，这得益于其广泛的应用范围。图像到视频（I2V）生成已成为视频合成领域的重点。然而，现有的评估基准主要集中在视频质量和时间一致性方面，而很大程度上忽视了模型对输入图像中特定主题的语义理解能力，或确保生成的视频符合物理定律和人类常识。为解决这一问题，我们提出了UI2V-Bench，这是一种新的基准，专注于语义理解和推理评估。它引入了四个主要评估维度：空间理解、属性绑定、类别理解以及推理。为了评估这些维度，我们基于多模态大型语言模型（MLLMs）设计了两种评估方法：一种是实例级的管道，用于细粒度的语义理解，另一种是基于反馈的推理管道，能够逐步进行因果评估，以实现更准确的评估。UI2V-Bench 包含约500个精心构建的图文对，并在所有定义的维度上评估了各种开源和闭源的I2V模型。我们进一步纳入了人工评估，结果显示与提出的基于MLLM的度量标准高度一致。总体而言，UI2V-Bench 通过强调语义理解和推理能力填补了I2V评估中的关键空白，提供了一个强大的框架和数据集，以支持该领域的未来研究和模型开发。

Summary / 总结

UI2V-Bench is a new benchmark for evaluating Image-to-Video generation models, focusing on semantic understanding and reasoning. It introduces four evaluation dimensions: spatial understanding, attribute binding, category understanding, and reasoning. Two evaluation methods based on Multimodal Large Language Models are designed: an instance-level pipeline for fine-grained semantic understanding and a feedback-based reasoning pipeline for causal assessment. The benchmark includes 500 carefully constructed text-image pairs and evaluates various I2V models. Human evaluations align well with the proposed metrics, addressing the gap in existing benchmarks by emphasizing semantic comprehension and reasoning ability.

UI2V-Bench 是一个新的 Image-to-Video (I2V) 模型评估基准，重点关注语义理解和推理能力。它引入了四个评估维度：空间理解、属性绑定、类别理解以及推理。该基准使用基于多模态大型语言模型 (MLLMs) 的两种方法来评估这些维度，包括一个实例级管道进行细粒度语义理解以及一个反馈驱动的推理管道进行逐步因果评估。UI2V-Bench 评估了 500 个文本-图像对和多种 I2V 模型，并且与人类评估结果高度一致。该基准填补了现有评估基准的空白，强调了 I2V 模型的语义理解和推理能力。

Learning Fine-Grained Visual Understanding for Video Question Answering via Decoupling Spatial-Temporal Modeling

Authors: Hsin-Ying Lee, Hung-Ting Su, Bing-Chen Tsai, Tsung-Han Wu, Jia-Fong Yeh, Winston H. Hsu

First: 2022-10-08T07:03:31+00:00 · Latest: 2022-10-08T07:03:31+00:00

Comments: BMVC 2022. Code is available at https://github.com/shinying/dest

Abs · PDF · Code1 · Code2 · Code3

Abstract

While recent large-scale video-language pre-training made great progress in video question answering, the design of spatial modeling of video-language models is less fine-grained than that of image-language models; existing practices of temporal modeling also suffer from weak and noisy alignment between modalities. To learn fine-grained visual understanding, we decouple spatial-temporal modeling and propose a hybrid pipeline, Decoupled Spatial-Temporal Encoders, integrating an image- and a video-language encoder. The former encodes spatial semantics from larger but sparsely sampled frames independently of time, while the latter models temporal dynamics at lower spatial but higher temporal resolution. To help the video-language model learn temporal relations for video QA, we propose a novel pre-training objective, Temporal Referring Modeling, which requires the model to identify temporal positions of events in video sequences. Extensive experiments demonstrate that our model outperforms previous work pre-trained on orders of magnitude larger datasets.

中文标题/摘要

标题：通过解耦空时建模学习细粒度视觉理解以视频问答

尽管近期大规模视频-语言预训练在视频问答方面取得了巨大进展，但视频-语言模型的空间建模设计不如图像-语言模型精细；现有的时间建模实践也遭受着模态间弱且嘈杂对齐的困扰。为了学习细粒度的视觉理解，我们解耦空时建模并提出了一种混合管道，即解耦空时编码器，该管道整合了一个图像编码器和一个视频编码器。前者独立于时间编码来自较大但稀疏采样的帧的空间语义，而后者在较低的空间分辨率但更高的时间分辨率下建模时间动态。为了帮助视频-语言模型学习视频问答中的时间关系，我们提出了一种新的预训练目标，即时间引用建模，该目标要求模型识别视频序列中事件的时间位置。广泛的实验表明，我们的模型在预训练于比之前工作大得多的数据集上时表现更优。

Summary / 总结

The research aims to enhance fine-grained visual understanding in video question answering by decoupling spatial and temporal modeling. The proposed method, Decoupled Spatial-Temporal Encoders, integrates an image-language encoder for spatial semantics and a video-language encoder for temporal dynamics. The Temporal Referring Modeling objective is introduced to improve the model's ability to learn temporal relations. Experiments show that this approach outperforms previous methods, especially when pre-trained on large datasets.

研究旨在通过分离空间和时间建模来提高视频问答中的细粒度视觉理解。方法引入了混合管道Decoupled Spatial-Temporal Encoders，使用图像语言编码器从稀疏采样的帧中捕获空间语义，使用视频语言编码器建模时间动态。提出了一个新的预训练目标，Temporal Referring Modeling，以帮助模型学习时间关系。实验表明，所提出的方法在比以前工作大得多的数据集上训练时表现出色。

EVA02-AT: Egocentric Video-Language Understanding with Spatial-Temporal Rotary Positional Embeddings and Symmetric Optimization

Authors: Xiaoqi Wang, Yi Wang, Lap-Pui Chau

First: 2025-06-17T09:51:51+00:00 · Latest: 2025-06-17T09:51:51+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Egocentric video-language understanding demands both high efficiency and accurate spatial-temporal modeling. Existing approaches face three key challenges: 1) Excessive pre-training cost arising from multi-stage pre-training pipelines, 2) Ineffective spatial-temporal encoding due to manually split 3D rotary positional embeddings that hinder feature interactions, and 3) Imprecise learning objectives in soft-label multi-instance retrieval, which neglect negative pair correlations. In this paper, we introduce EVA02-AT, a suite of EVA02-based video-language foundation models tailored to egocentric video understanding tasks. EVA02-AT first efficiently transfers an image-based CLIP model into a unified video encoder via a single-stage pretraining. Second, instead of applying rotary positional embeddings to isolated dimensions, we introduce spatial-temporal rotary positional embeddings along with joint attention, which can effectively encode both spatial and temporal information on the entire hidden dimension. This joint encoding of spatial-temporal features enables the model to learn cross-axis relationships, which are crucial for accurately modeling motion and interaction in videos. Third, focusing on multi-instance video-language retrieval tasks, we introduce the Symmetric Multi-Similarity (SMS) loss and a novel training framework that advances all soft labels for both positive and negative pairs, providing a more precise learning objective. Extensive experiments on Ego4D, EPIC-Kitchens-100, and Charades-Ego under zero-shot and fine-tuning settings demonstrate that EVA02-AT achieves state-of-the-art performance across diverse egocentric video-language tasks with fewer parameters. Models with our SMS loss also show significant performance gains on multi-instance retrieval benchmarks. Our code and models are publicly available at https://github.com/xqwang14/EVA02-AT .

中文标题/摘要

标题：EVA02-AT：基于空间-时间旋转位置嵌入和对称优化的主观视角视频-语言理解

主观视角视频-语言理解需要高效性和精确的空间-时间建模。现有方法面临三个关键挑战：1）多阶段预训练管道导致的高昂预训练成本，2）由于手动拆分的3D旋转位置嵌入导致的空间-时间编码效果不佳，妨碍了特征交互，3）软标签多实例检索中的不精确学习目标，忽视了负样本对之间的相关性。本文中，我们引入了EVA02-AT，这是一种基于EVA02的视频-语言基础模型，专门针对主观视角视频理解任务。EVA02-AT首先通过单阶段预训练将基于图像的CLIP模型高效地转换为统一的视频编码器。其次，我们引入了空间-时间旋转位置嵌入和联合注意力，可以有效地在整个隐藏维度上编码空间和时间信息。这种空间-时间特征的联合编码使模型能够学习跨轴关系，这对于准确建模视频中的运动和交互至关重要。第三，针对多实例视频-语言检索任务，我们引入了对称多相似性（SMS）损失和一种新的训练框架，为正样本和负样本对的所有软标签提供更精确的学习目标。在Ego4D、EPIC-Kitchens-100和Charades-Ego数据集上的零样本和微调设置下的广泛实验表明，EVA02-AT在各种主观视角视频-语言任务中实现了最先进的性能，且参数更少。使用我们SMS损失的模型在多实例检索基准测试中也显示出显著的性能提升。我们的代码和模型已公开发布在https://github.com/xqwang14/EVA02-AT 。

Summary / 总结

EVA02-AT addresses the challenges of egocentric video-language understanding by introducing a single-stage pretraining method, spatial-temporal rotary positional embeddings, and a novel Symmetric Multi-Similarity loss. The model outperforms existing approaches on Ego4D, EPIC-Kitchens-100, and Charades-Ego with fewer parameters, particularly in multi-instance retrieval tasks.

EVA02-AT 通过引入单阶段预训练方法、时空旋转位置嵌入以及新型对称多相似性损失，解决了自视点视频-语言理解的挑战。该模型在 Ego4D、EPIC-Kitchens-100 和 Charades-Ego 上的表现优于现有方法，特别是在多实例检索任务中参数更少但性能更佳。

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

Authors: Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, Limin Wang, Yu Qiao

Venue: CVPR 2024 highlight

First: 2023-11-28T17:59:04+00:00 · Latest: 2024-05-23T14:49:29+00:00

Comments: CVPR 2024 highlight: updated version with Mistral and better performances for MVBench/NExT-QA/STAR/TVQA/EgoSchema/IntentQA

Abs · PDF · Code1 · Code2 · Code3

Abstract

With the rapid development of Multi-modal Large Language Models (MLLMs), a number of diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities of these models. However, most benchmarks predominantly assess spatial understanding in the static image tasks, while overlooking temporal understanding in the dynamic video tasks. To alleviate this issue, we introduce a comprehensive Multi-modal Video understanding Benchmark, namely MVBench, which covers 20 challenging video tasks that cannot be effectively solved with a single frame. Specifically, we first introduce a novel static-to-dynamic method to define these temporal-related tasks. By transforming various static tasks into dynamic ones, we enable the systematic generation of video tasks that require a broad spectrum of temporal skills, ranging from perception to cognition. Then, guided by the task definition, we automatically convert public video annotations into multiple-choice QA to evaluate each task. On one hand, such a distinct paradigm allows us to build MVBench efficiently, without much manual intervention. On the other hand, it guarantees evaluation fairness with ground-truth video annotations, avoiding the biased scoring of LLMs. Moreover, we further develop a robust video MLLM baseline, i.e., VideoChat2, by progressive multi-modal training with diverse instruction-tuning data. The extensive results on our MVBench reveal that, the existing MLLMs are far from satisfactory in temporal understanding, while our VideoChat2 largely surpasses these leading models by over 15% on MVBench. All models and data are available at https://github.com/OpenGVLab/Ask-Anything.

中文标题/摘要

标题：MVBench：一个全面的多模态视频理解基准

随着多模态大型语言模型（MLLMs）的快速发展，最近出现了一些诊断基准来评估这些模型的理解能力。然而，大多数基准主要评估静态图像任务中的空间理解，而忽视了动态视频任务中的时间理解。为了解决这一问题，我们引入了一个全面的多模态视频理解基准，即MVBench，涵盖了20个无法仅凭单帧有效解决的具有挑战性的视频任务。具体来说，我们首先引入了一种新颖的静态到动态方法来定义这些与时间相关任务。通过将各种静态任务转化为动态任务，我们使视频任务的系统生成成为可能，这些任务需要广泛的时间技能范围，从感知到认知。然后，在任务定义的指导下，我们自动将公共视频注释转换为多项选择问答，以评估每个任务。一方面，这种独特的范式使我们能够高效地构建MVBench，而不需要太多的手动干预。另一方面，它保证了使用真实视频注释进行评估的公平性，避免了LLM的偏见评分。此外，我们进一步开发了一个稳健的视频MLLM基线，即VideoChat2，通过渐进的多模态训练和多样化的指令调优数据进行训练。我们在MVBench上的广泛结果表明，现有的MLLMs在时间理解方面远未令人满意，而我们的VideoChat2在MVBench上超过了这些领先模型超过15%。所有模型和数据可在https://github.com/OpenGVLab/Ask-Anything/获取。

Summary / 总结

MVBench is a comprehensive benchmark for multi-modal video understanding, addressing the lack of temporal understanding in existing benchmarks. It introduces 20 challenging video tasks and a static-to-dynamic method to define temporal tasks, converting static tasks into dynamic ones. The benchmark evaluates models through automatically generated multiple-choice QA from video annotations, ensuring fairness. The VideoChat2 model, trained with diverse instruction-tuning data, significantly outperforms existing models on MVBench, demonstrating the need for improved temporal understanding in MLLMs.

MVBench 是一个全面的多模态视频理解基准，旨在解决现有基准在时间理解方面的不足。它引入了一种新颖的静态到动态方法来定义 20 个具有挑战性的视频任务，并自动将视频注释转换为多项选择题。该基准评估了如 VideoChat2 等模型，VideoChat2 在 MVBench 上的表现超过了领先模型超过 15%，突显了现有多模态大语言模型在时间理解方面的局限性。所有模型和数据可在 https://github.com/OpenGVLab/Ask-Anything 获取。

MASH-VLM: Mitigating Action-Scene Hallucination in Video-LLMs through Disentangled Spatial-Temporal Representations

Authors: Kyungho Bae, Jinhyung Kim, Sihaeng Lee, Soonyoung Lee, Gunhee Lee, Jinwoo Choi

Venue: CVPR 2025

First: 2025-03-20T05:48:59+00:00 · Latest: 2025-03-20T05:48:59+00:00

Comments: Accepted for CVPR 2025

Abs · PDF · Code1 · Code2

Abstract

In this work, we tackle action-scene hallucination in Video Large Language Models (Video-LLMs), where models incorrectly predict actions based on the scene context or scenes based on observed actions. We observe that existing Video-LLMs often suffer from action-scene hallucination due to two main factors. First, existing Video-LLMs intermingle spatial and temporal features by applying an attention operation across all tokens. Second, they use the standard Rotary Position Embedding (RoPE), which causes the text tokens to overemphasize certain types of tokens depending on their sequential orders. To address these issues, we introduce MASH-VLM, Mitigating Action-Scene Hallucination in Video-LLMs through disentangled spatial-temporal representations. Our approach includes two key innovations: (1) DST-attention, a novel attention mechanism that disentangles the spatial and temporal tokens within the LLM by using masked attention to restrict direct interactions between the spatial and temporal tokens; (2) Harmonic-RoPE, which extends the dimensionality of the positional IDs, allowing the spatial and temporal tokens to maintain balanced positions relative to the text tokens. To evaluate the action-scene hallucination in Video-LLMs, we introduce the UNSCENE benchmark with 1,320 videos and 4,078 QA pairs. Extensive experiments demonstrate that MASH-VLM achieves state-of-the-art results on the UNSCENE benchmark, as well as on existing video understanding benchmarks.

中文标题/摘要

标题：MASH-VLM：通过解耦空间-时间表示减轻视频大语言模型中的动作场景幻觉

在本文中，我们解决了视频大语言模型（Video-LLMs）中的动作场景幻觉问题，即模型基于场景上下文或观察到的动作错误地预测动作。我们观察到，现有Video-LLMs由于两个主要原因经常遭受动作场景幻觉。首先，现有Video-LLMs通过在所有标记之间应用注意力操作来混合空间和时间特征。其次，它们使用标准的旋转位置嵌入（RoPE），这使得文本标记会根据其顺序强调某些类型的标记。为了解决这些问题，我们提出了MASH-VLM，一种通过解耦空间-时间表示来减轻Video-LLMs中动作场景幻觉的方法。我们的方法包括两个关键创新：（1）DST-注意力，一种新颖的注意力机制，通过使用掩码注意力来限制空间和时间标记之间的直接交互，从而在LLM中解耦空间和时间标记；（2）谐波-RoPE，它扩展了位置ID的维度，使空间和时间标记能够相对于文本标记保持平衡的位置。为了评估Video-LLMs中的动作场景幻觉，我们引入了包含1,320个视频和4,078个问答对的UNSCENE基准。广泛的实验表明，MASH-VLM在UNSCENE基准以及现有视频理解基准上均取得了最先进的结果。

Summary / 总结

This work addresses action-scene hallucination in Video Large Language Models (Video-LLMs) by introducing MASH-VLM, which uses disentangled spatial-temporal representations. MASH-VLM employs DST-attention to separate spatial and temporal tokens and Harmonic-RoPE to balance their positions relative to text tokens. Experiments show MASH-VLM outperforms existing methods on the UNSCENE benchmark and other video understanding benchmarks.

本文通过引入使用分离时空表示的MASH-VLM来解决Video Large Language Models（Video-LLMs）中的动作场景幻觉问题。MASH-VLM采用DST-注意力机制分离时空令牌，并使用Harmonic-RoPE保持它们与文本令牌的相对位置平衡。实验表明，MASH-VLM在UNSCENE基准和现有视频理解基准上均优于其他方法。

LVC: A Lightweight Compression Framework for Enhancing VLMs in Long Video Understanding

Authors: Ziyi Wang, Haoran Wu, Yiming Rong, Deyang Jiang, Yixin Zhang, Yunlong Zhao, Shuang Xu, Bo XU

First: 2025-04-09T12:51:10+00:00 · Latest: 2025-04-09T12:51:10+00:00

Abs · PDF · Code1 · Code2

Abstract

Long video understanding is a complex task that requires both spatial detail and temporal awareness. While Vision-Language Models (VLMs) obtain frame-level understanding capabilities through multi-frame input, they suffer from information loss due to the sparse sampling strategy. In contrast, Video Large Language Models (Video-LLMs) capture temporal relationships within visual features but are limited by the scarcity of high-quality video-text datasets. To transfer long video understanding capabilities to VLMs with minimal data and computational cost, we propose Lightweight Video Compression (LVC), a novel method featuring the Query-Attention Video Compression mechanism, which effectively tackles the sparse sampling problem in VLMs. By training only the alignment layer with 10k short video-text pairs, LVC significantly enhances the temporal reasoning abilities of VLMs. Extensive experiments show that LVC provides consistent performance improvements across various models, including the InternVL2 series and Phi-3.5-Vision. Notably, the InternVL2-40B-LVC achieves scores of 68.2 and 65.9 on the long video understanding benchmarks MLVU and Video-MME, respectively, with relative improvements of 14.6% and 7.7%. The enhanced models and code will be publicly available soon.

中文标题/摘要

标题：LVC：一种轻量级压缩框架以增强VLMs在长视频理解中的能力

长视频理解是一项复杂的任务，需要同时具备空间细节和时间意识。虽然视觉语言模型（VLMs）通过多帧输入获得帧级理解能力，但由于稀疏采样策略，它们会遭受信息损失。相比之下，视频大型语言模型（Video-LLMs）能够捕捉视觉特征内的时间关系，但受限于高质量视频-文本数据集的稀缺性。为了以最小的数据和计算成本将长视频理解能力转移到VLMs，我们提出了一种名为轻量级视频压缩（LVC）的新方法，该方法具有查询-注意视频压缩机制，有效解决了VLMs中的稀疏采样问题。通过仅使用10000个短视频-文本对训练对齐层，LVC显著增强了VLMs的时间推理能力。广泛实验表明，LVC在包括InternVL2系列和Phi-3.5-Vision在内的多种模型中提供了持续的性能改进。值得注意的是，InternVL2-40B-LVC在长视频理解基准MLVU和Video-MME上的得分为68.2和65.9，分别相对提高了14.6%和7.7%。增强后的模型和代码将很快公开。

Summary / 总结

The paper proposes Lightweight Video Compression (LVC) to enhance VLMs for long video understanding by addressing the sparse sampling issue. LVC uses the Query-Attention Video Compression mechanism to train only the alignment layer with short video-text pairs, improving temporal reasoning. Experiments show consistent performance gains across various models, with the InternVL2-40B-LVC achieving significant improvements on MLVU and Video-MME benchmarks.

论文提出了一种轻量级视频压缩（LVC）方法，通过解决稀疏采样问题来增强VLMs的长视频理解能力。LVC使用Query-Attention视频压缩机制，仅用10k短视频-文本对训练对齐层，提升时间推理能力。实验结果显示，各种模型均表现出一致的性能提升，InternVL2-40B-LVC在MLVU和Video-MME基准上的得分分别提高了14.6%和7.7%。

DynImg: Key Frames with Visual Prompts are Good Representation for Multi-Modal Video Understanding

Authors: Xiaoyi Bao, Chenwei Xie, Hao Tang, Tingyu Weng, Xiaofeng Wang, Yun Zheng, Xingang Wang

Venue: ICCV 2025

First: 2025-07-21T12:50:49+00:00 · Latest: 2025-07-21T12:50:49+00:00

Comments: Accepted by ICCV 2025

Abs · PDF · Code1 · Code2

Abstract

In recent years, the introduction of Multi-modal Large Language Models (MLLMs) into video understanding tasks has become increasingly prevalent. However, how to effectively integrate temporal information remains a critical research focus. Traditional approaches treat spatial and temporal information separately. Due to issues like motion blur, it is challenging to accurately represent the spatial information of rapidly moving objects. This can lead to temporally important regions being underemphasized during spatial feature extraction, which in turn hinders accurate spatio-temporal interaction and video understanding. To address this limitation, we propose an innovative video representation method called Dynamic-Image (DynImg). Specifically, we introduce a set of non-key frames as temporal prompts to highlight the spatial areas containing fast-moving objects. During the process of visual feature extraction, these prompts guide the model to pay additional attention to the fine-grained spatial features corresponding to these regions. Moreover, to maintain the correct sequence for DynImg, we employ a corresponding 4D video Rotary Position Embedding. This retains both the temporal and spatial adjacency of DynImg, helping MLLM understand the spatio-temporal order within this combined format. Experimental evaluations reveal that DynImg surpasses the state-of-the-art methods by approximately 2% across multiple video understanding benchmarks, proving the effectiveness of our temporal prompts in enhancing video comprehension.

中文标题/摘要

标题：DynImg：视觉提示下的关键帧是多模态视频理解的良好表示

近年来，多模态大型语言模型（MLLMs）在视频理解任务中的应用越来越普遍。然而，如何有效整合时间信息仍然是一个关键的研究重点。传统方法将空间和时间信息分开处理。由于运动模糊等问题，快速移动物体的空间信息难以准确表示，这可能导致在空间特征提取过程中重要的时间区域被低估，从而妨碍准确的空间-时间交互和视频理解。为了解决这一局限性，我们提出了一种名为动态图像（DynImg）的创新视频表示方法。具体来说，我们引入了一组非关键帧作为时间提示，以突出包含快速移动物体的空间区域。在视觉特征提取过程中，这些提示引导模型额外关注这些区域对应的空间细节特征。此外，为了保持DynImg的正确顺序，我们采用相应的4D视频旋转位置嵌入。这保留了DynImg的时间和空间邻近性，帮助MLLM理解这种组合格式中的空间-时间顺序。实验评估表明，DynImg在多个视频理解基准测试中比最先进的方法高出约2%，证明了我们的时间提示在增强视频理解方面的有效性。

Summary / 总结

The paper proposes DynImg, a method that uses non-key frames as temporal prompts to enhance video understanding. It addresses the challenge of accurately representing fast-moving objects by guiding the model to focus on their spatial details. DynImg incorporates 4D video Rotary Position Embedding to maintain temporal and spatial adjacency, improving MLLM's ability to understand spatio-temporal interactions. Experiments show that DynImg outperforms existing methods by about 2% across various video understanding benchmarks.

该研究提出DynImg方法，通过引入视觉提示整合时间信息，以增强多模态视频理解。通过在视觉特征提取过程中引入非关键帧作为时间提示，解决快速移动物体准确表示的难题。实验结果显示，DynImg在多个基准测试中比现有方法高出约2%，证明了这些提示在提升视频理解方面的有效性。

Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution

Authors: Zuyan Liu, Yuhao Dong, Ziwei Liu, Winston Hu, Jiwen Lu, Yongming Rao

Venue: ICLR 2025

First: 2024-09-19T17:59:51+00:00 · Latest: 2025-02-27T06:09:46+00:00

Comments: Accepted to ICLR 2025

Abs · PDF · Code1 · Code2 · Code3

Abstract

Visual data comes in various forms, ranging from small icons of just a few pixels to long videos spanning hours. Existing multi-modal LLMs usually standardize these diverse visual inputs to a fixed resolution for visual encoders and yield similar numbers of tokens for LLMs. This approach is non-optimal for multimodal understanding and inefficient for processing inputs with long and short visual contents. To solve the problem, we propose Oryx, a unified multimodal architecture for the spatial-temporal understanding of images, videos, and multi-view 3D scenes. Oryx offers an on-demand solution to seamlessly and efficiently process visual inputs with arbitrary spatial sizes and temporal lengths through two core innovations: 1) a pre-trained OryxViT model that can encode images at any resolution into LLM-friendly visual representations; 2) a dynamic compressor module that supports 1x to 16x compression on visual tokens by request. These design features enable Oryx to accommodate extremely long visual contexts, such as videos, with lower resolution and high compression while maintaining high recognition precision for tasks like document understanding with native resolution and no compression. Beyond the architectural improvements, enhanced data curation and specialized training on long-context retrieval and spatial-aware data help Oryx achieve strong capabilities in image, video, and 3D multimodal understanding simultaneously. Our work is open-sourced at https://github.com/Oryx-mllm/Oryx.

中文标题/摘要

标题：Oryx MLLM：任意分辨率下的时空理解

视觉数据形式多样，从几像素的小图标到数小时的长视频不等。现有的多模态LLM通常将这些多样的视觉输入标准化为固定分辨率，并为LLM生成相同数量的令牌。这种方法对于多模态理解来说并不理想，也不利于处理长视频和短视频输入。为了解决这个问题，我们提出了Oryx，一种统一的多模态架构，用于图像、视频和多视图3D场景的空间-时间理解。Oryx通过两项核心创新提供了一种按需解决方案，以无缝且高效地处理任意空间大小和时间长度的视觉输入：1）预训练的OryxViT模型，可以将图像以任意分辨率编码为LLM友好的视觉表示；2）动态压缩模块，可根据需求支持1x到16x的视觉令牌压缩。这些设计特性使Oryx能够以较低分辨率和高压缩率处理极长的视觉上下文，如视频，同时在文档理解等任务中保持高识别精度，无需压缩。除了架构改进，增强的数据整理和针对长上下文检索和空间感知数据的专业训练帮助Oryx在图像、视频和3D多模态理解方面实现了强大的能力。我们的工作已开源，地址为https://github.com/Oryx-mllm/Oryx。

Summary / 总结

The paper proposes Oryx, a unified multimodal architecture designed for spatial-temporal understanding of images, videos, and 3D scenes. It introduces a pre-trained OryxViT model that can encode images at any resolution and a dynamic compressor module that supports on-demand compression of visual tokens. This allows Oryx to handle long visual contexts efficiently while maintaining high recognition precision. Key experimental results show that Oryx outperforms existing methods in tasks like document understanding with native resolution and no compression, and it can process inputs with arbitrary spatial sizes and temporal lengths more efficiently than fixed-resolution approaches.

论文提出了Oryx，一种统一的多模态架构，用于处理图像、视频和3D场景的空间-时间理解。它引入了一个预训练的OryxViT模型，可以将图像以任意分辨率编码，并且有一个动态压缩模块，支持按需压缩视觉标记。实验表明，Oryx可以高效地处理长视觉上下文，同时在文档理解等任务中保持高识别精度。除了架构改进，增强的数据整理和专门的长上下文检索训练提高了其在多模态理解方面的能力。

CoS: Chain-of-Shot Prompting for Long Video Understanding

Authors: Jian Hu, Zixu Cheng, Chenyang Si, Wei Li, Shaogang Gong

First: 2025-02-10T13:03:05+00:00 · Latest: 2025-02-11T14:59:25+00:00

Comments: A training-free test-time optimisation approach for long video understanding

Abs · PDF · Code1 · Code2 · Project1

Abstract

Multi-modal Large Language Models (MLLMs) struggle with long videos due to the need for excessive visual tokens. These tokens exceed massively the context length of MLLMs, resulting in filled by redundant task-irrelevant shots. How to select shots is an unsolved critical problem: sparse sampling risks missing key details, while exhaustive sampling overwhelms the model with irrelevant content, leading to video misunderstanding. To solve this problem, we propose Chain-of-Shot prompting (CoS). The key idea is to frame shot selection as test-time visual prompt optimisation, choosing shots adaptive to video understanding semantic task by optimising shots-task alignment. CoS has two key parts: (1) a binary video summary mechanism that performs pseudo temporal grounding, discovering a binary coding to identify task-relevant shots, and (2) a video co-reasoning module that deploys the binary coding to pair (learning to align) task-relevant positive shots with irrelevant negative shots. It embeds the optimised shot selections into the original video, facilitating a focus on relevant context to optimize long video understanding. Experiments across three baselines and five datasets demonstrate the effectiveness and adaptability of CoS. Code given in https://lwpyh.github.io/CoS.

中文标题/摘要

标题：CoS：链式-shot提示在长视频理解中的应用

多模态大型语言模型（MLLMs）在处理长视频时遇到困难，因为需要大量的视觉标记。这些标记远远超过了MLLMs的上下文长度，导致视频中充满了冗余且与任务无关的镜头。如何选择镜头是一个未解决的关键问题：稀疏采样可能会错过关键细节，而全面采样则会令模型陷入无关内容的困扰，导致视频理解错误。为了解决这个问题，我们提出了链式-shot提示（CoS）。核心思想是将镜头选择视为测试时的视觉提示优化，通过优化镜头与视频理解语义任务的对齐来选择适应视频理解任务的镜头。CoS有两个关键部分：（1）一种二元视频摘要机制，执行伪时间定位，发现二元编码以识别任务相关镜头，（2）一种视频共推理模块，利用二元编码将任务相关正镜头与无关负镜头配对（学习对齐）。它将优化的镜头选择嵌入到原始视频中，有助于聚焦于相关上下文以优化长视频理解。在三个基线和五个数据集上的实验表明CoS的有效性和适应性。代码见https://lwpyh.github.io/CoS。

Summary / 总结

The research addresses the challenge of long video understanding for Multi-modal Large Language Models (MLLMs) by proposing Chain-of-Shot prompting (CoS). CoS optimizes shot selection at test time by aligning shots with the semantic task, using a binary video summary mechanism and a video co-reasoning module. Experiments show that CoS improves long video understanding across three baselines and five datasets, demonstrating its effectiveness and adaptability.

论文提出了一种Chain-of-Shot提示（CoS）方法，以解决多模态大型语言模型（MLLMs）在理解长视频时面临的挑战。CoS在测试时通过使镜头与语义任务对齐来优化镜头选择，使用二元视频摘要机制和视频共推理模块。实验表明，CoS在多个基线和数据集上提高了长视频理解的效果，且无需训练。

InternVideo2: Scaling Foundation Models for Multimodal Video Understanding

Authors: Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Chenting Wang, Guo Chen, Baoqi Pei, Ziang Yan, Rongkun Zheng, Jilan Xu, Zun Wang, Yansong Shi, Tianxiang Jiang, Songze Li, Hongjie Zhang, Yifei Huang, Yu Qiao, Yali Wang, Limin Wang

First: 2024-03-22T17:57:42+00:00 · Latest: 2024-08-14T14:31:50+00:00

Comments: a technical report about video understanding (accepted to ECCV2024)

Abs · PDF · Code1 · Code2 · Code3

Abstract

We introduce InternVideo2, a new family of video foundation models (ViFM) that achieve the state-of-the-art results in video recognition, video-text tasks, and video-centric dialogue. Our core design is a progressive training approach that unifies the masked video modeling, crossmodal contrastive learning, and next token prediction, scaling up the video encoder size to 6B parameters. At the data level, we prioritize spatiotemporal consistency by semantically segmenting videos and generating video-audio-speech captions. This improves the alignment between video and text. Through extensive experiments, we validate our designs and demonstrate superior performance on over 60 video and audio tasks. Notably, our model outperforms others on various video-related dialogue and long video understanding benchmarks, highlighting its ability to reason and comprehend longer contexts. Code and models are available at https://github.com/OpenGVLab/InternVideo/tree/main/InternVideo2/.

中文标题/摘要

标题：InternVideo2: 扩展多模态视频理解的基础模型

我们介绍了InternVideo2，这是一种新的视频基础模型（ViFM）家族，其在视频识别、视频-文本任务和视频为中心的对话中达到了最先进的结果。我们的核心设计是一种渐进式训练方法，将掩蔽视频建模、跨模态对比学习和下一个标记预测统一起来，将视频编码器的规模扩展到60亿参数。在数据层面，我们通过语义分割视频并生成视频-音频-语音字幕来优先考虑时空一致性，这提高了视频和文本之间的对齐。通过广泛的实验，我们验证了我们的设计，并在超过60个视频和音频任务上展示了优越的性能。值得注意的是，我们的模型在各种视频相关对话和长视频理解基准测试中表现优于其他模型，突显了其对更长上下文进行推理和理解的能力。代码和模型可在https://github.com/OpenGVLab/InternVideo/tree/main/InternVideo2/ 获取。

Dr.V: A Hierarchical Perception-Temporal-Cognition Framework to Diagnose Video Hallucination by Fine-grained Spatial-Temporal Grounding

Authors: Meng Luo, Shengqiong Wu, Liqiang Jing, Tianjie Ju, Li Zheng, Jinxiang Lai, Tianlong Wu, Xinya Du, Jian Li, Siyuan Yan, Jiebo Luo, William Yang Wang, Hao Fei, Mong-Li Lee, Wynne Hsu

First: 2025-09-15T12:39:19+00:00 · Latest: 2025-09-15T12:39:19+00:00

Comments: 25 pages, 16 figures

Abs · PDF · Code1 · Code2 · Code3

Abstract

Recent advancements in large video models (LVMs) have significantly enhance video understanding. However, these models continue to suffer from hallucinations, producing content that conflicts with input videos. To address this issue, we propose Dr.V, a hierarchical framework covering perceptive, temporal, and cognitive levels to diagnose video hallucination by fine-grained spatial-temporal grounding. Dr.V comprises of two key components: a benchmark dataset Dr.V-Bench and a satellite video agent Dr.V-Agent. Dr.V-Bench includes 10k instances drawn from 4,974 videos spanning diverse tasks, each enriched with detailed spatial-temporal annotation. Dr.V-Agent detects hallucinations in LVMs by systematically applying fine-grained spatial-temporal grounding at the perceptive and temporal levels, followed by cognitive level reasoning. This step-by-step pipeline mirrors human-like video comprehension and effectively identifies hallucinations. Extensive experiments demonstrate that Dr.V-Agent is effective in diagnosing hallucination while enhancing interpretability and reliability, offering a practical blueprint for robust video understanding in real-world scenarios. All our data and code are available at https://github.com/Eurekaleo/Dr.V.

中文标题/摘要

标题：Dr.V：一种分层感知-时序-认知框架，通过细粒度的空间-时序定位诊断视频幻觉

近年来，大型视频模型（LVMs）在视频理解方面取得了显著进步。然而，这些模型仍然存在幻觉问题，生成与输入视频冲突的内容。为了解决这一问题，我们提出了Dr.V，一种涵盖感知、时序和认知层次的分层框架，通过细粒度的空间-时序定位诊断视频幻觉。Dr.V 包含两个关键组件：基准数据集Dr.V-Bench和卫星视频代理Dr.V-Agent。Dr.V-Bench 包含来自4,974个视频的10,000个实例，涵盖多种任务，每个实例都附有详细的时空注释。Dr.V-Agent 通过在感知和时序层次上系统地应用细粒度的空间-时序定位，然后进行认知层次推理，来检测LVMs中的幻觉。这一逐步管道模仿了人类的视频理解过程，并有效地识别了幻觉。大量实验表明，Dr.V-Agent 在诊断幻觉的同时提高了可解释性和可靠性，为在实际场景中实现稳健的视频理解提供了实用的蓝图。所有数据和代码均可在https://github.com/Eurekaleo/Dr.V 获取。

Summary / 总结

Dr.V is a hierarchical framework that diagnoses video hallucination by fine-grained spatial-temporal grounding, covering perceptive, temporal, and cognitive levels. It includes a benchmark dataset Dr.V-Bench and a satellite video agent Dr.V-Agent. Dr.V-Agent detects hallucinations in large video models by systematically applying spatial-temporal grounding, which enhances interpretability and reliability. Extensive experiments show that Dr.V-Agent effectively identifies hallucinations in real-world scenarios.

Dr.V 是一个分层框架，通过在感知、时间和认知层面应用细粒度的空间-时间定位来诊断视频幻觉。它包含一个基准数据集 Dr.V-Bench，包含 10k 个实例，以及一个卫星视频代理 Dr.V-Agent。Dr.V-Agent 通过应用细粒度的空间-时间定位和认知推理来检测幻觉，有效识别幻觉并增强视频理解的可解释性和可靠性。广泛的实验显示其在大型视频模型中诊断幻觉的有效性。

Video Panels for Long Video Understanding

Authors: Lars Doorenbos, Federico Spurio, Juergen Gall

First: 2025-09-28T08:05:55+00:00 · Latest: 2025-09-28T08:05:55+00:00

Abs · PDF · Code1 · Code2

Abstract

Recent Video-Language Models (VLMs) achieve promising results on long-video understanding, but their performance still lags behind that achieved on tasks involving images or short videos. This has led to great interest in improving the long context modeling of VLMs by introducing novel modules and additional complexity. % additional training time. In this paper, we take a different approach: rather than fine-tuning VLMs with the limited data available, we attempt to maximize the performance of existing models. To this end, we propose a novel visual prompting strategy specifically designed for long-video understanding. By combining multiple frames as panels into one image, we effectively trade off spatial details for temporal resolution. Our approach is training-free, parameter-free, and model-agnostic, and can be seamlessly integrated into existing VLMs. Extensive experiments on five established benchmarks across a wide range of model architectures, sizes, and context windows confirm the consistency of our approach. For the TimeScope (Long) dataset, which has the longest videos, the accuracy for video question answering is improved by up to 19.4\%. Overall, our method raises the bar for long video understanding models. We will make our code available upon acceptance.

中文标题/摘要

标题：长视频理解的视频面板

近期的视频-语言模型（VLMs）在长视频理解任务上取得了令人鼓舞的结果，但其性能仍落后于涉及图像或短视频的任务。这导致了对改进VLMs的长上下文建模的兴趣，通过引入新颖模块和额外复杂性。%额外的训练时间。在本文中，我们采取了不同的方法：而不是使用有限的数据微调VLMs，我们尝试最大化现有模型的性能。为此，我们提出了一种专为长视频理解设计的新型视觉提示策略。通过将多个帧作为面板组合成一张图像，我们有效地在空间细节和时间分辨率之间进行权衡。我们的方法是无需训练的、无需参数的、模型无关的，并且可以无缝集成到现有的VLMs中。在五个广泛使用的基准上的大量实验，涵盖了各种模型架构、大小和上下文窗口，证实了我们方法的一致性。对于TimeScope（长）数据集，该数据集包含最长的视频，视频问答的准确性提高了高达19.4%。总体而言，我们的方法提高了长视频理解模型的标准。我们将在接受后提供我们的代码。

Summary / 总结

This paper addresses the limitations of Video-Language Models (VLMs) in understanding long videos by proposing a novel visual prompting strategy. Instead of fine-tuning VLMs, the authors combine multiple frames into panels to enhance temporal resolution while trading off spatial details. This approach is training-free, parameter-free, and model-agnostic, and it improves video question answering accuracy by up to 19.4% on the TimeScope (Long) dataset, demonstrating its effectiveness across various model architectures and context windows.

本文提出了一种新颖的视觉提示策略，将多个帧组合成面板以增强时间分辨率，解决长视频理解的挑战。该方法无需训练和参数，且适用于各种模型架构，并在TimeScope (Long) 数据集上将视频问答的准确性提高了高达19.4%，在不同模型架构和上下文窗口下表现出一致的性能。

T*: Re-thinking Temporal Search for Long-Form Video Understanding

Authors: Jinhui Ye, Zihan Wang, Haosen Sun, Keshigeyan Chandrasegaran, Zane Durante, Cristobal Eyzaguirre, Yonatan Bisk, Juan Carlos Niebles, Ehsan Adeli, Li Fei-Fei, Jiajun Wu, Manling Li

Venue: CVPR 2025 long

First: 2025-04-03T04:03:10+00:00 · Latest: 2025-08-25T02:57:46+00:00

Comments: Accepted by CVPR 2025; A real-world long video needle-in-haystack benchmark; long-video QA with human ref frames

Abs · PDF · Code1 · Code2

Abstract

Efficiently understanding long-form videos remains a significant challenge in computer vision. In this work, we revisit temporal search paradigms for long-form video understanding and address a fundamental issue pertaining to all state-of-the-art (SOTA) long-context vision-language models (VLMs). Our contributions are twofold: First, we frame temporal search as a Long Video Haystack problem: finding a minimal set of relevant frames (e.g., one to five) from tens of thousands based on specific queries. Upon this formulation, we introduce LV-Haystack, the first dataset with 480 hours of videos, 15,092 human-annotated instances for both training and evaluation aiming to improve temporal search quality and efficiency. Results on LV-Haystack highlight a significant research gap in temporal search capabilities, with current SOTA search methods only achieving 2.1% temporal F1 score on the Longvideobench subset. Next, inspired by visual search in images, we propose a lightweight temporal search framework, T* that reframes costly temporal search as spatial search. T* leverages powerful visual localization techniques commonly used in images and introduces an adaptive zooming-in mechanism that operates across both temporal and spatial dimensions. Extensive experiments show that integrating T* with existing methods significantly improves SOTA long-form video understanding. Under an inference budget of 32 frames, T* improves GPT-4o's performance from 50.5% to 53.1% and LLaVA-OneVision-OV-72B's performance from 56.5% to 62.4% on the Longvideobench XL subset. Our code, benchmark, and models are provided in the Supplementary material.

中文标题/摘要

标题：T*: 重新思考长视频时间搜索

高效理解长视频仍然是计算机视觉中的一个重大挑战。在本文中，我们重新审视了长视频理解的时间搜索范式，并解决了所有最先进的（SOTA）长上下文视觉-语言模型（VLMs）中的一个基本问题。我们的贡献有两个方面：首先，我们将时间搜索重新定义为长视频中的“长视频干草堆”问题：从成千上万的帧中找到一组相关的帧（例如一到五帧），基于特定的查询。在此基础上，我们引入了LV-干草堆数据集，这是第一个包含480小时视频、15,092个人标注实例的数据集，用于训练和评估，旨在提高时间搜索的质量和效率。LV-干草堆上的结果突显了时间搜索能力的研究缺口，当前SOTA搜索方法在Longvideobench子集上的时间F1分数仅为2.1%。其次，受图像中视觉搜索的启发，我们提出了一种轻量级的时间搜索框架T*，将昂贵的时间搜索重新定义为空间搜索。T*利用了在图像中常用的强大视觉定位技术，并引入了一种在时间和空间维度上都起作用的自适应放大机制。广泛的实验表明，将T*与现有方法结合使用可以显著提高SOTA长视频理解。在32帧的推理预算下，T*将GPT-4o在Longvideobench XL子集上的性能从50.5%提高到53.1%，将LLaVA-OneVision-OV-72B的性能从56.5%提高到62.4%。我们的代码、基准和模型在补充材料中提供。

Summary / 总结

This paper addresses the challenge of efficiently understanding long-form videos by revisiting temporal search paradigms. It introduces LV-Haystack, a new dataset for temporal search, and proposes T*, a lightweight framework that reframes temporal search as spatial search. Experiments show that T* significantly improves the performance of existing methods on long-form video understanding, achieving a 3.6% improvement over the state-of-the-art on the Longvideobench XL subset under an inference budget of 32 frames.

本文重新审视了长视频理解中的时间搜索范式，引入了LV-Haystack数据集，并提出了一种轻量级框架T*，将时间搜索重新定义为空间搜索。实验表明，T*显著提高了现有方法在长视频理解任务上的性能，在Longvideobench上的时间F1分数提高了2.1%，在Longvideobench XL下的性能达到了62.4%，在32帧的推理预算下。

Fly, Fail, Fix: Iterative Game Repair with Reinforcement Learning and Large Multimodal Models

Authors: Alex Zook, Josef Spjut, Jonathan Tremblay

First: 2025-07-16T22:45:40+00:00 · Latest: 2025-07-16T22:45:40+00:00

Comments: Published at Reinforcement Learning and Video Games workshop https://sites.google.com/view/rlvg-workshop-2025/home

Abs · PDF · Code1 · Code2 · Project1

Abstract

Game design hinges on understanding how static rules and content translate into dynamic player behavior - something modern generative systems that inspect only a game's code or assets struggle to capture. We present an automated design iteration framework that closes this gap by pairing a reinforcement learning (RL) agent, which playtests the game, with a large multimodal model (LMM), which revises the game based on what the agent does. In each loop the RL player completes several episodes, producing (i) numerical play metrics and/or (ii) a compact image strip summarising recent video frames. The LMM designer receives a gameplay goal and the current game configuration, analyses the play traces, and edits the configuration to steer future behaviour toward the goal. We demonstrate results that LMMs can reason over behavioral traces supplied by RL agents to iteratively refine game mechanics, pointing toward practical, scalable tools for AI-assisted game design.

中文标题/摘要

标题：飞、失败、修复：基于强化学习和大型多模态模型的游戏迭代修复

游戏设计依赖于理解静态规则和内容如何转化为动态玩家行为——而现代仅检查游戏代码或资产的生成系统难以捕捉这一点。我们提出了一种自动化设计迭代框架，通过将强化学习（RL）代理与大型多模态模型（LMM）配对来弥补这一差距。RL代理测试游戏，LMM根据代理的行为修改游戏。在每个循环中，RL玩家完成多个回合，产生（i）数值游戏指标或（ii）最近视频帧的紧凑图像摘要。LMM设计师接收游戏目标和当前游戏配置，分析游戏轨迹，并编辑配置以引导未来行为朝向目标。我们展示了LMM能够基于RL代理提供的行为轨迹迭代细化游戏机制的结果，这表明了AI辅助游戏设计的实用、可扩展工具的可能性。

SpaceR: Reinforcing MLLMs in Video Spatial Reasoning

Authors: Kun Ouyang, Yuanxin Liu, Haoning Wu, Yi Liu, Hao Zhou, Jie Zhou, Fandong Meng, Xu Sun

First: 2025-04-02T15:12:17+00:00 · Latest: 2025-05-21T09:38:44+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Video spatial reasoning, which involves inferring the underlying spatial structure from observed video frames, poses a significant challenge for existing Multimodal Large Language Models (MLLMs). This limitation stems primarily from 1) the absence of high-quality datasets for this task, and 2) the lack of effective training strategies to develop spatial reasoning capabilities. Motivated by the success of Reinforcement Learning with Verifiable Reward (RLVR) in unlocking LLM reasoning abilities, this work aims to improve MLLMs in video spatial reasoning through the RLVR paradigm. To this end, we introduce the $\textbf{SpaceR}$ framework. First, we present $\textbf{SpaceR-151k}$, a dataset with 91k questions spanning diverse spatial reasoning scenarios with verifiable answers, and 60k samples for maintaining general multimodal understanding. Second, we propose $\textbf{Spatially-Guided RLVR (SG-RLVR)}$, a novel reinforcement learning approach that extends Group Relative Policy Optimization (GRPO) with a novel map imagination mechanism, which encourages the model to infer spatial layouts in the thinking process, thereby facilitating more effective spatial reasoning. Extensive experiments demonstrate that SpaceR achieves state-of-the-art performance on spatial reasoning benchmarks (e.g., VSI-Bench, STI-Bench, and SPAR-Bench), while maintaining competitive results on video understanding benchmarks (e.g., Video-MME, TempCompass, and LongVideoBench). Remarkably, SpaceR surpasses the advanced GPT-4o by 11.6\% accuracy on VSI-Bench and is on par with the leading proprietary model Gemini-2.0-Flash, highlighting the effectiveness of our SpaceR-151k dataset and SG-RLVR in reinforcing spatial reasoning ability of MLLMs. Code, model, and dataset are available at https://github.com/OuyangKun10/SpaceR.

中文标题/摘要

标题：SpaceR：强化视频空间推理的MLLMs

视频空间推理涉及从观察到的视频帧中推断出潜在的空间结构，这给现有的多模态大型语言模型（MLLMs）带来了重大挑战。这一局限主要源于1）缺乏针对此任务的高质量数据集，以及2）缺乏有效的训练策略来发展空间推理能力。受验证奖励强化学习（RLVR）在解锁LLM推理能力方面的成功启发，这项工作旨在通过RLVR范式改进MLLMs在视频空间推理中的表现。为此，我们引入了SpaceR框架。首先，我们提出了包含91000个问题的SpaceR-151k数据集，这些问题覆盖了多种具有验证答案的空间推理场景，并有60000个样本用于保持多模态的一般理解。其次，我们提出了空间引导的RLVR（SG-RLVR），这是一种新颖的强化学习方法，它扩展了组相对策略优化（GRPO），并引入了一种新的地图想象机制，该机制鼓励模型在思考过程中推断空间布局，从而促进更有效的空间推理。广泛的实验表明，SpaceR在空间推理基准测试（如VSI-Bench、STI-Bench和SPAR-Bench）上达到了最先进的性能，同时在视频理解基准测试（如Video-MME、TempCompass和LongVideoBench）上保持了竞争力。值得注意的是，SpaceR在VSI-Bench上的准确率比先进的GPT-4o高出11.6%，并与领先的专有模型Gemini-2.0-Flash持平，突显了我们的SpaceR-151k数据集和SG-RLVR在强化MLLMs的空间推理能力方面的有效性。代码、模型和数据集可在https://github.com/OuyangKun10/SpaceR/获得。

Summary / 总结

The research aims to enhance the spatial reasoning capabilities of Multimodal Large Language Models (MLLMs) by addressing the lack of high-quality datasets and effective training strategies. The SpaceR framework introduces SpaceR-151k, a dataset with 91k questions and 60k samples, and Spatially-Guided RLVR (SG-RLVR), a reinforcement learning approach that improves spatial reasoning. Experiments show that SpaceR outperforms existing models on spatial reasoning benchmarks and maintains competitive performance on video understanding benchmarks, with a significant 11.6% improvement over GPT-4o on VSI-Bench.

研究旨在通过解决高质量数据集和有效训练策略的缺乏问题，增强多模态大型语言模型（MLLMs）的空间推理能力。SpaceR框架引入了包含91k问题和60k样本的SpaceR-151k数据集和空间引导的强化学习方法（SG-RLVR），以提升空间推理能力。实验表明，SpaceR在空间推理基准测试中表现出色，并在视频理解基准测试中保持竞争力，特别是在VSI-Bench上的表现比GPT-4o高出11.6%。

DSI-Bench: A Benchmark for Dynamic Spatial Intelligence

Authors: Ziang Zhang, Zehan Wang, Guanghao Zhang, Weilong Dai, Yan Xia, Ziang Yan, Minjie Hong, Zhou Zhao

First: 2025-10-21T17:59:36+00:00 · Latest: 2025-10-21T17:59:36+00:00

Abs · PDF · Code1 · Code2

Abstract

Reasoning about dynamic spatial relationships is essential, as both observers and objects often move simultaneously. Although vision-language models (VLMs) and visual expertise models excel in 2D tasks and static scenarios, their ability to fully understand dynamic 3D scenarios remains limited. We introduce Dynamic Spatial Intelligence and propose DSI-Bench, a benchmark with nearly 1,000 dynamic videos and over 1,700 manually annotated questions covering nine decoupled motion patterns of observers and objects. Spatially and temporally symmetric designs reduce biases and enable systematic evaluation of models' reasoning about self-motion and object motion. Our evaluation of 14 VLMs and expert models reveals key limitations: models often conflate observer and object motion, exhibit semantic biases, and fail to accurately infer relative relationships in dynamic scenarios. Our DSI-Bench provides valuable findings and insights about the future development of general and expertise models with dynamic spatial intelligence.

中文标题/摘要

标题：DSI-Bench：动态空间智能基准

关于动态空间关系的推理至关重要，因为观察者和物体经常同时移动。尽管视觉语言模型（VLMs）和视觉专门模型在2D任务和静态场景中表现出色，但它们理解动态3D场景的能力仍然有限。我们引入了动态空间智能，并提出了DSI-Bench基准，该基准包含近1000个动态视频和超过1700个手动标注的问题，涵盖了观察者和物体九种解耦的运动模式。空间和时间对称的设计减少了偏差，使模型对自身运动和物体运动的推理进行系统的评估成为可能。我们对14个VLMs和专家模型的评估揭示了关键的局限性：模型经常混淆观察者和物体的运动，表现出语义偏差，并且在动态场景中无法准确推断相对关系。我们的DSI-Bench提供了关于未来开发具有动态空间智能的一般和专门模型的重要发现和见解。

Summary / 总结

The research motivation is to evaluate the ability of vision-language models and visual expertise models to understand dynamic 3D scenarios, which is essential for reasoning about moving observers and objects. The main method involves creating DSI-Bench, a benchmark with nearly 1,000 dynamic videos and 1,700 annotated questions, covering nine motion patterns. Key experimental findings show that models often confuse observer and object motions, exhibit semantic biases, and struggle to infer accurate relative relationships in dynamic scenarios.

研究旨在评估模型在理解动态空间关系方面的能力，特别是在观察者和物体同时移动的情况下。作者提出了DSI-Bench，包含近1,000个动态视频和1,700个标注问题，以评估模型对自身运动和物体运动的推理能力。评估14个VLM和专家模型后，研究发现模型常常混淆观察者和物体的运动，表现出语义偏见，并且在动态场景中难以准确推断相对关系，这表明需要开发具有动态空间智能的更好模型。

CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding

Authors: Guo Chen, Yicheng Liu, Yifei Huang, Yuping He, Baoqi Pei, Jilan Xu, Yali Wang, Tong Lu, Limin Wang

First: 2024-12-16T18:46:45+00:00 · Latest: 2024-12-16T18:46:45+00:00

Comments: 14 pages, 9 figures

Abs · PDF · Code1 · Code2 · Project1

Abstract

Most existing video understanding benchmarks for multimodal large language models (MLLMs) focus only on short videos. The limited number of benchmarks for long video understanding often rely solely on multiple-choice questions (MCQs). However, because of the inherent limitation of MCQ-based evaluation and the increasing reasoning ability of MLLMs, models can give the current answer purely by combining short video understanding with elimination, without genuinely understanding the video content. To address this gap, we introduce CG-Bench, a novel benchmark designed for clue-grounded question answering in long videos. CG-Bench emphasizes the model's ability to retrieve relevant clues for questions, enhancing evaluation credibility. It features 1,219 manually curated videos categorized by a granular system with 14 primary categories, 171 secondary categories, and 638 tertiary categories, making it the largest benchmark for long video analysis. The benchmark includes 12,129 QA pairs in three major question types: perception, reasoning, and hallucination. Compensating the drawbacks of pure MCQ-based evaluation, we design two novel clue-based evaluation methods: clue-grounded white box and black box evaluations, to assess whether the model generates answers based on the correct understanding of the video. We evaluate multiple closed-source and open-source MLLMs on CG-Bench. Results indicate that current models significantly underperform in understanding long videos compared to short ones, and a significant gap exists between open-source and commercial models. We hope CG-Bench can advance the development of more trustworthy and capable MLLMs for long video understanding. All annotations and video data are released at https://cg-bench.github.io/leaderboard/.

中文标题/摘要

标题：CG-Bench：面向长视频理解的线索导向型问答基准

大多数现有的视频理解基准主要针对多模态大型语言模型（MLLMs）中的短视频。对于长视频理解的有限基准往往仅依赖于多项选择题（MCQs）。然而，由于MCQ评估固有的局限性以及MLLMs推理能力的增强，模型可以通过结合短视频理解和排除来给出当前答案，而无需真正理解视频内容。为解决这一问题，我们引入了CG-Bench，这是一种针对长视频的新型线索导向型问答基准。CG-Bench强调模型检索问题相关线索的能力，增强评估的可信度。它包含1,219个手工策划的视频，按精细分类系统分为14个主要类别、171个次要类别和638个三级类别，使其成为最大的长视频分析基准。基准包括12,129个问答对，涵盖三大类问题：感知、推理和幻觉。为弥补纯MCQ评估的不足，我们设计了两种新的基于线索的评估方法：线索导向的白盒和黑盒评估，以评估模型是否基于对视频的正确理解生成答案。我们在CG-Bench上评估了多个闭源和开源MLLMs。结果表明，当前模型在理解长视频方面显著不如理解短视频，开源模型与商用模型之间存在显著差距。我们希望CG-Bench能够促进更可信和强大的MLLMs在长视频理解方面的开发。所有注释和视频数据可在https://cg-bench.github.io/leaderboard/上获取。

Summary / 总结

CG-Bench is a new benchmark for clue-grounded question answering in long videos, addressing the limitations of existing short video benchmarks. It includes 1,219 manually curated videos and 12,129 QA pairs, covering perception, reasoning, and hallucination. Two novel clue-based evaluation methods are introduced to assess models' understanding. The evaluation shows that current models perform poorly on long videos compared to short ones, with a significant gap between open-source and commercial models. This benchmark aims to improve the development of more trustworthy MLLMs for long video understanding.

CG-Bench 是一个针对长视频理解的新基准，旨在解决现有短视频基准的局限性。它强调基于线索的问题回答，包含1,219个手动整理的视频和12,129个问答对。引入了两种新的评估方法来评估模型的理解能力。结果显示，当前模型在长视频上的表现较差，开源和商用模型之间存在显著差距。

VideoExpert: Augmented LLM for Temporal-Sensitive Video Understanding

Authors: Henghao Zhao, Ge-Peng Ji, Rui Yan, Huan Xiong, Zechao Li

First: 2025-04-10T07:33:39+00:00 · Latest: 2025-04-10T07:33:39+00:00

Abs · PDF · Code1 · Code2

Abstract

The core challenge in video understanding lies in perceiving dynamic content changes over time. However, multimodal large language models struggle with temporal-sensitive video tasks, which requires generating timestamps to mark the occurrence of specific events. Existing strategies require MLLMs to generate absolute or relative timestamps directly. We have observed that those MLLMs tend to rely more on language patterns than visual cues when generating timestamps, affecting their performance. To address this problem, we propose VideoExpert, a general-purpose MLLM suitable for several temporal-sensitive video tasks. Inspired by the expert concept, VideoExpert integrates two parallel modules: the Temporal Expert and the Spatial Expert. The Temporal Expert is responsible for modeling time sequences and performing temporal grounding. It processes high-frame-rate yet compressed tokens to capture dynamic variations in videos and includes a lightweight prediction head for precise event localization. The Spatial Expert focuses on content detail analysis and instruction following. It handles specially designed spatial tokens and language input, aiming to generate content-related responses. These two experts collaborate seamlessly via a special token, ensuring coordinated temporal grounding and content generation. Notably, the Temporal and Spatial Experts maintain independent parameter sets. By offloading temporal grounding from content generation, VideoExpert prevents text pattern biases in timestamp predictions. Moreover, we introduce a Spatial Compress module to obtain spatial tokens. This module filters and compresses patch tokens while preserving key information, delivering compact yet detail-rich input for the Spatial Expert. Extensive experiments demonstrate the effectiveness and versatility of the VideoExpert.

中文标题/摘要

标题：VideoExpert：用于时间敏感视频理解的增强LLM

视频理解的核心挑战在于感知随时间变化的动态内容。然而，多模态大型语言模型在处理时间敏感的视频任务时存在困难，这需要生成时间戳来标记特定事件的发生。现有策略要求MLLM直接生成绝对或相对时间戳。我们观察到，这些MLLM在生成时间戳时更依赖于语言模式而非视觉线索，影响了它们的性能。为了解决这一问题，我们提出了VideoExpert，这是一种适用于多种时间敏感视频任务的通用MLLM。受专家概念的启发，VideoExpert集成了两个并行模块：时间专家和空间专家。时间专家负责建模时间序列和执行时间定位。它处理高帧率但压缩的令牌以捕捉视频中的动态变化，并包含一个轻量级的预测头以实现精确的事件定位。空间专家专注于内容细节分析和指令遵循。它处理特别设计的空间令牌和语言输入，旨在生成内容相关的响应。这两个专家通过一个特殊令牌无缝协作，确保协调的时间定位和内容生成。值得注意的是，时间和空间专家保持独立的参数集。通过将时间定位从内容生成中卸载，VideoExpert防止了时间戳预测中的文本模式偏差。此外，我们引入了空间压缩模块以获取空间令牌。该模块过滤并压缩补丁令牌，同时保留关键信息，为空间专家提供紧凑但细节丰富的输入。广泛的实验表明，VideoExpert的有效性和多功能性。

Summary / 总结

The core challenge in video understanding is perceiving dynamic content changes over time, which is difficult for multimodal large language models (MLLMs) due to their reliance on language patterns rather than visual cues when generating timestamps. To address this, VideoExpert is proposed, integrating a Temporal Expert for modeling time sequences and a Spatial Expert for content detail analysis. These experts collaborate via a special token, ensuring accurate event localization and content generation. Experiments show VideoExpert's effectiveness and versatility in various temporal-sensitive video tasks.

研究旨在通过解决生成特定事件准确时间戳的挑战，提高动态视频理解能力。提出了VideoExpert，这是一种多模态大型语言模型，包含两个并行模块：时间专家负责时间序列建模和时间定位，空间专家负责内容分析和指令跟随。时间专家处理高帧率的令牌以捕捉动态变化，并包含一个预测头以实现精确的事件定位。空间专家专注于内容细节分析，并基于空间令牌和语言输入生成响应。通过广泛的实验，该模型展示了其有效性和多功能性，减少了时间戳预测中的文本模式偏差，并通过空间压缩模块提供了紧凑且详细的输入。

TinyLLaVA-Video: Towards Smaller LMMs for Video Understanding with Group Resampler

Authors: Xingjian Zhang, Xi Weng, Yihao Yue, Zhaoxin Fan, Wenjun Wu, Lei Huang

First: 2025-01-26T13:10:12+00:00 · Latest: 2025-06-10T14:30:19+00:00

Comments: code and training recipes are available at https://github.com/ZhangXJ199/TinyLLaVA-Video

Abs · PDF · Code1 · Code2 · Code3

Abstract

Video behavior recognition and scene understanding are fundamental tasks in multimodal intelligence, serving as critical building blocks for numerous real-world applications. Through large multimodal models (LMMs) have achieved remarkable progress in video understanding, most existing open-source models rely on over 7B parameters and require large-scale datasets for training, making them resource-intensive and inaccessible to many researchers. Furthermore, lightweight models face persistent challenges in effectively processing long visual sequences and temporal understanding. In this work, we introduce TinyLLaVA-Video, a lightweight yet powerful video understanding model with approximately 3.6B parameters. The cornerstone of our design is the video-level group resampler, a novel mechanism that significantly reduces and controls the number of visual tokens at the video level. Unlike traditional image-level resampler, our approach effectively mitigates redundancy while enhancing temporal comprehension, leading to improved performance on video-based tasks. In addition, TinyLLaVA-Video demonstrates exceptional efficiency, requiring only one day of training on 8 A100-40G GPUs. It surpasses several existing 7B-parameter models on multiple benchmarks. We believe this work provides a valuable foundation for future research on lightweight video understanding models. The code and weights is available at https://github.com/ZhangXJ199/TinyLLaVA-Video.

中文标题/摘要

标题：TinyLLaVA-Video：面向视频理解的轻量级LMM及其组重采样器

视频行为识别和场景理解是多模态智能中的基本任务，是众多实际应用的关键构建块。尽管大型多模态模型（LMMs）在视频理解方面取得了显著进展，但大多数现有的开源模型依赖于超过7B参数，并且需要大规模数据集进行训练，这使得它们资源密集型且对许多研究人员来说难以获取。此外，轻量级模型在有效处理长视觉序列和时间理解方面仍面临持续挑战。在本工作中，我们介绍了TinyLLaVA-Video，这是一种具有约3.6B参数的轻量级但强大的视频理解模型。我们设计的核心是视频级别组重采样器，这是一种新颖的机制，可以显著减少和控制视频级别的视觉标记数量。与传统的图像级别重采样器不同，我们的方法有效缓解了冗余问题，同时增强了时间理解能力，从而在基于视频的任务上取得了更好的性能。此外，TinyLLaVA-Video展示了出色的效率，仅需在8块A100-40G GPU上训练一天即可。它在多个基准测试上超过了几个现有的7B参数模型。我们认为，这项工作为未来轻量级视频理解模型的研究提供了有价值的基石。代码和权重可在https://github.com/ZhangXJ199/TinyLLaVA-Video 获取。

Summary / 总结

This paper introduces TinyLLaVA-Video, a lightweight video understanding model with about 3.6 billion parameters, designed to address the resource-intensive nature of existing large multimodal models. The key innovation is the video-level group resampler, which reduces visual token redundancy and enhances temporal understanding. Experiments show that TinyLLaVA-Video outperforms several 7B-parameter models on multiple benchmarks and requires only one day of training on 8 A100-40G GPUs, making it a valuable resource for researchers. The code and weights are available at https://github.com/ZhangXJ199/TinyLLaVA-Video.

研究旨在开发一种更小但有效的视频理解模型，解决现有大型多模态模型资源密集的问题。TinyLLaVA-Video，参数量约为3.6B，引入了视频级分组重采样器来减少视觉标记冗余并增强时间理解。该模型在多个基准测试中表现出色，优于多个7B参数模型，并且仅需在8块A100-40G GPU上训练一天。

VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation

Authors: Weiming Ren, Huan Yang, Jie Min, Cong Wei, Wenhu Chen

First: 2024-12-01T18:27:28+00:00 · Latest: 2024-12-01T18:27:28+00:00

Comments: Project Page: https://tiger-ai-lab.github.io/VISTA/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Current large multimodal models (LMMs) face significant challenges in processing and comprehending long-duration or high-resolution videos, which is mainly due to the lack of high-quality datasets. To address this issue from a data-centric perspective, we propose VISTA, a simple yet effective Video Spatiotemporal Augmentation framework that synthesizes long-duration and high-resolution video instruction-following pairs from existing video-caption datasets. VISTA spatially and temporally combines videos to create new synthetic videos with extended durations and enhanced resolutions, and subsequently produces question-answer pairs pertaining to these newly synthesized videos. Based on this paradigm, we develop seven video augmentation methods and curate VISTA-400K, a video instruction-following dataset aimed at enhancing long-duration and high-resolution video understanding. Finetuning various video LMMs on our data resulted in an average improvement of 3.3% across four challenging benchmarks for long-video understanding. Furthermore, we introduce the first comprehensive high-resolution video understanding benchmark HRVideoBench, on which our finetuned models achieve a 6.5% performance gain. These results highlight the effectiveness of our framework.

中文标题/摘要

标题：VISTA：通过视频时空增强提高长时长和高分辨率视频理解

当前的大规模多模态模型（LMMs）在处理和理解长时长或高分辨率视频时面临重大挑战，主要是由于缺乏高质量的数据集。从数据为中心的角度出发，我们提出了VISTA，一种简单而有效的视频时空增强框架，该框架从现有的视频字幕数据集中合成长时长和高分辨率视频指令跟随对。VISTA通过时空结合视频，创建具有延长时长和增强分辨率的新合成视频，并随后生成与这些新合成视频相关的问题-答案对。基于此范式，我们开发了七种视频增强方法，并编纂了VISTA-400K视频指令跟随数据集，旨在提高长时长和高分辨率视频理解。在我们的数据上微调各种视频LMMs后，在四个具有挑战性的长视频理解基准上平均提高了3.3%。此外，我们引入了第一个全面的高分辨率视频理解基准HRVideoBench，在此基准上，我们微调的模型实现了6.5%的性能提升。这些结果突显了我们框架的有效性。

Summary / 总结

The paper proposes VISTA, a Video Spatiotemporal Augmentation framework to enhance long-duration and high-resolution video understanding by synthesizing new videos from existing datasets. Seven augmentation methods are developed to create extended and high-resolution videos, and a new dataset VISTA-400K is curated. Finetuning various large multimodal models on this dataset improves performance by 3.3% on four benchmarks and 6.5% on the new HRVideoBench.

研究旨在通过解决高质量数据不足的问题，增强大型多模态模型对长时长和高分辨率视频的理解能力。VISTA是一种视频时空增强框架，从现有数据集中合成新的长时长和高分辨率视频指令跟随对。该框架在四个基准测试中将长视频理解能力提高了3.3%，并在新引入的高分辨率视频理解基准HRVideoBench上实现了6.5%的性能提升。

RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video

Authors: Shuhang Xun, Sicheng Tao, Jungang Li, Yibo Shi, Zhixin Lin, Zhanhui Zhu, Yibo Yan, Hanqian Li, Linghao Zhang, Shikang Wang, Yixin Liu, Hanbo Zhang, Ying Ma, Xuming Hu

Venue: NeurIPS 2025

First: 2025-05-04T10:55:21+00:00 · Latest: 2025-10-24T00:11:36+00:00

Comments: Accepted by NeurIPS 2025 Datasets and Benchmarks Track;

Abs · PDF · Code1 · Code2 · Code3

Abstract

Multimodal Large Language Models (MLLMs) increasingly excel at perception, understanding, and reasoning. However, current benchmarks inadequately evaluate their ability to perform these tasks continuously in dynamic, real-world environments. To bridge this gap, we introduce RTV-Bench, a fine-grained benchmark for MLLM real-time video analysis. RTV-Bench uses three key principles: (1) Multi-Timestamp Question Answering (MTQA), where answers evolve with scene changes; (2) Hierarchical Question Structure, combining basic and advanced queries; and (3) Multi-dimensional Evaluation, assessing the ability of continuous perception, understanding, and reasoning. RTV-Bench contains 552 diverse videos (167.2 hours) and 4,631 high-quality QA pairs. We evaluated leading MLLMs, including proprietary (GPT-4o, Gemini 2.0), open-source offline (Qwen2.5-VL, VideoLLaMA3), and open-source real-time (VITA-1.5, InternLM-XComposer2.5-OmniLive) models. Experiment results show open-source real-time models largely outperform offline ones but still trail top proprietary models. Our analysis also reveals that larger model size or higher frame sampling rates do not significantly boost RTV-Bench performance, sometimes causing slight decreases. This underscores the need for better model architectures optimized for video stream processing and long sequences to advance real-time video analysis with MLLMs. Our benchmark toolkit is available at: https://github.com/LJungang/RTV-Bench.

中文标题/摘要

标题：RTV-Bench：通过实时视频评估MLLM连续感知、理解和推理

多模态大型语言模型（MLLMs）在感知、理解和推理方面表现出色。然而，当前的基准测试未能充分评估它们在动态、真实环境中的连续执行这些任务的能力。为弥补这一差距，我们引入了RTV-Bench，这是一种针对MLLM实时视频分析的精细基准测试。RTV-Bench采用三个关键原则：（1）多时间戳问答（MTQA），答案随场景变化而演变；（2）层次化问题结构，结合基本和高级查询；（3）多维度评估，评估连续感知、理解和推理的能力。RTV-Bench包含552个多样化的视频（167.2小时）和4,631个高质量的问答对。我们评估了包括专有（GPT-4o、Gemini 2.0）、开源离线（Qwen2.5-VL、VideoLLaMA3）和开源实时（VITA-1.5、InternLM-XComposer2.5-OmniLive）模型在内的领先MLLM。实验结果表明，开源实时模型在很大程度上优于离线模型，但仍然落后于顶级专有模型。我们的分析还表明，更大的模型规模或更高的帧采样率并不显著提升RTV-Bench性能，有时甚至会导致轻微下降。这强调了需要更好的模型架构，以优化视频流处理和长序列，从而推动MLLM实时视频分析的进步。我们的基准测试工具包可在以下链接获取：https://github.com/LJungang/RTV-Bench。

Summary / 总结

RTV-Bench is a benchmark for evaluating MLLMs in real-time video analysis, focusing on continuous perception, understanding, and reasoning. It uses three principles: Multi-Timestamp Question Answering, Hierarchical Question Structure, and Multi-dimensional Evaluation. The benchmark includes 552 diverse videos and 4,631 QA pairs. Evaluations of leading MLLMs show that open-source real-time models outperform offline models but lag behind proprietary models. The study also finds that larger model sizes or higher frame sampling rates do not significantly improve performance, highlighting the need for better model architectures for real-time video analysis.

RTV-Bench 是一个用于评估 MLLMs 在实时视频分析中连续感知、理解和推理能力的基准。它采用了三个原则：多时间戳问答、层次化问题结构和多维度评估。基准包括 552 个多样化的视频和 4,631 对 QA。对领先 MLLMs 的评估显示，开源实时模型优于离线模型，但落后于专有模型。更大的模型规模或更高的帧采样率并未显著提升性能，这表明需要更好的模型架构来实现 MLLMs 的实时视频分析。

V2P-Bench: Evaluating Video-Language Understanding with Visual Prompts for Better Human-Model Interaction

Authors: Yiming Zhao, Yu Zeng, Yukun Qi, YaoYang Liu, Lin Chen, Zehui Chen, Xikun Bao, Jie Zhao, Feng Zhao

First: 2025-03-22T11:30:46+00:00 · Latest: 2025-03-22T11:30:46+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Large Vision-Language Models (LVLMs) have made significant progress in the field of video understanding recently. However, current benchmarks uniformly lean on text prompts for evaluation, which often necessitate complex referential language and fail to provide precise spatial and temporal references. This limitation diminishes the experience and efficiency of human-model interaction. To address this limitation, we propose the Video Visual Prompt Benchmark(V2P-Bench), a comprehensive benchmark specifically designed to evaluate LVLMs' video understanding capabilities in multimodal human-model interaction scenarios. V2P-Bench includes 980 unique videos and 1,172 QA pairs, covering 5 main tasks and 12 dimensions, facilitating instance-level fine-grained understanding aligned with human cognition. Benchmarking results reveal that even the most powerful models perform poorly on V2P-Bench (65.4% for GPT-4o and 67.9% for Gemini-1.5-Pro), significantly lower than the human experts' 88.3%, highlighting the current shortcomings of LVLMs in understanding video visual prompts. We hope V2P-Bench will serve as a foundation for advancing multimodal human-model interaction and video understanding evaluation. Project page: https://github.com/gaotiexinqu/V2P-Bench.

中文标题/摘要

标题：V2P-Bench：使用视觉提示评估视频-语言理解以改善人-模型交互

大型视觉-语言模型（LVLMs）在视频理解领域取得了显著进展。然而，当前的基准测试统一依赖于文本提示进行评估，这通常需要复杂的参照语言，并且无法提供精确的空间和时间参考。这一限制降低了人-模型交互的体验和效率。为了解决这一限制，我们提出了视频视觉提示基准测试（V2P-Bench），这是一个专门设计用于评估LVLMs在多模态人-模型交互场景中视频理解能力的基准测试。V2P-Bench 包含980个独特的视频和1,172个问答对，涵盖了5个主要任务和12个维度，促进了与人类认知相一致的实例级细粒度理解。基准测试结果表明，即使是最强大的模型在V2P-Bench上的表现也很差（GPT-4o为65.4%，Gemini-1.5-Pro为67.9%），远低于人类专家的88.3%，突显了LVLMs在理解视频视觉提示方面的当前不足。我们希望V2P-Bench能够成为推动多模态人-模型交互和视频理解评估的基础。项目页面：https://github.com/gaotiexinqu/V2P-Bench.

Summary / 总结

The research aims to improve the evaluation of video-language models (LVLMs) by addressing the limitations of current benchmarks that rely on text prompts. V2P-Bench, a new benchmark, introduces visual prompts to better assess LVLMs' video understanding in human-model interaction scenarios. The benchmark includes 980 videos and 1,172 QA pairs, covering 5 main tasks and 12 dimensions. Experimental results show that even the most advanced models, such as GPT-4o and Gemini-1.5-Pro, achieve only 65.4% and 67.9% accuracy, respectively, compared to human experts' 88.3%. This highlights the need for further improvements in LVLMs' video understanding capabilities.

V2P-Bench 是一个新基准，旨在评估大型视觉语言模型（LVLM）在人机交互场景中的视频理解能力。它使用视觉提示而不是文本提示，以更好地捕捉空间和时间细节，提高模型响应的精确性。该基准包括980个视频和1,172个问答对，涵盖5个主要任务和12个维度。实验结果显示，即使是最先进的模型，如GPT-4o和Gemini-1.5-Pro，也只能分别达到65.4%和67.9%的准确率，而人类专家的准确率为88.3%，表明LVLM在视频理解方面仍有很大的改进空间。

Benchmarking the Robustness of Spatial-Temporal Models Against Corruptions

Authors: Chenyu Yi, Siyuan Yang, Haoliang Li, Yap-peng Tan, Alex Kot

Venue: NeurIPs 2021

First: 2021-10-13T05:59:39+00:00 · Latest: 2022-08-22T14:37:16+00:00

Comments: Accepted to NeurIPs 2021 Dataset and Benchmark Track. Our codes are available on https://github.com/Newbeeyoung/Video-Corruption-Robustness

Abs · PDF · Code1 · Code2 · Code3

Abstract

The state-of-the-art deep neural networks are vulnerable to common corruptions (e.g., input data degradations, distortions, and disturbances caused by weather changes, system error, and processing). While much progress has been made in analyzing and improving the robustness of models in image understanding, the robustness in video understanding is largely unexplored. In this paper, we establish a corruption robustness benchmark, Mini Kinetics-C and Mini SSV2-C, which considers temporal corruptions beyond spatial corruptions in images. We make the first attempt to conduct an exhaustive study on the corruption robustness of established CNN-based and Transformer-based spatial-temporal models. The study provides some guidance on robust model design and training: Transformer-based model performs better than CNN-based models on corruption robustness; the generalization ability of spatial-temporal models implies robustness against temporal corruptions; model corruption robustness (especially robustness in the temporal domain) enhances with computational cost and model capacity, which may contradict the current trend of improving the computational efficiency of models. Moreover, we find the robustness intervention for image-related tasks (e.g., training models with noise) may not work for spatial-temporal models.

中文标题/摘要

标题：空间-时间模型对抗破坏的鲁棒性基准测试

最先进的深度神经网络对常见的破坏（例如输入数据退化、失真和由天气变化、系统错误和处理引起的干扰）非常脆弱。虽然在图像理解中已经取得了分析和提高模型鲁棒性的许多进展，但在视频理解中的鲁棒性研究却很少。在本文中，我们建立了考虑时间破坏的鲁棒性基准，Mini Kinetics-C和Mini SSV2-C，该基准不仅考虑了图像的空间破坏。我们首次对已建立的基于CNN和Transformer的空间-时间模型的破坏鲁棒性进行了全面研究。研究为鲁棒模型的设计和训练提供了一些指导：Transformer模型在破坏鲁棒性方面优于基于CNN的模型；空间-时间模型的一般化能力意味着对时间破坏的鲁棒性；空间-时间模型的计算成本和模型容量增加时，其破坏鲁棒性（尤其是时间域的鲁棒性）会增强，这可能与当前提高模型计算效率的趋势相矛盾。此外，我们发现针对图像相关任务（例如用噪声训练模型）的鲁棒性干预可能对空间-时间模型无效。

Summary / 总结

This paper aims to evaluate the robustness of spatial-temporal models against corruptions, particularly in video understanding, which has been less explored. The authors introduce two benchmarks, Mini Kinetics-C and Mini SSV2-C, to study the corruption robustness of CNN-based and Transformer-based models. Key findings include that Transformer-based models outperform CNN-based models in corruption robustness, and that robustness against temporal corruptions generally increases with model capacity and computational cost, contrary to the trend of improving computational efficiency. Additionally, robustness interventions effective for image tasks may not be applicable to spatial-temporal models.

本文旨在评估空间-时间模型在对抗干扰方面的鲁棒性，特别是在视频理解方面，这比图像理解方面探索得较少。作者引入了Mini Kinetics-C和Mini SSV2-C基准来研究基于CNN和Transformer的空间-时间模型的抗干扰能力。主要发现包括Transformer模型在抗干扰方面优于CNN模型，鲁棒性随着计算成本和模型容量的增加而提高，这与提高模型计算效率的趋势相反。此外，对图像任务有效的鲁棒性干预可能不适用于空间-时间模型。