arXiv 论文速递

VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM

Authors: Yuqian Yuan, Hang Zhang, Wentong Li, Zesen Cheng, Boqiang Zhang, Long Li, Xin Li, Deli Zhao, Wenqiao Zhang, Yueting Zhuang, Jianke Zhu, Lidong Bing

First: 2024-12-31T18:56:46+00:00 · Latest: 2025-03-26T00:40:06+00:00

Comments: 17 pages, 14 figures, technical report

Abs · Code1

Abstract

Video Large Language Models (Video LLMs) have recently exhibited remarkable capabilities in general video understanding. However, they mainly focus on holistic comprehension and struggle with capturing fine-grained spatial and temporal details. Besides, the lack of high-quality object-level video instruction data and a comprehensive benchmark further hinders their advancements. To tackle these challenges, we introduce the VideoRefer Suite to empower Video LLM for finer-level spatial-temporal video understanding, i.e., enabling perception and reasoning on any objects throughout the video. Specially, we thoroughly develop VideoRefer Suite across three essential aspects: dataset, model, and benchmark. Firstly, we introduce a multi-agent data engine to meticulously curate a large-scale, high-quality object-level video instruction dataset, termed VideoRefer-700K. Next, we present the VideoRefer model, which equips a versatile spatial-temporal object encoder to capture precise regional and sequential representations. Finally, we meticulously create a VideoRefer-Bench to comprehensively assess the spatial-temporal understanding capability of a Video LLM, evaluating it across various aspects. Extensive experiments and analyses demonstrate that our VideoRefer model not only achieves promising performance on video referring benchmarks but also facilitates general video understanding capabilities.

中文标题/摘要

标题：VideoRefer Suite: 提升视频时空物体理解的视频大语言模型

视频大语言模型（Video LLMs）最近在通用视频理解方面表现出显著的能力。然而，它们主要关注整体理解，难以捕捉细微的空间和时间细节。此外，高质量的对象级视频指令数据的缺乏和全面基准的缺失进一步阻碍了它们的发展。为了解决这些挑战，我们引入了VideoRefer Suite，以增强Video LLM的细粒度时空视频理解能力，即在视频中对任何物体进行感知和推理。特别地，我们从数据集、模型和基准三个方面全面开发了VideoRefer Suite。首先，我们引入了一个多智能体数据引擎，精心构建了一个大规模、高质量的对象级视频指令数据集，称为VideoRefer-700K。其次，我们提出了VideoRefer模型，该模型配备了多功能的时空物体编码器，以捕捉精确的区域和序列表示。最后，我们精心构建了VideoRefer-Bench，以全面评估Video LLM的时空理解能力，从多个方面对其进行评估。广泛的实验和分析表明，我们的VideoRefer模型不仅在视频引用基准测试中取得了令人鼓舞的性能，还促进了视频理解的一般能力。

Summary / 总结

The research aims to enhance the spatial-temporal understanding of Video Large Language Models (Video LLMs) by addressing their limitations in capturing fine-grained details. The authors introduce the VideoRefer Suite, which includes a curated dataset, a specialized model, and a benchmark. The VideoRefer model uses a spatial-temporal object encoder to improve the precision of regional and sequential representations. Experiments show that the VideoRefer model performs well on video referring benchmarks and enhances general video understanding capabilities.

研究旨在通过解决视频大语言模型（Video LLM）在捕捉细粒度细节方面的局限性，提升其空间-时间对象理解能力。引入了VideoRefer套件，包括一个多代理数据引擎用于创建大规模高质量的对象级视频指令数据集、VideoRefer模型中的空间-时间对象编码器以及一个全面的基准来评估Video LLM的空间-时间理解能力。实验结果表明，VideoRefer模型在视频引用基准测试中表现出色，并提升了通用视频理解能力。

STOP: Integrated Spatial-Temporal Dynamic Prompting for Video Understanding

Authors: Zichen Liu, Kunlun Xu, Bing Su, Xu Zou, Yuxin Peng, Jiahuan Zhou

First: 2025-03-20T09:16:20+00:00 · Latest: 2025-03-26T00:25:44+00:00

Abs · Code1

Abstract

Pre-trained on tremendous image-text pairs, vision-language models like CLIP have demonstrated promising zero-shot generalization across numerous image-based tasks. However, extending these capabilities to video tasks remains challenging due to limited labeled video data and high training costs. Recent video prompting methods attempt to adapt CLIP for video tasks by introducing learnable prompts, but they typically rely on a single static prompt for all video sequences, overlooking the diverse temporal dynamics and spatial variations that exist across frames. This limitation significantly hinders the model's ability to capture essential temporal information for effective video understanding. To address this, we propose an integrated Spatial-TempOral dynamic Prompting (STOP) model which consists of two complementary modules, the intra-frame spatial prompting and inter-frame temporal prompting. Our intra-frame spatial prompts are designed to adaptively highlight discriminative regions within each frame by leveraging intra-frame attention and temporal variation, allowing the model to focus on areas with substantial temporal dynamics and capture fine-grained spatial details. Additionally, to highlight the varying importance of frames for video understanding, we further introduce inter-frame temporal prompts, dynamically inserting prompts between frames with high temporal variance as measured by frame similarity. This enables the model to prioritize key frames and enhances its capacity to understand temporal dependencies across sequences. Extensive experiments on various video benchmarks demonstrate that STOP consistently achieves superior performance against state-of-the-art methods. The code is available at https://github.com/zhoujiahuan1991/CVPR2025-STOP.

中文标题/摘要

标题：STOP：集成时空动态提示的视频理解

基于大量图像-文本对预训练的视觉-语言模型如CLIP，在众多图像任务中展示了令人鼓舞的零样本泛化能力。然而，将这些能力扩展到视频任务仍然具有挑战性，因为缺乏标注的视频数据和高昂的训练成本。最近的视频提示方法试图通过引入可学习的提示来适应CLIP以用于视频任务，但它们通常依赖于单一静态提示来处理所有视频序列，忽视了帧间存在的多样化的时空变化。这一限制严重阻碍了模型捕捉关键的时空信息以实现有效的视频理解。为了解决这一问题，我们提出了一种集成时空动态提示（STOP）模型，该模型由两个互补模块组成：帧内时空提示和帧间时空提示。我们的帧内时空提示旨在通过利用帧内注意力和时间变化自适应地突出显示每个帧中的判别区域，使模型能够关注具有显著时间动态性的区域并捕捉细微的空间细节。此外，为了突出不同帧在视频理解中的重要性差异，我们进一步引入了帧间时空提示，动态地在帧相似度高、时间变化大的帧之间插入提示。这使模型能够优先处理关键帧，并增强其理解序列间时间依赖性的能力。在各种视频基准上的广泛实验表明，STOP在性能上始终优于最先进的方法。代码可在https://github.com/zhoujiahuan1991/CVPR2025-STOP/ 获取。

Summary / 总结

The research aims to improve the zero-shot generalization of vision-language models like CLIP for video understanding tasks by addressing the limitations of static prompts. The proposed STOP model introduces two modules: intra-frame spatial prompting and inter-frame temporal prompting. Intra-frame spatial prompts adaptively highlight discriminative regions within each frame, while inter-frame temporal prompts dynamically insert prompts between frames with high temporal variance. Experiments show that STOP outperforms existing methods on various video benchmarks, demonstrating its effectiveness in capturing temporal dynamics and spatial variations. The code is available at https://github.com/zhoujiahuan1991/CVPR2025-STOP.

研究旨在通过解决静态提示的局限性，提升视觉-语言模型在视频理解任务中的性能。提出的STOP模型引入了两个模块：帧内空间提示和帧间时间提示。帧内空间提示能够适应性地突出每个帧中的关键区域，而帧间时间提示则在具有高时间差异的帧之间动态插入提示。实验表明，STOP在各种视频基准测试中优于现有方法。

Q-Bench-Video: Benchmarking the Video Quality Understanding of LMMs

Authors: Zicheng Zhang, Ziheng Jia, Haoning Wu, Chunyi Li, Zijian Chen, Yingjie Zhou, Wei Sun, Xiaohong Liu, Xiongkuo Min, Weisi Lin, Guangtao Zhai

First: 2024-09-30T08:05:00+00:00 · Latest: 2025-03-04T02:19:05+00:00

Abs

Abstract

With the rising interest in research on Large Multi-modal Models (LMMs) for video understanding, many studies have emphasized general video comprehension capabilities, neglecting the systematic exploration into video quality understanding. To address this oversight, we introduce Q-Bench-Video in this paper, a new benchmark specifically designed to evaluate LMMs' proficiency in discerning video quality. a) To ensure video source diversity, Q-Bench-Video encompasses videos from natural scenes, AI-generated Content (AIGC), and Computer Graphics (CG). b) Building on the traditional multiple-choice questions format with the Yes-or-No and What-How categories, we include Open-ended questions to better evaluate complex scenarios. Additionally, we incorporate the video pair quality comparison question to enhance comprehensiveness. c) Beyond the traditional Technical, Aesthetic, and Temporal distortions, we have expanded our evaluation aspects to include the dimension of AIGC distortions, which addresses the increasing demand for video generation. Finally, we collect a total of 2,378 question-answer pairs and test them on 12 open-source & 5 proprietary LMMs. Our findings indicate that while LMMs have a foundational understanding of video quality, their performance remains incomplete and imprecise, with a notable discrepancy compared to human performance. Through Q-Bench-Video, we seek to catalyze community interest, stimulate further research, and unlock the untapped potential of LMMs to close the gap in video quality understanding.

中文标题/摘要

标题：Q-Bench-Video：评估LMMs的视频质量理解能力

随着对大型多模态模型（LMMs）在视频理解方面研究兴趣的增加，许多研究强调了其一般视频理解能力，而忽视了对视频质量理解的系统探索。为弥补这一不足，本文引入了Q-Bench-Video，这是一种专门设计来评估LMMs在区分视频质量方面的熟练程度的新基准。a) 为了确保视频来源的多样性，Q-Bench-Video 包含了来自自然场景、人工智能生成内容（AIGC）和计算机图形（CG）的视频。b) 在传统的多项选择题格式基础上，我们增加了开放性问题，以更好地评估复杂场景，并引入了视频对质量比较问题以增强全面性。c) 除了传统的技术、美学和时间扭曲之外，我们还扩展了评估维度，包括AIGC扭曲，以应对视频生成需求的增加。最后，我们收集了2,378个问题-答案对，并在12个开源和5个专有LMMs上进行了测试。我们的研究结果表明，虽然LMMs对视频质量有一定的基础理解，但其表现仍然不完整且不够精确，与人类表现存在显著差异。通过Q-Bench-Video，我们旨在激发社区兴趣，促进进一步研究，并释放LMMs的潜力，以缩小视频质量理解的差距。

Summary / 总结

Q-Bench-Video is a new benchmark designed to evaluate LMMs' ability to understand video quality, covering diverse video sources and expanding evaluation aspects to include AIGC distortions. It includes multiple-choice, open-ended, and video pair quality comparison questions. The study tested 17 LMMs and found that while LMMs have a basic understanding of video quality, their performance is incomplete and imprecise compared to human performance.

Q-Bench-Video 是一个新基准，旨在评估大型多模态模型（LMMs）在区分视频质量方面的能力，涵盖了自然场景、AI生成的内容和计算机图形。它包括多项选择、开放式和视频对质量比较问题，扩展了对AIGC失真的评估。测试12个开源和5个专有LMM后，研究发现虽然LMMs具备基本理解能力，但其性能不完整且不如人类表现。

Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1

Authors: Yi Chen, Yuying Ge, Rui Wang, Yixiao Ge, Lu Qiu, Ying Shan, Xihui Liu

First: 2025-03-31T17:55:23+00:00 · Latest: 2025-04-01T02:08:31+00:00

Comments: Technical Report (In Progress); Code released at: https://github.com/TencentARC/SEED-Bench-R1

Abs · Code1 · Code2

Abstract

Recent advancements in Chain of Thought (COT) generation have significantly improved the reasoning capabilities of Large Language Models (LLMs), with reinforcement learning (RL) emerging as an effective post-training approach. Multimodal Large Language Models (MLLMs) inherit this reasoning potential but remain underexplored in tasks requiring both perception and logical reasoning. To address this, we introduce SEED-Bench-R1, a benchmark designed to systematically evaluate post-training methods for MLLMs in video understanding. It includes intricate real-world videos and complex everyday planning tasks in the format of multiple-choice questions, requiring sophisticated perception and reasoning. SEED-Bench-R1 assesses generalization through a three-level hierarchy: in-distribution, cross-environment, and cross-environment-task scenarios, equipped with a large-scale training dataset with easily verifiable ground-truth answers. Using Qwen2-VL-Instruct-7B as a base model, we compare RL with supervised fine-tuning (SFT), demonstrating RL's data efficiency and superior performance on both in-distribution and out-of-distribution tasks, even outperforming SFT on general video understanding benchmarks like LongVideoBench. Our detailed analysis reveals that RL enhances visual perception but often produces less logically coherent reasoning chains. We identify key limitations such as inconsistent reasoning and overlooked visual cues, and suggest future improvements in base model reasoning, reward modeling, and RL robustness against noisy signals.

中文标题/摘要

标题：探索强化学习对视频理解的影响：SEED-Bench-R1 的见解

近期在思维链（COT）生成方面的进展显著提升了大型语言模型（LLMs）的推理能力，强化学习（RL）已成为一种有效的后训练方法。多模态大型语言模型（MLLMs）继承了这种推理潜力，但在需要感知和逻辑推理的任务中仍被严重忽视。为解决这一问题，我们引入了SEED-Bench-R1，这是一个旨在系统评估MLLMs在视频理解中后训练方法的基准。它包含复杂的现实世界视频和复杂的日常规划任务，以多项选择题的形式呈现，需要复杂的感知和推理。SEED-Bench-R1 通过三级层次结构评估泛化能力：同分布、跨环境和跨环境任务场景，配备了一个大规模训练数据集，具有易于验证的正确答案。使用Qwen2-VL-Instruct-7B作为基础模型，我们比较了RL与监督微调（SFT），展示了RL在同分布和异分布任务中的数据效率和优越性能，甚至在长视频基准等通用视频理解基准上优于SFT。我们的详细分析表明，RL增强了视觉感知，但往往产生逻辑不连贯的推理链。我们确定了关键限制，如推理不一致和忽视视觉线索，并建议未来在基础模型推理、奖励建模和RL对噪声信号的鲁棒性方面进行改进。

Summary / 总结

The research explores how reinforcement learning (RL) improves the reasoning capabilities of multimodal large language models (MLLMs) in video understanding tasks. SEED-Bench-R1, a new benchmark, evaluates post-training methods for MLLMs, focusing on complex real-world videos and everyday planning tasks. The study uses Qwen2-VL-Instruct-7B as a base model and compares RL with supervised fine-tuning (SFT), showing that RL is more data-efficient and performs better on both in-distribution and out-of-distribution tasks, even outperforming SFT on benchmarks like LongVideoBench. However, RL enhances visual perception but often results in less coherent reasoning chains, highlighting the need for improvements in base model reasoning, reward modeling, and RL robustness against noisy signals.

研究旨在探索强化学习（RL）对多模态大型语言模型（MLLMs）在视频理解任务中推理能力的影响。SEED-Bench-R1 是一个新的基准，用于评估 MLLMs 的后训练方法，重点关注复杂的现实世界视频和日常生活规划任务。研究使用 Qwen2-VL-Instruct-7B 作为基础模型，并将 RL 与监督微调（SFT）进行比较，显示 RL 在性能和数据效率方面更优，甚至在通用视频理解基准 LongVideoBench 上也优于 SFT。然而，RL 提升了视觉感知，但往往产生不那么连贯的推理链，指出了关键限制，并建议未来在基础模型推理、奖励建模和 RL 抗噪信号方面的改进。

DenseImage Network: Video Spatial-Temporal Evolution Encoding and Understanding

Authors: Xiaokai Chen, Ke Gao

First: 2018-05-19T08:53:44+00:00 · Latest: 2018-05-22T00:06:54+00:00

Comments: 7 pages

Abs

Abstract

Many of the leading approaches for video understanding are data-hungry and time-consuming, failing to capture the gist of spatial-temporal evolution in an efficient manner. The latest research shows that CNN network can reason about static relation of entities in images. To further exploit its capacity in dynamic evolution reasoning, we introduce a novel network module called DenseImage Network(DIN) with two main contributions. 1) A novel compact representation of video which distills its significant spatial-temporal evolution into a matrix called DenseImage, primed for efficient video encoding. 2) A simple yet powerful learning strategy based on DenseImage and a temporal-order-preserving CNN network is proposed for video understanding, which contains a local temporal correlation constraint capturing temporal evolution at multiple time scales with different filter widths. Extensive experiments on two recent challenging benchmarks demonstrate that our DenseImage Network can accurately capture the common spatial-temporal evolution between similar actions, even with enormous visual variations or different time scales. Moreover, we obtain the state-of-the-art results in action and gesture recognition with much less time-and-memory cost, indicating its immense potential in video representing and understanding.

中文标题/摘要

标题：DenseImage 网络：视频空间-时间演化编码与理解

许多领先的视频理解方法既耗数据又耗时，无法高效地捕捉空间-时间演化的核心。最新研究显示，CNN 网络可以推理图像中实体的静态关系。为了进一步利用其在动态演化推理方面的潜力，我们提出了一种名为 DenseImage 网络 (DIN) 的新型网络模块，包含两个主要贡献。1) 一种新颖的视频紧凑表示，将视频的关键空间-时间演化提炼成一个称为 DenseImage 的矩阵，便于高效视频编码。2) 基于 DenseImage 和时间顺序保持的 CNN 网络提出了一种简单而强大的学习策略，用于视频理解，该策略包含一个局部时间相关约束，能够以不同滤波器宽度捕捉多时间尺度的时间演化。在两个最近的具有挑战性的基准上的大量实验表明，我们的 DenseImage 网络能够准确捕捉相似动作之间的共同空间-时间演化，即使存在巨大的视觉变化或不同的时间尺度。此外，我们在动作和手势识别中获得了最先进的结果，且所需的时间和内存成本更低，表明其在视频表示和理解方面的巨大潜力。

Summary / 总结

The research aims to address the inefficiency of current data-hungry video understanding methods by introducing DenseImage Network (DIN). DIN proposes a novel compact representation called DenseImage to capture the spatial-temporal evolution efficiently and a learning strategy based on DenseImage and a temporal-order-preserving CNN network with a local temporal correlation constraint. The experiments show that DIN can accurately recognize actions and gestures with significant improvements in time-and-memory efficiency compared to existing methods.

研究旨在通过提出一种新型的DenseImage网络（DIN）来解决现有视频理解方法的低效问题，DIN通过引入DenseImage表示来提取关键的空间-时间信息，并采用具有局部时间相关约束的学习策略来在多个时间尺度上捕捉时间演变。实验表明，DIN能够以较低的计算成本准确识别动作和手势，展示了其在视频理解和表示方面的潜力。

4D-Bench: Benchmarking Multi-modal Large Language Models for 4D Object Understanding

Authors: Wenxuan Zhu, Bing Li, Cheng Zheng, Jinjie Mai, Jun Chen, Letian Jiang, Abdullah Hamdi, Sara Rojas Martinez, Chia-Wen Lin, Mohamed Elhoseiny, Bernard Ghanem

First: 2025-03-22T17:55:53+00:00 · Latest: 2025-03-25T00:36:59+00:00

Abs

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated impressive 2D image/video understanding capabilities. However, there are no publicly standardized benchmarks to assess the abilities of MLLMs in understanding the 4D objects (3D objects with temporal evolution over time). In this paper, we introduce 4D-Bench, the first benchmark to evaluate the capabilities of MLLMs in 4D object understanding, featuring tasks in 4D object Question Answering (4D object QA) and 4D object captioning. 4D-Bench provides 4D objects with diverse categories, high-quality annotations, and tasks necessitating multi-view spatial-temporal understanding, different from existing 2D image/video-based benchmarks. With 4D-Bench, we evaluate a wide range of open-source and closed-source MLLMs. The results from the 4D object captioning experiment indicate that MLLMs generally exhibit weaker temporal understanding compared to their appearance understanding, notably, while open-source models approach closed-source performance in appearance understanding, they show larger performance gaps in temporal understanding. 4D object QA yields surprising findings: even with simple single-object videos, MLLMs perform poorly, with state-of-the-art GPT-4o achieving only 63\% accuracy compared to the human baseline of 91\%. These findings highlight a substantial gap in 4D object understanding and the need for further advancements in MLLMs.

中文标题/摘要

标题：4D-Bench：多模态大型语言模型在4D物体理解中的基准测试

多模态大型语言模型（MLLMs）在2D图像/视频理解方面表现出色。然而，目前没有公开的标准基准来评估MLLMs在理解4D物体（具有时间演变的3D物体）方面的能力。本文介绍了4D-Bench，这是首个评估MLLMs在4D物体理解能力的基准，包含4D物体问答（4D物体QA）和4D物体描述任务。4D-Bench提供了多样类别的4D物体、高质量的注释以及需要多视角时空理解的任务，不同于现有的基于2D图像/视频的基准。通过4D-Bench，我们评估了多种开源和闭源的MLLMs。4D物体描述实验的结果表明，MLLMs在时间理解方面普遍弱于其外观理解能力，尽管开源模型在外观理解方面接近闭源模型的表现，但在时间理解方面却显示出更大的性能差距。4D物体问答产生了令人惊讶的结果：即使使用简单的单物体视频，MLLMs的表现也很差，最先进的GPT-4o的准确率仅为63%，而人类基线为91%。这些发现突显了4D物体理解的巨大差距，并强调了MLLMs进一步发展的必要性。

Summary / 总结

4D-Bench is a new benchmark designed to evaluate the 4D object understanding capabilities of Multimodal Large Language Models (MLLMs), focusing on 4D object Question Answering and captioning. The benchmark includes diverse 4D objects with high-quality annotations and tasks requiring multi-view spatial-temporal understanding. Experimental results show that MLLMs generally struggle with temporal understanding, with open-source models performing significantly worse than closed-source models in this aspect. Moreover, even with simple single-object videos, MLLMs perform poorly in 4D object QA, achieving only 63% accuracy compared to a human baseline of 91%. This highlights the need for further advancements in 4D object understanding capabilities of MLLMs.

该论文提出了4D-Bench，用于评估MLLMs在理解4D对象方面的能力，包括4D对象问答和描述等任务。评估结果显示，MLLMs在时间理解方面普遍表现较差，与外观理解相比，开源模型在时间理解方面落后于闭源模型。值得注意的是，即使是简单的单对象视频，最先进的模型如GPT-4o也只能达到63%的准确率，远低于人类基准的91%。

MT-Video-Bench: A Holistic Video Understanding Benchmark for Evaluating Multimodal LLMs in Multi-Turn Dialogues

Authors: Yaning Pan, Zekun Wang, Qianqian Xie, Yongqian Wen, Yuanxing Zhang, Guohui Zhang, Haoxuan Hu, Zhiyu Pan, Yibing Huang, Zhidong Gan, Yonghong Lin, An Ping, Tianhao Peng, Jiaheng Liu

First: 2025-10-20T16:38:40+00:00 · Latest: 2025-10-21T01:39:45+00:00

Comments: Project Website: https://github.com/NJU-LINK/MT-Video-Bench

Abs · Code1

Abstract

The recent development of Multimodal Large Language Models (MLLMs) has significantly advanced AI's ability to understand visual modalities. However, existing evaluation benchmarks remain limited to single-turn question answering, overlooking the complexity of multi-turn dialogues in real-world scenarios. To bridge this gap, we introduce MT-Video-Bench, a holistic video understanding benchmark for evaluating MLLMs in multi-turn dialogues. Specifically, our MT-Video-Bench mainly assesses six core competencies that focus on perceptivity and interactivity, encompassing 987 meticulously curated multi-turn dialogues from diverse domains. These capabilities are rigorously aligned with real-world applications, such as interactive sports analysis and multi-turn video-based intelligent tutoring. With MT-Video-Bench, we extensively evaluate various state-of-the-art open-source and closed-source MLLMs, revealing their significant performance discrepancies and limitations in handling multi-turn video dialogues. The benchmark will be publicly available to foster future research.

中文标题/摘要

标题：MT-Video-Bench：多模态大语言模型在多轮对话中视频理解评估基准

近年来，多模态大语言模型（MLLMs）的发展显著提升了AI理解视觉模态的能力。然而，现有的评估基准仍然局限于单轮问答，忽视了真实场景中多轮对话的复杂性。为弥补这一不足，我们引入了MT-Video-Bench，这是一个全面的视频理解基准，用于评估MLLMs在多轮对话中的表现。具体而言，我们的MT-Video-Bench主要评估六个核心能力，涵盖感知性和互动性，包括987个精心策划的多轮对话，涉及多个领域。这些能力严格与实际应用对齐，如互动体育分析和基于视频的多轮智能辅导。通过MT-Video-Bench，我们广泛评估了各种最先进的开源和闭源MLLMs，揭示了它们在处理多轮视频对话方面的显著性能差异和局限性。该基准将公开发布，以促进未来的研究。

Summary / 总结

The paper introduces MT-Video-Bench, a benchmark for evaluating MLLMs in multi-turn dialogues, addressing the limitations of existing single-turn benchmarks. It assesses six core competencies through 987 curated dialogues, focusing on perceptivity and interactivity. The evaluation reveals significant performance discrepancies among various MLLMs, highlighting their limitations in multi-turn video dialogues.

MT-Video-Bench 是一个用于评估多模态大型语言模型（MLLMs）在多轮对话中表现的基准，解决了现有单轮对话基准的局限性。它通过来自不同领域的987个多轮对话，评估了感知和互动等六个核心能力。该基准揭示了不同 MLLMs 在处理多轮视频对话时的显著性能差异，旨在推动未来的研究进展。

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Authors: Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, Lidong Bing

First: 2024-06-11T17:22:23+00:00 · Latest: 2024-10-31T00:30:47+00:00

Comments: ZC, SL, HZ, YX, and XL contributed equally to this project. Code: https://github.com/DAMO-NLP-SG/VideoLLaMA2

Abs · Code1 · Code2 · Code3

Abstract

In this paper, we present the VideoLLaMA 2, a set of Video Large Language Models (Video-LLMs) designed to enhance spatial-temporal modeling and audio understanding in video and audio-oriented tasks. Building upon its predecessor, VideoLLaMA 2 incorporates a tailor-made Spatial-Temporal Convolution (STC) connector, which effectively captures the intricate spatial and temporal dynamics of video data. Additionally, we integrate an Audio Branch into the model through joint training, thereby enriching the multimodal understanding capabilities of the model by seamlessly incorporating audio cues. Comprehensive evaluations on multiple-choice video question answering (MC-VQA), open-ended video question answering (OE-VQA), and video captioning (VC) tasks demonstrate that VideoLLaMA 2 consistently achieves competitive results among open-source models and even gets close to some proprietary models on several benchmarks. Furthermore, VideoLLaMA 2 exhibits reasonable improvements in audio-only and audio-video question-answering (AQA & OE-AVQA) benchmarks over existing models. These advancements underline VideoLLaMA 2's superior performance in multimodal comprehension, setting a new standard for intelligent video analysis systems. All models are public to facilitate further research.

中文标题/摘要

标题：VideoLLaMA 2：提升视频-大语言模型的空间-时间建模和音频理解能力

在本文中，我们介绍了VideoLLaMA 2，这是一种旨在增强视频和音频任务中空间-时间建模和音频理解能力的视频大语言模型（Video-LLMs）。基于其前身，VideoLLaMA 2 引入了一种定制的空间-时间卷积（STC）连接器，有效地捕捉了视频数据的复杂空间和时间动态。此外，我们通过联合训练将音频分支集成到模型中，从而通过无缝地结合音频线索来丰富模型的多模态理解能力。在多项选择视频问答（MC-VQA）、开放式视频问答（OE-VQA）和视频字幕生成（VC）任务上的全面评估表明，VideoLLaMA 2 在开源模型中始终能够获得具有竞争力的结果，并且在某些基准测试中接近一些专有模型。此外，VideoLLaMA 2 在音频仅和音频-视频问答（AQA & OE-AVQA）基准测试中也表现出合理的改进。这些进步突显了VideoLLaMA 2 在多模态理解方面的优越性能，为智能视频分析系统树立了新的标准。所有模型均为公开，以促进进一步的研究。

VideoINSTA: Zero-shot Long Video Understanding via Informative Spatial-Temporal Reasoning with LLMs

Authors: Ruotong Liao, Max Erler, Huiyu Wang, Guangyao Zhai, Gengyuan Zhang, Yunpu Ma, Volker Tresp

Venue: EMNLP 2024

First: 2024-09-30T15:04:14+00:00 · Latest: 2024-10-08T00:15:02+00:00

Comments: EMNLP 2024 Findings; 22 pages; Code: https://github.com/mayhugotong/VideoINSTA

Abs · Code1 · Code2 · Code3

Abstract

In the video-language domain, recent works in leveraging zero-shot Large Language Model-based reasoning for video understanding have become competitive challengers to previous end-to-end models. However, long video understanding presents unique challenges due to the complexity of reasoning over extended timespans, even for zero-shot LLM-based approaches. The challenge of information redundancy in long videos prompts the question of what specific information is essential for large language models (LLMs) and how to leverage them for complex spatial-temporal reasoning in long-form video analysis. We propose a framework VideoINSTA, i.e. INformative Spatial-TemporAl Reasoning for zero-shot long-form video understanding. VideoINSTA contributes (1) a zero-shot framework for long video understanding using LLMs; (2) an event-based temporal reasoning and content-based spatial reasoning approach for LLMs to reason over spatial-temporal information in videos; (3) a self-reflective information reasoning scheme balancing temporal factors based on information sufficiency and prediction confidence. Our model significantly improves the state-of-the-art on three long video question-answering benchmarks: EgoSchema, NextQA, and IntentQA, and the open question answering dataset ActivityNetQA. The code is released here: https://github.com/mayhugotong/VideoINSTA.

中文标题/摘要

标题：VideoINSTA：利用LLM进行零样本长视频理解的空间-时间推理

在视频-语言领域，利用零样本大型语言模型推理进行视频理解的最新工作已成为与之前端到端模型竞争的挑战者。然而，长视频理解由于长时间推理的复杂性，即使对于零样本大型语言模型（LLM）方法也提出了独特的挑战。长视频中的信息冗余问题促使我们思考大型语言模型（LLMs）需要哪些特定信息以及如何利用它们进行长视频分析中的复杂空间-时间推理。我们提出了一种框架VideoINSTA，即基于LLM的零样本长视频理解的空间-时间推理框架。VideoINSTA贡献了（1）一种利用LLM进行长视频理解的零样本框架；（2）一种基于事件的时间推理和基于内容的空间推理方法，使LLM能够在视频中推理空间-时间信息；（3）一种基于信息充分性和预测置信度的自我反思信息推理方案。我们的模型在三个长视频问答基准测试EgoSchema、NextQA和IntentQA以及开放问答数据集ActivityNetQA上显著改进了现有最佳性能。代码发布在：https://github.com/mayhugotong/VideoINSTA。

Summary / 总结

VideoINSTA is a zero-shot framework for long video understanding using LLMs, which introduces an event-based temporal reasoning and content-based spatial reasoning approach to handle the complexity of long videos. It also includes a self-reflective information reasoning scheme that balances temporal factors based on information sufficiency and prediction confidence. The model achieves significant improvements on three long video question-answering benchmarks: EgoSchema, NextQA, and IntentQA, as well as the open question answering dataset ActivityNetQA.

VideoINSTA 是一个使用大语言模型（LLMs）的零样本框架，用于长视频理解，引入了基于事件的时间推理和基于内容的空间推理方法来处理长视频的复杂性。它还包含一个基于信息充足性和预测置信度平衡时间因素的自我反思信息推理方案。该模型在三个长视频问答基准测试：EgoSchema、NextQA 和 IntentQA，以及开放问答数据集 ActivityNetQA 上取得了显著的改进。

VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding

Authors: Muhammad Maaz, Hanoona Rasheed, Salman Khan, Fahad Khan

First: 2024-06-13T17:59:59+00:00 · Latest: 2024-06-14T01:04:18+00:00

Comments: Technical Report

Abs · Code1 · Code2

Abstract

Building on the advances of language models, Large Multimodal Models (LMMs) have contributed significant improvements in video understanding. While the current video LMMs utilize advanced Large Language Models (LLMs), they rely on either image or video encoders to process visual inputs, each of which has its own limitations. Image encoders excel at capturing rich spatial details from frame sequences but lack explicit temporal context, which can be important in videos with intricate action sequences. On the other hand, video encoders provide temporal context but are often limited by computational constraints that lead to processing only sparse frames at lower resolutions, resulting in reduced contextual and spatial understanding. To this end, we introduce VideoGPT+, which combines the complementary benefits of the image encoder (for detailed spatial understanding) and the video encoder (for global temporal context modeling). The model processes videos by dividing them into smaller segments and applies an adaptive pooling strategy on features extracted by both image and video encoders. Our architecture showcases improved performance across multiple video benchmarks, including VCGBench, MVBench and Zero-shot question-answering. Further, we develop 112K video-instruction set using a novel semi-automatic annotation pipeline which further improves the model performance. Additionally, to comprehensively evaluate video LMMs, we present VCGBench-Diverse, covering 18 broad video categories such as lifestyle, sports, science, gaming, and surveillance videos. This benchmark with 4,354 question-answer pairs evaluates the generalization of existing LMMs on dense video captioning, spatial and temporal understanding, and complex reasoning, ensuring comprehensive assessment across diverse video types and dynamics. Code: https://github.com/mbzuai-oryx/VideoGPT-plus.

中文标题/摘要

标题：VideoGPT+: 结合图像和视频编码器以增强视频理解

在语言模型取得进展的基础上，大型多模态模型（LMMs）在视频理解方面做出了显著改进。当前的视频LMMs利用了先进的大型语言模型（LLMs），但它们依赖于图像或视频编码器来处理视觉输入，每种编码器都有其局限性。图像编码器擅长捕捉帧序列中的丰富空间细节，但缺乏明确的时间上下文，这对于包含复杂动作序列的视频来说可能是重要的。另一方面，视频编码器提供了时间上下文，但由于计算限制，通常只能以较低分辨率处理稀疏帧，导致上下文和空间理解能力降低。为此，我们引入了VideoGPT+，它结合了图像编码器（用于详细的空间理解）和视频编码器（用于全局时间上下文建模）的互补优势。该模型通过将视频分割成更小的片段，并在由图像和视频编码器提取的特征上应用自适应池化策略来处理视频。我们的架构在多个视频基准测试中展示了改进的性能，包括VCGBench、MVBench和零样本问答。此外，我们使用一种新颖的半自动注释流水线开发了112K视频指令集，进一步提高了模型性能。另外，为了全面评估视频LMMs，我们提出了VCGBench-Diverse，涵盖了18个广泛的视频类别，如生活方式、体育、科学、游戏和监控视频。该基准测试包含4,354个问答对，评估了现有LMMs在密集视频描述、空间和时间理解以及复杂推理方面的泛化能力，确保了对不同视频类型和动态的全面评估。代码：https://github.com/mbzuai-oryx/VideoGPT-plus.

Summary / 总结

VideoGPT+ integrates image and video encoders to enhance video understanding, addressing the limitations of using either type of encoder alone. It processes videos by segmenting them and applying an adaptive pooling strategy on features from both encoders, showing improved performance on multiple benchmarks. Additionally, a novel semi-automatic annotation pipeline was developed to create a diverse video instruction set, and a new benchmark, VCGBench-Diverse, was introduced to comprehensively evaluate video LMMs across various video categories and dynamics.

VideoGPT+ 结合了图像编码器和视频编码器，以增强视频理解，解决了单独使用其中一种编码器的局限性。该模型通过分段处理视频并在来自两种编码器的特征上应用自适应池化策略，展示了在多个基准测试中的改进性能。此外，开发了一种新的半自动注释流水线来创建多样化的视频指令集，并引入了新的基准测试 VCGBench-Diverse，以全面评估视频 LMM 在各种视频类别和动态中的泛化能力。

SurgLLM: A Versatile Large Multimodal Model with Spatial Focus and Temporal Awareness for Surgical Video Understanding

Authors: Zhen Chen, Xingjian Luo, Kun Yuan, Jinlin Wu, Danny T. M. Chan, Nassir Navab, Hongbin Liu, Zhen Lei, Jiebo Luo

First: 2025-08-30T04:36:41+00:00 · Latest: 2025-09-03T00:14:13+00:00

Abs · Code1

Abstract

Surgical video understanding is crucial for facilitating Computer-Assisted Surgery (CAS) systems. Despite significant progress in existing studies, two major limitations persist, including inadequate visual content perception and insufficient temporal awareness in surgical videos, and hinder the development of versatile CAS solutions. In this work, we propose the SurgLLM framework, an effective large multimodal model tailored for versatile surgical video understanding tasks with enhanced spatial focus and temporal awareness. Specifically, to empower the spatial focus of surgical videos, we first devise Surgical Context-aware Multimodal Pretraining (Surg-Pretrain) for the video encoder of SurgLLM, by performing instrument-centric Masked Video Reconstruction (MV-Recon) and subsequent multimodal alignment. To incorporate surgical temporal knowledge into SurgLLM, we further propose Temporal-aware Multimodal Tuning (TM-Tuning) to enhance temporal reasoning with interleaved multimodal embeddings. Moreover, to accommodate various understanding tasks of surgical videos without conflicts, we devise a Surgical Task Dynamic Ensemble to efficiently triage a query with optimal learnable parameters in our SurgLLM. Extensive experiments performed on diverse surgical video understanding tasks, including captioning, general VQA, and temporal VQA, demonstrate significant improvements over the state-of-the-art approaches, validating the effectiveness of our SurgLLM in versatile surgical video understanding. The source code is available at https://github.com/franciszchen/SurgLLM.

中文标题/摘要

标题：SurgLLM：一种具有空间聚焦和时间意识的多功能多模态模型及其在手术视频理解中的应用

手术视频理解对于促进计算机辅助手术（CAS）系统至关重要。尽管现有研究取得了显著进展，但仍然存在两个主要限制，包括手术视频中视觉内容感知不足和时间意识不足，这阻碍了多功能CAS解决方案的发展。在本文中，我们提出了SurgLLM框架，这是一种针对多功能手术视频理解任务的高效多模态模型，具有增强的空间聚焦和时间意识。具体而言，为了增强SurgLLM的空间聚焦，我们首先为SurgLLM的视频编码器设计了手术上下文感知多模态预训练（Surg-Pretrain），通过执行以器械为中心的掩蔽视频重建（MV-Recon）和后续的多模态对齐。为了将手术时间知识融入SurgLLM，我们进一步提出了时间感知多模态调优（TM-Tuning），通过交错的多模态嵌入增强时间推理。此外，为了适应手术视频的各种理解任务而不产生冲突，我们设计了手术任务动态集成，以高效地对查询进行分类，并在我们的SurgLLM中使用可学习参数。在包括字幕生成、通用VQA和时间VQA在内的多种手术视频理解任务上进行的广泛实验表明，SurgLLM在这些任务上显著优于现有方法，验证了SurgLLM在多功能手术视频理解中的有效性。源代码可在https://github.com/franciszchen/SurgLLM获取。

Summary / 总结

The research aims to improve surgical video understanding for Computer-Assisted Surgery systems by addressing limitations in visual content perception and temporal awareness. The SurgLLM framework is proposed, which includes Surgical Context-aware Multimodal Pretraining for spatial focus and Temporal-aware Multimodal Tuning for temporal reasoning. Experimental results show significant improvements over existing methods in tasks such as captioning, general VQA, and temporal VQA, validating the effectiveness of SurgLLM in versatile surgical video understanding tasks.

研究旨在通过解决视觉内容感知不足和时间感知不足的问题，提高手术视频理解能力，以促进计算机辅助手术（CAS）系统的开发。提出了SurgLLM框架，包括手术上下文感知多模态预训练以增强空间聚焦，以及时间感知多模态调优以增强时间推理。实验结果显示，在包括字幕生成、通用VQA和时间VQA等任务中，SurgLLM相比现有方法有显著改进，验证了其在多功能手术视频理解中的有效性。

FAVOR-Bench: A Comprehensive Benchmark for Fine-Grained Video Motion Understanding

Authors: Chongjun Tu, Lin Zhang, Pengtao Chen, Peng Ye, Xianfang Zeng, Wei Cheng, Gang Yu, Tao Chen

First: 2025-03-19T06:42:32+00:00 · Latest: 2025-03-20T00:29:58+00:00

Comments: FAVOR-Bench project page: https://favor-bench.github.io/

Abs · Project1 · Project2

Abstract

Multimodal Large Language Models (MLLMs) have shown remarkable capabilities in video content understanding but still struggle with fine-grained motion comprehension. To comprehensively assess the motion understanding ability of existing MLLMs, we introduce FAVOR-Bench, comprising 1,776 videos with structured manual annotations of various motions. Our benchmark includes both close-ended and open-ended tasks. For close-ended evaluation, we carefully design 8,184 multiple-choice question-answer pairs spanning six distinct sub-tasks. For open-ended evaluation, we develop both a novel cost-efficient LLM-free and a GPT-assisted caption assessment method, where the former can enhance benchmarking interpretability and reproducibility. Comprehensive experiments with 21 state-of-the-art MLLMs reveal significant limitations in their ability to comprehend and describe detailed temporal dynamics in video motions. To alleviate this limitation, we further build FAVOR-Train, a dataset consisting of 17,152 videos with fine-grained motion annotations. The results of finetuning Qwen2.5-VL on FAVOR-Train yield consistent improvements on motion-related tasks of TVBench, MotionBench and our FAVOR-Bench. Comprehensive assessment results demonstrate that the proposed FAVOR-Bench and FAVOR-Train provide valuable tools to the community for developing more powerful video understanding models. Project page: \href{https://favor-bench.github.io/}{https://favor-bench.github.io/}.

中文标题/摘要

标题：FAVOR-Bench：细粒度视频运动理解的综合基准

多模态大型语言模型（MLLMs）在视频内容理解方面表现出色，但在细粒度运动理解方面仍存在困难。为了全面评估现有MLLMs的运动理解能力，我们引入了FAVOR-Bench，包含1,776个视频，附有各种运动的结构化手动注释。我们的基准包括封闭式和开放式任务。对于封闭式评估，我们精心设计了8,184个多选题-答案对，涵盖六个不同的子任务。对于开放式评估，我们开发了一种新型低成本的LLM-free和一种基于GPT的字幕评估方法，前者可以增强基准测试的可解释性和可重复性。全面的实验表明，21种最先进的MLLMs在理解视频运动的详细时间动态方面存在显著局限。为缓解这一局限，我们进一步构建了FAVOR-Train数据集，包含17,152个视频，附有细粒度运动注释。对Qwen2.5-VL进行微调后在FAVOR-Train上的结果在TVBench、MotionBench和我们自己的FAVOR-Bench上的运动相关任务上表现出一致的改进。全面的评估结果表明，提出的FAVOR-Bench和FAVOR-Train为开发更强大的视频理解模型提供了有价值的工具。

Summary / 总结

FAVOR-Bench is a comprehensive benchmark for evaluating the fine-grained motion understanding ability of Multimodal Large Language Models (MLLMs). It consists of 1,776 videos with structured annotations and includes both close-ended and open-ended tasks. The benchmark reveals significant limitations in MLLMs' ability to comprehend temporal dynamics in video motions. To address this, FAVOR-Train, a dataset with 17,152 videos, was created, and finetuning Qwen2.5-VL on it improved performance on motion-related tasks. This work provides valuable tools for developing more powerful video understanding models.

FAVOR-Bench 是一个全面的基准，用于评估多模态大型语言模型（MLLMs）的细粒度运动理解能力。它包含 1,776 个带有结构化注释的视频，并包括封闭式和开放式任务。基准测试揭示了 MLLMs 在理解和描述视频运动的详细时间动态方面的显著局限性。为了应对这一挑战，开发了 FAVOR-Train 数据集，包含 17,152 个带有细粒度运动注释的视频。对 Qwen2.5-VL 进行 FAVOR-Train 微调后，在 TVBench、MotionBench 和 FAVOR-Bench 的运动相关任务上表现出了改进，突显了需要更强大的视频理解模型。

VIR-Bench: Evaluating Geospatial and Temporal Understanding of MLLMs via Travel Video Itinerary Reconstruction

Authors: Hao Wang, Eiki Murata, Lingfang Zhang, Ayako Sato, So Fukuda, Ziqi Yin, Wentao Hu, Keisuke Nakao, Yusuke Nakamura, Sebastian Zwirner, Yi-Chia Chen, Hiroyuki Otomo, Hiroki Ouchi, Daisuke Kawahara

First: 2025-09-23T13:46:31+00:00 · Latest: 2025-09-24T00:53:43+00:00

Abs · Code1 · Code2 · Code3

Abstract

Recent advances in multimodal large language models (MLLMs) have significantly enhanced video understanding capabilities, opening new possibilities for practical applications. Yet current video benchmarks focus largely on indoor scenes or short-range outdoor activities, leaving the challenges associated with long-distance travel largely unexplored. Mastering extended geospatial-temporal trajectories is critical for next-generation MLLMs, underpinning real-world tasks such as embodied-AI planning and navigation. To bridge this gap, we present VIR-Bench, a novel benchmark consisting of 200 travel videos that frames itinerary reconstruction as a challenging task designed to evaluate and push forward MLLMs' geospatial-temporal intelligence. Experimental results reveal that state-of-the-art MLLMs, including proprietary ones, struggle to achieve high scores, underscoring the difficulty of handling videos that span extended spatial and temporal scales. Moreover, we conduct an in-depth case study in which we develop a prototype travel-planning agent that leverages the insights gained from VIR-Bench. The agent's markedly improved itinerary recommendations verify that our evaluation protocol not only benchmarks models effectively but also translates into concrete performance gains in user-facing applications.

中文标题/摘要

标题：VIR-Bench：通过旅行视频行程重建评估MLLMs的地理空间和时间理解能力

近期多模态大型语言模型（MLLMs）的发展显著提升了视频理解能力，为实际应用开辟了新的可能性。然而，当前的视频基准主要集中在室内场景或短距离户外活动上，长距离旅行相关的挑战尚未得到充分探索。掌握扩展的地理空间-时间轨迹对于下一代MLLMs至关重要，支撑着诸如具身AI规划和导航等现实世界任务。为弥合这一差距，我们提出了VIR-Bench，这是一个由200个旅行视频组成的新型基准，将行程重建作为一项具有挑战性的任务，旨在评估和推动MLLMs的地理空间-时间智能。实验结果表明，最先进的MLLMs，包括专有模型，难以获得高分，突显了处理跨越广阔空间和时间尺度的视频的难度。此外，我们进行了一项深入的研究，开发了一个原型旅行规划代理，利用从VIR-Bench中获得的见解。代理显著改进的行程建议验证了我们的评估协议不仅有效地基准化了模型，还转化为面向用户的实际性能提升。

Summary / 总结

VIR-Bench evaluates the geospatial and temporal understanding of MLLMs through travel video itinerary reconstruction, addressing the lack of benchmarks for long-distance travel. The benchmark includes 200 travel videos and reveals that state-of-the-art MLLMs struggle with extended spatial and temporal scales. An in-depth case study shows that the evaluation protocol effectively benchmarks models and improves travel-planning agent performance.

VIR-Bench 通过旅行视频行程重建评估 MLLMs 的地理时空理解能力，弥补了长距离旅行基准的不足。该基准包含 200 个旅行视频，结果显示当前最先进的 MLLMs 在扩展的空间和时间尺度上表现不佳。深入的研究案例表明，该评估协议不仅能够有效评估模型，还能提高旅行规划代理的性能。

VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

Authors: Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Dilin Wang, Zhicheng Yan, Hongyu Xu, Justin Theiss, Tianlong Chen, Jiachen Li, Zhengzhong Tu, Zhangyang Wang, Rakesh Ranjan

First: 2025-05-26T17:56:30+00:00 · Latest: 2025-06-03T01:19:40+00:00

Comments: Project Page: https://vlm-3r.github.io/

Abs · Code1 · Project1

Abstract

The rapid advancement of Large Multimodal Models (LMMs) for 2D images and videos has motivated extending these models to understand 3D scenes, aiming for human-like visual-spatial intelligence. Nevertheless, achieving deep spatial understanding comparable to human capabilities poses significant challenges in model encoding and data acquisition. Existing methods frequently depend on external depth sensors for geometry capture or utilize off-the-shelf algorithms for pre-constructing 3D maps, thereby limiting their scalability, especially with prevalent monocular video inputs and for time-sensitive applications. In this work, we introduce VLM-3R, a unified framework for Vision-Language Models (VLMs) that incorporates 3D Reconstructive instruction tuning. VLM-3R processes monocular video frames by employing a geometry encoder to derive implicit 3D tokens that represent spatial understanding. Leveraging our Spatial-Visual-View Fusion and over 200K curated 3D reconstructive instruction tuning question-answer (QA) pairs, VLM-3R effectively aligns real-world spatial context with language instructions. This enables monocular 3D spatial assistance and embodied reasoning. To facilitate the evaluation of temporal reasoning, we introduce the Vision-Spatial-Temporal Intelligence benchmark, featuring over 138.6K QA pairs across five distinct tasks focused on evolving spatial relationships. Extensive experiments demonstrate that our model, VLM-3R, not only facilitates robust visual-spatial reasoning but also enables the understanding of temporal 3D context changes, excelling in both accuracy and scalability.

中文标题/摘要

标题：VLM-3R：视觉语言模型结合指令对齐的3D重建

大型多模态模型（LMMs）在2D图像和视频上的快速进步激发了将这些模型扩展到理解3D场景的动机，以实现类人的视觉-空间智能。然而，实现与人类能力相媲美的深度空间理解在模型编码和数据获取方面提出了重大挑战。现有方法经常依赖外部深度传感器进行几何捕获，或者利用现成的算法预先构建3D地图，从而限制了其可扩展性，尤其是在使用单目视频输入和时间敏感应用方面。在本文中，我们提出了VLM-3R，这是一种结合3D重建指令调优的统一视觉语言模型（VLMs）框架。VLM-3R通过使用几何编码器处理单目视频帧，推导出表示空间理解的隐式3D令牌。利用我们的空间-视觉-视角融合以及超过20万条精心策划的3D重建指令调优问答（QA）对，VLM-3R有效地将现实世界的空间上下文与语言指令对齐。这使得单目3D空间辅助和具身推理成为可能。为了促进时间推理的评估，我们引入了视觉-空间-时间智能基准，其中包括超过13.86万条问答对，涉及五个不同任务，专注于不断变化的空间关系。广泛的实验表明，我们的模型VLM-3R不仅促进了稳健的视觉-空间推理，还能够理解3D上下文的变化，无论在准确性和可扩展性方面都表现出色。

Summary / 总结

VLM-3R is a unified framework that enhances Vision-Language Models with 3D reconstructive instruction tuning to improve their spatial understanding. It uses a geometry encoder to derive implicit 3D tokens from monocular video frames and aligns these with language instructions through over 200K curated QA pairs. Experiments show that VLM-3R excels in visual-spatial reasoning and temporal 3D context changes, demonstrating both accuracy and scalability. The model facilitates monocular 3D spatial assistance and embodied reasoning, with a new benchmark for evaluating temporal reasoning introduced in the study.

VLM-3R 是一种统一框架，通过 3D 重建指令调优增强视觉-语言模型，处理单目视频帧并推导出隐式的 3D 令牌以理解空间关系。它使用超过 20 万对 3D 重建指令的 QA 对来对齐现实世界的空间上下文与语言指令，从而实现单目 3D 空间辅助和体态推理。实验表明，VLM-3R 在视觉-空间推理和理解 3D 时间上下文变化方面表现出色，展示了其准确性和可扩展性。

ST-Think: How Multimodal Large Language Models Reason About 4D Worlds from Ego-Centric Videos

Authors: Peiran Wu, Yunze Liu, Miao Liu, Junxiao Shen

First: 2025-03-16T15:24:11+00:00 · Latest: 2025-04-24T00:19:20+00:00

Abs · Code1 · Code2

Abstract

Humans excel at spatial-temporal reasoning, effortlessly interpreting dynamic visual events from an egocentric viewpoint. However, whether multimodal large language models (MLLMs) can similarly understand the 4D world remains uncertain. This paper explores multimodal spatial-temporal reasoning from an egocentric perspective, aiming to equip MLLMs with human-like reasoning capabilities. To support this objective, we introduce \textbf{Ego-ST Bench}, a novel benchmark containing over 5,000 question-answer pairs across four categories, systematically evaluating spatial, temporal, and integrated spatial-temporal reasoning. Additionally, we propose \textbf{ST-R1} training paradigm, a video-based reasoning model that incorporates reverse thinking into its reinforcement learning process, significantly enhancing performance. We combine long-chain-of-thought (long-CoT) supervised fine-tuning with Group Relative Policy Optimization (GRPO) reinforcement learning, achieving notable improvements with limited high-quality data. Ego-ST Bench and ST-R1 provide valuable insights and resources for advancing video-based spatial-temporal reasoning research.

中文标题/摘要

标题：ST-Think: 多模态大型语言模型如何从第一人称视频中推理四维世界

人类在空间-时间推理方面表现出色，能够轻松地从第一人称视角解释动态视觉事件。然而，多模态大型语言模型（MLLMs）是否能够以类似的方式理解四维世界仍不确定。本文探讨了从第一人称视角进行多模态空间-时间推理的可能性，旨在赋予MLLMs类似人类的推理能力。为了支持这一目标，我们引入了**Ego-ST Bench**，这是一个包含超过5,000个问题-答案对的新基准，系统地评估了空间、时间和整合的空间-时间推理能力。此外，我们还提出了**ST-R1**训练范式，这是一种基于视频的推理模型，将逆向思考融入其强化学习过程中，显著提高了性能。我们结合了长链推理（long-CoT）监督微调与组相对策略优化（GRPO）强化学习，即使在有限的高质量数据下也取得了显著的改进。Ego-ST Bench和ST-R1为推进基于视频的空间-时间推理研究提供了宝贵的知识和资源。

Summary / 总结

This paper investigates how multimodal large language models can reason about 4D worlds from an egocentric viewpoint, introducing Ego-ST Bench, a benchmark with over 5,000 question-answer pairs, and the ST-R1 training paradigm, which incorporates reverse thinking into reinforcement learning to enhance performance. The model combines long-chain-of-thought supervised fine-tuning with Group Relative Policy Optimization, achieving significant improvements with limited high-quality data.

该研究通过引入包含超过5,000个问答对的Ego-ST Bench基准，探讨了多模态大型语言模型是否能像人类一样进行时空推理。研究提出了一种基于视频的推理模型ST-R1，该模型在强化学习过程中采用逆向思考，显著提升了性能。结合长链推理监督微调与GRPO强化学习，该模型在有限高质量数据的情况下取得了显著改进。

TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding

Authors: Shuhuai Ren, Sishuo Chen, Shicheng Li, Xu Sun, Lu Hou

First: 2023-10-29T16:25:32+00:00 · Latest: 2023-10-31T01:39:37+00:00

Comments: 16 pages, 9 figures, code is available at https://github.com/RenShuhuai-Andy/TESTA

Abs · Code1

Abstract

Large-scale video-language pre-training has made remarkable strides in advancing video-language understanding tasks. However, the heavy computational burden of video encoding remains a formidable efficiency bottleneck, particularly for long-form videos. These videos contain massive visual tokens due to their inherent 3D properties and spatiotemporal redundancy, making it challenging to capture complex temporal and spatial relationships. To tackle this issue, we propose an efficient method called TEmporal-Spatial Token Aggregation (TESTA). TESTA condenses video semantics by adaptively aggregating similar frames, as well as similar patches within each frame. TESTA can reduce the number of visual tokens by 75% and thus accelerate video encoding. Building upon TESTA, we introduce a pre-trained video-language model equipped with a divided space-time token aggregation module in each video encoder block. We evaluate our model on five datasets for paragraph-to-video retrieval and long-form VideoQA tasks. Experimental results show that TESTA improves computing efficiency by 1.7 times, and achieves significant performance gains from its scalability in processing longer input frames, e.g., +13.7 R@1 on QuerYD and +6.5 R@1 on Condensed Movie.

中文标题/摘要

标题：TESTA：时空令牌聚合用于长视频语言理解

大规模视频语言预训练在推进视频语言理解任务方面取得了显著进展。然而，视频编码的沉重计算负担仍然是效率瓶颈，尤其是对于长视频。这些视频由于其固有的三维特性和时空冗余，包含大量的视觉令牌，这使得捕捉复杂的时空关系变得具有挑战性。为了解决这个问题，我们提出了一种高效的方法，称为时空令牌聚合（TESTA）。TESTA 通过自适应聚合相似的帧以及每个帧内的相似块来压缩视频语义。TESTA 可以将视觉令牌的数量减少 75%，从而加速视频编码。基于 TESTA，我们引入了一种预训练的视频语言模型，该模型在每个视频编码块中配备了分割的空间时间令牌聚合模块。我们在五个数据集上评估了我们的模型，用于段落到视频检索和长视频问答任务。实验结果表明，TESTA 将计算效率提高了 1.7 倍，并且由于其在处理更长输入帧方面的可扩展性，实现了显著的性能提升，例如在 QuerYD 上 +13.7 R@1 和在 Condensed Movie 上 +6.5 R@1。

Summary / 总结

The research aims to address the computational efficiency bottleneck in video-language understanding, especially for long-form videos. TESTA proposes a method to aggregate similar frames and patches within frames, reducing the number of visual tokens by 75% and accelerating video encoding. Experimental results demonstrate that TESTA enhances computing efficiency by 1.7 times and improves performance on long-form VideoQA tasks, with significant gains on datasets like QuerYD and Condensed Movie.

研究旨在解决视频-语言理解中的计算挑战，特别是对于长视频。提出了时空令牌聚合方法TESTA，通过减少75%的视觉令牌来加速视频编码。该模型在计算效率上提高了1.7倍，并在长视频问答任务上表现出显著的性能提升，例如在QuerYD上的R@1提高了13.7，在Condensed Movie上的R@1提高了6.5。

MAMBA4D: Efficient Long-Sequence Point Cloud Video Understanding with Disentangled Spatial-Temporal State Space Models

Authors: Jiuming Liu, Jinru Han, Lihao Liu, Angelica I. Aviles-Rivero, Chaokang Jiang, Zhe Liu, Hesheng Wang

Venue: CVPR 2025

First: 2024-05-23T09:08:09+00:00 · Latest: 2025-02-28T01:16:33+00:00

Comments: Accepted by CVPR 2025. The first two authors contribute equally

Abs · Code1

Abstract

Point cloud videos can faithfully capture real-world spatial geometries and temporal dynamics, which are essential for enabling intelligent agents to understand the dynamically changing world. However, designing an effective 4D backbone remains challenging, mainly due to the irregular and unordered distribution of points and temporal inconsistencies across frames. Also, recent transformer-based 4D backbones commonly suffer from large computational costs due to their quadratic complexity, particularly for long video sequences. To address these challenges, we propose a novel point cloud video understanding backbone purely based on the State Space Models (SSMs). Specifically, we first disentangle space and time in 4D video sequences and then establish the spatio-temporal correlation with our designed Mamba blocks. The Intra-frame Spatial Mamba module is developed to encode locally similar geometric structures within a certain temporal stride. Subsequently, locally correlated tokens are delivered to the Inter-frame Temporal Mamba module, which integrates long-term point features across the entire video with linear complexity. Our proposed Mamba4d achieves competitive performance on the MSR-Action3D action recognition (+10.4% accuracy), HOI4D action segmentation (+0.7 F1 Score), and Synthia4D semantic segmentation (+0.19 mIoU) datasets. Especially, for long video sequences, our method has a significant efficiency improvement with 87.5% GPU memory reduction and 5.36 times speed-up. Codes will be released at https://github.com/IRMVLab/Mamba4D.

中文标题/摘要

标题：MAMBA4D：基于解耦时空状态空间模型的高效长序列点云视频理解

点云视频能够真实地捕捉现实世界的空间几何和时间动态，这对于使智能代理理解动态变化的世界至关重要。然而，设计有效的4D主干结构仍然具有挑战性，主要是由于点的不规则和无序分布以及帧间的时间不一致性。此外，最近基于变压器的4D主干结构通常由于其二次复杂性而面临巨大的计算成本，特别是在长视频序列中。为了解决这些挑战，我们提出了一种基于状态空间模型（SSMs）的新型点云视频理解主干结构。具体来说，我们首先在4D视频序列中解耦空间和时间，然后通过我们设计的Mamba模块建立时空相关性。我们开发了帧内时空Mamba模块来在一定时间步长内编码局部相似的几何结构。随后，局部相关令牌被传递到帧间时空Mamba模块，该模块以线性复杂度整合整个视频中的长期点特征。我们提出的Mamba4D在MSR-Action3D动作识别（+10.4%准确率）、HOI4D动作分割（+0.7 F1分数）和Synthia4D语义分割（+0.19 mIoU）数据集上取得了竞争力的表现。特别是对于长视频序列，我们的方法在GPU内存减少87.5%和速度提升5.36倍方面具有显著的效率改进。代码将在https://github.com/IRMVLab/Mamba4D上发布。

Summary / 总结

The research aims to develop an efficient backbone for understanding long-sequence point cloud videos, addressing challenges such as irregular point distributions and high computational costs. The method introduces MAMBA4D, a disentangled spatial-temporal state space model that uses Mamba blocks to encode intra-frame spatial structures and integrate inter-frame temporal features linearly. Experimental results show competitive performance on action recognition, segmentation tasks, and a significant improvement in efficiency for long sequences, reducing GPU memory usage by 87.5% and speeding up by 5.36 times.

本文提出了一种基于State Space Models的MAMBA4D骨干网络，以解决长序列点云视频的理解问题。该方法将空间和时间成分分离，使用Mamba模块编码局部几何结构并高效地整合长期特征。该方法在动作识别、分割和语义分割数据集上取得了竞争力的表现，并且在长序列上具有显著的效率提升，GPU内存使用减少了87.5%，速度提高了5.36倍。

UI2V-Bench: An Understanding-based Image-to-video Generation Benchmark

Authors: Ailing Zhang, Lina Lei, Dehong Kong, Zhixin Wang, Jiaqi Xu, Fenglong Song, Chun-Le Guo, Chang Liu, Fan Li, Jie Chen

First: 2025-09-29T08:14:26+00:00 · Latest: 2025-09-30T01:43:04+00:00

Abs

Abstract

Generative diffusion models are developing rapidly and attracting increasing attention due to their wide range of applications. Image-to-Video (I2V) generation has become a major focus in the field of video synthesis. However, existing evaluation benchmarks primarily focus on aspects such as video quality and temporal consistency, while largely overlooking the model's ability to understand the semantics of specific subjects in the input image or to ensure that the generated video aligns with physical laws and human commonsense. To address this gap, we propose UI2V-Bench, a novel benchmark for evaluating I2V models with a focus on semantic understanding and reasoning. It introduces four primary evaluation dimensions: spatial understanding, attribute binding, category understanding, and reasoning. To assess these dimensions, we design two evaluation methods based on Multimodal Large Language Models (MLLMs): an instance-level pipeline for fine-grained semantic understanding, and a feedback-based reasoning pipeline that enables step-by-step causal assessment for more accurate evaluation. UI2V-Bench includes approximately 500 carefully constructed text-image pairs and evaluates a range of both open source and closed-source I2V models across all defined dimensions. We further incorporate human evaluations, which show strong alignment with the proposed MLLM-based metrics. Overall, UI2V-Bench fills a critical gap in I2V evaluation by emphasizing semantic comprehension and reasoning ability, offering a robust framework and dataset to support future research and model development in the field.

中文标题/摘要

标题：UI2V-Bench：基于理解的图像到视频生成基准

生成扩散模型正在迅速发展并受到广泛关注，这得益于其广泛的应用范围。图像到视频（I2V）生成已成为视频合成领域的重点。然而，现有的评估基准主要集中在视频质量和时间一致性方面，而很大程度上忽视了模型对输入图像中特定主题的语义理解能力，或确保生成的视频符合物理定律和人类常识。为解决这一问题，我们提出了UI2V-Bench，这是一种新的基准，专注于语义理解和推理评估。它引入了四个主要评估维度：空间理解、属性绑定、类别理解以及推理。为了评估这些维度，我们基于多模态大型语言模型（MLLMs）设计了两种评估方法：一种是实例级的管道，用于细粒度的语义理解，另一种是基于反馈的推理管道，能够逐步进行因果评估，以实现更准确的评估。UI2V-Bench 包含约500个精心构建的图文对，并在所有定义的维度上评估了各种开源和闭源的I2V模型。我们进一步纳入了人工评估，结果显示与提出的基于MLLM的度量标准高度一致。总体而言，UI2V-Bench 通过强调语义理解和推理能力填补了I2V评估中的关键空白，提供了一个强大的框架和数据集，以支持该领域的未来研究和模型开发。

Summary / 总结

UI2V-Bench is a new benchmark for evaluating Image-to-Video (I2V) models, focusing on semantic understanding and reasoning. It introduces four evaluation dimensions: spatial understanding, attribute binding, category understanding, and reasoning. The benchmark uses Multimodal Large Language Models (MLLMs) to assess these dimensions through an instance-level pipeline for fine-grained semantic understanding and a feedback-based reasoning pipeline. UI2V-Bench evaluates 500 text-image pairs across various I2V models and includes human evaluations that align with the proposed metrics, addressing the limitations of existing benchmarks by emphasizing semantic comprehension and reasoning ability.

UI2V-Bench 是一个新的 Image-to-Video (I2V) 模型评估基准，侧重于语义理解和推理。它引入了四个评估维度：空间理解、属性绑定、类别理解和推理。该基准使用多模态大型语言模型 (MLLMs) 通过实例级管道进行细粒度语义理解和反馈驱动的推理管道来评估这些维度。UI2V-Bench 评估了 500 个文本-图像对的各种 I2V 模型，并包括与提出的度量标准高度一致的人类评估，填补了现有基准在强调语义理解和推理能力方面的空白。

Learning Fine-Grained Visual Understanding for Video Question Answering via Decoupling Spatial-Temporal Modeling

Authors: Hsin-Ying Lee, Hung-Ting Su, Bing-Chen Tsai, Tsung-Han Wu, Jia-Fong Yeh, Winston H. Hsu

First: 2022-10-08T07:03:31+00:00 · Latest: 2022-10-11T00:07:39+00:00

Comments: BMVC 2022. Code is available at https://github.com/shinying/dest

Abs · Code1

Abstract

While recent large-scale video-language pre-training made great progress in video question answering, the design of spatial modeling of video-language models is less fine-grained than that of image-language models; existing practices of temporal modeling also suffer from weak and noisy alignment between modalities. To learn fine-grained visual understanding, we decouple spatial-temporal modeling and propose a hybrid pipeline, Decoupled Spatial-Temporal Encoders, integrating an image- and a video-language encoder. The former encodes spatial semantics from larger but sparsely sampled frames independently of time, while the latter models temporal dynamics at lower spatial but higher temporal resolution. To help the video-language model learn temporal relations for video QA, we propose a novel pre-training objective, Temporal Referring Modeling, which requires the model to identify temporal positions of events in video sequences. Extensive experiments demonstrate that our model outperforms previous work pre-trained on orders of magnitude larger datasets.

中文标题/摘要

标题：通过解耦空时建模学习细粒度视觉理解以视频问答

尽管近期大规模视频-语言预训练在视频问答方面取得了巨大进展，但视频-语言模型的空间建模设计不如图像-语言模型精细；现有的时间建模实践也遭受着模态间弱且嘈杂对齐的困扰。为了学习细粒度的视觉理解，我们解耦空时建模并提出了一种混合管道，即解耦空时编码器，该管道整合了一个图像编码器和一个视频编码器。前者独立于时间编码来自较大但稀疏采样的帧的空间语义，而后者在较低的空间分辨率但更高的时间分辨率下建模时间动态。为了帮助视频-语言模型学习视频问答中的时间关系，我们提出了一种新的预训练目标，即时间引用建模，该目标要求模型识别视频序列中事件的时间位置。广泛的实验表明，我们的模型在预训练于比之前工作大得多的数据集上时表现更优。

Summary / 总结

The research aims to improve fine-grained visual understanding in video question answering by decoupling spatial and temporal modeling. The proposed method, Decoupled Spatial-Temporal Encoders, uses an image-language encoder to capture spatial semantics from sparsely sampled frames and a video-language encoder to model temporal dynamics. A novel pre-training objective, Temporal Referring Modeling, is introduced to help the model learn temporal relations. Experiments show that the proposed model outperforms previous work pre-trained on much larger datasets.

研究旨在通过解耦空间和时间建模来提高视频问答中的细粒度视觉理解。提出的Decoupled Spatial-Temporal Encoders方法使用图像语言编码器从稀疏采样的帧中捕获空间语义，使用视频语言编码器建模时间动态。引入了一种新的预训练目标，即时间引用建模，以帮助模型学习时间关系。实验表明，该模型在比以前工作预训练的数据集大得多的条件下表现更优。

EVA02-AT: Egocentric Video-Language Understanding with Spatial-Temporal Rotary Positional Embeddings and Symmetric Optimization

Authors: Xiaoqi Wang, Yi Wang, Lap-Pui Chau

First: 2025-06-17T09:51:51+00:00 · Latest: 2025-06-18T00:36:11+00:00

Abs · Code1

Abstract

Egocentric video-language understanding demands both high efficiency and accurate spatial-temporal modeling. Existing approaches face three key challenges: 1) Excessive pre-training cost arising from multi-stage pre-training pipelines, 2) Ineffective spatial-temporal encoding due to manually split 3D rotary positional embeddings that hinder feature interactions, and 3) Imprecise learning objectives in soft-label multi-instance retrieval, which neglect negative pair correlations. In this paper, we introduce EVA02-AT, a suite of EVA02-based video-language foundation models tailored to egocentric video understanding tasks. EVA02-AT first efficiently transfers an image-based CLIP model into a unified video encoder via a single-stage pretraining. Second, instead of applying rotary positional embeddings to isolated dimensions, we introduce spatial-temporal rotary positional embeddings along with joint attention, which can effectively encode both spatial and temporal information on the entire hidden dimension. This joint encoding of spatial-temporal features enables the model to learn cross-axis relationships, which are crucial for accurately modeling motion and interaction in videos. Third, focusing on multi-instance video-language retrieval tasks, we introduce the Symmetric Multi-Similarity (SMS) loss and a novel training framework that advances all soft labels for both positive and negative pairs, providing a more precise learning objective. Extensive experiments on Ego4D, EPIC-Kitchens-100, and Charades-Ego under zero-shot and fine-tuning settings demonstrate that EVA02-AT achieves state-of-the-art performance across diverse egocentric video-language tasks with fewer parameters. Models with our SMS loss also show significant performance gains on multi-instance retrieval benchmarks. Our code and models are publicly available at https://github.com/xqwang14/EVA02-AT .

中文标题/摘要

标题：EVA02-AT：基于空间-时间旋转位置嵌入和对称优化的主观视角视频-语言理解

主观视角视频-语言理解需要高效性和精确的空间-时间建模。现有方法面临三个关键挑战：1）多阶段预训练管道导致的高昂预训练成本，2）由于手动拆分的3D旋转位置嵌入导致的空间-时间编码效果不佳，这阻碍了特征交互，3）软标签多实例检索中的不精确学习目标，忽略了负样本对之间的相关性。本文中，我们引入了EVA02-AT，这是一种基于EVA02的视频-语言基础模型，专门针对主观视角视频理解任务。EVA02-AT首先通过单阶段预训练将基于图像的CLIP模型高效地转换为统一的视频编码器。其次，我们引入了空间-时间旋转位置嵌入和联合注意力，这可以有效地在整个隐藏维度上编码空间和时间信息。这种空间-时间特征的联合编码使模型能够学习跨轴关系，这对于准确建模视频中的运动和交互至关重要。第三，针对多实例视频-语言检索任务，我们引入了对称多相似性（SMS）损失和一种新的训练框架，该框架可以同时为正样本和负样本对的所有软标签提供更精确的学习目标。在Ego4D、EPIC-Kitchens-100和Charades-Ego数据集上的零样本和微调设置下的广泛实验表明，EVA02-AT在各种主观视角视频-语言任务中实现了最先进的性能，且参数更少。使用我们SMS损失的模型在多实例检索基准测试中也显示出显著的性能提升。我们的代码和模型已公开发布在https://github.com/xqwang14/EVA02-AT 。

Summary / 总结

EVA02-AT addresses the challenges of egocentric video-language understanding by introducing a single-stage pretraining method, spatial-temporal rotary positional embeddings, and a symmetric multi-similarity loss. The model demonstrates superior performance across various egocentric video-language tasks with fewer parameters compared to existing approaches, and shows significant improvements in multi-instance retrieval benchmarks.

EVA02-AT 通过引入单阶段预训练方法、时空旋转位置嵌入和对称多相似性损失，解决了自视点视频-语言理解的挑战。该模型在各种自视点视频-语言任务中表现出更优性能，参数更少，并在多实例检索基准测试中显示出显著的性能提升。

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

Authors: Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, Limin Wang, Yu Qiao

Venue: CVPR 2024 highlight

First: 2023-11-28T17:59:04+00:00 · Latest: 2024-05-24T01:48:45+00:00

Comments: CVPR 2024 highlight: updated version with Mistral and better performances for MVBench/NExT-QA/STAR/TVQA/EgoSchema/IntentQA

Abs · Code1 · Code2 · Code3

Abstract

With the rapid development of Multi-modal Large Language Models (MLLMs), a number of diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities of these models. However, most benchmarks predominantly assess spatial understanding in the static image tasks, while overlooking temporal understanding in the dynamic video tasks. To alleviate this issue, we introduce a comprehensive Multi-modal Video understanding Benchmark, namely MVBench, which covers 20 challenging video tasks that cannot be effectively solved with a single frame. Specifically, we first introduce a novel static-to-dynamic method to define these temporal-related tasks. By transforming various static tasks into dynamic ones, we enable the systematic generation of video tasks that require a broad spectrum of temporal skills, ranging from perception to cognition. Then, guided by the task definition, we automatically convert public video annotations into multiple-choice QA to evaluate each task. On one hand, such a distinct paradigm allows us to build MVBench efficiently, without much manual intervention. On the other hand, it guarantees evaluation fairness with ground-truth video annotations, avoiding the biased scoring of LLMs. Moreover, we further develop a robust video MLLM baseline, i.e., VideoChat2, by progressive multi-modal training with diverse instruction-tuning data. The extensive results on our MVBench reveal that, the existing MLLMs are far from satisfactory in temporal understanding, while our VideoChat2 largely surpasses these leading models by over 15% on MVBench. All models and data are available at https://github.com/OpenGVLab/Ask-Anything.

中文标题/摘要

标题：MVBench：一个全面的多模态视频理解基准

随着多模态大型语言模型（MLLMs）的快速发展，最近出现了一些诊断基准来评估这些模型的理解能力。然而，大多数基准主要评估静态图像任务中的空间理解，而忽视了动态视频任务中的时间理解。为了解决这一问题，我们引入了一个全面的多模态视频理解基准，即MVBench，涵盖了20个无法仅凭单帧有效解决的具有挑战性的视频任务。具体来说，我们首先引入了一种新颖的静态到动态方法来定义这些与时间相关任务。通过将各种静态任务转化为动态任务，我们使视频任务的系统生成成为可能，这些任务需要广泛的时间技能范围，从感知到认知。然后，在任务定义的指导下，我们自动将公共视频注释转换为多项选择问答来评估每个任务。一方面，这种独特的范式使我们能够高效地构建MVBench，而不需要太多的手动干预。另一方面，它保证了使用真实视频注释进行评估的公平性，避免了对LLM的偏见评分。此外，我们进一步开发了一个稳健的视频MLLM基线，即VideoChat2，通过渐进的多模态训练和多样化的指令调优数据进行训练。我们在MVBench上的广泛结果表明，现有的MLLMs在时间理解方面远未令人满意，而我们的VideoChat2在MVBench上超过了这些领先模型超过15%。所有模型和数据可在https://github.com/OpenGVLab/Ask-Anything/获取。

Summary / 总结

MVBench is introduced to evaluate the temporal understanding of multi-modal large language models (MLLMs) in video tasks, addressing the lack of such benchmarks. It consists of 20 challenging video tasks, and a static-to-dynamic method is used to define these tasks. The benchmark is automatically generated from public video annotations into multiple-choice QA, ensuring fairness. The VideoChat2 model, trained with diverse instruction-tuning data, significantly outperforms existing MLLMs on MVBench, demonstrating the need for improved temporal understanding capabilities in MLLMs. All models and data are available at https://github.com/OpenGVLab/Ask-Anything.

MVBench 是一个用于评估多模态大语言模型（MLLMs）在视频任务中时间理解能力的基准，解决了现有缺乏此类基准的问题。它包含20个具有挑战性的视频任务，并使用静态到动态的方法定义这些任务。该基准通过将公共视频注释自动转换为多项选择题来生成，确保了公平性。通过多样指令调优数据进行渐进式多模态训练的 VideoChat2 模型，在 MVBench 上显著超越了现有 MLLMs，展示了 MLLMs 在时间理解能力方面存在的不足。所有模型和数据可在 https://github.com/OpenGVLab/Ask-Anything 获取。

MASH-VLM: Mitigating Action-Scene Hallucination in Video-LLMs through Disentangled Spatial-Temporal Representations

Authors: Kyungho Bae, Jinhyung Kim, Sihaeng Lee, Soonyoung Lee, Gunhee Lee, Jinwoo Choi

Venue: CVPR 2025

First: 2025-03-20T05:48:59+00:00 · Latest: 2025-03-21T00:29:57+00:00

Comments: Accepted for CVPR 2025

Abs

Abstract

In this work, we tackle action-scene hallucination in Video Large Language Models (Video-LLMs), where models incorrectly predict actions based on the scene context or scenes based on observed actions. We observe that existing Video-LLMs often suffer from action-scene hallucination due to two main factors. First, existing Video-LLMs intermingle spatial and temporal features by applying an attention operation across all tokens. Second, they use the standard Rotary Position Embedding (RoPE), which causes the text tokens to overemphasize certain types of tokens depending on their sequential orders. To address these issues, we introduce MASH-VLM, Mitigating Action-Scene Hallucination in Video-LLMs through disentangled spatial-temporal representations. Our approach includes two key innovations: (1) DST-attention, a novel attention mechanism that disentangles the spatial and temporal tokens within the LLM by using masked attention to restrict direct interactions between the spatial and temporal tokens; (2) Harmonic-RoPE, which extends the dimensionality of the positional IDs, allowing the spatial and temporal tokens to maintain balanced positions relative to the text tokens. To evaluate the action-scene hallucination in Video-LLMs, we introduce the UNSCENE benchmark with 1,320 videos and 4,078 QA pairs. Extensive experiments demonstrate that MASH-VLM achieves state-of-the-art results on the UNSCENE benchmark, as well as on existing video understanding benchmarks.

中文标题/摘要

标题：MASH-VLM：通过解耦空间-时间表示减轻视频大语言模型中的动作场景幻觉

在本文中，我们解决了视频大语言模型（Video-LLMs）中的动作场景幻觉问题，即模型基于场景上下文或观察到的动作错误地预测动作。我们观察到，现有Video-LLMs由于两个主要原因经常遭受动作场景幻觉。首先，现有Video-LLMs通过在所有标记之间应用注意力操作来混合空间和时间特征。其次，它们使用标准的旋转位置嵌入（RoPE），这使得文本标记会根据其顺序过度强调某些类型的标记。为了解决这些问题，我们提出了MASH-VLM，一种通过解耦空间-时间表示来减轻Video-LLMs中动作场景幻觉的方法。我们的方法包括两个关键创新：（1）DST-注意力，一种新颖的注意力机制，通过使用掩码注意力来限制空间和时间标记之间的直接交互，从而在LLM中解耦空间和时间标记；（2）谐波-RoPE，它扩展了位置ID的维度，使空间和时间标记能够相对于文本标记保持平衡的位置。为了评估Video-LLMs中的动作场景幻觉，我们引入了包含1,320个视频和4,078个问答对的UNSCENE基准。广泛的实验表明，MASH-VLM在UNSCENE基准和现有视频理解基准上均取得了最先进的结果。

Summary / 总结

The research addresses action-scene hallucination in Video Large Language Models (Video-LLMs) by introducing MASH-VLM, which uses DST-attention to disentangle spatial and temporal tokens and Harmonic-RoPE to balance positional embeddings. The method significantly reduces hallucination and achieves state-of-the-art results on the UNSCENE benchmark and existing video understanding benchmarks.

研究旨在通过引入MASH-VLM，使用分离的空间和时间表示来解决视频大型语言模型中的动作场景幻觉问题。MASH-VLM 使用 DST-注意力机制将空间和时间标记分离，并使用 Harmonic-RoPE 来平衡它们与文本标记的位置关系。实验表明，MASH-VLM 在 UNSCENE 基准和现有视频理解基准上均取得了最佳结果。

DynImg: Key Frames with Visual Prompts are Good Representation for Multi-Modal Video Understanding

Authors: Xiaoyi Bao, Chenwei Xie, Hao Tang, Tingyu Weng, Xiaofeng Wang, Yun Zheng, Xingang Wang

Venue: ICCV 2025

First: 2025-07-21T12:50:49+00:00 · Latest: 2025-07-22T01:19:59+00:00

Comments: Accepted by ICCV 2025

Abs

Abstract

In recent years, the introduction of Multi-modal Large Language Models (MLLMs) into video understanding tasks has become increasingly prevalent. However, how to effectively integrate temporal information remains a critical research focus. Traditional approaches treat spatial and temporal information separately. Due to issues like motion blur, it is challenging to accurately represent the spatial information of rapidly moving objects. This can lead to temporally important regions being underemphasized during spatial feature extraction, which in turn hinders accurate spatio-temporal interaction and video understanding. To address this limitation, we propose an innovative video representation method called Dynamic-Image (DynImg). Specifically, we introduce a set of non-key frames as temporal prompts to highlight the spatial areas containing fast-moving objects. During the process of visual feature extraction, these prompts guide the model to pay additional attention to the fine-grained spatial features corresponding to these regions. Moreover, to maintain the correct sequence for DynImg, we employ a corresponding 4D video Rotary Position Embedding. This retains both the temporal and spatial adjacency of DynImg, helping MLLM understand the spatio-temporal order within this combined format. Experimental evaluations reveal that DynImg surpasses the state-of-the-art methods by approximately 2% across multiple video understanding benchmarks, proving the effectiveness of our temporal prompts in enhancing video comprehension.

中文标题/摘要

标题：DynImg：视觉提示下的关键帧是多模态视频理解的良好表示

近年来，多模态大型语言模型（MLLMs）在视频理解任务中的应用越来越普遍。然而，如何有效整合时间信息仍然是一个关键的研究重点。传统方法将空间和时间信息分开处理。由于运动模糊等问题，快速移动物体的空间信息难以准确表示，这可能导致在空间特征提取过程中重要的时间区域被低估，从而妨碍准确的空间-时间交互和视频理解。为了解决这一局限性，我们提出了一种名为动态图像（DynImg）的创新视频表示方法。具体来说，我们引入了一组非关键帧作为时间提示，以突出包含快速移动物体的空间区域。在视觉特征提取过程中，这些提示引导模型额外关注这些区域对应的空间细节特征。此外，为了保持DynImg的正确顺序，我们采用相应的4D视频旋转位置嵌入。这保留了DynImg的时间和空间邻近性，帮助MLLM理解这种组合格式中的空间-时间顺序。实验评估表明，DynImg在多个视频理解基准测试中比最先进的方法高出约2%，证明了我们的时间提示在增强视频理解方面的有效性。

Summary / 总结

The paper addresses the challenge of integrating temporal information in video understanding tasks by proposing DynImg, which uses non-key frames as temporal prompts to highlight fast-moving object regions. This method enhances visual feature extraction and maintains the correct sequence through 4D video Rotary Position Embedding. Experiments show that DynImg outperforms existing methods by about 2% across various benchmarks, demonstrating the effectiveness of temporal prompts in improving video understanding.

论文提出DynImg方法，通过使用非关键帧作为时间提示来突出快速移动物体的区域，以解决视频理解任务中整合时间信息的挑战。该方法增强了细粒度空间特征的提取，并通过4D视频旋转位置嵌入保持正确的时空顺序。实验结果显示，DynImg在多个视频理解基准测试中比现有方法高出约2%，证明了时间提示在提高视频理解方面的有效性。

LVC: A Lightweight Compression Framework for Enhancing VLMs in Long Video Understanding

Authors: Ziyi Wang, Haoran Wu, Yiming Rong, Deyang Jiang, Yixin Zhang, Yunlong Zhao, Shuang Xu, Bo XU

First: 2025-04-09T12:51:10+00:00 · Latest: 2025-04-10T00:45:17+00:00

Abs

Abstract

Long video understanding is a complex task that requires both spatial detail and temporal awareness. While Vision-Language Models (VLMs) obtain frame-level understanding capabilities through multi-frame input, they suffer from information loss due to the sparse sampling strategy. In contrast, Video Large Language Models (Video-LLMs) capture temporal relationships within visual features but are limited by the scarcity of high-quality video-text datasets. To transfer long video understanding capabilities to VLMs with minimal data and computational cost, we propose Lightweight Video Compression (LVC), a novel method featuring the Query-Attention Video Compression mechanism, which effectively tackles the sparse sampling problem in VLMs. By training only the alignment layer with 10k short video-text pairs, LVC significantly enhances the temporal reasoning abilities of VLMs. Extensive experiments show that LVC provides consistent performance improvements across various models, including the InternVL2 series and Phi-3.5-Vision. Notably, the InternVL2-40B-LVC achieves scores of 68.2 and 65.9 on the long video understanding benchmarks MLVU and Video-MME, respectively, with relative improvements of 14.6% and 7.7%. The enhanced models and code will be publicly available soon.

中文标题/摘要

标题：LVC：一种轻量级压缩框架以增强VLMs在长视频理解中的能力

长视频理解是一项复杂的任务，需要同时具备空间细节和时间意识。虽然视觉语言模型（VLMs）通过多帧输入获得帧级理解能力，但由于稀疏采样策略，它们会遭受信息损失。相比之下，视频大型语言模型（Video-LLMs）能够捕捉视觉特征内的时间关系，但受限于高质量视频-文本数据集的稀缺性。为了以最小的数据和计算成本将长视频理解能力转移到VLMs，我们提出了一种名为轻量级视频压缩（LVC）的新方法，该方法具有查询-注意视频压缩机制，有效解决了VLMs中的稀疏采样问题。通过仅使用10000个短视频-文本对训练对齐层，LVC显著增强了VLMs的时间推理能力。广泛实验表明，LVC在包括InternVL2系列和Phi-3.5-Vision在内的各种模型中提供了持续的性能改进。值得注意的是，InternVL2-40B-LVC在长视频理解基准MLVU和Video-MME上的得分为68.2和65.9，分别相对提高了14.6%和7.7%。增强后的模型和代码将很快公开。

Summary / 总结

The paper proposes LVC, a lightweight compression framework that addresses the sparse sampling issue in VLMs by using the Query-Attention Video Compression mechanism. By training only the alignment layer with 10k short video-text pairs, LVC enhances the temporal reasoning abilities of VLMs. The method shows consistent performance improvements across various models, with notable relative improvements of 14.6% and 7.7% on MLVU and Video-MME benchmarks respectively.

论文提出了一种轻量级压缩框架LVC，通过解决稀疏采样问题来增强VLMs在长视频理解中的能力。LVC采用Query-Attention视频压缩机制，并仅用10k短视频-文本对进行训练，显著提升了时间推理能力。实验结果显示，LVC在各种模型上表现出一致的性能提升，在MLVU和Video-MME基准测试中分别取得了68.2和65.9的分数，相对提高了14.6%和7.7%。

Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution

Authors: Zuyan Liu, Yuhao Dong, Ziwei Liu, Winston Hu, Jiwen Lu, Yongming Rao

Venue: ICLR 2025

First: 2024-09-19T17:59:51+00:00 · Latest: 2025-02-28T01:26:05+00:00

Comments: Accepted to ICLR 2025

Abs · Code1

Abstract

Visual data comes in various forms, ranging from small icons of just a few pixels to long videos spanning hours. Existing multi-modal LLMs usually standardize these diverse visual inputs to a fixed resolution for visual encoders and yield similar numbers of tokens for LLMs. This approach is non-optimal for multimodal understanding and inefficient for processing inputs with long and short visual contents. To solve the problem, we propose Oryx, a unified multimodal architecture for the spatial-temporal understanding of images, videos, and multi-view 3D scenes. Oryx offers an on-demand solution to seamlessly and efficiently process visual inputs with arbitrary spatial sizes and temporal lengths through two core innovations: 1) a pre-trained OryxViT model that can encode images at any resolution into LLM-friendly visual representations; 2) a dynamic compressor module that supports 1x to 16x compression on visual tokens by request. These design features enable Oryx to accommodate extremely long visual contexts, such as videos, with lower resolution and high compression while maintaining high recognition precision for tasks like document understanding with native resolution and no compression. Beyond the architectural improvements, enhanced data curation and specialized training on long-context retrieval and spatial-aware data help Oryx achieve strong capabilities in image, video, and 3D multimodal understanding simultaneously. Our work is open-sourced at https://github.com/Oryx-mllm/Oryx.

中文标题/摘要

标题：Oryx MLLM：任意分辨率下的时空理解

视觉数据形式多样，从几像素的小图标到数小时的长视频不等。现有的多模态LLM通常将这些多样的视觉输入标准化为固定分辨率，并为LLM生成相同数量的令牌。这种方法对于多模态理解来说并不理想，也不利于处理长视频和短视频输入。为了解决这个问题，我们提出了Oryx，一种统一的多模态架构，用于图像、视频和多视图3D场景的空间-时间理解。Oryx通过两项核心创新提供了一种按需解决方案，以无缝且高效地处理任意空间大小和时间长度的视觉输入：1) 一个预训练的OryxViT模型，可以将图像以任意分辨率编码为LLM友好的视觉表示；2) 一个动态压缩模块，可以根据请求支持1到16倍的视觉令牌压缩。这些设计特性使Oryx能够容纳极长的视觉上下文，如低分辨率的视频，同时在文档理解等任务中保持高识别精度，无需压缩。除了架构改进，增强的数据整理和针对长上下文检索和空间感知数据的专业训练帮助Oryx同时具备强大的图像、视频和3D多模态理解能力。我们的工作已开源在https://github.com/Oryx-mllm/Oryx。

Summary / 总结

The paper proposes Oryx, a unified multimodal architecture for spatial-temporal understanding of images, videos, and 3D scenes. It introduces a pre-trained OryxViT model for encoding images at any resolution and a dynamic compressor module for efficient token compression. Oryx demonstrates strong performance in tasks like document understanding with high precision and can handle long visual contexts with lower resolution and high compression. Beyond the architecture, enhanced data curation and specialized training improve its capabilities in multimodal understanding.

论文提出了Oryx，一种统一的多模态架构，用于处理图像、视频和3D场景的空间-时间理解。该架构包含一个预训练的OryxViT模型，可以将图像以任意分辨率编码，并且包含一个动态压缩模块，支持按需压缩视觉标记。实验结果显示，Oryx能够处理长视觉上下文，同时以较低分辨率和高压缩率进行处理，而在文档理解等任务中保持高识别精度，无需压缩。除了架构改进之外，增强的数据整理和针对长上下文检索和空间感知数据的专业训练也增强了Oryx在多模态理解方面的强大能力。

InternVideo2: Scaling Foundation Models for Multimodal Video Understanding

Authors: Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Chenting Wang, Guo Chen, Baoqi Pei, Ziang Yan, Rongkun Zheng, Jilan Xu, Zun Wang, Yansong Shi, Tianxiang Jiang, Songze Li, Hongjie Zhang, Yifei Huang, Yu Qiao, Yali Wang, Limin Wang

First: 2024-03-22T17:57:42+00:00 · Latest: 2024-08-15T00:42:41+00:00

Comments: a technical report about video understanding (accepted to ECCV2024)

Abs · Code1 · Code2

Abstract

We introduce InternVideo2, a new family of video foundation models (ViFM) that achieve the state-of-the-art results in video recognition, video-text tasks, and video-centric dialogue. Our core design is a progressive training approach that unifies the masked video modeling, crossmodal contrastive learning, and next token prediction, scaling up the video encoder size to 6B parameters. At the data level, we prioritize spatiotemporal consistency by semantically segmenting videos and generating video-audio-speech captions. This improves the alignment between video and text. Through extensive experiments, we validate our designs and demonstrate superior performance on over 60 video and audio tasks. Notably, our model outperforms others on various video-related dialogue and long video understanding benchmarks, highlighting its ability to reason and comprehend longer contexts. Code and models are available at https://github.com/OpenGVLab/InternVideo/tree/main/InternVideo2/.

中文标题/摘要

标题：InternVideo2: 扩展多模态视频理解的基础模型

我们介绍了InternVideo2，这是一种新的视频基础模型（ViFM）家族，其在视频识别、视频-文本任务和视频中心对话中达到了最先进的结果。我们的核心设计是一种渐进式训练方法，将掩蔽视频建模、跨模态对比学习和下一个标记预测统一起来，将视频编码器的规模扩展到60亿参数。在数据层面，我们通过语义分割视频并生成视频-音频-语音字幕来优先考虑时空一致性，这提高了视频和文本之间的对齐。通过广泛的实验，我们验证了我们的设计，并在超过60个视频和音频任务上展示了优越的性能。值得注意的是，我们的模型在各种视频相关对话和长视频理解基准测试中表现优于其他模型，突显了其在推理和理解更长上下文方面的能力。代码和模型可在https://github.com/OpenGVLab/InternVideo/tree/main/InternVideo2/ 获取。

Summary / 总结

The research introduces InternVideo2, a new family of video foundation models that achieve state-of-the-art results in video recognition, video-text tasks, and video-centric dialogue. The model uses a progressive training approach that combines masked video modeling, crossmodal contrastive learning, and next token prediction, with a video encoder size of 6B parameters. The data is processed to prioritize spatiotemporal consistency through semantic segmentation and video-audio-speech caption generation. Experiments show superior performance on over 60 video and audio tasks, particularly in video-related dialogue and long video understanding benchmarks, demonstrating the model's ability to handle longer contexts effectively.

研究引入了InternVideo2，这是一种新的视频基础模型家族，在视频识别、视频-文本任务和视频中心对话方面达到了最先进的结果。该模型采用了一种渐进式训练方法，结合了掩蔽视频建模、跨模态对比学习和下一个标记预测，视频编码器的参数量为6B。数据处理时通过语义分割和视频-音频-语音字幕生成来优先考虑时空一致性。实验结果显示，该模型在超过60个视频和音频任务中表现出色，特别是在视频相关对话和长视频理解基准测试中表现出色，展示了其处理更长上下文的能力。

CoS: Chain-of-Shot Prompting for Long Video Understanding

Authors: Jian Hu, Zixu Cheng, Chenyang Si, Wei Li, Shaogang Gong

First: 2025-02-10T13:03:05+00:00 · Latest: 2025-02-12T01:57:10+00:00

Comments: A training-free test-time optimisation approach for long video understanding

Abs · Project1

Abstract

Multi-modal Large Language Models (MLLMs) struggle with long videos due to the need for excessive visual tokens. These tokens exceed massively the context length of MLLMs, resulting in filled by redundant task-irrelevant shots. How to select shots is an unsolved critical problem: sparse sampling risks missing key details, while exhaustive sampling overwhelms the model with irrelevant content, leading to video misunderstanding. To solve this problem, we propose Chain-of-Shot prompting (CoS). The key idea is to frame shot selection as test-time visual prompt optimisation, choosing shots adaptive to video understanding semantic task by optimising shots-task alignment. CoS has two key parts: (1) a binary video summary mechanism that performs pseudo temporal grounding, discovering a binary coding to identify task-relevant shots, and (2) a video co-reasoning module that deploys the binary coding to pair (learning to align) task-relevant positive shots with irrelevant negative shots. It embeds the optimised shot selections into the original video, facilitating a focus on relevant context to optimize long video understanding. Experiments across three baselines and five datasets demonstrate the effectiveness and adaptability of CoS. Code given in https://lwpyh.github.io/CoS.

中文标题/摘要

标题：CoS：链式-shot提示在长视频理解中的应用

多模态大型语言模型（MLLMs）在处理长视频时遇到困难，因为需要大量的视觉标记。这些标记远远超过了MLLMs的上下文长度，导致视频中充满了冗余且与任务无关的镜头。如何选择镜头是一个未解决的关键问题：稀疏采样可能会错过关键细节，而全面采样则会令模型陷入无关内容的困扰，导致视频理解错误。为了解决这个问题，我们提出了链式-shot提示（CoS）。核心思想是将镜头选择视为测试时的视觉提示优化，通过优化镜头与视频理解语义任务的对齐来选择适应视频理解任务的镜头。CoS有两个关键部分：（1）一种二元视频摘要机制，执行伪时间定位，发现二元编码以识别任务相关镜头，（2）一种视频共推理模块，利用二元编码将任务相关正镜头与无关负镜头配对（学习对齐）。它将优化的镜头选择嵌入到原始视频中，有助于聚焦于相关上下文以优化长视频理解。在三个基线和五个数据集上的实验表明CoS的有效性和适应性。代码见https://lwpyh.github.io/CoS。

Summary / 总结

The paper addresses the challenge of long video understanding for Multi-modal Large Language Models (MLLMs) by proposing Chain-of-Shot prompting (CoS). CoS optimizes shot selection at test time to align with the semantic task, using a binary video summary mechanism and a video co-reasoning module. Experiments show that CoS improves long video understanding across three baselines and five datasets, demonstrating its effectiveness and adaptability.

论文针对多模态大型语言模型（MLLMs）在长视频理解上的挑战，提出了链式镜头提示（CoS）方法。CoS在测试时优化镜头选择，使其与语义任务对齐，使用二元视频摘要机制和视频共推理模块。实验表明，CoS在多个基准和数据集上提高了长视频理解的效果，无需训练。

Dr.V: A Hierarchical Perception-Temporal-Cognition Framework to Diagnose Video Hallucination by Fine-grained Spatial-Temporal Grounding

Authors: Meng Luo, Shengqiong Wu, Liqiang Jing, Tianjie Ju, Li Zheng, Jinxiang Lai, Tianlong Wu, Xinya Du, Jian Li, Siyuan Yan, Jiebo Luo, William Yang Wang, Hao Fei, Mong-Li Lee, Wynne Hsu

First: 2025-09-15T12:39:19+00:00 · Latest: 2025-09-16T01:20:44+00:00

Comments: 25 pages, 16 figures

Abs · Code1

Abstract

Recent advancements in large video models (LVMs) have significantly enhance video understanding. However, these models continue to suffer from hallucinations, producing content that conflicts with input videos. To address this issue, we propose Dr.V, a hierarchical framework covering perceptive, temporal, and cognitive levels to diagnose video hallucination by fine-grained spatial-temporal grounding. Dr.V comprises of two key components: a benchmark dataset Dr.V-Bench and a satellite video agent Dr.V-Agent. Dr.V-Bench includes 10k instances drawn from 4,974 videos spanning diverse tasks, each enriched with detailed spatial-temporal annotation. Dr.V-Agent detects hallucinations in LVMs by systematically applying fine-grained spatial-temporal grounding at the perceptive and temporal levels, followed by cognitive level reasoning. This step-by-step pipeline mirrors human-like video comprehension and effectively identifies hallucinations. Extensive experiments demonstrate that Dr.V-Agent is effective in diagnosing hallucination while enhancing interpretability and reliability, offering a practical blueprint for robust video understanding in real-world scenarios. All our data and code are available at https://github.com/Eurekaleo/Dr.V.

中文标题/摘要

标题：Dr.V：一种分层感知-时序-认知框架，通过细粒度的空间-时序定位诊断视频幻觉

近年来，大型视频模型（LVMs）在视频理解方面取得了显著进步。然而，这些模型仍然存在幻觉问题，生成与输入视频冲突的内容。为了解决这一问题，我们提出了Dr.V，一种涵盖感知、时序和认知层次的分层框架，通过细粒度的空间-时序定位诊断视频幻觉。Dr.V 包含两个关键组件：基准数据集Dr.V-Bench和卫星视频代理Dr.V-Agent。Dr.V-Bench 包含来自4,974个视频的10,000个实例，涵盖多种任务，每个实例都附有详细的时空注释。Dr.V-Agent 通过在感知和时序层次上系统地应用细粒度的空间-时序定位，然后进行认知层次推理，来检测LVMs中的幻觉。这一逐步管道模仿了人类的视频理解过程，并有效地识别了幻觉。大量实验表明，Dr.V-Agent 在诊断幻觉的同时提高了可解释性和可靠性，为在实际场景中实现稳健的视频理解提供了实用的蓝图。所有数据和代码均可在https://github.com/Eurekaleo/Dr.V 获取。

Summary / 总结

The research aims to address hallucinations in large video models (LVMs) by proposing Dr.V, a hierarchical framework that includes a benchmark dataset Dr.V-Bench and a satellite video agent Dr.V-Agent. Dr.V-Agent diagnoses hallucinations through fine-grained spatial-temporal grounding at the perceptive and temporal levels, followed by cognitive level reasoning, effectively identifying hallucinations and enhancing interpretability and reliability in video understanding. Extensive experiments show Dr.V-Agent's effectiveness in diagnosing hallucinations and providing a practical blueprint for robust video understanding.

研究旨在通过提出Dr.V框架解决大型视频模型（LVMs）中的幻觉问题，该框架包括基准数据集Dr.V-Bench和卫星视频代理Dr.V-Agent。Dr.V-Agent通过在感知和时间层面进行精细的空间-时间定位，随后进行认知推理来诊断幻觉，有效识别视频理解中的不一致。实验表明，Dr.V-Agent在诊断幻觉方面提高了可解释性和可靠性，提供了一种实用的解决方案以实现稳健的视频理解。

Video Panels for Long Video Understanding

Authors: Lars Doorenbos, Federico Spurio, Juergen Gall

First: 2025-09-28T08:05:55+00:00 · Latest: 2025-09-30T01:01:00+00:00

Abs · Code1

Abstract

Recent Video-Language Models (VLMs) achieve promising results on long-video understanding, but their performance still lags behind that achieved on tasks involving images or short videos. This has led to great interest in improving the long context modeling of VLMs by introducing novel modules and additional complexity. % additional training time. In this paper, we take a different approach: rather than fine-tuning VLMs with the limited data available, we attempt to maximize the performance of existing models. To this end, we propose a novel visual prompting strategy specifically designed for long-video understanding. By combining multiple frames as panels into one image, we effectively trade off spatial details for temporal resolution. Our approach is training-free, parameter-free, and model-agnostic, and can be seamlessly integrated into existing VLMs. Extensive experiments on five established benchmarks across a wide range of model architectures, sizes, and context windows confirm the consistency of our approach. For the TimeScope (Long) dataset, which has the longest videos, the accuracy for video question answering is improved by up to 19.4\%. Overall, our method raises the bar for long video understanding models. We will make our code available upon acceptance.

中文标题/摘要

标题：长视频理解的视频面板

近期的视频-语言模型（VLMs）在长视频理解任务上取得了令人鼓舞的结果，但其性能仍落后于涉及图像或短视频的任务。这导致了对通过引入新颖模块和额外复杂性来提高VLMs的长上下文建模的兴趣。%额外的训练时间。在本文中，我们采取了不同的方法：而不是用有限的数据微调VLMs，我们试图最大化现有模型的性能。为此，我们提出了一种专为长视频理解设计的新型视觉提示策略。通过将多个帧作为面板组合成一张图像，我们有效地在空间细节和时间分辨率之间进行了权衡。我们的方法是无需训练的、无需参数的、模型无关的，并且可以无缝集成到现有的VLMs中。在五个广泛使用的基准上的大量实验，涵盖了各种模型架构、大小和上下文窗口，证实了我们方法的一致性。对于TimeScope（长）数据集，该数据集包含最长的视频，视频问答的准确性提高了高达19.4%。总体而言，我们的方法提高了长视频理解模型的标准。我们将在接受后提供我们的代码。

Summary / 总结

This paper addresses the challenge of long-video understanding by proposing a novel visual prompting strategy that combines multiple frames into panels to enhance temporal resolution. The approach is training-free, parameter-free, and model-agnostic, and it improves the accuracy of video question answering by up to 19.4% on the TimeScope (Long) dataset, demonstrating consistent performance across various model architectures and context windows.

本文提出了一种新颖的视觉提示策略，通过将多个帧组合成面板来增强时间分辨率，以解决长视频理解的挑战。该方法无需训练和参数，并且适用于各种视频理解基准。在TimeScope (Long)数据集上，视频问答的准确性提高了高达19.4%。该方法为长视频理解模型设定了新的基准。

T*: Re-thinking Temporal Search for Long-Form Video Understanding

Authors: Jinhui Ye, Zihan Wang, Haosen Sun, Keshigeyan Chandrasegaran, Zane Durante, Cristobal Eyzaguirre, Yonatan Bisk, Juan Carlos Niebles, Ehsan Adeli, Li Fei-Fei, Jiajun Wu, Manling Li

Venue: CVPR 2025 long

First: 2025-04-03T04:03:10+00:00 · Latest: 2025-08-26T00:56:49+00:00

Comments: Accepted by CVPR 2025; A real-world long video needle-in-haystack benchmark; long-video QA with human ref frames

Abs · Code1 · Code2 · Code3

Abstract

Efficiently understanding long-form videos remains a significant challenge in computer vision. In this work, we revisit temporal search paradigms for long-form video understanding and address a fundamental issue pertaining to all state-of-the-art (SOTA) long-context vision-language models (VLMs). Our contributions are twofold: First, we frame temporal search as a Long Video Haystack problem: finding a minimal set of relevant frames (e.g., one to five) from tens of thousands based on specific queries. Upon this formulation, we introduce LV-Haystack, the first dataset with 480 hours of videos, 15,092 human-annotated instances for both training and evaluation aiming to improve temporal search quality and efficiency. Results on LV-Haystack highlight a significant research gap in temporal search capabilities, with current SOTA search methods only achieving 2.1% temporal F1 score on the Longvideobench subset. Next, inspired by visual search in images, we propose a lightweight temporal search framework, T* that reframes costly temporal search as spatial search. T* leverages powerful visual localization techniques commonly used in images and introduces an adaptive zooming-in mechanism that operates across both temporal and spatial dimensions. Extensive experiments show that integrating T* with existing methods significantly improves SOTA long-form video understanding. Under an inference budget of 32 frames, T* improves GPT-4o's performance from 50.5% to 53.1% and LLaVA-OneVision-OV-72B's performance from 56.5% to 62.4% on the Longvideobench XL subset. Our code, benchmark, and models are provided in the Supplementary material.

中文标题/摘要

标题：T*: 重新思考长视频时间搜索

高效理解长视频仍然是计算机视觉中的一个重大挑战。在本文中，我们重新审视了长视频理解的时间搜索范式，并解决了所有最先进的（SOTA）长上下文视觉-语言模型（VLMs）中的一个基本问题。我们的贡献有两个方面：首先，我们将时间搜索重新定义为长视频中的“长视频干草堆”问题：从成千上万的帧中找到一组相关的帧（例如，一到五帧），基于特定的查询。在此基础上，我们引入了LV-干草堆数据集，这是第一个包含480小时视频、15,092个人标注实例的数据集，用于训练和评估，旨在提高时间搜索的质量和效率。LV-干草堆上的结果突显了时间搜索能力的研究缺口，当前SOTA搜索方法在Longvideobench子集上的时间F1分数仅为2.1%。其次，受图像中视觉搜索的启发，我们提出了一种轻量级的时间搜索框架T*，将昂贵的时间搜索重新定义为空间搜索。T*利用了在图像中常用的强大视觉定位技术，并引入了一种在时间和空间维度上都起作用的自适应放大机制。广泛的实验表明，将T*与现有方法结合使用可以显著提高SOTA长视频理解。在32帧的推理预算下，T*将GPT-4o的性能从50.5%提高到53.1%，将LLaVA-OneVision-OV-72B的性能从56.5%提高到62.4%。在Longvideobench XL子集上。我们的代码、基准和模型在附录中提供。

Summary / 总结

This paper addresses the challenge of efficiently understanding long-form videos by revisiting temporal search paradigms. It introduces LV-Haystack, a new dataset for temporal search, and proposes T*, a lightweight framework that reframes temporal search as spatial search. Experiments show that T* significantly improves the performance of existing methods on long-form video understanding, achieving a 6.2% improvement over the state-of-the-art on the Longvideobench XL subset under an inference budget of 32 frames.

本文通过重新审视时间搜索范式来解决长视频高效理解的挑战，引入了LV-Haystack这一新的时间搜索数据集，并提出了T*，一种将时间搜索重新定义为空间搜索的轻量级框架。实验表明，T*在长视频理解上显著提升了现有方法的性能，在长视频基准XL子集下的推理预算为32帧时，T*的性能提高了6.2%，超过了最先进的方法。

Fly, Fail, Fix: Iterative Game Repair with Reinforcement Learning and Large Multimodal Models

Authors: Alex Zook, Josef Spjut, Jonathan Tremblay

First: 2025-07-16T22:45:40+00:00 · Latest: 2025-07-18T00:09:41+00:00

Comments: Published at Reinforcement Learning and Video Games workshop https://sites.google.com/view/rlvg-workshop-2025/home

Abs · Project1

Abstract

Game design hinges on understanding how static rules and content translate into dynamic player behavior - something modern generative systems that inspect only a game's code or assets struggle to capture. We present an automated design iteration framework that closes this gap by pairing a reinforcement learning (RL) agent, which playtests the game, with a large multimodal model (LMM), which revises the game based on what the agent does. In each loop the RL player completes several episodes, producing (i) numerical play metrics and/or (ii) a compact image strip summarising recent video frames. The LMM designer receives a gameplay goal and the current game configuration, analyses the play traces, and edits the configuration to steer future behaviour toward the goal. We demonstrate results that LMMs can reason over behavioral traces supplied by RL agents to iteratively refine game mechanics, pointing toward practical, scalable tools for AI-assisted game design.

中文标题/摘要

标题：飞、失败、修复：基于强化学习和大型多模态模型的游戏迭代修复

游戏设计依赖于理解静态规则和内容如何转化为动态玩家行为——而现代仅检查游戏代码或资产的生成系统难以捕捉这一点。我们提出了一种自动化设计迭代框架，通过将强化学习（RL）代理与大型多模态模型（LMM）配对来弥补这一差距，其中RL代理进行游戏测试，LMM基于代理的行为进行游戏修订。在每个循环中，RL玩家完成多个回合，产生（i）数值游戏指标或（ii）最近视频帧的紧凑图像摘要。LMM设计师接收游戏目标和当前游戏配置，分析游戏轨迹，并编辑配置以引导未来行为朝向目标。我们展示了LMM能够基于RL代理提供的行为轨迹进行迭代游戏机制优化的结果，这表明了AI辅助游戏设计的实用、可扩展工具的可能性。

SpaceR: Reinforcing MLLMs in Video Spatial Reasoning

Authors: Kun Ouyang, Yuanxin Liu, Haoning Wu, Yi Liu, Hao Zhou, Jie Zhou, Fandong Meng, Xu Sun

First: 2025-04-02T15:12:17+00:00 · Latest: 2025-05-22T00:41:26+00:00

Abs · Code1 · Code2

Abstract

Video spatial reasoning, which involves inferring the underlying spatial structure from observed video frames, poses a significant challenge for existing Multimodal Large Language Models (MLLMs). This limitation stems primarily from 1) the absence of high-quality datasets for this task, and 2) the lack of effective training strategies to develop spatial reasoning capabilities. Motivated by the success of Reinforcement Learning with Verifiable Reward (RLVR) in unlocking LLM reasoning abilities, this work aims to improve MLLMs in video spatial reasoning through the RLVR paradigm. To this end, we introduce the $\textbf{SpaceR}$ framework. First, we present $\textbf{SpaceR-151k}$, a dataset with 91k questions spanning diverse spatial reasoning scenarios with verifiable answers, and 60k samples for maintaining general multimodal understanding. Second, we propose $\textbf{Spatially-Guided RLVR (SG-RLVR)}$, a novel reinforcement learning approach that extends Group Relative Policy Optimization (GRPO) with a novel map imagination mechanism, which encourages the model to infer spatial layouts in the thinking process, thereby facilitating more effective spatial reasoning. Extensive experiments demonstrate that SpaceR achieves state-of-the-art performance on spatial reasoning benchmarks (e.g., VSI-Bench, STI-Bench, and SPAR-Bench), while maintaining competitive results on video understanding benchmarks (e.g., Video-MME, TempCompass, and LongVideoBench). Remarkably, SpaceR surpasses the advanced GPT-4o by 11.6\% accuracy on VSI-Bench and is on par with the leading proprietary model Gemini-2.0-Flash, highlighting the effectiveness of our SpaceR-151k dataset and SG-RLVR in reinforcing spatial reasoning ability of MLLMs. Code, model, and dataset are available at https://github.com/OuyangKun10/SpaceR.

中文标题/摘要

标题：SpaceR：强化视频空间推理的MLLMs

视频空间推理涉及从观察到的视频帧中推断出潜在的空间结构，这给现有的多模态大型语言模型（MLLMs）带来了重大挑战。这一局限主要源于1）缺乏针对此任务的高质量数据集，以及2）缺乏有效的训练策略来发展空间推理能力。受验证奖励强化学习（RLVR）在解锁LLM推理能力方面的成功启发，本工作旨在通过RLVR范式改进MLLMs在视频空间推理方面的表现。为此，我们引入了SpaceR框架。首先，我们提出了包含91000个问题的SpaceR-151k数据集，这些问题覆盖了多种具有验证答案的空间推理场景，并有60000个样本用于保持多模态的一般理解。其次，我们提出了空间引导的RLVR（SG-RLVR），这是一种新颖的强化学习方法，它扩展了组相对策略优化（GRPO），并引入了一种新的地图想象机制，该机制鼓励模型在思考过程中推断空间布局，从而促进更有效的空间推理。广泛的实验表明，SpaceR在空间推理基准测试（如VSI-Bench、STI-Bench和SPAR-Bench）上达到了最先进的性能，同时在视频理解基准测试（如Video-MME、TempCompass和LongVideoBench）上保持了竞争力。值得注意的是，SpaceR在VSI-Bench上的准确率比先进的GPT-4o高出11.6%，并与领先的专有模型Gemini-2.0-Flash齐平，突显了我们的SpaceR-151k数据集和SG-RLVR在强化MLLMs的空间推理能力方面的有效性。代码、模型和数据集可在https://github.com/OuyangKun10/SpaceR/获得。

Summary / 总结

The research aims to enhance Multimodal Large Language Models (MLLMs) in video spatial reasoning by addressing the lack of high-quality datasets and effective training strategies. It introduces the SpaceR framework, which includes the SpaceR-151k dataset and the Spatially-Guided RLVR (SG-RLVR) method. The dataset contains 91k questions with verifiable answers and 60k samples for maintaining general multimodal understanding. SG-RLVR extends Group Relative Policy Optimization with a map imagination mechanism to encourage spatial layout inference. Experiments show that SpaceR outperforms existing models on spatial reasoning benchmarks and maintains competitive results on video understanding benchmarks, achieving 11.6% higher accuracy on VSI-Bench compared to GPT-4o.

研究旨在通过解决高质量数据集和有效训练策略的缺乏，提升多模态大型语言模型（MLLMs）在视频空间推理中的能力。引入了SpaceR框架，包括SpaceR-151k数据集和Spatially-Guided RLVR（SG-RLVR）方法。数据集包含91k个带有可验证答案的问题和60k个样本以保持多模态的一般理解能力。SG-RLVR通过扩展Group Relative Policy Optimization并引入地图想象机制来鼓励空间布局的推理。实验表明，SpaceR在空间推理基准测试中表现出色，并在视频理解基准测试中保持竞争力，相比GPT-4o在VSI-Bench上的准确率高出11.6%。

DSI-Bench: A Benchmark for Dynamic Spatial Intelligence

Authors: Ziang Zhang, Zehan Wang, Guanghao Zhang, Weilong Dai, Yan Xia, Ziang Yan, Minjie Hong, Zhou Zhao

First: 2025-10-21T17:59:36+00:00 · Latest: 2025-10-22T01:07:55+00:00

Abs · Code1

Abstract

Reasoning about dynamic spatial relationships is essential, as both observers and objects often move simultaneously. Although vision-language models (VLMs) and visual expertise models excel in 2D tasks and static scenarios, their ability to fully understand dynamic 3D scenarios remains limited. We introduce Dynamic Spatial Intelligence and propose DSI-Bench, a benchmark with nearly 1,000 dynamic videos and over 1,700 manually annotated questions covering nine decoupled motion patterns of observers and objects. Spatially and temporally symmetric designs reduce biases and enable systematic evaluation of models' reasoning about self-motion and object motion. Our evaluation of 14 VLMs and expert models reveals key limitations: models often conflate observer and object motion, exhibit semantic biases, and fail to accurately infer relative relationships in dynamic scenarios. Our DSI-Bench provides valuable findings and insights about the future development of general and expertise models with dynamic spatial intelligence.

中文标题/摘要

标题：DSI-Bench：动态空间智能基准

关于动态空间关系的推理至关重要，因为观察者和物体经常同时移动。尽管视觉语言模型（VLMs）和视觉专门模型在二维任务和静态场景中表现出色，但它们理解动态三维场景的能力仍然有限。我们引入了动态空间智能，并提出了DSI-Bench基准，该基准包含近1000个动态视频和超过1700个手动标注的问题，涵盖了观察者和物体九种解耦的运动模式。空间和时间对称的设计减少了偏差，使模型对自身运动和物体运动的推理进行系统的评估成为可能。我们对14个VLMs和专家模型的评估揭示了关键的局限性：模型经常混淆观察者和物体的运动，表现出语义偏差，并且在动态场景中无法准确推断相对关系。我们的DSI-Bench提供了关于未来开发具有动态空间智能的一般和专门模型的重要发现和见解。

Summary / 总结

The research aims to evaluate models' ability to understand dynamic spatial relationships, which are crucial when both observers and objects are moving. The study introduces DSI-Bench, a benchmark with nearly 1,000 dynamic videos and 1,700 annotated questions, covering nine motion patterns. Evaluating 14 VLMs and expert models, the study finds that models often confuse observer and object motion, show semantic biases, and struggle to infer accurate relationships in dynamic scenarios. DSI-Bench provides insights into the limitations of current models and highlights the need for improvements in dynamic spatial intelligence.

研究旨在评估模型在处理动态空间关系方面的能力，特别是在观察者和物体同时移动的场景中。研究引入了DSI-Bench，包含近1,000个动态视频和1,700个标注问题，关注九种解耦的运动模式。评估14个视觉语言模型和专家模型后，研究发现这些模型常常混淆观察者和物体的运动，表现出语义偏见，并且在动态场景中难以准确推断相对关系。DSI-Bench为提高模型的动态空间智能提供了有价值的见解。

CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding

Authors: Guo Chen, Yicheng Liu, Yifei Huang, Yuping He, Baoqi Pei, Jilan Xu, Yali Wang, Tong Lu, Limin Wang

First: 2024-12-16T18:46:45+00:00 · Latest: 2024-12-17T02:56:48+00:00

Comments: 14 pages, 9 figures

Abs · Code1 · Project1

Abstract

Most existing video understanding benchmarks for multimodal large language models (MLLMs) focus only on short videos. The limited number of benchmarks for long video understanding often rely solely on multiple-choice questions (MCQs). However, because of the inherent limitation of MCQ-based evaluation and the increasing reasoning ability of MLLMs, models can give the current answer purely by combining short video understanding with elimination, without genuinely understanding the video content. To address this gap, we introduce CG-Bench, a novel benchmark designed for clue-grounded question answering in long videos. CG-Bench emphasizes the model's ability to retrieve relevant clues for questions, enhancing evaluation credibility. It features 1,219 manually curated videos categorized by a granular system with 14 primary categories, 171 secondary categories, and 638 tertiary categories, making it the largest benchmark for long video analysis. The benchmark includes 12,129 QA pairs in three major question types: perception, reasoning, and hallucination. Compensating the drawbacks of pure MCQ-based evaluation, we design two novel clue-based evaluation methods: clue-grounded white box and black box evaluations, to assess whether the model generates answers based on the correct understanding of the video. We evaluate multiple closed-source and open-source MLLMs on CG-Bench. Results indicate that current models significantly underperform in understanding long videos compared to short ones, and a significant gap exists between open-source and commercial models. We hope CG-Bench can advance the development of more trustworthy and capable MLLMs for long video understanding. All annotations and video data are released at https://cg-bench.github.io/leaderboard/.

中文标题/摘要

标题：CG-Bench：面向长视频理解的线索导向型问答基准

大多数现有的视频理解基准主要针对多模态大型语言模型（MLLMs）中的短视频。对于长视频理解的有限基准往往仅依赖于多项选择题（MCQs）。然而，由于MCQ评估固有的局限性以及MLLMs推理能力的增强，模型可以通过结合短视频理解和排除来给出当前答案，而无需真正理解视频内容。为解决这一问题，我们引入了CG-Bench，这是一种针对长视频的新型线索导向型问答基准。CG-Bench强调模型检索问题相关线索的能力，增强评估的可信度。它包含1,219个手工策划的视频，按精细分类系统分为14个主要类别、171个次要类别和638个三级类别，使其成为最大的长视频分析基准。基准包括12,129个问答对，涵盖三大类问题：感知、推理和幻觉。为弥补纯MCQ评估的不足，我们设计了两种新的基于线索的评估方法：线索导向的白盒和黑盒评估，以评估模型是否基于对视频的正确理解生成答案。我们在CG-Bench上评估了多个闭源和开源MLLMs。结果表明，当前模型在理解长视频方面显著不如理解短视频，开源模型与商用模型之间存在显著差距。我们希望CG-Bench能够促进更可信和强大的MLLMs在长视频理解方面的开发。所有注释和视频数据可在https://cg-bench.github.io/leaderboard/上获取。

Summary / 总结

CG-Bench is a new benchmark for clue-grounded question answering in long videos, addressing the limitations of existing short video benchmarks. It includes 1,219 manually curated videos and 12,129 QA pairs, focusing on perception, reasoning, and hallucination. Two novel evaluation methods are introduced to assess models' understanding based on retrieved clues. The study finds that current models perform poorly on long videos compared to short ones, with a significant gap between open-source and commercial models.

CG-Bench 是一个针对长视频的线索导向型问答基准，旨在解决现有短视频基准的局限性。它包含1,219个手工精选的视频和12,129个问答对，涵盖感知、推理和幻觉三种问题类型。研究引入了两种新型的线索导向评估方法来评估模型的理解能力。研究结果表明，当前模型在长视频理解方面表现不佳，开源模型与商用模型之间存在显著差距。

VideoExpert: Augmented LLM for Temporal-Sensitive Video Understanding

Authors: Henghao Zhao, Ge-Peng Ji, Rui Yan, Huan Xiong, Zechao Li

First: 2025-04-10T07:33:39+00:00 · Latest: 2025-04-11T00:30:52+00:00

Abs

Abstract

The core challenge in video understanding lies in perceiving dynamic content changes over time. However, multimodal large language models struggle with temporal-sensitive video tasks, which requires generating timestamps to mark the occurrence of specific events. Existing strategies require MLLMs to generate absolute or relative timestamps directly. We have observed that those MLLMs tend to rely more on language patterns than visual cues when generating timestamps, affecting their performance. To address this problem, we propose VideoExpert, a general-purpose MLLM suitable for several temporal-sensitive video tasks. Inspired by the expert concept, VideoExpert integrates two parallel modules: the Temporal Expert and the Spatial Expert. The Temporal Expert is responsible for modeling time sequences and performing temporal grounding. It processes high-frame-rate yet compressed tokens to capture dynamic variations in videos and includes a lightweight prediction head for precise event localization. The Spatial Expert focuses on content detail analysis and instruction following. It handles specially designed spatial tokens and language input, aiming to generate content-related responses. These two experts collaborate seamlessly via a special token, ensuring coordinated temporal grounding and content generation. Notably, the Temporal and Spatial Experts maintain independent parameter sets. By offloading temporal grounding from content generation, VideoExpert prevents text pattern biases in timestamp predictions. Moreover, we introduce a Spatial Compress module to obtain spatial tokens. This module filters and compresses patch tokens while preserving key information, delivering compact yet detail-rich input for the Spatial Expert. Extensive experiments demonstrate the effectiveness and versatility of the VideoExpert.

中文标题/摘要

标题：VideoExpert：用于时间敏感视频理解的增强LLM

视频理解的核心挑战在于感知随时间变化的动态内容。然而，多模态大型语言模型在处理时间敏感的视频任务时存在困难，这需要生成时间戳来标记特定事件的发生。现有策略要求MLLM直接生成绝对或相对时间戳。我们观察到，这些MLLM在生成时间戳时更依赖于语言模式而非视觉线索，影响了它们的性能。为了解决这一问题，我们提出了VideoExpert，这是一种适用于多种时间敏感视频任务的通用MLLM。受专家概念的启发，VideoExpert集成了两个并行模块：时间专家和空间专家。时间专家负责建模时间序列和执行时间定位。它处理高帧率但压缩的令牌以捕捉视频中的动态变化，并包含一个轻量级的预测头以实现精确的事件定位。空间专家专注于内容细节分析和指令遵循。它处理特别设计的空间令牌和语言输入，旨在生成内容相关的响应。这两个专家通过一个特殊令牌无缝协作，确保协调的时间定位和内容生成。值得注意的是，时间和空间专家保持独立的参数集。通过将时间定位从内容生成中卸载，VideoExpert防止了时间戳预测中的文本模式偏差。此外，我们引入了空间压缩模块以获取空间令牌。该模块过滤并压缩补丁令牌，同时保留关键信息，为空间专家提供紧凑但细节丰富的输入。广泛的实验表明，VideoExpert的有效性和多功能性。

Summary / 总结

The core challenge in video understanding is perceiving dynamic content changes over time, which is difficult for multimodal large language models. VideoExpert is proposed to address this issue by integrating a Temporal Expert and a Spatial Expert. The Temporal Expert models time sequences and performs temporal grounding, while the Spatial Expert focuses on content detail analysis and instruction following. These experts collaborate via a special token, ensuring coordinated temporal grounding and content generation. Experiments show that VideoExpert effectively handles temporal-sensitive video tasks and reduces text pattern biases in timestamp predictions.

研究旨在通过解决现有大规模多模态语言模型（MLLM）在生成准确时间戳方面的局限性，提高动态视频理解能力。提出了VideoExpert，这是一种通用的MLLM，集成了时间专家模块和空间专家模块。时间专家模块负责建模时间序列和进行时间定位，处理高帧率的压缩令牌以捕捉视频中的动态变化，并包含一个预测头以实现精确的事件定位；空间专家模块专注于内容细节分析和指令跟随，处理特别设计的空间令牌和语言输入，以生成内容相关的响应。实验表明，VideoExpert有效解决了文本模式偏差问题，并展示了在各种动态视频任务中的灵活性。

TinyLLaVA-Video: Towards Smaller LMMs for Video Understanding with Group Resampler

Authors: Xingjian Zhang, Xi Weng, Yihao Yue, Zhaoxin Fan, Wenjun Wu, Lei Huang

First: 2025-01-26T13:10:12+00:00 · Latest: 2025-06-11T00:47:24+00:00

Comments: code and training recipes are available at https://github.com/ZhangXJ199/TinyLLaVA-Video

Abs · Code1

Abstract

Video behavior recognition and scene understanding are fundamental tasks in multimodal intelligence, serving as critical building blocks for numerous real-world applications. Through large multimodal models (LMMs) have achieved remarkable progress in video understanding, most existing open-source models rely on over 7B parameters and require large-scale datasets for training, making them resource-intensive and inaccessible to many researchers. Furthermore, lightweight models face persistent challenges in effectively processing long visual sequences and temporal understanding. In this work, we introduce TinyLLaVA-Video, a lightweight yet powerful video understanding model with approximately 3.6B parameters. The cornerstone of our design is the video-level group resampler, a novel mechanism that significantly reduces and controls the number of visual tokens at the video level. Unlike traditional image-level resampler, our approach effectively mitigates redundancy while enhancing temporal comprehension, leading to improved performance on video-based tasks. In addition, TinyLLaVA-Video demonstrates exceptional efficiency, requiring only one day of training on 8 A100-40G GPUs. It surpasses several existing 7B-parameter models on multiple benchmarks. We believe this work provides a valuable foundation for future research on lightweight video understanding models. The code and weights is available at https://github.com/ZhangXJ199/TinyLLaVA-Video.

中文标题/摘要

标题：TinyLLaVA-Video：面向视频理解的轻量级LMM及其组重采样器

视频行为识别和场景理解是多模态智能中的基本任务，是众多实际应用的关键构建块。尽管大型多模态模型（LMMs）在视频理解方面取得了显著进展，但大多数现有的开源模型依赖于超过7B参数，并且需要大规模数据集进行训练，这使得它们资源密集型且对许多研究人员来说难以获取。此外，轻量级模型在有效处理长视觉序列和时间理解方面仍面临持续挑战。在本工作中，我们介绍了TinyLLaVA-Video，这是一种具有约3.6B参数的轻量级但强大的视频理解模型。我们设计的核心是视频级别组重采样器，这是一种新颖的机制，可以显著减少和控制视频级别的视觉标记数量。与传统的图像级别重采样器不同，我们的方法有效缓解了冗余问题，同时增强了时间理解能力，从而在基于视频的任务上取得了更好的性能。此外，TinyLLaVA-Video展示了出色的效率，仅需在8块A100-40G GPU上训练一天即可。它在多个基准测试中超过了几个现有的7B参数模型。我们认为，这项工作为未来轻量级视频理解模型的研究提供了有价值的基石。代码和权重可在https://github.com/ZhangXJ199/TinyLLaVA-Video 获取。

Summary / 总结

This work introduces TinyLLaVA-Video, a lightweight video understanding model with about 3.6 billion parameters, designed to address the resource-intensive nature of large multimodal models. The key method involves a video-level group resampler that reduces visual token redundancy while enhancing temporal understanding. Experimental results show that TinyLLaVA-Video outperforms several 7B-parameter models on multiple benchmarks and requires only one day of training on 8 A100-40G GPUs, demonstrating its efficiency and effectiveness. The code and weights are available at https://github.com/ZhangXJ199/TinyLLaVA-Video.

研究旨在开发一种更小但有效的视频理解模型，解决现有大型多模态模型资源密集的问题。TinyLLaVA-Video，参数量约为3.6B，通过视频级分组重采样器减少视觉标记冗余并增强时间理解。该模型在多个基准测试中表现出色，超越了多个7B参数模型，并且仅需在8块A100-40G GPU上训练一天。

VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation

Authors: Weiming Ren, Huan Yang, Jie Min, Cong Wei, Wenhu Chen

First: 2024-12-01T18:27:28+00:00 · Latest: 2024-12-03T02:00:10+00:00

Comments: Project Page: https://tiger-ai-lab.github.io/VISTA/

Abs · Code1 · Code2 · Project1

Abstract

Current large multimodal models (LMMs) face significant challenges in processing and comprehending long-duration or high-resolution videos, which is mainly due to the lack of high-quality datasets. To address this issue from a data-centric perspective, we propose VISTA, a simple yet effective Video Spatiotemporal Augmentation framework that synthesizes long-duration and high-resolution video instruction-following pairs from existing video-caption datasets. VISTA spatially and temporally combines videos to create new synthetic videos with extended durations and enhanced resolutions, and subsequently produces question-answer pairs pertaining to these newly synthesized videos. Based on this paradigm, we develop seven video augmentation methods and curate VISTA-400K, a video instruction-following dataset aimed at enhancing long-duration and high-resolution video understanding. Finetuning various video LMMs on our data resulted in an average improvement of 3.3% across four challenging benchmarks for long-video understanding. Furthermore, we introduce the first comprehensive high-resolution video understanding benchmark HRVideoBench, on which our finetuned models achieve a 6.5% performance gain. These results highlight the effectiveness of our framework.

中文标题/摘要

标题：VISTA：通过视频时空增强提高长时长和高分辨率视频理解

当前的大规模多模态模型（LMMs）在处理和理解长时长或高分辨率视频时面临重大挑战，主要原因是缺乏高质量的数据集。为了解决这一问题，我们从数据为中心的角度出发，提出了一种简单而有效的视频时空增强框架VISTA，该框架从现有的视频字幕数据集中合成长时长和高分辨率视频指令跟随对。VISTA通过时空结合视频，创建新的合成视频，这些视频具有延长的时长和增强的分辨率，并随后生成与这些新合成视频相关的问题-答案对。基于这一范式，我们开发了七种视频增强方法，并构建了VISTA-400K视频指令跟随数据集，旨在提高长时长和高分辨率视频的理解。在我们的数据上微调各种视频LMMs后，在四个具有挑战性的长视频理解基准上平均提高了3.3%。此外，我们引入了第一个全面的高分辨率视频理解基准HRVideoBench，在该基准上，我们微调的模型实现了6.5%的性能提升。这些结果突显了我们框架的有效性。

Summary / 总结

The paper proposes VISTA, a Video Spatiotemporal Augmentation framework to enhance the understanding of long-duration and high-resolution videos. It synthesizes new videos by combining existing ones and generates question-answer pairs. Finetuning various large multimodal models on the VISTA-400K dataset improved performance by 3.3% on four benchmarks and 6.5% on the HRVideoBench, demonstrating the effectiveness of the approach.

研究旨在通过解决高质量数据不足的问题，提高对长视频和高分辨率视频的理解。VISTA是一种视频时空增强框架，从现有数据集中合成新的视频以延长其时长和提高分辨率。该框架在四个基准测试中将长视频理解提高了平均3.3%，并在新引入的HRVideoBench基准测试中实现了6.5%的性能提升。

RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video

Authors: Shuhang Xun, Sicheng Tao, Jungang Li, Yibo Shi, Zhixin Lin, Zhanhui Zhu, Yibo Yan, Hanqian Li, Linghao Zhang, Shikang Wang, Yixin Liu, Hanbo Zhang, Ying Ma, Xuming Hu

Venue: NeurIPS 2025

First: 2025-05-04T10:55:21+00:00 · Latest: 2025-10-27T00:13:19+00:00

Comments: Accepted by NeurIPS 2025 Datasets and Benchmarks Track;

Abs · Code1

Abstract

Multimodal Large Language Models (MLLMs) increasingly excel at perception, understanding, and reasoning. However, current benchmarks inadequately evaluate their ability to perform these tasks continuously in dynamic, real-world environments. To bridge this gap, we introduce RTV-Bench, a fine-grained benchmark for MLLM real-time video analysis. RTV-Bench uses three key principles: (1) Multi-Timestamp Question Answering (MTQA), where answers evolve with scene changes; (2) Hierarchical Question Structure, combining basic and advanced queries; and (3) Multi-dimensional Evaluation, assessing the ability of continuous perception, understanding, and reasoning. RTV-Bench contains 552 diverse videos (167.2 hours) and 4,631 high-quality QA pairs. We evaluated leading MLLMs, including proprietary (GPT-4o, Gemini 2.0), open-source offline (Qwen2.5-VL, VideoLLaMA3), and open-source real-time (VITA-1.5, InternLM-XComposer2.5-OmniLive) models. Experiment results show open-source real-time models largely outperform offline ones but still trail top proprietary models. Our analysis also reveals that larger model size or higher frame sampling rates do not significantly boost RTV-Bench performance, sometimes causing slight decreases. This underscores the need for better model architectures optimized for video stream processing and long sequences to advance real-time video analysis with MLLMs. Our benchmark toolkit is available at: https://github.com/LJungang/RTV-Bench.

中文标题/摘要

标题：RTV-Bench：通过实时视频评估MLLM连续感知、理解和推理

多模态大型语言模型（MLLMs）在感知、理解和推理方面表现出色。然而，当前的基准测试未能充分评估它们在动态、真实环境中的连续执行这些任务的能力。为弥补这一差距，我们引入了RTV-Bench，这是一种针对MLLM实时视频分析的精细基准测试。RTV-Bench采用三个关键原则：（1）多时间戳问答（MTQA），答案随场景变化而演变；（2）层次化问题结构，结合基本和高级查询；（3）多维度评估，评估连续感知、理解和推理的能力。RTV-Bench包含552个多样化的视频（167.2小时）和4,631个高质量的问答对。我们评估了包括专有（GPT-4o、Gemini 2.0）、开源离线（Qwen2.5-VL、VideoLLaMA3）和开源实时（VITA-1.5、InternLM-XComposer2.5-OmniLive）模型在内的领先MLLM。实验结果表明，开源实时模型在很大程度上优于离线模型，但仍然落后于顶级专有模型。我们的分析还表明，更大的模型规模或更高的帧采样率并不显著提升RTV-Bench性能，有时甚至会导致轻微下降。这强调了需要更好的模型架构，优化视频流处理和长序列，以促进MLLM实时视频分析的进步。我们的基准测试工具包可在以下链接获取：https://github.com/LJungang/RTV-Bench。

Summary / 总结

RTV-Bench is a benchmark for evaluating MLLMs in real-time video analysis, focusing on continuous perception, understanding, and reasoning. It uses three principles: Multi-Timestamp Question Answering, Hierarchical Question Structure, and Multi-dimensional Evaluation. The benchmark includes 552 diverse videos and 4,631 QA pairs. Open-source real-time models outperformed offline models but lagged behind proprietary models. Larger models or higher frame sampling rates did not significantly improve performance, highlighting the need for better model architectures for real-time video analysis.

RTV-Bench 是一个用于评估 MLLMs 在实时视频分析中的基准，重点关注连续感知、理解和推理。它采用了三个原则：多时间戳问答、层次化问题结构和多维度评估。基准包括 552 个视频和 4,631 对 QA。开源实时模型优于离线模型，但落后于专有模型。更大的模型或更高的帧采样率并不能显著提高性能，强调了需要更好的模型架构来实现实时视频分析。

V2P-Bench: Evaluating Video-Language Understanding with Visual Prompts for Better Human-Model Interaction

Authors: Yiming Zhao, Yu Zeng, Yukun Qi, YaoYang Liu, Lin Chen, Zehui Chen, Xikun Bao, Jie Zhao, Feng Zhao

First: 2025-03-22T11:30:46+00:00 · Latest: 2025-03-25T00:29:12+00:00

Abs · Code1

Abstract

Large Vision-Language Models (LVLMs) have made significant progress in the field of video understanding recently. However, current benchmarks uniformly lean on text prompts for evaluation, which often necessitate complex referential language and fail to provide precise spatial and temporal references. This limitation diminishes the experience and efficiency of human-model interaction. To address this limitation, we propose the Video Visual Prompt Benchmark(V2P-Bench), a comprehensive benchmark specifically designed to evaluate LVLMs' video understanding capabilities in multimodal human-model interaction scenarios. V2P-Bench includes 980 unique videos and 1,172 QA pairs, covering 5 main tasks and 12 dimensions, facilitating instance-level fine-grained understanding aligned with human cognition. Benchmarking results reveal that even the most powerful models perform poorly on V2P-Bench (65.4% for GPT-4o and 67.9% for Gemini-1.5-Pro), significantly lower than the human experts' 88.3%, highlighting the current shortcomings of LVLMs in understanding video visual prompts. We hope V2P-Bench will serve as a foundation for advancing multimodal human-model interaction and video understanding evaluation. Project page: https://github.com/gaotiexinqu/V2P-Bench.

中文标题/摘要

标题：V2P-Bench：使用视觉提示评估视频-语言理解以改善人-模型交互

大型视觉-语言模型（LVLMs）在视频理解领域取得了显著进展。然而，当前的基准测试统一依赖于文本提示进行评估，这通常需要复杂的参照语言，并且无法提供精确的空间和时间参考。这一限制降低了人-模型交互的体验和效率。为了解决这一限制，我们提出了视频视觉提示基准测试（V2P-Bench），这是一个专门设计用于评估LVLMs在多模态人-模型交互场景中视频理解能力的综合基准测试。V2P-Bench 包含980个独特的视频和1,172个问答对，涵盖了5个主要任务和12个维度，促进了与人类认知相一致的实例级细粒度理解。基准测试结果表明，即使是最强大的模型在V2P-Bench上的表现也很差（GPT-4o为65.4%，Gemini-1.5-Pro为67.9%），远低于人类专家的88.3%，突显了LVLMs在理解视频视觉提示方面的当前不足。我们希望V2P-Bench能够成为推动多模态人-模型交互和视频理解评估的基础。项目页面：https://github.com/gaotiexinqu/V2P-Bench.

Summary / 总结

The research aims to improve the evaluation of video-language understanding by proposing V2P-Bench, which uses visual prompts to better align with human cognition. The benchmark includes 980 videos and 1,172 QA pairs, covering 5 main tasks and 12 dimensions. Experimental results show that even top models like GPT-4o and Gemini-1.5-Pro achieve only 65.4% and 67.9% accuracy, respectively, compared to human experts' 88.3%, indicating significant room for improvement in video understanding capabilities.

研究旨在通过引入V2P-Bench，使用视觉提示来评估视频语言模型（LVLMs）的视频理解能力。基准包括980个独特的视频和1,172个问答对，涵盖5个主要任务和12个维度。实验结果显示，即使是最先进的模型，如GPT-4o和Gemini-1.5-Pro，也只能分别达到65.4%和67.9%的准确率，而人类专家的准确率为88.3%。这突显了LVLMs在理解视频视觉提示方面的当前局限性。

Benchmarking the Robustness of Spatial-Temporal Models Against Corruptions

Authors: Chenyu Yi, Siyuan Yang, Haoliang Li, Yap-peng Tan, Alex Kot

Venue: NeurIPs 2021

First: 2021-10-13T05:59:39+00:00 · Latest: 2022-08-23T00:29:10+00:00

Comments: Accepted to NeurIPs 2021 Dataset and Benchmark Track. Our codes are available on https://github.com/Newbeeyoung/Video-Corruption-Robustness

Abs · Code1

Abstract

The state-of-the-art deep neural networks are vulnerable to common corruptions (e.g., input data degradations, distortions, and disturbances caused by weather changes, system error, and processing). While much progress has been made in analyzing and improving the robustness of models in image understanding, the robustness in video understanding is largely unexplored. In this paper, we establish a corruption robustness benchmark, Mini Kinetics-C and Mini SSV2-C, which considers temporal corruptions beyond spatial corruptions in images. We make the first attempt to conduct an exhaustive study on the corruption robustness of established CNN-based and Transformer-based spatial-temporal models. The study provides some guidance on robust model design and training: Transformer-based model performs better than CNN-based models on corruption robustness; the generalization ability of spatial-temporal models implies robustness against temporal corruptions; model corruption robustness (especially robustness in the temporal domain) enhances with computational cost and model capacity, which may contradict the current trend of improving the computational efficiency of models. Moreover, we find the robustness intervention for image-related tasks (e.g., training models with noise) may not work for spatial-temporal models.

中文标题/摘要

标题：空间-时间模型对抗破坏的鲁棒性基准测试

最先进的深度神经网络对常见的破坏（例如输入数据退化、失真和由天气变化、系统错误和处理引起的干扰）非常脆弱。尽管在图像理解模型的鲁棒性分析和改进方面取得了很大进展，但在视频理解中的鲁棒性研究却很少。在本文中，我们建立了考虑时间破坏的鲁棒性基准，Mini Kinetics-C和Mini SSV2-C，该基准不仅考虑了图像的空间破坏。我们首次对已建立的基于CNN的空间-时间模型和基于Transformer的空间-时间模型的破坏鲁棒性进行了全面研究。研究为鲁棒模型的设计和训练提供了一些指导：基于Transformer的模型在破坏鲁棒性方面优于基于CNN的模型；空间-时间模型的一般化能力意味着对时间破坏的鲁棒性；空间-时间模型的破坏鲁棒性（尤其是时间域的鲁棒性）随着计算成本和模型容量的增加而增强，这可能与当前提高模型计算效率的趋势相矛盾。此外，我们发现针对图像相关任务的鲁棒性干预（例如用噪声训练模型）可能对空间-时间模型无效。

Summary / 总结

This paper aims to evaluate the robustness of spatial-temporal models against corruptions, particularly in video understanding, which has been less studied compared to image understanding. The authors introduce two benchmarks, Mini Kinetics-C and Mini SSV2-C, to assess the models' performance under various corruptions. Key findings include that Transformer-based models outperform CNN-based models in corruption robustness, and that robustness against temporal corruptions increases with model capacity and computational cost. Additionally, the study reveals that methods effective for improving robustness in image tasks may not be applicable to spatial-temporal models.

本文旨在评估空间-时间模型在对抗干扰下的鲁棒性，特别是在视频理解领域，这比图像理解领域研究较少。作者引入了两个基准，Mini Kinetics-C和Mini SSV2-C，以研究基于CNN和Transformer的空间-时间模型的鲁棒性。主要发现包括Transformer模型在鲁棒性方面优于CNN模型，鲁棒性（尤其是时间域的鲁棒性）随着计算成本和模型容量的增加而提高，这与提高模型效率的趋势相矛盾。此外，适用于图像任务的鲁棒性干预措施可能不适用于空间-时间模型。