arXiv 论文速递

Snapshot: 20260324_0402

MME-CoF-Pro: Evaluating Reasoning Coherence in Video Generative Models with Text and Visual Hints

Authors: Yu Qi, Xinyi Xu, Ziyu Guo, Siyuan Ma, Renrui Zhang, Xinyan Chen, Ruichuan An, Ruofan Xing, Jiayi Zhang, Haojie Huang, Pheng-Ann Heng, Jonathan Tremblay, Lawson L. S. Wong

First: 2026-03-20T17:59:56+00:00 · Latest: 2026-03-20T17:59:56+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Video generative models show emerging reasoning behaviors. It is essential to ensure that generated events remain causally consistent across frames for reliable deployment, a property we define as reasoning coherence. To bridge the gap in literature for missing reasoning coherence evaluation, we propose MME-CoF-Pro, a comprehensive video reasoning benchmark to assess reasoning coherence in video models. Specifically, MME-CoF-Pro contains 303 samples across 16 categories, ranging from visual logical to scientific reasoning. It introduces Reasoning Score as evaluation metric for assessing process-level necessary intermediate reasoning steps, and includes three evaluation settings, (a) no hint (b) text hint and (c) visual hint, enabling a controlled investigation into the underlying mechanisms of reasoning hint guidance. Evaluation results in 7 open and closed-source video models reveals insights including: (1) Video generative models exhibit weak reasoning coherence, decoupled from generation quality. (2) Text hints boost apparent correctness but often cause inconsistency and hallucinated reasoning (3) Visual hints benefit structured perceptual tasks but struggle with fine-grained perception. Website: https://video-reasoning-coherence.github.io/

中文标题/摘要

标题：MME-CoF-Pro：使用文本和视觉提示评估视频生成模型中的推理连贯性

视频生成模型展示了新兴的推理行为。确保生成的事件在帧间保持因果一致性对于可靠部署至关重要，我们将其定义为推理连贯性。为弥补文献中缺乏推理连贯性评估的空白，我们提出了MME-CoF-Pro，一个全面的视频推理基准，用于评估视频模型中的推理连贯性。具体而言，MME-CoF-Pro 包含16个类别中的303个样本，从视觉逻辑到科学推理不等。它引入了推理评分作为评估指标，用于评估过程级必要的中间推理步骤，并包括三种评估设置：(a) 无提示 (b) 文本提示和 (c) 视觉提示，从而可以控制地调查推理提示指导的内在机制。在7个开源和闭源视频模型中的评估结果揭示了以下见解：(1) 视频生成模型在推理连贯性方面表现较弱，与生成质量无关。(2) 文本提示提高了显性的正确性，但往往导致不一致和虚构的推理。(3) 视觉提示有利于结构化的感知任务，但在细粒度感知方面存在困难。网站：https://video-reasoning-coherence.github.io/

Summary / 总结

The research aims to evaluate reasoning coherence in video generative models, a critical property for their reliable deployment. MME-CoF-Pro, a new benchmark, assesses reasoning coherence through a Reasoning Score and three evaluation settings: no hint, text hint, and visual hint. The study finds that video generative models generally lack reasoning coherence, with text hints improving apparent correctness but causing inconsistencies, while visual hints help with structured tasks but not fine-grained perception.

研究旨在评估视频生成模型中的推理连贯性，这是可靠部署的关键属性。提出了MME-CoF-Pro这一全面基准，通过303个样本和16个类别，使用推理评分指标来评估推理连贯性。研究发现，视频生成模型的推理连贯性较弱，文本提示可以提高表面正确性但常导致不一致，而视觉提示有利于结构化的感知任务但难以处理精细的感知任务。

From Masks to Pixels and Meaning: A New Taxonomy, Benchmark, and Metrics for VLM Image Tampering

Authors: Xinyi Shang, Yi Tang, Jiacheng Cui, Ahmed Elhagry, Salwa K. Al Khatib, Sondos Mahmoud Bsharat, Jiacheng Liu, Xiaohan Zhao, Jing-Hao Xue, Hao Li, Salman Khan, Zhiqiang Shen

Venue: CVPR 2026

First: 2026-03-20T17:59:54+00:00 · Latest: 2026-03-20T17:59:54+00:00

Comments: Code and data at: https://github.com/VILA-Lab/PIXAR (Accepted in CVPR 2026 Findings, but not opted in)

Abs · PDF · Code1 · Code2 · Code3

Abstract

Existing tampering detection benchmarks largely rely on object masks, which severely misalign with the true edit signal: many pixels inside a mask are untouched or only trivially modified, while subtle yet consequential edits outside the mask are treated as natural. We reformulate VLM image tampering from coarse region labels to a pixel-grounded, meaning and language-aware task. First, we introduce a taxonomy spanning edit primitives (replace/remove/splice/inpaint/attribute/colorization, etc.) and their semantic class of tampered object, linking low-level changes to high-level understanding. Second, we release a new benchmark with per-pixel tamper maps and paired category supervision to evaluate detection and classification within a unified protocol. Third, we propose a training framework and evaluation metrics that quantify pixel-level correctness with localization to assess confidence or prediction on true edit intensity, and further measure tamper meaning understanding via semantics-aware classification and natural language descriptions for the predicted regions. We also re-evaluate the existing strong segmentation/localization baselines on recent strong tamper detectors and reveal substantial over- and under-scoring using mask-only metrics, and expose failure modes on micro-edits and off-mask changes. Our framework advances the field from masks to pixels, meanings and language descriptions, establishing a rigorous standard for tamper localization, semantic classification and description. Code and benchmark data are available at https://github.com/VILA-Lab/PIXAR.

中文标题/摘要

标题：从面具到像素和意义：VLM 图像篡改的新分类、基准和度量

现有的篡改检测基准主要依赖于对象掩码，这严重偏离了真正的编辑信号：掩码内的许多像素未被修改或仅被轻微修改，而掩码外的细微但具有重大影响的编辑则被视为自然的。我们从粗略的区域标签重新定义VLM图像篡改为基于像素、具有意义和语言意识的任务。首先，我们引入了一种分类体系，涵盖编辑原语（替换/删除/拼接/修复/属性/着色等）及其篡改对象的语义类别，将低级变化与高级理解联系起来。其次，我们发布了一个新的基准，包含每个像素的篡改图和配对的类别监督，以在统一协议下评估检测和分类。第三，我们提出了一种训练框架和评估指标，量化像素级别的正确性并进行定位，以评估对真实编辑强度的信心或预测，并进一步通过语义感知分类和自然语言描述预测区域来衡量篡改意义的理解。我们还重新评估了现有的强大分割/定位基线在最近的强篡改检测器上的表现，并揭示了仅使用掩码指标的过度评分和不足评分，以及在微小编辑和掩码外变化上的失败模式。我们的框架将领域从掩码推进到像素、意义和语言描述，建立了篡改定位、语义分类和描述的严格标准。代码和基准数据可在https://github.com/VILA-Lab/PIXAR 获取。

Summary / 总结

The research aims to improve tampering detection in images by addressing the limitations of existing mask-based benchmarks. It introduces a new taxonomy of tampering types and a pixel-grounded benchmark with per-pixel tamper maps and category supervision. Key findings include the development of a training framework and evaluation metrics that assess both pixel-level correctness and semantic understanding of tampered regions, revealing the inadequacies of mask-only metrics in evaluating recent tamper detectors and highlighting the importance of considering subtle and off-mask edits.

本文针对现有依赖对象掩码的篡改检测基准存在的问题，这些问题往往与真实的编辑信号不匹配。它引入了一种篡改原语及其语义类别的新分类体系，并提出了一个新的基准，该基准包含逐像素篡改图和配对的类别监督。作者还开发了一个训练框架和评估指标，以评估像素级的正确性和语义理解。结果表明，仅依赖掩码的指标对现有强大的分割/定位基线的评价存在过高和过低的情况，并且在微小编辑和非掩码区域的变化方面暴露出失败模式。这项工作通过从掩码到像素、意义和语言描述的进步，建立了篡改定位、语义分类和描述的严格标准。

LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation

Authors: Jiazheng Xing, Fei Du, Hangjie Yuan, Pengwei Liu, Hongbin Xu, Hai Ci, Ruigang Niu, Weihua Chen, Fan Wang, Yong Liu

Venue: ICLR 2026

First: 2026-03-20T17:59:46+00:00 · Latest: 2026-03-20T17:59:46+00:00

Comments: ICLR 2026 Camera Ready Version. Code and Models: https://jiazheng-xing.github.io/lumosx-home/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Recent advances in diffusion models have significantly improved text-to-video generation, enabling personalized content creation with fine-grained control over both foreground and background elements. However, precise face-attribute alignment across subjects remains challenging, as existing methods lack explicit mechanisms to ensure intra-group consistency. Addressing this gap requires both explicit modeling strategies and face-attribute-aware data resources. We therefore propose LumosX, a framework that advances both data and model design. On the data side, a tailored collection pipeline orchestrates captions and visual cues from independent videos, while multimodal large language models (MLLMs) infer and assign subject-specific dependencies. These extracted relational priors impose a finer-grained structure that amplifies the expressive control of personalized video generation and enables the construction of a comprehensive benchmark. On the modeling side, Relational Self-Attention and Relational Cross-Attention intertwine position-aware embeddings with refined attention dynamics to inscribe explicit subject-attribute dependencies, enforcing disciplined intra-group cohesion and amplifying the separation between distinct subject clusters. Comprehensive evaluations on our benchmark demonstrate that LumosX achieves state-of-the-art performance in fine-grained, identity-consistent, and semantically aligned personalized multi-subject video generation. Code and models are available at https://jiazheng-xing.github.io/lumosx-home/.

中文标题/摘要

标题：LumosX：通过身份与其属性关联实现个性化视频生成

扩散模型的最新进展显著提高了文本到视频生成的效果，使得在精细控制前景和背景元素的同时实现个性化内容创作成为可能。然而，跨主体的精确面部属性对齐仍然具有挑战性，因为现有方法缺乏确保组内一致性的确切机制。解决这一缺口需要明确建模策略和面部属性感知的数据资源。因此，我们提出了LumosX框架，该框架在数据和模型设计方面均有所推进。在数据方面，定制化的采集流程协调来自独立视频的描述和视觉提示，而多模态大型语言模型（MLLMs）推断并分配主题特定的依赖关系。提取的关联先验施加了更精细的结构，增强了个性化视频生成的表达控制，并使构建全面基准成为可能。在建模方面，关系自注意力和关系交叉注意力将位置感知嵌入与精细的注意力动态交织在一起，以明确嵌入主题属性依赖关系，确保组内一致性并增强不同主题簇之间的分离。在我们基准上的全面评估表明，LumosX 在细粒度、身份一致性和语义对齐的个性化多主体视频生成方面达到了最先进的性能。代码和模型可在 https://jiazheng-xing.github.io/lumosx-home/ 获取。

Summary / 总结

LumosX addresses the challenge of precise face-attribute alignment in personalized video generation by proposing a framework that includes a tailored data collection pipeline and model design. The framework uses a collection pipeline to integrate captions and visual cues from independent videos and multimodal large language models to infer subject-specific dependencies. On the modeling side, it employs Relational Self-Attention and Relational Cross-Attention to enhance intra-group consistency and separation between subject clusters. Evaluations show that LumosX outperforms existing methods in generating fine-grained, identity-consistent, and semantically aligned personalized multi-subject videos.

LumosX 提出了一种框架，通过定制化的数据收集管道和模型设计来解决个性化视频生成中精确面部属性对齐的挑战。数据方面使用了收集管道和多模态大型语言模型来推断主体特定的依赖关系，而模型方面则采用了关系自注意力和关系交叉注意力来强化组内的一致性并增强不同主体集群之间的分离。评估结果显示，LumosX 在生成细粒度、身份一致且语义对齐的个性化多主体视频方面优于现有方法。

CoVR-R:Reason-Aware Composed Video Retrieval

Authors: Omkar Thawakar, Dmitry Demidov, Vaishnav Potlapalli, Sai Prasanna Teja Reddy Bogireddy, Viswanatha Reddy Gajjala, Alaa Mostafa Lasheen, Rao Muhammad Anwer, Fahad Khan

Venue: CVPR 2026

First: 2026-03-20T17:59:25+00:00 · Latest: 2026-03-20T17:59:25+00:00

Comments: CVPR 2026 (findings)

Abs · PDF · Code1 · Code2 · Code3

Abstract

Composed Video Retrieval (CoVR) aims to find a target video given a reference video and a textual modification. Prior work assumes the modification text fully specifies the visual changes, overlooking after-effects and implicit consequences (e.g., motion, state transitions, viewpoint or duration cues) that emerge from the edit. We argue that successful CoVR requires reasoning about these after-effects. We introduce a reasoning-first, zero-shot approach that leverages large multimodal models to (i) infer causal and temporal consequences implied by the edit, and (ii) align the resulting reasoned queries to candidate videos without task-specific finetuning. To evaluate reasoning in CoVR, we also propose CoVR-Reason, a benchmark that pairs each (reference, edit, target) triplet with structured internal reasoning traces and challenging distractors that require predicting after-effects rather than keyword matching. Experiments show that our zero-shot method outperforms strong retrieval baselines on recall at K and particularly excels on implicit-effect subsets. Our automatic and human analysis confirm higher step consistency and effect factuality in our retrieved results. Our findings show that incorporating reasoning into general-purpose multimodal models enables effective CoVR by explicitly accounting for causal and temporal after-effects. This reduces dependence on task-specific supervision, improves generalization to challenging implicit-effect cases, and enhances interpretability of retrieval outcomes. These results point toward a scalable and principled framework for explainable video search. The model, code, and benchmark are available at https://github.com/mbzuai-oryx/CoVR-R.

中文标题/摘要

标题：CoVR-R：基于推理的组合视频检索

组合视频检索（CoVR）旨在根据参考视频和文本修改找到目标视频。先前的工作假设修改文本完全规定了视觉变化，忽视了编辑带来的后效和隐含后果（例如，运动、状态转换、视角或持续时间线索）。我们认为成功的CoVR需要考虑这些后效。我们提出了一种以推理为主导的零样本方法，利用大规模多模态模型（i）推断编辑所暗示的因果和时间后果，（ii）将推理出的查询与候选视频对齐，无需特定任务的微调。为了评估CoVR中的推理，我们还提出了CoVR-Reason基准，它将每个（参考，编辑，目标）三元组与结构化的内部推理痕迹和需要预测后效而非关键词匹配的具有挑战性的干扰项配对。实验表明，我们的零样本方法在召回率K上优于强大的检索基线，并且特别擅长处理隐含效应子集。我们的自动和人工分析证实了我们检索结果的步骤一致性和后效事实性更高。我们的研究结果表明，将推理纳入通用多模态模型可以有效进行CoVR，因为它明确地考虑了因果和时间后效。这减少了对特定任务监督的依赖，提高了对具有挑战性的隐含效应案例的泛化能力，并增强了检索结果的可解释性。这些结果指出了可扩展且原理上的框架，用于解释视频搜索。该模型、代码和基准可在https://github.com/mbzuai-oryx/CoVR-R上获得。

Summary / 总结

CoVR-R introduces a reasoning-first approach for Composed Video Retrieval (CoVR) that leverages large multimodal models to infer causal and temporal consequences from edits and align reasoned queries to candidate videos without task-specific fine-tuning. The method outperforms strong retrieval baselines, especially on implicit-effect subsets, and improves step consistency and effect factuality in retrieved results. The approach reduces dependence on task-specific supervision and enhances interpretability, pointing towards a scalable framework for explainable video search.

CoVR-R 提出了一种因果推理先行的方法来解决组合视频检索 (CoVR) 问题，能够从文本编辑中推断出因果和时间上的后果，并在无需特定任务微调的情况下与候选视频对齐。该方法在召回率上优于强大的检索基线，特别是在隐含效果案例中表现尤为出色，并且检索结果具有更高的步骤一致性和效果真实性。这项工作表明，将推理纳入通用多模态模型可以有效解决 CoVR 问题，通过考虑因果和时间上的后果，提高泛化能力和检索结果的可解释性。该模型和基准已公开发布。

MeanFlow Meets Control: Scaling Sampled-Data Control for Swarms

Authors: Anqi Dong, Yongxin Chen, Karl H. Johansson, Johan Karlsson

First: 2026-03-20T17:59:18+00:00 · Latest: 2026-03-20T17:59:18+00:00