arXiv 论文速递

Enhancing the Outcome Reward-based RL Training of MLLMs with Self-Consistency Sampling

Authors: Jiahao Wang, Weiye Xu, Aijun Yang, Wengang Zhou, Lewei Lu, Houqiang Li, Xiaohua Wang, Jinguo Zhu

Venue: NeurIPS 2025

First: 2025-11-13T18:59:57+00:00 · Latest: 2025-11-13T18:59:57+00:00

Comments: Accepted to NeurIPS 2025 (The Thirty-Ninth Annual Conference on Neural Information Processing Systems)

Abstract

Outcome-reward reinforcement learning (RL) is a common and increasingly significant way to refine the step-by-step reasoning of multimodal large language models (MLLMs). In the multiple-choice setting - a dominant format for multimodal reasoning benchmarks - the paradigm faces a significant yet often overlooked obstacle: unfaithful trajectories that guess the correct option after a faulty chain of thought receive the same reward as genuine reasoning, which is a flaw that cannot be ignored. We propose Self-Consistency Sampling (SCS) to correct this issue. For each question, SCS (i) introduces small visual perturbations and (ii) performs repeated truncation and resampling of an initial trajectory; agreement among the resulting trajectories yields a differentiable consistency score that down-weights unreliable traces during policy updates. Based on Qwen2.5-VL-7B-Instruct, plugging SCS into RLOO, GRPO, and REINFORCE++ series improves accuracy by up to 7.7 percentage points on six multimodal benchmarks with negligible extra computation. SCS also yields notable gains on both Qwen2.5-VL-3B-Instruct and InternVL3-8B, offering a simple, general remedy for outcome-reward RL in MLLMs.

中文标题/摘要

标题：利用自我一致性采样增强MLLM的基于结果奖励的RL训练

结果奖励强化学习（RL）是改进多模态大型语言模型（MLLM）逐步推理的一种常见且日益重要的方法。在多项选择设置中，这一范式面临一个重要的但往往被忽视的问题：在错误的推理链之后猜测正确选项的轨迹与真实的推理获得相同的奖励，这是一个不能忽视的缺陷。我们提出了自我一致性采样（SCS）来纠正这一问题。对于每个问题，SCS (i) 引入小的视觉扰动，并 (ii) 对初始轨迹进行重复截断和重新采样；结果轨迹的一致性产生一个可微的一致性分数，该分数在策略更新时降低不可靠轨迹的权重。基于Qwen2.5-VL-7B-Instruct，将SCS插入RLOO、GRPO和REINFORCE++系列，在六个多模态基准上提高了高达7.7个百分点的准确性，且额外计算量可以忽略不计。SCS还在Qwen2.5-VL-3B-Instruct和InternVL3-8B上取得了显著的收益，为MLLM中的结果奖励RL提供了一个简单且通用的解决方案。

Summary / 总结

The paper addresses the issue of unfaithful trajectories in outcome-reward reinforcement learning for multimodal large language models (MLLMs) by proposing Self-Consistency Sampling (SCS). SCS introduces small visual perturbations and performs repeated truncation and resampling of an initial trajectory to generate a consistency score, which down-weights unreliable traces during policy updates. This method improves accuracy by up to 7.7 percentage points on six multimodal benchmarks with minimal additional computation, and also yields gains on smaller models like Qwen2.5-VL-3B-Instruct and InternVL3-8B.

论文提出了一种名为Self-Consistency Sampling (SCS)的方法，以解决多模态大型语言模型（MLLMs）在结果奖励强化学习中出现的不忠实轨迹问题。SCS通过引入小的视觉扰动并重复截断和重新采样初始轨迹来生成一致的轨迹，然后计算一个可微的一致性分数以降低不可靠轨迹的权重。该方法在六个多模态基准测试中将准确性提高了最多7.7个百分点，并且几乎不增加额外的计算成本。

Depth Anything 3: Recovering the Visual Space from Any Views

Authors: Haotong Lin, Sili Chen, Junhao Liew, Donny Y. Chen, Zhenyu Li, Guang Shi, Jiashi Feng, Bingyi Kang

First: 2025-11-13T18:59:53+00:00 · Latest: 2025-11-13T18:59:53+00:00

Comments: https://depth-anything-3.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

We present Depth Anything 3 (DA3), a model that predicts spatially consistent geometry from an arbitrary number of visual inputs, with or without known camera poses. In pursuit of minimal modeling, DA3 yields two key insights: a single plain transformer (e.g., vanilla DINO encoder) is sufficient as a backbone without architectural specialization, and a singular depth-ray prediction target obviates the need for complex multi-task learning. Through our teacher-student training paradigm, the model achieves a level of detail and generalization on par with Depth Anything 2 (DA2). We establish a new visual geometry benchmark covering camera pose estimation, any-view geometry and visual rendering. On this benchmark, DA3 sets a new state-of-the-art across all tasks, surpassing prior SOTA VGGT by an average of 44.3% in camera pose accuracy and 25.1% in geometric accuracy. Moreover, it outperforms DA2 in monocular depth estimation. All models are trained exclusively on public academic datasets.

中文标题/摘要

标题：深度万物3：从任意视角恢复视觉空间

我们提出了深度万物3（DA3），这是一种可以从任意数量的视觉输入中预测空间一致几何结构的模型，这些输入可以有已知的相机姿态也可以没有。为了实现最小化建模，DA3 提出了两个关键见解：单一普通的变压器（例如，纯 DINO 编码器）作为骨干网络是足够的，不需要专门的架构，单一的深度射线预测目标消除了复杂多任务学习的需要。通过我们的教师-学生训练范式，该模型在细节和泛化能力上达到了与深度万物2（DA2）相当的水平。我们在涵盖相机姿态估计、任意视角几何和视觉渲染的新视觉几何基准上建立了新的基准。在该基准上，DA3 在所有任务上都达到了新的最佳水平，平均在相机姿态准确性上超越了先前的 SOTA VGGT 44.3%，在几何准确性上超越了 25.1%。此外，它在单目深度估计上也超过了 DA2。所有模型仅在公共学术数据集上进行训练。

Summary / 总结

Depth Anything 3 (DA3) is a model designed to predict spatially consistent geometry from any number of visual inputs, with or without known camera poses. It uses a plain transformer as its backbone and predicts depth rays directly, avoiding complex multi-task learning. DA3 achieves comparable detail and generalization to Depth Anything 2 (DA2) and sets new state-of-the-art benchmarks in camera pose estimation, any-view geometry, and visual rendering, outperforming previous methods by significant margins.

Depth Anything 3 (DA3) 是一种可以从任意数量的视觉输入中预测一致几何形状的模型，即使没有已知的相机姿态。它使用一个普通的变压器作为其骨干，并直接预测深度射线，简化了模型架构。DA3 在一个新的视觉几何基准上达到了与 Depth Anything 2 (DA2) 相当的细节和泛化能力，并在该基准上设定了新的最先进结果，相比先前方法在相机姿态准确性上提高了 44.3%，在几何准确性上提高了 25.1%。此外，它在单目深度估计上也优于 DA2，所有模型均在公共学术数据集上训练。

Ax-Prover: A Deep Reasoning Agentic Framework for Theorem Proving in Mathematics and Quantum Physics

Authors: Benjamin Breen, Marco Del Tredici, Jacob McCarran, Javier Aspuru Mijares, Weichen Winston Yin, Kfir Sulimany, Jacob M. Taylor, Frank H. L. Koppens, Dirk Englund

First: 2025-10-14T17:57:04+00:00 · Latest: 2025-11-13T18:56:10+00:00