arXiv 论文速递

Glyph: Scaling Context Windows via Visual-Text Compression

Authors: Jiale Cheng, Yusen Liu, Xinyu Zhang, Yulin Fei, Wenyi Hong, Ruiliang Lyu, Weihan Wang, Zhe Su, Xiaotao Gu, Xiao Liu, Yushi Bai, Jie Tang, Hongning Wang, Minlie Huang

First: 2025-10-20T17:58:56+00:00 · Latest: 2025-10-20T17:58:56+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Large language models (LLMs) increasingly rely on long-context modeling for tasks such as document understanding, code analysis, and multi-step reasoning. However, scaling context windows to the million-token level brings prohibitive computational and memory costs, limiting the practicality of long-context LLMs. In this work, we take a different perspective-visual context scaling-to tackle this challenge. Instead of extending token-based sequences, we propose Glyph, a framework that renders long texts into images and processes them with vision-language models (VLMs). This approach substantially compresses textual input while preserving semantic information, and we further design an LLM-driven genetic search to identify optimal visual rendering configurations for balancing accuracy and compression. Through extensive experiments, we demonstrate that our method achieves 3-4x token compression while maintaining accuracy comparable to leading LLMs such as Qwen3-8B on various long-context benchmarks. This compression also leads to around 4x faster prefilling and decoding, and approximately 2x faster SFT training. Furthermore, under extreme compression, a 128K-context VLM could scale to handle 1M-token-level text tasks. In addition, the rendered text data benefits real-world multimodal tasks, such as document understanding. Our code and model are released at https://github.com/thu-coai/Glyph.

中文标题/摘要

标题：Glyph：通过视觉-文本压缩扩展上下文窗口

大型语言模型（LLMs）越来越多地依赖于长上下文建模，用于文档理解、代码分析和多步推理等任务。然而，将上下文窗口扩展到百万词级别带来了巨大的计算和内存成本，限制了长上下文LLMs的实际应用。在本工作中，我们从视觉上下文扩展的角度出发，应对这一挑战。我们不扩展基于词元的序列，而是提出了一种名为Glyph的框架，将长文本渲染为图像，并使用视觉-语言模型（VLMs）处理这些图像。这种方法在大幅压缩文本输入的同时保留了语义信息，并进一步设计了一种由LLM驱动的遗传搜索，以平衡准确性和压缩率。通过广泛的实验，我们证明了我们的方法在各种长上下文基准测试中，实现了3-4倍的词元压缩，同时保持与领先LLM（如Qwen3-8B）相当的准确性。这种压缩还导致填充和解码速度提高了约4倍，以及约2倍的SFT训练速度。此外，在极端压缩下，一个128K上下文的VLM能够扩展处理百万词级别的文本任务。此外，渲染的文本数据还为现实世界的多模态任务，如文档理解，带来了益处。我们的代码和模型已发布在https://github.com/thu-coai/Glyph。

Summary / 总结

The research addresses the challenge of scaling context windows in large language models (LLMs) to handle long documents by proposing Glyph, a framework that converts long texts into images for processing with vision-language models (VLMs). This method achieves 3-4x token compression while maintaining comparable accuracy to leading LLMs like Qwen3-8B on various benchmarks. Additionally, it speeds up prefilling, decoding, and SFT training by around 4x and 2x, respectively, and can handle 1M-token-level text tasks with 128K-context VLMs under extreme compression. The rendered text data also benefits real-world multimodal tasks such as document understanding.

该研究提出了一种名为Glyph的方法，通过将长文本转换为图像并使用视觉语言模型（VLMs）处理，来解决大规模语言模型（LLMs）在处理长文档时扩展上下文窗口的挑战。该方法实现了3-4倍的令牌压缩，同时保持与领先LLM相当的准确性，并且还能提高预填充、解码和SFT训练的速度。在极端压缩下，一个128K上下文的VLM可以处理1M令牌级别的文本任务，而渲染后的文本数据也有助于多模态任务，如文档理解。

Foundational Automatic Evaluators: Scaling Multi-Task Generative Evaluator Training for Reasoning-Centric Domains

Authors: Austin Xu, Xuan-Phi Nguyen, Yilun Zhou, Chien-Sheng Wu, Caiming Xiong, Shafiq Joty

First: 2025-10-20T17:52:06+00:00 · Latest: 2025-10-20T17:52:06+00:00

Comments: 29 pages, 9 tables, 6 figures

Abs · PDF · Code1 · Code2

Abstract

Finetuning specialized generative evaluators has emerged as a popular paradigm to meet the increasing demand for scalable evaluation during both training and test-time. However, recent work has largely focused on applying new methodology, such as reinforcement learning (RL), to training evaluators, shying away from large-scale, data-driven development. In this work, we focus on data scaling, curating a set of 2.5M samples spanning five unique evaluation tasks (pairwise, step-level, reference-free and reference-based verification, and single rating) and multiple domains focused on reasoning evaluation. With our data, we train Foundational Automatic Reasoning Evaluators (FARE), a family of 8B and 20B (with 3.6B active) parameter evaluators, with a simple iterative rejection-sampling supervised finetuning (SFT) approach. FARE-8B challenges larger specialized RL-trained evaluators and FARE-20B sets the new standard for open-source evaluators, surpassing specialized 70B+ evaluators. Beyond static benchmarks, we evaluate FARE in real-world tasks: As inference-time rerankers, FARE-20B achieves near-oracle performance on MATH. As verifiers in RL training, FARE improves the downstream RL-trained model performance by up to 14.1% vs. string-matching verifiers. When initialized from FARE, a continually-finetuned FARE-Code outperforms gpt-oss-20B by 65% on evaluating test-case quality.

中文标题/摘要

标题：基础自动评估器：扩展多任务生成评估器训练以适应以推理为中心的领域

针对训练和测试期间不断增加的可扩展评估需求，微调专门的生成评估器已成为一个流行的范式。然而，近期的研究主要集中在使用强化学习（RL）等新方法训练评估器上，而避免了大规模、数据驱动的发展。在本研究中，我们专注于数据扩展，收集了涵盖五个独特评估任务（成对、步骤级、无参考和有参考验证，以及单一评分）和多个以推理评估为中心的领域的250万样本。利用这些数据，我们训练了基础自动推理评估器（FARE），这是一个包含80亿和200亿参数（其中36亿活跃参数）的评估器家族，使用简单的迭代拒绝采样监督微调（SFT）方法。FARE-8B挑战了更大的专门强化学习训练的评估器，而FARE-20B则成为开源评估器的新标准，超越了专门的700亿+评估器。除了静态基准，我们还在实际任务中评估了FARE：作为推理时间的重排序器，FARE-20B在MATH上达到了接近完美的性能。作为强化学习训练中的验证器，FARE提高了下游强化学习训练模型的性能，最高可达14.1%，优于字符串匹配验证器。从FARE初始化的持续微调FARE-Code在评估测试案例质量方面比gpt-oss-20B高出65%。

Summary / 总结

This paper addresses the need for scalable evaluation methods in reasoning-centric domains by scaling the training of generative evaluators. The authors curate a large dataset of 2.5M samples across five evaluation tasks and multiple domains. Using a simple iterative rejection-sampling supervised finetuning approach, they train Foundational Automatic Reasoning Evaluators (FARE) with 8B and 20B parameters. FARE-20B outperforms specialized 70B+ evaluators and sets a new standard for open-source evaluators. In real-world tasks, FARE-20B shows near-oracle performance on MATH inference and improves RL-trained model performance by up to 14.1% compared to string-matching verifiers.

本文旨在通过扩大生成评估器的训练规模，解决推理导向领域中的可扩展评估需求。作者收集了250万样本，涵盖五个评估任务和多个领域。使用简单的迭代拒绝采样监督微调方法，他们训练了具有80亿和200亿参数的Foundational Automatic Reasoning Evaluators (FARE)。FARE-20B在性能上超越了专门训练的700亿+评估器，并成为开源评估器的新标准。在实际任务中，FARE-20B在MATH推理中的表现接近完美，并将强化学习训练的模型性能提高了14.1%，相较于字符串匹配验证器。

SoftMimic: Learning Compliant Whole-body Control from Examples

Authors: Gabriel B. Margolis, Michelle Wang, Nolan Fey, Pulkit Agrawal

First: 2025-10-20T17:49:27+00:00 · Latest: 2025-10-20T17:49:27+00:00

Comments: Website: https://gmargo11.github.io/softmimic/

Abs · PDF · Code1 · Code2 · Project1

Abstract

We introduce SoftMimic, a framework for learning compliant whole-body control policies for humanoid robots from example motions. Imitating human motions with reinforcement learning allows humanoids to quickly learn new skills, but existing methods incentivize stiff control that aggressively corrects deviations from a reference motion, leading to brittle and unsafe behavior when the robot encounters unexpected contacts. In contrast, SoftMimic enables robots to respond compliantly to external forces while maintaining balance and posture. Our approach leverages an inverse kinematics solver to generate an augmented dataset of feasible compliant motions, which we use to train a reinforcement learning policy. By rewarding the policy for matching compliant responses rather than rigidly tracking the reference motion, SoftMimic learns to absorb disturbances and generalize to varied tasks from a single motion clip. We validate our method through simulations and real-world experiments, demonstrating safe and effective interaction with the environment.

中文标题/摘要

标题：SoftMimic：从示例中学习柔顺全身控制策略

我们介绍了SoftMimic，一种用于从示例动作中学习类人机器人柔顺全身控制策略的框架。使用强化学习模仿人类动作可以让类人机器人快速学习新技能，但现有方法倾向于激励刚性控制，积极纠正参考动作的偏差，导致机器人在遇到意外接触时表现出脆弱且不安全的行为。相比之下，SoftMimic使机器人能够对外部力作出柔顺响应，同时保持平衡和姿势。我们的方法利用逆运动学求解器生成一个可行的柔顺动作增强数据集，我们使用该数据集来训练强化学习策略。通过奖励策略匹配柔顺响应而不是严格跟踪参考动作，SoftMimic能够吸收干扰并从单个动作片段中泛化到各种任务。我们通过仿真和实际实验验证了该方法，展示了其与环境安全有效的互动能力。

Summary / 总结

SoftMimic is a framework for learning compliant whole-body control policies for humanoid robots from example motions. It addresses the issue of stiff control in existing methods by enabling robots to respond compliantly to external forces while maintaining balance. The approach uses an inverse kinematics solver to generate an augmented dataset of feasible compliant motions, which is then used to train a reinforcement learning policy. The policy is rewarded for matching compliant responses rather than rigidly tracking the reference motion, allowing the robot to absorb disturbances and generalize to various tasks from a single motion clip. Experiments in simulations and real-world settings validate the safety and effectiveness of the method in interacting with the environment.

SoftMimic 是一个框架，通过示例动作学习人形机器人的全身柔顺控制策略。它通过激励对外部力的柔顺响应来解决现有方法中的僵硬控制问题，从而提高机器人的安全性和适应性。该方法使用逆运动学求解器生成可行的柔顺动作数据集，并训练强化学习策略以匹配这些响应，使机器人能够安全地处理意外接触并从单个动作片段中泛化到各种任务。

UltraCUA: A Foundation Model for Computer Use Agents with Hybrid Action

Authors: Yuhao Yang, Zhen Yang, Zi-Yi Dou, Anh Nguyen, Keen You, Omar Attia, Andrew Szot, Michael Feng, Ram Ramrakhya, Alexander Toshev, Chao Huang, Yinfei Yang, Zhe Gan

First: 2025-10-20T17:48:26+00:00 · Latest: 2025-10-20T17:48:26+00:00

Abs · PDF · Code1 · Code2

Abstract

Multimodal agents for computer use rely exclusively on primitive actions (click, type, scroll) that require accurate visual grounding and lengthy execution chains, leading to cascading failures and performance bottlenecks. While other agents leverage rich programmatic interfaces (APIs, MCP servers, tools), computer-use agents (CUAs) remain isolated from these capabilities. We present UltraCUA, a foundation model that bridges this gap through hybrid action -- seamlessly integrating GUI primitives with high-level programmatic tool calls. To achieve this, our approach comprises four key components: (1) an automated pipeline that scales programmatic tools from software documentation, open-source repositories, and code generation; (2) a synthetic data engine producing over 17,000 verifiable tasks spanning real-world computer-use scenarios; (3) a large-scale high-quality hybrid action trajectory collection with both low-level GUI actions and high-level programmatic tool calls; and (4) a two-stage training pipeline combining supervised fine-tuning with online reinforcement learning, enabling strategic alternation between low-level and high-level actions. Experiments with our 7B and 32B models demonstrate substantial improvements over state-of-the-art agents. On OSWorld, UltraCUA models achieve an average 22% relative improvement over base models, while being 11% faster in terms of steps. Out-of-domain evaluation on WindowsAgentArena shows our model reaches 21.7% success rate, outperforming baselines trained on Windows data. The hybrid action mechanism proves critical, reducing error propagation while maintaining execution efficiency.

中文标题/摘要

标题：UltraCUA：具有混合动作的基础模型计算机使用代理

依赖于点击、输入和滚动等原始动作的多模态计算机使用代理需要精确的视觉定位和漫长的执行链，导致级联失败和性能瓶颈。虽然其他代理利用了丰富的程序接口（API、MCP服务器、工具），但计算机使用代理（CUAs）仍与这些能力隔绝。我们提出了UltraCUA，这是一种通过混合动作来弥合这一差距的基础模型——无缝地将GUI原语与高级程序工具调用集成在一起。为了实现这一点，我们的方法包括四个关键组件：（1）一个自动流水线，从软件文档、开源仓库和代码生成中扩展程序工具；（2）一个合成数据引擎，生成超过17,000个可验证的任务，涵盖现实世界的计算机使用场景；（3）一个大规模高质量的混合动作轨迹集合，包含低级GUI动作和高级程序工具调用；（4）一个两阶段训练流水线，结合监督微调和在线强化学习，使代理能够在低级和高级动作之间进行战略性的切换。我们的7B和32B模型实验表明，UltraCUA在性能上显著优于最先进的代理。在OSWorld上，UltraCUA模型相对于基线模型平均提高了22%的相对性能，同时每步速度快了11%。在WindowsAgentArena上的跨域评估中，我们的模型达到了21.7%的成功率，优于基于Windows数据训练的基线模型。混合动作机制至关重要，它减少了错误传播，同时保持了执行效率。

Summary / 总结

UltraCUA is designed to enhance computer-use agents by integrating GUI primitives with high-level programmatic tools, addressing the limitations of primitive actions. It includes an automated pipeline for scaling programmatic tools, a synthetic data engine, a hybrid action trajectory collection, and a two-stage training pipeline. Experiments show that UltraCUA models outperform state-of-the-art agents, achieving a 22% relative improvement on OSWorld and a 21.7% success rate in out-of-domain evaluation.

UltraCUA 是一种基础模型，旨在通过集成 GUI 原始动作与高级程序化工具来增强计算机使用代理。它包括一个自动扩展程序化工具的管道、一个合成数据引擎、一个混合动作轨迹集合以及一个两阶段训练管道。实验表明，UltraCUA 模型在 OSWorld 上的性能优于最先进的代理，相对改进了 22%，并且在 WindowsAgentArena 的离域评估中达到了 21.7% 的成功率，优于基于 Windows 数据训练的基线模型。

SparseVILA: Decoupling Visual Sparsity for Efficient VLM Inference

Authors: Samir Khaki, Junxian Guo, Jiaming Tang, Shang Yang, Yukang Chen, Konstantinos N. Plataniotis, Yao Lu, Song Han, Zhijian Liu

First: 2025-10-20T17:35:47+00:00 · Latest: 2025-10-20T17:35:47+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision Language Models (VLMs) have rapidly advanced in integrating visual and textual reasoning, powering applications across high-resolution image understanding, long-video analysis, and multi-turn conversation. However, their scalability remains limited by the growing number of visual tokens that dominate inference latency. We present SparseVILA, a new paradigm for efficient VLM inference that decouples visual sparsity across the prefilling and decoding stages. SparseVILA distributes sparsity across stages by pruning redundant visual tokens during prefill and retrieving only query-relevant tokens during decoding. This decoupled design matches leading prefill pruning methods while preserving multi-turn fidelity by retaining most of the visual cache so that query-aware tokens can be retrieved at each conversation round. Built on an AWQ-optimized inference pipeline, SparseVILA achieves up to 4.0 times faster prefilling, 2.5 times faster decoding, and an overall 2.6 times end-to-end speedup on long-context video tasks -- while improving accuracy on document-understanding and reasoning tasks. By decoupling query-agnostic pruning and query-aware retrieval, SparseVILA establishes a new direction for efficient multimodal inference, offering a training-free, architecture-agnostic framework for accelerating large VLMs without sacrificing capability.

中文标题/摘要

标题：SparseVILA：解耦视觉稀疏性以实现高效的VLM推理

视觉语言模型（VLMs）在视觉和文本推理的整合方面取得了快速进展，推动了高分辨率图像理解、长视频分析和多轮对话等应用的发展。然而，它们的可扩展性仍然受到主导推理延迟的视觉标记数量不断增长的限制。我们提出了SparseVILA，这是一种新的高效VLM推理范式，它在预填充和解码阶段解耦视觉稀疏性。SparseVILA通过在预填充阶段剪枝冗余的视觉标记，并在解码阶段仅检索与查询相关的标记，来在阶段之间分配稀疏性。这种解耦设计与领先的预填充剪枝方法相匹配，同时通过保留大部分视觉缓存来保留多轮对话的保真度，以便在每次对话轮次中检索查询相关的标记。基于AWQ优化的推理管道，SparseVILA在长上下文视频任务中实现了最高4.0倍的预填充速度、2.5倍的解码速度和整体2.6倍的端到端加速，同时在文档理解与推理任务中提高了准确性。通过解耦查询无关的剪枝和查询相关的检索，SparseVILA为高效的多模态推理确立了一个新方向，提供了一个无需训练、架构无关的框架，用于加速大型VLMs而不牺牲其能力。

Summary / 总结

SparseVILA is designed to enhance the efficiency of Vision Language Models (VLMs) by decoupling visual sparsity across prefilling and decoding stages. It prunes redundant visual tokens during prefill and retrieves only relevant tokens during decoding, achieving up to 4.0 times faster prefilling, 2.5 times faster decoding, and a 2.6 times overall speedup on long-context video tasks. This method maintains accuracy on document-understanding and reasoning tasks while improving inference speed. By separating query-agnostic pruning and query-aware retrieval, SparseVILA offers a scalable framework for accelerating VLMs without compromising their capabilities.

SparseVILA 通过在预填充和解码阶段解耦视觉稀疏性来提高视觉语言模型（VLM）的效率。它在预填充阶段修剪冗余的视觉令牌，在解码阶段仅检索相关令牌，从而实现预填充4.0倍、解码2.5倍的加速以及整体2.6倍的端到端加速，特别是在长上下文视频任务中。这种方法在文档理解和推理任务中提高了准确性，同时保持了多轮对话的忠实度和查询感知的检索能力。

Seeing but Not Believing: Probing the Disconnect Between Visual Attention and Answer Correctness in VLMs

Authors: Zhining Liu, Ziyi Chen, Hui Liu, Chen Luo, Xianfeng Tang, Suhang Wang, Joy Zeng, Zhenwei Dai, Zhan Shi, Tianxin Wei, Benoit Dumoulin, Hanghang Tong

First: 2025-10-20T17:31:09+00:00 · Latest: 2025-10-20T17:31:09+00:00

Comments: 21 pages, 10 figures, 6 tables

Abs · PDF · Code1 · Code2

Abstract

Vision-Language Models (VLMs) achieve strong results on multimodal tasks such as visual question answering, yet they can still fail even when the correct visual evidence is present. In this work, we systematically investigate whether these failures arise from not perceiving the evidence or from not leveraging it effectively. By examining layer-wise attention dynamics, we find that shallow layers focus primarily on text, while deeper layers sparsely but reliably attend to localized evidence regions. Surprisingly, VLMs often perceive the visual evidence when outputting incorrect answers, a phenomenon we term ``seeing but not believing'' that widely exists in major VLM families. Building on this, we introduce an inference-time intervention that highlights deep-layer evidence regions through selective attention-based masking. It requires no training and consistently improves accuracy across multiple families, including LLaVA, Qwen, Gemma, and InternVL. These results show that VLMs encode reliable evidence internally but under-utilize it, making such signals explicit can bridge the gap between perception and reasoning, advancing the diagnostic understanding and reliability of VLMs.

中文标题/摘要

标题：见而不见：探索VLMs中视觉注意与答案正确性之间的差距

视觉-语言模型（VLMs）在视觉问答等多模态任务上取得了很好的成绩，但即使在正确视觉证据存在的情况下，它们仍然可能失败。在本研究中，我们系统地探讨了这些失败是由于未能感知证据还是未能有效利用证据。通过分析逐层的注意力动态，我们发现浅层主要关注文本，而深层则稀疏但可靠地关注局部证据区域。令人惊讶的是，VLMs在输出错误答案时往往能够感知到视觉证据，我们将其称为“见而不见”的现象，这种现象在主要的VLM家族中普遍存在。基于此，我们提出了一种推理时的干预方法，通过选择性注意力掩蔽突出深层证据区域。这种方法无需训练，并且在多个家族中（包括LLaVA、Qwen、Gemma和InternVL）一致提高了准确性。这些结果表明，VLMs内部编码了可靠的证据，但利用率不足，使这些信号显性化可以弥合感知与推理之间的差距，从而推进对VLMs的诊断理解和可靠性。

REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable Rewards

Authors: Zafir Stojanovski, Oliver Stanley, Joe Sharratt, Richard Jones, Abdulhakeem Adefioye, Jean Kaddour, Andreas Köpf

Venue: NeurIPS 2025 Spotlight

First: 2025-05-30T16:20:18+00:00 · Latest: 2025-10-20T17:28:35+00:00

Comments: NeurIPS 2025 Spotlight. For code, see https://github.com/open-thought/reasoning-gym

Abs · PDF · Code1 · Code2 · Code3

Abstract

We introduce Reasoning Gym (RG), a library of reasoning environments for reinforcement learning with verifiable rewards. It provides over 100 data generators and verifiers spanning multiple domains including algebra, arithmetic, computation, cognition, geometry, graph theory, logic, and various common games. Its key innovation is the ability to generate virtually infinite training data with adjustable complexity, unlike most previous reasoning datasets, which are typically fixed. This procedural generation approach allows for continuous evaluation across varying difficulty levels. Our experimental results demonstrate the efficacy of RG in both evaluating and reinforcement learning of reasoning models.

中文标题/摘要

标题：推理健身房：具有可验证奖励的强化学习推理环境

我们介绍了推理健身房（RG），一个具有可验证奖励的强化学习推理环境库。它提供了超过100个跨多个领域（包括代数、算术、计算、认知、几何、图论、逻辑以及各种常见游戏）的数据生成器和验证器。其关键创新在于能够生成几乎无限的训练数据，并且可以调整复杂度，不同于大多数之前的推理数据集通常是固定的。这种过程生成方法允许在不同难度级别上进行持续评估。我们的实验结果表明，RG在评估和强化学习推理模型方面都表现出有效性。

Summary / 总结

Reasoning Gym (RG) is a library designed for reinforcement learning with verifiable rewards, offering over 100 data generators and verifiers across various domains. Its main innovation is the ability to generate virtually infinite training data with adjustable complexity, enabling continuous evaluation at different difficulty levels. Experiments show RG's effectiveness in evaluating and training reasoning models.

Reasoning Gym (RG) 是一个用于强化学习和可验证奖励的库，提供了涵盖多个领域的超过100个数据生成器和验证器。它使用程序生成方法创建几乎无限的可调整复杂度的训练数据，不同于固定的数据集。实验表明，RG 在不同难度级别上评估和训练推理模型的有效性。

VERA-V: Variational Inference Framework for Jailbreaking Vision-Language Models

Authors: Qilin Liao, Anamika Lochab, Ruqi Zhang

First: 2025-10-20T17:12:10+00:00 · Latest: 2025-10-20T17:12:10+00:00

Comments: 18 pages, 7 Figures,

Abs · PDF · Code1 · Code2

Abstract

Vision-Language Models (VLMs) extend large language models with visual reasoning, but their multimodal design also introduces new, underexplored vulnerabilities. Existing multimodal red-teaming methods largely rely on brittle templates, focus on single-attack settings, and expose only a narrow subset of vulnerabilities. To address these limitations, we introduce VERA-V, a variational inference framework that recasts multimodal jailbreak discovery as learning a joint posterior distribution over paired text-image prompts. This probabilistic view enables the generation of stealthy, coupled adversarial inputs that bypass model guardrails. We train a lightweight attacker to approximate the posterior, allowing efficient sampling of diverse jailbreaks and providing distributional insights into vulnerabilities. VERA-V further integrates three complementary strategies: (i) typography-based text prompts that embed harmful cues, (ii) diffusion-based image synthesis that introduces adversarial signals, and (iii) structured distractors to fragment VLM attention. Experiments on HarmBench and HADES benchmarks show that VERA-V consistently outperforms state-of-the-art baselines on both open-source and frontier VLMs, achieving up to 53.75% higher attack success rate (ASR) over the best baseline on GPT-4o.

中文标题/摘要

标题：VERA-V：视觉语言模型脱狱的变分推理框架

视觉语言模型（VLMs）通过视觉推理扩展了大型语言模型，但其多模态设计也引入了新的、未被充分探索的漏洞。现有的多模态红队方法主要依赖于脆弱的模板，专注于单一攻击场景，并仅暴露了一小部分漏洞。为了解决这些局限性，我们引入了VERA-V，这是一种变分推理框架，将多模态脱狱发现重新表述为学习配对文本-图像提示的联合后验分布。这种概率视角使得生成隐蔽的、耦合的对抗输入成为可能，这些输入可以绕过模型的护栏。我们训练了一个轻量级的攻击者来近似后验，从而实现多样脱狱的高效采样，并提供关于漏洞的分布见解。VERA-V 进一步整合了三种互补策略：（i）基于字体的文本提示，嵌入有害线索；（ii）基于扩散的图像合成，引入对抗信号；（iii）结构化的干扰物，以分散 VLM 的注意力。在 HarmBench 和 HADES 基准测试中，VERA-V 在开源和前沿 VLM 上始终优于最先进的基线，相对于最佳基线在 GPT-4o 上的攻击成功率（ASR）提高了 53.75%。

UniCTokens: Boosting Personalized Understanding and Generation via Unified Concept Tokens

Authors: Ruichuan An, Sihan Yang, Renrui Zhang, Zijun Shen, Ming Lu, Gaole Dai, Hao Liang, Ziyu Guo, Shilin Yan, Yulin Luo, Bocheng Zou, Chaoqun Yang, Wentao Zhang

Venue: NeurIPS 2025

First: 2025-05-20T17:56:01+00:00 · Latest: 2025-10-20T16:56:39+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Personalized models have demonstrated remarkable success in understanding and generating concepts provided by users. However, existing methods use separate concept tokens for understanding and generation, treating these tasks in isolation. This may result in limitations for generating images with complex prompts. For example, given the concept $\langle bo\rangle$, generating "$\langle bo\rangle$ wearing its hat" without additional textual descriptions of its hat. We call this kind of generation \textit{\textbf{personalized attribute-reasoning generation}}. To address the limitation, we present UniCTokens, a novel framework that effectively integrates personalized information into a unified vision language model (VLM) for understanding and generation. UniCTokens trains a set of unified concept tokens to leverage complementary semantics, boosting two personalized tasks. Moreover, we propose a progressive training strategy with three stages: understanding warm-up, bootstrapping generation from understanding, and deepening understanding from generation to enhance mutual benefits between both tasks. To quantitatively evaluate the unified VLM personalization, we present UnifyBench, the first benchmark for assessing concept understanding, concept generation, and attribute-reasoning generation. Experimental results on UnifyBench indicate that UniCTokens shows competitive performance compared to leading methods in concept understanding, concept generation, and achieving state-of-the-art results in personalized attribute-reasoning generation. Our research demonstrates that enhanced understanding improves generation, and the generation process can yield valuable insights into understanding. Our code and dataset will be released at: \href{https://github.com/arctanxarc/UniCTokens}{https://github.com/arctanxarc/UniCTokens}.

中文标题/摘要

标题：UniCTokens：通过统一概念令牌增强个性化理解和生成

个性化模型在理解和生成用户提供的概念方面取得了显著的成功。然而，现有方法使用单独的概念令牌分别进行理解和生成，将这两个任务孤立处理。这可能导致在生成具有复杂提示的图像时存在局限性。例如，给定概念$\langle bo\rangle$，生成"$\langle bo\rangle$戴着它的帽子"而无需额外的关于其帽子的文字描述。我们称这种生成为\textit{\textbf{个性化属性推理生成}}。为了解决这一局限，我们提出了UniCTokens，这是一种新颖的框架，能够有效地将个性化信息整合到统一的视觉语言模型（VLM）中，用于理解和生成。UniCTokens训练一组统一的概念令牌，利用互补的语义，增强两个个性化任务。此外，我们提出了一种分阶段的训练策略，分为理解预热、从理解中启动生成和从生成深化理解三个阶段，以增强两个任务之间的相互收益。为了定量评估统一VLM的个性化，我们提出了UnifyBench，这是第一个用于评估概念理解、概念生成和属性推理生成的基准。UnifyBench上的实验结果表明，UniCTokens在概念理解、概念生成方面表现出竞争力，并在个性化属性推理生成方面达到了最先进的结果。我们的研究证明，增强的理解可以提高生成，而生成过程也可以为理解提供有价值的见解。我们的代码和数据集将在以下链接发布： \href{https://github.com/arctanxarc/UniCTokens}{https://github.com/arctanxarc/UniCTokens}。

Summary / 总结

The research aims to improve personalized understanding and generation by integrating concept tokens for both tasks into a unified framework. UniCTokens trains unified concept tokens to enhance both understanding and generation, and introduces a progressive training strategy to boost mutual benefits. Experiments on UnifyBench show that UniCTokens outperforms existing methods in concept understanding, concept generation, and personalized attribute-reasoning generation, demonstrating the effectiveness of the unified approach.

研究旨在通过将概念令牌整合到统一的视觉语言模型中，提升个性化理解和生成。UniCTokens 使用三阶段的渐进训练策略来增强理解和生成之间的相互作用。实验表明，UniCTokens 在概念理解、概念生成和个性化属性推理生成方面均优于现有方法，展示了统一个性化的效果。生成过程还能提供对理解的见解，代码和数据集已公开供进一步研究使用。

Train for Truth, Keep the Skills: Binary Retrieval-Augmented Reward Mitigates Hallucinations

Authors: Tong Chen, Akari Asai, Luke Zettlemoyer, Hannaneh Hajishirzi, Faeze Brahman

First: 2025-10-20T16:45:43+00:00 · Latest: 2025-10-20T16:45:43+00:00

Abs · PDF · Code1 · Code2

Abstract

Language models often generate factually incorrect information unsupported by their training data, a phenomenon known as extrinsic hallucination. Existing mitigation approaches often degrade performance on open-ended generation and downstream tasks, limiting their practical utility. We propose an online reinforcement learning method using a novel binary retrieval-augmented reward (RAR) to address this tradeoff. Unlike continuous reward schemes, our approach assigns a reward of one only when the model's output is entirely factually correct, and zero otherwise. We evaluate our method on Qwen3 reasoning models across diverse tasks. For open-ended generation, binary RAR achieves a 39.3% reduction in hallucination rates, substantially outperforming both supervised training and continuous-reward RL baselines. In short-form question answering, the model learns calibrated abstention, strategically outputting "I don't know" when faced with insufficient parametric knowledge. This yields 44.4% and 21.7% fewer incorrect answers on PopQA and GPQA, respectively. Crucially, these factuality gains come without performance degradation on instruction following, math, or code, whereas continuous-reward RL, despite improving factuality, induces quality regressions.

中文标题/摘要

标题：训练以求真理，保持技能：二元检索增强奖励减轻幻觉现象

语言模型经常生成与其训练数据不符的事实错误信息，这种现象被称为外生幻觉。现有的缓解方法往往在开放生成和下游任务上降低性能，限制了其实用价值。我们提出了一种在线强化学习方法，使用一种新颖的二元检索增强奖励（RAR）来解决这一权衡。与连续奖励方案不同，我们的方法仅当模型的输出完全符合事实时才给予奖励1，否则给予奖励0。我们在Qwen3推理模型上对各种任务进行了评估。对于开放生成，二元RAR将幻觉率降低了39.3%，显著优于监督训练和连续奖励RL基线。在简短问答中，模型学会了校准的回避，面对不足的参数知识时战略性地输出“我不知道”。这分别在PopQA和GPQA上减少了44.4%和21.7%的错误答案。最关键的是，这些事实上的收益不会在指令遵循、数学或代码上降低性能，而连续奖励RL尽管提高了事实性，却导致了质量倒退。

Summary / 总结

The paper addresses the issue of extrinsic hallucination in language models by proposing a binary retrieval-augmented reward (RAR) method in an online reinforcement learning setting. This method assigns a reward of one only when the model's output is entirely factually correct, and zero otherwise. The approach significantly reduces hallucination rates in open-ended generation by 39.3% and decreases incorrect answers in short-form question answering by 44.4% and 21.7% on PopQA and GPQA, respectively, without degrading performance on instruction following, math, or code tasks.

论文通过提出一种基于二元检索增强奖励（RAR）的在线强化学习方法来解决语言模型中的外生幻觉问题。与连续奖励方案不同，该方法仅在模型输出完全符合事实时才给予奖励为一，否则为零。该方法在开放生成中将幻觉率显著降低了39.3%，并在PopQA和GPQA上分别减少了44.4%和21.7%的错误答案，同时在指令遵循、数学或代码等其他任务上没有性能下降。

MT-Video-Bench: A Holistic Video Understanding Benchmark for Evaluating Multimodal LLMs in Multi-Turn Dialogues

Authors: Yaning Pan, Zekun Wang, Qianqian Xie, Yongqian Wen, Yuanxing Zhang, Guohui Zhang, Haoxuan Hu, Zhiyu Pan, Yibing Huang, Zhidong Gan, Yonghong Lin, An Ping, Tianhao Peng, Jiaheng Liu

First: 2025-10-20T16:38:40+00:00 · Latest: 2025-10-20T16:38:40+00:00

Comments: Project Website: https://github.com/NJU-LINK/MT-Video-Bench

Abs · PDF · Code1 · Code2 · Code3

Abstract

The recent development of Multimodal Large Language Models (MLLMs) has significantly advanced AI's ability to understand visual modalities. However, existing evaluation benchmarks remain limited to single-turn question answering, overlooking the complexity of multi-turn dialogues in real-world scenarios. To bridge this gap, we introduce MT-Video-Bench, a holistic video understanding benchmark for evaluating MLLMs in multi-turn dialogues. Specifically, our MT-Video-Bench mainly assesses six core competencies that focus on perceptivity and interactivity, encompassing 987 meticulously curated multi-turn dialogues from diverse domains. These capabilities are rigorously aligned with real-world applications, such as interactive sports analysis and multi-turn video-based intelligent tutoring. With MT-Video-Bench, we extensively evaluate various state-of-the-art open-source and closed-source MLLMs, revealing their significant performance discrepancies and limitations in handling multi-turn video dialogues. The benchmark will be publicly available to foster future research.

中文标题/摘要

标题：MT-Video-Bench：评估多模态大语言模型在多轮对话中视频理解能力的综合性基准

近年来，多模态大语言模型（MLLMs）的发展显著提升了AI对视觉模态的理解能力。然而，现有的评估基准仍然局限于单轮问答，忽视了现实场景中多轮对话的复杂性。为弥补这一不足，我们引入了MT-Video-Bench，这是一个全面的视频理解基准，用于评估MLLMs在多轮对话中的表现。具体而言，我们的MT-Video-Bench主要评估六个核心能力，这些能力涵盖了感知性和互动性，包括987个精心挑选的多轮对话，这些对话来自多个领域。这些能力严格与实际应用对齐，如互动体育分析和基于视频的多轮智能辅导。通过MT-Video-Bench，我们广泛评估了各种最先进的开源和闭源MLLMs，揭示了它们在处理多轮视频对话方面的显著性能差异和局限性。该基准将公开发布，以促进未来的研究。

Summary / 总结

MT-Video-Bench is a benchmark designed to evaluate the performance of Multimodal Large Language Models (MLLMs) in multi-turn dialogues, addressing the limitations of existing single-turn benchmarks. It assesses six core competencies, including perceptivity and interactivity, using 987 curated dialogues from various domains. The evaluation reveals significant performance differences among different MLLMs, highlighting their limitations in handling multi-turn video dialogues.

MT-Video-Bench 是一个用于评估多模态大型语言模型（MLLMs）在多轮对话中表现的基准，解决了现有单轮对话基准的局限性。它通过987个来自不同领域的精心挑选的对话，评估了感知性和互动性等六个核心能力。该基准揭示了不同 MLLMs 在处理多轮视频对话时的显著性能差异和局限性，并将公开发布以促进未来研究。

A Principle of Targeted Intervention for Multi-Agent Reinforcement Learning

Authors: Anjie Liu, Jianhong Wang, Samuel Kaski, Jun Wang, Mengyue Yang

Venue: NeurIPS 2025

First: 2025-10-20T16:10:56+00:00 · Latest: 2025-10-20T16:10:56+00:00

Comments: Accepted to NeurIPS 2025

Abs · PDF · Code1 · Code2

Abstract

Steering cooperative multi-agent reinforcement learning (MARL) towards desired outcomes is challenging, particularly when the global guidance from a human on the whole multi-agent system is impractical in a large-scale MARL. On the other hand, designing mechanisms to coordinate agents most relies on empirical studies, lacking a easy-to-use research tool. In this work, we employ multi-agent influence diagrams (MAIDs) as a graphical framework to address the above issues. First, we introduce interaction paradigms that leverage MAIDs to analyze and visualize existing approaches in MARL. Then, we design a new interaction paradigm based on MAIDs, referred to as targeted intervention that is applied to only a single targeted agent, so the problem of global guidance can be mitigated. In our implementation, we introduce a causal inference technique-referred to as Pre-Strategy Intervention (PSI)-to realize the targeted intervention paradigm. Since MAIDs can be regarded as a special class of causal diagrams, a composite desired outcome that integrates the primary task goal and an additional desired outcome can be achieved by maximizing the corresponding causal effect through the PSI. Moreover, the bundled relevance graph analysis of MAIDs provides a tool to identify whether an MARL learning paradigm is workable under the design of an interaction paradigm. In experiments, we demonstrate the effectiveness of our proposed targeted intervention, and verify the result of relevance graph analysis.

中文标题/摘要

标题：多智能体强化学习中目标干预的原则

引导合作的多智能体强化学习（MARL）向期望的结果发展具有挑战性，特别是在大规模MARL中，从整体多智能体系统获得全局指导对人类来说往往是不切实际的。另一方面，设计机制来协调智能体主要依赖于经验研究，缺乏一个易于使用的研究工具。在本文中，我们采用多智能体影响图（MAIDs）作为图形框架来解决上述问题。首先，我们引入了利用MAIDs分析和可视化现有MARL方法的交互范式。然后，我们基于MAIDs设计了一种新的交互范式，称为针对单个目标智能体的应用干预，从而减轻了全局指导的问题。在我们的实现中，我们引入了一种因果推理技术——称为预策略干预（PSI）——来实现目标干预范式。由于MAIDs可以被视为因果图的一个特殊类别，通过PSI最大化相应的因果效应可以实现综合期望结果，即主要任务目标和附加期望结果的整合。此外，MAIDs的捆绑相关图分析提供了一种工具，用于确定在交互范式设计下MARL学习范式是否可行。在实验中，我们展示了我们提出的针对干预的有效性，并验证了相关图分析的结果。

Summary / 总结

This work addresses the challenge of steering cooperative multi-agent reinforcement learning towards desired outcomes by employing multi-agent influence diagrams (MAIDs). It introduces a new interaction paradigm called targeted intervention, which focuses on a single agent to mitigate the need for global guidance. The effectiveness of this approach is demonstrated through experiments, showing improved performance in achieving composite desired outcomes.

本文通过引入基于多智能体影响图（MAIDs）的新型干预范式——目标干预，解决多智能体强化学习（MARL）向期望结果引导的难题。利用因果推理技术Pre-Strategy Intervention（PSI），该方法仅干预单个智能体，从而减轻全局指导的需求。实验表明目标干预的有效性，并验证了相关图分析工具在识别可工作MARL学习范式方面的有效性。

Efficient Algorithms for Mitigating Uncertainty and Risk in Reinforcement Learning

Authors: Xihong Su

First: 2025-10-20T16:06:01+00:00 · Latest: 2025-10-20T16:06:01+00:00

Comments: Dissertation

Abs · PDF · Code1 · Code2

Abstract

This dissertation makes three main contributions. First, We identify a new connection between policy gradient and dynamic programming in MMDPs and propose the Coordinate Ascent Dynamic Programming (CADP) algorithm to compute a Markov policy that maximizes the discounted return averaged over the uncertain models. CADP adjusts model weights iteratively to guarantee monotone policy improvements to a local maximum. Second, We establish sufficient and necessary conditions for the exponential ERM Bellman operator to be a contraction and prove the existence of stationary deterministic optimal policies for ERM-TRC and EVaR-TRC. We also propose exponential value iteration, policy iteration, and linear programming algorithms for computing optimal stationary policies for ERM-TRC and EVaR-TRC. Third, We propose model-free Q-learning algorithms for computing policies with risk-averse objectives: ERM-TRC and EVaR-TRC. The challenge is that Q-learning ERM Bellman may not be a contraction. Instead, we use the monotonicity of Q-learning ERM Bellman operators to derive a rigorous proof that the ERM-TRC and the EVaR-TRC Q-learning algorithms converge to the optimal risk-averse value functions. The proposed Q-learning algorithms compute the optimal stationary policy for ERM-TRC and EVaR-TRC.

中文标题/摘要

标题：缓解强化学习中不确定性和风险的高效算法

本论文做出了三项主要贡献。首先，我们发现策略梯度与MDP中的动态规划之间的一个新联系，并提出了坐标上升动态规划(CADP)算法，以计算在不确定模型上平均折扣回报最大的马尔可夫策略。CADP通过迭代调整模型权重来保证策略的单调改进到局部最大值。其次，我们建立了指数ERM贝尔曼算子为收缩的充分必要条件，并证明了ERM-TRC和EVaR-TRC的稳态确定性最优策略的存在性。我们还提出了指数值迭代、策略迭代和线性规划算法来计算ERM-TRC和EVaR-TRC的最优稳态策略。第三，我们提出了无模型Q学习算法来计算具有风险厌恶目标的策略：ERM-TRC和EVaR-TRC。挑战在于Q学习ERM贝尔曼可能不是收缩的。相反，我们利用Q学习ERM贝尔曼算子的单调性来严格证明ERM-TRC和EVaR-TRC的Q学习算法收敛到最优风险厌恶价值函数。提出的Q学习算法计算了ERM-TRC和EVaR-TRC的最优稳态策略。

Summary / 总结

This dissertation addresses the challenges of uncertainty and risk in reinforcement learning by proposing efficient algorithms. It introduces the CADP algorithm to compute a Markov policy that maximizes discounted return under uncertain models, and establishes conditions for exponential ERM Bellman operators to be contractions, leading to optimal stationary policies via algorithms like exponential value iteration. Additionally, it develops model-free Q-learning algorithms for risk-averse objectives, ensuring convergence to optimal risk-averse value functions through the monotonicity of Q-learning ERM Bellman operators.

本论文提出了三项主要贡献，以减轻强化学习中的不确定性与风险。它提出了坐标上升动态规划（CADP）算法，以最大化不确定模型下的折现回报。此外，它还建立了指数ERM贝尔曼算子为收缩的条件，并开发了计算最优稳态策略的算法。最后，它提出了针对风险规避目标的无模型Q学习算法，并证明了它们收敛到最优风险规避价值函数。

CrossGuard: Safeguarding MLLMs against Joint-Modal Implicit Malicious Attacks

Authors: Xu Zhang, Hao Li, Zhichao Lu

First: 2025-10-20T16:02:34+00:00 · Latest: 2025-10-20T16:02:34+00:00

Comments: 14 pages, 8 figures, 2 tables

Abs · PDF · Code1 · Code2

Abstract

Multimodal Large Language Models (MLLMs) achieve strong reasoning and perception capabilities but are increasingly vulnerable to jailbreak attacks. While existing work focuses on explicit attacks, where malicious content resides in a single modality, recent studies reveal implicit attacks, in which benign text and image inputs jointly express unsafe intent. Such joint-modal threats are difficult to detect and remain underexplored, largely due to the scarcity of high-quality implicit data. We propose ImpForge, an automated red-teaming pipeline that leverages reinforcement learning with tailored reward modules to generate diverse implicit samples across 14 domains. Building on this dataset, we further develop CrossGuard, an intent-aware safeguard providing robust and comprehensive defense against both explicit and implicit threats. Extensive experiments across safe and unsafe benchmarks, implicit and explicit attacks, and multiple out-of-domain settings demonstrate that CrossGuard significantly outperforms existing defenses, including advanced MLLMs and guardrails, achieving stronger security while maintaining high utility. This offers a balanced and practical solution for enhancing MLLM robustness against real-world multimodal threats.

中文标题/摘要

标题：CrossGuard：保护MLLM免受联合模态隐式恶意攻击

多模态大型语言模型（MLLMs）实现了强大的推理和感知能力，但越来越容易受到逃逸攻击。虽然现有工作主要关注显式攻击，其中恶意内容存在于单一模态中，但最近的研究揭示了隐式攻击，其中看似无害的文本和图像输入联合表达了不安全的意图。这种联合模态威胁难以检测且尚未充分探索，主要是由于高质量隐式数据的稀缺性。我们提出了ImpForge，这是一种利用定制奖励模块的强化学习自动化红队管道，用于生成14个领域中的多样化隐式样本。在此数据集的基础上，我们进一步开发了CrossGuard，这是一种意图感知的保护措施，能够提供针对显式和隐式威胁的稳健和全面的防御。广泛的实验表明，CrossGuard在安全和不安全基准、隐式和显式攻击以及多个跨域设置中均显著优于现有防御，包括先进的MLLMs和护栏，同时提供了更强的安全性并保持了高实用性。这为增强MLLM对现实世界多模态威胁的鲁棒性提供了一个平衡且实用的解决方案。

Summary / 总结

The research aims to protect MLLMs from joint-modal implicit malicious attacks, which are harder to detect than explicit attacks. The study proposes ImpForge, an automated pipeline using reinforcement learning to generate diverse implicit samples, and CrossGuard, an intent-aware safeguard. CrossGuard effectively defends against both explicit and implicit threats, outperforming existing defenses in various benchmarks and settings, while maintaining high utility.

研究旨在应对多模态大型语言模型（MLLMs）对联合模态隐式恶意攻击的脆弱性，这类攻击比显式攻击更难检测。研究提出了ImpForge，一个使用强化学习生成多样化隐式样本的自动化管道，以及CrossGuard，一种意图感知的防护措施。实验表明，CrossGuard能够有效防御显式和隐式威胁，其性能优于现有防护措施，同时保持高实用性，提供了一种增强MLLM鲁棒性的实用解决方案。

Frugal Federated Learning for Violence Detection: A Comparison of LoRA-Tuned VLMs and Personalized CNNs

Authors: Sébastien Thuau, Siba Haidar, Ayush Bajracharya, Rachid Chelouah

First: 2025-10-20T15:26:43+00:00 · Latest: 2025-10-20T15:26:43+00:00

Comments: 7 pages, 1 figure, FLTA 2025

Abs · PDF · Code1 · Code2

Abstract

We examine frugal federated learning approaches to violence detection by comparing two complementary strategies: (i) zero-shot and federated fine-tuning of vision-language models (VLMs), and (ii) personalized training of a compact 3D convolutional neural network (CNN3D). Using LLaVA-7B and a 65.8M parameter CNN3D as representative cases, we evaluate accuracy, calibration, and energy usage under realistic non-IID settings. Both approaches exceed 90% accuracy. CNN3D slightly outperforms Low-Rank Adaptation(LoRA)-tuned VLMs in ROC AUC and log loss, while using less energy. VLMs remain favorable for contextual reasoning and multimodal inference. We quantify energy and CO$_2$ emissions across training and inference, and analyze sustainability trade-offs for deployment. To our knowledge, this is the first comparative study of LoRA-tuned vision-language models and personalized CNNs for federated violence detection, with an emphasis on energy efficiency and environmental metrics. These findings support a hybrid model: lightweight CNNs for routine classification, with selective VLM activation for complex or descriptive scenarios. The resulting framework offers a reproducible baseline for responsible, resource-aware AI in video surveillance, with extensions toward real-time, multimodal, and lifecycle-aware systems.

中文标题/摘要

标题：节俭联邦学习在暴力检测中的应用：LoRA调优视觉语言模型与个性化CNN的比较

我们通过比较两种互补策略来研究节俭联邦学习在暴力检测中的应用：(i) 零样本和联邦微调视觉语言模型(VLMs)，以及(ii) 个性化训练紧凑的3D卷积神经网络(CNN3D)。使用LLaVA-7B和一个65.8M参数的CNN3D作为代表案例，我们在非IID的现实环境中评估准确率、校准和能耗。两种方法的准确率均超过90%。CNN3D在ROC AUC和log loss上略优于LoRA调优的VLMs，同时能耗更低。VLMs在上下文推理和多模态推理方面仍占优势。我们量化了训练和推理过程中的能耗和CO2排放，并分析了部署中的可持续性权衡。据我们所知，这是首次对LoRA调优的视觉语言模型和个性化CNN在联邦暴力检测中的比较研究，重点在于能效和环境指标。这些发现支持一种混合模型：轻量级CNN用于常规分类，而选择性激活VLM用于复杂或描述性场景。该框架为视频监控中的负责任、资源感知AI提供了一个可重复的基线，并扩展到实时、多模态和生命周期感知系统。

RESample: A Robust Data Augmentation Framework via Exploratory Sampling for Robotic Manipulation

Authors: Yuquan Xue, Guanxing Lu, Zhenyu Wu, Chuanrui Zhang, Bofang Jia, Zhengyi Gu, Yansong Tang, Ziwei Wang

First: 2025-10-20T15:21:12+00:00 · Latest: 2025-10-20T15:21:12+00:00

Comments: 9 pages,7 figures, submitted to ICRA2026

Abs · PDF · Code1 · Code2

Abstract

Vision-Language-Action models (VLAs) have demonstrated remarkable performance on complex robotic manipulation tasks through imitation learning. However, existing imitation learning datasets contain only successful trajectories and lack failure or recovery data, especially for out-of-distribution (OOD) states where the robot deviates from the main policy due to minor perturbations or errors, leading VLA models to struggle with states deviating from the training distribution. To this end, we propose an automated OOD data augmentation framework named RESample through exploratory sampling. Specifically, we first leverage offline reinforcement learning to obtain an action-value network that accurately identifies sub-optimal actions under the current manipulation policy. We further sample potential OOD states from trajectories via rollout, and design an exploratory sampling mechanism that adaptively incorporates these action proxies into the training dataset to ensure efficiency. Subsequently, our framework explicitly encourages the VLAs to recover from OOD states and enhances their robustness against distributional shifts. We conduct extensive experiments on the LIBERO benchmark as well as real-world robotic manipulation tasks, demonstrating that RESample consistently improves the stability and generalization ability of VLA models.

中文标题/摘要

标题：RESample：通过探索性采样实现鲁棒数据增强的机器人操作框架

视觉-语言-动作模型（VLAs）通过模仿学习在复杂的机器人操作任务中表现出色。然而，现有的模仿学习数据集仅包含成功的轨迹，缺乏失败或恢复数据，尤其是对于机器人因轻微扰动或错误偏离主要策略的离分布（OOD）状态，导致VLAs模型难以处理与训练分布偏离的状态。为此，我们提出了一种名为RESample的自动化OOD数据增强框架，通过探索性采样实现。具体而言，我们首先利用离线强化学习获得一个动作-价值网络，该网络能够准确识别当前操作策略下的次优动作。我们进一步通过回放从轨迹中采样潜在的OOD状态，并设计了一种探索性采样机制，该机制能够自适应地将这些动作代理整合到训练数据集中以确保效率。随后，我们的框架明确鼓励VLAs从OOD状态中恢复并增强其对分布偏移的鲁棒性。我们在LIBERO基准以及实际的机器人操作任务上进行了广泛的实验，证明RESample能够一致地提高VLAs模型的稳定性和泛化能力。

Summary / 总结

The paper proposes RESample, an automated data augmentation framework for improving the robustness of Vision-Language-Action models in robotic manipulation tasks. It uses offline reinforcement learning to identify sub-optimal actions and then samples potential out-of-distribution states to enhance the training dataset. Experimental results on the LIBERO benchmark and real-world tasks show that RESample improves the stability and generalization ability of these models.

论文提出RESample框架，通过离线强化学习识别次优动作并采样潜在的离分布状态来增强训练数据集，以解决Vision-Language-Action模型在机器人操作中的鲁棒性问题。该框架提高了模型从离分布状态中恢复的能力，并增强了其对分布偏移的鲁棒性。在LIBERO基准和实际任务中的实验显示，该方法在稳定性和泛化能力上表现出一致的改进。

When Does Supervised Training Pay Off? The Hidden Economics of Object Detection in the Era of Vision-Language Models

Authors: Samer Al-Hamadani

First: 2025-10-13T11:48:48+00:00 · Latest: 2025-10-20T15:09:23+00:00

Comments: 30 pages, 12 figures, 4 tables

Abs · PDF · Code1 · Code2

Abstract

Object detection traditionally relies on costly manual annotation. We present the first comprehensive cost-effectiveness analysis comparing supervised YOLO and zero-shot vision-language models (Gemini Flash 2.5 and GPT-4). Evaluated on 5,000 stratified COCO images and 500 diverse product images, combined with Total Cost of Ownership modeling, we derive break-even thresholds for architecture selection. Results show supervised YOLO attains 91.2% accuracy versus 68.5% for Gemini and 71.3% for GPT-4 on standard categories; the annotation expense for a 100-category system is $10,800, and the accuracy advantage only pays off beyond 55 million inferences (151,000 images/day for one year). On diverse product categories Gemini achieves 52.3% and GPT-4 55.1%, while supervised YOLO cannot detect untrained classes. Cost-per-correct-detection favors Gemini ($0.00050) and GPT-4 ($0.00067) over YOLO ($0.143) at 100,000 inferences. We provide decision frameworks showing that optimal architecture choice depends on inference volume, category stability, budget, and accuracy requirements.

中文标题/摘要

标题：监督训练何时见效？视觉语言模型时代目标检测的隐含经济学

目标检测传统上依赖昂贵的手动标注。我们首次进行全面的成本效益分析，比较了监督YOLO和零样本视觉语言模型（Gemini Flash 2.5和GPT-4）。在5,000张分层COCO图像和500张多样产品图像上进行评估，并结合总拥有成本建模，我们推导出架构选择的临界点。结果显示，监督YOLO在标准类别上的准确率为91.2%，而Gemini为68.5%，GPT-4为71.3%；对于100个类别的系统，标注费用为10,800美元，只有在超过5.5亿次推理（相当于一年内每天处理151,000张图像）时，准确率优势才开始见效。在多样产品类别上，Gemini达到52.3%，GPT-4达到55.1%，而监督YOLO无法检测未训练的类别。每正确检测一次的成本，Gemini和GPT-4在10万次推理时分别优于YOLO（分别为0.00050美元和0.00067美元）。我们提供了决策框架，表明最优架构选择取决于推理量、类别稳定性、预算和准确率要求。

Summary / 总结

This study analyzes the cost-effectiveness of supervised YOLO and zero-shot vision-language models (Gemini Flash 2.5 and GPT-4) in object detection. Evaluations on 5,000 COCO images and 500 product images, combined with Total Cost of Ownership modeling, reveal that supervised YOLO outperforms Gemini and GPT-4 with 91.2% accuracy compared to 68.5% and 71.3%, respectively. However, the accuracy advantage only pays off beyond 55 million inferences, and the annotation cost for a 100-category system is $10,800. Gemini and GPT-4 have lower cost-per-correct-detection at 100,000 inferences, favoring them in certain scenarios.

研究分析了监督YOLO和零样本视觉语言模型（Gemini Flash 2.5和GPT-4）在目标检测中的成本效益。通过对5,000张COCO图像和500张产品图像的评估，并结合总拥有成本模型，结果显示监督YOLO在标准类别上的准确率为91.2%，而Gemini和GPT-4分别为68.5%和71.3%。然而，准确率优势仅在超过5500万次推理后显现，100类系统的注释成本为10,800美元。Gemini和GPT-4在100,000次推理时的每正确检测成本较低，更适用于某些场景。

Grounded Reinforcement Learning for Visual Reasoning

Authors: Gabriel Sarch, Snigdha Saha, Naitik Khandelwal, Ayush Jain, Michael J. Tarr, Aviral Kumar, Katerina Fragkiadaki

First: 2025-05-29T17:20:26+00:00 · Latest: 2025-10-20T14:54:22+00:00

Comments: Project website: https://visually-grounded-rl.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

While reinforcement learning (RL) over chains of thought has significantly advanced language models in tasks such as mathematics and coding, visual reasoning introduces added complexity by requiring models to direct visual attention, interpret perceptual inputs, and ground abstract reasoning in spatial evidence. We introduce ViGoRL (Visually Grounded Reinforcement Learning), a vision-language model trained with RL to explicitly anchor each reasoning step to specific visual coordinates. Inspired by human visual decision-making, ViGoRL learns to produce spatially grounded reasoning traces, guiding visual attention to task-relevant regions at each step. When fine-grained exploration is required, our novel multi-turn RL framework enables the model to dynamically zoom into predicted coordinates as reasoning unfolds. Across a diverse set of visual reasoning benchmarks--including SAT-2 and BLINK for spatial reasoning, V*bench for visual search, and ScreenSpot and VisualWebArena for web-based grounding--ViGoRL consistently outperforms both supervised fine-tuning and conventional RL baselines that lack explicit grounding mechanisms. Incorporating multi-turn RL with zoomed-in visual feedback significantly improves ViGoRL's performance on localizing small GUI elements and visual search, achieving 86.4% on V*Bench. Additionally, we find that grounding amplifies other visual behaviors such as region exploration, grounded subgoal setting, and visual verification. Finally, human evaluations show that the model's visual references are not only spatially accurate but also helpful for understanding model reasoning steps. Our results show that visually grounded RL is a strong paradigm for imbuing models with general-purpose visual reasoning.

中文标题/摘要

标题：基于视觉锚定的强化学习视觉推理

虽然强化学习（RL）在语言模型中的链式思维任务（如数学和编程）上取得了显著进展，但视觉推理增加了复杂性，要求模型能够引导视觉注意力、解释感知输入，并将抽象推理与空间证据联系起来。我们提出了ViGoRL（视觉锚定的强化学习），这是一种通过RL训练的视觉-语言模型，旨在明确将每一步推理与特定的视觉坐标锚定。受人类视觉决策的启发，ViGoRL 学会生成空间锚定的推理轨迹，在每一步引导视觉注意力到任务相关区域。当需要精细探索时，我们提出的新型多轮RL框架使模型能够随着推理展开动态地聚焦到预测的坐标上。在包括SAT-2和BLINK的空间推理、V*bench的视觉搜索、ScreenSpot和VisualWebArena的基于网页的定位等一系列视觉推理基准测试中，ViGoRL 一贯优于监督微调和缺乏明确锚定机制的传统RL基线。结合多轮RL与聚焦视觉反馈显著提高了ViGoRL 在定位小GUI元素和视觉搜索方面的性能，达到V*bench的86.4%。此外，我们发现锚定放大了其他视觉行为，如区域探索、锚定子目标设置和视觉验证。最后，人类评估表明，模型的视觉参考不仅在空间上准确，而且有助于理解模型的推理步骤。我们的结果表明，视觉锚定的RL是一种强大的范式，可以赋予模型通用的视觉推理能力。

Summary / 总结

The research aims to enhance visual reasoning by integrating reinforcement learning (RL) with visual grounding, addressing the complexity of directing visual attention and interpreting perceptual inputs. ViGoRL, a vision-language model, is trained using RL to explicitly link reasoning steps to specific visual coordinates. Across various visual reasoning benchmarks, ViGoRL outperforms both supervised fine-tuning and conventional RL baselines, demonstrating superior performance in tasks such as spatial reasoning and visual search, with a notable 86.4% score on V*Bench.

研究旨在通过将强化学习（RL）与视觉接地相结合，提升语言模型的视觉推理能力。ViGoRL 是一种视觉语言模型，通过 RL 训练，将每个推理步骤与特定的视觉坐标明确关联。这种方法在多种视觉推理基准测试中表现出色，优于监督微调和缺乏明确接地机制的传统 RL 方法。关键发现包括在局部化小 GUI 元素和视觉搜索方面的显著改进，V*Bench 的得分为 86.4%。接地还增强了区域探索等其他视觉行为，并且人类评估确认了模型视觉参考的空间准确性和帮助性。

MIRAGE: Agentic Framework for Multimodal Misinformation Detection with Web-Grounded Reasoning

Authors: Mir Nafis Sharear Shopnil, Sharad Duwal, Abhishek Tyagi, Adiba Mahbub Proma

First: 2025-10-20T14:40:26+00:00 · Latest: 2025-10-20T14:40:26+00:00

Comments: 16 pages, 3 tables, 1 figure

Abs · PDF · Code1 · Code2

Abstract

Misinformation spreads across web platforms through billions of daily multimodal posts that combine text and images, overwhelming manual fact-checking capacity. Supervised detection models require domain-specific training data and fail to generalize across diverse manipulation tactics. We present MIRAGE, an inference-time, model-pluggable agentic framework that decomposes multimodal verification into four sequential modules: visual veracity assessment detects AI-generated images, cross-modal consistency analysis identifies out-of-context repurposing, retrieval-augmented factual checking grounds claims in web evidence through iterative question generation, and a calibrated judgment module integrates all signals. MIRAGE orchestrates vision-language model reasoning with targeted web retrieval, outputs structured and citation-linked rationales. On MMFakeBench validation set (1,000 samples), MIRAGE with GPT-4o-mini achieves 81.65% F1 and 75.1% accuracy, outperforming the strongest zero-shot baseline (GPT-4V with MMD-Agent at 74.0% F1) by 7.65 points while maintaining 34.3% false positive rate versus 97.3% for a judge-only baseline. Test set results (5,000 samples) confirm generalization with 81.44% F1 and 75.08% accuracy. Ablation studies show visual verification contributes 5.18 F1 points and retrieval-augmented reasoning contributes 2.97 points. Our results demonstrate that decomposed agentic reasoning with web retrieval can match supervised detector performance without domain-specific training, enabling misinformation detection across modalities where labeled data remains scarce.

中文标题/摘要

标题：MIRAGE：基于网页推理的多模态 misinformation 检测框架

通过数十亿日均多模态帖子（结合文本和图像）在网页平台上传播的 misinformation，超出了人工事实核查的能力。监督检测模型需要特定领域的训练数据，并且无法泛化到多种操纵策略。我们提出了 MIRAGE，一种推理时、模型可插拔的代理框架，将多模态验证分解为四个顺序模块：视觉真实性评估检测 AI 生成的图像，跨模态一致性分析识别脱离上下文的重新利用，检索增强的事实核查通过迭代问题生成将声明与网页证据联系起来，以及一个校准判断模块整合所有信号。MIRAGE 组织视觉语言模型推理与目标网页检索，输出结构化和引文链接的推理。在 MMFakeBench 验证集（1,000 个样本）上，MIRAGE 使用 GPT-4o-mini 达到 81.65% 的 F1 和 75.1% 的准确率，优于最强的零样本基线（GPT-4V 与 MMD-Agent 的 74.0% F1）7.65 个百分点，同时将假阳性率保持在 34.3% 对比仅靠法官的基线为 97.3%。测试集结果（5,000 个样本）证实了泛化能力，F1 为 81.44%，准确率为 75.08%。消融研究显示视觉验证贡献了 5.18 个 F1 点，检索增强的推理贡献了 2.97 个点。我们的结果表明，分解的代理推理与网页检索可以匹配监督检测器的性能，无需特定领域的训练，从而在标注数据稀缺的多模态领域实现 misinformation 检测。

Summary / 总结

MIRAGE is an agentic framework designed to detect multimodal misinformation on web platforms. It decomposes the verification process into four modules: visual veracity assessment, cross-modal consistency analysis, retrieval-augmented factual checking, and a calibrated judgment module. On the MMFakeBench validation set, MIRAGE with GPT-4o-mini achieved 81.65% F1 and 75.1% accuracy, outperforming the strongest zero-shot baseline by 7.65 points. Ablation studies showed that visual verification and retrieval-augmented reasoning significantly contributed to the performance.

MIRAGE 是一个用于检测网络平台上多模态虚假信息的代理框架，它将验证过程分解为四个模块：视觉真实性评估、跨模态一致性分析、检索增强的事实核查以及一个校准判断模块。在 MMFakeBench 验证集上，MIRAGE 使用 GPT-4o-mini 达到了 81.65% 的 F1 得分和 75.1% 的准确率，比最强的零样本基线高出 7.65 个百分点。消融研究显示，视觉验证和检索增强的推理对性能有显著贡献。

Quantum Reinforcement Learning Trading Agent for Sector Rotation in the Taiwan Stock Market

Authors: Chi-Sheng Chen, Xinyu Zhang, Ya-Chuan Chen

First: 2025-06-26T01:29:19+00:00 · Latest: 2025-10-20T14:32:07+00:00

Abs · PDF · Code1 · Code2

Abstract

We propose a hybrid quantum-classical reinforcement learning framework for sector rotation in the Taiwan stock market. Our system employs Proximal Policy Optimization (PPO) as the backbone algorithm and integrates both classical architectures (LSTM, Transformer) and quantum-enhanced models (QNN, QRWKV, QASA) as policy and value networks. An automated feature engineering pipeline extracts financial indicators from capital share data to ensure consistent model input across all configurations. Empirical backtesting reveals a key finding: although quantum-enhanced models consistently achieve higher training rewards, they underperform classical models in real-world investment metrics such as cumulative return and Sharpe ratio. This discrepancy highlights a core challenge in applying reinforcement learning to financial domains -- namely, the mismatch between proxy reward signals and true investment objectives. Our analysis suggests that current reward designs may incentivize overfitting to short-term volatility rather than optimizing risk-adjusted returns. This issue is compounded by the inherent expressiveness and optimization instability of quantum circuits under Noisy Intermediate-Scale Quantum (NISQ) constraints. We discuss the implications of this reward-performance gap and propose directions for future improvement, including reward shaping, model regularization, and validation-based early stopping. Our work offers a reproducible benchmark and critical insights into the practical challenges of deploying quantum reinforcement learning in real-world finance.

中文标题/摘要

标题：用于台湾股市板块轮动的量子强化学习交易代理

我们提出了一种混合量子-经典强化学习框架，用于台湾股市的板块轮动。该系统以近端策略优化（PPO）作为基础算法，并结合了经典架构（LSTM，Transformer）和量子增强模型（QNN，QRWKV，QASA）作为策略和价值网络。自动特征工程流水线从资本份额数据中提取财务指标，以确保所有配置的一致模型输入。实证回测揭示了一个关键发现：尽管量子增强模型在训练奖励方面始终表现更优，但在实际投资指标（如累计回报和夏普比率）方面却不如经典模型。这种差异突显了将强化学习应用于金融领域的一个核心挑战——即代理奖励信号与真实投资目标之间的不匹配。我们的分析表明，当前的奖励设计可能激励对短期波动的过度拟合，而不是优化风险调整后的回报。这一问题在量子电路在噪声中尺度量子（NISQ）约束下的固有表达能力和优化不稳定性的背景下被进一步放大。我们讨论了这种奖励-性能差距的影响，并提出了未来改进的方向，包括奖励塑形、模型正则化和基于验证的早期停止。我们的工作提供了一个可重复的基准，并对在实际金融中部署量子强化学习的实践挑战提供了关键见解。

Summary / 总结

The study proposes a hybrid quantum-classical reinforcement learning framework for sector rotation in the Taiwan stock market, using Proximal Policy Optimization (PPO) and integrating classical and quantum-enhanced models. The research finds that while quantum-enhanced models achieve higher training rewards, they underperform classical models in real-world metrics like cumulative return and Sharpe ratio. This indicates a mismatch between proxy reward signals and true investment objectives, suggesting the need for better reward designs and model regularization.

研究提出了一种混合量子-经典强化学习框架，用于台湾股市的板块轮动，使用了Proximal Policy Optimization (PPO)，并结合了经典和量子增强模型。研究发现，尽管量子增强模型在训练奖励中表现更好，但在实际投资指标如累计回报率和夏普比率方面却不如经典模型。这表明代理奖励信号与真正的投资目标之间存在不匹配，需要改进奖励设计和模型正则化。

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

Authors: Maggie Huan, Yuetai Li, Tuney Zheng, Xiaoyu Xu, Seungone Kim, Minxin Du, Radha Poovendran, Graham Neubig, Xiang Yue

First: 2025-07-01T05:23:05+00:00 · Latest: 2025-10-20T14:27:09+00:00

Abs · PDF · Code1 · Code2

Abstract

Math reasoning has become the poster child of progress in large language models (LLMs), with new models rapidly surpassing human-level performance on benchmarks like MATH and AIME. But as math leaderboards improve week by week, it is worth asking: do these gains reflect broader problem-solving ability or just narrow overfitting? To answer this question, we evaluate over 20 open-weight reasoning-tuned models across a broad suite of tasks, including math, scientific QA, agent planning, coding, and standard instruction-following. We surprisingly find that most models that succeed in math fail to transfer their gains to other domains. To rigorously study this phenomenon, we conduct controlled experiments on Qwen3-14B models using math-only data but different tuning methods. We find that reinforcement learning (RL)-tuned models generalize well across domains, while supervised fine-tuning (SFT)-tuned models often forget general capabilities. Latent-space representation and token-space distribution shift analyses reveal that SFT induces substantial representation and output drift, while RL preserves general-domain structure. Our results suggest a need to rethink standard post-training recipes, particularly the reliance on SFT-distilled data for advancing reasoning models.

中文标题/摘要

标题：数学推理能否提升通用大语言模型能力？理解大语言模型推理的迁移性

数学推理已成为大语言模型（LLMs）进步的代名词，新模型迅速在MATH和AIME等基准测试中超越人类水平。但随着数学排行榜每周的提升，值得问的是：这些进步是否反映了更广泛的问题解决能力，还是仅仅狭窄的过拟合？为回答这一问题，我们评估了超过20个开放权重推理调优模型在广泛的任务套件中的表现，包括数学、科学问答、代理规划、编程和标准指令遵循。我们惊讶地发现，大多数在数学中取得成功的模型无法将其收益转移到其他领域。为了系统地研究这一现象，我们使用数学数据但不同调优方法在Qwen3-14B模型上进行了受控实验。我们发现，强化学习（RL）调优模型在不同领域中表现出良好的泛化能力，而监督微调（SFT）调优模型往往忘记通用能力。潜在空间表示和标记空间分布变化分析表明，SFT导致了显著的表示和输出漂移，而RL保留了通用领域的结构。我们的结果表明，需要重新思考标准后训练食谱，特别是对依赖SFT提炼数据以推进推理模型的依赖。

Summary / 总结

The study investigates whether improvements in math reasoning capabilities of large language models (LLMs) translate to broader problem-solving abilities. By evaluating over 20 reasoning-tuned models across various tasks, the research finds that models excelling in math often struggle to apply their skills in other domains. Controlled experiments on Qwen3-14B models using reinforcement learning (RL) and supervised fine-tuning (SFT) methods reveal that RL-tuned models generalize better across domains, whereas SFT-tuned models tend to lose general capabilities. The findings highlight the need to reconsider standard post-training recipes, particularly the use of SFT-distilled data for enhancing reasoning models.

研究探讨了大型语言模型（LLM）在数学推理能力上的提升是否能转化为更广泛的问题解决能力。通过评估超过20个推理调优模型在各种任务中的表现，研究发现，在数学上表现优异的模型往往难以将其技能应用到其他领域。对Qwen3-14B模型进行的控制实验表明，使用强化学习（RL）和监督微调（SFT）方法调优的模型在跨领域应用上表现不同，RL调优模型能更好地泛化，而SFT调优模型往往会失去通用能力。研究结果强调了需要重新考虑标准后训练配方，特别是对使用SFT提炼的数据来增强推理模型的依赖。

ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents

Authors: Yilei Jiang, Yaozhi Zheng, Yuxuan Wan, Jiaming Han, Qunzhong Wang, Michael R. Lyu, Xiangyu Yue

First: 2025-07-30T16:41:21+00:00 · Latest: 2025-10-20T14:13:43+00:00

Comments: ScreenCoder-v2

Abs · PDF · Code1 · Code2 · Code3

Abstract

Automating the transformation of user interface (UI) designs into front-end code holds significant promise for accelerating software development and democratizing design workflows. While multimodal large language models (MLLMs) can translate images to code, they often fail on complex UIs, struggling to unify visual perception, layout planning, and code synthesis within a single monolithic model, which leads to frequent perception and planning errors. To address this, we propose ScreenCoder, a modular multi-agent framework that decomposes the task into three interpretable stages: grounding, planning, and generation. By assigning these distinct responsibilities to specialized agents, our framework achieves significantly higher robustness and fidelity than end-to-end approaches. Furthermore, ScreenCoder serves as a scalable data engine, enabling us to generate high-quality image-code pairs. We use this data to fine-tune open-source MLLM via a dual-stage pipeline of supervised fine-tuning and reinforcement learning, demonstrating substantial gains in its UI generation capabilities. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in layout accuracy, structural coherence, and code correctness. Our code is made publicly available at https://github.com/leigest519/ScreenCoder.

中文标题/摘要

标题：ScreenCoder：通过模块化多模态代理推进前端自动化视觉到代码生成

自动化用户界面（UI）设计到前端代码的转换对加速软件开发和普及设计工作流具有重要意义。虽然多模态大型语言模型（MLLM）可以将图像转换为代码，但在处理复杂UI时往往失败，难以在一个单一的大型模型中统一视觉感知、布局规划和代码合成，从而导致频繁的感知和规划错误。为了解决这一问题，我们提出了ScreenCoder，这是一种模块化多代理框架，将任务分解为三个可解释的阶段：语义化、规划和生成。通过将这些不同的职责分配给专门的代理，我们的框架在鲁棒性和保真度方面显著优于端到端方法。此外，ScreenCoder充当可扩展的数据引擎，使我们能够生成高质量的图像-代码对。我们使用这些数据通过监督微调和强化学习的双阶段管道对开源MLLM进行微调，展示了其UI生成能力的显著提升。广泛的实验表明，我们的方法在布局准确性、结构连贯性和代码正确性方面达到了最先进的性能。我们的代码已公开发布在https://github.com/leigest519/ScreenCoder。

Summary / 总结

ScreenCoder is a modular multi-agent framework designed to automate the conversion of UI designs into front-end code. It decomposes the task into grounding, planning, and generation stages, each handled by specialized agents, which improves robustness and fidelity compared to end-to-end approaches. Experiments show that ScreenCoder outperforms existing methods in layout accuracy, structural coherence, and code correctness, and its code is publicly available.

ScreenCoder 是一个模块化的多智能体框架，旨在自动化将 UI 设计转换为前端代码的过程。它将任务分解为三个阶段：语义理解、规划和生成，每个阶段由专门的智能体负责。这种方法相比端到端模型提高了鲁棒性和准确性。实验表明，ScreenCoder 在布局准确性、结构连贯性和代码正确性方面优于现有方法。该框架还生成高质量的图像-代码对，用于微调开源 MLLM，增强其 UI 生成能力。

An Empirical Study of Lagrangian Methods in Safe Reinforcement Learning

Authors: Lindsay Spoor, Álvaro Serra-Gómez, Aske Plaat, Thomas Moerland

First: 2025-10-20T14:13:17+00:00 · Latest: 2025-10-20T14:13:17+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

In safety-critical domains such as robotics, navigation and power systems, constrained optimization problems arise where maximizing performance must be carefully balanced with associated constraints. Safe reinforcement learning provides a framework to address these challenges, with Lagrangian methods being a popular choice. However, the effectiveness of Lagrangian methods crucially depends on the choice of the Lagrange multiplier $\lambda$, which governs the trade-off between return and constraint cost. A common approach is to update the multiplier automatically during training. Although this is standard in practice, there remains limited empirical evidence on the robustness of an automated update and its influence on overall performance. Therefore, we analyze (i) optimality and (ii) stability of Lagrange multipliers in safe reinforcement learning across a range of tasks. We provide $\lambda$-profiles that give a complete visualization of the trade-off between return and constraint cost of the optimization problem. These profiles show the highly sensitive nature of $\lambda$ and moreover confirm the lack of general intuition for choosing the optimal value $\lambda^*$. Our findings additionally show that automated multiplier updates are able to recover and sometimes even exceed the optimal performance found at $\lambda^*$ due to the vast difference in their learning trajectories. Furthermore, we show that automated multiplier updates exhibit oscillatory behavior during training, which can be mitigated through PID-controlled updates. However, this method requires careful tuning to achieve consistently better performance across tasks. This highlights the need for further research on stabilizing Lagrangian methods in safe reinforcement learning. The code used to reproduce our results can be found at https://github.com/lindsayspoor/Lagrangian_SafeRL.

中文标题/摘要

标题：安全强化学习中拉格朗日方法的经验研究

在机器人技术、导航和电力系统等关键安全领域，会遇到约束优化问题，需要在性能最大化与相关约束之间仔细平衡。安全强化学习提供了一种解决这些挑战的框架，拉格朗日方法是其中一种流行的选择。然而，拉格朗日方法的有效性关键取决于拉格朗日乘数$\lambda$的选择，它决定了回报与约束成本之间的权衡。一种常见的方法是在训练过程中自动更新乘数。尽管在实践中这是标准做法，但关于自动更新的鲁棒性及其对整体性能的影响，仍缺乏充分的实证证据。因此，我们分析了在一系列任务中安全强化学习中拉格朗日乘数的（i）最优性和（ii）稳定性。我们提供了$\lambda$-曲线，完整地展示了优化问题中回报与约束成本之间的权衡。这些曲线显示了$\lambda$的高度敏感性，并且证实了选择最优值$\lambda^*$的一般直觉缺乏。我们的研究结果还表明，自动更新的乘数能够恢复甚至超过在$\lambda^*$处找到的最优性能，这是因为它们的学习轨迹存在巨大差异。此外，我们展示了自动更新的乘数在训练过程中表现出振荡行为，可以通过PID控制更新来缓解。然而，这种方法需要仔细调整以在不同任务中实现一致的更好性能。这突显了在安全强化学习中进一步研究稳定拉格朗日方法的必要性。用于重现我们结果的代码可以在https://github.com/lindsayspoor/Lagrangian_SafeRL找到。

SketchMind: A Multi-Agent Cognitive Framework for Assessing Student-Drawn Scientific Sketches

Authors: Ehsan Latif, Zirak Khan, Xiaoming Zhai

First: 2025-06-29T11:35:10+00:00 · Latest: 2025-10-20T13:55:37+00:00

Comments: Submitted to NeurIPS2025

Abs · PDF · Code1 · Code2

Abstract

Scientific sketches (e.g., models) offer a powerful lens into students' conceptual understanding, yet AI-powered automated assessment of such free-form, visually diverse artifacts remains a critical challenge. Existing solutions often treat sketch evaluation as either an image classification task or monolithic vision-language models, which lack interpretability, pedagogical alignment, and adaptability across cognitive levels. To address these limitations, we present SketchMind, a cognitively grounded, multi-agent framework for evaluating and improving student-drawn scientific sketches. SketchMind comprises modular agents responsible for rubric parsing, sketch perception, cognitive alignment, and iterative feedback with sketch modification, enabling personalized and transparent evaluation. We evaluate SketchMind on a curated dataset of 3,575 student-generated sketches across six science assessment items with different highest order of Bloom's level that require students to draw models to explain phenomena. Compared to baseline GPT-4o performance without SRG (average accuracy: 55.6%), and with SRG integration achieves 77.1% average accuracy (+21.4% average absolute gain). We also demonstrate that multi-agent orchestration with SRG enhances SketchMind performance, for example, GPT-4.1 gains an average 8.9% increase in sketch prediction accuracy, outperforming single-agent pipelines across all items. Human evaluators rated the feedback and co-created sketches generated by \textsc{SketchMind} with GPT-4.1, which achieved an average of 4.1 out of 5, significantly higher than those of baseline models (e.g., 2.3 for GPT-4o). Experts noted the system's potential to meaningfully support conceptual growth through guided revision. Our code and (pending approval) dataset will be released to support reproducibility and future research in AI-driven education.

中文标题/摘要

标题：SketchMind：一种评估学生绘制科学草图的认知多代理框架

科学草图（例如，模型）为了解学生概念理解提供了一种强大的视角，然而，利用人工智能自动评估这种自由形式、视觉多样的艺术作品仍然是一个关键挑战。现有解决方案往往将草图评估视为图像分类任务或单一的视觉-语言模型，缺乏可解释性、教学一致性以及跨认知层次的适应性。为了解决这些限制，我们提出了SketchMind，这是一种基于认知的、多代理的评估和改进学生绘制科学草图的框架。SketchMind 包含负责评分标准解析、草图感知、认知对齐以及迭代反馈与草图修改的模块化代理，从而实现个性化和透明的评估。我们使用一个包含3,575个学生生成的草图的定制数据集，这些草图涉及六个科学评估项目，不同最高阶的布鲁姆水平要求学生绘制模型来解释现象。与没有SRG的基线GPT-4o性能（平均准确率：55.6%）相比，集成SRG后，平均准确率达到了77.1%（平均绝对增益+21.4%）。我们还展示了多代理协调与SRG集成增强了SketchMind的性能，例如，GPT-4.1在草图预测准确性上平均提高了8.9%，在所有项目中都优于单代理管道。人类评估者对由SketchMind与GPT-4.1生成的反馈和共创草图的评价平均得分为4.1分，显著高于基线模型（例如，GPT-4o的得分为2.3分）。专家指出，该系统有可能通过引导性修订有意义地支持概念发展。我们的代码和（待审批）数据集将被发布，以支持可重复性和未来的人工智能驱动教育研究。

Summary / 总结

SketchMind is a multi-agent cognitive framework designed to assess student-drawn scientific sketches, addressing the limitations of existing AI solutions by offering interpretability, pedagogical alignment, and adaptability. It comprises agents for rubric parsing, sketch perception, cognitive alignment, and iterative feedback, enabling personalized and transparent evaluation. On a dataset of 3,575 student-generated sketches, SketchMind achieved an average accuracy of 77.1%, a 21.4% absolute gain over baseline GPT-4o without SRG, and outperformed single-agent pipelines across all items. Human evaluators rated the feedback and co-created sketches generated by SketchMind significantly higher than those of baseline models, indicating its potential to support conceptual growth through guided revision.

SketchMind 是一个多代理认知框架，旨在评估学生绘制的科学草图，解决了现有 AI 解决方案在可解释性和适应性方面的局限性。它包括用于评分标准解析、草图感知、认知对齐和迭代反馈的代理。在 3,575 个学生生成的草图上进行评估，SketchMind 达到了 77.1% 的平均准确率，比基线 GPT-4o 高 21.4%，并且在所有项目中都优于单代理管道。人类评估者对 SketchMind 生成的反馈和共同创作的草图的评价高于基线模型，表明其通过引导性修订支持概念成长的潜力。

Parameter Efficient Fine-tuning via Explained Variance Adaptation

Authors: Fabian Paischer, Lukas Hauzenberger, Thomas Schmied, Benedikt Alkin, Marc Peter Deisenroth, Sepp Hochreiter

Venue: NeurIPS 2025

First: 2024-10-09T17:59:06+00:00 · Latest: 2025-10-20T13:48:17+00:00

Comments: Accepted at NeurIPS 2025, Shared first authorship, Code available at https://github.com/ml-jku/EVA

Abs · PDF · Code1 · Code2 · Code3

Abstract

Foundation models (FMs) are pre-trained on large-scale datasets and then fine-tuned for a specific downstream task. The most common fine-tuning method is to update pretrained weights via low-rank adaptation (LoRA). Existing initialization strategies for LoRA often rely on singular value decompositions (SVD) of gradients or weight matrices. However, they do not provably maximize the expected gradient signal, which is critical for fast adaptation. To this end, we introduce Explained Variance Adaptation (EVA), an initialization scheme that uses the directions capturing the most activation variance, provably maximizing the expected gradient signal and accelerating fine-tuning. EVA performs incremental SVD on minibatches of activation vectors and selects the right-singular vectors for initialization once they converged. Further, by selecting the directions that capture the most activation-variance for a given rank budget, EVA accommodates adaptive ranks that reduce the number of trainable parameters. We apply EVA to a variety of fine-tuning tasks as language generation and understanding, image classification, and reinforcement learning. EVA exhibits faster convergence than competitors and achieves the highest average score across a multitude of tasks per domain while reducing the number of trainable parameters through rank redistribution. In summary, EVA establishes a new Pareto frontier compared to existing LoRA initialization schemes in both accuracy and efficiency.

中文标题/摘要

标题：通过解释方差适应实现参数高效微调

基础模型（FMs）在大规模数据集上预训练，然后针对特定下游任务进行微调。最常见的微调方法是通过低秩适应（LoRA）更新预训练权重。现有的LoRA初始化策略通常依赖于梯度或权重矩阵的奇异值分解（SVD），但它们并不能证明最大化期望梯度信号，这对于快速适应至关重要。为此，我们引入了解释方差适应（EVA），这是一种初始化方案，使用捕获最多激活方差的方向，证明了最大化期望梯度信号并加速微调。EVA 对激活向量的小批量进行增量SVD，并在收敛后选择右奇异向量进行初始化。此外，通过选择在给定秩预算下捕获最多激活方差的方向，EVA 适应了可减少可训练参数数量的自适应秩。我们将在语言生成和理解、图像分类和强化学习等多种微调任务中应用EVA。EVA 展现了比竞争对手更快的收敛速度，并在多个任务上实现了最高的平均得分，同时通过秩重新分配减少了可训练参数的数量。总之，EVA 在准确性和效率方面建立了与现有LoRA初始化方案相比的新帕累托前沿。

Summary / 总结

The research aims to improve the efficiency of fine-tuning foundation models by proposing a new initialization method called Explained Variance Adaptation (EVA). EVA uses singular value decomposition on minibatches of activation vectors to select right-singular vectors, which maximizes the expected gradient signal. The method achieves faster convergence and higher average scores across various tasks while reducing the number of trainable parameters through adaptive ranks.

论文提出了Explained Variance Adaptation (EVA) 方法，用于初始化基础模型的低秩适应（LoRA）。EVA 通过对激活向量的 minibatch 进行奇异值分解来选择能捕获最多方差的方向，从而最大化期望梯度信号。这种方法加速了微调过程，并减少了可训练参数的数量。实验表明，EVA 在各种任务中的收敛速度更快，并且在保持效率的同时实现了更高的平均得分。

NeRF-based Visualization of 3D Cues Supporting Data-Driven Spacecraft Pose Estimation

Authors: Antoine Legrand, Renaud Detry, Christophe De Vleeschouwer

First: 2025-09-18T12:10:47+00:00 · Latest: 2025-10-20T13:45:36+00:00

Comments: Accepted at IEEE ISpaRo 2025 (International Conference on Space Robotics) (8 pages, 2 figures)

Abs · PDF · Code1 · Code2

Abstract

On-orbit operations require the estimation of the relative 6D pose, i.e., position and orientation, between a chaser spacecraft and its target. While data-driven spacecraft pose estimation methods have been developed, their adoption in real missions is hampered by the lack of understanding of their decision process. This paper presents a method to visualize the 3D visual cues on which a given pose estimator relies. For this purpose, we train a NeRF-based image generator using the gradients back-propagated through the pose estimation network. This enforces the generator to render the main 3D features exploited by the spacecraft pose estimation network. Experiments demonstrate that our method recovers the relevant 3D cues. Furthermore, they offer additional insights on the relationship between the pose estimation network supervision and its implicit representation of the target spacecraft.

中文标题/摘要

标题：基于NeRF的3D线索可视化支持数据驱动的航天器姿态估计

在轨操作需要估计追逐航天器与其目标之间的相对6D姿态，即位置和姿态。虽然已经开发出了数据驱动的航天器姿态估计方法，但在实际任务中的应用受到对其决策过程理解不足的限制。本文提出了一种可视化给定姿态估计器依赖的3D视觉线索的方法。为此，我们使用通过姿态估计网络反向传播的梯度训练了一个基于NeRF的图像生成器。这促使生成器渲染由航天器姿态估计网络利用的主要3D特征。实验表明，我们的方法恢复了相关的3D线索，并且还提供了姿态估计网络监督与其对目标航天器的隐式表示之间的关系的额外见解。

Summary / 总结

The research aims to improve the understanding of data-driven spacecraft pose estimation methods by visualizing the 3D cues used by these methods. The authors use a NeRF-based image generator trained with gradients from the pose estimation network to visualize these cues. Experiments show that the method successfully recovers the relevant 3D features and provides insights into the relationship between the supervision and the network's representation of the spacecraft.

研究旨在通过可视化数据驱动的航天器姿态估计方法所依赖的3D线索，来提高对这些方法的理解。作者使用一个通过姿态估计网络梯度训练的NeRF基图像生成器来可视化这些线索。实验表明，该方法成功地恢复了相关的3D特征，并提供了监督与网络对航天器的内在表示之间的关系的见解。

OncoReason: Structuring Clinical Reasoning in LLMs for Robust and Interpretable Survival Prediction

Authors: Raghu Vamshi Hemadri, Geetha Krishna Guruju, Kristi Topollai, Anna Ewa Choromanska

First: 2025-10-20T13:35:12+00:00 · Latest: 2025-10-20T13:35:12+00:00

Abs · PDF · Code1 · Code2

Abstract

Predicting cancer treatment outcomes requires models that are both accurate and interpretable, particularly in the presence of heterogeneous clinical data. While large language models (LLMs) have shown strong performance in biomedical NLP, they often lack structured reasoning capabilities critical for high-stakes decision support. We present a unified, multi-task learning framework that aligns autoregressive LLMs with clinical reasoning for outcome prediction on the MSK-CHORD dataset. Our models are trained to jointly perform binary survival classification, continuous survival time regression, and natural language rationale generation. We evaluate three alignment strategies: (1) standard supervised fine-tuning (SFT), (2) SFT with Chain-of-Thought (CoT) prompting to elicit step-by-step reasoning, and (3) Group Relative Policy Optimization (GRPO), a reinforcement learning method that aligns model outputs to expert-derived reasoning trajectories. Experiments with LLaMa3-8B and Med42-8B backbones demonstrate that CoT prompting improves F1 by +6.0 and reduces MAE by 12%, while GRPO achieves state-of-the-art interpretability and predictive performance across BLEU, ROUGE, and BERTScore. We further show that existing biomedical LLMs often fail to produce valid reasoning traces due to architectural constraints. Our findings underscore the importance of reasoning-aware alignment in multi-task clinical modeling and set a new benchmark for interpretable, trustworthy LLMs in precision oncology.

中文标题/摘要

标题：OncoReason：为稳健和可解释的生存预测结构化临床推理在LLMs中的应用

预测癌症治疗结果需要既准确又可解释的模型，特别是在面对异质性临床数据时。虽然大型语言模型（LLMs）在生物医学NLP中表现出色，但它们通常缺乏关键的结构化推理能力，这对于高风险决策支持至关重要。我们提出了一种统一的多任务学习框架，将自回归LLMs与临床推理相结合，用于MSK-CHORD数据集的结果预测。我们的模型被训练以联合执行二元生存分类、连续生存时间回归和自然语言理由生成。我们评估了三种对齐策略：（1）标准监督微调（SFT），（2）带有步骤推理提示（CoT）的SFT以激发逐步推理，以及（3）组相对策略优化（GRPO），这是一种强化学习方法，将模型输出对齐到专家衍生的推理轨迹。使用LLaMa3-8B和Med42-8B骨干进行的实验表明，步骤推理提示提高了F1分数6.0%，降低了MAE 12%，而GRPO在BLEU、ROUGE和BERTScore上实现了最先进的可解释性和预测性能。我们进一步表明，现有的生物医学LLMs由于架构限制往往无法生成有效的推理轨迹。我们的研究结果强调了多任务临床建模中推理意识对齐的重要性，并为精准肿瘤学中的可解释、可信赖的LLMs设定了新的基准。

Summary / 总结

The research aims to develop models that can predict cancer treatment outcomes accurately and interpretably. It introduces a unified multi-task learning framework that aligns autoregressive LLMs with clinical reasoning for survival prediction. Three alignment strategies—standard supervised fine-tuning, Chain-of-Thought prompting, and Group Relative Policy Optimization—are evaluated. Experiments show that Chain-of-Thought prompting improves F1 by 6.0% and reduces MAE by 12%, while Group Relative Policy Optimization achieves state-of-the-art interpretability and predictive performance. The study highlights the importance of reasoning-aware alignment in clinical modeling and sets a new benchmark for interpretable LLMs in precision oncology.

研究旨在通过增强大型语言模型（LLMs）的推理能力，实现癌症治疗中准确且可解释的生存预测。研究采用多任务学习方法，使用自回归LLMs进行二元生存分类、连续生存时间回归和自然语言推理生成的训练。三种对齐策略——标准监督微调、链式思考提示和组相对策略优化——被评估。结果表明，链式思考提示提高了6.0%的F1分数并减少了12%的平均绝对误差，而组相对策略优化实现了最先进的可解释性和预测性能。研究强调了临床建模中推理意识对齐的重要性，并为精准肿瘤学中的可解释LLMs设定了新的基准。

Plasma Shape Control via Zero-shot Generative Reinforcement Learning

Authors: Niannian Wu, Rongpeng Li, Zongyu Yang, Yong Xiao, Ning Wei, Yihang Chen, Bo Li, Zhifeng Zhao, Wulyu Zhong

First: 2025-10-20T13:34:51+00:00 · Latest: 2025-10-20T13:34:51+00:00

Abs · PDF · Code1 · Code2

Abstract

Traditional PID controllers have limited adaptability for plasma shape control, and task-specific reinforcement learning (RL) methods suffer from limited generalization and the need for repetitive retraining. To overcome these challenges, this paper proposes a novel framework for developing a versatile, zero-shot control policy from a large-scale offline dataset of historical PID-controlled discharges. Our approach synergistically combines Generative Adversarial Imitation Learning (GAIL) with Hilbert space representation learning to achieve dual objectives: mimicking the stable operational style of the PID data and constructing a geometrically structured latent space for efficient, goal-directed control. The resulting foundation policy can be deployed for diverse trajectory tracking tasks in a zero-shot manner without any task-specific fine-tuning. Evaluations on the HL-3 tokamak simulator demonstrate that the policy excels at precisely and stably tracking reference trajectories for key shape parameters across a range of plasma scenarios. This work presents a viable pathway toward developing highly flexible and data-efficient intelligent control systems for future fusion reactors.

中文标题/摘要

标题：基于零样本生成强化学习的等离子体形状控制

传统的PID控制器在等离子体形状控制方面适应性有限，而针对特定任务的强化学习（RL）方法则面临泛化能力有限和需要重复重新训练的问题。为克服这些挑战，本文提出了一种新颖的框架，从大规模历史PID控制放电的离线数据集中开发出一种通用的零样本控制策略。我们的方法将生成对抗模仿学习（GAIL）与希尔伯特空间表示学习相结合，以实现双重目标：模仿PID数据的稳定操作风格，并构建一个几何结构化的潜在空间，以实现高效、目标导向的控制。生成的基础策略可以在零样本情况下部署，无需任何特定任务的微调。在HL-3托卡马克模拟器上的评估表明，该策略在各种等离子体场景中能够精确且稳定地跟踪关键形状参数的参考轨迹。本文提出了一条开发高度灵活和数据高效智能控制系统以供未来聚变反应堆使用的方法。

Summary / 总结

This paper addresses the limitations of traditional PID controllers and task-specific reinforcement learning methods in plasma shape control. It introduces a zero-shot generative reinforcement learning framework that uses a large offline dataset of PID-controlled discharges. By combining GAIL with Hilbert space representation learning, the framework aims to mimic stable operational styles and create a structured latent space for efficient control. Experimental results on the HL-3 tokamak simulator show that the policy can precisely and stably track reference trajectories for key shape parameters across various plasma scenarios without task-specific fine-tuning.

本文解决了传统PID控制器和任务特定强化学习方法在等离子体形状控制中的局限性。它提出了一种零样本生成强化学习框架，利用PID控制下的大量离线数据集。通过结合GAIL和希尔伯特空间表示学习，该框架旨在模仿稳定的操作风格并创建一个结构化的潜在空间以实现高效的控制。在HL-3托卡马克模拟器上的实验结果表明，该策略可以在各种等离子体场景中精确且稳定地跟踪关键形状参数的参考轨迹，而无需进行任务特定的微调。

DISCOVER: Automated Curricula for Sparse-Reward Reinforcement Learning

Authors: Leander Diaz-Bone, Marco Bagatella, Jonas Hübotter, Andreas Krause

Venue: NeurIPS 2025

First: 2025-05-26T11:35:07+00:00 · Latest: 2025-10-20T12:58:20+00:00

Comments: NeurIPS 2025

Abs · PDF · Code1 · Code2

Abstract

Sparse-reward reinforcement learning (RL) can model a wide range of highly complex tasks. Solving sparse-reward tasks is RL's core premise, requiring efficient exploration coupled with long-horizon credit assignment, and overcoming these challenges is key for building self-improving agents with superhuman ability. Prior work commonly explores with the objective of solving many sparse-reward tasks, making exploration of individual high-dimensional, long-horizon tasks intractable. We argue that solving such challenging tasks requires solving simpler tasks that are relevant to the target task, i.e., whose achieval will teach the agent skills required for solving the target task. We demonstrate that this sense of direction, necessary for effective exploration, can be extracted from existing RL algorithms, without leveraging any prior information. To this end, we propose a method for directed sparse-reward goal-conditioned very long-horizon RL (DISCOVER), which selects exploratory goals in the direction of the target task. We connect DISCOVER to principled exploration in bandits, formally bounding the time until the target task becomes achievable in terms of the agent's initial distance to the target, but independent of the volume of the space of all tasks. We then perform a thorough evaluation in high-dimensional environments. We find that the directed goal selection of DISCOVER solves exploration problems that are beyond the reach of prior state-of-the-art exploration methods in RL.

中文标题/摘要

标题：DISCOVER：自动化的稀疏奖励强化学习课程

稀疏奖励强化学习（RL）可以建模广泛的高度复杂任务。解决稀疏奖励任务是RL的核心前提，需要高效探索与长期信用分配相结合，并克服这些挑战是构建具有超人类能力的自我改进代理的关键。先前的工作通常旨在解决许多稀疏奖励任务，使得探索单个高维、长期任务变得不可行。我们认为，解决这些具有挑战性的任务需要解决与目标任务相关的更简单的任务，即完成这些任务将教会代理解决目标任务所需的技能。我们证明了这种方向感，对于有效的探索是必要的，可以从现有的RL算法中提取，无需利用任何先验信息。为此，我们提出了一种定向稀疏奖励目标条件的长期RL方法（DISCOVER），该方法在目标任务的方向上选择探索性目标。我们将DISCOVER与原则性的bandits探索联系起来，正式界定了目标任务变得可实现所需的时间，这取决于代理到目标的初始距离，但与所有任务空间的体积无关。然后我们在高维环境中进行了彻底的评估。我们发现，DISCOVER的目标选择解决了先前最先进的RL探索方法无法解决的探索问题。

Summary / 总结

The research aims to address the challenges of exploration in sparse-reward reinforcement learning by proposing DISCOVER, a method that selects exploratory goals directed towards the target task. DISCOVER connects to principled exploration in bandits and evaluates well in high-dimensional environments, demonstrating superior performance in solving exploration problems that prior methods cannot handle.

研究旨在通过提出DISCOVER方法解决稀疏奖励强化学习中的探索和长期信用分配挑战，该方法选择指向目标任务的探索性目标。该方法与臂部探索原理相连，并在高维环境中进行彻底评估，展示了在解决先前方法无法处理的探索问题方面的优越性能。

Certified Self-Consistency: Statistical Guarantees and Test-Time Training for Reliable Reasoning in LLMs

Authors: Paula Cordero-Encinar, Andrew B. Duncan

First: 2025-10-20T12:14:12+00:00 · Latest: 2025-10-20T12:14:12+00:00

Abs · PDF · Code1 · Code2

Abstract

Recent advances such as self-consistency and test-time reinforcement learning (TTRL) improve the reliability of large language models (LLMs) without additional supervision, yet their underlying mechanisms and statistical guarantees remain poorly understood. We present a unified framework for certifiable inference in LLMs, showing that majority voting provides a statistical certificate of self-consistency: under mild assumptions, the aggregated answer coincides with the mode of the model's terminal distribution with high probability. We derive finite-sample and anytime-valid concentration bounds that quantify this confidence, and introduce the Martingale Majority Certificate (MMC), a sequential stopping rule that adaptively determines when sufficient samples have been drawn. We further prove that label-free post-training methods such as TTRL implicitly sharpen the answer distribution by exponentially tilting it toward its mode, thereby reducing the number of samples required for certification. Building on this insight, we propose new post-training objectives that explicitly optimise this trade-off between sharpness and bias. Together, these results explain and connect two central test-time scaling strategies, self-consistency and TTRL, within a single statistical framework for label-free, certifiable reliability in reasoning LLMs.

中文标题/摘要

标题：认证自我一致性：统计保证与推理时训练以提高LLMs的可靠推理

近期的进展如自我一致性与推理时强化学习(TTRL)在不增加额外监督的情况下提高了大型语言模型(LLMs)的可靠性，但其背后的机制和统计保证仍不甚明了。我们提出了一种统一的LLMs认证推理框架，表明多数投票提供了自我一致性的统计证书：在轻微假设下，聚合的答案以高概率与模型终端分布的众数一致。我们推导出有限样本和任意时点有效的集中界来量化这种信心，并引入了鞅多数证书(MMC)，这是一种顺序停止规则，能够自适应地确定是否已足够采样。我们进一步证明，无标签后训练方法如TTRL通过指数倾斜其众数来隐式地使答案分布更加尖锐，从而减少认证所需的样本数量。基于这一见解，我们提出了新的后训练目标，明确优化这种尖锐性和偏差之间的权衡。这些结果解释并连接了两种核心的推理时扩展策略——自我一致性与TTRL——在单一的统计框架下，以实现无标签的、可认证的LLMs推理可靠性。

Summary / 总结

The paper introduces a unified framework for certifiable inference in large language models (LLMs) by leveraging majority voting and test-time reinforcement learning (TTRL). It provides statistical guarantees for self-consistency, deriving concentration bounds and introducing the Martingale Majority Certificate (MMC) for adaptive sample collection. The study shows that TTRL implicitly sharpens the answer distribution, reducing the need for samples, and proposes new post-training objectives to optimize the trade-off between sharpness and bias, thus enhancing the reliability of LLMs without additional supervision.

本文旨在通过自我一致性与测试时强化学习（TTRL）为大型语言模型（LLM）提供统计保证。作者提出了一种统一框架，表明多数投票可以统计上认证自我一致性，并推导出集中性界来量化这种信心。他们引入了马尔可夫多数证书（MMC）进行自适应采样，并证明TTRL隐式地使答案分布更加尖锐，从而减少认证所需的样本数量。该文提出了新的后训练目标，以优化这种尖锐性和偏差之间的权衡，将自我一致性与TTRL连接在一个统计框架中，用于LLM的认证可靠性。