Reuse your FLOPs: Scaling RL on Hard Problems by Conditioning on Very Off-Policy Prefixes
Authors: Amrith Setlur, Zijian Wang, Andrew Cohen, Paria Rashidinejad, Sang Michael Xie
First: 2026-01-26T18:57:00+00:00 · Latest: 2026-01-26T18:57:00+00:00
Abstract
Typical reinforcement learning (RL) methods for LLM reasoning waste compute on hard problems, where correct on-policy traces are rare, policy gradients vanish, and learning stalls. To bootstrap more efficient RL, we consider reusing old sampling FLOPs (from prior inference or RL training) in the form of off-policy traces. Standard off-policy methods supervise against off-policy data, causing instabilities during RL optimization. We introduce PrefixRL, where we condition on the prefix of successful off-policy traces and run on-policy RL to complete them, side-stepping off-policy instabilities. PrefixRL boosts the learning signal on hard problems by modulating the difficulty of the problem through the off-policy prefix length. We prove that the PrefixRL objective is not only consistent with the standard RL objective but also more sample efficient. Empirically, we discover back-generalization: training only on prefixed problems generalizes to out-of-distribution unprefixed performance, with learned strategies often differing from those in the prefix. In our experiments, we source the off-policy traces by rejection sampling with the base model, creating a self-improvement loop. On hard reasoning problems, PrefixRL reaches the same training reward 2x faster than the strongest baseline (SFT on off-policy data then RL), even after accounting for the compute spent on the initial rejection sampling, and increases the final reward by 3x. The gains transfer to held-out benchmarks, and PrefixRL is still effective when off-policy traces are derived from a different model family, validating its flexibility in practical settings.
中文标题/摘要
标题:重用你的FLOPs:通过条件化非常离策前缀扩展强化学习到难题
典型的强化学习(RL)方法在LLM推理中对难题浪费计算资源,因为正确的策略轨迹罕见,策略梯度消失,学习停滞。为了更高效地启动RL,我们考虑重用旧的采样FLOPs(来自先前推理或RL训练)作为离策轨迹。标准的离策方法使用离策数据进行监督,导致RL优化过程中出现不稳定性。我们引入了前缀RL,其中我们条件化于离策成功轨迹的前缀,并运行策略轨迹完成策略,绕过离策不稳定性。前缀RL通过调整离策前缀长度来调节问题难度,从而增强在难题上的学习信号。我们证明前缀RL目标不仅与标准RL目标一致,而且更具样本效率。实验中,我们发现反向泛化:仅在前缀问题上进行训练可以推广到未见过的无前缀性能,且学习策略往往与前缀中的不同。在我们的实验中,我们通过拒绝采样基模型来源的离策轨迹,创建了一个自我改进循环。在难题推理问题上,前缀RL比最强基线(在离策数据上进行SFT然后RL)快2倍达到相同的训练奖励,即使考虑初始拒绝采样的计算成本,且最终奖励提高了3倍。这些收益转移到了保留的基准上,即使离策轨迹源自不同的模型家族,前缀RL仍然有效,验证了其在实际应用中的灵活性。
Summary / 总结
The paper addresses the inefficiency of reinforcement learning (RL) methods in handling hard problems where on-policy data is scarce. It introduces PrefixRL, which conditions on successful off-policy traces to guide on-policy RL, thereby boosting learning efficiency. Experiments show that PrefixRL can achieve the same training reward 2x faster than strong baselines and triple the final reward on hard reasoning tasks, even when accounting for the initial sampling cost. Additionally, it demonstrates back-generalization, where strategies learned from prefixed problems generalize to out-of-distribution tasks.
论文提出PrefixRL方法,通过利用成功的离策策略前缀来辅助完成在策学习,从而提高学习信号。实验表明,PrefixRL在处理难题时比最强基线快2倍达到相同的训练奖励,并将最终奖励提高3倍。此外,该方法还展示了反向泛化能力,即从前缀问题中学到的策略可以泛化到未见过的无前缀问题上。
Unsupervised Text Segmentation via Kernel Change-Point Detection on Sentence Embeddings
Authors: Mumin Jia, Jairo Diaz-Rodriguez
First: 2026-01-26T18:54:34+00:00 · Latest: 2026-01-26T18:54:34+00:00
Comments: arXiv admin note: substantial text overlap with arXiv:2510.03437. substantial text overlap with arXiv:2510.03437. substantial text overlap with arXiv:2510.03437. substantial text overlap with arXiv:2510.03437
Abstract
Unsupervised text segmentation is crucial because boundary labels are expensive, subjective, and often fail to transfer across domains and granularity choices. We propose Embed-KCPD, a training-free method that represents sentences as embedding vectors and estimates boundaries by minimizing a penalized KCPD objective. Beyond the algorithmic instantiation, we develop, to our knowledge, the first dependence-aware theory for KCPD under $m$-dependent sequences, a finite-memory abstraction of short-range dependence common in language. We prove an oracle inequality for the population penalized risk and a localization guarantee showing that each true change point is recovered within a window that is small relative to segment length. To connect theory to practice, we introduce an LLM-based simulation framework that generates synthetic documents with controlled finite-memory dependence and known boundaries, validating the predicted scaling behavior. Across standard segmentation benchmarks, Embed-KCPD often outperforms strong unsupervised baselines. A case study on Taylor Swift's tweets illustrates that Embed-KCPD combines strong theoretical guarantees, simulated reliability, and practical effectiveness for text segmentation.
中文标题/摘要
标题:基于句子嵌入核变化点检测的无监督文本分段
无监督文本分段至关重要,因为边界标签成本高、主观性强且难以在不同领域和粒度选择间转移。我们提出了一种无需训练的方法Embed-KCPD,将句子表示为嵌入向量,并通过最小化惩罚核变化点检测目标来估计边界。除了算法实现,我们还开发了关于$m$依赖序列的核变化点检测的第一种依赖意识理论,这是一种语言中常见的短程依赖的有限记忆抽象。我们证明了总体惩罚风险的oracle不等式,并证明了每个真实变化点在相对于段长度较小的窗口内被恢复。为了将理论与实践连接起来,我们引入了一种基于LLM的模拟框架,生成具有可控有限记忆依赖和已知边界的合成文档,验证了预测的缩放行为。在标准分段基准测试中,Embed-KCPD经常优于强大的无监督基线。对泰勒·斯威夫特的推文进行的案例研究表明,Embed-KCPD结合了强大的理论保证、模拟可靠性和实际有效性。
Summary / 总结
The paper addresses the challenge of unsupervised text segmentation by proposing Embed-KCPD, which uses sentence embeddings and kernel change-point detection to estimate boundaries without training. The method provides theoretical guarantees for recovering true change points and is validated through an LLM-based simulation framework. Experiments on standard benchmarks show that Embed-KCPD often outperforms other unsupervised methods. A case study on Taylor Swift's tweets demonstrates its practical effectiveness and reliability.
论文通过提出Embed-KCPD方法,利用句子嵌入和核变化点检测来无监督地估计文本边界,无需训练。该方法提供了恢复真实变化点的理论保证,并通过基于LLM的模拟框架进行了验证。实验表明,Embed-KCPD在标准基准上通常优于其他无监督方法。以泰勒·斯威夫特的推文为例,展示了其实际有效性和可靠性。
Multi-Objective Reinforcement Learning for Efficient Tactical Decision Making for Trucks in Highway Traffic
Authors: Deepthi Pathare, Leo Laine, Morteza Haghir Chehreghani
First: 2026-01-26T18:50:21+00:00 · Latest: 2026-01-26T18:50:21+00:00
Abstract
Balancing safety, efficiency, and operational costs in highway driving poses a challenging decision-making problem for heavy-duty vehicles. A central difficulty is that conventional scalar reward formulations, obtained by aggregating these competing objectives, often obscure the structure of their trade-offs. We present a Proximal Policy Optimization based multi-objective reinforcement learning framework that learns a continuous set of policies explicitly representing these trade-offs and evaluates it on a scalable simulation platform for tactical decision making in trucks. The proposed approach learns a continuous set of Pareto-optimal policies that capture the trade-offs among three conflicting objectives: safety, quantified in terms of collisions and successful completion; energy efficiency and time efficiency, quantified using energy cost and driver cost, respectively. The resulting Pareto frontier is smooth and interpretable, enabling flexibility in choosing driving behavior along different conflicting objectives. This framework allows seamless transitions between different driving policies without retraining, yielding a robust and adaptive decision-making strategy for autonomous trucking applications.
中文标题/摘要
标题:高速公路交通中重型卡车战术决策高效多目标强化学习
在高速公路驾驶中平衡安全、效率和运营成本构成了重型车辆决策制定的挑战性问题。一个主要困难是,传统的标量奖励形式化通常通过聚合这些竞争目标来模糊它们之间的权衡结构。我们提出了一种基于近端策略优化的多目标强化学习框架,该框架学习了一组连续的策略,明确表示这些权衡,并在可扩展的模拟平台上评估其在卡车战术决策中的应用。所提出的方法学习了一组帕累托最优策略,这些策略捕捉了三个相互冲突目标之间的权衡:安全性,用碰撞和成功完成度量;能源效率和时间效率,分别用能源成本和司机成本衡量。由此产生的帕累托前沿是平滑且可解释的,能够灵活地在不同冲突目标之间选择驾驶行为。该框架允许在无需重新训练的情况下无缝过渡到不同的驾驶策略,从而为自主卡车应用提供稳健且适应性强的决策制定策略。
Summary / 总结
The paper addresses the challenge of balancing safety, efficiency, and operational costs in highway driving for heavy-duty vehicles through a multi-objective reinforcement learning approach. It uses Proximal Policy Optimization to learn a continuous set of Pareto-optimal policies that explicitly represent the trade-offs among safety, energy efficiency, and time efficiency. The resulting Pareto frontier is smooth and interpretable, allowing for flexible and adaptive decision-making strategies for autonomous trucking applications.
论文通过多目标强化学习框架解决重型车辆在高速公路驾驶中平衡安全、效率和运营成本的挑战。该框架使用Proximal Policy Optimization来学习代表安全、能源效率和时间效率之间权衡的一系列帕累托最优策略。由此产生的帕累托前沿平滑且可解释,允许在不同驾驶场景中灵活决策而无需重新训练,这对于自动驾驶卡车应用至关重要。
POPE: Learning to Reason on Hard Problems via Privileged On-Policy Exploration
Authors: Yuxiao Qu, Amrith Setlur, Virginia Smith, Ruslan Salakhutdinov, Aviral Kumar
First: 2026-01-26T18:47:21+00:00 · Latest: 2026-01-26T18:47:21+00:00
Abstract
Reinforcement learning (RL) has improved the reasoning abilities of large language models (LLMs), yet state-of-the-art methods still fail to learn on many training problems. On hard problems, on-policy RL rarely explores even a single correct rollout, yielding zero reward and no learning signal for driving improvement. We find that natural solutions to remedy this exploration problem from classical RL, such as entropy bonuses, more permissive clipping of the importance ratio, or direct optimization of pass@k objectives, do not resolve this issue and often destabilize optimization without improving solvability. A natural alternative is to leverage transfer from easier problems. However, we show that mixing easy and hard problems during RL training is counterproductive due to ray interference, where optimization focuses on already-solvable problems in a way that actively inhibits progress on harder ones. To address this challenge, we introduce Privileged On-Policy Exploration (POPE), an approach that leverages human- or other oracle solutions as privileged information to guide exploration on hard problems, unlike methods that use oracle solutions as training targets (e.g., off-policy RL methods or warmstarting from SFT). POPE augments hard problems with prefixes of oracle solutions, enabling RL to obtain non-zero rewards during guided rollouts. Crucially, the resulting behaviors transfer back to the original, unguided problems through a synergy between instruction-following and reasoning. Empirically, POPE expands the set of solvable problems and substantially improves performance on challenging reasoning benchmarks.
中文标题/摘要
标题:POPE:通过特权在线策略探索学习解决难题
强化学习(RL)已提升了大型语言模型(LLMs)的推理能力,但最先进的方法在许多训练问题上仍然无法学习。在难题上,在线RL很少探索单个正确的回放,导致零奖励且没有学习信号来驱动改进。我们发现,从经典RL中解决此探索问题的自然方法,如熵奖励、更宽松的重要性比率裁剪或直接优化pass@k目标,都无法解决此问题,且往往在不提高可解性的情况下使优化不稳定。一种自然的替代方法是利用从较易问题的转移。然而,我们证明,在RL训练过程中混合易题和难题是反生产性的,因为优化会集中在已可解的问题上,从而主动抑制对更难题的进展。为应对这一挑战,我们提出了特权在线策略探索(POPE)方法,该方法利用人类或其他先验信息来引导难题上的探索,不同于使用先验信息作为训练目标的方法(例如离线RL方法或从SFT重新启动)。POPE通过在难题前缀中添加先验解决方案,使RL在引导回放中获得非零奖励。关键的是,由此产生的行为通过指令遵循和推理之间的协同作用,转移到原始、未引导的问题上。实验上,POPE扩展了可解问题的范围,并在具有挑战性的推理基准测试中显著提高了性能。
Summary / 总结
The paper addresses the challenge of reinforcement learning (RL) in improving the reasoning abilities of large language models (LLMs) on hard problems. It proposes Privileged On-Policy Exploration (POPE), which uses human or other oracle solutions as privileged information to guide exploration on hard problems, enabling RL to obtain non-zero rewards. This method expands the set of solvable problems and significantly improves performance on challenging reasoning benchmarks.
论文解决了强化学习(RL)在处理大型语言模型(LLMs)的难题时的探索问题,标准方法无法提供学习信号。提出了特权在线探索(POPE)方法,利用人类或acles来引导探索,使RL在引导过程中获得非零奖励,并在推理基准测试中表现出色。该方法通过指令遵循和推理之间的协同作用,将学到的行为转移到原始问题上。
Teaching Models to Teach Themselves: Reasoning at the Edge of Learnability
Authors: Shobhita Sundaram, John Quan, Ariel Kwiatkowski, Kartik Ahuja, Yann Ollivier, Julia Kempe
First: 2026-01-26T18:46:56+00:00 · Latest: 2026-01-26T18:46:56+00:00
Abstract
Can a model learn to escape its own learning plateau? Reinforcement learning methods for finetuning large reasoning models stall on datasets with low initial success rates, and thus little training signal. We investigate a fundamental question: Can a pretrained LLM leverage latent knowledge to generate an automated curriculum for problems it cannot solve? To explore this, we design SOAR: A self-improvement framework designed to surface these pedagogical signals through meta-RL. A teacher copy of the model proposes synthetic problems for a student copy, and is rewarded with its improvement on a small subset of hard problems. Critically, SOAR grounds the curriculum in measured student progress rather than intrinsic proxy rewards. Our study on the hardest subsets of mathematical benchmarks (0/128 success) reveals three core findings. First, we show that it is possible to realize bi-level meta-RL that unlocks learning under sparse, binary rewards by sharpening a latent capacity of pretrained models to generate useful stepping stones. Second, grounded rewards outperform intrinsic reward schemes used in prior LLM self-play, reliably avoiding the instability and diversity collapse modes they typically exhibit. Third, analyzing the generated questions reveals that structural quality and well-posedness are more critical for learning progress than solution correctness. Our results suggest that the ability to generate useful stepping stones does not require the preexisting ability to actually solve the hard problems, paving a principled path to escape reasoning plateaus without additional curated data.
中文标题/摘要
标题:教学模型自我教学:边缘可学习性中的推理
模型能否学会突破自身的学习瓶颈?针对初始成功率低、训练信号少的数据集,强化学习方法在微调大型推理模型时会停滞不前。我们探讨了一个基本问题:预训练的语言模型能否利用潜在知识为它无法解决的问题生成自动课程?为此,我们设计了SOAR:一种自我改进框架,通过元强化学习揭示这些教学信号。教师模型副本为学生模型副本提出合成问题,并根据其在一小部分难题上的改进程度获得奖励。关键的是,SOAR将课程建立在可测量的学生进步之上,而不是内在的代理奖励之上。在数学基准中最难的子集(0/128成功率)上进行的研究揭示了三个核心发现。首先,我们展示了通过增强预训练模型生成有用阶梯的能力,实现双层元强化学习的可能性,从而在稀疏的二元奖励下解锁学习。其次,基于实际奖励的表现优于先前LLM自我博弈中使用的内在奖励方案,可靠地避免了它们通常表现出的不稳定性及多样性崩溃模式。第三,分析生成的问题表明,结构质量和良好定义比解的正确性对学习进步更为关键。我们的研究结果表明,生成有用阶梯的能力不需要预先具备解决难题的能力,为在无需额外精心策划数据的情况下逃离推理瓶颈提供了一条有原则的路径。
Summary / 总结
The study aims to explore whether a model can learn to improve itself by generating a curriculum for problems it cannot solve yet. The researchers developed SOAR, a self-improvement framework using meta-reinforcement learning, where a teacher model proposes synthetic problems for a student model, and the teacher is rewarded based on the student's improvement on hard problems. Key findings include the realization of bi-level meta-RL under sparse rewards, the superiority of grounded rewards over intrinsic rewards, and the importance of structural quality and well-posedness in generating useful stepping stones for learning progress.
研究旨在探索模型是否可以通过生成自己的课程来自我改进,尤其是在遇到无法解决的问题时。研究人员设计了SOAR框架,使用元强化学习,其中教师模型为学生模型提出合成问题,教师根据学生在难题上的改进获得奖励。关键发现包括在稀疏奖励下实现双层元强化学习、使用外生奖励相比内在奖励表现更好、以及结构质量和良好定义性在生成有用的学习阶梯中更为关键。
DeltaDorsal: Enhancing Hand Pose Estimation with Dorsal Features in Egocentric Views
Authors: William Huang, Siyou Pei, Leyi Zou, Eric J. Gonzalez, Ishan Chatterjee, Yang Zhang
First: 2026-01-21T23:00:43+00:00 · Latest: 2026-01-26T18:45:41+00:00
Comments: 16 pages, 11 figures, Presented at ACM CHI 2026. For associated codebase, see https://github.com/hilab-open-source/deltadorsal
Abstract
The proliferation of XR devices has made egocentric hand pose estimation a vital task, yet this perspective is inherently challenged by frequent finger occlusions. To address this, we propose a novel approach that leverages the rich information in dorsal hand skin deformation, unlocked by recent advances in dense visual featurizers. We introduce a dual-stream delta encoder that learns pose by contrasting features from a dynamic hand with a baseline relaxed position. Our evaluation demonstrates that, using only cropped dorsal images, our method reduces the Mean Per Joint Angle Error (MPJAE) by 18% in self-occluded scenarios (fingers >= 50% occluded) compared to state-of-the-art techniques that depend on the whole hand's geometry and large model backbones. Consequently, our method not only enhances the reliability of downstream tasks like index finger pinch and tap estimation in occluded scenarios but also unlocks new interaction paradigms, such as detecting isometric force for a surface "click" without visible movement while minimizing model size.
中文标题/摘要
标题:DeltaDorsal:利用背侧特征增强第一人称手部姿态估计
XR设备的普及使得第一人称手部姿态估计成为一项重要任务,但这一视角因手指遮挡频繁而受到挑战。为解决这一问题,我们提出了一种新颖的方法,利用最近在密集视觉特征提取器方面取得的进步,利用背侧手部皮肤变形中的丰富信息。我们引入了一种双流差分编码器,通过对比动态手部与基线放松位置的特征来学习姿态。我们的评估表明,仅使用裁剪的背侧图像,我们的方法在手指遮挡超过50%的自遮挡场景中将每关节角度误差(MPJAE)降低了18%,优于依赖整个手部几何结构和大型模型基础的最新技术。因此,我们的方法不仅提高了在遮挡场景下指尖捏合和轻击估计的可靠性,还解锁了新的交互模式,例如在无明显运动的情况下检测表面“点击”时的等向力,同时减小了模型大小。
Summary / 总结
This paper addresses the challenge of finger occlusions in egocentric hand pose estimation by proposing DeltaDorsal, a method that utilizes dorsal hand skin deformation. It introduces a dual-stream delta encoder to learn hand poses by contrasting features from a dynamic hand with a relaxed baseline. The method significantly reduces Mean Per Joint Angle Error (MPJAE) by 18% in scenarios with heavy finger occlusions, improving the reliability of downstream tasks and enabling new interaction paradigms with smaller model sizes.
该研究提出了一种名为DeltaDorsal的方法,利用手背皮肤变形来改善在手指遮挡情况下的自我中心手部姿态估计。通过使用双流差分编码器,该方法在自我遮挡场景下的Mean Per Joint Angle Error (MPJAE) 比最先进的技术降低了18%。这种方法不仅提高了下游任务的可靠性,还开启了无需大模型就能实现的新交互模式。
Dep-Search: Learning Dependency-Aware Reasoning Traces with Persistent Memory
Authors: Yanming Liu, Xinyue Peng, Zixuan Yan, Yanxin Shen, Wenjie Xu, Yuefeng Huang, Xinyi Wang, Jiannan Cao, Jianwei Yin, Xuhong Zhang
First: 2026-01-26T18:42:33+00:00 · Latest: 2026-01-26T18:42:33+00:00
Comments: Dep-Search 1st version
Abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks, particularly when augmented with search mechanisms that enable systematic exploration of external knowledge bases. The field has evolved from traditional retrieval-augmented generation (RAG) frameworks to more sophisticated search-based frameworks that orchestrate multi-step reasoning through explicit search strategies. However, existing search frameworks still rely heavily on implicit natural language reasoning to determine search strategies and how to leverage retrieved information across reasoning steps. This reliance on implicit reasoning creates fundamental challenges for managing dependencies between sub-questions, efficiently reusing previously retrieved knowledge, and learning optimal search strategies through reinforcement learning. To address these limitations, we propose Dep-Search, a dependency-aware search framework that advances beyond existing search frameworks by integrating structured reasoning, retrieval, and persistent memory through GRPO. Dep-Search introduces explicit control mechanisms that enable the model to decompose questions with dependency relationships, retrieve information when needed, access previously stored knowledge from memory, and summarize long reasoning contexts into reusable memory entries. Through extensive experiments on seven diverse question answering datasets, we demonstrate that Dep-Search significantly enhances LLMs' ability to tackle complex multi-hop reasoning tasks, achieving substantial improvements over strong baselines across different model scales.
中文标题/摘要
标题:Dep-Search:使用持久内存学习依赖感知推理轨迹
大型语言模型(LLMs)在复杂推理任务中展现了卓越的能力,特别是在与搜索机制结合后,能够系统地探索外部知识库。该领域已从传统的检索增强生成(RAG)框架发展到更复杂的基于搜索的框架,通过显式的搜索策略协调多步推理。然而,现有的搜索框架仍然高度依赖隐式的自然语言推理来确定搜索策略以及如何在推理步骤中利用检索到的信息。这种对隐式推理的依赖为管理子问题之间的依赖关系、高效重用已检索的知识以及通过强化学习学习最优搜索策略带来了根本性的挑战。为了解决这些限制,我们提出了Dep-Search,这是一种依赖感知的搜索框架,通过GRPO整合结构化推理、检索和持久内存,超越了现有的搜索框架。Dep-Search引入了明确的控制机制,使模型能够分解具有依赖关系的问题,按需检索信息,从内存中访问之前存储的知识,并将长推理上下文总结为可重用的内存条目。通过在七个不同的问答数据集上进行广泛的实验,我们证明了Dep-Search显著增强了LLMs处理复杂多跳推理任务的能力,不同模型规模下均实现了显著的改进。
Summary / 总结
Dep-Search is a dependency-aware search framework that improves large language models' ability to handle complex reasoning tasks by integrating structured reasoning, retrieval, and persistent memory. It introduces explicit control mechanisms to manage dependencies between sub-questions, efficiently reuse retrieved knowledge, and summarize reasoning contexts. Experiments on seven datasets show that Dep-Search outperforms strong baselines, particularly in multi-hop reasoning tasks.
Dep-Search 是一种依赖感知的搜索框架,通过整合结构化推理、检索和持久化记忆来提升大型语言模型处理复杂推理任务的能力。它解决了现有搜索框架的局限性,通过实现对推理过程的显式控制,帮助管理子问题之间的依赖关系并高效重用检索到的知识。在七个不同数据集上的实验表明,Dep-Search 在不同模型规模下均优于强基线,特别是在复杂的多跳推理任务中表现出显著的提升。
HiCache: A Plug-in Scaled-Hermite Upgrade for Taylor-Style Cache-then-Forecast Diffusion Acceleration
Authors: Liang Feng, Shikang Zheng, Jiacheng Liu, Yuqi Lin, Qinming Zhou, Peiliang Cai, Xinyu Wang, Junjie Chen, Chang Zou, Yue Ma, Linfeng Zhang
First: 2025-08-23T10:35:16+00:00 · Latest: 2026-01-26T18:39:41+00:00
Abstract
Diffusion models have achieved remarkable success in content generation but often incur prohibitive computational costs due to iterative sampling. Recent feature caching methods accelerate inference via temporal extrapolation, yet can suffer quality degradation from inaccurate modeling of the complex dynamics of feature evolution. We propose HiCache (Hermite Polynomial-based Feature Cache), a training-free acceleration framework that improves feature prediction by aligning mathematical tools with empirical properties. Our key insight is that feature-derivative approximations in diffusion Transformers exhibit multivariate Gaussian characteristics, motivating the use of Hermite polynomials as a potentially optimal basis for Gaussian-correlated processes. We further introduce a dual-scaling mechanism that ensures numerical stability while preserving predictive accuracy, and is also effective when applied standalone or integrated with TaylorSeer. Extensive experiments demonstrate HiCache's superiority, achieving 5.55x speedup on FLUX.1-dev while matching or exceeding baseline quality, and maintaining strong performance across text-to-image, video generation, and super-resolution tasks. Moreover, HiCache can be naturally added to previous caching methods to enhance their performance, e.g., improving ClusCa from 0.9480 to 0.9840 in terms of image rewards. Code: https://github.com/fenglang918/HiCache
中文标题/摘要
标题:HiCache:一种基于插件的扩展型Scaled-Hermite升级版,用于Taylor风格的缓存先行-随后预测扩散加速
扩散模型在内容生成方面取得了显著成功,但由于迭代采样,往往会产生高昂的计算成本。最近的特征缓存方法通过时间外推加速推理,但可能会因对特征演变复杂动态建模不准确而导致质量下降。我们提出了HiCache(基于赫尔mite多项式的特征缓存),这是一种无需训练的加速框架,通过将数学工具与经验特性对齐来提高特征预测能力。我们的核心见解是,扩散Transformer中的特征导数近似表现出多元高斯特性,这促使我们使用赫尔mite多项式作为高斯相关过程的潜在最优基。我们还引入了一种双重缩放机制,以确保数值稳定性同时保持预测准确性,并且在单独使用或与TaylorSeer集成时也有效。广泛的实验表明HiCache的优越性,在FLUX.1-dev上实现了5.55倍的加速,同时匹配或超过了基线质量,并在文本到图像、视频生成和超分辨率任务中保持了强大的性能。此外,HiCache可以自然地添加到先前的缓存方法中以增强其性能,例如,将ClusCa的图像奖励从0.9480提高到0.9840。代码:https://github.com/fenglang918/HiCache
Summary / 总结
HiCache is a training-free acceleration framework for diffusion models that uses Hermite polynomials to improve feature prediction, addressing the quality degradation issue in previous caching methods. It introduces a dual-scaling mechanism to ensure numerical stability and predictive accuracy. Experiments show that HiCache achieves a 5.55x speedup on FLUX.1-dev while maintaining or improving quality across various tasks, and can enhance the performance of existing caching methods.
HiCache 是一种无需训练的加速框架,通过使用 Hermite 多项式增强特征预测。它引入了一种双重缩放机制以确保数值稳定性和准确性。实验表明,HiCache 在 FLUX.1-dev 上实现了 5.55 倍的加速,同时在文本到图像、视频生成和超分辨率等任务中保持或提升了质量。此外,HiCache 还可以增强现有缓存方法如 ClusCa 的性能。
Learning to Discover: A Generalized Framework for Raga Identification without Forgetting
Authors: Parampreet Singh, Somya Kumar, Chaitanya Shailendra Nitawe, Vipul Arora
First: 2026-01-26T18:37:30+00:00 · Latest: 2026-01-26T18:37:30+00:00
Comments: Accepted at NCC 2026 conference
Abstract
Raga identification in Indian Art Music (IAM) remains challenging due to the presence of numerous rarely performed Ragas that are not represented in available training datasets. Traditional classification models struggle in this setting, as they assume a closed set of known categories and therefore fail to recognise or meaningfully group previously unseen Ragas. Recent works have tried categorizing unseen Ragas, but they run into a problem of catastrophic forgetting, where the knowledge of previously seen Ragas is diminished. To address this problem, we adopt a unified learning framework that leverages both labeled and unlabeled audio, enabling the model to discover coherent categories corresponding to the unseen Ragas, while retaining the knowledge of previously known ones. We test our model on benchmark Raga Identification datasets and demonstrate its performance in categorizing previously seen, unseen, and all Raga classes. The proposed approach surpasses the previous NCD-based pipeline even in discovering the unseen Raga categories, offering new insights into representation learning for IAM tasks.
中文标题/摘要
标题:学习发现:一种通用框架用于在不遗忘的情况下识别拉格
印度艺术音乐(IAM)中的拉格识别仍然具有挑战性,因为存在大量很少表演的拉格,这些拉格在可用的训练数据集中未被代表。传统的分类模型在这种情况下表现不佳,因为它们假设已知的封闭类别集,并因此无法识别或有意义地分组之前未见过的拉格。最近的研究试图对未见过的拉格进行分类,但它们遇到了灾难性遗忘的问题,即之前见过的拉格的知识被削弱。为了解决这个问题,我们采用了一种统一的学习框架,该框架利用了有标签和无标签的音频,使模型能够发现与未见过的拉格对应的连贯类别,同时保留已知拉格的知识。我们在基准拉格识别数据集上测试了我们的模型,并展示了其在分类已见过、未见过和所有拉格类别中的性能。所提出的方法即使在发现未见过的拉格类别方面也超越了基于NCD的管道,为IAM任务中的表示学习提供了新的见解。
Summary / 总结
The research aims to improve raga identification in Indian Art Music by addressing the challenge of numerous rarely performed ragas not present in training datasets. The method uses a unified learning framework that incorporates both labeled and unlabeled audio data to discover coherent categories for unseen ragas while retaining knowledge of previously known ones. The model outperforms previous approaches, particularly in identifying unseen raga categories, and demonstrates superior performance on benchmark datasets.
研究旨在通过解决训练数据集中未包含众多罕见表演的拉格问题,提高印度艺术音乐中的拉格识别。方法采用结合标记和未标记音频数据的统一学习框架,以发现未见过的拉格的连贯类别,同时保留已知拉格的知识。该模型在识别未见过的拉格类别方面优于先前的方法,展示了其在拉格识别任务中的有效性。
Brain-Inspired Perspective on Configurations: Unsupervised Similarity and Early Cognition
Authors: Juntang Wang, Yihan Wang, Hao Wu, Dongmian Zou, Shixin Xu
First: 2025-10-22T04:28:23+00:00 · Latest: 2026-01-26T18:34:12+00:00
Comments: 13 pages, 4 figures, conference paper. Equal contribution: Juntang Wang, Yihan Wang and Hao Wu
Abstract
Infants discover categories, detect novelty, and adapt to new contexts without supervision-a challenge for current machine learning. We present a brain-inspired perspective on configurations, a finite-resolution clustering framework that uses a single resolution parameter and attraction-repulsion dynamics to yield hierarchical organization, novelty sensitivity, and flexible adaptation. To evaluate these properties, we introduce mheatmap, which provides proportional heatmaps and reassignment algorithm to fairly assess multi-resolution and dynamic behavior. Across datasets, configurations are competitive on standard clustering metrics, achieve 87% AUC in novelty detection, and show 35% better stability during dynamic category evolution. These results position configurations as a principled computational model of early cognitive categorization and a step toward brain-inspired AI.
中文标题/摘要
标题:基于大脑视角的配置研究:无监督相似性和早期认知
婴儿在无监督的情况下发现类别、检测新颖性并适应新环境——这对当前的机器学习来说是一个挑战。我们提出了一种基于大脑视角的配置方法,这是一种有限分辨率的聚类框架,使用单一的分辨率参数和吸引-排斥动力学来产生分层组织、新颖性敏感性和灵活的适应性。为了评估这些特性,我们引入了mheatmap,它提供了比例热图和重新分配算法以公平评估多分辨率和动态行为。在各种数据集中,配置在标准聚类指标上具有竞争力,在新颖性检测中达到87%的AUC,并在动态类别演变过程中表现出35%更好的稳定性。这些结果将配置定位为早期认知分类的原理性计算模型,并朝着基于大脑的AI迈出一步。
Summary / 总结
The paper addresses the challenge of unsupervised learning in machine learning by drawing inspiration from infant cognitive development. It introduces configurations, a clustering framework that uses a single resolution parameter and attraction-repulsion dynamics to achieve hierarchical organization and adaptability. The mheatmap tool is used to evaluate these properties, showing that configurations perform well on standard clustering metrics, achieve high accuracy in novelty detection, and demonstrate better stability during dynamic category changes. These findings suggest configurations as a promising model for early cognitive categorization and brain-inspired AI.
论文通过提出一种受大脑启发的配置框架来解决机器学习中的无监督学习挑战。该框架使用单一分辨率参数和吸引-排斥动力学实现层次组织和适应性。开发了mheatmap工具来评估框架性能,结果显示在标准聚类指标上具有竞争力, novelty检测的AUC达到87%,并且在动态类别演变过程中稳定性提高了35%。这些结果表明配置作为一种早期认知分类的原理性模型,并且是朝着大脑启发式AI迈出的一步。
DGFusion: Depth-Guided Sensor Fusion for Robust Semantic Perception
Authors: Tim Broedermannn, Christos Sakaridis, Luigi Piccinelli, Wim Abbeloos, Luc Van Gool
First: 2025-09-11T20:03:00+00:00 · Latest: 2026-01-26T18:33:05+00:00
Comments: Code and models are available at https://github.com/timbroed/DGFusion
Abstract
Robust semantic perception for autonomous vehicles relies on effectively combining multiple sensors with complementary strengths and weaknesses. State-of-the-art sensor fusion approaches to semantic perception often treat sensor data uniformly across the spatial extent of the input, which hinders performance when faced with challenging conditions. By contrast, we propose a novel depth-guided multimodal fusion method that upgrades condition-aware fusion by integrating depth information. Our network, DGFusion, poses multimodal segmentation as a multi-task problem, utilizing the lidar measurements, which are typically available in outdoor sensor suites, both as one of the model's inputs and as ground truth for learning depth. Our corresponding auxiliary depth head helps to learn depth-aware features, which are encoded into spatially varying local depth tokens that condition our attentive cross-modal fusion. Together with a global condition token, these local depth tokens dynamically adapt sensor fusion to the spatially varying reliability of each sensor across the scene, which largely depends on depth. In addition, we propose a robust loss for our depth, which is essential for learning from lidar inputs that are typically sparse and noisy in adverse conditions. Our method achieves state-of-the-art panoptic and semantic segmentation performance on the challenging MUSES and DeLiVER datasets. Code and models are available at https://github.com/timbroed/DGFusion
中文标题/摘要
标题:DGFusion:深度引导的传感器融合以实现鲁棒的语义感知
自动驾驶车辆的鲁棒语义感知依赖于有效结合具有互补优势和劣势的多种传感器。最先进的语义感知传感器融合方法通常在输入的空间范围内均匀处理传感器数据,这在面对挑战性条件时会阻碍性能。相比之下,我们提出了一种新颖的深度引导多模态融合方法,通过整合深度信息来提升条件感知融合。我们的网络DGFusion将多模态分割视为一个多任务问题,利用通常在户外传感器套件中可用的激光雷达测量数据,作为模型的输入之一和学习深度的地面真实值。我们相应的辅助深度头有助于学习深度感知特征,这些特征被编码为在空间上变化的局部深度令牌,以条件我们的注意跨模态融合。结合全局条件令牌,这些局部深度令牌动态适应场景中每个传感器的空间变化可靠性,这在很大程度上取决于深度。此外,我们提出了一种鲁棒的深度损失,这对于从通常在恶劣条件下稀疏且噪声的激光雷达输入中学习至关重要。我们的方法在具有挑战性的MUSES和DeLiVER数据集上实现了最先进的全景和语义分割性能。代码和模型可在https://github.com/timbroed/DGFusion获取
Summary / 总结
DGFusion proposes a depth-guided multimodal fusion method for robust semantic perception in autonomous vehicles. It integrates depth information to enhance condition-aware fusion, using lidar measurements as both input and ground truth. The method dynamically adapts sensor fusion based on the spatial reliability of each sensor, achieving state-of-the-art performance on MUSES and DeLiVER datasets.
DGFusion 提出了一种基于深度的多模态融合方法,以增强自主车辆的语义感知。该方法通过整合通常来自激光雷达的深度信息,解决了现有方法在处理传感器数据时的局限性。网络 DGFusion 使用激光雷达数据作为输入和 ground truth 来学习深度感知特征,并将其编码为局部深度令牌,这些令牌可以条件化跨模态融合。这种方法根据每个传感器在场景中的空间可靠性动态调整融合,从而在 MUSES 和 DeLiVER 数据集上达到了最先进的性能。
$α^3$-SecBench: A Large-Scale Evaluation Suite of Security, Resilience, and Trust for LLM-based UAV Agents over 6G Networks
Authors: Mohamed Amine Ferrag, Abderrahmane Lakas, Merouane Debbah
First: 2026-01-26T18:25:07+00:00 · Latest: 2026-01-26T18:25:07+00:00
Abstract
Autonomous unmanned aerial vehicle (UAV) systems are increasingly deployed in safety-critical, networked environments where they must operate reliably in the presence of malicious adversaries. While recent benchmarks have evaluated large language model (LLM)-based UAV agents in reasoning, navigation, and efficiency, systematic assessment of security, resilience, and trust under adversarial conditions remains largely unexplored, particularly in emerging 6G-enabled settings.
We introduce $α^{3}$-SecBench, the first large-scale evaluation suite for assessing the security-aware autonomy of LLM-based UAV agents under realistic adversarial interference. Building on multi-turn conversational UAV missions from $α^{3}$-Bench, the framework augments benign episodes with 20,000 validated security overlay attack scenarios targeting seven autonomy layers, including sensing, perception, planning, control, communication, edge/cloud infrastructure, and LLM reasoning. $α^{3}$-SecBench evaluates agents across three orthogonal dimensions: security (attack detection and vulnerability attribution), resilience (safe degradation behavior), and trust (policy-compliant tool usage).
We evaluate 23 state-of-the-art LLMs from major industrial providers and leading AI labs using thousands of adversarially augmented UAV episodes sampled from a corpus of 113,475 missions spanning 175 threat types. While many models reliably detect anomalous behavior, effective mitigation, vulnerability attribution, and trustworthy control actions remain inconsistent. Normalized overall scores range from 12.9% to 57.1%, highlighting a significant gap between anomaly detection and security-aware autonomous decision-making. We release $α^{3}$-SecBench on GitHub: https://github.com/maferrag/AlphaSecBench
中文标题/摘要
标题:$α^3$-SecBench:基于6G网络的LLM驱动无人机代理安全、韧性和信任的大规模评估套件
自主无人机系统在安全关键的网络环境中越来越广泛部署,必须在恶意对手的干扰下可靠运行。尽管最近的基准测试评估了基于大型语言模型(LLM)的无人机代理在推理、导航和效率方面的表现,但在对抗条件下的系统性安全、韧性和信任评估仍然鲜有探索,特别是在新兴的6G环境下。
我们提出了$α^{3}$-SecBench,这是首个针对基于LLM的无人机代理在现实对抗干扰下的安全感知自主性的大规模评估套件。该框架基于$α^{3}$-Bench的多轮对话无人机任务,增加了20,000个验证过的安全叠加攻击场景,针对七个自主层,包括感知、感知、规划、控制、通信、边缘/云基础设施和LLM推理。$α^{3}$-SecBench从涵盖175种威胁类型的113,475次任务的语料库中,评估了来自主要工业提供商和领先AI实验室的23种最先进的LLM,使用数千个对抗增强的无人机任务样本。虽然许多模型能够可靠地检测异常行为,但有效的缓解、漏洞归因和可信赖的控制行动仍然不一致。总体得分范围从12.9%到57.1%,突显了异常检测与安全感知自主决策之间的巨大差距。我们已在GitHub上发布了$α^{3}$-SecBench:https://github.com/maferrag/AlphaSecBench
VoxGuard: Evaluating User and Attribute Privacy in Speech via Membership Inference Attacks
Authors: Efthymios Tsaprazlis, Thanathai Lertpetchpun, Tiantian Feng, Sai Praneeth Karimireddy, Shrikanth Narayanan
First: 2025-09-22T20:57:48+00:00 · Latest: 2026-01-26T18:23:42+00:00
Abstract
Voice anonymization aims to conceal speaker identity and attributes while preserving intelligibility, but current evaluations rely almost exclusively on Equal Error Rate (EER) that obscures whether adversaries can mount high-precision attacks. We argue that privacy should instead be evaluated in the low false-positive rate (FPR) regime, where even a small number of successful identifications constitutes a meaningful breach. To this end, we introduce VoxGuard, a framework grounded in differential privacy and membership inference that formalizes two complementary notions: User Privacy, preventing speaker re-identification, and Attribute Privacy, protecting sensitive traits such as gender and accent. Across synthetic and real datasets, we find that informed adversaries, especially those using fine-tuned models and max-similarity scoring, achieve orders-of-magnitude stronger attacks at low-FPR despite similar EER. For attributes, we show that simple transparent attacks recover gender and accent with near-perfect accuracy even after anonymization. Our results demonstrate that EER substantially underestimates leakage, highlighting the need for low-FPR evaluation, and recommend VoxGuard as a benchmark for evaluating privacy leakage.
中文标题/摘要
标题:VoxGuard:通过成员推理攻击评估语音中的用户和属性隐私
语音匿名化旨在在保持可懂度的同时隐藏说话人身份和属性,但当前的评估几乎完全依赖于错误接受率(EER),这掩盖了对手能否发起高精度攻击的情况。我们认为,隐私应该在低假阳性率(FPR)的范围内进行评估,在这个范围内,即使成功识别的次数很少也构成了一种有意义的泄露。为此,我们引入了VoxGuard,这是一种基于差分隐私和成员推理的框架,它形式化了两种互补的概念:用户隐私,防止重新识别说话人,以及属性隐私,保护性别和口音等敏感特征。在合成和真实数据集上,我们发现,知情的对手,尤其是使用微调模型和最大相似度评分的对手,在低FPR下实现了数量级更强的攻击,尽管EER相似。对于属性,我们展示了简单的透明攻击即使在匿名化后也能以近乎完美的准确度恢复性别和口音。我们的结果表明,EER严重低估了泄露,突显了低FPR评估的必要性,并推荐VoxGuard作为评估隐私泄露的基准。
Summary / 总结
VoxGuard evaluates user and attribute privacy in speech through membership inference attacks, arguing that current evaluations based on Equal Error Rate (EER) are insufficient. The framework introduces two types of privacy: User Privacy, which prevents speaker re-identification, and Attribute Privacy, which protects sensitive traits like gender and accent. The study finds that even small numbers of successful identifications at low false-positive rates indicate significant breaches, and that informed adversaries can achieve much stronger attacks than indicated by EER. Simple attacks can recover gender and accent with high accuracy after anonymization, underscoring the need for low-FPR evaluation.
VoxGuard 通过成员推理攻击评估语音中的用户和属性隐私,认为当前基于等错误率(EER)的评估不足。该框架提出了两种类型的隐私:用户隐私,防止重新识别说话人;属性隐私,保护敏感特征如性别和口音。研究发现,即使在低假阳性率下成功识别少量信息也表明存在重大漏洞,且有经验的攻击者可以比EER显示的更强地发起攻击。简单的攻击即使在匿名化后也能以高精度恢复性别和口音,强调了低假阳性率评估的重要性。
HalluGuard: Demystifying Data-Driven and Reasoning-Driven Hallucinations in LLMs
Authors: Xinyue Zeng, Junhong Lin, Yujun Yan, Feng Guo, Liang Shi, Jun Wu, Dawei Zhou
Venue: ICLR
First: 2026-01-26T18:23:09+00:00 · Latest: 2026-01-26T18:23:09+00:00
Comments: Have been accepted by ICLR'26
Abstract
The reliability of Large Language Models (LLMs) in high-stakes domains such as healthcare, law, and scientific discovery is often compromised by hallucinations. These failures typically stem from two sources: data-driven hallucinations and reasoning-driven hallucinations. However, existing detection methods usually address only one source and rely on task-specific heuristics, limiting their generalization to complex scenarios. To overcome these limitations, we introduce the Hallucination Risk Bound, a unified theoretical framework that formally decomposes hallucination risk into data-driven and reasoning-driven components, linked respectively to training-time mismatches and inference-time instabilities. This provides a principled foundation for analyzing how hallucinations emerge and evolve. Building on this foundation, we introduce HalluGuard, an NTK-based score that leverages the induced geometry and captured representations of the NTK to jointly identify data-driven and reasoning-driven hallucinations. We evaluate HalluGuard on 10 diverse benchmarks, 11 competitive baselines, and 9 popular LLM backbones, consistently achieving state-of-the-art performance in detecting diverse forms of LLM hallucinations.
中文标题/摘要
标题:HalluGuard:揭开大型语言模型中数据驱动和推理驱动幻觉的面纱
大型语言模型(LLMs)在医疗、法律和科学发现等高风险领域中的可靠性常常因幻觉而受损。这些失败通常源自两个来源:数据驱动的幻觉和推理驱动的幻觉。然而,现有的检测方法通常只解决其中一个来源,并依赖于特定任务的启发式方法,限制了它们在复杂场景中的泛化能力。为克服这些限制,我们引入了幻觉风险界,这是一种统一的理论框架,正式将幻觉风险分解为数据驱动和推理驱动的组件,分别与训练时的不匹配和推理时的不稳定性相关联。这为分析幻觉的产生和发展提供了原则性的基础。在此基础上,我们引入了HalluGuard,这是一种基于NTK的评分方法,利用NTK诱导的几何结构和捕获的表示来联合识别数据驱动和推理驱动的幻觉。我们在10个不同的基准测试、11个竞争性基线和9个流行的LLM基础模型上评估了HalluGuard,一致地实现了检测LLM幻觉的最新性能。
Summary / 总结
The paper addresses the issue of hallucinations in Large Language Models (LLMs) in high-stakes domains by introducing a unified theoretical framework called the Hallucination Risk Bound, which decomposes hallucination risk into data-driven and reasoning-driven components. Based on this framework, the authors developed HalluGuard, an NTK-based score that can jointly identify both types of hallucinations. HalluGuard outperforms 11 competitive baselines across 10 diverse benchmarks and 9 popular LLM backbones in detecting various forms of LLM hallucinations.
论文通过引入一个统一的理论框架——幻觉风险界,将幻觉风险分解为数据驱动和推理驱动两个部分,解决了大型语言模型(LLMs)在高风险领域中的问题。基于这一框架,作者开发了HalluGuard,这是一种基于NTK的得分方法,可以同时识别这两种类型的幻觉。HalluGuard在多种基准测试中进行了评估,并在检测LLM幻觉的各种形式方面优于现有方法。
Trust, Don't Trust, or Flip: Robust Preference-Based Reinforcement Learning with Multi-Expert Feedback
Authors: Seyed Amir Hosseini, Maryam Abdolali, Amirhosein Tavakkoli, Fardin Ayar, Ehsan Javanmardi, Manabu Tsukada, Mahdi Javanmardi
First: 2026-01-26T18:21:48+00:00 · Latest: 2026-01-26T18:21:48+00:00
Comments: Equal contribution: Seyed Amir Hosseini and Maryam Abdolali. Corresponding author: Maryam Abdolali (maryam.abdolali@kntu.ac.ir)
Abstract
Preference-based reinforcement learning (PBRL) offers a promising alternative to explicit reward engineering by learning from pairwise trajectory comparisons. However, real-world preference data often comes from heterogeneous annotators with varying reliability; some accurate, some noisy, and some systematically adversarial. Existing PBRL methods either treat all feedback equally or attempt to filter out unreliable sources, but both approaches fail when faced with adversarial annotators who systematically provide incorrect preferences. We introduce TriTrust-PBRL (TTP), a unified framework that jointly learns a shared reward model and expert-specific trust parameters from multi-expert preference feedback. The key insight is that trust parameters naturally evolve during gradient-based optimization to be positive (trust), near zero (ignore), or negative (flip), enabling the model to automatically invert adversarial preferences and recover useful signal rather than merely discarding corrupted feedback. We provide theoretical analysis establishing identifiability guarantees and detailed gradient analysis that explains how expert separation emerges naturally during training without explicit supervision. Empirically, we evaluate TTP on four diverse domains spanning manipulation tasks (MetaWorld) and locomotion (DM Control) under various corruption scenarios. TTP achieves state-of-the-art robustness, maintaining near-oracle performance under adversarial corruption while standard PBRL methods fail catastrophically. Notably, TTP outperforms existing baselines by successfully learning from mixed expert pools containing both reliable and adversarial annotators, all while requiring no expert features beyond identification indices and integrating seamlessly with existing PBRL pipelines.
中文标题/摘要
标题:信任、不信任或翻转:基于多专家反馈的鲁棒偏好强化学习
基于偏好的强化学习(PBRL)通过从轨迹对的成对比较中学习,提供了一种替代显式奖励工程的有前途的替代方案。然而,现实世界的偏好数据通常来自具有不同可靠性的异质标注者;有些准确,有些嘈杂,有些系统地敌对。现有的PBRL方法要么平等对待所有反馈,要么尝试过滤掉不可靠的来源,但当面对系统提供错误偏好反馈的敌对标注者时,这两种方法都会失败。我们引入了TriTrust-PBRL(TTP),这是一种统一框架,可以从多专家偏好反馈中联合学习共享奖励模型和专家特定的信任参数。关键洞察是,信任参数在基于梯度的优化过程中自然演化为正值(信任)、接近零(忽略)或负值(翻转),使模型能够自动反转敌对偏好并恢复有用的信号,而不是仅仅丢弃被破坏的反馈。我们提供了理论分析,建立了可识别性保证,并详细分析了梯度,解释了专家分离如何在训练过程中自然出现,而无需显式监督。实验上,我们在四个不同的领域(MetaWorld的操纵任务和DM Control的运动)下评估了TTP,各种破坏场景下。TTP实现了最先进的鲁棒性,在敌对破坏下保持接近或acles的性能,而标准的PBRL方法则会灾难性地失败。值得注意的是,TTP成功地从包含可靠和敌对标注者的混合专家池中学习,同时不需要专家特征,且无缝集成到现有的PBRL管道中。
Summary / 总结
The paper addresses the challenge of learning from heterogeneous preference data in reinforcement learning, where some annotators can be adversarial. It introduces TriTrust-PBRL (TTP), which jointly learns a shared reward model and expert-specific trust parameters. TTP automatically inverts adversarial preferences and recovers useful signal, achieving state-of-the-art robustness in various domains under adversarial corruption, outperforming existing methods.
论文解决了来自异质注释者(其中一些可能是敌对的)的偏好数据学习的挑战。它引入了TriTrust-PBRL (TTP),该方法同时学习共享奖励模型和专家特定的信任参数。TTP 自动反转敌对的偏好,并在对抗性污染下保持接近完美的性能,优于现有方法在多个领域中的表现。
When Swin Transformer Meets KANs: An Improved Transformer Architecture for Medical Image Segmentation
Authors: Nishchal Sapkota, Haoyan Shi, Yejia Zhang, Xianshi Ma, Bofang Zheng, Fabian Vazquez, Pengfei Gu, Danny Z. Chen
Venue: ISBI
First: 2025-11-06T05:44:57+00:00 · Latest: 2026-01-26T18:07:50+00:00
Comments: This paper has been accepted for publication in the Proceedings of the IEEE International Symposium on Biomedical Imaging (ISBI) 2026
Abstract
Medical image segmentation is critical for accurate diagnostics and treatment planning, but remains challenging due to complex anatomical structures and limited annotated training data. CNN-based segmentation methods excel at local feature extraction, but struggle with modeling long-range dependencies. Transformers, on the other hand, capture global context more effectively, but are inherently data-hungry and computationally expensive. In this work, we introduce UKAST, a U-Net like architecture that integrates rational-function based Kolmogorov-Arnold Networks (KANs) into Swin Transformer encoders. By leveraging rational base functions and Group Rational KANs (GR-KANs) from the Kolmogorov-Arnold Transformer (KAT), our architecture addresses the inefficiencies of vanilla spline-based KANs, yielding a more expressive and data-efficient framework with reduced FLOPs and only a very small increase in parameter count compared to SwinUNETR. UKAST achieves state-of-the-art performance on four diverse 2D and 3D medical image segmentation benchmarks, consistently surpassing both CNN- and Transformer-based baselines. Notably, it attains superior accuracy in data-scarce settings, alleviating the data-hungry limitations of standard Vision Transformers. These results show the potential of KAN-enhanced Transformers to advance data-efficient medical image segmentation. Code is available at: https://github.com/nsapkota417/UKAST
中文标题/摘要
标题:当Swin Transformer遇到KANs:一种改进的医疗图像分割Transformer架构
医疗图像分割对于准确诊断和治疗规划至关重要,但由于复杂的解剖结构和有限的标注训练数据,仍具有挑战性。基于CNN的分割方法在局部特征提取方面表现出色,但在建模长程依赖关系方面存在困难。相比之下,Transformer能够更有效地捕捉全局上下文,但本质上数据需求量大且计算成本高。在本文中,我们提出了一种UKAST架构,该架构类似于U-Net,将基于柯尔莫哥洛夫-阿诺尔德网络(KANs)的有理函数整合到Swin Transformer编码器中。通过利用柯尔莫哥洛夫-阿诺尔德变换器(KAT)中的有理基函数和组有理KANs(GR-KANs),我们的架构解决了传统样条KANs的效率问题,提供了一个更具表现力且数据高效的框架,与SwinUNETR相比,FLOPs减少且参数数量仅略有增加。UKAST在四个不同的2D和3D医疗图像分割基准测试中均实现了最先进的性能,一致地超越了基于CNN和Transformer的基线。值得注意的是,它在数据稀缺的环境中实现了更高的准确性,缓解了标准视觉Transformer的数据饥渴限制。这些结果表明,KAN增强的Transformer有潜力推动数据高效的医疗图像分割。代码可在:https://github.com/nsapkota417/UKAST 获取
Summary / 总结
This paper addresses the challenges of medical image segmentation by introducing UKAST, a U-Net like architecture that combines rational-function based Kolmogorov-Arnold Networks (KANs) with Swin Transformers. This approach improves upon traditional CNNs and Transformers by capturing global context more effectively and reducing computational costs. UKAST achieves state-of-the-art performance on four medical image segmentation benchmarks, outperforming both CNN- and Transformer-based methods, especially in data-scarce scenarios.
本文通过引入UKAST架构,结合Swin Transformer编码器和Group Rational KANs,解决了医学图像分割的挑战。该方法在捕捉全局上下文和降低计算成本方面优于传统CNN和Transformer。UKAST在四个医学图像分割基准测试中取得了最先进的性能,即使在数据稀缺的情况下也表现出色,并且参数量相比SwinUNETR仅略有增加。
TSRBench: A Comprehensive Multi-task Multi-modal Time Series Reasoning Benchmark for Generalist Models
Authors: Fangxu Yu, Xingang Guo, Lingzhi Yuan, Haoqiang Kang, Hongyu Zhao, Lianhui Qin, Furong Huang, Bin Hu, Tianyi Zhou
First: 2026-01-26T18:04:54+00:00 · Latest: 2026-01-26T18:04:54+00:00
Abstract
Time series data is ubiquitous in real-world scenarios and crucial for critical applications ranging from energy management to traffic control. Consequently, the ability to reason over time series is a fundamental skill for generalist models to solve practical problems. However, this dimension is notably absent from existing benchmarks of generalist models. To bridge this gap, we introduce TSRBench, a comprehensive multi-modal benchmark designed to stress-test the full spectrum of time series reasoning capabilities. TSRBench features: i) a diverse set of 4125 problems from 14 domains, and is categorized into 4 major dimensions: Perception, Reasoning, Prediction, and Decision-Making. ii) 15 tasks from the 4 dimensions evaluating essential reasoning capabilities (e.g., numerical reasoning). Through extensive experiments, we evaluated over 30 leading proprietary and open-source LLMs, VLMs, and TSLLMs within TSRBench. Our findings reveal that: i) scaling laws hold for perception and reasoning but break down for prediction; ii) strong reasoning does not guarantee accurate context-aware forecasting, indicating a decoupling between semantic understanding and numerical prediction; and iii) despite the complementary nature of textual and visual represenations of time series as inputs, current multimodal models fail to effectively fuse them for reciprocal performance gains. TSRBench provides a standardized evaluation platform that not only highlights existing challenges but also offers valuable insights to advance generalist models. Our code and dataset are available at https://tsrbench.github.io/.
中文标题/摘要
标题:TSRBench:全面的多任务多模态时间序列推理基准,用于通用模型
时间序列数据在现实场景中无处不在,并且对于从能源管理到交通控制等关键应用至关重要。因此,能够对时间序列进行推理是通用模型解决实际问题的一项基本技能。然而,这一维度在现有通用模型基准中明显缺失。为了弥合这一差距,我们引入了TSRBench,这是一个全面的多模态基准,旨在全面测试时间序列推理能力。TSRBench 特点包括:i) 来自14个领域的4125个问题的多样化集合,并按感知、推理、预测和决策制定4个主要维度分类;ii) 4个维度下的15项任务评估关键推理能力(例如,数值推理)。通过广泛的实验,我们在TSRBench中评估了超过30个领先的专有和开源LLM、VLM和TSLLM。我们的发现表明:i) 规模法则适用于感知和推理,但在预测中失效;ii) 强大的推理并不保证准确的上下文感知预测,表明语义理解与数值预测之间的脱钩;iii) 尽管文本和视觉表示作为时间序列输入具有互补性,当前的多模态模型未能有效融合它们以实现相互性能增益。TSRBench 提供了一个标准化的评估平台,不仅突显了现有挑战,还为推进通用模型提供了宝贵的见解。我们的代码和数据集可在 https://tsrbench.github.io/ 获取。
Summary / 总结
TSRBench is a comprehensive benchmark for evaluating the time series reasoning capabilities of generalist models across 14 domains. It includes 4125 problems and 15 tasks that test perception, reasoning, prediction, and decision-making. Extensive experiments with 30 leading models show that while scaling laws apply to perception and reasoning, they do not hold for prediction. Models with strong reasoning do not necessarily perform well in context-aware forecasting, indicating a gap between semantic understanding and numerical prediction. Additionally, current multimodal models struggle to effectively combine textual and visual representations of time series data, hindering performance gains. TSRBench provides a standardized platform to identify existing challenges and guide future advancements in generalist models.
TSRBench 是一个全面的基准,用于评估通用模型在14个领域中的时间序列推理能力,包括4125个问题,分类为感知、推理、预测和决策。它评估了15个任务。实验表明,感知和推理遵循缩放定律,但预测不遵循,且强大的推理并不保证准确的预测。此外,当前的多模态模型难以有效结合文本和视觉时间序列数据以提高性能。该基准提供了关于现有挑战和未来方向的见解。
BAH Dataset for Ambivalence/Hesitancy Recognition in Videos for Digital Behavioural Change
Authors: Manuela González-González, Soufiane Belharbi, Muhammad Osama Zeeshan, Masoumeh Sharafi, Muhammad Haseeb Aslam, Marco Pedersoli, Alessandro Lameiras Koerich, Simon L Bacon, Eric Granger
Venue: ICLR 2026
First: 2025-05-25T21:29:00+00:00 · Latest: 2026-01-26T18:01:53+00:00
Comments: 45 pages, 21 figures, ICLR 2026
Abstract
Ambivalence and hesitancy (A/H), a closely related construct, is the primary reasons why individuals delay, avoid, or abandon health behaviour changes. It is a subtle and conflicting emotion that sets a person in a state between positive and negative orientations, or between acceptance and refusal to do something. It manifests by a discord in affect between multiple modalities or within a modality, such as facial and vocal expressions, and body language. Although experts can be trained to recognize A/H as done for in-person interactions, integrating them into digital health interventions is costly and less effective. Automatic A/H recognition is therefore critical for the personalization and cost-effectiveness of digital behaviour change interventions. However, no datasets currently exists for the design of machine learning models to recognize A/H. This paper introduces the Behavioural Ambivalence/Hesitancy (BAH) dataset collected for multimodal recognition of A/H in videos. It contains 1,427 videos with a total duration of 10.60 hours captured from 300 participants across Canada answering predefined questions to elicit A/H. It is intended to mirror real-world online personalized behaviour change interventions. BAH is annotated by three experts to provide timestamps that indicate where A/H occurs, and frame- and video-level annotations with A/H cues. Video transcripts, cropped and aligned faces, and participants' meta-data are also provided. Since A and H manifest similarly in practice, we provide a binary annotation indicating the presence or absence of A/H. Additionally, this paper includes benchmarking results using baseline models on BAH for frame- and video-level recognition, zero-shot prediction, and personalization using source-free domain adaptation. The data, code, and pretrained weights are available.
中文标题/摘要
标题:BAH数据集:视频中数字行为改变中犹豫/矛盾识别
犹豫和矛盾(A/H)是个人推迟、避免或放弃健康行为改变的主要原因。这是一种微妙且矛盾的情绪,使人在正向和负向态度之间、接受和拒绝之间处于一种状态。它表现为情感在不同模态之间或同一模态内的不一致,如面部和语音表达、肢体语言。尽管专家可以被训练来识别A/H,如面对面互动中所做的那样,将其整合到数字健康干预措施中是成本高昂且效果较差的。因此,自动识别A/H对于数字行为改变干预措施的个性化和成本效益至关重要。然而,目前没有用于设计机器学习模型识别A/H的数据集。本文介绍了为视频中多模态A/H识别收集的Behavioral Ambivalence/Hesitancy (BAH)数据集。该数据集包含1,427个视频,总时长10.60小时,来自加拿大300名参与者回答预定义问题以引发A/H。它旨在模拟现实世界的在线个性化行为改变干预措施。BAH由三位专家注释,提供A/H发生的时间戳,以及帧级和视频级带有A/H线索的注释。还提供了视频转录、裁剪和对齐的脸部以及参与者元数据。由于A和H在实践中表现相似,我们提供了二元注释,表明A/H的存在或不存在。此外,本文还包括了在BAH上使用基线模型进行帧级和视频级识别、零样本预测和个人化(使用无源域适应)的基准测试结果。数据、代码和预训练权重均可用。
Summary / 总结
This paper introduces the BAH dataset for recognizing ambivalence and hesitancy (A/H) in videos, crucial for personalizing digital health interventions. The dataset includes 1,427 videos from 300 participants answering questions to elicit A/H, annotated by experts for A/H occurrences and cues. Benchmarking results show that baseline models perform well in recognizing A/H at both frame and video levels, and in zero-shot prediction and personalization using source-free domain adaptation.
该论文介绍了用于识别视频中矛盾和犹豫的BAH数据集,对于个性化数字健康干预至关重要。该数据集包含来自300名参与者回答问题以引发A/H的1,427个视频,附有标注以指示A/H发生的位置和提供A/H线索。基准测试结果表明,基线模型在帧和视频级别识别A/H的有效性,以及使用源无域适应进行零样本预测和个人化的效果。
Benchmarking Machine Learning Models for IoT Malware Detection under Data Scarcity and Drift
Authors: Jake Lyon, Ehsan Saeedizade, Shamik Sengupta
First: 2026-01-26T17:59:33+00:00 · Latest: 2026-01-26T17:59:33+00:00
Abstract
The rapid expansion of the Internet of Things (IoT) in domains such as smart cities, transportation, and industrial systems has heightened the urgency of addressing their security vulnerabilities. IoT devices often operate under limited computational resources, lack robust physical safeguards, and are deployed in heterogeneous and dynamic networks, making them prime targets for cyberattacks and malware applications. Machine learning (ML) offers a promising approach to automated malware detection and classification, but practical deployment requires models that are both effective and lightweight. The goal of this study is to investigate the effectiveness of four supervised learning models (Random Forest, LightGBM, Logistic Regression, and a Multi-Layer Perceptron) for malware detection and classification using the IoT-23 dataset. We evaluate model performance in both binary and multiclass classification tasks, assess sensitivity to training data volume, and analyze temporal robustness to simulate deployment in evolving threat landscapes. Our results show that tree-based models achieve high accuracy and generalization, even with limited training data, while performance deteriorates over time as malware diversity increases. These findings underscore the importance of adaptive, resource-efficient ML models for securing IoT systems in real-world environments.
中文标题/摘要
标题:在数据稀缺和漂移条件下物联网恶意软件检测的机器学习模型基准测试
物联网(IoT)在智慧城市、交通运输和工业系统等领域的迅速扩张加剧了对其安全漏洞的紧迫性。IoT设备通常在有限的计算资源下运行,缺乏坚固的物理保护,并部署在异构和动态网络中,使其成为网络攻击和恶意软件应用的主要目标。机器学习(ML)为自动恶意软件检测和分类提供了有希望的方法,但实际部署需要既有效又轻量级的模型。本研究旨在使用IoT-23数据集调查四种监督学习模型(随机森林、LightGBM、逻辑回归和多层感知机)在恶意软件检测和分类中的有效性。我们评估了模型在二分类和多分类任务中的性能,评估了对训练数据量的敏感性,并分析了时间上的鲁棒性以模拟在不断变化的威胁环境中部署。研究结果表明,基于树的模型即使在有限的训练数据下也能实现高准确率和泛化能力,而性能随恶意软件多样性增加而下降。这些发现强调了在实际环境中为IoT系统提供安全所需的自适应、资源高效ML模型的重要性。
Summary / 总结
This study investigates the effectiveness of four supervised learning models (Random Forest, LightGBM, Logistic Regression, and Multi-Layer Perceptron) for malware detection in IoT systems, using the IoT-23 dataset. The research evaluates model performance in binary and multiclass classification tasks, assesses sensitivity to training data volume, and analyzes temporal robustness. The findings indicate that tree-based models perform well with limited data and maintain high accuracy, but their performance degrades over time as malware diversity increases.
研究评估了四种监督学习模型(随机森林、LightGBM、逻辑回归和多层感知机)在IoT系统中进行恶意软件检测的有效性,使用了IoT-23数据集。研究在二分类和多分类任务中评估了模型性能,分析了对训练数据量的敏感性和时间上的鲁棒性。结果表明,基于树的模型在少量数据下表现良好,保持了高准确性,但随着恶意软件多样性的增加,其性能会随着时间的推移而下降。
Why Keep Your Doubts to Yourself? Trading Visual Uncertainties in Multi-Agent Bandit Systems
Authors: Jusheng Zhang, Yijia Fan, Kaitong Cai, Jing Yang, Jiawei Yao, Jian Wang, Guanlong Qu, Ziliang Chen, Keze Wang
Venue: ICLR 2026
First: 2026-01-26T17:58:53+00:00 · Latest: 2026-01-26T17:58:53+00:00
Comments: Accepted to ICLR 2026
Abstract
Vision-Language Models (VLMs) enable powerful multi-agent systems, but scaling them is economically unsustainable: coordinating heterogeneous agents under information asymmetry often spirals costs. Existing paradigms, such as Mixture-of-Agents and knowledge-based routers, rely on heuristic proxies that ignore costs and collapse uncertainty structure, leading to provably suboptimal coordination. We introduce Agora, a framework that reframes coordination as a decentralized market for uncertainty. Agora formalizes epistemic uncertainty into a structured, tradable asset (perceptual, semantic, inferential), and enforces profitability-driven trading among agents based on rational economic rules. A market-aware broker, extending Thompson Sampling, initiates collaboration and guides the system toward cost-efficient equilibria. Experiments on five multimodal benchmarks (MMMU, MMBench, MathVision, InfoVQA, CC-OCR) show that Agora outperforms strong VLMs and heuristic multi-agent strategies, e.g., achieving +8.5% accuracy over the best baseline on MMMU while reducing cost by over 3x. These results establish market-based coordination as a principled and scalable paradigm for building economically viable multi-agent visual intelligence systems.
中文标题/摘要
标题:为何将疑虑藏在心中?在多智能体 bandit 系统中交易视觉不确定性
视觉-语言模型(VLMs)能够实现强大的多智能体系统,但将其扩展在经济上是不可持续的:在信息不对称的情况下协调异构智能体往往会导致成本螺旋上升。现有的范式,如混合智能体和知识路由器,依赖于忽略成本的启发式代理,导致不确定性结构的坍塌,从而导致可证明的次优协调。我们提出了 Agora,一种将协调重新构想为不确定性分散市场的框架。Agora 将知识论不确定性形式化为可交易的结构化资产(感知、语义、推理),并基于理性经济规则在智能体之间实施基于盈利能力的交易。市场意识经纪人扩展了 Thompson 抽样,以启动合作并引导系统向成本效率均衡发展。在五个跨模态基准(MMMU、MMBench、MathVision、InfoVQA、CC-OCR)上的实验表明,Agora 在性能上优于强大的 VLMs 和启发式多智能体策略,例如在 MMMU 上的准确率比最佳基线高出 8.5%,同时成本降低了 3 倍以上。这些结果确立了基于市场的协调作为一种原理上可行且可扩展的范式,用于构建经济上可行的多智能体视觉智能系统。
Summary / 总结
The paper addresses the economic inefficiency of coordinating multi-agent systems using Vision-Language Models (VLMs) under information asymmetry. It proposes Agora, a framework that transforms epistemic uncertainty into tradable assets and uses a market-aware broker to guide agents towards cost-efficient equilibria. Experiments on five multimodal benchmarks demonstrate that Agora outperforms strong VLMs and heuristic strategies, achieving higher accuracy while reducing costs significantly.
论文针对多智能体系统中视觉-语言模型的经济不可持续性问题,信息不对称导致高昂成本。提出了一种名为Agora的框架,将知识不确定性视为可交易资产,使智能体基于理性经济规则进行交易。实验结果显示,Agora在五个多模态基准上优于强大的视觉-语言模型和启发式策略,实现了更高的准确率并大幅降低了成本。
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models
Authors: Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, Aditya Grover
First: 2026-01-26T17:56:50+00:00 · Latest: 2026-01-26T17:56:50+00:00
Comments: 13 pages
Abstract
Knowledge distillation improves large language model (LLM) reasoning by compressing the knowledge of a teacher LLM to train smaller LLMs. On-policy distillation advances this approach by having the student sample its own trajectories while a teacher LLM provides dense token-level supervision, addressing the distribution mismatch between training and inference in off-policy distillation methods. However, on-policy distillation typically requires a separate, often larger, teacher LLM and does not explicitly leverage ground-truth solutions available in reasoning datasets. Inspired by the intuition that a sufficiently capable LLM can rationalize external privileged reasoning traces and teach its weaker self (i.e., the version without access to privileged information), we introduce On-Policy Self-Distillation (OPSD), a framework where a single model acts as both teacher and student by conditioning on different contexts. The teacher policy conditions on privileged information (e.g., verified reasoning traces) while the student policy sees only the question; training minimizes the per-token divergence between these distributions over the student's own rollouts. We demonstrate the efficacy of our method on multiple mathematical reasoning benchmarks, achieving 4-8x token efficiency compared to reinforcement learning methods such as GRPO and superior performance over off-policy distillation methods.
中文标题/摘要
标题:自我蒸馏推理器:面向大规模语言模型的在线自我蒸馏
知识蒸馏通过压缩教师大规模语言模型(LLM)的知识来训练较小的LLM,从而改善大型语言模型的推理能力。在线蒸馏通过让学生在教师LLM提供密集的标记级监督的同时,自行采样其自身的轨迹,来推进这一方法,从而解决了脱机蒸馏方法中训练与推理之间的分布不匹配问题。然而,在线蒸馏通常需要一个单独的、通常更大的教师LLM,并且不明确利用推理数据集中可用的真实解决方案。受足够强大的LLM能够合理化外部特权推理轨迹并教导其较弱的自我(即没有访问特权信息的版本)这一直觉的启发,我们引入了在线自我蒸馏(OPSD)框架,其中单个模型同时作为教师和学生,通过不同的上下文进行条件化。教师策略基于特权信息(例如,验证的推理轨迹)进行条件化,而学生策略仅看到问题;训练通过最小化学生自身模拟过程中这些分布之间的每个标记差异来进行。我们通过多个数学推理基准展示了我们方法的有效性,与强化学习方法(如GRPO)相比,实现了4-8倍的标记效率,并且优于脱机蒸馏方法。
Summary / 总结
The research aims to improve large language model reasoning through on-policy self-distillation, where a single model acts as both teacher and student. The method conditions the teacher on privileged information and the student on the question, minimizing token divergence during training. Experiments show that OPSD achieves 4-8x token efficiency compared to reinforcement learning methods and outperforms off-policy distillation methods on mathematical reasoning benchmarks.
研究旨在通过单模型的自蒸馏方法提升大型语言模型的推理能力,该方法让模型同时扮演教师和学生角色,教师策略基于特权信息,学生策略仅基于问题。训练过程中通过最小化学生自我采样轨迹与教师策略之间的token差异来优化。实验表明,OPSD相比强化学习方法如GRPO实现了4-8倍的token效率,并在数学推理基准测试中优于脱政策略蒸馏方法。
Advances and Innovations in the Multi-Agent Robotic System (MARS) Challenge
Authors: Li Kang, Heng Zhou, Xiufeng Song, Rui Li, Bruno N. Y. Chen, Ziye Wang, Ximeng Meng, Stone Tao, Yiran Qin, Xiaohong Liu, Ruimao Zhang, Lei Bai, Yilun Du, Hao Su, Philip Torr, Zhenfei Yin, Ruihao Gong, Yejun Zeng, Fengjun Zhong, Shenghao Jin, Jinyang Guo, Xianglong Liu, Xiaojun Jia, Tianqi Shan, Wenqi Ren, Simeng Qin, Jialing Yang, Xiaoyu Ma, Tianxing Chen, Zixuan Li, Zijian Cai, Yan Qin, Yusen Qin, Qiangyu Chen, Kaixuan Wang, Zhaoming Han, Yao Mu, Ping Luo, Yuanqi Yao, Haoming Song, Jan-Nico Zaech, Fabien Despinoy, Danda Pani Paudel, Luc Van Gool
Venue: NeurIPS 2025
First: 2026-01-26T17:56:19+00:00 · Latest: 2026-01-26T17:56:19+00:00
Comments: MARS Challenge @ NeurIPS 2025 Workshop on Space in Vision, Language, and Embodied AI. Challenge page: https://mars-eai.github.io/MARS-Challenge-Webpage/
Abstract
Recent advancements in multimodal large language models and vision-languageaction models have significantly driven progress in Embodied AI. As the field transitions toward more complex task scenarios, multi-agent system frameworks are becoming essential for achieving scalable, efficient, and collaborative solutions. This shift is fueled by three primary factors: increasing agent capabilities, enhancing system efficiency through task delegation, and enabling advanced human-agent interactions. To address the challenges posed by multi-agent collaboration, we propose the Multi-Agent Robotic System (MARS) Challenge, held at the NeurIPS 2025 Workshop on SpaVLE. The competition focuses on two critical areas: planning and control, where participants explore multi-agent embodied planning using vision-language models (VLMs) to coordinate tasks and policy execution to perform robotic manipulation in dynamic environments. By evaluating solutions submitted by participants, the challenge provides valuable insights into the design and coordination of embodied multi-agent systems, contributing to the future development of advanced collaborative AI systems.
中文标题/摘要
标题:多智能体机器人系统(MARS)挑战的进展与创新
近年来,多模态大型语言模型和视觉-语言-动作模型的进展显著推动了嵌入式人工智能的发展。随着领域向更复杂的任务场景过渡,多智能体系统框架变得至关重要,以实现可扩展、高效和协作的解决方案。这一转变由三个主要因素推动:智能体能力的提高、通过任务分配增强系统效率以及实现高级的人机交互。为应对多智能体协作带来的挑战,我们提出了多智能体机器人系统(MARS)挑战,该挑战在NeurIPS 2025空间在视觉、语言和嵌入式人工智能研讨会中举办。竞赛集中在两个关键领域:规划与控制,参赛者利用视觉-语言模型(VLMs)进行多智能体嵌入式规划,以协调任务并执行机器人在动态环境中的操作。通过评估参赛者提交的解决方案,挑战提供了有关多智能体嵌入式系统设计和协调的宝贵见解,为未来先进协作人工智能系统的开发做出了贡献。
Summary / 总结
The research motivation is to advance Embodied AI through multi-agent systems, addressing the need for scalable, efficient, and collaborative solutions in complex task scenarios. The main method involves using multimodal large language models and vision-language-action models to coordinate multi-agent embodied planning and policy execution for robotic manipulation. Key experimental findings include insights into designing and coordinating embodied multi-agent systems, which contribute to the development of advanced collaborative AI systems.
研究动机是通过多智能体系统推进嵌入式AI,解决复杂任务场景下可扩展、高效和协作解决方案的需求。主要方法是使用多模态大型语言模型和视觉-语言-动作模型来协调多智能体的嵌入式规划和策略执行,以进行机器人操作。关键实验发现包括设计和协调嵌入式多智能体系统的见解,这有助于先进协作AI系统的开发。
Reflect: Transparent Principle-Guided Reasoning for Constitutional Alignment at Scale
Authors: Henry Bell, Caroline Zhang, Mohammed Mobasserul Haque, Dhaval Potdar, Samia Zaman, Brandon Fain
First: 2026-01-26T17:54:54+00:00 · Latest: 2026-01-26T17:54:54+00:00
Abstract
The constitutional framework of alignment aims to align large language models (LLMs) with value-laden principles written in natural language (such as to avoid using biased language). Prior work has focused on parameter fine-tuning techniques, such as reinforcement learning from human feedback (RLHF), to instill these principles. However, these approaches are computationally demanding, require careful engineering and tuning, and often require difficult-to-obtain human annotation data. We propose \textsc{reflect}, an inference-time framework for constitutional alignment that does not require any training or data, providing a plug-and-play approach for aligning an instruction-tuned model to a set of principles. \textsc{reflect} operates entirely in-context, combining a (i) constitution-conditioned base response with post-generation (ii) self-evaluation, (iii)(a) self-critique, and (iii)(b) final revision. \textsc{reflect}'s technique of explicit in-context reasoning over principles during post-generation outperforms standard few-shot prompting and provides transparent reasoning traces. Our results demonstrate that \textsc{reflect} significantly improves LLM conformance to diverse and complex principles, including principles quite distinct from those emphasized in the model's original parameter fine-tuning, without sacrificing factual reasoning. \textsc{reflect} is particularly effective at reducing the rate of rare but significant violations of principles, thereby improving safety and robustness in the tail end of the distribution of generations. Finally, we show that \textsc{reflect} naturally generates useful training data for traditional parameter fine-tuning techniques, allowing for efficient scaling and the reduction of inference-time computational overhead in long-term deployment scenarios.
中文标题/摘要
标题:Reflect: 透明原则导向推理在大规模宪法对齐中的应用
宪法框架对齐旨在使大型语言模型(LLMs)与自然语言中写入的价值导向原则(如避免使用有偏见的语言)保持一致。先前的工作主要集中在参数微调技术上,例如从人类反馈中强化学习(RLHF),以灌输这些原则。然而,这些方法计算需求高,需要精细的工程和调整,并且通常需要难以获取的人类注释数据。我们提出了一种名为\textsc{reflect}的推理时框架,用于宪法对齐,该框架不需要任何训练或数据,提供了一种即插即用的方法,将指令调优模型与一组原则对齐。\textsc{reflect}完全基于上下文运行,结合了(i)宪法条件下的基础响应与(ii)生成后的自我评估、(iii)(a)自我批判和(iii)(b)最终修订。\textsc{reflect}在生成后进行显式基于上下文的原则推理的技术优于标准的少量示例提示,并提供了透明的推理痕迹。我们的结果表明,\textsc{reflect}显著提高了LLM对多样且复杂的原则的遵从性,包括与模型原始参数微调中强调的原则截然不同的原则,同时不牺牲事实推理。\textsc{reflect}特别有效地减少了原则罕见但重要的违反率,从而在生成分布的尾端提高了安全性和鲁棒性。最后,我们展示了\textsc{reflect}自然生成了传统参数微调技术的有用训练数据,允许在长期部署场景中高效扩展并减少推理时的计算开销。
Summary / 总结
The paper introduces Reflect, a framework for aligning large language models with value-laden principles during inference without requiring any training or data. It operates by conditioning the base response on the constitution, followed by self-evaluation, self-critique, and final revision. Reflect outperforms standard few-shot prompting and provides transparent reasoning traces. The results show that Reflect significantly improves LLM conformance to diverse principles, enhances safety and robustness, and can generate useful training data for traditional parameter fine-tuning techniques.
论文介绍了Reflect框架,该框架在推理过程中无需任何训练或数据即可使大型语言模型与价值导向的原则对齐。它使用上下文中的推理来根据原则条件化模型的响应,随后进行自我评估、自我批判和最终修订。Reflect在标准少量示例提示方面表现出色,并提供了透明的推理痕迹。研究显示,Reflect显著提高了LLM对多样且复杂的原则的遵从性,增强了安全性和鲁棒性,并且可以生成传统参数微调技术所需的有用训练数据。
TensLoRA: Tensor Alternatives for Low-Rank Adaptation
Authors: Axel Marmoret, Reda Bensaid, Jonathan Lys, Vincent Gripon, François Leduc-Primeau
Venue: ICASSP 2026
First: 2025-09-22T17:15:23+00:00 · Latest: 2026-01-26T17:51:38+00:00
Comments: Published at ICASSP 2026. 5 pages, 1 figure, 2 tables. Code can be found at https://github.com/ax-le/TensLoRA
Abstract
Low-Rank Adaptation (LoRA) is widely used to efficiently adapt Transformers by adding trainable low-rank matrices to attention projections. While effective, these matrices are considered independent for each attention projection (Query, Key, and Value) and each layer. Recent extensions have considered joint, tensor-based adaptations, but only in limited forms and without a systematic framework. We introduce TensLoRA, a unified framework that aggregates LoRA updates into higher-order tensors and models a broad family of tensor-based low-rank adaptations. Our formulation generalizes existing tensor-based methods and enables mode-specific compression rates, allowing parameter budgets to be tailored according to the modality and task. Experiments on vision and language benchmarks reveal that the tensor construction directly impacts performance, sometimes better than standard LoRA under similar parameter counts.
中文标题/摘要
标题:TensLoRA:低秩适应的张量替代方法
低秩适应(LoRA)广泛用于通过在注意力投影中添加可训练的低秩矩阵来高效地适应Transformer。虽然有效,但这些矩阵被认为是每个注意力投影(查询、键和值)和每个层的独立项。最近的扩展考虑了联合的张量基适应,但仅限于有限的形式且没有系统框架。我们引入了TensLoRA,这是一种统一框架,将LoRA更新聚合为高阶张量,并建模了一类广泛的张量基低秩适应。我们的公式化方法扩展了现有的张量基方法,并允许特定模式的压缩率,使参数预算能够根据模态和任务进行调整。在视觉和语言基准上的实验表明,张量构造直接影响性能,在相似的参数计数下有时甚至优于标准LoRA。
Summary / 总结
TensLoRA is a unified framework that extends the concept of Low-Rank Adaptation (LoRA) by aggregating updates into higher-order tensors, allowing for a more flexible and efficient adaptation of Transformers. This method generalizes previous tensor-based approaches and enables mode-specific compression rates. Experiments show that the tensor construction can significantly impact performance, often outperforming standard LoRA under similar parameter constraints.
TensLoRA 是一种统一框架,扩展了低秩适应(LoRA)方法,通过将更新聚合到高阶张量中,实现更广泛的张量基低秩适应。该方法扩展了现有张量基方法,并允许针对模态和任务进行特定模式的压缩率。实验表明,张量构造可以直接影响性能,在相似参数量下有时甚至优于标准的 LoRA 方法。
Gained in Translation: Privileged Pairwise Judges Enhance Multilingual Reasoning
Authors: Lintang Sutawika, Gokul Swamy, Zhiwei Steven Wu, Graham Neubig
First: 2026-01-26T17:46:44+00:00 · Latest: 2026-01-26T17:46:44+00:00
Comments: Code available at https://github.com/lintangsutawika/SP3F
Abstract
When asked a question in a language less seen in its training data, current reasoning large language models (RLMs) often exhibit dramatically lower performance than when asked the same question in English. In response, we introduce \texttt{SP3F} (Self-Play with Privileged Pairwise Feedback), a two-stage framework for enhancing multilingual reasoning without \textit{any} data in the target language(s). First, we supervise fine-tune (SFT) on translated versions of English question-answer pairs to raise base model correctness. Second, we perform RL with feedback from a pairwise judge in a self-play fashion, with the judge receiving the English reference response as \textit{privileged information}. Thus, even when none of the model's responses are completely correct, the privileged pairwise judge can still tell which response is better. End-to-end, \texttt{SP3F} greatly improves base model performance, even outperforming fully post-trained models on multiple math and non-math tasks with less than
of the training data across the single-language, multilingual, and generalization to unseen language settings.
中文标题/摘要
标题:翻译中的收获:特权成对评判者增强多语言推理
当用训练数据中较少见到的语言提问时,当前的推理大型语言模型(RLMs)往往在回答相同问题时表现出显著较低的性能。为应对这一问题,我们引入了 \texttt{SP3F}(自我博弈与特权成对反馈),这是一种在没有任何目标语言数据的情况下增强多语言推理的两阶段框架。首先,我们监督微调(SFT)翻译后的英语问题-答案对,以提高基础模型的准确性。其次,我们以自我博弈的方式进行强化学习(RL),评判者在成对评判时获得英语参考答案作为特权信息。因此,即使模型的任何回答都不完全正确,特权成对评判者仍然可以判断哪个回答更好。从端到端来看,\texttt{SP3F} 显著提高了基础模型的性能,在单语言、多语言以及对未见语言的泛化设置中,即使使用不到
的训练数据,也超过了完全后训练模型在多个数学和非数学任务上的表现。
Summary / 总结
The research addresses the performance drop of reasoning large language models (RLMs) when processing questions in languages less seen in their training data. It proposes SP3F, a two-stage framework that first fine-tunes the model on translated English question-answer pairs and then uses a privileged pairwise judge in a self-play reinforcement learning setup to improve performance. This method significantly enhances the base model's performance across various tasks, outperforming fully post-trained models with minimal training data.
研究针对当前大型语言模型(RLMs)在较少见过的语言中的推理性能与英语相比存在差距的问题。提出了一种名为SP3F的两阶段框架,首先对模型进行翻译后的英语问题-答案对的微调,然后通过一个获得英语参考答案特权信息的成对裁判进行自我博弈来提升推理能力。该方法显著提高了模型在各种任务中的表现,即使在少量训练数据的情况下也超过了完全后训练的模型。
Point transformer for protein structural heterogeneity analysis using CryoEM
Authors: Muyuan Chen, Muchen Li, Renjie Liao
First: 2026-01-26T17:38:52+00:00 · Latest: 2026-01-26T17:38:52+00:00
Abstract
Structural dynamics of macromolecules is critical to their structural-function relationship. Cryogenic electron microscopy (CryoEM) provides snapshots of vitrified protein at different compositional and conformational states, and the structural heterogeneity of proteins can be characterized through computational analysis of the images. For protein systems with multiple degrees of freedom, it is still challenging to disentangle and interpret the different modes of dynamics. Here, by implementing Point Transformer, a self-attention network designed for point cloud analysis, we are able to improve the performance of heterogeneity analysis on CryoEM data, and characterize the dynamics of highly complex protein systems in a more human-interpretable way.
中文标题/摘要
标题:蛋白质结构异质性分析的点变换器方法基于CryoEM
大分子的结构动力学对于其结构-功能关系至关重要。冷冻电子显微镜(CryoEM)提供了不同组成和构象状态下玻璃化蛋白质的快照,并可以通过图像的计算分析来表征蛋白质的结构异质性。对于具有多个自由度的蛋白质系统,仍然难以解开和解释不同的动力学模式。通过实施点变换器,一种为点云分析设计的自注意力网络,我们能够提高CryoEM数据异质性分析的性能,并以更易于人类解读的方式表征高度复杂的蛋白质系统的动力学。
Summary / 总结
The research aims to analyze the structural dynamics of macromolecules using CryoEM data to better understand their structural-function relationship. The Point Transformer, a self-attention network for point cloud analysis, was employed to enhance the heterogeneity analysis of proteins, particularly in complex systems with multiple degrees of freedom, making the dynamics more interpretable for humans.
研究旨在利用CryoEM数据分析大分子的结构动态,以更好地理解其结构-功能关系。通过使用点云分析的自注意力网络Point Transformer,提高了CryoEM图像的异质性分析性能。主要发现是,这种方法能够更好地分析复杂蛋白质系统的结构异质性,使动态过程更易于人类理解。
SMART: Scalable Mesh-free Aerodynamic Simulations from Raw Geometries using a Transformer-based Surrogate Model
Authors: Jan Hagnberger, Mathias Niepert
First: 2026-01-26T17:34:16+00:00 · Latest: 2026-01-26T17:34:16+00:00
Abstract
Machine learning-based surrogate models have emerged as more efficient alternatives to numerical solvers for physical simulations over complex geometries, such as car bodies. Many existing models incorporate the simulation mesh as an additional input, thereby reducing prediction errors. However, generating a simulation mesh for new geometries is computationally costly. In contrast, mesh-free methods, which do not rely on the simulation mesh, typically incur higher errors. Motivated by these considerations, we introduce SMART, a neural surrogate model that predicts physical quantities at arbitrary query locations using only a point-cloud representation of the geometry, without requiring access to the simulation mesh. The geometry and simulation parameters are encoded into a shared latent space that captures both structural and parametric characteristics of the physical field. A physics decoder then attends to the encoder's intermediate latent representations to map spatial queries to physical quantities. Through this cross-layer interaction, the model jointly updates latent geometric features and the evolving physical field. Extensive experiments show that SMART is competitive with and often outperforms existing methods that rely on the simulation mesh as input, demonstrating its capabilities for industry-level simulations.
中文标题/摘要
标题:SMART:使用基于变换器的代理模型从原始几何形状进行可扩展的无网格气动模拟
基于机器学习的代理模型已成为物理模拟(特别是在复杂几何形状如汽车车身)中更高效的替代方案,相比数值求解器。许多现有模型将模拟网格作为额外输入,从而减少预测误差。然而,为新几何形状生成模拟网格是计算成本高昂的。相比之下,无网格方法不依赖于模拟网格,通常会引入更高的误差。鉴于这些考虑,我们引入了SMART,这是一种神经代理模型,仅使用几何点云表示而不需访问模拟网格即可预测任意查询位置的物理量。几何和模拟参数被编码到一个共享的潜在空间中,该空间捕捉物理场的结构和参数特征。然后,物理解码器关注编码器的中间潜在表示,将空间查询映射到物理量。通过这种跨层交互,模型同时更新潜在几何特征和演变中的物理场。大量实验表明,SMART在与依赖于模拟网格输入的现有方法竞争时,往往表现出更优的性能,展示了其在工业级模拟中的能力。
Summary / 总结
SMART is a neural surrogate model designed to predict physical quantities from raw geometries without requiring a simulation mesh, addressing the computational cost of mesh generation. By encoding geometry and simulation parameters into a shared latent space and using a physics decoder to map spatial queries to physical quantities, SMART achieves competitive and often superior performance compared to existing mesh-dependent methods, making it suitable for industry-level simulations.
SMART 是一种神经代理模型,旨在无需模拟网格的情况下从原始几何形状预测物理量,从而解决网格生成的高计算成本问题。通过将几何形状和模拟参数编码到共享的潜在空间中,并使用物理解码器将空间查询映射到物理量,SMART 在性能上与现有依赖网格的方法相当甚至更优,使其适用于工业级模拟。
Health-SCORE: Towards Scalable Rubrics for Improving Health-LLMs
Authors: Zhichao Yang, Sepehr Janghorbani, Dongxu Zhang, Jun Han, Qian Qian, Andrew Ressler, Gregory D. Lyng, Sanjit Singh Batra, Robert E. Tillman
First: 2026-01-26T17:34:10+00:00 · Latest: 2026-01-26T17:34:10+00:00
Abstract
Rubrics are essential for evaluating open-ended LLM responses, especially in safety-critical domains such as healthcare. However, creating high-quality and domain-specific rubrics typically requires significant human expertise time and development cost, making rubric-based evaluation and training difficult to scale. In this work, we introduce Health-SCORE, a generalizable and scalable rubric-based training and evaluation framework that substantially reduces rubric development costs without sacrificing performance. We show that Health-SCORE provides two practical benefits beyond standalone evaluation: it can be used as a structured reward signal to guide reinforcement learning with safety-aware supervision, and it can be incorporated directly into prompts to improve response quality through in-context learning. Across open-ended healthcare tasks, Health-SCORE achieves evaluation quality comparable to human-created rubrics while significantly lowering development effort, making rubric-based evaluation and training more scalable.
中文标题/摘要
标题:Health-SCORE: 向可扩展的健康LLM评估标准迈进
评估开放性LLM响应的评分标准在安全关键领域(如医疗保健)尤为重要。然而,创建高质量且领域特定的评分标准通常需要大量的人力和开发成本,这使得基于评分标准的评估和培训难以扩展。在本研究中,我们引入了Health-SCORE,这是一种通用且可扩展的基于评分标准的训练和评估框架,能够在不牺牲性能的情况下大幅降低评分标准开发成本。我们展示了Health-SCORE的两个实际优势:它可以作为结构化的奖励信号来引导具有安全意识监督的强化学习,也可以直接融入提示中以通过上下文学习提高响应质量。在开放性医疗保健任务中,Health-SCORE在评估质量上与人工创建的评分标准相当,同时显著降低了开发努力,使基于评分标准的评估和培训更具可扩展性。
Summary / 总结
Health-SCORE is a scalable rubric framework for evaluating and training health-oriented language models. It reduces the need for human expertise and development costs, making rubric-based evaluation and training more feasible. Health-SCORE not only serves as a robust evaluation tool but also acts as a structured reward signal for reinforcement learning and improves response quality through in-context learning. It achieves evaluation quality similar to human-created rubrics while substantially reducing development effort.
Health-SCORE 是一个可扩展的评分框架,用于评估和训练面向医疗的语言模型(LLMs),特别是在安全关键的医疗领域。它减少了对人类专业知识和开发成本的需求,使基于评分的评估和训练更具可行性。Health-SCORE 不仅提供高质量的评估,还可以作为强化学习中的结构化奖励信号,并通过上下文学习提高响应质量。它在评估质量上与人工创建的评分相当,但开发工作量大大降低,从而增强了可扩展性。
Data-Driven Qubit Characterization and Optimal Control using Deep Learning
Authors: Paul Surrey, Julian D. Teske, Tobias Hangleiter, Hendrik Bluhm, Pascal Cerfontaine
First: 2026-01-26T17:26:20+00:00 · Latest: 2026-01-26T17:26:20+00:00
Abstract
Quantum computing requires the optimization of control pulses to achieve high-fidelity quantum gates. We propose a machine learning-based protocol to address the challenges of evaluating gradients and modeling complex system dynamics. By training a recurrent neural network (RNN) to predict qubit behavior, our approach enables efficient gradient-based pulse optimization without the need for a detailed system model. First, we sample qubit dynamics using random control pulses with weak prior assumptions. We then train the RNN on the system's observed responses, and use the trained model to optimize high-fidelity control pulses. We demonstrate the effectiveness of this approach through simulations on a single $ST_0$ qubit.
中文标题/摘要
标题:基于数据的量子位表征和最优控制利用深度学习
量子计算需要优化控制脉冲以实现高保真度的量子门。我们提出了一种基于机器学习的协议,以应对评估梯度和建模复杂系统动力学的挑战。通过训练循环神经网络(RNN)来预测量子位行为,我们的方法能够在无需详细系统模型的情况下实现高效的梯度基脉冲优化。首先,我们使用弱先验假设的随机控制脉冲采样量子位动力学。然后,我们用系统观察到的响应训练RNN,并使用训练好的模型来优化高保真度的控制脉冲。我们通过在单个$ST_0$量子位上进行模拟来证明该方法的有效性。
Summary / 总结
The research aims to optimize control pulses for high-fidelity quantum gates using machine learning. The method involves training a recurrent neural network to predict qubit behavior based on observed responses to random control pulses. The key finding is that this approach can efficiently optimize control pulses without requiring a detailed system model, as demonstrated through simulations on a single $ST_0$ qubit.
研究旨在使用机器学习优化控制脉冲以实现高保真量子门。它提出了一种协议,其中循环神经网络根据随机控制脉冲的观察响应来预测量子比特的行为。这使得在不需要详细系统模型的情况下,可以高效地进行基于梯度的脉冲优化。该方法通过在单个$ST_0$量子比特上的模拟验证了其有效性,展示了其在优化控制脉冲以实现高保真量子门方面的效果。
Estimating the Joint Probability of Scenario Parameters with Gaussian Mixture Copula Models
Authors: Christian Reichenbächer, Philipp Rank, Jochen Hipp, Oliver Bringmann
First: 2025-06-11T18:30:20+00:00 · Latest: 2026-01-26T17:24:02+00:00
Comments: 9 pages, 4 figures; This work has been submitted to the IEEE for possible publication; Code available at: https://codeocean.com/capsule/1003615/tree
Abstract
This paper presents the first application of Gaussian Mixture Copula Models to the statistical modeling of driving scenarios for the safety validation of automated driving systems. Knowledge of the joint probability distribution of scenario parameters is essential for scenario-based safety assessment, where risk quantification depends on the likelihood of concrete parameter combinations. Gaussian Mixture Copula Models bring together the multimodal expressivity of Gaussian Mixture Models and the flexibility of copulas, enabling separate modeling of marginal distributions and dependence. We benchmark Gaussian Mixture Copula Models against previously proposed approaches - Gaussian Mixture Models and Gaussian Copula Models - using real-world driving data drawn from two scenarios defined in United Nations Regulation No. 157. Our evaluation on approximately 18 million instances of these two scenarios demonstrates that Gaussian Mixture Copula Models consistently surpass Gaussian Copula Models and perform competitively with Gaussian Mixture Models, as measured by both log-likelihood and Sinkhorn distance, with relative performance depending on the scenario. The results are promising for the adoption of Gaussian Mixture Copula Models as a statistical foundation for future scenario-based validation frameworks.
中文标题/摘要
标题:使用高斯混合 copula 模型估计场景参数的联合概率
本文首次将高斯混合 copula 模型应用于驾驶场景的统计建模,以验证自动驾驶系统的安全性。场景基于参数的联合概率分布的知识对于基于场景的安全评估至关重要,其中风险量化依赖于具体参数组合的可能性。高斯混合 copula 模型结合了高斯混合模型的多模态表达能力和 copula 的灵活性,能够分别建模边缘分布和依赖性。我们使用来自联合国第157号条例定义的两个场景的真实驾驶数据,将高斯混合 copula 模型与之前提出的高斯混合模型和高斯 copula 模型进行基准测试。在大约1800万个这些两个场景的实例上的评估表明,高斯混合 copula 模型在对数似然和 Sinkhorn 距离两个指标上都优于高斯 copula 模型,并且在与高斯混合模型的竞争中表现良好,相对性能取决于场景。这些结果表明高斯混合 copula 模型有望成为未来基于场景验证框架的统计基础。
Summary / 总结
This paper introduces Gaussian Mixture Copula Models for estimating the joint probability of scenario parameters in driving scenarios, crucial for the safety validation of automated driving systems. The models combine the multimodal expressivity of Gaussian Mixture Models with the flexibility of copulas, allowing for separate modeling of marginal distributions and dependence. Evaluations on real-world driving data from two scenarios show that Gaussian Mixture Copula Models outperform Gaussian Copula Models and perform competitively with Gaussian Mixture Models, as measured by log-likelihood and Sinkhorn distance.
本文介绍了用于驾驶场景中估计场景参数联合概率的高斯混合 copula 模型,这对于自动驾驶系统的安全性验证至关重要。该模型结合了高斯混合模型的多模态表达能力和 copula 的灵活性,允许分别建模边缘分布和依赖关系。对两个场景的真实驾驶数据进行评估显示,高斯混合 copula 模型在 log-likelihood 和 Sinkhorn 距离等指标上优于高斯 copula 模型,并且与高斯混合模型竞争。
TEA-Bench: A Systematic Benchmarking of Tool-enhanced Emotional Support Dialogue Agent
Authors: Xingyu Sui, Yanyan Zhao, Yulin Hu, Jiahe Guo, Weixiang Zhao, Bing Qin
First: 2026-01-26T17:15:27+00:00 · Latest: 2026-01-26T17:15:27+00:00
Abstract
Emotional Support Conversation requires not only affective expression but also grounded instrumental support to provide trustworthy guidance. However, existing ESC systems and benchmarks largely focus on affective support in text-only settings, overlooking how external tools can enable factual grounding and reduce hallucination in multi-turn emotional support. We introduce TEA-Bench, the first interactive benchmark for evaluating tool-augmented agents in ESC, featuring realistic emotional scenarios, an MCP-style tool environment, and process-level metrics that jointly assess the quality and factual grounding of emotional support. Experiments on nine LLMs show that tool augmentation generally improves emotional support quality and reduces hallucination, but the gains are strongly capacity-dependent: stronger models use tools more selectively and effectively, while weaker models benefit only marginally. We further release TEA-Dialog, a dataset of tool-enhanced ESC dialogues, and find that supervised fine-tuning improves in-distribution support but generalizes poorly. Our results underscore the importance of tool use in building reliable emotional support agents.
中文标题/摘要
标题:TEA-Bench:增强工具情感支持对话代理系统的系统性基准测试
情感支持对话不仅需要情感表达,还需要提供基于事实的实际支持,以提供可信赖的指导。然而,现有的情感支持对话(ESC)系统和基准主要集中在文本环境中的情感支持上,忽视了外部工具如何在多轮情感支持中实现事实性接地并减少幻觉。我们引入了TEA-Bench,这是第一个用于评估增强工具代理的交互式基准,它包括现实的情感场景、MCP风格的工具环境以及评估情感支持质量和事实性接地的过程级指标。对九种大语言模型的实验表明,工具增强通常可以提高情感支持的质量并减少幻觉,但收益强烈依赖于模型的能力:更强的模型更善于选择性且有效地使用工具,而较弱的模型仅能获得微小的收益。我们进一步发布了TEA-Dialog数据集,其中包含增强工具的情感支持对话,并发现监督微调在内部分布支持方面有所改进,但在泛化方面表现不佳。我们的结果强调了在构建可靠的情感支持代理时使用工具的重要性。
Summary / 总结
The research aims to evaluate tool-enhanced emotional support dialogue agents (ESDAs) by addressing the limitations of existing benchmarks that focus on affective support in text-only settings. The study introduces TEA-Bench, an interactive benchmark featuring realistic emotional scenarios and an MCP-style tool environment, to assess the quality and factual grounding of ESDAs. Experiments on nine large language models (LLMs) reveal that tool augmentation generally improves emotional support quality and reduces hallucination, though the benefits vary depending on the model's capacity. Stronger models use tools more effectively, while weaker models show only marginal improvements. The study also releases TEA-Dialog, a dataset of tool-enhanced ESC dialogues, and finds that supervised fine-tuning enhances in-distribution support but does not generalize well. These findings highlight the significance of tool use in developing reliable ESDAs.
研究旨在通过解决现有基准主要关注文本-only设置中的情感支持,而忽视外部工具如何实现事实性支撑和减少幻觉的问题,来评估工具增强的情感支持对话代理(ESDA)。研究引入了TEA-Bench,一个包含现实情感场景和MCP风格工具环境的交互式基准,以评估ESDA的质量和事实性支撑。实验表明,工具增强通常可以提高情感支持的质量并减少幻觉,但不同能力的模型受益程度不同:更强的模型更有效地使用工具,而较弱的模型仅获得轻微改善。研究还发布了TEA-Dialog,一个工具增强的ESC对话数据集,并发现监督微调可以提高内部支持,但无法很好地泛化。这些结果强调了在构建可靠的情感支持代理时工具使用的重要性。
MMedAgent-RL: Optimizing Multi-Agent Collaboration for Multimodal Medical Reasoning
Authors: Peng Xia, Jinglu Wang, Yibo Peng, Kaide Zeng, Zihan Dong, Xian Wu, Xiangru Tang, Hongtu Zhu, Yun Li, Linjun Zhang, Shujie Liu, Yan Lu, Huaxiu Yao
Venue: ICLR 2026
First: 2025-05-31T13:22:55+00:00 · Latest: 2026-01-26T17:15:26+00:00
Comments: ICLR 2026
Abstract
Medical Large Vision-Language Models (Med-LVLMs) have shown strong potential in multimodal diagnostic tasks. However, existing single-agent models struggle to generalize across diverse medical specialties, limiting their performance. Recent efforts introduce multi-agent collaboration frameworks inspired by clinical workflows, where general practitioners (GPs) and specialists interact in a fixed sequence. Despite improvements, these static pipelines lack flexibility and adaptability in reasoning. To address this, we propose MMedAgent-RL, a reinforcement learning (RL)-based multi-agent framework that enables dynamic, optimized collaboration among medical agents. Specifically, we train two GP agents based on Qwen2.5-VL via RL: the triage doctor learns to assign patients to appropriate specialties, while the attending physician integrates the judgments from multi-specialists and its own knowledge to make final decisions. To address the inconsistency in specialist outputs, we introduce a curriculum learning (CL)-guided RL strategy with dynamic entropy regulation, progressively teaching the attending physician to balance between imitating specialists and correcting their mistakes. Experiments on five medical VQA benchmarks demonstrate that MMedAgent-RL outperforms both open-source and proprietary Med-LVLMs. Notably, it achieves an average performance gain of 23.6% over strong baselines.
中文标题/摘要
标题:MMedAgent-RL:多模态医疗推理中多智能体协作的优化
医疗大型视觉-语言模型(Med-LVLMs)在多模态诊断任务中显示出强大的潜力。然而,现有的单智能体模型难以在多种医学专科之间泛化,限制了其性能。最近的努力引入了受临床工作流程启发的多智能体协作框架,其中全科医生(GPs)和专科医生按固定顺序交互。尽管有所改进,但这些静态管道缺乏推理的灵活性和适应性。为了解决这个问题,我们提出了MMedAgent-RL,这是一种基于强化学习(RL)的多智能体框架,能够实现医疗智能体之间的动态、优化协作。具体来说,我们通过RL训练了两个基于Qwen2.5-VL的全科医生智能体:分诊医生学习将患者分配到合适的专科,而主治医生则整合多专科医生的判断和自身知识来做出最终决定。为了解决专科医生输出的一致性问题,我们引入了一种带有动态熵调节的课程学习(CL)引导的RL策略,逐步教导主治医生在模仿专科医生和纠正其错误之间取得平衡。在五个医疗VQA基准上的实验表明,MMedAgent-RL优于开源和专有Med-LVLMs。值得注意的是,它在强基线上的平均性能提高了23.6%。
Summary / 总结
MMedAgent-RL is a reinforcement learning-based multi-agent framework designed to optimize dynamic collaboration among medical agents, addressing the limitations of static pipelines in medical reasoning. It trains two agents, a triage doctor and an attending physician, to collaboratively diagnose patients across various specialties. The attending physician learns to balance between imitating specialists and correcting their mistakes through a curriculum learning strategy with dynamic entropy regulation. Experiments show MMedAgent-RL outperforms existing models, achieving a 23.6% average performance gain over strong baselines on five medical VQA benchmarks.
MMedAgent-RL 是一个基于强化学习的多智能体框架,旨在优化医疗智能体之间的协作,以提高多模态诊断任务的性能。该框架训练两个全科医生智能体进行患者分流和综合专科判断。框架采用带有动态熵调节的渐进式学习策略,以增强主治医生平衡模仿和纠正的能力。实验表明,MMedAgent-RL 在五个医学问答基准测试中优于现有模型,平均性能提升 23.6%。
Are Video Generation Models Geographically Fair? An Attraction-Centric Evaluation of Global Visual Knowledge
Authors: Xiao Liu, Jiawei Zhang
First: 2026-01-26T17:14:57+00:00 · Latest: 2026-01-26T17:14:57+00:00
Comments: Work in progress
Abstract
Recent advances in text-to-video generation have produced visually compelling results, yet it remains unclear whether these models encode geographically equitable visual knowledge. In this work, we investigate the geo-equity and geographically grounded visual knowledge of text-to-video models through an attraction-centric evaluation. We introduce Geo-Attraction Landmark Probing (GAP), a systematic framework for assessing how faithfully models synthesize tourist attractions from diverse regions, and construct GEOATTRACTION-500, a benchmark of 500 globally distributed attractions spanning varied regions and popularity levels. GAP integrates complementary metrics that disentangle overall video quality from attraction-specific knowledge, including global structural alignment, fine-grained keypoint-based alignment, and vision-language model judgments, all validated against human evaluation. Applying GAP to the state-of-the-art text-to-video model Sora 2, we find that, contrary to common assumptions of strong geographic bias, the model exhibits a relatively uniform level of geographically grounded visual knowledge across regions, development levels, and cultural groupings, with only weak dependence on attraction popularity. These results suggest that current text-to-video models express global visual knowledge more evenly than expected, highlighting both their promise for globally deployed applications and the need for continued evaluation as such systems evolve.
中文标题/摘要
标题:视频生成模型在地理上公平吗?基于景点吸引力的全球视觉知识评估
近期文本到视频生成技术取得了令人信服的视觉成果,但尚不清楚这些模型是否编码了地理上公平的视觉知识。本文通过基于景点吸引力的评估,研究了文本到视频模型的地理公平性和地理上扎根的视觉知识。我们引入了Geo-Attraction Landmark Probing (GAP),这是一种系统框架,用于评估模型如何忠实合成来自不同地区的旅游景点,构建了包含500个全球分布的景点的GEOATTRACTION-500基准,这些景点覆盖了不同的地区和受欢迎程度。GAP 结合了互补的指标,将整体视频质量与景点特定知识分离,包括全球结构对齐、细粒度关键点对齐以及视觉-语言模型判断,所有这些都经过了人类评估的验证。将GAP 应用于最先进的文本到视频模型Sora 2,我们发现,与常见的地理偏见假设相反,该模型在不同地区、发展水平和文化群体中表现出相对均匀的地理上扎根的视觉知识水平,对景点受欢迎程度的依赖性较弱。这些结果表明,当前的文本到视频模型比预期更均匀地表达了全球视觉知识,既突显了其在全球部署应用中的潜力,也强调了随着此类系统的发展需要继续进行评估。
Summary / 总结
This work evaluates the geographic fairness of text-to-video generation models by introducing a systematic framework called Geo-Attraction Landmark Probing (GAP) and a benchmark of 500 globally distributed tourist attractions. The study finds that the state-of-the-art model Sora 2 exhibits relatively uniform geographically grounded visual knowledge across different regions and cultural groups, with only weak dependence on attraction popularity, challenging the common assumption of strong geographic bias.
这项工作通过引入Geo-Attraction Landmark Probing (GAP)系统框架,评估了文本到视频生成模型的地理公平性,该框架评估模型如何忠实合成来自不同地区的旅游景点。将GAP应用于最先进的模型Sora 2,研究发现该模型在地区、发展水平和文化群体方面表现出相对均匀的地理相关视觉知识,对景点的受欢迎程度依赖性较弱,这挑战了对强烈地理偏见的常见假设。
Explainability Methods for Hardware Trojan Detection: A Systematic Comparison
Authors: Paul Whitten, Francis Wolff, Chris Papachristou
First: 2026-01-26T17:13:00+00:00 · Latest: 2026-01-26T17:13:00+00:00
Abstract
Hardware trojan detection requires accurate identification and interpretable explanations for security engineers to validate and act on results. This work compares three explainability categories for gate-level trojan detection on the Trust-Hub benchmark: (1) domain-aware property-based analysis of 31 circuit-specific features from gate fanin patterns, flip-flop distances, and I/O connectivity; (2) case-based reasoning using k-nearest neighbors for precedent-based explanations; and (3) model-agnostic feature attribution (LIME, SHAP, gradient).
Results show different advantages per approach. Property-based analysis provides explanations through circuit concepts like "high fanin complexity near outputs indicates potential triggers." Case-based reasoning achieves 97.4% correspondence between predictions and training exemplars, offering justifications grounded in precedent. LIME and SHAP provide feature attributions with strong inter-method correlation (r=0.94, p<0.001) but lack circuit-level context for validation.
XGBoost classification achieves 46.15% precision and 52.17% recall on 11,392 test samples, a 9-fold precision improvement over prior work (Hasegawa et al.: 5.13%) while reducing false positive rates from 5.6% to 0.25%. Gradient-based attribution runs 481 times faster than SHAP but provides similar domain-opaque insights.
This work demonstrates that property-based and case-based approaches offer domain alignment and precedent-based interpretability compared to generic feature rankings, with implications for XAI deployment where practitioners must validate ML predictions.
中文标题/摘要
标题:硬件木马检测的可解释性方法:系统性比较
硬件木马检测需要准确识别并提供可解释的解释,以便安全工程师验证和采取行动。本研究在Trust-Hub基准上比较了三种门级木马检测的可解释性类别:(1)基于门扇入模式、触发器距离和I/O连接的31个电路特定特征的领域感知属性分析;(2)基于k近邻的案例推理,用于先例解释;(3)模型不可知的特征归因(LIME、SHAP、梯度)。
结果表明每种方法各有优势。属性分析通过电路概念如“高扇入复杂性靠近输出端口可能指示潜在触发器”提供解释。案例推理在预测与训练示例之间的对应率为97.4%,提供基于先例的解释依据。LIME和SHAP提供特征归因,方法间相关性很强(r=0.94,p<0.001),但缺乏电路级上下文验证。XGBoost分类在11,392个测试样本上达到46.15%的精确率和52.17%的召回率,相比前人工作(Hasegawa等:5.13%)精确率提高了9倍,同时将假阳性率从5.6%降低到0.25%。基于梯度的归因比SHAP快481倍,但提供类似领域不透明的见解。
本研究证明,属性分析和案例推理方法在领域对齐和基于先例的可解释性方面优于通用特征排名,对XAI部署具有重要意义,其中实践者必须验证机器学习预测。
Summary / 总结
This study evaluates three explainability methods for hardware trojan detection: property-based analysis, case-based reasoning, and model-agnostic feature attribution. Property-based analysis provides circuit-level explanations, case-based reasoning offers precedent-based justifications, and model-agnostic methods like LIME and SHAP offer strong inter-method correlation but lack circuit-level context. XGBoost achieves 46.15% precision and 52.17% recall, a significant improvement over previous methods. Property-based and case-based approaches are more interpretable and aligned with domain knowledge compared to generic feature rankings.
该研究评估了三种硬件木马检测的可解释性方法:基于领域的属性分析、案例推理以及模型不可知的特征归因。属性分析提供了电路级别的解释,案例推理提供了基于先例的说明,而特征归因方法如LIME和SHAP则提供了强相关性但缺乏电路级别的上下文。XGBoost在精度和召回率上表现出色,超越了以往的工作,并且梯度基归因方法更快但透明度较低。研究强调了领域对齐和基于先例的可解释性在硬件安全中的重要性。
A-TPT: Angular Diversity Calibration Properties for Test-Time Prompt Tuning of Vision-Language Models
Authors: Shihab Aaqil Ahamed, Udaya S. K. P. Miriya Thanthrige, Ranga Rodrigo, Muhammad Haris Khan
Venue: ICLR 2026
First: 2025-10-30T12:45:24+00:00 · Latest: 2026-01-26T17:12:54+00:00
Comments: Accepted at ICLR 2026
Abstract
Test-time prompt tuning (TPT) has emerged as a promising technique for adapting large vision-language models (VLMs) to unseen tasks without relying on labeled data. However, the lack of dispersion between textual features can hurt calibration performance, which raises concerns about VLMs' reliability, trustworthiness, and safety. Current TPT approaches primarily focus on improving prompt calibration by either maximizing average textual feature dispersion or enforcing orthogonality constraints to encourage angular separation. However, these methods may not always have optimal angular separation between class-wise textual features, which implies overlooking the critical role of angular diversity. To address this, we propose A-TPT, a novel TPT framework that introduces angular diversity to encourage uniformity in the distribution of normalized textual features induced by corresponding learnable prompts. This uniformity is achieved by maximizing the minimum pairwise angular distance between features on the unit hypersphere. We show that our approach consistently surpasses state-of-the-art TPT methods in reducing the aggregate average calibration error while maintaining comparable accuracy through extensive experiments with various backbones on different datasets. Notably, our approach exhibits superior zero-shot calibration performance on natural distribution shifts and generalizes well to medical datasets. We provide extensive analyses, including theoretical aspects, to establish the grounding of A-TPT. These results highlight the potency of promoting angular diversity to achieve well-dispersed textual features, significantly improving VLM calibration during test-time adaptation. Our code will be made publicly available.
中文标题/摘要
标题:A-TPT:视觉语言模型测试时提示调谐的角多样性校准特性
测试时提示调谐(TPT)已成为一种有前途的技术,用于在无需依赖标记数据的情况下将大型视觉语言模型(VLMs)适应未见过的任务。然而,文本特征之间的缺乏分散性会损害校准性能,这引起了人们对VLMs可靠性的担忧。当前的TPT方法主要通过最大化平均文本特征分散度或施加正交约束来鼓励角度分离,从而提高提示校准。然而,这些方法可能无法始终在类别间文本特征之间实现最优的角度分离,这意味着忽视了角多样性的关键作用。为了解决这个问题,我们提出了一种名为A-TPT的新颖TPT框架,该框架引入了角多样性,以鼓励由相应可学习提示诱导的归一化文本特征的分布均匀性。这种均匀性是通过最大化单位超球面上特征之间的最小成对角度距离来实现的。我们通过在不同数据集上使用各种骨干网络进行广泛实验,展示了我们的方法在降低累积平均校准误差方面始终优于最先进的TPT方法,同时保持了相当的准确性。值得注意的是,我们的方法在自然分布转移的零样本校准性能方面表现出色,并且在医学数据集上具有良好的泛化能力。我们提供了广泛的分析,包括理论方面,以建立A-TPT的基础。这些结果突显了促进角多样性以实现分散良好的文本特征的重要性,显著提高了VLM在测试时适应过程中的校准。我们的代码将公开发布。
Summary / 总结
The paper introduces A-TPT, a novel test-time prompt tuning framework that enhances the calibration performance of vision-language models by promoting angular diversity among textual features. It maximizes the minimum pairwise angular distance between features on the unit hypersphere, leading to better dispersion and uniformity. Extensive experiments show that A-TPT outperforms existing methods in reducing calibration error while maintaining accuracy, especially in zero-shot settings and with medical datasets.
论文提出了A-TPT,这是一种新颖的测试时提示调优框架,通过增强文本特征的角多样性来提高视觉-语言模型的校准。通过在单位超球面上最大化特征之间的最小成对角距离,A-TPT在减少校准误差的同时保持了准确性,并在自然分布偏移的零样本校准性能上表现出色,且能够很好地泛化到医学数据集上。
A Pragmatic VLA Foundation Model
Authors: Wei Wu, Fan Lu, Yunnan Wang, Shuai Yang, Shi Liu, Fangjing Wang, Qian Zhu, He Sun, Yong Wang, Shuailei Ma, Yiyu Ren, Kejia Zhang, Hui Yu, Jingmei Zhao, Shuai Zhou, Zhenqi Qiu, Houlong Xiong, Ziyu Wang, Zechen Wang, Ran Cheng, Yong-Lu Li, Yongtao Huang, Xing Zhu, Yujun Shen, Kecheng Zheng
First: 2026-01-26T17:08:04+00:00 · Latest: 2026-01-26T17:08:04+00:00
Comments: Project Webpage: https://technology.robbyant.com/lingbot-vla/, Code: https://github.com/Robbyant/lingbot-vla/
Abstract
Offering great potential in robotic manipulation, a capable Vision-Language-Action (VLA) foundation model is expected to faithfully generalize across tasks and platforms while ensuring cost efficiency (e.g., data and GPU hours required for adaptation). To this end, we develop LingBot-VLA with around 20,000 hours of real-world data from 9 popular dual-arm robot configurations. Through a systematic assessment on 3 robotic platforms, each completing 100 tasks with 130 post-training episodes per task, our model achieves clear superiority over competitors, showcasing its strong performance and broad generalizability. We have also built an efficient codebase, which delivers a throughput of 261 samples per second per GPU with an 8-GPU training setup, representing a 1.5~2.8$\times$ (depending on the relied VLM base model) speedup over existing VLA-oriented codebases. The above features ensure that our model is well-suited for real-world deployment. To advance the field of robot learning, we provide open access to the code, base model, and benchmark data, with a focus on enabling more challenging tasks and promoting sound evaluation standards.
中文标题/摘要
标题:一种务实的VLA基础模型
在机器人操作方面具有巨大的潜力,一个能力强大的视觉-语言-行动(VLA)基础模型有望在任务和平台之间忠实泛化,同时确保成本效益(例如,适应所需的代价数据和GPU小时数)。为此,我们开发了LingBot-VLA,使用来自9种流行双臂机器人配置的约20,000小时的真实世界数据。通过在3种机器人平台上进行系统评估,每个平台完成100个任务,每个任务有130个训练后回放,我们的模型在性能上明显优于竞争对手,展示了其强大的性能和广泛的泛化能力。我们还构建了一个高效的代码库,使用8块GPU的训练设置,每秒每GPU的吞吐量为261样本,比现有的VLA导向代码库快1.5~2.8倍(取决于所依赖的VLM基础模型)。上述特性确保了我们的模型适合实际部署。为了推进机器人学习领域的发展,我们提供了代码、基础模型和基准数据的开放访问,重点是使更多具有挑战性的任务成为可能,并促进合理的评估标准。
Summary / 总结
This paper introduces LingBot-VLA, a Vision-Language-Action foundation model developed with 20,000 hours of real-world data from nine dual-arm robot configurations. Through evaluations on three robotic platforms, LingBot-VLA demonstrates superior performance and broad generalizability, outperforming competitors. The model’s efficient codebase achieves a throughput of 261 samples per second per GPU, which is 1.5 to 2.8 times faster than existing VLA-oriented codebases. This model is well-suited for real-world deployment and is open-sourced with the base model and benchmark data to promote further research and evaluation standards in robot learning.
该论文介绍了使用九种双臂机器人配置的20,000小时真实世界数据开发的Vision-Language-Action基础模型LingBot-VLA。通过在三个机器人平台上的评估,LingBot-VLA展示了优越的性能和广泛的泛化能力,每项任务完成100次,每任务130次训练后样本。该模型的高效代码库支持每秒每GPU 261个样本的吞吐量,比现有的VLA定向代码库快1.5到2.8倍。作者提供了代码、基础模型和基准数据的开放访问,以促进进一步的研究和机器人学习的评估标准。
Exploration vs Exploitation: Rethinking RLVR through Clipping, Entropy, and Spurious Reward
Authors: Peter Chen, Xiaopeng Li, Ziniu Li, Wotao Yin, Xi Chen, Tianyi Lin
Venue: ICLR 2026
First: 2025-12-18T18:59:27+00:00 · Latest: 2026-01-26T17:06:02+00:00
Comments: Accepted by ICLR 2026
Abstract
This paper examines the exploration-exploitation trade-off in reinforcement learning with verifiable rewards (RLVR), a framework for improving the reasoning of Large Language Models (LLMs). Recent studies suggest that RLVR can elicit strong mathematical reasoning in LLMs through two seemingly paradoxical mechanisms: spurious rewards, which suppress exploitation by rewarding outcomes unrelated to the ground truth, and entropy minimization, which suppresses exploration by pushing the model toward more confident and deterministic outputs, highlighting a puzzling dynamic: both discouraging exploitation and discouraging exploration improve reasoning performance, yet the underlying principles that reconcile these effects remain poorly understood. We focus on two fundamental questions: (i) how policy entropy relates to performance, and (ii) whether spurious rewards yield gains, potentially through the interplay of clipping bias and model contamination. Our results show that clipping bias under spurious rewards reduces policy entropy, leading to more confident and deterministic outputs, while entropy minimization alone is insufficient for improvement. We further propose a reward-misalignment model explaining why spurious rewards can enhance performance beyond contaminated settings. Our findings clarify the mechanisms behind spurious-reward benefits and provide principles for more effective RLVR training.
中文标题/摘要
标题:探索与利用:通过剪裁、熵和虚假奖励重新思考RLVR
本文探讨了强化学习中可验证奖励(RLVR)框架下的探索与利用权衡问题,该框架旨在提高大型语言模型(LLMs)的推理能力。近期研究表明,RLVR可以通过两个看似矛盾的机制激发LLMs进行强大的数学推理:虚假奖励通过奖励与真实结果无关的结果来抑制利用,而熵最小化则通过促使模型产生更自信和确定性的输出来抑制探索,揭示了一个令人困惑的动态:两者都抑制利用和探索反而能提高推理性能,但其背后的原理仍不甚明了。我们关注两个基本问题:(i)策略熵与性能的关系,(ii)虚假奖励是否能带来收益,可能是通过剪裁偏差和模型污染的相互作用。我们的结果显示,在虚假奖励下,剪裁偏差降低了策略熵,导致更自信和确定性的输出,而仅通过熵最小化无法实现改进。我们进一步提出一个奖励错配模型,解释为什么虚假奖励可以在污染环境中增强性能。我们的发现阐明了虚假奖励益处背后的机制,并为更有效的RLVR训练提供了原则。
Summary / 总结
This paper investigates the exploration-exploitation trade-off in reinforcement learning with verifiable rewards (RLVR), focusing on the roles of spurious rewards and entropy minimization. The study reveals that spurious rewards reduce policy entropy, leading to more confident outputs, while entropy minimization alone is insufficient for improvement. The research proposes a reward-misalignment model to explain how spurious rewards can enhance performance beyond contaminated settings, providing insights into the mechanisms behind RLVR benefits.
该论文研究了可验证奖励(RLVR)下的探索-利用权衡,重点关注伪奖励和熵最小化的作用。研究发现,伪奖励会减少策略的熵,导致更自信的输出,而仅通过熵最小化不足以提高性能。作者提出了一种奖励错配模型来解释为什么伪奖励可以在受污染环境中提供更好的性能,从而阐明了伪奖励益处背后的机制,并为更有效的RLVR训练提供了原则。
Masked Generative Policy for Robotic Control
Authors: Lipeng Zhuang, Shiyu Fan, Florent P. Audonnet, Yingdong Ru, Edmond S. L. Ho, Gerardo Aragon Camarasa, Paul Henderson
First: 2025-12-09T20:37:40+00:00 · Latest: 2026-01-26T17:04:17+00:00
Abstract
We present Masked Generative Policy (MGP), a novel framework for visuomotor imitation learning. We represent actions as discrete tokens, and train a conditional masked transformer that generates tokens in parallel and then rapidly refines only low-confidence tokens. We further propose two new sampling paradigms: MGP-Short, which performs parallel masked generation with score-based refinement for Markovian tasks, and MGP-Long, which predicts full trajectories in a single pass and dynamically refines low-confidence action tokens based on new observations. With globally coherent prediction and robust adaptive execution capabilities, MGP-Long enables reliable control on complex and non-Markovian tasks that prior methods struggle with. Extensive evaluations on 150 robotic manipulation tasks spanning the Meta-World and LIBERO benchmarks show that MGP achieves both rapid inference and superior success rates compared to state-of-the-art diffusion and autoregressive policies. Specifically, MGP increases the average success rate by 9% across 150 tasks while cutting per-sequence inference time by up to 35x. It further improves the average success rate by 60% in dynamic and missing-observation environments, and solves two non-Markovian scenarios where other state-of-the-art methods fail.
中文标题/摘要
标题:掩码生成策略在机器人控制中的应用
我们提出了掩码生成策略(MGP),这是一种新颖的视觉运动模仿学习框架。我们将动作表示为离散的标记,并训练一个条件掩码变换器,该变换器并行生成标记,然后快速细化低置信度的标记。我们还提出了两种新的采样方法:MGP-Short,它使用基于分数的细化进行马尔可夫任务的并行掩码生成;MGP-Long,它在单次通过中预测完整轨迹,并根据新观察动态细化低置信度的动作标记。凭借全局一致的预测和稳健的自适应执行能力,MGP-Long 使可靠控制复杂和非马尔可夫任务成为可能,而这些任务对于先前的方法来说是具有挑战性的。在涵盖Meta-World和LIBERO基准的150个机器人操作任务的广泛评估中,MGP 在快速推理和成功率方面均优于最先进的扩散和自回归策略。具体而言,MGP 在150个任务中将平均成功率提高了9%,同时将每序列推理时间缩短了高达35倍。此外,在动态和缺失观察环境中,MGP 将平均成功率提高了60%,并解决了其他最先进的方法无法解决的两个非马尔可夫场景。
Summary / 总结
Masked Generative Policy (MGP) is a novel framework for visuomotor imitation learning, where actions are represented as discrete tokens and trained using a conditional masked transformer. MGP-Long, a variant of MGP, predicts full trajectories in a single pass and dynamically refines low-confidence action tokens, enabling reliable control on complex and non-Markovian tasks. MGP achieves a 9% increase in average success rate across 150 robotic manipulation tasks and reduces per-sequence inference time by up to 35x compared to state-of-the-art methods, particularly improving success rates by 60% in dynamic and missing-observation environments and solving non-Markovian scenarios where other methods fail.
论文提出了Masked Generative Policy (MGP),一种新的视觉-运动模仿学习框架。MGP 将动作表示为离散的标记,并训练一个条件掩码变换器并行生成标记并快速细化低置信度的标记。提出了两种采样方法,MGP-Short 和 MGP-Long。MGP-Long 一次预测完整轨迹并根据新观察动态细化动作标记。在150个机器人操作任务上的评估显示,MGP 将平均成功率提高了9%,并将每序列推理时间减少了最多35倍,特别在动态和缺失观察环境中表现出色,并解决了其他先进方法无法解决的非马尔可夫场景。
MARC: Memory-Augmented RL Token Compression for Efficient Video Understanding
Authors: Peiran Wu, Zhuorui Yu, Yunze Liu, Chi-Hao Wu, Enmin Zhou, Junxiao Shen
Venue: ICLR 2026
First: 2025-10-09T08:07:19+00:00 · Latest: 2026-01-26T16:58:19+00:00
Comments: Accepted at ICLR 2026
Abstract
The rapid progress of large language models (LLMs) has laid the foundation for multimodal models. However, visual language models (VLMs) still face heavy computational costs when extended from images to videos due to high frame rates and long durations. Token compression is a promising solution, yet most existing training-free methods cause information loss and performance degradation. To overcome this, we propose \textbf{Memory-Augmented Reinforcement Learning-based Token Compression (MARC)}, which integrates structured retrieval and RL-based distillation. MARC adopts a \textit{retrieve-then-compress} strategy using a \textbf{Visual Memory Retriever (VMR)} to select key clips and a \textbf{Compression Group Relative Policy Optimization (C-GRPO)} framework to distil reasoning ability from a teacher to a student model. Experiments on six video benchmarks show that MARC achieves near-baseline accuracy using only one frame's tokens -- reducing visual tokens by \textbf{95\%}, GPU memory by \textbf{72\%}, and latency by \textbf{23.9\%}. This demonstrates its potential for efficient, real-time video understanding in resource-constrained settings such as video QA, surveillance, and autonomous driving.
中文标题/摘要
标题:MARC:增强记忆的RL标记压缩以实现高效的视频理解
大型语言模型(LLMs)的快速发展为多模态模型奠定了基础。然而,视觉语言模型(VLMs)在从图像扩展到视频时仍面临巨大的计算成本,因为视频具有高帧率和长持续时间。标记压缩是一种有前途的解决方案,但大多数现有的无训练方法会导致信息丢失和性能下降。为了解决这个问题,我们提出了**增强记忆的强化学习标记压缩(MARC)**,该方法结合了结构化检索和基于RL的蒸馏。MARC采用**检索-压缩**策略,使用**视觉记忆检索器(VMR)**选择关键片段,并使用**压缩组相对策略优化(C-GRPO)**框架从教师模型向学生模型传递推理能力。在六个视频基准上的实验表明,MARC仅使用一帧的标记即可达到接近基线的准确性——视觉标记减少**95%**,GPU内存减少**72%**,延迟减少**23.9%**。这表明其在资源受限的环境中(如视频问答、监控和自动驾驶)实现高效、实时视频理解的潜力。
Summary / 总结
MARC is a method that addresses the computational challenges of extending visual language models to video understanding by integrating a Visual Memory Retriever and a Compression Group Relative Policy Optimization framework. It reduces visual tokens by 95%, GPU memory by 72%, and latency by 23.9% while maintaining near-baseline accuracy on six video benchmarks.
MARC 是一种使用增强记忆的强化学习方法来压缩视频令牌以实现高效的视频理解。它使用视觉记忆检索器选择关键片段,并使用压缩组相对策略优化框架从教师模型向学生模型传递推理能力。实验结果显示,MARC 可以将视觉令牌减少 95%,GPU 内存减少 72%,延迟减少 23.9%,同时保持接近基线的准确性。
ART for Diffusion Sampling: A Reinforcement Learning Approach to Timestep Schedule
Authors: Yilie Huang, Wenpin Tang, Xunyu Zhou
First: 2026-01-26T16:56:40+00:00 · Latest: 2026-01-26T16:56:40+00:00
Comments: 17 pages, 7 figures
Abstract
We consider time discretization for score-based diffusion models to generate samples from a learned reverse-time dynamic on a finite grid. Uniform and hand-crafted grids can be suboptimal given a budget on the number of time steps. We introduce Adaptive Reparameterized Time (ART) that controls the clock speed of a reparameterized time variable, leading to a time change and uneven timesteps along the sampling trajectory while preserving the terminal time. The objective is to minimize the aggregate error arising from the discretized Euler scheme. We derive a randomized control companion, ART-RL, and formulate time change as a continuous-time reinforcement learning (RL) problem with Gaussian policies. We then prove that solving ART-RL recovers the optimal ART schedule, which in turn enables practical actor--critic updates to learn the latter in a data-driven way. Empirically, based on the official EDM pipeline, ART-RL improves Fréchet Inception Distance on CIFAR-10 over a wide range of budgets and transfers to AFHQv2, FFHQ, and ImageNet without the need of retraining.
中文标题/摘要
标题:ART 用于扩散采样:一种基于强化学习的时间步长安排方法
我们考虑将时间离散化应用于基于分数的扩散模型,以从学习到的逆时间动态在有限网格上生成样本。均匀和手工设计的时间网格在时间步数预算有限的情况下可能不是最优的。我们引入了自适应重参数化时间(ART),控制重参数化时间变量的时钟速度,从而在采样轨迹中产生时间变化和不均匀的时间步长,同时保持终端时间不变。目标是通过离散化的欧拉方案最小化累积误差。我们推导出随机控制伴侣ART-RL,并将时间变化形式化为具有高斯策略的连续时间强化学习(RL)问题。然后我们证明,解决ART-RL可以恢复最优的ART时间表,进而使实际的演员-评论家更新能够以数据驱动的方式学习后者。实验上,基于官方的EDM管道,ART-RL在CIFAR-10上广泛的时间步长预算范围内提高了弗雷歇- inception距离,并且可以转移到AFHQv2、FFHQ和ImageNet,无需重新训练。
Summary / 总结
The paper addresses the challenge of time discretization in score-based diffusion models, aiming to optimize the timestep schedule for sample generation. It introduces ART-RL, a reinforcement learning approach that formulates time change as a continuous-time RL problem, and proves that solving ART-RL recovers the optimal ART schedule. Experiments show that ART-RL improves Fréchet Inception Distance on CIFAR-10 and can be effectively transferred to other datasets without retraining.
论文通过引入自适应重参数化时间(ART)来优化时间离散化,以最小化离散化误差。ART-RL作为一种随机控制方法,将时间变化建模为连续时间的强化学习问题,从而能够学习最优的时间步长。实验表明,ART-RL在CIFAR-10上改进了Fréchet Inception Distance,并且在不同预算条件下具有可转移性,无需重新训练即可应用于其他数据集。