arXiv 论文速递

Snapshot: 20260204_0352

RLAnything: Forge Environment, Policy, and Reward Model in Completely Dynamic RL System

Authors: Yinjie Wang, Tianbao Xie, Ke Shen, Mengdi Wang, Ling Yang

First: 2026-02-02T18:59:04+00:00 · Latest: 2026-02-02T18:59:04+00:00

Comments: Code: https://github.com/Gen-Verse/Open-AgentRL

Abstract

We propose RLAnything, a reinforcement learning framework that dynamically forges environment, policy, and reward models through closed-loop optimization, amplifying learning signals and strengthening the overall RL system for any LLM or agentic scenarios. Specifically, the policy is trained with integrated feedback from step-wise and outcome signals, while the reward model is jointly optimized via consistency feedback, which in turn further improves policy training. Moreover, our theory-motivated automatic environment adaptation improves training for both the reward and policy models by leveraging critic feedback from each, enabling learning from experience. Empirically, each added component consistently improves the overall system, and RLAnything yields substantial gains across various representative LLM and agentic tasks, boosting Qwen3-VL-8B-Thinking by 9.1% on OSWorld and Qwen2.5-7B-Instruct by 18.7% and 11.9% on AlfWorld and LiveBench, respectively. We also that optimized reward-model signals outperform outcomes that rely on human labels. Code: https://github.com/Gen-Verse/Open-AgentRL

中文标题/摘要

标题：RLAnything：在完全动态的RL系统中锻造环境、策略和奖励模型

我们提出了一种名为RLAnything的强化学习框架，该框架通过闭环优化动态锻造环境、策略和奖励模型，增强学习信号，从而强化整个RL系统，适用于任何LLM或代理场景。具体而言，策略通过逐步反馈和结果信号的整合进行训练，而奖励模型则通过一致性反馈联合优化，进一步提高策略训练效果。此外，基于理论动机的自动环境适应通过利用每个模型的评论反馈，增强了奖励和策略模型的训练，使学习更加有效。实验证明，每个新增组件都一致地提高了整个系统的效果，RLAnything在各种代表性LLM和代理任务中均取得了显著提升，分别提高了Qwen3-VL-8B-Thinking在OSWorld上的9.1%，Qwen2.5-7B-Instruct在AlfWorld和LiveBench上的18.7%和11.9%。我们还发现优化后的奖励模型信号优于依赖于人工标签的结果。代码：https://github.com/Gen-Verse/Open-AgentRL

Summary / 总结

RLAnything is a reinforcement learning framework that dynamically creates environment, policy, and reward models through closed-loop optimization, enhancing the learning process for various LLM and agentic tasks. The policy is trained using integrated feedback from step-wise and outcome signals, while the reward model is optimized via consistency feedback, further improving policy training. The framework also includes theory-motivated automatic environment adaptation, which leverages critic feedback to improve both reward and policy models. Empirical results show consistent improvements across different tasks, with RLAnything boosting Qwen3-VL-8B-Thinking by 9.1% on OSWorld and Qwen2.5-7B-Instruct by 18.7% and 11.9% on AlfWorld and LiveBench, respectively. Optimized reward-model signals outperform human-labeled outcomes in performance. Code: https://github.com/Gen-Verse/Open-AgentRL

RLAnything 是一种通过闭环优化动态创建环境、策略和奖励模型的强化学习框架，增强语言模型和代理场景的学习过程。策略通过步骤和结果信号的集成反馈进行训练，而奖励模型通过一致性反馈进行优化，进一步提高策略训练。该框架还包括基于批评反馈的自动环境适应，以提高两个模型的训练效果。实验证明，每个新增组件都能一致地提升系统性能，RLAnything 在各种任务上显著提升了性能，增幅从 9.1% 到 18.7% 不等。优化后的奖励模型信号优于基于人工标签的结果。代码: https://github.com/Gen-Verse/Open-AgentRL

Expanding the Capabilities of Reinforcement Learning via Text Feedback

Authors: Yuda Song, Lili Chen, Fahim Tajwar, Remi Munos, Deepak Pathak, J. Andrew Bagnell, Aarti Singh, Andrea Zanette

First: 2026-02-02T18:56:56+00:00 · Latest: 2026-02-02T18:56:56+00:00

Comments: 43 pages, 6 figures

Abs · PDF · Code1 · Code2

Abstract

The success of RL for LLM post-training stems from an unreasonably uninformative source: a single bit of information per rollout as binary reward or preference label. At the other extreme, distillation offers dense supervision but requires demonstrations, which are costly and difficult to scale. We study text feedback as an intermediate signal: richer than scalar rewards, yet cheaper than complete demonstrations. Textual feedback is a natural mode of human interaction and is already abundant in many real-world settings, where users, annotators, and automated judges routinely critique LLM outputs. Towards leveraging text feedback at scale, we formalize a multi-turn RL setup, RL from Text Feedback (RLTF), where text feedback is available during training but not at inference. Therefore, models must learn to internalize the feedback in order to improve their test-time single-turn performance. To do this, we propose two methods: Self Distillation (RLTF-SD), which trains the single-turn policy to match its own feedback-conditioned second-turn generations; and Feedback Modeling (RLTF-FM), which predicts the feedback as an auxiliary objective. We provide theoretical analysis on both methods, and empirically evaluate on reasoning puzzles, competition math, and creative writing tasks. Our results show that both methods consistently outperform strong baselines across benchmarks, highlighting the potential of RL with an additional source of rich supervision at scale.

中文标题/摘要

标题：通过文本反馈扩展强化学习的能力

对于LLM后训练的RL成功，其根源在于一个不合理地不具信息性的来源：每次rollout仅提供一个二进制奖励或偏好标签的信息。在另一端，蒸馏可以提供密集的监督，但需要演示，这既昂贵又难以扩展。我们研究文本反馈作为一种中间信号：比标量奖励更丰富，但比完整演示更便宜。文本反馈是人类交互的自然模式，并且在许多现实世界环境中已经非常普遍，用户、注释者和自动裁判员经常对LLM输出进行评价。为了大规模利用文本反馈，我们形式化了一个多轮RL设置，文本反馈强化学习（RLTF），其中在训练时有文本反馈，但在推理时没有。因此，模型必须学会内化反馈以提高其测试时单轮性能。为此，我们提出了两种方法：自我蒸馏（RLTF-SD），训练单轮策略使其匹配自身反馈条件下的第二轮生成；反馈建模（RLTF-FM），将预测反馈作为辅助目标。我们对这两种方法进行了理论分析，并在推理谜题、竞赛数学和创造性写作任务上进行了实验评估。我们的结果表明，两种方法在基准测试中均优于强基线，突显了大规模使用额外丰富监督源的RL的潜力。

Summary / 总结

This paper explores the use of text feedback to enhance reinforcement learning (RL) for large language models (LLMs), aiming to provide richer supervision than binary rewards but less costly than full demonstrations. Two methods, Self Distillation (RLTF-SD) and Feedback Modeling (RLTF-FM), are proposed to leverage text feedback during training. The methods are evaluated on reasoning puzzles, competition math, and creative writing tasks, showing consistent improvement over strong baselines across different benchmarks.

本文探讨了使用文本反馈来增强大型语言模型（LLM）的强化学习（RL）。动机是提供比标量奖励更丰富的反馈，但避免全演示的成本。作者提出了两种方法：自我蒸馏（RLTF-SD）和反馈建模（RLTF-FM）。这两种方法通过在训练期间内化文本反馈来提高单轮性能。实验结果显示，这两种方法在逻辑谜题、竞赛数学和创意写作等任务上均优于强基线，展示了文本反馈在大规模RL中的潜力。

Flow Policy Gradients for Robot Control

Authors: Brent Yi, Hongsuk Choi, Himanshu Gaurav Singh, Xiaoyu Huang, Takara E. Truong, Carmelo Sferrazza, Yi Ma, Rocky Duan, Pieter Abbeel, Guanya Shi, Karen Liu, Angjoo Kanazawa

First: 2026-02-02T18:56:49+00:00 · Latest: 2026-02-02T18:56:49+00:00

Comments: Project webpage: https://hongsukchoi.github.io/fpo-control

Abs · PDF · Code1 · Code2 · Project1

Abstract

Likelihood-based policy gradient methods are the dominant approach for training robot control policies from rewards. These methods rely on differentiable action likelihoods, which constrain policy outputs to simple distributions like Gaussians. In this work, we show how flow matching policy gradients -- a recent framework that bypasses likelihood computation -- can be made effective for training and fine-tuning more expressive policies in challenging robot control settings. We introduce an improved objective that enables success in legged locomotion, humanoid motion tracking, and manipulation tasks, as well as robust sim-to-real transfer on two humanoid robots. We then present ablations and analysis on training dynamics. Results show how policies can exploit the flow representation for exploration when training from scratch, as well as improved fine-tuning robustness over baselines.

中文标题/摘要

标题：机器人控制的流策略梯度

基于似然性的策略梯度方法是通过奖励训练机器人控制策略的主要方法。这些方法依赖于可微的动作似然性，这限制了策略输出为简单的分布，如高斯分布。在本文中，我们展示了如何使流匹配策略梯度——一种最近的框架，绕过了似然性计算——能够有效地训练和微调更具表现力的策略，以应对具有挑战性的机器人控制环境。我们引入了一个改进的目标，使其能够在腿部运动、类人运动跟踪和操作任务中取得成功，并在两个类人机器人上实现了鲁棒的模拟到现实世界的转移。然后，我们进行了训练动态的消融分析。结果表明，当从头开始训练时，策略可以利用流表示进行探索，以及与基线相比，改进了微调的鲁棒性。

Summary / 总结

This study addresses the limitations of likelihood-based policy gradient methods in robot control by introducing flow matching policy gradients, which do not require differentiable action likelihoods. The research demonstrates the effectiveness of this approach in training and fine-tuning more expressive policies for legged locomotion, humanoid motion tracking, and manipulation tasks. The method also shows robust sim-to-real transfer on two humanoid robots, with improved fine-tuning robustness compared to baseline methods.

本文探讨了使用流匹配策略梯度来训练更具表达性的机器人控制策略，解决了基于似然的方法的限制。作者引入了一个改进的目标，使其能够在复杂的任务如腿足运动、类人运动跟踪和操作中成功训练。该方法还在两个类人机器人上展示了稳健的模拟到现实的转移。实验表明，策略可以在训练过程中利用流表示进行有效的探索，并且在微调时表现出比基线方法更好的稳健性。

AgentRx: Diagnosing AI Agent Failures from Execution Trajectories

Authors: Shraddha Barke, Arnav Goyal, Alind Khare, Avaljot Singh, Suman Nath, Chetan Bansal

First: 2026-02-02T18:54:07+00:00 · Latest: 2026-02-02T18:54:07+00:00

Abs · PDF · Code1 · Code2

Abstract

AI agents often fail in ways that are difficult to localize because executions are probabilistic, long-horizon, multi-agent, and mediated by noisy tool outputs. We address this gap by manually annotating failed agent runs and release a novel benchmark of 115 failed trajectories spanning structured API workflows, incident management, and open-ended web/file tasks. Each trajectory is annotated with a critical failure step and a category from a grounded-theory derived, cross-domain failure taxonomy. To mitigate the human cost of failure attribution, we present AGENTRX, an automated domain-agnostic diagnostic framework that pinpoints the critical failure step in a failed agent trajectory. It synthesizes constraints, evaluates them step-by-step, and produces an auditable validation log of constraint violations with associated evidence; an LLM-based judge uses this log to localize the critical step and category. Our framework improves step localization and failure attribution over existing baselines across three domains.

中文标题/摘要

标题：AgentRx：从执行轨迹诊断AI代理故障

AI代理常常以难以定位的方式失败，因为执行具有概率性、长时序、多代理性，并且受到嘈杂工具输出的中介。我们通过手动标注失败的代理运行并发布了一个包含115个失败轨迹的新基准，这些轨迹涵盖了结构化API工作流、事件管理以及开放性网络/文件任务。每个轨迹都标注了一个关键失败步骤和一个从扎根理论推导出的跨域故障分类学中的类别。为了减轻失败归因的人力成本，我们提出了AGENTRX，一个自动化的领域无关诊断框架，能够定位失败代理轨迹中的关键失败步骤。它综合了约束条件，逐步骤评估，并生成一个可审计的验证日志，其中包含与证据相关的约束违规；基于LLM的法官使用此日志来定位关键步骤和类别。我们的框架在三个领域中提高了步骤定位和故障归因的效果，优于现有基线。

Summary / 总结

The paper addresses the challenge of diagnosing AI agent failures by developing AgentRx, an automated diagnostic framework. It manually annotates 115 failed trajectories across different domains and uses a grounded-theory derived taxonomy to categorize failures. AgentRx synthesizes constraints, evaluates them step-by-step, and generates an auditable log to help an LLM-based judge pinpoint the critical failure step and category, improving over existing methods in step localization and failure attribution.

论文通过手动标注115个不同领域的失败轨迹来解决AI代理故障诊断的挑战。它引入了AGENTRX，一个自动诊断框架，可以识别关键故障步骤并提供可审计的验证日志。该框架在结构化API工作流、事件管理以及开放式网络/文件任务领域中优于现有方法，在定位故障步骤和故障归因方面表现更佳。

HumanX: Toward Agile and Generalizable Humanoid Interaction Skills from Human Videos

Authors: Yinhuai Wang, Qihan Zhao, Yuen Fui Lau, Runyi Yu, Hok Wai Tsui, Qifeng Chen, Jingbo Wang, Jiangmiao Pang, Ping Tan

First: 2026-02-02T18:53:01+00:00 · Latest: 2026-02-02T18:53:01+00:00

Abs · PDF · Code1 · Code2

Abstract

Enabling humanoid robots to perform agile and adaptive interactive tasks has long been a core challenge in robotics. Current approaches are bottlenecked by either the scarcity of realistic interaction data or the need for meticulous, task-specific reward engineering, which limits their scalability. To narrow this gap, we present HumanX, a full-stack framework that compiles human video into generalizable, real-world interaction skills for humanoids, without task-specific rewards. HumanX integrates two co-designed components: XGen, a data generation pipeline that synthesizes diverse and physically plausible robot interaction data from video while supporting scalable data augmentation; and XMimic, a unified imitation learning framework that learns generalizable interaction skills. Evaluated across five distinct domains--basketball, football, badminton, cargo pickup, and reactive fighting--HumanX successfully acquires 10 different skills and transfers them zero-shot to a physical Unitree G1 humanoid. The learned capabilities include complex maneuvers such as pump-fake turnaround fadeaway jumpshots without any external perception, as well as interactive tasks like sustained human-robot passing sequences over 10 consecutive cycles--learned from a single video demonstration. Our experiments show that HumanX achieves over 8 times higher generalization success than prior methods, demonstrating a scalable and task-agnostic pathway for learning versatile, real-world robot interactive skills.

中文标题/摘要

标题：HumanX：从人类视频中实现灵活且通用的人形交互技能

使类人机器人能够执行灵活且适应性强的交互任务一直是机器人学中的核心挑战。当前的方法要么受限于现实交互数据的稀缺性，要么需要进行细致的任务特定奖励工程，这限制了它们的可扩展性。为缩小这一差距，我们提出了HumanX，这是一个全栈框架，能够将人类视频编译为类人机器人可以通用的、现实世界的交互技能，而无需任务特定的奖励。HumanX 结合了两个协同设计的组件：XGen，一个数据生成管道，能够从视频中合成多样且物理上合理的机器人交互数据，同时支持可扩展的数据增强；以及XMimic，一个统一的模仿学习框架，用于学习通用的交互技能。在五个不同的领域——篮球、足球、羽毛球、货物拾取和反应性格斗——中评估了HumanX，它成功地获得了10种不同的技能，并在零样本的情况下转移到了物理的Unitree G1类人机器人上。学习到的能力包括复杂的操作，如在没有任何外部感知的情况下进行假动作转身跳投，以及交互任务，如持续10个周期的人机传球序列——仅从一个视频演示中学习。我们的实验表明，与先前的方法相比，HumanX 的泛化成功率提高了8倍，展示了学习多功能、现实世界机器人交互技能的可扩展且任务无关的途径。

Summary / 总结

The research aims to enable humanoid robots to perform agile and adaptive interactive tasks by leveraging human video data. The method involves a full-stack framework called HumanX, which includes XGen for synthesizing diverse robot interaction data and XMimic for learning generalizable skills. Key findings show that HumanX successfully acquires 10 different skills across various domains and transfers them to a physical humanoid robot, achieving over 8 times higher generalization success compared to previous methods.

HumanX 是一个全栈框架，旨在使类人机器人能够从人类视频中学习通用的交互技能，而无需特定任务的奖励。它包括 XGen，一个数据生成管道，用于生成多样且物理上合理的机器人交互数据，以及 XMimic，一个模仿学习框架，用于学习这些技能。HumanX 成功地在各种领域中获得了 10 种不同的技能，并将其转移到了一个物理类人机器人上，其泛化成功率比以前的方法高 8 倍以上。

Helios 2.0: A Robust, Ultra-Low Power Gesture Recognition System Optimised for Event-Sensor based Wearables

Authors: Prarthana Bhattacharyya, Joshua Mitton, Ryan Page, Owen Morgan, Oliver Powell, Benjamin Menzies, Gabriel Homewood, Kemi Jacobs, Paolo Baesso, Taru Muhonen, Richard Vigars, Louis Berridge

First: 2025-03-10T20:12:06+00:00 · Latest: 2026-02-02T18:50:47+00:00

Comments: 24 pages, 14 figures. Prarthana Bhattacharyya, Joshua Mitton, Ryan Page, Owen Morgan, and Oliver Powell contributed equally to this paper

Abs · PDF · Code1 · Code2

Abstract

We present an advance in wearable technology: a mobile-optimized, real-time, ultra-low-power event camera system that enables natural hand gesture control for smart glasses, dramatically improving user experience. While hand gesture recognition in computer vision has advanced significantly, critical challenges remain in creating systems that are intuitive, adaptable across diverse users and environments, and energy-efficient enough for practical wearable applications. Our approach tackles these challenges through carefully selected microgestures: lateral thumb swipes across the index finger (in both directions) and a double pinch between thumb and index fingertips. These human-centered interactions leverage natural hand movements, ensuring intuitive usability without requiring users to learn complex command sequences. To overcome variability in users and environments, we developed a novel simulation methodology that enables comprehensive domain sampling without extensive real-world data collection. Our power-optimised architecture maintains exceptional performance, achieving F1 scores above 80\% on benchmark datasets featuring diverse users and environments. The resulting models operate at just 6-8 mW when exploiting the Qualcomm Snapdragon Hexagon DSP, with our 2-channel implementation exceeding 70\% F1 accuracy and our 6-channel model surpassing 80\% F1 accuracy across all gesture classes in user studies. These results were achieved using only synthetic training data. This improves on the state-of-the-art for F1 accuracy by 20\% with a power reduction 25x when using DSP. This advancement brings deploying ultra-low-power vision systems in wearable devices closer and opens new possibilities for seamless human-computer interaction.

中文标题/摘要

标题：Helios 2.0：一种针对事件传感器穿戴设备优化的稳健、超低功耗手势识别系统

我们提出了一项穿戴技术的进步：一种移动优化、实时、超低功耗的事件相机系统，能够为智能眼镜提供自然的手势控制，显著提升用户体验。尽管计算机视觉中的手势识别取得了显著进展，但在创建直观、适应不同用户和环境、且足够节能以适用于实际穿戴应用的系统方面仍面临重大挑战。我们通过精心选择的微手势——拇指在食指上的横向滑动（双向）和拇指与食指指尖之间的双捏——来应对这些挑战。这些以人为中心的交互利用了自然的手部动作，确保了直观的易用性，无需用户学习复杂的命令序列。为了克服用户和环境的差异性，我们开发了一种新颖的模拟方法，能够在不进行大量实地数据收集的情况下实现全面的领域采样。我们的节能架构保持了出色的性能，基准数据集上的F1分数超过80%。当利用高通骁龙Hexagon DSP时，这些模型在用户研究中仅消耗6-8 mW，我们的双通道实现超过70%的F1准确率，而我们的六通道模型在所有手势类别上的F1准确率超过80%。这些结果仅使用合成训练数据获得。与使用DSP时相比，这将F1准确率提高了20%，功耗降低了25倍。这一进展使部署超低功耗视觉系统在穿戴设备中更加接近，并为无缝的人机交互开辟了新的可能性。

Summary / 总结

This paper introduces Helios 2.0, a mobile-optimized, ultra-low-power gesture recognition system for event cameras in smart glasses. It addresses challenges in creating intuitive, adaptable, and energy-efficient systems by using microgestures like lateral thumb swipes and double pinches. The system achieves F1 scores above 80% on benchmark datasets and operates at 6-8 mW using the Qualcomm Snapdragon Hexagon DSP, with accuracy exceeding 70% and 80% for 2-channel and 6-channel models, respectively. This is a 20% improvement in F1 accuracy with a 25x power reduction compared to state-of-the-art systems.

论文介绍了Helios 2.0，这是一种低功耗手势识别系统，适用于可穿戴设备，特别是智能眼镜。该系统使用如拇指侧滑和双指捏合等自然且直观的微手势。系统采用了一种新颖的模拟方法进行领域采样，并进行了能效优化，实现了在基准数据集上的F1分数超过80%。模型在仅6-8 mW功耗下运行，6通道模型在所有手势类别上的F1准确率超过80%，相比现有技术提高了20%的F1准确率，功耗降低了25倍。

Avenir-Web: Human-Experience-Imitating Multimodal Web Agents with Mixture of Grounding Experts

Authors: Aiden Yiliu Li, Xinyue Hao, Shilong Liu, Mengdi Wang

First: 2026-02-02T18:50:07+00:00 · Latest: 2026-02-02T18:50:07+00:00

Abs · PDF · Code1 · Code2

Abstract

Despite advances in multimodal large language models, autonomous web agents still struggle to reliably execute long-horizon tasks on complex and dynamic web interfaces. Existing agents often suffer from inaccurate element grounding, the absence of site-specific procedural knowledge, and unstable long-term task tracking and memory, particularly when operating over complex Document Object Model structures. To address these limitations, we introduce Avenir-Web, a web agent that achieves a new open-source state of the art on the Online-Mind2Web benchmark in real-world deployment. Avenir-Web leverages a Mixture of Grounding Experts, Experience-Imitation Planning for incorporating procedural priors, and a task-tracking checklist combined with adaptive memory to enable robust and seamless interaction across diverse user interface paradigms. We evaluate Avenir-Web on Online-Mind2Web, a rigorous benchmark of live and user-centered web tasks. Our results demonstrate that Avenir-Web significantly surpasses prior open-source agents and attains performance parity with top-tier proprietary models, thereby establishing a new open-source state of the art for reliable web agents on live websites.

中文标题/摘要

标题：Avenir-Web：模仿人类体验的多模态网络代理

尽管多模态大型语言模型取得了进展，但自主网络代理仍然难以可靠地在复杂和动态的网络界面中执行长期任务。现有代理经常遭受元素定位不准确、缺乏特定站点的操作知识以及长期任务跟踪和记忆不稳定的问题，尤其是在处理复杂的文档对象模型结构时。为了解决这些限制，我们引入了Avenir-Web，这是一种在实际部署中达到在线-Mind2Web基准新开源前沿的网络代理。Avenir-Web利用混合定位专家、经验模仿规划以结合先验操作知识，并结合任务跟踪清单和自适应记忆，以实现跨不同用户界面范式的稳健和无缝交互。我们在严格的在线-Mind2Web基准上评估了Avenir-Web，这是一个以实时和用户为中心的网络任务基准。我们的结果表明，Avenir-Web显著超越了先前的开源代理，并达到了顶级专有模型的性能水平，从而为可靠的实时网站网络代理建立了新的开源前沿。

Summary / 总结

Avenir-Web is designed to improve the reliability of web agents in executing long-horizon tasks on complex web interfaces. It uses a Mixture of Grounding Experts, Experience-Imitation Planning, and a task-tracking checklist with adaptive memory to address issues such as inaccurate element grounding and unstable long-term task tracking. Avenir-Web achieves a new open-source state of the art on the Online-Mind2Web benchmark, surpassing prior open-source agents and matching the performance of top-tier proprietary models.

Avenir-Web旨在提高网络代理在执行复杂网页界面的长期任务时的可靠性。它使用混合接地专家、经验模仿规划以及带有自适应记忆的任务跟踪检查表来解决元素定位不准确和长期任务跟踪不稳定等问题。Avenir-Web在Online-Mind2Web基准测试中达到了新的开源最佳状态，超越了之前的开源代理，并与顶级专有模型的性能相当。

MentisOculi: Revealing the Limits of Reasoning with Mental Imagery

Authors: Jana Zeller, Thaddäus Wiedemer, Fanfei Li, Thomas Klein, Prasanna Mayilvahanan, Matthias Bethge, Felix Wichmann, Ryan Cotterell, Wieland Brendel

First: 2026-02-02T18:49:06+00:00 · Latest: 2026-02-02T18:49:06+00:00

Comments: 9 pages, 8 figures

Abs · PDF · Code1 · Code2

Abstract

Frontier models are transitioning from multimodal large language models (MLLMs) that merely ingest visual information to unified multimodal models (UMMs) capable of native interleaved generation. This shift has sparked interest in using intermediate visualizations as a reasoning aid, akin to human mental imagery. Central to this idea is the ability to form, maintain, and manipulate visual representations in a goal-oriented manner. To evaluate and probe this capability, we develop MentisOculi, a procedural, stratified suite of multi-step reasoning problems amenable to visual solution, tuned to challenge frontier models. Evaluating visual strategies ranging from latent tokens to explicit generated imagery, we find they generally fail to improve performance. Analysis of UMMs specifically exposes a critical limitation: While they possess the textual reasoning capacity to solve a task and can sometimes generate correct visuals, they suffer from compounding generation errors and fail to leverage even ground-truth visualizations. Our findings suggest that despite their inherent appeal, visual thoughts do not yet benefit model reasoning. MentisOculi establishes the necessary foundation to analyze and close this gap across diverse model families.

中文标题/摘要

标题：MentisOculi：揭示心智图像推理的局限

前沿模型正在从仅能摄取视觉信息的多模态大型语言模型（MLLMs）过渡到能够原生交错生成的统一多模态模型（UMMs）。这一转变激发了使用中间可视化作为推理辅助工具的兴趣，类似于人类的心智图像。这一想法的核心在于能够以目标导向的方式形成、维持和操作视觉表征。为了评估和探究这一能力，我们开发了MentisOculi，这是一种程序化的、分层的多步骤推理问题套件，适用于视觉解决方案，并针对前沿模型设置了挑战。评估从潜在标记到显式生成图像的各种视觉策略，我们发现它们通常未能提高性能。对UMMs的具体分析揭示了一个关键局限：尽管它们具备了解决任务的文本推理能力，并且有时能够生成正确的视觉，但它们会遭受累积生成错误，并且无法利用真实视觉信息。我们的研究结果表明，尽管视觉思维具有内在吸引力，但它们尚未对模型推理产生益处。MentisOculi为分析并弥合这一差距奠定了必要的基础，适用于多种模型家族。

Summary / 总结

The study evaluates the use of visual reasoning in unified multimodal models (UMMs) by developing MentisOculi, a suite of multi-step problems. It finds that UMMs, despite their textual reasoning capabilities, struggle with visual tasks, often generating incorrect imagery and failing to leverage correct visual inputs. This suggests that visual thoughts do not yet enhance model reasoning capabilities effectively.

研究旨在评估统一多模态模型（UMMs）使用视觉推理的能力，灵感来源于人类的思维图像。开发了MentisOculi，一套多步骤推理问题，以测试这一能力。研究发现，UMMs在使用视觉策略时通常无法提高性能，并且即使生成了正确的视觉，也会出现累积生成错误。这表明，尽管视觉思考具有潜力，但目前尚未提升模型的推理能力。MentisOculi为分析和解决这一问题提供了框架，适用于不同类型的模型。

Conflict-Aware Client Selection for Multi-Server Federated Learning

Authors: Mingwei Hong, Zheng Lin, Zehang Lin, Lin Li, Miao Yang, Xia Du, Zihan Fang, Zhaolu Kang, Dianxin Luan, Shunzhi Zhu

First: 2026-02-02T18:47:16+00:00 · Latest: 2026-02-02T18:47:16+00:00

Comments: 6 pages, 4 figures

Abs · PDF · Code1 · Code2

Abstract

Federated learning (FL) has emerged as a promising distributed machine learning (ML) that enables collaborative model training across clients without exposing raw data, thereby preserving user privacy and reducing communication costs. Despite these benefits, traditional single-server FL suffers from high communication latency due to the aggregation of models from a large number of clients. While multi-server FL distributes workloads across edge servers, overlapping client coverage and uncoordinated selection often lead to resource contention, causing bandwidth conflicts and training failures. To address these limitations, we propose a decentralized reinforcement learning with conflict risk prediction, named RL CRP, to optimize client selection in multi-server FL systems. Specifically, each server estimates the likelihood of client selection conflicts using a categorical hidden Markov model based on its sparse historical client selection sequence. Then, a fairness-aware reward mechanism is incorporated to promote long-term client participation for minimizing training latency and resource contention. Extensive experiments demonstrate that the proposed RL-CRP framework effectively reduces inter-server conflicts and significantly improves training efficiency in terms of convergence speed and communication cost.

中文标题/摘要

标题：冲突感知客户端选择在多服务器联邦学习中的应用

联邦学习（FL）作为一种分布式机器学习（ML）方法，允许跨客户端协作模型训练而不暴露原始数据，从而保护用户隐私并减少通信成本。尽管具有这些优势，传统的单服务器FL由于需要聚合大量客户端的模型而遭受高通信延迟。多服务器FL将工作负载分布在边缘服务器上，但由于客户端覆盖范围的重叠和未协调的客户端选择，往往导致资源争用，造成带宽冲突和训练失败。为了解决这些限制，我们提出了一种去中心化的强化学习方法，结合冲突风险预测，称为RL CRP，以优化多服务器FL系统中的客户端选择。具体来说，每个服务器使用基于其稀疏历史客户端选择序列的分类隐马尔可夫模型来估计客户端选择冲突的可能性。然后，引入了一种公平性意识的奖励机制，以促进长期客户端参与，从而减少训练延迟和资源争用。广泛的实验表明，所提出的RL-CRP框架有效地减少了服务器间的冲突，并在收敛速度和通信成本方面显著提高了训练效率。

Summary / 总结

The paper addresses the issue of high communication latency in federated learning (FL) due to client selection conflicts in multi-server FL systems. It proposes a decentralized reinforcement learning framework named RL CRP, which uses a categorical hidden Markov model to predict client selection conflicts and incorporates a fairness-aware reward mechanism to promote long-term client participation. Experimental results show that this approach effectively reduces inter-server conflicts and improves training efficiency by accelerating convergence and reducing communication costs.

论文针对多服务器联邦学习中由于资源竞争和带宽冲突导致的高通信延迟问题，提出了一种去中心化的强化学习框架RL CRP。该框架使用类别隐马尔可夫模型预测客户端选择冲突，并结合公平性意识的奖励机制来优化客户端选择。实验结果表明，RL CRP能够减少服务器间的冲突，通过加快收敛速度和降低通信成本来提高训练效率。

Drift-Bench: Diagnosing Cooperative Breakdowns in LLM Agents under Input Faults via Multi-Turn Interaction

Authors: Han Bao, Zheyuan Zhang, Pengcheng Jing, Zhengqing Yuan, Kaiwen Shi, Yanfang Ye

First: 2026-02-02T18:46:16+00:00 · Latest: 2026-02-02T18:46:16+00:00

Comments: 65 pages, 40 figures

Abs · PDF · Code1 · Code2

Abstract

As Large Language Models transition to autonomous agents, user inputs frequently violate cooperative assumptions (e.g., implicit intent, missing parameters, false presuppositions, or ambiguous expressions), creating execution risks that text-only evaluations do not capture. Existing benchmarks typically assume well-specified instructions or restrict evaluation to text-only, single-turn clarification, and thus do not measure multi-turn disambiguation under grounded execution risk. We introduce \textbf{Drift-Bench}, the first diagnostic benchmark that evaluates agentic pragmatics under input faults through multi-turn clarification across state-oriented and service-oriented execution environments. Grounded in classical theories of communication, \textbf{Drift-Bench} provides a unified taxonomy of cooperative breakdowns and employs a persona-driven user simulator with the \textbf{Rise} evaluation protocol. Experiments show substantial performance drops under these faults, with clarification effectiveness varying across user personas and fault types. \MethodName bridges clarification research and agent safety evaluation, enabling systematic diagnosis of failures that can lead to unsafe executions.

中文标题/摘要

标题：Drift-Bench：通过多轮交互诊断输入故障下LLM代理的合作破裂

随着大型语言模型转变为自主代理，用户输入经常违反合作假设（例如，隐含意图、缺失参数、虚假预设或模糊表达），这在仅基于文本的评估中无法捕捉到执行风险。现有基准通常假设明确的指令或仅限于单轮澄清的文本评估，因此无法衡量在基于执行风险的多轮澄清下的代理语用学。我们引入了**Drift-Bench**，这是第一个通过多轮澄清评估输入故障下代理语用学的诊断基准，跨越状态导向和服务导向的执行环境。基于经典通信理论，**Drift-Bench** 提供了一个统一的合作破裂分类，并采用以角色为导向的用户模拟器和**Rise**评估协议。实验显示，在这些故障下性能显著下降，澄清效果在不同用户角色和故障类型之间有所不同。**MethodName** 将澄清研究与代理安全性评估相结合，使系统诊断可能导致不安全执行的失败成为可能。

Summary / 总结

Drift-Bench is a diagnostic benchmark that evaluates the cooperative breakdowns in LLM agents under input faults through multi-turn interaction. It addresses the execution risks not captured by text-only evaluations and introduces a unified taxonomy of cooperative breakdowns. The benchmark uses a persona-driven user simulator and the Rise evaluation protocol to assess performance drops and clarification effectiveness under various faults, highlighting the need for systematic diagnosis of potential unsafe executions.

Drift-Bench 是一个诊断基准，通过多轮交互评估 LLM 代理在输入故障下的合作破裂情况。它解决了文本-only 评估无法捕捉的执行风险，并引入了一个合作破裂的统一分类体系。该基准使用基于人设的用户模拟器和 Rise 评估协议来评估在不同故障类型下的性能下降和澄清效果，突显了系统诊断潜在不安全执行的必要性。

World-Gymnast: Training Robots with Reinforcement Learning in a World Model

Authors: Ansh Kumar Sharma, Yixiang Sun, Ninghao Lu, Yunzhe Zhang, Jiarao Liu, Sherry Yang

First: 2026-02-02T18:44:45+00:00 · Latest: 2026-02-02T18:44:45+00:00

Comments: https://world-gymnast.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Robot learning from interacting with the physical world is fundamentally bottlenecked by the cost of physical interaction. The two alternatives, supervised finetuning (SFT) from expert demonstrations and reinforcement learning (RL) in a software-based simulator, are limited by the amount of expert data available and the sim-to-real gap for manipulation. With the recent emergence of world models learned from real-world video-action data, we ask the question of whether training a policy in a world model can be more effective than supervised learning or software simulation in achieving better real-robot performance. We propose World-Gymnast, which performs RL finetuning of a vision-language-action (VLA) policy by rolling out the policy in an action-conditioned video world model and rewarding the rollouts with a vision-language model (VLM). On the Bridge robot setup, World-Gymnast outperforms SFT by as much as 18x and outperforms software simulator by as much as 2x. More importantly, World-Gymnast demonstrates intriguing capabilities of RL with a world model, including training on diverse language instructions and novel scenes from the world model, test-time training in a novel scene, and online iterative world model and policy improvement. Our results suggest learning a world model and training robot policies in the cloud could be the key to bridging the gap between robots that work in demonstrations and robots that can work in anyone's household.

中文标题/摘要

标题：世界体操手：使用世界模型中的强化学习训练机器人

机器人通过与物理世界互动学习受到物理互动成本的限制。两种替代方案，基于专家演示的监督微调（SFT）和基于软件模拟的强化学习（RL），分别受限于可用的专家数据量和操作的模拟到现实的差距。随着世界模型从真实世界视频-动作数据中学习的最近出现，我们提出了一个问题：在世界模型中训练策略是否比监督学习或软件模拟更能实现更好的机器人性能。我们提出了World-Gymnast，它通过在动作条件下的视频世界模型中展开策略，并用视觉-语言模型（VLM）奖励展开来执行视觉-语言-动作（VLA）策略的RL微调。在Bridge机器人设置中，World-Gymnast的表现比SFT高出18倍，比软件模拟高出2倍。更重要的是，World-Gymnast展示了使用世界模型进行RL的有趣能力，包括在多种语言指令和世界模型中的新场景上进行训练，在新场景中的测试时训练，以及在线迭代改进世界模型和策略。我们的结果表明，学习世界模型并在云端训练机器人策略可能是弥合演示中工作的机器人和在任何家庭中工作的机器人之间差距的关键。

Summary / 总结

World-Gymnast addresses the limitations of robot learning by using reinforcement learning (RL) in a world model to overcome the high cost of physical interaction and the sim-to-real gap. It outperforms supervised finetuning by 18x and software simulation by 2x on the Bridge robot setup. Key findings include the ability to train on diverse language instructions, test-time training in novel scenes, and online iterative improvement of the world model and policy.

World-Gymnast通过使用世界模型中的强化学习（RL）来克服物理交互的高成本和模拟到现实的差距。它在Bridge机器人设置上比监督微调快18倍，比软件模拟快2倍。主要发现包括能够训练多种语言指令、在新场景中进行测试时的训练以及在线迭代改进世界模型和策略的能力。

Thinking with Comics: Enhancing Multimodal Reasoning through Structured Visual Storytelling

Authors: Andong Chen, Wenxin Zhu, Qiuyu Ding, Yuchen Song, Muyun Yang, Tiejun Zhao

First: 2026-02-02T18:43:57+00:00 · Latest: 2026-02-02T18:43:57+00:00

Comments: Working paper

Abs · PDF · Code1 · Code2

Abstract

Chain-of-Thought reasoning has driven large language models to extend from thinking with text to thinking with images and videos. However, different modalities still have clear limitations: static images struggle to represent temporal structure, while videos introduce substantial redundancy and computational cost. In this work, we propose Thinking with Comics, a visual reasoning paradigm that uses comics as a high information-density medium positioned between images and videos. Comics preserve temporal structure, embedded text, and narrative coherence while requiring significantly lower reasoning cost. We systematically study two reasoning paths based on comics and evaluate them on a range of reasoning tasks and long-context understanding tasks. Experimental results show that Thinking with Comics outperforms Thinking with Images on multi-step temporal and causal reasoning tasks, while remaining substantially more efficient than Thinking with Video. Further analysis indicates that different comic narrative structures and styles consistently affect performance across tasks, suggesting that comics serve as an effective intermediate visual representation for improving multimodal reasoning.

中文标题/摘要

标题：用漫画思考：通过结构化视觉叙事增强多模态推理

链式思考推理促使大型语言模型从仅处理文本扩展到处理图像和视频。然而，不同模态仍然存在明显的局限性：静态图像难以表现时间结构，而视频则引入了大量冗余和计算成本。在本研究中，我们提出了一种名为“用漫画思考”的视觉推理范式，该范式利用漫画作为介于图像和视频之间的一种高信息密度媒介。漫画保留了时间结构、嵌入文本和叙述连贯性，同时所需推理成本显著降低。我们系统地研究了两种基于漫画的推理路径，并在多种推理任务和长上下文理解任务中进行了评估。实验结果表明，用漫画思考在多步时间因果推理任务中优于用图像思考，同时在效率上远超用视频思考。进一步分析表明，不同漫画叙述结构和风格在不同任务中始终影响推理性能，这表明漫画作为一种有效的中间视觉表示，有助于提高多模态推理能力。

Summary / 总结

This work introduces Thinking with Comics, a visual reasoning paradigm that uses comics to enhance multimodal reasoning by preserving temporal structure and narrative coherence while reducing computational cost compared to videos. The study evaluates two reasoning paths based on comics and finds that it outperforms images on multi-step temporal and causal reasoning tasks while being more efficient than videos. Different comic narrative structures and styles are shown to affect performance across tasks, indicating comics' effectiveness in multimodal reasoning.

本研究旨在通过提出Thinking with Comics这一视觉推理范式来增强多模态推理，该范式利用漫画平衡了图像和视频的局限性。研究评估了两种基于漫画的推理路径，并发现这种方法在多步时间因果推理任务中优于图像，同时比视频更高效。分析还表明，不同类型的漫画叙事结构和风格会影响各种任务中的表现，这表明漫画是提高多模态推理的有效中间视觉表示。

Active Causal Experimentalist (ACE): Learning Intervention Strategies via Direct Preference Optimization

Authors: Patrick Cooper, Alvaro Velasquez

First: 2026-02-02T18:43:52+00:00 · Latest: 2026-02-02T18:43:52+00:00

Comments: 9 pages, 5 figures

Abs · PDF · Code1 · Code2

Abstract

Discovering causal relationships requires controlled experiments, but experimentalists face a sequential decision problem: each intervention reveals information that should inform what to try next. Traditional approaches such as random sampling, greedy information maximization, and round-robin coverage treat each decision in isolation, unable to learn adaptive strategies from experience. We propose Active Causal Experimentalist (ACE), which learns experimental design as a sequential policy. Our key insight is that while absolute information gains diminish as knowledge accumulates (making value-based RL unstable), relative comparisons between candidate interventions remain meaningful throughout. ACE exploits this via Direct Preference Optimization, learning from pairwise intervention comparisons rather than non-stationary reward magnitudes. Across synthetic benchmarks, physics simulations, and economic data, ACE achieves 70-71% improvement over baselines at equal intervention budgets (p < 0.001, Cohen's d ~ 2). Notably, the learned policy autonomously discovers that collider mechanisms require concentrated interventions on parent variables, a theoretically-grounded strategy that emerges purely from experience. This suggests preference-based learning can recover principled experimental strategies, complementing theory with learned domain adaptation.

中文标题/摘要

标题：活跃因果实验主义者（ACE）：通过直接偏好优化学习干预策略

发现因果关系需要控制实验，但实验主义者面临一个顺序决策问题：每次干预都会揭示信息，这些信息应该用来指导下一步的尝试。传统方法如随机抽样、贪婪的信息最大化和轮换覆盖将每个决策视为独立的，无法从经验中学习适应性策略。我们提出了活跃因果实验主义者（ACE），它将实验设计视为一个顺序策略。我们的关键见解是，虽然绝对信息增益随着知识积累而减少（使基于价值的强化学习不稳定），但候选干预措施之间的相对比较在整个过程中仍然有意义。ACE 通过直接偏好优化利用这一点，从两两干预比较中学习，而不是从非平稳的奖励幅度中学习。在合成基准、物理模拟和经济数据中，ACE 在相同干预预算下比基线提高了 70-71%（p < 0.001，Cohen's d ~ 2）。值得注意的是，学习到的策略自主发现碰撞机制需要在父变量上集中干预，这是一个从经验中纯粹涌现的、具有理论依据的策略。这表明偏好学习可以恢复原理性的实验策略，将理论与学习到的领域适应相结合。

Summary / 总结

The paper addresses the challenge of discovering causal relationships through controlled experiments, where each intervention informs subsequent decisions. It introduces Active Causal Experimentalist (ACE), which learns experimental design as a sequential policy using Direct Preference Optimization to compare candidate interventions. Experiments show ACE outperforms baseline methods by 70-71% with the same intervention budget, and it autonomously learns to concentrate interventions on parent variables in collider mechanisms, demonstrating the potential of preference-based learning to recover theoretically grounded strategies.

论文解决了通过控制实验发现因果关系的挑战，每个干预都会影响下一个干预。它提出了Active Causal Experimentalist (ACE)，该方法将实验设计视为一个顺序策略，通过关注候选干预措施之间的相对比较而非绝对信息增益。在各种基准测试中，ACE 的表现优于传统方法，提高了70-71%，并且它自主学习将干预集中在碰撞机制中的父变量上，展示了基于偏好学习恢复原理性实验策略的潜力。

RANKVIDEO: Reasoning Reranking for Text-to-Video Retrieval

Authors: Tyler Skow, Alexander Martin, Benjamin Van Durme, Rama Chellappa, Reno Kriz

First: 2026-02-02T18:40:37+00:00 · Latest: 2026-02-02T18:40:37+00:00

Abs · PDF · Code1 · Code2

Abstract

Reranking is a critical component of modern retrieval systems, which typically pair an efficient first-stage retriever with a more expressive model to refine results. While large reasoning models have driven rapid progress in text-centric reranking, reasoning-based reranking for video retrieval remains underexplored. To address this gap, we introduce RANKVIDEO, a reasoning-based reranker for video retrieval that explicitly reasons over query-video pairs using video content to assess relevance. RANKVIDEO is trained using a two-stage curriculum consisting of perception-grounded supervised fine-tuning followed by reranking training that combines pointwise, pairwise, and teacher confidence distillation objectives, and is supported by a data synthesis pipeline for constructing reasoning-intensive query-video pairs. Experiments on the large-scale MultiVENT 2.0 benchmark demonstrate that RANKVIDEO consistently improves retrieval performance within a two-stage framework, yielding an average improvement of 31% on nDCG@10 and outperforming text-only and vision-language reranking alternatives, while more efficient.

中文标题/摘要

标题：RANKVIDEO: 基于推理的视频检索重排序

重排序是现代检索系统中的关键组件，通常通过一个高效的初步检索器与一个更具表现力的模型配对来优化结果。虽然大型推理模型在文本中心的重排序方面取得了快速进展，但基于推理的视频检索重排序仍然未被充分探索。为了解决这一差距，我们引入了RANKVIDEO，这是一种基于推理的视频检索重排序器，它明确地通过利用视频内容来评估相关性来进行查询-视频对的推理。RANKVIDEO 通过一个包含感知导向的监督微调和结合点式、对式和教师置信度蒸馏目标的重排序训练的两阶段课程进行训练，并通过数据合成管道构建推理密集型查询-视频对。在大规模的MultiVENT 2.0基准测试上的实验表明，RANKVIDEO 在两阶段框架内始终能够提高检索性能，nDCG@10 的平均改进率为 31%，并优于仅基于文本和视觉语言的重排序替代方案，同时更加高效。

Summary / 总结

RANKVIDEO is a reasoning-based reranking model for video retrieval that improves the relevance of retrieved results by explicitly reasoning over query-video pairs. It is trained using a two-stage curriculum and a data synthesis pipeline, and shows a 31% improvement in nDCG@10 compared to text-only and vision-language reranking methods in the MultiVENT 2.0 benchmark.

RANKVIDEO 是一种基于推理的视频检索重排序方法，利用视频内容评估相关性，解决了视频检索中推理重排序的不足。该方法通过两阶段课程训练，并在 MultiVENT 2.0 基准测试中表现出色，比纯文本和视觉语言重排序方法提高了 31% 的 nDCG@10。该方法包含数据合成管道，并结合了点wise、pairwise 和教师置信度蒸馏目标。

Energy-Efficient Neuromorphic Computing for Edge AI: A Framework with Adaptive Spiking Neural Networks and Hardware-Aware Optimization

Authors: Olaf Yunus Laitinen Imanov, Derya Umut Kulali, Taner Yilmaz, Duygu Erisken, Rana Irem Turhan

First: 2026-02-02T18:34:48+00:00 · Latest: 2026-02-02T18:34:48+00:00

Comments: 8 pages, 4 figures, 4 tables. Submitted to IEEE Transactions on Neural Networks and Learning Systems (TNNLS)

Abs · PDF · Code1 · Code2

Abstract

Edge AI applications increasingly require ultra-low-power, low-latency inference. Neuromorphic computing based on event-driven spiking neural networks (SNNs) offers an attractive path, but practical deployment on resource-constrained devices is limited by training difficulty, hardware-mapping overheads, and sensitivity to temporal dynamics. We present NeuEdge, a framework that combines adaptive SNN models with hardware-aware optimization for edge deployment. NeuEdge uses a temporal coding scheme that blends rate and spike-timing patterns to reduce spike activity while preserving accuracy, and a hardware-aware training procedure that co-optimizes network structure and on-chip placement to improve utilization on neuromorphic processors. An adaptive threshold mechanism adjusts neuron excitability from input statistics, reducing energy consumption without degrading performance. Across standard vision and audio benchmarks, NeuEdge achieves 91-96% accuracy with up to 2.3 ms inference latency on edge hardware and an estimated 847 GOp/s/W energy efficiency. A case study on an autonomous-drone workload shows up to 312x energy savings relative to conventional deep neural networks while maintaining real-time operation.

中文标题/摘要

标题：边缘AI中的能效神经形态计算：基于自适应脉冲神经网络和硬件感知优化的框架

边缘AI应用越来越需要超低功耗和低延迟的推理。基于事件驱动脉冲神经网络（SNN）的神经形态计算提供了一条有吸引力的途径，但在资源受限设备上的实际部署受限于训练难度、硬件映射开销以及对时间动态的敏感性。我们提出NeuEdge框架，结合了自适应SNN模型与硬件感知优化，以实现边缘部署。NeuEdge采用了一种时间编码方案，将速率和脉冲时间模式相结合，以减少脉冲活动同时保持准确性，并采用一种硬件感知的训练过程，同时优化网络结构和片上放置，以提高神经形态处理器上的利用率。一种自适应阈值机制根据输入统计调整神经元的兴奋性，从而降低能耗而不影响性能。在标准视觉和音频基准测试中，NeuEdge在边缘硬件上的推理延迟最高为2.3毫秒，准确率高达91-96%，估计能耗效率为847 GOp/s/W。在自主无人机工作负载案例研究中，与传统的深度神经网络相比，能耗节省高达312倍，同时保持实时操作。

Summary / 总结

NeuEdge is a framework that combines adaptive spiking neural networks with hardware-aware optimization to enhance energy efficiency and deployment on edge devices. It uses a temporal coding scheme to reduce spike activity while maintaining accuracy and a co-optimization procedure for network structure and on-chip placement to improve utilization. NeuEdge achieves 91-96% accuracy with up to 2.3 ms inference latency and an estimated 847 GOp/s/W energy efficiency on standard benchmarks. In a drone application, it saves up to 312x energy compared to conventional deep neural networks while maintaining real-time operation.

NeuEdge 是一个框架，用于在边缘设备上使用自适应脉冲神经网络和硬件感知优化实现高效的 AI 推断。它采用了一种时间编码方案和硬件感知训练过程，以减少脉冲活动并提高在脉冲神经处理器上的利用率。NeuEdge 在标准基准测试中实现了高精度（91-96%），低延迟（至 2.3 毫秒）和高能效（847 GOp/s/W）。在无人机应用中，与传统的深度神经网络相比，它最多可节省 312 倍的能量，同时保持实时操作。

UniReason 1.0: A Unified Reasoning Framework for World Knowledge Aligned Image Generation and Editing

Authors: Dianyi Wang, Chaofan Ma, Feng Han, Size Wu, Wei Song, Yibin Wang, Zhixiong Zhang, Tianhang Wang, Siyuan Wang, Zhongyu Wei, Jiaqi Wang

First: 2026-02-02T18:34:35+00:00 · Latest: 2026-02-02T18:34:35+00:00

Abs · PDF · Code1 · Code2

Abstract

Unified multimodal models often struggle with complex synthesis tasks that demand deep reasoning, and typically treat text-to-image generation and image editing as isolated capabilities rather than interconnected reasoning steps. To address this, we propose UniReason, a unified framework that harmonizes these two tasks through a dual reasoning paradigm. We formulate generation as world knowledge-enhanced planning to inject implicit constraints, and leverage editing capabilities for fine-grained visual refinement to further correct visual errors via self-reflection. This approach unifies generation and editing within a shared representation, mirroring the human cognitive process of planning followed by refinement. We support this framework by systematically constructing a large-scale reasoning-centric dataset (~300k samples) covering five major knowledge domains (e.g., cultural commonsense, physics, etc.) for planning, alongside an agent-generated corpus for visual self-correction. Extensive experiments demonstrate that UniReason achieves advanced performance on reasoning-intensive benchmarks such as WISE, KrisBench and UniREditBench, while maintaining superior general synthesis capabilities.

中文标题/摘要

标题：UniReason 1.0：统一的知识对齐图像生成与编辑推理框架

统一的多模态模型在处理需要深入推理的复杂合成任务时往往表现不佳，通常将文本到图像生成和图像编辑视为孤立的能力，而不是相互关联的推理步骤。为了解决这个问题，我们提出了UniReason，这是一种通过双重推理范式将这两种任务统一起来的框架。我们将生成视为增强世界知识的规划，以注入隐式约束，并利用编辑能力进行精细的视觉修正，通过自我反思进一步纠正视觉错误。这种方法在共享表示中统一了生成和编辑，反映了人类认知过程中的规划随后是细化。我们通过系统构建了一个大规模的以推理为中心的数据集（约30万样本），涵盖了五个主要的知识领域（例如，文化常识、物理等）用于规划，以及一个代理生成的语料库用于视觉自我修正。广泛的实验表明，UniReason在如WISE、KrisBench和UniREditBench等推理密集型基准测试中实现了先进的性能，同时保持了优越的一般合成能力。

Summary / 总结

The paper introduces UniReason 1.0, a unified framework for world knowledge-aligned image generation and editing. It addresses the limitations of existing multimodal models by integrating text-to-image generation and image editing into a dual reasoning paradigm. The framework formulates generation as world knowledge-enhanced planning and uses editing for fine-grained visual refinement. Experiments show that UniReason performs well on reasoning-intensive benchmarks and maintains strong general synthesis capabilities.

研究旨在改进统一多模态模型，使其能够处理需要深度推理的复杂合成任务，如文本到图像生成和图像编辑。UniReason 是一个统一框架，通过双重推理范式将生成和编辑结合起来，其中生成被视为增强世界知识的规划，编辑被视为自我反思以进一步修正视觉错误。该框架在推理密集型基准测试中表现出色，并保持了强大的通用合成能力。

SelvaMask: Segmenting Trees in Tropical Forests and Beyond

Authors: Simon-Olivier Duguay, Hugo Baudchon, Etienne Laliberté, Helene Muller-Landau, Gonzalo Rivas-Torres, Arthur Ouaknine

First: 2026-02-02T18:26:56+00:00 · Latest: 2026-02-02T18:26:56+00:00

Comments: 22 pages, 8 figures

Abs · PDF · Code1 · Code2

Abstract

Tropical forests harbor most of the planet's tree biodiversity and are critical to global ecological balance. Canopy trees in particular play a disproportionate role in carbon storage and functioning of these ecosystems. Studying canopy trees at scale requires accurate delineation of individual tree crowns, typically performed using high-resolution aerial imagery. Despite advances in transformer-based models for individual tree crown segmentation, performance remains low in most forests, especially tropical ones. To this end, we introduce SelvaMask, a new tropical dataset containing over 8,800 manually delineated tree crowns across three Neotropical forest sites in Panama, Brazil, and Ecuador. SelvaMask features comprehensive annotations, including an inter-annotator agreement evaluation, capturing the dense structure of tropical forests and highlighting the difficulty of the task. Leveraging this benchmark, we propose a modular detection-segmentation pipeline that adapts vision foundation models (VFMs), using domain-specific detection-prompter. Our approach reaches state-of-the-art performance, outperforming both zero-shot generalist models and fully supervised end-to-end methods in dense tropical forests. We validate these gains on external tropical and temperate datasets, demonstrating that SelvaMask serves as both a challenging benchmark and a key enabler for generalized forest monitoring. Our code and dataset will be released publicly.

中文标题/摘要

标题：SelvaMask：热带森林及其以外区域树木分割

热带森林拥有地球上大部分树木生物多样性，对全球生态平衡至关重要。冠层树木在碳储存和生态系统功能方面发挥着不成比例的作用。大规模研究冠层树木需要准确地划分单个树木冠层，通常使用高分辨率航空影像完成。尽管基于变压器的模型在单个树木冠层分割方面取得了进展，但在大多数森林中，尤其是在热带森林中，性能仍然较低。为此，我们引入了SelvaMask，这是一个新的热带数据集，包含超过8,800个手工划分的树木冠层，分布在巴拿马、巴西和厄瓜多尔的三个新热带森林站点。SelvaMask 包含全面的注释，包括注释者间一致性评估，捕捉了热带森林的密集结构，并突显了任务的难度。利用这一基准，我们提出了一种模块化的检测-分割流水线，该流水线适应视觉基础模型（VFMs），使用特定领域的检测提示器。我们的方法达到了最先进的性能，在密集的热带森林中优于零样本通用模型和端到端的完全监督方法。我们在外部的热带和温带数据集上验证了这些收益，证明SelvaMask 既是具有挑战性的基准，也是实现通用森林监测的关键。我们的代码和数据集将公开发布。

Summary / 总结

The research aims to improve the accuracy of individual tree crown segmentation in tropical forests, crucial for ecological balance and carbon storage studies. SelvaMask, a new dataset with over 8,800 annotated tree crowns from three Neotropical sites, was created to address the limitations of existing models. The proposed modular detection-segmentation pipeline, which adapts vision foundation models using domain-specific detection-prompter, achieves state-of-the-art performance in dense tropical forests, outperforming both zero-shot generalist models and fully supervised end-to-end methods. The approach is validated on external datasets, showing SelvaMask's utility as a benchmark and enabler for forest monitoring.

研究旨在提高热带森林中单个树冠分割的准确性，这对于生态平衡和碳储存至关重要。为了应对现有模型的局限性，研究创建了一个包含来自三个中南美洲地区的超过8,800个手动注释树冠的新数据集SelvaMask。研究提出了一种模块化的检测-分割管道，该管道通过领域特定的检测提示器适应视觉基础模型，实现了在密集热带森林中的最佳性能，并在外部分类数据集上展示了泛化能力。

Repurposing Protein Language Models for Latent Flow-Based Fitness Optimization

Authors: Amaru Caceres Arroyo, Lea Bogensperger, Ahmed Allam, Michael Krauthammer, Konrad Schindler, Dominik Narnhofer

First: 2026-02-02T18:25:33+00:00 · Latest: 2026-02-02T18:25:33+00:00

Abs · PDF · Code1 · Code2

Abstract

Protein fitness optimization is challenged by a vast combinatorial landscape where high-fitness variants are extremely sparse. Many current methods either underperform or require computationally expensive gradient-based sampling. We present CHASE, a framework that repurposes the evolutionary knowledge of pretrained protein language models by compressing their embeddings into a compact latent space. By training a conditional flow-matching model with classifier-free guidance, we enable the direct generation of high-fitness variants without predictor-based guidance during the ODE sampling steps. CHASE achieves state-of-the-art performance on AAV and GFP protein design benchmarks. Finally, we show that bootstrapping with synthetic data can further enhance performance in data-constrained settings.

中文标题/摘要

标题：重新利用蛋白质语言模型进行潜在流基健身优化

蛋白质健身优化面临着一个极其庞大的组合景观，其中高健身变体极为稀少。许多当前方法要么表现不佳，要么需要昂贵的梯度采样。我们提出了一种名为CHASE的框架，该框架通过将预训练蛋白质语言模型的嵌入压缩到一个紧凑的潜在空间中，重新利用其进化的知识。通过训练条件流匹配模型并使用无引导分类器，我们能够在ODE采样步骤中直接生成高健身变体，而无需预测器引导。CHASE在AAV和GFP蛋白质设计基准测试中达到了最先进的性能。最后，我们展示了使用合成数据进行引导可以进一步提高在数据受限环境中的性能。

Summary / 总结

CHASE is a framework that uses pretrained protein language models to optimize protein fitness by compressing their embeddings into a compact latent space. It trains a conditional flow-matching model with classifier-free guidance to generate high-fitness variants directly, without predictor-based guidance during ODE sampling. CHASE outperforms existing methods on AAV and GFP protein design benchmarks and shows that using synthetic data can improve performance in data-limited scenarios.

CHASE 是一个框架，利用预训练的蛋白质语言模型通过压缩其嵌入到紧凑的潜在空间来优化蛋白质的适应度。它通过条件流匹配模型和无分类引导训练来直接生成高适应度变体，在 ODE 抽样步骤中无需预测器引导。CHASE 在 AAV 和 GFP 蛋白质设计基准测试中表现出色，并且在数据受限的情况下通过合成数据增强进一步提高了性能。

How to Train Your Advisor: Steering Black-Box LLMs with Advisor Models

Authors: Parth Asawa, Alan Zhu, Abby O'Neill, Matei Zaharia, Alexandros G. Dimakis, Joseph E. Gonzalez

First: 2025-10-02T18:02:39+00:00 · Latest: 2026-02-02T18:23:32+00:00

Abs · PDF · Code1 · Code2

Abstract

Frontier language models are deployed as black-box services, where model weights cannot be modified and customization is limited to prompting. We introduce Advisor Models, a method to train small open-weight models to generate dynamic, per-instance natural language advice that improves the capabilities of black-box frontier models. Advisor Models improve GPT-5's performance on RuleArena (Taxes) by 71%, reduce Gemini 3 Pro's steps taken in SWE agent tasks by 24.6%, and outperform static prompt optimizers in personalizing GPT-5 to user preferences (85-100% vs. 40-60%). We also find that advisors are transferable: an advisor trained with a low-cost student model still transfers improvements to a frontier model. Moreover, Advisor Models are robust: we observe no degradation on other benchmarks than the pipeline is trained on. Our method shows how to perform parametric optimization for black-box frontier models in a practical and cost-effective way.

中文标题/摘要

标题：如何训练你的顾问：使用顾问模型引导黑盒大语言模型

前沿语言模型作为黑盒服务部署，模型权重不可修改，定制仅限于提示。我们介绍了顾问模型，这是一种方法，通过训练小型开放权重模型生成针对每个实例的自然语言建议，以提高黑盒前沿模型的能力。顾问模型使GPT-5在RuleArena（税收）上的性能提高了71%，减少了Gemini 3 Pro在SWE代理任务中的步骤24.6%，并且在个性化GPT-5以满足用户偏好方面优于静态提示优化器（85-100% vs. 40-60%）。我们还发现，顾问具有可迁移性：用低成本学生模型训练的顾问仍然可以提高前沿模型的效果。此外，顾问模型具有鲁棒性：我们发现在训练管道之外的其他基准上没有观察到性能下降。我们的方法展示了如何以实用和成本效益的方式对黑盒前沿模型进行参数优化。

Summary / 总结

The research aims to enhance the capabilities of black-box language models by training small open-weight Advisor Models to provide dynamic advice. The method involves generating instance-specific advice to improve the performance of these models. Key findings include a 71% improvement in GPT-5's performance on RuleArena (Taxes), a 24.6% reduction in steps taken by Gemini 3 Pro in SWE agent tasks, and better personalization of GPT-5 to user preferences compared to static prompt optimizers. Advisor Models are also transferable and robust, showing improvements even when trained with low-cost student models and no degradation on other benchmarks.

研究旨在通过训练小型开放权重的顾问模型来提供动态建议，以增强黑盒语言模型的能力。该方法涉及生成实例特定的建议以提高这些模型的性能。关键发现包括GPT-5在RuleArena（税收）上的性能提高了71%，Gemini 3 Pro在SWE代理任务中的步骤减少了24.6%，并且与静态提示优化器相比，GPT-5的个性化效果更好。此外，顾问模型具有可转移性和鲁棒性，即使使用低成本的学生模型进行训练，也能在其他基准上显示出改进，且没有其他基准上的退化。

SafeGround: Know When to Trust GUI Grounding Models via Uncertainty Calibration

Authors: Qingni Wang, Yue Fan, Xin Eric Wang

First: 2026-02-02T18:22:45+00:00 · Latest: 2026-02-02T18:22:45+00:00

Abs · PDF · Code1 · Code2

Abstract

Graphical User Interface (GUI) grounding aims to translate natural language instructions into executable screen coordinates, enabling automated GUI interaction. Nevertheless, incorrect grounding can result in costly, hard-to-reverse actions (e.g., erroneous payment approvals), raising concerns about model reliability. In this paper, we introduce SafeGround, an uncertainty-aware framework for GUI grounding models that enables risk-aware predictions through calibrations before testing. SafeGround leverages a distribution-aware uncertainty quantification method to capture the spatial dispersion of stochastic samples from outputs of any given model. Then, through the calibration process, SafeGround derives a test-time decision threshold with statistically guaranteed false discovery rate (FDR) control. We apply SafeGround on multiple GUI grounding models for the challenging ScreenSpot-Pro benchmark. Experimental results show that our uncertainty measure consistently outperforms existing baselines in distinguishing correct from incorrect predictions, while the calibrated threshold reliably enables rigorous risk control and potentials of substantial system-level accuracy improvements. Across multiple GUI grounding models, SafeGround improves system-level accuracy by up to 5.38\% percentage points over Gemini-only inference.

中文标题/摘要

标题：SafeGround: 通过不确定性校准了解何时信任GUI接地模型

图形用户界面（GUI）接地旨在将自然语言指令转换为可执行的屏幕坐标，从而实现GUI的自动化交互。然而，错误的接地可能导致昂贵且难以撤销的操作（例如错误的支付批准），这引起了对模型可靠性的担忧。在本文中，我们提出了SafeGround，这是一种具有不确定性意识的GUI接地模型框架，通过测试前的校准实现风险意识预测。SafeGround 利用一种分布感知的不确定性量化方法来捕捉任何给定模型输出的随机样本的空间分散情况。然后，通过校准过程，SafeGround 得出一个具有统计保证的错误发现率（FDR）控制的测试时决策阈值。我们在ScreenSpot-Pro基准测试的多个GUI接地模型上应用SafeGround。实验结果表明，我们的不确定性度量在区分正确和错误预测方面始终优于现有基线，而校准后的阈值可靠地实现了严格的风险控制，并具有系统级准确度显著提升的潜力。在多个GUI接地模型中，SafeGround 在与Gemini-only推理相比时，系统级准确度提高了最多5.38个百分点。

Summary / 总结

SafeGround is an uncertainty-aware framework for GUI grounding models that enhances model reliability by calibrating predictions before testing. It uses a distribution-aware uncertainty quantification method to capture spatial dispersion and derives a test-time decision threshold with statistically guaranteed false discovery rate control. Experiments show that SafeGround outperforms existing baselines in distinguishing correct from incorrect predictions and improves system-level accuracy by up to 5.38 percentage points over Gemini-only inference.

SafeGround 是一个不确定性感知框架，用于增强 GUI 地图模型的可靠性，通过测试前的校准预测。它使用分布感知的不确定性量化方法来捕捉空间分散，并推导出具有统计保证的错误发现率控制的测试时决策阈值。实验结果表明，SafeGround 在区分正确和错误预测方面优于现有基线，并且在系统级准确性上提高了最多 5.38 个百分点，超过仅使用 Gemini 推断的情况。

FS-DFM: Fast and Accurate Long Text Generation with Few-Step Diffusion Language Models

Authors: Amin Karimi Monsefi, Nikhil Bhendawade, Manuel Rafael Ciosici, Dominic Culver, Yizhe Zhang, Irina Belousova

Venue: ICLR 2026

First: 2025-09-24T23:59:05+00:00 · Latest: 2026-02-02T18:18:24+00:00

Comments: Accepted to ICLR 2026

Abs · PDF · Code1 · Code2

Abstract

Autoregressive language models (ARMs) deliver strong likelihoods, but are inherently serial: they generate one token per forward pass, which limits throughput and inflates latency for long sequences. Diffusion Language Models (DLMs) parallelize across positions and thus appear promising for language generation, yet standard discrete diffusion typically needs hundreds to thousands of model evaluations to reach high quality, trading serial depth for iterative breadth. We introduce FS-DFM, Few-Step Discrete Flow-Matching. A discrete flow-matching model designed for speed without sacrificing quality. The core idea is simple: make the number of sampling steps an explicit parameter and train the model to be consistent across step budgets, so one big move lands where many small moves would. We pair this with a reliable update rule that moves probability in the right direction without overshooting, and with strong teacher guidance distilled from long-run trajectories. Together, these choices make few-step sampling stable, accurate, and easy to control. On language modeling benchmarks, FS-DFM with 8 sampling steps achieves perplexity parity with a 1,024-step discrete-flow baseline for generating 1,024 tokens using a similar-size model, delivering up to 128 times faster sampling and corresponding latency/throughput gains.

中文标题/摘要

标题：FS-DFM：快速准确的长文本生成方法

自回归语言模型（ARMs）能够提供强大的似然性，但本质上是串行的：它们每次前向传递生成一个词，这限制了吞吐量并增加了长序列的延迟。扩散语言模型（DLMs）可以在位置上并行化，因此对于语言生成看起来很有前景，但标准离散扩散通常需要数百到数千次模型评估才能达到高质量，用串行深度换取迭代广度。我们提出了FS-DFM，少量步骤离散流匹配。这是一种为速度而设计但不牺牲质量的离散流匹配模型。核心思想很简单：将采样步骤的数量作为显式参数，并训练模型在不同的步骤预算下保持一致，因此一次大动作的效果相当于多次小动作。我们还配以可靠的更新规则，该规则能够朝着正确的方向移动概率而不至于过冲，并且从长期运行轨迹中提取强大的教师指导。这些选择共同使得少量步骤采样稳定、准确且易于控制。在语言建模基准测试中，使用8个采样步骤的FS-DFM在生成1,024个词时，与1,024步离散流基线具有相同的困惑度，使用相似大小的模型，实现了高达128倍的采样速度，并相应地提高了延迟/吞吐量。

Summary / 总结

FS-DFM is a method designed to generate long text quickly and accurately by using a few sampling steps instead of many. It addresses the limitations of autoregressive models and diffusion models by training the model to be consistent across different step budgets and using a reliable update rule to ensure accurate sampling. On benchmarks, FS-DFM with 8 sampling steps achieves similar perplexity to a 1,024-step baseline but is up to 128 times faster, improving latency and throughput.

FS-DFM 是一种通过减少采样步骤来加速长文本生成的方法，同时保持高质量。它通过训练模型在不同步骤预算下保持一致性，并使用可靠的更新规则确保准确的采样。在基准测试中，FS-DFM 使用 8 个采样步骤与 1,024 步基线的困惑度相当，但速度快 128 倍，从而提高延迟和吞吐量。

ReasonEdit: Editing Vision-Language Models using Human Reasoning

Authors: Jiaxing Qiu, Kaihua Hou, Roxana Daneshjou, Ahmed Alaa, Thomas Hartvigsen

First: 2026-02-02T18:06:14+00:00 · Latest: 2026-02-02T18:06:14+00:00

Abs · PDF · Code1 · Code2

Abstract

Model editing aims to correct errors in large, pretrained models without altering unrelated behaviors. While some recent works have edited vision-language models (VLMs), no existing editors tackle reasoning-heavy tasks, which typically require humans and models to reason about images.We therefore propose ReasonEdit, the first VLM editor to let users explain their reasoning during editing, introducing a new, practical model editing setup. ReasonEdit continuously stores human reasoning in a codebook, and retrieves only relevant facts during inference using a novel topology-balanced multimodal embedding method inspired by network science. Across four VLMs on multiple rationale-based visual question answering datasets, ReasonEdit achieves state-of-the-art editing performance, ultimately showing that using human reasoning during editing greatly improves edit generalization.

中文标题/摘要

标题：ReasonEdit：使用人类推理编辑视觉-语言模型

模型编辑旨在纠正大型预训练模型中的错误，而不改变无关行为。虽然一些最近的工作已经编辑了视觉-语言模型（VLMs），但现有的编辑器没有解决需要人类和模型对图像进行推理的推理密集型任务。因此，我们提出了ReasonEdit，这是第一个允许用户在编辑过程中解释其推理的VLM编辑器，引入了一种新的、实用的模型编辑设置。ReasonEdit持续存储人类推理在代码书中，并使用一种新颖的拓扑平衡多模态嵌入方法，在推理过程中仅检索相关事实，该方法受到网络科学的启发。在四个VLMs的多个基于推理的视觉问答数据集上，ReasonEdit实现了最先进的编辑性能，最终表明在编辑过程中使用人类推理大大提高了编辑的泛化能力。

Summary / 总结

The research motivation is to correct errors in large pretrained models without affecting unrelated behaviors, particularly for reasoning-heavy tasks. The main method involves ReasonEdit, a new VLM editor that allows users to explain their reasoning during the editing process, using a topology-balanced multimodal embedding method. Key experimental findings show that ReasonEdit outperforms existing methods and significantly improves edit generalization across multiple VLMs on rationale-based visual question answering datasets.

研究动机是纠正大型预训练视觉-语言模型中的错误，而不影响其他行为。主要方法是ReasonEdit，它允许用户在编辑过程中解释他们的推理，使用一种由网络科学启发的拓扑平衡多模态嵌入方法。关键实验发现表明，ReasonEdit 在多个数据集上的编辑性能优于现有方法，达到最先进的水平，表明在编辑过程中结合人类推理显著提高了编辑的泛化能力。

SoMA: A Real-to-Sim Neural Simulator for Robotic Soft-body Manipulation

Authors: Mu Huang, Hui Wang, Kerui Ren, Linning Xu, Yunsong Zhou, Mulin Yu, Bo Dai, Jiangmiao Pang

First: 2026-02-02T17:59:31+00:00 · Latest: 2026-02-02T17:59:31+00:00

Comments: Project page: https://city-super.github.io/SoMA/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Simulating deformable objects under rich interactions remains a fundamental challenge for real-to-sim robot manipulation, with dynamics jointly driven by environmental effects and robot actions. Existing simulators rely on predefined physics or data-driven dynamics without robot-conditioned control, limiting accuracy, stability, and generalization. This paper presents SoMA, a 3D Gaussian Splat simulator for soft-body manipulation. SoMA couples deformable dynamics, environmental forces, and robot joint actions in a unified latent neural space for end-to-end real-to-sim simulation. Modeling interactions over learned Gaussian splats enables controllable, stable long-horizon manipulation and generalization beyond observed trajectories without predefined physical models. SoMA improves resimulation accuracy and generalization on real-world robot manipulation by 20%, enabling stable simulation of complex tasks such as long-horizon cloth folding.

中文标题/摘要

标题：SoMA：一种用于机器人软体操作的实时到模拟神经模拟器

模拟在丰富交互作用下的可变形物体仍然是机器人操作中从实时到模拟的基本挑战，其动力学由环境效应和机器人动作共同驱动。现有模拟器依赖预定义的物理模型或基于数据的动力学，但缺乏针对机器人条件的控制，限制了其准确性和泛化能力。本文提出了一种用于软体操作的3D 高斯斑点模拟器SoMA。SoMA在统一的潜在神经空间中结合了可变形动力学、环境力和机器人关节动作，实现端到端的实时到模拟模拟。通过学习到的高斯斑点建模交互作用，使长时间稳定操作和超越观察轨迹的泛化成为可能，无需预定义物理模型。SoMA在真实世界机器人操作的重模拟准确性和泛化能力上提高了20%，能够稳定模拟复杂的任务，如长时间布料折叠。

Summary / 总结

SoMA is a 3D Gaussian Splat simulator designed for robotic soft-body manipulation. It integrates deformable dynamics, environmental forces, and robot joint actions in a unified latent neural space for end-to-end real-to-sim simulation. The key experimental finding is that SoMA improves resimulation accuracy and generalization by 20% compared to existing simulators, enabling stable simulation of complex tasks like long-horizon cloth folding.

SoMA 是一个 3D 高斯斑点模拟器，用于软体物体的机器人操作任务。它将可变形动力学、环境力和机器人关节动作统一在一个潜在的神经空间中进行端到端模拟。这种方法相比现有模拟器提高了 20% 的重模拟准确性和泛化能力，能够稳定模拟复杂的任务，如长时间布料折叠。

Superman: Unifying Skeleton and Vision for Human Motion Perception and Generation

Authors: Xinshun Wang, Peiming Li, Ziyi Wang, Zhongbin Fang, Zhichao Deng, Songtao Wu, Jason Li, Mengyuan Liu

First: 2026-02-02T17:59:01+00:00 · Latest: 2026-02-02T17:59:01+00:00

Abs · PDF · Code1 · Code2

Abstract

Human motion analysis tasks, such as temporal 3D pose estimation, motion prediction, and motion in-betweening, play an essential role in computer vision. However, current paradigms suffer from severe fragmentation. First, the field is split between ``perception'' models that understand motion from video but only output text, and ``generation'' models that cannot perceive from raw visual input. Second, generative MLLMs are often limited to single-frame, static poses using dense, parametric SMPL models, failing to handle temporal motion. Third, existing motion vocabularies are built from skeleton data alone, severing the link to the visual domain. To address these challenges, we introduce Superman, a unified framework that bridges visual perception with temporal, skeleton-based motion generation. Our solution is twofold. First, to overcome the modality disconnect, we propose a Vision-Guided Motion Tokenizer. Leveraging the natural geometric alignment between 3D skeletons and visual data, this module pioneers robust joint learning from both modalities, creating a unified, cross-modal motion vocabulary. Second, grounded in this motion language, a single, unified MLLM architecture is trained to handle all tasks. This module flexibly processes diverse, temporal inputs, unifying 3D skeleton pose estimation from video (perception) with skeleton-based motion prediction and in-betweening (generation). Extensive experiments on standard benchmarks, including Human3.6M, demonstrate that our unified method achieves state-of-the-art or competitive performance across all motion tasks. This showcases a more efficient and scalable path for generative motion analysis using skeletons.

中文标题/摘要

标题：超人：统一骨架与视觉感知以实现人类运动感知与生成

人类运动分析任务，如时间3D姿态估计、运动预测和运动插值，在计算机视觉中起着重要作用。然而，当前的范式存在严重的碎片化问题。首先，该领域被分为“感知”模型，这些模型可以从视频中理解运动但仅输出文本，而“生成”模型则无法从原始视觉输入中感知。其次，生成的MLLM通常局限于单帧静态姿态，使用密集的参数化SMPL模型，无法处理时间运动。第三，现有的运动词汇表仅基于骨架数据构建，切断了与视觉领域的联系。为了解决这些挑战，我们引入了Superman，这是一种统一框架，将视觉感知与基于时间的骨架运动生成相结合。我们的解决方案分为两部分。首先，为克服模态断层，我们提出了一种视觉引导的运动分词器。利用3D骨架与视觉数据之间的自然几何对齐，该模块开创了从两种模态中进行稳健联合学习的先河，创建了一个统一的跨模态运动词汇表。其次，基于这种运动语言，一个统一的MLLM架构被训练以处理所有任务。该模块灵活处理各种时间输入，将基于视频的3D骨架姿态估计（感知）与基于骨架的运动预测和插值（生成）统一起来。在包括Human3.6M在内的标准基准上的广泛实验表明，我们的统一方法在所有运动任务中均实现了最先进的或具有竞争力的性能。这展示了使用骨架进行生成运动分析更高效和可扩展的路径。

Summary / 总结

The paper addresses the fragmentation in human motion analysis by introducing Superman, a unified framework that integrates visual perception with temporal motion generation. It proposes a Vision-Guided Motion Tokenizer to bridge the gap between visual and skeleton data, creating a unified motion vocabulary. The framework then uses a single MLLM to handle various motion tasks, including 3D pose estimation, motion prediction, and in-betweening. Experiments on benchmarks like Human3.6M show that Superman outperforms or matches existing methods across all tasks.

论文通过引入Superman框架，将视觉感知与时间序列运动生成统一起来，解决了人类运动分析中的碎片化问题。该框架提出了一种视觉引导的运动分词器，将3D骨架和视觉数据结合起来，创建了一个跨模态的运动词汇表。然后训练一个单一的MLLM架构来处理各种运动任务，包括从视频中估计3D姿态、运动预测和运动插值。实验表明，Superman在标准基准如Human3.6M上的各种运动任务中表现优于或与现有方法相当。

An Empirical Study on Noisy Data and LLM Pretraining Loss Divergence

Authors: Qizhen Zhang, Ankush Garg, Jakob Foerster, Niladri Chatterji, Kshitiz Malik, Mike Lewis

First: 2026-02-02T17:58:50+00:00 · Latest: 2026-02-02T17:58:50+00:00

Abs · PDF · Code1 · Code2

Abstract

Large-scale pretraining datasets drive the success of large language models (LLMs). However, these web-scale corpora inevitably contain large amounts of noisy data due to unregulated web content or randomness inherent in data. Although LLM pretrainers often speculate that such noise contributes to instabilities in large-scale LLM pretraining and, in the worst cases, loss divergence, this phenomenon remains poorly understood.In this work, we present a systematic empirical study of whether noisy data causes LLM pretraining divergences and how it does so. By injecting controlled synthetic uniformly random noise into otherwise clean datasets, we analyze training dynamics across model sizes ranging from 480M to 5.2B parameters. We show that noisy data indeed induces training loss divergence, and that the probability of divergence depends strongly on the noise type, amount of noise, and model scale. We further find that noise-induced divergences exhibit activation patterns distinct from those caused by high learning rates, and we provide diagnostics that differentiate these two failure modes. Together, these results provide a large-scale, controlled characterization of how noisy data affects loss divergence in LLM pretraining.

中文标题/摘要

标题：关于嘈杂数据和大语言模型预训练损失发散的经验研究

大规模预训练数据集推动了大语言模型（LLMs）的成功。然而，由于未经规范的网络内容或数据固有的随机性，这些网络规模的语料库不可避免地包含大量嘈杂数据。尽管LLM预训练者通常推测此类噪声会导致大规模LLM预训练的不稳定性，在最坏的情况下导致损失发散，但这种现象仍不甚明了。在本研究中，我们系统地研究了嘈杂数据是否会导致LLM预训练发散以及它是如何导致的。通过向原本干净的数据集注入可控的合成均匀随机噪声，我们分析了从4.8亿到52亿参数大小的模型的训练动态。我们展示了确实存在由噪声引起的训练损失发散，并且发散的概率强烈依赖于噪声类型、噪声量和模型规模。我们还发现，由噪声引起的发散表现出与高学习率引起的激活模式不同的特征，并提供了区分这两种失败模式的诊断方法。这些结果共同提供了大规模、受控的关于嘈杂数据如何影响LLM预训练损失发散的表征。

Summary / 总结

This study investigates the impact of noisy data on large language model (LLM) pretraining loss divergence. By introducing controlled synthetic noise into clean datasets, the researchers analyzed training dynamics across various model sizes. They found that noisy data can indeed cause training loss divergence, with the probability of divergence varying based on the type and amount of noise and the model scale. Additionally, the study identified distinct activation patterns for noise-induced divergences compared to those caused by high learning rates, offering diagnostics to differentiate these failure modes.

这项研究探讨了噪声数据对大规模语言模型（LLM）预训练中损失发散的影响。通过向干净的数据集引入受控的合成噪声，研究人员分析了不同模型规模下的训练动态。研究发现，噪声数据确实会导致损失发散，其可能性取决于噪声的类型、数量以及模型规模。研究还区分了由噪声引起的发散与由高学习率引起的发散，并提供了诊断方法来区分这两种故障模式。

A Backpropagation-Free Feedback-Hebbian Network for Continual Learning Dynamics

Authors: Josh Li, Fow-sen Choa

First: 2026-01-11T03:25:38+00:00 · Latest: 2026-02-02T17:58:38+00:00

Comments: 8 pages, 10 figures

Abs · PDF · Code1 · Code2

Abstract

Feedback-rich neural architectures can regenerate earlier representations and inject temporal context, making them a natural setting for strictly local synaptic plasticity. Existing literature raises doubt about whether a minimal, backpropagation-free feedback-Hebbian system can already express interpretable continual-learning-relevant behaviors under controlled training schedules. In this work, we introduce a compact prediction-reconstruction architecture with a dedicated feedback pathway providing lightweight, locally trainable temporal context for continual adaptation. All synapses are updated by a unified local rule combining centered Hebbian covariance, Oja-style stabilization, and a local supervised drive where targets are available. With a simple two-pair association task, learning is characterized through layer-wise activity snapshots, connectivity trajectories (row and column means of learned weights), and a normalized retention index across phases. Under sequential A to B training, forward output connectivity exhibits a long-term depression (LTD)-like suppression of the earlier association, while feedback connectivity preserves an A-related trace during acquisition of B. Under an alternating sequence, both associations are concurrently maintained rather than sequentially suppressed. Architectural controls and rule-term ablations isolate the role of dedicated feedback in regeneration and co-maintenance, alongside the role of the local supervised term in output selectivity and unlearning. Together, the results show that a compact feedback pathway trained with local plasticity can support regeneration and continual-learning-relevant dynamics in a minimal, mechanistically transparent setting.

中文标题/摘要

标题：一种无需反向传播的反馈-海비网络以实现持续学习动力学

富含反馈的神经架构可以再生早期表示并注入时间上下文，使其成为严格局部突触可塑性的自然环境。现有文献对是否一个最小的、无需反向传播的反馈-海比系统在受控训练计划下能否表达可解释的持续学习相关行为表示怀疑。在本文中，我们引入了一种紧凑的预测-重构架构，其中包含一个专用的反馈路径，为持续适应提供轻量级的局部可训练时间上下文。所有突触通过结合中心海比协方差、Oja风格稳定化和局部监督驱动（当目标可用时）的统一局部规则进行更新。通过简单的两对关联任务，学习通过逐层活动快照、连接轨迹（学习权重的行和列均值）和各阶段归一化保留指数来表征。在顺序A到B训练下，前向输出连接表现出类似长期抑制（LTD）的早期关联抑制，而反馈连接在获得B期间保留A相关的痕迹。在交替序列下，两种关联同时维持而不是顺序抑制。架构控制和规则项消融隔离了专用反馈在再生和共同维持中的作用，以及局部监督项在输出选择性和遗忘中的作用。总之，结果表明，通过局部可塑性训练的紧凑反馈路径可以在最小的、机制上透明的环境中支持再生和持续学习相关动力学。

Summary / 总结

This work investigates whether a minimal, backpropagation-free feedback-Hebbian system can exhibit interpretable continual learning behaviors. The authors introduce a compact prediction-reconstruction architecture with a dedicated feedback pathway for temporal context. Key findings include long-term depression-like suppression of earlier associations during forward training and concurrent maintenance of both associations under alternating sequences. The study highlights the roles of the feedback pathway and local supervised term in supporting regeneration and continual learning dynamics.

该研究探讨了一个最小的、无需反向传播的反馈-海비系统是否能够表现出可解释的持续学习行为。作者引入了一种紧凑的预测-重建架构，带有反馈路径以提供时间上下文。关键发现包括在顺序训练中对早期关联的长期抑郁样抑制，在交替序列中同时保持两个关联。结果表明，局部可塑性可以在简单透明的设置中支持再生和持续学习动态。

PRISM: Performer RS-IMLE for Single-pass Multisensory Imitation Learning

Authors: Amisha Bhaskar, Pratap Tokekar, Stefano Di Cairano, Alexander Schperberg

First: 2026-02-02T17:57:37+00:00 · Latest: 2026-02-02T17:57:37+00:00

Comments: 10 pages main text and 4 figures, and 11 pages appendix and 10 figures, total 21 pages and 14 figures

Abs · PDF · Code1 · Code2

Abstract

Robotic imitation learning typically requires models that capture multimodal action distributions while operating at real-time control rates and accommodating multiple sensing modalities. Although recent generative approaches such as diffusion models, flow matching, and Implicit Maximum Likelihood Estimation (IMLE) have achieved promising results, they often satisfy only a subset of these requirements. To address this, we introduce PRISM, a single-pass policy based on a batch-global rejection-sampling variant of IMLE. PRISM couples a temporal multisensory encoder (integrating RGB, depth, tactile, audio, and proprioception) with a linear-attention generator using a Performer architecture. We demonstrate the efficacy of PRISM on a diverse real-world hardware suite, including loco-manipulation using a Unitree Go2 with a 7-DoF arm D1 and tabletop manipulation with a UR5 manipulator. Across challenging physical tasks such as pre-manipulation parking, high-precision insertion, and multi-object pick-and-place, PRISM outperforms state-of-the-art diffusion policies by 10-25% in success rate while maintaining high-frequency (30-50 Hz) closed-loop control. We further validate our approach on large-scale simulation benchmarks, including CALVIN, MetaWorld, and Robomimic. In CALVIN (10% data split), PRISM improves success rates by approximately 25% over diffusion and approximately 20% over flow matching, while simultaneously reducing trajectory jerk by 20x-50x. These results position PRISM as a fast, accurate, and multisensory imitation policy that retains multimodal action coverage without the latency of iterative sampling.

中文标题/摘要

标题：PRISM：单次通过的执行者RS-IMLE用于单次通过多感官模仿学习

机器人模仿学习通常需要能够捕捉多模态动作分布的模型，同时在实时控制速率下运行，并适应多种传感模态。尽管最近的生成方法，如扩散模型、流匹配和隐式最大似然估计(IMLE)已经取得了令人鼓舞的结果，但它们通常只能满足这些要求的一部分。为了解决这个问题，我们引入了PRISM，这是一种基于批全局拒绝采样变体的IMLE的单次通过策略。PRISM将一个时间多感官编码器（整合RGB、深度、触觉、音频和本体感受）与一个使用执行者架构的线性注意力生成器相结合。我们在一个多样化的实际硬件套件上展示了PRISM的有效性，包括使用Unitree Go2和7自由度手臂D1进行移动操作，以及使用UR5操作器进行桌面操作。在诸如预操作停车、高精度插入和多对象取放等具有挑战性的物理任务中，PRISM在成功率上比最先进的扩散策略高出10-25%，同时保持高频（30-50 Hz）闭环控制。我们还在大规模模拟基准测试中进一步验证了我们的方法，包括CALVIN、MetaWorld和Robomimic。在CALVIN（10%数据分割）中，PRISM在成功率上比扩散方法提高了约25%，比流匹配提高了约20%，同时将轨迹冲击降低了20-50倍。这些结果将PRISM定位为一种快速、准确且多感官的模仿策略，能够在不牺牲多模态动作覆盖的情况下保留无迭代采样的延迟。

Summary / 总结

PRISM is designed to address the challenges of real-time robotic imitation learning with multiple sensing modalities. It uses a single-pass policy based on a batch-global rejection-sampling variant of IMLE, integrating RGB, depth, tactile, audio, and proprioception data. PRISM demonstrates superior performance in various physical tasks, achieving a 10-25% higher success rate compared to state-of-the-art diffusion policies while maintaining high-frequency closed-loop control. In large-scale simulation benchmarks, PRISM also shows significant improvements in success rates and reduced trajectory jerk compared to diffusion and flow matching methods.

PRISM 是一种单次策略，采用批全局拒绝采样变体的隐式最大似然估计（IMLE），实现实时控制并处理多种传感模态。它通过一个时间多模态编码器和基于 Performer 架构的线性注意力生成器整合 RGB、深度、触觉、音频和本体感受数据。PRISM 在各种物理任务中的成功率比最先进的扩散策略高出 10-25%，同时保持高频闭环控制。在大规模模拟基准测试中，PRISM 的成功率比扩散高出 25%，比流匹配高出 20%，同时将轨迹抖动减少 20 到 50 倍。

David vs. Goliath: Verifiable Agent-to-Agent Jailbreaking via Reinforcement Learning

Authors: Samuel Nellessen, Tal Kachman

First: 2026-02-02T17:56:55+00:00 · Latest: 2026-02-02T17:56:55+00:00

Comments: Under review. 8 main pages, 2 figures, 2 tables. Appendix included

Abs · PDF · Code1 · Code2

Abstract

The evolution of large language models into autonomous agents introduces adversarial failures that exploit legitimate tool privileges, transforming safety evaluation in tool-augmented environments from a subjective NLP task into an objective control problem. We formalize this threat model as Tag-Along Attacks: a scenario where a tool-less adversary "tags along" on the trusted privileges of a safety-aligned Operator to induce prohibited tool use through conversation alone. To validate this threat, we present Slingshot, a 'cold-start' reinforcement learning framework that autonomously discovers emergent attack vectors, revealing a critical insight: in our setting, learned attacks tend to converge to short, instruction-like syntactic patterns rather than multi-turn persuasion. On held-out extreme-difficulty tasks, Slingshot achieves a 67.0% success rate against a Qwen2.5-32B-Instruct-AWQ Operator (vs. 1.7% baseline), reducing the expected attempts to first success (on solved tasks) from 52.3 to 1.3. Crucially, Slingshot transfers zero-shot to several model families, including closed-source models like Gemini 2.5 Flash (56.0% attack success rate) and defensive-fine-tuned open-source models like Meta-SecAlign-8B (39.2% attack success rate). Our work establishes Tag-Along Attacks as a first-class, verifiable threat model and shows that effective agentic attacks can be elicited from off-the-shelf open-weight models through environment interaction alone.

中文标题/摘要

标题：大卫与歌利亚：通过强化学习实现可验证的代理间越狱

大型语言模型演进为自主代理，引入了利用合法工具权限的对抗性失败，将工具增强环境中的安全性评估从主观的自然语言处理任务转变为客观的控制问题。我们将这种威胁模型形式化为“搭便车攻击”：一种场景，其中工具匮乏的对手利用安全对齐操作员的受信任权限，仅通过对话诱导禁止的工具使用。为了验证这一威胁，我们提出了Slingshot，这是一种“冷启动”的强化学习框架，能够自主发现新兴的攻击向量，揭示了一个关键洞察：在我们的设置中，学习到的攻击往往收敛于简短的指令式句法模式，而非多轮说服。在保留的极端难度任务上，Slingshot 的成功率达到了67.0%（基线为1.7%），将首次成功解决任务的预期尝试次数从52.3次降低到1.3次。至关重要的是，Slingshot 能够零样本迁移至多个模型家族，包括闭源模型如Gemini 2.5 Flash（攻击成功率56.0%）和防御微调的开源模型如Meta-SecAlign-8B（攻击成功率39.2%）。我们的工作确立了“搭便车攻击”作为可验证的第一类威胁模型，并展示了通过环境交互即可从现成的开源权重模型中引发有效的代理攻击。

Summary / 总结

This paper addresses the threat of adversarial failures in tool-augmented environments, formalizing it as Tag-Along Attacks. It introduces Slingshot, a reinforcement learning framework that autonomously discovers attack vectors, achieving a 67.0% success rate against a Qwen2.5-32B-Instruct-AWQ Operator, compared to a 1.7% baseline. Slingshot also demonstrates zero-shot transfer to various model families, including closed-source and defensive-fine-tuned models.

论文探讨了Tag-Along攻击，即一个没有工具的对手利用安全对齐操作员的权限诱导使用禁止工具的问题。它引入了Slingshot，一种自主发现攻击向量的强化学习框架，对Qwen2.5-32B-Instruct-AWQ操作员的成功率为67.0%。这些攻击通常表现为简短的指令式模式，并且Slingshot能够有效地转移到各种模型家族，展示了这些模型对这种攻击的脆弱性。

Outcome-Based RL Provably Leads Transformers to Reason, but Only With the Right Data

Authors: Yuval Ran-Milo, Yotam Alexander, Shahar Mendel, Nadav Cohen

First: 2026-01-21T16:36:19+00:00 · Latest: 2026-02-02T17:56:01+00:00

Comments: 87 pages, 6 figures

Abs · PDF · Code1 · Code2

Abstract

Transformers trained via Reinforcement Learning (RL) with outcome-based supervision can spontaneously develop the ability to generate intermediate reasoning steps (Chain-of-Thought). Yet the mechanism by which sparse rewards drive policy gradient to discover such systematic reasoning remains poorly understood. We address this by analyzing the policy gradient dynamics of single-layer Transformers on a synthetic graph traversal task that cannot be solved without Chain-of-Thought but admits a simple iterative solution. We prove that despite training solely on final-answer correctness, policy gradient drives the Transformer to converge to a structured, interpretable algorithm that iteratively traverses the graph vertex-by-vertex. We characterize the distributional properties required for this emergence, identifying the critical role of "simple examples": instances requiring fewer reasoning steps. When the training distribution places sufficient mass on these simpler examples, the Transformer learns a generalizable traversal strategy that extrapolates to longer chains; when this mass vanishes, policy gradient learning becomes infeasible. We corroborate our theoretical results through experiments on synthetic data and with real-world language models on mathematical reasoning tasks, validating that our theoretical findings carry over to practical settings.

中文标题/摘要

标题：基于结果的RL使Transformer自发具备推理能力，但仅在适当数据下有效

通过强化学习（RL）和基于结果的监督训练的Transformer可以自发发展生成中间推理步骤（链式思考）的能力。然而，稀疏奖励如何驱动策略梯度发现这种系统性推理的机制尚不明确。我们通过分析单层Transformer在合成图遍历任务中的策略梯度动力学来解决这一问题，该任务无法仅通过链式思考解决，但允许简单的迭代解决方案。我们证明，尽管仅基于最终答案的正确性进行训练，策略梯度仍能驱动Transformer收敛到一个结构化、可解释的算法，该算法逐个顶点迭代遍历图。我们描述了这种涌现所需的分布特性，确定了“简单示例”的关键作用：需要较少推理步骤的实例。当训练分布对这些更简单的示例赋予足够的权重时，Transformer可以学习一种可泛化的遍历策略，该策略可以外推到更长的链；当这种权重消失时，策略梯度学习变得不可行。我们通过在合成数据上的实验和现实世界的语言模型在数学推理任务上的实验，验证了我们的理论结果，证明了我们的理论发现适用于实际应用。

Summary / 总结

The study investigates how Transformers trained via reinforcement learning with outcome-based supervision can develop the ability to generate intermediate reasoning steps. By analyzing single-layer Transformers on a synthetic graph traversal task, the researchers prove that policy gradient training, despite using only final-answer correctness as supervision, drives the model to converge to a structured algorithm that iteratively traverses the graph. The study identifies the importance of 'simple examples' in facilitating this learning process, showing that when the training distribution includes enough of these, the model can learn a generalizable traversal strategy, whereas without them, learning becomes infeasible. Experiments on synthetic data and real-world language models validate these findings in practical settings.

研究探讨了通过基于结果的强化学习训练的Transformer如何自发发展出生成中间推理步骤（Chain-of-Thought）的能力。通过对合成图遍历任务的策略梯度动态分析，研究证明，仅通过最终答案的正确性进行训练的Transformer会收敛到一个迭代遍历图的结构化算法。研究指出，简单示例的充分分布是这种现象出现的关键，通过合成数据和现实世界语言模型在数学推理任务上的实验验证了这些理论发现。

CoT-RVS: Zero-Shot Chain-of-Thought Reasoning Segmentation for Videos

Authors: Shiu-hong Kao, Yu-Wing Tai, Chi-Keung Tang

Venue: ICLR 2026

First: 2025-05-24T07:01:31+00:00 · Latest: 2026-02-02T17:54:23+00:00

Comments: Accepted to ICLR 2026. Project page: https://danielshkao.github.io/cot-rvs.html. Code: https://github.com/DanielSHKao/CoT-RVS

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

Reasoning Video Object Segmentation is a challenging task, aiming at generating a mask sequence from an input video given a complex and implicit text query. While existing works finetune Multimodal Large Language Models (MLLM) for the task, they still fail in video inputs given complex temporally-sensitive queries, indicating their lack of temporal and spatial integration in complex scenarios. In this paper, we propose CoT-RVS, a novel framework employing the zero-shot Chain-of-Thought (CoT) capability of MLLM to address these complex challenges by temporal-semantic reasoning: CoT-RVS analyzes the visible objects within a given frame that possibly match the language query (semantic), and chooses a corresponding keyframe for each object that can be observed effortlessly among all frames (temporal). Notably, the CoT-RVS framework is training-free and compatible with closed-source MLLMs, which can be applied to Reasoning Video Instance Segmentation. Our framework's training-free feature further allows its extension to process online video streams, where the CoT is used at test time to update the object of interest when a better target starts to emerge and becomes visible. We conduct extensive experiments on video object segmentation with explicit and implicit queries. The results show that CoT-RVS significantly outperforms previous works in both cases, qualitatively and quantitatively.

中文标题/摘要

标题：CoT-RVS：零样本视频对象分割中的链式思维推理分割

推理视频对象分割是一个具有挑战性的任务，旨在根据复杂的隐含文本查询从输入视频中生成一个掩码序列。现有工作通过微调多模态大型语言模型（MLLM）来完成此任务，但在面对复杂的时间敏感查询时，它们仍然无法处理视频输入，这表明它们在复杂场景中缺乏时间和空间的整合。在本文中，我们提出了一种名为CoT-RVS的新框架，利用MLLM的零样本链式思维（CoT）能力通过时间语义推理来解决这些复杂挑战：CoT-RVS分析给定帧中可能与语言查询匹配的可见对象（语义），并在所有帧中选择一个可以轻松观察到的对应关键帧（时间）。值得注意的是，CoT-RVS框架无需训练，并且兼容闭源的MLLM，可以应用于推理视频实例分割。我们框架的无需训练特性还允许其扩展以处理在线视频流，在这种情况下，CoT在测试时用于更新目标对象，当出现更好的目标并变得可见时。我们在具有显式和隐含查询的视频对象分割上进行了广泛的实验。结果表明，CoT-RVS在两种情况下都显著优于先前的工作，定性和定量上均是如此。

Summary / 总结

CoT-RVS is a novel framework that leverages the zero-shot Chain-of-Thought (CoT) capability of Multimodal Large Language Models (MLLM) to address the challenges of reasoning video object segmentation, especially for complex and temporally-sensitive queries. By performing temporal-semantic reasoning, CoT-RVS identifies relevant objects in each frame and selects keyframes for segmentation. The framework is training-free and compatible with closed-source MLLMs, demonstrating superior performance in both explicit and implicit query scenarios compared to previous methods.

研究旨在解决基于复杂文本查询从视频中生成掩码序列的挑战。CoT-RVS 提出了一种新颖的框架，利用多模态大型语言模型（MLLM）的零样本推理能力进行时空推理。该框架分析与查询匹配的帧中的可见对象，并为每个对象选择关键帧。实验结果显示，CoT-RVS 在显式和隐式查询场景中均显著优于先前方法，展示了在视频对象分割中的显著改进。

Infinite-World: Scaling Interactive World Models to 1000-Frame Horizons via Pose-Free Hierarchical Memory

Authors: Ruiqi Wu, Xuanhua He, Meng Cheng, Tianyu Yang, Yong Zhang, Zhuoliang Kang, Xunliang Cai, Xiaoming Wei, Chunle Guo, Chongyi Li, Ming-Ming Cheng

First: 2026-02-02T17:52:56+00:00 · Latest: 2026-02-02T17:52:56+00:00

Comments: 14 pages, 8 figures

Abs · PDF · Code1 · Code2

Abstract

We propose Infinite-World, a robust interactive world model capable of maintaining coherent visual memory over 1000+ frames in complex real-world environments. While existing world models can be efficiently optimized on synthetic data with perfect ground-truth, they lack an effective training paradigm for real-world videos due to noisy pose estimations and the scarcity of viewpoint revisits. To bridge this gap, we first introduce a Hierarchical Pose-free Memory Compressor (HPMC) that recursively distills historical latents into a fixed-budget representation. By jointly optimizing the compressor with the generative backbone, HPMC enables the model to autonomously anchor generations in the distant past with bounded computational cost, eliminating the need for explicit geometric priors. Second, we propose an Uncertainty-aware Action Labeling module that discretizes continuous motion into a tri-state logic. This strategy maximizes the utilization of raw video data while shielding the deterministic action space from being corrupted by noisy trajectories, ensuring robust action-response learning. Furthermore, guided by insights from a pilot toy study, we employ a Revisit-Dense Finetuning Strategy using a compact, 30-minute dataset to efficiently activate the model's long-range loop-closure capabilities. Extensive experiments, including objective metrics and user studies, demonstrate that Infinite-World achieves superior performance in visual quality, action controllability, and spatial consistency.

中文标题/摘要

标题：Infinite-World：通过无姿态层次记忆扩展交互世界模型至1000帧视野

我们提出了Infinite-World，一种能够在复杂现实环境中保持1000帧以上帧的连贯视觉记忆的稳健交互世界模型。虽然现有的世界模型可以在合成数据上高效优化，并具有完美的地面真值，但由于姿态估计的噪声和视角重访的稀缺性，它们缺乏有效的现实世界视频训练范式。为解决这一问题，我们首先引入了一种无姿态层次记忆压缩器（HPMC），它递归地将历史潜在变量精简为固定预算表示。通过与生成主干联合优化，HPMC使模型能够在有限的计算成本下自主锚定遥远的过去，从而消除显式几何先验的需求。其次，我们提出了一种不确定性感知动作标签模块，将连续运动离散化为三态逻辑。这一策略最大限度地利用了原始视频数据，同时保护了确定性动作空间不受噪声轨迹的污染，确保了稳健的动作-响应学习。此外，根据试点玩具研究的见解，我们使用一个紧凑的30分钟数据集采用重访密集微调策略来高效激活模型的长距离闭环能力。广泛的实验，包括客观指标和用户研究，表明Infinite-World在视觉质量、动作可控性和空间一致性方面取得了优越的性能。

Summary / 总结

Infinite-World is a robust interactive world model designed to maintain coherent visual memory over 1000 frames in complex real-world environments. It introduces a Hierarchical Pose-free Memory Compressor (HPMC) to recursively distill historical latents into a fixed-budget representation, enabling autonomous anchoring of generations with bounded computational cost. The model also includes an Uncertainty-aware Action Labeling module that discretizes continuous motion into a tri-state logic, ensuring robust action-response learning. Experimental results show that Infinite-World outperforms existing models in visual quality, action controllability, and spatial consistency.

Infinite-World 是一种能够在复杂现实环境中维持 1000 帧以上视觉记忆的鲁棒交互式世界模型。它引入了层次无姿态记忆压缩器（HPMC），通过递归地将历史潜在变量压缩到固定预算表示，使模型能够自主地在过去的遥远时刻进行生成，无需显式的几何先验。此外，它还提出了一种不确定性感知的动作标签模块，将连续运动离散化为三态逻辑，增强鲁棒的动作响应学习。实验表明，Infinite-World 在视觉质量、动作可控性和空间一致性方面优于现有模型。

Trust by Design: Skill Profiles for Transparent, Cost-Aware LLM Routing

Authors: Mika Okamoto, Ansel Kaplan Erol, Glenn Matlin

First: 2026-02-02T17:49:30+00:00 · Latest: 2026-02-02T17:49:30+00:00

Comments: Appeared at MLSys YPS 2025

Abs · PDF · Code1 · Code2

Abstract

How should Large Language Model (LLM) practitioners select the right model for a task without wasting money? We introduce BELLA (Budget-Efficient LLM Selection via Automated skill-profiling), a framework that recommends optimal LLM selection for tasks through interpretable skill-based model selection. Standard benchmarks report aggregate metrics that obscure which specific capabilities a task requires and whether a cheaper model could suffice. BELLA addresses this gap through three stages: (1) decomposing LLM outputs and extract the granular skills required by using critic-based profiling, (2) clustering skills into structured capability matrices, and (3) multi-objective optimization to select the right models to maximize performance while respecting budget constraints. BELLA provides natural-language rationale for recommendations, providing transparency that current black-box routing systems lack. We describe the framework architecture, situate it within the landscape of LLM routing and evaluation, and discuss its application to financial reasoning as a representative domain exhibiting diverse skill requirements and cost-variation across models. Our framework enables practitioners to make principled and cost-performance trade-offs for deploying LLMs.

中文标题/摘要

标题：设计中的信任：基于技能概貌的透明、成本意识的大语言模型路由

大语言模型（LLM）从业者应如何选择合适的模型以完成任务而不浪费资金？我们提出了BELLA（Budget-Efficient LLM Selection via Automated skill-profiling），这是一种通过基于技能的可解释模型选择来推荐最优LLM选择的框架。标准基准报告聚合指标，模糊了任务所需的具体能力以及是否可以使用更便宜的模型。BELLA 通过三个阶段解决了这一问题：（1）通过基于批评者的概要分析分解LLM输出并提取所需的细粒度技能，（2）将技能聚类为结构化的能力矩阵，（3）多目标优化以选择合适的模型，最大化性能同时遵守预算限制。BELLA 提供了自然语言的推荐理由，提供了当前黑盒路由系统所缺乏的透明度。我们描述了该框架的架构，将其置于LLM路由和评估的背景下，并讨论了其在金融推理领域的应用，这是一个表现出多样化技能需求和模型成本变化的代表性领域。我们的框架使从业者能够进行有原则的成本-性能权衡以部署LLM。

Summary / 总结

BELLA is a framework for selecting the most cost-effective Large Language Model (LLM) for a task by profiling the skills required. It decomposes LLM outputs, clusters skills into capability matrices, and uses multi-objective optimization to recommend models that balance performance and budget. This approach provides transparent rationale for recommendations, unlike current black-box systems. BELLA was applied to financial reasoning tasks, demonstrating its effectiveness in making principled cost-performance trade-offs.

BELLA 是一个框架，通过技能剖析和多目标优化来选择最适合成本效益的大型语言模型（LLM）用于特定任务。它分解 LLM 输出，将技能聚类到能力矩阵中，并推荐模型以优化性能和预算。这为 LLM 选择提供了透明的理由，解决了当前黑盒路由系统的局限性。

EUGens: Efficient, Unified, and General Dense Layers

Authors: Sang Min Kim, Byeongchan Kim, Arijit Sehanobish, Somnath Basu Roy Chowdhury, Rahul Kidambi, Dongseok Shim, Avinava Dubey, Snigdha Chaturvedi, Min-hwan Oh, Krzysztof Choromanski

First: 2026-01-30T05:01:03+00:00 · Latest: 2026-02-02T17:47:29+00:00

Comments: We want to update 2410.09771 with this submission

Abs · PDF · Code1 · Code2

Abstract

Efficient neural networks are essential for scaling machine learning models to real-time applications and resource-constrained environments. Fully-connected feedforward layers (FFLs) introduce computation and parameter count bottlenecks within neural network architectures. To address this challenge, in this work, we propose a new class of dense layers that generalize standard fully-connected feedforward layers, \textbf{E}fficient, \textbf{U}nified and \textbf{Gen}eral dense layers (EUGens). EUGens leverage random features to approximate standard FFLs and go beyond them by incorporating a direct dependence on the input norms in their computations. The proposed layers unify existing efficient FFL extensions and improve efficiency by reducing inference complexity from quadratic to linear time. They also lead to \textbf{the first} unbiased algorithms approximating FFLs with arbitrary polynomial activation functions. Furthermore, EuGens reduce the parameter count and computational overhead while preserving the expressive power and adaptability of FFLs. We also present a layer-wise knowledge transfer technique that bypasses backpropagation, enabling efficient adaptation of EUGens to pre-trained models. Empirically, we observe that integrating EUGens into Transformers and MLPs yields substantial improvements in inference speed (up to \textbf{27}\%) and memory efficiency (up to \textbf{30}\%) across a range of tasks, including image classification, language model pre-training, and 3D scene reconstruction. Overall, our results highlight the potential of EUGens for the scalable deployment of large-scale neural networks in real-world scenarios.

中文标题/摘要

标题：EUGens：高效、统一和通用的密集层

高效的神经网络对于将机器学习模型扩展到实时应用和资源受限环境至关重要。全连接前馈层（FFLs）在神经网络架构中引入了计算和参数数量瓶颈。为了解决这一挑战，本文提出了一种新的密集层类别，即**E**fficient、**U**nified和**Gen**eral密集层（EUGens）。EUGens利用随机特征来近似标准FFLs，并通过在其计算中直接依赖输入范数超越了它们。所提出的层统一了现有的高效FFL扩展，并通过将推理复杂度从二次降低到线性时间来提高效率。它们还导致了**第一个**无偏算法，可以近似任意多项式激活函数的FFLs。此外，EUGens减少了参数数量和计算开销，同时保持了FFLs的表达能力和适应性。我们还提出了一种层间知识转移技术，可以绕过反向传播，使EUGens能够高效地适应预训练模型。实验结果显示，将EUGens集成到Transformer和MLPs中，在包括图像分类、语言模型预训练和3D场景重建等多种任务中，可以显著提高推理速度（最高**27%**）和内存效率（最高**30%**）。总体而言，我们的结果突显了EUGens在实际场景中大规模部署神经网络的潜力。

Summary / 总结

EUGens are proposed to address the computational and parameter bottlenecks in fully-connected feedforward layers by leveraging random features and incorporating input norms. This results in linear-time inference complexity and unbiased algorithms for arbitrary polynomial activation functions. Experiments show that integrating EUGens into Transformers and MLPs improves inference speed by up to 27% and memory efficiency by up to 30% across various tasks. Layer-wise knowledge transfer techniques are also introduced to adapt EUGens to pre-trained models efficiently.

EUGens通过利用随机特征并结合输入范数来解决全连接前馈层（FFLs）的计算和参数瓶颈问题，实现了线性时间的推理复杂度并提高了效率。实验表明，在Transformer和MLP中集成EUGens可以将推理速度提升高达27%，并提高内存效率高达30%。

SLIME: Stabilized Likelihood Implicit Margin Enforcement for Preference Optimization

Authors: Maksim Afanasyev, Illarion Iov

First: 2026-02-02T17:46:06+00:00 · Latest: 2026-02-02T17:46:06+00:00

Abs · PDF · Code1 · Code2

Abstract

Direct preference optimization methods have emerged as a computationally efficient alternative to Reinforcement Learning from Human Feedback (RLHF) for aligning Large Language Models (LLMs). Latest approaches have streamlined the alignment process by deriving implicit reward functions, yet they often suffer from a critical objective mismatch: optimizing the relative margin between chosen and rejected responses does not guarantee the preservation of the chosen response's absolute likelihood. This can lead to ``unlearning'', where the model degrades the probability of high-quality outputs to satisfy margin constraints, and ``formatting collapse'' caused by the over-penalization of rejected sequences. In this work, we introduce SLIME (Stabilized Likelihood Implicit Margin Enforcement), a reference-free alignment objective designed to decouple preference learning from generation quality. SLIME incorporates a three-pronged objective: (1) an anchoring term to maximize the likelihood of preferred responses; (2) a stabilizing penalty that prevents the probabilities of rejected tokens from collapsing to zero; and (3) a dual-margin mechanism that combines hard and soft constraints for precise boundary shaping. Our results demonstrate that SLIME achieves superior performance compared to state-of-the-art baselines while maintaining higher generation stability.

中文标题/摘要

标题：SLIME：稳定似然隐式边距约束的偏好优化方法

直接偏好优化方法已成为一种计算效率更高的替代方案，用于通过人类反馈强化学习（RLHF）对大型语言模型（LLMs）进行对齐。最新方法通过推导隐式奖励函数简化了对齐过程，但它们通常会遭受一个关键目标不匹配的问题：优化所选响应与被拒绝响应之间的相对边距并不保证所选响应绝对似然性的保持。这可能导致“反学习”，即模型降低高质量输出的概率以满足边距约束，以及由于被拒绝序列的过度惩罚而引起的“格式崩溃”。在本文中，我们提出了SLIME（稳定似然隐式边距约束），这是一种无需参考的对齐目标，旨在将偏好学习与生成质量脱钩。SLIME 包含三个目标：（1）锚定项以最大化偏好响应的似然性；（2）稳定惩罚以防止被拒绝标记的概率坍缩为零；（3）双重边距机制，结合硬约束和软约束以实现精确边界塑造。我们的结果表明，SLIME 在保持更高生成稳定性的同时，优于最先进的基线方法。

Summary / 总结

SLIME is a new alignment objective that addresses the limitations of existing methods in direct preference optimization for LLMs. It introduces a three-pronged approach: anchoring preferred responses, stabilizing rejected tokens, and using a dual-margin mechanism. SLIME outperforms current baselines and maintains generation stability.

SLIME 是一种新的对齐目标，解决了现有方法在直接偏好优化中对 LLM 的局限性。它采用了三管齐下的方法：锚定偏好响应的似然性、稳定拒绝标记的概率以及使用双重边界机制。实验表明，SLIME 在性能和生成稳定性方面都优于当前的基线方法。

Self-Supervised Learning from Structural Invariance

Authors: Yipeng Zhang, Hafez Ghaemi, Jungyoon Lee, Shahab Bakhtiari, Eilif B. Muller, Laurent Charlin

Venue: ICLR 2026

First: 2026-02-02T17:44:44+00:00 · Latest: 2026-02-02T17:44:44+00:00

Comments: ICLR 2026

Abs · PDF · Code1 · Code2

Abstract

Joint-embedding self-supervised learning (SSL), the key paradigm for unsupervised representation learning from visual data, learns from invariances between semantically-related data pairs. We study the one-to-many mapping problem in SSL, where each datum may be mapped to multiple valid targets. This arises when data pairs come from naturally occurring generative processes, e.g., successive video frames. We show that existing methods struggle to flexibly capture this conditional uncertainty. As a remedy, we introduce a latent variable to account for this uncertainty and derive a variational lower bound on the mutual information between paired embeddings. Our derivation yields a simple regularization term for standard SSL objectives. The resulting method, which we call AdaSSL, applies to both contrastive and distillation-based SSL objectives, and we empirically show its versatility in causal representation learning, fine-grained image understanding, and world modeling on videos.

中文标题/摘要

标题：基于结构不变性的自我监督学习

联合嵌入自我监督学习（SSL），这一无监督视觉数据表示学习的关键范式，通过语义相关数据对之间的不变性进行学习。我们研究了SSL中的一个到多个映射问题，其中每个数据点可能映射到多个有效目标。这在数据对来自自然生成过程时出现，例如连续的视频帧。我们表明，现有方法难以灵活捕捉这种条件不确定性。为此，我们引入了一个潜在变量来解释这种不确定性，并推导出配对嵌入之间互信息的变分下界。我们的推导产生了一个简单的正则化项，适用于标准SSL目标。该方法，我们称之为AdaSSL，适用于对比性和蒸馏性SSL目标，我们实验证明了其在因果表示学习、细粒度图像理解和视频世界建模中的通用性。

Summary / 总结

The paper addresses the challenge of one-to-many mappings in joint-embedding self-supervised learning (SSL) where each datum can map to multiple valid targets. It introduces AdaSSL, a method that uses a latent variable to capture conditional uncertainty, derived from a variational lower bound on the mutual information between paired embeddings. Experiments demonstrate AdaSSL's effectiveness in causal representation learning, fine-grained image understanding, and world modeling on videos, showing its versatility across different SSL objectives.

论文研究了一对多映射问题在自监督学习中的挑战，即每个数据可以映射到多个有效目标。它引入了AdaSSL，通过引入潜变量来捕捉条件不确定性，并提供配对嵌入之间互信息的变分下界。实验表明，AdaSSL在因果表示学习、细粒度图像理解和视频中的世界建模等方面具有广泛适用性，展示了其在不同自监督学习目标上的有效性。

Unified Personalized Reward Model for Vision Generation

Authors: Yibin Wang, Yuhang Zang, Feng Han, Jiazi Bu, Yujie Zhou, Cheng Jin, Jiaqi Wang

First: 2026-02-02T17:44:21+00:00 · Latest: 2026-02-02T17:44:21+00:00

Comments: Website: https://codegoat24.github.io/UnifiedReward/flex

Abs · PDF · Code1 · Code2 · Project1

Abstract

Recent advancements in multimodal reward models (RMs) have significantly propelled the development of visual generation. Existing frameworks typically adopt Bradley-Terry-style preference modeling or leverage generative VLMs as judges, and subsequently optimize visual generation models via reinforcement learning. However, current RMs suffer from inherent limitations: they often follow a one-size-fits-all paradigm that assumes a monolithic preference distribution or relies on fixed evaluation rubrics. As a result, they are insensitive to content-specific visual cues, leading to systematic misalignment with subjective and context-dependent human preferences. To this end, inspired by human assessment, we propose UnifiedReward-Flex, a unified personalized reward model for vision generation that couples reward modeling with flexible and context-adaptive reasoning. Specifically, given a prompt and the generated visual content, it first interprets the semantic intent and grounds on visual evidence, then dynamically constructs a hierarchical assessment by instantiating fine-grained criteria under both predefined and self-generated high-level dimensions. Our training pipeline follows a two-stage process: (1) we first distill structured, high-quality reasoning traces from advanced closed-source VLMs to bootstrap SFT, equipping the model with flexible and context-adaptive reasoning behaviors; (2) we then perform direct preference optimization (DPO) on carefully curated preference pairs to further strengthen reasoning fidelity and discriminative alignment. To validate the effectiveness, we integrate UnifiedReward-Flex into the GRPO framework for image and video synthesis, and extensive results demonstrate its superiority.

中文标题/摘要

标题：统一个性化奖励模型在视觉生成中的应用

近年来，多模态奖励模型（RMs）的发展显著推动了视觉生成的进步。现有框架通常采用布雷德利-特里风格的偏好建模，或利用生成型VLM作为评判者，随后通过强化学习优化视觉生成模型。然而，当前的RMs存在固有的局限性：它们往往采用一刀切的范式，假设单一的偏好分布，或者依赖固定的评估标准。因此，它们对内容特定的视觉线索不够敏感，导致系统性地与主观和上下文相关的用户偏好不一致。为了解决这一问题，我们借鉴人类评估的方法，提出了UnifiedReward-Flex，这是一种统一的个性化奖励模型，将奖励建模与灵活且上下文适应的推理相结合。具体来说，给定一个提示和生成的视觉内容，它首先解释语义意图并基于视觉证据，然后通过实例化预定义和自生成的高层维度下的细粒度标准，动态构建层次评估。我们的训练流程分为两个阶段：（1）我们首先从先进的闭源VLM中提取结构化、高质量的推理痕迹，以启动SFT，使模型具备灵活且上下文适应的推理行为；（2）然后在精心策划的偏好对上直接进行偏好优化（DPO），进一步增强推理准确性和区分性对齐。为了验证其有效性，我们将UnifiedReward-Flex集成到GRPO框架中进行图像和视频合成，并广泛的结果表明其优越性。

Summary / 总结

The paper addresses the limitations of existing multimodal reward models in visual generation by proposing UnifiedReward-Flex, a unified personalized reward model that incorporates flexible and context-adaptive reasoning. It first interprets the semantic intent and grounds on visual evidence, then dynamically constructs a hierarchical assessment. The training pipeline consists of two stages: distilling reasoning traces from advanced VLMs and performing direct preference optimization on curated preference pairs. Experimental results show that UnifiedReward-Flex outperforms existing methods in image and video synthesis.

论文提出了一种统一的个性化奖励模型UnifiedReward-Flex，该模型结合了灵活和上下文适应的推理能力。它首先解释语义意图并基于视觉证据，然后动态构建层次评估。训练管道分为两个阶段：从高级闭源VLM中提取推理痕迹，并在精心策划的偏好对上进行直接偏好优化。实验结果表明，UnifiedReward-Flex在图像和视频合成中优于现有方法。

Live-Evo: Online Evolution of Agentic Memory from Continuous Feedback

Authors: Yaolun Zhang, Yiran Wu, Yijiong Yu, Qingyun Wu, Huazheng Wang

First: 2026-02-02T17:34:50+00:00 · Latest: 2026-02-02T17:34:50+00:00

Comments: 13 pages

Abs · PDF · Code1 · Code2 · Code3

Abstract

Large language model (LLM) agents are increasingly equipped with memory, which are stored experience and reusable guidance that can improve task-solving performance. Recent \emph{self-evolving} systems update memory based on interaction outcomes, but most existing evolution pipelines are developed for static train/test splits and only approximate online learning by folding static benchmarks, making them brittle under true distribution shift and continuous feedback. We introduce \textsc{Live-Evo}, an online self-evolving memory system that learns from a stream of incoming data over time. \textsc{Live-Evo} decouples \emph{what happened} from \emph{how to use it} via an Experience Bank and a Meta-Guideline Bank, compiling task-adaptive guidelines from retrieved experiences for each task. To manage memory online, \textsc{Live-Evo} maintains experience weights and updates them from feedback: experiences that consistently help are reinforced and retrieved more often, while misleading or stale experiences are down-weighted and gradually forgotten, analogous to reinforcement and decay in human memory. On the live \textit{Prophet Arena} benchmark over a 10-week horizon, \textsc{Live-Evo} improves Brier score by 20.8\% and increases market returns by 12.9\%, while also transferring to deep-research benchmarks with consistent gains over strong baselines. Our code is available at https://github.com/ag2ai/Live-Evo.

中文标题/摘要

标题：Live-Evo：基于连续反馈的代理记忆在线进化

大型语言模型（LLM）代理越来越多地配备了记忆，这些记忆是存储的经验和可重复使用的指导，可以提高任务解决性能。最近的\emph{自我进化}系统根据互动结果更新记忆，但大多数现有的进化管道是为静态训练/测试拆分开发的，并且仅通过折叠静态基准来近似在线学习，这使它们在真正的分布偏移和连续反馈下变得脆弱。我们引入了\textsc{Live-Evo}，这是一种从时间流中的数据中学习的在线自我进化记忆系统。\textsc{Live-Evo}通过经验银行和元指导方针银行将\emph{发生了什么}与\emph{如何使用它}解耦，从检索的经验中为每个任务编译任务自适应的指导方针。为了在线管理记忆，\textsc{Live-Evo}维护经验权重并从反馈中更新它们：始终有助于任务的经验得到强化并更频繁地检索，而误导性或过时的经验则被削弱并逐渐遗忘，类似于人类记忆中的强化和衰退。在为期10周的\textit{Prophet竞技场}基准测试中，\textsc{Live-Evo}将贝叶斯评分提高了20.8%，增加了12.9%的市场回报，同时在深度研究基准测试中也表现出一致的收益，超过了强大的基线。我们的代码可在https://github.com/ag2ai/Live-Evo/获取。

Summary / 总结

Live-Evo is an online self-evolving memory system that learns from continuous feedback by updating memory weights based on task performance. It uses an Experience Bank and a Meta-Guideline Bank to compile task-adaptive guidelines and manages memory by reinforcing helpful experiences and down-weighting misleading ones. Live-Evo shows significant improvements in Brier score and market returns on the Prophet Arena benchmark, and it also performs well on deep-research benchmarks compared to strong baselines.

Live-Evo 是一个在线自我进化的记忆系统，能够从持续反馈中学习以提高任务解决性能。它使用经验银行和元指导方针银行来编译任务适应性指导方针，并根据反馈更新经验权重，强化有用的经验并降低误导性的经验权重。在 Prophet Arena 基准测试中，Live-Evo 的 Brier 分数提高了 20.8%，市场回报率提高了 12.9%，并且在深度研究基准测试中也表现出对强基线的一致性改进。

MEMOIR: Lifelong Model Editing with Minimal Overwrite and Informed Retention for LLMs

Authors: Ke Wang, Yiming Qin, Nikolaos Dimitriadis, Alessandro Favero, Pascal Frossard

Venue: NeurIPS 2025

First: 2025-06-09T16:16:42+00:00 · Latest: 2026-02-02T17:34:13+00:00

Comments: The first two authors contributed equally to this work; Accepted to NeurIPS 2025

Abs · PDF · Code1 · Code2

Abstract

Language models deployed in real-world systems often require post-hoc updates to incorporate new or corrected knowledge. However, editing such models efficiently and reliably-without retraining or forgetting previous information-remains a major challenge. Existing methods for lifelong model editing either compromise generalization, interfere with past edits, or fail to scale to long editing sequences. We propose MEMOIR, a novel scalable framework that injects knowledge through a residual memory, i.e., a dedicated parameter module, while preserving the core capabilities of the pre-trained model. By sparsifying input activations through sample-dependent masks, MEMOIR confines each edit to a distinct subset of the memory parameters, minimizing interference among edits. At inference, it identifies relevant edits by comparing the sparse activation patterns of new queries to those stored during editing. This enables generalization to rephrased queries by activating only the relevant knowledge while suppressing unnecessary memory activation for unrelated prompts. Experiments on question answering, hallucination correction, and out-of-distribution generalization benchmarks for LLaMA-3 and Mistral backbones demonstrate that MEMOIR achieves state-of-the-art performance across reliability, generalization, and locality metrics, scaling to thousands of sequential edits with minimal forgetting.

中文标题/摘要

标题：回忆录：终身模型编辑，最小覆盖与明智保留

部署在实际系统中的语言模型通常需要事后更新以纳入新或更正的知识。然而，高效且可靠地编辑这些模型，同时不重新训练或忘记先前的信息，仍然是一个重大挑战。现有的终身模型编辑方法要么牺牲泛化能力，要么干扰过去的编辑，要么无法扩展到长时间序列的编辑。我们提出了一种名为MEMOIR的新颖可扩展框架，通过残差记忆，即专用参数模块，注入知识，同时保留预训练模型的核心能力。通过样本依赖的掩码稀疏化输入激活，MEMOIR将每次编辑限制在记忆参数的不同子集中，从而最小化编辑之间的干扰。在推理时，它通过比较新查询的稀疏激活模式与编辑期间存储的模式来识别相关编辑，从而仅激活相关知识，抑制与无关提示相关的不必要的记忆激活。针对LLaMA-3和Mistral骨干网络的问答、幻觉纠正和离分布泛化基准实验表明，MEMOIR在可靠性和泛化性指标上达到了最先进的性能，并且能够扩展到数千次连续编辑，同时几乎不遗忘。

Summary / 总结

The research aims to develop an efficient and reliable method for updating language models with new information without retraining or forgetting previous knowledge. MEMOIR uses a residual memory to inject knowledge sparsely, minimizing interference between edits. Experiments show that MEMOIR outperforms existing methods on question answering, hallucination correction, and out-of-distribution generalization benchmarks, achieving state-of-the-art performance even with thousands of sequential edits.

论文解决了在不重新训练或忘记之前知识的情况下，高效地用新信息更新语言模型的挑战。它提出了MEMOIR框架，该框架使用残余记忆来注入知识，同时保持模型的核心能力。通过稀疏化输入激活，MEMOIR减少了编辑之间的干扰，并能够对重新表述的查询进行泛化。实验表明，MEMOIR在各种基准测试中优于现有方法，实现了最先进的性能和对数千次连续编辑的可扩展性。

ReasonCACHE: Teaching LLMs To Reason Without Weight Updates

Authors: Sharut Gupta, Phillip Isola, Stefanie Jegelka, David Lopez-Paz, Kartik Ahuja, Mark Ibrahim, Mohammad Pezeshki

First: 2026-02-02T17:24:23+00:00 · Latest: 2026-02-02T17:24:23+00:00

Comments: 26 pages, 17 Figures

Abs · PDF · Code1 · Code2 · Project1

Abstract

Can Large language models (LLMs) learn to reason without any weight update and only through in-context learning (ICL)? ICL is strikingly sample-efficient, often learning from only a handful of demonstrations, but complex reasoning tasks typically demand many training examples to learn from. However, naively scaling ICL by adding more demonstrations breaks down at this scale: attention costs grow quadratically, performance saturates or degrades with longer contexts, and the approach remains a shallow form of learning. Due to these limitations, practitioners predominantly rely on in-weight learning (IWL) to induce reasoning. In this work, we show that by using Prefix Tuning, LLMs can learn to reason without overloading the context window and without any weight updates. We introduce $\textbf{ReasonCACHE}$, an instantiation of this mechanism that distills demonstrations into a fixed key-value cache. Empirically, across challenging reasoning benchmarks, including GPQA-Diamond, ReasonCACHE outperforms standard ICL and matches or surpasses IWL approaches. Further, it achieves this all while being more efficient across three key axes: data, inference cost, and trainable parameters. We also theoretically prove that ReasonCACHE can be strictly more expressive than low-rank weight update since the latter ties expressivity to input rank, whereas ReasonCACHE bypasses this constraint by directly injecting key-values into the attention mechanism. Together, our findings identify ReasonCACHE as a middle path between in-context and in-weight learning, providing a scalable algorithm for learning reasoning skills beyond the context window without modifying parameters. Our project page: https://reasoncache.github.io/

中文标题/摘要

标题：ReasonCACHE：让大语言模型在无需权重更新的情况下进行推理

大型语言模型（LLMs）能否在没有任何权重更新的情况下通过上下文学习（ICL）学会推理？ICL表现出惊人的样本效率，通常只需要少量示例即可学习，但复杂的推理任务通常需要大量训练示例才能学习。然而，简单地通过增加示例来扩展ICL在大规模下会失效：注意力成本呈二次增长，性能在长上下文中饱和或退化，并且该方法仍是一种浅层的学习形式。由于这些限制，从业者主要依赖于权重内学习（IWL）来诱导推理。在本研究中，我们展示了通过使用前缀调优，LLMs可以在不加载上下文窗口和无需任何权重更新的情况下学习推理。我们引入了ReasonCACHE，这是一种机制的实现，将示例提炼为固定的关键值缓存。实验结果显示，在包括GPQA-Diamond在内的多个挑战性推理基准测试中，ReasonCACHE优于标准ICL并匹配或超越IWL方法。此外，它在三个关键维度上更加高效：数据、推理成本和可训练参数。我们还从理论上证明，ReasonCACHE可以比低秩权重更新更具有表达性，因为后者将表达性与输入秩绑定，而ReasonCACHE通过直接注入关键值到注意力机制中绕过了这一限制。综上所述，我们的研究结果将ReasonCACHE识别为上下文学习和权重内学习之间的一种中间路径，提供了一种在不修改参数的情况下扩展学习推理技能的算法。

Summary / 总结

ReasonCACHE is designed to enable large language models to reason without weight updates and through in-context learning. It uses Prefix Tuning to distill demonstrations into a fixed key-value cache, allowing the models to perform reasoning efficiently. Across various reasoning benchmarks, ReasonCACHE outperforms standard in-context learning and matches or surpasses in-weight learning approaches while being more efficient in terms of data, inference cost, and trainable parameters. Theoretical analysis shows that ReasonCACHE can be more expressive than low-rank weight updates. This work provides a scalable solution for learning reasoning skills beyond the context window without modifying parameters.

ReasonCACHE旨在让大型语言模型（LLMs）在无需权重更新的情况下通过上下文学习（ICL）进行推理。它使用前缀调谐将演示文稿提炼成固定的关键值缓存，使LLMs能够高效处理复杂的推理任务。在各种推理基准测试中，ReasonCACHE在标准ICL上表现出色，并且在匹配或超越权重学习（IWL）方法的同时，在数据、推理成本和可训练参数方面更为高效。理论分析表明，ReasonCACHE的表达能力可能比低秩权重更新更强。这项工作提供了一种在不修改参数的情况下学习超出上下文窗口的推理技能的可扩展方法。

Future frame prediction in chest and liver cine MRI using the PCA respiratory motion model: comparing transformers and dynamically trained recurrent neural networks

Authors: Michel Pohl, Mitsuru Uesaka, Hiroyuki Takahashi, Kazuyuki Demachi, Ritu Bhusal Chhatkuli

First: 2024-10-08T10:21:43+00:00 · Latest: 2026-02-02T17:21:22+00:00

Comments: 43 pages, 19 figures, revised version (including transformer experiments, evaluation on liver MRI data, statistical analysis...)

Abs · PDF · Code1 · Code2

Abstract

Respiratory motion complicates accurate irradiation of thoraco-abdominal tumors in radiotherapy, as treatment-system latency entails target-location uncertainties. This work addresses frame forecasting in chest and liver cine MRI to compensate for such delays. We investigate RNNs trained with online learning algorithms, enabling adaptation to changing respiratory patterns via on-the-fly parameter updates, and transformers, increasingly common in time series forecasting for their ability to capture long-term dependencies. Experiments were conducted using 12 sagittal thoracic and upper-abdominal cine-MRI sequences from ETH Zürich and OvGU. PCA decomposes the Lucas-Kanade optical-flow field into static deformations and low-dimensional time-dependent weights. We compare various methods forecasting the latter: linear filters, population and sequence-specific encoder-only transformers, and RNNs trained with real-time recurrent learning (RTRL), unbiased online recurrent optimization, decoupled neural interfaces, and sparse one-step approximation (SnAp-1). Predicted displacements were used to warp the reference frame and generate future images. Prediction accuracy decreased with the horizon h. Linear regression performed best at short horizons (1.3mm geometrical error at h=0.32s, ETH Zürich data), while RTRL and SnAp-1 outperformed the other algorithms at medium-to-long horizons, with geometrical errors below 1.4mm and 2.8mm on the sequences from ETH Zürich and OvGU (the latter featuring higher motion variability, noise, and lower contrast), respectively. The sequence-specific transformer was competitive for low-to-medium horizons, but transformers remained overall limited by data scarcity and domain shift between datasets. Predicted frames visually resembled the ground truth, with notable errors occurring near the diaphragm at end-inspiration and regions affected by out-of-plane motion.

中文标题/摘要

标题：使用PCA呼吸运动模型在胸部和肝脏 cine MRI 中进行未来帧预测：比较变压器和动态训练递归神经网络

呼吸运动在放射治疗中对胸腹部肿瘤的精确照射造成了复杂性，因为治疗系统延迟会导致目标位置的不确定性。本研究旨在通过在胸部和肝脏 cine MRI 中进行帧预测来补偿这种延迟。我们研究了使用在线学习算法训练的 RNNs，这些算法能够通过实时参数更新来适应变化的呼吸模式，以及变压器，因其能够捕捉长期依赖关系而在时间序列预测中越来越常见。实验使用了来自 ETH Zürich 和 OvGU 的 12 个矢状胸腔和上腹部 cine-MRI 序列。PCA 将 Lucas-Kanade 光流场分解为静态变形和低维时间依赖权重。我们比较了各种方法预测后者：线性滤波器、群体和序列特定的仅编码器变压器、以及使用实时递归学习 (RTRL)、无偏在线递归优化、解耦神经接口和稀疏一步近似 (SnAp-1) 训练的 RNNs。预测的位移用于对参考帧进行变形并生成未来图像。预测精度随着预测范围 h 的增加而降低。线性回归在短预测范围内表现最佳（在 h=0.32s 时几何误差为 1.3mm，ETH Zürich 数据），而 RTRL 和 SnAp-1 在中到长预测范围内表现更优，ETH Zürich 和 OvGU 的序列中的几何误差分别低于 1.4mm 和 2.8mm（后者具有更高的运动变异性、噪声和较低的对比度）。序列特定的变压器在低到中等预测范围内具有竞争力，但变压器整体上受限于数据稀缺性和数据集之间的领域转移。预测帧在视觉上类似于真实值，但在膈肌末端呼气附近和受切向外运动影响的区域存在明显误差。

Summary / 总结

This study aims to predict future frames in chest and liver cine MRI to address respiratory motion-induced uncertainties in radiotherapy. It compares transformers and dynamically trained recurrent neural networks (RNNs) for frame forecasting. Experiments using 12 cine-MRI sequences from ETH Zürich and OvGU show that RTRL and SnAp-1 outperform other algorithms at medium-to-long horizons, with geometrical errors below 1.4mm and 2.8mm, respectively, on the sequences from ETH Zürich and OvGU. The sequence-specific transformer was competitive for low-to-medium horizons but was limited by data scarcity and domain shift between datasets.

该研究旨在通过预测胸部和肝脏 cine MRI 的未来帧来解决放射治疗中的呼吸运动复杂性问题。研究比较了变压器和动态训练的循环神经网络（RNNs）在帧预测中的表现。使用来自 ETH Zürich 和 OvGU 的 12 个 cine-MRI 序列的实验表明，RTRL 和 SnAp-1 在中到长预测区间内表现最佳，ETH Zürich 和 OvGU 的几何误差分别低于 1.4mm 和 2.8mm。序列特定的变压器在低到中区间内表现良好，但受限于数据稀缺性和数据集之间的领域转移。

History

20260202_0332 20260201_0328 20260131_0341 20260130_0339 20260129_0337 20260128_0335 20260127_0332 20260126_0325 20260125_0325 20260124_0333 20260123_0333 20260122_0339 20260121_0422 20260120_0328 20260119_0325 20260118_0324 20260117_0329 20260116_0332 20260115_0330 20260114_0329 20260113_0330 20260112_0330 20260111_0327 20260110_0328 20260109_0331 20260108_0330 20260107_0325 20260106_0331 20260105_0324 20260104_0324 20260103_0322 20260102_0335 20260101_0325 20251231_0331 20251230_0328 20251229_0326 20251228_0329 20251227_0325 20251226_0326 20251225_0325 20251224_0328 20251223_0327 20251222_0324 20251221_0326 20251220_0327 20251219_0327 20251218_0339 20251217_0331 20251216_0329 20251215_0331 20251214_0324 20251213_0324 20251212_0329 20251211_0326 20251210_0323 20251209_0326 20251208_0324 20251207_0323 20251206_0325 20251205_0326 20251204_0326 20251203_0328 20251202_0331 20251201_0324 20251130_0323 20251129_0323 20251128_0324 20251127_0324 20251126_0325 20251125_0322 20251124_0323 20251123_0323 20251122_0325 20251121_0324 20251120_0326 20251119_0325 20251118_0324 20251117_0322 20251116_0322 20251115_0324 20251114_0325 20251113_0326 20251112_0326 20251111_0318 20251110_0322 20251109_0323 20251108_0321 20251107_0320 20251106_0322 20251105_0321 20251104_0324 20251103_0317 20251102_0321 20251101_0317 20251031_0318 20251030_0328 20251029_0325 20251028_0324 20251027_0320 20251026_0328 20251025_0320 20251024_0328 20251023_1235 20251023_0316 20251022_0319 20251021_1916 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553