WorldCompass: Reinforcement Learning for Long-Horizon World Models
Authors: Zehan Wang, Tengfei Wang, Haiyu Zhang, Xuhui Zuo, Junta Wu, Haoyuan Wang, Wenqiang Sun, Zhenwei Wang, Chenjie Cao, Hengshuang Zhao, Chunchao Guo, Zhou Zhao
First: 2026-02-09T18:59:47+00:00 · Latest: 2026-02-09T18:59:47+00:00
Comments: Project page: \url{https://3d-models.hunyuan.tencent.com/world/}
Abstract
This work presents WorldCompass, a novel Reinforcement Learning (RL) post-training framework for the long-horizon, interactive video-based world models, enabling them to explore the world more accurately and consistently based on interaction signals. To effectively "steer" the world model's exploration, we introduce three core innovations tailored to the autoregressive video generation paradigm: 1) Clip-level rollout Strategy: We generate and evaluate multiple samples at a single target clip, which significantly boosts rollout efficiency and provides fine-grained reward signals. 2) Complementary Reward Functions: We design reward functions for both interaction-following accuracy and visual quality, which provide direct supervision and effectively suppress reward-hacking behaviors. 3) Efficient RL Algorithm: We employ the negative-aware fine-tuning strategy coupled with various efficiency optimizations to efficiently and effectively enhance model capacity. Evaluations on the SoTA open-source world model, WorldPlay, demonstrate that WorldCompass significantly improves interaction accuracy and visual fidelity across various scenarios.
中文标题/摘要
标题:WorldCompass:长时程交互视频世界模型的强化学习后训练框架
本文介绍了WorldCompass,这是一种新颖的强化学习(RL)后训练框架,用于长时程、交互式的视频基世界模型,使它们能够基于交互信号更准确、更一致地探索世界。为了有效“引导”世界模型的探索,我们针对自回归视频生成范式引入了三项核心创新:1)片段级回放策略:我们在单个目标片段上生成和评估多个样本,这显著提高了回放效率并提供了精细的奖励信号。2)互补的奖励函数:我们设计了奖励函数,既考虑交互跟随的准确性,又考虑视觉质量,这提供了直接监督并有效抑制了奖励作弊行为。3)高效的RL算法:我们采用负向意识微调策略并结合各种效率优化,以高效有效地增强模型能力。在当前最先进的开源世界模型WorldPlay上的评估表明,WorldCompass在各种场景中显著提高了交互准确性和视觉保真度。
Summary / 总结
WorldCompass is a reinforcement learning framework designed for long-horizon interactive video-based world models. It introduces a clip-level rollout strategy, complementary reward functions, and an efficient RL algorithm to enhance exploration accuracy and visual fidelity. Evaluations show significant improvements in interaction accuracy and visual quality across different scenarios compared to existing methods.
WorldCompass 是一种针对长周期互动视频世界模型的强化学习框架,通过引入剪辑级回放策略、互补奖励函数和高效 RL 算法来提升探索准确性和视觉保真度。评估结果显示,与现有方法相比,在不同场景中显著提高了交互准确性和视觉质量。
$χ_{0}$: Resource-Aware Robust Manipulation via Taming Distributional Inconsistencies
Authors: Checheng Yu, Chonghao Sima, Gangcheng Jiang, Hai Zhang, Haoguang Mai, Hongyang Li, Huijie Wang, Jin Chen, Kaiyang Wu, Li Chen, Lirui Zhao, Modi Shi, Ping Luo, Qingwen Bu, Shijia Peng, Tianyu Li, Yibo Yuan
First: 2026-02-09T18:59:45+00:00 · Latest: 2026-02-09T18:59:45+00:00
Abstract
High-reliability long-horizon robotic manipulation has traditionally relied on large-scale data and compute to understand complex real-world dynamics. However, we identify that the primary bottleneck to real-world robustness is not resource scale alone, but the distributional shift among the human demonstration distribution, the inductive bias learned by the policy, and the test-time execution distribution -- a systematic inconsistency that causes compounding errors in multi-stage tasks. To mitigate these inconsistencies, we propose $χ_{0}$, a resource-efficient framework with effective modules designated to achieve production-level robustness in robotic manipulation. Our approach builds off three technical pillars: (i) Model Arithmetic, a weight-space merging strategy that efficiently soaks up diverse distributions of different demonstrations, varying from object appearance to state variations; (ii) Stage Advantage, a stage-aware advantage estimator that provides stable, dense progress signals, overcoming the numerical instability of prior non-stage approaches; and (iii) Train-Deploy Alignment, which bridges the distribution gap via spatio-temporal augmentation, heuristic DAgger corrections, and temporal chunk-wise smoothing. $χ_{0}$ enables two sets of dual-arm robots to collaboratively orchestrate long-horizon garment manipulation, spanning tasks from flattening, folding, to hanging different clothes. Our method exhibits high-reliability autonomy; we are able to run the system from arbitrary initial state for consecutive 24 hours non-stop. Experiments validate that $χ_{0}$ surpasses the state-of-the-art $π_{0.5}$ in success rate by nearly 250%, with only 20-hour data and 8 A100 GPUs. Code, data and models will be released to facilitate the community.
中文标题/摘要
标题:$χ_{0}$: 资源感知鲁棒操作通过驯服分布不一致性
高可靠性的长期机器人操作传统上依赖大规模数据和计算来理解复杂的现实世界动力学。然而,我们发现现实世界鲁棒性的主要瓶颈不仅在于资源规模,还在于人类演示分布、策略学习的归纳偏见和测试时执行分布之间的分布偏移——这是一种系统性不一致性,导致多阶段任务中的累积错误。为了缓解这些不一致性,我们提出了$χ_{0}$,这是一种资源高效的框架,具有专门设计的有效模块,以实现机器人操作的生产级鲁棒性。我们的方法建立在三个技术支柱之上:(i) 模型算术,这是一种权重空间合并策略,能够高效地吸收不同演示的多样化分布,从物体外观到状态变化不等;(ii) 阶段优势,这是一种阶段感知的优势估计器,提供稳定、密集的进步信号,克服了先前非阶段方法的数值不稳定性;(iii) 训练部署对齐,通过时空增强、启发式DAgger修正和时间片段平滑来弥合分布差距。$χ_{0}$ 使两台双臂机器人能够协作执行长期的服装操作,涵盖从平整、折叠到挂不同衣物的任务。我们的方法展示了高可靠性的自主性;我们能够从任意初始状态连续运行系统24小时不间断。实验验证了$χ_{0}$ 在成功率上比最先进的$π_{0.5}$ 高出近250%,仅使用20小时数据和8个A100 GPU。代码、数据和模型将被发布以促进社区的发展。
Summary / 总结
The paper addresses the challenge of achieving high-reliability long-horizon robotic manipulation by focusing on distributional inconsistencies between human demonstrations, learned policies, and test-time execution. It introduces $χ_{0}$, a resource-efficient framework that includes Model Arithmetic, Stage Advantage, and Train-Deploy Alignment to mitigate these inconsistencies. The method enables dual-arm robots to collaboratively handle long-horizon tasks such as garment manipulation, demonstrating high reliability and surpassing state-of-the-art methods by 250% in success rate with limited resources.
论文针对高可靠性的长期机器人操作挑战,重点关注不同数据分布之间的分布差异。它提出了一个资源高效的框架$χ_{0}$,包括模型算术、阶段优势和训练部署对齐,以减轻这些不一致性。实验表明,$χ_{0}$在成功率上比最先进的方法$π_{0.5}$高出近250%,仅使用20小时的数据和8个A100 GPU。
Robustness Is a Function, Not a Number: A Factorized Comprehensive Study of OOD Robustness in Vision-Based Driving
Authors: Amir Mallak, Alaa Maalouf
First: 2026-02-09T18:59:03+00:00 · Latest: 2026-02-09T18:59:03+00:00
Abstract
Out of distribution (OOD) robustness in autonomous driving is often reduced to a single number, hiding what breaks a policy. We decompose environments along five axes: scene (rural/urban), season, weather, time (day/night), and agent mix; and measure performance under controlled $k$-factor perturbations ($k \in \{0,1,2,3\}$). Using closed loop control in VISTA, we benchmark FC, CNN, and ViT policies, train compact ViT heads on frozen foundation-model (FM) features, and vary ID support in scale, diversity, and temporal context. (1) ViT policies are markedly more OOD-robust than comparably sized CNN/FC, and FM features yield state-of-the-art success at a latency cost. (2) Naive temporal inputs (multi-frame) do not beat the best single-frame baseline. (3) The largest single factor drops are rural $\rightarrow$ urban and day $\rightarrow$ night ($\sim 31\%$ each); actor swaps $\sim 10\%$, moderate rain $\sim 7\%$; season shifts can be drastic, and combining a time flip with other changes further degrades performance. (4) FM-feature policies stay above $85\%$ under three simultaneous changes; non-FM single-frame policies take a large first-shift hit, and all no-FM models fall below $50\%$ by three changes. (5) Interactions are non-additive: some pairings partially offset, whereas season-time combinations are especially harmful. (6) Training on winter/snow is most robust to single-factor shifts, while a rural+summer baseline gives the best overall OOD performance. (7) Scaling traces/views improves robustness ($+11.8$ points from $5$ to $14$ traces), yet targeted exposure to hard conditions can substitute for scale. (8) Using multiple ID environments broadens coverage and strengthens weak cases (urban OOD $60.6\% \rightarrow 70.1\%$) with a small ID drop; single-ID preserves peak performance but in a narrow domain. These results yield actionable design rules for OOD-robust driving policies.
Summary / 总结
The study aims to understand the out-of-distribution (OOD) robustness in autonomous driving by decomposing environments into five factors: scene, season, weather, time, and agent mix. Using VISTA, the researchers benchmarked FC, CNN, and ViT policies, and found that ViT policies are more OOD-robust than CNN/FC. They also discovered that rural-to-urban and day-to-night transitions are particularly challenging, while training on winter/snow improves robustness to single-factor shifts. The study highlights the importance of considering multiple factors and using diverse training data for robust driving policies.
研究旨在通过将环境分解为五个因素(场景、季节、天气、时间和代理混合)来理解自主驾驶中的出分布(OOD)鲁棒性。使用VISTA的闭环控制,研究对比了FC、CNN和ViT策略,并发现ViT策略在OOD鲁棒性上优于CNN/FC。研究还表明,从农村到城市和从白天到夜晚的转变是最具影响的因素,分别导致约31%的下降。此外,研究表明,冬季/雪地条件的训练对单因素变化最具鲁棒性,而使用多个ID环境可以提高整体OOD性能,同时在狭窄领域保持峰值性能。
Contact-Anchored Policies: Contact Conditioning Creates Strong Robot Utility Models
Authors: Zichen Jeff Cui, Omar Rayyan, Haritheja Etukuru, Bowen Tan, Zavier Andrianarivo, Zicheng Teng, Yihang Zhou, Krish Mehta, Nicholas Wojno, Kevin Yuanbo Wu, Manan H Anjaria, Ziyuan Wu, Manrong Mao, Guangxun Zhang, Binit Shah, Yejin Kim, Soumith Chintala, Lerrel Pinto, Nur Muhammad Mahi Shafiullah
First: 2026-02-09T18:58:50+00:00 · Latest: 2026-02-09T18:58:50+00:00
Abstract
The prevalent paradigm in robot learning attempts to generalize across environments, embodiments, and tasks with language prompts at runtime. A fundamental tension limits this approach: language is often too abstract to guide the concrete physical understanding required for robust manipulation. In this work, we introduce Contact-Anchored Policies (CAP), which replace language conditioning with points of physical contact in space. Simultaneously, we structure CAP as a library of modular utility models rather than a monolithic generalist policy. This factorization allows us to implement a real-to-sim iteration cycle: we build EgoGym, a lightweight simulation benchmark, to rapidly identify failure modes and refine our models and datasets prior to real-world deployment. We show that by conditioning on contact and iterating via simulation, CAP generalizes to novel environments and embodiments out of the box on three fundamental manipulation skills while using only 23 hours of demonstration data, and outperforms large, state-of-the-art VLAs in zero-shot evaluations by 56%. All model checkpoints, codebase, hardware, simulation, and datasets will be open-sourced. Project page: https://cap-policy.github.io/
中文标题/摘要
标题:接触锚定策略:接触条件化创建强大的机器人效用模型
机器人学习的主流范式试图通过运行时的语言提示在不同环境、体态和任务之间进行泛化。这一方法存在根本性的矛盾:语言往往过于抽象,无法引导所需的具体物理理解以实现稳健的操作。在本研究中,我们引入了接触锚定策略(CAP),用空间中的物理接触点取代语言条件化。同时,我们将CAP结构化为模块化的效用模型库,而非单一的通用策略。这种分解允许我们实现从现实到模拟的迭代循环:我们构建了EgoGym,一个轻量级的模拟基准,以快速识别失败模式并改进我们的模型和数据集,以便于实际部署。我们展示了通过接触条件化和模拟迭代,CAP在三种基本操作技能上无需额外数据即可泛化到新的环境和体态,并在零样本评估中比大型最先进的视觉语言模型高出56%。所有模型检查点、代码库、硬件、模拟和数据集将开源。项目页面:https://cap-policy.github.io/
Raster2Seq: Polygon Sequence Generation for Floorplan Reconstruction
Authors: Hao Phung, Hadar Averbuch-Elor
First: 2026-02-09T18:58:46+00:00 · Latest: 2026-02-09T18:58:46+00:00
Comments: Code: https://anonymous.4open.science/r/Raster2Seq-BE73/
Abstract
Reconstructing a structured vector-graphics representation from a rasterized floorplan image is typically an important prerequisite for computational tasks involving floorplans such as automated understanding or CAD workflows. However, existing techniques struggle in faithfully generating the structure and semantics conveyed by complex floorplans that depict large indoor spaces with many rooms and a varying numbers of polygon corners. To this end, we propose Raster2Seq, framing floorplan reconstruction as a sequence-to-sequence task in which floorplan elements--such as rooms, windows, and doors--are represented as labeled polygon sequences that jointly encode geometry and semantics. Our approach introduces an autoregressive decoder that learns to predict the next corner conditioned on image features and previously generated corners using guidance from learnable anchors. These anchors represent spatial coordinates in image space, hence allowing for effectively directing the attention mechanism to focus on informative image regions. By embracing the autoregressive mechanism, our method offers flexibility in the output format, enabling for efficiently handling complex floorplans with numerous rooms and diverse polygon structures. Our method achieves state-of-the-art performance on standard benchmarks such as Structure3D, CubiCasa5K, and Raster2Graph, while also demonstrating strong generalization to more challenging datasets like WAFFLE, which contain diverse room structures and complex geometric variations.
中文标题/摘要
标题:Raster2Seq:楼面平面图序列生成以进行楼面平面图重建
从栅格化楼面平面图图像重建结构化的矢量图形表示,通常是涉及楼面平面图的计算任务(如自动理解或CAD工作流)的重要前提。然而,现有的技术在忠实生成复杂楼面平面图结构和语义方面存在困难,这些复杂楼面平面图描绘了包含许多房间和不同数量多边形角的大室内空间。为此,我们提出了Raster2Seq,将楼面平面图重建框架为序列到序列的任务,在该任务中,楼面平面图元素(如房间、窗户和门)被表示为联合编码几何和语义的标记多边形序列。我们的方法引入了一种自回归解码器,该解码器在图像特征和先前生成的角的基础上学习预测下一个角,并使用可学习锚点的指导。这些锚点代表图像空间中的空间坐标,因此允许有效地引导注意力机制关注信息丰富的图像区域。通过采用自回归机制,我们的方法在输出格式上具有灵活性,能够高效处理具有众多房间和多种多边形结构的复杂楼面平面图。我们的方法在标准基准(如Structure3D、CubiCasa5K和Raster2Graph)上达到了最先进的性能,同时在包含多种房间结构和复杂几何变化的更具挑战性的数据集(如WAFFLE)上也表现出强大的泛化能力。
Summary / 总结
The research aims to reconstruct structured vector-graphics representations from rasterized floorplan images, which is crucial for tasks like automated understanding or CAD workflows. Raster2Seq is proposed as a sequence-to-sequence model that generates labeled polygon sequences to encode both geometry and semantics of floorplan elements. The model uses an autoregressive decoder with learnable anchors to predict the next corner based on image features and previously generated corners, effectively handling complex floorplans. Experimental results show that Raster2Seq outperforms existing methods on standard benchmarks and demonstrates strong generalization to more challenging datasets.
研究旨在从栅格化平面图图像中重建结构化的矢量图形表示,这对于自动化理解或CAD工作流程等任务至关重要。提出了一个序列到序列模型Raster2Seq,该模型生成表示楼层元素的带标签多边形序列,结合了几何和语义信息。该模型使用自回归解码器和可学习的锚点来预测下一个角落,聚焦于图像中的信息区域。实验表明,Raster2Seq在如Structure3D、CubiCasa5K和Raster2Graph等基准测试中优于现有方法,并且能够很好地适应如WAFFLE等更具挑战性的数据集,该数据集包含多样化的房间结构和复杂的几何变化。
ArcFlow: Unleashing 2-Step Text-to-Image Generation via High-Precision Non-Linear Flow Distillation
Authors: Zihan Yang, Shuyuan Tu, Licheng Zhang, Qi Dai, Yu-Gang Jiang, Zuxuan Wu
First: 2026-02-09T18:56:14+00:00 · Latest: 2026-02-09T18:56:14+00:00
Abstract
Diffusion models have achieved remarkable generation quality, but they suffer from significant inference cost due to their reliance on multiple sequential denoising steps, motivating recent efforts to distill this inference process into a few-step regime. However, existing distillation methods typically approximate the teacher trajectory by using linear shortcuts, which makes it difficult to match its constantly changing tangent directions as velocities evolve across timesteps, thereby leading to quality degradation. To address this limitation, we propose ArcFlow, a few-step distillation framework that explicitly employs non-linear flow trajectories to approximate pre-trained teacher trajectories. Concretely, ArcFlow parameterizes the velocity field underlying the inference trajectory as a mixture of continuous momentum processes. This enables ArcFlow to capture velocity evolution and extrapolate coherent velocities to form a continuous non-linear trajectory within each denoising step. Importantly, this parameterization admits an analytical integration of this non-linear trajectory, which circumvents numerical discretization errors and results in high-precision approximation of the teacher trajectory. To train this parameterization into a few-step generator, we implement ArcFlow via trajectory distillation on pre-trained teacher models using lightweight adapters. This strategy ensures fast, stable convergence while preserving generative diversity and quality. Built on large-scale models (Qwen-Image-20B and FLUX.1-dev), ArcFlow only fine-tunes on less than 5% of original parameters and achieves a 40x speedup with 2 NFEs over the original multi-step teachers without significant quality degradation. Experiments on benchmarks show the effectiveness of ArcFlow both qualitatively and quantitatively.
中文标题/摘要
标题:ArcFlow:通过高精度非线性流蒸馏实现两步文本到图像生成
扩散模型在生成质量方面取得了显著成就,但由于其依赖于多个顺序去噪步骤,导致推理成本高昂,促使近期努力将此推理过程简化为几步流程。然而,现有的蒸馏方法通常通过使用线性捷径来近似教师轨迹,这使得难以匹配其随时间步变化不断变化的切线方向,从而导致质量下降。为解决这一局限性,我们提出ArcFlow,这是一种几步蒸馏框架,明确采用非线性流轨迹来近似预训练教师轨迹。具体而言,ArcFlow 将推理轨迹下的速度场参数化为连续动量过程的混合。这使ArcFlow能够捕捉速度演变并外推连贯的速度,以在每个去噪步骤内形成连续的非线性轨迹。重要的是,这种参数化允许对这条非线性轨迹进行解析积分,从而避免数值离散化误差并实现对教师轨迹的高精度近似。为了将此参数化训练成几步生成器,我们通过预训练教师模型使用轻量级适配器实现ArcFlow的轨迹蒸馏。这种策略确保了快速、稳定的收敛,同时保持生成多样性和质量。基于大规模模型(Qwen-Image-20B 和 FLUX.1-dev),ArcFlow 只微调不到5%的原始参数,并在不显著降低质量的情况下实现了比原始多步教师40倍的速度提升,仅需2个NFE。基准实验表明,ArcFlow 在定性和定量上都表现出有效性。
Summary / 总结
ArcFlow is a novel few-step distillation framework that uses non-linear flow trajectories to approximate the inference process of pre-trained diffusion models, addressing the quality degradation issue caused by linear approximations. By parameterizing the velocity field as a mixture of continuous momentum processes, ArcFlow captures velocity evolution and forms a continuous non-linear trajectory within each denoising step, leading to high-precision approximation and fast, stable convergence. Experiments show that ArcFlow achieves a 40x speedup with only 2 noise-free evaluations (NFEs) and less than 5% of the original parameters, without significant quality degradation.
ArcFlow 是一种新颖的少步蒸馏框架,使用非线性流轨迹来近似预训练扩散模型的推理过程,解决了由线性近似引起的质量下降问题。通过将速度场参数化为连续动量过程的混合,ArcFlow 捕捉速度演变并在每个去噪步骤中形成连续的非线性轨迹,从而实现高精度近似和快速、稳定的收敛。在大规模模型上,ArcFlow 只微调不到 5% 的原始参数,并且实现了 40 倍的速度提升,仅需 2 个非流评估 (NFE) 而且没有显著的质量损失。
Dexterous Manipulation Policies from RGB Human Videos via 4D Hand-Object Trajectory Reconstruction
Authors: Hongyi Chen, Tony Dong, Tiancheng Wu, Liquan Wang, Yash Jangir, Yaru Niu, Yufei Ye, Homanga Bharadhwaj, Zackory Erickson, Jeffrey Ichnowski
First: 2026-02-09T18:56:02+00:00 · Latest: 2026-02-09T18:56:02+00:00
Abstract
Multi-finger robotic hand manipulation and grasping are challenging due to the high-dimensional action space and the difficulty of acquiring large-scale training data. Existing approaches largely rely on human teleoperation with wearable devices or specialized sensing equipment to capture hand-object interactions, which limits scalability. In this work, we propose VIDEOMANIP, a device-free framework that learns dexterous manipulation directly from RGB human videos. Leveraging recent advances in computer vision, VIDEOMANIP reconstructs explicit 4D robot-object trajectories from monocular videos by estimating human hand poses, object meshes, and retargets the reconstructed human motions to robotic hands for manipulation learning. To make the reconstructed robot data suitable for dexterous manipulation training, we introduce hand-object contact optimization with interaction-centric grasp modeling, as well as a demonstration synthesis strategy that generates diverse training trajectories from a single video, enabling generalizable policy learning without additional robot demonstrations. In simulation, the learned grasping model achieves a 70.25% success rate across 20 diverse objects using the Inspire Hand. In the real world, manipulation policies trained from RGB videos achieve an average 62.86% success rate across seven tasks using the LEAP Hand, outperforming retargeting-based methods by 15.87%. Project videos are available at videomanip.github.io.
中文标题/摘要
标题:基于RGB人体视频的四维手-物轨迹重建的灵巧操作策略
多指机器人手操作和抓取由于高维动作空间和大规模训练数据获取的困难而具有挑战性。现有方法主要依赖穿戴设备或专用传感设备的人类远程操作来捕捉手-物交互,这限制了其可扩展性。在本工作中,我们提出了一种无需设备的框架VIDEOMANIP,可以直接从RGB人体视频中学习灵巧操作。利用计算机视觉的最新进展,VIDEOMANIP通过估计人类手部姿态、物体网格并重新定位重建的人类动作到机器人手中,从单目视频中重建明确的四维机器人-物体轨迹。为了使重建的机器人数据适合灵巧操作训练,我们引入了基于接触的灵巧操作优化和以交互为中心的抓取建模,以及一种从单个视频生成多样化训练轨迹的演示合成策略,从而在无需额外机器人演示的情况下实现泛化策略学习。在仿真中,使用Inspire手学习的抓取模型在20种不同物体上实现了70.25%的成功率。在现实世界中,从RGB视频训练的操作策略在使用LEAP手执行七项任务时实现了平均62.86%的成功率,优于基于重新定位的方法15.87%。项目视频可在videomanip.github.io获取。
Summary / 总结
This work addresses the challenge of multi-finger robotic hand manipulation by proposing VIDEOMANIP, a framework that learns from RGB human videos without requiring specialized sensing equipment. By reconstructing 4D hand-object trajectories and optimizing hand-object contacts, VIDEOMANIP enables the generation of diverse training trajectories from a single video. The approach achieves a 70.25% success rate in grasping 20 diverse objects in simulation and an average 62.86% success rate across seven tasks in the real world, outperforming existing methods.
研究旨在开发一种无需设备的框架VIDEOMANIP,直接从RGB人体视频中学习灵巧操作策略。通过估计人体手部姿态和物体网格来重建4D手-物体轨迹,并将这些动作重新定向到机器人手中。该框架包括手-物体接触优化和演示合成策略,以从单个视频生成多样化的训练轨迹。实验结果表明,学习到的抓取模型在模拟中实现了70.25%的成功率,在现实世界任务中平均实现了62.86%的成功率,优于基于重新定向的方法。
Next-Gen CAPTCHAs: Leveraging the Cognitive Gap for Scalable and Diverse GUI-Agent Defense
Authors: Jiacheng Liu, Yaxin Luo, Jiacheng Cui, Xinyi Shang, Xiaohan Zhao, Zhiqiang Shen
First: 2026-02-09T18:55:33+00:00 · Latest: 2026-02-09T18:55:33+00:00
Comments: Project page at https://greenoso.github.io/NextGen-CAPTCHAs_webpage/
Abstract
The rapid evolution of GUI-enabled agents has rendered traditional CAPTCHAs obsolete. While previous benchmarks like OpenCaptchaWorld established a baseline for evaluating multimodal agents, recent advancements in reasoning-heavy models, such as Gemini3-Pro-High and GPT-5.2-Xhigh have effectively collapsed this security barrier, achieving pass rates as high as 90% on complex logic puzzles like "Bingo". In response, we introduce Next-Gen CAPTCHAs, a scalable defense framework designed to secure the next-generation web against the advanced agents. Unlike static datasets, our benchmark is built upon a robust data generation pipeline, allowing for large-scale and easily scalable evaluations, notably, for backend-supported types, our system is capable of generating effectively unbounded CAPTCHA instances. We exploit the persistent human-agent "Cognitive Gap" in interactive perception, memory, decision-making, and action. By engineering dynamic tasks that require adaptive intuition rather than granular planning, we re-establish a robust distinction between biological users and artificial agents, offering a scalable and diverse defense mechanism for the agentic era.
中文标题/摘要
标题:下一代CAPTCHA:利用认知差距实现可扩展和多样化的GUI-代理防御
GUI使能代理的快速发展使传统CAPTCHA过时。虽然像OpenCaptchaWorld这样的基准测试为评估多模态代理奠定了基础,但最近的推理型模型,如Gemini3-Pro-High和GPT-5.2-Xhigh已经有效地消除了这一安全障碍,复杂逻辑谜题“宾果”上的通过率高达90%。为应对这一挑战,我们提出了下一代CAPTCHA,这是一种可扩展的防御框架,旨在保护下一代网络免受高级代理的攻击。与静态数据集不同,我们的基准测试基于强大的数据生成管道,允许大规模和易于扩展的评估,特别是对于后端支持的类型,我们的系统能够生成几乎无限的CAPTCHA实例。我们利用持续的人机“认知差距”在交互感知、记忆、决策和行动中的差异。通过设计需要适应性直觉而非细粒度规划的动态任务,我们重新建立了生物用户和人工代理之间的坚实区别,为代理时代提供了可扩展和多样化的防御机制。
Summary / 总结
The paper addresses the need for new CAPTCHA systems due to the advancement of GUI-enabled agents that can bypass traditional CAPTCHAs. It introduces Next-Gen CAPTCHAs, a scalable defense framework that leverages the cognitive gap between humans and artificial agents. The system generates dynamic tasks requiring adaptive intuition, which are difficult for reasoning-heavy models to solve, thus re-establishing a robust distinction between human users and artificial agents.
论文针对GUI启用代理的先进性导致传统CAPTCHA失效的问题,提出了Next-Gen CAPTCHAs这一新的防御框架。该系统利用人类与人工代理在感知、记忆、决策和行动方面的认知差距,生成需要适应性直觉的动态任务,并通过强大的数据生成管道进行大规模和多样化的评估。实验结果显示,Next-Gen CAPTCHAs能够有效地区分人类用户和高级AI代理,提供一种强大的防御机制。
Rethinking Memory Mechanisms of Foundation Agents in the Second Half: A Survey
Authors: Wei-Chieh Huang, Weizhi Zhang, Yueqing Liang, Yuanchen Bei, Yankai Chen, Tao Feng, Xinyu Pan, Zhen Tan, Yu Wang, Tianxin Wei, Shanglin Wu, Ruiyao Xu, Liangwei Yang, Rui Yang, Wooseong Yang, Chin-Yuan Yeh, Hanrong Zhang, Haozhen Zhang, Siqi Zhu, Henry Peng Zou, Wanjia Zhao, Song Wang, Wujiang Xu, Zixuan Ke, Zheng Hui, Dawei Li, Yaozu Wu, Langzhou He, Chen Wang, Xiongxiao Xu, Baixiang Huang, Juntao Tan, Shelby Heinecke, Huan Wang, Caiming Xiong, Ahmed A. Metwally, Jun Yan, Chen-Yu Lee, Hanqing Zeng, Yinglong Xia, Xiaokai Wei, Ali Payani, Yu Wang, Haitong Ma, Wenya Wang, Chengguang Wang, Yu Zhang, Xin Wang, Yongfeng Zhang, Jiaxuan You, Hanghang Tong, Xiao Luo, Xue Liu, Yizhou Sun, Wei Wang, Julian McAuley, James Zou, Jiawei Han, Philip S. Yu, Kai Shu
First: 2026-01-14T07:38:38+00:00 · Latest: 2026-02-09T18:53:33+00:00
Abstract
The research of artificial intelligence is undergoing a paradigm shift from prioritizing model innovations over benchmark scores towards emphasizing problem definition and rigorous real-world evaluation. As the field enters the "second half," the central challenge becomes real utility in long-horizon, dynamic, and user-dependent environments, where agents face context explosion and must continuously accumulate, manage, and selectively reuse large volumes of information across extended interactions. Memory, with hundreds of papers released this year, therefore emerges as the critical solution to fill the utility gap. In this survey, we provide a unified view of foundation agent memory along three dimensions: memory substrate (internal and external), cognitive mechanism (episodic, semantic, sensory, working, and procedural), and memory subject (agent- and user-centric). We then analyze how memory is instantiated and operated under different agent topologies and highlight learning policies over memory operations. Finally, we review evaluation benchmarks and metrics for assessing memory utility, and outline various open challenges and future directions.
中文标题/摘要
标题:重新思考基础代理在第二阶段的记忆机制:一项综述
人工智能的研究正在经历从优先考虑模型创新而非基准得分向强调问题定义和严格的现实世界评估的范式转变。随着领域进入“第二阶段”,中心挑战在于在长期、动态和用户依赖的环境中实现实际效用,其中代理面临上下文爆炸,必须在长时间交互中不断积累、管理和选择性重用大量信息。记忆,今年发布了数百篇论文,因此成为填补效用缺口的关键解决方案。在这项综述中,我们从三个维度提供了一致的基础代理记忆视图:记忆载体(内部和外部)、认知机制(事件、语义、感官、工作和程序),以及记忆主体(代理中心和用户中心)。然后我们分析了在不同代理拓扑结构下记忆是如何实现和操作的,并强调了对记忆操作的学习策略。最后,我们回顾了评估记忆效用的基准和指标,并概述了各种开放挑战和未来方向。
Summary / 总结
This survey reconsiders the memory mechanisms of foundation agents in the context of the AI field's shift towards emphasizing problem definition and real-world evaluation. It examines memory from three dimensions: substrate, cognitive mechanism, and subject, and analyzes how memory is instantiated and operated under different agent topologies. Key findings include the importance of memory in addressing the utility gap in long-horizon, dynamic environments, and the need for rigorous evaluation benchmarks and metrics to assess memory utility.
这篇综述重新审视了基础代理在AI范式转向实际应用时的内存机制。它从内存基质、认知机制和内存主体三个维度进行考察。关键发现包括在不同代理拓扑下的内存实例化和操作、内存操作上的学习策略以及评估内存实用性的基准和指标。研究强调了内存在解决动态环境中长期挑战和上下文爆炸问题中的关键作用。
GEBench: Benchmarking Image Generation Models as GUI Environments
Authors: Haodong Li, Jingwei Wu, Quan Sun, Guopeng Li, Juanxi Tian, Huanyu Zhang, Yanlin Lai, Ruichuan An, Hongbo Peng, Yuhong Dai, Chenxi Li, Chunmei Qing, Jia Wang, Ziyang Meng, Zheng Ge, Xiangyu Zhang, Daxin Jiang
First: 2026-02-09T18:52:02+00:00 · Latest: 2026-02-09T18:52:02+00:00
Comments: 23 pages, 5 figures, 4 tables
Abstract
Recent advancements in image generation models have enabled the prediction of future Graphical User Interface (GUI) states based on user instructions. However, existing benchmarks primarily focus on general domain visual fidelity, leaving the evaluation of state transitions and temporal coherence in GUI-specific contexts underexplored. To address this gap, we introduce GEBench, a comprehensive benchmark for evaluating dynamic interaction and temporal coherence in GUI generation. GEBench comprises 700 carefully curated samples spanning five task categories, covering both single-step interactions and multi-step trajectories across real-world and fictional scenarios, as well as grounding point localization. To support systematic evaluation, we propose GE-Score, a novel five-dimensional metric that assesses Goal Achievement, Interaction Logic, Content Consistency, UI Plausibility, and Visual Quality. Extensive evaluations on current models indicate that while they perform well on single-step transitions, they struggle significantly with maintaining temporal coherence and spatial grounding over longer interaction sequences. Our findings identify icon interpretation, text rendering, and localization precision as critical bottlenecks. This work provides a foundation for systematic assessment and suggests promising directions for future research toward building high-fidelity generative GUI environments. The code is available at: https://github.com/stepfun-ai/GEBench.
中文标题/摘要
标题:GEBench:将图像生成模型作为GUI环境进行基准测试
近年来,图像生成模型的进步使得根据用户指令预测未来图形用户界面(GUI)状态成为可能。然而,现有的基准测试主要集中在通用领域的视觉保真度上,GUI特定上下文中的状态转换和时间连贯性评估则相对不足。为解决这一问题,我们引入了GEBench,这是一个全面的基准测试,用于评估GUI生成中的动态交互和时间连贯性。GEBench 包含了跨越五个任务类别、共计700个精心挑选的样本,涵盖了单步骤交互和多步骤轨迹,包括现实世界和虚构场景,以及定位点的定位。为了支持系统的评估,我们提出了GE-Score,这是一种新颖的五维度度量标准,评估目标实现、交互逻辑、内容一致性、UI合理性以及视觉质量。对当前模型的广泛评估表明,虽然它们在单步骤转换上表现良好,但在长时间交互序列中保持时间连贯性和空间定位方面存在显著困难。我们的研究发现,图标解释、文本渲染和定位精度是关键瓶颈。这项工作为系统的评估提供了基础,并为未来研究构建高保真生成GUI环境指出了有希望的方向。代码可在:https://github.com/stepfun-ai/GEBench 获取。
Summary / 总结
GEBench is introduced to evaluate the dynamic interaction and temporal coherence of image generation models in GUI environments. It includes 700 samples across five task categories and proposes a five-dimensional metric, GE-Score, to assess goal achievement, interaction logic, content consistency, UI plausibility, and visual quality. Evaluations show that models excel in single-step transitions but struggle with maintaining coherence and spatial grounding in multi-step interactions, highlighting bottlenecks in icon interpretation, text rendering, and localization precision.
GEBench 是一个用于评估图像生成模型在 GUI 环境中动态交互和时间连贯性的基准,包含 700 个样本,覆盖五个任务类别,并提出了一种五维评估指标 GE-Score,用于评估目标实现、交互逻辑、内容一致性、UI 可信度和视觉质量。评估结果显示,模型在单步骤过渡中表现良好,但在多步骤交互中难以维持时间连贯性和空间定位,突出了图标解释、文本渲染和定位精度等关键瓶颈。
ARO: A New Lens On Matrix Optimization For Large Models
Authors: Wenbo Gong, Javier Zazo, Qijun Luo, Puqian Wang, James Hensman, Chao Ma
First: 2026-02-09T18:51:22+00:00 · Latest: 2026-02-09T18:51:22+00:00
Abstract
Matrix-based optimizers have attracted growing interest for improving LLM training efficiency, with significant progress centered on orthogonalization/whitening based methods. While yielding substantial performance gains, a fundamental question arises: can we develop new paradigms beyond orthogonalization, pushing the efficiency frontier further? We present \textbf{Adaptively Rotated Optimization (ARO}, a new matrix optimization framework that treats gradient rotation as a first class design principle. ARO accelerates LLM training by performing normed steepest descent in a rotated coordinate system, where the rotation is determined by a novel norm-informed policy. This perspective yields update rules that go beyond existing orthogonalization and whitening optimizers, improving sample efficiency in practice. To make comparisons reliable, we propose a rigorously controlled benchmarking protocol that reduces confounding and bias. Under this protocol, ARO consistently outperforms AdamW (by 1.3 $\sim$1.35$\times$) and orthogonalization methods (by 1.1$\sim$1.15$\times$) in LLM pretraining at up to 8B activated parameters, and up to $8\times$ overtrain budget, without evidence of diminishing returns. Finally, we discuss how ARO can be reformulated as a symmetry-aware optimizer grounded in rotational symmetries of residual streams, motivating advanced designs that enable computationally efficient exploitation of cross-layer/cross module couplings.
中文标题/摘要
标题:ARO:大型模型矩阵优化的新视角
基于矩阵的优化器因其提高大规模语言模型(LLM)训练效率而引起了广泛关注,特别是在正交化/白化方法方面取得了显著进展。尽管这些方法带来了显著的性能提升,但一个基本问题也随之浮现:我们能否开发出超越正交化的新范式,进一步推动效率边界?我们提出了**自适应旋转优化(ARO)**,这是一种新的矩阵优化框架,将梯度旋转视为首要设计原则。ARO 通过在旋转坐标系中执行归一化最速下降来加速 LLM 训练,其中旋转由一种新颖的归一化导向策略确定。这种视角产生的更新规则超越了现有的正交化和白化优化器,提高了样本效率。为了确保比较的可靠性,我们提出了一种严格控制的基准测试协议,减少了混淆和偏差。在该协议下,ARO 在多达 80 亿激活参数的 LLM 预训练中始终优于 AdamW(1.3 至 1.35 倍)和正交化方法(1.1 至 1.15 倍),且在多达 8 倍的过拟合预算下表现更优,没有证据显示边际效益递减。最后,我们讨论了如何将 ARO 重新表述为一种基于残差流旋转对称性的对称感知优化器,这激发了先进的设计,能够高效利用跨层/跨模块耦合。
Summary / 总结
The paper introduces Adaptively Rotated Optimization (ARO), a new matrix optimization framework that enhances LLM training efficiency by treating gradient rotation as a first-class principle. ARO performs normed steepest descent in a rotated coordinate system, with rotation determined by a novel norm-informed policy. Experimental results show that ARO outperforms AdamW and orthogonalization methods by 1.3 to 1.35 times and 1.1 to 1.15 times, respectively, in LLM pretraining with up to 8 billion activated parameters and up to 8 times overtrain budget, without diminishing returns.
论文提出了自适应旋转优化(ARO),这是一种新的矩阵优化框架,通过将梯度旋转视为首要原则来提升LLM训练效率。ARO在旋转坐标系中执行归一化最速下降,旋转由一种新颖的范数导向策略确定。实验表明,ARO在样本效率方面优于AdamW和正交化方法,并且在超过8倍的过训练预算下仍无效率衰减迹象。使用了一种严格的基准测试协议以确保比较的可靠性。
From Obstacles to Etiquette: Robot Social Navigation with VLM-Informed Path Selection
Authors: Zilin Fang, Anxing Xiao, David Hsu, Gim Hee Lee
First: 2026-02-09T18:46:12+00:00 · Latest: 2026-02-09T18:46:12+00:00
Comments: Accepted to IEEE Robotics and Automation Letters (RA-L)
Abstract
Navigating socially in human environments requires more than satisfying geometric constraints, as collision-free paths may still interfere with ongoing activities or conflict with social norms. Addressing this challenge calls for analyzing interactions between agents and incorporating common-sense reasoning into planning. This paper presents a social robot navigation framework that integrates geometric planning with contextual social reasoning. The system first extracts obstacles and human dynamics to generate geometrically feasible candidate paths, then leverages a fine-tuned vision-language model (VLM) to evaluate these paths, informed by contextually grounded social expectations, selecting a socially optimized path for the controller. This task-specific VLM distills social reasoning from large foundation models into a smaller and efficient model, allowing the framework to perform real-time adaptation in diverse human-robot interaction contexts. Experiments in four social navigation contexts demonstrate that our method achieves the best overall performance with the lowest personal space violation duration, the minimal pedestrian-facing time, and no social zone intrusions. Project page: https://path-etiquette.github.io
中文标题/摘要
标题:从障碍到礼仪:基于VLM的路径选择社会导航
在人类环境中进行社会导航不仅需要满足几何约束,碰撞自由路径仍可能干扰正在进行的活动或违背社会规范。解决这一挑战需要分析代理之间的交互,并将常识推理纳入规划中。本文提出了一种结合几何规划与情境社会推理的社会机器人导航框架。系统首先提取障碍和人类动态以生成几何上可行的候选路径,然后利用微调后的视觉语言模型(VLM)根据情境化的社会期望评估这些路径,选择一个社会优化路径供控制器使用。这种任务特定的VLM将大型基础模型中的社会推理提炼到一个更小且高效的模型中,使框架能够在多种人机交互场景中进行实时适应。在四个社会导航场景中的实验表明,我们的方法在最小个人空间侵犯时间、最小行人类面对时间和无社会区域侵入方面表现最佳。项目页面:https://path-etiquette.github.io
Summary / 总结
This paper addresses the challenge of social navigation for robots by integrating geometric planning with social reasoning. The system extracts obstacles and human dynamics to generate paths, then uses a fine-tuned vision-language model to evaluate these paths based on social expectations, selecting the most socially optimized route. Experiments show that the method performs best with minimal personal space violation and no social zone intrusions in various social navigation contexts.
该论文通过结合几何规划和社会推理解决了机器人的社会导航挑战。它提出了一种框架,首先生成几何上可行的路径,然后使用微调后的视觉语言模型根据社会期望评估这些路径,选择最优化的社会路径。实验表明,在四个社会导航场景中,所提出的方法表现最佳,个人空间侵犯最少且没有侵犯社会区域。
iGRPO: Self-Feedback-Driven LLM Reasoning
Authors: Ali Hatamizadeh, Shrimai Prabhumoye, Igor Gitman, Ximing Lu, Seungju Han, Wei Ping, Yejin Choi, Jan Kautz
First: 2026-02-09T18:45:11+00:00 · Latest: 2026-02-09T18:45:11+00:00
Comments: Tech report
Abstract
Large Language Models (LLMs) have shown promise in solving complex mathematical problems, yet they still fall short of producing accurate and consistent solutions. Reinforcement Learning (RL) is a framework for aligning these models with task-specific rewards, improving overall quality and reliability. Group Relative Policy Optimization (GRPO) is an efficient, value-function-free alternative to Proximal Policy Optimization (PPO) that leverages group-relative reward normalization. We introduce Iterative Group Relative Policy Optimization (iGRPO), a two-stage extension of GRPO that adds dynamic self-conditioning through model-generated drafts. In Stage 1, iGRPO samples multiple exploratory drafts and selects the highest-reward draft using the same scalar reward signal used for optimization. In Stage 2, it appends this best draft to the original prompt and applies a GRPO-style update on draft-conditioned refinements, training the policy to improve beyond its strongest prior attempt. Under matched rollout budgets, iGRPO consistently outperforms GRPO across base models (e.g., Nemotron-H-8B-Base-8K and DeepSeek-R1 Distilled), validating its effectiveness on diverse reasoning benchmarks. Moreover, applying iGRPO to OpenReasoning-Nemotron-7B trained on AceReason-Math achieves new state-of-the-art results of 85.62\% and 79.64\% on AIME24 and AIME25, respectively. Ablations further show that the refinement wrapper generalizes beyond GRPO variants, benefits from a generative judge, and alters learning dynamics by delaying entropy collapse. These results underscore the potential of iterative, self-feedback-based RL for advancing verifiable mathematical reasoning.
中文标题/摘要
标题:iGRPO:自我反馈驱动的LLM推理
大型语言模型(LLMs)在解决复杂数学问题方面显示出潜力,但仍难以提供准确且一致的解决方案。强化学习(RL)是一种框架,用于使这些模型与特定任务的奖励对齐,从而提高整体质量和可靠性。组相对策略优化(GRPO)是Proximal Policy Optimization(PPO)的一种高效、无价值函数的替代方案,利用了组相对奖励归一化。我们引入了迭代组相对策略优化(iGRPO),这是一种GRPO的两阶段扩展,通过模型生成的草稿添加动态自我条件。在第一阶段,iGRPO抽样多个探索性草稿,并使用相同的标量奖励信号选择最高奖励的草稿进行优化。在第二阶段,它将此最佳草稿附加到原始提示,并对条件草稿进行GRPO风格的更新,训练策略超越其最强的先前尝试。在匹配的展开预算下,iGRPO在基础模型(例如Nemotron-H-8B-Base-8K和DeepSeek-R1 Distilled)上始终优于GRPO,验证了其在各种推理基准上的有效性。此外,将iGRPO应用于在AceReason-Math上训练的OpenReasoning-Nemotron-7B,分别在AIME24和AIME25上取得了新的最佳结果85.62%和79.64%。消融实验进一步表明,改进的包装器超越了GRPO变体,受益于生成式评判,并通过延迟熵崩溃改变了学习动态。这些结果强调了迭代、基于自我反馈的RL在推进可验证数学推理方面的潜力。
Summary / 总结
iGRPO is a two-stage reinforcement learning method that enhances the reasoning capabilities of Large Language Models (LLMs) by incorporating model-generated drafts. In Stage 1, iGRPO selects the highest-reward draft from multiple exploratory drafts. In Stage 2, it refines the selected draft and applies a GRPO-style update. iGRPO outperforms GRPO across various base models and achieves new state-of-the-art results on AIME24 and AIME25 benchmarks when applied to OpenReasoning-Nemotron-7B, demonstrating its effectiveness in improving mathematical reasoning accuracy.
iGRPO 是一种迭代强化学习方法,通过模型生成的草稿进行自我条件化来增强大型语言模型(LLMs)的推理能力。在第一阶段,iGRPO 从多个探索性草稿中选择最高奖励的草稿;在第二阶段,它使用 GRPO 样式的更新来细化原始提示。这种方法在各种基础模型上表现优于 GRPO,并在应用到 OpenReasoning-Nemotron-7B 时,在 AIME24 和 AIME25 基准上取得了新的最佳结果。消融研究进一步表明,细化包装器和生成式评判者在提高学习动力学方面非常有效。
Generalizing Sports Feedback Generation by Watching Competitions and Reading Books: A Rock Climbing Case Study
Authors: Arushi Rai, Adriana Kovashka
Venue: WACV 2026
First: 2026-02-09T18:41:43+00:00 · Latest: 2026-02-09T18:41:43+00:00
Comments: to appear WACV 2026
Abstract
While there is rapid progress in video-LLMs with advanced reasoning capabilities, prior work shows that these models struggle on the challenging task of sports feedback generation and require expensive and difficult-to-collect finetuning feedback data for each sport. This limitation is evident from the poor generalization to sports unseen during finetuning. Furthermore, traditional text generation evaluation metrics (e.g., BLEU-4, METEOR, ROUGE-L, BERTScore), originally developed for machine translation and summarization, fail to capture the unique aspects of sports feedback quality. To address the first problem, using rock climbing as our case study, we propose using auxiliary freely-available web data from the target domain, such as competition videos and coaching manuals, in addition to existing sports feedback from a disjoint, source domain to improve sports feedback generation performance on the target domain. To improve evaluation, we propose two evaluation metrics: (1) specificity and (2) actionability. Together, our approach enables more meaningful and practical generation of sports feedback under limited annotations.
中文标题/摘要
标题:通过观看比赛和阅读书籍泛化体育反馈生成:以攀岩为例的研究
尽管视频-LLM在高级推理能力方面取得了快速进展,但先前的工作表明,这些模型在体育反馈生成这一具有挑战性的任务上表现不佳,需要为每项运动收集昂贵且难以获取的微调反馈数据。这一限制体现在微调过程中未见过的体育项目的泛化能力较差。此外,传统的文本生成评估指标(如BLEU-4、METEOR、ROUGE-L、BERTScore),最初是为机器翻译和摘要开发的,无法捕捉体育反馈质量的独特方面。为了解决第一个问题,以攀岩为例,我们提出使用目标领域中的辅助免费网络数据,如比赛视频和教练手册,以及来自不同源领域的现有体育反馈,以提高目标领域体育反馈生成的性能。为了改进评估,我们提出了两个评估指标:(1)具体性;(2)可操作性。我们的方法结合在一起,能够在有限注释的情况下实现更有意义和实用的体育反馈生成。
Summary / 总结
The research aims to improve sports feedback generation by leveraging freely available web data from the target domain, such as competition videos and coaching manuals, alongside existing sports feedback from a different domain. The proposed method includes two new evaluation metrics: specificity and actionability. Key findings show that this approach enhances the generalization of sports feedback generation to unseen sports and provides more meaningful and practical feedback with limited annotations.
该研究通过利用目标领域的辅助数据,如比赛视频和教练手册,以及不同领域的现有运动反馈,来解决为未见过的运动生成运动反馈的挑战。它引入了两个新的评估指标:具体性和可操作性,以更好地评估运动反馈的质量。该方法提高了运动反馈生成的泛化能力,并在有限标注的情况下提供了更具意义和实用性的反馈。
Block-Recurrent Dynamics in Vision Transformers
Authors: Mozes Jacobs, Thomas Fel, Richard Hakim, Alessandra Brondetta, Demba Ba, T. Andy Keller
First: 2025-12-23T00:18:23+00:00 · Latest: 2026-02-09T18:40:41+00:00
Comments: 25 pages, 15 figures
Abstract
As Vision Transformers (ViTs) become standard vision backbones, a mechanistic account of their computational phenomenology is essential. Despite architectural cues that hint at dynamical structure, there is no settled framework that interprets Transformer depth as a well-characterized flow. In this work, we introduce the Block-Recurrent Hypothesis (BRH), arguing that trained ViTs admit a block-recurrent depth structure such that the computation of the original $L$ blocks can be accurately rewritten using only $k \ll L$ distinct blocks applied recurrently. Across diverse ViTs, between-layer representational similarity matrices suggest few contiguous phases. To determine whether these phases reflect genuinely reusable computation, we train block-recurrent surrogates of pretrained ViTs: Recurrent Approximations to Phase-structured TransfORmers (Raptor). In small-scale, we demonstrate that stochastic depth and training promote recurrent structure and subsequently correlate with our ability to accurately fit Raptor. We then provide an empirical existence proof for BRH by training a Raptor model to recover $96\%$ of DINOv2 ImageNet-1k linear probe accuracy in only 2 blocks at equivalent computational cost. Finally, we leverage our hypothesis to develop a program of Dynamical Interpretability. We find i) directional convergence into class-dependent angular basins with self-correcting trajectories under small perturbations, ii) token-specific dynamics, where cls executes sharp late reorientations while patch tokens exhibit strong late-stage coherence toward their mean direction, and iii) a collapse to low rank updates in late depth, consistent with convergence to low-dimensional attractors. Altogether, we find a compact recurrent program emerges along ViT depth, pointing to a low-complexity normative solution that enables these models to be studied through principled dynamical systems analysis.
中文标题/摘要
标题:视觉变换器中的块循环动力学
随着视觉变换器(ViTs)成为标准的视觉骨干网络,对其计算现象学的机制性解释变得至关重要。尽管架构线索暗示了动力学结构的存在,但目前尚无定型的框架能够将Transformer的深度解释为一个特征明确的流。在本文中,我们提出了块循环假设(BRH),认为训练后的ViTs具有块循环的深度结构,使得原始的$L$个块的计算可以仅通过$k \ll L$个不同的块的循环应用来准确重写。在多种ViTs中,层间表示相似性矩阵表明存在少量连续阶段。为了确定这些阶段是否反映了真正可重用的计算,我们训练了预训练ViTs的块循环近似模型:相位结构变换OR近似器(Raptor)。在小规模实验中,我们证明了随机深度和训练促进了循环结构的形成,并且与我们能够准确拟合Raptor的能力相关。然后,我们通过训练一个Raptor模型,仅用2个块就恢复了96%的DINOv2 ImageNet-1k线性探针精度,证明了BRH的存在。最后,我们利用这一假设发展了一种动力学可解释性程序。我们发现i) 方向性收敛到类依赖的角盆地,具有自我纠正的轨迹;ii) 令牌特定的动力学,其中cls执行尖锐的后期重新定向,而patch令牌则表现出强烈的后期阶段与它们均值方向的共轭;iii) 深度后期的低秩更新,与收敛到低维吸引子一致。总体而言,我们发现ViT深度中出现了一个紧凑的循环程序,指向一种低复杂度的规范性解决方案,使这些模型能够通过原理性的动力学系统分析进行研究。
Summary / 总结
This work introduces the Block-Recurrent Hypothesis (BRH) to explain the computational structure of Vision Transformers (ViTs). By training block-recurrent surrogates, Raptor, it demonstrates that ViTs can be accurately represented using a small number of recurrent blocks. Key findings include 96% accuracy recovery of DINOv2 ImageNet-1k linear probe with only 2 blocks, and evidence of directional convergence, token-specific dynamics, and low-rank updates in late depth, suggesting a compact recurrent program along ViT depth.
该研究提出了块递归假设(BRH),以解释视觉变换器(ViTs)的计算结构,表明ViTs的深度可以用少量递归块准确表示。实验表明,预训练的ViTs可以通过仅使用两个块的Raptor模型来有效近似,该模型在ImageNet-1k上的准确率达到DINOv2的96%。关键发现包括方向性收敛到类相关的角度盆地、令牌特定的动力学以及晚期深度的低秩更新,表明ViT深度中出现了一个紧凑的递归程序。
Reproducible Benchmarking for Lung Nodule Detection and Malignancy Classification Across Multiple Low-Dose CT Datasets
Authors: Fakrul Islam Tushar, Avivah Wang, Lavsen Dahal, Ehsan Samei, Michael R. Harowicz, Jayashree Kalpathy-Cramer, Kyle J. Lafata, Tina D. Tailor, Cynthia Rudin, Joseph Y. Lo
First: 2024-05-07T18:36:40+00:00 · Latest: 2026-02-09T18:39:15+00:00
Comments: 3 tables, 2 supplement tables, 5 figures
Abstract
Evaluation of artificial intelligence (AI) models for low-dose CT lung cancer screening is limited by heterogeneous datasets, annotation standards, and evaluation protocols, making performance difficult to compare and translate across clinical settings. We establish a public, reproducible multi-dataset benchmark for lung nodule detection and nodule-level cancer classification and quantify cross-dataset generalizability.
Using the Duke Lung Cancer Screening (DLCS) dataset as a clinically curated development set, we evaluate performance across LUNA16/LIDC-IDRI, NLST-3D, and LUNA25. Detection models trained on DLCS and LUNA16 were evaluated externally on NLST-3D using free-response ROC analysis. For malignancy classification, we compared five strategies: randomly initialized ResNet50, Models Genesis, Med3D, a Foundation Model for Cancer Biomarkers, and a Strategic Warm-Start (ResNet50-SWS) approach pretrained using detection-derived candidate patches stratified by confidence. Performance was summarized using AUC with 95% confidence intervals and DeLong tests.
Detection performance varied substantially by training dataset, with DLCS-trained models outperforming LUNA16-trained models on external NLST-3D evaluation (sensitivity at 2 false positives per scan: 0.72 vs. 0.64; p < 0.001). For malignancy classification, ResNet50-SWS achieved AUCs of 0.71 (DLCS), 0.90 (LUNA16), 0.81 (NLST-3D), and 0.80 (LUNA25), consistently matching or exceeding alternative pretraining strategies. These results demonstrate that dataset characteristics strongly influence lung cancer AI performance and highlight the need for transparent, multi-dataset benchmarking.
中文标题/摘要
标题:肺癌筛查低剂量CT多数据集可重复基准测试
低剂量CT肺癌筛查中的人工智能(AI)模型评估受限于异质性数据集、注释标准和评估协议,使得不同临床环境下的性能比较和转换变得困难。我们建立了一个公开的、可重复的多数据集基准,用于肺结节检测和结节级癌症分类,并量化了跨数据集的一般化能力。
使用杜克肺癌筛查(DLCS)数据集作为临床筛选开发集,我们评估了LUNA16/LIDC-IDRI、NLST-3D和LUNA25上的性能。在NLST-3D上的外部评估中,使用DLCS和LUNA16训练的检测模型采用了自由响应ROC分析。对于恶性肿瘤分类,我们比较了五种策略:随机初始化的ResNet50、Models Genesis、Med3D、癌症生物标志物基础模型以及使用检测衍生候选斑块按置信度分层预训练的Strategic Warm-Start(ResNet50-SWS)方法。性能用95%置信区间和DeLong检验总结。
检测性能在不同训练数据集之间差异显著,DLCS训练的模型在外部NLST-3D评估中优于LUNA16训练的模型(每扫描2个假阳性时的灵敏度:0.72 vs. 0.64;p < 0.001)。对于恶性肿瘤分类,ResNet50-SWS实现了DLCS 0.71、LUNA16 0.90、NLST-3D 0.81和LUNA25 0.80的AUC,始终与替代预训练策略相当或超过。这些结果表明,数据集特征强烈影响肺癌AI性能,并强调了透明的多数据集基准测试的必要性。
Summary / 总结
The study aims to establish a reproducible benchmark for evaluating AI models in low-dose CT lung cancer screening across different datasets. Using the Duke Lung Cancer Screening dataset as a development set, the study evaluates detection and malignancy classification models on LUNA16/LIDC-IDRI, NLST-3D, and LUNA25. The results show that models trained on DLCS outperform those trained on LUNA16 when evaluated on NLST-3D. For malignancy classification, the Strategic Warm-Start (ResNet50-SWS) approach pretrained using detection-derived candidate patches achieved consistent high AUCs across datasets, indicating strong cross-dataset generalizability.
研究旨在建立一个可重复的基准,以评估不同数据集中的低剂量CT肺癌筛查AI模型性能。使用杜克肺癌筛查数据集作为开发集,研究在LUNA16/LIDC-IDRI、NLST-3D和LUNA25上评估检测和恶性分类模型。结果显示,DLCS训练的模型在外部NLST-3D评估中优于LUNA16训练的模型。对于恶性分类,使用检测衍生候选补丁进行预训练的战略预热(ResNet50-SWS)方法在各个数据集上实现了稳定的高AUC,表明了强跨数据集泛化能力。
InternAgent-1.5: A Unified Agentic Framework for Long-Horizon Autonomous Scientific Discovery
Authors: Shiyang Feng, Runmin Ma, Xiangchao Yan, Yue Fan, Yusong Hu, Songtao Huang, Shuaiyu Zhang, Zongsheng Cao, Tianshuo Peng, Jiakang Yuan, Zijie Guo, Zhijie Zhong, Shangheng Du, Weida Wang, Jinxin Shi, Yuhao Zhou, Xiaohan He, Zhiyin Yu, Fangchen Yu, Qihao Zheng, Jiamin Wu, Mianxin Liu, Chi Zhang, Shaowei Hou, Shuya Li, Yankai Jiang, Wenjie Lou, Lilong Wang, Zifu Wang, Jiong Wang, Wanghan Xu, Yue Deng, Dongrui Liu, Yiheng Wang, Wenlong Zhang, Fenghua Ling, Shufei Zhang, Xiaosong Wang, Shuangjia Zheng, Xun Huang, Siqi Sun, Shuyue Hu, Peng Ye, Chunfeng Song, Bin Wang, Conghui He, Yihao Liu, Xin Li, Qibin Hou, Tao Chen, Xiangyu Yue, Bin Wang, Liang He, Dahua Lin, Bowen Zhou, Bo Zhang, Lei Bai
First: 2026-02-09T18:36:06+00:00 · Latest: 2026-02-09T18:36:06+00:00
Comments: Code and project page: https://github.com/InternScience/InternAgent
Abstract
We introduce InternAgent-1.5, a unified system designed for end-to-end scientific discovery across computational and empirical domains. The system is built on a structured architecture composed of three coordinated subsystems for generation, verification, and evolution. These subsystems are supported by foundational capabilities for deep research, solution optimization, and long horizon memory. The architecture allows InternAgent-1.5 to operate continuously across extended discovery cycles while maintaining coherent and improving behavior. It also enables the system to coordinate computational modeling and laboratory experimentation within a single unified system. We evaluate InternAgent-1.5 on scientific reasoning benchmarks such as GAIA, HLE, GPQA, and FrontierScience, and the system achieves leading performance that demonstrates strong foundational capabilities. Beyond these benchmarks, we further assess two categories of discovery tasks. In algorithm discovery tasks, InternAgent-1.5 autonomously designs competitive methods for core machine learning problems. In empirical discovery tasks, it executes complete computational or wet lab experiments and produces scientific findings in earth, life, biological, and physical domains. Overall, these results show that InternAgent-1.5 provides a general and scalable framework for autonomous scientific discovery.
中文标题/摘要
标题:InternAgent-1.5:统一代理框架,用于长期自主科学发现
我们介绍了InternAgent-1.5,这是一个用于跨计算和经验领域端到端科学发现的统一系统。该系统基于一个结构化的架构,由三个协调的子系统组成,用于生成、验证和进化。这些子系统由深度研究的基础能力、解决方案优化和长期记忆支持。该架构使InternAgent-1.5能够在扩展的发现周期中连续运行,同时保持一致并改进行为。它还使系统能够在单一统一系统中协调计算建模和实验室实验。我们在GAIA、HLE、GPQA和FrontierScience等科学推理基准上评估了InternAgent-1.5,该系统展示了强大的基础能力,取得了领先性能。除了这些基准,我们还进一步评估了两类发现任务。在算法发现任务中,InternAgent-1.5自主设计了针对核心机器学习问题的竞争力方法。在经验发现任务中,它执行完整的计算或湿实验室实验,并在地球、生命、生物和物理领域产生科学发现。总体而言,这些结果表明,InternAgent-1.5提供了一个通用且可扩展的自主科学发现框架。
Summary / 总结
InternAgent-1.5 is a unified system for long-horizon scientific discovery, featuring a structured architecture with generation, verification, and evolution subsystems. It excels in scientific reasoning benchmarks and autonomously designs competitive methods for machine learning and executes experiments in various scientific domains, demonstrating strong foundational capabilities and scalability.
InternAgent-1.5 是一个用于长期科学发现的统一系统,具有生成、验证和进化子系统组成的结构化架构。它在科学推理基准测试中表现出色,并自主设计适用于机器学习的核心方法,同时执行各种科学领域的实验,展示了强大的基础能力和可扩展性。
f-GRPO and Beyond: Divergence-Based Reinforcement Learning Algorithms for General LLM Alignment
Authors: Rajdeep Haldar, Lantao Mei, Guang Lin, Yue Xing, Qifan Song
First: 2026-02-05T18:01:52+00:00 · Latest: 2026-02-09T18:34:18+00:00
Abstract
Recent research shows that Preference Alignment (PA) objectives act as divergence estimators between aligned (chosen) and unaligned (rejected) response distributions. In this work, we extend this divergence-based perspective to general alignment settings, such as reinforcement learning with verifiable rewards (RLVR), where only environmental rewards are available. Within this unified framework, we propose f-Group Relative Policy Optimization (f-GRPO), a class of on-policy reinforcement learning, and f-Hybrid Alignment Loss (f-HAL), a hybrid on/off policy objectives, for general LLM alignment based on variational representation of f-divergences. We provide theoretical guarantees that these classes of objectives improve the average reward after alignment. Empirically, we validate our framework on both RLVR (Math Reasoning) and PA tasks (Safety Alignment), demonstrating superior performance and flexibility compared to current methods.
中文标题/摘要
标题:f-GRPO及其扩展:基于偏差的强化学习算法在通用LLM对齐中的应用
近期研究表明,偏好对齐(PA)目标可以作为对齐(选择)和未对齐(拒绝)响应分布之间偏差的估计器。在此项工作中,我们将这种基于偏差的观点扩展到一般的对齐设置中,例如仅具有环境奖励的可验证奖励强化学习(RLVR)。在这一统一框架中,我们提出了f-组相对策略优化(f-GRPO),这是一种在线策略强化学习方法,以及f-混合对齐损失(f-HAL),这是一种混合在线/离线策略目标,基于f-偏差的变分表示,用于通用LLM对齐。我们提供了这些类目标在对齐后提高平均奖励的理论保证。实验上,我们在RLVR(数学推理)和PA任务(安全对齐)上验证了我们的框架,展示了与当前方法相比的优越性能和灵活性。
Summary / 总结
This research aims to enhance the alignment of large language models (LLMs) using divergence-based reinforcement learning (RL) methods. The study proposes f-Group Relative Policy Optimization (f-GRPO) and f-Hybrid Alignment Loss (f-HAL) to address general alignment settings, including reinforcement learning with verifiable rewards (RLVR). Theoretical guarantees show that these methods improve average rewards post-alignment. Experiments on RLVR and PA tasks demonstrate superior performance and flexibility compared to existing approaches.
该研究旨在通过基于发散的强化学习方法提升大型语言模型(LLM)的对齐。研究提出了f-Group Relative Policy Optimization (f-GRPO)和f-Hybrid Alignment Loss (f-HAL),以应对包括可验证奖励强化学习(RLVR)在内的通用对齐设置。理论保证表明,这些方法在对齐后可以提高平均奖励。实验结果表明,该方法在RLVR和PA任务上具有更优的性能和灵活性,优于现有方法。
Improving Detection of Rare Nodes in Hierarchical Multi-Label Learning
Authors: Isaac Xu, Martin Gillis, Ayushi Sharma, Benjamin Misiuk, Craig J. Brown, Thomas Trappenberg
First: 2026-02-09T18:34:17+00:00 · Latest: 2026-02-09T18:34:17+00:00
Comments: Accepted for publication in Transactions on Machine Learning Research (TMLR), 2026
Abstract
In hierarchical multi-label classification, a persistent challenge is enabling model predictions to reach deeper levels of the hierarchy for more detailed or fine-grained classifications. This difficulty partly arises from the natural rarity of certain classes (or hierarchical nodes) and the hierarchical constraint that ensures child nodes are almost always less frequent than their parents. To address this, we propose a weighted loss objective for neural networks that combines node-wise imbalance weighting with focal weighting components, the latter leveraging modern quantification of ensemble uncertainties. By emphasizing rare nodes rather than rare observations (data points), and focusing on uncertain nodes for each model output distribution during training, we observe improvements in recall by up to a factor of five on benchmark datasets, along with statistically significant gains in $F_{1}$ score. We also show our approach aids convolutional networks on challenging tasks, as in situations with suboptimal encoders or limited data.
中文标题/摘要
标题:改进层次多标签学习中稀有节点的检测
在层次多标签分类中,一个持续的挑战是如何使模型预测能够达到层次结构的更深层次,以实现更详细或更精细的分类。这种困难部分源于某些类(或层次节点)的自然稀有性以及层次约束确保子节点几乎总是比其父节点更稀有的事实。为了解决这个问题,我们提出了一种结合节点不平衡加权和焦点加权组件的加权损失目标,后者利用现代对集成不确定性量化的认识。通过强调稀有节点而不是稀有观察(数据点),并在训练过程中关注每个模型输出分布中的不确定节点,我们发现在基准数据集上召回率提高了多达五倍,并且在F1分数上取得了统计学意义上的显著提升。我们还展示了我们的方法有助于卷积网络在具有次优编码器或有限数据的挑战性任务中。
Summary / 总结
The paper addresses the challenge of detecting rare nodes in hierarchical multi-label classification by proposing a weighted loss objective that combines node-wise imbalance weighting with focal weighting. This method improves recall by up to a factor of five on benchmark datasets and leads to statistically significant gains in F1 score. The approach also enhances the performance of convolutional networks in challenging tasks with suboptimal encoders or limited data.
研究旨在通过解决节点频率不平衡的问题,提高层次多标签分类中稀有节点的检测能力。作者提出了一种结合节点不平衡加权和焦点加权的损失函数,利用集成不确定性关注不确定节点。这种方法在基准数据集上将召回率提高了五倍,并在具有次优编码器或有限数据的挑战性任务中显著提升了$F_{1}$分数。
Next Concept Prediction in Discrete Latent Space Leads to Stronger Language Models
Authors: Yuliang Liu, Yunchong Song, Yixuan Wang, Kewen Ge, Alex Lamb, Qipeng Guo, Kai Chen, Bowen Zhou, Zhouhan Lin
First: 2026-02-09T18:33:31+00:00 · Latest: 2026-02-09T18:33:31+00:00
Abstract
We propose Next Concept Prediction (NCP), a generative pretraining paradigm built on top of Next Token Prediction (NTP). NCP predicts discrete concepts that span multiple tokens, thereby forming a more challenging pretraining objective. Our model, ConceptLM, quantizes hidden states using Vector Quantization and constructs a concept vocabulary. It leverages both NCP and NTP to drive parameter updates and generates a concept to guide the generation of the following tokens. We train ConceptLM from scratch at scales ranging from 70M to 1.5B parameters with up to 300B training data, including Pythia and GPT-2 backbones. Results on 13 benchmarks show that NCP yields consistent performance gains over traditional token-level models. Furthermore, continual pretraining experiments on an 8B-parameter Llama model indicate that NCP can further improve an NTP-trained model. Our analysis suggests that NCP leads to more powerful language models by introducing a harder pretraining task, providing a promising path toward better language modeling.
中文标题/摘要
标题:在离散潜在空间中预测下一个概念使语言模型更强
我们提出了下一个概念预测(NCP),这是一种基于下一个标记预测(NTP)的生成预训练范式。NCP 预测跨越多个标记的概念,从而形成更具挑战性的预训练目标。我们的模型 ConceptLM 使用向量量化对隐藏状态进行量化,并构建概念词汇表。它利用 NCP 和 NTP 来驱动参数更新,并生成一个概念以指导后续标记的生成。我们从 70M 到 1.5B 参数规模对 ConceptLM 进行了从头训练,训练数据量高达 300B,包括 Pythia 和 GPT-2 基础模型。在 13 个基准测试上,NCP 在传统标记级模型上提供了持续的性能提升。此外,对一个 8B 参数的 Llama 模型进行的持续预训练实验表明,NCP 可以进一步提高 NTP 训练的模型。我们的分析表明,NCP 通过引入更难的预训练任务,使语言模型更加强大,为更好的语言建模提供了有希望的道路。
Summary / 总结
The paper introduces Next Concept Prediction (NCP), a new generative pretraining paradigm that predicts discrete concepts spanning multiple tokens, making the pretraining task more challenging. The model, ConceptLM, uses vector quantization to form a concept vocabulary and combines NCP with Next Token Prediction (NTP) for parameter updates. ConceptLM was trained from scratch with varying model sizes and large datasets, showing consistent performance improvements over traditional token-level models across 13 benchmarks. Continual pretraining on a large model further enhances the performance, indicating that NCP can lead to more powerful language models by introducing a harder pretraining task.
研究旨在通过引入Next Concept Prediction (NCP),预测跨越多个令牌的离散概念,使预训练任务更具挑战性。该模型ConceptLM使用向量量化形成概念词汇表,并结合Next Token Prediction (NTP)进行训练。在13个基准测试上的实验显示,NCP在性能上持续优于传统的令牌级模型,并且在8B参数的Llama模型上的持续预训练进一步证明了NCP在提升语言模型方面的有效性。
StretchTime: Adaptive Time Series Forecasting via Symplectic Attention
Authors: Yubin Kim, Viresh Pati, Jevon Twitty, Vinh Pham, Shihao Yang, Jiecheng Lu
First: 2026-02-09T18:29:25+00:00 · Latest: 2026-02-09T18:29:25+00:00
Abstract
Transformer architectures have established strong baselines in time series forecasting, yet they typically rely on positional encodings that assume uniform, index-based temporal progression. However, real-world systems, from shifting financial cycles to elastic biological rhythms, frequently exhibit "time-warped" dynamics where the effective flow of time decouples from the sampling index. In this work, we first formalize this misalignment and prove that rotary position embedding (RoPE) is mathematically incapable of representing non-affine temporal warping. To address this, we propose Symplectic Positional Embeddings (SyPE), a learnable encoding framework derived from Hamiltonian mechanics. SyPE strictly generalizes RoPE by extending the rotation group $\mathrm{SO}(2)$ to the symplectic group $\mathrm{Sp}(2,\mathbb{R})$, modulated by a novel input-dependent adaptive warp module. By allowing the attention mechanism to adaptively dilate or contract temporal coordinates end-to-end, our approach captures locally varying periodicities without requiring pre-defined warping functions. We implement this mechanism in StretchTime, a multivariate forecasting architecture that achieves state-of-the-art performance on standard benchmarks, demonstrating superior robustness on datasets exhibiting non-stationary temporal dynamics.
中文标题/摘要
标题:StretchTime: 通过辛注意力实现自适应时间序列预测
基于Transformer的架构在时间序列预测中已经建立了强大的基准,但它们通常依赖于假设均匀、基于索引的时间进化的位置编码。然而,从不断变化的金融周期到弹性生物节律等现实系统中,经常会出现“时间扭曲”动力学,其中有效的时间流与采样索引脱钩。在这项工作中,我们首先形式化了这种不一致,并证明旋转位置嵌入(RoPE)在数学上无法表示非线性时间扭曲。为了解决这个问题,我们提出了辛位置嵌入(SyPE),这是一种源自哈密顿力学的可学习编码框架。SyPE严格推广了RoPE,通过将旋转群$\mathrm{SO}(2)$扩展到辛群$\mathrm{Sp}(2,\mathbb{R})$,并由一个新颖的输入依赖的自适应扭曲模块调制。通过允许注意力机制端到端地适应性地拉伸或压缩时间坐标,我们的方法能够捕捉局部变化的周期性,而无需预先定义的扭曲函数。我们通过在StretchTime中实现这一机制,构建了一种多变量预测架构,该架构在标准基准上实现了最先进的性能,并在表现出非平稳时间动态的数据集上展示了更好的鲁棒性。
Summary / 总结
The research aims to improve time series forecasting by addressing the limitations of positional encodings in capturing non-uniform temporal dynamics. The authors propose Symplectic Positional Embeddings (SyPE), which extend the rotation group to the symplectic group, allowing for adaptive temporal warping. Experiments show that the StretchTime architecture, incorporating SyPE, outperforms existing methods on standard benchmarks, especially for datasets with non-stationary temporal dynamics.
研究旨在通过解决现有位置编码假设均匀时间进展的局限性,改进时间序列预测。作者提出了Symplectic Positional Embeddings (SyPE),将旋转群扩展到辛群,允许适应性时间扭曲。实验结果表明,提出的StretchTime架构在标准基准上优于现有方法,特别是在具有非平稳时间动态的数据集上表现出色。
When do neural ordinary differential equations generalize on complex networks?
Authors: Moritz Laber, Tina Eliassi-Rad, Brennan Klein
First: 2026-02-09T18:28:41+00:00 · Latest: 2026-02-09T18:28:41+00:00
Abstract
Neural ordinary differential equations (neural ODEs) can effectively learn dynamical systems from time series data, but their behavior on graph-structured data remains poorly understood, especially when applied to graphs with different size or structure than encountered during training. We study neural ODEs ($\mathtt{nODE}$s) with vector fields following the Barabási-Barzel form, trained on synthetic data from five common dynamical systems on graphs. Using the $\mathbb{S}^1$-model to generate graphs with realistic and tunable structure, we find that degree heterogeneity and the type of dynamical system are the primary factors in determining $\mathtt{nODE}$s' ability to generalize across graph sizes and properties. This extends to $\mathtt{nODE}$s' ability to capture fixed points and maintain performance amid missing data. Average clustering plays a secondary role in determining $\mathtt{nODE}$ performance. Our findings highlight $\mathtt{nODE}$s as a powerful approach to understanding complex systems but underscore challenges emerging from degree heterogeneity and clustering in realistic graphs.
中文标题/摘要
标题:神经常微分方程在复杂网络上的泛化何时发生?
神经常微分方程(神经 ODEs)能够有效从时间序列数据中学习动力系统,但在图结构数据上的行为仍然知之甚少,尤其是在应用于与训练时遇到的图大小或结构不同的图时。我们研究了遵循 Barabási-Barzel 形式的向量场的神经 ODEs($\mathtt{nODE}$s),这些神经 ODEs 在五种常见图动力系统的合成数据上进行训练。使用 $\mathbb{S}^1$ 模型生成具有现实且可调结构的图,我们发现度异质性和动力系统的类型是决定 $\mathtt{nODE}$ 跨图大小和属性泛化能力的主要因素。这扩展到 $\mathtt{nODE}$ 能够捕捉固定点并在数据缺失时保持性能。平均聚类在决定 $\mathtt{nODE}$ 性能方面起次要作用。我们的研究结果突显了 $\mathtt{nODE}$ 作为理解复杂系统的一种强大方法,但强调了来自度异质性和聚类的现实图中出现的挑战。
Summary / 总结
The study investigates the generalization capabilities of neural ODEs on complex networks, focusing on synthetic data from five common dynamical systems. By using the $S^1$-model to generate graphs with tunable structure, the researchers find that degree heterogeneity and the type of dynamical system are key factors in determining neural ODEs' ability to generalize across different graph sizes and properties. The study also shows that neural ODEs can effectively capture fixed points and maintain performance even with missing data, although average clustering plays a less significant role in their performance.
研究探讨了神经常微分方程(nODEs)在图结构数据上的泛化能力,集中在五个常见动力系统的合成数据上。使用$S^1$-模型生成具有可调结构的图,研究发现度异质性和动力系统的类型是主要影响nODEs在不同图大小和属性上泛化能力的因素。研究还表明,nODEs能够捕捉固定点并在数据缺失的情况下保持性能,尽管平均聚类在它们的性能中起次要作用。
ABBA-Adapters: Efficient and Expressive Fine-Tuning of Foundation Models
Authors: Raghav Singhal, Kaustubh Ponkshe, Rohit Vartak, Praneeth Vepakomma
Venue: ICLR 2026
First: 2025-05-20T11:43:25+00:00 · Latest: 2026-02-09T18:14:19+00:00
Comments: ICLR 2026. Raghav Singhal, Kaustubh Ponkshe, and Rohit Vartak contributed equally to this work
Abstract
Large Language Models have demonstrated strong performance across a wide range of tasks, but adapting them efficiently to new domains remains a key challenge. Parameter-Efficient Fine-Tuning (PEFT) methods address this by introducing lightweight, trainable modules while keeping most pre-trained weights fixed. The prevailing approach, LoRA, models updates using a low-rank decomposition, but its expressivity is inherently constrained by the rank. Recent methods like HiRA aim to increase expressivity by incorporating a Hadamard product with the frozen weights, but still rely on the structure of the pre-trained model. We introduce ABBA, a new PEFT architecture that reparameterizes the update as a Hadamard product of two independently learnable low-rank matrices. In contrast to prior work, ABBA fully decouples the update from the pre-trained weights, enabling both components to be optimized freely. This leads to significantly higher expressivity under the same parameter budget, a property we validate through matrix reconstruction experiments. Empirically, ABBA achieves state-of-the-art results on arithmetic and commonsense reasoning benchmarks, consistently outperforming existing PEFT methods by a significant margin across multiple models. Our code is publicly available at: https://github.com/CERT-Lab/abba.
中文标题/摘要
标题:ABBA-适配器:高效且富有表现力的基础模型微调
大型语言模型在广泛的任务中表现出强大的性能,但如何高效地将它们适应到新的领域仍然是一个关键挑战。参数高效微调(PEFT)方法通过引入轻量级、可训练的模块来解决这一问题,同时保持大部分预训练权重不变。目前占主导地位的方法LoRA使用低秩分解来建模更新,但其表现力受到秩的内在限制。最近的方法HiRA通过结合与冻结权重的哈达玛积来增加表现力,但仍依赖于预训练模型的结构。我们提出了ABBA,这是一种新的PEFT架构,将更新重新参数化为两个独立可学习低秩矩阵的哈达玛积。与先前工作不同,ABBA完全解耦了更新与预训练权重,使得两个组件可以自由优化。这在相同的参数预算下实现了显著更高的表现力,我们通过矩阵重构实验验证了这一点。实验中,ABBA在算术和常识推理基准测试中取得了最先进的结果,相对于现有PEFT方法在多个模型上均表现出显著的优越性。我们的代码已公开发布在:https://github.com/CERT-Lab/abba。
Summary / 总结
The research aims to improve the efficiency and expressivity of fine-tuning large language models for new domains. ABBA-Adapters reparameterize the update as a Hadamard product of two independently learnable low-rank matrices, decoupling the update from the pre-trained weights. This leads to higher expressivity under the same parameter budget and state-of-the-art performance on arithmetic and commonsense reasoning benchmarks, outperforming existing methods by a significant margin.
研究旨在提高大型语言模型在新领域中的微调效率和表达能力。ABBA提出了一种新的PEFT架构,将更新重新参数化为两个独立可学习的低秩矩阵的哈达玛积,完全解耦更新与预训练权重。这在相同参数预算下具有更高的表达能力,并在算术和常识推理基准测试中取得了最先进的结果,显著优于现有方法。
WorldArena: A Unified Benchmark for Evaluating Perception and Functional Utility of Embodied World Models
Authors: Yu Shang, Zhuohang Li, Yiding Ma, Weikang Su, Xin Jin, Ziyou Wang, Xin Zhang, Yinzhou Tang, Chen Gao, Wei Wu, Xihui Liu, Dhruv Shah, Zhaoxiang Zhang, Zhibo Chen, Jun Zhu, Yonghong Tian, Tat-Seng Chua, Wenwu Zhu, Yong Li
First: 2026-02-09T18:09:20+00:00 · Latest: 2026-02-09T18:09:20+00:00
Abstract
While world models have emerged as a cornerstone of embodied intelligence by enabling agents to reason about environmental dynamics through action-conditioned prediction, their evaluation remains fragmented. Current evaluation of embodied world models has largely focused on perceptual fidelity (e.g., video generation quality), overlooking the functional utility of these models in downstream decision-making tasks. In this work, we introduce WorldArena, a unified benchmark designed to systematically evaluate embodied world models across both perceptual and functional dimensions. WorldArena assesses models through three dimensions: video perception quality, measured with 16 metrics across six sub-dimensions; embodied task functionality, which evaluates world models as data engines, policy evaluators, and action planners integrating with subjective human evaluation. Furthermore, we propose EWMScore, a holistic metric integrating multi-dimensional performance into a single interpretable index. Through extensive experiments on 14 representative models, we reveal a significant perception-functionality gap, showing that high visual quality does not necessarily translate into strong embodied task capability. WorldArena benchmark with the public leaderboard is released at https://worldarena.ai, providing a framework for tracking progress toward truly functional world models in embodied AI.
中文标题/摘要
标题:WorldArena:评估具身世界模型感知与功能实用性的统一基准
尽管世界模型已成为具身智能的基石,通过动作条件下的预测使代理能够推理环境动力学,但其评估仍处于碎片化状态。当前对具身世界模型的评估主要集中在感知保真度(例如,视频生成质量)上,忽视了这些模型在下游决策任务中的功能实用性。在本文中,我们引入了WorldArena,这是一个统一基准,旨在系统地从感知和功能两个维度评估具身世界模型。WorldArena 通过三个维度评估模型:视频感知质量,通过六个子维度下的 16 个指标进行测量;具身任务功能,评估世界模型作为数据引擎、策略评估器和动作规划器的能力,并结合主观的人类评估。此外,我们提出了EWMScore,这是一种综合多维度性能的统一指标。通过在 14 个代表性模型上的广泛实验,我们揭示了感知与功能之间的显著差距,表明高视觉质量并不一定转化为强大的具身任务能力。WorldArena 基准及其公开排行榜可在 https://worldarena.ai 上获取,为追踪向真正功能性的具身人工智能世界模型进展提供框架。
Summary / 总结
WorldArena is a unified benchmark to evaluate the perceptual and functional aspects of embodied world models. It assesses models through video perception quality using 16 metrics and embodied task functionality, including data engine, policy evaluator, and action planner roles, with subjective human evaluation. The study reveals a significant gap between high visual quality and strong embodied task capability, indicating that perceptual fidelity does not always correlate with functional utility. The benchmark includes a holistic metric, EWMScore, and is publicly available at https://worldarena.ai.
WorldArena 是一个统一基准,用于评估体态世界模型的感知和功能方面。它通过16个指标评估视频感知质量,并通过数据引擎、策略评估器和动作规划器的角色评估体态任务功能,同时包含主观的人类评估。研究揭示了高视觉质量与强大体态任务能力之间的显著差距,表明感知保真度并不总是与功能性实用性相关。该基准包括一个综合指标 EWMScore,并在 https://worldarena.ai 公开发布。
stable-worldmodel-v1: Reproducible World Modeling Research and Evaluation
Authors: Lucas Maes, Quentin Le Lidec, Dan Haramati, Nassim Massaudi, Damien Scieur, Yann LeCun, Randall Balestriero
First: 2026-02-09T18:04:22+00:00 · Latest: 2026-02-09T18:04:22+00:00
Abstract
World Models have emerged as a powerful paradigm for learning compact, predictive representations of environment dynamics, enabling agents to reason, plan, and generalize beyond direct experience. Despite recent interest in World Models, most available implementations remain publication-specific, severely limiting their reusability, increasing the risk of bugs, and reducing evaluation standardization. To mitigate these issues, we introduce stable-worldmodel (SWM), a modular, tested, and documented world-model research ecosystem that provides efficient data-collection tools, standardized environments, planning algorithms, and baseline implementations. In addition, each environment in SWM enables controllable factors of variation, including visual and physical properties, to support robustness and continual learning research. Finally, we demonstrate the utility of SWM by using it to study zero-shot robustness in DINO-WM.
中文标题/摘要
标题:stable-worldmodel-v1: 可再现的世界建模研究与评估
世界模型已成为一种强大的范式,用于学习环境动力学的紧凑、预测性表示,使智能体能够推理、规划并超越直接经验进行泛化。尽管最近对世界模型的兴趣增加,但大多数可用的实现仍然具有出版物特定性,严重限制了其可重用性,增加了错误的风险,并降低了评估标准化。为缓解这些问题,我们引入了stable-worldmodel (SWM),这是一个模块化、经过测试和文档化的世界模型研究生态系统,提供了高效的数据收集工具、标准化环境、规划算法和基线实现。此外,SWM 中的每个环境都支持鲁棒性和持续学习研究,允许控制变化因素,包括视觉和物理属性。最后,我们通过使用SWM研究DINO-WM的零样本鲁棒性来展示SWM的实用性。
Summary / 总结
The research aims to improve the reusability and standardization of world models by introducing stable-worldmodel (SWM), a modular and well-documented ecosystem. SWM includes efficient data collection tools, standardized environments, and baseline implementations, allowing for robustness and continual learning research. The key finding is the utility of SWM in studying zero-shot robustness in DINO-WM, demonstrating its effectiveness in evaluating world models under varying conditions.
研究旨在通过引入稳定的世界模型(SWM)生态系统来提高世界模型的可重用性和标准化。SWM 包含高效的数据收集工具、标准化环境和基线实现,支持鲁棒性和持续学习研究。关键发现是,SWM 在研究 DINO-WM 的零样本鲁棒性方面表现出效用,证明了其在不同条件下评估世界模型的有效性。
Learning to Coordinate via Quantum Entanglement in Multi-Agent Reinforcement Learning
Authors: John Gardiner, Orlando Romero, Brendan Tivnan, Nicolò Dal Fabbro, George J. Pappas
First: 2026-02-09T18:01:40+00:00 · Latest: 2026-02-09T18:01:40+00:00
Abstract
The inability to communicate poses a major challenge to coordination in multi-agent reinforcement learning (MARL). Prior work has explored correlating local policies via shared randomness, sometimes in the form of a correlation device, as a mechanism to assist in decentralized decision-making. In contrast, this work introduces the first framework for training MARL agents to exploit shared quantum entanglement as a coordination resource, which permits a larger class of communication-free correlated policies than shared randomness alone. This is motivated by well-known results in quantum physics which posit that, for certain single-round cooperative games with no communication, shared quantum entanglement enables strategies that outperform those that only use shared randomness. In such cases, we say that there is quantum advantage. Our framework is based on a novel differentiable policy parameterization that enables optimization over quantum measurements, together with a novel policy architecture that decomposes joint policies into a quantum coordinator and decentralized local actors. To illustrate the effectiveness of our proposed method, we first show that we can learn, purely from experience, strategies that attain quantum advantage in single-round games that are treated as black box oracles. We then demonstrate how our machinery can learn policies with quantum advantage in an illustrative multi-agent sequential decision-making problem formulated as a decentralized partially observable Markov decision process (Dec-POMDP).
中文标题/摘要
标题:通过量子纠缠学习协调在多智能体强化学习中的应用
在多智能体强化学习(MARL)中,无法通信是协调的主要挑战。先前的工作探索了通过共享随机性关联局部策略的方法,有时以相关设备的形式,作为辅助去中心化决策机制。与此相反,本文引入了第一个利用共享量子纠缠作为协调资源的MARL代理训练框架,这使得可以使用比仅共享随机性更多的无通信关联策略。这受到量子物理学中已知结果的启发,即对于某些无通信的一轮合作博弈,共享量子纠缠能够实现优于仅使用共享随机性的策略。在这种情况下,我们说存在量子优势。我们的框架基于一种新颖的可微分策略参数化,能够优化量子测量,以及一种新颖的策略架构,将联合策略分解为量子协调器和分散的本地执行者。为了说明我们提出的方法的有效性,我们首先展示了如何仅从经验中学习在黑盒或acles中的一轮博弈中实现量子优势的策略。然后,我们展示了如何利用我们的机制在作为去中心化部分可观测马尔可夫决策过程(Dec-POMDP)提出的示例多智能体顺序决策问题中学习具有量子优势的策略。
Summary / 总结
This paper addresses the challenge of coordination in multi-agent reinforcement learning (MARL) by introducing a new framework that leverages quantum entanglement to enable communication-free coordination among agents. Motivated by quantum physics results showing that shared quantum entanglement can outperform shared randomness in certain cooperative games, the authors propose a differentiable policy parameterization and a policy architecture that decomposes joint policies into a quantum coordinator and local actors. The framework is tested in both single-round games and a multi-agent sequential decision-making problem, demonstrating the ability to learn strategies that achieve quantum advantage.
本文通过引入一种利用共享量子纠缠的框架,解决了多智能体强化学习(MARL)中无通信条件下的协调问题。受量子物理的启发,作者提出了一种方法,该方法允许比仅使用共享随机性更广泛的无通信协同策略。该框架包括一个可微分的策略参数化和一个将量子协调器与本地执行者分离的策略架构。实验表明,所提出的方法可以在单轮游戏和一个作为去中心化部分可观测马尔可夫决策过程(Dec-POMDP)形式化的顺序决策问题中学习具有量子优势的策略。
Latent Domain Modeling Improves Robustness to Geographic Shifts
Authors: Ruth Crasto, Esther Rolf
First: 2025-03-03T20:24:07+00:00 · Latest: 2026-02-09T18:01:33+00:00
Abstract
Geographic distribution shift arises when the distribution of locations on Earth in a training dataset is different from what is seen at inference time. Using standard empirical risk minimization (ERM) in this setting can lead to uneven generalization across different spatially-determined groups of interest such as continents or biomes. The most common approaches to tackling geographic distribution shift apply domain adaptation methods using discrete group labels, ignoring geographic coordinates that are often available as metadata. On the other hand, modeling methods that integrate geographic coordinates have been shown to improve overall performance, but their impact on geographic domain generalization has not been studied. In this work, we propose a general modeling framework for improving robustness to geographic distribution shift. The key idea is to model continuous, latent domain assignment using location encoders and to condition the main task predictor on the jointly-trained latents. On four diverse geo-tagged image datasets with different group splits, we show that instances of our framework achieve significant improvements in worst-group performance compared to existing domain adaptation and location-aware modeling methods. In particular, we achieve new state-of-the-art results on two datasets from the WILDS benchmark.
中文标题/摘要
标题:潜在领域建模提高地理转移的鲁棒性
地理分布转移是指训练数据集中地球上的位置分布与推理时所见的不同。在这种情况下使用标准的经验风险最小化(ERM)可能导致不同地理区域(如大陆或生物群落)的群体在泛化能力上不均衡。最常用的地理分布转移应对方法是使用离散的组标签进行领域适应方法,而忽略了通常作为元数据可用的地理坐标。另一方面,整合地理坐标的建模方法已被证明可以提高整体性能,但它们对地理领域泛化的影响尚未研究。在本文中,我们提出了一种通用建模框架,以提高对地理分布转移的鲁棒性。关键思想是使用位置编码器建模连续的潜在领域分配,并在联合训练的潜在变量上条件化主要任务预测器。在四个具有不同组划分的多样化地理标记图像数据集上,我们展示了我们框架的实例在最差组性能上比现有领域适应和位置感知建模方法取得了显著改进。特别是,我们在WILDS基准上的两个数据集上达到了新的最佳结果。
Summary / 总结
This study addresses the issue of geographic distribution shift in machine learning models, where the training data's geographic distribution differs from the inference environment. The authors propose a latent domain modeling framework that uses location encoders to model continuous geographic domains and conditions the main task predictor on these latent variables. Experiments on four diverse geo-tagged image datasets demonstrate that this approach significantly improves the worst-group performance compared to existing methods, achieving new state-of-the-art results on two datasets from the WILDS benchmark.
该研究解决了机器学习模型中的地理分布偏移问题,即训练数据和推理数据来自不同的地理区域。作者提出了一种潜域建模框架,使用位置编码器来建模连续的地理域,并在这些潜变量上条件化主要任务预测器。在四个不同的地理标记图像数据集上的实验表明,这种方法显著提高了最差群体的表现,特别是在WILDS基准上的两个数据集上达到了新的最佳结果。
A Behavioural and Representational Evaluation of Goal-Directedness in Language Model Agents
Authors: Raghu Arghal, Fade Chen, Niall Dalton, Evgenii Kortukov, Calum McNamara, Angelos Nalmpantis, Moksh Nirvaan, Gabriele Sarti, Mario Giulianelli
First: 2026-02-09T18:00:28+00:00 · Latest: 2026-02-09T18:00:28+00:00
Abstract
Understanding an agent's goals helps explain and predict its behaviour, yet there is no established methodology for reliably attributing goals to agentic systems. We propose a framework for evaluating goal-directedness that integrates behavioural evaluation with interpretability-based analyses of models' internal representations. As a case study, we examine an LLM agent navigating a 2D grid world toward a goal state. Behaviourally, we evaluate the agent against an optimal policy across varying grid sizes, obstacle densities, and goal structures, finding that performance scales with task difficulty while remaining robust to difficulty-preserving transformations and complex goal structures. We then use probing methods to decode the agent's internal representations of the environment state and its multi-step action plans. We find that the LLM agent non-linearly encodes a coarse spatial map of the environment, preserving approximate task-relevant cues about its position and the goal location; that its actions are broadly consistent with these internal representations; and that reasoning reorganises them, shifting from broader environment structural cues toward information supporting immediate action selection. Our findings support the view that introspective examination is required beyond behavioural evaluations to characterise how agents represent and pursue their objectives.
中文标题/摘要
标题:语言模型代理目标导向性的行为与表征评估
理解代理的目标有助于解释和预测其行为,但尚未建立可靠的方法将目标归因于代理系统。我们提出了一种框架,将行为评估与模型内部表示的可解释性分析相结合,以评估目标导向性。作为案例研究,我们分析了一个LLM代理在2D网格世界中向目标状态导航的行为。行为上,我们根据不同的网格大小、障碍密度和目标结构,将代理与最优策略进行评估,发现性能随任务难度增加而增加,同时对难度保持变换和复杂的目标结构保持稳健。然后,我们使用探针方法解码代理对环境状态和多步行动计划的内部表示。我们发现,LLM代理非线性地编码了环境的粗略空间图,保留了关于其位置和目标位置的大致任务相关线索;其行动大致与这些内部表示一致;并且推理重新组织了这些表示,从更广泛的环境结构线索转向支持即时行动选择的信息。我们的研究结果支持这样的观点:除了行为评估之外,还需要进行反思性检查来描述代理如何表示和追求其目标。
Summary / 总结
The study aims to develop a framework for evaluating goal-directedness in language model agents by combining behavioral assessments with interpretability analyses of internal representations. The research examines an LLM agent navigating a 2D grid world, finding that the agent's performance scales with task difficulty and remains robust to transformations and complex goal structures. Probing methods reveal that the agent encodes a coarse spatial map of the environment, with reasoning shifting from broader structural cues to immediate action support.
论文旨在通过结合行为评估和对代理内部表示的可解释性分析,开发一种评估代理目标导向性的框架。使用一个语言模型(LM)代理在2D网格世界中导航作为案例研究,研究发现代理的性能随着任务难度的增加而增加,但在任务变换和复杂目标结构下保持稳健。探针方法显示,LM代理编码了环境的粗略空间图,推理从更广泛的结构线索转向支持即时行动的信息。
Modeling 3D Pedestrian-Vehicle Interactions for Vehicle-Conditioned Pose Forecasting
Authors: Guangxun Zhu, Xuan Liu, Nicolas Pugeault, Chongfeng Wei, Edmond S. L. Ho
Venue: ICRA
First: 2026-02-09T17:58:53+00:00 · Latest: 2026-02-09T17:58:53+00:00
Comments: Accepted for IEEE International Conference on Robotics and Automation (ICRA) 2026
Abstract
Accurately predicting pedestrian motion is crucial for safe and reliable autonomous driving in complex urban environments. In this work, we present a 3D vehicle-conditioned pedestrian pose forecasting framework that explicitly incorporates surrounding vehicle information. To support this, we enhance the Waymo-3DSkelMo dataset with aligned 3D vehicle bounding boxes, enabling realistic modeling of multi-agent pedestrian-vehicle interactions. We introduce a sampling scheme to categorize scenes by pedestrian and vehicle count, facilitating training across varying interaction complexities. Our proposed network adapts the TBIFormer architecture with a dedicated vehicle encoder and pedestrian-vehicle interaction cross-attention module to fuse pedestrian and vehicle features, allowing predictions to be conditioned on both historical pedestrian motion and surrounding vehicles. Extensive experiments demonstrate substantial improvements in forecasting accuracy and validate different approaches for modeling pedestrian-vehicle interactions, highlighting the importance of vehicle-aware 3D pose prediction for autonomous driving. Code is available at: https://github.com/GuangxunZhu/VehCondPose3D
中文标题/摘要
标题:基于车辆条件的3D行人姿态预测模型:3D行人-车辆交互建模
准确预测行人运动对于复杂城市环境中安全可靠的自动驾驶至关重要。本文提出了一种基于车辆条件的3D行人姿态预测框架,明确地将周围车辆信息纳入其中。为此,我们扩展了Waymo-3DSkelMo数据集,添加了对齐的3D车辆边界框,使多智能体行人-车辆交互的现实建模成为可能。我们引入了一种采样方案,根据行人和车辆数量对场景进行分类,便于在不同交互复杂性下进行训练。我们提出的网络采用TBIFormer架构,配备专用的车辆编码器和行人-车辆交互交叉注意力模块,以融合行人和车辆特征,使预测能够同时基于历史行人运动和周围车辆。大量实验表明,预测准确性显著提高,并验证了不同行人-车辆交互建模方法的有效性,突显了自动驾驶中车辆感知3D姿态预测的重要性。代码可在:https://github.com/GuangxunZhu/VehCondPose3D 获取。
Summary / 总结
This work aims to improve pedestrian motion prediction for autonomous driving in urban environments by incorporating vehicle information. The authors enhance the Waymo-3DSkelMo dataset with 3D vehicle bounding boxes and propose a 3D vehicle-conditioned pedestrian pose forecasting framework. This framework uses a TBIFormer architecture with a vehicle encoder and interaction cross-attention module to condition predictions on both pedestrian history and surrounding vehicles. Experiments show significant improvements in forecasting accuracy and validate the importance of vehicle-aware 3D pose prediction for autonomous driving systems.
该研究旨在通过纳入车辆信息来提高城市环境中自主驾驶的行人运动预测。作者通过在Waymo-3DSkelMo数据集中添加3D车辆边界框来增强数据集,并提出了一种3D车辆条件下的行人姿态预测框架。该框架采用TBIFormer架构,包含车辆编码器和交互交叉注意力模块,以条件预测结合行人的历史运动和周围车辆。实验表明,在预测准确性方面取得了显著改进,并验证了车辆感知的3D姿态预测对于自主驾驶系统的重要性。
MotionCrafter: Dense Geometry and Motion Reconstruction with a 4D VAE
Authors: Ruijie Zhu, Jiahao Lu, Wenbo Hu, Xiaoguang Han, Jianfei Cai, Ying Shan, Chuanxia Zheng
First: 2026-02-09T17:58:12+00:00 · Latest: 2026-02-09T17:58:12+00:00
Comments: Project page: https://ruijiezhu94.github.io/MotionCrafter_Page
Abstract
We introduce MotionCrafter, a video diffusion-based framework that jointly reconstructs 4D geometry and estimates dense motion from a monocular video. The core of our method is a novel joint representation of dense 3D point maps and 3D scene flows in a shared coordinate system, and a novel 4D VAE to effectively learn this representation. Unlike prior work that forces the 3D value and latents to align strictly with RGB VAE latents-despite their fundamentally different distributions-we show that such alignment is unnecessary and leads to suboptimal performance. Instead, we introduce a new data normalization and VAE training strategy that better transfers diffusion priors and greatly improves reconstruction quality. Extensive experiments across multiple datasets demonstrate that MotionCrafter achieves state-of-the-art performance in both geometry reconstruction and dense scene flow estimation, delivering 38.64% and 25.0% improvements in geometry and motion reconstruction, respectively, all without any post-optimization. Project page: https://ruijiezhu94.github.io/MotionCrafter_Page
中文标题/摘要
标题:MotionCrafter:基于视频扩散的4D几何和密集运动重建
我们引入了MotionCrafter,一种基于视频扩散的框架,可以从单目视频中联合重建4D几何并估计密集运动。我们方法的核心是一种新颖的联合表示方法,将密集3D点图和3D场景流在共享坐标系中表示,并提出了一种新颖的4D VAE来有效学习这种表示。与之前的工作不同,这些工作强制3D值和潜在变量严格与RGB VAE潜在变量对齐——尽管它们的分布本质上是不同的,我们证明这种对齐是不必要的,并导致了次优性能。相反,我们提出了一种新的数据归一化和VAE训练策略,更好地转移了扩散先验,极大地提高了重建质量。在多个数据集上的大量实验表明,MotionCrafter在几何重建和密集场景流估计方面均达到了最先进的性能,分别在几何和运动重建上提高了38.64%和25.0%,且无需任何后优化。项目页面:https://ruijiezhu94.github.io/MotionCrafter_Page
Summary / 总结
MotionCrafter is a video diffusion-based framework that jointly reconstructs 4D geometry and estimates dense motion from a monocular video. It uses a novel joint representation of dense 3D point maps and 3D scene flows in a shared coordinate system and a 4D VAE for learning. This approach improves reconstruction quality by avoiding strict alignment with RGB VAE latents and instead uses a new data normalization and VAE training strategy. Experiments show that MotionCrafter outperforms previous methods, achieving 38.64% and 25.0% improvements in geometry and motion reconstruction, respectively, without post-optimization.
MotionCrafter 是一种基于视频扩散的框架,能够从单目视频中联合重建 4D 几何和估计密集运动。它引入了一种新的联合表示方法,将密集的 3D 点图和 3D 场景流在共享坐标系中表示,并使用 4D VAE 进行有效学习。与以往方法不同,MotionCrafter 不强制 3D 值和 3D 潜变量与 RGB VAE 潜变量对齐,这导致了更好的性能。实验表明,MotionCrafter 在几何重建和运动重建上的表现分别优于现有方法 38.64% 和 25.0%,且无需后优化。
Grow with the Flow: 4D Reconstruction of Growing Plants with Gaussian Flow Fields
Authors: Weihan Luo, Lily Goli, Sherwin Bahmani, Felix Taubner, Andrea Tagliasacchi, David B. Lindell
First: 2026-02-09T17:55:01+00:00 · Latest: 2026-02-09T17:55:01+00:00
Comments: Project page: https://weihanluo.ca/growflow/
Abstract
Modeling the time-varying 3D appearance of plants during their growth poses unique challenges: unlike many dynamic scenes, plants generate new geometry over time as they expand, branch, and differentiate. Recent motion modeling techniques are ill-suited to this problem setting. For example, deformation fields cannot introduce new geometry, and 4D Gaussian splatting constrains motion to a linear trajectory in space and time and cannot track the same set of Gaussians over time. Here, we introduce a 3D Gaussian flow field representation that models plant growth as a time-varying derivative over Gaussian parameters -- position, scale, orientation, color, and opacity -- enabling nonlinear and continuous-time growth dynamics. To initialize a sufficient set of Gaussian primitives, we reconstruct the mature plant and learn a process of reverse growth, effectively simulating the plant's developmental history in reverse. Our approach achieves superior image quality and geometric accuracy compared to prior methods on multi-view timelapse datasets of plant growth, providing a new approach for appearance modeling of growing 3D structures.
中文标题/摘要
标题:随流成长:使用高斯流场的植物生长4D重建
在植物生长过程中建模其随时间变化的3D外观提出了独特挑战:与许多动态场景不同,植物会随着时间的推移生成新的几何形状,扩展、分枝和分化。最近的运动建模技术不适合这种问题设置。例如,变形场无法引入新的几何形状,而4D高斯散点图将运动限制为空间和时间的线性轨迹,并且无法在时间上跟踪相同的高斯集合。在这里,我们引入了一种3D高斯流场表示法,将植物生长建模为高斯参数——位置、尺度、方向、颜色和透明度——的时间变化导数,从而能够模拟非线性和连续时间的生长动力学。为了初始化足够的高斯原始模型,我们重建了成熟的植物并学习了一个逆向生长过程,有效地模拟了植物的发育历史。我们的方法在多视角时间 lapse 数据集的植物生长上实现了优于先前方法的图像质量和几何精度,提供了一种新的外观建模方法,用于生长的3D结构。
Summary / 总结
This research addresses the challenge of modeling the time-varying 3D appearance of growing plants, introducing a 3D Gaussian flow field representation that allows for nonlinear and continuous-time growth dynamics. By reconstructing the mature plant and simulating its reverse growth, the method initializes a set of Gaussian primitives. The approach outperforms previous methods in image quality and geometric accuracy on multi-view timelapse datasets of plant growth, offering a new method for modeling the appearance of growing 3D structures.
该研究通过引入3D高斯流场表示来解决生长植物的时间变化3D外观建模问题,这种方法可以引入新的几何形状并模拟非线性生长动态。该方法通过重建成熟植物并模拟其逆向生长来初始化高斯基元。实验结果显示,所提出的方法在多视角时间快照数据集中的植物生长建模方面在图像质量和几何精度上优于现有技术。
Delay-Aware Reinforcement Learning for Highway On-Ramp Merging under Stochastic Communication Latency
Authors: Amin Tabrizian, Zhitong Huang, Arsyi Aziz, Peng Wei
First: 2024-03-18T15:02:46+00:00 · Latest: 2026-02-09T17:51:38+00:00
Abstract
Delayed and partially observable state information poses significant challenges for reinforcement learning (RL)-based control in real-world autonomous driving. In highway on-ramp merging, a roadside unit (RSU) can sense nearby traffic, perform edge perception, and transmit state estimates to the ego vehicle over vehicle-to-infrastructure (V2I) links. With recent advancements in intelligent transportation infrastructure and edge computing, such RSU-assisted perception is increasingly realistic and already deployed in modern connected roadway systems. However, edge processing time and wireless transmission can introduce stochastic V2I communication delays, violating the Markov assumption and substantially degrading control performance. In this work, we propose DAROM, a Delay-Aware Reinforcement Learning framework for On-ramp Merging that is robust to stochastic delays. We model the problem as a random delay Markov decision process (RDMDP) and develop a unified RL agent for joint longitudinal and lateral control. To recover a Markovian representation under delayed observations, we introduce a Delay-Aware Encoder that conditions on delayed observations, masked action histories, and observed delay magnitude to infer the current latent state. We further integrate a physics-based safety controller to reduce collision risk during merging. Experiments in the Simulation of Urban MObility (SUMO) simulator using real-world traffic data from the Next Generation Simulation (NGSIM) dataset demonstrate that DAROM consistently outperforms standard RL baselines across traffic densities. In particular, the gated recurrent unit (GRU)-based encoder achieves over 99% success in high-density traffic with random V2I delays of up to 2.0 seconds.
中文标题/摘要
标题:考虑随机通信延迟的高速公路入匝道并线延迟感知强化学习
延迟且部分可观测的状态信息对基于强化学习(RL)的现实世界自动驾驶控制构成了重大挑战。在高速公路入匝道并线中,路边单元(RSU)可以感知附近交通、执行边缘感知,并通过车辆到基础设施(V2I)链路将状态估计传输给自主车辆。随着智能交通基础设施和边缘计算的最新进展,这种RSU辅助感知越来越现实并已在现代互联道路系统中部署。然而,边缘处理时间和无线传输可能会引入随机的V2I通信延迟,违反马尔可夫假设并显著降低控制性能。在本文中,我们提出了一种名为DAROM的延迟感知强化学习框架,以应对随机延迟。我们将问题建模为随机延迟马尔可夫决策过程(RDMDP),并开发了一个联合纵向和横向控制的统一RL代理。为了在延迟观测下恢复马尔可夫表示,我们引入了一个延迟感知编码器,该编码器基于延迟观测、掩蔽动作历史和观测到的延迟幅度来推断当前的潜在状态。我们进一步整合了一个基于物理的安全控制器,以降低并线过程中的碰撞风险。使用来自Next Generation Simulation(NGSIM)数据集的真实世界交通数据在Simulation of Urban MObility(SUMO)仿真器中进行的实验表明,DAROM在各种交通密度下始终优于标准RL基线。特别是,基于门控循环单元(GRU)的编码器在随机V2I延迟高达2.0秒的高密度交通中实现了超过99%的成功率。
Summary / 总结
This work addresses the challenges of reinforcement learning in autonomous driving, particularly in highway on-ramp merging where stochastic communication delays affect state information. The proposed DAROM framework models the problem as a random delay Markov decision process and uses a Delay-Aware Encoder to handle delayed observations. Experiments show that DAROM outperforms standard RL baselines, with a GRU-based encoder achieving 99% success in high-density traffic with up to 2.0 seconds of random V2I delays.
研究针对自动驾驶中高速公路匝道汇入时因无线通信延迟带来的挑战,提出了一种名为DAROM的延迟感知强化学习框架,将问题建模为随机延迟马尔可夫决策过程,并引入延迟感知编码器处理延迟观测。实验结果显示,DAROM在各种交通密度下均优于标准的RL基线,特别是在高密度交通中,即使存在高达2.0秒的随机V2I延迟时,成功率也超过99%。
Beware Untrusted Simulators -- Reward-Free Backdoor Attacks in Reinforcement Learning
Authors: Ethan Rathbun, Wo Wei Lin, Alina Oprea, Christopher Amato
Venue: ICLR 2026
First: 2026-02-04T22:17:23+00:00 · Latest: 2026-02-09T17:46:50+00:00
Comments: 10 pages main body, ICLR 2026
Abstract
Simulated environments are a key piece in the success of Reinforcement Learning (RL), allowing practitioners and researchers to train decision making agents without running expensive experiments on real hardware. Simulators remain a security blind spot, however, enabling adversarial developers to alter the dynamics of their released simulators for malicious purposes. Therefore, in this work we highlight a novel threat, demonstrating how simulator dynamics can be exploited to stealthily implant action-level backdoors into RL agents. The backdoor then allows an adversary to reliably activate targeted actions in an agent upon observing a predefined ``trigger'', leading to potentially dangerous consequences. Traditional backdoor attacks are limited in their strong threat models, assuming the adversary has near full control over an agent's training pipeline, enabling them to both alter and observe agent's rewards. As these assumptions are infeasible to implement within a simulator, we propose a new attack ``Daze'' which is able to reliably and stealthily implant backdoors into RL agents trained for real world tasks without altering or even observing their rewards. We provide formal proof of Daze's effectiveness in guaranteeing attack success across general RL tasks along with extensive empirical evaluations on both discrete and continuous action space domains. We additionally provide the first example of RL backdoor attacks transferring to real, robotic hardware. These developments motivate further research into securing all components of the RL training pipeline to prevent malicious attacks.
中文标题/摘要
标题:谨防不可信模拟器——强化学习中的无奖励后门攻击
模拟环境是强化学习(RL)成功的关键组成部分,允许从业者和研究人员在无需在昂贵的硬件上运行实验的情况下训练决策代理。然而,模拟器仍然是一个安全盲点,允许恶意开发者通过修改其发布的模拟器的动力学来实现恶意目的。因此,在这项工作中,我们强调了一种新型威胁,展示了如何利用模拟器的动力学秘密植入行动级后门到RL代理中。后门允许攻击者在观察到预定义的“触发器”时可靠地激活目标行动,可能导致潜在的危险后果。传统的后门攻击在强大的威胁模型下受到限制,假设攻击者几乎完全控制代理的训练管道,能够同时修改和观察代理的奖励。由于这些假设在模拟器中难以实现,我们提出了一种新的攻击“Daze”,能够在不修改或甚至观察代理奖励的情况下可靠且隐蔽地植入后门到为真实世界任务训练的RL代理中。我们提供了Daze在一般RL任务中保证攻击成功效果的正式证明,并在离散和连续动作空间领域进行了广泛的实证评估。此外,我们提供了第一个RL后门攻击转移到真实机器人硬件的示例。这些进展促使进一步研究如何确保RL训练管道的所有组件以防止恶意攻击。
Summary / 总结
This paper addresses the security threat in simulated environments used for Reinforcement Learning (RL), where adversaries can alter simulator dynamics to implant backdoors into RL agents. The proposed attack, named Daze, allows an adversary to trigger specific actions in an RL agent without altering or observing the agent's rewards, making it suitable for real-world tasks. The study provides formal proof of Daze's effectiveness and extensive empirical evaluations, demonstrating that the backdoor can be reliably activated in both discrete and continuous action spaces. Additionally, the attack has been shown to transfer to real robotic hardware, highlighting the need for securing the entire RL training pipeline.
该论文探讨了用于强化学习(RL)的模拟环境中的安全威胁,攻击者可以通过修改模拟器动态来植入后门到RL代理中。提出的攻击方法Daze允许攻击者在不修改或观察代理奖励的情况下触发特定动作,使其适用于实际任务。研究提供了Daze有效性的形式证明,并进行了广泛的实证评估,证明了后门可以在离散和连续动作空间中可靠地激活。此外,该攻击已被证明可以转移到真实的机器人硬件上,强调了需要确保整个RL训练管道的安全性。
Digital Twin and Agentic AI for Wild Fire Disaster Management: Intelligent Virtual Situation Room
Authors: Mohammad Morsali, Siavash H. Khajavi
First: 2026-02-09T17:44:52+00:00 · Latest: 2026-02-09T17:44:52+00:00
Abstract
According to the United Nations, wildfire frequency and intensity are projected to increase by approximately 14% by 2030 and 30% by 2050 due to global warming, posing critical threats to life, infrastructure, and ecosystems. Conventional disaster management frameworks rely on static simulations and passive data acquisition, hindering their ability to adapt to arbitrarily evolving wildfire episodes in real-time. To address these limitations, we introduce the Intelligent Virtual Situation Room (IVSR), a bidirectional Digital Twin (DT) platform augmented by autonomous AI agents. The IVSR continuously ingests multisource sensor imagery, weather data, and 3D forest models to create a live virtual replica of the fire environment. A similarity engine powered by AI aligns emerging conditions with a precomputed Disaster Simulation Library, retrieving and calibrating intervention tactics under the watchful eyes of experts. Authorized action-ranging from UAV redeployment to crew reallocation-is cycled back through standardized procedures to the physical layer, completing the loop between response and analysis. We validate IVSR through detailed case-study simulations provided by an industrial partner, demonstrating capabilities in localized incident detection, privacy-preserving playback, collider-based fire-spread projection, and site-specific ML retraining. Our results indicate marked reductions in detection-to-intervention latency and more effective resource coordination versus traditional systems. By uniting real-time bidirectional DTs with agentic AI, IVSR offers a scalable, semi-automated decision-support paradigm for proactive, adaptive wildfire disaster management.
中文标题/摘要
标题:数字孪生与自主AI在野火灾害管理中的应用:智能虚拟情况室
根据联合国的预测,由于全球变暖,到2030年野火的频率和强度预计将增加约14%,到2050年将增加约30%,这对生命、基础设施和生态系统构成了重大威胁。传统的灾害管理框架依赖于静态模拟和被动数据采集,难以实时适应任意演变的野火事件。为解决这些局限性,我们引入了智能虚拟情况室(IVSR),这是一种双向数字孪生(DT)平台,由自主AI代理增强。IVSR持续摄入多源传感器图像、气象数据和3D森林模型,以创建火灾环境的实时虚拟复制品。基于AI的相似性引擎将当前条件与预先计算的灾害模拟库进行匹配,专家在场监督下检索和校准干预策略。授权的操作,从无人机重新部署到人员重新分配,通过标准化程序反馈到物理层,完成响应与分析之间的闭环。我们通过与工业合作伙伴提供的详细案例研究模拟验证IVSR,展示了其在局部事件检测、隐私保护回放、碰撞基于的火势蔓延预测和特定场地的ML重新训练方面的能力。我们的结果表明,与传统系统相比,IVSR在检测到干预的延迟和资源协调方面有显著减少。通过将实时双向DT与自主AI相结合,IVSR为积极、适应性野火灾害管理提供了一种可扩展的半自动化决策支持范式。
Summary / 总结
The research addresses the increasing threat of wildfires due to global warming by proposing the Intelligent Virtual Situation Room (IVSR), a Digital Twin platform enhanced with autonomous AI agents. IVSR integrates real-time sensor data, weather information, and 3D forest models to create a live virtual replica of the fire environment. It uses an AI-powered similarity engine to align current conditions with precomputed disaster simulations, enabling expert-calibrated interventions. The system demonstrates significant reductions in detection-to-intervention latency and improved resource coordination compared to conventional methods through industrial partner simulations.
论文介绍了智能虚拟情况室(IVSR),这是一种增强有自主AI代理的数字孪生平台,旨在应对因全球变暖而加剧的野火威胁。IVSR利用来自各种来源的实时数据来模拟火场环境,并从预先计算的灾难模拟库中检索干预策略,然后在物理世界中应用。研究通过工业合作伙伴的模拟,展示了与传统系统相比,IVSR在检测到干预之间的延迟显著减少以及资源协调能力的提高。
CoRefine: Confidence-Guided Self-Refinement for Adaptive Test-Time Compute
Authors: Chen Jin, Ryutaro Tanno, Tom Diethe, Philip Teare
First: 2026-02-09T17:44:41+00:00 · Latest: 2026-02-09T17:44:41+00:00
Abstract
Large Language Models (LLMs) often rely on test-time scaling via parallel decoding (for example, 512 samples) to boost reasoning accuracy, but this incurs substantial compute. We introduce CoRefine, a confidence-guided self-refinement method that achieves competitive accuracy using a fraction of the tokens via a lightweight 211k-parameter Conv1D controller atop a frozen LLM. The controller consumes full-trace confidence to decide whether to halt, re-examine, or try a different approach, enabling targeted self-correction with an average of 2.7 refinement steps per problem and roughly 190-fold token reduction relative to 512-sample baselines. Across diverse reasoning benchmarks and three open-source models, the controller achieves 92.6 percent precision when it confidently halts, indicating that confidence dynamics reliably signal correctness without ground-truth verification. We extend this to CoRefine-Tree, a hybrid sequential-parallel variant that adaptively balances exploration and exploitation, with easy serving integration and verifier compatibility. By treating confidence as a control signal rather than a correctness guarantee, CoRefine provides a modular primitive for scalable reasoning and agentic settings with imperfect verifiers.
中文标题/摘要
标题:CoRefine:基于置信度的自校正方法以实现适应性测试时计算
大型语言模型(LLMs)通常依赖于并行解码(例如,512个样本)的测试时缩放来提升推理准确性,但这种方法会带来大量的计算成本。我们提出了CoRefine,一种基于置信度的自校正方法,通过一个轻量级的211k参数Conv1D控制器在冻结的LLM之上,仅使用少量的token即可实现竞争力的准确性。控制器消耗完整的推理轨迹置信度来决定是否停止、重新检查或尝试不同的方法,从而实现有针对性的自我纠正,平均每道题仅需要2.7次校正步骤,相对于512样本基线,token减少约190倍。在多种推理基准测试和三个开源模型上,当控制器自信地停止时,其准确率达到了92.6%,表明置信度动态可靠地指示正确性,无需真实标签验证。我们将其扩展为CoRefine-Tree,一种混合顺序-并行变体,能够自适应地平衡探索和利用,具有易于部署集成和验证器兼容性。通过将置信度视为控制信号而非正确性保证,CoRefine为可扩展推理和代理设置提供了模块化的基本方法,即使在不完美的验证器情况下也是如此。
Summary / 总结
CoRefine is a confidence-guided self-refinement method that reduces compute by using a lightweight 211k-parameter Conv1D controller to decide whether to halt, re-examine, or try a different approach. This method achieves competitive accuracy with an average of 2.7 refinement steps per problem and roughly 190-fold token reduction compared to 512-sample baselines. Across various reasoning benchmarks and three open-source models, the controller demonstrates 92.6 percent precision when it confidently halts, indicating reliable confidence dynamics without ground-truth verification.
CoRefine 是一种基于置信度的自校正方法,相比 512 样本基线可减少 190 倍的计算量同时保持竞争力的准确性。它使用一个 211k 参数的 Conv1D 控制器,根据全程置信度决定是否停止、重新检查或尝试其他方法,平均每题需要 2.7 次校正步骤。在各种基准测试和模型上,控制器在自信停止时的精度达到 92.6%,表明无需真实验证即可可靠地发出正确性信号。CoRefine-Tree 是一种混合变体,进一步平衡探索和利用,提供更好的性能和易于集成。
From Features to Actions: Explainability in Traditional and Agentic AI Systems
Authors: Sindhuja Chaduvula, Jessee Ho, Kina Kim, Aravind Narayanan, Mahshid Alinoori, Muskan Garg, Dhanesh Ramachandram, Shaina Raza
First: 2026-02-06T16:34:29+00:00 · Latest: 2026-02-09T17:37:05+00:00
Abstract
Over the last decade, explainable AI has primarily focused on interpreting individual model predictions, producing post-hoc explanations that relate inputs to outputs under a fixed decision structure. Recent advances in large language models (LLMs) have enabled agentic AI systems whose behaviour unfolds over multi-step trajectories. In these settings, success and failure are determined by sequences of decisions rather than a single output. While useful, it remains unclear how explanation approaches designed for static predictions translate to agentic settings where behaviour emerges over time. In this work, we bridge the gap between static and agentic explainability by comparing attribution-based explanations with trace-based diagnostics across both settings. To make this distinction explicit, we empirically compare attribution-based explanations used in static classification tasks with trace-based diagnostics used in agentic benchmarks (TAU-bench Airline and AssistantBench). Our results show that while attribution methods achieve stable feature rankings in static settings (Spearman $ρ= 0.86$), they cannot be applied reliably to diagnose execution-level failures in agentic trajectories. In contrast, trace-grounded rubric evaluation for agentic settings consistently localizes behaviour breakdowns and reveals that state tracking inconsistency is 2.7$\times$ more prevalent in failed runs and reduces success probability by 49\%. These findings motivate a shift towards trajectory-level explainability for agentic systems when evaluating and diagnosing autonomous AI behaviour.
Resources:
https://github.com/VectorInstitute/unified-xai-evaluation-framework https://vectorinstitute.github.io/unified-xai-evaluation-framework
中文标题/摘要
标题:从特征到行动:传统与主动AI系统的可解释性
在过去十年中,可解释AI主要集中在解释单个模型预测,产生事后解释,将输入与固定决策结构下的输出联系起来。近年来,大型语言模型(LLMs)的进步使主动AI系统能够在其多步骤轨迹中展开行为。在这种情况下,成功与失败由一系列决策决定,而不是单一输出。虽然这些方法很有用,但尚不清楚为静态预测设计的解释方法如何适用于行为随时间演变的主动设置。在本文中,我们通过比较基于归因的解释与基于轨迹的诊断在两种设置中的应用,弥合了静态和主动解释之间的差距。为了使这种区别更加明确,我们实证比较了在静态分类任务中使用的基于归因的解释与在主动基准(TAU-bench航空和AssistantBench)中使用的基于轨迹的诊断。结果显示,虽然归因方法在静态设置中实现了稳定的特征排名(Spearman ρ= 0.86),但它们无法可靠地诊断主动轨迹中的执行级故障。相比之下,在主动设置中基于轨迹的评分标准评估始终能够定位行为故障,并揭示失败运行中状态跟踪不一致的情况是正常运行的2.7倍,并且降低了成功概率49%。这些发现促使我们在评估和诊断自主AI行为时转向轨迹级解释性,特别是对于主动系统。
Summary / 总结
This study addresses the challenge of explainability in agentic AI systems, which operate over multi-step trajectories, contrasting it with traditional static models. The research compares attribution-based explanations with trace-based diagnostics, showing that attribution methods are stable in static settings but fail to diagnose execution-level failures in agentic systems. Trace-based diagnostics, however, effectively localize behavior breakdowns, indicating that state tracking inconsistency is more prevalent in failed runs and significantly reduces success probability. This suggests a need for trajectory-level explainability in agentic AI systems.
本研究通过比较归因解释与轨迹诊断,解决了在多步骤轨迹操作的生成性AI系统中的解释性问题。研究发现,虽然归因方法在静态设置中提供了稳定的特征排名,但在生成性系统中诊断执行级故障时却不可靠。相比之下,基于轨迹的诊断能够有效定位行为故障,并揭示状态跟踪不一致在失败运行中更为普遍,降低了成功概率。这些发现表明,为了更好地评估和诊断自主AI行为,需要转向轨迹级别的解释性。
CausalT5K: Diagnosing and Informing Refusal for Trustworthy Causal Reasoning of Skepticism, Sycophancy, Detection-Correction, and Rung Collapse
Authors: Longling Geng, Andy Ouyang, Theodore Wu, Daphne Barretto, Matthew John Hayes, Rachael Cooper, Yuqiao Zeng, Sameer Vijay, Gia Ancone, Ankit Rai, Matthew Wolfman, Patrick Flanagan, Edward Y. Chang
First: 2026-02-09T17:36:56+00:00 · Latest: 2026-02-09T17:36:56+00:00
Comments: 17 pages, 20 tables, figures
Abstract
LLM failures in causal reasoning, including sycophancy, rung collapse, and miscalibrated refusal, are well-documented, yet progress on remediation is slow because no benchmark enables systematic diagnosis. We introduce CausalT5K, a diagnostic benchmark of over 5,000 cases across 10 domains that tests three critical capabilities: (1) detecting rung collapse, where models answer interventional queries with associational evidence; (2) resisting sycophantic drift under adversarial pressure; and (3) generating Wise Refusals that specify missing information when evidence is underdetermined. Unlike synthetic benchmarks, CausalT5K embeds causal traps in realistic narratives and decomposes performance into Utility (sensitivity) and Safety (specificity), revealing failure modes invisible to aggregate accuracy. Developed through a rigorous human-machine collaborative pipeline involving 40 domain experts, iterative cross-validation cycles, and composite verification via rule-based, LLM, and human scoring, CausalT5K implements Pearl's Ladder of Causation as research infrastructure. Preliminary experiments reveal a Four-Quadrant Control Landscape where static audit policies universally fail, a finding that demonstrates CausalT5K's value for advancing trustworthy reasoning systems. Repository: https://github.com/genglongling/CausalT5kBench
中文标题/摘要
标题:CausalT5K:诊断和指导不信任以实现可信因果推理中的怀疑、阿谀、检测-纠正和梯级崩溃
大型语言模型在因果推理中的失败,包括阿谀、梯级崩溃和拒绝校准,已有充分记录,但改进进展缓慢,因为没有基准可以系统诊断。我们引入了CausalT5K,这是一个包含超过5000个案例、涵盖10个领域的诊断基准,测试了三种关键能力:(1)检测梯级崩溃,即模型用关联证据回答干预查询;(2)在对抗压力下抵制阿谀漂移;(3)生成明智的拒绝,当证据不足时指明缺失信息。与合成基准不同,CausalT5K将因果陷阱嵌入现实叙述中,并将性能分解为效用(灵敏度)和安全性(特异性),揭示了聚合准确率无法察觉的失败模式。CausalT5K通过严格的机器-人机协作管道开发,涉及40位领域专家,迭代交叉验证循环,以及基于规则、大型语言模型和人工评分的综合验证,实现了佩尔的因果阶梯作为研究基础设施。初步实验揭示了一个四象限控制景观,其中静态审计策略普遍失效,这一发现证明了CausalT5K对于推进可信推理系统的价值。代码库:https://github.com/genglongling/CausalT5kBench
Summary / 总结
The research aims to address the limitations in causal reasoning of large language models (LLMs), particularly sycophancy, rung collapse, and miscalibrated refusal. CausalT5K, a benchmark of over 5,000 cases across 10 domains, evaluates models on detecting rung collapse, resisting sycophantic drift, and generating Wise Refusals. The study reveals a Four-Quadrant Control Landscape where static audit policies fail, highlighting CausalT5K's utility for advancing trustworthy reasoning systems.
研究旨在解决大型语言模型(LLMs)在因果推理方面的局限性,特别是趋炎附势、阶梯坍塌和拒绝校准不当。CausalT5K 是一个包含超过 5,000 个案例的诊断基准,覆盖 10 个领域,评估模型在检测阶梯坍塌、抵抗趋炎附势以及生成明智拒绝方面的表现。该基准将性能分解为效用和安全性,揭示了通过总体准确率无法看到的失败模式。关键发现包括一个四象限控制景观,其中静态审计策略普遍失效,突显了CausalT5K 对推进可信推理系统的重要性。
StealthRL: Reinforcement Learning Paraphrase Attacks for Multi-Detector Evasion of AI-Text Detectors
Authors: Suraj Ranganath, Atharv Ramesh
First: 2026-02-09T17:33:46+00:00 · Latest: 2026-02-09T17:33:46+00:00
Comments: Expanded version of a workshop submission. Code available
Abstract
AI-text detectors face a critical robustness challenge: adversarial paraphrasing attacks that preserve semantics while evading detection. We introduce StealthRL, a reinforcement learning framework that stress-tests detector robustness under realistic adversarial conditions. StealthRL trains a paraphrase policy against a multi-detector ensemble using Group Relative Policy Optimization (GRPO) with LoRA adapters on Qwen3-4B, optimizing a composite reward that balances detector evasion with semantic preservation. We evaluate six attack settings (M0-M5) against three detector families (RoBERTa, FastDetectGPT, and Binoculars) at the security-relevant 1% false positive rate operating point. StealthRL achieves near-zero detection (0.001 mean TPR@1%FPR), reduces mean AUROC from 0.74 to 0.27, and attains a 99.9% attack success rate. Critically, attacks transfer to a held-out detector family not seen during training, revealing shared architectural vulnerabilities rather than detector-specific brittleness. We additionally conduct LLM-based quality evaluation via Likert scoring, analyze detector score distributions to explain why evasion succeeds, and provide per-detector AUROC with bootstrap confidence intervals. Our results expose significant robustness gaps in current AI-text detection and establish StealthRL as a principled adversarial evaluation protocol. Code and evaluation pipeline are publicly available at https://github.com/suraj-ranganath/StealthRL.
中文标题/摘要
标题:StealthRL:针对多检测器的AI文本检测器逃逸的强化学习改写攻击
AI文本检测器面临一个关键的鲁棒性挑战:能够保持语义同时逃避检测的对抗性改写攻击。我们引入了StealthRL,这是一种强化学习框架,用于在现实的对抗性条件下测试检测器的鲁棒性。StealthRL使用Group Relative Policy Optimization (GRPO)与LoRA适配器在Qwen3-4B上训练一个改写策略,优化一个综合奖励,该奖励平衡了检测器逃逸与语义保持之间的关系。我们针对三个检测器家族(RoBERTa、FastDetectGPT和Binoculars)在安全相关1%假阳性率操作点评估了六个攻击设置(M0-M5)。StealthRL实现了接近零的检测率(1% FPR时0.001的平均TPR),将平均AUROC从0.74降低到0.27,并实现了99.9%的攻击成功率。关键的是,这些攻击在训练期间未见过的检测器家族中也能转移,揭示了共享的架构性漏洞而非特定检测器的脆弱性。我们还通过Likert评分进行基于LLM的质量评估,分析检测器得分分布以解释为何逃逸成功,并提供了每个检测器的AUROC及其置信区间。我们的结果揭示了当前AI文本检测器中的显著鲁棒性差距,并确立了StealthRL作为原理性的对抗性评估协议。代码和评估管道可在https://github.com/suraj-ranganath/StealthRL上公开获取。
Summary / 总结
StealthRL is a reinforcement learning framework designed to test the robustness of AI-text detectors against adversarial paraphrasing attacks. It trains a paraphrase policy using Group Relative Policy Optimization (GRPO) with LoRA adapters on Qwen3-4B, optimizing a reward that balances evasion and semantic preservation. Evaluations across six attack settings and three detector families show near-zero detection rates and a 99.9% attack success rate, indicating shared architectural vulnerabilities. The attacks also transfer to unseen detector families, highlighting the need for more robust AI-text detection systems.
StealthRL 是一个基于强化学习的框架,旨在测试 AI 文本检测器在对抗性改写攻击下的鲁棒性。它训练一个改写策略以在不改变语义的情况下躲避多个检测器。实验结果显示,在六个攻击设置和三个检测器家族中,几乎不存在检测率,并且攻击成功率高达 99.9%。这些攻击还能转移到未见过的检测器家族,揭示了共享的架构性漏洞而非特定检测器的脆弱性。研究还提供了详细的质量评估和检测器分数分布,确立了 StealthRL 作为 AI 文本检测系统的稳健对抗性评估协议的地位。
RiskAgent: Synergizing Language Models with Validated Tools for Evidence-Based Risk Prediction
Authors: Fenglin Liu, Jinge Wu, Hongjian Zhou, Xiao Gu, Jiayuan Zhu, Jiazhen Pan, Junde Wu, Soheila Molaei, Anshul Thakur, Lei Clifton, Honghan Wu, David A. Clifton
First: 2025-03-05T18:46:51+00:00 · Latest: 2026-02-09T17:28:20+00:00
Comments: Code and Data are available at https://github.com/AI-in-Health/RiskAgent
Abstract
Large Language Models (LLMs) achieve competitive results compared to human experts in medical examinations. However, it remains a challenge to apply LLMs to complex clinical decision-making, which requires a deep understanding of medical knowledge and differs from the standardized, exam-style scenarios commonly used in current efforts. A common approach is to fine-tune LLMs for target tasks, which, however, not only requires substantial data and computational resources but also remains prone to generating `hallucinations'. In this work, we present RiskAgent, which synergizes language models with hundreds of validated clinical decision tools supported by evidence-based medicine, to provide generalizable and faithful recommendations. Our experiments show that RiskAgent not only achieves superior performance on a broad range of clinical risk predictions across diverse scenarios and diseases, but also demonstrates robust generalization in tool learning on the external MedCalc-Bench dataset, as well as in medical reasoning and question answering on three representative benchmarks, MedQA, MedMCQA, and MMLU.
中文标题/摘要
标题:RiskAgent:将语言模型与验证工具结合以实现基于证据的风险预测
大型语言模型(LLMs)在医学检查中与人类专家相比取得了竞争力的结果。然而,将LLMs应用于复杂的临床决策仍然是一项挑战,这需要对医学知识有深刻的理解,而当前努力中常用的标准、考试风格的场景与此不同。一种常见方法是针对目标任务微调LLMs,然而,这种方法不仅需要大量的数据和计算资源,而且仍然容易生成“幻觉”。在本工作中,我们提出了RiskAgent,它将语言模型与由循证医学支持的数百种验证临床决策工具相结合,以提供通用和忠实的建议。我们的实验表明,RiskAgent不仅在多种临床风险预测场景和疾病中取得了优越的性能,而且在外部MedCalc-Bench数据集上的工具学习中表现出强大的泛化能力,以及在三个代表性基准测试MedQA、MedMCQA和MMLU上的医学推理和问答中也表现出色。
Summary / 总结
RiskAgent integrates language models with validated clinical decision tools to enhance evidence-based risk prediction. It addresses the challenge of applying large language models to complex clinical decision-making by leveraging a wide range of clinical tools. The model shows superior performance across various clinical risk predictions and demonstrates robust generalization on external benchmarks and medical reasoning tasks.
RiskAgent 结合语言模型和大量验证过的临床决策工具,以提高复杂医疗场景中的风险预测。它在多种临床风险预测中表现出色,并在外部基准测试和医学推理任务上展示了稳健的泛化能力。该方法避免了幻觉问题,并利用基于证据的医学方法提供更可靠的建议。
Self-Supervised Uncalibrated Multi-View Video Anonymization in the Operating Room
Authors: Keqi Chen, Vinkle Srivastav, Armine Vardazaryan, Cindy Rolland, Didier Mutter, Nicolas Padoy
First: 2026-02-02T21:54:57+00:00 · Latest: 2026-02-09T17:27:39+00:00
Abstract
Privacy preservation is a prerequisite for using video data in Operating Room (OR) research. Effective anonymization relies on the exhaustive localization of every individual; even a single missed detection necessitates extensive manual correction. However, existing approaches face two critical scalability bottlenecks: (1) they usually require manual annotations of each new clinical site for high accuracy; (2) while multi-camera setups have been widely adopted to address single-view ambiguity, camera calibration is typically required whenever cameras are repositioned. To address these problems, we propose a novel self-supervised multi-view video anonymization framework consisting of whole-body person detection and whole-body pose estimation, without annotation or camera calibration. Our core strategy is to enhance the single-view detector by "retrieving" false negatives using temporal and multi-view context, and conducting self-supervised domain adaptation. We first run an off-the-shelf whole-body person detector in each view with a low-score threshold to gather candidate detections. Then, we retrieve the low-score false negatives that exhibit consistency with the high-score detections via tracking and self-supervised uncalibrated multi-view association. These recovered detections serve as pseudo labels to iteratively fine-tune the whole-body detector. Finally, we apply whole-body pose estimation on each detected person, and fine-tune the pose model using its own high-score predictions. Experiments on the 4D-OR dataset of simulated surgeries and our dataset of real surgeries show the effectiveness of our approach achieving over 97% recall. Moreover, we train a real-time whole-body detector using our pseudo labels, achieving comparable performance and highlighting our method's practical applicability. Code will be available at https://github.com/CAMMA-public/OR_anonymization.
中文标题/摘要
标题:手术室多视角视频匿名化:无监督且未校准的方法
隐私保护是手术室(OR)研究中使用视频数据的前提。有效的匿名化依赖于对每个人的彻底定位;即使错过一次检测也需要进行大量的手动修正。然而,现有方法面临两个关键的可扩展性瓶颈:(1)它们通常需要为每个新的临床站点手动标注以获得高精度;(2)尽管多摄像头设置被广泛采用以解决单视角的模糊性,但每当重新定位摄像头时都需要进行校准。为了解决这些问题,我们提出了一种新颖的无监督多视角视频匿名化框架,该框架包括全身人体检测和全身姿态估计,无需标注或校准。我们的核心策略是通过利用时间和多视角上下文“检索”单视角检测器的假阴性,并进行自我监督的领域适应。我们首先在每个视角中使用低分阈值运行现成的全身人体检测器以收集候选检测。然后,我们通过跟踪和自我监督的未校准多视角关联检索出与高分检测具有一致性的低分假阴性。这些恢复的检测作为伪标签用于迭代微调全身检测器。最后,我们对每个检测到的人进行全身姿态估计,并使用其自身的高分预测微调姿态模型。在4D-OR模拟手术数据集和我们的真实手术数据集上的实验表明,我们的方法具有超过97%的召回率。此外,我们使用伪标签训练了一个实时全身检测器,其性能与现有方法相当,突显了我们方法的实际应用性。代码将在https://github.com/CAMMA-public/OR_anonymization/上提供。
Summary / 总结
This paper addresses the challenge of privacy-preserving video data usage in operating rooms by proposing a self-supervised multi-view video anonymization framework. The method enhances a single-view detector using temporal and multi-view context to recover false negatives and conducts self-supervised domain adaptation. Experiments on simulated and real surgeries demonstrate over 97% recall, and a real-time whole-body detector trained with pseudo labels shows comparable performance, highlighting practical applicability.
研究旨在解决在手术室中匿名化视频数据以保护隐私的问题。提出了一种无需手动标注或摄像机校准的自监督多视图视频匿名化框架。该方法通过使用时间上的和多视图的上下文来增强单视图检测器以检索假阴性,并迭代地 fine-tune 检测器。该方法在模拟手术的 4D-OR 数据集和真实手术数据集上实现了超过 97% 的召回率,并通过使用伪标签训练实时全身检测器展示了其实用性。