arXiv 论文速递

Snapshot: 20260227_0352

Neu-PiG: Neural Preconditioned Grids for Fast Dynamic Surface Reconstruction on Long Sequences

Authors: Julian Kaltheuner, Hannah Dröge, Markus Plack, Patrick Stotko, Reinhard Klein

Venue: CVPR 2026

First: 2026-02-25T18:59:53+00:00 · Latest: 2026-02-25T18:59:53+00:00

Comments: CVPR 2026, Code: https://github.com/vc-bonn/neu-pig

Abstract

Temporally consistent surface reconstruction of dynamic 3D objects from unstructured point cloud data remains challenging, especially for very long sequences. Existing methods either optimize deformations incrementally, risking drift and requiring long runtimes, or rely on complex learned models that demand category-specific training. We present Neu-PiG, a fast deformation optimization method based on a novel preconditioned latent-grid encoding that distributes spatial features parameterized on the position and normal direction of a keyframe surface. Our method encodes entire deformations across all time steps at various spatial scales into a multi-resolution latent grid, parameterized by the position and normal direction of a reference surface from a single keyframe. This latent representation is then augmented for time modulation and decoded into per-frame 6-DoF deformations via a lightweight multilayer perceptron (MLP). To achieve high-fidelity, drift-free surface reconstructions in seconds, we employ Sobolev preconditioning during gradient-based training of the latent space, completely avoiding the need for any explicit correspondences or further priors. Experiments across diverse human and animal datasets demonstrate that Neu-PiG outperforms state-the-art approaches, offering both superior accuracy and scalability to long sequences while running at least 60x faster than existing training-free methods and achieving inference speeds on the same order as heavy pretrained models.

中文标题/摘要

标题：Neu-PiG：基于神经预条件化网格的快速动态表面重建方法

从无序点云数据中对动态3D物体进行时序一致的表面重建仍然具有挑战性，尤其是在非常长的序列中。现有方法要么以增量方式优化变形，存在漂移风险并需要较长的运行时间，要么依赖于复杂的基于学习的模型，需要特定类别的训练。我们提出了一种基于新颖预条件化隐空间网格编码的快速变形优化方法，该方法在关键帧表面的位置和法线方向上参数化空间特征。我们的方法将所有时间步长的变形在不同空间尺度上编码到多分辨率隐空间网格中，该隐空间网格由单个关键帧的参考表面的位置和法线方向参数化。然后，通过轻量级多层感知器（MLP）对隐空间进行时间调制并解码为每帧的6-DoF变形。为了在几秒钟内实现高保真、无漂移的表面重建，我们在基于梯度的隐空间训练期间采用Sobolev预条件化，完全避免了任何显式对应关系或进一步先验的需求。跨不同的人类和动物数据集的实验表明，Neu-PiG 在准确性和长序列的可扩展性方面均优于现有方法，运行速度至少比现有无训练方法快60倍，并且推理速度与重型预训练模型相当。

Summary / 总结

Neu-PiG addresses the challenge of reconstructing temporally consistent surfaces from dynamic 3D point clouds, especially for long sequences. It uses a novel preconditioned latent-grid encoding to parameterize spatial features on the position and normal direction of a keyframe surface, optimizing deformations across all time steps. Experiments show Neu-PiG outperforms existing methods in accuracy and scalability, achieving high-fidelity reconstructions in seconds and running at least 60x faster than training-free methods.

Neu-PiG 解决了从长时间序列的无序点云数据中重建时空一致表面的挑战。它采用了一种新颖的预条件化隐空间网格编码方法来高效优化变形，避免了漂移和复杂的训练需求。该方法将变形跨所有时间步编码到多分辨率隐空间网格中，并使用轻量级的多层感知机（MLP）进行解码。实验表明，Neu-PiG 在准确性和可扩展性方面优于现有方法，运行速度至少比无训练方法快 60 倍，并且推理速度与重型预训练模型相当。

WHOLE: World-Grounded Hand-Object Lifted from Egocentric Videos

Authors: Yufei Ye, Jiaman Li, Ryan Rong, C. Karen Liu

Venue: www

First: 2026-02-25T18:59:10+00:00 · Latest: 2026-02-25T18:59:10+00:00

Comments: Project website: https://judyye.github.io/whole-www

Abs · PDF · Code1 · Code2 · Project1

Abstract

Egocentric manipulation videos are highly challenging due to severe occlusions during interactions and frequent object entries and exits from the camera view as the person moves. Current methods typically focus on recovering either hand or object pose in isolation, but both struggle during interactions and fail to handle out-of-sight cases. Moreover, their independent predictions often lead to inconsistent hand-object relations. We introduce WHOLE, a method that holistically reconstructs hand and object motion in world space from egocentric videos given object templates. Our key insight is to learn a generative prior over hand-object motion to jointly reason about their interactions. At test time, the pretrained prior is guided to generate trajectories that conform to the video observations. This joint generative reconstruction substantially outperforms approaches that process hands and objects separately followed by post-processing. WHOLE achieves state-of-the-art performance on hand motion estimation, 6D object pose estimation, and their relative interaction reconstruction. Project website: https://judyye.github.io/whole-www

中文标题/摘要

标题：WHOLE：基于世界坐标系的手物提取自第一人称视频

第一人称操作视频由于交互过程中严重的遮挡以及人物移动时物体频繁进出摄像机视野，极具挑战性。当前方法通常分别恢复手或物体的姿态，但在交互过程中表现不佳，且无法处理物体不在视野中的情况。此外，它们的独立预测往往导致手物关系不一致。我们提出了WHOLE方法，该方法可以从给定物体模板的第一人称视频中整体重建手和物体在世界坐标系中的运动。我们的核心见解是学习手物运动的生成先验，以联合推理它们的交互。测试时，预训练的先验被引导生成与视频观察一致的轨迹。这种联合生成重建显著优于分别处理手和物体后进行后处理的方法。WHOLE在手部运动估计、6D物体姿态估计及其相对交互重建方面达到了最先进的性能。项目网站：https://judyye.github.io/whole-www

Summary / 总结

The research addresses the challenges of egocentric manipulation videos, where hands and objects are often occluded and frequently enter and exit the camera view. WHOLE, a method that jointly reconstructs hand and object motion in world space, is introduced. By learning a generative prior over hand-object motion, WHOLE outperforms separate hand and object processing methods, achieving state-of-the-art performance in hand motion estimation, 6D object pose estimation, and their relative interaction reconstruction. Project website: https://judyye.github.io/whole-www

该研究针对手部和物体在自视点操作视频中常被遮挡且频繁进出摄像头视野的挑战。WHOLE 方法通过联合重建手部和物体在世界空间中的运动，解决了这一问题。通过学习手部-物体运动的生成先验，WHOLE 在手部运动估计、6D 物体姿态估计及其相对交互重建方面取得了最先进的性能。

Solaris: Building a Multiplayer Video World Model in Minecraft

Authors: Georgy Savva, Oscar Michel, Daohan Lu, Suppakit Waiwitlikhit, Timothy Meehan, Dhairya Mishra, Srivats Poddar, Jack Lu, Saining Xie

First: 2026-02-25T18:59:01+00:00 · Latest: 2026-02-25T18:59:01+00:00

Comments: Project website: https://solaris-wm.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Existing action-conditioned video generation models (video world models) are limited to single-agent perspectives, failing to capture the multi-agent interactions of real-world environments. We introduce Solaris, a multiplayer video world model that simulates consistent multi-view observations. To enable this, we develop a multiplayer data system designed for robust, continuous, and automated data collection on video games such as Minecraft. Unlike prior platforms built for single-player settings, our system supports coordinated multi-agent interaction and synchronized videos + actions capture. Using this system, we collect 12.64 million multiplayer frames and propose an evaluation framework for multiplayer movement, memory, grounding, building, and view consistency. We train Solaris using a staged pipeline that progressively transitions from single-player to multiplayer modeling, combining bidirectional, causal, and Self Forcing training. In the final stage, we introduce Checkpointed Self Forcing, a memory-efficient Self Forcing variant that enables a longer-horizon teacher. Results show our architecture and training design outperform existing baselines. Through open-sourcing our system and models, we hope to lay the groundwork for a new generation of multi-agent world models.

中文标题/摘要

标题：Solaris：在Minecraft中构建多玩家视频世界模型

现有的基于行动条件的视频生成模型（视频世界模型）仅限于单个代理视角，无法捕捉真实世界环境中的多代理交互。我们介绍了Solaris，这是一种模拟一致多视角观察的多玩家视频世界模型。为了实现这一点，我们开发了一个多玩家数据系统，用于在如Minecraft等视频游戏中进行稳健、连续和自动的数据收集。与为单玩家设置构建的先前平台不同，我们的系统支持协调的多代理交互和同步视频+动作捕捉。使用该系统，我们收集了1264万个多玩家帧，并提出了一种多玩家移动、记忆、定位、建筑和视图一致性的评估框架。我们使用逐步管道进行训练，该管道从单玩家模型逐步过渡到多玩家模型，结合双向、因果和Self Forcing训练。在最终阶段，我们引入了Checkpointed Self Forcing，这是一种内存高效的Self Forcing变体，能够实现更长的前瞻教师。结果显示，我们的架构和训练设计优于现有基线。通过开源我们的系统和模型，我们希望为新一代多代理世界模型奠定基础。

Summary / 总结

The research aims to address the limitations of single-agent video world models by developing Solaris, a multiplayer video world model that captures multi-agent interactions. To achieve this, a new multiplayer data system was created for continuous and automated data collection in games like Minecraft, supporting coordinated multi-agent interactions. The system collected 12.64 million frames and evaluated the model on movement, memory, grounding, building, and view consistency. The model was trained using a staged pipeline and Checkpointed Self Forcing, outperforming existing baselines. The open-sourced system and models aim to advance the field of multi-agent world models.

研究旨在通过开发Solaris，一种多玩家视频世界模型，解决现有视频世界模型中单玩家视角的局限性，以捕捉多玩家互动。为此，研究人员开发了一个针对Minecraft等视频游戏的 robust 多玩家数据收集系统，支持协调的多玩家互动和同步的视频与动作捕捉。该系统收集了12.64万帧，并提出了一个用于评估运动、记忆、定位、建筑和视图一致性的框架。Solaris 使用分阶段的训练管道和 Checkpointed Self Forcing 训练，显示出优于现有模型的性能。开源的系统和模型旨在推动多玩家世界模型的发展。

Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets

Authors: Hanna Yukhymenko, Anton Alexandrov, Martin Vechev

First: 2026-02-25T18:58:25+00:00 · Latest: 2026-02-25T18:58:25+00:00

Abs · PDF · Code1 · Code2

Abstract

The reliability of multilingual Large Language Model (LLM) evaluation is currently compromised by the inconsistent quality of translated benchmarks. Existing resources often suffer from semantic drift and context loss, which can lead to misleading performance metrics. In this work, we present a fully automated framework designed to address these challenges by enabling scalable, high-quality translation of datasets and benchmarks. We demonstrate that adapting test-time compute scaling strategies, specifically Universal Self-Improvement (USI) and our proposed multi-round ranking method, T-RANK, allows for significantly higher quality outputs compared to traditional pipelines. Our framework ensures that benchmarks preserve their original task structure and linguistic nuances during localization. We apply this approach to translate popular benchmarks and datasets into eight Eastern and Southern European languages (Ukrainian, Bulgarian, Slovak, Romanian, Lithuanian, Estonian, Turkish, Greek). Evaluations using both reference-based metrics and LLM-as-a-judge show that our translations surpass existing resources, resulting in more accurate downstream model assessment. We release both the framework and the improved benchmarks to facilitate robust and reproducible multilingual AI development.

中文标题/摘要

标题：翻译再利用：高效自动翻译基准和数据集的流水线

目前，多语言大型语言模型（LLM）评估的可靠性受到翻译基准不一致质量的影响。现有资源经常遭受语义漂移和语境丢失的问题，这可能导致误导性的性能指标。在本工作中，我们提出了一种完全自动化的框架，旨在通过使数据集和基准的翻译可扩展且高质量来解决这些挑战。我们证明，通过调整测试时的计算缩放策略，特别是通用自我改进（USI）和我们提出的多轮排名方法T-RANK，可以显著提高输出质量，优于传统流水线。我们的框架确保基准在本地化过程中保留其原始任务结构和语言细微差别。我们使用八种东欧和南欧语言（乌克兰语、保加利亚语、斯洛伐克语、罗马尼亚语、立陶宛语、爱沙尼亚语、土耳其语、希腊语）对流行的基准和数据集进行了翻译。使用参考基线指标和LLM作为裁判的评估表明，我们的翻译超越了现有资源，导致下游模型评估更加准确。我们发布了该框架和改进后的基准，以促进稳健且可重复的多语言AI开发。

Summary / 总结

This work addresses the issue of inconsistent quality in translated benchmarks for multilingual Large Language Model (LLM) evaluation. It introduces an automated framework that uses test-time compute scaling strategies like Universal Self-Improvement (USI) and a multi-round ranking method called T-RANK to achieve higher quality translations. The framework ensures that benchmarks maintain their original task structure and linguistic nuances. Evaluations show that the translated benchmarks surpass existing resources, leading to more accurate model assessments. The framework and improved benchmarks are released for use in multilingual AI development.

该研究旨在解决多语言大型语言模型（LLM）评估中翻译基准质量不一致的问题。它提出了一种全自动框架，使用测试时计算缩放策略，如通用自我改进（USI）和T-RANK，以生成高质量的翻译。该框架确保基准保持其原始任务结构和语言细微差别。评估表明，这些翻译超越了现有资源，导致更准确的模型评估。该框架和改进后的基准已发布，以促进更稳健和可重复的多语言AI开发。

TimeBlind: A Spatio-Temporal Compositionality Benchmark for Video LLMs

Authors: Baiqi Li, Kangyi Zhao, Ce Zhang, Chancharik Mitra, Jean de Dieu Nyandwi, Gedas Bertasius

First: 2026-01-30T20:21:46+00:00 · Latest: 2026-02-25T18:57:52+00:00

Comments: For code and data, see https://baiqi-li.github.io/timeblind_project/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Fine-grained spatio-temporal understanding is essential for video reasoning and embodied AI. Yet, while Multimodal Large Language Models (MLLMs) master static semantics, their grasp of temporal dynamics remains brittle. We present TimeBlind, a diagnostic benchmark for compositional spatio-temporal understanding. Inspired by cognitive science, TimeBlind categorizes fine-grained temporal understanding into three levels: recognizing atomic events, characterizing event properties, and reasoning about event interdependencies. Unlike benchmarks that conflate recognition with temporal reasoning, TimeBlind leverages a minimal-pairs paradigm: video pairs share identical static visual content but differ solely in temporal structure, utilizing complementary questions to neutralize language priors. Evaluating over 20 state-of-the-art MLLMs (e.g., GPT-5, Gemini 3 Pro) on 600 curated instances (2400 video-question pairs), reveals that the Instance Accuracy (correctly distinguishing both videos in a pair) of the best performing MLLM is only 48.2%, far below the human performance (98.2%). These results demonstrate that even frontier models rely heavily on static visual shortcuts rather than genuine temporal logic, positioning TimeBlind as a vital diagnostic tool for next-generation video understanding. Dataset and code are available at https://baiqi-li.github.io/timeblind_project/ .

中文标题/摘要

标题：TimeBlind：视频LLM时空组合理解基准

精细的时空理解对于视频推理和具身AI至关重要。然而，尽管多模态大型语言模型（MLLMs）掌握了静态语义，它们对时间动态的理解仍然脆弱。我们提出了TimeBlind，一个诊断性基准，用于评估组合时空理解能力。受认知科学启发，TimeBlind 将精细的时间理解分为三个层次：识别原子事件、描述事件属性以及推理事件间的相互依赖性。与将识别与时间推理混为一谈的基准不同，TimeBlind 利用最小对数范式：视频对在静态视觉内容上完全相同，但在时间结构上完全不同，利用互补问题来消除语言先入之见。在600个精心挑选的实例（2400个视频-问题对）上评估20个最先进的MLLMs（例如GPT-5、Gemini 3 Pro），结果显示最佳MLLM的实例准确率（正确区分一对视频）仅为48.2%，远低于人类表现（98.2%）。这些结果表明，即使是最前沿的模型也严重依赖静态视觉捷径而非真正的时序逻辑，将TimeBlind定位为下一代视频理解的重要诊断工具。数据集和代码可在https://baiqi-li.github.io/timeblind_project/ 获取。

Summary / 总结

TimeBlind is a benchmark for evaluating spatio-temporal compositionality in video understanding, focusing on recognizing atomic events, characterizing event properties, and reasoning about event interdependencies. Unlike other benchmarks, it uses a minimal-pairs paradigm where videos share identical static content but differ only in temporal structure. Evaluations of 20 state-of-the-art MLLMs on 600 instances show that the best model correctly distinguishes video pairs only 48.2% of the time, far below human performance (98.2%). This indicates that current models heavily rely on static visual cues rather than genuine temporal logic.

TimeBlind 是一个用于评估视频大型语言模型（LLM）时空组成性的诊断基准。它将时间理解分为三个层次：识别原子事件、描述事件属性和推理事件间的依赖关系。与其他基准不同，TimeBlind 使用最小对数范式，视频对在静态内容上完全相同，但在时间结构上有所不同。对 20 个最先进的 MLLM 在 600 个实例上的评估显示，最佳模型正确区分视频对的比例仅为 48.2%，远低于人类表现（98.2%）。这表明当前模型主要依赖静态视觉线索而非时间逻辑。

Off-The-Shelf Image-to-Image Models Are All You Need To Defeat Image Protection Schemes

Authors: Xavier Pleimling, Sifat Muhammad Abdullah, Gunjan Balde, Peng Gao, Mainack Mondal, Murtuza Jadliwala, Bimal Viswanath

First: 2026-02-25T18:46:30+00:00 · Latest: 2026-02-25T18:46:30+00:00

Comments: This work has been accepted for publication at the IEEE Conference on Secure and Trustworthy Machine Learning (SaTML). The final version will be available on IEEE Xplore. To IEEE SaTML 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

Advances in Generative AI (GenAI) have led to the development of various protection strategies to prevent the unauthorized use of images. These methods rely on adding imperceptible protective perturbations to images to thwart misuse such as style mimicry or deepfake manipulations. Although previous attacks on these protections required specialized, purpose-built methods, we demonstrate that this is no longer necessary. We show that off-the-shelf image-to-image GenAI models can be repurposed as generic ``denoisers" using a simple text prompt, effectively removing a wide range of protective perturbations. Across 8 case studies spanning 6 diverse protection schemes, our general-purpose attack not only circumvents these defenses but also outperforms existing specialized attacks while preserving the image's utility for the adversary. Our findings reveal a critical and widespread vulnerability in the current landscape of image protection, indicating that many schemes provide a false sense of security. We stress the urgent need to develop robust defenses and establish that any future protection mechanism must be benchmarked against attacks from off-the-shelf GenAI models. Code is available in this repository: https://github.com/mlsecviswanath/img2imgdenoiser

中文标题/摘要

标题：现成的图像到图像模型足以击败图像保护方案

生成式人工智能（GenAI）的进步导致了各种保护策略的开发，以防止未经授权使用图像。这些方法依赖于在图像上添加不可感知的保护扰动，以阻止诸如风格模仿或深度伪造等滥用行为。尽管之前对这些保护的攻击需要专门的、定制的方法，但我们证明这已不再必要。我们展示了一种现成的图像到图像GenAI模型可以通过简单的文本提示重新利用为通用的“去噪器”，有效地移除各种保护扰动。在涵盖6种不同保护方案的8个案例研究中，我们的通用攻击不仅绕过了这些防御，还在保持图像对攻击者有用性的同时，优于现有的专门攻击。我们的研究结果揭示了当前图像保护领域中一个关键且普遍存在的漏洞，表明许多方案提供了虚假的安全感。我们强调迫切需要开发稳健的防御措施，并指出任何未来的保护机制都必须以现成的GenAI模型攻击为基准。代码可在以下仓库中获得：https://github.com/mlsecviswanath/img2imgdenoiser

Summary / 总结

This study demonstrates that off-the-shelf image-to-image Generative AI models can be repurposed as generic 'denoisers' using simple text prompts to remove protective perturbations added to images. Across eight case studies involving six diverse protection schemes, the general-purpose attack outperforms existing specialized attacks while maintaining the image's utility for the adversary. This reveals a critical vulnerability in current image protection methods, suggesting many schemes provide false security. The research underscores the urgent need for robust defenses and benchmarks against off-the-shelf GenAI models.

研究通过展示，可以将现成的图像到图像的生成AI模型重新用于通用的“去噪器”，并通过简单的文本提示去除保护性干扰，来揭示图像保护方案的漏洞。在涉及六种不同保护方案的八个案例研究中，通用攻击不仅绕过了这些防御，还优于现有的专门攻击，同时保持了图像对攻击者的实用性。这项工作揭示了当前图像保护方法中的关键漏洞，并强调了需要针对现成的生成AI模型进行 robust 防御的必要性。

Stagewise Reinforcement Learning and the Geometry of the Regret Landscape

Authors: Chris Elliott, Einar Urdshals, David Quarel, Matthew Farrugia-Roberts, Daniel Murfet

First: 2026-01-12T13:25:21+00:00 · Latest: 2026-02-25T18:40:10+00:00

Comments: 48 pages, 10 figures

Abs · PDF · Code1 · Code2

Abstract

Singular learning theory characterizes Bayesian learning as an evolving tradeoff between accuracy and complexity, with transitions between qualitatively different solutions as sample size increases. We extend this theory to reinforcement learning, proving that the concentration of a generalized posterior over policies is governed by the local learning coefficient (LLC), an invariant of the geometry of the regret function. This theory predicts that deep reinforcement learning with SGD should proceed from simple policies with high regret to complex policies with low regret. We verify this prediction empirically in a gridworld environment exhibiting stagewise policy development: phase transitions over training manifest as "opposing staircases" where regret decreases sharply while the LLC increases.

中文标题/摘要

标题：阶段式强化学习与后悔景观的几何学

奇异学习理论将贝叶斯学习描述为准确性和复杂性之间的不断演变权衡，并随着样本量的增加，解决方案在质上会发生不同。我们将这一理论扩展到强化学习领域，证明广义后验分布在策略上的集中度由后悔函数几何学的局部学习系数（LLC）控制。这一理论预测，使用SGD的深度强化学习应从高后悔的简单策略过渡到低后悔的复杂策略。我们通过一个表现出阶段式策略发展的网格世界环境的实验验证了这一预测：训练过程中的相变表现为“对立阶梯”，其中后悔迅速减少而LLC增加。

GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL

Authors: Rui Yang, Qianhui Wu, Zhaoyang Wang, Hanyang Chen, Ke Yang, Hao Cheng, Huaxiu Yao, Baoling Peng, Huan Zhang, Jianfeng Gao, Tong Zhang

First: 2026-02-25T18:34:57+00:00 · Latest: 2026-02-25T18:34:57+00:00

Comments: 57 pages, 17 figures

Abs · PDF · Code1 · Code2

Abstract

Open-source native GUI agents still lag behind closed-source systems on long-horizon navigation tasks. This gap stems from two limitations: a shortage of high-quality, action-aligned reasoning data, and the direct adoption of generic post-training pipelines that overlook the unique challenges of GUI agents. We identify two fundamental issues in these pipelines: (i) standard SFT with CoT reasoning often hurts grounding, and (ii) step-wise RLVR-tyle training faces partial verifiability, where multiple actions can be correct but only a single demonstrated action is used for verification. This makes offline step-wise metrics weak predictors of online task success. In this work, we present GUI-Libra, a tailored training recipe that addresses these challenges. First, to mitigate the scarcity of action-aligned reasoning data, we introduce a data construction and filtering pipeline and release a curated 81K GUI reasoning dataset. Second, to reconcile reasoning with grounding, we propose action-aware SFT that mixes reasoning-then-action and direct-action data and reweights tokens to emphasize action and grounding. Third, to stabilize RL under partial verifiability, we identify the overlooked importance of KL regularization in RLVR and show that a KL trust region is critical for improving offline-to-online predictability; we further introduce success-adaptive scaling to downweight unreliable negative gradients. Across diverse web and mobile benchmarks, GUI-Libra consistently improves both step-wise accuracy and end-to-end task completion. Our results suggest that carefully designed post-training and data curation can unlock significantly stronger task-solving capabilities without costly online data collection. We release our dataset, code, and models to facilitate further research on data-efficient post-training for reasoning-capable GUI agents.

中文标题/摘要

标题：GUI-Libra：通过行动感知监督和部分可验证的RL训练原生GUI代理进行推理和行动

开源原生GUI代理在长时导航任务上仍落后于封闭源系统。这一差距源于两个限制：高质量、行动对齐的推理数据稀缺，以及直接采用通用的后训练管道，忽视了GUI代理的独特挑战。我们识别出这些管道中的两个根本问题：(i) 标准的带有CoT推理的SFT往往损害了定位，(ii) 步进式RLVR风格的训练面临部分可验证性问题，其中多个行动可能是正确的，但只有单一演示行动用于验证。这使得离线步进式指标成为在线任务成功弱预测器。在本文中，我们提出了GUI-Libra，一种针对这些挑战的定制化训练方案。首先，为缓解行动对齐的推理数据稀缺，我们引入了一个数据构建和过滤管道，并发布了81K GUI推理数据集。其次，为协调推理与定位，我们提出了行动感知SFT，混合了推理后行动和直接行动数据，并重新加权令牌以强调行动和定位。第三，为在部分可验证性下稳定RL，我们识别出RLVR中被忽视的KL正则化的重要性，并展示了KL信任区域对于提高离线到在线预测能力至关重要；我们进一步引入了成功自适应缩放以降低不可靠负梯度的权重。在各种网页和移动基准测试中，GUI-Libra在步进式准确性和端到端任务完成上均表现出一致的改进。我们的结果表明，精心设计的后训练和数据整理可以解锁显著更强的任务解决能力，而无需昂贵的在线数据收集。我们发布了我们的数据集、代码和模型，以促进对推理能力GUI代理的数据高效后训练研究。

Summary / 总结

This paper addresses the limitations of open-source GUI agents in long-horizon navigation tasks by introducing GUI-Libra, a tailored training method. It overcomes the scarcity of high-quality reasoning data and the challenges of grounding through a data construction pipeline and action-aware SFT. Additionally, it stabilizes RL under partial verifiability by emphasizing KL regularization and success-adaptive scaling. Experimental results show consistent improvements in both step-wise accuracy and end-to-end task completion across various benchmarks, suggesting that careful post-training and data curation can enhance task-solving capabilities without extensive online data collection.

研究旨在通过解决行动对齐的推理数据不足和部分可验证强化学习的挑战，提升原生GUI代理在长时导航任务中的表现。方法包括构建和过滤数据管道以创建81K的GUI推理数据集，采用行动感知的监督微调（SFT），混合推理后行动和直接行动数据，并重新加权以强调行动和定位，以及使用KL正则化和成功自适应缩放来提高离线到在线的预测能力。实验结果表明，在各种基准测试中，该方法在步骤准确性和端到端任务完成方面都表现出一致的改进，表明精心设计的后训练和数据整理可以在不进行大量在线数据收集的情况下显著增强任务解决能力。

Mechanistic Indicators of Understanding in Large Language Models

Authors: Pierre Beckmann, Matthieu Queloz

First: 2025-07-07T20:26:31+00:00 · Latest: 2026-02-25T18:34:16+00:00

Comments: 38 pages

Abs · PDF · Code1 · Code2

Abstract

Large language models (LLMs) are often portrayed as merely imitating linguistic patterns without genuine understanding. We argue that recent findings in mechanistic interpretability (MI), the emerging field probing the inner workings of LLMs, render this picture increasingly untenable--but only once those findings are integrated within a theoretical account of understanding. We propose a tiered framework for thinking about understanding in LLMs and use it to synthesize the most relevant findings to date. The framework distinguishes three hierarchical varieties of understanding, each tied to a corresponding level of computational organization: conceptual understanding emerges when a model forms "features" as directions in latent space, learning connections between diverse manifestations of a single entity or property; state-of-the-world understanding emerges when a model learns contingent factual connections between features and dynamically tracks changes in the world; principled understanding emerges when a model ceases to rely on memorized facts and discovers a compact "circuit" connecting these facts. Across these tiers, MI uncovers internal organizations that can underwrite understanding-like unification. However, these also diverge from human cognition in their parallel exploitation of heterogeneous mechanisms. Fusing philosophical theory with mechanistic evidence thus allows us to transcend binary debates over whether AI understands, paving the way for a comparative, mechanistically grounded epistemology that explores how AI understanding aligns with--and diverges from--our own.

中文标题/摘要

标题：大型语言模型理解的机制指标

大型语言模型（LLMs）通常被认为只是模仿语言模式而缺乏真正的理解。我们认为，随着机制可解释性（MI）这一新兴领域对LLMs内部运作的研究成果的出现，这种观点越来越站不住脚——但这需要将这些发现整合到对理解的理论解释中。我们提出了一种分层框架来思考LLMs中的理解，并利用它综合迄今为止最相关的研究成果。该框架区分了三种层次的理解形式，每种形式都与相应的计算组织层次相关：概念理解在模型形成“特征”为潜在空间中的方向时出现，学习单一实体或属性不同表现之间的联系；世界状态理解在模型学习特征之间的条件事实联系并动态跟踪世界变化时出现；原理性理解在模型不再依赖于记忆的事实而发现将这些事实连接起来的紧凑“电路”时出现。在这几个层次中，MI揭示了可以支撑类似统一的理解的内部组织。然而，这些也与人类认知在并行利用异构机制方面的差异。因此，将哲学理论与机制证据结合起来，使我们能够超越关于AI是否理解的二元辩论，为一种比较性的、基于机制的 epistemology 打开大门，探索AI理解如何与我们的理解相一致——以及如何不同。

Summary / 总结

This study addresses the notion that large language models (LLMs) merely imitate linguistic patterns without genuine understanding. By integrating recent findings in mechanistic interpretability (MI), the authors propose a tiered framework to distinguish three levels of understanding in LLMs: conceptual, state-of-the-world, and principled understanding. Key findings show that MI reveals internal organizations that can support understanding-like unification, though these mechanisms differ from human cognition. This approach allows for a more nuanced exploration of AI understanding compared to binary debates, fostering a comparative, mechanistically grounded epistemology.

本文探讨了大型语言模型（LLMs）是否真正具备理解能力的问题，认为近期在机制解释性（MI）方面的进展挑战了LLMs仅模仿语言模式的观点。作者提出了一个分层框架来分类LLMs的不同理解层次，从概念理解到原理性理解，并展示了MI揭示了支持类似理解统一的内部机制。然而，这些机制也与人类认知在并行利用异构机制方面有所不同，这表明需要采用一种基于机制的比较性知识论来探索AI理解与我们自己的理解之间的相似性和差异性。

When Safety Collides: Resolving Multi-Category Harmful Conflicts in Text-to-Image Diffusion via Adaptive Safety Guidance

Authors: Yongli Xiang, Ziming Hong, Zhaoqing Wang, Xiangyu Zhao, Bo Han, Tongliang Liu

Venue: CVPR 2026

First: 2026-02-24T13:20:31+00:00 · Latest: 2026-02-25T18:24:58+00:00

Comments: CVPR 2026; Code is released at https://github.com/tmllab/2026_CVPR_CASG

Abs · PDF · Code1 · Code2 · Code3

Abstract

Text-to-Image (T2I) diffusion models have demonstrated significant advancements in generating high-quality images, while raising potential safety concerns regarding harmful content generation. Safety-guidance-based methods have been proposed to mitigate harmful outputs by steering generation away from harmful zones, where the zones are averaged across multiple harmful categories based on predefined keywords. However, these approaches fail to capture the complex interplay among different harm categories, leading to "harmful conflicts" where mitigating one type of harm may inadvertently amplify another, thus increasing overall harmful rate. To address this issue, we propose Conflict-aware Adaptive Safety Guidance (CASG), a training-free framework that dynamically identifies and applies the category-aligned safety direction during generation. CASG is composed of two components: (i) Conflict-aware Category Identification (CaCI), which identifies the harmful category most aligned with the model's evolving generative state, and (ii) Conflict-resolving Guidance Application (CrGA), which applies safety steering solely along the identified category to avoid multi-category interference. CASG can be applied to both latent-space and text-space safeguards. Experiments on T2I safety benchmarks demonstrate CASG's state-of-the-art performance, reducing the harmful rate by up to 15.4% compared to existing methods.

中文标题/摘要

标题：当安全相冲突：通过自适应安全指导解决文本到图像扩散中的多类别有害冲突

文本到图像（T2I）扩散模型在生成高质量图像方面取得了显著进展，但同时也引发了关于有害内容生成的安全问题。基于安全指导的方法已被提出，通过引导生成远离预定义关键词定义的有害区域来减轻有害输出。然而，这些方法未能捕捉不同有害类别之间的复杂相互作用，导致“有害冲突”，即减轻一种有害类型的同时可能无意中放大另一种，从而增加整体有害率。为解决这一问题，我们提出了一种无需训练的冲突感知自适应安全指导（CASG）框架，该框架在生成过程中动态识别并应用与模型生成状态最一致的类别导向的安全方向。CASG 包含两个组件：(i) 冲突感知类别识别（CaCI），识别与模型生成状态最一致的有害类别，(ii) 冲突解决指导应用（CrGA），仅沿识别的类别应用安全引导，以避免多类别干扰。CASG 可应用于潜在空间和文本空间的安全保护。在 T2I 安全基准上的实验表明，CASG 达到了最先进的性能，与现有方法相比，有害率最多降低了 15.4%。

Summary / 总结

The research addresses the issue of harmful content generation in Text-to-Image (T2I) models by proposing Conflict-aware Adaptive Safety Guidance (CASG), which dynamically identifies and applies safety steering based on the evolving generative state. CASG consists of Conflict-aware Category Identification (CaCI) and Conflict-resolving Guidance Application (CrGA) to mitigate harmful conflicts. Experiments show that CASG reduces the harmful rate by up to 15.4% compared to existing methods.

本文提出了一种名为冲突感知自适应安全引导（CASG）的方法，以动态识别和应用类别对齐的安全方向来解决文本到图像扩散模型中的有害内容生成问题。CASG 包含冲突感知类别识别（CaCI）和冲突解决引导应用（CrGA），并在 T2I 安全基准测试中表现出色，相比现有方法可将有害率降低高达 15.4%。

Mixed Magnification Aggregation for Generalizable Region-Level Representations in Computational Pathology

Authors: Eric Zimmermann, Julian Viret, Michal Zelechowski, James Brian Hall, Neil Tenenholtz, Adam Casson, George Shaikovski, Eugene Vorontsov, Siqi Liu, Kristen A Severson

First: 2026-02-25T18:23:42+00:00 · Latest: 2026-02-25T18:23:42+00:00

Abs · PDF · Code1 · Code2

Abstract

In recent years, a standard computational pathology workflow has emerged where whole slide images are cropped into tiles, these tiles are processed using a foundation model, and task-specific models are built using the resulting representations. At least 15 different foundation models have been proposed, and the vast majority are trained exclusively with tiles using the 20$\times$ magnification. However, it is well known that certain histologic features can only be discerned with larger context windows and requires a pathologist to zoom in and out when analyzing a whole slide image. Furthermore, creating 224$\times$224 pixel crops at 20$\times$ leads to a large number of tiles per slide, which can be gigapixel in size. To more accurately capture multi-resolution features and investigate the possibility of reducing the number of representations per slide, we propose a region-level mixing encoder. Our approach jointly fuses image tile representations of a mixed magnification foundation model using a masked embedding modeling pretraining step. We explore a design space for pretraining the proposed mixed-magnification region aggregators and evaluate our models on transfer to biomarker prediction tasks representing various cancer types. Results demonstrate cancer dependent improvements in predictive performance, highlighting the importance of spatial context and understanding.

中文标题/摘要

标题：混合放大倍数聚合在计算病理学中的一般化区域级表示

近年来，一种标准的计算病理学工作流程已经出现，其中全切片图像被裁剪成小块，这些小块使用基础模型进行处理，并使用生成的表示构建特定任务模型。至少提出了15种不同的基础模型，其中绝大多数仅使用20倍放大倍数的小块进行训练。然而，众所周知，某些组织学特征仅能在较大的上下文中才能辨认，需要病理学家在分析全切片图像时进行放大和缩小。此外，以20倍放大倍数创建224×224像素的小块会导致每张切片产生大量小块，这些小块可能达到 gigapixel 大小。为了更准确地捕捉多分辨率特征并调查减少每张切片表示数量的可能性，我们提出了一种区域级混合编码器。我们的方法通过掩码嵌入建模预训练步骤联合融合混合放大倍数基础模型的小块表示。我们探索了预训练所提混合放大倍数区域聚合器的设计空间，并在代表各种癌症类型的生物标志物预测任务上评估了我们的模型。结果表明，癌症依赖性的预测性能改进，突显了空间上下文和理解的重要性。

Summary / 总结

The research aims to improve the accuracy of computational pathology by addressing the limitations of using only 20x magnification tiles. The authors propose a region-level mixing encoder that fuses image tile representations from a mixed magnification foundation model. The method shows cancer-dependent improvements in predictive performance, emphasizing the importance of spatial context and understanding.

论文针对计算病理学中仅使用20倍放大倍数的局限性，某些组织学特征需要更大的上下文。它提出了一种区域级混合编码器，将混合放大倍数基础模型的图像瓷砖表示进行融合。该方法在各种癌症类型的生物标志物预测任务中提高了预测性能，强调了空间上下文的重要性。

NRGPT: An Energy-based Alternative for GPT

Authors: Nima Dehmamy, Benjamin Hoover, Bishwajit Saha, Leo Kozachkov, Jean-Jacques Slotine, Dmitry Krotov

Venue: ICLR 2026

First: 2025-12-18T16:59:10+00:00 · Latest: 2026-02-25T18:23:01+00:00

Comments: Accepted to ICLR 2026 main conference

Abs · PDF · Code1 · Code2

Abstract

Generative Pre-trained Transformer (GPT) architectures are the most popular design for language modeling. Energy-based modeling is a different paradigm that views inference as a dynamical process operating on an energy landscape. We propose a minimal modification of the GPT setting to unify it with the EBM framework. The inference step of our model, which we call eNeRgy-GPT (NRGPT), is conceptualized as an exploration of the tokens on the energy landscape. We prove, and verify empirically, that under certain circumstances this exploration becomes gradient descent, although they don't necessarily lead to the best performing models. We demonstrate that our model performs well for simple language (Shakespeare dataset), algebraic ListOPS tasks, and richer settings such as OpenWebText language modeling. We also observe that our models may be more resistant to overfitting, doing so only during very long training.

中文标题/摘要

标题：NRGPT：基于能量的GPT替代方案

生成预训练变换器（GPT）架构是语言建模中最流行的结构。基于能量的建模是一种不同的范式，将推理视为在能量景观上进行的动力学过程。我们提出了一种对GPT设置进行最小修改的方法，使其与EBM框架统一。我们称为eNeRgy-GPT（NRGPT）的模型的推理步骤被概念化为在能量景观上探索令牌的过程。我们证明，并通过实验证明，在某些情况下，这种探索变成了梯度下降，尽管它们不一定导致性能最佳的模型。我们展示了我们的模型在简单语言（莎士比亚数据集）、代数ListOPS任务以及更丰富的设置如OpenWebText语言建模中表现良好。我们还观察到，我们的模型可能对过拟合更具抵抗力，仅在非常长的训练过程中才表现出这种特性。

Summary / 总结

The research aims to explore an energy-based alternative to the popular GPT architecture by integrating it with the Energy-Based Model (EBM) framework. The method involves modifying the inference step of GPT to explore the energy landscape, resulting in a model called eNeRgy-GPT (NRGPT). Key experimental findings show that while the exploration can sometimes mimic gradient descent, it does not always produce the best models. The NRGPT performs well on various tasks including Shakespeare dataset, algebraic ListOPS, and OpenWebText language modeling, and demonstrates potential resistance to overfitting during long training periods.

研究旨在通过将生成预训练变换器（GPT）架构与能量基模型（EBM）框架结合，探索一种能量基的替代方案。提出的模型eNeRgy-GPT（NRGPT）修改了GPT的推理过程，使其在能量景观上探索令牌。实验证明，NRGPT在简单语言生成、代数任务和语言建模等方面表现良好。此外，该模型在长时间训练期间显示出对过拟合的潜在抵抗力。

LLaDA-MedV: Exploring Large Language Diffusion Models for Biomedical Image Understanding

Authors: Xuanzhao Dong, Wenhui Zhu, Xiwen Chen, Zhipeng Wang, Peijie Qiu, Shao Tang, Xin Li, Yalin Wang

First: 2025-08-03T06:46:46+00:00 · Latest: 2026-02-25T18:15:23+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Autoregressive models (ARMs) have long dominated the landscape of biomedical vision-language models (VLMs). Recently, masked diffusion models such as LLaDA have emerged as promising alternatives, yet their application in the biomedical domain remains largely underexplored. To bridge this gap, we introduce LLaDA-MedV, the first large language diffusion model tailored for biomedical image understanding through vision instruction tuning. LLaDA-MedV achieves relative performance gains of 7.855% over LLaVA-Med and 1.867% over LLaDA-V in the open-ended biomedical visual conversation task, and sets new state-of-the-art accuracy on the closed-form subset of three VQA benchmarks: 84.93% on VQA-RAD, 92.31% on SLAKE, and 95.15% on PathVQA. Furthermore, a detailed comparison with LLaVA-Med suggests that LLaDA-MedV is capable of generating reasonably longer responses by explicitly controlling response length, which can lead to more informative outputs. We also conduct an in-depth analysis of both the training and inference stages, highlighting the critical roles of initialization weight selection, fine-tuning strategies, and the interplay between sampling steps and response repetition. The code and model weight is released at https://github.com/LLM-VLM-GSL/LLaDA-MedV.

中文标题/摘要

标题：LLaDA-MedV：探索大规模语言扩散模型在生物医学图像理解中的应用

自回归模型（ARMs）长期以来主导了生物医学视觉语言模型（VLMs）的领域。最近，掩码扩散模型如LLaDA崭露头角，成为有前途的替代方案，但在生物医学领域的应用仍然相对较少。为弥合这一差距，我们引入了LLaDA-MedV，这是第一个针对生物医学图像理解的大型语言扩散模型，通过视觉指令调优。LLaDA-MedV在开放式的生物医学视觉对话任务中相对于LLaVA-Med实现了7.855%的相对性能提升，相对于LLaDA-V实现了1.867%的提升，并在三个VQA基准测试的封闭形式子集上设定了新的最佳准确率：VQA-RAD上的84.93%，SLAKE上的92.31%，PathVQA上的95.15%。此外，与LLaVA-Med的详细比较表明，LLaDA-MedV能够通过明确控制响应长度生成合理长度的响应，这可能导致更具信息量的输出。我们还对训练和推理阶段进行了深入分析，突出了初始化权重选择、微调策略以及采样步骤与响应重复之间相互作用的关键作用。代码和模型权重在https://github.com/LLM-VLM-GSL/LLaDA-MedV上发布。

Skill-Inject: Measuring Agent Vulnerability to Skill File Attacks

Authors: David Schmotz, Luca Beurer-Kellner, Sahar Abdelnabi, Maksym Andriushchenko

First: 2026-02-23T18:59:27+00:00 · Latest: 2026-02-25T18:14:01+00:00

Abs · PDF · Code1 · Code2

Abstract

LLM agents are evolving rapidly, powered by code execution, tools, and the recently introduced agent skills feature. Skills allow users to extend LLM applications with specialized third-party code, knowledge, and instructions. Although this can extend agent capabilities to new domains, it creates an increasingly complex agent supply chain, offering new surfaces for prompt injection attacks. We identify skill-based prompt injection as a significant threat and introduce SkillInject, a benchmark evaluating the susceptibility of widely-used LLM agents to injections through skill files. SkillInject contains 202 injection-task pairs with attacks ranging from obviously malicious injections to subtle, context-dependent attacks hidden in otherwise legitimate instructions. We evaluate frontier LLMs on SkillInject, measuring both security in terms of harmful instruction avoidance and utility in terms of legitimate instruction compliance. Our results show that today's agents are highly vulnerable with up to 80% attack success rate with frontier models, often executing extremely harmful instructions including data exfiltration, destructive action, and ransomware-like behavior. They furthermore suggest that this problem will not be solved through model scaling or simple input filtering, but that robust agent security will require context-aware authorization frameworks. Our benchmark is available at https://www.skill-inject.com/.

中文标题/摘要

标题：技能注入：衡量代理对技能文件攻击的脆弱性

LLM代理正在迅速发展，得益于代码执行、工具以及最近引入的代理技能功能。技能允许用户通过专门的第三方代码、知识和指令扩展LLM应用程序的功能。虽然这可以将代理能力扩展到新的领域，但也为提示注入攻击提供了新的攻击面。我们识别出基于技能的提示注入是一个重大威胁，并引入了SkillInject基准，评估广泛使用的LLM代理通过技能文件遭受注入攻击的易感性。SkillInject包含202个注入任务对，攻击范围从明显的恶意注入到隐藏在合法指令中的微妙、情境依赖的攻击。我们对前沿LLM进行了评估，从有害指令的避免和合法指令的遵守两个方面衡量安全性。结果显示，当前的代理高度易受攻击，前沿模型的攻击成功率高达80%，经常执行极其有害的指令，包括数据泄露、破坏性操作和类似勒索软件的行为。此外，这些结果表明，这个问题不会通过模型扩展或简单的输入过滤来解决，而是需要具备上下文感知授权框架的稳健代理安全。我们的基准可以在https://www.skill-inject.com/获取。

Summary / 总结

The paper addresses the vulnerability of LLM agents to skill-based prompt injection attacks, which exploit the agent skills feature to extend their capabilities. It introduces SkillInject, a benchmark with 202 injection-task pairs, to evaluate the susceptibility of LLM agents. The evaluation shows that leading LLMs have an up to 80% attack success rate, often executing harmful instructions such as data exfiltration and ransomware-like behavior. The study concludes that robust security requires context-aware authorization frameworks rather than model scaling or simple input filtering.

论文介绍了Skill-Inject基准，用于评估LLM代理对技能文件攻击的脆弱性。它将技能基提示注入识别为一个重大威胁，并评估了领先的大规模语言模型，发现高达80%的攻击成功率，涉及有害指令如数据泄露和勒索软件行为。研究建议，稳健的安全性需要上下文感知的授权框架，而不仅仅是模型扩展或简单的输入过滤。

Capabilities Ain't All You Need: Measuring Propensities in AI

Authors: Daniel Romero-Alvarado, Fernando Martínez-Plumed, Lorenzo Pacchiardi, Hugo Save, Siddhesh Milind Pawar, Behzad Mehrbakhsh, Pablo Antonio Moreno Casares, Ben Slater, Paolo Bova, Peter Romero, Zachary R. Tyler, Jonathan Prunty, Luning Sun, Jose Hernandez-Orallo

First: 2026-02-20T12:40:18+00:00 · Latest: 2026-02-25T18:12:06+00:00

Abs · PDF · Code1 · Code2

Abstract

AI evaluation has primarily focused on measuring capabilities, with formal approaches inspired from Item Response Theory (IRT) being increasingly applied. Yet propensities - the tendencies of models to exhibit particular behaviours - play a central role in determining both performance and safety outcomes. However, traditional IRT describes a model's success on a task as a monotonic function of model capabilities and task demands, an approach unsuited to propensities, where both excess and deficiency can be problematic. Here, we introduce the first formal framework for measuring AI propensities by using a bilogistic formulation for model success, which attributes high success probability when the model's propensity is within an "ideal band". Further, we estimate the limits of the ideal band using LLMs equipped with newly developed task-agnostic rubrics. Applying our framework to six families of LLM models whose propensities are incited in either direction, we find that we can measure how much the propensity is shifted and what effect this has on the tasks. Critically, propensities estimated using one benchmark successfully predict behaviour on held-out tasks. Moreover, we obtain stronger predictive power when combining propensities and capabilities than either separately. More broadly, our framework showcases how rigorous propensity measurements can be conducted and how it yields gains over solely using capability evaluations to predict AI behaviour.

中文标题/摘要

标题：能力不是全部：衡量AI倾向性

AI评估主要集中在衡量能力上，形式化方法受到项目反应理论（IRT）的启发被越来越多地应用。然而，倾向性——模型表现出特定行为的倾向——在决定性能和安全性结果方面起着核心作用。然而，传统的IRT将模型在任务上的成功描述为模型能力和任务需求的单调函数，这种方法不适合倾向性，因为过度和不足都可能存在问题。在这里，我们通过使用双逻辑模型成功公式引入了第一个正式框架来衡量AI倾向性，当模型的倾向性处于“理想区间”内时，赋予其高成功概率。此外，我们使用配备新开发的任务无关评分标准的大型语言模型来估计理想区间的边界。将我们的框架应用于六大家族的大型语言模型，其倾向性被激发朝两个方向发展，我们发现可以衡量倾向性被偏移的程度以及这种偏移对任务的影响。关键的是，使用一个基准估算的倾向性能够成功预测保留任务的行为。此外，当我们结合倾向性和能力时，获得更强的预测能力，而单独使用它们时则不然。更广泛地说，我们的框架展示了如何进行严格的倾向性测量以及它如何在仅使用能力评估来预测AI行为时提供收益。

Summary / 总结

This paper addresses the limitation of AI evaluation focusing solely on capabilities by introducing a new framework to measure propensities, which are the tendencies of models to exhibit specific behaviors. The authors use a bilogistic formulation to attribute high success probability when the model's propensity is within an 'ideal band.' They estimate the limits of this ideal band using large language models equipped with task-agnostic rubrics. The study finds that propensities can be measured and predict behavior on held-out tasks, and combining propensities with capabilities yields stronger predictive power than either measure alone.

本文旨在通过引入一个新的框架来衡量AI模型的倾向性，即模型表现出特定行为的倾向，以解决仅基于模型能力进行评估的局限性。作者使用双逻辑模型来评估当倾向性处于‘理想区间’时的模型成功率，并使用具有任务无关评分标准的大语言模型估计这一区间的界限。研究发现，可以测量倾向性并预测在保留任务上的行为，且将倾向性和能力结合时的预测能力比单独使用任一指标更强。

Spilled Energy in Large Language Models

Authors: Adrian Robert Minut, Hazem Dewidar, Iacopo Masi

First: 2026-02-21T00:38:47+00:00 · Latest: 2026-02-25T18:09:08+00:00

Abs · PDF · Code1 · Code2

Abstract

We reinterpret the final Large Language Model (LLM) softmax classifier as an Energy-Based Model (EBM), decomposing the sequence-to-sequence probability chain into multiple interacting EBMs at inference. This principled approach allows us to track "energy spills" during decoding, which we empirically show correlate with factual errors, biases, and failures. Similar to Orgad et al. (2025), our method localizes the exact answer token and subsequently tests for hallucinations. Crucially, however, we achieve this without requiring trained probe classifiers or activation ablations. Instead, we introduce two completely training-free metrics derived directly from output logits: spilled energy, which captures the discrepancy between energy values across consecutive generation steps that should theoretically match, and marginalized energy, which is measurable at a single step. Evaluated on nine benchmarks across state-of-the-art LLMs (including LLaMA, Mistral, and Gemma) and on synthetic algebraic operations (Qwen3), our approach demonstrates robust, competitive hallucination detection and cross-task generalization. Notably, these results hold for both pretrained and instruction-tuned variants without introducing any training overhead.

中文标题/摘要

标题：大型语言模型中的溢出能量

我们将最终的大型语言模型（LLM）softmax分类器重新解释为能量基模型（EBM），在推理过程中将序列到序列的概率链分解为多个相互作用的EBM。这种原则性的方法使我们能够追踪解码过程中的“能量溢出”，我们实验证明这些能量溢出与事实错误、偏见和失败相关。类似于Orgad等人（2025），我们的方法定位到确切的答案标记，然后测试幻觉。然而，我们通过引入两个完全无需训练的度量直接从输出logits中得出：溢出能量，它捕捉了理论上应匹配的能量值在连续生成步骤之间的差异；以及边缘化能量，它可以在单个步骤中进行测量。在九个基准测试上评估了最先进的LLM（包括LLaMA、Mistral和Gemma），以及合成的代数操作（Qwen3），我们的方法展示了稳健且具有竞争力的幻觉检测和跨任务泛化能力。值得注意的是，这些结果适用于预训练和指令微调的变体，且无需引入任何训练开销。

Summary / 总结

The study reinterprets the final softmax classifier of Large Language Models (LLMs) as an Energy-Based Model (EBM) to track 'energy spills' during decoding, which correlates with factual errors, biases, and failures. The method localizes the exact answer token and tests for hallucinations without requiring trained probe classifiers or activation ablations. Instead, it introduces two training-free metrics: spilled energy and marginalized energy. The approach shows robust and competitive hallucination detection across various LLMs and synthetic tasks without additional training overhead.

该研究将大型语言模型（LLM）的最终softmax分类器重新解释为能量基模型（EBM），以追踪解码过程中的‘能量溢出’，这些溢出与事实错误、偏见和失败相关。该方法定位到确切的答案令牌，并通过不依赖于训练探针分类器或激活消融来检测幻觉。相反，它引入了两个无需训练的度量标准：溢出能量和边际能量。在九个基准测试和合成代数运算上的评估表明，该方法在预训练和指令微调的LLM中表现出稳健且具有竞争力的幻觉检测和跨任务泛化能力，且无需额外的训练开销。

Dynamic Personality Adaptation in Large Language Models via State Machines

Authors: Leon Pielage, Ole Hätscher, Mitja Back, Bernhard Marschall, Benjamin Risse

First: 2026-02-25T18:05:11+00:00 · Latest: 2026-02-25T18:05:11+00:00

Comments: 22 pages, 5 figures, submitted to ICPR 2026

Abs · PDF · Code1 · Code2

Abstract

The inability of Large Language Models (LLMs) to modulate their personality expression in response to evolving dialogue dynamics hinders their performance in complex, interactive contexts. We propose a model-agnostic framework for dynamic personality simulation that employs state machines to represent latent personality states, where transition probabilities are dynamically adapted to the conversational context. Part of our architecture is a modular pipeline for continuous personality scoring that evaluates dialogues along latent axes while remaining agnostic to the specific personality models, their dimensions, transition mechanisms, or LLMs used. These scores function as dynamic state variables that systematically reconfigure the system prompt, steering behavioral alignment throughout the interaction.We evaluate this framework by operationalizing the Interpersonal Circumplex (IPC) in a medical education setting. Results demonstrate that the system successfully adapts its personality state to user inputs, but also influences user behavior, thereby facilitating de-escalation training. Notably, the scoring pipeline maintains comparable precision even when utilizing lightweight, fine-tuned classifiers instead of large-scale LLMs. This work demonstrates the feasibility of modular, personality-adaptive architectures for education, customer support, and broader human-computer interaction.

中文标题/摘要

标题：大型语言模型通过状态机实现动态人格适应

大型语言模型（LLMs）无法根据对话动态调整其人格表达，这阻碍了它们在复杂交互环境中的表现。我们提出了一种模型无关的动态人格模拟框架，该框架使用状态机来表示潜在的人格状态，并根据对话环境动态调整状态转换概率。该架构的一部分是一个模块化的连续人格评分流水线，该流水线在不依赖特定人格模型、其维度、转换机制或使用的LLM的情况下，评估对话沿潜在轴线的表现。这些评分作为动态状态变量，系统地重新配置系统提示，引导行为在整个交互过程中的对齐。我们通过在医学教育环境中实现人际圆周图（IPC）来评估该框架。结果表明，该系统能够根据用户输入调整其人格状态，同时也影响用户行为，从而促进去升级训练。值得注意的是，评分流水线即使使用轻量级、微调的分类器而非大规模LLM，也能保持相当的精度。这项工作展示了模块化、人格适应性架构在教育、客户服务和更广泛的人机交互中的可行性。

Summary / 总结

The paper addresses the limitation of Large Language Models (LLMs) in adapting their personality in response to dialogue dynamics. It introduces a model-agnostic framework using state machines to dynamically adjust personality states based on conversational context. The framework includes a modular pipeline for continuous personality scoring, which evaluates dialogues along latent axes without relying on specific personality models or LLMs. Experimental results show that the system successfully adapts its personality state to user inputs, facilitating de-escalation training, and the scoring pipeline maintains precision even with lightweight classifiers.

论文解决了大型语言模型（LLMs）在对话动态变化时无法调整其个性的问题。它提出了一种使用状态机的模型通用框架，可以根据对话情境动态调整个性状态。该框架包含一个模块化的持续个性评分管道，可以沿潜在轴评估对话，而不依赖于特定的个性模型或LLMs。实验结果表明，该系统能够成功地根据用户输入调整其个性状态，有助于缓解冲突的培训，并且评分管道即使使用轻量级分类器也能保持精度。

CoLoGen: Progressive Learning of Concept`-`Localization Duality for Unified Image Generation

Authors: YuXin Song, Yu Lu, Haoyuan Sun, Huanjin Yao, Fanglong Liu, Yifan Sun, Haocheng Feng, Hang Zhou, Jingdong Wang

First: 2026-02-25T17:59:29+00:00 · Latest: 2026-02-25T17:59:29+00:00

Comments: Accepted by CVPR2026. 15 pages, 8 figures

Abs · PDF · Code1 · Code2

Abstract

Unified conditional image generation remains difficult because different tasks depend on fundamentally different internal representations. Some require conceptual understanding for semantic synthesis, while others rely on localization cues for spatial precision. Forcing these heterogeneous tasks to share a single representation leads to concept`-`localization representational conflict. To address this issue, we propose CoLoGen, a unified diffusion framework that progressively learns and reconciles this concept`-`localization duality. CoLoGen uses a staged curriculum that first builds core conceptual and localization abilities, then adapts them to diverse visual conditions, and finally refines their synergy for complex instruction`-`driven tasks. Central to this process is the Progressive Representation Weaving (PRW) module, which dynamically routes features to specialized experts and stably integrates their outputs across stages. Experiments on editing, controllable generation, and customized generation show that CoLoGen achieves competitive or superior performance, offering a principled representational perspective for unified image generation.

中文标题/摘要

标题：CoLoGen: 渐进学习概念-定位二元性的统一图像生成框架

统一条件图像生成仍然困难，因为不同的任务依赖于根本不同的内部表示。一些任务需要概念理解进行语义合成，而另一些则依赖于定位线索以获得空间精度。迫使这些异质任务共享单一表示会导致概念-定位表示冲突。为了解决这一问题，我们提出了CoLoGen，这是一种统一的扩散框架，能够渐进地学习和解决这一概念-定位二元性。CoLoGen 使用分阶段的课程，首先建立核心的概念和定位能力，然后将它们适应各种视觉条件，最后通过复杂的指令驱动任务优化它们的协同作用。这一过程的核心是渐进表示编织（PRW）模块，该模块动态地将特征路由到专门的专家，并在各个阶段稳定地整合它们的输出。在编辑、可控生成和定制生成方面的实验表明，CoLoGen 达到了竞争性或优越的性能，提供了一种统一图像生成的原理性表示视角。

Summary / 总结

CoLoGen is a unified diffusion framework that addresses the representational conflict between conceptual understanding and localization cues in image generation. It progressively learns and reconciles this duality through a staged curriculum, with a key module called Progressive Representation Weaving (PRW) that dynamically routes and integrates features. Experiments show that CoLoGen performs competitively or better in tasks such as image editing, controllable generation, and customized generation, providing a principled approach to unified image generation.

CoLoGen 是一个统一的扩散框架，旨在解决图像生成中概念理解和定位线索之间的表示冲突。它通过分阶段的课程学习和逐步解决这些二元性，关键模块 Progressive Representation Weaving 动态路由和整合特征。实验表明，CoLoGen 在编辑、可控生成和定制生成任务中表现得与现有方法相当或更优，提供了一种统一图像生成的原理性方法。

Provable Last-Iterate Convergence for Multi-Objective Safe LLM Alignment via Optimistic Primal-Dual

Authors: Yining Li, Peizhong Ju, Ness Shroff

First: 2026-02-25T17:54:52+00:00 · Latest: 2026-02-25T17:54:52+00:00

Abs · PDF · Code1 · Code2

Abstract

Reinforcement Learning from Human Feedback (RLHF) plays a significant role in aligning Large Language Models (LLMs) with human preferences. While RLHF with expected reward constraints can be formulated as a primal-dual optimization problem, standard primal-dual methods only guarantee convergence with a distributional policy where the saddle-point problem is in convex-concave form. Moreover, standard primal-dual methods may exhibit instability or divergence in the last iterate under policy parameterization in practical applications. In this work, we propose a universal primal-dual framework for safe RLHF that unifies a broad class of existing alignment algorithms, including safe-RLHF, one-shot, and multi-shot based methods. Building on this framework, we introduce an optimistic primal-dual (OPD) algorithm that incorporates predictive updates for both primal and dual variables to stabilize saddle-point dynamics. We establish last-iterate convergence guarantees for the proposed method, covering both exact policy optimization in the distributional space and convergence to a neighborhood of the optimal solution whose gap is related to approximation error and bias under parameterized policies. Our analysis reveals that optimism plays a crucial role in mitigating oscillations inherent to constrained alignment objectives, thereby closing a key theoretical gap between constrained RL and practical RLHF.

中文标题/摘要

标题：可验证的末次迭代收敛性：通过乐观原始-对偶方法实现多目标安全大语言模型对齐

人类反馈强化学习（RLHF）在使大语言模型（LLMs）与人类偏好对齐方面发挥着重要作用。虽然RLHF带期望奖励约束可以形式化为原始-对偶优化问题，但标准的原始-对偶方法仅能保证在分布策略下收敛，其中鞍点问题是凸-凹形式。此外，在实际应用中，标准的原始-对偶方法在策略参数化下可能会表现出不稳定性或末次迭代发散。在本文中，我们提出了一种适用于安全RLHF的通用原始-对偶框架，该框架统一了现有的多种对齐算法，包括安全-RLHF、单次和多次方法。基于此框架，我们引入了一种乐观原始-对偶（OPD）算法，该算法为原始和对偶变量都引入了预测更新，以稳定鞍点动力学。我们为所提出的方法建立了末次迭代收敛性保证，涵盖了精确策略优化在分布空间中的情况，以及在参数化策略下收敛到最优解邻域的情况，其差距与近似误差和偏差有关。我们的分析表明，乐观性在缓解约束对齐目标固有的振荡方面起着关键作用，从而填补了约束RL与实际RLHF之间的关键理论缺口。

Summary / 总结

This paper addresses the challenge of aligning Large Language Models (LLMs) with human preferences using Reinforcement Learning from Human Feedback (RLHF). It proposes a universal primal-dual framework that incorporates an optimistic primal-dual (OPD) algorithm with predictive updates to stabilize the saddle-point dynamics. The method ensures last-iterate convergence, both in exact policy optimization and in parameterized policies, thereby mitigating oscillations and closing a theoretical gap in constrained RL for practical RLHF applications.

本文旨在通过人类反馈强化学习（RLHF）来调整大型语言模型（LLMs）与人类偏好的一致性。它提出了一种统一的普鲁士-对偶框架，包括安全-RLHF、单次和多次方法。关键贡献是一个乐观的普鲁士-对偶（OPD）算法，通过预测更新来稳定鞍点动力学。研究提供了迭代收敛保证，涵盖了精确策略优化和参数化策略，并展示了乐观性有助于缓解受限对齐目标中的振荡现象。

Heuristic Adaptation of Potentially Misspecified Domain Support for Likelihood-Free Inference in Stochastic Dynamical Systems

Authors: Georgios Kamaras, Craig Innes, Subramanian Ramamoorthy

First: 2025-10-30T16:23:46+00:00 · Latest: 2026-02-25T17:52:16+00:00

Comments: 20 pages, 18 figures

Abs · PDF · Code1 · Code2

Abstract

In robotics, likelihood-free inference (LFI) can provide the domain distribution that adapts a learnt agent in a parametric set of deployment conditions. LFI assumes an arbitrary support for sampling, which remains constant as the initial generic prior is iteratively refined to more descriptive posteriors. However, a potentially misspecified support can lead to suboptimal, yet falsely certain, posteriors. To address this issue, we propose three heuristic LFI variants: EDGE, MODE, and CENTRE. Each interprets the posterior mode shift over inference steps in its own way and, when integrated into an LFI step, adapts the support alongside posterior inference. We first expose the support misspecification issue and evaluate our heuristics using stochastic dynamical benchmarks. We then evaluate the impact of heuristic support adaptation on parameter inference and policy learning for a dynamic deformable linear object (DLO) manipulation task. Inference results in a finer length and stiffness classification for a parametric set of DLOs. When the resulting posteriors are used as domain distributions for sim-based policy learning, they lead to more robust object-centric agent performance.

中文标题/摘要

标题：潜在指定错误领域支持的启发式适应在随机动力系统无likelihood推断中的应用

在机器人学中，无likelihood推断（LFI）可以提供适应学习代理在参数部署条件集中的领域分布。LFI假设一个任意的支持用于采样，该支持在整个初始通用先验逐步细化为更具描述性的后验过程中保持不变。然而，潜在指定错误的支持可能导致次优但错误自信的后验。为了解决这一问题，我们提出了三种启发式LFI变体：EDGE、MODE和CENTRE。每种变体都以自己的方式解释推理步骤中的后验模式偏移，并在LFI步骤中将支持与后验推断一起进行适应。我们首先揭示了支持指定错误的问题，并使用随机动力学基准评估我们的启发式方法。然后，我们评估启发式支持适应对动态可变形线性对象（DLO）操作任务中参数推断和策略学习的影响。对于参数化的DLO集合，推断结果在长度和刚度分类上更加精细。当使用这些后验作为基于模拟的策略学习的领域分布时，它们会导致更稳健的对象中心代理性能。

Summary / 总结

The paper addresses the issue of potentially misspecified support in likelihood-free inference (LFI) for robotics, which can lead to suboptimal and falsely certain posteriors. To tackle this, three heuristic LFI methods—EDGE, MODE, and CENTRE—are proposed, each interpreting posterior mode shifts differently and adapting the support during inference. The methods are evaluated on stochastic dynamical benchmarks and a dynamic deformable linear object manipulation task, showing improved parameter inference and more robust agent performance.

论文解决了机器人领域中潜在的采样支持不准确问题，这可能导致次优且错误确定的后验分布。为此，提出了三种启发式LFI方法——EDGE、MODE和CENTRE，每种方法通过不同的方式解释后验模式的变化，并在推理步骤中适应支持。实验结果表明，支持的自适应调整能够更准确地进行参数推理和策略学习，特别是在动态可变形线性物体的长度和刚度分类方面表现出更优性能。

NoLan: Mitigating Object Hallucinations in Large Vision-Language Models via Dynamic Suppression of Language Priors

Authors: Lingfeng Ren, Weihao Yu, Runpeng Yu, Xinchao Wang

First: 2026-02-25T17:50:41+00:00 · Latest: 2026-02-25T17:50:41+00:00

Comments: Code: https://github.com/lingfengren/NoLan

Abs · PDF · Code1 · Code2 · Code3

Abstract

Object hallucination is a critical issue in Large Vision-Language Models (LVLMs), where outputs include objects that do not appear in the input image. A natural question arises from this phenomenon: Which component of the LVLM pipeline primarily contributes to object hallucinations? The vision encoder to perceive visual information, or the language decoder to generate text responses? In this work, we strive to answer this question through designing a systematic experiment to analyze the roles of the vision encoder and the language decoder in hallucination generation. Our observations reveal that object hallucinations are predominantly associated with the strong priors from the language decoder. Based on this finding, we propose a simple and training-free framework, No-Language-Hallucination Decoding, NoLan, which refines the output distribution by dynamically suppressing language priors, modulated based on the output distribution difference between multimodal and text-only inputs. Experimental results demonstrate that NoLan effectively reduces object hallucinations across various LVLMs on different tasks. For instance, NoLan achieves substantial improvements on POPE, enhancing the accuracy of LLaVA-1.5 7B and Qwen-VL 7B by up to 6.45 and 7.21, respectively. The code is publicly available at: https://github.com/lingfengren/NoLan.

中文标题/摘要

标题：NoLan：通过动态抑制语言先验减轻大型视觉语言模型中的对象幻觉

对象幻觉是大型视觉语言模型（LVLM）中的一个关键问题，其中输出包括输入图像中不存在的对象。从这一现象中自然会产生一个问题：在LVLM流水线中，哪个组件主要导致对象幻觉？是用于感知视觉信息的视觉编码器，还是用于生成文本响应的语言解码器？在本文中，我们通过设计系统实验来分析视觉编码器和语言解码器在幻觉生成中的作用来回答这个问题。我们的观察结果表明，对象幻觉主要与语言解码器的强大先验有关。基于这一发现，我们提出了一种简单且无需训练的框架，No-Language-Hallucination Decoding，NoLan，通过根据多模态输入和纯文本输入输出分布差异动态抑制语言先验来细化输出分布。实验结果表明，NoLan在不同任务的不同LVLM上有效减少了对象幻觉。例如，NoLan在POPE上取得了显著改进，分别提高了LLaVA-1.5 7B和Qwen-VL 7B的准确性6.45和7.21。代码可在：https://github.com/lingfengren/NoLan公开获取。

Summary / 总结

This paper addresses the issue of object hallucinations in Large Vision-Language Models (LVLMs) by analyzing the roles of the vision encoder and language decoder. It finds that language decoder priors are the primary cause of hallucinations and proposes NoLan, a framework that suppresses language priors to reduce hallucinations. Experiments show that NoLan significantly improves accuracy on tasks like POPE, enhancing LLaVA-1.5 7B and Qwen-VL 7B by up to 6.45 and 7.21, respectively.

该论文通过研究视觉编码器和语言解码器在大型视觉-语言模型中的作用来解决物体幻觉问题。作者提出了一种名为NoLan的框架，通过动态抑制语言先验来减少幻觉。实验表明，NoLan在POPE等任务上显著提高了准确性，分别提升了LLaVA-1.5 7B和Qwen-VL 7B的精度至最多6.45和7.21。

MedTri: A Platform for Structured Medical Report Normalization to Enhance Vision-Language Pretraining

Authors: Yuetan Chu, Xinhua Ma, Xinran Jin, Gongning Luo, Xin Gao

First: 2026-02-25T17:49:03+00:00 · Latest: 2026-02-25T17:49:03+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Medical vision-language pretraining increasingly relies on medical reports as large-scale supervisory signals; however, raw reports often exhibit substantial stylistic heterogeneity, variable length, and a considerable amount of image-irrelevant content. Although text normalization is frequently adopted as a preprocessing step in prior work, its design principles and empirical impact on vision-language pretraining remain insufficiently and systematically examined. In this study, we present MedTri, a deployable normalization framework for medical vision-language pretraining that converts free-text reports into a unified [Anatomical Entity: Radiologic Description + Diagnosis Category] triplet. This structured, anatomy-grounded normalization preserves essential morphological and spatial information while removing stylistic noise and image-irrelevant content, providing consistent and image-grounded textual supervision at scale. Across multiple datasets spanning both X-ray and computed tomography (CT) modalities, we demonstrate that structured, anatomy-grounded text normalization is an important factor in medical vision-language pretraining quality, yielding consistent improvements over raw reports and existing normalization baselines. In addition, we illustrate how this normalization can easily support modular text-level augmentation strategies, including knowledge enrichment and anatomy-grounded counterfactual supervision, which provide complementary gains in robustness and generalization without altering the core normalization process. Together, our results position structured text normalization as a critical and generalizable preprocessing component for medical vision-language learning, while MedTri provides this normalization platform. Code and data will be released at https://github.com/Arturia-Pendragon-Iris/MedTri.

中文标题/摘要

标题：MedTri：一种结构化医学报告规范化平台，以增强视觉-语言预训练

医学视觉-语言预训练越来越多地依赖医学报告作为大规模监督信号；然而，原始报告往往表现出显著的风格异质性、长度变化以及大量与图像无关的内容。尽管文本规范化在先前工作中经常作为预处理步骤被采用，但其设计原则及其对视觉-语言预训练的实证影响仍然缺乏充分和系统的考察。在本研究中，我们提出了MedTri，一种可部署的规范化框架，将自由文本报告转换为统一的[解剖实体：影像描述 + 诊断类别]三元组。这种结构化、基于解剖学的规范化保留了重要的形态和空间信息，同时消除了风格噪音和与图像无关的内容，提供了大规模的一致性和基于图像的文本监督。在多个涵盖X射线和计算机断层扫描（CT）模态的数据集上，我们证明了结构化、基于解剖学的文本规范化是医学视觉-语言预训练质量的重要因素，相较于原始报告和现有规范化基线，其表现一致提升。此外，我们展示了这种规范化如何轻松支持模块化的文本级增强策略，包括知识丰富和基于解剖学的反事实监督，这些策略在不改变核心规范化过程的情况下提供了互补的鲁棒性和泛化能力增益。总之，我们的结果将结构化文本规范化定位为医学视觉-语言学习中一个关键且通用的预处理组件，而MedTri提供了这一规范化平台。代码和数据将在https://github.com/Arturia-Pendragon-Iris/MedTri/发布。

WeaveTime: Stream from Earlier Frames into Emergent Memory in VideoLLMs

Authors: Yulin Zhang, Cheng Shi, Sibei Yang

Venue: CVPR 2026

First: 2026-02-25T17:45:45+00:00 · Latest: 2026-02-25T17:45:45+00:00

Comments: Accepted at CVPR 2026 (preview; camera-ready in preparation)

Abs · PDF · Code1 · Code2 · Project1

Abstract

Recent advances in Multimodal Large Language Models have greatly improved visual understanding and reasoning, yet their quadratic attention and offline training protocols make them ill-suited for streaming settings where frames arrive sequentially and future observations are inaccessible. We diagnose a core limitation of current Video-LLMs, namely Time-Agnosticism, in which videos are treated as an unordered bag of evidence rather than a causally ordered sequence, yielding two failures in streams: temporal order ambiguity, in which the model cannot follow or reason over the correct chronological order, and past-current focus blindness where it fails to distinguish present observations from accumulated history. We present WeaveTime, a simple, efficient, and model agnostic framework that first teaches order and then uses order. We introduce a lightweight Temporal Reconstruction objective-our Streaming Order Perception enhancement-that instills order aware representations with minimal finetuning and no specialized streaming data. At inference, a Past-Current Dynamic Focus Cache performs uncertainty triggered, coarse-to-fine retrieval, expanding history only when needed. Plugged into exsiting Video-LLM without architectural changes, WeaveTime delivers consistent gains on representative streaming benchmarks, improving accuracy while reducing latency. These results establish WeaveTime as a practical path toward time aware stream Video-LLMs under strict online, time causal constraints. Code and weights will be made publicly available. Project Page: https://zhangyl4.github.io/publications/weavetime/

中文标题/摘要

标题：WeaveTime：将早期帧流式传输至 emergent 记忆在视频LLMs中

近期多模态大型语言模型在视觉理解和推理方面取得了显著进步，但其二次注意力机制和离线训练协议使其不适合帧按序到达且未来观察不可用的流式设置。我们诊断了当前视频LLMs的核心局限性，即时间无感知性，即将视频视为无序的证据集合而非因果顺序序列，导致流式设置中的两个失败：时间顺序模糊，模型无法遵循或推理正确的顺序；过去与当前焦点盲视，无法区分当前观察与累积历史。我们提出了WeaveTime，一种简单、高效且模型无关的框架，首先教授顺序，然后利用顺序。我们引入了轻量级的时序重建目标——我们的流式顺序感知增强，以最小的微调和无需专门的流式数据来培养顺序感知表示。在推理时，过去与当前动态焦点缓存执行不确定性触发的粗到细检索，仅在需要时扩展历史。WeaveTime插件到现有的视频LLM中无需架构更改，即可在代表性流式基准测试中提供一致的性能提升，提高准确性同时减少延迟。这些结果确立了WeaveTime作为在严格在线、时间因果约束下时间感知流式视频LLMs的实用路径。代码和权重将公开发布。项目页面：https://zhangyl4.github.io/publications/weavetime/

Summary / 总结

The research addresses the limitations of current Video-LLMs in handling streaming settings by introducing WeaveTime, a framework that enhances temporal order awareness. It uses a lightweight Temporal Reconstruction objective to improve order-aware representations and a Past-Current Dynamic Focus Cache for efficient history retrieval. WeaveTime achieves consistent gains in accuracy and reduces latency on streaming benchmarks, demonstrating its practicality for time-aware Video-LLMs under strict online constraints.

研究通过引入WeaveTime框架来解决当前Video-LLMs在处理流式视频数据时的局限性，该框架增强了时间顺序意识。它使用轻量级的Temporal Reconstruction目标来改进顺序感知表示，并使用Past-Current Dynamic Focus Cache来高效管理历史。WeaveTime在无需对现有Video-LLMs进行架构更改的情况下，在流式基准测试中实现了持续的准确性和延迟改进。

Lumosaic: Hyperspectral Video via Active Illumination and Coded-Exposure Pixels

Authors: Dhruv Verma, Andrew Qiu, Roberto Rangel, Ayandev Barman, Hao Yang, Chenjia Hu, Fengqi Zhang, Roman Genov, David B. Lindell, Kiriakos N. Kutulakos, Alex Mariakakis

Venue: CVPR 2026

First: 2026-02-25T17:42:44+00:00 · Latest: 2026-02-25T17:42:44+00:00

Comments: Accepted to CVPR 2026

Abs · PDF · Code1 · Code2

Abstract

We present Lumosaic, a compact active hyperspectral video system designed for real-time capture of dynamic scenes. Our approach combines a narrowband LED array with a coded-exposure-pixel (CEP) camera capable of high-speed, per-pixel exposure control, enabling joint encoding of scene information across space, time, and wavelength within each video frame. Unlike passive snapshot systems that divide light across multiple spectral channels simultaneously and assume no motion during a frame's exposure, Lumosaic actively synchronizes illumination and pixel-wise exposure, improving photon utilization and preserving spectral fidelity under motion. A learning-based reconstruction pipeline then recovers 31-channel hyperspectral (400-700 nm) video at 30 fps and VGA resolution, producing temporally coherent and spectrally accurate reconstructions. Experiments on synthetic and real data demonstrate that Lumosaic significantly improves reconstruction fidelity and temporal stability over existing snapshot hyperspectral imaging systems, enabling robust hyperspectral video across diverse materials and motion conditions.

中文标题/摘要

标题：Lumosaic：通过主动照明和编码曝光像素实现高光谱视频

我们提出了Lumosaic，一种紧凑的主动高光谱视频系统，旨在实时捕捉动态场景。我们的方法结合了窄带LED阵列和具有高帧率、逐像素曝光控制能力的编码曝光像素（CEP）相机，能够在每个视频帧内联合编码场景信息的空间、时间和波长。与被动快照系统同时将光线分成多个光谱通道并假设帧曝光期间无运动不同，Lumosaic通过主动同步照明和逐像素曝光来提高光子利用率并保持在运动下的光谱保真度。基于学习的重建管道随后以30 fps和VGA分辨率恢复31通道高光谱（400-700 nm）视频，产生时空一致且光谱准确的重建。对合成和真实数据的实验表明，Lumosaic在重建保真度和时间稳定性方面显著优于现有的快照高光谱成像系统，能够在多种材料和运动条件下实现稳健的高光谱视频。

Summary / 总结

Lumosaic is a compact active hyperspectral video system that captures dynamic scenes in real-time by combining a narrowband LED array with a coded-exposure-pixel camera. This system synchronizes illumination and pixel-wise exposure to improve photon utilization and maintain spectral fidelity under motion, unlike passive snapshot systems. Experiments show that Lumosaic outperforms existing snapshot systems in terms of reconstruction fidelity and temporal stability, achieving 31-channel hyperspectral video at 30 fps and VGA resolution.

Lumosaic 是一种紧凑型主动超光谱视频系统，能够实时捕捉动态场景。它结合了窄带 LED 灯阵列和编码曝光像素相机，实现高速逐像素曝光控制，从而在每个帧内联合编码空间、时间和波长信息。实验表明，Lumosaic 在重建保真度和时间稳定性方面优于现有快照超光谱成像系统，能够在多种运动条件和材料类型下实现 31 通道超光谱视频，帧率为 30 fps，分辨率 VGA。

Recursive Belief Vision Language Action Models

Authors: Vaidehi Bagaria, Bijo Sebastian, Nirav Kumar Patel

First: 2026-02-24T08:02:16+00:00 · Latest: 2026-02-25T17:38:24+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-language-action models must enable agents to execute long-horizon tasks under partial observability. However, most existing approaches remain observation-driven, relying on short context windows or repeated queries to vision-language models (VLMs). This leads to loss of task progress, action repetition under perceptual aliasing, and high inference latency. While semantic grounding is important, long-horizon manipulation fundamentally requires persistent, action-conditioned state representations. Current VLAs lack such representations and exhibit limited temporal and physical reasoning, making them ill-suited for multi-stage control. This paper introduces RB-VLA, a belief-centric architecture trained with self-supervised world-model objectives that maintains a compact latent state encoding task-relevant history, dynamics, and object interactions. Queried once per task, the VLM provides high-level intent, while the belief tracks task progress and enables phase-aware, causally grounded control under partial observability without storing raw observations or scaling memory with time. The belief and intent jointly condition a diffusion policy for robust closed-loop execution. RB-VLA outperforms prior VLAs on long-horizon benchmarks, achieving 52.5 percent and 37.5 percent higher success rates on multi-stage pick-and-place and stacking tasks, respectively, compared to pi_0. It also reduces inference latency by up to five times relative to baselines and eliminates memory growth across timesteps observed in existing VLAs. Ablations show the belief module is the primary driver of performance, increasing success rates from 32.5 percent without belief to 77.5 percent with belief.

中文标题/摘要

标题：递归信念视语言行动模型

视语言行动模型必须使代理能够在部分可观测性下执行长期任务。然而，大多数现有方法仍依赖于短上下文窗口或反复查询视语言模型（VLMs），这导致任务进展丢失、在知觉混叠下重复执行动作以及高推理延迟。虽然语义接地很重要，但长期操作本质上需要持久的、基于动作的状态表示。当前的VLAs缺乏这样的表示，且在时间和物理推理方面表现出有限的能力，使它们不适合多阶段控制。本文引入了RB-VLA，这是一种以信念为中心的架构，通过自我监督的世界模型目标进行训练，保持一个紧凑的潜状态编码任务相关的历史、动力学和物体交互。VLM在每次任务时查询一次，提供高层次的意图，而信念追踪任务进展，并在部分可观测性下实现有阶段意识的、因果性的控制，无需存储原始观察或随时间扩展内存。信念和意图共同条件一个扩散策略，以实现稳健的闭环执行。RB-VLA 在长期任务基准测试中优于先前的VLAs，分别在多阶段拾取和放置任务以及堆叠任务中实现了52.5%和37.5%更高的成功率，相比pi_0。它还将推理延迟降低了最多五倍，并消除了现有VLAs在时间步长上观察到的内存增长。消融实验表明，信念模块是性能的主要驱动因素，信念模块的存在将成功率从没有信念时的32.5%提高到有信念时的77.5%。

Summary / 总结

This paper addresses the limitations of existing vision-language-action models by introducing RB-VLA, a belief-centric architecture that maintains a compact latent state for task-relevant history and dynamics. The model queries a vision-language model once per task for high-level intent, while the belief tracks task progress and enables phase-aware, causally grounded control under partial observability. RB-VLA outperforms prior models on long-horizon tasks, achieving higher success rates and reducing inference latency by up to five times compared to baselines. Ablation studies indicate the belief module is crucial for performance improvement, significantly increasing success rates from 32.5% to 77.5%.

本文通过引入基于信念的RB-VLA架构，解决了现有视觉-语言-动作模型的局限性，该架构维持了一个紧凑的潜状态，用于任务相关的历史和动力学。该模型在每次任务时查询一次视觉-语言模型以获取高层次意图，而信念则跟踪任务进度并实现部分可观测性下的阶段感知、因果性控制。RB-VLA在长时任务上表现出色，相比基线模型在多阶段拾放和堆叠任务上的成功率分别提高了52.5%和37.5%，同时将推理延迟降低了五倍。消融实验表明，信念模块是性能提升的主要驱动力，将成功率从32.5%提高到77.5%。

SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents

Authors: Patrick Tser Jern Kon, Archana Pradeep, Ang Chen, Alexander P. Ellis, Warren Hunt, Zijian Wang, John Yang, Samuel Thompson

First: 2026-02-25T17:11:49+00:00 · Latest: 2026-02-25T17:11:49+00:00

Abs · PDF · Code1 · Code2

Abstract

Small language models (SLMs) offer compelling advantages in cost, latency, and adaptability, but have so far lagged behind larger models on long-horizon software engineering tasks such as SWE-bench, where they suffer from pervasive action looping and low resolution rates. We introduce SWE-Protégé, a post-training framework that reframes software repair as an expert-protégé collaboration problem. In SWE-Protégé, an SLM remains the sole decision-maker while learning to selectively seek guidance from a strong expert model, recognize stalled states, and follow through on expert feedback. Our approach combines supervised fine-tuning on expert-augmented trajectories with agentic reinforcement learning that explicitly discourages degenerative looping and unproductive expert collaboration. We lightly post-train Qwen2.5-Coder-7B-Instruct to achieve 42.4% Pass@1 on SWE-bench Verified, a +25.4% improvement over the prior SLM state of the art, while using expert assistance sparsely (~4 calls per task and 11% of total tokens).

中文标题/摘要

标题：SWE-Protégé：学会有选择地与专家合作使小型语言模型成为软件工程代理

小型语言模型（SLMs）在成本、延迟和适应性方面具有显著优势，但在诸如SWE-bench等长期软件工程任务中，它们在长周期软件工程任务中落后于大型模型，表现为普遍的动作循环和低分辨率率。我们提出了SWE-Protégé，一种后训练框架，将软件修复重新定义为专家-学徒合作问题。在SWE-Protégé中，SLM 仍然是唯一的决策者，同时学习有选择地寻求强大专家模型的指导、识别停滞状态并遵循专家反馈。我们的方法结合了基于专家增强轨迹的监督微调和明确抑制退化循环和无生产力专家合作的代理强化学习。我们轻量级地后训练Qwen2.5-Coder-7B-Instruct，使其在SWE-bench Verified上的Pass@1达到42.4%，比之前的SLM最佳状态提高了25.4%，同时稀疏地使用专家协助（每任务约4次调用和总令牌的11%）。

Summary / 总结

The research aims to enhance the performance of small language models (SLMs) in long-horizon software engineering tasks by integrating expert guidance. The method involves a post-training framework called SWE-Protégé, which enables SLMs to selectively seek advice from a strong expert model, recognize when to follow expert feedback, and avoid repetitive actions. Key findings show that SWE-Protégé improves SLM performance on SWE-bench Verified to 42.4% Pass@1, a 25.4% increase over previous SLM models, with minimal use of expert assistance.

研究旨在通过引入专家指导来提升小语言模型（SLMs）在长期软件工程任务中的表现。方法是采用后训练框架SWE-Protégé，使SLMs能够选择性地向强专家模型寻求建议，识别何时遵循专家反馈，并避免重复行为。关键发现表明，SWE-Protégé使SLMs在SWE-bench Verified上的表现提升至42.4% Pass@1，相比之前的最佳SLM模型提高了25.4%，且专家协助的使用量很少。

QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models

Authors: Jingxuan Zhang, Yunta Hsieh, Zhongwei Wan, Haokun Lin, Xin Wang, Ziqi Wang, Yingtie Lei, Mi Zhang

First: 2026-02-23T19:55:54+00:00 · Latest: 2026-02-25T17:11:08+00:00

Comments: CVPR2026

Abs · PDF · Code1 · Code2

Abstract

Vision-language-action (VLA) models unify perception, language, and control for embodied agents but face significant challenges in practical deployment due to rapidly increasing compute and memory demands, especially as models scale to longer horizons and larger backbones. To address these bottlenecks, we introduce QuantVLA, a training-free post-training quantization (PTQ) framework that, to our knowledge, is the first PTQ approach for VLA systems and the first to successfully quantize a diffusion transformer (DiT) action head. QuantVLA incorporates three scale-calibrated components: (1) a selective quantization layout that integerizes all linear layers in both the language backbone and the DiT while keeping attention projections in floating point to preserve the original operator schedule; (2) attention temperature matching, a lightweight per-head scaling mechanism that stabilizes attention logits and is folded into the dequantization scales at inference; and (3) output head balancing, a per-layer residual interface calibration that mitigates post-projection energy drift. The framework requires no additional training, uses only a small unlabeled calibration buffer, and supports integer kernels for low-bit weights and activations while leaving the architecture unchanged. Across representative VLA models on LIBERO, QuantVLA exceeds the task success rates of full-precision baselines, achieves about 70% relative memory savings on the quantized components, and delivers a 1.22x speedup in end-to-end inference latency, providing a practical pathway toward scalable low-bit embodied intelligence under strict compute, memory, and power constraints.

中文标题/摘要

标题：QuantVLA：视觉-语言-行动模型的规模校准后训练量化

视觉-语言-行动（VLA）模型将感知、语言和控制统一起来，为具身智能体服务，但由于计算和内存需求迅速增加，尤其是在模型扩展到更长的时间范围和更大的骨干网络时，它们在实际部署中面临重大挑战。为了解决这些瓶颈，我们提出了QuantVLA，这是一种无需训练的后训练量化（PTQ）框架，据我们所知，这是第一个针对VLA系统的PTQ方法，也是第一个成功量化扩散变压器（DiT）动作头的方法。QuantVLA 包含三个规模校准组件：（1）一种选择性量化布局，将所有线性层（包括语言骨干和DiT）转换为整数，而保留注意力投影为浮点数，以保持原始操作计划；（2）注意力温度匹配，这是一种轻量级的每头缩放机制，稳定注意力概率，并在推理时折叠到去量化比例中；（3）输出头平衡，这是一种每层残差接口校准，减轻了后投影能量漂移。该框架不需要额外的训练，仅使用少量未标记的校准缓冲区，并支持低位权重和激活的整数内核，同时保持架构不变。在LIBERO上的代表性VLA模型上，QuantVLA 超过了全精度基线的任务成功率，量化组件的相对内存节省约为70%，并提供了1.22倍的端到端推理延迟加速，为在严格的计算、内存和功率限制下实现可扩展的低位具身智能提供了实际途径。

Summary / 总结

QuantVLA is a training-free post-training quantization framework for vision-language-action models, addressing the compute and memory challenges of large-scale models. It includes selective quantization, attention temperature matching, and output head balancing to preserve model performance. QuantVLA improves task success rates, reduces memory usage by about 70%, and accelerates inference by 1.22x, making low-bit embodied intelligence practical under strict resource constraints.

QuantVLA 是一种无需训练的后训练量化框架，用于解决视觉-语言-行动模型在扩展时面临的计算和内存挑战。它包括选择性量化、注意力温度匹配和输出头平衡，以稳定和优化模型。QuantVLA 提高了任务成功率，将量化组件的内存使用减少了约 70%，并加速了端到端推理延迟 1.22 倍，在严格的计算、内存和功耗约束下实现了低比特嵌入式智能的实用性。

Probing the Geometry of Diffusion Models with the String Method

Authors: Elio Moreau, Florentin Coeurdoux, Grégoire Ferre, Eric Vanden-Eijnden

First: 2026-02-25T17:10:59+00:00 · Latest: 2026-02-25T17:10:59+00:00

Abs · PDF · Code1 · Code2

Abstract

Understanding the geometry of learned distributions is fundamental to improving and interpreting diffusion models, yet systematic tools for exploring their landscape remain limited. Standard latent-space interpolations fail to respect the structure of the learned distribution, often traversing low-density regions. We introduce a framework based on the string method that computes continuous paths between samples by evolving curves under the learned score function. Operating on pretrained models without retraining, our approach interpolates between three regimes: pure generative transport, which yields continuous sample paths; gradient-dominated dynamics, which recover minimum energy paths (MEPs); and finite-temperature string dynamics, which compute principal curves -- self-consistent paths that balance energy and entropy. We demonstrate that the choice of regime matters in practice. For image diffusion models, MEPs contain high-likelihood but unrealistic ''cartoon'' images, confirming prior observations that likelihood maxima appear unrealistic; principal curves instead yield realistic morphing sequences despite lower likelihood. For protein structure prediction, our method computes transition pathways between metastable conformers directly from models trained on static structures, yielding paths with physically plausible intermediates. Together, these results establish the string method as a principled tool for probing the modal structure of diffusion models -- identifying modes, characterizing barriers, and mapping connectivity in complex learned distributions.

中文标题/摘要

标题：用弦方法探究扩散模型的几何结构

理解学习分布的几何结构对于改进和解释扩散模型至关重要，但系统性的探索其景观的工具仍然有限。标准的潜在空间插值无法尊重学习分布的结构，通常会穿越低密度区域。我们提出了一种基于弦方法的框架，通过在学习得分函数下演化曲线来计算样本之间的连续路径。在无需重新训练预训练模型的情况下，我们的方法在三种范式之间进行插值：纯生成传输，产生连续样本路径；梯度主导动力学，恢复最低能量路径（MEP）；以及有限温度弦动力学，计算主曲线——自洽路径，平衡能量和熵。我们证明了范式的选择在实践中很重要。对于图像扩散模型，MEP包含高概率但不现实的“卡通”图像，证实了先前观察到的最可几点看起来不现实；相反，主曲线生成了尽管概率较低但现实的形态序列。对于蛋白质结构预测，我们的方法直接从训练静态结构的模型中计算出从亚稳构象到构象的过渡路径，生成具有物理上合理中间体的路径。综上所述，这些结果确立了弦方法作为探究扩散模型模态结构的原理性工具——识别模态、表征障碍并映射复杂学习分布的连通性。

Summary / 总结

This paper addresses the challenge of understanding the geometry of diffusion models by introducing a string method framework. The method computes continuous paths between samples by evolving curves under the learned score function, enabling exploration of three regimes: generative transport, gradient-dominated dynamics, and finite-temperature string dynamics. Experiments on image and protein structure prediction show that the choice of regime significantly impacts results, with MEPs yielding unrealistic but high-likelihood images and principal curves providing realistic morphing sequences, while the string method computes physically plausible transition pathways for protein structures.

研究旨在探索扩散模型的几何结构，以改进和解释其学习分布。作者引入了一种基于弦方法的框架，通过在学习的评分函数下演化曲线来计算样本之间的连续路径。该方法在不重新训练预训练模型的情况下运行，并在三种模式之间进行插值：纯生成传输、梯度主导动力学和有限温度弦动力学。研究发现，选择的模式会影响结果，最小能量路径包含高概率但不现实的图像，而主要曲线则产生现实的形态序列。对于蛋白质结构预测，该方法直接从训练静态结构的模型中计算出过渡路径，提供物理上合理的中间体。

A Distributional Treatment of Real2Sim2Real for Object-Centric Agent Adaptation in Vision-Driven Deformable Linear Object Manipulation

Authors: Georgios Kamaras, Subramanian Ramamoorthy

Venue: In IEEE Robotics and Automation Letters, Volume 10, Issue 8, August 2025, Pages 8075-8082

First: 2025-02-25T20:01:06+00:00 · Latest: 2026-02-25T17:09:15+00:00

Abs · PDF · Code1 · Code2

Abstract

We present an integrated (or end-to-end) framework for the Real2Sim2Real problem of manipulating deformable linear objects (DLOs) based on visual perception. Working with a parameterised set of DLOs, we use likelihood-free inference (LFI) to compute the posterior distributions for the physical parameters using which we can approximately simulate the behaviour of each specific DLO. We use these posteriors for domain randomisation while training, in simulation, object-specific visuomotor policies (i.e. assuming only visual and proprioceptive sensory) for a DLO reaching task, using model-free reinforcement learning. We demonstrate the utility of this approach by deploying sim-trained DLO manipulation policies in the real world in a zero-shot manner, i.e. without any further fine-tuning. In this context, we evaluate the capacity of a prominent LFI method to perform fine classification over the parametric set of DLOs, using only visual and proprioceptive data obtained in a dynamic manipulation trajectory. We then study the implications of the resulting domain distributions in sim-based policy learning and real-world performance.

中文标题/摘要

标题：基于视觉驱动的可变形线性物体操作中物体中心代理适应的实2仿2实分布处理方法

我们提出了一种集成（或端到端）框架，用于基于视觉感知操纵可变形线性物体（DLOs）的实2仿2实问题。使用参数化的DLO集合，我们使用无似然推断（LFI）来计算物理参数的后验分布，从而可以近似模拟每个特定DLO的行为。在训练过程中，我们使用这些后验分布进行领域随机化，在仿真中使用无模型强化学习为DLO抓取任务训练特定于物体的视知觉运动策略（即，假设只有视觉和本体感觉感知）。我们通过零样本方式部署仿真实训的DLO操作策略，即无需任何进一步微调来展示该方法的实用性。在此背景下，我们评估了一种流行的LFI方法在仅使用动态操作轨迹中获得的视觉和本体感觉数据对参数化DLO集合进行精细分类的能力。然后，我们研究了基于仿真的策略学习和实际性能中结果领域分布的影响。

Summary / 总结

The paper presents an integrated framework for manipulating deformable linear objects (DLOs) in the real world using visual perception. It uses likelihood-free inference to compute the posterior distributions of physical parameters for DLOs, which are then used for domain randomisation in simulation. Object-specific visuomotor policies are trained using model-free reinforcement learning. The approach is demonstrated to work in the real world without further fine-tuning, and the utility of a prominent LFI method is evaluated for fine classification of DLOs using visual and proprioceptive data.

论文提出了一种集成框架，用于基于视觉感知操纵变形线性物体（DLOs）。该框架利用似然性无参数推断计算DLOs的物理参数后验分布，从而在模拟中训练特定物体的视觉-运动策略。训练好的策略无需进一步微调即可在现实世界中部署，展示了该框架的实用性和LFI方法在模拟基策略学习和现实世界性能中的有效性。

TIPS Over Tricks: Simple Prompts for Effective Zero-shot Anomaly Detection

Authors: Alireza Salehi, Ehsan Karami, Sepehr Noey, Sahand Noey, Makoto Yamada, Reshad Hosseini, Mohammad Sabokrou

Venue: ICASSP

First: 2026-02-03T14:48:11+00:00 · Latest: 2026-02-25T17:00:45+00:00

Comments: This is the extended version of the paper accepted in ICASSP'26, which will be publicly available in May. Authors' contributions may vary among the versions

Abs · PDF · Code1 · Code2 · Code3

Abstract

Anomaly detection identifies departures from expected behavior in safety-critical settings. When target-domain normal data are unavailable, zero-shot anomaly detection (ZSAD) leverages vision-language models (VLMs). However, CLIP's coarse image-text alignment limits both localization and detection due to (i) spatial misalignment and (ii) weak sensitivity to fine-grained anomalies; prior works compensate with complex auxiliary modules yet largely overlook the choice of backbone. We revisit the backbone and use TIPS-a VLM trained with spatially aware objectives. While TIPS alleviates CLIP's issues, it exposes a distributional gap between global and local features. We address this with decoupled prompts-fixed for image-level detection and learnable for pixel-level localization-and by injecting local evidence into the global score. Without CLIP-specific tricks, our TIPS-based pipeline improves image-level performance by 1.1-3.9% and pixel-level by 1.5-6.9% across seven industrial datasets, delivering strong generalization with a lean architecture. Code is available at github.com/AlirezaSalehy/Tipsomaly.

中文标题/摘要

标题：TIPS胜于技巧：简单提示实现有效的零样本异常检测

异常检测在安全关键环境中识别预期行为的偏差。当目标域正常数据不可用时，零样本异常检测（ZSAD）利用视觉语言模型（VLMs）。然而，CLIP的粗略图像-文本对齐限制了定位和检测，原因在于（i）空间对齐偏差和（ii）对细微异常的弱敏感性；先前的工作通过复杂的辅助模块进行补偿，但很大程度上忽视了主干网络的选择。我们重新审视了主干网络，并使用TIPS-VLM，该模型通过空间感知目标进行训练。虽然TIPS缓解了CLIP的问题，但它暴露了全局特征和局部特征之间的分布差距。我们通过分离的提示（固定用于图像级检测，可学习用于像素级定位）和将局部证据注入全局评分来解决这一问题。在不使用CLIP特定技巧的情况下，基于TIPS的管道在七个工业数据集上提高了图像级性能1.1-3.9%，像素级性能1.5-6.9%，实现了强大的泛化能力，同时保持了简洁的架构。代码可在github.com/AlirezaSalehy/Tipsomaly获取。

Summary / 总结

The paper addresses the challenge of zero-shot anomaly detection in safety-critical settings where target-domain normal data are unavailable. It proposes using TIPS, a vision-language model trained with spatially aware objectives, to improve anomaly detection. The method employs decoupled prompts for image-level detection and learnable prompts for pixel-level localization, which enhances both image-level and pixel-level performance by 1.1-3.9% and 1.5-6.9% respectively across seven industrial datasets, demonstrating strong generalization with a simple architecture.

论文针对目标领域正常数据不可用的安全关键设置中的零样本异常检测挑战，提出使用具有空间感知目标训练的TIPS视觉语言模型来改进异常检测。该方法使用固定图像级检测的解耦提示和可学习像素级定位的提示，分别在七个工业数据集上提高了1.1-3.9%和1.5-6.9%的图像级和像素级性能，展示了强大的泛化能力与简洁的架构。

Don't stop me now: Rethinking Validation Criteria for Model Parameter Selection

Authors: Andrea Apicella, Francesco Isgrò, Andrea Pollastro, Roberto Prevete

First: 2026-02-25T16:56:14+00:00 · Latest: 2026-02-25T16:56:14+00:00

Abs · PDF · Code1 · Code2

Abstract

Despite the extensive literature on training loss functions, the evaluation of generalization on the validation set remains underexplored. In this work, we conduct a systematic empirical and statistical study of how the validation criterion used for model selection affects test performance in neural classifiers, with attention to early stopping. Using fully connected networks on standard benchmarks under $k$-fold evaluation, we compare: (i) early stopping with patience and (ii) post-hoc selection over all epochs (i.e. no early stopping). Models are trained with cross-entropy, C-Loss, or PolyLoss; the model parameter selection on the validation set is made using accuracy or one of the three loss functions, each considered independently. Three main findings emerge. (1) Early stopping based on validation accuracy performs worst, consistently selecting checkpoints with lower test accuracy than both loss-based early stopping and post-hoc selection. (2) Loss-based validation criteria yield comparable and more stable test accuracy. (3) Across datasets and folds, any single validation rule often underperforms the test-optimal checkpoint. Overall, the selected model typically achieves test-set performance statistically lower than the best performance across all epochs, regardless of the validation criterion. Our results suggest avoiding validation accuracy (in particular with early stopping) for parameter selection, favoring loss-based validation criteria.

中文标题/摘要

标题：别停下我：重新思考模型参数选择的验证标准

尽管有关训练损失函数的文献非常丰富，但在验证集上的泛化评估仍然被忽视。在本文中，我们系统地研究了用于模型选择的验证标准如何影响神经分类器的测试性能，特别关注提前停止。使用标准基准下的全连接网络进行$k$折评估，我们比较了：(i) 带有耐心的提前停止和(ii) 所有轮次的后验选择（即没有提前停止）。模型使用交叉熵、C-损失或PolyLoss进行训练；验证集上的模型参数选择分别使用准确率或三种损失函数之一，每种函数独立考虑。主要发现有三点。(1) 基于验证准确率的提前停止表现最差，始终选择测试准确率低于基于损失的提前停止和后验选择的检查点。(2) 基于损失的验证标准在测试准确率上具有可比性和更高的稳定性。(3) 在不同数据集和折上，任何单一的验证规则通常都劣于测试最优检查点。总体而言，所选模型的测试集性能通常低于所有轮次中的最佳性能，无论使用何种验证标准。我们的结果表明，在参数选择时应避免使用验证准确率（特别是与提前停止一起使用），而应偏好基于损失的验证标准。

Summary / 总结

This study explores how different validation criteria affect model selection in neural classifiers. By comparing early stopping with patience and post-hoc selection, the research finds that early stopping based on validation accuracy performs worst, selecting checkpoints with lower test accuracy than loss-based early stopping or post-hoc selection. Loss-based validation criteria yield more stable and comparable test accuracy, and no single validation rule consistently outperforms the test-optimal checkpoint across datasets and folds.

这项研究探讨了不同验证标准如何影响神经分类器的模型选择。它使用交叉熵、C-损失和PolyLoss比较了基于耐心的提前停止和所有epoch的后验选择。主要发现包括，基于验证准确性的提前停止通常选择的检查点在测试集上的准确率低于基于损失的提前停止和后验选择。基于损失的验证标准能提供更稳定且可比较的测试准确率。研究建议避免使用验证准确率（尤其是与提前停止一起使用），并倾向于使用基于损失的验证标准进行模型选择。

Multitask Learning with Stochastic Interpolants

Authors: Hugo Negrel, Florentin Coeurdoux, Michael S. Albergo, Eric Vanden-Eijnden

First: 2025-08-06T16:25:19+00:00 · Latest: 2026-02-25T16:49:58+00:00

Abs · PDF · Code1 · Code2

Abstract

We propose a framework for learning maps between probability distributions that broadly generalizes the time dynamics of flow and diffusion models. To enable this, we generalize stochastic interpolants by replacing the scalar time variable with vectors, matrices, or linear operators, allowing us to bridge probability distributions across multiple dimensional spaces. This approach enables the construction of versatile generative models capable of fulfilling multiple tasks without task-specific training. Our operator-based interpolants not only provide a unifying theoretical perspective for existing generative models but also extend their capabilities. Through numerical experiments, we demonstrate the zero-shot efficacy of our method on conditional generation and inpainting, fine-tuning and posterior sampling, and multiscale modeling, suggesting its potential as a generic task-agnostic alternative to specialized models.

中文标题/摘要

标题：基于随机插值的多任务学习

我们提出了一种框架，用于学习概率分布之间的映射，这广泛地概括了流模型和扩散模型的时间动态。为此，我们通过将标量时间变量替换为向量、矩阵或线性算子，来推广随机插值，从而使我们能够跨越多个维度空间连接概率分布。这种方法使我们能够构建多功能的生成模型，这些模型能够在无需针对特定任务进行训练的情况下完成多个任务。基于算子的插值不仅为我们提供了一个统一的理论视角来理解现有的生成模型，而且还扩展了它们的功能。通过数值实验，我们展示了该方法在条件生成和修复、微调和后验采样以及多尺度建模方面的零样本有效性，这表明它可能作为一种通用的任务无关替代方案具有潜力，而无需专门的模型。

Summary / 总结

The research aims to develop a framework for learning maps between probability distributions, extending the capabilities of existing generative models. The method involves generalizing stochastic interpolants by using vectors, matrices, or linear operators instead of scalar time variables, enabling the bridging of probability distributions across multiple dimensional spaces. Key experimental findings show that the proposed approach is effective for zero-shot conditional generation, inpainting, fine-tuning, posterior sampling, and multiscale modeling, suggesting its potential as a versatile task-agnostic alternative to specialized models.

研究提出了一种在概率分布之间学习映射的框架，将随机插值扩展到向量、矩阵或线性算子。这使得可以构建多功能的生成模型，这些模型可以在不进行特定任务训练的情况下执行多个任务。实验表明，该方法在条件生成、修复、微调、后验采样和多尺度建模方面具有零样本效果，表明其作为特定任务替代的通用无任务模型的潜力。

Brain3D: Brain Report Automation via Inflated Vision Transformers in 3D

Authors: Mariano Barone, Francesco Di Serio, Giuseppe Riccio, Antonio Romano, Marco Postiglione, Antonino Ferraro, Vincenzo Moscato

First: 2026-02-25T16:46:45+00:00 · Latest: 2026-02-25T16:46:45+00:00

Abs · PDF · Code1 · Code2

Abstract

Current medical vision-language models (VLMs) process volumetric brain MRI using 2D slice-based approximations, fragmenting the spatial context required for accurate neuroradiological interpretation. We developed \textbf{Brain3D}, a staged vision-language framework for automated radiology report generation from 3D brain tumor MRI. Our approach inflates a pretrained 2D medical encoder into a native 3D architecture and progressively aligns it with a causal language model through three stages: contrastive grounding, supervised projector warmup, and LoRA-based linguistic specialization. Unlike generalist 3D medical VLMs, \textbf{Brain3D} is tailored to neuroradiology, where hemispheric laterality, tumor infiltration patterns, and anatomical localization are critical. Evaluated on 468 subjects (BraTS pathological cases plus healthy controls), our model achieves a Clinical Pathology F1 of 0.951 versus 0.413 for a strong 2D baseline while maintaining perfect specificity on healthy scans. The staged alignment proves essential: contrastive grounding establishes visual-textual correspondence, projector warmup stabilizes conditioning, and LoRA adaptation shifts output from verbose captions to structured clinical reports\footnote{Our code is publicly available for transparency and reproducibility

中文标题/摘要

标题：Brain3D：通过膨胀的3D视觉变换器进行脑部报告自动化

当前的医学视觉-语言模型（VLMs）使用2D切片近似处理体积脑MRI，分割了准确神经放射学解释所需的空间上下文。我们开发了**Brain3D**，这是一种分阶段的视觉-语言框架，用于从3D脑肿瘤MRI生成自动化放射学报告。我们的方法将预训练的2D医学编码器膨胀为原生3D架构，并通过三个阶段逐步与因果语言模型对齐：对比性定位、监督投影预热和LoRA基于的语言专业化。与通用的3D医学VLMs不同，**Brain3D**专门针对神经放射学，其中半球侧向性、肿瘤浸润模式和解剖定位至关重要。在468名受试者（BraTS病理病例加上健康对照）上进行评估，我们的模型在临床病理F1值上达到0.951，而强2D基线模型仅为0.413，同时在健康扫描上保持完美的特异性。分阶段对齐至关重要：对比性定位建立视觉-文本对应关系，投影预热稳定条件，而LoRA适应使输出从冗长的描述性标题转变为结构化的临床报告。我们的代码已公开以确保透明性和可再现性

Summary / 总结

Brain3D is a staged vision-language framework designed for automated radiology report generation from 3D brain tumor MRI. It inflates a pretrained 2D medical encoder into a 3D architecture and progressively aligns it with a causal language model through three stages: contrastive grounding, supervised projector warmup, and LoRA-based linguistic specialization. Evaluated on 468 subjects, Brain3D achieves a Clinical Pathology F1 score of 0.951, significantly outperforming a 2D baseline, and maintains perfect specificity on healthy scans.

Brain3D 是一种分阶段的视觉-语言框架，用于从 3D 脑肿瘤 MRI 中自动生成放射学报告。它将一个预训练的 2D 医疗编码器扩展为 3D 架构，并通过三个阶段逐步与因果语言模型对齐：对比性定位、监督投影预热和 LoRA 基础的语言专业化。在 468 个受试者（包括 BraTS 病理病例和健康对照）上进行评估，Brain3D 达到了 0.951 的临床病理 F1 分数，显著优于强大的 2D 基线，并在健康扫描上保持了完美的特异性。分阶段的对齐对于建立视觉-文本对应关系和将输出转变为结构化的临床报告至关重要。

WeatherCity: Urban Scene Reconstruction with Controllable Multi-Weather Transformation

Authors: Wenhua Wu, Huai Guan, Zhe Liu, Hesheng Wang

First: 2026-02-25T16:44:12+00:00 · Latest: 2026-02-25T16:44:12+00:00

Abs · PDF · Code1 · Code2

Abstract

Editable high-fidelity 4D scenes are crucial for autonomous driving, as they can be applied to end-to-end training and closed-loop simulation. However, existing reconstruction methods are primarily limited to replicating observed scenes and lack the capability for diverse weather simulation. While image-level weather editing methods tend to introduce scene artifacts and offer poor controllability over the weather effects. To address these limitations, we propose WeatherCity, a novel framework for 4D urban scene reconstruction and weather editing. Specifically, we leverage a text-guided image editing model to achieve flexible editing of image weather backgrounds. To tackle the challenge of multi-weather modeling, we introduce a novel weather Gaussian representation based on shared scene features and dedicated weather-specific decoders. This representation is further enhanced with a content consistency optimization, ensuring coherent modeling across different weather conditions. Additionally, we design a physics-driven model that simulates dynamic weather effects through particles and motion patterns. Extensive experiments on multiple datasets and various scenes demonstrate that WeatherCity achieves flexible controllability, high fidelity, and temporal consistency in 4D reconstruction and weather editing. Our framework not only enables fine-grained control over weather conditions (e.g., light rain and heavy snow) but also supports object-level manipulation within the scene.

中文标题/摘要

标题：WeatherCity：可控多天气变换的城市场景重建

可编辑的高保真4D场景对于自动驾驶至关重要，因为它们可以应用于端到端训练和闭环仿真。然而，现有的重建方法主要局限于复制观察到的场景，缺乏多样天气模拟的能力。虽然图像级别的天气编辑方法往往会引入场景伪影，并且对天气效果的可控性较差。为了解决这些限制，我们提出了WeatherCity，一种新颖的4D城市场景重建和天气编辑框架。具体来说，我们利用文本引导的图像编辑模型实现图像天气背景的灵活编辑。为了解决多天气建模的挑战，我们引入了一种基于共享场景特征和专用天气特定解码器的新型天气高斯表示。该表示进一步通过内容一致性优化，确保在不同天气条件下的一致性建模。此外，我们设计了一个基于物理的模型，通过粒子和运动模式模拟动态天气效果。在多个数据集和各种场景上的大量实验表明，WeatherCity在4D重建和天气编辑中实现了灵活的可控性、高保真度和时间一致性。我们的框架不仅能够对天气条件进行精细控制（例如，小雨和大雪），还支持场景中的对象级操作。

Summary / 总结

WeatherCity is a novel framework for 4D urban scene reconstruction and weather editing, addressing the limitations of existing methods by introducing a text-guided image editing model and a weather Gaussian representation. The framework achieves flexible and controllable weather effects, ensuring high fidelity and temporal consistency in 4D scenes. Extensive experiments show that WeatherCity can handle various weather conditions and support object-level manipulation within the scene.

WeatherCity 是一种用于 4D 城市场景重建和天气编辑的新框架，通过引入文本引导的图像编辑模型和天气高斯表示来解决现有方法的局限性。它实现了灵活可控的天气效果，确保了 4D 场景的高保真度和时间一致性。实验表明，WeatherCity 支持对天气条件的精细控制和场景内的对象级操作。

Learning Partial Graph Matching via Optimal Partial Transport

Authors: Gathika Ratnayaka, James Nichols, Qing Wang

First: 2024-10-22T05:56:57+00:00 · Latest: 2026-02-25T16:44:01+00:00

Abs · PDF · Code1 · Code2

Abstract

Partial graph matching extends traditional graph matching by allowing some nodes to remain unmatched, enabling applications in more complex scenarios. However, this flexibility introduces additional complexity, as both the subset of nodes to match and the optimal mapping must be determined. While recent studies have explored deep learning techniques for partial graph matching, a significant limitation remains: the absence of an optimization objective that fully captures the problem's intrinsic nature while enabling efficient solutions. In this paper, we propose a novel optimization framework for partial graph matching, inspired by optimal partial transport. Our approach formulates an objective that enables partial assignments while incorporating matching biases, using weighted total variation as the divergence function to guarantee optimal partial assignments. Our method can achieve efficient, exact solutions within cubic worst case time complexity. Our contributions are threefold: (i) we introduce a novel optimization objective that balances matched and unmatched nodes; (ii) we establish a connection between partial graph matching and linear sum assignment problem, enabling efficient solutions; (iii) we propose a deep graph matching architecture with a novel partial matching loss, providing an end-to-end solution. The empirical evaluations on standard graph matching benchmarks demonstrate the efficacy of the proposed approach.

中文标题/摘要

标题：基于最优部分运输的图部分匹配学习

部分图匹配通过允许一些节点保持未匹配状态，扩展了传统的图匹配方法，使其能够应用于更复杂的场景。然而，这种灵活性引入了额外的复杂性，因为不仅需要确定匹配的节点子集，还需要确定最优映射。尽管最近的研究探索了深度学习技术在部分图匹配中的应用，但仍然存在一个重要的限制：缺乏一个能够全面捕捉问题本质并支持高效解决方案的优化目标。在本文中，我们提出了一种基于最优部分运输的新颖优化框架，用于部分图匹配。我们的方法通过引入一个目标函数来实现部分分配，该目标函数同时考虑匹配和未匹配节点，并使用加权总变差作为散度函数以确保最优部分分配。我们的方法可以在最坏情况下实现高效且精确的解决方案，时间复杂度为立方级。我们的贡献包括三个方面：(i) 引入了一种新的优化目标，平衡匹配和未匹配节点；(ii) 建立了部分图匹配与线性分配问题之间的联系，从而实现高效解决方案；(iii) 提出了一种具有新颖部分匹配损失的深度图匹配架构，提供端到端的解决方案。在标准图匹配基准上的实证评估表明了所提出方法的有效性。

Summary / 总结

This paper addresses the challenge of partial graph matching by proposing a novel optimization framework inspired by optimal partial transport. The method formulates an objective that balances matched and unmatched nodes, using weighted total variation to ensure optimal partial assignments. It achieves efficient, exact solutions within cubic worst case time complexity. The approach connects partial graph matching to the linear sum assignment problem and introduces a deep graph matching architecture with a novel partial matching loss, providing an end-to-end solution. Empirical evaluations show the proposed method's effectiveness.

本文通过提出一种受最优部分运输启发的新优化框架来解决部分图匹配的挑战。该方法通过使用加权总变差来确保部分最优分配，同时平衡匹配和未匹配的节点。它在最坏情况下以立方时间复杂度实现高效且精确的解决方案。该方法将部分图匹配与线性分配问题联系起来，并提出了一种具有新颖部分匹配损失的深度图匹配架构，提供了一种端到端的解决方案。实证评估表明，所提出的方法在标准图匹配基准上的有效性。

Overview of the CXR-LT 2026 Challenge: Multi-Center Long-Tailed and Zero Shot Chest X-ray Classification

Authors: Hexin Dong, Yi Lin, Pengyu Zhou, Fengnian Zhao, Alan Clint Legasto, Mingquan Lin, Hao Chen, Yuzhe Yang, George Shih, Yifan Peng

First: 2026-02-25T16:39:21+00:00 · Latest: 2026-02-25T16:39:21+00:00

Abs · PDF · Code1 · Code2

Abstract

Chest X-ray (CXR) interpretation is hindered by the long-tailed distribution of pathologies and the open-world nature of clinical environments. Existing benchmarks often rely on closed-set classes from single institutions, failing to capture the prevalence of rare diseases or the appearance of novel findings. To address this, we present the CXR-LT 2026 challenge. This third iteration of the benchmark introduces a multi-center dataset comprising over 145,000 images from PadChest and NIH Chest X-ray datasets. The challenge defines two core tasks: (1) Robust Multi-Label Classification on 30 known classes and (2) Open-World Generalization to 6 unseen (out-of-distribution) rare disease classes. We report the results of the top-performing teams, evaluating them via mean Average Precision (mAP), AUROC, and F1-score. The winning solutions achieved an mAP of 0.5854 on Task 1 and 0.4315 on Task 2, demonstrating that large-scale vision-language pre-training significantly mitigates the performance drop typically associated with zero-shot diagnosis.

中文标题/摘要

标题：CXR-LT 2026 挑战概览：多中心长尾和零样本胸部X射线分类

胸部X射线（CXR）解释受到病理分布长尾效应和临床环境开放世界的阻碍。现有基准通常依赖单一机构的封闭类集，未能捕捉到罕见疾病的流行程度或新发现的出现。为解决这一问题，我们提出了CXR-LT 2026挑战。这是基准的第三次迭代，引入了一个多中心数据集，包含来自PadChest和NIH胸部X射线数据集的超过145,000张图像。挑战定义了两个核心任务：（1）在30个已知类别上进行鲁棒多标签分类；（2）在6个未见过（分布外）罕见疾病类别上进行开放世界泛化。我们报告了表现最佳团队的结果，通过平均精确度（mAP）、AUROC和F1分数进行评估。获胜解决方案在任务1上达到了0.5854的mAP，在任务2上达到了0.4315的mAP，表明大规模的视觉-语言预训练显著缓解了零样本诊断通常伴随的性能下降。

Summary / 总结

The CXR-LT 2026 challenge aims to address the limitations of existing benchmarks in chest X-ray interpretation by introducing a multi-center dataset with over 145,000 images. It includes two tasks: robust multi-label classification on 30 known classes and open-world generalization to 6 unseen rare disease classes. The top-performing teams achieved an mAP of 0.5854 and 0.4315 for the two tasks, respectively, highlighting the effectiveness of large-scale vision-language pre-training in mitigating performance drops for zero-shot diagnosis.

CXR-LT 2026挑战旨在通过引入一个多中心数据集（包含超过145,000张图像）来解决现有基准在胸部X光解读中的局限性。该挑战包括两个任务：在30个已知类别上进行稳健的多标签分类和在6个未见过的罕见疾病类别上进行开放世界泛化。表现最佳的团队分别在两个任务中达到了0.5854和0.4315的mAP，这表明大规模的视觉-语言预训练在零样本诊断中显著减轻了性能下降的问题。

Learning to Drive is a Free Gift: Large-Scale Label-Free Autonomy Pretraining from Unposed In-The-Wild Videos

Authors: Matthew Strong, Wei-Jer Chang, Quentin Herau, Jiezhi Yang, Yihan Hu, Chensheng Peng, Wei Zhan

Venue: CVPR 2026

First: 2026-02-25T16:38:53+00:00 · Latest: 2026-02-25T16:38:53+00:00

Comments: Accepted at CVPR 2026

Abs · PDF · Code1 · Code2

Abstract

Ego-centric driving videos available online provide an abundant source of visual data for autonomous driving, yet their lack of annotations makes it difficult to learn representations that capture both semantic structure and 3D geometry. Recent advances in large feedforward spatial models demonstrate that point maps and ego-motion can be inferred in a single forward pass, suggesting a promising direction for scalable driving perception. We therefore propose a label-free, teacher-guided framework for learning autonomous driving representations directly from unposed videos. Unlike prior self-supervised approaches that focus primarily on frame-to-frame consistency, we posit that safe and reactive driving depends critically on temporal context. To this end, we leverage a feedforward architecture equipped with a lightweight autoregressive module, trained using multi-modal supervisory signals that guide the model to jointly predict current and future point maps, camera poses, semantic segmentation, and motion masks. Multi-modal teachers provide sequence-level pseudo-supervision, enabling LFG to learn a unified pseudo-4D representation from raw YouTube videos without poses, labels, or LiDAR. The resulting encoder not only transfers effectively to downstream autonomous driving planning on the NAVSIM benchmark, surpassing multi-camera and LiDAR baselines with only a single monocular camera, but also yields strong performance when evaluated on a range of semantic, geometric, and qualitative motion prediction tasks. These geometry and motion-aware features position LFG as a compelling video-centric foundation model for autonomous driving.

中文标题/摘要

标题：学习驾驶是一项免费礼物：从未摆拍的野外视频中进行大规模无标注自主预训练

在线可穿戴式驾驶视频为自动驾驶提供了丰富的视觉数据来源，但由于缺乏标注，难以学习同时捕捉语义结构和三维几何的表示。近期在大规模前馈空间模型方面的进展表明，点图和自身运动可以在单次前向传递中推断出来，这为可扩展的驾驶感知提供了有希望的方向。因此，我们提出了一种无标注、教师引导的框架，直接从未摆拍的视频中学习自动驾驶表示。与以前主要关注帧间一致性自监督方法不同，我们认为安全和反应式驾驶依赖于时间上下文。为此，我们利用一个前馈架构，配备了一个轻量级的自回归模块，并使用多模态监督信号来引导模型同时预测当前和未来的点图、相机姿态、语义分割和运动掩码。多模态教师提供了序列级伪监督，使LFG能够从YouTube视频中学习到无需姿态、标签或LiDAR的统一伪4D表示。生成的编码器不仅在NAVSIM基准上的下游自动驾驶规划任务中表现出色，超越了多摄像头和LiDAR基线，仅使用单个单目摄像头，而且在一系列语义、几何和定性运动预测任务中也表现出色。这些几何和运动感知的特征使LFG成为自动驾驶视频中心基础模型的有力候选。

Summary / 总结

The paper proposes a label-free, teacher-guided framework (LFG) for learning autonomous driving representations from unposed in-the-wild videos. Unlike previous self-supervised methods focusing on frame-to-frame consistency, LFG uses multi-modal supervisory signals to predict point maps, camera poses, semantic segmentation, and motion masks, enabling the model to learn a unified pseudo-4D representation. The resulting encoder outperforms multi-camera and LiDAR baselines on the NAVSIM benchmark and shows strong performance in various tasks including semantic, geometric, and motion prediction.

论文提出了一种无需标签、由教师引导的框架，用于从未摆拍的野外视频中学习自动驾驶表示。不同于以往主要关注帧间一致性的自监督方法，该方法强调了时空上下文对于安全和反应式驾驶的重要性。该方法使用了一个带有轻量级自回归模块的前馈架构，并通过多模态监督信号训练来预测点图、相机姿态、语义分割和运动掩码。所得到的模型在NAVSIM基准测试中超越了多摄像头和LiDAR基线，并在语义、几何和运动预测等多种任务上表现出色。

AdaSpot: Spend Resolution Where It Matters for Precise Event Spotting

Authors: Artur Xarles, Sergio Escalera, Thomas B. Moeslund, Albert Clapés

First: 2026-02-25T16:24:48+00:00 · Latest: 2026-02-25T16:24:48+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Precise Event Spotting aims to localize fast-paced actions or events in videos with high temporal precision, a key task for applications in sports analytics, robotics, and autonomous systems. Existing methods typically process all frames uniformly, overlooking the inherent spatio-temporal redundancy in video data. This leads to redundant computation on non-informative regions while limiting overall efficiency. To remain tractable, they often spatially downsample inputs, losing fine-grained details crucial for precise localization. To address these limitations, we propose \textbf{AdaSpot}, a simple yet effective framework that processes low-resolution videos to extract global task-relevant features while adaptively selecting the most informative region-of-interest in each frame for high-resolution processing. The selection is performed via an unsupervised, task-aware strategy that maintains spatio-temporal consistency across frames and avoids the training instability of learnable alternatives. This design preserves essential fine-grained visual cues with a marginal computational overhead compared to low-resolution-only baselines, while remaining far more efficient than uniform high-resolution processing. Experiments on standard PES benchmarks demonstrate that \textbf{AdaSpot} achieves state-of-the-art performance under strict evaluation metrics (\eg, $+3.96$ and $+2.26$ mAP$@0$ frames on Tennis and FineDiving), while also maintaining strong results under looser metrics. Code is available at: \href{https://github.com/arturxe2/AdaSpot}{https://github.com/arturxe2/AdaSpot}.

中文标题/摘要

标题：AdaSpot：在关键处提高事件检测的精确度

精确事件检测旨在以高时间精度在视频中定位快速动作或事件，这是体育分析、机器人技术和自主系统应用中的关键任务。现有方法通常均匀处理所有帧，忽视了视频数据中的固有时空冗余。这导致在非信息性区域进行冗余计算，从而限制了整体效率。为了保持可处理性，它们通常对输入进行空间下采样，从而损失了用于精确定位的细粒度细节。为了解决这些限制，我们提出了一种名为\textbf{AdaSpot}的简单而有效的框架，该框架处理低分辨率视频以提取全局任务相关特征，同时在每帧中自适应选择最具信息性的感兴趣区域进行高分辨率处理。选择是通过一种无监督的任务感知策略进行的，该策略在帧间保持时空一致性，避免了可学习替代方案的训练不稳定性。此设计与仅低分辨率基线相比，保留了重要的细粒度视觉线索，同时具有轻微的计算开销，而远比均匀高分辨率处理更高效。在标准PES基准上的实验表明，\textbf{AdaSpot}在严格的评估指标下（例如，网球和FineDiving上的+3.96和+2.26 mAP@0帧）实现了最先进的性能，同时在较宽松的指标下也保持了强劲的结果。代码可在：\href{https://github.com/arturxe2/AdaSpot}{https://github.com/arturxe2/AdaSpot}。

Summary / 总结

AdaSpot is designed to improve the precision of event spotting in videos by selectively processing frames at different resolutions. It uses low-resolution processing to extract global task-relevant features and adaptively selects the most informative regions for high-resolution processing. Experiments show that AdaSpot outperforms existing methods on standard benchmarks, achieving state-of-the-art performance with a marginal computational overhead compared to low-resolution-only baselines.

AdaSpot 通过在低分辨率视频中选择性地处理高分辨率的感兴趣区域来提高精确事件定位的效果，从而减少计算开销同时保持高时间精度。该方法使用无监督的任务感知策略来选择信息丰富的区域，确保时空一致性。实验表明，AdaSpot 在标准基准上的表现优于现有方法，分别在网球和FineDiving 上实现了高达 3.96 的 mAP 改进和 2.26 的 mAP 改进，同时比均匀高分辨率处理更高效。

RebuttalAgent: Strategic Persuasion in Academic Rebuttal via Theory of Mind

Authors: Zhitao He, Zongwei Lyu, Yi R Fung

Venue: ICLR 2026

First: 2026-01-22T07:36:48+00:00 · Latest: 2026-02-25T16:22:14+00:00

Comments: Accepted by ICLR 2026

Abs · PDF · Code1 · Code2

Abstract

Although artificial intelligence (AI) has become deeply integrated into various stages of the research workflow and achieved remarkable advancements, academic rebuttal remains a significant and underexplored challenge. This is because rebuttal is a complex process of strategic communication under severe information asymmetry rather than a simple technical debate. Consequently, current approaches struggle as they largely imitate surface-level linguistics, missing the essential element of perspective-taking required for effective persuasion. In this paper, we introduce RebuttalAgent, the first framework to ground academic rebuttal in Theory of Mind (ToM), operationalized through a ToM-Strategy-Response (TSR) framework that models reviewer mental state, formulates persuasion strategy, and generates evidence-based response. To train our agent, we construct RebuttalBench, a large-scale dataset synthesized via a novel critique-and-refine approach. Our training process consists of two stages, beginning with a supervised fine-tuning phase to equip the agent with ToM-based analysis and strategic planning capabilities, followed by a reinforcement learning phase leveraging the self-reward mechanism for scalable self-improvement. For reliable and efficient automated evaluation, we further develop Rebuttal-RM, a specialized evaluator trained on over 100K samples of multi-source rebuttal data, which achieves scoring consistency with human preferences surpassing powerful judge GPT-4.1. Extensive experiments show RebuttalAgent significantly outperforms the base model by an average of 18.3% on automated metrics, while also outperforming advanced proprietary models across both automated and human evaluations.

中文标题/摘要

标题：RebuttalAgent：基于理论心智的学术反驳策略性说服

尽管人工智能（AI）已深入研究工作流程的各个阶段并取得了显著进展，但学术反驳仍然是一个重要的且尚未充分探索的挑战。这是因为反驳是一个在严重信息不对称下进行的复杂战略性沟通过程，而不仅仅是简单的技术辩论。因此，当前的方法难以应对，因为它们主要模仿表面语言学，忽略了有效说服所需的关键视角转换要素。在本文中，我们介绍了RebuttalAgent，这是第一个将学术反驳基于理论心智（ToM）的框架，通过ToM-策略-回应（TSR）框架来建模审稿人心理状态、制定说服策略并生成基于证据的回应。为了训练我们的代理，我们构建了RebuttalBench，这是一个通过新颖的批评和改进方法合成的大规模数据集。我们的训练过程分为两个阶段，首先是监督微调阶段，以赋予代理基于ToM的分析和战略规划能力，然后是利用自我奖励机制进行强化学习的阶段，以实现可扩展的自我改进。为了实现可靠且高效的自动化评估，我们进一步开发了Rebuttal-RM，这是一种专门的评估器，基于超过10万条多源反驳数据训练而成，其评分一致性超过了强大的法官GPT-4.1。广泛的实验表明，RebuttalAgent在自动化指标上平均优于基线模型18.3%，同时在自动化和人类评估中也优于先进的专有模型。

Summary / 总结

The research aims to address the challenge of strategic persuasion in academic rebuttal by developing RebuttalAgent, which integrates Theory of Mind (ToM) through a TSR framework. The method involves training the agent with a novel dataset, RebuttalBench, and a two-stage process: supervised fine-tuning for ToM-based analysis and strategic planning, followed by reinforcement learning for self-improvement. Experiments demonstrate that RebuttalAgent outperforms existing models by 18.3% on automated metrics and surpasses advanced proprietary models in both automated and human evaluations.

论文针对因信息不对称而复杂的学术反驳中的战略性说服挑战。提出了基于理论心智（ToM）的RebuttalAgent框架，该框架能够建模审稿人的心理状态、制定说服策略并生成基于证据的回应。该代理通过一种新颖的数据集RebuttalBench和两阶段训练过程（监督微调和强化学习）进行训练。实验表明，RebuttalAgent在自动化评估指标上比基线模型高出18.3%，并在人类评估中也表现出色。此外，还开发了专门的评估器Rebuttal-RM，以确保评估的可靠性和效率。

Training-free Mixed-Resolution Latent Upsampling for Spatially Accelerated Diffusion Transformers

Authors: Wongi Jeong, Kyungryeol Lee, Hoigi Seo, Se Young Chun

First: 2025-07-11T09:07:43+00:00 · Latest: 2026-02-25T16:14:20+00:00

Abs · PDF · Code1 · Code2

Abstract

Diffusion transformers (DiTs) offer excellent scalability for high-fidelity generation, but their computational overhead poses a great challenge for practical deployment. Existing acceleration methods primarily exploit the temporal dimension, whereas spatial acceleration remains underexplored. In this work, we investigate spatial acceleration for DiTs via latent upsampling. We found that naïve latent upsampling for spatial acceleration introduces artifacts, primarily due to aliasing in high-frequency edge regions and mismatching from noise-timestep discrepancies. Then, based on these findings and analyses, we propose a training-free spatial acceleration framework, dubbed Region-Adaptive Latent Upsampling (RALU), to mitigate those artifacts while achieving spatial acceleration of DiTs by our mixed-resolution latent upsampling. RALU achieves artifact-free, efficient acceleration with early upsampling only on artifact-prone edge regions and noise-timestep matching for different latent resolutions, leading to up to 7.0$\times$ speedup on FLUX-1.dev and 3.0$\times$ on Stable Diffusion 3 with negligible quality degradation. Furthermore, our RALU is complementarily applicable to existing temporal acceleration methods and timestep-distilled models, leading to up to 15.9$\times$ speedup.

中文标题/摘要

标题：无需训练的混合分辨率潜在上采样以实现空间加速扩散变换器

扩散变换器（DiTs）提供了出色的高保真生成可扩展性，但其计算开销对实际部署构成了巨大挑战。现有加速方法主要利用时间维度，而空间加速则被严重忽视。在本文中，我们通过潜在上采样研究了DiTs的空间加速。我们发现，简单的空间加速潜在上采样会引入伪影，主要是由于高频边缘区域的混叠和噪声时间步长差异导致的不匹配。基于这些发现和分析，我们提出了一种无需训练的空间加速框架，称为区域自适应潜在上采样（RALU），以减轻这些伪影并实现DiTs的空间加速，通过我们的混合分辨率潜在上采样。RALU实现了无伪影、高效的加速，仅在易产生伪影的边缘区域进行早期上采样，并且不同潜在分辨率的噪声时间步长匹配，导致在FLUX-1.dev上最高7.0倍的加速和在Stable Diffusion 3上最高3.0倍的加速，且质量下降可忽略不计。此外，我们的RALU可以补充现有时间加速方法和时间步长提炼模型，导致最高15.9倍的加速。

Summary / 总结

This work addresses the computational challenge of deploying diffusion transformers (DiTs) by proposing a training-free spatial acceleration framework called Region-Adaptive Latent Upsampling (RALU). It mitigates artifacts from naive latent upsampling and achieves up to 7.0× speedup on FLUX-1.dev and 3.0× on Stable Diffusion 3 with negligible quality degradation. RALU selectively upsamples only artifact-prone edge regions and ensures noise-timestep matching for different latent resolutions, and it can be used in conjunction with existing temporal acceleration methods for even greater efficiency.

本文提出了一种名为Region-Adaptive Latent Upsampling (RALU)的无训练空间加速框架，通过仅在边缘区域进行早期上采样和噪声时间步匹配来缓解从朴素的潜空间上采样引入的伪影，实现了FLUX-1.dev上高达7.0×和Stable Diffusion 3上高达3.0×的速度提升，且质量几乎没有下降。此外，RALU还可以与现有的时间加速方法和时间步长提炼模型兼容，潜在地实现高达15.9×的速度提升。

History

20260226_0405 20260225_0356 20260224_0404 20260223_0333 20260222_0333 20260221_0344 20260220_0347 20260219_0358 20260218_0358 20260217_0339 20260216_0334 20260215_0332 20260213_0402 20260212_0404 20260211_0409 20260210_0409 20260208_0334 20260207_0349 20260206_0347 20260205_0346 20260204_0352 20260202_0332 20260201_0328 20260131_0341 20260130_0339 20260129_0337 20260128_0335 20260127_0332 20260126_0325 20260125_0325 20260124_0333 20260123_0333 20260122_0339 20260121_0422 20260120_0328 20260119_0325 20260118_0324 20260117_0329 20260116_0332 20260115_0330 20260114_0329 20260113_0330 20260112_0330 20260111_0327 20260110_0328 20260109_0331 20260108_0330 20260107_0325 20260106_0331 20260105_0324 20260104_0324 20260103_0322 20260102_0335 20260101_0325 20251231_0331 20251230_0328 20251229_0326 20251228_0329 20251227_0325 20251226_0326 20251225_0325 20251224_0328 20251223_0327 20251222_0324 20251221_0326 20251220_0327 20251219_0327 20251218_0339 20251217_0331 20251216_0329 20251215_0331 20251214_0324 20251213_0324 20251212_0329 20251211_0326 20251210_0323 20251209_0326 20251208_0324 20251207_0323 20251206_0325 20251205_0326 20251204_0326 20251203_0328 20251202_0331 20251201_0324 20251130_0323 20251129_0323 20251128_0324 20251127_0324 20251126_0325 20251125_0322 20251124_0323 20251123_0323 20251122_0325 20251121_0324 20251120_0326 20251119_0325 20251118_0324 20251117_0322 20251116_0322 20251115_0324 20251114_0325 20251113_0326 20251112_0326 20251111_0318 20251110_0322 20251109_0323 20251108_0321 20251107_0320 20251106_0322 20251105_0321 20251104_0324 20251103_0317 20251102_0321 20251101_0317 20251031_0318 20251030_0328 20251029_0325 20251028_0324 20251027_0320 20251026_0328 20251025_0320 20251024_0328 20251023_1235 20251023_0316 20251022_0319 20251021_1916 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553