arXiv 论文速递

Snapshot: 20260207_0349

Shared LoRA Subspaces for almost Strict Continual Learning

Authors: Prakhar Kaushik, Ankit Vaidya, Shravan Chaudhari, Rama Chellappa, Alan Yuille

First: 2026-02-05T18:59:58+00:00 · Latest: 2026-02-05T18:59:58+00:00

Abstract

Adapting large pretrained models to new tasks efficiently and continually is crucial for real-world deployment but remains challenging due to catastrophic forgetting and the high cost of retraining. While parameter-efficient tuning methods like low rank adaptation (LoRA) reduce computational demands, they lack mechanisms for strict continual learning and knowledge integration, without relying on data replay, or multiple adapters. We propose Share, a novel approach to parameter efficient continual finetuning that learns and dynamically updates a single, shared low-rank subspace, enabling seamless adaptation across multiple tasks and modalities. Share constructs a foundational subspace that extracts core knowledge from past tasks and incrementally integrates new information by identifying essential subspace directions. Knowledge from each new task is incorporated into this evolving subspace, facilitating forward knowledge transfer, while minimizing catastrophic interference. This approach achieves up to 100x parameter reduction and 281x memory savings over traditional LoRA methods, maintaining performance comparable to jointly trained models. A single Share model can replace hundreds of task-specific LoRA adapters, supporting scalable, asynchronous continual learning. Experiments across image classification, natural language understanding, 3D pose estimation, and text-to-image generation validate its effectiveness, making Share a practical and scalable solution for lifelong learning in large-scale AI systems.

中文标题/摘要

标题：共享LoRA子空间以实现几乎严格的持续学习

高效且持续地将大型预训练模型适应新任务对于实际部署至关重要，但由于灾难性遗忘和重新训练成本高昂，这仍然具有挑战性。尽管参数高效调优方法如低秩适应（LoRA）降低了计算需求，但它们缺乏严格持续学习和知识整合的机制，不依赖于数据重放或多个适配器。我们提出了一种名为Share的新方法，用于参数高效持续微调，它学习并动态更新一个共享的低秩子空间，从而在多个任务和模态之间实现无缝适应。Share构建了一个基础子空间，从中提取过去任务的核心知识，并通过识别关键子空间方向逐步整合新信息。来自每个新任务的知识被整合到这个不断演化的子空间中，促进了前向知识转移，同时最小化灾难性干扰。该方法在传统LoRA方法上的参数减少高达100倍，内存节省高达281倍，同时保持与联合训练模型相当的性能。一个Share模型可以替代数百个任务特定的LoRA适配器，支持可扩展的、异步的持续学习。跨图像分类、自然语言理解、3D姿态估计和文本到图像生成的实验验证了其有效性，使Share成为大规模AI系统中终身学习的实用且可扩展的解决方案。

Summary / 总结

The paper addresses the challenge of efficient continual learning in large pretrained models by proposing Share, a method that uses a shared low-rank subspace to dynamically update and adapt to new tasks without catastrophic forgetting. Share reduces parameter and memory requirements by up to 100x and 281x respectively, while maintaining performance similar to jointly trained models. It supports scalable, asynchronous continual learning and has been validated across various tasks including image classification, natural language understanding, 3D pose estimation, and text-to-image generation.

论文旨在解决大型预训练模型中高效连续学习的问题，重点关注灾难性遗忘和高计算成本的问题。它提出了一种名为Share的方法，通过动态更新共享的低秩子空间来跨任务积累知识，参数减少高达100倍，内存减少281倍，同时保持与联合训练模型相当的性能。这种方法支持可扩展和异步的连续学习，用单一模型替代了众多任务特定的适配器。

Predicting Camera Pose from Perspective Descriptions for Spatial Reasoning

Authors: Xuejun Zhang, Aditi Tiwari, Zhenhailong Wang, Heng Ji

First: 2026-02-05T18:59:55+00:00 · Latest: 2026-02-05T18:59:55+00:00

Abs · PDF · Code1 · Code2

Abstract

Multi-image spatial reasoning remains challenging for current multimodal large language models (MLLMs). While single-view perception is inherently 2D, reasoning over multiple views requires building a coherent scene understanding across viewpoints. In particular, we study perspective taking, where a model must build a coherent 3D understanding from multi-view observations and use it to reason from a new, language-specified viewpoint. We introduce CAMCUE, a pose-aware multi-image framework that uses camera pose as an explicit geometric anchor for cross-view fusion and novel-view reasoning. CAMCUE injects per-view pose into visual tokens, grounds natural-language viewpoint descriptions to a target camera pose, and synthesizes a pose-conditioned imagined target view to support answering. To support this setting, we curate CAMCUE-DATA with 27,668 training and 508 test instances pairing multi-view images and poses with diverse target-viewpoint descriptions and perspective-shift questions. We also include human-annotated viewpoint descriptions in the test split to evaluate generalization to human language. CAMCUE improves overall accuracy by 9.06% and predicts target poses from natural-language viewpoint descriptions with over 90% rotation accuracy within 20° and translation accuracy within a 0.5 error threshold. This direct grounding avoids expensive test-time search-and-match, reducing inference time from 256.6s to 1.45s per example and enabling fast, interactive use in real-world scenarios.

中文标题/摘要

标题：从视角描述预测相机姿态以进行空间推理

多图像空间推理仍然是当前多模态大型语言模型（MLLMs）面临的挑战。虽然单视角感知本质上是二维的，但多视角推理需要在不同视角之间构建连贯的场景理解。特别是，我们研究了视角转换，其中模型必须从多视角观察中构建连贯的三维理解，并使用它从新的、语言指定的视角进行推理。我们引入了CAMCUE，这是一种姿态感知的多图像框架，使用相机姿态作为跨视图融合和新视图推理的显式几何锚点。CAMCUE 将每视角姿态注入视觉标记，将自然语言视角描述定位到目标相机姿态，并合成姿态条件下的想象目标视图以支持回答。为了支持这一设置，我们收集了CAMCUE-DATA，其中包括27,668个训练实例和508个测试实例，这些实例将多视角图像和姿态与多样化的目标视角描述和视角转换问题配对。我们还在测试分割中包括了人工标注的视角描述，以评估对人类语言的泛化能力。CAMCUE 的整体准确率提高了9.06%，并且能够从自然语言视角描述中预测目标姿态，旋转准确率超过90%（误差在20°以内），平移准确率在0.5误差阈值以内超过90%。这种直接定位避免了昂贵的测试时搜索和匹配，将每个示例的推理时间从256.6秒减少到1.45秒，从而在实际场景中实现快速、交互式使用。

Summary / 总结

The paper addresses the challenge of multi-image spatial reasoning for current multimodal large language models by introducing CAMCUE, a pose-aware framework that uses camera pose as a geometric anchor for cross-view fusion and novel-view reasoning. The framework improves overall accuracy by 9.06% and predicts target poses with high accuracy, achieving over 90% rotation accuracy within 20° and translation accuracy within a 0.5 error threshold. CAMCUE also reduces inference time from 256.6s to 1.45s per example, enabling fast, interactive use in real-world scenarios.

论文通过引入CAMCUE框架解决了多视角空间推理的挑战，该框架使用相机姿态作为几何锚点进行跨视角融合和新颖视角推理。该框架提高了整体准确率9.06%，并且在旋转和平移精度方面预测目标姿态表现良好。此外，该框架通过将推理时间从每例256.6秒减少到1.45秒，支持了在实际场景中的快速交互使用。

DyTopo: Dynamic Topology Routing for Multi-Agent Reasoning via Semantic Matching

Authors: Yuxing Lu, Yucheng Hu, Xukai Zhao, Jiuxin Cao

First: 2026-02-05T18:59:51+00:00 · Latest: 2026-02-05T18:59:51+00:00

Abs · PDF · Code1 · Code2

Abstract

Multi-agent systems built from prompted large language models can improve multi-round reasoning, yet most existing pipelines rely on fixed, trajectory-wide communication patterns that are poorly matched to the stage-dependent needs of iterative problem solving. We introduce DyTopo, a manager-guided multi-agent framework that reconstructs a sparse directed communication graph at each round. Conditioned on the manager's round goal, each agent outputs lightweight natural-language query (need) and \key (offer) descriptors; DyTopo embeds these descriptors and performs semantic matching, routing private messages only along the induced edges. Across code generation and mathematical reasoning benchmarks and four LLM backbones, DyTopo consistently outperforms over the strongest baseline (avg. +6.2). Beyond accuracy, DyTopo yields an interpretable coordination trace via the evolving graphs, enabling qualitative inspection of how communication pathways reconfigure across rounds.

中文标题/摘要

标题：DyTopo：基于语义匹配的多智能体动态拓扑路由

由提示的大语言模型构建的多智能体系统可以提高多轮推理能力，但大多数现有管道依赖于固定、全程的通信模式，这些模式与迭代问题解决过程中阶段性的需求不匹配。我们引入了DyTopo，这是一种由管理者指导的多智能体框架，在每轮中重建一个稀疏的有向通信图。基于管理者的轮次目标，每个智能体输出轻量级的自然语言查询（需求）和关键（提供）描述；DyTopo嵌入这些描述并进行语义匹配，仅沿诱导的边路由私有消息。在代码生成和数学推理基准测试以及四个LLM基础模型中，DyTopo在最强基线之上始终表现出色（平均提高6.2%）。除了准确性之外，DyTopo还通过不断变化的图提供了可解释的协调轨迹，使人们能够定性地检查通信路径如何在轮次之间重新配置。

Summary / 总结

DyTopo is a manager-guided multi-agent framework that dynamically reconstructs a sparse directed communication graph at each round to improve multi-round reasoning in multi-agent systems. Agents output lightweight natural-language query and key descriptors, which are embedded and matched semantically to route private messages only along the induced edges. DyTopo outperforms the strongest baseline across code generation and mathematical reasoning benchmarks, achieving an average improvement of 6.2%. The evolving graphs provide an interpretable coordination trace, enabling qualitative inspection of communication pathway reconfigurations across rounds.

DyTopo 是一个由管理者引导的多智能体框架，每轮动态重构一个稀疏的有向通信图以提高多轮推理。智能体输出轻量级的自然语言查询和关键描述符，这些描述符被嵌入并进行语义匹配，仅沿诱导的边路由私有消息。DyTopo 在代码生成和数学推理基准测试中均优于最强基线，平均改进幅度为 6.2%。随时间演化的图提供了可解释的协调轨迹，便于对通信路径配置进行定性检查。

SwimBird: Eliciting Switchable Reasoning Mode in Hybrid Autoregressive MLLMs

Authors: Jintao Tong, Shilin Yan, Hongwei Xue, Xiaojun Tang, Kunyu Shi, Guannan Zhang, Ruixuan Li, Yixiong Zou

First: 2026-02-05T18:59:51+00:00 · Latest: 2026-02-05T18:59:51+00:00

Comments: Project Page: https://accio-lab.github.io/SwimBird

Abs · PDF · Code1 · Code2 · Project1

Abstract

Multimodal Large Language Models (MLLMs) have made remarkable progress in multimodal perception and reasoning by bridging vision and language. However, most existing MLLMs perform reasoning primarily with textual CoT, which limits their effectiveness on vision-intensive tasks. Recent approaches inject a fixed number of continuous hidden states as "visual thoughts" into the reasoning process and improve visual performance, but often at the cost of degraded text-based logical reasoning. We argue that the core limitation lies in a rigid, pre-defined reasoning pattern that cannot adaptively choose the most suitable thinking modality for different user queries. We introduce SwimBird, a reasoning-switchable MLLM that dynamically switches among three reasoning modes conditioned on the input: (1) text-only reasoning, (2) vision-only reasoning (continuous hidden states as visual thoughts), and (3) interleaved vision-text reasoning. To enable this capability, we adopt a hybrid autoregressive formulation that unifies next-token prediction for textual thoughts with next-embedding prediction for visual thoughts, and design a systematic reasoning-mode curation strategy to construct SwimBird-SFT-92K, a diverse supervised fine-tuning dataset covering all three reasoning patterns. By enabling flexible, query-adaptive mode selection, SwimBird preserves strong textual logic while substantially improving performance on vision-dense tasks. Experiments across diverse benchmarks covering textual reasoning and challenging visual understanding demonstrate that SwimBird achieves state-of-the-art results and robust gains over prior fixed-pattern multimodal reasoning methods.

中文标题/摘要

标题：SwimBird：在混合自回归MLLM中引发可切换的推理模式

多模态大型语言模型（MLLMs）通过视觉和语言的结合，在多模态感知和推理方面取得了显著进展。然而，大多数现有的MLLMs主要通过文本的逐步推理（CoT）进行推理，这限制了它们在视觉密集型任务上的效果。最近的方法将固定数量的连续隐藏状态作为“视觉思考”注入推理过程，从而提高了视觉性能，但通常会牺牲基于文本的逻辑推理能力。我们认为核心限制在于一种僵化的、预先定义的推理模式，无法根据不同用户查询自适应地选择最合适的思考模态。我们引入了SwimBird，这是一种可切换的MLLM，根据输入动态切换三种推理模式：（1）仅文本推理，（2）仅视觉推理（连续隐藏状态作为视觉思考），（3）视觉-文本交织推理。为了实现这一能力，我们采用了一种混合自回归公式，将文本思考的下一个词预测与视觉思考的下一个嵌入预测统一起来，并设计了一种系统性的推理模式策展策略，构建了SwimBird-SFT-92K，这是一个涵盖所有三种推理模式的多样化监督微调数据集。通过实现灵活、查询自适应的模式选择，SwimBird在保持强大的文本逻辑推理能力的同时，显著提高了视觉密集型任务的性能。跨多种基准测试的实验表明，SwimBird在文本推理和具有挑战性的视觉理解方面均实现了最先进的结果和稳健的改进。

Summary / 总结

SwimBird is designed to address the limitations of existing MLLMs by introducing a switchable reasoning mode that dynamically adapts to different user queries. It employs a hybrid autoregressive formulation and a reasoning-mode curation strategy to support three reasoning modes: text-only, vision-only, and interleaved vision-text. Experimental results show that SwimBird maintains strong text-based logical reasoning while significantly improving performance on vision-intensive tasks, achieving state-of-the-art results across various benchmarks.

SwimBird旨在通过引入可切换的推理模式来解决现有跨模态大型语言模型（MLLMs）的局限性，该模式能够根据不同用户查询动态适应。它采用了一种混合自回归公式和推理模式编排策略，支持三种推理模式：仅文本、仅视觉和视觉-文本交织。实验表明，SwimBird在保持强大文本逻辑的同时，显著提升了在视觉密集任务上的性能，超越了之前的固定模式跨模态推理方法，在各种基准测试中表现出色。

CommCP: Efficient Multi-Agent Coordination via LLM-Based Communication with Conformal Prediction

Authors: Xiaopan Zhang, Zejin Wang, Zhixu Li, Jianpeng Yao, Jiachen Li

Venue: ICRA 2026

First: 2026-02-05T18:59:45+00:00 · Latest: 2026-02-05T18:59:45+00:00

Comments: IEEE International Conference on Robotics and Automation (ICRA 2026); Project Website: https://comm-cp.github.io/

Abs · PDF · Code1 · Code2 · Project1 · Project2

Abstract

To complete assignments provided by humans in natural language, robots must interpret commands, generate and answer relevant questions for scene understanding, and manipulate target objects. Real-world deployments often require multiple heterogeneous robots with different manipulation capabilities to handle different assignments cooperatively. Beyond the need for specialized manipulation skills, effective information gathering is important in completing these assignments. To address this component of the problem, we formalize the information-gathering process in a fully cooperative setting as an underexplored multi-agent multi-task Embodied Question Answering (MM-EQA) problem, which is a novel extension of canonical Embodied Question Answering (EQA), where effective communication is crucial for coordinating efforts without redundancy. To address this problem, we propose CommCP, a novel LLM-based decentralized communication framework designed for MM-EQA. Our framework employs conformal prediction to calibrate the generated messages, thereby minimizing receiver distractions and enhancing communication reliability. To evaluate our framework, we introduce an MM-EQA benchmark featuring diverse, photo-realistic household scenarios with embodied questions. Experimental results demonstrate that CommCP significantly enhances the task success rate and exploration efficiency over baselines. The experiment videos, code, and dataset are available on our project website: https://comm-cp.github.io.

中文标题/摘要

标题：CommCP：通过基于LLM的通信与符合预测实现高效的多智能体协调

为了通过自然语言完成人类提供的任务，机器人必须解释命令、生成和回答相关问题以理解场景，并操作目标物体。实际部署中，通常需要不同操作能力的多个异构机器人协同处理不同的任务。除了需要专门的操作技能外，有效的信息收集对于完成这些任务也很重要。为了解决这一问题，我们将信息收集过程在完全协同的环境中形式化为一个未被充分探索的多任务多智能体体态问答（MM-EQA）问题，这是对经典体态问答（EQA）的一种新颖扩展，其中有效的通信对于协调努力且不重复至关重要。为了解决这一问题，我们提出CommCP，一种专为MM-EQA设计的基于LLM的分布式通信框架。我们的框架采用符合预测来校准生成的消息，从而减少接收者的分心并提高通信可靠性。为了评估我们的框架，我们引入了一个包含多样化、逼真家庭场景的MM-EQA基准，其中包含体态问题。实验结果表明，CommCP在任务成功率和探索效率上显著优于基线。实验视频、代码和数据集可在我们的项目网站上获取：https://comm-cp.github.io/

Summary / 总结

The paper addresses the challenge of multiple robots working together to complete tasks given in natural language, emphasizing the importance of effective communication. It proposes CommCP, a communication framework using LLMs and conformal prediction to enhance coordination among robots. Experimental results show that CommCP improves task success and exploration efficiency compared to baseline methods.

该论文提出了CommCP，一种让多机器人能够高效协作完成自然语言指令指定的任务的通信框架。它将问题形式化为一个多代理多任务的体感问答任务，强调有效沟通的重要性。CommCP 使用校准预测来校准消息，减少干扰并提高通信可靠性。实验结果显示，CommCP 在一个包含多种家庭场景的新MM-EQA基准测试中，在任务成功率和探索效率方面均优于基线方法。

Thinking with Geometry: Active Geometry Integration for Spatial Reasoning

Authors: Haoyuan Li, Qihang Cao, Tao Tang, Kun Xiang, Zihan Guo, Jianhua Han, Hang Xu, Xiaodan Liang

First: 2026-02-05T18:59:32+00:00 · Latest: 2026-02-05T18:59:32+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Recent progress in spatial reasoning with Multimodal Large Language Models (MLLMs) increasingly leverages geometric priors from 3D encoders. However, most existing integration strategies remain passive: geometry is exposed as a global stream and fused in an indiscriminate manner, which often induces semantic-geometry misalignment and redundant signals. We propose GeoThinker, a framework that shifts the paradigm from passive fusion to active perception. Instead of feature mixing, GeoThinker enables the model to selectively retrieve geometric evidence conditioned on its internal reasoning demands. GeoThinker achieves this through Spatial-Grounded Fusion applied at carefully selected VLM layers, where semantic visual priors selectively query and integrate task-relevant geometry via frame-strict cross-attention, further calibrated by Importance Gating that biases per-frame attention toward task-relevant structures. Comprehensive evaluation results show that GeoThinker sets a new state-of-the-art in spatial intelligence, achieving a peak score of 72.6 on the VSI-Bench. Furthermore, GeoThinker demonstrates robust generalization and significantly improved spatial perception across complex downstream scenarios, including embodied referring and autonomous driving. Our results indicate that the ability to actively integrate spatial structures is essential for next-generation spatial intelligence. Code can be found at https://github.com/Li-Hao-yuan/GeoThinker.

中文标题/摘要

标题：几何思维：基于几何的主动几何整合以促进空间推理

多模态大型语言模型（MLLMs）在空间推理方面的最新进展越来越多地利用3D编码器提供的几何先验。然而，大多数现有的整合策略仍然被动：几何信息以全局流的形式呈现，并以不分青红皂白的方式融合，这往往导致语义-几何错位和冗余信号。我们提出了GeoThinker框架，将范式从被动融合转变为主动感知。GeoThinker 不是通过特征混合，而是使模型能够根据其内部推理需求选择性地检索几何证据。GeoThinker 通过在精心选择的VLM层上应用空间语义融合来实现这一点，其中语义视觉先验通过帧严格的交叉注意力选择性地查询和整合与任务相关的几何信息，并通过重要性门控进一步校准，以偏向于与任务相关的结构的帧间注意力。全面的评估结果表明，GeoThinker 在空间智能方面达到了新的最佳状态，在VSI-Bench上达到峰值得分为72.6。此外，GeoThinker 在复杂下游场景中展示了稳健的泛化能力和显著改进的空间感知能力，包括体感指代和自动驾驶。我们的结果表明，能够主动整合空间结构对于下一代空间智能至关重要。代码可以在 https://github.com/Li-Hao-yuan/GeoThinker 获取。

Summary / 总结

The research aims to improve spatial reasoning by addressing the limitations of passive geometric integration in Multimodal Large Language Models (MLLMs). GeoThinker proposes an active geometry integration framework that allows the model to selectively retrieve geometric evidence based on its reasoning needs. This is achieved through Spatial-Grounded Fusion at specific VLM layers, where semantic visual priors query and integrate task-relevant geometry via frame-strict cross-attention, further refined by Importance Gating. GeoThinker outperforms previous methods, achieving a peak score of 72.6 on the VSI-Bench and demonstrating robust generalization in complex scenarios like embodied referring and autonomous driving.

研究旨在通过解决Multimodal Large Language Models (MLLMs)中被动几何集成的局限性来提升空间推理能力。提出了GeoThinker框架，从被动融合转向主动感知，使模型能够根据其推理需求选择性地检索几何证据。这通过Spatial-Grounded Fusion和Importance Gating实现，增强任务相关几何信息的集成。该框架在VSI-Bench上达到了新的最佳得分72.6，并在包括体感引用和自动驾驶在内的复杂场景中展示了强大的泛化能力。

InterPrior: Scaling Generative Control for Physics-Based Human-Object Interactions

Authors: Sirui Xu, Samuel Schulter, Morteza Ziyadi, Xialin He, Xiaohan Fei, Yu-Xiong Wang, Liangyan Gui

First: 2026-02-05T18:59:27+00:00 · Latest: 2026-02-05T18:59:27+00:00

Comments: Webpage: https://sirui-xu.github.io/InterPrior/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Humans rarely plan whole-body interactions with objects at the level of explicit whole-body movements. High-level intentions, such as affordance, define the goal, while coordinated balance, contact, and manipulation can emerge naturally from underlying physical and motor priors. Scaling such priors is key to enabling humanoids to compose and generalize loco-manipulation skills across diverse contexts while maintaining physically coherent whole-body coordination. To this end, we introduce InterPrior, a scalable framework that learns a unified generative controller through large-scale imitation pretraining and post-training by reinforcement learning. InterPrior first distills a full-reference imitation expert into a versatile, goal-conditioned variational policy that reconstructs motion from multimodal observations and high-level intent. While the distilled policy reconstructs training behaviors, it does not generalize reliably due to the vast configuration space of large-scale human-object interactions. To address this, we apply data augmentation with physical perturbations, and then perform reinforcement learning finetuning to improve competence on unseen goals and initializations. Together, these steps consolidate the reconstructed latent skills into a valid manifold, yielding a motion prior that generalizes beyond the training data, e.g., it can incorporate new behaviors such as interactions with unseen objects. We further demonstrate its effectiveness for user-interactive control and its potential for real robot deployment.

中文标题/摘要

标题：InterPrior：扩展基于物理的人机物交互生成控制

人类很少在整体身体层面上计划与物体的交互，而是通过高层次意图，如功能，来定义目标，而协调的平衡、接触和操作则可以从潜在的物理和运动先验中自然地涌现出来。扩展这些先验对于使类人机器人能够跨多种情境组合和泛化移动操作技能并保持物理上连贯的整体身体协调至关重要。为此，我们引入了InterPrior，这是一种可扩展的框架，通过大规模模仿预训练和后续的强化学习微调来学习一个统一的生成控制器。InterPrior首先将一个完整的参考模仿专家提炼成一个多功能、目标条件化的变分策略，该策略可以从多模态观察和高层次意图中重建运动。虽然提炼出的策略可以重建训练行为，但由于大规模人机物交互的庞大配置空间，它无法可靠地泛化。为了解决这个问题，我们应用了物理扰动的数据增强，并通过强化学习微调来提高对未见过的目标和初始状态的技能。这些步骤共同将重建的潜在技能凝聚成一个有效的流形，产生一个泛化能力超出训练数据的运动先验，例如，它可以包含与未见过的物体的交互行为。我们进一步展示了其在用户交互控制中的有效性及其在实际机器人部署中的潜力。

Summary / 总结

InterPrior is a scalable framework that learns a unified generative controller for physics-based human-object interactions. It uses large-scale imitation pretraining and reinforcement learning for fine-tuning. The key finding is that the framework can generalize beyond the training data, enabling humanoid robots to compose and generalize loco-manipulation skills across diverse contexts while maintaining physically coherent whole-body coordination. InterPrior can incorporate new behaviors such as interactions with unseen objects and is effective for user-interactive control and real robot deployment.

InterPrior 是一个通过模仿预训练和强化学习学习统一生成控制器的可扩展框架，用于人类与物体的交互。它首先将一个全参考模仿专家提炼成一个多功能、基于目标的变分策略，该策略可以从多模态观察和高层意图中重建运动。通过物理扰动的数据增强和强化学习微调，该策略能够更好地处理未见过的目标和初始状态，使其能够超越训练数据泛化并融入新行为。结果表明，InterPrior 可以有效控制类人机器人执行多样化的移动操作任务，并具有实际机器人部署的潜力。

V-Retrver: Evidence-Driven Agentic Reasoning for Universal Multimodal Retrieval

Authors: Dongyang Chen, Chaoyang Wang, Dezhao SU, Xi Xiao, Zeyu Zhang, Jing Xiong, Qing Li, Yuzhang Shang, Shichao Ka

First: 2026-02-05T18:59:21+00:00 · Latest: 2026-02-05T18:59:21+00:00

Abs · PDF · Code1 · Code2

Abstract

Multimodal Large Language Models (MLLMs) have recently been applied to universal multimodal retrieval, where Chain-of-Thought (CoT) reasoning improves candidate reranking. However, existing approaches remain largely language-driven, relying on static visual encodings and lacking the ability to actively verify fine-grained visual evidence, which often leads to speculative reasoning in visually ambiguous cases. We propose V-Retrver, an evidence-driven retrieval framework that reformulates multimodal retrieval as an agentic reasoning process grounded in visual inspection. V-Retrver enables an MLLM to selectively acquire visual evidence during reasoning via external visual tools, performing a multimodal interleaved reasoning process that alternates between hypothesis generation and targeted visual verification.To train such an evidence-gathering retrieval agent, we adopt a curriculum-based learning strategy combining supervised reasoning activation, rejection-based refinement, and reinforcement learning with an evidence-aligned objective. Experiments across multiple multimodal retrieval benchmarks demonstrate consistent improvements in retrieval accuracy (with 23.0% improvements on average), perception-driven reasoning reliability, and generalization.

中文标题/摘要

标题：V-Retrver：基于证据的代理推理以实现通用多模态检索

多模态大型语言模型（MLLMs）最近被应用于通用多模态检索，其中思维链（CoT）推理改善了候选检索结果的重新排序。然而，现有方法仍然主要依赖语言驱动，依赖静态视觉编码，缺乏主动验证细粒度视觉证据的能力，这往往导致在视觉含糊情况下进行推测性推理。我们提出V-Retrver，一种基于证据的检索框架，将多模态检索重新定义为基于视觉检查的代理推理过程。V-Retrver使MLLM能够在推理过程中通过外部视觉工具选择性地获取视觉证据，执行一种多模态交替推理过程，交替进行假设生成和目标视觉验证。为了训练这种证据收集检索代理，我们采用了一种基于课程的学习策略，结合监督推理激活、拒绝基础的细化以及与证据对齐的目标的强化学习。在多个多模态检索基准上的实验表明，检索准确性（平均提高23.0%）、感知驱动的推理可靠性以及泛化能力均得到了一致的提升。

Summary / 总结

V-Retrver is an evidence-driven retrieval framework that enhances multimodal retrieval by incorporating visual inspection and active verification. It reformulates the retrieval process as agentic reasoning, allowing the model to selectively gather visual evidence during reasoning. Experiments show that V-Retrver improves retrieval accuracy by 23.0% on average and enhances reasoning reliability and generalization.

研究旨在通过将主动的视觉证据验证纳入推理过程来提升多模态检索。V-Retrver 是一个基于证据的检索框架，允许 MLLM 在推理过程中收集和验证视觉证据，从而提高准确性和可靠性。实验结果显示，检索准确率平均提高了 23.0%，并且推理的感知驱动可靠性得到了增强。

Can vision language models learn intuitive physics from interaction?

Authors: Luca M. Schulze Buschoff, Konstantinos Voudouris, Can Demircan, Eric Schulz

First: 2026-02-05T18:59:20+00:00 · Latest: 2026-02-05T18:59:20+00:00

Abs · PDF · Code1 · Code2

Abstract

Pre-trained vision language models do not have good intuitions about the physical world. Recent work has shown that supervised fine-tuning can improve model performance on simple physical tasks. However, fine-tuned models do not appear to learn robust physical rules that can generalize to new contexts. Based on research in cognitive science, we hypothesize that models need to interact with an environment to properly learn its physical dynamics. We train models that learn through interaction with the environment using reinforcement learning. While learning from interaction allows models to improve their within-task performance, it fails to produce models with generalizable physical intuitions. We find that models trained on one task do not reliably generalize to related tasks, even if the tasks share visual statistics and physical principles, and regardless of whether the models are trained through interaction.

中文标题/摘要

标题：视觉语言模型能否通过交互学习直观的物理知识？

预训练的视觉语言模型对物理世界的直觉不够好。最近的研究表明，监督微调可以提高模型在简单物理任务上的表现。然而，微调后的模型似乎没有学会能够泛化的稳健物理规则。基于认知科学的研究，我们假设模型需要与环境进行交互才能正确学习其物理动力学。我们使用强化学习训练通过与环境交互来学习的模型。虽然通过交互学习可以让模型在任务内的表现得到提升，但无法产生具有泛化物理直觉的模型。我们发现，即使任务共享视觉统计和物理原理，针对一个任务训练的模型也不可靠地泛化到相关任务，无论模型是通过交互还是其他方式训练。

Summary / 总结

The study investigates whether vision language models can develop an understanding of physical laws through interaction with an environment. Despite improvements in task performance through reinforcement learning, the models fail to acquire robust, generalizable physical intuitions. Models trained on one task do not generalize well to related tasks, even when the tasks share similar physical principles and visual statistics.

研究探讨了视觉语言模型是否可以通过与环境的交互来理解物理规律。尽管通过强化学习可以提高任务性能，但模型未能获得稳健且可泛化的物理直觉。即使任务具有相似的物理原理和视觉特征，训练于一个任务的模型也无法很好地泛化到相关任务中。

Splat and Distill: Augmenting Teachers with Feed-Forward 3D Reconstruction For 3D-Aware Distillation

Authors: David Shavin, Sagie Benaim

Venue: ICLR 2026

First: 2026-02-05T18:59:05+00:00 · Latest: 2026-02-05T18:59:05+00:00

Comments: Accepted to ICLR 2026

Abs · PDF · Code1 · Code2 · Project1

Abstract

Vision Foundation Models (VFMs) have achieved remarkable success when applied to various downstream 2D tasks. Despite their effectiveness, they often exhibit a critical lack of 3D awareness. To this end, we introduce Splat and Distill, a framework that instills robust 3D awareness into 2D VFMs by augmenting the teacher model with a fast, feed-forward 3D reconstruction pipeline. Given 2D features produced by a teacher model, our method first lifts these features into an explicit 3D Gaussian representation, in a feedforward manner. These 3D features are then ``splatted" onto novel viewpoints, producing a set of novel 2D feature maps used to supervise the student model, ``distilling" geometrically grounded knowledge. By replacing slow per-scene optimization of prior work with our feed-forward lifting approach, our framework avoids feature-averaging artifacts, creating a dynamic learning process where the teacher's consistency improves alongside that of the student. We conduct a comprehensive evaluation on a suite of downstream tasks, including monocular depth estimation, surface normal estimation, multi-view correspondence, and semantic segmentation. Our method significantly outperforms prior works, not only achieving substantial gains in 3D awareness but also enhancing the underlying semantic richness of 2D features. Project page is available at https://davidshavin4.github.io/Splat-and-Distill/

中文标题/摘要

标题：Splat和Distill：通过前馈3D重建增强教师模型以实现3D感知蒸馏

视觉基础模型(VFMs)在应用于各种下游2D任务时取得了显著的成功。尽管如此，它们通常缺乏3D意识。为此，我们提出了Splat和Distill框架，通过将快速的前馈3D重建管道添加到教师模型中，将稳健的3D意识注入2D VFMs。给定由教师模型生成的2D特征，我们的方法首先以前馈方式将这些特征提升为显式的3D高斯表示。然后，将这些3D特征“平铺”到新的视点上，生成用于监督学生模型的一组新颖的2D特征图，从而“蒸馏”几何上具有根据性的知识。通过用我们的前馈提升方法替换先前工作中的慢速场景优化，我们的框架避免了特征平均伪影，创建了一个动态学习过程，在这个过程中，教师的一致性与学生的改进同步提高。我们在包括单目深度估计、表面法线估计、多视图对应和语义分割等一系列下游任务上进行了全面评估。我们的方法显著优于先前的工作，不仅在3D意识方面取得了重大进展，还增强了2D特征的语义丰富性。项目页面可在https://davidshavin4.github.io/Splat-and-Distill/获取。

Summary / 总结

The research aims to enhance the 3D awareness of Vision Foundation Models (VFMs) by introducing Splat and Distill, a framework that integrates a fast 3D reconstruction pipeline into the teacher model. This method lifts 2D features into 3D Gaussian representations and then projects them onto novel viewpoints to supervise the student model, effectively distilling geometrically grounded knowledge. Experimental results demonstrate significant improvements in 3D-aware tasks such as monocular depth estimation and surface normal estimation, surpassing previous methods and enhancing the semantic richness of 2D features.

研究旨在通过引入Splat和Distill框架，增强Vision Foundation Models (VFMs)的3D感知能力，该框架将快速的3D重建管道集成到教师模型中。该方法将2D特征提升为3D高斯表示，并将其投射到新的视角上以监督学生模型，提取几何上相关的知识。该框架避免了特征平均的缺陷，提高了3D感知和2D特征的语义丰富性，并在单目深度估计等下游任务中优于先前的工作。

PhysicsAgentABM: Physics-Guided Generative Agent-Based Modeling

Authors: Kavana Venkatesh, Yinhan He, Jundong Li, Jiaming Cui

First: 2026-02-05T18:59:01+00:00 · Latest: 2026-02-05T18:59:01+00:00

Abs · PDF · Code1 · Code2

Abstract

Large language model (LLM)-based multi-agent systems enable expressive agent reasoning but are expensive to scale and poorly calibrated for timestep-aligned state-transition simulation, while classical agent-based models (ABMs) offer interpretability but struggle to integrate rich individual-level signals and non-stationary behaviors. We propose PhysicsAgentABM, which shifts inference to behaviorally coherent agent clusters: state-specialized symbolic agents encode mechanistic transition priors, a multimodal neural transition model captures temporal and interaction dynamics, and uncertainty-aware epistemic fusion yields calibrated cluster-level transition distributions. Individual agents then stochastically realize transitions under local constraints, decoupling population inference from entity-level variability. We further introduce ANCHOR, an LLM agent-driven clustering strategy based on cross-contextual behavioral responses and a novel contrastive loss, reducing LLM calls by up to 6-8 times. Experiments across public health, finance, and social sciences show consistent gains in event-time accuracy and calibration over mechanistic, neural, and LLM baselines. By re-architecting generative ABM around population-level inference with uncertainty-aware neuro-symbolic fusion, PhysicsAgentABM establishes a new paradigm for scalable and calibrated simulation with LLMs.

中文标题/摘要

标题：PhysicsAgentABM：基于物理引导的生成性基于代理的建模

基于大型语言模型（LLM）的多代理系统能够实现富有表现力的代理推理，但难以扩展且不适用于时间步长对齐的状态转换模拟，而经典的基于代理的模型（ABMs）虽然具有可解释性，但在整合丰富的个体级信号和非平稳行为方面存在困难。我们提出了PhysicsAgentABM，该方法将推理转移到行为一致的代理集群中：状态专门化的符号代理编码机制性转换先验，多模态神经转换模型捕捉时间动态和交互动态，不确定性意识的本体融合生成校准的集群级转换分布。个体代理随后在局部约束下随机实现转换，从而解耦群体推理与实体级变异性。我们还引入了基于跨上下文行为响应的LLM代理驱动聚类策略ANCHOR，以及一种新颖的对比损失，最多可减少6-8倍的LLM调用次数。在公共卫生、金融和社会科学领域的实验表明，与机制性、神经网络和LLM基线相比，PhysicsAgentABM在事件时间准确性和校准方面均表现出一致的改进。通过围绕群体级推理重构生成性ABM，并结合不确定性意识的神经符号融合，PhysicsAgentABM确立了使用LLM进行可扩展且校准的模拟的新范式。

Summary / 总结

PhysicsAgentABM is designed to combine the interpretability of classical ABMs with the expressiveness of LLMs by shifting inference to behaviorally coherent agent clusters. It uses state-specialized symbolic agents to encode mechanistic transition priors, a multimodal neural transition model to capture temporal and interaction dynamics, and uncertainty-aware epistemic fusion to yield calibrated cluster-level transition distributions. The method also introduces ANCHOR, an LLM agent-driven clustering strategy that reduces LLM calls by up to 6-8 times. Experiments across public health, finance, and social sciences demonstrate consistent improvements in event-time accuracy and calibration over mechanistic, neural, and LLM baselines.

PhysicsAgentABM 通过将推理转移到行为一致的代理集群中，结合了经典ABM的可解释性和LLM的表达能力。它使用状态专门化的符号代理来编码机械转换先验，使用多模态神经转换模型来捕捉时间和交互动态，并使用不确定性感知的先验融合来生成集群级别的转换分布。该方法还引入了基于跨上下文行为响应的LLM代理驱动聚类策略ANCHOR，可将LLM调用次数减少6-8倍。实验表明，在公共卫生、金融和社会科学领域，该方法在事件时间准确性和校准方面优于机械、神经和LLM基线方法。

Context Forcing: Consistent Autoregressive Video Generation with Long Context

Authors: Shuo Chen, Cong Wei, Sun Sun, Ping Nie, Kai Zhou, Ge Zhang, Ming-Hsuan Yang, Wenhu Chen

First: 2026-02-05T18:58:01+00:00 · Latest: 2026-02-05T18:58:01+00:00

Abs · PDF · Code1 · Code2

Abstract

Recent approaches to real-time long video generation typically employ streaming tuning strategies, attempting to train a long-context student using a short-context (memoryless) teacher. In these frameworks, the student performs long rollouts but receives supervision from a teacher limited to short 5-second windows. This structural discrepancy creates a critical \textbf{student-teacher mismatch}: the teacher's inability to access long-term history prevents it from guiding the student on global temporal dependencies, effectively capping the student's context length. To resolve this, we propose \textbf{Context Forcing}, a novel framework that trains a long-context student via a long-context teacher. By ensuring the teacher is aware of the full generation history, we eliminate the supervision mismatch, enabling the robust training of models capable of long-term consistency. To make this computationally feasible for extreme durations (e.g., 2 minutes), we introduce a context management system that transforms the linearly growing context into a \textbf{Slow-Fast Memory} architecture, significantly reducing visual redundancy. Extensive results demonstrate that our method enables effective context lengths exceeding 20 seconds -- 2 to 10 times longer than state-of-the-art methods like LongLive and Infinite-RoPE. By leveraging this extended context, Context Forcing preserves superior consistency across long durations, surpassing state-of-the-art baselines on various long video evaluation metrics.

中文标题/摘要

标题：上下文强制：使用长上下文的一致自回归视频生成

近期的实时长视频生成方法通常采用流式调优策略，试图通过短上下文（无记忆）教师训练一个长上下文学生。在这些框架中，学生进行长时间的展开，但只能从短至5秒的窗口中获得监督。这种结构上的不匹配导致了一个关键的\textbf{学生-教师不匹配}：由于教师无法访问长期历史，它无法引导学生学习全局时间依赖性，从而限制了学生能够使用的上下文长度。为了解决这一问题，我们提出了\textbf{上下文强制}，这是一种新的框架，通过长上下文教师训练长上下文学生。通过确保教师了解完整的生成历史，我们消除了监督不匹配，使模型能够稳健地训练并实现长期一致性。为了使这种计算在极端持续时间（例如2分钟）下可行，我们引入了一种上下文管理系统，将线性增长的上下文转换为\textbf{慢速-快速记忆}架构，显著减少了视觉冗余。大量实验结果表明，我们的方法使有效的上下文长度超过20秒——比LongLive和Infinite-RoPE等最先进的方法长2到10倍。通过利用这种扩展的上下文，上下文强制在长视频的各种评估指标上超越了最先进的基线，保持了长期的一致性。

Summary / 总结

The paper addresses the issue of student-teacher mismatch in real-time long video generation by proposing Context Forcing, which trains a long-context student using a long-context teacher. This method ensures the teacher has access to full generation history, eliminating the supervision mismatch and enabling robust training for long-term consistency. The Slow-Fast Memory architecture further reduces computational demands, allowing for context lengths exceeding 20 seconds, which surpasses state-of-the-art methods like LongLive and Infinite-RoPE.

论文通过提出Context Forcing方法，解决了实时长视频生成中学生-教师不匹配的问题，该方法使用长历史上下文的教师来训练长历史上下文的学生，确保教师能够访问完整的生成历史，从而消除监督不匹配。为了应对计算需求，引入了Slow-Fast Memory架构，减少了视觉冗余。实验结果表明，Context Forcing能够实现超过20秒的上下文长度，优于LongLive和Infinite-RoPE等最先进的方法，在长视频评估指标上表现出更优的一致性。

Learning Query-Aware Budget-Tier Routing for Runtime Agent Memory

Authors: Haozhen Zhang, Haodong Yue, Tao Feng, Quanyu Long, Jianzhu Bao, Bowen Jin, Weizhi Zhang, Xiao Li, Jiaxuan You, Chengwei Qin, Wenya Wang

First: 2026-02-05T18:57:09+00:00 · Latest: 2026-02-05T18:57:09+00:00

Comments: Code is available at https://github.com/ViktorAxelsen/BudgetMem

Abs · PDF · Code1 · Code2 · Code3

Abstract

Memory is increasingly central to Large Language Model (LLM) agents operating beyond a single context window, yet most existing systems rely on offline, query-agnostic memory construction that can be inefficient and may discard query-critical information. Although runtime memory utilization is a natural alternative, prior work often incurs substantial overhead and offers limited explicit control over the performance-cost trade-off. In this work, we present \textbf{BudgetMem}, a runtime agent memory framework for explicit, query-aware performance-cost control. BudgetMem structures memory processing as a set of memory modules, each offered in three budget tiers (i.e., \textsc{Low}/\textsc{Mid}/\textsc{High}). A lightweight router performs budget-tier routing across modules to balance task performance and memory construction cost, which is implemented as a compact neural policy trained with reinforcement learning. Using BudgetMem as a unified testbed, we study three complementary strategies for realizing budget tiers: implementation (method complexity), reasoning (inference behavior), and capacity (module model size). Across LoCoMo, LongMemEval, and HotpotQA, BudgetMem surpasses strong baselines when performance is prioritized (i.e., high-budget setting), and delivers better accuracy-cost frontiers under tighter budgets. Moreover, our analysis disentangles the strengths and weaknesses of different tiering strategies, clarifying when each axis delivers the most favorable trade-offs under varying budget regimes.

中文标题/摘要

标题：学习查询感知预算层级路由以运行时代理内存

内存对于大型语言模型（LLM）代理在单个上下文窗口之外运行变得越来越重要，但大多数现有系统依赖于离线、查询无关的内存构建，这可能效率低下并可能丢弃查询关键信息。尽管运行时内存利用是一个自然的替代方案，但先前的工作往往会产生大量开销，并且对性能成本权衡的控制有限。在本文中，我们提出了**BudgetMem**，这是一种运行时代理内存框架，用于明确、查询感知的性能成本控制。BudgetMem 将内存处理结构化为一组内存模块，每个模块提供三个预算层级（即 Low/ Mid/ High）。一个轻量级路由器在模块之间执行预算层级路由，以平衡任务性能和内存构建成本，这通过强化学习训练的紧凑神经策略实现。使用 BudgetMem 作为统一的测试平台，我们研究了三种互补的预算层级实现策略：实现（方法复杂性）、推理（推理行为）和容量（模块模型大小）。在 LoCoMo、LongMemEval 和 HotpotQA 上，当优先考虑性能（即高预算设置）时，BudgetMem 超过了强大的基线，并在更紧的预算下提供了更好的准确度成本前沿。此外，我们的分析将不同层级策略的优势和劣势分离开来，阐明了在不同预算条件下，每个轴在提供最有利权衡时的表现。

Summary / 总结

BudgetMem is a runtime agent memory framework designed for explicit, query-aware performance-cost control in Large Language Models. It structures memory processing into three budget tiers and uses a lightweight router to balance task performance and memory cost. Experimental results show that BudgetMem outperforms strong baselines in high-budget settings and provides better accuracy-cost trade-offs under tighter budgets across different evaluation benchmarks.

BudgetMem 是一种运行时代理内存框架，通过将内存处理结构化为三个预算层级（低、中、高）并使用轻量级路由器来平衡任务性能和内存构建成本，实现明确的、查询感知的性能-成本控制。通过强化学习训练的紧凑型神经策略，BudgetMem 在高预算设置中优于强基线，并在更紧的预算下提供更好的准确度-成本前沿。研究还分析了不同层级策略的优势和劣势，提供了在不同预算条件下最优权衡的见解。

Correctness-Optimized Residual Activation Lens (CORAL): Transferrable and Calibration-Aware Inference-Time Steering

Authors: Miranda Muqing Miao, Young-Min Cho, Lyle Ungar

First: 2026-02-05T18:55:56+00:00 · Latest: 2026-02-05T18:55:56+00:00

Abs · PDF · Code1 · Code2

Abstract

Large language models (LLMs) exhibit persistent miscalibration, especially after instruction tuning and preference alignment. Modified training objectives can improve calibration, but retraining is expensive. Inference-time steering offers a lightweight alternative, yet most existing methods optimize proxies for correctness rather than correctness itself. We introduce CORAL (Correctness-Optimized Residual Activation Lens), a regularized inference-time steering method that captures distributed correctness signals from model internal activations using weight-decay MLP probes. We evaluate CORAL across three 7B-parameter models and find that it consistently improves accuracy by 10\% and expected calibration error (ECE) by 50\% on average. We additionally demonstrate that these gains transfer without retraining to the complete published test sets of four held-out benchmarks (ARC-Challenge, HellaSwag, Math-MC, OpenBookQA), averaging 14\% accuracy improvements and 49\% ECE improvements. Our results support the hypothesis that distributed information in model internals can be extracted using regularized probes when individual neurons are insufficient. CORAL thus provides a compute-efficient, transferable, and calibration-aware approach to improve MCQA performance during inference.

中文标题/摘要

标题：CORAL：正确性优化的残差激活透镜（Correctness-Optimized Residual Activation Lens）：可移植且校准意识的推理时校正导向

大型语言模型（LLMs）在指令调优和偏好对齐后表现出持续的校准不足。修改后的训练目标可以改善校准，但重新训练成本高昂。推理时校正提供了一种轻量级的替代方案，但大多数现有方法优化的是正确性的代理指标，而不是正确性本身。我们引入了CORAL（正确性优化的残差激活透镜），这是一种正则化推理时校正方法，通过权重衰减MLP探针捕捉模型内部激活中的分布式正确性信号。我们在三个7B参数模型上评估了CORAL，发现它在平均情况下将准确率提高了10%，预期校准误差（ECE）降低了50%。我们还展示了这些增益在无需重新训练的情况下转移到四个保留基准测试的完整发布测试集（ARC-Challenge、HellaSwag、Math-MC、OpenBookQA）上，平均准确率提高了14%，ECE降低了49%。我们的结果支持了这样一个假设：当单个神经元不足时，可以使用正则化探针从模型内部提取分布式信息。因此，CORAL提供了一种计算高效、可移植且校准意识的方法，以提高推理时的多项选择题问答性能。

Summary / 总结

The paper introduces CORAL, a regularized inference-time steering method that enhances the calibration and accuracy of large language models. It captures distributed correctness signals from internal activations using weight-decay MLP probes. Across three 7B-parameter models, CORAL improves accuracy by 10% and expected calibration error by 50% on average. These improvements transfer to four held-out benchmarks without retraining, averaging 14% accuracy and 49% ECE improvements, demonstrating a compute-efficient, transferable, and calibration-aware approach to improve MCQA performance during inference.

论文提出了CORAL，一种正则化推理时校正方法，通过内部激活捕捉分布式的正确性信号。该方法在三个7B参数模型上平均提高了10%的准确率和50%的期望校准误差（ECE）。此外，这些改进在四个未见过的基准测试集上无需重新训练，平均提高了14%的准确率和49%的ECE。

Diffusion Model's Generalization Can Be Characterized by Inductive Biases toward a Data-Dependent Ridge Manifold

Authors: Ye He, Yitong Qiu, Molei Tao

First: 2026-02-05T18:55:03+00:00 · Latest: 2026-02-05T18:55:03+00:00

Abs · PDF · Code1 · Code2

Abstract

When a diffusion model is not memorizing the training data set, how does it generalize exactly? A quantitative understanding of the distribution it generates would be beneficial to, for example, an assessment of the model's performance for downstream applications. We thus explicitly characterize what diffusion model generates, by proposing a log-density ridge manifold and quantifying how the generated data relate to this manifold as inference dynamics progresses. More precisely, inference undergoes a reach-align-slide process centered around the ridge manifold: trajectories first reach a neighborhood of the manifold, then align as being pushed toward or away from the manifold in normal directions, and finally slide along the manifold in tangent directions. Within the scope of this general behavior, different training errors will lead to different normal and tangent motions, which can be quantified, and these detailed motions characterize when inter-mode generations emerge. More detailed understanding of training dynamics will lead to more accurate quantification of the generation inductive bias, and an example of random feature model will be considered, for which we can explicitly illustrate how diffusion model's inductive biases originate as a composition of architectural bias and training accuracy, and how they evolve with the inference dynamics. Experiments on synthetic multimodal distributions and MNIST latent diffusion support the predicted directional effects, in both low- and high-dimensions.

中文标题/摘要

标题：扩散模型的泛化可以由数据依赖的岭流形上的归纳偏置来表征

当扩散模型不记忆训练数据集时，它如何泛化？对其生成分布的定量理解将有助于例如下游应用中模型性能的评估。因此，我们通过提出对数密度岭流形并量化生成数据与该流形的关系来明确表征扩散模型的生成内容。更具体地说，推理过程围绕岭流形进行一个到达-对齐-滑动的过程：轨迹首先到达流形的邻域，然后在法向方向上被推近或远离流形进行对齐，最后在切向方向上沿着流形滑动。在这一总体行为的范围内，不同的训练误差会导致不同的法向和切向运动，这些运动可以被量化，并且这些详细的运动表征了跨模态生成何时出现。对训练动力学更详细的理解将导致对生成归纳偏置更准确的量化，我们将考虑一个随机特征模型，可以明确展示扩散模型的归纳偏置如何源自架构偏置和训练准确性组成的组合，并且如何随着推理动力学的发展而演变。在合成多模态分布和MNIST潜在扩散上的实验支持了预测的方向性效应，在低维和高维空间中均是如此。

Summary / 总结

This study investigates how diffusion models generalize by proposing a log-density ridge manifold and analyzing the inference dynamics. The model's inference process is described as a reach-align-slide mechanism centered around the ridge manifold. Different training errors result in distinct normal and tangent motions, which can be quantified to understand inter-mode generation. Experiments on synthetic and MNIST data support the directional effects predicted by the model.

该论文通过提出一个对数密度岭流形并分析推理动力学，研究了扩散模型的泛化能力。研究表明，推理过程遵循围绕该流形的到达-对齐-滑动过程，不同的训练误差会导致不同的法向和切向运动。这些运动有助于量化模型的生成归纳偏见，特别是在合成多模态分布和MNIST潜在扩散中支持预测的方向效应。

Mechanisms of AI Protein Folding in ESMFold

Authors: Kevin Lu, Jannik Brinkmann, Stefan Huber, Aaron Mueller, Yonatan Belinkov, David Bau, Chris Wendler

First: 2026-02-05T18:54:54+00:00 · Latest: 2026-02-05T18:54:54+00:00

Comments: Our code, data, and results are available at https://folding.baulab.info

Abs · PDF · Code1 · Code2

Abstract

How do protein structure prediction models fold proteins? We investigate this question by tracing how ESMFold folds a beta hairpin, a prevalent structural motif. Through counterfactual interventions on model latents, we identify two computational stages in the folding trunk. In the first stage, early blocks initialize pairwise biochemical signals: residue identities and associated biochemical features such as charge flow from sequence representations into pairwise representations. In the second stage, late blocks develop pairwise spatial features: distance and contact information accumulate in the pairwise representation. We demonstrate that the mechanisms underlying structural decisions of ESMFold can be localized, traced through interpretable representations, and manipulated with strong causal effects.

中文标题/摘要

标题：ESMFold蛋白质折叠机制

蛋白质结构预测模型是如何折叠蛋白质的？我们通过追踪ESMFold折叠一个β发夹结构域的过程来探讨这一问题。通过对模型潜在变量进行反事实干预，我们识别出折叠过程中的两个计算阶段。在第一个阶段，早期模块初始化双分子生物化学信号：残基身份及其相关的生物化学特征，如从序列表示到双分子表示的电荷流动。在第二个阶段，晚期模块发展双分子空间特征：距离和接触信息在双分子表示中积累。我们证明了ESMFold结构决策背后的机制可以被局部化、通过可解释的表示进行追踪，并且可以通过强因果效应进行操控。

Summary / 总结

The study investigates the mechanisms of protein folding in ESMFold by analyzing its folding process of a beta hairpin. It identifies two computational stages: early blocks initialize biochemical signals from sequence representations, and late blocks develop spatial features in the pairwise representation. The study shows that these mechanisms can be localized, traced through interpretable representations, and manipulated with strong causal effects.

该研究通过分析ESMFold对β发夹结构的折叠过程，探讨其蛋白质折叠机制。研究发现两个计算阶段：第一个阶段从序列表示中初始化生物化学信号，第二个阶段在成对表示中发展空间特征。研究证明这些机制可以被定位、通过可解释的表示进行追踪，并且可以通过强因果效应进行操控。

MambaVF: State Space Model for Efficient Video Fusion

Authors: Zixiang Zhao, Yukun Cui, Lilun Deng, Haowen Bai, Haotong Qin, Tao Feng, Konrad Schindler

First: 2026-02-05T18:53:47+00:00 · Latest: 2026-02-05T18:53:47+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Video fusion is a fundamental technique in various video processing tasks. However, existing video fusion methods heavily rely on optical flow estimation and feature warping, resulting in severe computational overhead and limited scalability. This paper presents MambaVF, an efficient video fusion framework based on state space models (SSMs) that performs temporal modeling without explicit motion estimation. First, by reformulating video fusion as a sequential state update process, MambaVF captures long-range temporal dependencies with linear complexity while significantly reducing computation and memory costs. Second, MambaVF proposes a lightweight SSM-based fusion module that replaces conventional flow-guided alignment via a spatio-temporal bidirectional scanning mechanism. This module enables efficient information aggregation across frames. Extensive experiments across multiple benchmarks demonstrate that our MambaVF achieves state-of-the-art performance in multi-exposure, multi-focus, infrared-visible, and medical video fusion tasks. We highlight that MambaVF enjoys high efficiency, reducing up to 92.25% of parameters and 88.79% of computational FLOPs and a 2.1x speedup compared to existing methods. Project page: https://mambavf.github.io

中文标题/摘要

标题：MambaVF：基于状态空间模型的高效视频融合

视频融合是各种视频处理任务中的基本技术。然而，现有的视频融合方法严重依赖于光流估计和特征扭曲，导致严重的计算开销和有限的可扩展性。本文提出了一种基于状态空间模型（SSM）的高效视频融合框架MambaVF，该框架在无需显式运动估计的情况下进行时间建模。首先，通过将视频融合重新表述为一个顺序状态更新过程，MambaVF以线性复杂度捕获长程时间依赖性，同时显著降低计算和内存成本。其次，MambaVF提出了一种基于SSM的轻量级融合模块，用时空双向扫描机制替代了传统的流引导对齐。该模块使跨帧的信息聚合变得高效。在多个基准测试中的广泛实验表明，我们的MambaVF在多曝光、多焦点、红外可见和医学视频融合任务中达到了最先进的性能。我们强调MambaVF具有高效率，参数减少高达92.25%，计算FLOPs减少88.79%，相比现有方法速度提升2.1倍。项目页面：https://mambavf.github.io

Summary / 总结

MambaVF is an efficient video fusion framework that reformulates video fusion as a state space model (SSM) to capture long-range temporal dependencies without explicit motion estimation, reducing computational overhead and memory costs. It introduces a lightweight SSM-based fusion module that replaces conventional flow-guided alignment, enabling efficient information aggregation across frames. Experiments show that MambaVF achieves state-of-the-art performance in various video fusion tasks while significantly reducing parameters, computational FLOPs, and runtime compared to existing methods.

MambaVF 是一种高效的视频融合框架，通过使用状态空间模型（SSMs）将视频融合重新表述为一个序列状态更新过程，从而消除显式运动估计的需要。这种方法以线性复杂度捕获长程时间依赖性，显著减少了计算和内存成本。实验结果表明，MambaVF 在各种视频融合任务中表现出色，达到最先进的性能，同时将参数减少高达 92.25%，计算 FLOPs 减少 88.79%，并提供 2.1 倍的速度提升。

GenArena: How Can We Achieve Human-Aligned Evaluation for Visual Generation Tasks?

Authors: Ruihang Li, Leigang Qu, Jingxu Zhang, Dongnan Gui, Mengde Xu, Xiaosong Zhang, Han Hu, Wenjie Wang, Jiaqi Wang

First: 2026-02-05T18:52:48+00:00 · Latest: 2026-02-05T18:52:48+00:00

Comments: Project Page: https://genarena.github.io/, Code: https://github.com/ruihanglix/genarena

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

The rapid advancement of visual generation models has outpaced traditional evaluation approaches, necessitating the adoption of Vision-Language Models as surrogate judges. In this work, we systematically investigate the reliability of the prevailing absolute pointwise scoring standard, across a wide spectrum of visual generation tasks. Our analysis reveals that this paradigm is limited due to stochastic inconsistency and poor alignment with human perception. To resolve these limitations, we introduce GenArena, a unified evaluation framework that leverages a pairwise comparison paradigm to ensure stable and human-aligned evaluation. Crucially, our experiments uncover a transformative finding that simply adopting this pairwise protocol enables off-the-shelf open-source models to outperform top-tier proprietary models. Notably, our method boosts evaluation accuracy by over 20% and achieves a Spearman correlation of 0.86 with the authoritative LMArena leaderboard, drastically surpassing the 0.36 correlation of pointwise methods. Based on GenArena, we benchmark state-of-the-art visual generation models across diverse tasks, providing the community with a rigorous and automated evaluation standard for visual generation.

中文标题/摘要

标题：GenArena：我们如何实现视觉生成任务的人类对齐评估？

视觉生成模型的快速发展已经超越了传统的评估方法，迫切需要采用视觉语言模型作为替代的评判者。在本文中，我们系统地研究了当前广泛使用的绝对点对点评分标准在各种视觉生成任务中的可靠性。我们的分析表明，这种范式由于随机不一致性和与人类感知的不良对齐而受到限制。为了解决这些限制，我们引入了GenArena，这是一种统一的评估框架，利用成对比较范式确保稳定且人类对齐的评估。我们的实验揭示了一个变革性的发现，即简单采用这种成对协议可以使现成的开源模型超越顶级专有模型。值得注意的是，我们的方法将评估准确性提高了超过20%，并与权威的LMArena排行榜获得了0.86的斯皮尔曼相关性，远超点对点方法的0.36相关性。基于GenArena，我们对多种视觉生成模型进行了基准测试，为视觉生成提供了一个严格且自动化的评估标准。

Summary / 总结

This paper addresses the limitations of traditional absolute pointwise scoring in evaluating visual generation models, which have advanced rapidly. The authors introduce GenArena, a unified evaluation framework using a pairwise comparison paradigm to ensure stable and human-aligned evaluation. Experiments show that adopting this pairwise protocol improves evaluation accuracy by over 20% and achieves a Spearman correlation of 0.86 with the authoritative LMArena leaderboard, significantly surpassing the 0.36 correlation of pointwise methods. GenArena benchmarks state-of-the-art visual generation models across various tasks, providing a rigorous evaluation standard.

该研究通过引入GenArena框架，使用成对比较范式确保稳定且与人类感知一致的评估，解决了传统评估方法的局限性。实验表明，采用此协议可提高评估准确性超过20%，并与权威的LMArena排行榜达到0.86的Spearman相关性，远超点对点方法的0.36相关性。GenArena跨多种任务对最先进的视觉生成模型进行了基准测试，为视觉生成提供了一个严格且自动化的评估标准。

AgenticPay: A Multi-Agent LLM Negotiation System for Buyer-Seller Transactions

Authors: Xianyang Liu, Shangding Gu, Dawn Song

First: 2026-02-05T18:50:36+00:00 · Latest: 2026-02-05T18:50:36+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Large language model (LLM)-based agents are increasingly expected to negotiate, coordinate, and transact autonomously, yet existing benchmarks lack principled settings for evaluating language-mediated economic interaction among multiple agents. We introduce AgenticPay, a benchmark and simulation framework for multi-agent buyer-seller negotiation driven by natural language. AgenticPay models markets in which buyers and sellers possess private constraints and product-dependent valuations, and must reach agreements through multi-round linguistic negotiation rather than numeric bidding alone. The framework supports a diverse suite of over 110 tasks ranging from bilateral bargaining to many-to-many markets, with structured action extraction and metrics for feasibility, efficiency, and welfare. Benchmarking state-of-the-art proprietary and open-weight LLMs reveals substantial gaps in negotiation performance and highlights challenges in long-horizon strategic reasoning, establishing AgenticPay as a foundation for studying agentic commerce and language-based market interaction. Code and dataset are available at the link: https://github.com/SafeRL-Lab/AgenticPay.

中文标题/摘要

标题：AgenticPay：多智能体LLM谈判系统，用于买家卖家交易

基于大型语言模型（LLM）的代理越来越多地被期望自主谈判、协调和交易，但现有的基准测试缺乏评估语言中介的多智能体经济互动的规范性设置。我们引入了AgenticPay，这是一种多智能体买家卖家谈判基准和仿真框架，由自然语言驱动。AgenticPay 模拟了买家和卖家拥有私人约束和产品依赖价值的市场，并且必须通过多轮语言谈判达成协议，而不仅仅是数字竞价。该框架支持超过110项任务的多样化套件，从双边讨价还价到多对多市场，具有结构化动作提取和可行性、效率和福利的度量标准。对最先进的专有和开源权重LLM的基准测试揭示了谈判表现的巨大差距，并突显了长期战略推理的挑战，确立了AgenticPay作为研究代理商业和语言驱动的市场互动的基础。代码和数据集可在以下链接获取：https://github.com/SafeRL-Lab/AgenticPay.

Summary / 总结

AgenticPay is a benchmark and simulation framework for evaluating multi-agent buyer-seller negotiations driven by natural language. It models markets with private constraints and product-dependent valuations, requiring agents to reach agreements through multi-round linguistic negotiation. Key findings show significant gaps in negotiation performance among state-of-the-art LLMs, particularly in long-horizon strategic reasoning, highlighting the need for improved agentic commerce systems. Code and dataset are available at https://github.com/SafeRL-Lab/AgenticPay.

AgenticPay 是一个用于评估由自然语言驱动的多代理买家卖家谈判的基准和模拟框架。它模拟了具有私人约束和产品依赖价值的市场，要求代理通过多轮语言谈判达成协议。关键发现表明，最先进的语言模型在谈判性能上存在显著差距，尤其是在长期战略推理方面，强调了需要改进的代理商业系统。代码和数据集可在 https://github.com/SafeRL-Lab/AgenticPay 获取。

On Computation and Reinforcement Learning

Authors: Raj Ghugare, Michał Bortkiewicz, Alicja Ziarko, Benjamin Eysenbach

First: 2026-02-05T18:45:57+00:00 · Latest: 2026-02-05T18:45:57+00:00

Abs · PDF · Code1 · Code2

Abstract

How does the amount of compute available to a reinforcement learning (RL) policy affect its learning? Can policies using a fixed amount of parameters, still benefit from additional compute? The standard RL framework does not provide a language to answer these questions formally. Empirically, deep RL policies are often parameterized as neural networks with static architectures, conflating the amount of compute and the number of parameters. In this paper, we formalize compute bounded policies and prove that policies which use more compute can solve problems and generalize to longer-horizon tasks that are outside the scope of policies with less compute. Building on prior work in algorithmic learning and model-free planning, we propose a minimal architecture that can use a variable amount of compute. Our experiments complement our theory. On a set 31 different tasks spanning online and offline RL, we show that $(1)$ this architecture achieves stronger performance simply by using more compute, and $(2)$ stronger generalization on longer-horizon test tasks compared to standard feedforward networks or deep residual network using up to 5 times more parameters.

中文标题/摘要

标题：关于计算与强化学习

可用计算资源的多少如何影响强化学习（RL）策略的学习效果？固定参数数量的策略是否仍能从额外的计算资源中获益？标准的RL框架无法正式回答这些问题。从经验上讲，深度RL策略通常被参数化为具有静态架构的神经网络，混淆了计算资源量和参数数量。在本文中，我们形式化了计算受限策略，并证明使用更多计算资源的策略可以解决计算资源较少的策略无法解决的问题，并且在更长时间范围的任务上表现出更强的泛化能力。基于先前的工作，我们提出了一种可以使用可变计算资源的最小架构。我们的实验补充了我们的理论。在涵盖在线和离线RL的31个不同任务上，我们展示了（1）该架构仅通过使用更多计算资源就能实现更强的性能，（2）在更长时间范围的测试任务上表现出更强的泛化能力，与标准前馈网络或使用多达5倍参数的深度残差网络相比。

Summary / 总结

This paper investigates how the amount of computational resources affects reinforcement learning policies. It formalizes compute-bounded policies and demonstrates that policies with more compute can solve longer-horizon tasks that are beyond the capabilities of policies with less compute. Experiments on 31 tasks show that the proposed architecture performs better with more compute and generalizes better to longer-horizon tasks compared to standard feedforward networks or deep residual networks with up to 5 times more parameters.

该论文研究了计算资源的多少如何影响强化学习策略。它形式化了计算受限的策略，并证明了具有更多计算资源的策略可以解决那些少计算资源策略无法解决的长期任务。实验结果显示，提出的架构在31个任务上表现优于标准前馈网络和深度残差网络，无论后者使用多少倍更多的参数，在性能和长期任务上的泛化能力都更强。

VisRefiner: Learning from Visual Differences for Screenshot-to-Code Generation

Authors: Jie Deng, Kaichun Yao, Libo Zhang

First: 2026-02-05T18:45:53+00:00 · Latest: 2026-02-05T18:45:53+00:00

Abs · PDF · Code1 · Code2

Abstract

Screenshot-to-code generation aims to translate user interface screenshots into executable frontend code that faithfully reproduces the target layout and style. Existing multimodal large language models perform this mapping directly from screenshots but are trained without observing the visual outcomes of their generated code. In contrast, human developers iteratively render their implementation, compare it with the design, and learn how visual differences relate to code changes. Inspired by this process, we propose VisRefiner, a training framework that enables models to learn from visual differences between rendered predictions and reference designs. We construct difference-aligned supervision that associates visual discrepancies with corresponding code edits, allowing the model to understand how appearance variations arise from implementation changes. Building on this, we introduce a reinforcement learning stage for self-refinement, where the model improves its generated code by observing both the rendered output and the target design, identifying their visual differences, and updating the code accordingly. Experiments show that VisRefiner substantially improves single-step generation quality and layout fidelity, while also endowing models with strong self-refinement ability. These results demonstrate the effectiveness of learning from visual differences for advancing screenshot-to-code generation.

Summary / 总结

VisRefiner is a training framework that enables models to learn from visual differences between rendered predictions and reference designs, improving screenshot-to-code generation quality. It constructs difference-aligned supervision to associate visual discrepancies with code edits and introduces a reinforcement learning stage for self-refinement. Experiments show that VisRefiner enhances single-step generation quality and layout fidelity, and provides strong self-refinement ability.

VisRefiner 是一种训练框架，通过将渲染预测与参考设计之间的视觉差异与代码编辑关联起来，提高截图到代码生成的质量。它引入了关联差异的监督，并引入了强化学习阶段进行自我完善。实验表明，VisRefiner 提高了一步生成的质量和布局准确性，并提供了强大的自我完善能力。

Layer-wise LoRA fine-tuning: a similarity metric approach

Authors: Keith Ando Ogawa, Bruno Lopes Yamamoto, Lucas Lauton de Alcantara, Lucas Pellicer, Rosimeire Pereira Costa, Edson Bollis, Anna Helena Reali Costa, Artur Jordao

First: 2026-02-05T18:38:53+00:00 · Latest: 2026-02-05T18:38:53+00:00

Comments: Code is available at https://github.com/c2d-usp/Layer-wise-LoRA-with-CKA

Abs · PDF · Code1 · Code2 · Code3

Abstract

Pre-training Large Language Models (LLMs) on web-scale datasets becomes fundamental for advancing general-purpose AI. In contrast, enhancing their predictive performance on downstream tasks typically involves adapting their knowledge through fine-tuning. Parameter-efficient fine-tuning techniques, such as Low-Rank Adaptation (LoRA), aim to reduce the computational cost of this process by freezing the pre-trained model and updating a smaller number of parameters. In comparison to full fine-tuning, these methods achieve over 99\% reduction in trainable parameter count, depending on the configuration. Unfortunately, such a reduction may prove insufficient as LLMs continue to grow in scale. In this work, we address the previous problem by systematically selecting only a few layers to fine-tune using LoRA or its variants. We argue that not all layers contribute equally to the model adaptation. Leveraging this, we identify the most relevant layers to fine-tune by measuring their contribution to changes in internal representations. Our method is orthogonal to and readily compatible with existing low-rank adaptation techniques. We reduce the trainable parameters in LoRA-based techniques by up to 50\%, while maintaining the predictive performance across different models and tasks. Specifically, on encoder-only architectures, this reduction in trainable parameters leads to a negligible predictive performance drop on the GLUE benchmark. On decoder-only architectures, we achieve a small drop or even improvements in the predictive performance on mathematical problem-solving capabilities and coding tasks. Finally, this effectiveness extends to multimodal models, for which we also observe competitive results relative to fine-tuning with LoRA modules in all layers. Code is available at: https://github.com/c2d-usp/Layer-wise-LoRA-with-CKA

中文标题/摘要

标题：逐层LoRA微调：一种相似度度量方法

在网页规模数据集上预训练大型语言模型（LLMs）已成为推动通用人工智能发展的基础。相比之下，通过微调来增强其在下游任务中的预测性能通常涉及调整其知识。参数高效微调技术，如低秩适应（LoRA），旨在通过冻结预训练模型并更新较少的参数来降低此过程的计算成本。与全微调相比，这些方法的可训练参数数量减少了超过99%，具体取决于配置。不幸的是，随着LLMs的规模不断扩大，这种减少可能变得不足。在本研究中，我们通过系统地选择仅微调少数几层来解决上述问题，使用LoRA或其变体。我们认为，并非所有层都对模型适应贡献相同。利用这一点，我们通过测量它们对内部表示变化的贡献来识别最相关的层进行微调。我们的方法与现有的低秩适应技术是正交的，并且易于兼容。我们通过LoRA技术将可训练参数减少多达50%，同时在不同模型和任务中保持预测性能。具体而言，在仅编码器架构中，这种可训练参数的减少在GLUE基准测试中的预测性能下降可以忽略不计。在仅解码器架构中，我们实现了数学问题解决能力和编程任务中预测性能的小幅下降或甚至改进。最后，这种方法也适用于多模态模型，在这些模型中，我们还观察到与在所有层使用LoRA模块进行微调相比具有竞争力的结果。代码可在：https://github.com/c2d-usp/Layer-wise-LoRA-with-CKA

Summary / 总结

This paper addresses the challenge of fine-tuning large language models (LLMs) by proposing a layer-wise LoRA fine-tuning method. The authors measure the contribution of each layer to changes in internal representations to identify the most relevant layers for fine-tuning. This approach reduces the number of trainable parameters by up to 50% while maintaining or improving predictive performance across different models and tasks. Specifically, on encoder-only architectures, the performance on the GLUE benchmark remains nearly unchanged, and on decoder-only architectures, there is a small drop or improvement in mathematical problem-solving and coding tasks. The method is compatible with existing low-rank adaptation techniques and is available in the provided code repository.

本文提出了一种分层LoRA微调方法，通过基于内部表示变化选择关键层进行微调，从而将可训练参数减少高达50%，同时在不同模型和任务上保持或提升预测性能。具体来说，在仅编码器架构上，GLUE基准上的性能几乎没有下降；而在仅解码器架构上，数学问题解决和编程任务的性能有所下降或提升。该方法与现有的低秩适应技术兼容，并可在https://github.com/c2d-usp/Layer-wise-LoRA-with-CKA 获取代码。

SIRR-LMM: Single-image Reflection Removal via Large Multimodal Model

Authors: Yu Guo, Zhiqiang Lao, Xiyun Song, Yubin Zhou, Heather Yu

First: 2026-01-12T05:03:12+00:00 · Latest: 2026-02-05T18:37:54+00:00

Comments: 12 pages, 14 figures, accepted in WACVW 2026

Abs · PDF · Code1 · Code2

Abstract

Glass surfaces create complex interactions of reflected and transmitted light, making single-image reflection removal (SIRR) challenging. Existing datasets suffer from limited physical realism in synthetic data or insufficient scale in real captures. We introduce a synthetic dataset generation framework that path-traces 3D glass models over real background imagery to create physically accurate reflection scenarios with varied glass properties, camera settings, and post-processing effects. To leverage the capabilities of Large Multimodal Model (LMM), we concatenate the image layers into a single composite input, apply joint captioning, and fine-tune the model using task-specific LoRA rather than full-parameter training. This enables our approach to achieve improved reflection removal and separation performance compared to state-of-the-art methods.

中文标题/摘要

标题：SIRR-LMM：基于大型多模态模型的单张图像反射去除

玻璃表面会产生复杂的反射和透射光相互作用，使得单张图像反射去除（SIRR）具有挑战性。现有数据集在合成数据中缺乏物理现实性，或在实际捕获中规模不足。我们提出了一种合成数据集生成框架，通过在真实背景图像上路径追踪3D玻璃模型来创建具有多种玻璃属性、相机设置和后处理效果的物理准确反射场景。为了利用大型多模态模型（LMM）的能力，我们将图像层合并为单一复合输入，进行联合描述，并使用针对特定任务的LoRA进行微调，而不是进行全面参数训练。这使我们的方法在反射去除和分离性能方面优于现有最先进的方法。

Summary / 总结

The research addresses the challenge of single-image reflection removal (SIRR) from glass surfaces by introducing a new synthetic dataset generation framework that combines 3D path-traced glass models with real background imagery. The approach uses a Large Multimodal Model (LMM) with a composite input and fine-tuning via Low-Rank Adaptation (LoRA) rather than full-parameter training, leading to better reflection removal and separation performance compared to existing methods.

研究旨在通过引入新的合成数据生成框架和利用大型多模态模型（LMM）来解决玻璃表面的单图像反光去除（SIRR）问题。该框架通过路径追踪3D玻璃模型在真实背景上创建物理上准确的反光场景。通过使用任务特定的低秩适应（LoRA）而非全参数训练对LMM进行微调，从而在反光去除和分离性能上优于现有方法。

RISE-Video: Can Video Generators Decode Implicit World Rules?

Authors: Mingxin Liu, Shuran Ma, Shibei Meng, Xiangyu Zhao, Zicheng Zhang, Shaofeng Zhang, Zhihang Zhong, Peixian Chen, Haoyu Cao, Xing Sun, Haodong Duan, Xue Yang

First: 2026-02-05T18:36:10+00:00 · Latest: 2026-02-05T18:36:10+00:00

Comments: 38 pages, 16 figures, 3 tables; Code: https://github.com/VisionXLab/RISE-Video; HuggingFace: https://huggingface.co/datasets/VisionXLab/RISE-Video

Abs · PDF · Code1 · Code2 · Code3

Abstract

While generative video models have achieved remarkable visual fidelity, their capacity to internalize and reason over implicit world rules remains a critical yet under-explored frontier. To bridge this gap, we present RISE-Video, a pioneering reasoning-oriented benchmark for Text-Image-to-Video (TI2V) synthesis that shifts the evaluative focus from surface-level aesthetics to deep cognitive reasoning. RISE-Video comprises 467 meticulously human-annotated samples spanning eight rigorous categories, providing a structured testbed for probing model intelligence across diverse dimensions, ranging from commonsense and spatial dynamics to specialized subject domains. Our framework introduces a multi-dimensional evaluation protocol consisting of four metrics: \textit{Reasoning Alignment}, \textit{Temporal Consistency}, \textit{Physical Rationality}, and \textit{Visual Quality}. To further support scalable evaluation, we propose an automated pipeline leveraging Large Multimodal Models (LMMs) to emulate human-centric assessment. Extensive experiments on 11 state-of-the-art TI2V models reveal pervasive deficiencies in simulating complex scenarios under implicit constraints, offering critical insights for the advancement of future world-simulating generative models.

中文标题/摘要

标题：RISE-Video：视频生成器能否解码隐含的世界规则？

尽管生成式视频模型在视觉保真度方面取得了显著进展，但它们在内化和推理隐含世界规则方面的能力仍然是一个关键但尚未充分探索的领域。为弥合这一差距，我们提出了RISE-Video，这是一种开创性的基于推理的Text-Image-to-Video (TI2V) 合成基准，将评估重点从表面美学转移到深层次的认知推理。RISE-Video 包含467个精心的人工标注样本，涵盖八个严格的类别，为从常识和空间动态到专业主题领域的模型智能提供了一个结构化的测试平台。我们的框架引入了四个维度的评估协议：推理一致性、时间一致性、物理合理性以及视觉质量。为了进一步支持可扩展的评估，我们提出了一种基于大型多模态模型（LMMs）的自动化流程，以模拟人类评估。在11个最先进的TI2V模型上的广泛实验揭示了在隐含约束下模拟复杂场景的普遍缺陷，为未来世界模拟生成模型的发展提供了关键见解。

Summary / 总结

RISE-Video is a reasoning-oriented benchmark for Text-Image-to-Video synthesis that evaluates models based on their ability to reason over implicit world rules rather than just visual aesthetics. It includes 467 human-annotated samples across eight categories and introduces four evaluation metrics: Reasoning Alignment, Temporal Consistency, Physical Rationality, and Visual Quality. Experiments on 11 state-of-the-art models highlight their limitations in handling complex scenarios under implicit constraints, providing valuable insights for future model development.

RISE-Video 是一个针对文本-图像到视频合成的推理导向基准，评估模型在处理隐含世界规则方面的推理能力，而非仅仅视觉保真度。它包含467个人标注样本，涵盖八个类别，并引入了四个评估指标：推理一致性、时间一致性、物理合理性以及视觉质量。对11个最先进的模型的实验揭示了它们在处理隐含约束下的复杂场景时的局限性，为未来模型的发展提供了宝贵的见解。

Geographically-aware Transformer-based Traffic Forecasting for Urban Motorway Digital Twins

Authors: Krešimir Kušić, Vinny Cahill, Ivana Dusparic

First: 2026-02-05T18:33:03+00:00 · Latest: 2026-02-05T18:33:03+00:00

Comments: IEEE IV2026 37th IEEE Intelligent Vehicles Symposium

Abs · PDF · Code1 · Code2

Abstract

The operational effectiveness of digital-twin technology in motorway traffic management depends on the availability of a continuous flow of high-resolution real-time traffic data. To function as a proactive decision-making support layer within traffic management, a digital twin must also incorporate predicted traffic conditions in addition to real-time observations. Due to the spatio-temporal complexity and the time-variant, non-linear nature of traffic dynamics, predicting motorway traffic remains a difficult problem. Sequence-based deep-learning models offer clear advantages over classical machine learning and statistical models in capturing long-range, temporal dependencies in time-series traffic data, yet limitations in forecasting accuracy and model complexity point to the need for further improvements. To improve motorway traffic forecasting, this paper introduces a Geographically-aware Transformer-based Traffic Forecasting GATTF model, which exploits the geographical relationships between distributed sensors using their mutual information (MI). The model has been evaluated using real-time data from the Geneva motorway network in Switzerland and results confirm that incorporating geographical awareness through MI enhances the accuracy of GATTF forecasting compared to a standard Transformer, without increasing model complexity.

中文标题/摘要

标题：具有地理意识的基于Transformer的城市高速公路交通预测

数字孪生技术在高速公路交通管理中的运营有效性取决于实时高分辨率交通数据的持续流动。为了作为交通管理中的主动决策支持层，数字孪生还必须包含预测的交通状况，而不仅仅是实时观测。由于交通动态的空间-时间复杂性和随时间变化的非线性性质，预测高速公路交通仍然是一个困难的问题。基于序列的深度学习模型在捕捉时间序列交通数据中的长期、时间依赖性方面明显优于经典机器学习和统计模型，但预测准确性和模型复杂性的局限性表明需要进一步改进。为了改进高速公路交通预测，本文提出了一种具有地理意识的基于Transformer的交通预测（GATTF）模型，该模型利用分布式传感器之间的地理关系及其互信息（MI）。该模型使用来自瑞士日内瓦高速公路网络的实时数据进行了评估，结果证实，通过MI引入地理意识可以提高GATTF预测的准确性，而不会增加模型复杂性。

Summary / 总结

This paper addresses the challenge of accurately forecasting motorway traffic by introducing the Geographically-aware Transformer-based Traffic Forecasting (GATTF) model. The model leverages the geographical relationships between sensors using mutual information to improve traffic prediction accuracy. Evaluation on real-time data from the Geneva motorway network shows that GATTF outperforms a standard Transformer model in terms of forecasting accuracy without increasing model complexity.

本文通过引入地理感知的基于变换器的交通预测模型（GATTF），解决了准确预测高速公路交通流量的挑战。该模型利用传感器之间的地理关系和互信息来提高预测准确性。实时时序数据从瑞士日内瓦高速公路网络的评估表明，GATTF在不增加模型复杂度的情况下，比标准变换器模型具有更高的预测准确性。

Characterizing Human Semantic Navigation in Concept Production as Trajectories in Embedding Space

Authors: Felipe D. Toro-Hernández, Jesuino Vieira Filho, Rodrigo M. Cabral-Carvalho

Venue: ICLR 2026

First: 2026-02-05T18:23:04+00:00 · Latest: 2026-02-05T18:23:04+00:00

Comments: 10 pages, 6 figures (excluding refs/appendix). Accepted to ICLR 2026

Abs · PDF · Code1 · Code2

Abstract

Semantic representations can be framed as a structured, dynamic knowledge space through which humans navigate to retrieve and manipulate meaning. To investigate how humans traverse this geometry, we introduce a framework that represents concept production as navigation through embedding space. Using different transformer text embedding models, we construct participant-specific semantic trajectories based on cumulative embeddings and extract geometric and dynamical metrics, including distance to next, distance to centroid, entropy, velocity, and acceleration. These measures capture both scalar and directional aspects of semantic navigation, providing a computationally grounded view of semantic representation search as movement in a geometric space. We evaluate the framework on four datasets across different languages, spanning different property generation tasks: Neurodegenerative, Swear verbal fluency, Property listing task in Italian, and in German. Across these contexts, our approach distinguishes between clinical groups and concept types, offering a mathematical framework that requires minimal human intervention compared to typical labor-intensive linguistic pre-processing methods. Comparison with a non-cumulative approach reveals that cumulative embeddings work best for longer trajectories, whereas shorter ones may provide too little context, favoring the non-cumulative alternative. Critically, different embedding models yielded similar results, highlighting similarities between different learned representations despite different training pipelines. By framing semantic navigation as a structured trajectory through embedding space, bridging cognitive modeling with learned representation, thereby establishing a pipeline for quantifying semantic representation dynamics with applications in clinical research, cross-linguistic analysis, and the assessment of artificial cognition.

中文标题/摘要

标题：在嵌入空间中表征人类概念生成的语义导航轨迹

语义表示可以构架为一个结构化、动态的知识空间，人类在其中导航以检索和操作意义。为了研究人类如何穿越这一几何结构，我们提出了一种框架，将概念生成视为在嵌入空间中的导航。使用不同的变换器文本嵌入模型，我们基于累积嵌入构建了参与者特定的语义轨迹，并提取了几何和动力学度量，包括到下一个的距离、到质心的距离、熵、速度和加速度。这些度量捕捉了语义导航的标量和方向性方面，提供了语义表示搜索作为几何空间中运动的计算基础观点。我们在四个跨语言数据集上评估了该框架，涵盖不同的属性生成任务：神经退行性疾病、脏话流畅性、意大利语属性列表任务和德语。在这些背景下，我们的方法区分了临床组和概念类型，提供了一个与典型劳动密集型语言预处理方法相比需要最少人工干预的数学框架。与非累积方法的比较表明，累积嵌入对于较长的轨迹效果最佳，而较短的轨迹可能提供太少的上下文，倾向于非累积替代方法。关键的是，不同的嵌入模型产生了类似的结果，突显了尽管训练管道不同，不同学习表示之间的相似性。通过将语义导航构架为嵌入空间中的结构化轨迹，将认知建模与学习表示相结合，从而建立了一个量化语义表示动力学的管道，具有临床研究、跨语言分析和评估人工认知的应用。

Inverse Depth Scaling From Most Layers Being Similar

Authors: Yizhou Liu, Sara Kangaslahti, Ziming Liu, Jeff Gore

First: 2026-02-05T18:22:41+00:00 · Latest: 2026-02-05T18:22:41+00:00

Comments: 23 pages, 24 figures

Abs · PDF · Code1 · Code2

Abstract

Neural scaling laws relate loss to model size in large language models (LLMs), yet depth and width may contribute to performance differently, requiring more detailed studies. Here, we quantify how depth affects loss via analysis of LLMs and toy residual networks. We find loss scales inversely proportional to depth in LLMs, probably due to functionally similar layers reducing error through ensemble averaging rather than compositional learning or discretizing smooth dynamics. This regime is inefficient yet robust and may arise from the architectural bias of residual networks and target functions incompatible with smooth dynamics. The findings suggest that improving LLM efficiency may require architectural innovations to encourage compositional use of depth.

中文标题/摘要

标题：大多数层相似时的逆深度缩放

神经网络缩放定律将损失与大型语言模型（LLM）的模型大小相关联，但深度和宽度可能以不同的方式影响性能，需要更详细的研究所。在这里，我们通过分析LLM和玩具残差网络来量化深度如何影响损失。我们发现损失与深度成反比地变化，这可能是由于功能相似的层通过集合平均而不是组合学习或离散平滑动力学来减少误差。这种机制虽然效率低下但具有鲁棒性，可能源于残差网络的架构偏见和与平滑动力学不兼容的目标函数。研究结果表明，提高LLM效率可能需要架构创新以鼓励深度的组合使用。

Summary / 总结

This study investigates how depth affects loss in large language models (LLMs) by analyzing LLMs and toy residual networks. The research finds that loss scales inversely with depth, likely due to functionally similar layers reducing error through ensemble averaging. This regime is inefficient but robust and may arise from architectural bias and target functions incompatible with smooth dynamics. The findings imply that improving LLM efficiency might necessitate architectural innovations to encourage compositional use of depth.

研究通过分析大型语言模型（LLMs）和玩具残差网络，探讨了深度如何影响LLMs的损失。研究发现，LLMs中的损失与深度成反比，这可能是由于功能相似的层通过集成平均减少误差。这种机制虽然效率低下但很稳健，可能源于残差网络的架构偏见以及与平滑动力学不兼容的目标函数。研究结果表明，提高LLMs的效率可能需要通过架构创新来促进深度的组合使用。

A Hybrid Data-Driven Algorithm for Real-Time Friction Force Estimation in Hydraulic Cylinders

Authors: Mohamad Amin Jamshidi, Mehrbod Zarifi, Zolfa Anvari, Hamed Ghafarirad, Mohammad Zareinejad

First: 2026-02-05T18:21:28+00:00 · Latest: 2026-02-05T18:21:28+00:00

Comments: Published in: 2025 33rd International Conference on Electrical Engineering (ICEE), Publisher IEEE

Abs · PDF · Code1 · Code2

Abstract

Hydraulic systems are widely utilized in industrial applications due to their high force generation, precise control, and ability to function in harsh environments. Hydraulic cylinders, as actuators in these systems, apply force and position through the displacement of hydraulic fluid, but their operation is significantly influenced by friction force. Achieving precision in hydraulic cylinders requires an accurate friction model under various operating conditions. Existing analytical models, often derived from experimental tests, necessitate the identification or estimation of influencing factors but are limited in adaptability and computational efficiency. This research introduces a data-driven, hybrid algorithm based on Long Short-Term Memory (LSTM) networks and Random Forests for nonlinear friction force estimation. The algorithm effectively combines feature detection and estimation processes using training data acquired from an experimental hydraulic test setup. It achieves a consistent and stable model error of less than 10% across diverse operating conditions and external load variations, ensuring robust performance in complex situations. The computational cost of the algorithm is 1.51 milliseconds per estimation, making it suitable for real-time applications. The proposed method addresses the limitations of analytical models by delivering high precision and computational efficiency. The algorithm's performance is validated through detailed analysis and experimental results, including direct comparisons with the LuGre model. The comparison highlights that while the LuGre model offers a theoretical foundation for friction modeling, its performance is limited by its inability to dynamically adjust to varying operational conditions of the hydraulic cylinder, further emphasizing the advantages of the proposed hybrid approach in real-time applications.

中文标题/摘要

标题：一种用于液压缸实时摩擦力估计的混合数据驱动算法

液压系统因其高力输出、精确控制和在恶劣环境中的功能而广泛应用于工业应用。作为这些系统中的执行器，液压缸通过液压流体的位移施加力和位置，但其操作受到摩擦力的显著影响。要在液压缸中实现精确控制，需要在各种操作条件下具备准确的摩擦模型。现有的分析模型通常是从实验测试中推导出来的，需要识别或估计影响因素，但其适应性和计算效率有限。本研究提出了一种基于长短期记忆（LSTM）网络和随机森林的混合数据驱动算法，用于非线性摩擦力估计。该算法通过从实验液压测试装置中获取的训练数据，有效结合了特征检测和估计过程。该算法在各种操作条件和外部负载变化下实现了小于10%的一致和稳定的模型误差，确保在复杂情况下具有稳健的性能。该算法的计算成本为每次估计1.51毫秒，使其适用于实时应用。该方法通过提供高精度和计算效率解决了分析模型的局限性。算法性能通过详细分析和实验结果得到验证，包括与LuGre模型的直接比较。比较表明，虽然LuGre模型为摩擦建模提供了理论基础，但其性能受限于无法动态适应液压缸的运行条件变化，进一步突显了所提混合方法在实时应用中的优势。

Summary / 总结

This research develops a hybrid data-driven algorithm using LSTM networks and Random Forests for real-time friction force estimation in hydraulic cylinders. The algorithm combines feature detection and estimation processes, achieving consistent errors below 10% across various operating conditions and load variations. It demonstrates robust performance with a computational cost of 1.51 milliseconds per estimation, suitable for real-time applications. Experimental results show that the proposed method outperforms the LuGre model in dynamic operational conditions, highlighting its advantages in precision and computational efficiency.

该研究提出了一种基于LSTM网络和随机森林的混合数据驱动算法，用于液压缸的实时摩擦力估计。该算法结合了特征检测和估计过程，使用实验数据，实现了在各种操作条件和负载变化下模型误差低于10%的一致性。其每估计一次的计算成本为1.51毫秒，适用于实时应用。该方法通过动态适应液压缸的变操作条件，优于LuGre模型，确保在复杂情况下具有稳健的性能。

Informed Asymmetric Actor-Critic: Leveraging Privileged Signals Beyond Full-State Access

Authors: Daniel Ebi, Gaspard Lambrechts, Damien Ernst, Klemens Böhm

First: 2025-09-30T09:32:20+00:00 · Latest: 2026-02-05T18:21:20+00:00

Comments: 11 pages, 26 pages total, 3 figures

Abs · PDF · Code1 · Code2

Abstract

Asymmetric actor-critic methods are widely used in partially observable reinforcement learning, but typically assume full state observability to condition the critic during training, which is often unrealistic in practice. We introduce the informed asymmetric actor-critic framework, allowing the critic to be conditioned on arbitrary state-dependent privileged signals without requiring access to the full state. We show that any such privileged signal yields unbiased policy gradient estimates, substantially expanding the set of admissible privileged information. This raises the problem of selecting the most adequate privileged information in order to improve learning. For this purpose, we propose two novel informativeness criteria: a dependence-based test that can be applied prior to training, and a criterion based on improvements in value prediction accuracy that can be applied post-hoc. Empirical results on partially observable benchmark tasks and synthetic environments demonstrate that carefully selected privileged signals can match or outperform full-state asymmetric baselines while relying on strictly less state information.

中文标题/摘要

标题：知情不对称行为-评论家框架：超越全状态访问利用特权信号

不对称行为-评论家方法在部分可观测强化学习中广泛应用，但通常假设在训练过程中评论家可以基于完整状态进行条件化，这在实践中往往不现实。我们引入了知情不对称行为-评论家框架，允许评论家基于任意状态依赖的特权信号进行条件化，而无需访问完整状态。我们证明任何这样的特权信号都能提供无偏的行为梯度估计，极大地扩展了可接受的特权信息集。这提出了选择最合适的特权信息以提高学习的问题。为此，我们提出了两种新的信息性标准：一种基于依赖性的测试，可以在训练前应用；另一种基于价值预测准确性的改进，可以在训练后应用。在部分可观测基准任务和合成环境上的实验证明，精心选择的特权信号可以匹配或超越依赖完整状态的基线，同时依赖更少的状态信息。

Summary / 总结

The research aims to address the limitations of asymmetric actor-critic methods in partially observable environments by introducing an informed asymmetric actor-critic framework. This framework allows the critic to condition on arbitrary state-dependent privileged signals without needing full state access. The study shows that any such privileged signal provides unbiased policy gradient estimates, significantly expanding the types of privileged information that can be used. Two new criteria are proposed to select the most useful privileged signals: a dependence-based test before training and a post-hoc criterion based on improvements in value prediction accuracy. Experiments on benchmark tasks and synthetic environments show that carefully chosen privileged signals can match or outperform full-state baselines while using less state information.

研究旨在通过引入一种知情的不对称演员-评论家框架来解决部分可观测环境中不对称演员-评论家方法的限制。该框架允许评论家在无需访问完整状态的情况下，基于任意状态相关的特权信号进行条件化。研究表明，任何这样的特权信号都能提供无偏的策略梯度估计，显著扩展了可使用的特权信息类型。提出了两种新的标准来选择最有用的特权信号：一种是在训练前应用的依赖性测试，另一种是在训练后基于价值预测精度改进的标准。实验结果表明，精心选择的特权信号可以与或超越完整状态基线的表现，同时使用更少的状态信息。

LSA: Localized Semantic Alignment for Enhancing Temporal Consistency in Traffic Video Generation

Authors: Mirlan Karimov, Teodora Spasojevic, Markus Braun, Julian Wiederer, Vasileios Belagiannis, Marc Pollefeys

First: 2026-02-05T18:21:02+00:00 · Latest: 2026-02-05T18:21:02+00:00

Comments: Accepted to IEEE IV 2026. 8 pages, 3 figures. Code available at https://github.com/mirlanium/LSA

Abs · PDF · Code1 · Code2 · Code3

Abstract

Controllable video generation has emerged as a versatile tool for autonomous driving, enabling realistic synthesis of traffic scenarios. However, existing methods depend on control signals at inference time to guide the generative model towards temporally consistent generation of dynamic objects, limiting their utility as scalable and generalizable data engines. In this work, we propose Localized Semantic Alignment (LSA), a simple yet effective framework for fine-tuning pre-trained video generation models. LSA enhances temporal consistency by aligning semantic features between ground-truth and generated video clips. Specifically, we compare the output of an off-the-shelf feature extraction model between the ground-truth and generated video clips localized around dynamic objects inducing a semantic feature consistency loss. We fine-tune the base model by combining this loss with the standard diffusion loss. The model fine-tuned for a single epoch with our novel loss outperforms the baselines in common video generation evaluation metrics. To further test the temporal consistency in generated videos we adapt two additional metrics from object detection task, namely mAP and mIoU. Extensive experiments on nuScenes and KITTI datasets show the effectiveness of our approach in enhancing temporal consistency in video generation without the need for external control signals during inference and any computational overheads.

中文标题/摘要

标题：LSA：局部语义对齐以增强交通视频生成中的时间一致性

可控视频生成已成为自主驾驶领域的一种多功能工具，能够实现对交通场景的逼真合成。然而，现有方法依赖于推理时的控制信号来引导生成模型生成动态对象的时间一致性，限制了它们作为可扩展和通用数据引擎的实用性。在本文中，我们提出了一种简单而有效的框架——局部语义对齐（LSA），用于微调预训练的视频生成模型。LSA通过在真实视频和生成视频片段之间对齐语义特征来增强时间一致性。具体而言，我们比较了现成特征提取模型在真实视频和生成视频片段（围绕动态对象局部化）之间的输出，诱导语义特征一致性损失。我们通过将此损失与标准扩散损失结合来微调基础模型。使用我们新颖的损失微调一次迭代后，模型在常见的视频生成评估指标中优于基线。为了进一步测试生成视频的时间一致性，我们从目标检测任务中适应了两个额外的指标，即mAP和mIoU。在nuScenes和KITTI数据集上的大量实验表明，我们的方法在无需推理时外部控制信号和任何计算开销的情况下，能够有效增强视频生成的时间一致性。

Summary / 总结

The research aims to improve the temporal consistency in traffic video generation for autonomous driving applications. It introduces Localized Semantic Alignment (LSA), a method that fine-tunes pre-trained video generation models by aligning semantic features between ground-truth and generated video clips around dynamic objects. The approach uses a semantic feature consistency loss combined with a diffusion loss to enhance temporal consistency. Experiments on nuScenes and KITTI datasets demonstrate that LSA outperforms baseline methods in common video generation metrics and additional object detection metrics, achieving better temporal consistency without requiring external control signals or additional computational overheads.

研究旨在通过改进交通视频生成中的时间一致性，为自动驾驶应用提供支持。提出了一种局部语义对齐（LSA）方法，通过在动态物体周围对生成视频片段和真实视频片段之间的语义特征进行对齐，来微调预训练的视频生成模型。该方法结合了语义特征一致性损失和扩散损失来增强时间一致性。在nuScenes和KITTI数据集上的实验表明，LSA在常见视频生成指标和额外的对象检测指标上优于基线方法，实现了更好的时间一致性，无需在推理过程中使用外部控制信号或增加额外的计算开销。

Learning to Share: Selective Memory for Efficient Parallel Agentic Systems

Authors: Joseph Fioresi, Parth Parag Kulkarni, Ashmal Vayani, Song Wang, Mubarak Shah

First: 2026-02-05T18:20:21+00:00 · Latest: 2026-02-05T18:20:21+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Agentic systems solve complex tasks by coordinating multiple agents that iteratively reason, invoke tools, and exchange intermediate results. To improve robustness and solution quality, recent approaches deploy multiple agent teams running in parallel to explore diverse reasoning trajectories. However, parallel execution comes at a significant computational cost: when different teams independently reason about similar sub-problems or execute analogous steps, they repeatedly perform substantial overlapping computation. To address these limitations, in this paper, we propose Learning to Share (LTS), a learned shared-memory mechanism for parallel agentic frameworks that enables selective cross-team information reuse while controlling context growth. LTS introduces a global memory bank accessible to all teams and a lightweight controller that decides whether intermediate agent steps should be added to memory or not. The controller is trained using stepwise reinforcement learning with usage-aware credit assignment, allowing it to identify information that is globally useful across parallel executions. Experiments on the AssistantBench and GAIA benchmarks show that LTS significantly reduces overall runtime while matching or improving task performance compared to memory-free parallel baselines, demonstrating that learned memory admission is an effective strategy for improving the efficiency of parallel agentic systems. Project page: https://joefioresi718.github.io/LTS_webpage/

中文标题/摘要

标题：学会分享：选择性记忆以提高并行代理系统的效率

代理系统通过协调多个代理进行迭代推理、调用工具并交换中间结果来解决复杂任务。为了提高鲁棒性和解决方案质量，最近的方法部署了多个并行运行的代理团队，以探索不同的推理路径。然而，这种并行执行带来了显著的计算成本：当不同的团队独立地对相似的子问题进行推理或执行类似步骤时，它们会重复进行大量的重叠计算。为了解决这些限制，本文提出了一种名为学习分享（LTS）的机制，这是一种并行代理框架中的学习共享内存机制，能够选择性地在团队之间重用信息，同时控制上下文的增长。LTS 引入了一个全局内存库，所有团队都可以访问，并且有一个轻量级控制器来决定中间代理步骤是否应添加到内存中。控制器通过带有使用感知的信用分配的逐步强化学习进行训练，使其能够识别在并行执行中具有全局用处的信息。在 AssistantBench 和 GAIA 基准测试上的实验表明，与无内存的并行基线相比，LTS 显著减少了总体运行时间，同时匹配或提高了任务性能，证明了学习记忆准入是提高并行代理系统效率的有效策略。项目页面：https://joefioresi718.github.io/LTS_webpage/

Summary / 总结

This paper addresses the computational inefficiency in parallel agentic systems by proposing Learning to Share (LTS), a mechanism that enables selective cross-team information reuse. LTS introduces a global memory bank and a lightweight controller that decides which intermediate steps should be stored, reducing redundant computation. Experiments on AssistantBench and GAIA benchmarks show that LTS significantly reduces runtime while maintaining or improving task performance compared to memory-free baselines.

本文提出了一种名为Learning to Share (LTS) 的机制，通过选择性地在团队间重用中间信息来解决并行智能体系统中的计算效率问题。LTS 引入了一个全局记忆库和一个轻量级控制器，该控制器决定哪些中间步骤应被存储，从而减少重复计算。实验结果表明，LTS 在 AssistantBench 和 GAIA 基准测试中显著减少了运行时间，同时保持或提高了任务性能，优于无记忆的并行基线系统。

Discrete diffusion samplers and bridges: Off-policy algorithms and applications in latent spaces

Authors: Arran Carter, Sanghyeok Choi, Kirill Tamogashev, Víctor Elvira, Nikolay Malkin

First: 2026-02-05T18:16:57+00:00 · Latest: 2026-02-05T18:16:57+00:00

Comments: Code: https://github.com/mmacosha/offpolicy-discrete-diffusion-samplers-and-bridges

Abs · PDF · Code1 · Code2 · Code3

Abstract

Sampling from a distribution $p(x) \propto e^{-\mathcal{E}(x)}$ known up to a normalising constant is an important and challenging problem in statistics. Recent years have seen the rise of a new family of amortised sampling algorithms, commonly referred to as diffusion samplers, that enable fast and efficient sampling from an unnormalised density. Such algorithms have been widely studied for continuous-space sampling tasks; however, their application to problems in discrete space remains largely unexplored. Although some progress has been made in this area, discrete diffusion samplers do not take full advantage of ideas commonly used for continuous-space sampling. In this paper, we propose to bridge this gap by introducing off-policy training techniques for discrete diffusion samplers. We show that these techniques improve the performance of discrete samplers on both established and new synthetic benchmarks. Next, we generalise discrete diffusion samplers to the task of bridging between two arbitrary distributions, introducing data-to-energy Schrödinger bridge training for the discrete domain for the first time. Lastly, we showcase the application of the proposed diffusion samplers to data-free posterior sampling in the discrete latent spaces of image generative models.

Summary / 总结

This paper addresses the challenge of sampling from a distribution known up to a normalising constant in discrete spaces, which is a significant problem in statistics. It introduces off-policy training techniques for discrete diffusion samplers, improving their performance on various benchmarks. Additionally, the authors extend these samplers to bridge between two arbitrary distributions, introducing a new training method called data-to-energy Schrödinger bridge for the discrete domain. The proposed methods are applied to data-free posterior sampling in the latent spaces of image generative models.

该论文解决了统计学中从未标准化分布中在离散空间进行采样的挑战。它引入了离散扩散采样器的离策训练技术，提高了它们在各种基准上的性能。作者还扩展了这些采样器以在两个任意分布之间进行桥梁构建，首次提出了离散域中的数据到能量薛定谔桥梁训练方法。所提出的方法被应用于图像生成模型的离散潜在空间中的数据免费后验采样。

Better Source, Better Flow: Learning Condition-Dependent Source Distribution for Flow Matching

Authors: Junwan Kim, Jiho Park, Seonghu Jeon, Seungryong Kim

First: 2026-02-05T18:08:20+00:00 · Latest: 2026-02-05T18:08:20+00:00

Comments: Project Page: https://junwankimm.github.io/CSFM

Abs · PDF · Code1 · Code2 · Project1

Abstract

Flow matching has recently emerged as a promising alternative to diffusion-based generative models, particularly for text-to-image generation. Despite its flexibility in allowing arbitrary source distributions, most existing approaches rely on a standard Gaussian distribution, a choice inherited from diffusion models, and rarely consider the source distribution itself as an optimization target in such settings. In this work, we show that principled design of the source distribution is not only feasible but also beneficial at the scale of modern text-to-image systems. Specifically, we propose learning a condition-dependent source distribution under flow matching objective that better exploit rich conditioning signals. We identify key failure modes that arise when directly incorporating conditioning into the source, including distributional collapse and instability, and show that appropriate variance regularization and directional alignment between source and target are critical for stable and effective learning. We further analyze how the choice of target representation space impacts flow matching with structured sources, revealing regimes in which such designs are most effective. Extensive experiments across multiple text-to-image benchmarks demonstrate consistent and robust improvements, including up to a 3x faster convergence in FID, highlighting the practical benefits of a principled source distribution design for conditional flow matching.

中文标题/摘要

标题：更好的源分布，更好的流匹配：学习条件依赖的源分布

流匹配最近已成为扩散生成模型的一种有前途的替代方案，特别是在文本到图像生成方面。尽管它允许任意的源分布，但大多数现有方法仍然依赖于标准的高斯分布，这是从扩散模型继承而来的选择，很少将源分布本身作为优化目标。在本文中，我们展示了在现代文本到图像系统中，源分布的合理设计不仅是可行的，而且是有益的。具体来说，我们提出了在流匹配目标下学习条件依赖的源分布，以更好地利用丰富的条件信号。我们识别了直接将条件信号纳入源分布时出现的关键失败模式，包括分布坍塌和不稳定性，并表明适当的方差正则化和源与目标之间的方向对齐对于稳定和有效的学习至关重要。我们进一步分析了目标表示空间的选择如何影响结构化源的流匹配，揭示了在这种设计中最有效的区域。在多个文本到图像基准上的广泛实验表明，一致且稳健的改进，包括FID收敛速度提高3倍，突显了条件流匹配中合理源分布设计的实际益处。

Summary / 总结

This work addresses the limitations of using a standard Gaussian distribution as the source distribution in flow matching for text-to-image generation. It proposes learning a condition-dependent source distribution to better utilize rich conditioning signals. The study identifies issues like distributional collapse and instability when directly incorporating conditioning into the source and emphasizes the importance of variance regularization and directional alignment. Experiments show consistent improvements, with up to a 3x faster convergence in FID scores, indicating the practical benefits of a principled source distribution design for conditional flow matching.

该研究针对在文本到图像生成的流匹配中使用标准高斯分布作为源分布的局限性，提出学习条件依赖的源分布以更好地利用条件信号。研究指出，直接将条件信息融入源分布会导致分布坍缩和不稳定性等问题，并强调了方差正则化和源与目标的方向对齐的重要性。实验结果表明，这种设计可以实现一致且稳健的改进，FID收敛速度最多可提高3倍，突显了条件流匹配中合理源分布设计的实际益处。

Breaking Symmetry Bottlenecks in GNN Readouts

Authors: Mouad Talhi, Arne Wolf, Anthea Monod

First: 2026-02-05T18:08:13+00:00 · Latest: 2026-02-05T18:08:13+00:00

Comments: 23 pages

Abs · PDF · Code1 · Code2

Abstract

Graph neural networks (GNNs) are widely used for learning on structured data, yet their ability to distinguish non-isomorphic graphs is fundamentally limited. These limitations are usually attributed to message passing; in this work we show that an independent bottleneck arises at the readout stage. Using finite-dimensional representation theory, we prove that all linear permutation-invariant readouts, including sum and mean pooling, factor through the Reynolds (group-averaging) operator and therefore project node embeddings onto the fixed subspace of the permutation action, erasing all non-trivial symmetry-aware components regardless of encoder expressivity. This yields both a new expressivity barrier and an interpretable characterization of what global pooling preserves or destroys. To overcome this collapse, we introduce projector-based invariant readouts that decompose node representations into symmetry-aware channels and summarize them with nonlinear invariant statistics, preserving permutation invariance while retaining information provably invisible to averaging. Empirically, swapping only the readout enables fixed encoders to separate WL-hard graph pairs and improves performance across multiple benchmarks, demonstrating that readout design is a decisive and under-appreciated factor in GNN expressivity.

中文标题/摘要

标题：打破GNN读出阶段的对称性瓶颈

图神经网络（GNNs）广泛用于结构化数据的学习，但它们区分非同构图的能力从根本上受到限制。这些限制通常归因于消息传递；在本文中，我们表明独立的瓶颈出现在读出阶段。利用有限维表示论，我们证明所有线性置换不变读出，包括求和和平均池化，都通过Reynolds（群平均）算子，并因此将节点嵌入投影到置换作用的不变子空间上，抹去了所有非平凡的对称性感知成分，无论编码器的表达能力如何。这既产生了一个新的表达能力障碍，也提供了一个可解释的关于全局池化保留或破坏什么的表征。为了克服这种坍塌，我们引入了基于投影的不变读出，将节点表示分解为对称性感知通道，并用非线性不变统计进行汇总，同时保持置换不变性并保留平均化无法捕捉到的信息。实验上，仅交换读出即可使固定编码器区分WL难题图对，并在多个基准测试中提高性能，表明读出设计是GNN表达能力的关键且被低估的因素。

Summary / 总结

The research addresses the limitations of graph neural networks (GNNs) in distinguishing non-isomorphic graphs, attributing these limitations to a bottleneck at the readout stage. By using finite-dimensional representation theory, the authors prove that linear permutation-invariant readouts project node embeddings onto a fixed subspace, erasing symmetry-aware components. To overcome this, they introduce projector-based invariant readouts that decompose node representations into symmetry-aware channels and summarize them with nonlinear invariant statistics, preserving permutation invariance while retaining information invisible to averaging. Empirical results show that swapping only the readout enables fixed encoders to separate WL-hard graph pairs and improves performance across multiple benchmarks.

论文探讨了图神经网络（GNN）在区分非同构图方面的局限性，这些局限性不仅源于消息传递，还源于读出阶段。研究证明，线性置换不变读出将节点嵌入投影到一个固定子空间，消除了所有非平凡的对称感知成分。为克服这一问题，作者提出了基于投影的不变读出，将节点表示分解为对称感知通道，并使用非线性不变统计进行汇总，同时保持置换不变性并保留平均值无法捕捉的信息。实验表明，仅替换读出就能使固定编码器区分WL-hard图对，并在多个基准测试中提高性能。

Learning to Discover at Test Time

Authors: Mert Yuksekgonul, Daniel Koceja, Xinhao Li, Federico Bianchi, Jed McCaleb, Xiaolong Wang, Jan Kautz, Yejin Choi, James Zou, Carlos Guestrin, Yu Sun

First: 2026-01-22T18:24:00+00:00 · Latest: 2026-02-05T18:03:03+00:00

Comments: Code: https://github.com/test-time-training/discover

Abs · PDF · Code1 · Code2 · Code3

Abstract

How can we use AI to discover a new state of the art for a scientific problem? Prior work in test-time scaling, such as AlphaEvolve, performs search by prompting a frozen LLM. We perform reinforcement learning at test time, so the LLM can continue to train, but now with experience specific to the test problem. This form of continual learning is quite special, because its goal is to produce one great solution rather than many good ones on average, and to solve this very problem rather than generalize to other problems. Therefore, our learning objective and search subroutine are designed to prioritize the most promising solutions. We call this method Test-Time Training to Discover (TTT-Discover). Following prior work, we focus on problems with continuous rewards. We report results for every problem we attempted, across mathematics, GPU kernel engineering, algorithm design, and biology. TTT-Discover sets the new state of the art in almost all of them: (i) Erdős' minimum overlap problem and an autocorrelation inequality; (ii) a GPUMode kernel competition (up to $2\times$ faster than prior art); (iii) past AtCoder algorithm competitions; and (iv) denoising problem in single-cell analysis. Our solutions are reviewed by experts or the organizers. All our results are achieved with an open model, OpenAI gpt-oss-120b, and can be reproduced with our publicly available code, in contrast to previous best results that required closed frontier models. Our test-time training runs are performed using Tinker, an API by Thinking Machines, with a cost of only a few hundred dollars per problem.

中文标题/摘要

标题：在测试时学习发现

我们如何使用AI在科学问题上发现新的前沿？先前的测试时缩放工作，如AlphaEvolve，通过提示冻结的LLM进行搜索。我们进行测试时的强化学习，因此LLM可以继续训练，但现在是针对测试问题的具体经验。这种持续学习的形式非常特殊，因为它旨在产生一个最佳解决方案，而不是平均多个较好的解决方案，并且要解决这个问题而不是泛化到其他问题。因此，我们的学习目标和搜索子程序设计旨在优先考虑最有前途的解决方案。我们称这种方法为测试时训练以发现（TTT-Discover）。我们遵循先前的工作，专注于具有连续奖励的问题。我们报告了我们尝试的每个问题的结果，涵盖数学、GPU内核工程、算法设计和生物学。TTT-Discover在几乎所有问题上都设定了新的前沿：(i) 艾尔德什最小重叠问题和自相关不等式；(ii) GPUMode内核竞赛（比先前的最佳结果快至2倍）；(iii) 过去的AtCoder算法竞赛；和(iv) 单细胞分析中的去噪问题。我们的解决方案由专家或组织者审核。所有结果均使用开源模型OpenAI gpt-oss-120b实现，并可通过我们公开的代码重现，与之前的最佳结果相比，这些结果不需要封闭的前沿模型。我们的测试时训练运行使用Thinking Machines的Tinker API，每解决问题的成本仅为几百美元。

Summary / 总结

The research aims to use AI to discover new state-of-the-art solutions for scientific problems by performing reinforcement learning at test time. The method, Test-Time Training to Discover (TTT-Discover), allows the LLM to continue training with problem-specific experience, prioritizing promising solutions. It sets new benchmarks in mathematics, GPU kernel engineering, algorithm design, and biology, achieving up to 2x faster results in some cases. All results are reproducible with open-source code and an open model, OpenAI gpt-oss-120b, and the cost is minimal, around a few hundred dollars per problem.

研究旨在通过在测试时进行强化学习来使用AI发现科学问题的新前沿解决方案。该方法称为Test-Time Training to Discover (TTT-Discover)，允许LLM继续使用特定于测试问题的经验进行训练，并优先考虑最有前途的解决方案。该方法在数学、GPU内核工程、算法设计和生物学等多个领域设置了新的前沿结果，解决方案由专家或组织者审核。所有结果使用开源模型OpenAI gpt-oss-120b实现，并可通过公开的代码进行复现。

$f$-GRPO and Beyond: Divergence-Based Reinforcement Learning Algorithms for General LLM Alignment

Authors: Rajdeep Haldar, Lantao Mei, Guang Lin, Yue Xing, Qifan Song

First: 2026-02-05T18:01:52+00:00 · Latest: 2026-02-05T18:01:52+00:00

Abs · PDF · Code1 · Code2

Abstract

Recent research shows that Preference Alignment (PA) objectives act as divergence estimators between aligned (chosen) and unaligned (rejected) response distributions. In this work, we extend this divergence-based perspective to general alignment settings, such as reinforcement learning with verifiable rewards (RLVR), where only environmental rewards are available. Within this unified framework, we propose $f$-Group Relative Policy Optimization ($f$-GRPO), a class of on-policy reinforcement learning, and $f$-Hybrid Alignment Loss ($f$-HAL), a hybrid on/off policy objectives, for general LLM alignment based on variational representation of $f$-divergences. We provide theoretical guarantees that these classes of objectives improve the average reward after alignment. Empirically, we validate our framework on both RLVR (Math Reasoning) and PA tasks (Safety Alignment), demonstrating superior performance and flexibility compared to current methods.

中文标题/摘要

标题：$f$-GRPO及其扩展：基于离散度的通用大语言模型对齐强化学习算法

近期研究表明，偏好对齐(PA)目标可以作为对齐(选择)和未对齐(拒绝)响应分布之间离散度的估计器。在此项工作中，我们将这种基于离散度的观点扩展到一般的对齐设置中，例如具有可验证奖励的强化学习(RLVR)，其中仅可用环境奖励。在这一统一框架中，我们提出了基于$f$-散度变分表示的$f$-组相对策略优化($f$-GRPO)类的在线策略强化学习方法，以及$f$-混合对齐损失($f$-HAL)类的混合在线/离线策略目标，用于基于$f$-散度变分表示的通用大语言模型对齐。我们提供了这些类目标在对齐后提高平均奖励的理论保证。实验上，我们在RLVR(数学推理)和PA任务(安全对齐)上验证了我们的框架，展示了与当前方法相比的优越性能和灵活性。

Summary / 总结

This work extends the divergence-based perspective of Preference Alignment objectives to reinforcement learning with verifiable rewards and proposes $f$-GRPO and $f$-HAL for general LLM alignment. Theoretical guarantees show these methods improve average reward after alignment. Empirical validation on RLVR and PA tasks demonstrates superior performance and flexibility compared to existing methods.

该工作将偏好对齐（PA）目标的发散性视角扩展到可验证奖励的强化学习（RLVR），并提出了$f$-GRPO和$f$-HAL以实现通用的LLM对齐。理论保证表明这些方法在对齐后可以提高平均奖励。实验结果表明，与现有方法相比，该框架在RLVR和PA任务上具有更好的性能和灵活性。

Group-Adaptive Adversarial Learning for Robust Fake News Detection Against Malicious Comments

Authors: Zhao Tong, Chunlin Gong, Yimeng Gu, Haichao Shi, Qiang Liu, Shu Wu, Xiao-Yu Zhang

First: 2025-10-10T04:39:57+00:00 · Latest: 2026-02-05T17:52:31+00:00

Comments: 10 pages, 12 figures

Abs · PDF · Code1 · Code2

Abstract

Online fake news profoundly distorts public judgment and erodes trust in social platforms. While existing detectors achieve competitive performance on benchmark datasets, they remain notably vulnerable to malicious comments designed specifically to induce misclassification. This evolving threat landscape necessitates detection systems that simultaneously prioritize predictive accuracy and structural robustness. However, current detectors often fail to generalize across diverse and novel comment attack patterns. To bridge this gap, we propose AdComment, an adaptive adversarial training framework for robustness enhancement against diverse malicious comments. Based on cognitive psychology, we categorize adversarial comments into Fact Distortion, Logical Confusion, and Emotional Manipulation, and leverage LLMs to synthesize diverse, category-specific perturbations. Central to our framework is an InfoDirichlet Resampling (IDR) mechanism that dynamically adjusts malicious comment proportions during training, thereby steering optimization toward the model's most susceptible regions. Experimental results demonstrate that our approach achieves state-of-the-art performance on three benchmark datasets, improving the F1 scores by 17.9%, 14.5% and 9.0%, respectively.

中文标题/摘要

标题：针对恶意评论的分组自适应对抗学习以增强鲁棒性假新闻检测

在线假新闻严重扭曲公众判断并侵蚀社交平台的信任。尽管现有检测器在基准数据集上取得了竞争力的表现，但它们仍然明显容易受到专门设计以诱导分类错误的恶意评论的影响。这种不断演变的威胁环境需要同时兼顾预测准确性和结构鲁棒性的检测系统。然而，当前的检测器往往无法在多样且新颖的评论攻击模式中泛化。为弥补这一差距，我们提出了一种AdComment，这是一种针对多种恶意评论的自适应对抗训练框架，以增强鲁棒性。基于认知心理学，我们将对抗性评论分为事实扭曲、逻辑混淆和情感操控三类，并利用大语言模型合成多样化的、类别特定的扰动。我们框架的核心是InfoDirichlet重采样（IDR）机制，该机制在训练过程中动态调整恶意评论的比例，从而引导优化向模型最脆弱的区域。实验结果表明，我们的方法在三个基准数据集上取得了最先进的性能，分别提高了F1分数17.9%、14.5%和9.0%。

Summary / 总结

This paper addresses the vulnerability of existing fake news detectors to malicious comments by proposing AdComment, an adaptive adversarial training framework. It categorizes adversarial comments into three types: Fact Distortion, Logical Confusion, and Emotional Manipulation, and uses LLMs to generate diverse perturbations. The InfoDirichlet Resampling (IDR) mechanism dynamically adjusts the proportions of malicious comments during training to enhance the model's robustness. Experiments show that AdComment improves F1 scores by 17.9%, 14.5%, and 9.0% on three benchmark datasets compared to existing methods.

论文提出了一种适应性对抗训练框架AdComment，以增强对恶意评论的鲁棒性。该框架将对抗性评论分为事实扭曲、逻辑混淆和情感操控三类，并使用大语言模型生成多样化的扰动。框架中的InfoDirichlet重采样机制在训练过程中动态调整恶意评论的比例，以优化模型的脆弱区域。实验结果显示，AdComment在三个基准数据集上的F1分数分别提高了17.9%、14.5%和9.0%。

Multi-Scale Global-Instance Prompt Tuning for Continual Test-time Adaptation in Medical Image Segmentation

Authors: Lingrui Li, Yanfeng Zhou, Nan Pu, Xin Chen, Zhun Zhong

First: 2026-02-05T17:47:35+00:00 · Latest: 2026-02-05T17:47:35+00:00

Comments: 8 pages, BIBM2025

Abs · PDF · Code1 · Code2

Abstract

Distribution shift is a common challenge in medical images obtained from different clinical centers, significantly hindering the deployment of pre-trained semantic segmentation models in real-world applications across multiple domains. Continual Test-Time Adaptation(CTTA) has emerged as a promising approach to address cross-domain shifts during continually evolving target domains. Most existing CTTA methods rely on incrementally updating model parameters, which inevitably suffer from error accumulation and catastrophic forgetting, especially in long-term adaptation. Recent prompt-tuning-based works have shown potential to mitigate the two issues above by updating only visual prompts. While these approaches have demonstrated promising performance, several limitations remain:1)lacking multi-scale prompt diversity, 2)inadequate incorporation of instance-specific knowledge, and 3)risk of privacy leakage. To overcome these limitations, we propose Multi-scale Global-Instance Prompt Tuning(MGIPT), to enhance scale diversity of prompts and capture both global- and instance-level knowledge for robust CTTA. Specifically, MGIPT consists of an Adaptive-scale Instance Prompt(AIP) and a Multi-scale Global-level Prompt(MGP). AIP dynamically learns lightweight and instance-specific prompts to mitigate error accumulation with adaptive optimal-scale selection mechanism. MGP captures domain-level knowledge across different scales to ensure robust adaptation with anti-forgetting capabilities. These complementary components are combined through a weighted ensemble approach, enabling effective dual-level adaptation that integrates both global and local information. Extensive experiments on medical image segmentation benchmarks demonstrate that our MGIPT outperforms state-of-the-art methods, achieving robust adaptation across continually changing target domains.

中文标题/摘要

标题：多尺度全局-实例提示调优以实现医学图像分割中的持续测试时自适应

在不同临床中心获取的医学图像中，分布偏移是一个常见的挑战，显著阻碍了预训练语义分割模型在多领域实际应用中的部署。持续测试时自适应(CTTA)作为一种有前景的方法，旨在解决目标领域不断演变过程中跨域偏移问题。大多数现有的CTTA方法依赖于逐步更新模型参数，这不可避免地会导致错误累积和灾难性遗忘，尤其是在长期自适应过程中。最近基于提示调优的工作表明，通过仅更新视觉提示来缓解上述两个问题具有潜力。尽管这些方法展示了有前景的性能，但仍存在一些局限性：1) 缺乏多尺度提示多样性，2) 实例特定知识整合不足，3) 隐私泄露风险。为克服这些局限性，我们提出了多尺度全局-实例提示调优(MGIPT)，以增强提示的尺度多样性并捕获全局和实例级别的知识，以实现稳健的CTTA。具体而言，MGIPT 包含一个自适应尺度实例提示(AIP) 和一个多尺度全局提示(MGP)。AIP 动态学习轻量级和实例特定的提示，通过自适应最优尺度选择机制来缓解错误累积。MGP 跨不同尺度捕获领域知识，以确保具有抗遗忘能力的稳健自适应。这些互补组件通过加权集成方法结合，实现有效的双尺度自适应，整合全局和局部信息。在医学图像分割基准上的广泛实验表明，我们的MGIPT 在性能上优于最先进的方法，实现了在不断变化的目标领域中的稳健自适应。

Summary / 总结

The paper addresses the challenge of distribution shift in medical images from different clinical centers, proposing Multi-scale Global-Instance Prompt Tuning (MGIPT) to enhance continual test-time adaptation. MGIPT introduces Adaptive-scale Instance Prompt (AIP) and Multi-scale Global-level Prompt (MGP) to mitigate error accumulation and catastrophic forgetting. AIP learns lightweight, instance-specific prompts, while MGP captures domain-level knowledge across scales. Experiments show MGIPT outperforms existing methods in robust adaptation across changing target domains.

论文针对不同临床中心获取的医学图像分布变化问题，提出了多尺度全局-实例提示调优（MGIPT）方法，以增强持续测试时适应性在语义分割中的应用。MGIPT 引入了适应尺度实例提示（AIP）和多尺度全局提示（MGP），以增强尺度多样性并捕捉全局和实例级知识，从而缓解错误累积和灾难性遗忘。实验结果表明，MGIPT 在医学图像分割基准测试中优于现有方法，展示了在不断变化的目标域中实现稳健适应的能力。

Tuning Out-of-Distribution (OOD) Detectors Without Given OOD Data

Authors: Sudeepta Mondal, Xinyi Mary Xie, Ruxiao Duan, Alex Wong, Ganesh Sundaramoorthi

First: 2026-02-05T17:46:40+00:00 · Latest: 2026-02-05T17:46:40+00:00

Abs · PDF · Code1 · Code2

Abstract

Existing out-of-distribution (OOD) detectors are often tuned by a separate dataset deemed OOD with respect to the training distribution of a neural network (NN). OOD detectors process the activations of NN layers and score the output, where parameters of the detectors are determined by fitting to an in-distribution (training) set and the aforementioned dataset chosen adhocly. At detector training time, this adhoc dataset may not be available or difficult to obtain, and even when it's available, it may not be representative of actual OOD data, which is often ''unknown unknowns." Current benchmarks may specify some left-out set from test OOD sets. We show that there can be significant variance in performance of detectors based on the adhoc dataset chosen in current literature, and thus even if such a dataset can be collected, the performance of the detector may be highly dependent on the choice. In this paper, we introduce and formalize the often neglected problem of tuning OOD detectors without a given ``OOD'' dataset. To this end, we present strong baselines as an attempt to approach this problem. Furthermore, we propose a new generic approach to OOD detector tuning that does not require any extra data other than those used to train the NN. We show that our approach improves over baseline methods consistently across higher-parameter OOD detector families, while being comparable across lower-parameter families.

中文标题/摘要

标题：无需给定离群分布（OOD）数据调整离群检测器

现有的离群分布（OOD）检测器通常通过一个被认为与神经网络（NN）训练分布不同的分离数据集进行调整。OOD检测器处理NN层的激活并评分输出，检测器的参数通过拟合训练集和上述选择的随机数据集确定。在检测器训练时，这种随机数据集可能不可用或难以获取，即使可用，也可能不具有实际OOD数据的代表性，而实际OOD数据往往是“未知的未知”。当前基准可能指定了从测试OOD数据集中排除的一些数据集。我们表明，根据当前文献中选择的随机数据集，检测器的性能可能存在显著差异，因此即使可以收集这样的数据集，检测器的性能也可能高度依赖于选择。在本文中，我们引入并正式化了在没有给定“OOD”数据集的情况下调整OOD检测器的通常被忽视的问题。为此，我们提出了强大的基线方法，试图解决这个问题。此外，我们提出了一种新的通用方法来调整OOD检测器，不需要任何额外数据，只需使用训练NN的数据。我们表明，我们的方法在高参数OOD检测器家族中始终优于基线方法，而在低参数家族中具有可比性。

Summary / 总结

The paper addresses the challenge of tuning out-of-distribution (OOD) detectors without relying on an adhoc OOD dataset. It highlights the variability in detector performance based on the chosen dataset and introduces a new approach that does not require any extra data other than the training data. The proposed method improves over baseline methods in higher-parameter OOD detector families and is comparable in lower-parameter families.

本文解决了在没有使用特定的OOD数据集的情况下调校OOD检测器的问题，因为这类数据集往往难以获取或不能代表实际的OOD数据。作者提出了一种不需要额外数据的方法，仅使用神经网络的训练数据。研究结果表明，该方法在较高参数的OOD检测器家族中表现优于基线方法，而在较低参数的家族中表现相当。

Approximation of Log-Partition Function in Policy Mirror Descent Induces Implicit Regularization for LLM Post-Training

Authors: Zhenghao Xu, Qin Lu, Changlong Yu, Tuo Zhao

First: 2026-02-05T17:44:28+00:00 · Latest: 2026-02-05T17:44:28+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Policy mirror descent (PMD) provides a principled framework for reinforcement learning (RL) by iteratively solving KL-regularized policy improvement subproblems. While this approach has been adopted in training advanced LLMs such as Kimi K1.5/K2, the ideal closed-form PMD updates require reliable partition function estimation, a significant challenge when working with limited rollouts in the vast action spaces of LLMs. We investigate a practical algorithm, termed PMD-mean, that approximates the log-partition term with the mean reward under the sampling policy and performs regression in log-policy space. Specifically, we characterize the population solution of PMD-mean and demonstrate that it implicitly optimizes mirror descent subproblems with an adaptive mixed KL--$χ^2$ regularizer. This additional $χ^2$ regularization constrains large probability changes, producing more conservative updates when expected rewards are low and enhancing robustness against finite-sample estimation errors. Experiments on math reasoning tasks show that PMD-mean achieves superior performance with improved stability and time efficiency. These findings deepen our understanding of PMD-mean and illuminate pathways toward principled improvements in RL algorithms for LLMs. Code is available at https://github.com/horizon-rl/OpenKimi.

中文标题/摘要

标题：策略镜像梯度中的对数分区函数近似诱导LLM后训练时的隐式正则化

策略镜像梯度（PMD）提供了一种通过迭代求解KL正则化策略改进子问题来强化学习（RL）的原理框架。尽管这种方法已被应用于训练如Kimi K1.5/K2等高级LLM，但理想的PMD闭式更新需要可靠的分区函数估计，这在处理LLM庞大动作空间中的有限回放时是一个重大挑战。我们研究了一种实用算法PMD-mean，该算法通过在采样策略下近似对数分区项并进行对数策略空间回归来实现。具体而言，我们刻画了PMD-mean的总体解，并证明它隐式优化了具有自适应混合KL--χ²正则化的镜像梯度子问题。这种额外的χ²正则化限制了大概率变化，当预期回报较低时产生更保守的更新，从而增强对有限样本估计误差的鲁棒性。在数学推理任务上的实验表明，PMD-mean在稳定性和时间效率方面表现出更优的性能。这些发现加深了我们对PMD-mean的理解，并揭示了RL算法中LLM改进的原理途径。相关代码可在https://github.com/horizon-rl/OpenKimi/获取。

Summary / 总结

The research aims to address the challenge of estimating the log-partition function in policy mirror descent (PMD) for reinforcement learning in large language models (LLMs). The proposed PMD-mean algorithm approximates the log-partition term with the mean reward under the sampling policy and performs regression in the log-policy space. Key findings show that PMD-mean implicitly optimizes mirror descent subproblems with an adaptive mixed KL--$χ^2$ regularizer, leading to more conservative updates and enhanced robustness. Experiments on math reasoning tasks demonstrate that PMD-mean achieves superior performance with improved stability and time efficiency.

该研究探讨了PMD-mean，这是一种近似策略镜像下降（PMD）的方法，使用采样策略下的平均奖励来估计对数分区函数。该方法隐式优化了具有自适应混合KL--$χ^2$正则化的镜像下降子问题，增强了对估计误差的鲁棒性。实验表明，PMD-mean在数学推理任务中提高了性能、稳定性和时间效率。

History

20260206_0347 20260205_0346 20260204_0352 20260202_0332 20260201_0328 20260131_0341 20260130_0339 20260129_0337 20260128_0335 20260127_0332 20260126_0325 20260125_0325 20260124_0333 20260123_0333 20260122_0339 20260121_0422 20260120_0328 20260119_0325 20260118_0324 20260117_0329 20260116_0332 20260115_0330 20260114_0329 20260113_0330 20260112_0330 20260111_0327 20260110_0328 20260109_0331 20260108_0330 20260107_0325 20260106_0331 20260105_0324 20260104_0324 20260103_0322 20260102_0335 20260101_0325 20251231_0331 20251230_0328 20251229_0326 20251228_0329 20251227_0325 20251226_0326 20251225_0325 20251224_0328 20251223_0327 20251222_0324 20251221_0326 20251220_0327 20251219_0327 20251218_0339 20251217_0331 20251216_0329 20251215_0331 20251214_0324 20251213_0324 20251212_0329 20251211_0326 20251210_0323 20251209_0326 20251208_0324 20251207_0323 20251206_0325 20251205_0326 20251204_0326 20251203_0328 20251202_0331 20251201_0324 20251130_0323 20251129_0323 20251128_0324 20251127_0324 20251126_0325 20251125_0322 20251124_0323 20251123_0323 20251122_0325 20251121_0324 20251120_0326 20251119_0325 20251118_0324 20251117_0322 20251116_0322 20251115_0324 20251114_0325 20251113_0326 20251112_0326 20251111_0318 20251110_0322 20251109_0323 20251108_0321 20251107_0320 20251106_0322 20251105_0321 20251104_0324 20251103_0317 20251102_0321 20251101_0317 20251031_0318 20251030_0328 20251029_0325 20251028_0324 20251027_0320 20251026_0328 20251025_0320 20251024_0328 20251023_1235 20251023_0316 20251022_0319 20251021_1916 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553