Shared LoRA Subspaces for almost Strict Continual Learning
Authors: Prakhar Kaushik, Ankit Vaidya, Shravan Chaudhari, Rama Chellappa, Alan Yuille
First: 2026-02-05T18:59:58+00:00 · Latest: 2026-02-05T18:59:58+00:00
Abstract
Adapting large pretrained models to new tasks efficiently and continually is crucial for real-world deployment but remains challenging due to catastrophic forgetting and the high cost of retraining. While parameter-efficient tuning methods like low rank adaptation (LoRA) reduce computational demands, they lack mechanisms for strict continual learning and knowledge integration, without relying on data replay, or multiple adapters. We propose Share, a novel approach to parameter efficient continual finetuning that learns and dynamically updates a single, shared low-rank subspace, enabling seamless adaptation across multiple tasks and modalities. Share constructs a foundational subspace that extracts core knowledge from past tasks and incrementally integrates new information by identifying essential subspace directions. Knowledge from each new task is incorporated into this evolving subspace, facilitating forward knowledge transfer, while minimizing catastrophic interference. This approach achieves up to 100x parameter reduction and 281x memory savings over traditional LoRA methods, maintaining performance comparable to jointly trained models. A single Share model can replace hundreds of task-specific LoRA adapters, supporting scalable, asynchronous continual learning. Experiments across image classification, natural language understanding, 3D pose estimation, and text-to-image generation validate its effectiveness, making Share a practical and scalable solution for lifelong learning in large-scale AI systems.
中文标题/摘要
标题:共享LoRA子空间实现几乎严格的持续学习
高效且持续地将大型预训练模型适应新任务对于实际部署至关重要,但由于灾难性遗忘和重新训练成本高昂,这仍然具有挑战性。尽管参数高效调优方法如低秩适应(LoRA)降低了计算需求,但它们缺乏严格的持续学习和知识整合机制,不依赖于数据重放或多个适配器。我们提出了一种名为Share的新方法,用于参数高效的持续微调,它学习并动态更新一个共享的低秩子空间,从而在多个任务和模态之间实现无缝适应。Share构建了一个基础子空间,从中提取过去任务的核心知识,并通过识别关键子空间方向逐步整合新信息。来自每个新任务的知识被整合到这个不断演化的子空间中,促进前向知识迁移,同时最小化灾难性干扰。该方法在传统LoRA方法上的参数减少高达100倍,内存节省高达281倍,同时保持与联合训练模型相当的性能。一个Share模型可以替代数百个任务特定的LoRA适配器,支持可扩展的、异步的持续学习。跨图像分类、自然语言理解、3D姿态估计和文本到图像生成的实验验证了其有效性,使Share成为大规模AI系统中终身学习的实用且可扩展的解决方案。
Summary / 总结
The paper addresses the challenge of efficient continual learning by proposing Share, a method that uses a shared low-rank subspace to adapt large pretrained models to new tasks without catastrophic forgetting. Share dynamically updates a single subspace to integrate knowledge from past and new tasks, achieving up to 100x parameter reduction and 281x memory savings compared to traditional LoRA methods while maintaining performance similar to joint training. This approach supports scalable, asynchronous continual learning and is validated across various tasks including image classification, natural language understanding, 3D pose estimation, and text-to-image generation.
论文提出了一种名为Share的方法,通过动态更新共享的低秩子空间来高效地适应新任务,同时避免灾难性遗忘。Share将参数和内存需求分别减少了最多100倍和281倍,与联合训练模型相比保持了相当的性能。它支持可扩展的异步连续学习,并已在图像分类、自然语言理解、3D姿态估计和文本到图像生成等多种任务中得到了验证。
Predicting Camera Pose from Perspective Descriptions for Spatial Reasoning
Authors: Xuejun Zhang, Aditi Tiwari, Zhenhailong Wang, Heng Ji
First: 2026-02-05T18:59:55+00:00 · Latest: 2026-02-05T18:59:55+00:00
Abstract
Multi-image spatial reasoning remains challenging for current multimodal large language models (MLLMs). While single-view perception is inherently 2D, reasoning over multiple views requires building a coherent scene understanding across viewpoints. In particular, we study perspective taking, where a model must build a coherent 3D understanding from multi-view observations and use it to reason from a new, language-specified viewpoint. We introduce CAMCUE, a pose-aware multi-image framework that uses camera pose as an explicit geometric anchor for cross-view fusion and novel-view reasoning. CAMCUE injects per-view pose into visual tokens, grounds natural-language viewpoint descriptions to a target camera pose, and synthesizes a pose-conditioned imagined target view to support answering. To support this setting, we curate CAMCUE-DATA with 27,668 training and 508 test instances pairing multi-view images and poses with diverse target-viewpoint descriptions and perspective-shift questions. We also include human-annotated viewpoint descriptions in the test split to evaluate generalization to human language. CAMCUE improves overall accuracy by 9.06% and predicts target poses from natural-language viewpoint descriptions with over 90% rotation accuracy within 20° and translation accuracy within a 0.5 error threshold. This direct grounding avoids expensive test-time search-and-match, reducing inference time from 256.6s to 1.45s per example and enabling fast, interactive use in real-world scenarios.
中文标题/摘要
标题:从视角描述预测相机姿态以进行空间推理
多图像空间推理仍然是当前多模态大型语言模型(MLLMs)面临的挑战。虽然单视角感知本质上是二维的,但多视角推理需要在不同视角之间构建连贯的场景理解。特别是,我们研究了视角转换,其中模型必须从多视角观察中构建连贯的三维理解,并使用它从新的、语言指定的视角进行推理。我们引入了CAMCUE,这是一种姿态感知的多图像框架,使用相机姿态作为跨视图融合和新视图推理的显式几何锚点。CAMCUE 将每视角姿态注入视觉标记,将自然语言视角描述定位到目标相机姿态,并合成姿态条件下的想象目标视图以支持回答。为了支持这一设置,我们收集了CAMCUE-DATA,其中包括27,668个训练实例和508个测试实例,这些实例将多视角图像和姿态与多样化的目标视角描述和视角转换问题配对。我们还在测试分割中包括了人工标注的视角描述,以评估对人类语言的泛化能力。CAMCUE 的整体准确率提高了9.06%,并且能够从自然语言视角描述中预测目标姿态,旋转准确率超过90%(误差在20°以内),平移准确率在0.5误差阈值以内超过90%。这种直接定位避免了昂贵的测试时搜索和匹配,将每个示例的推理时间从256.6秒减少到1.45秒,从而在实际场景中实现快速、交互式使用。
Summary / 总结
The paper addresses the challenge of multi-image spatial reasoning for current multimodal large language models by introducing CAMCUE, a pose-aware framework that uses camera pose as a geometric anchor for cross-view fusion and novel-view reasoning. The framework improves overall accuracy by 9.06% and predicts target poses with high accuracy, reducing inference time significantly from 256.6s to 1.45s per example. CAMCUE-DATA, a curated dataset, supports this setting with diverse multi-view images and poses paired with target-viewpoint descriptions and perspective-shift questions, including human-annotated descriptions for evaluation.
论文通过引入CAMCUE,一种姿态感知框架,解决了当前多模态大型语言模型在多图像空间推理方面的挑战。CAMCUE 使用相机姿态作为几何锚点进行跨视图融合和新颖视图推理,整体准确率提高了9.06%,并在指定阈值内实现了超过90%的旋转和翻译准确性。该框架通过直接将自然语言描述与相机姿态对接,支持快速交互使用,并显著减少了推理时间。
DyTopo: Dynamic Topology Routing for Multi-Agent Reasoning via Semantic Matching
Authors: Yuxing Lu, Yucheng Hu, Xukai Zhao, Jiuxin Cao
First: 2026-02-05T18:59:51+00:00 · Latest: 2026-02-05T18:59:51+00:00
Abstract
Multi-agent systems built from prompted large language models can improve multi-round reasoning, yet most existing pipelines rely on fixed, trajectory-wide communication patterns that are poorly matched to the stage-dependent needs of iterative problem solving. We introduce DyTopo, a manager-guided multi-agent framework that reconstructs a sparse directed communication graph at each round. Conditioned on the manager's round goal, each agent outputs lightweight natural-language query (need) and \key (offer) descriptors; DyTopo embeds these descriptors and performs semantic matching, routing private messages only along the induced edges. Across code generation and mathematical reasoning benchmarks and four LLM backbones, DyTopo consistently outperforms over the strongest baseline (avg. +6.2). Beyond accuracy, DyTopo yields an interpretable coordination trace via the evolving graphs, enabling qualitative inspection of how communication pathways reconfigure across rounds.
中文标题/摘要
标题:DyTopo:基于语义匹配的多智能体动态拓扑路由
由提示的大语言模型构建的多智能体系统可以提高多轮推理能力,但大多数现有管道依赖于固定的整体通信模式,这些模式与迭代问题解决过程中阶段特定的需求匹配不佳。我们引入了DyTopo,这是一种由管理者指导的多智能体框架,在每一轮中重建一个稀疏的有向通信图。基于管理者的轮次目标,每个智能体输出轻量级的自然语言查询(需求)和关键(提供)描述;DyTopo嵌入这些描述并进行语义匹配,仅沿诱导的边路由私有消息。在代码生成和数学推理基准测试以及四个LLM基础模型中,DyTopo在最强基线之上始终表现出色(平均提高6.2%)。除了准确性之外,DyTopo还通过不断变化的图提供了可解释的协调轨迹,使人们能够定性地检查通信路径如何在轮次之间重新配置。
Summary / 总结
DyTopo is a manager-guided multi-agent framework that dynamically reconstructs a sparse directed communication graph at each round based on the manager's goal. Agents output lightweight natural-language query and key descriptors, which are embedded and matched semantically to route private messages. DyTopo outperforms the strongest baseline by an average of 6.2% across code generation and mathematical reasoning benchmarks, and provides interpretable coordination traces through evolving graphs.
DyTopo 是一个由管理者引导的多智能体框架,每轮根据管理者的目标动态重构一个稀疏的有向通信图。智能体输出轻量级的自然语言查询和关键描述符,这些描述符被嵌入并进行语义匹配以路由私有消息。DyTopo 在代码生成和数学推理基准测试中平均比最强基线高出 6.2%,并通过不断变化的图提供可解释的协调轨迹。
SwimBird: Eliciting Switchable Reasoning Mode in Hybrid Autoregressive MLLMs
Authors: Jintao Tong, Shilin Yan, Hongwei Xue, Xiaojun Tang, Kunyu Shi, Guannan Zhang, Ruixuan Li, Yixiong Zou
First: 2026-02-05T18:59:51+00:00 · Latest: 2026-02-05T18:59:51+00:00
Comments: Project Page: https://accio-lab.github.io/SwimBird
Abstract
Multimodal Large Language Models (MLLMs) have made remarkable progress in multimodal perception and reasoning by bridging vision and language. However, most existing MLLMs perform reasoning primarily with textual CoT, which limits their effectiveness on vision-intensive tasks. Recent approaches inject a fixed number of continuous hidden states as "visual thoughts" into the reasoning process and improve visual performance, but often at the cost of degraded text-based logical reasoning. We argue that the core limitation lies in a rigid, pre-defined reasoning pattern that cannot adaptively choose the most suitable thinking modality for different user queries. We introduce SwimBird, a reasoning-switchable MLLM that dynamically switches among three reasoning modes conditioned on the input: (1) text-only reasoning, (2) vision-only reasoning (continuous hidden states as visual thoughts), and (3) interleaved vision-text reasoning. To enable this capability, we adopt a hybrid autoregressive formulation that unifies next-token prediction for textual thoughts with next-embedding prediction for visual thoughts, and design a systematic reasoning-mode curation strategy to construct SwimBird-SFT-92K, a diverse supervised fine-tuning dataset covering all three reasoning patterns. By enabling flexible, query-adaptive mode selection, SwimBird preserves strong textual logic while substantially improving performance on vision-dense tasks. Experiments across diverse benchmarks covering textual reasoning and challenging visual understanding demonstrate that SwimBird achieves state-of-the-art results and robust gains over prior fixed-pattern multimodal reasoning methods.
中文标题/摘要
标题:SwimBird:在混合自回归MLLM中引发可切换的推理模式
多模态大型语言模型(MLLMs)通过视觉和语言的结合,在多模态感知和推理方面取得了显著进展。然而,大多数现有的MLLMs主要通过文本的逐步推理(CoT)进行推理,这限制了它们在视觉密集型任务上的效果。最近的方法将固定数量的连续隐藏状态作为“视觉思考”注入推理过程,从而提高了视觉性能,但通常会牺牲基于文本的逻辑推理。我们认为核心限制在于一种僵化的、预先定义的推理模式,无法根据不同用户查询自适应地选择最合适的思考模态。我们引入了SwimBird,这是一种可切换的MLLM,根据输入动态切换三种推理模式:(1)仅文本推理,(2)仅视觉推理(连续隐藏状态作为视觉思考),(3)交替的视觉-文本推理。为了实现这一能力,我们采用了一种混合自回归公式,将文本思考的下一个标记预测与视觉思考的下一个嵌入预测统一起来,并设计了一种系统性的推理模式筛选策略,构建了SwimBird-SFT-92K,这是一个涵盖所有三种推理模式的多样监督微调数据集。通过实现灵活、查询自适应的模式选择,SwimBird在保持强大的文本逻辑的同时,显著提高了视觉密集任务的性能。跨多种涵盖文本推理和挑战性视觉理解的基准实验表明,SwimBird在先前固定模式多模态推理方法上取得了最先进的结果和稳健的提升。
Summary / 总结
SwimBird is designed to address the limitation of fixed reasoning patterns in MLLMs by introducing a reasoning-switchable model that dynamically switches among text-only, vision-only, and interleaved vision-text reasoning modes based on input. This is achieved through a hybrid autoregressive formulation and a systematic reasoning-mode curation strategy. Experiments show that SwimBird maintains strong text-based logical reasoning while significantly improving performance on vision-intensive tasks, achieving state-of-the-art results across various benchmarks.
SwimBird 是一种动态切换文本-only、视觉-only 和 视觉-文本交错 reasoning 模式的 MLLM,根据输入查询切换模式。它使用混合自回归模型统一文本和视觉推理,并通过一种策展策略创建一个涵盖所有三种推理模式的多样化微调数据集。SwimBird 保持了强大的文本逻辑,同时在视觉密集任务上显著提高了性能,跨多种基准测试实现了最先进的结果。
CommCP: Efficient Multi-Agent Coordination via LLM-Based Communication with Conformal Prediction
Authors: Xiaopan Zhang, Zejin Wang, Zhixu Li, Jianpeng Yao, Jiachen Li
Venue: ICRA 2026
First: 2026-02-05T18:59:45+00:00 · Latest: 2026-02-05T18:59:45+00:00
Comments: IEEE International Conference on Robotics and Automation (ICRA 2026); Project Website: https://comm-cp.github.io/
Abstract
To complete assignments provided by humans in natural language, robots must interpret commands, generate and answer relevant questions for scene understanding, and manipulate target objects. Real-world deployments often require multiple heterogeneous robots with different manipulation capabilities to handle different assignments cooperatively. Beyond the need for specialized manipulation skills, effective information gathering is important in completing these assignments. To address this component of the problem, we formalize the information-gathering process in a fully cooperative setting as an underexplored multi-agent multi-task Embodied Question Answering (MM-EQA) problem, which is a novel extension of canonical Embodied Question Answering (EQA), where effective communication is crucial for coordinating efforts without redundancy. To address this problem, we propose CommCP, a novel LLM-based decentralized communication framework designed for MM-EQA. Our framework employs conformal prediction to calibrate the generated messages, thereby minimizing receiver distractions and enhancing communication reliability. To evaluate our framework, we introduce an MM-EQA benchmark featuring diverse, photo-realistic household scenarios with embodied questions. Experimental results demonstrate that CommCP significantly enhances the task success rate and exploration efficiency over baselines. The experiment videos, code, and dataset are available on our project website: https://comm-cp.github.io.
中文标题/摘要
标题:CommCP:通过基于LLM的通信与符合性预测实现高效的多智能体协调
为了通过自然语言完成人类提供的任务,机器人必须解释命令、生成和回答相关问题以理解场景,并操作目标物体。实际部署中,通常需要不同操作能力的多个异构机器人协同处理不同的任务。除了需要专门的操作技能外,有效的信息收集对于完成这些任务至关重要。为了解决这一问题,我们将信息收集过程在完全合作的环境中形式化为一个未被充分探索的多任务多智能体体态问答(MM-EQA)问题,这是体态问答(EQA)的经典问题的一个新颖扩展,其中有效的通信对于协调努力以避免冗余至关重要。为了解决这一问题,我们提出了CommCP,一种专为MM-EQA设计的基于LLM的分布式通信框架。我们的框架使用符合性预测来校准生成的消息,从而减少接收者的分心并提高通信可靠性。为了评估我们的框架,我们引入了一个包含多种多样的、逼真的家庭场景的MM-EQA基准,其中包含体态问题。实验结果表明,CommCP在任务成功率和探索效率方面显著优于基线。实验视频、代码和数据集可在我们的项目网站上获取:https://comm-cp.github.io/
Summary / 总结
The research aims to enable robots to interpret human commands and collaborate effectively to complete tasks. CommCP, a novel LLM-based communication framework, is proposed to facilitate multi-agent coordination in a fully cooperative setting. The framework uses conformal prediction to calibrate messages, reducing distractions and improving communication reliability. Experiments show that CommCP significantly improves task success rates and exploration efficiency compared to baseline methods.
研究旨在通过有效沟通和信息收集,提高多机器人在自然语言指令下的任务完成能力。CommCP 是一种基于 LLM 的新型通信框架,使用校准预测来校准消息,减少干扰并提高通信可靠性。实验结果显示,CommCP 显著提高了任务成功率和探索效率,优于基线方法。
Thinking with Geometry: Active Geometry Integration for Spatial Reasoning
Authors: Haoyuan Li, Qihang Cao, Tao Tang, Kun Xiang, Zihan Guo, Jianhua Han, Hang Xu, Xiaodan Liang
First: 2026-02-05T18:59:32+00:00 · Latest: 2026-02-05T18:59:32+00:00
Abstract
Recent progress in spatial reasoning with Multimodal Large Language Models (MLLMs) increasingly leverages geometric priors from 3D encoders. However, most existing integration strategies remain passive: geometry is exposed as a global stream and fused in an indiscriminate manner, which often induces semantic-geometry misalignment and redundant signals. We propose GeoThinker, a framework that shifts the paradigm from passive fusion to active perception. Instead of feature mixing, GeoThinker enables the model to selectively retrieve geometric evidence conditioned on its internal reasoning demands. GeoThinker achieves this through Spatial-Grounded Fusion applied at carefully selected VLM layers, where semantic visual priors selectively query and integrate task-relevant geometry via frame-strict cross-attention, further calibrated by Importance Gating that biases per-frame attention toward task-relevant structures. Comprehensive evaluation results show that GeoThinker sets a new state-of-the-art in spatial intelligence, achieving a peak score of 72.6 on the VSI-Bench. Furthermore, GeoThinker demonstrates robust generalization and significantly improved spatial perception across complex downstream scenarios, including embodied referring and autonomous driving. Our results indicate that the ability to actively integrate spatial structures is essential for next-generation spatial intelligence. Code can be found at https://github.com/Li-Hao-yuan/GeoThinker.
中文标题/摘要
标题:几何思维:基于几何的主动集成为空间推理
多模态大型语言模型(MLLMs)在空间推理方面的最新进展越来越多地利用3D编码器提供的几何先验。然而,大多数现有的集成策略仍然被动:几何信息作为全局流呈现,并以不分青红皂白的方式融合,这往往导致语义-几何错位和冗余信号。我们提出了GeoThinker框架,将范式从被动融合转变为主动感知。GeoThinker 不是通过特征混合,而是使模型能够根据其内部推理需求选择性地检索几何证据。GeoThinker 通过在精心选择的VLM层上应用空间语义融合来实现这一点,其中语义视觉先验通过帧严格的交叉注意力选择性地查询和整合与任务相关的几何结构,并通过重要性门控进一步校准,以偏向于与任务相关的结构的帧间注意力。全面的评估结果表明,GeoThinker 在空间智能方面达到了新的最佳水平,在VSI-Bench上达到峰值得分为72.6。此外,GeoThinker 在复杂下游场景中的稳健泛化和空间感知能力显著提高,包括体感指示和自动驾驶。我们的结果表明,主动整合空间结构的能力对于下一代空间智能至关重要。代码可以在 https://github.com/Li-Hao-yuan/GeoThinker 获取。
Summary / 总结
The research aims to enhance spatial reasoning in multimodal large language models by integrating geometric information more effectively. GeoThinker, a proposed framework, shifts from passive geometric fusion to active perception, allowing the model to selectively retrieve and integrate geometric evidence based on its reasoning needs. This is achieved through Spatial-Grounded Fusion at specific VLM layers, calibrated by Importance Gating. The framework significantly improves spatial intelligence, achieving a peak score of 72.6 on the VSI-Bench and demonstrating robust generalization in complex scenarios like embodied referring and autonomous driving.
研究旨在通过更主动地整合几何信息来提升多模态大语言模型的空间推理能力。GeoThinker 提出的框架将融合方式从被动转向主动感知,使模型能够根据推理需求选择性地检索几何证据。这通过在特定 VLM 层级上的 Spatial-Grounded 融合、帧严格交叉注意力以及重要性门控实现。GeoThinker 在 VSI-Bench 上达到 72.6 的新最佳分数,并在包括体感引用和自动驾驶在内的复杂场景中展示了强大的泛化能力。
InterPrior: Scaling Generative Control for Physics-Based Human-Object Interactions
Authors: Sirui Xu, Samuel Schulter, Morteza Ziyadi, Xialin He, Xiaohan Fei, Yu-Xiong Wang, Liangyan Gui
First: 2026-02-05T18:59:27+00:00 · Latest: 2026-02-05T18:59:27+00:00
Comments: Webpage: https://sirui-xu.github.io/InterPrior/
Abstract
Humans rarely plan whole-body interactions with objects at the level of explicit whole-body movements. High-level intentions, such as affordance, define the goal, while coordinated balance, contact, and manipulation can emerge naturally from underlying physical and motor priors. Scaling such priors is key to enabling humanoids to compose and generalize loco-manipulation skills across diverse contexts while maintaining physically coherent whole-body coordination. To this end, we introduce InterPrior, a scalable framework that learns a unified generative controller through large-scale imitation pretraining and post-training by reinforcement learning. InterPrior first distills a full-reference imitation expert into a versatile, goal-conditioned variational policy that reconstructs motion from multimodal observations and high-level intent. While the distilled policy reconstructs training behaviors, it does not generalize reliably due to the vast configuration space of large-scale human-object interactions. To address this, we apply data augmentation with physical perturbations, and then perform reinforcement learning finetuning to improve competence on unseen goals and initializations. Together, these steps consolidate the reconstructed latent skills into a valid manifold, yielding a motion prior that generalizes beyond the training data, e.g., it can incorporate new behaviors such as interactions with unseen objects. We further demonstrate its effectiveness for user-interactive control and its potential for real robot deployment.
中文标题/摘要
标题:InterPrior:扩展基于物理的人机物交互生成控制
人类很少在整体身体层面上计划与物体的交互,而是通过高层次意图,如功能,来定义目标,而协调的平衡、接触和操作则可以从潜在的物理和运动先验中自然地涌现出来。扩展这些先验对于使类人机器人能够跨不同场景组合和泛化移动操作技能并保持物理上连贯的整体身体协调至关重要。为此,我们引入了InterPrior,这是一种可扩展的框架,通过大规模模仿预训练和后续的强化学习微调来学习统一的生成控制器。InterPrior首先将一个完整的参考模仿专家提炼成一个多功能、目标条件化的变分策略,该策略可以从多模态观察和高层次意图中重建运动。虽然提炼出的策略可以重建训练行为,但由于大规模人机物交互的庞大配置空间,它无法可靠地泛化。为了解决这个问题,我们应用物理扰动的数据增强,并进行强化学习微调以提高对未见过的目标和初始条件的技能。这些步骤共同将重建的潜在技能巩固为一个有效的流形,从而产生一个泛化能力超出训练数据的运动先验,例如,它可以包含与未见过的物体的交互行为。我们进一步展示了其在用户交互控制中的有效性及其在实际机器人部署中的潜力。
Summary / 总结
InterPrior is a scalable framework that learns a unified generative controller for humanoids to perform complex loco-manipulation skills. It uses large-scale imitation pretraining and reinforcement learning for fine-tuning, enabling the humanoid to generalize beyond the training data and handle unseen objects. The method addresses the challenge of generalization in large-scale human-object interactions through data augmentation and reinforcement learning, resulting in a motion prior that can incorporate new behaviors. This framework demonstrates effectiveness in user-interactive control and has potential for real robot deployment.
InterPrior 是一个可扩展的框架,用于使类人机器人能够执行与物体的物理连贯全身交互。它使用大规模模仿预训练和强化学习进行微调。关键发现是,通过应用物理扰动的数据增强和强化学习,InterPrior 可以泛化到未见过的目标和初始状态,使类人机器人能够在多种情境下组合和泛化移动技能。
V-Retrver: Evidence-Driven Agentic Reasoning for Universal Multimodal Retrieval
Authors: Dongyang Chen, Chaoyang Wang, Dezhao SU, Xi Xiao, Zeyu Zhang, Jing Xiong, Qing Li, Yuzhang Shang, Shichao Ka
First: 2026-02-05T18:59:21+00:00 · Latest: 2026-02-05T18:59:21+00:00
Abstract
Multimodal Large Language Models (MLLMs) have recently been applied to universal multimodal retrieval, where Chain-of-Thought (CoT) reasoning improves candidate reranking. However, existing approaches remain largely language-driven, relying on static visual encodings and lacking the ability to actively verify fine-grained visual evidence, which often leads to speculative reasoning in visually ambiguous cases. We propose V-Retrver, an evidence-driven retrieval framework that reformulates multimodal retrieval as an agentic reasoning process grounded in visual inspection. V-Retrver enables an MLLM to selectively acquire visual evidence during reasoning via external visual tools, performing a multimodal interleaved reasoning process that alternates between hypothesis generation and targeted visual verification.To train such an evidence-gathering retrieval agent, we adopt a curriculum-based learning strategy combining supervised reasoning activation, rejection-based refinement, and reinforcement learning with an evidence-aligned objective. Experiments across multiple multimodal retrieval benchmarks demonstrate consistent improvements in retrieval accuracy (with 23.0% improvements on average), perception-driven reasoning reliability, and generalization.
中文标题/摘要
标题:V-Retrver:基于证据的代理推理在通用多模态检索中的应用
多模态大型语言模型(MLLMs)最近被应用于通用多模态检索,其中思维链(CoT)推理改善了候选检索结果的重新排序。然而,现有方法仍然主要依赖语言驱动,依赖静态视觉编码,缺乏主动验证细粒度视觉证据的能力,这往往导致在视觉含糊情况下进行推测性推理。我们提出V-Retrver,一种基于证据的检索框架,将多模态检索重新定义为基于视觉检查的代理推理过程。V-Retrver使MLLM能够在推理过程中通过外部视觉工具选择性地获取视觉证据,执行一种多模态交替推理过程,交替进行假设生成和目标导向的视觉验证。为了训练这种证据收集检索代理,我们采用了一种基于课程的学习策略,结合监督推理激活、拒绝基础的细化和与证据对齐的目标的强化学习。在多个多模态检索基准上的实验表明,检索准确性(平均提高23.0%)、感知驱动的推理可靠性和泛化能力均有所提升。
Summary / 总结
The research aims to enhance multimodal retrieval by integrating visual evidence into reasoning processes. V-Retrver, an evidence-driven retrieval framework, reformulates multimodal retrieval as an agentic reasoning process. It allows an MLLM to selectively gather visual evidence during reasoning, improving candidate reranking and leading to a 23.0% average improvement in retrieval accuracy.
V-Retrver 是一种证据驱动的检索框架,通过使 MLLM 在推理过程中主动验证视觉证据来提升多模态检索。这种方法改进了现有的语言驱动方法,通过引入视觉检查和目标验证,提高了在视觉模糊情况下的表现和推理可靠性。实验结果显示,检索准确率平均提高了 23.0%,并且增强了基于感知的推理可靠性。
Can vision language models learn intuitive physics from interaction?
Authors: Luca M. Schulze Buschoff, Konstantinos Voudouris, Can Demircan, Eric Schulz
First: 2026-02-05T18:59:20+00:00 · Latest: 2026-02-05T18:59:20+00:00
Abstract
Pre-trained vision language models do not have good intuitions about the physical world. Recent work has shown that supervised fine-tuning can improve model performance on simple physical tasks. However, fine-tuned models do not appear to learn robust physical rules that can generalize to new contexts. Based on research in cognitive science, we hypothesize that models need to interact with an environment to properly learn its physical dynamics. We train models that learn through interaction with the environment using reinforcement learning. While learning from interaction allows models to improve their within-task performance, it fails to produce models with generalizable physical intuitions. We find that models trained on one task do not reliably generalize to related tasks, even if the tasks share visual statistics and physical principles, and regardless of whether the models are trained through interaction.
中文标题/摘要
标题:视觉语言模型能否通过交互学习直观的物理知识?
预训练的视觉语言模型对物理世界的直觉不够好。最近的研究表明,监督微调可以提高模型在简单物理任务上的表现。然而,微调后的模型似乎没有学会能够泛化的物理规则。基于认知科学的研究,我们假设模型需要与环境互动才能正确学习其物理动态。我们使用强化学习训练模型通过与环境的互动来学习。虽然通过互动学习可以让模型提高其任务内的表现,但无法产生具有泛化物理直觉的模型。我们发现,即使任务共享视觉统计和物理原理,针对一个任务训练的模型也不可靠地泛化到相关任务,无论模型是通过互动还是其他方式训练。
Summary / 总结
The study investigates whether vision language models can learn intuitive physics through interaction. Despite improvements in performance with supervised fine-tuning, the models fail to develop robust, generalizable physical intuitions. Models trained through interaction show enhanced task-specific performance but lack the ability to generalize to related tasks, suggesting that interaction alone is insufficient for learning broad physical principles.
研究探讨了视觉语言模型是否可以通过互动来学习直观的物理知识。尽管监督微调可以提高模型的性能,但模型无法发展出稳健且能够泛化的物理直觉。通过互动训练的模型在特定任务上的表现有所提升,但在相关任务上的泛化能力却不足,表明互动本身不足以学习广泛的物理原理。
Splat and Distill: Augmenting Teachers with Feed-Forward 3D Reconstruction For 3D-Aware Distillation
Authors: David Shavin, Sagie Benaim
Venue: ICLR 2026
First: 2026-02-05T18:59:05+00:00 · Latest: 2026-02-05T18:59:05+00:00
Comments: Accepted to ICLR 2026
Abstract
Vision Foundation Models (VFMs) have achieved remarkable success when applied to various downstream 2D tasks. Despite their effectiveness, they often exhibit a critical lack of 3D awareness. To this end, we introduce Splat and Distill, a framework that instills robust 3D awareness into 2D VFMs by augmenting the teacher model with a fast, feed-forward 3D reconstruction pipeline. Given 2D features produced by a teacher model, our method first lifts these features into an explicit 3D Gaussian representation, in a feedforward manner. These 3D features are then ``splatted" onto novel viewpoints, producing a set of novel 2D feature maps used to supervise the student model, ``distilling" geometrically grounded knowledge. By replacing slow per-scene optimization of prior work with our feed-forward lifting approach, our framework avoids feature-averaging artifacts, creating a dynamic learning process where the teacher's consistency improves alongside that of the student. We conduct a comprehensive evaluation on a suite of downstream tasks, including monocular depth estimation, surface normal estimation, multi-view correspondence, and semantic segmentation. Our method significantly outperforms prior works, not only achieving substantial gains in 3D awareness but also enhancing the underlying semantic richness of 2D features. Project page is available at https://davidshavin4.github.io/Splat-and-Distill/
中文标题/摘要
标题:Splat和Distill:通过前馈3D重建增强教师模型以实现3D感知蒸馏
视觉基础模型(VFMs)在应用于各种下游2D任务时取得了显著的成功。尽管它们非常有效,但通常缺乏3D意识。为此,我们提出了Splat和Distill框架,通过将快速的前馈3D重建流水线添加到教师模型中,将坚固的3D意识注入2D VFMs。给定由教师模型生成的2D特征,我们的方法首先以前馈方式将这些特征提升为显式的3D高斯表示。然后,将这些3D特征“splat”到新的视点上,生成一组新的2D特征图,用于监督学生模型,从而“蒸馏”几何上具有根据性的知识。通过用我们的前馈提升方法替换先前工作中的慢速场景优化,我们的框架避免了特征平均化伪影,创建了一个动态学习过程,在此过程中,教师的一致性与学生的改进同步提高。我们在包括单目深度估计、表面法线估计、多视图对应和语义分割等一系列下游任务上进行了全面评估。我们的方法在3D意识方面显著优于先前的工作,不仅实现了显著的提升,还增强了2D特征的语义丰富性。项目页面可在https://davidshavin4.github.io/Splat-and-Distill/获取。
Summary / 总结
Splat and Distill is a framework that enhances 2D Vision Foundation Models (VFMs) with 3D awareness by integrating a fast feed-forward 3D reconstruction pipeline. It lifts 2D features into 3D Gaussian representations and splats them onto novel viewpoints to supervise a student model, distilling geometrically grounded knowledge. This method significantly improves 3D awareness and semantic richness in downstream tasks such as monocular depth estimation and semantic segmentation, outperforming previous works. The framework avoids feature-averaging artifacts and creates a dynamic learning process where both the teacher and student models improve consistency. Project page: https://davidshavin4.github.io/Splat-and-Distill/
Splat and Distill 是一种框架,通过集成快速的前馈 3D 重建管道,增强 2D 视觉基础模型(VFMs)的 3D 意识。它将 2D 特征提升为 3D 高斯表示,并将其投射到新颖视点上以监督学生模型,提取几何上基础的知识。该方法在单目深度估计和语义分割等下游任务中显著提高了 3D 意识和语义丰富性,超越了先前的工作。该框架避免了特征平均的缺陷,并创建了一个动态学习过程,其中教师和学生模型的一致性都得到提高。项目页面:https://davidshavin4.github.io/Splat-and-Distill/
PhysicsAgentABM: Physics-Guided Generative Agent-Based Modeling
Authors: Kavana Venkatesh, Yinhan He, Jundong Li, Jiaming Cui
First: 2026-02-05T18:59:01+00:00 · Latest: 2026-02-05T18:59:01+00:00
Abstract
Large language model (LLM)-based multi-agent systems enable expressive agent reasoning but are expensive to scale and poorly calibrated for timestep-aligned state-transition simulation, while classical agent-based models (ABMs) offer interpretability but struggle to integrate rich individual-level signals and non-stationary behaviors. We propose PhysicsAgentABM, which shifts inference to behaviorally coherent agent clusters: state-specialized symbolic agents encode mechanistic transition priors, a multimodal neural transition model captures temporal and interaction dynamics, and uncertainty-aware epistemic fusion yields calibrated cluster-level transition distributions. Individual agents then stochastically realize transitions under local constraints, decoupling population inference from entity-level variability. We further introduce ANCHOR, an LLM agent-driven clustering strategy based on cross-contextual behavioral responses and a novel contrastive loss, reducing LLM calls by up to 6-8 times. Experiments across public health, finance, and social sciences show consistent gains in event-time accuracy and calibration over mechanistic, neural, and LLM baselines. By re-architecting generative ABM around population-level inference with uncertainty-aware neuro-symbolic fusion, PhysicsAgentABM establishes a new paradigm for scalable and calibrated simulation with LLMs.
中文标题/摘要
标题:PhysicsAgentABM:基于物理引导的生成性基于代理的建模
基于大型语言模型(LLM)的多代理系统能够实现富有表现力的代理推理,但难以扩展且不适用于时间步长对齐的状态转换模拟,而传统的基于代理的模型(ABM)虽然具有可解释性,但在整合丰富的个体级信号和非平稳行为方面存在困难。我们提出了PhysicsAgentABM,将推理转移到行为一致的代理集群中:状态专门化的符号代理编码机制性转换先验,多模态神经转换模型捕捉时间动态和交互动态,不确定性意识的本体论融合生成校准的集群级转换分布。个体代理随后在局部约束下随机实现转换,从而解耦群体推理与实体级变异性。我们还引入了基于跨上下文行为响应的LLM代理驱动聚类策略ANCHOR,以及一种新颖的对比损失,最多可减少6-8倍的LLM调用次数。在公共卫生、金融和社会科学领域的实验表明,与机制性、神经网络和LLM基线相比,PhysicsAgentABM在事件时间准确性和校准方面均表现出一致的改进。通过围绕不确定性意识神经符号融合重构生成性ABM以实现群体级推理,PhysicsAgentABM确立了LLM支持的可扩展和校准模拟的新范式。
Summary / 总结
PhysicsAgentABM is designed to address the scalability and calibration issues of large language model (LLM)-based multi-agent systems and the interpretability and signal integration challenges of classical ABMs. It uses state-specialized symbolic agents to encode mechanistic transition priors, a multimodal neural transition model to capture temporal and interaction dynamics, and uncertainty-aware epistemic fusion to yield calibrated cluster-level transition distributions. Experiments across various fields show consistent improvements in event-time accuracy and calibration over existing mechanistic, neural, and LLM baselines. Additionally, ANCHOR, an LLM agent-driven clustering strategy, reduces LLM calls by up to 6-8 times.
PhysicsAgentABM 结合了基于物理的生成性基于代理的建模,以解决大型语言模型(LLM)的可扩展性和校准问题以及经典ABM的可解释性和个体级信号整合问题。它使用状态专业化符号代理来编码机械转换先验,多模态神经模型来捕捉时间和交互动力学,并使用不确定性意识融合方法来生成校准的集群级转换分布。进一步引入的 ANCHOR 策略最多可减少 8 倍的 LLM 调用。公共健康、金融和社会科学领域的实验表明,在各种基线模型上,其在事件时间准确性与校准方面具有一致的改进。
Context Forcing: Consistent Autoregressive Video Generation with Long Context
Authors: Shuo Chen, Cong Wei, Sun Sun, Ping Nie, Kai Zhou, Ge Zhang, Ming-Hsuan Yang, Wenhu Chen
First: 2026-02-05T18:58:01+00:00 · Latest: 2026-02-05T18:58:01+00:00
Abstract
Recent approaches to real-time long video generation typically employ streaming tuning strategies, attempting to train a long-context student using a short-context (memoryless) teacher. In these frameworks, the student performs long rollouts but receives supervision from a teacher limited to short 5-second windows. This structural discrepancy creates a critical \textbf{student-teacher mismatch}: the teacher's inability to access long-term history prevents it from guiding the student on global temporal dependencies, effectively capping the student's context length. To resolve this, we propose \textbf{Context Forcing}, a novel framework that trains a long-context student via a long-context teacher. By ensuring the teacher is aware of the full generation history, we eliminate the supervision mismatch, enabling the robust training of models capable of long-term consistency. To make this computationally feasible for extreme durations (e.g., 2 minutes), we introduce a context management system that transforms the linearly growing context into a \textbf{Slow-Fast Memory} architecture, significantly reducing visual redundancy. Extensive results demonstrate that our method enables effective context lengths exceeding 20 seconds -- 2 to 10 times longer than state-of-the-art methods like LongLive and Infinite-RoPE. By leveraging this extended context, Context Forcing preserves superior consistency across long durations, surpassing state-of-the-art baselines on various long video evaluation metrics.
中文标题/摘要
标题:上下文强制:使用长上下文的一致自回归视频生成
近期的实时长视频生成方法通常采用流式调优策略,试图通过短上下文(无记忆)教师训练一个长上下文学生。在这些框架中,学生进行长时间的展开,但只能从短至5秒的窗口中获得监督。这种结构上的不匹配导致了一个关键的\textbf{学生-教师不匹配}:由于教师无法访问长期历史,它无法引导学生学习全局时间依赖性,从而限制了学生能够使用的上下文长度。为了解决这一问题,我们提出了\textbf{上下文强制},这是一种通过长上下文教师训练长上下文学生的新型框架。通过确保教师了解完整的生成历史,我们消除了监督不匹配,使模型能够稳健地训练并实现长期一致性。为了使这种计算在极端持续时间(例如2分钟)下可行,我们引入了一种上下文管理系统,将线性增长的上下文转换为\textbf{慢速-快速记忆}架构,显著减少了视觉冗余。大量实验结果表明,我们的方法使有效的上下文长度超过20秒——比LongLive和Infinite-RoPE等最先进的方法长2到10倍。通过利用这种扩展的上下文,上下文强制在长时间段内保持了更优的一致性,并在各种长视频评估指标上超越了最先进的基线方法。
Summary / 总结
The paper addresses the issue of student-teacher mismatch in real-time long video generation by proposing Context Forcing, which trains a long-context student using a long-context teacher. This approach ensures the teacher can provide supervision based on the full generation history, thus improving long-term temporal consistency. The method introduces a Slow-Fast Memory architecture to manage the context efficiently, allowing for context lengths exceeding 20 seconds, significantly outperforming existing methods like LongLive and Infinite-RoPE in long video generation tasks.
本文提出了一种Context Forcing框架,通过使用长历史上下文的教师来训练长历史上下文的学生,解决了实时长视频生成中的学生-教师不匹配问题。这种方法确保教师可以访问完整的生成历史,消除监督不匹配,从而实现具有长期一致性的模型的稳健训练。该方法引入了一种慢速-快速记忆架构来管理上下文,使其在长时间内计算上可行。实验结果表明,Context Forcing可以实现超过20秒的有效上下文长度,超越了如LongLive和Infinite-RoPE等最先进的方法在长视频生成一致性方面的表现。
Learning Query-Aware Budget-Tier Routing for Runtime Agent Memory
Authors: Haozhen Zhang, Haodong Yue, Tao Feng, Quanyu Long, Jianzhu Bao, Bowen Jin, Weizhi Zhang, Xiao Li, Jiaxuan You, Chengwei Qin, Wenya Wang
First: 2026-02-05T18:57:09+00:00 · Latest: 2026-02-05T18:57:09+00:00
Comments: Code is available at https://github.com/ViktorAxelsen/BudgetMem
Abstract
Memory is increasingly central to Large Language Model (LLM) agents operating beyond a single context window, yet most existing systems rely on offline, query-agnostic memory construction that can be inefficient and may discard query-critical information. Although runtime memory utilization is a natural alternative, prior work often incurs substantial overhead and offers limited explicit control over the performance-cost trade-off. In this work, we present \textbf{BudgetMem}, a runtime agent memory framework for explicit, query-aware performance-cost control. BudgetMem structures memory processing as a set of memory modules, each offered in three budget tiers (i.e., \textsc{Low}/\textsc{Mid}/\textsc{High}). A lightweight router performs budget-tier routing across modules to balance task performance and memory construction cost, which is implemented as a compact neural policy trained with reinforcement learning. Using BudgetMem as a unified testbed, we study three complementary strategies for realizing budget tiers: implementation (method complexity), reasoning (inference behavior), and capacity (module model size). Across LoCoMo, LongMemEval, and HotpotQA, BudgetMem surpasses strong baselines when performance is prioritized (i.e., high-budget setting), and delivers better accuracy-cost frontiers under tighter budgets. Moreover, our analysis disentangles the strengths and weaknesses of different tiering strategies, clarifying when each axis delivers the most favorable trade-offs under varying budget regimes.
中文标题/摘要
标题:学习查询感知预算层级路由以运行时代理内存
内存对于大型语言模型(LLM)代理在超出单一上下文窗口操作时变得越来越关键,但大多数现有系统依赖于离线、查询无关的内存构建,这可能效率低下并可能丢弃查询关键信息。尽管运行时内存利用是一个自然的替代方案,但先前的工作往往会产生大量开销,并且对性能成本权衡的控制有限。在本文中,我们提出了**BudgetMem**,这是一种运行时代理内存框架,用于显式、查询感知的性能成本控制。BudgetMem 将内存处理结构化为一组内存模块,每个模块提供三种预算层级(即**低**/**中**/**高**)。一个轻量级路由器在模块之间执行预算层级路由,以平衡任务性能和内存构建成本,这通过强化学习训练的紧凑神经策略实现。使用BudgetMem作为统一的测试平台,我们研究了三种互补的预算层级实现策略:实现(方法复杂度)、推理(推理行为)和容量(模块模型大小)。在LoCoMo、LongMemEval和HotpotQA中,当优先考虑性能(即高预算设置)时,BudgetMem超越了强大的基线,并在更紧的预算下提供了更好的准确度成本前沿。此外,我们的分析将不同层级策略的优势和劣势分离开来,阐明了在不同预算条件下,每个轴在何时提供最有利的权衡。
Summary / 总结
BudgetMem is a runtime agent memory framework designed for explicit, query-aware performance-cost control in Large Language Models. It structures memory processing into three budget tiers and uses a lightweight router to route queries among memory modules based on performance and cost. BudgetMem outperforms strong baselines in high-budget settings and provides better accuracy-cost trade-offs under tighter budgets. The analysis of different tiering strategies helps clarify the optimal trade-offs under varying budget regimes.
BudgetMem 是一个用于大型语言模型的运行时代理内存框架,旨在实现明确的、查询感知的性能-成本控制。它将内存处理结构化为三个预算层级,并使用一个轻量级路由器根据性能和成本在内存模块之间路由查询。BudgetMem 在高预算设置下优于强基线,并在更紧的预算下提供更好的准确度-成本前沿。不同层级策略的分析有助于在不同预算条件下澄清最优权衡。
Correctness-Optimized Residual Activation Lens (CORAL): Transferrable and Calibration-Aware Inference-Time Steering
Authors: Miranda Muqing Miao, Young-Min Cho, Lyle Ungar
First: 2026-02-05T18:55:56+00:00 · Latest: 2026-02-05T18:55:56+00:00
Abstract
Large language models (LLMs) exhibit persistent miscalibration, especially after instruction tuning and preference alignment. Modified training objectives can improve calibration, but retraining is expensive. Inference-time steering offers a lightweight alternative, yet most existing methods optimize proxies for correctness rather than correctness itself. We introduce CORAL (Correctness-Optimized Residual Activation Lens), a regularized inference-time steering method that captures distributed correctness signals from model internal activations using weight-decay MLP probes. We evaluate CORAL across three 7B-parameter models and find that it consistently improves accuracy by 10\% and expected calibration error (ECE) by 50\% on average. We additionally demonstrate that these gains transfer without retraining to the complete published test sets of four held-out benchmarks (ARC-Challenge, HellaSwag, Math-MC, OpenBookQA), averaging 14\% accuracy improvements and 49\% ECE improvements. Our results support the hypothesis that distributed information in model internals can be extracted using regularized probes when individual neurons are insufficient. CORAL thus provides a compute-efficient, transferable, and calibration-aware approach to improve MCQA performance during inference.
中文标题/摘要
标题:CORAL(正确性优化的残差激活透镜):可移植且校准意识的推理时校正导向
大型语言模型(LLMs)在指令调优和偏好对齐后表现出持续的校准不足。修改后的训练目标可以改善校准,但重新训练成本高昂。推理时校正提供了一种轻量级的替代方案,但大多数现有方法优化的是正确性的代理指标而非正确性本身。我们引入了CORAL(正确性优化的残差激活透镜),这是一种正则化推理时校正方法,通过权重衰减MLP探针捕捉模型内部激活中的分布式正确性信号。我们在三个7B参数模型上评估了CORAL,发现它在平均情况下将准确率提高了10%并降低了50%的预期校准误差(ECE)。我们还证明了这些增益在无需重新训练的情况下转移到了四个保留基准测试的完整发布测试集(ARC-Challenge、HellaSwag、Math-MC、OpenBookQA)上,平均准确率提高了14%并降低了49%的ECE。我们的结果支持了这样一个假设,即当单个神经元不足时,可以使用正则化探针从模型内部提取分布式信息。因此,CORAL提供了一种计算高效、可移植且校准意识的方法,以提高推理时的多项选择题(MCQA)性能。
Summary / 总结
The paper introduces CORAL, a regularized inference-time steering method that enhances the accuracy and calibration of large language models. It uses weight-decay MLP probes to capture distributed correctness signals from model internal activations. Across three 7B-parameter models, CORAL improved accuracy by 10% and expected calibration error by 50% on average. These improvements transferred to four held-out benchmarks without retraining, averaging 14% accuracy and 49% ECE improvements, demonstrating a compute-efficient, transferable, and calibration-aware approach to improve multiple-choice question answering performance during inference.
论文提出了CORAL,一种正则化推理时校正方法,通过使用带权重衰减的MLP探针从模型内部激活中捕捉分布式的正确性信号来提升大型语言模型的准确性和校准度。在三个7B参数模型上,CORAL将准确率提高了10%,预期校准误差降低了50%。这些改进在四个未见过的基准测试集上无需重新训练也得到了验证,平均提高了14%的准确率和49%的ECE,展示了在推理时提高多项选择题回答性能的一种计算高效、可迁移且校准意识强的方法。
Diffusion Model's Generalization Can Be Characterized by Inductive Biases toward a Data-Dependent Ridge Manifold
Authors: Ye He, Yitong Qiu, Molei Tao
First: 2026-02-05T18:55:03+00:00 · Latest: 2026-02-05T18:55:03+00:00
Abstract
When a diffusion model is not memorizing the training data set, how does it generalize exactly? A quantitative understanding of the distribution it generates would be beneficial to, for example, an assessment of the model's performance for downstream applications. We thus explicitly characterize what diffusion model generates, by proposing a log-density ridge manifold and quantifying how the generated data relate to this manifold as inference dynamics progresses. More precisely, inference undergoes a reach-align-slide process centered around the ridge manifold: trajectories first reach a neighborhood of the manifold, then align as being pushed toward or away from the manifold in normal directions, and finally slide along the manifold in tangent directions. Within the scope of this general behavior, different training errors will lead to different normal and tangent motions, which can be quantified, and these detailed motions characterize when inter-mode generations emerge. More detailed understanding of training dynamics will lead to more accurate quantification of the generation inductive bias, and an example of random feature model will be considered, for which we can explicitly illustrate how diffusion model's inductive biases originate as a composition of architectural bias and training accuracy, and how they evolve with the inference dynamics. Experiments on synthetic multimodal distributions and MNIST latent diffusion support the predicted directional effects, in both low- and high-dimensions.
中文标题/摘要
标题:扩散模型的泛化可以由数据依赖的岭流形上的归纳偏置来表征
当扩散模型不记忆训练数据集时,它如何泛化?对其生成分布的定量理解将有助于例如下游应用中模型性能的评估。因此,我们通过提出对数密度岭流形并量化生成数据与该流形的关系来明确表征扩散模型的生成内容。更具体地说,推理过程围绕岭流形进行拉近-对齐-滑动的过程:轨迹首先接近流形的邻域,然后在法向方向被推近或远离流形,最后沿流形的切向方向滑动。在这一总体行为的范围内,不同的训练误差会导致不同的法向和切向运动,这些运动可以被量化,并且这些详细的运动表征了跨模态生成何时出现。对训练动力学更详细的理解将导致对生成归纳偏置更准确的量化,我们将考虑一个随机特征模型的例子,其中可以明确展示扩散模型的归纳偏置如何源自架构偏置和训练准确性组成的组合,并且如何随着推理动力学的发展而演变。在合成多模态分布和MNIST潜在扩散上的实验支持了预测的方向性效应,在低维和高维空间中均是如此。
Summary / 总结
This study investigates how diffusion models generalize by proposing a log-density ridge manifold and analyzing the inference dynamics. The model's inference process is characterized as a reach-align-slide process around the ridge manifold, where data trajectories first approach the manifold, then align in normal directions, and finally slide along the manifold. Different training errors result in distinct normal and tangent motions, which can be quantified to understand inter-mode generation. Experiments on synthetic and MNIST data support the directional effects predicted by the model.
研究通过提出对数密度岭流形并分析推理动力学,探讨了扩散模型的泛化机制。研究表明,推理过程遵循围绕该流形的reach-align-slide过程,不同的训练误差会导致不同的正常和切向运动。这些详细运动有助于表征跨模态生成,并提供关于模型归纳偏好的见解,这些偏好的演变与推理动力学有关。合成数据和MNIST数据的实验支持了这些发现。
Mechanisms of AI Protein Folding in ESMFold
Authors: Kevin Lu, Jannik Brinkmann, Stefan Huber, Aaron Mueller, Yonatan Belinkov, David Bau, Chris Wendler
First: 2026-02-05T18:54:54+00:00 · Latest: 2026-02-05T18:54:54+00:00
Comments: Our code, data, and results are available at https://folding.baulab.info
Abstract
How do protein structure prediction models fold proteins? We investigate this question by tracing how ESMFold folds a beta hairpin, a prevalent structural motif. Through counterfactual interventions on model latents, we identify two computational stages in the folding trunk. In the first stage, early blocks initialize pairwise biochemical signals: residue identities and associated biochemical features such as charge flow from sequence representations into pairwise representations. In the second stage, late blocks develop pairwise spatial features: distance and contact information accumulate in the pairwise representation. We demonstrate that the mechanisms underlying structural decisions of ESMFold can be localized, traced through interpretable representations, and manipulated with strong causal effects.
中文标题/摘要
标题:ESMFold中AI蛋白质折叠的机制
蛋白质结构预测模型是如何折叠蛋白质的?我们通过追踪ESMFold折叠一个β发夹结构域的过程来研究这一问题。通过对模型潜在变量进行反事实干预,我们识别出折叠过程中的两个计算阶段。在第一个阶段,早期模块初始化双分子生物化学信号:残基身份及其相关的生物化学特征,如从序列表示到双分子表示的电荷流动。在第二个阶段,晚期模块发展双分子空间特征:距离和接触信息在双分子表示中积累。我们证明ESMFold的结构决策机制可以被局部化、通过可解释的表示进行追踪,并且可以通过强因果效应进行操控。
Summary / 总结
This study explores the mechanisms of protein folding in ESMFold by analyzing how it processes a beta hairpin. The research identifies two computational stages: early blocks initialize biochemical signals from sequence representations, and late blocks develop spatial features in the pairwise representation. The study shows that these mechanisms can be localized, traced through interpretable representations, and manipulated with strong causal effects.
研究通过分析ESMFold对β发夹结构的折叠过程,探讨其蛋白质折叠机制。研究识别了两个计算阶段:早期块从序列表示初始化生物化学信号,晚期块在成对表示中发展空间特征。研究显示这些机制可以被定位、通过可解释的表示进行追踪,并且可以通过强因果效应进行操控。
MambaVF: State Space Model for Efficient Video Fusion
Authors: Zixiang Zhao, Yukun Cui, Lilun Deng, Haowen Bai, Haotong Qin, Tao Feng, Konrad Schindler
First: 2026-02-05T18:53:47+00:00 · Latest: 2026-02-05T18:53:47+00:00
Abstract
Video fusion is a fundamental technique in various video processing tasks. However, existing video fusion methods heavily rely on optical flow estimation and feature warping, resulting in severe computational overhead and limited scalability. This paper presents MambaVF, an efficient video fusion framework based on state space models (SSMs) that performs temporal modeling without explicit motion estimation. First, by reformulating video fusion as a sequential state update process, MambaVF captures long-range temporal dependencies with linear complexity while significantly reducing computation and memory costs. Second, MambaVF proposes a lightweight SSM-based fusion module that replaces conventional flow-guided alignment via a spatio-temporal bidirectional scanning mechanism. This module enables efficient information aggregation across frames. Extensive experiments across multiple benchmarks demonstrate that our MambaVF achieves state-of-the-art performance in multi-exposure, multi-focus, infrared-visible, and medical video fusion tasks. We highlight that MambaVF enjoys high efficiency, reducing up to 92.25% of parameters and 88.79% of computational FLOPs and a 2.1x speedup compared to existing methods. Project page: https://mambavf.github.io
中文标题/摘要
标题:MambaVF:基于状态空间模型的高效视频融合框架
视频融合是各种视频处理任务中的基本技术。然而,现有的视频融合方法严重依赖于光流估计和特征扭曲,导致了巨大的计算开销和有限的可扩展性。本文提出了一种基于状态空间模型(SSM)的高效视频融合框架MambaVF,该框架在无需显式运动估计的情况下进行时间建模。首先,通过将视频融合重新表述为一个顺序状态更新过程,MambaVF以线性复杂度捕获了长程时间依赖性,同时显著减少了计算和内存成本。其次,MambaVF提出了一种轻量级的基于SSM的融合模块,该模块通过时空双向扫描机制替代了传统的流引导对齐,从而实现了跨帧的高效信息聚合。在多个基准上的广泛实验表明,我们的MambaVF在多曝光、多焦点、红外可见和医学视频融合任务中达到了最先进的性能。我们强调MambaVF具有高效率,参数减少了高达92.25%,计算FLOPs减少了88.79%,并且比现有方法快2.1倍。项目页面:https://mambavf.github.io
Summary / 总结
MambaVF is an efficient video fusion framework that reformulates video fusion as a state space model to capture long-range temporal dependencies without explicit motion estimation, reducing computational overhead and memory costs. It introduces a lightweight spatio-temporal bidirectional scanning mechanism for efficient information aggregation, achieving state-of-the-art performance in various video fusion tasks while reducing up to 92.25% of parameters and 88.79% of computational FLOPs, with a 2.1x speedup compared to existing methods.
MambaVF 是一种高效的视频融合框架,通过将视频融合重新表述为状态空间模型(SSM),来捕捉长距离的时间依赖关系,而无需显式的运动估计。这种方法减少了计算开销和内存使用,实现了多种视频融合任务的最先进性能。MambaVF 显著减少了参数和计算 FLOPs,并且比现有方法快 2.1 倍。
GenArena: How Can We Achieve Human-Aligned Evaluation for Visual Generation Tasks?
Authors: Ruihang Li, Leigang Qu, Jingxu Zhang, Dongnan Gui, Mengde Xu, Xiaosong Zhang, Han Hu, Wenjie Wang, Jiaqi Wang
First: 2026-02-05T18:52:48+00:00 · Latest: 2026-02-05T18:52:48+00:00
Comments: Project Page: https://genarena.github.io/, Code: https://github.com/ruihanglix/genarena
Abstract
The rapid advancement of visual generation models has outpaced traditional evaluation approaches, necessitating the adoption of Vision-Language Models as surrogate judges. In this work, we systematically investigate the reliability of the prevailing absolute pointwise scoring standard, across a wide spectrum of visual generation tasks. Our analysis reveals that this paradigm is limited due to stochastic inconsistency and poor alignment with human perception. To resolve these limitations, we introduce GenArena, a unified evaluation framework that leverages a pairwise comparison paradigm to ensure stable and human-aligned evaluation. Crucially, our experiments uncover a transformative finding that simply adopting this pairwise protocol enables off-the-shelf open-source models to outperform top-tier proprietary models. Notably, our method boosts evaluation accuracy by over 20% and achieves a Spearman correlation of 0.86 with the authoritative LMArena leaderboard, drastically surpassing the 0.36 correlation of pointwise methods. Based on GenArena, we benchmark state-of-the-art visual generation models across diverse tasks, providing the community with a rigorous and automated evaluation standard for visual generation.
中文标题/摘要
标题:GenArena:我们如何实现视觉生成任务的人类对齐评估?
视觉生成模型的快速发展已经超越了传统的评估方法,迫切需要采用视觉语言模型作为替代的评判者。在本文中,我们系统地研究了当前广泛使用的绝对点对点评分标准在各种视觉生成任务中的可靠性。我们的分析表明,这种范式由于随机不一致性和与人类感知的不良对齐而受到限制。为了解决这些限制,我们引入了GenArena,这是一种统一的评估框架,利用成对比较范式确保稳定且人类对齐的评估。关键的是,我们的实验揭示了一个变革性的发现,即简单采用这种成对协议可以使现成的开源模型超越顶级专有模型。值得注意的是,我们的方法将评估准确性提高了超过20%,并与权威的LMArena排行榜获得了0.86的斯皮尔曼相关性,远远超过了0.36的点对点方法相关性。基于GenArena,我们对多种视觉生成模型进行了基准测试,为视觉生成提供了严格的自动化评估标准。
Summary / 总结
This work addresses the limitations of traditional evaluation methods for visual generation models by introducing GenArena, a framework that uses a pairwise comparison paradigm. The study finds that this approach significantly improves evaluation accuracy, with off-the-shelf models outperforming proprietary models and achieving a 20% boost in accuracy. GenArena also correlates strongly with the authoritative LMArena leaderboard, demonstrating its effectiveness in providing a human-aligned evaluation standard for visual generation tasks.
论文针对传统绝对点评分在评估视觉生成模型中的局限性,提出了GenArena,这是一种基于成对比较的框架,以提高评估的可靠性和与人类感知的对齐。实验表明,GenArena显著提高了评估准确性,超过20%,并且与权威的LMArena排行榜实现了0.86的Spearman相关性,远超点评分方法的0.36相关性。
AgenticPay: A Multi-Agent LLM Negotiation System for Buyer-Seller Transactions
Authors: Xianyang Liu, Shangding Gu, Dawn Song
First: 2026-02-05T18:50:36+00:00 · Latest: 2026-02-05T18:50:36+00:00
Abstract
Large language model (LLM)-based agents are increasingly expected to negotiate, coordinate, and transact autonomously, yet existing benchmarks lack principled settings for evaluating language-mediated economic interaction among multiple agents. We introduce AgenticPay, a benchmark and simulation framework for multi-agent buyer-seller negotiation driven by natural language. AgenticPay models markets in which buyers and sellers possess private constraints and product-dependent valuations, and must reach agreements through multi-round linguistic negotiation rather than numeric bidding alone. The framework supports a diverse suite of over 110 tasks ranging from bilateral bargaining to many-to-many markets, with structured action extraction and metrics for feasibility, efficiency, and welfare. Benchmarking state-of-the-art proprietary and open-weight LLMs reveals substantial gaps in negotiation performance and highlights challenges in long-horizon strategic reasoning, establishing AgenticPay as a foundation for studying agentic commerce and language-based market interaction. Code and dataset are available at the link: https://github.com/SafeRL-Lab/AgenticPay.
中文标题/摘要
标题:AgenticPay:多智能体LLM谈判系统用于买家卖家交易
基于大型语言模型(LLM)的代理越来越多地被期望自主进行谈判、协调和交易,但现有的基准测试缺乏评估语言中介的多智能体经济互动的规范性设置。我们引入了AgenticPay,这是一种多智能体买家卖家谈判基准和模拟框架,由自然语言驱动。AgenticPay 模拟了买家和卖家拥有私人约束和产品依赖价值的市场,并且必须通过多轮语言谈判达成协议,而不仅仅是通过数字竞价。该框架支持超过110项任务的多样化套件,从双边讨价还价到多对多市场,具有结构化的行动提取和可行性、效率和福利的度量标准。对最先进的专有和开源权重LLM的基准测试揭示了谈判性能的巨大差距,并突显了长期战略推理的挑战,确立了AgenticPay作为研究代理商业和语言驱动的市场互动的基础。代码和数据集可在以下链接获取:https://github.com/SafeRL-Lab/AgenticPay.
Summary / 总结
AgenticPay is a benchmark and simulation framework for evaluating multi-agent buyer-seller negotiations driven by natural language. It models markets with private constraints and product-dependent valuations, requiring agents to reach agreements through multi-round linguistic negotiation. Key findings show significant gaps in negotiation performance among state-of-the-art LLMs, particularly in long-horizon strategic reasoning, establishing AgenticPay as a valuable tool for studying agentic commerce and language-based market interaction.
AgenticPay 是一个用于评估由自然语言驱动的多代理买家卖家谈判的基准和模拟框架。它模拟了具有私人约束和产品依赖价值的市场,要求代理通过多轮语言谈判达成协议。关键发现表明,最先进的语言模型在谈判性能上存在显著差距,尤其是在长期战略推理方面,确立了AgenticPay作为研究代理商业和语言驱动市场互动的基础工具的地位。
On Computation and Reinforcement Learning
Authors: Raj Ghugare, Michał Bortkiewicz, Alicja Ziarko, Benjamin Eysenbach
First: 2026-02-05T18:45:57+00:00 · Latest: 2026-02-05T18:45:57+00:00
Abstract
How does the amount of compute available to a reinforcement learning (RL) policy affect its learning? Can policies using a fixed amount of parameters, still benefit from additional compute? The standard RL framework does not provide a language to answer these questions formally. Empirically, deep RL policies are often parameterized as neural networks with static architectures, conflating the amount of compute and the number of parameters. In this paper, we formalize compute bounded policies and prove that policies which use more compute can solve problems and generalize to longer-horizon tasks that are outside the scope of policies with less compute. Building on prior work in algorithmic learning and model-free planning, we propose a minimal architecture that can use a variable amount of compute. Our experiments complement our theory. On a set 31 different tasks spanning online and offline RL, we show that $(1)$ this architecture achieves stronger performance simply by using more compute, and $(2)$ stronger generalization on longer-horizon test tasks compared to standard feedforward networks or deep residual network using up to 5 times more parameters.
中文标题/摘要
标题:关于计算与强化学习
可用的计算量对强化学习(RL)策略的学习有何影响?固定参数数量的策略是否仍能从额外的计算中受益?标准的RL框架没有提供正式的语言来回答这些问题。从经验上讲,深度RL策略通常被参数化为具有静态架构的神经网络,混淆了计算量和参数数量。在本文中,我们形式化了计算受限策略,并证明使用更多计算的策略可以解决计算量较少的策略无法解决的问题,并且在更长的时序任务上具有更强的泛化能力。基于先前的工作,我们提出了一种可以使用可变计算量的最小架构。我们的实验补充了我们的理论。在涵盖在线和离线RL的31个不同任务上,我们展示了(1)这种架构仅通过使用更多的计算量就能实现更强的性能,(2)在更长时序的测试任务上具有更强的泛化能力,与标准的前馈网络或使用多达5倍参数的深度残差网络相比。
Summary / 总结
This paper investigates how the amount of computational resources affects reinforcement learning policies. It formalizes compute-bounded policies and demonstrates that policies with more compute can solve longer-horizon tasks that are beyond the capabilities of policies with less compute. Experiments on 31 tasks show that the proposed architecture performs better with more compute and generalizes better to longer-horizon tasks compared to standard neural network architectures with more parameters.
本文研究了计算资源的多少如何影响强化学习策略。它形式化了计算受限的策略,并展示了具有更多计算资源的策略能够解决那些少计算资源策略无法解决的长期任务。实验表明,该提出的架构在更多计算资源下表现更好,并且在长期任务上的泛化能力优于标准的具有更多参数的前馈网络或深度残差网络。
VisRefiner: Learning from Visual Differences for Screenshot-to-Code Generation
Authors: Jie Deng, Kaichun Yao, Libo Zhang
First: 2026-02-05T18:45:53+00:00 · Latest: 2026-02-05T18:45:53+00:00
Abstract
Screenshot-to-code generation aims to translate user interface screenshots into executable frontend code that faithfully reproduces the target layout and style. Existing multimodal large language models perform this mapping directly from screenshots but are trained without observing the visual outcomes of their generated code. In contrast, human developers iteratively render their implementation, compare it with the design, and learn how visual differences relate to code changes. Inspired by this process, we propose VisRefiner, a training framework that enables models to learn from visual differences between rendered predictions and reference designs. We construct difference-aligned supervision that associates visual discrepancies with corresponding code edits, allowing the model to understand how appearance variations arise from implementation changes. Building on this, we introduce a reinforcement learning stage for self-refinement, where the model improves its generated code by observing both the rendered output and the target design, identifying their visual differences, and updating the code accordingly. Experiments show that VisRefiner substantially improves single-step generation quality and layout fidelity, while also endowing models with strong self-refinement ability. These results demonstrate the effectiveness of learning from visual differences for advancing screenshot-to-code generation.
中文标题/摘要
标题:VisRefiner:从视觉差异中学习以实现屏幕截图到代码生成
屏幕截图到代码生成旨在将用户界面屏幕截图转换为能够忠实再现目标布局和样式的可执行前端代码。现有的多模态大型语言模型直接从屏幕截图映射到代码,但它们在生成代码时没有观察到视觉结果。相比之下,人类开发人员会迭代地渲染他们的实现,将其与设计进行比较,并学习视觉差异如何与代码更改相关联。受此过程的启发,我们提出了一种训练框架VisRefiner,使模型能够从渲染预测与参考设计之间的视觉差异中学习。我们构建了差异对齐的监督,将视觉差异与相应的代码编辑关联起来,使模型能够理解外观变化是如何由实现更改引起的。在此基础上,我们引入了一种强化学习阶段进行自我完善,模型通过观察渲染输出和目标设计之间的视觉差异,并相应地更新代码来改进其生成的代码。实验表明,VisRefiner 显著提高了单步生成质量和布局保真度,同时赋予模型强大的自我完善能力。这些结果表明,从视觉差异中学习对于推进屏幕截图到代码生成的有效性。
Summary / 总结
VisRefiner is a training framework that enables models to learn from visual differences between rendered predictions and reference designs, improving the quality and layout fidelity of screenshot-to-code generation. It uses difference-aligned supervision to associate visual discrepancies with corresponding code edits and introduces a reinforcement learning stage for self-refinement. Experiments show that VisRefiner significantly enhances single-step generation quality and layout fidelity, and endows models with strong self-refinement ability.
VisRefiner 是一种训练框架,通过将渲染预测与参考设计之间的视觉差异与相应的代码编辑关联起来,提高截图到代码生成的质量和布局准确性。它引入了差分对齐的监督和自强化学习阶段,前者将视觉差异与代码更改联系起来,后者使模型能够通过观察渲染输出和目标设计之间的视觉差异来改进生成的代码。实验表明,VisRefiner 显著提高了单步生成质量和布局准确性,并赋予模型强大的自强化能力。
Layer-wise LoRA fine-tuning: a similarity metric approach
Authors: Keith Ando Ogawa, Bruno Lopes Yamamoto, Lucas Lauton de Alcantara, Lucas Pellicer, Rosimeire Pereira Costa, Edson Bollis, Anna Helena Reali Costa, Artur Jordao
First: 2026-02-05T18:38:53+00:00 · Latest: 2026-02-05T18:38:53+00:00
Comments: Code is available at https://github.com/c2d-usp/Layer-wise-LoRA-with-CKA
Abstract
Pre-training Large Language Models (LLMs) on web-scale datasets becomes fundamental for advancing general-purpose AI. In contrast, enhancing their predictive performance on downstream tasks typically involves adapting their knowledge through fine-tuning. Parameter-efficient fine-tuning techniques, such as Low-Rank Adaptation (LoRA), aim to reduce the computational cost of this process by freezing the pre-trained model and updating a smaller number of parameters. In comparison to full fine-tuning, these methods achieve over 99\% reduction in trainable parameter count, depending on the configuration. Unfortunately, such a reduction may prove insufficient as LLMs continue to grow in scale. In this work, we address the previous problem by systematically selecting only a few layers to fine-tune using LoRA or its variants. We argue that not all layers contribute equally to the model adaptation. Leveraging this, we identify the most relevant layers to fine-tune by measuring their contribution to changes in internal representations. Our method is orthogonal to and readily compatible with existing low-rank adaptation techniques. We reduce the trainable parameters in LoRA-based techniques by up to 50\%, while maintaining the predictive performance across different models and tasks. Specifically, on encoder-only architectures, this reduction in trainable parameters leads to a negligible predictive performance drop on the GLUE benchmark. On decoder-only architectures, we achieve a small drop or even improvements in the predictive performance on mathematical problem-solving capabilities and coding tasks. Finally, this effectiveness extends to multimodal models, for which we also observe competitive results relative to fine-tuning with LoRA modules in all layers. Code is available at: https://github.com/c2d-usp/Layer-wise-LoRA-with-CKA
中文标题/摘要
标题:逐层LoRA微调:一种相似度度量方法
在大规模网络数据集上预训练大型语言模型(LLMs)已成为推动通用人工智能发展的基础。相比之下,通过微调来增强其在下游任务中的预测性能通常涉及调整其知识。参数高效的微调技术,如低秩适应(LoRA),旨在通过冻结预训练模型并更新较少的参数来降低此过程的计算成本。与全微调相比,这些方法的可训练参数数量减少了超过99%,具体取决于配置。不幸的是,随着LLMs的规模不断扩大,这种减少可能变得不足。在本文中,我们通过系统地选择仅微调少数几层来解决上述问题,使用LoRA或其变体。我们认为,并非所有层对模型适应的贡献都相等。利用这一点,我们通过测量它们对内部表示变化的贡献来识别最相关的层进行微调。我们的方法与现有的低秩适应技术是正交的,并且易于兼容。我们通过LoRA技术将可训练参数减少多达50%,同时在不同模型和任务上保持预测性能。具体而言,在仅编码器架构中,这种可训练参数的减少在GLUE基准测试上的预测性能下降可以忽略不计。在仅解码器架构中,我们实现了数学问题解决能力和编程任务上的小幅度下降甚至改进。最后,这种方法也适用于多模态模型,在这些模型中,我们还观察到与在所有层使用LoRA模块进行微调相比具有竞争力的结果。代码可在:https://github.com/c2d-usp/Layer-wise-LoRA-with-CKA
Summary / 总结
This paper addresses the challenge of fine-tuning large language models (LLMs) by proposing a layer-wise Low-Rank Adaptation (LoRA) method. The authors measure the contribution of each layer to internal representation changes to identify the most relevant layers for fine-tuning. This approach reduces the number of trainable parameters by up to 50% while maintaining or even improving predictive performance on various models and tasks, including GLUE benchmark and coding tasks on decoder-only architectures. The method is compatible with existing LoRA techniques and is available in the provided code repository.
本文提出了一种分层LoRA微调方法,通过根据层对内部表示变化的贡献选择关键层进行微调,从而将可训练参数减少高达50%,同时在不同模型和任务上保持或提高预测性能。对于编码器架构,GLUE基准上的性能下降可以忽略不计;对于解码器架构,数学问题解决和编程任务的预测性能有所改善或保持不变。该方法与现有的LoRA技术兼容,并适用于多模态模型,显示出与所有层使用LoRA模块微调相当的竞争性结果。
SIRR-LMM: Single-image Reflection Removal via Large Multimodal Model
Authors: Yu Guo, Zhiqiang Lao, Xiyun Song, Yubin Zhou, Heather Yu
First: 2026-01-12T05:03:12+00:00 · Latest: 2026-02-05T18:37:54+00:00
Comments: 12 pages, 14 figures, accepted in WACVW 2026
Abstract
Glass surfaces create complex interactions of reflected and transmitted light, making single-image reflection removal (SIRR) challenging. Existing datasets suffer from limited physical realism in synthetic data or insufficient scale in real captures. We introduce a synthetic dataset generation framework that path-traces 3D glass models over real background imagery to create physically accurate reflection scenarios with varied glass properties, camera settings, and post-processing effects. To leverage the capabilities of Large Multimodal Model (LMM), we concatenate the image layers into a single composite input, apply joint captioning, and fine-tune the model using task-specific LoRA rather than full-parameter training. This enables our approach to achieve improved reflection removal and separation performance compared to state-of-the-art methods.
中文标题/摘要
标题:SIRR-LMM:基于大型多模态模型的单张图像反射去除
玻璃表面会产生复杂的反射和透射光相互作用,使得单张图像反射去除(SIRR)具有挑战性。现有数据集在合成数据中缺乏物理现实性,或在实际捕获中规模不足。我们提出了一种合成数据集生成框架,通过在真实背景图像上路径追踪3D玻璃模型来创建具有多种玻璃属性、相机设置和后处理效果的物理准确反射场景。为了利用大型多模态模型(LMM)的能力,我们将图像层合并为单一复合输入,进行联合描述,并使用针对特定任务的LoRA进行微调,而不是进行全面参数训练。这使我们的方法在反射去除和分离性能方面优于现有最先进的方法。
Summary / 总结
The research aims to improve single-image reflection removal by addressing the limitations of existing datasets. A new synthetic dataset generation framework is introduced, which path-traces 3D glass models over real backgrounds to create physically accurate reflection scenarios. The method uses a Large Multimodal Model (LMM) by concatenating image layers and fine-tuning with task-specific LoRA, achieving better reflection removal and separation performance than state-of-the-art methods.
研究针对来自玻璃表面的单张图像反光去除(SIRR)的挑战,由于反射和透射光的相互作用使其复杂。为克服现有数据集的限制,作者开发了一种合成数据生成框架,通过路径追踪3D玻璃模型在真实背景上,创建物理上准确的反光场景。然后使用大型多模态模型(LMM),采用新颖的输入拼接和微调方法,实现了比当前最先进的方法更好的反光去除和分离效果。
RISE-Video: Can Video Generators Decode Implicit World Rules?
Authors: Mingxin Liu, Shuran Ma, Shibei Meng, Xiangyu Zhao, Zicheng Zhang, Shaofeng Zhang, Zhihang Zhong, Peixian Chen, Haoyu Cao, Xing Sun, Haodong Duan, Xue Yang
First: 2026-02-05T18:36:10+00:00 · Latest: 2026-02-05T18:36:10+00:00
Comments: 38 pages, 16 figures, 3 tables; Code: https://github.com/VisionXLab/RISE-Video; HuggingFace: https://huggingface.co/datasets/VisionXLab/RISE-Video
Abstract
While generative video models have achieved remarkable visual fidelity, their capacity to internalize and reason over implicit world rules remains a critical yet under-explored frontier. To bridge this gap, we present RISE-Video, a pioneering reasoning-oriented benchmark for Text-Image-to-Video (TI2V) synthesis that shifts the evaluative focus from surface-level aesthetics to deep cognitive reasoning. RISE-Video comprises 467 meticulously human-annotated samples spanning eight rigorous categories, providing a structured testbed for probing model intelligence across diverse dimensions, ranging from commonsense and spatial dynamics to specialized subject domains. Our framework introduces a multi-dimensional evaluation protocol consisting of four metrics: \textit{Reasoning Alignment}, \textit{Temporal Consistency}, \textit{Physical Rationality}, and \textit{Visual Quality}. To further support scalable evaluation, we propose an automated pipeline leveraging Large Multimodal Models (LMMs) to emulate human-centric assessment. Extensive experiments on 11 state-of-the-art TI2V models reveal pervasive deficiencies in simulating complex scenarios under implicit constraints, offering critical insights for the advancement of future world-simulating generative models.
中文标题/摘要
标题:RISE-Video:视频生成器能否解码隐含的世界规则?
尽管生成式视频模型在视觉保真度方面取得了显著进展,但它们在内化和推理隐含世界规则方面的能力仍然是一个关键但尚未充分探索的领域。为弥合这一差距,我们提出了RISE-Video,这是一种开创性的基于推理的Text-Image-to-Video (TI2V) 合成基准,将评估重点从表面美学转移到深层次的认知推理。RISE-Video 包含467个精心的人工标注样本,涵盖八个严格的类别,为从常识和空间动态到专业主题领域的模型智能提供了一个结构化的测试平台。我们的框架引入了四个维度的评估协议,包括推理一致性、时间一致性、物理合理性以及视觉质量。为了进一步支持可扩展的评估,我们提出了一种基于大型多模态模型(LMMs)的自动化流程,以模拟人类评估。在11个最先进的TI2V模型上的广泛实验揭示了在隐含约束下模拟复杂场景的普遍缺陷,为未来世界模拟生成模型的发展提供了关键见解。
Summary / 总结
RISE-Video is a reasoning-oriented benchmark for evaluating Text-Image-to-Video synthesis models, focusing on their ability to internalize and reason over implicit world rules. The benchmark includes 467 human-annotated samples across eight categories and introduces a multi-dimensional evaluation protocol with four metrics: Reasoning Alignment, Temporal Consistency, Physical Rationality, and Visual Quality. Experiments on 11 state-of-the-art models highlight their deficiencies in handling complex scenarios under implicit constraints, providing valuable insights for future model development.
RISE-Video 是一个针对文本-图像到视频合成模型的推理导向基准,旨在评估模型理解与推理隐含世界规则的能力。基准包含467个人工标注样本,涵盖八个类别,并引入了四个评估指标:推理一致性、时间一致性、物理合理性与视觉质量。对11个最先进的模型的实验揭示了它们在处理隐含约束下的复杂场景时的局限性,为未来模型的发展提供了宝贵见解。
Geographically-aware Transformer-based Traffic Forecasting for Urban Motorway Digital Twins
Authors: Krešimir Kušić, Vinny Cahill, Ivana Dusparic
First: 2026-02-05T18:33:03+00:00 · Latest: 2026-02-05T18:33:03+00:00
Comments: IEEE IV2026 37th IEEE Intelligent Vehicles Symposium
Abstract
The operational effectiveness of digital-twin technology in motorway traffic management depends on the availability of a continuous flow of high-resolution real-time traffic data. To function as a proactive decision-making support layer within traffic management, a digital twin must also incorporate predicted traffic conditions in addition to real-time observations. Due to the spatio-temporal complexity and the time-variant, non-linear nature of traffic dynamics, predicting motorway traffic remains a difficult problem. Sequence-based deep-learning models offer clear advantages over classical machine learning and statistical models in capturing long-range, temporal dependencies in time-series traffic data, yet limitations in forecasting accuracy and model complexity point to the need for further improvements. To improve motorway traffic forecasting, this paper introduces a Geographically-aware Transformer-based Traffic Forecasting GATTF model, which exploits the geographical relationships between distributed sensors using their mutual information (MI). The model has been evaluated using real-time data from the Geneva motorway network in Switzerland and results confirm that incorporating geographical awareness through MI enhances the accuracy of GATTF forecasting compared to a standard Transformer, without increasing model complexity.
中文标题/摘要
标题:基于地理感知的变压器交通预测模型在城市高速公路数字孪生中的应用
数字孪生技术在高速公路交通管理中的运营有效性取决于实时高分辨率交通数据的持续流动。为了作为交通管理中的主动决策支持层,数字孪生还必须包含预测的交通状况,而不仅仅是实时观测。由于交通动态的时空复杂性和时间变化的非线性性质,预测高速公路交通仍然是一个难题。基于序列的深度学习模型在捕捉时间序列交通数据中的长期时间依赖性方面明显优于经典机器学习和统计模型,但预测准确性和模型复杂性的局限性表明需要进一步改进。为了改进高速公路交通预测,本文提出了一种基于地理感知的变压器交通预测模型(GATTF),该模型利用分布式传感器之间的地理关系及其互信息(MI)。该模型使用来自瑞士日内瓦高速公路网络的实时数据进行了评估,结果表明,通过MI引入地理感知可以提高GATTF预测的准确性,而不会增加模型复杂性。
Summary / 总结
This paper addresses the challenge of accurately forecasting motorway traffic by introducing the Geographically-aware Transformer-based Traffic Forecasting (GATTF) model. The model leverages the geographical relationships between sensors using mutual information to improve traffic prediction accuracy. Evaluation using real-time data from the Geneva motorway network shows that GATTF outperforms a standard Transformer model in terms of accuracy without increasing model complexity.
本文通过引入地理感知的基于变换器的交通预测模型(GATTF),解决了准确预测高速公路交通流量的挑战。该模型利用传感器之间的地理关系和互信息来提高预测准确性。使用瑞士日内瓦高速公路网络的实时数据进行评估表明,GATTF在不增加模型复杂度的情况下,比标准变换器模型具有更高的预测准确性。
Characterizing Human Semantic Navigation in Concept Production as Trajectories in Embedding Space
Authors: Felipe D. Toro-Hernández, Jesuino Vieira Filho, Rodrigo M. Cabral-Carvalho
Venue: ICLR 2026
First: 2026-02-05T18:23:04+00:00 · Latest: 2026-02-05T18:23:04+00:00
Comments: 10 pages, 6 figures (excluding refs/appendix). Accepted to ICLR 2026
Abstract
Semantic representations can be framed as a structured, dynamic knowledge space through which humans navigate to retrieve and manipulate meaning. To investigate how humans traverse this geometry, we introduce a framework that represents concept production as navigation through embedding space. Using different transformer text embedding models, we construct participant-specific semantic trajectories based on cumulative embeddings and extract geometric and dynamical metrics, including distance to next, distance to centroid, entropy, velocity, and acceleration. These measures capture both scalar and directional aspects of semantic navigation, providing a computationally grounded view of semantic representation search as movement in a geometric space. We evaluate the framework on four datasets across different languages, spanning different property generation tasks: Neurodegenerative, Swear verbal fluency, Property listing task in Italian, and in German. Across these contexts, our approach distinguishes between clinical groups and concept types, offering a mathematical framework that requires minimal human intervention compared to typical labor-intensive linguistic pre-processing methods. Comparison with a non-cumulative approach reveals that cumulative embeddings work best for longer trajectories, whereas shorter ones may provide too little context, favoring the non-cumulative alternative. Critically, different embedding models yielded similar results, highlighting similarities between different learned representations despite different training pipelines. By framing semantic navigation as a structured trajectory through embedding space, bridging cognitive modeling with learned representation, thereby establishing a pipeline for quantifying semantic representation dynamics with applications in clinical research, cross-linguistic analysis, and the assessment of artificial cognition.
中文标题/摘要
标题:在概念生成中表征人类语义导航的轨迹
语义表示可以构架为一个结构化、动态的知识空间,人类在其中导航以检索和操作意义。为了研究人类如何穿越这一几何结构,我们提出了一种框架,将概念生成视为在嵌入空间中的导航。使用不同的变换器文本嵌入模型,我们基于累积嵌入构建了参与者特定的语义轨迹,并提取了几何和动力学度量,包括到下一个的距离、到质心的距离、熵、速度和加速度。这些度量捕捉了语义导航的标量和方向性方面,提供了语义表示搜索的计算基础观点,即在几何空间中的运动。我们在四种不同语言的数据集上评估了该框架,涵盖了不同的属性生成任务:神经退行性疾病、脏话流畅性、意大利语属性列表任务和德语。在这些背景下,我们的方法区分了临床组和概念类型,提供了一个比传统劳动密集型语言预处理方法需要更少人工干预的数学框架。与非累积方法的比较表明,累积嵌入在较长的轨迹中效果最佳,而较短的轨迹可能提供太少的上下文,倾向于非累积替代方法。关键的是,不同的嵌入模型产生了类似的结果,突显了尽管训练管道不同,不同学习表示之间的相似性。通过将语义导航构架为嵌入空间中的结构化轨迹,将认知建模与学习表示相结合,从而建立了一个量化语义表示动力学的管道,具有在临床研究、跨语言分析和评估人工认知中的应用。
Inverse Depth Scaling From Most Layers Being Similar
Authors: Yizhou Liu, Sara Kangaslahti, Ziming Liu, Jeff Gore
First: 2026-02-05T18:22:41+00:00 · Latest: 2026-02-05T18:22:41+00:00
Comments: 23 pages, 24 figures
Abstract
Neural scaling laws relate loss to model size in large language models (LLMs), yet depth and width may contribute to performance differently, requiring more detailed studies. Here, we quantify how depth affects loss via analysis of LLMs and toy residual networks. We find loss scales inversely proportional to depth in LLMs, probably due to functionally similar layers reducing error through ensemble averaging rather than compositional learning or discretizing smooth dynamics. This regime is inefficient yet robust and may arise from the architectural bias of residual networks and target functions incompatible with smooth dynamics. The findings suggest that improving LLM efficiency may require architectural innovations to encourage compositional use of depth.
中文标题/摘要
标题:从大多数层相似性推导逆深度缩放
神经网络缩放定律将损失与大型语言模型(LLM)的模型大小相关联,但深度和宽度可能以不同的方式影响性能,需要更详细的研究所。在这里,我们通过分析LLM和玩具残差网络来量化深度如何影响损失。我们发现损失与深度成反比地变化,这可能是由于功能相似的层通过集合平均而不是组合学习或离散平滑动力学来减少误差。这种机制虽然效率低下但很稳健,可能源于残差网络的架构偏见和与平滑动力学不兼容的目标函数。研究结果表明,提高LLM效率可能需要架构创新以鼓励深度的组合使用。
Summary / 总结
This study investigates how depth affects loss in large language models (LLMs) by analyzing LLMs and toy residual networks. It finds that loss scales inversely with depth, likely due to functionally similar layers reducing error through ensemble averaging rather than compositional learning. This regime is inefficient but robust and may arise from the architectural bias of residual networks and target functions incompatible with smooth dynamics. The research suggests that improving LLM efficiency may require architectural innovations to encourage compositional use of depth.
研究通过分析大型语言模型(LLMs)和玩具残差网络,探讨了深度如何影响损失。研究发现,损失与深度成反比,这可能是由于功能相似的层通过集合平均减少误差。这种机制虽然效率低下但很稳健,可能源于残差网络的架构偏见和目标函数与平滑动力学不兼容。研究建议,提高LLM效率可能需要通过架构创新来促进深度的组合使用。
A Hybrid Data-Driven Algorithm for Real-Time Friction Force Estimation in Hydraulic Cylinders
Authors: Mohamad Amin Jamshidi, Mehrbod Zarifi, Zolfa Anvari, Hamed Ghafarirad, Mohammad Zareinejad
First: 2026-02-05T18:21:28+00:00 · Latest: 2026-02-05T18:21:28+00:00
Comments: Published in: 2025 33rd International Conference on Electrical Engineering (ICEE), Publisher IEEE
Abstract
Hydraulic systems are widely utilized in industrial applications due to their high force generation, precise control, and ability to function in harsh environments. Hydraulic cylinders, as actuators in these systems, apply force and position through the displacement of hydraulic fluid, but their operation is significantly influenced by friction force. Achieving precision in hydraulic cylinders requires an accurate friction model under various operating conditions. Existing analytical models, often derived from experimental tests, necessitate the identification or estimation of influencing factors but are limited in adaptability and computational efficiency. This research introduces a data-driven, hybrid algorithm based on Long Short-Term Memory (LSTM) networks and Random Forests for nonlinear friction force estimation. The algorithm effectively combines feature detection and estimation processes using training data acquired from an experimental hydraulic test setup. It achieves a consistent and stable model error of less than 10% across diverse operating conditions and external load variations, ensuring robust performance in complex situations. The computational cost of the algorithm is 1.51 milliseconds per estimation, making it suitable for real-time applications. The proposed method addresses the limitations of analytical models by delivering high precision and computational efficiency. The algorithm's performance is validated through detailed analysis and experimental results, including direct comparisons with the LuGre model. The comparison highlights that while the LuGre model offers a theoretical foundation for friction modeling, its performance is limited by its inability to dynamically adjust to varying operational conditions of the hydraulic cylinder, further emphasizing the advantages of the proposed hybrid approach in real-time applications.
中文标题/摘要
标题:一种用于液压缸实时摩擦力估计的混合数据驱动算法
液压系统因其高力输出、精确控制和在恶劣环境下的工作能力而在工业应用中广泛使用。作为这些系统中的执行器,液压缸通过液压流体的位移产生力和位置,但其操作受到摩擦力的显著影响。要在液压缸中实现精确控制,需要在各种工作条件下建立准确的摩擦模型。现有的分析模型通常基于实验测试,需要识别或估计影响因素,但适应性和计算效率有限。本研究提出了一种基于长短期记忆(LSTM)网络和随机森林的混合数据驱动算法,用于非线性摩擦力估计。该算法通过从实验液压测试装置获取的训练数据有效结合了特征检测和估计过程。该算法在各种操作条件和外部负载变化下实现了小于10%的一致和稳定的模型误差,确保在复杂情况下具有稳健的性能。该算法的计算成本为每次估计1.51毫秒,使其适用于实时应用。该方法通过提供高精度和计算效率解决了分析模型的局限性。算法性能通过详细分析和实验结果得到验证,包括与LuGre模型的直接比较。比较表明,虽然LuGre模型为摩擦建模提供了理论基础,但其性能受限于无法动态适应液压缸的运行条件变化,进一步突显了所提混合方法在实时应用中的优势。
Summary / 总结
This research aims to improve the precision of friction force estimation in hydraulic cylinders, which are crucial components in industrial hydraulic systems. The study proposes a hybrid data-driven algorithm combining LSTM networks and Random Forests to estimate friction forces in real-time. The algorithm achieves consistent errors below 10% across various operating conditions and load variations, with a computational cost of 1.51 milliseconds per estimation. Experimental results show that the proposed method outperforms the LuGre model, particularly in adapting to dynamic operational conditions, highlighting its suitability for real-time applications.
该研究提出了一种基于LSTM网络和随机森林的混合数据驱动算法,用于液压缸的实时摩擦力估计。该算法结合了特征检测和估计过程,使用实验数据,实现了在各种操作条件和负载变化下一致的误差低于10%。它具有高精度和计算效率,每估计一次的成本为1.51毫秒,适用于实时应用。该方法在动态操作条件下优于LuGre模型,突显了其在精度和适应性方面的优势。
Informed Asymmetric Actor-Critic: Leveraging Privileged Signals Beyond Full-State Access
Authors: Daniel Ebi, Gaspard Lambrechts, Damien Ernst, Klemens Böhm
First: 2025-09-30T09:32:20+00:00 · Latest: 2026-02-05T18:21:20+00:00
Comments: 11 pages, 26 pages total, 3 figures
Abstract
Asymmetric actor-critic methods are widely used in partially observable reinforcement learning, but typically assume full state observability to condition the critic during training, which is often unrealistic in practice. We introduce the informed asymmetric actor-critic framework, allowing the critic to be conditioned on arbitrary state-dependent privileged signals without requiring access to the full state. We show that any such privileged signal yields unbiased policy gradient estimates, substantially expanding the set of admissible privileged information. This raises the problem of selecting the most adequate privileged information in order to improve learning. For this purpose, we propose two novel informativeness criteria: a dependence-based test that can be applied prior to training, and a criterion based on improvements in value prediction accuracy that can be applied post-hoc. Empirical results on partially observable benchmark tasks and synthetic environments demonstrate that carefully selected privileged signals can match or outperform full-state asymmetric baselines while relying on strictly less state information.
中文标题/摘要
标题:知情不对称行为-评论家方法:利用超越全状态访问的特权信号
不对称行为-评论家方法在部分可观测强化学习中广泛应用,但通常假设在训练过程中评论家可以基于完整状态进行条件化,这在实践中往往不现实。我们引入了知情不对称行为-评论家框架,允许评论家基于任意状态依赖的特权信号进行条件化,而无需访问完整状态。我们证明任何这样的特权信号都能提供无偏的行为梯度估计,极大地扩展了可接受的特权信息集。这提出了选择最合适的特权信息以提高学习的问题。为此,我们提出了两种新的信息性标准:一种基于依赖性的测试,可以在训练前应用;另一种基于价值预测准确性的改进,可以在训练后应用。在部分可观测基准任务和合成环境上的实验证明,精心选择的特权信号可以匹配或超越依赖完整状态的基线,同时依赖更少的状态信息。
LSA: Localized Semantic Alignment for Enhancing Temporal Consistency in Traffic Video Generation
Authors: Mirlan Karimov, Teodora Spasojevic, Markus Braun, Julian Wiederer, Vasileios Belagiannis, Marc Pollefeys
First: 2026-02-05T18:21:02+00:00 · Latest: 2026-02-05T18:21:02+00:00
Comments: Accepted to IEEE IV 2026. 8 pages, 3 figures. Code available at https://github.com/mirlanium/LSA
Abstract
Controllable video generation has emerged as a versatile tool for autonomous driving, enabling realistic synthesis of traffic scenarios. However, existing methods depend on control signals at inference time to guide the generative model towards temporally consistent generation of dynamic objects, limiting their utility as scalable and generalizable data engines. In this work, we propose Localized Semantic Alignment (LSA), a simple yet effective framework for fine-tuning pre-trained video generation models. LSA enhances temporal consistency by aligning semantic features between ground-truth and generated video clips. Specifically, we compare the output of an off-the-shelf feature extraction model between the ground-truth and generated video clips localized around dynamic objects inducing a semantic feature consistency loss. We fine-tune the base model by combining this loss with the standard diffusion loss. The model fine-tuned for a single epoch with our novel loss outperforms the baselines in common video generation evaluation metrics. To further test the temporal consistency in generated videos we adapt two additional metrics from object detection task, namely mAP and mIoU. Extensive experiments on nuScenes and KITTI datasets show the effectiveness of our approach in enhancing temporal consistency in video generation without the need for external control signals during inference and any computational overheads.
中文标题/摘要
标题:LSA:局部语义对齐以增强交通视频生成中的时间一致性
可控视频生成已成为自主驾驶领域中的一种多功能工具,能够生成逼真的交通场景。然而,现有方法依赖于推理时的控制信号来引导生成模型生成动态对象的时间一致性,限制了它们作为可扩展和通用数据引擎的实用性。在本文中,我们提出了一种简单而有效的框架——局部语义对齐(LSA),用于微调预训练的视频生成模型。LSA通过在真实视频和生成视频片段之间对齐语义特征来增强时间一致性。具体而言,我们比较了现成特征提取模型在真实视频和生成视频片段(围绕动态对象局部化)之间的输出,诱导语义特征一致性损失。我们通过将此损失与标准扩散损失结合来微调基础模型。使用我们新颖的损失微调一次迭代后,模型在常见的视频生成评估指标中优于基线。为了进一步测试生成视频的时间一致性,我们从目标检测任务中适应了两个额外的指标,即mAP和mIoU。在nuScenes和KITTI数据集上的大量实验表明,我们的方法在无需推理时外部控制信号和任何计算开销的情况下,能够有效增强视频生成的时间一致性。
Summary / 总结
The research aims to enhance temporal consistency in traffic video generation for autonomous driving applications. The proposed Localized Semantic Alignment (LSA) framework fine-tunes pre-trained video generation models by aligning semantic features between ground-truth and generated video clips around dynamic objects. This method improves temporal consistency without requiring external control signals during inference and incurs no additional computational overhead. Experiments on nuScenes and KITTI datasets demonstrate that LSA outperforms baseline methods in common video generation metrics and additional object detection metrics like mAP and mIoU.
研究旨在通过提出局部语义对齐(LSA)来增强交通视频生成的时间一致性,以应用于自动驾驶。LSA通过在动态物体周围对齐真实视频和生成视频的语义特征来微调预训练的视频生成模型。该方法结合了语义特征一致性损失和标准扩散损失,提高了视频生成的时间一致性,无需在推理过程中使用外部控制信号。在nuScenes和KITTI数据集上的实验表明,LSA在常见评估指标和额外的目标检测指标上优于基线方法,展示了其在生成时间一致的交通视频方面的有效性。
Learning to Share: Selective Memory for Efficient Parallel Agentic Systems
Authors: Joseph Fioresi, Parth Parag Kulkarni, Ashmal Vayani, Song Wang, Mubarak Shah
First: 2026-02-05T18:20:21+00:00 · Latest: 2026-02-05T18:20:21+00:00
Abstract
Agentic systems solve complex tasks by coordinating multiple agents that iteratively reason, invoke tools, and exchange intermediate results. To improve robustness and solution quality, recent approaches deploy multiple agent teams running in parallel to explore diverse reasoning trajectories. However, parallel execution comes at a significant computational cost: when different teams independently reason about similar sub-problems or execute analogous steps, they repeatedly perform substantial overlapping computation. To address these limitations, in this paper, we propose Learning to Share (LTS), a learned shared-memory mechanism for parallel agentic frameworks that enables selective cross-team information reuse while controlling context growth. LTS introduces a global memory bank accessible to all teams and a lightweight controller that decides whether intermediate agent steps should be added to memory or not. The controller is trained using stepwise reinforcement learning with usage-aware credit assignment, allowing it to identify information that is globally useful across parallel executions. Experiments on the AssistantBench and GAIA benchmarks show that LTS significantly reduces overall runtime while matching or improving task performance compared to memory-free parallel baselines, demonstrating that learned memory admission is an effective strategy for improving the efficiency of parallel agentic systems. Project page: https://joefioresi718.github.io/LTS_webpage/
中文标题/摘要
标题:学会分享:选择性记忆以提高并行代理系统的效率
代理系统通过协调多个代理进行迭代推理、调用工具并交换中间结果来解决复杂任务。为了提高鲁棒性和解决方案质量,最近的方法部署多个并行运行的代理团队以探索不同的推理路径。然而,这种并行执行带来了显著的计算成本:当不同的团队独立地对相似的子问题进行推理或执行类似步骤时,它们会重复进行大量的重叠计算。为了解决这些局限性,本文提出了一种名为Learning to Share (LTS) 的学习共享内存机制,该机制允许并行代理框架在控制上下文增长的同时选择性地重用跨团队的信息。LTS 引入了一个全局内存库,所有团队都可以访问,并且有一个轻量级控制器来决定中间代理步骤是否应添加到内存中。控制器通过带有使用感知的信用分配的逐步强化学习进行训练,使其能够识别在并行执行中具有全局用处的信息。在AssistantBench 和 GAIA 基准测试上的实验表明,与无内存的并行基线相比,LTS 显著减少了总体运行时间,同时匹配或提高了任务性能,证明了学习内存准入是提高并行代理系统效率的有效策略。项目页面:https://joefioresi718.github.io/LTS_webpage/
Summary / 总结
This paper addresses the computational inefficiency in parallel agentic systems by proposing Learning to Share (LTS), a mechanism that enables selective cross-team information reuse. LTS introduces a global memory bank and a lightweight controller that decides which intermediate steps should be stored. Experiments on AssistantBench and GAIA benchmarks show that LTS reduces overall runtime while maintaining or improving task performance compared to memory-free baselines, indicating the effectiveness of learned memory admission in enhancing the efficiency of parallel agentic systems.
本文提出了一种名为Learning to Share (LTS) 的机制,通过选择性地在团队间重用中间信息来解决并行智能体系统中的计算效率问题。LTS 引入了一个全局记忆库和一个轻量级控制器,该控制器决定哪些中间步骤应被存储,从而减少重复计算。实验结果表明,LTS 在 AssistantBench 和 GAIA 基准测试中显著减少了运行时间,同时保持或提高了任务性能,优于无记忆的并行基线系统。
Discrete diffusion samplers and bridges: Off-policy algorithms and applications in latent spaces
Authors: Arran Carter, Sanghyeok Choi, Kirill Tamogashev, Víctor Elvira, Nikolay Malkin
First: 2026-02-05T18:16:57+00:00 · Latest: 2026-02-05T18:16:57+00:00
Comments: Code: https://github.com/mmacosha/offpolicy-discrete-diffusion-samplers-and-bridges
Abstract
Sampling from a distribution $p(x) \propto e^{-\mathcal{E}(x)}$ known up to a normalising constant is an important and challenging problem in statistics. Recent years have seen the rise of a new family of amortised sampling algorithms, commonly referred to as diffusion samplers, that enable fast and efficient sampling from an unnormalised density. Such algorithms have been widely studied for continuous-space sampling tasks; however, their application to problems in discrete space remains largely unexplored. Although some progress has been made in this area, discrete diffusion samplers do not take full advantage of ideas commonly used for continuous-space sampling. In this paper, we propose to bridge this gap by introducing off-policy training techniques for discrete diffusion samplers. We show that these techniques improve the performance of discrete samplers on both established and new synthetic benchmarks. Next, we generalise discrete diffusion samplers to the task of bridging between two arbitrary distributions, introducing data-to-energy Schrödinger bridge training for the discrete domain for the first time. Lastly, we showcase the application of the proposed diffusion samplers to data-free posterior sampling in the discrete latent spaces of image generative models.
Summary / 总结
This paper addresses the challenge of sampling from a distribution known up to a normalising constant in discrete spaces, which is a significant problem in statistics. The authors propose off-policy training techniques to enhance the performance of discrete diffusion samplers, which were previously underexplored. They demonstrate the improved performance of these samplers on various benchmarks and introduce a new method for bridging between two arbitrary distributions in the discrete domain. The proposed techniques are applied to data-free posterior sampling in the latent spaces of image generative models, showcasing their practical utility.
该论文解决了统计学中在离散空间中从一个已知归一化常数的分布中采样的挑战性问题。它引入了离散扩散采样器的离策训练技术,提高了它们在各种基准上的性能。作者还扩展了这些采样器以在两个任意分布之间进行桥梁构建,首次在离散域中引入了数据到能量薛定谔桥梁训练方法。所提出的方法被应用于图像生成模型的离散潜在空间中的无数据后验采样。
Better Source, Better Flow: Learning Condition-Dependent Source Distribution for Flow Matching
Authors: Junwan Kim, Jiho Park, Seonghu Jeon, Seungryong Kim
First: 2026-02-05T18:08:20+00:00 · Latest: 2026-02-05T18:08:20+00:00
Comments: Project Page: https://junwankimm.github.io/CSFM
Abstract
Flow matching has recently emerged as a promising alternative to diffusion-based generative models, particularly for text-to-image generation. Despite its flexibility in allowing arbitrary source distributions, most existing approaches rely on a standard Gaussian distribution, a choice inherited from diffusion models, and rarely consider the source distribution itself as an optimization target in such settings. In this work, we show that principled design of the source distribution is not only feasible but also beneficial at the scale of modern text-to-image systems. Specifically, we propose learning a condition-dependent source distribution under flow matching objective that better exploit rich conditioning signals. We identify key failure modes that arise when directly incorporating conditioning into the source, including distributional collapse and instability, and show that appropriate variance regularization and directional alignment between source and target are critical for stable and effective learning. We further analyze how the choice of target representation space impacts flow matching with structured sources, revealing regimes in which such designs are most effective. Extensive experiments across multiple text-to-image benchmarks demonstrate consistent and robust improvements, including up to a 3x faster convergence in FID, highlighting the practical benefits of a principled source distribution design for conditional flow matching.
中文标题/摘要
标题:更好的源分布,更好的流匹配:学习条件依赖的源分布
流匹配最近已成为扩散生成模型的一种有前途的替代方案,特别是在文本到图像生成方面。尽管它允许任意的源分布,但大多数现有方法仍然依赖于标准的高斯分布,这是从扩散模型继承而来的选择,很少将源分布本身作为优化目标。在本文中,我们展示了在现代文本到图像系统中,源分布的合理设计不仅是可行的,而且是有益的。具体来说,我们提出了在流匹配目标下学习条件依赖的源分布,以更好地利用丰富的条件信号。我们识别了直接将条件信号纳入源分布时出现的关键失败模式,包括分布坍塌和不稳定性,并表明适当的方差正则化和源与目标之间的方向对齐对于稳定和有效的学习至关重要。我们进一步分析了目标表示空间的选择如何影响结构化源的流匹配,揭示了在这种设计中最有效的区域。在多个文本到图像基准上的广泛实验表明,一致且稳健的改进,包括FID加速3倍,突显了条件流匹配中合理源分布设计的实际益处。
Summary / 总结
This work addresses the limitations of using a standard Gaussian distribution as the source distribution in flow matching for text-to-image generation. It proposes learning a condition-dependent source distribution to better utilize conditioning signals. The study identifies issues like distributional collapse and instability when directly incorporating conditioning into the source and emphasizes the importance of variance regularization and directional alignment. Experiments show consistent improvements, with up to a 3x faster convergence in FID scores, indicating the practical benefits of a principled source distribution design for conditional flow matching.
该研究解决了在文本到图像生成的流匹配中使用标准高斯分布作为源分布的局限性。它提出学习一个条件依赖的源分布,以更好地利用条件信号。研究指出,直接将条件信息融入源分布时会出现分布坍塌和不稳定性等问题,并强调了方差正则化和源与目标的方向对齐的重要性。实验结果表明,这种设计可以实现一致且稳健的改进,FID分数的收敛速度最多可提高3倍,突显了条件流匹配中合理设计源分布的实际益处。
Breaking Symmetry Bottlenecks in GNN Readouts
Authors: Mouad Talhi, Arne Wolf, Anthea Monod
First: 2026-02-05T18:08:13+00:00 · Latest: 2026-02-05T18:08:13+00:00
Comments: 23 pages
Abstract
Graph neural networks (GNNs) are widely used for learning on structured data, yet their ability to distinguish non-isomorphic graphs is fundamentally limited. These limitations are usually attributed to message passing; in this work we show that an independent bottleneck arises at the readout stage. Using finite-dimensional representation theory, we prove that all linear permutation-invariant readouts, including sum and mean pooling, factor through the Reynolds (group-averaging) operator and therefore project node embeddings onto the fixed subspace of the permutation action, erasing all non-trivial symmetry-aware components regardless of encoder expressivity. This yields both a new expressivity barrier and an interpretable characterization of what global pooling preserves or destroys. To overcome this collapse, we introduce projector-based invariant readouts that decompose node representations into symmetry-aware channels and summarize them with nonlinear invariant statistics, preserving permutation invariance while retaining information provably invisible to averaging. Empirically, swapping only the readout enables fixed encoders to separate WL-hard graph pairs and improves performance across multiple benchmarks, demonstrating that readout design is a decisive and under-appreciated factor in GNN expressivity.
中文标题/摘要
标题:打破GNN读出阶段的对称性瓶颈
图神经网络(GNNs)广泛用于结构化数据的学习,但它们区分非同构图的能力从根本上受到限制。这些限制通常归因于消息传递;在本文中,我们表明独立的瓶颈出现在读出阶段。利用有限维表示论,我们证明所有线性置换不变读出,包括求和和平均池化,都通过Reynolds(群平均)算子,并因此将节点嵌入投影到置换作用的不变子空间上,抹去了所有非平凡的对称感知成分,无论编码器的表达能力如何。这既产生了一个新的表达能力障碍,也提供了一个可解释的关于全局池化保留或破坏什么的表征。为了克服这种坍塌,我们引入了投影基不变读出,将节点表示分解为对称感知通道,并用非线性不变统计进行汇总,同时保持置换不变性并保留平均化无法捕捉到的信息。实验上,仅交换读出就能使固定编码器区分WL-难图对,并在多个基准测试中提高性能,表明读出设计是GNN表达能力的关键且被低估的因素。
Summary / 总结
The paper addresses the limitations of graph neural networks (GNNs) in distinguishing non-isomorphic graphs, which are attributed to both message passing and readout stages. It proves that linear permutation-invariant readouts project node embeddings onto a fixed subspace, erasing symmetry-aware components. To overcome this, the authors introduce projector-based invariant readouts that decompose node representations into symmetry-aware channels and summarize them with nonlinear invariant statistics, improving GNN expressivity. Empirically, this approach enables fixed encoders to separate WL-hard graph pairs and enhances performance across multiple benchmarks.
论文探讨了图神经网络(GNN)在区分非同构图时的局限性,这些局限性不仅来自消息传递阶段,还来自读出阶段。研究证明,线性置换不变读出将节点嵌入投影到固定的子空间,消除了对称感知的成分。为克服这一问题,作者提出了基于投影的不变读出,将节点表示分解为对称感知的通道,并使用非线性不变统计进行汇总,从而提高GNN的表达能力和在基准测试中的性能。
Learning to Discover at Test Time
Authors: Mert Yuksekgonul, Daniel Koceja, Xinhao Li, Federico Bianchi, Jed McCaleb, Xiaolong Wang, Jan Kautz, Yejin Choi, James Zou, Carlos Guestrin, Yu Sun
First: 2026-01-22T18:24:00+00:00 · Latest: 2026-02-05T18:03:03+00:00
Comments: Code: https://github.com/test-time-training/discover
Abstract
How can we use AI to discover a new state of the art for a scientific problem? Prior work in test-time scaling, such as AlphaEvolve, performs search by prompting a frozen LLM. We perform reinforcement learning at test time, so the LLM can continue to train, but now with experience specific to the test problem. This form of continual learning is quite special, because its goal is to produce one great solution rather than many good ones on average, and to solve this very problem rather than generalize to other problems. Therefore, our learning objective and search subroutine are designed to prioritize the most promising solutions. We call this method Test-Time Training to Discover (TTT-Discover). Following prior work, we focus on problems with continuous rewards. We report results for every problem we attempted, across mathematics, GPU kernel engineering, algorithm design, and biology. TTT-Discover sets the new state of the art in almost all of them: (i) Erdős' minimum overlap problem and an autocorrelation inequality; (ii) a GPUMode kernel competition (up to $2\times$ faster than prior art); (iii) past AtCoder algorithm competitions; and (iv) denoising problem in single-cell analysis. Our solutions are reviewed by experts or the organizers. All our results are achieved with an open model, OpenAI gpt-oss-120b, and can be reproduced with our publicly available code, in contrast to previous best results that required closed frontier models. Our test-time training runs are performed using Tinker, an API by Thinking Machines, with a cost of only a few hundred dollars per problem.
中文标题/摘要
标题:在测试时学习发现
我们如何使用AI在科学问题上发现新的前沿?先前的测试时缩放工作,如AlphaEvolve,通过提示冻结的LLM进行搜索。我们进行测试时的强化学习,因此LLM可以继续训练,但现在是针对测试问题的具体经验。这种持续学习的形式非常特殊,因为它旨在产生一个最佳解决方案,而不是平均多个较好的解决方案,并且解决这个问题而不是泛化到其他问题。因此,我们的学习目标和搜索子程序被设计为优先考虑最有前途的解决方案。我们称这种方法为测试时训练以发现(TTT-Discover)。我们遵循先前的工作,专注于具有连续奖励的问题。我们报告了我们尝试的每个问题的结果,涵盖数学、GPU内核工程、算法设计和生物学。TTT-Discover在几乎所有问题上都设定了新的前沿:(i) Erdős的最小重叠问题和自相关不等式;(ii) GPUMode内核竞赛(比先前的最佳结果快至2倍);(iii) 过去的AtCoder算法竞赛;和(iv) 单细胞分析中的去噪问题。我们的解决方案由专家或组织者审核。所有结果都是使用开源模型OpenAI gpt-oss-120b实现的,并且可以通过我们公开的代码进行重现,与之前的最佳结果相比,这些结果不需要封闭的前沿模型。我们的测试时训练运行使用Thinking Machines的Tinker API,每解决问题的成本仅为几百美元。
Summary / 总结
The research aims to use AI to discover new state-of-the-art solutions for scientific problems by performing reinforcement learning at test time. The method, Test-Time Training to Discover (TTT-Discover), allows the LLM to continue training with problem-specific experience, prioritizing promising solutions. The method sets new state-of-the-art results in various domains including mathematics, GPU kernel engineering, algorithm design, and biology. Key achievements include solving Erdős' minimum overlap problem, improving a GPUMode kernel by up to 2x, winning past AtCoder algorithm competitions, and enhancing denoising in single-cell analysis.
研究旨在通过在测试时进行强化学习来使用AI发现科学问题的新前沿解决方案。方法Test-Time Training to Discover (TTT-Discover) 允许LLM继续使用特定于测试问题的经验进行训练,并优先考虑最有前途的解决方案。结果涵盖了数学、GPU内核工程、算法设计和生物学等多个领域,在几乎所有情况下都设定了新的前沿解决方案,某些GPU内核竞赛的性能提高了2倍,并解决了单细胞分析中的复杂去噪问题。所有结果均可通过公开代码和开源模型OpenAI gpt-oss-120b进行复现,成本仅为每问题几百美元。
$f$-GRPO and Beyond: Divergence-Based Reinforcement Learning Algorithms for General LLM Alignment
Authors: Rajdeep Haldar, Lantao Mei, Guang Lin, Yue Xing, Qifan Song
First: 2026-02-05T18:01:52+00:00 · Latest: 2026-02-05T18:01:52+00:00
Abstract
Recent research shows that Preference Alignment (PA) objectives act as divergence estimators between aligned (chosen) and unaligned (rejected) response distributions. In this work, we extend this divergence-based perspective to general alignment settings, such as reinforcement learning with verifiable rewards (RLVR), where only environmental rewards are available. Within this unified framework, we propose $f$-Group Relative Policy Optimization ($f$-GRPO), a class of on-policy reinforcement learning, and $f$-Hybrid Alignment Loss ($f$-HAL), a hybrid on/off policy objectives, for general LLM alignment based on variational representation of $f$-divergences. We provide theoretical guarantees that these classes of objectives improve the average reward after alignment. Empirically, we validate our framework on both RLVR (Math Reasoning) and PA tasks (Safety Alignment), demonstrating superior performance and flexibility compared to current methods.
中文标题/摘要
标题:$f$-GRPO及其扩展:基于离散度的通用大语言模型对齐强化学习算法
近期研究表明,偏好对齐(PA)目标可以作为对齐(选择)和未对齐(拒绝)响应分布之间离散度的估计器。在此项工作中,我们将这种基于离散度的观点扩展到一般的对齐设置中,例如具有可验证奖励的强化学习(RLVR),其中仅可用环境奖励。在这一统一框架中,我们提出了基于$f$-散度变分表示的$f$-组相对策略优化($f$-GRPO)类的在线策略强化学习方法,以及$f$-混合对齐损失($f$-HAL)类的混合在线/离线策略目标,用于基于$f$-散度变分表示的通用大语言模型对齐。我们提供了这些类目标在对齐后提高平均奖励的理论保证。实验上,我们在RLVR(数学推理)和PA任务(安全对齐)上验证了我们的框架,展示了与当前方法相比的优越性能和灵活性。
Summary / 总结
This work extends the divergence-based perspective of Preference Alignment (PA) objectives to general alignment settings, such as reinforcement learning with verifiable rewards (RLVR). It proposes $f$-GRPO and $f$-HAL, which are on-policy and hybrid on/off policy reinforcement learning objectives, respectively, based on variational representations of $f$-divergences. Theoretical guarantees show that these objectives improve the average reward after alignment. Empirical validation on RLVR and PA tasks demonstrates superior performance and flexibility compared to existing methods.
该研究将偏好对齐(PA)目标的发散性视角扩展到一般的对齐设置,如具有可验证奖励的强化学习(RLVR)。提出了$f$-GRPO和$f$-HAL,分别是基于发散性变分表示的在线策略和混合在线/离线目标,用于对齐大型语言模型(LLM)。理论保证表明这些目标在对齐后提高了平均奖励。实验证实在RLVR和PA任务上,这些方法表现出更优的性能和灵活性,优于现有方法。
Group-Adaptive Adversarial Learning for Robust Fake News Detection Against Malicious Comments
Authors: Zhao Tong, Chunlin Gong, Yimeng Gu, Haichao Shi, Qiang Liu, Shu Wu, Xiao-Yu Zhang
First: 2025-10-10T04:39:57+00:00 · Latest: 2026-02-05T17:52:31+00:00
Comments: 10 pages, 12 figures
Abstract
Online fake news profoundly distorts public judgment and erodes trust in social platforms. While existing detectors achieve competitive performance on benchmark datasets, they remain notably vulnerable to malicious comments designed specifically to induce misclassification. This evolving threat landscape necessitates detection systems that simultaneously prioritize predictive accuracy and structural robustness. However, current detectors often fail to generalize across diverse and novel comment attack patterns. To bridge this gap, we propose AdComment, an adaptive adversarial training framework for robustness enhancement against diverse malicious comments. Based on cognitive psychology, we categorize adversarial comments into Fact Distortion, Logical Confusion, and Emotional Manipulation, and leverage LLMs to synthesize diverse, category-specific perturbations. Central to our framework is an InfoDirichlet Resampling (IDR) mechanism that dynamically adjusts malicious comment proportions during training, thereby steering optimization toward the model's most susceptible regions. Experimental results demonstrate that our approach achieves state-of-the-art performance on three benchmark datasets, improving the F1 scores by 17.9%, 14.5% and 9.0%, respectively.
中文标题/摘要
标题:针对恶意评论的分组自适应对抗学习以增强鲁棒性假新闻检测
在线假新闻严重扭曲公众判断并侵蚀社交平台的信任。尽管现有检测器在基准数据集上取得了竞争力的表现,但它们仍然明显容易受到专门设计以诱导分类错误的恶意评论的影响。这种不断演变的威胁环境需要同时兼顾预测准确性和结构鲁棒性的检测系统。然而,当前的检测器往往无法在多样且新颖的评论攻击模式中泛化。为弥补这一差距,我们提出了一种AdComment,这是一种针对多种恶意评论的自适应对抗训练框架,以增强鲁棒性。基于认知心理学,我们将对抗性评论分为事实扭曲、逻辑混淆和情感操控三类,并利用大语言模型(LLM)生成多样化的、类别特定的扰动。我们框架的核心是InfoDirichlet重采样(IDR)机制,该机制在训练过程中动态调整恶意评论的比例,从而引导优化向模型最脆弱的区域。实验结果表明,我们的方法在三个基准数据集上取得了最先进的性能,分别提高了F1分数17.9%、14.5%和9.0%。
Summary / 总结
The paper addresses the vulnerability of existing fake news detectors to malicious comments designed to induce misclassification. It proposes AdComment, an adaptive adversarial training framework that enhances robustness against diverse malicious comments by categorizing them into Fact Distortion, Logical Confusion, and Emotional Manipulation. The framework uses LLMs to generate category-specific perturbations and an InfoDirichlet Resampling mechanism to adjust the proportion of malicious comments during training. Experimental results show that AdComment improves F1 scores by 17.9%, 14.5%, and 9.0% on three benchmark datasets compared to existing methods.
论文提出了一种适应性对抗训练框架AdComment,以增强对恶意评论的鲁棒性。该框架将对抗性评论分为三种类型,并使用LLM生成多样化的扰动。InfoDirichlet重采样机制在训练过程中动态调整恶意评论的比例,以优化模型的脆弱区域。该方法在基准数据集上的F1分数分别提高了17.9%、14.5%和9.0%。
Multi-Scale Global-Instance Prompt Tuning for Continual Test-time Adaptation in Medical Image Segmentation
Authors: Lingrui Li, Yanfeng Zhou, Nan Pu, Xin Chen, Zhun Zhong
First: 2026-02-05T17:47:35+00:00 · Latest: 2026-02-05T17:47:35+00:00
Comments: 8 pages, BIBM2025
Abstract
Distribution shift is a common challenge in medical images obtained from different clinical centers, significantly hindering the deployment of pre-trained semantic segmentation models in real-world applications across multiple domains. Continual Test-Time Adaptation(CTTA) has emerged as a promising approach to address cross-domain shifts during continually evolving target domains. Most existing CTTA methods rely on incrementally updating model parameters, which inevitably suffer from error accumulation and catastrophic forgetting, especially in long-term adaptation. Recent prompt-tuning-based works have shown potential to mitigate the two issues above by updating only visual prompts. While these approaches have demonstrated promising performance, several limitations remain:1)lacking multi-scale prompt diversity, 2)inadequate incorporation of instance-specific knowledge, and 3)risk of privacy leakage. To overcome these limitations, we propose Multi-scale Global-Instance Prompt Tuning(MGIPT), to enhance scale diversity of prompts and capture both global- and instance-level knowledge for robust CTTA. Specifically, MGIPT consists of an Adaptive-scale Instance Prompt(AIP) and a Multi-scale Global-level Prompt(MGP). AIP dynamically learns lightweight and instance-specific prompts to mitigate error accumulation with adaptive optimal-scale selection mechanism. MGP captures domain-level knowledge across different scales to ensure robust adaptation with anti-forgetting capabilities. These complementary components are combined through a weighted ensemble approach, enabling effective dual-level adaptation that integrates both global and local information. Extensive experiments on medical image segmentation benchmarks demonstrate that our MGIPT outperforms state-of-the-art methods, achieving robust adaptation across continually changing target domains.
中文标题/摘要
标题:多尺度全局-实例提示调优以实现医学图像分割中的持续测试时自适应
在不同临床中心获取的医学图像中,分布偏移是一个常见的挑战,显著阻碍了预训练语义分割模型在多领域实际应用中的部署。持续测试时自适应(CTTA)作为一种有前景的方法,旨在解决目标领域不断演变过程中跨域偏移问题。现有的大多数CTTA方法依赖于逐步更新模型参数,这不可避免地会导致错误累积和灾难性遗忘,尤其是在长期自适应过程中。最近基于提示调优的工作表明,通过仅更新视觉提示来缓解上述两个问题具有潜力。尽管这些方法展示了有前景的性能,但仍存在一些局限性:1) 缺乏多尺度提示多样性,2) 实例特定知识整合不足,3) 隐私泄露风险。为克服这些局限性,我们提出了多尺度全局-实例提示调优(MGIPT),以增强提示的尺度多样性并捕获全局和实例级别的知识,以实现稳健的CTTA。具体而言,MGIPT 包含自适应尺度实例提示(AIP)和多尺度全局提示(MGP)。AIP 动态学习轻量级和实例特定的提示,通过自适应最优尺度选择机制来缓解错误累积。MGP 跨不同尺度捕获领域知识,以确保具有抗遗忘能力的稳健自适应。这些互补组件通过加权集成方法结合,实现有效的双尺度自适应,整合全局和局部信息。在医学图像分割基准上的广泛实验表明,我们的MGIPT 在性能上优于现有最佳方法,实现了在不断变化的目标领域中的稳健自适应。
Summary / 总结
The paper addresses the challenge of distribution shift in medical images from different clinical centers, proposing Multi-scale Global-Instance Prompt Tuning (MGIPT) to enhance continual test-time adaptation. MGIPT includes an Adaptive-scale Instance Prompt (AIP) and a Multi-scale Global-level Prompt (MGP) to mitigate error accumulation and catastrophic forgetting. AIP learns instance-specific prompts, while MGP captures domain-level knowledge across scales. Experiments show MGIPT outperforms existing methods in robust adaptation across changing target domains.
论文针对不同临床中心获取的医学图像分布变化问题,提出了多尺度全局-实例提示调优(MGIPT)方法以增强持续测试时的适应性。MGIPT 引入了自适应尺度实例提示(AIP)和多尺度全局提示(MGP),以减轻错误累积和灾难性遗忘。AIP 学习实例特定的提示,而 MGP 捕捉不同尺度的领域知识。该方法在持续变化的目标域中实现了比现有方法更稳健的适应性。
Tuning Out-of-Distribution (OOD) Detectors Without Given OOD Data
Authors: Sudeepta Mondal, Xinyi Mary Xie, Ruxiao Duan, Alex Wong, Ganesh Sundaramoorthi
First: 2026-02-05T17:46:40+00:00 · Latest: 2026-02-05T17:46:40+00:00
Abstract
Existing out-of-distribution (OOD) detectors are often tuned by a separate dataset deemed OOD with respect to the training distribution of a neural network (NN). OOD detectors process the activations of NN layers and score the output, where parameters of the detectors are determined by fitting to an in-distribution (training) set and the aforementioned dataset chosen adhocly. At detector training time, this adhoc dataset may not be available or difficult to obtain, and even when it's available, it may not be representative of actual OOD data, which is often ''unknown unknowns." Current benchmarks may specify some left-out set from test OOD sets. We show that there can be significant variance in performance of detectors based on the adhoc dataset chosen in current literature, and thus even if such a dataset can be collected, the performance of the detector may be highly dependent on the choice. In this paper, we introduce and formalize the often neglected problem of tuning OOD detectors without a given ``OOD'' dataset. To this end, we present strong baselines as an attempt to approach this problem. Furthermore, we propose a new generic approach to OOD detector tuning that does not require any extra data other than those used to train the NN. We show that our approach improves over baseline methods consistently across higher-parameter OOD detector families, while being comparable across lower-parameter families.
中文标题/摘要
标题:无需给定离群分布(OOD)数据调整离群检测器
现有的离群分布(OOD)检测器通常通过一个被认为与神经网络(NN)训练分布不同的分离数据集进行调整。OOD检测器处理NN层的激活并评分输出,检测器的参数通过拟合训练集和上述选择的随机数据集确定。在检测器训练时,这种随机数据集可能不可用或难以获取,即使可用,也可能不具有实际OOD数据的代表性,而实际OOD数据往往是“未知的未知”。当前基准可能指定了从测试OOD数据集中排除的一些数据集。我们表明,根据当前文献中选择的随机数据集,检测器的性能可能存在显著差异,因此即使可以收集这样的数据集,检测器的性能也可能高度依赖于选择。在本文中,我们引入并正式化了在没有给定“OOD”数据集的情况下调整OOD检测器的通常被忽视的问题。为此,我们提出了强大的基线方法,试图解决这个问题。此外,我们提出了一种新的通用方法来调整OOD检测器,不需要任何额外数据,只需使用训练NN的数据。我们表明,我们的方法在高参数OOD检测器家族中始终优于基线方法,而在低参数家族中具有可比性。
Summary / 总结
The paper addresses the issue of tuning out-of-distribution (OOD) detectors without relying on an adhoc OOD dataset, which is often unavailable or not representative. It introduces a new approach that does not require any extra data other than the training data of the neural network. The proposed method improves over baseline methods in higher-parameter OOD detector families and is comparable in lower-parameter families, showing consistent performance gains.
论文解决了在没有给定的OOD数据集的情况下调校OOD检测器的问题,这种数据集往往不可用或不具代表性。提出了一种不需要额外数据的方法,仅使用神经网络的训练数据。该方法在较高参数的OOD检测器家族中表现出一致的性能提升,并且在较低参数的家族中与基线方法相当。
Approximation of Log-Partition Function in Policy Mirror Descent Induces Implicit Regularization for LLM Post-Training
Authors: Zhenghao Xu, Qin Lu, Changlong Yu, Tuo Zhao
First: 2026-02-05T17:44:28+00:00 · Latest: 2026-02-05T17:44:28+00:00
Abstract
Policy mirror descent (PMD) provides a principled framework for reinforcement learning (RL) by iteratively solving KL-regularized policy improvement subproblems. While this approach has been adopted in training advanced LLMs such as Kimi K1.5/K2, the ideal closed-form PMD updates require reliable partition function estimation, a significant challenge when working with limited rollouts in the vast action spaces of LLMs. We investigate a practical algorithm, termed PMD-mean, that approximates the log-partition term with the mean reward under the sampling policy and performs regression in log-policy space. Specifically, we characterize the population solution of PMD-mean and demonstrate that it implicitly optimizes mirror descent subproblems with an adaptive mixed KL--$χ^2$ regularizer. This additional $χ^2$ regularization constrains large probability changes, producing more conservative updates when expected rewards are low and enhancing robustness against finite-sample estimation errors. Experiments on math reasoning tasks show that PMD-mean achieves superior performance with improved stability and time efficiency. These findings deepen our understanding of PMD-mean and illuminate pathways toward principled improvements in RL algorithms for LLMs. Code is available at https://github.com/horizon-rl/OpenKimi.
中文标题/摘要
标题:策略镜像梯度中的对数分区函数近似诱导LLM后训练时的隐式正则化
策略镜像梯度(PMD)提供了一种通过迭代求解KL正则化策略改进子问题来强化学习(RL)的原理框架。尽管这种方法已被应用于训练如Kimi K1.5/K2等高级LLM,但理想的PMD更新需要可靠的分区函数估计,而在LLM庞大的动作空间中进行有限采样时,这是一项重大挑战。我们研究了一种实用算法PMD-mean,该算法通过采样策略下的平均奖励近似对数分区项,并在对数策略空间中进行回归。具体而言,我们刻画了PMD-mean的总体解,并证明它隐式优化了具有自适应混合KL--χ²正则化的镜像梯度子问题。这种额外的χ²正则化限制了大概率变化,当预期奖励较低时产生更保守的更新,从而增强对有限样本估计误差的鲁棒性。在数学推理任务上的实验表明,PMD-mean在稳定性和时间效率方面表现出更优性能。这些发现加深了我们对PMD-mean的理解,并揭示了RL算法中LLM改进的原理途径。代码可在https://github.com/horizon-rl/OpenKimi/ 获取。
Summary / 总结
The research addresses the challenge of estimating the log-partition function in policy mirror descent (PMD) for reinforcement learning in large language models (LLMs). It proposes PMD-mean, which approximates the log-partition term with the mean reward under the sampling policy and performs regression in log-policy space. The study shows that PMD-mean implicitly optimizes mirror descent subproblems with an adaptive mixed KL--$χ^2$ regularizer, leading to more conservative updates and enhanced robustness. Experiments on math reasoning tasks demonstrate that PMD-mean outperforms other methods with improved stability and time efficiency.
研究旨在解决在大型语言模型(LLMs)中使用策略镜像下降(PMD)进行强化学习时,对数分区函数估计的难题。提出的PMD-mean算法通过使用采样策略下的平均奖励来近似对数分区项,并在对数策略空间中进行回归。实验结果表明,PMD-mean在数学推理任务上表现出更优的性能,具有更好的稳定性和时间效率,这得益于隐含的$χ^2$正则化,它限制了大概率变化并增强了对估计误差的鲁棒性。