AnyView: Synthesizing Any Novel View in Dynamic Scenes
Authors: Basile Van Hoorick, Dian Chen, Shun Iwase, Pavel Tokmakov, Muhammad Zubair Irshad, Igor Vasiljevic, Swati Gupta, Fangzhou Cheng, Sergey Zakharov, Vitor Campagnolo Guizilini
First: 2026-01-23T18:59:58+00:00 · Latest: 2026-01-23T18:59:58+00:00
Comments: Project webpage: https://tri-ml.github.io/AnyView/
Abstract
Modern generative video models excel at producing convincing, high-quality outputs, but struggle to maintain multi-view and spatiotemporal consistency in highly dynamic real-world environments. In this work, we introduce \textbf{AnyView}, a diffusion-based video generation framework for \emph{dynamic view synthesis} with minimal inductive biases or geometric assumptions. We leverage multiple data sources with various levels of supervision, including monocular (2D), multi-view static (3D) and multi-view dynamic (4D) datasets, to train a generalist spatiotemporal implicit representation capable of producing zero-shot novel videos from arbitrary camera locations and trajectories. We evaluate AnyView on standard benchmarks, showing competitive results with the current state of the art, and propose \textbf{AnyViewBench}, a challenging new benchmark tailored towards \emph{extreme} dynamic view synthesis in diverse real-world scenarios. In this more dramatic setting, we find that most baselines drastically degrade in performance, as they require significant overlap between viewpoints, while AnyView maintains the ability to produce realistic, plausible, and spatiotemporally consistent videos when prompted from \emph{any} viewpoint. Results, data, code, and models can be viewed at: https://tri-ml.github.io/AnyView/
中文标题/摘要
标题:AnyView:动态场景中的任意视图合成
现代生成视频模型在生成逼真、高质量的输出方面表现出色,但在保持高度动态现实环境中的多视角和时空一致性方面存在困难。在本工作中,我们引入了**AnyView**,这是一种基于扩散的视频生成框架,用于**动态视图合成**,几乎没有任何归纳偏见或几何假设。我们利用多种监督程度不同的数据源进行训练,包括单目(2D)、多视角静态(3D)和多视角动态(4D)数据集,以生成零样本的新颖视频,这些视频可以从任意的摄像机位置和轨迹生成。我们在标准基准上评估了AnyView,显示了与当前最先进的技术竞争的结果,并提出了**AnyViewBench**,这是一个针对**极端**动态视图合成的新基准,适用于多种现实场景。在这一更具戏剧性的环境中,我们发现大多数基线在性能上大幅下降,因为它们需要视点之间有显著的重叠,而AnyView在从**任何**视点提示时仍能生成逼真、合理且时空一致的视频。结果、数据、代码和模型可以在:https://tri-ml.github.io/AnyView/查看。
Summary / 总结
AnyView is a diffusion-based video generation framework designed for dynamic view synthesis without strict geometric assumptions. It leverages multiple data sources, including monocular, multi-view static, and multi-view dynamic datasets, to train a generalist spatiotemporal implicit representation. The framework demonstrates competitive results on standard benchmarks and outperforms most baselines in the challenging AnyViewBench, which focuses on extreme dynamic view synthesis in diverse real-world scenarios, maintaining spatiotemporally consistent and realistic videos from any viewpoint. Results, data, code, and models are available at https://tri-ml.github.io/AnyView/.
AnyView 是一个基于扩散的视频生成框架,用于动态视图合成,无需强先验假设或几何假设。它利用单目、多视角静态和多视角动态数据集等多种数据源进行训练,以生成泛化的时空隐式表示。该框架在标准基准测试中表现出竞争力,并在专注于极端动态视图合成的 AnyViewBench 中超越大多数基线,能够从任意视角生成时空一致且逼真的视频。结果、数据、代码和模型可在 https://tri-ml.github.io/AnyView/ 查看。
A Scalable Measure of Loss Landscape Curvature for Analyzing the Training Dynamics of LLMs
Authors: Dayal Singh Kalra, Jean-Christophe Gagnon-Audet, Andrey Gromov, Ishita Mediratta, Kelvin Niu, Alexander H Miller, Michael Shvartsman
First: 2026-01-23T18:59:40+00:00 · Latest: 2026-01-23T18:59:40+00:00
Comments: 9 pages, 6 figures
Abstract
Understanding the curvature evolution of the loss landscape is fundamental to analyzing the training dynamics of neural networks. The most commonly studied measure, Hessian sharpness ($λ_{\max}^H$) -- the largest eigenvalue of the loss Hessian -- determines local training stability and interacts with the learning rate throughout training. Despite its significance in analyzing training dynamics, direct measurement of Hessian sharpness remains prohibitive for Large Language Models (LLMs) due to high computational cost. We analyze $\textit{critical sharpness}$ ($λ_c$), a computationally efficient measure requiring fewer than $10$ forward passes given the update direction $Δ\mathbfθ$. Critically, this measure captures well-documented Hessian sharpness phenomena, including progressive sharpening and Edge of Stability. Using this measure, we provide the first demonstration of these sharpness phenomena at scale, up to $7$B parameters, spanning both pre-training and mid-training of OLMo-2 models. We further introduce $\textit{relative critical sharpness}$ ($λ_c^{1\to 2}$), which quantifies the curvature of one loss landscape while optimizing another, to analyze the transition from pre-training to fine-tuning and guide data mixing strategies. Critical sharpness provides practitioners with a practical tool for diagnosing curvature dynamics and informing data composition choices at scale. More broadly, our work shows that scalable curvature measures can provide actionable insights for large-scale training.
中文标题/摘要
标题:一种可扩展的损失景观曲率度量方法,用于分析大规模语言模型的训练动力学
理解损失景观的曲率演变是分析神经网络训练动力学的基础。最常研究的度量标准是海森堡尖锐度($λ_{\max}^H$)——损失海森堡矩阵的最大特征值,它决定了局部训练稳定性,并在整个训练过程中与学习率相互作用。尽管它在分析训练动力学方面具有重要意义,但由于计算成本高昂,直接测量海森堡尖锐度对大规模语言模型(LLMs)来说仍然是不可行的。我们分析了$\textit{关键尖锐度}$($λ_c$),这是一种计算效率高的度量标准,给定更新方向$Δ\mathbfθ$,只需要少于$10$次前向传递即可。关键的是,这种度量标准能够很好地捕捉到已记录的海森堡尖锐度现象,包括逐步尖锐化和临界稳定性边缘。利用这种度量标准,我们首次在规模上展示了这些尖锐度现象,参数量达到$7$B,涵盖了OLMo-2模型的预训练和中期训练。我们还引入了$\textit{相对关键尖锐度}$($λ_c^{1\to 2}$),它量化了在优化另一个损失景观时一个损失景观的曲率,用于分析从预训练到微调的过渡,并指导数据混合策略。关键尖锐度为从业者提供了一种实用的工具,用于诊断曲率动态并指导大规模的数据组合选择。更广泛地说,我们的工作表明,可扩展的曲率度量可以为大规模训练提供可操作的见解。
Summary / 总结
This paper addresses the challenge of analyzing the loss landscape curvature for Large Language Models (LLMs) by proposing a computationally efficient measure called critical sharpness ($λ_c$), which requires fewer than 10 forward passes. The study demonstrates that critical sharpness captures well-known phenomena such as progressive sharpening and Edge of Stability, providing insights into the training dynamics of LLMs. The authors also introduce relative critical sharpness ($λ_c^{1\to 2}$) to analyze the transition from pre-training to fine-tuning, which helps in guiding data mixing strategies. The findings are validated on models up to 7 billion parameters, spanning both pre-training and mid-training stages of OLMo-2 models.
论文旨在通过理解损失景观的曲率演变来分析神经网络的训练动态,特别是大型语言模型(LLMs)。它引入了一个计算高效的度量标准,称为临界尖锐度($λ_c$),只需要少于10次前向传递,并捕捉了如渐进尖锐化和临界边缘等现象。研究在7B参数规模上展示了这些现象,并引入了相对临界尖锐度($λ_c^{1\to 2}$)来分析从预训练到微调的过渡,并指导数据混合策略。该度量标准为诊断曲率动态和指导大规模训练中的数据组成选择提供了实用的工具。
MapAnything: Universal Feed-Forward Metric 3D Reconstruction
Authors: Nikhil Keetha, Norman Müller, Johannes Schönberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, Jonathon Luiten, Manuel Lopez-Antequera, Samuel Rota Bulò, Christian Richardt, Deva Ramanan, Sebastian Scherer, Peter Kontschieder
First: 2025-09-16T18:00:14+00:00 · Latest: 2026-01-23T18:59:33+00:00
Comments: 3DV 2026. Project Page: https://map-anything.github.io/
Abstract
We introduce MapAnything, a unified transformer-based feed-forward model that ingests one or more images along with optional geometric inputs such as camera intrinsics, poses, depth, or partial reconstructions, and then directly regresses the metric 3D scene geometry and cameras. MapAnything leverages a factored representation of multi-view scene geometry, i.e., a collection of depth maps, local ray maps, camera poses, and a metric scale factor that effectively upgrades local reconstructions into a globally consistent metric frame. Standardizing the supervision and training across diverse datasets, along with flexible input augmentation, enables MapAnything to address a broad range of 3D vision tasks in a single feed-forward pass, including uncalibrated structure-from-motion, calibrated multi-view stereo, monocular depth estimation, camera localization, depth completion, and more. We provide extensive experimental analyses and model ablations demonstrating that MapAnything outperforms or matches specialist feed-forward models while offering more efficient joint training behavior, thus paving the way toward a universal 3D reconstruction backbone.
中文标题/摘要
标题:MapAnything: 统一的前馈度量3D重建模型
我们介绍了MapAnything,这是一种基于变压器的前馈模型,可以接受一张或多张图像以及可选的几何输入,如相机内参、姿态、深度或部分重建,然后直接回归度量3D场景几何和相机。MapAnything利用多视图场景几何的分解表示,即一组深度图、局部射线图、相机姿态和度量尺度因子,有效地将局部重建升级为全局一致的度量框架。通过标准化来自不同数据集的监督和训练,以及灵活的输入增强,MapAnything能够在单次前馈传递中解决广泛的3D视觉任务,包括未标定的结构从运动、标定的多视图立体、单目深度估计、相机定位、深度补全等。我们提供了详尽的实验分析和模型消融实验,证明MapAnything在性能上优于或匹配专门的前馈模型,同时提供更高效的联合训练行为,从而为通用3D重建骨干网络铺平了道路。
Summary / 总结
MapAnything is a unified transformer-based model that processes images and geometric inputs to directly regress metric 3D scene geometry and cameras. It uses a factored representation of multi-view geometry to achieve global consistency. Experiments show that MapAnything outperforms or matches specialist models across various 3D vision tasks, offering efficient joint training and paving the way for a universal 3D reconstruction backbone.
MapAnything 是一个统一的基于变压器的模型,它可以处理图像和几何输入并直接回归出度量级的 3D 场景几何和相机。它使用多视图几何的因子表示,并采用灵活的输入增强来处理诸如结构从运动、多视图立体视觉和单目深度估计等多种 3D 视觉任务。实验表明,MapAnything 在性能上超越或与专门模型持平,同时提供更高效的联合训练行为,使其成为有前景的通用 3D 重建骨干网络。
Towards Reasoning for PDE Foundation Models: A Reward-Model-Driven Inference-Time-Scaling Algorithm
Authors: Siddharth Mansingh, James Amarel, Ragib Arnab, Arvind Mohan, Kamaljeet Singh, Gerd J. Kunde, Nicolas Hengartner, Benjamin Migliori, Emily Casleton, Nathan A. Debardeleben, Ayan Biswas, Diane Oyen, Earl Lawrence
First: 2025-09-02T21:31:32+00:00 · Latest: 2026-01-23T18:55:13+00:00
Abstract
Partial Differential Equations (PDEs) are the bedrock for modern computational sciences and engineering, and inherently computationally expensive. While PDE foundation models have shown much promise for simulating such complex spatio-temporal phenomena, existing models remain constrained by the pretraining datasets and struggle with auto-regressive rollout performance, especially in out-of-distribution (OOD) cases. Furthermore, they have significant compute and training data requirements which hamper their use in many critical applications. Inspired by recent advances in ``thinking" strategies used in large language models (LLMs), we introduce the first test-time computing (TTC) strategy for PDEs that utilizes computational resources during inference to achieve more accurate predictions with fewer training samples and smaller models. We accomplish this with two types of reward models that evaluate predictions of a stochastic based model for spatio-temporal consistency. We demonstrate this method on compressible Euler-equation simulations from the PDEGym benchmark and show that TTC captures improved predictions relative to standard non-adaptive auto-regressive inference. This TTC framework marks a foundational step towards more advanced reasoning algorithms or PDE modeling, inluding building reinforcement-learning-based approaches, potentially transforming computational workflows in physics and engineering.
中文标题/摘要
标题:向PDE基础模型推理能力迈进:一种奖励模型驱动的推理时缩放算法
偏微分方程(PDEs)是现代计算科学和工程的基础,但计算上非常昂贵。虽然PDE基础模型在模拟复杂的时空现象方面显示出很大的潜力,但现有模型仍然受到预训练数据集的限制,并且在自回归展开性能上存在挑战,尤其是在分布外(OOD)情况下。此外,它们需要大量的计算资源和训练数据,这阻碍了它们在许多关键应用中的使用。受大型语言模型(LLMs)中“思考”策略最新进展的启发,我们首次引入了一种用于PDE的测试时计算(TTC)策略,该策略在推理过程中利用计算资源以更少的训练样本和更小的模型实现更准确的预测。我们通过两种类型的奖励模型来评估基于随机模型的时空一致性预测来实现这一点。我们在PDEGym基准上的可压缩欧拉方程模拟上展示了这种方法,并表明TTC相对于标准非自适应自回归推理捕获了更好的预测。这种TTC框架标志着向更高级的PDE建模推理算法迈进的基础步骤,包括构建基于强化学习的方法,有可能彻底改变物理和工程中的计算工作流。
Summary / 总结
The paper addresses the computational challenges of Partial Differential Equations (PDEs) in computational sciences and engineering by introducing a test-time computing (TTC) strategy. This strategy uses reward models to evaluate the predictions of a stochastic model for spatio-temporal consistency, thereby achieving more accurate predictions with fewer training samples and smaller models. The method was tested on compressible Euler-equation simulations and showed improved predictions compared to standard inference methods.
论文通过引入测试时计算(TTC)策略来解决偏微分方程(PDEs)在计算科学和工程中的计算挑战。该策略使用奖励模型来评估基于随机模型的空间-时间一致性预测,从而实现更准确的预测,同时使用更少的训练样本和更小的模型。该方法在可压缩欧拉方程模拟上进行了测试,并显示出比标准推理方法更好的预测结果。
Scribble-Supervised Medical Image Segmentation with Dynamic Teacher Switching and Hierarchical Consistency
Authors: Thanh-Huy Nguyen, Hoang-Loc Cao, Dat T. Chung, Mai-Anh Vu, Thanh-Minh Nguyen, Minh Le, Phat K. Huynh, Ulas Bagci
First: 2026-01-21T01:01:01+00:00 · Latest: 2026-01-23T18:54:37+00:00
Abstract
Scribble-supervised methods have emerged to mitigate the prohibitive annotation burden in medical image segmentation. However, the inherent sparsity of these annotations introduces significant ambiguity, which results in noisy pseudo-label propagation and hinders the learning of robust anatomical boundaries. To address this challenge, we propose SDT-Net, a novel dual-teacher, single-student framework designed to maximize supervision quality from these weak signals. Our method features a Dynamic Teacher Switching (DTS) module to adaptively select the most reliable teacher. This selected teacher then guides the student via two synergistic mechanisms: high-confidence pseudo-labels, refined by a Pick Reliable Pixels (PRP) mechanism, and multi-level feature alignment, enforced by a Hierarchical Consistency (HiCo) module. Extensive experiments on the ACDC and MSCMRseg datasets demonstrate that SDT-Net achieves state-of-the-art performance, producing more accurate and anatomically plausible segmentation.
中文标题/摘要
标题:Scribble-监督医学图像分割中的动态教师切换和层次一致性
Scribble-监督方法已经出现,以减轻医学图像分割中的标注负担。然而,这些标注的固有稀疏性引入了显著的模糊性,导致伪标签传播噪声化并阻碍了对稳健的解剖边界的学习。为了解决这一挑战,我们提出了一种新的双教师单学生框架SDT-Net,旨在最大化这些弱信号的监督质量。该方法包含一个动态教师切换(DTS)模块,以自适应地选择最可靠的教师。该选定的教师通过两种协同机制指导学生:由挑选可靠像素(PRP)机制精炼的高置信度伪标签,以及由层次一致性(HiCo)模块强制执行的多级特征对齐。在ACDC和MSCMRseg数据集上的广泛实验表明,SDT-Net达到了最先进的性能,产生了更准确且解剖上更合理的分割。
Summary / 总结
The research aims to improve medical image segmentation by addressing the issue of sparse annotations, which can lead to noisy pseudo-labels. The proposed SDT-Net uses a dual-teacher, single-student framework with a Dynamic Teacher Switching module to select the most reliable teacher. This teacher provides guidance through high-confidence pseudo-labels, refined by a Pick Reliable Pixels mechanism, and multi-level feature alignment enforced by a Hierarchical Consistency module. Experiments on ACDC and MSCMRseg datasets show that SDT-Net outperforms existing methods, yielding more accurate and anatomically plausible segmentations.
论文提出了一种双教师、单学生框架SDT-Net,以应对医学图像分割中稀疏标注带来的挑战。SDT-Net 包含一个动态教师切换模块来选择最可靠的教师,以及一个层次一致性模块来强制执行多级特征对齐。该方法还使用Pick Reliable Pixels机制来细化伪标签。实验结果表明,SDT-Net 在ACDC 和 MSCMRseg 数据集上的分割准确性和解剖学合理性方面优于现有方法。
LLM Reasoning for Cold-Start Item Recommendation
Authors: Shijun Li, Yu Wang, Jin Wang, Ying Li, Joydeep Ghosh, Anne Cocos
Venue: WWW 2026
First: 2025-11-23T03:22:53+00:00 · Latest: 2026-01-23T18:51:39+00:00
Comments: Published on Proceedings of the ACM on Web Conference 2026 (WWW 2026)
Abstract
Large Language Models (LLMs) have shown significant potential for improving recommendation systems through their inherent reasoning capabilities and extensive knowledge base. Yet, existing studies predominantly address warm-start scenarios with abundant user-item interaction data, leaving the more challenging cold-start scenarios, where sparse interactions hinder traditional collaborative filtering methods, underexplored. To address this limitation, we propose novel reasoning strategies designed for cold-start item recommendations within the Netflix domain. Our method utilizes the advanced reasoning capabilities of LLMs to effectively infer user preferences, particularly for newly introduced or rarely interacted items. We systematically evaluate supervised fine-tuning, reinforcement learning-based fine-tuning, and hybrid approaches that combine both methods to optimize recommendation performance. Extensive experiments on real-world data demonstrate significant improvements in both methodological efficacy and practical performance in cold-start recommendation contexts. Remarkably, our reasoning-based fine-tuned models outperform Netflix's production ranking model by up to 8% in certain cases.
中文标题/摘要
标题:冷启动项目推荐中的大语言模型推理
大型语言模型(LLMs)通过其固有的推理能力和广泛的知识库,显示出在推荐系统中改进的巨大潜力。然而,现有研究主要集中在有丰富用户-项目交互数据的温启动场景,而冷启动场景则被忽视,这些场景中稀疏的交互数据阻碍了传统的协同过滤方法。为解决这一局限,我们提出了一种针对Netflix领域冷启动项目推荐的新型推理策略。该方法利用LLMs的高级推理能力,有效推断用户偏好,特别是对于新引入或很少交互的项目。我们系统地评估了监督微调、基于强化学习的微调以及结合两种方法的混合方法,以优化推荐性能。在实际数据上的广泛实验表明,在冷启动推荐场景中,方法的有效性和实际性能都有显著提高。值得注意的是,我们的基于推理的微调模型在某些情况下比Netflix的生产排名模型高出8%。
Summary / 总结
This paper addresses the challenge of cold-start item recommendation by leveraging the reasoning capabilities of Large Language Models (LLMs). It proposes novel reasoning strategies and evaluates supervised fine-tuning, reinforcement learning-based fine-tuning, and hybrid approaches. The experiments on real-world data show that reasoning-based fine-tuned models significantly outperform Netflix's production ranking model by up to 8% in cold-start scenarios.
该研究利用大型语言模型(LLMs)的推理能力解决冷启动项目推荐的挑战,提出了一种新的推理策略来推断新或很少交互项目的用户偏好。研究评估了监督微调、基于强化学习的微调以及结合两种方法的混合方法,显示出在推荐性能上的显著提升。推理微调模型在某些情况下比Netflix的生产排名模型高出8%以上。
VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents
Authors: Zirui Wang, Junyi Zhang, Jiaxin Ge, Long Lian, Letian Fu, Lisa Dunlap, Ken Goldberg, XuDong Wang, Ion Stoica, David M. Chan, Sewon Min, Joseph E. Gonzalez
First: 2026-01-23T18:43:34+00:00 · Latest: 2026-01-23T18:43:34+00:00
Comments: Project page: https://visgym.github.io/
Abstract
Modern Vision-Language Models (VLMs) remain poorly characterized in multi-step visual interactions, particularly in how they integrate perception, memory, and action over long horizons. We introduce VisGym, a gymnasium of 17 environments for evaluating and training VLMs. The suite spans symbolic puzzles, real-image understanding, navigation, and manipulation, and provides flexible controls over difficulty, input representation, planning horizon, and feedback. We also provide multi-step solvers that generate structured demonstrations, enabling supervised finetuning. Our evaluations show that all frontier models struggle in interactive settings, achieving low success rates in both the easy (46.6%) and hard (26.0%) configurations. Our experiments reveal notable limitations: models struggle to effectively leverage long context, performing worse with an unbounded history than with truncated windows. Furthermore, we find that several text-based symbolic tasks become substantially harder once rendered visually. However, explicit goal observations, textual feedback, and exploratory demonstrations in partially observable or unknown-dynamics settings for supervised finetuning yield consistent gains, highlighting concrete failure modes and pathways for improving multi-step visual decision-making. Code, data, and models can be found at: https://visgym.github.io/.
中文标题/摘要
标题:VisGym:多模态代理的多样化、可定制、可扩展环境
现代视觉-语言模型(VLMs)在多步骤视觉交互中仍然缺乏充分的表征,特别是在它们如何在长时间范围内整合感知、记忆和行动方面。我们引入了VisGym,这是一个包含17个环境的训练场,用于评估和训练VLMs。该套件涵盖了符号谜题、真实图像理解、导航和操作,并提供了对难度、输入表示、规划时间范围和反馈的灵活控制。我们还提供了多步骤求解器,生成结构化的演示,以实现监督微调。我们的评估表明,所有前沿模型在交互设置中都面临挑战,在简单(46.6%)和困难(26.0%)配置中成功率都很低。我们的实验揭示了一些显著的局限性:模型难以有效利用长上下文,在无界历史记录下表现不如在截断窗口下。此外,我们发现,一旦以视觉形式呈现,几种基于文本的符号任务变得显著更难。然而,在部分可观测或未知动力学设置中,通过明确的目标观察、文本反馈和探索性演示进行监督微调可以实现一致的改进,突显了多步骤视觉决策的具体失败模式和改进途径。代码、数据和模型可在:https://visgym.github.io/ 获取。
Summary / 总结
VisGym is a suite of 17 environments designed to evaluate and train Vision-Language Models (VLMs) in multi-step visual interactions. It covers various tasks such as symbolic puzzles, real-image understanding, navigation, and manipulation, and allows for flexible control over difficulty, input representation, planning horizon, and feedback. The study shows that leading VLMs perform poorly in interactive settings, with low success rates even in easy configurations. Key findings include difficulties in leveraging long context and increased challenges when tasks are rendered visually. However, explicit goal observations, textual feedback, and exploratory demonstrations improve performance in partially observable or unknown-dynamics settings.
VisGym 提供了 17 个多样化的环境来评估和训练视觉-语言模型 (VLM),这些环境涵盖了符号谜题、真实图像理解、导航和操作。这些环境允许灵活地控制难度、输入表示、规划时间和反馈。评估结果显示,当前的 VLM 在交互式设置中表现不佳,即使在简单配置中成功率也很低。关键发现包括模型在利用长上下文方面的困难,以及当文本符号任务被视觉化时变得更为复杂。然而,明确的目标观察和探索性演示在部分可观测或未知动力学设置中提高了性能。
Auto-Regressive Masked Diffusion Models
Authors: Mahdi Karami, Ali Ghodsi
Venue: 29th International Conference on Artificial Intelligence and Statistics (AISTATS) 2026
First: 2026-01-23T18:42:30+00:00 · Latest: 2026-01-23T18:42:30+00:00
Abstract
Masked diffusion models (MDMs) have emerged as a promising approach for language modeling, yet they face a performance gap compared to autoregressive models (ARMs) and require more training iterations. In this work, we present the Auto-Regressive Masked Diffusion (ARMD) model, an architecture designed to close this gap by unifying the training efficiency of autoregressive models with the parallel generation capabilities of diffusion-based models. Our key insight is to reframe the masked diffusion process as a block-wise causal model. This perspective allows us to design a strictly causal, permutation-equivariant architecture that computes all conditional probabilities across multiple denoising steps in a single, parallel forward pass. The resulting architecture supports efficient, autoregressive-style decoding and a progressive permutation training scheme, allowing the model to learn both canonical left-to-right and random token orderings. Leveraging this flexibility, we introduce a novel strided parallel generation strategy that accelerates inference by generating tokens in parallel streams while maintaining global coherence. Empirical results demonstrate that ARMD achieves state-of-the-art performance on standard language modeling benchmarks, outperforming established diffusion baselines while requiring significantly fewer training steps. Furthermore, it establishes a new benchmark for parallel text generation, effectively bridging the performance gap between parallel and sequential decoding.
中文标题/摘要
标题:自回归掩蔽扩散模型
掩蔽扩散模型(MDMs)已成为语言建模的一种有前途的方法,但与自回归模型(ARMs)相比,它们在性能上存在差距,并且需要更多的训练迭代。在本文中,我们提出了自回归掩蔽扩散(ARMD)模型,这是一种旨在通过结合自回归模型的训练效率和基于扩散模型的并行生成能力来缩小这一差距的架构。我们的核心见解是将掩蔽扩散过程重新构想为块因果模型。这种视角使我们能够设计出严格因果且置换等变的架构,在单个并行前向传递中计算多个去噪步骤的所有条件概率。由此产生的架构支持高效的自回归风格解码和渐进置换训练方案,使模型能够学习标准的从左到右和随机的标记顺序。利用这种灵活性,我们引入了一种新颖的分段并行生成策略,通过并行流生成标记来加速推理,同时保持全局一致性。实验证明,ARMD在标准语言建模基准上达到了最先进的性能,优于现有的扩散基线模型,同时所需训练步骤显著减少。此外,它还为并行文本生成设定了新的基准,有效地弥合了并行和序列解码之间的性能差距。
Summary / 总结
The research aims to improve the performance of masked diffusion models in language modeling by closing the gap with autoregressive models. The Auto-Regressive Masked Diffusion (ARMD) model is introduced, which combines the training efficiency of autoregressive models with the parallel generation capabilities of diffusion models. The key method involves reinterpreting the masked diffusion process as a block-wise causal model, enabling a permutation-equivariant architecture that supports efficient decoding and progressive training. Experimental results show that ARMD outperforms existing diffusion models on standard benchmarks with fewer training steps and establishes a new benchmark for parallel text generation.
Auto-Regressive Masked Diffusion (ARMD) 模型通过结合掩码扩散模型和自回归模型的训练效率和并行生成能力,解决了两者之间的性能差距。它通过将掩码扩散过程重新定义为块因果模型,实现了严格的因果、置换不变的架构,支持高效的解码和渐进置换训练。ARMD 在语言建模基准测试中表现出色,比现有扩散模型需要更少的训练步骤,并引入了一种新颖的分步并行生成策略,提高了推理速度同时保持全局一致性。
BONO-Bench: A Comprehensive Test Suite for Bi-objective Numerical Optimization with Traceable Pareto Sets
Authors: Lennart Schäpermeier, Pascal Kerschke
First: 2026-01-23T18:42:20+00:00 · Latest: 2026-01-23T18:42:20+00:00
Comments: Accepted for publication in the Special Issue on Benchmarking in Multi-Criteria Optimization at ACM TELO
Abstract
The evaluation of heuristic optimizers on test problems, better known as \emph{benchmarking}, is a cornerstone of research in multi-objective optimization.
However, most test problems used in benchmarking numerical multi-objective black-box optimizers come from one of two flawed approaches: On the one hand, problems are constructed manually, which result in problems with well-understood optimal solutions, but unrealistic properties and biases.
On the other hand, more realistic and complex single-objective problems are composited into multi-objective problems, but with a lack of control and understanding of problem properties.
This paper proposes an extensive problem generation approach for bi-objective numerical optimization problems consisting of the combination of theoretically well-understood convex-quadratic functions into unimodal and multimodal landscapes with and without global structure.
It supports configuration of test problem properties, such as the number of decision variables, local optima, Pareto front shape, plateaus in the objective space, or degree of conditioning, while maintaining theoretical tractability: The optimal front can be approximated to an arbitrary degree of precision regarding Pareto-compliant performance indicators such as the hypervolume or the exact R2 indicator.
To demonstrate the generator's capabilities, a test suite of 20 problem categories, called \emph{BONO-Bench}, is created and subsequently used as a basis of an illustrative benchmark study.
Finally, the general approach underlying our proposed generator, together with the associated test suite, is publicly released in the Python package \texttt{bonobench} to facilitate reproducible benchmarking.
中文标题/摘要
标题:BONO-Bench:用于具有可追溯帕累托集的双目标数值优化的综合测试套件
在多目标优化研究中,启发式优化器在测试问题上的评估,即所谓的基准测试,是其基石。
然而,用于基准测试数值多目标黑盒优化器的大多数测试问题来自两种有缺陷的方法之一:一方面,问题是手工构建的,导致具有已知最优解但不现实且有偏见的问题。
另一方面,更现实和复杂的单目标问题被组合成多目标问题,但缺乏对问题属性的控制和理解。
本文提出了一种广泛的问题生成方法,用于双目标数值优化问题,该方法将理论理解良好的凸二次函数组合成单模态和多模态景观,有和没有全局结构。
该方法支持测试问题属性的配置,如决策变量的数量、局部最优解、帕累托前沿形状、目标空间中的平台或条件程度,同时保持理论可处理性:帕累托合规性能指标(如超体积或精确R2指标)的最优前沿可以任意精度近似。
为了展示生成器的能力,创建了一个包含20个问题类别的测试套件,称为BONO-Bench,并随后用作说明性基准研究的基础。
最后,我们提出的方法的基本原理及其相关测试套件在Python包bonobench中公开发布,以促进可重复的基准测试。
Summary / 总结
This paper addresses the limitations of existing benchmark problems for bi-objective numerical optimization by proposing BONO-Bench, a comprehensive test suite. The method combines convex-quadratic functions to create unimodal and multimodal landscapes, allowing for the configuration of various problem properties such as the number of decision variables and Pareto front shape. Key findings include the ability to generate problems with theoretical tractability, where the Pareto front can be approximated to arbitrary precision using performance indicators like hypervolume and R2. The BONO-Bench suite consists of 20 problem categories and is publicly available in the Python package bonobench to support reproducible benchmarking studies.
该论文提出了BONO-Bench,一个全面的双目标数值优化测试套件,解决了现有基准的局限性。它通过组合理论理解良好的凸二次函数来创建单模态和多模态景观,允许配置各种问题属性,如决策变量数量、局部最优解和帕累托前沿形状,同时保持理论可解性。该套件包含20个问题类别,并以bonobench Python包的形式公开发布,便于可重复的基准测试研究。
Q-learning with Adjoint Matching
Authors: Qiyang Li, Sergey Levine
First: 2026-01-20T18:45:34+00:00 · Latest: 2026-01-23T18:40:14+00:00
Comments: 32 pages, 8 figures, 7 tables
Abstract
We propose Q-learning with Adjoint Matching (QAM), a novel TD-based reinforcement learning (RL) algorithm that tackles a long-standing challenge in continuous-action RL: efficient optimization of an expressive diffusion or flow-matching policy with respect to a parameterized Q-function. Effective optimization requires exploiting the first-order information of the critic, but it is challenging to do so for flow or diffusion policies because direct gradient-based optimization via backpropagation through their multi-step denoising process is numerically unstable. Existing methods work around this either by only using the value and discarding the gradient information, or by relying on approximations that sacrifice policy expressivity or bias the learned policy. QAM sidesteps both of these challenges by leveraging adjoint matching, a recently proposed technique in generative modeling, which transforms the critic's action gradient to form a step-wise objective function that is free from unstable backpropagation, while providing an unbiased, expressive policy at the optimum. Combined with temporal-difference backup for critic learning, QAM consistently outperforms prior approaches on hard, sparse reward tasks in both offline and offline-to-online RL.
中文标题/摘要
标题:伴随匹配的Q学习
我们提出了一种新颖的基于时差的强化学习(RL)算法——伴随匹配的Q学习(QAM),该算法解决了连续动作RL中的长期挑战:针对参数化Q函数高效优化表达性强的扩散或流匹配策略。有效的优化需要利用评论者的一阶信息,但由于直接通过其多步去噪过程进行梯度优化的反向传播在数值上不稳定,因此对流或扩散策略进行梯度优化具有挑战性。现有方法通过仅使用价值并丢弃梯度信息,或者依赖牺牲策略表达性或偏置学习策略的近似方法来绕过这一挑战。QAM通过利用生成建模中最近提出的技术——伴随匹配,绕过了这两个挑战,该技术将评论者的动作梯度转换为逐步的目标函数,从而避免了不稳定的反向传播,同时在最优状态下提供了一个无偏且表达性强的策略。结合评论者学习的时差备份,QAM在离线和离线到在线RL中的硬、稀疏奖励任务上始终优于先前的方法。
Summary / 总结
Q-learning with Adjoint Matching (QAM) is a novel TD-based reinforcement learning algorithm designed to efficiently optimize a diffusion or flow-matching policy with respect to a parameterized Q-function. By leveraging adjoint matching, QAM transforms the critic's action gradient into a step-wise objective function, avoiding numerical instability in backpropagation. The method consistently outperforms previous approaches on challenging, sparse reward tasks in both offline and offline-to-online reinforcement learning scenarios.
Q-learning与伴随匹配(QAM)是一种新型的TD基强化学习算法,旨在高效优化表达性强的扩散或流匹配策略。它通过使用伴随匹配将批评家的动作梯度转换为避免直接通过多步去噪过程的数值不稳定梯度回传的方式,来解决这一挑战。QAM在硬的、稀疏奖励任务的离线和离线到在线强化学习设置中,始终优于先前的方法。
Spatial-Agent: Agentic Geo-spatial Reasoning with Scientific Core Concepts
Authors: Riyang Bao, Cheng Yang, Dazhou Yu, Zhexiang Tang, Gengchen Mai, Liang Zhao
First: 2026-01-23T18:33:45+00:00 · Latest: 2026-01-23T18:33:45+00:00
Comments: 15pages, 4 figures
Abstract
Geospatial reasoning is essential for real-world applications such as urban analytics, transportation planning, and disaster response. However, existing LLM-based agents often fail at genuine geospatial computation, relying instead on web search or pattern matching while hallucinating spatial relationships. We present Spatial-Agent, an AI agent grounded in foundational theories of spatial information science. Our approach formalizes geo-analytical question answering as a concept transformation problem, where natural-language questions are parsed into executable workflows represented as GeoFlow Graphs -- directed acyclic graphs with nodes corresponding to spatial concepts and edges representing transformations. Drawing on spatial information theory, Spatial-Agent extracts spatial concepts, assigns functional roles with principled ordering constraints, and composes transformation sequences through template-based generation. Extensive experiments on MapEval-API and MapQA benchmarks demonstrate that Spatial-Agent significantly outperforms existing baselines including ReAct and Reflexion, while producing interpretable and executable geospatial workflows.
中文标题/摘要
标题:空间-代理:基于科学核心概念的空间代理推理
空间推理对于实际应用如城市分析、交通规划和灾害响应至关重要。然而,现有的基于LLM的代理往往无法进行真实的地理空间计算,而是依赖于网络搜索或模式匹配,同时虚构空间关系。我们提出了空间代理,这是一种基于空间信息科学基础理论的AI代理。我们的方法将地理分析问题回答形式化为概念转换问题,其中自然语言问题被解析为由空间概念节点和表示转换的边组成的可执行工作流GeoFlow图。借助空间信息理论,空间代理提取空间概念,赋予功能角色并结合有原则的顺序约束,通过基于模板的生成组合转换序列。在MapEval-API和MapQA基准上的大量实验表明,空间代理显著优于包括ReAct和Reflexion在内的现有基线,同时生成可解释和可执行的空间工作流。
Summary / 总结
The research aims to improve geospatial reasoning in AI agents for applications like urban analytics and disaster response. Spatial-Agent formalizes geospatial question answering as concept transformation using GeoFlow Graphs, which are directed acyclic graphs representing spatial concepts and their transformations. Experiments show that Spatial-Agent outperforms existing methods like ReAct and Reflexion, generating interpretable and executable geospatial workflows.
研究旨在提升AI代理在城市分析和灾害响应等应用中的地理空间推理能力。Spatial-Agent将地理空间问题解答形式化为概念转换,使用GeoFlow图表示工作流。该代理在基准测试中表现出色,生成可解释且可执行的地理空间工作流,超越了如ReAct和Reflexion等现有方法。
AgentDrive: An Open Benchmark Dataset for Agentic AI Reasoning with LLM-Generated Scenarios in Autonomous Systems
Authors: Mohamed Amine Ferrag, Abderrahmane Lakas, Merouane Debbah
First: 2026-01-23T18:33:41+00:00 · Latest: 2026-01-23T18:33:41+00:00
Comments: 16 pages
Abstract
The rapid advancement of large language models (LLMs) has sparked growing interest in their integration into autonomous systems for reasoning-driven perception, planning, and decision-making. However, evaluating and training such agentic AI models remains challenging due to the lack of large-scale, structured, and safety-critical benchmarks. This paper introduces AgentDrive, an open benchmark dataset containing 300,000 LLM-generated driving scenarios designed for training, fine-tuning, and evaluating autonomous agents under diverse conditions. AgentDrive formalizes a factorized scenario space across seven orthogonal axes: scenario type, driver behavior, environment, road layout, objective, difficulty, and traffic density. An LLM-driven prompt-to-JSON pipeline generates semantically rich, simulation-ready specifications that are validated against physical and schema constraints. Each scenario undergoes simulation rollouts, surrogate safety metric computation, and rule-based outcome labeling. To complement simulation-based evaluation, we introduce AgentDrive-MCQ, a 100,000-question multiple-choice benchmark spanning five reasoning dimensions: physics, policy, hybrid, scenario, and comparative reasoning. We conduct a large-scale evaluation of fifty leading LLMs on AgentDrive-MCQ. Results show that while proprietary frontier models perform best in contextual and policy reasoning, advanced open models are rapidly closing the gap in structured and physics-grounded reasoning. We release the AgentDrive dataset, AgentDrive-MCQ benchmark, evaluation code, and related materials at https://github.com/maferrag/AgentDrive
中文标题/摘要
标题:AgentDrive:一种基于LLM生成场景的自主系统中代理型AI推理的开放基准数据集
大型语言模型(LLMs)的快速发展激发了将其集成到自主系统中进行推理驱动的感知、规划和决策的兴趣。然而,由于缺乏大规模、结构化和安全性关键的基准测试,评估和训练此类代理型AI模型仍然具有挑战性。本文介绍了AgentDrive,一个包含300,000个LLM生成的驾驶场景的开放基准数据集,旨在在各种条件下训练、微调和评估自主代理。AgentDrive 在七个正交轴上形式化了一个因素化的场景空间:场景类型、驾驶员行为、环境、道路布局、目标、难度和交通密度。基于LLM的提示到JSON流水线生成了语义丰富、可模拟的规范,并通过物理和模式约束进行了验证。每个场景都经历了模拟滚动、代理安全度量计算和基于规则的结果标记。为了补充基于模拟的评估,我们引入了AgentDrive-MCQ,一个涵盖五个推理维度的100,000道选择题基准:物理、策略、混合、场景和比较推理。我们对50个领先的LLM在AgentDrive-MCQ上的进行了大规模评估。结果显示,尽管专有前沿模型在上下文和策略推理方面表现最佳,但先进的开源模型在结构化和物理基础推理方面正在迅速缩小差距。我们发布了AgentDrive数据集、AgentDrive-MCQ基准、评估代码及相关材料,网址为https://github.com/maferrag/AgentDrive
Summary / 总结
AgentDrive is an open benchmark dataset containing 300,000 LLM-generated driving scenarios for training and evaluating agentic AI models in autonomous systems. It formalizes a factorized scenario space across seven axes and uses an LLM-driven pipeline to generate semantically rich, simulation-ready specifications. The dataset is complemented by AgentDrive-MCQ, a 100,000-question multiple-choice benchmark covering five reasoning dimensions. Evaluation results show that proprietary models excel in contextual and policy reasoning, while advanced open models are improving in structured and physics-grounded reasoning. The dataset, benchmark, and evaluation code are publicly available.
AgentDrive 是一个包含 300,000 个由 LLM 生成的驾驶场景的开放基准数据集,用于训练和评估自主系统中的 agentic AI 模型。它在七个轴上形式化了一个因子化的场景空间,并使用 LLM 驱动的管道生成语义丰富、可模拟的规范。该数据集由 AgentDrive-MCQ 补充,这是一个包含 100,000 个问题的选择题基准,涵盖五个推理维度。评估结果显示,专有模型在上下文和策略推理方面表现出色,而先进的开源模型在结构化和物理基础推理方面正在迅速提高。数据集、基准和评估代码已公开发布。
3D Molecule Generation from Rigid Motifs via SE(3) Flows
Authors: Roman Poletukhin, Marcel Kollovieh, Eike Eberhard, Stephan Günnemann
First: 2026-01-23T18:24:57+00:00 · Latest: 2026-01-23T18:24:57+00:00
Abstract
Three-dimensional molecular structure generation is typically performed at the level of individual atoms, yet molecular graph generation techniques often consider fragments as their structural units. Building on the advances in frame-based protein structure generation, we extend these fragmentation ideas to 3D, treating general molecules as sets of rigid-body motifs. Utilising this representation, we employ SE(3)-equivariant generative modelling for de novo 3D molecule generation from rigid motifs. In our evaluations, we observe comparable or superior results to state-of-the-art across benchmarks, surpassing it in atom stability on GEOM-Drugs, while yielding a 2x to 10x reduction in generation steps and offering 3.5x compression in molecular representations compared to the standard atom-based methods.
中文标题/摘要
标题:通过SE(3)流从刚性基元生成3D分子
三维分子结构生成通常在原子层面进行,而分子图生成技术往往以片段作为结构单元。基于基于框架的蛋白质结构生成进展,我们将这些片段化理念扩展到三维,将一般分子视为刚性体基元的集合。利用这种表示,我们采用SE(3)-协变生成建模从刚性基元进行从头三维分子生成。在我们的评估中,我们在基准测试中观察到与最先进的技术相当或更优的结果,在GEOM-Drugs上超过了其原子稳定性,同时生成步骤减少2到10倍,并且分子表示压缩比标准原子基元方法达到3.5倍。
Summary / 总结
The research aims to improve 3D molecular structure generation by treating molecules as sets of rigid-body motifs, rather than individual atoms. The method uses SE(3)-equivariant generative modeling to generate 3D molecules from these motifs. The study shows that this approach achieves comparable or better results than state-of-the-art methods across various benchmarks, particularly excelling in atom stability on the GEOM-Drugs dataset. Additionally, it reduces the number of generation steps by 2 to 10 times and compresses molecular representations by 3.5 times compared to traditional atom-based methods.
研究旨在通过将分子视为刚性体片段集合来改进3D分子结构生成,并利用SE(3)-不变生成模型。该方法在基准测试中实现了与最先进的技术相当或更优的结果,特别是在GEOM-Drugs数据集上的原子稳定性方面表现更佳。此外,它将生成步骤减少了2到10倍,并将分子表示压缩了3.5倍,相较于传统的基于原子的方法。
Domain-invariant Mixed-domain Semi-supervised Medical Image Segmentation with Clustered Maximum Mean Discrepancy Alignment
Authors: Ba-Thinh Lam, Thanh-Huy Nguyen, Hoang-Thien Nguyen, Quang-Khai Bui-Tran, Nguyen Lan Vi Vu, Phat K. Huynh, Ulas Bagci, Min Xu
Venue: ICASSP 2026
First: 2026-01-23T18:23:03+00:00 · Latest: 2026-01-23T18:23:03+00:00
Comments: accepted in ICASSP 2026
Abstract
Deep learning has shown remarkable progress in medical image semantic segmentation, yet its success heavily depends on large-scale expert annotations and consistent data distributions. In practice, annotations are scarce, and images are collected from multiple scanners or centers, leading to mixed-domain settings with unknown domain labels and severe domain gaps. Existing semi-supervised or domain adaptation approaches typically assume either a single domain shift or access to explicit domain indices, which rarely hold in real-world deployment. In this paper, we propose a domain-invariant mixed-domain semi-supervised segmentation framework that jointly enhances data diversity and mitigates domain bias. A Copy-Paste Mechanism (CPM) augments the training set by transferring informative regions across domains, while a Cluster Maximum Mean Discrepancy (CMMD) block clusters unlabeled features and aligns them with labeled anchors via an MMD objective, encouraging domain-invariant representations. Integrated within a teacher-student framework, our method achieves robust and precise segmentation even with very few labeled examples and multiple unknown domain discrepancies. Experiments on Fundus and M&Ms benchmarks demonstrate that our approach consistently surpasses semi-supervised and domain adaptation methods, establishing a potential solution for mixed-domain semi-supervised medical image segmentation.
中文标题/摘要
标题:域不变混合域半监督医学图像分割与聚类最大均值偏差对齐
深度学习在医学图像语义分割方面取得了显著进展,但其成功高度依赖于大规模专家注释和一致的数据分布。实践中,注释稀缺,图像来自多个扫描器或中心,导致存在未知域标签和严重域差距的混合域设置。现有半监督或域适应方法通常假设单一的域转移或可访问显式域索引,这在实际部署中很少成立。本文提出了一种域不变混合域半监督分割框架,该框架联合增强数据多样性并减轻域偏差。复制粘贴机制(CPM)通过在域间转移信息区域来扩充训练集,而聚类最大均值偏差(CMMD)块通过MMD目标聚类未标记特征并将其与标记锚点对齐,促进域不变表示。该方法集成在教师-学生框架中,即使有少量标记示例和多个未知域差距,也能实现稳健和精确的分割。在视网膜和M&Ms基准测试上进行的实验表明,我们的方法在半监督和域适应方法中表现始终更优,为混合域半监督医学图像分割提供了一种潜在解决方案。
Summary / 总结
This paper addresses the challenge of medical image segmentation in mixed-domain settings with limited labeled data and unknown domain labels. It proposes a domain-invariant mixed-domain semi-supervised segmentation framework that includes a Copy-Paste Mechanism to augment the training set and a Cluster Maximum Mean Discrepancy block to align unlabeled features with labeled anchors, promoting domain-invariant representations. Experiments on Fundus and M&Ms benchmarks show that the proposed method outperforms existing semi-supervised and domain adaptation approaches, even with very few labeled examples and multiple domain discrepancies.
本文针对混合域设置下的医学图像分割问题,该设置中存在有限的标注数据和未知的域标签。提出了一种域不变的混合域半监督分割框架,该框架包括一种Copy-Paste机制来扩充训练集,以及一种Cluster Maximum Mean Discrepancy块来通过MMD目标对齐未标注特征与标注锚点,减少域偏差。在Fundus和M&Ms基准上的实验表明,所提出的方法在少量标注样本的情况下,比现有半监督和域适应方法表现更优,实现了稳健且精确的分割。
Beyond the LUMIR challenge: The pathway to foundational registration models
Authors: Junyu Chen, Shuwen Wei, Joel Honkamaa, Pekka Marttinen, Hang Zhang, Min Liu, Yichao Zhou, Zuopeng Tan, Zhuoyuan Wang, Yi Wang, Hongchao Zhou, Shunbo Hu, Yi Zhang, Qian Tao, Lukas Förner, Thomas Wendler, Bailiang Jian, Benedikt Wiestler, Tim Hable, Jin Kim, Dan Ruan, Frederic Madesta, Thilo Sentker, Wiebke Heyer, Lianrui Zuo, Yuwei Dai, Jing Wu, Jerry L. Prince, Harrison Bai, Yong Du, Yihao Liu, Alessa Hering, Reuben Dorent, Lasse Hansen, Mattias P. Heinrich, Aaron Carass
First: 2025-05-30T03:07:58+00:00 · Latest: 2026-01-23T18:17:29+00:00
Abstract
Medical image challenges have played a transformative role in advancing the field, catalyzing innovation and establishing new performance benchmarks. Image registration, a foundational task in neuroimaging, has similarly advanced through the Learn2Reg initiative. Building on this, we introduce the Large-scale Unsupervised Brain MRI Image Registration (LUMIR) challenge, a next-generation benchmark for unsupervised brain MRI registration. Previous challenges relied upon anatomical label maps, however LUMIR provides 4,014 unlabeled T1-weighted MRIs for training, encouraging biologically plausible deformation modeling through self-supervision. Evaluation includes 590 in-domain test subjects and extensive zero-shot tasks across disease populations, imaging protocols, and species. Deep learning methods consistently achieved state-of-the-art performance and produced anatomically plausible, diffeomorphic deformation fields. They outperformed several leading optimization-based methods and remained robust to most domain shifts. These findings highlight the growing maturity of deep learning in neuroimaging registration and its potential to serve as a foundation model for general-purpose medical image registration.
中文标题/摘要
标题:超越LUMIR挑战:通往基础注册模型的道路
医学图像挑战在推动领域发展方面发挥了变革性作用,激发创新并建立了新的性能基准。图像配准作为神经影像学中的基础任务,同样通过Learn2Reg倡议得到了推进。在此基础上,我们引入了大规模无监督脑MRI图像配准(LUMIR)挑战,这是一个下一代无监督脑MRI配准基准。以往的挑战依赖于解剖标签图,而LUMIR提供了4,014个未标记的T1加权MRI用于训练,通过自我监督鼓励生物合理的变形建模。评估包括590个领域内测试对象以及跨疾病人群、成像协议和物种的广泛零样本任务。深度学习方法始终实现了最先进的性能,并生成了解剖上合理的、同胚同构的变形场。它们在大多数领域转移中优于几种领先的基于优化的方法。这些发现突显了深度学习在神经影像配准中的日益成熟及其作为通用医学图像配准基础模型的潜力。
Summary / 总结
The paper introduces the LUMIR challenge, a next-generation benchmark for unsupervised brain MRI registration, which uses 4,014 unlabeled T1-weighted MRIs for training. Deep learning methods outperformed optimization-based methods and produced anatomically plausible deformation fields, demonstrating the maturity of deep learning in neuroimaging registration and its potential as a foundation model for general medical image registration. Evaluation included 590 in-domain test subjects and zero-shot tasks across various disease populations, imaging protocols, and species, showing robust performance across different domains.
论文介绍了LUMIR挑战,这是一个使用4,014张未标记的T1加权MRI的无监督脑MRI注册基准。它在590个领域内和多样化的零样本测试案例上评估了深度学习方法与优化方法的性能。深度学习模型表现出更优的性能,并生成了解剖上合理的变形,超越了传统方法,并在各种成像条件和物种中表现出鲁棒性。
Efficient semantic uncertainty quantification in language models via diversity-steered sampling
Authors: Ji Won Park, Kyunghyun Cho
Venue: NeurIPS 2025
First: 2025-10-24T10:06:21+00:00 · Latest: 2026-01-23T18:02:21+00:00
Comments: 10 pages (+7 appendix), 7 figures. Accepted at NeurIPS 2025
Abstract
Accurately estimating semantic aleatoric and epistemic uncertainties in large language models (LLMs) is particularly challenging in free-form question answering (QA), where obtaining stable estimates often requires many expensive generations. We introduce a diversity-steered sampler that discourages semantically redundant outputs during decoding, covers both autoregressive and masked diffusion paradigms, and yields substantial sample-efficiency gains. The key idea is to inject a continuous semantic-similarity penalty into the model's proposal distribution using a natural language inference (NLI) model lightly finetuned on partial prefixes or intermediate diffusion states. We debias downstream uncertainty estimates with importance reweighting and shrink their variance with control variates. Across four QA benchmarks, our method matches or surpasses baselines while covering more semantic clusters with the same number of samples. Being modular and requiring no gradient access to the base LLM, the framework promises to serve as a drop-in enhancement for uncertainty estimation in risk-sensitive model deployments.
中文标题/摘要
标题:通过多样性引导采样在语言模型中高效量化语义不确定性
在大型语言模型(LLMs)中准确估计自由形式问答(QA)中的语义 aleatoric 和 epistemic 不确定性特别具有挑战性,通常需要许多昂贵的生成才能获得稳定估计。我们引入了一种多样性引导的采样器,在解码过程中避免产生语义冗余输出,适用于自回归和掩码扩散范式,并且能够显著提高样本效率。核心思想是通过轻量级微调自然语言推理(NLI)模型,将连续的语义相似性惩罚注入模型的提议分布中。我们通过重要性加权消除下游不确定性估计的偏差,并通过控制变量减少其方差。在四个问答基准测试中,我们的方法在相同数量的样本下覆盖了更多的语义簇,同时与基线方法相当或超越基线方法。该框架模块化且无需访问基础LLM的梯度,有望作为风险敏感模型部署中不确定性估计的即插即用增强。
Summary / 总结
This paper addresses the challenge of quantifying semantic uncertainties in large language models (LLMs) during free-form question answering. It proposes a diversity-steered sampler that reduces semantically redundant outputs, improving sample efficiency. The method uses a lightly fine-tuned natural language inference model to inject a semantic-similarity penalty into the model's proposal distribution. Experiments on four QA benchmarks show that the approach matches or outperforms baselines while covering more semantic clusters with the same number of samples.
研究旨在提高大型语言模型(LLM)在自由形式问答中对语义不确定性估计的能力,因为获得稳定估计的成本很高。方法引入了一种多样性引导采样器,通过在模型的提议分布中注入语义相似性惩罚,涵盖了自回归和掩码扩散两种范式。这种方法在四个问答基准测试中表现出显著的样本效率提升,能够与基线方法匹配或超越基线方法,同时用相同数量的样本覆盖更多的语义簇。
Reward-Forcing: Autoregressive Video Generation with Reward Feedback
Authors: Jingran Zhang, Ning Li, Yuanhao Ban, Andrew Bai, Justin Cui
First: 2026-01-23T17:47:56+00:00 · Latest: 2026-01-23T17:47:56+00:00
Comments: preprint
Abstract
While most prior work in video generation relies on bidirectional architectures, recent efforts have sought to adapt these models into autoregressive variants to support near real-time generation. However, such adaptations often depend heavily on teacher models, which can limit performance, particularly in the absence of a strong autoregressive teacher, resulting in output quality that typically lags behind their bidirectional counterparts. In this paper, we explore an alternative approach that uses reward signals to guide the generation process, enabling more efficient and scalable autoregressive generation. By using reward signals to guide the model, our method simplifies training while preserving high visual fidelity and temporal consistency. Through extensive experiments on standard benchmarks, we find that our approach performs comparably to existing autoregressive models and, in some cases, surpasses similarly sized bidirectional models by avoiding constraints imposed by teacher architectures. For example, on VBench, our method achieves a total score of 84.92, closely matching state-of-the-art autoregressive methods that score 84.31 but require significant heterogeneous distillation.
中文标题/摘要
标题:奖励强化:基于奖励反馈的自回归视频生成
尽管大多数视频生成的先前工作依赖于双向架构,但最近的努力试图将这些模型改编为自回归变体以支持接近实时的生成。然而,这些改编通常依赖于教师模型,这可能会限制性能,特别是在缺乏强大的自回归教师时,导致输出质量通常落后于其双向对应物。在本文中,我们探索了一种替代方法,该方法使用奖励信号来引导生成过程,从而实现更高效和可扩展的自回归生成。通过使用奖励信号来引导模型,我们的方法简化了训练过程,同时保持了高视觉保真度和时间一致性。通过在标准基准上的广泛实验,我们发现我们的方法在性能上与现有的自回归模型相当,并且在某些情况下,通过避免教师架构施加的限制,超过了同样大小的双向模型。例如,在VBench上,我们的方法获得了84.92的总分,接近得分84.31的最先进的自回归方法,但后者需要大量的异构蒸馏。
Summary / 总结
This paper introduces Reward-Forcing, an approach to autoregressive video generation that uses reward signals to guide the model, aiming to improve performance and scalability compared to bidirectional models. The method simplifies training while maintaining high visual fidelity and temporal consistency. Experiments show that Reward-Forcing performs comparably to existing autoregressive models and outperforms similar-sized bidirectional models on the VBench benchmark, achieving a score of 84.92, close to the state-of-the-art autoregressive methods that score 84.31 but require more complex teacher models.
本文提出了一种名为Reward-Forcing的方法,通过使用奖励信号来引导视频生成过程,旨在提高性能和可扩展性,相比双向模型。实验表明,该方法在VBench基准测试中取得了84.92的总分,接近最先进的自回归方法的84.31分,但后者需要更复杂的训练过程。
Pretraining Frame Preservation in Autoregressive Video Memory Compression
Authors: Lvmin Zhang, Shengqu Cai, Muyang Li, Chong Zeng, Beijia Lu, Anyi Rao, Song Han, Gordon Wetzstein, Maneesh Agrawala
First: 2025-12-29T20:29:21+00:00 · Latest: 2026-01-23T17:47:41+00:00
Comments: Additional Results: https://lllyasviel.github.io/pfp_gitpage/
Abstract
We present PFP, a neural network structure to compress long videos into short contexts, with an explicit pretraining objective to preserve the high-frequency details of single frames at arbitrary temporal positions. The baseline model can compress a 20-second video into a context at about 5k length, where random frames can be retrieved with perceptually preserved appearances. Such pretrained models can be directly fine-tuned as memory encoders for autoregressive video models, enabling long history memory with low context cost and relatively low fidelity loss. We evaluate the framework with ablative settings and discuss the trade-offs of possible neural architecture designs.
中文标题/摘要
标题:自回归视频记忆压缩中的预训练框架保存
我们提出了一种名为PFP的神经网络结构,用于将长视频压缩为短上下文,并具有明确的预训练目标,以在任意时间位置保留单帧的高频细节。基线模型可以将20秒的视频压缩到约5k长度的上下文中,其中可以随机检索具有感知保真度外观的帧。此类预训练模型可以直接微调为自回归视频模型的记忆编码器,从而实现具有较低上下文成本和相对较低保真度损失的长历史记忆。我们通过消融设置评估了该框架,并讨论了可能的神经架构设计的权衡。
HEIGHT: Heterogeneous Interaction Graph Transformer for Robot Navigation in Crowded and Constrained Environments
Authors: Shuijing Liu, Haochen Xia, Fatemeh Cheraghi Pouria, Kaiwen Hong, Neeloy Chakraborty, Zichao Hu, Joydeep Biswas, Katherine Driggs-Campbell
First: 2024-11-19T00:56:35+00:00 · Latest: 2026-01-23T17:24:19+00:00
Comments: Accepted to IEEE Transactions of Automation Science and Engineering (T-ASE)
Abstract
We study the problem of robot navigation in dense and interactive crowds with static constraints such as corridors and furniture. Previous methods fail to consider all types of spatial and temporal interactions among agents and obstacles, leading to unsafe and inefficient robot paths. In this article, we leverage a graph-based representation of crowded and constrained scenarios and propose a structured framework to learn robot navigation policies with deep reinforcement learning. We first split the representations of different inputs and propose a heterogeneous spatio-temporal graph to model distinct interactions among humans, robots, and obstacles. Based on the heterogeneous spatio-temporal graph, we propose HEIGHT, a novel navigation policy network architecture with different components to capture heterogeneous interactions through space and time. HEIGHT utilizes attention mechanisms to prioritize important interactions and a recurrent network to track changes in the dynamic scene over time, encouraging the robot to avoid collisions adaptively. Through extensive simulation and real-world experiments, we demonstrate that HEIGHT outperforms state-of-the-art baselines in terms of success, navigation time, and generalization to domain shifts in challenging navigation scenarios. More information is available at https://sites.google.com/view/crowdnav-height/home.
中文标题/摘要
标题:HEIGHT:拥挤和受限环境中异质交互图变换器的机器人导航
我们研究了在静态约束(如走廊和家具)存在的密集和互动人群中机器人的导航问题。先前的方法未能考虑所有类型的时空交互,导致机器人路径不安全且效率低下。在本文中,我们利用拥挤和受限场景的图表示,并提出了一种结构化框架,利用深度强化学习学习机器人的导航策略。我们首先将不同输入的表示拆分,并提出了一种异质时空图来建模人类、机器人和障碍物之间的不同交互。基于异质时空图,我们提出了一种新颖的导航策略网络架构HEIGHT,通过空间和时间捕捉异质交互。HEIGHT利用注意力机制优先处理重要交互,并利用循环网络跟踪动态场景随时间的变化,促使机器人适应性地避免碰撞。通过广泛的仿真和实地实验,我们证明了HEIGHT在具有挑战性的导航场景中在成功率、导航时间和对领域转移的泛化能力方面优于最先进的基线方法。更多信息请参见https://sites.google.com/view/crowdnav-height/home。
Summary / 总结
The research addresses the challenge of robot navigation in crowded and constrained environments by proposing a graph-based approach called HEIGHT. It models heterogeneous interactions among humans, robots, and obstacles using a spatio-temporal graph and a novel navigation policy network. Experimental results show that HEIGHT outperforms existing methods in terms of navigation success, efficiency, and adaptability to different scenarios.
研究聚焦于机器人在拥挤和受限环境中的导航问题,以往方法往往因为未能充分考虑空间和时间上的交互而表现不佳。作者提出了一种基于图的框架,利用深度强化学习来建模人类、机器人和障碍物之间的异质交互。实验结果表明,该方法在导航成功率、效率以及对新环境的适应性方面优于现有方法。
LoL: Longer than Longer, Scaling Video Generation to Hour
Authors: Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, Cho-Jui Hsieh
First: 2026-01-23T17:21:35+00:00 · Latest: 2026-01-23T17:21:35+00:00
Comments: preprint
Abstract
Recent research in long-form video generation has shifted from bidirectional to autoregressive models, yet these methods commonly suffer from error accumulation and a loss of long-term coherence. While attention sink frames have been introduced to mitigate this performance decay, they often induce a critical failure mode we term sink-collapse: the generated content repeatedly reverts to the sink frame, resulting in abrupt scene resets and cyclic motion patterns. Our analysis reveals that sink-collapse originates from an inherent conflict between the periodic structure of Rotary Position Embedding (RoPE) and the multi-head attention mechanisms prevalent in current generative models. To address it, we propose a lightweight, training-free approach that effectively suppresses this behavior by introducing multi-head RoPE jitter that breaks inter-head attention homogenization and mitigates long-horizon collapse. Extensive experiments show that our method successfully alleviates sink-collapse while preserving generation quality. To the best of our knowledge, this work achieves the first demonstration of real-time, streaming, and infinite-length video generation with little quality decay. As an illustration of this robustness, we generate continuous videos up to 12 hours in length, which, to our knowledge, is among the longest publicly demonstrated results in streaming video generation.
中文标题/摘要
标题:LoL:更长的视频生成,扩展到一小时
近期关于长视频生成的研究已从双向模型转向自回归模型,但这些方法通常会遭受错误累积和长期连贯性丧失的问题。虽然已经引入了注意力陷阱帧来缓解这种性能衰减,但它们往往会导致我们称之为陷阱崩溃的关键失败模式:生成的内容反复回到陷阱帧,导致场景突然重置和循环运动模式。我们的分析表明,陷阱崩溃源于旋转位置嵌入(RoPE)的周期结构与当前生成模型中普遍存在的多头注意力机制之间的固有冲突。为了解决这一问题,我们提出了一种轻量级、无需训练的方法,通过引入多头RoPE抖动来有效抑制这种行为,打破跨头注意力的同质化并缓解长期崩溃。大量实验表明,我们的方法成功缓解了陷阱崩溃现象,同时保持了生成质量。据我们所知,这项工作实现了实时、流式和无限长度视频生成的首次演示,几乎没有质量衰减。作为这一鲁棒性的示例,我们生成了长达12小时的连续视频,据我们所知,这是已公开演示的最长流式视频生成结果之一。
Summary / 总结
This research addresses the issue of error accumulation and loss of long-term coherence in long-form video generation, particularly in autoregressive models. The authors propose a lightweight, training-free method called multi-head RoPE jitter to mitigate a critical failure mode known as sink-collapse. Experimental results demonstrate that their approach successfully alleviates sink-collapse while maintaining video quality, enabling real-time, streaming, and infinite-length video generation with minimal quality decay, up to 12 hours in length.
该研究旨在通过解决错误累积和长期一致性丧失的问题,改进长视频生成。作者提出了一种轻量级、无需训练的方法,通过引入多头RoPE抖动来缓解由RoPE与多头注意力机制之间的冲突引起的sink-collapse问题。实验结果表明,该方法有效缓解了sink-collapse现象,同时保持了生成质量,实现了实时、流式和无限长度的视频生成,质量衰减极小。这项工作通过生成长达12小时的连续视频,创下了流式视频生成中最长的公开演示结果的新纪录。
T-LoRA: Single Image Diffusion Model Customization Without Overfitting
Authors: Vera Soboleva, Aibek Alanov, Andrey Kuznetsov, Konstantin Sobolev
Venue: AAAI 2026
First: 2025-07-08T13:14:10+00:00 · Latest: 2026-01-23T17:14:49+00:00
Comments: AAAI 2026
Abstract
While diffusion model fine-tuning offers a powerful approach for customizing pre-trained models to generate specific objects, it frequently suffers from overfitting when training samples are limited, compromising both generalization capability and output diversity. This paper tackles the challenging yet most impactful task of adapting a diffusion model using just a single concept image, as single-image customization holds the greatest practical potential. We introduce T-LoRA, a Timestep-Dependent Low-Rank Adaptation framework specifically designed for diffusion model personalization. We show that higher diffusion timesteps are more prone to overfitting than lower ones, necessitating a timestep-sensitive fine-tuning strategy. T-LoRA incorporates two key innovations: (1) a dynamic fine-tuning strategy that adjusts rank-constrained updates based on diffusion timesteps, and (2) a weight parametrization technique that ensures independence between adapter components through orthogonal initialization. Extensive experiments on SD-XL and FLUX-1.dev show that T-LoRA and its individual components outperform standard LoRA and other diffusion model personalization techniques, achieving a superior balance between concept fidelity and text alignment. Project page is available at https://controlgenai.github.io/T-LoRA/.
中文标题/摘要
标题:T-LoRA:无需过拟合的单张图像扩散模型定制
尽管扩散模型微调为预训练模型生成特定对象提供了一种强大的方法,但在训练样本有限的情况下,它经常遭受过拟合的困扰,这会损害模型的泛化能力和输出多样性。本文解决了使用单张概念图像适应扩散模型这一具有挑战性但最具影响力的任务,因为单张图像定制具有最大的实际潜力。我们提出了T-LoRA,一种针对扩散模型个性化设计的时间步长依赖低秩适应框架。我们展示了更高的扩散时间步更容易过拟合,因此需要一种时间步长敏感的微调策略。T-LoRA 包含两个关键创新:(1)一种动态微调策略,根据扩散时间步调整基于秩约束的更新,(2)一种权重参数化技术,通过正交初始化确保适配器组件之间的独立性。在 SD-XL 和 FLUX-1.dev 上的广泛实验表明,T-LoRA 及其各个组件均优于标准 LoRA 和其他扩散模型个性化技术,实现了概念保真度和文本对齐之间的更优平衡。项目页面可在 https://controlgenai.github.io/T-LoRA/ 获取。
Summary / 总结
This paper addresses the challenge of customizing diffusion models using a single image without overfitting. It introduces T-LoRA, a timestep-dependent low-rank adaptation framework. The method includes a dynamic fine-tuning strategy and orthogonal initialization to ensure independence between adapter components. Experiments on SD-XL and FLUX-1.dev demonstrate that T-LoRA outperforms standard LoRA and other techniques in balancing concept fidelity and text alignment while avoiding overfitting.
本文解决了仅使用一张图像定制扩散模型的挑战,这对实际应用至关重要但容易过拟合。T-LoRA,一种时间步长依赖的低秩适应框架,通过根据扩散时间步调整秩约束更新并确保适配器组件之间的独立性来缓解这一问题。在SD-XL和FLUX-1.dev上的实验表明,T-LoRA在概念保真度和文本对齐之间实现了更好的平衡,且优于标准LoRA和其他技术,没有过拟合。
The Trajectory Alignment Coefficient in Two Acts: From Reward Tuning to Reward Learning
Authors: Calarina Muslimani, Yunshu Du, Kenta Kawamoto, Kaushik Subramanian, Peter Stone, Peter Wurman
First: 2026-01-23T17:13:54+00:00 · Latest: 2026-01-23T17:13:54+00:00
Abstract
The success of reinforcement learning (RL) is fundamentally tied to having a reward function that accurately reflects the task objective. Yet, designing reward functions is notoriously time-consuming and prone to misspecification. To address this issue, our first goal is to understand how to support RL practitioners in specifying appropriate weights for a reward function. We leverage the Trajectory Alignment Coefficient (TAC), a metric that evaluates how closely a reward function's induced preferences match those of a domain expert. To evaluate whether TAC provides effective support in practice, we conducted a human-subject study in which RL practitioners tuned reward weights for Lunar Lander. We found that providing TAC during reward tuning led participants to produce more performant reward functions and report lower cognitive workload relative to standard tuning without TAC. However, the study also underscored that manual reward design, even with TAC, remains labor-intensive. This limitation motivated our second goal: to learn a reward model that maximizes TAC directly. Specifically, we propose Soft-TAC, a differentiable approximation of TAC that can be used as a loss function to train reward models from human preference data. Validated in the racing simulator Gran Turismo 7, reward models trained using Soft-TAC successfully captured preference-specific objectives, resulting in policies with qualitatively more distinct behaviors than models trained with standard Cross-Entropy loss. This work demonstrates that TAC can serve as both a practical tool for guiding reward tuning and a reward learning objective in complex domains.
中文标题/摘要
标题:两阶段轨迹对齐系数:从奖励调优到奖励学习
强化学习(RL)的成功从根本上依赖于一个能够准确反映任务目标的奖励函数。然而,设计奖励函数通常耗时且容易出错。为解决这一问题,我们的首要目标是理解如何支持RL从业者为奖励函数指定合适的权重。我们利用轨迹对齐系数(TAC),这是一种评估奖励函数诱导的偏好与领域专家偏好匹配程度的度量。为了评估TAC在实际应用中的有效性,我们在RL从业者对Lunar Lander调优奖励权重的人体实验中进行了测试。我们发现,在奖励调优过程中提供TAC可以促使参与者生成更高效的奖励函数,并报告较低的认知负荷,相较于没有TAC的标准调优。然而,该研究也表明,即使有TAC辅助,手动设计奖励仍然劳动密集。这一局限性促使我们的第二个目标:学习一个直接最大化TAC的奖励模型。具体而言,我们提出了软TAC(Soft-TAC),这是一种可微近似TAC,可以用作损失函数来从人类偏好数据中训练奖励模型。在赛车模拟器Gran Turismo 7中验证,使用Soft-TAC训练的奖励模型成功捕捉了偏好特定的目标,导致了与使用标准交叉熵损失训练的模型相比具有更多质性差异的行为策略。这项工作证明了TAC可以作为指导奖励调优的实用工具和复杂领域中的奖励学习目标。
Summary / 总结
The paper aims to improve the process of reward function design in reinforcement learning by leveraging the Trajectory Alignment Coefficient (TAC). In a human-subject study, RL practitioners tuned reward weights for Lunar Lander with TAC, leading to more performant reward functions and reduced cognitive workload. However, manual reward design remains labor-intensive. To address this, the authors propose Soft-TAC, a differentiable approximation of TAC, which was used to train reward models in Gran Turismo 7, resulting in policies with more distinct behaviors compared to models trained with standard Cross-Entropy loss.
本文通过引入轨迹对齐系数(TAC),评估奖励函数与领域专家偏好之间的匹配程度,来解决强化学习中有效设计奖励函数的挑战。在Lunar Lander的人类实验中,使用TAC进行奖励调优提高了奖励函数的性能并减少了认知负担。实验还表明,手动设计奖励仍然是劳动密集型的。为解决这一问题,作者提出了Soft-TAC,这是一种TAC的可微近似,可以用作从人类偏好数据中训练奖励模型的损失函数。Gran Turismo 7中的实验表明,使用Soft-TAC训练的奖励模型比使用标准交叉熵损失训练的模型更能捕捉偏好特定的目标,从而产生更具差异性的策略行为。
Evaluating Large Vision-language Models for Surgical Tool Detection
Authors: Nakul Poudel, Richard Simon, Cristian A. Linte
First: 2026-01-23T17:00:46+00:00 · Latest: 2026-01-23T17:00:46+00:00
Abstract
Surgery is a highly complex process, and artificial intelligence has emerged as a transformative force in supporting surgical guidance and decision-making. However, the unimodal nature of most current AI systems limits their ability to achieve a holistic understanding of surgical workflows. This highlights the need for general-purpose surgical AI systems capable of comprehensively modeling the interrelated components of surgical scenes. Recent advances in large vision-language models that integrate multimodal data processing offer strong potential for modeling surgical tasks and providing human-like scene reasoning and understanding. Despite their promise, systematic investigations of VLMs in surgical applications remain limited. In this study, we evaluate the effectiveness of large VLMs for the fundamental surgical vision task of detecting surgical tools. Specifically, we investigate three state-of-the-art VLMs, Qwen2.5, LLaVA1.5, and InternVL3.5, on the GraSP robotic surgery dataset under both zero-shot and parameter-efficient LoRA fine-tuning settings. Our results demonstrate that Qwen2.5 consistently achieves superior detection performance in both configurations among the evaluated VLMs. Furthermore, compared with the open-set detection baseline Grounding DINO, Qwen2.5 exhibits stronger zero-shot generalization and comparable fine-tuned performance. Notably, Qwen2.5 shows superior instrument recognition, while Grounding DINO demonstrates stronger localization.
中文标题/摘要
标题:评估大型视觉-语言模型在手术工具检测中的效果
手术是一个高度复杂的过程,人工智能已经成为了支持手术指导和决策的变革性力量。然而,大多数当前AI系统的单模态性质限制了它们实现对手术工作流程的全面理解的能力。这突显了需要能够全面建模手术场景中相关组件的一般用途手术AI系统的需求。最近在多模态数据处理方面取得的大型视觉-语言模型的进步为建模手术任务和提供类人的场景推理和理解提供了强大的潜力。尽管它们具有潜力,但在手术应用中的系统性研究仍然有限。在本研究中,我们评估了大型视觉-语言模型在基本的手术视觉任务——检测手术工具中的效果。具体而言,我们在GraSP机器人手术数据集上研究了三种最先进的视觉-语言模型Qwen2.5、LLaVA1.5和InternVL3.5,在零样本和参数高效LoRA微调设置下的表现。我们的结果表明,在评估的视觉-语言模型中,Qwen2.5在两种配置下都实现了更优的检测性能。此外,与开放集检测基准Grounding DINO相比,Qwen2.5在零样本泛化方面表现更强,并且微调性能相当。值得注意的是,Qwen2.5在器械识别方面表现出色,而Grounding DINO在定位方面表现更强。
Summary / 总结
This study evaluates the effectiveness of large vision-language models (VLMs) for detecting surgical tools, focusing on Qwen2.5, LLaVA1.5, and InternVL3.5. The research uses both zero-shot and parameter-efficient LoRA fine-tuning settings on the GraSP robotic surgery dataset. Qwen2.5 is found to outperform the other models in both configurations, showing superior detection performance and strong zero-shot generalization compared to the open-set detection baseline Grounding DINO.
研究评估了大型视觉-语言模型(VLMs)在检测手术工具方面的有效性,重点关注Qwen2.5、LLaVA1.5和InternVL3.5。研究使用了GraSP机器人手术数据集上的零样本和参数高效LoRA微调设置。Qwen2.5在两种配置下均表现出色,零样本泛化能力更强,微调性能与开放集检测基线Grounding DINO相当,特别是在器械识别方面表现出色。
LLM-Based Adversarial Persuasion Attacks on Fact-Checking Systems
Authors: João A. Leite, Olesya Razuvayevskaya, Kalina Bontcheva, Carolina Scarton
First: 2026-01-23T16:57:16+00:00 · Latest: 2026-01-23T16:57:16+00:00
Abstract
Automated fact-checking (AFC) systems are susceptible to adversarial attacks, enabling false claims to evade detection. Existing adversarial frameworks typically rely on injecting noise or altering semantics, yet no existing framework exploits the adversarial potential of persuasion techniques, which are widely used in disinformation campaigns to manipulate audiences. In this paper, we introduce a novel class of persuasive adversarial attacks on AFCs by employing a generative LLM to rephrase claims using persuasion techniques. Considering 15 techniques grouped into 6 categories, we study the effects of persuasion on both claim verification and evidence retrieval using a decoupled evaluation strategy. Experiments on the FEVER and FEVEROUS benchmarks show that persuasion attacks can substantially degrade both verification performance and evidence retrieval. Our analysis identifies persuasion techniques as a potent class of adversarial attacks, highlighting the need for more robust AFC systems.
中文标题/摘要
标题:基于LLM的论辩式对抗攻击对事实核查系统的攻击
自动化事实核查(AFC)系统容易受到对抗攻击的影响,使虚假声明得以逃避检测。现有的对抗框架通常依赖于注入噪声或改变语义,但没有一个框架利用论辩技术的对抗潜力,这些技术在信息操纵活动中广泛用于操控受众。在本文中,我们通过使用生成型LLM来重新表述声明,引入了一类新的论辩式对抗攻击,以对AFC进行攻击。我们研究了15种技术,将其分为6个类别,使用解耦评估策略研究论辩对声明验证和证据检索的影响。在FEVER和FEVEROUS基准上的实验表明,论辩攻击可以显著降低验证性能和证据检索。我们的分析表明,论辩技术是一种强大的对抗攻击类别,突显了需要更 robust 的AFC系统。
Summary / 总结
This paper addresses the vulnerability of automated fact-checking systems to adversarial persuasion attacks. It introduces a novel approach using a generative language model to rephrase claims with persuasion techniques, which are commonly used in disinformation campaigns. Experiments on FEVER and FEVEROUS benchmarks demonstrate that these persuasion attacks significantly reduce the performance of claim verification and evidence retrieval, indicating the need for more robust fact-checking systems.
本文探讨了自动化事实核查系统对劝说式对抗攻击的脆弱性。它提出了一种新的方法,使用生成型语言模型以劝说技巧重新表述声明,这些技巧在信息操纵活动中广泛使用。在FEVER和FEVEROUS基准上的实验表明,这些劝说攻击显著降低了声明验证和证据检索的性能,强调了需要更 robust 的事实核查系统。
Equivariant Flow Matching for Symmetry-Breaking Bifurcation Problems
Authors: Fleur Hendriks, Ondřej Rokoš, Martin Doškář, Marc G. D. Geers, Vlado Menkovski
Venue: NeurIPS 2025
First: 2025-09-03T14:18:05+00:00 · Latest: 2026-01-23T16:51:17+00:00
Comments: 9 pages, 7 figures including appendices. Accepted to Machine Learning and the Physical Sciences Workshop, NeurIPS 2025 (https://ml4physicalsciences.github.io/2025/). Repository with corresponding code: https://github.com/FHendriks11/bifurcationML/. Video explanation: https://www.youtube.com/watch?v=wsL3h17KtjY
Abstract
Bifurcation phenomena in nonlinear dynamical systems often lead to multiple coexisting stable solutions, particularly in the presence of symmetry breaking. Deterministic machine learning models are unable to capture this multiplicity, averaging over solutions and failing to represent lower-symmetry outcomes. In this work, we formalize the use of generative AI, specifically flow matching, as a principled way to model the full probability distribution over bifurcation outcomes. Our approach builds on existing techniques by combining flow matching with equivariant architectures and an optimal-transport-based coupling mechanism. We generalize equivariant flow matching to a symmetric coupling strategy that aligns predicted and target outputs under group actions, allowing accurate learning in equivariant settings. We validate our approach on a range of systems, from simple conceptual systems to physical problems such as buckling beams and the Allen--Cahn equation. The results demonstrate that the approach accurately captures multimodal distributions and symmetry-breaking bifurcations. Moreover, our results demonstrate that flow matching significantly outperforms non-probabilistic and variational methods. This offers a principled and scalable solution for modeling multistability in high-dimensional systems.
中文标题/摘要
标题:对称破缺分岔问题的等变流匹配方法
非线性动力系统中的分岔现象通常会导致多个共存的稳定解,特别是在对称破缺的情况下。确定性的机器学习模型无法捕捉这种多样性,会平均多个解并无法表示低对称性结果。本文中,我们通过将生成式AI,特别是流匹配技术,形式化为一种原理性的方法来建模分岔结果的完整概率分布。我们的方法在现有技术的基础上,结合了等变架构和基于最优传输的耦合机制。我们推广了等变流匹配,使其成为一种对称耦合策略,能够在群作用下对预测输出和目标输出进行对齐,从而在等变设置中实现准确的学习。我们在一系列系统上进行了验证,从简单的概念系统到物理问题,如屈曲梁和Allen--Cahn方程。结果表明,该方法能够准确捕捉多模态分布和对称破缺分岔。此外,我们的结果表明,流匹配方法显著优于非概率性和变分方法。这为高维系统中多稳态的建模提供了一种原理性和可扩展的解决方案。
Summary / 总结
This work addresses the challenge of modeling bifurcation phenomena in nonlinear dynamical systems, where symmetry breaking leads to multiple stable solutions. The authors propose an equivariant flow matching method that uses generative AI to capture the full probability distribution over these solutions. The method combines flow matching with equivariant architectures and an optimal-transport-based coupling mechanism, demonstrating accurate learning and performance in various systems, including physical problems. The results show that this approach outperforms non-probabilistic and variational methods in modeling multistability in high-dimensional systems.
该研究利用生成AI中的流匹配方法来解决非线性动力系统中对称性破坏分岔的建模问题。方法结合了流匹配与对称性架构以及基于最优传输的耦合机制,以准确捕捉分岔结果的完整概率分布。实验表明,该方法能够有效建模多模态分布和对称性破坏分岔,并在高维系统中捕捉多稳态方面优于非概率性和变分方法。
Multi-scale Graph Autoregressive Modeling: Molecular Property Prediction via Next Token Prediction
Authors: Zhuoyang Jiang, Yaosen Min, Peiran Jin, Lei Chen
First: 2026-01-05T20:06:11+00:00 · Latest: 2026-01-23T16:49:47+00:00
Abstract
We present Connection-Aware Motif Sequencing (CamS), a graph-to-sequence representation that enables decoder-only Transformers to learn molecular graphs via standard next-token prediction (NTP). For molecular property prediction, SMILES-based NTP scales well but lacks explicit topology, whereas graph-native masked modeling captures connectivity but risks disrupting the pivotal chemical details (e.g., activity cliffs). CamS bridges this gap by serializing molecular graphs into structure-rich causal sequences. CamS first mines data-driven connection-aware motifs. It then serializes motifs via scaffold-rooted breadth-first search (BFS) to establish a stable core-to-periphery order. Crucially, CamS enables hierarchical modeling by concatenating sequences from fine to coarse motif scales, allowing the model to condition global scaffolds on dense, uncorrupted local structural evidence. We instantiate CamS-LLaMA by pre-training a vanilla LLaMA backbone on CamS sequences. It achieves state-of-the-art performance on MoleculeNet and the activity-cliff benchmark MoleculeACE, outperforming both SMILES-based language models and strong graph baselines. Interpretability analysis confirms that our multi-scale causal serialization effectively drives attention toward cliff-determining differences.
中文标题/摘要
标题:多尺度图自回归建模:通过下一个标记预测分子性质
我们提出了连接感知基序序列化(CamS),这是一种图到序列的表示方法,使仅解码器的Transformer能够通过标准的下一个标记预测(NTP)学习分子图。对于分子性质预测,基于SMILES的NTP扩展良好但缺乏显式的拓扑结构,而基于图的掩码建模可以捕捉连接性但可能会破坏关键的化学细节(例如,活性悬崖)。CamS通过将分子图序列化为结构丰富的因果序列来弥合这一差距。CamS首先挖掘数据驱动的连接感知基序。然后,通过以骨架为中心的广度优先搜索(BFS)序列化基序,以建立稳定的中心到外围顺序。关键的是,CamS通过将从细到粗的基序尺度的序列串联起来,实现了层次建模,使模型能够根据密集且未受污染的局部结构证据来条件化全局骨架。我们通过在CamS序列上预训练一个标准的LLaMA主干来实例化CamS-LLaMA。它在MoleculeNet和活性悬崖基准MoleculeACE上达到了最先进的性能,优于基于SMILES的语言模型和强大的图基线。可解释性分析证实,我们的多尺度因果序列化有效地驱动了对决定悬崖差异的关注。
Summary / 总结
CamS is a graph-to-sequence representation that uses next-token prediction to enable decoder-only Transformers to learn molecular graphs. It addresses the limitations of SMILES-based NTP and graph-native masked modeling by serializing molecular graphs into structure-rich causal sequences. CamS first mines connection-aware motifs and then serializes them using scaffold-rooted BFS to establish a core-to-periphery order. By concatenating sequences from fine to coarse motif scales, CamS allows the model to condition global scaffolds on local structural evidence. Pre-training CamS-LLaMA on CamS sequences, the model achieves state-of-the-art performance on MoleculeNet and MoleculeACE, outperforming both SMILES-based and graph-based models.
CamS 是一种图到序列的表示方法,通过下一个标记预测使解码器仅的变压器能够学习分子图。它通过将分子图序列化为结构丰富的因果序列来解决 SMILES 基础的 NTP 和图原生掩码建模的局限性。CamS 在预训练一个 vanilla LLaMA 主干网络后,在 MoleculeNet 和 MoleculeACE 基准测试中取得了最先进的性能,优于 SMILES 基础和图基线模型。可解释性分析表明,CamS 有效地将注意力集中在决定悬崖差异的因素上。
Gradient-Based Neuroplastic Adaptation for Concurrent Optimization of Neuro-Fuzzy Networks
Authors: John Wesley Hostetter, Min Chi
First: 2025-06-26T21:08:11+00:00 · Latest: 2026-01-23T16:49:40+00:00
Comments: 52 pages
Abstract
Neuro-fuzzy networks (NFNs) are transparent, symbolic, and universal function approximations that perform as well as conventional neural architectures, but their knowledge is expressed as linguistic IF-THEN rules. Despite these advantages, their systematic design process remains a challenge. Existing work will often sequentially build NFNs by inefficiently isolating parametric and structural identification, leading to a premature commitment to brittle and subpar architecture. We propose a novel application-independent approach called gradient-based neuroplastic adaptation for the concurrent optimization of NFNs' parameters and structure. By recognizing that NFNs' parameters and structure should be optimized simultaneously as they are deeply conjoined, settings previously unapproachable for NFNs are now accessible, such as the online reinforcement learning of NFNs for vision-based tasks. The effectiveness of concurrently optimizing NFNs is empirically shown as it is trained by online reinforcement learning to proficiently play challenging scenarios from a vision-based video game called DOOM.
中文标题/摘要
标题:基于梯度的神经可塑性适应用于神经模糊网络的并行优化
神经模糊网络(NFNs)是透明的、符号化的和通用的函数近似器,其性能与传统神经架构相当,但其知识以语言IF-THEN规则的形式表达。尽管具有这些优势,其系统的设计过程仍然是一个挑战。现有工作通常会通过无效地分离参数和结构识别来依次构建NFNs,导致过早地承诺一个脆弱且表现不佳的架构。我们提出了一种名为基于梯度的神经可塑性适应的新型应用无关方法,用于同时优化NFNs的参数和结构。通过认识到NFNs的参数和结构应该在它们紧密相连时同时优化,以前对NFNs不可接近的设置现在变得可行,例如基于视觉任务的NFNs的在线强化学习。并发优化NFNs的有效性通过在线强化学习训练来证明,使其能够熟练地玩来自基于视觉的电子游戏DOOM的具有挑战性的场景。
Summary / 总结
The paper addresses the challenge of systematically designing neuro-fuzzy networks (NFNs) by proposing a gradient-based neuroplastic adaptation method for concurrent optimization of NFNs' parameters and structure. This approach allows for the simultaneous optimization of both parameters and structure, which were previously optimized sequentially, leading to better performance. The effectiveness of this method is demonstrated through training NFNs to play challenging scenarios in a vision-based video game called DOOM using online reinforcement learning.
论文提出了一种基于梯度的神经可塑性适应方法,用于同时优化神经模糊网络(NFNs)的参数和结构,以解决NFNs的系统设计难题。该方法认识到NFNs的参数和结构之间存在紧密联系,从而能够在线强化学习NFNs进行基于视觉的任务。通过训练NFNs在基于视觉的电子游戏DOOM中执行具有挑战性的场景,证明了该方法的有效性,显示出与顺序优化方法相比的更好性能。
GPA-VGGT:Adapting VGGT to Large scale Localization by self-Supervised learning with Geometry and Physics Aware loss
Authors: Yangfan Xu, Lilian Zhang, Xiaofeng He, Pengdong Wu, Wenqi Wu, Jun Mao
First: 2026-01-23T16:46:59+00:00 · Latest: 2026-01-23T16:46:59+00:00
Abstract
Transformer-based general visual geometry frameworks have shown promising performance in camera pose estimation and 3D scene understanding. Recent advancements in Visual Geometry Grounded Transformer (VGGT) models have shown great promise in camera pose estimation and 3D reconstruction. However, these models typically rely on ground truth labels for training, posing challenges when adapting to unlabeled and unseen scenes. In this paper, we propose a self-supervised framework to train VGGT with unlabeled data, thereby enhancing its localization capability in large-scale environments. To achieve this, we extend conventional pair-wise relations to sequence-wise geometric constraints for self-supervised learning. Specifically, in each sequence, we sample multiple source frames and geometrically project them onto different target frames, which improves temporal feature consistency. We formulate physical photometric consistency and geometric constraints as a joint optimization loss to circumvent the requirement for hard labels. By training the model with this proposed method, not only the local and global cross-view attention layers but also the camera and depth heads can effectively capture the underlying multi-view geometry. Experiments demonstrate that the model converges within hundreds of iterations and achieves significant improvements in large-scale localization. Our code will be released at https://github.com/X-yangfan/GPA-VGGT.
中文标题/摘要
标题:GPA-VGGT:通过几何和物理感知损失的自我监督学习将VGGT适应大规模定位
基于变换器的一般视觉几何框架在相机姿态估计和3D场景理解方面表现出有希望的性能。视觉几何地面变换器(VGGT)模型的最新进展在相机姿态估计和3D重建方面显示出巨大的潜力。然而,这些模型通常依赖于训练时的地面真实标签,这在适应未标记和未见过的场景时提出了挑战。在本文中,我们提出了一种自我监督框架,使用未标记数据训练VGGT,从而增强其在大规模环境中的定位能力。为此,我们将传统的成对关系扩展为序列级的几何约束以进行自我监督学习。具体而言,在每个序列中,我们采样多个源帧并几何投影到不同的目标帧,这提高了时间特征的一致性。我们将物理光度一致性与几何约束形式化为联合优化损失,以避免需要硬标签。通过使用此提出的方法训练模型,不仅局部和全局跨视图注意力层,而且相机和深度头可以有效地捕捉多视图几何。实验表明,该模型在数百次迭代内收敛,并在大规模定位方面取得了显著改进。我们的代码将在https://github.com/X-yangfan/GPA-VGGT上发布。
Summary / 总结
This paper proposes GPA-VGGT, a self-supervised framework that adapts VGGT models to large-scale localization without ground truth labels. By extending pair-wise relations to sequence-wise geometric constraints and formulating physical photometric consistency and geometric constraints as a joint optimization loss, the model can effectively capture multi-view geometry. Experiments show that the model converges quickly and significantly improves localization accuracy in large-scale environments.
本文提出了一种自监督框架GPA-VGGT,通过利用序列级几何约束和物理光度一致性及几何约束的联合优化损失,将VGGT适应到大规模定位任务中。该模型无需硬标签即可有效捕捉多视图几何结构,实验结果表明,该模型在大规模定位任务中取得了显著改进,在几百次迭代内即可收敛。
MACTAS: Self-Attention-Based Inter-Agent Communication in Multi-Agent Reinforcement Learning with Action-Value Function Decomposition
Authors: Maciej Wojtala, Bogusz Stefańczyk, Dominik Bogucki, Łukasz Lepak, Jakub Strykowski, Paweł Wawrzyński
Venue: IJCAI 2026
First: 2025-08-19T09:08:48+00:00 · Latest: 2026-01-23T16:26:59+00:00
Comments: Submitted for IJCAI 2026
Abstract
Communication is essential for the collective execution of complex tasks by human agents, motivating interest in communication mechanisms for multi-agent reinforcement learning (MARL). However, existing communication protocols in MARL are often complex and non-differentiable. In this work, we introduce a self-attention-based communication method that exchanges information between the agents in MARL. Our proposed approach is fully differentiable, allowing agents to learn to generate messages in a reward-driven manner. The method can be seamlessly integrated with any action-value function decomposition algorithm and can be viewed as an orthogonal extension of such decompositions. Notably, it includes a fixed number of trainable parameters, independent of the number of agents, which makes it scalable to large systems. Experimental results on the SMACv2 benchmark demonstrate the effectiveness of our approach, which achieves state-of-the-art performance on a number of maps. makes it scalable to large systems. Experimental results on the SMACv2 benchmark demonstrate the effectiveness of our approach, which achieves state-of-the-art performance on a number of maps.
中文标题/摘要
标题:MACTAS:基于自注意力的多智能体强化学习中动作-价值函数分解的代理间通信
通信对于人类代理执行复杂任务至关重要,激发了对多智能体强化学习(MARL)中通信机制的兴趣。然而,现有的MARL通信协议通常复杂且非可微。本文提出了一种基于自注意力的通信方法,可以在MARL中交换信息。我们提出的方法是完全可微的,允许智能体在奖励驱动下学习生成消息。该方法可以无缝集成到任何动作-价值函数分解算法中,并可视为此类分解的正交扩展。值得注意的是,它包含固定数量的可训练参数,与智能体数量无关,使其适用于大型系统。在SMACv2基准上的实验结果表明了我们方法的有效性,该方法在多个地图上达到了最先进的性能。
Summary / 总结
The research aims to improve communication among agents in multi-agent reinforcement learning (MARL) to better execute complex tasks. The method uses self-attention to enable agents to exchange information in a fully differentiable manner, allowing them to learn to generate messages for better performance. Experiments on the SMACv2 benchmark show that this approach achieves state-of-the-art performance on several maps, demonstrating its effectiveness and scalability to large systems.
本文提出了MACTAS,一种基于自注意力的多智能体强化学习通信方法,使智能体能够以奖励驱动的方式学习生成消息。该方法完全可微,并且可以与任何动作-价值函数分解算法无缝集成,使其能够扩展到大型系统。实验结果表明,MACTAS在SMACv2基准测试中的多个地图上达到了最先进的性能。
Boosting Deep Reinforcement Learning with Semantic Knowledge for Robotic Manipulators
Authors: Lucía Güitta-López, Vincenzo Suriani, Jaime Boal, Álvaro J. López-López, Daniele Nardi
Venue: Robotics, published 24 June 2025
First: 2026-01-23T16:14:28+00:00 · Latest: 2026-01-23T16:14:28+00:00
Abstract
Deep Reinforcement Learning (DRL) is a powerful framework for solving complex sequential decision-making problems, particularly in robotic control. However, its practical deployment is often hindered by the substantial amount of experience required for learning, which results in high computational and time costs. In this work, we propose a novel integration of DRL with semantic knowledge in the form of Knowledge Graph Embeddings (KGEs), aiming to enhance learning efficiency by providing contextual information to the agent. Our architecture combines KGEs with visual observations, enabling the agent to exploit environmental knowledge during training. Experimental validation with robotic manipulators in environments featuring both fixed and randomized target attributes demonstrates that our method achieves up to {60}{\%} reduction in learning time and improves task accuracy by approximately 15 percentage points, without increasing training time or computational complexity. These results highlight the potential of semantic knowledge to reduce sample complexity and improve the effectiveness of DRL in robotic applications.
中文标题/摘要
标题:利用语义知识提升机器人操作器的深度强化学习
深度强化学习(DRL)是一种强大的框架,用于解决复杂的序列决策问题,特别是在机器人控制方面。然而,其实际部署往往受到学习所需大量经验的阻碍,导致高计算和时间成本。在本文中,我们提出了一种将DRL与知识图嵌入(KGEs)形式的语义知识进行新颖整合的方法,旨在通过向代理提供上下文信息来提高学习效率。我们的架构将KGEs与视觉观察相结合,使代理在训练期间能够利用环境知识。在包含固定和随机目标属性的机器人操作器环境中进行的实验验证表明,我们的方法在学习时间上最多可减少60%,任务准确性提高约15个百分点,而无需增加训练时间和计算复杂度。这些结果突显了语义知识在减少样本复杂性和提高DRL在机器人应用中的有效性方面的潜力。
Summary / 总结
This paper addresses the challenge of high computational and time costs in Deep Reinforcement Learning (DRL) by integrating semantic knowledge in the form of Knowledge Graph Embeddings (KGEs) to enhance learning efficiency. The proposed method combines KGEs with visual observations to provide contextual information to the agent, which is particularly useful for robotic manipulators. Experiments show a 60% reduction in learning time and a 15 percentage point improvement in task accuracy, without increasing training time or computational complexity.
本文通过将知识图嵌入(KGEs)与深度强化学习(DRL)相结合,解决机器人控制中高计算和时间成本的问题。该方法将KGEs与视觉观察结合,为代理提供上下文信息,从而提高学习效率。实验结果表明,该方法在不增加训练时间和计算复杂度的情况下,实现了60%的学习时间减少和约15个百分点的任务准确率提升。
Mixture-of-Models: Unifying Heterogeneous Agents via N-Way Self-Evaluating Deliberation
Authors: Tims Pecerskis, Aivars Smirnovs
First: 2026-01-23T16:11:54+00:00 · Latest: 2026-01-23T16:11:54+00:00
Abstract
This paper introduces the N-Way Self-Evaluating Deliberation (NSED) protocol, a Runtime Mixture-of-Models (MoM) architecture that constructs emergent composite models from a plurality of distinct expert agents. Unlike traditional Mixture-of-Experts (MoE) which rely on static gating networks, NSED employs a Dynamic Expertise Broker - a runtime optimization engine that treats model selection as a variation of the Knapsack Problem, binding heterogeneous checkpoints to functional roles based on live telemetry and cost constraints. At the execution layer, we formalize deliberation as a Macro-Scale Recurrent Neural Network (RNN), where the consensus state loops back through a semantic forget gate to enable iterative refinement without proportional VRAM scaling. Key components include an orchestration fabric for trustless N-to-N peer review, a Quadratic Voting activation function for non-linear consensus, and a feedback-driven state update. Empirical validation on challenging benchmarks (AIME 2025, LiveCodeBench) demonstrates that this topology allows ensembles of small (less than 20B) consumer-grade models to match or exceed the performance of state-of-the-art 100B+ parameter models, establishing a new hardware arbitrage efficiency frontier. Furthermore, testing on the DarkBench safety suite reveals intrinsic alignment properties, with peer-mediated correction reducing sycophancy scores below that of any individual agent.
中文标题/摘要
标题:模型混合体:通过N维自我评估协商统一异质代理
本文介绍了N维自我评估协商(NSED)协议,这是一种运行时模型混合体(MoM)架构,能够从多种不同的专家代理中构建出新兴的复合模型。不同于依赖静态门控网络的传统专家混合体(MoE),NSED采用了一个动态专家经纪人——一个运行时优化引擎,将模型选择视为背包问题的一种变体,根据实时遥测数据和成本约束将异构检查点绑定到功能角色。在执行层面上,我们将协商形式化为宏尺度循环神经网络(RNN),其中共识状态通过语义遗忘门回环,以实现迭代细化而不需按比例增加VRAM。关键组件包括一个用于无信任N对N同伴评审的编排织物、一个二次投票激活函数以实现非线性共识,以及一个基于反馈的状态更新。在具有挑战性的基准测试(AIME 2025,LiveCodeBench)上的实证验证表明,这种架构使得小规模(少于200亿参数)的消费级模型能够匹配甚至超越当前最先进的100亿参数以上模型的性能,确立了新的硬件效率前沿。此外,在暗箱安全套件上的测试揭示了固有的对齐特性,同伴介导的纠正降低了奉承评分,低于任何单一代理的评分。
Summary / 总结
The paper introduces N-Way Self-Evaluating Deliberation (NSED), a runtime Mixture-of-Models architecture that combines multiple expert agents to form emergent composite models. Unlike traditional Mixture-of-Experts, NSED uses a Dynamic Expertise Broker to dynamically select models based on live telemetry and cost constraints, treating model selection as a Knapsack Problem. Key findings show that ensembles of small consumer-grade models can match or exceed the performance of large state-of-the-art models, demonstrating a new hardware efficiency frontier. Additionally, the system reduces sycophancy scores, indicating improved intrinsic alignment through peer-mediated correction.
论文介绍了N-Way Self-Evaluating Deliberation (NSED)协议,这是一种运行时Mixture-of-Models架构,能够结合多个专家代理形成新兴的复合模型。不同于传统的Mixture-of-Experts,NSED使用动态专家经纪人基于实时遥测和成本约束动态选择模型,将模型选择视为背包问题。关键发现表明,小型消费级模型的集合可以匹配或超越大型最先进的模型的性能,展示了新的硬件效率前沿。此外,该系统通过同伴调解纠正降低了顺从性评分,表明了更好的内在对齐性。
Orbitopal Fixing in SAT
Authors: Markus Anders, Cayden Codel, Marijn J. H. Heule
First: 2026-01-23T16:01:48+00:00 · Latest: 2026-01-23T16:01:48+00:00
Comments: to appear at TACAS 2026
Abstract
Despite their sophisticated heuristics, boolean satisfiability (SAT) solvers are still vulnerable to symmetry, causing them to visit search regions that are symmetric to ones already explored. While symmetry handling is routine in other solving paradigms, integrating it into state-of-the-art proof-producing SAT solvers is difficult: added reasoning must be fast, non-interfering with solver heuristics, and compatible with formal proof logging. To address these issues, we present a practical static symmetry breaking approach based on orbitopal fixing, a technique adapted from mixed-integer programming. Our approach adds only unit clauses, which minimizes downstream slowdowns, and it emits succinct proof certificates in the substitution redundancy proof system. Implemented in the satsuma tool, our methods deliver consistent speedups on symmetry-rich benchmarks with negligible regressions elsewhere.
中文标题/摘要
标题:轨道固定在SAT中的应用
尽管布尔可满足性(SAT)求解器具有复杂的启发式算法,但它们仍然容易受到对称性的影响,导致它们访问已经探索过的搜索区域的对称区域。虽然在其他求解范式中处理对称性是常规操作,但在将对称性处理集成到最先进的证明生成SAT求解器中仍然是困难的:增加的推理必须快速,不干扰求解器启发式算法,并且与形式证明日志兼容。为了解决这些问题,我们提出了一种基于轨道固定的技术的实用静态对称性破坏方法,轨道固定是从混合整数规划中借鉴的技术。我们的方法仅添加单位子句,这最大限度地减少了下游的性能下降,并且它在替换冗余证明系统中生成简洁的证明证书。在satsuma工具中实现,我们的方法在对称性丰富的基准测试中提供了持续的加速,而在其他地方几乎没有退步。
Summary / 总结
The research aims to address the vulnerability of SAT solvers to symmetry by integrating static symmetry breaking techniques. The method uses orbitopal fixing, which adds only unit clauses to minimize slowdowns and emits succinct proof certificates. Experiments show consistent speedups on symmetry-rich benchmarks with minimal impact on other benchmarks.
研究旨在通过引入静态对称性打破技术来缓解SAT求解器对称性的脆弱性。方法使用了轨道固定技术,仅添加单位子句以最小化性能下降,并生成简洁的证明证书。该方法在satsuma工具中实现,能够在对称性丰富的基准测试中提供一致的性能提升,且在其他地方没有显著的性能退化。
Reasoning Promotes Robustness in Theory of Mind Tasks
Authors: Ian B. de Haan, Peter van der Putten, Max van Duijn
First: 2026-01-23T16:01:24+00:00 · Latest: 2026-01-23T16:01:24+00:00
Comments: 14 pages, 2 figures
Abstract
Large language models (LLMs) have recently shown strong performance on Theory of Mind (ToM) tests, prompting debate about the nature and true performance of the underlying capabilities. At the same time, reasoning-oriented LLMs trained via reinforcement learning with verifiable rewards (RLVR) have achieved notable improvements across a range of benchmarks. This paper examines the behavior of such reasoning models in ToM tasks, using novel adaptations of machine psychological experiments and results from established benchmarks. We observe that reasoning models consistently exhibit increased robustness to prompt variations and task perturbations. Our analysis indicates that the observed gains are more plausibly attributed to increased robustness in finding the correct solution, rather than to fundamentally new forms of ToM reasoning. We discuss the implications of this interpretation for evaluating social-cognitive behavior in LLMs.
中文标题/摘要
标题:推理促进理论心智任务的稳健性
大型语言模型(LLMs)最近在理论心智(ToM)测试中表现出强大的性能,引发了关于其潜在能力和真正性能的辩论。同时,通过可验证奖励强化学习(RLVR)训练的以推理为导向的LLMs在一系列基准测试中取得了显著进步。本文研究了此类推理模型在ToM任务中的行为,使用了机器心理实验的新颖改编和现有基准的结果。我们观察到,推理模型在提示变化和任务扰动方面表现出一致的稳健性增加。我们的分析表明,观察到的收益更有可能归因于找到正确解的稳健性增加,而不是新的形式的ToM推理。我们讨论了这一解释对评估LLMs的社会认知行为的影响。
Summary / 总结
This paper investigates the performance of reasoning-oriented large language models (LLMs) on Theory of Mind (ToM) tasks, using novel adaptations of machine psychological experiments and established benchmarks. The study finds that reasoning models show increased robustness to prompt variations and task perturbations, suggesting that the gains are due to enhanced robustness in finding the correct solution rather than new forms of ToM reasoning. The research highlights the importance of robustness in evaluating social-cognitive behavior in LLMs.
该论文研究了推理导向的大语言模型(LLM)在理论思维(ToM)任务中的表现。通过使用机器心理实验的新适应方法和已建立的基准测试,研究发现推理模型在提示变化和任务扰动方面表现出更强的稳健性。研究结果表明,这些改进主要是由于在找到正确答案方面增强了稳健性,而不是新的ToM推理形式。论文讨论了这些发现对评估LLM的社会认知行为的影响。
Theoretical Foundations of Scaling Law in Familial Models
Authors: Huan Song, Qingfei Zhao, Ting Long, Shuyu Tian, Hongjun An, Jiawei Shao, Xuelong Li
First: 2025-12-29T12:01:58+00:00 · Latest: 2026-01-23T15:36:25+00:00
Abstract
Neural scaling laws have become foundational for optimizing large language model (LLM) training, yet they typically assume a single dense model output. This limitation effectively overlooks "Familial models, a transformative paradigm essential for realizing ubiquitous intelligence across heterogeneous device-edge-cloud hierarchies. Transcending static architectures, familial models integrate early exits with relay-style inference to spawn G deployable sub-models from a single shared backbone. In this work, we theoretically and empirically extend the scaling law to capture this "one-run, many-models" paradigm by introducing Granularity (G) as a fundamental scaling variable alongside model size (N) and training tokens (D). To rigorously quantify this relationship, we propose a unified functional form L(N, D, G) and parameterize it using large-scale empirical runs. Specifically, we employ a rigorous IsoFLOP experimental design to strictly isolate architectural impact from computational scale. Across fixed budgets, we systematically sweep model sizes (N) and granularities (G) while dynamically adjusting tokens (D). This approach effectively decouples the marginal cost of granularity from the benefits of scale, ensuring high-fidelity parameterization of our unified scaling law. Our results reveal that the granularity penalty follows a multiplicative power law with an extremely small exponent. Theoretically, this bridges fixed-compute training with dynamic architectures. Practically, it validates the "train once, deploy many" paradigm, demonstrating that deployment flexibility is achievable without compromising the compute-optimality of dense baselines.
中文标题/摘要
标题:家族模型的标度定律理论基础
神经网络的标度定律已成为优化大型语言模型(LLM)训练的基础,但它们通常假设单一密集模型输出。这一限制实际上忽视了“家族模型”,这是一种变革性的范式,对于实现跨异构设备-边缘-云层次结构的通用智能至关重要。超越静态架构,家族模型将早期退出与接力式推理集成,从单一共享主干生成G个可部署子模型。在本文中,我们通过引入粒度(G)作为基本标度变量,与模型大小(N)和训练令牌(D)一起,理论和实证上扩展了标度定律,以捕捉这种“一次运行,多个模型”范式。为了严格量化这种关系,我们提出了一种统一的函数形式L(N, D, G),并使用大规模实证运行对其进行参数化。具体而言,我们采用严格的IsoFLOP实验设计,严格隔离架构影响与计算规模。在固定预算下,我们系统地扫掠模型大小(N)和粒度(G),同时动态调整令牌(D)。这种方法有效地将粒度的边际成本与规模的好处解耦,确保我们统一的标度定律的高保真参数化。我们的结果表明,粒度惩罚遵循一个具有极小指数的乘法幂律。理论上,这将固定计算量训练与动态架构联系起来。实际上,它验证了“一次训练,多次部署”的范式,证明了在不牺牲密集基线的计算优化性的情况下,部署灵活性是可行的。
Summary / 总结
This work extends the neural scaling laws to familial models, which integrate early exits and relay-style inference. The authors introduce granularity (G) as a new scaling variable alongside model size (N) and training tokens (D). Using an IsoFLOP experimental design, they systematically vary model sizes and granularities while adjusting tokens to decouple the marginal cost of granularity from the benefits of scale. Key findings include a multiplicative power law for the granularity penalty with a very small exponent, validating the 'train once, deploy many' paradigm without sacrificing compute-optimality.
该研究将神经网络的扩展定律应用到家庭模型中,这些模型结合了早期退出和接力式推理。作者引入了粒度(G)作为新的扩展变量,与模型大小(N)和训练令牌(D)一起。通过使用IsoFLOP实验设计,他们系统地改变模型大小和粒度,同时调整令牌,以解耦粒度的边际成本与规模的好处。关键发现包括粒度惩罚遵循一个具有非常小指数的乘法幂律,验证了“一次训练,多次部署”的范式,同时不牺牲密集基线的计算优化性。
ColorConceptBench: A Benchmark for Probabilistic Color-Concept Understanding in Text-to-Image Models
Authors: Chenxi Ruan, Yu Xiao, Yihan Hou, Guosheng Hu, Wei Zeng
First: 2026-01-23T15:36:02+00:00 · Latest: 2026-01-23T15:36:02+00:00
Abstract
While text-to-image (T2I) models have advanced considerably, their capability to associate colors with implicit concepts remains underexplored. To address the gap, we introduce ColorConceptBench, a new human-annotated benchmark to systematically evaluate color-concept associations through the lens of probabilistic color distributions. ColorConceptBench moves beyond explicit color names or codes by probing how models translate 1,281 implicit color concepts using a foundation of 6,369 human annotations. Our evaluation of seven leading T2I models reveals that current models lack sensitivity to abstract semantics, and crucially, this limitation appears resistant to standard interventions (e.g., scaling and guidance). This demonstrates that achieving human-like color semantics requires more than larger models, but demands a fundamental shift in how models learn and represent implicit meaning.
中文标题/摘要
标题:ColorConceptBench:一种用于文本到图像模型中概率色彩概念理解的基准
尽管文本到图像(T2I)模型取得了显著进步,但它们将颜色与隐含概念关联的能力仍被忽视。为解决这一问题,我们引入了ColorConceptBench,这是一种新的基于人类注释的基准,通过概率色彩分布的视角系统评估颜色概念关联。ColorConceptBench 超越了显式色彩名称或代码,通过6,369个人类注解,探索模型如何将1,281个隐含色彩概念进行转换。我们对七种领先T2I模型的评估表明,当前模型在抽象语义方面缺乏敏感性,而且这种局限性对标准干预(例如缩放和指导)具有抵抗力。这表明,实现人类级别的色彩语义不仅需要更大的模型,还需要模型在学习和表示隐含意义方面发生根本性的转变。
Summary / 总结
The research aims to evaluate text-to-image models' ability to associate colors with implicit concepts, which is underexplored. To achieve this, the study introduces ColorConceptBench, a new benchmark based on 6,369 human annotations of 1,281 implicit color concepts. Evaluating seven leading T2I models, the findings show that these models struggle with abstract semantics and are not significantly improved by standard interventions, indicating a need for a fundamental shift in how models learn and represent implicit meaning.
研究旨在评估文本到图像模型理解并关联颜色与隐含概念的能力,这一直被忽视。为此,研究引入了基于6,369个人类对1,281个隐含颜色概念的标注的ColorConceptBench基准。评估七个领先的T2I模型后,研究发现这些模型在抽象语义方面存在困难,并且标准干预措施并未显著改善,表明需要在模型如何学习和表示隐含意义方面进行根本性的转变。
Calibrated Probabilistic Interpolation for GEDI Biomass
Authors: Robin Young, Srinivasan Keshav
First: 2026-01-23T15:35:33+00:00 · Latest: 2026-01-23T15:35:33+00:00
Abstract
Reliable wall-to-wall biomass mapping from NASA's GEDI mission requires interpolating sparse LiDAR observations across heterogeneous landscapes. While machine learning approaches like Random Forest and XGBoost are standard for this task, they treat spatial predictions of GEDI observations from multispectral or SAR remote sensing data as independent without adapting to the varying difficulty of heterogeneous landscapes. We demonstrate these approaches generally fail to produce calibrated prediction intervals. We identify that this stems from conflating ensemble variance with aleatoric uncertainty and ignoring local spatial context.
To resolve this, we introduce Attentive Neural Processes (ANPs), a probabilistic meta-learning framework that explicitly conditions predictions on local observation sets and geospatial foundation model embeddings. Unlike static ensembles, ANPs learn a flexible spatial covariance function, allowing uncertainty estimates to expand in complex landscapes and contract in homogeneous areas. We validate this approach across five distinct biomes ranging from Tropical Amazonian forests to Boreal and Alpine ecosystems, demonstrating that ANPs achieve competitive accuracy while maintaining near-ideal uncertainty calibration. We demonstrate the operational utility of the method through few-shot adaptation, where the model recovers most of the performance gap in cross-region transfer using minimal local data. This work provides a scalable, theoretically rigorous alternative to ensemble variance for continental scale earth observation.
中文标题/摘要
标题:GEDI生物量校准概率插值
从NASA的GEDI任务可靠地进行全区域生物量制图需要在异质景观上插补稀疏的LiDAR观测数据。尽管随机森林和XGBoost等机器学习方法常用于此任务,但它们将来自多光谱或SAR遥感数据的GEDI观测空间预测视为独立的,而不适应异质景观的差异性。我们证明这些方法通常无法生成校准的预测区间。我们发现这源于将集合方差与 aleatoric 不确定性混淆,并忽略了局部空间上下文。
为了解决这个问题,我们引入了注意神经过程(ANPs),这是一种概率元学习框架,明确地根据局部观测集和地理空间基础模型嵌入来条件化预测。与静态集合不同,ANPs 学习一个灵活的空间协方差函数,使不确定性估计在复杂景观中扩展,在同质区域中收缩。我们通过五个不同的生物群落进行验证,从热带亚马逊森林到 boreal 和高山生态系统,证明ANPs 在保持接近理想的不确定性校准的同时实现了竞争力的准确性。我们通过少样本适应展示了该方法的操作实用性,其中模型在使用少量本地数据的情况下恢复了大部分跨区域转移的性能差距。这项工作为大陆规模地球观测提供了一种可扩展且理论严谨的替代集合方差方法。
Summary / 总结
This study addresses the challenge of reliably mapping biomass across diverse landscapes using NASA's GEDI mission data. It highlights the limitations of traditional machine learning methods like Random Forest and XGBoost, which fail to produce calibrated prediction intervals due to their inability to adapt to varying landscape complexities. To overcome this, the authors introduce Attentive Neural Processes (ANPs), which explicitly condition predictions on local observation sets and geospatial context, allowing for more accurate and calibrated uncertainty estimates. Experiments across five biomes show that ANPs achieve competitive accuracy while maintaining ideal uncertainty calibration, and demonstrate operational utility through few-shot adaptation, suggesting their potential for cross-region application with minimal local data.
研究旨在通过解决传统机器学习方法如随机森林和XGBoost无法考虑不同景观复杂性的问题,提高NASA GEDI任务的生物量地图绘制可靠性。研究引入了注意力神经过程(ANPs),这是一种概率元学习框架,能够根据局部观测集和地理空间上下文明确调整预测。ANPs通过提供在复杂景观中尤为准确的不确定性估计,超越了传统方法,并在不同生物群落中表现出色,即使使用少量局部数据也能快速适应。
Privacy in Human-AI Romantic Relationships: Concerns, Boundaries, and Agency
Authors: Rongjun Ma, Shijing He, Jose Luis Martin-Navarro, Xiao Zhan, Jose Such
First: 2026-01-23T15:23:37+00:00 · Latest: 2026-01-23T15:23:37+00:00
Comments: Accepted at CHI 2026
Abstract
An increasing number of LLM-based applications are being developed to facilitate romantic relationships with AI partners, yet the safety and privacy risks in these partnerships remain largely underexplored. In this work, we investigate privacy in human-AI romantic relationships through an interview study (N=17), examining participants' experiences and privacy perceptions across stages of exploration, intimacy, and dissolution, alongside platforms they used. We found that these relationships took varied forms, from one-to-one to one-to-many, and were shaped by multiple actors, including creators, platforms, and moderators. AI partners were perceived as having agency, actively negotiating privacy boundaries with participants and sometimes encouraging disclosure of personal details. As intimacy deepened, these boundaries became more permeable, though some participants voiced concerns such as conversation exposure and sought to preserve anonymity. Overall, platform affordances and diverse romantic dynamics expand the privacy landscape, underscoring the need to rethink how privacy is constructed in human-AI intimacy.
中文标题/摘要
标题:人类与AI浪漫关系中的隐私:关切、边界与自主
越来越多基于LLM的应用程序被开发出来以促进与AI伴侣的浪漫关系,然而这些伙伴关系中的安全和隐私风险仍然很大程度上未被探索。在本研究中,我们通过访谈研究(N=17)探讨人类与AI浪漫关系中的隐私问题,考察参与者在探索、亲密和解体阶段的经历和隐私感知,以及他们使用的平台。我们发现这些关系形式多样,从一对一到一对多,并受到创作者、平台和管理员等多方的影响。AI伴侣被认为具有自主性,与参与者积极协商隐私边界,有时还会鼓励披露个人信息。随着亲密关系加深,这些边界变得更为脆弱,尽管一些参与者表达了诸如对话暴露等方面的担忧,并寻求保持匿名。总体而言,平台功能和多样的浪漫动态扩展了隐私的范围,强调了重新思考人类与AI亲密关系中隐私构建的必要性。
VMMU: A Vietnamese Multitask Multimodal Understanding and Reasoning Benchmark
Authors: Vy Tuong Dang, An Vo, Emilio Villa-Cueva, Quang Tau, Duc Dm, Thamar Solorio, Daeyoung Kim
First: 2025-08-19T09:31:18+00:00 · Latest: 2026-01-23T15:16:58+00:00
Abstract
We introduce VMMU, a Vietnamese Multitask Multimodal Understanding and Reasoning Benchmark designed to evaluate how vision-language models (VLMs) interpret and reason over visual and textual information beyond English. VMMU consists of 2.5k multimodal questions across 7 tasks, covering a diverse range of problem contexts, including STEM problem solving, data interpretation, rule-governed visual reasoning, and abstract visual reasoning. All questions require genuine multimodal integration, rather than reliance on text-only cues or OCR-based shortcuts. We evaluate a diverse set of state-of-the-art proprietary and open-source VLMs on VMMU. Despite strong Vietnamese OCR performance, proprietary models achieve only 66% mean accuracy. Further analysis shows that the primary source of failure is not OCR, but instead multimodal grounding and reasoning over text and visual evidence. Code and data are available at https://vmmu-bench.github.io/
中文标题/摘要
标题:VMMU:越南多任务多模态理解与推理基准
我们介绍了VMMU,一个越南多任务多模态理解与推理基准,旨在评估视觉语言模型(VLMs)如何超越英语对视觉和文本信息进行解释和推理。VMMU 包含7个任务中的2500个多模态问题,涵盖了从STEM问题解决到数据解释、规则指导的视觉推理和抽象视觉推理等多种问题情境。所有问题都需要真正的多模态整合,而不是依赖于仅基于文本的线索或OCR捷径。我们对VMMU上的一系列最先进的专有和开源VLMs进行了评估。尽管越南OCR表现出色,但专有模型的平均准确率仅为66%。进一步的分析表明,失败的主要原因是多模态定位和文本与视觉证据的推理,而不是OCR。代码和数据可在https://vmmu-bench.github.io/ 获取
Summary / 总结
VMMU is a Vietnamese benchmark for evaluating vision-language models in multitask multimodal understanding and reasoning, focusing on tasks beyond English. It includes 2,500 multimodal questions across seven diverse tasks, requiring genuine multimodal integration. Despite strong OCR performance, proprietary models achieve only 66% mean accuracy, indicating challenges in multimodal grounding and reasoning. The benchmark aims to push the boundaries of VLMs in handling Vietnamese multimodal data.
VMMU 是一个越南语多任务多模态理解与推理基准,旨在评估视觉语言模型在处理非英语任务时的能力。它包含2500个多模态问题,覆盖七个不同的任务,需要真正的多模态整合。尽管OCR表现良好,但专用模型的平均准确率仅为66%,表明在多模态定位和推理方面存在挑战。该基准旨在推动视觉语言模型在处理越南语多模态数据方面的进步。
Data Matters Most: Auditing Social Bias in Contrastive Vision Language Models
Authors: Zahraa Al Sahili, Ioannis Patras, Matthew Purver
First: 2025-01-22T21:08:30+00:00 · Latest: 2026-01-23T15:12:35+00:00
Comments: Published at TMLR; updated version
Abstract
Vision-language models (VLMs) deliver strong zero-shot recognition but frequently inherit social biases from their training data. We systematically disentangle three design factors -- model size, training-data scale, and training-data source -- by comparing CLIP and OpenCLIP, two models that share an identical contrastive objective yet differ in encoder width and in the image-text corpora on which they are pre-trained (400M proprietary pairs vs. 400M/2B LAION). Across balanced face-analysis benchmarks, enlarging the encoder reduces gender skew in CLIP but amplifies both gender and racial skew in OpenCLIP; increasing the LAION corpus from 400M to 2B further increases OpenCLIP bias. At matched model and data budgets, substituting proprietary data with LAION improves gender fairness while increasing racial skew, underscoring data source as the primary driver of bias patterns. We also evaluate three post-hoc, test-time debiasing strategies -- Bias Prompts, Prompt Array, and SANER. Debiasing reduces but does not eliminate harm, and its effectiveness is source- and size-dependent: Bias Prompts most effectively reduce gender skew in CLIP at smaller model sizes, whereas Prompt Array and SANER more reliably reduce racial skew in OpenCLIP; scaling LAION reconfigures which method is most fair. Taken together, these findings challenge the assumption that bigger models or datasets are automatically fairer and foreground training data source as the key determinant of both bias and mitigation efficacy. We release code and evaluation scripts to enable transparent, reproducible auditing of future VLMs.
中文标题/摘要
标题:数据最重要:审计对比视觉语言模型中的社会偏见
视觉-语言模型(VLMs)在零样本识别方面表现出色,但经常从训练数据中继承社会偏见。我们系统地拆分了三个设计因素——模型大小、训练数据规模和训练数据来源,通过比较CLIP和OpenCLIP两种模型,这两种模型具有相同的对比目标,但编码器宽度不同,且预训练的数据集不同(4亿个专有图像-文本对 vs. 4亿/20亿个LAION)。在平衡的人脸分析基准测试中,增大编码器减少了CLIP中的性别偏见,但在OpenCLIP中放大了性别和种族偏见;将LAION数据集从4亿增加到20亿进一步增加了OpenCLIP的偏见。在匹配的模型和数据预算下,用LAION数据替换专有数据提高了性别公平性,但增加了种族偏见,突显了数据来源是偏见模式的主要驱动因素。我们还评估了三种事后测试时去偏策略——偏见提示、提示阵列和SANER。去偏减少了但并未消除伤害,其有效性取决于数据来源和模型规模:偏见提示在较小的模型规模下最有效地减少了CLIP中的性别偏见,而提示阵列和SANER更可靠地减少了OpenCLIP中的种族偏见;扩大LAION重新配置了哪种方法最公平。这些发现共同挑战了更大的模型或数据集自动更公平的假设,并将训练数据来源置于偏见和缓解效果的关键决定因素之上。我们发布了代码和评估脚本,以实现未来VLMs的透明和可重复审计。
Summary / 总结
This study investigates the impact of model size, training data scale, and source on social bias in vision-language models (VLMs) by comparing CLIP and OpenCLIP. The research finds that increasing the model size reduces gender bias in CLIP but amplifies both gender and racial bias in OpenCLIP. Expanding the training data to 2B from 400M further increases bias in OpenCLIP. Substituting proprietary data with LAION improves gender fairness but increases racial bias. Post-hoc debiasing strategies reduce bias but are source- and size-dependent, highlighting the importance of training data source in determining bias patterns and mitigation efficacy.
研究探讨了模型大小、训练数据规模和数据来源对视觉语言模型社会偏见的影响。通过比较CLIP和OpenCLIP,这两个模型具有相同的对比目标但编码器大小和训练数据不同,研究发现增加编码器大小可以减少CLIP中的性别偏见,但在OpenCLIP中则会放大偏见。将训练数据从400M扩展到2B会进一步增加OpenCLIP的偏见。用LAION替换专有数据可以改善性别公平性但增加种族偏见。后处理的去偏方法可以减少偏见但依赖于数据来源和模型大小,突显了训练数据来源在决定偏见模式和缓解效果中的重要性。
Incorporating Eye-Tracking Signals Into Multimodal Deep Visual Models For Predicting User Aesthetic Experience In Residential Interiors
Authors: Chen-Ying Chien, Po-Chih Kuo
First: 2026-01-23T15:02:44+00:00 · Latest: 2026-01-23T15:02:44+00:00
Abstract
Understanding how people perceive and evaluate interior spaces is essential for designing environments that promote well-being. However, predicting aesthetic experiences remains difficult due to the subjective nature of perception and the complexity of visual responses. This study introduces a dual-branch CNN-LSTM framework that fuses visual features with eye-tracking signals to predict aesthetic evaluations of residential interiors. We collected a dataset of 224 interior design videos paired with synchronized gaze data from 28 participants who rated 15 aesthetic dimensions. The proposed model attains 72.2% accuracy on objective dimensions (e.g., light) and 66.8% on subjective dimensions (e.g., relaxation), outperforming state-of-the-art video baselines and showing clear gains on subjective evaluation tasks. Notably, models trained with eye-tracking retain comparable performance when deployed with visual input alone. Ablation experiments further reveal that pupil responses contribute most to objective assessments, while the combination of gaze and visual cues enhances subjective evaluations. These findings highlight the value of incorporating eye-tracking as privileged information during training, enabling more practical tools for aesthetic assessment in interior design.
中文标题/摘要
标题:将眼动信号融入多模态深度视觉模型以预测住宅室内设计的用户审美体验
理解人们如何感知和评估室内空间对于设计促进福祉的环境至关重要。然而,由于感知的主观性和视觉反应的复杂性,预测审美体验仍然具有挑战性。本研究引入了一种结合视觉特征和眼动信号的双分支CNN-LSTM框架,以预测住宅室内设计的审美评价。我们收集了224个室内设计视频数据集,并与28名参与者同步的眼动数据配对,这些参与者对15个审美维度进行了评分。所提出的模型在客观维度(如光线)上达到了72.2%的准确率,在主观维度(如放松)上达到了66.8%的准确率,优于最先进的视频基线,并在主观评估任务上显示出明显的改进。值得注意的是,使用眼动信号训练的模型在仅使用视觉输入时保留了相当的性能。进一步的消融实验表明,瞳孔反应对客观评估贡献最大,而眼动和视觉线索的结合增强了主观评估。这些发现突显了在训练过程中将眼动信号作为特权信息的重要性,从而为室内设计提供更实用的审美评估工具。
Summary / 总结
This study aims to predict users' aesthetic experiences in residential interiors by integrating eye-tracking signals into a multimodal deep visual model. The researchers developed a CNN-LSTM framework that combines visual features with eye-tracking data, using a dataset of 224 interior design videos and 28 participants' ratings on 15 aesthetic dimensions. The model achieved 72.2% accuracy on objective dimensions and 66.8% on subjective dimensions, outperforming existing video-based methods. Ablation experiments showed that pupil responses are crucial for objective assessments, while gaze and visual cues together improve subjective evaluations, emphasizing the importance of eye-tracking in aesthetic assessment tools for interior design.
本研究旨在通过将眼动追踪信号集成到多模态深度视觉模型中,预测用户在住宅内饰中的审美体验。开发了一种双分支CNN-LSTM框架,将视觉特征与眼动追踪数据融合。该模型在客观维度上达到了72.2%的准确率,在主观维度上达到了66.8%的准确率,优于现有视频基线。消融实验表明,瞳孔反应对于客观评估至关重要,而眼动和视觉线索的结合则提高了主观评估的效果。这项工作强调了在训练中纳入眼动追踪数据的重要性,从而为更准确的室内设计审美评估提供了可能。