arXiv 论文速递

2026-02-18 03:58
Snapshot: 20260218_0358
EditCtrl: Disentangled Local and Global Control for Real-Time Generative Video Editing
Authors: Yehonathan Litman, Shikun Liu, Dario Seyb, Nicholas Milef, Yang Zhou, Carl Marshall, Shubham Tulsiani, Caleb Leak
First: 2026-02-16T18:59:58+00:00 · Latest: 2026-02-16T18:59:58+00:00
Comments: Project page: https://yehonathanlitman.github.io/edit_ctrl
Abstract
High-fidelity generative video editing has seen significant quality improvements by leveraging pre-trained video foundation models. However, their computational cost is a major bottleneck, as they are often designed to inefficiently process the full video context regardless of the inpainting mask's size, even for sparse, localized edits. In this paper, we introduce EditCtrl, an efficient video inpainting control framework that focuses computation only where it is needed. Our approach features a novel local video context module that operates solely on masked tokens, yielding a computational cost proportional to the edit size. This local-first generation is then guided by a lightweight temporal global context embedder that ensures video-wide context consistency with minimal overhead. Not only is EditCtrl 10 times more compute efficient than state-of-the-art generative editing methods, it even improves editing quality compared to methods designed with full-attention. Finally, we showcase how EditCtrl unlocks new capabilities, including multi-region editing with text prompts and autoregressive content propagation.
中文标题/摘要
标题:EditCtrl:分离的局部和全局控制以实现实时生成视频编辑
通过利用预训练的视频基础模型,高保真生成视频编辑在质量上取得了显著提升。然而,它们的计算成本是一个主要瓶颈,因为它们通常设计为无论修复遮罩的大小如何,都以低效的方式处理整个视频上下文,即使对于稀疏的局部编辑也是如此。在本文中,我们提出了EditCtrl,这是一种高效的视频修复控制框架,仅在需要的地方进行计算。我们的方法包含一个新颖的局部视频上下文模块,仅在遮罩标记上操作,计算成本与编辑大小成正比。然后,该局部优先生成由轻量级的时空全局上下文嵌入器引导,以确保视频范围内的上下文一致性,同时最小化开销。与最先进的生成编辑方法相比,EditCtrl不仅计算效率提高了10倍,甚至在编辑质量上也优于设计时具有全注意力的方法。最后,我们展示了EditCtrl如何解锁新的功能,包括使用文本提示进行多区域编辑和自回归内容传播。
Summary / 总结
EditCtrl is a video editing framework that efficiently handles localized edits by focusing computation only on the masked areas, making it 10 times more compute-efficient than existing methods. It uses a local video context module and a lightweight global context embedder to ensure consistency across the video while improving editing quality. Additionally, it enables new editing capabilities such as multi-region editing with text prompts and autoregressive content propagation.
EditCtrl 是一种高效的视频编辑框架,仅对遮罩区域进行计算,使其比现有方法高效10倍。它使用局部视频上下文模块和轻量级的全局上下文嵌入器来确保视频一致性,同时提高编辑质量。此外,它还支持使用文本提示进行多区域编辑和自回归内容传播等新功能。
Long Context, Less Focus: A Scaling Gap in LLMs Revealed through Privacy and Personalization
Authors: Shangding Gu
First: 2026-02-16T18:59:42+00:00 · Latest: 2026-02-16T18:59:42+00:00
Abstract
Large language models (LLMs) are increasingly deployed in privacy-critical and personalization-oriented scenarios, yet the role of context length in shaping privacy leakage and personalization effectiveness remains largely unexplored. We introduce a large-scale benchmark, PAPerBench, to systematically study how increasing context length influences both personalization quality and privacy protection in LLMs. The benchmark comprises approximately 29,000 instances with context lengths ranging from 1K to 256K tokens, yielding a total of 377K evaluation questions. It jointly evaluates personalization performance and privacy risks across diverse scenarios, enabling controlled analysis of long-context model behavior. Extensive evaluations across state-of-the-art LLMs reveal consistent performance degradation in both personalization and privacy as context length increases. We further provide a theoretical analysis of attention dilution under context scaling, explaining this behavior as an inherent limitation of soft attention in fixed-capacity Transformers. The empirical and theoretical findings together suggest a general scaling gap in current models -- long context, less focus. We release the benchmark to support reproducible evaluation and future research on scalable privacy and personalization. Code and data are available at https://github.com/SafeRL-Lab/PAPerBench
中文标题/摘要
标题:长上下文,少关注:通过隐私和个性化揭示的LLM扩展差距
大型语言模型(LLMs)在隐私关键和个人化导向的场景中越来越被部署,但上下文长度在塑造隐私泄露和个人化效果中的作用尚未得到充分探索。我们引入了一个大规模基准PAPerBench,系统研究增加上下文长度如何影响LLMs中的个性化质量和隐私保护。该基准包括约29,000个实例,上下文长度从1K到256K个标记不等,总共生成了377,000个评估问题。它联合评估了在不同场景下的个性化性能和隐私风险,使对长上下文模型行为的可控分析成为可能。对最先进的LLMs进行的广泛评估显示,随着上下文长度的增加,个性化和隐私保护的性能一致下降。我们进一步提供了在上下文扩展下注意力稀释的理论分析,将这种行为解释为固定容量Transformer中软注意力的固有限制。实证和理论发现共同表明,当前模型存在一个普遍的扩展差距——长上下文,少关注。我们发布了该基准以支持可重复评估和未来关于可扩展隐私和个性化的研究。代码和数据可在https://github.com/SafeRL-Lab/PAPerBench获取
Summary / 总结
The paper introduces PAPerBench, a large-scale benchmark to study the impact of context length on personalization and privacy in LLMs. It reveals consistent performance degradation in both areas as context length increases. The authors provide a theoretical analysis of attention dilution, explaining this behavior as an inherent limitation of soft attention in fixed-capacity Transformers, suggesting a general scaling gap in current models: long context, less focus. The benchmark supports reproducible evaluation and future research on scalable privacy and personalization.
论文引入了PAPerBench,一个大规模基准,用于研究上下文长度对LLM个性化和隐私的影响。研究发现,随着上下文长度的增加,两个方面的性能都会一致地下降。作者提供了注意力稀释的理论分析,解释这种行为是固定容量Transformer中软注意力的固有限制所致,表明当前模型存在一个普遍的缩放差距:长上下文,少聚焦。该基准支持可重复评估和未来关于可扩展隐私和个性化的研究。
Rethinking Diffusion Models with Symmetries through Canonicalization with Applications to Molecular Graph Generation
Authors: Cai Zhou, Zijie Chen, Zian Li, Jike Wang, Kaiyi Jiang, Pan Li, Rose Yu, Muhan Zhang, Stephen Bates, Tommi Jaakkola
First: 2026-02-16T18:58:55+00:00 · Latest: 2026-02-16T18:58:55+00:00
Comments: 32 pages
Abstract
Many generative tasks in chemistry and science involve distributions invariant to group symmetries (e.g., permutation and rotation). A common strategy enforces invariance and equivariance through architectural constraints such as equivariant denoisers and invariant priors. In this paper, we challenge this tradition through the alternative canonicalization perspective: first map each sample to an orbit representative with a canonical pose or order, train an unconstrained (non-equivariant) diffusion or flow model on the canonical slice, and finally recover the invariant distribution by sampling a random symmetry transform at generation time. Building on a formal quotient-space perspective, our work provides a comprehensive theory of canonical diffusion by proving: (i) the correctness, universality and superior expressivity of canonical generative models over invariant targets; (ii) canonicalization accelerates training by removing diffusion score complexity induced by group mixtures and reducing conditional variance in flow matching. We then show that aligned priors and optimal transport act complementarily with canonicalization and further improves training efficiency. We instantiate the framework for molecular graph generation under $S_n \times SE(3)$ symmetries. By leveraging geometric spectra-based canonicalization and mild positional encodings, canonical diffusion significantly outperforms equivariant baselines in 3D molecule generation tasks, with similar or even less computation. Moreover, with a novel architecture Canon, CanonFlow achieves state-of-the-art performance on the challenging GEOM-DRUG dataset, and the advantage remains large in few-step generation.
中文标题/摘要
标题:通过典范化重新思考具有对称性的扩散模型及其在分子图生成中的应用
化学和科学中的许多生成任务涉及对群对称性不变的分布(例如,置换和旋转)。一种常见的策略是通过架构约束(如对称性去噪器和不变先验)来实现不变性和对称性。在本文中,我们通过典范化的新视角挑战这一传统:首先将每个样本映射到轨道代表样本的经典姿态或顺序,然后在典范切片上训练一个不受约束(非对称)的扩散或流模型,最后在生成时通过随机对称变换采样恢复不变分布。基于形式化的商空间视角,我们的工作提供了典范扩散的全面理论,证明了:(i) 典范生成模型相对于不变目标的正确性、普适性和优越的表达能力;(ii) 典范化通过消除群混合引起的扩散评分复杂性并减少流匹配中的条件方差来加速训练。然后我们展示了对齐的先验和最优传输与典范化互补,进一步提高了训练效率。我们针对 $S_n imes SE(3)$ 对称性下的分子图生成框架进行了实例化。通过利用基于几何光谱的典范化和温和的位置编码,典范扩散在3D分子生成任务中显著优于对称性基线,计算量相似甚至更少。此外,通过新颖的架构Canon,CanonFlow在具有挑战性的GEOM-DRUG数据集上达到了最先进的性能,即使在少量生成步骤中优势仍然很大。
Summary / 总结
This paper proposes a new approach to generative tasks in chemistry and science by leveraging canonicalization with symmetries, particularly for molecular graph generation. Instead of enforcing invariance and equivariance through architectural constraints, the authors map samples to orbit representatives, train an unconstrained diffusion model on the canonical slice, and recover the invariant distribution at generation time. The method is shown to accelerate training and improve performance in 3D molecule generation tasks, outperforming equivariant baselines with similar or less computational cost. Additionally, the CanonFlow architecture achieves state-of-the-art results on the GEOM-DRUG dataset, especially in few-step generation scenarios.
本文重新考虑了通过将样本映射到具有标准姿态的轨道代表,训练不受约束的模型并在标准切片上进行训练,然后在生成时恢复不变分布的方法。理论证明表明,标准模型是正确的、通用的,并且比不变目标更具表达性。实验表明,标准扩散在3D分子生成中优于对称基线,具有相似或更低的计算成本,并在GEOM-DRUG数据集上实现了最先进的性能。
Hunt Globally: Deep Research AI Agents for Drug Asset Scouting in Investing, Business Development, and Search & Evaluation
Authors: Alisa Vinogradova, Vlad Vinogradov, Luba Greenwood, Ilya Yasny, Dmitry Kobyzev, Shoman Kasbekar, Kong Nguyen, Dmitrii Radkevich, Roman Doronin, Andrey Doronichev
First: 2026-02-16T18:57:49+00:00 · Latest: 2026-02-16T18:57:49+00:00
Abstract
Bio-pharmaceutical innovation has shifted: many new drug assets now originate outside the United States and are disclosed primarily via regional, non-English channels. Recent data suggests >85% of patent filings originate outside the U.S., with China accounting for nearly half of the global total; a growing share of scholarly output is also non-U.S. Industry estimates put China at ~30% of global drug development, spanning 1,200+ novel candidates. In this high-stakes environment, failing to surface "under-the-radar" assets creates multi-billion-dollar risk for investors and business development teams, making asset scouting a coverage-critical competition where speed and completeness drive value. Yet today's Deep Research AI agents still lag human experts in achieving high-recall discovery across heterogeneous, multilingual sources without hallucinations. We propose a benchmarking methodology for drug asset scouting and a tuned, tree-based self-learning Bioptic Agent aimed at complete, non-hallucinated scouting. We construct a challenging completeness benchmark using a multilingual multi-agent pipeline: complex user queries paired with ground-truth assets that are largely outside U.S.-centric radar. To reflect real deal complexity, we collected screening queries from expert investors, BD, and VC professionals and used them as priors to conditionally generate benchmark queries. For grading, we use LLM-as-judge evaluation calibrated to expert opinions. We compare Bioptic Agent against Claude Opus 4.6, OpenAI GPT-5.2 Pro, Perplexity Deep Research, Gemini 3 Pro + Deep Research, and Exa Websets. Bioptic Agent achieves 79.7% F1 versus 56.2% (Claude Opus 4.6), 50.6% (Gemini 3 Pro + Deep Research), 46.6% (GPT-5.2 Pro), 44.2% (Perplexity Deep Research), and 26.9% (Exa Websets). Performance improves steeply with additional compute, supporting the view that more compute yields better results.
中文标题/摘要
标题:全球搜索:药物资产筛选的深度研究AI代理
生物制药创新已发生变化:许多新的药物资产现在起源于美国之外,并主要通过区域性的非英语渠道披露。最近的数据表明,超过85%的专利申请起源于美国之外,其中中国占全球总数的近一半;非美国的学术产出比例也在增加。行业估计显示,中国在全球药物研发中的份额约为30%,涵盖1,200多种新型候选药物。在这种高风险环境中,未能发现“未被注意”的资产会给投资者和业务发展团队带来数十亿美元的风险,使资产筛选成为一项关键的覆盖竞争,速度和完整性决定价值。然而,当前的深度研究AI代理在实现跨异构、多语言来源的高召回率发现方面仍落后于人类专家,且缺乏幻觉。 我们提出了一种药物资产筛选的基准测试方法,并开发了一种调优的基于树的自我学习生物光学代理,旨在实现完整的、无幻觉的筛选。我们使用多语言多代理管道构建了一个具有挑战性的完整性基准:复杂用户查询配以主要在美国中心雷达之外的真实资产。为了反映实际交易的复杂性,我们从专家投资者、BD和VC专业人士那里收集了筛选查询,并将其作为先验条件生成基准查询。在评估中,我们使用校准过的LLM作为裁判,以专家意见为标准。我们将生物光学代理与Claude Opus 4.6、OpenAI GPT-5.2 Pro、Perplexity Deep Research、Gemini 3 Pro + Deep Research和Exa Websets进行了比较。生物光学代理的F1得分为79.7%,而Claude Opus 4.6为56.2%,Gemini 3 Pro + Deep Research为50.6%,GPT-5.2 Pro为46.6%,Perplexity Deep Research为44.2%,Exa Websets为26.9%。随着计算资源的增加,性能显著提高,支持了更多计算资源会带来更好结果的观点。
Summary / 总结
The research addresses the challenge of identifying drug assets from non-English sources, which is critical for investors and business development teams. It introduces a Bioptic Agent, a deep learning-based AI system, and benchmarks its performance against other AI agents. The Bioptic Agent achieves an F1 score of 79.7%, significantly outperforming other systems like Claude Opus 4.6 (56.2%), Gemini 3 Pro + Deep Research (50.6%), GPT-5.2 Pro (46.6%), Perplexity Deep Research (44.2%), and Exa Websets (26.9%).
研究旨在识别来自非美国地区的新型药物资产,特别是中国,超过85%的专利申请来自美国以外地区。研究提出了一个基准测试方法,并开发了一个基于树的自学习Bioptic代理,以提高资产筛选能力。Bioptic代理在与Claude Opus 4.6、GPT-5.2 Pro和Gemini 3 Pro + Deep Research等其他AI代理的比较中表现出色,F1得分为79.7%,而其他代理的得分分别为26.9%到56.2%。性能随计算资源的增加而显著提升,强调了计算资源在获得更好结果中的重要性。
Privileged Information Distillation for Language Models
Authors: Emiliano Penaloza, Dheeraj Vattikonda, Nicolas Gontier, Alexandre Lacoste, Laurent Charlin, Massimo Caccia
First: 2026-02-04T18:46:17+00:00 · Latest: 2026-02-16T18:57:38+00:00
Comments: Abstract border should have been purple
Abstract
Training-time privileged information (PI) can enable language models to succeed on tasks they would otherwise fail, making it a powerful tool for reinforcement learning in hard, long-horizon settings. However, transferring capabilities learned with PI to policies that must act without it at inference time remains a fundamental challenge. We study this problem in the context of distilling frontier models for multi-turn agentic environments, which typically hide their internal reasoning and expose only action trajectories. This breaks standard distillation pipelines, since successful behavior is observable, but the reasoning process is not. For this, we introduce π-Distill, a joint teacher-student objective that trains a PI-conditioned teacher and an unconditioned student simultaneously using the same model. Additionally, we also introduce On-Policy Self-Distillation (OPSD), an alternative approach that trains using Reinforcement Learning (RL) with a reverse KL-penalty between the student and the PI-conditioned teacher. We show that both of these algorithms effectively distill frontier agents using action-only PI. Specifically, we find that π-Distill and, in some cases, OPSD, outperform industry standard practices (Supervised finetuning followed by RL) that assume access to full Chain-of-Thought supervision across multiple agentic benchmarks, models, and forms of PI. We complement our results with extensive analysis that characterizes the factors enabling effective learning with PI, focusing primarily on π-Distill and characterizing when OPSD is competitive.
中文标题/摘要
标题:语言模型中的特权信息提炼
训练时的特权信息(PI)可以使语言模型在原本会失败的任务上取得成功,使其成为在困难、长期环境中强化学习的强大工具。然而,在推理时使用PI学习的能力转移到必须在没有PI的情况下行动的策略上仍然是一个基本挑战。我们研究了在多轮代理环境的前沿模型提炼中的这个问题,这些环境通常隐藏其内部推理,仅暴露行动轨迹。这打破了标准的提炼管道,因为成功的行为是可观察的,但推理过程不是。为此,我们引入了π-提炼,这是一种联合教师-学生目标,使用相同的模型同时训练一个PI条件教师和一个未条件的学生。此外,我们还引入了基于强化学习(RL)的反KL惩罚的在线策略自我提炼(OPSD)。我们展示了这两种算法有效地使用仅行动的PI提炼前沿代理。具体来说,我们发现π-提炼,在某些情况下,OPSD,优于假设跨多个代理基准、模型和PI形式的完整思维链监督的行业标准实践(监督微调后进行RL)。我们通过广泛的分析补充了我们的结果,这些分析描述了使PI有效学习的因素,主要集中在π-提炼上,并描述了OPSD具有竞争力的情况。
Summary / 总结
The research aims to transfer capabilities learned with training-time privileged information (PI) to policies that must act without it at inference time, addressing a fundamental challenge in reinforcement learning for multi-turn agentic environments. The study introduces π-Distill and On-Policy Self-Distillation (OPSD) methods, showing that both effectively distill frontier agents using action-only PI, outperforming industry standard practices across multiple benchmarks and models. Key findings indicate that π-Distill and, in some cases, OPSD, surpass supervised fine-tuning followed by RL, which assumes access to full Chain-of-Thought supervision.
研究旨在将训练时的特权信息(PI)所学到的能力转移到推理时必须不使用PI的策略中,解决强化学习在多轮代理环境中的一项基本挑战。研究引入了π-Distill和On-Policy Self-Distillation(OPSD)方法,表明两者都能有效利用仅动作的PI来提炼前沿代理,超越了假设拥有完整思维链监督的行业标准实践。关键发现表明,在某些情况下,π-Distill和OPSD均优于监督微调后进行RL的方法。
Simulating the Real World: A Unified Survey of Multimodal Generative Models
Authors: Yuqi Hu, Longguang Wang, Xian Liu, Ling-Hao Chen, Yuwei Guo, Yukai Shi, Ce Liu, Anyi Rao, Zeyu Wang, Hui Xiong
First: 2025-03-06T17:31:43+00:00 · Latest: 2026-02-16T18:57:17+00:00
Comments: Repository for the related papers at https://github.com/ALEEEHU/World-Simulator
Abstract
Understanding and replicating the real world is a critical challenge in Artificial General Intelligence (AGI) research. To achieve this, many existing approaches, such as world models, aim to capture the fundamental principles governing the physical world, enabling more accurate simulations and meaningful interactions. However, current methods often treat different modalities, including 2D (images), videos, 3D, and 4D representations, as independent domains, overlooking their interdependencies. Additionally, these methods typically focus on isolated dimensions of reality without systematically integrating their connections. In this survey, we present a unified survey for multimodal generative models that investigate the progression of data dimensionality in real-world simulation. Specifically, this survey starts from 2D generation (appearance), then moves to video (appearance+dynamics) and 3D generation (appearance+geometry), and finally culminates in 4D generation that integrate all dimensions. To the best of our knowledge, this is the first attempt to systematically unify the study of 2D, video, 3D and 4D generation within a single framework. To guide future research, we provide a comprehensive review of datasets, evaluation metrics and future directions, and fostering insights for newcomers. This survey serves as a bridge to advance the study of multimodal generative models and real-world simulation within a unified framework.
中文标题/摘要
标题:模拟现实世界:多模态生成模型综述
理解并复制现实世界是人工智能通用智能(AGI)研究中的关键挑战。为了实现这一目标,许多现有方法,如世界模型,旨在捕捉支配物理世界的根本原则,从而实现更准确的模拟和更有意义的互动。然而,当前的方法通常将不同的模态,包括2D(图像)、视频、3D和4D表示,视为独立的领域,忽视了它们之间的相互依赖性。此外,这些方法通常专注于现实的孤立维度,而没有系统地整合它们之间的联系。在这篇综述中,我们提供了一个统一的多模态生成模型综述,探讨了现实世界模拟中数据维度的发展。具体来说,这篇综述从2D生成(外观)开始,然后转向视频(外观+动力学),再到3D生成(外观+几何),最后达到整合所有维度的4D生成。据我们所知,这是首次尝试在单一框架内系统地统一2D、视频、3D和4D生成的研究。为了指导未来的研究,我们提供了数据集、评估指标和未来方向的全面回顾,为新入门者提供见解。这篇综述为在统一框架内推进多模态生成模型和现实世界模拟的研究架起了桥梁。
Summary / 总结
This survey addresses the challenge of simulating the real world in AGI research by unifying multimodal generative models. It explores the progression from 2D image generation to 4D integration, covering appearance, dynamics, and geometry. The survey provides a comprehensive review of datasets, evaluation metrics, and future directions, aiming to foster a unified framework for multimodal generative models.
本文综述了在AGI研究中模拟真实世界的问题,通过统一多模态生成模型来解决。它涵盖了从2D图像生成到4D集成外观、动力学、几何和时间的进展。综述提供了数据集、评估指标和未来研究方向的全面回顾,旨在促进一个统一的多模态生成模型框架。
Neurosim: A Fast Simulator for Neuromorphic Robot Perception
Authors: Richeek Das, Pratik Chaudhari
First: 2026-02-16T18:57:04+00:00 · Latest: 2026-02-16T18:57:04+00:00
Comments: 13 pages, 6 figures
Abstract
Neurosim is a fast, real-time, high-performance library for simulating sensors such as dynamic vision sensors, RGB cameras, depth sensors, and inertial sensors. It can also simulate agile dynamics of multi-rotor vehicles in complex and dynamic environments. Neurosim can achieve frame rates as high as ~2700 FPS on a desktop GPU. Neurosim integrates with a ZeroMQ-based communication library called Cortex to facilitate seamless integration with machine learning and robotics workflows. Cortex provides a high-throughput, low-latency message-passing system for Python and C++ applications, with native support for NumPy arrays and PyTorch tensors. This paper discusses the design philosophy behind Neurosim and Cortex. It demonstrates how they can be used to (i) train neuromorphic perception and control algorithms, e.g., using self-supervised learning on time-synchronized multi-modal data, and (ii) test real-time implementations of these algorithms in closed-loop. Neurosim and Cortex are available at https://github.com/grasp-lyrl/neurosim .
中文标题/摘要
标题:Neurosim:一种快速的类脑机器人感知模拟器
Neurosim 是一个快速、实时、高性能的库,用于模拟动态视觉传感器、RGB相机、深度传感器和惯性传感器。它还可以模拟多旋翼车辆在复杂和动态环境中的敏捷动力学。Neurosim 可以在台式机 GPU 上实现高达约 2700 FPS 的帧率。Neurosim 与基于 ZeroMQ 的通信库 Cortex 集成,以促进与机器学习和机器人工作流的无缝集成。Cortex 提供了一个高吞吐量、低延迟的消息传递系统,适用于 Python 和 C++ 应用程序,并原生支持 NumPy 数组和 PyTorch 张量。本文讨论了 Neurosim 和 Cortex 的设计理念。它展示了如何使用这些工具(i)使用时间同步的多模态数据进行自我监督学习来训练类脑感知和控制算法,以及(ii)在闭环中测试这些算法的实时实现。Neurosim 和 Cortex 可在 https://github.com/grasp-lyrl/neurosim 获取。
Summary / 总结
Neurosim is a high-performance library for simulating various sensors and agile dynamics of multi-rotor vehicles in real-time. It achieves up to 2700 FPS on a desktop GPU and integrates with Cortex, a ZeroMQ-based communication library, to facilitate machine learning and robotics workflows. Key findings include the successful use of Neurosim and Cortex for training self-supervised neuromorphic perception and control algorithms and testing real-time implementations in closed-loop systems.
Neurosim 是一个快速模拟神经形态机器人感知的库,能够在桌面 GPU 上达到每秒 2700 帧的速率。它支持模拟各种传感器和多旋翼车辆的敏捷动态。Neurosim 与 Cortex 集成,Cortex 是一个基于 ZeroMQ 的通信库,用于实时机器学习和机器人工作流。主要发现包括使用同步多模态数据进行自我监督学习成功训练神经形态感知和控制算法,并在闭环系统中测试其实时实现。
Scaling Beyond Masked Diffusion Language Models
Authors: Subham Sekhar Sahoo, Jean-Marie Lemercier, Zhihan Yang, Justin Deschenaux, Jingyu Liu, John Thickstun, Ante Jukic
First: 2026-02-16T18:54:47+00:00 · Latest: 2026-02-16T18:54:47+00:00
Comments: code: https://github.com/s-sahoo/scaling-dllms
Abstract
Diffusion language models are a promising alternative to autoregressive models due to their potential for faster generation. Among discrete diffusion approaches, Masked diffusion currently dominates, largely driven by strong perplexity on language modeling benchmarks. In this work, we present the first scaling law study of uniform-state and interpolating discrete diffusion methods. We also show that Masked diffusion models can be made approximately 12% more FLOPs-efficient when trained with a simple cross-entropy objective. We find that perplexity is informative within a diffusion family but can be misleading across families, where models with worse likelihood scaling may be preferable due to faster and more practical sampling, as reflected by the speed-quality Pareto frontier. These results challenge the view that Masked diffusion is categorically the future of diffusion language modeling and that perplexity alone suffices for cross-algorithm comparison. Scaling all methods to 1.7B parameters, we show that uniform-state diffusion remains competitive on likelihood-based benchmarks and outperforms autoregressive and Masked diffusion models on GSM8K, despite worse validation perplexity. We provide the code, model checkpoints, and video tutorials on the project page: http://s-sahoo.github.io/scaling-dllms
中文标题/摘要
标题:超越掩码扩散语言模型的扩展
扩散语言模型由于其潜在的更快生成能力,是自回归模型的一种有前途的替代方案。在离散扩散方法中,掩码扩散目前占主导地位,主要得益于其在语言建模基准上的强大困惑度。在本研究中,我们首次对均匀状态和插值离散扩散方法进行了扩展律研究。我们还展示了当使用简单的交叉熵目标进行训练时,掩码扩散模型可以大约提高12%的FLOPs效率。我们发现,在扩散家族内部,困惑度是信息性的,但在家族之间,它可能是误导性的,其中具有更差似然扩展的模型可能由于更快和更实用的采样而更可取,这反映在速度-质量帕累托前沿上。这些结果挑战了掩码扩散是扩散语言建模未来且困惑度足以进行跨算法比较的观点。将所有方法扩展到1.7B参数,我们展示了均匀状态扩散在基于似然性的基准上仍然具有竞争力,并在GSM8K上优于自回归和掩码扩散模型,尽管验证困惑度较差。我们在项目页面提供了代码、模型检查点和视频教程:http://s-sahoo.github.io/scaling-dllms
Summary / 总结
This study investigates the scaling of uniform-state and interpolating discrete diffusion methods, showing that Masked diffusion models can be made more efficient. The research challenges the dominance of Masked diffusion by demonstrating that uniform-state diffusion can outperform autoregressive and Masked diffusion models on GSM8K with better sampling speed, despite lower validation perplexity. Key findings include the development of a cross-entropy training method for Masked diffusion and the identification of a speed-quality Pareto frontier that favors uniform-state diffusion at larger scales.
这项研究探讨了均匀状态和插值离散扩散方法的扩展规律,挑战了Masked扩散模型的主导地位。研究发现,Masked扩散模型可以通过简单的交叉熵目标变得更高效,而困惑度在不同扩散方法之间比较时并不是一个可靠的指标。在1.7B参数下,均匀状态扩散在GSM8K上的表现优于自回归和Masked扩散模型,尽管验证困惑度更高,这表明更快和更实用的采样可能比更低的困惑度更重要。
Cold-Start Personalization via Training-Free Priors from Structured World Models
Authors: Avinandan Bose, Shuyue Stella Li, Faeze Brahman, Pang Wei Koh, Simon Shaolei Du, Yulia Tsvetkov, Maryam Fazel, Lin Xiao, Asli Celikyilmaz
First: 2026-02-16T18:52:13+00:00 · Latest: 2026-02-16T18:52:13+00:00
Comments: 24 pages, 4 figures, 4 tables
Abstract
Cold-start personalization requires inferring user preferences through interaction when no user-specific historical data is available. The core challenge is a routing problem: each task admits dozens of preference dimensions, yet individual users care about only a few, and which ones matter depends on who is asking. With a limited question budget, asking without structure will miss the dimensions that matter. Reinforcement learning is the natural formulation, but in multi-turn settings its terminal reward fails to exploit the factored, per-criterion structure of preference data, and in practice learned policies collapse to static question sequences that ignore user responses. We propose decomposing cold-start elicitation into offline structure learning and online Bayesian inference. Pep (Preference Elicitation with Priors) learns a structured world model of preference correlations offline from complete profiles, then performs training-free Bayesian inference online to select informative questions and predict complete preference profiles, including dimensions never asked about. The framework is modular across downstream solvers and requires only simple belief models. Across medical, mathematical, social, and commonsense reasoning, Pep achieves 80.8% alignment between generated responses and users' stated preferences versus 68.5% for RL, with 3-5x fewer interactions. When two users give different answers to the same question, Pep changes its follow-up 39-62% of the time versus 0-28% for RL. It does so with ~10K parameters versus 8B for RL, showing that the bottleneck in cold-start elicitation is the capability to exploit the factored structure of preference data.
中文标题/摘要
标题:基于结构化世界模型的无训练先验冷启动个性化
冷启动个性化要求在没有用户特定历史数据的情况下,通过交互推断用户偏好。核心挑战是一个路由问题:每个任务包含数十个偏好维度,但每个用户只关心其中的几个,而且哪些维度重要取决于提问的对象。在有限的问题预算下,无结构的提问将错过重要的维度。强化学习是自然的表述方式,但在多轮对话中,其终端奖励未能利用偏好数据按标准分解的事实结构,实践中学习到的策略会退化为静态的问题序列,忽略用户的回应。我们提出将冷启动提取分解为离线结构学习和在线贝叶斯推理。Pep(偏好提取与先验)从完整档案中学习一个结构化的世界模型来描述偏好相关性,然后进行无训练的贝叶斯推理来选择信息性问题并预测完整的偏好档案,包括从未提问过的维度。该框架在下游求解器中是模块化的,只需要简单的信念模型。在医学、数学、社会和常识推理中,Pep 在生成的响应与用户声明的偏好之间的匹配度为 80.8%,而强化学习为 68.5%,交互次数少 3-5 倍。当两个用户对同一问题给出不同答案时,Pep 的后续问题改变比例为 39-62%,而强化学习为 0-28%。它仅使用约 10K 参数,而强化学习为 8B,表明冷启动提取的瓶颈在于利用偏好数据分解结构的能力。
Summary / 总结
This paper addresses the challenge of cold-start personalization by proposing Pep, a framework that decomposes the problem into offline structure learning and online Bayesian inference. Pep learns a structured world model of preference correlations from complete profiles and performs training-free Bayesian inference to select informative questions and predict complete preference profiles. The method achieves higher alignment with users' stated preferences and requires fewer interactions compared to reinforcement learning, demonstrating the effectiveness of exploiting the structured nature of preference data.
该论文通过提出Pep方法,该方法在线下学习结构化的偏好模型,并在线上执行无训练的贝叶斯推理来选择有信息量的问题和预测完整的偏好配置文件。该方法在更少的交互次数下实现了更高的偏好匹配度(80.8% vs. 68.5% 对于RL),并且能够更好地适应不同用户的回答。它使用显著更少的参数(10K vs. 8B 对于RL),通过利用偏好数据的结构化特性来实现这一目标。
On the Semantics of Primary Cause in Hybrid Dynamic Domains
Authors: Shakil M. Khan, Asim Mehmood, Sandra Zilles
First: 2026-02-16T18:25:08+00:00 · Latest: 2026-02-16T18:25:08+00:00
Abstract
Reasoning about actual causes of observed effects is fundamental to the study of rationality. This important problem has been studied since the time of Aristotle, with formal mathematical accounts emerging recently. We live in a world where change due to actions can be both discrete and continuous, that is, hybrid. Yet, despite extensive research on actual causation, only few recent studies looked into causation with continuous change. Building on recent progress, in this paper we propose two definitions of primary cause in a hybrid action-theoretic framework, namely the hybrid temporal situation calculus. One of these is foundational in nature while the other formalizes causation through contributions, which can then be verified from a counterfactual perspective using a modified ``but-for'' test. We prove that these two definitions are indeed equivalent. We then show that our definitions of causation have some intuitively justifiable properties.
中文标题/摘要
标题:混合动态领域中首要原因的语义研究
关于观察到的效果的实际原因的推理是研究理性的重要基础。这一重要问题自亚里士多德时代以来就被研究,最近才有了形式化的数学描述。我们生活在一个变化既可能是离散的也可能是连续的世界,即混合的世界。尽管在实际因果关系的研究方面进行了大量研究,但只有少数研究关注了连续变化的因果关系。基于最近的进展,本文在混合行动理论框架,即混合时间情况语义学中提出了两种首要原因的定义。其中一种定义具有基础性质,另一种则通过贡献形式化因果关系,并可以通过修改后的“如果……会发生什么”测试从反事实的角度进行验证。我们证明了这两种定义实际上是等价的。然后我们展示了我们提出的因果关系定义具有某些直观合理的性质。
Summary / 总结
This paper addresses the problem of identifying primary causes in hybrid dynamic domains where changes can be both discrete and continuous. It proposes two definitions of primary cause within a hybrid temporal situation calculus framework: one foundational and the other based on contributions. The authors prove the equivalence of these definitions and demonstrate that they possess intuitively justifiable properties.
论文探讨了在既包含离散变化也包含连续变化的混合动态领域中,识别观察到的效果的主要原因的问题。它在混合时间情况语 calculus框架内提出了两种主要原因的定义,一种是基础性的,另一种是基于贡献并通过修改后的反事实测试进行验证。作者证明了这两种定义是等价的,并展示了它们具有直观合理的性质。
ThermEval: A Structured Benchmark for Evaluation of Vision-Language Models on Thermal Imagery
Authors: Ayush Shrivastava, Kirtan Gangani, Laksh Jain, Mayank Goel, Nipun Batra
First: 2026-02-16T18:16:19+00:00 · Latest: 2026-02-16T18:16:19+00:00
Comments: 8 Pages with 2 figures of main content. 2 pages of References. 10 pages of appendix with 6 figures
Abstract
Vision language models (VLMs) achieve strong performance on RGB imagery, but they do not generalize to thermal images. Thermal sensing plays a critical role in settings where visible light fails, including nighttime surveillance, search and rescue, autonomous driving, and medical screening. Unlike RGB imagery, thermal images encode physical temperature rather than color or texture, requiring perceptual and reasoning capabilities that existing RGB-centric benchmarks do not evaluate. We introduce ThermEval-B, a structured benchmark of approximately 55,000 thermal visual question answering pairs designed to assess the foundational primitives required for thermal vision language understanding. ThermEval-B integrates public datasets with our newly collected ThermEval-D, the first dataset to provide dense per-pixel temperature maps with semantic body-part annotations across diverse indoor and outdoor environments. Evaluating 25 open-source and closed-source VLMs, we find that models consistently fail at temperature-grounded reasoning, degrade under colormap transformations, and default to language priors or fixed responses, with only marginal gains from prompting or supervised fine-tuning. These results demonstrate that thermal understanding requires dedicated evaluation beyond RGB-centric assumptions, positioning ThermEval as a benchmark to drive progress in thermal vision language modeling.
中文标题/摘要
标题:ThermEval:热成像视觉语言模型评估基准
视觉语言模型(VLMs)在RGB图像上表现出色,但在热图像上却无法泛化。热成像在可见光失效的环境中起着关键作用,包括夜间监视、搜索与救援、自动驾驶和医学筛查。与RGB图像不同,热图像编码的是物理温度而非颜色或纹理,这需要感知和推理能力,而现有的以RGB为中心的基准测试并未对其进行评估。我们引入了ThermEval-B,这是一个包含约55,000个热视觉问答对的结构化基准,旨在评估热视觉语言理解所需的底层能力。ThermEval-B将公共数据集与我们新收集的ThermEval-D结合在一起,ThermEval-D是首个提供密集的逐像素温度图和语义身体部分注释的数据集,适用于多种室内外环境。评估25个开源和闭源VLMs,我们发现模型在温度相关推理方面表现一致不佳,在颜色映射变换下性能下降,并倾向于依赖语言先验或固定响应,仅通过提示或监督微调获得微小改进。这些结果表明,热理解需要超越RGB中心假设的专门评估,将ThermEval定位为推动热视觉语言建模进展的基准。
Summary / 总结
ThermEval is a benchmark for evaluating vision-language models on thermal imagery, addressing the gap in existing RGB-centric benchmarks. It includes 55,000 thermal visual question answering pairs and a new dataset with per-pixel temperature maps and semantic annotations. Evaluating 25 models, the study found that they struggle with temperature-grounded reasoning, degrade under colormap changes, and rely on language priors, highlighting the need for dedicated thermal understanding benchmarks.
ThermEval 是一个用于评估视觉语言模型在热成像上的基准,解决了它们从 RGB 图像上泛化的不足。它包含约 55,000 个热视觉问答对,并且包含带有像素级温度图和语义标注的新数据集。评估 25 个模型后,研究发现它们在温度推理上表现不佳,在颜色映射变化下表现下降,并依赖于语言先验,表明需要专门的热成像评估基准。
Orthogonalized Multimodal Contrastive Learning with Asymmetric Masking for Structured Representations
Authors: Carolin Cissee, Raneen Younis, Zahra Ahmadi
First: 2026-02-16T18:06:53+00:00 · Latest: 2026-02-16T18:06:53+00:00
Abstract
Multimodal learning seeks to integrate information from heterogeneous sources, where signals may be shared across modalities, specific to individual modalities, or emerge only through their interaction. While self-supervised multimodal contrastive learning has achieved remarkable progress, most existing methods predominantly capture redundant cross-modal signals, often neglecting modality-specific (unique) and interaction-driven (synergistic) information. Recent extensions broaden this perspective, yet they either fail to explicitly model synergistic interactions or learn different information components in an entangled manner, leading to incomplete representations and potential information leakage. We introduce \textbf{COrAL}, a principled framework that explicitly and simultaneously preserves redundant, unique, and synergistic information within multimodal representations. COrAL employs a dual-path architecture with orthogonality constraints to disentangle shared and modality-specific features, ensuring a clean separation of information components. To promote synergy modeling, we introduce asymmetric masking with complementary view-specific patterns, compelling the model to infer cross-modal dependencies rather than rely solely on redundant cues. Extensive experiments on synthetic benchmarks and diverse MultiBench datasets demonstrate that COrAL consistently matches or outperforms state-of-the-art methods while exhibiting low performance variance across runs. These results indicate that explicitly modeling the full spectrum of multimodal information yields more stable, reliable, and comprehensive embeddings.
中文标题/摘要
标题:正交化多模态对比学习与非对称掩码用于结构化表示
多模态学习旨在整合来自异构来源的信息,其中信号可能在模态之间共享、特定于个别模态或仅通过它们的交互而出现。虽然自监督多模态对比学习取得了显著进展,但大多数现有方法主要捕捉冗余的跨模态信号,往往忽略了特定于模态的(独特)和交互驱动的(协同)信息。最近的扩展拓宽了这一视角,但它们要么未能明确建模协同交互,要么以纠缠的方式学习不同的信息组件,导致不完整的表示和潜在的信息泄露。我们引入了**COrAL**,这是一种原理性的框架,明确且同时保留了多模态表示中的冗余、独特和协同信息。COrAL 使用具有正交性约束的双路径架构来分离共享和特定于模态的特征,确保信息组件的清晰分离。为了促进协同建模,我们引入了非对称掩码,具有互补的视图特定模式,促使模型推断跨模态依赖关系,而不是仅依赖冗余线索。在合成基准和多样化的MultiBench数据集上的广泛实验表明,COrAL 一致地匹配或超越了最先进的方法,同时表现出较低的性能变异性。这些结果表明,明确建模多模态信息的完整谱系会产生更稳定、可靠和全面的嵌入。
Summary / 总结
The research aims to improve multimodal contrastive learning by explicitly capturing redundant, unique, and synergistic information. COrAL uses a dual-path architecture with orthogonality constraints to disentangle shared and modality-specific features and introduces asymmetric masking to promote cross-modal dependency modeling. Experiments show that COrAL outperforms or matches state-of-the-art methods with low performance variance across runs, indicating the effectiveness of explicitly modeling the full spectrum of multimodal information.
研究旨在通过明确捕捉冗余、独特和协同信息来改进多模态对比学习。COrAL 使用具有正交约束的双路径架构来分离共享和模态特定特征,并引入不对称掩码以促进协同建模。实验表明,COrAL 在不同运行中表现出一致的高性能或与最先进的方法相当,并且性能波动较小。
Evolution Strategies at the Hyperscale
Authors: Bidipta Sarkar, Mattie Fellows, Juan Agustin Duque, Alistair Letcher, Antonio León Villares, Anya Sims, Clarisse Wibault, Dmitry Samsonov, Dylan Cope, Jarek Liesen, Kang Li, Lukas Seier, Theo Wolf, Uljad Berdica, Valentin Mohl, Alexander David Goldie, Aaron Courville, Karin Sevegnani, Shimon Whiteson, Jakob Nicolaus Foerster
First: 2025-11-20T18:56:05+00:00 · Latest: 2026-02-16T18:01:18+00:00
Comments: 76 pages, 15 figures, Website at https://eshyperscale.github.io/
Abstract
Evolution Strategies (ES) is a class of powerful black-box optimisation methods that are highly parallelisable and can handle non-differentiable and noisy objectives. However, naïve ES becomes prohibitively expensive at scale on GPUs due to the low arithmetic intensity of batched matrix multiplications with unstructured random perturbations. We introduce Evolution Guided GeneRal Optimisation via Low-rank Learning (EGGROLL), which improves arithmetic intensity by structuring individual perturbations as rank-$r$ matrices, resulting in a hundredfold increase in training speed for billion-parameter models at large population sizes, achieving up to 91% of the throughput of pure batch inference. We provide a rigorous theoretical analysis of Gaussian ES for high-dimensional parameter objectives, investigating conditions needed for ES updates to converge in high dimensions. Our results reveal a linearising effect, and proving consistency between EGGROLL and ES as parameter dimension increases. Our experiments show that EGGROLL: (1) enables the stable pretraining of nonlinear recurrent language models that operate purely in integer datatypes, (2) is competitive with GRPO for post-training LLMs on reasoning tasks, and (3) does not compromise performance compared to ES in tabula rasa RL settings, despite being faster.
中文标题/摘要
标题:超大规模环境下的进化策略
进化策略(ES)是一类强大的黑盒优化方法,高度并行化且能处理非可微和噪声目标。然而,朴素的ES在GPU上大规模运行时由于批量矩阵乘法与无结构随机扰动的低算术强度变得代价高昂。我们引入了进化引导基因般优化通过低秩学习(EGGROLL),通过将个体扰动结构化为秩-$r$矩阵来提高算术强度,从而在大规模种群下使训练速度提高了百倍,达到纯批量推理吞吐量的91%。我们对高维参数目标的高斯ES进行了严格的理论分析,研究了ES更新在高维收敛所需的条件。我们的结果揭示了线性化效应,并证明了参数维度增加时EGGROLL与ES之间的一致性。我们的实验表明EGGROLL:(1)使非线性递归语言模型的稳定预训练成为可能,这些模型仅在整数数据类型中运行;(2)在推理任务中与GRPO竞争;(3)在空白学习设置中不损害性能,尽管速度更快。
Summary / 总结
The paper addresses the scalability issue of Evolution Strategies (ES) on GPUs by introducing EGGROLL, which structures individual perturbations as rank-$r$ matrices to improve arithmetic intensity. The method achieves a hundredfold increase in training speed for billion-parameter models and up to 91% of the throughput of pure batch inference. Key experimental results include stable pretraining of nonlinear recurrent language models, competitiveness with GRPO for post-training language models on reasoning tasks, and equivalent performance to ES in tabula rasa reinforcement learning settings despite being faster.
论文通过引入EGGROLL方法解决了进化策略(ES)在GPU上的可扩展性问题,该方法将个体扰动结构化为秩-$r$矩阵以提高算术强度。该方法实现了对于十亿参数模型的百倍训练速度提升,并且在纯批量推理吞吐量的91%以上。关键实验结果包括:(1)能够稳定预训练非线性递归语言模型;(2)在后训练语言模型的推理任务中与GRPO竞争;(3)在空白状态强化学习设置中与ES相比性能相当,尽管速度更快。
Faster Molecular Dynamics with Neural Network Potentials via Distilled Multiple Time-Stepping and Non-Conservative Forces
Authors: Nicolaï Gouraud, Côme Cattin, Thomas Plé, Olivier Adjoua, Louis Lagardère, Jean-Philip Piquemal
First: 2026-02-16T17:59:44+00:00 · Latest: 2026-02-16T17:59:44+00:00
Abstract
Following our previous work (J. Phys. Chem. Lett., 2026, 17, 5, 1288-1295), we propose the DMTS-NC approach, a distilled multi-time-step (DMTS) strategy using non conservative (NC) forces to further accelerate atomistic molecular dynamics simulations using foundation neural network models. There, a dual-level reversible reference system propagator algorithm (RESPA) formalism couples a target accurate conservative potential to a simplified distilled representation optimized for the production of non-conservative forces. Despite being non-conservative, the distilled architecture is designed to enforce key physical priors, such as equivariance under rotation and cancellation of atomic force components. These choices facilitate the distillation process and therefore improve drastically the robustness of simulation, significantly limiting the "holes" in the simpler potential, thus achieving excellent agreement with the forces data. Overall, the DMTS-NC scheme is found to be more stable and efficient than its conservative counterpart with additional speedups reaching 15-30% over DMTS. Requiring no finetuning steps, it is easier to implement and can be pushed to the limit of the systems physical resonances to maintain accuracy while providing maximum efficiency. As for DMTS, DMTS-NC is applicable to any neural network potential.
中文标题/摘要
标题:利用蒸馏多时间步长和非保守力的神经网络势加速分子动力学
在我们之前的工作(J. Phys. Chem. Lett., 2026, 17, 5, 1288-1295)基础上,我们提出了DMTS-NC方法,这是一种使用非保守力的蒸馏多时间步长(DMTS)策略,以进一步利用基础神经网络模型加速原子分子动力学模拟。该方法采用双级可逆参考系统推进器算法(RESPA)形式主义,将目标准确的保守势与简化优化的蒸馏表示相结合,用于生成非保守力。尽管是非保守的,但蒸馏架构被设计成强制执行关键的物理先验,如旋转下的不变性和原子力分量的抵消。这些选择促进了蒸馏过程,从而极大地提高了模拟的稳健性,显著限制了更简单势中的“空洞”,从而实现了与力数据的极好一致。总体而言,DMTS-NC方案比其保守对应物更稳定、更高效,且额外的速度提升可达15-30%。无需微调步骤,它更易于实现,并能在保持准确性的同时提供最大效率。与DMTS一样,DMTS-NC适用于任何神经网络势。
Summary / 总结
The research aims to enhance the efficiency of molecular dynamics simulations by introducing the DMTS-NC approach, which combines a distilled multi-time-step strategy with non-conservative forces. This method uses a dual-level reversible reference system propagator algorithm to couple a target accurate conservative potential with a simplified distilled representation optimized for producing non-conservative forces. The key findings show that DMTS-NC is more stable and efficient than its conservative counterpart, offering additional speedups of 15-30% over DMTS without requiring fine-tuning steps. The approach maintains accuracy while pushing the limits of system physical resonances for maximum efficiency.
研究旨在通过提出DMTS-NC方法来加速分子动力学模拟,该方法结合了多时间步策略和非保守力。该方法使用双层可逆参考系统传播器算法将目标准确的保守势与简化了的优化非保守力表示耦合。关键发现表明,DMTS-NC比其保守的对应物更稳定和高效,提供了15-30%的额外加速,同时保持了准确性和鲁棒性,无需进行微调步骤。
PhyScensis: Physics-Augmented LLM Agents for Complex Physical Scene Arrangement
Authors: Yian Wang, Han Yang, Minghao Guo, Xiaowen Qiu, Tsun-Hsuan Wang, Wojciech Matusik, Joshua B. Tenenbaum, Chuang Gan
Venue: ICLR 2026
First: 2026-02-16T17:55:25+00:00 · Latest: 2026-02-16T17:55:25+00:00
Comments: ICLR 2026
Abstract
Automatically generating interactive 3D environments is crucial for scaling up robotic data collection in simulation. While prior work has primarily focused on 3D asset placement, it often overlooks the physical relationships between objects (e.g., contact, support, balance, and containment), which are essential for creating complex and realistic manipulation scenarios such as tabletop arrangements, shelf organization, or box packing. Compared to classical 3D layout generation, producing complex physical scenes introduces additional challenges: (a) higher object density and complexity (e.g., a small shelf may hold dozens of books), (b) richer supporting relationships and compact spatial layouts, and (c) the need to accurately model both spatial placement and physical properties. To address these challenges, we propose PhyScensis, an LLM agent-based framework powered by a physics engine, to produce physically plausible scene configurations with high complexity. Specifically, our framework consists of three main components: an LLM agent iteratively proposes assets with spatial and physical predicates; a solver, equipped with a physics engine, realizes these predicates into a 3D scene; and feedback from the solver informs the agent to refine and enrich the configuration. Moreover, our framework preserves strong controllability over fine-grained textual descriptions and numerical parameters (e.g., relative positions, scene stability), enabled through probabilistic programming for stability and a complementary heuristic that jointly regulates stability and spatial relations. Experimental results show that our method outperforms prior approaches in scene complexity, visual quality, and physical accuracy, offering a unified pipeline for generating complex physical scene layouts for robotic manipulation.
中文标题/摘要
标题:PhyScensis:增强物理的LLM代理用于复杂物理场景布局
自动生成交互式3D环境对于扩大机器人数据收集在模拟中的规模至关重要。尽管先前的工作主要集中在3D资产放置上,但它们往往忽略了物体之间的物理关系(例如接触、支撑、平衡和包含),这些关系对于创建复杂的现实操作场景(如桌面布局、架子整理或盒子打包)至关重要。与传统的3D布局生成相比,生成复杂的物理场景带来了额外的挑战:(a) 更高的物体密度和复杂性(例如,一个小架子可能容纳几十本书),(b) 更丰富的支撑关系和紧凑的空间布局,以及(c) 需要准确建模空间放置和物理属性。为了解决这些挑战,我们提出PhyScensis,这是一种基于LLM代理的框架,由物理引擎驱动,以生成具有高复杂度的物理上合理的场景配置。具体而言,我们的框架由三个主要组件组成:一个LLM代理迭代地提出具有空间和物理谓词的资产;一个配备物理引擎的求解器将这些谓词实现为3D场景;以及求解器的反馈指导代理改进和完善配置。此外,我们的框架通过概率编程保持对细粒度文本描述和数值参数(例如相对位置、场景稳定性)的强大控制,通过互补启发式方法联合调节稳定性和空间关系。实验结果表明,我们的方法在场景复杂性、视觉质量和物理准确性方面优于先前的方法,提供了一种统一的生成机器人操作所需的复杂物理场景布局的管道。
Summary / 总结
PhyScensis is an LLM agent-based framework that uses a physics engine to generate complex and physically plausible 3D scenes for robotic manipulation. It addresses the challenges of higher object density, richer supporting relationships, and accurate modeling of spatial and physical properties. The framework consists of an LLM agent, a solver with a physics engine, and feedback mechanisms to refine scene configurations. Experimental results demonstrate that PhyScensis outperforms previous methods in terms of scene complexity, visual quality, and physical accuracy, providing a unified pipeline for complex physical scene generation.
PhyScensis 是一个基于 LLM 的框架,结合物理引擎生成复杂且物理上合理的 3D 场景,解决高密度物体、丰富支撑关系以及精确建模空间放置和物理属性的挑战。该框架包括一个 LLM 代理、一个带有物理引擎的求解器和反馈机制以优化场景配置。实验结果表明,PhyScensis 在场景复杂性、视觉质量和物理准确性方面优于先前的方法,提供了一个统一的管道用于机器人操作场景的生成。
PAct: Part-Decomposed Single-View Articulated Object Generation
Authors: Qingming Liu, Xinyue Yao, Shuyuan Zhang, Yueci Deng, Guiliang Liu, Zhen Liu, Kui Jia
First: 2026-02-16T17:45:44+00:00 · Latest: 2026-02-16T17:45:44+00:00
Comments: Technical Report(11 figures, 14 pages), Project Page: https://PAct-project.github.io
Abstract
Articulated objects are central to interactive 3D applications, including embodied AI, robotics, and VR/AR, where functional part decomposition and kinematic motion are essential. Yet producing high-fidelity articulated assets remains difficult to scale because it requires reliable part decomposition and kinematic rigging. Existing approaches largely fall into two paradigms: optimization-based reconstruction or distillation, which can be accurate but often takes tens of minutes to hours per instance, and inference-time methods that rely on template or part retrieval, producing plausible results that may not match the specific structure and appearance in the input observation. We introduce a part-centric generative framework for articulated object creation that synthesizes part geometry, composition, and articulation under explicit part-aware conditioning. Our representation models an object as a set of movable parts, each encoded by latent tokens augmented with part identity and articulation cues. Conditioned on a single image, the model generates articulated 3D assets that preserve instance-level correspondence while maintaining valid part structure and motion. The resulting approach avoids per-instance optimization, enables fast feed-forward inference, and supports controllable assembly and articulation, which are important for embodied interaction. Experiments on common articulated categories (e.g., drawers and doors) show improved input consistency, part accuracy, and articulation plausibility over optimization-based and retrieval-driven baselines, while substantially reducing inference time.
中文标题/摘要
标题:PAct:分解部件的单视图 articulated 对象生成
articulated 对象是交互式3D应用的核心,包括具身AI、机器人技术和VR/AR,其中功能部件分解和运动学运动至关重要。然而,由于需要可靠的部件分解和运动学布线,生成高质量的articulated资产仍然难以规模化。现有方法主要分为两类:基于优化的重建或蒸馏,这可能很准确,但通常需要数分钟到数小时才能生成一个实例;以及在推理时依赖于模板或部件检索的方法,生成的可能是合理的结果,但可能与输入观察的具体结构和外观不符。我们提出了一种以部件为中心的生成框架,用于articulated对象的创建,该框架在明确的部件感知条件下综合部件几何、组成和运动。我们的表示将对象建模为一组可移动部件,每个部件由嵌入部件身份和运动学线索的潜在令牌编码。在单张图像的条件下,该模型生成的articulated 3D资产保持实例级对应关系,同时保持有效的部件结构和运动。该方法避免了逐实例优化,实现了快速前向推理,并支持可控的组装和运动,这对于具身交互至关重要。在常见articulated类别(例如抽屉和门)上的实验表明,与基于优化和检索驱动的基线相比,该方法在输入一致性、部件准确性和运动学合理性方面有所改进,同时显著减少了推理时间。
Summary / 总结
The research aims to create high-fidelity articulated objects for interactive 3D applications by addressing the challenges of part decomposition and kinematic rigging. The proposed PAct framework uses a part-centric generative model that synthesizes part geometry, composition, and articulation under explicit part-aware conditioning. This approach avoids per-instance optimization, enables fast feed-forward inference, and supports controllable assembly and articulation. Experiments show improved input consistency, part accuracy, and articulation plausibility compared to existing methods, while reducing inference time significantly.
研究旨在通过解决部件分解和运动学装配的挑战,为交互式3D应用生成高保真度的活动部件对象。提出的PAct方法使用一种以部件为中心的生成框架,该框架在明确的部件感知条件下综合部件几何、组成和运动。该方法可以从单张图像生成保留实例级对应关系和有效部件结构与运动的3D活动部件对象,同时比现有的基于优化和检索驱动的方法减少推理时间。实验结果显示了更好的输入一致性、部件准确性和运动合理性,且推理时间显著减少。
Why Synthetic Isn't Real Yet: A Diagnostic Framework for Contact Center Dialogue Generation
Authors: Rishikesh Devanathan, Varun Nathan, Ayush Kumar
First: 2025-08-25T17:10:36+00:00 · Latest: 2026-02-16T17:44:09+00:00
Abstract
Synthetic data is increasingly critical for contact centers, where privacy constraints and data scarcity limit the availability of real conversations. However, generating synthetic dialogues that are realistic and useful for downstream applications remains challenging. In this work, we benchmark multiple generation strategies guided by structured supervision on call attributes (Intent Summaries, Topic Flows, and Quality Assurance (QA) Forms) across multiple languages. To test downstream utility, we evaluate synthetic transcripts on an automated quality assurance (AutoQA) task, finding that prompts optimized on real transcripts consistently outperform those optimized on synthetic transcripts. These results suggest that current synthetic transcripts fall short in capturing the full realism of real agent-customer interactions. To highlight these downstream gaps, we introduce a diagnostic evaluation framework comprising 17 metrics across four dimensions: (1) Emotional and Sentiment Arcs, (2) Linguistic Complexity, (3) Interaction Style, and (4) Conversational Properties. Our analysis shows that even with structured supervision, current generation strategies exhibit measurable deficiencies in sentiment fidelity, disfluency modeling, behavioral variation, and conversational realism. Together, these results highlight the importance of diagnostic, metric-driven evaluation for synthetic conversation generation intended for downstream applications.
中文标题/摘要
标题:为什么合成数据尚未真实:接触中心对话生成的诊断框架
合成数据在接触中心中越来越关键,由于隐私限制和数据稀缺,真实对话的可用性受到限制。然而,生成既现实又对下游应用有用的合成对话仍然具有挑战性。在本研究中,我们基于对通话属性(意图摘要、主题流程和质量保证表单)的结构化监督,对多种生成策略进行了基准测试,跨越了多种语言。为了测试下游应用的实用性,我们在自动质量保证(AutoQA)任务上评估了合成转录,发现针对真实转录优化的提示始终优于针对合成转录优化的提示。这些结果表明,当前的合成转录未能充分捕捉到真实座席-客户互动的全部现实性。为了突出这些下游差距,我们引入了一个诊断评估框架,包括17个指标,涵盖四个维度:(1)情感和情感弧线,(2)语言复杂性,(3)互动风格,(4)对话属性。我们的分析表明,即使有结构化的监督,当前的生成策略在情感保真度、不流畅性建模、行为变化和对话现实性方面仍存在可测量的缺陷。这些结果共同强调了对旨在用于下游应用的合成对话生成进行诊断性、基于指标的评估的重要性。
Summary / 总结
This work addresses the challenge of generating realistic synthetic dialogues for contact centers, where privacy and data scarcity constraints are prevalent. The authors benchmark various generation strategies using structured supervision on call attributes and evaluate their downstream utility through an automated quality assurance task. They find that prompts optimized on real transcripts outperform those optimized on synthetic ones, indicating current synthetic dialogues lack full realism. To diagnose these gaps, the authors introduce a diagnostic evaluation framework with 17 metrics covering emotional arcs, linguistic complexity, interaction style, and conversational properties, revealing deficiencies in sentiment fidelity, disfluency modeling, and behavioral variation even with structured supervision.
该研究旨在解决生成现实合成对话的挑战,因为隐私和数据稀缺限制了对真实对话的访问。研究人员使用跨多种语言的呼叫属性结构化监督来评估各种生成策略,并通过自动化质量保证任务测试其实用性。结果表明,当前的合成对话缺乏真实互动的全部现实性。为了诊断这些差距,作者引入了一个包含17个指标的诊断评估框架,涵盖情感弧线、语言复杂性、互动风格和对话属性,揭示了情感一致性、不流畅建模和行为变化方面的不足。这项工作强调了在实际应用中对合成对话生成进行度量驱动评估的重要性。
AIRS-Bench: a Suite of Tasks for Frontier AI Research Science Agents
Authors: Alisia Lupidi, Bhavul Gauri, Thomas Simon Foster, Bassel Al Omari, Despoina Magka, Alberto Pepe, Alexis Audran-Reiss, Muna Aghamelu, Nicolas Baldwin, Lucia Cipolina-Kun, Jean-Christophe Gagnon-Audet, Chee Hau Leow, Sandra Lefdal, Hossam Mossalam, Abhinav Moudgil, Saba Nazir, Emanuel Tewolde, Isabel Urrego, Jordi Armengol Estape, Amar Budhiraja, Gaurav Chaurasia, Abhishek Charnalia, Derek Dunfield, Karen Hambardzumyan, Daniel Izcovich, Martin Josifoski, Ishita Mediratta, Kelvin Niu, Parth Pathak, Michael Shvartsman, Edan Toledo, Anton Protopopov, Roberta Raileanu, Alexander Miller, Tatiana Shavrina, Jakob Foerster, Yoram Bachrach
First: 2026-02-06T16:45:02+00:00 · Latest: 2026-02-16T17:43:19+00:00
Comments: 49 pages, 14 figures, 10 tables
Abstract
LLM agents hold significant promise for advancing scientific research. To accelerate this progress, we introduce AIRS-Bench (the AI Research Science Benchmark), a suite of 20 tasks sourced from state-of-the-art machine learning papers. These tasks span diverse domains, including language modeling, mathematics, bioinformatics, and time series forecasting. AIRS-Bench tasks assess agentic capabilities over the full research lifecycle -- including idea generation, experiment analysis and iterative refinement -- without providing baseline code. The AIRS-Bench task format is versatile, enabling easy integration of new tasks and rigorous comparison across different agentic frameworks. We establish baselines using frontier models paired with both sequential and parallel scaffolds. Our results show that agents exceed human SOTA in four tasks but fail to match it in sixteen others. Even when agents surpass human benchmarks, they do not reach the theoretical performance ceiling for the underlying tasks. These findings indicate that AIRS-Bench is far from saturated and offers substantial room for improvement. We open-source the AIRS-Bench task definitions and evaluation code to catalyze further development in autonomous scientific research.
中文标题/摘要
标题:AIRS-Bench:前沿人工智能研究科学代理的一套任务集
大语言模型代理在推进科学研究方面具有巨大潜力。为了加速这一进程,我们引入了AIRS-Bench(人工智能研究科学基准),这是一个包含20项任务的套件,这些任务源自最新的机器学习论文。这些任务涵盖了语言建模、数学、生物信息学和时间序列预测等多个领域。AIRS-Bench任务评估代理在整个研究生命周期中的能力,包括创意生成、实验分析和迭代改进,而不提供基线代码。AIRS-Bench任务格式灵活,便于新任务的集成和不同代理框架之间的严格比较。我们使用前沿模型与顺序和并行支架相结合来建立基线。结果显示,代理在四项任务中超过了人类的SOTA,但在其他十六项任务中未能达到。即使代理超越了人类基准,它们也没有达到底层任务的理论性能上限。这些发现表明,AIRS-Bench远未饱和,提供了巨大的改进空间。我们开源了AIRS-Bench任务定义和评估代码,以促进自主科学研究的发展。
Summary / 总结
AIRS-Bench is a suite of 20 tasks designed to evaluate the capabilities of AI agents in scientific research, covering domains like language modeling, mathematics, bioinformatics, and time series forecasting. The tasks assess agents throughout the research lifecycle, from idea generation to iterative refinement, without providing baseline code. Initial results show that agents outperform humans in four tasks but fall short in sixteen others, indicating significant room for improvement. The benchmark is open-sourced to promote further development in autonomous scientific research.
AIRS-Bench 是一个包含 20 个任务的基准套件,这些任务来自最新的机器学习论文,涵盖了多个科学领域。它评估代理在整个研究生命周期中的能力,从想法生成到迭代改进,不提供基线代码。基准结果显示,代理在四个任务中超越了人类,但在十六个任务中未能达到。这些发现表明,AIRS-Bench 远未饱和,仍有很大的改进空间。作者开源了任务定义和评估代码,以促进自主科学研究的进一步发展。
Gradient Networks for Universal Magnetic Modeling of Synchronous Machines
Authors: Junyi Li, Tim Foissner, Floran Martin, Antti Piippo, Marko Hinkkanen
First: 2026-02-16T17:28:42+00:00 · Latest: 2026-02-16T17:28:42+00:00
Abstract
This paper presents a physics-informed neural network approach for dynamic modeling of saturable synchronous machines, including cases with spatial harmonics. We introduce an architecture that incorporates gradient networks directly into the fundamental machine equations, enabling accurate modeling of the nonlinear and coupled electromagnetic constitutive relationship. By learning the gradient of the magnetic field energy, the model inherently satisfies energy balance (reciprocity conditions). The proposed architecture can universally approximate any physically feasible magnetic behavior and offers several advantages over lookup tables and standard machine learning models: it requires less training data, ensures monotonicity and reliable extrapolation, and produces smooth outputs. These properties further enable robust model inversion and optimal trajectory generation, often needed in control applications. We validate the proposed approach using measured and finite-element method (FEM) datasets from a 5.6-kW permanent-magnet (PM) synchronous reluctance machine. Results demonstrate accurate and physically consistent models, even with limited training data.
中文标题/摘要
标题:梯度网络在同步电机磁特性通用建模中的应用
本文提出了一种基于物理信息的神经网络方法,用于饱和同步电机的动力学建模,包括包含空间谐波的情况。我们提出了一种架构,将梯度网络直接集成到基本电机方程中,从而能够准确建模非线性和耦合的电磁本构关系。通过学习磁场能量的梯度,该模型自然满足能量守恒(互易条件)。所提出的架构可以普遍逼近任何物理上可行的磁行为,并且相对于查找表和标准机器学习模型具有几个优势:需要较少的训练数据,确保单调性和可靠的外推,并产生平滑的输出。这些特性进一步使稳健的模型反演和最优轨迹生成成为可能,这些通常在控制应用中是需要的。我们使用5.6-kW 永磁(PM)同步磁阻电机的测量和有限元方法(FEM)数据集验证了所提出的方法。结果表明,即使使用有限的训练数据,也能获得准确且物理上一致的模型。
AnchorWeave: World-Consistent Video Generation with Retrieved Local Spatial Memories
Authors: Zun Wang, Han Lin, Jaehong Yoon, Jaemin Cho, Yue Zhang, Mohit Bansal
First: 2026-02-16T17:23:08+00:00 · Latest: 2026-02-16T17:23:08+00:00
Comments: Project website: https://zunwang1.github.io/AnchorWeave
Abstract
Maintaining spatial world consistency over long horizons remains a central challenge for camera-controllable video generation. Existing memory-based approaches often condition generation on globally reconstructed 3D scenes by rendering anchor videos from the reconstructed geometry in the history. However, reconstructing a global 3D scene from multiple views inevitably introduces cross-view misalignment, as pose and depth estimation errors cause the same surfaces to be reconstructed at slightly different 3D locations across views. When fused, these inconsistencies accumulate into noisy geometry that contaminates the conditioning signals and degrades generation quality. We introduce AnchorWeave, a memory-augmented video generation framework that replaces a single misaligned global memory with multiple clean local geometric memories and learns to reconcile their cross-view inconsistencies. To this end, AnchorWeave performs coverage-driven local memory retrieval aligned with the target trajectory and integrates the selected local memories through a multi-anchor weaving controller during generation. Extensive experiments demonstrate that AnchorWeave significantly improves long-term scene consistency while maintaining strong visual quality, with ablation and analysis studies further validating the effectiveness of local geometric conditioning, multi-anchor control, and coverage-driven retrieval.
中文标题/摘要
标题:AnchorWeave:基于检索局部空间记忆的世界一致视频生成
在长时间范围内保持空间世界一致性仍然是相机可控视频生成中的一个核心挑战。现有的基于记忆的方法通常通过从历史重建的几何体中渲染锚视频来对生成进行全局重建。然而,从多个视角重建全局3D场景不可避免地会引入跨视角对齐错误,因为姿态和深度估计错误导致相同的表面在不同视角中被重建在略有不同的3D位置。当融合时,这些不一致性会累积成噪声几何体,污染条件信号并降低生成质量。我们提出了AnchorWeave,一种增强的记忆视频生成框架,用多个干净的局部几何记忆取代单一的对齐错误的全局记忆,并学习解决它们的跨视角不一致性。为此,AnchorWeave 进行目标轨迹驱动的局部记忆检索,并在生成过程中通过多锚点编织控制器整合所选的局部记忆。广泛的实验表明,AnchorWeave 显著提高了长期场景一致性,同时保持了强大的视觉质量,消融和分析研究进一步验证了局部几何条件、多锚点控制和目标轨迹驱动检索的有效性。
Summary / 总结
AnchorWeave addresses the challenge of maintaining spatial world consistency in long-term camera-controllable video generation. It replaces a single misaligned global memory with multiple clean local geometric memories and reconciles their cross-view inconsistencies through a multi-anchor weaving controller. Experiments show that AnchorWeave enhances long-term scene consistency while preserving visual quality.
AnchorWeave 是一种视频生成框架,旨在解决长时间内保持空间一致性的挑战。它用多个干净的局部几何记忆取代了一个错位的全局记忆,并学习解决它们之间的跨视图不一致性。通过执行覆盖驱动的局部记忆检索并将选定的记忆通过多锚点编织控制器进行整合,AnchorWeave 提高了长期场景的一致性,同时保持了高质量的视觉效果。实验表明,与现有方法相比,它在场景一致性方面取得了显著改进,并且消融研究验证了局部几何条件、多锚点控制和覆盖驱动检索的有效性。
iQUEST: An Iterative Question-Guided Framework for Knowledge Base Question Answering
Authors: Shuai Wang, Yinan Yu
Venue: ACL 2025
First: 2025-06-02T15:30:02+00:00 · Latest: 2026-02-16T17:22:05+00:00
Comments: Accepted to the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025), Main Track
Abstract
Large Language Models (LLMs) excel in many natural language processing tasks but often exhibit factual inconsistencies in knowledge-intensive settings. Integrating external knowledge resources, particularly knowledge graphs (KGs), provides a transparent and updatable foundation for more reliable reasoning. Knowledge Base Question Answering (KBQA), which queries and reasons over KGs, is central to this effort, especially for complex, multi-hop queries. However, multi-hop reasoning poses two key challenges: (1)~maintaining coherent reasoning paths, and (2)~avoiding prematurely discarding critical multi-hop connections. To tackle these challenges, we introduce iQUEST, a question-guided KBQA framework that iteratively decomposes complex queries into simpler sub-questions, ensuring a structured and focused reasoning trajectory. Additionally, we integrate a Graph Neural Network (GNN) to look ahead and incorporate 2-hop neighbor information at each reasoning step. This dual approach strengthens the reasoning process, enabling the model to explore viable paths more effectively. Detailed experiments demonstrate the consistent improvement delivered by iQUEST across four benchmark datasets and four LLMs. The code is publicly available at: https://github.com/Wangshuaiia/iQUEST.
中文标题/摘要
标题:iQUEST:一种迭代式问题引导的知识库问答框架
大型语言模型(LLMs)在许多自然语言处理任务中表现出色,但在知识密集型环境中往往会出现事实不一致的情况。通过整合外部知识资源,特别是知识图谱(KGs),可以为更可靠的推理提供透明且可更新的基础。知识库问答(KBQA),即查询和推理知识图谱,是这一努力的核心,尤其是对于复杂的多跳查询。然而,多跳推理提出了两个关键挑战:(1)保持连贯的推理路径,(2)避免过早丢弃关键的多跳连接。为了解决这些挑战,我们引入了iQUEST,这是一种问题引导的知识库问答框架,通过迭代地将复杂查询分解为更简单的子问题,确保结构化的和聚焦的推理轨迹。此外,我们还集成了一个图神经网络(GNN),在每次推理步骤中向前看并整合2跳邻居信息。这种双重方法加强了推理过程,使模型能够更有效地探索可行路径。详细的实验表明,iQUEST在四个基准数据集和四个LLMs上都实现了持续改进。代码已公开发布于:https://github.com/Wangshuaiia/iQUEST。
Summary / 总结
iQUEST is an iterative question-guided framework for Knowledge Base Question Answering (KBQA) that addresses the challenges of multi-hop reasoning by decomposing complex queries into simpler sub-questions and integrating a Graph Neural Network (GNN) to incorporate 2-hop neighbor information. Experiments show consistent improvement across four benchmark datasets and four Large Language Models (LLMs).
iQUEST 是一个迭代问题导向的知识库问答(KBQA)框架,通过将复杂查询分解为更简单的子问题,并结合图神经网络(GNN)来引入两跳邻居信息,来解决多跳推理的挑战。实验结果显示,该框架在四个基准数据集和四个大型语言模型上均表现出一致的改进。
DiffusionNFT: Online Diffusion Reinforcement with Forward Process
Authors: Kaiwen Zheng, Huayu Chen, Haotian Ye, Haoxiang Wang, Qinsheng Zhang, Kai Jiang, Hang Su, Stefano Ermon, Jun Zhu, Ming-Yu Liu
Venue: ICLR 2026 Oral
First: 2025-09-19T16:09:33+00:00 · Latest: 2026-02-16T17:14:06+00:00
Comments: ICLR 2026 Oral
Abstract
Online reinforcement learning (RL) has been central to post-training language models, but its extension to diffusion models remains challenging due to intractable likelihoods. Recent works discretize the reverse sampling process to enable GRPO-style training, yet they inherit fundamental drawbacks, including solver restrictions, forward-reverse inconsistency, and complicated integration with classifier-free guidance (CFG). We introduce Diffusion Negative-aware FineTuning (DiffusionNFT), a new online RL paradigm that optimizes diffusion models directly on the forward process via flow matching. DiffusionNFT contrasts positive and negative generations to define an implicit policy improvement direction, naturally incorporating reinforcement signals into the supervised learning objective. This formulation enables training with arbitrary black-box solvers, eliminates the need for likelihood estimation, and requires only clean images rather than sampling trajectories for policy optimization. DiffusionNFT is up to $25\times$ more efficient than FlowGRPO in head-to-head comparisons, while being CFG-free. For instance, DiffusionNFT improves the GenEval score from 0.24 to 0.98 within 1k steps, while FlowGRPO achieves 0.95 with over 5k steps and additional CFG employment. By leveraging multiple reward models, DiffusionNFT significantly boosts the performance of SD3.5-Medium in every benchmark tested.
中文标题/摘要
标题:DiffusionNFT:在线扩散强化学习与前向过程
在线强化学习(RL)一直是后训练语言模型的核心,但将其扩展到扩散模型仍然具有挑战性,因为难以处理似然性。近期工作将反向采样过程离散化以实现类似GRPO的训练,但它们继承了包括求解器限制、前向-反向不一致和与无分类器引导(CFG)复杂集成在内的根本性缺点。我们提出了Diffusion Negative-aware FineTuning(DiffusionNFT),这是一种新的在线RL范式,通过流匹配直接优化扩散模型的前向过程。DiffusionNFT通过对比正向和负向生成来定义隐式策略改进方向,自然地将强化信号纳入监督学习目标中。这种表述允许使用任意黑盒求解器进行训练,消除了似然性估计的需要,并只需干净的图像而非采样轨迹进行策略优化。在直接对比中,DiffusionNFT比FlowGRPO高效25倍以上,同时是无CFG的。例如,DiffusionNFT在1000步内将GenEval得分从0.24提高到0.98,而FlowGRPO在超过5000步和额外使用CFG的情况下仅达到0.95。通过利用多个奖励模型,DiffusionNFT显著提升了SD3.5-Medium在所有测试基准中的性能。
Summary / 总结
DiffusionNFT is an online reinforcement learning method that optimizes diffusion models on the forward process, contrasting positive and negative generations to define an implicit policy improvement direction. This approach enables training with arbitrary black-box solvers, avoids likelihood estimation, and requires only clean images. DiffusionNFT is up to 25 times more efficient than FlowGRPO and significantly improves GenEval scores, achieving 0.98 within 1k steps compared to FlowGRPO's 0.95 with 5k steps and additional classifier-free guidance. It also boosts SD3.5-Medium performance across all benchmarks tested.
DiffusionNFT 是一种直接在前向过程中优化扩散模型的在线强化学习方法,通过流匹配对比正向和负向生成。这种方法允许使用任意求解器进行训练,避免了似然估计,并只需要干净的图像。DiffusionNFT 比 FlowGRPO 更高效,在 1k 步内将 GenEval 分数从 0.24 提升到 0.98,而 FlowGRPO 在 5k 步后仅达到 0.95 并且需要使用无条件引导(CFG)。此外,它还显著提升了 SD3.5-Medium 在各个基准测试中的性能。
Wrivinder: Towards Spatial Intelligence for Geo-locating Ground Images onto Satellite Imagery
Authors: Chandrakanth Gudavalli, Tajuddin Manhar Mohammed, Abhay Yadav, Ananth Vishnu Bhaskar, Hardik Prajapati, Cheng Peng, Rama Chellappa, Shivkumar Chandrasekaran, B. S. Manjunath
First: 2026-02-16T17:06:54+00:00 · Latest: 2026-02-16T17:06:54+00:00
Abstract
Aligning ground-level imagery with geo-registered satellite maps is crucial for mapping, navigation, and situational awareness, yet remains challenging under large viewpoint gaps or when GPS is unreliable. We introduce Wrivinder, a zero-shot, geometry-driven framework that aggregates multiple ground photographs to reconstruct a consistent 3D scene and align it with overhead satellite imagery. Wrivinder combines SfM reconstruction, 3D Gaussian Splatting, semantic grounding, and monocular depth--based metric cues to produce a stable zenith-view rendering that can be directly matched to satellite context for metrically accurate camera geo-localization. To support systematic evaluation of this task, which lacks suitable benchmarks, we also release MC-Sat, a curated dataset linking multi-view ground imagery with geo-registered satellite tiles across diverse outdoor environments. Together, Wrivinder and MC-Sat provide a first comprehensive baseline and testbed for studying geometry-centered cross-view alignment without paired supervision. In zero-shot experiments, Wrivinder achieves sub-30\,m geolocation accuracy across both dense and large-area scenes, highlighting the promise of geometry-based aggregation for robust ground-to-satellite localization.
中文标题/摘要
标题:Wrivinder:朝向地理定位地面图像到卫星影像的空间智能
将地面视角图像与地理注册的卫星地图对齐对于制图、导航和态势感知至关重要,但在大视角差距或GPS不可靠的情况下仍然具有挑战性。我们引入了Wrivinder,这是一种零样本、几何驱动的框架,它将多个地面照片聚合起来重建一个一致的3D场景,并将其与上方的卫星影像对齐。Wrivinder结合了SfM重建、3D高斯点绘制、语义定位和单目深度基于的度量线索,生成一个稳定的顶视图渲染,可以直接与卫星上下文匹配,实现度量级准确的相机地理定位。为了支持对这一任务的系统评估,由于缺乏合适的基准,我们还发布了MC-Sat,这是一个精心构建的数据集,将多视角地面图像与跨多种户外环境的地理注册卫星瓦片链接起来。Wrivinder和MC-Sat一起为研究以几何为中心的跨视角对齐提供了第一个全面的基础和测试平台,无需配对监督。在零样本实验中,Wrivinder在密集场景和大面积场景中均实现了低于30米的地理定位精度,突显了基于几何聚合方法在稳健的地面到卫星定位方面的潜力。
Summary / 总结
Wrivinder is a zero-shot, geometry-driven framework that reconstructs a consistent 3D scene from multiple ground photographs to align with satellite imagery, addressing challenges in geo-localization under large viewpoint gaps. It combines SfM reconstruction, 3D Gaussian Splatting, semantic grounding, and monocular depth cues to produce a stable zenith-view rendering. Experiments show that Wrivinder achieves sub-30m geolocation accuracy in various scenes, demonstrating the effectiveness of geometry-based aggregation for robust ground-to-satellite localization. Additionally, MC-Sat, a curated dataset, supports systematic evaluation of this task by linking multi-view ground imagery with geo-registered satellite tiles across diverse environments.
Wrivinder 是一种零样本、基于几何的框架,通过从多张地面照片重建一致的 3D 场景来与卫星图像对齐,解决在大视角差距下地理定位的挑战。它结合了 SfM 重建、3D 高斯点积、语义定位和单目深度线索,生成稳定的顶视图渲染。实验结果显示,Wrivinder 在各种场景中实现了低于 30 米的地理定位精度,展示了基于几何聚合方法在地面到卫星定位中的鲁棒性。此外,MC-Sat 是一个经过整理的数据集,通过链接多视角地面图像与地理注册的卫星瓦片,支持该任务的系统评估,涵盖了多种户外环境。
From Classical to Quantum: Extending Prometheus for Unsupervised Discovery of Phase Transitions in Three Dimensions and Quantum Systems
Authors: Brandon Yee, Wilson Collins, Maximilian Rutkowski
First: 2026-02-16T17:06:20+00:00 · Latest: 2026-02-16T17:06:20+00:00
Abstract
We extend the Prometheus framework for unsupervised phase transition discovery from 2D classical systems to 3D classical and quantum many-body systems, addressing scalability in higher dimensions and generalization to quantum fluctuations. For the 3D Ising model ($L \leq 32$), the framework detects the critical temperature within 0.01\% of literature values ($T_c/J = 4.511 \pm 0.005$) and extracts critical exponents with $\geq 70\%$ accuracy ($β= 0.328 \pm 0.015$, $γ= 1.24 \pm 0.06$, $ν= 0.632 \pm 0.025$), correctly identifying the 3D Ising universality class via $χ^2$ comparison ($p = 0.72$) without analytical guidance. For quantum systems, we developed quantum-aware VAE (Q-VAE) architectures using complex-valued wavefunctions and fidelity-based loss. Applied to the transverse field Ising model, we achieve 2\% accuracy in quantum critical point detection ($h_c/J = 1.00 \pm 0.02$) and successfully discover ground state magnetization as the order parameter ($r = 0.97$). Notably, for the disordered transverse field Ising model, we detect exotic infinite-randomness criticality characterized by activated dynamical scaling $\ln ξ\sim |h - h_c|^{-ψ}$, extracting a tunneling exponent $ψ= 0.48 \pm 0.08$ consistent with theoretical predictions ($ψ= 0.5$). This demonstrates that unsupervised learning can identify qualitatively different types of critical behavior, not just locate critical points. Our systematic validation across classical thermal transitions ($T = 0$ to $T > 0$) and quantum phase transitions ($T = 0$, varying $h$) establishes that VAE-based discovery generalizes across fundamentally different physical domains, providing robust tools for exploring phase diagrams where analytical solutions are unavailable.
中文标题/摘要
标题:从经典到量子:扩展普罗米修斯框架以在三维经典和量子多体系统中无监督发现相变
我们扩展了普罗米修斯框架,用于从二维经典系统到三维经典和量子多体系统的无监督相变发现,解决了高维中的可扩展性问题,并将泛化到量子波动。对于三维伊辛模型($L \leq 32$),该框架检测到临界温度与文献值相差不到0.01%($T_c/J = 4.511 \pm 0.005$),并以至少70%的准确性提取临界指数($β= 0.328 \pm 0.015$,$γ= 1.24 \pm 0.06$,$ν= 0.632 \pm 0.025$),通过$χ^2$比较正确识别了三维伊辛普遍类($p = 0.72$),无需理论指导。对于量子系统,我们开发了量子感知的VAE(Q-VAE)架构,使用复值波函数和保真度损失。应用于横向场伊辛模型,我们实现了2%的量子临界点检测准确性($h_c/J = 1.00 \pm 0.02$),并成功发现基态磁化作为序参量($r = 0.97$)。值得注意的是,对于无序横向场伊辛模型,我们检测到由激活动力学标度$\ln ξ\sim |h - h_c|^{-ψ}$表征的奇异无限随机临界性,提取出隧穿指数$ψ= 0.48 \pm 0.08$,与理论预测一致($ψ= 0.5$)。这表明无监督学习可以识别不同类型的临界行为,而不仅仅是定位临界点。我们系统地验证了经典热相变($T = 0$到$T > 0$)和量子相变($T = 0$,变化$h$)表明基于VAE的发现可以跨根本不同的物理领域泛化,为探索无解析解的相图提供了稳健的工具。
Summary / 总结
The study extends the Prometheus framework for unsupervised phase transition discovery from 2D classical systems to 3D classical and quantum many-body systems. For the 3D Ising model, the framework accurately detects the critical temperature and extracts critical exponents with high accuracy, correctly identifying the universality class. In quantum systems, a quantum-aware VAE (Q-VAE) is developed to detect quantum critical points and discover ground state magnetization. The study also identifies exotic critical behavior in disordered quantum systems, demonstrating the framework's ability to generalize across different physical domains.
研究扩展了Prometheus框架,用于发现3D经典和量子系统的相变,提高了在更高维度和量子波动下的可扩展性和泛化能力。对于3D伊辛模型,它准确地检测了临界温度和临界指数,并识别了3D伊辛普遍类。在量子系统中,开发了量子感知的VAE(Q-VAE)架构来检测量子临界点和基态磁化率,并成功识别了无序横向场伊辛模型中的奇异临界行为,提取了隧道指数与理论预测一致。这表明该框架能够跨不同物理领域进行泛化,并在没有解析指导的情况下识别各种类型的临界行为。
MPCM-Net: Multi-scale network integrates partial attention convolution with Mamba for ground-based cloud image segmentation
Authors: Penghui Niu, Jiashuai She, Taotao Cai, Yajuan Zhang, Ping Zhang, Junhua Gu, Jianxin Li
First: 2025-11-12T06:17:49+00:00 · Latest: 2026-02-16T17:02:14+00:00
Abstract
Ground-based cloud image segmentation is a critical research domain for photovoltaic power forecasting. Current deep learning approaches primarily focus on encoder-decoder architectural refinements. However, existing methodologies exhibit several limitations:(1)they rely on dilated convolutions for multi-scale context extraction, lacking the partial feature effectiveness and interoperability of inter-channel;(2)attention-based feature enhancement implementations neglect accuracy-throughput balance; and (3)the decoder modifications fail to establish global interdependencies among hierarchical local features, limiting inference efficiency. To address these challenges, we propose MPCM-Net, a Multi-scale network that integrates Partial attention Convolutions with Mamba architectures to enhance segmentation accuracy and computational efficiency. Specifically, the encoder incorporates MPAC, which comprises:(1)a MPC block with ParCM and ParSM that enables global spatial interaction across multi-scale cloud formations, and (2)a MPA block combining ParAM and ParSM to extract discriminative features with reduced computational complexity. On the decoder side, a M2B is employed to mitigate contextual loss through a SSHD that maintains linear complexity while enabling deep feature aggregation across spatial and scale dimensions. As a key contribution to the community, we also introduce and release a dataset CSRC, which is a clear-label, fine-grained segmentation benchmark designed to overcome the critical limitations of existing public datasets. Extensive experiments on CSRC demonstrate the superior performance of MPCM-Net over state-of-the-art methods, achieving an optimal balance between segmentation accuracy and inference speed. The dataset and source code will be available at https://github.com/she1110/CSRC.
中文标题/摘要
标题:MPCM-Net:多尺度网络结合Mamba进行部分注意卷积以实现基于地面的云图像分割
基于地面的云图像分割是光伏电力预测中的关键研究领域。当前的深度学习方法主要集中在编码器-解码器架构的改进上。然而,现有的方法存在几个局限性:(1)它们依赖于膨胀卷积进行多尺度上下文提取,缺乏跨通道的部分特征效果和互操作性;(2)基于注意力的特征增强实现忽略了准确性和吞吐量之间的平衡;(3)解码器修改无法建立层次局部特征之间的全局依赖性,限制了推理效率。为了解决这些挑战,我们提出了MPCM-Net,这是一种结合了部分注意卷积与Mamba架构的多尺度网络,以提高分割准确性和计算效率。具体来说,编码器中包含MPAC,包括:(1)一个包含ParCM和ParSM的MPC块,能够实现多尺度云形成之间的全局空间交互,(2)一个结合ParAM和ParSM的MPA块,用于以减少计算复杂度的方式提取具有区分性的特征。在解码器方面,使用M2B来通过SSHD减轻上下文损失,SSHD保持线性复杂度的同时允许在空间和尺度维度上进行深层特征聚合。作为对社区的重要贡献,我们还引入并发布了CSRC数据集,这是一个清晰标签、细粒度分割基准,旨在克服现有公共数据集的关键局限性。在CSRC上的广泛实验表明,MPCM-Net在分割准确性和推理速度之间实现了最优平衡,其性能优于最先进的方法。数据集和源代码将在https://github.com/she1110/CSRC上提供。
Summary / 总结
MPCM-Net is designed to improve ground-based cloud image segmentation for photovoltaic power forecasting by addressing limitations of existing methods. It integrates Partial attention Convolutions with Mamba architectures in both the encoder and decoder. The encoder uses MPAC blocks for global spatial interaction and MPA blocks for feature extraction, while the decoder employs M2B with SSHD to maintain linear complexity and deep feature aggregation. Experiments on the newly released CSRC dataset show that MPCM-Net outperforms state-of-the-art methods in terms of both accuracy and inference speed, achieving a better balance between these two metrics.
MPCM-Net通过结合Partial attention Convolutions和Mamba架构来解决现有深度学习方法在地面云图像分割中的局限性。编码器使用MPAC块以实现全局空间交互和特征提取的复杂度降低,而解码器采用M2B来保持线性复杂度并实现跨空间和尺度维度的深层特征聚合。在新引入的CSRC数据集上的实验表明,MPCM-Net在分割准确性和推理速度方面均优于现有最先进的方法,实现了两者之间的更好平衡。数据集和源代码可在https://github.com/she1110/CSRC获取。
MAC-AMP: A Closed-Loop Multi-Agent Collaboration System for Multi-Objective Antimicrobial Peptide Design
Authors: Gen Zhou, Sugitha Janarthanan, Lianghong Chen, Pingzhao Hu
Venue: ICLR 2026
First: 2026-02-16T17:01:47+00:00 · Latest: 2026-02-16T17:01:47+00:00
Comments: This paper is published in ICLR 2026
Abstract
To address the global health threat of antimicrobial resistance, antimicrobial peptides (AMP) are being explored for their potent and promising ability to fight resistant pathogens. While artificial intelligence (AI) is being employed to advance AMP discovery and design, most AMP design models struggle to balance key goals like activity, toxicity, and novelty, using rigid or unclear scoring methods that make results hard to interpret and optimize. As the capabilities of Large Language Models (LLM) advance and evolve swiftly, we turn to AI multi-agent collaboration based on such models (multi-agent LLMs), which show rapidly rising potential in complex scientific design scenarios. Based on this, we introduce MAC-AMP, a closed-loop multi-agent collaboration (MAC) system for multi-objective AMP design. The system implements a fully autonomous simulated peer review-adaptive reinforcement learning framework that requires only a task description and example dataset to design novel AMPs. The novelty of our work lies in introducing a closed-loop multi-agent system for AMP design, with cross-domain transferability, that supports multi-objective optimization while remaining explainable rather than a 'black box'. Experiments show that MAC-AMP outperforms other AMP generative models by effectively optimizing AMP generation for multiple key molecular properties, demonstrating exceptional results in antibacterial activity, AMP likeliness, toxicity compliance, and structural reliability.
中文标题/摘要
标题:MAC-AMP:多智能体协作系统在多重目标抗菌肽设计中的闭环多智能体合作
为应对全球健康威胁的抗生素耐药性问题,抗菌肽(AMP)因其强大的潜力和前景而受到关注。尽管人工智能(AI)被用于推进AMP的发现和设计,但大多数AMP设计模型在平衡活性、毒性、新颖性等关键目标时仍存在问题,使用的是僵硬或不明确的评分方法,使得结果难以解释和优化。随着大型语言模型(LLM)能力的提升和迅速发展,我们转向基于此类模型的AI多智能体合作(多智能体LLM),这在复杂的科学设计场景中显示出快速上升的潜力。基于此,我们介绍了MAC-AMP,这是一种用于多重目标AMP设计的闭环多智能体合作(MAC)系统。该系统实现了一个完全自主的模拟同行评审-自适应强化学习框架,仅需任务描述和示例数据即可设计新型AMP。我们工作的创新之处在于引入了一种具有跨域可转移性的闭环多智能体系统,支持多重目标优化,同时保持可解释性而非“黑盒”。实验表明,MAC-AMP在有效优化多种关键分子性质的AMP生成方面优于其他AMP生成模型,展示了在抗菌活性、AMP可能性、毒性合规性和结构可靠性方面的出色结果。
Summary / 总结
The research aims to address the global health threat of antimicrobial resistance by designing effective antimicrobial peptides (AMPs) using artificial intelligence (AI). MAC-AMP, a closed-loop multi-agent collaboration system, is introduced to optimize AMP design for multiple objectives such as activity, toxicity, and novelty. Experiments show that MAC-AMP outperforms other AMP generative models by effectively optimizing AMP generation for key molecular properties and demonstrating superior results in antibacterial activity, AMP likeliness, toxicity compliance, and structural reliability.
MAC-AMP 是一个闭环多智能体协作系统,旨在通过平衡活性、毒性和新颖性来优化抗菌肽(AMP)设计。它使用具有同行评审和自适应学习的强化学习框架,仅需任务描述和数据集。实验表明,MAC-AMP 在优化多种分子属性方面优于其他模型,包括抗菌活性、AMP 可能性、毒性合规性和结构可靠性。
ReusStdFlow: A Standardized Reusability Framework for Dynamic Workflow Construction in Agentic AI
Authors: Gaoyang Zhang, Shanghong Zou, Yafang Wang, He Zhang, Ruohua Xu, Feng Zhao
First: 2026-02-16T16:56:53+00:00 · Latest: 2026-02-16T16:56:53+00:00
Abstract
To address the ``reusability dilemma'' and structural hallucinations in enterprise Agentic AI,this paper proposes ReusStdFlow, a framework centered on a novel ``Extraction-Storage-Construction'' paradigm. The framework deconstructs heterogeneous, platform-specific Domain Specific Languages (DSLs) into standardized, modular workflow segments. It employs a dual knowledge architecture-integrating graph and vector databases-to facilitate synergistic retrieval of both topological structures and functional semantics. Finally, workflows are intelligently assembled using a retrieval-augmented generation (RAG) strategy. Tested on 200 real-world n8n workflows, the system achieves over 90% accuracy in both extraction and construction. This framework provides a standardized solution for the automated reorganization and efficient reuse of enterprise digital assets.
中文标题/摘要
标题:ReusStdFlow:动态工作流构建在能动AI中的标准化可重用性框架
为了解决“可重用性困境”和结构幻觉问题,本文提出ReusStdFlow框架,该框架以新颖的“提取-存储-构建”范式为中心。该框架将异构的、平台特定的领域特定语言(DSL)拆解为标准化的、模块化的流程片段。它采用结合图数据库和向量数据库的双重知识架构,以促进拓扑结构和功能语义的协同检索。最后,使用检索增强生成(RAG)策略智能组装工作流。在200个实际的n8n工作流上进行测试,系统在提取和构建方面的准确率均超过90%。该框架为企业的数字资产的自动化重组和高效重用提供了一个标准化的解决方案。
Summary / 总结
The paper addresses the reusability dilemma and structural hallucinations in enterprise Agentic AI by proposing ReusStdFlow, a framework based on an Extraction-Storage-Construction paradigm. It standardizes heterogeneous DSLs into modular workflow segments and uses a dual knowledge architecture combining graph and vector databases for retrieval. The system achieves over 90% accuracy in both extraction and construction when tested on 200 real-world n8n workflows, providing a standardized solution for automated reorganization and efficient reuse of enterprise digital assets.
论文提出了一种名为ReusStdFlow的框架,通过将异构DSL分解为标准化的工作流片段,并使用图数据库和向量数据库进行检索,来解决企业Agentic AI中的可重用性难题和结构幻觉问题。工作流通过检索增强生成策略进行智能组装。系统在200个实际n8n工作流的测试中,提取和构建的准确率均超过90%。
BHyGNN+: Unsupervised Representation Learning for Heterophilic Hypergraphs
Authors: Tianyi Ma, Yiyue Qian, Zehong Wang, Zheyuan Zhang, Chuxu Zhang, Yanfang Ye
First: 2026-02-16T16:55:37+00:00 · Latest: 2026-02-16T16:55:37+00:00
Abstract
Hypergraph Neural Networks (HyGNNs) have demonstrated remarkable success in modeling higher-order relationships among entities. However, their performance often degrades on heterophilic hypergraphs, where nodes connected by the same hyperedge tend to have dissimilar semantic representations or belong to different classes. While several HyGNNs, including our prior work BHyGNN, have been proposed to address heterophily, their reliance on labeled data significantly limits their applicability in real-world scenarios where annotations are scarce or costly. To overcome this limitation, we introduce BHyGNN+, a self-supervised learning framework that extends BHyGNN for representation learning on heterophilic hypergraphs without requiring ground-truth labels. The core idea of BHyGNN+ is hypergraph duality, a structural transformation where the roles of nodes and hyperedges are interchanged. By contrasting augmented views of a hypergraph against its dual using cosine similarity, our framework captures essential structural patterns in a fully unsupervised manner. Notably, this duality-based formulation eliminates the need for negative samples, a common requirement in existing hypergraph contrastive learning methods that is often difficult to satisfy in practice. Extensive experiments on eleven benchmark datasets demonstrate that BHyGNN+ consistently outperforms state-of-the-art supervised and self-supervised baselines on both heterophilic and homophilic hypergraphs. Our results validate the effectiveness of leveraging hypergraph duality for self-supervised learning and establish a new paradigm for representation learning on challenging, unlabeled hypergraphs.
中文标题/摘要
标题:BHyGNN+: 无监督学习在异ophilic超图上的表示学习
超图神经网络(HyGNNs)在建模实体之间的高阶关系方面取得了显著的成功。然而,它们在异ophilic超图上的性能往往会下降,在异ophilic超图中,通过同一超边连接的节点往往具有不同的语义表示或属于不同的类别。虽然已经提出了几种HyGNNs,包括我们之前的工作BHyGNN,以解决异ophilic性问题,但它们对标注数据的依赖性极大地限制了它们在标注数据稀缺或昂贵的真实世界场景中的应用。为克服这一限制,我们引入了BHyGNN+,这是一种自监督学习框架,它扩展了BHyGNN,用于在不需要真实标签的情况下进行异ophilic超图上的表示学习。BHyGNN+的核心思想是超图对偶性,这是一种结构转换,其中节点和超边的角色互换。通过使用余弦相似性对比超图及其对偶的增强视图,我们的框架以完全无监督的方式捕捉到关键的结构模式。值得注意的是,基于对偶性的这种表述消除了现有超图对比学习方法中常见的负样本需求,这些负样本在实践中往往难以满足。在11个基准数据集上的广泛实验表明,BHyGNN+在异ophilic和homophilic超图上始终优于最先进的监督和自监督基线。我们的结果验证了利用超图对偶性进行自监督学习的有效性,并为在具有挑战性的未标注超图上的表示学习建立了新的范式。
Summary / 总结
BHyGNN+ is a self-supervised learning framework that extends BHyGNN to handle heterophilic hypergraphs without requiring labeled data. By utilizing hypergraph duality, BHyGNN+ contrasts augmented views of a hypergraph against its dual using cosine similarity, capturing structural patterns unsupervised. Experiments show BHyGNN+ outperforms existing supervised and self-supervised baselines on various datasets, demonstrating the effectiveness of hypergraph duality in self-supervised learning.
BHyGNN+ 是一种自监督学习框架,扩展了 BHyGNN 以在异质超图上进行无监督表示学习。它利用超图对偶性将增强后的超图视图与其对偶进行对比,使用余弦相似性,避免了需要负样本。实验表明,BHyGNN+ 在各种基准数据集上优于监督和自监督基线,展示了其在无标签数据下捕捉结构模式的有效性。
BFS-PO: Best-First Search for Large Reasoning Models
Authors: Fiorenzo Parascandolo, Wenhui Tan, Enver Sangineto, Ruihua Song, Rita Cucchiara
First: 2026-02-16T16:53:41+00:00 · Latest: 2026-02-16T16:53:41+00:00
Abstract
Large Reasoning Models (LRMs) such as OpenAI o1 and DeepSeek-R1 have shown excellent performance in reasoning tasks using long reasoning chains. However, this has also led to a significant increase of computational costs and the generation of verbose output, a phenomenon known as overthinking. The tendency to overthinking is often exacerbated by Reinforcement Learning (RL) algorithms such as GRPO/DAPO. In this paper, we propose BFS-PO, an RL algorithm which alleviates this problem using a Best-First Search exploration strategy. Specifically, BFS-PO looks for the shortest correct answer using a backtracking mechanism based on maximum entropy nodes. By generating progressively shorter responses during training, BFS-PO learns to produce concise reasoning chains. Using different benchmarks and base LRMs, we show that BFS-PO can simultaneously increase the LRM accuracy and shorten its answers.
中文标题/摘要
标题:BFS-PO:大型推理模型的优先级广度搜索
大型推理模型(LRMs)如OpenAI o1和DeepSeek-R1在使用长推理链进行推理任务时表现出色。然而,这也导致了计算成本的显著增加和生成冗长输出的现象,这种现象被称为过度思考。过度思考的趋势往往会被强化学习(RL)算法如GRPO/DAPO所加剧。在本文中,我们提出了一种名为BFS-PO的RL算法,它使用优先级广度搜索探索策略来缓解这一问题。具体来说,BFS-PO通过基于最大熵节点的回溯机制寻找最短的正确答案。通过在训练过程中生成越来越短的回答,BFS-PO学会了生成简洁的推理链。使用不同的基准和基础LRMs,我们展示了BFS-PO可以同时提高LRM的准确性和缩短其答案。
Summary / 总结
The paper addresses the issue of overthinking in Large Reasoning Models (LRMs) like OpenAI o1 and DeepSeek-R1, which leads to increased computational costs and verbose outputs. To tackle this, BFS-PO, a new RL algorithm, is proposed, utilizing a Best-First Search strategy to find the shortest correct answer through a backtracking mechanism based on maximum entropy nodes. Experimental results demonstrate that BFS-PO can enhance the accuracy of LRMs while also producing more concise reasoning chains.
本文针对大型推理模型(LRMs)如OpenAI o1和DeepSeek-R1在推理任务中产生的冗长输出和高计算成本问题。作者提出了一种名为BFS-PO的RL算法,该算法采用最佳优先搜索策略,通过从最大熵节点回溯来寻找最短的正确答案。通过训练,BFS-PO学会了生成简洁的推理链。实验结果表明,BFS-PO能够同时提高LRM的准确性和答案的简洁性。
Foundation Models in Autonomous Driving: A Survey on Scenario Generation and Scenario Analysis
Authors: Yuan Gao, Mattia Piccinini, Yuchen Zhang, Dingrui Wang, Korbinian Moller, Roberto Brusnicki, Baha Zarrouki, Alessio Gambi, Jan Frederik Totz, Kai Storms, Steven Peters, Andrea Stocco, Bassam Alrifaee, Marco Pavone, Johannes Betz
Venue: IEEE Open Journal of Intelligent Transportation Systems, 03 February 2026
First: 2025-06-13T07:25:59+00:00 · Latest: 2026-02-16T16:49:09+00:00
Comments: IEEE Open Journal of Intelligent Transportation Systems
Abstract
For autonomous vehicles, safe navigation in complex environments depends on handling a broad range of diverse and rare driving scenarios. Simulation- and scenario-based testing have emerged as key approaches to development and validation of autonomous driving systems. Traditional scenario generation relies on rule-based systems, knowledge-driven models, and data-driven synthesis, often producing limited diversity and unrealistic safety-critical cases. With the emergence of foundation models, which represent a new generation of pre-trained, general-purpose AI models, developers can process heterogeneous inputs (e.g., natural language, sensor data, HD maps, and control actions), enabling the synthesis and interpretation of complex driving scenarios. In this paper, we conduct a survey about the application of foundation models for scenario generation and scenario analysis in autonomous driving (as of May 2025). Our survey presents a unified taxonomy that includes large language models, vision-language models, multimodal large language models, diffusion models, and world models for the generation and analysis of autonomous driving scenarios. In addition, we review the methodologies, open-source datasets, simulation platforms, and benchmark challenges, and we examine the evaluation metrics tailored explicitly to scenario generation and analysis. Finally, the survey concludes by highlighting the open challenges and research questions, and outlining promising future research directions. All reviewed papers are listed in a continuously maintained repository, which contains supplementary materials and is available at https://github.com/TUM-AVS/FM-for-Scenario-Generation-Analysis.
中文标题/摘要
标题:自动驾驶中的基础模型:场景生成与分析综述
对于自动驾驶车辆而言,安全导航于复杂环境中依赖于处理广泛多样的罕见驾驶场景。仿真和基于场景的测试已成为开发和验证自动驾驶系统的关键方法。传统场景生成依赖于基于规则的系统、知识驱动模型和数据驱动合成,通常产生有限的多样性和不现实的安全关键案例。随着基础模型的出现,这些代表新一代预训练、通用人工智能模型,开发者可以处理异构输入(例如自然语言、传感器数据、高清地图和控制动作),从而生成和解释复杂的驾驶场景。在本文中,我们对截至2025年5月基础模型在自动驾驶场景生成与分析中的应用进行了综述。综述中提出了一种统一的分类体系,包括大型语言模型、视觉语言模型、多模态大型语言模型、扩散模型和世界模型,用于生成和分析自动驾驶场景。此外,我们还回顾了方法论、开源数据集、仿真平台和基准挑战,并审查了专门针对场景生成与分析的评估指标。最后,综述总结了开放挑战和研究问题,并概述了有前景的未来研究方向。所有审查的论文都列在一个持续维护的仓库中,该仓库包含补充材料,并可在https://github.com/TUM-AVS/FM-for-Scenario-Generation-Analysis/获取。
Summary / 总结
This paper surveys the application of foundation models in autonomous driving, focusing on scenario generation and analysis. It highlights how foundation models can process diverse inputs to generate complex and realistic driving scenarios, overcoming limitations of traditional methods. Key findings include the use of large language models, vision-language models, and multimodal models to enhance scenario diversity and realism, along with the development of tailored evaluation metrics and open-source datasets to support research and development in this area.
本文综述了基础模型在自动驾驶场景生成和分析中的应用。它强调了这些模型如何处理多样化的输入以合成和解释复杂的驾驶场景,解决了传统方法的局限性。关键发现包括使用大型语言模型、视觉-语言模型和多模态模型来生成和分析场景,并回顾了方法论、数据集和评估指标。综述指出了开放挑战,并提出了改进自动驾驶系统中场景多样性和真实性的未来研究方向。
The Potential of CoT for Reasoning: A Closer Look at Trace Dynamics
Authors: Gregor Bachmann, Yichen Jiang, Seyed Mohsen Moosavi Dezfooli, Moin Nabi
First: 2026-02-16T16:38:47+00:00 · Latest: 2026-02-16T16:38:47+00:00
Abstract
Chain-of-thought (CoT) prompting is a de-facto standard technique to elicit reasoning-like responses from large language models (LLMs), allowing them to spell out individual steps before giving a final answer. While the resemblance to human-like reasoning is undeniable, the driving forces underpinning the success of CoT reasoning still remain largely unclear. In this work, we perform an in-depth analysis of CoT traces originating from competition-level mathematics questions, with the aim of better understanding how, and which parts of CoT actually contribute to the final answer. To this end, we introduce the notion of a potential, quantifying how much a given part of CoT increases the likelihood of a correct completion. Upon examination of reasoning traces through the lens of the potential, we identify surprising patterns including (1) its often strong non-monotonicity (due to reasoning tangents), (2) very sharp but sometimes tough to interpret spikes (reasoning insights and jumps) as well as (3) at times lucky guesses, where the model arrives at the correct answer without providing any relevant justifications before. While some of the behaviours of the potential are readily interpretable and align with human intuition (such as insights and tangents), others remain difficult to understand from a human perspective. To further quantify the reliance of LLMs on reasoning insights, we investigate the notion of CoT transferability, where we measure the potential of a weaker model under the partial CoT from another, stronger model. Indeed aligning with our previous results, we find that as little as 20% of partial CoT can ``unlock'' the performance of the weaker model on problems that were previously unsolvable for it, highlighting that a large part of the mechanics underpinning CoT are transferable.
中文标题/摘要
标题:CoT在推理中的潜力:对轨迹动态的深入观察
链式思考(CoT)提示是一种事实上的标准技术,用于从大规模语言模型(LLMs)中引发类似推理的响应,允许它们在给出最终答案之前详细说明每个步骤。尽管其与人类推理的相似性无可否认,但CoT推理成功的驱动因素仍然很大程度上不清楚。在本文中,我们对源自竞赛级数学问题的CoT轨迹进行了深入分析,旨在更好地理解哪些部分以及如何实际贡献于最终答案。为此,我们引入了潜力的概念,量化了CoT中给定部分增加正确完成的可能性。通过潜力的视角审视推理轨迹,我们发现了令人惊讶的模式,包括(1)其经常强烈的非单调性(由于推理旁支),(2)非常尖锐但有时难以解释的突增(推理洞察和跳跃),以及(3)有时的幸运猜测,其中模型在未提供任何相关证明之前就到达了正确答案。虽然潜力的一些行为是易于解释并与人类直觉一致的(如洞察和旁支),但其他行为从人类角度来看仍然难以理解。为了进一步量化LLMs对推理洞察的依赖性,我们研究了CoT可转移性的概念,其中我们测量了较弱模型在另一个更强模型部分CoT下的潜力。确实与我们之前的结果一致,我们发现即使只有20%的部分CoT就能“解锁”较弱模型在之前无法解决的问题上的性能,这表明CoT背后的大部分机制是可转移的。
Summary / 总结
This study investigates the dynamics of chain-of-thought (CoT) reasoning in large language models (LLMs) by analyzing traces from competition-level mathematics questions. The research introduces the concept of potential to quantify how much each part of CoT contributes to the likelihood of a correct answer. Key findings include non-monotonic reasoning patterns, sharp but sometimes difficult-to-interpret spikes, and instances of lucky guesses. The study also explores CoT transferability, showing that a small portion of CoT from a stronger model can significantly improve a weaker model's performance, indicating that much of the reasoning mechanism is transferable.
该研究通过分析竞赛级数学问题的链式思考(CoT)痕迹,探讨了大型语言模型(LLMs)的推理动态。研究引入了潜力的概念,以量化CoT中每个部分对正确答案概率的贡献。主要发现包括非单调推理模式、尖锐但有时难以解释的突增、以及幸运猜测的情况。研究还探索了CoT的可转移性,表明一小部分来自更强模型的CoT可以显著提升较弱模型在之前无法解决的问题上的表现,这表明许多推理机制是可以转移的。
Picking the Right Specialist: Attentive Neural Process-based Selection of Task-Specialized Models as Tools for Agentic Healthcare Systems
Authors: Pramit Saha, Joshua Strong, Mohammad Alsharid, Divyanshu Mishra, J. Alison Noble
First: 2026-02-16T16:36:32+00:00 · Latest: 2026-02-16T16:36:32+00:00
Abstract
Task-specialized models form the backbone of agentic healthcare systems, enabling the agents to answer clinical queries across tasks such as disease diagnosis, localization, and report generation. Yet, for a given task, a single "best" model rarely exists. In practice, each task is better served by multiple competing specialist models where different models excel on different data samples. As a result, for any given query, agents must reliably select the right specialist model from a heterogeneous pool of tool candidates. To this end, we introduce ToolSelect, which adaptively learns model selection for tools by minimizing a population risk over sampled specialist tool candidates using a consistent surrogate of the task-conditional selection loss. Concretely, we propose an Attentive Neural Process-based selector conditioned on the query and per-model behavioral summaries to choose among the specialist models. Motivated by the absence of any established testbed, we, for the first time, introduce an agentic Chest X-ray environment equipped with a diverse suite of task-specialized models (17 disease detection, 19 report generation, 6 visual grounding, and 13 VQA) and develop ToolSelectBench, a benchmark of 1448 queries. Our results demonstrate that ToolSelect consistently outperforms 10 SOTA methods across four different task families.
中文标题/摘要
标题:选择合适的专家:基于注意神经过程的任务专业化模型选择方法作为代理型医疗保健系统的工具
任务专业化模型构成了代理型医疗保健系统的骨干,使代理能够跨疾病诊断、定位和报告生成等任务回答临床查询。然而,对于给定的任务,通常并不存在一个“最佳”模型。实际上,每个任务都由多个竞争的专业化模型服务,不同的模型在不同的数据样本上表现出色。因此,对于任何给定的查询,代理必须从异质的工具候选池中可靠地选择合适的专家模型。为此,我们引入了ToolSelect,它通过最小化采样专家工具候选者的总体风险来适应性地学习工具选择,使用任务条件选择损失的一致近似。具体而言,我们提出了一种基于注意神经过程的选择器,该选择器根据查询和每个模型的行为总结来选择专家模型。鉴于没有现成的测试平台,我们首次引入了一个代理型胸部X光环境,配备了多样化的任务专业化模型(17种疾病检测、19种报告生成、6种视觉定位和13种VQA),并开发了包含1448个查询的ToolSelectBench基准。我们的结果表明,ToolSelect在四个不同任务家族中的一致性上优于10种最先进的方法。
Summary / 总结
The research addresses the challenge of selecting the most appropriate task-specialized model for clinical queries in agentic healthcare systems. It introduces ToolSelect, which uses an Attentive Neural Process to adaptively choose the best model from a pool of specialists by minimizing a population risk. Experiments on a new agentic Chest X-ray environment with diverse models show that ToolSelect outperforms 10 state-of-the-art methods across different task families.
本文解决了在智能医疗系统中为临床查询选择最合适的专家模型的挑战。它引入了ToolSelect,该方法使用注意力神经过程从专家池中选择最佳模型。该方法通过从样本专家候选者中学习来最小化总体风险。实验表明,ToolSelect在四个任务家族中优于10种最先进的方法。
Drift-Diffusion Matching: Embedding dynamics in latent manifolds of asymmetric neural networks
Authors: Ramón Nartallo-Kaluarachchi, Renaud Lambiotte, Alain Goriely
First: 2026-02-16T16:15:59+00:00 · Latest: 2026-02-16T16:15:59+00:00
Comments: 23 pages, 15 figures
Abstract
Recurrent neural networks (RNNs) provide a theoretical framework for understanding computation in biological neural circuits, yet classical results, such as Hopfield's model of associative memory, rely on symmetric connectivity that restricts network dynamics to gradient-like flows. In contrast, biological networks support rich time-dependent behaviour facilitated by their asymmetry. Here we introduce a general framework, which we term drift-diffusion matching, for training continuous-time RNNs to represent arbitrary stochastic dynamical systems within a low-dimensional latent subspace. Allowing asymmetric connectivity, we show that RNNs can faithfully embed the drift and diffusion of a given stochastic differential equation, including nonlinear and nonequilibrium dynamics such as chaotic attractors. As an application, we construct RNN realisations of stochastic systems that transiently explore various attractors through both input-driven switching and autonomous transitions driven by nonequilibrium currents, which we interpret as models of associative and sequential (episodic) memory. To elucidate how these dynamics are encoded in the network, we introduce decompositions of the RNN based on its asymmetric connectivity and its time-irreversibility. Our results extend attractor neural network theory beyond equilibrium, showing that asymmetric neural populations can implement a broad class of dynamical computations within low-dimensional manifolds, unifying ideas from associative memory, nonequilibrium statistical mechanics, and neural computation.
中文标题/摘要
标题:漂移-扩散匹配:在不对称神经网络的潜在流形中嵌入动力学
循环神经网络(RNNs)为理解生物神经回路中的计算提供了一个理论框架,然而,经典的成果,如霍普菲尔德关于联想记忆的模型,依赖于对称连接,这限制了网络动力学为梯度流。相比之下,生物网络通过其不对称性支持丰富的时变行为。在这里,我们提出了一种通用框架,称为漂移-扩散匹配,用于训练连续时间RNNs以在低维潜在子空间中表示任意随机动力系统。允许不对称连接,我们展示了RNNs可以忠实嵌入给定随机微分方程的漂移和扩散,包括非线性和非平衡动力学,如混沌吸引子。作为应用,我们构建了RNN实现的随机系统,这些系统通过输入驱动切换和由非平衡电流驱动的自主转换暂时探索各种吸引子,我们将其解释为联想记忆和序列(情景)记忆的模型。为了阐明这些动力学如何在网络中编码,我们基于其不对称连接和时间不可逆性引入了RNN的分解。我们的结果将吸引子神经网络理论扩展到非平衡状态,表明不对称神经群体可以在低维流形中实现广泛的动态计算,统一了联想记忆、非平衡统计力学和神经计算的想法。
Summary / 总结
The research introduces drift-diffusion matching, a framework for training continuous-time RNNs to represent stochastic dynamical systems within a low-dimensional latent subspace, allowing asymmetric connectivity. The study demonstrates that RNNs can accurately embed both linear and nonlinear dynamics, including chaotic attractors, and can be used to model associative and sequential memory through input-driven and autonomous transitions. The results extend attractor neural network theory to nonequilibrium systems, showing that asymmetric neural populations can perform a wide range of dynamical computations in low-dimensional manifolds.
该研究引入了漂移-扩散匹配框架,用于训练连续时间RNN在低维潜空间中表示随机动力系统,允许不对称连接。这些RNN能够准确嵌入线性和非线性动力学,包括混沌吸引子。研究展示了这些网络可以通过输入驱动和自主过渡来模拟关联性和序列记忆,并提供了分解来理解动力学如何在网络中编码。该研究扩展了超越平衡的吸引子神经网络理论,表明不对称神经群体可以在低维流形中实现广泛的动态计算。
CT-Bench: A Benchmark for Multimodal Lesion Understanding in Computed Tomography
Authors: Qingqing Zhu, Qiao Jin, Tejas S. Mathai, Yin Fang, Zhizheng Wang, Yifan Yang, Maame Sarfo-Gyamfi, Benjamin Hou, Ran Gu, Praveen T. S. Balamuralikrishna, Kenneth C. Wang, Ronald M. Summers, Zhiyong Lu
First: 2026-02-16T16:10:19+00:00 · Latest: 2026-02-16T16:10:19+00:00
Abstract
Artificial intelligence (AI) can automatically delineate lesions on computed tomography (CT) and generate radiology report content, yet progress is limited by the scarcity of publicly available CT datasets with lesion-level annotations. To bridge this gap, we introduce CT-Bench, a first-of-its-kind benchmark dataset comprising two components: a Lesion Image and Metadata Set containing 20,335 lesions from 7,795 CT studies with bounding boxes, descriptions, and size information, and a multitask visual question answering benchmark with 2,850 QA pairs covering lesion localization, description, size estimation, and attribute categorization. Hard negative examples are included to reflect real-world diagnostic challenges. We evaluate multiple state-of-the-art multimodal models, including vision-language and medical CLIP variants, by comparing their performance to radiologist assessments, demonstrating the value of CT-Bench as a comprehensive benchmark for lesion analysis. Moreover, fine-tuning models on the Lesion Image and Metadata Set yields significant performance gains across both components, underscoring the clinical utility of CT-Bench.
中文标题/摘要
标题:CT-Bench:计算机断层扫描中多模态病变理解的基准数据集
人工智能(AI)可以自动勾画计算机断层扫描(CT)中的病变并生成放射学报告内容,但进展受限于可用的带有病变级别注释的CT数据集稀缺。为解决这一问题,我们引入了CT-Bench,这是一个首创的基准数据集,包含两个部分:包含7,795份CT研究中20,335个病变的病变图像和元数据集,其中包含边界框、描述和尺寸信息,以及涵盖病变定位、描述、尺寸估计和属性分类的多任务视觉问答基准,包含2,850个问答对。还包含困难的负例以反映实际诊断挑战。我们通过将多个最先进的多模态模型与放射科医生评估进行比较,评估了CT-Bench的价值,证明了CT-Bench作为病变分析综合基准的价值。此外,对病变图像和元数据集进行微调在两个部分上均取得了显著的性能提升,突显了CT-Bench的临床用途。
On the Learning Dynamics of RLVR at the Edge of Competence
Authors: Yu Huang, Zixin Wen, Yuejie Chi, Yuting Wei, Aarti Singh, Yingbin Liang, Yuxin Chen
First: 2026-02-16T16:03:08+00:00 · Latest: 2026-02-16T16:03:08+00:00
Abstract
Reinforcement learning with verifiable rewards (RLVR) has been a main driver of recent breakthroughs in large reasoning models. Yet it remains a mystery how rewards based solely on final outcomes can help overcome the long-horizon barrier to extended reasoning. To understand this, we develop a theory of the training dynamics of RL for transformers on compositional reasoning tasks. Our theory characterizes how the effectiveness of RLVR is governed by the smoothness of the difficulty spectrum. When data contains abrupt discontinuities in difficulty, learning undergoes grokking-type phase transitions, producing prolonged plateaus before progress recurs. In contrast, a smooth difficulty spectrum leads to a relay effect: persistent gradient signals on easier problems elevate the model's capabilities to the point where harder ones become tractable, resulting in steady and continuous improvement. Our theory explains how RLVR can improve performance at the edge of competence, and suggests that appropriately designed data mixtures can yield scalable gains. As a technical contribution, our analysis develops and adapts tools from Fourier analysis on finite groups to our setting. We validate the predicted mechanisms empirically via synthetic experiments.
中文标题/摘要
标题:RLVR在边缘 competence的学习动态
可验证奖励的强化学习(RLVR)是近期大型推理模型突破的主要驱动力。然而,基于最终结果的奖励如何帮助克服长期推理障碍仍然是一个谜。为了理解这一点,我们发展了一种关于transformer在组合推理任务中强化学习训练动态的理论。该理论描述了RLVR的有效性如何由难度谱的平滑度来决定。当数据包含难度的突然断点时,学习会经历类似于grokking的相变阶段,在进步重新出现之前会出现长期停滞。相反,平滑的难度谱会导致接力效应:持续的梯度信号在较简单的问题上提升模型的能力,使得更难的问题变得可处理,从而导致持续和稳定的改进。该理论解释了RLVR如何在边缘 competence提高性能,并建议适当设计的数据混合可以带来可扩展的收益。作为技术贡献,我们的分析发展并适应了有限群上的傅里叶分析工具来适应我们的环境。我们通过合成实验验证了预测机制。
Summary / 总结
The paper investigates the learning dynamics of RLVR (Reinforcement Learning with Verifiable Rewards) in overcoming the long-horizon barrier in reasoning tasks. It develops a theory characterizing how the smoothness of the difficulty spectrum affects the effectiveness of RLVR. The theory shows that abrupt difficulty discontinuities lead to prolonged plateaus, while a smooth spectrum enables steady improvement through a relay effect, enhancing performance at the edge of competence. Empirical validation through synthetic experiments supports the proposed mechanisms.
论文研究了RLVR(可验证奖励的强化学习)在克服长时序推理障碍方面的学习动态。它发展了一种理论,描述了难度谱的平滑性如何影响RLVR的有效性。理论表明,难度的突然跳跃会导致长时间的停滞,而平滑的难度谱则通过接力效应实现持续改进,从而在接近能力极限时提升性能。合成实验的实证验证支持了提出的机制。
Concept Influence: Leveraging Interpretability to Improve Performance and Efficiency in Training Data Attribution
Authors: Matthew Kowal, Goncalo Paulo, Louis Jaburi, Tom Tseng, Lev E McKinney, Stefan Heimersheim, Aaron David Tucker, Adam Gleave, Kellin Pelrine
First: 2026-02-16T16:02:09+00:00 · Latest: 2026-02-16T16:02:09+00:00
Abstract
As large language models are increasingly trained and fine-tuned, practitioners need methods to identify which training data drive specific behaviors, particularly unintended ones. Training Data Attribution (TDA) methods address this by estimating datapoint influence. Existing approaches like influence functions are both computationally expensive and attribute based on single test examples, which can bias results toward syntactic rather than semantic similarity. To address these issues of scalability and influence to abstract behavior, we leverage interpretable structures within the model during the attribution. First, we introduce Concept Influence which attribute model behavior to semantic directions (such as linear probes or sparse autoencoder features) rather than individual test examples. Second, we show that simple probe-based attribution methods are first-order approximations of Concept Influence that achieve comparable performance while being over an order-of-magnitude faster. We empirically validate Concept Influence and approximations across emergent misalignment benchmarks and real post-training datasets, and demonstrate they achieve comparable performance to classical influence functions while being substantially more scalable. More broadly, we show that incorporating interpretable structure within traditional TDA pipelines can enable more scalable, explainable, and better control of model behavior through data.
中文标题/摘要
标题:概念影响:利用可解释性提高训练数据归因性能和效率
随着大型语言模型的不断训练和微调,从业者需要方法来识别哪些训练数据驱动特定行为,特别是无意中的行为。训练数据归因(TDA)方法通过估计数据点的影响来解决这一问题。现有的方法如影响函数既计算成本高昂,又基于单个测试示例进行归因,这可能会使结果偏向于语法而非语义相似性。为了解决这些可扩展性和影响到抽象行为的问题,我们在归因过程中利用模型中的可解释结构。首先,我们引入了概念影响,将模型行为归因于语义方向(如线性探针或稀疏自编码器特征)而非单个测试示例。其次,我们展示了基于探针的归因方法是概念影响的一阶近似,能够在性能相当的情况下快一个数量级。我们在新兴的偏差基准和实际的后训练数据集上实证验证了概念影响及其近似方法,并证明它们在可扩展性方面优于经典的影响函数。更广泛地说,我们展示了在传统TDA管道中整合可解释结构可以实现更可扩展、更可解释以及更好地控制模型行为的数据。
Summary / 总结
The paper addresses the need for scalable and interpretable methods to identify training data influencing specific behaviors in large language models. It introduces Concept Influence, which attributes model behavior to semantic directions rather than individual test examples, addressing the limitations of existing influence functions. The method uses simple probe-based attribution, which is an order-of-magnitude faster while achieving comparable performance. Empirical validation across benchmarks and real datasets shows that Concept Influence and its approximations are more scalable and maintain comparable performance to classical influence functions, enhancing interpretability and control over model behavior through data.
论文旨在通过引入Concept Influence方法提高Training Data Attribution (TDA)方法的可扩展性和可解释性,该方法将模型行为归因于语义方向而非单个测试示例。这种方法利用模型中的可解释结构来估计影响,提供了一阶近似,速度显著提高且性能相当。实验证明,Concept Influence及其近似方法在基准测试和真实数据集上的性能与经典影响函数相当,但更具可扩展性。
Goldilocks RL: Tuning Task Difficulty to Escape Sparse Rewards for Reasoning
Authors: Ilia Mahrooghi, Aryo Lotfi, Emmanuel Abbe
First: 2026-02-16T16:01:27+00:00 · Latest: 2026-02-16T16:01:27+00:00
Comments: 21 pages, 12 figures
Abstract
Reinforcement learning has emerged as a powerful paradigm for unlocking reasoning capabilities in large language models. However, relying on sparse rewards makes this process highly sample-inefficient, as models must navigate vast search spaces with minimal feedback. While classic curriculum learning aims to mitigate this by ordering data based on complexity, the right ordering for a specific model is often unclear. To address this, we propose Goldilocks, a novel teacher-driven data sampling strategy that aims to predict each question's difficulty for the student model. The teacher model selects questions of appropriate difficulty for the student model, i.e., questions that are neither too easy nor too hard (Goldilocks principle), while training the student with GRPO. By leveraging the student's performance on seen samples, the teacher continuously adapts to the student's evolving abilities. On OpenMathReasoning dataset, Goldilocks data sampling improves the performance of models trained with standard GRPO under the same compute budget.
中文标题/摘要
标题:金发姑娘RL:调整任务难度以摆脱稀疏奖励实现推理
强化学习已成为解锁大型语言模型推理能力的强大范式。然而,依赖稀疏奖励使得这一过程变得高度样本不高效,因为模型必须在反馈有限的情况下导航庞大的搜索空间。虽然经典的课程学习试图通过按复杂度排序数据来缓解这一问题,但特定模型的最佳排序往往不清楚。为了解决这一问题,我们提出了一种名为Goldilocks的新颖教师驱动的数据采样策略,旨在预测每个问题对学生模型的难度。教师模型为学生模型选择合适的难度问题,即既不过于简单也不过于困难(金发姑娘原则),同时使用GRPO训练学生。通过利用学生在已见样本上的表现,教师不断适应学生不断发展的能力。在OpenMathReasoning数据集上,Goldilocks数据采样在相同的计算预算下提高了使用标准GRPO训练的模型的性能。
Summary / 总结
The paper addresses the inefficiency of reinforcement learning with sparse rewards for reasoning tasks. It introduces Goldilocks RL, a teacher-driven data sampling strategy that selects questions of appropriate difficulty for the student model, ensuring neither too easy nor too hard. This method improves model performance on the OpenMathReasoning dataset compared to standard GRPO under the same computational resources.
研究旨在通过解决稀疏奖励问题来提高使用强化学习训练大型语言模型的效率。Goldilocks RL 方法使用教师模型为学生模型选择合适的难度问题,确保既不太容易也不太难。这种方法在 OpenMathReasoning 数据集上提高了使用标准 GRPO 训练的模型的性能,同时保持相同的计算资源。
Fast and accurate quasi-atom method for simultaneous atomistic and continuum simulation of solids
Authors: Artem Chuprov, Egor E. Nuzhin, Alexey A. Tsukanov, Nikolay V. Brilliantov
First: 2026-02-16T16:00:58+00:00 · Latest: 2026-02-16T16:00:58+00:00
Abstract
We report a novel hybrid method of simultaneous atomistic simulation of solids in critical regions (contacts surfaces, cracks areas, etc.), along with continuum modeling of other parts. The continuum is treated in terms of quasi-atoms of different size, comprising composite medium. The parameters of interaction potential between the quasi-atoms are optimized to match elastic properties of the composite medium to those of the atomic one. The optimization method coincides conceptually with the online Machine Learning (ML) methods, making it computationally very efficient. Such an approach allows a straightforward application of standard software packages for molecular dynamics (MD), supplemented by the ML-based optimizer. The new method is applied to model systems with a simple, pairwise Lennard-Jones potential, as well with multi-body Tersoff potential, describing covalent bonds. Using LAMMPS software we simulate collision of particles of different size. Comparing simulation results, obtained by the novel method, with full-atomic simulations, we demonstrate its accuracy, validity and overwhelming superiority in computational speed. Furthermore, we compare our method with other hybrid methods, specifically, with the closest one -- AtC (Atomic to Continuum) method. We demonstrate a significant superiority of our approach in computational speed and implementation convenience. Finally, we discuss a possible extension of the method for modeling other phenomena.
中文标题/摘要
标题:快速而准确的类原子方法用于固体的原子级和连续模拟的同步
我们报告了一种新的混合方法,用于在关键区域(接触表面、裂纹区域等)对固体进行原子级模拟,同时对其他部分进行连续建模。连续部分用不同大小的类原子表示,构成复合介质。类原子间相互作用势的参数被优化,以使复合介质的弹性性质与原子介质的相匹配。优化方法在概念上与在线机器学习(ML)方法一致,使其在计算上非常高效。这种方法允许直接应用标准的分子动力学(MD)软件包,并通过基于ML的优化器进行补充。该新方法应用于使用简单双体Lennard-Jones势和描述共价键的多体Tersoff势的系统。使用LAMMPS软件模拟不同大小粒子的碰撞。通过新方法获得的模拟结果与全原子模拟结果进行比较,展示了其准确性、有效性和在计算速度上的显著优势。此外,我们将我们的方法与其他混合方法进行了比较,特别是最近的一种——原子到连续(AtC)方法。我们展示了在计算速度和实现便利性方面我们方法的显著优势。最后,我们讨论了该方法可能扩展以模拟其他现象的可能性。
Summary / 总结
This paper introduces a hybrid method combining atomistic simulation in critical regions with continuum modeling elsewhere using quasi-atoms. The method optimizes interaction parameters to match elastic properties, leveraging machine learning for efficiency. Simulations with Lennard-Jones and Tersoff potentials show the method’s accuracy and computational speed, outperforming existing hybrid methods like AtC in both speed and implementation. The approach can be applied using standard molecular dynamics software with an ML-based optimizer, making it versatile and efficient for various systems.
本文介绍了一种新的混合方法,结合了在关键区域进行原子级模拟与在其他部分进行连续模型模拟,使用准原子。该方法通过优化准原子之间的相互作用参数来匹配弹性性质,并利用机器学习提高效率。使用Lennard-Jones和Tersoff势能的模拟显示了该方法的准确性和计算速度,其速度和实现便利性均优于现有的混合方法如AtC。该方法可以使用标准分子动力学软件和机器学习优化器进行应用,使其适用于各种系统且具有灵活性和高效性。
C^2ROPE: Causal Continuous Rotary Positional Encoding for 3D Large Multimodal-Models Reasoning
Authors: Guanting Ye, Qiyan Zhao, Wenhao Yu, Xiaofeng Zhang, Jianmin Ji, Yanyong Zhang, Ka-Veng Yuen
Venue: ICRA 2026
First: 2026-02-11T05:50:17+00:00 · Latest: 2026-02-16T15:58:51+00:00
Comments: Accepted in ICRA 2026
Abstract
Recent advances in 3D Large Multimodal Models (LMMs) built on Large Language Models (LLMs) have established the alignment of 3D visual features with LLM representations as the dominant paradigm. However, the inherited Rotary Position Embedding (RoPE) introduces limitations for multimodal processing. Specifically, applying 1D temporal positional indices disrupts the continuity of visual features along the column dimension, resulting in spatial locality loss. Moreover, RoPE follows the prior that temporally closer image tokens are more causally related, leading to long-term decay in attention allocation and causing the model to progressively neglect earlier visual tokens as the sequence length increases. To address these issues, we propose C^2RoPE, an improved RoPE that explicitly models local spatial Continuity and spatial Causal relationships for visual processing. C^2RoPE introduces a spatio-temporal continuous positional embedding mechanism for visual tokens. It first integrates 1D temporal positions with Cartesian-based spatial coordinates to construct a triplet hybrid positional index, and then employs a frequency allocation strategy to encode spatio-temporal positional information across the three index components. Additionally, we introduce Chebyshev Causal Masking, which determines causal dependencies by computing the Chebyshev distance of image tokens in 2D space. Evaluation results across various benchmarks, including 3D scene reasoning and 3D visual question answering, demonstrate C^2RoPE's effectiveness. The code is be available at https://github.com/ErikZ719/C2RoPE.
中文标题/摘要
标题:C^2ROPE:因果连续旋转位置编码在3D大型多模态模型推理中的应用
基于大型语言模型(LLMs)构建的3D大型多模态模型(LMMs)的最新进展已经确立了3D视觉特征与LLM表示的对齐作为主导范式。然而,继承的旋转位置嵌入(RoPE)为多模态处理带来了局限性。具体来说,应用1D时间位置索引破坏了沿列维度视觉特征的连续性,导致空间局部性损失。此外,RoPE 假设时间上更接近的图像标记具有更大的因果关系,导致长期注意力分配衰减,使模型随着序列长度增加而逐渐忽略早期的视觉标记。为了解决这些问题,我们提出了C^2RoPE,这是一种改进的RoPE,明确地为视觉处理建模局部空间连续性和空间因果关系。C^2RoPE 引入了一种时空连续位置嵌入机制,为视觉标记。它首先将1D时间位置与基于笛卡尔坐标的空间坐标结合,构建一个三元混合位置索引,然后采用频率分配策略,对三个索引组件中的时空位置信息进行编码。此外,我们引入了切比雪夫因果掩码,通过计算2D空间中图像标记的切比雪夫距离来确定因果依赖性。在包括3D场景推理和3D视觉问答在内的各种基准上的评估结果表明了C^2RoPE的有效性。代码可在https://github.com/ErikZ719/C2RoPE/ 获取。
Summary / 总结
The paper proposes C^2RoPE, an improved Rotary Positional Encoding method that addresses limitations of traditional RoPE in 3D multimodal models. C^2RoPE explicitly models local spatial continuity and causal relationships, integrating 1D temporal positions with Cartesian spatial coordinates and using a frequency allocation strategy to encode spatio-temporal information. Experiments on 3D scene reasoning and visual question answering benchmarks show C^2RoPE's effectiveness in maintaining spatial locality and causal dependencies, improving model performance. The code is available at https://github.com/ErikZ719/C2RoPE.
该论文提出了C^2RoPE,一种改进的旋转位置编码方法,解决了传统RoPE在3D多模态模型中的局限性。C^2RoPE 显式地建模了局部空间连续性和因果关系,将1D时间位置与笛卡尔空间坐标结合,并使用频率分配策略编码时空位置信息。该方法还引入了Chebyshev因果掩码,通过计算图像标记在二维空间中的Chebyshev距离来确定因果依赖性。实验结果表明,C^2RoPE 在3D场景推理和视觉问答基准测试中有效,能够保持空间局部性和改善长序列中的注意力分配。
Efficient Test-Time Scaling for Small Vision-Language Models
Authors: Mehmet Onurcan Kaya, Desmond Elliott, Dim P. Papadopoulos
Venue: ICLR 2026
First: 2025-10-03T23:49:06+00:00 · Latest: 2026-02-16T15:56:06+00:00
Comments: Accepted at ICLR 2026. Project Page: https://monurcan.github.io/efficient_test_time_scaling
Abstract
Small Vision-Language Models (VLMs) provide a computationally efficient alternative to larger models, at the cost of weaker generalization abilities and downstream task performance. These shortcomings could be addressed by test-time scaling techniques, but existing methods are typically computationally demanding, contradicting the resource-efficient design goals of small models. To address these limitations, we propose two novel and efficient test-time scaling strategies that leverage the model-internal features rather than external supervision: (i) Test-Time Augmentation (TTAug), which generates multiple augmented inputs and aggregates outputs at the token level without parameter updates, and (ii) Test-Time Adaptation (TTAdapt), which adapts model parameters during inference using consensus-based pseudolabels from TTAug. Through extensive experiments across nine benchmarks, we demonstrate consistent performance improvements while maintaining computational efficiency suitable for resource-constrained environments. The generality of our approach is demonstrated both within models at different scales and across different VLMs without additional tuning.
中文标题/摘要
标题:小视觉语言模型测试时高效缩放方法
小视觉语言模型(VLMs)提供了一种在计算上更高效的替代方案,但代价是泛化能力和下游任务性能较弱。这些不足可以通过测试时缩放技术来解决,但现有方法通常计算成本较高,这与小模型资源高效的设计目标相矛盾。为了解决这些限制,我们提出了两种新颖且高效的测试时缩放策略,这些策略利用模型内部特征而非外部监督:(i) 测试时增强(TTAug),它生成多个增强输入并在标记级别聚合输出而不更新参数,(ii) 测试时适应(TTAdapt),它在推理过程中通过TTAug提供的基于共识的伪标签来调整模型参数。通过在九个基准上的广泛实验,我们展示了在保持计算效率的同时的一致性能改进,该计算效率适合资源受限的环境。我们的方法的通用性在不同规模的模型内部以及不同VLMs之间得到了验证,无需额外调整。
Summary / 总结
This paper addresses the limitations of small Vision-Language Models (VLMs) by proposing two efficient test-time scaling strategies, TTTAug and TTAdapt, which enhance performance without increasing computational demands. These methods leverage internal model features and consensus-based pseudolabels, respectively, to improve task performance consistently across various benchmarks while maintaining computational efficiency for resource-constrained environments.
本文提出两种高效的测试时缩放策略:Test-Time Augmentation (TTAug) 和 Test-Time Adaptation (TTAdapt),以解决小规模 Vision-Language 模型的局限性。TTAug 生成多个增强输入并在标记级别聚合输出,而 TTAdapt 在推理过程中使用 TTAug 的共识伪标签调整模型参数。实验表明,在九个基准上的一致性能改进,同时保持适合资源受限环境的计算效率。
History
20260217_0339 20260216_0334 20260215_0332 20260213_0402 20260212_0404 20260211_0409 20260210_0409 20260208_0334 20260207_0349 20260206_0347 20260205_0346 20260204_0352 20260202_0332 20260201_0328 20260131_0341 20260130_0339 20260129_0337 20260128_0335 20260127_0332 20260126_0325 20260125_0325 20260124_0333 20260123_0333 20260122_0339 20260121_0422 20260120_0328 20260119_0325 20260118_0324 20260117_0329 20260116_0332 20260115_0330 20260114_0329 20260113_0330 20260112_0330 20260111_0327 20260110_0328 20260109_0331 20260108_0330 20260107_0325 20260106_0331 20260105_0324 20260104_0324 20260103_0322 20260102_0335 20260101_0325 20251231_0331 20251230_0328 20251229_0326 20251228_0329 20251227_0325 20251226_0326 20251225_0325 20251224_0328 20251223_0327 20251222_0324 20251221_0326 20251220_0327 20251219_0327 20251218_0339 20251217_0331 20251216_0329 20251215_0331 20251214_0324 20251213_0324 20251212_0329 20251211_0326 20251210_0323 20251209_0326 20251208_0324 20251207_0323 20251206_0325 20251205_0326 20251204_0326 20251203_0328 20251202_0331 20251201_0324 20251130_0323 20251129_0323 20251128_0324 20251127_0324 20251126_0325 20251125_0322 20251124_0323 20251123_0323 20251122_0325 20251121_0324 20251120_0326 20251119_0325 20251118_0324 20251117_0322 20251116_0322 20251115_0324 20251114_0325 20251113_0326 20251112_0326 20251111_0318 20251110_0322 20251109_0323 20251108_0321 20251107_0320 20251106_0322 20251105_0321 20251104_0324 20251103_0317 20251102_0321 20251101_0317 20251031_0318 20251030_0328 20251029_0325 20251028_0324 20251027_0320 20251026_0328 20251025_0320 20251024_0328 20251023_1235 20251023_0316 20251022_0319 20251021_1916 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553