arXiv 论文速递

2026-02-13 04:02
Snapshot: 20260213_0402
SurfPhase: 3D Interfacial Dynamics in Two-Phase Flows from Sparse Videos
Authors: Yue Gao, Hong-Xing Yu, Sanghyeon Chang, Qianxi Fu, Bo Zhu, Yoonjin Won, Juan Carlos Niebles, Jiajun Wu
First: 2026-02-11T18:59:55+00:00 · Latest: 2026-02-11T18:59:55+00:00
Comments: The first two authors contributed equally. Project website: https://yuegao.me/SurfPhase
Abstract
Interfacial dynamics in two-phase flows govern momentum, heat, and mass transfer, yet remain difficult to measure experimentally. Classical techniques face intrinsic limitations near moving interfaces, while existing neural rendering methods target single-phase flows with diffuse boundaries and cannot handle sharp, deformable liquid-vapor interfaces. We propose SurfPhase, a novel model for reconstructing 3D interfacial dynamics from sparse camera views. Our approach integrates dynamic Gaussian surfels with a signed distance function formulation for geometric consistency, and leverages a video diffusion model to synthesize novel-view videos to refine reconstruction from sparse observations. We evaluate on a new dataset of high-speed pool boiling videos, demonstrating high-quality view synthesis and velocity estimation from only two camera views. Project website: https://yuegao.me/SurfPhase.
中文标题/摘要
标题:SurfPhase:从稀疏视频中重建两相流界面动力学的3D动态
两相流中的界面动力学控制着动量、热量和质量传递,但实验测量仍然困难。经典技术在接近移动界面时存在固有限制,而现有的神经渲染方法针对的是具有模糊边界的单相流,无法处理尖锐且可变形的液-汽界面。我们提出SurfPhase,一种从稀疏摄像机视角重建3D界面动力学的新模型。我们的方法结合了动态高斯表面元与符号距离函数表示法以保持几何一致性,并利用视频扩散模型合成新颖视角视频以从稀疏观测中细化重建。我们在一个新数据集上进行评估,该数据集包含高速池沸腾视频,展示了仅从两个摄像机视角即可实现高质量视图合成和速度估计。项目网站:https://yuegao.me/SurfPhase。
Summary / 总结
SurfPhase is designed to reconstruct 3D interfacial dynamics in two-phase flows from sparse video data, addressing the limitations of classical techniques near moving interfaces and existing neural rendering methods that struggle with sharp, deformable interfaces. The model uses dynamic Gaussian surfels and a signed distance function for geometric consistency, and synthesizes novel-view videos to refine the reconstruction. Experiments on high-speed pool boiling videos show high-quality view synthesis and velocity estimation from just two camera views.
SurfPhase 是一种模型,用于从稀疏视频视角重建两相流中的 3D 表面动态。它结合了动态高斯表面元和符号距离函数以保持几何一致性,并使用视频扩散模型生成新的视角以从稀疏观测中细化重建。该模型能够仅从两个摄像机视角估计速度并生成高质量的视图,克服了经典技术和现有神经渲染方法在处理尖锐、可变形的液-汽界面时的限制,特别是在高速池沸腾视频中的应用。
Diffusion-Pretrained Dense and Contextual Embeddings
Authors: Sedigheh Eslami, Maksim Gaiduk, Markus Krimmel, Louis Milliken, Bo Wang, Denis Bykov
First: 2026-02-11T18:59:08+00:00 · Latest: 2026-02-11T18:59:08+00:00
Abstract
In this report, we introduce pplx-embed, a family of multilingual embedding models that employ multi-stage contrastive learning on a diffusion-pretrained language model backbone for web-scale retrieval. By leveraging bidirectional attention through diffusion-based pretraining, our models capture comprehensive bidirectional context within passages, enabling the use of mean pooling and a late chunking strategy to better preserve global context across long documents. We release two model types: pplx-embed-v1 for standard retrieval, and pplx-embed-context-v1 for contextualized embeddings that incorporate global document context into passage representations. pplx-embed-v1 achieves competitive performance on the MTEB(Multilingual, v2), MTEB(Code), MIRACL, BERGEN, and ToolRet retrieval benchmarks, while pplx-embed-context-v1 sets new records on the ConTEB benchmark. Beyond public benchmarks, pplx-embed-v1 demonstrates strong performance on our internal evaluation suite, which focuses on real-world, large-scale search scenarios over tens of millions of documents. These results validate the models' effectiveness in production environments where retrieval quality and efficiency are critical at scale.
中文标题/摘要
标题:扩散预训练密集和上下文嵌入
在本报告中,我们介绍了pplx-embed,这是一种多语言嵌入模型家族,通过在扩散预训练语言模型骨干上进行多阶段对比学习,用于大规模网页检索。通过利用基于扩散的预训练中的双向注意力,我们的模型能够在段落中捕捉全面的双向上下文,从而利用均值池化和后期分块策略更好地保留长文档中的全局上下文。我们发布了两种模型类型:pplx-embed-v1 用于标准检索,pplx-embed-context-v1 用于包含全局文档上下文的上下文化嵌入。pplx-embed-v1 在 MTEB(多语言,v2)、MTEB(代码)、MIRACL、BERGEN 和 ToolRet 检索基准测试中表现出竞争力,而 pplx-embed-context-v1 在 ConTEB 基准测试中创下了新纪录。除了公共基准测试,pplx-embed-v1 在我们专注于数千万文档的大规模现实场景的内部评估套件中也表现出色。这些结果验证了这些模型在生产环境中检索质量和效率的关键性。
Summary / 总结
The research introduces pplx-embed, a multilingual embedding model that uses multi-stage contrastive learning on a diffusion-pretrained language model. The models capture comprehensive bidirectional context through bidirectional attention, enabling mean pooling and late chunking for better global context preservation. pplx-embed-v1 performs competitively on various retrieval benchmarks, while pplx-embed-context-v1 sets new records on the ConTEB benchmark, demonstrating strong performance in real-world, large-scale search scenarios.
研究引入了pplx-embed,这是一种使用扩散预训练语言模型进行多阶段对比学习的多语言嵌入模型。该模型通过双向注意力捕捉全面的双向上下文,支持均值池化和后期分块策略以更好地保留全局上下文。pplx-embed-v1在多个检索基准测试中表现出色,而pplx-embed-context-v1在ConTEB基准测试中创下了新纪录,展示了在大规模实际搜索场景中的强大性能。
Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling
Authors: Gongye Liu, Bo Yang, Yida Zhi, Zhizhou Zhong, Lei Ke, Didan Deng, Han Gao, Yongxiang Huang, Kaihao Zhang, Hongbo Fu, Wenhan Luo
First: 2026-02-11T18:57:29+00:00 · Latest: 2026-02-11T18:57:29+00:00
Comments: Code: https://github.com/HKUST-C4G/diffusion-rm
Abstract
Preference optimization for diffusion and flow-matching models relies on reward functions that are both discriminatively robust and computationally efficient. Vision-Language Models (VLMs) have emerged as the primary reward provider, leveraging their rich multimodal priors to guide alignment. However, their computation and memory cost can be substantial, and optimizing a latent diffusion generator through a pixel-space reward introduces a domain mismatch that complicates alignment. In this paper, we propose DiNa-LRM, a diffusion-native latent reward model that formulates preference learning directly on noisy diffusion states. Our method introduces a noise-calibrated Thurstone likelihood with diffusion-noise-dependent uncertainty. DiNa-LRM leverages a pretrained latent diffusion backbone with a timestep-conditioned reward head, and supports inference-time noise ensembling, providing a diffusion-native mechanism for test-time scaling and robust rewarding. Across image alignment benchmarks, DiNa-LRM substantially outperforms existing diffusion-based reward baselines and achieves performance competitive with state-of-the-art VLMs at a fraction of the computational cost. In preference optimization, we demonstrate that DiNa-LRM improves preference optimization dynamics, enabling faster and more resource-efficient model alignment.
中文标题/摘要
标题:超越基于VLM的奖励:扩散原生潜在奖励建模
扩散和流匹配模型的偏好优化依赖于既具有判别鲁棒性又具有计算效率的奖励函数。视觉语言模型(VLMs)已经成为了主要的奖励提供者,利用其丰富的跨模态先验来引导对齐。然而,它们的计算和内存成本可能相当大,通过像素空间奖励优化潜在的扩散生成器会引入领域不匹配,这会复杂化对齐。在本文中,我们提出了DiNa-LRM,一种直接在噪声扩散状态上进行偏好学习的扩散原生潜在奖励模型。我们的方法引入了一种噪声校准的Thurstone似然性,具有扩散噪声依赖的不确定性。DiNa-LRM 利用了一个预训练的潜在扩散主干,并带有时间步条件化的奖励头,支持推理时的噪声成簇,提供了一种扩散原生的测试时扩展机制和稳健的奖励。在图像对齐基准测试中,DiNa-LRM 显著优于现有的基于扩散的奖励基线,并且在计算成本仅为最先进的VLMs的一小部分的情况下,实现了具有竞争力的性能。在偏好优化中,我们展示了DiNa-LRM 改进了偏好优化动力学,使模型对齐更快且更节省资源。
Summary / 总结
This paper addresses the challenge of preference optimization for diffusion and flow-matching models by proposing DiNa-LRM, a diffusion-native latent reward model. It formulates preference learning directly on noisy diffusion states and introduces a noise-calibrated Thurstone likelihood with diffusion-noise-dependent uncertainty. DiNa-LRM outperforms existing diffusion-based reward baselines and achieves performance competitive with state-of-the-art VLMs at a lower computational cost. In preference optimization, DiNa-LRM improves the dynamics of model alignment, enabling faster and more resource-efficient optimization.
本文提出了一种扩散本征的潜在奖励模型DiNa-LRM,以解决扩散和流匹配模型的偏好优化问题。与计算成本高昂的VLM不同,DiNa-LRM直接在噪声扩散状态上进行偏好学习,减少了领域不匹配并改善了对齐效果。该模型采用噪声校准的Thurstone似然,并支持推理时的噪声成簇,使其更加高效。实验表明,DiNa-LRM在图像对齐基准测试中优于现有的基于扩散的奖励方法,并且在计算成本较低的情况下达到了与VLM相当的性能,同时提高了偏好优化的动力学,实现了更快和更高效的模型对齐。
GENIUS: Generative Fluid Intelligence Evaluation Suite
Authors: Ruichuan An, Sihan Yang, Ziyu Guo, Wei Dai, Zijun Shen, Haodong Li, Renrui Zhang, Xinyu Wei, Guopeng Li, Wenshan Wu, Wentao Zhang
First: 2026-02-11T18:55:54+00:00 · Latest: 2026-02-11T18:55:54+00:00
Abstract
Unified Multimodal Models (UMMs) have shown remarkable progress in visual generation. Yet, existing benchmarks predominantly assess $\textit{Crystallized Intelligence}$, which relies on recalling accumulated knowledge and learned schemas. This focus overlooks $\textit{Generative Fluid Intelligence (GFI)}$: the capacity to induce patterns, reason through constraints, and adapt to novel scenarios on the fly. To rigorously assess this capability, we introduce $\textbf{GENIUS}$ ($\textbf{GEN}$ Fluid $\textbf{I}$ntelligence Eval$\textbf{U}$ation $\textbf{S}$uite). We formalize $\textit{GFI}$ as a synthesis of three primitives. These include $\textit{Inducing Implicit Patterns}$ (e.g., inferring personalized visual preferences), $\textit{Executing Ad-hoc Constraints}$ (e.g., visualizing abstract metaphors), and $\textit{Adapting to Contextual Knowledge}$ (e.g., simulating counter-intuitive physics). Collectively, these primitives challenge models to solve problems grounded entirely in the immediate context. Our systematic evaluation of 12 representative models reveals significant performance deficits in these tasks. Crucially, our diagnostic analysis disentangles these failure modes. It demonstrates that deficits stem from limited context comprehension rather than insufficient intrinsic generative capability. To bridge this gap, we propose a training-free attention intervention strategy. Ultimately, $\textbf{GENIUS}$ establishes a rigorous standard for $\textit{GFI}$, guiding the field beyond knowledge utilization toward dynamic, general-purpose reasoning. Our dataset and code will be released at: $\href{https://github.com/arctanxarc/GENIUS}{https://github.com/arctanxarc/GENIUS}$.
中文标题/摘要
标题:GENIUS:生成流体智能评估套件
统一多模态模型(UMMs)在视觉生成方面取得了显著进展。然而,现有的基准测试主要评估的是“晶体智力”,这依赖于回忆积累的知识和学习的模式。这种关注忽视了“生成流体智力(GFI)”:即在不断变化的情境中诱导模式、通过约束进行推理和适应新场景的能力。为了严格评估这种能力,我们引入了**GENIUS**(**GEN**流体**I**智能评估**U**套件)。我们将GFI形式化为三个基本要素的综合。这些包括“诱导隐含模式”(例如,推断个性化的视觉偏好)、“执行即兴约束”(例如,可视化抽象的隐喻)和“适应上下文知识”(例如,模拟反直觉的物理现象)。这些基本要素共同挑战模型解决完全基于即时上下文的问题。我们对12个代表性模型的系统评估揭示了这些任务中的显著性能缺陷。至关重要的是,我们的诊断分析将这些失败模式分离出来。它表明缺陷源于对上下文理解的有限性,而不是内在生成能力的不足。为了弥合这一差距,我们提出了一种无需训练的注意力干预策略。最终,**GENIUS**为GFI确立了一个严格的标准,引导该领域从知识利用转向动态、通用的目的推理。我们的数据集和代码将在:https://github.com/arctanxarc/GENIUS 发布。
Summary / 总结
The research introduces GENIUS, a suite designed to evaluate Generative Fluid Intelligence (GFI) in models, which is the ability to induce patterns, reason through constraints, and adapt to novel scenarios. By formalizing GFI into three primitives—inducing implicit patterns, executing ad-hoc constraints, and adapting to contextual knowledge—the study evaluates 12 models and finds significant performance deficits. The analysis shows that these deficits arise from limited context comprehension. GENIUS provides a rigorous standard for GFI, promoting dynamic, general-purpose reasoning beyond knowledge utilization.
论文提出了GENIUS,一个新的评估套件,用于评估生成性流体智力(GFI),这涉及到诱导模式、通过约束进行推理以及适应新颖场景。它将GFI形式化为三个基本要素:诱导隐含模式、执行即兴约束和适应上下文知识。评估12个模型后,研究发现显著的性能缺陷,特别是在上下文理解方面。作者提出了一种无需训练的注意力干预策略,旨在引导该领域向动态、通用推理方向发展。
Data-Efficient Hierarchical Goal-Conditioned Reinforcement Learning via Normalizing Flows
Authors: Shaswat Garg, Matin Moezzi, Brandon Da Silva
First: 2026-02-11T18:54:48+00:00 · Latest: 2026-02-11T18:54:48+00:00
Comments: 9 pages, 3 figures, IEEE International Conference on Robotics and Automation 2026
Abstract
Hierarchical goal-conditioned reinforcement learning (H-GCRL) provides a powerful framework for tackling complex, long-horizon tasks by decomposing them into structured subgoals. However, its practical adoption is hindered by poor data efficiency and limited policy expressivity, especially in offline or data-scarce regimes. In this work, Normalizing flow-based hierarchical implicit Q-learning (NF-HIQL), a novel framework that replaces unimodal gaussian policies with expressive normalizing flow policies at both the high- and low-levels of the hierarchy is introduced. This design enables tractable log-likelihood computation, efficient sampling, and the ability to model rich multimodal behaviors. New theoretical guarantees are derived, including explicit KL-divergence bounds for Real-valued non-volume preserving (RealNVP) policies and PAC-style sample efficiency results, showing that NF-HIQL preserves stability while improving generalization. Empirically, NF-HIQL is evaluted across diverse long-horizon tasks in locomotion, ball-dribbling, and multi-step manipulation from OGBench. NF-HIQL consistently outperforms prior goal-conditioned and hierarchical baselines, demonstrating superior robustness under limited data and highlighting the potential of flow-based architectures for scalable, data-efficient hierarchical reinforcement learning.
中文标题/摘要
标题:基于归一化流的数据高效分层目标导向强化学习
分层目标导向强化学习(H-GCRL)提供了一种强大的框架,用于通过将复杂、长期的任务分解为结构化的子目标来应对这些任务。然而,其实际应用受到数据效率差和策略表达能力有限的阻碍,尤其是在离线或数据稀缺的环境中。本文提出了一种新的框架——基于归一化流的分层隐式Q学习(NF-HIQL),该框架在分层的高低层都用表达性强的归一化流策略取代了单模高斯策略。这种设计使得可以进行可处理的对数似然计算、高效采样,并能够建模丰富的多模态行为。推导了新的理论保证,包括实值非体积保持(RealNVP)策略的显式KL散度界和PAC风格的数据效率结果,表明NF-HIQL在保持稳定性的同时提高了泛化能力。实验上,NF-HIQL在OGBench中的行走、带球和多步操作等长期任务中进行了评估。NF-HIQL在所有任务中都优于先前的目标导向和分层基线,展示了在数据有限条件下更强的鲁棒性,并突显了基于流的架构在可扩展、数据高效分层强化学习中的潜力。
Summary / 总结
This work addresses the limitations of hierarchical goal-conditioned reinforcement learning (H-GCRL) by introducing NF-HIQL, which uses normalizing flow policies to enhance data efficiency and policy expressivity. Theoretical guarantees and empirical evaluations show that NF-HIQL improves generalization and robustness under limited data, outperforming previous methods in long-horizon tasks like locomotion and manipulation.
该论文针对层级目标条件强化学习(H-GCRL)在数据效率和策略表达性方面的局限性,提出了NF-HIQL框架,使用流形策略增强学习策略的表达能力。理论保证和实验证明,NF-HIQL在数据有限的情况下提高了泛化能力和鲁棒性,并在长周期任务如运动和操作中优于先前的方法。
AlignTune: Modular Toolkit for Post-Training Alignment of Large Language Models
Authors: R E Zera Marveen Lyngkhoi, Chirag Chawla, Pratinav Seth, Utsav Avaiya, Soham Bhattacharjee, Mykola Khandoga, Rui Yuan, Vinay Kumar Sankarapu
First: 2026-02-10T10:08:51+00:00 · Latest: 2026-02-11T18:51:19+00:00
Comments: Library opensource and available at https://github.com/Lexsi-Labs/aligntune
Abstract
Post-training alignment is central to deploying large language models (LLMs), yet practical workflows remain split across backend-specific tools and ad-hoc glue code, making experiments hard to reproduce. We identify backend interference, reward fragmentation, and irreproducible pipelines as key obstacles in alignment research. We introduce AlignTune, a modular toolkit exposing a unified interface for supervised fine-tuning (SFT) and RLHF-style optimization with interchangeable TRL and Unsloth backends. AlignTune standardizes configuration, provides an extensible reward layer (rule-based and learned), and integrates evaluation over standard benchmarks and custom tasks. By isolating backend-specific logic behind a single factory boundary, AlignTune enables controlled comparisons and reproducible alignment experiments.
中文标题/摘要
标题:AlignTune:大型语言模型后训练对齐的模块化工具包
后训练对齐是部署大型语言模型(LLMs)的核心,但实际工作流仍然分散在后端特定工具和临时粘合代码之间,使得实验难以复现。我们确定后端干扰、奖励碎片化和不可复现的管道是对齐研究中的关键障碍。我们引入了AlignTune,这是一种模块化工具包,提供了一个统一接口,用于监督微调(SFT)和类似RLHF的优化,支持可互换的TRL和Unsloth后端。AlignTune标准化了配置,提供了可扩展的奖励层(基于规则和学习),并集成了标准基准和自定义任务的评估。通过将后端特定逻辑隔离在单一工厂边界后面,AlignTune使可控比较和可复现的对齐实验成为可能。
Summary / 总结
The research motivation is to address the challenges of reproducibility and backend interference in post-training alignment of large language models. The main method involves developing AlignTune, a modular toolkit that standardizes configuration and provides interchangeable backends for supervised fine-tuning and RLHF-style optimization. Key experimental findings include the ability to perform controlled comparisons and reproducible alignment experiments by isolating backend-specific logic behind a single factory boundary.
研究动机是解决大型语言模型后训练对齐中的可重复性和后端干扰问题。主要方法是开发了AlignTune模块化工具包,标准化配置并提供可互换的后端用于监督微调和RLHF风格优化。关键实验发现包括通过将后端特定逻辑隔离在单一工厂边界后,能够进行可控比较和可重复的对齐实验。
TabICLv2: A better, faster, scalable, and open tabular foundation model
Authors: Jingang Qu, David Holzmüller, Gaël Varoquaux, Marine Le Morvan
First: 2026-02-11T18:51:02+00:00 · Latest: 2026-02-11T18:51:02+00:00
Abstract
Tabular foundation models, such as TabPFNv2 and TabICL, have recently dethroned gradient-boosted trees at the top of predictive benchmarks, demonstrating the value of in-context learning for tabular data. We introduce TabICLv2, a new state-of-the-art foundation model for regression and classification built on three pillars: (1) a novel synthetic data generation engine designed for high pretraining diversity; (2) various architectural innovations, including a new scalable softmax in attention improving generalization to larger datasets without prohibitive long-sequence pretraining; and (3) optimized pretraining protocols, notably replacing AdamW with the Muon optimizer. On the TabArena and TALENT benchmarks, TabICLv2 without any tuning surpasses the performance of the current state of the art, RealTabPFN-2.5 (hyperparameter-tuned, ensembled, and fine-tuned on real data). With only moderate pretraining compute, TabICLv2 generalizes effectively to million-scale datasets under 50GB GPU memory while being markedly faster than RealTabPFN-2.5. We provide extensive ablation studies to quantify these contributions and commit to open research by first releasing inference code and model weights at https://github.com/soda-inria/tabicl, with synthetic data engine and pretraining code to follow.
中文标题/摘要
标题:TabICLv2:更好的、更快的、可扩展的和开放的表格基础模型
表格基础模型,如TabPFNv2和TabICL,最近在预测基准测试中取代了梯度提升树,展示了上下文学习对表格数据的价值。我们介绍了TabICLv2,这是一种新的最先进的回归和分类基础模型,基于三个支柱:(1)一种新型的合成数据生成引擎,旨在实现高预训练多样性;(2)各种架构创新,包括一种新的可扩展的注意力softmax,改进了对大型数据集的一般化,而无需进行代价高昂的长序列预训练;以及(3)优化的预训练协议,显著地用Muon优化器取代了AdamW。在TabArena和TALENT基准测试中,TabICLv2在没有任何调整的情况下超越了当前最先进的模型RealTabPFN-2.5(经过超参数调整、集成和在真实数据上微调)。仅在中等预训练计算下,TabICLv2在50GB GPU内存下有效泛化到百万规模的数据集,同时比RealTabPFN-2.5更快。我们提供了广泛的消融研究来量化这些贡献,并承诺开放研究,首先在https://github.com/soda-inria/tabicl发布推理代码和模型权重,合成数据引擎和预训练代码将随后发布。
Proficient Graph Neural Network Design by Accumulating Knowledge on Large Language Models
Authors: Jialiang Wang, Hanmo Liu, Shimin Di, Zhili Wang, Jiachuan Wang, Lei Chen, Xiaofang Zhou
Venue: WSDM 2026
First: 2024-08-13T08:22:01+00:00 · Latest: 2026-02-11T18:49:00+00:00
Comments: Accepted at WSDM 2026. Title changed from "Computation-friendly graph neural network design by accumulating knowledge on large language models" to "Proficient Graph Neural Network Design by Accumulating Knowledge on Large Language Models"
Abstract
High-level automation is increasingly critical in AI, driven by rapid advances in large language models (LLMs) and AI agents. However, LLMs, despite their general reasoning power, struggle significantly in specialized, data-sensitive tasks such as designing Graph Neural Networks (GNNs). This difficulty arises from (1) the inherent knowledge gaps in modeling the intricate, varying relationships between graph properties and suitable architectures and (2) the external noise from misleading descriptive inputs, often resulting in generic or even misleading model suggestions. Achieving proficiency in designing data-aware models -- defined as the meta-level capability to systematically accumulate, interpret, and apply data-specific design knowledge -- remains challenging for existing automated approaches, due to their inefficient construction and application of meta-knowledge. To achieve meta-level proficiency, we propose DesiGNN, a knowledge-centered framework that systematically converts past model design experience into structured, fine-grained knowledge priors well-suited for meta-learning with LLMs. To account for the inherent variability and external noise, DesiGNN aligns empirical property filtering from extensive benchmarks with adaptive elicitation of literature insights via LLMs. By constructing a solid meta-knowledge between unseen graph understanding and known effective architecture patterns, DesiGNN can deliver top-5.77% initial model proposals for unseen datasets within seconds and achieve consistently superior performance with minimal search cost compared to baselines.
中文标题/摘要
标题:通过在大型语言模型上积累知识来精通图神经网络设计
高级自动化在AI中越来越关键,受到大型语言模型(LLMs)和AI代理快速进步的推动。然而,尽管LLMs具有普遍推理能力,但在设计图神经网络(GNNs)等专门的数据敏感任务上仍面临巨大挑战。这种困难源于(1)在建模图属性与合适架构之间复杂多变的关系时的知识缺口,以及(2)来自误导性描述输入的外部噪音,通常导致通用甚至误导性的模型建议。实现数据感知模型的专业设计——定义为元层面的能力,能够系统地积累、解释和应用数据特定的设计知识——对于现有自动化方法来说仍然具有挑战性,因为它们在元知识的构建和应用方面效率低下。为了实现元层面的专业性,我们提出了一种以知识为中心的框架DesiGNN,该框架系统地将过去的模型设计经验转化为结构化、细粒度的知识先验,这些先验非常适合LLMs的元学习。为了应对固有的变异性与外部噪音,DesiGNN将广泛的基准测试中的经验属性过滤与通过LLMs适应性地提取文献见解相结合。通过在未知图理解与已知有效架构模式之间构建坚实的元知识,DesiGNN可以在几秒内为未见过的数据集提供排名前5.77%的初始模型提案,并且与基线相比,具有最小的搜索成本且性能始终更优。
Summary / 总结
The paper addresses the challenge of designing specialized Graph Neural Networks (GNNs) using high-level automation, which is limited by the knowledge gaps and external noise faced by large language models (LLMs). To overcome these issues, the authors propose DesiGNN, a knowledge-centered framework that converts past model design experiences into structured knowledge priors for meta-learning with LLMs. DesiGNN aligns empirical property filtering with adaptive literature insights to achieve top-5.77% initial model proposals for unseen datasets within seconds, outperforming baseline methods with minimal search cost.
论文旨在通过利用大型语言模型(LLM)的知识来解决设计高效图神经网络(GNN)的挑战。提出了一种知识中心化的框架DesiGNN,将过去的模型设计经验转化为结构化的知识先验,并结合广泛的实证属性过滤与LLM驱动的文献洞察。DesiGNN能够在几秒内为未见过的数据集生成顶级5.77%的初始模型提案,并且与基线相比具有最小的搜索成本且性能更优。
FormalJudge: A Neuro-Symbolic Paradigm for Agentic Oversight
Authors: Jiayi Zhou, Yang Sheng, Hantao Lou, Yaodong Yang, Jie Fu
First: 2026-02-11T18:48:11+00:00 · Latest: 2026-02-11T18:48:11+00:00
Comments: 27 pages
Abstract
As LLM-based agents increasingly operate in high-stakes domains with real-world consequences, ensuring their behavioral safety becomes paramount. The dominant oversight paradigm, LLM-as-a-Judge, faces a fundamental dilemma: how can probabilistic systems reliably supervise other probabilistic systems without inheriting their failure modes? We argue that formal verification offers a principled escape from this dilemma, yet its adoption has been hindered by a critical bottleneck: the translation from natural language requirements to formal specifications. This paper bridges this gap by proposing , a neuro-symbolic framework that employs a bidirectional Formal-of-Thought architecture: LLMs serve as specification compilers that top-down decompose high-level human intent into atomic, verifiable constraints, then bottom-up prove compliance using Dafny specifications and Z3 Satisfiability modulo theories solving, which produces mathematical guarantees rather than probabilistic scores. We validate across three benchmarks spanning behavioral safety, multi-domain constraint adherence, and agentic upward deception detection. Experiments on 7 agent models demonstrate that achieves an average improvement of 16.6% over LLM-as-a-Judge baselines, enables weak-to-strong generalization where a 7B judge achieves over 90% accuracy detecting deception from 72B agents, and provides near-linear safety improvement through iterative refinement.
中文标题/摘要
标题:FormalJudge:一种代理监督的神经符号范式
随着基于LLM的代理越来越多地在具有实际后果的高风险领域中运行,确保其行为安全变得至关重要。占主导地位的监督范式,LLM作为法官,面临着一个根本性的困境:概率系统如何可靠地监督其他概率系统而不继承其失败模式?我们认为形式化验证提供了一种从困境中解脱出来的原则性方法,但其采用受到了一个关键瓶颈的阻碍:自然语言需求到形式化规范的转换。本文通过提出一种神经符号框架来弥合这一差距,该框架采用双向Formal-of-Thought架构:LLM作为规范编译器,从上而下将高层次的人类意图分解为可验证的原子约束,然后从下而上使用Dafny规范和Z3可满足性模理论求解来证明合规性,从而产生数学保证而非概率分数。我们在三个基准测试中进行了验证,涵盖行为安全、多领域约束遵守以及代理向上欺骗检测。在7个代理模型上的实验表明,相比LLM作为法官的基线,实现了16.6%的平均改进,使7B法官能够从72B代理中检测欺骗的准确率超过90%,并通过迭代细化提供了接近线性的安全改进。
Summary / 总结
FormalJudge is a neuro-symbolic framework designed to enhance the behavioral safety of LLM-based agents in high-stakes domains. It addresses the challenge of translating natural language requirements into formal specifications using a bidirectional Formal-of-Thought architecture. Experiments show that FormalJudge improves safety by 16.6% over LLM-as-a-Judge baselines, enables weak-to-strong generalization, and provides near-linear safety improvement through iterative refinement.
FormalJudge 是一个神经符号框架,旨在确保 LLM 基础代理在高风险领域的行为安全性。它通过双向 Formal-of-Thought 架构解决自然语言需求向形式规范的转换难题。实验表明,FormalJudge 在各种基准测试中比 LLM-as-a-Judge 基线高出 16.6%,实现了从弱到强的泛化,并通过迭代细化提供了接近线性的安全性改进。
Equivariant symmetry-aware head pose estimation for fetal MRI
Authors: Ramya Muthukrishnan, Borjan Gagoski, Aryn Lee, P. Ellen Grant, Elfar Adalsteinsson, Polina Golland, Benjamin Billot
First: 2025-12-04T15:15:55+00:00 · Latest: 2026-02-11T18:47:39+00:00
Abstract
We present E(3)-Pose, a novel fast pose estimation method that jointly and explicitly models rotation equivariance and object symmetry. Our work is motivated by the challenging problem of accounting for fetal head motion during a diagnostic MRI scan. We aim to enable automatic adaptive prescription of 2D diagnostic MRI slices with 6-DoF head pose estimation, supported by 3D MRI volumes rapidly acquired before each 2D slice. Existing methods struggle to generalize to clinical volumes, due to pose ambiguities induced by inherent anatomical symmetries, as well as low resolution, noise, and artifacts. In contrast, E(3)-Pose captures anatomical symmetries and rigid pose equivariance by construction, and yields robust estimates of the fetal head pose. Our experiments on publicly available and representative clinical fetal MRI datasets demonstrate the superior robustness and generalization of our method across domains. Crucially, E(3)-Pose achieves state-of-the-art accuracy on clinical MRI volumes, supporting future clinical translation. Our implementation is publicly available at github.com/MedicalVisionGroup/E3-Pose.
中文标题/摘要
标题:Equivariant symmetry-aware头位估计方法在胎儿MRI中的应用
我们提出了E(3)-Pose,这是一种新颖的快速姿态估计方法,能够同时和明确地建模旋转等变性和物体对称性。我们的工作旨在解决诊断MRI扫描中胎儿头部运动的挑战性问题。我们旨在通过支持快速获取的3D MRI体积,实现自动适应的2D诊断MRI切片,并提供6-DoF头部姿态估计。现有方法难以在临床体积上泛化,主要是由于由固有解剖对称性引起的姿态歧义,以及低分辨率、噪声和伪影。相比之下,E(3)-Pose通过构造捕捉了解剖对称性和刚性姿态等变性,并提供了胎儿头部姿态的稳健估计。我们在公开可用且具有代表性的临床胎儿MRI数据集上的实验表明,我们的方法在不同领域具有优越的稳健性和泛化能力。至关重要的是,E(3)-Pose在临床MRI体积上达到了最先进的准确性,支持未来的临床转化。我们的实现可在github.com/MedicalVisionGroup/E3-Pose获取。
Summary / 总结
E(3)-Pose is a novel pose estimation method that models rotation equivariance and object symmetry to address the challenge of fetal head motion during MRI scans. It aims to enable automatic 2D slice prescription with 6-DoF head pose estimation using 3D MRI volumes. Experiments show that E(3)-Pose outperforms existing methods in terms of robustness and generalization across different clinical datasets, achieving state-of-the-art accuracy on clinical MRI volumes. The method is designed to capture anatomical symmetries and rigid pose equivariance, making it suitable for clinical translation.
E(3)-Pose 是一种新颖的姿态估计方法,通过建模旋转等变性和物体对称性来解决胎儿头部在诊断 MRI 扫描过程中运动的挑战。该方法旨在利用 3D MRI 体积实现自动 6-DoF 头部姿态估计,以支持 2D 诊断 MRI 切片。实验表明,E(3)-Pose 在不同临床数据集上的鲁棒性和泛化能力优于现有方法,并在临床 MRI 体积上达到了最先进的准确性。该方法设计用于捕捉解剖对称性和刚性姿态等变性,有助于克服姿态歧义和伪影。
Just on Time: Token-Level Early Stopping for Diffusion Language Models
Authors: Zahar Kohut, Severyn Shykula, Dmytro Khamula, Mykola Vysotskyi, Taras Rumezhak, Volodymyr Karpiv
First: 2026-02-11T18:44:04+00:00 · Latest: 2026-02-11T18:44:04+00:00
Comments: Under review
Abstract
Diffusion language models generate text through iterative refinement, a process that is often computationally inefficient because many tokens reach stability long before the final denoising step. We introduce a training-free, token-level early stopping approach that identifies convergence independently at each position. Our method leverages lightweight signals derived from the model's predictions and local context to dynamically determine when individual tokens can be finalized. This yields adaptive per-token freezing without task-specific fine-tuning, substantially reducing the total number of diffusion steps required. Across diverse benchmarks, spanning mathematical reasoning, general question answering, and scientific understanding, our approach achieves state-of-the-art efficiency gains while preserving generation quality.
中文标题/摘要
标题:恰到好处:扩散语言模型的标记级早期停止
扩散语言模型通过迭代细化生成文本,这一过程通常因许多标记在最终去噪步骤之前就已稳定而计算效率低下。我们提出了一种无需训练的标记级早期停止方法,能够在每个位置独立识别收敛。该方法利用模型预测和局部上下文中的轻量级信号,动态确定何时可以最终确定个别标记。这实现了适应性的标记级冻结,无需针对特定任务的微调,大幅减少了所需的扩散步骤总数。在涵盖数学推理、通用问答和科学理解等多种基准测试中,我们的方法在保持生成质量的同时实现了最先进的效率提升。
Summary / 总结
The research aims to improve the efficiency of diffusion language models by identifying when individual tokens have reached stability. The method uses lightweight signals from the model's predictions to determine when to stop refining each token, leading to adaptive per-token freezing. This approach reduces the total number of diffusion steps required while maintaining generation quality across various benchmarks, including mathematical reasoning, general question answering, and scientific understanding, achieving state-of-the-art efficiency gains.
研究旨在通过引入基于令牌级别的早期停止方法来提高扩散语言模型的计算效率。该方法利用模型预测中的轻量级信号来识别何时单个令牌达到稳定状态,从而实现无需特定任务微调的适应性令牌冻结。该方法在数学推理、通用问答和科学理解等多种基准测试中,显著减少了所需的扩散步骤数量,同时保持了生成质量。
Expanding the Capabilities of Reinforcement Learning via Text Feedback
Authors: Yuda Song, Lili Chen, Fahim Tajwar, Remi Munos, Deepak Pathak, J. Andrew Bagnell, Aarti Singh, Andrea Zanette
First: 2026-02-02T18:56:56+00:00 · Latest: 2026-02-11T18:43:26+00:00
Comments: 43 pages, 6 figures
Abstract
The success of RL for LLM post-training stems from an unreasonably uninformative source: a single bit of information per rollout as binary reward or preference label. At the other extreme, distillation offers dense supervision but requires demonstrations, which are costly and difficult to scale. We study text feedback as an intermediate signal: richer than scalar rewards, yet cheaper than complete demonstrations. Textual feedback is a natural mode of human interaction and is already abundant in many real-world settings, where users, annotators, and automated judges routinely critique LLM outputs. Towards leveraging text feedback at scale, we formalize a multi-turn RL setup, RL from Text Feedback (RLTF), where text feedback is available during training but not at inference. Therefore, models must learn to internalize the feedback in order to improve their test-time single-turn performance. To do this, we propose two methods: Self Distillation (RLTF-SD), which trains the single-turn policy to match its own feedback-conditioned second-turn generations; and Feedback Modeling (RLTF-FM), which predicts the feedback as an auxiliary objective. We provide theoretical analysis on both methods, and empirically evaluate on reasoning puzzles, competition math, and creative writing tasks. Our results show that both methods consistently outperform strong baselines across benchmarks, highlighting the potential of RL with an additional source of rich supervision at scale.
中文标题/摘要
标题:通过文本反馈扩展强化学习的能力
对于LLM后训练的RL成功,其根源在于一个不合理地不具信息性的来源:每次回放中仅有一比特信息作为二元奖励或偏好标签。在另一极端,蒸馏提供了密集的监督,但需要演示,这既昂贵又难以扩展。我们研究文本反馈作为中间信号:比标量奖励更丰富,但比完整演示更便宜。文本反馈是人类互动的自然方式,并且在许多现实世界环境中已经非常普遍,用户、注释者和自动裁判经常对LLM输出进行评价。为了大规模利用文本反馈,我们形式化了一个多轮RL设置,文本反馈强化学习(RLTF),其中在训练期间可用文本反馈,但在推理时不可用。因此,模型必须学会内化反馈以提高其测试时单轮性能。为此,我们提出了两种方法:自我蒸馏(RLTF-SD),训练单轮策略使其匹配自身反馈条件下的第二轮生成;反馈建模(RLTF-FM),将预测反馈作为辅助目标。我们对这两种方法进行了理论分析,并在推理谜题、竞赛数学和创造性写作任务上进行了实证评估。结果显示,两种方法在基准测试中均优于强基线,突显了大规模使用额外丰富监督源的RL潜力。
Summary / 总结
This paper aims to enhance reinforcement learning (RL) for large language models (LLMs) by utilizing text feedback as a richer yet less costly signal compared to binary rewards or full demonstrations. The authors propose two methods, Self Distillation (RLTF-SD) and Feedback Modeling (RLTF-FM), to train models to internalize text feedback during training. Empirical evaluations on reasoning puzzles, competition math, and creative writing tasks demonstrate that both methods outperform strong baselines, indicating the potential of using text feedback to improve RL performance at scale.
本文探讨了使用文本反馈来增强大型语言模型(LLM)的强化学习(RL),通过提供比标量奖励更丰富的信号,但需要的劳动量比完整示范少。作者提出了两种方法,Self Distillation(RLTF-SD)和Feedback Modeling(RLTF-FM),以使模型在训练期间能够内化文本反馈。在逻辑谜题、竞赛数学和创造性写作任务上的实证评估表明,这两种方法都优于强基线,表明使用文本反馈来提高RL性能的潜力。
MIND: Benchmarking Memory Consistency and Action Control in World Models
Authors: Yixuan Ye, Xuanyu Lu, Yuxin Jiang, Yuchao Gu, Rui Zhao, Qiwei Liang, Jiachun Pan, Fengda Zhang, Weijia Wu, Alex Jinpeng Wang
First: 2026-02-08T15:57:23+00:00 · Latest: 2026-02-11T18:42:39+00:00
Abstract
World models aim to understand, remember, and predict dynamic visual environments, yet a unified benchmark for evaluating their fundamental abilities remains lacking. To address this gap, we introduce MIND, the first open-domain closed-loop revisited benchmark for evaluating Memory consIstency and action coNtrol in worlD models. MIND contains 250 high-quality videos at 1080p and 24 FPS, including 100 (first-person) + 100 (third-person) video clips under a shared action space and 25 + 25 clips across varied action spaces covering eight diverse scenes. We design an efficient evaluation framework to measure two core abilities: memory consistency and action control, capturing temporal stability and contextual coherence across viewpoints. Furthermore, we design various action spaces, including different character movement speeds and camera rotation angles, to evaluate the action generalization capability across different action spaces under shared scenes. To facilitate future performance benchmarking on MIND, we introduce MIND-World, a novel interactive Video-to-World baseline. Extensive experiments demonstrate the completeness of MIND and reveal key challenges in current world models, including the difficulty of maintaining long-term memory consistency and generalizing across action spaces. Code: https://github.com/CSU-JPG/MIND.
中文标题/摘要
标题:MIND:评估世界模型中的内存一致性与动作控制基准
世界模型旨在理解、记忆和预测动态视觉环境,但缺乏统一的基准来评估其基本能力。为解决这一问题,我们引入了MIND,这是首个开放领域循环重访基准,用于评估世界模型中的内存一致性与动作控制。MIND包含250个1080p、24 FPS的高质量视频,包括100个(第一人称)+100个(第三人称)视频片段,覆盖共享动作空间的8个不同场景,以及25个+25个视频片段,覆盖不同动作空间的25个不同场景。我们设计了一个高效的评估框架,以衡量两种核心能力:内存一致性与动作控制,捕捉不同视角下的时间稳定性和上下文连贯性。此外,我们设计了多种动作空间,包括不同角色移动速度和相机旋转角度,以评估在共享场景下的不同动作空间中的动作泛化能力。为了便于未来在MIND上的性能基准测试,我们引入了MIND-World,这是一种新颖的交互式视频到世界基准。广泛的实验表明了MIND的完整性,并揭示了当前世界模型中的关键挑战,包括保持长期内存一致性和跨动作空间泛化的难度。代码:https://github.com/CSU-JPG/MIND.
Summary / 总结
MIND is a benchmark for evaluating the memory consistency and action control in world models, addressing the lack of a unified benchmark. It includes 250 high-quality videos and evaluates temporal stability and contextual coherence. Experiments show challenges in maintaining long-term memory consistency and generalizing across different action spaces.
MIND 是一个新的基准,用于评估世界模型中的记忆一致性和动作控制能力,解决了缺乏统一评估框架的问题。它包含 250 个高质量视频,具有多种动作空间和场景。评估框架衡量时间稳定性和上下文一致性,揭示了保持长期记忆一致性和在不同动作空间之间泛化的挑战。
From Circuits to Dynamics: Understanding and Stabilizing Failure in 3D Diffusion Transformers
Authors: Maximilian Plattner, Fabian Paischer, Johannes Brandstetter, Arturs Berzins
First: 2026-02-11T18:42:05+00:00 · Latest: 2026-02-11T18:42:05+00:00
Abstract
Reliable surface completion from sparse point clouds underpins many applications spanning content creation and robotics. While 3D diffusion transformers attain state-of-the-art results on this task, we uncover that they exhibit a catastrophic mode of failure: arbitrarily small on-surface perturbations to the input point cloud can fracture the output into multiple disconnected pieces -- a phenomenon we call Meltdown. Using activation-patching from mechanistic interpretability, we localize Meltdown to a single early denoising cross-attention activation. We find that the singular-value spectrum of this activation provides a scalar proxy: its spectral entropy rises when fragmentation occurs and returns to baseline when patched. Interpreted through diffusion dynamics, we show that this proxy tracks a symmetry-breaking bifurcation of the reverse process. Guided by this insight, we introduce PowerRemap, a test-time control that stabilizes sparse point-cloud conditioning. We demonstrate that Meltdown persists across state-of-the-art architectures (WaLa, Make-a-Shape), datasets (GSO, SimJEB) and denoising strategies (DDPM, DDIM), and that PowerRemap effectively counters this failure with stabilization rates of up to 98.3%. Overall, this work is a case study on how diffusion model behavior can be understood and guided based on mechanistic analysis, linking a circuit-level cross-attention mechanism to diffusion-dynamics accounts of trajectory bifurcations.
Summary / 总结
This paper addresses the issue of catastrophic failure in 3D diffusion transformers, known as Meltdown, where small perturbations to input point clouds can cause the output to fragment into disconnected pieces. By using activation-patching from mechanistic interpretability, the authors identify a specific early denoising cross-attention activation responsible for this phenomenon. They introduce PowerRemap, a test-time control, which stabilizes the output and counters Meltdown with high effectiveness, achieving up to 98.3% stabilization rates across various architectures, datasets, and denoising strategies.
该研究探讨了3D扩散变换器中的灾难性故障问题,即Meltdown现象,这种现象会导致输入点云的小幅扰动使输出分裂成多个不连续的部分。通过使用机制可解释性中的激活补丁技术,研究人员将问题定位到一个特定的早期去噪交叉注意力激活,并发现其奇异值谱可以预测分裂的发生。基于这一洞察,他们引入了PowerRemap,一种测试时的控制方法,可以稳定输出,并在不同的架构、数据集和去噪策略下表现出高达98.3%的稳定化率。
Asymmetric Prompt Weighting for Reinforcement Learning with Verifiable Rewards
Authors: Reinhard Heckel, Mahdi Soltanolkotabi, Christos Thramboulidis
First: 2026-02-11T18:39:42+00:00 · Latest: 2026-02-11T18:39:42+00:00
Abstract
Reinforcement learning with verifiable rewards has driven recent advances in LLM post-training, in particular for reasoning. Policy optimization algorithms generate a number of responses for a given prompt and then effectively weight the corresponding gradients depending on the rewards. The most popular algorithms including GRPO, DAPO, and RLOO focus on ambiguous prompts, i.e., prompts with intermediate success probability, while downgrading gradients with very easy and very hard prompts. In this paper, we consider asymmetric prompt weightings that assign higher weights to prompts with low, or even zero, empirical success probability. We find that asymmetric weighting particularly benefits from-scratch RL (as in R1-Zero), where training traverses a wide accuracy range, and less so in post-SFT RL where the model already starts at high accuracy. We also provide theory that characterizes prompt weights which minimize the time needed to raise success probability from an initial level to a target accuracy under a fixed update budget. In low-success regimes, where informative responses are rare and response cost dominates, these optimal weights become asymmetric, upweighting low success probabilities and thereby accelerating effective-time convergence.
中文标题/摘要
标题:具有可验证奖励的强化学习中的非对称提示加权
具有可验证奖励的强化学习推动了LLM后训练的近期进展,特别是在推理方面。策略优化算法为给定的提示生成多个响应,然后根据奖励有效地加权相应的梯度。最受欢迎的算法包括GRPO、DAPO和RLOO,它们侧重于模糊提示,即成功率介于中间的提示,而降低非常容易和非常难的提示的梯度权重。在本文中,我们考虑了非对称提示加权,为具有低的,甚至为零的实证成功率的提示分配更高的权重。我们发现,非对称加权特别有利于从零开始的RL(如R1-Zero),其中训练跨越广泛的准确率范围,而在后SFT RL中,模型已经从高准确率开始,其受益较少。我们还提供了理论,描述了在固定更新预算下,使成功率从初始水平提高到目标准确率所需时间最小化的提示权重。在低成功率区域,由于信息性响应稀少,响应成本占主导地位,这些最优权重变得非对称,加权低成功率,从而加速有效时间收敛。
PhyCritic: Multimodal Critic Models for Physical AI
Authors: Tianyi Xiong, Shihao Wang, Guilin Liu, Yi Dong, Ming Li, Heng Huang, Jan Kautz, Zhiding Yu
First: 2026-02-11T18:35:39+00:00 · Latest: 2026-02-11T18:35:39+00:00
Abstract
With the rapid development of large multimodal models, reliable judge and critic models have become essential for open-ended evaluation and preference alignment, providing pairwise preferences, numerical scores, and explanatory justifications for assessing model-generated responses. However, existing critics are primarily trained in general visual domains such as captioning or image question answering, leaving physical AI tasks involving perception, causal reasoning, and planning largely underexplored. We introduce PhyCritic, a multimodal critic model optimized for physical AI through a two-stage RLVR pipeline: a physical skill warmup stage that enhances physically oriented perception and reasoning, followed by self-referential critic finetuning, where the critic generates its own prediction as an internal reference before judging candidate responses, improving judgment stability and physical correctness. Across both physical and general-purpose multimodal judge benchmarks, PhyCritic achieves strong performance gains over open-source baselines and, when applied as a policy model, further improves perception and reasoning in physically grounded tasks.
中文标题/摘要
标题:PhyCritic:面向物理AI的多模态批评模型
随着大型多模态模型的快速发展,可靠且准确的批评和评判模型对于开放性评估和偏好对齐变得至关重要,它们能够提供成对偏好、数值评分以及评估模型生成响应的解释性说明。然而,现有的批评模型主要是在一般视觉领域如描述或图像问答中进行训练,而涉及感知、因果推理和规划的物理AI任务则被严重忽视。我们提出了PhyCritic,这是一种通过两阶段RLVR管道优化的多模态批评模型:物理技能预热阶段增强物理导向的感知和推理能力,随后是自我参照批评微调,批评模型在生成自己的预测作为内部参考后再进行评判,从而提高评判的稳定性和物理正确性。在物理和通用多模态评判基准测试中,PhyCritic在开源基线之上取得了显著的性能提升,当作为策略模型应用时,进一步提高了物理接地任务中的感知和推理能力。
Summary / 总结
PhyCritic is a multimodal critic model designed to enhance the evaluation of physical AI tasks. It uses a two-stage RLVR pipeline, first warming up physical skills and then fine-tuning a self-referential critic. This approach improves judgment stability and physical correctness. PhyCritic outperforms open-source baselines on both physical and general multimodal judge benchmarks and enhances perception and reasoning in physically grounded tasks.
PhyCritic 是一种针对物理 AI 任务的多模态批评模型。它使用两阶段的 RLVR 管道来增强物理感知和推理,随后进行自我参照批评微调。该模型在物理和通用多模态评判基准测试中均优于开源基线,并且能够进一步提高物理相关任务中的感知和推理能力。
A New Dataset and Performance Benchmark for Real-time Spacecraft Segmentation in Onboard Computers
Authors: Jeffrey Joan Sam, Janhavi Sathe, Nikhil Chigali, Naman Gupta, Radhey Ruparel, Yicheng Jiang, Janmajay Singh, James W. Berck, Arko Barman
First: 2025-07-14T20:02:40+00:00 · Latest: 2026-02-11T18:32:56+00:00
Abstract
Spacecraft deployed in outer space are routinely subjected to various forms of damage due to exposure to hazardous environments. In addition, there are significant risks to the subsequent process of in-space repairs through human extravehicular activity or robotic manipulation, incurring substantial operational costs. Recent developments in image segmentation could enable the development of reliable and cost-effective autonomous inspection systems. While these models often require large amounts of training data to achieve satisfactory results, publicly available annotated spacecraft segmentation data are very scarce. Here, we present a new dataset of nearly 64k annotated spacecraft images that was created using real spacecraft models, superimposed on a mixture of real and synthetic backgrounds generated using NASA's TTALOS pipeline. To mimic camera distortions and noise in real-world image acquisition, we also added different types of noise and distortion to the images. Our dataset includes images with several real-world challenges, including noise, camera distortions, glare, varying lighting conditions, varying field of view, partial spacecraft visibility, brightly-lit city backgrounds, densely patterned and confounding backgrounds, aurora borealis, and a wide variety of spacecraft geometries. Finally, we finetuned YOLOv8 and YOLOv11 models for spacecraft segmentation to generate performance benchmarks for the dataset under well-defined hardware and inference time constraints to mimic real-world image segmentation challenges for real-time onboard applications in space on NASA's inspector spacecraft. The resulting models, when tested under these constraints, achieved a Dice score of 0.92, Hausdorff distance of 0.69, and an inference time of about 0.5 second. The dataset and models for performance benchmark are available at https://github.com/RiceD2KLab/SWiM.
中文标题/摘要
标题:一种新的数据集和实时航天器分割性能基准
部署在外太空的航天器由于暴露在危险环境中,经常遭受各种形式的损害。此外,通过宇航员出舱活动或机器人操作进行的空间维修存在重大风险,导致高昂的操作成本。最近在图像分割方面的进展可能使开发可靠的自主检查系统成为可能。尽管这些模型通常需要大量训练数据才能达到满意的效果,但公开的标注航天器分割数据非常稀缺。在此,我们使用真实的航天器模型,并结合使用NASA的TTALOS管道生成的真实和合成背景的混合体,创建了一个近64,000张标注的航天器图像的新数据集。为了模拟真实世界图像获取中的相机畸变和噪声,我们还向图像中添加了不同类型的噪声和畸变。该数据集包括多种真实世界挑战的图像,包括噪声、相机畸变、反光、光照条件变化、视场变化、部分可见的航天器、明亮的城市背景、密集的图案和混淆的背景、极光以及各种航天器几何形状。最后,我们针对NASA的检查员航天器在明确的硬件和推理时间约束下,对YOLOv8和YOLOv11模型进行了微调,以生成数据集的性能基准,模拟实时航天器上的空间图像分割挑战。在这些约束条件下测试后,生成的模型的Dice分数为0.92,Hausdorff距离为0.69,推理时间为约0.5秒。数据集和用于性能基准的模型可在https://github.com/RiceD2KLab/SWiM/获取。
Summary / 总结
This research addresses the need for reliable autonomous inspection systems for damaged spacecraft by developing a new dataset of nearly 64,000 annotated images. The dataset includes various real-world challenges such as noise, camera distortions, and varying lighting conditions. The authors fine-tuned YOLOv8 and YOLOv11 models on this dataset, achieving a Dice score of 0.92 and a Hausdorff distance of 0.69 with an inference time of about 0.5 seconds, suitable for real-time onboard applications.
该研究旨在通过开发包含近64,000张标注图像的新数据集来实现对太空飞船的可靠自主检查系统,这些图像包括各种现实世界挑战。数据集使用真实的太空飞船模型和混合背景创建,加入了噪声和失真以模拟现实条件。作者对YOLOv8和YOLOv11模型进行了微调,实现了在定义的硬件约束下Dice得分为0.92、Hausdorff距离为0.69以及推理时间为约0.5秒的结果。
Implicit Hypothesis Testing and Divergence Preservation in Neural Network Representations
Authors: Kadircan Aksoy, Protim Bhattacharjee, Peter Jung
First: 2026-01-28T10:46:44+00:00 · Latest: 2026-02-11T18:32:03+00:00
Abstract
We study the supervised training dynamics of neural classifiers through the lens of binary hypothesis testing. We model classification as a set of binary tests between class-conditional distributions of representations and empirically show that, along training trajectories, well-generalizing networks increasingly align with Neyman-Pearson optimal decision rules via monotonic improvements in KL divergence that relate to error rate exponents. We finally discuss how this yields an explanation and possible training or regularization strategies for different classes of neural networks.
中文标题/摘要
标题:隐含假设检验与神经网络表示中的散度保持
我们通过二元假设检验的视角研究神经分类器的监督训练动态。我们将分类视为类条件表示分布之间的二元测试集,并实验证明,在训练轨迹中,泛化良好的网络越来越与 Neyman-Pearson 最优决策规则对齐,这与与错误率指数相关的 KL 散度的单调改进有关。最后,我们讨论了这如何提供不同类别的神经网络的解释和可能的训练或正则化策略。
Summary / 总结
The study investigates how neural classifiers learn through the perspective of binary hypothesis testing. It shows that well-generalizing networks progressively align with Neyman-Pearson optimal decision rules during training, as measured by increasing KL divergence which correlates with error rate exponents. The research provides insights into training dynamics and suggests potential strategies for different neural network classes.
研究通过二元假设检验的角度探讨了神经分类器的学习动态,显示了随着KL散度的提升和与错误率指数的关联,表现良好的网络逐渐与Neyman-Pearson最优决策规则对齐。研究提供了训练动态的见解,并提出了不同神经网络类别的潜在策略。
HairWeaver: Few-Shot Photorealistic Hair Motion Synthesis with Sim-to-Real Guided Video Diffusion
Authors: Di Chang, Ji Hou, Aljaz Bozic, Assaf Neuberger, Felix Juefei-Xu, Olivier Maury, Gene Wei-Chin Lin, Tuur Stuyck, Doug Roble, Mohammad Soleymani, Stephane Grabli
First: 2026-02-11T18:31:47+00:00 · Latest: 2026-02-11T18:31:47+00:00
Comments: Website: https://boese0601.github.io/hairweaver/
Abstract
We present HairWeaver, a diffusion-based pipeline that animates a single human image with realistic and expressive hair dynamics. While existing methods successfully control body pose, they lack specific control over hair, and as a result, fail to capture the intricate hair motions, resulting in stiff and unrealistic animations. HairWeaver overcomes this limitation using two specialized modules: a Motion-Context-LoRA to integrate motion conditions and a Sim2Real-Domain-LoRA to preserve the subject's photoreal appearance across different data domains. These lightweight components are designed to guide a video diffusion backbone while maintaining its core generative capabilities. By training on a specialized dataset of dynamic human motion generated from a CG simulator, HairWeaver affords fine control over hair motion and ultimately learns to produce highly realistic hair that responds naturally to movement. Comprehensive evaluations demonstrate that our approach sets a new state of the art, producing lifelike human hair animations with dynamic details.
中文标题/摘要
标题:HairWeaver:基于模拟到现实引导视频扩散的少量示例逼真头发运动合成
我们提出了HairWeaver,一种基于扩散的流水线,能够以逼真且富有表现力的方式为单个人像动画生成头发动态。现有方法虽然能够成功控制身体姿态,但在头发控制方面却缺乏特定控制,因此无法捕捉到复杂的头发运动,导致动画僵硬且不真实。HairWeaver 通过两个专门模块克服了这一限制:运动-上下文-LoRA 用于整合运动条件,Sim2Real-域-LoRA 用于在不同数据域中保持主题的逼真外观。这些轻量级组件旨在引导视频扩散主干,同时保持其核心生成能力。通过在来自CG模拟器的动态人体运动的专门数据集上进行训练,HairWeaver 能够对头发运动进行精细控制,并最终学会生成高度逼真的头发,能够自然地响应动作。全面的评估表明,我们的方法达到了新的最先进的水平,产生了具有动态细节的逼真人头发动画。
Summary / 总结
HairWeaver is a diffusion-based system that synthesizes photorealistic hair motions from a single human image. It addresses the limitations of existing methods by integrating motion conditions and preserving the subject's photoreal appearance. Key findings show that HairWeaver produces highly realistic and expressive hair animations, setting a new state-of-the-art in human hair motion synthesis.
HairWeaver 是一种基于扩散的系统,可以从单个人像生成逼真的头发动态。它通过引入 Motion-Context-LoRA 来整合运动条件,并通过 Sim2Real-Domain-LoRA 来保持照片级的真实感。该系统通过动态人类运动的专门数据集进行训练,并在生成具有详细动态的逼真头发动画方面表现出色,优于现有方法。
Renet: Principled and Efficient Relaxation for the Elastic Net via Dynamic Objective Selection
Authors: Albert Dorador
First: 2026-02-11T18:22:59+00:00 · Latest: 2026-02-11T18:22:59+00:00
Abstract
We introduce Renet, a principled generalization of the Relaxed Lasso to the Elastic Net family of estimators. While, on the one hand, $\ell_1$-regularization is a standard tool for variable selection in high-dimensional regimes and, on the other hand, the $\ell_2$ penalty provides stability and solution uniqueness through strict convexity, the standard Elastic Net nevertheless suffers from shrinkage bias that frequently yields suboptimal prediction accuracy. We propose to address this limitation through a framework called \textit{relaxation}. Existing relaxation implementations rely on naive linear interpolations of penalized and unpenalized solutions, which ignore the non-linear geometry that characterizes the entire regularization path and risk violating the Karush-Kuhn-Tucker conditions. Renet addresses these limitations by enforcing sign consistency through an adaptive relaxation procedure that dynamically dispatches between convex blending and efficient sub-path refitting. Furthermore, we identify and formalize a unique synergy between relaxation and the ``One-Standard-Error'' rule: relaxation serves as a robust debiasing mechanism, allowing practitioners to leverage the parsimony of the 1-SE rule without the traditional loss in predictive fidelity. Our theoretical framework incorporates automated stability safeguards for ultra-high dimensional regimes and is supported by a comprehensive benchmarking suite across 20 synthetic and real-world datasets, demonstrating that Renet consistently outperforms the standard Elastic Net and provides a more robust alternative to the Adaptive Elastic Net in high-dimensional, low signal-to-noise ratio and high-multicollinearity regimes. By leveraging an adaptive solver backend, Renet delivers these statistical gains while offering a computational profile that remains competitive with state-of-the-art coordinate descent implementations.
中文标题/摘要
标题:Renet:通过动态目标选择的弹性网的有原则且高效的松弛方法
我们引入了Renet,这是一种弹性网估计器家族中Relaxed Lasso的有原则的推广。一方面,$\ell_1$正则化是高维环境中进行变量选择的标准工具;另一方面,$\ell_2$惩罚通过严格的凸性提供稳定性和解的唯一性。然而,标准的弹性网仍然存在收缩偏差的问题,这通常会导致预测准确性不佳。我们通过一种称为“松弛”的框架来解决这一局限性。现有的松弛实现依赖于惩罚解和非惩罚解的简单线性插值,这忽略了整个正则化路径的非线性几何特性,并且可能会违反Karush-Kuhn-Tucker条件。Renet通过一种自适应的松弛过程动态地在凸混合和高效子路径重新拟合之间切换,来解决这些局限性。此外,我们识别并形式化了松弛与“单一标准误差”规则之间的独特协同作用:松弛作为稳健的去偏差机制,使实践者能够在不牺牲预测精度的情况下利用1-SE规则的简洁性。我们的理论框架包括了针对超高维环境的自动稳定性保障,并通过跨越20个合成和真实数据集的全面基准测试套件得到了支持,表明Renet在高维、低信噪比和高多重共线性环境中始终优于标准的弹性网,并提供了一种比自适应弹性网更稳健的替代方案。通过利用自适应求解器后端,Renet在提供这些统计增益的同时,保持了与最先进的坐标下降实现相当的计算性能。
Summary / 总结
Renet is a principled generalization of the Relaxed Lasso to the Elastic Net family, addressing the shrinkage bias of the standard Elastic Net. It uses an adaptive relaxation procedure that dynamically switches between convex blending and efficient sub-path refitting to maintain sign consistency. Theoretical and empirical results show that Renet outperforms the standard Elastic Net and the Adaptive Elastic Net in high-dimensional settings with low signal-to-noise ratio and high multicollinearity, while maintaining computational efficiency comparable to state-of-the-art coordinate descent implementations.
Renet 是一种将 Relaxed Lasso 扩展到 Elastic Net 家族的新方法,旨在解决标准 Elastic Net 的收缩偏差问题。它采用了一种自适应的放松程序,动态切换凸混合和高效的子路径重新拟合,以保持符号一致性。理论和实验证明,Renet 在高维和低信噪比场景中优于标准 Elastic Net 和自适应 Elastic Net,同时保持与最先进的坐标下降实现相当的计算效率。
FastFlow: Accelerating The Generative Flow Matching Models with Bandit Inference
Authors: Divya Jyoti Bajpai, Dhruv Bhardwaj, Soumya Roy, Tejas Duseja, Harsh Agarwal, Aashay Sandansing, Manjesh Kumar Hanawal
Venue: ICLR
First: 2026-02-11T18:21:11+00:00 · Latest: 2026-02-11T18:21:11+00:00
Comments: Accepted at International Conference on Learning Representations (ICLR) 2026
Abstract
Flow-matching models deliver state-of-the-art fidelity in image and video generation, but the inherent sequential denoising process renders them slower. Existing acceleration methods like distillation, trajectory truncation, and consistency approaches are static, require retraining, and often fail to generalize across tasks. We propose FastFlow, a plug-and-play adaptive inference framework that accelerates generation in flow matching models. FastFlow identifies denoising steps that produce only minor adjustments to the denoising path and approximates them without using the full neural network models used for velocity predictions. The approximation utilizes finite-difference velocity estimates from prior predictions to efficiently extrapolate future states, enabling faster advancements along the denoising path at zero compute cost. This enables skipping computation at intermediary steps. We model the decision of how many steps to safely skip before requiring a full model computation as a multi-armed bandit problem. The bandit learns the optimal skips to balance speed with performance. FastFlow integrates seamlessly with existing pipelines and generalizes across image generation, video generation, and editing tasks. Experiments demonstrate a speedup of over 2.6x while maintaining high-quality outputs. The source code for this work can be found at https://github.com/Div290/FastFlow.
中文标题/摘要
标题:FastFlow:通过多臂老虎机推理加速生成流匹配模型
生成流匹配模型在图像和视频生成中提供了最先进的保真度,但由于其固有的顺序去噪过程,它们变得较慢。现有的加速方法如蒸馏、轨迹截断和一致性方法是静态的,需要重新训练,并且通常无法在任务之间泛化。我们提出了FastFlow,这是一种即插即用的自适应推理框架,可以加速流匹配模型中的生成过程。FastFlow 识别出那些仅对去噪路径产生微小调整的去噪步骤,并通过不使用用于速度预测的完整神经网络模型来近似这些步骤。近似利用先前预测的速度的有限差分估计来高效地外推未来状态,从而在零计算成本的情况下沿去噪路径更快地前进。这使得可以在中间步骤跳过计算。我们将如何安全地跳过多少步骤以需要完整的模型计算的决策建模为多臂老虎机问题。老虎机学习在速度与性能之间取得最优平衡的跳过步骤。FastFlow 可无缝集成到现有管道中,并在图像生成、视频生成和编辑任务中泛化。实验表明,与保持高质量输出相比,加速比超过2.6倍。此工作的源代码可以在 https://github.com/Div290/FastFlow/ 找到。
Summary / 总结
FastFlow is an adaptive inference framework that accelerates flow-matching models by identifying and skipping minor denoising steps, using finite-difference velocity estimates to extrapolate future states. This method models the decision of how many steps to skip as a multi-armed bandit problem, balancing speed and performance. Experiments show a 2.6x speedup with high-quality outputs. The framework is plug-and-play and generalizes across image, video generation, and editing tasks.
FastFlow 是一种自适应推理框架,通过跳过不必要的去噪步骤来加速用于图像和视频生成的流匹配模型。它使用有限差分速度估计来近似这些步骤,从而在不增加计算成本的情况下更快地沿去噪路径前进。该框架将跳过多少步骤的决策建模为多臂老虎机问题,以优化速度和性能。实验显示,与高质量输出相比,速度提高了2.6倍。
Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning
Authors: Zhaoyang Wang, Canwen Xu, Boyi Liu, Yite Wang, Siwei Han, Zhewei Yao, Huaxiu Yao, Yuxiong He
First: 2026-02-10T18:55:41+00:00 · Latest: 2026-02-11T18:20:25+00:00
Comments: 41 pages
Abstract
Recent advances in large language model (LLM) have empowered autonomous agents to perform complex tasks that require multi-turn interactions with tools and environments. However, scaling such agent training is limited by the lack of diverse and reliable environments. In this paper, we propose Agent World Model (AWM), a fully synthetic environment generation pipeline. Using this pipeline, we scale to 1,000 environments covering everyday scenarios, in which agents can interact with rich toolsets (35 tools per environment on average) and obtain high-quality observations. Notably, these environments are code-driven and backed by databases, providing more reliable and consistent state transitions than environments simulated by LLMs. Moreover, they enable more efficient agent interaction compared with collecting trajectories from realistic environments. To demonstrate the effectiveness of this resource, we perform large-scale reinforcement learning for multi-turn tool-use agents. Thanks to the fully executable environments and accessible database states, we can also design reliable reward functions. Experiments on three benchmarks show that training exclusively in synthetic environments, rather than benchmark-specific ones, yields strong out-of-distribution generalization. The code is available at https://github.com/Snowflake-Labs/agent-world-model.
中文标题/摘要
标题:代理世界模型:无限合成环境下的自主强化学习
大型语言模型(LLM)的最新进展使自主代理能够执行需要与工具和环境进行多轮交互的复杂任务。然而,这种代理训练的扩展受到缺乏多样性和可靠环境的限制。在本文中,我们提出了代理世界模型(AWM),这是一种完全合成环境生成流水线。使用此流水线,我们扩展到涵盖日常场景的1,000个环境,在这些环境中,代理可以与丰富的工具集(每个环境平均35种工具)进行交互并获得高质量的观察结果。值得注意的是,这些环境是通过代码驱动并依托数据库的,提供了比LLM模拟环境更可靠和一致的状态转换。此外,它们相比从现实环境中收集轨迹,使代理交互更加高效。为了展示此资源的有效性,我们对多轮工具使用代理进行了大规模强化学习。得益于完全可执行的环境和可访问的数据库状态,我们还可以设计可靠的奖励函数。在三个基准上的实验表明,仅在合成环境中训练,而不是特定基准环境,能够获得强大的分布外泛化能力。代码可在https://github.com/Snowflake-Labs/agent-world-model/ 获取。
Summary / 总结
This paper addresses the challenge of scaling autonomous agent training by proposing Agent World Model (AWM), a synthetic environment generation pipeline. AWM creates 1,000 diverse environments with rich toolsets, enabling agents to interact in everyday scenarios. These environments are code-driven and database-backed, offering more reliable state transitions than LLM-simulated environments. Experiments show that training agents exclusively in these synthetic environments leads to strong out-of-distribution generalization on three benchmarks. The code is available on GitHub.
本文提出了Agent World Model (AWM) 合成环境生成管道,以解决自主代理训练规模化的挑战。AWM 创建了1,000个包含丰富工具集的多样化环境,使代理能够在日常场景中互动。这些环境是代码驱动和数据库支持的,提供了比LLM模拟环境更可靠的状态转换。实验表明,仅在这些合成环境中训练代理,可以在三个基准测试上实现强大的跨分布泛化。代码可在GitHub上获得。
GameDevBench: Evaluating Agentic Capabilities Through Game Development
Authors: Wayne Chi, Yixiong Fang, Arnav Yayavaram, Siddharth Yayavaram, Seth Karten, Qiuhong Anna Wei, Runkun Chen, Alexander Wang, Valerie Chen, Ameet Talwalkar, Chris Donahue
First: 2026-02-11T18:15:11+00:00 · Latest: 2026-02-11T18:15:11+00:00
Abstract
Despite rapid progress on coding agents, progress on their multimodal counterparts has lagged behind. A key challenge is the scarcity of evaluation testbeds that combine the complexity of software development with the need for deep multimodal understanding. Game development provides such a testbed as agents must navigate large, dense codebases while manipulating intrinsically multimodal assets such as shaders, sprites, and animations within a visual game scene. We present GameDevBench, the first benchmark for evaluating agents on game development tasks. GameDevBench consists of 132 tasks derived from web and video tutorials. Tasks require significant multimodal understanding and are complex -- the average solution requires over three times the amount of lines of code and file changes compared to prior software development benchmarks. Agents still struggle with game development, with the best agent solving only 54.5% of tasks. We find a strong correlation between perceived task difficulty and multimodal complexity, with success rates dropping from 46.9% on gameplay-oriented tasks to 31.6% on 2D graphics tasks. To improve multimodal capability, we introduce two simple image and video-based feedback mechanisms for agents. Despite their simplicity, these methods consistently improve performance, with the largest change being an increase in Claude Sonnet 4.5's performance from 33.3% to 47.7%. We release GameDevBench publicly to support further research into agentic game development.
中文标题/摘要
标题:GameDevBench:通过游戏开发评估代理能力
尽管在编码代理方面取得了快速进展,但其多模态对应物的进步却落后了。一个关键挑战是缺乏能够结合软件开发复杂性与深度多模态理解需求的评估测试平台。游戏开发提供了这样一个测试平台,因为代理必须在大型密集代码库中导航,同时在视觉游戏场景中操作固有的多模态资产,如着色器、精灵和动画。我们提出了GameDevBench,这是第一个用于评估代理在游戏开发任务上的基准。GameDevBench 包含了从网络和视频教程中提取的132项任务。这些任务需要大量的多模态理解,并且非常复杂——平均解决方案的代码行数和文件更改量比之前的软件开发基准多出三倍以上。代理在游戏开发方面仍然面临挑战,最好的代理仅解决了54.5%的任务。我们发现感知任务难度与多模态复杂性之间存在强烈相关性,成功率为游戏导向任务的46.9%下降到2D图形任务的31.6%。为了提高多模态能力,我们引入了两种简单的基于图像和视频的反馈机制。尽管它们很简单,但这些方法始终能够提高性能,最大的变化是Claude Sonnet 4.5的性能从33.3%提高到47.7%。我们公开发布了GameDevBench,以支持进一步研究代理游戏开发。
Summary / 总结
GameDevBench evaluates agents' agentic capabilities through game development tasks, addressing the scarcity of evaluation testbeds for multimodal agents. The benchmark includes 132 tasks from web and video tutorials, requiring significant multimodal understanding. The best agent solves only 54.5% of tasks, with success rates dropping from 46.9% for gameplay-oriented tasks to 31.6% for 2D graphics tasks. Two simple image and video-based feedback mechanisms improve performance, with Claude Sonnet 4.5's success rate increasing from 33.3% to 47.7%.
GameDevBench 通过结合软件开发复杂性和深度多模态理解来评估代理在游戏开发中的能力。它包含来自网络和视频教程的132个任务,比之前的基准更为复杂。最好的代理仅解决了54.5%的任务,成功率为46.9%的游戏导向任务下降到31.6%的2D图形任务。两种简单的图像和视频反馈机制提高了性能,Claude Sonnet 4.5的成功率从33.3%提高到47.7%。GameDevBench 已公开发布,以促进代理游戏开发的研究。
Safety Recovery in Reasoning Models Is Only a Few Early Steering Steps Away
Authors: Soumya Suvra Ghosal, Souradip Chakraborty, Vaibhav Singh, Furong Huang, Dinesh Manocha, Amrit Singh Bedi
First: 2026-02-11T18:09:17+00:00 · Latest: 2026-02-11T18:09:17+00:00
Abstract
Reinforcement learning (RL) based post-training for explicit chain-of-thought (e.g., GRPO) improves the reasoning ability of multimodal large-scale reasoning models (MLRMs). But recent evidence shows that it can simultaneously degrade safety alignment and increase jailbreak success rates. We propose SafeThink, a lightweight inference-time defense that treats safety recovery as a satisficing constraint rather than a maximization objective. SafeThink monitors the evolving reasoning trace with a safety reward model and conditionally injects an optimized short corrective prefix ("Wait, think safely") only when the safety threshold is violated. In our evaluations across six open-source MLRMs and four jailbreak benchmarks (JailbreakV-28K, Hades, FigStep, and MM-SafetyBench), SafeThink reduces attack success rates by 30-60% (e.g., LlamaV-o1: 63.33% to 5.74% on JailbreakV-28K, R1-Onevision: 69.07% to 5.65% on Hades) while preserving reasoning performance (MathVista accuracy: 65.20% to 65.00%). A key empirical finding from our experiments is that safety recovery is often only a few steering steps away: intervening in the first 1-3 reasoning steps typically suffices to redirect the full generation toward safe completions.
中文标题/摘要
标题:推理模型的安全恢复仅需几步早期转向即可实现
基于强化学习(RL)的后训练方法(例如GRPO)提高了多模态大规模推理模型(MLRM)的推理能力。但最近的证据表明,这种方法同时会降低安全对齐并增加脱逃成功率。我们提出了一种名为SafeThink的轻量级推理时防御方法,将安全恢复视为满足约束而非最大化目标。SafeThink使用安全奖励模型监控推理轨迹,并仅在安全阈值被违反时有条件地注入优化的短纠正前缀(“请等一下,安全思考”)。在对六种开源MLRM和四种脱逃基准(JailbreakV-28K、Hades、FigStep和MM-SafetyBench)的评估中,SafeThink将攻击成功率降低了30-60%(例如,LlamaV-o1:从JailbreakV-28K的63.33%降至5.74%,R1-Onevision:从Hades的69.07%降至5.65%),同时保持了推理性能(MathVista准确率:从65.20%降至65.00%)。我们实验中的一个关键经验发现是,安全恢复通常只需几步转向即可:干预前1-3步推理通常足以引导整个生成走向安全完成。
Summary / 总结
The paper addresses the issue of safety degradation in reasoning models enhanced by reinforcement learning. It introduces SafeThink, a lightweight defense mechanism that monitors the reasoning process and injects a corrective prefix only when safety is compromised. SafeThink significantly reduces jailbreak success rates by 30-60% across various models and benchmarks while maintaining reasoning performance. The key finding is that safety recovery is often achievable with minimal intervention, typically within the first few reasoning steps.
论文针对强化学习增强的推理模型中安全性下降的问题,提出了一种轻量级的防御机制SafeThink。SafeThink在推理过程中监控安全状态,仅在安全被破坏时注入纠正前缀。该机制在多种模型和基准测试中将攻击成功率降低了30-60%,同时保持了推理性能。关键发现是,安全性恢复通常只需要少量干预,通常在最初的1-3个推理步骤内即可实现。
MerLin: A Discovery Engine for Photonic and Hybrid Quantum Machine Learning
Authors: Cassandre Notton, Benjamin Stott, Philippe Schoeb, Anthony Walsh, Grégoire Leboucher, Vincent Espitalier, Vassilis Apostolou, Louis-Félix Vigneux, Alexia Salavrakos, Jean Senellart
First: 2026-02-11T18:00:01+00:00 · Latest: 2026-02-11T18:00:01+00:00
Comments: This work has been submitted to the 2026 IEEE World Congress on Computational Intelligence
Abstract
Identifying where quantum models may offer practical benefits in near term quantum machine learning (QML) requires moving beyond isolated algorithmic proposals toward systematic and empirical exploration across models, datasets, and hardware constraints. We introduce MerLin, an open source framework designed as a discovery engine for photonic and hybrid quantum machine learning. MerLin integrates optimized strong simulation of linear optical circuits into standard PyTorch and scikit learn workflows, enabling end to end differentiable training of quantum layers. MerLin is designed around systematic benchmarking and reproducibility. As an initial contribution, we reproduce eighteen state of the art photonic and hybrid QML works spanning kernel methods, reservoir computing, convolutional and recurrent architectures, generative models, and modern training paradigms. These reproductions are released as reusable, modular experiments that can be directly extended and adapted, establishing a shared experimental baseline consistent with empirical benchmarking methodologies widely adopted in modern artificial intelligence. By embedding photonic quantum models within established machine learning ecosystems, MerLin allows practitioners to leverage existing tooling for ablation studies, cross modality comparisons, and hybrid classical quantum workflows. The framework already implements hardware aware features, allowing tests on available quantum hardware while enabling exploration beyond its current capabilities, positioning MerLin as a future proof co design tool linking algorithms, benchmarks, and hardware.
中文标题/摘要
标题:MerLin:光子和混合量子机器学习的发现引擎
识别量子模型在短期内可能提供的实际益处需要超越孤立的算法提案,转向对模型、数据集和硬件限制进行系统和经验探索。我们介绍了MerLin,一个开源框架,旨在作为光子和混合量子机器学习的发现引擎。MerLin将优化的强模拟线性光学电路集成到标准的PyTorch和scikit learn工作流中,使量子层的端到端可微训练成为可能。MerLin围绕系统基准测试和可重复性设计。作为初步贡献,我们重现了十八项前沿的光子和混合量子机器学习工作,涵盖了核方法、水库计算、卷积和递归架构、生成模型以及现代训练范式。这些重现作为可重用、模块化的实验发布,可以直接扩展和适应,建立了一致的实验基准,符合现代人工智能广泛采用的经验基准测试方法。通过将光子量子模型嵌入现有的机器学习生态系统中,MerLin使从业者能够利用现有的工具进行消融研究、跨模态比较和混合经典量子工作流。该框架已经实现了硬件感知功能,允许在现有量子硬件上进行测试,同时促进超越其当前能力的探索,将MerLin定位为一个面向未来的协同设计工具,连接算法、基准测试和硬件。
Summary / 总结
MerLin is an open-source framework designed to systematically explore photonic and hybrid quantum machine learning models. It integrates optimized simulations into standard machine learning workflows, enabling end-to-end training of quantum layers. The framework reproduces 18 state-of-the-art photonic and hybrid QML works, establishing a shared experimental baseline. This allows practitioners to conduct ablation studies and cross-modality comparisons, and it supports hardware-aware features for both current and future quantum hardware.
MerLin 是一个开源框架,旨在系统地探索光子和混合量子机器学习模型。它将优化的线性光学电路模拟集成到标准机器学习工作流中,实现端到端的可微训练。该框架重现了18项最先进的光子和混合QML工作,提供了一个可重复和基准测试的共享实验基线。这使得从业者能够进行消融研究和混合经典-量子工作流,并具有针对当前和未来量子硬件的硬件感知功能。
Direct Learning of Calibration-Aware Uncertainty for Neural PDE Surrogates
Authors: Carlos Stein Brito
First: 2026-02-11T17:57:20+00:00 · Latest: 2026-02-11T17:57:20+00:00
Comments: 13 pages, 11 figures
Abstract
Neural PDE surrogates are often deployed in data-limited or partially observed regimes where downstream decisions depend on calibrated uncertainty in addition to low prediction error. Existing approaches obtain uncertainty through ensemble replication, fixed stochastic noise such as dropout, or post hoc calibration. Cross-regularized uncertainty learns uncertainty parameters during training using gradients routed through a held-out regularization split. The predictor is optimized on the training split for fit, while low-dimensional uncertainty controls are optimized on the regularization split to reduce train-test mismatch, yielding regime-adaptive uncertainty without per-regime noise tuning. The framework can learn continuous noise levels at the output head, within hidden features, or within operator-specific components such as spectral modes. We instantiate the approach in Fourier Neural Operators and evaluate on APEBench sweeps over observed fraction and training-set size. Across these sweeps, the learned predictive distributions are better calibrated on held-out splits and the resulting uncertainty fields concentrate in high-error regions in one-step spatial diagnostics.
中文标题/摘要
标题:直接学习校准感知不确定性以用于神经偏微分方程代理模型
神经偏微分方程代理模型常在数据有限或部分观测的环境中部署,此时下游决策不仅依赖于低预测误差,还依赖于校准的不确定性。现有方法通过集成复制、固定随机噪声(如dropout)或事后校准获得不确定性。交叉正则化不确定性在训练过程中通过保留的正则化分割梯度来学习不确定性参数。预测器在训练分割上优化以拟合,而低维不确定性控制在正则化分割上优化以减少训练-测试不匹配,从而获得适应不同环境的不确定性,而无需针对每个环境调整噪声水平。该框架可以在输出头、隐藏特征或特定算子组件(如谱模式)内学习连续噪声水平。我们在此框架中实例化Fourier神经算子,并在APEBench上评估不同观测比例和训练集大小的扫面。在这些扫面中,学习到的预测分布在外置分割上校准更好,而结果的不确定性场在单步空间诊断中的高误差区域集中。
Summary / 总结
The research aims to improve the calibration of uncertainty in neural partial differential equation (PDE) surrogates, which are crucial for decision-making in data-limited scenarios. The method, cross-regularized uncertainty, learns uncertainty parameters during training by optimizing the predictor on a training split and the uncertainty controls on a regularization split. This approach yields regime-adaptive uncertainty without requiring per-regime noise tuning. Key findings show that the learned predictive distributions are better calibrated on held-out splits and the uncertainty fields concentrate in high-error regions, particularly in one-step spatial diagnostics.
研究旨在提高神经PDE代理在数据有限场景中的不确定性校准。方法使用交叉正则化不确定性在训练过程中学习不确定性参数,优化预测器在训练分割上,优化不确定性控制在正则化分割上。这种方法在不需要每种情况下的噪声调整的情况下,实现了适应不同情况的不确定性。关键发现表明,学习的预测分布在保留分割上具有更好的校准,并且不确定性场集中在一维空间诊断中的高误差区域。
DataChef: Cooking Up Optimal Data Recipes for LLM Adaptation via Reinforcement Learning
Authors: Yicheng Chen, Zerun Ma, Xinchen Xie, Yining Li, Kai Chen
First: 2026-02-11T17:56:15+00:00 · Latest: 2026-02-11T17:56:15+00:00
Abstract
In the current landscape of Large Language Models (LLMs), the curation of large-scale, high-quality training data is a primary driver of model performance. A key lever is the \emph{data recipe}, which comprises a data processing pipeline to transform raw sources into training corpora. Despite the growing use of LLMs to automate individual data processing steps, such as data synthesis and filtering, the overall design of data recipes remains largely manual and labor-intensive, requiring substantial human expertise and iteration. To bridge this gap, we formulate \emph{end-to-end data recipe generation} for LLM adaptation. Given a target benchmark and a pool of available data sources, a model is required to output a complete data recipe that adapts a base LLM to the target task. We present DataChef-32B, which performs online reinforcement learning using a proxy reward that predicts downstream performance for candidate recipes. Across six held-out tasks, DataChef-32B produces practical recipes that reach comparable downstream performance to those curated by human experts. Notably, the recipe from DataChef-32B adapts Qwen3-1.7B-Base to the math domain, achieving 66.7 on AIME'25 and surpassing Qwen3-1.7B. This work sheds new light on automating LLM training and developing self-evolving AI systems.
中文标题/摘要
标题:DataChef:通过强化学习烹饪出最优数据食谱以适应大规模语言模型
在当前大规模语言模型(LLM)的背景下,大规模高质量训练数据的策划是决定模型性能的主要因素。关键在于数据食谱,它包括将原始数据转换为训练语料的数据处理管道。尽管使用LLM自动化单个数据处理步骤(如数据合成和过滤)越来越普遍,但整体数据食谱的设计仍然主要依赖手动和劳动密集型,需要大量的人类专业知识和迭代。为了解决这一问题,我们提出了端到端的数据食谱生成方法,以适应LLM。给定一个目标基准和可用数据源池,模型需要输出一个完整的数据食谱,将基础LLM适应到目标任务。我们介绍了DataChef-32B,它使用代理奖励进行在线强化学习,该奖励可以预测候选食谱的下游性能。在六个保留任务中,DataChef-32B生成的食谱达到了与人类专家策划的食谱相当的下游性能。值得注意的是,DataChef-32B生成的食谱将Qwen3-1.7B-Base适应到数学领域,取得了AIME'25 66.7的成绩,超过了Qwen3-1.7B。这项工作为自动化LLM训练和开发自我进化的AI系统提供了新的视角。
Summary / 总结
The research aims to automate the creation of data recipes for adapting Large Language Models (LLMs) to specific tasks using reinforcement learning. DataChef-32B, a model that employs online reinforcement learning, generates complete data recipes that match the performance of human-curated ones across six tasks. Notably, it significantly improves the performance of Qwen3-1.7B-Base in the math domain to 66.7 on AIME'25, surpassing the base model's original capabilities.
论文通过提出DataChef-32B,使用在线强化学习来生成端到端的数据食谱,以解决为大型语言模型(LLMs)手动设计数据食谱的挑战。该方法涉及模型输出一个完整的数据处理管道,将原始数据转换为训练语料,给定目标基准和可用的数据源。关键实验结果表明,DataChef-32B生成的食谱与人工编写的食谱性能相当,并且它将Qwen3-1.7B-Base适应到数学领域,在AIME'25中达到66.7的得分,超过了基模型的性能。
General Flexible $f$-divergence for Challenging Offline RL Datasets with Low Stochasticity and Diverse Behavior Policies
Authors: Jianxun Wang, Grant C. Forbes, Leonardo Villalobos-Arias, David L. Roberts
First: 2026-02-11T17:53:49+00:00 · Latest: 2026-02-11T17:53:49+00:00
Comments: Extended version of the full paper with the appendix accepted at AAMAS 2026
Abstract
Offline RL algorithms aim to improve upon the behavior policy that produces the collected data while constraining the learned policy to be within the support of the dataset. However, practical offline datasets often contain examples with little diversity or limited exploration of the environment, and from multiple behavior policies with diverse expertise levels. Limited exploration can impair the offline RL algorithm's ability to estimate \textit{Q} or \textit{V} values, while constraining towards diverse behavior policies can be overly conservative. Such datasets call for a balance between the RL objective and behavior policy constraints. We first identify the connection between $f$-divergence and optimization constraint on the Bellman residual through a more general Linear Programming form for RL and the convex conjugate. Following this, we introduce the general flexible function formulation for the $f$-divergence to incorporate an adaptive constraint on algorithms' learning objectives based on the offline training dataset. Results from experiments on the MuJoCo, Fetch, and AdroitHand environments show the correctness of the proposed LP form and the potential of the flexible $f$-divergence in improving performance for learning from a challenging dataset when applied to a compatible constrained optimization algorithm.
中文标题/摘要
标题:通用灵活的$f$-散度用于具有低随机性和多样化行为策略的挑战性离线RL数据集
离线RL算法旨在改进产生收集数据的行为策略,同时将学习策略限制在数据集的支持范围内。然而,实际的离线数据集通常包含多样性不足或环境探索有限的例子,并且来自多个具有不同专长水平的行为策略。有限的探索可能会影响离线RL算法估计$Q$或$V$值的能力,而将学习策略限制在多样化的行为策略上可能过于保守。这样的数据集需要在RL目标和行为策略约束之间取得平衡。我们首先通过更一般的线性规划形式和凸共轭,识别了$f$-散度与贝尔曼残差优化约束之间的联系。随后,我们引入了一般灵活的函数形式来表示$f$-散度,以根据离线训练数据集自适应地调整算法的学习目标。实验结果表明,提出的线性规划形式的正确性以及灵活的$f$-散度在应用于兼容的约束优化算法时提高从挑战性数据集学习性能的潜力。
Summary / 总结
The paper addresses the challenge of offline reinforcement learning (RL) with datasets that have low stochasticity and diverse behavior policies. It introduces a general flexible $f$-divergence to balance the RL objective and behavior policy constraints. The method uses a linear programming (LP) form and convex conjugate to adaptively constrain the learning objective. Experiments on MuJoCo, Fetch, and AdroitHand environments demonstrate the effectiveness of the proposed approach in improving performance for challenging offline RL tasks.
该论文针对低随机性和多样行为策略的数据集下的离线强化学习挑战,引入了一种通用灵活的$f$-散度来平衡RL目标和行为策略约束。方法利用更一般的线性规划(LP)形式和凸共轭来适应性地约束学习目标。实验在MuJoCo、Fetch和AdroitHand环境中展示了该方法在处理具有挑战性的离线数据集时的有效性。
SteuerLLM: Local specialized large language model for German tax law analysis
Authors: Sebastian Wind, Jeta Sopa, Laurin Schmid, Quirin Jackl, Sebastian Kiefer, Fei Wu, Martin Mayr, Harald Köstler, Gerhard Wellein, Andreas Maier, Soroosh Tayebi Arasteh
First: 2026-02-11T17:46:01+00:00 · Latest: 2026-02-11T17:46:01+00:00
Abstract
Large language models (LLMs) demonstrate strong general reasoning and language understanding, yet their performance degrades in domains governed by strict formal rules, precise terminology, and legally binding structure. Tax law exemplifies these challenges, as correct answers require exact statutory citation, structured legal argumentation, and numerical accuracy under rigid grading schemes. We algorithmically generate SteuerEx, the first open benchmark derived from authentic German university tax law examinations. SteuerEx comprises 115 expert-validated examination questions spanning six core tax law domains and multiple academic levels, and employs a statement-level, partial-credit evaluation framework that closely mirrors real examination practice. We further present SteuerLLM, a domain-adapted LLM for German tax law trained on a large-scale synthetic dataset generated from authentic examination material using a controlled retrieval-augmented pipeline. SteuerLLM (28B parameters) consistently outperforms general-purpose instruction-tuned models of comparable size and, in several cases, substantially larger systems, demonstrating that domain-specific data and architectural adaptation are more decisive than parameter scale for performance on realistic legal reasoning tasks. All benchmark data, training datasets, model weights, and evaluation code are released openly to support reproducible research in domain-specific legal artificial intelligence. A web-based demo of SteuerLLM is available at https://steuerllm.i5.ai.fau.de.
中文标题/摘要
标题:SteuerLLM:针对德国税法分析的本地专业化大型语言模型
大型语言模型(LLMs)展示了强大的通用推理和语言理解能力,但在受严格形式规则、精确术语和法律约束结构支配的领域中,其性能会下降。税法就是一个典型的例子,因为正确的答案需要精确的法定引用、结构化的法律论证和在严格的评分方案下的数值准确性。我们通过算法生成了SteuerEx,这是第一个基于真实的德国大学税法考试题目的公开基准。SteuerEx 包含了115个专家验证的考试题目,覆盖了六个核心税法领域和多个学术层次,并采用了一种基于陈述的、部分评分的评估框架,该框架与实际考试实践非常接近。我们还介绍了SteuerLLM,这是一种针对德国税法的领域适应型LLM,它是在一个大规模合成数据集上训练的,该数据集是通过使用受控检索增强管道从真实的考试材料中生成的。SteuerLLM(280亿参数)在多个方面始终优于同等规模的一般用途指令调优模型,甚至在某些情况下,其性能也超过了更大规模的系统,这表明领域特定的数据和架构适应比参数规模对于解决现实法律推理任务更为关键。所有基准数据、训练数据集、模型权重和评估代码都已公开,以支持领域特定法律人工智能的可重复研究。SteuerLLM 的网络演示可在 https://steuerllm.i5.ai.fau.de/ 上获得。
Summary / 总结
The research aims to address the limitations of large language models (LLMs) in handling strict formal rules and precise terminology found in tax law. To achieve this, the authors developed SteuerEx, an open benchmark consisting of 115 expert-validated tax law examination questions, and SteuerLLM, a domain-adapted LLM trained on a synthetic dataset generated from authentic examination material. Experimental results show that SteuerLLM outperforms general-purpose instruction-tuned models, indicating that domain-specific data and architectural adaptation are more critical than model size for legal reasoning tasks.
研究旨在解决大型语言模型(LLMs)在处理严格形式规则和精确术语方面的局限性,特别是在税法领域。作者开发了SteuerEx,这是一个包含115个专家验证的税法问题的开放基准,以及SteuerLLM,一个基于真实考试材料生成的合成数据集进行训练的领域适应型LLM。SteuerLLM在性能上优于通用模型,表明领域特定的数据和架构适应比模型规模更为关键。所有资源均已公开发布,以支持可重复的研究。
In-the-Wild Model Organisms: Mitigating Undesirable Emergent Behaviors in Production LLM Post-Training via Data Attribution
Authors: Frank Xiao, Santiago Aranguri
First: 2026-02-11T17:45:31+00:00 · Latest: 2026-02-11T17:45:31+00:00
Abstract
We propose activation-based data attribution, a method that traces behavioral changes in post-trained language models to responsible training datapoints. By computing activation-difference vectors for both test prompts and preference pairs and ranking by cosine similarity, we identify datapoints that cause specific behaviors and validate these attributions causally by retraining with modified data. Clustering behavior-datapoint similarity matrices also enables unsupervised discovery of emergent behaviors. Applying this to OLMo 2's production DPO training, we surfaced distractor-triggered compliance: a harmful behavior where the model complies with dangerous requests when benign formatting instructions are appended. Filtering top-ranked datapoints reduces this behavior by 63% while switching their labels achieves 78%. Our method outperforms gradient-based attribution and LLM-judge baselines while being over 10 times cheaper than both. This in-the-wild model organism - emerging from contaminated preference data rather than deliberate injection - provides a realistic benchmark for safety techniques.
中文标题/摘要
标题:野外模型生物:通过数据归因减轻生产LLM后训练中不良涌现行为
我们提出了一种基于激活的数据归因方法,该方法将后训练语言模型的行为变化追溯到负责的训练数据点。通过为测试提示和偏好对计算激活差异向量,并按余弦相似度进行排序,我们识别出导致特定行为的数据点,并通过使用修改后的数据重新训练来进行因果验证。聚类行为-数据点相似矩阵还能够实现无监督的涌现行为发现。将此方法应用于OLMo 2的生产DPO训练,我们发现了诱饵触发的合规性:一种有害行为,当模型在无害格式化指令附加后,会遵从危险请求。过滤出排名靠前的数据点可将此行为降低63%,而切换其标签则可实现78%的降低。我们的方法在性能上优于基于梯度的归因和LLM-judge基线,且成本仅为后者的十分之一以上。这种野外模型生物——源自受污染的偏好数据而非故意注入——为安全性技术提供了一个现实基准。
Summary / 总结
The research aims to mitigate undesirable behaviors in production language models by identifying and filtering responsible training data. The method uses activation-based data attribution to trace behavioral changes to specific training datapoints and validate these attributions through retraining. Key findings show a 63% reduction in a harmful behavior called distractor-triggered compliance by filtering top-ranked datapoints, and a 78% reduction by switching their labels. This approach outperforms gradient-based attribution and LLM-judge baselines while being significantly cheaper.
研究旨在通过识别和过滤负责的训练数据来减轻生产语言模型中的不良行为。方法使用基于激活的数据归因来追踪行为变化到特定的训练数据点,并通过重新训练进行验证。关键发现表明,通过过滤排名靠前的数据点,可以减少63%的有害行为——称为诱饵触发的合规性;通过切换这些数据点的标签,可以实现78%的减少。这种方法在性能上优于基于梯度的归因和LLM-judge基线,并且成本低得多。
Interpretable Attention-Based Multi-Agent PPO for Latency Spike Resolution in 6G RAN Slicing
Authors: Kavan Fatehi, Mostafa Rahmani Ghourtani, Amir Sonee, Poonam Yadav, Alessandra M Russo, Hamed Ahmadi, Radu Calinescu
First: 2026-02-11T17:44:03+00:00 · Latest: 2026-02-11T17:44:03+00:00
Comments: This work has been accepted to appear in the IEEE International Conference on Communications (ICC)
Abstract
Sixth-generation (6G) radio access networks (RANs) must enforce strict service-level agreements (SLAs) for heterogeneous slices, yet sudden latency spikes remain difficult to diagnose and resolve with conventional deep reinforcement learning (DRL) or explainable RL (XRL). We propose \emph{Attention-Enhanced Multi-Agent Proximal Policy Optimization (AE-MAPPO)}, which integrates six specialized attention mechanisms into multi-agent slice control and surfaces them as zero-cost, faithful explanations. The framework operates across O-RAN timescales with a three-phase strategy: predictive, reactive, and inter-slice optimization. A URLLC case study shows AE-MAPPO resolves a latency spike in $18$ms, restores latency to $0.98$ms with $99.9999\%$ reliability, and reduces troubleshooting time by $93\%$ while maintaining eMBB and mMTC continuity. These results confirm AE-MAPPO's ability to combine SLA compliance with inherent interpretability, enabling trustworthy and real-time automation for 6G RAN slicing.
中文标题/摘要
标题:可解释的基于注意力机制的多智能体PPO方法用于解决6G RAN切片中的延迟突增
第六代(6G)无线接入网络(RAN)必须为异构切片强制执行严格的服务水平协议(SLA),但突发延迟突增仍然难以诊断和解决,使用传统的深度强化学习(DRL)或可解释的RL(XRL)也是如此。我们提出了\emph{注意力增强的多智能体近端策略优化(AE-MAPPO)},将六种专门的注意力机制集成到多智能体切片控制中,并以零成本、忠实的解释方式呈现它们。该框架在O-RAN时间尺度上运行,采用三阶段策略:预测、反应和跨切片优化。 一个URLLC案例研究显示,AE-MAPPO在18毫秒内解决了延迟突增,将延迟恢复到0.98毫秒,可靠性为99.9999%,将故障排除时间减少了93%,同时保持了eMBB和mMTC的连续性。这些结果证实了AE-MAPPO结合SLA合规性和内在可解释性的能力,使其能够为6G RAN切片提供可信赖的实时自动化。
Summary / 总结
The paper addresses the challenge of resolving sudden latency spikes in 6G RAN slicing by proposing AE-MAPPO, which integrates six specialized attention mechanisms into multi-agent slice control. The framework uses a three-phase strategy and demonstrates that it can resolve a latency spike in 18ms, restore latency to 0.98ms with 99.9999% reliability, and reduce troubleshooting time by 93% while maintaining service continuity. This confirms AE-MAPPO's effectiveness in combining SLA compliance with interpretability for real-time automation in 6G RAN slicing.
论文提出了AE-MAPPO,一种基于注意力的多智能体强化学习方法,用于解决6G RAN切片中的突发延迟问题。该方法将六个专门的注意力机制整合到预测、反应和跨切片优化的三阶段策略中。AE-MAPPO在18毫秒内解决了URLLC的延迟突增,将延迟恢复到0.98毫秒,可靠性达到99.9999%,同时将故障排查时间减少了93%,并保持了eMBB和mMTC的连续性,证明了其在合规性和可解释性方面的有效性。
Chatting with Images for Introspective Visual Thinking
Authors: Junfei Wu, Jian Guan, Qiang Liu, Shu Wu, Liang Wang, Wei Wu, Tienie Tan
First: 2026-02-11T17:42:37+00:00 · Latest: 2026-02-11T17:42:37+00:00
Abstract
Current large vision-language models (LVLMs) typically rely on text-only reasoning based on a single-pass visual encoding, which often leads to loss of fine-grained visual information. Recently the proposal of ''thinking with images'' attempts to alleviate this limitation by manipulating images via external tools or code; however, the resulting visual states are often insufficiently grounded in linguistic semantics, impairing effective cross-modal alignment - particularly when visual semantics or geometric relationships must be reasoned over across distant regions or multiple images. To address these challenges, we propose ''chatting with images'', a new framework that reframes visual manipulation as language-guided feature modulation. Under the guidance of expressive language prompts, the model dynamically performs joint re-encoding over multiple image regions, enabling tighter coupling between linguistic reasoning and visual state updates. We instantiate this paradigm in ViLaVT, a novel LVLM equipped with a dynamic vision encoder explicitly designed for such interactive visual reasoning, and trained it with a two-stage curriculum combining supervised fine-tuning and reinforcement learning to promote effective reasoning behaviors. Extensive experiments across eight benchmarks demonstrate that ViLaVT achieves strong and consistent improvements, with particularly pronounced gains on complex multi-image and video-based spatial reasoning tasks.
中文标题/摘要
标题:基于图像交流的内省视觉思考
当前的大规模视觉-语言模型(LVLMs)通常依赖基于单次视觉编码的文本推理,这往往会导致精细视觉信息的丢失。最近提出的“通过图像思考”试图通过外部工具或代码操作图像来缓解这一限制;然而,由此产生的视觉状态往往缺乏语言语义的充分支撑,影响了有效的跨模态对齐——特别是在需要在远距离区域或多个图像之间推理视觉语义或几何关系时。为了解决这些挑战,我们提出了一种新的框架“基于图像交流”,将视觉操作重新构想为语言引导的特征调制。在表达性语言提示的指导下,模型动态地对多个图像区域进行联合重新编码,从而增强了语言推理与视觉状态更新之间的耦合。我们通过ViLaVT这一新型LVLM实例化了这一范式,ViLaVT配备了一个明确设计用于此类交互视觉推理的动态视觉编码器,并通过结合监督微调和强化学习的两阶段课程训练,促进有效的推理行为。在八个基准测试中的广泛实验表明,ViLaVT在多个图像和视频基的复杂空间推理任务中取得了显著且一致的改进。
Summary / 总结
This paper addresses the limitations of current large vision-language models (LVLMs) that rely on text-only reasoning and single-pass visual encoding, which can result in the loss of fine-grained visual information. To improve this, the authors propose 'chatting with images', a new framework that reframes visual manipulation as language-guided feature modulation. The model, instantiated in ViLaVT, dynamically re-encodes multiple image regions under the guidance of language prompts, enhancing cross-modal alignment. Experiments show that ViLaVT outperforms existing models, especially on complex multi-image and video-based spatial reasoning tasks.
论文针对当前大型视觉-语言模型(LVLM)在处理细粒度视觉信息方面的局限性,提出了一种新的框架‘与图像对话’来增强跨模态对齐。该框架将视觉操作重新定义为语言引导的特征调制,使模型在语言提示的指导下动态重新编码多个图像区域。ViLaVT 是一种新型的 LVLM,配备了动态视觉编码器,通过结合监督微调和强化学习的两阶段课程进行训练。实验表明,ViLaVT 在复杂的多图像和视频空间推理任务中表现优异,优于现有模型。
Conversational Behavior Modeling Foundation Model With Multi-Level Perception
Authors: Dingkun Zhou, Shuchang Pan, Jiachen Lian, Siddharth Banerjee, Sarika Pasumarthy, Dhruv Hebbar, Siddhant Patel, Zeyi Austin Li, Kan Jen Cheng, Sanay Bordia, Krish Patel, Akshaj Gupta, Tingle Li, Gopala Anumanchipalli
First: 2026-02-11T17:32:52+00:00 · Latest: 2026-02-11T17:32:52+00:00
Abstract
Human conversation is organized by an implicit chain of thoughts that manifests as timed speech acts. Capturing this perceptual pathway is key to building natural full-duplex interactive systems. We introduce a framework that models this process as multi-level perception, and then reasons over conversational behaviors via a Graph-of-Thoughts (GoT). Our approach formalizes the intent-to-action pathway with a hierarchical labeling scheme, predicting high-level communicative intents and low-level speech acts to learn their causal and temporal dependencies. To train this system, we develop a high quality corpus that pairs controllable, event-rich dialogue data with human-annotated labels. The GoT framework structures streaming predictions as an evolving graph, enabling a transformer to forecast the next speech act, generate concise justifications for its decisions, and dynamically refine its reasoning. Experiments on both synthetic and real duplex dialogues show that the framework delivers robust behavior detection, produces interpretable reasoning chains, and establishes a foundation for benchmarking conversational reasoning in full duplex spoken dialogue systems.
中文标题/摘要
标题:基于多级感知的对话行为建模基础模型
人类对话由一系列隐含的思想链组织,表现为定时的言语行为。捕捉这一感知路径是构建自然双工交互系统的关键。我们提出了一种框架,将这一过程建模为多级感知,并通过思维图(GoT)来推理对话行为。我们的方法通过分层标签方案正式化了意图到行动的路径,预测高层的交际意图和低层的言语行为,以学习它们的因果和时间依赖性。为了训练此系统,我们开发了一个高质量的语料库,将可控的、事件丰富的对话数据与人工标注的标签配对。GoT框架将流式预测结构化为一个不断演化的图,使变换器能够预测下一个言语行为,生成简洁的决策理由,并动态优化其推理。在合成和真实双工对话上的实验表明,该框架提供了稳健的行为检测,生成可解释的推理链,并为全双工语音对话系统中的对话推理基准测试奠定了基础。
Summary / 总结
The research aims to model human conversation by capturing the implicit chain of thoughts through timed speech acts, essential for natural full-duplex interactive systems. The method involves a multi-level perception framework and a Graph-of-Thoughts (GoT) to predict high-level communicative intents and low-level speech acts, learning their causal and temporal dependencies. Experiments on synthetic and real dialogues demonstrate robust behavior detection and interpretable reasoning chains, establishing a benchmark for conversational reasoning in full-duplex spoken dialogue systems.
研究旨在通过捕捉时间化的言语行为来建模人类对话中的隐含思维链,这对于开发自然的全双工交互系统至关重要。方法包括一个多级感知框架和一个Graph-of-Thoughts(GoT),用于预测高层的交际意图和低层的言语行为,并学习它们的因果和时间依赖关系。关键实验发现显示了稳健的行为检测、可解释的推理链以及全双工语音对话系统中对话推理的基准建立的基础。
CamReasoner: Reinforcing Camera Movement Understanding via Structured Spatial Reasoning
Authors: Hang Wu, Yujun Cai, Zehao Li, Haonan Ge, Bowen Sun, Junsong Yuan, Yiwei Wang
First: 2026-01-30T04:45:43+00:00 · Latest: 2026-02-11T17:26:00+00:00
Abstract
Understanding camera dynamics is a fundamental pillar of video spatial intelligence. However, existing multimodal models predominantly treat this task as a black-box classification, often confusing physically distinct motions by relying on superficial visual patterns rather than geometric cues. We present CamReasoner, a framework that reformulates camera movement understanding as a structured inference process to bridge the gap between perception and cinematic logic. Our approach centers on the Observation-Thinking-Answer (O-T-A) paradigm, which compels the model to decode spatio-temporal cues such as trajectories and view frustums within an explicit reasoning block. To instill this capability, we construct a Large-scale Inference Trajectory Suite comprising 18k SFT reasoning chains and 38k RL feedback samples. Notably, we are the first to employ RL for logical alignment in this domain, ensuring motion inferences are grounded in physical geometry rather than contextual guesswork. By applying Reinforcement Learning to the Observation-Think-Answer (O-T-A) reasoning paradigm, CamReasoner effectively suppresses hallucinations and achieves state-of-the-art performance across multiple benchmarks.
Summary / 总结
The research aims to enhance the understanding of camera dynamics in videos by moving beyond black-box classification methods. CamReasoner, a new framework, reformulates camera movement understanding as a structured inference process, focusing on the Observation-Thinking-Answer (O-T-A) paradigm. It uses a Large-scale Inference Trajectory Suite with 18k SFT reasoning chains and 38k RL feedback samples to ensure that motion inferences are grounded in physical geometry. This approach leads to state-of-the-art performance across multiple benchmarks and effectively suppresses hallucinations.
研究旨在通过超越黑盒分类方法,提高对视频中摄像机动态的理解。CamReasoner 将此任务重新表述为结构化的推理过程,使用观察-思考-回答(O-T-A)范式来解码时空线索。该框架包含一个大规模推理轨迹套件,包括18k推理链和38k反馈样本,并采用强化学习进行逻辑对齐,从而在多个基准测试中实现了最先进的性能。
Divide, Harmonize, Then Conquer It: Shooting Multi-Commodity Flow Problems with Multimodal Language Models
Authors: Xinyu Yuan, Yan Qiao, Zonghui Wang, Wenzhi Chen
Venue: ICLR 2026
First: 2026-02-11T17:24:49+00:00 · Latest: 2026-02-11T17:24:49+00:00
Comments: Published as a conference paper at ICLR 2026
Abstract
The multi-commodity flow (MCF) problem is a fundamental topic in network flow and combinatorial optimization, with broad applications in transportation, communication, and logistics, etc. Nowadays, the rapid expansion of allocation systems has posed challenges for existing optimization engines in balancing optimality and tractability. In this paper, we present Pram, the first ML-based method that leverages the reasoning power of multimodal language models (MLMs) for addressing the trade-off dilemma -- a great need of service providers. As part of our proposal, Pram (i) quickly computes high-quality allocations by dividing the original problem into local subproblems, which are then resolved by an MLM-powered "agent", and (ii) ensures global consistency by harmonizing these subproblems via a multi-agent reinforcement learning algorithm. Theoretically, we show that Pram, which learns to perform gradient descent in context, provably converges to the optimum within the family of MCF problems. Empirically, on real-world datasets and public topologies, Pram achieves performance comparable to, and in some cases even surpassing, linear programming solvers (very close to the optimal solution), and substantially lower runtimes (1 to 2 orders of magnitude faster). Moreover, Pram exhibits strong robustness (<10\% performance degradation under link failures or flow bursts), demonstrating MLM's generalization ability to unforeseen events. Pram is objective-agnostic and seamlessly integrates with mainstream allocation systems, providing a practical and scalable solution for future networks.
中文标题/摘要
标题:分而治之:利用多模态语言模型解决多商品流问题
多商品流(MCF)问题是网络流和组合优化中的一个基本主题,广泛应用于交通、通信和物流等领域。随着分配系统的迅速扩展,现有的优化引擎在平衡最优性和可处理性方面面临着挑战。本文提出了Pram,这是第一个利用多模态语言模型(MLM)推理能力来解决这种权衡困境的方法,满足了服务提供商的需求。作为提案的一部分,Pram (i) 通过将原始问题分解为局部子问题并由MLM驱动的“代理”解决这些子问题,快速计算高质量的分配;(ii) 通过多代理强化学习算法确保全局一致性。理论上,我们证明Pram能够在MCF问题家族中通过上下文进行梯度下降学习,并且可以证明其收敛到最优解。实验上,在真实数据集和公共拓扑结构上,Pram的性能与线性规划求解器相当,甚至在某些情况下超越了线性规划求解器(接近最优解),并且运行时间显著降低(快1到2个数量级)。此外,Pram表现出强大的鲁棒性(链路故障或流量突发时性能下降不到10%),展示了MLM对未预见事件的泛化能力。Pram与目标无关,无缝集成到主流分配系统中,为未来的网络提供了实用且可扩展的解决方案。
Summary / 总结
Pram is an ML-based method that uses multimodal language models to address the multi-commodity flow problem by dividing the problem into local subproblems and harmonizing them through a multi-agent reinforcement learning algorithm. Theoretically, Pram converges to the optimum within the family of MCF problems. Empirically, Pram outperforms linear programming solvers on real-world datasets and public topologies, achieving comparable or better performance with significantly lower runtimes. Additionally, Pram demonstrates strong robustness under various network conditions.
Pram 是一种基于 ML 的方法,使用多模态语言模型将多商品流问题分解为局部子问题,并通过多智能体强化学习算法确保全局一致性。理论上,Pram 在多商品流问题家族中收敛到最优解。实验上,Pram 的性能与线性规划求解器相当,有时甚至更好,运行时间显著缩短,并且在各种条件下表现出很强的鲁棒性。
Goal-Conditioned Reinforcement Learning from Sub-Optimal Data on Metric Spaces
Authors: Alfredo Reichlin, Miguel Vasco, Hang Yin, Danica Kragic
First: 2024-02-16T16:46:53+00:00 · Latest: 2026-02-11T17:20:09+00:00
Abstract
We study the problem of learning optimal behavior from sub-optimal datasets for goal-conditioned offline reinforcement learning under sparse rewards, invertible actions and deterministic transitions. To mitigate the effects of \emph{distribution shift}, we propose MetricRL, a method that combines metric learning for value function approximation with weighted imitation learning for policy estimation. MetricRL avoids conservative or behavior-cloning constraints, enabling effective learning even in severely sub-optimal regimes. We introduce distance monotonicity as a key property linking metric representations to optimality and design an objective that explicitly promotes it. Empirically, MetricRL consistently outperforms prior state-of-the-art goal-conditioned RL methods in recovering near-optimal behavior from sub-optimal offline data.
中文标题/摘要
标题:基于度量空间的次优数据条件化强化学习
我们研究在稀疏奖励、可逆动作和确定性转换条件下,从次优数据中学习目标条件化离线强化学习最优行为的问题。为减轻分布偏移的影响,我们提出了一种结合度量学习进行价值函数近似和加权模仿学习进行策略估计的方法——MetricRL。MetricRL 避免了保守或行为克隆的约束,即使在严重次优的环境中也能实现有效的学习。我们引入了距离单调性作为度量表示与最优性之间的重要联系,并设计了一个明确促进该性质的目标。实验上,MetricRL 在从次优离线数据中恢复接近最优行为方面始终优于先前的最优目标条件化 RL 方法。
Summary / 总结
The study addresses the challenge of learning optimal behavior from sub-optimal datasets in goal-conditioned offline reinforcement learning with sparse rewards. It introduces MetricRL, a method combining metric learning for value function approximation with weighted imitation learning for policy estimation. This approach effectively mitigates the distribution shift issue, allowing for better performance even in severely sub-optimal regimes. Empirical results show that MetricRL outperforms existing methods in recovering near-optimal behavior from sub-optimal data.
研究解决了从次优数据中学习目标导向的离线强化学习中最优行为的问题,该问题具有稀疏奖励。提出的MetricRL方法结合了基于度量的学习来近似价值函数和加权模仿学习来估计策略。实验表明,MetricRL在从次优数据中恢复接近最优行为方面优于现有方法,即使在高度次优的环境中也是如此。
GraphSeek: Next-Generation Graph Analytics with LLMs
Authors: Maciej Besta, Łukasz Jarmocik, Orest Hrycyna, Shachar Klaiman, Konrad Mączka, Robert Gerstenberger, Jürgen Müller, Piotr Nyczyk, Hubert Niewiadomski, Torsten Hoefler
First: 2026-02-11T17:20:06+00:00 · Latest: 2026-02-11T17:20:06+00:00
Abstract
Graphs are foundational across domains but remain hard to use without deep expertise. LLMs promise accessible natural language (NL) graph analytics, yet they fail to process industry-scale property graphs effectively and efficiently: such datasets are large, highly heterogeneous, structurally complex, and evolve dynamically. To address this, we devise a novel abstraction for complex multi-query analytics over such graphs. Its key idea is to replace brittle generation of graph queries directly from NL with planning over a Semantic Catalog that describes both the graph schema and the graph operations. Concretely, this induces a clean separation between a Semantic Plane for LLM planning and broader reasoning, and an Execution Plane for deterministic, database-grade query execution over the full dataset and tool implementations. This design yields substantial gains in both token efficiency and task effectiveness even with small-context LLMs. We use this abstraction as the basis of the first LLM-enhanced graph analytics framework called GraphSeek. GraphSeek achieves substantially higher success rates (e.g., 86% over enhanced LangChain) and points toward the next generation of affordable and accessible graph analytics that unify LLM reasoning with database-grade execution over large and complex property graphs.
中文标题/摘要
标题:GraphSeek:基于LLM的下一代图分析
图在各个领域都是基础性的,但没有深厚的专业知识就难以使用。LLM承诺提供易于访问的自然语言(NL)图分析,然而它们无法有效地高效地处理行业规模的属性图:这些数据集庞大、高度异构、结构复杂且动态变化。为了解决这个问题,我们设计了一种新的抽象,用于处理此类图的复杂多查询分析。其核心思想是用语义目录的规划来替代直接从自然语言生成脆弱的图查询。具体来说,这导致了语义平面和执行平面之间的清晰分离,前者用于LLM规划和更广泛的推理,后者用于在完整数据集上进行确定性的、数据库级别的查询执行和工具实现。此设计即使在小上下文的LLM中也能显著提高标记效率和任务有效性。我们以此抽象为基础构建了第一个增强的图分析框架GraphSeek。GraphSeek实现了显著更高的成功率(例如,相对于增强的LangChain为86%),并朝着将LLM推理与数据库级别的执行统一起来的下一代可负担得起且易于访问的图分析迈进。
Summary / 总结
GraphSeek is designed to enhance graph analytics by leveraging LLMs, addressing the challenge of processing large, complex, and dynamic property graphs. It introduces a Semantic Catalog for planning graph queries, separating the Semantic Plane for LLM planning from the Execution Plane for efficient query execution. This approach improves both token efficiency and task effectiveness, achieving higher success rates compared to existing methods like enhanced LangChain.
GraphSeek 通过利用大模型来增强图分析,解决处理大规模、复杂且动态变化的属性图的挑战。它引入了语义目录来进行图查询规划,将语义平面用于大模型规划与执行平面用于高效查询执行分离。这种方法提高了标记效率和任务有效性,相比现有方法如增强的 LangChain 达到了更高的成功率。
Deformation-Recovery Diffusion Model (DRDM): Instance Deformation for Image Manipulation and Synthesis
Authors: Jian-Qing Zheng, Yuanhan Mo, Yang Sun, Jiahua Li, Fuping Wu, Ziyang Wang, Tonia Vincent, Bartłomiej W. Papież
First: 2024-07-10T01:26:48+00:00 · Latest: 2026-02-11T17:09:13+00:00
Comments: accepted by Medical Image Analysis
Abstract
In medical imaging, the diffusion models have shown great potential for synthetic image generation tasks. However, these approaches often lack the interpretable connections between the generated and real images and can create anatomically implausible structures or illusions. To address these limitations, we propose the Deformation-Recovery Diffusion Model (DRDM), a novel diffusion-based generative model that emphasises morphological transformation through deformation fields rather than direct image synthesis. DRDM introduces a topology-preserving deformation field generation strategy, which randomly samples and integrates multi-scale Deformation Velocity Fields (DVFs). DRDM is trained to learn to recover unrealistic deformation components, thus restoring randomly deformed images to a realistic distribution. This formulation enables the generation of diverse yet anatomically plausible deformations that preserve structural integrity, thereby improving data augmentation and synthesis for downstream tasks such as few-shot learning and image registration. Experiments on cardiac Magnetic Resonance Imaging and pulmonary Computed Tomography show that DRDM is capable of creating diverse, large-scale deformations, while maintaining anatomical plausibility of deformation fields. Additional evaluations on 2D image segmentation and 3D image registration tasks indicate notable performance gains, underscoring DRDM's potential to enhance both image manipulation and generative modelling in medical imaging applications. Project page: https://jianqingzheng.github.io/def_diff_rec/
中文标题/摘要
标题:变形-恢复扩散模型(DRDM):实例变形在图像操纵和合成中的应用
在医学成像中,扩散模型在合成图像生成任务中显示出巨大的潜力。然而,这些方法往往缺乏生成图像与真实图像之间的可解释联系,并且可能会生成解剖上不合理的结构或幻觉。为了解决这些限制,我们提出了变形-恢复扩散模型(DRDM),这是一种基于扩散的生成模型,强调通过变形场的形态转换,而不是直接的图像合成。DRDM 引入了一种拓扑保持的变形场生成策略,该策略随机采样并整合多尺度变形速度场(DVFs)。DRDM 通过学习恢复不现实的变形成分,从而将随机变形的图像恢复到现实分布。这种表述使生成多样且解剖上合理的变形成为可能,这些变形能够保持结构完整性,从而提高下游任务(如少样本学习和图像配准)的数据增强和生成。在心脏磁共振成像和肺部计算机断层扫描实验中,DRDM 能够生成多样且大规模的变形,同时保持变形场的解剖合理性。额外的 2D 图像分割和 3D 图像配准任务评估表明,DRDM 在图像操纵和医学成像应用中的生成建模方面具有显著的性能提升。
Summary / 总结
The Deformation-Recovery Diffusion Model (DRDM) addresses the limitations of existing diffusion models in medical imaging by emphasizing morphological transformation through deformation fields. DRDM generates diverse and anatomically plausible deformations by training to recover unrealistic deformation components, which are then used to create realistic synthetic images. Experiments on cardiac MRI and pulmonary CT show that DRDM can produce large-scale deformations while maintaining anatomical plausibility, improving data augmentation and synthesis for downstream tasks such as few-shot learning and image registration.
Deformation-Recovery Diffusion Model (DRDM) 通过强调通过变形场进行形态转换来解决现有扩散模型在医学成像中的局限性。DRDM 使用拓扑保持的变形场生成策略,并通过恢复不现实的变形来训练,从而能够生成多样且解剖上合理的变形。实验表明,DRDM 可以生成大规模变形并保持解剖上的合理性,并且在 2D 图像分割和 3D 图像配准任务中表现出色。
Evaluating Kubernetes Performance for GenAI Inference: From Automatic Speech Recognition to LLM Summarization
Authors: Sai Sindhur Malleni, Raúl Sevilla, Aleksei Vasilevskii, José Castillo Lema, André Bauer
First: 2026-02-03T15:36:08+00:00 · Latest: 2026-02-11T16:56:32+00:00
Comments: A accepted at the 17th International Conference on Performance Engineering
Abstract
As Generative AI (GenAI), particularly inference, rapidly emerges as a dominant workload category, the Kubernetes ecosystem is proactively evolving to natively support its unique demands. This industry paper demonstrates how emerging Kubernetes-native projects can be combined to deliver the benefits of container orchestration, such as scalability and resource efficiency, to complex AI workflows. We implement and evaluate an illustrative, multi-stage use case consisting of automatic speech recognition and summarization. First, we address batch inference by using Kueue to manage jobs that transcribe audio files with Whisper models and Dynamic Accelerator Slicer (DAS) to increase parallel job execution. Second, we address a discrete online inference scenario by feeding the transcripts to a Large Language Model for summarization hosted using llm-d, a novel solution utilizing the recent developments around the Kubernetes Gateway API Inference Extension (GAIE) for optimized routing of inference requests. Our findings illustrate that these complementary components (Kueue, DAS, and GAIE) form a cohesive, high-performance platform, proving Kubernetes' capability to serve as a unified foundation for demanding GenAI workloads: Kueue reduced total makespan by up to 15%; DAS shortened mean job completion time by 36\%; and GAIE working in conjunction with llm-d improved tail Time to First Token latency by up to 90% even under high loads.
中文标题/摘要
标题:评估Kubernetes性能以支持GenAI推理:从自动语音识别到大语言模型总结
随着生成型人工智能(GenAI),尤其是推理,迅速成为主导的工作负载类别,Kubernetes生态系统正积极演变以原生支持其独特需求。本文展示了新兴的Kubernetes原生项目如何结合使用,以提供容器编排的好处,如可扩展性和资源效率,应用于复杂的AI工作流。我们实现并评估了一个示例多阶段用例,包括自动语音识别和总结。首先,我们使用Kueue管理使用Whisper模型转录音频文件的工作负载,并使用动态加速器切片器(DAS)增加并行工作负载执行。其次,我们通过将转录文本输入使用llm-d托管的大语言模型进行总结,解决了一个离散的在线推理场景,llm-d是一种利用Kubernetes网关API推理扩展(GAIE)的最新进展优化推理请求路由的新型解决方案。我们的研究结果表明,这些互补组件(Kueue、DAS和GAIE)形成一个统一、高性能的平台,证明了Kubernetes能够作为支持苛刻的GenAI工作负载的统一基础:Kueue将总周转时间最多减少了15%;DAS将平均工作负载完成时间缩短了36%;而GAIE与llm-d结合使用,在高负载下将尾部第一个令牌延迟最多提高了90%。
Summary / 总结
This paper evaluates Kubernetes performance for GenAI inference by implementing an illustrative multi-stage use case involving automatic speech recognition and summarization. The study uses Kueue for batch inference management, DAS to enhance parallel job execution, and GAIE with llm-d for optimized routing of inference requests in discrete online scenarios. Key findings show that Kueue reduces total makespan by up to 15%, DAS shortens mean job completion time by 36%, and GAIE with llm-d improves tail Time to First Token latency by up to 90% under high loads, demonstrating Kubernetes' capability to support complex GenAI workflows efficiently.
该论文评估了Kubernetes在GenAI推理任务中的性能,重点关注自动语音识别和总结。它使用Kueue、动态加速器切片器(DAS)和Kubernetes网关API推理扩展(GAIE)来增强可扩展性和资源效率。研究表明,Kueue将总工期缩短了最多15%,DAS将平均任务完成时间缩短了36%,而GAIE与llm-d结合使用在高负载下将尾部第一个词的延迟时间改善了最多90%,展示了Kubernetes支持复杂AI工作流的能力。
Learning to Explore with Parameter-Space Noise: A Deep Dive into Parameter-Space Noise for Reinforcement Learning with Verifiable Rewards
Authors: Bizhe Bai, Xinyue Wang, Peng Ye, Tao Chen
First: 2026-01-30T13:10:30+00:00 · Latest: 2026-02-11T16:53:48+00:00
Comments: 17 pages, 10 Figures
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) improves LLM reasoning, yet growing evidence indicates an exploration ceiling: it often reweights existing solution traces rather than discovering new strategies, limiting gains under large sampling budgets (e.g., pass-at-256). We address this limitation with PSN-RLVR, which perturbs policy parameters before rollout generation to induce temporally consistent, trajectory-level exploration that better preserves long-horizon chain-of-thought coherence than action-space noise. To mitigate the resulting sampling-update mismatch, we incorporate truncated importance sampling (TIS). To avoid expensive KL-based adaptive noise control, we propose a computationally efficient real-time adaptive noise scheduler driven by a lightweight surrogate that combines semantic diversity with normalized self-certainty. Instantiated on GRPO, a widely used RLVR method, PSN-GRPO consistently expands the effective reasoning capability boundary across multiple mathematical reasoning benchmarks and model families, yielding higher pass-at-k under large sampling budgets and outperforming prior exploration-oriented RLVR methods (e.g., Pass-at-k-style training) while remaining orthogonal and thus composable for additional gains.
中文标题/摘要
标题:学习探索:参数空间噪声在可验证奖励强化学习中的深入探讨
可验证奖励强化学习(RLVR)提高了LLM的推理能力,但越来越多的证据表明存在探索上限:它通常重新加权现有的解决方案轨迹,而不是发现新的策略,限制了在大量采样预算(例如,pass-at-256)下的收益。我们通过PSN-RLVR解决了这一限制,该方法在生成回放生成前扰动策略参数,以诱导时间上一致的、轨迹级别的探索,这比动作空间噪声更好地保持了长时序链式思维的一致性。为了缓解由此产生的采样-更新不匹配,我们引入了截断重要性采样(TIS)。为了避免基于KL的自适应噪声控制的高昂成本,我们提出了一种由轻量级代理驱动的计算效率高的实时自适应噪声调度器,该代理结合了语义多样性与归一化的自我确信度。基于广泛使用的RLVR方法GRPO,PSN-GRPO在多个数学推理基准和模型家族中一致地扩展了有效的推理能力边界,产生了在大量采样预算下的更高pass-at-k,并在保持独立性从而可组合性的同时,优于先前的探索导向的RLVR方法(例如,pass-at-k风格训练)。
Summary / 总结
The research aims to enhance exploration in Reinforcement Learning with Verifiable Rewards (RLVR) by addressing the exploration ceiling issue, where the method often reweights existing solutions rather than discovering new strategies. The study introduces PSN-RLVR, which perturbs policy parameters to induce consistent exploration at the trajectory level, using truncated importance sampling to mitigate sampling-update mismatches. The method also proposes a lightweight adaptive noise scheduler to control exploration efficiently. Experimental results show that PSN-GRPO, instantiated on GRPO, consistently expands the reasoning capability boundary across various benchmarks, achieving higher pass-at-k under large sampling budgets and outperforming previous exploration-oriented RLVR methods.
研究旨在通过解决RLVR中的探索天花板问题,即该方法往往重新加权现有解决方案而不是发现新策略,来增强探索。研究引入了PSN-RLVR,通过扰动策略参数来诱导一致的探索,并结合截断的重要性采样来缓解采样-更新不匹配的问题。该方法还提出了一种轻量级的自适应噪声调度器,以避免昂贵的KL基控制,从而在数学推理基准测试中表现出更高的pass-at-k率,并在大规模采样预算下优于先前的方法。
History
20260212_0404 20260211_0409 20260210_0409 20260208_0334 20260207_0349 20260206_0347 20260205_0346 20260204_0352 20260202_0332 20260201_0328 20260131_0341 20260130_0339 20260129_0337 20260128_0335 20260127_0332 20260126_0325 20260125_0325 20260124_0333 20260123_0333 20260122_0339 20260121_0422 20260120_0328 20260119_0325 20260118_0324 20260117_0329 20260116_0332 20260115_0330 20260114_0329 20260113_0330 20260112_0330 20260111_0327 20260110_0328 20260109_0331 20260108_0330 20260107_0325 20260106_0331 20260105_0324 20260104_0324 20260103_0322 20260102_0335 20260101_0325 20251231_0331 20251230_0328 20251229_0326 20251228_0329 20251227_0325 20251226_0326 20251225_0325 20251224_0328 20251223_0327 20251222_0324 20251221_0326 20251220_0327 20251219_0327 20251218_0339 20251217_0331 20251216_0329 20251215_0331 20251214_0324 20251213_0324 20251212_0329 20251211_0326 20251210_0323 20251209_0326 20251208_0324 20251207_0323 20251206_0325 20251205_0326 20251204_0326 20251203_0328 20251202_0331 20251201_0324 20251130_0323 20251129_0323 20251128_0324 20251127_0324 20251126_0325 20251125_0322 20251124_0323 20251123_0323 20251122_0325 20251121_0324 20251120_0326 20251119_0325 20251118_0324 20251117_0322 20251116_0322 20251115_0324 20251114_0325 20251113_0326 20251112_0326 20251111_0318 20251110_0322 20251109_0323 20251108_0321 20251107_0320 20251106_0322 20251105_0321 20251104_0324 20251103_0317 20251102_0321 20251101_0317 20251031_0318 20251030_0328 20251029_0325 20251028_0324 20251027_0320 20251026_0328 20251025_0320 20251024_0328 20251023_1235 20251023_0316 20251022_0319 20251021_1916 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553