Scaling Verification Can Be More Effective than Scaling Policy Learning for Vision-Language-Action Alignment
Authors: Jacky Kwok, Xilun Zhang, Mengdi Xu, Yuejiang Liu, Azalia Mirhoseini, Chelsea Finn, Marco Pavone
First: 2026-02-12T18:59:59+00:00 · Latest: 2026-02-12T18:59:59+00:00
Abstract
The long-standing vision of general-purpose robots hinges on their ability to understand and act upon natural language instructions. Vision-Language-Action (VLA) models have made remarkable progress toward this goal, yet their generated actions can still misalign with the given instructions. In this paper, we investigate test-time verification as a means to shrink the "intention-action gap.'' We first characterize the test-time scaling law for embodied instruction following and demonstrate that jointly scaling the number of rephrased instructions and generated actions greatly increases test-time sample diversity, often recovering correct actions more efficiently than scaling each dimension independently. To capitalize on these scaling laws, we present CoVer, a contrastive verifier for vision-language-action alignment, and show that our architecture scales gracefully with additional computational resources and data. We then introduce "boot-time compute" and a hierarchical verification inference pipeline for VLAs. At deployment, our framework precomputes a diverse set of rephrased instructions from a Vision-Language-Model (VLM), repeatedly generates action candidates for each instruction, and then uses a verifier to select the optimal high-level prompt and low-level action chunks. Compared to scaling policy pre-training on the same data, our verification approach yields 22% gains in-distribution and 13% out-of-distribution on the SIMPLER benchmark, with a further 45% improvement in real-world experiments. On the PolaRiS benchmark, CoVer achieves 14% gains in task progress and 9% in success rate.
中文标题/摘要
标题:扩展验证比扩展策略学习更能有效实现视觉-语言-行动对齐
通用机器人长期愿景依赖于它们理解和执行自然语言指令的能力。视觉-语言-行动(VLA)模型在这一目标上取得了显著进展,但它们生成的动作仍然可能与给定的指令不一致。在本文中,我们研究测试时验证作为缩小“意图-行动差距”的手段。我们首先表征了基于指示的物理任务的测试时扩展定律,证明了同时扩展重述指令的数量和生成动作的数量大大增加了测试时样本多样性,通常比独立扩展每个维度更有效地恢复正确的动作。为了利用这些扩展定律,我们提出了CoVer,一种对比验证器,用于视觉-语言-行动对齐,并展示了我们的架构随着额外计算资源和数据的增加而平滑扩展。然后,我们介绍了“启动时计算”和一个分层验证推理流水线,用于VLAs。在部署时,我们的框架从视觉语言模型(VLM)预计算一组多样化的重述指令,反复为每条指令生成动作候选,然后使用验证器选择最优的高级提示和低级动作片段。与在相同数据上扩展策略预训练相比,我们的验证方法在SIMPLER基准测试中获得了22%的同分布改进和13%的异分布改进,在实际实验中进一步提高了45%。在PolaRiS基准测试中,CoVer实现了14%的任务进展和9%的成功率提升。
Summary / 总结
This paper explores test-time verification as a method to improve alignment between actions and natural language instructions in vision-language-action models. It demonstrates that jointly scaling the number of rephrased instructions and generated actions increases test-time sample diversity, leading to more efficient recovery of correct actions. The proposed CoVer architecture scales gracefully with additional resources, and the framework precomputes diverse rephrased instructions and uses a verifier to select optimal actions, resulting in significant improvements in both in-distribution and out-of-distribution performance on the SIMPLER benchmark and real-world experiments. On the PolaRiS benchmark, CoVer shows 14% gains in task progress and 9% in success rate.
本文探讨了测试时验证作为提高视觉-语言-动作模型中动作与自然语言指令之间对齐的方法。研究表明,同时增加重述指令的数量和生成动作的数量可以提高测试时样本多样性,从而更有效地恢复正确的动作。提出的CoVer架构能够随着资源的增加而平滑扩展,并且该框架预计算多样化的重述指令,使用验证器选择最优的动作,从而在SIMPLER基准测试中的分布内和分布外性能上取得了显著的改进。在PolaRiS基准测试中,CoVer在任务进度上取得了14%的提升,在成功率上取得了9%的提升。
Stroke of Surprise: Progressive Semantic Illusions in Vector Sketching
Authors: Huai-Hsun Cheng, Siang-Ling Zhang, Yu-Lun Liu
First: 2026-02-12T18:59:54+00:00 · Latest: 2026-02-12T18:59:54+00:00
Comments: Project page: https://stroke-of-surprise.github.io/ Code: https://github.com/stroke-of-surprise/Stroke-Of-Surprise
Abstract
Visual illusions traditionally rely on spatial manipulations such as multi-view consistency. In this work, we introduce Progressive Semantic Illusions, a novel vector sketching task where a single sketch undergoes a dramatic semantic transformation through the sequential addition of strokes. We present Stroke of Surprise, a generative framework that optimizes vector strokes to satisfy distinct semantic interpretations at different drawing stages. The core challenge lies in the "dual-constraint": initial prefix strokes must form a coherent object (e.g., a duck) while simultaneously serving as the structural foundation for a second concept (e.g., a sheep) upon adding delta strokes. To address this, we propose a sequence-aware joint optimization framework driven by a dual-branch Score Distillation Sampling (SDS) mechanism. Unlike sequential approaches that freeze the initial state, our method dynamically adjusts prefix strokes to discover a "common structural subspace" valid for both targets. Furthermore, we introduce a novel Overlay Loss that enforces spatial complementarity, ensuring structural integration rather than occlusion. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art baselines in recognizability and illusion strength, successfully expanding visual anagrams from the spatial to the temporal dimension. Project page: https://stroke-of-surprise.github.io/
中文标题/摘要
标题:惊喜一击:渐进语义错觉在矢量素描中的应用
视觉错觉传统上依赖于空间操作,如多视角一致性。在本工作中,我们引入了渐进语义错觉,这是一种新颖的矢量素描任务,其中单个素描通过逐步添加线条经历剧烈的语义转变。我们提出了惊喜一击,这是一种生成框架,优化矢量线条以在不同的绘画阶段满足不同的语义解释。核心挑战在于“双重约束”:初始前缀线条必须形成一个连贯的对象(例如,一只鸭子),同时作为添加增量线条后第二个概念(例如,一只绵羊)的结构基础。为了解决这一问题,我们提出了一种基于双重分支评分蒸馏采样(SDS)机制的序列感知联合优化框架。与顺序方法冻结初始状态不同,我们的方法动态调整前缀线条以发现适用于两个目标的“共同结构子空间”。此外,我们引入了一种新颖的叠加损失,以确保空间互补性,而不是遮挡。大量实验表明,我们的方法在可识别性和错觉强度方面显著优于最先进的基线方法,成功地将视觉异文从空间扩展到时间维度。项目页面:https://stroke-of-surprise.github.io/
Summary / 总结
This work introduces Progressive Semantic Illusions in vector sketching, where a single sketch transforms dramatically through the addition of strokes. The Stroke of Surprise framework optimizes vector strokes to satisfy different semantic interpretations at various drawing stages, addressing the dual-constraint of coherence and structural foundation. Experiments show that this method outperforms existing techniques in recognizability and illusion strength, expanding visual anagrams from spatial to temporal dimensions.
该研究引入了矢量素描中的渐进语义幻象,通过顺序添加线条使单个素描在不同阶段发生剧烈变化。Stroke of Surprise框架优化线条以满足不同语义解释的需求,通过动态调整初始线条来同时满足初始和后续的概念。实验表明,该方法在可识别性和幻象强度上优于现有技术,将视觉异文从空间维度扩展到时间维度。
UniT: Unified Multimodal Chain-of-Thought Test-time Scaling
Authors: Leon Liangyu Chen, Haoyu Ma, Zhipeng Fan, Ziqi Huang, Animesh Sinha, Xiaoliang Dai, Jialiang Wang, Zecheng He, Jianwei Yang, Chunyuan Li, Junzhe Sun, Chu Wang, Serena Yeung-Levy, Felix Juefei-Xu
First: 2026-02-12T18:59:49+00:00 · Latest: 2026-02-12T18:59:49+00:00
Abstract
Unified models can handle both multimodal understanding and generation within a single architecture, yet they typically operate in a single pass without iteratively refining their outputs. Many multimodal tasks, especially those involving complex spatial compositions, multiple interacting objects, or evolving instructions, require decomposing instructions, verifying intermediate results, and making iterative corrections. While test-time scaling (TTS) has demonstrated that allocating additional inference compute for iterative reasoning substantially improves language model performance, extending this paradigm to unified multimodal models remains an open challenge. We introduce UniT, a framework for multimodal chain-of-thought test-time scaling that enables a single unified model to reason, verify, and refine across multiple rounds. UniT combines agentic data synthesis, unified model training, and flexible test-time inference to elicit cognitive behaviors including verification, subgoal decomposition, and content memory. Our key findings are: (1) unified models trained on short reasoning trajectories generalize to longer inference chains at test time; (2) sequential chain-of-thought reasoning provides a more scalable and compute-efficient TTS strategy than parallel sampling; (3) training on generation and editing trajectories improves out-of-distribution visual reasoning. These results establish multimodal test-time scaling as an effective paradigm for advancing both generation and understanding in unified models.
中文标题/摘要
标题:UniT:统一多模态链式思维测试时扩展
统一模型可以在单一架构中处理多模态理解和生成,但通常它们在单次通过过程中运行,而不进行迭代细化输出。许多多模态任务,尤其是涉及复杂空间组合、多个相互作用的对象或不断变化的指令的任务,需要分解指令、验证中间结果并进行迭代修正。虽然测试时扩展(TTS)已经证明,为迭代推理分配额外的推理计算可以显著提高语言模型的性能,但将这一范式扩展到统一的多模态模型仍然是一个开放的挑战。我们引入了UniT,这是一种多模态链式思维测试时扩展的框架,使单一统一模型能够在多轮次中进行推理、验证和细化。UniT 结合了代理数据合成、统一模型训练和灵活的测试时推理,以引发包括验证、子目标分解和内容记忆在内的认知行为。我们的主要发现是:(1) 统一模型在短推理轨迹上的训练在测试时能够泛化到更长的推理链;(2) 顺序链式思维推理比并行采样提供了一种更具扩展性和计算效率的TTS策略;(3) 在生成和编辑轨迹上的训练提高了分布外视觉推理能力。这些结果确立了多模态测试时扩展作为推进统一模型中生成和理解的有效范式。
Summary / 总结
The research aims to improve the performance of unified multimodal models by enabling iterative reasoning at test time. UniT introduces a framework for multimodal chain-of-thought test-time scaling, which allows a single unified model to reason, verify, and refine its outputs across multiple rounds. Key findings include the generalization of unified models to longer inference chains, the scalability and efficiency of sequential chain-of-thought reasoning, and the improvement in out-of-distribution visual reasoning through training on generation and editing trajectories.
研究旨在通过迭代推理和细化增强统一多模态模型。UniT 是一个用于多模态链式思维测试时扩展的框架,允许单一统一模型分解指令、验证中间结果并进行多次细化。关键发现包括统一模型在更长推理链上的泛化能力、顺序链式思维推理的可扩展性和计算效率,以及通过生成和编辑轨迹训练提高离分布视觉推理能力。
Agentic Test-Time Scaling for WebAgents
Authors: Nicholas Lee, Lutfi Eren Erdogan, Chris Joseph John, Surya Krishnapillai, Michael W. Mahoney, Kurt Keutzer, Amir Gholami
First: 2026-02-12T18:58:30+00:00 · Latest: 2026-02-12T18:58:30+00:00
Abstract
Test-time scaling has become a standard way to improve performance and boost reliability of neural network models. However, its behavior on agentic, multi-step tasks remains less well-understood: small per-step errors can compound over long horizons; and we find that naive policies that uniformly increase sampling show diminishing returns. In this work, we present CATTS, a simple technique for dynamically allocating compute for multi-step agents. We first conduct an empirical study of inference-time scaling for web agents. We find that uniformly increasing per-step compute quickly saturates in long-horizon environments. We then investigate stronger aggregation strategies, including an LLM-based Arbiter that can outperform naive voting, but that can overrule high-consensus decisions. We show that uncertainty statistics derived from the agent's own vote distribution (entropy and top-1/top-2 margin) correlate with downstream success and provide a practical signal for dynamic compute allocation. Based on these findings, we introduce Confidence-Aware Test-Time Scaling (CATTS), which uses vote-derived uncertainty to allocate compute only when decisions are genuinely contentious. CATTS improves performance on WebArena-Lite and GoBrowse by up to 9.1% over React while using up to 2.3x fewer tokens than uniform scaling, providing both efficiency gains and an interpretable decision rule.
中文标题/摘要
标题:代理测试时缩放以适应网络代理
测试时缩放已成为提高神经网络模型性能和增强其可靠性的标准方法。然而,其在代理型、多步骤任务中的行为尚不完全清楚:小的每步误差会在长时间范围内累积;我们发现,均匀增加采样的简单策略会显示出递减的回报。在本文中,我们提出了CATTS,一种简单的技术,用于动态为多步骤代理分配计算资源。我们首先对网络代理的推理时缩放进行了实证研究。我们发现,均匀增加每步计算资源在长时间环境中很快达到饱和。然后我们研究了更强的聚合策略,包括基于LLM的仲裁者,它可以超越简单的投票,但可以推翻高度一致的决策。我们展示了代理自身投票分布得出的不确定性统计(熵和top-1/top-2差距)与下游成功相关,并提供了一种实用的动态计算分配信号。基于这些发现,我们引入了基于信心的测试时缩放(CATTS),它仅在决策真正有争议时才使用投票得出的不确定性来分配计算资源。CATTS在WebArena-Lite和GoBrowse上的性能比React提高了最多9.1%,同时使用的令牌数量最多减少了2.3倍,提供了效率提升和可解释的决策规则。
Summary / 总结
This paper addresses the challenge of test-time scaling for agentic, multi-step tasks, where small errors can accumulate over time. The authors introduce CATTS, a technique that dynamically allocates compute based on the uncertainty of the agent's decisions. Empirical studies show that uniformly increasing per-step compute quickly saturates, while CATTS improves performance by up to 9.1% on WebArena-Lite and GoBrowse, using up to 2.3x fewer tokens than uniform scaling.
该研究针对多步骤任务中累积误差的挑战,引入了CATTS技术,该技术基于决策不确定性动态分配计算资源。通过实证研究,作者表明均匀增加每步计算在长时间环境中很快达到饱和,提出CATTS,仅在决策真正有争议时才分配计算资源,从而在WebArena-Lite和GoBrowse上分别将性能提高至多9.1%,同时使用的令牌数量减少至多2.3倍。
Function-Space Decoupled Diffusion for Forward and Inverse Modeling in Carbon Capture and Storage
Authors: Xin Ju, Jiachen Yao, Anima Anandkumar, Sally M. Benson, Gege Wen
First: 2026-02-12T18:58:12+00:00 · Latest: 2026-02-12T18:58:12+00:00
Abstract
Accurate characterization of subsurface flow is critical for Carbon Capture and Storage (CCS) but remains challenged by the ill-posed nature of inverse problems with sparse observations. We present Fun-DDPS, a generative framework that combines function-space diffusion models with differentiable neural operator surrogates for both forward and inverse modeling. Our approach learns a prior distribution over geological parameters (geomodel) using a single-channel diffusion model, then leverages a Local Neural Operator (LNO) surrogate to provide physics-consistent guidance for cross-field conditioning on the dynamics field. This decoupling allows the diffusion prior to robustly recover missing information in parameter space, while the surrogate provides efficient gradient-based guidance for data assimilation. We demonstrate Fun-DDPS on synthetic CCS modeling datasets, achieving two key results: (1) For forward modeling with only 25% observations, Fun-DDPS achieves 7.7% relative error compared to 86.9% for standard surrogates (an 11x improvement), proving its capability to handle extreme data sparsity where deterministic methods fail. (2) We provide the first rigorous validation of diffusion-based inverse solvers against asymptotically exact Rejection Sampling (RS) posteriors. Both Fun-DDPS and the joint-state baseline (Fun-DPS) achieve Jensen-Shannon divergence less than 0.06 against the ground truth. Crucially, Fun-DDPS produces physically consistent realizations free from the high-frequency artifacts observed in joint-state baselines, achieving this with 4x improved sample efficiency compared to rejection sampling.
中文标题/摘要
标题:功能空间解耦扩散在碳捕获与储存中的前向和逆向建模
准确表征地下流场对于碳捕获与储存(CCS)至关重要,但稀疏观测导致的逆问题病态性仍是一个挑战。我们提出了一种生成框架Fun-DDPS,该框架结合了功能空间扩散模型和可微神经算子代理,用于前向和逆向建模。我们的方法使用单通道扩散模型学习地质参数(地质模型)的先验分布,然后利用局部神经算子(LNO)代理为跨场条件提供符合物理的指导。这种解耦允许扩散先验在参数空间中稳健地恢复缺失信息,而代理则提供基于梯度的数据同化高效指导。我们在合成的CCS建模数据集上展示了Fun-DDPS,取得了两个关键结果:(1)在只有25%观测的情况下进行前向建模,Fun-DDPS的相对误差为7.7%,而标准代理的相对误差为86.9%(提高了11倍),证明了其在极端数据稀疏性情况下处理能力,此时确定性方法失效。(2)我们首次对基于扩散的逆解算器进行了与渐近精确拒绝采样(RS)后验的严格验证。Fun-DDPS和联合状态基线(Fun-DPS)的Jensen-Shannon散度均小于0.06,与真实值一致。至关重要的是,Fun-DDPS生成了物理上一致的实现,没有联合状态基线中观察到的高频伪影,且样本效率提高了4倍,优于拒绝采样。
Summary / 总结
Fun-DDPS is a generative framework that integrates function-space diffusion models with differentiable neural operators for forward and inverse modeling in CCS. It learns a prior distribution over geological parameters using a single-channel diffusion model and uses a Local Neural Operator surrogate to provide physics-consistent guidance for data assimilation. The method shows significant improvements in handling extreme data sparsity, achieving 7.7% relative error compared to 86.9% for standard surrogates. It also provides physically consistent realizations with 4x improved sample efficiency compared to rejection sampling.
Fun-DDPS 是一个生成框架,结合了函数空间扩散模型和可微神经算子,用于 CCS 的正向和反向建模。该方法学习地质参数的先验分布,并使用局部神经算子代理进行高效指导。该方法在仅有 25% 观测数据的情况下,正向建模的相对误差为 7.7%,而标准代理为 86.9%,证明了其处理极端数据稀疏性的能力。此外,它还提供了物理上一致的实现,并且样本效率提高了 4 倍,优于拒绝采样。
MonarchRT: Efficient Attention for Real-Time Video Generation
Authors: Krish Agarwal, Zhuoming Chen, Cheng Luo, Yongqi Chen, Haizhong Zheng, Xun Huang, Atri Rudra, Beidi Chen
First: 2026-02-12T18:56:53+00:00 · Latest: 2026-02-12T18:56:53+00:00
Abstract
Real-time video generation with Diffusion Transformers is bottlenecked by the quadratic cost of 3D self-attention, especially in real-time regimes that are both few-step and autoregressive, where errors compound across time and each denoising step must carry substantially more information. In this setting, we find that prior sparse-attention approximations break down, despite showing strong results for bidirectional, many-step diffusion. Specifically, we observe that video attention is not reliably sparse, but instead combines pronounced periodic structure driven by spatiotemporal position with dynamic, sparse semantic correspondences and dense mixing, exceeding the representational capacity of even oracle top-k attention. Building on this insight, we propose Monarch-RT, a structured attention parameterization for video diffusion models that factorizes attention using Monarch matrices. Through appropriately aligned block structure and our extended tiled Monarch parameterization, we achieve high expressivity while preserving computational efficiency. We further overcome the overhead of parameterization through finetuning, with custom Triton kernels. We first validate the high efficacy of Monarch-RT over existing sparse baselines designed only for bidirectional models. We further observe that Monarch-RT attains up to 95% attention sparsity with no loss in quality when applied to the state-of-the-art model Self-Forcing, making Monarch-RT a pioneering work on highly-capable sparse attention parameterization for real-time video generation. Our optimized implementation outperforms FlashAttention-2, FlashAttention-3, and FlashAttention-4 kernels on Nvidia RTX 5090, H100, and B200 GPUs respectively, providing kernel speedups in the range of 1.4-11.8X. This enables us, for the first time, to achieve true real-time video generation with Self-Forcing at 16 FPS on a single RTX 5090.
中文标题/摘要
标题:MonarchRT:实时视频生成的高效注意力机制
实时视频生成中的扩散变换器受到三维自注意力的二次成本瓶颈限制,尤其是在既为少量步骤又为自回归的实时环境中,错误会随时间累积,每个去噪步骤必须携带更多的信息。在这种环境中,我们发现先前的稀疏注意力近似失效,尽管在双向、多步骤扩散中表现出色。具体来说,我们观察到视频注意力并不是可靠的稀疏,而是由时空位置驱动的显著周期结构与动态稀疏语义对应和密集混合相结合,超过了甚至先验top-k注意力的表示能力。基于这一洞察,我们提出了Monarch-RT,这是一种用于视频扩散模型的结构化注意力参数化方法,通过Monarch矩阵分解注意力。通过适当的块结构对齐和我们扩展的Tiled Monarch参数化,我们实现了高表达性同时保持计算效率。我们进一步通过微调克服了参数化的开销,使用了自定义的Triton内核。我们首先验证了Monarch-RT在现有仅针对双向模型设计的稀疏基线中的高有效性。我们还观察到,当应用于最先进的模型Self-Forcing时,Monarch-RT可以达到95%的注意力稀疏性,而不会损失质量,这使Monarch-RT成为实时视频生成中高能力稀疏注意力参数化的开创性工作。我们的优化实现分别在Nvidia RTX 5090、H100和B200 GPU上优于FlashAttention-2、FlashAttention-3和FlashAttention-4内核,提供了1.4-11.8倍的内核加速。这使我们首次能够在单个RTX 5090上以16 FPS实现真正的实时视频生成。
Summary / 总结
The research aims to address the computational bottleneck in real-time video generation using Diffusion Transformers, particularly the quadratic cost of 3D self-attention. Monarch-RT, a structured attention parameterization, is proposed to factorize attention using Monarch matrices, achieving high expressivity while maintaining computational efficiency. The method outperforms existing sparse baselines and enables true real-time video generation at 16 FPS with the Self-Forcing model on a single RTX 5090 GPU, providing speedups of up to 11.8X compared to FlashAttention kernels.
研究旨在解决使用Diffusion Transformers进行实时视频生成时的计算瓶颈,特别是3D自注意力的二次成本问题。提出了Monarch-RT,一种结构化注意力参数化方法,通过使用Monarch矩阵分解注意力,实现了高表达性的同时保持计算效率。该方法超越了现有的稀疏基线,并使Self-Forcing模型在单个RTX 5090 GPU上以16 FPS的速度实现真正的实时视频生成,相比FlashAttention内核提供了高达11.8倍的加速。
CM2: Reinforcement Learning with Checklist Rewards for Multi-Turn and Multi-Step Agentic Tool Use
Authors: Zhen Zhang, Kaiqiang Song, Xun Wang, Yebowen Hu, Weixiang Yan, Chenyang Zhao, Henry Peng Zou, Haoyun Deng, Sathish Reddy Indurthi, Shujian Liu, Simin Ma, Xiaoyang Wang, Xin Eric Wang, Song Wang
First: 2026-02-12T18:55:09+00:00 · Latest: 2026-02-12T18:55:09+00:00
Abstract
AI agents are increasingly used to solve real-world tasks by reasoning over multi-turn user interactions and invoking external tools. However, applying reinforcement learning to such settings remains difficult: realistic objectives often lack verifiable rewards and instead emphasize open-ended behaviors; moreover, RL for multi-turn, multi-step agentic tool use is still underexplored; and building and maintaining executable tool environments is costly, limiting scale and coverage. We propose CM2, an RL framework that replaces verifiable outcome rewards with checklist rewards. CM2 decomposes each turn's intended behavior into fine-grained binary criteria with explicit evidence grounding and structured metadata, turning open-ended judging into more stable classification-style decisions. To balance stability and informativeness, our method adopts a strategy of sparse reward assignment but dense evaluation criteria. Training is performed in a scalable LLM-simulated tool environment, avoiding heavy engineering for large tool sets. Experiments show that CM2 consistently improves over supervised fine-tuning. Starting from an 8B Base model and training on an 8k-example RL dataset, CM2 improves over the SFT counterpart by 8 points on tau^-Bench, by 10 points on BFCL-V4, and by 12 points on ToolSandbox. The results match or even outperform similarly sized open-source baselines, including the judging model. CM2 thus provides a scalable recipe for optimizing multi-turn, multi-step tool-using agents without relying on verifiable rewards. Code provided by the open-source community: https://github.com/namezhenzhang/CM2-RLCR-Tool-Agent.
中文标题/摘要
标题:CM2:使用检查表奖励的多轮多步骤代理工具使用强化学习
AI代理越来越多地用于通过推理多轮用户交互和调用外部工具来解决实际任务。然而,在这种设置中应用强化学习仍然很困难:现实目标往往缺乏可验证的奖励,而是强调开放性行为;此外,多轮、多步骤代理工具使用仍然未被充分探索;而且构建和维护可执行的工具环境成本高昂,限制了规模和覆盖面。我们提出了CM2,一种使用检查表奖励代替可验证结果奖励的强化学习框架。CM2将每轮预期行为分解为细粒度的二元标准,具有明确的证据基础和结构化元数据,将开放性判断转化为更稳定的分类决策。为了平衡稳定性和信息量,我们的方法采用稀疏奖励分配但密集评估标准的策略。训练在可扩展的大语言模型模拟工具环境中进行,避免了为大型工具集进行大量工程工作。实验表明,CM2在tau^-Bench上比监督微调提高了8个点,在BFCL-V4上提高了10个点,在ToolSandbox上提高了12个点。结果与或甚至超过了同样规模的开源基线,包括判断模型。因此,CM2提供了一种无需依赖可验证奖励来优化多轮多步骤工具使用代理的可扩展方法。开源社区提供的代码:https://github.com/namezhenzhang/CM2-RLCR-Tool-Agent.
Summary / 总结
The paper introduces CM2, a reinforcement learning framework that uses checklist rewards to address the challenges of training multi-turn and multi-step agentic tool-using AI agents. By decomposing each turn's behavior into binary criteria with explicit evidence, CM2 transforms open-ended judgments into more stable classification tasks. Experiments show that CM2 outperforms supervised fine-tuning on multiple benchmarks, improving by 8 points on tau^-Bench, 10 points on BFCL-V4, and 12 points on ToolSandbox, demonstrating its effectiveness in optimizing such agents without relying on verifiable rewards.
论文针对将强化学习应用于处理多轮用户交互和外部工具的AI代理时面临的挑战,其中现实目标往往是开放式的且缺乏可验证的奖励。文中提出了一种名为CM2的强化学习框架,使用检查表奖励将每轮行为分解为具有明确证据和结构化元数据的二元标准。实验结果显示,CM2在多个基准测试上优于监督微调,分别在tau^-Bench上提高了8个点,在BFCL-V4上提高了10个点,在ToolSandbox上提高了12个点,证明了其在优化多轮、多步骤工具使用代理方面的有效性,无需依赖可验证的奖励。
T3D: Few-Step Diffusion Language Models via Trajectory Self-Distillation with Direct Discriminative Optimization
Authors: Tunyu Zhang, Xinxi Zhang, Ligong Han, Haizhou Shi, Xiaoxiao He, Zhuowei Li, Hao Wang, Kai Xu, Akash Srivastava, Hao Wang, Vladimir Pavlovic, Dimitris N. Metaxas
First: 2026-02-12T18:52:35+00:00 · Latest: 2026-02-12T18:52:35+00:00
Abstract
Diffusion large language models (DLLMs) have the potential to enable fast text generation by decoding multiple tokens in parallel. However, in practice, their inference efficiency is constrained by the need for many refinement steps, while aggressively reducing the number of steps leads to a substantial degradation in generation quality. To alleviate this, we propose a trajectory self-distillation framework that improves few-step decoding by distilling the model's own generative trajectories. We incorporate Direct Discriminative Optimization (DDO), a reverse-KL objective that promotes mode-seeking distillation and encourages the student to concentrate on high-probability teacher modes. Across benchmarks, our approach consistently outperforms strong few-step baselines and standard training under tight step budgets. Although full-step decoding remains superior, we substantially narrow the gap, establishing a strong foundation towards practical few-step DLLMs. The source code is available at https://github.com/Tyrion58/T3D.
中文标题/摘要
标题:T3D:通过轨迹自蒸馏与直接判别优化的少量步骤扩散语言模型
扩散大型语言模型(DLLMs)有可能通过并行解码多个标记来实现快速文本生成。然而,在实践中,它们的推理效率受到需要许多细化步骤的限制,而大幅减少步骤会导致生成质量显著下降。为了解决这个问题,我们提出了一种轨迹自蒸馏框架,通过蒸馏模型自身的生成轨迹来改进少量步骤解码。我们结合了直接判别优化(DDO),这是一种反向KL目标,促进模式寻求蒸馏,并鼓励学生集中于高概率教师模式。在各种基准测试中,我们的方法在紧缩步骤预算下始终优于强大的少量步骤基线和标准训练。尽管全步骤解码仍然更优,但我们显著缩小了差距,为实用的少量步骤DLLMs奠定了坚实的基础。源代码可在https://github.com/Tyrion58/T3D获取。
Think like a Scientist: Physics-guided LLM Agent for Equation Discovery
Authors: Jianke Yang, Ohm Venkatachalam, Mohammad Kianezhad, Sharvaree Vadgama, Rose Yu
First: 2026-02-12T18:49:27+00:00 · Latest: 2026-02-12T18:49:27+00:00
Abstract
Explaining observed phenomena through symbolic, interpretable formulas is a fundamental goal of science. Recently, large language models (LLMs) have emerged as promising tools for symbolic equation discovery, owing to their broad domain knowledge and strong reasoning capabilities. However, most existing LLM-based systems try to guess equations directly from data, without modeling the multi-step reasoning process that scientists often follow: first inferring physical properties such as symmetries, then using these as priors to restrict the space of candidate equations. We introduce KeplerAgent, an agentic framework that explicitly follows this scientific reasoning process. The agent coordinates physics-based tools to extract intermediate structure and uses these results to configure symbolic regression engines such as PySINDy and PySR, including their function libraries and structural constraints. Across a suite of physical equation benchmarks, KeplerAgent achieves substantially higher symbolic accuracy and greater robustness to noisy data than both LLM and traditional baselines.
中文标题/摘要
标题:像科学家一样思考:基于物理的LLM代理进行方程发现
通过符号、可解释的公式来解释观察到的现象是科学的基本目标。近年来,大型语言模型(LLMs)因其广泛的领域知识和强大的推理能力,已成为符号方程发现的有前途的工具。然而,现有的大多数基于LLM的系统试图直接从数据中猜测方程,而没有建模科学家通常遵循的多步推理过程:首先推断诸如对称性等物理属性,然后利用这些属性作为先验知识来限制候选方程的空间。我们引入了KeplerAgent,这是一种明确遵循这一科学推理过程的代理框架。该代理协调基于物理的工具来提取中间结构,并利用这些结果配置符号回归引擎,如PySINDy和PySR,包括它们的功能库和结构约束。在一系列物理方程基准测试中,KeplerAgent在符号准确性方面显著高于LLM和传统基线,并且对噪声数据具有更强的鲁棒性。
Summary / 总结
The research aims to improve symbolic equation discovery by mimicking the scientific reasoning process. KeplerAgent, an agent-based framework, explicitly models the multi-step reasoning process, starting with inferring physical properties and using them as priors to restrict the space of candidate equations. This approach leads to higher symbolic accuracy and better robustness to noisy data compared to existing LLM and traditional methods across various physical equation benchmarks.
研究旨在通过引入基于物理的方法来增强大型语言模型(LLMs)发现符号方程的能力。KeplerAgent是一个基于代理的框架,明确地模拟了科学家使用的多步推理过程,从推断物理属性开始,并使用这些属性作为先验来限制候选方程的空间。这种方法在各种物理方程基准测试中比LLM和传统基线方法都具有更高的符号准确性和更好的鲁棒性。
Privacy Risks in Time Series Forecasting: User- and Record-Level Membership Inference
Authors: Nicolas Johansson, Tobias Olsson, Daniel Nilsson, Johan Östman, Fazeleh Hoseini
First: 2025-09-04T12:43:45+00:00 · Latest: 2026-02-12T18:46:20+00:00
Abstract
Membership inference attacks (MIAs) aim to determine whether specific data were used to train a model. While extensively studied on classification models, their impact on time series forecasting remains largely unexplored. We address this gap by introducing two new attacks: (i) an adaptation of multivariate LiRA, a state-of-the-art MIA originally developed for classification models, to the time-series forecasting setting, and (ii) a novel end-to-end learning approach called Deep Time Series (DTS) attack. We benchmark these methods against adapted versions of other leading attacks from the classification setting.
We evaluate all attacks in realistic settings on the TUH-EEG and ELD datasets, targeting two strong forecasting architectures, LSTM and the state-of-the-art N-HiTS, under both record- and user-level threat models. Our results show that forecasting models are vulnerable, with user-level attacks often achieving perfect detection. The proposed methods achieve the strongest performance in several settings, establishing new baselines for privacy risk assessment in time series forecasting. Furthermore, vulnerability increases with longer prediction horizons and smaller training populations, echoing trends observed in large language models.
中文标题/摘要
标题:时间序列预测中的隐私风险:用户级和记录级成员推断
成员推断攻击(MIAs)旨在确定特定数据是否被用于训练模型。尽管在分类模型上得到了广泛研究,但它们对时间序列预测的影响仍然很大程度上未被探索。我们通过引入两种新的攻击方法来填补这一空白:(i) 将当前最先进的分类模型MIA方法多变量LiRA进行适应,应用于时间序列预测设置;(ii) 提出一种新的端到端学习方法,称为Deep Time Series (DTS)攻击。我们使用来自分类设置的其他领先攻击的适应版本对这些方法进行了基准测试。
我们在现实环境中对TUH-EEG和ELD数据集上的所有攻击进行了评估,针对两种强大的预测架构LSTM和最先进的N-HiTS,在用户级和记录级威胁模型下进行。我们的结果表明,预测模型存在漏洞,用户级攻击往往能够实现完美的检测。所提出的方法在多种情况下表现出最强的性能,为时间序列预测中的隐私风险评估建立了新的基准。此外,预测时间越长和训练样本越少,漏洞越严重,这与大型语言模型中观察到的趋势一致。
Summary / 总结
This study investigates privacy risks in time series forecasting through membership inference attacks (MIAs). It introduces two new attacks: an adapted multivariate LiRA method and a novel Deep Time Series (DTS) attack. Evaluations on the TUH-EEG and ELD datasets show that forecasting models are vulnerable to these attacks, with user-level attacks achieving perfect detection in many cases. The study highlights that vulnerability increases with longer prediction horizons and smaller training populations.
该研究通过成员推理攻击探讨时间序列预测中的隐私风险,引入了两种新攻击方法:适应后的多变量LiRA方法和新型的Deep Time Series (DTS) 攻击。在TUH-EEG和ELD数据集上的评估显示,预测模型对这些攻击非常脆弱,用户级别的攻击在许多情况下实现了完美的检测。研究还指出,随着预测时间窗口的延长和训练样本数量的减少,脆弱性会增加。
On the implicit regularization of Langevin dynamics with projected noise
Authors: Govind Menon, Austin J. Stromme, Adrien Vacher
First: 2026-02-12T18:45:42+00:00 · Latest: 2026-02-12T18:45:42+00:00
Comments: 30 pages, 1 figure
Abstract
We study Langevin dynamics with noise projected onto the directions orthogonal to an isometric group action. This mathematical model is introduced to shed new light on the effects of symmetry on stochastic gradient descent for over-parametrized models. Our main result identifies a novel form of implicit regularization: when the initial and target density are both invariant under the group action, Langevin dynamics with projected noise is equivalent in law to Langevin dynamics with isotropic diffusion but with an additional drift term proportional to the negative log volume of the group orbit. We prove this result by constructing a coupling of the two processes via a third process on the group itself, and identify the additional drift as the mean curvature of the orbits.
中文标题/摘要
标题:关于投影噪声下朗之万动力学的隐式正则化
我们研究了噪声投影到等距群作用方向正交方向上的朗之万动力学。这种数学模型被引入以揭示对称性对过参数化模型的随机梯度下降法的影响。我们的主要结果识别了一种新的隐式正则化形式:当初始密度和目标密度都对群作用不变时,投影噪声下的朗之万动力学在概率意义上等同于具有额外正比于群轨道负对数体积的漂移项的各向同性扩散的朗之万动力学。我们通过构造两个过程在群上的第三个过程的耦合来证明这一结果,并将额外的漂移项识别为轨道的平均曲率。
Summary / 总结
This paper investigates Langevin dynamics with noise projected onto directions orthogonal to an isometric group action, aiming to understand the impact of symmetry on stochastic gradient descent in over-parametrized models. The key finding is that when both the initial and target densities are invariant under the group action, the dynamics with projected noise are equivalent in law to isotropic diffusion dynamics with an additional drift term. This drift term is proportional to the negative log volume of the group orbit and can be interpreted as mean curvature of the orbits.
本文研究了投影噪声下的Langevin动力学,以理解对称性对过参数化模型中随机梯度下降的影响。主要发现是,当初始和目标密度都对群作用不变时,投影噪声下的动力学与具有额外线性漂移项的等向扩散动力学等价,该漂移项与群轨道的体积负对数成正比,并可解释为轨道的平均曲率。
EGG-SR: Embedding Symbolic Equivalence into Symbolic Regression via Equality Graph
Authors: Nan Jiang, Ziyi Wang, Yexiang Xue
Venue: ICLR 2026
First: 2025-11-08T04:39:11+00:00 · Latest: 2026-02-12T18:38:11+00:00
Comments: Camera-ready version accepted for ICLR 2026
Abstract
Symbolic regression seeks to uncover physical laws from experimental data by searching for closed-form expressions, which is an important task in AI-driven scientific discovery. Yet the exponential growth of the search space of expression renders the task computationally challenging. A promising yet underexplored direction for reducing the search space and accelerating training lies in *symbolic equivalence*: many expressions, although syntactically different, define the same function -- for example, $\log(x_1^2x_2^3)$, $\log(x_1^2)+\log(x_2^3)$, and $2\log(x_1)+3\log(x_2)$. Existing algorithms treat such variants as distinct outputs, leading to redundant exploration and slow learning. We introduce EGG-SR, a unified framework that integrates symbolic equivalence into a class of modern symbolic regression methods, including Monte Carlo Tree Search (MCTS), Deep Reinforcement Learning (DRL), and Large Language Models (LLMs). EGG-SR compactly represents equivalent expressions through the proposed EGG module (via equality graphs), accelerating learning by: (1) pruning redundant subtree exploration in EGG-MCTS, (2) aggregating rewards across equivalent generated sequences in EGG-DRL, and (3) enriching feedback prompts in EGG-LLM. Theoretically, we show the benefit of embedding EGG into learning: it tightens the regret bound of MCTS and reduces the variance of the DRL gradient estimator. Empirically, EGG-SR consistently enhances a class of symbolic regression models across several benchmarks, discovering more accurate expressions within the same time limit. Project page is at: https://nan-jiang-group.github.io/egg-sr.
Summary / 总结
EGG-SR is a framework that integrates symbolic equivalence into symbolic regression methods like Monte Carlo Tree Search, Deep Reinforcement Learning, and Large Language Models to reduce redundant exploration and accelerate learning. Theoretical analysis shows that embedding EGG improves the regret bound of MCTS and reduces the variance of the DRL gradient estimator. Empirical results demonstrate that EGG-SR enhances the performance of symbolic regression models across various benchmarks, leading to more accurate expressions within the same time limit.
EGG-SR 是一个框架,将符号等价性整合到现代符号回归方法中,以减少搜索空间并加速学习。它使用 EGG 模块通过等价图表示等价表达式,这有助于在 MCTS 中修剪冗余探索,在 DRL 中聚合奖励,并在 LLM 中丰富反馈提示。实验表明,EGG-SR 在各种基准测试中提高了发现表达式的准确性,同时在相同的时间限制内完成。
Community Concealment from Unsupervised Graph Learning-Based Clustering
Authors: Dalyapraz Manatova, Pablo Moriano, L. Jean Camp
First: 2026-02-12T18:36:19+00:00 · Latest: 2026-02-12T18:36:19+00:00
Abstract
Graph neural networks (GNNs) are designed to use attributed graphs to learn representations. Such representations are beneficial in the unsupervised learning of clusters and community detection. Nonetheless, such inference may reveal sensitive groups, clustered systems, or collective behaviors, raising concerns regarding group-level privacy. Community attribution in social and critical infrastructure networks, for example, can expose coordinated asset groups, operational hierarchies, and system dependencies that could be used for profiling or intelligence gathering. We study a defensive setting in which a data publisher (defender) seeks to conceal a community of interest while making limited, utility-aware changes in the network. Our analysis indicates that community concealment is strongly influenced by two quantifiable factors: connectivity at the community boundary and feature similarity between the protected community and adjacent communities. Informed by these findings, we present a perturbation strategy that rewires a set of selected edges and modifies node features to reduce the distinctiveness leveraged by GNN message passing. The proposed method outperforms DICE in our experiments on synthetic benchmarks and real network graphs under identical perturbation budgets. Overall, it achieves median relative concealment improvements of approximately 20-45% across the evaluated settings. These findings demonstrate a mitigation strategy against GNN-based community learning and highlight group-level privacy risks intrinsic to graph learning.
中文标题/摘要
标题:基于无监督图学习聚类的社区隐藏
图神经网络(GNNs)旨在使用带属性的图来学习表示。这些表示在无监督聚类和社区检测中是有益的。然而,这种推断可能会揭示敏感群体、聚类系统或集体行为,从而引发关于群体级隐私的担忧。例如,在社会和关键基础设施网络中的社区归属可能会暴露协调的资产组、操作层次结构和系统依赖性,这些信息可用于进行画像或情报收集。我们研究了一个防御性场景,在这种场景中,数据发布者(防御者)试图隐藏一个社区,同时在有限的、具有实用性的变化中修改网络。我们的分析表明,社区隐藏受到两个可量化因素的强烈影响:社区边界的连接性和保护社区与相邻社区之间的特征相似性。根据这些发现,我们提出了一种重新布线选定边和修改节点特征的扰动策略,以减少GNN消息传递中利用的独特性。在我们的实验中,该方法在合成基准和真实网络图下的扰动预算相同的情况下,优于DICE。总体而言,它在评估的设置中实现了约20-45%的中位相对隐藏改进。这些发现表明了一种针对基于GNN的社区学习的缓解策略,并突显了图学习中固有的群体级隐私风险。
Summary / 总结
The research aims to protect community privacy by concealing sensitive groups in attributed graphs using unsupervised graph learning-based clustering. The method involves rewiring selected edges and modifying node features to reduce the distinctiveness used by GNNs in message passing. Experiments show that the proposed method outperforms DICE, achieving median relative concealment improvements of approximately 20-45% across various settings.
研究探讨了如何通过修改网络来隐藏社区以保护数据发布者的隐私,同时保持网络的实用性。研究发现,边界连接性和特征相似性是影响社区隐藏的关键因素。提出了一种重构选定边和修改节点特征的策略,以减少GNN消息传递中利用的差异性。该方法在各种设置下优于DICE,实现了约20-45%的相对隐藏改进。
"Sorry, I Didn't Catch That": How Speech Models Miss What Matters Most
Authors: Kaitlyn Zhou, Martijn Bartelds, Federico Bianchi, James Zou
First: 2026-02-12T18:36:09+00:00 · Latest: 2026-02-12T18:36:09+00:00
Abstract
Despite speech recognition systems achieving low word error rates on standard benchmarks, they often fail on short, high-stakes utterances in real-world deployments. Here, we study this failure mode in a high-stakes task: the transcription of U.S. street names as spoken by U.S. participants. We evaluate 15 models from OpenAI, Deepgram, Google, and Microsoft on recordings from linguistically diverse U.S. speakers and find an average transcription error rate of 44%. We quantify the downstream impact of failed transcriptions by geographic locations and show that mis-transcriptions systematically cause errors for all speakers, but that routing distance errors are twice as large for non-English primary speakers compared to English primary speakers. To mitigate this harm, we introduce a synthetic data generation approach that produces diverse pronunciations of named entities using open-source text-to-speech models. Fine-tuning with less than 1,000 synthetic samples improves street name transcription accuracy by nearly 60% (relative to base models) for non-English primary speakers. Our results highlight a critical gap between benchmark performance and real-world reliability in speech systems and demonstrate a simple, scalable path to reducing high-stakes transcription errors.
中文标题/摘要
标题:"对不起,我没有听清楚那句话": 为何语音模型忽视了最重要的内容
尽管语音识别系统在标准基准上的单词错误率较低,但在实际部署中,它们往往在短且高风险的口头表达上失败。在这里,我们研究了这种失败模式在一项高风险任务中的表现:美国参与者说出的美国街道名称的转录。我们评估了来自OpenAI、Deepgram、Google和Microsoft的15个模型在来自语言多样化的美国发言者录音上的表现,并发现平均转录错误率为44%。我们通过地理区域量化了失败转录的下游影响,并表明错误转录系统性地影响了所有发言者,但非英语母语发言者的路由距离错误是英语母语发言者的两倍。为了减轻这种危害,我们引入了一种合成数据生成方法,使用开源文本转语音模型生成命名实体的多样化发音。使用不到1000个合成样本进行微调后,非英语母语发言者的街道名称转录准确性提高了近60%(相对于基线模型)。我们的结果突显了语音系统基准性能与实际可靠性之间的关键差距,并证明了一条简单且可扩展的减少高风险转录错误的途径。
Summary / 总结
This study examines the failure of speech recognition systems on short, high-stakes utterances, focusing on the transcription of U.S. street names. Evaluating 15 models from various companies, the research found an average transcription error rate of 44%. It also discovered that mis-transcriptions cause errors for all speakers, with non-English primary speakers experiencing twice the routing distance errors compared to English primary speakers. By generating synthetic data with open-source text-to-speech models and fine-tuning with less than 1,000 samples, the study improved transcription accuracy by nearly 60% for non-English primary speakers, highlighting the gap between benchmark performance and real-world reliability in speech systems.
研究探讨了语音识别系统在短时高风险语音片段上的失败情况,集中在对美国街道名称的转录。评估了来自不同公司的15个模型,研究发现平均转录错误率为44%。研究还发现,错误转录会导致所有说话者出错,非英语母语说话者相比英语母语说话者在路由距离上的错误率要高一倍。通过使用开源文本转语音模型生成合成数据,并用不到1000个样本进行微调,研究将非英语母语说话者的街道名称转录准确性提高了近60%,突显了基准性能与实际可靠性之间的差距。
ExtractBench: A Benchmark and Evaluation Methodology for Complex Structured Extraction
Authors: Nick Ferguson, Josh Pennington, Narek Beghian, Aravind Mohan, Douwe Kiela, Sheshansh Agrawal, Thien Hang Nguyen
First: 2026-02-12T18:31:37+00:00 · Latest: 2026-02-12T18:31:37+00:00
Abstract
Unstructured documents like PDFs contain valuable structured information, but downstream systems require this data in reliable, standardized formats. LLMs are increasingly deployed to automate this extraction, making accuracy and reliability paramount. However, progress is bottlenecked by two gaps. First, no end-to-end benchmark evaluates PDF-to-JSON extraction under enterprise-scale schema breadth. Second, no principled methodology captures the semantics of nested extraction, where fields demand different notions of correctness (exact match for identifiers, tolerance for quantities, semantic equivalence for names), arrays require alignment, and omission must be distinguished from hallucination. We address both gaps with ExtractBench, an open-source benchmark and evaluation framework for PDF-to-JSON structured extraction. The benchmark pairs 35 PDF documents with JSON Schemas and human-annotated gold labels across economically valuable domains, yielding 12,867 evaluatable fields spanning schema complexities from tens to hundreds of fields. The evaluation framework treats the schema as an executable specification: each field declares its scoring metric. Baseline evaluations reveal that frontier models (GPT-5/5.2, Gemini-3 Flash/Pro, Claude 4.5 Opus/Sonnet) remain unreliable on realistic schemas. Performance degrades sharply with schema breadth, culminating in 0% valid output on a 369-field financial reporting schema across all tested models. We release ExtractBench at https://github.com/ContextualAI/extract-bench.
中文标题/摘要
标题:ExtractBench:复杂结构化提取的基准和评估方法
非结构化文档如PDF包含有价值的结构化信息,但下游系统需要这些数据以可靠且标准化的形式存在。随着LLM被越来越多地部署以自动化此提取过程,准确性和可靠性变得至关重要。然而,进展受到两个瓶颈的阻碍。首先,没有端到端的基准可以评估大规模企业级模式下的PDF到JSON提取。其次,没有系统的方法来捕捉嵌套提取的语义,其中字段需要不同的正确性概念(标识符的精确匹配,数量的容忍度,名称的语义等价),数组需要对齐,遗漏必须与幻觉区分开。我们通过ExtractBench解决了这两个问题,这是一个开源的PDF到JSON结构化提取基准和评估框架。基准测试将35份PDF文档与经济价值领域的JSON模式和人工标注的黄金标准标签配对,生成了涵盖从几十到几百个字段的12,867个可评估字段。评估框架将模式视为可执行规范:每个字段声明其评分标准。基线评估显示,前沿模型(GPT-5/5.2,Gemini-3 Flash/Pro,Claude 4.5 Opus/Sonnet)在现实模式下仍然不可靠。随着模式范围的增加,性能急剧下降,所有测试模型在包含369个字段的财务报告模式下均无有效输出。我们已在https://github.com/ContextualAI/extract-bench/发布了ExtractBench。
Summary / 总结
ExtractBench addresses the lack of an end-to-end benchmark for evaluating PDF-to-JSON extraction and introduces a principled methodology for nested extraction semantics. The benchmark includes 35 PDF documents with human-annotated gold labels and 12,867 evaluatable fields across various domains, covering schema complexities from tens to hundreds of fields. Evaluations show that leading models struggle with realistic schemas, with performance deteriorating sharply as schema breadth increases, resulting in no valid output on a 369-field financial reporting schema.
ExtractBench 解决了缺乏端到端的 PDF 到 JSON 提取评估基准的问题,并引入了一种处理嵌套提取语义的方法。基准包括 35 份 PDF 文档、JSON 架构和黄金标签,覆盖了 12,867 个字段,涉及多个领域。评估显示,领先模型在处理真实架构时表现不佳,随着架构复杂性的增加,性能急剧下降,最终在包含 369 个字段的财务报告架构上没有任何有效的输出。
Intrinsic-Energy Joint Embedding Predictive Architectures Induce Quasimetric Spaces
Authors: Anthony Kobanda, Waris Radji
First: 2026-02-12T18:30:27+00:00 · Latest: 2026-02-12T18:30:27+00:00
Abstract
Joint-Embedding Predictive Architectures (JEPAs) aim to learn representations by predicting target embeddings from context embeddings, inducing a scalar compatibility energy in a latent space. In contrast, Quasimetric Reinforcement Learning (QRL) studies goal-conditioned control through directed distance values (cost-to-go) that support reaching goals under asymmetric dynamics. In this short article, we connect these viewpoints by restricting attention to a principled class of JEPA energy functions : intrinsic (least-action) energies, defined as infima of accumulated local effort over admissible trajectories between two states. Under mild closure and additivity assumptions, any intrinsic energy is a quasimetric. In goal-reaching control, optimal cost-to-go functions admit exactly this intrinsic form ; inversely, JEPAs trained to model intrinsic energies lie in the quasimetric value class targeted by QRL. Moreover, we observe why symmetric finite energies are structurally mismatched with one-way reachability, motivating asymmetric (quasimetric) energies when directionality matters.
中文标题/摘要
标题:内在能量联合嵌入预测架构诱导拟度量空间
联合嵌入预测架构(JEPAs)旨在通过从上下文嵌入预测目标嵌入来学习表示,在潜在空间中诱导标量兼容能量。相比之下,准度量强化学习(QRL)研究通过支持在非对称动力学下达到目标的定向距离值(成本到终点)的目标条件控制。在本文中,我们通过限制关注JEPA能量函数的规范类:定义为两个状态之间可接受轨迹上累积局部努力的下确界的形式内在(最小作用)能量,将这些观点联系起来。在目标到达控制中,最优的成本到终点函数恰好具有这种内在形式;反过来,训练以模型内在能量的JEPAs位于QRL目标的拟度量价值类中。此外,我们观察到为什么对称的有限能量在方向性重要时与单向可达性在结构上不匹配,从而在方向性重要时推动使用非对称(拟度量)能量。
Summary / 总结
This paper connects Joint-Embedding Predictive Architectures (JEPAs) with Quasimetric Reinforcement Learning (QRL) by focusing on intrinsic (least-action) energies, which are defined as infima of accumulated local effort between two states. Under mild assumptions, intrinsic energies are quasimetrics, aligning with optimal cost-to-go functions in goal-reaching control. The study shows that JEPAs trained to model intrinsic energies are consistent with the quasimetric value class targeted by QRL, highlighting the importance of asymmetric (quasimetric) energies for directionality in control tasks.
本文通过关注内在(最小作用量)能量将联合嵌入预测架构(JEPAs)与准度量强化学习(QRL)联系起来,内在能量定义为两个状态之间累积局部努力的下确界。在轻微假设下,内在能量是准度量,与目标到达控制中的最优成本-去函数相一致。研究显示,JEPAs训练以模型内在能量与QRL目标的准度量值类一致,强调了在方向性任务中使用不对称(准度量)能量的重要性。
Moonshine v2: Ergodic Streaming Encoder ASR for Latency-Critical Speech Applications
Authors: Manjunath Kudlur, Evan King, James Wang, Pete Warden
First: 2026-02-12T18:20:45+00:00 · Latest: 2026-02-12T18:20:45+00:00
Comments: 7 pages, 5 figures
Abstract
Latency-critical speech applications (e.g., live transcription, voice commands, and real-time translation) demand low time-to-first-token (TTFT) and high transcription accuracy, particularly on resource-constrained edge devices. Full-attention Transformer encoders remain a strong accuracy baseline for automatic speech recognition (ASR) because every frame can directly attend to every other frame, which resolves otherwise locally ambiguous acoustics using distant lexical context. However, this global dependency incurs quadratic complexity in sequence length, inducing an inherent "encode-the-whole-utterance" latency profile. For streaming use cases, this causes TTFT to grow linearly with utterance length as the encoder must process the entire prefix before any decoder token can be emitted. To better meet the needs of on-device, streaming ASR use cases we introduce Moonshine v2, an ergodic streaming-encoder ASR model that employs sliding-window self-attention to achieve bounded, low-latency inference while preserving strong local context. Our models achieve state of the art word error rates across standard benchmarks, attaining accuracy on-par with models 6x their size while running significantly faster. These results demonstrate that carefully designed local attention is competitive with the accuracy of full attention at a fraction of the size and latency cost, opening new possibilities for interactive speech interfaces on edge devices.
中文标题/摘要
标题:月光shine v2:遍历流式编码器ASR及其在临界延迟语音应用中的应用
临界延迟语音应用(例如实时转录、语音命令和实时翻译)需要低首个词时间(TTFT)和高转录准确性,特别是在资源受限的边缘设备上。全注意Transformer编码器仍然是自动语音识别(ASR)的强准确性基准,因为每一帧可以直接关注每一帧,这可以使用远程词汇上下文解决原本局部模糊的声学。然而,这种全局依赖性导致序列长度上的二次复杂性,产生固有的“编码整个语音片段”的延迟模式。对于流式应用,这会导致TTFT随着语音片段长度线性增长,因为编码器必须处理整个前缀,才能发出任何解码器词元。为了更好地满足边缘设备上流式ASR应用的需求,我们引入了月光shine v2,这是一种遍历流式编码器ASR模型,采用滑动窗口自注意力机制,以实现有界、低延迟推理,同时保留强大的局部上下文。我们的模型在标准基准测试中达到了最先进的字错误率,其准确性与大小为其6倍的模型相当,但运行速度显著更快。这些结果表明,精心设计的局部注意力在大小和延迟成本仅为全注意力的一小部分的情况下,与全注意力的准确性相当,为边缘设备上的交互式语音界面开辟了新的可能性。
Summary / 总结
Moonshine v2 addresses the latency challenges in streaming ASR for resource-constrained devices by introducing an ergodic streaming-encoder model that uses sliding-window self-attention. This method achieves state-of-the-art word error rates, comparable to much larger models but with significantly reduced latency and faster inference speeds.
Moonshine v2 通过引入滑动窗口自注意力机制来解决实时语音应用中的延迟问题,允许在保持强局部上下文的同时实现有界低延迟。该模型在标准基准测试中实现了最先进的字错误率,与更大规模的模型相比具有相当的准确性,但运行速度更快,证明了局部注意力在资源受限设备上的 ASR 中的有效性。
Hyperparameter Transfer with Mixture-of-Expert Layers
Authors: Tianze Jiang, Blake Bordelon, Cengiz Pehlevan, Boris Hanin
First: 2026-01-28T03:02:30+00:00 · Latest: 2026-02-12T18:19:47+00:00
Comments: 25 Pages, 18 Figures
Abstract
Mixture-of-Experts (MoE) layers have emerged as an important tool in scaling up modern neural networks by decoupling total trainable parameters from activated parameters in the forward pass for each token. However, sparse MoEs add complexity to training due to (i) new trainable parameters (router weights) that, like all other parameter groups, require hyperparameter (HP) tuning; (ii) new architecture scale dimensions (number of and size of experts) that must be chosen and potentially taken large. To make HP selection cheap and reliable, we propose a new parameterization for transformer models with MoE layers when scaling model width, depth, number of experts, and expert (hidden) size. Our parameterization is justified by a novel dynamical mean-field theory (DMFT) analysis. When varying different model dimensions trained at a fixed token budget, we find empirically that our parameterization enables reliable HP transfer across models from 51M to over 2B total parameters. We further take HPs identified from sweeping small models on a short token horizon to train larger models on longer horizons and report performant model behaviors.
中文标题/摘要
标题:混合专家层中的超参数转移
混合专家(MoE)层已成为通过解耦总可训练参数与每个令牌前向传递中激活参数来扩大现代神经网络的重要工具。然而,稀疏MoE由于(i) 新的可训练参数(路由器权重),这些参数与其他所有参数组一样需要超参数(HP)调优;(ii) 新的架构规模维度(专家的数量和大小)必须选择并可能变得很大,增加了训练的复杂性。为了使HP选择既便宜又可靠,我们提出了一种新的参数化方法,用于具有MoE层的变压器模型,以扩展模型宽度、深度、专家数量和专家(隐藏)大小。我们的参数化方法通过新颖的动力学均场理论(DMFT)分析得到了证明。当我们以固定令牌预算变化不同的模型维度时,我们发现经验上,我们的参数化方法能够在从51M到超过2B总参数的模型之间实现可靠的HP转移。我们进一步将从小模型上短令牌窗口扫掠中识别出的HP应用到大模型上长令牌窗口的训练中,并报告了表现良好的模型行为。
Summary / 总结
The paper addresses the challenge of hyperparameter tuning in sparse Mixture-of-Experts (MoE) layers by proposing a new parameterization for transformer models. This parameterization is supported by a dynamical mean-field theory (DMFT) analysis. The authors find that their method allows for reliable hyperparameter transfer across models with varying dimensions, from 51M to over 2B total parameters, while maintaining model performance on longer token horizons.
论文解决了稀疏Mixture-of-Experts (MoE)层中的超参数调优问题,这些层用于扩展神经网络。它提出了一种新的参数化方法,支持动态均场理论分析。主要发现是,这种方法在不同模型大小之间(从51M到超过2B参数)实现了可靠的超参数转移,当改变模型维度同时保持固定令牌预算时。这种方法能够有效地为更大规模的模型选择超参数。
Tiny is not small enough: High-quality, low-resource facial animation models through hybrid knowledge distillation
Authors: Zhen Han, Mattias Teye, Derek Yadgaroff, Judith Bütepage
Venue: SIGGRAPH
First: 2025-07-24T12:25:12+00:00 · Latest: 2026-02-12T18:17:00+00:00
Comments: Accepted to ACM TOG 2025 (SIGGRAPH journal track); Project page: https://electronicarts.github.io/tiny-voice2face/
Abstract
The training of high-quality, robust machine learning models for speech-driven 3D facial animation requires a large, diverse dataset of high-quality audio-animation pairs. To overcome the lack of such a dataset, recent work has introduced large pre-trained speech encoders that are robust to variations in the input audio and, therefore, enable the facial animation model to generalize across speakers, audio quality, and languages. However, the resulting facial animation models are prohibitively large and lend themselves only to offline inference on a dedicated machine. In this work, we explore on-device, real-time facial animation models in the context of game development. We overcome the lack of large datasets by using hybrid knowledge distillation with pseudo-labeling. Given a large audio dataset, we employ a high-performing teacher model to train very small student models. In contrast to the pre-trained speech encoders, our student models only consist of convolutional and fully-connected layers, removing the need for attention context or recurrent updates. In our experiments, we demonstrate that we can reduce the memory footprint to up to 3.4 MB and required future audio context to up to 81 ms while maintaining high-quality animations. This paves the way for on-device inference, an important step towards realistic, model-driven digital characters.
中文标题/摘要
标题:微型模型仍不够小:通过混合知识蒸馏实现高质量低资源面部动画模型
用于语音驱动3D面部动画的高质量、鲁棒机器学习模型的训练需要一个包含高质量音频-动画对的大型多样数据集。为了解决缺乏此类数据集的问题,最近的工作引入了大型预训练语音编码器,这些编码器对输入音频的变异具有鲁棒性,因此使面部动画模型能够在不同说话人、音频质量和语言之间泛化。然而,由此产生的面部动画模型过于庞大,只能在专用机器上进行离线推理。在本文中,我们探索了游戏开发中的设备上实时面部动画模型。我们通过使用混合知识蒸馏和伪标签来克服大型数据集的缺乏。给定一个大型音频数据集,我们使用高性能的教师模型来训练非常小的学生模型。与预训练的语音编码器不同,我们的学生模型仅由卷积层和全连接层组成,消除了注意力上下文或递归更新的需要。在我们的实验中,我们证明了可以将内存占用减少到最多3.4 MB,并将未来音频上下文需求减少到最多81 ms,同时保持高质量的动画。这为设备上推理铺平了道路,这是实现真实、模型驱动的数字角色的重要一步。
Summary / 总结
This work addresses the challenge of creating high-quality, low-resource facial animation models for real-time use in game development. It employs hybrid knowledge distillation with pseudo-labeling to train very small student models using a large audio dataset and a high-performing teacher model. The resulting models are significantly smaller, with a memory footprint of up to 3.4 MB and require only 81 ms of future audio context, while still maintaining high-quality animations.
该研究旨在解决为3D面部动画生成高质量语音驱动面部动画模型的问题,这些模型通常需要大量数据集。通过使用伪标签和混合知识蒸馏,作者使用大型音频数据集和高性能教师模型训练非常小的学生模型。这些模型的内存占用最多为3.4 MB,并且只需要81 ms的未来音频上下文,同时仍能生成高质量的动画。这使得实时、设备端推理成为可能,是实现游戏开发中逼真、模型驱动的数字角色的重要一步。
Energy-Aware Spike Budgeting for Continual Learning in Spiking Neural Networks for Neuromorphic Vision
Authors: Anika Tabassum Meem, Muntasir Hossain Nadid, Md Zesun Ahmed Mia
First: 2026-02-12T18:15:32+00:00 · Latest: 2026-02-12T18:15:32+00:00
Abstract
Neuromorphic vision systems based on spiking neural networks (SNNs) offer ultra-low-power perception for event-based and frame-based cameras, yet catastrophic forgetting remains a critical barrier to deployment in continually evolving environments. Existing continual learning methods, developed primarily for artificial neural networks, seldom jointly optimize accuracy and energy efficiency, with particularly limited exploration on event-based datasets. We propose an energy-aware spike budgeting framework for continual SNN learning that integrates experience replay, learnable leaky integrate-and-fire neuron parameters, and an adaptive spike scheduler to enforce dataset-specific energy constraints during training. Our approach exhibits modality-dependent behavior: on frame-based datasets (MNIST, CIFAR-10), spike budgeting acts as a sparsity-inducing regularizer, improving accuracy while reducing spike rates by up to 47\%; on event-based datasets (DVS-Gesture, N-MNIST, CIFAR-10-DVS), controlled budget relaxation enables accuracy gains up to 17.45 percentage points with minimal computational overhead. Across five benchmarks spanning both modalities, our method demonstrates consistent performance improvements while minimizing dynamic power consumption, advancing the practical viability of continual learning in neuromorphic vision systems.
中文标题/摘要
标题:基于脉冲神经网络的神经形态视觉连续学习的节能脉冲预算方法
基于脉冲神经网络(SNN)的神经形态视觉系统为事件驱动和帧驱动的相机提供超低功耗感知,但灾难性遗忘仍然是在不断变化的环境中部署的关键障碍。现有的连续学习方法主要针对人工神经网络开发,很少同时优化准确性和能效,特别是在事件驱动的数据集上探索有限。我们提出了一种节能的脉冲预算框架,用于连续SNN学习,该框架结合了经验回放、可学习的漏型积分-放电神经元参数以及自适应脉冲调度器,在训练过程中强制执行数据集特定的能耗约束。我们的方法表现出模态依赖的行为:在帧驱动的数据集(MNIST、CIFAR-10)上,脉冲预算作为稀疏性诱导的正则化器,提高准确率的同时将脉冲率降低高达47%;在事件驱动的数据集(DVS-Gesture、N-MNIST、CIFAR-10-DVS)上,可控的预算放松可实现高达17.45个百分点的准确率提升,同时具有最小的计算开销。在五个涵盖不同模态的基准测试中,我们的方法在提高性能的同时最大限度地减少了动态功耗,推动了神经形态视觉系统中连续学习的实际可行性。
Summary / 总结
The paper addresses the challenge of catastrophic forgetting in continual learning within spiking neural networks (SNNs) for neuromorphic vision systems. It introduces an energy-aware spike budgeting framework that combines experience replay, adaptable neuron parameters, and an adaptive spike scheduler to optimize both accuracy and energy efficiency. The method shows significant improvements in accuracy and energy efficiency on both frame-based and event-based datasets, with up to 47% reduction in spike rates on frame-based datasets and up to 17.45 percentage point gains in accuracy on event-based datasets.
论文针对神经形态视觉系统中基于突触神经网络(SNN)的持续学习中的灾难性遗忘问题,提出了一种能量感知的突触预算框架,该框架结合了经验回放、可调神经元参数和自适应突触调度器,以优化准确性和能效。该方法在帧基和事件基数据集上均显示出显著的性能提升,帧基数据集上突触率最多可减少47%,事件基数据集上准确率最多可提升17.45个百分点。
Evaluating LLM Reasoning Beyond Correctness and CoT
Authors: Soheil Abbasloo
First: 2025-10-20T22:08:59+00:00 · Latest: 2026-02-12T18:07:50+00:00
Abstract
What does it truly mean for a language model to "reason"? Current evaluations reward models' correct standalone answers-but correctness alone reveals little about the process that produced them. We argue that reasoning should be understood not as a static chain of steps but as a dynamic trajectory in which ideas interact, clash, and evolve into integrated insights. Building on the philosophical tradition of dialectics, we introduce SIEV, a structured evaluation framework that assesses reasoning through explicit thesis-antithesis-synthesis interactions. SIEV produces interpretable trajectories that highlight key properties of reasoning-robustness to challenge, adaptability under conflict, and synthesis across competing viewpoints-dimensions that conventional correctness-based metrics cannot capture. Empirical results on GSM and MMLU demonstrate substantial gaps in the reasoning abilities of state-of-the-art models: for example, GPT-5-chat loses more than 40 points (out of 100) on GSM when evaluated through SIEV's process-oriented lens. By shifting focus from what answer a model gives to how it arrives there, SIEV enables a more transparent and principled distinction between structured reasoning and surface-level pattern generation offering a clearer foundation for assessing and understanding the reasoning capabilities of LLMs.
中文标题/摘要
标题:超越正确性和思维过程评估大语言模型的推理能力
语言模型究竟如何“推理”意味着什么?当前的评估奖励模型的正确独立答案,但仅凭正确性并不能揭示其背后的推理过程。我们认为,推理不应被视为静态的步骤链,而应被视为一种动态轨迹,在其中思想相互作用、碰撞并发展成综合见解。基于辩证法的哲学传统,我们引入了SIEV结构化评估框架,通过明确的论题-反题-合题互动来评估推理。SIEV生成可解释的轨迹,突出推理的关键属性——面对挑战的稳健性、冲突下的适应性以及观点竞争中的综合能力——这些维度是传统基于正确性的度量无法捕捉到的。在GSM和MMLU上的实证结果表明,最先进的模型在推理能力上存在显著差距:例如,GPT-5-chat在通过SIEV的过程导向视角评估时,得分下降超过40分(满分100分)。通过将焦点从模型给出的答案转向其如何得出答案,SIEV使结构化推理与表面模式生成之间的区分更加透明和原则化,为评估和理解大语言模型的推理能力提供了更清晰的基础。
Summary / 总结
The paper evaluates language models' reasoning abilities beyond mere correctness, proposing SIEV, a structured framework that assesses reasoning through thesis-antithesis-synthesis interactions. Key findings show significant gaps in reasoning abilities, with models like GPT-5-chat losing over 40 points on GSM when evaluated through SIEV's process-oriented lens, highlighting robustness, adaptability, and synthesis as crucial dimensions of reasoning not captured by conventional metrics.
论文旨在超越正确性评估语言模型的推理能力。它引入了基于辩证法传统的SIEV结构化评估框架,通过明确的论题-反题-合题互动来评估推理。关键发现表明,如GPT-5-chat等最先进的模型在通过SIEV的过程导向评估时,在应对挑战的稳健性和在冲突下的适应性等方面表现出显著差距,这些是传统基于正确性的度量无法捕捉到的方面。
Towards On-Policy SFT: Distribution Discriminant Theory and its Applications in LLM Training
Authors: Miaosen Zhang, Yishan Liu, Shuxia Lin, Xu Yang, Qi Dai, Chong Luo, Weihao Jiang, Peng Hou, Anxiang Zeng, Xin Geng, Baining Guo
First: 2026-02-12T17:59:58+00:00 · Latest: 2026-02-12T17:59:58+00:00
Abstract
Supervised fine-tuning (SFT) is computationally efficient but often yields inferior generalization compared to reinforcement learning (RL). This gap is primarily driven by RL's use of on-policy data. We propose a framework to bridge this chasm by enabling On-Policy SFT. We first present \textbf{\textit{Distribution Discriminant Theory (DDT)}}, which explains and quantifies the alignment between data and the model-induced distribution. Leveraging DDT, we introduce two complementary techniques: (i) \textbf{\textit{In-Distribution Finetuning (IDFT)}}, a loss-level method to enhance generalization ability of SFT, and (ii) \textbf{\textit{Hinted Decoding}}, a data-level technique that can re-align the training corpus to the model's distribution. Extensive experiments demonstrate that our framework achieves generalization performance on par with prominent offline RL algorithms, including DPO and SimPO, while maintaining the efficiency of an SFT pipeline. The proposed framework thus offers a practical alternative in domains where RL is infeasible. We open-source the code here: https://github.com/zhangmiaosen2000/Towards-On-Policy-SFT
中文标题/摘要
标题:迈向基于策略的微调:分布判别理论及其在大模型训练中的应用
监督微调(SFT)计算效率高,但通常在泛化能力上不如强化学习(RL)。这一差距主要是由于RL使用了基于策略的数据。我们提出了一种框架来弥合这一差距,使其能够实现基于策略的微调。我们首先提出了**分布判别理论(DDT)**,该理论解释并量化了数据与模型诱导分布之间的对齐程度。利用DDT,我们引入了两种互补的技术:(i)**分布内微调(IDFT)**,一种在损失层面增强SFT泛化能力的方法;(ii)**提示解码**,一种数据层面的技术,可以重新对齐训练语料库以匹配模型的分布。大量实验表明,我们的框架在泛化性能上与著名的离线RL算法(包括DPO和SimPO)相当,同时保持了SFT管道的效率。因此,该提出的框架为在RL不可行的领域提供了一种实用的替代方案。我们在此开源代码:https://github.com/zhangmiaosen2000/Towards-On-Policy-SFT
Summary / 总结
The paper aims to bridge the gap between supervised fine-tuning (SFT) and reinforcement learning (RL) by proposing a framework for on-policy SFT. It introduces Distribution Discriminant Theory (DDT) to quantify the alignment between data and the model-induced distribution, and two techniques: In-Distribution Finetuning (IDFT) and Hinted Decoding. Experiments show that the proposed framework matches the generalization performance of offline RL algorithms like DPO and SimPO while maintaining the efficiency of SFT.
论文通过提出面向策略的微调(On-Policy SFT)框架来弥合监督微调(SFT)与强化学习(RL)之间的差距。它引入了分布判别理论(DDT)来解释数据与模型诱导分布之间的对齐,并提出了两种技术:In-Distribution Finetuning(IDFT)和提示解码。实验表明,该框架在泛化性能上与DPO和SimPO等离线RL算法相当,同时保持了SFT的高效性。
Best of Both Worlds: Multimodal Reasoning and Generation via Unified Discrete Flow Matching
Authors: Onkar Susladkar, Tushar Prakash, Gayatri Deshmukh, Kiet A. Nguyen, Jiaxun Zhang, Adheesh Juvekar, Tianshu Bao, Lin Chai, Sparsh Mittal, Inderjit S Dhillon, Ismini Lourentzou
First: 2026-02-12T17:59:08+00:00 · Latest: 2026-02-12T17:59:08+00:00
Abstract
We propose UniDFlow, a unified discrete flow-matching framework for multimodal understanding, generation, and editing. It decouples understanding and generation via task-specific low-rank adapters, avoiding objective interference and representation entanglement, while a novel reference-based multimodal preference alignment optimizes relative outcomes under identical conditioning, improving faithfulness and controllability without large-scale retraining. UniDFlpw achieves SOTA performance across eight benchmarks and exhibits strong zero-shot generalization to tasks including inpainting, in-context image generation, reference-based editing, and compositional generation, despite no explicit task-specific training.
中文标题/摘要
标题:兼而有之:通过统一离散流匹配实现多模态推理与生成
我们提出了一种统一的离散流匹配框架UniDFlow,用于多模态理解、生成和编辑。该框架通过特定任务的低秩适配器解耦理解与生成,避免了目标干扰和表示纠缠,同时一种新颖的基于参考的多模态偏好对齐优化了在相同条件下的相对结果,提高了忠实度和可控性,而无需大规模重新训练。UniDFlpw在八个基准测试中取得了SOTA性能,并且在包括 inpainting、上下文图像生成、基于参考的编辑和组合生成等任务中表现出强大的零样本泛化能力,尽管没有进行明确的特定任务训练。
Summary / 总结
The research aims to develop a unified framework for multimodal tasks such as understanding, generation, and editing by decoupling these processes and using task-specific adapters. UniDFlow uses a discrete flow-matching approach and a reference-based multimodal preference alignment to enhance faithfulness and controllability. The model achieves state-of-the-art performance across eight benchmarks and demonstrates strong zero-shot generalization to various tasks without extensive task-specific training.
研究旨在通过分离理解和生成过程,并通过新颖的参考基 multimodal preference alignment 来优化相对结果,开发一个统一的多模态框架。UniDFlow 使用任务特定的低秩适配器和离散流匹配方法,在八个基准测试中达到最先进的性能,并在各种任务上展示了强大的零样本泛化能力,而无需大量重新训练。
The Observer Effect in World Models: Invasive Adaptation Corrupts Latent Physics
Authors: Christian Internò, Jumpei Yamaguchi, Loren Amdahl-Culleton, Markus Olhofer, David Klindt, Barbara Hammer
First: 2026-02-12T17:56:07+00:00 · Latest: 2026-02-12T17:56:07+00:00
Abstract
Determining whether neural models internalize physical laws as world models, rather than exploiting statistical shortcuts, remains challenging, especially under out-of-distribution (OOD) shifts. Standard evaluations often test latent capability via downstream adaptation (e.g., fine-tuning or high-capacity probes), but such interventions can change the representations being measured and thus confound what was learned during self-supervised learning (SSL). We propose a non-invasive evaluation protocol, PhyIP. We test whether physical quantities are linearly decodable from frozen representations, motivated by the linear representation hypothesis. Across fluid dynamics and orbital mechanics, we find that when SSL achieves low error, latent structure becomes linearly accessible. PhyIP recovers internal energy and Newtonian inverse-square scaling on OOD tests (e.g., $ρ> 0.90$). In contrast, adaptation-based evaluations can collapse this structure ($ρ\approx 0.05$). These findings suggest that adaptation-based evaluation can obscure latent structures and that low-capacity probes offer a more accurate evaluation of physical world models.
中文标题/摘要
标题:世界模型中的观察者效应:侵入性适应破坏潜在物理定律
确定神经模型是否将物理定律作为世界模型内部化,而不是利用统计捷径,尤其是在分布外(OOD)转移时,仍然具有挑战性。标准评估通常通过下游适应(例如微调或高容量探针)测试潜在能力,但这些干预措施会改变正在测量的表示,从而混淆自监督学习(SSL)期间学到的内容。我们提出了一种非侵入性评估协议,PhyIP。我们测试物理量是否可以从冻结的表示中线性可解码,受线性表示假设的启发。在流体动力学和轨道力学中,我们发现当SSL达到低误差时,潜在结构变得线性可访问。PhyIP在分布外测试中恢复了内部能量和牛顿平方反比缩放(例如,ρ>0.90)。相比之下,基于适应的评估可以压缩这种结构(ρ≈0.05)。这些发现表明,基于适应的评估可能会掩盖潜在结构,而低容量探针提供了对物理世界模型更准确的评估。
Summary / 总结
The study aims to evaluate whether neural models learn physical laws or rely on statistical shortcuts, especially under out-of-distribution shifts. It proposes a non-invasive evaluation method, PhyIP, which tests the linear decodability of physical quantities from frozen representations. The method shows that when self-supervised learning achieves low error, physical structures become linearly accessible, recovering internal energy and Newtonian inverse-square scaling with high correlation ($ρ> 0.90$) on OOD tests. In contrast, adaptation-based evaluations fail to recover these structures ($ρ\approx 0.05$).
研究旨在评估神经模型是学习物理定律还是依赖统计捷径,特别是在分布外转移时。提出了一种非侵入性评估方法PhyIP,通过测试冻结表示中物理量的线性可解性来进行评估。该方法表明,当自我监督学习达到低误差时,物理结构会变得线性可解,能够在分布外测试中恢复内部能量和牛顿反平方定律,相关性高达$ρ> 0.90$。相比之下,基于适应性的评估无法恢复这些结构,相关性约为$ρ\approx 0.05$。
VIRENA: Virtual Arena for Research, Education, and Democratic Innovation
Authors: Emma Hoes, K. Jonathan Klueser, Fabrizio Gilardi
First: 2026-02-12T17:46:52+00:00 · Latest: 2026-02-12T17:46:52+00:00
Comments: VIRENA is under active development and currently in use at the University of Zurich, supported by the DIZH Innovation Program: 2nd Founder-Call. This preprint will be updated as new features are released. For the latest version and to inquire about demos or pilot collaborations, contact the authors
Abstract
Digital platforms shape how people communicate, deliberate, and form opinions. Studying these dynamics has become increasingly difficult due to restricted data access, ethical constraints on real-world experiments, and limitations of existing research tools. VIRENA (Virtual Arena) is a platform that enables controlled experimentation in realistic social media environments. Multiple participants interact simultaneously in realistic replicas of feed-based platforms (Instagram, Facebook, Reddit) and messaging apps (WhatsApp, Messenger). Large language model-powered AI agents participate alongside humans with configurable personas and realistic behavior. Researchers can manipulate content moderation approaches, pre-schedule stimulus content, and run experiments across conditions through a visual interface requiring no programming skills. VIRENA makes possible research designs that were previously impractical: studying human--AI interaction in realistic social contexts, experimentally comparing moderation interventions, and observing group deliberation as it unfolds. Built on open-source technologies that ensure data remain under institutional control and comply with data protection requirements, VIRENA is currently in use at the University of Zurich and available for pilot collaborations. Designed for researchers, educators, and public organizations alike, VIRENA's no-code interface makes controlled social media simulation accessible across disciplines and sectors. This paper documents its design, architecture, and capabilities.
中文标题/摘要
标题:VIRENA:虚拟竞技场,用于研究、教育和民主创新
数字平台塑造了人们的交流、讨论和形成观点的方式。由于数据访问受限、现实世界实验的伦理限制以及现有研究工具的局限性,研究这些动态变得越来越困难。VIRENA(虚拟竞技场)是一个平台,它能够在现实社交媒体环境中进行受控实验。多个参与者同时在基于信息流的平台(Instagram、Facebook、Reddit)和即时通讯应用(WhatsApp、Messenger)的现实复制品中互动。由大型语言模型驱动的AI代理与人类一起参与,具有可配置的人设和现实行为。研究人员可以通过无需编程技能的可视化界面操控内容审核方法、预排定刺激内容,并在不同条件下运行实验。VIRENA 使得以前不切实际的研究设计成为可能:在现实社会环境中研究人类与AI的互动、实验性地比较干预措施的效果以及观察小组讨论的展开过程。VIRENA 建立在开源技术之上,确保数据保留在机构控制之下并符合数据保护要求,目前在苏黎世大学使用,并可供试点合作。VIRENA 的无代码界面使其跨学科和跨领域的受控社交媒体模拟变得可行。本文档记录了其设计、架构和功能。
Summary / 总结
VIRENA is a platform designed to enable controlled experimentation in realistic social media environments, addressing the challenges of restricted data access and ethical constraints. Researchers can manipulate content moderation, pre-schedule stimulus content, and run experiments through a no-code visual interface. Key findings include the ability to study human-AI interaction, experimentally compare moderation interventions, and observe group deliberation in realistic settings, making previously impractical research designs possible.
VIRENA 是一个平台,旨在通过现实社交媒体环境中的受控实验来解决数据访问受限和伦理约束的问题。它允许参与者在 Instagram、Facebook、Reddit、WhatsApp 和 Messenger 等流行平台的复制品中互动,并可选择加入由大型语言模型驱动的 AI 代理。研究人员可以通过无代码的可视化界面操纵内容审核并运行实验。主要发现包括能够研究人与 AI 的互动、比较审核干预措施以及观察群体讨论的展开,使以前难以实现的研究设计成为可能。
Beyond Accuracy: A Stability-Aware Metric for Multi-Horizon Forecasting
Authors: Chutian Ma, Grigorii Pomazkin, Giacinto Paolo Saggese, Paul Smith
First: 2026-01-15T21:26:57+00:00 · Latest: 2026-02-12T17:45:17+00:00
Abstract
Traditional time series forecasting methods optimize for accuracy alone. This objective neglects temporal consistency, in other words, how consistently a model predicts the same future event as the forecast origin changes. We introduce the forecast accuracy and coherence score (forecast AC score for short) for measuring the quality of probabilistic multi-horizon forecasts in a way that accounts for both multi-horizon accuracy and stability. Our score additionally allows user-specified weights to balance accuracy and consistency requirements. As an example application, we implement the score as a differentiable objective function for training seasonal auto-regressive integrated models and evaluate it on the M4 Hourly benchmark dataset. Results demonstrate substantial improvements over traditional maximum likelihood estimation. Regarding stability, the AC-optimized model generated out-of-sample forecasts with 91.1\% reduced vertical variance relative to the MLE-fitted model. In terms of accuracy, the AC-optimized model achieved considerable improvements for medium-to-long-horizon forecasts. While one-step-ahead forecasts exhibited a 7.5\% increase in MAPE, all subsequent horizons experienced an improved accuracy as measured by MAPE of up to 26\%. These results indicate that our metric successfully trains models to produce more stable and accurate multi-step forecasts in exchange for some degradation in one-step-ahead performance.
中文标题/摘要
标题:超越准确性:面向多时距预测的稳定性度量
传统的时序预测方法仅优化准确性。这一目标忽略了时间一致性,即模型如何在预测起始时间变化时一致地预测同一未来事件。我们引入了预测准确性和连贯性评分(简称预测AC评分),以衡量概率多时距预测的质量,这种方式同时考虑了多时距准确性和稳定性。该评分还允许用户指定权重以平衡准确性和一致性要求。作为示例应用,我们将该评分实现为训练季节性自回归整定模型的可微目标函数,并在M4小时基准数据集上对其进行评估。结果表明,与最大似然估计相比,AC优化模型在离样本外预测中垂直方差减少了91.1%。在准确性方面,AC优化模型在中长期预测中取得了显著改进。尽管一步预测的MAPE提高了7.5%,但所有后续时距的准确性均有所提高,MAPE最多提高了26%。这些结果表明,我们的度量成功地训练模型以产生更稳定和准确的多步预测,尽管这在一步预测性能上有所牺牲。
Summary / 总结
The paper addresses the limitation of traditional time series forecasting methods that focus solely on accuracy, neglecting temporal consistency. It introduces the forecast accuracy and coherence score (forecast AC score) to evaluate multi-horizon forecasts, considering both accuracy and stability. The AC score is used as a training objective for seasonal auto-regressive integrated models and evaluated on the M4 Hourly benchmark dataset, showing significant improvements in stability and accuracy for medium-to-long horizons, with minor degradation in one-step-ahead forecasts.
本文针对传统时间序列预测方法仅关注准确性的局限性,提出了预测准确性和连贯性评分(预测AC评分)来评估多步预测,同时考虑准确性和稳定性。AC评分被用作训练季节自回归整定模型的可微分目标函数,并在M4小时基准数据集上进行了评估。结果表明,AC优化模型显著减少了91.1%的垂直方差,并且在中长期预测中提高了准确性,最高降低了26%的MAPE,尽管一阶预测的MAPE增加了7.5%。
DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing
Authors: Dianyi Wang, Ruihang Li, Feng Han, Chaofan Ma, Wei Song, Siyuan Wang, Yibin Wang, Yi Xin, Hongjian Liu, Zhixiong Zhang, Shengyuan Ding, Tianhang Wang, Zhenglin Cheng, Tao Lin, Cheng Jin, Kaicheng Yu, Jingjing Chen, Wenjie Wang, Zhongyu Wei, Jiaqi Wang
First: 2026-02-12T17:44:24+00:00 · Latest: 2026-02-12T17:44:24+00:00
Abstract
Current unified multimodal models for image generation and editing typically rely on massive parameter scales (e.g., >10B), entailing prohibitive training costs and deployment footprints. In this work, we present DeepGen 1.0, a lightweight 5B unified model that achieves comprehensive capabilities competitive with or surpassing much larger counterparts. To overcome the limitations of compact models in semantic understanding and fine-grained control, we introduce Stacked Channel Bridging (SCB), a deep alignment framework that extracts hierarchical features from multiple VLM layers and fuses them with learnable 'think tokens' to provide the generative backbone with structured, reasoning-rich guidance. We further design a data-centric training strategy spanning three progressive stages: (1) Alignment Pre-training on large-scale image-text pairs and editing triplets to synchronize VLM and DiT representations, (2) Joint Supervised Fine-tuning on a high-quality mixture of generation, editing, and reasoning tasks to foster omni-capabilities, and (3) Reinforcement Learning with MR-GRPO, which leverages a mixture of reward functions and supervision signals, resulting in substantial gains in generation quality and alignment with human preferences, while maintaining stable training progress and avoiding visual artifacts. Despite being trained on only ~50M samples, DeepGen 1.0 achieves leading performance across diverse benchmarks, surpassing the 80B HunyuanImage by 28% on WISE and the 27B Qwen-Image-Edit by 37% on UniREditBench. By open-sourcing our training code, weights, and datasets, we provide an efficient, high-performance alternative to democratize unified multimodal research.
中文标题/摘要
标题:DeepGen 1.0:一种轻量级统一多模态模型,用于推进图像生成和编辑
当前用于图像生成和编辑的统一多模态模型通常依赖于庞大的参数规模(例如,>10B),导致高昂的训练成本和部署足迹。在本工作中,我们提出了DeepGen 1.0,这是一种轻量级的5B统一模型,其综合能力与或超越了更大的同类模型。为了克服紧凑模型在语义理解和精细控制方面的局限性,我们引入了堆叠通道桥接(SCB),这是一种深度对齐框架,从多个VLM层中提取层次特征,并与可学习的“思考标记”融合,为生成骨干提供结构化、富含推理的指导。我们还设计了一种以数据为中心的训练策略,分为三个渐进阶段:(1)在大规模图像-文本对和编辑三元组上进行对齐预训练,以同步VLM和DiT表示;(2)在高质量的生成、编辑和推理任务混合集上进行联合监督微调,以培养全方位的能力;(3)使用MR-GRPO的强化学习,利用混合奖励函数和监督信号,显著提高了生成质量和与人类偏好的一致性,同时保持了稳定的训练进展并避免了视觉伪影。尽管仅在约5000万样本上进行训练,DeepGen 1.0在多种基准测试中均表现出领先性能,在WISE上超越了80B的HunyuanImage 28%,在UniREditBench上超越了27B的Qwen-Image-Edit 37%。通过开源我们的训练代码、权重和数据集,我们提供了一种高效、高性能的替代方案,以促进统一多模态研究的民主化。
Summary / 总结
DeepGen 1.0 is a lightweight 5B unified model for image generation and editing, addressing the high training and deployment costs of larger models. It introduces Stacked Channel Bridging (SCB) to enhance semantic understanding and fine-grained control. The model undergoes three stages of training: alignment pre-training, joint supervised fine-tuning, and reinforcement learning. DeepGen 1.0 outperforms larger models on various benchmarks, achieving 28% and 37% better performance on WISE and UniREditBench, respectively.
DeepGen 1.0 是一个轻量级的 5B 统一模型,用于图像生成和编辑,尽管规模较小但仍能超越更大规模的模型。它引入了堆叠通道桥接 (SCB) 来增强语义理解和精细控制。该模型经历了三个逐步的训练阶段:对齐预训练、联合监督微调和强化学习。DeepGen 1.0 在各种基准测试中表现出色,分别在 WISE 和 UniREditBench 上超越了 80B 的 HunyuanImage 和 27B 的 Qwen-Image-Edit,显著的领先优势。它为统一多模态研究提供了高效的替代方案。
Beyond the Loss Curve: Scaling Laws, Active Learning, and the Limits of Learning from Exact Posteriors
Authors: Arian Khorasani, Nathaniel Chen, Yug D Oswal, Akshat Santhana Gopalan, Egemen Kolemen, Ravid Shwartz-Ziv
First: 2026-01-30T21:08:55+00:00 · Latest: 2026-02-12T17:44:22+00:00
Abstract
How close are neural networks to the best they could possibly do? Standard benchmarks cannot answer this because they lack access to the true posterior p(y|x). We use class-conditional normalizing flows as oracles that make exact posteriors tractable on realistic images (AFHQ, ImageNet). This enables five lines of investigation. Scaling laws: Prediction error decomposes into irreducible aleatoric uncertainty and reducible epistemic error; the epistemic component follows a power law in dataset size, continuing to shrink even when total loss plateaus. Limits of learning: The aleatoric floor is exactly measurable, and architectures differ markedly in how they approach it: ResNets exhibit clean power-law scaling while Vision Transformers stall in low-data regimes. Soft labels: Oracle posteriors contain learnable structure beyond class labels: training with exact posteriors outperforms hard labels and yields near-perfect calibration. Distribution shift: The oracle computes exact KL divergence of controlled perturbations, revealing that shift type matters more than shift magnitude: class imbalance barely affects accuracy at divergence values where input noise causes catastrophic degradation. Active learning: Exact epistemic uncertainty distinguishes genuinely informative samples from inherently ambiguous ones, improving sample efficiency. Our framework reveals that standard metrics hide ongoing learning, mask architectural differences, and cannot diagnose the nature of distribution shift.
中文标题/摘要
标题:超越损失曲线:缩放定律、主动学习与从精确后验学习的极限
神经网络距离它们可能达到的最佳性能有多接近?标准基准无法回答这个问题,因为它们缺乏访问真正的后验 p(y|x) 的能力。我们使用条件归一化流作为或acles,在现实图像(AFHQ、ImageNet)上使精确后验变得可处理。这使我们能够进行五项调查。缩放定律:预测误差分解为不可约的偶然不确定性与可约的先验误差;先验部分遵循数据集大小的幂律,在总损失平台期后仍然继续缩小。学习的极限:偶然的下限是可精确测量的,而不同架构在接近它的方式上存在显著差异:ResNets 展现出清晰的幂律缩放,而视觉变换器在低数据域中停滞不前。软标签:Oracle 后验包含超越类别标签的可学习结构:使用精确后验进行训练优于使用硬标签,并且能够实现近乎完美的校准。分布偏移:Oracle 计算受控扰动的精确KL散度,揭示了扰动类型比扰动幅度更重要:类别不平衡在KL散度值较高的情况下,输入噪声导致灾难性退化,几乎不影响准确性。主动学习:精确的先验不确定性区分了真正有信息量的样本与本质上模棱两可的样本,提高了样本效率。我们的框架揭示了标准度量隐藏了持续学习,掩盖了架构差异,并且无法诊断分布偏移的性质。
Summary / 总结
This study investigates the limitations of neural networks by using class-conditional normalizing flows as oracles to access exact posteriors on realistic images. The research finds that prediction error decomposes into irreducible aleatoric uncertainty and reducible epistemic error, with the latter following a power law in dataset size. The study also reveals that architectures like ResNets and Vision Transformers differ significantly in approaching the aleatoric floor, and that training with exact posteriors provides better calibration than hard labels. Additionally, the research shows that the type of distribution shift is more critical than its magnitude, and that exact epistemic uncertainty can improve sample efficiency in active learning.
该论文通过使用类条件归一化流作为先知来访问精确的后验分布,来研究神经网络性能的极限。研究发现预测误差可以分解为不可约的 aleatoric 不确定性和可约的 epistemic 错误,后者随着数据集大小的增加遵循幂律。此外,研究还表明 ResNets 和 Vision Transformers 在接近 aleatoric 底层方面存在显著差异,并且使用精确的后验分布进行训练比使用硬标签效果更好,还能提供更好的校准。另外,研究还表明分布转移的类型比其幅度更重要,并且精确的 epistemic 不确定性可以提高主动学习中的样本效率。
Learning to Forget Attention: Memory Consolidation for Adaptive Compute Reduction
Authors: Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma
First: 2026-02-12T17:40:15+00:00 · Latest: 2026-02-12T17:40:15+00:00
Abstract
Hybrid architectures combining state-space models with attention have achieved strong efficiency-quality tradeoffs, yet existing approaches either apply attention uniformly or learn static sparse patterns. This misses a key opportunity: \emph{attention demand should decrease over time as recurring patterns become familiar}. We present a surprising finding from analyzing GPT-2 models: \textbf{88\%} of attention operations retrieve information already predictable from the model's hidden state, and this redundancy does \emph{not} decrease during training. Motivated by this observation, we introduce \textbf{\ours{}} (\textbf{C}onsolidation-based \textbf{R}outing for \textbf{A}daptive \textbf{M}emory), a biologically inspired memory consolidation mechanism that gradually distills episodic retrievals into parametric semantic memory. Unlike prior sparse attention methods, \ours{} exhibits \emph{decreasing attention utilization} over training, achieving a \textbf{37.8$\times$} reduction through a sharp phase transition at approximately 3K steps. We prove that this capability is \emph{impossible} without consolidation: any static routing scheme requires $Ω(f \cdot n)$ attention for tasks with recurring patterns of frequency $f$. On our proposed SRCD benchmark, \ours{} achieves \textbf{100\% retrieval accuracy} at 1.6\% attention compute (vs.\ 68\% for baselines), and consolidated patterns transfer to unseen tasks with \textbf{48--52\%} attention reduction without retraining. Remarkably, the learned consolidation dynamics quantitatively match human episodic-to-semantic memory transition curves from cognitive psychology ($γ= 0.43$ vs.\ $γ_{\text{human}} \approx 0.4$--$0.5$). Code and benchmarks are available at [anonymized].
中文标题/摘要
标题:学习忘记注意力:记忆巩固以适应计算减少
结合状态空间模型与注意力的混合架构已实现了强大的效率-质量权衡,但现有方法要么均匀应用注意力,要么学习静态稀疏模式。这错过了一个关键机会:\emph{随着时间的推移,随着反复出现的模式变得熟悉,注意力需求应该减少}。我们通过对GPT-2模型的分析发现:\textbf{88\%}的注意力操作检索的信息已经可以从模型的隐藏状态预测,而且这种冗余在训练过程中并未减少。受此观察的启发,我们引入了\textbf{\ours{}}(基于巩固的自适应记忆路由),这是一种生物启发的记忆巩固机制,逐步将事件检索提炼为参数语义记忆。与之前的稀疏注意力方法不同,\ours{}在训练过程中表现出\emph{注意力利用率的减少},通过在大约3K步时的尖锐相变实现了\textbf{37.8$\times$}的计算减少。我们证明了这种能力在没有巩固的情况下是\emph{不可能}的:任何静态路由方案对于具有反复出现频率为$f$的任务都需要$Ω(f \cdot n)$的注意力。在我们提出的SRCD基准上,\ours{}在1.6\%注意力计算下实现了\textbf{100\%}的检索准确率(基线为68%),并且巩固后的模式在无需重新训练的情况下将注意力减少\textbf{48--52\%}应用于未见过的任务。令人惊讶的是,学习到的巩固动力学定量匹配认知心理学中的人类事件到语义记忆过渡曲线($γ= 0.43$ vs.\ $γ_{\text{human}} \approx 0.4$--$0.5$)。代码和基准可在[匿名]获取。
LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs
Authors: Yujun Zhou, Jingdong Yang, Yue Huang, Kehan Guo, Zoe Emory, Bikram Ghosh, Amita Bedar, Sujay Shekar, Zhenwen Liang, Pin-Yu Chen, Tian Gao, Werner Geyer, Nuno Moniz, Nitesh V Chawla, Xiangliang Zhang
Venue: Nat Mach Intell 8, 20-31 (2026)
First: 2024-10-18T05:21:05+00:00 · Latest: 2026-02-12T17:29:23+00:00
Comments: Published at Nature Machine Intelligence
Abstract
Artificial Intelligence (AI) is revolutionizing scientific research, yet its growing integration into laboratory environments presents critical safety challenges. Large language models (LLMs) and vision language models (VLMs) now assist in experiment design and procedural guidance, yet their "illusion of understanding" may lead researchers to overtrust unsafe outputs. Here we show that current models remain far from meeting the reliability needed for safe laboratory operation. We introduce LabSafety Bench, a comprehensive benchmark that evaluates models on hazard identification, risk assessment, and consequence prediction across 765 multiple-choice questions and 404 realistic lab scenarios, encompassing 3,128 open-ended tasks. Evaluations on 19 advanced LLMs and VLMs show that no model evaluated on hazard identification surpasses 70% accuracy. While proprietary models perform well on structured assessments, they do not show a clear advantage in open-ended reasoning. These results underscore the urgent need for specialized safety evaluation frameworks before deploying AI systems in real laboratory settings.
中文标题/摘要
标题:实验室安全台:评估大型语言模型在科学实验室安全问题上的基准
人工智能(AI)正在革新科学研究,但其日益融入实验室环境带来了关键的安全挑战。大型语言模型(LLMs)和视觉语言模型(VLMs)现在协助实验设计和程序指导,但它们的“理解错觉”可能导致研究人员过度信任不安全的输出。我们展示当前模型远未达到安全实验室操作所需的可靠性。我们引入了LabSafety Bench,这是一个全面的基准,评估模型在危害识别、风险评估和后果预测方面的表现,涵盖765个多项选择题和404个现实实验室场景,共计3,128个开放式任务。对19个先进LLM和VLM的评估显示,没有模型在危害识别上的准确率超过70%。虽然专有模型在结构化评估中表现良好,但在开放式推理方面并没有明显优势。这些结果强调了在实际实验室环境中部署AI系统之前,迫切需要专门的安全评估框架。
Summary / 总结
The paper aims to address the safety challenges posed by the integration of AI in scientific laboratories. It introduces LabSafety Bench, a benchmark that tests models on hazard identification, risk assessment, and consequence prediction. Evaluations on 19 advanced LLMs and VLMs reveal that no model achieves over 70% accuracy in hazard identification, highlighting the need for specialized safety evaluation frameworks before deploying AI in real laboratory settings.
研究旨在解决AI在科学实验室中的安全挑战。引入了LabSafety Bench基准,评估模型在危害识别、风险评估和后果预测方面的表现。对19种先进LLM和VLM的评估显示,没有模型在危害识别上的准确率超过70%,强调了在实际实验室环境中部署AI系统前需要有专门的安全评估框架的迫切性。
Visual Reasoning Benchmark: Evaluating Multimodal LLMs on Classroom-Authentic Visual Problems from Primary Education
Authors: Mohamed Huti, Alasdair Mackintosh, Amy Waldock, Dominic Andrews, Maxime Lelièvre, Moritz Boos, Tobias Murray, Paul Atherton, Robin A. A. Ince, Oliver G. B. Garrod
First: 2026-02-12T17:29:03+00:00 · Latest: 2026-02-12T17:29:03+00:00
Abstract
AI models have achieved state-of-the-art results in textual reasoning; however, their ability to reason over spatial and relational structures remains a critical bottleneck -- particularly in early-grade maths, which relies heavily on visuals. This paper introduces the visual reasoning benchmark (VRB), a novel dataset designed to evaluate Multimodal Large Language Models (MLLMs) on their ability to solve authentic visual problems from classrooms. This benchmark is built on a set of 701 questions sourced from primary school examinations in Zambia and India, which cover a range of tasks such as reasoning by analogy, pattern completion, and spatial matching. We outline the methodology and development of the benchmark which intentionally uses unedited, minimal-text images to test if models can meet realistic needs of primary education. Our findings reveal a ``jagged frontier'' of capability where models demonstrate better proficiency in static skills such as counting and scaling, but reach a distinct ``spatial ceiling'' when faced with dynamic operations like folding, reflection, and rotation. These weaknesses pose a risk for classroom use on visual reasoning problems, with the potential for incorrect marking, false scaffolding, and reinforcing student misconceptions. Consequently, education-focused benchmarks like the VRB are essential for determining the functional boundaries of multimodal tools used in classrooms.
中文标题/摘要
标题:视觉推理基准:评估多模态大语言模型在小学课堂真实视觉问题上的推理能力
AI模型在文本推理方面已达到最先进的水平;然而,它们在处理空间和关系结构方面的推理能力仍然是一个关键瓶颈——特别是在依赖大量视觉元素的早期数学中。本文介绍了视觉推理基准(VRB),这是一个新型数据集,旨在评估多模态大语言模型(MLLMs)在解决课堂真实视觉问题方面的能力。该基准基于赞比亚和印度小学考试中的701个问题构建,涵盖了诸如类比推理、模式填充和空间匹配等一系列任务。我们概述了基准的方法和开发过程,故意使用未经编辑的、文字最少的图像来测试模型是否能够满足小学教育的实际需求。我们的研究结果揭示了一个“能力断层”,模型在静态技能如计数和缩放方面表现出更好的熟练度,但在面对动态操作如折叠、反射和旋转时则达到了一个明显的“空间天花板”。这些弱点对课堂中使用视觉推理问题的风险包括错误评分、虚假支持和强化学生的错误概念。因此,像VRB这样的面向教育的基准对于确定用于课堂的多模态工具的功能边界至关重要。
Summary / 总结
The paper introduces the Visual Reasoning Benchmark (VRB) to evaluate Multimodal Large Language Models (MLLMs) on their ability to solve authentic visual problems from primary education. The benchmark consists of 701 questions from Zambia and India, covering tasks like reasoning by analogy and spatial matching. The study finds that models excel in static skills like counting and scaling but struggle with dynamic operations such as folding and rotation, indicating a 'spatial ceiling' that limits their classroom applicability in visual reasoning tasks.
论文介绍了视觉推理基准(VRB),用于评估多模态大型语言模型(MLLMs)解决来自小学的真实视觉问题的能力。基准包括来自赞比亚和印度的701个问题,涵盖了类比推理和空间匹配等任务。研究发现,模型在计数和缩放等静态技能上表现良好,但在折叠和旋转等动态操作上却遇到困难,表明存在一个‘空间天花板’,限制了它们在教育环境中的有效性。
Beyond Rewards in Reinforcement Learning for Cyber Defence
Authors: Elizabeth Bates, Chris Hicks, Vasilios Mavroudis
First: 2026-02-04T17:55:23+00:00 · Latest: 2026-02-12T17:29:01+00:00
Abstract
Recent years have seen an explosion of interest in autonomous cyber defence agents trained to defend computer networks using deep reinforcement learning. These agents are typically trained in cyber gym environments using dense, highly engineered reward functions which combine many penalties and incentives for a range of (un)desirable states and costly actions. Dense rewards help alleviate the challenge of exploring complex environments but risk biasing agents towards suboptimal and potentially riskier solutions, a critical issue in complex cyber environments. We thoroughly evaluate the impact of reward function structure on learning and policy behavioural characteristics using a variety of sparse and dense reward functions, two well-established cyber gyms, a range of network sizes, and both policy gradient and value-based RL algorithms. Our evaluation is enabled by a novel ground truth evaluation approach which allows directly comparing between different reward functions, illuminating the nuanced inter-relationships between rewards, action space and the risks of suboptimal policies in cyber environments. Our results show that sparse rewards, provided they are goal aligned and can be encountered frequently, uniquely offer both enhanced training reliability and more effective cyber defence agents with lower-risk policies. Surprisingly, sparse rewards can also yield policies that are better aligned with cyber defender goals and make sparing use of costly defensive actions without explicit reward-based numerical penalties.
中文标题/摘要
标题:超越强化学习中网络防御中的奖励
近年来,使用深度强化学习训练自主网络防御代理以防御计算机网络的兴趣激增。这些代理通常在精心设计的密集奖励函数的网络健身房环境中进行训练,这些奖励函数结合了多种惩罚和激励措施,以应对各种(不)希望的状态和昂贵的操作。密集奖励有助于缓解探索复杂环境的挑战,但也可能使代理偏向次优甚至更危险的解决方案,这是复杂网络环境中一个关键问题。我们使用稀疏和密集奖励函数、两个成熟的网络健身房、不同规模的网络以及策略梯度和值基强化学习算法,全面评估了奖励函数结构对学习和策略行为特征的影响。我们的评估得益于一种新颖的基准评估方法,该方法允许直接比较不同奖励函数之间的差异,揭示了奖励、动作空间和网络环境中次优策略风险之间的复杂关系。我们的结果表明,只要它们与目标对齐并且可以频繁遇到,稀疏奖励可以提供增强的训练可靠性和具有较低风险的更有效的网络防御代理。令人惊讶的是,稀疏奖励还可以产生与网络防御者目标更好的对齐策略,并在没有显式基于奖励的数值惩罚的情况下节省使用昂贵的防御行动。
Summary / 总结
The paper investigates the impact of reward function structure on reinforcement learning for cyber defence, using both dense and sparse reward functions in various cyber gym environments. The study reveals that sparse rewards, when aligned with goals and frequently encountered, enhance training reliability and produce more effective cyber defence agents with lower-risk policies. Surprisingly, sparse rewards also lead to policies that are better aligned with cyber defender goals and use costly defensive actions sparingly.
研究探讨了奖励函数结构对自主网络防御代理在强化学习中的学习和策略特性的影响。通过在各种网络规模和网络防御环境中比较密集和稀疏的奖励函数,研究发现,当稀疏奖励与目标对齐且频繁出现时,它们能提高训练的可靠性并生成更有效且风险更低的网络防御策略。令人惊讶的是,稀疏奖励还能导致与网络防御者目标更好的对齐,并在不需要明确的奖励数值惩罚的情况下减少昂贵防御行动的使用。
AI Agentic Vulnerability Injection And Transformation with Optimized Reasoning
Authors: Amine Lbath, Massih-Reza Amini, Aurelien Delaitre, Vadim Okun
First: 2025-08-28T14:59:39+00:00 · Latest: 2026-02-12T17:24:56+00:00
Abstract
The increasing complexity of software systems and the sophistication of cyber-attacks have underscored the need for reliable automated software vulnerability detection. Data-driven approaches using deep learning models show promise but critically depend on the availability of large, accurately labeled datasets. Yet existing datasets either suffer from noisy labels, limited vulnerability coverage, or fail to reflect vulnerabilities as they occur in real-world software. This also limits large-scale benchmarking of such solutions. Automated vulnerability injection provides a way to address these limitations, but existing techniques remain limited in coverage, contextual fidelity, or injection success. In this paper, we present AVIATOR, the first AI-agentic vulnerability injection framework. AVIATOR decomposes vulnerability injection into a coordinated workflow of specialized AI agents, tool-based analysis, and iterative self-correction, explicitly mirroring expert reasoning. It integrates RAG and lightweight LoRA-based fine-tuning to produce realistic, category-specific vulnerabilities without relying on handcrafted patterns. Across three benchmarks, AVIATOR achieves high injection fidelity (91-95%) surpassing existing injection techniques in both accuracy and vulnerability coverage. When used for data augmentation to train deep learning-based vulnerability detection (DLVD) models, AVIATOR provides the strongest downstream gains in vulnerability detection. Across models and base datasets, AVIATOR improves average F1 scores by +22% over no augmentation, +25% over VGX, holding the prior best injection success rate, and +3% over VulScribeR, the prior state-of-the-art LLM-based injection model, with +7% higher recall and no precision loss. Its augmented data exhibits the lowest distributional distortion and scales efficiently with <2% syntax rejection at 4.3x lower cost than VulScribeR.
中文标题/摘要
标题:AI自主漏洞注入与转换及优化推理
软件系统的复杂性和网络攻击的复杂性突显了可靠自动化软件漏洞检测的必要性。基于数据的方法使用深度学习模型显示出潜力,但关键依赖于大规模、准确标注的数据集。然而,现有数据集要么标签不准确,要么漏洞覆盖范围有限,或者无法反映真实软件中出现的漏洞。这也限制了此类解决方案的大规模基准测试。自动漏洞注入提供了一种解决这些限制的方法,但现有技术在覆盖范围、上下文准确度或注入成功率方面仍有限制。在本文中,我们提出了AVIATOR,这是第一个AI自主漏洞注入框架。AVIATOR将漏洞注入分解为由专门AI代理、工具分析和迭代自我纠正协调组成的流程,明确地模仿专家推理。它结合了RAG和轻量级LoRA微调,以生成现实且类别特定的漏洞,而不依赖于手工设计的模式。在三个基准测试中,AVIATOR实现了高注入准确度(91-95%),在准确性和漏洞覆盖范围方面均超过了现有注入技术。当用于训练基于深度学习的漏洞检测(DLVD)模型的数据增强时,AVIATOR提供了最强的下游增益,在漏洞检测中表现出最强的效果。在所有模型和基础数据集上,AVIATOR将平均F1分数提高了22%(无增强),25%(VGX),保持了先前的最佳注入成功率,并且相对于VulScribeR(先前的最先进的基于LLM的注入模型)提高了3%,召回率提高了7%,且无精度损失。其增强数据的分布失真最低,并且在成本降低4.3倍的情况下,语法拒绝率低于2%。
Summary / 总结
This paper addresses the need for reliable automated software vulnerability detection by presenting AVIATOR, an AI-agentic vulnerability injection framework. AVIATOR uses a coordinated workflow of specialized AI agents and tool-based analysis to inject realistic vulnerabilities, surpassing existing techniques in accuracy and coverage. When used for data augmentation, AVIATOR improves F1 scores by 22-25% over previous methods and shows no precision loss while increasing recall by 7%. It also scales efficiently with minimal syntax rejection and lower costs compared to prior models.
论文介绍了AVIATOR,这是一种AI代理型漏洞注入框架,将漏洞注入分解为专门的AI代理、工具分析和迭代自我纠正。它使用RAG和轻量级LoRA微调来生成现实且类别特定的漏洞。AVIATOR实现了高注入保真度(91-95%),并显著提高了基于深度学习的漏洞检测模型的效果,平均F1分数提高了22%以上,超过了前最先进的LLM基线注入模型VulScribeR,提高了3%的召回率且无精度损失。
WaveFormer: Wavelet Embedding Transformer for Biomedical Signals
Authors: Habib Irani, Bikram De, Vangelis Metsis
First: 2026-02-12T17:20:43+00:00 · Latest: 2026-02-12T17:20:43+00:00
Abstract
Biomedical signal classification presents unique challenges due to long sequences, complex temporal dynamics, and multi-scale frequency patterns that are poorly captured by standard transformer architectures. We propose WaveFormer, a transformer architecture that integrates wavelet decomposition at two critical stages: embedding construction, where multi-channel Discrete Wavelet Transform (DWT) extracts frequency features to create tokens containing both time-domain and frequency-domain information, and positional encoding, where Dynamic Wavelet Positional Encoding (DyWPE) adapts position embeddings to signal-specific temporal structure through mono-channel DWT analysis. We evaluate WaveFormer on eight diverse datasets spanning human activity recognition and brain signal analysis, with sequence lengths ranging from 50 to 3000 timesteps and channel counts from 1 to 144. Experimental results demonstrate that WaveFormer achieves competitive performance through comprehensive frequency-aware processing. Our approach provides a principled framework for incorporating frequency-domain knowledge into transformer-based time series classification.
中文标题/摘要
标题:WaveFormer:小波嵌入变换器在生物医学信号中的应用
生物医学信号分类由于长序列、复杂的时序动态和多尺度频率模式,标准的变换器架构难以充分捕捉。我们提出WaveFormer,这是一种在两个关键阶段整合小波分解的变换器架构:在嵌入构建阶段,多通道离散小波变换(DWT)提取频率特征,创建包含时域和频域信息的令牌;在位置编码阶段,动态小波位置编码(DyWPE)通过单通道DWT分析适应信号特定的时序结构。我们在八个涵盖人类活动识别和脑信号分析的多样数据集上评估了WaveFormer,序列长度从50到3000个时间步,通道数从1到144。实验结果表明,WaveFormer通过全面的频率感知处理实现了竞争力的性能。我们的方法为将频域知识整合到基于变换器的时间序列分类中提供了原则性的框架。
Summary / 总结
WaveFormer is a transformer architecture designed to address the challenges of biomedical signal classification by integrating wavelet decomposition at embedding construction and positional encoding stages. It uses Discrete Wavelet Transform to extract frequency features and Dynamic Wavelet Positional Encoding to adapt position embeddings to signal-specific temporal structures. The model was evaluated on eight diverse datasets, and the results showed that WaveFormer achieved competitive performance through comprehensive frequency-aware processing.
该研究针对生物医学信号分类中的长序列和复杂时序动态挑战,提出了WaveFormer,一种结合小波分解的变压器架构。它使用多通道离散小波变换进行嵌入构建,并使用动态小波位置编码进行自适应位置编码。该模型在八个数据集上进行了评估,展示了在人类活动识别和脑信号分析任务中的竞争力。
OpenTSLM: Time-Series Language Models for Reasoning over Multivariate Medical Text- and Time-Series Data
Authors: Patrick Langer, Thomas Kaar, Max Rosenblattl, Maxwell A. Xu, Winnie Chow, Martin Maritsch, Robert Jakob, Ning Wang, Aradhana Verma, Brian Han, Daniel Seung Kim, Henry Chubb, Scott Ceresnak, Aydin Zahedivash, Alexander Tarlochan Singh Sandhu, Fatima Rodriguez, Daniel McDuff, Elgar Fleisch, Oliver Aalami, Filipe Barata, Paul Schmiedmayer
First: 2025-10-02T09:58:23+00:00 · Latest: 2026-02-12T17:19:15+00:00
Abstract
LLMs have emerged as powerful tools for interpreting multimodal data. In medicine, they hold particular promise for synthesizing large volumes of clinical information into actionable insights and digital health applications. Yet, a major limitation remains their inability to handle time series. To overcome this gap, we present OpenTSLM, a family of Time Series Language Models (TSLMs) created by integrating time series as a native modality to pretrained LLMs, enabling reasoning over multiple time series of any length. We investigate two architectures for OpenTSLM. The first, OpenTSLM-SoftPrompt, models time series implicitly by concatenating learnable time series tokens with text tokens via soft prompting. Although parameter-efficient, we hypothesize that explicit time series modeling scales better and outperforms implicit approaches. We thus introduce OpenTSLM-Flamingo, which integrates time series with text via cross-attention. We benchmark both variants against baselines that treat time series as text tokens or plots, across a suite of text-time-series Chain-of-Thought (CoT) reasoning tasks. We introduce three datasets: HAR-CoT, Sleep-CoT, and ECG-QA-CoT. Across all, OpenTSLM models outperform baselines, reaching 69.9 F1 in sleep staging and 65.4 in HAR, compared to 9.05 and 52.2 for finetuned text-only models. Notably, even 1B-parameter OpenTSLM models surpass GPT-4o (15.47 and 2.95). OpenTSLM-Flamingo matches OpenTSLM-SoftPrompt in performance and outperforms on longer sequences, while maintaining stable memory requirements. By contrast, SoftPrompt grows exponentially in memory with sequence length, requiring around 110 GB compared to 40 GB VRAM when training on ECG-QA with LLaMA-3B. Expert reviews by clinicians find strong reasoning capabilities exhibited by OpenTSLMs on ECG-QA. To facilitate further research, we provide all code, datasets, and models open-source.
中文标题/摘要
标题:OpenTSLM:用于处理多变量医疗文本和时间序列数据的语言模型
大规模语言模型(LLMs)已成为解释多模态数据的强大工具。在医学领域,它们特别有潜力将大量临床信息综合成可操作的见解和数字健康应用。然而,一个主要限制是它们无法处理时间序列数据。为克服这一差距,我们提出了OpenTSLM,这是一种通过将时间序列作为预训练LLMs的原生模态来集成的时间序列语言模型(TSLMs),从而能够在任意长度的时间序列上进行推理。我们研究了OpenTSLM的两种架构。第一种,OpenTSLM-SoftPrompt,通过软提示将可学习的时间序列标记与文本标记连接起来,隐式建模时间序列。尽管参数效率较高,但我们假设显式的时间序列建模更具扩展性并优于隐式方法。因此,我们引入了OpenTSLM-Flamingo,它通过交叉注意力将时间序列与文本集成。我们使用一系列文本-时间序列链式推理任务(CoT)对两种变体与将时间序列视为文本标记或图表的基线进行了基准测试。我们引入了三个数据集:HAR-CoT、Sleep-CoT和ECG-QA-CoT。在所有任务中,OpenTSLM模型均优于基线,睡眠分期任务的F1值达到69.9,HAR任务的F1值达到65.4,而微调的纯文本模型分别为9.05和52.2。值得注意的是,即使参数量为10亿的OpenTSLM模型也超过了GPT-4o(15.47和2.95)。OpenTSLM-Flamingo在性能上与OpenTSLM-SoftPrompt相当,并在较长序列上表现更优,同时保持稳定的内存需求。相比之下,SoftPrompt的内存需求随序列长度呈指数增长,在使用LLaMA-3B训练ECG-QA时,需要约110 GB,而训练时仅需40 GB VRAM。临床专家评审发现OpenTSLMs在ECG-QA上展示了强大的推理能力。为了促进进一步研究,我们提供了所有代码、数据集和模型的开源版本。
Summary / 总结
The research aims to enhance the capabilities of language models (LMs) in handling time-series data, particularly in medical applications. The study introduces OpenTSLM, a family of Time Series Language Models (TSLMs) that integrate time series data directly into pretrained LLMs. Two architectures are explored: OpenTSLM-SoftPrompt, which models time series implicitly, and OpenTSLM-Flamingo, which uses cross-attention for explicit time series modeling. Experiments on three datasets show that OpenTSLM models outperform baselines, with OpenTSLM-Flamingo matching OpenTSLM-SoftPrompt in performance and outperforming on longer sequences while maintaining stable memory requirements.
研究旨在提升语言模型(LMs)处理医学应用中的时间序列数据能力。OpenTSLM 是一种将时间序列数据直接集成到预训练语言模型中的时间序列语言模型家族。开发了两种架构:OpenTSLM-SoftPrompt 和 OpenTSLM-Flamingo。OpenTSLM-Flamingo 使用交叉注意力机制,在长序列上表现优于 OpenTSLM-SoftPrompt,而在短序列上则与其性能相当。在三个数据集上的实验表明,OpenTSLM 模型优于基线模型,睡眠阶段识别和心活动识别的 F1 分数分别为 69.9 和 65.4,而基于文本的微调模型分别为 9.05 和 52.2。临床专家在 ECG 数据上发现了 OpenTSLM 强大的推理能力。
SAGEO Arena: A Realistic Environment for Evaluating Search-Augmented Generative Engine Optimization
Authors: Sunghwan Kim, Wooseok Jeong, Serin Kim, Sangam Lee, Dongha Lee
First: 2026-02-12T17:18:00+00:00 · Latest: 2026-02-12T17:18:00+00:00
Comments: Work in Progress
Abstract
Search-Augmented Generative Engines (SAGE) have emerged as a new paradigm for information access, bridging web-scale retrieval with generative capabilities to deliver synthesized answers. This shift has fundamentally reshaped how web content gains exposure online, giving rise to Search-Augmented Generative Engine Optimization (SAGEO), the practice of optimizing web documents to improve their visibility in AI-generated responses. Despite growing interest, no evaluation environment currently supports comprehensive investigation of SAGEO. Specifically, existing benchmarks lack end-to-end visibility evaluation of optimization strategies, operating on pre-determined candidate documents that abstract away retrieval and reranking preceding generation. Moreover, existing benchmarks discard structural information (e.g., schema markup) present in real web documents, overlooking the rich signals that search systems actively leverage in practice. Motivated by these gaps, we introduce SAGEO Arena, a realistic and reproducible environment for stage-level SAGEO analysis. Our objective is to jointly target search-oriented optimization (SEO) and generation-centric optimization (GEO). To achieve this, we integrate a full generative search pipeline over a large-scale corpus of web documents with rich structural information. Our findings reveal that existing approaches remain largely impractical under realistic conditions and often degrade performance in retrieval and reranking. We also find that structural information helps mitigate these limitations, and that effective SAGEO requires tailoring optimization to each pipeline stage. Overall, our benchmark paves the way for realistic SAGEO evaluation and optimization beyond simplified settings.
中文标题/摘要
标题:SAGEO竞技场:评估搜索增强生成引擎优化的现实环境
搜索增强生成引擎(SAGE)作为一种新的信息访问范式,将网络规模检索与生成能力相结合,提供合成答案。这一转变从根本上重塑了网络内容在线曝光的方式,催生了搜索增强生成引擎优化(SAGEO),即优化网络文档以提高其在AI生成响应中的可见性。尽管兴趣日益浓厚,但目前尚无评估环境支持全面调查SAGEO。具体而言,现有基准缺乏端到端的优化策略可见性评估,仅在预设的候选文档上操作,这些文档抽象掉了检索和重新排序之前的生成过程。此外,现有基准忽略了真实网络文档中存在的结构信息(例如,模式标记),未能充分利用搜索系统在实践中积极利用的丰富信号。鉴于这些差距,我们引入了SAGEO竞技场,这是一种现实且可重复的环境,用于阶段级SAGEO分析。我们的目标是同时针对搜索导向优化(SEO)和生成导向优化(GEO)。为此,我们整合了一个大型网络文档语料库的完整生成搜索管道,这些文档具有丰富的结构信息。我们的研究发现,现有方法在现实条件下仍然难以实现,并且往往在检索和重新排序中降低性能。我们还发现,结构信息有助于缓解这些限制,有效的SAGEO需要针对每个管道阶段进行优化。总体而言,我们的基准为超越简化设置的现实SAGEO评估和优化铺平了道路。
Summary / 总结
The paper introduces SAGEO Arena, a realistic environment for evaluating Search-Augmented Generative Engine Optimization (SAGEO). Motivated by the lack of comprehensive evaluation tools, the authors aim to address gaps in existing benchmarks by integrating a full generative search pipeline with rich structural information from web documents. Key findings include the impracticality of existing approaches under realistic conditions and the importance of tailoring optimization to each pipeline stage, with structural information playing a crucial role in mitigating performance degradation in retrieval and reranking.
论文介绍了SAGEO Arena,这是一个用于评估搜索增强生成引擎优化(SAGEO)的现实环境。作者旨在通过整合包含丰富结构信息的大型网页文档的完整生成搜索管道来弥补现有基准的不足。主要发现包括现有方法在现实条件下的实用性较差,并且优化需要针对每个管道阶段进行定制,结构信息在缓解检索和重排序性能下降方面起着关键作用。
Convex Markov Games and Beyond: New Proof of Existence, Characterization and Learning Algorithms for Nash Equilibria
Authors: Anas Barakat, Ioannis Panageas, Antonios Varvitsiotis
First: 2026-02-12T17:11:20+00:00 · Latest: 2026-02-12T17:11:20+00:00
Comments: AISTATS 2026
Abstract
Convex Markov Games (cMGs) were recently introduced as a broad class of multi-agent learning problems that generalize Markov games to settings where strategic agents optimize general utilities beyond additive rewards. While cMGs expand the modeling frontier, their theoretical foundations, particularly the structure of Nash equilibria (NE) and guarantees for learning algorithms, are not yet well understood. In this work, we address these gaps for an extension of cMGs, which we term General Utility Markov Games (GUMGs), capturing new applications requiring coupling between agents' occupancy measures. We prove that in GUMGs, Nash equilibria coincide with the fixed points of projected pseudo-gradient dynamics (i.e., first-order stationary points), enabled by a novel agent-wise gradient domination property. This insight also yields a simple proof of NE existence using Brouwer's fixed-point theorem. We further show the existence of Markov perfect equilibria. Building on this characterization, we establish a policy gradient theorem for GUMGs and design a model-free policy gradient algorithm. For potential GUMGs, we establish iteration complexity guarantees for computing approximate-NE under exact gradients and provide sample complexity bounds in both the generative model and on-policy settings. Our results extend beyond prior work restricted to zero-sum cMGs, providing the first theoretical analysis of common-interest cMGs.
中文标题/摘要
标题:凸马尔可夫博弈及其扩展:新的纳什均衡存在性、特征化和学习算法证明
凸马尔可夫博弈(cMGs)最近被引入为一类多智能体学习问题,它将马尔可夫博弈推广到智能体优化超越加性奖励的一般效用的设置中。虽然cMGs扩展了建模的边界,但它们的理论基础,特别是纳什均衡(Nash equilibrium, NE)的结构和学习算法的保证,尚未得到充分理解。在本文中,我们针对cMGs的一个扩展进行了研究,将其称为广义效用马尔可夫博弈(GUMGs),以捕捉需要智能体占用度量耦合的新应用。我们证明在GUMGs中,纳什均衡与投影伪梯度动力学的不动点(即一阶稳定点)重合,这得益于一种新的智能体梯度支配性质。这一见解还通过布劳威尔不动点定理提供了一个简单的NE存在性证明。我们进一步证明了马尔可夫完美均衡的存在性。基于这一特征化,我们为GUMGs建立了策略梯度定理,并设计了一种无模型策略梯度算法。对于潜在的GUMGs,我们建立了在精确梯度下计算近似NE的迭代复杂度保证,并提供了生成模型和在线策略设置下的样本复杂度界。我们的结果超越了仅限于零和cMGs的先前工作,提供了对共同利益cMGs的第一个理论分析。
Summary / 总结
This paper addresses the theoretical foundations of General Utility Markov Games (GUMGs), which extend the concept of Convex Markov Games to scenarios where agents optimize non-additive utilities. The authors prove that Nash equilibria in GUMGs are equivalent to the fixed points of projected pseudo-gradient dynamics, leveraging a novel gradient domination property. They also provide a simple proof of NE existence using Brouwer's fixed-point theorem and establish a policy gradient theorem, leading to a model-free algorithm. The study offers iteration and sample complexity guarantees for computing approximate Nash equilibria in both exact and approximate gradient settings, extending prior work to common-interest games.
该论文探讨了广义用途马尔可夫博弈(GUMGs)的理论基础,这是凸马尔可夫博弈(cMGs)的一个扩展,允许代理之间的联合效用。证明了GUMGs中的纳什均衡等同于投影伪梯度动力学的不动点,利用了一个新的梯度支配性质。工作还使用布劳威尔不动点定理提供了一个简单的纳什均衡存在的证明,并建立了GUMGs的策略梯度定理,从而设计了一个无模型策略梯度算法。研究还为在精确梯度和近似梯度设置下计算近似纳什均衡提供了迭代和样本复杂度保证,超越了之前仅限于零和cMGs的分析。
How Sampling Shapes LLM Alignment: From One-Shot Optima to Iterative Dynamics
Authors: Yurong Chen, Yu He, Michael I. Jordan, Fan Yao
First: 2026-02-12T17:11:08+00:00 · Latest: 2026-02-12T17:11:08+00:00
Abstract
Standard methods for aligning large language models with human preferences learn from pairwise comparisons among sampled candidate responses and regularize toward a reference policy. Despite their effectiveness, the effects of sampling and reference choices are poorly understood theoretically. We investigate these effects through Identity Preference Optimization, a widely used preference alignment framework, and show that proper instance-dependent sampling can yield stronger ranking guarantees, while skewed on-policy sampling can induce excessive concentration under structured preferences. We then analyze iterative alignment dynamics in which the learned policy feeds back into future sampling and reference policies, reflecting a common practice of model-generated preference data. We prove that these dynamics can exhibit persistent oscillations or entropy collapse for certain parameter choices, and characterize regimes that guarantee stability. Our theoretical insights extend to Direct Preference Optimization, indicating the phenomena we captured are common to a broader class of preference-alignment methods. Experiments on real-world preference data validate our findings.
中文标题/摘要
标题:采样如何塑造大语言模型的对齐:从单次优化到迭代动力学
标准的大语言模型与人类偏好对齐方法通过采样候选响应的成对比较学习,并向参考策略正则化。尽管这些方法非常有效,但采样和参考选择的影响在理论上知之甚少。我们通过广泛使用的偏好对齐框架Identity Preference Optimization来研究这些影响,并表明适当的实例依赖采样可以提供更强的排名保证,而偏斜的在线策略采样在结构化偏好下会导致过度集中。然后我们分析了学习策略反馈到未来采样和参考策略中的迭代对齐动力学,这反映了模型生成偏好数据的常见做法。我们证明了在某些参数选择下,这些动力学可以表现出持久的振荡或熵崩溃,并确定了保证稳定性的区域。我们的理论见解扩展到直接偏好优化,表明我们捕捉到的现象适用于更广泛的偏好对齐方法类别。在真实偏好数据上的实验验证了我们的发现。
Summary / 总结
The paper investigates how sampling affects the alignment of large language models with human preferences. It uses Identity Preference Optimization to show that proper instance-dependent sampling can improve ranking guarantees, while skewed on-policy sampling can lead to excessive concentration. The study also analyzes iterative alignment dynamics and proves that these can result in persistent oscillations or entropy collapse under certain conditions, with stability guaranteed in specific regimes. Experiments confirm these theoretical findings.
研究旨在理解采样和参考选择如何影响大型语言模型与人类偏好的一致性。通过使用Identity Preference Optimization,研究显示适当的实例依赖采样可以提高排名保证,而偏斜的在线策略采样可能导致过度集中。研究还分析了迭代对齐动态,证明在某些条件下这些动态可以表现出持续振荡或熵崩溃,并界定了保证稳定性的条件。实世界数据的实验支持了这些理论发现。
EO-VAE: Towards A Multi-sensor Tokenizer for Earth Observation Data
Authors: Nils Lehmann, Yi Wang, Zhitong Xiong, Xiaoxiang Zhu
First: 2026-02-12T17:09:14+00:00 · Latest: 2026-02-12T17:09:14+00:00
Abstract
State-of-the-art generative image and video models rely heavily on tokenizers that compress high-dimensional inputs into more efficient latent representations. While this paradigm has revolutionized RGB generation, Earth observation (EO) data presents unique challenges due to diverse sensor specifications and variable spectral channels. We propose EO-VAE, a multi-sensor variational autoencoder designed to serve as a foundational tokenizer for the EO domain. Unlike prior approaches that train separate tokenizers for each modality, EO-VAE utilizes a single model to encode and reconstruct flexible channel combinations via dynamic hypernetworks. Our experiments on the TerraMesh dataset demonstrate that EO-VAE achieves superior reconstruction fidelity compared to the TerraMind tokenizers, establishing a robust baseline for latent generative modeling in remote sensing.
中文标题/摘要
标题:EO-VAE:向地球观测数据多传感器分词器迈进
当前最先进的生成图像和视频模型高度依赖于将高维输入压缩为更高效的潜在表示的分词器。虽然这种范式已经彻底改变了RGB生成,但地球观测(EO)数据由于多样化的传感器规格和变化的光谱通道而提出了独特的挑战。我们提出了EO-VAE,这是一种多传感器变分自编码器,旨在作为EO领域的基础分词器。与先前的方法不同,EO-VAE 使用单一模型通过动态超网络编码和重构灵活的通道组合。我们在TerraMesh数据集上的实验表明,EO-VAE 在重构保真度方面优于TerraMind分词器,为遥感领域的潜在生成建模奠定了稳健的基础。
Summary / 总结
The research aims to address the unique challenges of Earth observation (EO) data by proposing EO-VAE, a multi-sensor variational autoencoder. Unlike previous methods that train separate tokenizers for each sensor, EO-VAE uses a single model with dynamic hypernetworks to encode and reconstruct various channel combinations. Experiments on the TerraMesh dataset show that EO-VAE outperforms TerraMind tokenizers in reconstruction fidelity, setting a strong baseline for latent generative modeling in remote sensing.
研究旨在通过提出EO-VAE,一种多传感器变分自编码器,解决地球观测(EO)数据的独特挑战,作为EO数据的分词器。不同于之前的方法分别训练不同模态的分词器,EO-VAE 使用单个模型和动态超网络来编码和重建各种通道组合。实验表明,EO-VAE 在 TerraMesh 数据集上的重构保真度优于 TerraMind 分词器,为遥感中的潜在生成建模奠定了坚实的基础。
Deep learning Based Correction Algorithms for 3D Medical Reconstruction in Computed Tomography and Macroscopic Imaging
Authors: Tomasz Les, Tomasz Markiewicz, Malgorzata Lorent, Miroslaw Dziekiewicz, Krzysztof Siwek
First: 2026-01-30T17:16:17+00:00 · Latest: 2026-02-12T17:06:54+00:00
Comments: 23 pages, 9 figures, submitted to Applied Sciences (MDPI)
Abstract
This paper introduces a hybrid two-stage registration framework for reconstructing three-dimensional (3D) kidney anatomy from macroscopic slices, using CT-derived models as the geometric reference standard. The approach addresses the data-scarcity and high-distortion challenges typical of macroscopic imaging, where fully learning-based registration (e.g., VoxelMorph) often fails to generalize due to limited training diversity and large nonrigid deformations that exceed the capture range of unconstrained convolutional filters. In the proposed pipeline, the Optimal Cross-section Matching (OCM) algorithm first performs constrained global alignment: translation, rotation, and uniform scaling to establish anatomically consistent slice initialization. Next, a lightweight deep-learning refinement network, inspired by VoxelMorph, predicts residual local deformations between consecutive slices. The core novelty of this architecture lies in its hierarchical decomposition of the registration manifold. This hybrid OCM+DL design integrates explicit geometric priors with the flexible learning capacity of neural networks, ensuring stable optimization and plausible deformation fields even with few training examples. Experiments on an original dataset of 40 kidneys demonstrated better results compared to single-stage baselines. The pipeline maintains physical calibration via Hough-based grid detection and employs Bezier-based contour smoothing for robust meshing and volume estimation. Although validated on kidney data, the proposed framework generalizes to other soft-tissue organs reconstructed from optical or photographic cross-sections. By decoupling interpretable global optimization from data-efficient deep refinement, the method advances the precision, reproducibility, and anatomical realism of multimodal 3D reconstructions for surgical planning, morphological assessment, and medical education.
中文标题/摘要
标题:基于深度学习的3D医学重建矫正算法在计算机断层扫描和宏观成像中的应用
本文介绍了一种用于从宏观切片重建三维(3D)肾脏解剖结构的混合两阶段注册框架,使用CT衍生模型作为几何参考标准。该方法解决了宏观成像中常见的数据稀缺性和高失真挑战,其中完全基于学习的注册(例如VoxelMorph)由于训练多样性有限和非刚性变形超出未约束卷积滤波器的捕捉范围而难以泛化。在所提出的流水线中,Optimal Cross-section Matching (OCM) 算法首先执行约束全局对齐:平移、旋转和均匀缩放,以建立解剖上一致的切片初始化。接下来,一个轻量级的深度学习精炼网络,受到VoxelMorph的启发,预测连续切片之间的残余局部变形。该架构的核心新颖之处在于其分层分解注册流形。这种混合OCM+DL设计结合了显式的几何先验与神经网络的灵活学习能力,即使在少量训练示例的情况下也能确保稳定的优化和合理的变形场。在40个肾脏的原始数据集上进行的实验表明,与单阶段基线相比,结果更好。该流水线通过基于Hough的网格检测保持物理校准,并使用Bezier基的轮廓平滑进行稳健的网格化和体积估计。尽管该框架在肾脏数据上进行了验证,但它可以推广到从光学或摄影切片重建的其他软组织器官。通过将可解释的全局优化与数据高效的深度精炼解耦,该方法推进了多模态3D重建的精度、可重复性和解剖真实性,用于手术规划、形态评估和医学教育。
Summary / 总结
This paper presents a hybrid two-stage registration framework for reconstructing 3D kidney anatomy from macroscopic slices using CT-derived models as a reference. The approach combines an Optimal Cross-section Matching (OCM) algorithm for constrained global alignment with a lightweight deep-learning refinement network to predict local deformations. Experiments on 40 kidneys showed improved results over single-stage baselines, maintaining physical calibration and robust meshing.
本文提出了一种结合Optimal Cross-section Matching (OCM)算法和轻量级深度学习精修网络的两阶段注册框架,用于从宏观切片重建3D肾脏解剖结构,以CT衍生模型作为几何参考标准。实验结果显示,该方法在40个肾脏样本上表现优于单阶段基线,提高了精度和解剖真实性。