arXiv 论文速递

2026-03-18 04:05
Snapshot: 20260318_0405
Towards Generalizable Robotic Manipulation in Dynamic Environments
Authors: Heng Fang, Shangru Li, Shuhan Wang, Xuanyang Xi, Dingkang Liang, Xiang Bai
First: 2026-03-16T17:59:57+00:00 · Latest: 2026-03-16T17:59:57+00:00
Abstract
Vision-Language-Action (VLA) models excel in static manipulation but struggle in dynamic environments with moving targets. This performance gap primarily stems from a scarcity of dynamic manipulation datasets and the reliance of mainstream VLAs on single-frame observations, restricting their spatiotemporal reasoning capabilities. To address this, we introduce DOMINO, a large-scale dataset and benchmark for generalizable dynamic manipulation, featuring 35 tasks with hierarchical complexities, over 110K expert trajectories, and a multi-dimensional evaluation suite. Through comprehensive experiments, we systematically evaluate existing VLAs on dynamic tasks, explore effective training strategies for dynamic awareness, and validate the generalizability of dynamic data. Furthermore, we propose PUMA, a dynamics-aware VLA architecture. By integrating scene-centric historical optical flow and specialized world queries to implicitly forecast object-centric future states, PUMA couples history-aware perception with short-horizon prediction. Results demonstrate that PUMA achieves state-of-the-art performance, yielding a 6.3% absolute improvement in success rate over baselines. Moreover, we show that training on dynamic data fosters robust spatiotemporal representations that transfer to static tasks. All code and data are available at https://github.com/H-EmbodVis/DOMINO.
中文标题/摘要
标题:在动态环境中的通用可转移机器人操作
视觉-语言-动作(VLA)模型在静态操作中表现出色,但在具有移动目标的动态环境中却难以应对。这种性能差距主要源于缺乏动态操作数据集以及主流VLA依赖单帧观察,限制了它们的空间-时间推理能力。为了解决这一问题,我们引入了DOMINO,这是一个大规模的动态操作数据集和基准测试,包含35个具有层次复杂性的任务,超过11万个专家轨迹,以及多维度的评估套件。通过全面的实验,我们系统地评估了现有VLA在动态任务上的表现,探索了有效的动态意识训练策略,并验证了动态数据的可转移性。此外,我们提出了PUMA,一种动态感知的VLA架构。通过整合场景中心的历史光流和专门的世界查询,PUMA隐式预测对象中心的未来状态,将历史感知与短期预测相结合。结果表明,PUMA达到了最先进的性能,相对于基线模型在成功率上提高了6.3%。此外,我们展示了在动态数据上进行训练可以培养出对静态任务具有鲁棒性的空间-时间表示。所有代码和数据均可在https://github.com/H-EmbodVis/DOMINO/获取。
Summary / 总结
The research aims to improve robotic manipulation in dynamic environments by addressing the limitations of existing Vision-Language-Action (VLA) models. To this end, the authors introduce DOMINO, a large-scale dataset and benchmark for dynamic manipulation, which includes 35 hierarchical tasks, over 110K expert trajectories, and a multi-dimensional evaluation suite. The study evaluates existing VLA models on dynamic tasks, proposes PUMA, a dynamics-aware VLA architecture, and demonstrates a 6.3% improvement in success rate. Additionally, the research shows that training on dynamic data enhances spatiotemporal representations that transfer to static tasks.
研究旨在通过解决现有视觉-语言-动作(VLA)模型的局限性,提高机器人在动态环境中的操作能力。为此,作者引入了DOMINO,一个大规模的动态操作数据集和基准,包含35个分层任务、超过110K专家轨迹和多维度评估套件。研究评估了现有VLA模型在动态任务上的表现,提出了PUMA,一种动态感知与短期预测结合的VLA架构,并展示了6.3%的成功率提升。此外,研究还表明,使用动态数据训练可以增强时空表示,这些表示可以转移到静态任务中。
Mixture-of-Depths Attention
Authors: Lianghui Zhu, Yuxin Fang, Bencheng Liao, Shijie Wang, Tianheng Cheng, Zilong Huang, Chen Chen, Lai Wei, Yutao Zeng, Ya Wang, Yi Lin, Yu Li, Xinggang Wang
First: 2026-03-16T17:59:55+00:00 · Latest: 2026-03-16T17:59:55+00:00
Comments: Code is released at https://github.com/hustvl/MoDA
Abstract
Scaling depth is a key driver for large language models (LLMs). Yet, as LLMs become deeper, they often suffer from signal degradation: informative features formed in shallow layers are gradually diluted by repeated residual updates, making them harder to recover in deeper layers. We introduce mixture-of-depths attention (MoDA), a mechanism that allows each attention head to attend to sequence KV pairs at the current layer and depth KV pairs from preceding layers. We further describe a hardware-efficient algorithm for MoDA that resolves non-contiguous memory-access patterns, achieving 97.3% of FlashAttention-2's efficiency at a sequence length of 64K. Experiments on 1.5B-parameter models demonstrate that MoDA consistently outperforms strong baselines. Notably, it improves average perplexity by 0.2 across 10 validation benchmarks and increases average performance by 2.11% on 10 downstream tasks, with a negligible 3.7% FLOPs computational overhead. We also find that combining MoDA with post-norm yields better performance than using it with pre-norm. These results suggest that MoDA is a promising primitive for depth scaling. Code is released at https://github.com/hustvl/MoDA .
中文标题/摘要
标题:深度混合注意力
深度扩展是大型语言模型(LLMs)的关键驱动力。然而,随着LLMs变得更深,它们往往会遭受信号降解:浅层形成的有信息特征逐渐被重复的残差更新稀释,使得这些特征在深层更难恢复。我们引入了深度混合注意力(MoDA)机制,允许每个注意力头同时关注当前层的序列KV对和来自前几层的深度KV对。我们还描述了一种针对MoDA的高效硬件算法,该算法解决了非连续内存访问模式,实现了在序列长度为64K时达到FlashAttention-2效率的97.3%。在1.5B参数模型上的实验表明,MoDA始终优于强大的基线模型。值得注意的是,它在10个验证基准上将平均困惑度降低了0.2,并在10个下游任务上提高了2.11%的平均性能,计算开销仅为3.7%的FLOPs。我们还发现,将MoDA与后规范化结合使用比与前规范化结合使用效果更好。这些结果表明,MoDA是深度扩展的一个有前途的基本构建块。代码发布在https://github.com/hustvl/MoDA。
Summary / 总结
The research addresses the issue of signal degradation in deep language models, proposing mixture-of-depths attention (MoDA) to allow attention heads to access information from both current and preceding layers. The method achieves high efficiency in hardware and outperforms strong baselines, reducing average perplexity by 0.2 and improving performance on downstream tasks by 2.11% with minimal computational overhead.
论文提出了混合深度注意力(MoDA),以解决深层语言模型中的信号降级问题。MoDA 允许注意力头访问当前层和前层的信息,从而提升模型性能。实验结果显示,MoDA 可以将平均困惑度降低 0.2,并在下游任务上提高 2.11% 的性能,同时计算开销几乎可以忽略。结合 MoDA 和后规范化进一步提升性能。
Look Before Acting: Enhancing Vision Foundation Representations for Vision-Language-Action Models
Authors: Yulin Luo, Hao Chen, Zhuangzhe Wu, Bowen Sui, Jiaming Liu, Chenyang Gu, Zhuoyang Liu, Qiuxuan Feng, Jiale Yu, Shuo Gu, Peng Jia, Pheng-Ann Heng, Shanghang Zhang
First: 2026-03-16T17:59:54+00:00 · Latest: 2026-03-16T17:59:54+00:00
Abstract
Vision-Language-Action (VLA) models have recently emerged as a promising paradigm for robotic manipulation, in which reliable action prediction critically depends on accurately interpreting and integrating visual observations conditioned on language instructions. Although recent works have sought to enhance the visual capabilities of VLA models, most approaches treat the LLM backbone as a black box, providing limited insight into how visual information is grounded into action generation. Therefore, we perform a systematic analysis of multiple VLA models across different action-generation paradigms and observe that sensitivity to visual tokens progressively decreases in deeper layers during action generation. Motivated by this observation, we propose \textbf{DeepVision-VLA}, built on a \textbf{Vision-Language Mixture-of-Transformers (VL-MoT)} framework. This framework enables shared attention between the vision foundation model and the VLA backbone, injecting multi-level visual features from the vision expert into deeper layers of the VLA backbone to enhance visual representations for precise and complex manipulation. In addition, we introduce \textbf{Action-Guided Visual Pruning (AGVP)}, which leverages shallow-layer attention to prune irrelevant visual tokens while preserving task-relevant ones, reinforcing critical visual cues for manipulation with minimal computational overhead. DeepVision-VLA outperforms prior state-of-the-art methods by 9.0\% and 7.5\% on simulated and real-world tasks, respectively, providing new insights for the design of visually enhanced VLA models.
中文标题/摘要
标题:行动之前先观察:增强视觉基础表示以提升视觉-语言-行动模型
视觉-语言-行动(VLA)模型最近已成为机器人操作的有前途的范式,其中可靠的行动预测在很大程度上依赖于准确地解释和整合视觉观察,这些观察是根据语言指令进行的。尽管最近的工作已经寻求增强VLA模型的视觉能力,但大多数方法将LLM主干视为黑盒,提供了有限的关于视觉信息如何嵌入到行动生成中的见解。因此,我们对不同行动生成范式下的多种VLA模型进行了系统的分析,并观察到在行动生成过程中,视觉标记的敏感性在更深的层中逐渐降低。受此观察的启发,我们提出了基于视觉-语言混合的变换器(VL-MoT)框架的DeepVision-VLA。该框架使视觉基础模型与VLA主干之间共享注意力,将视觉专家的多级视觉特征注入到VLA主干的更深层中,以增强视觉表示,实现精确和复杂的操作。此外,我们引入了行动引导的视觉剪枝(AGVP),利用浅层注意力剪枝无关的视觉标记,同时保留任务相关的标记,以最小的计算开销强化关键的视觉提示。DeepVision-VLA在模拟和真实世界任务中分别比先前的最先进方法提高了9.0%和7.5%,为设计视觉增强的VLA模型提供了新的见解。
Summary / 总结
The research aims to improve the visual capabilities of Vision-Language-Action (VLA) models for robotic manipulation by addressing the issue of visual information being less effective in deeper layers. The study proposes DeepVision-VLA, which uses a Vision-Language Mixture-of-Transformers (VL-MoT) framework to enable shared attention between the vision foundation model and the VLA backbone. This approach injects multi-level visual features into deeper layers of the VLA backbone to enhance visual representations. Additionally, Action-Guided Visual Pruning (AGVP) is introduced to prune irrelevant visual tokens while preserving task-relevant ones. Experimental results show that DeepVision-VLA outperforms previous state-of-the-art methods by 9.0% and 7.5% on simulated and real-world tasks, respectively.
研究旨在通过解决视觉信息在深层层中敏感度降低的问题,提升Vision-Language-Action (VLA) 模型的视觉能力,以用于机器人操作。作者提出了DeepVision-VLA,该模型基于Vision-Language Mixture-of-Transformers框架,使视觉基础模型与VLA主干网络之间能够共享注意力,并引入了Action-Guided Visual Pruning来去除无关的视觉标记,保留与任务相关的标记。该模型在模拟和真实世界任务中的性能分别优于之前最先进的方法9.0%和7.5%。
HorizonMath: Measuring AI Progress Toward Mathematical Discovery with Automatic Verification
Authors: Erik Y. Wang, Sumeet Motwani, James V. Roggeveen, Eliot Hodges, Dulhan Jayalath, Charles London, Kalyan Ramakrishnan, Flaviu Cipcigan, Philip Torr, Alessandro Abate
First: 2026-03-16T17:59:53+00:00 · Latest: 2026-03-16T17:59:53+00:00
Abstract
Can AI make progress on important, unsolved mathematical problems? Large language models are now capable of sophisticated mathematical and scientific reasoning, but whether they can perform novel research is still widely debated and underexplored. We introduce HorizonMath, a benchmark of over 100 predominantly unsolved problems spanning 8 domains in computational and applied mathematics, paired with an open-source evaluation framework for automated verification. Our benchmark targets a class of problems where discovery is hard, requiring meaningful mathematical insight, but verification is computationally efficient and simple. Because these solutions are unknown, HorizonMath is immune to data contamination, and most state-of-the-art models score near 0%. Existing research-level benchmarks instead rely on formal proof verification or manual review, both of which are expensive to scale. Using this platform, we find two problems for which GPT 5.4 Pro proposes solutions that improve on the best-known published results, representing potential novel contributions (pending expert review). We release HorizonMath as an open challenge and a growing community resource, where correct solutions to problems in the unsolved problem classes could constitute novel results in the mathematical literature.
中文标题/摘要
标题:HorizonMath:通过自动验证衡量AI在数学发现方面的进步
AI能否在重要的未解数学问题上取得进展?大型语言模型现在能够进行复杂的数学和科学推理,但它们能否进行新颖的研究仍然存在广泛争议和未被充分探索。我们引入了HorizonMath,这是一个包含100多个主要未解问题的基准,涵盖8个计算和应用数学领域,并配有一个开源的自动验证评估框架。我们的基准针对一类发现困难的问题,需要有意义的数学洞察,但验证计算效率高且简单。由于这些解决方案未知,HorizonMath不受数据污染的影响,大多数最先进的模型得分接近0%。现有的研究级基准则依赖于形式证明验证或人工审查,这两种方法都难以大规模应用。使用此平台,我们发现GPT 5.4 Pro为两个问题提出了改进现有最佳已知结果的解决方案,这可能代表潜在的新贡献(待专家评审)。我们以开放挑战和不断增长的社区资源的形式发布HorizonMath,其中正确解决未解决问题类别的解决方案可以构成数学文献中的新成果。
Summary / 总结
The research aims to assess AI's capability in making progress on unsolved mathematical problems. HorizonMath, a benchmark of over 100 predominantly unsolved problems in computational and applied mathematics, is introduced, along with an open-source evaluation framework for automated verification. The study finds that GPT 5.4 Pro proposes solutions for two problems that improve on the best-known published results, indicating potential novel contributions. The benchmark is designed to be immune to data contamination and serves as an open challenge and community resource for mathematical discovery.
研究旨在评估AI在解决未解决问题时是否能做出新的数学贡献。方法是创建HorizonMath,一个包含超过100个未解决问题的基准,涵盖计算和应用数学领域,附带一个自动验证的开源评估框架。关键发现表明,GPT 5.4 Pro为两个问题提出了改进现有最佳结果的解决方案,显示出潜在的新数学研究成果(需专家评审)。
GlyphPrinter: Region-Grouped Direct Preference Optimization for Glyph-Accurate Visual Text Rendering
Authors: Xincheng Shuai, Ziye Li, Henghui Ding, Dacheng Tao
Venue: CVPR 2026
First: 2026-03-16T17:59:31+00:00 · Latest: 2026-03-16T17:59:31+00:00
Comments: CVPR 2026, Project Page: https://henghuiding.com/GlyphPrinter/
Abstract
Generating accurate glyphs for visual text rendering is essential yet challenging. Existing methods typically enhance text rendering by training on a large amount of high-quality scene text images, but the limited coverage of glyph variations and excessive stylization often compromise glyph accuracy, especially for complex or out-of-domain characters. Some methods leverage reinforcement learning to alleviate this issue, yet their reward models usually depend on text recognition systems that are insensitive to fine-grained glyph errors, so images with incorrect glyphs may still receive high rewards. Inspired by Direct Preference Optimization (DPO), we propose GlyphPrinter, a preference-based text rendering method that eliminates reliance on explicit reward models. However, the standard DPO objective only models overall preference between two samples, which is insufficient for visual text rendering where glyph errors typically occur in localized regions. To address this issue, we construct the GlyphCorrector dataset with region-level glyph preference annotations and propose Region-Grouped DPO (R-GDPO), a region-based objective that optimizes inter- and intra-sample preferences over annotated regions, substantially enhancing glyph accuracy. Furthermore, we introduce Regional Reward Guidance, an inference strategy that samples from an optimal distribution with controllable glyph accuracy. Extensive experiments demonstrate that the proposed GlyphPrinter outperforms existing methods in glyph accuracy while maintaining a favorable balance between stylization and precision.
中文标题/摘要
标题:GlyphPrinter:基于区域分组的直接偏好优化方法以实现精确的视觉文本渲染
生成准确的字形对于视觉文本渲染至关重要但极具挑战性。现有方法通常通过大量高质量场景文本图像进行训练以增强文本渲染效果,但字形变体的有限覆盖范围和过度的风格化往往会影响字形的准确性,尤其是在处理复杂或域外字符时。一些方法利用强化学习来缓解这一问题,但其奖励模型通常依赖于对细粒度字形错误不敏感的文本识别系统,因此含有错误字形的图像仍可能获得高奖励。受直接偏好优化(DPO)的启发,我们提出了一种基于偏好的文本渲染方法——GlyphPrinter,该方法消除了对外显奖励模型的依赖。然而,标准的DPO目标仅建模两个样本之间的整体偏好,对于字形错误通常发生在局部区域的视觉文本渲染来说是不够的。为了解决这一问题,我们构建了包含区域级字形偏好注释的GlyphCorrector数据集,并提出了基于区域的分组DPO(R-GDPO)目标,该目标在标注区域上优化样本间的和样本内的偏好,显著提高了字形的准确性。此外,我们引入了区域奖励引导,这是一种控制字形准确性可调的推理策略。大量实验表明,提出的GlyphPrinter在字形准确性方面优于现有方法,同时在风格化和精确性之间保持了良好的平衡。
Summary / 总结
The research aims to improve the accuracy of glyphs in visual text rendering by addressing the limitations of existing methods. GlyphPrinter uses a preference-based approach without explicit reward models, and introduces Region-Grouped Direct Preference Optimization (R-GDPO) to optimize preferences over annotated regions, enhancing glyph accuracy. The method also includes Regional Reward Guidance for controlled glyph accuracy during inference. Experiments show that GlyphPrinter outperforms existing methods in glyph accuracy while balancing stylization and precision.
研究旨在通过解决现有方法在字体准确性方面的不足,如字体变异覆盖不充分和依赖于对细粒度错误不敏感的文字识别系统,来提高视觉文本渲染中的字体准确性。作者提出了GlyphPrinter,该方法使用基于区域的直接偏好优化(R-GDPO)来通过区域注释优化字体准确性。该方法在字体准确性方面优于现有技术,同时平衡了风格化和精确度。此外,还引入了区域奖励指导来增强推理过程。
Mechanistic Origin of Moral Indifference in Language Models
Authors: Lingyu Li, Yan Teng, Yingchun Wang
First: 2026-03-16T17:59:17+00:00 · Latest: 2026-03-16T17:59:17+00:00
Comments: 24 pages, 11 figures, 5 tables
Abstract
Existing behavioral alignment techniques for Large Language Models (LLMs) often neglect the discrepancy between surface compliance and internal unaligned representations, leaving LLMs vulnerable to long-tail risks. More crucially, we posit that LLMs possess an inherent state of moral indifference due to compressing distinct moral concepts into uniform probability distributions. We verify and remedy this indifference in LLMs' latent representations, utilizing 251k moral vectors constructed upon Prototype Theory and the Social-Chemistry-101 dataset. Firstly, our analysis across 23 models reveals that current LLMs fail to represent the distinction between opposed moral categories and fine-grained typicality gradients within these categories; notably, neither model scaling, architecture, nor explicit alignment reshapes this indifference. We then employ Sparse Autoencoders on Qwen3-8B, isolate mono-semantic moral features, and targetedly reconstruct their topological relationships to align with ground-truth moral vectors. This representational alignment naturally improves moral reasoning and granularity, achieving a 75% pairwise win-rate on the independent adversarial Flames benchmark. Finally, we elaborate on the remedial nature of current intervention methods from an experientialist philosophy, arguing that endogenously aligned AI might require a transformation from post-hoc corrections to proactive cultivation.
中文标题/摘要
标题:语言模型道德冷漠的机制起源
现有针对大型语言模型(LLMs)的行为对齐技术往往忽视表面合规与内部未对齐表示之间的差异,使LLMs面临长尾风险。更为关键的是,我们提出LLMs由于将不同的道德概念压缩成统一的概率分布,因而具有内在的道德冷漠状态。我们验证并修正了LLMs潜在表示中的这种冷漠,利用251k个基于原型理论和社会化学-101数据集构建的道德向量。首先,我们对23个模型的分析表明,当前的LLMs无法区分对立的道德类别以及这些类别内的细微典型性梯度;值得注意的是,无论是模型规模、架构还是显式对齐都无法改变这种冷漠。然后,我们使用稀疏自编码器在Qwen3-8B上进行操作,分离出单一语义的道德特征,并针对性地重建它们的拓扑关系,使其与真实道德向量对齐。这种表示对齐自然提高了道德推理和细微程度,实现了在独立对抗火焰基准测试中75%的对局胜率。最后,我们从经验主义哲学的角度阐述了当前干预方法的补救性质,认为内生对齐的人工智能可能需要从事后修正转变为积极培养。
Summary / 总结
This study addresses the issue of moral indifference in Large Language Models (LLMs) by analyzing their latent representations and employing Sparse Autoencoders to align moral vectors. The research finds that current LLMs fail to distinguish between opposed moral categories and fine-grained typicality gradients, and that this indifference is not influenced by model size or explicit alignment techniques. The method involves reconstructing mono-semantic moral features to align with ground-truth vectors, which improves moral reasoning and achieves a 75% pairwise win-rate on the adversarial Flames benchmark.
研究旨在通过分析大型语言模型(LLM)的潜在表示来解决道德冷漠的问题。基于原型理论和社会化学-101数据集构建了251k道德向量以验证和缓解这一冷漠。关键发现包括当前LLM无法表示道德差异,并通过稀疏自编码器成功对齐了道德推理,使Flames基准测试的两两胜率提高到75%。研究还讨论了需要对齐的人工智能从事后修正转向主动培养的必要性。
Tri-Prompting: Video Diffusion with Unified Control over Scene, Subject, and Motion
Authors: Zhenghong Zhou, Xiaohang Zhan, Zhiqin Chen, Soo Ye Kim, Nanxuan Zhao, Haitian Zheng, Qing Liu, He Zhang, Zhe Lin, Yuqian Zhou, Jiebo Luo
First: 2026-03-16T17:59:05+00:00 · Latest: 2026-03-16T17:59:05+00:00
Comments: Project page: https://zhouzhenghong-gt.github.io/Tri-Prompting-Page/
Abstract
Recent video diffusion models have made remarkable strides in visual quality, yet precise, fine-grained control remains a key bottleneck that limits practical customizability for content creation. For AI video creators, three forms of control are crucial: (i) scene composition, (ii) multi-view consistent subject customization, and (iii) camera-pose or object-motion adjustment. Existing methods typically handle these dimensions in isolation, with limited support for multi-view subject synthesis and identity preservation under arbitrary pose changes. This lack of a unified architecture makes it difficult to support versatile, jointly controllable video. We introduce Tri-Prompting, a unified framework and two-stage training paradigm that integrates scene composition, multi-view subject consistency, and motion control. Our approach leverages a dual-condition motion module driven by 3D tracking points for background scenes and downsampled RGB cues for foreground subjects. To ensure a balance between controllability and visual realism, we further propose an inference ControlNet scale schedule. Tri-Prompting supports novel workflows, including 3D-aware subject insertion into any scenes and manipulation of existing subjects in an image. Experimental results demonstrate that Tri-Prompting significantly outperforms specialized baselines such as Phantom and DaS in multi-view subject identity, 3D consistency, and motion accuracy.
中文标题/摘要
标题:三提示:统一控制场景、主体和运动的视频扩散
近期的视频扩散模型在视觉质量方面取得了显著进展,但精确、细粒度的控制仍然是一个关键瓶颈,限制了内容创作的实际定制化。对于AI视频创作者来说,三种形式的控制至关重要:(i) 场景构图,(ii) 多视角一致的主体定制,和(iii) 摄像机姿态或物体运动调整。现有方法通常在这些维度上孤立处理,对多视角主体合成和姿态变化下的身份保持支持有限。缺乏统一的架构使得支持多功能、联合可控的视频变得困难。我们引入了三提示,这是一种统一框架和两阶段训练范式,将场景构图、多视角主体一致性和运动控制整合在一起。我们的方法利用由3D跟踪点驱动的双条件运动模块和下采样的RGB线索驱动前景主体。为了在可控性和视觉真实性之间保持平衡,我们进一步提出了一种推理ControlNet尺度调度。三提示支持新的工作流程,包括在任何场景中进行3D感知的主体插入和图像中现有主体的操作。实验结果表明,三提示在多视角主体身份、3D一致性和运动准确性方面显著优于专门的基线如Phantom和DaS。
Summary / 总结
The research aims to enhance the control over scene composition, subject customization, and motion adjustment in video diffusion models. Tri-Prompting introduces a unified framework with a two-stage training paradigm, incorporating a dual-condition motion module and an inference ControlNet scale schedule. The method significantly improves multi-view subject identity, 3D consistency, and motion accuracy compared to existing methods like Phantom and DaS.
研究旨在提高视频扩散模型在场景构成、主体定制和运动控制方面的精确性。Tri-Prompting 提出了一种统一框架和两阶段训练范式,结合了双条件运动模块和推理中的 ControlNet 比例调度。该方法在多视角主体身份、3D 一致性以及运动准确性方面显著优于 Phantom 和 DaS 等现有方法。
HSImul3R: Physics-in-the-Loop Reconstruction of Simulation-Ready Human-Scene Interactions
Authors: Yukang Cao, Haozhe Xie, Fangzhou Hong, Long Zhuo, Zhaoxi Chen, Liang Pan, Ziwei Liu
First: 2026-03-16T17:58:33+00:00 · Latest: 2026-03-16T17:58:33+00:00
Comments: https://yukangcao.github.io/HSImul3R/
Abstract
We present HSImul3R, a unified framework for simulation-ready 3D reconstruction of human-scene interactions (HSI) from casual captures, including sparse-view images and monocular videos. Existing methods suffer from a perception-simulation gap: visually plausible reconstructions often violate physical constraints, leading to instability in physics engines and failure in embodied AI applications. To bridge this gap, we introduce a physically-grounded bi-directional optimization pipeline that treats the physics simulator as an active supervisor to jointly refine human dynamics and scene geometry. In the forward direction, we employ Scene-targeted Reinforcement Learning to optimize human motion under dual supervision of motion fidelity and contact stability. In the reverse direction, we propose Direct Simulation Reward Optimization, which leverages simulation feedback on gravitational stability and interaction success to refine scene geometry. We further present HSIBench, a new benchmark with diverse objects and interaction scenarios. Extensive experiments demonstrate that HSImul3R produces the first stable, simulation-ready HSI reconstructions and can be directly deployed to real-world humanoid robots.
中文标题/摘要
标题:HSImul3R:闭环物理重建可模拟人类场景交互
我们提出了HSImul3R,这是一种统一框架,用于从随意捕捉中(包括稀疏视角图像和单目视频)重建可用于模拟的人类场景交互(HSI)的3D模型。现有方法存在感知与模拟之间的差距:视觉上合理的重建往往违反物理约束,导致物理引擎不稳定并在具身AI应用中失败。为了解决这一问题,我们引入了一个基于物理的双向优化管道,将物理模拟器视为积极的监督者,共同细化人类动力学和场景几何。在正向方向,我们采用场景导向的强化学习来优化人类运动,同时受到运动保真度和接触稳定性的双重监督。在反向方向,我们提出了直接模拟奖励优化,利用模拟反馈中的重力稳定性和交互成功来细化场景几何。我们还介绍了HSIBench,这是一个新的基准,包含多种物体和交互场景。广泛的实验表明,HSImul3R生成了第一个稳定且可用于模拟的人类场景交互重建,并可以直接部署到现实世界的人形机器人中。
Summary / 总结
HSImul3R is a unified framework for reconstructing 3D human-scene interactions from casual captures, addressing the perception-simulation gap by integrating a physics simulator. It uses a bi-directional optimization pipeline to refine human dynamics and scene geometry, optimizing motion fidelity and contact stability, and leveraging simulation feedback for gravitational stability and interaction success. HSImul3R produces stable, simulation-ready reconstructions and can be directly applied to real-world humanoid robots.
HSImul3R 是一个统一框架,用于从随意拍摄的图像中重建3D人类-场景交互,通过引入物理约束来解决感知-模拟差距。该框架使用双向优化管道,其中物理模拟器作为主动监督者来共同细化人类动态和场景几何。正向优化使用场景导向的强化学习,反向优化使用直接模拟奖励优化。实验表明,HSImul3R 生成了稳定的、可用于模拟的3D交互重建,并可以直接应用于现实世界的人形机器人。
Virtual Full-stack Scanning of Brain MRI via Imputing Any Quantised Code
Authors: Yicheng Wu, Tao Song, Zhonghua Wu, Jin Ye, Zongyuan Ge, Wenjia Bai, Zhaolin Chen, Jianfei Cai
Venue: CVPR 2026
First: 2025-01-30T13:14:40+00:00 · Latest: 2026-03-16T17:56:59+00:00
Comments: Accepted by CVPR 2026
Abstract
Magnetic resonance imaging (MRI) is a powerful and versatile imaging technique, offering a wide spectrum of information about the anatomy by employing different acquisition modalities. However, in the clinical workflow, it is impractical to collect all relevant modalities due to the scan time and cost constraints. Virtual full-stack scanning aims to impute missing MRI modalities from available but incomplete acquisitions, offering a cost-efficient solution to enhance data completeness and clinical usability. Existing imputation methods often depend on global conditioning or modality-specific designs, which limit their generalisability across patient cohorts and imaging protocols. To address these limitations, we propose CodeBrain, a unified framework that reformulates various ``any-to-any'' imputation tasks as a region-level full-stack code prediction problem. CodeBrain adopts a two-stage pipeline: (1) it learns the compact representation of a complete MRI modality set by encoding it into scalar-quantised codes at the region level, enabling high-fidelity image reconstruction after decoding these codes along with modality-agnostic common features; (2) it trains a projection encoder to predict the full-stack code map from incomplete modalities via a grading-based design for diverse imputation scenarios. Extensive experiments on two public brain MRI datasets, i.e., IXI and BraTS 2023, demonstrate that CodeBrain consistently outperforms state-of-the-art methods, establishing a new benchmark for unified brain MRI imputation and enabling virtual full-stack scanning. Our code will be released at https://github.com/ycwu1997/CodeBrain.
中文标题/摘要
标题:通过插值任意量化代码实现脑MRI的虚拟全栈扫描
磁共振成像(MRI)是一种强大且多功能的成像技术,通过不同的采集模式提供关于解剖结构的广泛信息。然而,在临床工作流程中,由于扫描时间和成本限制,无法收集所有相关模式。虚拟全栈扫描旨在从可用但不完整的采集中推断出缺失的MRI模式,提供一种成本效益高的解决方案,以增强数据完整性和临床实用性。现有的插值方法通常依赖于全局条件或特定于模式的设计,这限制了它们在不同患者群体和成像协议之间的通用性。为了解决这些限制,我们提出了一种名为CodeBrain的统一框架,将各种“任意到任意”的插值任务重新表述为区域级别的全栈代码预测问题。CodeBrain采用两阶段管道:(1)通过在区域级别将完整的MRI模式集编码为标量量化代码,学习其紧凑表示,并在解码这些代码和模式无关的通用特征后实现高保真图像重建;(2)通过基于评分的设计训练投影编码器,从不完整的模式中预测全栈代码图,以应对多种插值场景。在两个公开的脑MRI数据集IXI和BraTS 2023上进行的广泛实验表明,CodeBrain在所有方面都优于最先进的方法,建立了统一脑MRI插值的新基准,并使虚拟全栈扫描成为可能。我们的代码将在https://github.com/ycwu1997/CodeBrain/发布。
Summary / 总结
CodeBrain is a unified framework for virtual full-stack scanning of brain MRI, which imputes missing modalities from available acquisitions. It uses a two-stage pipeline to encode complete MRI modalities into scalar-quantised codes and then predict full-stack codes from incomplete data. Experiments on IXI and BraTS 2023 datasets show that CodeBrain outperforms existing methods, setting a new benchmark for brain MRI imputation and enabling cost-efficient data enhancement in clinical workflows.
CodeBrain 是一种统一框架,用于通过从不完整采集中预测全栈代码来进行脑 MRI 虚拟全栈扫描,以填补缺失的模态。它采用两阶段管道:首先将完整的 MRI 模态编码为标量量化代码,然后从不完整模态预测全栈代码图。在 IXI 和 BraTS 2023 数据集上的实验表明,CodeBrain 在现有方法中表现更优,为脑 MRI 补间建立了新的基准,并在临床工作流程中实现了成本高效的数据显示增强。
Fast SAM 3D Body: Accelerating SAM 3D Body for Real-Time Full-Body Human Mesh Recovery
Authors: Timing Yang, Sicheng He, Hongyi Jing, Jiawei Yang, Zhijian Liu, Chuhang Zou, Yue Wang
First: 2026-03-16T17:54:40+00:00 · Latest: 2026-03-16T17:54:40+00:00
Abstract
SAM 3D Body (3DB) achieves state-of-the-art accuracy in monocular 3D human mesh recovery, yet its inference latency of several seconds per image precludes real-time application. We present Fast SAM 3D Body, a training-free acceleration framework that reformulates the 3DB inference pathway to achieve interactive rates. By decoupling serial spatial dependencies and applying architecture-aware pruning, we enable parallelized multi-crop feature extraction and streamlined transformer decoding. Moreover, to extract the joint-level kinematics (SMPL) compatible with existing humanoid control and policy learning frameworks, we replace the iterative mesh fitting with a direct feedforward mapping, accelerating this specific conversion by over 10,000x. Overall, our framework delivers up to a 10.9x end-to-end speedup while maintaining on-par reconstruction fidelity, even surpassing 3DB on benchmarks such as LSPET. We demonstrate its utility by deploying Fast SAM 3D Body in a vision-only teleoperation system that-unlike methods reliant on wearable IMUs-enables real-time humanoid control and the direct collection of manipulation policies from a single RGB stream.
中文标题/摘要
标题:Fast SAM 3D人体:加速3D人体网格恢复
SAM 3D人体(3DB)在单目3D人体网格恢复方面达到了最先进的准确性,但由于其每张图像几秒的推理延迟,无法应用于实时场景。我们提出了一种无需训练的加速框架Fast SAM 3D人体,通过重新制定3DB推理路径以实现交互速率。通过解耦序列空间依赖性和应用架构感知剪枝,我们实现了并行多裁剪特征提取和简化变压器解码。此外,为了提取与现有类人控制和策略学习框架兼容的关节级运动学(SMPL),我们用直接前馈映射替代了迭代网格拟合,这种特定转换的加速超过10000倍。总体而言,我们的框架在保持重建保真度的同时实现了高达10.9倍的端到端加速,甚至在LSPET等基准测试中超过了3DB。我们通过部署Fast SAM 3D人体在仅基于视觉的远程操作系统中展示了其实用性,该系统与依赖穿戴式IMU的方法不同,能够实现类人控制并直接从单个RGB流中收集操作策略。
Summary / 总结
Fast SAM 3D Body is an acceleration framework for SAM 3D Body, which originally had high inference latency. By decoupling spatial dependencies and applying pruning, it achieves interactive rates. The framework also replaces iterative mesh fitting with a direct feedforward mapping, significantly speeding up joint-level kinematics extraction. As a result, Fast SAM 3D Body provides up to a 10.9x speedup while maintaining reconstruction accuracy, surpassing 3DB on benchmarks like LSPET. It is used in a vision-only teleoperation system for real-time humanoid control and manipulation policy collection.
Fast SAM 3D Body 是一种加速 SAM 3D Body 的框架,通过解耦空间依赖性和应用剪枝技术,将推理速度从每张图像几秒提高到交互速率。这使得特征提取可以并行化,解码过程也更加简洁,同时用直接的前向映射替代迭代网格拟合,以增强关节级别的运动学提取。该框架实现了最高 10.9 倍的端到端加速,同时保持与重建精度相当,甚至在 LSPET 等基准测试中超越 SAM 3D Body。它被证明能够实现基于单个 RGB 流的实时人形控制和直接收集操作策略。
From Passive Observer to Active Critic: Reinforcement Learning Elicits Process Reasoning for Robotic Manipulation
Authors: Yibin Liu, Yaxing Lyu, Daqi Gao, Zhixuan Liang, Weiliang Tang, Shilong Mu, Xiaokang Yang, Yao Mu
First: 2026-03-16T17:53:28+00:00 · Latest: 2026-03-16T17:53:28+00:00
Comments: 31 pages
Abstract
Accurate process supervision remains a critical challenge for long-horizon robotic manipulation. A primary bottleneck is that current video MLLMs, trained primarily under a Supervised Fine-Tuning (SFT) paradigm, function as passive "Observers" that recognize ongoing events rather than evaluating the current state relative to the final task goal. In this paper, we introduce PRIMO R1 (Process Reasoning Induced Monitoring), a 7B framework that transforms video MLLMs into active "Critics". We leverage outcome-based Reinforcement Learning to incentivize explicit Chain-of-Thought generation for progress estimation. Furthermore, our architecture constructs a structured temporal input by explicitly anchoring the video sequence between initial and current state images. Supported by the proposed PRIMO Dataset and Benchmark, extensive experiments across diverse in-domain environments and out-of-domain real-world humanoid scenarios demonstrate that PRIMO R1 achieves state-of-the-art performance. Quantitatively, our 7B model achieves a 50% reduction in the mean absolute error of specialized reasoning baselines, demonstrating significant relative accuracy improvements over 72B-scale general MLLMs. Furthermore, PRIMO R1 exhibits strong zero-shot generalization on difficult failure detection tasks. We establish state-of-the-art performance on RoboFail benchmark with 67.0% accuracy, surpassing closed-source models like OpenAI o1 by 6.0%.
中文标题/摘要
标题:从被动观察者到主动批评者:强化学习激发机器人操作过程推理
长期视角下的机器人操作过程监督仍然是一个关键挑战。当前的主要瓶颈在于,主要在监督微调(SFT)范式下训练的视频MLLMs,作为被动的“观察者”,仅能识别正在进行的事件,而不能评估当前状态相对于最终任务目标的状态。在本文中,我们引入了PRIMO R1(过程推理诱导监控)框架,将视频MLLMs转变为积极的“批评者”。我们利用基于结果的强化学习来激励生成明确的推理链以进行进度估计。此外,我们的架构通过明确将视频序列锚定在初始状态和当前状态图像之间,构建了一个结构化的时序输入。通过提出的PRIMO数据集和基准测试,广泛的实验在多种领域内环境和跨域真实世界的人形机器人场景中证明了PRIMO R1达到了最先进的性能。定量上,我们的7B模型在专门推理基线的平均绝对误差上实现了50%的降低,相对于72B规模的一般MLLMs,显示出显著的相对准确度提升。此外,PRIMO R1在困难的故障检测任务上表现出强大的零样本泛化能力。我们在RoboFail基准测试中取得了67.0%的准确率,超越了如OpenAI o1等闭源模型6.0%。
Summary / 总结
This paper addresses the challenge of accurate process supervision in long-horizon robotic manipulation by introducing PRIMO R1, which transforms video MLLMs into active 'Critics' through outcome-based Reinforcement Learning. The model generates explicit Chain-of-Thought for progress estimation and shows significant improvements over 72B-scale general MLLMs, achieving a 50% reduction in mean absolute error. PRIMO R1 also demonstrates strong zero-shot generalization on failure detection tasks, surpassing OpenAI o1 by 6.0% on the RoboFail benchmark.
本文通过引入PRIMO R1,利用基于结果的强化学习将视频MLLMs转变为积极的‘批评者’,以解决长时域机器人操作中的准确过程监督问题。该模型生成明确的推理链以估计进度,并在与专门推理基线相比的平均绝对误差上实现了50%的减少,达到最先进的性能。PRIMO R1还在故障检测任务上展示了强大的零样本泛化能力,并在RoboFail基准测试中超越了其他模型,准确率达到67.0%。
SmartSearch: How Ranking Beats Structure for Conversational Memory Retrieval
Authors: Jesper Derehag, Carlos Calva, Timmy Ghiurau
First: 2026-03-16T17:53:21+00:00 · Latest: 2026-03-16T17:53:21+00:00
Abstract
Recent conversational memory systems invest heavily in LLM-based structuring at ingestion time and learned retrieval policies at query time. We show that neither is necessary. SmartSearch retrieves from raw, unstructured conversation history using a fully deterministic pipeline: NER-weighted substring matching for recall, rule-based entity discovery for multi-hop expansion, and a CrossEncoder+ColBERT rank fusion stage -- the only learned component -- running on CPU in ~650ms. Oracle analysis on two benchmarks identifies a compilation bottleneck: retrieval recall reaches 98.6%, but without intelligent ranking only 22.5% of gold evidence survives truncation to the token budget. With score-adaptive truncation and no per-dataset tuning, SmartSearch achieves 93.5% on LoCoMo and 88.4% on LongMemEval-S, exceeding all known memory systems under the same evaluation protocol on both benchmarks while using 8.5x fewer tokens than full-context baselines.
Summary / 总结
SmartSearch demonstrates that ranking is more effective than structuring for conversational memory retrieval. It uses a deterministic pipeline with NER-weighted substring matching, rule-based entity discovery, and a rank fusion stage to retrieve from raw conversation history. On two benchmarks, SmartSearch achieves high recall and, with score-adaptive truncation, outperforms existing systems while using fewer tokens.
SmartSearch 表明,排名比结构化更适用于对话记忆检索。它使用一个确定性的流水线,包括基于NER的子字符串匹配、基于规则的实体发现和一个排名融合阶段,从原始对话历史中检索信息。尽管没有使用学习检索策略,SmartSearch 在高召回率的基础上,通过分数自适应截断,超越了现有记忆系统的 LoCoMo 和 LongMemEval-S 基准测试,同时使用了比全上下文基线更少的词元。
AC-Foley: Reference-Audio-Guided Video-to-Audio Synthesis with Acoustic Transfer
Authors: Pengjun Fang, Yingqing He, Yazhou Xing, Qifeng Chen, Ser-Nam Lim, Harry Yang
Venue: ICLR 2026
First: 2026-03-16T17:53:07+00:00 · Latest: 2026-03-16T17:53:07+00:00
Comments: Accepted at ICLR 2026. 15 pages, 5 figures
Abstract
Existing video-to-audio (V2A) generation methods predominantly rely on text prompts alongside visual information to synthesize audio. However, two critical bottlenecks persist: semantic granularity gaps in training data, such as conflating acoustically distinct sounds under coarse labels, and textual ambiguity in describing micro-acoustic features. These bottlenecks make it difficult to perform fine-grained sound synthesis using text-controlled modes. To address these limitations, we propose AC-Foley, an audio-conditioned V2A model that directly leverages reference audio to achieve precise and fine-grained control over generated sounds. This approach enables fine-grained sound synthesis, timbre transfer, zero-shot sound generation, and improved audio quality. By directly conditioning on audio signals, our approach bypasses the semantic ambiguities of text descriptions while enabling precise manipulation of acoustic attributes. Empirically, AC-Foley achieves state-of-the-art performance for Foley generation when conditioned on reference audio, while remaining competitive with state-of-the-art video-to-audio methods even without audio conditioning.
中文标题/摘要
标题:AC-Foley:参考音频引导的视频到音频合成与声学转移
现有的视频到音频(V2A)生成方法主要依赖于文本提示和视觉信息来合成音频。然而,存在两个关键瓶颈:训练数据中的语义粒度差距,例如将声学上不同的声音归类为粗略标签,以及对微声学特征描述的文本歧义性。这些瓶颈使得使用文本控制模式进行精细声音合成变得困难。为了解决这些限制,我们提出了AC-Foley,这是一种基于音频条件的V2A模型,可以直接利用参考音频来实现对生成声音的精确和精细控制。这种方法使精细声音合成、音色转移、零样本声音生成和提高音频质量成为可能。通过直接基于音频信号进行条件化,我们的方法绕过了文本描述的语义歧义,同时允许对声学属性进行精确操作。实验证明,当基于参考音频条件化时,AC-Foley在福莱声生成方面达到了最先进的性能,即使在没有音频条件化的情况下,其性能也与最先进的视频到音频方法相当。
Summary / 总结
AC-Foley addresses the limitations of existing video-to-audio synthesis methods by directly leveraging reference audio to achieve precise and fine-grained control over generated sounds. This approach overcomes semantic granularity gaps and textual ambiguity, enabling fine-grained sound synthesis, timbre transfer, zero-shot sound generation, and improved audio quality. Empirically, AC-Foley outperforms state-of-the-art methods for Foley generation when conditioned on reference audio and remains competitive without audio conditioning.
AC-Foley 是一种基于参考音频的视频到音频合成模型,能够实现精细和粒度化的声音合成,克服了语义粒度和文本描述的模糊性问题。该模型支持精细声音合成、音色转移、零样本声音生成以及提高音频质量。实验证明,AC-Foley 在参考音频条件下生成 Foley 声音时表现出色,即使不使用音频条件,其性能也与现有方法相当。
Emotion is Not Just a Label: Latent Emotional Factors in LLM Processing
Authors: Benjamin Reichman, Adar Avsian, Samuel Webster, Larry Heck
First: 2026-03-10T05:23:18+00:00 · Latest: 2026-03-16T17:52:20+00:00
Abstract
Large language models are routinely deployed on text that varies widely in emotional tone, yet their reasoning behavior is typically evaluated without accounting for emotion as a source of representational variation. Prior work has largely treated emotion as a prediction target, for example in sentiment analysis or emotion classification. In contrast, we study emotion as a latent factor that shapes how models attend to and reason over text. We analyze how emotional tone systematically alters attention geometry in transformer models, showing that metrics such as locality, center-of-mass distance, and entropy vary across emotions and correlate with downstream question-answering performance. To facilitate controlled study of these effects, we introduce Affect-Uniform ReAding QA (AURA-QA), a question-answering dataset with emotionally balanced, human-authored context passages. Finally, an emotional regularization framework is proposed that constrains emotion-conditioned representational drift during training. Experiments across multiple QA benchmarks demonstrate that this approach improves reading comprehension in both emotionally-varying and non-emotionally varying datasets, yielding consistent gains under distribution shift and in-domain improvements on several benchmarks.
中文标题/摘要
标题:情绪不仅仅是标签:LLM处理中的潜在情绪因素
大型语言模型通常部署在情感色彩变化广泛的文字上,但其推理行为通常在评估时并未考虑情感作为表示变化来源的因素。先前的工作大多将情感视为预测目标,例如情感分析或情绪分类。相比之下,我们研究情感作为潜在因素,影响模型如何关注和推理文本。我们分析了情感语气如何系统地改变变压器模型的注意力几何结构,表明局部性、质心距离和熵等度量指标在不同情绪下有所变化,并与下游问答性能相关。为了便于对这些效应进行受控研究,我们引入了情感平衡阅读问答数据集(AURA-QA),该数据集包含情感平衡的人工撰写的上下文段落。最后,我们提出了一种情感正则化框架,该框架在训练过程中限制情绪条件下的表示漂移。跨多个问答基准的实验表明,这种方法在情绪变化和非情绪变化的数据集上均提高了阅读理解能力,在分布转移和多个基准测试中均实现了一致的改进。
Summary / 总结
This study investigates how emotional tone in text influences large language models (LLMs) by treating emotion as a latent factor rather than a prediction target. The research analyzes how emotional content alters attention patterns in transformer models and introduces Affect-Uniform ReAding QA (AURA-QA), a dataset with emotionally balanced context passages. The study proposes an emotional regularization framework to constrain emotion-conditioned representational drift during training, showing consistent improvements in reading comprehension across various QA benchmarks under distribution shift and in-domain settings.
研究探讨了情感内容如何影响大型语言模型(LLMs),将其视为潜在因素而非预测目标。分析了情感内容如何改变变压器模型的注意力模式,并引入了情感平衡阅读问答数据集(AURA-QA)。研究提出了一种情感正则化框架,以在训练过程中约束情感条件下的表示漂移,展示了在各种问答基准测试中的一致性阅读理解改进。
OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training Data
Authors: Yuwen Du, Rui Ye, Shuo Tang, Xinyu Zhu, Yijun Lu, Yuzhu Cai, Siheng Chen
First: 2026-03-16T17:52:04+00:00 · Latest: 2026-03-16T17:52:04+00:00
Comments: 15 pages, 6 figures
Abstract
Deep search capabilities have become an indispensable competency for frontier Large Language Model (LLM) agents, yet the development of high-performance search agents remains dominated by industrial giants due to a lack of transparent, high-quality training data. This persistent data scarcity has fundamentally hindered the progress of the broader research community in developing and innovating within this domain. To bridge this gap, we introduce OpenSeeker, the first fully open-source search agent (i.e., model and data) that achieves frontier-level performance through two core technical innovations: (1) Fact-grounded scalable controllable QA synthesis, which reverse-engineers the web graph via topological expansion and entity obfuscation to generate complex, multi-hop reasoning tasks with controllable coverage and complexity. (2) Denoised trajectory synthesis, which employs a retrospective summarization mechanism to denoise the trajectory, therefore promoting the teacher LLMs to generate high-quality actions. Experimental results demonstrate that OpenSeeker, trained (a single training run) on only 11.7k synthesized samples, achieves state-of-the-art performance across multiple benchmarks including BrowseComp, BrowseComp-ZH, xbench-DeepSearch, and WideSearch. Notably, trained with simple SFT, OpenSeeker significantly outperforms the second-best fully open-source agent DeepDive (e.g., 29.5% v.s. 15.3% on BrowseComp), and even surpasses industrial competitors such as Tongyi DeepResearch (trained via extensive continual pre-training, SFT, and RL) on BrowseComp-ZH (48.4% v.s. 46.7%). We fully open-source the complete training dataset and the model weights to democratize frontier search agent research and foster a more transparent, collaborative ecosystem.
中文标题/摘要
标题:OpenSeeker:通过全面开源训练数据普及前沿搜索代理
深度搜索能力已成为前沿大型语言模型(LLM)代理不可或缺的技能,但由于缺乏透明且高质量的训练数据,高性能搜索代理的开发仍主要由工业巨头主导。这种持续的数据稀缺性从根本上阻碍了更广泛研究社区在该领域的开发和创新。为解决这一问题,我们引入了OpenSeeker,这是首个全面开源的搜索代理(即模型和数据),通过两项核心技术创新实现了前沿级别的性能:(1)基于事实的可扩展可控问答合成,通过拓扑扩展和实体混淆反向工程网络图,生成具有可控覆盖范围和复杂度的复杂多跳推理任务。(2)去噪轨迹合成,采用回顾性总结机制去噪轨迹,从而促进教师LLM生成高质量行动。实验结果表明,OpenSeeker仅在11.7k合成样本上进行一次训练,就在包括BrowseComp、BrowseComp-ZH、xbench-DeepSearch和WideSearch等多个基准测试中达到了最先进的性能。值得注意的是,使用简单的SFT训练后,OpenSeeker显著优于第二好的全面开源代理DeepDive(例如,在BrowseComp上的表现分别为29.5%和15.3%),甚至在BrowseComp-ZH上超越了工业竞争对手通义DeepResearch(通过广泛的持续预训练、SFT和RL训练,得分为48.4%和46.7%)。我们全面开源了完整的训练数据集和模型权重,以普及前沿搜索代理研究,促进更加透明和协作的生态系统。
Summary / 总结
OpenSeeker is introduced as the first fully open-source search agent that achieves state-of-the-art performance in multiple benchmarks through two core innovations: fact-grounded scalable controllable QA synthesis and denoised trajectory synthesis. OpenSeeker, trained on only 11,700 synthesized samples, outperforms both fully open-source and industrial search agents on various benchmarks, demonstrating significant advancements in democratizing frontier search agent research.
OpenSeeker 是首个完全开源的搜索代理,通过事实导向的可扩展可控 QA 合成和去噪轨迹合成两项创新,实现了多项基准测试(包括 BrowseComp、BrowseComp-ZH、xbench-DeepSearch 和 WideSearch)上的最先进性能。仅使用 11,700 个合成样本进行训练的 OpenSeeker 在多个基准测试中超越了开源和工业竞争对手。完整的训练数据和模型权重均已开源,以促进搜索代理研究领域的透明和协作。
MatAnyone 2: Scaling Video Matting via a Learned Quality Evaluator
Authors: Peiqing Yang, Shangchen Zhou, Kai Hao, Qingyi Tao
Venue: CVPR 2026
First: 2025-12-12T18:51:49+00:00 · Latest: 2026-03-16T17:51:36+00:00
Comments: Accepted to CVPR 2026. Project page: https://pq-yang.github.io/projects/MatAnyone2/
Abstract
Video matting remains limited by the scale and realism of existing datasets. While leveraging segmentation data can enhance semantic stability, the lack of effective boundary supervision often leads to segmentation-like mattes lacking fine details. To this end, we introduce a learned Matting Quality Evaluator (MQE) that assesses semantic and boundary quality of alpha mattes without ground truth. It produces a pixel-wise evaluation map that identifies reliable and erroneous regions, enabling fine-grained quality assessment. The MQE scales up video matting in two ways: (1) as an online matting-quality feedback during training to suppress erroneous regions, providing comprehensive supervision, and (2) as an offline selection module for data curation, improving annotation quality by combining the strengths of leading video and image matting models. This process allows us to build a large-scale real-world video matting dataset, VMReal, containing 28K clips and 2.4M frames. To handle large appearance variations in long videos, we introduce a reference-frame training strategy that incorporates long-range frames beyond the local window for effective training. Our MatAnyone 2 achieves state-of-the-art performance on both synthetic and real-world benchmarks, surpassing prior methods across all metrics.
中文标题/摘要
标题:MatAnyone 2:通过学习质量评估器扩展视频抠图
视频抠图仍然受限于现有数据集的规模和现实性。虽然利用分割数据可以增强语义稳定性,但缺乏有效的边界监督往往导致分割样式的抠图缺乏精细细节。为此,我们引入了一个学习的抠图质量评估器(MQE),该评估器无需地面真值即可评估alpha抠图的语义和边界质量。它生成一个像素级的评估图,识别可靠和错误的区域,实现精细的质量评估。MQE通过两种方式扩展视频抠图:(1)作为训练期间的在线抠图质量反馈,抑制错误区域,提供全面的监督;(2)作为离线选择模块进行数据整理,通过结合领先视频和图像抠图模型的优势提高注释质量。这一过程使我们能够构建一个大规模的现实世界视频抠图数据集VMReal,包含28K个片段和2.4M帧。为处理长视频中较大的外观变化,我们引入了一种参考帧训练策略,该策略结合了超出局部窗口的长距离帧进行有效的训练。我们的MatAnyone 2在合成和现实世界基准测试中均达到最先进的性能,所有指标均超越了先前的方法。
Summary / 总结
The research aims to improve video matting by addressing the limitations of existing datasets and methods. It introduces a learned Matting Quality Evaluator (MQE) that assesses the quality of alpha mattes without ground truth, providing both online feedback during training and offline data curation. This leads to the creation of a large-scale real-world video matting dataset, VMReal, and a reference-frame training strategy to handle large appearance variations. The method achieves state-of-the-art performance on both synthetic and real-world benchmarks.
研究旨在通过解决现有数据集和方法的限制来提升视频抠图。引入了无监督的Matting Quality Evaluator (MQE),用于评估alpha mattes的质量,实现精细的质量评估。MQE 作为在线反馈机制用于训练过程中抑制错误区域,并作为离线选择模块进行数据筛选,提高标注质量。这导致了大规模真实世界视频抠图数据集VMReal的建立,并引入了参考帧训练策略以处理长视频中的大范围外观变化。该方法在合成和真实世界基准上均达到了最先进的性能。
SemBench: A Benchmark for Semantic Query Processing Engines
Authors: Jiale Lao, Andreas Zimmerer, Olga Ovcharenko, Tianji Cong, Matthew Russo, Gerardo Vitagliano, Michael Cochez, Fatma Özcan, Gautam Gupta, Thibaud Hottelier, H. V. Jagadish, Kris Kissel, Sebastian Schelter, Andreas Kipf, Immanuel Trummer
First: 2025-11-03T16:25:19+00:00 · Latest: 2026-03-16T17:51:06+00:00
Comments: Accepted to VLDB 2026; Revised version
Abstract
We present a benchmark targeting a novel class of systems: semantic query processing engines. Those systems rely inherently on generative and reasoning capabilities of state-of-the-art large language models (LLMs). They extend SQL with semantic operators, configured by natural language instructions, that are evaluated via LLMs and enable users to perform various operations on multimodal data. Our benchmark introduces diversity across three key dimensions: scenarios, modalities, and operators. Included are scenarios ranging from movie review analysis to car damage detection. Within these scenarios, we cover different data modalities, including images, audio, and text. Finally, the queries involve a diverse set of operators, including semantic filters, joins, mappings, ranking, and classification operators. We evaluated our benchmark on three academic systems (LOTUS, Palimpzest, and ThalamusDB) and one industrial system, Google BigQuery. Although these results reflect a snapshot of systems under continuous development, our study offers crucial insights into their current strengths and weaknesses, illuminating promising directions for future research.
中文标题/摘要
标题:SemBench:语义查询处理引擎的基准测试
我们提出了一项基准测试,针对一类新型系统:语义查询处理引擎。这些系统依赖于最先进的大型语言模型(LLMs)的生成和推理能力。它们扩展了SQL,加入了由自然语言指令配置的语义操作符,通过LLMs进行评估,使用户能够对多模态数据执行各种操作。 该基准测试在三个关键维度上引入了多样性:场景、模态和操作符。包括从电影评论分析到汽车损伤检测的各种场景。在这些场景中,涵盖了不同类型的模态数据,包括图像、音频和文本。最后,查询涉及多种操作符,包括语义过滤器、连接、映射、排名和分类操作符。 我们在三个学术系统(LOTUS、Palimpzest和ThalamusDB)和一个工业系统Google BigQuery上评估了该基准测试。尽管这些结果反映了系统在持续开发中的一个快照,但我们的研究提供了对其当前优势和劣势的关键见解,揭示了未来研究的有希望的方向。
Summary / 总结
The paper introduces SemBench, a benchmark for evaluating semantic query processing engines that use large language models for generative and reasoning tasks. It covers diverse scenarios such as movie review analysis and car damage detection, involving various data modalities like images, audio, and text, and operators including filters, joins, and classifications. The benchmark was tested on four systems, revealing their strengths and weaknesses in handling semantic queries, providing insights for future research.
论文介绍了SemBench,一个用于评估依赖大型语言模型进行生成和推理的语义查询处理引擎的基准。它涵盖了从电影评论分析到汽车损伤检测的各种场景,涉及图像、音频和文本等多种数据模态,以及过滤、连接和分类等操作。该基准测试了四个系统,揭示了它们在处理语义查询方面的优缺点,为未来研究提供了重要见解。
Illustrator's Depth: Monocular Layer Index Prediction for Image Decomposition
Authors: Nissim Maruani, Peiying Zhang, Siddhartha Chaudhuri, Matthew Fisher, Nanxuan Zhao, Vladimir G. Kim, Pierre Alliez, Mathieu Desbrun, Wang Yifan
First: 2025-11-21T17:56:43+00:00 · Latest: 2026-03-16T17:50:26+00:00
Abstract
We introduce Illustrator's Depth, a novel definition of depth that addresses a key challenge in digital content creation: decomposing flat images into editable, ordered layers. Inspired by an artist's compositional process, illustrator's depth infers a layer index to each pixel, forming an interpretable image decomposition through a discrete, globally consistent ordering of elements optimized for editability. We also propose and train a neural network using a curated dataset of layered vector graphics to predict layering directly from raster inputs. Our layer index inference unlocks a range of powerful downstream applications. In particular, it significantly outperforms state-of-the-art baselines for image vectorization while also enabling high-fidelity text-to-vector-graphics generation, automatic 3D relief generation from 2D images, and intuitive depth-aware editing. By reframing depth from a physical quantity to a creative abstraction, illustrator's depth prediction offers a new foundation for editable image decomposition.
中文标题/摘要
标题:Illustrator的深度:基于单ocular层索引预测的图像分解
我们提出了Illustrator的深度,这是一种新颖的深度定义,解决了数字内容创作中的关键挑战:将扁平图像分解为可编辑、有序的图层。该定义受到艺术家构图过程的启发,通过推断每个像素的图层索引,形成一种可解释的图像分解,其中元素以离散且全局一致的顺序排列,优化了可编辑性。我们还提出并使用一个精心策划的分层矢量图形数据集训练了一个神经网络,直接从栅格输入预测图层。我们的图层索引推断解锁了一系列强大的下游应用。特别是,它在图像矢量化方面显著优于最先进的基线方法,同时还能实现高保真度的文本到矢量图形生成、从二维图像自动生成三维浮雕以及直观的深度感知编辑。通过将深度从物理量重新定义为创意抽象,Illustrator的深度预测为可编辑图像分解提供了一个新的基础。
Summary / 总结
The research introduces Illustrator's Depth, a novel depth definition aimed at decomposing flat images into editable layers. The method uses a neural network trained on a dataset of layered vector graphics to predict layer indices for each pixel, enabling various applications such as image vectorization, text-to-vector-graphics generation, and 3D relief creation. The approach outperforms existing methods and supports intuitive depth-aware editing.
论文提出了Illustrator's Depth,这是一种新的深度概念,旨在将平面图像分解为可编辑的图层。该方法为每个像素推断一个图层索引,实现了一种全局一致且可解释的图像分解。通过训练神经网络在层叠矢量图形数据集上直接从像素图预测图层,该方法显著优于现有图像矢量化方法,并支持文本到矢量图形生成、2D图像到3D浮雕生成以及深度感知编辑等功能。
Physics-Informed Neural Systems for the Simulation of EUV Electromagnetic Wave Diffraction from a Lithography Mask
Authors: Vasiliy A. Es'kin, Egor V. Ivanov
First: 2026-03-16T17:46:15+00:00 · Latest: 2026-03-16T17:46:15+00:00
Comments: arXiv admin note: substantial text overlap with arXiv:2507.04153
Abstract
Physics-informed neural networks (PINNs) and neural operators (NOs) for solving the problem of diffraction of Extreme Ultraviolet (EUV) electromagnetic waves from contemporary lithography masks are presented. A novel hybrid Waveguide Neural Operator (WGNO) is introduced, based on a waveguide method with its most computationally expensive components replaced by a neural network. To evaluate performance, the accuracy and inference time of PINNs and NOs are compared against modern numerical solvers for a series of problems with known exact solutions. The emphasis is placed on investigation of solution accuracy by considered artificial neural systems for 13.5 nm and 11.2 nm wavelengths. Numerical experiments on realistic 2D and 3D masks demonstrate that PINNs and neural operators achieve competitive accuracy and significantly reduced prediction times, with the proposed WGNO architecture reaching state-of-the-art performance. The presented neural operator has pronounced generalizing properties, meaning that for unseen problem parameters it delivers a solution accuracy close to that for parameters seen in the training dataset. These results provide a highly efficient solution for accelerating the design and optimization workflows of next-generation lithography masks.
中文标题/摘要
标题:基于物理的神经系统在模拟EUV电磁波从光刻掩膜衍射中的应用
介绍了用于解决极端紫外线(EUV)电磁波从当代光刻掩膜衍射问题的物理信息神经网络(PINNs)和神经算子(NOs)。提出了一种基于波导方法的新型混合波导神经算子(WGNO),其最耗计算资源的部分被神经网络替代。通过与现代具有已知精确解的数值求解器进行比较,评估了PINNs和NOs的性能。重点研究了考虑的人工神经系统在13.5 nm和11.2 nm波长下的解的准确性。在现实的2D和3D掩膜上的数值实验表明,PINNs和神经算子实现了可竞争的准确性,并显著减少了预测时间,提出的WGNO架构达到了最先进的性能。所提出的神经算子具有明显的泛化能力,这意味着对于未见过的问题参数,它提供的解的准确性接近训练数据集中参数的准确性。这些结果为加速下一代光刻掩膜的设计和优化工作流程提供了高效的解决方案。
Summary / 总结
The research aims to improve the simulation of EUV electromagnetic wave diffraction from lithography masks using physics-informed neural networks (PINNs) and neural operators (NOs). A novel Waveguide Neural Operator (WGNO) is introduced, which combines waveguide methods with neural networks. The study evaluates the accuracy and inference time of PINNs and NOs against modern numerical solvers for various problems with known solutions. The results show that PINNs and NOs, particularly the WGNO, achieve competitive accuracy and significantly reduced prediction times, making them efficient tools for accelerating lithography mask design and optimization workflows.
研究旨在使用物理信息神经网络(PINNs)和神经算子(NOs)来改进对EUV电磁波从光刻掩模衍射的模拟。引入了一种新的波导神经算子(WGNO),它结合了波导方法和神经网络组件。实验表明,PINNs和NOs在与现代数值求解器相比时,能够实现竞争性的准确性和显著减少的推理时间,而WGNO达到了最先进的性能。神经算子展示了强大的泛化能力,能够为未见过的问题参数提供接近训练数据集参数的准确解。
Grounding World Simulation Models in a Real-World Metropolis
Authors: Junyoung Seo, Hyunwook Choi, Minkyung Kwon, Jinhyeok Choi, Siyoon Jin, Gayoung Lee, Junho Kim, JoungBin Lee, Geonmo Gu, Dongyoon Han, Sangdoo Yun, Seungryong Kim, Jin-Hwa Kim
First: 2026-03-16T17:46:04+00:00 · Latest: 2026-03-16T17:46:04+00:00
Comments: project page: https://seoul-world-model.github.io/
Abstract
What if a world simulation model could render not an imagined environment but a city that actually exists? Prior generative world models synthesize visually plausible yet artificial environments by imagining all content. We present Seoul World Model (SWM), a city-scale world model grounded in the real city of Seoul. SWM anchors autoregressive video generation through retrieval-augmented conditioning on nearby street-view images. However, this design introduces several challenges, including temporal misalignment between retrieved references and the dynamic target scene, limited trajectory diversity and data sparsity from vehicle-mounted captures at sparse intervals. We address these challenges through cross-temporal pairing, a large-scale synthetic dataset enabling diverse camera trajectories, and a view interpolation pipeline that synthesizes coherent training videos from sparse street-view images. We further introduce a Virtual Lookahead Sink to stabilize long-horizon generation by continuously re-grounding each chunk to a retrieved image at a future location. We evaluate SWM against recent video world models across three cities: Seoul, Busan, and Ann Arbor. SWM outperforms existing methods in generating spatially faithful, temporally consistent, long-horizon videos grounded in actual urban environments over trajectories reaching hundreds of meters, while supporting diverse camera movements and text-prompted scenario variations.
中文标题/摘要
标题:将世界模拟模型扎根于真实都市
如果世界模拟模型能够渲染一个实际存在的城市,而不是一个想象中的环境会怎样?先前的生成世界模型通过想象所有内容来合成视觉上可信但人工的环境。我们提出了首尔世界模型(SWM),这是一个基于韩国首尔真实城市的城规模型。SWM 通过检索增强的邻近街景图像条件来锚定自回归视频生成。然而,这种设计引入了几个挑战,包括检索参考与动态目标场景之间的时间对齐问题,以及由于车辆搭载拍摄间隔稀疏导致的轨迹多样性有限和数据稀疏问题。我们通过跨时间配对、大规模合成数据集以及从稀疏街景图像中合成连贯训练视频的视图插值管道来解决这些挑战。我们还引入了虚拟前瞻汇流井,通过不断将每个片段重新锚定到未来位置的检索图像来稳定长时生成。我们在首尔、釜山和安阿伯三个城市中将SWM 与最近的视频世界模型进行了评估。SWM 在生成空间上忠实、时间上一致、长时的视频方面优于现有方法,这些视频扎根于实际的城市环境,轨迹可达数百米,并支持多种摄像机运动和文本提示场景变化。
Summary / 总结
The research aims to create a world simulation model that generates videos of a real city, Seoul, rather than an imagined environment. The method involves using autoregressive video generation with retrieval-augmented conditioning on nearby street-view images. Key findings include SWM outperforming existing methods in generating spatially faithful, temporally consistent, long-horizon videos over trajectories reaching hundreds of meters, while supporting diverse camera movements and text-prompted scenario variations.
研究旨在通过使用附近的街景图像作为参考,生成首尔的真实城市视频。方法包括跨时间配对、大规模合成数据集和视图插值管道,以解决时间错位和数据稀疏性等挑战。模型首尔世界模型(SWM)在生成空间上忠实、时间上一致且长达数百米的长时视频方面优于现有方法,支持多种相机运动和文本提示的场景变化。
Benchmarking Machine Learning Approaches for Polarization Mapping in Ferroelectrics Using 4D-STEM
Authors: Matej Martinc, Goran Dražič, Anton Kokalj, Katarina Žiberna, Janina Roknić, Matic Poberžnik, Sašo Džeroski, Andreja Benčan Golob
First: 2026-03-16T17:45:28+00:00 · Latest: 2026-03-16T17:45:28+00:00
Abstract
Four-dimensional scanning transmission electron microscopy (4D-STEM) provides rich, atomic-scale insights into materials structures. However, extracting specific physical properties - such as polarization directions essential for understanding functional properties of ferroelectrics - remains a significant challenge. In this study, we systematically benchmark multiple machine learning models, namely ResNet, VGG, a custom convolutional neural network, and PCA-informed k-Nearest Neighbors, to automate the detection of polarization directions from 4D-STEM diffraction patterns in ferroelectric potassium sodium niobate. While models trained on synthetic data achieve high accuracy on idealized synthetic diffraction patterns of equivalent thickness, the domain gap between simulation and experiment remains a critical barrier to real-world deployment. In this context, a custom made prototype representation training regime and PCA-based methods, combined with data augmentation and filtering, can better bridge this gap. Error analysis reveals periodic missclassification patterns, indicating that not all diffraction patterns carry enough information for a successful classification. Additionally, our qualitative analysis demonstrates that irregularities in the model's prediction patterns correlate with defects in the crystal structure, suggesting that supervised models could be used for detecting structural defects. These findings guide the development of robust, transferable machine learning tools for electron microscopy analysis.
中文标题/摘要
标题:使用4D-STEM对铁电体极化映射的机器学习方法基准测试
四维扫描透射电子显微镜(4D-STEM)提供了丰富的原子尺度材料结构见解。然而,提取特定物理属性——如对于理解铁电体功能属性至关重要的极化方向——仍然是一个重大挑战。在本研究中,我们系统地基准测试了多种机器学习模型,包括ResNet、VGG、一个自定义卷积神经网络以及基于PCA的k-最近邻方法,以自动化从铁电铌酸钾钠的4D-STEM衍射图中检测极化方向。虽然在合成数据上训练的模型在理想化的合成衍射图上实现了高精度,但模拟与实验之间的领域差距仍然是实际部署中的关键障碍。在此背景下,自定义的原型表示训练制度和基于PCA的方法,结合数据增强和过滤,可以更好地弥合这一差距。误差分析揭示了周期性的分类错误模式,表明并非所有衍射图都携带有足够的信息进行成功的分类。此外,我们的定性分析表明,模型预测模式中的不规则性与晶体结构中的缺陷相关,这表明监督模型可以用于检测结构缺陷。这些发现指导了用于电子显微镜分析的稳健且可转移的机器学习工具的发展。
Summary / 总结
This study benchmarks several machine learning models to automate the detection of polarization directions from 4D-STEM diffraction patterns in ferroelectric potassium sodium niobate. While models trained on synthetic data perform well on idealized patterns, the domain gap between simulation and experiment poses a challenge. A custom training regime and PCA-based methods, combined with data augmentation and filtering, improve performance. Error analysis shows periodic misclassifications, indicating insufficient information in some diffraction patterns. The study also suggests that model predictions can correlate with crystal defects, potentially aiding in defect detection.
本研究对比了几种机器学习模型,以自动化检测铁电铌酸钾钠的4D-STEM衍射图中的极化方向。虽然在理想化模式上训练的模型表现良好,但模拟与实验之间的差距仍然是一个挑战。自定义训练方案和基于PCA的方法,结合数据增强和过滤,有助于弥合这一差距。研究发现周期性误分类,并将模型预测的不规则性与晶体缺陷相关联,表明监督模型可用于检测结构缺陷。
Unbiased and Biased Variance-Reduced Forward-Reflected-Backward Splitting Methods for Stochastic Composite Inclusions
Authors: Quoc Tran-Dinh, Nghia Nguyen-Trung
First: 2026-03-16T17:39:25+00:00 · Latest: 2026-03-16T17:39:25+00:00
Comments: 34 pages and 2 figures
Abstract
This paper develops new variance-reduction techniques for the forward-reflected-backward splitting (FRBS) method to solve a class of possibly nonmonotone stochastic composite inclusions. Unlike unbiased estimators such as mini-batching, developing stochastic biased variants faces a fundamental technical challenge and has not been utilized before for inclusions and fixed-point problems. We fill this gap by designing a new framework that can handle both unbiased and biased estimators. Our main idea is to construct stochastic variance-reduced estimators for the forward-reflected direction and use them to perform iterate updates. First, we propose a class of unbiased variance-reduced estimators and show that increasing mini-batch SGD, loopless-SVRG, and SAGA estimators fall within this class. For these unbiased estimators, we establish a $\mathcal{O}(1/k)$ best-iterate convergence rate for the expected squared residual norm, together with almost-sure convergence of the iterate sequence to a solution. Consequently, we prove that the best oracle complexities for the $n$-finite-sum and expectation settings are $\mathcal{O}(n^{2/3}ε^{-2})$ and $\mathcal{O}(ε^{-10/3})$, respectively, when employing loopless-SVRG or SAGA, where $ε$ is a desired accuracy. Second, we introduce a new class of biased variance-reduced estimators for the forward-reflected direction, which includes SARAH, Hybrid SGD, and Hybrid SVRG as special instances. While the convergence rates remain valid for these biased estimators, the resulting oracle complexities are $\mathcal{O}(n^{3/4}ε^{-2})$ and $\mathcal{O}(ε^{-5})$ for the $n$-finite-sum and expectation settings, respectively. Finally, we conduct two numerical experiments on AUC optimization for imbalanced classification and policy evaluation in reinforcement learning.
中文标题/摘要
标题:无偏和有偏减方差前反射后向分裂方法用于随机复合包含
本文开发了前反射后向分裂(FRBS)方法的新减方差技术,以解决一类可能非单调的随机复合包含问题。与无偏估计量(如小批量)不同,开发随机有偏变体面临一个基本的技术挑战,此前尚未利用其解决包含和不动点问题。我们通过设计一个新框架来填补这一空白,该框架可以处理无偏和有偏估计量。我们的主要思想是为前反射方向构造随机减方差估计量,并使用它们来执行迭代更新。首先,我们提出了一类无偏减方差估计量,并证明了增加小批量SGD、无循环SVRG和SAGA估计量属于此类。对于这些无偏估计量,我们建立了期望平方残差范数的$\mathcal{O}(1/k)$最优迭代收敛速率,并证明了迭代序列几乎肯定收敛到一个解。因此,我们证明了当使用无循环SVRG或SAGA时,$n$有限和期望设置的最佳先验复杂度分别为$\mathcal{O}(n^{2/3}ε^{-2})$和$\mathcal{O}(ε^{-10/3})$,其中$ε$是期望精度。其次,我们引入了一类新的前反射方向有偏减方差估计量,其中包括SARAH、混合SGD和混合SVRG作为特殊情况。虽然这些有偏估计量的收敛速率仍然有效,但它们在$n$有限和期望设置中的先验复杂度分别为$\mathcal{O}(n^{3/4}ε^{-2})$和$\mathcal{O}(ε^{-5})$。最后,我们在不平衡分类的AUC优化和强化学习中的策略评估中进行了两个数值实验。
Severe Domain Shift in Skeleton-Based Action Recognition:A Study of Uncertainty Failure in Real-World Gym Environments
Authors: Aaditya Khanal, Junxiu Zhou
First: 2026-03-16T17:37:17+00:00 · Latest: 2026-03-16T17:37:17+00:00
Comments: 6 pages, 7 figures
Abstract
The practical deployment gap -- transitioning from controlled multi-view 3D skeleton capture to unconstrained monocular 2D pose estimation -- introduces a compound domain shift whose safety implications remain critically underexplored. We present a systematic study of this severe domain shift using a novel Gym2D dataset (style/viewpoint shift) and the UCF101 dataset (semantic shift). Our Skeleton Transformer achieves 63.2% cross-subject accuracy on NTU-120 but drops to 1.6% under zero-shot transfer to the Gym domain and 1.16% on UCF101. Critically, we demonstrate that high Out-Of-Distribution (OOD) detection AUROC does not guarantee safe selective classification. Standard uncertainty methods fail to detect this performance drop: the model remains confidently incorrect with 99.6% risk even at 50% coverage across both OOD datasets. While energy-based scoring (AUROC >= 0.91) and Mahalanobis distance provide reliable distributional detection signals, such high AUROC scores coexist with poor risk-coverage behavior when making decisions. A lightweight finetuned gating mechanism restores calibration and enables graceful abstention, substantially reducing the rate of confident wrong predictions. Our work challenges standard deployment assumptions, providing a principled safety analysis of both semantic and geometric skeleton recognition deployment.
中文标题/摘要
标题:基于骨架的动作识别中的严重领域偏移:现实健身房环境中的不确定性失败研究
实际部署差距——从受控的多视角3D骨架捕捉过渡到不受约束的单目2D姿态估计——引入了一个复合领域偏移,其安全性影响仍然严重未被探索。我们使用新型Gym2D数据集(风格/视角偏移)和UCF101数据集(语义偏移)进行了系统研究。我们的骨架变换器在NTU-120上实现了63.2%的跨被试准确率,但在零样本转移到健身房领域时降至1.6%,在UCF101上降至1.16%。关键的是,我们证明了高离域(OOD)检测AUROC并不能保证安全的选择性分类。标准不确定性方法无法检测到这种性能下降:即使在50%的覆盖率下,模型仍然以99.6%的风险保持自信的错误。虽然基于能量的评分(AUROC >= 0.91)和马氏距离提供了可靠的分布检测信号,但这些高AUROC分数在做决策时伴随着糟糕的风险覆盖率行为。一个轻量级微调门控机制恢复了校准并使模型能够优雅地避免错误,大幅减少了自信错误预测的频率。我们的工作挑战了标准部署假设,提供了语义和几何骨架识别部署的原理性安全性分析。
Summary / 总结
This study addresses the severe domain shift in skeleton-based action recognition from controlled multi-view 3D skeleton capture to unconstrained monocular 2D pose estimation, using the Gym2D dataset and UCF101 dataset. The Skeleton Transformer shows 63.2% accuracy on NTU-120 but fails to transfer to the Gym domain with only 1.6% accuracy and to UCF101 with 1.16% accuracy. High Out-Of-Distribution detection AUROC does not ensure safe selective classification, and standard uncertainty methods fail to detect the performance drop. A lightweight finetuned gating mechanism improves calibration and enables better decision-making, reducing confident wrong predictions. This work highlights the need for a principled safety analysis in deploying skeleton-based action recognition systems.
该研究探讨了从受控的多视角3D骨架捕捉到不受约束的单目2D姿态估计的严重领域偏移问题,使用了Gym2D数据集和UCF101数据集。Skeleton Transformer在NTU-120上的准确率为63.2%,但在Gym领域和UCF101上的准确率分别只有1.6%和1.16%。高Out-Of-Distribution检测AUROC并不能保证安全的选择性分类,标准不确定性方法无法检测到性能下降。一个轻量级的微调门控机制可以改善校准并使决策更加稳健,从而减少自信错误预测的频率。这项工作强调了在部署基于骨架的动作识别系统时需要进行原理性的安全性分析。
Learning 2D Invariant Affordance Knowledge for 3D Affordance Grounding
Authors: Xianqiang Gao, Pingrui Zhang, Delin Qu, Dong Wang, Zhigang Wang, Yan Ding, Bin Zhao
Venue: AAAI 2025 Oral
First: 2024-08-23T12:27:33+00:00 · Latest: 2026-03-16T17:35:55+00:00
Comments: Accepted by AAAI 2025 (Oral)
Abstract
3D Object Affordance Grounding aims to predict the functional regions on a 3D object and has laid the foundation for a wide range of applications in robotics. Recent advances tackle this problem via learning a mapping between 3D regions and a single human-object interaction image. However, the geometric structure of the 3D object and the object in the human-object interaction image are not always consistent, leading to poor generalization. To address this issue, we propose to learn generalizable invariant affordance knowledge from multiple human-object interaction images within the same affordance category. Specifically, we introduce the Multi-Image Guided Invariant-Feature-Aware 3D Affordance Grounding (MIFAG) framework. It grounds 3D object affordance regions by identifying common interaction patterns across multiple human-object interaction images. First, the Invariant Affordance Knowledge Extraction Module (IAM) utilizes an iterative updating strategy to gradually extract aligned affordance knowledge from multiple images and integrate it into an affordance dictionary. Then, the Affordance Dictionary Adaptive Fusion Module (ADM) learns comprehensive point cloud representations that consider all affordance candidates in multiple images. Besides, the Multi-Image and Point Affordance (MIPA) benchmark is constructed and our method outperforms existing state-of-the-art methods on various experimental comparisons.
中文标题/摘要
标题:学习2D不变性功能知识以实现3D功能定位
3D物体功能定位旨在预测3D物体上的功能区域,并为机器人技术的广泛应用奠定了基础。最近的进步通过学习3D区域与单个人机交互图像之间的映射来解决这个问题。然而,3D物体的几何结构与人机交互图像中的物体并不总是保持一致,导致泛化能力较差。为了解决这一问题,我们提出从同一功能类别内的多个个人机交互图像中学习可泛化的不变功能知识。具体而言,我们引入了多图像引导的不变特征感知3D功能定位(MIFAG)框架。该框架通过识别多个个人机交互图像中的共同交互模式来定位3D物体的功能区域。首先,不变功能知识提取模块(IAM)利用迭代更新策略逐步从多个图像中提取对齐的功能知识,并将其整合到功能字典中。然后,功能字典自适应融合模块(ADM)学习全面的点云表示,考虑多个图像中的所有功能候选。此外,我们构建了多图像和点功能(MIPA)基准,并在各种实验比较中我们的方法优于现有最先进的方法。
Summary / 总结
This paper addresses the challenge of 3D object affordance grounding by proposing the MIFAG framework, which learns invariant affordance knowledge from multiple human-object interaction images. The IAM module extracts aligned affordance knowledge iteratively, while the ADM module integrates comprehensive point cloud representations. Experimental results show that MIFAG outperforms existing methods in various comparisons, demonstrating better generalization across different 3D objects and interaction images.
研究旨在通过解决3D物体几何结构与人类物体交互图像之间的一致性问题,提高3D物体功能区域的定位。提出的MIFAG框架从同一类别内的多个交互图像中学习不变的功能知识。该方法使用迭代更新策略逐步提取对齐的功能知识,并将其整合到功能字典中。此外,该方法还通过考虑所有图像中的功能候选者来学习全面的点云表示。实验结果表明,MIFAG在各种比较中优于现有最先进的方法。
Limitations of Public Chest Radiography Datasets for Artificial Intelligence: Label Quality, Domain Shift, Bias and Evaluation Challenges
Authors: Amy Rafferty, Ajitha Rajan
First: 2025-09-18T16:13:11+00:00 · Latest: 2026-03-16T17:35:20+00:00
Abstract
Artificial intelligence has shown significant promise in chest radiography, where deep learning models can approach radiologist-level diagnostic performance. Progress has been accelerated by large public datasets such as MIMIC-CXR, ChestX-ray14, PadChest, and CheXpert, which provide hundreds of thousands of labelled images with pathology annotations. However, these datasets also present important limitations. Automated label extraction from radiology reports introduces errors, particularly in handling uncertainty and negation, and radiologist review frequently disagrees with assigned labels. In addition, domain shift and population bias restrict model generalisability, while evaluation practices often overlook clinically meaningful measures. We conduct a systematic analysis of these challenges, focusing on label quality, dataset bias, and domain shift. Our cross-dataset domain shift evaluation across multiple model architectures revealed substantial external performance degradation, with pronounced reductions in AUPRC and F1 scores relative to internal testing. To assess dataset bias, we trained a source-classification model that distinguished datasets with near-perfect accuracy, and performed subgroup analyses showing reduced performance for minority age and sex groups. Finally, expert review by two board-certified radiologists identified significant disagreement with public dataset labels. Our findings highlight important clinical weaknesses of current benchmarks and emphasise the need for clinician-validated datasets and fairer evaluation frameworks.
中文标题/摘要
标题:公共胸部X光数据集在人工智能中的局限性:标签质量、领域偏移、偏差和评估挑战
人工智能在胸部X光成像中显示出巨大的潜力,其中深度学习模型可以接近放射科医生的诊断性能。大型公共数据集如MIMIC-CXR、ChestX-ray14、PadChest和CheXpert加速了这一进展,提供了带有病理注释的数十万张标注图像。然而,这些数据集也存在重要局限。自动从放射学报告中提取标签引入了错误,特别是在处理不确定性与否定时尤为明显,放射科医生的审查经常与分配的标签不一致。此外,领域偏移和人群偏差限制了模型的泛化能力,而评估实践往往忽视了临床意义的度量。我们系统分析了这些挑战,重点关注标签质量、数据集偏差和领域偏移。我们在多个模型架构上的跨数据集领域偏移评估揭示了显著的外部性能下降,AUPRC和F1分数相对于内部测试有显著降低。为了评估数据集偏差,我们训练了一个源分类模型,能够以近乎完美的准确度区分数据集,并进行了子组分析,显示少数年龄和性别群体的性能降低。最后,两位认证放射科医生的专家审查发现,公共数据集标签存在显著分歧。我们的研究结果突显了当前基准中的重要临床弱点,并强调了需要临床验证的数据集和更公平的评估框架的必要性。
Summary / 总结
The paper investigates the limitations of public chest radiography datasets for artificial intelligence, focusing on label quality, domain shift, bias, and evaluation challenges. It finds that automated label extraction introduces errors, radiologist disagreements are common, and domain shift and population bias restrict model generalizability. The study also shows substantial external performance degradation and reduced performance for minority groups, highlighting the need for fairer evaluation frameworks.
该研究探讨了公共胸部X光数据集在人工智能中的局限性,重点关注标签质量、领域偏移、偏差和评估挑战。自动从放射学报告中提取标签引入了错误,放射科医生审查经常不同意分配的标签。领域偏移和人口偏差限制了模型的泛化能力,而评估实践往往忽视了临床有意义的指标。研究揭示了多个模型架构在外部测试中的显著性能下降,区分数据集的准确性接近完美,并通过专家审查发现了公共数据集中标签的重大分歧。这些发现突显了当前基准的重要临床弱点,并强调了需要更公平的评估框架。
NerVE: Nonlinear Eigenspectrum Dynamics in LLM Feed-Forward Networks
Authors: Nandan Kumar Jha, Brandon Reagen
Venue: ICLR 2026
First: 2026-03-06T22:50:43+00:00 · Latest: 2026-03-16T17:30:30+00:00
Comments: Accepted to ICLR 2026. Project page: https://nerve-eigenspectrum.github.io
Abstract
We introduce NerVE, a unified eigenspectral framework for understanding how feed-forward networks (FFNs) in large language models (LLMs) organize and regulate information flow in high-dimensional latent space. Despite FFNs dominating the parameter budget, their high-dimensional dynamics remain poorly understood. NerVE addresses this gap through lightweight, memory-efficient tracking of eigenspectrum dynamics via four complementary metrics: Spectral Entropy (dispersion), Participation Ratio (effective dimensionality), Eigenvalue Early Enrichment (top-heaviness), and Jensen-Shannon divergence (distributional shifts). Our key insight is that FFN nonlinearities reinject variance across eigenmodes, fundamentally governing latent dimension utilization, and that optimizer geometry strongly modulates the extent of this variance reinjection. We validate NerVE across model scales, and diverse architectural and optimizer configurations, each uniquely shaping FFN dynamics: normalization schemes controlling variance flow; FFN weight geometries constraining latent space; positional encoding and activation functions regulating information flow; and optimizer choices redistributing effective capacity across depth. Across these settings, NerVE consistently recovers stable spectral signatures that correlate with model's generalization ability and respond predictably to design choices, generalizing beyond transformer to MLP-Mixer architectures, providing actionable insights for architectural and optimizer choices beyond trial-and-error.
中文标题/摘要
标题:NerVE:大规模语言模型前馈网络非线性特征谱动力学
我们引入了NerVE,这是一种统一的特征谱框架,用于理解大规模语言模型(LLMs)中的前馈网络(FFNs)如何在高维潜空间中组织和调节信息流。尽管FFNs占据了大部分参数预算,但它们的高维动力学仍然知之甚少。NerVE 通过四种互补的度量标准(谱熵(分散性)、参与比(有效维度)、特征值早期富集(顶部重性)和Jensen-Shannon散度(分布变化))轻量级、内存高效地跟踪特征谱动力学来填补这一空白。我们的核心见解是,FFN的非线性重新注入了特征模式中的方差,从根本上控制了潜空间的利用,并且优化器几何形状强烈调节了这种方差重新注入的程度。我们在不同规模的模型、多样化的架构和优化器配置中验证了NerVE,每种配置都独特地塑造了FFN的动力学:归一化方案控制方差流动;FFN权重几何形状限制潜空间;位置编码和激活函数调节信息流动;以及优化器选择重新分配深度的有效容量。在这些设置中,NerVE 一致地恢复了稳定的特征谱签名,这些签名与模型的泛化能力相关,并且对设计选择的响应可预测,NerVE 能够超越变压器架构,适用于MLP-Mixer架构,提供了有关架构和优化器选择的可操作见解,超越了试错。
Summary / 总结
NerVE is a framework that uses four metrics—Spectral Entropy, Participation Ratio, Eigenvalue Early Enrichment, and Jensen-Shannon divergence—to track the dynamics of eigenspectrum in feed-forward networks of large language models. It reveals that nonlinearity in FFNs reinjects variance across eigenmodes, which is modulated by optimizer geometry. Experiments across various model scales and configurations show that NerVE can consistently identify spectral signatures that correlate with generalization ability, offering insights into architectural and optimizer choices.
NerVE 是一个框架,通过四个指标(谱熵、参与度比、特征值早期富集和杰森-香农散度)来跟踪大型语言模型(LLM)中前馈网络(FFN)的特征谱动态。它揭示了FFN非线性和优化器几何形状如何显著影响特征值中误差的分布,从而影响潜在维度的利用和模型泛化能力。在各种模型规模和配置下,NerVE 提供了一致的谱特征签名,与泛化能力相关,并且能够预测地响应架构和优化器选择。
LHM++: An Efficient Large Human Reconstruction Model for Pose-free Images to 3D
Authors: Lingteng Qiu, Peihao Li, Heyuan Li, Qi Zuo, Xiaodong Gu, Yuan Dong, Weihao Yuan, Rui Peng, Siyu Zhu, Xiaoguang Han, Guanying Chen, Zilong Dong
First: 2025-06-16T17:59:56+00:00 · Latest: 2026-03-16T17:28:33+00:00
Comments: HomePage: https://lingtengqiu.github.io/LHM++/ Online Demo: https://huggingface.co/spaces/Lingteng/LHMPP
Abstract
Reconstructing animatable 3D humans from casually captured images of articulated subjects without camera or pose information is highly practical but remains challenging due to view misalignment, occlusions, and the absence of structural priors. In this work, we present LHM++, an efficient large-scale human reconstruction model that generates high-quality, animatable 3D avatars within seconds from one or multiple pose-free images. At its core is an Encoder-Decoder Point-Image Transformer architecture that progressively encodes and decodes 3D geometric point features to improve efficiency, while fusing hierarchical 3D point features with image features through multimodal attention. The fused features are decoded into 3D Gaussian splats to recover detailed geometry and appearance. To further enhance visual fidelity, we introduce a lightweight 3D-aware neural animation renderer that refines the rendering quality of reconstructed avatars in real time. Extensive experiments show that our method produces high-fidelity, animatable 3D humans without requiring camera or pose annotations. Our code and project page are available at https://lingtengqiu.github.io/LHM++/
中文标题/摘要
标题:LHM++:一种高效的大型人体重建模型,用于生成无姿态图像的3D动画人体
从包含姿态变化的主体的随意拍摄图像中重建可动画的3D人体,无需相机或姿态信息,尽管具有很高的实用性,但由于视角不一致、遮挡和缺乏结构先验,仍然具有挑战性。本文中,我们提出了一种高效的大型人体重建模型LHM++,能够在几秒钟内从一张或多张无姿态图像中生成高质量、可动画的3D化身。其核心是一种编码器-解码器点-图像变换架构,该架构逐步编码和解码3D几何点特征以提高效率,同时通过多模态注意力融合层次3D点特征和图像特征。融合后的特征被解码为3D高斯斑点以恢复详细的几何形状和外观。为了进一步提高视觉保真度,我们引入了一种轻量级的3D感知神经动画渲染器,该渲染器可以实时优化重建化身的渲染质量。大量实验表明,我们的方法能够在无需相机或姿态注释的情况下生成高保真度、可动画的3D人体。我们的代码和项目页面可在https://lingtengqiu.github.io/LHM++/获取。
Summary / 总结
The research aims to reconstruct high-fidelity 3D humans from pose-free images without camera or pose information, addressing challenges like view misalignment and occlusions. The method uses an Encoder-Decoder Point-Image Transformer architecture to encode and decode 3D geometric point features, and a lightweight 3D-aware neural animation renderer to refine the rendering quality. Experiments demonstrate that LHM++ can generate detailed and animatable 3D avatars within seconds from one or multiple images. The code and project page are available online.
研究旨在从无姿态信息的图像中重建高保真3D人体,无需相机或姿态标注,解决视点错位和遮挡等问题。方法采用编码器-解码器点-图像变换架构来编码和解码3D几何点特征,并引入轻量级的3D感知神经动画渲染器以实时优化渲染质量。实验表明,LHM++可以从一张或多张图像中在几秒内生成详细且可动画化的3D角色。代码和项目页面已上线。
The PokeAgent Challenge: Competitive and Long-Context Learning at Scale
Authors: Seth Karten, Jake Grigsby, Tersoo Upaa, Junik Bae, Seonghun Hong, Hyunyoung Jeong, Jaeyoon Jung, Kun Kerdthaisong, Gyungbo Kim, Hyeokgi Kim, Yujin Kim, Eunju Kwon, Dongyu Liu, Patrick Mariglia, Sangyeon Park, Benedikt Schink, Xianwei Shi, Anthony Sistilli, Joseph Twin, Arian Urdu, Matin Urdu, Qiao Wang, Ling Wu, Wenli Zhang, Kunsheng Zhou, Stephanie Milani, Kiran Vodrahalli, Amy Zhang, Fei Fang, Yuke Zhu, Chi Jin
Venue: NeurIPS 2025
First: 2026-03-16T17:25:42+00:00 · Latest: 2026-03-16T17:25:42+00:00
Comments: 41 pages, 26 figures, 5 tables. NeurIPS 2025 Competition Track
Abstract
We present the PokeAgent Challenge, a large-scale benchmark for decision-making research built on Pokemon's multi-agent battle system and expansive role-playing game (RPG) environment. Partial observability, game-theoretic reasoning, and long-horizon planning remain open problems for frontier AI, yet few benchmarks stress all three simultaneously under realistic conditions. PokeAgent targets these limitations at scale through two complementary tracks: our Battling Track, which calls for strategic reasoning and generalization under partial observability in competitive Pokemon battles, and our Speedrunning Track, which requires long-horizon planning and sequential decision-making in the Pokemon RPG. Our Battling Track supplies a dataset of 20M+ battle trajectories alongside a suite of heuristic, RL, and LLM-based baselines capable of high-level competitive play. Our Speedrunning Track provides the first standardized evaluation framework for RPG speedrunning, including an open-source multi-agent orchestration system for modular, reproducible comparisons of harness-based LLM approaches. Our NeurIPS 2025 competition validates both the quality of our resources and the research community's interest in Pokemon, with over 100 teams competing across both tracks and winning solutions detailed in our paper. Participant submissions and our baselines reveal considerable gaps between generalist (LLM), specialist (RL), and elite human performance. Analysis against the BenchPress evaluation matrix shows that Pokemon battling is nearly orthogonal to standard LLM benchmarks, measuring capabilities not captured by existing suites and positioning Pokemon as an unsolved benchmark that can drive RL and LLM research forward. We transition to a living benchmark with a live leaderboard for Battling and self-contained evaluation for Speedrunning at https://pokeagentchallenge.com.
中文标题/摘要
标题:PokeAgent挑战:大规模竞争与长时序学习
我们提出了PokeAgent挑战,这是一个基于宝可梦多智能体战斗系统和广阔角色扮演游戏(RPG)环境的大规模决策研究基准。部分可观测性、博弈论推理和长时序规划仍然是前沿AI领域的开放问题,但很少有基准能在现实条件下同时对这三项进行压力测试。PokeAgent通过两个互补的赛道来解决这些局限性:我们的战斗赛道要求在宝可梦战斗中进行部分可观测性的战略推理和泛化;我们的速通赛道则要求在宝可梦RPG中进行长时序规划和顺序决策。我们的战斗赛道提供了一个包含2000万以上战斗轨迹的数据集,以及一系列基于启发式、强化学习和大语言模型的基线,能够实现高水平的竞争性表现。我们的速通赛道提供了第一个标准化的RPG速通评估框架,包括一个开源的多智能体编排系统,用于模块化、可重复的基于缰绳的大语言模型方法比较。我们的NeurIPS 2025竞赛验证了我们资源的质量以及研究社区对宝可梦的兴趣,超过100支队伍在两个赛道中竞争,获胜解决方案在我们的论文中有详细说明。参赛提交和我们的基线揭示了通用型(大语言模型)、专业型(强化学习)和精英人类表现之间的巨大差距。与BenchPress评估矩阵的分析表明,宝可梦战斗几乎与标准的大语言模型基准无关,测量了现有套件未捕捉到的能力,并将宝可梦定位为一个未解决的基准,可以推动强化学习和大语言模型研究的发展。我们将其转换为一个活基准,提供了一个实时排行榜和速通赛道的独立评估,网址为https://pokeagentchallenge.com。
Summary / 总结
The PokeAgent Challenge is a large-scale benchmark for decision-making research based on Pokemon's multi-agent battle system and RPG environment. It addresses open problems in AI such as partial observability, game-theoretic reasoning, and long-horizon planning through two tracks: Battling Track for strategic reasoning and generalization under partial observability, and Speedrunning Track for long-horizon planning and sequential decision-making. The challenge provides a dataset of over 20 million battle trajectories and an open-source multi-agent orchestration system, with over 100 teams competing and highlighting significant gaps between different AI approaches. Analysis shows that Pokemon battling is distinct from standard LLM benchmarks, making it a valuable benchmark for advancing RL and LLM research.
PokeAgent挑战基于Pokemon的多代理战斗系统和RPG环境,是一个大规模的决策研究基准。它通过两个赛道解决AI中的开放问题,即部分可观测性、博弈论推理和长期规划:战斗赛道侧重于在部分可观测性下的策略推理,速度运行赛道侧重于RPG中的长期规划。挑战提供了2000多万场战斗轨迹的数据集和开源编排系统,导致LLM、RL和人类表现之间存在显著差距。NeurIPS 2025的竞赛验证了基准的质量和研究兴趣。分析表明,Pokemon战斗几乎与标准LLM基准无关,推动了RL和LLM研究的发展。
Convergence of Distributionally Robust Q-Learning with Linear Function Approximation
Authors: Saptarshi Mandal, Yashaswini Murthy, R. Srikant
First: 2025-10-02T07:01:41+00:00 · Latest: 2026-03-16T17:23:13+00:00
Comments: Preprint. 53 Pages
Abstract
Distributionally robust reinforcement learning (DRRL) focuses on designing policies that achieve good performance under model uncertainties. The goal is to maximize the worst-case long-term discounted reward, where the data for RL comes from a nominal model while the deployed environment can deviate from the nominal model within a prescribed uncertainty set. Existing convergence guarantees for DRRL are limited to tabular MDPs or are dependent on restrictive discount factor assumptions when function approximation is used. We present a convergence result for a robust Q-learning algorithm with linear function approximation without any discount factor restrictions. In this paper, the robustness is measured with respect to the total-variation distance uncertainty set. Our model free algorithm does not require generative access to the MDP and achieves an $\tilde{\mathcal{O}}(1/ε^{4})$ sample complexity for an $ε$-accurate value estimate. Our results close a key gap between the empirical success of robust RL algorithms and the non-asymptotic guarantees enjoyed by their non-robust counterparts. The key ideas in the paper also extend in a relatively straightforward fashion to robust Temporal-Difference (TD) learning with function approximation. The robust TD learning algorithm is discussed in the Appendix.
中文标题/摘要
标题:分布鲁棒Q学习的收敛性分析与线性函数逼近
分布鲁棒强化学习(DRRL)旨在设计在模型不确定性下能取得良好性能的策略。目标是最大化最坏情况下的长期折现回报,其中强化学习的数据来自名义模型,而部署的环境可以在预设的不确定性集合内偏离名义模型。现有的DRRL收敛性保证仅限于表格MDP,或者在使用函数逼近时依赖于严格的折现因子假设。我们提出了一个在没有折现因子限制的情况下使用线性函数逼近的鲁棒Q学习算法的收敛结果。本文中,鲁棒性是基于总变差距离不确定性集进行衡量的。我们的无模型算法不需要生成MDP的访问权限,并且对于ε-准确的价值估计实现了~\mathcal{O}(1/ε^{4})的样本复杂度。我们的结果填补了鲁棒RL算法的实证成功与非鲁棒算法的非渐近保证之间的关键差距。本文中的关键思想也相对直接地扩展到函数逼近下的鲁棒时序差分(TD)学习。鲁棒TD学习算法在附录中讨论。
EvoX: Meta-Evolution for Automated Discovery
Authors: Shu Liu, Shubham Agarwal, Monishwaran Maheswaran, Mert Cemri, Zhifei Li, Qiuyang Mang, Ashwin Naren, Ethan Boneh, Audrey Cheng, Melissa Z. Pan, Alexander Du, Kurt Keutzer, Alvin Cheung, Alexandros G. Dimakis, Koushik Sen, Matei Zaharia, Ion Stoica
First: 2026-02-26T18:54:41+00:00 · Latest: 2026-03-16T17:22:57+00:00
Abstract
Recent work such as AlphaEvolve has shown that combining LLM-driven optimization with evolutionary search can effectively improve programs, prompts, and algorithms across domains. In this paradigm, previously evaluated solutions are reused to guide the model toward new candidate solutions. Crucially, the effectiveness of this evolution process depends on the search strategy: how prior solutions are selected and varied to generate new candidates. However, most existing methods rely on fixed search strategies with predefined knobs (e.g., explore-exploit ratios) that remain static throughout execution. While effective in some settings, these approaches often fail to adapt across tasks, or even within the same task as the search space changes over time. We introduce EvoX, an adaptive evolution method that optimizes its own evolution process. EvoX jointly evolves candidate solutions and the search strategies used to generate them, continuously updating how prior solutions are selected and varied based on progress. This enables the system to dynamically shift between different search strategies during the optimization process. Across nearly 200 real-world optimization tasks, EvoX outperforms existing AI-driven evolutionary methods including AlphaEvolve, OpenEvolve, GEPA, and ShinkaEvolve on the majority of tasks.
中文标题/摘要
标题:EvoX:元进化以实现自动化发现
近期的工作,如AlphaEvolve表明,将LLM驱动的优化与进化搜索相结合,可以在跨领域的程序、提示和算法中有效提升性能。在此范式中,先前评估过的解决方案被重用以引导模型向新的候选解决方案发展。关键的是,这一进化过程的有效性取决于搜索策略:如何选择和变异先前的解决方案以生成新的候选者。然而,大多数现有方法依赖于固定不变的搜索策略,这些策略在执行过程中保持不变。虽然在某些情况下这些方法是有效的,但它们往往无法跨任务或在搜索空间随时间变化时在同一个任务中进行调整。我们引入了EvoX,一种自适应进化方法,优化其自身的进化过程。EvoX同时进化候选解决方案及其生成策略,根据进展不断更新如何选择和变异先前的解决方案。这使得系统在优化过程中能够动态地在不同的搜索策略之间切换。在近200个实际优化任务中,EvoX在大多数任务上优于现有的基于AI的进化方法,包括AlphaEvolve、OpenEvolve、GEPA和ShinkaEvolve。
Summary / 总结
EvoX is an adaptive evolution method that jointly evolves candidate solutions and the search strategies used to generate them, allowing for dynamic adjustment of search strategies based on progress. This approach outperforms existing AI-driven evolutionary methods on most of nearly 200 real-world optimization tasks.
EvoX 是一种自适应进化方法,通过同时进化候选解决方案及其生成策略来优化自身的进化过程。这使系统能够根据进度动态调整其搜索策略。在近200个实际优化任务中,EvoX 在大多数任务上都优于现有的基于AI的进化方法,如AlphaEvolve、OpenEvolve、GEPA和ShinkaEvolve。
Panoramic Affordance Prediction
Authors: Zixin Zhang, Chenfei Liao, Hongfei Zhang, Harold Haodong Chen, Kanghao Chen, Zichen Wen, Litao Guo, Bin Ren, Xu Zheng, Yinchuan Li, Xuming Hu, Nicu Sebe, Ying-Cong Chen
First: 2026-03-16T17:21:49+00:00 · Latest: 2026-03-16T17:21:49+00:00
Abstract
Affordance prediction serves as a critical bridge between perception and action in embodied AI. However, existing research is confined to pinhole camera models, which suffer from narrow Fields of View (FoV) and fragmented observations, often missing critical holistic environmental context. In this paper, we present the first exploration into Panoramic Affordance Prediction, utilizing 360-degree imagery to capture global spatial relationships and holistic scene understanding. To facilitate this novel task, we first introduce PAP-12K, a large-scale benchmark dataset containing over 1,000 ultra-high-resolution (12k, 11904 x 5952) panoramic images with over 12k carefully annotated QA pairs and affordance masks. Furthermore, we propose PAP, a training-free, coarse-to-fine pipeline inspired by the human foveal visual system to tackle the ultra-high resolution and severe distortion inherent in panoramic images. PAP employs recursive visual routing via grid prompting to progressively locate targets, applies an adaptive gaze mechanism to rectify local geometric distortions, and utilizes a cascaded grounding pipeline to extract precise instance-level masks. Experimental results on PAP-12K reveal that existing affordance prediction methods designed for standard perspective images suffer severe performance degradation and fail due to the unique challenges of panoramic vision. In contrast, PAP framework effectively overcomes these obstacles, significantly outperforming state-of-the-art baselines and highlighting the immense potential of panoramic perception for robust embodied intelligence.
中文标题/摘要
标题:全景功能预测
功能预测是将感知与行动在具身人工智能中联系起来的关键桥梁。然而,现有的研究局限于针孔相机模型,这些模型视野狭窄且观察片段化,经常缺失关键的整体环境背景。在本文中,我们首次探索全景功能预测,利用360度图像捕捉全局空间关系和整体场景理解。为了促进这一新型任务,我们首先引入了PAP-12K,这是一个大规模基准数据集,包含超过1,000张超高分辨率(12k,11904 x 5952)全景图像,以及超过12k个仔细标注的问答对和功能掩码。此外,我们提出了PAP,一种无需训练、从粗到细的管道,灵感来源于人类的中心视觉系统,以应对全景图像中的超高清分辨率和严重的几何失真。PAP 通过递归视觉路由和网格提示逐步定位目标,应用自适应凝视机制校正局部几何失真,并利用级联定位管道提取精确的实例级掩码。在PAP-12K上的实验结果表明,现有的设计用于标准视角图像的功能预测方法在全景视觉的独特挑战面前表现严重下降并失败。相比之下,PAP框架有效地克服了这些障碍,显著优于最先进的基线方法,突显了全景感知在稳健的具身智能中的巨大潜力。
Summary / 总结
This paper addresses the limitation of existing affordance prediction methods by introducing Panoramic Affordance Prediction, which uses 360-degree imagery to capture holistic scene understanding. The authors propose PAP, a coarse-to-fine pipeline that includes recursive visual routing, an adaptive gaze mechanism, and a cascaded grounding pipeline to handle the unique challenges of panoramic images. Experimental results show that PAP outperforms existing methods on the PAP-12K dataset, demonstrating the potential of panoramic perception for embodied AI.
研究旨在解决现有依赖针孔相机模型且视野狭窄的抓取预测方法的局限性。为此,作者引入了PAP-12K,这是一个包含大量全景图像的大规模数据集,并提出了PAP,一种无需训练、从粗到细的管道,使用网格提示和自适应注视机制来处理全景视觉的独特挑战。实验表明,PAP在PAP-12K数据集上优于现有方法,展示了全景感知在稳健的体感智能中的巨大潜力。
Anatomy of a Lie: A Multi-Stage Diagnostic Framework for Tracing Hallucinations in Vision-Language Models
Authors: Lexiang Xiong, Qi Li, Jingwen Ye, Xinchao Wang
First: 2026-03-16T17:20:38+00:00 · Latest: 2026-03-16T17:20:38+00:00
Abstract
Vision-Language Models (VLMs) frequently "hallucinate" - generate plausible yet factually incorrect statements - posing a critical barrier to their trustworthy deployment. In this work, we propose a new paradigm for diagnosing hallucinations, recasting them from static output errors into dynamic pathologies of a model's computational cognition. Our framework is grounded in a normative principle of computational rationality, allowing us to model a VLM's generation as a dynamic cognitive trajectory. We design a suite of information-theoretic probes that project this trajectory onto an interpretable, low-dimensional Cognitive State Space. Our central discovery is a governing principle we term the geometric-information duality: a cognitive trajectory's geometric abnormality within this space is fundamentally equivalent to its high information-theoretic surprisal. Hallucination detection is counts as a geometric anomaly detection problem. Evaluated across diverse settings - from rigorous binary QA (POPE) and comprehensive reasoning (MME) to unconstrained open-ended captioning (MS-COCO) - our framework achieves state-of-the-art performance. Crucially, it operates with high efficiency under weak supervision and remains highly robust even when calibration data is heavily contaminated. This approach enables a causal attribution of failures, mapping observable errors to distinct pathological states: perceptual instability (measured by Perceptual Entropy), logical-causal failure (measured by Inferential Conflict), and decisional ambiguity (measured by Decision Entropy). Ultimately, this opens a path toward building AI systems whose reasoning is transparent, auditable, and diagnosable by design.
中文标题/摘要
标题:谎言剖析:视觉-语言模型中幻觉的多阶段诊断框架
视觉-语言模型(VLMs)经常“幻觉”——生成看似合理但实际上不正确的陈述,这构成了它们可靠部署的关键障碍。在本文中,我们提出了一种新的诊断范式,将幻觉重新定义为模型计算认知动态病态。我们的框架基于计算理性规范原则,使我们能够将VLM的生成建模为动态认知轨迹。我们设计了一套信息论探针,将此轨迹投影到可解释的低维认知状态空间中。我们的主要发现是一种我们称之为几何-信息二元性的原则:此空间中认知轨迹的几何异常本质上等同于其高信息论惊讶度。幻觉检测被视为几何异常检测问题。在从严格的二元问答(POPE)和全面推理(MME)到不受限制的开放生成(MS-COCO)的各种场景中,我们的框架均实现了最先进的性能。关键的是,它在弱监督下高效运行,并且即使校准数据严重污染,也保持高度鲁棒性。这种方法使我们能够对失败进行因果归因,将可观察的错误映射到不同的病理状态:感知不稳定性(通过感知熵测量)、逻辑因果失败(通过推理冲突测量)和决策模糊性(通过决策熵测量)。最终,这为构建透明、可审计和可诊断的AI系统开辟了道路。
Summary / 总结
The research aims to address the issue of hallucinations in Vision-Language Models (VLMs) by proposing a new diagnostic framework. The method involves modeling VLM generation as a dynamic cognitive trajectory and using information-theoretic probes to project this trajectory into a low-dimensional Cognitive State Space. The key finding is the geometric-information duality, where geometric anomalies in the space correspond to high information-theoretic surprisal, enabling effective hallucination detection. The framework demonstrates state-of-the-art performance across various settings and is robust under weak supervision and contaminated calibration data, providing causal attribution of failures into perceptual, logical-causal, and decisional states.
本文提出了一种多阶段诊断框架,用于追踪视觉-语言模型(VLM)中的幻觉,将幻觉视为动态病理现象。该框架将VLM生成建模为低维空间中的认知轨迹,通过几何异常来识别幻觉。关键发现包括几何-信息二重性,即轨迹几何异常与高信息论意外相对应。该框架在各种任务中表现良好,并在弱监督下具有鲁棒性,能够将观察到的错误归因于感知不稳定性、逻辑因果失败和决策不确定性。这种方法增强了VLM的透明度和可诊断性。
Learning Latent Proxies for Controllable Single-Image Relighting
Authors: Haoze Zheng, Zihao Wang, Xianfeng Wu, Yajing Bai, Yexin Liu, Yun Li, Xiaogang Xu, Harry Yang
First: 2026-03-16T17:16:59+00:00 · Latest: 2026-03-16T17:16:59+00:00
Comments: Accepted by CVPR2026
Abstract
Single-image relighting is highly under-constrained: small illumination changes can produce large, nonlinear variations in shading, shadows, and specularities, while geometry and materials remain unobserved. Existing diffusion-based approaches either rely on intrinsic or G-buffer pipelines that require dense and fragile supervision, or operate purely in latent space without physical grounding, making fine-grained control of direction, intensity, and color unreliable. We observe that a full intrinsic decomposition is unnecessary and redundant for accurate relighting. Instead, sparse but physically meaningful cues, indicating where illumination should change and how materials should respond, are sufficient to guide a diffusion model. Based on this insight, we introduce LightCtrl that integrates physical priors at two levels: a few-shot latent proxy encoder that extracts compact material-geometry cues from limited PBR supervision, and a lighting-aware mask that identifies sensitive illumination regions and steers the denoiser toward shading relevant pixels. To compensate for scarce PBR data, we refine the proxy branch using a DPO-based objective that enforces physical consistency in the predicted cues. We also present ScaLight, a large-scale object-level dataset with systematically varied illumination and complete camera-light metadata, enabling physically consistent and controllable training. Across object and scene level benchmarks, our method achieves photometrically faithful relighting with accurate continuous control, surpassing prior diffusion and intrinsic-based baselines, including gains of up to +2.4 dB PSNR and 35% lower RMSE under controlled lighting shifts.
中文标题/摘要
标题:学习隐含代理以实现可控的单图像光照调整
单图像光照调整高度欠约束:微小的光照变化可以产生显著的非线性阴影、高光和反光变化,而几何形状和材料则未被观察到。现有的基于扩散的方法要么依赖于内在或G缓冲区管道,需要密集且脆弱的监督,要么纯粹在潜在空间中操作而缺乏物理基础,使得方向、强度和颜色的精细控制不可靠。我们观察到,完整的内在分解对于准确的光照调整是不必要的且冗余的。相反,稀疏但物理上有意义的线索,指示光照应如何变化以及材料应如何响应,足以引导扩散模型。基于这一洞察,我们引入了LightCtrl,它在两个层次上整合了物理先验:一个少量样本的潜在代理编码器,从有限的PBR监督中提取紧凑的材料-几何线索,以及一个光照感知的掩码,识别敏感的光照区域并引导去噪器朝向与光照相关的像素。为了弥补稀缺的PBR数据,我们使用基于DPO的目标函数细化代理分支,以确保预测线索的物理一致性。我们还提出了ScaLight,这是一个大规模的对象级数据集,具有系统变化的光照和完整的相机-光源元数据,使物理一致且可控的训练成为可能。在对象和场景级别的基准测试中,我们的方法实现了光度学上忠实的光照调整,并实现了准确的连续控制,超越了先前的基于扩散和内在的方法,包括在受控光照变化下的PSNR提高2.4 dB和RMSE降低35%。
Summary / 总结
The paper addresses the challenge of single-image relighting by introducing LightCtrl, which uses a few-shot latent proxy encoder and a lighting-aware mask to guide a diffusion model. This approach leverages sparse physical cues to achieve accurate relighting without the need for dense supervision. The method outperforms previous diffusion and intrinsic-based approaches, achieving higher PSNR and lower RMSE under controlled lighting shifts.
论文通过引入LightCtrl方法,使用少量样本的潜空间代理编码器和照明感知掩码来引导扩散模型,整合物理先验以提取紧凑的材料-几何线索并识别敏感的照明区域,实现了光度忠实的重新照明,并且具有准确的连续控制,相比之前的方法在受控光照变化下的PSNR提高了最多2.4 dB,RMSE降低了35%。
Curriculum Reinforcement Learning from Easy to Hard Tasks Improves LLM Reasoning
Authors: Shubham Parashar, Shurui Gui, Xiner Li, Hongyi Ling, Sushil Vemuri, Blake Olson, Eric Li, Yu Zhang, James Caverlee, Dileep Kalathil, Shuiwang Ji
First: 2025-06-07T02:41:54+00:00 · Latest: 2026-03-16T17:16:37+00:00
Abstract
We aim to improve the reasoning capabilities of language models via reinforcement learning (RL). Recent RL post-trained models like DeepSeek-R1 have demonstrated reasoning abilities on mathematical and coding tasks. However, prior studies suggest that using RL alone to improve reasoning on inherently difficult tasks is less effective. Here, we draw inspiration from curriculum learning and propose to schedule tasks from easy to hard (E2H), allowing LLMs to build reasoning skills gradually. Our method is termed E2H Reasoner. Empirically, we observe that, although easy tasks are important initially, fading them out through appropriate scheduling is essential in preventing overfitting. Theoretically, we establish convergence guarantees for E2H Reasoner within an approximate policy iteration framework. We derive finite-sample complexity bounds and show that when tasks are appropriately decomposed and conditioned, learning through curriculum stages requires fewer total samples than direct learning. Experiments across multiple domains show that E2H Reasoner significantly improves the reasoning ability of small LLMs (1.5B to 3B), which otherwise struggle when trained with vanilla RL alone, highlighting the effectiveness of our method. Our code can be found on https://github.com/divelab/E2H-Reasoning.
中文标题/摘要
标题:从易到难任务的课程强化学习提高LLM推理能力
我们旨在通过强化学习(RL)提高语言模型的推理能力。最近的RL后训练模型如DeepSeek-R1在数学和编程任务上展示了推理能力。然而,先前的研究表明,仅使用RL来提高难以推理任务的推理能力效果较差。在这里,我们借鉴了课程学习的理念,提出从易到难(E2H)调度任务,使LLMs能够逐步建立推理技能。我们的方法称为E2H推理器。实证上,我们观察到,虽然初始阶段容易的任务很重要,但通过适当的调度逐渐淡化它们对于防止过拟合至关重要。理论上,我们在一个近似策略迭代框架内为E2H推理器建立了收敛保证。我们推导出有限样本复杂性界,并表明当任务适当分解和条件化时,通过课程阶段学习所需的总样本数少于直接学习。跨多个领域的实验表明,E2H推理器显著提高了小规模LLM(1.5B到3B)的推理能力,这些模型仅使用传统的RL训练时会遇到困难,突显了我们方法的有效性。我们的代码可以在https://github.com/divelab/E2H-Reasoning找到。
Summary / 总结
The study aims to enhance the reasoning abilities of language models using reinforcement learning (RL), inspired by curriculum learning. The proposed method, E2H Reasoner, schedules tasks from easy to hard, allowing LLMs to develop reasoning skills gradually. Experiments show that fading out easy tasks through appropriate scheduling prevents overfitting and significantly improves the reasoning ability of small LLMs (1.5B to 3B) when trained with RL, compared to direct learning. Theoretical analysis provides convergence guarantees and finite-sample complexity bounds, supporting the effectiveness of the approach.
研究旨在通过强化学习(RL)提高语言模型的推理能力,借鉴了渐进学习的思想。提出的E2H Reasoner方法按从易到难的顺序安排任务,使LLM逐步发展推理技能。实验表明,通过适当的安排逐渐淘汰简单任务可以防止过拟合,并显著提高了使用RL训练的小型LLM(1.5B到3B)的推理能力,优于直接学习。理论分析提供了收敛保证和有限样本复杂性界,支持该方法的有效性。
InterveneBench: Benchmarking LLMs for Intervention Reasoning and Causal Study Design in Real Social Systems
Authors: Shaojie Shi, Zhengyu Shi, Lingran Zheng, Xinyu Su, Anna Xie, Bohao Lv, Rui Xu, Zijian Chen, Zhichao Chen, Guolei Liu, Naifu Zhang, Mingjian Dong, Zhuo Quan, Bohao Chen, Teqi Hao, Yuan Qi, Yinghui Xu, Libo Wu
First: 2026-03-16T17:06:37+00:00 · Latest: 2026-03-16T17:06:37+00:00
Comments: 35pages,3 figures
Abstract
Causal inference in social science relies on end-to-end, intervention-centered research-design reasoning grounded in real-world policy interventions, but current benchmarks fail to evaluate this capability of large language models (LLMs). We present InterveneBench, a benchmark designed to assess such reasoning in realistic social settings. Each instance in InterveneBench is derived from an empirical social science study and requires models to reason about policy interventions and identification assumptions without access to predefined causal graphs or structural equations. InterveneBench comprises 744 peer-reviewed studies across diverse policy domains. Experimental results show that state-of-the-art LLMs struggle under this setting. To address this limitation, we further propose a multi-agent framework, STRIDES. It achieves significant performance improvements over state-of-the-art reasoning models. Our code and data are available at https://github.com/Sii-yuning/STRIDES.
中文标题/摘要
标题:InterveneBench:评估大型语言模型在实际社会系统中进行干预推理和因果研究设计的能力
社会科学中的因果推断依赖于基于实际政策干预的端到端、干预中心的研究设计推理,但当前的基准测试未能评估大型语言模型(LLMs)的这种能力。我们提出了InterveneBench,一个旨在评估这种推理能力的基准测试。InterveneBench中的每个实例都源自实证的社会科学研究,并要求模型在没有预先定义的因果图或结构方程的情况下推理政策干预和识别假设。InterveneBench包含来自不同政策领域的744篇同行评审研究。实验结果表明,最先进的LLMs在这种情况下表现不佳。为了解决这一局限性,我们进一步提出了一种多智能体框架STRIDES。它在最先进的推理模型上实现了显著的性能提升。我们的代码和数据可在https://github.com/Sii-yuning/STRIDES获取。
Summary / 总结
InterveneBench is a benchmark designed to evaluate large language models' ability to reason about policy interventions and causal study design in real-world social settings. It consists of 744 empirical studies from diverse policy domains, requiring models to reason without predefined causal graphs. State-of-the-art LLMs perform poorly in this setting, but the proposed STRIDES framework shows significant improvements.
InterveneBench 是一个基准,用于评估大型语言模型在实际社会系统中进行干预和因果研究设计的能力。它包含来自不同政策领域的 744 项研究,要求模型在没有预定义因果图的情况下推理政策干预和识别假设。最先进的 LLM 表现不佳,但提出的多智能体框架 STRIDES 显著提高了性能。
DOT: Dynamic Knob Selection and Online Sampling for Automated Database Tuning
Authors: Yifan Wang, Debabrota Basu, Pierre Bourhis, Romain Rouvoy, Patrick Royer
First: 2026-03-16T17:05:34+00:00 · Latest: 2026-03-16T17:05:34+00:00
Abstract
Database Management Systems (DBMS) are crucial for efficient data management and access control, but their administration remains challenging for Database Administrators (DBAs). Tuning, in particular, is known to be difficult. Modern systems have many tuning parameters, but only a subset significantly impacts performance. Focusing on these influential parameters reduces the search space and optimizes performance. Current methods rely on costly warm-up phases and human expertise to identify important tuning parameters. In this paper, we present DOT, a dynamic knob selection and online sampling DBMS tuning algorithm. DOT uses Recursive Feature Elimination with Cross-Validation (RFECV) to prune low-importance tuning parameters and a Likelihood Ratio Test (LRT) strategy to balance exploration and exploitation. For parameter search, DOT uses a Bayesian Optimization (BO) algorithm to optimize configurations on-the-fly, eliminating the need for warm-up phases or prior knowledge (although existing knowledge can be incorporated). Experiments show that DOT achieves matching or outperforming performance compared to state-of-the-art tuners while substantially reducing tuning overhead.
中文标题/摘要
标题:DOT:动态旋钮选择和在线采样以实现自动数据库调优
数据库管理系统(DBMS)对于高效的数据管理和访问控制至关重要,但其管理对于数据库管理员(DBAs)来说仍然具有挑战性。调优,特别是,被认为是困难的。现代系统有许多调优参数,但只有其中一部分显著影响性能。专注于这些有影响力的参数可以减少搜索空间并优化性能。当前的方法依赖于昂贵的预热阶段和人类的专业知识来识别重要的调优参数。在本文中,我们提出了DOT,一种动态旋钮选择和在线采样DBMS调优算法。DOT使用递归特征消除与交叉验证(RFECV)来修剪低重要性的调优参数,并使用似然比检验(LRT)策略来平衡探索和利用。对于参数搜索,DOT使用贝叶斯优化(BO)算法实时优化配置,消除了预热阶段或先验知识的需求(尽管现有的知识可以被纳入)。实验表明,DOT在与最先进的调优器匹配或超越性能的同时,显著减少了调优开销。
Summary / 总结
The paper presents DOT, a dynamic knob selection and online sampling algorithm for automated database tuning. DOT employs RFECV to prune less important tuning parameters and uses LRT to balance exploration and exploitation. It also applies Bayesian Optimization to optimize configurations in real-time, eliminating the need for warm-up phases. Experiments demonstrate that DOT achieves comparable or better performance than existing state-of-the-art tuners with significantly reduced tuning overhead.
论文提出了DOT,一种动态旋钮选择和在线采样算法,用于自动化数据库调优。DOT 使用 RFECV 剔除不重要的调优参数,并使用 LRT 平衡探索和利用。它采用贝叶斯优化实时优化配置,无需预热阶段或先验知识。实验表明,DOT 在减少调优开销的同时,能够达到或超越现有最佳调优器的性能。
Vib2ECG: A Paired Chest-Lead SCG-ECG Dataset and Benchmark for ECG Reconstruction
Authors: Guorui Lu, Xiaohui Cai, Todor Stefanov, Qinyu Chen
First: 2026-03-16T17:05:27+00:00 · Latest: 2026-03-16T17:05:27+00:00
Comments: This work has been submitted to the IEEE for possible publication
Abstract
Twelve-lead electrocardiography (ECG) is essential for cardiovascular diagnosis, but its long-term acquisition in daily life is constrained by complex and costly hardware. Recent efforts have explored reconstructing ECG from low-cost cardiac vibrational signals such as seismocardiography (SCG), however, due to the lack of a dataset, current methods are limited to limb leads, while clinical diagnosis requires multi-lead ECG, including chest leads. In this work, we propose Vib2ECG, the first paired, multi-channel electro-mechanical cardiac signal dataset, which includes complete twelve-lead ECGs and vibrational signals acquired by inertial measurement units (IMUs) at six chest-lead positions from 17 subjects. Based on this dataset, we also provide a benchmark. Experimental results demonstrate the feasibility of reconstructing electrical cardiac signals at variable locations from vibrational signals using a lightweight 364 K-parameter U-Net. Furthermore, we observe a hallucination phenomenon in the model, where ECG waveforms are generated in regions where no corresponding electrical activity is present. We analyze the causes of this phenomenon and propose potential directions for mitigation. This study demonstrates the feasibility of mobile-device-friendly ECG monitoring through chest-lead ECG prediction from low-cost vibrational signals acquired using IMU sensors. It expands the application of cardiac vibrational signals and provides new insights into the spatial relationship between cardiac electrical and mechanical activities with spatial location variation.
中文标题/摘要
标题:Vib2ECG:首个配对的胸导联SCG-ECG数据集及ECG重建基准
十二导联心电图(ECG)对于心血管诊断至关重要,但在日常生活中其长期获取受到复杂且昂贵的硬件限制。近期研究探索了从低成本心脏振动信号(如心音图SCG)重建ECG的方法,但由于缺乏数据集,当前方法仅限于肢体导联,而临床诊断需要多导联ECG,包括胸导联。在本研究中,我们提出了Vib2ECG,这是首个包含完整十二导联ECG和由惯性测量单元(IMU)在六个胸导联位置从17名受试者处获取的振动信号的配对多通道电机械心脏信号数据集。基于此数据集,我们还提供了基准。实验结果表明,使用轻量级364K参数的U-Net可以从振动信号中在不同位置重建电心脏信号是可行的。此外,我们观察到模型中存在幻觉现象,即在没有相应电活动的区域生成ECG波形。我们分析了这一现象的原因,并提出了潜在的缓解方向。本研究证明了通过低成本IMU传感器获取的胸导联振动信号预测ECG监测的可行性,扩展了心脏振动信号的应用,并提供了关于心脏电活动和机械活动空间关系的新见解。
Summary / 总结
This study introduces Vib2ECG, the first paired dataset of complete twelve-lead ECG and chest-lead vibrational signals from 17 subjects, to facilitate ECG reconstruction from SCG. Using a lightweight U-Net model with 364 K parameters, the researchers demonstrate the feasibility of reconstructing electrical cardiac signals from vibrational signals. They also identify a hallucination phenomenon where ECG waveforms are generated in regions without electrical activity and propose potential mitigation strategies. This work paves the way for mobile-friendly ECG monitoring using low-cost vibrational sensors.
研究旨在通过开发一个数据集和基准,解决长时间12导联ECG获取的限制,以从胸部振动信号重建ECG。Vib2ECG包括17名受试者在六个胸导联位置的完整12导联ECG和振动信号。研究证明,使用具有364 K参数的轻量级U-Net模型可以从振动信号重建心电图信号,但也发现了在没有相应电活动的区域生成心电图波形的现象。该研究提供了心脏电活动和机械活动空间关系的见解,并扩展了使用IMU传感器获取的低成本振动信号进行移动设备友好型ECG监测的应用。
Near-Equilibrium Propagation training in nonlinear wave systems
Authors: Karol Sajnok, Michał Matuszewski
First: 2025-10-17T15:03:07+00:00 · Latest: 2026-03-16T16:56:16+00:00
Comments: 7 figures
Abstract
Backpropagation learning algorithm, the workhorse of modern artificial intelligence, is notoriously difficult to implement in physical neural networks. Equilibrium Propagation (EP) is an alternative with comparable efficiency and strong potential for in-situ training. We extend EP learning to both discrete and continuous complex-valued wave systems. In contrast to previous EP implementations, our scheme is valid in the weakly dissipative regime, and readily applicable to a wide range of physical settings, even without well defined nodes, where trainable inter-node connections can be replaced by trainable local potential. We test the method in driven-dissipative exciton-polariton condensates governed by generalized Gross-Pitaevskii dynamics. Numerical studies on standard benchmarks, including a simple logical task and handwritten-digit recognition, demonstrate stable convergence, establishing a practical route to in-situ learning in physical systems in which system control is restricted to local parameters.
中文标题/摘要
标题:非平衡传播训练在非线性波系统中的应用
反向传播学习算法是现代人工智能的马基雅维利式工具,但在物理神经网络中的实现却臭名昭著地困难。平衡传播(EP)是一种具有相似效率和在地训练潜力的替代方案。我们扩展了EP学习,适用于离散和连续的复值波系统。与之前的EP实现不同,我们的方案适用于弱耗散区域,并且可以广泛应用于各种物理设置,即使没有明确的节点,也可以通过可训练的局部势来替代可训练的节点间连接。我们使用由广义格拉斯-皮塔耶夫斯基动力学控制的驱动耗散激子-极化子凝聚态测试了该方法。数值研究在标准基准上,包括简单的逻辑任务和手写数字识别,展示了稳定的收敛性,为在系统控制受限于局部参数的物理系统中实现在线学习提供了实际途径。
Summary / 总结
The research aims to address the challenge of implementing backpropagation in physical neural networks by extending the Equilibrium Propagation (EP) learning algorithm to complex-valued wave systems. The method involves training inter-node connections through trainable local potentials, making it applicable in systems without well-defined nodes. Experimental results on driven-dissipative exciton-polariton condensates and standard benchmarks show stable convergence, demonstrating a practical approach for in-situ learning in restricted control settings.
研究旨在通过将Equilibrium Propagation (EP)学习算法扩展到复数波系统中,解决在物理神经网络中实现反向传播的难题。该方法通过可训练的局部势能来训练节点间的连接,使其适用于没有明确节点的系统。实验结果表明,在驱动耗散激子极化子凝聚态和标准基准测试中,该方法能够稳定收敛,展示了在受限控制条件下进行原位学习的实用途径。
HAMLOCK: HArdware-Model LOgically Combined attacK
Authors: Sanskar Amgain, Daniel Lobo, Atri Chatterjee, Swarup Bhunia, Fnu Suya
First: 2025-10-22T00:31:49+00:00 · Latest: 2026-03-16T16:55:18+00:00
Comments: Accepted to usenix security 2026
Abstract
The growing use of third-party hardware accelerators (e.g., FPGAs, ASICs) for deep neural networks (DNNs) introduces new security vulnerabilities. Conventional model-level backdoor attacks, which only poison a model's weights to misclassify inputs with a specific trigger, are often detectable because the entire attack logic is embedded within the model (i.e., software), creating a traceable layer-by-layer activation path. This paper introduces the HArdware-Model Logically Combined Attack (HAMLOCK), a far stealthier threat that distributes the attack logic across the hardware-software boundary. The software (model) is now only minimally altered by tuning the activations of few neurons to produce uniquely high activation values when a trigger is present. A malicious hardware Trojan detects those unique activations by monitoring the corresponding neurons' most significant bit or the 8-bit exponents and triggers another hardware Trojan to directly manipulate the final output logits for misclassification. This decoupled design is highly stealthy, as the model itself contains no complete backdoor activation path as in conventional attacks and hence, appears fully benign. Empirically, across benchmarks like MNIST, CIFAR10, GTSRB, and ImageNet, HAMLOCK achieves a near-perfect attack success rate with a negligible clean accuracy drop. More importantly, HAMLOCK circumvents the state-of-the-art model-level defenses without any adaptive optimization. The hardware Trojan is also undetectable, incurring area and power overheads as low as 0.01%, which is easily masked by process and environmental noise. Our findings expose a critical vulnerability at the hardware-software interface, demanding new cross-layer defenses against this emerging threat.
中文标题/摘要
标题:HAMLOCK:硬件与模型逻辑结合攻击
随着第三方硬件加速器(例如FPGA、ASIC)在深度神经网络(DNN)中的广泛应用,新的安全漏洞也随之增加。传统的基于模型的后门攻击仅通过毒化模型权重来使特定触发器下的输入误分类,这些攻击通常可以被检测到,因为整个攻击逻辑都嵌入在模型(即软件)中,形成可追踪的逐层激活路径。 本文介绍了一种名为HArdware-Model Logically Combined Attack(HAMLOCK)的更为隐蔽的威胁,它将攻击逻辑分布在硬件与软件边界之间。软件(模型)仅通过调整少数神经元的激活值来产生在触发器存在时的独特高激活值。恶意硬件特洛伊木马通过监控相应神经元的最高有效位或8位指数来检测这些独特激活值,并触发另一个硬件特洛伊木马来直接操纵最终输出的logits以实现误分类。 这种解耦设计非常隐蔽,因为模型本身不包含完整的后门激活路径,因此看起来完全无害。在MNIST、CIFAR10、GTSRB和ImageNet等基准测试中,HAMLOCK几乎实现了完美的攻击成功率,同时保持了微小的准确率下降。更重要的是,HAMLOCK能够绕过最先进的基于模型的防御措施,无需任何适应性优化。硬件特洛伊木马也是不可检测的,其面积和功耗开销低至0.01%,这可以通过工艺和环境噪声轻松掩盖。我们的研究揭示了硬件与软件接口处的一个关键漏洞,要求开发新的跨层防御措施以应对这种新兴威胁。
Summary / 总结
HAMLOCK is a novel hardware-software attack that distributes backdoor logic across the hardware-software boundary, making it highly stealthy. The software model is minimally altered by tuning a few neurons to produce unique activations when a trigger is present. A malicious hardware Trojan then detects these activations and manipulates the output logits for misclassification. HAMLOCK achieves near-perfect attack success rates with minimal accuracy drop and circumvents state-of-the-art defenses. It is also undetectable, incurring minimal overheads.
HAMLOCK 是一种新型的硬件-软件攻击,通过在硬件-软件边界上分散后门逻辑,使其极具隐蔽性。软件模型仅通过调整少数神经元在触发器存在时产生独特的激活值。恶意硬件特洛伊木马随后检测这些激活值并操纵输出逻辑以实现误分类。HAMLOCK 实现了近乎完美的攻击成功率,同时保持了极小的准确率下降,并绕过了最先进的模型级防御。此外,它也是不可检测的,产生的开销极低。
Building Trust in PINNs: Error Estimation through Finite Difference Methods
Authors: Aleksander Krasowski, René P. Klausen, Aycan Celik, Sebastian Lapuschkin, Wojciech Samek, Jonas Naujoks
First: 2026-03-16T16:51:42+00:00 · Latest: 2026-03-16T16:51:42+00:00
Abstract
Physics-informed neural networks (PINNs) constitute a flexible deep learning approach for solving partial differential equations (PDEs), which model phenomena ranging from heat conduction to quantum mechanical systems. Despite their flexibility, PINNs offer limited insight into how their predictions deviate from the true solution, hindering trust in their prediction quality. We propose a lightweight post-hoc method that addresses this gap by producing pointwise error estimates for PINN predictions, which offer a natural form of explanation for such models, identifying not just whether a prediction is wrong, but where and by how much. For linear partial differential equations, the error between a PINN approximation and the true solution satisfies the same differential operator as the original problem, but driven by the PINN's PDE residual as its source term. We solve this error equation numerically using finite difference methods requiring no knowledge of the true solution. Evaluated on several benchmark PDEs, our method yields accurate error maps at low computational cost, enabling targeted and interpretable validation of PINNs.
中文标题/摘要
标题:通过有限差分方法估计误差以建立PINNs的信任
物理知情神经网络(PINNs)是一种灵活的深度学习方法,用于解决偏微分方程(PDEs),这些方程可以模拟从热传导到量子力学系统的各种现象。尽管具有灵活性,但PINNs对预测与真实解之间的偏差提供有限的洞察,阻碍了对其预测质量的信任。我们提出了一种轻量级的后处理方法,通过为PINN预测生成点误差估计来填补这一空白,这种估计为这些模型提供了一种自然形式的解释,不仅指出预测是否错误,还指出错误的具体位置和程度。对于线性偏微分方程,PINN近似与真实解之间的误差满足与原始问题相同的微分算子,但以PINN的PDE残差为源项。我们使用有限差分方法数值求解此误差方程,无需知道真实解。在多个基准PDE上评估,我们的方法以较低的计算成本提供准确的误差图,使PINNs的验证更加有针对性和可解释性。
Summary / 总结
The research addresses the lack of error estimation in physics-informed neural networks (PINNs) for solving partial differential equations (PDEs), which limits their trustworthiness. It introduces a post-hoc method using finite difference methods to estimate pointwise errors in PINN predictions, providing detailed explanations of prediction accuracy. The method is computationally efficient and accurate, especially for linear PDEs, offering valuable insights for validating PINN solutions.
研究旨在通过提供PINN预测的误差估计来增强对其的信任。方法使用有限差分方法求解由PINN的PDE残差驱动的误差方程,无需知道真实解即可获得点误差图。在多种基准PDE上的实验表明,该方法以较低的计算成本提供准确的误差图,有助于PINN的针对性验证。
History
20260317_0401 20260316_0338 20260315_0335 20260314_0340 20260313_0351 20260312_0350 20260311_0346 20260310_0350 20260309_0332 20260308_0331 20260307_0344 20260306_0400 20260305_0347 20260304_0347 20260303_0347 20260302_0330 20260301_0331 20260228_0349 20260227_0352 20260226_0405 20260225_0356 20260224_0404 20260223_0333 20260222_0333 20260221_0344 20260220_0347 20260219_0358 20260218_0358 20260217_0339 20260216_0334 20260215_0332 20260213_0402 20260212_0404 20260211_0409 20260210_0409 20260208_0334 20260207_0349 20260206_0347 20260205_0346 20260204_0352 20260202_0332 20260201_0328 20260131_0341 20260130_0339 20260129_0337 20260128_0335 20260127_0332 20260126_0325 20260125_0325 20260124_0333 20260123_0333 20260122_0339 20260121_0422 20260120_0328 20260119_0325 20260118_0324 20260117_0329 20260116_0332 20260115_0330 20260114_0329 20260113_0330 20260112_0330 20260111_0327 20260110_0328 20260109_0331 20260108_0330 20260107_0325 20260106_0331 20260105_0324 20260104_0324 20260103_0322 20260102_0335 20260101_0325 20251231_0331 20251230_0328 20251229_0326 20251228_0329 20251227_0325 20251226_0326 20251225_0325 20251224_0328 20251223_0327 20251222_0324 20251221_0326 20251220_0327 20251219_0327 20251218_0339 20251217_0331 20251216_0329 20251215_0331 20251214_0324 20251213_0324 20251212_0329 20251211_0326 20251210_0323 20251209_0326 20251208_0324 20251207_0323 20251206_0325 20251205_0326 20251204_0326 20251203_0328 20251202_0331 20251201_0324 20251130_0323 20251129_0323 20251128_0324 20251127_0324 20251126_0325 20251125_0322 20251124_0323 20251123_0323 20251122_0325 20251121_0324 20251120_0326 20251119_0325 20251118_0324 20251117_0322 20251116_0322 20251115_0324 20251114_0325 20251113_0326 20251112_0326 20251111_0318 20251110_0322 20251109_0323 20251108_0321 20251107_0320 20251106_0322 20251105_0321 20251104_0324 20251103_0317 20251102_0321 20251101_0317 20251031_0318 20251030_0328 20251029_0325 20251028_0324 20251027_0320 20251026_0328 20251025_0320 20251024_0328 20251023_1235 20251023_0316 20251022_0319 20251021_1916 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553