arXiv 论文速递

2026-03-17 04:01
Snapshot: 20260317_0401
PhysMoDPO: Physically-Plausible Humanoid Motion with Preference Optimization
Authors: Yangsong Zhang, Anujith Muraleedharan, Rikhat Akizhanov, Abdul Ahad Butt, Gül Varol, Pascal Fua, Fabio Pizzati, Ivan Laptev
First: 2026-03-13T17:59:59+00:00 · Latest: 2026-03-13T17:59:59+00:00
Abstract
Recent progress in text-conditioned human motion generation has been largely driven by diffusion models trained on large-scale human motion data. Building on this progress, recent methods attempt to transfer such models for character animation and real robot control by applying a Whole-Body Controller (WBC) that converts diffusion-generated motions into executable trajectories. While WBC trajectories become compliant with physics, they may expose substantial deviations from original motion. To address this issue, we here propose PhysMoDPO, a Direct Preference Optimization framework. Unlike prior work that relies on hand-crafted physics-aware heuristics such as foot-sliding penalties, we integrate WBC into our training pipeline and optimize diffusion model such that the output of WBC becomes compliant both with physics and original text instructions. To train PhysMoDPO we deploy physics-based and task-specific rewards and use them to assign preference to synthesized trajectories. Our extensive experiments on text-to-motion and spatial control tasks demonstrate consistent improvements of PhysMoDPO in both physical realism and task-related metrics on simulated robots. Moreover, we demonstrate that PhysMoDPO results in significant improvements when applied to zero-shot motion transfer in simulation and for real-world deployment on a G1 humanoid robot.
中文标题/摘要
标题:PhysMoDPO:基于物理的人形运动生成与偏好优化
文本条件下的真人运动生成最近的进展主要得益于在大规模真人运动数据上训练的扩散模型。在此基础上,最近的方法试图通过应用全身控制器(WBC)将扩散生成的运动转换为可执行轨迹,从而将这些模型应用于角色动画和实际机器人控制。虽然WBC轨迹变得符合物理,但它们可能会与原始运动存在显著偏差。为了解决这一问题,我们在此提出PhysMoDPO,一种直接偏好优化框架。与之前依赖手工构建的物理感知启发式方法(如脚滑动惩罚)不同,我们将在训练管道中集成WBC,并优化扩散模型,使其输出的WBC轨迹同时符合物理和原始文本指令。为了训练PhysMoDPO,我们部署了基于物理和任务特定的奖励,并使用它们来为合成轨迹分配偏好。我们在文本到运动和空间控制任务上的广泛实验表明,PhysMoDPO在模拟机器人上的物理现实性和任务相关指标上都表现出一致的改进。此外,我们展示了当将PhysMoDPO应用于模拟中的零样本运动转移以及在G1人形机器人上的实际部署时,它能带来显著的改进。
Summary / 总结
The research aims to improve physically plausible human motion generation by addressing the deviations from original motions when using Whole-Body Controllers (WBC). The PhysMoDPO framework is proposed, which integrates WBC into the training pipeline to optimize the diffusion model such that the output of WBC is both compliant with physics and adheres to original text instructions. Experiments show consistent improvements in physical realism and task-related metrics on simulated robots and significant enhancements in zero-shot motion transfer in both simulation and real-world deployment on a G1 humanoid robot.
研究旨在通过解决全身体控制器(WBC)在扩散模型中引入的物理不一致性问题,提高人类动作的物理合理性。PhysMoDPO 是一个直接偏好优化框架,将 WBC 集成到训练管道中,优化扩散模型输出,确保其同时符合物理和原始文本指令。实验表明,在模拟机器人上的物理真实性和任务相关指标上表现出一致的改进,并且在模拟和实际部署到 G1 人形机器人上的零样本动作转移中取得了显著提升。
Visual-ERM: Reward Modeling for Visual Equivalence
Authors: Ziyu Liu, Shengyuan Ding, Xinyu Fang, Xuanlang Dai, Penghui Yang, Jianze Liang, Jiaqi Wang, Kai Chen, Dahua Lin, Yuhang Zang
First: 2026-03-13T17:58:14+00:00 · Latest: 2026-03-13T17:58:14+00:00
Comments: Project: https://github.com/InternLM/Visual-ERM
Abstract
Vision-to-code tasks require models to reconstruct structured visual inputs, such as charts, tables, and SVGs, into executable or structured representations with high visual fidelity. While recent Large Vision Language Models (LVLMs) achieve strong results via supervised fine-tuning, reinforcement learning remains challenging due to misaligned reward signals. Existing rewards either rely on textual rules or coarse visual embedding similarity, both of which fail to capture fine-grained visual discrepancies and are vulnerable to reward hacking. We propose Visual Equivalence Reward Model (Visual-ERM), a multimodal generative reward model that provides fine-grained, interpretable, and task-agnostic feedback to evaluate vision-to-code quality directly in the rendered visual space. Integrated into RL, Visual-ERM improves Qwen3-VL-8B-Instruct by +8.4 on chart-to-code and yields consistent gains on table and SVG parsing (+2.7, +4.1 on average), and further strengthens test-time scaling via reflection and revision. We also introduce VisualCritic-RewardBench (VC-RewardBench), a benchmark for judging fine-grained image-to-image discrepancies on structured visual data, where Visual-ERM at 8B decisively outperforms Qwen3-VL-235B-Instruct and approaches leading closed-source models. Our results suggest that fine-grained visual reward supervision is both necessary and sufficient for vision-to-code RL, regardless of task specificity.
中文标题/摘要
标题:Visual-ERM:视觉等价性奖励建模
视觉到代码任务要求模型将结构化的视觉输入,如图表、表格和SVG,重构为具有高视觉保真的可执行或结构化表示。虽然最近的大规模视觉语言模型(LVLM)通过监督微调取得了出色的结果,但强化学习仍然具有挑战性,因为奖励信号存在对齐问题。现有的奖励要么依赖于文本规则,要么依赖于粗略的视觉嵌入相似性,两者都无法捕捉细微的视觉差异,并且容易受到奖励作弊的影响。我们提出了视觉等价性奖励模型(Visual-ERM),这是一种多模态生成奖励模型,能够直接在渲染的视觉空间中提供细微、可解释且任务无关的反馈,以评估视觉到代码的质量。将Visual-ERM集成到RL中,可以将Qwen3-VL-8B-Instruct的图表到代码质量提高8.4%,在表格和SVG解析方面也取得了稳定的提升(平均分别提高2.7%和4.1%),并通过反思和修订进一步增强了测试时的扩展性。我们还引入了VisualCritic-RewardBench(VC-RewardBench),这是一个用于判断结构化视觉数据上细微图像到图像差异的基准,其中Visual-ERM在8B规模下明显优于Qwen3-VL-235B-Instruct,并接近领先的企业级模型。我们的结果表明,无论任务具体性如何,细微的视觉奖励监督都是必要且充分的。
Summary / 总结
Visual-ERM is a multimodal generative reward model designed to improve the quality of visual-to-code reconstructions by providing fine-grained, interpretable feedback directly in the visual space. It integrates into reinforcement learning to enhance the performance of Qwen3-VL-8B-Instruct, leading to significant improvements in chart-to-code tasks (+8.4) and consistent gains in table and SVG parsing (+2.7, +4.1 on average). Visual-ERM also outperforms larger models and approaches the performance of leading closed-source models on the VisualCritic-RewardBench benchmark, demonstrating the effectiveness of fine-grained visual reward supervision in vision-to-code reinforcement learning.
Visual-ERM 是一个多模态生成奖励模型,旨在为视觉到代码任务提供细粒度、可解释且任务无关的反馈。它使 Qwen3-VL-8B-Instruct 在图表到代码任务上提高了 8.4,并在表格和 SVG 解析上保持一致的改进。Visual-ERM 在 VisualCritic-RewardBench 基准测试中优于 Qwen3-VL-235B-Instruct,并接近领先的企业级模型,表明细粒度的视觉奖励监督对于视觉到代码的强化学习既是必要的也是充分的。
Out of Sight, Out of Mind? Evaluating State Evolution in Video World Models
Authors: Ziqi Ma, Mengzhan Liufu, Georgia Gkioxari
First: 2026-03-13T17:51:14+00:00 · Latest: 2026-03-13T17:51:14+00:00
Comments: https://glab-caltech.github.io/STEVOBench/
Abstract
Evolutions in the world, such as water pouring or ice melting, happen regardless of being observed. Video world models generate "worlds" via 2D frame observations. Can these generated "worlds" evolve regardless of observation? To probe this question, we design a benchmark to evaluate whether video world models can decouple state evolution from observation. Our benchmark, STEVO-Bench, applies observation control to evolving processes via instructions of occluder insertion, turning off the light, or specifying camera "lookaway" trajectories. By evaluating video models with and without camera control for a diverse set of naturally-occurring evolutions, we expose their limitations in decoupling state evolution from observation. STEVO-Bench proposes an evaluation protocol to automatically detect and disentangle failure modes of video world models across key aspects of natural state evolution. Analysis of STEVO-Bench results provide new insight into potential data and architecture bias of present-day video world models. Project website: https://glab-caltech.github.io/STEVOBench/. Blog: https://ziqi-ma.github.io/blog/2026/outofsight/
中文标题/摘要
标题:视而不见,记不住吗?视频世界模型中状态演变的评估
世界的变化,如水的流动或冰的融化,无论是否被观察到都会发生。视频世界模型通过2D帧观察生成“世界”。这些生成的“世界”是否能在不被观察的情况下演变?为了探究这一问题,我们设计了一个基准来评估视频世界模型是否能将状态演变与观察脱钩。我们的基准,STEVO-Bench,通过遮挡物插入指令、关闭灯光或指定摄像机“移开”轨迹来控制观察,对演变过程进行控制。通过评估具有和不具有摄像机控制的视频模型对多种自然演变过程的表现,我们揭示了它们在将状态演变与观察脱钩方面的局限性。STEVO-Bench 提出了一种评估协议,以自动检测和分离视频世界模型在自然状态演变关键方面的失败模式。STEVO-Bench 结果的分析为现有视频世界模型的数据和架构偏差提供了新的见解。项目网站:https://glab-caltech.github.io/STEVOBench/. 博客:https://ziqi-ma.github.io/blog/2026/outofsight/
Summary / 总结
The study investigates whether video world models can simulate the evolution of a system's state independently of observation. A benchmark called STEVO-Bench was developed to control observations through instructions like occluder insertion or camera redirection. Evaluating video models with and without camera control, the research reveals that current models struggle to decouple state evolution from observation, highlighting potential data and architecture biases.
研究探讨视频世界模型是否能在不被观察的情况下模拟状态演变。通过使用遮挡物、灯光控制和摄像机转向来控制观察,设计了一个名为STEVO-Bench的基准测试。研究发现,当前的视频模型难以将状态演变与观察分离,揭示了数据和架构中的潜在偏差。这项工作为视频世界模型的局限性提供了新的见解。
MaDiS: Taming Masked Diffusion Language Models for Sign Language Generation
Authors: Ronglai Zuo, Rolandos Alexandros Potamias, Qi Sun, Evangelos Ververas, Jiankang Deng, Stefanos Zafeiriou
First: 2026-01-27T13:06:47+00:00 · Latest: 2026-03-13T17:48:22+00:00
Abstract
Sign language generation (SLG) aims to translate written texts into expressive sign motions, bridging communication barriers for the Deaf and Hard-of-Hearing communities. Recent studies formulate SLG within the language modeling framework using autoregressive language models, which suffer from unidirectional context modeling and slow token-by-token inference. To address these limitations, we present MaDiS, a masked-diffusion-based language model for SLG that captures bidirectional dependencies and supports efficient parallel multi-token generation. We further introduce a tri-level cross-modal pretraining scheme that jointly learns from token-, latent-, and 3D physical-space objectives to leverage complementary, multi-level sign representations. To accelerate model convergence in the fine-tuning stage, we design a novel unmasking strategy with temporal checkpoints, which restructures generation in a coarse-to-fine manner and reduces the combinatorial complexity of unmasking orders by over $10^{41}$ times. In addition, a mixture-of-parts embedding layer is developed to effectively fuse information stored in different part-wise sign tokens through a learnable gate and well-optimized codebooks. Extensive experiments on CSL-Daily, Phoenix-2014T, and How2Sign demonstrate that MaDiS achieves superior performance across multiple metrics, including DTW error and two newly introduced metrics, SiBLEU and SiCLIP, while delivering a 40\% higher throughput. Code and models will be publicly released.
中文标题/摘要
标题:MaDiS: 控制掩码扩散语言模型进行手语生成
手语生成(SLG)旨在将书面文本转换为富有表现力的手语动作,以消除聋人和听力障碍社区的沟通障碍。最近的研究将SLG建模为语言模型框架内的问题,使用自回归语言模型,但这些模型存在单向上下文建模和逐个标记推理速度慢的问题。为了解决这些限制,我们提出了MaDiS,这是一种基于掩码扩散的语言模型,能够捕捉双向依赖关系并支持高效的并行多标记生成。我们还引入了一种三级跨模态预训练方案,该方案从标记、潜在和三维物理空间目标中联合学习,以利用互补的多级手语表示。为了在微调阶段加速模型收敛,我们设计了一种新颖的去掩码策略,带有时间检查点,该策略以粗到细的方式重新构建生成,并将去掩码顺序的组合复杂度降低了超过$10^{41}$倍。此外,我们开发了一种部件混合嵌入层,通过可学习门控和优化的码本有效融合存储在不同部件手语标记中的信息。在CSL-Daily、Phoenix-2014T和How2Sign上的广泛实验表明,MaDiS在多个指标上,包括DTW误差和两个新引入的指标SiBLEU和SiCLIP,均实现了优越的性能,同时提供40%更高的吞吐量。代码和模型将公开发布。
Summary / 总结
MaDiS is a masked-diffusion-based language model for sign language generation that addresses the limitations of autoregressive models by capturing bidirectional dependencies and supporting efficient parallel multi-token generation. It employs a tri-level cross-modal pretraining scheme and a novel unmasking strategy with temporal checkpoints to enhance model performance and convergence. Experiments on CSL-Daily, Phoenix-2014T, and How2Sign show that MaDiS outperforms existing methods in terms of DTW error and introduces new metrics SiBLEU and SiCLIP, while achieving a 40% higher throughput.
MaDiS 是一种基于掩码扩散的语言模型,用于手语生成,通过捕捉双向依赖关系和实现高效的并行多令牌生成来解决自回归模型的限制。它采用三级跨模态预训练方案和具有时间检查点的新颖去掩码策略,以提高模型性能和收敛性。实验表明,MaDiS 在多个指标上优于现有方法,包括 DTW 错误和 SiBLEU,同时吞吐量提高了 40%。
Let Your Image Move with Your Motion! -- Implicit Multi-Object Multi-Motion Transfer
Authors: Yuze Li, Dong Gong, Xiao Cao, Junchao Yuan, Dongsheng Li, Lei Zhou, Yun Sing Koh, Cheng Yan, Xinyu Zhang
Venue: cvpr 2026
First: 2026-03-01T09:03:05+00:00 · Latest: 2026-03-13T17:38:27+00:00
Comments: 15 pages, 11 figures, cvpr 2026, see https://ethan-li123.github.io/FlexiMMT_page/
Abstract
Motion transfer has emerged as a promising direction for controllable video generation, yet existing methods largely focus on single-object scenarios and struggle when multiple objects require distinct motion patterns. In this work, we present FlexiMMT, the first implicit image-to-video (I2V) motion transfer framework that explicitly enables multi-object, multi-motion transfer. Given a static multi-object image and multiple reference videos, FlexiMMT independently extracts motion representations and accurately assigns them to different objects, supporting flexible recombination and arbitrary motion-to-object mappings. To address the core challenge of cross-object motion entanglement, we introduce a Motion Decoupled Mask Attention Mechanism that uses object-specific masks to constrain attention, ensuring that motion and text tokens only influence their designated regions. We further propose a Differentiated Mask Propagation Mechanism that derives object-specific masks directly from diffusion attention and progressively propagates them across frames efficiently. Extensive experiments demonstrate that FlexiMMT achieves precise, compositional, and state-of-the-art performance in I2V-based multi-object multi-motion transfer. Our project page is: https://ethan-li123.github.io/FlexiMMT_page/
中文标题/摘要
标题:让您的图像随您的动作移动!—— 显式多对象多动作转移
动作转移已成为可控视频生成的一个有前途的方向,但现有方法主要集中在单对象场景上,当需要多个对象具有不同的动作模式时,它们会遇到困难。在本文中,我们提出了FlexiMMT,这是第一个显式支持多对象、多动作转移的隐式图像到视频(I2V)动作转移框架。给定一个静态多对象图像和多个参考视频,FlexiMMT 独立提取动作表示并准确地分配给不同的对象,支持灵活重组和任意动作到对象映射。为了解决跨对象动作纠缠的核心挑战,我们引入了一种动作解耦掩码注意机制,使用对象特定的掩码来约束注意力,确保动作和文本标记仅影响其指定区域。我们还提出了一种差异化的掩码传播机制,直接从扩散注意力中推导出对象特定的掩码,并在帧之间高效传播。广泛的实验表明,FlexiMMT 在基于I2V的多对象多动作转移中实现了精确、组合和最先进的性能。我们的项目页面是:https://ethan-li123.github.io/FlexiMMT_page/
Summary / 总结
The research aims to address the limitations of existing motion transfer methods that focus on single objects and struggle with multiple objects. FlexiMMT is introduced as the first implicit image-to-video motion transfer framework capable of handling multi-object, multi-motion scenarios. It uses a Motion Decoupled Mask Attention Mechanism and a Differentiated Mask Propagation Mechanism to accurately assign and recombine motion representations, achieving precise and state-of-the-art performance in multi-object multi-motion transfer. Extensive experiments validate the framework's effectiveness.
该研究通过引入FlexiMMT框架解决了视频生成中的多对象多运动转移挑战。该框架独立提取并分配运动表示,并使用运动解耦掩码注意机制确保运动和文本令牌仅影响其指定区域。进一步通过从扩散注意中直接推导并高效传播对象特定的掩码来增强此功能。实验结果表明,FlexiMMT在多对象多运动转移方面实现了精确且处于领先水平的表现。
Global Sensitivity Analysis for Engineering Design Based on Individual Conditional Expectations
Authors: Pramudita Satria Palar, Paul Saves, Rommel G. Regis, Koji Shimoyama, Shigeru Obayashi, Nicolas Verstaevel, Joseph Morlier
First: 2025-12-12T15:28:17+00:00 · Latest: 2026-03-13T17:35:32+00:00
Comments: Published in Aerospace Science and Technology, 2026
Abstract
Explainable machine learning techniques have gained increasing attention in engineering applications, especially in aerospace design and analysis, where understanding how input variables influence data-driven models is essential. Partial Dependence Plots (PDPs) are widely used for interpreting black-box models by showing the average effect of an input variable on the prediction. However, their global sensitivity metric can be misleading when strong interactions are present, as averaging tends to obscure interaction effects. To address this limitation, we propose a global sensitivity metric based on Individual Conditional Expectation (ICE) curves. The method computes the expected feature importance across ICE curves, along with their standard deviation, to more effectively capture the influence of interactions. We provide a mathematical proof demonstrating that the PDP-based sensitivity is a lower bound of the proposed ICE-based metric under truncated orthogonal polynomial expansion. In addition, we introduce an ICE-based correlation value to quantify how interactions modify the relationship between inputs and the output. Comparative evaluations were performed on three cases: a 5-variable analytical function, a 5-variable wind-turbine fatigue problem, and a 9-variable airfoil aerodynamics case, where ICE-based sensitivity was benchmarked against PDP, SHapley Additive exPlanations (SHAP), and Sobol' indices. The results show that ICE-based feature importance provides richer insights than the traditional PDP-based approach, while visual interpretations from PDP, ICE, and SHAP complement one another by offering multiple perspectives.
中文标题/摘要
标题:基于个体条件期望的工程设计全局敏感性分析
可解释的机器学习技术在工程应用中越来越受到关注,特别是在航空航天设计和分析中,理解输入变量如何影响数据驱动模型至关重要。部分依赖图(PDP)广泛用于解释黑盒模型,通过展示输入变量对预测的平均影响来实现这一点。然而,当存在强烈交互作用时,它们的全局敏感性度量可能会误导人,因为平均化往往会掩盖交互作用的影响。为了解决这一局限性,我们提出了一种基于个体条件期望(ICE)曲线的全局敏感性度量方法。该方法通过计算ICE曲线上的特征重要性期望及其标准差来更有效地捕捉交互作用的影响。我们提供了一个数学证明,证明基于PDP的敏感性是所提出的基于ICE的度量的下界,前提是截断正交多项式展开。此外,我们引入了一个基于ICE的相关值来量化交互作用如何修改输入与输出之间的关系。我们在三个案例上进行了比较评估:一个5变量的分析函数,一个5变量的风力涡轮机疲劳问题,以及一个9变量的机翼气动学案例,其中基于ICE的敏感性与PDP、SHapley加性解释(SHAP)和Sobol'指数进行了基准测试。结果表明,基于ICE的特征重要性提供了比传统基于PDP的方法更丰富的见解,而PDP、ICE和SHAP的可视化解释则通过提供多种视角相互补充。
Summary / 总结
This paper addresses the limitation of Partial Dependence Plots (PDPs) in capturing interaction effects by proposing a new global sensitivity metric based on Individual Conditional Expectation (ICE) curves. The method computes expected feature importance and its standard deviation across ICE curves to better capture interaction influences. Comparative evaluations on various cases demonstrate that ICE-based feature importance offers richer insights compared to traditional PDP-based approaches, while visual interpretations from PDP, ICE, and SHAP provide complementary perspectives.
该研究提出了一种基于个体条件期望(ICE)曲线的全局敏感性分析方法,以提高工程设计中机器学习模型的可解释性,特别是在航空航天应用中。该方法解决了部分依赖图(PDP)的局限性,更有效地捕捉交互效应。ICE方法通过计算ICE曲线上的期望特征重要性和其标准差来提供更准确的敏感性度量。在各种案例上的比较评估表明,基于ICE的特征重要性提供了比传统PDP方法更丰富的洞察,而PDP、ICE和SHAP的可视化解释则通过提供多种视角相互补充。
Large language models show fragile cognitive reasoning about human emotions
Authors: Sree Bhattacharyya, Evgenii Kuriabov, Lucas Craig, Tharun Dilliraj, Reginald B. Adams,, Jia Li, James Z. Wang
Venue: NeurIPS 2025
First: 2025-08-07T22:19:15+00:00 · Latest: 2026-03-13T17:27:05+00:00
Comments: Under Review, a version was presented at WiML Workshop @ NeurIPS 2025
Abstract
Affective computing seeks to support the holistic development of artificial intelligence by enabling machines to engage with human emotion. Recent foundation models, particularly large language models (LLMs), have been trained and evaluated on emotion-related tasks, typically using supervised learning with discrete emotion labels. Such evaluations largely focus on surface phenomena, such as recognizing expressed or evoked emotions, leaving open whether these systems reason about emotion in cognitively meaningful ways. Here we ask whether LLMs can reason about emotions through underlying cognitive dimensions rather than labels alone. Drawing on cognitive appraisal theory, we introduce CoRE, a large-scale benchmark designed to probe the implicit cognitive structures LLMs use when interpreting emotionally charged situations. We assess alignment with human appraisal patterns, internal consistency, cross-model generalization, and robustness to contextual variation. We find that LLMs capture systematic relations between cognitive appraisals and emotions but show misalignment with human judgments and instability across contexts.
中文标题/摘要
标题:大型语言模型在人类情感认知推理方面表现脆弱
情感计算旨在通过使机器能够与人类情感互动来支持人工智能的全面发展。最近的基础模型,尤其是大型语言模型(LLMs),已被训练和评估在情感相关任务上,通常使用监督学习和离散情感标签。这些评估主要集中在表面现象上,如识别表达或引发的情感,而未明确这些系统是否以认知有意义的方式进行情感推理。在这里,我们探讨LLMs是否能够通过认知维度而非仅标签来推理情感。基于认知评估理论,我们引入了CoRE,这是一个大规模基准,旨在探究LLMs在解释情感化情境时所使用的隐含认知结构。我们评估了与人类评估模式的一致性、内部一致性、跨模型泛化能力和对上下文变化的稳健性。我们发现,LLMs捕捉了认知评估与情感之间的系统关系,但与人类判断存在偏差,并且在不同情境下表现出不稳定性。
Summary / 总结
This study investigates whether large language models (LLMs) can reason about human emotions through underlying cognitive dimensions rather than just discrete emotion labels. Using CoRE, a new benchmark based on cognitive appraisal theory, the research assesses LLMs' alignment with human judgments, internal consistency, cross-model generalization, and robustness to context. The findings indicate that while LLMs capture systematic relations between cognitive appraisals and emotions, they show misalignment with human judgments and instability across different contexts.
研究探讨了大型语言模型(LLMs)是否能够通过认知维度而非仅标签来推理情感。使用基于认知评估理论的新基准CoRE,研究评估了LLMs与人类判断的一致性、内部一致性、跨模型的一般化能力和对上下文变化的鲁棒性。研究发现,虽然LLMs能够捕捉认知评估与情感之间的系统关系,但它们与人类判断存在偏差,并且在不同情境下不稳定。
From Experiments to Expertise: Scientific Knowledge Consolidation for AI-Driven Computational Research
Authors: Haonan Huang
First: 2026-03-13T17:25:47+00:00 · Latest: 2026-03-13T17:25:47+00:00
Abstract
While large language models (LLMs) have transformed AI agents into proficient executors of computational materials science, performing a hundred simulations does not make a researcher. What distinguishes research from routine execution is the progressive accumulation of knowledge -- learning which approaches fail, recognizing patterns across systems, and applying understanding to new problems. However, the prevailing paradigm in AI-driven computational science treats each execution in isolation, largely discarding hard-won insights between runs. Here we present QMatSuite, an open-source platform closing this gap. Agents record findings with full provenance, retrieve knowledge before new calculations, and in dedicated reflection sessions correct erroneous findings and synthesize observations into cross-compound patterns. In benchmarks on a six-step quantum-mechanical simulation workflow, accumulated knowledge reduces reasoning overhead by 67% and improves accuracy from 47% to 3% deviation from literature -- and when transferred to an unfamiliar material, achieves 1% deviation with zero pipeline failures.
中文标题/摘要
标题:从实验到专长:面向AI驱动计算研究的科学知识整合
虽然大型语言模型(LLMs)已经将AI代理转变为计算材料科学的熟练执行者,但仅仅执行一百次模拟并不能造就一名研究人员。研究与常规执行的区别在于知识的逐步积累——学习哪些方法失败,识别不同系统中的模式,并将理解应用于新问题。然而,在AI驱动的计算科学中,当前的范式将每次执行视为孤立的事件,很大程度上忽略了运行之间的宝贵见解。在这里,我们介绍了QMatSuite这一开源平台,填补了这一空白。代理记录具有完整溯源的研究发现,在新计算前检索知识,并在专门的反思会中纠正错误的发现,将观察综合为跨化合物模式。在六步量子力学模拟工作流的基准测试中,积累的知识将推理开销减少了67%,并将准确性从文献偏差47%提高到3%——当转移到不熟悉的材料时,实现了零管道失败且偏差仅为1%。
Summary / 总结
This paper addresses the gap between AI agents proficient in executing computational materials science tasks and the need for researchers to accumulate knowledge over time. It introduces QMatSuite, an open-source platform that allows agents to record findings with full provenance, retrieve knowledge before new calculations, and reflect on their work to correct errors and synthesize patterns. In benchmarks, accumulated knowledge reduced reasoning overhead by 67% and improved accuracy to 3% deviation from literature, with 1% deviation when applied to an unfamiliar material without pipeline failures.
本文探讨了AI代理在执行计算材料科学任务方面的优势与研究人员需要积累知识之间的差距。它介绍了QMatSuite,一个开源平台,允许代理记录具有完整溯源的发现,检索知识以进行新计算,并在专门的反思会中纠正错误并综合观察结果。在基准测试中,积累的知识将推理开销减少了67%,并将准确性提高到与文献47%偏差相比的3%偏差,在应用于不熟悉的材料时,偏差为1%,且没有管道失败。
LLM Constitutional Multi-Agent Governance
Authors: J. de Curtò, I. de Zarzà
First: 2026-03-13T17:21:26+00:00 · Latest: 2026-03-13T17:21:26+00:00
Comments: Accepted for publication in 20th International Conference on Agents and Multi-Agent Systems: Technologies and Applications (AMSTA 2026), to appear in Springer Nature proceedings (KES Smart Innovation Systems and Technologies). The final authenticated version will be available online at Springer
Abstract
Large Language Models (LLMs) can generate persuasive influence strategies that shift cooperative behavior in multi-agent populations, but a critical question remains: does the resulting cooperation reflect genuine prosocial alignment, or does it mask erosion of agent autonomy, epistemic integrity, and distributional fairness? We introduce Constitutional Multi-Agent Governance (CMAG), a two-stage framework that interposes between an LLM policy compiler and a networked agent population, combining hard constraint filtering with soft penalized-utility optimization that balances cooperation potential against manipulation risk and autonomy pressure. We propose the Ethical Cooperation Score (ECS), a multiplicative composite of cooperation, autonomy, integrity, and fairness that penalizes cooperation achieved through manipulative means. In experiments on scale-free networks of 80 agents under adversarial conditions (70% violating candidates), we benchmark three regimes: full CMAG, naive filtering, and unconstrained optimization. While unconstrained optimization achieves the highest raw cooperation (0.873), it yields the lowest ECS (0.645) due to severe autonomy erosion (0.867) and fairness degradation (0.888). CMAG attains an ECS of 0.741, a 14.9% improvement, while preserving autonomy at 0.985 and integrity at 0.995, with only modest cooperation reduction to 0.770. The naive ablation (ECS = 0.733) confirms that hard constraints alone are insufficient. Pareto analysis shows CMAG dominates the cooperation-autonomy trade-off space, and governance reduces hub-periphery exposure disparities by over 60%. These findings establish that cooperation is not inherently desirable without governance: constitutional constraints are necessary to ensure that LLM-mediated influence produces ethically stable outcomes rather than manipulative equilibria.
中文标题/摘要
标题:大型语言模型宪法多智能体治理
大型语言模型(LLMs)可以生成具有说服力的影响策略,改变多智能体群体中的合作行为,但一个关键问题仍然存在:这种合作是否反映了真正的亲社会一致性,还是掩盖了智能体自主权、知识完整性和分配公平性的侵蚀?我们引入了宪法多智能体治理(CMAG),这是一种两阶段框架,介于LLM策略编译器和网络化智能体群体之间,结合了硬约束过滤和软惩罚效用优化,平衡了合作潜力与操纵风险和自主权压力。我们提出了伦理合作评分(ECS),这是一种合作、自主权、完整性和公平性的乘积复合体,惩罚通过操纵手段实现的合作。在70%的候选者违反条件下,对80个智能体的无标度网络进行实验,我们基准测试了三种模式:完整CMAG、简单过滤和无约束优化。虽然无约束优化实现了最高的原始合作(0.873),但由于严重的自主权侵蚀(0.867)和公平性退化(0.888),其ECS最低(0.645)。CMAG实现了0.741的ECS,提高了14.9%,同时保持了0.985的自主权和0.995的完整性,合作减少到0.770。简单的消融实验(ECS = 0.733)证实了仅靠硬约束是不够的。帕累托分析显示,CMAG在合作-自主权权衡空间中占优,治理减少了中心-边缘暴露差异超过60%。这些发现表明,没有治理的合作并非固然是可取的:宪法约束是必要的,以确保LLM介导的影响产生伦理稳定的成果,而不是操纵性均衡。
Summary / 总结
The research addresses the ethical concerns of using Large Language Models (LLMs) to influence cooperative behavior in multi-agent systems, proposing a two-stage framework called Constitutional Multi-Agent Governance (CMAG) to balance cooperation potential with autonomy and fairness. CMAG uses hard constraint filtering and soft penalized-utility optimization to achieve a higher Ethical Cooperation Score (ECS) compared to unconstrained optimization. Experiments show that while unconstrained optimization yields the highest raw cooperation, it severely erodes autonomy and fairness. CMAG, despite reducing cooperation slightly, significantly improves ECS and preserves autonomy and integrity, demonstrating the necessity of constitutional constraints for ethically stable outcomes.
研究探讨了使用大型语言模型(LLMs)影响多智能体系统中合作行为的伦理问题。引入了宪法多智能体治理(CMAG)框架,该框架平衡了合作潜力与自主性和公平性。实验显示,虽然无约束优化最大化了原始合作,但它严重侵蚀了自主性和公平性。CMAG结合了硬约束和软优化,实现了更高的伦理合作评分(ECS)0.741,同时保持了自主性和完整性,虽然合作略有减少,但显著提高了伦理稳定性。
Neural-Quantum-States Impurity Solver for Quantum Embedding Problems
Authors: Yinzhanghao Zhou, Tsung-Han Lee, Ao Chen, Nicola Lanatà, Hong Guo
First: 2025-09-15T20:33:10+00:00 · Latest: 2026-03-13T17:19:06+00:00
Comments: 10 pages main text, and 4 figures. Note that YinZhangHao Zhou and Zhanghao Zhouyin are the same person, I use them both
Abstract
Neural quantum states (NQS) have emerged as a promising approach to solve second-quantized Hamiltonians, because of their scalability and flexibility. In this work, we design and benchmark an NQS impurity solver for the quantum embedding (QE) methods, focusing on the ghost Gutzwiller Approximation (gGA) framework. We introduce a graph transformer-based NQS framework able to represent arbitrarily connected impurity orbitals of the embedding Hamiltonian (EH) and develop an error control mechanism to stabilize iterative updates throughout the QE loops. We validate the accuracy of our approach with benchmark gGA calculations of the Anderson Lattice Model, yielding results in excellent agreement with the exact diagonalisation impurity solver. Finally, our analysis of the computational budget reveals the method's principal bottleneck to be the high-accuracy sampling of physical observables required by the embedding loop, rather than the NQS variational optimization, directly highlighting the critical need for more efficient inference techniques.
中文标题/摘要
标题:神经量子态杂质求解器用于量子嵌入问题
神经量子态(NQS)已成为解决第二量量子化哈密顿量的有前途的方法,因为它们具有可扩展性和灵活性。在本文中,我们设计并测试了一种NQS杂质求解器,用于量子嵌入(QE)方法,重点关注鬼Gutzwiller近似(gGA)框架。我们引入了一种基于图变换器的NQS框架,能够表示嵌入哈密顿量(EH)的任意连接杂质轨道,并开发了一种误差控制机制,以在整个QE循环中稳定迭代更新。我们通过基准gGA计算安德森晶格模型,验证了我们方法的准确性,结果与精确对角化杂质求解器的计算结果非常一致。最后,我们的计算预算分析揭示了该方法的主要瓶颈在于嵌入循环所需的物理可观测量的高精度采样,而不是NQS变分优化,直接突显了更高效推理技术的迫切需求。
Summary / 总结
This work introduces a neural quantum states (NQS) impurity solver for quantum embedding (QE) methods, specifically for the ghost Gutzwiller Approximation (gGA) framework. The method uses a graph transformer-based NQS to represent impurity orbitals and includes an error control mechanism to stabilize iterative updates. The approach is validated through benchmark gGA calculations of the Anderson Lattice Model, showing excellent agreement with exact diagonalization results. The analysis indicates that the main computational challenge lies in the high-accuracy sampling of physical observables, rather than the NQS optimization process.
本文提出了一种用于量子嵌入方法的神经量子态(NQS)杂质求解器,特别适用于鬼Gutzwiller近似框架。该方法使用基于图变换器的NQS来表示杂质轨道,并包含一个误差控制机制。通过使用Anderson晶格模型进行验证,结果显示与精确对角化求解器的结果高度一致。计算瓶颈被识别为物理可观测量的高精度采样,这表明需要更高效的推理技术。
Towards Spatio-Temporal World Scene Graph Generation from Monocular Videos
Authors: Rohith Peddi, Saurabh, Shravan Shanmugam, Likhitha Pallapothula, Yu Xiang, Parag Singla, Vibhav Gogate
First: 2026-03-13T17:18:03+00:00 · Latest: 2026-03-13T17:18:03+00:00
Comments: https://github.com/rohithpeddi/WorldSGG
Abstract
Spatio-temporal scene graphs provide a principled representation for modeling evolving object interactions, yet existing methods remain fundamentally frame-centric: they reason only about currently visible objects, discard entities upon occlusion, and operate in 2D. To address this, we first introduce ActionGenome4D, a dataset that upgrades Action Genome videos into 4D scenes via feed-forward 3D reconstruction, world-frame oriented bounding boxes for every object involved in actions, and dense relationship annotations including for objects that are temporarily unobserved due to occlusion or camera motion. Building on this data, we formalize World Scene Graph Generation (WSGG), the task of constructing a world scene graph at each timestamp that encompasses all interacting objects in the scene, both observed and unobserved. We then propose three complementary methods, each exploring a different inductive bias for reasoning about unobserved objects: PWG (Persistent World Graph), which implements object permanence via a zero-order feature buffer; MWAE (Masked World Auto-Encoder), which reframes unobserved-object reasoning as masked completion with cross-view associative retrieval; and 4DST (4D Scene Transformer), which replaces the static buffer with differentiable per-object temporal attention enriched by 3D motion and camera-pose features. We further design and evaluate the performance of strong open-source Vision-Language Models on the WSGG task via a suite of Graph RAG-based approaches, establishing baselines for unlocalized relationship prediction. WSGG thus advances video scene understanding toward world-centric, temporally persistent, and interpretable scene reasoning.
中文标题/摘要
标题:从单目视频生成时空世界场景图的方法
时空场景图提供了一种原理性的表示方法,用于建模不断变化的对象交互,但现有方法仍然主要基于帧:它们仅考虑当前可见的对象,遮挡时丢弃实体,并在二维空间中操作。为解决这一问题,我们首先引入了ActionGenome4D数据集,该数据集通过前馈3D重建、面向世界框架的对象边界框以及密集的关系注释(包括由于遮挡或摄像机运动而暂时未观察到的对象之间的关系),将Action Genome视频升级为4D场景。基于此数据,我们定义了世界场景图生成(WSGG)任务,即在每个时间戳构建一个包含场景中所有交互对象(包括已观察和未观察的对象)的世界场景图。然后,我们提出了三种互补的方法,每种方法探索不同的归纳偏置来处理未观察到的对象:PWG(持久世界图),通过零阶特征缓冲区实现对象持久性;MWAE(掩码世界自编码器),将未观察到的对象推理重新定义为掩码完成与跨视图关联检索;以及4DST(4D场景变换器),用具有3D运动和摄像机姿态特征的可微分的逐对象时空注意力替换静态缓冲区。我们进一步设计并评估了强大的开源视觉-语言模型在WSGG任务上的性能,通过一系列基于Graph RAG的方法建立了未定位关系预测的基线。因此,WSGG推动了视频场景理解向以世界为中心、时间持久和可解释的场景推理方向发展。
Summary / 总结
The paper addresses the limitation of existing methods that are frame-centric and do not account for occluded objects or operate in 3D. It introduces ActionGenome4D, a dataset that reconstructs 4D scenes and provides dense relationship annotations. The authors then propose three methods for World Scene Graph Generation (WSGG): PWG, MWAE, and 4DST, each with different approaches to handling unobserved objects. The methods are evaluated using strong Vision-Language Models, and the results establish baselines for unlocalized relationship prediction in videos.
论文通过引入包含3D重建和密集关系标注的ActionGenome4D数据集,解决了现有帧中心方法在场景图生成中的局限性。它定义了世界场景图生成(WSGG)任务,并提出了三种方法:PWG、MWAE和4DST,每种方法都以不同的方式处理未观察到的对象。这些方法实现了时间持久和可解释的场景推理,推动了视频场景理解的发展。还通过图RAG基方法评估了强大的开源视觉-语言模型,为未定位的关系预测提供了基线。
LongStream: Long-Sequence Streaming Autoregressive Visual Geometry
Authors: Chong Cheng, Xianda Chen, Tao Xie, Wei Yin, Weiqiang Ren, Qian Zhang, Xiaoyang Guo, Hao Wang
First: 2026-02-13T18:30:51+00:00 · Latest: 2026-03-13T17:17:55+00:00
Comments: CVPR2026 accepted
Abstract
Long-sequence streaming 3D reconstruction remains a significant open challenge. Existing autoregressive models often fail when processing long sequences because they anchor poses to the first frame, leading to attention decay, scale drift, and extrapolation errors. We introduce LongStream, a novel gauge-decoupled streaming visual geometry model for metric-scale scene reconstruction across thousands of frames under a strictly online, future-invisible setting. Our approach is threefold. First, we discard the first-frame anchor and predict keyframe-relative poses. This reformulates long-range extrapolation into a constant-difficulty local task. Second, we introduce orthogonal scale learning. This method fully disentangles geometry from scale estimation to suppress drift. Finally, we identify attention bias issues in Transformers, including attention-sink reliance and long-term KV-cache saturation. We propose cache-consistent training combined with periodic cache refresh. This approach suppresses attention biases and contamination over ultra-long sequences and reduces the gap between training and inference. Experiments show that LongStream achieves state-of-the-art performance, enabling stable, metric-scale reconstruction over kilometer-scale sequences at 18 FPS. Project Page: https://3dagentworld.github.io/longstream/
中文标题/摘要
标题:LongStream:长序列流式自回归视觉几何
长序列流式3D重建仍然是一个重要的开放挑战。现有的自回归模型在处理长序列时经常失败,因为它们将姿态锚定在第一帧,导致注意力衰减、尺度漂移和外推误差。我们提出了LongStream,这是一种在严格在线、未来不可见设置下,用于数千帧的度量尺度场景重建的新型量规解耦流式视觉几何模型。我们的方法分为三部分。首先,我们放弃了第一帧的锚点,预测关键帧相对姿态。这将长距离外推重新定义为一个恒定难度的局部任务。其次,我们引入了正交尺度学习。这种方法完全解耦几何与尺度估计,以抑制漂移。最后,我们识别了Transformer中的注意力偏差问题,包括注意力陷阱依赖和长期KV缓存饱和。我们提出了缓存一致训练结合周期性缓存刷新。这种方法抑制了超长序列中的注意力偏差和污染,并减少了训练与推理之间的差距。实验表明,LongStream实现了最先进的性能,能够在18 FPS下实现千米级序列的稳定、度量尺度重建。
Summary / 总结
LongStream addresses the challenge of long-sequence streaming 3D reconstruction by introducing a gauge-decoupled streaming visual geometry model. It predicts keyframe-relative poses instead of anchoring to the first frame, and introduces orthogonal scale learning to disentangle geometry from scale estimation. Additionally, it proposes cache-consistent training and periodic cache refresh to suppress attention biases. Experiments demonstrate that LongStream achieves state-of-the-art performance, enabling stable, metric-scale reconstruction over kilometer-scale sequences at 18 FPS.
LongStream通过将长范围外推重新表述为局部任务、分离几何与尺度估计以及缓解Transformer中的注意力偏差,解决了长序列流式3D重建的挑战。它在千米级序列上实现了稳定的、米尺度的重建,并以18 FPS的速度达到最先进的性能,在未来不可见的在线设置下运行。
SpaceControl: Introducing Test-Time Spatial Control to 3D Generative Modeling
Authors: Elisabetta Fedele, Francis Engelmann, Ian Huang, Or Litany, Marc Pollefeys, Leonidas Guibas
First: 2025-12-05T00:54:48+00:00 · Latest: 2026-03-13T17:13:29+00:00
Comments: Project page: https://spacecontrol3d.github.io/
Abstract
Generative methods for 3D assets have recently achieved remarkable progress, yet providing intuitive and precise control over the object geometry remains a key challenge. Existing approaches predominantly rely on text or image prompts, which often fall short in geometric specificity: language can be ambiguous, and images are difficult to manipulate. In this work, we introduce SpaceControl, a training-free test-time method for explicit spatial control of 3D asset generation. Our approach accepts a wide range of geometric inputs, from coarse primitives to detailed meshes, and integrates seamlessly with modern generative models without requiring any additional training. A control parameter lets users trade off between geometric fidelity and output realism. Extensive quantitative evaluation and user studies demonstrate that SpaceControl outperforms both training-based and optimization-based baselines in geometric faithfulness while preserving high visual quality. Finally, we present an interactive interface for real-time superquadric editing and direct 3D asset generation, enabling seamless use in creative workflows. Project page: https://spacecontrol3d.github.io/.
中文标题/摘要
标题:SpaceControl:在3D生成建模中引入测试时空间控制
近年来,用于3D资产的生成方法取得了显著进展,但在提供对物体几何形状的直观和精确控制方面仍面临关键挑战。现有方法主要依赖于文本或图像提示,这些提示在几何精确性方面往往不够:语言可能含糊不清,而图像难以操作。在本工作中,我们引入了SpaceControl,这是一种无需训练的测试时方法,用于明确控制3D资产生成的空间。我们的方法接受从粗略的基本体到详细的网格等各种几何输入,并能够无缝集成到现代生成模型中,无需额外训练。一个控制参数让用户可以在几何保真度和输出现实性之间进行权衡。广泛的定量评估和用户研究证明,SpaceControl在几何保真度方面优于基于训练和基于优化的基线方法,同时保持高质量的视觉效果。最后,我们提供了一个交互式界面,用于实时超二次编辑和直接3D资产生成,使其能够无缝地应用于创意工作流程中。项目页面:https://spacecontrol3d.github.io/
Summary / 总结
SpaceControl introduces a training-free method for explicit spatial control in 3D generative modeling, allowing users to input geometric inputs ranging from primitives to meshes. This method integrates with modern generative models without additional training and provides a control parameter to balance geometric fidelity and output realism. Experimental results show that SpaceControl outperforms both training-based and optimization-based baselines in geometric faithfulness while maintaining high visual quality.
SpaceControl 提出了一种无需训练的方法,用于 3D 生成建模中的显式空间控制,允许用户输入几何数据并调节几何保真度与输出真实感之间的权衡。该方法与现代生成模型无缝集成,并在几何保真度方面优于基于训练和基于优化的基线方法,同时保持高质量的视觉效果。用户研究和定量评估支持这些结论,并提供了一个交互式界面,用于实时超二次曲面编辑和直接 3D 资产生成。
Clustering Astronomical Orbital Synthetic Data Using Advanced Feature Extraction and Dimensionality Reduction Techniques
Authors: Eraldo Pereira Marinho, Nelson Callegari Junior, Fabricio Aparecido Breve, Caetano Mazzoni Ranieri
First: 2026-03-13T17:11:52+00:00 · Latest: 2026-03-13T17:11:52+00:00
Comments: This paper has been accepted for publication in Neural Computing and Applications (Springer Nature)
Abstract
The dynamics of Saturn's satellite system offer a rich framework for studying orbital stability and resonance interactions. Traditional methods for analysing such systems, including Fourier analysis and stability metrics, struggle with the scale and complexity of modern datasets. This study introduces a machine learning-based pipeline for clustering approximately 22,300 simulated satellite orbits, addressing these challenges with advanced feature extraction and dimensionality reduction techniques. The key to this approach is using MiniRocket, which efficiently transforms 400 timesteps into a 9,996-dimensional feature space, capturing intricate temporal patterns. Additional automated feature extraction and dimensionality reduction techniques refine the data, enabling robust clustering analysis. This pipeline reveals stability regions, resonance structures, and other key behaviours in Saturn's satellite system, providing new insights into their long-term dynamical evolution. By integrating computational tools with traditional celestial mechanics techniques, this study offers a scalable and interpretable methodology for analysing large-scale orbital datasets and advancing the exploration of planetary dynamics.
中文标题/摘要
标题:使用高级特征提取和降维技术对天体轨道合成数据进行聚类
土星卫星系统的动力学为研究轨道稳定性与共振相互作用提供了丰富的框架。传统方法,包括傅里叶分析和稳定性指标,难以处理现代数据集的规模和复杂性。本研究引入了一种基于机器学习的管道,用于聚类约22,300个模拟卫星轨道,通过使用先进的特征提取和降维技术应对这些挑战。该方法的关键在于使用MiniRocket,它高效地将400个时间步转换为9,996维特征空间,捕捉复杂的时序模式。此外,自动化的特征提取和降维技术进一步精炼数据,使聚类分析更加稳健。该管道揭示了土星卫星系统的稳定性区域、共振结构和其他关键行为,提供了对其长期动力学演化的新的见解。通过将计算工具与传统的天体力学技术相结合,本研究提供了一种可扩展且可解释的方法,用于分析大规模轨道数据集并推进对行星动力学的探索。
Summary / 总结
This study addresses the challenges of analyzing complex orbital data from Saturn's satellite system by developing a machine learning-based pipeline. It uses MiniRocket for feature extraction and dimensionality reduction, transforming 400 timesteps into a 9,996-dimensional space to capture intricate temporal patterns. The pipeline reveals stability regions, resonance structures, and other key behaviors, providing new insights into the long-term dynamical evolution of Saturn's satellite system.
该研究旨在使用机器学习技术分析土卫系统的复杂动力学,以克服传统方法的局限性。研究人员开发了一种管道,使用MiniRocket进行特征提取和降维,将400个时间步长转换为9,996维空间。这种方法使研究人员能够聚类约22,300个模拟卫星轨道,揭示了稳定区域、共振结构和其他关键行为,为土卫系统的长期动力学演化提供了新的见解。
Adaptive $Q$-Aid for Conditional Supervised Learning in Offline Reinforcement Learning
Authors: Jeonghye Kim, Suyoung Lee, Woojun Kim, Youngchul Sung
Venue: NeurIPS 2024
First: 2024-02-03T04:17:09+00:00 · Latest: 2026-03-13T17:08:48+00:00
Comments: Accepted to NeurIPS 2024 (reduced file-size version). The project page is available at https://beanie00.com/publications/qcs
Abstract
Offline reinforcement learning (RL) has progressed with return-conditioned supervised learning (RCSL), but its lack of stitching ability remains a limitation. We introduce $Q$-Aided Conditional Supervised Learning (QCS), which effectively combines the stability of RCSL with the stitching capability of $Q$-functions. By analyzing $Q$-function over-generalization, which impairs stable stitching, QCS adaptively integrates $Q$-aid into RCSL's loss function based on trajectory return. Empirical results show that QCS significantly outperforms RCSL and value-based methods, consistently achieving or exceeding the maximum trajectory returns across diverse offline RL benchmarks.
中文标题/摘要
标题:自适应$Q$-辅助的条件监督学习在离线强化学习中的应用
离线强化学习(RL)通过回报条件监督学习(RCSL)取得了进展,但其缺乏拼接能力仍然是一个限制。我们提出了$Q$-辅助条件监督学习(QCS),它有效地将RCSL的稳定性与$Q$-函数的拼接能力结合起来。通过分析$Q$-函数的过度泛化,这会损害稳定的拼接,QCS根据轨迹回报自适应地将$Q$-辅助整合到RCSL的损失函数中。实验证明,QCS显著优于RCSL和基于值的方法,在多种离线RL基准测试中始终能够达到或超过最大轨迹回报。
Summary / 总结
The research aims to address the limitation of return-conditioned supervised learning (RCSL) in offline reinforcement learning by introducing Q-Aided Conditional Supervised Learning (QCS). QCS combines the stability of RCSL with the stitching capability of Q-functions by adaptively integrating Q-aid into RCSL's loss function based on trajectory return. Experimental results demonstrate that QCS outperforms both RCSL and value-based methods, achieving or exceeding the maximum trajectory returns across various offline RL benchmarks.
论文通过引入Q-Aided Conditional Supervised Learning (QCS),解决了返回条件监督学习(RCSL)在离线强化学习中的拼接能力不足问题。QCS通过基于轨迹回报将Q-辅助适配性地整合到RCSL的损失函数中,结合了RCSL的稳定性和Q函数的拼接能力。实验结果显示,QCS在多种离线RL基准测试中表现优异,超过了RCSL和基于值的方法,实现了或超过了最大轨迹回报。
Semantic Invariance in Agentic AI
Authors: I. de Zarzà, J. de Curtò, Jordi Cabot, Pietro Manzoni, Carlos T. Calafate
First: 2026-03-13T17:08:44+00:00 · Latest: 2026-03-13T17:08:44+00:00
Comments: Accepted for publication in 20th International Conference on Agents and Multi-Agent Systems: Technologies and Applications (AMSTA 2026), to appear in Springer Nature proceedings (KES Smart Innovation Systems and Technologies). The final authenticated version will be available online at Springer
Abstract
Large Language Models (LLMs) increasingly serve as autonomous reasoning agents in decision support, scientific problem-solving, and multi-agent coordination systems. However, deploying LLM agents in consequential applications requires assurance that their reasoning remains stable under semantically equivalent input variations, a property we term semantic invariance.Standard benchmark evaluations, which assess accuracy on fixed, canonical problem formulations, fail to capture this critical reliability dimension. To address this shortcoming, in this paper we present a metamorphic testing framework for systematically assessing the robustness of LLM reasoning agents, applying eight semantic-preserving transformations (identity, paraphrase, fact reordering, expansion, contraction, academic context, business context, and contrastive formulation) across seven foundation models spanning four distinct architectural families: Hermes (70B, 405B), Qwen3 (30B-A3B, 235B-A22B), DeepSeek-R1, and gpt-oss (20B, 120B). Our evaluation encompasses 19 multi-step reasoning problems across eight scientific domains. The results reveal that model scale does not predict robustness: the smaller Qwen3-30B-A3B achieves the highest stability (79.6% invariant responses, semantic similarity 0.91), while larger models exhibit greater fragility.
中文标题/摘要
标题:代理型人工智能中的语义不变性
大型语言模型(LLMs)越来越多地作为自主推理代理,在决策支持、科学问题解决和多代理协调系统中发挥作用。然而,在关键应用中部署LLM代理需要确保其推理在语义等效输入变化下保持稳定,这一特性我们称之为语义不变性。标准基准评估仅评估固定的标准问题表述的准确性,未能捕捉到这种关键的可靠性维度。为解决这一不足,本文提出了一种元测试框架,系统评估LLM推理代理的鲁棒性,应用了八种语义保持变换(身份、改写、事实重排、扩展、收缩、学术背景、商业背景和对比表述),跨越了四个不同架构家族的七种基础模型:Hermes(70B,405B)、Qwen3(30B-A3B,235B-A22B)、DeepSeek-R1 和 gpt-oss(20B,120B)。评估涵盖了八个科学领域中的19个多步推理问题。结果表明,模型规模并不能预测鲁棒性:较小的Qwen3-30B-A3B实现了最高的稳定性(79.6%不变响应,语义相似度0.91),而较大的模型则表现出更大的脆弱性。
Summary / 总结
This paper addresses the need for semantic invariance in autonomous reasoning agents, particularly large language models (LLMs), by presenting a metamorphic testing framework. Eight semantic-preserving transformations were applied to seven foundation models across four architectural families, evaluating their robustness on 19 multi-step reasoning problems. The results indicate that model size does not correlate with robustness, with the smaller Qwen3-30B-A3B achieving the highest stability, while larger models were more fragile.
本文通过提出一种元形变测试框架,评估了七个基础模型在八个科学领域的表现,以解决自主LLM代理需要语义不变性的问题。研究应用了八种语义保持变换,发现模型大小与稳健性无关,较小的Qwen3-30B-A3B实现了最高的稳定性(79.6%不变响应,语义相似度0.91),而较大的模型则表现出更大的脆弱性。
Developing and evaluating a chatbot to support maternal health care
Authors: Smriti Jha, Vidhi Jain, Jianyu Xu, Grace Liu, Sowmya Ramesh, Jitender Nagpal, Gretchen Chapman, Benjamin Bellows, Siddhartha Goyal, Aarti Singh, Bryan Wilder
Venue: IJCAI 2026
First: 2026-03-13T17:02:05+00:00 · Latest: 2026-03-13T17:02:05+00:00
Comments: 17 pages; submitted to IJCAI 2026 AI and Social Good Track
Abstract
The ability to provide trustworthy maternal health information using phone-based chatbots can have a significant impact, particularly in low-resource settings where users have low health literacy and limited access to care. However, deploying such systems is technically challenging: user queries are short, underspecified, and code-mixed across languages, answers require regional context-specific grounding, and partial or missing symptom context makes safe routing decisions difficult. We present a chatbot for maternal health in India developed through a partnership between academic researchers, a health tech company, a public health nonprofit, and a hospital. The system combines (1) stage-aware triage, routing high-risk queries to expert templates, (2) hybrid retrieval over curated maternal/newborn guidelines, and (3) evidence-conditioned generation from an LLM. Our core contribution is an evaluation workflow for high-stakes deployment under limited expert supervision. Targeting both component-level and end-to-end testing, we introduce: (i) a labeled triage benchmark (N=150) achieving 86.7% emergency recall, explicitly reporting the missed-emergency vs. over-escalation trade-off; (ii) a synthetic multi-evidence retrieval benchmark (N=100) with chunk-level evidence labels; (iii) LLM-as-judge comparison on real queries (N=781) using clinician-codesigned criteria; and (iv) expert validation. Our findings show that trustworthy medical assistants in multilingual, noisy settings require defense-in-depth design paired with multi-method evaluation, rather than any single model and evaluation method choice.
中文标题/摘要
标题:开发和支持孕产妇保健聊天机器人的研究
使用基于电话的聊天机器人提供可信的孕产妇健康信息可以在低资源环境中产生重大影响,特别是在用户健康素养低且难以获得护理的情况下。然而,部署此类系统在技术上具有挑战性:用户查询简短、不明确且跨语言混杂,答案需要特定区域背景,不完整的或缺失的症状信息使安全路由决策变得困难。 我们介绍了一个在印度开发的孕产妇健康聊天机器人,该系统由学术研究人员、健康科技公司、公共卫生非营利组织和医院合作开发。该系统结合了(1)阶段感知的分诊,将高风险查询路由到专家模板,(2)混合检索,涵盖孕产妇/新生儿指南,以及(3)基于LLM的证据条件生成。我们的核心贡献是在有限专家监督下进行高风险部署的评估工作流程。我们针对组件级和端到端测试引入了:(i)带有150个标记分诊基准,紧急召回率为86.7%,明确报告了紧急情况漏报与过度升级之间的权衡;(ii)带有片段级证据标签的合成多证据检索基准,数量为100;(iii)使用临床设计标准的LLM作为评判者比较真实查询(数量为781);以及(iv)专家验证。我们的研究结果表明,在多语言和嘈杂的环境中,值得信赖的医疗助手需要多层次设计配以多方法评估,而不是单一模型和评估方法的选择。
Summary / 总结
The research aims to develop a chatbot for maternal health in low-resource settings, addressing challenges such as short and underspecified user queries and the need for regional context. The system combines triage, hybrid retrieval, and evidence-conditioned generation. Key experimental findings include an 86.7% emergency recall rate for triage, a synthetic retrieval benchmark with explicit evidence labels, and an LLM-as-judge comparison on real queries, highlighting the need for a defense-in-depth design and multi-method evaluation.
该研究开发并评估了一个针对印度孕产妇健康护理的聊天机器人,解决了诸如用户查询简短且不明确、语言混杂以及需要区域背景信息等挑战。聊天机器人采用了阶段感知的分流、混合检索和基于证据的生成技术。评估包括标记的分流基准、合成的检索基准以及LLM作为评判者比较,表明在多语言和嘈杂环境中,可信的医疗助手需要多层次设计和多方法评估。
DiT-IC: Aligned Diffusion Transformer for Efficient Image Compression
Authors: Junqi Shi, Ming Lu, Xingchen Li, Anle Ke, Ruiqi Zhang, Zhan Ma
First: 2026-03-13T16:56:00+00:00 · Latest: 2026-03-13T16:56:00+00:00
Abstract
Diffusion-based image compression has recently shown outstanding perceptual fidelity, yet its practicality is hindered by prohibitive sampling overhead and high memory usage. Most existing diffusion codecs employ U-Net architectures, where hierarchical downsampling forces diffusion to operate in shallow latent spaces (typically with only 8x spatial downscaling), resulting in excessive computation. In contrast, conventional VAE-based codecs work in much deeper latent domains (16x - 64x downscaled), motivating a key question: Can diffusion operate effectively in such compact latent spaces without compromising reconstruction quality? To address this, we introduce DiT-IC, an Aligned Diffusion Transformer for Image Compression, which replaces the U-Net with a Diffusion Transformer capable of performing diffusion in latent space entirely at 32x downscaled resolution. DiT-IC adapts a pretrained text-to-image multi-step DiT into a single-step reconstruction model through three key alignment mechanisms: (1) a variance-guided reconstruction flow that adapts denoising strength to latent uncertainty for efficient reconstruction; (2) a self-distillation alignment that enforces consistency with encoder-defined latent geometry to enable one-step diffusion; and (3) a latent-conditioned guidance that replaces text prompts with semantically aligned latent conditions, enabling text-free inference. With these designs, DiT-IC achieves state-of-the-art perceptual quality while offering up to 30x faster decoding and drastically lower memory usage than existing diffusion-based codecs. Remarkably, it can reconstruct 2048x2048 images on a 16 GB laptop GPU.
中文标题/摘要
标题:DiT-IC:对齐的扩散变换器用于高效的图像压缩
基于扩散的图像压缩最近展示了出色的感知保真度,但其实用性受到采样开销巨大和高内存使用率的阻碍。大多数现有的扩散编解码器采用U-Net架构,其中分层下采样迫使扩散在浅层潜在空间(通常只有8倍空间下采样)中运行,导致过多的计算。相比之下,传统的基于VAE的编解码器工作在更深的潜在域(16倍至64倍下采样),这激发了一个关键问题:扩散是否可以在如此紧凑的潜在空间中有效运行而不牺牲重建质量?为了解决这个问题,我们提出了DiT-IC,一种用于图像压缩的对齐扩散变换器,它用能够在32倍下采样分辨率的潜在空间中完全执行扩散的扩散变换器替代了U-Net。DiT-IC 通过三种关键对齐机制将预训练的多步文本到图像DiT 调整为单步重建模型:(1)一种基于方差的重建流程,根据潜在不确定性调整去噪强度以实现高效的重建;(2)一种自我蒸馏对齐,强制一致性以使潜在几何结构与编码器定义的潜在几何结构一致,从而实现一步扩散;(3)一种潜在条件引导,用语义对齐的潜在条件替换文本提示,实现无文本推理。凭借这些设计,DiT-IC 达到了最先进的感知质量,同时提供了比现有基于扩散的编解码器高达30倍更快的解码速度和大幅降低的内存使用率。令人惊讶的是,它可以在16 GB的笔记本GPU上重建2048x2048的图像。
Summary / 总结
DiT-IC is an Aligned Diffusion Transformer for image compression that addresses the high computational cost and memory usage of existing diffusion-based codecs. It uses a Diffusion Transformer to perform diffusion at 32x downscaled resolution, and introduces three alignment mechanisms: variance-guided reconstruction, self-distillation alignment, and latent-conditioned guidance. DiT-IC achieves superior perceptual quality, up to 30x faster decoding, and lower memory usage compared to existing methods, and can reconstruct 2048x2048 images on a 16 GB GPU.
DiT-IC 是一种用于图像压缩的对齐扩散变换器,通过在 32x 下采样分辨率下操作来解决扩散方法的高计算成本问题。它使用了三种关键机制:方差引导重建、自我蒸馏对齐和潜在条件引导。DiT-IC 达到了最先进的感知质量,解码速度最高可提升 30 倍,并且内存使用量低于现有基于扩散的方法,能够在 16 GB 的笔记本 GPU 上重建 2048x2048 的图像。
ESG-Bench: Benchmarking Long-Context ESG Reports for Hallucination Mitigation
Authors: Siqi Sun, Ben Peng Wu, Mali Jin, Peizhen Bai, Hanpei Zhang, Xingyi Song
Venue: AAAI 2026
First: 2026-03-13T16:48:05+00:00 · Latest: 2026-03-13T16:48:05+00:00
Comments: To be published in the AAAI 2026 proceedings
Abstract
As corporate responsibility increasingly incorporates environmental, social, and governance (ESG) criteria, ESG reporting is becoming a legal requirement in many regions and a key channel for documenting sustainability practices and assessing firms' long-term and ethical performance. However, the length and complexity of ESG disclosures make them difficult to interpret and automate the analysis reliably. To support scalable and trustworthy analysis, this paper introduces ESG-Bench, a benchmark dataset for ESG report understanding and hallucination mitigation in large language models (LLMs). ESG-Bench contains human-annotated question-answer (QA) pairs grounded in real-world ESG report contexts, with fine-grained labels indicating whether model outputs are factually supported or hallucinated. Framing ESG report analysis as a QA task with verifiability constraints enables systematic evaluation of LLMs' ability to extract and reason over ESG content and provides a new use case: mitigating hallucinations in socially sensitive, compliance-critical settings. We design task-specific Chain-of-Thought (CoT) prompting strategies and fine-tune multiple state-of-the-art LLMs on ESG-Bench using CoT-annotated rationales. Our experiments show that these CoT-based methods substantially outperform standard prompting and direct fine-tuning in reducing hallucinations, and that the gains transfer to existing QA benchmarks beyond the ESG domain.
中文标题/摘要
标题:ESG-Bench:长文ESG报告基准测试以减轻幻觉
随着企业责任越来越多地纳入环境、社会和治理(ESG)标准,ESG报告在许多地区已成为一项法律要求,并成为记录可持续实践和评估公司长期和道德表现的关键渠道。然而,ESG披露的长度和复杂性使其难以解读和可靠地自动化分析。为了支持可扩展和可信赖的分析,本文介绍了ESG-Bench,这是一个用于ESG报告理解和大型语言模型(LLM)幻觉减轻的基准数据集。ESG-Bench包含基于真实ESG报告背景的人工标注问答(QA)对,并使用细粒度标签表明模型输出是否得到事实支持或幻觉。将ESG报告分析作为具有可验证性约束的问答任务,使我们能够系统地评估LLM提取和推理ESG内容的能力,并提供一种新的用例:在社会敏感、合规关键的环境中减轻幻觉。我们设计了特定任务的思维链(CoT)提示策略,并在ESG-Bench上使用CoT标注的理由对多个最先进的LLM进行微调。我们的实验表明,这些基于CoT的方法在减少幻觉方面显著优于标准提示和直接微调,并且这些收益可以转移到ESG领域之外的现有问答基准上。
Summary / 总结
This paper addresses the challenge of interpreting and automating the analysis of long and complex ESG reports, which are increasingly important for corporate responsibility. ESG-Bench is introduced as a benchmark dataset for evaluating large language models in understanding and mitigating hallucinations in ESG reports. The dataset includes human-annotated question-answer pairs and fine-grained labels. The authors use Chain-of-Thought prompting strategies and fine-tune multiple state-of-the-art LLMs on ESG-Bench, demonstrating that these methods significantly reduce hallucinations compared to standard prompting and direct fine-tuning, with benefits extending to other QA benchmarks.
本文介绍了ESG-Bench,这是一个用于评估大型语言模型在理解和减轻ESG报告中幻觉问题的基准数据集。该数据集包含来自真实ESG报告的人工标注问题-答案对,并标注了事实支持情况。通过将ESG报告分析作为具有验证约束的问答任务,作者实现了对LLM的系统性评估。实验表明,使用链式思维提示策略和带有链式思维注释理由的微调显著减少了幻觉,与标准提示和直接微调方法相比,这些改进也扩展到了ESG领域之外的问答基准任务中。
Distributional Regression with Tabular Foundation Models: Evaluating Probabilistic Predictions via Proper Scoring Rules
Authors: Jonas Landsgesell, Pascal Knoll
First: 2026-03-09T10:38:01+00:00 · Latest: 2026-03-13T16:39:12+00:00
Abstract
Tabular foundation models such as TabPFN and TabICL already produce full predictive distributions, yet the benchmarks used to evaluate them (TabArena, TALENT, and others) still rely almost exclusively on point-estimate metrics (RMSE, $R^2$). This mismatch implicitly rewards models that elicit a good conditional mean while ignoring the quality of the predicted distribution. We make two contributions. First, we propose supplementing standard point metrics with proper scoring rules (CRPS, CRLS, and the Interval Score) and provide a head-to-head comparison of realTabPFNv2.5 and TabICLv2 with regards to some proper scoring rules across 20 OpenML regression datasets. Second, we show analytically and empirically that different proper scoring rules induce different model rankings and different inductive biases during training, even though each rule is individually minimized by the true distribution. Fine-tuning realTabPFNv2.5 with scoring rules not seen during pretraining (CRLS, $β=1.8$ energy score) yields consistent improvements on the corresponding metrics, confirming that the training loss shapes the model beyond what propriety alone guarantees. Together, these findings argue for (i) reporting distributional metrics in tabular regression benchmarks and (ii) making the training objective of foundation models adaptable (via fine-tuning or task-token conditioning) to the scoring rule relevant to the downstream decision problem.
中文标题/摘要
标题:基于表格基础模型的分布回归:通过合适的评分规则评估概率预测
诸如TabPFN和TabICL之类的表格基础模型已经生成了完整的预测分布,但用于评估它们的基准(TabArena、TALENT等)仍然几乎完全依赖于点估计指标(均方根误差、$R^2$)。这种不匹配隐式地奖励了那些产生良好条件均值的模型,而忽略了预测分布的质量。我们做出了两项贡献。首先,我们建议用合适的评分规则(CRPS、CRLS和区间评分)补充标准的点估计指标,并对realTabPFNv2.5和TabICLv2在20个OpenML回归数据集上的某些合适的评分规则进行了直接对比。其次,我们通过分析和实验证明,不同的合适的评分规则会诱导不同的模型排名和训练期间不同的归纳偏置,尽管每个规则都由真实分布单独最小化。通过预训练中未见过的评分规则(CRLS,$β=1.8$能量评分)微调realTabPFNv2.5,可以一致地提高相应的指标,这证实了训练损失不仅限于单一的适当性保证。综上所述,这些发现表明(i)在表格回归基准中报告分布性指标,并且(ii)使基础模型的训练目标(通过微调或任务标记条件)适应下游决策问题相关的评分规则是必要的。
Summary / 总结
This study addresses the issue of evaluating tabular foundation models like TabPFN and TabICL, which produce full predictive distributions, using benchmarks that focus on point-estimate metrics. The authors propose using proper scoring rules such as CRPS, CRLS, and the Interval Score to evaluate model performance. They compare realTabPFNv2.5 and TabICLv2 across 20 OpenML regression datasets and find that different scoring rules can induce different model rankings and biases, even though each rule is individually optimized by the true distribution. Fine-tuning realTabPFNv2.5 with scoring rules not seen during pretraining yields consistent improvements on the corresponding metrics, highlighting the importance of using distributional metrics and adaptable training objectives in tabular regression benchmarks.
论文针对表格式基础模型如TabPFN和TabICL虽然能生成完整的预测分布,但主要使用点估计指标进行评估的问题,引入了合适的评分规则(CRPS、CRLS和区间评分)来评估预测分布的质量,并在20个OpenML回归数据集上比较了realTabPFNv2.5和TabICLv2的表现。研究显示不同的评分规则可以导致不同的模型排名和训练偏见,并且通过未见过的评分规则进行微调可以提高相应的指标性能,这表明在表格式回归基准中需要报告分布性指标,并且基础模型的训练目标需要适应下游决策问题相关的评分规则。
RobotArena $\infty$: Scalable Robot Benchmarking via Real-to-Sim Translation
Authors: Yash Jangir, Yidi Zhang, Pang-Chi Lo, Kashu Yamazaki, Chenyu Zhang, Kuan-Hsun Tu, Tsung-Wei Ke, Lei Ke, Yonatan Bisk, Katerina Fragkiadaki
First: 2025-10-27T17:41:38+00:00 · Latest: 2026-03-13T16:29:34+00:00
Comments: Website: https://robotarenainf.github.io
Abstract
The pursuit of robot generalists, agents capable of performing diverse tasks across diverse environments, demands rigorous and scalable evaluation. Yet real-world testing of robot policies remains fundamentally constrained: it is labor-intensive, slow, unsafe at scale, and difficult to reproduce. As policies expand in scope and complexity, these barriers only intensify, since defining "success" in robotics often hinges on nuanced human judgments of execution quality. We introduce RobotArena Infinity, a new benchmarking framework that overcomes these challenges by shifting vision-language-action (VLA) evaluation into large-scale simulated environments augmented with online human feedback. Leveraging advances in vision-language models, 2D-to-3D generative modeling, and differentiable rendering, our approach automatically converts video demonstrations from widely used robot datasets into simulated counterparts. Within these digital twins, we assess VLA policies using both automated vision-language-model-guided scoring and scalable human preference judgments collected from crowdworkers, transforming human involvement from tedious scene setup, resetting, and safety supervision into lightweight preference comparisons. To measure robustness, we systematically perturb simulated environments along multiple axes, including textures and object placements, stress-testing policy generalization under controlled variation. The result is a continuously evolving, reproducible, and scalable benchmark for real-world-trained robot manipulation policies, addressing a critical missing capability in today's robotics landscape.
中文标题/摘要
标题:RobotArena $\infty$:通过实到模拟转换实现可扩展的机器人基准测试
机器人通才,即能够在多种环境中执行多种任务的代理,需要严格的可扩展评估。然而,机器人策略的实际测试仍然受到根本限制:它劳动密集、速度慢、在大规模应用时不安全且难以复现。随着策略的范围和复杂性扩大,这些障碍只会加剧,因为机器人成功往往依赖于执行质量的微妙的人类判断。我们提出了RobotArena Infinity,这是一种新的基准测试框架,通过将视觉-语言-动作(VLA)评估转移到增强有人工反馈的大规模模拟环境中来克服这些挑战。利用视觉-语言模型、2D到3D生成建模和可微渲染的最新进展,我们的方法自动将广泛使用的机器人数据集中的视频演示转换为模拟对应物。在这些数字孪生中,我们使用自动化的视觉-语言模型指导评分和从众包工人收集的可扩展的人类偏好判断来评估VLA策略,将人类参与从繁琐的场景设置、重置和安全监督转变为轻量级的偏好比较。为了衡量鲁棒性,我们系统地沿多个轴线扰动模拟环境,包括纹理和物体放置,对策略在受控变化下的泛化能力进行压力测试。结果是一个不断演进、可复现且可扩展的基准测试,用于实际训练的机器人操作策略,解决了当今机器人领域的一个关键缺失能力。
Summary / 总结
RobotArena $\infty$ addresses the need for scalable and rigorous evaluation of robot policies by shifting vision-language-action (VLA) evaluation into large-scale simulated environments with online human feedback. It uses advances in vision-language models, 2D-to-3D generative modeling, and differentiable rendering to convert real-world video demonstrations into simulated counterparts. The framework assesses VLA policies through automated scoring and scalable human preference judgments, and systematically perturbs simulated environments to test policy robustness. Key findings include a continuously evolving, reproducible, and scalable benchmark for real-world-trained robot manipulation policies.
RobotArena $\infty$ 旨在通过实况到模拟的转换来评估能够执行多种任务并适应不同环境的机器人策略。该方法利用视觉语言模型、2D 到 3D 生成建模和可微渲染将视频演示转换为模拟环境,在这些环境中,策略通过自动评分和人类偏好判断进行评估。关键发现包括系统地扰动模拟环境以测试策略的鲁棒性和泛化能力,从而形成一个可扩展且可重复的基准,用于训练有素的机器人操作策略。
When Right Meets Wrong: Bilateral Context Conditioning with Reward-Confidence Correction for GRPO
Authors: Yu Li, Tian Lan, Zhengling Qi
First: 2026-03-13T16:25:02+00:00 · Latest: 2026-03-13T16:25:02+00:00
Abstract
Group Relative Policy Optimization (GRPO) has emerged as an effective method for training reasoning models. While it computes advantages based on group mean, GRPO treats each output as an independent sample during the optimization and overlooks a vital structural signal: the natural contrast between correct and incorrect solutions within the same group, thus ignoring the rich, comparative data that could be leveraged by explicitly pitting successful reasoning traces against failed ones. To capitalize on this, we present a contrastive reformulation of GRPO, showing that the GRPO objective implicitly maximizes the margin between the policy ratios of correct and incorrect samples. Building on this insight, we propose Bilateral Context Conditioning (BICC), a mechanism that allows the model to cross-reference successful and failed reasoning traces during the optimization, enabling a direct information flow across samples. We further introduce Reward-Confidence Correction (RCC) to stabilize training by dynamically adjusts the advantage baseline in GRPO using reward-confidence covariance derived from the first-order approximation of the variance-minimizing estimator. Both mechanisms require no additional sampling or auxiliary models and can be adapted to all GRPO variants. Experiments on mathematical reasoning benchmarks demonstrate consistent improvements across comprehensive models and algorithms. Code is available at \href{https://github.com/Skylanding/BiCC}{https://github.com/Skylanding/BiCC}.
中文标题/摘要
标题:当正确遇到错误:基于奖励-信心校正的双边上下文条件化GRPO
组相对策略优化(GRPO)已成为训练推理模型的有效方法。虽然它基于组均值计算优势,但在优化过程中将每个输出视为独立样本,忽略了同一组内正确和错误解之间自然对比这一重要结构信号,从而忽视了可以利用的丰富比较数据,即明确将成功的推理轨迹与失败的轨迹进行对比。为利用这一点,我们提出了GRPO的对比重构,表明GRPO目标隐式地最大化了正确和错误样本策略比率之间的差距。基于这一洞察,我们提出了双边上下文条件化(BICC)机制,允许模型在优化过程中交叉参考成功的和失败的推理轨迹,实现样本间的直接信息流。我们还引入了奖励-信心校正(RCC),通过从一阶近似方差最小化估计器导出的奖励-信心协方差动态调整GRPO中的优势基线,以稳定训练。这两种机制不需要额外的采样或辅助模型,并且可以适应所有GRPO变体。在数学推理基准测试上的实验表明,这两种机制在综合模型和算法中均能实现一致的改进。代码可在https://github.com/Skylanding/BiCC 获取。
Summary / 总结
The paper addresses the limitation of Group Relative Policy Optimization (GRPO) in ignoring the contrast between correct and incorrect solutions within the same group. It introduces Bilateral Context Conditioning (BICC) to leverage this contrast by allowing the model to cross-reference successful and failed reasoning traces during optimization. Additionally, Reward-Confidence Correction (RCC) is proposed to stabilize training by dynamically adjusting the advantage baseline. Experiments show consistent improvements across various models and algorithms on mathematical reasoning benchmarks.
研究旨在通过利用正确和错误推理痕迹之间的对比来增强Group Relative Policy Optimization (GRPO)。方法引入了Bilateral Context Conditioning (BICC),以在优化过程中实现成功和失败推理痕迹之间的交叉引用,并引入了Reward-Confidence Correction (RCC)来稳定训练。实验结果显示,在各种模型和算法上的一致改进,特别是在数学推理基准测试中。
Steve-Evolving: Open-World Embodied Self-Evolution via Fine-Grained Diagnosis and Dual-Track Knowledge Distillation
Authors: Zhengwei Xie, Zhisheng Chen, Ziyan Weng, Tingyu Wu, Chenglong Li, Vireo Zhang, Kun Wang
First: 2026-03-13T16:23:34+00:00 · Latest: 2026-03-13T16:23:34+00:00
Abstract
Open-world embodied agents must solve long-horizon tasks where the main bottleneck is not single-step planning quality but how interaction experience is organized and evolved. To this end, we present Steve-Evolving, a non-parametric self-evolving framework that tightly couples fine-grained execution diagnosis with dual-track knowledge distillation in a closed loop. The method follows three phases: Experience Anchoring, Experience Distillation, and Knowledge-Driven Closed-Loop Control. In detail, Experience Anchoring solidifies each subgoal attempt into a structured experience tuple with a fixed schema (pre-state, action, diagnosis-result, and post-state) and organizes it in a three-tier experience space with multi-dimensional indices (e.g., condition signatures, spatial hashing, and semantic tags) plus rolling summarization for efficient and auditable recall. To ensure sufficient information density for attribution, the execution layer provides compositional diagnosis signals beyond binary outcomes, including state-difference summaries, enumerated failure causes, continuous indicators, and stagnation/loop detection. Moreover, successful trajectories of Experience Distillation are generalized into reusable skills with explicit preconditions and verification criteria, while failures are distilled into executable guardrails that capture root causes and forbid risky operations at both subgoal and task granularities. Besides, Knowledge-Driven Closed-Loop Control retrieved skills and guardrails are injected into an LLM planner, and diagnosis-triggered local replanning updates the active constraints online, forming a continual evolution process without any model parameter updates. Experiments on the long-horizon suite of Minecraft MCU demonstrate consistent improvements over static-retrieval baselines.
中文标题/摘要
标题:Steve-Evolving: 开放世界具身自我进化框架通过细粒度诊断和双轨知识蒸馏
开放世界具身智能体必须解决长期任务,其中主要瓶颈不是单步规划质量,而是交互经验的组织和进化。为此,我们提出了Steve-Evolving,一种非参数化的自我进化框架,该框架将细粒度执行诊断与双轨知识蒸馏紧密耦合在一个闭环中。该方法分为三个阶段:经验锚定、经验蒸馏和知识驱动的闭环控制。具体而言,经验锚定将每个子目标尝试固化为具有固定模式的结构化经验元组(前状态、动作、诊断结果和后状态),并将其组织在具有多维索引(例如,条件签名、空间哈希和语义标签)的三层经验空间中,加上滚动总结,以实现高效和可审计的检索。为了确保足够的信息密度以进行归因,执行层提供了超出二元结果的组合诊断信号,包括状态差异总结、列举的失败原因、连续指标以及停滞/循环检测。此外,经验蒸馏成功的轨迹被泛化为可重用技能,带有明确的前提条件和验证标准,而失败则被蒸馏为可执行的护栏,捕捉根本原因并禁止子目标和任务粒度上的风险操作。此外,知识驱动的闭环控制检索技能和护栏,并将其注入LLM规划器中,诊断触发的局部重规划在线更新活动约束,形成一个持续进化过程,而无需任何模型参数更新。在Minecraft MCU的长期任务套件上的实验表明,与静态检索基线相比,该方法具有一致的改进。
Summary / 总结
Steve-Evolving is a non-parametric self-evolving framework designed to enhance open-world embodied agents' ability to solve long-horizon tasks by integrating fine-grained execution diagnosis and dual-track knowledge distillation. The framework consists of three phases: Experience Anchoring, Experience Distillation, and Knowledge-Driven Closed-Loop Control. It solidifies each subgoal attempt into structured experience tuples and organizes them in a three-tier experience space. Successful trajectories are generalized into reusable skills, while failures are distilled into guardrails. The framework demonstrates consistent improvements over static-retrieval baselines in Minecraft MCU long-horizon tasks.
Steve-Evolving 是一个非参数化的自我演化框架,旨在通过精细的执行诊断和双轨知识蒸馏来组织和演化开放世界中的交互经验。该框架包含三个阶段:经验锚定、经验蒸馏和知识驱动的闭环控制。它将每个子目标尝试固化为结构化经验元组,并组织在一个三层经验空间中。此外,它还提供了详细的组合诊断信号,并将成功的轨迹泛化为可重用的技能,同时将失败蒸馏为可执行的护栏。实验结果显示,在Minecraft MCU的长期任务中,该框架相对于静态检索基线具有一致的改进。
Optimistically Optimistic Exploration for Provably Efficient Infinite-Horizon Reinforcement and Imitation Learning
Authors: Antoine Moulin, Gergely Neu, Luca Viano
First: 2025-02-19T17:32:35+00:00 · Latest: 2026-03-13T16:18:27+00:00
Abstract
We study the problem of reinforcement learning in infinite-horizon discounted linear Markov decision processes (MDPs), and propose the first computationally efficient algorithm achieving rate-optimal regret guarantees in this setting. Our main idea is to combine two classic techniques for optimistic exploration: additive exploration bonuses applied to the reward function, and artificial transitions made to an absorbing state with maximal return. We show that, combined with a regularized approximate dynamic-programming scheme, the resulting algorithm achieves a regret of order $\tilde{\mathcal{O}} (\sqrt{d^3 (1 - γ)^{- 7 / 2} T})$, where $T$ is the total number of sample transitions, $γ\in (0,1)$ is the discount factor, and $d$ is the feature dimensionality. The results continue to hold against adversarial reward sequences, enabling application of our method to the problem of imitation learning in linear MDPs, where we achieve state-of-the-art results.
中文标题/摘要
标题:乐观探索以实现可证明高效的无限时域强化学习和模仿学习
我们研究了无限时域折扣线性马尔可夫决策过程(MDP)中的强化学习问题,并提出了第一个在该设置下具有最优率遗憾保证的计算高效算法。我们的主要思想是结合两种经典的乐观探索技术:应用于奖励函数的加性探索奖金,以及向具有最大回报的吸收状态制造人工过渡。我们证明,结合正则化近似动态规划方案后,所提出算法的遗憾为 $\tilde{\mathcal{O}} (\sqrt{d^3 (1 - γ)^{- 7 / 2} T})$,其中 $T$ 是样本过渡的总数,$γ\in (0,1)$ 是折扣因子,$d$ 是特征维度。该结果在对抗性奖励序列下仍然成立,使我们的方法能够应用于线性MDP中的模仿学习问题,我们在此问题上取得了最先进的成果。
FDeID-Toolbox: Face De-Identification Toolbox
Authors: Hui Wei, Hao Yu, Guoying Zhao
First: 2026-03-13T16:15:34+00:00 · Latest: 2026-03-13T16:15:34+00:00
Comments: Technical Report. Codebase: https://github.com/infraface/FDeID-Toolbox
Abstract
Face de-identification (FDeID) aims to remove personally identifiable information from facial images while preserving task-relevant utility attributes such as age, gender, and expression. It is critical for privacy-preserving computer vision, yet the field suffers from fragmented implementations, inconsistent evaluation protocols, and incomparable results across studies. These challenges stem from the inherent complexity of the task: FDeID spans multiple downstream applications (e.g., age estimation, gender recognition, expression analysis) and requires evaluation across three dimensions (e.g., privacy protection, utility preservation, and visual quality), making existing codebases difficult to use and extend. To address these issues, we present FDeID-Toolbox, a comprehensive toolbox designed for reproducible FDeID research. Our toolbox features a modular architecture comprising four core components: (1) standardized data loaders for mainstream benchmark datasets, (2) unified method implementations spanning classical approaches to SOTA generative models, (3) flexible inference pipelines, and (4) systematic evaluation protocols covering privacy, utility, and quality metrics. Through experiments, we demonstrate that FDeID-Toolbox enables fair and reproducible comparison of diverse FDeID methods under consistent conditions.
中文标题/摘要
标题:FDeID-工具箱:面部去标识化工具箱
面部去标识化(FDeID)旨在从面部图像中去除个人可识别信息,同时保留与任务相关的有用属性,如年龄、性别和表情。这对于保护隐私的计算机视觉至关重要,但该领域存在碎片化的实现、不一致的评估协议和无法比较的研究结果。这些挑战源于任务本身的固有复杂性:FDeID 涉及多个下游应用(例如年龄估计、性别识别、表情分析),并且需要在三个维度上进行评估(例如隐私保护、有用性保留和视觉质量),使得现有的代码库难以使用和扩展。为了解决这些问题,我们提出了FDeID-工具箱,这是一个旨在实现可重复FDeID研究的综合工具箱。我们的工具箱具有模块化架构,包括四个核心组件:(1)主流基准数据集的标准数据加载器,(2)统一的方法实现,涵盖经典方法到SOTA生成模型,(3)灵活的推理管道,(4)系统化的评估协议,涵盖隐私、有用性和质量指标。通过实验,我们证明FDeID-工具箱能够在一致的条件下实现多种FDeID方法的公平和可重复比较。
Summary / 总结
FDeID-Toolbox is designed to address the fragmented and inconsistent nature of face de-identification research by providing a modular and comprehensive framework. It includes standardized data loaders, unified method implementations, flexible inference pipelines, and systematic evaluation protocols. Experiments show that FDeID-Toolbox allows for fair and reproducible comparisons of different de-identification methods under consistent conditions.
FDeID-Toolbox旨在解决面部去标识化研究中的碎片化实现和不一致的评估协议问题。它提供了一个模块化的架构,包括标准化的数据加载器、统一的方法实现、灵活的推理管道和系统化的评估协议。实验表明,FDeID-Toolbox能够在一致的条件下实现各种面部去标识化方法的公平和可重复比较。
Geometry-Guided Camera Motion Understanding in VideoLLMs
Authors: Haoan Feng, Sri Harsha Musunuri, Guan-Ming Su
First: 2026-03-13T16:13:09+00:00 · Latest: 2026-03-13T16:13:09+00:00
Comments: 10 pages, 7 figures, supplementary included
Abstract
Camera motion is a fundamental geometric signal that shapes visual perception and cinematic style, yet current video-capable vision-language models (VideoLLMs) rarely represent it explicitly and often fail on fine-grained motion primitives. We address this gap with a framework of $\textbf{benchmarking}$, $\textbf{diagnosis}$, and $\textbf{injection}$. We curate $\textbf{CameraMotionDataset}$, a large-scale synthetic dataset with explicit camera control, formulate camera motion as constraint-aware multi-label recognition, and construct a VQA benchmark--$\textbf{CameraMotionVQA}$. Across diverse off-the-shelf VideoLLMs, we observe substantial errors in recognizing camera motion primitives. Probing experiments on a Qwen2.5-VL vision encoder suggest that camera motion cues are weakly represented, especially in deeper ViT blocks, helping explain the observed failure modes. To bridge this gap without costly training or fine-tuning, we propose a lightweight, model-agnostic pipeline that extracts geometric camera cues from 3D foundation models (3DFMs), predicts constrained motion primitives with a temporal classifier, and injects them into downstream VideoLLM inference via structured prompting. Experiments demonstrate improved motion recognition and more camera-aware model responses, highlighting geometry-driven cue extraction and structured prompting as practical steps toward a camera-aware VideoLLM and VLA system. The dataset and benchmark is publicly available at https://hf.co/datasets/fengyee/camera-motion-dataset-and-benchmark.
中文标题/摘要
标题:视频LLMs中的几何引导摄像机运动理解
摄像机运动是塑造视觉感知和电影风格的基本几何信号,但当前的视频能力视觉语言模型(VideoLLMs)很少明确表示它,并且经常在精细的运动基元上出错。我们通过一个框架来解决这一差距,该框架包括基准测试、诊断和注入。我们编纂了一个名为$\textbf{CameraMotionDataset}$的大规模合成数据集,其中明确控制了摄像机运动,并将摄像机运动形式化为约束感知的多标签识别问题,构建了一个VQA基准——$\textbf{CameraMotionVQA}$。在多种现成的VideoLLMs中,我们观察到在识别摄像机运动基元方面存在大量错误。对Qwen2.5-VL视觉编码器的探针实验表明,摄像机运动提示在视觉编码器中弱表示,尤其是在更深层次的ViT块中,这有助于解释观察到的失败模式。为了在不进行昂贵的训练或微调的情况下弥合这一差距,我们提出了一种轻量级、模型无关的管道,该管道从3D基础模型(3DFMs)中提取几何摄像机提示,使用时间分类器预测受约束的运动基元,并通过结构化提示将它们注入到下游VideoLLM推理中。实验表明,运动识别得到了改善,模型的响应也更加关注摄像机,突出了几何驱动提示提取和结构化提示作为实现摄像机感知VideoLLM和VLA系统的实用步骤。数据集和基准可以在https://hf.co/datasets/fengyee/camera-motion-dataset-and-benchmark/上公开获取。
Summary / 总结
The research aims to improve the understanding of camera motion in video-capable vision-language models (VideoLLMs) by addressing the lack of explicit representation of camera motion. The authors introduce a framework involving benchmarking, diagnosis, and injection, and develop the CameraMotionDataset, a large-scale synthetic dataset with explicit camera control. They observe significant errors in recognizing camera motion primitives across various VideoLLMs and propose a lightweight, model-agnostic pipeline to inject geometric camera cues into VideoLLMs through structured prompting, which improves motion recognition and enhances camera-awareness in model responses.
本文通过引入基准测试、诊断和注入的框架,解决了VideoLLMs中缺乏显式摄像机运动表示的问题。它构建了CameraMotionDataset,并将摄像机运动形式化为约束感知的多标签识别。作者观察到各种VideoLLMs在识别摄像机运动基本元素时存在显著错误,并提出了一种轻量级、模型无关的管道,从3D基础模型中提取几何摄像机线索,使用时间分类器预测约束运动基本元素,并通过结构化提示将它们注入到下游VideoLLM推理中,从而提高了运动识别和模型响应。
NOIR: Neural Operator mapping for Implicit Representations
Authors: Sidaty El Hadramy, Nazim Haouchine, Michael Wehrli, Philippe C. Cattin
First: 2026-03-13T16:13:05+00:00 · Latest: 2026-03-13T16:13:05+00:00
Abstract
This paper presents NOIR, a framework that reframes core medical imaging tasks as operator learning between continuous function spaces, challenging the prevailing paradigm of discrete grid-based deep learning. Instead of operating on fixed pixel or voxel grids, NOIR embeds discrete medical signals into shared Implicit Neural Representations and learns a Neural Operator that maps between their latent modulations, enabling resolution-independent function-to-function transformations. We evaluate NOIR across multiple 2D and 3D downstream tasks, including segmentation, shape completion, image-to-image translation, and image synthesis, on several public datasets such as Shenzhen, OASIS-4, SkullBreak, fastMRI, as well as an in-house clinical dataset. It achieves competitive performance at native resolution while demonstrating strong robustness to unseen discretizations, and empirically satisfies key theoretical properties of neural operators. The project page is available here: https://github.com/Sidaty1/NOIR-io.
中文标题/摘要
标题:NOIR:神经操作符映射用于隐式表示
本文介绍了NOIR框架,该框架将核心医学成像任务重新定义为连续函数空间之间的操作符学习,挑战了现有的基于离散网格的深度学习主导范式。NOIR 不是在固定的像素或体素网格上操作,而是将离散的医学信号嵌入到共享的隐式神经表示中,并学习一个神经操作符,该操作符在它们的潜在调制之间进行映射,从而实现分辨率无关的功能到功能的转换。我们在多个2D和3D下游任务上评估了NOIR,包括分割、形状补全、图像到图像的转换和图像合成,使用了多个公开数据集,如深圳、OASIS-4、SkullBreak、fastMRI,以及一个内部临床数据集。它在原生分辨率下实现了竞争力的性能,同时对未见过的离散化表现出强大的鲁棒性,并且实证上满足了神经操作符的关键理论性质。项目页面在此:https://github.com/Sidaty1/NOIR-io.
Summary / 总结
NOIR is a framework that redefines medical imaging tasks as operator learning between continuous function spaces, avoiding the use of fixed pixel or voxel grids. By embedding discrete medical signals into shared Implicit Neural Representations and learning a Neural Operator, NOIR enables resolution-independent transformations. The framework was evaluated on various 2D and 3D tasks, including segmentation and image synthesis, across multiple datasets. NOIR achieved competitive performance at native resolution and showed robustness to unseen discretizations, satisfying key theoretical properties of neural operators.
NOIR 是一个框架,将医学成像任务重新定义为在连续函数空间之间的操作学习,避免使用固定的像素或体素网格。它将离散的医学信号嵌入到共享的隐式神经表示中,并学习一个神经操作符以实现无分辨率依赖的功能到功能的转换。NOIR 在多种 2D 和 3D 任务上表现出色,显示出对未见过的离散化的鲁棒性,并且满足神经操作符的关键理论性质。
Panoramic Multimodal Semantic Occupancy Prediction for Quadruped Robots
Authors: Guoqiang Zhao, Zhe Yang, Sheng Wu, Fei Teng, Mengfei Duan, Yuanfan Zheng, Kai Luo, Kailun Yang
First: 2026-03-13T16:04:33+00:00 · Latest: 2026-03-13T16:04:33+00:00
Comments: The dataset and code will be publicly released at https://github.com/SXDR/PanoMMOcc
Abstract
Panoramic imagery provides holistic 360° visual coverage for perception in quadruped robots. However, existing occupancy prediction methods are mainly designed for wheeled autonomous driving and rely heavily on RGB cues, limiting their robustness in complex environments. To bridge this gap, (1) we present PanoMMOcc, the first real-world panoramic multimodal occupancy dataset for quadruped robots, featuring four sensing modalities across diverse scenes. (2) We propose a panoramic multimodal occupancy perception framework, VoxelHound, tailored for legged mobility and spherical imaging. Specifically, we design (i) a Vertical Jitter Compensation (VJC) module to mitigate severe viewpoint perturbations caused by body pitch and roll during mobility, enabling more consistent spatial reasoning, and (ii) an effective Multimodal Information Prompt Fusion (MIPF) module that jointly leverages panoramic visual cues and auxiliary modalities to enhance volumetric occupancy prediction. (3) We establish a benchmark based on PanoMMOcc and provide detailed data analysis to enable systematic evaluation of perception methods under challenging embodied scenarios. Extensive experiments demonstrate that VoxelHound achieves state-of-the-art performance on PanoMMOcc (+4.16%} in mIoU). The dataset and code will be publicly released to facilitate future research on panoramic multimodal 3D perception for embodied robotic systems at https://github.com/SXDR/PanoMMOcc, along with the calibration tools released at https://github.com/losehu/CameraLiDAR-Calib.
中文标题/摘要
标题:全景多模态语义占用预测技术在四足机器人中的应用
全景图像为四足机器人提供了360°的整体视觉覆盖,但现有的占用预测方法主要针对轮式自主驾驶,依赖于RGB信息,限制了其在复杂环境中的鲁棒性。为解决这一问题,(1) 我们提出了PanoMMOcc,这是首个针对四足机器人的全景多模态占用数据集,包含四种传感模态,适用于多种场景。(2) 我们提出了一种全景多模态占用感知框架VoxelHound,专为腿式移动和球形成像设计。具体来说,我们设计了(i) 一种垂直抖动补偿模块(VJC),以减轻移动过程中由身体俯仰和滚动引起的严重视角变化,从而实现更一致的空间推理;(ii) 一种有效的多模态信息提示融合模块(MIPF),该模块联合利用全景视觉线索和辅助模态,以增强体素占用预测。(3) 我们基于PanoMMOcc建立了基准,并提供了详细的数据分析,以在具有挑战性的实体场景中系统评估感知方法。大量实验表明,VoxelHound在PanoMMOcc上的表现优于现有方法(+4.16%的mIoU)。数据集和代码将在https://github.com/SXDR/PanoMMOcc公开发布,以促进未来对全景多模态3D感知在实体机器人系统中的研究,同时发布的还有校准工具https://github.com/losehu/CameraLiDAR-Calib。
Summary / 总结
The research aims to improve occupancy prediction for quadruped robots using panoramic multimodal sensing. The authors introduce PanoMMOcc, a new dataset with four sensing modalities for diverse scenes, and propose VoxelHound, a framework with a VJC module for viewpoint compensation and an MIPF module for multimodal information fusion. Experiments show that VoxelHound outperforms existing methods by 4.16% in mIoU on PanoMMOcc.
研究旨在通过全景图像提高四足机器人的占用率预测,解决现有方法依赖RGB线索的局限性。研究引入了PanoMMOcc,一个新的全景多模态占用率数据集,以及VoxelHound,一个定制的感知框架。VoxelHound包括一个处理视点扰动的VJC模块和一个整合多种传感输入的MIPF模块。实验表明,VoxelHound在PanoMMOcc上的mIoU表现优于现有方法,提高了4.16%。
FoV-Net: Rotation-Invariant CAD B-rep Learning via Field-of-View Ray Casting
Authors: Matteo Ballegeer, Dries F. Benoit
Venue: CVPR 2026
First: 2026-02-27T15:21:52+00:00 · Latest: 2026-03-13T16:04:29+00:00
Comments: Manuscript accepted at CVPR 2026
Abstract
Learning directly from boundary representations (B-reps) has significantly advanced 3D CAD analysis. However, state-of-the-art B-rep learning methods rely on absolute coordinates and normals to encode global context, making them highly sensitive to rotations. Our experiments reveal that models achieving over 95% accuracy on aligned benchmarks can collapse to as low as 10% under arbitrary $\mathbf{SO}(3)$ rotations. To address this, we introduce FoV-Net, the first B-rep learning framework that captures both local surface geometry and global structural context in a rotation-invariant manner. Each face is represented by a Local Reference Frame (LRF) UV-grid that encodes its local surface geometry, and by Field-of-View (FoV) grids that capture the surrounding 3D context by casting rays and recording intersections with neighboring faces. Lightweight CNNs extract per-face features, which are propagated over the B-rep graph using a graph attention network. FoV-Net achieves state-of-the-art performance on B-rep classification and segmentation benchmarks, demonstrating robustness to arbitrary rotations while also requiring less training data to achieve strong results.
中文标题/摘要
标题:FoV-Net:通过视场射线投射学习旋转不变的CAD B-rep
直接从边界表示(B-reps)学习显著推进了3D CAD分析。然而,最先进的B-rep学习方法依赖绝对坐标和法线来编码全局上下文,使其对旋转非常敏感。我们的实验表明,在对齐基准上达到95%以上准确率的模型,在任意$\mathbf{SO}(3)$旋转下可能会下降到10%以下。为了解决这一问题,我们提出了FoV-Net,这是第一个以旋转不变的方式同时捕捉局部表面几何和全局结构上下文的B-rep学习框架。每个面由一个局部参考框架(LRF)UV网格表示,编码其局部表面几何,以及通过投射射线并记录与相邻面的交点来捕捉周围3D上下文的视场(FoV)网格。轻量级CNN提取每个面的特征,通过图注意力网络在B-rep图上进行传播。FoV-Net在B-rep分类和分割基准上达到了最先进的性能,展示了对任意旋转的鲁棒性,同时需要较少的训练数据即可获得良好的结果。
Summary / 总结
The research aims to improve the robustness of 3D CAD analysis by addressing the sensitivity of existing B-rep learning methods to rotations. FoV-Net, a novel B-rep learning framework, captures both local surface geometry and global structural context in a rotation-invariant manner. It uses Local Reference Frame UV-grids for local geometry and Field-of-View grids for global context through ray casting. FoV-Net outperforms existing methods on B-rep classification and segmentation benchmarks, showing robustness to arbitrary rotations and requiring less training data for strong results.
研究旨在通过解决现有B-rep学习方法对旋转敏感的问题,提高3D CAD分析的性能。FoV-Net是一种新颖的旋转不变框架,能够同时捕捉局部表面几何和全局结构上下文。它使用局部参考框架UV网格和视场网格来编码面及其周围的环境。FoV-Net在B-rep分类和分割基准测试中表现出色,显示出对任意旋转的鲁棒性,并且需要较少的训练数据即可达到良好的效果。
BenDFM: A taxonomy and synthetic CAD dataset for manufacturability assessment in sheet metal bending
Authors: Matteo Ballegeer, Dries F. Benoit
First: 2026-03-13T15:57:44+00:00 · Latest: 2026-03-13T15:57:44+00:00
Abstract
Predicting the manufacturability of CAD designs early, in terms of both feasibility and required effort, is a key goal of Design for Manufacturing (DFM). Despite advances in deep learning for CAD and its widespread use in manufacturing process selection, learning-based approaches for predicting manufacturability within a specific process remain limited. Two key challenges limit progress: inconsistency across prior work in how manufacturability is defined and consequently in the associated learning targets, and a scarcity of suitable datasets. Existing labels vary significantly: they may reflect intrinsic design constraints or depend on specific manufacturing capabilities (such as available tools), and they range from discrete feasibility checks to continuous complexity measures. Furthermore, industrial datasets typically contain only manufacturable parts, offering little signal for infeasible cases, while existing synthetic datasets focus on simple geometries and subtractive processes. To address these gaps, we propose a taxonomy of manufacturability metrics along the axes of configuration dependence and measurement type, allowing clearer scoping of generalizability and learning objectives. Next, we introduce BenDFM, the first synthetic dataset for manufacturability assessment in sheet metal bending. BenDFM contains 20,000 parts, both manufacturable and unmanufacturable, generated with process-aware bending simulations, providing both folded and unfolded geometries and multiple manufacturability labels across the taxonomy, enabling systematic study of previously unexplored learning-based DFM challenges. We benchmark two state-of-the-art 3D learning architectures on BenDFM, showing that graph-based representations that capture relationships between part surfaces achieve better accuracy, and that predicting metrics that depend on specific manufacturing setups remains more challenging.
中文标题/摘要
标题:BenDFM:一种用于板料弯曲制造可行性的分类体系和合成CAD数据集
早期预测CAD设计的制造可行性及其所需努力程度,是制造设计(DFM)的关键目标。尽管在CAD和制造过程选择中广泛应用了深度学习,但针对特定过程预测制造可行性的基于学习的方法仍然有限。两个关键挑战限制了进展:先前工作中制造可行性的定义不一致,导致学习目标不一致,以及可用数据集的稀缺性。现有标签差异显著:它们可能反映内在设计约束或依赖特定的制造能力(如可用工具),并且范围从离散的可行性检查到连续的复杂性度量。此外,工业数据集通常只包含可制造的部件,对不可行情况提供的信号很少,而现有的合成数据集则专注于简单的几何形状和减材工艺。为了解决这些差距,我们提出了制造可行性的分类体系,沿着配置依赖性和测量类型的轴线进行分类,以更清晰地界定泛化能力和学习目标。接着,我们介绍了BenDFM,这是第一个用于板料弯曲制造可行性的合成数据集。BenDFM 包含20,000个部件,既有可制造的也有不可制造的,通过具有工艺意识的弯曲模拟生成,提供了折叠和展开的几何形状以及分类体系中的多种制造可行性的标签,使对先前未探索的基于学习的DFM挑战进行系统研究成为可能。我们在BenDFM上对两种最先进的3D学习架构进行了基准测试,结果显示,能够捕捉部件表面之间关系的图表示法具有更高的准确性,而预测依赖特定制造设置的指标仍然更具挑战性。
Summary / 总结
The paper addresses the challenge of predicting the manufacturability of CAD designs in sheet metal bending, proposing a taxonomy and a new synthetic dataset named BenDFM. BenDFM includes 20,000 parts, both manufacturable and unmanufacturable, generated with process-aware simulations, providing a wide range of manufacturability labels. The study benchmarks two 3D learning architectures and finds that graph-based representations perform better, while predicting metrics specific to manufacturing setups remains challenging.
论文针对预测冲压成形CAD设计的可制造性问题,提出了一种制造性度量的分类体系,并引入了BenDFM合成数据集,包含20,000个可制造和不可制造的零件,通过工艺感知的弯曲模拟生成。数据集提供了折叠和展开的几何形状以及多种标签,有助于研究基于学习的DFM挑战。实验表明,基于图的表示优于其他架构,但预测特定制造设置依赖的度量仍然具有挑战性。
Proof-Carrying Materials: Falsifiable Safety Certificates for Machine-Learned Interatomic Potentials
Authors: Abhinaba Basu, Pavan Chakraborty
First: 2026-03-12T17:13:25+00:00 · Latest: 2026-03-13T15:57:32+00:00
Abstract
Machine-learned interatomic potentials (MLIPs) are deployed for high-throughput materials screening without formal reliability guarantees. We show that a single MLIP used as a stability filter misses 93% of density functional theory (DFT)-stable materials (recall 0.07) on a 25,000-material benchmark. Proof-Carrying Materials (PCM) closes this gap through three stages: adversarial falsification across compositional space, bootstrap envelope refinement with 95% confidence intervals, and Lean 4 formal certification. Auditing CHGNet, TensorNet and MACE reveals architecture-specific blind spots with near-zero pairwise error correlations (r <= 0.13; n = 5,000), confirmed by independent Quantum ESPRESSO validation (20/20 converged; median DFT/CHGNet force ratio 12x). A risk model trained on PCM-discovered features predicts failures on unseen materials (AUC-ROC = 0.938 +/- 0.004) and transfers across architectures (cross-MLIP AUC-ROC ~ 0.70; feature importance r = 0.877). In a thermoelectric screening case study, PCM-audited protocols discover 62 additional stable materials missed by single-MLIP screening - a 25% improvement in discovery yield.
中文标题/摘要
标题:证明携带材料:机器学习原子势能的可验证安全证书
机器学习原子势能(MLIPs)被用于高通量材料筛选,但缺乏正式的可靠性保证。我们展示了单一MLIP用作稳定性筛选器时,未能识别出93%的密度泛函理论(DFT)稳定的材料(召回率0.07)在25,000种材料的基准测试中。通过三个阶段的证明携带材料(PCM)来弥补这一差距:针对组成空间的对抗性反证、使用95%置信区间进行的Bootstrap包络细化,以及Lean 4形式化认证。审计CHGNet、TensorNet和MACE揭示了特定架构的盲点,配对误差相关性接近零(r <= 0.13;n = 5,000),并由独立的Quantum ESPRESSO验证所证实(20/20收敛;DFT/CHGNet力的中位数比值12倍)。基于PCM发现特征训练的风险模型在未见材料上预测失败(AUC-ROC = 0.938 ± 0.004),并在不同架构间具有可转移性(跨MLIP AUC-ROC ~ 0.70;特征重要性r = 0.877)。在热电材料筛选案例研究中,PCM审计的协议发现了62种额外的稳定材料,比单一MLIP筛选提高了25%的发现率。
Summary / 总结
The research addresses the lack of reliability guarantees for machine-learned interatomic potentials (MLIPs) used in materials screening. It introduces Proof-Carrying Materials (PCM) that involve adversarial falsification, bootstrap envelope refinement, and formal certification. Key findings include the discovery of 62 additional stable materials in a thermoelectric screening case study, a 25% improvement over single-MLIP screening, and a risk model that predicts failures with high accuracy (AUC-ROC = 0.938).
研究旨在解决机器学习原子势(MLIPs)在材料筛选中缺乏可靠性保证的问题。提出了证明携带材料(PCM)的方法,包括对抗性反驳、Bootstrap边界修正和形式化认证。该方法揭示了CHGNet、TensorNet和MACE在特定架构上的盲点,并基于PCM特征构建的风险模型预测失败具有高准确性。在热电材料筛选案例中,PCM审核的协议相比单一MLIP筛选提高了25%的发现率。
EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards
Authors: Omkar Thawakar, Shravan Venkatraman, Ritesh Thawkar, Abdelrahman Shaker, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, Fahad Khan
Venue: CVPR 2026
First: 2025-11-20T18:59:54+00:00 · Latest: 2026-03-13T15:53:44+00:00
Comments: CVPR 2026 (findings)
Abstract
Recent advances in large multimodal models (LMMs) have enabled impressive reasoning and perception abilities, yet most existing training pipelines still depend on human-curated data or externally verified reward models, limiting their autonomy and scalability. In this work, we strive to improve LMM reasoning capabilities in a purely unsupervised fashion (without any annotated data or reward distillation). To this end, we propose a self-evolving framework, named EvoLMM, that instantiates two cooperative agents from a single backbone model: a Proposer, which generates diverse, image-grounded questions, and a Solver, which solves them through internal consistency, where learning proceeds through a continuous self-rewarding process. This dynamic feedback encourages both the generation of informative queries and the refinement of structured reasoning without relying on ground-truth or human judgments. When using the popular Qwen2.5-VL as the base model, our EvoLMM yields consistent gains upto $\sim$3\% on multimodal math-reasoning benchmarks, including ChartQA, MathVista, and MathVision, using only raw training images. We hope our simple yet effective approach will serve as a solid baseline easing future research in self-improving LMMs in a fully-unsupervised fashion. Our code and models are available at https://github.com/mbzuai-oryx/EvoLMM.
中文标题/摘要
标题:EvoLMM:自我进化的大型多模态模型及其连续奖励
近年来,大型多模态模型(LMMs)的发展使其实现了令人印象深刻的推理和感知能力,但大多数现有的训练管道仍然依赖于人工策划的数据或外部验证的奖励模型,这限制了它们的自主性和可扩展性。在本工作中,我们旨在以完全无监督的方式(无需任何标注数据或奖励蒸馏)提高LMM的推理能力。为此,我们提出了一种自我进化的框架,名为EvoLMM,该框架从单一骨干模型中实例化两个合作的代理:一个提案者,它生成多样化的、基于图像的问题;一个解决者,它通过内部一致性解决这些问题,其中学习过程通过连续的自我奖励过程进行。这种动态反馈鼓励生成具有信息性的查询,并改进结构化推理,而无需依赖于真实数据或人工判断。当使用流行的Qwen2.5-VL作为基础模型时,我们的EvoLMM在仅使用原始训练图像的情况下,在多模态数学推理基准测试中,包括ChartQA、MathVista和MathVision,取得了高达约3%的一致性收益。我们希望我们的简单而有效的方法能够作为坚实的基础,简化未来在完全无监督方式下自我改进的LMMs的研究。我们的代码和模型可在https://github.com/mbzuai-oryx/EvoLMM/获取。
Summary / 总结
EvoLMM is a self-evolving framework for large multimodal models that improves reasoning capabilities without relying on annotated data or external rewards. It uses a single backbone model to create two agents: a Proposer that generates diverse image-grounded questions and a Solver that solves them through internal consistency. This continuous self-rewarding process enhances reasoning and perception, achieving consistent gains of up to 3% on multimodal math-reasoning benchmarks like ChartQA, MathVista, and MathVision.
EvoLMM 是一种自进化的大型多模态模型(LMM)框架,能够在无需标注数据或外部奖励的情况下提升推理能力。该框架使用单一的骨干模型实例化两个代理:一个生成者(Proposer)生成多样化的图像相关问题,一个解决者(Solver)通过内部一致性解决这些问题。这种持续的自我奖励过程增强了问题生成和结构化推理。当使用 Qwen2.5-VL 作为基础模型时,EvoLMM 在多模态数学推理基准测试 ChartQA、MathVista 和 MathVision 上实现了高达 3% 的一致改进,仅使用原始训练图像。
Evaluating VLMs' Spatial Reasoning Over Robot Motion: A Step Towards Robot Planning with Motion Preferences
Authors: Wenxi Wu, Jingjing Zhang, Martim Brandão
Venue: ICLR 2026
First: 2026-03-13T15:53:42+00:00 · Latest: 2026-03-13T15:53:42+00:00
Comments: Accepted to the First Workshop on Efficient Spatial Reasoning at ICLR 2026
Abstract
Understanding user instructions and object spatial relations in surrounding environments is crucial for intelligent robot systems to assist humans in various tasks. The natural language and spatial reasoning capabilities of Vision-Language Models (VLMs) have the potential to enhance the generalization of robot planners on new tasks, objects, and motion specifications. While foundation models have been applied to task planning, it is still unclear the degree to which they have the capability of spatial reasoning required to enforce user preferences or constraints on motion, such as desired distances from objects, topological properties, or motion style preferences. In this paper, we evaluate the capability of four state-of-the-art VLMs at spatial reasoning over robot motion, using four different querying methods. Our results show that, with the highest-performing querying method, Qwen2.5-VL achieves 71.4% accuracy zero-shot and 75% on a smaller model after fine-tuning, and GPT-4o leads to lower performance. We evaluate two types of motion preferences (object-proximity and path-style), and we also analyze the trade-off between accuracy and computation cost in number of tokens. This work shows some promise in the potential of VLM integration with robot motion planning pipelines.
中文标题/摘要
标题:评估VLMs在机器人运动空间推理中的能力:迈向具有运动偏好的机器人规划
理解用户的指令和周围环境中的物体空间关系对于智能机器人系统在各种任务中协助人类至关重要。视觉-语言模型(VLMs)的自然语言和空间推理能力有可能增强机器人规划者在新任务、新物体和运动规范上的泛化能力。尽管基础模型已被应用于任务规划,但尚不清楚它们在执行用户对运动的偏好或约束(如与物体的距离、拓扑属性或运动风格偏好)所需的空间推理能力方面的能力如何。在本文中,我们使用四种不同的查询方法评估了四种最先进的VLMs在机器人运动空间推理方面的能力。我们的结果显示,使用性能最佳的查询方法,Qwen2.5-VL在零样本情况下达到71.4%的准确率,在较小的模型上微调后达到75%,而GPT-4o的性能较低。我们评估了两种类型的运动偏好(物体接近性和路径风格),并分析了准确率和计算成本(以令牌数量衡量)之间的权衡。这项工作展示了VLM与机器人运动规划管道集成的潜力。
Summary / 总结
This paper evaluates the spatial reasoning capabilities of Vision-Language Models (VLMs) in robot motion planning, focusing on their ability to understand user preferences and constraints. Four state-of-the-art VLMs were tested using four querying methods, with Qwen2.5-VL achieving 71.4% accuracy zero-shot and 75% after fine-tuning. The study also examined two types of motion preferences and analyzed the trade-off between accuracy and computational cost, indicating potential for VLM integration in robot motion planning.
本文评估了视觉-语言模型(VLMs)在机器人运动规划中的空间推理能力,重点在于它们解释用户偏好和约束的能力。使用了四种查询方法测试了四种最先进的VLMs,Qwen2.5-VL在零样本情况下达到了71.4%的准确率,在微调后达到了75%。研究还考察了两种类型的运动偏好以及准确性和计算成本之间的权衡,表明VLMs在机器人运动规划中的潜在集成可能性。
Unsupervised anomaly detection in MeV ultrafast electron diffraction
Authors: Mariana A. Fazio, Manel Martinez-Ramon, Salvador Sosa Güitron, Marcus Babzien, Mikhail Fedurin, Junjie Li, Mark Palmer, Sandra S. Biedron
First: 2025-05-19T20:05:24+00:00 · Latest: 2026-03-13T15:49:45+00:00
Abstract
MeV ultrafast electron diffraction (MUED) is a pump-probe technique used to study the dynamic structural evolution of materials. An ultrashort laser pulse triggers structural changes, which are then probed by an ultrashort relativistic electron beam. To overcome low signal-to-noise ratios, diffraction patterns are averaged over thousands of shots. However, shot-to-shot instabilities in the electron beam can distort individual patterns, introducing uncertainty. Improving MUED accuracy requires detecting and removing these anomalous patterns from large datasets. In this work, we developed a fully unsupervised methodology for the detection of anomalous diffraction patterns. Using a convolutional autoencoder, we calculate the reconstruction mean squared error of the diffraction patterns. Based on the statistical analysis of this error, we provide the user an estimation of the probability that the pattern is normal, which also allows a posterior visual inspection of the images that are difficult to classify. This method has been trained with only 100 diffraction patterns and tested on 1521 patterns, resulting in a false positive rate between 0.2\% and 0.4\%, with a training time of 10 seconds per image and a test time of about 1 second per image. The proposed methodology can also be applied to other diffraction techniques in which large datasets are collected that include faulty images due to instrumental instabilities.
中文标题/摘要
标题:MeV超快电子衍射中的无监督异常检测
MeV超快电子衍射(MUED)是一种泵-探针技术,用于研究材料的动态结构演变。超短激光脉冲触发结构变化,随后由超短相对论电子束探测。为了克服低信噪比,衍射图案会在数千个脉冲上进行平均。然而,电子束的脉冲间不稳定性会扭曲个别图案,引入不确定性。提高MUED的准确性需要检测并从大量数据集中移除这些异常的衍射图案。在本研究中,我们开发了一种完全无监督的方法来检测异常的衍射图案。利用卷积自编码器,我们计算衍射图案的重构均方误差。基于该误差的统计分析,我们为用户提供一个估计值,以判断图案是否正常,这还允许用户对难以分类的图像进行后验视觉检查。该方法仅使用100个衍射图案进行训练,并在1521个图案上进行测试,结果的假阳性率在0.2%到0.4%之间,训练时间为每张图像10秒,测试时间为每张图像约1秒。所提出的方法还可以应用于其他包含由于仪器不稳定性导致的故障图像的大数据集的衍射技术。
Summary / 总结
This study addresses the challenge of detecting anomalous diffraction patterns in MeV ultrafast electron diffraction (MUED) by developing an unsupervised methodology using a convolutional autoencoder. The method calculates the reconstruction mean squared error of diffraction patterns and provides a probability estimate for each pattern, enabling visual inspection. With only 100 training patterns, the method achieved a false positive rate of 0.2% to 0.4% and demonstrated fast training and testing times, making it suitable for large datasets with instrumental instabilities.
本研究通过使用卷积自编码器开发了一种无监督方法,以解决MeV超快电子衍射(MUED)中检测异常衍射图案的挑战。该方法计算衍射图案的重构均方误差,并提供正常性的概率,以便对难以分类的图像进行视觉检查。该模型在100个图案上进行了训练,并在1521个图案上进行了测试,实现了0.2%到0.4%的误报率,训练时间为每张图像10秒,测试时间为每张图像约1秒。该方法还可应用于其他包含由于仪器不稳定而产生的故障图像的大数据集衍射技术。
Beyond Final Answers: CRYSTAL Benchmark for Transparent Multimodal Reasoning Evaluation
Authors: Wayner Barrios, SouYoung Jin
First: 2026-03-13T15:48:15+00:00 · Latest: 2026-03-13T15:48:15+00:00
Abstract
We introduce **CRYSTAL** (*__C__lear __R__easoning via __Y__ielded __S__teps, __T__raceability and __L__ogic*), a diagnostic benchmark with 6,372 instances that evaluates multimodal reasoning through verifiable intermediate steps. We propose two complementary metrics: *Match F1*, which scores step-level precision and recall via semantic similarity matching, and *Ordered Match F1*, which further penalizes disordered reasoning chains. References are constructed through a Delphi-inspired pipeline where four independent MLLMs generate trajectories, aggregated via semantic clustering and validated through human quality gates. Evaluation of 20 MLLMs, including commercial frontier systems not used during benchmark construction, reveals systematic failures invisible to accuracy: universal cherry-picking (precision far exceeds recall), non-monotonic scaling trade-offs, and disordered reasoning where no competitive model preserves more than 60% of matched steps in correct order. Beyond evaluation, we propose the **Causal Process Reward (CPR)**, a multiplicative reward that couples answer correctness with step-level alignment, and **CPR-Curriculum**, which progressively increases reasoning difficulty during training. CPR-Curriculum achieves +32% Match F1 via GRPO where additive reward strategies fail, improving reasoning without manual step annotation.
中文标题/摘要
标题:超越最终答案:CRYSTAL 基准测试用于透明多模态推理评估
我们引入了**CRYSTAL**(*清晰推理通过生成步骤、可追溯性和逻辑*),这是一个诊断基准,包含6,372个实例,通过可验证的中间步骤评估多模态推理。我们提出了两个互补的度量标准:*匹配F1*,通过语义相似性匹配评分步骤级别的精确度和召回率;*有序匹配F1*,进一步惩罚无序的推理链。参考文献通过一个德尔菲启发式的管道构建,其中四个独立的MLLM生成轨迹,通过语义聚类聚合,并通过人类质量门验证。对20个MLLM的评估,包括在基准构建过程中未使用的商业前沿系统,揭示了系统性失败,这些失败在准确性上是看不见的:普遍的挑食(精确度远高于召回率)、非单调的规模权衡以及无序推理,其中没有竞争模型在正确顺序上保留超过60%的匹配步骤。除了评估之外,我们提出了**因果过程奖励(CPR)**,这是一种乘法奖励,将答案的正确性与步骤级别的对齐相结合,以及**CPR-课程**,在训练过程中逐步增加推理难度。CPR-课程通过GRPO实现了+32%的匹配F1,而加性奖励策略失败,提高了推理能力而无需手动步骤注释。
Summary / 总结
CRYSTAL is a benchmark for evaluating multimodal reasoning by assessing verifiable intermediate steps. It uses two metrics: Match F1 and Ordered Match F1, and constructs references through a Delphi-inspired pipeline. Evaluations of 20 MLLMs show systematic failures such as cherry-picking and disordered reasoning. CPR and CPR-Curriculum are proposed to improve reasoning, achieving a 32% increase in Match F1.
CRYSTAL基准通过要求可验证的中间步骤来评估多模态推理,并使用两种指标:Match F1和Ordered Match F1。基准包含6,372个实例,并评估了20个MLLMs,揭示了系统性失败,如选择性拾取和推理混乱。提出了因果过程奖励(CPR)和CPR-课程,以提高推理能力,GRPO模型通过这种方法实现了Match F1提高32%。
SldprtNet: A Large-Scale Multimodal Dataset for CAD Generation in Language-Driven 3D Design
Authors: Ruogu Li, Sikai Li, Yao Mu, Mingyu Ding
Venue: ICRA 2026
First: 2026-03-13T15:47:08+00:00 · Latest: 2026-03-13T15:47:08+00:00
Comments: Accept by ICRA 2026
Abstract
We introduce SldprtNet, a large-scale dataset comprising over 242,000 industrial parts, designed for semantic-driven CAD modeling, geometric deep learning, and the training and fine-tuning of multimodal models for 3D design. The dataset provides 3D models in both .step and .sldprt formats to support diverse training and testing. To enable parametric modeling and facilitate dataset scalability, we developed supporting tools, an encoder and a decoder, which support 13 types of CAD commands and enable lossless transformation between 3D models and a structured text representation. Additionally, each sample is paired with a composite image created by merging seven rendered views from different viewpoints of the 3D model, effectively reducing input token length and accelerating inference. By combining this image with the parameterized text output from the encoder, we employ the lightweight multimodal language model Qwen2.5-VL-7B to generate a natural language description of each part's appearance and functionality. To ensure accuracy, we manually verified and aligned the generated descriptions, rendered images, and 3D models. These descriptions, along with the parameterized modeling scripts, rendered images, and 3D model files, are fully aligned to construct SldprtNet. To assess its effectiveness, we fine-tuned baseline models on a dataset subset, comparing image-plus-text inputs with text-only inputs. Results confirm the necessity and value of multimodal datasets for CAD generation. It features carefully selected real-world industrial parts, supporting tools for scalable dataset expansion, diverse modalities, and ensured diversity in model complexity and geometric features, making it a comprehensive multimodal dataset built for semantic-driven CAD modeling and cross-modal learning.
中文标题/摘要
标题:SldprtNet:一种用于语言驱动3D设计的大型多模态数据集
我们介绍了SldprtNet,一个包含超过242,000个工业部件的大规模数据集,旨在用于语义驱动的CAD建模、几何深度学习以及多模态模型的训练和微调。该数据集提供了STEP和SLDPRT格式的3D模型,以支持多样化的训练和测试。为了支持参数化建模并促进数据集的扩展,我们开发了支持工具、编码器和解码器,支持13种CAD命令,并允许3D模型与结构化文本表示之间无损转换。此外,每个样本都配有一个由3D模型不同视角的七个渲染视图合并而成的复合图像,有效减少了输入标记长度并加速了推理。通过将此图像与编码器生成的参数化文本输出结合,我们使用轻量级多模态语言模型Qwen2.5-VL-7B生成每个部件外观和功能的自然语言描述。为了确保准确性,我们手动验证并对生成的描述、渲染图像和3D模型进行了对齐。这些描述、参数化建模脚本、渲染图像和3D模型文件完全对齐,构建了SldprtNet。为了评估其有效性,我们在数据集子集上微调了基线模型,比较了图像加文本输入与仅文本输入的效果。结果证实了多模态数据集对于CAD生成的必要性和价值。它包含精心选择的真实世界工业部件,支持可扩展的数据集扩展工具,具有多种模态,确保了模型复杂性和几何特征的多样性,使其成为一个全面的多模态数据集,用于语义驱动的CAD建模和跨模态学习。
Summary / 总结
SldprtNet is a large-scale dataset of over 242,000 industrial parts for semantic-driven CAD modeling and multimodal training. It includes 3D models in .step and .sldprt formats, and supports parametric modeling through an encoder and decoder. Each part is paired with a composite image and a natural language description generated by a lightweight multimodal language model. The dataset was fine-tuned on a subset to compare image-plus-text inputs with text-only inputs, demonstrating the value of multimodal datasets for CAD generation.
SldprtNet 是一个包含超过 242,000 个工业零件的大规模数据集,旨在支持语义驱动的 CAD 模型设计和多模态模型训练。该数据集包括 .step 和 .sldprt 格式的 3D 模型,并通过编码器和解码器支持参数化建模。每个零件都配有一张由七个不同视角渲染图合成的图像,以及使用 Qwen2.5-VL-7B 生成的自然语言描述。数据集经过人工验证和对齐,以确保准确性。实验结果表明,多模态输入可以提高 CAD 生成模型的效果,突显了 SldprtNet 在语义驱动的 CAD 模型设计和跨模态学习中的价值。
MoHETS: Long-term Time Series Forecasting with Mixture-of-Heterogeneous-Experts
Authors: Evandro S. Ortigossa, Guy Lutsker, Eran Segal
First: 2026-01-29T15:35:26+00:00 · Latest: 2026-03-13T15:44:19+00:00
Comments: Under review
Abstract
Real-world multivariate time series can exhibit intricate multi-scale structures, including global trends, local periodicities, and non-stationary regimes, which makes long-horizon forecasting challenging. Although sparse Mixture-of-Experts (MoE) approaches improve scalability and specialization, they typically rely on homogeneous MLP experts that poorly capture the diverse temporal dynamics of time series data. We address these limitations with MoHETS, an encoder-only Transformer that integrates sparse Mixture-of-Heterogeneous-Experts (MoHE) layers. MoHE routes temporal patches to a small subset of expert networks, combining a shared depthwise-convolution expert for sequence-level continuity with routed Fourier-based experts for patch-level periodic structures. MoHETS further improves robustness to non-stationary dynamics by incorporating exogenous information via cross-attention over covariate patch embeddings. Finally, we replace parameter-heavy linear projection heads with a lightweight convolutional patch decoder, improving parameter efficiency, reducing training instability, and allowing a single model to generalize across arbitrary forecast horizons. We validate across seven multivariate benchmarks and multiple horizons, with MoHETS consistently achieving state-of-the-art performance, reducing the average MSE by $12\%$ compared to strong recent baselines, demonstrating effective heterogeneous specialization for long-term forecasting.
中文标题/摘要
标题:MoHETS:混合专家的长期时间序列预测
现实世界中的多变量时间序列可能表现出复杂的多尺度结构,包括全局趋势、局部周期性和非平稳区域,这使得长期预测变得具有挑战性。尽管稀疏混合专家(MoE)方法提高了可扩展性和专业化,但它们通常依赖于同质的MLP专家,这不能很好地捕捉时间序列数据的多样时间动态。我们通过MoHETS解决了这些限制,这是一种仅编码器的Transformer,结合了稀疏混合异质专家(MoHE)层。MoHE将时间片段路由到一小部分专家网络,结合了一个用于序列级连续性的共享深度卷积专家和基于傅里叶的专家用于片段级周期结构。MoHETS进一步通过跨注意力机制结合协变量片段嵌入来增强对非平稳动态的鲁棒性。最后,我们用轻量级的卷积片段解码器替代参数密集的线性投影头,提高了参数效率,减少了训练不稳定性,并允许单个模型在任意预测时间范围内泛化。我们在七个多元基准和多个时间范围内进行了验证,MoHETS始终实现了最先进的性能,与强大的近期基线相比,平均MSE降低了12%,展示了有效的异质专业化以进行长期预测。
Summary / 总结
MoHETS addresses the challenges of long-term forecasting in multivariate time series by integrating sparse Mixture-of-Heterogeneous-Experts (MoHE) layers into an encoder-only Transformer. This approach routes temporal patches to specialized experts, combining a depthwise-convolution expert for sequence-level continuity with Fourier-based experts for patch-level periodic structures. MoHETS also incorporates exogenous information via cross-attention and uses a lightweight convolutional patch decoder to improve parameter efficiency. Experiments on seven benchmarks show MoHETS achieves state-of-the-art performance, reducing average MSE by 12% compared to recent strong baselines.
MoHETS通过将稀疏的混合异质专家(MoHE)层集成到编码器中来解决多变量时间序列的长期预测挑战。它结合了一个共享的深度卷积专家用于序列级别的连续性,以及路由的傅里叶基专家用于片段级别的周期结构,并通过交叉注意力引入外生信息。MoHETS还使用轻量级的卷积片段解码器代替了参数密集的线性投影头。在七个多变量基准上的实验表明,MoHETS达到了最先进的性能,与最近的强基线相比,平均MSE降低了12%。
Accelerating Residual Reinforcement Learning with Uncertainty Estimation
Authors: Lakshita Dodeja, Karl Schmeckpeper, Shivam Vats, Thomas Weng, Mingxi Jia, George Konidaris, Stefanie Tellex
First: 2025-06-21T03:18:01+00:00 · Latest: 2026-03-13T15:43:45+00:00
Abstract
Residual Reinforcement Learning (RL) is a popular approach for adapting pretrained policies by learning a lightweight residual policy that provides corrective actions. While Residual RL is more sample-efficient than finetuning the entire base policy, existing methods struggle with sparse rewards and are designed for deterministic base policies. We propose two improvements to Residual RL that further enhance its sample efficiency and make it suitable for stochastic base policies. First, we leverage uncertainty estimates of the base policy to focus exploration on regions in which the base policy is not confident. Second, we propose a simple modification to off-policy residual learning that allows it to observe base actions and better handle stochastic base policies. We evaluate our method with both Gaussian-based and Diffusion-based stochastic base policies on tasks from Robosuite and D4RL, and compare against state-of-the-art finetuning methods, demo-augmented RL methods, and other residual RL methods. Our algorithm significantly outperforms existing baselines in a variety of simulation benchmark environments. We also deploy our learned polices in the real world to demonstrate their robustness with zero-shot sim-to-real transfer. Paper homepage : lakshitadodeja.github.io/uncertainty-aware-residual-rl/
中文标题/摘要
标题:基于不确定性估计加速残差强化学习
残差强化学习(RL)是一种流行的通过学习一个轻量级的残差策略来适应预训练策略的方法,该策略提供纠正动作。虽然残差RL比整个基策略的微调更具样本效率,但现有方法在稀疏奖励下表现不佳,并且是为确定性基策略设计的。我们提出了两种改进残差RL的方法,进一步提高了其样本效率,并使其适用于随机基策略。首先,我们利用基策略的不确定性估计,将探索集中在基策略不自信的区域。其次,我们提出了一种简单的离策略残差学习修改,使其能够观察基动作并更好地处理随机基策略。我们使用Robosuite和D4RL中的任务评估了我们的方法,并与最先进的微调方法、示例增强的RL方法和其他残差RL方法进行了比较。我们的算法在多种模拟基准环境中显著优于现有基线。我们还在现实世界中部署了我们的学习策略,以证明其在零样本模拟到现实世界的转移中的鲁棒性。论文主页:lakshitadodeja.github.io/uncertainty-aware-residual-rl/
Summary / 总结
The research aims to improve the sample efficiency of Residual RL by addressing its limitations with sparse rewards and stochastic base policies. The method introduces uncertainty estimates to guide exploration and a modification to off-policy residual learning to handle stochastic base policies. Experiments on Robosuite and D4RL tasks show that the proposed approach outperforms existing methods and achieves robust sim-to-real transfer.
论文针对残差强化学习在处理稀疏奖励和随机基政策时的局限性,提出了两种改进:利用不确定性估计进行集中探索和修改离策略残差学习以更好地处理随机基政策。实验结果表明,所提出的方法在Robosuite和D4RL上的表现显著优于现有基线,并实现了稳健的仿真到现实世界的转移。
CCMamba: Topologically-Informed Selective State-Space Networks on Combinatorial Complexes for Higher-Order Graph Learning
Authors: Jiawen Chen, Qi Shao, Mingtong Zhou, Duxin Chen, Wenwu Yu
First: 2026-01-28T11:52:13+00:00 · Latest: 2026-03-13T15:43:11+00:00
Abstract
Topological deep learning has emerged as a powerful paradigm for modeling higher-order relational structures beyond pairwise interactions that standard graph neural networks fail to capture. While combinatorial complexes (CCs) offer a unified topological foundation for the higher-order graph learning, existing topological deep learning methods rely heavily on local message passing and attention mechanisms. These suffer from quadratic complexity and local neighborhood constraints, limiting their scalability and capacity for rank-aware, long-range dependency modeling. To overcome these challenges, we propose Combinatorial Complex Mamba (CCMamba), the first unified Mamba-based neural framework for learning on combinatorial complexes. CCMamba reformulates higher-order message passing as a selective state-space modeling problem by linearizing multi-rank incidence relations into structured, rank-aware sequences. This architecture enables adaptive, directional, and long-range information propagation in linear time bypassing the scalability bottlenecks of self-attention. Theoretically, we further establish that the expressive power of CCMamba is upper-bounded by the 1-dimensional combinatorial complex Weisfeiler-Lehman (1-CCWL) test. Extensive experiments across graph, hypergraph, and simplicial benchmarks demonstrate that CCMamba consistently outperforms existing methods while exhibiting superior scalability and remarkable robustness against over-smoothing in deep architectures.
中文标题/摘要
标题:CCMamba:基于拓扑信息的选择性状态空间网络在组合复形上的高阶图学习
拓扑深度学习已成为一种强大的范式,用于建模标准图神经网络无法捕捉的超越成对交互的高阶关系结构。虽然组合复形(CCs)为高阶图学习提供了一个统一的拓扑基础,但现有的拓扑深度学习方法严重依赖于局部消息传递和注意力机制。这些方法存在二次复杂性和局部邻域约束,限制了它们的可扩展性和对秩感知、长距离依赖建模的能力。为克服这些挑战,我们提出了组合复形蟒蛇(CCMamba),这是第一个基于蟒蛇的统一神经框架,用于在组合复形上进行学习。CCMamba将高阶消息传递重新表述为选择性状态空间建模问题,通过线性化多秩 incidence 关系为结构化、秩感知序列。该架构能够在线性时间内实现自适应、方向性和长距离信息传播,绕过了自注意力的可扩展性瓶颈。理论上,我们进一步证明CCMamba的表达能力上限为1维组合复形Weisfeiler-Lehman(1-CCWL)测试。在图、超图和单纯复形基准上的广泛实验表明,CCMamba在性能上始终优于现有方法,同时表现出优越的可扩展性和对深度架构中过度平滑的显著鲁棒性。
Summary / 总结
CCMamba is a topologically-informed neural framework designed for learning on combinatorial complexes, addressing the limitations of existing methods by reformulating higher-order message passing as a selective state-space modeling problem. This approach enables adaptive, directional, and long-range information propagation in linear time, overcoming scalability bottlenecks. Experiments show that CCMamba outperforms existing methods across various benchmarks, demonstrating superior scalability and robustness against over-smoothing in deep architectures.
CCMamba 是一种新型神经框架,用于在组合复杂体上进行学习,通过将高阶消息传递重新表述为选择性状态空间建模问题,解决了现有方法的局限性。这种方法实现了线性时间、自适应和长距离的信息传播,克服了可扩展性问题。实验表明,CCMamba 在各种基准测试中优于现有方法,展示了出色的可扩展性和对深度架构中过度平滑的鲁棒性。
Reasoning over Video: Evaluating How MLLMs Extract, Integrate, and Reconstruct Spatiotemporal Evidence
Authors: Seunghwan Bang, Hwanjun Song
First: 2026-03-13T15:40:42+00:00 · Latest: 2026-03-13T15:40:42+00:00
Comments: 35 pages, 8 figures, 21 tables
Abstract
The growing interest in embodied agents increases the demand for spatiotemporal video understanding, yet existing benchmarks largely emphasize extractive reasoning, where answers can be explicitly presented within spatiotemporal events. It remains unclear whether multimodal large language models can instead perform abstractive spatiotemporal reasoning, which requires integrating observations over time, combining dispersed cues, and inferring implicit spatial and contextual structure. To address this gap, we formalize abstractive spatiotemporal reasoning from videos by introducing a structured evaluation taxonomy that systematically targets its core dimensions and construct a controllable, scenario-driven synthetic egocentric video dataset tailored to evaluate abstractive spatiotemporal reasoning capabilities, spanning object-, room-, and floor-plan-level scenarios. Based on this framework, we present VAEX-BENCH, a benchmark comprising five abstractive reasoning tasks together with their extractive counterparts. Our extensive experiments compare the performance of state-of-the-art MLLMs under extractive and abstractive settings, exposing their limitations on abstractive tasks and providing a fine-grained analysis of the underlying bottlenecks. The dataset will be released soon.
中文标题/摘要
标题:视频推理:评估MLLMs提取、整合和重构时空证据的能力
随着对具身代理的兴趣日益增长,对时空视频理解的需求也在增加,但现有的基准测试主要侧重于提取性推理,其中答案可以在时空事件中明确呈现。尚不清楚多模态大型语言模型是否能够执行抽象的时空推理,这需要在时间上整合观察结果、结合分散的线索并推断隐含的空间和上下文结构。为解决这一差距,我们通过引入一个结构化的评估分类法,系统地针对其核心维度,提出了一个可控的、基于场景的合成第一人称视频数据集,以评估抽象时空推理能力,涵盖对象级、房间级和楼层平面图级场景。基于此框架,我们提出了VAEX-BENCH基准,包括五个抽象推理任务及其提取性对应任务。我们的大量实验比较了最先进的MLLMs在提取性和抽象性设置下的性能,揭示了它们在抽象任务上的局限性,并提供了对潜在瓶颈的精细分析。数据集将很快发布。
Summary / 总结
This paper addresses the gap in evaluating abstractive spatiotemporal reasoning in multimodal large language models (MLLMs) by introducing a structured evaluation taxonomy and a synthetic egocentric video dataset. The study presents VAEX-BENCH, a benchmark with five abstractive reasoning tasks, to compare MLLMs under extractive and abstractive settings, revealing their limitations on abstractive tasks and identifying underlying bottlenecks. The dataset will be released soon.
该研究通过引入结构化的评估分类法和合成的第一人称视角视频数据集,旨在评估多模态大语言模型(MLLMs)在抽象时空推理方面的表现。研究提出了VAEX-BENCH基准,包含五个抽象推理任务,以比较MLLMs在提取性和抽象性设置下的性能,揭示其在抽象任务中的局限性并分析潜在瓶颈。数据集即将发布。
History
20260316_0338 20260315_0335 20260314_0340 20260313_0351 20260312_0350 20260311_0346 20260310_0350 20260309_0332 20260308_0331 20260307_0344 20260306_0400 20260305_0347 20260304_0347 20260303_0347 20260302_0330 20260301_0331 20260228_0349 20260227_0352 20260226_0405 20260225_0356 20260224_0404 20260223_0333 20260222_0333 20260221_0344 20260220_0347 20260219_0358 20260218_0358 20260217_0339 20260216_0334 20260215_0332 20260213_0402 20260212_0404 20260211_0409 20260210_0409 20260208_0334 20260207_0349 20260206_0347 20260205_0346 20260204_0352 20260202_0332 20260201_0328 20260131_0341 20260130_0339 20260129_0337 20260128_0335 20260127_0332 20260126_0325 20260125_0325 20260124_0333 20260123_0333 20260122_0339 20260121_0422 20260120_0328 20260119_0325 20260118_0324 20260117_0329 20260116_0332 20260115_0330 20260114_0329 20260113_0330 20260112_0330 20260111_0327 20260110_0328 20260109_0331 20260108_0330 20260107_0325 20260106_0331 20260105_0324 20260104_0324 20260103_0322 20260102_0335 20260101_0325 20251231_0331 20251230_0328 20251229_0326 20251228_0329 20251227_0325 20251226_0326 20251225_0325 20251224_0328 20251223_0327 20251222_0324 20251221_0326 20251220_0327 20251219_0327 20251218_0339 20251217_0331 20251216_0329 20251215_0331 20251214_0324 20251213_0324 20251212_0329 20251211_0326 20251210_0323 20251209_0326 20251208_0324 20251207_0323 20251206_0325 20251205_0326 20251204_0326 20251203_0328 20251202_0331 20251201_0324 20251130_0323 20251129_0323 20251128_0324 20251127_0324 20251126_0325 20251125_0322 20251124_0323 20251123_0323 20251122_0325 20251121_0324 20251120_0326 20251119_0325 20251118_0324 20251117_0322 20251116_0322 20251115_0324 20251114_0325 20251113_0326 20251112_0326 20251111_0318 20251110_0322 20251109_0323 20251108_0321 20251107_0320 20251106_0322 20251105_0321 20251104_0324 20251103_0317 20251102_0321 20251101_0317 20251031_0318 20251030_0328 20251029_0325 20251028_0324 20251027_0320 20251026_0328 20251025_0320 20251024_0328 20251023_1235 20251023_0316 20251022_0319 20251021_1916 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553