SimpliHuMoN: Simplifying Human Motion Prediction
Authors: Aadya Agrawal, Alexander Schwing
First: 2026-03-04T18:59:57+00:00 · Latest: 2026-03-04T18:59:57+00:00
Comments: 19 pages, 7 figures. Preprint
Abstract
Human motion prediction combines the tasks of trajectory forecasting and human pose prediction. For each of the two tasks, specialized models have been developed. Combining these models for holistic human motion prediction is non-trivial, and recent methods have struggled to compete on established benchmarks for individual tasks. To address this, we propose a simple yet effective transformer-based model for human motion prediction. The model employs a stack of self-attention modules to effectively capture both spatial dependencies within a pose and temporal relationships across a motion sequence. This simple, streamlined, end-to-end model is sufficiently versatile to handle pose-only, trajectory-only, and combined prediction tasks without task-specific modifications. We demonstrate that this approach achieves state-of-the-art results across all tasks through extensive experiments on a wide range of benchmark datasets, including Human3.6M, AMASS, ETH-UCY, and 3DPW.
中文标题/摘要
标题:SimpliHuMoN: 简化的人体运动预测
人体运动预测结合了轨迹预测和人体姿态预测的任务。对于这两个任务,已经开发了专门的模型。将这些模型结合起来进行整体人体运动预测并不简单,最近的方法在单独任务的基准测试上难以竞争。为了解决这个问题,我们提出了一种基于变压器的简单而有效的模型来进行人体运动预测。该模型采用堆叠的自注意力模块来有效地捕捉姿态内的空间依赖性和运动序列间的时序关系。这种简单、精简的端到端模型足够灵活,可以处理姿态仅、轨迹仅和结合预测任务,而无需针对特定任务进行修改。通过在广泛基准数据集上的大量实验,我们证明了这种方法在所有任务上都达到了最先进的结果,包括Human3.6M、AMASS、ETH-UCY和3DPW。
Summary / 总结
The paper addresses the challenge of human motion prediction by proposing SimpliHuMoN, a transformer-based model that effectively captures spatial and temporal dependencies. The model is versatile and can handle pose-only, trajectory-only, and combined prediction tasks without task-specific modifications. Extensive experiments on benchmarks such as Human3.6M, AMASS, ETH-UCY, and 3DPW show that SimpliHuMoN achieves state-of-the-art results across all tasks.
论文提出了一种基于变压器的SimpliHuMoN模型,该模型能够有效捕捉空间和时间依赖性。该模型具有通用性,无需针对特定任务进行修改,即可处理姿态预测、轨迹预测以及结合预测任务。在多个基准数据集上的广泛实验表明,SimpliHuMoN在所有任务上都取得了最先进的结果。
Accurate and Efficient Hybrid-Ensemble Atmospheric Data Assimilation in Latent Space with Uncertainty Quantification
Authors: Hang Fan, Juan Nathaniel, Yi Xiao, Ce Bian, Fenghua Ling, Ben Fei, Lei Bai, Pierre Gentine
First: 2026-03-04T18:58:27+00:00 · Latest: 2026-03-04T18:58:27+00:00
Comments: 23 pages, 12 figures
Abstract
Data assimilation (DA) combines model forecasts and observations to estimate the optimal state of the atmosphere with its uncertainty, providing initial conditions for weather prediction and reanalyses for climate research. Yet, existing traditional and machine-learning DA methods struggle to achieve accuracy, efficiency and uncertainty quantification simultaneously. Here, we propose HLOBA (Hybrid-Ensemble Latent Observation-Background Assimilation), a three-dimensional hybrid-ensemble DA method that operates in an atmospheric latent space learned via an autoencoder (AE). HLOBA maps both model forecasts and observations into a shared latent space via the AE encoder and an end-to-end Observation-to-Latent-space mapping network (O2Lnet), respectively, and fuses them through a Bayesian update with weights inferred from time-lagged ensemble forecasts. Both idealized and real-observation experiments demonstrate that HLOBA matches dynamically constrained four-dimensional DA methods in both analysis and forecast skill, while achieving end-to-end inference-level efficiency and theoretical flexibility applies to any forecasting model. Moreover, by exploiting the error decorrelation property of latent variables, HLOBA enables element-wise uncertainty estimates for its latent analysis and propagates them to model space via the decoder. Idealized experiments show that this uncertainty highlights large-error regions and captures their seasonal variability.
中文标题/摘要
标题:在潜空间中具有不确定性量化的一种准确高效的混合集成大气数据同化方法
数据同化(DA)结合模型预报和观测数据,以估计大气的最优状态及其不确定性,提供天气预报的初始条件并为气候研究提供再分析。然而,现有的传统和机器学习数据同化方法难以同时实现高精度、高效性和不确定性量化。在此,我们提出了一种名为HLOBA(混合集成潜空间观测-背景同化)的三维混合集成数据同化方法,该方法在通过自编码器(AE)学习的大气潜空间中运行。HLOBA通过AE编码器和端到端的观测到潜空间映射网络(O2Lnet)分别将模型预报和观测数据映射到共享的潜空间,并通过时间滞后集成预报的贝叶斯更新进行融合。理想化和实际观测实验均表明,HLOBA在分析和预报技能方面与动力约束的四维数据同化方法相当,同时实现了端到端的推理级效率和理论上的灵活性,适用于任何预报模型。此外,通过利用潜变量的误差去相关特性,HLOBA能够为潜空间分析提供元素级的不确定性估计,并通过解码器将其传播到模型空间。理想化实验表明,这种不确定性能够突出显示大误差区域并捕捉其季节性变化。
Summary / 总结
The paper addresses the challenge of achieving accurate, efficient, and reliable uncertainty quantification in atmospheric data assimilation. It introduces HLOBA, a hybrid-ensemble method that operates in an atmospheric latent space learned via an autoencoder. HLOBA combines model forecasts and observations through a Bayesian update and demonstrates comparable analysis and forecast skill to dynamically constrained four-dimensional methods, while offering end-to-end efficiency and flexibility. Additionally, it provides element-wise uncertainty estimates that highlight large-error regions and capture their seasonal variability.
论文旨在解决大气数据同化中准确、高效且可靠地量化不确定性的问题。文中提出了一种名为HLOBA的混合集合方法,该方法通过自编码器学习大气的潜在空间。HLOBA通过贝叶斯更新结合模型预报和观测数据,展示了与动态约束的四维方法相当的分析和预报技能,同时具备端到端的高效性和灵活性。此外,它还能提供元素级别的不确定性估计,突出显示大误差区域并捕捉其季节变化。
UMA: A Family of Universal Models for Atoms
Authors: Brandon M. Wood, Misko Dzamba, Xiang Fu, Meng Gao, Muhammed Shuaibi, Luis Barroso-Luque, Kareem Abdelmaqsoud, Vahe Gharakhanyan, John R. Kitchin, Daniel S. Levine, Kyle Michel, Anuroop Sriram, Taco Cohen, Abhishek Das, Ammar Rizvi, Sushree Jagriti Sahoo, Zachary W. Ulissi, C. Lawrence Zitnick
First: 2025-06-30T15:38:13+00:00 · Latest: 2026-03-04T18:57:47+00:00
Comments: 33 pages, 8 figures
Abstract
The ability to quickly and accurately compute properties from atomic simulations is critical for advancing a large number of applications in chemistry and materials science including drug discovery, energy storage, and semiconductor manufacturing. To address this need, Meta FAIR presents a family of Universal Models for Atoms (UMA), designed to push the frontier of speed, accuracy, and generalization. UMA models are trained on half a billion unique 3D atomic structures (the largest training runs to date) by compiling data across multiple chemical domains, e.g. molecules, materials, and catalysts. We develop empirical scaling laws to help understand how to increase model capacity alongside dataset size to achieve the best accuracy. The UMA small and medium models utilize a novel architectural design we refer to as mixture of linear experts that enables increasing model capacity without sacrificing speed. For example, UMA-medium has 1.4B parameters but only ~50M active parameters per atomic structure. We evaluate UMA models on a diverse set of applications across multiple domains and find that, remarkably, a single model without any fine-tuning can perform similarly or better than specialized models. We are releasing the UMA code, weights, and associated data to accelerate computational workflows and enable the community to continue to build increasingly capable AI models.
中文标题/摘要
标题:UMA:原子的通用模型家族
从原子模拟快速准确地计算性质的能力对于推进化学和材料科学中的许多应用至关重要,包括药物发现、储能和半导体制造。为了解决这一需求,Meta FAIR 呈现了一种原子的通用模型家族(UMA),旨在推动速度、准确性和泛化的前沿。UMA 模型在超过五亿个独特的三维原子结构上进行了训练(迄今为止最大的训练规模),通过跨多个化学领域(如分子、材料和催化剂)汇总数据。我们开发了经验性缩放定律来帮助理解如何随着数据集大小增加模型容量以获得最佳准确度。UMA 小型和中型模型采用了我们称之为线性专家混合的新型架构设计,这使得在不牺牲速度的情况下增加模型容量成为可能。例如,UMA 中型模型有 14 亿个参数,但每个原子结构只有约 5000 万个活跃参数。我们在多个领域的多种应用上评估了 UMA 模型,发现令人惊讶的是,一个未经任何微调的单一模型可以与专门模型表现得同样好或更好。我们正在发布 UMA 代码、权重及相关数据,以加速计算工作流并使社区能够继续构建越来越强大的 AI 模型。
Summary / 总结
UMA is a family of universal models for atoms aimed at improving the speed, accuracy, and generalization in atomic simulations for applications in chemistry and materials science. These models are trained on half a billion unique 3D atomic structures, the largest dataset to date, and utilize a novel architectural design called mixture of linear experts to increase model capacity without sacrificing speed. Experimental results show that a single UMA model can perform similarly or better than specialized models across various applications without any fine-tuning.
UMA 是一种用于原子的通用模型家族,旨在提高化学和材料科学中原子模拟的速度、准确性和泛化能力。这些模型基于迄今为止最大的数据集——半亿个独特的三维原子结构进行训练,并采用混合线性专家架构来增加模型容量而不牺牲速度。UMA 模型,尤其是中型模型,在各种应用中表现出色,无需微调即可超越专门模型。
ZipMap: Linear-Time Stateful 3D Reconstruction with Test-Time Training
Authors: Haian Jin, Rundi Wu, Tianyuan Zhang, Ruiqi Gao, Jonathan T. Barron, Noah Snavely, Aleksander Holynski
First: 2026-03-04T18:49:37+00:00 · Latest: 2026-03-04T18:49:37+00:00
Comments: Project page: https://haian-jin.github.io/ZipMap
Abstract
Feed-forward transformer models have driven rapid progress in 3D vision, but state-of-the-art methods such as VGGT and $π^3$ have a computational cost that scales quadratically with the number of input images, making them inefficient when applied to large image collections. Sequential-reconstruction approaches reduce this cost but sacrifice reconstruction quality. We introduce ZipMap, a stateful feed-forward model that achieves linear-time, bidirectional 3D reconstruction while matching or surpassing the accuracy of quadratic-time methods. ZipMap employs test-time training layers to zip an entire image collection into a compact hidden scene state in a single forward pass, enabling reconstruction of over 700 frames in under 10 seconds on a single H100 GPU, more than $20\times$ faster than state-of-the-art methods such as VGGT. Moreover, we demonstrate the benefits of having a stateful representation in real-time scene-state querying and its extension to sequential streaming reconstruction.
中文标题/摘要
标题:ZipMap:具有测试时训练的线性时间3D重建
前馈变压器模型在3D视觉方面取得了快速进展,但最先进的方法如VGGT和$π^3$的计算成本与输入图像数量的平方成正比,这使得它们在应用于大型图像集合时效率低下。顺序重建方法可以降低这种成本,但会牺牲重建质量。我们提出了ZipMap,这是一种具有状态的前馈模型,能够在保持或超越二次时间方法准确性的前提下实现线性时间、双向3D重建。ZipMap 使用测试时训练层在单次前向传递中将整个图像集合压缩为紧凑的隐藏场景状态,使其能够在单个H100 GPU上以不到10秒的时间重建超过700帧,比最先进的方法VGGT快20多倍。此外,我们展示了具有状态表示在实时场景状态查询中的优势及其扩展到顺序流式重建。
Summary / 总结
The research aims to address the computational inefficiency of state-of-the-art 3D reconstruction methods that scale quadratically with the number of input images. ZipMap, a stateful feed-forward model, is introduced to achieve linear-time bidirectional 3D reconstruction, matching or surpassing the accuracy of quadratic-time methods. Key experimental findings show that ZipMap can reconstruct over 700 frames in under 10 seconds on a single H100 GPU, more than 20 times faster than existing methods like VGGT.
ZipMap 是一种线性时间的状态化前馈模型,能够高效地进行双向 3D 重建,其准确度与二次时间复杂度的方法相当或更好。该模型利用测试时训练层将整个图像集合压缩成一个紧凑的隐藏场景状态,单次前向传播即可完成,能够在单个 H100 GPU 上以不到 10 秒的时间重建超过 700 帧,比现有方法如 VGGT 快 20 倍以上。此外,该模型还受益于状态化表示,适用于实时场景状态查询和顺序流式重建。
Turning Trust to Transactions: Tracking Affiliate Marketing and FTC Compliance in YouTube's Influencer Economy
Authors: Chen Sun, Yash Vekaria, Zubair Shafiq, Rishab Nithyanand
First: 2026-03-04T18:47:12+00:00 · Latest: 2026-03-04T18:47:12+00:00
Comments: ICWSM 2026
Abstract
YouTube has evolved into a powerful platform that where creators monetize their influence through affiliate marketing, raising concerns about transparency and ethics, especially when creators fail to disclose their affiliate relationships. Although regulatory agencies like the US Federal Trade Commission (FTC) have issued guidelines to address these issues, non-compliance and consumer harm persist, and the extent of these problems remains unclear. In this paper, we introduce tools, developed with insights from recent advances in Web measurement and NLP research, to examine the state of the affiliate marketing ecosystem on YouTube. We apply these tools to a 10-year dataset of 2 million videos from nearly 540,000 creators, analyzing the prevalence of affiliate marketing on YouTube and the rates of non-compliant behavior. Our findings reveal that affiliate links are widespread, yet dis- closure compliance remains low, with most videos failing to meet FTC standards. Furthermore, we analyze the effects of different stakeholders in improving disclosure behavior. Our study suggests that the platform is highly associated with improved compliance through standardized disclosure features. We recommend that regulators and affiliate partners collaborate with platforms to enhance transparency, accountability, and trust in the influencer economy.
中文标题/摘要
标题:将信任转化为交易:追踪YouTube影响者经济中的附属营销和FTC合规性
YouTube 已发展成为创作者通过附属营销变现的强大平台,引发了关于透明度和伦理的担忧,尤其是当创作者未能披露其附属关系时。尽管美国联邦贸易委员会 (FTC) 等监管机构已发布指南以解决这些问题,但不合规和消费者伤害仍然存在,这些问题的程度仍不清楚。在本文中,我们介绍了与最近的网络测量和自然语言处理研究进展相结合开发的工具,以检查YouTube附属营销生态系统的现状。我们应用这些工具分析了YouTube上附属营销的普遍性和违规行为的频率。我们的研究发现,附属链接普遍存在,但披露合规性仍然很低,大多数视频未能达到FTC标准。此外,我们分析了不同利益相关者在改善披露行为方面的影响。我们的研究建议,平台与标准披露功能的关联性与提高合规性密切相关。我们建议监管机构和附属合作伙伴与平台合作,以增强影响者经济中的透明度、问责制和信任。
Summary / 总结
This paper examines the affiliate marketing ecosystem on YouTube, using tools developed from recent advances in web measurement and NLP research. Analyzing 2 million videos from 540,000 creators over 10 years, the study finds that while affiliate links are common, disclosure compliance is low, with most videos failing to meet FTC standards. The research also explores the impact of different stakeholders on improving disclosure behavior and suggests that standardized disclosure features on the platform can enhance compliance and trust in the influencer economy.
本文探讨了YouTube影响者经济中的透明度问题,特别是关注附属营销和FTC合规性。作者使用网络测量和NLP技术开发了工具,分析了10年间来自近54万创作者的200万条视频。研究发现,尽管附属链接很常见,但合规披露要求的遵守率很低,大多数视频未能达到FTC标准。研究还探讨了不同利益相关者对改善披露行为的影响,并建议平台功能可以增强影响者经济中的透明度、问责制和信任。
Composition-Grounded Data Synthesis for Visual Reasoning
Authors: Xinyi Gu, Jiayuan Mao, Zhang-Wei Hong, Zhuoran Yu, Pengyuan Li, Dhiraj Joshi, Rogerio Feris, Zexue He
First: 2025-10-16T18:00:48+00:00 · Latest: 2026-03-04T18:45:57+00:00
Comments: ICLR2026 camera-ready version. Project page: https://cogsynthesis.github.io
Abstract
Pretrained multi-modal large language models (MLLMs) demonstrate strong performance on diverse multimodal tasks, but remain limited in reasoning capabilities for domains where annotations are difficult to collect. In this work, we focus on artificial image domains such as charts, rendered documents, and webpages, which are abundant in practice yet lack large-scale human annotated reasoning datasets. We introduce COGS (COmposition-Grounded data Synthesis), a data-efficient framework for equipping MLLMs with advanced reasoning abilities from a small set of seed questions. The key idea is to decompose each seed question into primitive perception and reasoning factors, which can then be systematically recomposed with new images to generate large collections of synthetic question-answer pairs. Each generated question is paired with subquestions and intermediate answers, enabling reinforcement learning with factor-level process rewards. Experiments on chart reasoning show that COGS substantially improves performance on unseen questions, with the largest gains on reasoning-heavy and compositional questions. Moreover, training with a factor-level mixture of different seed data yields better transfer across multiple datasets, suggesting that COGS induces generalizable capabilities rather than dataset-specific overfitting. We further demonstrate that the framework extends beyond charts to other domains such as webpages.
中文标题/摘要
标题:基于组成驱动的数据合成以增强视觉推理能力
预训练的多模态大型语言模型(MLLMs)在多种多模态任务中表现出色,但在难以收集注释的领域中推理能力仍然有限。本文我们关注人工图像领域,如图表、渲染文档和网页,这些领域在实践中丰富但缺乏大规模的人工注释推理数据集。我们引入了COGS(基于组成的数据合成),这是一种数据高效框架,可以从少量种子问题中赋予MLLMs高级推理能力。核心思想是将每个种子问题分解为基本感知和推理因素,然后系统地重新组合新图像以生成大量合成的问答对。每个生成的问题都配以子问题和中间答案,这使得基于因素级过程奖励的强化学习成为可能。在图表推理实验中,COGS在未见过的问题上显著提高了性能,特别是在推理密集和组合性问题上取得了最大的改进。此外,使用不同种子数据的因子级混合进行训练在多个数据集上表现出更好的迁移性,表明COGS诱导了可泛化的功能而非数据集特定的过拟合。我们进一步证明了该框架不仅适用于图表,还可以扩展到其他领域,如网页。
Summary / 总结
This work addresses the limitation of multi-modal large language models in reasoning tasks where annotations are hard to collect, focusing on artificial image domains like charts and webpages. COGS, a data-efficient framework, decomposes seed questions into perception and reasoning factors, which are then recomposed to generate synthetic question-answer pairs. Experiments show COGS improves performance on unseen questions, especially on reasoning-heavy and compositional ones, and suggests generalizable reasoning abilities rather than dataset-specific overfitting.
该研究针对多模态大型语言模型在难以获取标注的数据域(如图表、文档和网页)中的推理能力有限的问题,提出了COGS框架,通过将种子问题分解为感知和推理因素,再重新组合生成新的合成问题-答案对。实验表明,COGS在未见过的问题上显著提高了性能,特别是在推理密集和组合性问题上,并且在不同数据集上表现出良好的泛化能力,没有过拟合。该框架还被证明适用于除图表之外的其他领域。
TaxonRL: Reinforcement Learning with Intermediate Rewards for Interpretable Fine-Grained Visual Reasoning
Authors: Maximilian von Klinski, Maximilian Schall
Venue: WACV 2026
First: 2026-03-04T18:45:35+00:00 · Latest: 2026-03-04T18:45:35+00:00
Comments: Accepted at WACV 2026
Abstract
Traditional vision-language models struggle with contrastive fine-grained taxonomic reasoning, particularly when distinguishing between visually similar species within the same genus or family. We introduce TaxonRL, a reinforcement learning approach using Group Relative Policy Optimization with intermediate rewards that decomposes the reasoning process into hierarchical taxonomic predictions. Our method incentivizes models to explicitly reason about species-level, genus-level, and family-level features before making final classifications. This structured approach is designed not only to boost accuracy but also to yield a transparent, verifiable decision-making process. On the challenging Birds-to-Words dataset, TaxonRL achieves 91.7\% average accuracy, exceeding human performance (77.3\%) while generating interpretable reasoning traces. We demonstrate strong cross-domain generalization, showing substantial gains in primate and marine species verification. Our results establish that enforcing structured, hierarchical reasoning provides a powerful and transferable framework for fine-grained visual discrimination.
中文标题/摘要
标题:TaxonRL:使用中间奖励的强化学习进行可解释的细粒度视觉推理
传统的视觉-语言模型在对比细粒度分类-taxonomic推理方面存在困难,尤其是在区分同一属或同一科中的视觉相似物种时。我们提出了TaxonRL,这是一种使用组相对策略优化的强化学习方法,并使用中间奖励将推理过程分解为分层分类预测。我们的方法激励模型在最终分类之前明确地推理物种级、属级和科级特征。这种结构化方法不仅旨在提高准确性,还旨在产生透明且可验证的决策过程。在具有挑战性的鸟类到词语数据集上,TaxonRL 达到了 91.7% 的平均准确率,超过了人类表现(77.3%),同时生成了可解释的推理轨迹。我们展示了强大的跨域泛化能力,在灵长类和海洋物种验证中取得了显著进步。我们的结果表明,强制执行结构化、分层推理为细粒度视觉区分提供了一个强大且可转移的框架。
Summary / 总结
TaxonRL is a reinforcement learning method that uses intermediate rewards to improve fine-grained visual reasoning, especially for distinguishing similar species within the same genus or family. It decomposes the reasoning process into hierarchical taxonomic predictions, enhancing both accuracy and interpretability. On the Birds-to-Words dataset, TaxonRL achieves 91.7% accuracy, surpassing human performance and generating interpretable reasoning traces. It also shows strong generalization across different species domains.
TaxonRL 是一种使用中间奖励的强化学习方法,旨在提高细粒度视觉推理能力,特别是区分视觉上相似的物种。它将推理过程分解为层次化的分类预测,既提升了准确性也增强了可解释性。在 Birds-to-Words 数据集上,TaxonRL 达到了 91.7% 的准确率,超过了人类的表现,并生成了可解释的推理痕迹。此外,它在灵长类和海洋物种验证中也表现出强大的跨域泛化能力。
Helios: Real Real-Time Long Video Generation Model
Authors: Shenghai Yuan, Yuanyang Yin, Zongjian Li, Xinwei Huang, Xiao Yang, Li Yuan
First: 2026-03-04T18:45:21+00:00 · Latest: 2026-03-04T18:45:21+00:00
Comments: Page: pku-yuangroup.github.io/Helios-Page
Abstract
We introduce Helios, the first 14B video generation model that runs at 19.5 FPS on a single NVIDIA H100 GPU and supports minute-scale generation while matching the quality of a strong baseline. We make breakthroughs along three key dimensions: (1) robustness to long-video drifting without commonly used anti-drifting heuristics such as self-forcing, error-banks, or keyframe sampling; (2) real-time generation without standard acceleration techniques such as KV-cache, sparse/linear attention, or quantization; and (3) training without parallelism or sharding frameworks, enabling image-diffusion-scale batch sizes while fitting up to four 14B models within 80 GB of GPU memory. Specifically, Helios is a 14B autoregressive diffusion model with a unified input representation that natively supports T2V, I2V, and V2V tasks. To mitigate drifting in long-video generation, we characterize typical failure modes and propose simple yet effective training strategies that explicitly simulate drifting during training, while eliminating repetitive motion at its source. For efficiency, we heavily compress the historical and noisy context and reduce the number of sampling steps, yielding computational costs comparable to -- or lower than -- those of 1.3B video generative models. Moreover, we introduce infrastructure-level optimizations that accelerate both inference and training while reducing memory consumption. Extensive experiments demonstrate that Helios consistently outperforms prior methods on both short- and long-video generation. We plan to release the code, base model, and distilled model to support further development by the community.
中文标题/摘要
标题:Helios:首个在单块NVIDIA H100 GPU上以19.5 FPS运行的14B视频生成模型
我们介绍了Helios,这是首个在单块NVIDIA H100 GPU上以19.5 FPS运行的14B视频生成模型,支持分钟级生成并匹配强基线的质量。我们在三个关键维度上取得了突破:(1)在无需使用自强制、错误银行或关键帧采样等常用抗漂移启发式方法的情况下,实现长视频生成的鲁棒性;(2)在无需使用标准加速技术如KV缓存、稀疏/线性注意力或量化的情况下,实现实时生成;(3)在无需并行或切片框架的情况下进行训练,能够支持图像扩散规模的批量大小,同时在80 GB的GPU内存中容纳四个14B模型。具体而言,Helios是一个14B自回归扩散模型,具有统一的输入表示,支持T2V、I2V和V2V任务。为了缓解长视频生成中的漂移问题,我们描述了典型的失败模式,并提出了一种简单而有效的训练策略,在训练过程中显式模拟漂移,同时从源头消除重复运动。为了提高效率,我们高度压缩了历史和嘈杂的上下文,并减少了采样步骤,计算成本与1.3B视频生成模型相当或更低。此外,我们引入了基础设施级别的优化,加速了推理和训练,同时减少了内存消耗。广泛的实验表明,Helios在短视频和长视频生成方面均优于先前的方法。我们计划发布代码、基础模型和精简模型,以支持社区进一步开发。
Summary / 总结
Helios is a 14B autoregressive diffusion model that generates long videos in real-time at 19.5 FPS on a single NVIDIA H100 GPU, matching the quality of a strong baseline. It achieves this by addressing long-video drifting through novel training strategies, reducing computational costs, and optimizing both inference and training. Experiments show Helios outperforms previous methods in both short- and long-video generation. Key optimizations include compressing historical and noisy context and minimizing sampling steps, allowing it to fit within 80 GB of GPU memory without parallelism or sharding frameworks.
Helios 是一个 14B 自回归扩散模型,能够在单个 NVIDIA H100 GPU 上以 19.5 FPS 的速度实时生成长视频,同时与强基线保持质量一致。它通过解决三个关键挑战来实现这一目标:长视频漂移的鲁棒性、不使用标准加速技术的实时生成,以及不使用并行或切片框架的高效训练。Helios 通过在训练中模拟漂移来减轻漂移问题,并通过压缩历史上下文和减少采样步骤来降低计算成本,使其在效率上与较小的模型相当。实验表明,Helios 在短视频和长视频生成方面均优于先前的方法。
A dataset of high-resolution plantar pressures for gait analysis across varying footwear and walking speeds
Authors: Robyn Larracy, Angkoon Phinyomark, Ala Salehi, Eve MacDonald, Saeed Kazemi, Shikder Shafiul Bashar, Aaron Tabor, Erik Scheme
Venue: Scientific Data 12 (2025) 1415
First: 2025-02-24T15:21:02+00:00 · Latest: 2026-03-04T18:35:47+00:00
Abstract
Gait refers to the patterns of limb movement generated during walking, which are unique to each individual due to both physical and behavioral traits. Walking patterns have been widely studied in biometrics, biomechanics, sports, and rehabilitation. While traditional methods rely on video and motion capture, advances in plantar pressure sensing technology now offer deeper insights into gait. However, underfoot pressures during walking remain underexplored due to the lack of large, publicly accessible datasets. To address this, we introduce the UNB StepUP-P150 dataset: a footStep database for gait analysis and recognition using Underfoot Pressure, including data from 150 individuals. This dataset comprises high-resolution plantar pressure data (4 sensors per cm-squared) collected using a 1.2m by 3.6m pressure-sensing walkway. It contains over 200,000 footsteps from participants walking with various speeds (preferred, slow-to-stop, fast, and slow) and footwear conditions (barefoot, standard shoes, and two personal shoes), supporting advancements in biometric gait recognition and presenting new research opportunities in biomechanics and deep learning. UNB StepUP-P150 establishes a new benchmark for plantar pressure-based gait analysis and recognition.
中文标题/摘要
标题:一种适用于不同鞋类和行走速度步态分析的高分辨率足底压力数据集
步态是指行走过程中产生的肢体运动模式,由于生理和行为特征的不同,每个人的步态都是独特的。步态模式在生物识别、生物力学、体育和康复领域得到了广泛研究。传统方法依赖于视频和动作捕捉,而足底压力传感技术的进步现在提供了更深入的步态洞察。然而,由于缺乏大型的公开可用数据集,行走时的足底压力仍然未被充分探索。为了解决这一问题,我们引入了UNB StepUP-P150数据集:一种用于步态分析和识别的足底压力数据库,包括150名个体的数据。该数据集包含使用1.2米×3.6米压力传感行走道收集的高分辨率足底压力数据(每平方厘米4个传感器),包含超过200,000个以不同速度(首选、慢至停止、快、慢)和鞋类条件(赤足、标准鞋和两种个人鞋)行走的脚印,支持生物识别步态识别的进步,并为生物力学和深度学习提供了新的研究机会。UNB StepUP-P150为基于足底压力的步态分析和识别设立了新的基准。
Summary / 总结
The study aims to explore underfoot pressures during walking, which have been underexplored due to the lack of large, publicly accessible datasets. The researchers developed the UNB StepUP-P150 dataset, which includes high-resolution plantar pressure data from 150 individuals walking at different speeds and footwear conditions. Key findings include over 200,000 footsteps collected using a 1.2m by 3.6m pressure-sensing walkway, supporting advancements in biometric gait recognition and biomechanics research. This dataset establishes a new benchmark for plantar pressure-based gait analysis and recognition.
研究旨在探索行走过程中的足底压力,由于缺乏大型公开数据集,这些压力尚未得到充分研究。研究引入了UNB StepUP-P150数据集,该数据集包含150名个体在不同速度和鞋类条件下的高分辨率足底压力数据。主要发现包括超过20万步,为生物识别和生物力学中的足底压力步态分析和识别提供了新的基准。
$τ$-Knowledge: Evaluating Conversational Agents over Unstructured Knowledge
Authors: Quan Shi, Alexandra Zytek, Pedram Razavi, Karthik Narasimhan, Victor Barres
First: 2026-03-04T18:34:47+00:00 · Latest: 2026-03-04T18:34:47+00:00
Comments: 29 pages (10 main + 19 appendix)
Abstract
Conversational agents are increasingly deployed in knowledge-intensive settings, where correct behavior depends on retrieving and applying domain-specific knowledge from large, proprietary, and unstructured corpora during live interactions with users. Yet most existing benchmarks evaluate retrieval or tool use independently of each other, creating a gap in realistic, fully agentic evaluation over unstructured data in long-horizon interactions. We introduce $τ$-Knowledge, an extension of $τ$-Bench for evaluating agents in environments where success depends on coordinating external, natural-language knowledge with tool outputs to produce verifiable, policy-compliant state changes. Our new domain, $τ$-Banking, models realistic fintech customer support workflows in which agents must navigate roughly 700 interconnected knowledge documents while executing tool-mediated account updates. Across embedding-based retrieval and terminal-based search, even frontier models with high reasoning budgets achieve only $\sim$25.5% pass^1, with reliability degrading sharply over repeated trials. Agents struggle to retrieve the correct documents from densely interlinked knowledge bases and to reason accurately over complex internal policies. Overall, $τ$-Knowledge provides a realistic testbed for developing agents that integrate unstructured knowledge in human-facing deployments.
中文标题/摘要
标题:$τ$-知识:评估对话代理在非结构化知识上的表现
对话代理在知识密集型环境中越来越被部署,正确的行为依赖于在与用户实时交互过程中从大型、专有且非结构化的语料库中检索和应用特定领域的知识。然而,现有的大多数基准测试独立地评估检索或工具使用,这在长时间交互中创建了一个现实的、全面的评估缺口。我们引入了$τ$-知识,这是$τ$-基准的扩展,用于评估在环境中成功取决于协调外部自然语言知识与工具输出以产生可验证、符合政策的状态变化的代理。我们的新领域$τ$-银行业,模拟了现实的金融科技客户支持工作流程,在此过程中代理必须在执行工具介导的账户更新的同时导航大约700个相互关联的知识文档。在基于嵌入的检索和基于终端的搜索中,即使具有高推理预算的前沿模型也只能达到约25.5%的通过率,可靠性在多次试验中急剧下降。代理难以从紧密关联的知识库中检索正确的文档,并且难以准确地在复杂的内部政策上进行推理。总体而言,$τ$-知识为开发能够整合非结构化知识的人机交互代理提供了现实的测试平台。
Summary / 总结
The research aims to evaluate conversational agents in knowledge-intensive settings where they must retrieve and apply unstructured domain-specific knowledge during live interactions. The method involves extending $\tau$-Bench to create $\tau$-Knowledge, which evaluates agents in a new domain, $\tau$-Banking, where they must navigate interconnected knowledge documents and execute tool-mediated account updates. Key findings show that even advanced models with high reasoning budgets only achieve around 25.5% pass rate, indicating significant challenges in retrieving correct documents and reasoning over complex policies.
研究旨在评估对话代理在知识密集型环境中的表现,这些环境要求代理在实时交互中检索和应用未结构化的领域特定知识。研究引入了$τ$-Knowledge,这是$τ$-Bench的扩展,用于评估代理在$τ$-Banking领域中的表现,该领域模拟了金融科技客户支持的工作流程。即使具有高推理预算的先进模型,也只能达到约25.5%的通过率,突显了在复杂的知识基底中导航和推理复杂政策的困难。
Dual-Modality Multi-Stage Adversarial Safety Training: Robustifying Multimodal Web Agents Against Cross-Modal Attacks
Authors: Haoyu Liu, Dingcheng Li, Lukas Rutishauser, Zeyu Zheng
First: 2026-03-04T18:29:54+00:00 · Latest: 2026-03-04T18:29:54+00:00
Abstract
Multimodal web agents that process both screenshots and accessibility trees are increasingly deployed to interact with web interfaces, yet their dual-stream architecture opens an underexplored attack surface: an adversary who injects content into the webpage DOM simultaneously corrupts both observation channels with a consistent deceptive narrative. Our vulnerability analysis on MiniWob++ reveals that attacks including a visual component far outperform text-only injections, exposing critical gaps in text-centric VLM safety training. Motivated by this finding, we propose Dual-Modality Multi-Stage Adversarial Safety Training (DMAST), a framework that formalizes the agent-attacker interaction as a two-player zero-sum Markov game and co-trains both players through a three-stage pipeline: (1) imitation learning from a strong teacher model, (2) oracle-guided supervised fine-tuning that uses a novel zero-acknowledgment strategy to instill task-focused reasoning under adversarial noise, and (3) adversarial reinforcement learning via Group Relative Policy Optimization (GRPO) self-play. On out-of-distribution tasks, DMAST substantially mitigates adversarial risks while simultaneously doubling task completion efficiency. Our approach significantly outperforms established training-based and prompt-based defenses, demonstrating genuine co-evolutionary progress and robust generalization to complex, unseen environments.
中文标题/摘要
标题:双模态多阶段对抗安全训练:增强多模态网络代理对抗跨模态攻击的鲁棒性
处理屏幕截图和无障碍树的多模态网络代理越来越多地被部署以与网页界面交互,但其双流架构打开了一个未被充分探索的攻击面:攻击者同时向网页DOM注入内容,会以一致的欺骗性叙述同时破坏两个观察通道。我们对MiniWob++的漏洞分析表明,包含视觉成分的攻击远优于仅包含文本的注入,暴露了以文本为中心的VLM安全训练中的关键漏洞。受此发现的启发,我们提出了双模态多阶段对抗安全训练(DMAST)框架,将代理-攻击者交互形式化为一个两玩家零和马尔可夫博弈,并通过三阶段流水线共同训练两个玩家:(1)从强大教师模型中学习模仿,(2)使用新颖的零确认策略的oracle引导监督微调,以在对抗噪声下培养任务导向的推理,(3)通过Group Relative Policy Optimization(GRPO)自博弈的对抗强化学习。在分布外任务中,DMAST显著减轻了对抗风险,同时将任务完成效率翻倍。我们的方法显著优于现有的基于训练和基于提示的防御,展示了真正的共生进步和对复杂、未见过的环境的稳健泛化能力。
Summary / 总结
The paper addresses the vulnerability of multimodal web agents that process both screenshots and accessibility trees, which can be attacked by adversaries injecting content into the webpage DOM. To counter this, the authors propose DMAST, a framework that involves imitation learning, supervised fine-tuning with a zero-acknowledgment strategy, and adversarial reinforcement learning. This method significantly reduces adversarial risks and improves task completion efficiency on out-of-distribution tasks, outperforming existing defenses.
论文针对处理屏幕截图和无障碍树的多模态网络代理的漏洞,这些代理可能受到攻击者同时篡改网页DOM内容并影响两个观察通道的攻击。为应对这一问题,作者提出了多模态多阶段对抗安全训练(DMAST)框架,该框架包括模仿学习、带有零确认策略的监督微调以及对抗强化学习。DMAST在处理未见过的任务时显著降低了对抗风险并提高了任务完成效率,优于现有防御方法。
Robust Unscented Kalman Filtering via Recurrent Meta-Adaptation of Sigma-Point Weights
Authors: Kenan Majewski, Michał Modzelewski, Marcin Żugaj, Piotr Lichota
First: 2026-03-04T18:27:59+00:00 · Latest: 2026-03-04T18:27:59+00:00
Comments: 8 pages, 3 figures, Submitted to the 29th International Conference on Information Fusion (FUSION 2026)
Abstract
The Unscented Kalman Filter (UKF) is a ubiquitous tool for nonlinear state estimation; however, its performance is limited by the static parameterization of the Unscented Transform (UT). Conventional weighting schemes, governed by fixed scaling parameters, assume implicit Gaussianity and fail to adapt to time-varying dynamics or heavy-tailed measurement noise. This work introduces the Meta-Adaptive UKF (MA-UKF), a framework that reformulates sigma-point weight synthesis as a hyperparameter optimization problem addressed via memory-augmented meta-learning. Unlike standard adaptive filters that rely on instantaneous heuristic corrections, our approach employs a Recurrent Context Encoder to compress the history of measurement innovations into a compact latent embedding. This embedding informs a policy network that dynamically synthesizes the mean and covariance weights of the sigma points at each time step, effectively governing the filter's trust in the prediction versus the measurement. By optimizing the system end-to-end through the filter's recursive logic, the MA-UKF learns to maximize tracking accuracy while maintaining estimation consistency. Numerical benchmarks on maneuvering targets demonstrate that the MA-UKF significantly outperforms standard baselines, exhibiting superior robustness to non-Gaussian glint noise and effective generalization to out-of-distribution (OOD) dynamic regimes unseen during training.
中文标题/摘要
标题:鲁棒的无迹卡尔曼滤波通过递归元自适应sigma点权重
无迹卡尔曼滤波器(UKF)是用于非线性状态估计的通用工具;然而,其性能受限于无迹变换(UT)的静态参数化。传统的加权方案由固定的缩放参数控制,假设隐含的高斯性,并且无法适应时间变化的动力学或重尾测量噪声。本文引入了元自适应UKF(MA-UKF)框架,将sigma点权重合成重新表述为通过记忆增强的元学习解决的超参数优化问题。与依赖于瞬时启发式修正的标准自适应滤波器不同,我们的方法使用递归上下文编码器将测量创新的历史压缩成紧凑的潜在嵌入。该嵌入指导策略网络在每个时间步动态合成sigma点的均值和协方差权重,有效地控制滤波器对预测的信任程度与测量之间的关系。通过优化滤波器递归逻辑中的系统,MA-UKF学习最大化跟踪精度并保持估计一致性。在机动目标的数值基准测试中,MA-UKF显著优于标准基线,表现出对非高斯散射噪声的优越鲁棒性,并且能够有效泛化到训练期间未见过的分布外(OOD)动态环境中。
Summary / 总结
The research addresses the limitations of the Unscented Kalman Filter (UKF) in handling time-varying dynamics and heavy-tailed noise by introducing the Meta-Adaptive UKF (MA-UKF). This method reformulates sigma-point weight synthesis as a hyperparameter optimization problem using a Recurrent Context Encoder to adaptively adjust weights based on historical data. Experimental results show that MA-UKF outperforms standard UKF and other baselines, particularly in scenarios with non-Gaussian noise and unseen dynamic regimes.
研究旨在通过解决UKF在处理时间变化动态和重尾噪声方面的局限性,提高其性能。提出的Meta-Adaptive UKF (MA-UKF) 将sigma点权重合成重新表述为一个超参数优化问题,并使用递归上下文编码器压缩历史测量创新,生成一个潜在嵌入,指导策略网络在每个时间步动态调整sigma点权重。实验结果表明,MA-UKF 在非高斯噪声和未见过的动力学环境中优于标准UKF和其他基线方法。
CareMedEval dataset: Evaluating Critical Appraisal and Reasoning in the Biomedical Field
Authors: Doria Bonzi, Alexandre Guiggi, Frédéric Béchet, Carlos Ramisch, Benoit Favre
First: 2025-11-05T13:02:06+00:00 · Latest: 2026-03-04T18:27:25+00:00
Comments: Accepted at LREC 2026. To access the dataset, see https://github.com/bonzid/CareMedEval
Abstract
Critical appraisal of scientific literature is an essential skill in the biomedical field. While large language models (LLMs) can offer promising support in this task, their reliability remains limited, particularly for critical reasoning in specialized domains. We introduce CareMedEval, an original dataset designed to evaluate LLMs on biomedical critical appraisal and reasoning tasks. Derived from authentic exams taken by French medical students, the dataset contains 534 questions based on 37 scientific articles. Unlike existing benchmarks, CareMedEval explicitly evaluates critical reading and reasoning grounded in scientific papers. Benchmarking state-of-the-art generalist and biomedical-specialized LLMs under various context conditions reveals the difficulty of the task: open and commercial models fail to exceed an Exact Match Rate of 0.5 even though generating intermediate reasoning tokens considerably improves the results. Yet, models remain challenged especially on questions about study limitations and statistical analysis. CareMedEval provides a challenging benchmark for grounded reasoning, exposing current LLM limitations and paving the way for future development of automated support for critical appraisal.
中文标题/摘要
标题:CareMedEval 数据集:评估生物医学领域的批判性评估与推理
批判性评估科学文献是生物医学领域的一项基本技能。虽然大型语言模型(LLMs)在这一任务中提供了有希望的支持,但它们的可靠性仍然有限,特别是在专门领域的批判性推理方面。我们介绍了CareMedEval,这是一个原创数据集,旨在评估LLMs在生物医学批判性评估和推理任务中的表现。该数据集源自法国医学生的真实考试,包含基于37篇科学文章的534个问题。与现有的基准不同,CareMedEval明确评估了基于科学论文的批判性阅读和推理。在不同上下文条件下对最先进的通用和生物医学专业化LLMs进行基准测试揭示了任务的难度:开源和商用模型即使生成中间推理令牌也无法超过0.5的精确匹配率。然而,模型在关于研究局限性和统计分析的问题上仍然面临挑战。CareMedEval为基于推理的基准测试提供了挑战,揭示了当前LLM的局限性,并为未来开发自动支持批判性评估铺平了道路。
Summary / 总结
The research aims to evaluate the critical appraisal and reasoning skills of large language models (LLMs) in the biomedical field. The study introduces CareMedEval, a dataset derived from authentic exams taken by French medical students, containing 534 questions based on 37 scientific articles. Benchmarking generalist and specialized LLMs shows that even with intermediate reasoning tokens, these models struggle, particularly with questions about study limitations and statistical analysis. The dataset highlights the current limitations of LLMs in this domain and provides a challenge for future development of automated support for critical appraisal.
研究旨在评估大型语言模型(LLMs)在生物医学领域的批判性评估和推理能力。研究引入了CareMedEval数据集,该数据集来源于法国医学生的真实考试,包含基于37篇科学文章的534个问题。对通用和专门化LLM的基准测试显示,即使使用中间推理令牌,这些模型也难以应对关于研究局限性和统计分析的问题。该数据集揭示了LLMs在这一领域的当前局限性,并为未来开发自动支持批判性评估提供了挑战。
Implicit U-KAN2.0: Dynamic, Efficient and Interpretable Medical Image Segmentation
Authors: Chun-Wun Cheng, Yining Zhao, Yanqi Cheng, Javier A. Montoya-Zegarra, Carola-Bibiane Schönlieb, Angelica I Aviles-Rivero
Venue: MICCAI 2025
First: 2025-03-05T03:31:05+00:00 · Latest: 2026-03-04T18:27:18+00:00
Comments: Accepted in MICCAI 2025
Abstract
Image segmentation is a fundamental task in both image analysis and medical applications. State-of-the-art methods predominantly rely on encoder-decoder architectures with a U-shaped design, commonly referred to as U-Net. Recent advancements integrating transformers and MLPs improve performance but still face key limitations, such as poor interpretability, difficulty handling intrinsic noise, and constrained expressiveness due to discrete layer structures, often lacking a solid theoretical foundation.In this work, we introduce Implicit U-KAN 2.0, a novel U-Net variant that adopts a two-phase encoder-decoder structure. In the SONO phase, we use a second-order neural ordinary differential equation (NODEs), called the SONO block, for a more efficient, expressive, and theoretically grounded modeling approach. In the SONO-MultiKAN phase, we integrate the second-order NODEs and MultiKAN layer as the core computational block to enhance interpretability and representation power. Our contributions are threefold. First, U-KAN 2.0 is an implicit deep neural network incorporating MultiKAN and second order NODEs, improving interpretability and performance while reducing computational costs. Second, we provide a theoretical analysis demonstrating that the approximation ability of the MultiKAN block is independent of the input dimension. Third, we conduct extensive experiments on a variety of 2D and a single 3D dataset, demonstrating that our model consistently outperforms existing segmentation networks. Project Website: https://math-ml-x.github.io/IUKAN2/
中文标题/摘要
标题:隐式U-KAN2.0:动态、高效且可解释的医学图像分割
图像分割是图像分析和医学应用中的基本任务。最先进的方法主要依赖于具有U形设计的编码器-解码器架构,通常称为U-Net。最近将变压器和MLPs集成进来提高了性能,但仍面临关键限制,如解释性差、难以处理固有噪声以及由于离散层结构导致的表达能力受限,通常缺乏坚实的理论基础。在本工作中,我们引入了隐式U-KAN2.0,这是一种新颖的U-Net变体,采用两阶段编码器-解码器结构。在SONO阶段,我们使用了称为SONO块的二阶神经常微分方程(NODEs),以实现更高效、更具表达性和理论依据的建模方法。在SONO-MultiKAN阶段,我们将二阶NODEs和MultiKAN层作为核心计算块,以增强可解释性和表示能力。我们的贡献有三个方面。首先,U-KAN2.0是一种隐式深度神经网络,结合了MultiKAN和二阶NODEs,提高了可解释性和性能,同时降低了计算成本。其次,我们提供了理论分析,证明了MultiKAN块的逼近能力与输入维度无关。第三,我们在多种2D数据集和一个3D数据集上进行了广泛的实验,证明我们的模型在所有情况下都优于现有的分割网络。项目网站:https://math-ml-x.github.io/IUKAN2/
Summary / 总结
The research aims to address the limitations of existing U-Net architectures in medical image segmentation, such as poor interpretability and constrained expressiveness. The authors introduce Implicit U-KAN2.0, which uses a two-phase encoder-decoder structure with SONO and SONO-MultiKAN phases. The model incorporates second-order neural ordinary differential equations (NODEs) and MultiKAN layers, enhancing interpretability and representation power. Experiments on various 2D and 3D datasets show that the proposed model outperforms existing segmentation networks in terms of both performance and efficiency.
研究旨在通过解决现有U-Net架构的局限性,提高医学图像分割的可解释性和性能。方法引入了Implicit U-KAN 2.0,采用两阶段编码解码结构,包括SONO和SONO-MultiKAN阶段。模型在多种2D和3D数据集上表现出色,优于现有分割网络,提升了可解释性和降低了计算成本。理论分析表明,MultiKAN块的逼近能力与输入维度无关,进一步验证了模型的稳健性和效率。
CMI-RewardBench: Evaluating Music Reward Models with Compositional Multimodal Instruction
Authors: Yinghao Ma, Haiwen Xia, Hewei Gao, Weixiong Chen, Yuxin Ye, Yuchen Yang, Sungkyun Chang, Mingshuo Ding, Yizhi Li, Ruibin Yuan, Simon Dixon, Emmanouil Benetos
First: 2026-02-28T12:10:58+00:00 · Latest: 2026-03-04T18:26:58+00:00
Abstract
While music generation models have evolved to handle complex multimodal inputs mixing text, lyrics, and reference audio, evaluation mechanisms have lagged behind. In this paper, we bridge this critical gap by establishing a comprehensive ecosystem for music reward modeling under Compositional Multimodal Instruction (CMI), where the generated music may be conditioned on text descriptions, lyrics, and audio prompts. We first introduce CMI-Pref-Pseudo, a large-scale preference dataset comprising 110k pseudo-labeled samples, and CMI-Pref, a high-quality, human-annotated corpus tailored for fine-grained alignment tasks. To unify the evaluation landscape, we propose CMI-RewardBench, a unified benchmark that evaluates music reward models on heterogeneous samples across musicality, text-music alignment, and compositional instruction alignment. Leveraging these resources, we develop CMI reward models (CMI-RMs), a parameter-efficient reward model family capable of processing heterogeneous inputs. We evaluate their correlation with human judgments scores on musicality and alignment on CMI-Pref along with previous datasets. Further experiments demonstrate that CMI-RM not only correlates strongly with human judgments, but also enables effective inference-time scaling via top-k filtering. The necessary training data, benchmarks, and reward models are publicly available.
中文标题/摘要
标题:CMI-RewardBench:基于组合多模态指令评估音乐奖励模型
尽管音乐生成模型已经能够处理混合文本、歌词和参考音频的复杂多模态输入,但评估机制却落后了。本文通过建立基于组合多模态指令(CMI)的音乐奖励建模综合生态系统,填补了这一关键空白,其中生成的音乐可以基于文本描述、歌词和音频提示进行条件化。我们首先介绍了包含110,000个伪标签样本的CMI-Pref-Pseudo大规模偏好数据集,以及一个针对细粒度对齐任务的人工标注高质量语料库CMI-Pref。为了统一评估框架,我们提出了CMI-RewardBench统一基准,该基准在音乐性、文本-音乐对齐和组合指令对齐方面对音乐奖励模型进行评估。利用这些资源,我们开发了CMI奖励模型(CMI-RMs),这是一种参数高效的奖励模型家族,能够处理异构输入。我们评估了它们与人类判断得分在音乐性和对齐方面的相关性,以及与先前数据集的对齐情况。进一步的实验表明,CMI-RM 不仅与人类判断高度相关,还通过 top-k 过滤实现了有效的推理时缩放。训练数据、基准和奖励模型均已公开。
Summary / 总结
This paper addresses the gap in evaluating music generation models by introducing CMI-RewardBench, a unified benchmark for music reward models under Compositional Multimodal Instruction (CMI). The authors develop CMI-Pref-Pseudo and CMI-Pref datasets and propose CMI-RMs, a family of parameter-efficient reward models. Key findings include strong correlation with human judgments on musicality and alignment, and effective inference-time scaling via top-k filtering.
本文通过引入CMI-RewardBench统一基准,解决了音乐生成模型的评估问题,该基准适用于基于Compositional Multimodal Instruction的音乐奖励建模。它包括CMI-Pref-Pseudo和CMI-Pref数据集,分别用于偏好学习和细粒度对齐。作者提出了能够高效处理异构输入的CMI奖励模型(CMI-RMs),并在音乐性和对齐方面与人类判断表现出强烈的相关性。实验还展示了通过top-k过滤实现有效的推理时缩放。
Beyond Dominant Patches: Spatial Credit Redistribution For Grounded Vision-Language Models
Authors: Niamul Hassan Samin, Md Arifur Rahman, Abdullah Ibne Hanif Arean, Juena Ahmed Noshin, Md Ashikur Rahman
First: 2026-02-25T23:08:31+00:00 · Latest: 2026-03-04T18:21:09+00:00
Abstract
Vision-Language Models (VLMs) often hallucinate objects that are not present in the input image. We identify a contributing cause of this behavior, which we term spatial credit collapse: in early transformer layers, hidden-state activation concentrates on a small number of visual patches, suppressing surrounding contextual evidence and increasing reliance on language priors. Across seven models we observe a strong correlation between visual attention entropy and hallucination rate (r = -0.65, p < 0.001), suggesting that reduced spatial credit diversity contributes to hallucination.
To address this issue we propose Spatial Credit Redistribution (SCR), a training-free inference-time method. SCR uses a lightweight two-pass procedure. A diagnostic pass identifies the top-K high-attention source patches and their spatial neighbors. A redistribution pass then scales each source by 1/lambda (~0.91) and injects a (lambda - 1) weighted copy of its hidden state into neighboring patches, restoring suppressed visual context without modifying model weights. Because the diagnostic pass is performed once per image and reused across the output sequence, the added latency is negligible (<0.5 ms per token for 100-token responses).
We evaluate SCR across seven model configurations from four VLM families (Chameleon, LLaVA-1.5, Qwen-VL/Qwen2-VL, and InternVL2) on five benchmarks: POPE, CHAIR, MME, HallusionBench, and AMBER. SCR reduces POPE-Adversarial hallucination by 4.6-6.0 percentage points and CHAIR-s by 41-51 percent while preserving caption quality (CIDEr drop <=0.8). Compared with prior inference-time methods including OPERA, VCD, OA-VCD, DoLa, VLI, SID, and CRoPS, SCR achieves a better trade-off between hallucination reduction, generation quality, and latency.
中文标题/摘要
标题:超越主导斑块:空间信用重分配以实现基于视觉-语言模型的定位
视觉-语言模型(VLMs)经常虚构输入图像中不存在的对象。我们识别出这种行为的一个促成因素,称为空间信用崩溃:在早期的变压器层中,隐藏状态激活集中在少量的视觉斑块上,抑制了周围的上下文证据,并增加了对语言先验的依赖。在七个模型中,我们观察到视觉注意力熵与虚构率之间存在显著相关性(r = -0.65,p < 0.001),表明空间信用多样性减少会促进虚构。
为解决这一问题,我们提出了一种无需训练的推理时方法——空间信用重分配(SCR)。SCR 使用一种轻量级的两步程序。诊断步骤识别出高注意力的前K个源斑块及其空间邻居。然后,重分配步骤将每个源斑块的权重调整为1/λ(~0.91),并将(λ - 1)加权的隐藏状态副本注入到相邻斑块中,从而恢复被抑制的视觉上下文,而不修改模型权重。由于诊断步骤仅在每张图像上执行一次并在输出序列中重复使用,因此增加的延迟可以忽略不计(对于100个词的响应,每词<0.5毫秒)。
我们在四种VLM家族(Chameleon、LLaVA-1.5、Qwen-VL/Qwen2-VL、InternVL2)的七个模型配置上,对五个基准(POPE、CHAIR、MME、HallusionBench、AMBER)进行了评估。SCR 将POPE-对抗虚构率降低了4.6-6.0个百分点,将CHAIR-s降低了41-51个百分点,同时保持了描述质量(CIDEr下降<=0.8)。与先前的推理时方法(包括OPER、VCD、OA-VCD、DoLa、VLI、SID和CRoPS)相比,SCR 在减少虚构、生成质量和延迟之间实现了更好的权衡。
RoboCasa365: A Large-Scale Simulation Framework for Training and Benchmarking Generalist Robots
Authors: Soroush Nasiriany, Sepehr Nasiriany, Abhiram Maddukuri, Yuke Zhu
Venue: ICLR 2026
First: 2026-03-04T18:20:03+00:00 · Latest: 2026-03-04T18:20:03+00:00
Comments: ICLR 2026; First three authors contributed equally
Abstract
Recent advances in robot learning have accelerated progress toward generalist robots that can perform everyday tasks in human environments. Yet it remains difficult to gauge how close we are to this vision. The field lacks a reproducible, large-scale benchmark for systematic evaluation. To fill this gap, we present RoboCasa365, a comprehensive simulation benchmark for household mobile manipulation. Built on the RoboCasa platform, RoboCasa365 introduces 365 everyday tasks across 2,500 diverse kitchen environments, with over 600 hours of human demonstration data and over 1600 hours of synthetically generated demonstration data -- making it one of the most diverse and large-scale resources for studying generalist policies. RoboCasa365 is designed to support systematic evaluations for different problem settings, including multi-task learning, robot foundation model training, and lifelong learning. We conduct extensive experiments on this benchmark with state-of-the-art methods and analyze the impacts of task diversity, dataset scale, and environment variation on generalization. Our results provide new insights into what factors most strongly affect the performance of generalist robots and inform strategies for future progress in the field.
中文标题/摘要
标题:RoboCasa365:大规模模拟框架,用于训练和基准测试通用机器人
近年来,机器人学习的进步加速了通用机器人在人类环境中执行日常任务的进程。然而,仍难以衡量我们离这一愿景有多近。该领域缺乏一个可重复的大规模基准,用于系统评估。为填补这一空白,我们提出了RoboCasa365,这是一个全面的家庭移动操作模拟基准。基于RoboCasa平台,RoboCasa365引入了2500个不同厨房环境中的365项日常任务,包含超过600小时的人类演示数据和超过1600小时的合成生成演示数据——使其成为研究通用策略最多样和最庞大的资源之一。RoboCasa365旨在支持不同问题设置的系统评估,包括多任务学习、机器人基础模型训练和终身学习。我们使用最先进的方法在该基准上进行了广泛的实验,并分析了任务多样性、数据集规模和环境变化对泛化的影响。我们的结果提供了关于哪些因素最强烈影响通用机器人性能的新见解,并为该领域的未来进展提供了策略指导。
Summary / 总结
RoboCasa365 is a large-scale simulation framework designed to evaluate generalist robots in household settings. It introduces 365 everyday tasks across 2,500 diverse kitchen environments, utilizing over 600 hours of human demonstration data and 1600 hours of synthetic data. The framework supports multi-task learning, robot foundation model training, and lifelong learning. Extensive experiments with state-of-the-art methods reveal that task diversity, dataset scale, and environment variation significantly impact generalist robot performance.
RoboCasa365 是一个大规模的模拟框架,旨在评估机器人在家庭任务中的表现。它引入了2,500个多样化的厨房环境中的365项日常任务,附带了大量的人类和合成演示数据。该框架支持多任务学习、机器人基础模型训练和终身学习。使用最先进的方法进行的实验表明,任务多样性、数据集规模和环境变化对通用机器人性能有显著影响,为未来的研究提供了新的见解。
Out-of-distribution transfer of PDE foundation models to material dynamics under extreme loading
Authors: Mahindra Rautela, Alexander Most, Siddharth Mansingh, Aleksandra Pachalieva, Bradley Love, Daniel O Malley, Alexander Scheinker, Kyle Hickmann, Diane Oyen, Nathan Debardeleben, Earl Lawrence, Ayan Biswas
First: 2026-03-04T18:19:35+00:00 · Latest: 2026-03-04T18:19:35+00:00
Abstract
Most PDE foundation models are pretrained and fine-tuned on fluid-centric benchmarks. Their utility under extreme-loading material dynamics remains unclear. We benchmark out-of-distribution transfer on two discontinuity-dominated regimes in which shocks, evolving interfaces, and fracture produce highly non-smooth fields: shock-driven multi-material interface dynamics (perturbed layered interface or PLI) and dynamic fracture/failure evolution (FRAC). We formulate the downstream task as terminal-state prediction, i.e., learning a long-horizon map that predicts the final state directly from the first snapshot without intermediate supervision. Using a unified training and evaluation protocol, we evaluate two open-source pretrained PDE foundation models, POSEIDON and MORPH, and compare fine-tuning from pretrained weights against training from scratch across training-set sizes to quantify sample efficiency under distribution shift.
中文标题/摘要
标题:极端加载条件下材料动力学中PDE基础模型的离分布转移
大多数PDE基础模型在流体中心基准上进行预训练和微调,它们在极端加载条件下材料动力学中的实用性尚不清楚。我们在两种以不连续性为主导的环境中进行了离分布转移基准测试,其中冲击波、演化界面和断裂产生高度非光滑场:冲击波驱动的多材料界面动力学(扰动层状界面或PLI)和动态断裂/失效演化(FRAC)。我们将下游任务定义为终端状态预测,即学习一个长期预测映射,直接从初始快照预测最终状态,无需中间监督。使用统一的训练和评估协议,我们评估了两个开源预训练PDE基础模型POSEIDON和MORPH,并比较了从预训练权重微调与从零开始训练在不同训练集大小下的样本效率,以量化分布转移下的样本效率。
Summary / 总结
The research aims to evaluate the transferability of PDE foundation models pretrained on fluid benchmarks to material dynamics under extreme loading conditions. The study benchmarks out-of-distribution transfer on two discontinuity-dominated regimes: shock-driven multi-material interface dynamics and dynamic fracture/failure evolution. The downstream task is formulated as terminal-state prediction. Two open-source pretrained PDE foundation models, POSEIDON and MORPH, are evaluated, and the study compares fine-tuning from pretrained weights to training from scratch across different training-set sizes to assess sample efficiency under distribution shift.
研究旨在评估预训练于流体基准上的PDE基础模型在极端加载材料动力学中的迁移能力。研究在冲击驱动的多材料界面动力学和动态断裂/失效演化两个领域进行基准测试。使用统一的协议,评估了两个预训练模型POSEIDON和MORPH在分布偏移下的样本效率,比较了从预训练权重微调与从头训练的方法。
A Constrained RL Approach for Cost-Efficient Delivery of Latency-Sensitive Applications
Authors: Ozan Aygün, Vincenzo Norman Vitale, Antonia M. Tulino, Hao Feng, Elza Erkip, Jaime Llorca
First: 2026-03-04T18:19:35+00:00 · Latest: 2026-03-04T18:19:35+00:00
Comments: 7 pages, 4 figures, accepted for publication in 2025 59th Asilomar Conference on Signals, Systems, and Computers
Abstract
Next-generation networks aim to provide performance guarantees to real-time interactive services that require timely and cost-efficient packet delivery. In this context, the goal is to reliably deliver packets with strict deadlines imposed by the application while minimizing overall resource allocation cost. A large body of work has leveraged stochastic optimization techniques to design efficient dynamic routing and scheduling solutions under average delay constraints; however, these methods fall short when faced with strict per-packet delay requirements. We formulate the minimum-cost delay-constrained network control problem as a constrained Markov decision process and utilize constrained deep reinforcement learning (CDRL) techniques to effectively minimize total resource allocation cost while maintaining timely throughput above a target reliability level. Results indicate that the proposed CDRL-based solution can ensure timely packet delivery even when existing baselines fall short, and it achieves lower cost compared to other throughput-maximizing methods.
中文标题/摘要
标题:一种针对延迟敏感应用成本高效交付的约束强化学习方法
下一代网络旨在为需要及时且成本高效的分组交付的实时交互式服务提供性能保证。在此背景下,目标是在严格的应用程序时间限制下可靠地交付分组,同时尽量减少总体资源分配成本。大量研究工作利用随机优化技术设计了在平均延迟约束下的高效动态路由和调度解决方案;然而,当面对严格的每包延迟要求时,这些方法会失效。我们将最小成本延迟约束网络控制问题形式化为约束马尔可夫决策过程,并利用约束深度强化学习(CDRL)技术有效最小化总资源分配成本,同时保持及时吞吐量高于目标可靠性水平。结果表明,提出的基于CDRL的解决方案即使在现有基线方法失效时也能确保及时分组交付,并且与其它吞吐量最大化方法相比成本更低。
Summary / 总结
The paper addresses the challenge of delivering latency-sensitive applications with strict per-packet delay requirements in next-generation networks. It formulates the problem as a constrained Markov decision process and employs constrained deep reinforcement learning to minimize resource allocation cost while ensuring timely packet delivery. Experimental results show that the proposed method outperforms existing baselines in maintaining timely throughput and achieving lower cost.
论文针对下一代网络中严格单包延迟要求的实时交互服务的传输问题,将问题形式化为约束马尔可夫决策过程,并采用约束深度强化学习方法以最小化资源分配成本同时确保及时传输。结果表明,所提出的方法在保持及时吞吐量和实现较低成本方面优于现有基线方法和吞吐量最大化方法。
FocusGraph: Graph-Structured Frame Selection for Embodied Long Video Question Answering
Authors: Tatiana Zemskova, Solomon Andryushenko, Ilya Obrubov, Viktoriia Khoruzhaia, Ekaterina Eroshenko, Ekaterina Derevyanka, Dmitry Yudin
First: 2026-03-04T18:14:00+00:00 · Latest: 2026-03-04T18:14:00+00:00
Abstract
The ability to understand long videos is vital for embodied intelligent agents, because their effectiveness depends on how well they can accumulate, organize, and leverage long-horizon perceptual memories. Recently, multimodal LLMs have been gaining popularity for solving the long video understanding task due to their general ability to understand natural language and to leverage world knowledge. However, as the number of frames provided to an MLLM increases, the quality of its responses tends to degrade, and inference time grows. Therefore, when using MLLMs for long video understanding, a crucial step is selecting key frames from the video to answer user queries.
In this work, we develop FocusGraph, a framework for keyframe selection for question answering over long egocentric videos. It leverages a lightweight trainable Scene-Caption LLM Selector that selects query-relevant clips based on their graph-based captions, and a training-free method for selecting keyframes from these clips. Unlike existing methods, the proposed Scene-Caption LLM Selector does not rely on the original sequence of low-resolution frames; instead, it operates on a compact textual representation of the scene. We then design a training-free Patch-wise Sparse-Flow Retention (PSFR) method to select keyframes from the resulting sequence of clips, which are fed into an MLLM to produce the final answer. Together, these components enable FocusGraph to achieve state-of-the-art results on challenging egocentric long-video question answering benchmarks, including FindingDory and HourVideo, while significantly reducing inference time relative to baseline approaches.
中文标题/摘要
标题:FocusGraph:基于图结构框架的选择性关键帧提取用于体感长视频问答
理解长视频的能力对于体感智能代理至关重要,因为它们的效果取决于能否有效地积累、组织和利用长时间感知记忆。最近,由于其理解和利用世界知识的一般能力,多模态LLM因解决长视频理解任务而受到越来越多的关注。然而,随着提供给MLLM的帧数量增加,其响应质量往往会下降,推理时间也会增长。因此,在使用MLLM进行长视频理解时,一个关键步骤是从视频中选择关键帧以回答用户查询。
在这项工作中,我们开发了FocusGraph,这是一种用于长第一人称视角视频问答的关键帧选择框架。它利用一个轻量级可训练的场景-描述LLM选择器,该选择器基于图基描述选择与查询相关的片段,并且使用一种无需训练的方法从这些片段中选择关键帧。与现有方法不同,所提出的场景-描述LLM选择器不依赖于原始的低分辨率帧序列,而是操作于场景的紧凑文本表示。然后,我们设计了一种无需训练的块级稀疏流保留(PSFR)方法,从生成的片段序列中选择关键帧,这些片段被输入到MLLM以生成最终答案。这些组件共同使FocusGraph在具有挑战性的第一人称视角长视频问答基准测试(包括FindingDory和HourVideo)中取得了最先进的结果,同时显著减少了相对于基线方法的推理时间。
Summary / 总结
FocusGraph is a framework for keyframe selection in long video question answering, which uses a lightweight Scene-Caption LLM Selector to generate query-relevant clips based on graph-based captions, and a PSFR method to select keyframes from these clips. This approach reduces inference time and achieves state-of-the-art results on egocentric long-video question answering benchmarks while improving efficiency over baseline methods.
FocusGraph 是一种用于长视频问答的关键帧选择框架,使用轻量级的 Scene-Caption LLM 选择器和无训练的 PSFR 方法来选择关键帧。它利用基于图的描述来选择与查询相关的片段,然后使用 PSFR 从这些片段中挑选关键帧,这些关键帧随后被输入到 MLLM 中以生成最终答案。FocusGraph 在 FindingDory 和 HourVideo 等基准测试中取得了最先进的结果,同时显著减少了推理时间。
RANGER: Sparsely-Gated Mixture-of-Experts with Adaptive Retrieval Re-ranking for Pathology Report Generation
Authors: Yixin Chen, Ziyu Su, Hikmat Khan, Muhammad Khalid Khan Niazi
First: 2026-03-04T18:12:31+00:00 · Latest: 2026-03-04T18:12:31+00:00
Abstract
Pathology report generation remains a relatively under-explored downstream task, primarily due to the gigapixel scale and complex morphological heterogeneity of Whole Slide Images (WSIs). Existing pathology report generation frameworks typically employ transformer architectures, relying on a homogeneous decoder architecture and static knowledge retrieval integration. Such architectures limit generative specialization and may introduce noisy external guidance during the report generation process. To address these limitations, we propose RANGER, a sparsely-gated Mixture-of-Experts (MoE) framework with adaptive retrieval re-ranking for pathology report generation. Specifically, we integrate a sparsely gated MoE into the decoder, along with noisy top-$k$ routing and load-balancing regularization, to enable dynamic expert specialization across various diagnostic patterns. Additionally, we introduce an adaptive retrieval re-ranking module that selectively refines retrieved memory from a knowledge base before integration, reducing noise and improving semantic alignment based on visual feature representations. We perform extensive experiments on the PathText-BRCA dataset and demonstrate consistent improvements over existing approaches across standard natural language generation metrics. Our full RANGER model achieves optimal performance on PathText dataset, reaching BLEU-1 to BLEU-4 scores of 0.4598, 0.3044, 0.2036, and 0.1435, respectively, with METEOR of 0.1883, and ROUGE-L of 0.3038, validating the effectiveness of dynamic expert routing and adaptive knowledge refinement for semantically grounded pathology report generation.
中文标题/摘要
标题:RANGER:稀疏门控混合专家体系结构与自适应检索重排序在病理报告生成中的应用
病理报告生成仍然是相对未被充分探索的下游任务,主要由于全切片图像(WSIs)的吉像素规模和复杂的形态异质性。现有的病理报告生成框架通常采用变压器架构,依赖于同质解码器架构和静态知识检索集成。这些架构限制了生成的专业化,并可能在报告生成过程中引入噪声外部指导。为了解决这些限制,我们提出了一种稀疏门控混合专家(MoE)框架RANGER,该框架结合了自适应检索重排序,以实现病理报告生成。具体而言,我们将稀疏门控MoE集成到解码器中,结合嘈杂的top-$k$路由和负载均衡正则化,以实现各种诊断模式下的动态专家专业化。此外,我们引入了一个自适应检索重排序模块,在集成前选择性地细化知识库检索的记忆,减少噪声并基于视觉特征表示提高语义对齐。我们在PathText-BRCA数据集上进行了广泛的实验,并在标准自然语言生成指标上展示了相对于现有方法的一致改进。我们的完整RANGER模型在PathText数据集上达到了最优性能,BLEU-1到BLEU-4得分为0.4598、0.3044、0.2036和0.1435,METEOR得分为0.1883,ROUGE-L得分为0.3038,验证了动态专家路由和自适应知识细化在语义导向病理报告生成中的有效性。
Summary / 总结
RANGER is a sparsely-gated Mixture-of-Experts framework with adaptive retrieval re-ranking designed for pathology report generation. It integrates a sparsely gated MoE into the decoder and includes a module for adaptive retrieval re-ranking to refine knowledge from a database. Experiments on the PathText-BRCA dataset show consistent improvements over existing methods in natural language generation metrics, with the full RANGER model achieving optimal performance, including BLEU-1 to BLEU-4 scores of 0.4598, 0.3044, 0.2036, and 0.1435, respectively, and METEOR of 0.1883, and ROUGE-L of 0.3038.
RANGER 是一种用于病理报告生成的稀疏门控 Mixture-of-Experts 框架,结合了自适应检索重排序模块以从知识库中精炼知识,增强语义对齐。在 PathText-BRCA 数据集上的实验显示,该方法在 PathText 数据集上表现最优,BLEU-1 到 BLEU-4 分别达到 0.4598、0.3044、0.2036 和 0.1435,METEOR 为 0.1883,ROUGE-L 为 0.3038,显示出动态专家路由和自适应知识精炼的有效性。
Cognition Envelopes for Bounded Decision Making in Autonomous UAS Operations
Authors: Pedro Antonio Alarcon Granadeno, Arturo Miguel Bernal Russell, Sofia Nelson, Demetrius Hernandez, Maureen Petterson, Michael Murphy, Walter J. Scheirer, Jane Cleland-Huang
First: 2025-10-30T18:11:32+00:00 · Latest: 2026-03-04T18:07:51+00:00
Comments: 12 pages, 9 figures
Abstract
Cyber-physical systems increasingly rely on foundational models, such as Large Language Models (LLMs) and Vision-Language Models (VLMs) to increase autonomy through enhanced perception, inference, and planning. However, these models also introduce new types of errors, such as hallucinations, over-generalizations, and context misalignments, resulting in incorrect and flawed decisions. To address this, we introduce the concept of Cognition Envelopes, designed to establish reasoning boundaries that constrain AI-generated decisions while complementing the use of meta-cognition and traditional safety envelopes. As with safety envelopes, Cognition Envelopes require practical guidelines and systematic processes for their definition, validation, and assurance. In this paper we describe an LLM/VLM-supported pipeline for dynamic clue analysis within the domain of small autonomous Uncrewed Aerial Systems deployed on Search and Rescue (SAR) missions, and a Cognition Envelope based on probabilistic reasoning and resource analysis. We evaluate the approach through assessing decisions made by our Clue Analysis Pipeline in a series of SAR missions. Finally, we identify key software engineering challenges for systematically designing, implementing, and validating Cognition Envelopes for AI-supported decisions in cyber-physical systems.
中文标题/摘要
标题:认知边界在自主无人航空系统受限决策中的应用
网络物理系统越来越多地依赖于大型语言模型(LLMs)和视觉语言模型(VLMs)等基础模型,通过增强感知、推理和规划来提高自主性。然而,这些模型也会引入新的错误类型,如幻觉、过度概括和上下文错位,导致错误和有缺陷的决策。为了解决这一问题,我们提出了认知边界的概念,旨在通过限制AI生成的决策来建立推理边界,同时补充元认知和传统安全边界的使用。与安全边界类似,认知边界需要实用的指导方针和系统的过程来定义、验证和保证。在本文中,我们描述了一个基于LLM/VLM的线索分析管道,用于搜救(SAR)任务中部署的小型自主无人航空系统,并基于概率推理和资源分析构建了认知边界。我们通过评估线索分析管道在一系列SAR任务中做出的决策来评估该方法。最后,我们确定了系统设计、实现和验证支持AI决策的网络物理系统中认知边界的软件工程挑战。
Summary / 总结
This paper introduces Cognition Envelopes to address errors in autonomous decision-making by LLMs and VLMs in small Uncrewed Aerial Systems (UAS) for Search and Rescue (SAR) missions. The method involves a pipeline for dynamic clue analysis and a Cognition Envelope based on probabilistic reasoning and resource analysis. Key findings show that the approach improves decision accuracy in SAR missions, but also highlights challenges in systematically designing and validating these envelopes.
本文旨在通过引入认知包络来解决自主决策中的错误问题,该认知包络为AI生成的决策设定推理边界。作者描述了一个基于LLM/VLM的线索分析管道以及基于概率推理和资源分析的认知包络。该方法通过在搜救任务中的决策评估进行了验证,强调了在计算物理系统中系统设计、实现和验证认知包络的必要性。
Underrepresented in Foundation Model Pretraining Data? A One-Shot Probe
Authors: Chris Vorster, Mayug Maniparambil, Noel E. O'Connor, Noel Murphy, Derek Molloy
First: 2026-03-04T18:07:23+00:00 · Latest: 2026-03-04T18:07:23+00:00
Abstract
Large-scale Vision-Language Foundation Models (VLFMs), such as CLIP, now underpin a wide range of computer vision research and applications. VLFMs are often adapted to various domain-specific tasks. However, VLFM performance on novel, specialised, or underrepresented domains remains inconsistent. Evaluating VLFMs typically requires labelled test sets, which are often unavailable for niche domains of interest, particularly those from the Global South. We address this gap by proposing a highly data-efficient method to predict a VLFM's zero-shot accuracy on a target domain using only a single labelled image per class. Our approach uses a Large Language Model to generate plausible counterfactual descriptions of a given image. By measuring the VLFM's ability to distinguish the correct description from these hard negatives, we engineer features that capture the VLFM's discriminative power in its shared embedding space. A linear regressor trained on these similarity scores estimates the VLFM's zero-shot test accuracy across various visual domains with a Pearson-r correlation of 0.96. We demonstrate our method's performance across five diverse datasets, including standard benchmark datasets and underrepresented datasets from Africa. Our work provides a low-cost, reliable tool for probing VLFMs, enabling researchers and practitioners to make informed decisions about data annotation efforts before committing significant resources. The model training code, generated captions and counterfactuals are released here: https://github.com/chris-vorster/PreLabellingProbe.
中文标题/摘要
标题:基础模型预训练数据中代表性不足?一次探针
大规模视觉-语言基础模型(VLFMs),如CLIP,现在支撑着广泛的计算机视觉研究和应用。VLFMs通常被调整以适应各种特定领域的任务。然而,VLFMs在新颖、专门或代表性不足的领域中的表现仍然不一致。评估VLFMs通常需要带有标签的测试集,而这些测试集对于特定的、尤其是来自全球南方的领域来说往往不可用。我们通过提出一种仅使用每个类别一个带有标签的图像来预测VLFM在目标领域零样本准确性的高效方法来填补这一空白。我们的方法使用大型语言模型生成给定图像的合乎情理的反事实描述。通过测量VLFM区分正确描述与这些困难负样本的能力,我们设计了能够捕捉VLFM在共享嵌入空间中判别能力的特征。基于这些相似度分数训练的线性回归器在各种视觉领域中估计VLFM的零样本测试准确率,皮尔逊相关系数为0.96。我们在五个不同的数据集中展示了该方法的性能,包括标准基准数据集和来自非洲的代表性不足的数据集。我们的工作提供了一种低成本、可靠的工具来探针VLFMs,使研究人员和实践者能够在投入大量资源之前做出知情的数据注释努力决策。模型训练代码、生成的描述和反事实在这里发布:https://github.com/chris-vorster/PreLabellingProbe.
Summary / 总结
The research aims to evaluate Vision-Language Foundation Models (VLFMs) on underrepresented domains where labeled test sets are scarce. The method involves using a Large Language Model to generate counterfactual descriptions of images, which are then used to measure the VLFM's discriminative power. This approach achieves a Pearson-r correlation of 0.96 in predicting zero-shot test accuracy across various visual domains, including underrepresented datasets from Africa. This provides a cost-effective tool for assessing VLFMs before extensive data annotation efforts are made.
研究旨在评估在缺乏标签测试集的未充分代表领域中Vision-Language基础模型(VLFMs)的表现。方法是使用大型语言模型生成图像的反事实描述,然后测量VLFM的区分能力。该方法在各种视觉领域,包括来自非洲的未充分代表的数据集上,实现了0.96的皮尔逊相关性预测零样本测试准确性。这提供了一种成本效益高的工具,在进行大量数据注释工作之前评估VLFMs。
Benchmarking ECG FMs: A Reality Check Across Clinical Tasks
Authors: M A Al-Masud, Juan Miguel Lopez Alcaraz, Nils Strodthoff
Venue: ICLR 2026
First: 2025-09-29T17:29:48+00:00 · Latest: 2026-03-04T18:06:32+00:00
Comments: Accepted at ICLR 2026. OpenReview: https://openreview.net/forum?id=xXRqWpt3Xr
Abstract
The 12-lead electrocardiogram (ECG) is a long-standing diagnostic tool. Yet machine learning for ECG interpretation remains fragmented, often limited to narrow tasks or datasets. FMs promise broader adaptability, but fundamental questions remain: Which architectures generalize best? How do models scale with limited labels? What explains performance differences across model families? We benchmarked eight ECG FMs on 26 clinically relevant tasks using 12 public datasets comprising 1,650 regression and classification targets. Models were evaluated under fine-tuning and frozen settings, with scaling analyses across dataset sizes. Results show heterogeneous performance across domains: in adult ECG interpretation, three FMs consistently outperformed strong supervised baselines. In contrast, ECG-CPC, a compact structured state-space model, dominated 5 of 7 task categories, demonstrating that architecture matters more than scale. FMs improved label efficiency 3.3-9x over supervised baselines, though scaling behaviors varied across architectures. Representation analysis reveals that models with similar performance learn markedly different internal structures, suggesting multiple viable paths to effective ECG representation. Overall, while FMs show promise for adult ECG analysis, substantial gaps remain in cardiac structure, outcome prediction, and patient characterization. ECG-CPC's strong performance despite being orders of magnitude smaller challenges the assumption that FM quality requires massive scale, highlighting architectural inductive biases as an untapped opportunity.
中文标题/摘要
标题:心电图FMs基准测试:跨临床任务的现实检查
12导联心电图(ECG)是一种长期的诊断工具。然而,ECG解释的机器学习仍然支离破碎,通常局限于狭窄的任务或数据集。FMs承诺具有更广泛的适应性,但基本问题仍然存在:哪种架构泛化最好?模型在有限标签下如何扩展?模型家族之间性能差异的原因是什么?我们使用12个公共数据集中的1,650个回归和分类目标,对26个临床相关任务进行了8种ECG FMs的基准测试。模型在微调和冻结设置下进行了评估,并进行了跨数据集规模的扩展分析。结果显示,不同领域间性能异质性:在成人ECG解释中,三种FMs始终优于强大的监督基线。相反,ECG-CPC,一种紧凑的结构化状态空间模型,在7个任务类别中的5个中占主导地位,表明架构比规模更重要。FMs在标签效率上提高了3.3-9倍,尽管不同架构的扩展行为有所不同。表示分析表明,具有类似性能的模型学习了截然不同的内部结构,暗示了多种有效ECG表示的有效途径。总体而言,虽然FMs在成人ECG分析中显示出前景,但在心脏结构、结果预测和患者特征方面仍存在巨大差距。尽管ECG-CPC在规模小得多的情况下表现出色,挑战了FMs质量需要大规模的假设,突显了架构归纳偏见作为未开发的机会。
Summary / 总结
This study benchmarks eight ECG feature models (FMs) on 26 clinical tasks using 12 public datasets, evaluating their performance under fine-tuning and frozen settings. Results indicate that different FMs excel in various domains, with some models like ECG-CPC outperforming larger supervised models even when trained on limited labels. The study also finds that FMs can improve label efficiency by 3.3 to 9 times, but their scaling behaviors vary. Representation analysis suggests that models with similar performance learn different internal structures, indicating multiple effective paths to ECG representation. However, significant gaps remain in cardiac structure, outcome prediction, and patient characterization.
该研究对8种ECG特征提取模型(FMs)在26个临床任务上进行了基准测试,使用了12个公开数据集,评估了它们在微调和冻结设置下的性能。结果显示,不同FMs在不同领域表现出色,其中如ECG-CPC等模型即使参数量较少也能取得优异表现。研究强调了架构选择的重要性而非单纯规模,并指出存在多种有效的ECG表示路径。
Do We Need All the Synthetic Data? Targeted Image Augmentation via Diffusion Models
Authors: Dang Nguyen, Jiping Li, Jinghao Zheng, Baharan Mirzasoleiman
First: 2025-05-27T07:27:03+00:00 · Latest: 2026-03-04T18:05:45+00:00
Abstract
Synthetically augmenting training datasets with diffusion models has become an effective strategy for improving the generalization of image classifiers. However, existing approaches typically increase dataset size by 10-30x and struggle to ensure generation diversity, leading to substantial computational overhead. In this work, we introduce TADA (TArgeted Diffusion Augmentation), a principled framework that selectively augments examples that are not learned early in training using faithful synthetic images that preserve semantic features while varying noise. We show that augmenting only this targeted subset consistently outperforms augmenting the entire dataset. Through theoretical analysis on a two-layer CNN, we prove that TADA improves generalization by promoting homogeneity in feature learning speed without amplifying noise. Extensive experiments demonstrate that by augmenting only 30-40% of the training data, TADA improves generalization by up to 2.8% across diverse architectures including ResNet, ViT, ConvNeXt, and Swin Transformer on CIFAR-10/100, TinyImageNet, and ImageNet, using optimizers such as SGD and SAM. Notably, TADA combined with SGD outperforms the state-of-the-art optimizer SAM on CIFAR-100 and TinyImageNet. Furthermore, TADA shows promising improvements on object detection benchmarks, demonstrating its applicability beyond image classification. Our code is available at https://github.com/BigML-CS-UCLA/TADA.
中文标题/摘要
标题:我们需要所有合成数据吗?通过扩散模型的目标图像增强
使用扩散模型合成增强训练数据集已成为提高图像分类器泛化能力的有效策略。然而,现有方法通常将数据集大小增加10-30倍,并且难以确保生成多样性,导致大量计算开销。在本文中,我们引入了TADA(目标扩散增强),这是一种原理性的框架,仅在训练早期未学习到的示例上进行选择性增强,使用忠实的合成图像保留语义特征同时变化噪声。我们证明,仅增强这一目标子集的一致性优于增强整个数据集。通过对两层CNN的理论分析,我们证明TADA通过促进特征学习速度的同质性来提高泛化能力,而不放大噪声。广泛的实验表明,通过仅增强30-40%的训练数据,TADA在包括ResNet、ViT、ConvNeXt和Swin Transformer在内的各种架构上,如CIFAR-10/100、TinyImageNet和ImageNet上,使用SGD和SAM等优化器,泛化能力提高高达2.8%。值得注意的是,TADA与SGD结合在CIFAR-100和TinyImageNet上优于最先进的优化器SAM。此外,TADA在对象检测基准测试中显示出有希望的改进,证明了其在图像分类之外的应用潜力。我们的代码可在https://github.com/BigML-CS-UCLA/TADA/获取。
Summary / 总结
This work introduces TADA (TArgeted Diffusion Augmentation), a framework that selectively augments training data using diffusion models to improve the generalization of image classifiers. TADA focuses on augmenting examples that are not learned early in training, leading to better performance with only 30-40% of the training data augmented. Experiments show that TADA improves generalization by up to 2.8% across various architectures and datasets, and it outperforms state-of-the-art optimizers on certain benchmarks. Theoretical analysis supports that TADA promotes homogeneity in feature learning speed without amplifying noise.
该研究提出了TADA(TArgeted Diffusion Augmentation),一种框架,通过选择性地使用扩散模型对训练数据进行增强,以提高图像分类器的泛化能力。TADA专注于增强在训练早期未被学习到的示例,从而仅使用30-40%的训练数据进行增强即可获得更好的性能。实验表明,TADA在各种架构和数据集上将泛化能力提高了高达2.8%,并且在某些基准测试中优于最先进的优化器。理论分析表明,TADA促进了特征学习速度的同质性,而不会放大噪声。
LadderSym: A Multimodal Interleaved Transformer for Music Practice Error Detection
Authors: Benjamin Shiue-Hal Chou, Purvish Jajal, Nick John Eliopoulos, James C. Davis, George K. Thiruvathukal, Kristen Yeon-Ji Yun, Yung-Hsiang Lu
Venue: ICLR 2026
First: 2025-09-16T02:15:06+00:00 · Latest: 2026-03-04T18:04:43+00:00
Comments: Accepted to ICLR 2026
Abstract
Music learners can greatly benefit from tools that accurately detect errors in their practice. Existing approaches typically compare audio recordings to music scores using heuristics or learnable models. This paper introduces LadderSym, a novel Transformer-based method for music error detection. LadderSym is guided by two key observations about the state-of-the-art approaches: (1) late fusion limits inter-stream alignment and cross-modality comparison capability; and (2) reliance on score audio introduces ambiguity in the frequency spectrum, degrading performance in music with concurrent notes. To address these limitations, LadderSym introduces (1) a two-stream encoder with inter-stream alignment modules to improve audio comparison capabilities and error detection F1 scores, and (2) a multimodal strategy that leverages both audio and symbolic scores by incorporating symbolic representations as decoder prompts, reducing ambiguity and improving F1 scores. We evaluate our method on the MAESTRO-E and CocoChorales-E datasets by measuring the F1 score for each note category. Compared to the previous state of the art, LadderSym more than doubles F1 for missed notes on MAESTRO-E (26.8% -> 56.3%) and improves extra note detection by 14.4 points (72.0% -> 86.4%). Similar gains are observed on CocoChorales-E. Furthermore, we also evaluate our models on real data we curated. This work introduces insights about comparison models that could inform sequence evaluation tasks for reinforcement learning, human skill assessment, and model evaluation. Code: https://github.com/ben2002chou/LadderSYM
中文标题/摘要
标题:LadderSym:一种多模态交织变换器,用于音乐练习错误检测
音乐学习者可以从能够准确检测练习中错误的工具中受益。现有方法通常使用启发式方法或可学习模型将音频录音与乐谱进行比较。本文介绍了一种新颖的基于Transformer的方法LadderSym,用于音乐错误检测。LadderSym基于对现有方法的两个关键观察:(1)晚期融合限制了跨流对齐和跨模态比较的能力;(2)依赖乐谱音频引入了频率谱中的模糊性,降低了同时音符音乐的性能。为了解决这些限制,LadderSym引入了(1)一种具有跨流对齐模块的双流编码器,以提高音频比较能力和错误检测F1分数,以及(2)一种多模态策略,通过将符号表示作为解码器提示来利用音频和符号乐谱,减少模糊性并提高F1分数。我们通过测量每个音符类别的F1分数,在MAESTRO-E和CocoChorales-E数据集上评估了该方法。与之前的最新技术相比,LadderSym在MAESTRO-E上将遗漏音符的F1分数提高了两倍多(26.8% -> 56.3%),并且在额外音符检测上提高了14.4个百分点(72.0% -> 86.4%)。在CocoChorales-E上也观察到类似收益。此外,我们还使用我们收集的真实数据评估了我们的模型。这项工作引入了关于比较模型的见解,这些见解可以指导强化学习序列评估任务、人类技能评估和模型评估。代码:https://github.com/ben2002chou/LadderSYM
Skullptor: High Fidelity 3D Head Reconstruction in Seconds with Multi-View Normal Prediction
Authors: Noé Artru, Rukhshanda Hussain, Emeline Got, Alexandre Messier, David B. Lindell, Abdallah Dib
First: 2026-02-24T17:02:11+00:00 · Latest: 2026-03-04T18:04:31+00:00
Comments: For our project page, see https://ubisoft-laforge.github.io/character/skullptor/
Abstract
Reconstructing high-fidelity 3D head geometry from images is critical for a wide range of applications, yet existing methods face fundamental limitations. Traditional photogrammetry achieves exceptional detail but requires extensive camera arrays (25-200+ views), substantial computation, and manual cleanup in challenging areas like facial hair. Recent alternatives present a fundamental trade-off: foundation models enable efficient single-image reconstruction but lack fine geometric detail, while optimization-based methods achieve higher fidelity but require dense views and expensive computation. We bridge this gap with a hybrid approach that combines the strengths of both paradigms. Our method introduces a multi-view surface normal prediction model that extends monocular foundation models with cross-view attention to produce geometrically consistent normals in a feed-forward pass. We then leverage these predictions as strong geometric priors within an inverse rendering optimization framework to recover high-frequency surface details. Our approach outperforms state-of-the-art single-image and multi-view methods, achieving high-fidelity reconstruction on par with dense-view photogrammetry while reducing camera requirements and computational cost. The code and model will be released.
中文标题/摘要
标题:Skullptor:秒级多视角表面法线预测的高保真3D头像重建
从图像中重建高保真3D头像几何结构对于广泛的应用至关重要,但现有方法面临根本性的限制。传统的摄影测量能够实现极高的细节,但需要大量的相机阵列(25-200+视角)、大量的计算,并且在面部毛发等复杂区域需要手动清理。最近的替代方案存在根本性的权衡:基础模型能够高效地从单张图像中重建,但缺乏精细的几何细节,而基于优化的方法能够实现更高的保真度,但需要密集的视角和昂贵的计算。我们通过结合两种范式的优点来弥合这一差距。我们的方法引入了一种多视角表面法线预测模型,该模型将单目基础模型与跨视角注意力相结合,在前向传递中生成几何上一致的法线。然后,我们利用这些预测作为逆渲染优化框架中的强几何先验,以恢复高频表面细节。我们的方法在单张图像和多视角方法中表现出色,实现了与密集视角摄影测量相媲美的高保真重建,同时减少了相机需求和计算成本。代码和模型将被发布。
Summary / 总结
The research aims to improve the efficiency and accuracy of 3D head reconstruction from images. The method combines monocular foundation models with multi-view surface normal prediction and inverse rendering optimization to achieve high-fidelity reconstruction with fewer camera views and reduced computational cost. Key findings show that the approach matches the detail of dense-view photogrammetry but with significantly fewer views and lower computational demands.
研究旨在改进从图像中进行高保真3D头像重建,解决传统摄影测量和近期基础模型的局限性。该方法结合了多视角表面法线预测与逆向渲染优化,减少了对大量摄像头阵列和计算成本的需求。它在几何细节和效率上与密集视角摄影测量相当,但只需要较少的视角和更少的计算资源,优于单图像和多视角方法。
Balancing Fidelity, Utility, and Privacy in Synthetic Cardiac MRI Generation: A Comparative Study
Authors: Madhura Edirisooriya, Dasuni Kawya, Ishan Kumarasinghe, Isuri Devindi, Mary M. Maleckar, Roshan Ragel, Isuru Nawinne, Vajira Thambawita
First: 2026-03-04T17:59:00+00:00 · Latest: 2026-03-04T17:59:00+00:00
Comments: 7 pages, 4 figures, Preprint
Abstract
Deep learning in cardiac MRI (CMR) is fundamentally constrained by both data scarcity and privacy regulations. This study systematically benchmarks three generative architectures: Denoising Diffusion Probabilistic Models (DDPM), Latent Diffusion Models (LDM), and Flow Matching (FM) for synthetic CMR generation. Utilizing a two-stage pipeline where anatomical masks condition image synthesis, we evaluate generated data across three critical axes: fidelity, utility, and privacy. Our results show that diffusion-based models, particularly DDPM, provide the most effective balance between downstream segmentation utility, image fidelity, and privacy preservation under limited-data conditions, while FM demonstrates promising privacy characteristics with slightly lower task-level performance. These findings quantify the trade-offs between cross-domain generalization and patient confidentiality, establishing a framework for safe and effective synthetic data augmentation in medical imaging.
中文标题/摘要
标题:在合成心脏MRI生成中平衡保真度、实用性和隐私性:一项比较研究
心脏MRI (CMR) 中的深度学习从根本上受到数据稀缺性和隐私法规的限制。本研究系统地评估了三种生成架构:去噪扩散概率模型 (DDPM)、潜在扩散模型 (LDM) 和流匹配 (FM) 在合成CMR生成中的表现。利用两阶段管道,其中解剖学掩码条件化图像合成,我们从保真度、实用性和隐私性三个关键维度评估生成数据。结果显示,基于扩散的模型,尤其是DDPM,在数据稀缺条件下提供了最佳的下游分割实用性、图像保真度和隐私保护平衡,而FM展示了有希望的隐私特性,但任务级性能略低。这些发现量化了跨域泛化和患者保密性之间的权衡,为医学成像中的安全有效的合成数据增强建立了框架。
Summary / 总结
This study evaluates three generative architectures (DDPM, LDM, and FM) for synthetic cardiac MRI generation, focusing on balancing fidelity, utility, and privacy. Using a two-stage pipeline with anatomical masks, the research finds that DDPM offers the best balance between segmentation utility, image fidelity, and privacy preservation under limited data conditions, while FM shows promising privacy characteristics with slightly lower task-level performance.
该研究评估了三种生成架构——去噪扩散概率模型(DDPM)、潜在扩散模型(LDM)和流匹配(FM)——用于合成心脏MRI生成。通过基于解剖掩模的条件生成,并从保真度、实用性和隐私性三个方面评估生成的数据,研究发现DDPM在有限数据条件下提供了最佳的分割实用性、图像质量和隐私保护之间的平衡。FM在隐私特性方面表现出色,但在任务性能上略低。
ArtHOI: Articulated Human-Object Interaction Synthesis by 4D Reconstruction from Video Priors
Authors: Zihao Huang, Tianqi Liu, Zhaoxi Chen, Shaocong Xu, Saining Zhang, Lixing Xiao, Zhiguo Cao, Wei Li, Hao Zhao, Ziwei Liu
First: 2026-03-04T17:58:04+00:00 · Latest: 2026-03-04T17:58:04+00:00
Comments: Project Page: https://arthoi.github.io/
Abstract
Synthesizing physically plausible articulated human-object interactions (HOI) without 3D/4D supervision remains a fundamental challenge. While recent zero-shot approaches leverage video diffusion models to synthesize human-object interactions, they are largely confined to rigid-object manipulation and lack explicit 4D geometric reasoning. To bridge this gap, we formulate articulated HOI synthesis as a 4D reconstruction problem from monocular video priors: given only a video generated by a diffusion model, we reconstruct a full 4D articulated scene without any 3D supervision. This reconstruction-based approach treats the generated 2D video as supervision for an inverse rendering problem, recovering geometrically consistent and physically plausible 4D scenes that naturally respect contact, articulation, and temporal coherence. We introduce ArtHOI, the first zero-shot framework for articulated human-object interaction synthesis via 4D reconstruction from video priors. Our key designs are: 1) Flow-based part segmentation: leveraging optical flow as a geometric cue to disentangle dynamic from static regions in monocular video; 2) Decoupled reconstruction pipeline: joint optimization of human motion and object articulation is unstable under monocular ambiguity, so we first recover object articulation, then synthesize human motion conditioned on the reconstructed object states. ArtHOI bridges video-based generation and geometry-aware reconstruction, producing interactions that are both semantically aligned and physically grounded. Across diverse articulated scenes (e.g., opening fridges, cabinets, microwaves), ArtHOI significantly outperforms prior methods in contact accuracy, penetration reduction, and articulation fidelity, extending zero-shot interaction synthesis beyond rigid manipulation through reconstruction-informed synthesis.
中文标题/摘要
标题:ArtHOI:基于视频先验的4D重建合成 articulated 人类-物体交互
在没有3D/4D监督的情况下合成物理上合理的articulated人类-物体交互(HOI)仍然是一个基本挑战。虽然最近的零样本方法利用视频扩散模型来合成人类-物体交互,但它们主要局限于刚体操作,缺乏明确的4D几何推理。为了解决这一差距,我们将articulated HOI合成建模为从单目视频先验进行4D重建的问题:仅给定由扩散模型生成的视频,我们无需任何3D监督即可重建完整的4D articulated场景。基于重建的方法将生成的2D视频视为逆渲染问题的监督,恢复几何上一致且物理上合理的4D场景,这些场景自然地尊重接触、articulation和时间连贯性。我们引入了ArtHOI,这是第一个通过视频先验进行4D重建合成articulated人类-物体交互的零样本框架。我们的关键设计包括:1) 流动基于的部分分割:利用光学流作为几何线索来分离单目视频中的动态和静态区域;2) 分解的重建流水线:在单目模糊下,人类运动和物体articulation的联合优化不稳定,因此我们首先恢复物体articulation,然后在重建的物体状态上合成人类运动。ArtHOI将基于视频的生成与几何感知的重建结合起来,产生既在语义上对齐又在物理上扎根的交互。在各种articulated场景(例如,打开冰箱、橱柜、微波炉)中,ArtHOI在接触准确性、穿透减少和articulation保真度方面显著优于先前的方法,通过重建指导的合成将零样本交互合成扩展到刚体操作之外。
Summary / 总结
The research aims to synthesize physically plausible articulated human-object interactions (HOI) without 3D/4D supervision. The method formulates HOI synthesis as a 4D reconstruction problem from monocular video priors, using a flow-based part segmentation and a decoupled reconstruction pipeline. Key findings show that ArtHOI, the proposed zero-shot framework, significantly outperforms prior methods in contact accuracy, penetration reduction, and articulation fidelity across various scenes, extending zero-shot interaction synthesis beyond rigid manipulation.
研究解决了在没有3D/4D监督的情况下合成物理上合理的 articulated 人-物交互的挑战。提出了ArtHOI框架,该框架从由扩散模型生成的单目视频先验中重建4D场景。关键方法包括基于流的部分分割和解耦的重建流水线。实验结果表明,ArtHOI在接触精度、穿透减少和关节保真度方面优于先前的方法,扩展了零样本交互合成的应用范围,超越了刚体操作的限制。
What Does Flow Matching Bring To TD Learning?
Authors: Bhavya Agrawalla, Michal Nauman, Aviral Kumar
First: 2026-03-04T17:51:30+00:00 · Latest: 2026-03-04T17:51:30+00:00
Abstract
Recent work shows that flow matching can be effective for scalar Q-value function estimation in reinforcement learning (RL), but it remains unclear why or how this approach differs from standard critics. Contrary to conventional belief, we show that their success is not explained by distributional RL, as explicitly modeling return distributions can reduce performance. Instead, we argue that the use of integration for reading out values and dense velocity supervision at each step of this integration process for training improves TD learning via two mechanisms. First, it enables robust value prediction through \emph{test-time recovery}, whereby iterative computation through integration dampens errors in early value estimates as more integration steps are performed. This recovery mechanism is absent in monolithic critics. Second, supervising the velocity field at multiple interpolant values induces more \emph{plastic} feature learning within the network, allowing critics to represent non-stationary TD targets without discarding previously learned features or overfitting to individual TD targets encountered during training. We formalize these effects and validate them empirically, showing that flow-matching critics substantially outperform monolithic critics (2$\times$ in final performance and around 5$\times$ in sample efficiency) in settings where loss of plasticity poses a challenge e.g., in high-UTD online RL problems, while remaining stable during learning.
中文标题/摘要
标题:流匹配为TD学习带来了什么?
近期研究表明,流匹配在强化学习(RL)中的标量Q值函数估计中可能是有效的,但尚不清楚这种方法与标准批评家有何不同。与传统观点相反,我们表明,它们的成功并非由分布性RL解释,因为明确建模回报分布可能会降低性能。相反,我们认为,通过积分读取值以及在积分过程中的每个步骤监督密集的速度场,使用积分提高了TD学习的两种机制。首先,它通过在多次迭代计算中通过积分来缓解早期值估计中的错误,从而实现稳健的值预测,即通过“测试时恢复”机制。这种恢复机制在单一的批评家中不存在。其次,监督多个插值值的速度场会诱导网络中更“可塑”的特征学习,使批评家能够在不丢弃先前学习的特征或过度拟合训练过程中遇到的个别TD目标的情况下表示非平稳的TD目标。我们形式化了这些效果并进行了实证验证,表明流匹配批评家在高-UTD在线RL问题等场景中显著优于单一批评家(最终性能提高2倍,样本效率提高约5倍),同时在学习过程中保持稳定。
Summary / 总结
This paper investigates the effectiveness of flow matching in reinforcement learning, particularly for scalar Q-value function estimation. It challenges the notion that flow matching's success is due to distributional RL and instead attributes it to two mechanisms: test-time recovery through iterative integration and plastic feature learning induced by dense velocity supervision. Experiments show that flow-matching critics outperform monolithic critics by 2 times in final performance and 5 times in sample efficiency, especially in high-UTD online RL problems.
该研究探讨了流匹配在强化学习中对时间差分(TD)学习中标量Q值函数估计的有效性。研究挑战了流匹配通过分布性强化学习提升性能的常规观点,而是将其成功归因于两个机制:通过迭代积分实现的测试时恢复以及由密集速度监督诱导的可塑特征学习。实验表明,流匹配批评家在最终性能上比单一批评家高出2倍,在样本效率上高出5倍,特别是在高UTD在线RL问题中表现出色。
Preference Leakage: A Contamination Problem in LLM-as-a-judge
Authors: Dawei Li, Renliang Sun, Yue Huang, Ming Zhong, Bohan Jiang, Jiawei Han, Xiangliang Zhang, Wei Wang, Huan Liu
Venue: ICLR 2026
First: 2025-02-03T17:13:03+00:00 · Latest: 2026-03-04T17:50:58+00:00
Comments: Accepted by ICLR 2026
Abstract
Large Language Models (LLMs) as judges and LLM-based data synthesis have emerged as two fundamental LLM-driven data annotation methods in model development. While their combination significantly enhances the efficiency of model training and evaluation, little attention has been given to the potential contamination brought by this new model development paradigm. In this work, we expose preference leakage, a contamination problem in LLM-as-a-judge caused by the relatedness between the synthetic data generators and LLM-based evaluators. To study this issue, we first define three common relatednesses between the data generator LLM and the judge LLM: being the same model, having an inheritance relationship, and belonging to the same model family. Through extensive experiments, we empirically confirm the bias of judges towards their related student models caused by preference leakage across multiple LLM baselines and benchmarks. Further analysis suggests that preference leakage is a pervasive and real-world problem that is harder to detect compared to previously identified biases in LLM-as-a-judge scenarios. All of these findings imply that preference leakage is a widespread and challenging problem in the area of LLM-as-a-judge. We release all codes and data at: https://github.com/David-Li0406/Preference-Leakage.
中文标题/摘要
标题:偏好泄露:LLM作为法官中的污染问题
大型语言模型(LLMs)作为法官和基于LLM的数据合成已经成为了两种基本的LLM驱动的数据注释方法,在模型开发中得到了广泛应用。虽然它们的结合显著提高了模型训练和评估的效率,但对这种新模型开发范式带来的潜在污染却很少受到关注。在本文中,我们揭示了偏好泄露,这是一种由数据生成器LLM与法官LLM之间的相关性引起的污染问题。为了研究这一问题,我们首先定义了数据生成器LLM和法官LLM之间的三种常见相关性:同一模型、继承关系以及同一模型家族。通过广泛的实验,我们实证确认了偏好泄露导致法官倾向于其相关的学生模型的偏差,这一现象在多个LLM基线和基准中得到了验证。进一步的分析表明,偏好泄露是一个普遍且难以检测的现实问题,比之前在LLM作为法官场景中识别出的偏差更为棘手。所有这些发现表明,偏好泄露是LLM作为法官领域中一个普遍且具有挑战性的问题。我们已在以下链接发布了所有代码和数据:https://github.com/David-Li0406/Preference-Leakage。
Summary / 总结
This work investigates preference leakage, a contamination issue in LLM-as-a-judge, where the relatedness between synthetic data generators and evaluators introduces bias. The study defines three relatedness types and empirically confirms the bias across multiple LLM baselines and benchmarks. The findings suggest that preference leakage is a widespread and challenging problem in LLM-as-a-judge scenarios, harder to detect than previously identified biases.
研究探讨了LLM-as-a-judge中的偏好泄露问题,这种问题由于合成数据生成器和评估器之间的相关性引入了偏差。研究定义了三种相关性类型,并在多个LLM基线和基准上实证确认了这种偏差。研究结果表明,偏好泄露是一个在LLM-as-a-judge场景中广泛且具有挑战性的问题,比之前识别的偏差更难检测。
Algorithmic Compliance and Regulatory Loss in Digital Assets
Authors: Khem Raj Bhatt, Krishna Sharma
First: 2026-03-04T17:48:17+00:00 · Latest: 2026-03-04T17:48:17+00:00
Abstract
We study the deployment performance of machine learning based enforcement systems used in cryptocurrency anti money laundering (AML). Using forward looking and rolling evaluations on Bitcoin transaction data, we show that strong static classification metrics substantially overstate real world regulatory effectiveness. Temporal nonstationarity induces pronounced instability in cost sensitive enforcement thresholds, generating large and persistent excess regulatory losses relative to dynamically optimal benchmarks. The core failure arises from miscalibration of decision rules rather than from declining predictive accuracy per se. These findings underscore the fragility of fixed AML enforcement policies in evolving digital asset markets and motivate loss-based evaluation frameworks for regulatory oversight.
中文标题/摘要
标题:算法合规与数字资产监管损失
我们研究了基于机器学习的执法系统在加密货币反洗钱(AML)中的部署性能。通过对比特币交易数据进行前瞻性和滚动评估,我们表明,强大的静态分类指标严重高估了实际世界的监管有效性。时间非平稳性导致成本敏感的执法阈值产生显著的不稳定性,相对于动态最优基准,产生了大量且持久的超额监管损失。核心失败源于决策规则的误校准,而不是预测准确性本身下降。这些发现强调了固定AML执法政策在不断变化的数字资产市场中的脆弱性,并促使监管审查基于损失的评估框架。
Summary / 总结
This study evaluates the performance of machine learning-based enforcement systems in cryptocurrency AML. By analyzing Bitcoin transaction data, the research demonstrates that static classification metrics significantly overestimate the real-world effectiveness of regulatory measures. The study finds that temporal instability leads to large and persistent regulatory losses, highlighting the need for dynamic evaluation frameworks to ensure effective AML policies in evolving digital asset markets.
研究考察了基于机器学习的执法系统在加密货币反洗钱(AML)中的有效性。通过对比特币交易数据进行前瞻性和滚动评估,研究发现,静态分类指标夸大了实际的监管效果。时间非平稳性导致执法阈值的不稳定性,相对于动态最优基准产生了显著且持续的监管损失。核心问题是决策规则的校准不当,而不是预测准确性的下降。这突显了固定AML政策在动态数字资产市场中的脆弱性,并建议采用基于损失的评估框架来加强监管监督。
Scalable Evaluation of the Realism of Synthetic Environmental Augmentations in Images
Authors: Damian J. Ruck, Paul Vautravers, Oliver Chalkley, Jake Thomas
First: 2026-03-04T17:46:08+00:00 · Latest: 2026-03-04T17:46:08+00:00
Abstract
Evaluation of AI systems often requires synthetic test cases, particularly for rare or safety-critical conditions that are difficult to observe in operational data. Generative AI offers a promising approach for producing such data through controllable image editing, but its usefulness depends on whether the resulting images are sufficiently realistic to support meaningful evaluation.
We present a scalable framework for assessing the realism of synthetic image-editing methods and apply it to the task of adding environmental conditions-fog, rain, snow, and nighttime-to car-mounted camera images. Using 40 clear-day images, we compare rule-based augmentation libraries with generative AI image-editing models. Realism is evaluated using two complementary automated metrics: a vision-language model (VLM) jury for perceptual realism assessment, and embedding-based distributional analysis to measure similarity to genuine adverse-condition imagery.
Generative AI methods substantially outperform rule-based approaches, with the best generative method achieving approximately 3.6 times the acceptance rate of the best rule-based method. Performance varies across conditions: fog proves easiest to simulate, while nighttime transformations remain challenging. Notably, the VLM jury assigns imperfect acceptance even to real adverse-condition imagery, establishing practical ceilings against which synthetic methods can be judged. By this standard, leading generative methods match or exceed real-image performance for most conditions.
These results suggest that modern generative image-editing models can enable scalable generation of realistic adverse-condition imagery for evaluation pipelines. Our framework therefore provides a practical approach for scalable realism evaluation, though validation against human studies remains an important direction for future work.
中文标题/摘要
标题:合成环境增强在图像中的现实性可扩展评估
AI系统的评估通常需要合成测试案例,尤其是对于在操作数据中难以观察的罕见或安全关键条件。生成式AI通过可控的图像编辑提供了一种有前景的数据生成方法,但其有用性取决于生成的图像是否足够现实,以支持有意义的评估。
我们提出了一种可扩展的框架来评估合成图像编辑方法的现实性,并将其应用于向汽车车载摄像头图像添加环境条件(雾、雨、雪和夜间)的任务。使用40张晴天图像,我们将基于规则的增强库与生成式AI图像编辑模型进行了比较。现实性通过两种互补的自动化度量标准进行评估:基于视觉-语言模型(VLM)的陪审团进行感知现实性评估,以及基于嵌入的分布分析来衡量与真实不良条件图像的相似性。
生成式AI方法显著优于基于规则的方法,最佳生成式方法的接受率大约是最佳基于规则方法的3.6倍。性能在不同条件下有所不同:雾是最容易模拟的,而夜间变换仍然具有挑战性。值得注意的是,VLM陪审团即使对真实不良条件图像也给予了不完美的接受,这为合成方法设定了实际的上限。按照这一标准,领先的生成式方法在大多数条件下与真实图像的性能相当或超过。
这些结果表明,现代生成式图像编辑模型可以实现对评估管道中现实不良条件图像的可扩展生成。因此,我们的框架提供了一种实用的方法来进行可扩展的现实性评估,尽管未来工作仍需通过人类研究进行验证。
Summary / 总结
The research aims to evaluate the realism of synthetic environmental augmentations in images, which is crucial for testing AI systems under rare or safety-critical conditions. The study uses a scalable framework combining a vision-language model jury and embedding-based analysis to assess the realism of synthetic images with fog, rain, snow, and nighttime conditions. Generative AI methods outperform rule-based approaches, with the best generative method achieving 3.6 times the acceptance rate of the best rule-based method. The study finds that while fog is easiest to simulate, nighttime transformations remain challenging, and leading generative methods match or exceed real-image performance for most conditions.
论文提出了一种可扩展的框架来评估合成图像增强的真实感,特别是为汽车摄像头图像添加雾、雨、雪和夜间条件。研究将基于规则的增强库与生成式AI模型进行了比较,使用了两种指标:视觉语言模型进行感知真实感评估和嵌入式分析衡量与真实图像的相似度。研究发现,生成式AI方法优于基于规则的方法,最佳生成式方法的接受率大约是最佳基于规则方法的3.6倍。视觉语言模型也对真实不良条件图像给予了不完美的接受,表明领先的生成式方法可以匹配或超过真实图像的性能,尤其是在大多数条件下。
PTOPOFL: Privacy-Preserving Personalised Federated Learning via Persistent Homology
Authors: Kelly L Vomo-Donfack, Adryel Hoszu, Grégory Ginot, Ian Morilla
First: 2026-03-04T17:44:39+00:00 · Latest: 2026-03-04T17:44:39+00:00
Comments: 22 pages, 6 Figures
Abstract
Federated learning (FL) faces two structural tensions: gradient sharing enables data-reconstruction attacks, while non-IID client distributions degrade aggregation quality. We introduce PTOPOFL, a framework that addresses both challenges simultaneously by replacing gradient communication with topological descriptors derived from persistent homology (PH). Clients transmit only 48-dimensional PH feature vectors-compact shape summaries whose many-to-one structure makes inversion provably ill-posed-rather than model gradients. The server performs topology-guided personalised aggregation: clients are clustered by Wasserstein similarity between their PH diagrams, intra-cluster models are topology-weighted,and clusters are blended with a global consensus. We prove an information-contraction theorem showing that PH descriptors leak strictly less mutual information per sample than gradients under strongly convex loss functions, and we establish linear convergence of the Wasserstein-weighted aggregation scheme with an error floor strictly smaller than FedAvg. Evaluated against FedAvg, FedProx, SCAFFOLD, and pFedMe on a non-IID healthcare scenario (8 hospitals, 2 adversarial) and a pathological benchmark (10 clients), PTOPOFL achieves AUC 0.841 and 0.910 respectively-the highest in both settings-while reducing reconstruction risk by a factor of 4.5 relative to gradient sharing. Code is publicly available at https://github.com/MorillaLab/TopoFederatedL and data at https://doi.org/10.5281/zenodo.18827595.
中文标题/摘要
标题:PTOPOFL:通过持久同调实现的隐私保护个性化联邦学习
联邦学习(FL)面临两个结构上的矛盾:梯度共享使得数据重建攻击成为可能,而非IID客户端分布降低了聚合质量。我们引入PTOPOFL框架,通过使用从持久同调(PH)导出的拓扑描述符来同时解决这两个问题,从而替代梯度通信。客户端仅传输48维的PH特征向量——紧凑的形状摘要,其多对一的结构使得逆向推断是不可证明的。服务器执行拓扑引导的个性化聚合:客户端根据其PH图之间的Wasserstein相似度进行聚类,同一簇内的模型按拓扑加权,并与全局共识混合。我们证明了一个信息收缩定理,表明在强凸损失函数下,PH描述符泄露的每样本互信息严格少于梯度。我们还证明了Wasserstein加权聚合方案的线性收敛性,其误差下限严格小于FedAvg。在非IID医疗保健场景(8家医院,2个对手)和病理基准测试(10个客户端)中,与FedAvg、FedProx、SCAFFOLD和pFedMe相比,PTOPOFL分别实现了AUC 0.841和0.910——在两种设置中均为最高值——同时将梯度共享的重建风险降低了4.5倍。代码可在https://github.com/MorillaLab/TopoFederatedL 公开获取,数据可在https://doi.org/10.5281/zenodo.18827595 获取。
Summary / 总结
PTOPOFL addresses the challenges of data-reconstruction attacks and non-IID client distributions in federated learning by using topological descriptors derived from persistent homology. Clients send compact 48-dimensional PH feature vectors instead of gradients, and the server performs topology-guided personalized aggregation. PTOPOFL achieves the highest AUC in both a non-IID healthcare scenario and a pathological benchmark, while reducing reconstruction risk by a factor of 4.5 compared to gradient sharing.
PTOPOFL通过使用持久同调来替代梯度通信,以拓扑描述符来解决联邦学习中的数据重建攻击和非IID客户端分布问题。该框架基于 Wasserstein 相似性对客户端进行聚类,并进行拓扑加权聚合。实验结果显示,PTOPOFL在非IID医疗场景(8家医院,2个对手)和病理基准测试(10个客户端)中分别实现了0.841和0.910的AUC值,高于其他方法如FedAvg、FedProx、SCAFFOLD和pFedMe,并将重建风险降低了4.5倍。信息收缩定理证明,在强凸损失函数下,拓扑描述符比梯度泄露更少的互信息,且聚合方案的误差下限比FedAvg更小。
SPRINT: Semi-supervised Prototypical Representation for Few-Shot Class-Incremental Tabular Learning
Authors: Umid Suleymanov, Murat Kantarcioglu, Kevin S Chan, Michael De Lucia, Kevin Hamlen, Latifur Khan, Sharad Mehrotra, Ananthram Swami, Bhavani Thuraisingham
First: 2026-03-04T17:39:52+00:00 · Latest: 2026-03-04T17:39:52+00:00
Comments: Under Review
Abstract
Real-world systems must continuously adapt to novel concepts from limited data without forgetting previously acquired knowledge. While Few-Shot Class-Incremental Learning (FSCIL) is established in computer vision, its application to tabular domains remains largely unexplored. Unlike images, tabular streams (e.g., logs, sensors) offer abundant unlabeled data, a scarcity of expert annotations and negligible storage costs, features ignored by existing vision-based methods that rely on restrictive buffers. We introduce SPRINT, the first FSCIL framework tailored for tabular distributions. SPRINT introduces a mixed episodic training strategy that leverages confidence-based pseudo-labeling to enrich novel class representations and exploits low storage costs to retain base class history. Extensive evaluation across six diverse benchmarks spanning cybersecurity, healthcare, and ecological domains, demonstrates SPRINT's cross-domain robustness. It achieves a state-of-the-art average accuracy of 77.37% (5-shot), outperforming the strongest incremental baseline by 4.45%.
中文标题/摘要
标题:SPRINT:半监督原型表示在少量数据下的少样本类增量表格学习
现实世界系统必须从有限数据中不断适应新的概念,同时不忘记之前获得的知识。尽管少样本类增量学习(FSCIL)在计算机视觉中已经建立,但其在表格领域的应用仍然鲜有探索。与图像不同,表格流(例如日志、传感器)提供了丰富的未标记数据,缺乏专家注释且存储成本极低,这些特征被现有的基于视觉的方法所忽视,这些方法依赖于限制性的缓冲区。我们提出了SPRINT,这是第一个针对表格分布的FSCIL框架。SPRINT 引入了一种混合的分段训练策略,利用基于置信度的伪标签来丰富新的类表示,并利用低存储成本来保留基础类的历史。在跨越网络安全、医疗保健和生态学领域的六个不同基准上的广泛评估表明,SPRINT 具有跨域鲁棒性。它在5-shot下的平均准确率达到77.37%,比最强的增量基线高出4.45%。
Summary / 总结
SPRINT is a semi-supervised prototypical representation framework for Few-Shot Class-Incremental Learning in tabular data, addressing the challenge of continuous adaptation with limited labeled data. It uses a mixed episodic training strategy with confidence-based pseudo-labeling to enhance novel class representations and retains base class history due to low storage costs. SPRINT achieves an average accuracy of 77.37% across six diverse benchmarks, outperforming existing methods by 4.45%.
SPRINT 是一种针对表格数据的Few-Shot Class-Incremental Learning的半监督原型表示框架,旨在解决在有限标注数据下持续适应的挑战。它采用了一种混合 episodic 训练策略,结合置信度为基础的伪标签来增强新类的表示,并由于存储成本低而保留了基础类的历史。SPRINT 在六个不同的基准测试中实现了 77.37% 的平均准确率,比现有方法高出 4.45%。
Unraveling the Complexity of Memory in RL Agents: an Approach for Classification and Evaluation
Authors: Egor Cherepanov, Nikita Kachaev, Artem Zholus, Alexey K. Kovalev, Aleksandr I. Panov
First: 2024-12-09T14:34:31+00:00 · Latest: 2026-03-04T17:39:28+00:00
Comments: 20 pages, 6 figures, 9 tables
Abstract
The incorporation of memory into agents is essential for numerous tasks within the domain of Reinforcement Learning (RL). In particular, memory is paramount for tasks that require the use of past information, adaptation to novel environments, and improved sample efficiency. However, the term "memory" encompasses a wide range of concepts, which, coupled with the lack of a unified methodology for validating an agent's memory, leads to erroneous judgments about agents' memory capabilities and prevents objective comparison with other memory-enhanced agents. This paper aims to streamline the concept of memory in RL by providing practical precise definitions of agent memory types, such as long-term vs. short-term memory and declarative vs. procedural memory, inspired by cognitive science. Using these definitions, we categorize different classes of agent memory, propose a robust experimental methodology for evaluating the memory capabilities of RL agents, and standardize evaluations. Furthermore, we empirically demonstrate the importance of adhering to the proposed methodology when evaluating different types of agent memory by conducting experiments with different RL agents and what its violation leads to.
中文标题/摘要
标题:解析强化学习代理中的记忆复杂性:一种分类与评估的方法
将记忆融入代理对于强化学习(RL)领域中的许多任务至关重要。特别是,记忆对于需要使用过去信息、适应新环境和提高样本效率的任务至关重要。然而,“记忆”这一术语涵盖了广泛的概念,加之缺乏统一的方法来验证代理的记忆能力,导致了对代理记忆能力的错误判断,并阻碍了与其他增强记忆的代理进行客观比较。本文旨在通过提供基于认知科学的代理记忆类型的实际精确定义,简化RL中的记忆概念,从而提供不同的代理记忆类别,提出评估RL代理记忆能力的稳健实验方法,并标准化评估。此外,通过使用不同的RL代理进行实验,我们实证展示了在评估不同类型的代理记忆时遵循提议方法的重要性,以及违反该方法会导致的问题。
Summary / 总结
This paper addresses the complexity of memory in Reinforcement Learning (RL) agents by defining different types of memory such as long-term vs. short-term and declarative vs. procedural memory. It proposes a standardized methodology for evaluating memory capabilities and demonstrates through experiments the importance of adhering to this methodology for accurate evaluations of memory types in RL agents.
本文通过借鉴认知科学定义了不同类型的记忆,如长期记忆与短期记忆、陈述性记忆与程序性记忆,来解决强化学习(RL)代理中的记忆复杂性问题。它提出了一种标准化的评估方法,并通过实验展示了严格遵循此方法对于准确评估不同类型记忆在RL代理中的重要性是必不可少的。
SycoEval-EM: Sycophancy Evaluation of Large Language Models in Simulated Clinical Encounters for Emergency Care
Authors: Dongshen Peng, Yi Wang, Austin Schoeffler, Carl Preiksaitis, Christian Rose
First: 2026-01-23T08:01:39+00:00 · Latest: 2026-03-04T17:38:45+00:00
Comments: 11 pages, 5 figures
Abstract
Large language models (LLMs) show promise in clinical decision support yet risk acquiescing to patient pressure for inappropriate care. We introduce SycoEval-EM, a multi-agent simulation framework evaluating LLM robustness through adversarial patient persuasion in emergency medicine. Across 20 LLMs and 1,875 encounters spanning three Choosing Wisely scenarios, acquiescence rates ranged from 0-100\%. Models showed higher vulnerability to imaging requests (38.8\%) than opioid prescriptions (25.0\%), with model capability poorly predicting robustness. All persuasion tactics proved equally effective (30.0-36.0\%), indicating general susceptibility rather than tactic-specific weakness. Our findings demonstrate that static benchmarks inadequately predict safety under social pressure, necessitating multi-turn adversarial testing for clinical AI certification.
中文标题/摘要
标题:SycoEval-EM:模拟急诊护理中患者施压的大型语言模型奉承性评估
大型语言模型(LLMs)在临床决策支持方面显示出潜力,但存在因患者压力而接受不适当护理的风险。我们引入了SycoEval-EM,这是一种多智能体模拟框架,通过急诊医学中的敌对患者说服来评估LLMs的鲁棒性。在20个LLMs和1,875次涉及三个“选择明智”场景的互动中,奉承率从0%-100%不等。模型在影像学请求(38.8%)方面的脆弱性高于阿片类药物处方(25.0%),模型能力与鲁棒性之间关系不佳。所有说服策略的有效性相同(30.0%-36.0%),表明普遍易感而非策略特定的弱点。我们的研究结果表明,静态基准无法准确预测在社会压力下的安全性,需要进行多轮对抗性测试以认证临床AI。
Summary / 总结
The study introduces SycoEval-EM, a simulation framework to evaluate the robustness of large language models (LLMs) in emergency medicine by simulating adversarial patient persuasion. Across 20 LLMs and 1,875 simulated clinical encounters, the study found that acquiescence rates varied widely, with models more likely to comply with imaging requests than opioid prescriptions. The study also found that all persuasion tactics were equally effective, indicating a general susceptibility to social pressure rather than specific weaknesses. The findings suggest that static benchmarks are insufficient for ensuring safety under social pressure and advocate for multi-turn adversarial testing in clinical AI certification.
研究引入了SycoEval-EM,通过模拟对抗性患者说服来评估大型语言模型(LLMs)在急诊医学中的稳健性。在20个LLM和1,875次模拟对话中,服从率差异很大,模型对影像学检查请求的脆弱性高于对阿片类药物处方的脆弱性。研究发现,所有说服策略的效果相当,表明普遍的易感性而非特定策略的弱点。研究结果表明,静态基准不足以评估在社会压力下的安全性,强调了临床AI认证中需要进行多轮对抗性测试的必要性。
World Properties without World Models: Recovering Spatial and Temporal Structure from Co-occurrence Statistics in Static Word Embeddings
Authors: Elan Barenholtz
First: 2026-03-04T17:37:05+00:00 · Latest: 2026-03-04T17:37:05+00:00
Comments: 12 pages, 3 figures, 3 tables
Abstract
Recent work interprets the linear recoverability of geographic and temporal variables from large language model (LLM) hidden states as evidence for world-like internal representations. We test a simpler possibility: that much of the relevant structure is already latent in text itself. Applying the same class of ridge regression probes to static co-occurrence-based embeddings (GloVe and Word2Vec), we find substantial recoverable geographic signal and weaker but reliable temporal signal, with held-out R^2 values of 0.71-0.87 for city coordinates and 0.48-0.52 for historical birth years. Semantic-neighbor analyses and targeted subspace ablations show that these signals depend strongly on interpretable lexical gradients, especially country names and climate-related vocabulary. These findings suggest that ordinary word co-occurrence preserves richer spatial, temporal, and environmental structure than is often assumed, revealing a remarkable and underappreciated capacity of simple static embeddings to preserve world-shaped structure from text alone. Linear probe recoverability alone therefore does not establish a representational move beyond text.
中文标题/摘要
标题:没有世界模型的世界属性:从静态词嵌入共现统计中恢复空间和时间结构
近期的研究将大型语言模型(LLM)隐藏状态中地理和时间变量的线性可恢复性解释为世界内部表示的证据。我们测试了一个更简单的可能性:许多相关结构已经隐含在文本本身中。应用相同的岭回归探针到静态共现基于的嵌入(GloVe和Word2Vec),我们发现显著的可恢复地理信号和较弱但可靠的时序信号,外留的R²值分别为城市坐标0.71-0.87和历史出生年份0.48-0.52。语义邻域分析和目标子空间消融表明,这些信号强烈依赖于可解释的词汇梯度,尤其是国家名称和气候相关词汇。这些发现表明,普通的词共现保留了比通常假设的更丰富的空间、时间和环境结构,揭示了简单的静态嵌入从文本中保留世界形状结构的惊人且未被充分认识的能力。因此,仅线性探针可恢复性并不能证明超越文本的表示迁移。
Summary / 总结
The study investigates whether the recoverable geographic and temporal information from large language model hidden states can be attributed to the inherent structure in text itself, using static word embeddings (GloVe and Word2Vec). The research finds substantial recoverable geographic signals with R^2 values of 0.71-0.87 for city coordinates and weaker but reliable temporal signals with R^2 values of 0.48-0.52 for historical birth years. The signals are shown to depend on lexical gradients, particularly country names and climate-related vocabulary, suggesting that simple static embeddings can preserve rich spatial, temporal, and environmental structure from text alone.
研究探讨了从大型语言模型隐藏状态中恢复地理和时间信息的原因是由于内部的世界模型还是文本本身固有的。通过使用基于共现的静态嵌入(GloVe和Word2Vec)和应用岭回归探针,研究发现存在显著可恢复的地理信号(R^2值为0.71-0.87)和较弱但可靠的时序信号(R^2值为0.48-0.52)。研究结果表明,简单的静态嵌入可以保存丰富的空间、时间和环境结构,暗示线性探针恢复能力本身并不能表明超越文本的表征能力。
ELMUR: External Layer Memory with Update/Rewrite for Long-Horizon RL Problems
Authors: Egor Cherepanov, Alexey K. Kovalev, Aleksandr I. Panov
First: 2025-10-08T15:50:34+00:00 · Latest: 2026-03-04T17:31:48+00:00
Comments: 31 pages, 15 figures, 8 tables
Abstract
Real-world robotic agents must act under partial observability and long horizons, where key cues may appear long before they affect decision making. However, most modern approaches rely solely on instantaneous information, without incorporating insights from the past. Standard recurrent or transformer models struggle with retaining and leveraging long-term dependencies: context windows truncate history, while naive memory extensions fail under scale and sparsity. We propose ELMUR (External Layer Memory with Update/Rewrite), a transformer architecture with structured external memory. Each layer maintains memory embeddings, interacts with them via bidirectional cross-attention, and updates them through an Least Recently Used (LRU) memory module using replacement or convex blending. ELMUR extends effective horizons up to 100,000 times beyond the attention window and achieves a 100% success rate on a synthetic T-Maze task with corridors up to one million steps. In POPGym, it outperforms baselines on more than half of the tasks. On MIKASA-Robo sparse-reward manipulation tasks with visual observations, it nearly doubles the performance of strong baselines, achieving the best success rate on 21 out of 23 tasks and improving the aggregate success rate across all tasks by about 70% over the previous best baseline. These results demonstrate that structured, layer-local external memory offers a simple and scalable approach to decision making under partial observability. Code and project page: https://elmur-paper.github.io/.
中文标题/摘要
标题:ELMUR:外部层记忆与更新/重写以应对长期 horizon 的 RL 问题
现实世界中的机器人代理必须在部分可观测性和长期 horizon 下行动,其中关键线索可能在影响决策之前很久就会出现。然而,大多数现代方法仅依赖瞬时信息,而不结合过去的见解。标准递归或变换模型难以保留和利用长期依赖关系:上下文窗口截断历史,而简单的记忆扩展在规模和稀疏性下失效。我们提出了 ELMUR(外部层记忆与更新/重写),这是一种具有结构化外部记忆的变换架构。每一层都维护着记忆嵌入,通过双向交叉注意力与它们交互,并通过一个基于最近最少使用(LRU)的记忆模块进行更新,使用替换或凸融合。ELMUR 将有效 horizon 延长了 100,000 倍以上,并在合成 T-Maze 任务中实现了 100% 的成功率,该任务的走廊长度可达一百万步。在 POPGym 中,它在超过一半的任务中优于基线。在 MIKASA-Robo 稀疏奖励操作任务中,它几乎将强基线的性能翻倍,实现了 23 个任务中的 21 个任务的最佳成功率,并将所有任务的综合成功率提高了约 70% 以上,超过了之前的最佳基线。这些结果表明,结构化、层局部外部记忆提供了一种简单且可扩展的方法来应对部分可观测性下的决策。
Summary / 总结
ELMUR is designed to address long-horizon reinforcement learning problems by incorporating external memory with update/rewrite mechanisms. It extends the effective horizon up to 100,000 times beyond the attention window and achieves high success rates on various tasks, including synthetic T-Maze, POPGym, and MIKASA-Robo. ELMUR outperforms baseline methods in these tasks, particularly in sparse-reward manipulation tasks with visual observations, where it nearly doubles the performance and improves the aggregate success rate by about 70% over the previous best baseline.
ELMUR 通过引入具有更新/重写机制的外部记忆来解决长期 horizons 的强化学习问题。它将有效 horizon 延长了 100,000 倍以上,并在合成 T-Maze、POPGym 和 MIKASA-Robo 等任务中实现了高成功率,特别是在具有视觉观察的稀疏奖励操作任务中,ELMUR 的性能几乎提高了两倍,并将整体成功率提高了约 70% 以上,超过了之前的最佳基线。
GraphMERT: Efficient and Scalable Distillation of Reliable Knowledge Graphs from Unstructured Data
Authors: Margarita Belova, Jiaxin Xiao, Shikhar Tuli, Niraj K. Jha
First: 2025-10-10T17:36:14+00:00 · Latest: 2026-03-04T17:26:02+00:00
Comments: Camera-ready version. Published in Transactions on Machine Learning Research (TMLR), 2026. Reviewed on OpenReview: https://openreview.net/forum?id=tnXSdDhvqc
Abstract
Researchers have pursued neurosymbolic artificial intelligence (AI) applications for nearly three decades. A marriage of the neural and symbolic components can lead to rapid advancements in AI. Yet, the field has not realized this promise since most neurosymbolic AI frameworks fail to scale. In addition, the implicit representations and approximate reasoning of purely neural approaches limit interpretability and trust. Knowledge graphs (KGs), a gold-standard representation of explicit semantic knowledge, can address the symbolic side of the problem. However, automatically deriving reliable KGs from text corpora remains an open problem. We address these challenges by introducing GraphMERT, a tiny graphical encoder-only model that distills high-quality KGs from unstructured text corpora and its own internal representations. GraphMERT and its equivalent KG form a modular neurosymbolic stack: neural learning of abstractions; symbolic KGs for verifiable reasoning. GraphMERT + KG is the first efficient and scalable neurosymbolic model to achieve state-of-the-art benchmark accuracy along with superior symbolic representations relative to baselines. Concretely, we target reliable domain-specific KGs that are both (1) factual (with provenance) and (2) valid (ontology-consistent relations with domain-appropriate semantics). When a large language model (LLM), e.g., Qwen3-32B, generates domain-specific KGs, it falls short on reliability due to prompt sensitivity, shallow domain expertise, and hallucinated relations. On text obtained from PubMed papers on diabetes, our 80M-parameter GraphMERT yields a KG with a 69.8% FActScore; a 32B-parameter baseline LLM yields a KG that achieves only 40.2% FActScore. The GraphMERT KG also attains a higher ValidityScore of 68.8%, versus 43.0% for the LLM baseline.
中文标题/摘要
标题:GraphMERT:从非结构化数据中高效且可扩展地提炼可靠的知识图谱
研究人员近三十年来一直在追求神经符号人工智能(AI)应用。神经和符号组件的结合可以推动AI的快速进步。然而,由于大多数神经符号AI框架无法扩展,该领域尚未实现这一承诺。此外,纯神经方法的隐式表示和近似推理限制了可解释性和可信度。知识图谱(KGs),一种标准的显式语义知识表示,可以解决符号问题的一侧。然而,从文本语料库中自动推导出可靠的KGs仍然是一个开放问题。我们通过引入GraphMERT,一种小型图形编码器模型,解决了这些挑战,该模型可以从非结构化文本语料库及其内部表示中提炼高质量的KGs。GraphMERT及其等效KG形式一个模块化的神经符号堆栈:神经学习抽象;符号KGs进行可验证推理。GraphMERT + KG是第一个高效且可扩展的神经符号模型,不仅在基准测试中达到最先进的准确率,而且在符号表示方面也优于基线。具体而言,我们针对的是可靠的专业领域KGs,它们既是(1)事实性的(有来源)又是(2)有效的(与领域适当语义一致的关系)。当大型语言模型(LLM),例如Qwen3-32B,生成专业领域KGs时,由于提示敏感性、浅薄的专业知识和虚构的关系,其可靠性不足。在来自PubMed糖尿病论文的文本上,我们80M参数的GraphMERT生成了一个KG,其FActScore为69.8%;而32B参数的基线LLM生成的KG仅达到40.2%的FActScore。GraphMERT KG的有效性得分也更高,为68.8%,而基线LLM的得分仅为43.0%。
Summary / 总结
GraphMERT is a small graphical encoder-only model designed to distill high-quality knowledge graphs (KGs) from unstructured text corpora and its internal representations. It addresses the challenges of deriving reliable KGs and forms a modular neurosymbolic stack with symbolic KGs for verifiable reasoning. GraphMERT outperforms a 32B-parameter baseline large language model (LLM) on factual and validity scores for domain-specific KGs derived from PubMed papers on diabetes, achieving 69.8% and 68.8% respectively, compared to 40.2% and 43.0% for the LLM.
GraphMERT 是一种小型图形编码器模型,旨在从非结构化文本中提炼高质量的知识图谱(KGs)。它通过结合神经学习和符号KGs进行可验证推理来解决神经符号AI中的可扩展性和解释性挑战。实验结果表明,GraphMERT 在生成糖尿病相关的可靠领域特定 KGs 方面优于一个 32B 参数的基线大型语言模型(LLM),在 FActScore 和 ValidityScore 上分别达到了 69.8% 和 68.8%,而 LLM 分别仅为 40.2% 和 43.0%。