arXiv 论文速递

2026-02-22 03:33
Snapshot: 20260222_0333
OpenEarthAgent: A Unified Framework for Tool-Augmented Geospatial Agents
Authors: Akashah Shabbir, Muhammad Umer Sheikh, Muhammad Akhtar Munir, Hiyam Debary, Mustansar Fiaz, Muhammad Zaigham Zaheer, Paolo Fraccaro, Fahad Shahbaz Khan, Muhammad Haris Khan, Xiao Xiang Zhu, Salman Khan
First: 2026-02-19T18:59:54+00:00 · Latest: 2026-02-19T18:59:54+00:00
Abstract
Recent progress in multimodal reasoning has enabled agents that can interpret imagery, connect it with language, and perform structured analytical tasks. Extending such capabilities to the remote sensing domain remains challenging, as models must reason over spatial scale, geographic structures, and multispectral indices while maintaining coherent multi-step logic. To bridge this gap, OpenEarthAgent introduces a unified framework for developing tool-augmented geospatial agents trained on satellite imagery, natural-language queries, and detailed reasoning traces. The training pipeline relies on supervised fine-tuning over structured reasoning trajectories, aligning the model with verified multistep tool interactions across diverse analytical contexts. The accompanying corpus comprises 14,538 training and 1,169 evaluation instances, with more than 100K reasoning steps in the training split and over 7K reasoning steps in the evaluation split. It spans urban, environmental, disaster, and infrastructure domains, and incorporates GIS-based operations alongside index analyses such as NDVI, NBR, and NDBI. Grounded in explicit reasoning traces, the learned agent demonstrates structured reasoning, stable spatial understanding, and interpretable behaviour through tool-driven geospatial interactions across diverse conditions. We report consistent improvements over a strong baseline and competitive performance relative to recent open and closed-source models.
中文标题/摘要
标题:OpenEarthAgent:统一的工具增强地理空间代理框架
近期在多模态推理方面的进展使代理能够解释图像、将其与语言关联起来并执行结构化分析任务。将这些能力扩展到遥感领域仍然具有挑战性,因为模型必须在保持连贯的多步逻辑的同时,在空间尺度、地理结构和多光谱指数之间进行推理。为弥合这一差距,OpenEarthAgent 引入了一个统一框架,用于开发基于卫星图像、自然语言查询和详细推理轨迹训练的工具增强地理空间代理。训练管道依赖于结构化推理轨迹的监督微调,使模型与跨多种分析上下文的验证多步工具交互保持一致。伴随的语料库包括14,538个训练实例和1,169个评估实例,训练集包含超过100K个推理步骤,评估集包含超过7K个推理步骤。它涵盖了城市、环境、灾害和基础设施领域,并结合了基于GIS的操作和如NDVI、NBR和NDBI等指数分析。基于显式的推理轨迹,学习到的代理展示了结构化的推理、稳定的空间理解和通过工具驱动的地理空间交互在多种条件下的可解释行为。我们报告了相对于强大基线的一致改进,并且在与最近的开源和闭源模型相比时表现出竞争力。
Summary / 总结
OpenEarthAgent is a unified framework for developing tool-augmented geospatial agents that can interpret satellite imagery and natural-language queries. It uses a supervised fine-tuning approach on structured reasoning trajectories to align models with verified multi-step tool interactions. The framework includes a large dataset with over 100K reasoning steps for training and more than 7K for evaluation, covering various domains. The learned agent shows structured reasoning, stable spatial understanding, and interpretable behavior, with consistent improvements over a strong baseline and competitive performance compared to recent models.
OpenEarthAgent 是一个统一框架,用于增强地理空间代理,使其能够解释卫星图像和自然语言查询,并执行结构化的分析任务。该框架采用带有详细推理轨迹的大规模数据集进行监督微调,使模型能够处理空间尺度、地理结构和多光谱指数。该模型展示了结构化的推理、稳定的空间理解和可解释的行为,适用于各种领域和条件,并优于强基线模型,与最近的模型竞争。
CLEF HIPE-2026: Evaluating Accurate and Efficient Person-Place Relation Extraction from Multilingual Historical Texts
Authors: Juri Opitz, Corina Raclé, Emanuela Boros, Andrianos Michail, Matteo Romanello, Maud Ehrmann, Simon Clematide
First: 2026-02-19T18:59:44+00:00 · Latest: 2026-02-19T18:59:44+00:00
Comments: ECIR 2026. CLEF Evaluation Lab. Registration DL: 2026/04/23. Task Homepage at https://hipe-eval.github.io/HIPE-2026/
Abstract
HIPE-2026 is a CLEF evaluation lab dedicated to person-place relation extraction from noisy, multilingual historical texts. Building on the HIPE-2020 and HIPE-2022 campaigns, it extends the series toward semantic relation extraction by targeting the task of identifying person--place associations in multiple languages and time periods. Systems are asked to classify relations of two types - $at$ ("Has the person ever been at this place?") and $isAt$ ("Is the person located at this place around publication time?") - requiring reasoning over temporal and geographical cues. The lab introduces a three-fold evaluation profile that jointly assesses accuracy, computational efficiency, and domain generalization. By linking relation extraction to large-scale historical data processing, HIPE-2026 aims to support downstream applications in knowledge-graph construction, historical biography reconstruction, and spatial analysis in digital humanities.
中文标题/摘要
标题:CLEF HIPE-2026:从多语言历史文本中准确高效地提取人物地点关系
HIPE-2026 是一个CLEF评估实验室,专注于从嘈杂的多语言历史文本中提取人物地点关系。在HIPE-2020和HIPE-2022活动的基础上,它将系列扩展到语义关系提取,目标是在多个语言和时期识别人物与地点的关联。系统被要求分类两种类型的关系——$at$(“这个人是否曾经在该地点?”)和$isAt$(“这个人是否在该地点附近出版时间?”),需要对时间和地理线索进行推理。该实验室引入了三方面的评估标准,共同评估准确性、计算效率和领域泛化能力。通过将关系提取与大规模历史数据处理联系起来,HIPE-2026旨在支持知识图谱构建、历史传记重建和数字人文中的空间分析等下游应用。
Summary / 总结
HIPE-2026 is an evaluation lab under CLEF that focuses on extracting person-place relations from multilingual historical texts. Systems are evaluated on their ability to classify two types of relations, $at$ and $isAt$, which require understanding temporal and geographical contexts. The lab introduces a comprehensive evaluation framework that measures accuracy, computational efficiency, and domain generalization. By processing large-scale historical data, HIPE-2026 supports applications such as knowledge-graph construction and historical biography reconstruction.
HIPE-2026 是 CLEF 下的一个评估实验室,专注于从多语言历史文本中提取人物地点关系。系统需要分类两种类型的关系,$at$ 和 $isAt$,这需要理解时间与地理背景。该实验室引入了一个综合评估框架,衡量准确率、计算效率和领域泛化能力。通过处理大规模历史数据,HIPE-2026 支持知识图谱构建和历史传记重建等应用。
When Vision Overrides Language: Evaluating and Mitigating Counterfactual Failures in VLAs
Authors: Yu Fang, Yuchun Feng, Dong Jing, Jiaqi Liu, Yue Yang, Zhenyu Wei, Daniel Szafir, Mingyu Ding
First: 2026-02-19T18:59:20+00:00 · Latest: 2026-02-19T18:59:20+00:00
Comments: Website: https://vla-va.github.io/
Abstract
Vision-Language-Action models (VLAs) promise to ground language instructions in robot control, yet in practice often fail to faithfully follow language. When presented with instructions that lack strong scene-specific supervision, VLAs suffer from counterfactual failures: they act based on vision shortcuts induced by dataset biases, repeatedly executing well-learned behaviors and selecting objects frequently seen during training regardless of language intent. To systematically study it, we introduce LIBERO-CF, the first counterfactual benchmark for VLAs that evaluates language following capability by assigning alternative instructions under visually plausible LIBERO layouts. Our evaluation reveals that counterfactual failures are prevalent yet underexplored across state-of-the-art VLAs. We propose Counterfactual Action Guidance (CAG), a simple yet effective dual-branch inference scheme that explicitly regularizes language conditioning in VLAs. CAG combines a standard VLA policy with a language-unconditioned Vision-Action (VA) module, enabling counterfactual comparison during action selection. This design reduces reliance on visual shortcuts, improves robustness on under-observed tasks, and requires neither additional demonstrations nor modifications to existing architectures or pretrained models. Extensive experiments demonstrate its plug-and-play integration across diverse VLAs and consistent improvements. For example, on LIBERO-CF, CAG improves $π_{0.5}$ by 9.7% in language following accuracy and 3.6% in task success on under-observed tasks using a training-free strategy, with further gains of 15.5% and 8.5%, respectively, when paired with a VA model. In real-world evaluations, CAG reduces counterfactual failures of 9.4% and improves task success by 17.2% on average.
中文标题/摘要
标题:视觉优先于语言:评估和缓解VLAs中的反事实失败
视觉-语言-行动模型(VLAs)承诺将语言指令应用于机器人控制,但在实践中往往未能忠实执行语言指令。当面对缺乏强烈场景特定监督的指令时,VLAs会遭受反事实失败:它们基于由数据集偏差引起的视觉捷径行动,反复执行在训练中频繁出现的行为,并选择在训练期间频繁出现的对象,而不管语言意图如何。为了系统地研究这一问题,我们引入了LIBERO-CF,这是第一个用于VLAs的反事实基准,通过在视觉上合理的LIBERO布局下分配替代指令来评估语言跟随能力。我们的评估表明,反事实失败在最先进的VLAs中普遍存在但尚未得到充分探索。我们提出了反事实行动指导(CAG),这是一种简单而有效的双分支推理方案,明确地在VLAs中正则化语言条件。CAG结合了一个标准的VLA策略和一个未受语言条件的视觉-行动(VA)模块,在行动选择期间进行反事实比较。这种设计减少了对视觉捷径的依赖,提高了对未观察任务的鲁棒性,并且无需额外演示或修改现有架构或预训练模型。广泛的实验表明,它可以在各种VLAs中实现即插即用集成,并且具有持续改进。例如,在LIBERO-CF中,CAG在语言跟随准确性上提高了9.7%,在未观察任务上的任务成功率提高了3.6%,使用无训练策略,配以VA模型时,进一步提高了15.5%和8.5%。在实际应用中,CAG将反事实失败减少了9.4%,平均提高了任务成功率17.2%。
Summary / 总结
The paper addresses the issue of counterfactual failures in Vision-Language-Action models (VLAs), where models act based on visual biases rather than language instructions. It introduces LIBERO-CF, a benchmark for evaluating language following capability, and proposes Counterfactual Action Guidance (CAG), a dual-branch inference scheme that improves robustness and reduces visual shortcut reliance. Experiments show CAG enhances language following accuracy and task success, particularly on under-observed tasks, with consistent improvements across various VLAs in both simulated and real-world settings.
论文针对Vision-Language-Action模型(VLAs)中的反事实失败问题,即模型基于视觉偏见而非语言指令行动。作者引入了LIBERO-CF基准,通过在视觉上合理的场景下提供替代指令来评估语言跟随能力。作者提出了一种名为Counterfactual Action Guidance(CAG)的双分支推理方案,该方案增强了语言条件性并减少了视觉捷径,从而提高了在未观察任务上的鲁棒性。实验表明,CAG在语言跟随准确性和任务成功率上取得了显著改进,特别是在未观察任务上,且在各种VLAs的模拟和实际环境中都表现出一致的提升。
Human-level 3D shape perception emerges from multi-view learning
Authors: Tyler Bonnen, Jitendra Malik, Angjoo Kanazawa
First: 2026-02-19T18:56:05+00:00 · Latest: 2026-02-19T18:56:05+00:00
Abstract
Humans can infer the three-dimensional structure of objects from two-dimensional visual inputs. Modeling this ability has been a longstanding goal for the science and engineering of visual intelligence, yet decades of computational methods have fallen short of human performance. Here we develop a modeling framework that predicts human 3D shape inferences for arbitrary objects, directly from experimental stimuli. We achieve this with a novel class of neural networks trained using a visual-spatial objective over naturalistic sensory data; given a set of images taken from different locations within a natural scene, these models learn to predict spatial information related to these images, such as camera location and visual depth, without relying on any object-related inductive biases. Notably, these visual-spatial signals are analogous to sensory cues readily available to humans. We design a zero-shot evaluation approach to determine the performance of these `multi-view' models on a well established 3D perception task, then compare model and human behavior. Our modeling framework is the first to match human accuracy on 3D shape inferences, even without task-specific training or fine-tuning. Remarkably, independent readouts of model responses predict fine-grained measures of human behavior, including error patterns and reaction times, revealing a natural correspondence between model dynamics and human perception. Taken together, our findings indicate that human-level 3D perception can emerge from a simple, scalable learning objective over naturalistic visual-spatial data. All code, human behavioral data, and experimental stimuli needed to reproduce our findings can be found on our project page.
中文标题/摘要
标题:多视角学习中的人类级3D形状感知
人类可以从二维视觉输入中推断出物体的三维结构。模拟这种能力一直是视觉智能科学与工程的长期目标,但几十年来,计算方法仍未达到人类的性能。我们开发了一种建模框架,可以直接从实验刺激中预测任意物体的人类3D形状推断。我们使用一种新颖的神经网络类,通过自然感官数据的空间视觉目标进行训练;给定自然场景中不同位置拍摄的一组图像,这些模型能够学习预测与这些图像相关的空间信息,如相机位置和视觉深度,而无需依赖任何与物体相关的归纳偏置。值得注意的是,这些视觉-空间信号类似于人类可轻易获得的感官线索。我们设计了一种零样本评估方法来确定这些“多视角”模型在一项成熟的3D感知任务中的性能,然后将模型行为与人类行为进行比较。我们的建模框架是第一个在无需特定任务训练或微调的情况下达到人类3D形状推断准确性的框架。令人惊讶的是,模型响应的独立读数可以预测人类行为的细微差别,包括错误模式和反应时间,揭示了模型动态与人类感知之间的自然对应关系。综上所述,我们的研究结果表明,人类级的3D感知可以从自然视觉-空间数据上的简单可扩展学习目标中涌现出来。所有用于重现我们研究结果的代码、人类行为数据和实验刺激都可以在我们的项目页面上找到。
Summary / 总结
The research aims to model human ability to infer 3D shapes from 2D images, a longstanding challenge in visual intelligence. The method involves training neural networks on multi-view images from natural scenes to predict spatial information like camera location and visual depth. Key findings show that these models match human accuracy in 3D shape inference without task-specific training, and their responses correlate with human error patterns and reaction times, suggesting a natural correspondence between model dynamics and human perception.
该研究旨在模拟人类从2D图像中推断3D形状的能力,这是一个视觉智能领域的长期挑战。研究人员开发了一种神经网络,该网络基于自然场景中的多视角图像进行训练,学习空间信息而不依赖于对象特定的先验知识。该模型在3D形状推断方面的准确度与人类相当,并预测了人类的错误模式和反应时间,表明模型动态与人类感知之间存在自然对应关系。这表明,人类级别的3D感知可以从自然视觉-空间数据的学习中自然地涌现出来。
Multi-Round Human-AI Collaboration with User-Specified Requirements
Authors: Sima Noorani, Shayan Kiyani, Hamed Hassani, George Pappas
First: 2026-02-19T18:54:34+00:00 · Latest: 2026-02-19T18:54:34+00:00
Abstract
As humans increasingly rely on multiround conversational AI for high stakes decisions, principled frameworks are needed to ensure such interactions reliably improve decision quality. We adopt a human centric view governed by two principles: counterfactual harm, ensuring the AI does not undermine human strengths, and complementarity, ensuring it adds value where the human is prone to err. We formalize these concepts via user defined rules, allowing users to specify exactly what harm and complementarity mean for their specific task. We then introduce an online, distribution free algorithm with finite sample guarantees that enforces the user-specified constraints over the collaboration dynamics. We evaluate our framework across two interactive settings: LLM simulated collaboration on a medical diagnostic task and a human crowdsourcing study on a pictorial reasoning task. We show that our online procedure maintains prescribed counterfactual harm and complementarity violation rates even under nonstationary interaction dynamics. Moreover, tightening or loosening these constraints produces predictable shifts in downstream human accuracy, confirming that the two principles serve as practical levers for steering multi-round collaboration toward better decision quality without the need to model or constrain human behavior.
中文标题/摘要
标题:多轮人机协作与用户指定要求
随着人类越来越多地依赖多轮对话AI进行高风险决策,需要有原则性的框架来确保此类交互能够可靠地提高决策质量。我们采取以人为中心的观点,遵循两个原则:反事实伤害,确保AI不削弱人类的优势;互补性,确保AI在人类容易出错的地方增加价值。我们通过用户定义的规则形式化这些概念,允许用户明确指定其特定任务中的伤害和互补性含义。然后,我们引入了一个在线的、无分布假设的算法,具有有限样本保证,该算法在协作动态中强制执行用户指定的约束。我们在两个交互设置中评估了我们的框架:模拟大型语言模型在医疗诊断任务上的合作和人类众包研究在图像推理任务上的合作。我们证明,即使在非平稳交互动态下,我们的在线程序也能保持规定的反事实伤害和互补性违反率。此外,收紧或放松这些约束会产生可预测的人类下游准确性变化,证实了这两个原则作为实用杠杆的作用,可以引导多轮合作向更好的决策质量发展,而无需建模或约束人类行为。
IntRec: Intent-based Retrieval with Contrastive Refinement
Authors: Pourya Shamsolmoali, Masoumeh Zareapoor, Eric Granger, Yue Lu
First: 2026-02-19T18:50:53+00:00 · Latest: 2026-02-19T18:50:53+00:00
Abstract
Retrieving user-specified objects from complex scenes remains a challenging task, especially when queries are ambiguous or involve multiple similar objects. Existing open-vocabulary detectors operate in a one-shot manner, lacking the ability to refine predictions based on user feedback. To address this, we propose IntRec, an interactive object retrieval framework that refines predictions based on user feedback. At its core is an Intent State (IS) that maintains dual memory sets for positive anchors (confirmed cues) and negative constraints (rejected hypotheses). A contrastive alignment function ranks candidate objects by maximizing similarity to positive cues while penalizing rejected ones, enabling fine-grained disambiguation in cluttered scenes. Our interactive framework provides substantial improvements in retrieval accuracy without additional supervision. On LVIS, IntRec achieves 35.4 AP, outperforming OVMR, CoDet, and CAKE by +2.3, +3.7, and +0.5, respectively. On the challenging LVIS-Ambiguous benchmark, it improves performance by +7.9 AP over its one-shot baseline after a single corrective feedback, with less than 30 ms of added latency per interaction.
中文标题/摘要
标题:IntRec:基于意图的对比精炼检索
从复杂场景中检索用户指定的对象仍然是一个具有挑战性的任务,尤其是在查询模糊或涉及多个相似对象的情况下。现有的开放式词汇检测器以单次操作的方式工作,缺乏根据用户反馈精炼预测的能力。为了解决这个问题,我们提出了IntRec,这是一种基于用户反馈进行预测精炼的交互式对象检索框架。其核心是一个意图状态(IS),它维护了正锚点(确认的线索)和负约束(拒绝的假设)的双重记忆集。对比对齐函数通过最大化与正线索的相似性并惩罚被拒绝的对象来对候选对象进行排名,从而在杂乱的场景中实现细粒度的消歧。我们的交互式框架在不增加额外监督的情况下显著提高了检索准确性。在LVIS上,IntRec达到了35.4 AP,分别优于OVMR、CoDet和CAKE 2.3、3.7和0.5。在具有挑战性的LVIS-模糊基准上,它在单次纠正反馈后提高了7.9 AP的性能,每次交互的额外延迟少于30毫秒。
Summary / 总结
IntRec is an interactive object retrieval framework that refines predictions based on user feedback, addressing the challenge of ambiguous queries in complex scenes. It uses an Intent State maintaining positive anchors and negative constraints to rank candidate objects through a contrastive alignment function. On LVIS, IntRec achieves 35.4 AP, outperforming existing methods by up to +3.7. It also shows significant improvement on the LVIS-Ambiguous benchmark with a single feedback, adding less than 30 ms of latency per interaction.
IntRec 是一个交互式对象检索框架,旨在处理复杂场景中的模糊查询和多个相似对象。它通过意图状态维护正锚和负约束,并使用对比对齐函数根据用户反馈进行预测细化。在 LVIS 上,IntRec 的性能比现有方法高出 +2.3 到 +3.7 AP,并且在 LVIS-Ambiguous 基准测试中,通过单次反馈后性能提高了 +7.9 AP,每次交互的额外延迟不到 30 毫秒。
CORAL: Correspondence Alignment for Improved Virtual Try-On
Authors: Jiyoung Kim, Youngjin Shin, Siyoon Jin, Dahyun Chung, Jisu Nam, Tongmin Kim, Jongjae Park, Hyeonwoo Kang, Seungryong Kim
First: 2026-02-19T18:50:12+00:00 · Latest: 2026-02-19T18:50:12+00:00
Comments: 32 pages, 25 figures
Abstract
Existing methods for Virtual Try-On (VTON) often struggle to preserve fine garment details, especially in unpaired settings where accurate person-garment correspondence is required. These methods do not explicitly enforce person-garment alignment and fail to explain how correspondence emerges within Diffusion Transformers (DiTs). In this paper, we first analyze full 3D attention in DiT-based architecture and reveal that the person-garment correspondence critically depends on precise person-garment query-key matching within the full 3D attention. Building on this insight, we then introduce CORrespondence ALignment (CORAL), a DiT-based framework that explicitly aligns query-key matching with robust external correspondences. CORAL integrates two complementary components: a correspondence distillation loss that aligns reliable matches with person-garment attention, and an entropy minimization loss that sharpens the attention distribution. We further propose a VLM-based evaluation protocol to better reflect human preference. CORAL consistently improves over the baseline, enhancing both global shape transfer and local detail preservation. Extensive ablations validate our design choices.
中文标题/摘要
标题:CORAL: 对应对齐以改进虚拟试穿
现有的虚拟试穿(VTON)方法往往难以保留细部服装细节,特别是在需要准确的人-服装对应关系的非配对设置中。这些方法没有明确地强制执行人-服装对齐,并且无法解释对应关系如何在扩散变换器(DiTs)中出现。在本文中,我们首先分析了基于DiT架构的全3D注意力,并揭示出人-服装对应关系的关键依赖于全3D注意力中精确的人-服装查询-键匹配。基于这一洞察,我们随后引入了CORrespondence ALignment(CORAL),这是一种基于DiT的框架,明确地将查询-键匹配与稳健的外部对应关系对齐。CORAL结合了两个互补的组件:一个对应关系蒸馏损失,将可靠的匹配与人-服装注意力对齐,以及一个熵最小化损失,使注意力分布更加清晰。我们还提出了一种基于VLM的评估协议,以更好地反映人类偏好。CORAL在基准之上始终表现出改进,增强了全局形状转移和局部细节保留。广泛的消融实验验证了我们的设计选择。
Summary / 总结
This paper addresses the challenge of preserving fine garment details in Virtual Try-On (VTON) by introducing CORAL, a DiT-based framework that explicitly aligns query-key matching with robust external correspondences. The method includes a correspondence distillation loss and an entropy minimization loss to enhance attention distribution. Experimental results show that CORAL improves both global shape transfer and local detail preservation over the baseline methods.
论文通过分析Diffusion Transformers (DiTs)中人物-服装查询-键匹配的关键作用,解决了虚拟试穿(VTON)中精细服装细节保留的问题。提出了CORAL框架,通过对应关系蒸馏损失和熵最小化损失明确对齐查询-键匹配与稳健的外部对应关系。实验结果表明,CORAL在全局形状转移和局部细节保留方面均优于基线方法。
SMAC: Score-Matched Actor-Critics for Robust Offline-to-Online Transfer
Authors: Nathan S. de Lara, Florian Shkurti
First: 2026-02-19T18:47:31+00:00 · Latest: 2026-02-19T18:47:31+00:00
Abstract
Modern offline Reinforcement Learning (RL) methods find performant actor-critics, however, fine-tuning these actor-critics online with value-based RL algorithms typically causes immediate drops in performance. We provide evidence consistent with the hypothesis that, in the loss landscape, offline maxima for prior algorithms and online maxima are separated by low-performance valleys that gradient-based fine-tuning traverses. Following this, we present Score Matched Actor-Critic (SMAC), an offline RL method designed to learn actor-critics that transition to online value-based RL algorithms with no drop in performance. SMAC avoids valleys between offline and online maxima by regularizing the Q-function during the offline phase to respect a first-order derivative equality between the score of the policy and action-gradient of the Q-function. We experimentally demonstrate that SMAC converges to offline maxima that are connected to better online maxima via paths with monotonically increasing reward found by first-order optimization. SMAC achieves smooth transfer to Soft Actor-Critic and TD3 in 6/6 D4RL tasks. In 4/6 environments, it reduces regret by 34-58% over the best baseline.
中文标题/摘要
标题:SMAC:分数匹配的演员评论家算法以实现稳健的离线到在线转移
现代离线强化学习(RL)方法能够找到表现良好的演员评论家,然而,使用基于值的RL算法在线微调这些演员评论家通常会导致性能立即下降。我们提供了证据支持以下假设:在损失景观中,先前算法的离线最大值与在线最大值之间被低性能的山谷隔开,基于梯度的微调会穿越这些山谷。基于此,我们提出了分数匹配的演员评论家(SMAC),这是一种离线RL方法,旨在学习在不降低性能的情况下过渡到在线基于值的RL算法的演员评论家。SMAC通过在离线阶段正则化Q函数来避免离线和在线最大值之间的山谷,使其尊重策略得分与Q函数动作梯度的一阶导数相等。我们通过一阶优化找到的路径实验性地证明了SMAC收敛到连接到更好在线最大值的离线最大值。SMAC在6/6个D4RL任务中实现了平滑过渡到Soft Actor-Critic和TD3。在4/6个环境中,它将后悔减少34-58%超过最佳基线。
Summary / 总结
The research aims to address the issue of performance drops when fine-tuning offline-trained actor-critics with online value-based RL algorithms. The method, Score Matched Actor-Critic (SMAC), regularizes the Q-function during the offline phase to ensure a connection between offline and online maxima, avoiding low-performance valleys. Experiments show that SMAC achieves smooth transfer to Soft Actor-Critic and TD3 in all six D4RL tasks and reduces regret by 34-58% in four out of six environments compared to the best baseline.
研究旨在解决在使用在线价值基RL算法微调离线训练的actor-critic时性能下降的问题。提出的Score Matched Actor-Critic (SMAC)方法在离线阶段通过正则化Q函数来避免损失景观中的低性能山谷,从而实现平滑过渡到在线RL算法。实验表明,SMAC在所有六个D4RL任务中实现了平滑过渡,并且在四个环境中将后悔减少34-58%,优于最佳基线。
Catastrophic Forgetting Resilient One-Shot Incremental Federated Learning
Authors: Obaidullah Zaland, Zulfiqar Ahmad Khan, Monowar Bhuyan
First: 2026-02-19T18:44:23+00:00 · Latest: 2026-02-19T18:44:23+00:00
Comments: Accepted for publication in the IEEE International Conference on Big Data (IEEE BigData) 2025
Abstract
Modern big-data systems generate massive, heterogeneous, and geographically dispersed streams that are large-scale and privacy-sensitive, making centralization challenging. While federated learning (FL) provides a privacy-enhancing training mechanism, it assumes a static data flow and learns a collaborative model over multiple rounds, making learning with \textit{incremental} data challenging in limited-communication scenarios. This paper presents One-Shot Incremental Federated Learning (OSI-FL), the first FL framework that addresses the dual challenges of communication overhead and catastrophic forgetting. OSI-FL communicates category-specific embeddings, devised by a frozen vision-language model (VLM) from each client in a single communication round, which a pre-trained diffusion model at the server uses to synthesize new data similar to the client's data distribution. The synthesized samples are used on the server for training. However, two challenges still persist: i) tasks arriving incrementally need to retrain the global model, and ii) as future tasks arrive, retraining the model introduces catastrophic forgetting. To this end, we augment training with Selective Sample Retention (SSR), which identifies and retains the top-p most informative samples per category and task pair based on sample loss. SSR bounds forgetting by ensuring that representative retained samples are incorporated into training in further iterations. The experimental results indicate that OSI-FL outperforms baselines, including traditional and one-shot FL approaches, in both class-incremental and domain-incremental scenarios across three benchmark datasets.
中文标题/摘要
标题:具有灾难性遗忘鲁棒性的单次增量联邦学习
现代大数据系统生成大量异构且地理上分散的流数据,规模庞大且隐私敏感,使得集中化变得困难。虽然联邦学习(FL)提供了一种增强隐私的训练机制,但它假设静态数据流,并在多轮次中学习协作模型,这使得在通信受限场景中处理增量数据的学习变得具有挑战性。本文提出了单次增量联邦学习(OSI-FL),这是第一个解决通信开销和灾难性遗忘双重挑战的FL框架。OSI-FL通过在单次通信轮次中由每个客户端的冻结视觉-语言模型(VLM)生成类别特定的嵌入,然后由服务器端预训练的扩散模型合成与客户端数据分布相似的新数据样本,这些合成样本在服务器端用于训练。然而,仍存在两个挑战:i) 任务以增量方式到达需要重新训练全局模型,ii) 随着未来任务的到达,重新训练模型会导致灾难性遗忘。为此,我们通过选择性样本保留(SSR)增强训练,该方法基于样本损失识别并保留每个类别和任务对中最信息丰富的top-p个样本。SSR通过确保代表性保留样本在后续迭代中被纳入训练来限制遗忘。实验结果表明,OSI-FL在三个基准数据集上的类增量和领域增量场景中均优于基线方法,包括传统的和单次的FL方法。
Summary / 总结
This paper addresses the challenges of communication overhead and catastrophic forgetting in federated learning with incremental data. It introduces One-Shot Incremental Federated Learning (OSI-FL), which communicates category-specific embeddings from clients to the server in a single round. The server uses a pre-trained diffusion model to synthesize new data and trains the global model. To mitigate catastrophic forgetting, the paper proposes Selective Sample Retention (SSR), which retains the most informative samples per category and task. Experiments show that OSI-FL outperforms traditional and one-shot FL approaches in class-incremental and domain-incremental scenarios across three benchmark datasets.
本文解决了增量数据下联邦学习中通信开销和灾难性遗忘的挑战。提出了增量联邦学习(OSI-FL),该方法在单个通信轮次中从每个客户端传输类别特定的嵌入,服务器据此合成新数据进行训练。为了缓解灾难性遗忘,作者提出了选择性样本保留(SSR),该方法根据样本损失保留每个类别和任务对中最信息性的样本。实验结果显示,OSI-FL 在三个基准数据集上的类增量和域增量场景中均优于传统和单次联邦学习方法。
Stable Asynchrony: Variance-Controlled Off-Policy RL for LLMs
Authors: Luke Huang, Zhuoyang Zhang, Qinghao Hu, Shang Yang, Song Han
First: 2026-02-19T18:40:51+00:00 · Latest: 2026-02-19T18:40:51+00:00
Abstract
Reinforcement learning (RL) is widely used to improve large language models on reasoning tasks, and asynchronous RL training is attractive because it increases end-to-end throughput. However, for widely adopted critic-free policy-gradient methods such as REINFORCE and GRPO, high asynchrony makes the policy-gradient estimator markedly $\textbf{higher variance}$: training on stale rollouts creates heavy-tailed importance ratios, causing a small fraction of samples to dominate updates. This amplification makes gradients noisy and learning unstable relative to matched on-policy training. Across math and general reasoning benchmarks, we find collapse is reliably predicted by effective sample size (ESS) and unstable gradient norms. Motivated by this diagnosis, we propose $\textbf{V}$ariance $\textbf{C}$ontrolled $\textbf{P}$olicy $\textbf{O}$ptimization ($\textbf{VCPO}$), a general stabilization method for REINFORCE/GRPO-style algorithms that (i) scales learning rate based on effective sample size to dampen unreliable updates, and (ii) applies a closed-form minimum-variance baseline for the off-policy setting, avoiding an auxiliary value model and adding minimal overhead. Empirically, VCPO substantially improves robustness for asynchronous training across math, general reasoning, and tool-use tasks, outperforming a broad suite of baselines spanning masking/clipping stabilizers and algorithmic variants. This reduces long-context, multi-turn training time by 2.5$\times$ while matching synchronous performance, demonstrating that explicit control of policy-gradient variance is key for reliable asynchronous RL at scale.
中文标题/摘要
标题:稳定异步性:基于方差控制的离策RL方法用于大语言模型
强化学习(RL)广泛用于提高大语言模型在推理任务上的表现,而异步RL训练因其能提高端到端吞吐量而具有吸引力。然而,对于广泛采用的无批评家的策略梯度方法如REINFORCE和GRPO,高异步性使得策略梯度估计器的方差显著增加:基于过时回放训练会产生重尾的重要性比率,导致一小部分样本主导更新。这种放大效应使得梯度变得嘈杂,学习变得不稳定,相对于匹配的在策训练而言更是如此。在数学和通用推理基准测试中,我们发现,有效样本量(ESS)和不稳定的梯度范数可以可靠地预测崩溃。基于这一诊断,我们提出了方差控制策略优化(VCPO),这是一种适用于REINFORCE/GRPO风格算法的通用稳定化方法,(i)基于有效样本量调整学习率以抑制不可靠的更新,(ii)为离策设置应用闭式最小方差基线,避免使用辅助价值模型并增加最少的开销。实验证明,VCPO显著提高了数学、通用推理和工具使用任务中异步训练的鲁棒性,优于广泛基线方法,包括掩码/剪切稳定器和算法变体。这减少了长上下文、多轮训练时间2.5倍,同时匹配同步性能,表明显式控制策略梯度方差是大规模可靠异步RL的关键。
Summary / 总结
The paper addresses the issue of high variance in asynchronous reinforcement learning (RL) training for large language models (LLMs), particularly for critic-free methods like REINFORCE and GRPO. It proposes VCPO, a variance-controlled policy optimization method that adjusts the learning rate based on effective sample size and uses a closed-form minimum-variance baseline to stabilize off-policy updates. Experiments show that VCPO enhances robustness in asynchronous training across various reasoning tasks, reducing training time by 2.5 times while matching synchronous performance.
论文针对critic-free方法如REINFORCE和GRPO在大型语言模型(LLMs)异步强化学习(RL)训练中的高方差问题,提出了一种方差控制的策略优化方法VCPO,该方法基于有效样本大小调整学习率,并应用了一种闭式最小方差基线。实验表明,VCPO在各种基准测试中显著提高了异步训练的鲁棒性,优于其他稳定器和算法变体,并将训练时间减少了2.5倍,同时匹配同步性能。
Guarding the Middle: Protecting Intermediate Representations in Federated Split Learning
Authors: Obaidullah Zaland, Sajib Mistry, Monowar Bhuyan
First: 2026-02-19T18:40:12+00:00 · Latest: 2026-02-19T18:40:12+00:00
Comments: Accepted for Publication in IEEE International Conference on Big Data (IEEE BigData) 2025
Abstract
Big data scenarios, where massive, heterogeneous datasets are distributed across clients, demand scalable, privacy-preserving learning methods. Federated learning (FL) enables decentralized training of machine learning (ML) models across clients without data centralization. Decentralized training, however, introduces a computational burden on client devices. U-shaped federated split learning (UFSL) offloads a fraction of the client computation to the server while keeping both data and labels on the clients' side. However, the intermediate representations (i.e., smashed data) shared by clients with the server are prone to exposing clients' private data. To reduce exposure of client data through intermediate data representations, this work proposes k-anonymous differentially private UFSL (KD-UFSL), which leverages privacy-enhancing techniques such as microaggregation and differential privacy to minimize data leakage from the smashed data transferred to the server. We first demonstrate that an adversary can access private client data from intermediate representations via a data-reconstruction attack, and then present a privacy-enhancing solution, KD-UFSL, to mitigate this risk. Our experiments indicate that, alongside increasing the mean squared error between the actual and reconstructed images by up to 50% in some cases, KD-UFSL also decreases the structural similarity between them by up to 40% on four benchmarking datasets. More importantly, KD-UFSL improves privacy while preserving the utility of the global model. This highlights its suitability for large-scale big data applications where privacy and utility must be balanced.
中文标题/摘要
标题:守护中间:保护联邦分割学习中的中间表示
在大规模、异构数据集分布在客户端的场景中,需要可扩展且保护隐私的机器学习方法。联邦学习(FL)允许在不集中数据的情况下,通过客户端分散训练机器学习模型。然而,分散训练会增加客户端设备的计算负担。U形联邦分割学习(UFSL)将部分客户端计算卸载到服务器上,同时将数据和标签保留在客户端。然而,客户端与服务器共享的中间表示(即被打碎的数据)容易泄露客户端的私有数据。为了减少通过中间数据表示泄露客户端数据的风险,本文提出了一种基于k匿名和差分隐私的UFSL(KD-UFSL),利用微聚合和差分隐私等隐私增强技术,最小化传输到服务器的被打碎数据的数据泄露。我们首先证明了攻击者可以通过数据重建攻击访问中间表示中的客户端私有数据,然后提出了一种隐私增强解决方案KD-UFSL来缓解这一风险。实验表明,KD-UFSL在某些情况下将实际图像与重建图像之间的均方误差提高了50%,结构相似性降低了40%。更重要的是,KD-UFSL在提高隐私的同时保持了全局模型的实用性。这突显了其在需要平衡隐私和实用性的大规模大数据应用中的适用性。
Summary / 总结
This work addresses the challenge of protecting client data in federated split learning (UFSL) by proposing k-anonymous differentially private UFSL (KD-UFSL). The method uses privacy-enhancing techniques like microaggregation and differential privacy to minimize data leakage from intermediate representations shared with the server. Experiments show that KD-UFSL increases the error in data reconstruction by up to 50% and reduces structural similarity by up to 40%, while still preserving the utility of the global model, thus balancing privacy and utility in large-scale big data applications.
该研究旨在保护联邦分割学习(UFSL)中的中间表示,防止泄露客户端数据。它提出了k-匿名差分隐私UFSL(KD-UFSL),使用微聚合和差分隐私来减少数据泄露。实验表明,KD-UFSL可以将重建图像的误差提高到50%,结构相似性降低到40%,同时仍然保持全局模型的实用性,从而在大规模大数据应用中平衡隐私和实用性。
Contrastive Diffusion Alignment: Learning Structured Latents for Controllable Generation
Authors: Ruchi Sandilya, Sumaira Perez, Charles Lynch, Lindsay Victoria, Benjamin Zebley, Derrick Matthew Buchanan, Mahendra T. Bhati, Nolan Williams, Timothy J. Spellman, Faith M. Gunning, Conor Liston, Logan Grosenick
First: 2025-10-16T00:48:05+00:00 · Latest: 2026-02-19T18:33:22+00:00
Abstract
Diffusion models excel at generation, but their latent spaces are high dimensional and not explicitly organized for interpretation or control. We introduce ConDA (Contrastive Diffusion Alignment), a plug-and-play geometry layer that applies contrastive learning to pretrained diffusion latents using auxiliary variables (e.g., time, stimulation parameters, facial action units). ConDA learns a low-dimensional embedding whose directions align with underlying dynamical factors, consistent with recent contrastive learning results on structured and disentangled representations. In this embedding, simple nonlinear trajectories support smooth interpolation, extrapolation, and counterfactual editing while rendering remains in the original diffusion space. ConDA separates editing and rendering by lifting embedding trajectories back to diffusion latents with a neighborhood-preserving kNN decoder and is robust across inversion solvers. Across fluid dynamics, neural calcium imaging, therapeutic neurostimulation, facial expression dynamics, and monkey motor cortex activity, ConDA yields more interpretable and controllable latent structure than linear traversals and conditioning-based baselines, indicating that diffusion latents encode dynamics-relevant structure that can be exploited by an explicit contrastive geometry layer.
中文标题/摘要
标题:对比扩散对齐:学习结构化潜在变量以实现可控生成
扩散模型在生成方面表现出色,但其潜在空间是高维的,并未明确组织以供解释或控制。我们引入了ConDA(对比扩散对齐),这是一种即插即用的几何层,通过辅助变量(例如时间、刺激参数、面部动作单元)对预训练的扩散潜在变量应用对比学习。ConDA 学习一个低维嵌入,其方向与潜在动力学因素对齐,这与最近关于结构化和分离表示的对比学习结果一致。在该嵌入中,简单的非线性轨迹支持平滑的插值、外推和反事实编辑,同时渲染保持在原始扩散空间中。ConDA 通过保留邻域的kNN解码器将嵌入轨迹提升回扩散潜在变量,从而实现编辑和渲染的分离,并且在各种反演求解器中表现出鲁棒性。在流体动力学、神经钙成像、治疗性神经刺激、面部表情动力学和猴子运动皮层活动等领域,ConDA 比线性遍历和基于条件的基线提供了更可解释和可控的潜在结构,表明扩散潜在变量编码了与动力学相关的结构,可以通过显式的对比几何层加以利用。
Summary / 总结
The research aims to enhance the interpretability and controllability of diffusion models by introducing ConDA (Contrastive Diffusion Alignment), a method that applies contrastive learning to pretrained diffusion latents. ConDA learns a low-dimensional embedding that aligns with underlying dynamical factors, enabling smooth interpolation, extrapolation, and counterfactual editing while maintaining the original diffusion space. Experiments across various domains show that ConDA outperforms linear traversals and conditioning-based baselines in terms of interpretability and controllability.
该研究引入了ConDA(对比扩散对齐)方法,通过对比学习对预训练的扩散模型的潜在空间进行处理,学习一个与潜在动态因素对齐的低维嵌入。这种嵌入支持平滑的插值、外推和反事实编辑,同时保持原始的扩散空间。ConDA 在多个应用中优于线性遍历和基于条件的方法,展示了更可解释和可控的潜在结构。
ReplaceMe: Network Simplification via Depth Pruning and Transformer Block Linearization
Authors: Dmitriy Shopkhoev, Ammar Ali, Magauiya Zhussip, Valentin Malykh, Stamatios Lefkimmiatis, Nikos Komodakis, Sergey Zagoruyko
Venue: NeurIPS 2025
First: 2025-05-05T17:47:42+00:00 · Latest: 2026-02-19T18:32:53+00:00
Comments: This work was accepted and presented at NeurIPS 2025. Code is available at https://github.com/mts-ai/replaceme Reviews at OpenReview: https://openreview.net/forum?id=zEj1FSYCRn NeurIPS 2025 Proceedings: https://openreview.net/pdf?id=zEj1FSYCRn
Abstract
We introduce ReplaceMe, a generalized training-free depth pruning method that effectively replaces transformer blocks with a linear operation, while maintaining high performance for low compression ratios. In contrast to conventional pruning approaches that require additional training or fine-tuning, our approach requires only a small calibration dataset that is used to estimate a linear transformation, which approximates the pruned blocks. The estimated linear mapping can be seamlessly merged with the remaining transformer blocks, eliminating the need for any additional network parameters. Our experiments show that ReplaceMe consistently outperforms other training-free approaches and remains highly competitive with state-of-the-art pruning methods that involve extensive retraining/fine-tuning and architectural modifications. Applied to several large language models (LLMs), ReplaceMe achieves up to 25\% pruning while retaining approximately 90\% of the original model's performance on open benchmarks - without any training or healing steps, resulting in minimal computational overhead. We provide an open-source library implementing ReplaceMe alongside several state-of-the-art depth pruning techniques, available at https://github.com/mts-ai/ReplaceMe
中文标题/摘要
标题:ReplaceMe:通过深度剪枝和变压器块线性化实现网络简化
我们引入了ReplaceMe,这是一种通用的无需训练的深度剪枝方法,能够有效将变压器块替换为线性操作,同时在低压缩比下保持高性能。与需要额外训练或微调的传统剪枝方法不同,我们的方法仅需要一个小规模的校准数据集来估计线性变换,该变换近似于剪枝后的块。估计出的线性映射可以无缝地与剩余的变压器块合并,无需任何额外的网络参数。我们的实验表明,ReplaceMe在所有无需训练的方法中表现最佳,并且在涉及大量重新训练/微调和架构修改的最新剪枝方法中保持了高度竞争力。应用于多个大型语言模型(LLMs),ReplaceMe在开放基准测试中实现了高达25%的剪枝,同时保留了原始模型约90%的性能,无需任何训练或修复步骤,从而减少了计算开销。我们提供了一个开源库,实现了ReplaceMe以及几种最新的深度剪枝技术,可在https://github.com/mts-ai/ReplaceMe 获取。
Summary / 总结
ReplaceMe is a training-free depth pruning method that replaces transformer blocks with linear operations, maintaining high performance with low compression ratios. Unlike conventional pruning methods that require additional training, ReplaceMe uses a small calibration dataset to estimate a linear transformation that approximates the pruned blocks. Experiments show ReplaceMe outperforms other training-free approaches and remains competitive with state-of-the-art pruning methods. Applied to large language models, ReplaceMe achieves up to 25% pruning while retaining 90% of the original performance, with minimal computational overhead and no training or healing steps required.
ReplaceMe 是一种无需训练的深度剪枝方法,通过将变压器块替换为线性操作来保持高性能和低压缩比。与需要额外训练的传统剪枝方法不同,ReplaceMe 使用一个小的校准数据集来估计一个线性变换,该变换近似于剪枝后的块。实验表明,ReplaceMe 在无需训练或修复步骤的情况下,优于其他无需训练的剪枝方法,并且与涉及大量重新训练和架构修改的最新剪枝方法保持竞争力。应用于大型语言模型时,ReplaceMe 可以实现高达 25% 的剪枝,同时保留原始模型约 90% 的性能,且具有最小的计算开销。
Towards Anytime-Valid Statistical Watermarking
Authors: Baihe Huang, Eric Xu, Kannan Ramchandran, Jiantao Jiao, Michael I. Jordan
First: 2026-02-19T18:32:26+00:00 · Latest: 2026-02-19T18:32:26+00:00
Abstract
The proliferation of Large Language Models (LLMs) necessitates efficient mechanisms to distinguish machine-generated content from human text. While statistical watermarking has emerged as a promising solution, existing methods suffer from two critical limitations: the lack of a principled approach for selecting sampling distributions and the reliance on fixed-horizon hypothesis testing, which precludes valid early stopping. In this paper, we bridge this gap by developing the first e-value-based watermarking framework, Anchored E-Watermarking, that unifies optimal sampling with anytime-valid inference. Unlike traditional approaches where optional stopping invalidates Type-I error guarantees, our framework enables valid, anytime-inference by constructing a test supermartingale for the detection process. By leveraging an anchor distribution to approximate the target model, we characterize the optimal e-value with respect to the worst-case log-growth rate and derive the optimal expected stopping time. Our theoretical claims are substantiated by simulations and evaluations on established benchmarks, showing that our framework can significantly enhance sample efficiency, reducing the average token budget required for detection by 13-15% relative to state-of-the-art baselines.
中文标题/摘要
标题:迈向任意时点有效的统计水印
大型语言模型(LLMs)的普及需要有效的机制来区分机器生成的内容和人类文本。虽然统计水印已经证明是一种有前景的解决方案,但现有方法存在两个关键限制:缺乏选择抽样分布的原理性方法以及依赖固定时间窗假设检验,这限制了早期停止的有效性。在本文中,我们通过开发第一个基于e值的水印框架——锚定e水印,填补了这一空白,该框架将最优采样与任意时点的有效推断统一起来。与传统方法不同,传统方法中的可选停止会破坏第一类错误保证,而我们的框架通过为检测过程构建测试超鞅,实现了有效的任意时点推断。通过利用锚定分布来近似目标模型,我们以最坏情况下的对数增长率为基准来表征最优e值,并推导出最优的预期停止时间。我们的理论主张通过模拟和在现有基准上的评估得到了验证,表明我们的框架可以显著提高样本效率,与最先进的基线相比,检测所需的平均令牌预算减少了13-15%。
Summary / 总结
This paper addresses the challenge of distinguishing machine-generated content from human text by developing a new statistical watermarking framework called Anchored E-Watermarking. It overcomes the limitations of existing methods by integrating optimal sampling with anytime-valid inference, using an anchor distribution to approximate the target model and construct a test supermartingale. The framework allows for valid early stopping while maintaining Type-I error guarantees, and simulations show a 13-15% reduction in the average token budget required for detection compared to state-of-the-art methods.
本文提出了一种新的统计水印框架Anchored E-Watermarking,以解决区分机器生成内容和人类文本的问题。该框架通过将最优采样与随时有效的推断相结合,利用锚定分布近似目标模型并构建测试超鞅来克服现有方法的局限性。该框架显著提高了样本效率,与最先进的方法相比,将检测所需的平均令牌预算减少了13-15%。
Adapting Actively on the Fly: Relevance-Guided Online Meta-Learning with Latent Concepts for Geospatial Discovery
Authors: Jowaria Khan, Anindya Sarkar, Yevgeniy Vorobeychik, Elizabeth Bondi-Kelly
First: 2026-02-19T18:30:18+00:00 · Latest: 2026-02-19T18:30:18+00:00
Abstract
In many real-world settings, such as environmental monitoring, disaster response, or public health, with costly and difficult data collection and dynamic environments, strategically sampling from unobserved regions is essential for efficiently uncovering hidden targets under tight resource constraints. Yet, sparse and biased geospatial ground truth limits the applicability of existing learning-based methods, such as reinforcement learning. To address this, we propose a unified geospatial discovery framework that integrates active learning, online meta-learning, and concept-guided reasoning. Our approach introduces two key innovations built on a shared notion of *concept relevance*, which captures how domain-specific factors influence target presence: a *concept-weighted uncertainty sampling strategy*, where uncertainty is modulated by learned relevance based on readily-available domain-specific concepts (e.g., land cover, source proximity); and a *relevance-aware meta-batch formation strategy* that promotes semantic diversity during online-meta updates, improving generalization in dynamic environments. Our experiments include testing on a real-world dataset of cancer-causing PFAS (Per- and polyfluoroalkyl substances) contamination, showcasing our method's reliability at uncovering targets with limited data and a varying environment.
中文标题/摘要
标题:适应性在线元学习:基于相关性的时空发现
在环境监测、灾害响应或公共卫生等许多现实场景中,由于数据收集成本高且环境动态变化,从未观察区域有选择地采样对于在资源受限的情况下高效发现隐藏目标至关重要。然而,稀疏且有偏的地理空间真实情况限制了现有基于学习的方法(如强化学习)的应用。为解决这一问题,我们提出了一种统一的时空发现框架,该框架结合了主动学习、在线元学习和概念引导的推理。我们的方法引入了两个关键创新,基于*概念相关性*这一共同概念:一种*概念加权不确定性采样策略*,其中不确定性根据易于获取的领域特定概念(如土地覆盖、源距离)学习到的相关性进行调整;以及一种*相关性感知的元批次形成策略*,该策略在在线元更新过程中促进语义多样性,从而在动态环境中提高泛化能力。我们的实验包括在真实世界的数据集(涉及致癌的PFAS(全氟和多氟烷基物质)污染)上测试,展示了在有限数据和变化环境中该方法发现目标的可靠性。
Summary / 总结
This paper proposes a unified geospatial discovery framework that combines active learning, online meta-learning, and concept-guided reasoning to efficiently uncover hidden targets in dynamic environments with limited data. The key innovations include a concept-weighted uncertainty sampling strategy and a relevance-aware meta-batch formation strategy. The method demonstrates its effectiveness in a real-world dataset of PFAS contamination, showing reliable target discovery under varying conditions.
论文提出了一种结合主动学习、在线元学习和概念引导推理的统一地理空间发现框架,以在动态环境中高效地发现隐藏目标,同时数据有限。该方法引入了基于领域特定概念的概念加权不确定性采样策略和相关性感知元批次形成策略,以提高目标发现效果。实验在真实世界的PFAS污染数据集上展示了该方法在资源受限条件下的可靠性。
pi-Flow: Policy-Based Few-Step Generation via Imitation Distillation
Authors: Hansheng Chen, Kai Zhang, Hao Tan, Leonidas Guibas, Gordon Wetzstein, Sai Bi
Venue: ICLR 2026
First: 2025-10-16T17:59:51+00:00 · Latest: 2026-02-19T18:30:05+00:00
Comments: ICLR 2026. Code: https://github.com/Lakonik/piFlow Demos: https://huggingface.co/spaces/Lakonik/pi-Qwen | https://huggingface.co/spaces/Lakonik/pi-FLUX.1 | https://huggingface.co/spaces/Lakonik/pi-FLUX.2
Abstract
Few-step diffusion or flow-based generative models typically distill a velocity-predicting teacher into a student that predicts a shortcut towards denoised data. This format mismatch has led to complex distillation procedures that often suffer from a quality-diversity trade-off. To address this, we propose policy-based flow models ($π$-Flow). $π$-Flow modifies the output layer of a student flow model to predict a network-free policy at one timestep. The policy then produces dynamic flow velocities at future substeps with negligible overhead, enabling fast and accurate ODE integration on these substeps without extra network evaluations. To match the policy's ODE trajectory to the teacher's, we introduce a novel imitation distillation approach, which matches the policy's velocity to the teacher's along the policy's trajectory using a standard $\ell_2$ flow matching loss. By simply mimicking the teacher's behavior, $π$-Flow enables stable and scalable training and avoids the quality-diversity trade-off. On ImageNet 256$^2$, it attains a 1-NFE FID of 2.85, outperforming previous 1-NFE models of the same DiT architecture. On FLUX.1-12B and Qwen-Image-20B at 4 NFEs, $π$-Flow achieves substantially better diversity than state-of-the-art DMD models, while maintaining teacher-level quality.
中文标题/摘要
标题:pi-Flow:基于策略的少步生成通过模仿蒸馏
少步扩散或基于流的生成模型通常将一个预测速度的教师蒸馏为一个学生,该学生预测通向去噪数据的捷径。这种格式不匹配导致了复杂的蒸馏过程,往往会导致质量与多样性的权衡。为了解决这个问题,我们提出了基于策略的流模型($π$-Flow)。$π$-Flow 将学生流模型的输出层修改为在单个时间步预测一个无网络策略。该策略随后在未来的子步骤中生成动态流速度,几乎不增加额外开销,从而在这些子步骤中无需额外网络评估即可快速准确地进行 ODE 集成。为了使策略的 ODE 轨迹与教师的相匹配,我们引入了一种新颖的模仿蒸馏方法,该方法使用标准的 $\ell_2$ 流匹配损失,在策略轨迹上匹配策略的速度与教师的速度。通过简单模仿教师的行为,$π$-Flow 使训练变得稳定且可扩展,并避免了质量与多样性的权衡。在 ImageNet 256$^2$ 上,它达到了 1-NFE FID 为 2.85,优于相同 DiT 架构的先前 1-NFE 模型。在 FLUX.1-12B 和 Qwen-Image-20B 上,$π$-Flow 在 4 NFEs 下实现了比最先进的 DMD 模型更好的多样性,同时保持了教师级别的质量。
Summary / 总结
The paper proposes $π$-Flow, a policy-based few-step generative model that addresses the quality-diversity trade-off in existing few-step diffusion models by predicting a network-free policy at one timestep. This policy generates dynamic flow velocities at future substeps, enabling fast and accurate ODE integration without extra network evaluations. The model uses imitation distillation to match the policy's ODE trajectory to the teacher's, achieving superior performance on ImageNet and FLUX datasets compared to previous models.
该论文提出了$π$-Flow,一种基于策略的少量步骤生成方法,通过将教师模型中的信息提炼到学生模型中,该模型在某一时间步预测策略,该策略在后续子步骤生成动态速度,从而实现高效的ODE积分。该方法使用模仿蒸馏来匹配策略的速度和教师的速度,从而实现稳定和可扩展的训练。在ImageNet 256$^2$上,$π$-Flow达到1-NFE FID为2.85,优于之前的模型。在FLUX.1-12B和Qwen-Image-20B上,$π$-Flow在4 NFEs时显示出比最先进的DMD模型更好的多样性,同时保持教师级别的质量。
Asymptotic Smoothing of the Lipschitz Loss Landscape in Overparameterized One-Hidden-Layer ReLU Networks
Authors: Saveliy Baturin
First: 2026-02-19T18:20:21+00:00 · Latest: 2026-02-19T18:20:21+00:00
Abstract
We study the topology of the loss landscape of one-hidden-layer ReLU networks under overparameterization. On the theory side, we (i) prove that for convex $L$-Lipschitz losses with an $\ell_1$-regularized second layer, every pair of models at the same loss level can be connected by a continuous path within an arbitrarily small loss increase $ε$ (extending a known result for the quadratic loss); (ii) obtain an asymptotic upper bound on the energy gap $ε$ between local and global minima that vanishes as the width $m$ grows, implying that the landscape flattens and sublevel sets become connected in the limit. Empirically, on a synthetic Moons dataset and on the Wisconsin Breast Cancer dataset, we measure pairwise energy gaps via Dynamic String Sampling (DSS) and find that wider networks exhibit smaller gaps; in particular, a permutation test on the maximum gap yields $p_{perm}=0$, indicating a clear reduction in the barrier height.
中文标题/摘要
标题:过参数化单隐藏层ReLU网络的Lipschitz损失景观的渐近光滑化
我们研究了过参数化单隐藏层ReLU网络的损失景观拓扑。在理论方面,我们(i) 证明对于具有$\ell_1$正则化第二层的凸$L$-Lipschitz损失,任意两个在相同损失水平的模型可以在损失增加$ε$(任意小)的范围内通过连续路径连接(扩展了对二次损失已知的结果);(ii) 得到了局部和全局最小值之间能量差距$ε$的渐近上界,该差距随着宽度$m$的增长而消失,表明景观趋于平坦,子水平集在极限下变得连通。在合成的月亮数据集和威斯康星州乳腺癌数据集上,我们通过动态字符串采样(DSS)测量了两两能量差距,并发现更宽的网络表现出更小的差距;特别是,最大差距的置换检验得到$p_{perm}=0$,表明明显的势垒高度降低。
CT-Bench: A Benchmark for Multimodal Lesion Understanding in Computed Tomography
Authors: Qingqing Zhu, Qiao Jin, Tejas S. Mathai, Yin Fang, Zhizheng Wang, Yifan Yang, Maame Sarfo-Gyamfi, Benjamin Hou, Ran Gu, Praveen T. S. Balamuralikrishna, Kenneth C. Wang, Ronald M. Summers, Zhiyong Lu
First: 2026-02-16T16:10:19+00:00 · Latest: 2026-02-19T18:19:25+00:00
Abstract
Artificial intelligence (AI) can automatically delineate lesions on computed tomography (CT) and generate radiology report content, yet progress is limited by the scarcity of publicly available CT datasets with lesion-level annotations. To bridge this gap, we introduce CT-Bench, a first-of-its-kind benchmark dataset comprising two components: a Lesion Image and Metadata Set containing 20,335 lesions from 7,795 CT studies with bounding boxes, descriptions, and size information, and a multitask visual question answering benchmark with 2,850 QA pairs covering lesion localization, description, size estimation, and attribute categorization. Hard negative examples are included to reflect real-world diagnostic challenges. We evaluate multiple state-of-the-art multimodal models, including vision-language and medical CLIP variants, by comparing their performance to radiologist assessments, demonstrating the value of CT-Bench as a comprehensive benchmark for lesion analysis. Moreover, fine-tuning models on the Lesion Image and Metadata Set yields significant performance gains across both components, underscoring the clinical utility of CT-Bench.
中文标题/摘要
标题:CT-Bench:计算机断层扫描中多模态病变理解的基准
人工智能(AI)可以自动在计算机断层扫描(CT)上勾画出病变并生成放射学报告内容,但进展受限于可用的带有病变级别注释的CT数据集稀缺。为解决这一问题,我们引入了CT-Bench,这是一个首创的基准数据集,包含两个部分:包含20,335个病变的病变图像和元数据集,来自7,795个CT研究,附带边界框、描述和尺寸信息,以及一个涵盖病变定位、描述、尺寸估计和属性分类的多任务视觉问答基准,包含2,850个问答对。还包含困难的负例以反映实际诊断挑战。我们通过将多个最先进的多模态模型与放射科医生评估进行比较,评估了CT-Bench的价值,证明了CT-Bench作为病变分析综合基准的价值。此外,对病变图像和元数据集进行微调在两个部分上均取得了显著的性能提升,突显了CT-Bench的临床用途。
Summary / 总结
CT-Bench is a benchmark dataset for lesion understanding in CT scans, consisting of 20,335 lesions from 7,795 CT studies with bounding boxes, descriptions, and size information. It also includes a multitask visual question answering benchmark with 2,850 QA pairs. The dataset evaluates state-of-the-art multimodal models, showing that fine-tuning on the Lesion Image and Metadata Set improves performance. This highlights CT-Bench's value for comprehensive lesion analysis and clinical utility.
CT-Bench 是一个用于 CT 扫描中多模态病灶理解的新基准数据集,包含来自 7,795 个 CT 研究的 20,335 个病灶,附有边界框、描述和大小信息,以及包含 2,850 个 QA 对的多任务视觉问答基准。该数据集包含困难的负例以反映实际诊断挑战。评估表明,当在该数据集上进行微调时,最先进的多模态模型在两个基准组件上的性能更好,表明其在病灶分析中的价值。对病灶图像和元数据集的微调在两个基准组件上均提高了性能。
AI Gamestore: Scalable, Open-Ended Evaluation of Machine General Intelligence with Human Games
Authors: Lance Ying, Ryan Truong, Prafull Sharma, Kaiya Ivy Zhao, Nathan Cloos, Kelsey R. Allen, Thomas L. Griffiths, Katherine M. Collins, José Hernández-Orallo, Phillip Isola, Samuel J. Gershman, Joshua B. Tenenbaum
First: 2026-02-19T18:17:25+00:00 · Latest: 2026-02-19T18:17:25+00:00
Comments: 29 pages, 14 figures
Abstract
Rigorously evaluating machine intelligence against the broad spectrum of human general intelligence has become increasingly important and challenging in this era of rapid technological advance. Conventional AI benchmarks typically assess only narrow capabilities in a limited range of human activity. Most are also static, quickly saturating as developers explicitly or implicitly optimize for them. We propose that a more promising way to evaluate human-like general intelligence in AI systems is through a particularly strong form of general game playing: studying how and how well they play and learn to play \textbf{all conceivable human games}, in comparison to human players with the same level of experience, time, or other resources. We define a "human game" to be a game designed by humans for humans, and argue for the evaluative suitability of this space of all such games people can imagine and enjoy -- the "Multiverse of Human Games". Taking a first step towards this vision, we introduce the AI GameStore, a scalable and open-ended platform that uses LLMs with humans-in-the-loop to synthesize new representative human games, by automatically sourcing and adapting standardized and containerized variants of game environments from popular human digital gaming platforms. As a proof of concept, we generated 100 such games based on the top charts of Apple App Store and Steam, and evaluated seven frontier vision-language models (VLMs) on short episodes of play. The best models achieved less than 10\% of the human average score on the majority of the games, and especially struggled with games that challenge world-model learning, memory and planning. We conclude with a set of next steps for building out the AI GameStore as a practical way to measure and drive progress toward human-like general intelligence in machines.
中文标题/摘要
标题:AI游戏库:通过人类游戏评估机器通用智能的可扩展、开放性方法
在技术飞速发展的时代,严格评估机器智能与人类广泛智能的全面谱系变得越来越重要且具有挑战性。传统的AI基准测试通常仅评估人类活动有限范围内的狭窄能力。大多数基准测试也是静态的,随着开发人员显式或隐式地对其进行优化,它们很快就会饱和。我们提出了一种评估AI系统中类人通用智能的更有前途的方法:通过一种特别强大的通用游戏玩法形式:研究它们如何以及如何有效地玩和学习玩所有可能的人类游戏,与具有相同经验、时间或其他资源的人类玩家进行比较。我们定义“人类游戏”为人类设计供人类玩的游戏,并为这一所有此类游戏的空间辩护——“人类游戏多元宇宙”。为实现这一愿景的第一步,我们引入了AI游戏库,这是一个使用人类在环的LLM构建的可扩展和开放性平台,通过自动获取和适应来自流行的人类数字游戏平台的标准和容器化游戏环境变体来合成新的代表性人类游戏。作为概念验证,我们基于Apple App Store和Steam的热门排行榜生成了100个此类游戏,并在短游戏片段上评估了七个前沿的视觉-语言模型(VLMs)。最好的模型在大多数游戏中的人类平均得分中仅达到了不到10%,尤其是在挑战世界模型学习、记忆和规划的游戏方面表现尤为困难。最后,我们提出了构建AI游戏库的下一步,作为一种实际的测量和推动机器向类人通用智能发展的方法。
Summary / 总结
The research aims to evaluate machine intelligence more comprehensively by comparing it to human general intelligence through a wide range of games designed by humans. The method involves creating a scalable platform, AI GameStore, which uses LLMs and human input to generate new games. Key findings show that advanced models performed poorly, achieving less than 10% of human scores, especially in games that require complex world modeling, memory, and planning.
研究旨在通过广泛的人类设计游戏来全面评估机器智能,与人类一般智能进行比较。方法是创建一个可扩展的平台AI GameStore,使用LLM和人类输入生成新游戏。关键发现表明,先进模型表现不佳,仅达到人类平均得分的不到10%,尤其是在需要复杂世界建模、记忆和规划的游戏方面。
Accelerating Large-Scale Dataset Distillation via Exploration-Exploitation Optimization
Authors: Muhammad J. Alahmadi, Peng Gao, Feiyi Wang, Dongkuan Xu
First: 2026-02-17T00:27:58+00:00 · Latest: 2026-02-19T18:14:28+00:00
Abstract
Dataset distillation compresses the original data into compact synthetic datasets, reducing training time and storage while retaining model performance, enabling deployment under limited resources. Although recent decoupling-based distillation methods enable dataset distillation at large scale, they continue to face an efficiency gap: optimization-based decoupling methods achieve higher accuracy but demand intensive computation, whereas optimization-free decoupling methods are efficient but sacrifice accuracy. To overcome this trade-off, we propose Exploration--Exploitation Distillation (E$^2$D), a simple, practical method that minimizes redundant computation through an efficient pipeline that begins with full-image initialization to preserve semantic integrity and feature diversity. It then uses a two-phase optimization strategy: an exploration phase that performs uniform updates and identifies high-loss regions, and an exploitation phase that focuses updates on these regions to accelerate convergence. We evaluate E$^2$D on large-scale benchmarks, surpassing the state-of-the-art on ImageNet-1K while being $18\times$ faster, and on ImageNet-21K, our method substantially improves accuracy while remaining $4.3\times$ faster. These results demonstrate that targeted, redundancy-reducing updates, rather than brute-force optimization, bridge the gap between accuracy and efficiency in large-scale dataset distillation. Code is available at https://github.com/ncsu-dk-lab/E2D.
中文标题/摘要
标题:通过探索-利用优化加速大规模数据集蒸馏
数据集蒸馏将原始数据压缩成紧凑的合成数据集,减少训练时间和存储空间同时保持模型性能,使在资源有限的情况下部署成为可能。尽管最近的解耦蒸馏方法能够实现大规模的数据集蒸馏,但它们仍然面临效率差距:基于优化的解耦方法能够获得更高的准确率,但需要大量的计算,而无需优化的解耦方法则更高效但牺牲了准确率。为了克服这种权衡,我们提出了探索-利用蒸馏(E$^2$D),这是一种简单实用的方法,通过高效的流水线减少冗余计算,该流水线从全图像初始化开始以保持语义完整性和特征多样性。然后使用两阶段优化策略:探索阶段进行均匀更新并识别高损失区域,以及利用阶段专注于这些区域的更新以加速收敛。我们在大规模基准上评估了E$^2$D,在ImageNet-1K上超越了最先进的方法,同时快了18倍,在ImageNet-21K上我们的方法显著提高了准确率同时保持了4.3倍的加速。这些结果表明,目标导向、减少冗余的更新,而不是暴力优化,能够弥合大规模数据集蒸馏中的准确率和效率之间的差距。代码可在https://github.com/ncsu-dk-lab/E2D/ 获取。
Summary / 总结
The research aims to improve the efficiency of large-scale dataset distillation by addressing the trade-off between accuracy and computation. The proposed Exploration--Exploitation Distillation (E$^2$D) method uses an efficient pipeline for initialization and a two-phase optimization strategy to minimize redundant computation. E$^2$D surpasses the state-of-the-art on ImageNet-1K while being 18 times faster, and on ImageNet-21K, it improves accuracy while remaining 4.3 times faster, demonstrating the effectiveness of targeted updates in bridging the accuracy-efficiency gap.
研究旨在通过解决准确性和计算效率之间的权衡问题,提高大规模数据集蒸馏的效率。提出的Exploration--Exploitation Distillation (E$^2$D)方法使用高效的初始化管道和两阶段优化策略,以减少冗余计算。E$^2$D在ImageNet-1K上超越了最先进的方法,速度快了18倍,而在ImageNet-21K上,它提高了准确率,同时保持了4.3倍的加速,证明了有针对性的更新在准确性和效率之间的权衡中具有有效性。
Asymptotically Optimal Sequential Testing with Markovian Data
Authors: Alhad Sethi, Kavali Sofia Sagar, Shubhada Agrawal, Debabrota Basu, P. N. Karthik
First: 2026-02-19T18:11:02+00:00 · Latest: 2026-02-19T18:11:02+00:00
Abstract
We study one-sided and $α$-correct sequential hypothesis testing for data generated by an ergodic Markov chain. The null hypothesis is that the unknown transition matrix belongs to a prescribed set $P$ of stochastic matrices, and the alternative corresponds to a disjoint set $Q$. We establish a tight non-asymptotic instance-dependent lower bound on the expected stopping time of any valid sequential test under the alternative. Our novel analysis improves the existing lower bounds, which are either asymptotic or provably sub-optimal in this setting. Our lower bound incorporates both the stationary distribution and the transition structure induced by the unknown Markov chain. We further propose an optimal test whose expected stopping time matches this lower bound asymptotically as $α\to 0$. We illustrate the usefulness of our framework through applications to sequential detection of model misspecification in Markov Chain Monte Carlo and to testing structural properties, such as the linearity of transition dynamics, in Markov decision processes. Our findings yield a sharp and general characterization of optimal sequential testing procedures under Markovian dependence.
中文标题/摘要
标题:马尔可夫数据的渐近最优序贯检验
我们研究了由遍历马尔可夫链生成的数据的一边和$α$-正确的序贯假设检验。零假设是未知的转移矩阵属于给定的随机矩阵集合$P$,备择假设对应于一个不相交的集合$Q$。我们建立了在备择假设下任何有效序贯检验的期望停止时间的紧致非渐近实例依赖下界。我们的新颖分析改进了现有的下界,这些下界要么是渐近的,要么在这个设置中是可证明的次优的。我们的下界同时包含了平稳分布和由未知马尔可夫链诱导的转移结构。我们进一步提出了一种最优检验,其期望停止时间在$α\to 0$时与该下界渐近匹配。我们通过马尔可夫链蒙特卡洛中的模型误设序贯检测应用以及马尔可夫决策过程中的结构属性检验(如转移动力学的线性性)来说明我们框架的有效性。我们的发现为马尔可夫依赖下的最优序贯检验程序提供了精确而通用的表征。
Summary / 总结
This paper investigates sequential hypothesis testing for data generated by an ergodic Markov chain, focusing on one-sided and $α$-correct tests. It establishes a tight non-asymptotic lower bound on the expected stopping time for any valid sequential test under the alternative hypothesis. The proposed test asymptotically matches this lower bound as $α$ approaches zero. The findings provide a sharp and general characterization of optimal sequential testing procedures under Markovian dependence, applicable to various applications such as model misspecification detection and testing structural properties in Markov decision processes.
本文研究了由遍历马尔可夫链生成的数据的一边性和$α$-正确序贯假设检验。它建立了任何有效序贯检验在备择假设下的期望停止时间的紧致非渐近下界。提出的检验在$α$接近零时渐近匹配这一下界。研究结果为马尔可夫依赖下的最优序贯检验程序提供了精确的一般表征,适用于模型误指定检测和马尔可夫决策过程中的结构属性测试等应用。
Nonlinear Model Order Reduction of Dynamical Systems in Process Engineering: Review and Comparison
Authors: Jan C. Schulze, Alexander Mitsos
First: 2025-06-15T11:39:12+00:00 · Latest: 2026-02-19T17:29:34+00:00
Abstract
Computationally cheap yet accurate dynamical models are a key requirement for real-time capable nonlinear optimization and model-based control. When given a computationally expensive high-order prediction model, a reduction to a lower-order simplified model can enable such real-time applications. Herein, we review nonlinear model order reduction methods and provide a comparison of method characteristics. Additionally, we discuss both general-purpose methods and tailored approaches for chemical process systems and we identify similarities and differences between these methods. As machine learning manifold-Galerkin approaches currently do not account for inputs in the construction of the reduced state subspace, we extend these methods to dynamical systems with inputs. In a comparative case study, we apply eight established model order reduction methods to an air separation process model: POD-Galerkin, nonlinear-POD-Galerkin, manifold-Galerkin, dynamic mode decomposition, Koopman theory, manifold learning with latent predictor, compartment modeling, and model aggregation. Herein, we do not investigate hyperreduction, i.e., reduction of floating point operations. Based on our findings, we discuss strengths and weaknesses of the model order reduction methods.
中文标题/摘要
标题:过程工程中动力学系统的非线性模型降阶:综述与比较
计算成本低廉且准确的动力学模型是实时非线性优化和模型导向控制的关键需求。当给定一个计算成本高昂的高阶预测模型时,将其简化为一个较低阶的简化模型可以实现此类实时应用。本文综述了非线性模型降阶方法,并提供了方法特性的比较。此外,我们讨论了通用方法和针对化学过程系统的定制方法,并指出了这些方法之间的相似性和差异性。由于当前的机器学习流形-伽罗瓦方法在构建降阶状态子空间时不考虑输入,我们将其扩展到具有输入的动力学系统。在比较案例研究中,我们应用了八种已建立的模型降阶方法到空气分离过程模型:POD-伽罗瓦、非线性-POD-伽罗瓦、流形-伽罗瓦、动态模式分解、Koopman理论、潜在预测的流形学习、隔室建模和模型聚合。在此基础上,我们没有研究超缩减,即浮点运算的缩减。基于我们的发现,我们讨论了模型降阶方法的优势和劣势。
Be Wary of Your Time Series Preprocessing
Authors: Sofiane Ennadir, Tianze Wang, Oleg Smirnov, Sahar Asadi, Lele Cao
Venue: AAAI
First: 2026-02-19T17:23:56+00:00 · Latest: 2026-02-19T17:23:56+00:00
Comments: Accepted at the AI4TS workshop at AAAI-26
Abstract
Normalization and scaling are fundamental preprocessing steps in time series modeling, yet their role in Transformer-based models remains underexplored from a theoretical perspective. In this work, we present the first formal analysis of how different normalization strategies, specifically instance-based and global scaling, impact the expressivity of Transformer-based architectures for time series representation learning. We propose a novel expressivity framework tailored to time series, which quantifies a model's ability to distinguish between similar and dissimilar inputs in the representation space. Using this framework, we derive theoretical bounds for two widely used normalization methods: Standard and Min-Max scaling. Our analysis reveals that the choice of normalization strategy can significantly influence the model's representational capacity, depending on the task and data characteristics. We complement our theory with empirical validation on classification and forecasting benchmarks using multiple Transformer-based models. Our results show that no single normalization method consistently outperforms others, and in some cases, omitting normalization entirely leads to superior performance. These findings highlight the critical role of preprocessing in time series learning and motivate the need for more principled normalization strategies tailored to specific tasks and datasets.
中文标题/摘要
标题:警惕您的时间序列预处理
归一化和缩放是时间序列建模中基本的预处理步骤,但它们在基于Transformer的模型中的作用从理论角度来看仍然未被充分探索。在本文中,我们首次对不同归一化策略,特别是实例基和全局缩放,如何影响基于Transformer的时间序列表示学习架构的表达能力进行了形式化分析。我们提出了一种针对时间序列定制的新型表达能力框架,该框架量化了模型在表示空间中区分相似和不相似输入的能力。利用该框架,我们为两种广泛使用的归一化方法:标准和最小-最大缩放,推导出了理论界值。我们的分析表明,归一化策略的选择可能会显著影响模型的表示能力,这取决于任务和数据特性。我们通过在分类和预测基准上使用多个基于Transformer的模型进行实证验证来补充我们的理论。结果显示,并非单一的归一化方法始终优于其他方法,在某些情况下,完全省略归一化可能会获得更好的性能。这些发现突显了预处理在时间序列学习中的关键作用,并促使需要为特定任务和数据集量身定制的更原则性的归一化策略。
Summary / 总结
This work analyzes the impact of normalization strategies on Transformer-based models for time series representation learning. It introduces a novel expressivity framework to quantify a model's ability to distinguish between similar and dissimilar inputs. The study derives theoretical bounds for Standard and Min-Max scaling and empirically validates these findings on classification and forecasting benchmarks. The results indicate that no single normalization method is universally superior, and in some cases, omitting normalization can lead to better performance, underscoring the importance of principled preprocessing in time series learning.
这项研究分析了归一化策略对基于Transformer的时间序列表示学习模型的影响。引入了一种新的表达能力框架来量化模型区分相似和不相似输入的能力。研究推导了标准和最小-最大归一化的理论界限,并在分类和预测基准上进行了实证验证。结果表明,并没有一种归一化方法在所有情况下都更优,在某些情况下,完全省略归一化可以取得更好的性能,突显了预处理在时间序列学习中的关键作用。
ODESteer: A Unified ODE-Based Steering Framework for LLM Alignment
Authors: Hongjue Zhao, Haosen Sun, Jiangtao Kong, Xiaochang Li, Qineng Wang, Liwei Jiang, Qi Zhu, Tarek Abdelzaher, Yejin Choi, Manling Li, Huajie Shao
Venue: ICLR 2026
First: 2026-02-19T17:13:44+00:00 · Latest: 2026-02-19T17:13:44+00:00
Comments: Accepted by ICLR 2026
Abstract
Activation steering, or representation engineering, offers a lightweight approach to align large language models (LLMs) by manipulating their internal activations at inference time. However, current methods suffer from two key limitations: \textit{(i)} the lack of a unified theoretical framework for guiding the design of steering directions, and \textit{(ii)} an over-reliance on \textit{one-step steering} that fail to capture complex patterns of activation distributions. In this work, we propose a unified ordinary differential equations (ODEs)-based \textit{theoretical} framework for activation steering in LLM alignment. We show that conventional activation addition can be interpreted as a first-order approximation to the solution of an ODE. Based on this ODE perspective, identifying a steering direction becomes equivalent to designing a \textit{barrier function} from control theory. Derived from this framework, we introduce ODESteer, a kind of ODE-based steering guided by barrier functions, which shows \textit{empirical} advancement in LLM alignment. ODESteer identifies steering directions by defining the barrier function as the log-density ratio between positive and negative activations, and employs it to construct an ODE for \textit{multi-step and adaptive} steering. Compared to state-of-the-art activation steering methods, ODESteer achieves consistent empirical improvements on diverse LLM alignment benchmarks, a notable $5.7\%$ improvement over TruthfulQA, $2.5\%$ over UltraFeedback, and $2.4\%$ over RealToxicityPrompts. Our work establishes a principled new view of activation steering in LLM alignment by unifying its theoretical foundations via ODEs, and validating it empirically through the proposed ODESteer method.
中文标题/摘要
标题:ODESteer:基于ODE的统一大型语言模型对齐引导框架
激活引导或表示工程提供了一种轻量级的方法,在推理时通过操控大型语言模型(LLMs)的内部激活来对其对齐。然而,当前的方法存在两个关键限制:\textit{(i)} 缺乏一个统一的理论框架来指导引导方向的设计,\textit{(ii)} 过度依赖于\textit{一步引导},未能捕捉激活分布的复杂模式。在本文中,我们提出了一种基于常微分方程(ODEs)的统一激活引导理论框架,用于LLM对齐。我们表明,传统的激活添加可以被解释为ODE解的一阶近似。基于这种ODE视角,确定一个引导方向等同于从控制理论设计一个\textit{障碍函数}。基于此框架,我们引入了ODESteer,这是一种由障碍函数引导的基于ODE的引导方法,展示了在LLM对齐中的\textit{实证}进步。ODESteer通过将障碍函数定义为正激活和负激活的对数密度比来确定引导方向,并利用它构建一个ODE进行\textit{多步和自适应}引导。与最先进的激活引导方法相比,ODESteer在各种LLM对齐基准测试中实现了持续的实证改进,对TruthfulQA的改进为5.7%,对UltraFeedback的改进为2.5%,对RealToxicityPrompts的改进为2.4%。我们的工作通过使用ODE统一激活引导的理论基础,建立了LLM对齐中激活引导的新原理,并通过提出的ODESteer方法进行了实证验证。
Summary / 总结
The paper proposes ODESteer, an ODE-based framework for aligning large language models (LLMs) through activation steering. It addresses the limitations of current methods by providing a unified theoretical framework and enabling multi-step and adaptive steering. ODESteer shows empirical improvements on various LLM alignment benchmarks, achieving a notable 5.7% improvement over TruthfulQA, 2.5% over UltraFeedback, and 2.4% over RealToxicityPrompts.
论文提出了一种基于ODE的统一框架来解决当前激活引导方法在大型语言模型(LLM)对齐中的局限性。该框架将传统的激活添加解释为ODE的一阶近似,并引入了由屏障函数引导的ODESteer,用于实现多步和自适应的引导。实验结果显示,ODESteer在TruthfulQA、UltraFeedback和RealToxicityPrompts等基准测试中分别取得了5.7%、2.5%和2.4%的显著改进。
Revisiting Weight Regularization for Low-Rank Continual Learning
Authors: Yaoyue Zheng, Yin Zhang, Joost van de Weijer, Gido M van de Ven, Shaoyi Du, Xuetao Zhang, Zhiqiang Tian
Venue: ICLR 2026
First: 2026-02-19T17:13:00+00:00 · Latest: 2026-02-19T17:13:00+00:00
Comments: Accepted by ICLR 2026
Abstract
Continual Learning (CL) with large-scale pre-trained models (PTMs) has recently gained wide attention, shifting the focus from training from scratch to continually adapting PTMs. This has given rise to a promising paradigm: parameter-efficient continual learning (PECL), where task interference is typically mitigated by assigning a task-specific module during training, such as low-rank adapters. However, weight regularization techniques, such as Elastic Weight Consolidation (EWC)-a key strategy in CL-remain underexplored in this new paradigm. In this paper, we revisit weight regularization in low-rank CL as a new perspective for mitigating task interference in PECL. Unlike existing low-rank CL methods, we mitigate task interference by regularizing a shared low-rank update through EWC, thereby keeping the storage requirement and inference costs constant regardless of the number of tasks. Our proposed method EWC-LoRA leverages a low-rank representation to estimate parameter importance over the full-dimensional space. This design offers a practical, computational- and memory-efficient solution for CL with PTMs, and provides insights that may inform the broader application of regularization techniques within PECL. Extensive experiments on various benchmarks demonstrate the effectiveness of EWC-LoRA, achieving a stability-plasticity trade-off superior to existing low-rank CL approaches. These results indicate that, even under low-rank parameterizations, weight regularization remains an effective mechanism for mitigating task interference. Code is available at: https://github.com/yaoyz96/low-rank-cl.
中文标题/摘要
标题:低秩持续学习中的权重正则化 revisiting
大规模预训练模型(PTMs)驱动的持续学习(CL)最近引起了广泛关注,焦点从从头训练转向不断适应PTMs。这催生了一个有前景的范式:参数高效持续学习(PECL),其中任务干扰通常通过在训练期间分配任务特定模块来缓解,例如低秩适配器。然而,权重正则化技术,如弹性权重巩固(EWC)——这是CL中的关键策略——在这一新范式中仍被严重忽视。本文重新审视了低秩CL中的权重正则化,作为PECL中缓解任务干扰的新视角。与现有低秩CL方法不同,我们通过EWC正则化共享的低秩更新来缓解任务干扰,从而保持存储需求和推理成本不变,与任务数量无关。我们提出的方法EWC-LoRA利用低秩表示来估计参数在全维空间中的重要性。该设计为使用PTMs的持续学习提供了一种实用、计算和内存高效的解决方案,并为PECL中正则化技术的更广泛应用提供了见解。在各种基准上的广泛实验表明,EWC-LoRA的有效性优于现有低秩CL方法,实现了优于现有方法的稳定性和灵活性权衡。这些结果表明,即使在低秩参数化下,权重正则化仍然是缓解任务干扰的有效机制。代码可在:https://github.com/yaoyz96/low-rank-cl/ 获取。
Summary / 总结
This paper revisits weight regularization techniques, particularly Elastic Weight Consolidation (EWC), in the context of parameter-efficient continual learning (PECL) with large-scale pre-trained models (PTMs). The authors propose EWC-LoRA, which uses a low-rank representation to regularize shared low-rank updates, thereby maintaining constant storage and inference costs. Experiments show that EWC-LoRA achieves a better stability-plasticity trade-off compared to existing low-rank CL approaches, indicating the effectiveness of weight regularization in mitigating task interference even under low-rank parameterizations.
本文重新审视了低秩连续学习(CL)中的权重正则化技术,以减轻参数高效连续学习(PECL)中的任务干扰。提出的EWC-LoRA方法使用弹性权重巩固(EWC)来正则化共享的低秩更新,从而保持存储和推理成本的恒定。在各种基准上的实验表明,EWC-LoRA在稳定性-可塑性权衡方面优于现有的低秩CL方法,表明即使在低秩参数化下,权重正则化仍然是减轻任务干扰的有效机制。
RetouchIQ: MLLM Agents for Instruction-Based Image Retouching with Generalist Reward
Authors: Qiucheng Wu, Jing Shi, Simon Jenni, Kushal Kafle, Tianyu Wang, Shiyu Chang, Handong Zhao
First: 2026-02-19T17:11:59+00:00 · Latest: 2026-02-19T17:11:59+00:00
Comments: 10 pages, 6 figures
Abstract
Recent advances in multimodal large language models (MLLMs) have shown great potential for extending vision-language reasoning to professional tool-based image editing, enabling intuitive and creative editing. A promising direction is to use reinforcement learning (RL) to enable MLLMs to reason about and execute optimal tool-use plans within professional image-editing software. However, training remains challenging due to the lack of reliable, verifiable reward signals that can reflect the inherently subjective nature of creative editing. In this work, we introduce RetouchIQ, a framework that performs instruction-based executable image editing through MLLM agents guided by a generalist reward model. RetouchIQ interprets user-specified editing intentions and generates corresponding, executable image adjustments, bridging high-level aesthetic goals with precise parameter control. To move beyond conventional, rule-based rewards that compute similarity against a fixed reference image using handcrafted metrics, we propose a generalist reward model, an RL fine-tuned MLLM that evaluates edited results through a set of generated metrics on a case-by-case basis. Then, the reward model provides scalar feedback through multimodal reasoning, enabling reinforcement learning with high-quality, instruction-consistent gradients. We curate an extended dataset with 190k instruction-reasoning pairs and establish a new benchmark for instruction-based image editing. Experiments show that RetouchIQ substantially improves both semantic consistency and perceptual quality over previous MLLM-based and diffusion-based editing systems. Our findings demonstrate the potential of generalist reward-driven MLLM agents as flexible, explainable, and executable assistants for professional image editing.
中文标题/摘要
标题:RetouchIQ:基于指令的图像润色MLLM代理及其通用奖励
近期多模态大型语言模型(MLLM)的发展为将视觉-语言推理扩展到专业工具图像编辑提供了巨大潜力,使其能够进行直观和创造性的编辑。一个有前景的方向是使用强化学习(RL)使MLLM能够理解和执行在专业图像编辑软件中的最佳工具使用计划。然而,由于缺乏可靠的、可验证的奖励信号来反映创意编辑的主观性,训练仍然具有挑战性。在本文中,我们介绍了RetouchIQ框架,该框架通过由通用奖励模型引导的MLLM代理执行基于指令的可执行图像编辑。RetouchIQ解释用户指定的编辑意图并生成相应的可执行图像调整,将高层次的美学目标与精确的参数控制相结合。为了超越传统的基于规则的奖励,这些奖励使用手工制作的度量标准计算与固定参考图像的相似性,我们提出了一种通用奖励模型,这是一种针对编辑结果通过一组生成的度量标准进行评估的RL微调MLLM。然后,奖励模型通过多模态推理提供标量反馈,使强化学习能够获得高质量、指令一致的梯度。我们收集了一个扩展的数据集,包含19万个指令-推理对,并建立了基于指令的图像编辑的新基准。实验表明,RetouchIQ在语义一致性和感知质量方面显著优于之前的基于MLLM和扩散的编辑系统。我们的研究结果表明,通用奖励驱动的MLLM代理具有作为灵活、可解释和可执行的专业图像编辑助手的潜力。
Summary / 总结
RetouchIQ is a framework that uses MLLM agents guided by a generalist reward model to perform instruction-based image retouching. It interprets user intentions and generates executable image adjustments, improving semantic consistency and perceptual quality compared to previous systems. The framework introduces a generalist reward model that provides scalar feedback through multimodal reasoning, enabling reinforcement learning with high-quality gradients.
RetouchIQ 是一个框架,通过由通用奖励模型引导的 MLLM 代理执行基于指令的图像修复。它解释用户意图并生成可执行的图像调整,相比之前的系统在语义一致性和感知质量上都有显著提升。通用奖励模型通过一组生成的指标评估编辑结果,为强化学习提供标量反馈。实验表明,与现有的基于 MLLM 和扩散模型的编辑系统相比,RetouchIQ 在语义一致性和感知质量上都有显著改进。
Neural Implicit Representations for 3D Synthetic Aperture Radar Imaging
Authors: Nithin Sugavanam, Emre Ertin
First: 2026-02-19T17:10:37+00:00 · Latest: 2026-02-19T17:10:37+00:00
Abstract
Synthetic aperture radar (SAR) is a tomographic sensor that measures 2D slices of the 3D spatial Fourier transform of the scene. In many operational scenarios, the measured set of 2D slices does not fill the 3D space in the Fourier domain, resulting in significant artifacts in the reconstructed imagery. Traditionally, simple priors, such as sparsity in the image domain, are used to regularize the inverse problem. In this paper, we review our recent work that achieves state-of-the-art results in 3D SAR imaging employing neural structures to model the surface scattering that dominates SAR returns. These neural structures encode the surface of the objects in the form of a signed distance function learned from the sparse scattering data. Since estimating a smooth surface from a sparse and noisy point cloud is an ill-posed problem, we regularize the surface estimation by sampling points from the implicit surface representation during the training step. We demonstrate the model's ability to represent target scattering using measured and simulated data from single vehicles and a larger scene with a large number of vehicles. We conclude with future research directions calling for methods to learn complex-valued neural representations to enable synthesizing new collections from the volumetric neural implicit representation.
中文标题/摘要
标题:神经隐式表示在3D合成孔径雷达成像中的应用
合成孔径雷达(SAR)是一种透射传感器,测量场景在3D空间中的二维傅里叶变换切片。在许多操作场景中,测量到的二维切片集并不填充傅里叶域中的3D空间,导致重建图像中存在显著的伪影。传统上,使用简单的先验知识,如图像域中的稀疏性来正则化逆问题。在本文中,我们回顾了我们最近的工作,通过使用神经结构来建模主导SAR返回的表面散射,从而在3D SAR成像中达到最先进的结果。这些神经结构以学习自稀疏散射数据的符号距离函数的形式编码物体的表面。由于从稀疏且噪声的点云中估计光滑表面是一个病态问题,我们在训练步骤中通过从隐式表面表示中采样点来正则化表面估计。我们使用单辆车和包含大量车辆的大型场景的测量和模拟数据,展示了该模型表示目标散射的能力。最后,我们提出了未来研究方向,呼吁学习复值神经表示的方法,以从体素神经隐式表示中合成新的集合。
Summary / 总结
This paper addresses the challenge of reconstructing 3D imagery from sparse 2D Synthetic Aperture Radar (SAR) data by employing neural implicit representations. The method models the surface scattering using a signed distance function learned from sparse data, and regularizes the surface estimation by sampling points during training. Key experimental findings show that the model can effectively represent target scattering from both single vehicles and a complex scene with multiple vehicles, achieving state-of-the-art results in 3D SAR imaging.
本文通过使用神经隐式表示来解决从稀疏的2D合成孔径雷达(SAR)数据重建3D图像的挑战。该方法使用从稀疏数据中学习的符号距离函数来建模表面散射,并在训练过程中通过采样点来正则化表面估计。实验结果表明,该模型能够有效地从单个车辆和包含多个车辆的复杂场景中表示目标散射,实现了3D SAR成像的最新成果。
GraphThinker: Reinforcing Video Reasoning with Event Graph Thinking
Authors: Zixu Cheng, Da Li, Jian Hu, Ziquan Liu, Wei Li, Shaogang Gong
First: 2026-02-19T17:09:30+00:00 · Latest: 2026-02-19T17:09:30+00:00
Comments: Under review
Abstract
Video reasoning requires understanding the causal relationships between events in a video. However, such relationships are often implicit and costly to annotate manually. While existing multimodal large language models (MLLMs) often infer event relations through dense captions or video summaries for video reasoning, such modeling still lacks causal understanding. Without explicit causal structure modeling within and across video events, these models suffer from hallucinations during the video reasoning. In this work, we propose GraphThinker, a reinforcement finetuning-based method that constructs structural event-level scene graphs and enhances visual grounding to jointly reduce hallucinations in video reasoning. Specifically, we first employ an MLLM to construct an event-based video scene graph (EVSG) that explicitly models both intra- and inter-event relations, and incorporate these formed scene graphs into the MLLM as an intermediate thinking process. We also introduce a visual attention reward during reinforcement finetuning, which strengthens video grounding and further mitigates hallucinations. We evaluate GraphThinker on two datasets, RexTime and VidHalluc, where it shows superior ability to capture object and event relations with more precise event localization, reducing hallucinations in video reasoning compared to prior methods.
中文标题/摘要
标题:GraphThinker:通过事件图思考强化视频推理
视频推理需要理解视频中事件之间的因果关系。然而,这些关系往往是隐含的,手动标注成本高昂。虽然现有的多模态大型语言模型(MLLM)通常通过密集字幕或视频摘要来推断事件关系,但这种建模仍然缺乏因果理解。在视频事件内部和跨事件之间没有明确的因果结构建模的情况下,这些模型在视频推理过程中会表现出幻觉。在本文中,我们提出了一种基于强化微调的方法——GraphThinker,该方法构建结构化的事件级场景图并增强视觉定位,以联合减少视频推理中的幻觉。具体而言,我们首先使用MLLM构建基于事件的视频场景图(EVSG),明确建模事件内的和事件间的联系,并将这些形成的场景图作为中间思考过程集成到MLLM中。我们还在强化微调过程中引入了视觉注意力奖励,这加强了视频定位并进一步减轻了幻觉。我们在RexTime和VidHalluc两个数据集上评估了GraphThinker,结果显示它在捕捉对象和事件关系方面具有更强的能力,能够更精确地定位事件,从而在视频推理中减少幻觉,优于先前的方法。
Summary / 总结
GraphThinker is a reinforcement finetuning-based method that constructs event-level scene graphs to improve causal understanding in video reasoning. It uses an MLLM to create an event-based video scene graph (EVSG) that models both intra- and inter-event relations, and incorporates these graphs into the MLLM for better visual grounding. The method also introduces a visual attention reward during reinforcement finetuning to further reduce hallucinations. On the RexTime and VidHalluc datasets, GraphThinker demonstrates superior performance in capturing object and event relations with more precise event localization and reduced hallucinations compared to previous methods.
GraphThinker 是一种基于强化微调的方法,通过构建事件级场景图来增强视觉定位并减少视频推理中的幻觉。它使用 MLLM 创建一个基于事件的视频场景图 (EVSG),以建模事件内的和事件间的联系,并将这些图融入 MLLM 中进行中间思考。此外,在强化微调期间引入视觉注意力奖励,进一步增强视频定位。GraphThinker 在 RexTime 和 VidHalluc 数据集上的表现优于先前的方法,提供了更精确的事件定位并减少了幻觉。
MASPO: Unifying Gradient Utilization, Probability Mass, and Signal Reliability for Robust and Sample-Efficient LLM Reasoning
Authors: Xiaoliang Fu, Jiaye Lin, Yangyi Fang, Binbin Zheng, Chaowen Hu, Zekai Shao, Cong Qin, Lu Pan, Ke Zeng, Xunliang Cai
First: 2026-02-19T17:05:20+00:00 · Latest: 2026-02-19T17:05:20+00:00
Abstract
Existing Reinforcement Learning with Verifiable Rewards (RLVR) algorithms, such as GRPO, rely on rigid, uniform, and symmetric trust region mechanisms that are fundamentally misaligned with the complex optimization dynamics of Large Language Models (LLMs). In this paper, we identify three critical challenges in these methods: (1) inefficient gradient utilization caused by the binary cutoff of hard clipping, (2) insensitive probability mass arising from uniform ratio constraints that ignore the token distribution, and (3) asymmetric signal reliability stemming from the disparate credit assignment ambiguity between positive and negative samples. To bridge these gaps, we propose Mass-Adaptive Soft Policy Optimization (MASPO), a unified framework designed to harmonize these three dimensions. MASPO integrates a differentiable soft Gaussian gating to maximize gradient utility, a mass-adaptive limiter to balance exploration across the probability spectrum, and an asymmetric risk controller to align update magnitudes with signal confidence. Extensive evaluations demonstrate that MASPO serves as a robust, all-in-one RLVR solution, significantly outperforming strong baselines. Our code is available at: https://anonymous.4open.science/r/ma1/README.md.
中文标题/摘要
标题:MASPO:统一梯度利用、概率质量与信号可靠性以实现稳健且样本高效的LLM推理
现有的可验证奖励强化学习(RLVR)算法,如GRPO,依赖于刚性、均匀和对称的信任区域机制,这些机制与大型语言模型(LLMs)复杂的优化动态从根本上不一致。在本文中,我们识别了这些方法中的三个关键挑战:(1)由于硬剪裁的二元截止导致的梯度利用效率低下,(2)由于均匀比率约束忽略了令牌分布而导致的敏感性概率质量,(3)由于正负样本之间不同的信用分配模糊性而导致的信号可靠性不对称。为了弥合这些差距,我们提出了质量自适应软策略优化(MASPO),这是一种旨在统一这三个维度的统一框架。MASPO结合了可微软高斯门控以最大化梯度利用,质量自适应限制器以平衡概率谱上的探索,以及不对称风险控制器以使更新幅度与信号置信度对齐。广泛的评估表明,MASPO作为稳健且全能的RLVR解决方案,显著优于强大的基线。我们的代码可在:https://anonymous.4open.science/r/ma1/README.md 获取。
Summary / 总结
This paper addresses the limitations of existing RLVR algorithms like GRPO by proposing MASPO, which unifies gradient utilization, probability mass, and signal reliability. MASPO introduces a differentiable soft Gaussian gating, a mass-adaptive limiter, and an asymmetric risk controller to optimize these aspects. The extensive evaluations show that MASPO outperforms strong baselines and provides a robust RLVR solution for LLMs.
本文针对现有RLVR算法如GRPO的局限性,提出了MASPO,该算法统一了梯度利用、概率质量以及信号可靠性。MASPO引入了可微软高斯门控、质量自适应限制器和不对称风险控制器来优化这些方面。广泛的评估显示,MASPO在性能上显著优于强基线,并为LLMs提供了一个稳健的RLVR解决方案。
KLong: Training LLM Agent for Extremely Long-horizon Tasks
Authors: Yue Liu, Zhiyuan Hu, Flood Sung, Jiaheng Zhang, Bryan Hooi
First: 2026-02-19T17:01:08+00:00 · Latest: 2026-02-19T17:01:08+00:00
Abstract
This paper introduces KLong, an open-source LLM agent trained to solve extremely long-horizon tasks. The principle is to first cold-start the model via trajectory-splitting SFT, then scale it via progressive RL training. Specifically, we first activate basic agentic abilities of a base model with a comprehensive SFT recipe. Then, we introduce Research-Factory, an automated pipeline that generates high-quality training data by collecting research papers and constructing evaluation rubrics. Using this pipeline, we build thousands of long-horizon trajectories distilled from Claude 4.5 Sonnet (Thinking). To train with these extremely long trajectories, we propose a new trajectory-splitting SFT, which preserves early context, progressively truncates later context, and maintains overlap between sub-trajectories. In addition, to further improve long-horizon task-solving capability, we propose a novel progressive RL, which schedules training into multiple stages with progressively extended timeouts. Experiments demonstrate the superiority and generalization of KLong, as shown in Figure 1. Notably, our proposed KLong (106B) surpasses Kimi K2 Thinking (1T) by 11.28% on PaperBench, and the performance improvement generalizes to other coding benchmarks like SWE-bench Verified and MLE-bench.
中文标题/摘要
标题:KLong:训练用于极长时域任务的LLM代理
本文介绍了KLong,一个开源的LLM代理,用于解决极长时域任务。该方法首先通过轨迹分割精调(SFT)冷启动模型,然后通过渐进式强化学习(RL)训练进行扩展。具体来说,我们首先使用全面的SFT食谱激活基础模型的基本代理能力。然后,我们引入了Research-Factory,这是一个自动化流水线,通过收集研究论文并构建评估标准来生成高质量的训练数据。使用此流水线,我们从Claude 4.5 Sonnet(思考)中构建了数千个长时域轨迹。为了使用这些极其长的轨迹进行训练,我们提出了一种新的轨迹分割精调方法,该方法保留早期上下文,逐步截断后期上下文,并保持子轨迹之间的重叠。此外,为了进一步提高解决极长时域任务的能力,我们提出了一种新的渐进式RL,该方法将训练分为多个阶段,并逐步延长超时时间。实验表明,KLong在性能和泛化能力方面优于Kimik2 Thinking,如图1所示。值得注意的是,我们提出的KLong(106B)在PaperBench上的表现比Kimik2 Thinking(1T)高出11.28%,并且性能改进也适用于其他编程基准,如SWE-bench Verified和MLE-bench。
Summary / 总结
KLong is an open-source LLM agent designed for solving extremely long-horizon tasks. It is trained using a two-step process: initial cold-start via trajectory-splitting SFT and subsequent scaling through progressive RL training. Key experimental results show that KLong outperforms Kimi K2 Thinking by 11.28% on PaperBench and demonstrates generalization to other coding benchmarks like SWE-bench Verified and MLE-bench.
KLong 是一个开源的 LLM 代理,用于处理极长时域任务。它通过轨迹分割 SFT 冷启动模型,并通过渐进式 RL 训练进行扩展。关键发现表明,KLong 在 PaperBench 上比 Kimi K2 Thinking 高出 11.28%,并且在其他编码基准如 SWE-bench Verified 和 MLE-bench 上也表现出良好的泛化能力。
Evaluating Chain-of-Thought Reasoning through Reusability and Verifiability
Authors: Shashank Aggarwal, Ram Vikas Mishra, Amit Awekar
First: 2026-02-19T16:59:11+00:00 · Latest: 2026-02-19T16:59:11+00:00
Abstract
In multi-agent IR pipelines for tasks such as search and ranking, LLM-based agents exchange intermediate reasoning in terms of Chain-of-Thought (CoT) with each other. Current CoT evaluation narrowly focuses on target task accuracy. However, this metric fails to assess the quality or utility of the reasoning process itself. To address this limitation, we introduce two novel measures: reusability and verifiability. We decouple CoT generation from execution using a Thinker-Executor framework. Reusability measures how easily an Executor can reuse the Thinker's CoT. Verifiability measures how frequently an Executor can match the Thinker's answer using the CoT. We evaluated four Thinker models against a committee of ten Executor models across five benchmarks. Our results reveal that reusability and verifiability do not correlate with standard accuracy, exposing a blind spot in current accuracy-based leaderboards for reasoning capability. Surprisingly, we find that CoTs from specialized reasoning models are not consistently more reusable or verifiable than those from general-purpose LLMs like Llama and Gemma.
中文标题/摘要
标题:通过可重用性和可验证性评估链式思考推理
在诸如搜索和排名等任务的多智能体信息检索流水线中,基于LLM的智能体以链式思考(CoT)的形式相互交换中间推理。当前的CoT评估仅狭隘地关注目标任务的准确性。然而,这一指标未能评估推理过程本身的质量或实用性。为解决这一局限,我们引入了两个新的度量标准:可重用性和可验证性。我们使用思考者-执行者框架将CoT的生成与执行分离。可重用性衡量执行者重新使用思考者CoT的难易程度。可验证性衡量执行者使用CoT匹配思考者答案的频率。我们在五个基准上对四种思考者模型与十个执行者模型组成的委员会进行了评估。我们的结果表明,可重用性和可验证性与标准准确性无关,揭示了当前基于准确性的推理能力排行榜中的盲点。令人惊讶的是,我们发现专门的推理模型的CoT并不比通用的LLM(如Llama和Gemma)更一致地具有更高的可重用性和可验证性。
Summary / 总结
The study evaluates Chain-of-Thought (CoT) reasoning in multi-agent information retrieval pipelines by introducing two new metrics: reusability and verifiability. These metrics assess the quality and utility of CoT rather than just task accuracy. Using a Thinker-Executor framework, the researchers found that CoTs from specialized reasoning models are not necessarily more reusable or verifiable than those from general-purpose models like Llama and Gemma, highlighting a limitation in current accuracy-based evaluations of reasoning capabilities.
研究通过引入可重用性和可验证性作为新指标,评估多代理信息检索管道中的链式思考(CoT)推理。这些指标评估CoT的质量和实用性,而非仅仅任务准确性。使用思考者-执行者框架,研究发现,专门化模型的CoT并不比通用模型(如Llama和Gemma)的CoT更具有可重用性和可验证性,这揭示了当前基于准确性的推理能力排行榜的局限性。
IRIS: Learning-Driven Task-Specific Cinema Robot Arm for Visuomotor Motion Control
Authors: Qilong Cheng, Matthew Mackay, Ali Bereyhi
First: 2026-02-19T16:50:31+00:00 · Latest: 2026-02-19T16:50:31+00:00
Abstract
Robotic camera systems enable dynamic, repeatable motion beyond human capabilities, yet their adoption remains limited by the high cost and operational complexity of industrial-grade platforms. We present the Intelligent Robotic Imaging System (IRIS), a task-specific 6-DOF manipulator designed for autonomous, learning-driven cinematic motion control. IRIS integrates a lightweight, fully 3D-printed hardware design with a goal-conditioned visuomotor imitation learning framework based on Action Chunking with Transformers (ACT). The system learns object-aware and perceptually smooth camera trajectories directly from human demonstrations, eliminating the need for explicit geometric programming. The complete platform costs under $1,000 USD, supports a 1.5 kg payload, and achieves approximately 1 mm repeatability. Real-world experiments demonstrate accurate trajectory tracking, reliable autonomous execution, and generalization across diverse cinematic motions.
中文标题/摘要
标题:IRIS:基于学习的任务特定电影机器人手臂用于视动运动控制
机器人摄像系统能够实现超越人类能力的动态、可重复运动,但其采用受限于工业级平台的高成本和操作复杂性。我们介绍了智能机器人成像系统(IRIS),这是一种专为自主、基于学习的电影运动控制设计的6自由度 manipulator。IRIS 结合了轻量级的全3D打印硬件设计和基于动作分块与变换器(ACT)的目标条件视动模仿学习框架。该系统直接从人类示范中学习对象感知和感知上平滑的摄像机轨迹,消除了显式几何编程的需要。整个平台成本低于1000美元,支持1.5公斤负载,并实现约1毫米的重复性。实际实验表明,该系统能够准确跟踪轨迹、可靠自主执行,并在多种电影运动中泛化。
Summary / 总结
The research aims to develop a cost-effective and easy-to-use robotic camera system for cinematic motion control. IRIS, a 6-DOF manipulator, integrates a lightweight 3D-printed hardware with a goal-conditioned visuomotor imitation learning framework based on Action Chunking with Transformers (ACT). Key findings include the ability to learn object-aware and smooth camera trajectories from human demonstrations, achieving 1 mm repeatability and successful autonomous execution in diverse cinematic motions.
研究旨在开发一种低成本且易于使用的机器人摄像系统,用于电影运动控制。IRIS 是一个6自由度的机械臂,集成了轻量级的3D打印硬件设计和基于Action Chunking with Transformers的有目标条件的视觉运动模仿学习框架。系统可以从人类演示中学习摄像机轨迹,并能够准确跟踪轨迹、自主执行并跨多种电影运动进行泛化。该平台成本低于1000美元,承载能力为1.5公斤,重复精度为1毫米。
LATA: Laplacian-Assisted Transductive Adaptation for Conformal Uncertainty in Medical VLMs
Authors: Behzad Bozorgtabar, Dwarikanath Mahapatra, Sudipta Roy, Muzammal Naseer, Imran Razzak, Zongyuan Ge
First: 2026-02-19T16:45:38+00:00 · Latest: 2026-02-19T16:45:38+00:00
Comments: 18 pages, 6 figures, 4 tables
Abstract
Medical vision-language models (VLMs) are strong zero-shot recognizers for medical imaging, but their reliability under domain shift hinges on calibrated uncertainty with guarantees. Split conformal prediction (SCP) offers finite-sample coverage, yet prediction sets often become large (low efficiency) and class-wise coverage unbalanced-high class-conditioned coverage gap (CCV), especially in few-shot, imbalanced regimes; moreover, naively adapting to calibration labels breaks exchangeability and voids guarantees. We propose \texttt{\textbf{LATA}} (Laplacian-Assisted Transductive Adaptation), a \textit{training- and label-free} refinement that operates on the joint calibration and test pool by smoothing zero-shot probabilities over an image-image k-NN graph using a small number of CCCP mean-field updates, preserving SCP validity via a deterministic transform. We further introduce a \textit{failure-aware} conformal score that plugs into the vision-language uncertainty (ViLU) framework, providing instance-level difficulty and label plausibility to improve prediction set efficiency and class-wise balance at fixed coverage. \texttt{\textbf{LATA}} is black-box (no VLM updates), compute-light (windowed transduction, no backprop), and includes an optional prior knob that can run strictly label-free or, if desired, in a label-informed variant using calibration marginals once. Across \textbf{three} medical VLMs and \textbf{nine} downstream tasks, \texttt{\textbf{LATA}} consistently reduces set size and CCV while matching or tightening target coverage, outperforming prior transductive baselines and narrowing the gap to label-using methods, while using far less compute. Comprehensive ablations and qualitative analyses show that \texttt{\textbf{LATA}} sharpens zero-shot predictions without compromising exchangeability.
中文标题/摘要
标题:LATA:拉普拉斯辅助的归纳适应性转换以在医疗VLM中提供校准不确定性
医疗视觉-语言模型(VLMs)在医疗成像中具有强大的零样本识别能力,但其在领域转移下的可靠性依赖于有保证的校准不确定性。分割校准预测(SCP)提供了有限样本覆盖,但预测集通常变得很大(效率低),并且类别间的覆盖不平衡(高类别条件覆盖差距,CCV),尤其是在少量样本、类别不平衡的情况下;此外,直接适应校准标签会破坏可交换性并使保证失效。我们提出了\texttt{\textbf{LATA}}(拉普拉斯辅助的归纳适应性转换),这是一种无需训练和标签的改进方法,通过在图像-图像k-NN图上平滑零样本概率,使用少量CCC更新,通过确定性变换保持SCP的有效性。我们还引入了一种\textit{失败感知}的校准分数,将其插入视觉-语言不确定性(ViLU)框架中,提供实例级的难度和标签可验证性,以提高固定覆盖下的预测集效率和类别间的平衡。\texttt{\textbf{LATA}}是黑盒的(不更新VLM),计算量小(窗口化归纳,无反向传播),并包括一个可选的先验旋钮,可以完全不依赖标签运行,或者如果需要,可以使用校准边缘信息的标签指导变体运行。在\textbf{三个}医疗VLM和\textbf{九}个下游任务上,\texttt{\textbf{LATA}}始终减少集合大小和CCV,同时匹配或收紧目标覆盖,优于先前的归纳基线,并缩小与使用标签方法的差距,同时使用更少的计算资源。全面的消融实验和定性分析表明,\texttt{\textbf{LATA}}在不牺牲可交换性的情况下增强了零样本预测。
Summary / 总结
LATA (Laplacian-Assisted Transductive Adaptation) is a training- and label-free method for improving the reliability of medical vision-language models under domain shift. It operates on the joint calibration and test pool by smoothing zero-shot probabilities using a small number of CCCP mean-field updates, and introduces a failure-aware conformal score to enhance prediction set efficiency and class-wise balance. LATA consistently reduces set size and class-conditioned coverage variance while maintaining target coverage across three medical VLMs and nine downstream tasks, outperforming previous transductive baselines and using less compute.
LATA(Laplacian-Assisted Transductive Adaptation)旨在通过精炼联合校准和测试池来提高医疗视觉语言模型(VLMs)在领域变化下的可靠性。它使用少量的CCCP均场更新,在图像-图像k-NN图上平滑零样本概率,以保持分裂校准预测的有效性。LATA还引入了一种失败感知的校准分数,以增强预测集的效率和类别间的平衡。实验表明,LATA在三个医疗VLMs和九个下游任务上减少了集合大小和类别条件下的覆盖差距,同时保持或提高了目标覆盖范围,且计算成本更低,优于之前的递推基线方法。
Position: Evaluation of ECG Representations Must Be Fixed
Authors: Zachary Berger, Daniel Prakah-Asante, John Guttag, Collin M. Stultz
First: 2026-02-19T16:42:46+00:00 · Latest: 2026-02-19T16:42:46+00:00
Comments: Project website at https://ecgfix.csail.mit.edu/
Abstract
This position paper argues that current benchmarking practice in 12-lead ECG representation learning must be fixed to ensure progress is reliable and aligned with clinically meaningful objectives. The field has largely converged on three public multi-label benchmarks (PTB-XL, CPSC2018, CSN) dominated by arrhythmia and waveform-morphology labels, even though the ECG is known to encode substantially broader clinical information. We argue that downstream evaluation should expand to include an assessment of structural heart disease and patient-level forecasting, in addition to other evolving ECG-related endpoints, as relevant clinical targets. Next, we outline evaluation best practices for multi-label, imbalanced settings, and show that when they are applied, the literature's current conclusion about which representations perform best is altered. Furthermore, we demonstrate the surprising result that a randomly initialized encoder with linear evaluation matches state-of-the-art pre-training on many tasks. This motivates the use of a random encoder as a reasonable baseline model. We substantiate our observations with an empirical evaluation of three representative ECG pre-training approaches across six evaluation settings: the three standard benchmarks, a structural disease dataset, hemodynamic inference, and patient forecasting.
中文标题/摘要
标题:立场:必须修正12导联ECG表示的评估方法
这篇立场论文认为,必须修正当前12导联ECG表示学习的基准测试实践,以确保进展可靠且与临床有意义的目标一致。该领域已基本达成共识,主要依赖三个公共多标签基准(PTB-XL、CPSC2018、CSN),这些基准主要关注心律失常和波形形态标签,尽管已知ECG还包含更广泛临床信息。我们主张,下游评估应扩展到包括对结构性心脏病和患者水平预测的评估,以及其他不断发展的ECG相关终点,作为相关临床目标。其次,我们概述了多标签、不平衡设置下的评估最佳实践,并表明当这些实践被应用时,文献中关于哪种表示方法表现最佳的结论发生了改变。此外,我们展示了令人惊讶的结果,即随机初始化的编码器与线性评估匹配许多任务上的最新预训练模型。这促使使用随机编码器作为合理的基线模型。我们通过在六个评估设置中对三种代表性ECG预训练方法进行实证评估,证实了我们的观察:三个标准基准、结构性疾病数据集、血流动力学推断和患者预测。
Summary / 总结
This paper argues for reforming the benchmarking practices in 12-lead ECG representation learning to better align with clinical objectives. It suggests expanding evaluation to include structural heart disease and patient-level forecasting, and outlines best practices for evaluating multi-label, imbalanced data. The study shows that applying these practices changes the conclusions about which representations perform best and demonstrates that a randomly initialized encoder can match state-of-the-art pre-training on many tasks, suggesting its use as a baseline model. Empirical evaluations across various settings support these claims.
本文主张改进12导联ECG表示学习的基准测试实践,以更好地与临床意义的目标对齐。建议扩展评估范围,包括结构性心脏病和患者级预测,并概述了多标签、不平衡设置的最佳实践。研究显示,随机初始化的编码器在许多任务上可以匹配最先进的预训练效果,这促使将其作为基准模型使用。通过各种设置的实证评估揭示了当前关于表示性能结论可能需要重新评估。
Enhancing Large Language Models (LLMs) for Telecom using Dynamic Knowledge Graphs and Explainable Retrieval-Augmented Generation
Authors: Dun Yuan, Hao Zhou, Xue Liu, Hao Chen, Yan Xin, Jianzhong, Zhang
First: 2026-02-19T16:40:17+00:00 · Latest: 2026-02-19T16:40:17+00:00
Abstract
Large language models (LLMs) have shown strong potential across a variety of tasks, but their application in the telecom field remains challenging due to domain complexity, evolving standards, and specialized terminology. Therefore, general-domain LLMs may struggle to provide accurate and reliable outputs in this context, leading to increased hallucinations and reduced utility in telecom operations.To address these limitations, this work introduces KG-RAG-a novel framework that integrates knowledge graphs (KGs) with retrieval-augmented generation (RAG) to enhance LLMs for telecom-specific tasks. In particular, the KG provides a structured representation of domain knowledge derived from telecom standards and technical documents, while RAG enables dynamic retrieval of relevant facts to ground the model's outputs. Such a combination improves factual accuracy, reduces hallucination, and ensures compliance with telecom specifications.Experimental results across benchmark datasets demonstrate that KG-RAG outperforms both LLM-only and standard RAG baselines, e.g., KG-RAG achieves an average accuracy improvement of 14.3% over RAG and 21.6% over LLM-only models. These results highlight KG-RAG's effectiveness in producing accurate, reliable, and explainable outputs in complex telecom scenarios.
中文标题/摘要
标题:利用动态知识图谱和可解释检索增强生成提升电信领域的大型语言模型
大型语言模型(LLMs)在多种任务中展现了强大的潜力,但在电信领域的应用仍面临挑战,这主要是由于领域复杂性、不断演化的标准以及专业术语。因此,通用领域的LLMs在电信场景中可能难以提供准确和可靠的结果,导致增加幻觉并降低电信操作的实用性。为解决这些限制,本研究引入了KG-RAG——一种将知识图谱(KGs)与检索增强生成(RAG)相结合的新框架,以提升LLMs在电信特定任务中的表现。特别是,知识图谱提供了从电信标准和技术文档中推导出的领域知识的结构化表示,而RAG则允许动态检索相关事实以支撑模型的输出。这种结合提高了事实准确性,减少了幻觉,并确保符合电信规范。基准数据集上的实验结果表明,KG-RAG在准确度上优于仅使用LLM和标准RAG基线,例如,KG-RAG在准确度上分别比RAG和仅使用LLM的模型提高了14.3%和21.6%。这些结果突显了KG-RAG在复杂电信场景中生成准确、可靠和可解释输出的有效性。
Summary / 总结
This work aims to enhance large language models (LLMs) for telecom applications by introducing KG-RAG, a framework that integrates knowledge graphs (KGs) with retrieval-augmented generation (RAG). The KG provides structured domain knowledge from telecom standards, while RAG dynamically retrieves relevant facts to ground the model's outputs. Experimental results show that KG-RAG outperforms both LLM-only and standard RAG baselines, achieving an average accuracy improvement of 14.3% over RAG and 21.6% over LLM-only models.
本文通过引入KG-RAG框架,将知识图谱与检索增强生成相结合,以解决通用领域的大语言模型在电信领域中的局限性。该框架通过提供结构化的领域知识表示和动态检索相关事实来增强LLMs,以适应电信特定任务。实验结果表明,KG-RAG在准确性和可靠性方面均优于LLM-only和标准RAG基线,分别提高了14.3%和21.6%的准确率。
FoundationPose-Initialized 3D-2D Liver Registration for Surgical Augmented Reality
Authors: Hanyuan Zhang, Lucas He, Runlong He, Abdolrahim Kadkhodamohammadi, Danail Stoyanov, Brian R. Davidson, Evangelos B. Mazomenos, Matthew J. Clarkson
First: 2026-02-19T16:31:14+00:00 · Latest: 2026-02-19T16:31:14+00:00
Abstract
Augmented reality can improve tumor localization in laparoscopic liver surgery. Existing registration pipelines typically depend on organ contours; deformable (non-rigid) alignment is often handled with finite-element (FE) models coupled to dimensionality-reduction or machine-learning components. We integrate laparoscopic depth maps with a foundation pose estimator for camera-liver pose estimation and replace FE-based deformation with non-rigid iterative closest point (NICP) to lower engineering/modeling complexity and expertise requirements. On real patient data, the depth-augmented foundation pose approach achieved 9.91 mm mean registration error in 3 cases. Combined rigid-NICP registration outperformed rigid-only registration, demonstrating NICP as an efficient substitute for finite-element deformable models. This pipeline achieves clinically relevant accuracy while offering a lightweight, engineering-friendly alternative to FE-based deformation.
中文标题/摘要
标题:FoundationPose-初始化的3D-2D肝脏注册用于外科增强现实
增强现实可以提高腹腔镜肝脏手术中肿瘤定位的准确性。现有的注册管道通常依赖于器官轮廓;非刚性对齐通常使用有限元(FE)模型结合降维或机器学习组件来处理。我们结合腹腔镜深度图与基础姿态估计器进行相机-肝脏姿态估计,并用非刚性迭代最近点(NICP)替换基于FE的变形,以降低工程/建模复杂性和专业知识要求。在真实患者数据上,深度增强的基础姿态方法在3个案例中实现了9.91毫米的平均注册误差。结合刚性-NICP注册优于仅刚性注册,证明了NICP作为有限元非刚性模型的有效替代品。该管道实现了临床相关的准确性,同时提供了一种轻量级、工程友好的替代FE变形的方法。
Summary / 总结
The research aims to improve tumor localization in laparoscopic liver surgery using augmented reality. The method integrates laparoscopic depth maps with a foundation pose estimator for camera-liver pose estimation and uses non-rigid iterative closest point (NICP) for deformable alignment, reducing engineering complexity. The approach achieved a mean registration error of 9.91 mm in three cases and outperformed rigid-only registration when combined with rigid registration, showing NICP as an efficient alternative to finite-element deformable models.
研究旨在通过增强现实提高腹腔镜肝脏手术中的肿瘤定位。方法结合了腹腔镜深度图和基础姿态估计器来估计相机-肝脏姿态,并使用非刚性最近点迭代(NICP)进行非刚性对齐,减少了复杂有限元模型的需求。该方法在三个案例中实现了9.91 mm的平均注册误差,并且与刚性对齐结合使用NICP时表现更优,显示了NICP作为有限元变形模型的高效替代方案的有效性。
Defining and Evaluating Physical Safety for Large Language Models
Authors: Yung-Chen Tang, Pin-Yu Chen, Tsung-Yi Ho
First: 2024-11-04T17:41:25+00:00 · Latest: 2026-02-19T16:30:29+00:00
Abstract
Large Language Models (LLMs) are increasingly used to control robotic systems such as drones, but their risks of causing physical threats and harm in real-world applications remain unexplored. Our study addresses the critical gap in evaluating LLM physical safety by developing a comprehensive benchmark for drone control. We classify the physical safety risks of drones into four categories: (1) human-targeted threats, (2) object-targeted threats, (3) infrastructure attacks, and (4) regulatory violations. Our evaluation of mainstream LLMs reveals an undesirable trade-off between utility and safety, with models that excel in code generation often performing poorly in crucial safety aspects. Furthermore, while incorporating advanced prompt engineering techniques such as In-Context Learning and Chain-of-Thought can improve safety, these methods still struggle to identify unintentional attacks. In addition, larger models demonstrate better safety capabilities, particularly in refusing dangerous commands. Our findings and benchmark can facilitate the design and evaluation of physical safety for LLMs. The project page is available at huggingface.co/spaces/TrustSafeAI/LLM-physical-safety.
中文标题/摘要
标题:定义和评估大型语言模型的物理安全性
大型语言模型(LLMs)越来越多地用于控制无人机等机器人系统,但它们在实际应用中造成物理威胁和伤害的风险尚未得到探索。我们的研究通过开发无人机控制的全面基准来填补评估LLM物理安全性的关键空白。我们将无人机的物理安全风险分为四类:(1)针对人类的目标威胁,(2)针对物体的目标威胁,(3)基础设施攻击,以及(4)法规违规。我们对主流LLM的评估揭示了实用性与安全性之间的不理想权衡,擅长代码生成的模型在关键的安全方面往往表现不佳。此外,虽然通过引入先进的提示工程技术如上下文学习和思维链可以提高安全性,但这些方法仍然难以识别无意中的攻击。此外,更大的模型在拒绝危险命令方面表现出更好的安全性。我们的发现和基准可以促进LLM物理安全性的设计和评估。项目页面可在huggingface.co/spaces/TrustSafeAI/LLM-physical-safety获取。
Summary / 总结
This study aims to address the lack of evaluation methods for the physical safety of Large Language Models (LLMs) used in controlling robotic systems like drones. The researchers developed a comprehensive benchmark to classify physical safety risks into four categories: human-targeted threats, object-targeted threats, infrastructure attacks, and regulatory violations. Evaluations of mainstream LLMs showed a trade-off between utility and safety, with models excelling in code generation often performing poorly in safety aspects. Advanced prompt engineering techniques improved safety but could not fully prevent unintentional attacks. Larger models demonstrated better safety capabilities, especially in refusing dangerous commands. The findings and benchmark can help in designing and evaluating physical safety for LLMs.
本研究解决了在控制无人机时大型语言模型(LLM)缺乏物理安全评估的问题。它将物理安全风险分为四类,并开发了一个全面的基准。评估结果显示,擅长代码生成的模型在安全方面往往表现较差,尽管先进的技术如上下文学习和链式思考可以提高安全性,但它们仍然难以识别无意中的攻击。较大的模型在拒绝危险命令方面表现出更好的安全性。这些发现和基准可以用于设计和评估LLM的物理安全性。
Capturing Individual Human Preferences with Reward Features
Authors: André Barreto, Vincent Dumoulin, Yiran Mao, Mark Rowland, Nicolas Perez-Nieves, Bobak Shahriari, Yann Dauphin, Doina Precup, Hugo Larochelle
Venue: NeurIPS 2025
First: 2025-03-21T17:39:33+00:00 · Latest: 2026-02-19T16:23:22+00:00
Comments: Published at NeurIPS 2025
Abstract
Reinforcement learning from human feedback usually models preferences using a reward function that does not distinguish between people. We argue that this is unlikely to be a good design choice in contexts with high potential for disagreement, like in the training of large language models. We formalise and analyse the problem of learning a reward model that can be specialised to a user. Using the principle of empirical risk minimisation, we derive a probably approximately correct (PAC) bound showing the dependency of the approximation error on the number of training examples, as usual, and also on the number of human raters who provided feedback on them. Based on our theoretical findings, we discuss how to best collect pairwise preference data and argue that adaptive reward models should be beneficial when there is considerable disagreement among users. We also propose a concrete architecture for an adaptive reward model. Our approach leverages the observation that individual preferences can be captured as a linear combination of a set of general reward features. We show how to learn such features and subsequently use them to quickly adapt the reward model to a specific individual, even if their preferences are not reflected in the training data. We present experiments with large language models illustrating our theoretical results and comparing the proposed architecture with a non-adaptive baseline. Consistent with our analysis, the benefits provided by our model increase with the number of raters and the heterogeneity of their preferences. We also show that our model compares favourably to adaptive counterparts, including those performing in-context personalisation.
中文标题/摘要
标题:利用奖励特征捕捉个体人类偏好
从人类反馈中进行强化学习通常使用一个不区分个体的奖励函数。我们提出,在大型语言模型训练等存在高度分歧的背景下,这可能不是一个好的设计选择。我们形式化并分析了学习一个可以专门针对用户的奖励模型的问题。基于经验风险最小化的原则,我们推导出一个大概率近似正确(PAC)的界,显示了近似误差依赖于训练样本数量,以及提供反馈的人类评判者的数量。基于我们的理论发现,我们讨论了如何最好地收集两两偏好数据,并认为当用户之间存在显著分歧时,适应性奖励模型是有益的。我们还提出了一种适应性奖励模型的具体架构。我们的方法利用了个体偏好可以表示为一组通用奖励特征线性组合的观察。我们展示了如何学习这些特征,并随后使用它们快速适应特定个体的奖励模型,即使他们的偏好未反映在训练数据中。我们展示了使用大型语言模型的实验,以说明我们的理论结果,并将提出的架构与非适应性基线进行比较。与我们的分析一致,我们的模型提供的益处随着评判者的数量和他们偏好的异质性增加而增加。我们还展示了我们的模型与适应性对应模型相比具有竞争力,包括那些进行上下文个性化的方法。
Summary / 总结
This paper addresses the issue of modeling individual human preferences in reinforcement learning, particularly in contexts where there is high potential for disagreement, such as training large language models. The authors propose a method to learn a reward model that can be specialized to a user by leveraging a set of general reward features. They derive a PAC bound to show the dependency of the approximation error on the number of training examples and human raters. Experiments with large language models demonstrate that their adaptive reward model, which can quickly adapt to individual preferences, outperforms a non-adaptive baseline, especially when there is considerable disagreement among users.
本文探讨了在偏好可能广泛不同的背景下,如训练大型语言模型时,捕捉个体人类偏好在强化学习中的挑战。作者提出了一种方法,通过使用一组通用奖励特征来学习一个可以专门针对用户的奖励模型。他们推导出一个PAC边界,以展示近似误差对训练示例数量和提供反馈的人类评判者的数量的依赖性。实验表明,提出的自适应奖励模型在用户偏好存在显著分歧时,优于非自适应基线,尤其是在有更多评判者和更广泛的偏好异质性的情况下。
LORA-CRAFT: Cross-layer Rank Adaptation via Frozen Tucker Decomposition of Pre-trained Attention Weights
Authors: Kasun Dewage, Marianna Pensky, Suranadi De Silva, Shankadeep Mondal
First: 2026-02-19T16:22:22+00:00 · Latest: 2026-02-19T16:22:22+00:00
Abstract
We introduce CRAFT (Cross-layer Rank Adaptation via Frozen Tucker), a parameter-efficient fine-tuning (PEFT) method that applies Tucker tensor decomposition to pre-trained attention weight matrices stacked across transformer layers and trains only small square adaptation matrices on the resulting frozen Tucker factors. Existing tensor-based PEFT methods decompose gradient updates: LoTR applies Tucker decomposition with shared factor matrices, while SuperLoRA groups and reshapes $ΔW$ across layers before applying Tucker decomposition. Separately, methods like PiSSA apply SVD to pre-trained weights but operate independently per layer. CRAFT bridges these two lines of work: it performs full Tucker decomposition via Higher-Order SVD (HOSVD) directly on pre-trained weights organized as cross-layer 3D tensors, freezes all resulting factors, and adapts the model through lightweight trainable transformations applied to each factor matrix. Experiments on the GLUE benchmark using RoBERTa-base and RoBERTa-large demonstrate that CRAFT achieves competitive performance with existing methods while requiring only 41K Tucker adaptation parameters--a count independent of model dimension and depth at fixed Tucker ranks.
中文标题/摘要
标题:LORA-CRAFT:通过冻结Tucker分解预训练注意力权重的跨层秩适应
我们引入了CRAFT(跨层秩适应通过冻结Tucker分解),这是一种参数高效微调(PEFT)方法,它将Tucker张量分解应用于堆叠在变压器层上的预训练注意力权重矩阵,并仅在冻结的Tucker因子上训练小型方形适应矩阵。现有的基于张量的PEFT方法分解梯度更新:LoTR使用共享因子矩阵的Tucker分解,而SuperLoRA在应用Tucker分解之前将$ΔW$按层分组和重塑。另外,像PiSSA这样的方法对预训练权重应用SVD,但每层独立操作。CRAFT将这两条研究路线结合起来:它直接对组织成跨层3D张量的预训练权重进行完整的Tucker分解,通过Higher-Order SVD(HOSVD)进行,冻结所有结果因子,并通过应用于每个因子矩阵的轻量级可训练变换来适应模型。使用RoBERTa-base和RoBERTa-large在GLUE基准上的实验表明,CRAFT在性能上与现有方法相当,同时只需要41K个Tucker适应参数——在固定Tucker秩的情况下,该计数与模型维度和深度无关。
Summary / 总结
CRAFT is a parameter-efficient fine-tuning method that uses Tucker tensor decomposition on pre-trained attention weight matrices across transformer layers, training only small adaptation matrices. It bridges existing tensor-based PEFT methods by performing full Tucker decomposition via Higher-Order SVD on cross-layer 3D tensors, freezing the factors and applying lightweight trainable transformations. Experiments on GLUE benchmark with RoBERTa-base and RoBERTa-large show that CRAFT achieves competitive performance with only 41K Tucker adaptation parameters, independent of model dimension and depth at fixed Tucker ranks.
CRAFT 是一种参数高效的微调方法,它使用 Tucker 张量分解跨 transformer 层的预训练注意力权重矩阵,并仅训练小的适应矩阵。实验表明,CRAFT 在 GLUE 基准上实现了与现有方法相当的性能,仅需 41K 个 Tucker 调整参数,且与模型大小和深度无关,固定 Tucker 秩时独立于模型尺寸和深度。
Pareto Optimal Benchmarking of AI Models on ARM Cortex Processors for Sustainable Embedded Systems
Authors: Pranay Jain, Maximilian Kasper, Göran Köber, Axel Plinge, Dominik Seuß
Venue: EEAI 2025
First: 2026-02-19T16:21:47+00:00 · Latest: 2026-02-19T16:21:47+00:00
Comments: 11 pages, 7 figures, Funding: GreenICT@FMD (BMFTR grant 16ME0491K)
Abstract
This work presents a practical benchmarking framework for optimizing artificial intelligence (AI) models on ARM Cortex processors (M0+, M4, M7), focusing on energy efficiency, accuracy, and resource utilization in embedded systems. Through the design of an automated test bench, we provide a systematic approach to evaluate across key performance indicators (KPIs) and identify optimal combinations of processor and AI model. The research highlights a nearlinear correlation between floating-point operations (FLOPs) and inference time, offering a reliable metric for estimating computational demands. Using Pareto analysis, we demonstrate how to balance trade-offs between energy consumption and model accuracy, ensuring that AI applications meet performance requirements without compromising sustainability. Key findings indicate that the M7 processor is ideal for short inference cycles, while the M4 processor offers better energy efficiency for longer inference tasks. The M0+ processor, while less efficient for complex AI models, remains suitable for simpler tasks. This work provides insights for developers, guiding them to design energy-efficient AI systems that deliver high performance in realworld applications.
中文标题/摘要
标题:ARM Cortex处理器上基于帕累托最优的AI模型基准测试方法及其在可持续嵌入式系统中的应用
本研究提出了一种实用的基准测试框架,用于在ARM Cortex处理器(M0+,M4,M7)上优化人工智能(AI)模型,重点关注嵌入式系统中的能效、准确性和资源利用率。通过设计自动化测试平台,我们提供了一种系统的方法来评估关键性能指标(KPIs),并确定处理器和AI模型的最佳组合。研究强调了浮点运算(FLOPs)与推理时间之间的近线性关系,提供了一种可靠的计算需求估算指标。利用帕累托分析,我们展示了如何在能耗和模型准确性之间取得平衡,确保AI应用满足性能要求同时不牺牲可持续性。关键发现表明,M7处理器适用于短推理周期,而M4处理器对于较长推理任务具有更好的能效。M0+处理器虽然对于复杂AI模型效率较低,但对于简单任务仍然适用。本研究为开发人员提供了见解,指导他们设计高效的AI系统,在实际应用中实现高性能。
Summary / 总结
This work introduces a benchmarking framework to optimize AI models on ARM Cortex processors (M0+, M4, M7) by evaluating energy efficiency, accuracy, and resource utilization. Through Pareto analysis, the study identifies the best combinations of processor and AI model, showing a near-linear relationship between floating-point operations and inference time. Key findings suggest that the M7 processor is suitable for short inference cycles, the M4 processor is more energy-efficient for longer tasks, and the M0+ is adequate for simpler tasks. This research offers developers guidance for designing energy-efficient AI systems.
这项研究提出了一种基准测试框架,用于在ARM Cortex处理器(M0+、M4、M7)上优化AI模型,评估能源效率、准确性和资源利用率。通过使用帕累托分析来平衡能源消耗和模型准确性,研究显示M7处理器最适合短推理周期,M4处理器在较长推理任务中更具能源效率,而M0+处理器适用于更简单的任务。关键发现表明浮点运算与推理时间之间存在近线性关系,提供了一个可靠的计算需求评估指标。
History
20260221_0344 20260220_0347 20260219_0358 20260218_0358 20260217_0339 20260216_0334 20260215_0332 20260213_0402 20260212_0404 20260211_0409 20260210_0409 20260208_0334 20260207_0349 20260206_0347 20260205_0346 20260204_0352 20260202_0332 20260201_0328 20260131_0341 20260130_0339 20260129_0337 20260128_0335 20260127_0332 20260126_0325 20260125_0325 20260124_0333 20260123_0333 20260122_0339 20260121_0422 20260120_0328 20260119_0325 20260118_0324 20260117_0329 20260116_0332 20260115_0330 20260114_0329 20260113_0330 20260112_0330 20260111_0327 20260110_0328 20260109_0331 20260108_0330 20260107_0325 20260106_0331 20260105_0324 20260104_0324 20260103_0322 20260102_0335 20260101_0325 20251231_0331 20251230_0328 20251229_0326 20251228_0329 20251227_0325 20251226_0326 20251225_0325 20251224_0328 20251223_0327 20251222_0324 20251221_0326 20251220_0327 20251219_0327 20251218_0339 20251217_0331 20251216_0329 20251215_0331 20251214_0324 20251213_0324 20251212_0329 20251211_0326 20251210_0323 20251209_0326 20251208_0324 20251207_0323 20251206_0325 20251205_0326 20251204_0326 20251203_0328 20251202_0331 20251201_0324 20251130_0323 20251129_0323 20251128_0324 20251127_0324 20251126_0325 20251125_0322 20251124_0323 20251123_0323 20251122_0325 20251121_0324 20251120_0326 20251119_0325 20251118_0324 20251117_0322 20251116_0322 20251115_0324 20251114_0325 20251113_0326 20251112_0326 20251111_0318 20251110_0322 20251109_0323 20251108_0321 20251107_0320 20251106_0322 20251105_0321 20251104_0324 20251103_0317 20251102_0321 20251101_0317 20251031_0318 20251030_0328 20251029_0325 20251028_0324 20251027_0320 20251026_0328 20251025_0320 20251024_0328 20251023_1235 20251023_0316 20251022_0319 20251021_1916 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553