OpenEarthAgent: A Unified Framework for Tool-Augmented Geospatial Agents
Authors: Akashah Shabbir, Muhammad Umer Sheikh, Muhammad Akhtar Munir, Hiyam Debary, Mustansar Fiaz, Muhammad Zaigham Zaheer, Paolo Fraccaro, Fahad Shahbaz Khan, Muhammad Haris Khan, Xiao Xiang Zhu, Salman Khan
First: 2026-02-19T18:59:54+00:00 · Latest: 2026-02-19T18:59:54+00:00
Abstract
Recent progress in multimodal reasoning has enabled agents that can interpret imagery, connect it with language, and perform structured analytical tasks. Extending such capabilities to the remote sensing domain remains challenging, as models must reason over spatial scale, geographic structures, and multispectral indices while maintaining coherent multi-step logic. To bridge this gap, OpenEarthAgent introduces a unified framework for developing tool-augmented geospatial agents trained on satellite imagery, natural-language queries, and detailed reasoning traces. The training pipeline relies on supervised fine-tuning over structured reasoning trajectories, aligning the model with verified multistep tool interactions across diverse analytical contexts. The accompanying corpus comprises 14,538 training and 1,169 evaluation instances, with more than 100K reasoning steps in the training split and over 7K reasoning steps in the evaluation split. It spans urban, environmental, disaster, and infrastructure domains, and incorporates GIS-based operations alongside index analyses such as NDVI, NBR, and NDBI. Grounded in explicit reasoning traces, the learned agent demonstrates structured reasoning, stable spatial understanding, and interpretable behaviour through tool-driven geospatial interactions across diverse conditions. We report consistent improvements over a strong baseline and competitive performance relative to recent open and closed-source models.
中文标题/摘要
标题:OpenEarthAgent:统一的工具增强地理空间代理框架
近期多模态推理的进步使代理能够解释图像、将其与语言连接起来并执行结构化分析任务。将此类能力扩展到遥感领域仍然具有挑战性,因为模型必须在保持连贯的多步逻辑的同时,在空间尺度、地理结构和多光谱指数上进行推理。为弥合这一差距,OpenEarthAgent 引入了一个统一框架,用于开发基于卫星图像、自然语言查询和详细推理轨迹训练的工具增强地理空间代理。训练管道依赖于结构化推理轨迹的监督微调,使模型与跨多种分析上下文的验证多步工具交互对齐。伴随的语料库包括14,538个训练实例和1,169个评估实例,训练集中有超过100,000个推理步骤,评估集中有超过7,000个推理步骤。它涵盖了城市、环境、灾害和基础设施领域,并结合了GIS操作和NDVI、NBR和NDBI等指数分析。基于明确的推理轨迹,学习到的代理展示了结构化的推理、稳定的地理空间理解以及通过工具驱动的地理空间交互实现的可解释行为。我们报告了相对于强大基线的一致改进,并且在与最近的开源和闭源模型的性能上具有竞争力。
Summary / 总结
The research aims to develop geospatial agents capable of handling complex tasks in the remote sensing domain by integrating multimodal reasoning with GIS operations. The method involves training a unified framework, OpenEarthAgent, on a large dataset of satellite imagery, natural-language queries, and reasoning traces. Key findings show that the model improves upon a strong baseline and performs competitively compared to recent models, demonstrating structured reasoning and stable spatial understanding in various geospatial contexts.
OpenEarthAgent 是一个统一框架,用于增强地理空间代理,能够解释卫星图像和自然语言查询,基于包含超过 10 万个推理步骤的 14,538 个实例进行训练。该模型通过结构化推理轨迹的监督微调与验证的多步工具交互对齐。主要发现表明,该模型在基线模型上表现出一致的改进,并且在与最近的开源和闭源模型的性能比较中具有竞争力,展示了在各种条件下结构化的推理和稳定的地理空间理解能力。
CLEF HIPE-2026: Evaluating Accurate and Efficient Person-Place Relation Extraction from Multilingual Historical Texts
Authors: Juri Opitz, Corina Raclé, Emanuela Boros, Andrianos Michail, Matteo Romanello, Maud Ehrmann, Simon Clematide
First: 2026-02-19T18:59:44+00:00 · Latest: 2026-02-19T18:59:44+00:00
Comments: ECIR 2026. CLEF Evaluation Lab. Registration DL: 2026/04/23. Task Homepage at https://hipe-eval.github.io/HIPE-2026/
Abstract
HIPE-2026 is a CLEF evaluation lab dedicated to person-place relation extraction from noisy, multilingual historical texts. Building on the HIPE-2020 and HIPE-2022 campaigns, it extends the series toward semantic relation extraction by targeting the task of identifying person--place associations in multiple languages and time periods. Systems are asked to classify relations of two types - $at$ ("Has the person ever been at this place?") and $isAt$ ("Is the person located at this place around publication time?") - requiring reasoning over temporal and geographical cues. The lab introduces a three-fold evaluation profile that jointly assesses accuracy, computational efficiency, and domain generalization. By linking relation extraction to large-scale historical data processing, HIPE-2026 aims to support downstream applications in knowledge-graph construction, historical biography reconstruction, and spatial analysis in digital humanities.
中文标题/摘要
标题:CLEF HIPE-2026:从多语言历史文本中准确高效地提取人物地点关系
HIPE-2026 是一个CLEF评估实验室,专注于从嘈杂的多语言历史文本中提取人物地点关系。在HIPE-2020和HIPE-2022活动的基础上,它将系列扩展到语义关系提取,通过在多个语言和时期内识别人物与地点的关联来完成任务。系统被要求对两种类型的关系进行分类——$at$(“这个人是否曾经在过这个地方?”)和$isAt$(“这个人是否在发布时间附近位于这个地方?”),这需要对时间与地理线索进行推理。该实验室引入了三方面的评估标准,共同评估准确性、计算效率和领域泛化能力。通过将关系提取与大规模历史数据处理联系起来,HIPE-2026旨在支持知识图谱构建、历史传记重建和数字人文中的空间分析等下游应用。
When Vision Overrides Language: Evaluating and Mitigating Counterfactual Failures in VLAs
Authors: Yu Fang, Yuchun Feng, Dong Jing, Jiaqi Liu, Yue Yang, Zhenyu Wei, Daniel Szafir, Mingyu Ding
First: 2026-02-19T18:59:20+00:00 · Latest: 2026-02-19T18:59:20+00:00
Comments: Website: https://vla-va.github.io/
Abstract
Vision-Language-Action models (VLAs) promise to ground language instructions in robot control, yet in practice often fail to faithfully follow language. When presented with instructions that lack strong scene-specific supervision, VLAs suffer from counterfactual failures: they act based on vision shortcuts induced by dataset biases, repeatedly executing well-learned behaviors and selecting objects frequently seen during training regardless of language intent. To systematically study it, we introduce LIBERO-CF, the first counterfactual benchmark for VLAs that evaluates language following capability by assigning alternative instructions under visually plausible LIBERO layouts. Our evaluation reveals that counterfactual failures are prevalent yet underexplored across state-of-the-art VLAs. We propose Counterfactual Action Guidance (CAG), a simple yet effective dual-branch inference scheme that explicitly regularizes language conditioning in VLAs. CAG combines a standard VLA policy with a language-unconditioned Vision-Action (VA) module, enabling counterfactual comparison during action selection. This design reduces reliance on visual shortcuts, improves robustness on under-observed tasks, and requires neither additional demonstrations nor modifications to existing architectures or pretrained models. Extensive experiments demonstrate its plug-and-play integration across diverse VLAs and consistent improvements. For example, on LIBERO-CF, CAG improves $π_{0.5}$ by 9.7% in language following accuracy and 3.6% in task success on under-observed tasks using a training-free strategy, with further gains of 15.5% and 8.5%, respectively, when paired with a VA model. In real-world evaluations, CAG reduces counterfactual failures of 9.4% and improves task success by 17.2% on average.
中文标题/摘要
标题:视觉优先于语言:评估和缓解VLAs中的反事实失败
视觉-语言-行动模型(VLAs)承诺将语言指令应用于机器人控制,但在实践中往往未能忠实执行语言指令。当面对缺乏强烈场景特定监督的指令时,VLAs会遭受反事实失败:它们基于数据集偏差诱导的视觉捷径行动,反复执行已学得的行为,并选择在训练期间频繁出现的对象,而不考虑语言意图。为了系统地研究这一问题,我们引入了LIBERO-CF,这是第一个用于VLAs的反事实基准,通过在视觉上合理的LIBERO布局下分配替代指令来评估语言遵循能力。我们的评估表明,反事实失败在最先进的VLAs中普遍存在但尚未得到充分探索。我们提出了反事实行动指导(CAG),这是一种简单而有效的双分支推理方案,明确地在VLAs中正则化语言条件。CAG结合了一个标准的VLA策略和一个未受语言条件的视觉-行动(VA)模块,在行动选择过程中进行反事实比较。这种设计减少了对视觉捷径的依赖,提高了对未观察任务的鲁棒性,并不需要额外的演示或对现有架构或预训练模型进行修改。广泛的实验表明,它可以在各种VLAs中实现即插即用集成,并且具有持续改进。例如,在LIBERO-CF中,CAG在语言遵循准确性上提高了9.7%,在未观察任务上的任务成功率提高了3.6%,使用无训练策略,配以VA模型时,进一步提高了15.5%和8.5%。在实际应用评估中,CAG将反事实失败减少了9.4%,平均提高了17.2%的任务成功率。
Summary / 总结
This paper addresses the issue of counterfactual failures in Vision-Language-Action models (VLAs), where models act based on visual shortcuts rather than language instructions. The authors introduce LIBERO-CF, a benchmark to evaluate this issue, and propose Counterfactual Action Guidance (CAG), a simple dual-branch inference scheme that improves language following accuracy and task success, especially on under-observed tasks. Experiments show CAG reduces counterfactual failures and enhances robustness without requiring additional training or modifications to existing models.
研究探讨了Vision-Language-Action模型(VLAs)中的反事实失败问题,即模型基于视觉偏见而非语言指令行动。研究引入了LIBERO-CF基准,通过提供在视觉上合理的替代指令来评估语言跟随能力。研究提出了Counterfactual Action Guidance(CAG)方案,这是一种双分支推理方案,增强了语言条件性并减少了视觉捷径,从而在未观察到的任务上提高了性能。实验结果显示,CAG在语言跟随准确性和任务成功率上分别提高了9.7%和3.6%,并且与Vision-Action模型结合时,进一步提高了15.5%和8.5%。在实际世界评估中,CAG减少了9.4%的反事实失败,并且平均提高了17.2%的任务成功率。
Human-level 3D shape perception emerges from multi-view learning
Authors: Tyler Bonnen, Jitendra Malik, Angjoo Kanazawa
First: 2026-02-19T18:56:05+00:00 · Latest: 2026-02-19T18:56:05+00:00
Abstract
Humans can infer the three-dimensional structure of objects from two-dimensional visual inputs. Modeling this ability has been a longstanding goal for the science and engineering of visual intelligence, yet decades of computational methods have fallen short of human performance. Here we develop a modeling framework that predicts human 3D shape inferences for arbitrary objects, directly from experimental stimuli. We achieve this with a novel class of neural networks trained using a visual-spatial objective over naturalistic sensory data; given a set of images taken from different locations within a natural scene, these models learn to predict spatial information related to these images, such as camera location and visual depth, without relying on any object-related inductive biases. Notably, these visual-spatial signals are analogous to sensory cues readily available to humans. We design a zero-shot evaluation approach to determine the performance of these `multi-view' models on a well established 3D perception task, then compare model and human behavior. Our modeling framework is the first to match human accuracy on 3D shape inferences, even without task-specific training or fine-tuning. Remarkably, independent readouts of model responses predict fine-grained measures of human behavior, including error patterns and reaction times, revealing a natural correspondence between model dynamics and human perception. Taken together, our findings indicate that human-level 3D perception can emerge from a simple, scalable learning objective over naturalistic visual-spatial data. All code, human behavioral data, and experimental stimuli needed to reproduce our findings can be found on our project page.
中文标题/摘要
标题:多视角学习中的人类级3D形状感知
人类可以从二维视觉输入中推断出物体的三维结构。模拟这种能力一直是视觉智能科学与工程的长期目标,但几十年来,计算方法仍未达到人类的性能。我们开发了一种建模框架,可以直接从实验刺激中预测任意物体的人类3D形状推断。我们使用一种新颖的神经网络类,通过自然感官数据的空间视觉目标进行训练;给定自然场景中不同位置拍摄的一组图像,这些模型能够学习预测与这些图像相关的空间信息,如相机位置和视觉深度,而无需依赖任何与物体相关的归纳偏置。值得注意的是,这些视觉-空间信号类似于人类可轻易获得的感官线索。我们设计了一种零样本评估方法来确定这些“多视角”模型在一项成熟的3D感知任务中的性能,然后将模型行为与人类行为进行比较。我们的建模框架是首个在无需特定任务训练或微调的情况下达到人类3D形状推断准确性的框架。令人惊讶的是,模型响应的独立读数可以预测人类行为的细微差异,包括错误模式和反应时间,揭示了模型动态与人类感知之间的自然对应关系。综上所述,我们的研究结果表明,人类级的3D感知可以从自然视觉-空间数据的简单、可扩展的学习目标中涌现出来。所有用于重现我们研究结果的代码、人类行为数据和实验刺激都可以在我们的项目页面上找到。
Summary / 总结
This study aims to model human ability to infer 3D object structure from 2D images, a longstanding challenge in visual intelligence. The researchers developed a neural network trained on naturalistic multi-view images to predict spatial information like camera location and visual depth. Notably, the model matches human accuracy on 3D shape inference without task-specific training, and its responses correlate with human error patterns and reaction times, suggesting a natural correspondence between model dynamics and human perception.
该研究旨在模拟人类从二维图像中推断三维物体结构的能力,这是一个长期存在的视觉智能挑战。研究人员开发了一种新型神经网络,通过自然场景下的多视角图像训练来预测如相机位置和视觉深度的空间信息。这些模型在3D形状推理任务上达到了与人类相当的准确性,并且能够预测人类的错误模式和反应时间,表明模型动态与人类感知之间存在自然对应关系。
Multi-Round Human-AI Collaboration with User-Specified Requirements
Authors: Sima Noorani, Shayan Kiyani, Hamed Hassani, George Pappas
First: 2026-02-19T18:54:34+00:00 · Latest: 2026-02-19T18:54:34+00:00
Abstract
As humans increasingly rely on multiround conversational AI for high stakes decisions, principled frameworks are needed to ensure such interactions reliably improve decision quality. We adopt a human centric view governed by two principles: counterfactual harm, ensuring the AI does not undermine human strengths, and complementarity, ensuring it adds value where the human is prone to err. We formalize these concepts via user defined rules, allowing users to specify exactly what harm and complementarity mean for their specific task. We then introduce an online, distribution free algorithm with finite sample guarantees that enforces the user-specified constraints over the collaboration dynamics. We evaluate our framework across two interactive settings: LLM simulated collaboration on a medical diagnostic task and a human crowdsourcing study on a pictorial reasoning task. We show that our online procedure maintains prescribed counterfactual harm and complementarity violation rates even under nonstationary interaction dynamics. Moreover, tightening or loosening these constraints produces predictable shifts in downstream human accuracy, confirming that the two principles serve as practical levers for steering multi-round collaboration toward better decision quality without the need to model or constrain human behavior.
中文标题/摘要
标题:多轮人机协作与用户指定要求
随着人类越来越多地依赖多轮对话AI进行高风险决策,需要有原则性的框架来确保此类交互能够可靠地提高决策质量。我们采取以人为中心的观点,遵循两个原则:反事实伤害,确保AI不削弱人类的优势;互补性,确保AI在人类容易出错的地方增加价值。我们通过用户定义的规则形式化这些概念,允许用户明确指定特定任务中的伤害和互补性含义。然后,我们引入了一个在线的、无分布假设的算法,具有有限样本保证,该算法在协作动态中强制执行用户指定的约束。我们通过两个交互设置评估了我们的框架:LLM模拟协作在医学诊断任务上的表现和人类众包研究在图像推理任务上的表现。我们展示了我们的在线程序即使在非平稳交互动态下也能保持规定的反事实伤害和互补性违反率。此外,收紧或放松这些约束会产生可预测的人类下游准确性变化,证实了这两个原则作为实用杠杆的作用,可以引导多轮协作向更好的决策质量发展,而无需建模或约束人类行为。
Summary / 总结
This study addresses the need for reliable human-AI collaboration in high-stakes decision-making scenarios. It introduces a framework based on user-defined rules to ensure the AI complements human strengths and avoids undermining them. The framework uses an online algorithm with finite sample guarantees to enforce these rules during multi-round interactions. Evaluations in medical diagnostics and pictorial reasoning tasks show that the framework maintains consistent performance and that adjusting the constraints can predictably affect human accuracy, demonstrating the practical utility of the proposed principles.
该研究旨在解决高风险决策场景中可靠的人工智能协作需求。它提出了一种基于用户定义规则的框架,确保人工智能能够补充人类的优势而不削弱它们。该框架使用具有有限样本保证的在线算法来在多轮交互中执行这些规则。在医学诊断和图像推理任务中的评估表明,该框架能够保持一致的性能,并且调整这些规则可以预测性地影响人类的准确性,证明了所提出的原理的实际应用价值。
IntRec: Intent-based Retrieval with Contrastive Refinement
Authors: Pourya Shamsolmoali, Masoumeh Zareapoor, Eric Granger, Yue Lu
First: 2026-02-19T18:50:53+00:00 · Latest: 2026-02-19T18:50:53+00:00
Abstract
Retrieving user-specified objects from complex scenes remains a challenging task, especially when queries are ambiguous or involve multiple similar objects. Existing open-vocabulary detectors operate in a one-shot manner, lacking the ability to refine predictions based on user feedback. To address this, we propose IntRec, an interactive object retrieval framework that refines predictions based on user feedback. At its core is an Intent State (IS) that maintains dual memory sets for positive anchors (confirmed cues) and negative constraints (rejected hypotheses). A contrastive alignment function ranks candidate objects by maximizing similarity to positive cues while penalizing rejected ones, enabling fine-grained disambiguation in cluttered scenes. Our interactive framework provides substantial improvements in retrieval accuracy without additional supervision. On LVIS, IntRec achieves 35.4 AP, outperforming OVMR, CoDet, and CAKE by +2.3, +3.7, and +0.5, respectively. On the challenging LVIS-Ambiguous benchmark, it improves performance by +7.9 AP over its one-shot baseline after a single corrective feedback, with less than 30 ms of added latency per interaction.
中文标题/摘要
标题:IntRec:基于意图的对比精炼检索
从复杂场景中检索用户指定的对象仍然是一个具有挑战性的任务,尤其是在查询模糊或涉及多个相似对象的情况下。现有的开放式词汇检测器以单次操作的方式工作,缺乏根据用户反馈精炼预测的能力。为了解决这个问题,我们提出了IntRec,这是一种基于用户反馈进行预测精炼的交互式对象检索框架。其核心是一个意图状态(IS),它维护了正锚点(确认的线索)和负约束(被拒绝的假设)的双重记忆集。对比对齐函数通过最大化与正线索的相似性并惩罚被拒绝的对象来对候选对象进行排名,从而在杂乱的场景中实现细粒度的消歧。我们的交互式框架在不增加额外监督的情况下显著提高了检索准确性。在LVIS数据集上,IntRec达到了35.4 AP,分别比OVMR、CoDet和CAKE高出+2.3、+3.7和+0.5。在具有挑战性的LVIS-模糊基准上,它在单次纠正反馈后提高了7.9 AP的性能,每次交互的额外延迟少于30毫秒。
Summary / 总结
IntRec is an interactive object retrieval framework designed to refine predictions based on user feedback, addressing the challenge of ambiguous queries in complex scenes. It uses an Intent State to maintain positive anchors and negative constraints, and a contrastive alignment function to rank candidates. On LVIS, IntRec outperforms existing methods by +2.3 to +3.7 AP and improves performance by +7.9 AP on the LVIS-Ambiguous benchmark with minimal latency.
IntRec 是一个基于用户反馈的交互式物体检索框架,使用意图状态来维护正锚和负约束。它采用对比对齐函数来排名候选物体,从而在杂乱场景中提高检索准确性。在 LVIS 数据集上,IntRec 的性能优于现有方法 2.3 到 3.7 AP,并在 LVIS-Ambiguous 基准上通过单次纠正反馈提高了 7.9 AP,且每次交互的额外延迟不到 30 毫秒。
CORAL: Correspondence Alignment for Improved Virtual Try-On
Authors: Jiyoung Kim, Youngjin Shin, Siyoon Jin, Dahyun Chung, Jisu Nam, Tongmin Kim, Jongjae Park, Hyeonwoo Kang, Seungryong Kim
First: 2026-02-19T18:50:12+00:00 · Latest: 2026-02-19T18:50:12+00:00
Comments: 32 pages, 25 figures
Abstract
Existing methods for Virtual Try-On (VTON) often struggle to preserve fine garment details, especially in unpaired settings where accurate person-garment correspondence is required. These methods do not explicitly enforce person-garment alignment and fail to explain how correspondence emerges within Diffusion Transformers (DiTs). In this paper, we first analyze full 3D attention in DiT-based architecture and reveal that the person-garment correspondence critically depends on precise person-garment query-key matching within the full 3D attention. Building on this insight, we then introduce CORrespondence ALignment (CORAL), a DiT-based framework that explicitly aligns query-key matching with robust external correspondences. CORAL integrates two complementary components: a correspondence distillation loss that aligns reliable matches with person-garment attention, and an entropy minimization loss that sharpens the attention distribution. We further propose a VLM-based evaluation protocol to better reflect human preference. CORAL consistently improves over the baseline, enhancing both global shape transfer and local detail preservation. Extensive ablations validate our design choices.
中文标题/摘要
标题:CORAL: 对应对齐以改进虚拟试穿
现有的虚拟试穿(VTON)方法往往难以保留细部服装细节,尤其是在需要准确的人-服装对应关系的非配对设置中。这些方法没有明确地强制执行人-服装对齐,并且无法解释对应关系如何在扩散变换器(DiTs)中出现。在本文中,我们首先分析了基于DiT架构的全3D注意力,并揭示出人-服装对应关系的关键依赖于全3D注意力中精确的人-服装查询-键匹配。基于这一洞察,我们随后引入了CORrespondence ALignment(CORAL),这是一种基于DiT的框架,明确地将查询-键匹配与稳健的外部对应关系对齐。CORAL结合了两个互补的组件:一种对应关系蒸馏损失,将可靠的匹配与人-服装注意力对齐,以及一种熵最小化损失,使注意力分布更加清晰。我们还提出了一种基于VLM的评估协议,以更好地反映人类偏好。CORAL在基准之上始终表现出改进,增强了全局形状转移和局部细节保留。广泛的消融实验验证了我们的设计选择。
Summary / 总结
This paper addresses the challenge of preserving fine garment details in Virtual Try-On (VTON) by introducing CORAL, a DiT-based framework that explicitly aligns query-key matching with robust external correspondences. The method includes a correspondence distillation loss and an entropy minimization loss to enhance attention distribution. Experimental results show that CORAL improves both global shape transfer and local detail preservation over the baseline methods.
论文通过引入CORAL框架,解决虚拟试衣中精细服装细节保留的问题,该框架基于DiT架构,明确对齐查询键匹配与稳健的外部对应关系。它包含一个对应关系蒸馏损失和一个熵最小化损失以增强注意力分布。实验结果表明,CORAL在全局形状转移和局部细节保留方面均优于基线方法。
SMAC: Score-Matched Actor-Critics for Robust Offline-to-Online Transfer
Authors: Nathan S. de Lara, Florian Shkurti
First: 2026-02-19T18:47:31+00:00 · Latest: 2026-02-19T18:47:31+00:00
Abstract
Modern offline Reinforcement Learning (RL) methods find performant actor-critics, however, fine-tuning these actor-critics online with value-based RL algorithms typically causes immediate drops in performance. We provide evidence consistent with the hypothesis that, in the loss landscape, offline maxima for prior algorithms and online maxima are separated by low-performance valleys that gradient-based fine-tuning traverses. Following this, we present Score Matched Actor-Critic (SMAC), an offline RL method designed to learn actor-critics that transition to online value-based RL algorithms with no drop in performance. SMAC avoids valleys between offline and online maxima by regularizing the Q-function during the offline phase to respect a first-order derivative equality between the score of the policy and action-gradient of the Q-function. We experimentally demonstrate that SMAC converges to offline maxima that are connected to better online maxima via paths with monotonically increasing reward found by first-order optimization. SMAC achieves smooth transfer to Soft Actor-Critic and TD3 in 6/6 D4RL tasks. In 4/6 environments, it reduces regret by 34-58% over the best baseline.
中文标题/摘要
标题:SMAC:分数匹配的演员-评论家算法以实现稳健的离线到在线转移
现代离线强化学习(RL)方法能够找到表现良好的演员-评论家,然而,使用基于值的RL算法在线微调这些演员-评论家通常会导致性能立即下降。我们提供了证据支持假设,在损失景观中,先前算法的离线最大值和在线最大值之间被低性能的山谷隔开,基于梯度的微调会穿越这些山谷。基于此,我们提出了分数匹配的演员-评论家(SMAC),这是一种离线RL方法,旨在学习在不降低性能的情况下过渡到在线基于值的RL算法的演员-评论家。SMAC通过在离线阶段正则化Q函数来避免离线和在线最大值之间的山谷,使其尊重策略得分与Q函数动作梯度的一阶导数相等。实验结果表明,SMAC收敛到通过一阶优化找到的奖励单调增加的路径连接到更好在线最大值的离线最大值。SMAC在6/6个D4RL任务中实现了平滑过渡到Soft Actor-Critic和TD3。在4/6个环境中,它将后悔减少34-58%超过最佳基线。
Summary / 总结
The research aims to address the issue of performance drops when fine-tuning offline-trained actor-critics with online value-based RL algorithms. The proposed method, Score Matched Actor-Critic (SMAC), regularizes the Q-function during the offline phase to avoid low-performance valleys between offline and online maxima. Experiments show that SMAC achieves smooth transfer to Soft Actor-Critic and TD3 in 6 out of 6 D4RL tasks and reduces regret by 34-58% in 4 out of 6 environments compared to the best baseline.
研究旨在解决在使用在线值基RL算法微调离线训练的actor-critic时出现性能下降的问题。提出的Score Matched Actor-Critic (SMAC) 方法在离线阶段正则化Q函数,以确保与在线RL算法的平滑过渡而不损失性能。实验表明,SMAC 在所有六个D4RL任务中实现了平滑过渡,并且在四个环境中将后悔率降低了34-58%,优于最佳基线。
Catastrophic Forgetting Resilient One-Shot Incremental Federated Learning
Authors: Obaidullah Zaland, Zulfiqar Ahmad Khan, Monowar Bhuyan
First: 2026-02-19T18:44:23+00:00 · Latest: 2026-02-19T18:44:23+00:00
Comments: Accepted for publication in the IEEE International Conference on Big Data (IEEE BigData) 2025
Abstract
Modern big-data systems generate massive, heterogeneous, and geographically dispersed streams that are large-scale and privacy-sensitive, making centralization challenging. While federated learning (FL) provides a privacy-enhancing training mechanism, it assumes a static data flow and learns a collaborative model over multiple rounds, making learning with \textit{incremental} data challenging in limited-communication scenarios. This paper presents One-Shot Incremental Federated Learning (OSI-FL), the first FL framework that addresses the dual challenges of communication overhead and catastrophic forgetting. OSI-FL communicates category-specific embeddings, devised by a frozen vision-language model (VLM) from each client in a single communication round, which a pre-trained diffusion model at the server uses to synthesize new data similar to the client's data distribution. The synthesized samples are used on the server for training. However, two challenges still persist: i) tasks arriving incrementally need to retrain the global model, and ii) as future tasks arrive, retraining the model introduces catastrophic forgetting. To this end, we augment training with Selective Sample Retention (SSR), which identifies and retains the top-p most informative samples per category and task pair based on sample loss. SSR bounds forgetting by ensuring that representative retained samples are incorporated into training in further iterations. The experimental results indicate that OSI-FL outperforms baselines, including traditional and one-shot FL approaches, in both class-incremental and domain-incremental scenarios across three benchmark datasets.
中文标题/摘要
标题:具有灾难性遗忘鲁棒性的单次增量联邦学习
现代大数据系统生成大量异构且地理上分散的流数据,规模庞大且隐私敏感,使得集中化变得困难。虽然联邦学习(FL)提供了一种增强隐私的训练机制,但它假设静态数据流,并在多轮次中学习协作模型,这使得在通信受限场景中处理增量数据的学习变得具有挑战性。本文提出了单次增量联邦学习(OSI-FL),这是第一个解决通信开销和灾难性遗忘双重挑战的FL框架。OSI-FL通过在单次通信轮次中由每个客户端的冻结视觉-语言模型(VLM)生成类别特定的嵌入,然后由服务器端预训练的扩散模型合成与客户端数据分布相似的新数据样本,这些合成样本在服务器端用于训练。然而,仍存在两个挑战:i) 任务以增量方式到达需要重新训练全局模型,ii) 随着未来任务的到达,重新训练模型会导致灾难性遗忘。为此,我们通过选择性样本保留(SSR)增强训练,该方法基于样本损失识别并保留每个类别和任务对中最信息丰富的top-p个样本。SSR通过确保代表性保留样本在后续迭代中被纳入训练来限制遗忘。实验结果表明,OSI-FL在三个基准数据集上的类增量和领域增量场景中均优于基线方法,包括传统的和单次的FL方法。
Summary / 总结
This paper addresses the challenges of communication overhead and catastrophic forgetting in federated learning with incremental data. It introduces One-Shot Incremental Federated Learning (OSI-FL), which communicates category-specific embeddings from clients to a server for synthesizing new data. The server then uses these samples for training. To mitigate catastrophic forgetting, the authors propose Selective Sample Retention (SSR), which retains the most informative samples per category and task. Experiments show that OSI-FL outperforms traditional and one-shot FL approaches in both class-incremental and domain-incremental scenarios across three benchmark datasets.
本文解决了增量数据场景下联邦学习中的通信开销和灾难性遗忘问题。提出了一次增量联邦学习(OSI-FL),通过客户端在单轮中传输类别特定的嵌入,结合预训练的扩散模型生成新数据进行服务器端训练。为缓解灾难性遗忘,作者提出了选择性样本保留(SSR),根据样本损失保留每个类别和任务对中最信息丰富的样本。实验结果表明,OSI-FL 在三个基准数据集上的类增量和域增量场景中均优于传统和一次增量联邦学习方法。
Stable Asynchrony: Variance-Controlled Off-Policy RL for LLMs
Authors: Luke Huang, Zhuoyang Zhang, Qinghao Hu, Shang Yang, Song Han
First: 2026-02-19T18:40:51+00:00 · Latest: 2026-02-19T18:40:51+00:00
Abstract
Reinforcement learning (RL) is widely used to improve large language models on reasoning tasks, and asynchronous RL training is attractive because it increases end-to-end throughput. However, for widely adopted critic-free policy-gradient methods such as REINFORCE and GRPO, high asynchrony makes the policy-gradient estimator markedly $\textbf{higher variance}$: training on stale rollouts creates heavy-tailed importance ratios, causing a small fraction of samples to dominate updates. This amplification makes gradients noisy and learning unstable relative to matched on-policy training. Across math and general reasoning benchmarks, we find collapse is reliably predicted by effective sample size (ESS) and unstable gradient norms. Motivated by this diagnosis, we propose $\textbf{V}$ariance $\textbf{C}$ontrolled $\textbf{P}$olicy $\textbf{O}$ptimization ($\textbf{VCPO}$), a general stabilization method for REINFORCE/GRPO-style algorithms that (i) scales learning rate based on effective sample size to dampen unreliable updates, and (ii) applies a closed-form minimum-variance baseline for the off-policy setting, avoiding an auxiliary value model and adding minimal overhead. Empirically, VCPO substantially improves robustness for asynchronous training across math, general reasoning, and tool-use tasks, outperforming a broad suite of baselines spanning masking/clipping stabilizers and algorithmic variants. This reduces long-context, multi-turn training time by 2.5$\times$ while matching synchronous performance, demonstrating that explicit control of policy-gradient variance is key for reliable asynchronous RL at scale.
中文标题/摘要
标题:稳定异步性:基于方差控制的离策RL方法用于LLMs
强化学习(RL)广泛用于提高大型语言模型在推理任务上的表现,而异步RL训练因其能提高端到端吞吐量而具有吸引力。然而,对于广泛采用的无批评的策略梯度方法如REINFORCE和GRPO,高异步性使得策略梯度估计器的方差显著增加:基于过时的回放训练会产生重尾的重要性比率,导致一小部分样本主导更新。这种放大效应使得梯度变得嘈杂,学习变得不稳定,相对于匹配的在线训练而言更是如此。在数学和通用推理基准测试中,我们发现,有效样本量(ESS)和不稳定的梯度范数可以可靠地预测崩溃。基于这一诊断,我们提出了方差控制策略优化(VCPO),这是一种适用于REINFORCE/GRPO风格算法的通用稳定化方法,(i)基于有效样本量调整学习率以抑制不可靠的更新,(ii)为离策设置应用闭式最小方差基线,避免使用辅助价值模型,并且增加的开销很小。实验证明,VCPO显著提高了数学、通用推理和工具使用任务中异步训练的鲁棒性,优于广泛基线方法,包括掩码/剪切稳定器和算法变体。这表明,在大规模可靠异步RL中,显式控制策略梯度方差至关重要,可以将长上下文、多轮训练时间减少2.5倍,同时匹配同步性能。
Summary / 总结
The paper addresses the issue of high variance in asynchronous reinforcement learning (RL) training for large language models (LLMs), which can lead to unstable learning. It proposes VCPO, a method that scales the learning rate based on effective sample size and uses a closed-form minimum-variance baseline to stabilize off-policy updates. Experiments show that VCPO improves robustness in asynchronous training across various benchmarks, outperforming other stabilization techniques and reducing training time by 2.5 times.
论文针对 critic-free 方法如 REINFORCE 和 GRPO 在大型语言模型 (LLM) 异步 RL 训练中的高方差问题进行了研究。提出了一种名为 VCPO 的方法,该方法根据有效样本大小调整学习率,并使用闭式最小方差基线来稳定 off-policy 更新。实验表明,VCPO 在各种推理任务中显著提高了异步训练的稳健性,将训练时间减少了 2.5 倍,同时匹配了同步训练的性能。
Guarding the Middle: Protecting Intermediate Representations in Federated Split Learning
Authors: Obaidullah Zaland, Sajib Mistry, Monowar Bhuyan
First: 2026-02-19T18:40:12+00:00 · Latest: 2026-02-19T18:40:12+00:00
Comments: Accepted for Publication in IEEE International Conference on Big Data (IEEE BigData) 2025
Abstract
Big data scenarios, where massive, heterogeneous datasets are distributed across clients, demand scalable, privacy-preserving learning methods. Federated learning (FL) enables decentralized training of machine learning (ML) models across clients without data centralization. Decentralized training, however, introduces a computational burden on client devices. U-shaped federated split learning (UFSL) offloads a fraction of the client computation to the server while keeping both data and labels on the clients' side. However, the intermediate representations (i.e., smashed data) shared by clients with the server are prone to exposing clients' private data. To reduce exposure of client data through intermediate data representations, this work proposes k-anonymous differentially private UFSL (KD-UFSL), which leverages privacy-enhancing techniques such as microaggregation and differential privacy to minimize data leakage from the smashed data transferred to the server. We first demonstrate that an adversary can access private client data from intermediate representations via a data-reconstruction attack, and then present a privacy-enhancing solution, KD-UFSL, to mitigate this risk. Our experiments indicate that, alongside increasing the mean squared error between the actual and reconstructed images by up to 50% in some cases, KD-UFSL also decreases the structural similarity between them by up to 40% on four benchmarking datasets. More importantly, KD-UFSL improves privacy while preserving the utility of the global model. This highlights its suitability for large-scale big data applications where privacy and utility must be balanced.
中文标题/摘要
标题:守护中间:保护联邦分割学习中的中间表示
在大规模、异构数据集分布在客户端的场景中,需要可扩展且保护隐私的机器学习方法。联邦学习(FL)允许在不集中数据的情况下,通过客户端分散训练机器学习模型。然而,分散训练会增加客户端设备的计算负担。U形联邦分割学习(UFSL)将部分客户端计算卸载到服务器上,同时将数据和标签保留在客户端。然而,客户端与服务器共享的中间表示(即被打碎的数据)容易泄露客户端的私有数据。为了减少通过中间数据表示泄露客户端数据的风险,本文提出了一种基于k匿名和差分隐私的UFSL(KD-UFSL),利用微聚合和差分隐私等隐私增强技术,最小化传输到服务器的被打碎数据的数据泄露。我们首先证明了攻击者可以通过数据重建攻击访问中间表示中的客户端私有数据,然后提出了一种隐私增强解决方案KD-UFSL来缓解这一风险。实验表明,KD-UFSL在某些情况下将实际图像与重建图像之间的均方误差提高了50%,结构相似性降低了40%。更重要的是,KD-UFSL在提高隐私的同时保持了全局模型的实用性。这突显了其在需要平衡隐私和实用性的大规模大数据应用中的适用性。
Summary / 总结
This work addresses the challenge of protecting intermediate representations in federated split learning (UFSL) to prevent exposure of client data. It introduces k-anonymous differentially private UFSL (KD-UFSL), which uses microaggregation and differential privacy to reduce data leakage. Experiments show that KD-UFSL increases the error in reconstructed images by up to 50% and decreases structural similarity by up to 40%, while still maintaining the utility of the global model.
本文旨在保护联邦分割学习(UFSL)中的中间表示,防止泄露客户端数据。它提出了k-匿名不同隐私的UFSL(KD-UFSL),使用微聚合和差分隐私来减少数据泄露。实验表明,KD-UFSL可以将重建图像的误差提高到50%,结构相似性降低到40%,同时仍然保持全局模型的实用性,从而在大规模大数据应用中平衡隐私和实用性。
Contrastive Diffusion Alignment: Learning Structured Latents for Controllable Generation
Authors: Ruchi Sandilya, Sumaira Perez, Charles Lynch, Lindsay Victoria, Benjamin Zebley, Derrick Matthew Buchanan, Mahendra T. Bhati, Nolan Williams, Timothy J. Spellman, Faith M. Gunning, Conor Liston, Logan Grosenick
First: 2025-10-16T00:48:05+00:00 · Latest: 2026-02-19T18:33:22+00:00
Abstract
Diffusion models excel at generation, but their latent spaces are high dimensional and not explicitly organized for interpretation or control. We introduce ConDA (Contrastive Diffusion Alignment), a plug-and-play geometry layer that applies contrastive learning to pretrained diffusion latents using auxiliary variables (e.g., time, stimulation parameters, facial action units). ConDA learns a low-dimensional embedding whose directions align with underlying dynamical factors, consistent with recent contrastive learning results on structured and disentangled representations. In this embedding, simple nonlinear trajectories support smooth interpolation, extrapolation, and counterfactual editing while rendering remains in the original diffusion space. ConDA separates editing and rendering by lifting embedding trajectories back to diffusion latents with a neighborhood-preserving kNN decoder and is robust across inversion solvers. Across fluid dynamics, neural calcium imaging, therapeutic neurostimulation, facial expression dynamics, and monkey motor cortex activity, ConDA yields more interpretable and controllable latent structure than linear traversals and conditioning-based baselines, indicating that diffusion latents encode dynamics-relevant structure that can be exploited by an explicit contrastive geometry layer.
Summary / 总结
The research aims to improve the interpretability and controllability of diffusion models by learning structured latent representations. ConDA (Contrastive Diffusion Alignment) is introduced, which uses contrastive learning on pretrained diffusion latents to align with underlying dynamical factors. Key findings show that ConDA enables smooth interpolation, extrapolation, and counterfactual editing while maintaining rendering in the original diffusion space, outperforming linear traversals and conditioning-based methods across various applications.
研究旨在通过学习结构化的潜在表示来提高扩散模型的可解释性和可控性。引入了ConDA(对比扩散对齐),该方法通过对比学习对预训练的扩散潜在变量进行几何调整,使其与潜在的动力学因素对齐。关键发现表明,ConDA 能够实现平滑的插值、外推和反事实编辑,同时保持渲染在原始扩散空间中,优于线性遍历和基于条件的方法,适用于多种应用。
ReplaceMe: Network Simplification via Depth Pruning and Transformer Block Linearization
Authors: Dmitriy Shopkhoev, Ammar Ali, Magauiya Zhussip, Valentin Malykh, Stamatios Lefkimmiatis, Nikos Komodakis, Sergey Zagoruyko
Venue: NeurIPS 2025
First: 2025-05-05T17:47:42+00:00 · Latest: 2026-02-19T18:32:53+00:00
Comments: This work was accepted and presented at NeurIPS 2025. Code is available at https://github.com/mts-ai/replaceme Reviews at OpenReview: https://openreview.net/forum?id=zEj1FSYCRn NeurIPS 2025 Proceedings: https://openreview.net/pdf?id=zEj1FSYCRn
Abstract
We introduce ReplaceMe, a generalized training-free depth pruning method that effectively replaces transformer blocks with a linear operation, while maintaining high performance for low compression ratios. In contrast to conventional pruning approaches that require additional training or fine-tuning, our approach requires only a small calibration dataset that is used to estimate a linear transformation, which approximates the pruned blocks. The estimated linear mapping can be seamlessly merged with the remaining transformer blocks, eliminating the need for any additional network parameters. Our experiments show that ReplaceMe consistently outperforms other training-free approaches and remains highly competitive with state-of-the-art pruning methods that involve extensive retraining/fine-tuning and architectural modifications. Applied to several large language models (LLMs), ReplaceMe achieves up to 25\% pruning while retaining approximately 90\% of the original model's performance on open benchmarks - without any training or healing steps, resulting in minimal computational overhead. We provide an open-source library implementing ReplaceMe alongside several state-of-the-art depth pruning techniques, available at https://github.com/mts-ai/ReplaceMe
中文标题/摘要
标题:ReplaceMe:通过深度剪枝和Transformer块线性化实现网络简化
我们引入了ReplaceMe,这是一种通用的无需训练的深度剪枝方法,能够有效将Transformer块替换为线性操作,同时在低压缩比下保持高性能。与需要额外训练或微调的传统剪枝方法不同,我们的方法仅需要一个小规模的校准数据集来估计线性变换,该变换近似于剪枝后的块。估计出的线性映射可以无缝地与剩余的Transformer块合并,无需任何额外的网络参数。我们的实验表明,ReplaceMe在所有无需训练的方法中表现最佳,并且在涉及大量重新训练/微调和架构修改的最新剪枝方法中保持了高度竞争力。应用于多个大型语言模型(LLMs),ReplaceMe在开放基准测试中实现了高达25%的剪枝,同时保留了原始模型约90%的性能,无需任何训练或修复步骤,从而减少了计算开销。我们提供了一个开源库,实现了ReplaceMe以及几种最新的深度剪枝技术,可在https://github.com/mts-ai/ReplaceMe 获取。
Summary / 总结
ReplaceMe is a training-free depth pruning method that replaces transformer blocks with linear operations, maintaining high performance for low compression ratios. Unlike conventional pruning methods that require additional training, ReplaceMe uses a small calibration dataset to estimate a linear transformation that approximates pruned blocks. Experiments show ReplaceMe outperforms other training-free approaches and remains competitive with state-of-the-art pruning methods that involve extensive retraining. Applied to large language models, ReplaceMe achieves up to 25% pruning with minimal computational overhead and no training or healing steps required.
ReplaceMe 是一种无需训练的深度剪枝方法,通过将变压器块替换为线性操作来保持高性能并实现最小压缩。与需要额外训练的常规方法不同,ReplaceMe 使用一个小的校准数据集来估计一个线性变换,该变换近似于剪枝后的块。实验表明,ReplaceMe 在无需额外训练和计算开销的情况下,能够实现高达 25% 的剪枝,同时保持原始模型约 90% 的性能。
Towards Anytime-Valid Statistical Watermarking
Authors: Baihe Huang, Eric Xu, Kannan Ramchandran, Jiantao Jiao, Michael I. Jordan
First: 2026-02-19T18:32:26+00:00 · Latest: 2026-02-19T18:32:26+00:00
Abstract
The proliferation of Large Language Models (LLMs) necessitates efficient mechanisms to distinguish machine-generated content from human text. While statistical watermarking has emerged as a promising solution, existing methods suffer from two critical limitations: the lack of a principled approach for selecting sampling distributions and the reliance on fixed-horizon hypothesis testing, which precludes valid early stopping. In this paper, we bridge this gap by developing the first e-value-based watermarking framework, Anchored E-Watermarking, that unifies optimal sampling with anytime-valid inference. Unlike traditional approaches where optional stopping invalidates Type-I error guarantees, our framework enables valid, anytime-inference by constructing a test supermartingale for the detection process. By leveraging an anchor distribution to approximate the target model, we characterize the optimal e-value with respect to the worst-case log-growth rate and derive the optimal expected stopping time. Our theoretical claims are substantiated by simulations and evaluations on established benchmarks, showing that our framework can significantly enhance sample efficiency, reducing the average token budget required for detection by 13-15% relative to state-of-the-art baselines.
中文标题/摘要
标题:迈向任意时点有效的统计水印
大型语言模型(LLMs)的普及需要有效的机制来区分机器生成的内容和人类文本。虽然统计水印已经作为一种有前景的解决方案出现,但现有方法存在两个关键限制:缺乏选择抽样分布的原理性方法以及依赖固定时间窗假设检验,这限制了早期停止的有效性。在本文中,我们通过开发第一个基于e值的水印框架——锚定e水印,填补了这一空白,该框架将最优采样与任意时点的有效推断统一起来。与传统方法不同,传统方法中的可选停止会破坏第一类错误保证,而我们的框架通过为检测过程构建测试超鞅,实现了有效的任意时点推断。通过利用锚定分布来近似目标模型,我们以最坏情况下的对数增长率为基准,表征最优e值,并推导出最优的期望停止时间。我们的理论主张通过模拟和在现有基准上的评估得到了验证,表明我们的框架可以显著提高样本效率,相对于最先进的基线,平均减少13-15%的令牌预算用于检测。
Summary / 总结
This paper addresses the need for efficient mechanisms to distinguish machine-generated content from human text, particularly with the rise of Large Language Models. It introduces Anchored E-Watermarking, a novel e-value-based framework that combines optimal sampling with anytime-valid inference. Unlike traditional methods, this framework allows for valid early stopping without compromising Type-I error guarantees. Theoretical analysis and simulations demonstrate that the proposed method can reduce the average token budget required for detection by 13-15% compared to existing state-of-the-art approaches.
本文提出了一种新的统计水印框架Anchored E-Watermarking,以解决区分机器生成内容和人类文本的问题。该框架将最优采样与任意时点的有效推断统一起来,克服了现有方法缺乏采样分布的原理性方法和依赖固定时间窗口假设检验的局限。主要发现是,所提出的方法可以显著提高样本效率,与最先进的基线相比,检测所需的平均令牌预算减少了13-15%。
Adapting Actively on the Fly: Relevance-Guided Online Meta-Learning with Latent Concepts for Geospatial Discovery
Authors: Jowaria Khan, Anindya Sarkar, Yevgeniy Vorobeychik, Elizabeth Bondi-Kelly
First: 2026-02-19T18:30:18+00:00 · Latest: 2026-02-19T18:30:18+00:00
Abstract
In many real-world settings, such as environmental monitoring, disaster response, or public health, with costly and difficult data collection and dynamic environments, strategically sampling from unobserved regions is essential for efficiently uncovering hidden targets under tight resource constraints. Yet, sparse and biased geospatial ground truth limits the applicability of existing learning-based methods, such as reinforcement learning. To address this, we propose a unified geospatial discovery framework that integrates active learning, online meta-learning, and concept-guided reasoning. Our approach introduces two key innovations built on a shared notion of *concept relevance*, which captures how domain-specific factors influence target presence: a *concept-weighted uncertainty sampling strategy*, where uncertainty is modulated by learned relevance based on readily-available domain-specific concepts (e.g., land cover, source proximity); and a *relevance-aware meta-batch formation strategy* that promotes semantic diversity during online-meta updates, improving generalization in dynamic environments. Our experiments include testing on a real-world dataset of cancer-causing PFAS (Per- and polyfluoroalkyl substances) contamination, showcasing our method's reliability at uncovering targets with limited data and a varying environment.
中文标题/摘要
标题:动态适应:基于相关性的在线元学习与潜在概念引导的空间发现
在环境监测、灾害响应或公共卫生等许多现实场景中,由于数据收集成本高且环境动态变化,从未观察区域有选择地采样对于在资源受限的情况下高效发现隐藏目标至关重要。然而,稀疏且有偏的空间真实情况限制了现有基于学习的方法(如强化学习)的应用。为解决这一问题,我们提出了一种统一的空间发现框架,该框架结合了主动学习、在线元学习和概念引导推理。我们的方法引入了两个关键创新,基于*概念相关性*这一共同概念:一种*概念加权不确定性采样策略*,其中不确定性根据基于现成领域特定概念(如土地覆盖、源距离)学习到的相关性进行调整;以及一种*相关性感知的元批次形成策略*,该策略在在线元更新过程中促进语义多样性,从而在动态环境中提高泛化能力。我们的实验包括在真实世界数据集(含致癌的PFAS(全氟和多氟烷基物质)污染)上的测试,展示了在有限数据和变化环境中,该方法在发现目标方面的可靠性。
Summary / 总结
The paper proposes a unified geospatial discovery framework that combines active learning, online meta-learning, and concept-guided reasoning to efficiently uncover hidden targets in dynamic environments with limited data. It introduces a concept-weighted uncertainty sampling strategy and a relevance-aware meta-batch formation strategy, both leveraging domain-specific concepts to improve target discovery. The method was tested on a real-world dataset of PFAS contamination, demonstrating its effectiveness in uncovering targets under resource constraints.
论文针对动态环境中有限数据和资源条件下高效发现隐藏目标的挑战,特别是在环境监测等地理空间场景中。提出了一种结合主动学习、在线元学习和概念导向推理的统一框架。关键创新包括基于概念权重的不确定性采样策略和相关性感知的元批次形成策略。该方法在PFAS(全氟和多氟烷基物质)污染等真实数据集上展示了在稀疏和偏差数据以及动态环境中发现目标的有效性。
pi-Flow: Policy-Based Few-Step Generation via Imitation Distillation
Authors: Hansheng Chen, Kai Zhang, Hao Tan, Leonidas Guibas, Gordon Wetzstein, Sai Bi
Venue: ICLR 2026
First: 2025-10-16T17:59:51+00:00 · Latest: 2026-02-19T18:30:05+00:00
Comments: ICLR 2026. Code: https://github.com/Lakonik/piFlow Demos: https://huggingface.co/spaces/Lakonik/pi-Qwen | https://huggingface.co/spaces/Lakonik/pi-FLUX.1 | https://huggingface.co/spaces/Lakonik/pi-FLUX.2
Abstract
Few-step diffusion or flow-based generative models typically distill a velocity-predicting teacher into a student that predicts a shortcut towards denoised data. This format mismatch has led to complex distillation procedures that often suffer from a quality-diversity trade-off. To address this, we propose policy-based flow models ($π$-Flow). $π$-Flow modifies the output layer of a student flow model to predict a network-free policy at one timestep. The policy then produces dynamic flow velocities at future substeps with negligible overhead, enabling fast and accurate ODE integration on these substeps without extra network evaluations. To match the policy's ODE trajectory to the teacher's, we introduce a novel imitation distillation approach, which matches the policy's velocity to the teacher's along the policy's trajectory using a standard $\ell_2$ flow matching loss. By simply mimicking the teacher's behavior, $π$-Flow enables stable and scalable training and avoids the quality-diversity trade-off. On ImageNet 256$^2$, it attains a 1-NFE FID of 2.85, outperforming previous 1-NFE models of the same DiT architecture. On FLUX.1-12B and Qwen-Image-20B at 4 NFEs, $π$-Flow achieves substantially better diversity than state-of-the-art DMD models, while maintaining teacher-level quality.
中文标题/摘要
标题:pi-Flow:基于策略的少步生成通过蒸馏模仿
少步扩散或基于流的生成模型通常将一个预测速度的教师蒸馏为一个学生,该学生预测通向去噪数据的捷径。这种格式不匹配导致了复杂的蒸馏过程,往往在质量与多样性之间存在权衡。为了解决这个问题,我们提出了基于策略的流模型($π$-Flow)。$π$-Flow 将学生流模型的输出层修改为在单个时间步预测一个无网络策略。该策略随后在未来的子步骤中生成动态流速度,几乎不增加额外开销,从而在这些子步骤中无需额外网络评估即可快速准确地进行 ODE 整合。为了使策略的 ODE 轨迹与教师的相匹配,我们引入了一种新颖的模仿蒸馏方法,该方法使用标准的 $\ell_2$ 流匹配损失,在策略轨迹上将策略的速度与教师的速度进行匹配。通过简单模仿教师的行为,$π$-Flow 使训练变得稳定且可扩展,并避免了质量与多样性的权衡。在 ImageNet 256$^2$ 上,它达到了 1-NFE 的 FID 为 2.85,优于相同 DiT 架构的先前 1-NFE 模型。在 FLUX.1-12B 和 Qwen-Image-20B 上,$π$-Flow 在 4 NFEs 下实现了比最先进的 DMD 模型更好的多样性,同时保持了教师级别的质量。
Summary / 总结
The paper proposes pi-Flow, a policy-based few-step generative model that addresses the quality-diversity trade-off in existing models by using a teacher-student framework with imitation distillation. The student model predicts a policy at one timestep, which guides dynamic flow velocities at future substeps, enabling fast and accurate ODE integration. The model outperforms previous 1-NFE models on ImageNet and achieves better diversity than state-of-the-art DMD models on FLUX and Qwen-Image datasets while maintaining teacher-level quality.
pi-Flow通过提出基于策略的流模型来解决少步生成模型中的质量多样性权衡问题。该模型在学生模型中预测一个策略,该策略在未来的子步骤中生成动态速度,而无需额外的网络评估。一种新的模仿蒸馏方法确保策略与教师的轨迹匹配。在ImageNet 256^2上,pi-Flow达到1-NFE FID为2.85,优于之前的模型。在FLUX.1-12B和Qwen-Image-20B上,pi-Flow在保持教师级质量的同时,显示出比最先进的DMD模型更好的多样性。
Asymptotic Smoothing of the Lipschitz Loss Landscape in Overparameterized One-Hidden-Layer ReLU Networks
Authors: Saveliy Baturin
First: 2026-02-19T18:20:21+00:00 · Latest: 2026-02-19T18:20:21+00:00
Abstract
We study the topology of the loss landscape of one-hidden-layer ReLU networks under overparameterization. On the theory side, we (i) prove that for convex $L$-Lipschitz losses with an $\ell_1$-regularized second layer, every pair of models at the same loss level can be connected by a continuous path within an arbitrarily small loss increase $ε$ (extending a known result for the quadratic loss); (ii) obtain an asymptotic upper bound on the energy gap $ε$ between local and global minima that vanishes as the width $m$ grows, implying that the landscape flattens and sublevel sets become connected in the limit. Empirically, on a synthetic Moons dataset and on the Wisconsin Breast Cancer dataset, we measure pairwise energy gaps via Dynamic String Sampling (DSS) and find that wider networks exhibit smaller gaps; in particular, a permutation test on the maximum gap yields $p_{perm}=0$, indicating a clear reduction in the barrier height.
中文标题/摘要
标题:过参数化单隐藏层ReLU网络的Lipschitz损失景观的渐近光滑化
我们研究了过参数化单隐藏层ReLU网络的损失景观拓扑。在理论方面,我们(i) 证明对于具有$\ell_1$正则化第二层的凸$L$-Lipschitz损失,任意两个在相同损失水平的模型可以在不超过任意小损失增加$ε$的连续路径内相互连接(扩展了对二次损失已知的结果);(ii) 得到了局部和全局最小值之间能量差距$ε$的渐近上界,该差距随着宽度$m$的增长而消失,表明景观趋于平坦,子水平集变得连通。在实验中,我们使用动态字符串采样(DSS)在合成的月亮数据集和威斯康星乳腺癌数据集上测量了两两能量差距,发现更宽的网络表现出更小的差距;特别是,最大差距的置换检验得到$p_{perm}=0$,表明明显的势垒高度降低。
CT-Bench: A Benchmark for Multimodal Lesion Understanding in Computed Tomography
Authors: Qingqing Zhu, Qiao Jin, Tejas S. Mathai, Yin Fang, Zhizheng Wang, Yifan Yang, Maame Sarfo-Gyamfi, Benjamin Hou, Ran Gu, Praveen T. S. Balamuralikrishna, Kenneth C. Wang, Ronald M. Summers, Zhiyong Lu
First: 2026-02-16T16:10:19+00:00 · Latest: 2026-02-19T18:19:25+00:00
Abstract
Artificial intelligence (AI) can automatically delineate lesions on computed tomography (CT) and generate radiology report content, yet progress is limited by the scarcity of publicly available CT datasets with lesion-level annotations. To bridge this gap, we introduce CT-Bench, a first-of-its-kind benchmark dataset comprising two components: a Lesion Image and Metadata Set containing 20,335 lesions from 7,795 CT studies with bounding boxes, descriptions, and size information, and a multitask visual question answering benchmark with 2,850 QA pairs covering lesion localization, description, size estimation, and attribute categorization. Hard negative examples are included to reflect real-world diagnostic challenges. We evaluate multiple state-of-the-art multimodal models, including vision-language and medical CLIP variants, by comparing their performance to radiologist assessments, demonstrating the value of CT-Bench as a comprehensive benchmark for lesion analysis. Moreover, fine-tuning models on the Lesion Image and Metadata Set yields significant performance gains across both components, underscoring the clinical utility of CT-Bench.
中文标题/摘要
标题:CT-Bench: 计算机断层扫描中多模态病变理解的基准数据集
人工智能(AI)可以自动勾画计算机断层扫描(CT)中的病变并生成放射学报告内容,但进展受限于可用的带有病变级别注释的CT数据集稀缺。为解决这一问题,我们引入了CT-Bench,这是一个首创的基准数据集,包含两个部分:包含20,335个病变的病变图像和元数据集,来自7,795个CT研究,包含边界框、描述和尺寸信息,以及一个涵盖病变定位、描述、尺寸估计和属性分类的多任务视觉问答基准,包含2,850个问答对。还包含困难的负例以反映实际诊断挑战。我们通过将多个最先进的多模态模型与放射科医生评估进行比较,评估了CT-Bench的价值,证明了CT-Bench作为病变分析综合基准的价值。此外,对病变图像和元数据集进行微调在两个部分上均取得了显著的性能提升,突显了CT-Bench的临床用途。
Summary / 总结
CT-Bench is a new benchmark dataset for lesion understanding in CT scans, consisting of 20,335 lesions from 7,795 CT studies with bounding boxes, descriptions, and size information. It also includes a multitask visual question answering benchmark with 2,850 QA pairs. The dataset evaluates state-of-the-art multimodal models, showing that fine-tuning on the Lesion Image and Metadata Set improves performance. This benchmark helps bridge the gap in publicly available CT datasets with lesion-level annotations and highlights the clinical utility of CT-Bench for lesion analysis.
CT-Bench 是一个新的基准数据集,包含来自 7,795 个 CT 研究的 20,335 个病灶,附有边界框、描述和大小信息。它还包含一个包含 2,850 个 QA 对的多任务视觉问答基准。该数据集评估了最先进的多模态模型,显示通过在病灶图像和元数据集上进行微调可以提高性能。CT-Bench 有助于填补公开可用的带有病灶级注释的 CT 数据集的空白,并突显了 CT-Bench 在病灶分析中的临床用途。
AI Gamestore: Scalable, Open-Ended Evaluation of Machine General Intelligence with Human Games
Authors: Lance Ying, Ryan Truong, Prafull Sharma, Kaiya Ivy Zhao, Nathan Cloos, Kelsey R. Allen, Thomas L. Griffiths, Katherine M. Collins, José Hernández-Orallo, Phillip Isola, Samuel J. Gershman, Joshua B. Tenenbaum
First: 2026-02-19T18:17:25+00:00 · Latest: 2026-02-19T18:17:25+00:00
Comments: 29 pages, 14 figures
Abstract
Rigorously evaluating machine intelligence against the broad spectrum of human general intelligence has become increasingly important and challenging in this era of rapid technological advance. Conventional AI benchmarks typically assess only narrow capabilities in a limited range of human activity. Most are also static, quickly saturating as developers explicitly or implicitly optimize for them. We propose that a more promising way to evaluate human-like general intelligence in AI systems is through a particularly strong form of general game playing: studying how and how well they play and learn to play \textbf{all conceivable human games}, in comparison to human players with the same level of experience, time, or other resources. We define a "human game" to be a game designed by humans for humans, and argue for the evaluative suitability of this space of all such games people can imagine and enjoy -- the "Multiverse of Human Games". Taking a first step towards this vision, we introduce the AI GameStore, a scalable and open-ended platform that uses LLMs with humans-in-the-loop to synthesize new representative human games, by automatically sourcing and adapting standardized and containerized variants of game environments from popular human digital gaming platforms. As a proof of concept, we generated 100 such games based on the top charts of Apple App Store and Steam, and evaluated seven frontier vision-language models (VLMs) on short episodes of play. The best models achieved less than 10\% of the human average score on the majority of the games, and especially struggled with games that challenge world-model learning, memory and planning. We conclude with a set of next steps for building out the AI GameStore as a practical way to measure and drive progress toward human-like general intelligence in machines.
中文标题/摘要
标题:AI游戏库:通过人类游戏评估机器通用智能的可扩展、开放性方法
在技术飞速发展的时代,严格评估机器智能与人类通用智能的广泛谱系变得越来越重要且具有挑战性。传统的AI基准测试通常仅评估人类活动有限范围内的狭窄能力。大多数基准测试也是静态的,随着开发人员显式或隐式地对其进行优化,它们很快就会饱和。我们提出了一种更可行的方法来评估AI系统中的人类般通用智能:通过一种特别强大的通用游戏玩法形式:研究它们如何以及如何很好地玩和学习玩所有可能的人类游戏,与具有相同经验、时间或其他资源的人类玩家进行比较。我们定义“人类游戏”为人类设计供人类玩的游戏,并认为这个所有人类可以想象和享受的游戏空间是评估的理想领域——“人类游戏多元宇宙”。为实现这一愿景的第一步,我们引入了AI游戏库,这是一个使用人类在环的LLM构建的可扩展和开放性平台,通过自动获取和适应来自流行人类数字游戏平台的标准和容器化游戏环境变体来合成新的代表性人类游戏。作为概念验证,我们基于Apple App Store和Steam的热门排行榜生成了100个此类游戏,并在短游戏片段上评估了七个前沿的视觉语言模型(VLMs)。最好的模型在大多数游戏中的人类平均得分中仅达到了不到10%,尤其是在挑战世界模型学习、记忆和规划的游戏方面表现尤为困难。最后,我们提出了构建AI游戏库的下一步,作为一种实际的测量和推动机器向人类般通用智能发展的方法。
Summary / 总结
The paper aims to evaluate machine general intelligence by comparing it to human general intelligence through a scalable and open-ended platform called AI GameStore. This platform uses large language models with human oversight to synthesize new human games and evaluates vision-language models on these games. The key finding is that the best models achieved less than 10% of the human average score on most games, particularly struggling with games that test world-model learning, memory, and planning capabilities.
论文旨在通过一种新颖的方法全面评估机器智能与人类一般智能的对比:通过让机器玩所有可能的人类游戏。方法是创建一个名为AI GameStore的可扩展平台,该平台使用LLM和人类辅助生成新的代表性人类游戏。对这七种前沿的视觉语言模型进行了评估,最好的模型在大多数游戏中仅达到了人类平均得分的不到10%,尤其是在挑战世界模型学习、记忆和规划的游戏方面表现尤为糟糕。
Accelerating Large-Scale Dataset Distillation via Exploration-Exploitation Optimization
Authors: Muhammad J. Alahmadi, Peng Gao, Feiyi Wang, Dongkuan Xu
First: 2026-02-17T00:27:58+00:00 · Latest: 2026-02-19T18:14:28+00:00
Abstract
Dataset distillation compresses the original data into compact synthetic datasets, reducing training time and storage while retaining model performance, enabling deployment under limited resources. Although recent decoupling-based distillation methods enable dataset distillation at large scale, they continue to face an efficiency gap: optimization-based decoupling methods achieve higher accuracy but demand intensive computation, whereas optimization-free decoupling methods are efficient but sacrifice accuracy. To overcome this trade-off, we propose Exploration--Exploitation Distillation (E$^2$D), a simple, practical method that minimizes redundant computation through an efficient pipeline that begins with full-image initialization to preserve semantic integrity and feature diversity. It then uses a two-phase optimization strategy: an exploration phase that performs uniform updates and identifies high-loss regions, and an exploitation phase that focuses updates on these regions to accelerate convergence. We evaluate E$^2$D on large-scale benchmarks, surpassing the state-of-the-art on ImageNet-1K while being $18\times$ faster, and on ImageNet-21K, our method substantially improves accuracy while remaining $4.3\times$ faster. These results demonstrate that targeted, redundancy-reducing updates, rather than brute-force optimization, bridge the gap between accuracy and efficiency in large-scale dataset distillation. Code is available at https://github.com/ncsu-dk-lab/E2D.
中文标题/摘要
标题:通过探索-利用优化加速大规模数据集蒸馏
数据集蒸馏将原始数据压缩成紧凑的合成数据集,减少训练时间和存储空间同时保持模型性能,使在资源有限的情况下部署成为可能。尽管最近的解耦蒸馏方法能够实现大规模的数据集蒸馏,但它们仍然面临效率差距:基于优化的解耦方法能够获得更高的准确率,但需要大量的计算,而无需优化的解耦方法则更高效但牺牲了准确率。为了克服这种权衡,我们提出了探索-利用蒸馏(E$^2$D),这是一种简单实用的方法,通过高效的流水线减少冗余计算,该流水线从全图像初始化开始以保持语义完整性和特征多样性。然后使用两阶段优化策略:探索阶段进行均匀更新并识别高损失区域,以及利用阶段专注于这些区域的更新以加速收敛。我们在大规模基准上评估了E$^2$D,在ImageNet-1K上超越了最先进的方法,同时快了18倍,在ImageNet-21K上我们的方法显著提高了准确率同时保持了4.3倍的加速。这些结果表明,目标导向、减少冗余的更新,而不是暴力优化,能够弥合大规模数据集蒸馏中的准确率和效率之间的差距。代码可在https://github.com/ncsu-dk-lab/E2D/ 获取。
Summary / 总结
The paper addresses the challenge of efficiently compressing large-scale datasets for model training while maintaining performance. It introduces Exploration--Exploitation Distillation (E$^2$D), which uses a two-phase optimization strategy to minimize redundant computation. The method initializes with full-image updates to preserve semantic integrity and feature diversity, followed by an exploration phase to identify high-loss regions and an exploitation phase to focus updates on these regions. Experiments show that E$^2$D outperforms existing methods on ImageNet-1K and ImageNet-21K, achieving higher accuracy and being significantly faster. This demonstrates that targeted updates can bridge the gap between accuracy and efficiency in large-scale dataset distillation.
论文旨在高效压缩大规模数据集以进行模型训练,同时保持性能。提出了一种名为Exploration--Exploitation Distillation (E$^2$D)的方法,采用两阶段优化策略以减少冗余计算。该方法首先用全图像更新来保持语义完整性和特征多样性,然后在探索阶段识别高损失区域,在利用阶段专注于这些区域进行更新。实验结果表明,E$^2$D在ImageNet-1K和ImageNet-21K上优于现有方法,实现了更高的准确性和显著的加速效果。
Asymptotically Optimal Sequential Testing with Markovian Data
Authors: Alhad Sethi, Kavali Sofia Sagar, Shubhada Agrawal, Debabrota Basu, P. N. Karthik
First: 2026-02-19T18:11:02+00:00 · Latest: 2026-02-19T18:11:02+00:00
Abstract
We study one-sided and $α$-correct sequential hypothesis testing for data generated by an ergodic Markov chain. The null hypothesis is that the unknown transition matrix belongs to a prescribed set $P$ of stochastic matrices, and the alternative corresponds to a disjoint set $Q$. We establish a tight non-asymptotic instance-dependent lower bound on the expected stopping time of any valid sequential test under the alternative. Our novel analysis improves the existing lower bounds, which are either asymptotic or provably sub-optimal in this setting. Our lower bound incorporates both the stationary distribution and the transition structure induced by the unknown Markov chain. We further propose an optimal test whose expected stopping time matches this lower bound asymptotically as $α\to 0$. We illustrate the usefulness of our framework through applications to sequential detection of model misspecification in Markov Chain Monte Carlo and to testing structural properties, such as the linearity of transition dynamics, in Markov decision processes. Our findings yield a sharp and general characterization of optimal sequential testing procedures under Markovian dependence.
中文标题/摘要
标题:马尔可夫数据的渐近最优序贯检验
我们研究了由遍历马尔可夫链生成的数据的一边和$α$-正确的序贯假设检验。零假设是未知的转移矩阵属于给定的随机矩阵集合$P$,备择假设对应于一个不相交的集合$Q$。我们建立了在备择假设下任何有效序贯检验的期望停止时间的紧致非渐近实例依赖下界。我们的新颖分析改进了现有的下界,这些下界要么是渐近的,要么在这个设置中是可证明的次优的。我们的下界同时包含了平稳分布和由未知马尔可夫链诱导的转移结构。我们进一步提出了一种最优检验,其期望停止时间在$α\to 0$时与该下界渐近匹配。我们通过马尔可夫链蒙特卡洛中的模型误设序贯检测应用以及马尔可夫决策过程中的结构属性检验(如转移动力学的线性性)来说明我们框架的有效性。我们的发现为马尔可夫依赖下的最优序贯检验程序提供了精确而通用的表征。
Nonlinear Model Order Reduction of Dynamical Systems in Process Engineering: Review and Comparison
Authors: Jan C. Schulze, Alexander Mitsos
First: 2025-06-15T11:39:12+00:00 · Latest: 2026-02-19T17:29:34+00:00
Abstract
Computationally cheap yet accurate dynamical models are a key requirement for real-time capable nonlinear optimization and model-based control. When given a computationally expensive high-order prediction model, a reduction to a lower-order simplified model can enable such real-time applications. Herein, we review nonlinear model order reduction methods and provide a comparison of method characteristics. Additionally, we discuss both general-purpose methods and tailored approaches for chemical process systems and we identify similarities and differences between these methods. As machine learning manifold-Galerkin approaches currently do not account for inputs in the construction of the reduced state subspace, we extend these methods to dynamical systems with inputs. In a comparative case study, we apply eight established model order reduction methods to an air separation process model: POD-Galerkin, nonlinear-POD-Galerkin, manifold-Galerkin, dynamic mode decomposition, Koopman theory, manifold learning with latent predictor, compartment modeling, and model aggregation. Herein, we do not investigate hyperreduction, i.e., reduction of floating point operations. Based on our findings, we discuss strengths and weaknesses of the model order reduction methods.
中文标题/摘要
标题:过程工程中动力学系统的非线性模型降阶:综述与比较
计算成本低廉但准确的动力学模型是实时非线性优化和模型导向控制的关键需求。当给定一个计算成本高昂的高阶预测模型时,将其降阶为一个较低阶的简化模型可以实现此类实时应用。本文综述了非线性模型降阶方法,并提供了方法特性的比较。此外,我们讨论了通用方法和针对化学过程系统的定制方法,并指出了这些方法之间的相似性和差异性。由于当前的机器学习流形-伽罗金方法在构建降阶状态子空间时不考虑输入,我们将其扩展到具有输入的动力学系统。在比较案例研究中,我们应用了八种已建立的模型降阶方法到空气分离过程模型:POD-伽罗金、非线性-POD-伽罗金、流形-伽罗金、动态模式分解、Koopman理论、基于潜在预测的流形学习、隔室建模和模型聚合。本文未研究超缩减,即浮点运算的缩减。基于我们的发现,我们讨论了模型降阶方法的优势和劣势。
Summary / 总结
The research aims to develop computationally efficient yet accurate dynamical models for real-time nonlinear optimization and control in process engineering. The study reviews and compares various nonlinear model order reduction methods, including general-purpose and process-specific approaches. Key findings show that manifold-Galerkin methods, when extended to include inputs, perform well in reducing the complexity of air separation process models while maintaining accuracy. The study highlights the strengths and weaknesses of each method, providing insights for practical applications.
研究旨在开发在过程工程中用于实时非线性优化和控制的计算效率高且准确的动力学模型。研究回顾并比较了各种非线性模型降阶方法,包括通用方法和针对化工过程的定制方法。主要发现表明,当扩展到包含输入时,流形-Galerkin方法在减少空气分离过程模型的复杂性方面表现良好,同时保持了准确性。研究还强调了每种方法的优点和缺点,为实际应用提供了见解。
Be Wary of Your Time Series Preprocessing
Authors: Sofiane Ennadir, Tianze Wang, Oleg Smirnov, Sahar Asadi, Lele Cao
Venue: AAAI
First: 2026-02-19T17:23:56+00:00 · Latest: 2026-02-19T17:23:56+00:00
Comments: Accepted at the AI4TS workshop at AAAI-26
Abstract
Normalization and scaling are fundamental preprocessing steps in time series modeling, yet their role in Transformer-based models remains underexplored from a theoretical perspective. In this work, we present the first formal analysis of how different normalization strategies, specifically instance-based and global scaling, impact the expressivity of Transformer-based architectures for time series representation learning. We propose a novel expressivity framework tailored to time series, which quantifies a model's ability to distinguish between similar and dissimilar inputs in the representation space. Using this framework, we derive theoretical bounds for two widely used normalization methods: Standard and Min-Max scaling. Our analysis reveals that the choice of normalization strategy can significantly influence the model's representational capacity, depending on the task and data characteristics. We complement our theory with empirical validation on classification and forecasting benchmarks using multiple Transformer-based models. Our results show that no single normalization method consistently outperforms others, and in some cases, omitting normalization entirely leads to superior performance. These findings highlight the critical role of preprocessing in time series learning and motivate the need for more principled normalization strategies tailored to specific tasks and datasets.
中文标题/摘要
标题:警惕您的时间序列预处理
归一化和缩放是时间序列建模中基本的预处理步骤,但在理论上,它们在基于Transformer的模型中的作用尚未得到充分探索。本文首次对不同归一化策略,特别是实例基和全局缩放,如何影响基于Transformer的时间序列表示学习架构的表达能力进行了形式化分析。我们提出了一种针对时间序列定制的新型表达能力框架,该框架量化了模型在表示空间中区分相似和不相似输入的能力。利用该框架,我们为两种广泛使用的归一化方法——标准和最小-最大缩放——推导出了理论界线。我们的分析表明,归一化策略的选择可能会显著影响模型的表示能力,这取决于任务和数据特性。我们通过在分类和预测基准上使用多种基于Transformer的模型进行实证验证,补充了我们的理论。结果显示,并非所有归一化方法都能始终优于其他方法,在某些情况下,完全省略归一化可能会获得更好的性能。这些发现突显了预处理在时间序列学习中的关键作用,并促使需要为特定任务和数据集定制更原则性的归一化策略。
Summary / 总结
This work explores the impact of normalization strategies on Transformer-based models for time series analysis. It introduces a novel expressivity framework to quantify a model's ability to distinguish between similar and dissimilar inputs. The study derives theoretical bounds for Standard and Min-Max scaling and finds that the choice of normalization can significantly affect the model's representational capacity. Empirical validation on classification and forecasting benchmarks shows that no single normalization method consistently outperforms others, and in some cases, omitting normalization can lead to better performance.
该研究探讨了Transformer模型在时间序列分析中归一化和缩放的作用,引入了一种新的表达能力框架来量化模型区分相似和不相似输入的能力。研究推导了标准和最小-最大缩放的理论边界,并在分类和预测基准上进行了实证验证。结果表明,归一化策略的选择显著影响模型性能,在某些情况下,不进行归一化效果最佳。
ODESteer: A Unified ODE-Based Steering Framework for LLM Alignment
Authors: Hongjue Zhao, Haosen Sun, Jiangtao Kong, Xiaochang Li, Qineng Wang, Liwei Jiang, Qi Zhu, Tarek Abdelzaher, Yejin Choi, Manling Li, Huajie Shao
Venue: ICLR 2026
First: 2026-02-19T17:13:44+00:00 · Latest: 2026-02-19T17:13:44+00:00
Comments: Accepted by ICLR 2026
Abstract
Activation steering, or representation engineering, offers a lightweight approach to align large language models (LLMs) by manipulating their internal activations at inference time. However, current methods suffer from two key limitations: \textit{(i)} the lack of a unified theoretical framework for guiding the design of steering directions, and \textit{(ii)} an over-reliance on \textit{one-step steering} that fail to capture complex patterns of activation distributions. In this work, we propose a unified ordinary differential equations (ODEs)-based \textit{theoretical} framework for activation steering in LLM alignment. We show that conventional activation addition can be interpreted as a first-order approximation to the solution of an ODE. Based on this ODE perspective, identifying a steering direction becomes equivalent to designing a \textit{barrier function} from control theory. Derived from this framework, we introduce ODESteer, a kind of ODE-based steering guided by barrier functions, which shows \textit{empirical} advancement in LLM alignment. ODESteer identifies steering directions by defining the barrier function as the log-density ratio between positive and negative activations, and employs it to construct an ODE for \textit{multi-step and adaptive} steering. Compared to state-of-the-art activation steering methods, ODESteer achieves consistent empirical improvements on diverse LLM alignment benchmarks, a notable $5.7\%$ improvement over TruthfulQA, $2.5\%$ over UltraFeedback, and $2.4\%$ over RealToxicityPrompts. Our work establishes a principled new view of activation steering in LLM alignment by unifying its theoretical foundations via ODEs, and validating it empirically through the proposed ODESteer method.
中文标题/摘要
标题:ODESteer:基于ODE的统一大型语言模型对齐引导框架
激活引导或表示工程提供了一种轻量级的方法,在推理时通过操控大型语言模型(LLMs)的内部激活来对其对齐。然而,当前的方法存在两个关键限制:\textit{(i)} 缺乏一个统一的理论框架来指导引导方向的设计,\textit{(ii)} 过度依赖于\textit{一步引导},未能捕捉激活分布的复杂模式。在本文中,我们提出了一种基于常微分方程(ODEs)的统一激活引导理论框架,用于LLM对齐。我们表明,传统的激活添加可以被解释为ODE解的一阶近似。基于这种ODE视角,确定一个引导方向等同于从控制理论设计一个\textit{障碍函数}。基于此框架,我们引入了ODESteer,这是一种由障碍函数引导的基于ODE的引导方法,展示了在LLM对齐中的\textit{实证}进步。ODESteer通过将障碍函数定义为正激活和负激活的对数密度比来确定引导方向,并利用它构建一个ODE进行\textit{多步和自适应}引导。与最先进的激活引导方法相比,ODESteer在各种LLM对齐基准测试中实现了一致的实证改进,对TruthfulQA的改进为5.7%,对UltraFeedback的改进为2.5%,对RealToxicityPrompts的改进为2.4%。我们的工作通过使用ODE统一激活引导的理论基础,建立了LLM对齐中激活引导的新原理,并通过提出的ODESteer方法进行了实证验证。
Summary / 总结
The paper addresses the limitations of current activation steering methods for large language models (LLMs) by proposing a unified ODE-based framework, ODESteer. This framework interprets conventional activation addition as a first-order approximation to an ODE solution and uses barrier functions from control theory to guide steering directions. ODESteer shows empirical improvements over existing methods on various LLM alignment benchmarks, achieving a notable 5.7% improvement over TruthfulQA, 2.5% over UltraFeedback, and 2.4% over RealToxicityPrompts.
论文提出了一种基于ODE的统一框架ODESteer,以解决当前用于大型语言模型(LLMs)的激活引导方法的局限性。该框架将传统的激活添加解释为ODE解的一阶近似,并使用控制理论中的障碍函数来引导引导方向。ODESteer在各种LLM对齐基准测试中表现出明显的改进,相对于TruthfulQA提高了5.7%,相对于UltraFeedback提高了2.5%,相对于RealToxicityPrompts提高了2.4%。
Revisiting Weight Regularization for Low-Rank Continual Learning
Authors: Yaoyue Zheng, Yin Zhang, Joost van de Weijer, Gido M van de Ven, Shaoyi Du, Xuetao Zhang, Zhiqiang Tian
Venue: ICLR 2026
First: 2026-02-19T17:13:00+00:00 · Latest: 2026-02-19T17:13:00+00:00
Comments: Accepted by ICLR 2026
Abstract
Continual Learning (CL) with large-scale pre-trained models (PTMs) has recently gained wide attention, shifting the focus from training from scratch to continually adapting PTMs. This has given rise to a promising paradigm: parameter-efficient continual learning (PECL), where task interference is typically mitigated by assigning a task-specific module during training, such as low-rank adapters. However, weight regularization techniques, such as Elastic Weight Consolidation (EWC)-a key strategy in CL-remain underexplored in this new paradigm. In this paper, we revisit weight regularization in low-rank CL as a new perspective for mitigating task interference in PECL. Unlike existing low-rank CL methods, we mitigate task interference by regularizing a shared low-rank update through EWC, thereby keeping the storage requirement and inference costs constant regardless of the number of tasks. Our proposed method EWC-LoRA leverages a low-rank representation to estimate parameter importance over the full-dimensional space. This design offers a practical, computational- and memory-efficient solution for CL with PTMs, and provides insights that may inform the broader application of regularization techniques within PECL. Extensive experiments on various benchmarks demonstrate the effectiveness of EWC-LoRA, achieving a stability-plasticity trade-off superior to existing low-rank CL approaches. These results indicate that, even under low-rank parameterizations, weight regularization remains an effective mechanism for mitigating task interference. Code is available at: https://github.com/yaoyz96/low-rank-cl.
中文标题/摘要
标题:低秩持续学习中的权重正则化 revisiting
大规模预训练模型(PTMs)驱动的持续学习(CL)最近引起了广泛关注,焦点从从头训练转向持续适应PTMs。这催生了一个有前景的范式:参数高效持续学习(PECL),其中任务干扰通常通过在训练期间分配任务特定模块来缓解,例如低秩适配器。然而,弹性权重巩固(EWC)等权重正则化技术在这一新范式中仍被忽视。本文重新审视了低秩CL中的权重正则化,作为PECL中缓解任务干扰的新视角。与现有低秩CL方法不同,我们通过EWC正则化共享的低秩更新来缓解任务干扰,从而保持存储需求和推理成本不变,与任务数量无关。我们提出的方法EWC-LoRA利用低秩表示估计参数在全维空间中的重要性。该设计为使用PTMs的持续学习提供了一种实用、计算和内存高效的解决方案,并为PECL中正则化技术的更广泛应用提供了见解。在各种基准上的广泛实验表明,EWC-LoRA的有效性优于现有低秩CL方法,实现了优于现有方法的稳定性和灵活性权衡。这些结果表明,即使在低秩参数化下,权重正则化仍然是缓解任务干扰的有效机制。代码可在:https://github.com/yaoyz96/low-rank-cl/ 获取。
Summary / 总结
This paper revisits weight regularization techniques in low-rank continual learning (CL) to mitigate task interference in parameter-efficient continual learning (PECL). The proposed method, EWC-LoRA, uses Elastic Weight Consolidation (EWC) to regularize a shared low-rank update, keeping storage and inference costs constant. Experiments show that EWC-LoRA achieves a better stability-plasticity trade-off compared to existing low-rank CL approaches, indicating the effectiveness of weight regularization even under low-rank parameterizations.
本文重新审视了在大规模预训练模型(PTMs)的参数高效连续学习(PECL)中使用弹性权重巩固(EWC)等权重正则化技术。作者提出了EWC-LoRA,通过正则化共享的低秩更新来缓解任务干扰,保持存储和推理成本的恒定。实验表明,EWC-LoRA 在稳定性-可塑性权衡方面优于现有低秩CL方法,表明即使在低秩参数化下,权重正则化仍然是缓解任务干扰的有效机制。
RetouchIQ: MLLM Agents for Instruction-Based Image Retouching with Generalist Reward
Authors: Qiucheng Wu, Jing Shi, Simon Jenni, Kushal Kafle, Tianyu Wang, Shiyu Chang, Handong Zhao
First: 2026-02-19T17:11:59+00:00 · Latest: 2026-02-19T17:11:59+00:00
Comments: 10 pages, 6 figures
Abstract
Recent advances in multimodal large language models (MLLMs) have shown great potential for extending vision-language reasoning to professional tool-based image editing, enabling intuitive and creative editing. A promising direction is to use reinforcement learning (RL) to enable MLLMs to reason about and execute optimal tool-use plans within professional image-editing software. However, training remains challenging due to the lack of reliable, verifiable reward signals that can reflect the inherently subjective nature of creative editing. In this work, we introduce RetouchIQ, a framework that performs instruction-based executable image editing through MLLM agents guided by a generalist reward model. RetouchIQ interprets user-specified editing intentions and generates corresponding, executable image adjustments, bridging high-level aesthetic goals with precise parameter control. To move beyond conventional, rule-based rewards that compute similarity against a fixed reference image using handcrafted metrics, we propose a generalist reward model, an RL fine-tuned MLLM that evaluates edited results through a set of generated metrics on a case-by-case basis. Then, the reward model provides scalar feedback through multimodal reasoning, enabling reinforcement learning with high-quality, instruction-consistent gradients. We curate an extended dataset with 190k instruction-reasoning pairs and establish a new benchmark for instruction-based image editing. Experiments show that RetouchIQ substantially improves both semantic consistency and perceptual quality over previous MLLM-based and diffusion-based editing systems. Our findings demonstrate the potential of generalist reward-driven MLLM agents as flexible, explainable, and executable assistants for professional image editing.
中文标题/摘要
标题:RetouchIQ:基于指令的图像润色通用奖励模型代理
近期多模态大型语言模型(MLLM)的发展为将视觉-语言推理扩展到专业工具图像编辑提供了巨大潜力,使直观和创意编辑成为可能。一个有前景的方向是使用强化学习(RL)使MLLM能够理解和在专业图像编辑软件中执行最佳工具使用计划。然而,由于缺乏可靠的、可验证的奖励信号来反映创意编辑的主观性,训练仍然具有挑战性。在本文中,我们介绍了RetouchIQ框架,该框架通过由通用奖励模型引导的MLLM代理执行基于指令的可执行图像编辑。RetouchIQ解释用户指定的编辑意图并生成相应的可执行图像调整,将高层次的美学目标与精确的参数控制相结合。为了超越传统的基于规则的奖励,这些奖励使用手工制作的度量标准计算与固定参考图像的相似性,我们提出了一种通用奖励模型,这是一种经过RL微调的MLLM,它通过一组生成的度量标准逐案评估编辑结果。然后,奖励模型通过多模态推理提供标量反馈,使强化学习能够获得高质量、指令一致的梯度。我们收集了一个扩展的数据集,包含19万个指令-推理对,并建立了基于指令的图像编辑的新基准。实验表明,RetouchIQ在语义一致性和感知质量方面显著优于基于MLLM和扩散模型的编辑系统。我们的研究结果表明,通用奖励驱动的MLLM代理具有作为灵活、可解释和可执行的专业图像编辑助手的潜力。
Summary / 总结
RetouchIQ is a framework that uses MLLM agents guided by a generalist reward model to perform instruction-based image editing. It interprets user intentions and generates executable image adjustments, improving semantic consistency and perceptual quality over previous systems. The framework introduces a new benchmark with 190k instruction-reasoning pairs and demonstrates the potential of generalist reward-driven MLLM agents in professional image editing.
RetouchIQ 是一个框架,通过由通用奖励模型引导的 MLLM 代理执行基于指令的图像编辑。它解释用户意图并生成可执行的图像调整,将高层次的美学目标与精确的参数控制相结合。实验表明,RetouchIQ 在语义一致性和感知质量方面优于之前的基于 MLLM 和扩散模型的编辑系统。
Neural Implicit Representations for 3D Synthetic Aperture Radar Imaging
Authors: Nithin Sugavanam, Emre Ertin
First: 2026-02-19T17:10:37+00:00 · Latest: 2026-02-19T17:10:37+00:00
Abstract
Synthetic aperture radar (SAR) is a tomographic sensor that measures 2D slices of the 3D spatial Fourier transform of the scene. In many operational scenarios, the measured set of 2D slices does not fill the 3D space in the Fourier domain, resulting in significant artifacts in the reconstructed imagery. Traditionally, simple priors, such as sparsity in the image domain, are used to regularize the inverse problem. In this paper, we review our recent work that achieves state-of-the-art results in 3D SAR imaging employing neural structures to model the surface scattering that dominates SAR returns. These neural structures encode the surface of the objects in the form of a signed distance function learned from the sparse scattering data. Since estimating a smooth surface from a sparse and noisy point cloud is an ill-posed problem, we regularize the surface estimation by sampling points from the implicit surface representation during the training step. We demonstrate the model's ability to represent target scattering using measured and simulated data from single vehicles and a larger scene with a large number of vehicles. We conclude with future research directions calling for methods to learn complex-valued neural representations to enable synthesizing new collections from the volumetric neural implicit representation.
中文标题/摘要
标题:神经隐式表示在3D合成孔径雷达成像中的应用
合成孔径雷达(SAR)是一种透射传感器,测量场景在3D空间中的二维傅里叶变换切片。在许多操作场景中,测量到的二维切片集并不填充傅里叶域中的3D空间,导致重建图像中存在显著的伪影。传统上,使用简单的先验知识,如图像域中的稀疏性来正则化逆问题。在本文中,我们回顾了我们最近的工作,通过使用神经结构来建模主导SAR返回的表面散射,从而在3D SAR成像中达到最先进的结果。这些神经结构以学习自稀疏散射数据的符号距离函数的形式编码物体的表面。由于从稀疏且噪声的点云中估计光滑表面是一个病态问题,我们在训练步骤中通过从隐式表面表示中采样点来正则化表面估计。我们使用单辆车和包含大量车辆的大型场景的测量和模拟数据,展示了该模型表示目标散射的能力。最后,我们提出了未来的研究方向,呼吁学习复值神经表示的方法,以从体素神经隐式表示中合成新的集合。
Summary / 总结
This paper addresses the challenge of reconstructing 3D imagery from 2D Synthetic Aperture Radar (SAR) data, which often results in artifacts due to incomplete Fourier domain coverage. The authors propose using neural implicit representations to model the surface scattering, which is crucial for SAR returns. By learning a signed distance function from sparse scattering data, the model can effectively represent the 3D surface of objects. The method regularizes the surface estimation by sampling points during training, and it demonstrates superior performance in reconstructing 3D SAR imagery from both measured and simulated data, including scenes with multiple vehicles.
本文解决了从2D合成孔径雷达(SAR)数据重建3D图像时由于频域覆盖不完整而产生的图像伪影问题。作者提出使用神经隐式表示来建模表面散射,这对于SAR返回至关重要。通过从稀疏散射数据中学习符号距离函数,该模型能够有效表示物体的3D表面。通过在训练过程中采样点进行正则化,该方法在从实际测量和模拟数据中重建包含多辆车辆的场景的3D SAR图像方面表现出色。
GraphThinker: Reinforcing Video Reasoning with Event Graph Thinking
Authors: Zixu Cheng, Da Li, Jian Hu, Ziquan Liu, Wei Li, Shaogang Gong
First: 2026-02-19T17:09:30+00:00 · Latest: 2026-02-19T17:09:30+00:00
Comments: Under review
Abstract
Video reasoning requires understanding the causal relationships between events in a video. However, such relationships are often implicit and costly to annotate manually. While existing multimodal large language models (MLLMs) often infer event relations through dense captions or video summaries for video reasoning, such modeling still lacks causal understanding. Without explicit causal structure modeling within and across video events, these models suffer from hallucinations during the video reasoning. In this work, we propose GraphThinker, a reinforcement finetuning-based method that constructs structural event-level scene graphs and enhances visual grounding to jointly reduce hallucinations in video reasoning. Specifically, we first employ an MLLM to construct an event-based video scene graph (EVSG) that explicitly models both intra- and inter-event relations, and incorporate these formed scene graphs into the MLLM as an intermediate thinking process. We also introduce a visual attention reward during reinforcement finetuning, which strengthens video grounding and further mitigates hallucinations. We evaluate GraphThinker on two datasets, RexTime and VidHalluc, where it shows superior ability to capture object and event relations with more precise event localization, reducing hallucinations in video reasoning compared to prior methods.
中文标题/摘要
标题:GraphThinker:通过事件图思考强化视频推理
视频推理需要理解视频中事件之间的因果关系。然而,这些关系往往是隐含的,手动标注成本高昂。虽然现有的多模态大型语言模型(MLLM)通常通过密集字幕或视频摘要来推断事件关系,但这种建模仍然缺乏因果理解。在视频事件内部和跨事件之间没有明确的因果结构建模的情况下,这些模型在视频推理过程中会表现出幻觉。在本文中,我们提出了一种基于强化微调的方法——GraphThinker,该方法构建结构化的事件级场景图并增强视觉定位,以联合减少视频推理中的幻觉。具体来说,我们首先使用MLLM构建基于事件的视频场景图(EVSG),明确建模事件内的和事件间的联系,并将这些形成的场景图作为中间思考过程整合到MLLM中。我们还在强化微调过程中引入了视觉注意力奖励,这加强了视频定位并进一步减轻了幻觉。我们在RexTime和VidHalluc两个数据集上评估了GraphThinker,结果显示它在捕捉对象和事件关系方面具有更强的能力,能够更精确地定位事件,从而在视频推理中减少幻觉,优于先前的方法。
Summary / 总结
GraphThinker is a reinforcement finetuning-based method that constructs event-level scene graphs to enhance visual grounding and reduce hallucinations in video reasoning. It uses an MLLM to create an event-based video scene graph (EVSG) that models both intra- and inter-event relations, and incorporates these graphs into the MLLM for intermediate thinking. Additionally, a visual attention reward during reinforcement finetuning strengthens video grounding. GraphThinker outperforms previous methods on the RexTime and VidHalluc datasets by improving event localization and reducing hallucinations.
GraphThinker 是一种基于强化微调的方法,通过构建事件级场景图来提高视频推理中的因果理解。它使用 MLLM 创建一个基于事件的视频场景图(EVSG),以建模事件内的和事件间的关联,并将这些图融入 MLLM 中以改善视觉定位。GraphThinker 还在强化微调过程中引入了视觉注意力奖励,以进一步减少幻觉。在 RexTime 和 VidHalluc 数据集上,GraphThinker 在捕捉对象和事件关系方面表现出更精确的事件定位和减少幻觉的优越性能,优于先前的方法。
MASPO: Unifying Gradient Utilization, Probability Mass, and Signal Reliability for Robust and Sample-Efficient LLM Reasoning
Authors: Xiaoliang Fu, Jiaye Lin, Yangyi Fang, Binbin Zheng, Chaowen Hu, Zekai Shao, Cong Qin, Lu Pan, Ke Zeng, Xunliang Cai
First: 2026-02-19T17:05:20+00:00 · Latest: 2026-02-19T17:05:20+00:00
Abstract
Existing Reinforcement Learning with Verifiable Rewards (RLVR) algorithms, such as GRPO, rely on rigid, uniform, and symmetric trust region mechanisms that are fundamentally misaligned with the complex optimization dynamics of Large Language Models (LLMs). In this paper, we identify three critical challenges in these methods: (1) inefficient gradient utilization caused by the binary cutoff of hard clipping, (2) insensitive probability mass arising from uniform ratio constraints that ignore the token distribution, and (3) asymmetric signal reliability stemming from the disparate credit assignment ambiguity between positive and negative samples. To bridge these gaps, we propose Mass-Adaptive Soft Policy Optimization (MASPO), a unified framework designed to harmonize these three dimensions. MASPO integrates a differentiable soft Gaussian gating to maximize gradient utility, a mass-adaptive limiter to balance exploration across the probability spectrum, and an asymmetric risk controller to align update magnitudes with signal confidence. Extensive evaluations demonstrate that MASPO serves as a robust, all-in-one RLVR solution, significantly outperforming strong baselines. Our code is available at: https://anonymous.4open.science/r/ma1/README.md.
中文标题/摘要
标题:MASPO:统一梯度利用、概率质量与信号可靠性以实现稳健且样本高效的LLM推理
现有的可验证奖励强化学习(RLVR)算法,如GRPO,依赖于刚性、均匀和对称的信任区域机制,这些机制与大型语言模型(LLMs)复杂的优化动态从根本上不一致。在本文中,我们识别出这些方法中的三个关键挑战:(1)由于硬剪裁的二元截止导致的梯度利用效率低下,(2)由于均匀比率约束忽略了令牌分布而导致的敏感性概率质量,(3)由于正负样本之间不同的信用分配模糊性而导致的信号可靠性不对称。为了弥合这些差距,我们提出了质量自适应软策略优化(MASPO),这是一种统一框架,旨在协调这三个维度。MASPO结合了可微软高斯门控以最大化梯度利用,质量自适应限制器以平衡概率谱上的探索,以及不对称风险控制器以使更新幅度与信号置信度对齐。广泛的评估表明,MASPO作为稳健且全能的RLVR解决方案,显著优于强大的基线。我们的代码可在:https://anonymous.4open.science/r/ma1/README.md 获取。
Summary / 总结
This paper addresses the limitations of existing Reinforcement Learning with Verifiable Rewards (RLVR) algorithms like GRPO, which use rigid trust region mechanisms that are not well-suited for the complex optimization dynamics of Large Language Models (LLMs). The authors propose MASPO, a unified framework that addresses three key challenges: inefficient gradient utilization, insensitive probability mass, and asymmetric signal reliability. MASPO uses a differentiable soft Gaussian gating to optimize gradient utility, a mass-adaptive limiter to balance exploration, and an asymmetric risk controller to align updates with signal confidence. Experimental results show that MASPO outperforms strong baselines and provides a robust RLVR solution.
本文针对现有RLVR算法如GRPO的局限性,提出了MASPO,旨在统一梯度利用、概率质量与信号可靠性。MASPO引入了可微软高斯门控、质量自适应限制器和不对称风险控制器来优化这些方面。广泛的评估表明,MASPO在性能上优于强基线,并为LLM推理提供了一个稳健的解决方案。
KLong: Training LLM Agent for Extremely Long-horizon Tasks
Authors: Yue Liu, Zhiyuan Hu, Flood Sung, Jiaheng Zhang, Bryan Hooi
First: 2026-02-19T17:01:08+00:00 · Latest: 2026-02-19T17:01:08+00:00
Abstract
This paper introduces KLong, an open-source LLM agent trained to solve extremely long-horizon tasks. The principle is to first cold-start the model via trajectory-splitting SFT, then scale it via progressive RL training. Specifically, we first activate basic agentic abilities of a base model with a comprehensive SFT recipe. Then, we introduce Research-Factory, an automated pipeline that generates high-quality training data by collecting research papers and constructing evaluation rubrics. Using this pipeline, we build thousands of long-horizon trajectories distilled from Claude 4.5 Sonnet (Thinking). To train with these extremely long trajectories, we propose a new trajectory-splitting SFT, which preserves early context, progressively truncates later context, and maintains overlap between sub-trajectories. In addition, to further improve long-horizon task-solving capability, we propose a novel progressive RL, which schedules training into multiple stages with progressively extended timeouts. Experiments demonstrate the superiority and generalization of KLong, as shown in Figure 1. Notably, our proposed KLong (106B) surpasses Kimi K2 Thinking (1T) by 11.28% on PaperBench, and the performance improvement generalizes to other coding benchmarks like SWE-bench Verified and MLE-bench.
中文标题/摘要
标题:KLong:训练用于极长时域任务的LLM代理
本文介绍了KLong,一个开源的LLM代理,用于解决极长时域任务。原理是首先通过轨迹分割精调(SFT)冷启动模型,然后通过渐进式强化学习(RL)训练进行扩展。具体来说,我们首先使用全面的SFT配方激活基础模型的基本代理能力。然后,我们引入了Research-Factory,这是一个自动化流水线,通过收集研究论文并构建评估标准来生成高质量的训练数据。使用此流水线,我们从Claude 4.5 Sonnet(思考)中构建了数千条长时域轨迹。为了使用这些极其长的轨迹进行训练,我们提出了新的轨迹分割精调方法,该方法保留早期上下文,逐步截断后期上下文,并保持子轨迹之间的重叠。此外,为了进一步提高解决极长时域任务的能力,我们提出了新的渐进式RL,该方法将训练分为多个阶段,每个阶段的超时时间逐渐延长。实验表明KLong的优越性和泛化能力,如图1所示。值得注意的是,我们提出的KLong(106B)在PaperBench上比Kimi K2 Thinking(1T)高出11.28%,并且性能改进也适用于其他编程基准,如SWE-bench Verified和MLE-bench。
Summary / 总结
KLong is an open-source LLM agent designed to handle extremely long-horizon tasks. It is trained using a two-step process: initial cold-start via trajectory-splitting SFT and subsequent scaling through progressive RL training. Key methods include a comprehensive SFT recipe to activate basic agentic abilities, an automated pipeline called Research-Factory for generating high-quality training data, and a novel trajectory-splitting SFT and progressive RL that enhance long-horizon task-solving capability. Experimental results show that KLong outperforms Kimi K2 Thinking by 11.28% on PaperBench and generalizes well to other coding benchmarks.
KLong 是一个开源的 LLM 代理,旨在处理极长时域任务。其训练过程分为两步:初始冷启动通过轨迹分割 SFT 和后续通过渐进式 RL 训练进行扩展。关键发现表明,KLong 在 PaperBench 上比 Kimi K2 Thinking 高出 11.28%,并且在其他编码基准如 SWE-bench Verified 和 MLE-bench 上也表现出泛化能力。
Evaluating Chain-of-Thought Reasoning through Reusability and Verifiability
Authors: Shashank Aggarwal, Ram Vikas Mishra, Amit Awekar
First: 2026-02-19T16:59:11+00:00 · Latest: 2026-02-19T16:59:11+00:00
Abstract
In multi-agent IR pipelines for tasks such as search and ranking, LLM-based agents exchange intermediate reasoning in terms of Chain-of-Thought (CoT) with each other. Current CoT evaluation narrowly focuses on target task accuracy. However, this metric fails to assess the quality or utility of the reasoning process itself. To address this limitation, we introduce two novel measures: reusability and verifiability. We decouple CoT generation from execution using a Thinker-Executor framework. Reusability measures how easily an Executor can reuse the Thinker's CoT. Verifiability measures how frequently an Executor can match the Thinker's answer using the CoT. We evaluated four Thinker models against a committee of ten Executor models across five benchmarks. Our results reveal that reusability and verifiability do not correlate with standard accuracy, exposing a blind spot in current accuracy-based leaderboards for reasoning capability. Surprisingly, we find that CoTs from specialized reasoning models are not consistently more reusable or verifiable than those from general-purpose LLMs like Llama and Gemma.
中文标题/摘要
标题:通过可重用性和可验证性评估链式思维推理
在诸如搜索和排名等任务的多智能体信息检索流水线中,基于LLM的智能体以链式思维(CoT)的形式相互交换中间推理。当前的CoT评估仅狭隘地关注目标任务的准确性。然而,这一指标未能评估推理过程本身的质量或实用性。为解决这一局限,我们引入了两个新的度量标准:可重用性和可验证性。我们使用思考者-执行者框架将CoT的生成与执行分离。可重用性衡量执行者可以多容易地重用思考者的CoT。可验证性衡量执行者使用CoT匹配思考者答案的频率。我们在五个基准上对四种思考者模型与十个执行者模型组成的委员会进行了评估。我们的结果表明,可重用性和可验证性与标准准确性无关,揭示了当前基于准确性的推理能力排行榜中的盲点。令人惊讶的是,我们发现专门的推理模型的CoT并不比通用的LLM(如Llama和Gemma)更一致地具有更高的可重用性和可验证性。
Summary / 总结
The study evaluates Chain-of-Thought (CoT) reasoning in multi-agent information retrieval pipelines by introducing reusability and verifiability as new metrics, decoupling CoT generation from execution. Across five benchmarks, four Thinker models were assessed against ten Executor models, revealing that CoT quality, as measured by reusability and verifiability, does not correlate with standard accuracy, highlighting a limitation in current accuracy-based evaluations of reasoning capabilities. Specialized reasoning models do not necessarily produce more reusable or verifiable CoTs than general-purpose models like Llama and Gemma.
研究通过引入可重用性和可验证性作为新指标,评估多代理信息检索管道中的链式思考(CoT)推理,将CoT生成与执行分离。在五个基准上,四个Thinker模型与十个Executor模型进行了评估,结果显示,CoT质量,用可重用性和可验证性衡量,与标准准确性无关,揭示了当前基于准确性的推理能力排行榜的局限性。专门的推理模型并不一定比通用的LLM(如Llama和Gemma)产生更可重用或可验证的CoT。
IRIS: Learning-Driven Task-Specific Cinema Robot Arm for Visuomotor Motion Control
Authors: Qilong Cheng, Matthew Mackay, Ali Bereyhi
First: 2026-02-19T16:50:31+00:00 · Latest: 2026-02-19T16:50:31+00:00
Abstract
Robotic camera systems enable dynamic, repeatable motion beyond human capabilities, yet their adoption remains limited by the high cost and operational complexity of industrial-grade platforms. We present the Intelligent Robotic Imaging System (IRIS), a task-specific 6-DOF manipulator designed for autonomous, learning-driven cinematic motion control. IRIS integrates a lightweight, fully 3D-printed hardware design with a goal-conditioned visuomotor imitation learning framework based on Action Chunking with Transformers (ACT). The system learns object-aware and perceptually smooth camera trajectories directly from human demonstrations, eliminating the need for explicit geometric programming. The complete platform costs under $1,000 USD, supports a 1.5 kg payload, and achieves approximately 1 mm repeatability. Real-world experiments demonstrate accurate trajectory tracking, reliable autonomous execution, and generalization across diverse cinematic motions.
中文标题/摘要
标题:IRIS:基于学习的任务特定电影机器人手臂用于视动运动控制
机器人摄像系统能够实现超越人类能力的动态、可重复运动,但其采用受限于工业级平台的高成本和操作复杂性。我们介绍了智能机器人成像系统(IRIS),这是一种专为自主、基于学习的电影运动控制设计的6-DOF操作臂。IRIS 结合了轻量级的全3D打印硬件设计和基于动作分块与变换器(ACT)的目标条件视动模仿学习框架。该系统直接从人类示范中学习对象感知和感知平滑的摄像机轨迹,无需显式的几何编程。整个平台成本低于1000美元,支持1.5公斤负载,并实现约1毫米的重复性。实际实验表明,该系统能够准确跟踪轨迹、可靠自主执行,并在多种电影运动中泛化。
Summary / 总结
The research aims to develop a cost-effective and easy-to-operate robotic camera system for cinematic motion control. IRIS, a 6-DOF manipulator, integrates a lightweight 3D-printed hardware with a learning-based visuomotor imitation framework. The system learns camera trajectories from human demonstrations and can track accurate trajectories, execute autonomously, and generalize to various cinematic motions. The complete platform costs less than $1,000 and supports a 1.5 kg payload with 1 mm repeatability.
研究动机是开发一种低成本且用户友好的机器人摄像系统,用于电影运动控制。主要方法是使用轻量级的3D打印6-DOF manipulator和基于Action Chunking with Transformers (ACT)的模仿学习框架。关键实验发现包括准确的轨迹跟踪、可靠的自主执行以及在各种电影运动中的泛化能力,该系统成本低于1000美元,承载能力为1.5公斤,重复精度约为1毫米。
LATA: Laplacian-Assisted Transductive Adaptation for Conformal Uncertainty in Medical VLMs
Authors: Behzad Bozorgtabar, Dwarikanath Mahapatra, Sudipta Roy, Muzammal Naseer, Imran Razzak, Zongyuan Ge
First: 2026-02-19T16:45:38+00:00 · Latest: 2026-02-19T16:45:38+00:00
Comments: 18 pages, 6 figures, 4 tables
Abstract
Medical vision-language models (VLMs) are strong zero-shot recognizers for medical imaging, but their reliability under domain shift hinges on calibrated uncertainty with guarantees. Split conformal prediction (SCP) offers finite-sample coverage, yet prediction sets often become large (low efficiency) and class-wise coverage unbalanced-high class-conditioned coverage gap (CCV), especially in few-shot, imbalanced regimes; moreover, naively adapting to calibration labels breaks exchangeability and voids guarantees. We propose \texttt{\textbf{LATA}} (Laplacian-Assisted Transductive Adaptation), a \textit{training- and label-free} refinement that operates on the joint calibration and test pool by smoothing zero-shot probabilities over an image-image k-NN graph using a small number of CCCP mean-field updates, preserving SCP validity via a deterministic transform. We further introduce a \textit{failure-aware} conformal score that plugs into the vision-language uncertainty (ViLU) framework, providing instance-level difficulty and label plausibility to improve prediction set efficiency and class-wise balance at fixed coverage. \texttt{\textbf{LATA}} is black-box (no VLM updates), compute-light (windowed transduction, no backprop), and includes an optional prior knob that can run strictly label-free or, if desired, in a label-informed variant using calibration marginals once. Across \textbf{three} medical VLMs and \textbf{nine} downstream tasks, \texttt{\textbf{LATA}} consistently reduces set size and CCV while matching or tightening target coverage, outperforming prior transductive baselines and narrowing the gap to label-using methods, while using far less compute. Comprehensive ablations and qualitative analyses show that \texttt{\textbf{LATA}} sharpens zero-shot predictions without compromising exchangeability.
中文标题/摘要
标题:LATA:拉普拉斯辅助的归纳适应性转换以提高医学VLM中的校准不确定性
医学视觉-语言模型(VLMs)在医学成像中具有强大的零样本识别能力,但它们在领域转移下的可靠性依赖于有保证的校准不确定性。分割一致预测(SCP)提供了有限样本覆盖率,但预测集往往变得很大(低效率),并且类别间的覆盖率不平衡(高类别条件覆盖率差距,CCV),尤其是在少量样本、类别不平衡的情况下;此外,直接适应校准标签会破坏可交换性并使保证失效。我们提出了\texttt{\textbf{LATA}}(拉普拉斯辅助的归纳适应性转换),这是一种无需训练和标签的改进方法,通过在图像-图像k-NN图上平滑零样本概率,使用少量CCC更新,通过确定性变换保持SCP的有效性。我们还引入了一种\textit{失败感知}的校准分数,将其插入视觉-语言不确定性(ViLU)框架中,提供实例级的难度和标签可验证性,以提高固定覆盖率下的预测集效率和类别间的平衡。\texttt{\textbf{LATA}}是黑盒的(不更新VLM),计算量轻(窗口化归纳转换,无反向传播),并包括一个可选的先验旋钮,可以完全不使用标签运行,或者如果需要,可以使用校准边缘信息在标签指导下运行。在\textbf{三个}医学VLM和\textbf{九}个下游任务上,\texttt{\textbf{LATA}}始终减少了集合大小和CCV,同时匹配或收紧目标覆盖率,优于先前的归纳基线,并缩小了与使用标签方法的差距,同时使用了更少的计算资源。全面的消融分析和定性分析表明,\texttt{\textbf{LATA}}在不牺牲可交换性的情况下增强了零样本预测。
Summary / 总结
LATA (Laplacian-Assisted Transductive Adaptation) is proposed to improve the reliability of medical vision-language models (VLMs) under domain shift by refining calibration and test data without updating the VLMs or using labels. It uses a small number of CCCP mean-field updates to smooth zero-shot probabilities over an image-image k-NN graph, preserving the validity of split conformal prediction. LATA also introduces a failure-aware conformal score to enhance prediction set efficiency and class-wise balance. Experiments on three medical VLMs and nine downstream tasks show that LATA reduces set size and class-conditioned coverage gap while maintaining or improving target coverage, outperforming previous transductive baselines and using less compute.
LATA(Laplacian-Assisted Transductive Adaptation)是一种无需训练和标签的方法,通过在图像-图像k-NN图上平滑零样本概率来精炼医疗视觉-语言模型的联合校准和测试池。这种方法减少了预测集大小和类别间覆盖率差异,同时保持了有限样本覆盖率保证。LATA在最小计算开销下优于之前的归纳基线,并减少了与使用标签方法之间的差距。
Position: Evaluation of ECG Representations Must Be Fixed
Authors: Zachary Berger, Daniel Prakah-Asante, John Guttag, Collin M. Stultz
First: 2026-02-19T16:42:46+00:00 · Latest: 2026-02-19T16:42:46+00:00
Comments: Project website at https://ecgfix.csail.mit.edu/
Abstract
This position paper argues that current benchmarking practice in 12-lead ECG representation learning must be fixed to ensure progress is reliable and aligned with clinically meaningful objectives. The field has largely converged on three public multi-label benchmarks (PTB-XL, CPSC2018, CSN) dominated by arrhythmia and waveform-morphology labels, even though the ECG is known to encode substantially broader clinical information. We argue that downstream evaluation should expand to include an assessment of structural heart disease and patient-level forecasting, in addition to other evolving ECG-related endpoints, as relevant clinical targets. Next, we outline evaluation best practices for multi-label, imbalanced settings, and show that when they are applied, the literature's current conclusion about which representations perform best is altered. Furthermore, we demonstrate the surprising result that a randomly initialized encoder with linear evaluation matches state-of-the-art pre-training on many tasks. This motivates the use of a random encoder as a reasonable baseline model. We substantiate our observations with an empirical evaluation of three representative ECG pre-training approaches across six evaluation settings: the three standard benchmarks, a structural disease dataset, hemodynamic inference, and patient forecasting.
中文标题/摘要
标题:立场:必须修正12导联ECG表示的评估方法
这篇立场论文认为,必须修正当前12导联ECG表示学习的基准测试实践,以确保进展可靠且与临床有意义的目标一致。该领域已基本达成共识,主要依赖三个公共多标签基准(PTB-XL、CPSC2018、CSN),这些基准主要关注心律失常和波形形态标签,尽管已知ECG还包含更广泛临床信息。我们主张,下游评估应扩展到包括对结构性心脏病和患者水平预测的评估,以及其他不断发展的ECG相关终点,作为相关临床目标。其次,我们概述了多标签、不平衡设置下的评估最佳实践,并表明当这些实践被应用时,文献中关于哪种表示方法表现最佳的结论发生了改变。此外,我们展示了令人惊讶的结果,即随机初始化的编码器与线性评估匹配许多任务上的最新预训练模型。这促使使用随机编码器作为合理的基线模型。我们通过在六个评估设置中对三种代表性ECG预训练方法进行实证评估,证实了我们的观察:三个标准基准、结构性疾病数据集、血流动力学推断和患者预测。
Summary / 总结
This paper argues for fixing benchmarking practices in 12-lead ECG representation learning to better align with clinical objectives. It suggests expanding evaluation to include structural heart disease and patient-level forecasting, and outlines best practices for multi-label and imbalanced settings. The study shows that a randomly initialized encoder can match state-of-the-art performance on many tasks, motivating its use as a baseline. Empirical evaluations across various settings reveal that current conclusions about best representations may need re-evaluation.
本文主张修复当前的12导联ECG表示学习基准测试实践,以更好地与临床有意义的目标对齐。建议将评估扩展到包括结构性心脏病和患者水平的预测,并概述了多标签、不平衡设置的最佳实践。研究显示,应用这些实践会改变关于哪种表示方法表现最佳的结论,并展示了随机初始化的编码器在许多任务上可以匹配最先进的预训练效果,建议将其作为基准模型。各种设置下的实证评估支持了这些发现。
Enhancing Large Language Models (LLMs) for Telecom using Dynamic Knowledge Graphs and Explainable Retrieval-Augmented Generation
Authors: Dun Yuan, Hao Zhou, Xue Liu, Hao Chen, Yan Xin, Jianzhong, Zhang
First: 2026-02-19T16:40:17+00:00 · Latest: 2026-02-19T16:40:17+00:00
Abstract
Large language models (LLMs) have shown strong potential across a variety of tasks, but their application in the telecom field remains challenging due to domain complexity, evolving standards, and specialized terminology. Therefore, general-domain LLMs may struggle to provide accurate and reliable outputs in this context, leading to increased hallucinations and reduced utility in telecom operations.To address these limitations, this work introduces KG-RAG-a novel framework that integrates knowledge graphs (KGs) with retrieval-augmented generation (RAG) to enhance LLMs for telecom-specific tasks. In particular, the KG provides a structured representation of domain knowledge derived from telecom standards and technical documents, while RAG enables dynamic retrieval of relevant facts to ground the model's outputs. Such a combination improves factual accuracy, reduces hallucination, and ensures compliance with telecom specifications.Experimental results across benchmark datasets demonstrate that KG-RAG outperforms both LLM-only and standard RAG baselines, e.g., KG-RAG achieves an average accuracy improvement of 14.3% over RAG and 21.6% over LLM-only models. These results highlight KG-RAG's effectiveness in producing accurate, reliable, and explainable outputs in complex telecom scenarios.
中文标题/摘要
标题:利用动态知识图谱和可解释检索增强生成提升电信领域的大型语言模型
大型语言模型(LLMs)在多种任务中展现了强大的潜力,但在电信领域的应用仍面临挑战,由于领域复杂性、不断演化的标准和专业术语。因此,通用领域的LLMs在电信场景中可能难以提供准确可靠的输出,导致增加幻觉并降低电信操作的实用性。为解决这些限制,本研究引入了KG-RAG——一种将知识图谱(KGs)与检索增强生成(RAG)相结合的新框架,以提升LLMs在电信特定任务中的表现。特别是,知识图谱提供了从电信标准和技术文档中推导出的领域知识的结构化表示,而RAG则允许动态检索相关事实以支撑模型的输出。这种结合提高了事实准确性,减少了幻觉,并确保符合电信规范。基准数据集上的实验结果表明,KG-RAG在准确度上优于仅使用LLM和标准RAG基线模型,例如,KG-RAG在准确度上分别比RAG和仅使用LLM的模型提高了14.3%和21.6%。这些结果突显了KG-RAG在复杂电信场景中生成准确、可靠和可解释输出的有效性。
Summary / 总结
This work addresses the challenges of applying general-domain large language models (LLMs) in the telecom field by introducing KG-RAG, a framework that integrates knowledge graphs (KGs) with retrieval-augmented generation (RAG). The framework enhances LLMs for telecom-specific tasks by providing a structured representation of domain knowledge and enabling dynamic retrieval of relevant facts. Experimental results show that KG-RAG outperforms both LLM-only and standard RAG baselines, with an average accuracy improvement of 14.3% over RAG and 21.6% over LLM-only models.
这项工作通过引入KG-RAG框架,将知识图谱(KGs)与检索增强生成(RAG)相结合,来解决通用领域大型语言模型(LLMs)在电信领域的局限性。该框架通过提供领域知识的结构化表示和动态检索相关事实来增强LLMs的电信特定任务能力。实验结果表明,KG-RAG在准确性和可靠性方面优于LLM-only和标准RAG基线模型,分别提高了14.3%和21.6%的准确率。
FoundationPose-Initialized 3D-2D Liver Registration for Surgical Augmented Reality
Authors: Hanyuan Zhang, Lucas He, Runlong He, Abdolrahim Kadkhodamohammadi, Danail Stoyanov, Brian R. Davidson, Evangelos B. Mazomenos, Matthew J. Clarkson
First: 2026-02-19T16:31:14+00:00 · Latest: 2026-02-19T16:31:14+00:00
Abstract
Augmented reality can improve tumor localization in laparoscopic liver surgery. Existing registration pipelines typically depend on organ contours; deformable (non-rigid) alignment is often handled with finite-element (FE) models coupled to dimensionality-reduction or machine-learning components. We integrate laparoscopic depth maps with a foundation pose estimator for camera-liver pose estimation and replace FE-based deformation with non-rigid iterative closest point (NICP) to lower engineering/modeling complexity and expertise requirements. On real patient data, the depth-augmented foundation pose approach achieved 9.91 mm mean registration error in 3 cases. Combined rigid-NICP registration outperformed rigid-only registration, demonstrating NICP as an efficient substitute for finite-element deformable models. This pipeline achieves clinically relevant accuracy while offering a lightweight, engineering-friendly alternative to FE-based deformation.
中文标题/摘要
标题:FoundationPose-初始化的3D-2D肝脏注册用于外科增强现实
增强现实可以提高腹腔镜肝脏手术中肿瘤定位的准确性。现有的注册管道通常依赖于器官轮廓;非刚性对齐通常使用有限元(FE)模型结合降维或机器学习组件来处理。我们结合腹腔镜深度图与基础姿态估计器进行相机-肝脏姿态估计,并用非刚性迭代最近点(NICP)替换基于FE的变形,以降低工程/建模复杂性和专业知识要求。在真实患者数据上,深度增强的基础姿态方法在3个病例中实现了9.91毫米的平均注册误差。结合刚性-NICP注册优于仅刚性注册,证明了NICP作为有限元非刚性模型的有效替代品。该管道实现了临床相关的准确性,同时提供了一种轻量级且工程友好的替代FE变形的方法。
Summary / 总结
The research aims to improve tumor localization in laparoscopic liver surgery using augmented reality. The method integrates laparoscopic depth maps with a foundation pose estimator for camera-liver pose estimation and uses non-rigid iterative closest point (NICP) instead of finite-element (FE) models for deformable alignment. The approach achieved a mean registration error of 9.91 mm in three cases and outperformed rigid-only registration, showing NICP as an efficient alternative to FE-based models for deformable alignment.
研究旨在通过增强现实改善腹腔镜肝脏手术中的肿瘤定位。方法结合了腹腔镜深度图和基础姿态估计器以估计相机-肝脏姿态,并使用非刚性迭代最近点(NICP)进行非刚性对齐,从而降低工程复杂性。该方法在三个案例中实现了9.91毫米的平均注册误差,并且在与刚性对齐结合使用时,NICP优于仅刚性对齐,显示出NICP作为有限元变形模型的高效替代方案的有效性。
Defining and Evaluating Physical Safety for Large Language Models
Authors: Yung-Chen Tang, Pin-Yu Chen, Tsung-Yi Ho
First: 2024-11-04T17:41:25+00:00 · Latest: 2026-02-19T16:30:29+00:00
Abstract
Large Language Models (LLMs) are increasingly used to control robotic systems such as drones, but their risks of causing physical threats and harm in real-world applications remain unexplored. Our study addresses the critical gap in evaluating LLM physical safety by developing a comprehensive benchmark for drone control. We classify the physical safety risks of drones into four categories: (1) human-targeted threats, (2) object-targeted threats, (3) infrastructure attacks, and (4) regulatory violations. Our evaluation of mainstream LLMs reveals an undesirable trade-off between utility and safety, with models that excel in code generation often performing poorly in crucial safety aspects. Furthermore, while incorporating advanced prompt engineering techniques such as In-Context Learning and Chain-of-Thought can improve safety, these methods still struggle to identify unintentional attacks. In addition, larger models demonstrate better safety capabilities, particularly in refusing dangerous commands. Our findings and benchmark can facilitate the design and evaluation of physical safety for LLMs. The project page is available at huggingface.co/spaces/TrustSafeAI/LLM-physical-safety.
中文标题/摘要
标题:定义和评估大型语言模型的物理安全性
大型语言模型(LLMs)越来越多地用于控制无人机等机器人系统,但它们在实际应用中造成物理威胁和伤害的风险尚未得到探索。我们的研究通过开发无人机控制的全面基准来填补评估LLM物理安全性的关键空白。我们将无人机的物理安全风险分为四类:(1)针对人类的威胁,(2)针对物体的威胁,(3)基础设施攻击,以及(4)法规违规。我们对主流LLM的评估揭示了实用性与安全性之间的不利权衡,即在代码生成方面表现优异的模型往往在关键的安全方面表现不佳。此外,虽然通过引入先进的提示工程技术如上下文学习和思维链可以提高安全性,但这些方法仍然难以识别无意中的攻击。此外,更大的模型在拒绝危险命令方面表现出更好的安全性。我们的发现和基准可以促进LLM物理安全性的设计和评估。项目页面可在huggingface.co/spaces/TrustSafeAI/LLM-physical-safety获取。
Summary / 总结
This study addresses the lack of evaluation for physical safety in Large Language Models (LLMs) used in drone control. It classifies physical safety risks into four categories and develops a benchmark to evaluate LLMs. The evaluation shows a trade-off between utility and safety, with models excelling in code generation often performing poorly in safety. Advanced prompt engineering techniques can improve safety but still struggle with identifying unintentional attacks. Larger models show better safety in refusing dangerous commands. The findings and benchmark can aid in designing and evaluating physical safety for LLMs.
该研究解决了大型语言模型(LLM)在控制无人机等机器人系统时物理安全性评估方法的缺失问题。它将物理安全风险分为四类,并开发了一个全面的基准。评估结果显示,擅长代码生成的模型在安全性方面往往表现较差,尽管先进的技术可以提高安全性,但它们仍然难以识别无意中的攻击。较大的模型通常在拒绝危险命令方面表现出更好的安全性。研究结果有助于设计和评估LLM的物理安全性。
Capturing Individual Human Preferences with Reward Features
Authors: André Barreto, Vincent Dumoulin, Yiran Mao, Mark Rowland, Nicolas Perez-Nieves, Bobak Shahriari, Yann Dauphin, Doina Precup, Hugo Larochelle
Venue: NeurIPS 2025
First: 2025-03-21T17:39:33+00:00 · Latest: 2026-02-19T16:23:22+00:00
Comments: Published at NeurIPS 2025
Abstract
Reinforcement learning from human feedback usually models preferences using a reward function that does not distinguish between people. We argue that this is unlikely to be a good design choice in contexts with high potential for disagreement, like in the training of large language models. We formalise and analyse the problem of learning a reward model that can be specialised to a user. Using the principle of empirical risk minimisation, we derive a probably approximately correct (PAC) bound showing the dependency of the approximation error on the number of training examples, as usual, and also on the number of human raters who provided feedback on them. Based on our theoretical findings, we discuss how to best collect pairwise preference data and argue that adaptive reward models should be beneficial when there is considerable disagreement among users. We also propose a concrete architecture for an adaptive reward model. Our approach leverages the observation that individual preferences can be captured as a linear combination of a set of general reward features. We show how to learn such features and subsequently use them to quickly adapt the reward model to a specific individual, even if their preferences are not reflected in the training data. We present experiments with large language models illustrating our theoretical results and comparing the proposed architecture with a non-adaptive baseline. Consistent with our analysis, the benefits provided by our model increase with the number of raters and the heterogeneity of their preferences. We also show that our model compares favourably to adaptive counterparts, including those performing in-context personalisation.
中文标题/摘要
标题:利用奖励特征捕捉个体人类偏好
从人类反馈中进行强化学习通常使用一个不区分个体的奖励函数。我们提出,在大型语言模型训练等存在高度分歧的背景下,这可能不是一个好的设计选择。我们形式化并分析了学习一个可以专门针对用户的奖励模型的问题。基于经验风险最小化的原则,我们推导出一个大概率近似正确(PAC)的界,显示了近似误差依赖于训练样本数量,以及提供反馈的人类评判者的数量。基于我们的理论发现,我们讨论了如何最好地收集两两偏好数据,并认为当用户之间存在显著分歧时,适应性奖励模型是有益的。我们还提出了一种适应性奖励模型的具体架构。我们的方法利用了个体偏好可以表示为一组通用奖励特征线性组合的观察。我们展示了如何学习这些特征,并随后使用它们快速适应特定个体的奖励模型,即使他们的偏好未反映在训练数据中。我们展示了使用大型语言模型的实验,以说明我们的理论结果,并将提出的架构与非适应性基线进行比较。与我们的分析一致,我们的模型提供的益处随着评判者的数量和他们偏好的异质性增加而增加。我们还展示了我们的模型与适应性对应模型(包括那些进行上下文个性化处理的模型)相比具有优势。
Summary / 总结
This paper addresses the issue of capturing individual human preferences in reinforcement learning, particularly in contexts where user disagreement is likely, such as training large language models. It proposes a method to learn a reward model that can be specialized to a user by using a set of general reward features. The approach leverages empirical risk minimization and shows that the approximation error depends on the number of training examples and human raters. Experiments demonstrate that the proposed adaptive reward model outperforms a non-adaptive baseline, especially when there is significant user disagreement and a diverse set of raters are involved.
本文探讨了在大型语言模型等偏好差异较大的场景中,如何通过强化学习捕捉个体人类偏好。提出了一种利用通用奖励特征来学习可针对用户定制的奖励模型的方法,并使用经验风险最小化推导出PAC边界。实验表明,所提出的自适应模型在用户偏好存在显著分歧和多种偏好时,比非自适应基线模型表现更好。
LORA-CRAFT: Cross-layer Rank Adaptation via Frozen Tucker Decomposition of Pre-trained Attention Weights
Authors: Kasun Dewage, Marianna Pensky, Suranadi De Silva, Shankadeep Mondal
First: 2026-02-19T16:22:22+00:00 · Latest: 2026-02-19T16:22:22+00:00
Abstract
We introduce CRAFT (Cross-layer Rank Adaptation via Frozen Tucker), a parameter-efficient fine-tuning (PEFT) method that applies Tucker tensor decomposition to pre-trained attention weight matrices stacked across transformer layers and trains only small square adaptation matrices on the resulting frozen Tucker factors. Existing tensor-based PEFT methods decompose gradient updates: LoTR applies Tucker decomposition with shared factor matrices, while SuperLoRA groups and reshapes $ΔW$ across layers before applying Tucker decomposition. Separately, methods like PiSSA apply SVD to pre-trained weights but operate independently per layer. CRAFT bridges these two lines of work: it performs full Tucker decomposition via Higher-Order SVD (HOSVD) directly on pre-trained weights organized as cross-layer 3D tensors, freezes all resulting factors, and adapts the model through lightweight trainable transformations applied to each factor matrix. Experiments on the GLUE benchmark using RoBERTa-base and RoBERTa-large demonstrate that CRAFT achieves competitive performance with existing methods while requiring only 41K Tucker adaptation parameters--a count independent of model dimension and depth at fixed Tucker ranks.
中文标题/摘要
标题:LORA-CRAFT:通过冻结Tucker分解预训练注意力权重的跨层秩适应
我们引入了CRAFT(跨层秩适应通过冻结Tucker分解),这是一种参数高效微调(PEFT)方法,它将Tucker张量分解应用于堆叠在变压器层上的预训练注意力权重矩阵,并仅在冻结的Tucker因子上训练小型方适应矩阵。现有的基于张量的PEFT方法分解梯度更新:LoTR使用共享因子矩阵的Tucker分解,而SuperLoRA在应用Tucker分解之前将$ΔW$按层分组和重塑。另外,像PiSSA这样的方法对预训练权重应用SVD,但每层独立操作。CRAFT将这两条研究路线结合起来:它直接对组织成跨层3D张量的预训练权重进行完整的Tucker分解,通过Higher-Order SVD(HOSVD)进行,冻结所有结果因子,并通过应用于每个因子矩阵的轻量级可训练变换来适应模型。使用RoBERTa-base和RoBERTa-large在GLUE基准上的实验表明,CRAFT在性能上与现有方法相当,同时只需要41K个Tucker适应参数——在固定Tucker秩的情况下,该计数与模型维度和深度无关。
Summary / 总结
CRAFT is a parameter-efficient fine-tuning method that applies Tucker tensor decomposition to pre-trained attention weight matrices across transformer layers, training only small adaptation matrices on the resulting frozen Tucker factors. Experiments on the GLUE benchmark show that CRAFT achieves competitive performance with existing methods using only 41K Tucker adaptation parameters, independent of model dimension and depth at fixed Tucker ranks.
CRAFT 是一种参数高效微调方法,它使用 Tucker 张量分解跨变压器层的预训练注意力权重矩阵,并仅训练小的适应矩阵。实验表明,CRAFT 在 GLUE 基准上达到竞争力的表现,使用 41K 个 Tucker 适应参数,与模型维度和深度无关,固定 Tucker 秩时。
Pareto Optimal Benchmarking of AI Models on ARM Cortex Processors for Sustainable Embedded Systems
Authors: Pranay Jain, Maximilian Kasper, Göran Köber, Axel Plinge, Dominik Seuß
Venue: EEAI 2025
First: 2026-02-19T16:21:47+00:00 · Latest: 2026-02-19T16:21:47+00:00
Comments: 11 pages, 7 figures, Funding: GreenICT@FMD (BMFTR grant 16ME0491K)
Abstract
This work presents a practical benchmarking framework for optimizing artificial intelligence (AI) models on ARM Cortex processors (M0+, M4, M7), focusing on energy efficiency, accuracy, and resource utilization in embedded systems. Through the design of an automated test bench, we provide a systematic approach to evaluate across key performance indicators (KPIs) and identify optimal combinations of processor and AI model. The research highlights a nearlinear correlation between floating-point operations (FLOPs) and inference time, offering a reliable metric for estimating computational demands. Using Pareto analysis, we demonstrate how to balance trade-offs between energy consumption and model accuracy, ensuring that AI applications meet performance requirements without compromising sustainability. Key findings indicate that the M7 processor is ideal for short inference cycles, while the M4 processor offers better energy efficiency for longer inference tasks. The M0+ processor, while less efficient for complex AI models, remains suitable for simpler tasks. This work provides insights for developers, guiding them to design energy-efficient AI systems that deliver high performance in realworld applications.
中文标题/摘要
标题:ARM Cortex处理器上基于帕累托最优的AI模型基准测试及其在可持续嵌入式系统中的应用
本研究提出了一种实用的基准测试框架,用于在ARM Cortex处理器(M0+,M4,M7)上优化人工智能(AI)模型,重点关注嵌入式系统中的能效、准确性和资源利用率。通过设计自动化测试平台,我们提供了一种系统的方法来评估关键性能指标(KPIs),并确定处理器和AI模型的最佳组合。研究强调了浮点运算(FLOPs)与推理时间之间的近线性关系,提供了一种可靠的计算需求估算指标。利用帕累托分析,我们展示了如何在能耗和模型准确性之间取得平衡,确保AI应用满足性能要求同时不牺牲可持续性。关键发现表明,M7处理器适用于短推理周期,而M4处理器对于较长推理任务具有更好的能效。M0+处理器虽然对于复杂AI模型效率较低,但对于简单任务仍然适用。本研究为开发人员提供了见解,指导他们设计高效的AI系统,以在实际应用中实现高性能。
Summary / 总结
This work introduces a benchmarking framework for optimizing AI models on ARM Cortex processors (M0+, M4, M7) by evaluating energy efficiency, accuracy, and resource utilization. The study uses Pareto analysis to balance energy consumption and model accuracy, showing that the M7 processor is best for short inference cycles, the M4 processor is more energy-efficient for longer tasks, and the M0+ is suitable for simpler tasks. Key findings suggest that the M7 processor is ideal for short inference cycles, the M4 processor offers better energy efficiency for longer inference tasks, and the M0+ processor is suitable for simpler tasks.
这项研究介绍了一种针对ARM Cortex处理器(M0+、M4、M7)优化AI模型的基准测试框架,通过评估能源效率、准确性和资源利用率。研究使用帕累托分析来平衡能源消耗和模型准确性,结果显示M7处理器最适合短推理周期,M4处理器在长时间推理任务中更节能,而M0+处理器适用于更简单的任务。研究还指出浮点运算与推理时间之间存在近线性关系,提供了一个可靠的计算需求评估指标。