TeCoNeRV: Leveraging Temporal Coherence for Compressible Neural Representations for Videos
Authors: Namitha Padmanabhan, Matthew Gwilliam, Abhinav Shrivastava
First: 2026-02-18T18:59:55+00:00 · Latest: 2026-02-18T18:59:55+00:00
Abstract
Implicit Neural Representations (INRs) have recently demonstrated impressive performance for video compression. However, since a separate INR must be overfit for each video, scaling to high-resolution videos while maintaining encoding efficiency remains a significant challenge. Hypernetwork-based approaches predict INR weights (hyponetworks) for unseen videos at high speeds, but with low quality, large compressed size, and prohibitive memory needs at higher resolutions. We address these fundamental limitations through three key contributions: (1) an approach that decomposes the weight prediction task spatially and temporally, by breaking short video segments into patch tubelets, to reduce the pretraining memory overhead by 20$\times$; (2) a residual-based storage scheme that captures only differences between consecutive segment representations, significantly reducing bitstream size; and (3) a temporal coherence regularization framework that encourages changes in the weight space to be correlated with video content. Our proposed method, TeCoNeRV, achieves substantial improvements of 2.47dB and 5.35dB PSNR over the baseline at 480p and 720p on UVG, with 36% lower bitrates and 1.5-3$\times$ faster encoding speeds. With our low memory usage, we are the first hypernetwork approach to demonstrate results at 480p, 720p and 1080p on UVG, HEVC and MCL-JCV. Our project page is available at https://namithap10.github.io/teconerv/ .
中文标题/摘要
标题:TeCoNeRV:利用时间连续性为视频压缩的可压缩神经表示
隐式神经表示(INRs)最近在视频压缩方面展示了令人印象深刻的性能。然而,由于必须为每个视频单独拟合一个INR,因此在保持编码效率的同时扩展到高分辨率视频仍然是一个重大挑战。基于超网络的方法以高速度预测未见过的视频的INR权重(超网络),但质量较低,压缩文件大小较大,并且在更高分辨率下需要巨大的内存。我们通过三个关键贡献解决了这些基本限制:(1)一种空间和时间分解权重预测任务的方法,通过将短视频片段分解为补丁管节来减少预训练内存开销20倍;(2)一种基于残差的存储方案,仅捕获连续片段表示之间的差异,显著减小了比特流大小;(3)一种时间连续性正则化框架,鼓励权重空间的变化与视频内容相关。我们提出的方法TeCoNeRV在UVG的480p和720p上分别实现了2.47dB和5.35dB的PSNR改进,比特率降低了36%,编码速度提高了1.5-3倍。凭借我们低内存使用,我们是第一个在UVG、HEVC和MCL-JCV上展示480p、720p和1080p结果的超网络方法。我们的项目页面可在https://namithap10.github.io/teconerv/ 查看。
Semantic Chunking and the Entropy of Natural Language
Authors: Weishun Zhong, Doron Sivan, Tankut Can, Mikhail Katkov, Misha Tsodyks
First: 2026-02-13T18:58:10+00:00 · Latest: 2026-02-18T18:59:22+00:00
Comments: 29 pages, 9 figures; typos fixed
Abstract
The entropy rate of printed English is famously estimated to be about one bit per character, a benchmark that modern large language models (LLMs) have only recently approached. This entropy rate implies that English contains nearly 80 percent redundancy relative to the five bits per character expected for random text. We introduce a statistical model that attempts to capture the intricate multi-scale structure of natural language, providing a first-principles account of this redundancy level. Our model describes a procedure of self-similarly segmenting text into semantically coherent chunks down to the single-word level. The semantic structure of the text can then be hierarchically decomposed, allowing for analytical treatment. Numerical experiments with modern LLMs and open datasets suggest that our model quantitatively captures the structure of real texts at different levels of the semantic hierarchy. The entropy rate predicted by our model agrees with the estimated entropy rate of printed English. Moreover, our theory further reveals that the entropy rate of natural language is not fixed but should increase systematically with the semantic complexity of corpora, which are captured by the only free parameter in our model.
中文标题/摘要
标题:语义切块与自然语言的熵
印刷英语的熵率著名地估计为每个字符大约一个比特,这是一个基准,现代大型语言模型(LLMs)仅在最近才接近。这一熵率意味着英语相对于预期的每个字符五比特的随机文本,几乎含有80%的冗余。我们引入了一个统计模型,试图捕捉自然语言复杂的多层次结构,提供了一个从基本原理出发的冗余水平解释。该模型描述了一种自相似地将文本切块为语义上连贯的片段的过程,直到单个单词级别。文本的语义结构可以逐级分解,从而进行分析处理。现代LLMs和开源数据集的数值实验表明,我们的模型在语义层次的不同水平上定量地捕捉了真实文本的结构。我们的模型预测的熵率与印刷英语估计的熵率一致。此外,我们的理论进一步揭示,自然语言的熵率不是固定的,而是应该随着语料库的语义复杂性系统地增加,这由我们模型中的唯一自由参数来捕捉。
Summary / 总结
The paper aims to understand the redundancy in natural language by estimating its entropy rate, which is about one bit per character. To achieve this, the authors propose a statistical model that segments text into semantically coherent chunks, allowing for a hierarchical decomposition of the text's structure. Experiments with modern large language models and open datasets show that the model accurately captures the semantic structure of real texts and predicts an entropy rate consistent with the estimated entropy rate of printed English. Additionally, the model suggests that the entropy rate increases with semantic complexity.
该论文旨在通过估算自然语言的熵率(约为每个字符一个比特)来理解其中的冗余。为此,作者提出了一种统计模型,将文本按语义连贯的块进行分段,允许对文本结构进行层次分解。实验表明,该模型能够准确捕捉真实文本的语义结构,并预测的熵率与印刷英语的估计熵率一致。此外,该模型还表明,熵率会随着语料库语义复杂性的增加而系统地增加。
Measuring Mid-2025 LLM-Assistance on Novice Performance in Biology
Authors: Shen Zhou Hong, Alex Kleinman, Alyssa Mathiowetz, Adam Howes, Julian Cohen, Suveer Ganta, Alex Letizia, Dora Liao, Deepika Pahari, Xavier Roberts-Gaal, Luca Righetti, Joe Torres
First: 2026-02-18T18:51:28+00:00 · Latest: 2026-02-18T18:51:28+00:00
Abstract
Large language models (LLMs) perform strongly on biological benchmarks, raising concerns that they may help novice actors acquire dual-use laboratory skills. Yet, whether this translates to improved human performance in the physical laboratory remains unclear. To address this, we conducted a pre-registered, investigator-blinded, randomized controlled trial (June-August 2025; n = 153) evaluating whether LLMs improve novice performance in tasks that collectively model a viral reverse genetics workflow. We observed no significant difference in the primary endpoint of workflow completion (5.2% LLM vs. 6.6% Internet; P = 0.759), nor in the success rate of individual tasks. However, the LLM arm had numerically higher success rates in four of the five tasks, most notably for the cell culture task (68.8% LLM vs. 55.3% Internet; P = 0.059). Post-hoc Bayesian modeling of pooled data estimates an approximate 1.4-fold increase (95% CrI 0.74-2.62) in success for a "typical" reverse genetics task under LLM assistance. Ordinal regression modelling suggests that participants in the LLM arm were more likely to progress through intermediate steps across all tasks (posterior probability of a positive effect: 81%-96%). Overall, mid-2025 LLMs did not substantially increase novice completion of complex laboratory procedures but were associated with a modest performance benefit. These results reveal a gap between in silico benchmarks and real-world utility, underscoring the need for physical-world validation of AI biosecurity assessments as model capabilities and user proficiency evolve.
中文标题/摘要
标题:测量2025年中期LLM辅助对生物学初学者表现的影响
大型语言模型(LLMs)在生物基准测试中表现出色,引发了对其可能帮助初学者获得双重用途实验室技能的担忧。然而,这种能力是否转化为物理实验室中的人类表现提升仍不清楚。为了解决这一问题,我们在2025年6月至8月进行了一个预先注册、研究者盲法、随机对照试验(n = 153),评估LLMs是否能提高初学者在模拟病毒逆向遗传学工作流程任务中的表现。我们没有观察到主要终点(工作流程完成)的显著差异(5.2% LLM vs. 6.6%互联网;P = 0.759),也没有观察到单个任务成功率的差异。然而,LLM组在五个任务中有四个任务的成功率较高,尤其是在细胞培养任务中(68.8% LLM vs. 55.3%互联网;P = 0.059)。对合并数据的后验贝叶斯建模估计,在LLM辅助下,“典型”的逆向遗传学任务成功率大约增加1.4倍(95% CrI 0.74-2.62)。有序回归建模表明,LLM组的参与者更有可能在所有任务中跨过中间步骤(正效应后验概率:81%-96%)。总体而言,2025年中期的LLMs并未显著增加初学者完成复杂实验室程序,但与轻微的表现优势相关。这些结果揭示了计算基准与实际应用之间的差距,强调了随着模型能力和用户熟练度的发展,需要对AI生物安全评估进行物理世界验证的必要性。
Summary / 总结
The study aimed to determine if large language models (LLMs) could enhance novice performance in a biological laboratory setting. A randomized controlled trial with 153 participants found no significant difference in overall workflow completion between the LLM and internet control groups. However, the LLM group showed a modest improvement in four out of five tasks, particularly in cell culture, and Bayesian modeling suggested a 1.4-fold increase in success rates. Ordinal regression indicated a higher likelihood of progressing through task steps in the LLM group. Despite these findings, LLMs did not substantially improve novice performance in complex laboratory procedures.
研究旨在确定大型语言模型(LLMs)是否能提高生物实验室新手的操作表现。一项包含153名参与者的随机对照试验发现,使用LLMs和使用互联网的两组在整体工作流程完成率上没有显著差异。然而,LLMs在四个出五个任务中显示出轻微的改进,特别是在细胞培养方面,贝叶斯建模表明成功率提高了约1.4倍。LLM组的参与者在所有任务中更有可能完成中间步骤。尽管如此,LLMs并未显著提升新手完成复杂实验室操作的水平,这表明虚拟基准与实际应用之间存在差距,强调了随着模型能力和用户熟练度的提升,需要进行物理世界的验证以评估AI生物安全评估的有效性。
Saliency-Aware Multi-Route Thinking: Revisiting Vision-Language Reasoning
Authors: Mingjia Shi, Yinhan He, Yaochen Zhu, Jundong Li
First: 2026-02-18T18:49:56+00:00 · Latest: 2026-02-18T18:49:56+00:00
Comments: preprint 10 pages, 4 figures
Abstract
Vision-language models (VLMs) aim to reason by jointly leveraging visual and textual modalities. While allocating additional inference-time computation has proven effective for large language models (LLMs), achieving similar scaling in VLMs remains challenging. A key obstacle is that visual inputs are typically provided only once at the start of generation, while textual reasoning (e.g., early visual summaries) is generated autoregressively, causing reasoning to become increasingly text-dominated and allowing early visual grounding errors to accumulate. Moreover, vanilla guidance for visual grounding during inference is often coarse and noisy, making it difficult to steer reasoning over long texts. To address these challenges, we propose \emph{Saliency-Aware Principle} (SAP) selection. SAP operates on high-level reasoning principles rather than token-level trajectories, which enable stable control over discrete generation under noisy feedback while allowing later reasoning steps to re-consult visual evidence when renewed grounding is required. In addition, SAP supports multi-route inference, enabling parallel exploration of diverse reasoning behaviors. SAP is model-agnostic and data-free, requiring no additional training. Empirical results show that SAP achieves competitive performance, especially in reducing object hallucination, under comparable token-generation budgets while yielding more stable reasoning and lower response latency than CoT-style long sequential reasoning.
中文标题/摘要
标题:注意引导多路径思考:重访视觉-语言推理
视觉-语言模型(VLMs)旨在通过联合利用视觉和文本模态进行推理。虽然为大型语言模型(LLMs)分配额外的推理时间计算已被证明是有效的,但在VLMs中实现类似的扩展仍然具有挑战性。一个关键障碍是视觉输入通常只在生成的开始阶段提供一次,而文本推理(例如,早期视觉摘要)是自回归生成的,这使得推理变得越来越以文本为主导,并允许早期视觉定位错误累积。此外,推理期间的视觉定位指导通常粗糙且嘈杂,这使得在长文本上引导推理变得困难。为了解决这些挑战,我们提出了\emph{注意引导原则}(SAP)选择。SAP 在高层次的推理原则上操作,而不是在标记级轨迹上,这使得在嘈杂反馈下稳定控制离散生成成为可能,同时允许后续推理步骤在需要重新定位时重新咨询视觉证据。此外,SAP 支持多路径推理,允许并行探索多种推理行为。SAP 是模型无关的,不需要额外的数据,也不需要额外的训练。实验证明,SAP 在与可比的标记生成预算下实现了竞争力的表现,特别是在减少对象幻觉方面,同时提供了比CoT风格的长序列推理更稳定的推理和更低的响应延迟。
Summary / 总结
The paper addresses the challenge of scaling vision-language models (VLMs) by proposing Saliency-Aware Principle (SAP) selection, which operates on high-level reasoning principles to stabilize discrete generation under noisy feedback. SAP supports multi-route inference, allowing parallel exploration of diverse reasoning behaviors. Experimental results demonstrate that SAP reduces object hallucination and improves reasoning stability and response latency compared to long sequential reasoning.
论文提出了一种Saliency-Aware Principle (SAP) 选择方法,通过操作高层推理原则来实现对离散生成的稳定控制并支持多路线推理,以解决视觉语言模型(VLMs)的扩展问题。实验结果显示,SAP 能减少物体幻觉,提供更稳定的推理并具有更低的响应延迟,优于长序列推理。
Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents
Authors: Wenxuan Ding, Nicholas Tomlin, Greg Durrett
First: 2026-02-18T18:46:14+00:00 · Latest: 2026-02-18T18:46:14+00:00
Abstract
LLMs are increasingly being used for complex problems which are not necessarily resolved in a single response, but require interacting with an environment to acquire information. In these scenarios, LLMs must reason about inherent cost-uncertainty tradeoffs in when to stop exploring and commit to an answer. For instance, on a programming task, an LLM should test a generated code snippet if it is uncertain about the correctness of that code; the cost of writing a test is nonzero, but typically lower than the cost of making a mistake. In this work, we show that we can induce LLMs to explicitly reason about balancing these cost-uncertainty tradeoffs, then perform more optimal environment exploration. We formalize multiple tasks, including information retrieval and coding, as sequential decision-making problems under uncertainty. Each problem has latent environment state that can be reasoned about via a prior which is passed to the LLM agent. We introduce a framework called Calibrate-Then-Act (CTA), where we feed the LLM this additional context to enable it to act more optimally. This improvement is preserved even under RL training of both the baseline and CTA. Our results on information-seeking QA and on a simplified coding task show that making cost-benefit tradeoffs explicit with CTA can help agents discover more optimal decision-making strategies.
中文标题/摘要
标题:校准后再行动:LLM代理的成本意识探索
LLM越来越多地被用于解决需要与环境交互以获取信息的复杂问题。在这种情况下,LLM必须权衡探索何时停止并做出决定的成本不确定性。例如,在编程任务中,如果LLM对生成的代码片段的正确性不确定,它应该测试该代码;编写测试的成本非零,但通常低于犯错的成本。在本研究中,我们展示了如何促使LLM明确地权衡这些成本不确定性,从而进行更有效的环境探索。我们将包括信息检索和编程在内的多个任务形式化为不确定性下的顺序决策问题。每个问题都有潜在的环境状态,可以通过先验知识来推理,该先验知识传递给LLM代理。我们引入了一种称为校准后再行动(CTA)的框架,其中我们将此额外上下文传递给LLM,以使其能够更有效地行动。即使在对基线和CTA进行RL训练时,这种改进仍然保持不变。我们在信息查询和简化编程任务上的结果表明,通过CTA使成本效益权衡变得明确,可以帮助代理发现更优的决策策略。
Summary / 总结
This work addresses the challenge of cost-aware exploration in LLM agents by formalizing tasks as sequential decision-making problems under uncertainty. The Calibrate-Then-Act (CTA) framework is introduced, which provides LLMs with additional context to reason about cost-uncertainty tradeoffs, enabling more optimal exploration. Experiments on information retrieval and coding tasks demonstrate that CTA helps agents discover better decision-making strategies compared to baseline methods, even when trained using reinforcement learning.
该研究针对LLM代理在与环境交互解决复杂问题时的成本感知探索挑战。作者引入了一种称为Calibrate-Then-Act (CTA)的框架,帮助LLM在何时停止探索和做出决定时权衡成本和不确定性。通过将任务形式化为具有潜在环境状态的不确定性下的顺序决策问题,并向LLM提供额外的上下文,CTA框架能够实现更优的探索。在信息检索和简化编程任务上的实验表明,CTA可以使代理发现更高效的决策策略,相比基线方法更为有效。
Parameter-free representations outperform single-cell foundation models on downstream benchmarks
Authors: Huan Souza, Pankaj Mehta
First: 2026-02-18T18:42:29+00:00 · Latest: 2026-02-18T18:42:29+00:00
Abstract
Single-cell RNA sequencing (scRNA-seq) data exhibit strong and reproducible statistical structure. This has motivated the development of large-scale foundation models, such as TranscriptFormer, that use transformer-based architectures to learn a generative model for gene expression by embedding genes into a latent vector space. These embeddings have been used to obtain state-of-the-art (SOTA) performance on downstream tasks such as cell-type classification, disease-state prediction, and cross-species learning. Here, we ask whether similar performance can be achieved without utilizing computationally intensive deep learning-based representations. Using simple, interpretable pipelines that rely on careful normalization and linear methods, we obtain SOTA or near SOTA performance across multiple benchmarks commonly used to evaluate single-cell foundation models, including outperforming foundation models on out-of-distribution tasks involving novel cell types and organisms absent from the training data. Our findings highlight the need for rigorous benchmarking and suggest that the biology of cell identity can be captured by simple linear representations of single cell gene expression data.
中文标题/摘要
标题:无参数表示优于单细胞基础模型的下游基准测试
单细胞RNA测序(scRNA-seq)数据表现出强烈的且可重复的统计结构。这激发了大规模基础模型的发展,如TranscriptFormer,它使用基于变换器的架构来通过将基因嵌入到潜在向量空间中学习生成模型,以预测基因表达。这些嵌入已被用于在细胞类型分类、疾病状态预测和跨物种学习等下游任务中获得最先进的性能(SOTA)。在这里,我们询问是否可以在不利用计算密集型深度学习表示的情况下实现类似的表现。使用简单的、可解释的管道,依赖于精细的归一化和线性方法,我们在包括新型细胞类型和训练数据中不存在的生物体的下游任务中,获得SOTA或接近SOTA的性能,涵盖了广泛用于评估单细胞基础模型的多个基准测试。我们的发现强调了严格的基准测试的必要性,并表明单细胞基因表达数据的简单线性表示可以捕捉细胞身份的生物学特性。
Summary / 总结
This study investigates whether simple, interpretable pipelines can achieve state-of-the-art performance on single-cell RNA sequencing tasks without relying on computationally intensive deep learning-based representations. By using normalization and linear methods, the researchers outperformed foundation models on multiple benchmarks, including tasks involving novel cell types and organisms not seen during training. This suggests that the biological aspects of cell identity can be effectively captured by straightforward linear representations of gene expression data.
研究探讨了是否可以不依赖复杂的深度学习表示方法,仅使用简单的可解释管道就能在单细胞RNA测序任务中达到最先进的性能。通过使用归一化和线性方法,研究人员在多个基准测试中超过了基础模型,包括涉及未在训练数据中出现的新细胞类型和生物体的任务。这表明细胞身份的生物学特性可以通过单细胞基因表达数据的简单线性表示来有效捕捉。
EconEvals: Benchmarks and Litmus Tests for Economic Decision-Making by LLM Agents
Authors: Sara Fish, Julia Shephard, Minkai Li, Ran I. Shorrer, Yannai A. Gonczarowski
First: 2025-03-24T16:06:04+00:00 · Latest: 2026-02-18T18:37:52+00:00
Comments: v3 was a major revision with updated experiments and analysis; v4 consists of minor edits
Abstract
We develop evaluation methods for measuring the economic decision-making capabilities and tendencies of LLMs. First, we develop benchmarks derived from key problems in economics -- procurement, scheduling, and pricing -- that test an LLM's ability to learn from the environment in context. Second, we develop the framework of litmus tests, evaluations that quantify an LLM's choice behavior on a stylized decision-making task with multiple conflicting objectives. Each litmus test outputs a litmus score, which quantifies an LLM's tradeoff response, a reliability score, which measures the coherence of an LLM's choice behavior, and a competency score, which measures an LLM's capability at the same task when the conflicting objectives are replaced by a single, well-specified objective. Evaluating a broad array of frontier LLMs, we (1) investigate changes in LLM capabilities and tendencies over time, (2) derive economically meaningful insights from the LLMs' choice behavior and chain-of-thought, (3) validate our litmus test framework by testing self-consistency, robustness, and generalizability. Overall, this work provides a foundation for evaluating LLM agents as they are further integrated into economic decision-making.
中文标题/摘要
标题:EconEvals:评估LLM经济决策能力的标准与试金石
我们开发了评估LLM经济决策能力和倾向的方法。首先,我们从经济学中的关键问题——采购、调度和定价——中开发了基准,测试LLM在情境中从环境中学习的能力。其次,我们开发了试金石框架,这是一种评估LLM在具有多个冲突目标的简化决策任务中的选择行为的框架,量化了其选择行为。每个试金石输出一个试金石分数,量化了LLM的权衡响应,一个可靠度分数,衡量了LLM选择行为的一致性,以及一个能力分数,衡量了当冲突目标被单一明确的目标替代时,LLM在同一任务中的能力。评估了广泛的前沿LLM,我们(1)研究了LLM能力与倾向随时间的变化,(2)从LLM的选择行为和推理链中获得了经济上有意义的见解,(3)通过测试自我一致性、稳健性和泛化性验证了我们的试金石框架。总体而言,这项工作为评估进一步整合到经济决策中的LLM代理奠定了基础。
Summary / 总结
This paper develops evaluation methods to measure the economic decision-making capabilities of LLMs. It introduces benchmarks based on key economic problems and a litmus test framework that evaluates LLMs on stylized tasks with multiple objectives. The study finds changes in LLM capabilities over time, provides insights into their decision-making processes, and validates the litmus test framework through self-consistency, robustness, and generalizability tests.
该研究开发了评估LLM经济决策能力的方法,引入了基于关键经济问题的基准测试和评估框架,用于评估LLM在具有多个目标的简化决策任务上的表现。研究发现LLM能力随时间的变化,提供了对其决策过程的见解,并通过自我一致性、稳健性和泛化性测试验证了评估框架的有效性。
Synthetic-Powered Multiple Testing with FDR Control
Authors: Yonghoon Lee, Meshi Bashari, Edgar Dobriban, Yaniv Romano
First: 2026-02-18T18:36:24+00:00 · Latest: 2026-02-18T18:36:24+00:00
Abstract
Multiple hypothesis testing with false discovery rate (FDR) control is a fundamental problem in statistical inference, with broad applications in genomics, drug screening, and outlier detection. In many such settings, researchers may have access not only to real experimental observations but also to auxiliary or synthetic data -- from past, related experiments or generated by generative models -- that can provide additional evidence about the hypotheses of interest. We introduce SynthBH, a synthetic-powered multiple testing procedure that safely leverages such synthetic data. We prove that SynthBH guarantees finite-sample, distribution-free FDR control under a mild PRDS-type positive dependence condition, without requiring the pooled-data p-values to be valid under the null. The proposed method adapts to the (unknown) quality of the synthetic data: it enhances the sample efficiency and may boost the power when synthetic data are of high quality, while controlling the FDR at a user-specified level regardless of their quality. We demonstrate the empirical performance of SynthBH on tabular outlier detection benchmarks and on genomic analyses of drug-cancer sensitivity associations, and further study its properties through controlled experiments on simulated data.
Summary / 总结
The paper introduces SynthBH, a method for multiple hypothesis testing with false discovery rate control that incorporates synthetic data to enhance sample efficiency and power. It guarantees finite-sample, distribution-free FDR control under mild dependence conditions, and adapts to the quality of synthetic data, ensuring FDR control regardless of their quality. Empirical results show improved performance on tabular outlier detection and genomic analyses, and controlled experiments confirm its robustness.
该论文提出了SynthBH方法,该方法结合合成数据进行多重假设检验,以提高样本效率和功效。该方法在轻微的依赖条件下保证有限样本、无分布假设的FDR控制,并根据合成数据的质量进行调整,无论数据质量如何都能保证FDR控制。实证结果表明,该方法在表格异常检测和基因组分析中表现出色,并通过模拟数据的受控实验验证了其稳健性。
Are Object-Centric Representations Better At Compositional Generalization?
Authors: Ferdinand Kapl, Amir Mohammad Karimi Mamaghan, Maximilian Seitzer, Karl Henrik Johansson, Carsten Marr, Stefan Bauer, Andrea Dittadi
First: 2026-02-18T18:34:07+00:00 · Latest: 2026-02-18T18:34:07+00:00
Abstract
Compositional generalization, the ability to reason about novel combinations of familiar concepts, is fundamental to human cognition and a critical challenge for machine learning. Object-centric (OC) representations, which encode a scene as a set of objects, are often argued to support such generalization, but systematic evidence in visually rich settings is limited. We introduce a Visual Question Answering benchmark across three controlled visual worlds (CLEVRTex, Super-CLEVR, and MOVi-C) to measure how well vision encoders, with and without object-centric biases, generalize to unseen combinations of object properties. To ensure a fair and comprehensive comparison, we carefully account for training data diversity, sample size, representation size, downstream model capacity, and compute. We use DINOv2 and SigLIP2, two widely used vision encoders, as the foundation models and their OC counterparts. Our key findings reveal that (1) OC approaches are superior in harder compositional generalization settings; (2) original dense representations surpass OC only on easier settings and typically require substantially more downstream compute; and (3) OC models are more sample efficient, achieving stronger generalization with fewer images, whereas dense encoders catch up or surpass them only with sufficient data and diversity. Overall, object-centric representations offer stronger compositional generalization when any one of dataset size, training data diversity, or downstream compute is constrained.
中文标题/摘要
标题:对象中心表示法在组合泛化方面更优越吗?
组合泛化,即对熟悉概念的新组合进行推理的能力,是人类认知的基础,也是机器学习中的关键挑战。对象中心(OC)表示法将场景编码为一组对象,常被认为支持这种泛化,但在视觉丰富的环境中系统性的证据有限。我们引入了一个跨三个受控视觉世界(CLEVRTex、Super-CLEVR和MOVi-C)的视觉问答基准,以衡量具有和不具有对象中心偏见的视觉编码器在对未见过的对象属性组合进行泛化方面的表现。为了确保公平和全面的比较,我们仔细考虑了训练数据多样性、样本大小、表示大小、下游模型容量和计算量。我们使用DINOv2和SigLIP2作为基础模型及其OC版本。我们的主要发现表明:(1) 在更难的组合泛化设置中,OC方法更优越;(2) 原始密集表示法仅在更简单的设置中优于OC,通常需要更多的下游计算;(3) OC模型更具样本效率,在较少的图像中实现更强的泛化,而密集编码器仅在有足够的数据和多样性时才能赶上或超过它们。总体而言,当数据集大小、训练数据多样性或下游计算受到限制时,对象中心表示法提供了更强的组合泛化能力。
Summary / 总结
The study investigates the effectiveness of object-centric (OC) representations in compositional generalization, comparing them with dense representations using a Visual Question Answering benchmark across three controlled visual worlds. Key findings show that OC approaches outperform dense representations in harder generalization settings, while dense representations only excel on easier tasks and require more downstream compute. Additionally, OC models are more sample efficient, achieving better generalization with fewer images compared to dense encoders, which catch up or surpass them only with sufficient data and diversity.
研究探讨了在视觉丰富环境中,对象中心(OC)表示是否能增强组合泛化能力。通过在三个受控视觉世界中使用视觉问答基准,比较了OC和密集表示,并考虑了训练数据多样性等因素。主要发现表明,OC方法在更难的泛化任务中表现更优,而密集表示在较简单的任务中表现更好,但需要更多的下游计算。此外,OC模型更具样本效率,在较少的图像中就能实现更好的泛化。
MC-LLaVA: Multi-Concept Personalized Vision-Language Model
Authors: Ruichuan An, Sihan Yang, Renrui Zhang, Ming Lu, Tianyi Jiang, Kai Zeng, Yulin Luo, Jiajun Cao, Hao Liang, Ying Chen, Qi She, Shanghang Zhang, Wentao Zhang
First: 2024-11-18T16:33:52+00:00 · Latest: 2026-02-18T18:33:19+00:00
Abstract
Current vision-language models (VLMs) show exceptional abilities across diverse tasks, such as visual question answering. To enhance user experience, recent studies have investigated VLM personalization to understand user-provided concepts. However, they mainly focus on single concepts, neglecting the existence and interplay of multiple concepts, which limits real-world applicability. This paper proposes MC-LLaVA, a multi-concept personalization paradigm. Specifically, MC-LLaVA employs a multi-concept instruction tuning strategy, effectively integrating multiple concepts in a single training step. To reduce the training costs, we propose a personalized textual prompt that uses visual token information to initialize concept tokens. Additionally, we introduce a personalized visual prompt during inference, aggregating location maps for enhanced recognition and grounding capabilities. To further push the performance upper bound, we incorporate an optional auxiliary loss, better enhancing the proposed personalized prompts. To decorate the VLM personalization research, we contribute a high-quality dataset. We carefully collect images with multiple characters and objects from movies and manually create question-answer samples for multi-concept scenarios, featuring superior diversity. Comprehensive experiments demonstrate that MC-LLaVA achieves impressive multi-concept personalized responses, paving the way for VLMs to become better user assistants. The code and dataset will be released at \href{https://github.com/arctanxarc/MC-LLaVA}{https://github.com/arctanxarc/MC-LLaVA}.
中文标题/摘要
标题:MC-LLaVA:多概念个性化视觉语言模型
当前的视觉语言模型(VLMs)在各种任务上表现出色,如视觉问答。为了提升用户体验,最近的研究探讨了VLM个性化以理解用户提供的概念。然而,他们主要关注单个概念,忽视了多个概念的存在及其相互作用,这限制了其在实际中的应用。本文提出了一种多概念个性化范式MC-LLaVA。具体而言,MC-LLaVA采用多概念指令调优策略,在单个训练步骤中有效整合多个概念。为了降低训练成本,我们提出了一种个性化的文本提示,利用视觉标记信息初始化概念标记。此外,在推理过程中,我们引入了个性化的视觉提示,聚合位置图以增强识别和语义关联能力。为了进一步提高性能上限,我们引入了一个可选的辅助损失,更好地增强了提出的个性化提示。为了丰富VLM个性化研究,我们贡献了一个高质量的数据集。我们仔细收集了来自电影的多角色和多物体图像,并手动创建了多概念场景的问题-答案样本,具有更高的多样性。全面的实验表明,MC-LLaVA实现了令人印象深刻的多概念个性化响应,为VLMs成为更好的用户助手铺平了道路。代码和数据集将在https://github.com/arctanxarc/MC-LLaVA发布。
Summary / 总结
This paper introduces MC-LLaVA, a multi-concept personalization paradigm for vision-language models, addressing the limitations of single-concept personalization. It employs a multi-concept instruction tuning strategy and uses personalized textual and visual prompts to enhance recognition and grounding. Experimental results show that MC-LLaVA can generate impressive multi-concept personalized responses, improving the real-world applicability of VLMs as user assistants.
本文提出了MC-LLaVA,一种多概念个性化范式,通过多概念指令调优策略和个人化文本和视觉提示,增强识别和语义关联能力。该模型展示了出色的多概念个性化响应,提高了视觉语言模型作为用户助手的实用性。提供了一个高质量的数据集以支持研究。
Learning Situated Awareness in the Real World
Authors: Chuhan Li, Ruilin Han, Joy Hsu, Yongyuan Liang, Rajiv Dhawan, Jiajun Wu, Ming-Hsuan Yang, Xin Eric Wang
First: 2026-02-18T18:22:52+00:00 · Latest: 2026-02-18T18:22:52+00:00
Abstract
A core aspect of human perception is situated awareness, the ability to relate ourselves to the surrounding physical environment and reason over possible actions in context. However, most existing benchmarks for multimodal foundation models (MFMs) emphasize environment-centric spatial relations (relations among objects in a scene), while largely overlooking observer-centric relationships that require reasoning relative to agent's viewpoint, pose, and motion. To bridge this gap, we introduce SAW-Bench (Situated Awareness in the Real World), a novel benchmark for evaluating egocentric situated awareness using real-world videos. SAW-Bench comprises 786 self-recorded videos captured with Ray-Ban Meta (Gen 2) smart glasses spanning diverse indoor and outdoor environments, and over 2,071 human-annotated question-answer pairs. It probes a model's observer-centric understanding with six different awareness tasks. Our comprehensive evaluation reveals a human-model performance gap of 37.66%, even with the best-performing MFM, Gemini 3 Flash. Beyond this gap, our in-depth analysis uncovers several notable findings; for example, while models can exploit partial geometric cues in egocentric videos, they often fail to infer a coherent camera geometry, leading to systematic spatial reasoning errors. We position SAW-Bench as a benchmark for situated spatial intelligence, moving beyond passive observation to understanding physically grounded, observer-centric dynamics.
中文标题/摘要
标题:学习现实世界的定位意识
人类感知的核心方面是定位意识,即能够将自己与周围物理环境联系起来,并在上下文中推理可能的动作。然而,大多数针对多模态基础模型(MFMs)的基准测试主要强调环境中心的空间关系(场景中物体之间的关系),而忽视了需要根据代理视角、姿态和运动进行相对推理的观察者中心关系。为了弥合这一差距,我们引入了SAW-Bench(现实世界的定位意识),这是一个使用真实视频评估自我的定位意识的新基准。SAW-Bench 包含了786个使用Ray-Ban Meta(Gen 2)智能眼镜录制的自我记录视频,覆盖了多种室内外环境,并且包含超过2,071个人类标注的问题-答案对。它通过六个不同的意识任务来测试模型的观察者中心理解。我们全面的评估揭示了即使在表现最佳的MFMs(如Gemini 3 Flash)中,人类模型性能差距仍高达37.66%。除了这个差距之外,我们深入的分析还揭示了几个值得注意的发现;例如,虽然模型可以利用自我的部分几何线索,但它们往往无法推断出连贯的摄像机几何结构,导致系统性的空间推理错误。我们将SAW-Bench 定位为定位空间智能的基准,超越了被动观察,转向理解物理上接地的观察者中心动态。
Summary / 总结
The research aims to evaluate multimodal foundation models' ability to understand observer-centric spatial relationships in real-world scenarios. SAW-Bench, a novel benchmark, was introduced to assess egocentric situated awareness using real-world videos. The benchmark includes 786 self-recorded videos and 2,071 human-annotated question-answer pairs, covering diverse environments. Key findings show a significant human-model performance gap of 37.66%, with models struggling to infer coherent camera geometry and spatial relationships from partial cues.
研究旨在评估模型在现实世界场景中理解以观察者为中心的空间关系的能力,而现有基准往往忽视了这一点。研究引入了SAW-Bench,这是一个使用真实世界视频的新基准,并发现人类和表现最佳的多模态基础模型Gemini 3 Flash之间存在显著的性能差距,这表明从第一人称视频中推断出连贯的摄像机几何结构的难度很大。
VETime: Vision Enhanced Zero-Shot Time Series Anomaly Detection
Authors: Yingyuan Yang, Tian Lan, Yifei Gao, Yimeng Lu, Wenjun He, Meng Wang, Chenghao Liu, Chen Zhang
First: 2026-02-18T18:22:22+00:00 · Latest: 2026-02-18T18:22:22+00:00
Abstract
Time-series anomaly detection (TSAD) requires identifying both immediate Point Anomalies and long-range Context Anomalies. However, existing foundation models face a fundamental trade-off: 1D temporal models provide fine-grained pointwise localization but lack a global contextual perspective, while 2D vision-based models capture global patterns but suffer from information bottlenecks due to a lack of temporal alignment and coarse-grained pointwise detection. To resolve this dilemma, we propose VETime, the first TSAD framework that unifies temporal and visual modalities through fine-grained visual-temporal alignment and dynamic fusion. VETime introduces a Reversible Image Conversion and a Patch-Level Temporal Alignment module to establish a shared visual-temporal timeline, preserving discriminative details while maintaining temporal sensitivity. Furthermore, we design an Anomaly Window Contrastive Learning mechanism and a Task-Adaptive Multi-Modal Fusion to adaptively integrate the complementary perceptual strengths of both modalities. Extensive experiments demonstrate that VETime significantly outperforms state-of-the-art models in zero-shot scenarios, achieving superior localization precision with lower computational overhead than current vision-based approaches. Code available at: https://github.com/yyyangcoder/VETime.
中文标题/摘要
标题:VETime:视觉增强的零样本时间序列异常检测
时间序列异常检测(TSAD)需要识别即时点异常和长范围上下文异常。然而,现有的基础模型面临一个根本性的权衡:一维时间模型提供精细的点定位但缺乏全局上下文视角,而基于视觉的二维模型能够捕捉全局模式但因缺乏时间对齐和粗粒度点定位而受到信息瓶颈的困扰。为了解决这一困境,我们提出了VETime,这是第一个通过精细视觉-时间对齐和动态融合统一时间和视觉模态的TSAD框架。VETime引入了可逆图像转换和块级时间对齐模块,以建立共享的视觉-时间时间线,同时保留区分性细节并保持时间敏感性。此外,我们设计了异常窗口对比学习机制和任务自适应多模态融合,以适应性地整合两种模态的互补感知优势。广泛的实验表明,VETime在零样本场景中显著优于最先进的模型,具有比当前基于视觉的方法更低的计算开销和更高的定位精度。代码可在:https://github.com/yyyangcoder/VETime 获取。
Summary / 总结
VETime is a novel framework for time-series anomaly detection that integrates temporal and visual modalities to address the limitations of existing models. It uses fine-grained visual-temporal alignment and dynamic fusion, including a Reversible Image Conversion and a Patch-Level Temporal Alignment module, to establish a shared timeline. Additionally, it employs Anomaly Window Contrastive Learning and Task-Adaptive Multi-Modal Fusion to adaptively integrate the strengths of both modalities. Experimental results show that VETime outperforms state-of-the-art models in zero-shot scenarios with higher localization precision and lower computational overhead compared to current vision-based approaches.
VETime 是一种将时间序列和视觉模态相结合的新框架,以解决现有模型的局限性。它通过细粒度的视觉-时间对齐和动态融合,包括可逆图像转换和补丁级时间对齐模块,来保持时间和细节的敏感性。该框架还采用了异常窗口对比学习和任务自适应多模态融合,以适应性地结合两种模态的优势。实验结果表明,VETime 在零样本场景中比现有的视觉方法具有更高的精度和更低的计算成本。
Modeling Human Behavior in a Strategic Network Game with Complex Group Dynamics
Authors: Jonathan Skaggs, Jacob W. Crandall
First: 2025-05-01T18:13:20+00:00 · Latest: 2026-02-18T18:09:33+00:00
Comments: In Proc. of the 25th International Conference on Autonomous Agents and Multiagent Systems, Paphos, Cyprus, 2026
Abstract
Human networks greatly impact important societal outcomes, including wealth and health inequality, poverty, and bullying. As such, understanding human networks is critical to learning how to promote favorable societal outcomes. As a step toward better understanding human networks, we compare and contrast several methods for learning models of human behavior in a strategic network game called the Junior High Game (JHG) [39]. These modeling methods differ with respect to the assumptions they use to parameterize human behavior (behavior matching vs. community-aware behavior) and the moments they model (mean vs. distribution). Results show that the highest-performing method, called hCAB, models the distribution of human behavior rather than the mean and assumes humans use community-aware behavior rather than behavior matching. When applied to small societies, the hCAB model closely mirrors the population dynamics of human groups (with notable differences). Additionally, in a user study, human participants had difficulty distinguishing hCAB agents from other humans, thus illustrating that the hCAB model also produces plausible (individual) behavior in this strategic network game.
中文标题/摘要
标题:在具有复杂群体动态的战略网络游戏中建模人类行为
人类网络极大地影响着重要的社会结果,包括财富和健康不平等、贫困和欺凌。因此,理解人类网络对于学习如何促进有利的社会结果至关重要。为了更好地理解人类网络,我们比较了几种在名为初中游戏(JHG)的战略网络游戏中学习人类行为模型的方法。这些建模方法在参数化人类行为(行为匹配 vs. 社区感知行为)以及建模的时刻(均值 vs. 分布)方面有所不同。结果显示,表现最佳的方法称为hCAB,它建模的是人类行为的分布而非均值,并假设人类使用的是社区感知行为而非行为匹配。当应用于小型社会时,hCAB模型与人类群体的人口动态非常相似(有显著差异)。此外,在一项用户研究中,人类参与者难以区分hCAB代理与其他人类,从而表明hCAB模型在这一战略网络游戏中也产生了可信的(个体)行为。
Summary / 总结
The research aims to understand human behavior in strategic networks by comparing different modeling methods for a game called the Junior High Game. The study finds that the hCAB model, which assumes community-aware behavior and models the distribution of human behavior, outperforms others. This model closely mimics real-world population dynamics and produces behavior that is indistinguishable from human players in a user study.
研究旨在通过比较不同方法在Junior High Game中的应用,理解人类在网络中的行为。hCAB模型假设社区意识行为并模拟人类行为的分布,表现最佳。它能很好地模仿人类群体动态,并且在用户研究中,人类参与者难以区分hCAB代理与其他人类玩家,表明其合理性和有效性。
PredMapNet: Future and Historical Reasoning for Consistent Online HD Vectorized Map Construction
Authors: Bo Lang, Nirav Savaliya, Zhihao Zheng, Jinglun Feng, Zheng-Hang Yeh, Mooi Choo Chuah
Venue: WACV 2026
First: 2026-02-18T18:08:26+00:00 · Latest: 2026-02-18T18:08:26+00:00
Comments: WACV 2026
Abstract
High-definition (HD) maps are crucial to autonomous driving, providing structured representations of road elements to support navigation and planning. However, existing query-based methods often employ random query initialization and depend on implicit temporal modeling, which lead to temporal inconsistencies and instabilities during the construction of a global map. To overcome these challenges, we introduce a novel end-to-end framework for consistent online HD vectorized map construction, which jointly performs map instance tracking and short-term prediction. First, we propose a Semantic-Aware Query Generator that initializes queries with spatially aligned semantic masks to capture scene-level context globally. Next, we design a History Rasterized Map Memory to store fine-grained instance-level maps for each tracked instance, enabling explicit historical priors. A History-Map Guidance Module then integrates rasterized map information into track queries, improving temporal continuity. Finally, we propose a Short-Term Future Guidance module to forecast the immediate motion of map instances based on the stored history trajectories. These predicted future locations serve as hints for tracked instances to further avoid implausible predictions and keep temporal consistency. Extensive experiments on the nuScenes and Argoverse2 datasets demonstrate that our proposed method outperforms state-of-the-art (SOTA) methods with good efficiency.
中文标题/摘要
标题:PredMapNet:未来和历史推理在一致的在线高清矢量地图构建中的应用
高分辨率(HD)地图对于自动驾驶至关重要,它们提供了道路元素的结构化表示,以支持导航和规划。然而,现有的基于查询的方法通常采用随机查询初始化,并依赖于隐式的时序建模,这导致在构建全局地图时出现时序不一致和不稳定现象。为了解决这些挑战,我们提出了一种新的端到端框架,用于一致的在线高清矢量地图构建,该框架同时执行地图实例跟踪和短期预测。首先,我们提出了一种语义感知查询生成器,使用空间对齐的语义掩码初始化查询,以捕捉全局场景级上下文。接下来,我们设计了一种历史栅格化地图记忆,用于存储每个跟踪实例的细粒度实例级地图,从而启用显式的历史先验。然后,历史地图引导模块将栅格化地图信息整合到跟踪查询中,提高时序连续性。最后,我们提出了一种短期未来引导模块,根据存储的历史轨迹预测地图实例的即时运动。这些预测的未来位置作为跟踪实例的提示,进一步避免不合理的预测并保持时序一致性。在nuScenes和Argoverse2数据集上的广泛实验表明,我们提出的方法在效率良好的情况下优于最先进的(SOTA)方法。
Summary / 总结
The research aims to address the temporal inconsistencies and instabilities in the construction of global HD vectorized maps for autonomous driving. The proposed PredMapNet framework integrates semantic-aware query generation, history rasterized map memory, history-map guidance, and short-term future guidance modules to achieve consistent online map construction. Experiments show that PredMapNet outperforms existing methods on nuScenes and Argoverse2 datasets while maintaining efficiency.
研究旨在解决全球高精度矢量地图构建中的时间不一致性和不稳定性问题,以支持自动驾驶。PredMapNet框架结合了语义感知查询生成、历史栅格化地图记忆、历史地图引导和短期未来引导模块。实验结果表明,PredMapNet在nuScenes和Argoverse2数据集上优于现有方法,同时保持了良好的效率。
Towards a Science of AI Agent Reliability
Authors: Stephan Rabanser, Sayash Kapoor, Peter Kirgis, Kangheng Liu, Saiteja Utpala, Arvind Narayanan
First: 2026-02-18T18:05:44+00:00 · Latest: 2026-02-18T18:05:44+00:00
Abstract
AI agents are increasingly deployed to execute important tasks. While rising accuracy scores on standard benchmarks suggest rapid progress, many agents still continue to fail in practice. This discrepancy highlights a fundamental limitation of current evaluations: compressing agent behavior into a single success metric obscures critical operational flaws. Notably, it ignores whether agents behave consistently across runs, withstand perturbations, fail predictably, or have bounded error severity. Grounded in safety-critical engineering, we provide a holistic performance profile by proposing twelve concrete metrics that decompose agent reliability along four key dimensions: consistency, robustness, predictability, and safety. Evaluating 14 agentic models across two complementary benchmarks, we find that recent capability gains have only yielded small improvements in reliability. By exposing these persistent limitations, our metrics complement traditional evaluations while offering tools for reasoning about how agents perform, degrade, and fail.
中文标题/摘要
标题:迈向AI代理可靠性的科学
AI代理正越来越多地被部署以执行重要任务。尽管在标准基准上的准确率得分不断提高,表明快速进步,但许多代理在实践中仍然继续失败。这种差异突显了当前评估的基本局限性:将代理行为压缩为单一成功指标掩盖了关键的操作缺陷。值得注意的是,它忽略了代理是否在多次运行中表现一致、能否抵御干扰、能否可预测地失败或具有有限的误差严重性。基于安全关键工程,我们通过提出十二个具体的指标,从四个关键维度分解代理可靠性:一致性、鲁棒性、可预测性和安全性,提供了一个全面的性能概况。在两个互补基准上评估14个代理模型,我们发现最近的能力提升仅在可靠性方面带来了微小的改进。通过揭示这些持续的局限性,我们的指标补充了传统的评估,同时提供了关于代理如何表现、退化和失败的推理工具。
Summary / 总结
The research aims to address the gap between AI agent performance on benchmarks and their practical reliability. It introduces twelve metrics to evaluate four key dimensions of agent reliability: consistency, robustness, predictability, and safety. Evaluating 14 agentic models, the study reveals that recent improvements in capability have not significantly enhanced reliability, highlighting persistent issues in agent performance, degradation, and failure.
研究旨在解决AI代理在基准测试中的表现与其实际可靠性之间的差距。它提出了十二个指标来评估代理可靠性的四个维度:一致性、稳健性、可预测性和安全性。通过对两个基准测试中的14个模型进行评估,研究发现最近的进步仅在可靠性方面带来了微小的改进,突显了确保一致和稳健的AI行为的持续挑战。
Unpaired Image-to-Image Translation via a Self-Supervised Semantic Bridge
Authors: Jiaming Liu, Felix Petersen, Yunhe Gao, Yabin Zhang, Hyojin Kim, Akshay S. Chaudhari, Yu Sun, Stefano Ermon, Sergios Gatidis
First: 2026-02-18T18:05:00+00:00 · Latest: 2026-02-18T18:05:00+00:00
Comments: 36 pages
Abstract
Adversarial diffusion and diffusion-inversion methods have advanced unpaired image-to-image translation, but each faces key limitations. Adversarial approaches require target-domain adversarial loss during training, which can limit generalization to unseen data, while diffusion-inversion methods often produce low-fidelity translations due to imperfect inversion into noise-latent representations. In this work, we propose the Self-Supervised Semantic Bridge (SSB), a versatile framework that integrates external semantic priors into diffusion bridge models to enable spatially faithful translation without cross-domain supervision. Our key idea is to leverage self-supervised visual encoders to learn representations that are invariant to appearance changes but capture geometric structure, forming a shared latent space that conditions the diffusion bridges. Extensive experiments show that SSB outperforms strong prior methods for challenging medical image synthesis in both in-domain and out-of-domain settings, and extends easily to high-quality text-guided editing.
中文标题/摘要
标题:无配对图像到图像翻译的自我监督语义桥梁
对抗扩散和扩散反向方法已推动无配对图像到图像翻译的发展,但每种方法都面临关键限制。对抗方法在训练过程中需要目标域对抗损失,这可能会限制其对未见数据的泛化能力,而扩散反向方法通常会产生低保真度的翻译,因为其向噪声潜在表示的反向过程不完美。在本文中,我们提出了一种名为自我监督语义桥梁(SSB)的通用框架,该框架将外部语义先验整合到扩散桥梁模型中,以在无需跨域监督的情况下实现空间保真的翻译。我们的主要想法是利用自我监督视觉编码器学习不变于外观变化但能捕捉几何结构的表示,从而形成一个共享的潜在空间,该空间条件了扩散桥梁。大量实验表明,SSB 在域内和域外设置中均优于强先验方法,用于具有挑战性的医学图像合成,并且可以轻松扩展到高质量的文本引导编辑。
Summary / 总结
This work addresses the limitations of adversarial and diffusion-inversion methods in unpaired image-to-image translation by proposing the Self-Supervised Semantic Bridge (SSB). SSB integrates external semantic priors into diffusion bridge models to enable spatially faithful translation without cross-domain supervision. The key method involves using self-supervised visual encoders to learn invariant representations that capture geometric structure, forming a shared latent space. Experiments demonstrate that SSB outperforms prior methods for medical image synthesis and extends well to high-quality text-guided editing.
本文通过提出自监督语义桥梁(SSB)来解决无配对图像到图像转换中对抗性和扩散反向方法的局限性。SSB 将外部语义先验整合到扩散桥梁模型中,以实现无需跨域监督的空间忠实转换。该方法使用自监督视觉编码器学习不变表示,捕捉几何结构,形成共享的潜在空间。实验表明,SSB 在域内和域外设置中均优于先前的方法,并支持高质量的文本引导编辑。
Optimizer choice matters for the emergence of Neural Collapse
Authors: Jim Zhao, Tin Sum Cheng, Wojciech Masarczyk, Aurelien Lucchi
Venue: ICLR 2026
First: 2026-02-18T17:32:43+00:00 · Latest: 2026-02-18T17:32:43+00:00
Comments: Published as a conference paper at ICLR 2026
Abstract
Neural Collapse (NC) refers to the emergence of highly symmetric geometric structures in the representations of deep neural networks during the terminal phase of training. Despite its prevalence, the theoretical understanding of NC remains limited. Existing analyses largely ignore the role of the optimizer, thereby suggesting that NC is universal across optimization methods. In this work, we challenge this assumption and demonstrate that the choice of optimizer plays a critical role in the emergence of NC. The phenomenon is typically quantified through NC metrics, which, however, are difficult to track and analyze theoretically. To overcome this limitation, we introduce a novel diagnostic metric, NC0, whose convergence to zero is a necessary condition for NC. Using NC0, we provide theoretical evidence that NC cannot emerge under decoupled weight decay in adaptive optimizers, as implemented in AdamW. Concretely, we prove that SGD, SignGD with coupled weight decay (a special case of Adam), and SignGD with decoupled weight decay (a special case of AdamW) exhibit qualitatively different NC0 dynamics. Also, we show the accelerating effect of momentum on NC (beyond convergence of train loss) when trained with SGD, being the first result concerning momentum in the context of NC. Finally, we conduct extensive empirical experiments consisting of 3,900 training runs across various datasets, architectures, optimizers, and hyperparameters, confirming our theoretical results. This work provides the first theoretical explanation for optimizer-dependent emergence of NC and highlights the overlooked role of weight-decay coupling in shaping the implicit biases of optimizers.
中文标题/摘要
标题:优化器选择对神经塌缩的出现至关重要
神经塌缩(NC)是指在深度神经网络训练的终端阶段,其表示中出现的高度对称几何结构。尽管NC普遍存在,但对其理论理解仍然有限。现有分析大多忽略了优化器的作用,因此暗示NC在各种优化方法中是普遍存在的。在本文中,我们挑战了这一假设,并证明了优化器的选择在NC的出现中起着关键作用。通常通过NC指标来量化这一现象,但这些指标难以进行理论上的跟踪和分析。为克服这一限制,我们引入了一个新的诊断指标NC0,其收敛于零是NC的必要条件。使用NC0,我们提供了理论证据,证明在自适应优化器中实现的AdamW中,解耦权重衰减不能导致NC的出现。具体而言,我们证明了SGD、耦合权重衰减的SignGD(Adam的一个特殊情况)和解耦权重衰减的SignGD(AdamW的一个特殊情况)在NC0动态上表现出质的不同。此外,我们展示了动量在使用SGD训练时对NC的加速效应(超越训练损失的收敛),这是关于NC的第一个关于动量的结果。最后,我们进行了广泛的实证实验,包括在各种数据集、架构、优化器和超参数上的3,900次训练运行,证实了我们的理论结果。本文提供了优化器依赖的NC出现的第一个理论解释,并强调了权重衰减耦合在塑造优化器的隐式偏见中的被忽视的作用。
Summary / 总结
This work investigates the role of optimizers in the emergence of Neural Collapse (NC), a phenomenon where deep neural networks develop highly symmetric representations. The authors introduce a novel diagnostic metric, NC0, to track NC more effectively. They demonstrate that the choice of optimizer significantly affects NC, with SGD and certain variants of Adam exhibiting different NC0 dynamics. Momentum in SGD is shown to have an accelerating effect on NC beyond just reducing training loss. Extensive empirical experiments confirm these findings across various conditions, providing the first theoretical explanation for optimizer-dependent NC and highlighting the importance of weight-decay coupling in shaping implicit biases of optimizers.
该研究探讨了优化器选择对神经塌缩(NC)的影响,NC是指深度神经网络在训练后期形成高度对称的表示。作者引入了一种新的诊断指标NC0,以更有效地追踪NC。研究表明,优化器的选择显著影响NC,SGD和某些Adam变体表现出不同的NC0动态。研究还揭示了动量可以加速NC,而不仅仅是减少训练损失。广泛的实验验证了这些发现,并且跨越了各种数据集和架构,提供了优化器依赖的NC的理论解释。
Enhanced Diffusion Sampling: Efficient Rare Event Sampling and Free Energy Calculation with Diffusion Models
Authors: Yu Xie, Ludwig Winkler, Lixin Sun, Sarah Lewis, Adam E. Foster, José Jiménez Luna, Tim Hempel, Michael Gastegger, Yaoyi Chen, Iryna Zaporozhets, Cecilia Clementi, Christopher M. Bishop, Frank Noé
First: 2026-02-18T17:26:15+00:00 · Latest: 2026-02-18T17:26:15+00:00
Abstract
The rare-event sampling problem has long been the central limiting factor in molecular dynamics (MD), especially in biomolecular simulation. Recently, diffusion models such as BioEmu have emerged as powerful equilibrium samplers that generate independent samples from complex molecular distributions, eliminating the cost of sampling rare transition events. However, a sampling problem remains when computing observables that rely on states which are rare in equilibrium, for example folding free energies. Here, we introduce enhanced diffusion sampling, enabling efficient exploration of rare-event regions while preserving unbiased thermodynamic estimators. The key idea is to perform quantitatively accurate steering protocols to generate biased ensembles and subsequently recover equilibrium statistics via exact reweighting. We instantiate our framework in three algorithms: UmbrellaDiff (umbrella sampling with diffusion models), $Δ$G-Diff (free-energy differences via tilted ensembles), and MetaDiff (a batchwise analogue for metadynamics). Across toy systems, protein folding landscapes and folding free energies, our methods achieve fast, accurate, and scalable estimation of equilibrium properties within GPU-minutes to hours per system -- closing the rare-event sampling gap that remained after the advent of diffusion-model equilibrium samplers.
中文标题/摘要
标题:增强扩散采样:使用扩散模型进行高效罕见事件采样和自由能计算
罕见事件采样问题长期以来一直是分子动力学(MD)中的主要限制因素,尤其是在生物分子模拟中。最近,诸如BioEmu之类的扩散模型作为强大的平衡采样器出现,能够从复杂的分子分布中生成独立样本,从而消除采样罕见过渡事件的成本。然而,在计算依赖于在平衡状态下罕见状态的可观测量时,仍然存在采样问题,例如折叠自由能。在这里,我们引入了增强扩散采样,使其能够在保持无偏热力学估计的同时高效探索罕见事件区域。关键思想是执行定量准确的引导协议以生成有偏的集合,然后通过精确重加权恢复平衡统计。我们在三种算法中实现了我们的框架:UmbrellaDiff(伞形采样与扩散模型),ΔG-Diff(通过倾斜集合计算自由能差)和MetaDiff(用于元动力学的批处理类算法)。在玩具系统、蛋白质折叠景观和折叠自由能中,我们的方法在几分钟到几小时内实现了平衡性质的快速、准确和可扩展估计——在扩散模型平衡采样器出现之后,填补了罕见事件采样缺口。
Summary / 总结
The paper addresses the challenge of sampling rare events in molecular dynamics simulations, which is crucial for computing folding free energies. It introduces enhanced diffusion sampling, which uses steering protocols to generate biased ensembles and then reweights them to obtain unbiased thermodynamic estimators. The methods, including UmbrellaDiff, ΔG-Diff, and MetaDiff, are demonstrated to be fast and accurate, providing equilibrium property estimates within minutes to hours on GPUs.
论文解决了分子动力学中采样稀有事件的问题,特别是生物分子模拟。它引入了增强扩散采样方法,结合了引导协议和扩散模型,以高效地探索稀有区域并保持无偏的热力学估计。该方法包括UmbrellaDiff、ΔG-Diff和MetaDiff,能够快速准确地估计平衡性质,将计算时间缩短到每系统数小时。
Almost Sure Convergence of Differential Temporal Difference Learning for Average Reward Markov Decision Processes
Authors: Ethan Blaser, Jiuqi Wang, Shangtong Zhang
First: 2026-02-18T17:24:27+00:00 · Latest: 2026-02-18T17:24:27+00:00
Abstract
The average reward is a fundamental performance metric in reinforcement learning (RL) focusing on the long-run performance of an agent. Differential temporal difference (TD) learning algorithms are a major advance for average reward RL as they provide an efficient online method to learn the value functions associated with the average reward in both on-policy and off-policy settings. However, existing convergence guarantees require a local clock in learning rates tied to state visit counts, which practitioners do not use and does not extend beyond tabular settings. We address this limitation by proving the almost sure convergence of on-policy $n$-step differential TD for any $n$ using standard diminishing learning rates without a local clock. We then derive three sufficient conditions under which off-policy $n$-step differential TD also converges without a local clock. These results strengthen the theoretical foundations of differential TD and bring its convergence analysis closer to practical implementations.
中文标题/摘要
标题:平均奖励马尔可夫决策过程的微分时差学习几乎必然收敛
平均奖励是强化学习(RL)中一个基本的性能指标,关注代理的长期性能。微分时差(TD)学习算法是平均奖励RL的一个重大进步,它们提供了一种在有策略和无策略设置中高效在线学习与平均奖励相关的价值函数的方法。然而,现有的收敛性保证需要与状态访问次数相关的局部时钟来调整学习率,而实践者并不使用这种方法,且无法扩展到非表格设置。我们通过证明使用标准递减学习率的有策略$n$步微分TD在没有局部时钟的情况下几乎必然收敛来解决这一限制。然后,我们推导出三个充分条件,在这些条件下,无策略$n$步微分TD在没有局部时钟的情况下也收敛。这些结果加强了微分TD的理论基础,并使其收敛分析更接近实际实现。
Summary / 总结
This paper addresses the convergence of differential temporal difference (TD) learning algorithms for average reward Markov decision processes. It proves the almost sure convergence of on-policy $n$-step differential TD using standard diminishing learning rates, and derives three sufficient conditions for off-policy $n$-step differential TD to converge without a local clock. This work improves the theoretical understanding of differential TD and aligns it more closely with practical applications.
本文解决了差分时差(TD)学习算法在平均奖励马尔可夫决策过程中的收敛性问题。证明了使用标准递减学习率的on-policy $n$-步差分TD几乎必然收敛,并推导出三个充分条件,使得off-policy $n$-步差分TD在没有局部时钟的情况下也能收敛。这项工作增强了差分TD的理论基础,并使其收敛分析更接近实际应用。
View Invariant Learning for Vision-Language Navigation in Continuous Environments
Authors: Josh Qixuan Sun, Xiaoying Xing, Huaiyuan Weng, Chul Min Yeum, Mark Crowley
First: 2025-07-05T18:04:35+00:00 · Latest: 2026-02-18T17:20:08+00:00
Comments: This paper is accepted to RA-L 2026
Abstract
Vision-Language Navigation in Continuous Environments (VLNCE), where an agent follows instructions and moves freely to reach a destination, is a key research problem in embodied AI. However, most navigation policies are sensitive to viewpoint changes, i.e., variations in camera height and viewing angle that alter the agent's observation. In this paper, we introduce a generalized scenario, V2-VLNCE (VLNCE with Varied Viewpoints), and propose VIL (View Invariant Learning), a view-invariant post-training strategy that enhances the robustness of existing navigation policies to changes in camera viewpoint. VIL employs a contrastive learning framework to learn sparse and view-invariant features. Additionally, we introduce a teacher-student framework for the Waypoint Predictor Module, a core component of most VLNCE baselines, where a view-dependent teacher model distills knowledge into a view-invariant student model. We employ an end-to-end training paradigm to jointly optimize these components, thus eliminating the cost for individual module training. Empirical results show that our method outperforms state-of-the-art approaches on V2-VLNCE by 8-15% measured on Success Rate for two standard benchmark datasets R2R-CE and RxR-CE. Furthermore, we evaluate VIL under the standard VLNCE setting and find that, despite being trained for varied viewpoints, it often still improves performance. On the more challenging RxR-CE dataset, our method also achieved state-of-the-art performance across all metrics when compared to other map-free methods. This suggests that adding VIL does not diminish the standard viewpoint performance and can serve as a plug-and-play post-training method.
中文标题/摘要
标题:连续环境中的视点不变学习在视觉-语言导航中的应用
连续环境中的视觉-语言导航(VLNCE),其中智能体遵循指令自由移动以到达目的地,是嵌入式人工智能中的关键研究问题。然而,大多数导航策略对视点变化敏感,即相机高度和视角变化导致的智能体观察变化。本文引入了一种通用场景,即V2-VLNCE(具有变化视点的VLNCE),并提出了一种视点不变后训练策略VIL(视点不变学习),以增强现有导航策略对相机视点变化的鲁棒性。VIL采用对比学习框架学习稀疏且视点不变的特征。此外,我们还引入了一种教师-学生框架用于路径点预测模块,这是大多数VLNCE基线的核心组件,其中视点依赖的教师模型将知识提炼到视点不变的学生模型中。我们采用端到端训练范式联合优化这些组件,从而消除了单独模块训练的成本。实验结果表明,我们的方法在两个标准基准数据集R2R-CE和RxR-CE上的成功率为8-15%的提升。此外,我们在标准VLNCE设置下评估了VIL,发现尽管它被训练用于变化视点,但在许多情况下仍然提高了性能。在更具挑战性的RxR-CE数据集上,我们的方法在所有指标上也达到了其他无地图方法的最先进水平。这表明添加VIL不会削弱标准视点性能,并且可以作为即插即用的后训练方法。
Summary / 总结
This paper addresses the challenge of viewpoint sensitivity in Vision-Language Navigation in Continuous Environments (VLNCE) by introducing a new scenario, V2-VLNCE, and proposing VIL (View Invariant Learning), a view-invariant post-training strategy. VIL uses a contrastive learning framework to learn sparse and view-invariant features and introduces a teacher-student framework for the Waypoint Predictor Module. The method outperforms state-of-the-art approaches by 8-15% on Success Rate for two benchmark datasets, and it also improves performance under the standard VLNCE setting and achieves state-of-the-art results on the RxR-CE dataset.
论文通过引入V2-VLNCE和提出VIL(视点不变学习)来解决视觉-语言导航在连续环境中的视点敏感性问题。VIL通过对比学习和教师-学生框架增强导航策略对相机视点变化的鲁棒性。该方法在R2R-CE和RxR-CE数据集上的成功率上优于现有最佳方法8-15%,并在RxR-CE数据集上实现了所有指标的最佳性能,表明即使在标准VLNCE设置下,VIL也能提高性能,且可以作为即插即用的后训练方法。
Still Competitive: Revisiting Recurrent Models for Irregular Time Series Prediction
Authors: Ankitkumar Joshi, Milos Hauskrecht
First: 2025-10-17T19:04:16+00:00 · Latest: 2026-02-18T17:10:14+00:00
Comments: Published in Transactions on Machine Learning Research, 2026
Abstract
Modeling irregularly sampled multivariate time series is a persistent challenge in domains like healthcare and sensor networks. While recent works have explored a variety of complex learning architectures to solve the prediction problems for irregularly sampled time series, it remains unclear what the true benefits of some of these architectures are, and whether clever modifications of simpler and more efficient RNN-based algorithms are still competitive, i.e. they are on par with or even superior to these methods. In this work, we propose and study GRUwE: Gated Recurrent Unit with Exponential basis functions, that builds upon RNN-based architectures for observations made at irregular times. GRUwE supports both regression-based and event-based predictions in continuous time. GRUwE works by maintaining a Markov state representation of the time series that updates with the arrival of irregular observations. The Markov state update relies on two reset mechanisms: (i) observation-triggered reset to account for the new observation, and (ii) time-triggered reset that relies on learnable exponential decays, to support the predictions in continuous time. Our empirical evaluations across several real-world benchmarks on next-observation and next-event prediction tasks demonstrate that GRUwE can indeed achieve competitive or superior performance compared to the recent state-of-the-art (SOTA) methods. Thanks to its simplicity, GRUwE offers compelling advantages: it is easy to implement, requires minimal hyper-parameter tuning efforts, and significantly reduces the computational overhead in the online deployment.
中文标题/摘要
标题:仍然具有竞争力:重访用于不规则时间序列预测的递归模型
建模不规则采样多变量时间序列在医疗保健和传感器网络等领域一直是一个持续的挑战。虽然最近的研究探索了各种复杂的学习架构来解决不规则采样时间序列的预测问题,但尚不清楚这些架构的实际优势是什么,以及是否简单的基于RNN的算法经过巧妙的修改后仍然具有竞争力,即它们与这些方法相当甚至更优。在本文中,我们提出了GRUwE:带有指数基函数的门控递归单元,该模型基于不规则时间观察的递归架构。GRUwE 支持连续时间中的回归预测和事件预测。GRUwE 通过维护随不规则观察到达而更新的时间序列马尔可夫状态表示来工作。马尔可夫状态更新依赖于两种重置机制:(i) 观察触发重置以应对新观察,(ii) 时间触发重置,依赖于可学习的指数衰减,以支持连续时间中的预测。我们在几个真实世界的基准上的实证评估表明,GRUwE 确实可以实现与最近的最先进方法相当或更优的性能。由于其简洁性,GRUwE 提供了显著的优势:易于实现,需要最少的超参数调整工作,并且在在线部署中显著减少了计算开销。
Summary / 总结
This paper revisits recurrent models for predicting irregularly sampled multivariate time series, focusing on the benefits of complex architectures versus simpler RNN-based methods. The authors propose GRUwE, which integrates Gated Recurrent Units with exponential basis functions to handle both regression and event-based predictions in continuous time. Empirical evaluations show that GRUwE performs competitively or even better than recent state-of-the-art methods across various real-world benchmarks, while offering simplicity and reduced computational overhead.
该研究针对不规则采样多变量时间序列的建模挑战,特别是在医疗保健和传感器网络领域。它提出了GRUwE,这是一种基于指数基函数的门控循环单元,建立在不规则时间序列的RNN架构之上。GRUwE 支持连续时间内的回归和事件预测。实证评估表明,GRUwE 在各种实际基准测试中的表现与最新的先进方法相当或更优。
Who can we trust? LLM-as-a-jury for Comparative Assessment
Authors: Mengjie Qian, Guangzhi Sun, Mark J. F. Gales, Kate M. Knill
First: 2026-02-18T17:04:02+00:00 · Latest: 2026-02-18T17:04:02+00:00
Abstract
Large language models (LLMs) are increasingly applied as automatic evaluators for natural language generation assessment often using pairwise comparative judgements. Existing approaches typically rely on single judges or aggregate multiple judges assuming equal reliability. In practice, LLM judges vary substantially in performance across tasks and aspects, and their judgment probabilities may be biased and inconsistent. Furthermore, human-labelled supervision for judge calibration may be unavailable. We first empirically demonstrate that inconsistencies in LLM comparison probabilities exist and show that it limits the effectiveness of direct probability-based ranking. To address this, we study the LLM-as-a-jury setting and propose BT-sigma, a judge-aware extension of the Bradley-Terry model that introduces a discriminator parameter for each judge to jointly infer item rankings and judge reliability from pairwise comparisons alone. Experiments on benchmark NLG evaluation datasets show that BT-sigma consistently outperforms averaging-based aggregation methods, and that the learned discriminator strongly correlates with independent measures of the cycle consistency of LLM judgments. Further analysis reveals that BT-sigma can be interpreted as an unsupervised calibration mechanism that improves aggregation by modelling judge reliability.
中文标题/摘要
标题:我们能信任谁?作为比较评估陪审团的LLM
大型语言模型(LLM)越来越多地被用作自然语言生成评估的自动评估者,通常使用成对比较判断。现有方法通常依赖单一评估者或汇总多个评估者,假设它们具有相同的可靠性。实际上,LLM评估者在不同任务和方面表现差异显著,其判断概率可能存在偏差和不一致性。此外,用于评估者校准的人工标注监督可能不可用。我们首先实证证明LLM比较概率存在不一致性,并表明这限制了直接基于概率的排名的有效性。为了解决这一问题,我们研究了LLM作为陪审团的设置,并提出BT-sigma,这是一种基于布雷德利-特里模型的评估者感知扩展,为每个评估者引入了一个判别参数,仅从成对比较中联合推断项目排名和评估者可靠性。在基准NLG评估数据集上的实验表明,BT-sigma 一致优于基于平均值的聚合方法,且学习到的判别参数与LLM判断循环一致性独立度量高度相关。进一步分析表明,BT-sigma 可以解释为一种无监督校准机制,通过建模评估者可靠性来改进聚合。
FlowPrefill: Decoupling Preemption from Prefill Scheduling Granularity to Mitigate Head-of-Line Blocking in LLM Serving
Authors: Chia-chi Hsieh, Zan Zong, Xinyang Chen, Jianjiang Li, Jidong Zhai, Lijie Wen
First: 2026-02-18T16:57:45+00:00 · Latest: 2026-02-18T16:57:45+00:00
Comments: 13 pages
Abstract
The growing demand for large language models (LLMs) requires serving systems to handle many concurrent requests with diverse service level objectives (SLOs). This exacerbates head-of-line (HoL) blocking during the compute-intensive prefill phase, where long-running requests monopolize resources and delay higher-priority ones, leading to widespread time-to-first-token (TTFT) SLO violations. While chunked prefill enables interruptibility, it introduces an inherent trade-off between responsiveness and throughput: reducing chunk size improves response latency but degrades computational efficiency, whereas increasing chunk size maximizes throughput but exacerbates blocking. This necessitates an adaptive preemption mechanism. However, dynamically balancing execution granularity against scheduling overheads remains a key challenge.
In this paper, we propose FlowPrefill, a TTFT-goodput-optimized serving system that resolves this conflict by decoupling preemption granularity from scheduling frequency. To achieve adaptive prefill scheduling, FlowPrefill introduces two key innovations: 1) Operator-Level Preemption, which leverages operator boundaries to enable fine-grained execution interruption without the efficiency loss associated with fixed small chunking; and 2) Event-Driven Scheduling, which triggers scheduling decisions only upon request arrival or completion events, thereby supporting efficient preemption responsiveness while minimizing control-plane overhead. Evaluation on real-world production traces shows that FlowPrefill improves maximum goodput by up to 5.6$\times$ compared to state-of-the-art systems while satisfying heterogeneous SLOs.
中文标题/摘要
标题:FlowPrefill:解耦预占与预填充调度粒度以缓解LLM服务中的首包延迟
大型语言模型(LLMs)日益增长的需求要求服务系统能够处理大量并发请求并满足多样化的服务水平目标(SLOs)。这加剧了计算密集型预填充阶段的首包延迟(HoL)阻塞问题,长时间运行的请求独占资源并延迟优先级更高的请求,导致广泛的时间到首个标记(TTFT)SLO违规。虽然分块预填充可以实现中断性,但它引入了响应性和吞吐量之间的固有权衡:减少分块大小可以改善响应延迟,但会降低计算效率;而增加分块大小可以最大化吞吐量,但会加剧阻塞。这需要一种自适应预占机制。然而,动态平衡执行粒度与调度开销之间的关系仍然是一个关键挑战。
在本文中,我们提出了一种TTFT-吞吐量优化的服务系统FlowPrefill,通过解耦预占粒度与调度频率来解决这一冲突。为了实现自适应预填充调度,FlowPrefill引入了两项关键创新:1)操作级预占,利用操作边界实现细粒度执行中断,而不损失固定小分块的效率;2)事件驱动调度,仅在请求到达或完成事件时触发调度决策,从而支持高效的预占响应性,同时最小化控制平面开销。在真实生产跟踪上的评估表明,与最先进的系统相比,FlowPrefill将最大吞吐量提高了最多5.6倍,同时满足了异构SLO。
Summary / 总结
FlowPrefill is designed to address head-of-line blocking in large language model serving systems by decoupling preemption granularity from scheduling frequency. It introduces Operator-Level Preemption and Event-Driven Scheduling to enhance responsiveness and throughput. Evaluation shows that FlowPrefill can improve maximum goodput by up to 5.6 times compared to existing systems while meeting diverse service level objectives.
FlowPrefill 是一种通过解耦预emption粒度与调度频率来优化大型语言模型 (LLM) 服务中的时间到首个令牌 (TTFT) 和吞吐量的服务器系统。它引入了操作级预emption和事件驱动调度来缓解头-of-线阻塞。评估结果显示,FlowPrefill 可以将最大吞吐量提高多达 5.6 倍,同时满足不同的服务级别目标。
Sequential Membership Inference Attacks
Authors: Thomas Michel, Debabrota Basu, Emilie Kaufmann
First: 2026-02-18T16:51:13+00:00 · Latest: 2026-02-18T16:51:13+00:00
Comments: 27 pages, 10 figures
Abstract
Modern AI models are not static. They go through multiple updates in their lifecycles. Thus, exploiting the model dynamics to create stronger Membership Inference (MI) attacks and tighter privacy audits are timely questions. Though the literature empirically shows that using a sequence of model updates can increase the power of MI attacks, rigorous analysis of the `optimal' MI attacks is limited to static models with infinite samples. Hence, we develop an `optimal' MI attack, SeMI*, that uses the sequence of model updates to identify the presence of a target inserted at a certain update step. For the empirical mean computation, we derive the optimal power of SeMI*, while accessing a finite number of samples with or without privacy. Our results retrieve the existing asymptotic analysis. We observe that having access to the model sequence avoids the dilution of MI signals unlike the existing attacks on the final model, where the MI signal vanishes as training data accumulates. Furthermore, an adversary can use SeMI* to tune both the insertion time and the canary to yield tighter privacy audits. Finally, we conduct experiments across data distributions and models trained or fine-tuned with DP-SGD demonstrating that practical variants of SeMI* lead to tighter privacy audits than the baselines.
中文标题/摘要
标题:序列成员推断攻击
现代AI模型并非静态的,它们在其生命周期中会经历多次更新。因此,利用模型动态来创建更强的成员推断(MI)攻击和更严格的隐私审计是及时的问题。尽管文献实证表明使用模型更新序列可以增强MI攻击的效果,但对“最优”MI攻击的严格分析仅限于静态模型且样本无限的情况。因此,我们开发了一种使用模型更新序列来识别特定更新步骤中目标存在的“最优”MI攻击SeMI*。对于经验均值计算,我们推导了SeMI*的最优功率,同时访问有限数量的样本,无论是有隐私还是无隐私。我们的结果恢复了现有的渐近分析。我们观察到,访问模型序列可以避免MI信号的稀释,这与现有针对最终模型的攻击不同,其中随着训练数据的积累,MI信号会消失。此外,攻击者可以使用SeMI*来调整插入时间和 Canary,以进行更严格的隐私审计。最后,我们在数据分布和使用DP-SGD训练或微调的模型上进行实验,证明实际的SeMI*变体比基线方法提供了更严格的隐私审计。
Summary / 总结
This paper addresses the need for stronger Membership Inference (MI) attacks by developing an optimal MI attack, SeMI*, that utilizes a sequence of model updates. The authors derive the optimal power of SeMI* for finite samples and observe that accessing the model sequence prevents the dilution of MI signals, which is not the case for attacks on the final model. The experiments show that practical variants of SeMI* provide tighter privacy audits compared to baseline methods across various data distributions and models trained with DP-SGD.
研究旨在通过利用现代AI模型的动态特性,即模型在生命周期中会经历多次更新,来增强会员推理(MI)攻击。研究开发了一种名为SeMI*的‘最优’MI攻击,该攻击利用模型序列来识别特定更新步骤中插入的目标。关键发现表明,访问模型序列可以防止MI信号的稀释,而在针对最终模型的攻击中,随着训练数据的增加,MI信号会消失。此外,SeMI*允许攻击者调整插入时间和哨兵以进行更严格的隐私审计,而基于SeMI*的实际变体在各种数据分布和使用DP-SGD训练或微调的模型中,比基线方法提供了更严格的隐私审计。
ReaCritic: Reasoning Transformer-based DRL Critic-model Scaling For Wireless Networks
Authors: Feiran You, Hongyang Du
First: 2025-05-16T08:42:08+00:00 · Latest: 2026-02-18T16:45:13+00:00
Abstract
Heterogeneous Networks (HetNets) pose critical challenges for intelligent management due to the diverse user requirements and time-varying wireless conditions. These factors introduce significant decision complexity, which limits the adaptability of existing Deep Reinforcement Learning (DRL) methods. In many DRL algorithms, especially those involving value-based or actor-critic structures, the critic component plays a key role in guiding policy learning by estimating value functions. However, conventional critic models often use shallow architectures that map observations directly to scalar estimates, limiting their ability to handle multi-task complexity. In contrast, recent progress in inference-time scaling of Large Language Models (LLMs) has shown that generating intermediate reasoning steps can significantly improve decision quality. Motivated by this, we propose ReaCritic, a reasoning transformer-based critic-model scaling scheme that brings reasoning-like ability into DRL. ReaCritic performs horizontal reasoning over parallel state-action inputs and vertical reasoning through deep transformer stacks. It is compatible with a broad range of value-based and actor-critic DRL algorithms and enhances generalization in dynamic wireless environments. Extensive experiments demonstrate that ReaCritic improves convergence speed and final performance across various HetNet settings and standard OpenAI Gym control tasks. The code of ReaCritic is available at https://github.com/NICE-HKU/ReaCritic.
中文标题/摘要
标题:ReaCritic: 基于推理变换器的DRL批评模型扩展方案用于无线网络
异构网络(HetNets)由于多样化的用户需求和时间变化的无线条件,对智能管理提出了关键挑战。这些因素引入了显著的决策复杂性,限制了现有深度强化学习(DRL)方法的适应性。在许多DRL算法中,尤其是涉及价值基础或演员-批评结构的算法中,批评组件通过估计价值函数在引导策略学习中起着关键作用。然而,传统的批评模型通常使用浅层架构,直接将观察映射到标量估计,限制了它们处理多任务复杂性的能力。相比之下,最近在大型语言模型(LLMs)推理时扩展方面的进展表明,生成中间推理步骤可以显著提高决策质量。受此启发,我们提出了一种基于推理变换器的批评模型扩展方案ReaCritic,将推理能力引入DRL。ReaCritic在并行状态-动作输入上进行水平推理,并通过深层变换器堆栈进行垂直推理。它与广泛的基于价值和演员-批评DRL算法兼容,并增强了动态无线环境中的泛化能力。广泛的实验表明,ReaCritic在各种HetNet设置和标准OpenAI Gym控制任务中提高了收敛速度和最终性能。ReaCritic的代码可在https://github.com/NICE-HKU/ReaCritic/获得。
Summary / 总结
ReaCritic is a reasoning transformer-based critic-model scaling scheme designed to enhance the adaptability of Deep Reinforcement Learning (DRL) in managing Heterogeneous Networks (HetNets). It introduces reasoning-like ability by performing horizontal and vertical reasoning over state-action inputs and through deep transformer stacks, respectively. Experiments show that ReaCritic improves convergence speed and final performance in various HetNet settings and standard OpenAI Gym control tasks.
ReaCritic 是一种基于推理变换器的批评模型扩展方案,旨在增强深度强化学习(DRL)在管理异构网络(HetNets)中的适应性。它通过水平和垂直推理引入了推理能力,提高了多任务复杂性的处理能力。实验表明,ReaCritic 加快了收敛速度并提高了各种 HetNet 设置和标准控制任务中的性能。
Adaptive Rank Allocation for Federated Parameter-Efficient Fine-Tuning of Language Models
Authors: Fei Wu, Jia Hu, Geyong Min, Shiqiang Wang
Venue: IEEE Transactions on Computers, Jan 2026
First: 2025-01-24T11:19:07+00:00 · Latest: 2026-02-18T16:41:21+00:00
Abstract
Pre-trained Language Models (PLMs) have demonstrated their superiority and versatility in modern Natural Language Processing (NLP), effectively adapting to various downstream tasks through further fine-tuning. Federated Parameter-Efficient Fine-Tuning (FedPEFT) has emerged as a promising solution to address privacy and efficiency challenges in distributed training for PLMs on resource-constrained local devices. However, our measurements reveal two key limitations of FedPEFT: heterogeneous data across devices exacerbates performance degradation of low-rank adaptation, and a fixed parameter configuration results in communication inefficiency. To overcome these limitations, we propose FedARA, a novel adaptive rank allocation framework for federated parameter-efficient fine-tuning of language models. Specifically, FedARA employs truncated Singular Value Decomposition (SVD) adaptation to enhance similar feature representation across clients, significantly mitigating the adverse effects of data heterogeneity. Subsequently, it utilizes dynamic rank allocation to progressively identify critical ranks, effectively improving communication efficiency. Lastly, it leverages rank-based module pruning to automatically remove inactive modules, steadily reducing local computational cost and memory usage in each federated learning round. Extensive experiments show that FedARA consistently outperforms baselines by an average of 6.95% to 8.49% across various datasets and models under heterogeneous data while significantly improving communication efficiency by 2.40$ \times$. Moreover, experiments on various edge devices demonstrate substantial decreases in total training time and energy consumption by up to 48.90% and 46.95%, respectively.
中文标题/摘要
标题:联邦参数高效微调语言模型的自适应秩分配
预训练语言模型(PLMs)在现代自然语言处理(NLP)中表现出其优越性和多功能性,通过进一步微调有效适应各种下游任务。联邦参数高效微调(FedPEFT)作为一种有前景的解决方案,旨在解决在资源受限的本地设备上分布式训练PLMs时的隐私和效率挑战。然而,我们的测量结果显示FedPEFT存在两个关键限制:设备间数据异质性加剧了低秩适应的性能下降,固定参数配置导致通信效率低下。为克服这些限制,我们提出FedARA,一种新颖的自适应秩分配框架,用于联邦参数高效微调语言模型。具体而言,FedARA 使用截断奇异值分解(SVD)适应来增强客户端之间的相似特征表示,显著减轻了数据异质性带来的负面影响。随后,它利用动态秩分配逐步识别关键秩,有效提高通信效率。最后,它利用基于秩的模块剪枝自动移除不活跃模块,逐步降低每次联邦学习轮次中的本地计算成本和内存使用。广泛实验表明,在异质数据下,FedARA 在各种数据集和模型上平均优于基线6.95%到8.49%,同时通信效率显著提高2.40倍。此外,针对各种边缘设备的实验表明,总训练时间和能量消耗分别降低了48.90%和46.95%。
Summary / 总结
The paper addresses the limitations of Federated Parameter-Efficient Fine-Tuning (FedPEFT) by proposing FedARA, an adaptive rank allocation framework. FedARA uses truncated SVD adaptation to handle data heterogeneity and dynamic rank allocation to improve communication efficiency. It also employs rank-based module pruning to reduce local computational cost. Experiments show that FedARA outperforms baselines by 6.95% to 8.49% across various datasets and models, and improves communication efficiency by 2.40 times. Additionally, it reduces training time and energy consumption by up to 48.90% and 46.95% on edge devices.
论文提出了一种新的自适应秩分配框架FedARA,以解决FedPEFT的局限性。FedARA通过截断SVD适应来处理数据异质性,并通过动态秩分配提高通信效率。此外,它还采用基于秩的模块剪枝来降低本地计算成本。实验表明,FedARA在各种数据集和模型下比基线高出6.95%到8.49%,并且将通信效率提高了2.40倍。同时,它在边缘设备上的总训练时间和能耗分别降低了48.90%和46.95%。
AIFL: A Global Daily Streamflow Forecasting Model Using Deterministic LSTM Pre-trained on ERA5-Land and Fine-tuned on IFS
Authors: Maria Luisa Taccari, Kenza Tazi, Oisín M. Morrison, Andreas Grafberger, Juan Colonese, Corentin Carton de Wiart, Christel Prudhomme, Cinzia Mazzetti, Matthew Chantry, Florian Pappenberger
First: 2026-02-18T16:26:36+00:00 · Latest: 2026-02-18T16:26:36+00:00
Abstract
Reliable global streamflow forecasting is essential for flood preparedness and water resource management, yet data-driven models often suffer from a performance gap when transitioning from historical reanalysis to operational forecast products. This paper introduces AIFL (Artificial Intelligence for Floods), a deterministic LSTM-based model designed for global daily streamflow forecasting. Trained on 18,588 basins curated from the CARAVAN dataset, AIFL utilises a novel two-stage training strategy to bridge the reanalysis-to-forecast domain shift. The model is first pre-trained on 40 years of ERA5-Land reanalysis (1980-2019) to capture robust hydrological processes, then fine-tuned on operational Integrated Forecasting System (IFS) control forecasts (2016-2019) to adapt to the specific error structures and biases of operational numerical weather prediction. To our knowledge, this is the first global model trained end-to-end within the CARAVAN ecosystem. On an independent temporal test set (2021-2024), AIFL achieves high predictive skill with a median modified Kling-Gupta Efficiency (KGE') of 0.66 and a median Nash-Sutcliffe Efficiency (NSE) of 0.53. Benchmarking results show that AIFL is highly competitive with current state-of-the-art global systems, achieving comparable accuracy while maintaining a transparent and reproducible forcing pipeline. The model demonstrates exceptional reliability in extreme-event detection, providing a streamlined and operationally robust baseline for the global hydrological community.
中文标题/摘要
标题:AIFL:基于确定性LSTM的ERA5-Land预训练和IFS微调的全球每日径流预报模型
可靠的全球径流预报对于洪水准备和水资源管理至关重要,但数据驱动模型在从历史再分析过渡到操作预报产品时往往会遭受性能差距。本文介绍了AIFL(人工智能用于洪水),这是一种基于确定性LSTM的全球每日径流预报模型。该模型在来自CARAVAN数据集的18,588个流域上进行训练,并采用一种新颖的两阶段训练策略来跨越再分析到预报领域的转变。模型首先在1980-2019年的ERA5-Land再分析数据(40年)上进行预训练,以捕捉稳健的水文过程,然后在2016-2019年的操作集成预报系统(IFS)控制预报上进行微调,以适应操作数值天气预报的具体误差结构和偏差。据我们所知,这是CARAVAN生态系统中第一个端到端训练的全球模型。在独立的时间测试集(2021-2024年)上,AIFL在中位数修改后的Kling-Gupta效率(KGE')为0.66和中位数纳什-斯图尔特效率(NSE)为0.53的情况下表现出高度的预测能力。基准测试结果表明,AIFL在与当前最先进的全球系统竞争时具有高度竞争力,同时保持透明和可重复的驱动管线。该模型在极端事件检测方面表现出色,为全球水文社区提供了一个简洁且操作稳健的基础线。
Summary / 总结
AIFL is a deterministic LSTM-based model for global daily streamflow forecasting, pre-trained on ERA5-Land reanalysis data and fine-tuned on IFS operational forecasts. It achieves high predictive skill with median KGE' of 0.66 and NSE of 0.53 on an independent test set from 2021 to 2024, outperforming current state-of-the-art systems while maintaining a transparent forcing pipeline. The model is designed to bridge the gap between reanalysis and operational forecasts, providing reliable predictions for flood preparedness and water resource management.
AIFL 是一种基于确定性 LSTM 的全球每日径流预报模型,旨在弥合历史再分析数据和操作预报产品之间的差距。该模型首先在 40 年的 ERA5-Land 再分析数据上进行预训练,然后在 IFS 控制预报上进行微调以适应操作预报中的偏差。在独立测试集上,AIFL 展现出高预测技能,中位数 KGE' 为 0.66,NSE 为 0.53,其性能优于当前最先进的系统,同时保持透明性和可重复性。
The Quantification Horizon Theory of Consciousness
Authors: T. R. Lima
First: 2017-04-04T18:32:58+00:00 · Latest: 2026-02-18T16:23:19+00:00
Abstract
The scientific revolution began with an exclusion. To make nature mathematically tractable, Galileo stripped the scientific model of the world of its qualities -- colors, sounds, tastes, feels -- leaving only what admits of numerical characterization. Four centuries later, the qualities remain unexplained. They are the "hard problem" of consciousness: the enigma of why and how physical processing gives rise to felt experience. The Quantification Horizon Theory of Consciousness (QHT) proposes that this enigma arises from a structural necessity of mathematical description itself. Quantitative models can only capture quantifiable features of reality. Where there is nothing, a model assigns zero; where there is something quantifiable, it assigns a value; but where there is something unquantifiable -- a quale -- the model degenerates: it produces a singularity. QHT identifies singularities in the information geometry of neural dynamics as the mathematical fingerprint of phenomenal experience: a quantification horizon beyond which quantitative description cannot reach. From this basis, QHT derives the hallmark properties of consciousness -- ineffability, privacy, subjectivity, unity, and causal efficacy -- and provides substrate-independent criteria for determining which systems are conscious. The theory avoids panpsychism, makes testable predictions, and offers concrete implications for artificial intelligence and artificial consciousness. Its core intuition -- that singularities correspond to felt experience -- may have been foreshadowed by Srinivasa Ramanujan.
中文标题/摘要
标题:意识的量化界限理论
科学革命始于一次排除。为了使自然数学化,伽利略剥离了世界科学模型中的品质——颜色、声音、味道、感觉,只保留了可量化特征。四百年后,这些品质仍然无法解释。它们是意识的“硬问题”:物理处理为何和如何产生主观体验的谜团。意识的量化界限理论(QHT)提出,这一谜团源于数学描述本身的结构性必要性。定量模型只能捕捉现实中的可量化特征。在无物之处,模型赋予零值;在有可量化之物之处,它赋予数值;但在有不可量化之物——质觉之处,模型退化:它产生奇点。QHT将神经动力学信息几何中的奇点识别为现象性体验的数学指纹:一种超越定量描述的量化界限。基于此,QHT推导出意识的核心属性——不可言说性、私密性、主观性、统一性和因果有效性,并提供了与载体无关的意识判定标准。该理论避免了泛心论,提出了可测试的预测,并为人工智能和人工意识提供了具体启示。其核心直觉——奇点对应于主观体验——可能预示了斯里尼瓦萨·拉马努金。
MoDE-Boost: Boosting Shared Mobility Demand with Edge-Ready Prediction Models
Authors: Antonios Tziorvas, George S. Theodoropoulos, Yannis Theodoridis
First: 2026-02-18T16:18:13+00:00 · Latest: 2026-02-18T16:18:13+00:00
Comments: 25 pages
Abstract
Urban demand forecasting plays a critical role in optimizing routing, dispatching, and congestion management within Intelligent Transportation Systems. By leveraging data fusion and analytics techniques, traffic demand forecasting serves as a key intermediate measure for identifying emerging spatial and temporal demand patterns. In this paper, we tackle this challenge by proposing two gradient boosting model variations, one for classiffication and one for regression, both capable of generating demand forecasts at various temporal horizons, from 5 minutes up to one hour. Our overall approach effectively integrates temporal and contextual features, enabling accurate predictions that are essential for improving the efficiency of shared (micro-) mobility services. To evaluate its effectiveness, we utilize open shared mobility data derived from e-scooter and e-bike networks in five metropolitan areas. These real-world datasets allow us to compare our approach with state-of-the-art methods as well as a Generative AI-based model, demonstrating its effectiveness in capturing the complexities of modern urban mobility. Ultimately, our methodology offers novel insights on urban micro-mobility management, helping to tackle the challenges arising from rapid urbanization and thus, contributing to more sustainable, efficient, and livable cities.
中文标题/摘要
标题:MoDE-Boost:利用边缘就绪预测模型提升共享出行需求
城市需求预测在优化智能交通系统中的路径规划、调度和拥堵管理中起着关键作用。通过利用数据融合和分析技术,交通需求预测成为识别新兴的空间和时间需求模式的关键中间指标。在本文中,我们通过提出两种梯度提升模型变体来应对这一挑战,一种用于分类,一种用于回归,两者都能生成从5分钟到一小时的各种时间范围内的需求预测。我们的整体方法有效地整合了时间和上下文特征,使准确预测成为提高共享(微)出行服务效率的关键。为了评估其有效性,我们使用来自五个大都市地区电动滑板车和电动自行车网络的开放共享出行数据。这些实际数据集使我们能够将我们的方法与最先进的方法以及基于生成AI的模型进行比较,展示了其在捕捉现代城市出行复杂性方面的有效性。最终,我们的方法提供了关于城市微出行管理的新颖见解,有助于应对快速城市化带来的挑战,从而促进更可持续、更高效和宜居的城市。
Summary / 总结
This paper addresses the challenge of urban demand forecasting for shared mobility services by proposing two gradient boosting model variations for classification and regression. These models can generate accurate demand forecasts from 5 minutes to one hour, integrating temporal and contextual features. The effectiveness of the approach is evaluated using real-world data from e-scooter and e-bike networks in five metropolitan areas, showing its capability to capture the complexities of modern urban mobility and improve the efficiency of shared mobility services.
论文提出了MoDE-Boost方法,使用梯度提升模型进行共享出行服务的都市需求预测。该方法整合了时间与上下文特征,能够预测从5分钟到一小时的不同时间范围内的需求。通过使用来自五个大都市区域的电动滑板车和电动自行车网络的真实数据进行评估,展示了其在捕捉都市出行复杂性方面优于最先进的方法和生成式AI模型的效果。
DIAGPaper: Diagnosing Valid and Specific Weaknesses in Scientific Papers via Multi-Agent Reasoning
Authors: Zhuoyang Zou, Abolfazl Ansari, Delvin Ce Zhang, Dongwon Lee, Wenpeng Yin
First: 2026-01-12T14:59:00+00:00 · Latest: 2026-02-18T16:09:49+00:00
Abstract
Paper weakness identification using single-agent or multi-agent LLMs has attracted increasing attention, yet existing approaches exhibit key limitations. Many multi-agent systems simulate human roles at a surface level, missing the underlying criteria that lead experts to assess complementary intellectual aspects of a paper. Moreover, prior methods implicitly assume identified weaknesses are valid, ignoring reviewer bias, misunderstanding, and the critical role of author rebuttals in validating review quality. Finally, most systems output unranked weakness lists, rather than prioritizing the most consequential issues for users. In this work, we propose DIAGPaper, a novel multi-agent framework that addresses these challenges through three tightly integrated modules. The customizer module simulates human-defined review criteria and instantiates multiple reviewer agents with criterion-specific expertise. The rebuttal module introduces author agents that engage in structured debate with reviewer agents to validate and refine proposed weaknesses. The prioritizer module learns from large-scale human review practices to assess the severity of validated weaknesses and surfaces the top-K severest ones to users. Experiments on two benchmarks, AAAR and ReviewCritique, demonstrate that DIAGPaper substantially outperforms existing methods by producing more valid and more paper-specific weaknesses, while presenting them in a user-oriented, prioritized manner.
中文标题/摘要
标题:DIAGPaper: 通过多智能体推理诊断科学论文中的有效和具体弱点
使用单智能体或多智能体LLM识别论文弱点引起了越来越多的关注,但现有方法存在关键局限性。许多多智能体系统在表面层次上模拟人类角色,忽略了专家评估论文互补智力方面所依据的标准。此外,先前的方法隐含地假设识别出的弱点是有效的,忽视了审稿人的偏见、误解以及作者反驳在验证审稿质量中的关键作用。最后,大多数系统输出未排序的弱点列表,而不是优先处理对用户最具有影响力的议题。在本工作中,我们提出了一种名为DIAGPaper的新颖多智能体框架,通过三个紧密集成的模块来解决这些挑战。定制模块模拟人类定义的评审标准,并实例化具有特定标准专业知识的多个评审智能体。反驳模块引入作者智能体,与评审智能体进行结构化辩论以验证和细化提出的弱点。优先级模块从大规模的人类评审实践中学习,评估验证后的弱点的严重性,并向用户展示最严重的前K个弱点。在AAAR和ReviewCritique两个基准上的实验表明,DIAGPaper在生成更有效、更具体于论文的弱点方面显著优于现有方法,同时以用户导向的方式优先展示这些弱点。
Summary / 总结
DIAGPaper proposes a novel multi-agent framework to diagnose valid and specific weaknesses in scientific papers. It addresses limitations of existing methods by simulating human-defined review criteria, incorporating structured debates between reviewer and author agents, and prioritizing the most severe weaknesses. Experiments show DIAGPaper outperforms existing methods in producing more valid and paper-specific weaknesses, and presenting them in a user-oriented, prioritized manner.
本文提出了一种多智能体框架DIAGPaper,以解决现有方法在识别科学论文弱点时的局限性。该框架包括模拟人类评审标准的定制器模块、涉及审稿人和作者智能体之间结构化辩论的反驳模块以及评估验证后弱点严重性的优先级模块。实验表明,DIAGPaper生成的弱点更具有有效性和特定性,并以用户导向的方式按优先级呈现。
Learning Degenerate Manifolds of Frustrated Magnets with Boltzmann Machines
Authors: Ho Jang, Jackson C. Glass, Gia-Wei Chern
First: 2025-11-25T03:40:01+00:00 · Latest: 2026-02-18T16:07:15+00:00
Comments: 13 pages, 10 figures
Abstract
We show that Restricted Boltzmann Machines (RBMs) provide a flexible generative framework for modeling spin configurations in disordered yet strongly correlated phases of frustrated magnets. As a benchmark, we first demonstrate that an RBM can learn the zero-temperature ground-state manifold of the one-dimensional ANNNI model at its multiphase point, accurately reproducing its characteristic oscillatory and exponentially decaying correlations. We then apply RBMs to kagome spin ice and show that they successfully learn the local ice rules and short-range correlations of the extensively degenerate ice-I manifold. Correlation functions computed from RBM-generated configurations closely match those from direct Monte Carlo simulations. For the partially ordered ice-II phase -- featuring long-range charge order and broken time-reversal symmetry -- accurate modeling requires RBMs with uniform-sign bias fields, mirroring the underlying symmetry breaking. These results highlight the utility of RBMs as generative models for learning constrained and highly frustrated magnetic states.
中文标题/摘要
标题:使用玻尔兹曼机学习受挫磁体的退化流形
我们展示了受限玻尔兹曼机(RBMs)为建模受挫磁体无序但强关联相的自旋配置提供了一个灵活的生成框架。作为基准,我们首先证明了RBM可以学习一维ANNNI模型在多相点的零温基态流形,准确地再现了其特征振荡和指数衰减的相关性。然后,我们将RBM应用于蜂巢自旋冰,并表明它们成功地学习了冰-I流形的局部冰规则和短程相关性。从RBM生成的配置计算的相关函数与直接蒙特卡洛模拟的结果非常接近。对于部分有序的冰-II相——具有长程电荷有序和时间反演对称性破缺——准确建模需要具有均匀符号偏置场的RBM,这反映了其下的对称性破缺。这些结果突显了RBM作为生成模型学习受限和高度受挫磁性状态的实用性。
Summary / 总结
This study demonstrates that Restricted Boltzmann Machines (RBMs) can effectively model the spin configurations in frustrated magnets. RBMs accurately learned the ground-state manifold of the ANNNI model and the local ice rules of the kagome spin ice. The correlation functions from RBM-generated configurations matched those from direct Monte Carlo simulations. For the partially ordered ice-II phase, RBMs with uniform-sign bias fields were necessary to capture the long-range charge order and symmetry breaking, underscoring the utility of RBMs in modeling constrained and highly frustrated magnetic states.
研究展示了受限玻尔兹曼机(RBMs)能够有效模拟受挫磁体中的自旋配置。RBMs准确地学习了ANNNI模型的零温基态 manifold 和 Kagome 自旋冰的局部冰规则。对于部分有序的冰-II 相,需要使用具有均匀符号偏置场的 RBMs 来捕捉长程电荷有序和对称性破缺。这些结果强调了 RBMs 作为学习受限且高度受挫磁态生成模型的潜力。
A Scalable Approach to Solving Simulation-Based Network Security Games
Authors: Michael Lanier, Yevgeniy Vorobeychik
First: 2026-02-18T16:07:01+00:00 · Latest: 2026-02-18T16:07:01+00:00
Abstract
We introduce MetaDOAR, a lightweight meta-controller that augments the Double Oracle / PSRO paradigm with a learned, partition-aware filtering layer and Q-value caching to enable scalable multi-agent reinforcement learning on very large cyber-network environments. MetaDOAR learns a compact state projection from per node structural embeddings to rapidly score and select a small subset of devices (a top-k partition) on which a conventional low-level actor performs focused beam search utilizing a critic agent. Selected candidate actions are evaluated with batched critic forwards and stored in an LRU cache keyed by a quantized state projection and local action identifiers, dramatically reducing redundant critic computation while preserving decision quality via conservative k-hop cache invalidation. Empirically, MetaDOAR attains higher player payoffs than SOTA baselines on large network topologies, without significant scaling issues in terms of memory usage or training time. This contribution provide a practical, theoretically motivated path to efficient hierarchical policy learning for large-scale networked decision problems.
中文标题/摘要
标题:一种解决基于仿真网络安全性博弈的可扩展方法
我们引入了MetaDOAR,这是一种轻量级的元控制器,它通过引入一个学习到的、分区感知的过滤层和Q值缓存,将Double Oracle / PSRO范式与之结合,以实现非常大的网络环境中的多智能体强化学习的可扩展性。MetaDOAR 从每个节点的结构嵌入中学习一个紧凑的状态投影,以快速评分并选择一小部分设备(前k个分区),然后由一个常规的低级行为者在批评家代理的帮助下进行聚焦的束搜索。选定的候选行动通过批处理批评家前向传播并存储在一个基于量化状态投影和局部行动标识符的LRU缓存中,从而大幅减少冗余批评家计算,同时通过保守的k跳缓存失效机制保持决策质量。实验表明,MetaDOAR 在大型网络拓扑结构上获得了比当前最佳基线更高的玩家收益,且在内存使用和训练时间方面没有显著的扩展问题。本贡献提供了一条实用且理论导向的路径,用于解决大规模网络决策问题中的高效层次策略学习。
Summary / 总结
The research aims to address the scalability challenges in solving network security games through simulation-based methods. MetaDOAR, a lightweight meta-controller, is introduced, which enhances the Double Oracle / PSRO framework with a learned filtering layer and Q-value caching. This approach enables efficient multi-agent reinforcement learning on large cyber-network environments by projecting state space, selecting a subset of devices, and utilizing a critic agent for evaluation. The method significantly improves player payoffs compared to state-of-the-art baselines without increasing memory usage or training time substantially.
研究旨在通过仿真方法解决网络安全性博弈中的可扩展性问题。提出了MetaDOAR,一种轻量级元控制器,它通过学习过滤层和Q值缓存来增强Double Oracle / PSRO框架。该方法通过投影状态空间、选择设备子集并利用批评家代理进行评估,使大规模网络决策问题中的多智能体强化学习变得高效。该方法显著提高了玩家收益,同时没有显著增加内存使用或训练时间。
SNAP-UQ: Self-supervised Next-Activation Prediction for Single-Pass Uncertainty in TinyML
Authors: Ismail Lamaakal, Chaymae Yahyati, Khalid El Makkaoui, Ibrahim Ouahbi, Yassine Maleh
Venue: ICLR 2026
First: 2025-08-18T13:14:20+00:00 · Latest: 2026-02-18T15:56:15+00:00
Comments: Published as a conference paper at ICLR 2026
Abstract
Reliable uncertainty estimation is a key missing piece for on-device monitoring in TinyML: microcontrollers must detect failures, distribution shift, or accuracy drops under strict flash/latency budgets, yet common uncertainty approaches (deep ensembles, MC dropout, early exits, temporal buffering) typically require multiple passes, extra branches, or state that is impractical on milliwatt hardware. This paper proposes a novel and practical method, SNAP-UQ, for single-pass, label-free uncertainty estimation based on depth-wise next-activation prediction. SNAP-UQ taps a small set of backbone layers and uses tiny int8 heads to predict the mean and scale of the next activation from a low-rank projection of the previous one; the resulting standardized prediction error forms a depth-wise surprisal signal that is aggregated and mapped through a lightweight monotone calibrator into an actionable uncertainty score. The design introduces no temporal buffers or auxiliary exits and preserves state-free inference, while increasing deployment footprint by only a few tens of kilobytes. Across vision and audio backbones, SNAP-UQ reduces flash and latency relative to early-exit and deep-ensemble baselines (typically $\sim$40--60% smaller and $\sim$25--35% faster), with several competing methods at similar accuracy often exceeding MCU memory limits. On corrupted streams, it improves accuracy-drop event detection by multiple AUPRC points and maintains strong failure detection (AUROC $\approx 0.9$) in a single forward pass. By grounding uncertainty in layer-to-layer dynamics rather than solely in output confidence, SNAP-UQ offers a novel, resource-efficient basis for robust TinyML monitoring. Our code is available at: https://github.com/Ism-ail11/SNAP-UQ
中文标题/摘要
标题:SNAP-UQ:单次通过自监督下一层激活预测的小型化机器学习不确定性量化
可靠的不确定性估计是TinyML设备监测中的关键缺失环节:微控制器必须在严格的闪存/延迟预算下检测故障、分布偏移或准确性下降,但常见的不确定性方法(深度集成、MC Dropout、提前退出、时间缓冲)通常需要多次通过、额外分支或在毫瓦级硬件上不切实际的状态。本文提出了一种新颖且实用的方法,SNAP-UQ,基于深度卷积下一层激活预测的单次通过、无标签不确定性估计。SNAP-UQ 利用少量骨干层并使用小型 int8 头从上一层的低秩投影预测下一层激活的均值和尺度;由此产生的标准化预测误差形成了一种深度级的惊讶信号,该信号被轻量级单调校准器聚合并映射为可操作的不确定性评分。该设计不引入时间缓冲或辅助退出,并保持无状态推理,同时仅增加数十千字节的部署空间。在视觉和音频骨干网络上,SNAP-UQ 相对于提前退出和深度集成基线减少了闪存和延迟(通常约40-60%更小,约25-35%更快),而一些具有相似准确性的竞争方法往往超出MCU内存限制。在受污染的流中,它通过多个AUPRC点提高了准确性下降事件检测,并在单次前向传递中保持了强大的故障检测(AUROC ≈ 0.9)。通过将不确定性扎根于层间动态而非仅在输出置信度中,SNAP-UQ 提供了一种新颖且资源高效的TinyML监测基础。我们的代码可在:https://github.com/Ism-ail11/SNAP-UQ
Summary / 总结
SNAP-UQ is a self-supervised method for single-pass uncertainty estimation in TinyML, addressing the need for reliable on-device monitoring without additional passes or state. It uses a small set of backbone layers to predict the mean and scale of the next activation from a low-rank projection of the previous one, forming a surprisal signal that is calibrated into an uncertainty score. This method reduces flash and latency compared to baselines and maintains strong failure detection. On corrupted streams, SNAP-UQ improves accuracy-drop event detection and operates with minimal additional deployment footprint.
SNAP-UQ 是一种自监督的方法,用于 TinyML 中的单次通过不确定性估计,解决了在不增加额外遍历或状态的情况下进行可靠设备内监控的需求。它使用一小部分骨干层来预测下一个激活的均值和尺度,从前一个激活的低秩投影中形成一个惊讶信号,并校准成不确定性评分。该方法相比基线减少了闪存和延迟,并保持了强大的故障检测能力。在受污染的数据流上,SNAP-UQ 提高了准确度下降事件的检测,并且仅具有最小的额外部署开销。
RIDER: 3D RNA Inverse Design with Reinforcement Learning-Guided Diffusion
Authors: Tianmeng Hu, Yongzheng Cui, Biao Luo, Ke Li
Venue: ICLR 2026
First: 2026-02-18T15:52:26+00:00 · Latest: 2026-02-18T15:52:26+00:00
Comments: Accepted as a conference paper at ICLR 2026
Abstract
The inverse design of RNA three-dimensional (3D) structures is crucial for engineering functional RNAs in synthetic biology and therapeutics. While recent deep learning approaches have advanced this field, they are typically optimized and evaluated using native sequence recovery, which is a limited surrogate for structural fidelity, since different sequences can fold into similar 3D structures and high recovery does not necessarily indicate correct folding. To address this limitation, we propose RIDER, an RNA Inverse DEsign framework with Reinforcement learning that directly optimizes for 3D structural similarity. First, we develop and pre-train a GNN-based generative diffusion model conditioned on the target 3D structure, achieving a 9% improvement in native sequence recovery over state-of-the-art methods. Then, we fine-tune the model with an improved policy gradient algorithm using four task-specific reward functions based on 3D self-consistency metrics. Experimental results show that RIDER improves structural similarity by over 100% across all metrics and discovers designs that are distinct from native sequences.
中文标题/摘要
标题:RIDER:基于强化学习引导扩散的3D RNA逆设计
工程化合成生物学和治疗中的功能性RNA的3D结构逆设计至关重要。尽管最近的深度学习方法在这一领域取得了进展,但它们通常通过原生序列恢复进行优化和评估,这仅是结构忠实度的有限替代,因为不同的序列可以折叠成相似的3D结构,高恢复率并不一定意味着正确的折叠。为了解决这一局限性,我们提出了RIDER,一种基于强化学习的RNA逆设计框架,直接优化3D结构相似性。首先,我们开发并预训练了一个基于GNN的生成扩散模型,该模型根据目标3D结构进行条件化,实现了比最先进的方法高出9%的原生序列恢复率。然后,我们使用基于3D自一致性度量的四个任务特定奖励函数改进的策略梯度算法对模型进行微调。实验结果表明,RIDER在所有指标上将结构相似性提高了超过100%,并发现了与原生序列不同的设计。
Summary / 总结
RIDER is an RNA inverse design framework that directly optimizes for 3D structural similarity using reinforcement learning. It first pre-trains a GNN-based generative diffusion model and then fine-tunes it with a policy gradient algorithm. The results show that RIDER improves structural similarity by over 100% across all metrics and discovers designs distinct from native sequences.
RIDER 是一个使用强化学习直接优化 3D 结构相似性的 RNA 逆向设计框架。它首先预训练一个基于 GNN 的生成扩散模型以获得更好的原生序列恢复,然后使用四个基于 3D 自洽度量的任务特定奖励函数进行微调。实验结果表明,RIDER 显著提高了结构相似性并发现了与原生序列不同的设计。
FindAnything: Open-Vocabulary and Object-Centric Mapping for Robot Exploration in Any Environment
Authors: Sebastián Barbas Laina, Simon Boche, Sotiris Papatheodorou, Simon Schaefer, Jaehyung Jung, Stefan Leutenegger
First: 2025-04-11T15:12:05+00:00 · Latest: 2026-02-18T15:52:04+00:00
Comments: 11 pages, 5 figures
Abstract
Geometrically accurate and semantically expressive map representations have proven invaluable for robot deployment and task planning in unknown environments. Nevertheless, real-time, open-vocabulary semantic understanding of large-scale unknown environments still presents open challenges, mainly due to computational requirements. In this paper we present FindAnything, an open-world mapping framework that incorporates vision-language information into dense volumetric submaps. Thanks to the use of vision-language features, FindAnything combines pure geometric and open-vocabulary semantic information for a higher level of understanding. It proposes an efficient storage of open-vocabulary information through the aggregation of features at the object level. Pixelwise vision-language features are aggregated based on eSAM segments, which are in turn integrated into object-centric volumetric submaps, providing a mapping from open-vocabulary queries to 3D geometry that is scalable also in terms of memory usage. We demonstrate that FindAnything performs on par with the state-of-the-art in terms of semantic accuracy while being substantially faster and more memory-efficient, allowing its deployment in large-scale environments and on resourceconstrained devices, such as MAVs. We show that the real-time capabilities of FindAnything make it useful for downstream tasks, such as autonomous MAV exploration in a simulated Search and Rescue scenario. Project Page: https://ethz-mrl.github.io/findanything/.
中文标题/摘要
标题:FindAnything: 任意词汇和对象中心的映射框架以实现机器人在任意环境中的探索
几何精确且语义丰富的地图表示已被证明对于在未知环境中部署机器人和任务规划至关重要。然而,实时、开放词汇的大型未知环境语义理解仍然存在挑战,主要由于计算需求。本文提出了一种名为FindAnything的开放世界映射框架,将视觉-语言信息整合到密集的体积子地图中。通过使用视觉-语言特征,FindAnything结合了纯几何和开放词汇的语义信息,提高了理解水平。它通过在对象级别聚合特征来高效存储开放词汇信息。基于eSAM片段的像素级视觉-语言特征被聚合,并整合到对象中心的体积子地图中,提供了一种从开放词汇查询到三维几何的映射,同时在内存使用方面也具有可扩展性。我们证明FindAnything在语义准确性方面与最先进的技术相当,但速度更快且更节省内存,使其能够在大型环境中部署,并在资源受限的设备上运行,如MAVs。我们展示了FindAnything的实时能力使其在下游任务中具有实用性,例如在模拟的搜索和救援场景中自主MAV探索。项目页面: https://ethz-mrl.github.io/findanything/
Summary / 总结
FindAnything is an open-world mapping framework that integrates vision-language information into dense volumetric submaps to achieve both geometric and semantic understanding. It uses object-level feature aggregation and eSAM segments to efficiently store open-vocabulary information, providing scalable memory usage. FindAnything matches state-of-the-art semantic accuracy but is faster and more memory-efficient, suitable for large-scale environments and resource-constrained devices like MAVs. It demonstrates real-time capabilities useful for tasks such as autonomous MAV exploration in simulated Search and Rescue scenarios.
FindAnything 是一种将视觉-语言信息集成到密集体素子地图中的开放世界映射框架,以实现几何和语义理解。它使用对象级特征聚合和 eSAM 区段高效存储开放词汇信息,提供从查询到 3D 几何的可扩展映射。FindAnything 在语义准确性上与最先进的技术相当,但速度更快且占用内存更少,适用于大规模环境和资源受限设备如 MAV 的部署。它展示了实时能力,适用于模拟搜索和救援场景中的自主 MAV 探索等任务。
FeatBench: Towards More Realistic Evaluation of Feature-level Code Generation
Authors: Haorui Chen, Chengze Li, Jia Li
First: 2025-09-26T11:47:50+00:00 · Latest: 2026-02-18T15:49:02+00:00
Abstract
Evaluating Large Language Models (LLMs) on repository-level feature implementation is a critical frontier in software engineering. However, establishing a benchmark that faithfully mirrors realistic development scenarios remains a significant challenge. Existing feature-level benchmarks generally suffer from two primary limitations: unrealistic task inputs enriched with code hints and significant data leakage risks due to their static nature. To address these limitations, we propose a new benchmark - FeatBench, which introduces the following advances: (1) Realistic Task Inputs. Task inputs consist solely of natural language requirements, strictly devoid of code hints (e.g., function signatures). This format mirrors realistic software development by requiring agents to independently bridge the gap between abstract user intent and concrete code changes. (2) Evolving Data. FeatBench employs a fully automated pipeline to construct new benchmark versions from the latest repositories, effectively mitigating data contamination. The initial release comprises 157 tasks sourced from 27 actively maintained repositories. We evaluate two state-of-the-art agent frameworks with four leading LLMs on FeatBench. The results reveal that FeatBench poses a significant challenge, with the highest resolved rate reaching only 29.94%. Crucially, our analysis uncovers a prevalent behavioral pattern of aggressive implementation, which leads to "scope creep" and widespread regressions where agents break existing features by diverging from the user's explicit intent. We release FeatBench, our automated pipeline, and all experimental results to facilitate further community research.
中文标题/摘要
标题:FeatBench:朝着更现实的特征级代码生成评估迈进
在软件工程领域,通过代码库级别的特征实现来评估大型语言模型(LLMs)是一个关键前沿。然而,建立一个能够忠实反映现实开发场景的基准仍然是一项重大挑战。现有的特征级基准通常存在两个主要局限性:不切实际的任务输入,其中包含代码提示,以及由于其静态性质而导致的数据泄露风险。为了解决这些局限性,我们提出了一种新的基准——FeatBench,它引入了以下进步:(1)现实的任务输入。任务输入仅由自然语言需求组成,严格不包含代码提示(例如,函数签名)。这种格式通过要求代理独立地弥合抽象用户意图与具体代码更改之间的差距,反映了现实的软件开发。(2)不断演化的数据。FeatBench 使用一个完全自动化的管道从最新的代码库中构建新的基准版本,有效地减少了数据污染。初始发布包含来自27个活跃维护的代码库的157个任务。我们在FeatBench上评估了两个最先进的代理框架和四个领先的LLM。结果表明,FeatBench 提出了重大挑战,最高解决率为29.94%。我们的分析揭示了一种普遍的行为模式,即激进的实现,导致“范围蔓延”和广泛的倒退,其中代理通过偏离用户的明确意图来破坏现有功能。我们发布了FeatBench、我们的自动化管道以及所有实验结果,以促进进一步的社区研究。
Summary / 总结
FeatBench is a new benchmark designed to evaluate Large Language Models (LLMs) more realistically in feature-level code generation. It addresses the limitations of existing benchmarks by providing realistic task inputs in natural language without code hints and using an evolving data pipeline to avoid data leakage. Initial evaluation shows that even the best models resolve only 29.94% of tasks, highlighting the difficulty of the benchmark. The study also identifies a common issue where models implement features aggressively, leading to unintended regressions. The benchmark, pipeline, and results are publicly available for further research.
FeatBench 是一个新的基准,旨在更真实地评估大型语言模型在功能级代码生成中的表现。它通过提供自然语言的现实任务输入和通过自动化管道生成的不断变化的数据来解决现有基准的局限性。研究对两个最先进的代理框架和四个领先的 LLM 进行了评估,发现只有 29.94% 的任务被解决,这表明代理通常会采取激进的实现方式,导致范围蔓延和现有功能的广泛退化。
Safe But Not Sorry: Reducing Over-Conservatism in Safety Critics via Uncertainty-Aware Modulation
Authors: Daniel Bethell, Simos Gerasimou, Radu Calinescu, Calum Imrie
First: 2025-10-21T09:57:44+00:00 · Latest: 2026-02-18T15:47:54+00:00
Comments: Accepted into AAMAS '26
Abstract
Ensuring the safe exploration of reinforcement learning (RL) agents is critical for deployment in real-world systems. Yet existing approaches struggle to strike the right balance: methods that tightly enforce safety often cripple task performance, while those that prioritize reward leave safety constraints frequently violated, producing diffuse cost landscapes that flatten gradients and stall policy improvement. We introduce the Uncertain Safety Critic (USC), a novel approach that integrates uncertainty-aware modulation and refinement into critic training. By concentrating conservatism in uncertain and costly regions while preserving sharp gradients in safe areas, USC enables policies to achieve effective reward-safety trade-offs. Extensive experiments show that USC reduces safety violations by approximately 40% while maintaining competitive or higher rewards, and reduces the error between predicted and true cost gradients by approximately 83%, breaking the prevailing trade-off between safety and performance and paving the way for scalable safe RL.
中文标题/摘要
标题:安全但不保守:通过不确定性感知调制减少安全批评中的过度保守
确保强化学习(RL)代理的安全探索对于在实际系统中部署至关重要。然而,现有方法难以找到合适的平衡:严格遵守安全性的方法往往会削弱任务性能,而优先考虑奖励的方法则会频繁违反安全约束,产生模糊的成本景观,从而压平梯度并阻碍策略改进。我们提出了不确定性安全批评者(USC),这是一种新颖的方法,将不确定性感知调制和细化整合到批评者训练中。通过在不确定和昂贵的区域集中保守性,而在安全区域保持锐利的梯度,USC使策略能够实现有效的奖励-安全权衡。大量实验表明,USC将安全违规减少约40%,同时保持竞争力或更高的奖励,并将预测和真实成本梯度之间的误差减少约83%,打破了安全性和性能之间的传统权衡,为可扩展的安全RL铺平了道路。
Summary / 总结
The paper addresses the challenge of balancing safety and performance in reinforcement learning agents by introducing the Uncertain Safety Critic (USC). USC modulates safety critics to be more conservative in uncertain and costly regions while maintaining sharp gradients in safe areas. This approach reduces safety violations by about 40% and decreases the error between predicted and true cost gradients by around 83%, leading to better reward-safety trade-offs compared to existing methods.
论文通过引入不确定安全批评者(USC)来解决强化学习代理在安全性和性能之间平衡的问题。USC 调节安全批评者,在不确定和高成本区域更加保守,而在安全区域保持梯度的锐度。这种方法将安全违规减少约40%,并将预测和真实成本梯度之间的误差降低约83%,从而在安全性和性能之间实现更好的权衡,优于现有方法。
Let's Split Up: Zero-Shot Classifier Edits for Fine-Grained Video Understanding
Authors: Kaiting Liu, Hazel Doughty
Venue: ICLR 2026
First: 2026-02-18T15:46:36+00:00 · Latest: 2026-02-18T15:46:36+00:00
Comments: ICLR 2026
Abstract
Video recognition models are typically trained on fixed taxonomies which are often too coarse, collapsing distinctions in object, manner or outcome under a single label. As tasks and definitions evolve, such models cannot accommodate emerging distinctions and collecting new annotations and retraining to accommodate such changes is costly. To address these challenges, we introduce category splitting, a new task where an existing classifier is edited to refine a coarse category into finer subcategories, while preserving accuracy elsewhere. We propose a zero-shot editing method that leverages the latent compositional structure of video classifiers to expose fine-grained distinctions without additional data. We further show that low-shot fine-tuning, while simple, is highly effective and benefits from our zero-shot initialization. Experiments on our new video benchmarks for category splitting demonstrate that our method substantially outperforms vision-language baselines, improving accuracy on the newly split categories without sacrificing performance on the rest. Project page: https://kaitingliu.github.io/Category-Splitting/.
中文标题/摘要
标题:让我们拆分:零样本分类器编辑以实现细粒度视频理解
视频识别模型通常在固定的分类体系上进行训练,这些分类体系往往过于粗略,无法区分物体、方式或结果。随着任务和定义的演变,这些模型无法适应新的区分,收集新注释并重新训练以适应这些变化成本高昂。为了解决这些挑战,我们引入了类别拆分这一新任务,即对现有的分类器进行编辑,以将粗分类细化为更细的子类别,同时在其他地方保持准确性。我们提出了一种零样本编辑方法,利用视频分类器的潜在组合结构来揭示细粒度的区分,无需额外数据。我们还展示了低样本微调虽然简单但非常有效,并且从我们的零样本初始化中受益。在我们新的视频基准测试中的类别拆分实验表明,我们的方法显著优于视觉-语言基线,在新拆分的类别上提高了准确性,而不会牺牲其他部分的性能。
Vulnerability Analysis of Safe Reinforcement Learning via Inverse Constrained Reinforcement Learning
Authors: Jialiang Fan, Shixiong Jiang, Mengyu Liu, Fanxin Kong
First: 2026-02-18T15:43:36+00:00 · Latest: 2026-02-18T15:43:36+00:00
Comments: 12 pages, 6 figures, supplementary material included
Abstract
Safe reinforcement learning (Safe RL) aims to ensure policy performance while satisfying safety constraints. However, most existing Safe RL methods assume benign environments, making them vulnerable to adversarial perturbations commonly encountered in real-world settings. In addition, existing gradient-based adversarial attacks typically require access to the policy's gradient information, which is often impractical in real-world scenarios. To address these challenges, we propose an adversarial attack framework to reveal vulnerabilities of Safe RL policies. Using expert demonstrations and black-box environment interaction, our framework learns a constraint model and a surrogate (learner) policy, enabling gradient-based attack optimization without requiring the victim policy's internal gradients or the ground-truth safety constraints. We further provide theoretical analysis establishing feasibility and deriving perturbation bounds. Experiments on multiple Safe RL benchmarks demonstrate the effectiveness of our approach under limited privileged access.
中文标题/摘要
标题:安全强化学习通过逆约束强化学习的安全性分析
安全强化学习(Safe RL)旨在确保策略性能的同时满足安全约束。然而,现有的大多数Safe RL方法假设环境是良性的,这使得它们在现实世界中常见的对抗性扰动面前变得脆弱。此外,现有的基于梯度的对抗性攻击通常需要访问策略的梯度信息,而在现实世界场景中这往往是不切实际的。为了解决这些挑战,我们提出了一种对抗性攻击框架,以揭示Safe RL策略的脆弱性。利用专家演示和黑盒环境交互,我们的框架学习了一个约束模型和一个代理(学习者)策略,从而可以在不需要受害策略的内部梯度或真实安全约束的情况下进行基于梯度的攻击优化。我们还提供了理论分析,证明了可行性和推导了扰动边界。在多个Safe RL基准上的实验表明,在有限的特权访问下,我们的方法是有效的。
Summary / 总结
The paper addresses the vulnerability of Safe Reinforcement Learning (Safe RL) policies to adversarial perturbations by proposing an adversarial attack framework. This framework uses expert demonstrations and black-box interaction to learn a constraint model and a surrogate policy, allowing for gradient-based attacks without needing the victim policy's gradients or the true safety constraints. Theoretical analysis supports the feasibility of the approach, and experiments show its effectiveness in multiple Safe RL benchmarks under limited access conditions.
该论文通过提出一种对抗攻击框架来解决安全强化学习(Safe RL)策略对 adversarial 干扰的脆弱性问题。方法利用专家演示和黑盒环境交互来学习约束模型和代理策略,允许在不需要受害策略的梯度信息或真实安全约束的情况下进行基于梯度的攻击优化。实验表明,在有限特权访问条件下,所提出的方法可以有效识别 Safe RL 策略的漏洞。
Benchmarking Stochastic Approximation Algorithms for Fairness-Constrained Training of Deep Neural Networks
Authors: Andrii Kliachkin, Jana Lepšová, Gilles Bareilles, Jakub Mareček
Venue: 14th International Conference on Learning Representations, 2026
First: 2025-07-05T13:01:18+00:00 · Latest: 2026-02-18T15:37:58+00:00
Abstract
The ability to train Deep Neural Networks (DNNs) with constraints is instrumental in improving the fairness of modern machine-learning models. Many algorithms have been analysed in recent years, and yet there is no standard, widely accepted method for the constrained training of DNNs. In this paper, we provide a challenging benchmark of real-world large-scale fairness-constrained learning tasks, built on top of the US Census (Folktables). We point out the theoretical challenges of such tasks and review the main approaches in stochastic approximation algorithms. Finally, we demonstrate the use of the benchmark by implementing and comparing three recently proposed, but as-of-yet unimplemented, algorithms both in terms of optimization performance, and fairness improvement. We release the code of the benchmark as a Python package at https://github.com/humancompatible/train.
中文标题/摘要
标题:公平约束下深度神经网络训练的随机逼近算法基准测试
能够对深度神经网络(DNNs)施加约束的训练能力对于提高现代机器学习模型的公平性至关重要。近年来,已经分析了许多算法,但仍然没有广泛接受的标准方法来约束训练DNNs。在本文中,我们提供了一个基于美国人口普查(Folktables)的具有挑战性的公平约束下大规模现实世界学习任务基准。我们指出了此类任务的理论挑战,并回顾了随机逼近算法的主要方法。最后,我们通过实现和比较三种最近提出的但尚未实现的算法来展示基准的使用,从优化性能和公平性改进两个方面进行比较。我们在https://github.com/humancompatible/train/发布了基准代码的Python包。
Summary / 总结
This paper aims to improve the fairness of deep neural networks by benchmarking stochastic approximation algorithms for fairness-constrained training. The authors present a challenging benchmark using real-world data from the US Census, highlighting the theoretical challenges and reviewing existing approaches. They implement and compare three recently proposed algorithms, evaluating their optimization performance and fairness improvement. The benchmark is released as a Python package.
该论文旨在通过基准测试随机逼近算法来提高深度神经网络的公平性,基于美国人口普查数据构建了一个具有挑战性的基准,并评估了三种最近提出的算法在优化性能和公平性改进方面的表现。研究指出了公平约束学习中的理论挑战,并提供了一个Python包用于基准测试。