DuwatBench: Bridging Language and Visual Heritage through an Arabic Calligraphy Benchmark for Multimodal Understanding
Authors: Shubham Patle, Sara Ghaboura, Hania Tariq, Mohammad Usman Khan, Omkar Thawakar, Rao Muhammad Anwer, Salman Khan
First: 2026-01-27T18:59:19+00:00 · Latest: 2026-01-27T18:59:19+00:00
Comments: Accepted to EACL-2026 (Main Track)
Abstract
Arabic calligraphy represents one of the richest visual traditions of the Arabic language, blending linguistic meaning with artistic form. Although multimodal models have advanced across languages, their ability to process Arabic script, especially in artistic and stylized calligraphic forms, remains largely unexplored. To address this gap, we present DuwatBench, a benchmark of 1,272 curated samples containing about 1,475 unique words across six classical and modern calligraphic styles, each paired with sentence-level detection annotations. The dataset reflects real-world challenges in Arabic writing, such as complex stroke patterns, dense ligatures, and stylistic variations that often challenge standard text recognition systems. Using DuwatBench, we evaluated 13 leading Arabic and multilingual multimodal models and showed that while they perform well on clean text, they struggle with calligraphic variation, artistic distortions, and precise visual-text alignment. By publicly releasing DuwatBench and its annotations, we aim to advance culturally grounded multimodal research, foster fair inclusion of the Arabic language and visual heritage in AI systems, and support continued progress in this area. Our dataset (https://huggingface.co/datasets/MBZUAI/DuwatBench) and evaluation suit (https://github.com/mbzuai-oryx/DuwatBench) are publicly available.
中文标题/摘要
标题:DuwatBench:通过阿拉伯书法基准促进语言与视觉遗产的融合以实现跨模态理解
阿拉伯书法是阿拉伯语言最丰富的视觉传统之一,将语言意义与艺术形式融为一体。尽管跨语言的多模态模型已经取得了进展,但它们处理阿拉伯书法的能力,尤其是艺术性和风格化的书法形式,仍然鲜有探索。为了解决这一差距,我们提出了DuwatBench,这是一个包含1,272个精心挑选的样本的数据集,这些样本涵盖了六种古典和现代书法风格,每种风格都配有一级句子级别的检测注释,数据集反映了阿拉伯书写中的现实挑战,如复杂的笔画模式、密集的连字以及风格上的变化,这些往往给标准的文本识别系统带来了挑战。使用DuwatBench,我们评估了13个领先的阿拉伯语和多语言多模态模型,并展示了尽管它们在干净的文本上表现良好,但在书法变化、艺术变形和精确的视觉-文本对齐方面却表现不佳。通过公开发布DuwatBench及其注释,我们旨在推动文化背景下的多模态研究,促进阿拉伯语言和视觉遗产在人工智能系统中的公平包容,并支持该领域的持续进步。我们的数据集(https://huggingface.co/datasets/MBZUAI/DuwatBench)和评估套件(https://github.com/mbzuai-oryx/DuwatBench)已公开发布。
Summary / 总结
DuwatBench is a benchmark dataset for Arabic calligraphy that bridges language and visual heritage, containing 1,272 samples with sentence-level annotations across six calligraphic styles. It evaluates 13 leading Arabic and multilingual multimodal models, revealing their limitations in handling calligraphic variation and artistic distortions. The dataset aims to advance culturally grounded multimodal research and support the inclusion of Arabic visual heritage in AI systems.
DuwatBench 是一个包含 1,272 个样本和 1,475 个独特词汇的阿拉伯书法基准数据集,涵盖了六种书法风格,并且每个样本都配有句子级别的注释。该数据集旨在解决处理艺术性和风格化的阿拉伯书法的挑战,这是当前多模态模型尚未充分探索的领域。对 13 个领先模型的评估表明,它们在书法变体和精确的视觉-文本对齐方面表现不佳,突显了需要改进对阿拉伯书法的多模态理解。通过发布此数据集和评估工具,作者旨在推动文化背景下的多模态研究,并支持将阿拉伯视觉遗产纳入 AI 系统中。
Self-Distillation Enables Continual Learning
Authors: Idan Shenfeld, Mehul Damani, Jonas Hübotter, Pulkit Agrawal
First: 2026-01-27T18:59:08+00:00 · Latest: 2026-01-27T18:59:08+00:00
Abstract
Continual learning, enabling models to acquire new skills and knowledge without degrading existing capabilities, remains a fundamental challenge for foundation models. While on-policy reinforcement learning can reduce forgetting, it requires explicit reward functions that are often unavailable. Learning from expert demonstrations, the primary alternative, is dominated by supervised fine-tuning (SFT), which is inherently off-policy. We introduce Self-Distillation Fine-Tuning (SDFT), a simple method that enables on-policy learning directly from demonstrations. SDFT leverages in-context learning by using a demonstration-conditioned model as its own teacher, generating on-policy training signals that preserve prior capabilities while acquiring new skills. Across skill learning and knowledge acquisition tasks, SDFT consistently outperforms SFT, achieving higher new-task accuracy while substantially reducing catastrophic forgetting. In sequential learning experiments, SDFT enables a single model to accumulate multiple skills over time without performance regression, establishing on-policy distillation as a practical path to continual learning from demonstrations.
中文标题/摘要
标题:自我蒸馏使连续学习成为可能
连续学习,使模型能够获取新技能和知识而不损害现有能力,仍然是基础模型面临的基本挑战。尽管在线策略强化学习可以减少遗忘,但它需要明确的奖励函数,这些函数往往不可用。从专家演示学习的主要替代方法是监督微调(SFT),这是固有的离策略。我们引入了自我蒸馏微调(SDFT),这是一种简单的方法,可以直接从演示中进行在线策略学习。SDFT 利用上下文学习,通过使用演示条件下的模型作为自己的教师,生成保留先前能力的同时获取新技能的在线策略训练信号。在技能学习和知识获取任务中,SDFT 一贯优于 SFT,实现更高的新任务准确率,同时显著减少灾难性遗忘。在顺序学习实验中,SDFT 使单个模型能够在不出现性能退化的情况下随着时间的推移积累多种技能,确立了在线策略蒸馏作为从演示中实现连续学习的实用途径。
Summary / 总结
The paper addresses the challenge of continual learning in foundation models, where models need to acquire new skills without forgetting existing ones. It introduces Self-Distillation Fine-Tuning (SDFT), a method that uses a demonstration-conditioned model as its own teacher to generate on-policy training signals. SDFT outperforms supervised fine-tuning in both skill learning and knowledge acquisition tasks, achieving better new-task accuracy and reducing catastrophic forgetting. In sequential learning, SDFT allows a single model to learn multiple skills over time without performance degradation.
论文针对基础模型在获取新技能时不忘记已有能力的持续学习挑战,提出了一种名为Self-Distillation Fine-Tuning (SDFT)的方法,该方法利用示范条件下的模型作为自己的教师生成基于策略的训练信号。SDFT在技能学习和知识获取任务中优于监督微调,显示出更高的新任务准确率和减少灾难性遗忘。在顺序学习实验中,SDFT使单个模型能够在不降低性能的情况下逐步学习多个技能,证明了基于策略蒸馏在示范驱动持续学习中的实用性。
M-SGWR: Multiscale Similarity and Geographically Weighted Regression
Authors: M. Naser Lessani, Zhenlong Li, Manzhu Yu, Helen Greatrex, Chan Shen
First: 2026-01-27T18:55:12+00:00 · Latest: 2026-01-27T18:55:12+00:00
Abstract
The first law of geography is a cornerstone of spatial analysis, emphasizing that nearby and related locations tend to be more similar, however, defining what constitutes "near" and "related" remains challenging, as different phenomena exhibit distinct spatial patterns. Traditional local regression models, such as Geographically Weighted Regression (GWR) and Multiscale GWR (MGWR), quantify spatial relationships solely through geographic proximity. In an era of globalization and digital connectivity, however, geographic proximity alone may be insufficient to capture how locations are interconnected. To address this limitation, we propose a new multiscale local regression framework, termed M-SGWR, which characterizes spatial interaction across two dimensions: geographic proximity and attribute (variable) similarity. For each predictor, geographic and attribute-based weight matrices are constructed separately and then combined using an optimized parameter, alpha, which governs their relative contribution to local model fitting. Analogous to variable-specific bandwidths in MGWR, the optimal alpha varies by predictor, allowing the model to flexibly account for geographic, mixed, or non-spatial (remote similarity) effects. Results from two simulation experiments and one empirical application demonstrate that M-SGWR consistently outperforms GWR, SGWR, and MGWR across all goodness-of-fit metrics.
中文标题/摘要
标题:M-SGWR:多尺度相似性和地理加权回归
地理学的第一定律是空间分析的基石,强调临近和相关的位置往往更相似,然而,定义什么是“临近”和“相关”仍然具有挑战性,因为不同的现象表现出不同的空间模式。传统的局部回归模型,如地理加权回归(GWR)和多尺度GWR(MGWR),仅通过地理邻近性来量化空间关系。然而,在全球化和数字互联的时代,仅靠地理邻近性可能不足以捕捉位置之间的相互联系。为了解决这一局限性,我们提出了一种新的多尺度局部回归框架,称为M-SGWR,该框架在两个维度上表征空间交互:地理邻近性和属性(变量)相似性。对于每个预测变量,分别构建地理和属性权重矩阵,然后使用优化参数alpha将其结合,该参数决定了它们在局部模型拟合中的相对贡献。类似于MGWR中的变量特定带宽,最优alpha因预测变量而异,使模型能够灵活地考虑地理、混合或非空间(远程相似性)效应。来自两个模拟实验和一个实证应用的结果表明,M-SGWR在所有拟合度指标上都优于GWR、SGWR和MGWR。
Summary / 总结
The research aims to improve spatial analysis by addressing the limitations of traditional local regression models that rely solely on geographic proximity. M-SGWR, a new multiscale local regression framework, considers both geographic proximity and attribute similarity to better capture spatial interactions. The method constructs separate geographic and attribute-based weight matrices for each predictor, combining them using an optimized parameter, alpha. Experimental results show that M-SGWR outperforms GWR, SGWR, and MGWR in terms of goodness-of-fit metrics across various scenarios.
研究旨在通过结合地理接近性和属性相似性来改进局部回归模型。M-SGWR 是一种新的多尺度局部回归框架,为每个预测因子分别构建地理和属性权重矩阵,并通过优化参数 alpha 进行组合。研究结果显示,M-SGWR 在模拟和实证应用中的拟合度指标上优于 GWR、SGWR 和 MGWR。
SONIC: Spectral Oriented Neural Invariant Convolutions
Authors: Gijs Joppe Moens, Regina Beets-Tan, Eduardo H. P. Pooch
Venue: ICLR 2026
First: 2026-01-27T18:51:11+00:00 · Latest: 2026-01-27T18:51:11+00:00
Comments: 10 pages, 4 figures. Accepted at ICLR 2026
Abstract
Convolutional Neural Networks (CNNs) rely on fixed-size kernels scanning local patches, which limits their ability to capture global context or long-range dependencies without very deep architectures. Vision Transformers (ViTs), in turn, provide global connectivity but lack spatial inductive bias, depend on explicit positional encodings, and remain tied to the initial patch size. Bridging these limitations requires a representation that is both structured and global. We introduce SONIC (Spectral Oriented Neural Invariant Convolutions), a continuous spectral parameterisation that models convolutional operators using a small set of shared, orientation-selective components. These components define smooth responses across the full frequency domain, yielding global receptive fields and filters that adapt naturally across resolutions. Across synthetic benchmarks, large-scale image classification, and 3D medical datasets, SONIC shows improved robustness to geometric transformations, noise, and resolution shifts, and matches or exceeds convolutional, attention-based, and prior spectral architectures with an order of magnitude fewer parameters. These results demonstrate that continuous, orientation-aware spectral parameterisations provide a principled and scalable alternative to conventional spatial and spectral operators.
中文标题/摘要
标题:SONIC:谱导向神经不变卷积
卷积神经网络(CNNs)依赖于固定大小的核扫描局部区域,这限制了它们在不使用非常深的架构的情况下捕捉全局上下文或长距离依赖的能力。视觉变换器(ViTs)则提供了全局连接性,但缺乏空间归纳偏差,依赖于显式的位置编码,并且仍然受限于初始的切片大小。要弥合这些限制,需要一种既是结构化的又是全局的表示。我们引入了SONIC(Spectral Oriented Neural Invariant Convolutions),这是一种连续的谱参数化,使用一组共享的、方向选择性的组件来建模卷积算子。这些组件在完整的频率域内定义了平滑的响应,从而产生全局感受野和滤波器,这些滤波器在不同分辨率下自然适应。在合成基准测试、大规模图像分类和3D医学数据集中,SONIC展示了对几何变换、噪声和分辨率变化的改进鲁棒性,并且在参数量减少一个数量级的情况下,匹配或超过了卷积、基于注意力和先前的谱架构。这些结果表明,连续的方向感知谱参数化提供了一种原理上和可扩展的替代传统空间和谱算子的方法。
Summary / 总结
The paper introduces SONIC, a spectral parameterization that models convolutional operators using orientation-selective components, addressing the limitations of fixed-size kernels in CNNs and the lack of spatial inductive bias in ViTs. SONIC achieves global receptive fields and adaptable filters, showing improved robustness to geometric transformations, noise, and resolution shifts in various benchmarks, with significantly fewer parameters compared to convolutional, attention-based, and spectral architectures.
研究旨在解决卷积神经网络(CNNs)在捕捉全局上下文方面的局限性以及视觉变换器(ViTs)在保持空间归纳偏见方面的局限性。SONIC 是一种新的谱参数化方法,使用一组方向选择性组件来建模卷积操作,提供全局感受野和自适应滤波器。实验表明,SONIC 在各种基准测试中表现出色,参数量比 CNNs、ViTs 和先前的谱架构少得多,显示出对几何变换、噪声和分辨率变化的鲁棒性增强。
RHSIA: Real-time Hemodynamics Surrogation for Non-idealized Intracranial Aneurysms
Authors: Yiying Sheng, Wenhao Ding, Dylan Roi, Leonard Leong Litt Yeo, Hwa Liang Leo, Choon Hwai Yap
First: 2026-01-27T18:39:58+00:00 · Latest: 2026-01-27T18:39:58+00:00
Abstract
Extensive studies suggested that fluid mechanical markers of intracranial aneurysms (IAs) derived from Computational Fluid Dynamics (CFD) can indicate disease progression risks, but to date this has not been translated clinically. This is because CFD requires specialized expertise and is time-consuming and low throughput, making it difficult to support clinical trials. A deep learning model that maps IA morphology to biomechanical markers can address this, enabling physicians to obtain these markers in real time without performing CFD. Here, we show that a Graph Transformer model that incorporates temporal information, which is supervised by large CFD data, can accurately predict Wall Shear Stress (WSS) across the cardiac cycle from IA surface meshes. The model effectively captures the temporal variations of the WSS pattern, achieving a Structural Similarity Index (SSIM) of up to 0.981 and a maximum-based relative L2 error of 2.8%. Ablation studies and SOTA comparison confirmed its optimality. Further, as pulsatile CFD data is computationally expensive to generate and sample sizes are limited, we engaged a strategy of injecting a large amount of steady-state CFD data, which are extremely low-cost to generate, as augmentation. This approach enhances network performance substantially when pulsatile CFD data sample size is small. Our study provides a proof of concept that temporal sequences cardiovascular fluid mechanical parameters can be computed in real time using a deep learning model from the geometric mesh, and this is achievable even with small pulsatile CFD sample size. Our approach is likely applicable to other cardiovascular scenarios.
中文标题/摘要
标题:RHSIA: 非理想化颅内动脉瘤的实时血流动力学替代
大量研究表明,从计算流体动力学(CFD)获得的颅内动脉瘤(IAs)的流体机械标志物可以预测疾病进展风险,但至今尚未在临床中实现。这是因为CFD需要专门的技术知识,耗时且低通量,难以支持临床试验。一个将动脉瘤形态映射到生物力学标志物的深度学习模型可以解决这一问题,使医生能够无需执行CFD即可实时获得这些标志物。在这里,我们展示了结合时间信息的图变换器模型,通过大型CFD数据监督,可以从动脉瘤表面网格准确预测整个心动周期的壁剪切应力(WSS)。该模型有效地捕捉了WSS模式的时间变化,实现了高达0.981的结构相似性指数(SSIM)和最大基于相对L2误差的2.8%。消融研究和SOTA比较证实了其最优性。此外,由于脉动CFD数据的生成和采样成本高昂且样本量有限,我们采用了一种策略,即注入大量低成本生成的稳态CFD数据作为增强。当脉动CFD数据样本量较小时,这种方法显著提高了网络性能。我们的研究证明了从几何网格使用深度学习模型计算心血管流体机械参数的时间序列是可行的,即使脉动CFD样本量较小也是如此。我们的方法可能适用于其他心血管场景。
Summary / 总结
The study aims to address the clinical translation gap of using Computational Fluid Dynamics (CFD) for intracranial aneurysms by developing a Graph Transformer model that predicts Wall Shear Stress (WSS) in real-time. The model, supervised by large CFD data, achieves high accuracy with a SSIM of up to 0.981 and a maximum-based relative L2 error of 2.8%. By augmenting pulsatile CFD data with steady-state CFD data, the model's performance is significantly improved, especially when pulsatile data is limited. This approach demonstrates the potential for real-time computation of cardiovascular fluid mechanical parameters using geometric meshes.
研究旨在解决将计算流体力学(CFD)用于预测颅内动脉瘤血流动力学标志物的临床转化难题,这些方法耗时且需要专门的技术。研究使用一个由大量CFD数据监督的图变换器模型来实现实时预测壁剪切应力(WSS),达到了高达0.981的结构相似性指数(SSIM)和最大基于的相对L2误差2.8%。该方法还利用稳态CFD数据作为增强,以提高在脉动CFD数据样本量有限时的性能,展示了其在其他心血管场景中实现实时预测的潜力。
LOGICAL-COMMONSENSEQA: A Benchmark for Logical Commonsense Reasoning
Authors: Obed Junias, Maria Leonor Pacheco
First: 2026-01-23T07:07:19+00:00 · Latest: 2026-01-27T18:33:20+00:00
Abstract
Commonsense reasoning often involves evaluating multiple plausible interpretations rather than selecting a single atomic answer, yet most benchmarks rely on single-label evaluation, obscuring whether statements are jointly plausible, mutually exclusive, or jointly implausible. We introduce LOGICAL-COMMONSENSEQA, a benchmark that re-frames commonsense reasoning as logical composition over pairs of atomic statements using plausibility-level operators (AND, OR, NEITHER/NOR). Evaluating instruction-tuned, reasoning-specialized, and fine-tuned models under zero-shot, few-shot, and chain-of-thought prompting, we find that while models perform reasonably on conjunctive and moderately on disjunctive reasoning, performance degrades sharply on negation-based questions. LOGICAL-COMMONSENSEQA exposes fundamental reasoning limitations and provides a controlled framework for advancing compositional commonsense reasoning.
中文标题/摘要
标题:LOGICAL-COMMONSENSEQA:逻辑常识推理基准
常识推理通常涉及评估多个合理的解释,而不是选择单一的原子答案,然而大多数基准依赖于单标签评估,掩盖了陈述是否联合合理、相互排斥或联合不合理。我们引入了LOGICAL-COMMONSENSEQA,这是一个将常识推理重新定义为使用合理性级别操作符(AND,OR,NEITHER/NOR)对原子陈述进行逻辑组合的基准。在零样本、少量样本和链式思考提示下评估指令调优、推理专业化和微调模型,我们发现模型在联合推理方面表现合理,在析取推理方面表现适度,但在基于否定的问题上表现急剧下降。LOGICAL-COMMONSENSEQA 暴露了基本的推理限制,并提供了一个可控框架以推进组合常识推理。
Summary / 总结
The research aims to evaluate the ability of models to handle logical commonsense reasoning by introducing LOGICAL-COMMONSENSEQA, which assesses reasoning over pairs of atomic statements using AND, OR, and NEITHER/NOR operators. The study finds that models perform well on conjunctive reasoning and moderately on disjunctive reasoning, but struggle with negation-based questions, highlighting fundamental reasoning limitations and providing a framework for advancing compositional commonsense reasoning.
论文提出了LOGICAL-COMMONSENSEQA基准,该基准通过使用可实现性级别运算符(AND, OR, NEITHER/NOR)对成对的原子陈述进行逻辑组合来评估模型,解决了单一标签评估的局限性。研究发现,模型在合取推理方面表现良好,在析取推理方面表现适度,但在否定推理方面存在问题,突显了需要更好的组合常识推理能力。
Parameter-Efficient MoE LoRA for Few-Shot Multi-Style Editing
Authors: Cong Cao, Yujie Xu, Xiaodong Xu
First: 2025-11-14T12:40:21+00:00 · Latest: 2026-01-27T18:27:31+00:00
Comments: Technical report
Abstract
In recent years, image editing has garnered growing attention. However, general image editing models often fail to produce satisfactory results when confronted with new styles. The challenge lies in how to effectively fine-tune general image editing models to new styles using only a limited amount of paired data. To address this issue, this paper proposes a novel few-shot style editing framework. For this task, we construct a benchmark dataset that encompasses five distinct styles. Correspondingly, we propose a parameter-efficient multi-style Mixture-of-Experts Low-Rank Adaptation (MoE LoRA) with style-specific and style-shared routing mechanisms for jointly fine-tuning multiple styles. The style-specific routing ensures that different styles do not interfere with one another, while the style-shared routing adaptively allocates shared MoE LoRAs to learn common patterns. Our MoE LoRA can automatically determine the optimal ranks for each layer through a novel metric-guided approach that estimates the importance score of each single-rank component. Additionally, we explore the optimal location to insert LoRA within the Diffusion in Transformer (DiT) model and integrate adversarial learning and flow matching to guide the diffusion training process. Experimental results demonstrate that our proposed method outperforms existing state-of-the-art approaches with significantly fewer LoRA parameters. Our code and dataset are available at https://github.com/cao-cong/FSMSE.
中文标题/摘要
标题:参数高效MoE LoRA在少量示例多风格编辑中的应用
近年来,图像编辑引起了越来越多的关注。然而,通用图像编辑模型在面对新风格时往往无法产生令人满意的结果。挑战在于如何仅使用少量配对数据有效地微调通用图像编辑模型以适应新风格。为了解决这一问题,本文提出了一种新颖的少量示例风格编辑框架。为此任务,我们构建了一个包含五种不同风格的基准数据集。相应地,我们提出了一种参数高效的多风格Mixture-of-Experts Low-Rank Adaptation (MoE LoRA),并采用风格特定和风格共享的路由机制共同微调多种风格。风格特定的路由机制确保不同风格之间不会相互干扰,而风格共享的路由机制则能够自适应地分配共享的MoE LoRAs以学习共性模式。我们的MoE LoRA可以通过一种新颖的基于度量的方法自动确定每一层的最佳秩,该方法估计了每个单一秩组件的重要性得分。此外,我们探索了在Transformer中的扩散(DiT)模型中插入LoRA的最佳位置,并结合对抗学习和流匹配来引导扩散训练过程。实验结果表明,与现有最先进的方法相比,我们的方法在显著减少LoRA参数的情况下表现出更优的效果。我们的代码和数据集可在https://github.com/cao-cong/FSMSE上获取。
Summary / 总结
This paper addresses the challenge of fine-tuning general image editing models to new styles with limited paired data. It introduces a parameter-efficient multi-style Mixture-of-Experts Low-Rank Adaptation (MoE LoRA) framework with style-specific and style-shared routing mechanisms. The method automatically determines the optimal ranks for each layer and integrates adversarial learning and flow matching to enhance the diffusion training process. Experimental results show that the proposed method outperforms existing approaches with fewer LoRA parameters.
该论文旨在解决使用有限配对数据将通用图像编辑模型调整到新风格的挑战。它提出了一种参数高效的多风格Mixture-of-Experts Low-Rank Adaptation (MoE LoRA)框架,该框架使用风格特定和风格共享路由机制联合调整多种风格。该方法能够自动为每一层确定最优的秩,并结合对抗学习和流匹配来指导扩散训练过程。实验结果表明,所提出的方法在更少的LoRA参数下优于现有方法。
Bandits in Flux: Adversarial Constraints in Dynamic Environments
Authors: Tareq Si Salem
First: 2026-01-27T18:26:07+00:00 · Latest: 2026-01-27T18:26:07+00:00
Comments: Accepted to AISTATS 2026
Abstract
We investigate the challenging problem of adversarial multi-armed bandits operating under time-varying constraints, a scenario motivated by numerous real-world applications. To address this complex setting, we propose a novel primal-dual algorithm that extends online mirror descent through the incorporation of suitable gradient estimators and effective constraint handling. We provide theoretical guarantees establishing sublinear dynamic regret and sublinear constraint violation for our proposed policy. Our algorithm achieves state-of-the-art performance in terms of both regret and constraint violation. Empirical evaluations demonstrate the superiority of our approach.
中文标题/摘要
标题:动态环境下的权衡盗贼:时间变化约束下的对抗多臂老虎机
我们研究了在时间变化约束下运作的对抗多臂老虎机这一具有挑战性的问题,这一场景由许多实际应用所激发。为应对这一复杂环境,我们提出了一种新颖的 primal-dual 算法,通过引入合适的梯度估计器和有效的约束处理来扩展在线镜像下降方法。我们提供了理论保证,证明了我们提出的策略在动态遗憾和约束违反方面均具有亚线性表现。我们的算法在遗憾和约束违反方面均达到了最先进的性能。实证评估表明了我们方法的优越性。
Summary / 总结
This paper addresses the problem of adversarial multi-armed bandits with time-varying constraints, motivated by real-world applications. It introduces a novel primal-dual algorithm that enhances online mirror descent with gradient estimators and constraint management techniques. The algorithm ensures sublinear dynamic regret and constraint violation, and outperforms existing methods in both metrics. Empirical results confirm its effectiveness.
该论文研究了时间变化环境下的对抗多臂老虎机问题,提出了一种新的 primal-dual 算法,该算法结合了梯度估计器和有效的约束处理。该算法提供了亚线性动态遗憾和约束违反的理论保证。实证结果表明,它在遗憾和约束违反方面都优于现有方法。
MIP against Agent: Malicious Image Patches Hijacking Multimodal OS Agents
Authors: Lukas Aichberger, Alasdair Paren, Guohao Li, Philip Torr, Yarin Gal, Adel Bibi
Venue: NeurIPS 2025
First: 2025-03-13T18:59:12+00:00 · Latest: 2026-01-27T18:10:17+00:00
Comments: NeurIPS 2025
Abstract
Recent advances in operating system (OS) agents have enabled vision-language models (VLMs) to directly control a user's computer. Unlike conventional VLMs that passively output text, OS agents autonomously perform computer-based tasks in response to a single user prompt. OS agents do so by capturing, parsing, and analysing screenshots and executing low-level actions via application programming interfaces (APIs), such as mouse clicks and keyboard inputs. This direct interaction with the OS significantly raises the stakes, as failures or manipulations can have immediate and tangible consequences. In this work, we uncover a novel attack vector against these OS agents: Malicious Image Patches (MIPs), adversarially perturbed screen regions that, when captured by an OS agent, induce it to perform harmful actions by exploiting specific APIs. For instance, a MIP can be embedded in a desktop wallpaper or shared on social media to cause an OS agent to exfiltrate sensitive user data. We show that MIPs generalise across user prompts and screen configurations, and that they can hijack multiple OS agents even during the execution of benign instructions. These findings expose critical security vulnerabilities in OS agents that have to be carefully addressed before their widespread deployment.
中文标题/摘要
标题:MIP对抗代理:恶意图像补丁劫持多模态OS代理
近年来操作系统(OS)代理的进步使视觉语言模型(VLMs)能够直接控制用户的计算机。与传统的被动输出文本的VLMs不同,OS代理能够自主执行基于计算机的任务,仅需一个用户提示。OS代理通过捕获、解析和分析屏幕截图,并通过应用程序编程接口(APIs)执行低级操作(如鼠标点击和键盘输入)来实现这一目标。这种直接与OS的交互显著提高了风险,因为失败或操纵可能会立即产生实际后果。在本研究中,我们发现了一种针对这些OS代理的新攻击向量:恶意图像补丁(MIPs),这些对抗性扰动的屏幕区域在被OS代理捕获时,会通过利用特定的APIs诱导其执行有害操作。例如,MIP可以嵌入在桌面上的壁纸或在社交媒体上分享,以导致OS代理泄露敏感用户数据。我们展示了MIPs在用户提示和屏幕配置方面具有泛化能力,并且即使在执行良性指令期间也能劫持多个OS代理。这些发现揭示了OS代理中关键的安全漏洞,这些漏洞在广泛部署之前必须仔细解决。
Summary / 总结
This research investigates a new attack vector called Malicious Image Patches (MIPs) that can hijack OS agents by exploiting specific APIs. The study demonstrates that MIPs can be embedded in images and cause OS agents to perform harmful actions, such as exfiltrating sensitive user data. The findings highlight critical security vulnerabilities in OS agents and suggest that these vulnerabilities need to be addressed before widespread deployment.
研究探讨了一种新的攻击向量——恶意图像补丁(MIPs),通过利用特定的API,这些恶意图像补丁可以在OS代理中引发有害行为。研究指出,OS代理在执行用户指令时可能会受到这些恶意图像的影响,从而暴露了严重的安全漏洞。研究还表明,MIPs可以在多种背景下嵌入,并且可以操控多个OS代理,即使是在执行良性操作时。
MLVTG: Mamba-Based Feature Alignment and LLM-Driven Purification for Multi-Modal Video Temporal Grounding
Authors: Zhiyi Zhu, Xiaoyu Wu, Zihao Liu, Linlin Yang
First: 2025-06-10T07:20:12+00:00 · Latest: 2026-01-27T18:07:12+00:00
Abstract
Video Temporal Grounding (VTG), which aims to localize video clips corresponding to natural language queries, is a fundamental yet challenging task in video understanding. Existing Transformer-based methods often suffer from redundant attention and suboptimal multi-modal alignment. To address these limitations, we propose MLVTG, a novel framework that integrates two key modules: MambaAligner and LLMRefiner. MambaAligner uses stacked Vision Mamba blocks as a backbone instead of Transformers to model temporal dependencies and extract robust video representations for multi-modal alignment. LLMRefiner leverages the specific frozen layer of a pre-trained Large Language Model (LLM) to implicitly transfer semantic priors, enhancing multi-modal alignment without fine-tuning. This dual alignment strategy, temporal modeling via structured state-space dynamics and semantic purification via textual priors, enables more precise localization. Extensive experiments on QVHighlights, Charades-STA, and TVSum demonstrate that MLVTG achieves state-of-the-art performance and significantly outperforms existing baselines.
中文标题/摘要
标题:MLVTG:基于Mamba的特征对齐和LLM驱动的多模态视频时间定位净化
视频时间定位(VTG),旨在定位与自然语言查询对应的视频片段,是视频理解中的一个基本但具有挑战性的任务。现有的基于Transformer的方法往往受到冗余注意力和次优多模态对齐的困扰。为了解决这些限制,我们提出了一种名为MLVTG的新框架,该框架集成了两个关键模块:MambaAligner和LLMRefiner。MambaAligner使用堆叠的Vision Mamba块作为骨干,而不是Transformer,以建模时间依赖关系并提取用于多模态对齐的稳健视频表示。LLMRefiner利用预训练大型语言模型(LLM)的特定冻结层来隐式转移语义先验,增强多模态对齐而不进行微调。这种双重对齐策略,通过结构化的状态空间动力学建模时间,通过文本先验进行语义净化,能够实现更精确的定位。在QVHighlights、Charades-STA和TVSum上的广泛实验表明,MLVTG达到了最先进的性能,并显著优于现有基线。
Summary / 总结
MLVTG is a novel framework for Video Temporal Grounding that addresses the limitations of existing Transformer-based methods by using MambaAligner and LLMRefiner. MambaAligner employs Vision Mamba blocks to model temporal dependencies and enhance multi-modal alignment, while LLMRefiner uses a pre-trained Large Language Model to transfer semantic priors without fine-tuning. MLVTG outperforms existing methods on QVHighlights, Charades-STA, and TVSum, demonstrating more precise localization capabilities.
研究旨在通过解决现有Transformer方法的局限性来提升视频时间定位。MLVTG提出了一种新的框架,包含MambaAligner和LLMRefiner模块。MambaAligner使用Vision Mamba块来建模时间依赖关系并增强多模态对齐,而LLMRefiner利用预训练的LLM隐式转移语义先验,无需微调。实验结果表明,MLVTG在QVHighlights、Charades-STA和TVSum上的定位精度优于现有方法。
Generative Latent Alignment for Interpretable Radar Based Occupancy Detection in Ambient Assisted Living
Authors: Huy Trinh
First: 2026-01-27T18:06:51+00:00 · Latest: 2026-01-27T18:06:51+00:00
Abstract
In this work, we study how to make mmWave radar presence detection more interpretable for Ambient Assisted Living (AAL) settings, where camera-based sensing raises privacy concerns. We propose a Generative Latent Alignment (GLA) framework that combines a lightweight convolutional variational autoencoder with a frozen CLIP text encoder to learn a low-dimensional latent representation of radar Range-Angle (RA) heatmaps. The latent space is softly aligned with two semantic anchors corresponding to "empty room" and "person present", and Grad-CAM is applied in this aligned latent space to visualize which spatial regions support each presence decision. On our mmWave radar dataset, we qualitatively observe that the "person present" class produces compact Grad-CAM blobs that coincide with strong RA returns, whereas "empty room" samples yield diffuse or no evidence. We also conduct an ablation study using unrelated text prompts, which degrades both reconstruction and localization, suggesting that radar-specific anchors are important for meaningful explanations in this setting.
中文标题/摘要
标题:基于生成潜在对齐的可解释雷达占用检测在辅助生活中的应用
在本工作中,我们研究如何使毫米波雷达存在检测在辅助生活(AAL)环境中更具可解释性,其中基于摄像头的传感会引发隐私问题。我们提出了一种生成潜在对齐(GLA)框架,该框架结合了轻量级卷积变分自编码器和冻结的CLIP文本编码器,以学习雷达距离-角度(RA)热图的低维潜在表示。潜在空间通过两个语义锚点“空房间”和“有人存在”进行软对齐,并在对齐的潜在空间中应用Grad-CAM以可视化哪些空间区域支持每个存在决策。在我们的毫米波雷达数据集中,我们观察到“有人存在”类别的Grad-CAM斑块是紧凑的,并且与强烈的RA返回重合,而“空房间”样本则产生模糊的或没有证据。我们还使用不相关的文本提示进行了消融研究,这降低了重建和定位性能,表明雷达特定的锚点对于此设置中的有意义解释很重要。
Will It Zero-Shot?: Predicting Zero-Shot Classification Performance For Arbitrary Queries
Authors: Kevin Robbins, Xiaotong Liu, Yu Wu, Le Sun, Grady McPeak, Abby Stylianou, Robert Pless
First: 2026-01-24T17:30:23+00:00 · Latest: 2026-01-27T18:04:35+00:00
Abstract
Vision-Language Models like CLIP create aligned embedding spaces for text and images, making it possible for anyone to build a visual classifier by simply naming the classes they want to distinguish. However, a model that works well in one domain may fail in another, and non-expert users have no straightforward way to assess whether their chosen VLM will work on their problem. We build on prior work using text-only comparisons to evaluate how well a model works for a given natural language task, and explore approaches that also generate synthetic images relevant to that task to evaluate and refine the prediction of zero-shot accuracy. We show that generated imagery to the baseline text-only scores substantially improves the quality of these predictions. Additionally, it gives a user feedback on the kinds of images that were used to make the assessment. Experiments on standard CLIP benchmark datasets demonstrate that the image-based approach helps users predict, without any labeled examples, whether a VLM will be effective for their application.
中文标题/摘要
标题:它能否零样本分类?:预测任意查询的零样本分类性能
像CLIP这样的视觉-语言模型创建了文本和图像对齐的嵌入空间,使得任何人都可以通过简单地命名他们想要区分的类别来构建视觉分类器。然而,一个在某一领域表现良好的模型可能在另一个领域失败,非专家用户没有直接的方法来评估他们选择的VLM是否适用于他们的问题。我们在此前仅使用文本比较的工作基础上,评估模型在给定自然语言任务中的表现,并探索生成与该任务相关的合成图像来评估和改进零样本准确性的预测方法。我们展示了生成的图像相对于基线文本仅比较分数显著提高了这些预测的质量。此外,它还为用户提供反馈,说明了用于评估的图像类型。标准CLIP基准数据集上的实验表明,基于图像的方法帮助用户在没有任何标注示例的情况下预测VLM是否适用于他们的应用。
Summary / 总结
The research aims to predict the zero-shot classification performance for arbitrary queries using Vision-Language Models like CLIP. The method involves evaluating the model's performance through text-only comparisons and further enhancing it by generating synthetic images relevant to the task. The key experimental findings show that using generated imagery improves the prediction quality significantly and provides users with feedback on the types of images used for the assessment. Experiments on standard CLIP benchmark datasets demonstrate that this image-based approach helps users predict the effectiveness of a VLM for their application without any labeled examples.
研究旨在使用如CLIP的Vision-Language模型预测任意查询的零样本分类性能。方法包括仅比较文本和生成与任务相关的合成图像来评估模型性能。关键实验发现表明,使用生成的图像可以提高零样本准确性的预测质量,并为用户提供有关用于评估的图像类型的反馈。标准CLIP基准数据集上的实验证实,基于图像的方法增强了用户在没有标注示例的情况下预测VLM是否适用于其应用的能力。
Learning Dynamic Representations via An Optimally-Weighted Maximum Mean Discrepancy Optimization Framework for Continual Learning
Authors: KaiHui Huang, RunQing Wu, JinHui Sheng, HanYi Zhang, Ling Ge, JinYu Guo, Fei Ye
First: 2025-01-21T13:33:45+00:00 · Latest: 2026-01-27T18:04:09+00:00
Abstract
Continual learning has emerged as a pivotal area of research, primarily due to its advantageous characteristic that allows models to persistently acquire and retain information. However, catastrophic forgetting can severely impair model performance. In this study, we address network forgetting by introducing a novel framework termed Optimally-Weighted Maximum Mean Discrepancy (OWMMD), which imposes penalties on representation alterations via a Multi-Level Feature Matching Mechanism (MLFMM). Furthermore, we propose an Adaptive Regularization Optimization (ARO) strategy to refine the adaptive weight vectors, which autonomously assess the significance of each feature layer throughout the optimization process, The proposed ARO approach can relieve the over-regularization problem and promote the future task learning. We conduct a comprehensive series of experiments, benchmarking our proposed method against several established baselines. The empirical findings indicate that our approach achieves state-of-the-art performance.
中文标题/摘要
标题:通过最优加权最大均值偏差优化框架学习动态表示以应对持续学习中的网络遗忘
持续学习已成为一个关键的研究领域,主要是因为它能够使模型持续获取和保留信息。然而,灾难性遗忘会严重影响模型性能。在本研究中,我们通过引入一种称为最优加权最大均值偏差(OWMMD)的新框架来解决网络遗忘问题,该框架通过多级特征匹配机制(MLFMM)对表示变化施加惩罚。此外,我们提出了一种自适应正则化优化(ARO)策略来细化自适应权重向量,该策略在优化过程中自主评估每一层特征的重要性。提出的ARO方法可以缓解过度正则化问题并促进未来任务的学习。我们进行了全面的实验,将我们提出的方法与几个现有的基线方法进行了比较。实证结果表明,我们的方法达到了最先进的性能。
Summary / 总结
This study addresses catastrophic forgetting in continual learning by proposing a novel framework called Optimally-Weighted Maximum Mean Discrepancy (OWMMD), which uses a Multi-Level Feature Matching Mechanism to penalize representation changes. An Adaptive Regularization Optimization (ARO) strategy is also introduced to refine adaptive weight vectors, enhancing the model's ability to learn new tasks without forgetting old ones. Experimental results show that the proposed method outperforms existing baselines in terms of performance and stability.
该研究通过提出一种名为优化加权最大均值偏差(OWMMD)的新框架,解决了持续学习中的灾难性遗忘问题,该框架使用多级特征匹配机制来惩罚表示变化。还引入了一种自适应正则化优化(ARO)策略来细化自适应权重向量,增强模型在学习新任务时不忘记旧任务的能力。实验表明,所提出的方法在性能和稳定性方面优于现有基线方法。
EgoHandICL: Egocentric 3D Hand Reconstruction with In-Context Learning
Authors: Binzhu Xie, Shi Qiu, Sicheng Zhang, Yinqiao Wang, Hao Xu, Muzammal Naseer, Chi-Wing Fu, Pheng-Ann Heng
Venue: ICLR 2026
First: 2026-01-27T17:58:12+00:00 · Latest: 2026-01-27T17:58:12+00:00
Comments: Accepted in ICLR 2026, Codebase: https://github.com/Nicous20/EgoHandICL
Abstract
Robust 3D hand reconstruction in egocentric vision is challenging due to depth ambiguity, self-occlusion, and complex hand-object interactions. Prior methods mitigate these issues by scaling training data or adding auxiliary cues, but they often struggle in unseen contexts. We present EgoHandICL, the first in-context learning (ICL) framework for 3D hand reconstruction that improves semantic alignment, visual consistency, and robustness under challenging egocentric conditions. EgoHandICL introduces complementary exemplar retrieval guided by vision-language models (VLMs), an ICL-tailored tokenizer for multimodal context, and a masked autoencoder (MAE)-based architecture trained with hand-guided geometric and perceptual objectives. Experiments on ARCTIC and EgoExo4D show consistent gains over state-of-the-art methods. We also demonstrate real-world generalization and improve EgoVLM hand-object interaction reasoning by using reconstructed hands as visual prompts. Code and data: https://github.com/Nicous20/EgoHandICL
中文标题/摘要
标题:EgoHandICL:基于上下文学习的主观视角3D手部重建
由于深度模糊、自遮挡以及复杂的手部-物体交互,主观视角下的稳健3D手部重建具有挑战性。先前的方法通过扩大训练数据或添加辅助提示来缓解这些问题,但它们在未见过的场景中往往表现不佳。我们提出了EgoHandICL,这是首个用于3D手部重建的上下文学习(ICL)框架,它提高了语义对齐、视觉一致性和在具有挑战性的主观视角条件下的鲁棒性。EgoHandICL引入了由视觉语言模型(VLMs)引导的补充示例检索、针对多模态上下文的ICL定制分词器以及基于掩码自编码器(MAE)的架构,该架构通过手部引导的几何和感知目标进行训练。在ARCTIC和EgoExo4D上的实验显示,EgoHandICL在最先进的方法上具有持续的改进。我们还展示了其在现实世界中的泛化能力,并通过使用重建的手部作为视觉提示来改进EgoVLM对手部-物体交互的推理。
HARMONI: Multimodal Personalization of Multi-User Human-Robot Interactions with LLMs
Authors: Jeanne Malécot, Hamed Rahimi, Jeanne Cattoni, Marie Samson, Mouad Abrini, Mahdi Khoramshahi, Maribel Pino, Mohamed Chetouani
First: 2026-01-27T17:45:04+00:00 · Latest: 2026-01-27T17:45:04+00:00
Abstract
Existing human-robot interaction systems often lack mechanisms for sustained personalization and dynamic adaptation in multi-user environments, limiting their effectiveness in real-world deployments. We present HARMONI, a multimodal personalization framework that leverages large language models to enable socially assistive robots to manage long-term multi-user interactions. The framework integrates four key modules: (i) a perception module that identifies active speakers and extracts multimodal input; (ii) a world modeling module that maintains representations of the environment and short-term conversational context; (iii) a user modeling module that updates long-term speaker-specific profiles; and (iv) a generation module that produces contextually grounded and ethically informed responses. Through extensive evaluation and ablation studies on four datasets, as well as a real-world scenario-driven user-study in a nursing home environment, we demonstrate that HARMONI supports robust speaker identification, online memory updating, and ethically aligned personalization, outperforming baseline LLM-driven approaches in user modeling accuracy, personalization quality, and user satisfaction.
中文标题/摘要
标题:HARMONI:利用大语言模型实现多用户人机交互的多模态个性化
现有的人机交互系统往往缺乏在多用户环境中持续个性化和动态适应的机制,限制了其在实际部署中的有效性。我们提出了HARMONI,一种利用大语言模型的多模态个性化框架,使社会辅助机器人能够管理长期的多用户交互。该框架整合了四个关键模块:(i) 感知模块,识别活跃说话者并提取多模态输入;(ii) 世界建模模块,维护环境和短期对话上下文的表示;(iii) 用户建模模块,更新长期特定说话者的个人资料;以及(iv) 生成模块,生成上下文相关且伦理导向的响应。通过在四个数据集上的广泛评估和消融研究,以及在养老院环境中基于实际场景的用户研究,我们证明HARMONI支持稳健的说话者识别、在线记忆更新和伦理对齐的个性化,其用户建模准确性、个性化质量和用户满意度均优于基线的大语言模型驱动方法。
Summary / 总结
HARMONI is a multimodal personalization framework for human-robot interactions that uses large language models to enable socially assistive robots to handle long-term multi-user interactions. It includes modules for perception, world modeling, user modeling, and generation. Extensive evaluations show that HARMONI improves speaker identification, memory updating, and personalization quality, surpassing baseline approaches in user modeling accuracy and satisfaction.
HARMONI 是一种多模态个性化框架,用于人类-机器人交互,利用大型语言模型使社会辅助机器人能够处理长期的多用户交互。该框架包含四个模块:感知、世界建模、用户建模和生成。广泛的评估表明,HARMONI 在提高说话人识别、在线记忆更新和伦理个性化方面表现出色,优于基线方法,在用户建模准确性、个性化质量和用户满意度方面表现更佳。
Efficient and Transferable Agentic Knowledge Graph RAG via Reinforcement Learning
Authors: Jinyeop Song, Song Wang, Julian Shun, Yada Zhu
First: 2025-09-30T15:14:24+00:00 · Latest: 2026-01-27T17:44:43+00:00
Comments: Wrong numbers are reported for main results
Abstract
Knowledge-graph retrieval-augmented generation (KG-RAG) couples large language models (LLMs) with structured, verifiable knowledge graphs (KGs) to reduce hallucinations and expose reasoning traces. However, many KG-RAG systems compose multiple LLM modules (e.g planning, reasoning, and responding), inflating inference cost and binding behavior to a specific target KG. To address this, we introduce KG-R1, an agentic KG retrieval-augmented generation (KG-RAG) framework through reinforcement learning (RL). KG-R1 utilizes a single agent that interacts with KGs as its environment, learning to retrieve at each step and incorporating the retrieved information into its reasoning and generation. The process is optimized through end-to-end RL. In controlled experiments across Knowledge-Graph Question Answering (KGQA) benchmarks, our method demonstrates both efficiency and transferability: Using Qwen-2.5-3B, KG-R1 improves answer accuracy with fewer generation tokens than prior multi-module workflow methods that use larger foundation or fine-tuned models. Furthermore, KG-R1 enables plug and play: after training, it maintains strong accuracy on new KGs without modification. These properties make KG-R1 a promising KG-RAG framework for real-world deployment. Our code is publicly available at https://github.com/Jinyeop3110/KG-R1.
中文标题/摘要
标题:通过强化学习实现高效且可迁移的代理知识图谱RAG
知识图谱检索增强生成(KG-RAG)将大型语言模型(LLMs)与结构化、可验证的知识图谱(KGs)结合,以减少幻觉并暴露推理痕迹。然而,许多KG-RAG系统组合了多个LLM模块(如规划、推理和响应),增加了推理成本并将其行为绑定到特定的目标KG。为了解决这个问题,我们引入了KG-R1,这是一种通过强化学习(RL)实现的代理KG检索增强生成(KG-RAG)框架。KG-R1 使用一个单一的代理与KGs 交互作为其环境,在每一步中学习检索并将其检索到的信息融入其推理和生成中。该过程通过端到端的RL进行优化。在Knowledge-Graph Question Answering(KGQA)基准测试中的受控实验中,我们的方法展示了高效性和可迁移性:使用Qwen-2.5-3B,KG-R1 以比使用更大基础模型或微调模型的多模块工作流程方法更少的生成标记提高了答案准确性。此外,KG-R1 具有即插即用功能:在训练后,它在新的KG上保持了强大的准确性而无需修改。这些特性使KG-R1 成为实际部署中具有前景的KG-RAG框架。我们的代码可在 https://github.com/Jinyeop3110/KG-R1 公开获取。
Summary / 总结
The research aims to improve the efficiency and transferability of Knowledge-Graph retrieval-augmented generation (KG-RAG) systems by using reinforcement learning (RL) to create a single-agent framework, KG-R1. This framework reduces the need for multiple LLM modules, thereby lowering inference costs and enhancing transferability across different knowledge graphs. Experimental results show that KG-R1, using Qwen-2.5-3B, achieves higher answer accuracy with fewer generation tokens compared to previous multi-module methods, and it can maintain strong accuracy on new knowledge graphs without further training, making it a promising solution for real-world deployment.
研究旨在通过使用强化学习(RL)创建单代理框架KG-R1来提高知识图谱检索增强生成(KG-RAG)系统的效率和可移植性。该框架减少了对多个LLM模块的需求,并通过端到端的RL优化了过程。实验结果表明,与使用更大模型的先前多模块方法相比,KG-R1使用更少的生成令牌提高了答案准确性,并且可以在无需重新训练的情况下轻松适应新的知识图谱,展示了其高效性和可移植性。
Visual Generation Unlocks Human-Like Reasoning through Multimodal World Models
Authors: Jialong Wu, Xiaoying Zhang, Hongyi Yuan, Xiangcheng Zhang, Tianhao Huang, Changjing He, Chaoyi Deng, Renrui Zhang, Youbin Wu, Mingsheng Long
First: 2026-01-27T17:40:07+00:00 · Latest: 2026-01-27T17:40:07+00:00
Comments: Project page: https://thuml.github.io/Reasoning-Visual-World
Abstract
Humans construct internal world models and reason by manipulating the concepts within these models. Recent advances in AI, particularly chain-of-thought (CoT) reasoning, approximate such human cognitive abilities, where world models are believed to be embedded within large language models. Expert-level performance in formal and abstract domains such as mathematics and programming has been achieved in current systems by relying predominantly on verbal reasoning. However, they still lag far behind humans in domains like physical and spatial intelligence, which require richer representations and prior knowledge. The emergence of unified multimodal models (UMMs) capable of both verbal and visual generation has therefore sparked interest in more human-like reasoning grounded in complementary multimodal pathways, though their benefits remain unclear. From a world-model perspective, this paper presents the first principled study of when and how visual generation benefits reasoning. Our key position is the visual superiority hypothesis: for certain tasks--particularly those grounded in the physical world--visual generation more naturally serves as world models, whereas purely verbal world models encounter bottlenecks arising from representational limitations or insufficient prior knowledge. Theoretically, we formalize internal world modeling as a core component of CoT reasoning and analyze distinctions among different forms of world models. Empirically, we identify tasks that necessitate interleaved visual-verbal CoT reasoning, constructing a new evaluation suite, VisWorld-Eval. Controlled experiments on a state-of-the-art UMM show that interleaved CoT significantly outperforms purely verbal CoT on tasks that favor visual world modeling, but offers no clear advantage otherwise. Together, this work clarifies the potential of multimodal world modeling for more powerful, human-like multimodal AI.
中文标题/摘要
标题:视觉生成解锁多模态世界模型中的类人推理
人类构建内部世界模型并通过操作这些模型中的概念进行推理。近年来,特别是在链式思考(CoT)推理方面取得的AI进展,近似了人类的认知能力,其中世界模型被认为嵌入在大型语言模型中。当前系统在数学和编程等正式和抽象领域中已通过主要依赖于语言推理实现了专家级表现。然而,在物理和空间智能等领域,它们仍然远远落后于人类,这些领域需要更丰富的表示和先验知识。因此,能够同时进行语言和视觉生成的统一多模态模型(UMMs)的出现引发了对基于互补多模态路径的更类人推理的兴趣,尽管其优势尚不明确。从世界模型的角度来看,本文首次系统研究了视觉生成何时以及如何促进推理。我们的核心观点是视觉优越性假设:对于某些任务——特别是那些基于物理世界的任务——视觉生成更自然地充当世界模型,而纯粹的语言世界模型则会遇到由于表示限制或缺乏先验知识而产生的瓶颈。理论上,我们将内部世界建模作为CoT推理的核心组成部分进行形式化,并分析不同形式世界模型之间的区别。实验上,我们确定了需要交错进行视觉-语言CoT推理的任务,构建了一个新的评估套件VisWorld-Eval。在最先进的UMM上的受控实验表明,交错CoT在有利于视觉世界建模的任务中显著优于纯粹的语言CoT,但在其他情况下没有明显优势。综上所述,这项工作阐明了多模态世界建模在更强大、更类人的多模态AI中的潜力。
Summary / 总结
This paper explores how visual generation enhances reasoning capabilities in AI systems, particularly in tasks grounded in the physical world. It introduces the visual superiority hypothesis, suggesting that visual generation is more effective for certain tasks compared to purely verbal reasoning. The study develops a new evaluation suite, VisWorld-Eval, and demonstrates that interleaved visual-verbal chain-of-thought reasoning significantly outperforms purely verbal reasoning on tasks requiring visual world modeling, while offering no clear advantage in other tasks. This work highlights the potential of multimodal world modeling for more human-like AI reasoning.
本文探讨了视觉生成如何增强多模态世界模型中的推理能力,解决了纯粹语言推理在需要物理和空间智能的领域中的局限性。研究提出了视觉优越性假设,认为视觉生成对于与物理世界相关的任务更为有效。通过开发一个新的评估套件VisWorld-Eval,作者发现,在有利于视觉世界建模的任务上,结合视觉和语言的链式思考显著优于纯粹语言的链式思考,而在其他任务上则没有明显优势。
When Iterative RAG Beats Ideal Evidence: A Diagnostic Study in Scientific Multi-hop Question Answering
Authors: Mahdi Astaraki, Mohammad Arshi Saloot, Ali Shiraee Kasmaee, Hamidreza Mahyar, Soheila Samiee
First: 2026-01-27T17:35:05+00:00 · Latest: 2026-01-27T17:35:05+00:00
Comments: 27 pages, 15 figures
Abstract
Retrieval-Augmented Generation (RAG) extends large language models (LLMs) beyond parametric knowledge, yet it is unclear when iterative retrieval-reasoning loops meaningfully outperform static RAG, particularly in scientific domains with multi-hop reasoning, sparse domain knowledge, and heterogeneous evidence. We provide the first controlled, mechanism-level diagnostic study of whether synchronized iterative retrieval and reasoning can surpass an idealized static upper bound (Gold Context) RAG. We benchmark eleven state-of-the-art LLMs under three regimes: (i) No Context, measuring reliance on parametric memory; (ii) Gold Context, where all oracle evidence is supplied at once; and (iii) Iterative RAG, a training-free controller that alternates retrieval, hypothesis refinement, and evidence-aware stopping. Using the chemistry-focused ChemKGMultiHopQA dataset, we isolate questions requiring genuine retrieval and analyze behavior with diagnostics spanning retrieval coverage gaps, anchor-carry drop, query quality, composition fidelity, and control calibration. Across models, Iterative RAG consistently outperforms Gold Context, with gains up to 25.6 percentage points, especially for non-reasoning fine-tuned models. Staged retrieval reduces late-hop failures, mitigates context overload, and enables dynamic correction of early hypothesis drift, but remaining failure modes include incomplete hop coverage, distractor latch trajectories, early stopping miscalibration, and high composition failure rates even with perfect retrieval. Overall, staged retrieval is often more influential than the mere presence of ideal evidence; we provide practical guidance for deploying and diagnosing RAG systems in specialized scientific settings and a foundation for more reliable, controllable iterative retrieval-reasoning frameworks.
中文标题/摘要
标题:当迭代RAG超越理想证据时:科学多跳问答中的诊断研究
检索增强生成(RAG)将大型语言模型(LLMs)扩展到参数化知识之外,但尚不清楚何时迭代检索-推理循环在意义上优于静态RAG,特别是在具有多跳推理、稀疏领域知识和异构证据的科学领域。我们提供了第一个受控的、机制层面的诊断研究,探讨同步迭代检索和推理是否能超越理想化的静态上限(黄金上下文)RAG。我们基于三种模式基准测试了十一个最先进的LLMs:(i)无上下文,衡量对参数化记忆的依赖;(ii)黄金上下文,所有先验证据一次性提供;(iii)迭代RAG,一个无需训练的控制器,交替进行检索、假设精炼和证据驱动的停止。使用化学重点的ChemKGMultiHopQA数据集,我们隔离了需要真正检索的问题,并通过检索覆盖率差距、锚点携带丢失、查询质量、组合保真度和控制校准等诊断分析了行为。在所有模型中,迭代RAG始终优于黄金上下文,增幅高达25.6个百分点,尤其是对于非推理微调模型。分阶段检索减少了晚期跳失败,缓解了上下文过载,并允许动态纠正早期假设漂移,但剩余的失败模式包括不完整的跳覆盖、干扰物锁定轨迹、早期停止校准不当以及即使在完美检索的情况下也有较高的组合失败率。总体而言,分阶段检索往往比理想证据的存在本身更具影响力;我们提供了在专门的科学环境中部署和诊断RAG系统的实用指导,并为更可靠、可控的迭代检索-推理框架奠定了基础。
Summary / 总结
This study investigates when iterative retrieval-reasoning in RAG outperforms static RAG, especially in scientific domains. Using the ChemKGMultiHopQA dataset, eleven state-of-the-art LLMs were benchmarked under three regimes: no context, gold context, and iterative RAG. Iterative RAG consistently outperformed the gold context, with gains up to 25.6 percentage points, particularly for non-reasoning fine-tuned models. Staged retrieval reduced late-hop failures and context overload but still faced issues like incomplete hop coverage and early stopping miscalibration.
该研究探讨了在科学领域需要多跳推理时,迭代检索增强生成(RAG)何时优于静态RAG。使用ChemKGMultiHopQA数据集,研究比较了十一个最先进的LLM在不同条件下的表现:无上下文、黄金上下文和迭代RAG。迭代RAG在所有模型中都优于黄金上下文,增幅最高可达25.6个百分点,特别是对于非推理微调模型。分阶段检索减少了晚期跳失败和上下文过载,但仍面临诸如不完整跳覆盖和早期停止校准不当等问题。
APEX-Agents
Authors: Bertie Vidgen, Austin Mann, Abby Fennelly, John Wright Stanly, Lucas Rothman, Marco Burstein, Julien Benchek, David Ostrofsky, Anirudh Ravichandran, Debnil Sur, Neel Venugopal, Alannah Hsia, Isaac Robinson, Calix Huang, Olivia Varones, Daniyal Khan, Michael Haines, Zach Richards, Chirag Mahapatra, Brendan Foody, Osvald Nitski
First: 2026-01-20T18:53:44+00:00 · Latest: 2026-01-27T17:31:16+00:00
Abstract
We introduce the AI Productivity Index for Agents (APEX-Agents), a benchmark for assessing whether AI agents can execute long-horizon, cross-application tasks created by investment banking analysts, management consultants, and corporate lawyers. APEX-Agents requires agents to navigate realistic work environments with files and tools. We test eight agents for the leaderboard using Pass@1. Gemini 3 Flash (Thinking=High) achieves the highest score of 24.0%, followed by GPT-5.2 (Thinking=High), Claude Opus 4.5 (Thinking=High), and Gemini 3 Pro (Thinking=High). We open source the APEX-Agents benchmark (n=480) with all prompts, rubrics, gold outputs, files, and metadata. We also open-source Archipelago, our infrastructure for agent execution and evaluation.
中文标题/摘要
标题:APEX-Agents
我们介绍了代理人工智能生产力指数(APEX-Agents),这是一个基准,用于评估AI代理是否能够执行由投资银行分析师、管理咨询顾问和公司律师创建的长期跨应用任务。APEX-Agents 要求代理在包含文件和工具的现实工作环境中导航。我们使用 Pass@1 测试了八种代理以确定排行榜。Gemini 3 Flash(思考=高)获得最高分为 24.0%,其次是 GPT-5.2(思考=高)、Claude Opus 4.5(思考=高)和 Gemini 3 Pro(思考=高)。我们开源了包含 480 个提示、评分标准、黄金输出、文件和元数据的 APEX-Agents 基准。我们还开源了我们的代理执行和评估基础设施 Archipelago。
Summary / 总结
The study introduces APEX-Agents, a benchmark to evaluate AI agents' ability to perform long-term, cross-application tasks as done by professionals in investment banking, management consulting, and corporate law. The benchmark involves navigating realistic work environments with files and tools. Eight agents were tested, with Gemini 3 Flash achieving the highest Pass@1 score of 24.0%, followed by GPT-5.2, Claude Opus 4.5, and Gemini 3 Pro. The dataset includes 480 tasks with prompts, rubrics, gold outputs, and metadata, and is open-sourced along with Archipelago, the evaluation infrastructure.
研究引入了APEX-Agents基准,用于评估AI代理执行投资银行、管理咨询和公司律师等专业人士所进行的长期跨应用任务的能力。基准测试涉及在包含文件和工具的现实工作环境中导航。八种代理被测试,Gemini 3 Flash以24.0%的Pass@1得分最高,其次是GPT-5.2、Claude Opus 4.5和Gemini 3 Pro。数据集包括480个任务,包含提示、评分标准、黄金输出和元数据,并且是开源的,同时开源了用于评估的基础设施Archipelago。
Routing End User Queries to Enterprise Databases
Authors: Saikrishna Sudarshan, Tanay Kulkarni, Manasi Patwardhan, Lovekesh Vig, Ashwin Srinivasan, Tanmay Tulsidas Verlekar
First: 2026-01-27T17:30:19+00:00 · Latest: 2026-01-27T17:30:19+00:00
Comments: 6 pages, 2 figures
Abstract
We address the task of routing natural language queries in multi-database enterprise environments. We construct realistic benchmarks by extending existing NL-to-SQL datasets. Our study shows that routing becomes increasingly challenging with larger, domain-overlapping DB repositories and ambiguous queries, motivating the need for more structured and robust reasoning-based solutions. By explicitly modelling schema coverage, structural connectivity, and fine-grained semantic alignment, the proposed modular, reasoning-driven reranking strategy consistently outperforms embedding-only and direct LLM-prompting baselines across all the metrics.
中文标题/摘要
标题:将用户查询路由至企业数据库
我们解决了在多数据库企业环境中路由自然语言查询的任务。我们通过扩展现有的NL-to-SQL数据集构建了现实基准。研究表明,随着数据库仓库规模的增大和领域重叠以及查询的模糊性,路由变得越来越具有挑战性,这促使需要更结构化和稳健的基于推理的解决方案。通过明确建模模式覆盖、结构连接性和细粒度语义对齐,所提出的模块化、基于推理的重排序策略在所有指标上都优于仅基于嵌入和直接LLM提示的基线。
Summary / 总结
The paper addresses the challenge of routing natural language queries to appropriate enterprise databases. It constructs realistic benchmarks by extending existing datasets and demonstrates that routing becomes more difficult with larger, domain-overlapping databases and ambiguous queries. The authors propose a modular, reasoning-driven reranking strategy that explicitly models schema coverage, structural connectivity, and fine-grained semantic alignment, which outperforms embedding-only and direct LLM-prompting baselines across various metrics.
论文解决了将自然语言查询路由到适当的企业数据库的挑战。通过扩展现有数据集构建了现实基准,并展示了当数据库更大、领域重叠以及查询更模糊时,路由变得更为困难。作者提出了一种模块化、基于推理的重排序策略,该策略考虑了模式覆盖、结构连接性和细粒度语义对齐,该方法在各种指标上优于仅基于嵌入和直接LLM提示的方法。
Assessing the Effectiveness of Deep Embeddings for Tree Species Classification in the Dutch Forest Inventory
Authors: Takayuki Ishikawa, Carmelo Bonannella, Bas J. W. Lerink, Marc Rußwurm
First: 2025-08-26T09:06:14+00:00 · Latest: 2026-01-27T17:25:21+00:00
Abstract
National Forest Inventory serves as the primary source of forest information, however, maintaining these inventories requires labor-intensive on-site campaigns by forestry experts to identify and document tree species. Embeddings from deep pre-trained remote sensing models offer new opportunities to update NFIs more frequently and at larger scales. While training new deep learning models on few data points remains challenging, we show that using pre-computed embeddings can proven effective for distinguishing tree species through seasonal canopy reflectance patternsin combination with Random Forest. This work systematically investigates how deep embeddings improve tree species classification accuracy in the Netherlands with few annotated data. We evaluate this question on three embedding models: Presto, Alpha Earth, and Tessera, using three tree species datasets of varying difficulty. Data-wise, we compare the available embeddings from Alpha Earth and Tessera with dynamically calculated embeddings from a pre-trained Presto model. Our results demonstrate that fine-tuning a publicly available remote sensing time series pre-trained model outperforms the current state-of-the-art in NFI classification in the Netherlands, yielding performance gains of approximately 2-9 percentage points across datasets and evaluation metrics. This indicates that classic hand-defined features are too simple for this task and highlights the potential of using deep embeddings for data-limited applications such as NFI classification. By leveraging openly available satellite data and deep embeddings from pre-trained models, this approach significantly improves classification accuracy compared to traditional methods and can effectively complement existing forest inventory processes.
中文标题/摘要
标题:评估深度嵌入在荷兰森林资源清查中树种分类有效性
国家森林清查是主要的森林信息来源,然而,维护这些清查需要林业专家进行劳动密集型的现场活动来识别和记录树种。来自深度预训练遥感模型的嵌入为更频繁和更大规模地更新NFIs提供了新机会。虽然在少量数据点上训练新的深度学习模型仍然具有挑战性,但我们展示了使用预计算嵌入通过季节性冠层反射模式与随机森林结合区分树种的有效性。本研究系统地探讨了在荷兰使用深度嵌入如何提高树种分类准确性,尤其是在少量标注数据的情况下。我们使用三种嵌入模型:Presto、Alpha Earth和Tessera,以及三种不同难度的树种数据集来评估这个问题。数据上,我们将Alpha Earth和Tessera提供的可用嵌入与使用预训练Presto模型动态计算的嵌入进行了比较。我们的结果表明,微调一个公开可用的遥感时间序列预训练模型在荷兰的NFI分类中优于当前最先进的技术,各数据集和评估指标的性能提升约为2-9个百分点。这表明经典的手动定义特征过于简单,突显了使用深度嵌入在数据受限应用如NFI分类中的潜力。通过利用公开可用的卫星数据和预训练模型的深度嵌入,这种方法在分类准确性上显著优于传统方法,并且可以有效补充现有的森林清查过程。
Summary / 总结
This study evaluates the effectiveness of deep embeddings for tree species classification in the Dutch Forest Inventory, using seasonal canopy reflectance patterns and Random Forest. Three embedding models—Presto, Alpha Earth, and Tessera—were tested on three datasets of varying difficulty. The study found that fine-tuning a publicly available remote sensing time series pre-trained model outperformed existing methods, achieving performance gains of approximately 2-9 percentage points. This suggests that deep embeddings can enhance classification accuracy in data-limited scenarios, such as NFI updates, by leveraging satellite data and pre-trained models.
该研究评估了深度嵌入模型在荷兰森林资源清查中识别树种的有效性,使用预训练的遥感模型和随机森林。测试了三种嵌入模型——Presto、Alpha Earth 和 Tessera,以及三个难度不同的数据集。结果显示,微调一个公开可用的遥感时间序列预训练模型优于现有方法,性能提升幅度为2-9个百分点。这表明深度嵌入模型可以显著提高分类准确性,特别是在数据有限的应用场景如森林资源清查中。
Query-Guided Spatial-Temporal-Frequency Interaction for Music Audio-Visual Question Answering
Authors: Kun Li, Michael Ying Yang, Sami Sebastian Brandt
First: 2026-01-27T17:24:32+00:00 · Latest: 2026-01-27T17:24:32+00:00
Abstract
Audio--Visual Question Answering (AVQA) is a challenging multimodal task that requires jointly reasoning over audio, visual, and textual information in a given video to answer natural language questions. Inspired by recent advances in Video QA, many existing AVQA approaches primarily focus on visual information processing, leveraging pre-trained models to extract object-level and motion-level representations. However, in those methods, the audio input is primarily treated as complementary to video analysis, and the textual question information contributes minimally to audio--visual understanding, as it is typically integrated only in the final stages of reasoning. To address these limitations, we propose a novel Query-guided Spatial--Temporal--Frequency (QSTar) interaction method, which effectively incorporates question-guided clues and exploits the distinctive frequency-domain characteristics of audio signals, alongside spatial and temporal perception, to enhance audio--visual understanding. Furthermore, we introduce a Query Context Reasoning (QCR) block inspired by prompting, which guides the model to focus more precisely on semantically relevant audio and visual features. Extensive experiments conducted on several AVQA benchmarks demonstrate the effectiveness of our proposed method, achieving significant performance improvements over existing Audio QA, Visual QA, Video QA, and AVQA approaches. The code and pretrained models will be released after publication.
中文标题/摘要
标题:基于查询引导的空间-时间-频率交互的音乐音视频问答
音视频问答(AVQA)是一项具有挑战性的多模态任务,需要在给定的视频中联合推理音频、视觉和文本信息以回答自然语言问题。受视频问答(Video QA)近期进展的启发,许多现有的AVQA方法主要集中在视觉信息处理上,利用预训练模型提取对象级和运动级表示。然而,在这些方法中,音频输入主要被视为视频分析的补充,文本问题信息对音视频理解的贡献很少,通常仅在推理的最后阶段进行整合。为了解决这些局限性,我们提出了一种新颖的查询引导的空间-时间-频率(QSTar)交互方法,该方法有效地结合了问题引导的线索,并利用音频信号的独特频域特征,以及空间和时间感知,以增强音视频理解。此外,我们引入了一个灵感来源于提示的查询上下文推理(QCR)模块,该模块引导模型更精确地关注语义相关的音频和视觉特征。在多个AVQA基准上的广泛实验表明,我们提出的方法具有显著的效果,相对于现有的音频问答(Audio QA)、视觉问答(Visual QA)、视频问答(Video QA)和AVQA方法,实现了显著的性能提升。代码和预训练模型将在发表后发布。
Summary / 总结
This paper addresses the limitations of existing AVQA methods by proposing a Query-guided Spatial-Temporal-Frequency (QSTar) interaction method. The method incorporates question-guided clues and utilizes the frequency-domain characteristics of audio signals, along with spatial and temporal perception, to enhance audio-visual understanding. The authors introduce a Query Context Reasoning (QCR) block to guide the model to focus on semantically relevant features. Experiments on several AVQA benchmarks show that the proposed method outperforms existing approaches in terms of performance.
论文针对现有AVQA方法主要侧重于视觉信息而忽视了音频和文本信息的问题,提出了一种QSTar方法,该方法结合了问题引导的线索,并利用音频信号的频域特征以及空间和时间感知来增强音频-视觉理解。QCR模块引导模型关注语义相关的特征。在AVQA基准上的实验显示,该方法显著优于现有方法。
ReVision: A Dataset and Baseline VLM for Privacy-Preserving Task-Oriented Visual Instruction Rewriting
Authors: Abhijit Mishra, Mingda Li, Hsiang Fu, Richard Noh, Minji Kim
First: 2025-02-20T18:01:41+00:00 · Latest: 2026-01-27T17:16:10+00:00
Comments: In Proceedings of the IJCNLP-AACL 2025
Abstract
Efficient and privacy-preserving multimodal interaction is essential as AR, VR, and modern smartphones with powerful cameras become primary interfaces for human-computer communication. Existing powerful large vision-language models (VLMs) enabling multimodal interaction often rely on cloud-based processing, raising significant concerns about (1) visual privacy by transmitting sensitive vision data to servers, and (2) their limited real-time, on-device usability. This paper explores Visual Instruction Rewriting, a novel approach that transforms multimodal instructions into text-only commands, allowing seamless integration of lightweight on-device instruction rewriter VLMs (250M parameters) with existing conversational AI systems, enhancing vision data privacy. To achieve this, we present a dataset of over 39,000 examples across 14 domains and develop a compact VLM, pretrained on image captioning datasets and fine-tuned for instruction rewriting. Experimental results, evaluated through NLG metrics such as BLEU, METEOR, and ROUGE, along with semantic parsing analysis, demonstrate that even a quantized version of the model (<500MB storage footprint) can achieve effective instruction rewriting, thus enabling privacy-focused, multimodal AI applications.
中文标题/摘要
标题:ReVision:一种用于隐私保护任务导向视觉指令重写的数据集和基线VLM
随着AR、VR和配备强大摄像头的现代智能手机成为人机通信的主要接口,高效的隐私保护多模态交互变得至关重要。现有的强大视觉-语言模型(VLMs)支持多模态交互,通常依赖于基于云的处理,这引发了(1)视觉隐私问题,即传输敏感的视觉数据到服务器,以及(2)其有限的实时、设备端可用性。本文探讨了视觉指令重写这一新颖的方法,即将多模态指令转换为纯文本命令,允许轻量级的设备端指令重写VLM(参数量250M)与现有的对话AI系统无缝集成,增强视觉数据隐私。为此,我们提供了一个涵盖14个领域的超过39,000个示例的数据集,并开发了一个紧凑的VLM,该模型在图像描述数据集上进行预训练,并针对指令重写进行了微调。实验结果通过自然语言生成指标(如BLEU、METEOR和ROUGE)以及语义解析分析评估,表明即使是最小量化版本的模型(存储占用<500MB)也能实现有效的指令重写,从而实现以隐私为重点的多模态AI应用。
Summary / 总结
This paper addresses the need for efficient and privacy-preserving multimodal interaction by introducing ReVision, a dataset and baseline vision-language model for visual instruction rewriting. The model transforms multimodal instructions into text-only commands, enhancing privacy and on-device usability. Key experimental results show that even a quantized version of the model can effectively rewrite instructions, achieving good performance on NLG metrics and semantic parsing analysis.
该论文通过引入ReVision数据集和视觉指令重写基准模型,解决了高效且隐私保护的多模态交互需求。该模型将视觉指令转换为纯文本命令,增强隐私性和设备端使用性。实验结果显示,即使量化后的模型也能有效重写指令,并在自然语言生成指标上表现良好,从而支持隐私导向的多模态AI应用。
Unsupervised Learning of Efficient Exploration: Pre-training Adaptive Policies via Self-Imposed Goals
Authors: Octavio Pappalardo
Venue: ICLR 2026
First: 2026-01-27T17:10:29+00:00 · Latest: 2026-01-27T17:10:29+00:00
Comments: To appear at ICLR 2026
Abstract
Unsupervised pre-training can equip reinforcement learning agents with prior knowledge and accelerate learning in downstream tasks. A promising direction, grounded in human development, investigates agents that learn by setting and pursuing their own goals. The core challenge lies in how to effectively generate, select, and learn from such goals. Our focus is on broad distributions of downstream tasks where solving every task zero-shot is infeasible. Such settings naturally arise when the target tasks lie outside of the pre-training distribution or when their identities are unknown to the agent. In this work, we (i) optimize for efficient multi-episode exploration and adaptation within a meta-learning framework, and (ii) guide the training curriculum with evolving estimates of the agent's post-adaptation performance. We present ULEE, an unsupervised meta-learning method that combines an in-context learner with an adversarial goal-generation strategy that maintains training at the frontier of the agent's capabilities. On XLand-MiniGrid benchmarks, ULEE pre-training yields improved exploration and adaptation abilities that generalize to novel objectives, environment dynamics, and map structures. The resulting policy attains improved zero-shot and few-shot performance, and provides a strong initialization for longer fine-tuning processes. It outperforms learning from scratch, DIAYN pre-training, and alternative curricula.
中文标题/摘要
标题:无监督学习高效探索:通过自我设定目标进行先验训练
无监督预训练可以为强化学习代理提供先验知识,加速下游任务的学习。基于人类发展的研究方向表明,代理可以通过设定并追求自己的目标来学习。核心挑战在于如何有效地生成、选择和从这些目标中学习。我们的重点是在解决每个任务零样本不可行的广泛下游任务分布中。当目标任务位于预训练分布之外或代理对其身份一无所知时,这种设置自然会出现。在本文中,我们(i) 在元学习框架内优化多集高效探索和适应,(ii) 使用代理适应后性能的不断演变估计来指导训练课程。我们提出了ULEE,一种结合上下文学习者和对抗性目标生成策略的无监督元学习方法,保持训练在代理能力的前沿。在XLand-MiniGrid基准测试中,ULEE预训练提高了探索和适应能力,能够泛化到新的目标、环境动力学和地图结构。生成的策略在零样本和少量样本性能上有所提高,并为更长时间的微调过程提供了强大的初始化。它优于从头开始学习、DIAYN预训练和替代课程。
Summary / 总结
This work addresses the challenge of generating and learning from self-imposed goals to enhance exploration and adaptation in reinforcement learning. The method, ULEE, optimizes for efficient multi-episode exploration within a meta-learning framework and uses evolving performance estimates to guide training. On XLand-MiniGrid benchmarks, ULEE pre-training improves exploration and adaptation, leading to better zero-shot and few-shot performance compared to learning from scratch and other pre-training methods.
该论文提出了ULEE,一种无监督元学习方法,用于预训练强化学习代理设置和追求自己的目标,优化多回合探索和适应的效率。该方法使用上下文学习器和对抗性目标生成策略,保持训练在代理能力的前沿。实验表明,ULEE在XLand-MiniGrid基准测试中提高了探索和适应能力,实现了更好的零样本和少量样本性能,并为更长时间的微调过程提供了强大的初始化,优于从头学习、DIAYN预训练和其他方法。
Youtu-VL: Unleashing Visual Potential via Unified Vision-Language Supervision
Authors: Zhixiang Wei, Yi Li, Zhehan Kan, Xinghua Jiang, Zuwei Long, Shifeng Liu, Hongze Shen, Wei Liu, Xiaoyu Tan, Haojia Lin, Yubo Zhu, Qianyu Li, Di Yin, Haoyu Cao, Weibo Gu, Xin Li, Yinsong Liu, Deqiang Jiang, Xing Sun, Yunsheng Wu, Mingkong Tang, Shuangyin Liu, Lexiang Tang, Haodong Lin, Junru Lu, Jiarui Qin, Lingfeng Qiao, Ruizhi Qiao, Bo Ke, Jianfeng He, Ke Li, Yangning Li, Yunhang Shen, Mengdan Zhang, Peixian Chen, Kun Yin, Bing Liu, Yunfei Wu, Huang Chen, Zhongpeng Cai, Xiaotian Li
First: 2026-01-27T17:01:16+00:00 · Latest: 2026-01-27T17:01:16+00:00
Abstract
Despite the significant advancements represented by Vision-Language Models (VLMs), current architectures often exhibit limitations in retaining fine-grained visual information, leading to coarse-grained multimodal comprehension. We attribute this deficiency to a suboptimal training paradigm inherent in prevailing VLMs, which exhibits a text-dominant optimization bias by conceptualizing visual signals merely as passive conditional inputs rather than supervisory targets. To mitigate this, we introduce Youtu-VL, a framework leveraging the Vision-Language Unified Autoregressive Supervision (VLUAS) paradigm, which fundamentally shifts the optimization objective from ``vision-as-input'' to ``vision-as-target.'' By integrating visual tokens directly into the prediction stream, Youtu-VL applies unified autoregressive supervision to both visual details and linguistic content. Furthermore, we extend this paradigm to encompass vision-centric tasks, enabling a standard VLM to perform vision-centric tasks without task-specific additions. Extensive empirical evaluations demonstrate that Youtu-VL achieves competitive performance on both general multimodal tasks and vision-centric tasks, establishing a robust foundation for the development of comprehensive generalist visual agents.
中文标题/摘要
标题:Youtu-VL:通过统一的视觉语言监督释放视觉潜力
尽管视觉语言模型(VLMs)取得了显著进展,但当前架构在保留细微视觉信息方面仍存在局限性,导致粗粒度的多模态理解。我们将其归因于现有VLMs中固有的次优训练范式,这种范式表现出以文本为主导的优化偏见,将视觉信号仅视为被动的条件输入而非监督目标。为解决这一问题,我们提出了Youtu-VL框架,该框架利用视觉语言统一自回归监督(VLUAS)范式,从根本上将优化目标从“视觉作为输入”转变为“视觉作为目标”。通过直接将视觉标记集成到预测流中,Youtu-VL 对视觉细节和语言内容应用统一的自回归监督。此外,我们还将这一范式扩展到视觉中心任务,使标准VLM能够在无需特定任务添加的情况下执行视觉中心任务。广泛的实证评估表明,Youtu-VL 在通用多模态任务和视觉中心任务上均取得了竞争力的表现,为全面通用视觉代理的发展奠定了坚实基础。
Summary / 总结
The research aims to improve the fine-grained visual information retention in Vision-Language Models (VLMs) by introducing Youtu-VL, which uses a Vision-Language Unified Autoregressive Supervision (VLUAS) paradigm. This method shifts the optimization objective from treating vision as input to treating it as a target, integrating visual tokens into the prediction stream. The study shows that Youtu-VL performs competitively on both general multimodal tasks and vision-centric tasks, providing a strong foundation for developing comprehensive visual agents.
研究旨在通过引入使用视觉-语言统一自回归监督(VLUAS)范式的Youtu-VL,来改善视觉语言模型(VLM)中的细粒度视觉信息保留。该方法将优化目标从将视觉视为输入转变为将其视为目标,将视觉标记直接集成到预测流中。研究显示,Youtu-VL 在通用多模态任务和视觉中心任务上均表现出色,为开发全面的视觉代理奠定了坚实的基础。
Diffusion for De-Occlusion: Accessory-Aware Diffusion Inpainting for Robust Ear Biometric Recognition
Authors: Deeksha Arun, Kevin W. Bowyer, Patrick Flynn
First: 2026-01-27T16:55:35+00:00 · Latest: 2026-01-27T16:55:35+00:00
Abstract
Ear occlusions (arising from the presence of ear accessories such as earrings and earphones) can negatively impact performance in ear-based biometric recognition systems, especially in unconstrained imaging circumstances. In this study, we assess the effectiveness of a diffusion-based ear inpainting technique as a pre-processing aid to mitigate the issues of ear accessory occlusions in transformer-based ear recognition systems. Given an input ear image and an automatically derived accessory mask, the inpainting model reconstructs clean and anatomically plausible ear regions by synthesizing missing pixels while preserving local geometric coherence along key ear structures, including the helix, antihelix, concha, and lobule. We evaluate the effectiveness of this pre-processing aid in transformer-based recognition systems for several vision transformer models and different patch sizes for a range of benchmark datasets. Experiments show that diffusion-based inpainting can be a useful pre-processing aid to alleviate ear accessory occlusions to improve overall recognition performance.
中文标题/摘要
标题:去遮挡的扩散:配件感知的扩散修复以增强稳健的耳部生物识别
耳部遮挡(由于存在耳饰如耳环和耳机)会负面影响基于耳部的生物识别系统的表现,尤其是在不受约束的成像环境中。在本研究中,我们评估了一种基于扩散的耳部修复技术作为预处理辅助手段,以减轻基于变压器的耳部识别系统中耳部配件遮挡的问题。给定输入耳部图像和自动提取的配件掩码,修复模型通过合成缺失像素来重建干净且解剖上合理的耳部区域,同时保持关键耳部结构(包括耳轮、反耳轮、耳甲腔和耳垂)的局部几何一致性。我们评估了这种预处理辅助手段在几种视觉变压器模型和不同块大小下的有效性,针对一系列基准数据集。实验表明,基于扩散的修复可以作为一种有用的预处理辅助手段,以减轻耳部配件遮挡,从而提高整体识别性能。
Summary / 总结
The study addresses the challenge of ear occlusions caused by accessories like earrings and earphones, which can degrade the performance of ear-based biometric recognition systems. It introduces a diffusion-based inpainting technique that uses an accessory mask to reconstruct clean ear regions, preserving local geometric coherence. Experiments demonstrate that this pre-processing method enhances recognition performance across various transformer-based models and datasets.
该研究解决了耳饰导致的耳部遮挡问题,影响了基于耳朵的生物识别系统。研究评估了一种基于扩散的修复技术,该技术通过合成缺失的像素来重建干净的耳部区域,同时保持局部几何连贯性。该方法使用自动提取的耳饰掩码,并在不同的变压器模型和不同大小的块尺寸下对多个基准数据集进行了测试,显示出通过减轻耳饰遮挡对识别性能的改善。
Component-Aware Pruning Framework for Neural Network Controllers via Gradient-Based Importance Estimation
Authors: Ganesh Sundaram, Jonas Ulmen, Daniel Görges
First: 2026-01-27T16:53:19+00:00 · Latest: 2026-01-27T16:53:19+00:00
Comments: 8 pages, Submitted to the 2026 IFAC World Congress
Abstract
The transition from monolithic to multi-component neural architectures in advanced neural network controllers poses substantial challenges due to the high computational complexity of the latter. Conventional model compression techniques for complexity reduction, such as structured pruning based on norm-based metrics to estimate the relative importance of distinct parameter groups, often fail to capture functional significance. This paper introduces a component-aware pruning framework that utilizes gradient information to compute three distinct importance metrics during training: Gradient Accumulation, Fisher Information, and Bayesian Uncertainty. Experimental results with an autoencoder and a TD-MPC agent demonstrate that the proposed framework reveals critical structural dependencies and dynamic shifts in importance that static heuristics often miss, supporting more informed compression decisions.
中文标题/摘要
标题:基于梯度基重要性估计的组件感知神经网络控制器剪枝框架
从单一组件到多组件神经架构的转变在高级神经网络控制器中带来了巨大挑战,因为后者具有较高的计算复杂性。传统的基于范数度量的结构剪枝等模型压缩技术常用于复杂度降低,但往往无法捕捉功能重要性。本文提出了一种组件感知剪枝框架,利用梯度信息在训练过程中计算三种不同的重要性度量:梯度累积、费舍尔信息和贝叶斯不确定性。实验结果表明,所提出的框架揭示了关键的结构依赖性和重要性的动态变化,而静态启发式方法往往无法捕捉到这些变化,从而支持更明智的压缩决策。
Summary / 总结
This paper addresses the challenge of pruning multi-component neural architectures in advanced neural network controllers, which are more computationally complex than monolithic architectures. It proposes a component-aware pruning framework that uses gradient information to estimate the importance of parameters during training. The framework calculates three metrics: Gradient Accumulation, Fisher Information, and Bayesian Uncertainty. Experiments with an autoencoder and a TD-MPC agent show that this approach uncovers important structural dependencies and dynamic shifts in importance that static methods often overlook, leading to more effective compression decisions.
本文解决了在先进神经网络控制器中修剪多组件神经架构的挑战,这些架构比单一组件架构更具计算复杂性。该文提出了一种基于梯度信息的组件感知剪枝框架,用于训练期间估计参数的重要性,具体通过梯度累积、费舍尔信息和贝叶斯不确定性指标。实验表明,该框架可以揭示静态启发式方法可能忽略的重要结构依赖性和动态重要性变化,从而实现更有效的压缩决策。
CASTER: Breaking the Cost-Performance Barrier in Multi-Agent Orchestration via Context-Aware Strategy for Task Efficient Routing
Authors: Shanyv Liu, Xuyang Yuan, Tao Chen, Zijun Zhan, Zhu Han, Danyang Zheng, Weishan Zhang, Shaohua Cao
First: 2026-01-27T16:52:47+00:00 · Latest: 2026-01-27T16:52:47+00:00
Abstract
Graph-based Multi-Agent Systems (MAS) enable complex cyclic workflows but suffer from inefficient static model allocation, where deploying strong models uniformly wastes computation on trivial sub-tasks. We propose CASTER (Context-Aware Strategy for Task Efficient Routing), a lightweight router for dynamic model selection in graph-based MAS. CASTER employs a Dual-Signal Router that combines semantic embeddings with structural meta-features to estimate task difficulty. During training, the router self-optimizes through a Cold Start to Iterative Evolution paradigm, learning from its own routing failures via on-policy negative feedback. Experiments using LLM-as-a-Judge evaluation across Software Engineering, Data Analysis, Scientific Discovery, and Cybersecurity demonstrate that CASTER reduces inference cost by up to 72.4% compared to strong-model baselines while matching their success rates, and consistently outperforms both heuristic routing and FrugalGPT across all domains.
中文标题/摘要
标题:CASTER:通过上下文感知策略实现任务高效路由以打破多代理编排的成本-性能障碍
基于图的多代理系统(MAS)能够实现复杂的循环工作流,但静态模型分配效率低下,导致部署强大模型时在简单子任务上浪费计算资源。我们提出了一种轻量级路由器CASTER(上下文感知策略),用于基于图的MAS中的动态模型选择。CASTER采用了一种双信号路由器,结合语义嵌入和结构元特征来估计任务难度。在训练过程中,路由器通过冷启动到迭代进化的方式自我优化,通过基于策略的负反馈从自身的路由失败中学习。使用LLM作为裁判在软件工程、数据分析、科学发现和网络安全领域的实验表明,与强大的模型基线相比,CASTER将推理成本降低了高达72.4%,同时保持了相同的成功率,并且在所有领域中始终优于启发式路由和FrugalGPT。
Summary / 总结
The research aims to improve the efficiency of multi-agent systems by addressing the issue of inefficient static model allocation. CASTER, a context-aware strategy for task efficient routing, is proposed to dynamically select models based on semantic embeddings and structural meta-features. Experiments show that CASTER reduces inference cost by up to 72.4% compared to strong-model baselines while maintaining success rates, and outperforms heuristic routing and FrugalGPT across various domains including Software Engineering, Data Analysis, Scientific Discovery, and Cybersecurity.
研究旨在通过解决静态模型分配效率低的问题来提高多智能体系统的效率。CASTER,一种基于上下文的任务高效路由策略,使用双信号路由器结合语义嵌入和结构元特征来估计任务难度。实验表明,与强模型基线相比,CASTER将推理成本最多减少72.4%,同时保持成功率,并在软件工程、数据分析、科学发现和网络安全等多个领域优于启发式路由和FrugalGPT。
GeoDiff3D: Self-Supervised 3D Scene Generation with Geometry-Constrained 2D Diffusion Guidance
Authors: Haozhi Zhu, Miaomiao Zhao, Dingyao Liu, Runze Tian, Yan Zhang, Jie Guo, Fenggen Yu
First: 2026-01-27T16:47:35+00:00 · Latest: 2026-01-27T16:47:35+00:00
Abstract
3D scene generation is a core technology for gaming, film/VFX, and VR/AR. Growing demand for rapid iteration, high-fidelity detail, and accessible content creation has further increased interest in this area. Existing methods broadly follow two paradigms - indirect 2D-to-3D reconstruction and direct 3D generation - but both are limited by weak structural modeling and heavy reliance on large-scale ground-truth supervision, often producing structural artifacts, geometric inconsistencies, and degraded high-frequency details in complex scenes. We propose GeoDiff3D, an efficient self-supervised framework that uses coarse geometry as a structural anchor and a geometry-constrained 2D diffusion model to provide texture-rich reference images. Importantly, GeoDiff3D does not require strict multi-view consistency of the diffusion-generated references and remains robust to the resulting noisy, inconsistent guidance. We further introduce voxel-aligned 3D feature aggregation and dual self-supervision to maintain scene coherence and fine details while substantially reducing dependence on labeled data. GeoDiff3D also trains with low computational cost and enables fast, high-quality 3D scene generation. Extensive experiments on challenging scenes show improved generalization and generation quality over existing baselines, offering a practical solution for accessible and efficient 3D scene construction.
中文标题/摘要
标题:GeoDiff3D:基于几何约束的2D扩散指导自监督3D场景生成
3D场景生成是游戏、电影/特效和VR/AR的核心技术。不断增长的快速迭代需求、高保真细节和易于访问的内容创作进一步增加了对该领域的兴趣。现有方法大致遵循两种范式——间接的2D到3D重建和直接的3D生成,但两者都受限于薄弱的结构建模和对大规模真实监督的强烈依赖,经常产生结构伪影、几何不一致和复杂场景中的高频细节退化。我们提出了GeoDiff3D,这是一种高效的自监督框架,使用粗略的几何结构作为结构锚点,并通过几何约束的2D扩散模型提供纹理丰富的参考图像。重要的是,GeoDiff3D 不需要扩散生成的参考严格多视角一致性,并且对由此产生的嘈杂、不一致的指导具有鲁棒性。我们进一步引入了体素对齐的3D特征聚合和双重自监督,以保持场景连贯性和精细细节,同时大幅减少对标注数据的依赖。GeoDiff3D 也以较低的计算成本进行训练,并能够快速生成高质量的3D场景。在具有挑战性的场景上的广泛实验表明,GeoDiff3D 在泛化能力和生成质量上优于现有基线,提供了一种实用的解决方案,以实现易于访问和高效的3D场景构建。
Summary / 总结
GeoDiff3D is a self-supervised framework for 3D scene generation that uses coarse geometry as a structural anchor and a geometry-constrained 2D diffusion model to generate texture-rich images. It does not require strict multi-view consistency of the diffusion-generated references, making it robust to noisy guidance. The method introduces voxel-aligned 3D feature aggregation and dual self-supervision to maintain scene coherence and fine details, reducing the need for labeled data. Experiments show that GeoDiff3D improves generalization and generation quality compared to existing methods, enabling fast and high-quality 3D scene generation.
GeoDiff3D 是一个自监督的 3D 场景生成框架,使用粗略的几何结构作为结构锚点,并结合几何约束的 2D 扩散模型生成纹理丰富的图像。它不需要扩散生成参考的多视图一致性,并保持场景连贯性和精细细节。实验表明,GeoDiff3D 在泛化能力和生成质量上优于现有方法,能够实现快速和高质量的 3D 场景生成。
Reimagining Peer Review Process Through Multi-Agent Mechanism Design
Authors: Ahmad Farooq, Kamran Iqbal
First: 2026-01-27T16:43:11+00:00 · Latest: 2026-01-27T16:43:11+00:00
Comments: To appear in the Proceedings of the 2026 IEEE/ACM 48th International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE). 4 pages, 1 figure, 1 table
Abstract
The software engineering research community faces a systemic crisis: peer review is failing under growing submissions, misaligned incentives, and reviewer fatigue. Community surveys reveal that researchers perceive the process as "broken." This position paper argues that these dysfunctions are mechanism design failures amenable to computational solutions. We propose modeling the research community as a stochastic multi-agent system and applying multi-agent reinforcement learning to design incentive-compatible protocols. We outline three interventions: a credit-based submission economy, MARL-optimized reviewer assignment, and hybrid verification of review consistency. We present threat models, equity considerations, and phased pilot metrics. This vision charts a research agenda toward sustainable peer review.
中文标题/摘要
标题:通过多智能体机制设计重塑同行评审过程
软件工程研究领域面临系统性危机:随着提交量的增长、激励不一致以及评审员疲劳,同行评审正在失效。社区调查表明,研究人员认为这一过程是“失效”的。本文认为,这些功能障碍是机制设计失败,可以通过计算解决方案来解决。我们建议将研究社区建模为随机多智能体系统,并应用多智能体强化学习来设计激励相容协议。我们提出了三项干预措施:基于信用的提交经济、MARL优化的评审员分配以及审查一致性混合验证。我们提出了威胁模型、公平性考虑以及分阶段试点指标。本文勾勒出一条通向可持续同行评审的研究议程。
SCoPE VLM: Selective Context Processing for Efficient Document Navigation in Vision-Language Models
Authors: Gyubeum Lim, Yemo Koo, Vijay Krishna Madisetti
First: 2025-10-22T17:47:12+00:00 · Latest: 2026-01-27T16:39:04+00:00
Abstract
Understanding long-context visual information remains a fundamental challenge for vision-language models, particularly in agentic tasks such as GUI control and web navigation. While web pages and GUI environments are inherently structured documents, current VLMs typically neglect decision-oriented document understanding in their training objectives. Existing approaches primarily extend visual embeddings to process long, high-resolution inputs, but these methods are memory-intensive and impractical for locally deployable solutions. To address these issues, we propose SCoPE VLM, a document navigation expert that leverages a novel Chain of Scroll mechanism to selectively and recursively navigate documents, focusing exclusively on relevant segments. We introduce a dedicated data generation pipeline to construct informative Chain of Scroll trajectories and Episodic Group Relative Policy Optimization, a tailored reinforcement learning method to bridge the gap between training and inference. Our method substantially reduces memory usage and effectively models human-like reading behaviors. To the best of our knowledge, SCoPE VLM is the first framework to explicitly model agentic reading patterns in multi-page document question answering, advancing the capabilities of multimodal agents.
中文标题/摘要
标题:SCoPE VLM:选择性上下文处理以提高视觉语言模型的文档导航效率
理解长上下文视觉信息仍然是视觉语言模型的基本挑战,尤其是在如GUI控制和网页导航等代理任务中。虽然网页和GUI环境本质上是结构化的文档,但当前的VLMs在训练目标中通常忽略了决策导向的文档理解。现有方法主要通过扩展视觉嵌入来处理长的高分辨率输入,但这些方法内存密集且不适用于本地部署解决方案。为了解决这些问题,我们提出了SCoPE VLM,这是一种文档导航专家,利用新颖的滚动链机制选择性和递归地导航文档,专注于相关段落。我们引入了一种专门的数据生成管道来构建有信息量的滚动链轨迹,并提出了一种定制的强化学习方法——阶段性组相对策略优化,以弥合训练和推理之间的差距。我们的方法显著减少了内存使用,并有效地模拟了人类的阅读行为。据我们所知,SCoPE VLM是第一个明确建模多页文档问答中代理阅读模式的框架,推动了多模态代理的能力。
Summary / 总结
The research aims to improve vision-language models' ability to understand long-context visual information, especially for agentic tasks like GUI control and web navigation. SCoPE VLM introduces a Chain of Scroll mechanism that selectively navigates through documents, focusing on relevant segments. This approach reduces memory usage and models human-like reading behaviors, making it suitable for locally deployable solutions. The key finding is that SCoPE VLM effectively addresses the limitations of existing VLMs by explicitly modeling agentic reading patterns in multi-page document question answering.
研究解决了视觉语言模型在GUI控制和网页导航等代理任务中理解长上下文视觉信息的挑战。提出了SCoPE VLM,该模型使用链式滚动机制选择性地导航文档,专注于相关段落。该方法减少了内存使用,并模拟了人类的阅读行为,使其适用于本地部署解决方案。据作者所知,SCoPE VLM 是第一个明确建模多页文档问答中代理阅读模式的框架,提升了多模态代理的能力。
DisCoPatch: Taming Adversarially-driven Batch Statistics for Improved Out-of-Distribution Detection
Authors: Francisco Caetano, Christiaan Viviers, Luis A. Zavala-Mondragón, Peter H. N. de With, Fons van der Sommen
Venue: ICCV 2025
First: 2025-01-14T10:49:26+00:00 · Latest: 2026-01-27T16:37:43+00:00
Comments: ICCV 2025
Abstract
Out-of-distribution (OOD) detection holds significant importance across many applications. While semantic and domain-shift OOD problems are well-studied, this work focuses on covariate shifts - subtle variations in the data distribution that can degrade machine learning performance. We hypothesize that detecting these subtle shifts can improve our understanding of in-distribution boundaries, ultimately improving OOD detection. In adversarial discriminators trained with Batch Normalization (BN), real and adversarial samples form distinct domains with unique batch statistics - a property we exploit for OOD detection. We introduce DisCoPatch, an unsupervised Adversarial Variational Autoencoder (VAE) framework that harnesses this mechanism. During inference, batches consist of patches from the same image, ensuring a consistent data distribution that allows the model to rely on batch statistics. DisCoPatch uses the VAE's suboptimal outputs (generated and reconstructed) as negative samples to train the discriminator, thereby improving its ability to delineate the boundary between in-distribution samples and covariate shifts. By tightening this boundary, DisCoPatch achieves state-of-the-art results in public OOD detection benchmarks. The proposed model not only excels in detecting covariate shifts, achieving 95.5% AUROC on ImageNet-1K(-C) but also outperforms all prior methods on public Near-OOD (95.0%) benchmarks. With a compact model size of 25MB, it achieves high OOD detection performance at notably lower latency than existing methods, making it an efficient and practical solution for real-world OOD detection applications. The code is publicly available.
中文标题/摘要
标题:DisCoPatch:驾驭由对抗驱动的批量统计以提高异常分布检测
异常分布(OOD)检测在许多应用中具有重要意义。虽然语义和领域转移OOD问题已经得到了充分研究,但本工作关注的是协变量偏移——数据分布中的细微变化,这会降低机器学习性能。我们假设检测这些细微变化可以提高我们对分布内边界的理解,从而最终提高OOD检测。在使用批量标准化(BN)训练的对抗判别器中,真实样本和对抗样本形成了具有独特批量统计的独立领域——我们利用这一特性进行OOD检测。我们引入了DisCoPatch,这是一种无监督的对抗变分自编码器(VAE)框架,利用了这一机制。在推理过程中,批次由同一图像的补丁组成,确保了数据分布的一致性,从而使模型能够依赖于批量统计。DisCoPatch 使用VAE的次优输出(生成和重构)作为负样本来训练判别器,从而提高其区分分布内样本和协变量偏移的能力。通过收紧这一边界,DisCoPatch 在公共OOD检测基准测试中达到了最先进的结果。所提出的模型不仅在检测协变量偏移方面表现出色,实现了ImageNet-1K(-C)上的95.5% AUROC,还在公共Near-OOD基准测试中超越了所有先前的方法(95.0%)。凭借25MB的紧凑模型大小,它在显著降低现有方法的延迟的同时实现了高OOD检测性能,使其成为现实世界OOD检测应用的高效且实用的解决方案。代码已公开。
Summary / 总结
This work addresses the challenge of detecting covariate shifts, which are subtle variations in data distribution that can degrade machine learning performance. It introduces DisCoPatch, an unsupervised Adversarial Variational Autoencoder framework that uses batch statistics from real and adversarial samples to improve OOD detection. DisCoPatch achieves state-of-the-art results in public OOD detection benchmarks, with 95.5% AUROC on ImageNet-1K(-C) and 95.0% on Near-OOD benchmarks, outperforming all prior methods. The model is compact (25MB) and efficient, offering high OOD detection performance with lower latency compared to existing methods.
该研究旨在检测数据分布中的细微变化(covariate shifts),这些变化会降低机器学习性能。提出了一种名为DisCoPatch的无监督Adversarial VAE框架,利用真实样本和对抗样本的批统计信息来提升OOD检测效果。DisCoPatch在公共OOD检测基准测试中取得了最佳结果,包括95.5%的AUROC在ImageNet-1K(-C)和95.0%在Near-OOD基准上,超越了所有先前的方法。该模型体积小巧(25MB),性能高效,在保持高OOD检测性能的同时,具有较低的延迟,适用于实际应用。
Optimal Scaling Needs Optimal Norm
Authors: Oleg Filatov, Jiangtao Wang, Jan Ebert, Stefan Kesselheim
First: 2025-10-04T16:48:36+00:00 · Latest: 2026-01-27T16:32:23+00:00
Abstract
Despite recent progress in optimal hyperparameter transfer under model and dataset scaling, no unifying explanatory principle has been established. For Adam and Scion optimizers, we discover that joint optimal scaling across model and dataset sizes is conditioned on a single invariant: the operator norm of the output layer. Across models with up to 1.3B parameters trained on up to 138B tokens, the optimal learning rate/batch size pair $(η^{\ast}, B^{\ast})$ consistently has the same operator norm value - a phenomenon we term norm transfer. This constant norm condition is necessary but not sufficient: while for each dataset size, multiple $(η, B)$ reach the optimal norm, only a unique $(η^{\ast}, B^{\ast})$ achieves the best loss. As a sufficient condition, we provide the first measurement of $(η^{\ast}, B^{\ast})$ scaling with dataset size for Scion, and find that the scaling rules are consistent with those of Adam. Tuning per-layer-group learning rates also improves model performance, with the output layer being the most sensitive and hidden layers benefiting from lower learning rates. We provide practical insights on norm-guided optimal scaling and release our Distributed Scion (Disco) implementation with logs from over two thousand runs to support research on LLM training dynamics at scale.
中文标题/摘要
标题:最优缩放需要最优范数
尽管在模型和数据集缩放下的最优超参数转移方面取得了近期进展,但尚未建立统一的解释性原则。对于Adam和Scion优化器,我们发现模型和数据集大小的联合最优缩放取决于一个不变量:输出层的操作范数。在最多13亿参数的模型和最多138亿个标记的训练中,最优的学习率/批量大小对$(η^{\ast}, B^{\ast})$始终具有相同的操作范数值——我们称之为范数转移。这一恒定范数条件是必要的但不充分的:虽然对于每个数据集大小,多个$(η, B)$可以达到最优范数,但只有唯一的$(η^{\ast}, B^{\ast})$能够实现最佳损失。作为充分条件,我们首次测量了Scion的$(η^{\ast}, B^{\ast})$随数据集大小的缩放规则,并发现这些规则与Adam的一致。逐层组调整学习率也提高了模型性能,输出层最为敏感,而隐藏层则受益于较低的学习率。我们提供了基于范数指导的最优缩放的实用见解,并发布了Distributed Scion(Disco)实现,其中包含超过两千次运行的日志,以支持大规模LLM训练动态的研究。
Summary / 总结
The research aims to establish a unifying principle for optimal hyperparameter transfer under model and dataset scaling. By studying Adam and Scion optimizers, the authors find that the optimal learning rate and batch size pair for each dataset size share a consistent operator norm, termed norm transfer. This condition is necessary but not sufficient; while multiple pairs can achieve the optimal norm, only a unique pair yields the best loss. The study also shows that tuning per-layer-group learning rates, with the output layer being the most sensitive, improves model performance. The findings provide practical insights into norm-guided optimal scaling and are supported by a large-scale implementation of Scion.
研究旨在建立一种统一的原则,以应对模型和数据集规模变化时的最优超参数转移问题。通过研究Adam和Scion优化器,作者发现每个数据集大小的最优学习率和批量大小配对共享一个一致的操作范数,称为范数转移。这一条件是必要的但不充分的;虽然多个配对可以达到最优范数,但只有唯一的配对能获得最佳损失。研究还表明,通过分层组调整学习率可以提高模型性能,输出层最为敏感,而隐藏层则受益于较低的学习率。研究提供了基于范数的最优缩放的实用见解,并通过大规模的Scion(Disco)实现及其超过两千次的运行日志支持了大规模训练动态的研究。
The Effect of Architecture During Continual Learning
Authors: Allyson Hahn, Krishnan Raghavan
First: 2026-01-27T16:29:42+00:00 · Latest: 2026-01-27T16:29:42+00:00
Abstract
Continual learning is a challenge for models with static architecture, as they fail to adapt to when data distributions evolve across tasks. We introduce a mathematical framework that jointly models architecture and weights in a Sobolev space, enabling a rigorous investigation into the role of neural network architecture in continual learning and its effect on the forgetting loss. We derive necessary conditions for the continual learning solution and prove that learning only model weights is insufficient to mitigate catastrophic forgetting under distribution shifts. Consequently, we prove that by learning the architecture and weights simultaneously at each task, we can reduce catastrophic forgetting.
To learn weights and architecture simultaneously, we formulate continual learning as a bilevel optimization problem: the upper level selects an optimal architecture for a given task, while the lower level computes optimal weights via dynamic programming over all tasks. To solve the upper level problem, we introduce a derivative-free direct search algorithm to determine the optimal architecture. Once found, we must transfer knowledge from the current architecture to the optimal one. However, the optimal architecture will result in a weights parameter space different from the current architecture (i.e., dimensions of weights matrices will not match). To bridge the dimensionality gap, we develop a low-rank transfer mechanism to map knowledge across architectures of mismatched dimensions. Empirical studies across regression and classification problems, including feedforward, convolutional, and graph neural networks, demonstrate that learning the optimal architecture and weights simultaneously yields substantially improved performance (up to two orders of magnitude), reduced forgetting, and enhanced robustness to noise compared with static architecture approaches.
Summary / 总结
The paper addresses the challenge of catastrophic forgetting in continual learning by proposing a mathematical framework that jointly models architecture and weights. It formulates continual learning as a bilevel optimization problem and introduces a derivative-free direct search algorithm to learn the optimal architecture. The study demonstrates that learning both architecture and weights simultaneously reduces catastrophic forgetting and improves performance, robustness, and noise tolerance compared to static architecture approaches. Empirical results across various neural network types show substantial improvements.
研究通过提出一个同时建模架构和权重的数学框架,解决了持续学习中的灾难性遗忘问题。方法将持续学习表述为一个双层优化问题,上层选择最优架构,下层计算最优权重。研究证明仅学习权重不足以减轻遗忘,并展示了同时学习架构和权重可以减少灾难性遗忘。实验表明,与静态架构方法相比,这种方法在各种神经网络类型中显著提高了性能和鲁棒性。
Activation Function Design Sustains Plasticity in Continual Learning
Authors: Lute Lillo, Nick Cheney
First: 2025-09-26T16:41:47+00:00 · Latest: 2026-01-27T16:19:30+00:00
Abstract
In independent, identically distributed (i.i.d.) training regimes, activation functions have been benchmarked extensively, and their differences often shrink once model size and optimization are tuned. In continual learning, however, the picture is different: beyond catastrophic forgetting, models can progressively lose the ability to adapt (referred to as loss of plasticity) and the role of the non-linearity in this failure mode remains underexplored. We show that activation choice is a primary, architecture-agnostic lever for mitigating plasticity loss. Building on a property-level analysis of negative-branch shape and saturation behavior, we introduce two drop-in nonlinearities (Smooth-Leaky and Randomized Smooth-Leaky) and evaluate them in two complementary settings: (i) supervised class-incremental benchmarks and (ii) reinforcement learning with non-stationary MuJoCo environments designed to induce controlled distribution and dynamics shifts. We also provide a simple stress protocol and diagnostics that link the shape of the activation to the adaptation under change. The takeaway is straightforward: thoughtful activation design offers a lightweight, domain-general way to sustain plasticity in continual learning without extra capacity or task-specific tuning.
中文标题/摘要
标题:激活函数设计维持连续学习中的可塑性
在独立同分布(i.i.d.)训练制度中,激活函数已经被广泛测试,一旦调整了模型大小和优化方法,它们之间的差异往往会缩小。然而,在连续学习中,情况有所不同:除了灾难性遗忘之外,模型还可能逐渐失去适应能力(称为可塑性丧失),而这种失败模式中非线性的角色尚未得到充分探索。我们表明,激活函数的选择是减轻可塑性丧失的主要、架构无关的杠杆。基于负分支形状和饱和行为的属性级分析,我们引入了两种即插即用的非线性(Smooth-Leaky和Randomized Smooth-Leaky),并在两种互补的设置中进行了评估:(i)监督类增量基准测试和(ii)强化学习中的非平稳MuJoCo环境,这些环境旨在诱导可控的数据分布和动力学变化。我们还提供了一个简单的压力测试协议和诊断工具,将激活函数的形状与变化下的适应性联系起来。结论是明确的:精心设计的激活函数提供了一种轻量级、领域通用的方法,可以在不增加容量或特定任务调整的情况下维持连续学习中的可塑性。
Summary / 总结
The paper investigates the role of activation functions in continual learning, where models face challenges like catastrophic forgetting and loss of plasticity. It introduces Smooth-Leaky and Randomized Smooth-Leaky as new activation functions and evaluates them in supervised class-incremental benchmarks and reinforcement learning environments. The study finds that thoughtful activation design can sustain plasticity without increasing model capacity or requiring task-specific tuning.
论文研究了激活函数在持续学习中的作用,特别是在模型随时间推移会失去适应能力的情况下。研究引入了Smooth-Leaky和Randomized Smooth-Leaky两种新的激活函数,并在监督类增量基准和强化学习的非平稳MuJoCo环境中进行了评估。研究发现,精心设计的激活函数可以在不增加模型容量或进行任务特定调整的情况下维持适应性。
WaterClear-GS: Optical-Aware Gaussian Splatting for Underwater Reconstruction and Restoration
Authors: Xinrui Zhang, Yufeng Wang, Shuangkang Fang, Zesheng Wang, Dacheng Qi, Wenrui Ding
First: 2026-01-27T16:14:34+00:00 · Latest: 2026-01-27T16:14:34+00:00
Abstract
Underwater 3D reconstruction and appearance restoration are hindered by the complex optical properties of water, such as wavelength-dependent attenuation and scattering. Existing Neural Radiance Fields (NeRF)-based methods struggle with slow rendering speeds and suboptimal color restoration, while 3D Gaussian Splatting (3DGS) inherently lacks the capability to model complex volumetric scattering effects. To address these issues, we introduce WaterClear-GS, the first pure 3DGS-based framework that explicitly integrates underwater optical properties of local attenuation and scattering into Gaussian primitives, eliminating the need for an auxiliary medium network. Our method employs a dual-branch optimization strategy to ensure underwater photometric consistency while naturally recovering water-free appearances. This strategy is enhanced by depth-guided geometry regularization and perception-driven image loss, together with exposure constraints, spatially-adaptive regularization, and physically guided spectral regularization, which collectively enforce local 3D coherence and maintain natural visual perception. Experiments on standard benchmarks and our newly collected dataset demonstrate that WaterClear-GS achieves outstanding performance on both novel view synthesis (NVS) and underwater image restoration (UIR) tasks, while maintaining real-time rendering. The code will be available at https://buaaxrzhang.github.io/WaterClear-GS/.
中文标题/摘要
标题:WaterClear-GS:基于光学感知的高斯点云渲染以实现水下重建与恢复
水下3D重建和外观恢复受到水的复杂光学特性(如波长依赖性衰减和散射)的阻碍。现有的基于神经辐射场(NeRF)的方法在渲染速度和颜色恢复方面存在不足,而3D高斯点云渲染(3DGS)本身无法建模复杂的体积散射效应。为了解决这些问题,我们提出了WaterClear-GS,这是第一个完全基于3DGS的框架,它明确地将局部衰减和散射的水下光学特性整合到高斯原语中,消除了辅助介质网络的需要。我们的方法采用双分支优化策略以确保水下光度一致性,同时自然恢复无水外观。该策略通过深度引导几何正则化和感知驱动的图像损失,结合曝光约束、空间自适应正则化和物理引导的光谱正则化,共同确保局部3D一致性并保持自然视觉感知。在标准基准和我们新收集的数据集上的实验表明,WaterClear-GS在新颖视图合成(NVS)和水下图像恢复(UIR)任务上均表现出色,同时保持实时渲染。代码将在https://buaaxrzhang.github.io/WaterClear-GS/上提供。
Summary / 总结
WaterClear-GS is a novel framework that integrates underwater optical properties into 3D Gaussian splatting to address the challenges of slow rendering and suboptimal color restoration in underwater reconstruction and restoration. It uses a dual-branch optimization strategy with depth-guided geometry regularization and perception-driven image loss to ensure photometric consistency and natural visual perception. Experiments show that WaterClear-GS outperforms existing methods on both novel view synthesis and underwater image restoration tasks while maintaining real-time rendering capabilities.
WaterClear-GS 是一种将水下光学特性集成到 3D 高斯点绘中的新框架,以解决水下重建和恢复中渲染速度慢和颜色恢复不佳的问题。它使用深度引导的几何正则化和感知驱动的图像损失的双分支优化策略,以确保光度一致性并保持自然视觉感知。实验表明,WaterClear-GS 在新颖视图合成和水下图像恢复任务中均优于现有方法,同时保持实时渲染能力。
Mitigating Attention Sinks and Massive Activations in Audio-Visual Speech Recognition with LLMs
Authors: Anand, Umberto Cappellazzo, Stavros Petridis, Maja Pantic
Venue: ICASSP 2026
First: 2025-10-26T09:44:20+00:00 · Latest: 2026-01-27T16:14:08+00:00
Comments: IEEE ICASSP 2026. The code is available at https://github.com/umbertocappellazzo/Llama-AVSR
Abstract
Large language models (LLMs) have recently advanced auditory speech recognition (ASR), visual speech recognition (VSR), and audio-visual speech recognition (AVSR). However, understanding of their internal dynamics under fine-tuning remains limited. In natural language processing, recent work has revealed attention sinks, tokens that attract disproportionately high attention, and associated massive activations in which some features of sink tokens exhibit huge activation in LLMs. In this work, we are the first to study these phenomena in multimodal speech recognition. Through a detailed analysis of audio-visual LLMs, we identify attention sinks and massive activations not only at the BOS token but also at intermediate low-semantic tokens across ASR, VSR, and AVSR. We show that massive activations originate in the MLP layers and correspond to fixed feature indices across all sink tokens. We further show that intermediate sink tokens exhibit high cosine similarity to the BOS token, thereby amplifying attention and activation. Building on these insights, we introduce a simple decorrelation loss that reduces cosine similarity between BOS and other tokens, effectively mitigating intermediate sinks and massive activations. Furthermore, our method improves word error rate (WER) under high audio-visual feature downsampling while remaining stable at lower downsampling rates.
中文标题/摘要
标题:利用LLM在视听语音识别中缓解注意力陷阱和大规模激活
大型语言模型(LLMs)最近在听觉语音识别(ASR)、视觉语音识别(VSR)和视听语音识别(AVSR)方面取得了进展。然而,对其在微调过程中的内部动态理解仍然有限。在自然语言处理领域,最近的研究揭示了注意力陷阱,即吸引不成比例高注意力的标记,以及与之相关的巨大激活,在LLMs中,某些陷阱标记的特征表现出巨大的激活。在本研究中,我们首次研究了这些现象在多模态语音识别中的情况。通过对视听LLMs的详细分析,我们不仅在BOS标记,还在ASR、VSR和AVSR的中间低语义标记中识别出注意力陷阱和大规模激活。我们表明,大规模激活起源于MLP层,并且对应于所有陷阱标记的固定特征索引。我们进一步表明,中间陷阱标记与BOS标记具有高余弦相似性,从而放大了注意力和激活。基于这些见解,我们引入了一种简单的去相关损失,减少了BOS与其他标记之间的余弦相似性,有效地缓解了中间陷阱和大规模激活。此外,我们的方法在高视听特征下采样时提高了单词错误率(WER),而在较低的下采样率下保持稳定。
Summary / 总结
This work investigates attention sinks and massive activations in large language models (LLMs) for audio-visual speech recognition (AVSR). By analyzing audio-visual LLMs, the authors identify these phenomena not only at the beginning of sequences but also at intermediate tokens. They show that massive activations originate in the MLP layers and are associated with fixed feature indices. The authors introduce a decorrelation loss to reduce the similarity between the beginning of sequences and other tokens, which mitigates these issues and improves word error rate (WER) even under high feature downsampling.
该研究探讨了大型语言模型在音频-视觉语音识别微调过程中出现的注意力陷阱和大规模激活问题。通过对音频-视觉LLM的分析,作者在ASR、VSR和AVSR中的BOS和中间token中发现了这些现象。他们表明,大规模激活起源于MLP层,并与固定特征索引相关。作者引入了一种去相关损失,以降低BOS与其他token之间的相似性,从而缓解这些问题并在高特征下采样时改善了词错误率。
Benchmarking Multimodal Large Language Models for Missing Modality Completion in Product Catalogues
Authors: Junchen Fu, Wenhao Deng, Kaiwen Zheng, Alexandros Karatzoglou, Ioannis Arapakis, Yu Ye, Yongxin Ni, Joemon M. Jose, Xuri Ge
First: 2026-01-27T16:13:26+00:00 · Latest: 2026-01-27T16:13:26+00:00
Abstract
Missing-modality information on e-commerce platforms, such as absent product images or textual descriptions, often arises from annotation errors or incomplete metadata, impairing both product presentation and downstream applications such as recommendation systems. Motivated by the multimodal generative capabilities of recent Multimodal Large Language Models (MLLMs), this work investigates a fundamental yet underexplored question: can MLLMs generate missing modalities for products in e-commerce scenarios? We propose the Missing Modality Product Completion Benchmark (MMPCBench), which consists of two sub-benchmarks: a Content Quality Completion Benchmark and a Recommendation Benchmark.
We further evaluate six state-of-the-art MLLMs from the Qwen2.5-VL and Gemma-3 model families across nine real-world e-commerce categories, focusing on image-to-text and text-to-image completion tasks. Experimental results show that while MLLMs can capture high-level semantics, they struggle with fine-grained word-level and pixel- or patch-level alignment. In addition, performance varies substantially across product categories and model scales, and we observe no trivial correlation between model size and performance, in contrast to trends commonly reported in mainstream benchmarks. We also explore Group Relative Policy Optimization (GRPO) to better align MLLMs with this task. GRPO improves image-to-text completion but does not yield gains for text-to-image completion. Overall, these findings expose the limitations of current MLLMs in real-world cross-modal generation and represent an early step toward more effective missing-modality product completion.
中文标题/摘要
标题:多模态大型语言模型在电子商务产品目录中缺失模态完成的基准测试
电子商务平台上缺失模态信息,如缺少产品图片或文本描述,通常源于标注错误或不完整的元数据,影响了产品展示和推荐系统等下游应用。受近期多模态大型语言模型(MLLMs)的生成能力启发,本文探讨了一个基础但尚未充分探索的问题:MLLMs能否生成电子商务场景中产品的缺失模态?我们提出了缺失模态产品完成基准(MMPCBench),包括内容质量完成基准和推荐基准。我们进一步评估了来自Qwen2.5-VL和Gemma-3模型家族的六种最先进的MLLMs在九个实际电子商务类别中的表现,重点关注图像到文本和文本到图像的完成任务。实验结果表明,尽管MLLMs能够捕捉高层次语义,但在细粒度的词级和像素级或补丁级对齐方面存在困难。此外,不同产品类别和模型规模的性能差异显著,我们观察到模型大小与性能之间没有简单的相关性,这与主流基准中通常报告的趋势相反。我们还探索了组相对策略优化(GRPO)以更好地使MLLMs与该任务对齐。GRPO提高了图像到文本的完成,但对文本到图像的完成没有增益。总体而言,这些发现揭示了当前MLLMs在实际跨模态生成中的局限性,并代表了更有效缺失模态产品完成的早期步骤。
Summary / 总结
This work investigates whether Multimodal Large Language Models (MLLMs) can generate missing modalities in e-commerce product catalogues, proposing the Missing Modality Product Completion Benchmark (MMPCBench). Six state-of-the-art MLLMs were evaluated on nine real-world e-commerce categories, focusing on image-to-text and text-to-image completion tasks. Results indicate that while MLLMs can capture high-level semantics, they struggle with fine-grained alignment and performance varies across categories and model sizes, challenging the common belief that larger models perform better. The study also explores Group Relative Policy Optimization (GRPO) but finds limited improvement in text-to-image completion.
这项研究探讨了多模态大型语言模型(MLLMs)是否能够生成电子商务产品目录中的缺失模态。研究提出了缺失模态产品完成基准(MMPCBench),并在九个电子商务类别中评估了六种最先进的MLLMs。结果显示,尽管MLLMs能够捕捉高层次的语义,但在单词和像素级别的细粒度对齐方面存在困难。性能在类别和模型大小之间差异显著,而组相对策略优化(GRPO)方法可以提高图像到文本的完成,但对文本到图像的完成没有提升。
Investigating Test Overfitting on SWE-bench
Authors: Toufique Ahmed, Jatin Ganhotra, Avraham Shinnar, Martin Hirzel
First: 2025-11-20T23:55:56+00:00 · Latest: 2026-01-27T16:12:38+00:00
Abstract
Tests can be useful towards resolving issues on code repositories. However, relying too much on tests for issue resolution can lead to code that technically passes observed tests but actually misses important cases or even breaks functionality. This problem, called test overfitting, is exacerbated by the fact that issues usually lack readily executable tests. Instead, several issue resolution systems use tests auto-generated from issues, which may be imperfect. Some systems even iteratively refine code and tests jointly. This paper presents the first empirical study of test overfitting in this setting.
中文标题/摘要
标题:探究SWE-bench上的测试过拟合
测试可以在解决代码仓库中的问题时发挥作用。然而,过于依赖测试进行问题解决可能会导致代码虽然通过了观察到的测试,但实际上却遗漏了重要的情况甚至破坏了功能。这个问题被称为测试过拟合,由于问题通常缺乏可以直接执行的测试,因此这个问题被放大了。相反,一些问题解决系统使用从问题自动生成的测试,这些测试可能是不完美的。甚至有些系统还会联合迭代地改进代码和测试。本文提出了第一个针对此场景的测试过拟合的实证研究。
Summary / 总结
This paper investigates test overfitting in issue resolution systems that use auto-generated tests. The study finds that code that passes these tests may still miss important cases or break functionality. The research highlights the need to address this issue, especially in systems that iteratively refine both code and tests together.
该论文研究了在使用自动生成测试的issue解决系统中出现的测试过拟合问题。研究发现,即使代码通过了这些测试,也可能仍然遗漏重要情况或破坏功能。该研究强调了在联合优化代码和测试的系统中尤其需要解决这一问题。
Veri-Sure: A Contract-Aware Multi-Agent Framework with Temporal Tracing and Formal Verification for Correct RTL Code Generation
Authors: Jiale Liu, Taiyu Zhou, Tianqi Jiang
First: 2026-01-27T16:10:23+00:00 · Latest: 2026-01-27T16:10:23+00:00
Abstract
In the rapidly evolving field of Electronic Design Automation (EDA), the deployment of Large Language Models (LLMs) for Register-Transfer Level (RTL) design has emerged as a promising direction. However, silicon-grade correctness remains bottlenecked by: (i) limited test coverage and reliability of simulation-centric evaluation, (ii) regressions and repair hallucinations introduced by iterative debugging, and (iii) semantic drift as intent is reinterpreted across agent handoffs. In this work, we propose Veri-Sure, a multi-agent framework that establishes a design contract to align agents' intent and uses a patching mechanism guided by static dependency slicing to perform precise, localized repairs. By integrating a multi-branch verification pipeline that combines trace-driven temporal analysis with formal verification consisting of assertion-based checking and boolean equivalence proofs, Veri-Sure enables functional correctness beyond pure simulations. We also introduce VerilogEval-v2-EXT, extending the original benchmark with 53 more industrial-grade design tasks and stratified difficulty levels, and show that Veri-Sure achieves state-of-the-art verified-correct RTL code generation performance, surpassing standalone LLMs and prior agentic systems.
中文标题/摘要
标题:Veri-Sure:一种基于合约的多智能体框架,具有时间追溯和形式验证,用于正确的RTL代码生成
在快速发展的电子设计自动化(EDA)领域,大型语言模型(LLMs)在寄存器传输级(RTL)设计中的应用已成为一个有前景的方向。然而,硅级正确性仍受到以下瓶颈的制约:(i)以仿真为中心的评估的有限测试覆盖率和可靠性,(ii)迭代调试引入的回归和修复幻觉,以及(iii)意图在智能体交接过程中重新解释时出现的语义漂移。在本文中,我们提出Veri-Sure,这是一种多智能体框架,通过建立设计合约来对齐智能体的意图,并使用由静态依赖切片引导的补丁机制进行精确、局部的修复。通过结合基于轨迹的时间分析和形式验证(包括基于断言的检查和布尔等价证明)的多分支验证流水线,Veri-Sure 能够超越单纯的仿真实现功能正确性。我们还引入了VerilogEval-v2-EXT,扩展了原始基准测试,增加了53个工业级设计任务,并分层了难度级别,并展示了Veri-Sure 在验证正确的RTL代码生成性能上达到了最先进的水平,超越了独立的LLMs和先前的智能体系统。