arXiv 论文速递

2025-12-30 03:28
Snapshot: 20251230_0328
See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning
Authors: Shuoshuo Zhang, Yizhen Zhang, Jingjing Fu, Lei Song, Jiang Bian, Yujiu Yang, Rui Wang
First: 2025-12-26T18:59:47+00:00 · Latest: 2025-12-26T18:59:47+00:00
Abstract
Large vision-language models (VLMs) often benefit from intermediate visual cues, either injected via external tools or generated as latent visual tokens during reasoning, but these mechanisms still overlook fine-grained visual evidence (e.g., polylines in charts), generalize poorly across domains, and incur high inference-time cost. In this paper, we propose Bi-directional Perceptual Shaping (BiPS), which transforms question-conditioned masked views into bidirectional where-to-look signals that shape perception during training. BiPS first applies a KL-consistency constraint between the original image and an evidence-preserving view that keeps only question-relevant regions, encouraging coarse but complete coverage of supporting pixels. It then applies a KL-separation constraint between the original and an evidence-ablated view where critical pixels are masked so the image no longer supports the original answer, discouraging text-only shortcuts (i.e., answering from text alone) and enforcing fine-grained visual reliance. Across eight benchmarks, BiPS boosts Qwen2.5-VL-7B by 8.2% on average and shows strong out-of-domain generalization to unseen datasets and image types.
中文标题/摘要
标题:见少而明:双向感知塑造多模态推理
大型视觉-语言模型(VLMs)通常从中间视觉提示中受益,这些提示要么通过外部工具注入,要么在推理过程中作为潜在视觉标记生成,但这些机制仍然忽略了细微的视觉证据(例如图表中的多段线),在不同领域泛化能力差,并且在推理时间成本高。在本文中,我们提出了双向感知塑造(BiPS),它将问题条件下的掩码视图转换为双向的“看哪里”信号,在训练过程中塑造感知。BiPS 首先在原始图像和保留仅与问题相关区域的证据保留视图之间施加KL一致性约束,鼓励粗略但完整的支持像素覆盖。然后在原始图像和关键像素被遮蔽的证据消除视图之间施加KL分离约束,该视图不再支持原始答案,从而避免仅从文本回答(即,仅从文本回答)并强制执行细微的视觉依赖。在八个基准测试中,BiPS 将 Qwen2.5-VL-7B 的性能平均提高了 8.2%,并在未见过的数据集和图像类型上展示了强大的跨域泛化能力。
Summary / 总结
This paper addresses the limitations of existing vision-language models that rely on intermediate visual cues, which often overlook fine-grained visual evidence and generalize poorly. The proposed Bi-directional Perceptual Shaping (BiPS) method transforms question-conditioned masked views into bidirectional where-to-look signals to shape perception during training. BiPS improves Qwen2.5-VL-7B by 8.2% on average across eight benchmarks and demonstrates strong out-of-domain generalization to unseen datasets and image types.
研究旨在通过解决现有方法依赖外部工具或潜在视觉标记的局限性,提高大型视觉-语言模型的性能。提出的双向感知塑造(BiPS)方法将遮蔽视图转换为双向的看哪里信号,在训练期间引导感知。BiPS增强了模型对细粒度视觉证据的依赖能力,并在不同领域表现出强大的泛化能力,八个基准测试的平均改进幅度为8.2%。
ProEdit: Inversion-based Editing From Prompts Done Right
Authors: Zhi Ouyang, Dian Zheng, Xiao-Ming Wu, Jian-Jian Jiang, Kun-Yu Lin, Jingke Meng, Wei-Shi Zheng
First: 2025-12-26T18:59:14+00:00 · Latest: 2025-12-26T18:59:14+00:00
Comments: Equal contributions from first two authors. Project page: https://isee-laboratory.github.io/ProEdit/ Code: https://github.com/iSEE-Laboratory/ProEdit
Abstract
Inversion-based visual editing provides an effective and training-free way to edit an image or a video based on user instructions. Existing methods typically inject source image information during the sampling process to maintain editing consistency. However, this sampling strategy overly relies on source information, which negatively affects the edits in the target image (e.g., failing to change the subject's atributes like pose, number, or color as instructed). In this work, we propose ProEdit to address this issue both in the attention and the latent aspects. In the attention aspect, we introduce KV-mix, which mixes KV features of the source and the target in the edited region, mitigating the influence of the source image on the editing region while maintaining background consistency. In the latent aspect, we propose Latents-Shift, which perturbs the edited region of the source latent, eliminating the influence of the inverted latent on the sampling. Extensive experiments on several image and video editing benchmarks demonstrate that our method achieves SOTA performance. In addition, our design is plug-and-play, which can be seamlessly integrated into existing inversion and editing methods, such as RF-Solver, FireFlow and UniEdit.
中文标题/摘要
标题:ProEdit:从指令正确进行基于反转的编辑
基于反转的视觉编辑提供了一种有效且无需训练的方法,可以根据用户指令编辑图像或视频。现有方法通常在采样过程中注入源图像信息以保持编辑一致性。然而,这种采样策略过度依赖源信息,这会负面影响目标图像中的编辑效果(例如,无法按照指令改变主体的姿态、数量或颜色)。在本文中,我们提出ProEdit以在注意力和潜在方面解决这一问题。在注意力方面,我们引入了KV-mix,它在编辑区域混合源和目标的KV特征,减轻了源图像对编辑区域的影响,同时保持背景一致性。在潜在方面,我们提出了Latents-Shift,它扰动源潜在的编辑区域,消除了反转潜在对采样的影响。在几个图像和视频编辑基准上的广泛实验表明,我们的方法达到了SOTA性能。此外,我们的设计是即插即用的,可以无缝集成到现有的反转和编辑方法中,如RF-Solver、FireFlow和UniEdit。
Summary / 总结
ProEdit addresses the issue of overly relying on source image information in inversion-based visual editing, which negatively affects the edits in the target image. It introduces KV-mix to mix KV features of the source and target in the edited region, and Latents-Shift to perturb the edited region of the source latent. Experiments show that ProEdit achieves state-of-the-art performance on various benchmarks and is plug-and-play, compatible with existing methods like RF-Solver, FireFlow, and UniEdit.
ProEdit 解决了基于反转的视觉编辑中过度依赖源图像信息的问题,这会负面影响目标图像中的编辑效果。它引入了 KV-mix 来混合编辑区域中源图像和目标图像的 KV 特征,以及 Latents-Shift 来扰动源图像的编辑区域的潜在特征。实验表明,ProEdit 在各种基准测试中达到了最先进的性能,并且是即插即用的,可以无缝集成到现有的方法如 RF-Solver、FireFlow 和 UniEdit 中。
Agentic Structured Graph Traversal for Root Cause Analysis of Code-related Incidents in Cloud Applications
Authors: Shengkun Cui, Rahul Krishna, Saurabh Jha, Ravishankar K. Iyer
First: 2025-12-26T18:56:18+00:00 · Latest: 2025-12-26T18:56:18+00:00
Abstract
Cloud incidents pose major operational challenges in production, with unresolved production cloud incidents cost on average over $2M per hour. Prior research identifies code- and configuration-related issues as the predominant category of root causes in cloud incidents. This paper introduces PRAXIS, an orchestrator that manages and deploys an agentic workflow for diagnosing code- and configuration-caused cloud incidents. PRAXIS employs an LLM-driven structured traversal over two types of graph: (1) a service dependency graph (SDG) that captures microservice-level dependencies; and (2) a hammock-block program dependence graph (PDG) that captures code-level dependencies for each microservice. Together, these graphs encode microservice- and code-level dependencies and the LLM acts as a traversal policy over these graphs, moving between services and code dependencies to localize and explain failures. Compared to state-of-the-art ReAct baselines, PRAXIS improves RCA accuracy by up to 3.1x while reducing token consumption by 3.8x. PRAXIS is demonstrated on a set of 30 comprehensive real-world incidents that is being compiled into an RCA benchmark.
中文标题/摘要
标题:代理结构化图遍历在云应用代码相关事故根本原因分析中的应用
云事故在生产中带来了重大的运营挑战,未解决的生产云事故平均每小时成本超过200万美元。先前的研究指出,代码和配置相关问题是云事故中主要的根本原因类别。本文介绍了PRAXIS,一种 orchestrator,用于管理和部署代理工作流以诊断由代码和配置引起的云事故。PRAXIS 使用基于LLM的结构化遍历两种类型的图:(1) 服务依赖图(SDG),捕捉微服务级别的依赖关系;(2) 蒙托克-程序依赖图(PDG),捕捉每个微服务的代码级别依赖关系。这些图共同编码了微服务和代码级别的依赖关系,而LLM作为这些图上的遍历策略,通过在服务和代码依赖关系之间移动来定位和解释故障。与最先进的ReAct基线相比,PRAXIS 的根本原因分析准确性提高了3.1倍,同时减少了3.8倍的令牌消耗。PRAXIS 在一个包含30个全面的现实世界事故的集合上进行了演示,这些事故正在被编译成一个根本原因分析基准。
Summary / 总结
This paper addresses the challenge of diagnosing cloud incidents caused by code and configuration issues, which are costly and frequent. PRAXIS, an orchestrator, uses an LLM-driven structured traversal over a service dependency graph and a hammock-block program dependence graph to diagnose these issues. Compared to existing methods, PRAXIS achieves up to 3.1 times higher root cause analysis accuracy while consuming 3.8 times fewer tokens. The method is validated on 30 real-world incidents.
本文旨在诊断由代码和配置问题引起的云故障,这些问题既频繁又昂贵。PRAXIS作为一种 orchestrator,使用LLM驱动的结构化遍历服务依赖图和hammock-block程序依赖图来诊断这些问题。与现有方法相比,PRAXIS的根因分析准确率最高可提高3.1倍,同时消耗的token数量减少3.8倍。该方法已在30个真实世界案例上进行了验证。
Experimental End-to-End Optimization of Directly Modulated Laser-based IM/DD Transmission
Authors: Sergio Hernandez, Christophe Peucheret, Francesco Da Ros, Darko Zibar
First: 2025-08-27T14:13:59+00:00 · Latest: 2025-12-26T18:55:41+00:00
Comments: 10 pages, 10 figures, published in journal of lightwave technology
Abstract
Directly modulated lasers (DMLs) are an attractive technology for short-reach intensity modulation and direct detection communication systems. However, their complex nonlinear dynamics make the modeling and optimization of DML-based systems challenging. In this paper, we study the end-to-end optimization of DML-based systems based on a data-driven surrogate model trained on experimental data. The end-to-end optimization includes the pulse shaping and equalizer filters, the bias current and the modulation radio-frequency (RF) power applied to the laser. The performance of the end-to-end optimization scheme is tested on the experimental setup and compared to 4 different benchmark schemes based on linear and nonlinear receiver-side equalization. The results show that the proposed end-to-end scheme is able to deliver better performance throughout the studied symbol rates and transmission distances while employing lower modulation RF power, fewer filter taps and utilizing a smaller signal bandwidth.
中文标题/摘要
标题:直接调制激光器基直接调制和直接检测传输的端到端优化实验研究
直接调制激光器(DML)是短距强度调制和直接检测通信系统的有吸引力的技术。然而,其复杂的非线性动力学使得基于DML的系统建模和优化具有挑战性。本文基于实验数据训练的数据驱动代理模型研究了基于DML的系统的端到端优化。端到端优化包括脉冲整形和均衡滤波器、偏置电流以及应用于激光的调制射频(RF)功率。端到端优化方案的性能在实验设置中进行了测试,并与基于线性和非线性接收端均衡的4种基准方案进行了比较。结果表明,所提出的端到端方案能够在研究的符号率和传输距离范围内提供更好的性能,同时使用较低的调制RF功率、较少的滤波器抽头并利用更小的信号带宽。
Summary / 总结
This paper addresses the challenge of optimizing directly modulated laser-based systems by using a data-driven surrogate model for end-to-end optimization, which includes pulse shaping, equalizer filters, bias current, and modulation RF power. The optimized scheme outperforms four benchmark schemes across various symbol rates and transmission distances, achieving better performance with lower modulation RF power, fewer filter taps, and a smaller signal bandwidth.
本文通过使用基于实验数据的数据驱动代理模型进行端到端优化,涵盖了脉冲整形、均衡器滤波器、偏置电流和调制射频功率。所提出的方案在各种符号率和传输距离下优于四种基准方案,同时需要更低的调制射频功率、更少的滤波器系数和更窄的信号带宽。
Pruning as a Game: Equilibrium-Driven Sparsification of Neural Networks
Authors: Zubair Shah, Noaman Khan
First: 2025-12-26T18:25:38+00:00 · Latest: 2025-12-26T18:25:38+00:00
Comments: Preprint. Under review / to be submitted to a conference
Abstract
Neural network pruning is widely used to reduce model size and computational cost. Yet, most existing methods treat sparsity as an externally imposed constraint, enforced through heuristic importance scores or training-time regularization. In this work, we propose a fundamentally different perspective: pruning as an equilibrium outcome of strategic interaction among model components. We model parameter groups such as weights, neurons, or filters as players in a continuous non-cooperative game, where each player selects its level of participation in the network to balance contribution against redundancy and competition. Within this formulation, sparsity emerges naturally when continued participation becomes a dominated strategy at equilibrium. We analyze the resulting game and show that dominated players collapse to zero participation under mild conditions, providing a principled explanation for pruning behavior. Building on this insight, we derive a simple equilibrium-driven pruning algorithm that jointly updates network parameters and participation variables without relying on explicit importance scores. This work focuses on establishing a principled formulation and empirical validation of pruning as an equilibrium phenomenon, rather than exhaustive architectural or large-scale benchmarking. Experiments on standard benchmarks demonstrate that the proposed approach achieves competitive sparsity-accuracy trade-offs while offering an interpretable, theory-grounded alternative to existing pruning methods.
中文标题/摘要
标题:剪枝作为一种游戏:神经网络的均衡驱动稀疏化
神经网络剪枝广泛用于减少模型大小和计算成本。然而,大多数现有方法将稀疏性视为外部施加的约束,通过启发式重要性评分或训练时正则化来实现。在本文中,我们提出了一种根本不同的视角:剪枝是模型组件之间战略互动的均衡结果。我们将参数组,如权重、神经元或滤波器,建模为在连续非合作游戏中选择其在网络中参与程度的玩家,以平衡贡献与冗余和竞争之间的关系。在这种表述中,当继续参与成为均衡中的占优策略时,稀疏性自然出现。我们分析了由此产生的游戏,并展示了在温和条件下占优玩家会归零,从而为剪枝行为提供了一个原则性的解释。基于这一洞察,我们推导出一个简单的均衡驱动剪枝算法,该算法联合更新网络参数和参与变量,而不依赖于显式的重要性评分。本文的重点是建立剪枝作为均衡现象的原理性表述和实证验证,而不是详尽的架构或大规模基准测试。在标准基准上的实验表明,所提出的方法在稀疏性-准确性的权衡上具有竞争力,同时提供了一种可解释的、基于理论的替代现有剪枝方法。
Summary / 总结
This paper proposes a new perspective on neural network pruning by framing it as a strategic game among model components. The method models parameters as players in a continuous non-cooperative game, where each parameter decides its level of participation based on its contribution and competition. The approach shows that sparsity emerges naturally as dominated strategies collapse to zero participation. Experiments on standard benchmarks demonstrate competitive sparsity-accuracy trade-offs compared to existing methods, providing a theory-grounded alternative to heuristic-based pruning techniques.
本文提出了将神经网络剪枝视为模型组件之间战略互动的均衡结果的新视角。作者将参数建模为一个连续的非合作博弈中的玩家,每个参数根据其贡献和竞争决定其参与程度。所提出的均衡驱动剪枝算法同时更新网络参数和参与变量,实验证实在标准基准上实现了有竞争力的稀疏性-准确率权衡。该方法提供了一种基于原理且可解释的替代现有剪枝技术的方法,而不依赖于启发式的重要性评分。
Learning Association via Track-Detection Matching for Multi-Object Tracking
Authors: Momir Adžemović
First: 2025-12-26T18:19:39+00:00 · Latest: 2025-12-26T18:19:39+00:00
Comments: 14 pages (+4 for references), 8 tables, 4 figures
Abstract
Multi-object tracking aims to maintain object identities over time by associating detections across video frames. Two dominant paradigms exist in literature: tracking-by-detection methods, which are computationally efficient but rely on handcrafted association heuristics, and end-to-end approaches, which learn association from data at the cost of higher computational complexity. We propose Track-Detection Link Prediction (TDLP), a tracking-by-detection method that performs per-frame association via link prediction between tracks and detections, i.e., by predicting the correct continuation of each track at every frame. TDLP is architecturally designed primarily for geometric features such as bounding boxes, while optionally incorporating additional cues, including pose and appearance. Unlike heuristic-based methods, TDLP learns association directly from data without handcrafted rules, while remaining modular and computationally efficient compared to end-to-end trackers. Extensive experiments on multiple benchmarks demonstrate that TDLP consistently surpasses state-of-the-art performance across both tracking-by-detection and end-to-end methods. Finally, we provide a detailed analysis comparing link prediction with metric learning-based association and show that link prediction is more effective, particularly when handling heterogeneous features such as detection bounding boxes. Our code is available at \href{https://github.com/Robotmurlock/TDLP}{https://github.com/Robotmurlock/TDLP}.
中文标题/摘要
标题:通过轨迹-检测匹配学习关联的多目标跟踪
多目标跟踪旨在通过跨视频帧关联检测来保持对象身份。文献中存在两种主要范式:检测驱动跟踪方法,这类方法计算效率高但依赖手工设计的关联启发式规则,以及端到端方法,这类方法从数据中学习关联但计算复杂度较高。我们提出了一种检测驱动跟踪方法——轨迹-检测链接预测(TDLP),该方法通过在每帧中预测轨迹和检测之间的链接来执行关联,即通过预测每帧中每个轨迹的正确延续。TDLP主要针对几何特征如边界框进行架构设计,同时可选地结合姿态和外观等额外线索。与基于启发式规则的方法不同,TDLP直接从数据中学习关联而无需手工设计规则,同时在计算效率上仍保持模块化。在多个基准上的广泛实验表明,TDLP在检测驱动跟踪和端到端方法中均能持续超越现有最佳性能。最后,我们详细分析了链接预测与度量学习关联之间的差异,并表明链接预测在处理如检测边界框等异构特征时更为有效。我们的代码可在https://github.com/Robotmurlock/TDLP 获取。
Summary / 总结
The paper proposes Track-Detection Link Prediction (TDLP), a tracking-by-detection method that uses link prediction to associate tracks and detections, aiming to maintain object identities over time. TDLP is designed to be computationally efficient and modular, while optionally incorporating pose and appearance cues. Experiments show that TDLP outperforms both tracking-by-detection and end-to-end methods on multiple benchmarks. Link prediction is found to be more effective, especially for handling heterogeneous features like bounding boxes.
论文提出了一种名为Track-Detection Link Prediction (TDLP)的方法,通过链接预测来关联轨迹和检测,以维持视频帧间物体身份的一致性。TDLP设计为高效且模块化,主要使用几何特征如边界框,并可选地结合姿态和外观等额外线索。实验表明,TDLP在多个基准测试中优于基于检测的跟踪和端到端方法。分析还显示,链接预测在处理检测边界框等异构特征时比基于度量学习的关联更有效。
Introducing TrGLUE and SentiTurca: A Comprehensive Benchmark for Turkish General Language Understanding and Sentiment Analysis
Authors: Duygu Altinok
First: 2025-12-26T18:02:09+00:00 · Latest: 2025-12-26T18:02:09+00:00
Comments: under review by Springer
Abstract
Evaluating the performance of various model architectures, such as transformers, large language models (LLMs), and other NLP systems, requires comprehensive benchmarks that measure performance across multiple dimensions. Among these, the evaluation of natural language understanding (NLU) is particularly critical as it serves as a fundamental criterion for assessing model capabilities. Thus, it is essential to establish benchmarks that enable thorough evaluation and analysis of NLU abilities from diverse perspectives. While the GLUE benchmark has set a standard for evaluating English NLU, similar benchmarks have been developed for other languages, such as CLUE for Chinese, FLUE for French, and JGLUE for Japanese. However, no comparable benchmark currently exists for the Turkish language. To address this gap, we introduce TrGLUE, a comprehensive benchmark encompassing a variety of NLU tasks for Turkish. In addition, we present SentiTurca, a specialized benchmark for sentiment analysis. To support researchers, we also provide fine-tuning and evaluation code for transformer-based models, facilitating the effective use of these benchmarks. TrGLUE comprises Turkish-native corpora curated to mirror the domains and task formulations of GLUE-style evaluations, with labels obtained through a semi-automated pipeline that combines strong LLM-based annotation, cross-model agreement checks, and subsequent human validation. This design prioritizes linguistic naturalness, minimizes direct translation artifacts, and yields a scalable, reproducible workflow. With TrGLUE, our goal is to establish a robust evaluation framework for Turkish NLU, empower researchers with valuable resources, and provide insights into generating high-quality semi-automated datasets.
中文标题/摘要
标题:介绍TrGLUE和SentiTurca:面向土耳其语通用语言理解和情感分析的综合基准
评估各种模型架构(如变换器、大型语言模型(LLMs)和其他NLP系统)的表现需要跨多个维度进行全面基准测试。其中,自然语言理解(NLU)的评估尤为重要,因为它是评估模型能力的基本标准。因此,建立能够从多角度进行彻底评估和分析的基准测试是必要的。虽然GLUE基准为英语NLU评估设定了标准,但其他语言也开发了类似的基准,如CLUE(中文)、FLUE(法语)和JGLUE(日语)。然而,目前尚无适用于土耳其语的类似基准。为填补这一空白,我们引入了TrGLUE,这是一个涵盖多种土耳其语NLU任务的综合基准。此外,我们还提出了SentiTurca,一个专门用于情感分析的基准。为了支持研究人员,我们还提供了针对变换器模型的微调和评估代码,便于有效使用这些基准。TrGLUE包括经过精心策划的土耳其本土语料库,旨在模仿GLUE风格评估的领域和任务形式,标签通过结合强LLM注释、跨模型一致性检查和后续的人工验证的半自动化管道获得。这种设计优先考虑语言自然性,减少直接翻译的痕迹,并提供可扩展、可重复的工作流程。通过TrGLUE,我们的目标是建立一个稳健的土耳其语NLU评估框架,为研究人员提供有价值的资源,并提供生成高质量半自动化数据集的见解。
Summary / 总结
The paper introduces TrGLUE and SentiTurca, benchmarks for evaluating Turkish natural language understanding and sentiment analysis. TrGLUE includes a variety of NLU tasks with Turkish-native corpora, while SentiTurca focuses on sentiment analysis. The benchmarks are designed to mirror GLUE-style evaluations and use a semi-automated pipeline for annotation, ensuring linguistic naturalness and reproducibility. Experimental results demonstrate the effectiveness of these benchmarks in evaluating model performance across different NLU tasks.
该论文引入了TrGLUE和SentiTurca,分别用于评估土耳其自然语言理解和情感分析的基准,填补了土耳其语言缺乏此类基准的空白。TrGLUE包括多种NLU任务,使用土耳其本土语料库,而SentiTurca专注于情感分析。语料库通过LLM注释、跨模型检查和人工验证的半自动化流程进行整理,以确保语言自然性和可扩展性。关键发现表明,TrGLUE提供了一个全面的框架来评估土耳其语NLU模型,有助于深入分析和比较。
Yume-1.5: A Text-Controlled Interactive World Generation Model
Authors: Xiaofeng Mao, Zhen Li, Chuanhao Li, Xiaojie Xu, Kaining Ying, Tong He, Jiangmiao Pang, Yu Qiao, Kaipeng Zhang
First: 2025-12-26T17:52:49+00:00 · Latest: 2025-12-26T17:52:49+00:00
Abstract
Recent approaches have demonstrated the promise of using diffusion models to generate interactive and explorable worlds. However, most of these methods face critical challenges such as excessively large parameter sizes, reliance on lengthy inference steps, and rapidly growing historical context, which severely limit real-time performance and lack text-controlled generation capabilities. To address these challenges, we propose \method, a novel framework designed to generate realistic, interactive, and continuous worlds from a single image or text prompt. \method achieves this through a carefully designed framework that supports keyboard-based exploration of the generated worlds. The framework comprises three core components: (1) a long-video generation framework integrating unified context compression with linear attention; (2) a real-time streaming acceleration strategy powered by bidirectional attention distillation and an enhanced text embedding scheme; (3) a text-controlled method for generating world events. We have provided the codebase in the supplementary material.
中文标题/摘要
标题:Yume-1.5:一种文本控制的交互世界生成模型
近期的方法表明,使用扩散模型生成交互和可探索的世界具有很大的潜力。然而,这些方法大多面临着参数量过大、依赖于长时间推理步骤以及历史上下文迅速增长等关键挑战,这严重限制了实时性能并缺乏文本控制生成能力。为了解决这些挑战,我们提出了一种名为\method的新框架,该框架旨在从单张图片或文本提示生成逼真、交互和连续的世界。\method通过一个精心设计的框架实现这一点,该框架支持基于键盘的生成世界探索。该框架包括三个核心组件:(1)结合统一上下文压缩和线性注意力的长视频生成框架;(2)由双向注意力蒸馏和增强的文本嵌入方案驱动的实时流式加速策略;(3)一种文本控制的世界事件生成方法。我们已在附录中提供了代码库。
Summary / 总结
The research aims to improve the generation of interactive and explorable worlds using diffusion models, addressing issues like large parameter sizes and lack of real-time performance. The proposed method, Yume-1.5, introduces a framework with three core components: a long-video generation framework, a real-time streaming acceleration strategy, and a text-controlled method for generating world events. Key findings include the ability to generate realistic and interactive worlds from text prompts with improved real-time performance and keyboard-based exploration capabilities.
研究旨在通过使用扩散模型生成交互式和可探索的世界,并解决参数量大、实时性能差等问题。提出的Yume-1.5方法引入了三个核心组件:长视频生成框架、实时流式加速策略以及基于文本生成世界事件的方法。主要发现包括能够从文本提示生成真实且互动的世界,并且具有改进的实时性能和键盘探索功能。
Accelerating Diffusion Planners in Offline RL via Reward-Aware Consistency Trajectory Distillation
Authors: Xintong Duan, Yutong He, Fahim Tajwar, Ruslan Salakhutdinov, J. Zico Kolter, Jeff Schneider
First: 2025-06-09T14:48:19+00:00 · Latest: 2025-12-26T17:50:58+00:00
Abstract
Although diffusion models have achieved strong results in decision-making tasks, their slow inference speed remains a key limitation. While consistency models offer a potential solution, existing applications to decision-making either struggle with suboptimal demonstrations under behavior cloning or rely on complex concurrent training of multiple networks under the actor-critic framework. In this work, we propose a novel approach to consistency distillation for offline reinforcement learning that directly incorporates reward optimization into the distillation process. Our method achieves single-step sampling while generating higher-reward action trajectories through decoupled training and noise-free reward signals. Empirical evaluations on the Gym MuJoCo, FrankaKitchen, and long horizon planning benchmarks demonstrate that our approach can achieve a 9.7% improvement over previous state-of-the-art while offering up to 142x speedup over diffusion counterparts in inference time.
中文标题/摘要
标题:通过奖励意识一致性轨迹蒸馏加速离线强化学习中的扩散计划者
尽管扩散模型在决策任务中取得了显著成果,但其缓慢的推理速度仍然是一个关键限制。虽然一致性模型提供了一种潜在的解决方案,但现有应用要么在行为克隆下难以提供最优演示,要么依赖于在演员-评论家框架下多个网络的复杂并发训练。在本工作中,我们提出了一种新的离线强化学习中的一致性蒸馏方法,该方法直接将奖励优化纳入蒸馏过程。我们的方法通过解耦训练和无噪声奖励信号实现了单步采样,同时生成更高奖励的动作轨迹。在Gym MuJoCo、FrankaKitchen和长时规划基准上的实证评估表明,我们的方法可以比之前最先进的方法提高9.7%的性能,同时在推理时间上比扩散模型快多达142倍。
Summary / 总结
This work addresses the slow inference speed of diffusion models in decision-making tasks by proposing a novel reward-aware consistency trajectory distillation method for offline reinforcement learning. The method incorporates reward optimization directly into the distillation process, enabling single-step sampling and higher-reward action trajectories through decoupled training and noise-free reward signals. Experiments on various benchmarks show a 9.7% improvement over previous state-of-the-art methods and up to 142x speedup in inference time compared to diffusion models.
该研究通过提出一种新的奖励感知一致性轨迹蒸馏方法,解决扩散模型在决策任务中的推理速度慢问题,应用于离线强化学习。该方法将奖励优化集成到蒸馏过程中,通过解耦训练和无噪声奖励信号实现单步采样和更高奖励的动作轨迹。实验在多个基准上显示,相比之前最先进的方法,改进了9.7%,并且推理时间快了最多142倍。
Modeling Microenvironment Trajectories on Spatial Transcriptomics with NicheFlow
Authors: Kristiyan Sakalyan, Alessandro Palma, Filippo Guerranti, Fabian J. Theis, Stephan Günnemann
Venue: NeurIPS 2025
First: 2025-11-02T15:41:38+00:00 · Latest: 2025-12-26T17:29:14+00:00
Comments: 37 pages, 15 figures, to appear in NeurIPS 2025
Abstract
Understanding the evolution of cellular microenvironments in spatiotemporal data is essential for deciphering tissue development and disease progression. While experimental techniques like spatial transcriptomics now enable high-resolution mapping of tissue organization across space and time, current methods that model cellular evolution operate at the single-cell level, overlooking the coordinated development of cellular states in a tissue. We introduce NicheFlow, a flow-based generative model that infers the temporal trajectory of cellular microenvironments across sequential spatial slides. By representing local cell neighborhoods as point clouds, NicheFlow jointly models the evolution of cell states and spatial coordinates using optimal transport and Variational Flow Matching. Our approach successfully recovers both global spatial architecture and local microenvironment composition across diverse spatiotemporal datasets, from embryonic to brain development.
中文标题/摘要
标题:使用NicheFlow在空间转录组学中建模微环境轨迹
理解时空数据中细胞微环境的演变对于解读组织发育和疾病进展至关重要。虽然空间转录组学等实验技术现在能够实现高分辨率的组织组织化空间和时间映射,但当前用于建模细胞演化的方法仅在单细胞层面进行,忽视了组织中细胞状态的协调发展。我们引入了NicheFlow,这是一种基于流的生成模型,用于推断连续空间切片中细胞微环境的时序轨迹。通过将局部细胞邻域表示为点云,NicheFlow 使用最优传输和变分流匹配联合建模细胞状态和空间坐标的演变。我们的方法成功地在多种时空数据集中恢复了全局空间结构和局部微环境组成,从胚胎到大脑发育。
Summary / 总结
The research aims to understand the evolution of cellular microenvironments in spatiotemporal data to better comprehend tissue development and disease progression. NicheFlow, a flow-based generative model, is introduced to infer the temporal trajectory of cellular microenvironments across sequential spatial slides. By representing local cell neighborhoods as point clouds and using optimal transport and Variational Flow Matching, NicheFlow jointly models the evolution of cell states and spatial coordinates. The model successfully recovers both global spatial architecture and local microenvironment composition across various spatiotemporal datasets, including embryonic and brain development.
研究旨在通过理解细胞微环境在时空数据中的演变,更好地解析组织发育和疾病进展。引入了基于流的生成模型NicheFlow,以推断连续空间切片中细胞微环境的时间轨迹。通过将局部细胞邻域表示为点云,并使用最优传输和变分流匹配,NicheFlow同时建模了细胞状态和空间坐标的演变。该模型成功地恢复了来自不同时空数据集的全局空间架构和局部微环境组成,包括胚胎和大脑发育。
Unifying Learning Dynamics and Generalization in Transformers Scaling Law
Authors: Chiwun Yang
First: 2025-12-26T17:20:09+00:00 · Latest: 2025-12-26T17:20:09+00:00
Abstract
The scaling law, a cornerstone of Large Language Model (LLM) development, predicts improvements in model performance with increasing computational resources. Yet, while empirically validated, its theoretical underpinnings remain poorly understood. This work formalizes the learning dynamics of transformer-based language models as an ordinary differential equation (ODE) system, then approximates this process to kernel behaviors. Departing from prior toy-model analyses, we rigorously analyze stochastic gradient descent (SGD) training for multi-layer transformers on sequence-to-sequence data with arbitrary data distribution, closely mirroring real-world conditions. Our analysis characterizes the convergence of generalization error to the irreducible risk as computational resources scale with data, especially during the optimization process. We establish a theoretical upper bound on excess risk characterized by a distinct phase transition. In the initial optimization phase, the excess risk decays exponentially relative to the computational cost ${\sf C}$. However, once a specific resource allocation threshold is crossed, the system enters a statistical phase, where the generalization error follows a power-law decay of $Θ(\mathsf{C}^{-1/6})$. Beyond this unified framework, our theory derives isolated scaling laws for model size, training time, and dataset size, elucidating how each variable independently governs the upper bounds of generalization.
Summary / 总结
This work aims to understand the theoretical underpinnings of the scaling law in Large Language Models (LLMs), which predicts improved performance with more computational resources. By formulating the learning dynamics of transformer-based models as an ODE system and approximating it to kernel behaviors, the study rigorously analyzes SGD training for multi-layer transformers on sequence-to-sequence data. Key findings include an initial exponential decay of excess risk with computational cost, followed by a power-law decay once a resource threshold is crossed, indicating a phase transition in the optimization process.
该研究旨在通过将变压器模型的学习动态形式化为常微分方程(ODE)系统并近似为核行为,来理解大型语言模型(LLM)中的缩放定律的理论基础。研究对多层变压器在序列到序列数据上的SGD训练进行了严格分析,表明在初始优化阶段,过拟合风险呈指数衰减,但一旦达到特定的资源分配阈值,系统进入统计阶段,过拟合误差遵循C^{-1/6}的幂律衰减。理论还推导出模型大小、训练时间和数据集大小的独立缩放定律,解释了每个变量如何独立影响泛化上限。
A Frobenius-Optimal Projection for Enforcing Linear Conservation in Learned Dynamical Models
Authors: John M. Mango, Ronald Katende
First: 2025-12-26T17:11:16+00:00 · Latest: 2025-12-26T17:11:16+00:00
Abstract
We consider the problem of restoring linear conservation laws in data-driven linear dynamical models. Given a learned operator $\widehat{A}$ and a full-rank constraint matrix $C$ encoding one or more invariants, we show that the matrix closest to $\widehat{A}$ in the Frobenius norm and satisfying $C^\top A = 0$ is the orthogonal projection $A^\star = \widehat{A} - C(C^\top C)^{-1}C^\top \widehat{A}$. This correction is uniquely defined, low rank and fully determined by the violation $C^\top \widehat{A}$. In the single-invariant case it reduces to a rank-one update. We prove that $A^\star$ enforces exact conservation while minimally perturbing the dynamics, and we verify these properties numerically on a Markov-type example. The projection provides an elementary and general mechanism for embedding exact invariants into any learned linear model.
中文标题/摘要
标题:最优弗罗贝尼乌斯投影以强制执行学习动力学模型中的线性守恒
我们考虑在数据驱动的线性动力学模型中恢复线性守恒定律的问题。给定一个学习到的算子$\widehat{A}$和一个满秩约束矩阵$C$,其中编码了一个或多个不变量,我们证明了在弗罗贝尼乌斯范数下最接近$\widehat{A}$且满足$C^\top A = 0$的矩阵是正交投影$A^\star = \widehat{A} - C(C^\top C)^{-1}C^\top \widehat{A}$。这种修正唯一定义,低秩且完全由违反$C^\top \widehat{A}$确定。在单不变量情况下,它简化为秩一更新。我们证明$A^\star$在最小扰动动力学的同时强制执行精确守恒,并通过马尔可夫类型示例进行数值验证。投影提供了一种基本且通用的方法,将精确不变量嵌入到任何学习到的线性模型中。
fMRI-LM: Towards a Universal Foundation Model for Language-Aligned fMRI Understanding
Authors: Yuxiang Wei, Yanteng Zhang, Xi Xiao, Chengxuan Qian, Tianyang Wang, Vince D. Calhoun
First: 2025-11-24T20:26:59+00:00 · Latest: 2025-12-26T15:54:52+00:00
Comments: Code are available: https://github.com/yuxiangwei0808/fMRI-LM
Abstract
Recent advances in multimodal large language models (LLMs) have enabled unified reasoning across images, audio, and video, but extending such capability to brain imaging remains largely unexplored. Bridging this gap is essential to link neural activity with semantic cognition and to develop cross-modal brain representations. To this end, we present fMRI-LM, a foundational model that bridges functional MRI (fMRI) and language through a three-stage framework. In Stage 1, we learn a neural tokenizer that maps fMRI into discrete tokens embedded in a language-consistent space. In Stage 2, a pretrained LLM is adapted to jointly model fMRI tokens and text, treating brain activity as a sequence that can be temporally predicted and linguistically described. To overcome the lack of natural fMRI-text pairs, we construct a large descriptive corpus that translates diverse imaging-based features into structured textual descriptors, capturing the low-level organization of fMRI signals. In Stage 3, we perform multi-task, multi-paradigm instruction tuning to endow fMRI-LM with high-level semantic understanding, supporting diverse downstream applications. Across various benchmarks, fMRI-LM achieves strong zero-shot and few-shot performance, and adapts efficiently with parameter-efficient tuning (LoRA), establishing a scalable pathway toward a language-aligned, universal model for structural and semantic understanding of fMRI.
中文标题/摘要
标题:fMRI-LM:迈向语言对齐的fMRI理解的通用基础模型
近期多模态大型语言模型(LLMs)的发展使图像、音频和视频的统一推理成为可能,但将这种能力扩展到脑成像领域仍鲜有探索。弥合这一差距对于将神经活动与语义认知联系起来以及开发跨模态脑表示至关重要。为此,我们提出了fMRI-LM,这是一种通过三阶段框架将功能性磁共振成像(fMRI)与语言联系起来的基础模型。在第一阶段,我们学习了一个神经分词器,将fMRI映射到嵌入语言一致空间的离散标记中。在第二阶段,我们对预训练的LLM进行调整,使其能够同时建模fMRI标记和文本,将脑活动视为可以进行时间预测和语言描述的序列。为了解决自然fMRI-文本对的缺乏,我们构建了一个大型描述性语料库,将多种成像特征翻译成结构化的文本描述,捕捉fMRI信号的低级组织。在第三阶段,我们进行多任务、多范式指令微调,赋予fMRI-LM高层次的语义理解,支持多种下游应用。在各种基准测试中,fMRI-LM实现了强大的零样本和少样本性能,并通过参数高效微调(LoRA)高效适应,建立了语言对齐的、通用的fMRI结构和语义理解模型的可扩展途径。
Summary / 总结
fMRI-LM is a foundational model that bridges fMRI and language by learning a neural tokenizer to map fMRI into discrete tokens, adapting a pretrained LLM to model fMRI tokens and text, and performing multi-task instruction tuning. It achieves strong zero-shot and few-shot performance across various benchmarks and can be efficiently tuned with parameter-efficient methods, establishing a scalable pathway for language-aligned fMRI understanding.
研究旨在通过开发fMRI-LM基础模型来弥合功能性磁共振成像(fMRI)与语言之间的差距。该模型采用三阶段框架:首先,学习一个神经分词器将fMRI映射到语言一致的空间中的离散令牌;其次,将预训练的LLM适应以建模fMRI令牌和文本;最后,进行多任务、多范式指令调优以实现高层次语义理解。该模型在各种基准测试中表现出强大的零样本和少量样本性能,并且通过参数高效调优(LoRA)能够高效适应。
Periodic Asynchrony: An Effective Method for Accelerating Reinforcement Learning for Large Language Models
Authors: Jian Lu
First: 2025-11-24T08:22:50+00:00 · Latest: 2025-12-26T15:48:38+00:00
Abstract
Since the introduction of the GRPO algorithm, reinforcement learning (RL) has attracted increasing attention, with growing efforts to reproduce and apply it. However, training efficiency remains a critical challenge. In mainstream RL frameworks, inference and training are typically deployed on the same devices. While this approach reduces costs through resource consolidation, its synchronous execution imposes a computational coupling that prevents concurrent inference and training. In this study, we are returning to the strategy of separating inference and training deployment, and by introducing improvements in the data loader, we transform the conventional synchronous architecture into a periodically asynchronous framework, which allows for demand-driven, independent, and elastic scaling of each component, while the accuracy of the algorithm remains completely equivalent to the synchronization method, with both belonging to the on-policy strategy. It is worth emphasizing that we apply a unified tri-model architecture in the training phase, and we also proposed a shared-prompt attention mask to reduce repetitive computation. In practice, these works have achieved at least a threefold overall performance improvement in RL training on NPU platforms, indicating its potential for widespread application.
中文标题/摘要
标题:周期异步性:一种加速大型语言模型强化学习的有效方法
自GRPO算法问世以来,强化学习(RL)引起了越来越多的关注,人们不断尝试重现和应用它。然而,训练效率仍然是一个关键挑战。在主流的RL框架中,推理和训练通常部署在同一设备上。虽然这种方法通过资源整合降低了成本,但其同步执行方式导致了计算耦合,阻碍了推理和训练的同时进行。在本研究中,我们重新采用了分离推理和训练部署的策略,并通过改进数据加载器,将传统的同步架构转变为周期异步框架,从而实现了需求驱动、独立和弹性扩展每个组件的能力,同时算法的准确性与同步方法完全等价,两者都属于在线策略。值得注意的是,在训练阶段,我们应用了一致的三模型架构,并提出了共享提示注意掩码以减少重复计算。在实践中,这些工作在NPU平台上实现了至少三倍的整体训练性能提升,表明其具有广泛的应用潜力。
Summary / 总结
This study addresses the challenge of training efficiency in reinforcement learning for large language models by proposing a periodically asynchronous method. The method separates inference and training deployment, using a unified tri-model architecture and a shared-prompt attention mask to reduce computation. Experiments show that this approach achieves at least a threefold improvement in RL training performance on NPU platforms compared to synchronous methods, while maintaining the same accuracy as the synchronous method.
该研究通过提出一种周期性异步方法来解决大型语言模型强化学习中的训练效率问题,该方法分离了推理和训练的部署,并使用统一的三模型架构和共享提示注意力掩码来减少计算量。实验在NPU平台上显示了至少三倍的整体性能提升,验证了该方法的有效性。
Prefill vs. Decode Bottlenecks: SRAM-Frequency Tradeoffs and the Memory-Bandwidth Ceiling
Authors: Hannah Atmer, Yuan Yao, Thiemo Voigt, Stefanos Kaxiras
First: 2025-12-26T15:42:29+00:00 · Latest: 2025-12-26T15:42:29+00:00
Abstract
Energy consumption dictates the cost and environmental impact of deploying Large Language Models. This paper investigates the impact of on-chip SRAM size and operating frequency on the energy efficiency and performance of LLM inference, focusing on the distinct behaviors of the compute-bound prefill and memory-bound decode phases. Our simulation methodology combines OpenRAM for energy modeling, LLMCompass for latency simulation, and ScaleSIM for systolic array operational intensity. Our findings show that total energy use is predominantly determined by SRAM size in both phases, with larger buffers significantly increasing static energy due to leakage, which is not offset by corresponding latency benefits. We quantitatively explore the memory-bandwidth bottleneck, demonstrating that while high operating frequencies reduce prefill latency, their positive impact on memory-bound decode latency is capped by the external memory bandwidth. Counter-intuitively, high compute frequency can reduce total energy by reducing execution time and consequently decreasing static energy consumption more than the resulting dynamic power increase. We identify an optimal hardware configuration for the simulated workload: high operating frequencies (1200MHz-1400MHz) and a small local buffer size of 32KB to 64KB. This combination achieves the best energy-delay product, balancing low latency with high energy efficiency. Furthermore, we demonstrate how memory bandwidth acts as a performance ceiling, and that increasing compute frequency only yields performance gains up to the point where the workload becomes memory-bound. This analysis provides concrete architectural insights for designing energy-efficient LLM accelerators, especially for datacenters aiming to minimize their energy overhead.
中文标题/摘要
标题:预填充 vs. 解码瓶颈:SRAM-频率权衡与内存带宽上限
能耗决定了部署大型语言模型的成本和环境影响。本文研究了片上SRAM大小和工作频率对LLM推理的能量效率和性能的影响,重点关注计算受限的预填充阶段和内存受限的解码阶段的差异行为。我们的仿真方法结合了OpenRAM进行能量建模、LLMCompass进行延迟仿真和ScaleSIM进行阵列操作强度仿真。我们的研究结果表明,总能耗主要由两个阶段的SRAM大小决定,较大的缓冲区显著增加了由于泄漏导致的静态能耗,而这种静态能耗并未因相应的延迟减少而得到补偿。我们定量探讨了内存带宽瓶颈,表明虽然高工作频率可以减少预填充延迟,但其对内存受限解码延迟的积极影响受到外部内存带宽的限制。出乎意料的是,高计算频率可以通过减少执行时间从而降低静态能耗,从而在动态功率增加的情况下减少总能耗。我们确定了模拟工作负载的最佳硬件配置:高工作频率(1200MHz-1400MHz)和较小的本地缓冲区大小32KB到64KB。这种组合实现了最佳的能量延迟积,平衡了低延迟与高能量效率。此外,我们展示了内存带宽作为性能天花板的作用,并表明增加计算频率只能在工作负载变为内存受限之前提供性能增益。此分析为设计节能LLM加速器提供了具体的架构见解,特别是对于希望最小化其能耗的数据中心而言。
Summary / 总结
This paper investigates the impact of on-chip SRAM size and operating frequency on the energy efficiency and performance of Large Language Model inference. The study uses a combination of OpenRAM, LLMCompass, and ScaleSIM for simulation. Key findings show that SRAM size is the primary factor in total energy use, with larger buffers increasing static energy due to leakage. High operating frequencies reduce prefill latency but have a limited impact on memory-bound decode latency due to external memory bandwidth constraints. Surprisingly, high compute frequency can reduce total energy by decreasing static energy consumption more than dynamic power increases. An optimal configuration is identified: high operating frequencies (1200MHz-1400MHz) and a small local buffer size of 32KB to 64KB, which balances low latency with high energy efficiency. Memory bandwidth acts as a performance ceiling, and increasing compute frequency only yields gains up to the memory-bound point.
该研究探讨了片上SRAM大小和操作频率对大型语言模型推理的能效和性能的影响。研究使用OpenRAM、LLMCompass和ScaleSIM进行仿真,重点关注prefill和decode阶段。主要发现包括SRAM大小是能量使用的主要决定因素,较大的缓冲区会因泄漏增加静态能量。分析还表明,虽然高操作频率可以减少prefill延迟,但其对memory-bound decode延迟的益处受到外部内存带宽的限制。最优配置被发现为高操作频率(1200MHz-1400MHz)和32KB到64KB的小本地缓冲区,这可以平衡低延迟和高能效。此外,研究还指出,内存带宽是性能的天花板,增加计算频率只能在工作负载变为内存限制之前提供性能增益。
StreamAvatar: Streaming Diffusion Models for Real-Time Interactive Human Avatars
Authors: Zhiyao Sun, Ziqiao Peng, Yifeng Ma, Yi Chen, Zhengguang Zhou, Zixiang Zhou, Guozhen Zhang, Youliang Zhang, Yuan Zhou, Qinglin Lu, Yong-Jin Liu
First: 2025-12-26T15:41:24+00:00 · Latest: 2025-12-26T15:41:24+00:00
Comments: Project page: https://streamavatar.github.io
Abstract
Real-time, streaming interactive avatars represent a critical yet challenging goal in digital human research. Although diffusion-based human avatar generation methods achieve remarkable success, their non-causal architecture and high computational costs make them unsuitable for streaming. Moreover, existing interactive approaches are typically limited to head-and-shoulder region, limiting their ability to produce gestures and body motions. To address these challenges, we propose a two-stage autoregressive adaptation and acceleration framework that applies autoregressive distillation and adversarial refinement to adapt a high-fidelity human video diffusion model for real-time, interactive streaming. To ensure long-term stability and consistency, we introduce three key components: a Reference Sink, a Reference-Anchored Positional Re-encoding (RAPR) strategy, and a Consistency-Aware Discriminator. Building on this framework, we develop a one-shot, interactive, human avatar model capable of generating both natural talking and listening behaviors with coherent gestures. Extensive experiments demonstrate that our method achieves state-of-the-art performance, surpassing existing approaches in generation quality, real-time efficiency, and interaction naturalness. Project page: https://streamavatar.github.io .
中文标题/摘要
标题:StreamAvatar:用于实时互动人体avatar的流式扩散模型
实时、流式互动avatar是数字人类研究中的一个关键但具有挑战性的目标。尽管基于扩散的人体avatar生成方法取得了显著的成功,但其非因果结构和高计算成本使其不适合流式传输。此外,现有的互动方法通常仅限于头部和肩部区域,限制了其产生手势和身体动作的能力。为了解决这些挑战,我们提出了一种两阶段自回归适应和加速框架,该框架应用自回归蒸馏和对抗性细化来适应高保真度的人体视频扩散模型,以实现实时、互动的流式传输。为了确保长期稳定性和一致性,我们引入了三个关键组件:参考汇流池、参考锚定位置重编码(RAPR)策略和一致性感知判别器。在此框架的基础上,我们开发了一种能够生成自然对话和聆听行为并具有连贯手势的一次性互动人体avatar模型。大量实验表明,我们的方法在生成质量、实时效率和互动自然度方面均达到了最先进的性能,超越了现有方法。项目页面:https://streamavatar.github.io
Summary / 总结
The research aims to develop real-time, streaming interactive human avatars, addressing the limitations of existing diffusion-based methods in terms of causality and computational efficiency. The proposed method uses a two-stage autoregressive adaptation and acceleration framework with autoregressive distillation and adversarial refinement to adapt a high-fidelity human video diffusion model. Key components include a Reference Sink, a Reference-Anchored Positional Re-encoding strategy, and a Consistency-Aware Discriminator. The experiments show that the method outperforms existing approaches in generation quality, real-time efficiency, and interaction naturalness.
论文提出了一种两阶段自回归适应和加速框架,通过自回归蒸馏和对抗精炼来适应高保真度的人类视频扩散模型以实现实时流式交互。关键组件包括参考下水道、参考锚定的位置重编码策略和一致性感知判别器。该方法能够生成自然的说话和倾听行为,并具有连贯的手势,其生成质量、实时效率和交互自然度均优于现有方法。
MAI-UI Technical Report: Real-World Centric Foundation GUI Agents
Authors: Hanzhang Zhou, Xu Zhang, Panrong Tong, Jianan Zhang, Liangyu Chen, Quyu Kong, Chenglin Cai, Chen Liu, Yue Wang, Jingren Zhou, Steven Hoi
First: 2025-12-26T14:51:52+00:00 · Latest: 2025-12-26T14:51:52+00:00
Abstract
The development of GUI agents could revolutionize the next generation of human-computer interaction. Motivated by this vision, we present MAI-UI, a family of foundation GUI agents spanning the full spectrum of sizes, including 2B, 8B, 32B, and 235B-A22B variants. We identify four key challenges to realistic deployment: the lack of native agent-user interaction, the limits of UI-only operation, the absence of a practical deployment architecture, and brittleness in dynamic environments. MAI-UI addresses these issues with a unified methodology: a self-evolving data pipeline that expands the navigation data to include user interaction and MCP tool calls, a native device-cloud collaboration system routes execution by task state, and an online RL framework with advanced optimizations to scale parallel environments and context length. MAI-UI establishes new state-of-the-art across GUI grounding and mobile navigation. On grounding benchmarks, it reaches 73.5% on ScreenSpot-Pro, 91.3% on MMBench GUI L2, 70.9% on OSWorld-G, and 49.2% on UI-Vision, surpassing Gemini-3-Pro and Seed1.8 on ScreenSpot-Pro. On mobile GUI navigation, it sets a new SOTA of 76.7% on AndroidWorld, surpassing UI-Tars-2, Gemini-2.5-Pro and Seed1.8. On MobileWorld, MAI-UI obtains 41.7% success rate, significantly outperforming end-to-end GUI models and competitive with Gemini-3-Pro based agentic frameworks. Our online RL experiments show significant gains from scaling parallel environments from 32 to 512 (+5.2 points) and increasing environment step budget from 15 to 50 (+4.3 points). Finally, the native device-cloud collaboration system improves on-device performance by 33%, reduces cloud model calls by over 40%, and preserves user privacy.
中文标题/摘要
标题:MAI-UI 技术报告:以现实为中心的 GUI 基础智能代理
GUI 代理的发展有可能革新下一代人机交互。受此愿景的驱动,我们提出了 MAI-UI,这是一个覆盖从 2B 到 235B-A22B 的全尺寸基础 GUI 代理家族。我们确定了实现现实部署的四个关键挑战:缺乏原生代理-用户交互、UI 仅操作的限制、缺乏实用部署架构以及动态环境中的脆弱性。MAI-UI 通过统一的方法解决了这些问题:一个自进化的数据管道,扩展导航数据以包括用户交互和 MCP 工具调用,一个原生设备-云协作系统根据任务状态路由执行,以及一个具有高级优化的在线 RL 框架,以扩展并行环境和上下文长度。MAI-UI 在 GUI 地基和移动导航方面建立了新的最先进的技术。在地基基准测试中,它在 ScreenSpot-Pro 达到 73.5%,在 MMBench GUI L2 达到 91.3%,在 OSWorld-G 达到 70.9%,在 UI-Vision 达到 49.2%,超过了 Gemini-3-Pro 和 Seed1.8 在 ScreenSpot-Pro 上的表现。在移动 GUI 导航方面,它在 AndroidWorld 达到了新的 SOTA 76.7%,超过了 UI-Tars-2、Gemini-2.5-Pro 和 Seed1.8。在 MobileWorld,MAI-UI 获得了 41.7% 的成功率,显著优于端到端的 GUI 模型,并与基于代理框架的 Gemini-3-Pro 相当。我们的在线 RL 实验表明,从 32 扩展到 512 的并行环境规模提高了 5.2 个百分点,环境步长预算从 15 增加到 50 提高了 4.3 个百分点。最后,原生设备-云协作系统提高了设备端性能 33%,减少了超过 40% 的云模型调用,并保留了用户隐私。
Summary / 总结
The research aims to enhance human-computer interaction through the development of GUI agents. MAI-UI, a family of foundation GUI agents, addresses key challenges such as native interaction, UI-only operation, deployment architecture, and environmental brittleness. It achieves this through a unified methodology involving a self-evolving data pipeline, a native device-cloud collaboration system, and an online RL framework. Experimental results show MAI-UI outperforms existing models on grounding benchmarks and mobile navigation tasks, with significant improvements in success rates and parallel environment scaling.
研究旨在通过GUI代理的开发来提升人机交互。MAI-UI是一系列基础GUI代理,通过解决原生交互、UI操作限制、部署架构和动态环境适应性等关键问题。它采用统一的方法,包括自进化数据管道、原生设备-云协作系统和在线RL框架。实验结果显示,MAI-UI在基准测试和移动GUI导航任务中超越了现有模型,特别是在成功率和并行环境扩展方面取得了显著进步。
Backdoor Attacks on Prompt-Driven Video Segmentation Foundation Models
Authors: Zongmin Zhang, Zhen Sun, Yifan Liao, Wenhan Dong, Xinlei He, Xingshuo Han, Shengmin Xu, Xinyi Huang
First: 2025-12-26T14:48:58+00:00 · Latest: 2025-12-26T14:48:58+00:00
Abstract
Prompt-driven Video Segmentation Foundation Models (VSFMs) such as SAM2 are increasingly deployed in applications like autonomous driving and digital pathology, raising concerns about backdoor threats. Surprisingly, we find that directly transferring classic backdoor attacks (e.g., BadNet) to VSFMs is almost ineffective, with ASR below 5\%. To understand this, we study encoder gradients and attention maps and observe that conventional training keeps gradients for clean and triggered samples largely aligned, while attention still focuses on the true object, preventing the encoder from learning a distinct trigger-related representation. To address this challenge, we propose BadVSFM, the first backdoor framework tailored to prompt-driven VSFMs. BadVSFM uses a two-stage strategy: (1) steer the image encoder so triggered frames map to a designated target embedding while clean frames remain aligned with a clean reference encoder; (2) train the mask decoder so that, across prompt types, triggered frame-prompt pairs produce a shared target mask, while clean outputs stay close to a reference decoder. Extensive experiments on two datasets and five VSFMs show that BadVSFM achieves strong, controllable backdoor effects under diverse triggers and prompts while preserving clean segmentation quality. Ablations over losses, stages, targets, trigger settings, and poisoning rates demonstrate robustness to reasonable hyperparameter changes and confirm the necessity of the two-stage design. Finally, gradient-conflict analysis and attention visualizations show that BadVSFM separates triggered and clean representations and shifts attention to trigger regions, while four representative defenses remain largely ineffective, revealing an underexplored vulnerability in current VSFMs.
中文标题/摘要
标题:面向提示驱动视频分割基础模型的后门攻击
提示驱动视频分割基础模型(VSFMs)如SAM2在自动驾驶和数字病理学等应用中越来越广泛部署,引发了后门威胁的担忧。令人惊讶的是,我们发现直接将经典后门攻击(如BadNet)转移到VSFMs几乎无效,ASR低于5%。为了理解这一现象,我们研究了编码器梯度和注意力图,并观察到常规训练保持干净样本和触发样本的梯度几乎对齐,同时注意力仍然集中在真实物体上,防止编码器学习与触发相关的独特表示。为了解决这一挑战,我们提出了BadVSFM,这是第一个针对提示驱动VSFMs的后门框架。BadVSFM采用两阶段策略:(1)引导图像编码器,使触发帧映射到指定的目标嵌入,同时保持干净帧与干净参考编码器对齐;(2)训练掩码解码器,使不同提示类型下的触发帧-提示对生成共享的目标掩码,而干净输出保持接近参考解码器。在两个数据集和五种VSFMs上的广泛实验表明,BadVSFM在多种触发和提示下实现了强大的可控后门效果,同时保持了干净分割的质量。损失、阶段、目标、触发设置和污染率的消融实验表明,该框架对合理的超参数变化具有鲁棒性,并证实了两阶段设计的必要性。最后,梯度冲突分析和注意力可视化表明,BadVSFM将触发和干净表示分离,并将注意力转移到触发区域,而四种代表性防御措施基本无效,揭示了当前VSFMs中未被充分探索的漏洞。
Summary / 总结
This study addresses backdoor attacks on Prompt-driven Video Segmentation Foundation Models (VSFMs) like SAM2, which are used in applications such as autonomous driving and digital pathology. The research finds that traditional backdoor attacks are ineffective on VSFMs. To overcome this, the authors propose BadVSFM, a two-stage framework that steers the image encoder and trains the mask decoder to achieve strong, controllable backdoor effects while maintaining clean segmentation quality. Experiments show that BadVSFM works effectively across various triggers and prompts and is robust to reasonable hyperparameter changes.
论文研究了提示驱动的视频分割基础模型(VSFMs)对后门攻击的脆弱性。尽管传统的后门攻击效果不佳,作者提出了BadVSFM框架,该框架通过引导图像编码器和训练掩码解码器,实现了强大的可控后门效果。实验结果显示,BadVSFM在保持干净分割质量的同时,实现了高后门成功率。消融实验和分析证实了两阶段设计的鲁棒性,并突显了该方法分离触发和干净表示的必要性。
Unveiling the Learning Mind of Language Models: A Cognitive Framework and Empirical Study
Authors: Zhengyu Hu, Jianxun Lian, Zheyuan Xiao, Seraphina Zhang, Tianfu Wang, Nicholas Jing Yuan, Xing Xie, Hui Xiong
First: 2025-06-16T13:24:50+00:00 · Latest: 2025-12-26T14:33:32+00:00
Abstract
Large language models (LLMs) have shown impressive capabilities across tasks such as mathematics, coding, and reasoning, yet their learning ability, which is crucial for adapting to dynamic environments and acquiring new knowledge, remains underexplored. In this work, we address this gap by introducing a framework inspired by cognitive psychology and education. Specifically, we decompose general learning ability into three distinct, complementary dimensions: Learning from Instructor (acquiring knowledge via explicit guidance), Learning from Concept (internalizing abstract structures and generalizing to new contexts), and Learning from Experience (adapting through accumulated exploration and feedback). We conduct a comprehensive empirical study across the three learning dimensions and identify several insightful findings, such as (i) interaction improves learning; (ii) conceptual understanding is scale-emergent and benefits larger models; and (iii) LLMs are effective few-shot learners but not many-shot learners. Based on our framework and empirical findings, we introduce a benchmark that provides a unified and realistic evaluation of LLMs' general learning abilities across three learning cognition dimensions. It enables diagnostic insights and supports evaluation and development of more adaptive and human-like models.
中文标题/摘要
标题:揭开语言模型的学习心智:认知框架与实证研究
大型语言模型(LLMs)在数学、编程和推理等任务上展现了令人印象深刻的性能,然而它们的学习能力——这对于适应动态环境和获取新知识至关重要——仍然未被充分探索。在本研究中,我们通过引入受认知心理学和教育启发的框架来填补这一空白。具体而言,我们将一般的学习能力分解为三个相互补充的维度:从导师学习(通过明确指导获取知识)、从概念学习(内化抽象结构并在新情境中泛化)和从经验学习(通过累积探索和反馈进行适应)。我们对这三个学习维度进行了全面的实证研究,并发现了几个有价值的发现,例如(i)互动可以提高学习效果;(ii)概念理解是规模涌现的,并且有利于更大的模型;(iii)LLMs 是有效的少样本学习者但不是多样本学习者。基于我们的框架和实证发现,我们引入了一个基准,该基准提供了对LLMs在三个认知学习维度上一般学习能力的统一和现实的评估。它提供了诊断性的见解,并支持对更适应性和类人模型的评估和开发。
Summary / 总结
This study addresses the underexplored learning ability of large language models (LLMs) by proposing a framework inspired by cognitive psychology. The framework decomposes learning into three dimensions: Learning from Instructor, Learning from Concept, and Learning from Experience. The empirical study across these dimensions reveals that interaction enhances learning, conceptual understanding scales with model size, and LLMs excel in few-shot learning but not many-shot learning. The study introduces a benchmark to evaluate LLMs' general learning abilities comprehensively.
该研究旨在通过借鉴认知心理学的方法,探索大型语言模型(LLMs)的学习机制。研究将学习分解为三个维度:从指导者学习、从概念学习和从经验学习。跨这些维度的实证研究发现,互动可以提高学习效果,概念理解随模型规模增加而增强,而LLMs在少量示例学习中表现出色但在大量示例学习中表现不佳。研究还引入了一个基准,用于评估LLMs在这些维度上的通用学习能力。
Degradation-Aware All-in-One Image Restoration via Latent Prior Encoding
Authors: S M A Sharif, Abdur Rehman, Fayaz Ali Dharejo, Radu Timofte, Rizwan Ali Naqvi
First: 2025-09-22T13:51:09+00:00 · Latest: 2025-12-26T14:28:16+00:00
Abstract
Real-world images often suffer from spatially diverse degradations such as haze, rain, snow, and low-light, significantly impacting visual quality and downstream vision tasks. Existing all-in-one restoration (AIR) approaches either depend on external text prompts or embed hand-crafted architectural priors (e.g., frequency heuristics); both impose discrete, brittle assumptions that weaken generalization to unseen or mixed degradations. To address this limitation, we propose to reframe AIR as learned latent prior inference, where degradation-aware representations are automatically inferred from the input without explicit task cues. Based on latent priors, we formulate AIR as a structured reasoning paradigm: (1) which features to route (adaptive feature selection), (2) where to restore (spatial localization), and (3) what to restore (degradation semantics). We design a lightweight decoding module that efficiently leverages these latent encoded cues for spatially-adaptive restoration. Extensive experiments across six common degradation tasks, five compound settings, and previously unseen degradations demonstrate that our method outperforms state-of-the-art (SOTA) approaches, achieving an average PSNR improvement of 1.68 dB while being three times more efficient.
中文标题/摘要
标题:基于潜在先验编码的综合降解感知图像恢复
现实世界中的图像往往遭受空间上多样的降解,如雾霾、雨、雪和低光照,严重影响了视觉质量和下游视觉任务。现有的综合降解恢复(AIR)方法要么依赖外部文本提示,要么嵌入手工构建的架构先验(例如,频率启发式方法);这两种方法都施加了离散且脆弱的假设,削弱了对未见过或混合降解的泛化能力。为了解决这一局限性,我们提出将AIR重新定义为学习潜在先验推理,其中降解感知的表示可以从输入中自动推断,无需显式的任务提示。基于潜在先验,我们将AIR形式化为一种结构化推理范式:(1)哪些特征进行路由(自适应特征选择),(2)在哪里恢复(空间定位),(3)恢复什么(降解语义)。我们设计了一个轻量级解码模块,有效地利用这些潜在编码线索进行空间自适应恢复。在六种常见降解任务、五种复合设置以及未见过的降解中进行的广泛实验表明,我们的方法优于现有最佳方法(SOTA),平均PSNR提高了1.68 dB,同时效率提高了三倍。
Summary / 总结
The paper addresses the challenge of restoring real-world images with diverse degradations such as haze, rain, and low-light conditions. It proposes a degradation-aware all-in-one image restoration method that infers latent priors from the input image without external prompts or hand-crafted priors. The method formulates the restoration process as structured reasoning, focusing on adaptive feature selection, spatial localization, and degradation semantics. Experimental results show that the proposed method outperforms state-of-the-art approaches, achieving an average PSNR improvement of 1.68 dB with higher efficiency.
论文针对由各种因素(如雾、雨和低光照)导致的现实世界图像退化问题,提出了一种新的方法——基于学习的先验的全一气图像恢复(AIR),该方法能够自动从输入中推断出退化感知的表示,无需外部提示或手工构建的先验。该方法将AIR建模为一个结构化的推理过程,关注自适应特征选择、空间定位和退化语义。实验结果表明,该方法在平均PSNR上比现有最先进的方法提高了1.68 dB,并且效率提高了三倍。
Advancing Multimodal Teacher Sentiment Analysis:The Large-Scale T-MED Dataset & The Effective AAM-TSA Model
Authors: Zhiyi Duan, Xiangren Wang, Hongyu Yuan, Qianli Xing
First: 2025-12-23T17:42:16+00:00 · Latest: 2025-12-26T14:11:13+00:00
Abstract
Teachers' emotional states are critical in educational scenarios, profoundly impacting teaching efficacy, student engagement, and learning achievements. However, existing studies often fail to accurately capture teachers' emotions due to the performative nature and overlook the critical impact of instructional information on emotional expression. In this paper, we systematically investigate teacher sentiment analysis by building both the dataset and the model accordingly. We construct the first large-scale teacher multimodal sentiment analysis dataset, T-MED. To ensure labeling accuracy and efficiency, we employ a human-machine collaborative labeling process. The T-MED dataset includes 14,938 instances of teacher emotional data from 250 real classrooms across 11 subjects ranging from K-12 to higher education, integrating multimodal text, audio, video, and instructional information. Furthermore, we propose a novel asymmetric attention-based multimodal teacher sentiment analysis model, AAM-TSA. AAM-TSA introduces an asymmetric attention mechanism and hierarchical gating unit to enable differentiated cross-modal feature fusion and precise emotional classification. Experimental results demonstrate that AAM-TSA significantly outperforms existing state-of-the-art methods in terms of accuracy and interpretability on the T-MED dataset.
中文标题/摘要
标题:推进多模态教师情感分析:T-MED数据集与有效的AAM-TSA模型
教师的情感状态在教育场景中至关重要,深刻影响着教学效果、学生参与度和学习成就。然而,现有研究往往由于表演性特征而未能准确捕捉教师的情感,忽视了教学信息对情感表达的关键影响。本文系统地研究了教师情感分析,构建了相应的数据集和模型。我们构建了首个大规模教师多模态情感分析数据集T-MED。为确保标注准确性和效率,我们采用了人机协作标注过程。T-MED数据集包含来自11个学科的250个真实教室的14,938个教师情感数据实例,涵盖从小学到高等教育,整合了多模态文本、音频、视频和教学信息。此外,我们提出了一种新颖的非对称注意力机制多模态教师情感分析模型AAM-TSA。AAM-TSA引入了非对称注意力机制和分层门控单元,以实现跨模态特征的差异化融合和精确的情感分类。实验结果表明,AAM-TSA在T-MED数据集上的准确性和可解释性显著优于现有最先进的方法。
Summary / 总结
This paper addresses the importance of teachers' emotional states in education by developing the T-MED dataset and the AAM-TSA model. T-MED is a large-scale multimodal dataset that includes 14,938 instances of teacher emotional data from 250 classrooms, integrating text, audio, video, and instructional information. AAM-TSA, a novel asymmetric attention-based model, effectively fuses cross-modal features and achieves superior accuracy and interpretability compared to existing methods on the T-MED dataset.
本文通过开发大规模多模态数据集T-MED和有效的不对称注意力机制模型AAM-TSA,关注教师情绪状态在教育中的重要性。T-MED包含来自250个教室的14,938个教师情绪数据实例,整合了文本、音频、视频和教学信息。AAM-TSA利用不对称注意力机制和分层门控单元来提高跨模态特征融合和情绪分类的准确性,优于现有方法在T-MED上的表现。
From In Silico to In Vitro: Evaluating Molecule Generative Models for Hit Generation
Authors: Nagham Osman, Vittorio Lembo, Giovanni Bottegoni, Laura Toni
Venue: NeurIPS 2025
First: 2025-12-26T14:02:59+00:00 · Latest: 2025-12-26T14:02:59+00:00
Abstract
Hit identification is a critical yet resource-intensive step in the drug discovery pipeline, traditionally relying on high-throughput screening of large compound libraries. Despite advancements in virtual screening, these methods remain time-consuming and costly. Recent progress in deep learning has enabled the development of generative models capable of learning complex molecular representations and generating novel compounds de novo. However, using ML to replace the entire drug-discovery pipeline is highly challenging. In this work, we rather investigate whether generative models can replace one step of the pipeline: hit-like molecule generation. To the best of our knowledge, this is the first study to explicitly frame hit-like molecule generation as a standalone task and empirically test whether generative models can directly support this stage of the drug discovery pipeline. Specifically, we investigate if such models can be trained to generate hit-like molecules, enabling direct incorporation into, or even substitution of, traditional hit identification workflows. We propose an evaluation framework tailored to this task, integrating physicochemical, structural, and bioactivity-related criteria within a multi-stage filtering pipeline that defines the hit-like chemical space. Two autoregressive and one diffusion-based generative models were benchmarked across various datasets and training settings, with outputs assessed using standard metrics and target-specific docking scores. Our results show that these models can generate valid, diverse, and biologically relevant compounds across multiple targets, with a few selected GSK-3$β$ hits synthesized and confirmed active in vitro. We also identify key limitations in current evaluation metrics and available training data.
中文标题/摘要
标题:从虚拟到体外:评估分子生成模型在先导化合物发现中的应用
先导化合物的识别是药物发现管道中一个关键但资源密集的步骤,传统上依赖于对大型化合物库进行高通量筛选。尽管虚拟筛选技术取得了进展,但这些方法仍然耗时且成本高昂。最近深度学习的进步使得能够开发出能够学习复杂分子表示并生成全新化合物的生成模型。然而,使用机器学习完全替代药物发现管道是极具挑战性的。在这项工作中,我们更关注的是探讨生成模型是否可以替代药物发现管道中的一个步骤:先导化合物样分子的生成。据我们所知,这是首次将先导化合物样分子的生成明确地作为独立任务进行研究,并实证测试生成模型是否可以直接支持药物发现管道的这一阶段。具体来说,我们研究了这些模型是否可以被训练以生成先导化合物样分子,从而直接纳入或替代传统的先导化合物识别工作流程。我们提出了一种针对该任务的评估框架,将物理化学、结构和生物活性相关标准整合到多阶段筛选管道中,以定义先导化合物样化学空间。两种自回归和一种基于扩散的生成模型在各种数据集和训练设置下进行了基准测试,输出使用标准指标和靶点特异性对接评分进行评估。我们的结果显示,这些模型可以生成多个靶点的有效、多样且生物相关的化合物,其中一些葛兰素史克-3β先导化合物在体外合成并确认活性。我们还指出了当前评估指标和可用训练数据中的关键局限性。
Summary / 总结
This study evaluates generative models for hit-like molecule generation in drug discovery, aiming to replace traditional high-throughput screening. The authors propose an evaluation framework integrating physicochemical, structural, and bioactivity criteria. Three generative models were benchmarked, generating valid, diverse, and biologically relevant compounds across multiple targets. In vitro synthesis and confirmation of a few GSK-3β hits demonstrated the models' potential. However, limitations in current evaluation metrics and training data were identified.
该研究评估了生成模型在药物发现中用于生成类似候选药物的分子的能力,解决了传统候选药物识别过程中的资源密集问题。研究人员提出了一种评估框架,结合了物理化学、结构和生物活性标准。他们对两种自回归和一种扩散生成模型进行了基准测试,发现这些模型能够生成有效的、多样化的和生物相关的化合物。值得注意的是,一些合成的候选药物在GSK-3β靶点上被证实具有活性,这表明生成模型在药物发现工作流程中的潜力。
LibContinual: A Comprehensive Library towards Realistic Continual Learning
Authors: Wenbin Li, Shangge Liu, Borui Kang, Yiyang Chen, KaXuan Lew, Yang Chen, Yinghuan Shi, Lei Wang, Yang Gao, Jiebo Luo
First: 2025-12-26T13:59:13+00:00 · Latest: 2025-12-26T13:59:13+00:00
Abstract
A fundamental challenge in Continual Learning (CL) is catastrophic forgetting, where adapting to new tasks degrades the performance on previous ones. While the field has evolved with diverse methods, this rapid surge in diverse methodologies has culminated in a fragmented research landscape. The lack of a unified framework, including inconsistent implementations, conflicting dependencies, and varying evaluation protocols, makes fair comparison and reproducible research increasingly difficult. To address this challenge, we propose LibContinual, a comprehensive and reproducible library designed to serve as a foundational platform for realistic CL. Built upon a high-cohesion, low-coupling modular architecture, LibContinual integrates 19 representative algorithms across five major methodological categories, providing a standardized execution environment. Meanwhile, leveraging this unified framework, we systematically identify and investigate three implicit assumptions prevalent in mainstream evaluation: (1) offline data accessibility, (2) unregulated memory resources, and (3) intra-task semantic homogeneity. We argue that these assumptions often overestimate the real-world applicability of CL methods. Through our comprehensive analysis using strict online CL settings, a novel unified memory budget protocol, and a proposed category-randomized setting, we reveal significant performance drops in many representative CL methods when subjected to these real-world constraints. Our study underscores the necessity of resource-aware and semantically robust CL strategies, and offers LibContinual as a foundational toolkit for future research in realistic continual learning. The source code is available from \href{https://github.com/RL-VIG/LibContinual}{https://github.com/RL-VIG/LibContinual}.
中文标题/摘要
标题:LibContinual:面向现实连续学习的综合库
连续学习(CL)中的一个基本挑战是灾难性遗忘,即适应新任务会降低之前任务的性能。尽管该领域已经发展出了多种多样的方法,但这些快速涌现的方法论多样性导致了研究景观的碎片化。缺乏统一框架,包括不一致的实现、冲突的依赖关系和不同的评估协议,使得公平比较和可再现研究变得越来越困难。为了解决这一挑战,我们提出了LibContinual,这是一个全面且可再现的库,旨在作为现实CL的基础平台。基于高内聚、低耦合的模块化架构,LibContinual集成了五个主要方法论类别中的19种代表性算法,提供了一个标准化的执行环境。同时,利用这一统一框架,我们系统地识别并研究了主流评估中普遍存在的三种隐含假设:(1)离线数据可访问性,(2)未加限制的内存资源,(3)任务内语义同质性。我们认为这些假设往往高估了CL方法在现实世界中的适用性。通过使用严格的在线CL设置、一种新的统一内存预算协议和一个提出的类别随机化设置进行全面分析,我们揭示了许多代表性CL方法在面对这些现实约束时的显著性能下降。我们的研究强调了资源感知和语义稳健的CL策略的必要性,并为未来在现实连续学习中的研究提供LibContinual作为基础工具包。源代码可在https://github.com/RL-VIG/LibContinual获取。
Summary / 总结
LibContinual is a comprehensive library designed to address the challenge of catastrophic forgetting in Continual Learning (CL) by providing a unified and reproducible platform. It integrates 19 representative algorithms across five categories and systematically investigates three implicit assumptions in mainstream evaluation, revealing significant performance drops under real-world constraints. This study highlights the need for resource-aware and semantically robust CL strategies and offers LibContinual as a foundational toolkit for future research.
LibContinual 是一个综合性的库,旨在通过提供统一的框架、一致的实现和评估协议来解决持续学习研究中的碎片化问题。它整合了五个类别中的19种代表性算法,并系统地研究了主流评估中的三个隐含假设,揭示了在实际约束下许多代表性持续学习方法的性能显著下降。这项研究强调了资源感知和语义稳健的持续学习策略的必要性。
Accelerating Training Speed of Tiny Recursive Models with Curriculum Guided Adaptive Recursion
Authors: Kaleem Ullah Qasim, Jiashu Zhang
First: 2025-11-11T08:17:23+00:00 · Latest: 2025-12-26T13:40:53+00:00
Abstract
Background: Recursive reasoning models achieve strong performance through iterative refinement, allowing small networks to match large language models. However, training is computationally expensive, often requiring 36 GPU-hours for Sudoku extreme. Existing models use fixed recursion depth and uniform supervision weighting, leading to inefficient training. Objectives: We propose CGAR (Curriculum-Guided Adaptive Recursion), applying curriculum learning to architectural depth. CGAR introduces Progressive Depth Curriculum (PDC) to dynamically adjust recursion depth and Hierarchical Supervision Weighting (HSW) to apply exponentially decaying importance to supervision steps. Methods: PDC implements a three-stage schedule transitioning from shallow (2, 1) to full depth (6, 3) configurations, providing 41.4% FLOPs reduction. HSW applies exponential decay to supervision steps, achieving 40% gradient variance reduction and accelerated convergence. Results: On Sudoku-Extreme, CGAR achieves 1.71x training speedup (10.93 to 6.38 hours) with only a 0.63% accuracy drop (86.65% to 86.02%). PDC alone achieves 2.26x speedup with 85.47% accuracy, showing a Pareto improvement in efficiency and quality. HSW provides 1.61x speedup. CGAR-trained models show superior inference efficiency with 100% halting accuracy and 11% fewer reasoning steps. Conclusions: CGAR enables efficient training of recursive models on modest hardware. By treating depth as a scheduled parameter, it achieves substantial savings and prevents overfitting, making these models practical for neurosymbolic AI and program synthesis. https://github.com/Kaleemullahqasim/CGAR and huggingface.co/Kaleemullah/trm-cgar-sudoku.
中文标题/摘要
标题:使用课程引导自适应递归加速小型递归模型的训练速度
背景:递归推理模型通过迭代细化实现强大的性能,使小型网络能够匹配大型语言模型。然而,训练计算成本高昂,通常需要36个GPU小时来完成数独极端任务。现有模型使用固定的递归深度和均匀的监督权重,导致训练效率低下。目标:我们提出了CGAR(课程引导自适应递归),将课程学习应用于架构深度。CGAR引入了渐进深度课程(PDC)来动态调整递归深度,并引入了层次监督权重(HSW)来对监督步骤的重要性进行指数衰减。方法:PDC采用三阶段计划,从浅层(2, 1)过渡到全深度(6, 3)配置,提供41.4%的FLOPs减少。HSW对监督步骤应用指数衰减,实现40%的梯度方差减少和加速收敛。结果:在数独极端任务上,CGAR实现了1.71倍的训练加速(从10.93小时到6.38小时),准确率仅下降0.63%(从86.65%到86.02%)。PDC单独实现了2.26倍的加速,准确率为85.47%,显示了效率和质量的帕累托改进。HSW提供了1.61倍的加速。CGAR训练的模型在推理效率上表现出色,具有100%的停止准确率和11%更少的推理步骤。结论:CGAR使递归模型在有限硬件上高效训练成为可能。通过将深度视为计划参数,它实现了显著的节省并防止过拟合,使这些模型适用于神经符号AI和程序合成。https://github.com/Kaleemullahqasim/CGAR 和 huggingface.co/Kaleemullah/trm-cgar-sudoku。
Summary / 总结
The paper proposes CGAR (Curriculum-Guided Adaptive Recursion) to accelerate the training of recursive models, particularly for tasks like Sudoku. It introduces PDC (Progressive Depth Curriculum) to dynamically adjust recursion depth and HSW (Hierarchical Supervision Weighting) to weight supervision steps exponentially. On Sudoku-Extreme, CGAR achieves a 1.71x training speedup with minimal accuracy drop, while PDC alone provides a 2.26x speedup. HSW offers a 1.61x speedup. CGAR-trained models show better inference efficiency with higher halting accuracy and fewer reasoning steps. This method enables efficient training on modest hardware, making recursive models practical for neurosymbolic AI and program synthesis.
论文提出了CGAR(Curriculum-Guided Adaptive Recursion),以加速小型递归模型的训练,如用于数独解决的模型。它引入了渐进深度课程(PDC)来动态调整递归深度,以及层次监督权重(HSW)来对监督步骤应用指数衰减的重要性。在Sudoku-Extreme上,CGAR实现了1.71倍的训练加速,仅略有精度下降,而PDC单独使用则提供了2.26倍的加速,精度略有损失。HSW也提供了1.61倍的加速。CGAR训练的模型在推理效率上表现出色,具有100%的停止准确性和更少的推理步骤。
Direction Finding with Sparse Arrays Based on Variable Window Size Spatial Smoothing
Authors: Wesley S. Leite, Rodrigo C. de Lamare, Yuriy Zakharov, Wei Liu, Martin Haardt
First: 2025-12-26T13:08:03+00:00 · Latest: 2025-12-26T13:08:03+00:00
Comments: 2 figures, 5 pages
Abstract
In this work, we introduce a variable window size (VWS) spatial smoothing framework that enhances coarray-based direction of arrival (DOA) estimation for sparse linear arrays. By compressing the smoothing aperture, the proposed VWS Coarray MUSIC (VWS-CA-MUSIC) and VWS Coarray root-MUSIC (VWS-CA-rMUSIC) algorithms replace part of the perturbed rank-one outer products in the smoothed coarray data with unperturbed low-rank additional terms, increasing the separation between signal and noise subspaces, while preserving the signal subspace span. We also derive the bounds that guarantees identifiability, by limiting the values that can be assumed by the compression parameter. Simulations with sparse geometries reveal significant performance improvements and complexity savings relative to the fixed-window coarray MUSIC method.
中文标题/摘要
标题:基于可变窗宽空间平滑的稀疏阵列方向寻找
在本文中,我们提出了一种可变窗宽(VWS)空间平滑框架,以增强基于共阵列的方向到达(DOA)估计,适用于稀疏线性阵列。通过压缩平滑孔径,所提出的VWS共阵列MUSIC(VWS-CA-MUSIC)和VWS共阵列根MUSIC(VWS-CA-rMUSIC)算法用未受扰动的低秩附加项替换部分扰动的秩一外积,从而增加信号子空间和噪声子空间之间的分离度,同时保持信号子空间的跨度。我们还推导了保证可识别性的界限,通过限制压缩参数可以取的值。仿真结果表明,与固定窗宽共阵列MUSIC方法相比,该方法在性能和复杂度上均有显著改进。
Summary / 总结
The study introduces a variable window size (VWS) spatial smoothing framework to improve DOA estimation for sparse linear arrays using coarray-based methods. The VWS-CA-MUSIC and VWS-CA-rMUSIC algorithms replace part of the perturbed rank-one outer products with unperturbed low-rank terms, enhancing the separation between signal and noise subspaces. Simulations show that this approach offers better performance and lower complexity compared to the fixed-window coarray MUSIC method.
研究引入了一种可变窗口大小(VWS)空间平滑框架,以提高稀疏线性阵列的方向到达(DOA)估计。VWS-CA-MUSIC和VWS-CA-rMUSIC算法通过压缩平滑孔径,用未受扰的低秩附加项替换部分扰动的秩一外积,从而增强信号子空间和噪声子空间之间的分离。仿真结果显示,该方法在稀疏几何结构下相对于固定窗口大小的共阵MUSIC方法在性能和复杂度上都有显著改进。
Meta-Learning-Based Handover Management in NextG O-RAN
Authors: Michail Kalntis, George Iosifidis, José Suárez-Varela, Andra Lutu, Fernando A. Kuipers
First: 2025-12-26T13:01:46+00:00 · Latest: 2025-12-26T13:01:46+00:00
Abstract
While traditional handovers (THOs) have served as a backbone for mobile connectivity, they increasingly suffer from failures and delays, especially in dense deployments and high-frequency bands. To address these limitations, 3GPP introduced Conditional Handovers (CHOs) that enable proactive cell reservations and user-driven execution. However, both handover (HO) types present intricate trade-offs in signaling, resource usage, and reliability. This paper presents unique, countrywide mobility management datasets from a top-tier mobile network operator (MNO) that offer fresh insights into these issues and call for adaptive and robust HO control in next-generation networks. Motivated by these findings, we propose CONTRA, a framework that, for the first time, jointly optimizes THOs and CHOs within the O-RAN architecture. We study two variants of CONTRA: one where users are a priori assigned to one of the HO types, reflecting distinct service or user-specific requirements, as well as a more dynamic formulation where the controller decides on-the-fly the HO type, based on system conditions and needs. To this end, it relies on a practical meta-learning algorithm that adapts to runtime observations and guarantees performance comparable to an oracle with perfect future information (universal no-regret). CONTRA is specifically designed for near-real-time deployment as an O-RAN xApp and aligns with the 6G goals of flexible and intelligent control. Extensive evaluations leveraging crowdsourced datasets show that CONTRA improves user throughput and reduces both THO and CHO switching costs, outperforming 3GPP-compliant and Reinforcement Learning (RL) baselines in dynamic and real-world scenarios.
中文标题/摘要
标题:基于元学习的NextG O-RAN切换管理
尽管传统的切换(THOs)一直是移动连接的基础,但在密集部署和高频段中,它们越来越容易出现故障和延迟。为解决这些问题,3GPP引入了条件切换(CHOs),能够实现主动小区预留和用户驱动的执行。然而,这两种切换类型在信号、资源使用和可靠性方面都存在复杂的权衡。本文提供了来自顶级移动网络运营商(MNO)的全国范围内的移动管理数据集,揭示了这些问题的新见解,并呼吁在下一代网络中实现适应性和鲁棒的切换控制。受这些发现的启发,我们提出了CONTRA框架,这是首次在O-RAN架构中同时优化THOs和CHOs。我们研究了CONTRA的两种变体:一种是用户事先被分配到一种切换类型,反映不同的服务或用户特定需求,另一种是更动态的变体,控制器根据系统条件和需求实时决定切换类型。为此,它依赖于一种实用的元学习算法,该算法能够根据运行时观察进行调整,并保证性能与具有完美未来信息的先验无悔的oracle相当。CONTRA特别设计用于近实时部署为O-RAN xApp,并符合6G灵活和智能控制的目标。利用众包数据集的广泛评估表明,CONTRA提高了用户吞吐量,并减少了THO和CHO切换成本,在动态和现实场景中优于3GPP合规和强化学习(RL)基线。
Summary / 总结
This paper addresses the limitations of traditional handovers (THOs) and conditional handovers (CHOs) in dense deployments and high-frequency bands by proposing CONTRA, a framework that jointly optimizes THOs and CHOs within the O-RAN architecture. CONTRA uses a meta-learning algorithm to adapt to runtime observations and achieve performance comparable to an oracle with perfect future information. Evaluations show that CONTRA improves user throughput and reduces switching costs, outperforming 3GPP-compliant and RL baselines in dynamic scenarios.
本文针对传统手柄(THOs)的局限性,引入了条件手柄(CHOs)以改善移动连接。受需要适应性和鲁棒的手柄控制的驱动,作者提出了CONTRA框架,该框架在O-RAN架构中联合优化THOs和CHOs。CONTRA使用元学习算法适应运行时观察,并实现与具有完美未来信息的先知相当的性能。评估显示,CONTRA提高了用户吞吐量并减少了切换成本,在动态场景中优于3GPP合规和强化学习(RL)基线。
Non-Resolution Reasoning (NRR): A Computational Framework for Contextual Identity and Ambiguity Preservation
Authors: Kei Saito
First: 2025-12-15T16:14:32+00:00 · Latest: 2025-12-26T12:48:34+00:00
Comments: v5: Major revision to Section 5. Replaced accuracy-based OOD evaluation with entropy-based functional verification (proof-of-concept). Clarified scope as architectural demonstration rather than comparative benchmark
Abstract
Current AI systems exhibit a fundamental limitation: they resolve ambiguity prematurely. This premature semantic collapse--collapsing multiple valid interpretations into single outputs--stems from classical identity assumptions in neural architectures. We propose Non-Resolution Reasoning (NRR), treating ambiguity retention as a valid reasoning mode. NRR introduces three principles: (1) Non-Identity ($A \neq A$)--the same symbol refers to different entities across contexts; (2) Approximate Identity ($A \approx A$)--entities share partial overlap without being identical; (3) Non-Resolution--conflicting interpretations coexist without forced convergence. We formalize these through Multi-Vector Embeddings, Non-Collapsing Attention, and Contextual Identity Tracking (CIT). Functional verification via Turn 1 Entropy measurement shows NRR-lite maintains high entropy ($H = 0.63$) at ambiguous turns while standard architectures collapse early ($H = 0.10$), demonstrating that NRR preserves interpretive flexibility until context arrives. The question is not whether AI should resolve ambiguity, but when, how, and under whose control.
中文标题/摘要
标题:非解决性推理(NRR):一种面向上下文身份和歧义保留的计算框架
当前的人工智能系统存在一个根本性的局限性:它们会过早地解决歧义。这种过早的语义坍缩——将多个有效解释合并为单一输出——源自神经架构中的经典身份假设。我们提出了非解决性推理(NRR),将保留歧义视为一种有效的推理模式。NRR 引入了三个原则:(1)非同一性($A \neq A$)——同一个符号在不同上下文中指代不同的实体;(2)近似同一性($A \approx A$)——实体之间存在部分重叠但不完全相同;(3)非解决性——冲突的解释共存而不强制收敛。我们通过多向量嵌入、非坍缩注意和上下文身份跟踪(CIT)来形式化这些原则。通过转1熵测量的功能验证显示,NRR-轻在歧义转中保持高熵($H = 0.63$),而标准架构则过早坍缩($H = 0.10$),表明NRR保留了解释的灵活性直到上下文到来。问题不在于AI是否应该解决歧义,而在于何时、如何以及在谁的控制下解决。
Summary / 总结
The paper addresses the issue of AI systems resolving ambiguity prematurely by proposing Non-Resolution Reasoning (NRR), which retains ambiguity as a valid reasoning mode. NRR introduces three principles: Non-Identity, Approximate Identity, and Non-Resolution. The method uses Multi-Vector Embeddings, Non-Collapsing Attention, and Contextual Identity Tracking to formalize these principles. Experimental results show that NRR-lite maintains high entropy (0.63) at ambiguous turns, whereas standard architectures collapse early (0.10), indicating that NRR preserves interpretive flexibility until context is fully understood.
论文针对AI系统过早解决歧义的问题,限制了其解释灵活性。提出了一种非解决推理(NRR)框架,包含三个原则:非同一性、近似同一性和非解决性。NRR通过多向量嵌入、非坍塌注意力和上下文同一性跟踪实现。实验结果显示,NRR在歧义回合中保持较高的熵(0.63),而标准架构则过早坍塌(0.10),表明NRR在充分理解上下文前保留了解释灵活性。
SketchPlay: Intuitive Creation of Physically Realistic VR Content with Gesture-Driven Sketching
Authors: Xiangwen Zhang, Xiaowei Dai, Runnan Chen, Xiaoming Chen, Zeke Zexi Hu
First: 2025-12-26T12:32:39+00:00 · Latest: 2025-12-26T12:32:39+00:00
Abstract
Creating physically realistic content in VR often requires complex modeling tools or predefined 3D models, textures, and animations, which present significant barriers for non-expert users. In this paper, we propose SketchPlay, a novel VR interaction framework that transforms humans' air-drawn sketches and gestures into dynamic, physically realistic scenes, making content creation intuitive and playful like drawing. Specifically, sketches capture the structure and spatial arrangement of objects and scenes, while gestures convey physical cues such as velocity, direction, and force that define movement and behavior. By combining these complementary forms of input, SketchPlay captures both the structure and dynamics of user-created content, enabling the generation of a wide range of complex physical phenomena, such as rigid body motion, elastic deformation, and cloth dynamics. Experimental results demonstrate that, compared to traditional text-driven methods, SketchPlay offers significant advantages in expressiveness, and user experience. By providing an intuitive and engaging creation process, SketchPlay lowers the entry barrier for non-expert users and shows strong potential for applications in education, art, and immersive storytelling.
中文标题/摘要
标题:SketchPlay:基于手势绘制的直观创建物理真实感VR内容
在VR中创建物理真实感的内容通常需要复杂的建模工具或预先定义的3D模型、纹理和动画,这对非专家用户来说构成了显著的障碍。本文提出了一种名为SketchPlay的新型VR交互框架,该框架能够将人类在空中绘制的草图和手势转化为动态的、物理真实感的场景,使内容创作变得直观且充满乐趣,如同绘画一般。具体而言,草图捕捉物体和场景的结构和空间布局,而手势则传达诸如速度、方向和力等物理线索,定义了运动和行为。通过结合这两种输入形式,SketchPlay能够捕捉用户创建内容的结构和动态,从而生成一系列复杂的物理现象,如刚体运动、弹性变形和布料动力学。实验结果表明,与传统的文本驱动方法相比,SketchPlay在表达能力和用户体验方面具有显著优势。通过提供一个直观且引人入胜的创作过程,SketchPlay降低了非专家用户的入门门槛,并展示了在教育、艺术和沉浸式叙事方面的强大应用潜力。
Summary / 总结
The paper introduces SketchPlay, a VR interaction framework that allows users to create physically realistic content through gesture-driven sketching. It captures the structure and spatial arrangement of objects with sketches and conveys physical cues with gestures. Experimental results show that SketchPlay offers better expressiveness and user experience compared to traditional text-driven methods, making content creation more intuitive and engaging for non-expert users. This approach has potential applications in education, art, and immersive storytelling.
论文介绍了SketchPlay,这是一种使用手势驱动绘图来创建物理现实内容的VR交互框架,解决了传统建模工具的复杂性问题。通过结合结构绘图和物理手势,SketchPlay使用户能够直观地生成动态场景,包括刚体运动、弹性变形和布料动力学。实验表明,与基于文本的方法相比,SketchPlay在表达能力和用户体验方面更具优势,使非专家用户能够更轻松地进行内容创作,并且在教育和艺术应用方面具有巨大潜力。
Imitating Radiological Scrolling: A Global-Local Attention Model for 3D Chest CT Volumes Multi-Label Anomaly Classification
Authors: Theo Di Piazza, Carole Lazarus, Olivier Nempont, Loic Boussel
First: 2025-03-26T15:47:50+00:00 · Latest: 2025-12-26T12:31:58+00:00
Comments: 13 pages, 4 figures. Accepted for publication at MIDL 2025
Abstract
The rapid increase in the number of Computed Tomography (CT) scan examinations has created an urgent need for automated tools, such as organ segmentation, anomaly classification, and report generation, to assist radiologists with their growing workload. Multi-label classification of Three-Dimensional (3D) CT scans is a challenging task due to the volumetric nature of the data and the variety of anomalies to be detected. Existing deep learning methods based on Convolutional Neural Networks (CNNs) struggle to capture long-range dependencies effectively, while Vision Transformers require extensive pre-training, posing challenges for practical use. Additionally, these existing methods do not explicitly model the radiologist's navigational behavior while scrolling through CT scan slices, which requires both global context understanding and local detail awareness. In this study, we present CT-Scroll, a novel global-local attention model specifically designed to emulate the scrolling behavior of radiologists during the analysis of 3D CT scans. Our approach is evaluated on two public datasets, demonstrating its efficacy through comprehensive experiments and an ablation study that highlights the contribution of each model component.
中文标题/摘要
标题:模仿放射学滚动:一种用于3D胸部CT体积多标签异常分类的全局-局部注意力模型
随着计算机断层扫描(CT)检查数量的迅速增加,迫切需要自动化工具,如器官分割、异常分类和报告生成,以帮助放射科医生应对日益增长的工作量。由于数据的体数据性质和需要检测的多种异常,三维(3D)CT扫描的多标签分类是一项具有挑战性的任务。现有的基于卷积神经网络(CNNs)的深度学习方法难以有效捕捉长距离依赖关系,而视觉变换器需要大量的预训练,这为实际应用带来了挑战。此外,这些现有方法没有明确建模放射科医生在浏览CT扫描切片时的导航行为,这需要全局上下文理解和局部细节意识。在本研究中,我们提出了一种名为CT-Scroll的新型全局-局部注意力模型,专门设计用于模拟放射科医生在分析3D CT扫描时的滚动行为。我们的方法在两个公开数据集上进行了评估,并通过全面的实验和消融研究证明了其有效性,突显了每个模型组件的贡献。
Summary / 总结
This study addresses the challenge of multi-label classification of 3D CT scans by proposing CT-Scroll, a global-local attention model that mimics radiologists' scrolling behavior. The model captures both global context and local details, overcoming limitations of existing CNNs and Vision Transformers. Experiments on two public datasets show that CT-Scroll outperforms existing methods in terms of anomaly classification accuracy.
该研究通过提出CT-Scroll模型,模仿放射科医生在分析3D CT扫描时的滚动行为,解决了多标签分类的挑战。该模型能够有效捕捉长距离依赖关系,并结合全局上下文和局部细节意识。在两个公开数据集上的实验表明,CT-Scroll在异常分类准确性上优于现有方法。
LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models
Authors: Senyu Fei, Siyin Wang, Junhao Shi, Zihao Dai, Jikun Cai, Pengfang Qian, Li Ji, Xinzhe He, Shiduo Zhang, Zhaoye Fei, Jinlan Fu, Jingjing Gong, Xipeng Qiu
First: 2025-10-15T14:51:36+00:00 · Latest: 2025-12-26T12:19:56+00:00
Abstract
Visual-Language-Action (VLA) models report impressive success rates on robotic manipulation benchmarks, yet these results may mask fundamental weaknesses in robustness. We perform a systematic vulnerability analysis by introducing controlled perturbations across seven dimensions: objects layout, camera viewpoints, robot initial states, language instructions, light conditions, background textures and sensor noise. We comprehensively analyzed multiple state-of-the-art models and revealed consistent brittleness beneath apparent competence. Our analysis exposes critical weaknesses: models exhibit extreme sensitivity to perturbation factors, including camera viewpoints and robot initial states, with performance dropping from 95% to below 30% under modest perturbations. Surprisingly, models are largely insensitive to language variations, with further experiments revealing that models tend to ignore language instructions completely. Our findings challenge the assumption that high benchmark scores equate to true competency and highlight the need for evaluation practices that assess reliability under realistic variation.
中文标题/摘要
标题:LIBERO-Plus: 视觉-语言-行动模型的深入鲁棒性分析
视觉-语言-行动(VLA)模型在机器人操作基准测试中取得了令人印象深刻的成功率,但这些结果可能掩盖了鲁棒性中的根本弱点。我们通过在七个维度上引入可控的扰动进行系统性的脆弱性分析:物体布局、相机视角、机器人初始状态、语言指令、光照条件、背景纹理和传感器噪声。我们全面分析了多个最先进的模型,并揭示了在表面上看似胜任的情况下存在一致的脆弱性。我们的分析揭示了关键的弱点:模型对扰动因素表现出极大的敏感性,包括相机视角和机器人初始状态,在适度扰动下性能从95%下降到低于30%。令人惊讶的是,模型对语言变化几乎不敏感,进一步的实验表明,模型倾向于完全忽略语言指令。我们的研究结果挑战了高基准分数等同于真正胜任的假设,并强调了评估实践需要在现实变异性下评估可靠性的必要性。
Summary / 总结
The research aims to evaluate the robustness of Vision-Language-Action (VLA) models used in robotic manipulation by introducing controlled perturbations across seven dimensions. The study reveals that these models are highly sensitive to factors like camera viewpoints and robot initial states, with performance dropping significantly under modest perturbations. Surprisingly, the models are largely insensitive to language variations, often ignoring language instructions. This suggests that high benchmark scores do not necessarily indicate true competency and highlights the need for more reliable evaluation methods.
研究旨在通过在七个维度上引入可控的扰动来揭示Vision-Language-Action (VLA)模型的鲁棒性问题。研究分析了多个最先进的模型,发现这些模型对相机视角和机器人初始状态等扰动极为敏感,性能会显著下降。令人惊讶的是,模型对语言变化几乎没有敏感性,往往完全忽略语言指令。这表明高基准分数并不一定意味着真正的能力,并强调了在现实条件下需要更可靠的评估方法的重要性。
When Unsupervised Domain Adaptation meets One-class Anomaly Detection: Addressing the Two-fold Unsupervised Curse by Leveraging Anomaly Scarcity
Authors: Nesryne Mejri, Enjie Ghorbel, Anis Kacem, Pavel Chernakov, Niki Foteinopoulou, Djamila Aouada
First: 2025-02-28T13:05:47+00:00 · Latest: 2025-12-26T12:15:40+00:00
Comments: Added acknowledgments
Abstract
This paper introduces the first fully unsupervised domain adaptation (UDA) framework for unsupervised anomaly detection (UAD). The performance of UAD techniques degrades significantly in the presence of a domain shift, difficult to avoid in a real-world setting. While UDA has contributed to solving this issue in binary and multi-class classification, such a strategy is ill-posed in UAD. This might be explained by the unsupervised nature of the two tasks, namely, domain adaptation and anomaly detection. Herein, we first formulate this problem that we call the two-fold unsupervised curse. Then, we propose a pioneering solution to this curse, considered intractable so far, by assuming that anomalies are rare. Specifically, we leverage clustering techniques to identify a dominant cluster in the target feature space. Posed as the normal cluster, the latter is aligned with the source normal features. Concretely, given a one-class source set and an unlabeled target set composed mostly of normal data and some anomalies, we fit the source features within a hypersphere while jointly aligning them with the features of the dominant cluster from the target set. The paper provides extensive experiments and analysis on common adaptation benchmarks for anomaly detection, demonstrating the relevance of both the newly introduced paradigm and the proposed approach. The code will be made publicly available.
中文标题/摘要
标题:当无监督领域适应遇到一类异常检测时:通过利用异常稀有性解决两重无监督诅咒
本文介绍了首个用于无监督异常检测(UAD)的完全无监督领域适应(UDA)框架。UAD技术在领域偏移存在的情况下性能显著下降,而在现实环境中难以避免。尽管UDA在二分类和多分类分类中解决了这一问题,但在UAD中采用这种策略是不合适的。这可能由两个任务的无监督性质解释,即领域适应和异常检测。本文首先提出了这一问题,称为两重无监督诅咒。然后,我们提出了一种开创性的解决方案,假设异常是稀有的。具体来说,我们利用聚类技术在目标特征空间中识别出一个主导簇,并将其视为正常簇,然后将其与源正常特征对齐。给定一个一类源集和一个主要由正常数据和一些异常组成的未标记目标集,我们拟合源特征在一个超球体内,并同时与目标集中主导簇的特征对齐。本文在常见的异常检测适应基准上进行了广泛的实验和分析,证明了新引入的范式和所提方法的相关性。代码将公开提供。
Summary / 总结
This paper addresses the challenge of unsupervised domain adaptation (UDA) in unsupervised anomaly detection (UAD), where performance drops due to domain shift. It introduces a novel framework that assumes anomalies are rare, using clustering to identify a dominant cluster in the target feature space as the normal cluster. The source features are then aligned with this cluster while fitting within a hypersphere. Experiments on common benchmarks show the effectiveness of this approach in addressing the two-fold unsupervised curse.
本文解决了无监督领域适应(UDA)在无监督异常检测(UAD)中的挑战,由于领域偏移,UAD技术的性能会显著下降。作者提出了一种新颖的框架,假设异常样本稀少,并使用聚类技术识别目标特征空间中的主导簇,将其与源正常特征对齐。在常见的适应基准上的实验表明,该方法能够有效处理两重无监督诅咒问题。
LongFly: Long-Horizon UAV Vision-and-Language Navigation with Spatiotemporal Context Integration
Authors: Wen Jiang, Li Wang, Kangyao Huang, Wei Fan, Jinyuan Liu, Shaoyu Liu, Hongwei Duan, Bin Xu, Xiangyang Ji
First: 2025-12-26T12:09:40+00:00 · Latest: 2025-12-26T12:09:40+00:00
Abstract
Unmanned aerial vehicles (UAVs) are crucial tools for post-disaster search and rescue, facing challenges such as high information density, rapid changes in viewpoint, and dynamic structures, especially in long-horizon navigation. However, current UAV vision-and-language navigation(VLN) methods struggle to model long-horizon spatiotemporal context in complex environments, resulting in inaccurate semantic alignment and unstable path planning. To this end, we propose LongFly, a spatiotemporal context modeling framework for long-horizon UAV VLN. LongFly proposes a history-aware spatiotemporal modeling strategy that transforms fragmented and redundant historical data into structured, compact, and expressive representations. First, we propose the slot-based historical image compression module, which dynamically distills multi-view historical observations into fixed-length contextual representations. Then, the spatiotemporal trajectory encoding module is introduced to capture the temporal dynamics and spatial structure of UAV trajectories. Finally, to integrate existing spatiotemporal context with current observations, we design the prompt-guided multimodal integration module to support time-based reasoning and robust waypoint prediction. Experimental results demonstrate that LongFly outperforms state-of-the-art UAV VLN baselines by 7.89\% in success rate and 6.33\% in success weighted by path length, consistently across both seen and unseen environments.
中文标题/摘要
标题:长飞:长时距无人机视觉-语言导航中的时空上下文整合
无人驾驶飞行器(UAV)是灾后搜索与救援的关键工具,面临高信息密度、快速视角变化和动态结构等挑战,尤其是在长时距导航中。然而,当前的UAV视觉-语言导航(VLN)方法难以在复杂环境中建模长时距时空上下文,导致语义对齐不准确和路径规划不稳定。为此,我们提出长飞(LongFly),一种用于长时距无人机VLN的时空上下文建模框架。长飞提出了一种历史感知的时空建模策略,将碎片化和冗余的历史数据转化为结构化、紧凑且富有表现力的表示。首先,我们提出了基于槽的历史图像压缩模块,该模块动态地将多视角历史观察压缩为固定长度的上下文表示。然后,引入了时空轨迹编码模块以捕捉无人机轨迹的时空动态和空间结构。最后,为了整合现有时空上下文与当前观察,我们设计了提示引导的多模态集成模块,以支持基于时间的推理和稳健的航点预测。实验结果表明,长飞在成功率为7.89%和成功加权路径长度为6.33%方面优于最先进的UAV VLN基线,且在可见和不可见环境中均表现出色。
Summary / 总结
LongFly is a spatiotemporal context modeling framework designed for long-horizon UAV vision-and-language navigation, addressing challenges like high information density and dynamic structures. It uses a history-aware spatiotemporal modeling strategy to transform historical data into structured representations, captures temporal dynamics and spatial structure, and integrates this context with current observations for robust waypoint prediction. LongFly outperforms existing methods by 7.89% in success rate and 6.33% in success weighted by path length across both seen and unseen environments.
LongFly 是一个用于改进长航程无人机视觉-语言导航的框架,通过整合时空上下文。它使用时空建模策略将历史数据转换为结构化表示,捕捉时间动态和空间结构,并将现有时空上下文与当前观察结果结合以实现稳健的航点预测。实验表明,LongFly 在各种环境中的成功率提高了 7.89%,路径长度加权成功率提高了 6.33%。
iSHIFT: Lightweight Slow-Fast GUI Agent with Adaptive Perception
Authors: Sarthak Mehrotra, Sairam V C Rebbapragada, Mani Hemanth Reddy Bonthu, Vineeth N Balasubramanian
First: 2025-12-26T12:09:15+00:00 · Latest: 2025-12-26T12:09:15+00:00
Abstract
Multimodal Large Language Models (MLLMs) show strong potential for interpreting and interacting with complex, pixel-rich Graphical User Interface (GUI) environments. However, building agents that are both efficient for high-level tasks and precise for fine-grained interactions remains challenging. GUI agents must perform routine actions efficiently while also handling tasks that demand exact visual grounding, yet existing approaches struggle when accuracy depends on identifying specific interface elements. These MLLMs also remain large and cannot adapt their reasoning depth to the task at hand. In this work, we introduce iSHIFT: Implicit Slow-fast Hybrid Inference with Flexible Tokens, a lightweight agent that integrates latent thinking (implicit chain-of-thought) with a perception control module. iSHIFT enables an MLLM to switch between a slow mode, which leverages detailed visual grounding for high precision and a fast mode that uses global cues for efficiency. Special perception tokens guide attention to relevant screen regions, allowing the model to decide both how to reason and where to focus. Despite its compact 2.5B size, iSHIFT matches state-of-the-art performance on multiple benchmark datasets.
Summary / 总结
iSHIFT is a lightweight GUI agent that combines implicit reasoning with a perception control module to switch between slow and fast modes for efficient and precise interactions. It uses special perception tokens to guide attention to relevant screen regions, enabling the model to decide both reasoning and focus. Despite its small size of 2.5B parameters, iSHIFT achieves state-of-the-art performance on multiple benchmark datasets.
iSHIFT 是一个轻量级的 GUI 代理,结合了隐式推理和感知控制模块,在需要高精度的慢模式和追求效率的快模式之间切换。它使用特殊的感知标记来引导对相关屏幕区域的注意力,使模型能够决定推理方式和关注点。尽管其参数量仅为 25 亿,但 iSHIFT 在多个基准数据集上的性能与最先进的方法相当。
DuaDeep-SeqAffinity: Dual-Stream Deep Learning Framework for Sequence-Only Antigen-Antibody Affinity Prediction
Authors: Aicha Boutorh, Soumia Bouyahiaoui, Sara Belhadj, Nour El Yakine Guendouz, Manel Kara Laouar
First: 2025-12-26T12:06:59+00:00 · Latest: 2025-12-26T12:06:59+00:00
Abstract
Predicting the binding affinity between antigens and antibodies is fundamental to drug discovery and vaccine development. Traditional computational approaches often rely on experimentally determined 3D structures, which are scarce and computationally expensive to obtain. This paper introduces DuaDeep-SeqAffinity, a novel sequence-only deep learning framework that predicts affinity scores solely from their amino acid sequences using a dual-stream hybrid architecture. Our approach leverages pre-trained ESM-2 protein language model embeddings, combining 1D Convolutional Neural Networks (CNNs) for local motif detection with Transformer encoders for global contextual representation. A subsequent fusion module integrates these multi-faceted features, which are then passed to a fully connected network for final score regression. Experimental results demonstrate that DuaDeep-SeqAffinity significantly outperforms individual architectural components and existing state-of-the-art (SOTA) methods. DuaDeep achieved a superior Pearson correlation of 0.688, an R^2 of 0.460, and a Root Mean Square Error (RMSE) of 0.737, surpassing single-branch variants ESM-CNN and ESM-Transformer. Notably, the model achieved an Area Under the Curve (AUC) of 0.890, outperforming sequence-only benchmarks and even surpassing structure-sequence hybrid models. These findings prove that high-fidelity sequence embeddings can capture essential binding patterns typically reserved for structural modeling. By eliminating the reliance on 3D structures, DuaDeep-SeqAffinity provides a highly scalable and efficient solution for high-throughput screening of vast sequence libraries, significantly accelerating the therapeutic discovery pipeline.
中文标题/摘要
标题:DuaDeep-SeqAffinity:基于双流深度学习框架的仅序列抗原-抗体亲和力预测
抗原与抗体之间的结合亲和力预测是药物发现和疫苗开发的基础。传统的计算方法通常依赖于实验确定的三维结构,这些结构稀缺且计算成本高昂。本文介绍了一种名为DuaDeep-SeqAffinity的新颖仅序列深度学习框架,该框架仅从氨基酸序列中预测亲和力评分,采用双流混合架构。我们的方法利用预训练的ESM-2蛋白质语言模型嵌入,结合1D卷积神经网络(CNN)进行局部基序检测和Transformer编码器进行全局上下文表示。随后的融合模块整合了这些多方面的特征,然后传递给全连接网络进行最终评分回归。实验结果表明,DuaDeep-SeqAffinity显著优于各个架构组件和现有最先进的(SOTA)方法。DuaDeep实现了0.688的皮尔逊相关系数、0.460的R²和0.737的均方根误差(RMSE),超过了单分支变体ESM-CNN和ESM-Transformer。值得注意的是,该模型实现了0.890的曲线下面积(AUC),优于仅序列基准模型,甚至超过了结构-序列混合模型。这些发现证明了高保真度序列嵌入可以捕捉到通常由结构建模保留的关键结合模式。通过消除对三维结构的依赖,DuaDeep-SeqAffinity为大规模序列库的高通量筛选提供了高度可扩展和高效的解决方案,显著加速了治疗性发现流程。
Summary / 总结
DuaDeep-SeqAffinity is a sequence-only deep learning framework that predicts antigen-antibody binding affinity using a dual-stream hybrid architecture. It combines 1D CNNs for local motif detection and Transformer encoders for global contextual representation, integrating these features through a fusion module before regression. The model outperforms existing methods, achieving a Pearson correlation of 0.688, an R^2 of 0.460, and an RMSE of 0.737, with an AUC of 0.890, surpassing both sequence-only and structure-sequence hybrid models.
DuaDeep-SeqAffinity 是一种仅使用氨基酸序列来预测抗原抗体结合亲和力的深度学习框架。它采用双流架构,结合了1D 卷积神经网络进行局部模式检测和Transformer编码器进行全局上下文表示。实验结果显示,DuaDeep-SeqAffinity 在多个指标上均优于现有方法,包括皮尔逊相关系数0.688、R^2 0.460 和均方根误差0.737,AUC 达到0.890,超过了序列仅模型和结构-序列混合模型。
Look Closer! An Adversarial Parametric Editing Framework for Hallucination Mitigation in VLMs
Authors: Jiayu Hu, Beibei Li, Jiangwei Xia, Yanjun Qin, Bing Ji, Zhongshi He
First: 2025-12-26T11:56:45+00:00 · Latest: 2025-12-26T11:56:45+00:00
Abstract
While Vision-Language Models (VLMs) have garnered increasing attention in the AI community due to their promising practical applications, they exhibit persistent hallucination issues, generating outputs misaligned with visual inputs. Recent studies attribute these hallucinations to VLMs' over-reliance on linguistic priors and insufficient visual feature integration, proposing heuristic decoding calibration strategies to mitigate them. However, the non-trainable nature of these strategies inherently limits their optimization potential. To this end, we propose an adversarial parametric editing framework for Hallucination mitigation in VLMs, which follows an \textbf{A}ctivate-\textbf{L}ocate-\textbf{E}dit \textbf{A}dversarially paradigm. Specifically, we first construct an activation dataset that comprises grounded responses (positive samples attentively anchored in visual features) and hallucinatory responses (negative samples reflecting LLM prior bias and internal knowledge artifacts). Next, we identify critical hallucination-prone parameter clusters by analyzing differential hidden states of response pairs. Then, these clusters are fine-tuned using prompts injected with adversarial tuned prefixes that are optimized to maximize visual neglect, thereby forcing the model to prioritize visual evidence over inherent parametric biases. Evaluations on both generative and discriminative VLM tasks demonstrate the significant effectiveness of ALEAHallu in alleviating hallucinations. Our code is available at https://github.com/hujiayu1223/ALEAHallu.
中文标题/摘要
标题:仔细看看!一种对抗参数编辑框架以减轻VLM中的幻觉
视觉-语言模型(VLMs)因其有前景的实际应用而在AI社区中引起了越来越多的关注,但它们仍然存在持续的幻觉问题,生成的输出与视觉输入不一致。最近的研究将这些幻觉归因于VLMs过度依赖于语言先验和视觉特征整合不足,提出了启发式解码校准策略来减轻这些问题。然而,这些策略的不可训练性质固有限制了它们的优化潜力。为此,我们提出了一种对抗参数编辑框架,以减轻VLM中的幻觉,遵循“激活-定位-编辑-对抗”范式。具体来说,我们首先构建了一个激活数据集,其中包括与视觉特征紧密关联的响应(正样本)和反映LLM先验偏见和内部知识缺陷的幻觉响应(负样本)。然后,通过分析响应对的差异隐藏状态,我们确定了关键的幻觉易发参数簇。接着,使用注入对抗调优前缀的提示对这些簇进行微调,这些前缀旨在最大化视觉忽视,从而迫使模型优先考虑视觉证据而非固有的参数偏见。在生成性和判别性VLM任务上的评估表明,ALEAHallu在减轻幻觉方面具有显著效果。我们的代码可在https://github.com/hujiayu1223/ALEAHallu获取。
Summary / 总结
The paper addresses the hallucination issue in Vision-Language Models (VLMs) by proposing an adversarial parametric editing framework called ALEAHallu. It constructs an activation dataset with positive and negative samples to identify critical parameter clusters prone to hallucinations. These clusters are then fine-tuned using adversarial prefixes to prioritize visual evidence. Experiments show ALEAHallu effectively mitigates hallucinations in both generative and discriminative VLM tasks.
论文提出了一种对抗参数编辑框架ALEAHallu来解决Vision-Language模型(VLM)中的幻觉问题。该框架构建了一个包含正负样本的激活数据集,以识别易产生幻觉的关键参数簇。然后通过对抗前缀进行微调,使模型优先考虑视觉证据而非语言偏见。实验表明,ALEAHallu在生成性和判别性VLM任务中都能有效减轻幻觉现象。
A Lightweight Multi-Scale Attention Framework for Real-Time Spinal Endoscopic Instance Segmentation
Authors: Qi Lai, JunYan Li, Qiang Cai, Lei Wang, Tao Yan, XiaoKun Liang
First: 2025-12-26T11:07:06+00:00 · Latest: 2025-12-26T11:07:06+00:00
Abstract
Real-time instance segmentation for spinal endoscopy is important for identifying and protecting critical anatomy during surgery, but it is difficult because of the narrow field of view, specular highlights, smoke/bleeding, unclear boundaries, and large scale changes. Deployment is also constrained by limited surgical hardware, so the model must balance accuracy and speed and remain stable under small-batch (even batch-1) training. We propose LMSF-A, a lightweight multi-scale attention framework co-designed across backbone, neck, and head. The backbone uses a C2f-Pro module that combines RepViT-style re-parameterized convolution (RVB) with efficient multi-scale attention (EMA), enabling multi-branch training while collapsing into a single fast path for inference. The neck improves cross-scale consistency and boundary detail using Scale-Sequence Feature Fusion (SSFF) and Triple Feature Encoding (TFE), which strengthens high-resolution features. The head adopts a Lightweight Multi-task Shared Head (LMSH) with shared convolutions and GroupNorm to reduce parameters and support batch-1 stability. We also release the clinically reviewed PELD dataset (61 patients, 610 images) with instance masks for adipose tissue, bone, ligamentum flavum, and nerve. Experiments show that LMSF-A is highly competitive (or even better than) in all evaluation metrics and much lighter than most instance segmentation methods requiring only 1.8M parameters and 8.8 GFLOPs, and it generalizes well to a public teeth benchmark. Code and dataset: https://github.com/hhwmortal/PELD-Instance-segmentation.
中文标题/摘要
标题:一种用于实时脊柱内窥镜实例分割的轻量级多尺度注意力框架
脊柱内窥镜的实时实例分割对于识别和保护手术中的关键解剖结构非常重要,但由于视野狭窄、镜面反射、烟雾/出血、边界不清晰和尺度变化大等原因,这是一项挑战。部署还受到有限的手术硬件的限制,因此模型必须在保持准确性和速度的同时,在小批次(甚至批次1)训练下保持稳定。我们提出了一种名为LMSF-A的轻量级多尺度注意力框架,该框架在骨干、颈部和头部设计上进行了协同设计。骨干使用C2f-Pro模块,结合了RepViT风格的可重构卷积(RVB)和高效的多尺度注意力(EMA),使多分支训练能够在推理时合并为一个快速路径。颈部使用Scale-Sequence特征融合(SSFF)和三重特征编码(TFE)来提高跨尺度一致性和边界细节,增强了高分辨率特征。头部采用轻量级多任务共享头(LMSH),使用共享卷积和GroupNorm来减少参数并支持批次1稳定性。我们还发布了经过临床审查的PELD数据集(61名患者,610张图像),其中包含脂肪组织、骨、黄韧带和神经的实例掩码。实验表明,LMSF-A在所有评估指标中表现非常出色(甚至优于),并且比大多数需要1.8M参数和8.8 GFLOPs的实例分割方法要轻得多,而且在公共牙齿基准上具有良好的泛化能力。代码和数据集:https://github.com/hhwmortal/PELD-Instance-segmentation.
Summary / 总结
Real-time instance segmentation for spinal endoscopy is important for identifying and protecting critical anatomy during surgery, but it is difficult because of the narrow field of view, specular highlights, smoke/bleeding, unclear boundaries, and large scale changes.
论文针对脊柱内镜手术中的实时实例分割难题,提出了LMSF-A轻量级多尺度注意力框架。该框架包含C2f-Pro模块,实现高效多尺度注意力和多分支训练,颈部采用Scale-Sequence Feature Fusion和Triple Feature Encoding增强跨尺度一致性,头部采用轻量级多任务共享头以减少参数并支持小批量稳定性。实验结果显示,LMSF-A在所有评估指标上表现优异或与之相当,且比大多数实例分割方法更轻量,仅包含1.8M参数和8.8 GFLOPs,并且在公共牙齿基准上表现出良好的泛化能力。
Perceive and Calibrate: Analyzing and Enhancing Robustness of Medical Multi-Modal Large Language Models
Authors: Dunyuan XU, Xikai Yang, Yaoqian Li, Juzheng Miao, Jinpeng Li, Pheng-Ann Heng
First: 2025-12-26T10:23:30+00:00 · Latest: 2025-12-26T10:23:30+00:00
Abstract
Medical Multi-modal Large Language Models (MLLMs) have shown promising clinical performance. However, their sensitivity to real-world input perturbations, such as imaging artifacts and textual errors, critically undermines their clinical applicability. Systematic analysis of such noise impact on medical MLLMs remains largely unexplored. Furthermore, while several works have investigated the MLLMs' robustness in general domains, they primarily focus on text modality and rely on costly fine-tuning. They are inadequate to address the complex noise patterns and fulfill the strict safety standards in medicine. To bridge this gap, this work systematically analyzes the impact of various perturbations on medical MLLMs across both visual and textual modalities. Building on our findings, we introduce a training-free Inherent-enhanced Multi-modal Calibration (IMC) framework that leverages MLLMs' inherent denoising capabilities following the perceive-and-calibrate principle for cross-modal robustness enhancement. For the visual modality, we propose a Perturbation-aware Denoising Calibration (PDC) which leverages MLLMs' own vision encoder to identify noise patterns and perform prototype-guided feature calibration. For text denoising, we design a Self-instantiated Multi-agent System (SMS) that exploits the MLLMs' self-assessment capabilities to refine noisy text through a cooperative hierarchy of agents. We construct a benchmark containing 11 types of noise across both image and text modalities on 2 datasets. Experimental results demonstrate our method achieves the state-of-the-art performance across multiple modalities, showing potential to enhance MLLMs' robustness in real clinical scenarios.
中文标题/摘要
标题:感知与校准:分析和增强医疗多模态大型语言模型的鲁棒性
医疗多模态大型语言模型(MLLMs)在临床应用中表现出有希望的性能。然而,它们对现实世界输入扰动的敏感性,如成像伪影和文本错误,严重削弱了其临床应用性。对医疗MLLMs中此类噪声影响的系统分析仍鲜有探索。此外,尽管有几项研究探讨了MLLMs在一般领域的鲁棒性,但它们主要集中在文本模态上,并依赖于昂贵的微调。这些方法不足以应对复杂的噪声模式并满足医学中的严格安全标准。为解决这一问题,本研究系统分析了各种扰动对医疗MLLMs在视觉和文本模态上的影响。基于我们的发现,我们提出了一种无需训练的固有增强多模态校准(IMC)框架,该框架遵循感知与校准原则,利用MLLMs的固有去噪能力来增强跨模态鲁棒性。对于视觉模态,我们提出了一种扰动感知去噪校准(PDC),利用MLLMs自身的视觉编码器识别噪声模式并进行原型引导的特征校准。对于文本去噪,我们设计了一种自我实例化多智能体系统(SMS),利用MLLMs的自我评估能力通过智能体合作层次结构来精炼噪声文本。我们构建了一个基准,包含2个数据集中的图像和文本模态11种类型的噪声。实验结果表明,我们的方法在多个模态上达到了最先进的性能,显示出增强MLLMs在实际临床场景中鲁棒性的潜力。
Summary / 总结
This work addresses the robustness issues of Medical Multi-modal Large Language Models (MLLMs) by analyzing their sensitivity to real-world perturbations and introducing an Inherent-enhanced Multi-modal Calibration (IMC) framework. The IMC framework includes a Perturbation-aware Denoising Calibration (PDC) for visual modality and a Self-instantiated Multi-agent System (SMS) for text denoising. The study constructs a benchmark with 11 types of noise across image and text modalities and shows that the proposed method outperforms existing approaches in enhancing MLLMs' robustness.
该研究分析了医疗多模态大型语言模型(MLLMs)在视觉和文本模态中对现实世界干扰的敏感性问题,并引入了基于模型固有去噪能力的跨模态鲁棒性增强框架Inherent-enhanced Multi-modal Calibration (IMC)。对于视觉模态,提出了一个扰动感知去噪校准(PDC)来识别和校准噪声模式;对于文本去噪,设计了一个自实例化多代理系统(SMS),通过代理的协作层次结构来精炼噪声文本。实验结果表明,该方法在多种模态下显著增强了MLLMs的鲁棒性,有望在临床场景中更好地应用。
CP-Agent: Agentic Constraint Programming
Authors: Stefan Szeider
First: 2025-08-10T19:59:01+00:00 · Latest: 2025-12-26T10:12:55+00:00
Abstract
Translating natural language into formal constraint models requires expertise in the problem domain and modeling frameworks. To investigate whether constraint modeling benefits from agentic workflows, we introduce CP-Agent, a Python coding agent using the ReAct framework with a persistent IPython kernel. Domain knowledge is provided through a project prompt of under 50 lines. The agent iteratively executes code, observes the solver's feedback, and refines models based on the execution results. We evaluate CP-Agent on CP-Bench's 101 constraint programming problems. We clarified the benchmark to address systematic ambiguities in problem specifications and errors in ground-truth models. On the clarified benchmark, CP-Agent solves all 101 problems. Ablation studies indicate that minimal guidance outperforms detailed procedural scaffolding, and that explicit task management tools have mixed effects on focused modeling tasks.
中文标题/摘要
标题:CP-Agent: 代理约束编程
将自然语言转换为形式化的约束模型需要在问题领域和建模框架方面具备专业知识。为了调查约束建模是否受益于代理工作流程,我们引入了CP-Agent,这是一种使用ReAct框架和持久IPython内核的Python编码代理。领域知识通过不到50行的项目提示提供。代理迭代执行代码,观察求解器的反馈,并根据执行结果改进模型。我们在CP-Bench的101个约束编程问题上评估了CP-Agent。我们澄清了基准测试以解决问题规范中的系统性歧义和地面真实模型中的错误。在澄清后的基准测试上,CP-Agent解决了所有101个问题。消融研究表明,最少的指导比详细的程序性支架更优,而明确的任务管理工具对集中建模任务的效果参差不齐。
Summary / 总结
The research aims to explore if constraint modeling can benefit from agentic workflows. CP-Agent, a Python coding agent using the ReAct framework, was developed to iteratively refine constraint models based on solver feedback. Evaluating CP-Agent on 101 constraint programming problems from CP-Bench, the agent successfully solved all problems after clarifying the benchmark. Ablation studies suggest that minimal guidance is more effective than detailed procedural scaffolding, and explicit task management tools have mixed effects on focused modeling tasks.
研究旨在通过使用带有ReAct框架和持久IPython内核的Python编码代理CP-Agent,探索代理工作流在约束建模中的优势。代理通过迭代根据求解器反馈来细化模型。在澄清基准后,CP-Agent成功解决了CP-Bench的101个约束编程问题。消融研究显示,最少指导比详细程序化支架更有效,而明确的任务管理工具对集中建模任务的效果参差不齐。
Open-World Deepfake Attribution via Confidence-Aware Asymmetric Learning
Authors: Haiyang Zheng, Nan Pu, Wenjing Li, Teng Long, Nicu Sebe, Zhun Zhong
First: 2025-12-14T12:31:28+00:00 · Latest: 2025-12-26T09:55:50+00:00
Comments: Accepted by AAAI2026
Abstract
The proliferation of synthetic facial imagery has intensified the need for robust Open-World DeepFake Attribution (OW-DFA), which aims to attribute both known and unknown forgeries using labeled data for known types and unlabeled data containing a mixture of known and novel types. However, existing OW-DFA methods face two critical limitations: 1) A confidence skew that leads to unreliable pseudo-labels for novel forgeries, resulting in biased training. 2) An unrealistic assumption that the number of unknown forgery types is known *a priori*. To address these challenges, we propose a Confidence-Aware Asymmetric Learning (CAL) framework, which adaptively balances model confidence across known and novel forgery types. CAL mainly consists of two components: Confidence-Aware Consistency Regularization (CCR) and Asymmetric Confidence Reinforcement (ACR). CCR mitigates pseudo-label bias by dynamically scaling sample losses based on normalized confidence, gradually shifting the training focus from high- to low-confidence samples. ACR complements this by separately calibrating confidence for known and novel classes through selective learning on high-confidence samples, guided by their confidence gap. Together, CCR and ACR form a mutually reinforcing loop that significantly improves the model's OW-DFA performance. Moreover, we introduce a Dynamic Prototype Pruning (DPP) strategy that automatically estimates the number of novel forgery types in a coarse-to-fine manner, removing the need for unrealistic prior assumptions and enhancing the scalability of our methods to real-world OW-DFA scenarios. Extensive experiments on the standard OW-DFA benchmark and a newly extended benchmark incorporating advanced manipulations demonstrate that CAL consistently outperforms previous methods, achieving new state-of-the-art performance on both known and novel forgery attribution.
中文标题/摘要
标题:基于置信感知不对称学习的开放世界深度伪造溯源
合成面部图像的泛滥加剧了对稳健的开放世界深度伪造溯源(OW-DFA)的需求,OW-DFA旨在使用已标记的数据来识别已知伪造,以及使用未标记的数据来识别已知和新型伪造。然而,现有的OW-DFA方法面临两个关键限制:1)置信度偏差,导致新型伪造的伪标签不可靠,从而导致训练偏差。2)不现实地假设未知伪造类型的数量已知。为了解决这些挑战,我们提出了一种置信感知不对称学习(CAL)框架,该框架能够自适应地平衡已知和新型伪造类型之间的模型置信度。CAL主要由两个组件组成:置信感知一致性正则化(CCR)和不对称置信强化(ACR)。CCR通过基于归一化置信度动态调整样本损失来减轻伪标签偏差,逐渐将训练重点从高置信度样本转移到低置信度样本。ACR通过在高置信度样本上选择性地学习来分别校准已知和新型类别的置信度,通过它们的置信度差距进行引导。CCR和ACR共同形成一个相互强化的循环,显著提高了模型的OW-DFA性能。此外,我们引入了一种动态原型修剪(DPP)策略,以粗到细的方式自动估计新型伪造类型的数量,消除了不现实的先验假设,增强了我们方法在实际OW-DFA场景中的可扩展性。在标准OW-DFA基准和一个新扩展的基准中包含高级操作的广泛实验表明,CAL在已知和新型伪造溯源方面始终优于先前的方法,实现了新的最佳性能。
Summary / 总结
The paper addresses the need for robust Open-World DeepFake Attribution (OW-DFA) by proposing a Confidence-Aware Asymmetric Learning (CAL) framework. This framework includes Confidence-Aware Consistency Regularization (CCR) and Asymmetric Confidence Reinforcement (ACR) to mitigate pseudo-label bias and improve model performance. Additionally, a Dynamic Prototype Pruning (DPP) strategy is introduced to estimate the number of novel forgery types. Experiments show that CAL outperforms previous methods on both known and novel forgery attribution, achieving new state-of-the-art results.
本文提出了一种名为Confidence-Aware Asymmetric Learning (CAL)的框架,以解决Open-World DeepFake Attribution (OW-DFA)中的挑战。该框架包括Confidence-Aware Consistency Regularization (CCR)和Asymmetric Confidence Reinforcement (ACR),以减轻伪标签偏差并提高模型性能。此外,还引入了一种Dynamic Prototype Pruning (DPP)策略,以在无需先验假设的情况下估计新型伪造类型的数量。实验表明,CAL在已知和新型伪造归因方面均优于先前的方法,实现了新的最佳性能。
Data relativistic uncertainty framework for low-illumination anime scenery image enhancement
Authors: Yiquan Gao, John See
First: 2025-12-26T09:43:24+00:00 · Latest: 2025-12-26T09:43:24+00:00
Comments: Preprint, awaiting submission to the appropriate conference or journal
Abstract
By contrast with the prevailing works of low-light enhancement in natural images and videos, this study copes with the low-illumination quality degradation in anime scenery images to bridge the domain gap. For such an underexplored enhancement task, we first curate images from various sources and construct an unpaired anime scenery dataset with diverse environments and illumination conditions to address the data scarcity. To exploit the power of uncertainty information inherent with the diverse illumination conditions, we propose a Data Relativistic Uncertainty (DRU) framework, motivated by the idea from Relativistic GAN. By analogy with the wave-particle duality of light, our framework interpretably defines and quantifies the illumination uncertainty of dark/bright samples, which is leveraged to dynamically adjust the objective functions to recalibrate the model learning under data uncertainty. Extensive experiments demonstrate the effectiveness of DRU framework by training several versions of EnlightenGANs, yielding superior perceptual and aesthetic qualities beyond the state-of-the-art methods that are incapable of learning from data uncertainty perspective. We hope our framework can expose a novel paradigm of data-centric learning for potential visual and language domains. Code is available.
中文标题/摘要
标题:相对论不确定性数据框架在低光照动漫场景图像增强中的应用
与自然图像和视频中的低光照增强研究相比,本研究致力于解决动漫场景图像在低光照条件下的质量退化问题,以缩小领域差距。针对这一尚未充分探索的增强任务,我们首先从多个来源收集图像,并构建了一个包含多种环境和光照条件的未配对动漫场景数据集,以解决数据稀缺问题。为了利用不同光照条件固有的不确定性信息,我们提出了一个数据相对论不确定性(DRU)框架,该框架受到相对论生成对抗网络(GAN)思想的启发。通过类比光的波粒二象性,我们的框架解释性地定义并量化了暗/亮样本的光照不确定性,并利用这些信息动态调整目标函数,以在数据不确定性下重新校准模型学习。大量实验表明,DRU框架通过训练多个版本的EnlightenGANs,能够超越现有方法,获得更优越的感知和美学质量。我们希望我们的框架能够为潜在的视觉和语言领域提供一种以数据为中心的学习新范式。代码已开源。
Summary / 总结
This study addresses the low-illumination quality degradation in anime scenery images by proposing a Data Relativistic Uncertainty (DRU) framework. Motivated by Relativistic GAN, the framework quantifies illumination uncertainty and dynamically adjusts the objective functions to improve model learning under data uncertainty. Experiments show that DRU enhances perceptual and aesthetic qualities of anime images better than existing methods, which fail to consider data uncertainty. The framework aims to expose a new paradigm of data-centric learning for visual domains.
该研究提出了一种数据相对不确定性(DRU)框架,以解决动漫风景图像在低光照条件下的质量退化问题。受相对生成对抗网络的启发,该框架量化了光照不确定性,并动态调整目标函数以在数据不确定性下改进模型学习。实验表明,DRU在感知和美学质量上优于现有方法,这些方法未能从数据不确定性角度进行学习。该框架旨在为视觉领域揭示一种新的数据为中心的学习范式。
History
20251229_0326 20251228_0329 20251227_0325 20251226_0326 20251225_0325 20251224_0328 20251223_0327 20251222_0324 20251221_0326 20251220_0327 20251219_0327 20251218_0339 20251217_0331 20251216_0329 20251215_0331 20251214_0324 20251213_0324 20251212_0329 20251211_0326 20251210_0323 20251209_0326 20251208_0324 20251207_0323 20251206_0325 20251205_0326 20251204_0326 20251203_0328 20251202_0331 20251201_0324 20251130_0323 20251129_0323 20251128_0324 20251127_0324 20251126_0325 20251125_0322 20251124_0323 20251123_0323 20251122_0325 20251121_0324 20251120_0326 20251119_0325 20251118_0324 20251117_0322 20251116_0322 20251115_0324 20251114_0325 20251113_0326 20251112_0326 20251111_0318 20251110_0322 20251109_0323 20251108_0321 20251107_0320 20251106_0322 20251105_0321 20251104_0324 20251103_0317 20251102_0321 20251101_0317 20251031_0318 20251030_0328 20251029_0325 20251028_0324 20251027_0320 20251026_0328 20251025_0320 20251024_0328 20251023_1235 20251023_0316 20251022_0319 20251021_1916 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553