arXiv 论文速递

2026-01-13 03:30
Snapshot: 20260113_0330
Manifold limit for the training of shallow graph convolutional neural networks
Authors: Johanna Tengler, Christoph Brune, José A. Iglesias
First: 2026-01-09T18:59:20+00:00 · Latest: 2026-01-09T18:59:20+00:00
Comments: 44 pages, 0 figures, 1 table
Abstract
We study the discrete-to-continuum consistency of the training of shallow graph convolutional neural networks (GCNNs) on proximity graphs of sampled point clouds under a manifold assumption. Graph convolution is defined spectrally via the graph Laplacian, whose low-frequency spectrum approximates that of the Laplace-Beltrami operator of the underlying smooth manifold, and shallow GCNNs of possibly infinite width are linear functionals on the space of measures on the parameter space. From this functional-analytic perspective, graph signals are seen as spatial discretizations of functions on the manifold, which leads to a natural notion of training data consistent across graph resolutions. To enable convergence results, the continuum parameter space is chosen as a weakly compact product of unit balls, with Sobolev regularity imposed on the output weight and bias, but not on the convolutional parameter. The corresponding discrete parameter spaces inherit the corresponding spectral decay, and are additionally restricted by a frequency cutoff adapted to the informative spectral window of the graph Laplacians. Under these assumptions, we prove $Γ$-convergence of regularized empirical risk minimization functionals and corresponding convergence of their global minimizers, in the sense of weak convergence of the parameter measures and uniform convergence of the functions over compact sets. This provides a formalization of mesh and sample independence for the training of such networks.
中文标题/摘要
标题:浅层图卷积神经网络训练的流形极限
我们研究了在采样点云的近邻图上,基于流形假设浅层图卷积神经网络(GCNNs)训练的离散到连续一致性。图卷积通过图拉普拉斯算子的谱定义,其低频谱近似于底层光滑流形的拉普拉斯-贝尔特拉米算子,而浅层GCNNs可能是无限宽的,它们是参数空间上测度的线性泛函。从泛函分析的角度来看,图信号被视为流形上函数的空间离散化,这导致了一种自然的训练数据概念,这种概念在不同的图分辨率下是一致的。为了使收敛结果成立,连续参数空间被选择为弱紧的单位球乘积,对输出权重和偏置施加Sobolev正则性,但不对卷积参数施加正则性。相应的离散参数空间继承了相应的谱衰减,并且还受到适应图拉普拉斯算子的信息谱窗口的频率截止的限制。在这些假设下,我们证明了正则化经验风险最小化泛函的Γ-收敛及其全局最小值的相应收敛,在参数测度的弱收敛和函数在紧集上的均匀收敛意义上。这为这种网络的训练提供了网格和样本独立性的形式化。
Summary / 总结
The study investigates the consistency of training shallow graph convolutional neural networks (GCNNs) on proximity graphs of sampled point clouds under a manifold assumption. The research defines graph convolution spectrally via the graph Laplacian and considers shallow GCNNs as linear functionals on the space of measures. It proves Γ-convergence of regularized empirical risk minimization functionals and convergence of their global minimizers, ensuring mesh and sample independence in the training process.
该研究探讨了在点云近邻图上训练浅层图卷积神经网络(GCNN)在流形假设下的连续一致性。研究集中在通过图拉普拉斯定义的谱图卷积以及参数空间上的测度线性函数。主要发现包括证明了正则化经验风险最小化泛函的Γ收敛及其全局最小值的收敛,确保了此类网络训练的网格和样本独立性。
AdaFuse: Adaptive Ensemble Decoding with Test-Time Scaling for LLMs
Authors: Chengming Cui, Tianxin Wei, Ziyi Chen, Ruizhong Qiu, Zhichen Zeng, Zhining Liu, Xuying Ning, Duo Zhou, Jingrui He
First: 2026-01-09T18:58:22+00:00 · Latest: 2026-01-09T18:58:22+00:00
Abstract
Large language models (LLMs) exhibit complementary strengths arising from differences in pretraining data, model architectures, and decoding behaviors. Inference-time ensembling provides a practical way to combine these capabilities without retraining. However, existing ensemble approaches suffer from fundamental limitations. Most rely on fixed fusion granularity, which lacks the flexibility required for mid-generation adaptation and fails to adapt to different generation characteristics across tasks. To address these challenges, we propose AdaFuse, an adaptive ensemble decoding framework that dynamically selects semantically appropriate fusion units during generation. Rather than committing to a fixed granularity, AdaFuse adjusts fusion behavior on the fly based on the decoding context, with words serving as basic building blocks for alignment. To be specific, we introduce an uncertainty-based criterion to decide whether to apply ensembling at each decoding step. Under confident decoding states, the model continues generation directly. In less certain states, AdaFuse invokes a diversity-aware scaling strategy to explore alternative candidate continuations and inform ensemble decisions. This design establishes a synergistic interaction between adaptive ensembling and test-time scaling, where ensemble decisions guide targeted exploration, and the resulting diversity in turn strengthens ensemble quality. Experiments on open-domain question answering, arithmetic reasoning, and machine translation demonstrate that AdaFuse consistently outperforms strong ensemble baselines, achieving an average relative improvement of 6.88%. The code is available at https://github.com/CCM0111/AdaFuse.
中文标题/摘要
标题:AdaFuse:适应性集成解码与测试时缩放的LLM解码框架
大型语言模型(LLMs)由于预训练数据、模型架构和解码行为的不同而表现出互补的优势。推理时的集成提供了一种实用的方法来结合这些能力,而无需重新训练。然而,现有的集成方法存在根本性的局限性。大多数方法依赖于固定的融合粒度,缺乏在生成过程中进行中期调整的灵活性,也无法适应不同任务的生成特性。为了解决这些挑战,我们提出了AdaFuse,这是一种适应性集成解码框架,在生成过程中动态选择语义上合适的融合单元。AdaFuse 不是固定粒度地进行融合,而是根据解码上下文实时调整融合行为,以单词作为基本对齐单元。具体来说,我们引入了一种基于不确定性的标准来决定在每个解码步骤是否应用集成。在自信的解码状态下,模型直接继续生成。在不确定的状态下,AdaFuse 调用一种多样性感知的缩放策略来探索替代候选续写,并指导集成决策。这种设计建立了适应性集成和测试时缩放之间的协同作用,其中集成决策引导有针对性的探索,而产生的多样性反过来增强了集成质量。在开放域问答、算术推理和机器翻译上的实验表明,AdaFuse 一致地优于强大的集成基线,平均相对改进为 6.88%。代码可在 https://github.com/CCM0111/AdaFuse 获取。
Summary / 总结
AdaFuse is an adaptive ensemble decoding framework designed to dynamically select fusion units during generation, addressing the limitations of fixed-granularity ensembling. It uses an uncertainty-based criterion to decide whether to apply ensembling at each step, and employs a diversity-aware scaling strategy in uncertain states to explore alternative continuations. Experiments show that AdaFuse outperforms strong ensemble baselines, achieving an average relative improvement of 6.88%.
AdaFuse 是一种自适应集成解码框架,旨在生成过程中动态选择融合单元,解决固定粒度集成的局限性。它使用不确定性准则在每一步决定是否应用集成,并在不确定性较高的状态下采用多样性感知的缩放策略来探索替代候选延续,指导集成决策。实验表明,AdaFuse 在开放领域问答、算术推理和机器翻译任务中均优于强集成基线,平均相对改进率为 6.88%。
LookAroundNet: Extending Temporal Context with Transformers for Clinically Viable EEG Seizure Detection
Authors: Þór Sverrisson, Steinn Guðmundsson
First: 2026-01-09T18:52:24+00:00 · Latest: 2026-01-09T18:52:24+00:00
Abstract
Automated seizure detection from electroencephalography (EEG) remains difficult due to the large variability of seizure dynamics across patients, recording conditions, and clinical settings. We introduce LookAroundNet, a transformer-based seizure detector that uses a wider temporal window of EEG data to model seizure activity. The seizure detector incorporates EEG signals before and after the segment of interest, reflecting how clinicians use surrounding context when interpreting EEG recordings. We evaluate the proposed method on multiple EEG datasets spanning diverse clinical environments, patient populations, and recording modalities, including routine clinical EEG and long-term ambulatory recordings, in order to study performance across varying data distributions. The evaluation includes publicly available datasets as well as a large proprietary collection of home EEG recordings, providing complementary views of controlled clinical data and unconstrained home-monitoring conditions. Our results show that LookAroundNet achieves strong performance across datasets, generalizes well to previously unseen recording conditions, and operates with computational costs compatible with real-world clinical deployment. The results indicate that extended temporal context, increased training data diversity, and model ensembling are key factors for improving performance. This work contributes to moving automatic seizure detection models toward clinically viable solutions.
中文标题/摘要
标题:LookAroundNet:通过变换器扩展时间上下文以实现临床可行的EEG癫痫发作检测
由于癫痫发作动力学在患者、记录条件和临床环境之间存在巨大差异,从脑电图(EEG)中自动检测癫痫发作仍然具有挑战性。我们引入了LookAroundNet,这是一种基于变换器的癫痫发作检测器,它使用更宽的时间窗口的EEG数据来建模癫痫活动。该检测器结合了感兴趣段落前后EEG信号,反映了临床医生在解释EEG记录时如何利用周围上下文。我们通过多个EEG数据集评估了该方法,这些数据集涵盖了不同的临床环境、患者群体和记录模式,包括常规临床EEG和长期便携式记录,以研究在不同数据分布下的性能。评估包括公开可用的数据集以及一个大型的私人收藏的家庭EEG记录,提供了受控临床数据和非约束家庭监测条件的互补视角。我们的结果表明,LookAroundNet在数据集上表现出色,能够很好地泛化到以前未见过的记录条件,并且计算成本与实际临床部署兼容。结果表明,扩展时间上下文、增加训练数据多样性以及模型集成是提高性能的关键因素。这项工作为将自动癫痫发作检测模型推向临床可行的解决方案做出了贡献。
Summary / 总结
The research aims to improve automated seizure detection from EEG by addressing the variability in seizure dynamics. LookAroundNet, a transformer-based model, uses a wider temporal window of EEG data to model seizure activity, incorporating context before and after the segment of interest. The model was evaluated on diverse EEG datasets, including routine clinical and long-term ambulatory recordings, showing strong performance and good generalization to unseen conditions. Extended temporal context, increased data diversity, and model ensembling were found to be crucial for performance improvement.
LookAroundNet 是一种基于变压器的癫痫检测器,通过扩展时间上下文来提高自动 EEG 癫痫检测的准确性。它使用了包含感兴趣段落前后 EEG 数据的更宽时间窗口,类似于临床医生如何解读 EEG 录音。该方法在多个 EEG 数据集上进行了评估,包括常规临床 EEG 和长期随访记录,显示出强大的性能和良好的对未见过的记录条件的泛化能力。扩展的时间上下文、增加的训练数据多样性和模型集成是提高性能的关键因素。
Detecting Stochasticity in Discrete Signals via Nonparametric Excursion Theorem
Authors: Sunia Tanweer, Firas A. Khasawneh
First: 2026-01-09T18:47:57+00:00 · Latest: 2026-01-09T18:47:57+00:00
Abstract
We develop a practical framework for distinguishing diffusive stochastic processes from deterministic signals using only a single discrete time series. Our approach is based on classical excursion and crossing theorems for continuous semimartingales, which correlates number $N_\varepsilon$ of excursions of magnitude at least $\varepsilon$ with the quadratic variation $[X]_T$ of the process. The scaling law holds universally for all continuous semimartingales with finite quadratic variation, including general Ito diffusions with nonlinear or state-dependent volatility, but fails sharply for deterministic systems -- thereby providing a theoretically-certfied method of distinguishing between these dynamics, as opposed to the subjective entropy or recurrence based state of the art methods. We construct a robust data-driven diffusion test. The method compares the empirical excursion counts against the theoretical expectation. The resulting ratio $K(\varepsilon)=N_{\varepsilon}^{\mathrm{emp}}/N_{\varepsilon}^{\mathrm{theory}}$ is then summarized by a log-log slope deviation measuring the $\varepsilon^{-2}$ law that provides a classification into diffusion-like or not. We demonstrate the method on canonical stochastic systems, some periodic and chaotic maps and systems with additive white noise, as well as the stochastic Duffing system. The approach is nonparametric, model-free, and relies only on the universal small-scale structure of continuous semimartingales.
中文标题/摘要
标题:通过非参数 excursion 定理检测离散信号中的随机性
我们开发了一种实用框架,仅使用单个离散时间序列来区分扩散随机过程和确定性信号。我们的方法基于连续半鞅的经典 excursion 和穿越定理,将幅度至少为 $\varepsilon$ 的 excursion 数量 $N_\varepsilon$ 与过程的二次变差 $[X]_T$ 相关联。这种标度律适用于所有具有有限二次变差的连续半鞅,包括一般伊藤扩散,无论其波动率是非线性的还是状态依赖的,但在确定性系统中却失败得非常尖锐,从而提供了一种理论认证的方法来区分这些动力学,而不是基于主观熵或复发的现有方法。我们构建了一种稳健的数据驱动扩散检验。该方法将经验 excursion 计数与理论期望进行比较。然后通过 log-log 斜率偏差总结得到的比率 $K(\varepsilon)=N_{\varepsilon}^{\mathrm{emp}}/N_{\varepsilon}^{\mathrm{theory}}$,该比率衡量 $\varepsilon^{-2}$ 规律,提供了一种分类方法,区分扩散型或非扩散型。我们通过一些典型随机系统、周期性和混沌映射以及具有加性白噪声的系统,以及随机 Duffing 系统,演示了该方法。该方法是非参数的、无模型的,并且仅依赖于连续半鞅的普遍小尺度结构。
Co-Training Vision Language Models for Remote Sensing Multi-task Learning
Authors: Qingyun Li, Shuran Ma, Junwei Luo, Yi Yu, Yue Zhou, Fengxiang Wang, Xudong Lu, Xiaoxing Wang, Xin He, Yushi Chen, Xue Yang
First: 2025-11-26T10:55:07+00:00 · Latest: 2026-01-09T18:43:00+00:00
Comments: 14 pages, 6 figures
Abstract
With Transformers achieving outstanding performance on individual remote sensing (RS) tasks, we are now approaching the realization of a unified model that excels across multiple tasks through multi-task learning (MTL). Compared to single-task approaches, MTL methods offer improved generalization, enhanced scalability, and greater practical applicability. Recently, vision language models (VLMs) have achieved promising results in RS image understanding, grounding, and ultra-high-resolution (UHR) image reasoning, respectively. Moreover, the unified text-based interface demonstrates significant potential for MTL. Hence, in this work, we present RSCoVLM, a simple yet flexible VLM baseline for RS MTL. Firstly, we create the data curation engine, including data acquisition, offline processing and integrating, as well as online loading and weighting. This data engine effectively addresses complex RS data enviroment and generates flexible vision-language conversations. Furthermore, we propose a unified dynamic-resolution strategy to address the diverse image scales inherent in RS imagery. For UHR images, we introduce the Zoom-in Chain mechanism together with its corresponding dataset, LRS-VQA-Zoom. The strategies are flexible and effectively mitigate the computational burdens. Additionally, we significantly enhance the model's object detection capability and propose a novel evaluation protocol that ensures fair comparison between VLMs and conventional detection models. Extensive experiments demonstrate that RSCoVLM achieves state-of-the-art performance across diverse tasks, outperforming existing RS VLMs and even rivaling specialized expert models. All the training and evaluating tools, model weights, and datasets have been fully open-sourced to support reproducibility. We expect that this baseline will promote further progress toward general-purpose RS models.
中文标题/摘要
标题:遥感多任务学习中联合训练视觉语言模型
随着Transformer在遥感(RS)单一任务中取得卓越表现,我们正接近通过多任务学习(MTL)实现统一模型在多个任务上的卓越表现。与单一任务方法相比,MTL方法提供了更好的泛化能力、更强的可扩展性和更高的实际应用价值。最近,视觉语言模型(VLMs)在RS图像理解、语义分割和超高清(UHR)图像推理方面取得了令人鼓舞的结果。此外,统一的文本界面展示了MTL的巨大潜力。因此,在这项工作中,我们提出了RSCoVLM,这是一种简单而灵活的VLM基线模型,用于RS MTL。首先,我们创建了数据编纂引擎,包括数据获取、离线处理和集成,以及在线加载和加权。该数据引擎有效地解决了复杂的RS数据环境问题,并生成了灵活的视觉-语言对话。此外,我们提出了一种统一的动态分辨率策略,以解决RS图像中固有的不同图像尺度问题。对于UHR图像,我们引入了缩放链机制及其相应的数据集LRS-VQA-Zoom。这些策略灵活且有效地减轻了计算负担。此外,我们显著增强了模型的物体检测能力,并提出了一种新的评估协议,以确保VLMs和传统检测模型之间的公平比较。广泛的实验表明,RSCoVLM在多种任务上都达到了最先进的性能,优于现有的RS VLMs,甚至与专门的专家模型相媲美。所有训练和评估工具、模型权重和数据集均已完全开源,以支持可再现性。我们期望这一基线将促进通用RS模型的进一步发展。
The Molecular Structure of Thought: Mapping the Topology of Long Chain-of-Thought Reasoning
Authors: Qiguang Chen, Yantao Du, Ziniu Li, Jinhao Liu, Songyao Duan, Jiarui Guo, Minghao Liu, Jiaheng Liu, Tong Yang, Ge Zhang, Libo Qin, Wanxiang Che, Wenhao Huang
First: 2026-01-09T18:39:01+00:00 · Latest: 2026-01-09T18:39:01+00:00
Comments: Preprint
Abstract
Large language models (LLMs) often fail to learn effective long chain-of-thought (Long CoT) reasoning from human or non-Long-CoT LLMs imitation. To understand this, we propose that effective and learnable Long CoT trajectories feature stable molecular-like structures in unified view, which are formed by three interaction types: Deep-Reasoning (covalent-like), Self-Reflection (hydrogen-bond-like), and Self-Exploration (van der Waals-like). Analysis of distilled trajectories reveals these structures emerge from Long CoT fine-tuning, not keyword imitation. We introduce Effective Semantic Isomers and show that only bonds promoting fast entropy convergence support stable Long CoT learning, while structural competition impairs training. Drawing on these findings, we present Mole-Syn, a distribution-transfer-graph method that guides synthesis of effective Long CoT structures, boosting performance and RL stability across benchmarks.
中文标题/摘要
标题:思维的分子结构:长链推理拓扑映射
大型语言模型(LLMs)往往难以从人类或非长链推理(Long CoT)LLMs模仿中学习有效的长链推理。为了理解这一点,我们提出,有效的可学习的长链推理轨迹具有在统一视图中形成的稳定分子状结构,这些结构由三种交互类型组成:深层推理(共价型)、自我反思(氢键型)和自我探索(范德华力型)。对精简轨迹的分析表明,这些结构源自长链推理微调,而非关键词模仿。我们引入了有效语义异构体,并表明仅促进快速熵收敛的键支持稳定的长链推理学习,而结构竞争会损害训练。基于这些发现,我们提出了Mole-Syn方法,这是一种分布转移图方法,用于引导有效长链推理结构的合成,从而在基准测试中提升性能和强化学习稳定性。
Summary / 总结
The study aims to understand why large language models struggle with learning effective long chain-of-thought reasoning. It proposes that effective Long CoT trajectories have stable molecular-like structures formed by three types of interactions: Deep-Reasoning (covalent-like), Self-Reflection (hydrogen-bond-like), and Self-Exploration (van der Waals-like). The research finds that these structures emerge from Long CoT fine-tuning rather than keyword imitation. The study also identifies that only bonds promoting fast entropy convergence support stable Long CoT learning, while structural competition hinders training. Based on these findings, Mole-Syn, a distribution-transfer-graph method, is introduced to guide the synthesis of effective Long CoT structures, improving performance and reinforcement learning stability across benchmarks.
研究旨在理解大型语言模型为何难以学习有效的长链推理。研究提出,这种推理涉及由三种类型的相互作用形成的稳定分子结构:深度推理(共价键样)、自我反思(氢键样)和自我探索(范德华力样)。研究发现,这些结构源自长链推理微调而非关键词模仿。研究引入了Mole-Syn方法,该方法指导有效长链推理结构的合成,并展示了其在基准测试中的性能和强化学习稳定性方面的提升。
There are no Champions in Supervised Long-Term Time Series Forecasting
Authors: Lorenzo Brigato, Rafael Morand, Knut Strømmen, Maria Panagiotou, Markus Schmidt, Stavroula Mougiakakou
First: 2025-02-19T19:08:37+00:00 · Latest: 2026-01-09T18:37:55+00:00
Comments: Accepted at TMLR
Abstract
Recent advances in long-term time series forecasting have introduced numerous complex supervised prediction models that consistently outperform previously published architectures. However, this rapid progression raises concerns regarding inconsistent benchmarking and reporting practices, which may undermine the reliability of these comparisons. In this study, we first perform a broad, thorough, and reproducible evaluation of the top-performing supervised models on the most popular benchmark and additional baselines representing the most active architecture families. This extensive evaluation assesses eight models on 14 datasets, encompassing $\sim$5,000 trained networks for the hyperparameter (HP) searches. Then, through a comprehensive analysis, we find that slight changes to experimental setups or current evaluation metrics drastically shift the common belief that newly published results are advancing the state of the art. Our findings emphasize the need to shift focus away from pursuing ever-more complex models, towards enhancing benchmarking practices through rigorous and standardized evaluations that enable more substantiated claims, including reproducible HP setups and statistical testing. We offer recommendations for future research.
中文标题/摘要
标题:监督长期时间序列预测中不存在冠军模型
近期在长期时间序列预测方面取得的进展引入了大量复杂的监督预测模型,这些模型持续超越之前发表的架构。然而,这种快速进步引发了关于不一致基准测试和报告实践的担忧,这可能削弱这些比较的可靠性。在本研究中,我们首先对最流行的基准和代表最活跃架构家族的其他基线进行了广泛、彻底且可重复的评估,评估了八种模型在14个数据集上的表现,涵盖了约5000个用于超参数(HP)搜索的训练网络。然后,通过全面分析,我们发现实验设置或当前评估指标的微小变化极大地改变了人们普遍认为新发表的结果正在推动前沿技术的信念。我们的研究结果强调了从追求越来越复杂的模型转向通过严格的标准化评估改进基准测试实践的必要性,以使更有力的声明成为可能,包括可重复的HP设置和统计测试。我们为未来的研究提供了建议。
Summary / 总结
This study evaluates the top-performing supervised models in long-term time series forecasting across 14 datasets, involving over 5,000 trained networks. Despite recent advancements, the study reveals that slight changes in experimental setups or evaluation metrics can significantly alter the perceived superiority of new models, suggesting a need for more rigorous benchmarking practices. The findings highlight the importance of reproducible hyperparameter setups and statistical testing to substantiate claims in this field.
研究对14个数据集上的顶级监督模型进行了评估,涉及超过5,000个训练网络。尽管最近有所进步,但研究发现,实验设置或评估指标的小变化可以显著改变新模型的优越性,表明需要更严格的基准测试实践。研究强调了可重复的超参数设置和统计测试的重要性,以支持该领域的主张。
ACDZero: MCTS Agent for Mastering Automated Cyber Defense
Authors: Yu Li, Sizhe Tang, Rongqian Chen, Fei Xu Yu, Guangyu Jiang, Mahdi Imani, Nathaniel D. Bastian, Tian Lan
First: 2026-01-05T15:18:54+00:00 · Latest: 2026-01-09T18:28:29+00:00
Abstract
Automated cyber defense (ACD) seeks to protect computer networks with minimal or no human intervention, reacting to intrusions by taking corrective actions such as isolating hosts, resetting services, deploying decoys, or updating access controls. However, existing approaches for ACD, such as deep reinforcement learning (RL), often face difficult exploration in complex networks with large decision/state spaces and thus require an expensive amount of samples. Inspired by the need to learn sample-efficient defense policies, we frame ACD in CAGE Challenge 4 (CAGE-4 / CC4) as a context-based partially observable Markov decision problem and propose a planning-centric defense policy based on Monte Carlo Tree Search (MCTS). It explicitly models the exploration-exploitation tradeoff in ACD and uses statistical sampling to guide exploration and decision making. We make novel use of graph neural networks (GNNs) to embed observations from the network as attributed graphs, to enable permutation-invariant reasoning over hosts and their relationships. To make our solution practical in complex search spaces, we guide MCTS with learned graph embeddings and priors over graph-edit actions, combining model-free generalization and policy distillation with look-ahead planning. We evaluate the resulting agent on CC4 scenarios involving diverse network structures and adversary behaviors, and show that our search-guided, graph-embedding-based planning improves defense reward and robustness relative to state-of-the-art RL baselines.
中文标题/摘要
标题:ACDZero:掌握自动化网络防御的MCTS代理
自动化网络防御(ACD)旨在通过最少或无需人类干预来保护计算机网络,并通过采取纠正措施(如隔离主机、重置服务、部署诱饵或更新访问控制)来应对入侵。然而,现有的ACD方法,如深度强化学习(RL),在复杂网络的大决策/状态空间中难以探索,因此需要大量的样本。受学习高效防御策略的启发,我们将ACD在CAGE挑战4(CAGE-4 / CC4)中重新定义为基于上下文的部分可观测马尔可夫决策问题,并提出基于蒙特卡洛树搜索(MCTS)的规划中心防御策略。该策略明确地在ACD中建模了探索与利用之间的权衡,并使用统计抽样来指导探索和决策。我们创新地使用图神经网络(GNN)将网络观察嵌入为带属性的图中,以实现对主机及其关系的不变推理。为了使我们的解决方案在复杂的搜索空间中实用,我们用学习到的图嵌入和图编辑操作的先验来引导MCTS,结合无模型泛化、策略蒸馏和前瞻规划。我们在涉及多种网络结构和对手行为的CAGE-4场景中评估了该代理,并展示了与最先进的RL基线相比,我们的搜索引导、基于图嵌入的规划提高了防御奖励和鲁棒性。
Summary / 总结
The research aims to develop an efficient automated cyber defense system that can operate with minimal human intervention. The method involves framing the problem as a context-based partially observable Markov decision process and using Monte Carlo Tree Search (MCTS) with graph neural networks (GNNs) to embed network observations. Key findings show that the proposed agent outperforms existing reinforcement learning baselines in terms of defense reward and robustness across various network scenarios and adversary behaviors.
ACDZero 使用蒙特卡洛树搜索(MCTS)来学习高效的自动化防御策略,通过图神经网络嵌入网络观察,实现不变性推理。实验结果表明,ACDZero 在各种网络结构和对手行为的场景下,相较于最先进的强化学习方法,在防御奖励和鲁棒性方面表现更优。
Open-Vocabulary 3D Instruction Ambiguity Detection
Authors: Jiayu Ding, Haoran Tang, Ge Li
First: 2026-01-09T18:17:11+00:00 · Latest: 2026-01-09T18:17:11+00:00
Abstract
In safety-critical domains, linguistic ambiguity can have severe consequences; a vague command like "Pass me the vial" in a surgical setting could lead to catastrophic errors. Yet, most embodied AI research overlooks this, assuming instructions are clear and focusing on execution rather than confirmation. To address this critical safety gap, we are the first to define Open-Vocabulary 3D Instruction Ambiguity Detection, a fundamental new task where a model must determine if a command has a single, unambiguous meaning within a given 3D scene. To support this research, we build Ambi3D, the large-scale benchmark for this task, featuring over 700 diverse 3D scenes and around 22k instructions. Our analysis reveals a surprising limitation: state-of-the-art 3D Large Language Models (LLMs) struggle to reliably determine if an instruction is ambiguous. To address this challenge, we propose AmbiVer, a two-stage framework that collects explicit visual evidence from multiple views and uses it to guide an vision-language model (VLM) in judging instruction ambiguity. Extensive experiments demonstrate the challenge of our task and the effectiveness of AmbiVer, paving the way for safer and more trustworthy embodied AI. Code and dataset available at https://jiayuding031020.github.io/ambi3d/.
中文标题/摘要
标题:开放词汇3D指令歧义检测
在安全关键领域,语言歧义可能导致严重后果;例如,在手术环境中,模糊的指令“递给我那个药瓶”可能导致灾难性错误。然而,大多数具身AI研究忽视了这一点,假设指令是清晰的,而专注于执行而不是确认。为解决这一关键安全缺口,我们首次定义了开放词汇3D指令歧义检测这一基本新任务,要求模型确定在一个给定的3D场景中,一个指令是否具有单一且明确的意义。为了支持这一研究,我们构建了Ambi3D,这是该任务的大规模基准数据集,包含超过700个多样化的3D场景和约22000条指令。我们的分析揭示了一个令人惊讶的局限性:最先进的3D大型语言模型(LLMs)难以可靠地判断指令是否具有歧义性。为应对这一挑战,我们提出了AmbiVer,这是一种两阶段框架,通过从多个视角收集明确的视觉证据,并利用这些证据来指导视觉-语言模型(VLM)判断指令的歧义性。广泛的实验表明了我们任务的挑战性以及AmbiVer的有效性,为更安全和更可信赖的具身AI铺平了道路。代码和数据集可在https://jiayuding031020.github.io/ambi3d/获取。
Summary / 总结
The research aims to address the critical safety gap in embodied AI by defining Open-Vocabulary 3D Instruction Ambiguity Detection, where models must determine if a command has a single, unambiguous meaning within a 3D scene. The study introduces Ambi3D, a large-scale benchmark with over 700 diverse 3D scenes and 22k instructions, and finds that state-of-the-art 3D LLMs struggle with this task. To improve, the authors propose AmbiVer, a two-stage framework that uses visual evidence to guide a VLM in judging instruction ambiguity, demonstrating its effectiveness through extensive experiments.
本文通过定义Open-Vocabulary 3D指令歧义检测任务,解决了安全关键领域中的语言歧义问题,该任务要求模型在给定的3D场景中判断一个命令是否有单一且明确的意义。作者构建了Ambi3D基准,包含超过700个多样化的3D场景和22k指令,并发现最先进的3D大型语言模型在这一任务上表现不佳。为了改进这一问题,他们提出了AmbiVer框架,该框架利用多视角的视觉证据来指导视觉语言模型判断指令的歧义性,并通过广泛的实验展示了其有效性。
Deepfake detectors are DUMB: A benchmark to assess adversarial training robustness under transferability constraints
Authors: Adrian Serrano, Erwan Umlil, Ronan Thomas
First: 2026-01-09T18:06:19+00:00 · Latest: 2026-01-09T18:06:19+00:00
Comments: 10 pages, four tables, one figure
Abstract
Deepfake detection systems deployed in real-world environments are subject to adversaries capable of crafting imperceptible perturbations that degrade model performance. While adversarial training is a widely adopted defense, its effectiveness under realistic conditions -- where attackers operate with limited knowledge and mismatched data distributions - remains underexplored. In this work, we extend the DUMB -- Dataset soUrces, Model architecture and Balance - and DUMBer methodology to deepfake detection. We evaluate detectors robustness against adversarial attacks under transferability constraints and cross-dataset configuration to extract real-world insights. Our study spans five state-of-the-art detectors (RECCE, SRM, XCeption, UCF, SPSL), three attacks (PGD, FGSM, FPBA), and two datasets (FaceForensics++ and Celeb-DF-V2). We analyze both attacker and defender perspectives mapping results to mismatch scenarios. Experiments show that adversarial training strategies reinforce robustness in the in-distribution cases but can also degrade it under cross-dataset configuration depending on the strategy adopted. These findings highlight the need for case-aware defense strategies in real-world applications exposed to adversarial attacks.
中文标题/摘要
标题:Deepfake检测器是愚蠢的:评估在转移性约束下的对抗训练鲁棒性基准
部署在真实环境中的Deepfake检测系统可能受到能够制造不可感知扰动的对手的攻击,这些扰动会降低模型的性能。虽然对抗训练是一种广泛采用的防御手段,但在现实条件下,攻击者知识有限且数据分布不匹配的情况下,其有效性仍然未被充分探索。在本文中,我们扩展了DUMB——数据源、模型架构和平衡——以及DUMBer方法论,应用于Deepfake检测。我们评估了检测器在转移性约束和跨数据集配置下的鲁棒性,以提取现实世界的见解。我们的研究涵盖了五种最先进的检测器(RECCE、SRM、XCeption、UCF、SPSL)、三种攻击(PGD、FGSM、FPBA)和两个数据集(FaceForensics++和Celeb-DF-V2)。我们从攻击者和防御者的视角分析了结果,映射到不匹配场景。实验表明,对抗训练策略在同分布情况下增强了鲁棒性,但在跨数据集配置下,根据所采用的策略,也可能降低鲁棒性。这些发现强调了在面临对抗攻击的真实世界应用中需要针对具体情况的防御策略。
Summary / 总结
This study investigates the robustness of deepfake detectors against adversarial attacks under transferability constraints. It extends the DUMB methodology to evaluate five state-of-the-art detectors (RECCE, SRM, XCeption, UCF, SPSL) against three types of attacks (PGD, FGSM, FPBA) across two datasets (FaceForensics++ and Celeb-DF-V2). The findings reveal that adversarial training enhances robustness within the same dataset but can degrade performance when applied to different datasets, underscoring the need for case-aware defense strategies in real-world applications.
该研究考察了在迁移性限制条件下,对抗攻击对深度假脸检测器的鲁棒性影响。研究扩展了DUMB方法,评估了五种最先进的检测器(RECCE、SRM、XCeption、UCF、SPSL)在三种攻击类型(PGD、FGSM、FPBA)和两个数据集(FaceForensics++和Celeb-DF-V2)上的表现。实验结果显示,虽然对抗训练在相同数据集内提高了鲁棒性,但在不同数据集上应用时可能会降低性能,这强调了在面临对抗攻击的实际应用中需要采取案例特定的防御策略。
Community-Based Model Sharing and Generalisation: Anomaly Detection in IoT Temperature Sensor Networks
Authors: Sahibzada Saadoon Hammad, Joaquín Huerta Guijarro, Francisco Ramos, Michael Gould Carlson, Sergio Trilles Oliver
First: 2026-01-09T18:05:57+00:00 · Latest: 2026-01-09T18:05:57+00:00
Comments: 20 pages, 9 figures, Journal submission
Abstract
The rapid deployment of Internet of Things (IoT) devices has led to large-scale sensor networks that monitor environmental and urban phenomena in real time. Communities of Interest (CoIs) provide a promising paradigm for organising heterogeneous IoT sensor networks by grouping devices with similar operational and environmental characteristics. This work presents an anomaly detection framework based on the CoI paradigm by grouping sensors into communities using a fused similarity matrix that incorporates temporal correlations via Spearman coefficients, spatial proximity using Gaussian distance decay, and elevation similarities. For each community, representative stations based on the best silhouette are selected and three autoencoder architectures (BiLSTM, LSTM, and MLP) are trained using Bayesian hyperparameter optimization with expanding window cross-validation and tested on stations from the same cluster and the best representative stations of other clusters. The models are trained on normal temperature patterns of the data and anomalies are detected through reconstruction error analysis. Experimental results show a robust within-community performance across the evaluated configurations, while variations across communities are observed. Overall, the results support the applicability of community-based model sharing in reducing computational overhead and to analyse model generalisability across IoT sensor networks.
中文标题/摘要
标题:基于社区的模型共享与泛化:物联网温度传感器网络中的异常检测
物联网(IoT)设备的快速部署导致了大规模的传感器网络,这些网络可以实时监控环境和城市现象。利益相关者社区(CoIs)为组织异构的IoT传感器网络提供了一种有前景的范式,通过将具有相似操作和环境特性的设备分组。本文基于CoI范式提出了一种异常检测框架,通过融合相似矩阵将传感器分组,该矩阵通过斯皮尔曼系数引入了时间相关性,通过高斯距离衰减引入了空间接近性,并引入了海拔相似性。对于每个社区,基于最佳轮廓选择代表站,并使用贝叶斯超参数优化和扩展窗口交叉验证训练三种自编码器架构(BiLSTM、LSTM和MLP),并在同一簇的站点和来自其他簇的最佳代表站点上进行测试。模型在正常温度模式的数据上进行训练,并通过重构误差分析检测异常。实验结果表明,在评估的配置中,社区内的性能表现出稳健性,而社区间存在变化。总体而言,结果支持基于社区的模型共享在减少计算开销和分析IoT传感器网络中的模型泛化能力方面的适用性。
Summary / 总结
This work introduces an anomaly detection framework for IoT temperature sensor networks using a community-based model sharing approach. Sensors are grouped into communities based on a fused similarity matrix that considers temporal, spatial, and elevation factors. Autoencoders are trained on representative stations within each community and tested on other stations. The study demonstrates robust performance within communities and varying performance across communities, supporting the use of community-based model sharing to reduce computational overhead and enhance model generalisability in IoT networks.
该研究提出了一种基于社区的物联网温度传感器网络异常检测框架。传感器根据融合相似矩阵(考虑了时间、空间和海拔因素)被分组到不同的社区中。在每个社区中,基于代表站训练自编码器并在同一社区和最佳代表站的其他社区中进行测试。结果表明,社区内的性能表现稳健,而不同社区之间存在差异,这支持了通过减少计算开销和增强模型在物联网网络中的泛化能力来使用基于社区的模型共享方法的有效性。
Spectral Masking and Interpolation Attack (SMIA): A Black-box Adversarial Attack against Voice Authentication and Anti-Spoofing Systems
Authors: Kamel Kamel, Hridoy Sankar Dutta, Keshav Sood, Sunil Aryal
First: 2025-09-09T12:43:59+00:00 · Latest: 2026-01-09T17:56:08+00:00
Abstract
Voice Authentication Systems (VAS) use unique vocal characteristics for verification. They are increasingly integrated into high-security sectors such as banking and healthcare. Despite their improvements using deep learning, they face severe vulnerabilities from sophisticated threats like deepfakes and adversarial attacks. The emergence of realistic voice cloning complicates detection, as systems struggle to distinguish authentic from synthetic audio. While anti-spoofing countermeasures (CMs) exist to mitigate these risks, many rely on static detection models that can be bypassed by novel adversarial methods, leaving a critical security gap. To demonstrate this vulnerability, we propose the Spectral Masking and Interpolation Attack (SMIA), a novel method that strategically manipulates inaudible frequency regions of AI-generated audio. By altering the voice in imperceptible zones to the human ear, SMIA creates adversarial samples that sound authentic while deceiving CMs. We conducted a comprehensive evaluation of our attack against state-of-the-art (SOTA) models across multiple tasks, under simulated real-world conditions. SMIA achieved a strong attack success rate (ASR) of at least 82% against combined VAS/CM systems, at least 97.5% against standalone speaker verification systems, and 100% against countermeasures. These findings conclusively demonstrate that current security postures are insufficient against adaptive adversarial attacks. This work highlights the urgent need for a paradigm shift toward next-generation defenses that employ dynamic, context-aware frameworks capable of evolving with the threat landscape.
中文标题/摘要
标题:光谱掩蔽和插值攻击(SMIA):针对语音认证和防欺骗系统的黑盒对抗攻击
语音认证系统(VAS)使用独特的语音特征进行验证。它们越来越多地被集成到银行和医疗等高安全领域。尽管使用深度学习有所改进,但它们仍面临来自深度伪造和对抗攻击等复杂威胁的严重漏洞。现实中的语音克隆使得检测变得更加复杂,因为系统难以区分真实和合成的音频。虽然存在反欺骗对策(CMs)来减轻这些风险,但许多对策依赖于静态检测模型,这些模型可以被新型对抗方法绕过,从而留下关键的安全缺口。为了展示这一漏洞,我们提出了光谱掩蔽和插值攻击(SMIA),这是一种新颖的方法,它战略性地操纵AI生成音频的不可听频段。通过在人类听觉不可感知的区域改变语音,SMIA 创建了听起来真实但能够欺骗CMs的对抗样本。我们在多种任务下对最先进的(SOTA)模型进行了全面评估,模拟了现实世界条件。SMIA 在联合VAS/CM系统中的攻击成功率(ASR)至少为82%,在独立说话人验证系统中的ASR至少为97.5%,在对策中的ASR为100%。这些发现表明,当前的安全措施不足以抵御适应性对抗攻击。这项工作突显了迫切需要转向下一代防御,这些防御采用动态、上下文感知的框架,能够随着威胁环境的变化而演变。
Summary / 总结
The paper introduces Spectral Masking and Interpolation Attack (SMIA), a black-box adversarial attack that manipulates inaudible frequency regions of AI-generated audio to deceive voice authentication and anti-spoofing systems. SMIA achieves a strong attack success rate of at least 82% against combined VAS/CM systems, 97.5% against standalone speaker verification systems, and 100% against countermeasures, highlighting the vulnerability of current security measures to adaptive adversarial attacks.
研究旨在揭示语音认证系统(VAS)及其防欺诈对策(CMs)在面对复杂对抗性攻击时的脆弱性。Spectral Masking and Interpolation Attack (SMIA) 方法通过操纵AI生成音频的不可听频率区域来创建听起来真实但能欺骗CMs的对抗样本。研究显示,SMIA 的攻击成功率至少为 82% 对于结合了VAS/CM 的系统,至少为 97.5% 对于独立的说话人验证系统,以及 100% 对于防欺诈对策,这表明需要更 robust 的安全措施。
AWaRe-SAC: Proactive Slice Admission Control under Weather-Induced Capacity Uncertainty
Authors: Dror Jacoby, Yanzhi Li, Shuyue Yu, Nicola Di Cicco, Hagit Messer, Gil Zussman, Igor Kadota
First: 2026-01-09T17:53:09+00:00 · Latest: 2026-01-09T17:53:09+00:00
Abstract
As emerging applications demand higher throughput and lower latencies, operators are increasingly deploying millimeter-wave (mmWave) links within x-haul transport networks, spanning fronthaul, midhaul, and backhaul segments. However, the inherent susceptibility of mmWave frequencies to weather-related attenuation, particularly rain fading, complicates the maintenance of stringent Quality of Service (QoS) requirements. This creates a critical challenge: making admission decisions under uncertainty regarding future network capacity. To address this, we develop a proactive slice admission control framework for mmWave x-haul networks subject to rain-induced fluctuations. Our objective is to improve network performance, ensure QoS, and optimize revenue, thereby surpassing the limitations of standard reactive approaches. The proposed framework integrates a deep learning predictor of future network conditions with a proactive Q-learning-based slice admission control mechanism. We validate our solution using real-world data from a mmWave x-haul deployment in a dense urban area, incorporating realistic models of link capacity attenuation and dynamic slice demands. Extensive evaluations demonstrate that our proactive solution achieves 2-3x higher long-term average revenue under dynamic link conditions, providing a scalable and resilient framework for adaptive admission control.
中文标题/摘要
标题:AWaRe-SAC:受天气引起的容量不确定性影响下的前瞻性切片准入控制
随着新兴应用对更高吞吐量和更低延迟的需求,运营商正在越来越多地在前传、中传和回传网络段内部署毫米波(mmWave)链路。然而,毫米波频率对天气相关衰减的固有敏感性,尤其是雨衰,使得维持严格的QoS要求变得复杂。这提出了一个关键挑战:在对未来网络容量存在不确定性的条件下进行准入决策。为了解决这一问题,我们为受降雨影响波动的毫米波前传网络开发了一种前瞻性切片准入控制框架。我们的目标是提高网络性能、确保QoS并优化收入,从而超越标准的被动方法的局限性。所提出的框架结合了对未来网络条件的深度学习预测器和基于前瞻性Q学习的切片准入控制机制。我们使用来自密集城市地区毫米波前传部署的真实数据来验证我们的解决方案,其中包括链路容量衰减和动态切片需求的现实模型。广泛的评估表明,在动态链路条件下,我们的前瞻性解决方案在长期平均收入方面提高了2-3倍,提供了一种可扩展且具有弹性的适应性准入控制框架。
Summary / 总结
The paper addresses the challenge of maintaining Quality of Service (QoS) in millimeter-wave (mmWave) x-haul networks under weather-induced capacity uncertainty. It proposes a proactive slice admission control framework that uses a deep learning predictor for future network conditions and a Q-learning-based mechanism for slice admission. Evaluations with real-world data show that this proactive approach can achieve 2-3 times higher long-term average revenue compared to reactive methods.
论文针对毫米波(x-haul)网络在天气影响下容量不确定性带来的服务质量(QoS)维护难题,提出了一种前瞻性的切片准入控制框架,该框架结合了深度学习预测未来网络状况和基于Q-learning的切片准入控制机制。评估结果显示,该方法在动态链路条件下可以实现比反应式方法高出2-3倍的长期平均收入,提供了一个可扩展且具有弹性的适应性准入控制框架。
Monadic Context Engineering
Authors: Yifan Zhang, Yang Yuan, Mengdi Wang, Andrew Chi-Chih Yao
First: 2025-12-27T01:52:06+00:00 · Latest: 2026-01-09T17:48:20+00:00
Comments: The authors have decided to withdraw this manuscript, as the ideas presented in the paper are not yet sufficiently mature and require further development and refinement
Abstract
The proliferation of Large Language Models (LLMs) has catalyzed a shift towards autonomous agents capable of complex reasoning and tool use. However, current agent architectures are frequently constructed using imperative, ad hoc patterns. This results in brittle systems plagued by difficulties in state management, error handling, and concurrency. This paper introduces Monadic Context Engineering (MCE), a novel architectural paradigm leveraging the algebraic structures of Functors, Applicative Functors, and Monads to provide a formal foundation for agent design. MCE treats agent workflows as computational contexts where cross-cutting concerns, such as state propagation, short-circuiting error handling, and asynchronous execution, are managed intrinsically by the algebraic properties of the abstraction. We demonstrate how Monads enable robust sequential composition, how Applicatives provide a principled structure for parallel execution, and crucially, how Monad Transformers allow for the systematic composition of these capabilities. This layered approach enables developers to construct complex, resilient, and efficient AI agents from simple, independently verifiable components. We further extend this framework to describe Meta-Agents, which leverage MCE for generative orchestration, dynamically creating and managing sub-agent workflows through metaprogramming.
中文标题/摘要
标题:单态上下文工程
大型语言模型(LLMs)的普及催化了自主代理向复杂推理和工具使用能力的转变。然而,当前的代理架构通常使用命令式的、随意的模式构建,导致系统脆弱,难以管理状态、处理错误和并发。本文介绍了单态上下文工程(MCE),这是一种新的架构范式,利用函子、应用函子和单态的代数结构为代理设计提供形式基础。MCE 将代理工作流视为计算上下文,在这种上下文中,如状态传播、短路错误处理和异步执行等横切关注点由抽象的代数性质内在管理。我们展示了如何使用单态实现稳健的顺序组合,如何使用应用函子提供并行执行的原理结构,以及关键地,如何使用单态变换器系统地组合这些能力。这种分层方法使开发人员能够从简单的、可独立验证的组件构建出复杂的、健壮且高效的AI代理。我们进一步扩展了这一框架,描述了利用MCE进行生成性编排的元代理,通过元编程动态创建和管理子代理工作流。
Summary / 总结
This paper addresses the limitations of current agent architectures by introducing Monadic Context Engineering (MCE), which uses algebraic structures to manage state, error handling, and concurrency. MCE enables robust sequential and parallel execution through Functors, Applicative Functors, and Monads, and Monad Transformers allow for the systematic composition of these capabilities. The authors demonstrate how MCE can be used to construct resilient AI agents and extend the framework to Meta-Agents for generative orchestration. However, the manuscript has been withdrawn as the ideas require further development and refinement.
论文提出了Monadic Context Engineering (MCE),这是一种使用Functors、Applicative Functors和Monads的代数结构来设计自主代理的新架构范式。MCE旨在通过内在管理状态传播、错误处理和并发来解决当前代理架构的脆弱性。作者展示了如何使用Monads实现稳健的顺序组合,如何使用Applicatives提供并行执行的原理性结构,以及如何使用Monad Transformers系统地组合这些能力。该框架进一步扩展以描述Meta-Agents,这些Meta-Agents通过元编程动态创建和管理子代理的工作流。然而,作者决定撤回手稿,因为这些想法尚未成熟,需要进一步的发展和完善。
QueryGym: Step-by-Step Interaction with Relational Databases
Authors: Haritha Ananthakrishnan, Harsha Kokel, Kelsey Sikes, Debarun Bhattacharjya, Michael Katz, Shirin Sohrabi, Kavitha Srinivas
First: 2025-09-25T22:48:49+00:00 · Latest: 2026-01-09T17:48:08+00:00
Abstract
We introduce QueryGym, an interactive environment for building, testing, and evaluating LLM-based query planning agents. Existing frameworks often tie agents to specific query language dialects or obscure their reasoning; QueryGym instead requires agents to construct explicit sequences of relational algebra operations, ensuring engine-agnostic evaluation and transparent step-by-step planning. The environment is implemented as a Gymnasium interface that supplies observations -- including schema details, intermediate results, and execution feedback -- and receives actions that represent database exploration (e.g., previewing tables, sampling column values, retrieving unique values) as well as relational algebra operations (e.g., filter, project, join). We detail the motivation and the design of the environment. In the demo, we showcase the utility of the environment by contrasting it with contemporary LLMs that query databases. QueryGym serves as a practical testbed for research in error remediation, transparency, and reinforcement learning for query generation. For the associated demo, see https://ibm.biz/QueryGym.
中文标题/摘要
标题:QueryGym:与关系数据库逐步交互
我们介绍了QueryGym,一个用于构建、测试和评估基于LLM的查询规划代理的交互式环境。现有框架往往将代理绑定到特定的查询语言方言或使其推理过程晦涩难懂;QueryGym 要求代理构建明确的关系代数操作序列,确保跨引擎评估,并透明地进行逐步规划。该环境以Gymnasium接口的形式实现,提供观察结果——包括模式细节、中间结果和执行反馈——并接收代表数据库探索(例如,预览表、采样列值、检索唯一值)以及关系代数操作(例如,过滤、投影、连接)的动作。我们详细阐述了环境的设计动机。在演示中,我们通过将其与当前查询数据库的LLM进行对比,展示了该环境的实用性。QueryGym 作为研究查询生成中的错误修正、透明性和强化学习的实用试验平台。有关相关演示,请参见 https://ibm.biz/QueryGym。
Summary / 总结
QueryGym is an interactive environment designed for building, testing, and evaluating LLM-based query planning agents. It requires agents to construct explicit sequences of relational algebra operations, ensuring engine-agnostic evaluation and transparent step-by-step planning. Key experimental findings show that QueryGym provides a practical testbed for research in error remediation, transparency, and reinforcement learning for query generation, contrasting effectively with contemporary LLMs that query databases.
QueryGym 是一个交互式环境,用于构建、测试和评估基于 LLM 的查询规划代理。它要求代理构建明确的 Relational Algebra 操作序列,确保跨引擎评估和透明的逐步规划。实验结果表明,QueryGym 为错误修正、透明性和查询生成的强化学习研究提供了一个实用的测试平台,有效对比了当前能够查询数据库的 LLMs。
DeePM: Regime-Robust Deep Learning for Systematic Macro Portfolio Management
Authors: Kieran Wood, Stephen J. Roberts, Stefan Zohren
First: 2026-01-09T17:47:32+00:00 · Latest: 2026-01-09T17:47:32+00:00
Abstract
We propose DeePM (Deep Portfolio Manager), a structured deep-learning macro portfolio manager trained end-to-end to maximize a robust, risk-adjusted utility. DeePM addresses three fundamental challenges in financial learning: (1) it resolves the asynchronous "ragged filtration" problem via a Directed Delay (Causal Sieve) mechanism that prioritizes causal impulse-response learning over information freshness; (2) it combats low signal-to-noise ratios via a Macroeconomic Graph Prior, regularizing cross-asset dependence according to economic first principles; and (3) it optimizes a distributionally robust objective where a smooth worst-window penalty serves as a differentiable proxy for Entropic Value-at-Risk (EVaR) - a window-robust utility encouraging strong performance in the most adverse historical subperiods. In large-scale backtests from 2010-2025 on 50 diversified futures with highly realistic transaction costs, DeePM attains net risk-adjusted returns that are roughly twice those of classical trend-following strategies and passive benchmarks, solely using daily closing prices. Furthermore, DeePM improves upon the state-of-the-art Momentum Transformer architecture by roughly fifty percent. The model demonstrates structural resilience across the 2010s "CTA (Commodity Trading Advisor) Winter" and the post-2020 volatility regime shift, maintaining consistent performance through the pandemic, inflation shocks, and the subsequent higher-for-longer environment. Ablation studies confirm that strictly lagged cross-sectional attention, graph prior, principled treatment of transaction costs, and robust minimax optimization are the primary drivers of this generalization capability.
中文标题/摘要
标题:DeePM:稳健的结构化深度学习系统宏观投资组合管理
我们提出了DeePM(深度投资组合经理),这是一种端到端训练的结构化深度学习宏观投资组合经理,旨在最大化稳健的风险调整效用。DeePM解决了金融学习中的三个基本挑战:(1)通过因果延迟(因果筛)机制解决异步“不齐整过滤”问题,优先考虑因果冲击响应学习而非信息新鲜度;(2)通过宏观经济图先验对抗低信噪比,根据经济第一原理正则化跨资产依赖关系;(3)优化一个分布稳健的目标,其中平滑最坏窗口惩罚作为熵值风险(EVaR)的可微代理,鼓励在最不利的历史子时期表现出色。在2010-2025年50种多样化期货的大规模回测中,DeePM仅使用每日收盘价实现了大约是经典趋势跟随策略和被动基准两倍的净风险调整回报。此外,DeePM在Momentum Transformer架构上提高了约50%。该模型在2010年代“商品交易顾问(CTA)寒冬”和2020年后波动率制度转变中展示了结构上的韧性,通过大流行、通胀冲击和随后的长期高通胀环境保持了持续的性能。消融研究证实,严格滞后横截面注意力、图先验、合理的交易成本处理和稳健的最小最大优化是这种泛化能力的主要驱动因素。
Summary / 总结
DeePM is a deep-learning macro portfolio manager that addresses the challenges of financial learning by using a Directed Delay mechanism, a Macroeconomic Graph Prior, and a distributionally robust objective. In backtests from 2010-2025, DeePM achieved net risk-adjusted returns roughly twice those of classical trend-following strategies and passive benchmarks, and improved upon the Momentum Transformer by about fifty percent. DeePM maintained consistent performance through various economic regimes, including the CTA Winter and post-2020 volatility shifts.
DeePM 是一种通过使用 Directed Delay 机制、Macroeconomic Graph Prior 和分布鲁棒目标来解决金融学习挑战的深度学习宏观投资组合经理。在2010-2025年的回测中,DeePM 的净风险调整回报率几乎是传统趋势跟随策略和被动基准的两倍,并且在包括CTA寒冬和2020年后波动性变化在内的各种经济环境中保持了稳定的性能。
Controlled Automatic Task-Specific Synthetic Data Generation for Hallucination Detection
Authors: Yong Xie, Karan Aggarwal, Aitzaz Ahmad, Stephen Lau
Venue: KDD 2024
First: 2024-10-16T06:31:59+00:00 · Latest: 2026-01-09T17:41:42+00:00
Comments: 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (ACM KDD 2024). Accepted by Workshop on Evaluation and Trustworthiness of Generative AI Models
Abstract
We present a novel approach to automatically generate non-trivial task-specific synthetic datasets for hallucination detection. Our approach features a two-step generation-selection pipeline, using hallucination pattern guidance and a language style alignment during generation. Hallucination pattern guidance leverages the most important task-specific hallucination patterns while language style alignment aligns the style of the synthetic dataset with benchmark text. To obtain robust supervised detectors from synthetic datasets, we also adopt a data mixture strategy to improve performance robustness and generalization. Our results on three datasets show that our generated hallucination text is more closely aligned with non-hallucinated text versus baselines, to train hallucination detectors with better generalization. Our hallucination detectors trained on synthetic datasets outperform in-context-learning (ICL)-based detectors by a large margin of 32%. Our extensive experiments confirm the benefits of our approach with cross-task and cross-generator generalization. Our data-mixture-based training further improves the generalization and robustness of hallucination detection.
中文标题/摘要
标题:受控自动任务特定合成数据生成用于幻觉检测
我们提出了一种新的方法,用于自动生成非平凡的任务特定合成数据集以进行幻觉检测。该方法采用两步生成-选择流水线,利用幻觉模式指导和生成期间的语言风格对齐。幻觉模式指导利用了最重要的任务特定幻觉模式,而语言风格对齐则使合成数据集的风格与基准文本对齐。为了从合成数据集中获得稳健的监督检测器,我们还采用了数据混合策略以提高性能的稳健性和泛化能力。在三个数据集上的结果表明,我们生成的幻觉文本与非幻觉文本的对齐程度更高,从而训练出泛化能力更好的幻觉检测器。我们基于合成数据集训练的幻觉检测器在上下文学习(ICL)基础上的检测器上取得了显著的32%的性能优势。我们广泛的实验验证了该方法在跨任务和跨生成器泛化方面的优势。基于数据混合的训练进一步提高了幻觉检测的泛化能力和稳健性。
VideoAR: Autoregressive Video Generation via Next-Frame & Scale Prediction
Authors: Longbin Ji, Xiaoxiong Liu, Junyuan Shang, Shuohuan Wang, Yu Sun, Hua Wu, Haifeng Wang
First: 2026-01-09T17:34:59+00:00 · Latest: 2026-01-09T17:34:59+00:00
Abstract
Recent advances in video generation have been dominated by diffusion and flow-matching models, which produce high-quality results but remain computationally intensive and difficult to scale. In this work, we introduce VideoAR, the first large-scale Visual Autoregressive (VAR) framework for video generation that combines multi-scale next-frame prediction with autoregressive modeling. VideoAR disentangles spatial and temporal dependencies by integrating intra-frame VAR modeling with causal next-frame prediction, supported by a 3D multi-scale tokenizer that efficiently encodes spatio-temporal dynamics. To improve long-term consistency, we propose Multi-scale Temporal RoPE, Cross-Frame Error Correction, and Random Frame Mask, which collectively mitigate error propagation and stabilize temporal coherence. Our multi-stage pretraining pipeline progressively aligns spatial and temporal learning across increasing resolutions and durations. Empirically, VideoAR achieves new state-of-the-art results among autoregressive models, improving FVD on UCF-101 from 99.5 to 88.6 while reducing inference steps by over 10x, and reaching a VBench score of 81.74-competitive with diffusion-based models an order of magnitude larger. These results demonstrate that VideoAR narrows the performance gap between autoregressive and diffusion paradigms, offering a scalable, efficient, and temporally consistent foundation for future video generation research.
中文标题/摘要
标题:VideoAR:通过下一帧和尺度预测的自回归视频生成
近期视频生成的进展主要由扩散和流匹配模型主导,这些模型虽然生成高质量的结果,但计算密集且难以扩展。本文中,我们引入了VideoAR,这是第一个大规模的视觉自回归(VAR)框架,结合了多尺度下一帧预测与自回归建模。VideoAR通过将帧内VAR建模与因果下一帧预测相结合,分离空间和时间依赖性,并通过3D多尺度分词器高效编码时空动态。为了提高长期一致性,我们提出了多尺度时间RoPE、跨帧误差校正和随机帧掩码,这些方法共同减轻了误差传播并稳定了时间连贯性。我们的多阶段预训练管道逐步在不断增加的分辨率和持续时间上对空间和时间学习进行对齐。实验结果表明,VideoAR在自回归模型中达到了新的最佳性能,FVD在UCF-101上的得分从99.5提高到88.6,同时减少了超过10倍的推理步骤,并达到了81.74的VBench得分,与大一个数量级的基于扩散的模型相当。这些结果表明,VideoAR缩小了自回归和扩散范式的性能差距,为未来的视频生成研究提供了一个可扩展、高效且时间连贯的基础。
Summary / 总结
VideoAR is a Visual Autoregressive framework for video generation that combines multi-scale next-frame prediction with autoregressive modeling. It uses a 3D multi-scale tokenizer to encode spatio-temporal dynamics and includes techniques like Multi-scale Temporal RoPE, Cross-Frame Error Correction, and Random Frame Mask to improve long-term consistency. VideoAR achieves new state-of-the-art results, improving FVD on UCF-101 and reducing inference steps by over 10x compared to previous autoregressive models, making it a scalable and efficient alternative to diffusion-based models.
VideoAR 是一种自回归视频生成框架,结合了多尺度下一帧预测和自回归建模,以提高计算效率和可扩展性。它引入了多尺度时间 RoPE、跨帧误差校正和随机帧掩码等技术来增强长期一致性。VideoAR 在 UCF-101 上实现了最先进的结果,将推理步骤减少了超过 10 倍,并提高了 FVD 分数,使其成为一种可扩展、高效且时间一致的替代扩散模型的基础。
Reflect3r: Single-View 3D Stereo Reconstruction Aided by Mirror Reflections
Authors: Jing Wu, Zirui Wang, Iro Laina, Victor Adrian Prisacariu
First: 2025-09-24T23:00:22+00:00 · Latest: 2026-01-09T17:24:32+00:00
Comments: 3DV 2026. Code and Data Available at https://jingwu2121.github.io/reflect3r/
Abstract
Mirror reflections are common in everyday environments and can provide stereo information within a single capture, as the real and reflected virtual views are visible simultaneously. We exploit this property by treating the reflection as an auxiliary view and designing a transformation that constructs a physically valid virtual camera, allowing direct pixel-domain generation of the virtual view while adhering to the real-world imaging process. This enables a multi-view stereo setup from a single image, simplifying the imaging process, making it compatible with powerful feed-forward reconstruction models for generalizable and robust 3D reconstruction. To further exploit the geometric symmetry introduced by mirrors, we propose a symmetric-aware loss to refine pose estimation. Our framework also naturally extends to dynamic scenes, where each frame contains a mirror reflection, enabling efficient per-frame geometry recovery. For quantitative evaluation, we provide a fully customizable synthetic dataset of 16 Blender scenes, each with ground-truth point clouds and camera poses. Extensive experiments on real-world data and synthetic data are conducted to illustrate the effectiveness of our method.
中文标题/摘要
标题:Reflect3r:利用镜面反射的单视角三维立体重建
镜面反射在日常环境中很常见,并且可以在单次拍摄中提供立体信息,因为真实视图和反射视图可以同时可见。我们通过将反射视为辅助视图并设计一种构造物理上有效的虚拟相机的方法来利用这一特性,从而在保持真实世界成像过程的同时直接在像素域生成虚拟视图。这使得可以从单张图像构建多视角立体设置,简化成像过程,使其与强大的前馈重建模型兼容,从而实现通用且稳健的三维重建。为了进一步利用镜子引入的几何对称性,我们提出了一种对称感知损失来细化姿态估计。我们的框架还自然扩展到动态场景,在这些场景中,每一帧都包含镜面反射,从而实现高效的逐帧几何恢复。为了进行定量评估,我们提供了一个完全可定制的合成数据集,包含16个Blender场景,每个场景都有地面实况点云和相机姿态。在真实世界数据和合成数据上进行了广泛的实验,以说明我们方法的有效性。
Summary / 总结
The research aims to leverage mirror reflections for single-view 3D stereo reconstruction. The method involves treating the reflection as an auxiliary view and designing a transformation to construct a virtual camera, enabling the generation of a virtual view while adhering to the real-world imaging process. Key findings include the effectiveness of the symmetric-aware loss in refining pose estimation and the capability of the framework to handle dynamic scenes, demonstrating robust and generalizable 3D reconstruction. Experiments on both real-world and synthetic data confirm the method's effectiveness.
研究旨在利用镜面反射进行单视角3D立体重建。方法包括将反射视为辅助视图,并设计一种转换来构建虚拟相机,从而在遵循真实世界成像过程的同时生成虚拟视图。关键发现包括对称感知损失在姿态估计细化中的有效性,以及框架能够处理动态场景,展示出稳健且通用的3D重建能力。在真实世界和合成数据上的实验验证了该方法的有效性。
On the Robustness of Age for Learning-Based Wireless Scheduling in Unknown Environments
Authors: Juaren Steiger, Bin Li
First: 2026-01-09T17:15:17+00:00 · Latest: 2026-01-09T17:15:17+00:00
Comments: technical report of conference paper
Abstract
The constrained combinatorial multi-armed bandit model has been widely employed to solve problems in wireless networking and related areas, including the problem of wireless scheduling for throughput optimization under unknown channel conditions. Most work in this area uses an algorithm design strategy that combines a bandit learning algorithm with the virtual queue technique to track the throughput constraint violation. These algorithms seek to minimize the virtual queue length in their algorithm design. However, in networks where channel conditions change abruptly, the resulting constraints may become infeasible, leading to unbounded growth in virtual queue lengths. In this paper, we make the key observation that the dynamics of the head-of-line age, i.e. the age of the oldest packet in the virtual queue, make it more robust when used in algorithm design compared to the virtual queue length. We therefore design a learning-based scheduling policy that uses the head-of-line age in place of the virtual queue length. We show that our policy matches state-of-the-art performance under i.i.d. network conditions. Crucially, we also show that the system remains stable even under abrupt changes in channel conditions and can rapidly recover from periods of constraint infeasibility.
中文标题/摘要
标题:基于学习的无线调度在未知环境中的年龄鲁棒性研究
受限组合多臂老虎机模型已被广泛应用于无线网络及相关领域的诸多问题中,包括在未知信道条件下进行无线调度以优化吞吐量的问题。该领域大多数工作采用了一种结合老虎机学习算法和虚拟队列技术的算法设计策略,以追踪吞吐量约束的违反情况。这些算法在算法设计中寻求最小化虚拟队列长度。然而,在信道条件突然变化的网络中,这些约束可能变得不可行,导致虚拟队列长度无界增长。在本文中,我们提出了一个关键观察,即头包年龄,即虚拟队列中最老的包的年龄,在算法设计中比虚拟队列长度更具鲁棒性。因此,我们设计了一种基于学习的调度策略,该策略使用头包年龄代替虚拟队列长度。我们证明,在独立同分布的网络条件下,我们的策略可以达到最先进的性能。至关重要的是,我们还证明,在信道条件突然变化的情况下,系统仍然保持稳定,并且可以从约束不可行的时期迅速恢复。
Summary / 总结
This paper addresses the robustness of age-based scheduling in wireless networks under unknown channel conditions. It proposes a learning-based scheduling policy that uses the head-of-line age instead of the virtual queue length to manage throughput constraints. The policy is shown to perform comparably to state-of-the-art methods under stable network conditions and remains stable and recoverable during abrupt channel changes.
本文探讨了在未知信道条件下无线网络中基于年龄的调度的鲁棒性。它提出了一种基于学习的调度策略,使用头部年龄而非虚拟队列长度来跟踪吞吐量约束。该策略在稳定网络条件下表现出与最先进的方法相当的性能,并且在信道条件突然变化时保持稳定并能够快速从约束不可行的时期恢复。
Unsupervised Domain Adaptation for Binary Classification with an Unobservable Source Subpopulation
Authors: Chao Ying, Jun Jin, Haotian Zhang, Qinglong Tian, Yanyuan Ma, Yixuan Li, Jiwei Zhao
First: 2025-09-24T22:00:49+00:00 · Latest: 2026-01-09T17:01:50+00:00
Abstract
We study an unsupervised domain adaptation problem where the source domain consists of subpopulations defined by the binary label $Y$ and a binary background (or environment) $A$. We focus on a challenging setting in which one such subpopulation in the source domain is unobservable. Naively ignoring this unobserved group can result in biased estimates and degraded predictive performance. Despite this structured missingness, we show that the prediction in the target domain can still be recovered. Specifically, we rigorously derive both background-specific and overall prediction models for the target domain. For practical implementation, we propose the distribution matching method to estimate the subpopulation proportions. We provide theoretical guarantees for the asymptotic behavior of our estimator, and establish an upper bound on the prediction error. Experiments on both synthetic and real-world datasets show that our method outperforms the naive benchmark that does not account for this unobservable source subpopulation.
中文标题/摘要
标题:无监督领域适应在二元分类中的应用及不可观测源子群体
我们研究了一个无监督领域适应问题,其中源领域由由二元标签$Y$和二元背景(或环境)$A$定义的子群体组成。我们关注一个具有挑战性的场景,在该场景中,源领域中的一个子群体是不可观测的。简单忽略这个未观察到的群体会导致估计偏差和预测性能下降。尽管存在这种结构化的缺失性,我们证明了目标领域的预测仍然可以恢复。具体来说,我们严格推导了目标领域的背景特定和总体预测模型。为了实际实施,我们提出了分布匹配方法来估计子群体比例。我们为我们的估计器的渐近行为提供了理论保证,并建立了预测误差的上界。在合成数据集和真实世界数据集上的实验表明,我们的方法优于不考虑这个不可观测源子群体的朴素基准方法。
Summary / 总结
This study addresses an unsupervised domain adaptation problem where the source domain is divided into subpopulations based on binary label $Y$ and background $A$, with one subpopulation being unobservable. The research aims to recover prediction in the target domain despite this structured missingness. The authors derive background-specific and overall prediction models for the target domain and propose a distribution matching method to estimate subpopulation proportions. Theoretical guarantees and an upper bound on prediction error are provided. Experiments on synthetic and real-world datasets demonstrate that the proposed method outperforms the naive approach that ignores the unobservable subpopulation.
研究探讨了一个源域根据二元标签$Y$和背景$A$划分为子群体的无监督领域适应问题,其中一个子群体是不可见的。研究旨在尽管存在缺失数据,仍能恢复目标域的预测。作者推导了背景特定和总体预测模型,并提出了一种分布匹配方法来估计子群体比例。提供了理论保证和预测误差的上界。在合成和真实世界数据集上的实验表明,所提出的方法优于忽略不可见子群体的朴素方法。
Higher-Order Domain Generalization in Magnetic Resonance-Based Assessment of Alzheimer's Disease
Authors: Zobia Batool, Diala Lteif, Vijaya B. Kolachalama, Huseyin Ozkan, Erchan Aptoula
First: 2026-01-04T11:25:36+00:00 · Latest: 2026-01-09T17:01:49+00:00
Abstract
Despite progress in deep learning for Alzheimer's disease (AD) diagnostics, models trained on structural magnetic resonance imaging (sMRI) often do not perform well when applied to new cohorts due to domain shifts from varying scanners, protocols and patient demographics. AD, the primary driver of dementia, manifests through progressive cognitive and neuroanatomical changes like atrophy and ventricular expansion, making robust, generalizable classification essential for real-world use. While convolutional neural networks and transformers have advanced feature extraction via attention and fusion techniques, single-domain generalization (SDG) remains underexplored yet critical, given the fragmented nature of AD datasets. To bridge this gap, we introduce Extended MixStyle (EM), a framework for blending higher-order feature moments (skewness and kurtosis) to mimic diverse distributional variations. Trained on sMRI data from the National Alzheimer's Coordinating Center (NACC; n=4,647) to differentiate persons with normal cognition (NC) from those with mild cognitive impairment (MCI) or AD and tested on three unseen cohorts (total n=3,126), EM yields enhanced cross-domain performance, improving macro-F1 on average by 2.4 percentage points over state-of-the-art SDG benchmarks, underscoring its promise for invariant, reliable AD detection in heterogeneous real-world settings. The source code will be made available upon acceptance at https://github.com/zobia111/Extended-Mixstyle.
中文标题/摘要
标题:基于磁共振成像的阿尔茨海默病评估中的高阶领域泛化
尽管在阿尔茨海默病(AD)诊断中取得了深度学习的进步,但由于不同扫描器、协议和患者人口统计学导致的领域偏移,结构磁共振成像(sMRI)训练的模型在应用于新队列时往往表现不佳。AD 是痴呆的主要驱动因素,通过进行认知和神经解剖学的渐进性变化(如萎缩和室管膜扩张)表现出来,因此稳健且可泛化的分类对于实际应用至关重要。虽然卷积神经网络和变压器通过注意力和融合技术提高了特征提取能力,但单一领域泛化(SDG)仍处于未被充分探索的状态,但鉴于AD数据集的碎片化性质,它仍然是至关重要的。为了弥合这一差距,我们引入了扩展MixStyle(EM),这是一种融合高阶特征矩(偏度和峰度)以模拟多样分布变化的框架。EM 在国家阿尔茨海默病协调中心(NACC;n=4,647)的sMRI数据上训练,用于区分正常认知(NC)与轻度认知障碍(MCI)或AD的人群,并在三个未见过的队列上进行测试(总计n=3,126),EM 在跨领域性能上表现出增强,平均提高宏观F1分数2.4个百分点,超过了最先进的单一领域泛化基准,突显了其在异质现实世界环境中的不变且可靠的AD检测的潜力。源代码将在接受后在https://github.com/zobia111/Extended-Mixstyle上提供。
Summary / 总结
The research aims to improve the robustness and generalizability of deep learning models for Alzheimer's disease (AD) diagnosis using structural magnetic resonance imaging (sMRI). To address domain shifts, the study introduces Extended MixStyle (EM), which blends higher-order feature moments to simulate diverse distributional variations. EM outperforms state-of-the-art single-domain generalization benchmarks, achieving an average improvement of 2.4 percentage points in macro-F1 score across three unseen cohorts, demonstrating its potential for invariant AD detection in heterogeneous settings. The source code is available upon acceptance.
研究旨在解决使用结构磁共振成像(sMRI)数据进行阿尔茨海默病(AD)诊断时出现的领域偏移问题。研究引入了Extended MixStyle(EM)框架,通过融合高阶特征矩来提升跨域性能。EM在三个未见过的队列中测试时,平均将宏F1分数提高了2.4个百分点,超过了现有的单域泛化基准,展示了其在异质环境中进行AD检测的潜力。
Context-Aware Decoding for Faithful Vision-Language Generation
Authors: Mehrdad Fazli, Bowen Wei, Ziwei Zhu
First: 2026-01-09T16:50:57+00:00 · Latest: 2026-01-09T16:50:57+00:00
Abstract
Hallucinations, generating responses inconsistent with the visual input, remain a critical limitation of large vision-language models (LVLMs), especially in open-ended tasks such as image captioning and visual reasoning. In this work, we probe the layer-wise generation dynamics that drive hallucinations and propose a training-free mitigation strategy. Employing the Logit Lens, we examine how LVLMs construct next-token distributions across decoder layers, uncovering a pronounced commitment-depth gap: truthful tokens accumulate probability mass on their final candidates earlier than hallucinatory ones. Drawing on this discovery, we introduce Context Embedding Injection (CEI), a lightweight method that harnesses the hidden state of the last input token-the context embedding-as a grounding signal to maintain visual fidelity throughout decoding and curb hallucinations. Evaluated on the CHAIR, AMBER, and MMHal-Bench benchmarks (with a maximum token length of 512), CEI outperforms state-of-the-art baselines across three LVLMs, with its dynamic variant yielding the lowest overall hallucination rates. By integrating novel mechanistic insights with a scalable intervention, this work advances the mitigation of hallucinations in LVLMs.
中文标题/摘要
标题:基于上下文的解码以实现忠实的跨模态语言生成
幻觉,即生成与视觉输入不一致的响应,仍然是大型跨模态语言模型(LVLM)的关键限制,尤其是在开放任务如图像字幕和视觉推理中。在本文中,我们探究了导致幻觉的逐层生成动态,并提出了一种无需训练的缓解策略。利用Logit Lens,我们检查了LVLMs在解码器各层如何构建下一词分布,发现了一种显著的可信度深度差距:真实词更早地将概率质量集中在最终候选词上,而幻觉词则不然。基于这一发现,我们引入了上下文嵌入注入(CEI)方法,这是一种轻量级方法,利用最后一个输入词的隐藏状态——上下文嵌入——作为接地信号,以在整个解码过程中保持视觉保真度并抑制幻觉。在CHAIR、AMBER和MMHal-Bench基准测试(最大词长512)上评估,CEI在三种LVLM中均优于最先进的基线,其动态变体的幻觉率最低。通过将新颖的机制见解与可扩展的干预措施相结合,本文推进了LVLM中幻觉的缓解。
Summary / 总结
This work addresses the issue of hallucinations in large vision-language models (LVLMs) by analyzing the layer-wise generation dynamics and proposing a training-free mitigation strategy called Context Embedding Injection (CEI). CEI uses the hidden state of the last input token as a grounding signal to maintain visual fidelity during decoding. The method outperforms state-of-the-art baselines on the CHAIR, AMBER, and MMHal-Bench benchmarks, with the dynamic variant achieving the lowest hallucination rates across three LVLMs.
该研究通过分析层间生成动态并提出Context Embedding Injection (CEI) 方法来解决大型视觉语言模型(LVLMs)中的幻觉问题。CEI 使用最后一个输入令牌的隐藏状态作为接地信号,以在解码过程中保持视觉保真度。该方法在CHAIR、AMBER 和 MMHal-Bench 基准测试中优于最先进的基线,动态变体在三个 LVLMs 上实现了最低的幻觉率。
Low-Latency Event-Based Velocimetry for Quadrotor Control in a Narrow Pipe
Authors: Leonard Bauersfeld, Davide Scaramuzza
First: 2025-07-21T09:53:42+00:00 · Latest: 2026-01-09T16:49:03+00:00
Comments: 19 pages
Abstract
Autonomous quadrotor flight in confined spaces such as pipes and tunnels presents significant challenges due to unsteady, self-induced aerodynamic disturbances. Very recent advances have enabled flight in such conditions, but they either rely on constant motion through the pipe to mitigate airflow recirculation effects or suffer from limited stability during hovering. In this work, we present the first closed-loop control system for quadrotors for hovering in narrow pipes that leverages real-time flow field measurements. We develop a low-latency, event-based smoke velocimetry method that estimates local airflow at high temporal resolution. This flow information is used by a disturbance estimator based on a recurrent convolutional neural network, which infers force and torque disturbances in real time. The estimated disturbances are integrated into a learning-based controller trained via reinforcement learning. The flow-feedback control proves particularly effective during lateral translation maneuvers in the pipe cross-section. There, the real-time disturbance information enables the controller to effectively counteract transient aerodynamic effects, thereby preventing collisions with the pipe wall. To the best of our knowledge, this work represents the first demonstration of an aerial robot with closed-loop control informed by real-time flow field measurements. This opens new directions for research on flight in aerodynamically complex environments. In addition, our work also sheds light on the characteristic flow structures that emerge during flight in narrow, circular pipes, providing new insights at the intersection of robotics and fluid dynamics.
中文标题/摘要
标题:窄管中四旋翼飞行的低延迟事件驱动流速测量 velocimetry 控制
在管道和隧道等受限空间内自主四旋翼飞行面临着显著挑战,由于不稳定的自诱导气动干扰。最近的进展使得在这些条件下飞行成为可能,但要么依赖于管道内的持续运动以减轻气流再循环效应,要么在悬停时稳定性有限。在本研究中,我们提出了第一个利用实时流场测量的四旋翼在狭窄管道中悬停的闭环控制系统。我们开发了一种低延迟、事件驱动的烟雾流速测量方法,以高时间分辨率估计局部气流。这些流场信息被用于基于循环卷积神经网络的干扰估计器,实时推断力和力矩干扰。估计的干扰被集成到通过强化学习训练的基于学习的控制器中。流反馈控制在管道横截面内的横向平移机动中特别有效。在那里,实时干扰信息使控制器能够有效抵消瞬态气动效应,从而防止与管道壁发生碰撞。据我们所知,这项工作代表了第一个利用实时流场测量进行闭环控制的空中机器人演示。这为研究在气动复杂环境中飞行开辟了新的研究方向。此外,我们的工作还揭示了在狭窄圆形管道中飞行时出现的特征流场结构,为机器人学和流体力学的交叉领域提供了新的见解。
Summary / 总结
This research addresses the challenge of autonomous quadrotor flight in narrow pipes, where unsteady aerodynamic disturbances pose significant difficulties. The authors developed a low-latency event-based smoke velocimetry method to measure local airflow in real time. This data is used by a disturbance estimator based on a recurrent convolutional neural network to infer force and torque disturbances, which are then integrated into a reinforcement learning-based controller. The system effectively prevents collisions with the pipe wall during lateral translation maneuvers by counteracting transient aerodynamic effects in real time. This work is the first to demonstrate closed-loop control using real-time flow field measurements for aerial robots in such environments, opening new avenues for research in complex aerodynamic conditions.
本文通过开发低延迟的事件驱动烟流测速方法来测量局部气流,解决自主四旋翼在狭窄管道中飞行的挑战。气流信息被用于干扰估计器,并集成到基于强化学习的控制器中。该控制系统在横向机动时有效抵消了瞬态气动效应,防止与管道壁发生碰撞。这是首次使用实时流场测量信息进行闭环控制的四旋翼飞行演示,为机器人学和流体力学交叉领域的研究开辟了新方向。
Adapting Vision Transformers to Ultra-High Resolution Semantic Segmentation with Relay Tokens
Authors: Yohann Perron, Vladyslav Sydorov, Christophe Pottier, Loic Landrieu
First: 2026-01-09T16:41:08+00:00 · Latest: 2026-01-09T16:41:08+00:00
Comments: 13 pages +3 pages of suppmat
Abstract
Current approaches for segmenting ultra high resolution images either slide a window, thereby discarding global context, or downsample and lose fine detail. We propose a simple yet effective method that brings explicit multi scale reasoning to vision transformers, simultaneously preserving local details and global awareness. Concretely, we process each image in parallel at a local scale (high resolution, small crops) and a global scale (low resolution, large crops), and aggregate and propagate features between the two branches with a small set of learnable relay tokens. The design plugs directly into standard transformer backbones (eg ViT and Swin) and adds fewer than 2 % parameters. Extensive experiments on three ultra high resolution segmentation benchmarks, Archaeoscape, URUR, and Gleason, and on the conventional Cityscapes dataset show consistent gains, with up to 15 % relative mIoU improvement. Code and pretrained models are available at https://archaeoscape.ai/work/relay-tokens/ .
Summary / 总结
The paper addresses the challenge of ultra-high resolution semantic segmentation by proposing a method that combines local and global feature processing using relay tokens. This approach enhances both local detail and global context without significantly increasing the model size. Experiments on various benchmarks show a significant improvement in mean Intersection over Union (mIoU) by up to 15%. The method is designed to be easily integrated into existing transformer architectures like ViT and Swin, adding less than 2% parameters.
论文提出了一种结合局部和全局推理的方法,使用relay tokens处理超高清图像的语义分割问题。该方法在高分辨率和低分辨率下并行处理图像,并在两个尺度之间聚合特征,从而同时增强局部细节和全局上下文。实验结果显示,在多个基准数据集上取得了显著提升,相对mIoU提高了15%。
From Laboratory to Real-World Applications: Benchmarking Agentic Code Reasoning at the Repository Level
Authors: Jia Li, Yuxin Su, Michael R. Lyu
First: 2026-01-07T09:22:28+00:00 · Latest: 2026-01-09T16:30:25+00:00
Abstract
As large language models (LLMs) evolve into autonomous agents, evaluating repository-level reasoning, the ability to maintain logical consistency across massive, real-world, interdependent file systems, has become critical. Current benchmarks typically fluctuate between isolated code snippets and black-box evaluations. We present RepoReason, a white-box diagnostic benchmark centered on abductive assertion verification. To eliminate memorization while preserving authentic logical depth, we implement an execution-driven mutation framework that utilizes the environment as a semantic oracle to regenerate ground-truth states. Furthermore, we establish a fine-grained diagnostic system using dynamic program slicing, quantifying reasoning via three orthogonal metrics: $ESV$ (reading load), $MCL$ (simulation depth), and $DFI$ (integration width). Comprehensive evaluations of frontier models (e.g., Claude-4.5-Sonnet, DeepSeek-v3.1-Terminus) reveal a prevalent aggregation deficit, where integration width serves as the primary cognitive bottleneck. Our findings provide granular white-box insights for optimizing the next generation of agentic software engineering.
中文标题/摘要
标题:从实验室到实际应用:在仓库级别基准测试代理代码推理
随着大型语言模型(LLMs)演变为自主代理,评估仓库级别的推理能力,即在庞大、实际且相互依赖的文件系统中保持逻辑一致性的能力,变得至关重要。当前的基准测试通常在孤立的代码片段之间波动或进行黑盒评估。我们提出了RepoReason,这是一种以演绎断言验证为中心的白盒诊断基准测试。为了消除记忆现象同时保持真实的逻辑深度,我们实现了一个基于执行的变异框架,利用环境作为语义预言机来再生真实状态。此外,我们使用动态程序切片建立了一个精细的诊断系统,通过三个正交度量:$ESV$(阅读负载)、$MCL$(模拟深度)和$DFI$(集成宽度)来量化推理。对前沿模型(例如Claude-4.5-Sonnet、DeepSeek-v3.1-Terminus)的全面评估揭示了普遍存在的聚合缺陷,其中集成宽度是主要的认知瓶颈。我们的研究结果为优化下一代代理软件工程提供了详细的白盒洞察。
Summary / 总结
The research aims to evaluate the ability of large language models to maintain logical consistency across large, real-world code repositories. RepoReason, a white-box benchmark, focuses on abductive assertion verification and uses an execution-driven mutation framework to assess models. Key findings show that current models struggle with integration width, indicating a primary cognitive bottleneck in reasoning across the entire codebase.
研究旨在评估大型语言模型在大规模真实代码仓库中保持逻辑一致性的能力。RepoReason 是一个白盒基准,专注于演绎假设验证,并使用执行驱动的变异框架来评估模型。主要发现表明,当前模型在处理集成宽度方面存在困难,这表明在整体代码库中推理的主要认知瓶颈。
Distilling Lightweight Domain Experts from Large ML Models by Identifying Relevant Subspaces
Authors: Pattarawat Chormai, Ali Hashemi, Klaus-Robert Müller, Grégoire Montavon
First: 2026-01-09T16:28:55+00:00 · Latest: 2026-01-09T16:28:55+00:00
Comments: 20 pages + supplement
Abstract
Knowledge distillation involves transferring the predictive capabilities of large, high-performing AI models (teachers) to smaller models (students) that can operate in environments with limited computing power. In this paper, we address the scenario in which only a few classes and their associated intermediate concepts are relevant to distill. This scenario is common in practice, yet few existing distillation methods explicitly focus on the relevant subtask. To address this gap, we introduce 'SubDistill', a new distillation algorithm with improved numerical properties that only distills the relevant components of the teacher model at each layer. Experiments on CIFAR-100 and ImageNet with Convolutional and Transformer models demonstrate that SubDistill outperforms existing layer-wise distillation techniques on a representative set of subtasks. Our benchmark evaluations are complemented by Explainable AI analyses showing that our distilled student models more closely match the decision structure of the original teacher model.
中文标题/摘要
标题:从大型机器学习模型中提炼轻量级领域专家通过识别相关子空间
知识蒸馏涉及将大型高性能人工智能模型(教师)的预测能力转移到可以在计算能力有限的环境中运行的小型模型(学生)中。在本文中,我们解决了一个只有少数几类及其相关中间概念才需要蒸馏的场景。这种场景在实践中很常见,但现有的大多数蒸馏方法并未明确关注相关子任务。为了解决这一差距,我们引入了“SubDistill”,一种具有改进数值特性的新蒸馏算法,该算法在每一层仅蒸馏教师模型的相关组件。在CIFAR-100和ImageNet上的卷积和Transformer模型实验表明,SubDistill在代表性子任务上优于现有的逐层蒸馏技术。我们的基准评估得到了可解释人工智能分析的支持,表明我们的蒸馏学生模型更接近原始教师模型的决策结构。
Summary / 总结
The research aims to improve knowledge distillation for scenarios where only a few classes and their associated concepts are relevant. The method, SubDistill, selectively distills relevant components of the teacher model at each layer, leading to better performance on specific subtasks compared to existing layer-wise distillation techniques. Experiments on CIFAR-100 and ImageNet with Convolutional and Transformer models show that SubDistill outperforms other methods and provides more aligned decision structures with the original teacher model.
研究旨在改进仅在少数类别及其相关概念上有效的知识蒸馏。作者提出了SubDistill,一种新的蒸馏算法,仅蒸馏教师模型的相关组件。实验表明,SubDistill在CIFAR-100和ImageNet上的特定子任务上优于现有层间蒸馏方法,并且可解释AI分析显示,蒸馏后的学生模型更接近原始教师模型的决策结构。
Multi-task Modeling for Engineering Applications with Sparse Data
Authors: Yigitcan Comlek, R. Murali Krishnan, Sandipp Krishnan Ravi, Amin Moghaddas, Rafael Giorjao, Michael Eff, Anirban Samaddar, Nesar S. Ramachandra, Sandeep Madireddy, Liping Wang
First: 2026-01-09T16:28:19+00:00 · Latest: 2026-01-09T16:28:19+00:00
Comments: 15 pages, 5 figures, 6 tables
Abstract
Modern engineering and scientific workflows often require simultaneous predictions across related tasks and fidelity levels, where high-fidelity data is scarce and expensive, while low-fidelity data is more abundant. This paper introduces an Multi-Task Gaussian Processes (MTGP) framework tailored for engineering systems characterized by multi-source, multi-fidelity data, addressing challenges of data sparsity and varying task correlations. The proposed framework leverages inter-task relationships across outputs and fidelity levels to improve predictive performance and reduce computational costs. The framework is validated across three representative scenarios: Forrester function benchmark, 3D ellipsoidal void modeling, and friction-stir welding. By quantifying and leveraging inter-task relationships, the proposed MTGP framework offers a robust and scalable solution for predictive modeling in domains with significant computational and experimental costs, supporting informed decision-making and efficient resource utilization.
中文标题/摘要
标题:工程应用中稀疏数据的多任务建模
现代工程和科学工作流通常需要同时对相关任务和保真度级别进行预测,其中高保真数据稀缺且昂贵,而低保真数据则更为丰富。本文介绍了一种针对由多源、多保真度数据特征的工程系统进行建模的多任务高斯过程(MTGP)框架,以应对数据稀疏性和任务相关性变化的挑战。该框架利用输出和保真度级别之间的任务间关系来提高预测性能并降低计算成本。该框架在Forrester函数基准、3D椭球空洞建模和搅拌摩擦焊接三个代表性场景中进行了验证。通过量化和利用任务间关系,所提出的MTGP框架为具有显著计算和实验成本的领域提供了稳健且可扩展的预测建模解决方案,支持基于信息的决策和高效资源利用。
Summary / 总结
The paper introduces a Multi-Task Gaussian Processes (MTGP) framework to address the challenges of data sparsity and varying task correlations in engineering systems with multi-source, multi-fidelity data. The method leverages inter-task relationships to improve predictive performance and reduce computational costs. The framework is validated through three scenarios: Forrester function benchmark, 3D ellipsoidal void modeling, and friction-stir welding, demonstrating its robustness and scalability in domains with high computational and experimental costs.
论文提出了一种多任务高斯过程(MTGP)框架,以应对工程系统中多源、多保真度数据的数据稀疏性和任务相关性变化的挑战。该方法利用任务间的关系来提高预测性能并减少计算成本。框架通过Forrester函数基准、3D椭球空洞建模和摩擦搅拌焊接三个场景进行验证,展示了其在高计算和实验成本领域中的稳健性和可扩展性。
SCOPE: Sequential Causal Optimization of Process Interventions
Authors: Jakob De Moor, Hans Weytjens, Johannes De Smedt, Jochen De Weerdt
First: 2025-12-19T14:33:02+00:00 · Latest: 2026-01-09T16:21:13+00:00
Abstract
Prescriptive Process Monitoring (PresPM) recommends interventions during business processes to optimize key performance indicators (KPIs). In realistic settings, interventions are rarely isolated: organizations need to align sequences of interventions to jointly steer the outcome of a case. Existing PresPM approaches fall short in this respect. Many focus on a single intervention decision, while others treat multiple interventions independently, ignoring how they interact over time. Methods that do address these dependencies depend either on simulation or data augmentation to approximate the process to train a Reinforcement Learning (RL) agent, which can create a reality gap and introduce bias. We introduce SCOPE, a PresPM approach that learns aligned sequential intervention recommendations. SCOPE employs backward induction to estimate the effect of each candidate intervention action, propagating its impact from the final decision point back to the first. By leveraging causal learners, our method can utilize observational data directly, unlike methods that require constructing process approximations for reinforcement learning. Experiments on both an existing synthetic dataset and a new semi-synthetic dataset show that SCOPE consistently outperforms state-of-the-art PresPM techniques in optimizing the KPI. The novel semi-synthetic setup, based on a real-life event log, is provided as a reusable benchmark for future work on sequential PresPM.
中文标题/摘要
标题:SCOPE:过程干预的顺序因果优化
预测性过程监控(PresPM)在业务过程中推荐干预措施以优化关键绩效指标(KPI)。在现实环境中,干预措施通常不是孤立的:组织需要协调一系列干预措施以共同引导案例的结果。现有的PresPM方法在这方面做得不够好。许多方法仅关注单一的干预决策,而其他方法则独立处理多个干预措施,忽略了它们随时间相互作用的方式。能够解决这些依赖关系的方法要么依赖于模拟,要么通过数据增强来近似过程以训练强化学习(RL)代理,这可能会导致现实差距并引入偏差。我们提出了SCOPE,这是一种能够学习协调的顺序干预建议的PresPM方法。SCOPE 使用逆向归纳来估计每个候选干预行动的影响,从最终决策点反向传播其影响到第一个决策点。通过利用因果学习器,我们的方法可以直接利用观察数据,而不需要像需要构建过程近似以进行强化学习的方法那样。在现有合成数据集和新合成数据集上的实验表明,SCOPE 在优化KPI方面始终优于最先进的PresPM技术。基于真实事件日志的新合成设置提供了一个可重复使用的基准,供未来顺序PresPM研究使用。
Summary / 总结
SCOPE is a Prescriptive Process Monitoring approach that addresses the limitations of existing methods by learning aligned sequential intervention recommendations. It uses backward induction and causal learners to estimate the impact of each intervention action, propagating the effects from the final decision point back to the first. Experiments show that SCOPE outperforms state-of-the-art techniques in optimizing key performance indicators across both synthetic and semi-synthetic datasets.
SCOPE 是一种在业务流程中提供联贯序列干预建议的 Prescriptive Process Monitoring 方法。它使用逆向归纳和因果学习来估计每个候选干预的影响,并将其从最终决策点反向传播回第一个决策点。实验表明,SCOPE 在优化关键绩效指标方面优于现有技术。该方法不依赖于模拟或数据增强,从而避免了现实差距和偏差。提供了一个基于真实事件日志的新颖半合成数据集作为未来研究的基准。
TowerMind: A Tower Defence Game Learning Environment and Benchmark for LLM as Agents
Authors: Dawei Wang, Chengming Zhou, Di Zhao, Xinyuan Liu, Marci Chi Ma, Gary Ushaw, Richard Davison
Venue: AAAI 2026 Oral
First: 2026-01-09T16:18:08+00:00 · Latest: 2026-01-09T16:18:08+00:00
Comments: AAAI 2026 Oral
Abstract
Recent breakthroughs in Large Language Models (LLMs) have positioned them as a promising paradigm for agents, with long-term planning and decision-making emerging as core general-purpose capabilities for adapting to diverse scenarios and tasks. Real-time strategy (RTS) games serve as an ideal testbed for evaluating these two capabilities, as their inherent gameplay requires both macro-level strategic planning and micro-level tactical adaptation and action execution. Existing RTS game-based environments either suffer from relatively high computational demands or lack support for textual observations, which has constrained the use of RTS games for LLM evaluation. Motivated by this, we present TowerMind, a novel environment grounded in the tower defense (TD) subgenre of RTS games. TowerMind preserves the key evaluation strengths of RTS games for assessing LLMs, while featuring low computational demands and a multimodal observation space, including pixel-based, textual, and structured game-state representations. In addition, TowerMind supports the evaluation of model hallucination and provides a high degree of customizability. We design five benchmark levels to evaluate several widely used LLMs under different multimodal input settings. The results reveal a clear performance gap between LLMs and human experts across both capability and hallucination dimensions. The experiments further highlight key limitations in LLM behavior, such as inadequate planning validation, a lack of multifinality in decision-making, and inefficient action use. We also evaluate two classic reinforcement learning algorithms: Ape-X DQN and PPO. By offering a lightweight and multimodal design, TowerMind complements the existing RTS game-based environment landscape and introduces a new benchmark for the AI agent field. The source code is publicly available on GitHub(https://github.com/tb6147877/TowerMind).
中文标题/摘要
标题:TowerMind:一种基于塔防游戏的LLM代理学习环境和基准
大型语言模型(LLMs)的最新突破使其成为代理的有前途的范式,长期规划和决策能力成为适应各种场景和任务的核心通用能力。即时战略(RTS)游戏为评估这两种能力提供了理想的测试平台,因为其固有的游戏玩法需要宏观层面的战略规划和微观层面的战术适应和行动执行。现有的基于RTS游戏的环境要么计算需求相对较高,要么缺乏对文本观察的支持,这限制了RTS游戏在LLM评估中的应用。受此启发,我们提出了TowerMind,一种基于即时战略游戏塔防(TD)子类的新颖环境。TowerMind保留了RTS游戏评估LLMs的关键优势,同时具有较低的计算需求和多模态观察空间,包括基于像素、文本和结构化游戏状态的表示。此外,TowerMind支持模型幻觉的评估,并提供了高度的可定制性。我们设计了五个基准级别,在不同的多模态输入设置下评估了多种广泛使用的LLMs。结果表明,LLMs在能力和幻觉维度上与人类专家之间存在明显的性能差距。实验进一步突显了LLMs行为的关键局限性,如规划验证不足、决策缺乏多目标性以及行动使用效率低下。我们还评估了两个经典的强化学习算法:Ape-X DQN和PPO。通过提供轻量级和多模态设计,TowerMind补充了现有的基于RTS游戏的环境景观,并为AI代理领域引入了一个新的基准。源代码已公开发布在GitHub(https://github.com/tb6147877/TowerMind)。
Summary / 总结
TowerMind is a new learning environment and benchmark for evaluating Large Language Models (LLMs) as agents in tower defense games, which are a subgenre of Real-Time Strategy (RTS) games. It addresses the limitations of existing RTS game environments by offering low computational demands and a multimodal observation space. The benchmark includes five levels to evaluate various LLMs under different input settings, revealing a significant performance gap between LLMs and human experts. The experiments highlight key limitations in LLM behavior, such as inadequate planning validation and inefficient action use. TowerMind also supports the evaluation of model hallucination and provides high customizability. It complements existing RTS game-based environments and introduces a new benchmark for AI agents. The source code is available on GitHub.
TowerMind 是一个新型的塔防游戏环境,旨在评估大型语言模型(LLMs)作为代理的能力。它结合了即时战略游戏中所需的策略规划和战术适应性,同时具有低计算需求和多模态观察。研究对几种LLM和两种强化学习算法进行了基准测试,结果显示LLMs在能力和幻觉方面与人类专家之间存在显著差距。LLM行为的关键局限包括规划验证不足、决策缺乏多目标性以及行动使用效率低下。
RobustFormer: Noise-Robust Pre-training for images and videos
Authors: Ashish Bastola, Nishant Luitel, Hao Wang, Danda Pani Paudel, Roshani Poudel, Abolfazl Razi
First: 2024-11-20T05:10:48+00:00 · Latest: 2026-01-09T16:12:54+00:00
Comments: 13 pages
Abstract
While deep learning-based models like transformers, have revolutionized time-series and vision tasks, they remain highly susceptible to noise and often overfit on noisy patterns rather than robust features. This issue is exacerbated in vision transformers, which rely on pixel-level details that can easily be corrupt. To address this, we leverage the discrete wavelet transform (DWT) for its ability to decompose into multi-resolution layers, isolating noise primarily in the high frequency domain while preserving essential low-frequency information for resilient feature learning. Conventional DWT-based methods, however, struggle with computational inefficiencies due to the requirement for a subsequent inverse discrete wavelet transform (IDWT) step. In this work, we introduce RobustFormer, a novel framework that enables noise-robust masked autoencoder (MAE) pre-training for both images and videos by using DWT for efficient downsampling, eliminating the need for expensive IDWT reconstruction and simplifying the attention mechanism to focus on noise-resilient multi-scale representations. To our knowledge, RobustFormer is the first DWT-based method fully compatible with video inputs and MAE-style pre-training. Extensive experiments on noisy image and video datasets demonstrate that our approach achieves up to 8% increase in Top-1 classification accuracy under severe noise conditions in Imagenet-C and up to 2.7% in Imagenet-P standard benchmarks compared to the baseline and up to 13% higher Top-1 accuracy on UCF-101 under severe custom noise perturbations while maintaining similar accuracy scores for clean datasets. We also observe the reduction of computation complexity by up to 4.4% through IDWT removal compared to VideoMAE baseline without any performance drop.
中文标题/摘要
标题:RobustFormer:图像和视频的噪声鲁棒预训练
虽然基于深度学习的模型如变压器已经革新了时间序列和视觉任务,但它们仍然高度易受噪声影响,并且往往在噪声模式上过拟合而不是学习鲁棒特征。这一问题在依赖像素级细节的视觉变压器中尤为严重,这些细节很容易被破坏。为了解决这一问题,我们利用离散小波变换(DWT)的能力,它可以分解为多分辨率层,主要在高频域隔离噪声,同时保留必要的低频信息以实现鲁棒特征学习。然而,传统的DWT方法由于需要后续的逆离散小波变换(IDWT)步骤而面临计算效率低的问题。在本工作中,我们引入了RobustFormer,这是一种新颖的框架,通过使用DWT进行高效下采样,可以消除昂贵的IDWT重建步骤,并简化注意力机制以专注于噪声鲁棒的多尺度表示,从而实现图像和视频的噪声鲁棒掩码自编码器(MAE)预训练。据我们所知,RobustFormer是第一个完全兼容视频输入和MAE风格预训练的DWT方法。在有噪声的图像和视频数据集上的广泛实验表明,在严重噪声条件下,我们的方法在ImageNet-C和ImageNet-P标准基准上的Top-1分类准确率分别提高了8%和2.7%,在UCF-101下严重自定义噪声扰动中Top-1准确率提高了13%,同时在干净数据集上保持相似的准确率分数。我们还观察到与VideoMAE基线相比,通过去除IDWT减少了高达4.4%的计算复杂度,而没有任何性能下降。
Summary / 总结
RobustFormer addresses the issue of noise susceptibility in deep learning models by leveraging the discrete wavelet transform (DWT) for efficient downsampling and noise isolation. This method eliminates the need for an inverse DWT step, simplifying the attention mechanism and focusing on noise-resilient multi-scale representations. Experiments show that RobustFormer improves Top-1 classification accuracy by up to 8% in noisy image datasets and 2.7% in video datasets compared to baseline models, while also reducing computational complexity by up to 4.4%.
RobustFormer通过利用离散小波变换(DWT)进行高效的降采样和噪声隔离,解决了深度学习模型在视觉变压器中的噪声敏感性问题。这种方法消除了逆DWT步骤的需要,简化了注意力机制,并专注于噪声鲁棒的多尺度表示。实验表明,RobustFormer在噪声严重的ImageNet-C和ImageNet-P基准测试中分别提高了Top-1分类准确率8%和2.7%,在UCF-101下严重自定义噪声条件下实现了高达13%的更高准确率,同时在干净数据集上保持相似的准确率,计算复杂度降低了4.4%,且无性能下降。
Improving Matrix Exponential for Generative AI Flows: A Taylor-Based Approach Beyond Paterson--Stockmeyer
Authors: Jorge Sastre, Daniel Faronbi, José Miguel Alonso, Peter Traver, Javier Ibáñez, Nuria Lloret
First: 2025-12-23T21:25:40+00:00 · Latest: 2026-01-09T16:10:31+00:00
Comments: 42 pages, 35 figures
Abstract
The matrix exponential is a fundamental operator in scientific computing and system simulation, with applications ranging from control theory and quantum mechanics to modern generative machine learning. While Padé approximants combined with scaling and squaring have long served as the standard, recent Taylor-based methods, which utilize polynomial evaluation schemes that surpass the classical Paterson--Stockmeyer technique, offer superior accuracy and reduced computational complexity. This paper presents an optimized Taylor-based algorithm for the matrix exponential, specifically designed for the high-throughput requirements of generative AI flows. We provide a rigorous error analysis and develop a dynamic selection strategy for the Taylor order and scaling factor to minimize computational effort under a prescribed error tolerance. Extensive numerical experiments demonstrate that our approach provides significant acceleration and maintains high numerical stability compared to existing state-of-the-art implementations. These results establish the proposed method as a highly efficient tool for large-scale generative modeling.
中文标题/摘要
标题:基于泰勒级数的方法改进生成AI流中的矩阵指数:超越帕特森-斯托克梅yer的技术
矩阵指数是科学计算和系统仿真中的基本算子,应用范围从控制理论和量子力学到现代生成机器学习。虽然帕德逼近与缩放和平方相结合长期以来一直是标准方法,但最近基于泰勒的方法,利用超越经典帕特森-斯托克梅yer技术的多项式评估方案,提供了更高的准确性和更低的计算复杂度。本文提出了一种针对生成AI流高通量要求优化的基于泰勒级数的矩阵指数算法。我们提供了严格的误差分析,并开发了一种动态选择泰勒级数和缩放因子的策略,以在给定的误差容差下最小化计算努力。广泛的数值实验表明,与现有最先进的实现相比,我们的方法提供了显著的加速并保持了高数值稳定性。这些结果确立了所提出方法作为大规模生成建模高效工具的地位。
Summary / 总结
This paper aims to improve the matrix exponential for generative AI flows by proposing an optimized Taylor-based algorithm. The method uses polynomial evaluation schemes that outperform the classical Paterson--Stockmeyer technique, offering better accuracy and reduced computational complexity. The authors provide a rigorous error analysis and develop a dynamic selection strategy for the Taylor order and scaling factor to ensure numerical stability under a specified error tolerance. Extensive experiments show that the proposed approach significantly accelerates computations while maintaining high numerical stability compared to existing methods.
本文旨在提高生成AI流中矩阵指数计算的效率和准确性。提出了一种优化的泰勒级数方法,利用多项式评估方案改进了经典的帕特森-斯托克梅耶技术。该方法包含一种动态选择泰勒级数和缩放因子的策略,以确保数值稳定性和计算效率。实验结果表明,所提出的方法在保持高数值稳定性的同时,显著加速了计算过程,优于现有方法。
StackPlanner: A Centralized Hierarchical Multi-Agent System with Task-Experience Memory Management
Authors: Ruizhe Zhang, Xinke Jiang, Zhibang Yang, Zhixin Zhang, Jiaran Gao, Yuzhen Xiao, Hongbin Lai, Xu Chu, Junfeng Zhao, Yasha Wang
First: 2026-01-09T16:09:48+00:00 · Latest: 2026-01-09T16:09:48+00:00
Abstract
Multi-agent systems based on large language models, particularly centralized architectures, have recently shown strong potential for complex and knowledge-intensive tasks. However, central agents often suffer from unstable long-horizon collaboration due to the lack of memory management, leading to context bloat, error accumulation, and poor cross-task generalization. To address both task-level memory inefficiency and the inability to reuse coordination experience, we propose StackPlanner, a hierarchical multi-agent framework with explicit memory control. StackPlanner addresses these challenges by decoupling high-level coordination from subtask execution with active task-level memory control, and by learning to retrieve and exploit reusable coordination experience via structured experience memory and reinforcement learning. Experiments on multiple deep-search and agent system benchmarks demonstrate the effectiveness of our approach in enabling reliable long-horizon multi-agent collaboration.
中文标题/摘要
标题:StackPlanner:一种具有任务经验记忆管理的集中式分层多智能体系统
基于大型语言模型的多智能体系统,尤其是集中式架构,最近在复杂和知识密集型任务中显示出强大的潜力。然而,中央智能体由于缺乏记忆管理经常遭受长期合作不稳定的问题,导致上下文膨胀、错误累积和跨任务泛化能力差。为了解决任务层面的记忆低效和协调经验重用能力不足的问题,我们提出了一种具有显式记忆控制的分层多智能体框架——StackPlanner。StackPlanner通过将高层协调与子任务执行解耦,并通过结构化经验记忆和强化学习学习检索和利用可重用的协调经验来解决这些挑战。在多个深度搜索和智能体系统基准测试中的实验表明,我们的方法能够实现可靠的长期多智能体合作。
Summary / 总结
StackPlanner is a hierarchical multi-agent framework designed to improve the stability and efficiency of long-horizon collaboration in multi-agent systems. It introduces explicit memory control to manage task-level memory and learns to reuse coordination experience through structured experience memory and reinforcement learning. Experiments show that StackPlanner enhances the reliability and effectiveness of multi-agent collaboration in complex tasks compared to traditional centralized architectures.
研究旨在通过解决记忆管理问题来提高基于大型语言模型的多智能体系统的稳定性和效率。StackPlanner 是一个分层多智能体框架,通过显式记忆控制将高层次协调与子任务执行解耦,并利用结构化经验记忆和强化学习来重用协调经验。实验表明,StackPlanner 提高了多智能体的长时协作可靠性。
TDHook: A Lightweight Framework for Interpretability
Authors: Yoann Poupart
First: 2025-09-29T20:28:43+00:00 · Latest: 2026-01-09T16:00:00+00:00
Abstract
Interpretability of Deep Neural Networks (DNNs) is a growing field driven by the study of vision and language models. Yet, some use cases, like image captioning, or domains like Deep Reinforcement Learning (DRL), require complex modelling, with multiple inputs and outputs or use composable and separated networks. As a consequence, they rarely fit natively into the API of popular interpretability frameworks. We thus present TDHook, an open-source, lightweight, generic interpretability framework based on $\texttt{tensordict}$ and applicable to any $\texttt{torch}$ model. It focuses on handling complex composed models which can be trained for Computer Vision, Natural Language Processing, Reinforcement Learning or any other domain. This library features ready-to-use methods for attribution, probing and a flexible get-set API for interventions, and is aiming to bridge the gap between these method classes to make modern interpretability pipelines more accessible. TDHook is designed with minimal dependencies, requiring roughly half as much disk space as $\texttt{transformer_lens}$, and, in our controlled benchmark, achieves up to a $\times$2 speed-up over $\texttt{captum}$ when running integrated gradients for multi-target pipelines on both CPU and GPU. In addition, to value our work, we showcase concrete use cases of our library with composed interpretability pipelines in Computer Vision (CV) and Natural Language Processing (NLP), as well as with complex models in DRL.
中文标题/摘要
标题:TDHook:一种轻量级的可解释性框架
深度神经网络(DNNs)的可解释性是一个快速增长的领域,由视觉和语言模型的研究推动。然而,一些应用场景,如图像字幕,或领域如深度强化学习(DRL),需要复杂的建模,具有多个输入和输出或使用可组合和分离的网络。因此,它们通常不能原生地适应流行可解释性框架的API。因此,我们提出了TDHook,一个基于`tensordict`的开源、轻量级、通用的可解释性框架,适用于任何`torch`模型。它专注于处理复杂的组合模型,这些模型可以用于计算机视觉、自然语言处理、强化学习或任何其他领域。该库提供了现成的方法进行归因、探针,并具有灵活的获取-设置API进行干预,旨在弥合这些方法类之间的差距,使现代可解释性管道更加易于访问。TDHook设计时依赖最少,占用的磁盘空间约为`transformer_lens`的一半,在我们的受控基准测试中,当在CPU和GPU上运行集成梯度的多目标管道时,其速度比`captum`快两倍。此外,为了体现我们的工作价值,我们展示了我们的库在计算机视觉(CV)和自然语言处理(NLP)中的具体应用场景,以及在DRL中的复杂模型中的组合可解释性管道。
Summary / 总结
TDHook is a lightweight open-source interpretability framework designed for complex models in various domains such as computer vision, natural language processing, and reinforcement learning. It uses $\texttt{tensordict}$ and $\texttt{torch}$ models to handle composed models and offers methods for attribution, probing, and interventions. TDHook demonstrates up to a $\times$2 speed-up over $\texttt{captum}$ in integrated gradients for multi-target pipelines on both CPU and GPU. It also provides concrete use cases in CV, NLP, and DRL, bridging the gap between different interpretability methods to make modern pipelines more accessible.
TDHook 是一个轻量级的解释性框架,用于处理计算机视觉、自然语言处理和强化学习等不同领域的复杂模型。它基于 $\texttt{tensordict}$ 和 $\texttt{torch}$ 模型,并提供了归因、探针和干预的方法。TDHook 在多目标管道的集成梯度上展示了比 $\texttt{captum}$ 快 $\times$2 的速度提升,并且依赖性较少,占用的磁盘空间比 $\texttt{transformer_lens}$ 少。该框架通过计算机视觉、自然语言处理和强化学习的具体用例得到了验证,旨在弥合不同解释性方法之间的差距,使现代解释性管道更加易于使用。
A Novel Patch-Based TDA Approach for Computed Tomography
Authors: Dashti A. Ali, Aras T. Asaad, Jacob J. Peoples, Mohammad Hamghalam, Alex Robins, Mane Piliposyan, Richard K. G. Do, Natalie Gangai, Yun S. Chun, Ahmad Bashir Barekzai, Jayasree Chakraborty, Hala Khasawneh, Camila Vilela, Natally Horvat, João Miranda, Alice C. Wei, Amber L. Simpson
First: 2025-12-13T00:51:03+00:00 · Latest: 2026-01-09T15:47:58+00:00
Abstract
The development of machine learning (ML) models based on computed tomography (CT) imaging modality has been a major focus of recent research in the medical imaging domain. Incorporating robust feature engineering approach can highly improve the performance of these models. Topological data analysis (TDA), a recent development based on the mathematical field of algebraic topology, mainly focuses on the data from a topological perspective, extracting deeper insight and higher dimensional structures from the data. Persistent homology (PH), a fundamental tool in the area of TDA, can extract topological features such as connected components, cycles and voids from the data. A popular approach to construct PH from 3D CT images is to utilize the 3D cubical complex filtration, a method adapted for grid-structured data. However, this approach may not always yield the best performance and can suffer from computational complexity with higher resolution CT images. This study introduces a novel patch-based PH construction approach tailored for volumetric medical imaging data, in particular CT modality. A wide range of experiments has been conducted on several datasets of 3D CT images to comprehensively analyze the performance of the proposed method with various parameters and benchmark it against the 3D cubical complex algorithm. Our results highlight the dominance of the patch-based TDA approach in terms of both classification performance and time-efficiency. The proposed approach outperformed the cubical complex method, achieving average improvement of 10.38%, 6.94%, 2.06%, 11.58%, and 8.51% in accuracy, AUC, sensitivity, specificity, and F1 score, respectively, across all datasets. Finally, we provide a convenient python package, Patch-TDA, to facilitate the utilization of the proposed approach.
中文标题/摘要
标题:一种基于CT成像模态的新型基于补丁的TDA方法
基于计算机断层扫描(CT)成像模态的机器学习(ML)模型开发一直是医学影像领域研究的主要焦点。结合稳健的特征工程方法可以显著提高这些模型的性能。拓扑数据分析(TDA),基于代数拓扑学的最新发展,主要从拓扑视角出发,从数据中提取更深层次的洞察和更高维度的结构。持久同调(PH),TDA领域的一个基本工具,可以从数据中提取连通分支、环和空洞等拓扑特征。一种流行的从3D CT图像构建PH的方法是利用3D立方体复杂滤波,该方法适用于网格结构数据。然而,这种方法可能并不总是能获得最佳性能,并且在高分辨率CT图像中可能会遭受计算复杂性问题。本研究介绍了一种针对体素医学影像数据,特别是CT模态的新型基于补丁的PH构建方法。在多个3D CT图像数据集上进行了广泛的实验,以全面分析所提方法在不同参数下的性能,并将其与3D立方体复杂算法进行基准测试。我们的结果强调了基于补丁的TDA方法在分类性能和时间效率方面的优越性。所提方法在所有数据集上的准确率、AUC、敏感性、特异性和F1分数分别平均提高了10.38%、6.94%、2.06%、11.58%和8.51%。最后,我们提供了一个方便的Python包Patch-TDA,以促进所提方法的使用。
Summary / 总结
This study introduces a novel patch-based topological data analysis (TDA) approach for computed tomography (CT) imaging, aiming to improve feature extraction and model performance. The method constructs persistent homology (PH) from 3D CT images using a patch-based approach, which is more efficient and effective than the traditional 3D cubical complex filtration. Comprehensive experiments on multiple datasets show that the patch-based TDA approach outperforms the cubical complex method in terms of classification accuracy, AUC, sensitivity, specificity, and F1 score, with average improvements of 10.38%, 6.94%, 2.06%, 11.58%, and 8.51%, respectively. A Python package, Patch-TDA, is provided to facilitate the implementation of this approach.
该研究提出了一种针对CT成像的新型基于补丁的拓扑数据分析(TDA)方法,解决了3D立方体复杂滤波方法的局限性。通过在多个3D CT数据集上的广泛实验,基于补丁的TDA方法展示了优越的性能,平均提高了10.38%的准确率、6.94%的AUC、2.06%的敏感性、11.58%的特异性和8.51%的F1分数,相比立方体复杂方法。该方法还显示了更好的时间效率。
IIB-LPO: Latent Policy Optimization via Iterative Information Bottleneck
Authors: Huilin Deng, Hongchen Luo, Yue Zhu, Long Li, Zhuoyue Chen, Xinghao Zhao, Ming Li, Jihai Zhang, Mengchang Wang, Yang Cao, Yu Kang
First: 2026-01-09T15:46:40+00:00 · Latest: 2026-01-09T15:46:40+00:00
Abstract
Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) for Large Language Model (LLM) reasoning have been hindered by a persistent challenge: exploration collapse. The semantic homogeneity of random rollouts often traps models in narrow, over-optimized behaviors. While existing methods leverage policy entropy to encourage exploration, they face inherent limitations. Global entropy regularization is susceptible to reward hacking, which can induce meaningless verbosity, whereas local token-selective updates struggle with the strong inductive bias of pre-trained models. To address this, we propose Latent Policy Optimization via Iterative Information Bottleneck (IIB-LPO), a novel approach that shifts exploration from statistical perturbation of token distributions to topological branching of reasoning trajectories. IIB-LPO triggers latent branching at high-entropy states to diversify reasoning paths and employs the Information Bottleneck principle both as a trajectory filter and a self-reward mechanism, ensuring concise and informative exploration. Empirical results across four mathematical reasoning benchmarks demonstrate that IIB-LPO achieves state-of-the-art performance, surpassing prior methods by margins of up to 5.3% in accuracy and 7.4% in diversity metrics.
中文标题/摘要
标题:IIB-LPO:迭代信息瓶颈下的潜在策略优化
可验证奖励(RLVR)在大型语言模型(LLM)推理中的强化学习最近受到了持续挑战:探索崩溃。随机滚动的语义同质性经常将模型困在狭窄且过度优化的行为中。虽然现有方法利用策略熵来鼓励探索,但它们面临固有的局限性。全局熵正则化容易导致奖励作弊,这可能会引起无意义的冗余,而局部令牌选择性更新则难以应对预训练模型的强归纳偏见。为了解决这个问题,我们提出了迭代信息瓶颈下的潜在策略优化(IIB-LPO),这是一种新颖的方法,将探索从令牌分布的统计扰动转移到推理轨迹的拓扑分支。IIB-LPO 在高熵状态下触发潜在分支以多样化推理路径,并利用信息瓶颈原则作为轨迹过滤器和自我奖励机制,确保简洁且信息丰富的探索。在四个数学推理基准上的实证结果表明,IIB-LPO 达到了最先进的性能,在准确性和多样性指标上分别超越了先前方法 5.3% 和 7.4%。
Summary / 总结
The paper addresses the challenge of exploration collapse in Reinforcement Learning with Verifiable Rewards for Large Language Models by proposing IIB-LPO, a method that uses iterative Information Bottleneck to diversify reasoning paths. IIB-LPO triggers latent branching at high-entropy states to ensure concise and informative exploration, and it employs the Information Bottleneck principle as both a trajectory filter and a self-reward mechanism. Experiments show that IIB-LPO outperforms previous methods by up to 5.3% in accuracy and 7.4% in diversity metrics across four mathematical reasoning benchmarks.
论文解决了大型语言模型在可验证奖励强化学习中探索坍塌的问题。提出了一种名为IIB-LPO的方法,通过迭代信息瓶颈将探索从token分布扰动转移到推理轨迹分支。IIB-LPO提高了推理路径的多样性和简洁性,使其在四个数学推理基准测试中表现出色,相比之前的方法,准确率最高提高了5.3%,多样性指标提高了7.4%。
Sequential Bayesian Optimal Experimental Design in Infinite Dimensions via Policy Gradient Reinforcement Learning
Authors: Kaichen Shen, Peng Chen
First: 2026-01-09T15:44:49+00:00 · Latest: 2026-01-09T15:44:49+00:00
Abstract
Sequential Bayesian optimal experimental design (SBOED) for PDE-governed inverse problems is computationally challenging, especially for infinite-dimensional random field parameters. High-fidelity approaches require repeated forward and adjoint PDE solves inside nested Bayesian inversion and design loops. We formulate SBOED as a finite-horizon Markov decision process and learn an amortized design policy via policy-gradient reinforcement learning (PGRL), enabling online design selection from the experiment history without repeatedly solving an SBOED optimization problem. To make policy training and reward evaluation scalable, we combine dual dimension reduction -- active subspace projection for the parameter and principal component analysis for the state -- with an adjusted derivative-informed latent attention neural operator (LANO) surrogate that predicts both the parameter-to-solution map and its Jacobian. We use a Laplace-based D-optimality reward while noting that, in general, other expected-information-gain utilities such as KL divergence can also be used within the same framework. We further introduce an eigenvalue-based evaluation strategy that uses prior samples as proxies for maximum a posteriori (MAP) points, avoiding repeated MAP solves while retaining accurate information-gain estimates. Numerical experiments on sequential multi-sensor placement for contaminant source tracking demonstrate approximately $100\times$ speedup over high-fidelity finite element methods, improved performance over random sensor placements, and physically interpretable policies that discover an ``upstream'' tracking strategy.
中文标题/摘要
标题:无限维随机场参数下的偏微分方程支配逆问题的序贯贝叶斯最优实验设计
无限维随机场参数下的偏微分方程支配逆问题的序贯贝叶斯最优实验设计(SBOED)在计算上具有挑战性。高保真方法需要在嵌套的贝叶斯反演和设计循环中反复进行前向和伴随偏微分方程求解。我们将SBOED形式化为有限时间马尔可夫决策过程,并通过策略梯度强化学习(PGRL)学习一个可延展的设计策略,从而可以在实验历史中在线选择设计而不必反复求解SBOED优化问题。为了使策略训练和奖励评估可扩展,我们结合了双维降维——参数的主动子空间投影和状态的主成分分析——以及调整的导数感知潜在注意力神经算子(LANO)近似器,该近似器可以预测参数到解的映射及其雅可比。我们使用基于拉普拉斯的D-最优性奖励,同时指出,一般来说,其他期望信息增益度量,如KL散度,也可以在相同框架中使用。我们进一步引入了一种基于特征值的评估策略,使用先验样本作为最大后验(MAP)点的代理,避免反复求解MAP,同时保留准确的信息增益估计。针对污染物源追踪的顺序多传感器布局的数值实验表明,与高保真有限元方法相比,大约有100倍的加速,优于随机传感器布局,并且具有物理可解释性的策略发现了“上游”追踪策略。
Summary / 总结
This paper addresses the computational challenges of sequential Bayesian optimal experimental design (SBOED) for infinite-dimensional random field parameters in PDE-governed inverse problems. It formulates SBOED as a Markov decision process and uses policy-gradient reinforcement learning to learn an amortized design policy. The method combines dual dimension reduction techniques and a latent attention neural operator to predict the parameter-to-solution map and its Jacobian efficiently. Numerical experiments show a significant speedup over high-fidelity finite element methods and improved performance compared to random sensor placements, with the policy discovering an 'upstream' tracking strategy.
本文解决了无限维随机场参数在PDE控制反问题中顺序贝叶斯最优实验设计(SBOED)的计算挑战。它将SBOED形式化为马尔可夫决策过程,并使用策略梯度强化学习来学习一个可延展的设计策略。该方法结合了双重降维技术和一个潜在注意力神经操作符来高效预测参数到解的映射及其雅可比。数值实验显示,该方法相对于高保真有限元方法有显著的加速效果,并且与随机传感器放置相比具有更好的性能,策略发现了一种‘上游’跟踪策略。
Phase4DFD: Multi-Domain Phase-Aware Attention for Deepfake Detection
Authors: Zhen-Xin Lin, Shang-Kuan Chen
First: 2026-01-09T15:37:03+00:00 · Latest: 2026-01-09T15:37:03+00:00
Comments: 15 pages, 3 figures, conference
Abstract
Recent deepfake detection methods have increasingly explored frequency domain representations to reveal manipulation artifacts that are difficult to detect in the spatial domain. However, most existing approaches rely primarily on spectral magnitude, implicitly under exploring the role of phase information. In this work, we propose Phase4DFD, a phase aware frequency domain deepfake detection framework that explicitly models phase magnitude interactions via a learnable attention mechanism. Our approach augments standard RGB input with Fast Fourier Transform (FFT) magnitude and local binary pattern (LBP) representations to expose subtle synthesis artifacts that remain indistinguishable under spatial analysis alone. Crucially, we introduce an input level phase aware attention module that uses phase discontinuities commonly introduced by synthetic generation to guide the model toward frequency patterns that are most indicative of manipulation before backbone feature extraction. The attended multi domain representation is processed by an efficient BNext M backbone, with optional channel spatial attention applied for semantic feature refinement. Extensive experiments on the CIFAKE and DFFD datasets demonstrate that our proposed model Phase4DFD outperforms state of the art spatial and frequency-based detectors while maintaining low computational overhead. Comprehensive ablation studies further confirm that explicit phase modeling provides complementary and non-redundant information beyond magnitude-only frequency representations.
中文标题/摘要
标题:Phase4DFD:多域相位感知注意力深度假信息检测
最近的深度假信息检测方法越来越多地探索频域表示,以揭示在空间域中难以检测的篡改特征。然而,大多数现有方法主要依赖于频谱幅度,隐式地忽略了相位信息的作用。在本文中,我们提出了一种相位感知频域深度假信息检测框架Phase4DFD,通过可学习的注意力机制明确建模相位幅度交互。我们的方法通过快速傅里叶变换(FFT)幅度和局部二值模式(LBP)表示增强标准RGB输入,以揭示仅通过空间分析无法区分的细微合成特征。关键的是,我们引入了一种输入级别相位感知注意力模块,该模块利用合成生成中常见的相位不连续性来引导模型在主干特征提取之前朝向最能指示篡改的频域模式。经过注意力机制处理的多域表示由高效的BNext M主干处理,可选地应用通道空间注意力以进行语义特征细化。在CIFAKE和DFFD数据集上的广泛实验表明,我们提出的模型Phase4DFD在保持低计算开销的同时优于最先进的空间和频域检测器。全面的消融研究进一步证实,显式的相位建模提供了超越幅度仅频域表示的互补和非冗余信息。
Summary / 总结
Phase4DFD is a phase-aware frequency domain deepfake detection framework that uses a learnable attention mechanism to model phase magnitude interactions. It combines standard RGB input with FFT magnitude and LBP representations to detect subtle synthesis artifacts. The model introduces an input-level phase-aware attention module to guide the model towards frequency patterns indicative of manipulation. Experiments show that Phase4DFD outperforms existing spatial and frequency-based detectors while maintaining low computational overhead.
Phase4DFD 是一种基于频域的深伪检测框架,通过可学习的注意力机制建模相位和幅度的交互。它结合了标准的 RGB 输入、FFT 幅度和 LBP 表示,以检测仅通过空间分析难以识别的合成伪影。模型引入了输入级的相位感知注意力模块,以引导模型朝向最能指示篡改的频率模式。实验表明,Phase4DFD 在保持低计算开销的同时,优于现有的空间和频域检测器。
DYRECT Computed Tomography: DYnamic Reconstruction of Events on a Continuous Timescale
Authors: Wannes Goethals, Tom Bultreys, Steffen Berg, Matthieu N. Boone, Jan Aelterman
Venue: IEEE Transactions on Computational Imaging 11 (2025) 638-649
First: 2024-11-15T14:21:46+00:00 · Latest: 2026-01-09T15:35:49+00:00
Comments: 13 pages, 10 figures, article. Submitted to IEEE Transactions on Computational Imaging 23/10/2024 - Accepted 18/04/2025 - Published 01/05/2025
Abstract
Time-resolved high-resolution X-ray Computed Tomography (4D $μ$CT) is an imaging technique that offers insight into the evolution of dynamic processes inside materials that are opaque to visible light. Conventional tomographic reconstruction techniques are based on recording a sequence of 3D images that represent the sample state at different moments in time. This frame-based approach limits the temporal resolution compared to dynamic radiography experiments due to the time needed to make CT scans. Moreover, it leads to an inflation of the amount of data and thus to costly post-processing computations to quantify the dynamic behaviour from the sequence of time frames, hereby often ignoring the temporal correlations of the sample structure. Our proposed 4D $μ$CT reconstruction technique, named DYRECT, estimates individual attenuation evolution profiles for each position in the sample. This leads to a novel memory-efficient event-based representation of the sample, using as little as three image volumes: its initial attenuation, its final attenuation and the transition times. This third volume represents local events on a continuous timescale instead of the discrete global time frames. We propose a method to iteratively reconstruct the transition times and the attenuation volumes. The dynamic reconstruction technique was validated on synthetic ground truth data and experimental data, and was found to effectively pinpoint the transition times in the synthetic dataset with a time resolution corresponding to less than a tenth of the amount of projections required to reconstruct traditional $μ$CT time frames.
中文标题/摘要
标题:DYRECT计算机断层扫描:连续时间尺度上事件的动态重建
时间分辨高分辨率X射线计算机断层扫描(4D $μ$CT)是一种成像技术,可以洞察材料内部动态过程随时间演变的情况,这些材料对可见光是不透明的。传统的断层重建技术基于记录表示样品在不同时间点状态的一系列3D图像。这种基于帧的方法由于需要进行CT扫描的时间限制,限制了与动态放射摄影实验相比的时间分辨率。此外,它导致数据量膨胀,从而增加了从时间帧序列中量化动态行为的后处理计算成本,通常忽略了样品结构的时间相关性。我们提出的4D $μ$CT重建技术,称为DYRECT,估计样品中每个位置的衰减演变曲线。这导致了一种新颖的、内存高效的基于事件的样品表示,仅使用三个图像体积:初始衰减、最终衰减和转换时间。第三体积代表连续时间尺度上的局部事件,而不是离散的全局时间帧。我们提出了一种迭代重建转换时间和衰减体积的方法。动态重建技术在合成真实数据和实验数据上进行了验证,并发现能够有效确定合成数据集中的转换时间,时间分辨率对应于传统$μ$CT时间帧所需投影数量的十分之一。
Summary / 总结
The research aims to improve the temporal resolution and efficiency of time-resolved high-resolution X-ray Computed Tomography (4D μCT) by proposing a novel reconstruction technique called DYRECT. DYRECT estimates individual attenuation evolution profiles for each position in the sample, using only three image volumes: the initial and final attenuations and the transition times. This event-based approach reduces data inflation and computational costs, and was validated on both synthetic and experimental data, effectively pinpointing transition times with high temporal resolution.
研究旨在通过提出DYRECT动态重建技术提高4D μCT的时间分辨率和效率。DYRECT方法估计每个位置的衰减演变曲线,仅使用三个体积:初始衰减、最终衰减和转换时间。该方法减少了数据膨胀和计算成本,时间分辨率低于传统μCT时间帧的十分之一。该技术已在合成和实验数据上进行了验证,有效指出了转换时间。
Bidirectional Channel-selective Semantic Interaction for Semi-Supervised Medical Segmentation
Authors: Kaiwen Huang, Yizhe Zhang, Yi Zhou, Tianyang Xu, Tao Zhou
Venue: AAAI 2026
First: 2026-01-09T15:32:57+00:00 · Latest: 2026-01-09T15:32:57+00:00
Comments: Accepted to AAAI 2026. Code at: https://github.com/taozh2017/BCSI
Abstract
Semi-supervised medical image segmentation is an effective method for addressing scenarios with limited labeled data. Existing methods mainly rely on frameworks such as mean teacher and dual-stream consistency learning. These approaches often face issues like error accumulation and model structural complexity, while also neglecting the interaction between labeled and unlabeled data streams. To overcome these challenges, we propose a Bidirectional Channel-selective Semantic Interaction~(BCSI) framework for semi-supervised medical image segmentation. First, we propose a Semantic-Spatial Perturbation~(SSP) mechanism, which disturbs the data using two strong augmentation operations and leverages unsupervised learning with pseudo-labels from weak augmentations. Additionally, we employ consistency on the predictions from the two strong augmentations to further improve model stability and robustness. Second, to reduce noise during the interaction between labeled and unlabeled data, we propose a Channel-selective Router~(CR) component, which dynamically selects the most relevant channels for information exchange. This mechanism ensures that only highly relevant features are activated, minimizing unnecessary interference. Finally, the Bidirectional Channel-wise Interaction~(BCI) strategy is employed to supplement additional semantic information and enhance the representation of important channels. Experimental results on multiple benchmarking 3D medical datasets demonstrate that the proposed method outperforms existing semi-supervised approaches.
中文标题/摘要
标题:半监督医学图像分割中的双向通道选择语义交互
半监督医学图像分割是处理有限标注数据场景的有效方法。现有方法主要依赖于均值教师和双流一致性学习等框架。这些方法常常面临错误累积和模型结构复杂性的问题,同时也忽视了标记数据流和未标记数据流之间的交互。为克服这些挑战,我们提出了一种用于半监督医学图像分割的双向通道选择语义交互(BCSI)框架。首先,我们提出了一种语义-空间扰动(SSP)机制,该机制使用两种强增强操作扰乱数据,并利用弱增强生成的伪标签进行无监督学习。此外,我们利用两种强增强预测的一致性进一步提高模型的稳定性和鲁棒性。其次,为了减少标记数据流和未标记数据流之间交互过程中的噪声,我们提出了一种通道选择路由器(CR)组件,该组件动态选择信息交换中最相关的通道。该机制确保仅激活高度相关的特征,从而减少不必要的干扰。最后,我们采用了双向通道交互(BCI)策略,以补充额外的语义信息并增强重要通道的表示能力。在多个基准3D医学数据集上的实验结果表明,所提出的方法优于现有的半监督方法。
Summary / 总结
The research aims to address the challenges of semi-supervised medical image segmentation, particularly error accumulation and model complexity. The proposed BCSI framework includes an SSP mechanism for data augmentation and pseudo-labeling, a CR component for channel selection, and a BCI strategy for semantic interaction. Experiments show that the method outperforms existing approaches on multiple 3D medical datasets.
论文提出了一种BCSI框架来解决半监督医学图像分割的挑战。引入了SSP机制,通过数据增强和伪标签增强模型的稳定性和鲁棒性。CR组件选择性地在标记和未标记数据之间交换相关通道以减少噪声,而BCI增强了重要通道的表示。实验结果显示,该方法在3D医学数据集上优于现有方法。
History
20260112_0330 20260111_0327 20260110_0328 20260109_0331 20260108_0330 20260107_0325 20260106_0331 20260105_0324 20260104_0324 20260103_0322 20260102_0335 20260101_0325 20251231_0331 20251230_0328 20251229_0326 20251228_0329 20251227_0325 20251226_0326 20251225_0325 20251224_0328 20251223_0327 20251222_0324 20251221_0326 20251220_0327 20251219_0327 20251218_0339 20251217_0331 20251216_0329 20251215_0331 20251214_0324 20251213_0324 20251212_0329 20251211_0326 20251210_0323 20251209_0326 20251208_0324 20251207_0323 20251206_0325 20251205_0326 20251204_0326 20251203_0328 20251202_0331 20251201_0324 20251130_0323 20251129_0323 20251128_0324 20251127_0324 20251126_0325 20251125_0322 20251124_0323 20251123_0323 20251122_0325 20251121_0324 20251120_0326 20251119_0325 20251118_0324 20251117_0322 20251116_0322 20251115_0324 20251114_0325 20251113_0326 20251112_0326 20251111_0318 20251110_0322 20251109_0323 20251108_0321 20251107_0320 20251106_0322 20251105_0321 20251104_0324 20251103_0317 20251102_0321 20251101_0317 20251031_0318 20251030_0328 20251029_0325 20251028_0324 20251027_0320 20251026_0328 20251025_0320 20251024_0328 20251023_1235 20251023_0316 20251022_0319 20251021_1916 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553