ExposeAnyone: Personalized Audio-to-Expression Diffusion Models Are Robust Zero-Shot Face Forgery Detectors
Authors: Kaede Shiohara, Toshihiko Yamasaki, Vladislav Golyanik
First: 2026-01-05T18:59:54+00:00 · Latest: 2026-01-05T18:59:54+00:00
Comments: 17 pages, 8 figures, 11 tables; project page: https://mapooon.github.io/ExposeAnyonePage/
Abstract
Detecting unknown deepfake manipulations remains one of the most challenging problems in face forgery detection. Current state-of-the-art approaches fail to generalize to unseen manipulations, as they primarily rely on supervised training with existing deepfakes or pseudo-fakes, which leads to overfitting to specific forgery patterns. In contrast, self-supervised methods offer greater potential for generalization, but existing work struggles to learn discriminative representations only from self-supervision. In this paper, we propose ExposeAnyone, a fully self-supervised approach based on a diffusion model that generates expression sequences from audio. The key idea is, once the model is personalized to specific subjects using reference sets, it can compute the identity distances between suspected videos and personalized subjects via diffusion reconstruction errors, enabling person-of-interest face forgery detection. Extensive experiments demonstrate that 1) our method outperforms the previous state-of-the-art method by 4.22 percentage points in the average AUC on DF-TIMIT, DFDCP, KoDF, and IDForge datasets, 2) our model is also capable of detecting Sora2-generated videos, where the previous approaches perform poorly, and 3) our method is highly robust to corruptions such as blur and compression, highlighting the applicability in real-world face forgery detection.
中文标题/摘要
标题:ExposéAnyone: 个性化音频到表情扩散模型是稳健的零样本面部伪造检测器
检测未知的深度伪造仍然是面部伪造检测中最具挑战性的问题之一。当前最先进的方法无法泛化到未见过的伪造,因为它们主要依赖于使用现有深度伪造或伪伪造的监督训练,这会导致对特定伪造模式的过拟合。相比之下,自监督方法具有更大的泛化潜力,但现有工作难以仅从自监督中学习判别性表示。在本文中,我们提出了一种基于扩散模型的完全自监督方法ExposéAnyone,该方法从音频生成表情序列。关键思想是,一旦模型使用参考集个性化到特定主体,它可以通过扩散重建误差计算疑似视频与个性化主体之间的身份距离,从而实现感兴趣的人的面部伪造检测。广泛的实验表明,1) 在DF-TIMIT、DFDCP、KoDF和IDForge数据集上,我们的方法在平均AUC上比之前最先进的方法高出4.22个百分点;2) 我们的模型还能够检测Sora2生成的视频,而之前的方案在这方面表现不佳;3) 我们的方法对模糊和压缩等破坏具有高度鲁棒性,突显了其在实际面部伪造检测中的适用性。
Summary / 总结
This paper addresses the challenge of detecting unknown deepfake manipulations by proposing ExposeAnyone, a fully self-supervised approach using a diffusion model. The model generates expression sequences from audio and, once personalized to specific subjects, can compute identity distances to detect face forgery. Experiments show that ExposeAnyone outperforms previous methods by 4.22 percentage points in average AUC across multiple datasets, effectively detects Sora2-generated videos, and is robust to corruptions like blur and compression.
本文提出了一种基于自监督扩散模型的ExposéAnyone方法,用于检测未知的深度伪造。该模型通过音频生成表情序列,并可以针对特定个体进行个性化。通过扩散重建误差计算身份距离,该方法能够有效检测面部伪造。实验表明,ExposéAnyone在多个数据集上的平均AUC比之前最先进的方法高出4.22个百分点,并且对模糊和压缩等干扰具有高度鲁棒性。
EmoNet-Voice: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection
Authors: Christoph Schuhmann, Robert Kaczmarczyk, Gollam Rabby, Felix Friedrich, Maurice Kraus, Kourosh Nadi, Huu Nguyen, Kristian Kersting, Sören Auer
First: 2025-06-11T15:06:59+00:00 · Latest: 2026-01-05T18:59:29+00:00
Abstract
Speech emotion recognition (SER) systems are constrained by existing datasets that typically cover only 6-10 basic emotions, lack scale and diversity, and face ethical challenges when collecting sensitive emotional states. We introduce EMONET-VOICE, a comprehensive resource addressing these limitations through two components: (1) EmoNet-Voice Big, a 5,000-hour multilingual pre-training dataset spanning 40 fine-grained emotion categories across 11 voices and 4 languages, and (2) EmoNet-Voice Bench, a rigorously validated benchmark of 4,7k samples with unanimous expert consensus on emotion presence and intensity levels. Using state-of-the-art synthetic voice generation, our privacy-preserving approach enables ethical inclusion of sensitive emotions (e.g., pain, shame) while maintaining controlled experimental conditions. Each sample underwent validation by three psychology experts. We demonstrate that our Empathic Insight models trained on our synthetic data achieve strong real-world dataset generalization, as tested on EmoDB and RAVDESS. Furthermore, our comprehensive evaluation reveals that while high-arousal emotions (e.g., anger: 95% accuracy) are readily detected, the benchmark successfully exposes the difficulty of distinguishing perceptually similar emotions (e.g., sadness vs. distress: 63% discrimination), providing quantifiable metrics for advancing nuanced emotion AI. EMONET-VOICE establishes a new paradigm for large-scale, ethically-sourced, fine-grained SER research.
中文标题/摘要
标题:EmoNet-Voice:一种细粒度、专家验证的语音情绪识别基准
语音情绪识别(SER)系统受限于现有数据集,这些数据集通常仅涵盖6-10种基本情绪,缺乏规模和多样性,并且在收集敏感情绪状态时面临伦理挑战。我们引入了EMONET-VOICE,这是一种通过两个组成部分解决这些限制的综合资源:(1)EmoNet-Voice Big,一个包含40种细粒度情绪类别、跨越11种声音和4种语言的5000小时多语言预训练数据集,(2)EmoNet-Voice Bench,一个包含4700个样本的严格验证基准,这些样本在情绪存在和强度水平上获得了专家的一致共识。利用最先进的合成语音生成技术,我们的隐私保护方法使敏感情绪(如疼痛、羞愧)的伦理纳入成为可能,同时保持了受控的实验条件。每个样本都经过了三位心理学专家的验证。我们证明,我们基于合成数据训练的Empathic Insight模型在EmoDB和RAVDESS上的实际数据集泛化能力很强。此外,我们的全面评估表明,虽然高唤醒情绪(如愤怒:95%的准确率)很容易被检测到,但基准成功地揭示了区分感知上相似的情绪(如悲伤 vs. 痛苦:63%的区分度)的难度,为推进细腻的情绪AI提供了可量化的指标。EMONET-VOICE为大规模、伦理来源的细粒度SER研究树立了新的范式。
Summary / 总结
The research introduces EmoNet-Voice, a comprehensive benchmark for speech emotion detection that addresses limitations of existing datasets by providing a 5,000-hour multilingual pre-training dataset with 40 fine-grained emotion categories and a rigorously validated benchmark of 4,700 samples. The dataset includes sensitive emotions through a privacy-preserving approach using synthetic voice generation and validation by psychology experts. Empathic Insight models trained on synthetic data show strong generalization to real-world datasets, and the benchmark highlights the difficulty in distinguishing similar emotions, providing new metrics for advancing emotion AI research.
研究引入了EmoNet-Voice,这是一个全面的语音情绪检测基准,通过提供一个包含40个细粒度情绪类别的5,000小时多语言预训练数据集和一个由三位心理学专家严格验证的4,700样本基准,解决了现有数据集的局限性。该数据集通过合成语音生成的隐私保护方法包括敏感情绪,并通过三位心理学专家的验证。基于合成数据训练的Empathic Insight模型在真实世界数据集上表现出强大的泛化能力,而基准则揭示了区分类似情绪的难度,提供了推动细腻情绪AI发展的量化指标。
VINO: A Unified Visual Generator with Interleaved OmniModal Context
Authors: Junyi Chen, Tong He, Zhoujie Fu, Pengfei Wan, Kun Gai, Weicai Ye
First: 2026-01-05T18:56:34+00:00 · Latest: 2026-01-05T18:56:34+00:00
Comments: Project page: https://sotamak1r.github.io/VINO-web/
Abstract
We present VINO, a unified visual generator that performs image and video generation and editing within a single framework. Instead of relying on task-specific models or independent modules for each modality, VINO uses a shared diffusion backbone that conditions on text, images and videos, enabling a broad range of visual creation and editing tasks under one model. Specifically, VINO couples a vision-language model (VLM) with a Multimodal Diffusion Transformer (MMDiT), where multimodal inputs are encoded as interleaved conditioning tokens, and then used to guide the diffusion process. This design supports multi-reference grounding, long-form instruction following, and coherent identity preservation across static and dynamic content, while avoiding modality-specific architectural components. To train such a unified system, we introduce a multi-stage training pipeline that progressively expands a video generation base model into a unified, multi-task generator capable of both image and video input and output. Across diverse generation and editing benchmarks, VINO demonstrates strong visual quality, faithful instruction following, improved reference and attribute preservation, and more controllable multi-identity edits. Our results highlight a practical path toward scalable unified visual generation, and the promise of interleaved, in-context computation as a foundation for general-purpose visual creation.
中文标题/摘要
标题:VINO:统一视觉生成器,具有交错的全模态上下文
我们提出了VINO,一种统一的视觉生成器,可以在单一框架内进行图像和视频生成与编辑。VINO 不依赖于特定任务的模型或独立的模块,而是使用一个共享的扩散骨干网络,该网络根据文本、图像和视频进行条件化,从而在一个模型中实现广泛的视觉创作和编辑任务。具体来说,VINO 将视觉语言模型(VLM)与多模态扩散变换器(MMDiT)耦合,其中多模态输入被编码为交错的条件化标记,然后用于引导扩散过程。这种设计支持多参考定位、长格式指令跟随以及在静态和动态内容中保持一致的身份,同时避免了特定模态的架构组件。为了训练这样一个统一系统,我们引入了一种多阶段训练管道,逐步扩展视频生成基础模型,使其成为一个能够处理图像和视频输入输出的统一、多任务生成器。在各种生成和编辑基准测试中,VINO 展现出强大的视觉质量、忠实的指令跟随、改进的参考和属性保留以及更可控的多身份编辑。我们的结果突显了可扩展统一视觉生成的实用路径,并展示了交错的上下文计算作为通用视觉创作基础的潜力。
Summary / 总结
VINO is a unified visual generator that integrates image and video generation and editing within a single framework. It uses a shared diffusion backbone conditioned on text, images, and videos, coupled with a Multimodal Diffusion Transformer (MMDiT) and a vision-language model (VLM). This design supports various visual tasks, including multi-reference grounding and coherent identity preservation. VINO demonstrates strong visual quality and improved reference and attribute preservation across different benchmarks, showcasing a practical approach to scalable unified visual generation.
VINO 是一个统一的视觉生成器,将图像和视频生成与编辑整合在一个框架中。它使用一个共享的扩散骨干网络,并结合了视觉语言模型和多模态扩散变换器,条件化于文本、图像和视频。这种设计支持多种视觉任务,并在静态和动态内容中保持一致的身份。VINO 在不同基准测试中展示了强大的视觉质量、忠实的指令跟随以及改进的参考和属性保留,展示了统一视觉生成的实用途径。
SpatialBench: Can Agents Analyze Real-World Spatial Biology Data?
Authors: Kenny Workman, Zhen Yang, Harihara Muralidharan, Hannah Le
Venue: NeurIPS 2024
First: 2025-12-26T07:40:11+00:00 · Latest: 2026-01-05T18:55:51+00:00
Comments: 10 pages, 9 figures, 4 tables; NeurIPS 2024 format
Abstract
Spatial transcriptomics assays are rapidly increasing in scale and complexity, making computational analysis a major bottleneck in biological discovery. Although frontier AI agents have improved dramatically at software engineering and general data analysis, it remains unclear whether they can extract biological insight from messy, real-world spatial datasets. We introduce SpatialBench, a benchmark of 146 verifiable problems derived from practical spatial analysis workflows spanning five spatial technologies and seven task categories. Each problem provides a snapshot of experimental data immediately prior to an analysis step and a deterministic grader that evaluates recovery of a key biological result. Benchmark data on frontier models shows that base model accuracy remains low (20-38% across model families), with strong model-task and model-platform interactions. Harness design has a large empirical effect on performance, indicating that tools, prompts, control flow, and execution environment should be evaluated and improved as first-class objects. SpatialBench serves both as a measurement tool and a diagnostic lens for developing agents that can interact with real spatial datasets faithfully, transparently, and reproducibly.
中文标题/摘要
标题:SpatialBench:智能体能否分析现实世界的空间生物学数据?
空间转录组学检测正在迅速扩大规模和复杂性,使得计算分析成为生物发现的主要瓶颈。尽管前沿的人工智能代理在软件工程和通用数据分析方面取得了显著进步,但尚不清楚它们是否能够从杂乱的现实世界空间数据集中提取生物学见解。我们引入了SpatialBench,这是一个包含146个可验证问题的基准,这些问题源自跨越五种空间技术和七个任务类别的实际空间分析工作流。每个问题提供了一步分析之前的实验数据快照,并提供了一个确定性的评分器来评估关键生物学结果的恢复情况。前沿模型的基准数据显示,基础模型的准确性仍然很低(模型家族间20-38%),存在明显的模型-任务和模型-平台交互作用。硬件设计对性能有显著影响,表明工具、提示、控制流和执行环境应作为一等对象进行评估和改进。SpatialBench 既作为测量工具,也作为诊断镜片,用于开发能够忠实、透明和可重复地与真实空间数据集交互的智能体。
Summary / 总结
The research aims to evaluate the capability of AI agents in analyzing complex spatial transcriptomics data, which is crucial for biological discovery. The study introduces SpatialBench, a benchmark consisting of 146 problems derived from practical spatial analysis workflows. Key findings show that current models have low accuracy (20-38%) and that model-task and model-platform interactions significantly affect performance. The study highlights the importance of evaluating and improving tools, prompts, control flow, and execution environment to enhance the interaction of AI agents with real spatial datasets.
研究旨在评估先进AI代理是否能够有效分析复杂的空间生物学数据,这对于生物发现至关重要。研究引入了SpatialBench,这是一个包含146个问题的基准,这些问题来自五种技术的空间分析工作流。每个问题包括实验数据和一个确定性的评分器来评估生物洞察的恢复。结果显示,当前模型的准确性较低(20-38%),并且模型与任务之间存在显著的交互作用,这表明需要更好地评估和改进工具、提示、控制流和执行环境。
Talk2Move: Reinforcement Learning for Text-Instructed Object-Level Geometric Transformation in Scenes
Authors: Jing Tan, Zhaoyang Zhang, Yantao Shen, Jiarui Cai, Shuo Yang, Jiajun Wu, Wei Xia, Zhuowen Tu, Stefano Soatto
First: 2026-01-05T18:55:32+00:00 · Latest: 2026-01-05T18:55:32+00:00
Comments: Project page: https://sparkstj.github.io/talk2move
Abstract
We introduce Talk2Move, a reinforcement learning (RL) based diffusion framework for text-instructed spatial transformation of objects within scenes. Spatially manipulating objects in a scene through natural language poses a challenge for multimodal generation systems. While existing text-based manipulation methods can adjust appearance or style, they struggle to perform object-level geometric transformations-such as translating, rotating, or resizing objects-due to scarce paired supervision and pixel-level optimization limits. Talk2Move employs Group Relative Policy Optimization (GRPO) to explore geometric actions through diverse rollouts generated from input images and lightweight textual variations, removing the need for costly paired data. A spatial reward guided model aligns geometric transformations with linguistic description, while off-policy step evaluation and active step sampling improve learning efficiency by focusing on informative transformation stages. Furthermore, we design object-centric spatial rewards that evaluate displacement, rotation, and scaling behaviors directly, enabling interpretable and coherent transformations. Experiments on curated benchmarks demonstrate that Talk2Move achieves precise, consistent, and semantically faithful object transformations, outperforming existing text-guided editing approaches in both spatial accuracy and scene coherence.
中文标题/摘要
标题:Talk2Move:场景中基于文本指令的对象级几何变换的强化学习方法
我们介绍了Talk2Move,一种基于强化学习(RL)的扩散框架,用于场景中基于文本的空间对象变换。通过自然语言在场景中操纵对象对多模态生成系统提出了挑战。现有的基于文本的操纵方法可以调整外观或风格,但在执行对象级几何变换(如平移、旋转或缩放对象)方面存在困难,这主要是由于缺乏配对监督和像素级优化的限制。Talk2Move 使用组相对策略优化(GRPO)通过从输入图像和轻量级文本变化生成的多样化回放探索几何动作,从而消除昂贵的配对数据需求。空间奖励引导模型将几何变换与语言描述对齐,而离策略步骤评估和主动步骤采样通过关注信息性变换阶段提高学习效率。此外,我们设计了以对象为中心的空间奖励,直接评估位移、旋转和缩放行为,使变换具有可解释性和连贯性。在精心策划的基准测试上进行的实验表明,Talk2Move 实现了精确、一致且语义忠实的对象变换,优于现有基于文本的编辑方法在空间准确性和场景连贯性方面的表现。
Summary / 总结
Talk2Move is a reinforcement learning-based framework that uses natural language to instruct geometric transformations of objects within scenes. It addresses the challenge of spatial manipulation through diverse rollouts and lightweight textual variations, avoiding the need for paired data. The method employs Group Relative Policy Optimization and a spatial reward model to align transformations with linguistic descriptions, improving learning efficiency and achieving precise, consistent, and semantically faithful object transformations compared to existing approaches.
Talk2Move 是一个基于强化学习的框架,通过自然语言在场景中执行对象级别的几何变换。它通过使用 Group Relative Policy Optimization 生成多样化的卷出和轻量级的文本变化来解决通过文本进行空间操作的挑战,避免了配对数据的需求。该框架使用空间奖励模型将变换与语言描述对齐,并通过离策略步骤评估和主动步骤采样提高学习效率。实验表明,Talk2Move 实现了精确、一致且语义忠实的变换,超越了现有的文本引导编辑方法在空间准确性和场景连贯性方面的表现。
Scaling Open-Ended Reasoning to Predict the Future
Authors: Nikhil Chandak, Shashwat Goel, Ameya Prabhu, Moritz Hardt, Jonas Geiping
First: 2025-12-31T18:59:51+00:00 · Latest: 2026-01-05T18:45:47+00:00
Comments: 45 pages
Abstract
High-stakes decision making involves reasoning under uncertainty about the future. In this work, we train language models to make predictions on open-ended forecasting questions. To scale up training data, we synthesize novel forecasting questions from global events reported in daily news, using a fully automated, careful curation recipe. We train the Qwen3 thinking models on our dataset, OpenForesight. To prevent leakage of future information during training and evaluation, we use an offline news corpus, both for data generation and retrieval in our forecasting system. Guided by a small validation set, we show the benefits of retrieval, and an improved reward function for reinforcement learning (RL). Once we obtain our final forecasting system, we perform held-out testing between May to August 2025. Our specialized model, OpenForecaster 8B, matches much larger proprietary models, with our training improving the accuracy, calibration, and consistency of predictions. We find calibration improvements from forecasting training generalize across popular benchmarks. We open-source all our models, code, and data to make research on language model forecasting broadly accessible.
中文标题/摘要
标题:将开放性推理扩展以预测未来
高风险决策涉及对未来不确定性的推理。在本研究中,我们训练语言模型对开放性预测问题进行预测。为了扩大训练数据,我们从每日新闻中报道的全球事件中合成新型预测问题,采用完全自动化的仔细编纂配方。我们在OpenForesight数据集上训练Qwen3思考模型。为了防止训练和评估期间出现未来信息泄露,我们在数据生成和检索中使用离线新闻语料库。在一小部分验证集的引导下,我们展示了检索的好处以及强化学习(RL)中改进的奖励函数。一旦我们获得最终的预测系统,我们将在2025年5月至8月之间进行保留测试。我们的专门模型OpenForecaster 8B与更大规模的专有模型相当,我们的训练提高了预测的准确性、校准性和一致性。我们发现预测训练带来的校准改进在流行基准上具有普遍性。我们开源了所有模型、代码和数据,以使语言模型预测研究广泛可及。
Summary / 总结
This work aims to improve language models for open-ended forecasting by synthesizing novel questions from daily news. The Qwen3 models were trained on the OpenForesight dataset, using an offline news corpus to avoid future information leakage. The model, OpenForecaster 8B, showed improved accuracy, calibration, and consistency in predictions, matching larger proprietary models. Calibration improvements generalized across benchmarks, and all resources were open-sourced for broader research access.
该研究旨在提高语言模型在高风险决策中对开放性预测问题的推理能力。作者从每日新闻中合成新型预测问题,并在名为OpenForesight的数据集上训练Qwen3模型。他们使用离线新闻语料库进行数据生成和检索,并展示了在准确度、校准性和一致性方面的改进。专门的模型OpenForecaster 8B在2025年5月至8月的保留测试中优于更大规模的专有模型,且校准改进在流行基准上具有普适性。所有模型、代码和数据均已开源,以促进更广泛的科研访问。
Falcon-H1R: Pushing the Reasoning Frontiers with a Hybrid Model for Efficient Test-Time Scaling
Authors: Falcon LLM Team, Iheb Chaabane, Puneesh Khanna, Suhail Mohmad, Slim Frikha, Shi Hu, Abdalgader Abubaker, Reda Alami, Mikhail Lubinets, Mohamed El Amine Seddik, Hakim Hacid
First: 2026-01-05T18:44:27+00:00 · Latest: 2026-01-05T18:44:27+00:00
Abstract
This work introduces Falcon-H1R, a 7B-parameter reasoning-optimized model that establishes the feasibility of achieving competitive reasoning performance with small language models (SLMs). Falcon-H1R stands out for its parameter efficiency, consistently matching or outperforming SOTA reasoning models that are $2\times$ to $7\times$ larger across a variety of reasoning-intensive benchmarks. These results underscore the importance of careful data curation and targeted training strategies (via both efficient SFT and RL scaling) in delivering significant performance gains without increasing model size. Furthermore, Falcon-H1R advances the 3D limits of reasoning efficiency by combining faster inference (through its hybrid-parallel architecture design), token efficiency, and higher accuracy. This unique blend makes Falcon-H1R-7B a practical backbone for scaling advanced reasoning systems, particularly in scenarios requiring extensive chain-of-thoughts generation and parallel test-time scaling. Leveraging the recently introduced DeepConf approach, Falcon-H1R achieves state-of-the-art test-time scaling efficiency, offering substantial improvements in both accuracy and computational cost. As a result, Falcon-H1R demonstrates that compact models, through targeted model training and architectural choices, can deliver robust and scalable reasoning performance.
中文标题/摘要
标题:Falcon-H1R:通过混合模型实现高效测试时扩展的推理前沿探索
本研究介绍了Falcon-H1R,这是一种70亿参数的推理优化模型,证明了使用小型语言模型(SLMs)实现具有竞争力的推理性能的可行性。Falcon-H1R 以其参数效率著称,在多种推理密集型基准测试中,其性能与比其大2到7倍的SOTA推理模型保持一致或超越。这些结果强调了精心的数据整理和有针对性的训练策略(包括高效的SFT和RL扩展)的重要性,以在不增加模型大小的情况下实现显著的性能提升。此外,Falcon-H1R 通过结合更快的推理(通过其混合并行架构设计)、标记效率和更高的准确性,突破了推理效率的三维极限。这种独特的结合使Falcon-H1R-7B 成为扩展高级推理系统的实用基础架构,特别是在需要大量链式思考生成和并行测试时扩展的场景中。利用最近引入的DeepConf方法,Falcon-H1R 达到了测试时扩展效率的SOTA,显著提高了准确性和计算成本。因此,Falcon-H1R 证明了通过有针对性的模型训练和架构选择,紧凑型模型可以提供稳健且可扩展的推理性能。
Summary / 总结
Falcon-H1R is a 7B-parameter reasoning-optimized model that achieves competitive reasoning performance with smaller models, outperforming SOTA reasoning models that are 2 to 7 times larger. It uses efficient data curation and targeted training strategies, including SFT and RL scaling, to enhance performance without increasing model size. Falcon-H1R also combines faster inference, token efficiency, and higher accuracy, making it suitable for scenarios requiring extensive chain-of-thoughts generation and parallel test-time scaling. It achieves state-of-the-art test-time scaling efficiency through the DeepConf approach, offering improvements in both accuracy and computational cost.
Falcon-H1R 是一个 7 亿参数的推理优化模型,能够在各种推理基准测试中与比其大 2 到 7 倍的 SOTA 模型竞争,甚至超越它们。它通过更快的推理、更高效的 token 使用和更高的准确性来实现显著的性能提升,而不增加模型大小。Falcon-H1R 还利用 DeepConf 方法实现最先进的测试时缩放效率,提高准确性和计算成本,使其成为高级推理系统的实用基础架构。
Causal Multi-fidelity Surrogate Forward and Inverse Models for ICF Implosions
Authors: Tyler E. Maltba, Ben S. Southworth, Jeffrey R. Haack, Marc L. Klasky
First: 2025-09-05T21:39:53+00:00 · Latest: 2026-01-05T18:34:15+00:00
Abstract
Continued progress in inertial confinement fusion (ICF) requires solving inverse problems relating experimental observations to simulation input parameters, followed by design optimization. However, such high-dimensional dynamic PDE-constrained optimization problems are extremely challenging or even intractable. It has been recently shown that inverse problems can be solved by only considering certain robust features. Here we consider the ICF capsule's deuterium-tritium (DT) interface, and construct a causal, dynamic, multifidelity reduced-order surrogate that maps from a time-dependent radiation temperature drive to the interface's radius and velocity dynamics. The surrogate targets an ODE embedding of DT interface dynamics, and is constructed by learning a controller for a base analytical model using low- and high-fidelity simulation training data with respect to radiation energy group structure. After demonstrating excellent accuracy of the surrogate interface model, we use machine learning (ML) models with surrogate-generated data to solve inverse problems optimizing radiation temperature drive to reproduce observed interface dynamics. For sparse snapshots in time, the ML model further characterizes the most informative times at which to sample dynamics. Altogether we demonstrate how operator learning, causal architectures, and physical inductive bias can be integrated to accelerate discovery, design, and diagnostics in high-energy-density systems.
中文标题/摘要
标题:因果多保真度代理前向和逆向模型在ICF压缩中的应用
惯性约束聚变(ICF)领域的持续进步需要解决将实验观测结果与模拟输入参数关联的逆问题,随后进行设计优化。然而,这类高维动态偏微分方程约束的优化问题极其具有挑战性,甚至无法解决。最近的研究表明,可以通过考虑某些稳健特征来解决逆问题。在这里,我们考虑ICF胶囊的氘-氚(DT)界面,并构建了一个因果、动态、多保真度的降阶代理模型,该模型将时间依赖的辐射温度驱动映射到界面的半径和速度动力学。该代理模型针对DT界面动力学的常微分方程嵌入进行目标定位,并通过学习基于辐射能量组结构的低保真度和高保真度模拟训练数据的基分析模型控制器进行构建。在展示了代理界面模型的卓越准确性之后,我们使用基于代理生成的数据的机器学习(ML)模型来解决优化辐射温度驱动以重现观测界面动力学的逆问题。对于时间上的稀疏快照,ML模型进一步表征了需要采样的最具有信息量的时间点。总体而言,我们展示了如何将操作学习、因果架构和物理归纳偏置结合起来以加速高能密度系统中的发现、设计和诊断。
Summary / 总结
This study addresses the challenge of solving high-dimensional inverse problems in inertial confinement fusion (ICF) by developing a causal, dynamic, multifidelity reduced-order surrogate model. The model maps from a time-dependent radiation temperature drive to the interface's radius and velocity dynamics, using low- and high-fidelity simulation data. The surrogate model is then used to solve inverse problems, optimizing the radiation temperature drive to match observed interface dynamics. The research demonstrates the effectiveness of this approach, particularly in characterizing the most informative times for sampling dynamics with sparse data.
该研究通过开发因果、动态、多保真度降阶代理模型来解决惯性约束聚变(ICF)中的高维逆问题。该模型将时间依赖的辐射温度驱动映射到氘-氚界面的半径和速度动力学。通过学习低保真度和高保真度模拟数据,代理模型实现了高精度。研究人员随后使用代理生成的数据中的机器学习模型来解决逆问题,通过优化辐射温度驱动来匹配观察到的界面动力学。对于稀疏数据,机器学习模型确定了最需要采样的动力学时间,展示了将操作学习、因果架构和物理归纳偏见集成到ICF研究中的有效性。
Joint Semantic and Rendering Enhancements in 3D Gaussian Modeling with Anisotropic Local Encoding
Authors: Jingming He, Chongyi Li, Shiqi Wang, Sam Kwong
Venue: ICCV 2025
First: 2026-01-05T18:33:50+00:00 · Latest: 2026-01-05T18:33:50+00:00
Comments: Accepted by ICCV 2025
Abstract
Recent works propose extending 3DGS with semantic feature vectors for simultaneous semantic segmentation and image rendering. However, these methods often treat the semantic and rendering branches separately, relying solely on 2D supervision while ignoring the 3D Gaussian geometry. Moreover, current adaptive strategies adapt the Gaussian set depending solely on rendering gradients, which can be insufficient in subtle or textureless regions. In this work, we propose a joint enhancement framework for 3D semantic Gaussian modeling that synergizes both semantic and rendering branches. Firstly, unlike conventional point cloud shape encoding, we introduce an anisotropic 3D Gaussian Chebyshev descriptor using the Laplace-Beltrami operator to capture fine-grained 3D shape details, thereby distinguishing objects with similar appearances and reducing reliance on potentially noisy 2D guidance. In addition, without relying solely on rendering gradient, we adaptively adjust Gaussian allocation and spherical harmonics with local semantic and shape signals, enhancing rendering efficiency through selective resource allocation. Finally, we employ a cross-scene knowledge transfer module to continuously update learned shape patterns, enabling faster convergence and robust representations without relearning shape information from scratch for each new scene. Experiments on multiple datasets demonstrate improvements in segmentation accuracy and rendering quality while maintaining high rendering frame rates.
中文标题/摘要
标题:3D高斯建模中各向异性局部编码的语义和渲染增强
近期工作提出将语义特征向量扩展到3DGS中,以实现同时的语义分割和图像渲染。然而,这些方法通常将语义和渲染分支分开处理,仅依赖2D监督,而忽视了3D高斯几何。此外,当前的自适应策略仅根据渲染梯度调整高斯集合,这在细微或无纹理区域可能不够充分。在本文中,我们提出了一种联合增强框架,用于3D语义高斯建模,该框架结合了语义和渲染分支。首先,不同于传统的点云形状编码,我们引入了使用拉普拉斯-贝尔特拉米算子的各向异性3D高斯切比雪夫描述符,以捕捉精细的3D形状细节,从而区分具有相似外观的对象,并减少对潜在噪声2D指导的依赖。此外,我们不完全依赖渲染梯度,而是根据局部语义和形状信号自适应调整高斯分配和球谐函数,通过选择性资源分配提高渲染效率。最后,我们采用跨场景知识转移模块持续更新学习的形状模式,从而在无需从头开始重新学习每个新场景的形状信息的情况下实现快速收敛和稳健表示。在多个数据集上的实验表明,该方法在提高分割准确性和渲染质量的同时,保持了高渲染帧率。
Summary / 总结
This work addresses the limitations of existing 3D Gaussian modeling methods by proposing a joint enhancement framework that integrates semantic and rendering branches. It introduces an anisotropic 3D Gaussian Chebyshev descriptor to capture fine-grained 3D shape details and adaptively adjusts Gaussian allocation and spherical harmonics using local semantic and shape signals. The cross-scene knowledge transfer module further enhances the model's efficiency and robustness. Experiments show improved segmentation accuracy and rendering quality with high frame rates.
该研究提出了一种联合增强框架,将语义和渲染分支结合起来,解决了现有3D高斯建模方法的局限性。引入了各向异性3D高斯切比雪夫描述符来捕捉细粒度的3D形状细节,并使用局部语义和形状信号自适应调整高斯分配和球谐函数。通过跨场景知识转移模块更新学习到的形状模式,从而提高了分割准确性和渲染质量,同时保持了高帧率。
Diminishing Returns in Self-Supervised Learning
Authors: Oli Bridge, Huey Sun, Botond Branyicskai-Nagy, Charles D'Ornano, Shomit Basu
First: 2025-12-03T15:11:44+00:00 · Latest: 2026-01-05T18:17:53+00:00
Abstract
Transformer-based architectures have become a dominant paradigm in vision and language, but their success is often attributed to large model capacity and massive training data. In this work, we examine how self-supervised pre-training, intermediate fine-tuning, and downstream fine-tuning interact in a low-capacity regime, using a 5M-parameter Vision Transformer for semantic segmentation. Across multiple data scales, we find that masked image modeling pre-training and downstream fine-tuning reliably improve performance, but with clear diminishing returns as supervision increases. In contrast, inserting an intermediate classification fine-tuning stage consistently degrades downstream performance, with the largest drops occurring precisely where pre-training is most effective. Through an analysis of patch-level representation geometry, we show that classification-based intermediate supervision actively interferes with representations learned during pre-training by collapsing spatial structure critical for dense prediction. These results indicate that, in small models, the geometry of supervision matters more than the number of training stages: misaligned intermediate objectives can negate the benefits of pre-training rather than amplify them.
中文标题/摘要
标题:自我监督学习中的递减回报
基于变换器的架构已成为视觉和语言领域的主导范式,但其成功往往归因于大模型容量和大量训练数据。在本工作中,我们研究了在低容量范围内自我监督预训练、中间微调和下游微调之间的相互作用,使用一个500万参数的视觉变换器进行语义分割。在多个数据尺度上,我们发现掩码图像建模预训练和下游微调可以可靠地提高性能,但随着监督的增加,回报逐渐递减。相反,插入中间分类微调阶段始终会降低下游性能,最大的下降发生在预训练最有效的区域。通过分析块级表示几何结构,我们表明基于分类的中间监督会通过压缩对密集预测至关重要的空间结构,主动干扰在预训练中学习到的表示。这些结果表明,在小模型中,监督的几何结构比训练阶段的数量更重要:不一致的中间目标可能会抵消预训练的好处,而不是放大它们。
Summary / 总结
This study investigates the impact of self-supervised pre-training, intermediate fine-tuning, and downstream fine-tuning on a 5M-parameter Vision Transformer for semantic segmentation. The research finds diminishing returns in performance improvements as supervision increases, while intermediate classification fine-tuning consistently degrades downstream performance, especially where pre-training is most effective. The authors attribute this to the collapse of spatial structure critical for dense prediction during intermediate supervision, suggesting that the geometry of supervision is more critical than the number of training stages in small models.
研究考察了低容量Vision Transformer在语义分割中的自监督预训练、中间细调和下游细调之间的相互作用。研究发现,随着监督增加,性能提升的效果逐渐减弱,而中间分类细调会一致地降低下游性能。分析表明,基于分类的中间监督会干扰预训练得到的表示,特别是通过压缩对密集预测至关重要的空间结构。
Hunting for "Oddballs" with Machine Learning: Detecting Anomalous Exoplanets Using a Deep-Learned Low-Dimensional Representation of Transit Spectra with Autoencoders
Authors: Alexander Roman, Emilie Panek, Roy T. Forestano, Eyup B. Unlu, Katia Matcheva, Konstantin T. Matchev
First: 2026-01-05T18:15:53+00:00 · Latest: 2026-01-05T18:15:53+00:00
Comments: 14 pages, 12 figures
Abstract
This study explores the application of autoencoder-based machine learning techniques for anomaly detection to identify exoplanet atmospheres with unconventional chemical signatures using a low-dimensional data representation. We use the Atmospheric Big Challenge (ABC) database, a publicly available dataset with over 100,000 simulated exoplanet spectra, to construct an anomaly detection scenario by defining CO2-rich atmospheres as anomalies and CO2-poor atmospheres as the normal class. We benchmarked four different anomaly detection strategies: Autoencoder Reconstruction Loss, One-Class Support Vector Machine (1 class-SVM), K-means Clustering, and Local Outlier Factor (LOF). Each method was evaluated in both the original spectral space and the autoencoder's latent space using Receiver Operating Characteristic (ROC) curves and Area Under the Curve (AUC) metrics. To test the performance of the different methods under realistic conditions, we introduced Gaussian noise levels ranging from 10 to 50 ppm. Our results indicate that anomaly detection is consistently more effective when performed within the latent space across all noise levels. Specifically, K-means clustering in the latent space emerged as a stable and high-performing method. We demonstrate that this anomaly detection approach is robust to noise levels up to 30 ppm (consistent with realistic space-based observations) and remains viable even at 50 ppm when leveraging latent space representations. On the other hand, the performance of the anomaly detection methods applied directly in the raw spectral space degrades significantly with increasing the level of noise. This suggests that autoencoder-driven dimensionality reduction offers a robust methodology for flagging chemically anomalous targets in large-scale surveys where exhaustive retrievals are computationally prohibitive.
中文标题/摘要
标题:使用自动编码器机器学习寻找“怪球星”:基于深度学习低维表示的自动编码器检测异常外行星大气
本研究探讨了使用基于自动编码器的机器学习技术进行异常检测,以识别具有非传统化学特征的外行星大气。我们使用了包含超过10万个模拟外行星光谱的Atmospheric Big Challenge (ABC)数据库,将CO2丰富的大气定义为异常,CO2贫乏的大气定义为正常类,构建了异常检测场景。我们评估了四种不同的异常检测策略:自动编码器重构损失、一类支持向量机(1类-SVM)、K均值聚类和局部异常因子(LOF)。每种方法分别在原始光谱空间和自动编码器的潜在空间中使用受试者操作特征(ROC)曲线和曲线下面积(AUC)指标进行了评估。为了在现实条件下测试不同方法的性能,我们引入了从10到50 ppm的高斯噪声水平。结果显示,无论在何种噪声水平下,潜在空间中的异常检测始终更为有效。具体而言,潜在空间中的K均值聚类方法表现出稳定且高效的性能。我们证明了这种异常检测方法在噪声水平高达30 ppm(与现实中的空间观测一致)时是稳健的,即使在50 ppm时,通过利用潜在空间表示,该方法仍然可行。另一方面,直接在原始光谱空间中应用的异常检测方法随着噪声水平的增加而显著下降。这表明自动编码器驱动的降维方法为在大规模调查中标记化学异常目标提供了一种稳健的方法,尤其是在全面检索计算上不可行的情况下。
Summary / 总结
This study applies autoencoder-based machine learning to detect exoplanets with unusual atmospheric compositions by using a low-dimensional data representation. Four anomaly detection methods—Autoencoder Reconstruction Loss, One-Class SVM, K-means Clustering, and LOF—were evaluated in both the original spectral space and the autoencoder's latent space. Results show that anomaly detection is more effective in the latent space, with K-means clustering in the latent space performing particularly well and maintaining robustness up to 30 ppm noise levels. This approach is suggested to be a reliable method for identifying chemically anomalous exoplanets in large-scale surveys.
该研究利用自编码器机器学习技术,通过低维数据表示来检测具有异常大气组成的系外行星。四种异常检测方法——自编码器重构损失、一类支持向量机、K均值聚类和局部异常因子——分别在原始光谱空间和自编码器的潜在空间中进行了评估。结果显示,潜在空间中的异常检测效果更佳,特别是在潜在空间中使用K均值聚类方法表现尤为稳定。该方法在噪声水平达到30 ppm时仍能保持稳健性,并且在噪声水平达到50 ppm时仍具有可行性,表明自编码器驱动的降维方法在大规模调查中可用于标记化学异常目标。
DatBench: Discriminative, Faithful, and Efficient VLM Evaluations
Authors: Siddharth Joshi, Haoli Yin, Rishabh Adiga, Ricardo Monti, Aldo Carranza, Alex Fang, Alvin Deng, Amro Abbas, Brett Larsen, Cody Blakeney, Darren Teh, David Schwab, Fan Pan, Haakon Mongstad, Jack Urbanek, Jason Lee, Jason Telanoff, Josh Wills, Kaleigh Mentzer, Luke Merrick, Parth Doshi, Paul Burstein, Pratyush Maini, Scott Loftin, Spandan Das, Tony Jiang, Vineeth Dorna, Zhengping Wang, Bogdan Gaza, Ari Morcos, Matthew Leavitt
First: 2026-01-05T18:07:51+00:00 · Latest: 2026-01-05T18:07:51+00:00
Abstract
Empirical evaluation serves as the primary compass guiding research progress in foundation models. Despite a large body of work focused on training frontier vision-language models (VLMs), approaches to their evaluation remain nascent. To guide their maturation, we propose three desiderata that evaluations should satisfy: (1) faithfulness to the modality and application, (2) discriminability between models of varying quality, and (3) efficiency in compute. Through this lens, we identify critical failure modes that violate faithfulness and discriminability, misrepresenting model capabilities: (i) multiple-choice formats reward guessing, poorly reflect downstream use cases, and saturate early as models improve; (ii) blindly solvable questions, which can be answered without images, constitute up to 70% of some evaluations; and (iii) mislabeled or ambiguous samples compromise up to 42% of examples in certain datasets. Regarding efficiency, the computational burden of evaluating frontier models has become prohibitive: by some accounts, nearly 20% of development compute is devoted to evaluation alone. Rather than discarding existing benchmarks, we curate them via transformation and filtering to maximize fidelity and discriminability. We find that converting multiple-choice questions to generative tasks reveals sharp capability drops of up to 35%. In addition, filtering blindly solvable and mislabeled samples improves discriminative power while simultaneously reducing computational cost. We release DatBench-Full, a cleaned evaluation suite of 33 datasets spanning nine VLM capabilities, and DatBench, a discriminative subset that achieves 13x average speedup (up to 50x) while closely matching the discriminative power of the original datasets. Our work outlines a path toward evaluation practices that are both rigorous and sustainable as VLMs continue to scale.
中文标题/摘要
标题:DatBench:区分性、忠实性和高效性的VLM评估
经验性评估是指导基础模型研究进展的主要指南。尽管有大量的工作集中在训练前沿的视觉-语言模型(VLMs)上,但对其评估的方法仍处于初级阶段。为了促进其成熟,我们提出了评估应满足的三个标准:(1)忠实于模态和应用,(2)能够区分不同质量的模型,(3)计算效率。通过这一视角,我们识别出一些关键的失败模式,这些模式违反了忠实性和区分性,错误地代表了模型的能力:(i)多项选择题奖励猜测,不能很好地反映下游使用场景,并且随着模型的改进而饱和;(ii)一些可以不使用图像直接回答的问题占到了某些评估的70%以上;(iii)错误标记或模棱两可的样本在某些数据集中占到了42%。关于效率,评估前沿模型的计算负担已经变得难以承受:据一些说法,近20%的开发计算资源被用于评估。我们没有抛弃现有的基准,而是通过转换和筛选来优化它们,以最大化忠实性和区分性。我们发现,将多项选择题转换为生成任务可以揭示出高达35%的能力下降。此外,过滤掉可以不使用图像直接回答的问题和错误标记的样本可以提高区分能力,同时降低计算成本。我们发布了DatBench-Full,这是一个包含33个数据集的清理评估套件,涵盖了九种VLM能力,以及DatBench,这是一个区分性子集,实现了13倍的平均加速(最高可达50倍),同时与原始数据集的区分能力非常接近。我们的工作概述了一条通向评估实践的道路,这些实践既严格又可持续,随着VLMs的不断扩展。
Summary / 总结
The paper motivates the need for better evaluation methods for vision-language models (VLMs) to ensure their maturation. It identifies critical failure modes in existing evaluations, such as multiple-choice formats that reward guessing and mislabeled samples that compromise model assessment. To address these issues, the authors propose DatBench, a curated evaluation suite that includes 33 datasets and a discriminative subset, DatBench, which achieves a 13x average speedup in evaluation while maintaining high discriminative power. This work aims to provide a more faithful, discriminative, and efficient evaluation framework for VLMs.
论文提出了DatBench,一个新的视觉-语言模型(VLM)评估套件,通过确保忠实性、可区分性和效率来解决现有基准的不足。研究指出了诸如选择题格式奖励猜测和标记错误样本等问题,这些问题会损害模型评估。研究发现,将选择题转换为生成任务,并过滤掉盲目可解和标记错误的样本,可以显著提高VLM评估的可区分性和计算效率。DatBench-Full套件包括33个数据集,而DatBench是一个更高效的子集,可以实现高达50倍的评估时间加速,同时保持相似的可区分性。
Non-omniscient backdoor injection with one poison sample: Proving the one-poison hypothesis for linear regression, linear classification, and 2-layer ReLU neural networks
Authors: Thorsten Peinemann, Paula Arnold, Sebastian Berndt, Thomas Eisenbarth, Esfandiar Mohammadi
First: 2025-08-07T17:41:33+00:00 · Latest: 2026-01-05T18:07:06+00:00
Comments: Added generalization to 2-layer ReLU neural networks
Abstract
Backdoor poisoning attacks are a threat to machine learning models trained on large data collected from untrusted sources; these attacks enable attackers to inject malicious behavior into the model that can be triggered by specially crafted inputs. Prior work has established bounds on the success of backdoor attacks and their impact on the benign learning task, however, an open question is what amount of poison data is needed for a successful backdoor attack. Typical attacks either use few samples but need much information about the data points, or need to poison many data points.
In this paper, we formulate the one-poison hypothesis: An adversary with one poison sample and limited background knowledge can inject a backdoor with zero backdooring-error and without significantly impacting the benign learning task performance. Moreover, we prove the one-poison hypothesis for linear regression, linear classification, and 2-layer ReLU neural networks. For adversaries that utilize a direction unused by the clean data distribution for the poison sample, we prove for linear classification and linear regression that the resulting model is functionally equivalent to a model where the poison was excluded from training. We build on prior work on statistical backdoor learning to show that in all other cases, the impact on the benign learning task is still limited. We validate our theoretical results experimentally with realistic benchmark data sets.
中文标题/摘要
标题:非全知后门注入:基于单个毒样本的一次性后门假设证明,适用于线性回归、线性分类和2层ReLU神经网络
后门投毒攻击是对从不可信来源收集的大数据中训练的机器学习模型的一种威胁;这些攻击使攻击者能够在模型中注入恶意行为,这种行为可以通过特制的输入被触发。先前的工作已经建立了后门攻击成功的边界及其对良性学习任务的影响,然而,一个开放的问题是成功实施后门攻击需要多少毒数据。典型的攻击要么使用少量样本但需要大量关于数据点的信息,要么需要污染大量数据点。
在本文中,我们提出了“一次性后门假设”:拥有一个毒样本和有限背景知识的对手可以注入一个后门,且后门注入错误率为零,同时不会显著影响良性学习任务的性能。此外,我们证明了该假设适用于线性回归、线性分类和2层ReLU神经网络。对于利用未被干净数据分布使用的方向作为毒样本的对手,我们证明了对于线性分类和线性回归,最终模型在功能上等同于未包含毒样本进行训练的模型。我们基于先前关于统计后门学习的工作,展示了在所有其他情况下,对良性学习任务的影响仍然有限。我们通过现实基准数据集的实验验证了我们的理论结果。
Summary / 总结
This paper investigates the feasibility of backdoor injection using a single poisoned sample with limited knowledge. It proves the one-poison hypothesis for linear regression, linear classification, and 2-layer ReLU neural networks. The study shows that an adversary with one poison sample can inject a backdoor with no backdooring error and minimal impact on the model's performance on benign tasks. Experiments confirm these theoretical findings using benchmark datasets.
该论文研究使用单个中毒样本进行后门注入的可行性,证明了线性回归、线性分类和2层ReLU神经网络中的单后门假设。方法是使用未被干净数据分布利用的方向作为中毒样本的方向,确保无后门错误且对模型在良性任务上的性能影响最小。理论证明和基准数据集上的实验验证支持了这些发现。
Project Ariadne: A Structural Causal Framework for Auditing Faithfulness in LLM Agents
Authors: Sourena Khanzadeh
First: 2026-01-05T18:05:29+00:00 · Latest: 2026-01-05T18:05:29+00:00
Abstract
As Large Language Model (LLM) agents are increasingly tasked with high-stakes autonomous decision-making, the transparency of their reasoning processes has become a critical safety concern. While \textit{Chain-of-Thought} (CoT) prompting allows agents to generate human-readable reasoning traces, it remains unclear whether these traces are \textbf{faithful} generative drivers of the model's output or merely \textbf{post-hoc rationalizations}. We introduce \textbf{Project Ariadne}, a novel XAI framework that utilizes Structural Causal Models (SCMs) and counterfactual logic to audit the causal integrity of agentic reasoning. Unlike existing interpretability methods that rely on surface-level textual similarity, Project Ariadne performs \textbf{hard interventions} ($do$-calculus) on intermediate reasoning nodes -- systematically inverting logic, negating premises, and reversing factual claims -- to measure the \textbf{Causal Sensitivity} ($φ$) of the terminal answer. Our empirical evaluation of state-of-the-art models reveals a persistent \textit{Faithfulness Gap}. We define and detect a widespread failure mode termed \textbf{Causal Decoupling}, where agents exhibit a violation density ($ρ$) of up to $0.77$ in factual and scientific domains. In these instances, agents arrive at identical conclusions despite contradictory internal logic, proving that their reasoning traces function as "Reasoning Theater" while decision-making is governed by latent parametric priors. Our findings suggest that current agentic architectures are inherently prone to unfaithful explanation, and we propose the Ariadne Score as a new benchmark for aligning stated logic with model action.
中文标题/摘要
标题:项目阿里阿德涅:一种结构因果框架,用于审计大型语言模型代理的忠实性
随着大型语言模型(LLM)代理在高风险自主决策任务中的应用日益增多,其推理过程的透明度已成为一个关键的安全问题。虽然“思维链”(CoT)提示允许代理生成可读的推理轨迹,但尚不清楚这些轨迹是否是模型输出的忠实生成驱动因素,还是仅仅事后合理化。我们提出了“项目阿里阿德涅”,这是一种新颖的可解释性人工智能(XAI)框架,利用结构因果模型(SCMs)和反事实逻辑来审计代理推理的因果完整性。与依赖表面文本相似性的现有可解释性方法不同,项目阿里阿德涅通过对中间推理节点进行硬干预($do$-计算)——系统地反转逻辑、否定前提和逆转事实声明——来衡量终端答案的因果敏感性($φ$)。我们对最先进的模型的实证评估揭示了一个持续存在的“忠实性差距”。我们定义并检测了一种普遍存在的失败模式,称为“因果脱耦”,其中代理在事实和科学领域中的因果脱耦密度($ρ$)高达0.77。在这些情况下,尽管内部逻辑矛盾,代理仍得出相同的结论,证明其推理轨迹充当“推理剧场”,而决策则由潜在参数先验控制。我们的研究结果表明,当前的代理架构本质上容易产生不忠实的解释,并提出阿里阿德涅分数作为新的基准,以使声明的逻辑与模型行为保持一致。
Summary / 总结
Project Ariadne introduces a novel framework to audit the faithfulness of Large Language Model (LLM) agents' reasoning processes by using Structural Causal Models (SCMs) and counterfactual logic. It measures the causal sensitivity of the model's output through hard interventions and defines a new benchmark, the Ariadne Score, to align stated logic with model actions. The evaluation reveals a significant Faithfulness Gap and a widespread failure mode called Causal Decoupling, indicating that agents often produce reasoning traces that are post-hoc rationalizations rather than faithful drivers of their outputs.
Project Ariadne 提出了一种新的框架来审计大型语言模型(LLM)代理推理过程的忠实性。通过使用结构因果模型和反事实逻辑,它执行硬干预以测量因果敏感性。评估显示存在显著的忠实性缺口,在事实和科学领域中的违反密度高达 77%,表明代理经常提供事后合理化而非忠实的推理驱动其输出。
360DVO: Deep Visual Odometry for Monocular 360-Degree Camera
Authors: Xiaopeng Guo, Yinzhe Xu, Huajian Huang, Sai-Kit Yeung
First: 2026-01-05T17:52:50+00:00 · Latest: 2026-01-05T17:52:50+00:00
Comments: 12 pages. Received by RA-L
Abstract
Monocular omnidirectional visual odometry (OVO) systems leverage 360-degree cameras to overcome field-of-view limitations of perspective VO systems. However, existing methods, reliant on handcrafted features or photometric objectives, often lack robustness in challenging scenarios, such as aggressive motion and varying illumination. To address this, we present 360DVO, the first deep learning-based OVO framework. Our approach introduces a distortion-aware spherical feature extractor (DAS-Feat) that adaptively learns distortion-resistant features from 360-degree images. These sparse feature patches are then used to establish constraints for effective pose estimation within a novel omnidirectional differentiable bundle adjustment (ODBA) module. To facilitate evaluation in realistic settings, we also contribute a new real-world OVO benchmark. Extensive experiments on this benchmark and public synthetic datasets (TartanAir V2 and 360VO) demonstrate that 360DVO surpasses state-of-the-art baselines (including 360VO and OpenVSLAM), improving robustness by 50% and accuracy by 37.5%. Homepage: https://chris1004336379.github.io/360DVO-homepage
中文标题/摘要
标题:360DVO:单目360度相机的深度视觉里程计
单目全景视觉里程计(OVO)系统利用360度相机克服了视角视觉里程计(VO)系统的视野限制。然而,现有的方法依赖于手工设计的特征或光度目标,往往在挑战性场景(如剧烈运动和光照变化)中缺乏鲁棒性。为了解决这个问题,我们提出了360DVO,这是第一个基于深度学习的OVO框架。我们的方法引入了一种感知畸变的球面特征提取器(DAS-Feat),能够从360度图像中自适应地学习抗畸变特征。这些稀疏特征补丁随后用于在新颖的全景可微束调整(ODBA)模块中建立有效的姿态估计约束。为了在现实环境中进行评估,我们还贡献了一个新的真实世界OVO基准。在该基准和公开的合成数据集(TartanAir V2和360VO)上的广泛实验表明,360DVO超越了最先进的基线(包括360VO和OpenVSLAM),在鲁棒性上提高了50%,在准确性上提高了37.5%。主页:https://chris1004336379.github.io/360DVO-homepage
Summary / 总结
360DVO is a deep learning-based framework for monocular omnidirectional visual odometry (OVO) that addresses the limitations of existing methods by introducing a distortion-aware spherical feature extractor (DAS-Feat) and a novel omnidirectional differentiable bundle adjustment (ODBA) module. The approach significantly improves robustness and accuracy in challenging scenarios, outperforming state-of-the-art baselines by 50% in robustness and 37.5% in accuracy on various datasets.
360DVO是一种基于深度学习的单目全景视觉里程计系统,通过使用畸变感知的球面特征提取器(DAS-Feat)和新颖的全景可微束调整(ODBA)模块来解决现有方法的局限性。该系统在各种基准测试中优于最先进的基线,提高了50%的鲁棒性和37.5%的准确性。
Differential Privacy for Transformer Embeddings of Text with Nonparametric Variational Information Bottleneck
Authors: Dina El Zein, James Henderson
First: 2026-01-05T17:49:39+00:00 · Latest: 2026-01-05T17:49:39+00:00
Comments: 11 pages, 2 figures
Abstract
We propose a privacy-preserving method for sharing text data by sharing noisy versions of their transformer embeddings. It has been shown that hidden representations learned by deep models can encode sensitive information from the input, making it possible for adversaries to recover the input data with considerable accuracy. This problem is exacerbated in transformer embeddings because they consist of multiple vectors, one per token. To mitigate this risk, we propose Nonparametric Variational Differential Privacy (NVDP), which ensures both useful data sharing and strong privacy protection. We take a differential privacy approach, integrating a Nonparametric Variational Information Bottleneck (NVIB) layer into the transformer architecture to inject noise into its multi-vector embeddings and thereby hide information, and measuring privacy protection with Rényi divergence and its corresponding Bayesian Differential Privacy (BDP) guarantee. Training the NVIB layer calibrates the noise level according to utility. We test NVDP on the GLUE benchmark and show that varying the noise level gives us a useful tradeoff between privacy and accuracy. With lower noise levels, our model maintains high accuracy while offering strong privacy guarantees, effectively balancing privacy and utility.
中文标题/摘要
标题:Transformer文本嵌入的非参数变分信息瓶颈差分隐私
我们提出了一种通过共享其变换器嵌入的噪声版本来保护隐私的方法,以共享文本数据。已证明,深度模型学习到的隐藏表示可以包含输入中的敏感信息,使对手能够以相当高的准确性恢复输入数据。在变换器嵌入中,这一问题更为严重,因为它们由多个向量组成,每个词一个。为缓解这一风险,我们提出了非参数变分差分隐私(NVDP),以确保有用的数据共享和强大的隐私保护。我们采用差分隐私方法,在变换器架构中集成非参数变分信息瓶颈(NVIB)层,向其多向量嵌入中注入噪声以隐藏信息,并使用Rényi散度及其相应的贝叶斯差分隐私(BDP)保证来衡量隐私保护。训练NVIB层根据效用校准噪声水平。我们在GLUE基准上测试了NVDP,并表明噪声水平的变化在隐私和准确性之间提供了有用的权衡。在较低的噪声水平下,我们的模型保持了高准确性的同时提供了强大的隐私保证,有效地平衡了隐私和效用。
Summary / 总结
The research aims to protect sensitive information in text data by sharing noisy versions of transformer embeddings. It introduces Nonparametric Variational Differential Privacy (NVDP), which uses a Nonparametric Variational Information Bottleneck (NVIB) layer to inject noise into transformer embeddings, ensuring both data utility and strong privacy. Experimental results on the GLUE benchmark show that NVDP provides a useful tradeoff between privacy and accuracy, maintaining high accuracy even with strong privacy guarantees.
研究旨在通过向变压器嵌入中添加噪声来保护文本数据的隐私。它引入了非参数变分差分隐私(NVDP),使用非参数变分信息瓶颈(NVIB)层向变压器嵌入中注入噪声,从而保护敏感信息。研究表明,通过调整噪声水平可以在隐私和准确性之间提供权衡,即使在提供强大隐私保证的情况下也能保持高准确性。
Grounded Test-Time Adaptation for LLM Agents
Authors: Arthur Chen, Zuxin Liu, Jianguo Zhang, Akshara Prabhakar, Zhiwei Liu, Shelby Heinecke, Silvio Savarese, Victor Zhong, Caiming Xiong
First: 2025-11-06T22:24:35+00:00 · Latest: 2026-01-05T17:43:48+00:00
Comments: Our code is available here: https://github.com/r2llab/GTTA
Abstract
Large language model (LLM)-based agents struggle to generalize to novel and complex environments, such as unseen websites or new sets of functions, due to a fundamental mismatch between their pre-training and test-time conditions. This challenge stems from two distinct failure modes: a syntactic misunderstanding of environment-specific components like observation formats, and a semantic misunderstanding of state-transition dynamics, which are only revealed at test time. To address these issues, we propose two distinct and complementary strategies for adapting LLM agents by leveraging environment-specific information available during deployment. First, an online distributional adaptation method parameterizes environmental nuances by learning a lightweight adaptation vector that biases the model's output distribution, enabling rapid alignment with an environment response format. Second, a deployment-time dynamics grounding method employs a persona-driven exploration phase to systematically probe and learn the environment's causal dynamics before task execution, equipping the agent with a nonparametric world model. We evaluate these strategies across diverse agentic benchmarks, including function calling and web navigation. Our empirical results show the effectiveness of both strategies across all benchmarks with minimal computational cost. We find that dynamics grounding is particularly effective in complex environments where unpredictable dynamics pose a major obstacle, demonstrating a robust path toward more generalizable and capable LLM-based agents. For example, on the WebArena multi-site split, this method increases the agent's success rate from 2% to 23%.
中文标题/摘要
标题:基于地面测试时适应的大语言模型代理
基于大语言模型(LLM)的代理在处理新颖和复杂的环境时难以泛化,例如未见过的网站或新的功能集,这是因为它们的预训练和测试条件之间存在根本性的不匹配。这一挑战源自两种不同的失败模式:对环境特定组件(如观察格式)的句法误解,以及对状态转换动力学的语义误解,这些动力学仅在测试时才显现。为了解决这些问题,我们提出了两种不同的且互补的策略,通过利用部署期间可用的环境特定信息来适应LLM代理。首先,一种在线分布适应方法通过学习一个轻量级的适应向量来参数化环境的细微差别,该向量偏置模型的输出分布,从而实现快速与环境响应格式的对齐。其次,一种部署时动力学接地方法采用基于人设的探索阶段系统地探查和学习环境的因果动力学,从而在执行任务前为代理提供一个非参数化的世界模型。我们在多种代理基准测试中评估了这些策略,包括函数调用和网页导航。我们的实验证明了这两种策略在所有基准测试中的有效性,且计算成本较低。我们发现,在复杂环境中,动力学接地特别有效,这些环境中的不可预测动力学构成了重大障碍,这表明了一条通往更泛化和能力更强的LLM代理的稳健路径。例如,在WebArena多站点分割中,这种方法将代理的成功率从2%提高到23%。
Summary / 总结
This paper addresses the challenge of large language model (LLM) agents generalizing to novel environments by proposing two strategies: an online distributional adaptation method that learns a lightweight adaptation vector to align the model's output with environment-specific formats, and a deployment-time dynamics grounding method that uses persona-driven exploration to learn the environment's causal dynamics. The methods are evaluated across various benchmarks, showing effectiveness with minimal computational cost, with dynamics grounding proving especially effective in complex environments.
本文提出了两种策略来解决大型语言模型(LLM)代理在新环境中泛化的挑战:一种在线分布适应方法,通过学习轻量级的适应向量来使模型输出与环境特定格式对齐;以及一种部署时动力学接地方法,通过个性化的探索学习环境的动力学。这些方法在各种基准测试中进行了评估,显示出在最小计算成本下有效,特别是在复杂环境中,动力学接地特别有效。
SortWaste: A Densely Annotated Dataset for Object Detection in Industrial Waste Sorting
Authors: Sara Inácio, Hugo Proença, João C. Neves
First: 2026-01-05T17:34:50+00:00 · Latest: 2026-01-05T17:34:50+00:00
Comments: 9 pages
Abstract
The increasing production of waste, driven by population growth, has created challenges in managing and recycling materials effectively. Manual waste sorting is a common practice; however, it remains inefficient for handling large-scale waste streams and presents health risks for workers. On the other hand, existing automated sorting approaches still struggle with the high variability, clutter, and visual complexity of real-world waste streams. The lack of real-world datasets for waste sorting is a major reason automated systems for this problem are underdeveloped. Accordingly, we introduce SortWaste, a densely annotated object detection dataset collected from a Material Recovery Facility. Additionally, we contribute to standardizing waste detection in sorting lines by proposing ClutterScore, an objective metric that gauges the scene's hardness level using a set of proxies that affect visual complexity (e.g., object count, class and size entropy, and spatial overlap). In addition to these contributions, we provide an extensive benchmark of state-of-the-art object detection models, detailing their results with respect to the hardness level assessed by the proposed metric. Despite achieving promising results (mAP of 59.7% in the plastic-only detection task), performance significantly decreases in highly cluttered scenes. This highlights the need for novel and more challenging datasets on the topic.
中文标题/摘要
标题:SortWaste:工业废弃物分类中的密集标注数据集
随着人口增长导致的废弃物产量增加,有效管理和回收材料的挑战也随之而来。手工废弃物分类是一种常见做法,但处理大规模废弃物流仍效率低下,并且对工人健康构成风险。另一方面,现有的自动化分类方法仍然难以应对实际废弃物流中的高变异性、杂乱和视觉复杂性。缺乏实际的废弃物分类数据集是导致此类问题的自动化系统发展不足的主要原因之一。因此,我们介绍了SortWaste,这是一个从材料回收设施收集的密集标注的目标检测数据集。此外,我们通过提出ClutterScore,一种使用影响视觉复杂度的代理(例如物体数量、类别和大小熵以及空间重叠)来衡量场景难度水平的客观指标,为分类线上的废弃物检测标准化做出了贡献。除了这些贡献,我们还提供了最先进的目标检测模型的广泛基准测试,详细说明了它们在所提出的指标评估的难度水平下的结果。尽管在仅塑料检测任务中取得了令人鼓舞的结果(mAP为59.7%),但在高度杂乱的场景中性能显著下降。这突显了在该主题上需要新颖且更具挑战性的数据集的需求。
Summary / 总结
SortWaste is a densely annotated dataset for object detection in industrial waste sorting, addressing the inefficiency and health risks of manual sorting. The dataset includes a new metric, ClutterScore, which evaluates the complexity of waste scenes. State-of-the-art object detection models achieve an mAP of 59.7% in plastic-only detection but perform poorly in cluttered scenes, underscoring the need for more challenging datasets.
该论文介绍了SortWaste,一个用于工业废物分类的密集标注数据集,旨在解决手工分类的低效性和健康风险问题。数据集中包含一个名为ClutterScore的指标,用于评估废物场景的复杂性。实验结果显示,在仅塑料检测任务中,最先进的模型达到了59.7%的mAP,但在高度杂乱的场景中表现较差,表明该领域需要更具挑战性的数据集。
Anytime-Valid Answer Sufficiency Certificates for LLM Generation via Sequential Information Lift
Authors: Sanjeda Akter, Ibne Farabi Shihab, Anuj Sharma
First: 2025-10-07T21:28:53+00:00 · Latest: 2026-01-05T17:33:34+00:00
Abstract
We introduce Sequential-EDFL (Empirical Dynamic Formal Lift), which applies anytime-valid sequential testing to language model generation stopping. Our approach tracks information lift, defined as the log-likelihood ratio between the full model and deliberately weakened "skeleton" baselines, using self-normalized empirical-Bernstein e-processes that provide formal delta-level error control regardless of stopping time. This delta guarantee controls premature stopping when information lift is insufficient relative to the skeleton, and it does not imply delta control of factual incorrectness or hallucinations. We handle unknown centering through online mean estimation, combine multiple parameters via mixture e-processes, and support adaptive resets under distributional drift. On six benchmarks, Sequential-EDFL reduces generation length by 22 to 28 percent relative to sequential baselines while maintaining delta-level control with 12 percent computational overhead. We introduce automated skeletons (distilled submodels and randomized logits) and show robustness across skeleton families. Composing EDFL with a lightweight correctness gate (sentence boundaries plus a verifier) improves end-task correctness while preserving anytime-valid guarantees by only delaying stopping. Our certificates control information sufficiency, not factual correctness. Specifically, 10.9 percent of stopped sequences remain incorrect even with the gate (13.2 to 22.7 percent without it). EDFL serves as a first-stage filter that can reduce verification burden: when applied to stopped sequences, the gate validates 83 percent of stops, requiring full verification only for the remaining 17 percent, plus all non-stopped sequences. EDFL is not a standalone solution for safety-critical domains.
中文标题/摘要
标题:LLM生成中的任意时延有效答案充分性证书通过顺序信息提升
我们引入了Sequential-EDFL(经验动态形式提升),它将任意时延有效的顺序测试应用于语言模型生成的停止。我们的方法跟踪信息提升,定义为完整模型与故意削弱的“骨架”基线之间的对数似然比,使用自规范化经验伯恩斯坦e过程,无论停止时间如何,都能提供形式化的delta级误差控制。这种delta保证在信息提升不足以相对于骨架时控制提前停止,但它不意味着delta控制事实错误或幻觉。我们通过在线均值估计处理未知中心化,通过混合e过程结合多个参数,并在分布漂移下支持自适应重置。在六个基准测试中,Sequential-EDFL相对于顺序基线将生成长度减少了22%到28%,同时保持12%的计算开销,仍能维持delta级控制。我们引入了自动化骨架(提炼子模型和随机化logits),并在不同骨架家族中展示了鲁棒性。将EDFL与轻量级正确性门(句子边界加上验证器)结合,可以提高最终任务的正确性,同时保留任意时延有效的保证,仅延迟停止。EDFL控制信息充分性,而不是事实正确性。具体来说,即使有门控,仍有10.9%的停止序列保持错误(没有门控时为13.2%到22.7%)。EDFL作为第一阶段过滤器,可以减少验证负担:当应用于停止序列时,门控验证83%的停止,仅对剩余17%的停止和所有未停止序列进行完整验证。EDFL不是安全关键领域中的独立解决方案。
Summary / 总结
Sequential-EDFL (Empirical Dynamic Formal Lift) applies anytime-valid sequential testing to language model generation, tracking information lift using self-normalized empirical-Bernstein e-processes for formal delta-level error control. On six benchmarks, Sequential-EDFL reduces generation length by 22 to 28 percent with 12 percent computational overhead while maintaining delta-level control. Automated skeletons and a lightweight correctness gate improve end-task correctness, validating 83 percent of stops and reducing full verification needs by 83 percent.
Sequential-EDFL 使用自归一化经验-Bernstein e-过程对语言模型生成进行任意时间有效的序列测试,跟踪信息提升以控制停止时间。它将生成长度减少了22到28个百分点,相对序列基准,同时计算开销增加12个百分点,同时保持形式化的delta级控制。自动骨架和轻量级正确性门控进一步提高了最终任务的正确性,验证了83%的停止,并减少了验证负担。
Language as a Wave Phenomenon: Iso-Energetic Phase-Locking and Semantic Interference in Neural Networks
Authors: Alper Yıldırım, İbrahim Yücedağ
First: 2025-12-01T02:46:15+00:00 · Latest: 2026-01-05T17:26:51+00:00
Comments: Major Revision. Title changed to reflect the new theoretical framework. Complete narrative shift from "Optimization Efficiency" to "Iso-Energetic Phase Coding" and "Optical Hardware Compatibility". Replaced ISMR diagnostics with Holographic Optical Learning simulations and mechanistic "Dual-Regime" phase analysis. Comparison with spectral baselines (FNet) added
Abstract
Conventional deep learning paradigms rely on metabolically expensive magnitude-based representations, rendering them fundamentally incompatible with passive photonic hardware. We introduce PRISM, a sequence modeling architecture that bridges high-level reasoning and physical constraints by enforcing an Iso-Energetic (Unity Gain) principle, compelling the network to encode semantic information exclusively in the phase angle. Validated on the WMT14 translation benchmark, PRISM achieves a 0.799 COMET score, demonstrating that phase-based reasoning competes with standard Transformers (0.821) and functionally matches unconstrained spectral baselines like FNet (0.805), despite enforcing strict energy constraints and requiring 11.5% fewer parameters. Furthermore, to verify hardware feasibility, we simulate a Holographic Backpropagation mechanism on a noisy, 4-bit optical correlator. Ablation studies reveal a substantial performance gain (48.4% vs. 62.4%) over a frozen baseline, proving that the proposed phase-steering mechanism actively optimizes physical parameters under strict energy constraints. These results establish an existence proof that ultra-low-power, passive optical hardware can support high-level linguistic intelligence without sacrificing representational capacity.
中文标题/摘要
标题:语言作为一种波现象:等能相位锁定与神经网络中的语义干扰
传统的深度学习范式依赖于代谢昂贵的幅度表示,使其与被动光子硬件从根本上不兼容。我们引入了PRISM,这是一种通过强制网络仅在相位角中编码语义信息来实现高能级推理和物理约束相结合的序列建模架构。在WMT14翻译基准上验证,PRISM获得了0.799的COMET分数,表明基于相位的推理与标准Transformer(0.821)竞争,并且在严格能量约束下与未受约束的光谱基线FNet(0.805)功能上相当,尽管参数减少了11.5%。此外,为了验证硬件可行性,我们在嘈杂的4位光学相关器上模拟了全息反向传播机制。消融研究显示,与冻结基线相比,性能提高了48.4%,证明了所提出的相位控制机制在严格能量约束下积极优化了物理参数。这些结果证明了超低功耗、被动光学硬件可以在不牺牲表示能力的情况下支持高级语言智能。
Summary / 总结
The paper introduces PRISM, a sequence modeling architecture that enforces an Iso-Energetic principle, focusing on phase-based semantic encoding to reduce metabolic costs. PRISM achieves a 0.799 COMET score on the WMT14 translation benchmark, comparable to standard Transformers and spectral baselines like FNet, while requiring fewer parameters and demonstrating hardware feasibility through optical simulations. Ablation studies show a 48.4% performance gain over a frozen baseline, proving the phase-steering mechanism's effectiveness under strict energy constraints.
论文提出了PRISM架构,该架构遵循等能原理,专注于基于相位的语义编码以减少代谢成本。PRISM在WMT14翻译基准测试中取得了0.799的COMET分数,与标准Transformer和FNet等光谱基线相当,同时需要更少的参数,并通过光学仿真展示了硬件可行性。消融研究表明,与冻结基线相比,其性能提高了48.4%,证明了在严格能量约束下的相位引导机制的有效性。
InfiniteVGGT: Visual Geometry Grounded Transformer for Endless Streams
Authors: Shuai Yuan, Yantai Yang, Xiaotian Yang, Xupeng Zhang, Zhonghao Zhao, Lingming Zhang, Zhipeng Zhang
First: 2026-01-05T17:11:00+00:00 · Latest: 2026-01-05T17:11:00+00:00
Abstract
The grand vision of enabling persistent, large-scale 3D visual geometry understanding is shackled by the irreconcilable demands of scalability and long-term stability. While offline models like VGGT achieve inspiring geometry capability, their batch-based nature renders them irrelevant for live systems. Streaming architectures, though the intended solution for live operation, have proven inadequate. Existing methods either fail to support truly infinite-horizon inputs or suffer from catastrophic drift over long sequences. We shatter this long-standing dilemma with InfiniteVGGT, a causal visual geometry transformer that operationalizes the concept of a rolling memory through a bounded yet adaptive and perpetually expressive KV cache. Capitalizing on this, we devise a training-free, attention-agnostic pruning strategy that intelligently discards obsolete information, effectively ``rolling'' the memory forward with each new frame. Fully compatible with FlashAttention, InfiniteVGGT finally alleviates the compromise, enabling infinite-horizon streaming while outperforming existing streaming methods in long-term stability. The ultimate test for such a system is its performance over a truly infinite horizon, a capability that has been impossible to rigorously validate due to the lack of extremely long-term, continuous benchmarks. To address this critical gap, we introduce the Long3D benchmark, which, for the first time, enables a rigorous evaluation of continuous 3D geometry estimation on sequences about 10,000 frames. This provides the definitive evaluation platform for future research in long-term 3D geometry understanding. Code is available at: https://github.com/AutoLab-SAI-SJTU/InfiniteVGGT
中文标题/摘要
标题:InfiniteVGGT:视觉几何导向变换器,用于无尽流
持久的大规模3D视觉几何理解的宏伟愿景受到可扩展性和长期稳定性不可调和需求的束缚。虽然离线模型如VGGT实现了令人鼓舞的几何能力,但它们基于批次的性质使它们不适合实时系统。流式架构虽然是为实时操作设计的解决方案,但已被证明是不充分的。现有方法要么无法支持真正无限的输入,要么在长时间序列中遭受灾难性漂移。我们通过InfiniteVGGT打破了这一长期困境,这是一种因果视觉几何变换器,通过有界但适应性强且持续表达的KV缓存实现滚动记忆的概念。利用这一点,我们设计了一种无需训练、不依赖注意力的剪枝策略,能够智能地丢弃过时信息,有效地“滚动”记忆向前推进每一帧。InfiniteVGGT完全兼容FlashAttention,最终解决了这一妥协,使无限时长的流式传输成为可能,同时在长期稳定性方面优于现有流式方法。对于此类系统的最终考验是其在真正无限时长上的性能,由于缺乏长期连续基准,这种能力一直难以严格验证。为解决这一关键缺口,我们引入了Long3D基准,这是首次能够对序列长度约10,000帧的连续3D几何估计进行严格评估的基准。这为未来在长期3D几何理解方面的研究提供了决定性的评估平台。代码可在:https://github.com/AutoLab-SAI-SJTU/InfiniteVGGT 获取
Summary / 总结
InfiniteVGGT addresses the challenge of persistent 3D visual geometry understanding by introducing a causal transformer with a bounded yet adaptive KV cache, enabling infinite-horizon streaming while maintaining long-term stability. It outperforms existing methods and introduces the Long3D benchmark for rigorous evaluation of 3D geometry estimation over extremely long sequences, providing a definitive platform for future research.
InfiniteVGGT通过引入具有界限但适应性强的KV缓存的因果变压器,解决了持续的3D视觉几何理解挑战,实现了无限时间窗口的流式处理并保持长期稳定性。它超越了现有方法,并引入了Long3D基准,首次实现了对极长序列上3D几何估计的严格评估,为未来研究提供了 definitive的评价平台。
TopoLoRA-SAM: Topology-Aware Parameter-Efficient Adaptation of Foundation Segmenters for Thin-Structure and Cross-Domain Binary Semantic Segmentation
Authors: Salim Khazem
First: 2026-01-05T17:03:45+00:00 · Latest: 2026-01-05T17:03:45+00:00
Abstract
Foundation segmentation models such as the Segment Anything Model (SAM) exhibit strong zero-shot generalization through large-scale pretraining, but adapting them to domain-specific semantic segmentation remains challenging, particularly for thin structures (e.g., retinal vessels) and noisy modalities (e.g., SAR imagery). Full fine-tuning is computationally expensive and risks catastrophic forgetting. We propose \textbf{TopoLoRA-SAM}, a topology-aware and parameter-efficient adaptation framework for binary semantic segmentation. TopoLoRA-SAM injects Low-Rank Adaptation (LoRA) into the frozen ViT encoder, augmented with a lightweight spatial convolutional adapter and optional topology-aware supervision via differentiable clDice. We evaluate our approach on five benchmarks spanning retinal vessel segmentation (DRIVE, STARE, CHASE\_DB1), polyp segmentation (Kvasir-SEG), and SAR sea/land segmentation (SL-SSDD), comparing against U-Net, DeepLabV3+, SegFormer, and Mask2Former. TopoLoRA-SAM achieves the best retina-average Dice and the best overall average Dice across datasets, while training only \textbf{5.2\%} of model parameters ($\sim$4.9M). On the challenging CHASE\_DB1 dataset, our method substantially improves segmentation accuracy and robustness, demonstrating that topology-aware parameter-efficient adaptation can match or exceed fully fine-tuned specialist models. Code is available at : https://github.com/salimkhazem/Seglab.git
中文标题/摘要
标题:TopoLoRA-SAM:拓扑感知参数高效适应基础分割器以实现薄结构和跨域二元语义分割
基础分割模型如分割一切模型(SAM)通过大规模预训练表现出强大的零样本泛化能力,但将其适应特定领域的语义分割仍然具有挑战性,尤其是对于薄结构(例如视网膜血管)和嘈杂的模态(例如SAR影像)。全量微调计算成本高昂且存在灾难性遗忘的风险。我们提出了一种拓扑感知和参数高效的适应框架——TopoLoRA-SAM,该框架将低秩适应(LoRA)注入冻结的ViT编码器,并增加了轻量级的空间卷积适配器和可选的拓扑感知监督(通过可微分的clDice实现)。我们在五个基准上评估了该方法,涵盖了视网膜血管分割(DRIVE、STARE、CHASE_DB1)、息肉分割(Kvasir-SEG)和SAR海/陆分割(SL-SSDD),并与U-Net、DeepLabV3+、SegFormer和Mask2Former进行比较。TopoLoRA-SAM在视网膜平均Dice和数据集总体平均Dice上均取得最佳成绩,同时仅训练了模型参数的5.2%(约4.9M)。在具有挑战性的CHASE_DB1数据集上,我们的方法显著提高了分割准确性和鲁棒性,证明了拓扑感知参数高效的适应可以与或超越完全微调的专业模型。代码可在:https://github.com/salimkhazem/Seglab.git
Summary / 总结
TopoLoRA-SAM is a topology-aware and parameter-efficient adaptation framework for binary semantic segmentation, particularly for thin structures and noisy modalities. It injects Low-Rank Adaptation (LoRA) into the frozen ViT encoder and adds a lightweight spatial convolutional adapter, along with optional topology-aware supervision. TopoLoRA-SAM outperforms U-Net, DeepLabV3+, SegFormer, and Mask2Former on five benchmarks, achieving the best retina-average Dice and overall average Dice while training only 5.2% of the model parameters. The method demonstrates improved segmentation accuracy and robustness on the challenging CHASE_DB1 dataset.
TopoLoRA-SAM 是一种针对细小结构和噪声模态的二元语义分割的拓扑感知和参数高效适应框架。它将低秩适应(LoRA)注入冻结的 ViT 编码器,增加一个轻量级的空间卷积适配器,并可选地使用拓扑感知监督。该方法在多种基准测试中实现了最佳的视网膜平均 Dice 值和总体平均 Dice 值,仅训练了模型参数的 5.2%。在 CHASE_DB1 数据集上,该方法显著提高了分割准确性和鲁棒性,超过了完全微调的专业模型。
DiffProxy: Multi-View Human Mesh Recovery via Diffusion-Generated Dense Proxies
Authors: Renke Wang, Zhenyu Zhang, Ying Tai, Jian Yang
First: 2026-01-05T16:51:45+00:00 · Latest: 2026-01-05T16:51:45+00:00
Comments: Page: https://wrk226.github.io/DiffProxy.html, Code: https://github.com/wrk226/DiffProxy
Abstract
Human mesh recovery from multi-view images faces a fundamental challenge: real-world datasets contain imperfect ground-truth annotations that bias the models' training, while synthetic data with precise supervision suffers from domain gap. In this paper, we propose DiffProxy, a novel framework that generates multi-view consistent human proxies for mesh recovery. Central to DiffProxy is leveraging the diffusion-based generative priors to bridge the synthetic training and real-world generalization. Its key innovations include: (1) a multi-conditional mechanism for generating multi-view consistent, pixel-aligned human proxies; (2) a hand refinement module that incorporates flexible visual prompts to enhance local details; and (3) an uncertainty-aware test-time scaling method that increases robustness to challenging cases during optimization. These designs ensure that the mesh recovery process effectively benefits from the precise synthetic ground truth and generative advantages of the diffusion-based pipeline. Trained entirely on synthetic data, DiffProxy achieves state-of-the-art performance across five real-world benchmarks, demonstrating strong zero-shot generalization particularly on challenging scenarios with occlusions and partial views. Project page: https://wrk226.github.io/DiffProxy.html
中文标题/摘要
标题:DiffProxy:通过扩散生成密集代理的多视角人体网格恢复
从多视角图像恢复人体网格面临着一个根本性的挑战:现实世界的数据集包含有偏的不完美标注,影响模型的训练,而带有精确监督的合成数据则存在领域差距。本文提出了一种新颖的框架DiffProxy,用于生成多视角一致的人体代理以进行网格恢复。DiffProxy的核心在于利用基于扩散的生成先验来弥合合成训练与现实世界泛化的差距。其关键创新包括:(1) 多条件机制,用于生成多视角一致、像素对齐的人体代理;(2) 手部细化模块,结合灵活的视觉提示以增强局部细节;(3) 一种基于不确定性测试时的缩放方法,以在优化过程中提高对具有遮挡和部分视角的挑战性情况的鲁棒性。这些设计确保了网格恢复过程能够有效利用精确的合成标注和基于扩散的生成优势。DiffProxy完全在合成数据上训练,实现了五个现实世界基准上的最佳性能,特别是在具有遮挡和部分视角的挑战性场景中表现出强大的零样本泛化能力。
Summary / 总结
DiffProxy is a novel framework for human mesh recovery from multi-view images, addressing the challenge of training with imperfect ground-truth annotations and synthetic data domain gap. It generates multi-view consistent human proxies using diffusion-based generative priors and includes a hand refinement module and uncertainty-aware scaling method. DiffProxy, trained solely on synthetic data, outperforms existing methods on five real-world benchmarks, especially in challenging scenarios with occlusions and partial views.
DiffProxy 是一种用于从多视角图像恢复人体网格的新框架,解决了训练数据偏差和合成数据与真实世界数据之间领域差距的问题。它使用基于扩散的生成先验生成多视角一致的人体代理,并包含手部细化模块和不确定性感知的测试时缩放方法。DiffProxy 仅在合成数据上训练,实现了五个真实世界基准上的最先进性能,特别是在具有遮挡和部分视角的挑战性场景中表现出色。
Predicting Early and Complete Drug Release from Long-Acting Injectables Using Explainable Machine Learning
Authors: Karla N. Robles, Manar D. Samad
First: 2026-01-05T16:49:17+00:00 · Latest: 2026-01-05T16:49:17+00:00
Abstract
Polymer-based long-acting injectables (LAIs) have transformed the treatment of chronic diseases by enabling controlled drug delivery, thus reducing dosing frequency and extending therapeutic duration. Achieving controlled drug release from LAIs requires extensive optimization of the complex underlying physicochemical properties. Machine learning (ML) can accelerate LAI development by modeling the complex relationships between LAI properties and drug release. However, recent ML studies have provided limited information on key properties that modulate drug release, due to the lack of custom modeling and analysis tailored to LAI data. This paper presents a novel data transformation and explainable ML approach to synthesize actionable information from 321 LAI formulations by predicting early drug release at 24, 48, and 72 hours, classification of release profile types, and prediction of complete release profiles. These three experiments investigate the contribution and control of LAI material characteristics in early and complete drug release profiles. A strong correlation (>0.65) is observed between the true and predicted drug release in 72 hours, while a 0.87 F1-score is obtained in classifying release profile types. A time-independent ML framework predicts delayed biphasic and triphasic curves with better performance than current time-dependent approaches. Shapley additive explanations reveal the relative influence of material characteristics during early and for complete release which fill several gaps in previous in-vitro and ML-based studies. The novel approach and findings can provide a quantitative strategy and recommendations for scientists to optimize the drug-release dynamics of LAI. The source code for the model implementation is publicly available.
中文标题/摘要
标题:使用可解释的机器学习预测长效注射剂早期和完全药物释放
基于聚合物的长效注射剂(LAIs)通过实现受控药物释放,减少了给药频率并延长了治疗时间,从而彻底改变了慢性疾病的治疗方式。实现LAIs中的受控药物释放需要对复杂的物理化学性质进行广泛的优化。机器学习(ML)可以通过建模LAIs属性与药物释放之间的复杂关系来加速LAIs的发展。然而,最近的ML研究由于缺乏针对LAIs数据的定制建模和分析,提供的关于调节药物释放的关键属性的信息有限。本文提出了一种新颖的数据转换和可解释的ML方法,通过预测24、48和72小时的早期药物释放、释放模式类型的分类以及完全释放模式的预测,从321种LAI配方中综合出可操作的信息。这三个实验探讨了LAI材料特性在早期和完全药物释放模式中的贡献和控制。在72小时的真实和预测药物释放之间观察到强相关性(>0.65),而在分类释放模式类型方面获得了0.87的F1分数。时间独立的ML框架预测延迟的双相和三相曲线的性能优于当前的时间依赖方法。Shapley加性解释揭示了材料特性在早期和完全释放中的相对影响,填补了先前体外和基于ML研究中的多个空白。该新颖的方法和发现可以为科学家提供一种定量策略和建议,以优化LAIs的药物释放动力学。模型实现的源代码已公开。
Summary / 总结
This study aims to optimize the drug release from polymer-based long-acting injectables (LAIs) using explainable machine learning. The researchers developed a novel data transformation and explainable ML approach to predict early drug release at 24, 48, and 72 hours, classify release profile types, and predict complete release profiles. They observed a strong correlation (>0.65) between true and predicted drug release at 72 hours and achieved an 0.87 F1-score in classifying release profile types. The time-independent ML framework outperformed current time-dependent approaches in predicting delayed biphasic and triphasic curves. Shapley additive explanations provided insights into the relative influence of material characteristics on drug release, filling gaps in previous studies. The findings offer a quantitative strategy for optimizing LAI drug-release dynamics.
该研究旨在使用可解释的机器学习优化聚合物基长效注射剂(LAIs)的药物释放。研究人员开发了一种新型数据转换和可解释的ML方法,以预测24、48和72小时的早期药物释放、分类释放模式类型以及预测完全释放模式。他们观察到,在72小时的真实和预测药物释放之间存在较强的相关性(>0.65),并在分类释放模式类型方面获得了0.87的F1分数。时间独立的ML框架在预测延迟的双相和三相曲线方面优于当前的时间依赖性方法。Shapley加性解释突出了材料特性在早期和完全释放过程中的相对影响,为LAI优化提供了可操作的见解。研究结果为科学家提供了增强LAIs药物释放动力学的定量策略。
Development of a high-resolution indoor radon map using a new machine learning-based probabilistic model and German radon survey data
Authors: Eric Petermann, Peter Bossew, Joachim Kemski, Valeria Gruber, Nils Suhr, Bernd Hoffmann
Venue: Environmental Health Perspectives 132 (9), 097009 (2024)
First: 2023-10-17T10:51:05+00:00 · Latest: 2026-01-05T16:47:03+00:00
Abstract
Accurate knowledge of indoor radon concentration is crucial for assessing radon-related health effects or identifying radon-prone areas. Indoor radon concentration at the national scale is usually estimated on the basis of extensive measurement campaigns. However, characteristics of the sampled households often differ from the characteristics of the target population owing to the large number of relevant factors that control the indoor radon concentration, such as the availability of geogenic radon or floor level. We propose a model-based approach that allows a more realistic estimation of indoor radon distribution with a higher spatial resolution than a purely data-based approach. A modeling approach was used by applying a quantile regression forest to estimate the probability distribution function of indoor radon for each floor level of each residential building in Germany. Based on the estimated probability distribution function,a probabilistic Monte Carlo sampling technique was applied, enabling the combination and population weighting of floor-level predictions. In this way,the uncertainty of the individual predictions is effectively propagated into the estimate of variability at the aggregated level. The results show an approximate lognormal distribution of indoor radon in dwellings in Germany with an arithmetic mean of 63 Bq/m3, a geometric mean of 41 Bq/m3, and a 95th percentile of 180 Bq/m3. The exceedance probabilities for 100 and 300 Bq/m3 are 12.5% (10.5 million people affected) and 2.2 % (1.9 million people affected), respectively. The advantages of our approach are that it yields a) an accurate estimation of indoor radon concentration even if the survey is not fully representative with respect to floor level and radon concentration in soil, and b) an estimate of the indoor radon distribution with a much higher spatial resolution than basic descriptive statistics.
中文标题/摘要
标题:使用新型基于机器学习的概率模型和德国氡调查数据开发高分辨率室内氡图
准确了解室内氡浓度对于评估氡相关健康影响或识别易氡地区至关重要。通常,国家规模的室内氡浓度是基于广泛的测量活动估算的。然而,由于控制室内氡浓度的众多因素,如地质氡或楼层水平的可用性,采样家庭的特征往往与目标人群的特征不同。我们提出了一种基于模型的方法,该方法允许更现实地估计室内氡分布,并具有比纯数据方法更高的空间分辨率。通过应用分位数回归森林来估计德国每栋住宅建筑每楼层室内氡的概率分布函数。基于估计的概率分布函数,应用概率蒙特卡洛采样技术,使楼层预测的组合和人口加权成为可能。这样,个体预测的不确定性有效地传播到聚合水平的变异估计中。结果显示,德国住宅中的室内氡分布近似对数正态分布,算术平均值为63 Bq/m³,几何平均值为41 Bq/m³,第95百分位数为180 Bq/m³。超过100和300 Bq/m³的概率分别为12.5%(影响1050万人)和2.2%(影响190万人)。我们方法的优点是即使调查在楼层水平和土壤氡浓度方面不完全具有代表性,也能准确估计室内氡浓度,并且能以比基本描述性统计高得多的空间分辨率估计室内氡分布。
POSEIDON: Physics-Optimized Seismic Energy Inference and Detection Operating Network
Authors: Boris Kriuk, Fedor Kriuk
First: 2026-01-05T16:46:34+00:00 · Latest: 2026-01-05T16:46:34+00:00
Comments: 8 pages, 14 figures
Abstract
Earthquake prediction and seismic hazard assessment remain fundamental challenges in geophysics, with existing machine learning approaches often operating as black boxes that ignore established physical laws. We introduce POSEIDON (Physics-Optimized Seismic Energy Inference and Detection Operating Network), a physics-informed energy-based model for unified multi-task seismic event prediction, alongside the Poseidon dataset -- the largest open-source global earthquake catalog comprising 2.8 million events spanning 30 years. POSEIDON embeds fundamental seismological principles, including the Gutenberg-Richter magnitude-frequency relationship and Omori-Utsu aftershock decay law, as learnable constraints within an energy-based modeling framework. The architecture simultaneously addresses three interconnected prediction tasks: aftershock sequence identification, tsunami generation potential, and foreshock detection. Extensive experiments demonstrate that POSEIDON achieves state-of-the-art performance across all tasks, outperforming gradient boosting, random forest, and CNN baselines with the highest average F1 score among all compared methods. Crucially, the learned physics parameters converge to scientifically interpretable values -- Gutenberg-Richter b-value of 0.752 and Omori-Utsu parameters p=0.835, c=0.1948 days -- falling within established seismological ranges while enhancing rather than compromising predictive accuracy. The Poseidon dataset is publicly available at https://huggingface.co/datasets/BorisKriuk/Poseidon, providing pre-computed energy features, spatial grid indices, and standardized quality metrics to advance physics-informed seismic research.
中文标题/摘要
标题:POSEIDON:基于物理优化的地震能量推断与检测操作网络
地震预测和地震灾害评估仍然是地球物理学中的基本挑战,现有的机器学习方法往往作为黑箱操作,忽视了已有的物理定律。我们引入了POSEIDON(基于物理优化的地震能量推断与检测操作网络),这是一种嵌入了基本地震学原理的基于能量的模型,用于统一的多任务地震事件预测,同时伴随着Poseidon数据集——包含280万次地震事件、跨越30年的最大开源全球地震目录。POSEIDON嵌入了古登堡-里克特震级-频率关系和奥米罗-宇斯 aftershock 衰减定律等基本地震学原理,作为可学习的约束条件,嵌入在基于能量的建模框架中。该架构同时解决了三个相互关联的预测任务:aftershock 序列识别、海啸生成潜力和前震检测。广泛的实验表明,POSEIDON在所有任务上都达到了最先进的性能,优于梯度提升、随机森林和CNN基线,所有比较方法中平均F1分数最高。最关键的是,学习到的物理参数收敛到可解释的科学值——古登堡-里克特b值为0.752,奥米罗-宇斯参数p=0.835,c=0.1948天,这些值落在已建立的地震学范围内,同时提高了预测准确性而没有削弱。Poseidon数据集可在https://huggingface.co/datasets/BorisKriuk/Poseidon 公开获取,提供预计算的能量特征、空间网格索引和标准化质量指标,以促进基于物理的地震研究。
Summary / 总结
POSEIDON is a physics-informed energy-based model designed for unified multi-task seismic event prediction, incorporating seismological principles such as the Gutenberg-Richter magnitude-frequency relationship and Omori-Utsu aftershock decay law. It addresses three interconnected tasks: aftershock sequence identification, tsunami generation potential, and foreshock detection. Experiments show that POSEIDON outperforms gradient boosting, random forest, and CNN baselines with the highest average F1 score. The learned physics parameters converge to scientifically interpretable values within established seismological ranges, enhancing predictive accuracy without compromising it.
POSEIDON 是一个物理导向的能量模型,用于地震预测和地震灾害评估。它将 Gutenberg-Richter 等震级-频率关系和 Omori-Utsu 后效衰减定律等地震学原理整合到能量模型框架中,以预测 aftershock 序列、海啸生成潜力和前震。实验表明,POSEIDON 在所有任务上的平均 F1 分数最高,优于梯度提升、随机森林和 CNN 基线,并且学习到的物理参数与已知的地震学范围相符,同时提高了预测准确性。
Towards Fair In-Context Learning with Tabular Foundation Models
Authors: Patrik Kenfack, Samira Ebrahimi Kahou, Ulrich Aïvodji
First: 2025-05-14T15:53:14+00:00 · Latest: 2026-01-05T16:39:29+00:00
Comments: Published in Transactions on Machine Learning Research (TMLR)
Abstract
Transformer-based tabular foundation models have recently demonstrated promising in-context learning (ICL) performance on structured data, emerging as competitive alternatives to gradient-boosted trees. However, the fairness implications of this new paradigm remain largely unexplored. We present the first investigation of fairness in tabular ICL, evaluating three recently proposed foundation models--TabPFNv2, TabICL, and TabDPT--on multiple benchmark datasets. To mitigate biases, we explore three pre-processing fairness-enhancing methods: correlation removal (decorrelating input features from the sensitive attribute), group-balanced sample selection (ensuring equal representation of protected groups in context examples), and uncertainty-based sample selection (prioritizing context examples with high sensitive-attribute prediction uncertainty). Our experiments show that the uncertainty-based strategy consistently improves group fairness metrics (e.g., demographic parity, equalized odds, and equal opportunity) with minimal impact on predictive accuracy. We release our code to facilitate reproducibility https://github.com/patrikken/Fair-TabICL.
中文标题/摘要
标题:迈向公平的表格上下文学习
基于Transformer的表格基础模型在结构化数据上最近展示了有前景的上下文学习(ICL)性能,成为梯度增强树的有竞争力的替代方案。然而,这一新范式的公平性影响尚未得到充分探索。我们首次对表格ICL中的公平性进行了研究,评估了三种最近提出的基础模型——TabPFNv2、TabICL和TabDPT——在多个基准数据集上的表现。为了减轻偏差,我们探索了三种预处理公平性增强方法:相关性去除(使输入特征与敏感属性去相关)、群体平衡样本选择(确保敏感群体在上下文示例中的平等代表性)和基于不确定性样本选择(优先选择敏感属性预测不确定性高的上下文示例)。我们的实验表明,基于不确定性的策略在最小影响预测准确性的情况下,始终能提高群体公平性指标(如人口统计公平性、同等机会和同等真实率)。我们发布了代码以促进可重复性:https://github.com/patrikken/Fair-TabICL。
Summary / 总结
This study investigates fairness in in-context learning (ICL) for tabular data using transformer-based foundation models. It evaluates three models—TabPFNv2, TabICL, and TabDPT—on multiple benchmarks and explores three pre-processing methods to mitigate biases. The uncertainty-based sample selection strategy is found to improve group fairness metrics with minimal loss in predictive accuracy.
该研究探讨了基于变换器的表结构数据在上下文学习(ICL)中的公平性问题,评估了三种模型—TabPFNv2、TabICL 和 TabDPT—在多个基准数据集上的表现,并探索了三种预处理方法以减轻偏差。研究发现,基于不确定性选择样本的策略能够提高群体公平性指标,同时对预测准确性影响较小。
Training More Robust Classification Model via Discriminative Loss and Gaussian Noise Injection
Authors: Hai-Vy Nguyen, Fabrice Gamboa, Sixin Zhang, Reda Chhaibi, Serge Gratton, Thierry Giaccone
First: 2024-05-28T18:10:45+00:00 · Latest: 2026-01-05T16:38:03+00:00
Comments: Published in Transactions on Machine Learning Research (TMLR)
Abstract
Robustness of deep neural networks to input noise remains a critical challenge, as naive noise injection often degrades accuracy on clean (uncorrupted) data. We propose a novel training framework that addresses this trade-off through two complementary objectives. First, we introduce a loss function applied at the penultimate layer that explicitly enforces intra-class compactness and increases the margin to analytically defined decision boundaries. This enhances feature discriminativeness and class separability for clean data. Second, we propose a class-wise feature alignment mechanism that brings noisy data clusters closer to their clean counterparts. Furthermore, we provide a theoretical analysis demonstrating that improving feature stability under additive Gaussian noise implicitly reduces the curvature of the softmax loss landscape in input space, as measured by Hessian eigenvalues.This thus naturally enhances robustness without explicit curvature penalties. Conversely, we also theoretically show that lower curvatures lead to more robust models. We validate the effectiveness of our method on standard benchmarks and our custom dataset. Our approach significantly reinforces model robustness to various perturbations while maintaining high accuracy on clean data, advancing the understanding and practice of noise-robust deep learning.
中文标题/摘要
标题:通过判别性损失和高斯噪声注入训练更具鲁棒性的分类模型
深度神经网络对输入噪声的鲁棒性仍然是一个关键挑战,因为简单的噪声注入往往会降低干净(未受污染)数据上的准确率。我们提出了一种新的训练框架,通过两个互补的目标来解决这种权衡。首先,我们在倒数第二层引入了一个损失函数,该函数明确地促进了类内紧凑性并增加了到分析定义的决策边界的余量,从而增强了特征的判别性和类间可分性。其次,我们提出了一种类内特征对齐机制,将噪声数据簇拉近其干净的对应物。此外,我们还提供了一种理论分析,证明在加性高斯噪声下提高特征稳定性隐式地减少了softmax损失景观在输入空间中的曲率,这可以通过海森矩阵特征值来衡量。因此,这自然增强了鲁棒性,而无需显式的曲率惩罚。相反,我们还从理论上证明了较低的曲率会导致更鲁棒的模型。我们在标准基准和自定义数据集上验证了我们方法的有效性。我们的方法在各种扰动下显著增强了模型的鲁棒性,同时在干净数据上保持了高准确率,推进了噪声鲁棒深度学习的理解和实践。
Summary / 总结
The paper addresses the challenge of training deep neural networks to be robust against input noise while maintaining accuracy on clean data. It proposes a training framework with two objectives: a discriminative loss function at the penultimate layer to enhance feature discriminativeness and class separability, and a class-wise feature alignment mechanism to bring noisy data closer to clean counterparts. Theoretical analysis shows that improving feature stability under noise reduces the curvature of the softmax loss landscape, enhancing robustness. Experiments on standard benchmarks and a custom dataset demonstrate that the proposed method significantly improves model robustness to various perturbations while maintaining high accuracy on clean data.
论文旨在解决训练深度神经网络以抵抗输入噪声的同时保持在干净数据上的准确性。提出了一种训练框架,包含两个目标:在倒数第二层使用区分性损失函数以增强特征的区分性和类别可分性,以及一种类别特征对齐机制以使噪声数据更接近干净数据。理论分析表明,通过噪声提高特征稳定性可以减少softmax损失景观的曲率,从而增强鲁棒性。实验结果表明,该方法在标准基准和自定义数据集上显著提高了模型对各种扰动的鲁棒性,同时保持了在干净数据上的高准确性。
Improved Accuracy for Private Continual Cardinality Estimation in Fully Dynamic Streams via Matrix Factorization
Authors: Joel Daniel Andersson, Palak Jain, Satchit Sivakumar
First: 2026-01-05T16:36:59+00:00 · Latest: 2026-01-05T16:36:59+00:00
Abstract
We study differentially-private statistics in the fully dynamic continual observation model, where many updates can arrive at each time step and updates to a stream can involve both insertions and deletions of an item. Earlier work (e.g., Jain et al., NeurIPS 2023 for counting distinct elements; Raskhodnikova & Steiner, PODS 2025 for triangle counting with edge updates) reduced the respective cardinality estimation problem to continual counting on the difference stream associated with the true function values on the input stream. In such reductions, a change in the original stream can cause many changes in the difference stream, this poses a challenge for applying private continual counting algorithms to obtain optimal error bounds. We improve the accuracy of several such reductions by studying the associated $\ell_p$-sensitivity vectors of the resulting difference streams and isolating their properties.
We demonstrate that our framework gives improved bounds for counting distinct elements, estimating degree histograms, and estimating triangle counts (under a slightly relaxed privacy model), thus offering a general approach to private continual cardinality estimation in streaming settings. Our improved accuracy stems from tight analysis of known factorization mechanisms for the counting matrix in this setting; the key technical challenge is arguing that one can use state-of-the-art factorizations for sensitivity vector sets with the properties we isolate. Empirically and analytically, we demonstrate that our improved error bounds offer a substantial improvement in accuracy for cardinality estimation problems over a large range of parameters.
中文标题/摘要
标题:通过矩阵分解提高全动态流中私有持续基数估计的准确性
我们研究了在完全动态持续观测模型中的不同性隐私统计,其中每个时间步可以接收到许多更新,流中的更新可以涉及项目插入和删除。早期的工作(例如,Jain等人,NeurIPS 2023用于计数不同元素;Raskhodnikova & Steiner,PODS 2025用于边更新的三角计数)将相应的基数估计问题归结为与输入流的真实函数值相关的差分流上的持续计数。在这些归约中,原始流中的变化可能导致差分流中的许多变化,这为应用私有持续计数算法以获得最优误差界带来了挑战。我们通过研究结果差分流相关的$\ell_p$-敏感向量及其属性,提高了几种此类归约的准确性。我们展示了我们的框架为计数不同元素、估计度直方图和估计三角计数(在稍微放宽的隐私模型下)提供了改进的界,从而提供了一种在流环境中进行私有持续基数估计的一般方法。我们改进的准确性源自对这种设置下计数矩阵的已知分解机制的精确分析;关键的技术挑战在于证明可以使用我们隔离属性的状态最先进分解方法来处理敏感向量集。从理论上和实验上,我们证明了我们改进的误差界在参数的广泛范围内提供了显著的准确性改进。
VAR RL Done Right: Tackling Asynchronous Policy Conflicts in Visual Autoregressive Generation
Authors: Shikun Sun, Liao Qu, Huichao Zhang, Yiheng Liu, Yangyang Song, Xian Li, Xu Wang, Yi Jiang, Daniel K. Du, Xinglong Wu, Jia Jia
First: 2026-01-05T16:36:40+00:00 · Latest: 2026-01-05T16:36:40+00:00
Comments: Project page: https://github.com/ByteVisionLab/NextFlow
Abstract
Visual generation is dominated by three paradigms: AutoRegressive (AR), diffusion, and Visual AutoRegressive (VAR) models. Unlike AR and diffusion, VARs operate on heterogeneous input structures across their generation steps, which creates severe asynchronous policy conflicts. This issue becomes particularly acute in reinforcement learning (RL) scenarios, leading to unstable training and suboptimal alignment. To resolve this, we propose a novel framework to enhance Group Relative Policy Optimization (GRPO) by explicitly managing these conflicts. Our method integrates three synergistic components: 1) a stabilizing intermediate reward to guide early-stage generation; 2) a dynamic time-step reweighting scheme for precise credit assignment; and 3) a novel mask propagation algorithm, derived from principles of Reward Feedback Learning (ReFL), designed to isolate optimization effects both spatially and temporally. Our approach demonstrates significant improvements in sample quality and objective alignment over the vanilla GRPO baseline, enabling robust and effective optimization for VAR models.
中文标题/摘要
标题:VAR RL 正确实现:解决视觉自回归生成中的异步策略冲突
视觉生成主要由三种范式主导:自回归(AR)、扩散和视觉自回归(VAR)模型。与AR和扩散不同,VAR在生成步骤中处理异构输入结构,这导致了严重的异步策略冲突。在强化学习(RL)场景中,这一问题尤为严重,导致训练不稳定和次优对齐。为了解决这一问题,我们提出了一种新的框架来增强组相对策略优化(GRPO),通过明确管理这些冲突。我们的方法整合了三个协同作用的组件:1)一个稳定化的中间奖励,以引导早期生成;2)一种动态时间步重权方案,以精确分配信用;3)一种源自奖励反馈学习(ReFL)原则的新颖掩码传播算法,旨在在空间和时间上隔离优化效果。我们的方法在样本质量和目标对齐方面显著优于vanilla GRPO基线,使VAR模型的稳健和有效优化成为可能。
Summary / 总结
The paper addresses the challenge of asynchronous policy conflicts in Visual AutoRegressive (VAR) models, which are crucial for visual generation tasks. It proposes a framework to enhance Group Relative Policy Optimization (GRPO) by introducing a stabilizing intermediate reward, a dynamic time-step reweighting scheme, and a mask propagation algorithm. The method improves sample quality and objective alignment compared to the vanilla GRPO baseline, making VAR models more robust and effective for reinforcement learning scenarios.
论文解决了Visual AutoRegressive (VAR)模型中异步策略冲突的问题,特别是在强化学习场景中更为突出。提出了一种增强Group Relative Policy Optimization (GRPO)的方法,通过引入稳定中间奖励、动态时间步重权方案和基于Reward Feedback Learning (ReFL)原理的掩码传播算法。该方法在样本质量和目标对齐方面显著优于vanilla GRPO基线,使VAR模型在视觉生成任务中更加稳健和有效。
SLGNet: Synergizing Structural Priors and Language-Guided Modulation for Multimodal Object Detection
Authors: Xiantai Xiang, Guangyao Zhou, Zixiao Wen, Wenshuai Li, Ben Niu, Feng Wang, Lijia Huang, Qiantong Wang, Yuhan Liu, Zongxu Pan, Yuxin Hu
First: 2026-01-05T16:31:41+00:00 · Latest: 2026-01-05T16:31:41+00:00
Abstract
Multimodal object detection leveraging RGB and Infrared (IR) images is pivotal for robust perception in all-weather scenarios. While recent adapter-based approaches efficiently transfer RGB-pretrained foundation models to this task, they often prioritize model efficiency at the expense of cross-modal structural consistency. Consequently, critical structural cues are frequently lost when significant domain gaps arise, such as in high-contrast or nighttime environments. Moreover, conventional static multimodal fusion mechanisms typically lack environmental awareness, resulting in suboptimal adaptation and constrained detection performance under complex, dynamic scene variations. To address these limitations, we propose SLGNet, a parameter-efficient framework that synergizes hierarchical structural priors and language-guided modulation within a frozen Vision Transformer (ViT)-based foundation model. Specifically, we design a Structure-Aware Adapter to extract hierarchical structural representations from both modalities and dynamically inject them into the ViT to compensate for structural degradation inherent in ViT-based backbones. Furthermore, we propose a Language-Guided Modulation module that exploits VLM-driven structured captions to dynamically recalibrate visual features, thereby endowing the model with robust environmental awareness. Extensive experiments on the LLVIP, FLIR, KAIST, and DroneVehicle datasets demonstrate that SLGNet establishes new state-of-the-art performance. Notably, on the LLVIP benchmark, our method achieves an mAP of 66.1, while reducing trainable parameters by approximately 87% compared to traditional full fine-tuning. This confirms SLGNet as a robust and efficient solution for multimodal perception.
中文标题/摘要
标题:SLGNet:结合结构先验和语言引导调制的多模态物体检测
利用RGB和红外(IR)图像进行多模态物体检测对于全天候场景中的稳健感知至关重要。虽然基于适配器的方法能够高效地将RGB预训练的基础模型迁移到该任务上,但它们往往以牺牲跨模态结构一致性为代价来追求模型效率。因此,在高对比度或夜间环境中出现显著领域差距时,关键的结构线索经常被丢失。此外,传统的静态多模态融合机制通常缺乏环境意识,导致在复杂、动态场景变化下适应不足,检测性能受限。为了解决这些局限性,我们提出了一种参数高效的SLGNet框架,该框架在冻结的Vision Transformer(ViT)基础模型中结合了层次结构先验和语言引导调制。具体而言,我们设计了一种结构感知适配器,从两种模态中提取层次结构表示,并动态注入到ViT中,以补偿ViT基础架构固有的结构退化。此外,我们还提出了一种语言引导调制模块,利用VLM驱动的结构化描述来动态重新校准视觉特征,从而赋予模型强大的环境意识。在LLVIP、FLIR、KAIST和DroneVehicle数据集上的广泛实验表明,SLGNet建立了新的性能基准。值得注意的是,在LLVIP基准测试中,我们的方法实现了66.1的mAP,同时将可训练参数减少了约87%与传统的全量微调相比。这证实了SLGNet作为多模态感知的稳健和高效解决方案的有效性。
Summary / 总结
SLGNet is designed to improve multimodal object detection using RGB and IR images by addressing the limitations of adapter-based approaches. It introduces a Structure-Aware Adapter to extract hierarchical structural representations and inject them into a frozen ViT, and a Language-Guided Modulation module to dynamically recalibrate visual features based on structured captions. Experiments show that SLGNet outperforms existing methods, achieving an mAP of 66.1 on the LLVIP benchmark with significantly reduced parameters.
SLGNet 旨在通过解决适配器方法优先考虑模型效率而非跨模态结构一致性的问题,来提升利用 RGB 和 IR 图像的多模态目标检测。它引入了结构感知适配器和语言引导调制模块,以增强结构表示和环境意识。实验表明,SLGNet 在 LLVIP 基准上的 mAP 达到 66.1,且与全量微调相比,可减少约 87% 的可训练参数,证明其是一个稳健且高效的多模态感知解决方案。
TI-PREGO: Chain of Thought and In-Context Learning for Online Mistake Detection in PRocedural EGOcentric Videos
Authors: Leonardo Plini, Luca Scofano, Edoardo De Matteis, Guido Maria D'Amely di Melendugno, Alessandro Flaborea, Andrea Sanchietti, Giovanni Maria Farinella, Fabio Galasso, Antonino Furnari
First: 2024-11-04T20:03:06+00:00 · Latest: 2026-01-05T16:20:54+00:00
Abstract
Identifying procedural errors online from egocentric videos is a critical yet challenging task across various domains, including manufacturing, healthcare, and skill-based training. The nature of such mistakes is inherently open-set, as unforeseen or novel errors may occur, necessitating robust detection systems that do not rely on prior examples of failure. Currently, however, no technique effectively detects open-set procedural mistakes online.
We propose a dual branch architecture to address this problem in an online fashion: one branch continuously performs step recognition from the input egocentric video, while the other anticipates future steps based on the recognition module's output. Mistakes are detected as mismatches between the currently recognized action and the action predicted by the anticipation module. The recognition branch takes input frames, predicts the current action, and aggregates frame-level results into action tokens. The anticipation branch, specifically, leverages the solid pattern-matching capabilities of Large Language Models (LLMs) to predict action tokens based on previously predicted ones.
Given the online nature of the task, we also thoroughly benchmark the difficulties associated with per-frame evaluations, particularly the need for accurate and timely predictions in dynamic online scenarios.
Extensive experiments on two procedural datasets demonstrate the challenges and opportunities of leveraging a dual-branch architecture for mistake detection, showcasing the effectiveness of our proposed approach. In a thorough evaluation including recognition and anticipation variants and state-of-the-art models, our method reveals its robustness and effectiveness in online applications.
中文标题/摘要
标题:TI-PREGO:在线检测自我中心视频中程序性错误的思维链和上下文学习链
从自我中心视频中在线识别程序性错误是一项在制造、医疗保健和技能训练等多个领域都至关重要的但极具挑战性的任务。这类错误的性质是开放集的,因为可能会出现未预见或新颖的错误,因此需要不依赖于失败先例的稳健检测系统。然而,目前没有任何技术能够有效地在线检测开放集的程序性错误。
我们提出了一种双分支架构来在线解决这一问题:一个分支持续从输入的自我中心视频中进行步骤识别,而另一个分支则基于识别模块的输出预测未来步骤。错误被检测为当前识别的动作与预测模块预测的动作之间的不匹配。识别分支接收输入帧,预测当前动作,并将帧级结果聚合为动作令牌。预测分支特别利用大型语言模型(LLMs)的强大模式匹配能力,基于先前预测的动作来预测动作令牌。
鉴于任务的在线性质,我们还详细评估了每帧评估所面临的困难,特别是动态在线场景中对准确和及时预测的需求。
在两个程序性数据集上的广泛实验展示了利用双分支架构进行错误检测的挑战和机遇,展示了我们提出方法的有效性。在包括识别和预测变体及最新模型的全面评估中,我们的方法在在线应用中的稳健性和有效性得到了验证。
Summary / 总结
The paper proposes TI-PREGO, a dual-branch architecture for online detection of procedural errors in egocentric videos. The system continuously recognizes current actions and predicts future actions, detecting errors as mismatches. Experiments on two datasets show the method's effectiveness and robustness in online applications, addressing the challenge of open-set errors without relying on prior examples of failure.
该论文提出了TI-PREGO,一种用于在线检测egocentric视频中程序性错误的双分支架构。它使用识别分支连续识别步骤,并使用预测分支基于先前预测的步骤进行预测,错误通过识别当前步骤与预测步骤之间的不匹配来检测。在两个数据集上的实验显示了该方法在各种领域如制造和医疗保健中的有效性和鲁棒性,解决了开放集错误的挑战。
VIBE: Visual Instruction Based Editor
Authors: Grigorii Alekseenko, Aleksandr Gordeev, Irina Tolstykh, Bulat Suleimanov, Vladimir Dokholyan, Georgii Fedorov, Sergey Yakubson, Aleksandra Tsybina, Mikhail Chernyshov, Maksim Kuprashevich
First: 2026-01-05T16:17:20+00:00 · Latest: 2026-01-05T16:17:20+00:00
Abstract
Instruction-based image editing is among the fastest developing areas in generative AI. Over the past year, the field has reached a new level, with dozens of open-source models released alongside highly capable commercial systems. However, only a limited number of open-source approaches currently achieve real-world quality. In addition, diffusion backbones, the dominant choice for these pipelines, are often large and computationally expensive for many deployments and research settings, with widely used variants typically containing 6B to 20B parameters. This paper presents a compact, high-throughput instruction-based image editing pipeline that uses a modern 2B-parameter Qwen3-VL model to guide the editing process and the 1.6B-parameter diffusion model Sana1.5 for image generation. Our design decisions across architecture, data processing, training configuration, and evaluation target low-cost inference and strict source consistency while maintaining high quality across the major edit categories feasible at this scale. Evaluated on the ImgEdit and GEdit benchmarks, the proposed method matches or exceeds the performance of substantially heavier baselines, including models with several times as many parameters and higher inference cost, and is particularly strong on edits that require preserving the input image, such as an attribute adjustment, object removal, background edits, and targeted replacement. The model fits within 24 GB of GPU memory and generates edited images at up to 2K resolution in approximately 4 seconds on an NVIDIA H100 in BF16, without additional inference optimizations or distillation.
中文标题/摘要
标题:VIBE:基于视觉指令的编辑器
基于指令的图像编辑是生成式AI发展最快的领域之一。在过去一年中,该领域达到了新的水平,发布了数十个开源模型,同时伴随高度能力的商业系统。然而,目前只有有限数量的开源方法能够达到实际应用的质量。此外,这些管道中的主导选择——扩散骨干网络,通常很大且在许多部署和研究设置中计算成本高昂,广泛使用的变体通常包含60亿到200亿个参数。本文提出了一种紧凑、高通量的基于指令的图像编辑管道,使用一个现代的20亿参数Qwen3-VL模型来指导编辑过程,并使用一个16亿参数的扩散模型Sana1.5进行图像生成。我们的设计决策涵盖了架构、数据处理、训练配置和评估,旨在实现低成本推理和严格的源一致性,同时在这一规模下保持高质量。在ImgEdit和GEdit基准测试上,所提出的方法在性能上与更重的基线模型相当或超过,包括具有数倍参数和更高推理成本的模型,特别是在需要保留输入图像的编辑方面表现尤为出色,如属性调整、对象移除、背景编辑和目标替换。该模型可以容纳在24 GB的GPU内存中,并在NVIDIA H100上以BF16格式生成最高2K分辨率的编辑图像,大约需要4秒,无需额外的推理优化或蒸馏。
Summary / 总结
The paper introduces VIBE, a compact instruction-based image editing pipeline using a 2B-parameter Qwen3-VL model and a 1.6B-parameter Sana1.5 diffusion model. It achieves high-quality edits across major categories while maintaining low computational cost and strict source consistency. On benchmark tests, VIBE matches or outperforms heavier models, especially in preserving input image details during edits.
该论文介绍了VIBE,一种使用2B参数的Qwen3-VL模型和1.6B参数的Sana1.5扩散模型的紧凑型指令驱动图像编辑管道。该方法在ImgEdit和GEdit等基准测试中达到了与更大模型相当的性能,特别是在保留输入图像的编辑(如属性调整和对象移除)方面表现尤为出色。该系统旨在实现低成本推理,并在有限的GPU内存和计算资源下保持高质量。
Learning Evolving Latent Strategies for Multi-Agent Language Systems without Model Fine-Tuning
Authors: Wenlong Tang
First: 2025-11-28T23:36:45+00:00 · Latest: 2026-01-05T16:14:39+00:00
Comments: 17 pages, 5 figures. Code available at https://github.com/wltang-dev/Latent-Strategy-RL-Agent
Abstract
This study proposes a multi-agent language framework that enables continual strategy evolution without fine-tuning the language model's parameters. The core idea is to liberate the latent vectors of abstract concepts from traditional static semantic representations, allowing them to be continuously updated through environmental interaction and reinforcement feedback. We construct a dual-loop architecture: the behavior loop adjusts action preferences based on environmental rewards, while the language loop updates the external latent vectors by reflecting on the semantic embeddings of generated text.
Together, these mechanisms allow agents to develop stable and disentangled strategic styles over long-horizon multi-round interactions. Experiments show that agents' latent spaces exhibit clear convergence trajectories under reflection-driven updates, along with structured shifts at critical moments. Moreover, the system demonstrates an emergent ability to implicitly infer and continually adapt to emotional agents, even without shared rewards. These results indicate that, without modifying model parameters, an external latent space can provide language agents with a low-cost, scalable, and interpretable form of abstract strategic representation.
中文标题/摘要
标题:学习多智能体语言系统中的演变潜在策略无需模型微调
本研究提出了一种多智能体语言框架,能够在不微调语言模型参数的情况下实现持续策略演变。核心思想是使抽象概念的潜在向量脱离传统的静态语义表示,通过环境交互和强化反馈不断更新。我们构建了一个双环架构:行为环根据环境奖励调整行动偏好,而语言环通过反思生成文本的语义嵌入来更新外部潜在向量。这些机制共同使智能体能够在长时间多轮交互中发展出稳定且分离的策略风格。实验表明,在反思驱动的更新下,智能体的潜在空间表现出清晰的收敛轨迹,并在关键时刻出现结构化的转变。此外,该系统展示了隐式推断和持续适应情感智能体的新兴能力,即使没有共享奖励也是如此。这些结果表明,在不修改模型参数的情况下,外部潜在空间可以为语言智能体提供一种低成本、可扩展且可解释的抽象策略表示。
Summary / 总结
This study introduces a multi-agent language framework that enables continuous strategy evolution without fine-tuning the model's parameters. The framework uses a dual-loop architecture where the behavior loop adjusts actions based on environmental rewards, and the language loop updates latent vectors by reflecting on semantic embeddings of generated text. Experiments show clear convergence trajectories in agents' latent spaces and structured shifts at critical moments, indicating the system's ability to develop stable strategic styles and adapt to emotional agents without shared rewards.
该研究提出了一种多智能体语言框架,能够在不微调模型参数的情况下使智能体持续进化其策略。框架通过交互和强化反馈更新潜在向量,导致潜在空间的清晰收敛轨迹和结构化转变。智能体还能够隐式推断并适应情绪智能体,而无需共享奖励,展示了该方法的有效性。
Tuning without Peeking: Provable Generalization Bounds and Robust LLM Post-Training
Authors: Ismail Labiad, Mathurin Videau, Matthieu Kowalski, Marc Schoenauer, Alessandro Leite, Julia Kempe, Olivier Teytaud
First: 2025-07-02T14:29:30+00:00 · Latest: 2026-01-05T16:10:09+00:00
Abstract
Gradient-based optimization is the workhorse of deep learning, offering efficient and scalable training via backpropagation. However, exposing gradients during training can leak sensitive information about the underlying data, raising privacy and security concerns such as susceptibility to data poisoning attacks. In contrast, black box optimization methods, which treat the model as an opaque function, relying solely on function evaluations to guide optimization, offer a promising alternative in scenarios where data access is restricted, adversarial risks are high, or overfitting is a concern. This paper introduces BBoxER, an evolutionary black-box method for LLM post-training that induces an information bottleneck via implicit compression of the training data. Leveraging the tractability of information flow, we provide non-vacuous generalization bounds and strong theoretical guarantees for privacy, robustness to data poisoning attacks, and extraction attacks. In experiments with LLMs, we demonstrate empirically that black-box optimization methods, despite the scalability and computational challenges inherent to black-box approaches, are able to learn, showing how a few iterations of BBoxER improve performance, generalize well on a benchmark of reasoning datasets, and are robust to membership inference attacks. This positions BBoxER as an attractive add-on on top of gradient-based optimization, offering suitability for deployment in restricted or privacy-sensitive environments while also providing non-vacuous generalization guarantees.
中文标题/摘要
标题:无需窥探的调优:可证明的泛化界与鲁棒LLM后训练
基于梯度的优化是深度学习的核心,通过反向传播提供高效可扩展的训练。然而,在训练过程中暴露梯度可能会泄露底层数据的敏感信息,引发隐私和安全问题,如数据投毒攻击。相比之下,黑盒优化方法将模型视为不透明的函数,仅依赖于函数评估来引导优化,这在数据访问受限、对抗风险高或过拟合是问题的情况下提供了有希望的替代方案。本文介绍了一种名为BBoxER的进化黑盒后训练方法,通过隐式压缩训练数据来诱导信息瓶颈。利用信息流的可计算性,我们提供了非空泛化界,并为隐私、数据投毒攻击和提取攻击提供了强大的理论保证。在LLM实验中,我们实证展示了黑盒优化方法,尽管黑盒方法固有的可扩展性和计算挑战,仍能够学习,表明BBoxER的几次迭代如何提高性能、在推理数据集基准上泛化良好,并对成员推理攻击具有鲁棒性。这将BBoxER定位为梯度优化的有吸引力的补充,适用于受限或隐私敏感环境,同时提供非空泛化保证。
Summary / 总结
This paper addresses the privacy and security concerns of gradient-based optimization by introducing BBoxER, an evolutionary black-box method for LLM post-training. BBoxER compresses training data implicitly to induce an information bottleneck, providing non-vacuous generalization bounds and strong theoretical guarantees. Experiments show that BBoxER improves performance, generalizes well on reasoning datasets, and is robust to membership inference attacks, making it suitable for restricted or privacy-sensitive environments.
该论文通过提出BBoxER,一种用于大型语言模型后训练的进化黑盒优化方法,解决了深度学习中的数据隐私问题。它提供了非空泛化的泛化界和隐私及鲁棒性的理论保证。实验表明,BBoxER在性能上有所提升,在推理数据集上泛化良好,并且对成员推断攻击具有鲁棒性,使其成为在受限或隐私敏感环境中替代梯度优化的合适替代方案。
ELLA: Efficient Lifelong Learning for Adapters in Large Language Models
Authors: Shristi Das Biswas, Yue Zhang, Anwesan Pal, Radhika Bhargava, Kaushik Roy
First: 2026-01-05T15:58:08+00:00 · Latest: 2026-01-05T15:58:08+00:00
Abstract
Large Language Models (LLMs) suffer severe catastrophic forgetting when adapted sequentially to new tasks in a continual learning (CL) setting. Existing approaches are fundamentally limited: replay-based methods are impractical and privacy-violating, while strict orthogonality-based methods collapse under scale: each new task is projected onto an orthogonal complement, progressively reducing the residual degrees of freedom and eliminating forward transfer by forbidding overlap in shared representations. In this work, we introduce ELLA, a training framework built on the principle of selective subspace de-correlation. Rather than forbidding all overlap, ELLA explicitly characterizes the structure of past updates and penalizes alignments along their high-energy, task-specific directions, while preserving freedom in the low-energy residual subspaces to enable transfer. Formally, this is realized via a lightweight regularizer on a single aggregated update matrix. We prove this mechanism corresponds to an anisotropic shrinkage operator that bounds interference, yielding a penalty that is both memory- and compute-constant regardless of task sequence length. ELLA requires no data replay, no architectural expansion, and negligible storage. Empirically, it achieves state-of-the-art CL performance on three popular benchmarks, with relative accuracy gains of up to $9.6\%$ and a $35\times$ smaller memory footprint. Further, ELLA scales robustly across architectures and actively enhances the model's zero-shot generalization performance on unseen tasks, establishing a principled and scalable solution for constructive lifelong LLM adaptation.
中文标题/摘要
标题:ELLA:大型语言模型适配器的高效终身学习框架
在连续学习(CL)设置中,大型语言模型(LLMs)在顺序适应新任务时遭受严重的灾难性遗忘。现有方法本质上是有限制的:基于重放的方法不切实际且侵犯隐私,而严格的正交性方法在规模下失效:每个新任务都被投影到正交补空间中,逐渐减少剩余自由度并禁止共享表示的重叠,从而阻止前向迁移。在本文中,我们提出了ELLA,一种基于选择性子空间去相关原理的训练框架。ELLA 不是完全禁止重叠,而是明确地表征了过去更新的结构,并惩罚沿其高能、任务特定方向的对齐,同时保留低能剩余子空间的自由度以实现迁移。形式上,这通过单个聚合更新矩阵上的轻量级正则化器实现。我们证明这种机制对应于一个各向异性收缩算子,该算子限制干扰,产生一个与任务序列长度无关的内存和计算常数的惩罚。ELLA 不需要数据重放,不需要架构扩展,且存储需求微乎其微。实验上,它在三个流行的基准测试中实现了最先进的CL性能,相对准确度增益高达9.6%,且内存占用小35倍。此外,ELLA 能够稳健地跨架构扩展,并积极增强模型在未见任务上的零样本泛化性能,从而建立了一个原理上和可扩展的终身LLM适配解决方案。
Summary / 总结
ELLA is a training framework designed to address catastrophic forgetting in large language models (LLMs) during continual learning. It introduces selective subspace de-correlation, penalizing alignments along high-energy task-specific directions while preserving freedom in low-energy subspaces. Empirically, ELLA outperforms existing methods on three benchmarks, achieving up to 9.6% relative accuracy gains and a 35x smaller memory footprint without requiring data replay or architectural expansion.
ELLA 是一种训练框架,旨在解决大型语言模型(LLMs)在连续学习过程中出现的灾难性遗忘问题。它引入了选择性子空间去相关,对过去更新的高能方向进行惩罚,同时在低能子空间中保留自由度,以促进迁移。实验表明,ELLA 在三个基准测试中优于现有方法,相对准确度提升高达 9.6%,且内存占用仅为原来的 35 分之一。它不需要数据重放或架构扩展,并且能够增强模型在未见过的任务上的零样本泛化能力。
FMVP: Masked Flow Matching for Adversarial Video Purification
Authors: Duoxun Tang, Xueyi Zhang, Chak Hin Wang, Xi Xiao, Dasen Dai, Xinhang Jiang, Wentao Shi, Rui Li, Qing Li
First: 2026-01-05T15:55:46+00:00 · Latest: 2026-01-05T15:55:46+00:00
Abstract
Video recognition models remain vulnerable to adversarial attacks, while existing diffusion-based purification methods suffer from inefficient sampling and curved trajectories. Directly regressing clean videos from adversarial inputs often fails to recover faithful content due to the subtle nature of perturbations; this necessitates physically shattering the adversarial structure. Therefore, we propose Flow Matching for Adversarial Video Purification FMVP. FMVP physically shatters global adversarial structures via a masking strategy and reconstructs clean video dynamics using Conditional Flow Matching (CFM) with an inpainting objective. To further decouple semantic content from adversarial noise, we design a Frequency-Gated Loss (FGL) that explicitly suppresses high-frequency adversarial residuals while preserving low-frequency fidelity. We design Attack-Aware and Generalist training paradigms to handle known and unknown threats, respectively. Extensive experiments on UCF-101 and HMDB-51 demonstrate that FMVP outperforms state-of-the-art methods (DiffPure, Defense Patterns (DP), Temporal Shuffling (TS) and FlowPure), achieving robust accuracy exceeding 87% against PGD and 89% against CW attacks. Furthermore, FMVP demonstrates superior robustness against adaptive attacks (DiffHammer) and functions as a zero-shot adversarial detector, attaining detection accuracies of 98% for PGD and 79% for highly imperceptible CW attacks.
中文标题/摘要
标题:FMVP:蒙面流匹配对抗视频净化
视频识别模型仍然容易受到对抗攻击的影响,而现有的基于扩散的过程净化方法则面临采样效率低和轨迹弯曲的问题。直接从对抗输入中回归干净的视频往往由于扰动的微妙性质而无法恢复忠实的内容;因此,需要物理上破坏对抗结构。因此,我们提出了对抗视频净化的流匹配方法(FMVP)。FMVP通过掩码策略物理上破坏全局对抗结构,并使用条件流匹配(CFM)和修复目标重建干净的视频动态。为了进一步将语义内容与对抗噪声解耦,我们设计了一种频域门控损失(FGL),该损失明确抑制高频对抗残差同时保留低频保真度。我们设计了对抗感知和通用训练范式分别处理已知和未知威胁。在UCF-101和HMDB-51上的广泛实验表明,FMVP在对抗投影梯度下降(PGD)和对抗连续性(CW)攻击中均表现出色,其稳健准确率分别超过87%和89%。此外,FMVP在适应性攻击(DiffHammer)中表现出优越的稳健性,并作为零样本对抗检测器运行,对PGD攻击的检测准确率为98%,对高度不可感知的CW攻击的检测准确率为79%。
Summary / 总结
FMVP is designed to purify adversarial videos by physically shattering global adversarial structures using a masking strategy and reconstructing clean video dynamics with Conditional Flow Matching and an inpainting objective. It also employs Frequency-Gated Loss to suppress high-frequency adversarial residuals. Experimental results show that FMVP outperforms existing methods like DiffPure and FlowPure, achieving robust accuracy over 87% against PGD and 89% against CW attacks, and demonstrating superior robustness against adaptive attacks and zero-shot adversarial detection.
FMVP 通过使用掩码策略物理地破坏全局对抗结构,并结合条件流匹配重建干净的视频动态来净化对抗视频。它还包含一个频率门控损失,以抑制高频对抗残差。实验结果表明,FMVP 在对抗 PGD 和 CW 攻击中的鲁棒准确性分别超过 87% 和 89%,优于现有方法如 DiffPure 和 FlowPure。此外,FMVP 对抗适应性攻击具有优越的鲁棒性,并且可以作为零样本对抗检测器,对 PGD 攻击的检测准确率为 98%,对高度不可感知的 CW 攻击的检测准确率为 79%。
CangLing-KnowFlow: A Unified Knowledge-and-Flow-fused Agent for Comprehensive Remote Sensing Applications
Authors: Zhengchao Chen, Haoran Wang, Jing Yao, Pedram Ghamisi, Jun Zhou, Peter M. Atkinson, Bing Zhang
First: 2025-12-17T09:31:57+00:00 · Latest: 2026-01-05T15:48:10+00:00
Abstract
The automated and intelligent processing of massive remote sensing (RS) datasets is critical in Earth observation (EO). Existing automated systems are normally task-specific, lacking a unified framework to manage diverse, end-to-end workflows--from data preprocessing to advanced interpretation--across diverse RS applications. To address this gap, this paper introduces CangLing-KnowFlow, a unified intelligent agent framework that integrates a Procedural Knowledge Base (PKB), Dynamic Workflow Adjustment, and an Evolutionary Memory Module. The PKB, comprising 1,008 expert-validated workflow cases across 162 practical RS tasks, guides planning and substantially reduces hallucinations common in general-purpose agents. During runtime failures, the Dynamic Workflow Adjustment autonomously diagnoses and replans recovery strategies, while the Evolutionary Memory Module continuously learns from these events, iteratively enhancing the agent's knowledge and performance. This synergy enables CangLing-KnowFlow to adapt, learn, and operate reliably across diverse, complex tasks. We evaluated CangLing-KnowFlow on the KnowFlow-Bench, a novel benchmark of 324 workflows inspired by real-world applications, testing its performance across 13 top Large Language Model (LLM) backbones, from open-source to commercial. Across all complex tasks, CangLing-KnowFlow surpassed the Reflexion baseline by at least 4% in Task Success Rate. As the first most comprehensive validation along this emerging field, this research demonstrates the great potential of CangLing-KnowFlow as a robust, efficient, and scalable automated solution for complex EO challenges by leveraging expert knowledge (Knowledge) into adaptive and verifiable procedures (Flow).
中文标题/摘要
标题:沧灵-KnowFlow:统一的知识与流程融合智能代理在综合遥感应用中的应用
地球观测(EO)中,大规模遥感(RS)数据集的自动化和智能化处理至关重要。现有的自动化系统通常针对特定任务,缺乏一个统一的框架来管理从数据预处理到高级解释的多样化、端到端的工作流。为解决这一问题,本文介绍了沧灵-KnowFlow,这是一种统一的智能代理框架,集成了过程知识库(PKB)、动态工作流调整和进化记忆模块。PKB 包含了 1,008 个专家验证的工作流案例,覆盖 162 项实际 RS 任务,指导规划并显著减少了通用代理中常见的幻觉。在运行时失败时,动态工作流调整能够自主诊断并重新规划恢复策略,而进化记忆模块则不断从这些事件中学习,逐步提升代理的知识和性能。这种协同作用使沧灵-KnowFlow 能够在多样且复杂的任务中适应、学习并可靠地运行。我们使用 KnowFlow-Bench 这一新型基准测试了沧灵-KnowFlow,该基准包含 324 个工作流,灵感来源于实际应用,并在 13 个顶级大型语言模型(LLM)的后端上测试了其性能,从开源到商用。在所有复杂任务中,沧灵-KnowFlow 在任务成功率上至少比 Reflexion 基准高出 4%。作为该新兴领域中第一个最全面的验证,这项研究展示了沧灵-KnowFlow 作为一种利用专家知识(Knowledge)转化为适应性和可验证流程(Flow)的稳健、高效和可扩展的自动化解决方案的巨大潜力。
Summary / 总结
CangLing-KnowFlow is a unified intelligent agent framework designed to handle diverse remote sensing tasks from data preprocessing to advanced interpretation. It integrates a Procedural Knowledge Base, Dynamic Workflow Adjustment, and an Evolutionary Memory Module to reduce hallucinations and improve performance. Evaluated on the KnowFlow-Bench, CangLing-KnowFlow outperformed the Reflexion baseline by at least 4% in Task Success Rate across all complex tasks, demonstrating its robustness and efficiency in Earth observation challenges.
CangLing-KnowFlow 是一个统一的智能代理框架,旨在自动化和提升遥感数据集的处理。它整合了过程知识库、动态工作流调整和进化记忆模块,以管理从数据预处理到高级解释的多样化工作流。该框架在包含324个工作流的新型基准测试中进行了评估,并在所有复杂任务中的任务成功率上至少比 Reflexion 基准提高了4%。这表明 CangLing-KnowFlow 作为地球观测挑战的稳健、高效和可扩展解决方案的巨大潜力。
PrevMatch: Revisiting and Maximizing Temporal Knowledge in Semi-Supervised Semantic Segmentation
Authors: Wooseok Shin, Hyun Joon Park, Jin Sob Kim, Juan Yun, Se Hong Park, Sung Won Han
Venue: WACV 2026
First: 2024-05-31T03:54:59+00:00 · Latest: 2026-01-05T15:45:20+00:00
Comments: To appear in WACV 2026. Code: https://github.com/wooseok-shin/PrevMatch
Abstract
In semi-supervised semantic segmentation, the Mean Teacher- and co-training-based approaches are employed to mitigate confirmation bias and coupling problems. However, despite their high performance, these approaches frequently involve complex training pipelines and a substantial computational burden, limiting the scalability and compatibility of these methods. In this paper, we propose a PrevMatch framework that effectively mitigates the aforementioned limitations by maximizing the utilization of the temporal knowledge obtained during the training process. The PrevMatch framework relies on two core strategies: (1) we reconsider the use of temporal knowledge and thus directly utilize previous models obtained during training to generate additional pseudo-label guidance, referred to as previous guidance. (2) we design a highly randomized ensemble strategy to maximize the effectiveness of the previous guidance. PrevMatch, a simple yet effective plug-in method, can be seamlessly integrated into existing semi-supervised learning frameworks with minimal computational overhead. Experimental results on three benchmark semantic segmentation datasets show that incorporating PrevMatch into existing methods significantly improves their performance. Furthermore, our analysis indicates that PrevMatch facilitates stable optimization during training, resulting in improved generalization performance.
中文标题/摘要
标题:PrevMatch:重新审视并最大化半监督语义分割中的时间知识
在半监督语义分割中,采用Mean Teacher-和协同训练的方法来缓解确认偏差和耦合问题。然而,尽管这些方法性能很高,但它们通常涉及复杂的训练管道和大量的计算负担,限制了这些方法的可扩展性和兼容性。本文提出了一种PrevMatch框架,通过最大化利用训练过程中获得的时间知识来有效缓解上述限制。PrevMatch框架依赖于两种核心策略:(1) 我们重新考虑时间知识的使用,直接利用训练过程中获得的先前模型生成额外的伪标签指导,称为先前指导。(2) 我们设计了一种高度随机化的集成策略,以最大化先前指导的有效性。PrevMatch是一种简单而有效的插件方法,可以无缝集成到现有的半监督学习框架中,且计算开销较小。在三个基准语义分割数据集上的实验结果表明,将PrevMatch集成到现有方法中可以显著提高其性能。此外,我们的分析表明,PrevMatch有助于训练过程中的稳定优化,从而提高泛化性能。
Summary / 总结
The paper proposes PrevMatch, a framework that maximizes the use of temporal knowledge in semi-supervised semantic segmentation by reusing previous models to generate pseudo-label guidance and employing a randomized ensemble strategy. This approach simplifies the training process and enhances the performance of existing methods on three benchmark datasets, leading to better generalization and optimization stability.
论文针对Mean Teacher-和co-training方法在半监督语义分割中的复杂训练流程和高计算成本问题,提出了PrevMatch框架,通过利用训练过程中获得的先前模型生成伪标签来最大化时间知识的使用,并采用随机化集成策略增强这些标签的效果。实验结果表明,将PrevMatch应用于现有方法可以显著提升性能,并促进训练过程中的稳定优化,从而提高泛化能力。
FaithLens: Detecting and Explaining Faithfulness Hallucination
Authors: Shuzheng Si, Qingyi Wang, Haozhe Zhao, Yuzhuo Bai, Guanqiao Chen, Kangyang Luo, Gang Chen, Fanchao Qi, Minjia Zhang, Baobao Chang, Maosong Sun
First: 2025-12-23T09:20:32+00:00 · Latest: 2026-01-05T15:43:00+00:00
Abstract
Recognizing whether outputs from large language models (LLMs) contain faithfulness hallucination is crucial for real-world applications, e.g., retrieval-augmented generation and summarization. In this paper, we introduce FaithLens, a cost-efficient and effective faithfulness hallucination detection model that can jointly provide binary predictions and corresponding explanations to improve trustworthiness. To achieve this, we first synthesize training data with explanations via advanced LLMs and apply a well-defined data filtering strategy to ensure label correctness, explanation quality, and data diversity. Subsequently, we fine-tune the model on these well-curated training data as a cold start and further optimize it with rule-based reinforcement learning, using rewards for both prediction correctness and explanation quality. Results on 12 diverse tasks show that the 8B-parameter FaithLens outperforms advanced models such as GPT-4.1 and o3. Also, FaithLens can produce high-quality explanations, delivering a distinctive balance of trustworthiness, efficiency, and effectiveness.
中文标题/摘要
标题:FaithLens:检测和解释忠实性幻觉
识别大型语言模型(LLMs)输出中是否包含忠实性幻觉对于实际应用至关重要,例如检索增强生成和总结。在本文中,我们介绍了FaithLens,一种成本效益高且有效的忠实性幻觉检测模型,能够同时提供二元预测和相应的解释以提高可信度。为此,我们首先通过高级LLM合成带有解释的训练数据,并应用明确的数据过滤策略以确保标签准确性、解释质量和数据多样性。随后,我们在此精心整理的训练数据上进行模型微调作为冷启动,并进一步使用基于规则的强化学习对其进行优化,使用预测正确性和解释质量的奖励。在12个不同任务上的结果显示,8B参数的FaithLens优于GPT-4.1和o3等先进模型。此外,FaithLens能够生成高质量的解释,提供可信度、效率和有效性之间的独特平衡。
Summary / 总结
FaithLens is a cost-efficient model designed to detect and explain faithfulness hallucination in large language model outputs. It uses advanced LLMs to synthesize training data with explanations and applies a filtering strategy to ensure label correctness and data diversity. The model is fine-tuned and further optimized with rule-based reinforcement learning. Experiments on 12 diverse tasks demonstrate that FaithLens outperforms models like GPT-4.1 and o3, providing high-quality explanations that enhance trustworthiness, efficiency, and effectiveness.
FaithLens 是一种成本效益高的模型,用于检测和解释大型语言模型输出中的忠实性幻觉。它使用高级 LLM 合成带有解释的训练数据,并应用过滤策略以确保数据质量。该模型通过基于规则的强化学习进行微调,以提高预测准确性和解释质量。在 12 个不同任务上的实验表明,FaithLens 的表现优于 GPT-4.1 和 o3 等模型,提供高质量的解释,从而增强可信度和效率。