arXiv 论文速递

2026-01-30 03:39
Snapshot: 20260130_0339
LLMStinger: Jailbreaking LLMs using RL fine-tuned LLMs
Authors: Piyush Jha, Arnav Arora, Vijay Ganesh
Venue: AAAI 2025
First: 2024-11-13T18:44:30+00:00 · Latest: 2026-01-28T18:58:57+00:00
Comments: Accepted at AAAI 2025
Abstract
We introduce LLMStinger, a novel approach that leverages Large Language Models (LLMs) to automatically generate adversarial suffixes for jailbreak attacks. Unlike traditional methods, which require complex prompt engineering or white-box access, LLMStinger uses a reinforcement learning (RL) loop to fine-tune an attacker LLM, generating new suffixes based on existing attacks for harmful questions from the HarmBench benchmark. Our method significantly outperforms existing red-teaming approaches (we compared against 15 of the latest methods), achieving a +57.2% improvement in Attack Success Rate (ASR) on LLaMA2-7B-chat and a +50.3% ASR increase on Claude 2, both models known for their extensive safety measures. Additionally, we achieved a 94.97% ASR on GPT-3.5 and 99.4% on Gemma-2B-it, demonstrating the robustness and adaptability of LLMStinger across open and closed-source models.
中文标题/摘要
标题:LLMStinger:使用RL微调的LLM破解大型语言模型
我们介绍了LLMStinger,这是一种新颖的方法,利用大型语言模型(LLMs)自动生成针对破解攻击的对抗后缀。与传统方法不同,后者需要复杂的提示工程或白盒访问,LLMStinger使用强化学习(RL)循环微调攻击LLM,根据现有攻击生成新的后缀,以应对来自HarmBench基准的有害问题。我们的方法在对抗成功率(ASR)方面显著优于现有红队方法(我们与15种最新方法进行了比较),在LLaMA2-7B-chat上提高了57.2%,在Claude 2上提高了50.3%,这两款模型因其广泛的安全部署而闻名。此外,我们在GPT-3.5上实现了94.97%的ASR,在Gemma-2B-it上实现了99.4%,这表明LLMStinger在开源和闭源模型中具有稳健性和适应性。
Summary / 总结
LLMStinger is a novel approach that uses Large Language Models (LLMs) to automatically generate adversarial suffixes for jailbreak attacks. It employs a reinforcement learning (RL) loop to fine-tune an attacker LLM, which generates new suffixes based on existing attacks. LLMStinger significantly outperforms 15 of the latest red-teaming methods, achieving a +57.2% improvement in Attack Success Rate (ASR) on LLaMA2-7B-chat and a +50.3% ASR increase on Claude 2. It also demonstrates high ASR on GPT-3.5 and Gemma-2B-it, showing its robustness and adaptability across different models.
LLMStinger 是一种利用大型语言模型(LLMs)自动生成劫持攻击中对抗性后缀的新方法。它通过强化学习(RL)循环来微调攻击 LLM,基于现有攻击生成新的后缀。LLMStinger 显著优于 15 种最新红队方法,分别在 LLaMA2-7B-chat 和 Claude 2 上实现了 +57.2% 和 +50.3% 的攻击成功率(ASR)提升。此外,它在 GPT-3.5 和 Gemma-2B-it 上也表现出高 ASR,展示了其在不同模型上的鲁棒性和适应性。
ArchesClimate: Probabilistic Decadal Ensemble Generation With Flow Matching
Authors: Graham Clyne, Guillaume Couairon, Guillaume Gastineau, Claire Monteleoni, Anastase Charantonis
First: 2025-09-19T12:53:24+00:00 · Latest: 2026-01-28T18:58:50+00:00
Abstract
Climate projections have uncertainties related to components of the climate system and their interactions. A typical approach to quantifying these uncertainties is to use climate models to create ensembles of repeated simulations under different initial conditions. Due to the complexity of these simulations, generating such ensembles of projections is computationally expensive. In this work, we present ArchesClimate, a deep learning-based climate model emulator that aims to reduce this cost. ArchesClimate is trained on decadal hindcasts of the IPSL-CM6A-LR climate model at a spatial resolution of approximately 2.5x1.25 degrees. We train a flow matching model following ArchesWeatherGen, which we adapt to predict near-term climate. Once trained, the model generates states at a one-month lead time and can be used to auto-regressively emulate climate model simulations of any length. We show that for up to 10 years, these generations are stable and physically consistent. We also show that for several important climate variables, ArchesClimate generates simulations that are interchangeable with the IPSL model. This work suggests that climate model emulators could significantly reduce the cost of climate model simulations.
中文标题/摘要
标题:ArchesClimate:基于流匹配的年代际ensemble生成
气候预测存在与气候系统及其相互作用相关的不确定性。量化这些不确定性的典型方法是使用气候模型生成在不同初始条件下重复模拟的ensemble。由于这些模拟的复杂性,生成这种ensemble的预测是计算上昂贵的。在本文中,我们介绍了ArchesClimate,这是一种基于深度学习的气候模型模拟器,旨在降低这种成本。ArchesClimate在大约2.5x1.25度的空间分辨率下,基于IPSL-CM6A-LR气候模型的年代际回溯数据进行训练。我们遵循ArchesWeatherGen的方法训练了一个流匹配模型,对其进行调整以预测短期气候。训练完成后,该模型可以生成一个月的预测状态,并可用于自回归地模拟任何长度的气候模型模拟。我们展示了在10年以内,这些生成是稳定的且物理上一致的。我们还展示了对于几个重要的气候变量,ArchesClimate生成的模拟可以与IPSL模型互换。这项工作表明,气候模型模拟器可以显著降低气候模型模拟的成本。
Summary / 总结
ArchesClimate is a deep learning-based emulator designed to reduce the computational cost of generating climate model ensembles. It is trained on decadal hindcasts from the IPSL-CM6A-LR model and uses a flow matching approach to predict near-term climate states. The model generates stable and physically consistent simulations up to 10 years and produces results that are interchangeable with the original IPSL model for several key climate variables.
ArchesClimate 是一个基于深度学习的气候模型模拟器,旨在降低生成气候模型集合的计算成本。它基于 IPSL-CM6A-LR 模型的多年回溯数据进行训练,并使用流匹配方法预测短期气候状态。该模型能够生成长达10年的稳定且物理上一致的模拟,并且生成的气候变量与原始 IPSL 模型的输出可以互换。
DCP-Bench-Open: Evaluating LLMs for Constraint Modelling of Discrete Combinatorial Problems
Authors: Kostis Michailidis, Dimos Tsouros, Tias Guns
First: 2025-06-06T12:56:02+00:00 · Latest: 2026-01-28T18:58:23+00:00
Comments: This version is currently submitted and it is under review. For CP-Bench (the paper accepted at ECAI25), please refer to the previous version of this entry (v2)
Abstract
Discrete Combinatorial Problems (DCPs) are prevalent in industrial decision-making and optimisation. However, while constraint solving technologies for DCPs have advanced significantly, the core process of formalising them, namely constraint modelling, requires significant expertise and remains a bottleneck for wider adoption. Aiming to alleviate this bottleneck, recent studies have explored using Large Language Models (LLMs) to transform combinatorial problem descriptions into executable constraint models. However, the existing evaluation datasets for discrete constraint modelling are often limited to small, homogeneous, or domain-specific problems, which do not capture the diversity of real-world scenarios. This work addresses this gap by introducing DCP-Bench-Open, a novel benchmark that includes a diverse set of well-known discrete combinatorial problems sourced from the Constraint Programming (CP) and Operations Research (OR) communities, structured explicitly for evaluating LLM-driven constraint modelling. With this dataset, and given the variety of modelling frameworks, we compare and evaluate the modelling capabilities of LLMs for three distinct constraint modelling systems, which vary in abstraction level and underlying syntax. Notably, the results show higher performance when modelling with a high-level Python-based framework. Additionally, we systematically evaluate the use of prompt-based and inference-time compute methods across different LLMs, which further increase accuracy, reaching up to 91% on this highly challenging benchmark. DCP-Bench-Open is publicly available.
中文标题/摘要
标题:DCP-Bench-Open:评估大型语言模型在离散组合问题约束建模中的应用
离散组合问题(DCPs)在工业决策和优化中普遍存在。然而,尽管DCPs的约束求解技术取得了显著进步,但将它们形式化的核心过程,即约束建模,仍需要大量专业知识,成为更广泛应用的瓶颈。为缓解这一瓶颈,近期研究探索了使用大型语言模型(LLMs)将组合问题描述转换为可执行的约束模型。然而,现有的离散约束建模评估数据集通常局限于小规模、同质或特定领域的例子,无法捕捉到现实世界的多样性。本研究通过引入DCP-Bench-Open,一个包含来自约束编程(CP)和运筹学(OR)社区的多种知名离散组合问题的新基准,填补了这一空白,这些问题明确结构化以评估LLM驱动的约束建模。借助此数据集和不同的建模框架,我们比较和评估了三种不同抽象级别和底层语法的约束建模系统的建模能力。值得注意的是,使用基于Python的高级框架进行建模时,性能更高。此外,我们系统地评估了提示方法和推理时计算方法在不同LLM中的使用情况,这进一步提高了准确性,最高达到91%。DCP-Bench-Open已公开发布。
Summary / 总结
This work introduces DCP-Bench-Open, a benchmark for evaluating Large Language Models (LLMs) in transforming combinatorial problem descriptions into executable constraint models. The dataset includes a diverse set of problems from Constraint Programming and Operations Research communities. The study compares LLMs across three constraint modelling systems and finds higher performance with a high-level Python-based framework. Prompt-based and inference-time compute methods improve accuracy, reaching up to 91% on the benchmark.
这项研究引入了DCP-Bench-Open,一个用于评估大型语言模型(LLMs)将组合问题描述转换为可执行约束模型的能力的基准。它通过包含来自约束编程和运筹学社区的多样化问题来解决现有数据集的限制。研究比较了LLMs在三种约束建模系统中的表现,并发现使用高级Python框架时准确性更高。通过使用提示方法和推理时的计算方法,进一步提高了准确性,最高达到91%。
FreeFix: Boosting 3D Gaussian Splatting via Fine-Tuning-Free Diffusion Models
Authors: Hongyu Zhou, Zisen Shao, Sheng Miao, Pan Wang, Dongfeng Bai, Bingbing Liu, Yiyi Liao
First: 2026-01-28T18:56:03+00:00 · Latest: 2026-01-28T18:56:03+00:00
Comments: Our project page is at https://xdimlab.github.io/freefix
Abstract
Neural Radiance Fields and 3D Gaussian Splatting have advanced novel view synthesis, yet still rely on dense inputs and often degrade at extrapolated views. Recent approaches leverage generative models, such as diffusion models, to provide additional supervision, but face a trade-off between generalization and fidelity: fine-tuning diffusion models for artifact removal improves fidelity but risks overfitting, while fine-tuning-free methods preserve generalization but often yield lower fidelity. We introduce FreeFix, a fine-tuning-free approach that pushes the boundary of this trade-off by enhancing extrapolated rendering with pretrained image diffusion models. We present an interleaved 2D-3D refinement strategy, showing that image diffusion models can be leveraged for consistent refinement without relying on costly video diffusion models. Furthermore, we take a closer look at the guidance signal for 2D refinement and propose a per-pixel confidence mask to identify uncertain regions for targeted improvement. Experiments across multiple datasets show that FreeFix improves multi-frame consistency and achieves performance comparable to or surpassing fine-tuning-based methods, while retaining strong generalization ability.
中文标题/摘要
标题:FreeFix:通过无需微调的扩散模型增强3D高斯点绘
神经辐射场和3D高斯点绘在新颖视图合成方面取得了进展,但仍依赖于密集输入,并且在插值视图中往往会退化。最近的方法利用生成模型,如扩散模型,提供额外的监督,但面临泛化和保真度之间的权衡:用于去除伪影的微调扩散模型可以提高保真度,但存在过拟合的风险,而无需微调的方法则保持了泛化能力,但通常保真度较低。我们提出了FreeFix,这是一种无需微调的方法,通过使用预训练的图像扩散模型增强插值渲染,从而在这一权衡中推动边界。我们提出了一种交错的2D-3D细化策略,表明图像扩散模型可以用于一致的细化,而无需依赖昂贵的视频扩散模型。此外,我们更深入地研究了2D细化的指导信号,并提出了一种逐像素置信度掩码,以识别需要针对性改进的不确定区域。在多个数据集上的实验表明,FreeFix提高了多帧一致性,并实现了与或超越了基于微调方法的性能,同时保持了强大的泛化能力。
Summary / 总结
FreeFix is a fine-tuning-free approach that uses pretrained image diffusion models to enhance 3D Gaussian splatting for novel view synthesis, addressing the trade-off between generalization and fidelity. It introduces an interleaved 2D-3D refinement strategy and a per-pixel confidence mask to improve multi-frame consistency and achieve performance comparable to or surpassing fine-tuning-based methods while maintaining strong generalization ability.
FreeFix 是一种无需微调的方法,通过使用预训练的图像扩散模型来增强 3D 高斯散点图,以改善外推渲染。它引入了一种交替的 2D-3D 精炼策略和一个逐像素置信掩模,以针对不确定区域进行改进,从而实现性能与微调方法相当或超越,同时保持强大的泛化能力。
SokoBench: Evaluating Long-Horizon Planning and Reasoning in Large Language Models
Authors: Sebastiano Monti, Carlo Nicolini, Gianni Pellegrini, Jacopo Staiano, Bruno Lepri
First: 2026-01-28T18:56:00+00:00 · Latest: 2026-01-28T18:56:00+00:00
Abstract
Although the capabilities of large language models have been increasingly tested on complex reasoning tasks, their long-horizon planning abilities have not yet been extensively investigated. In this work, we provide a systematic assessment of the planning and long-horizon reasoning capabilities of state-of-the-art Large Reasoning Models (LRMs). We propose a novel benchmark based on Sokoban puzzles, intentionally simplified to isolate long-horizon planning from state persistence. Our findings reveal a consistent degradation in planning performance when more than 25 moves are required to reach the solution, suggesting a fundamental constraint on forward planning capacity. We show that equipping LRMs with Planning Domain Definition Language (PDDL) parsing, validation, and solving tools allows for modest improvements, suggesting inherent architectural limitations which might not be overcome by test-time scaling approaches alone.
中文标题/摘要
标题:SokoBench:评估大型语言模型的长期规划和推理能力
尽管大型语言模型的能力在复杂的推理任务中得到了越来越多的测试,但它们的长期规划能力尚未得到广泛研究。在本文中,我们系统地评估了最先进的大型推理模型(LRMs)的规划和长期推理能力。我们提出了一种基于推箱子谜题的新基准,故意简化以隔离长期规划与状态持久性。我们的研究结果表明,当需要超过25步才能达到解决方案时,规划性能会出现一致下降,这表明存在基本的前向规划能力限制。我们展示了将LRMs与规划领域定义语言(PDDL)解析、验证和求解工具结合使用可以带来适度的改进,这表明固有的架构限制可能无法仅通过测试时的扩展方法来克服。
Summary / 总结
This study evaluates the long-horizon planning and reasoning capabilities of large language models using a novel Sokoban-based benchmark. The research finds that performance degrades significantly when more than 25 moves are needed to solve the puzzles, indicating a fundamental limitation in forward planning. Integrating PDDL tools provides only minor improvements, suggesting inherent architectural constraints that cannot be fully addressed by scaling alone.
该研究使用基于 Sokoban 的新基准评估了大型语言模型 (LRMs) 的长期规划和推理能力。研究发现,当需要超过25步才能解决问题时,LRMs 的性能会出现一致下降,表明它们在前向规划能力方面存在根本限制。将 PDDL 工具集成用于解析和求解可以稍微提高性能,表明架构限制是关键因素,仅通过测试时的扩展方法无法完全解决这一问题。
From Specialist to Generalist: Unlocking SAM's Learning Potential on Unlabeled Medical Images
Authors: Vi Vu, Thanh-Huy Nguyen, Tien-Thinh Nguyen, Ba-Thinh Lam, Hoang-Thien Nguyen, Tianyang Wang, Xingjian Li, Min Xu
Venue: ISBI 2026
First: 2026-01-25T18:13:48+00:00 · Latest: 2026-01-28T18:55:46+00:00
Comments: Accepted to ISBI 2026
Abstract
Foundation models like the Segment Anything Model (SAM) show strong generalization, yet adapting them to medical images remains difficult due to domain shift, scarce labels, and the inability of Parameter-Efficient Fine-Tuning (PEFT) to exploit unlabeled data. While conventional models like U-Net excel in semi-supervised medical learning, their potential to assist a PEFT SAM has been largely overlooked. We introduce SC-SAM, a specialist-generalist framework where U-Net provides point-based prompts and pseudo-labels to guide SAM's adaptation, while SAM serves as a powerful generalist supervisor to regularize U-Net. This reciprocal guidance forms a bidirectional co-training loop that allows both models to effectively exploit the unlabeled data. Across prostate MRI and polyp segmentation benchmarks, our method achieves state-of-the-art results, outperforming other existing semi-supervised SAM variants and even medical foundation models like MedSAM, highlighting the value of specialist-generalist cooperation for label-efficient medical image segmentation. Our code is available at https://github.com/vnlvi2k3/SC-SAM.
中文标题/摘要
标题:从专家到通才:解锁SAM在未标注医学图像上的学习潜力
基础模型如分割一切模型(SAM)表现出强大的泛化能力,但将其适应医学图像仍然困难重重,由于领域转移、稀缺的标签以及参数高效微调(PEFT)无法利用未标注数据。尽管像U-Net这样的传统模型在半监督医学学习中表现出色,但它们协助PEFT SAM的潜力却未被充分重视。我们提出了SC-SAM,这是一种专家-通才框架,其中U-Net提供基于点的提示和伪标签来引导SAM的适应,而SAM则作为强大的通才监督者来正则化U-Net。这种相互指导形成了双向的协同训练循环,使两个模型都能有效利用未标注数据。在前列腺MRI和息肉分割基准测试中,我们的方法取得了最先进的成果,优于其他现有的半监督SAM变体,甚至优于医学基础模型MedSAM,突显了专家-通才合作在标签高效医学图像分割中的价值。我们的代码可在https://github.com/vnlvi2k3/SC-SAM获取。
Summary / 总结
This study addresses the challenge of adapting the Segment Anything Model (SAM) to medical images by introducing SC-SAM, a specialist-generalist framework. U-Net provides point-based prompts and pseudo-labels to guide SAM's adaptation, while SAM acts as a generalist supervisor to regularize U-Net. The method achieves state-of-the-art results in prostate MRI and polyp segmentation benchmarks, outperforming other semi-supervised SAM variants and medical foundation models like MedSAM, demonstrating the effectiveness of specialist-generalist cooperation for label-efficient medical image segmentation.
该研究通过引入SC-SAM框架,解决将Segment Anything Model (SAM) 调适到医学图像的挑战。U-Net提供点基提示和伪标签来引导SAM的适应,而SAM作为通用模型监督者来规范U-Net。该方法在前列腺MRI和息肉分割基准测试中取得了最先进的结果,超越了其他半监督SAM变体和医学基础模型如MedSAM,展示了专家-通用模型合作在标签高效医学图像分割中的有效性。
HeuriGym: An Agentic Benchmark for LLM-Crafted Heuristics in Combinatorial Optimization
Authors: Hongzheng Chen, Yingheng Wang, Yaohui Cai, Hins Hu, Jiajie Li, Shirley Huang, Chenhui Deng, Rongjian Liang, Shufeng Kong, Haoxing Ren, Samitha Samaranayake, Carla P. Gomes, Zhiru Zhang
Venue: ICLR
First: 2025-06-09T17:46:47+00:00 · Latest: 2026-01-28T18:52:54+00:00
Comments: Accepted to ICLR'26
Abstract
While Large Language Models (LLMs) have demonstrated significant advancements in reasoning and agent-based problem-solving, current evaluation methodologies fail to adequately assess their capabilities: existing benchmarks either rely on closed-ended questions prone to saturation and memorization, or subjective comparisons that lack consistency and rigor. In this work, we introduce HeuriGym, an agentic framework designed for evaluating heuristic algorithms generated by LLMs for combinatorial optimization problems, characterized by clearly defined objectives and expansive solution spaces. HeuriGym empowers LLMs to propose heuristics, receive evaluative feedback via code execution, and iteratively refine their solutions. We evaluate nine state-of-the-art models on nine problems across domains such as computer systems, logistics, and biology, exposing persistent limitations in tool use, planning, and adaptive reasoning. To quantify performance, we propose the Quality-Yield Index (QYI), a metric that captures both solution pass rate and quality. Even top models like GPT-o4-mini-high and Gemini-2.5-Pro attain QYI scores of only 0.6, well below the expert baseline of 1. Our open-source benchmark aims to guide the development of LLMs toward more effective and realistic problem-solving in scientific and engineering domains.
中文标题/摘要
标题:HeuriGym:LLM生成启发式算法在组合优化中的代理基准
虽然大型语言模型(LLMs)在推理和基于代理的问题解决方面取得了显著进展,但当前的评估方法未能充分评估其能力:现有基准要么依赖于容易饱和和记忆的封闭式问题,要么依赖于缺乏一致性和严谨性的主观比较。在本文中,我们引入了HeuriGym,这是一种代理框架,用于评估LLM生成的启发式算法在组合优化问题中的表现,这些问题具有明确的目标和广阔的解空间。HeuriGym使LLM能够提出启发式算法,通过代码执行接收评估反馈,并逐步改进其解决方案。我们对九个最先进的模型在计算机系统、物流和生物学等领域中的九个问题进行了评估,揭示了工具使用、规划和适应性推理方面的持续局限性。为了量化性能,我们提出了质量产量指数(QYI),这是一个同时捕捉解通过率和质量的指标。即使是顶级模型如GPT-o4-mini-high和Gemini-2.5-Pro的QYI得分也只有0.6,远低于专家基准的1。我们的开源基准旨在引导LLM向更有效的科学和工程领域问题解决发展。
Summary / 总结
HeuriGym is an agentic benchmark for evaluating LLM-generated heuristics in combinatorial optimization, addressing limitations of existing benchmarks. It allows LLMs to propose heuristics, receive feedback through code execution, and iteratively refine solutions. Evaluating nine state-of-the-art models on nine problems, the study reveals limitations in tool use, planning, and adaptive reasoning, with QYI scores below the expert baseline of 1, even for top models like GPT-o4-mini-high and Gemini-2.5-Pro.
HeuriGym 是一个旨在评估 LLM 生成的组合优化问题启发式算法的代理框架。它通过让 LLM 提出启发式算法、通过代码执行接收反馈并迭代改进解决方案来弥补现有基准的不足。对九个最先进的模型在九个不同领域问题上的评估显示,这些模型在工具使用、规划和自适应推理方面存在持续的局限性。引入了质量产量指数(QYI)来衡量性能,结果显示即使是顶级模型的 QYI 分数也只有 0.6,远低于专家基准的 1。
C3Box: A CLIP-based Class-Incremental Learning Toolbox
Authors: Hao Sun, Da-Wei Zhou
First: 2026-01-28T18:52:36+00:00 · Latest: 2026-01-28T18:52:36+00:00
Comments: The code is available at https://github.com/LAMDA-CL/C3Box
Abstract
Traditional machine learning systems are typically designed for static data distributions, which suffer from catastrophic forgetting when learning from evolving data streams. Class-Incremental Learning (CIL) addresses this challenge by enabling learning systems to continuously learn new classes while preserving prior knowledge. With the rise of pre-trained models (PTMs) such as CLIP, leveraging their strong generalization and semantic alignment capabilities has become a promising direction in CIL. However, existing CLIP-based CIL methods are often scattered across disparate codebases, rely on inconsistent configurations, hindering fair comparisons, reproducibility, and practical adoption. Therefore, we propose C3Box (CLIP-based Class-inCremental learning toolBOX), a modular and comprehensive Python toolbox. C3Box integrates representative traditional CIL methods, ViT-based CIL methods, and state-of-the-art CLIP-based CIL methods into a unified CLIP-based framework. By inheriting the streamlined design of PyCIL, C3Box provides a JSON-based configuration and standardized execution pipeline. This design enables reproducible experimentation with low engineering overhead and makes C3Box a reliable benchmark platform for continual learning research. Designed to be user-friendly, C3Box relies only on widely used open-source libraries and supports major operating systems. The code is available at https://github.com/LAMDA-CL/C3Box.
中文标题/摘要
标题:C3Box:基于CLIP的类增量学习工具箱
传统的机器学习系统通常针对静态数据分布设计,当学习来自演变的数据流时,会遭受灾难性遗忘。类增量学习(CIL)通过使学习系统能够连续学习新类并保留先前知识来解决这一挑战。随着预训练模型(PTMs)如CLIP的兴起,利用其强大的泛化能力和语义对齐能力在CIL中成为了一个有前景的方向。然而,现有的基于CLIP的CIL方法往往分散在不同的代码库中,依赖于不一致的配置,阻碍了公平比较、可重复性和实际应用。因此,我们提出了C3Box(CLIP基于的类增量学习工具箱),一个模块化和全面的Python工具箱。C3Box将传统的CIL方法、基于ViT的CIL方法和最先进的基于CLIP的CIL方法整合到一个统一的CLIP框架中。通过继承PyCIL简洁的设计,C3Box提供了一个基于JSON的配置和标准化执行管道。这种设计使得在低工程开销下进行可重复实验成为可能,并使C3Box成为连续学习研究的可靠基准平台。C3Box设计用户友好,仅依赖广泛使用的开源库,并支持主要的操作系统。代码可在https://github.com/LAMDA-CL/C3Box获取。
Summary / 总结
C3Box is a modular Python toolbox designed to facilitate class-incremental learning (CIL) using CLIP-based methods. It integrates various traditional and state-of-the-art CIL methods into a unified framework, providing a standardized execution pipeline and JSON-based configuration. This enables reproducible experiments and serves as a reliable benchmark platform for continual learning research, reducing engineering overhead and enhancing reproducibility.
C3Box 是一个模块化的 Python 工具箱,旨在利用 CLIP 方法促进类增量学习 (CIL)。它将各种传统和最先进的 CIL 方法整合到一个统一的框架中,提供标准化的执行管道和基于 JSON 的配置以实现可重复的实验。主要发现包括提高了 CIL 方法的可重复性和实际应用,使 C3Box 成为持续学习研究的可靠基准平台。
Splat Feature Solver
Authors: Butian Xiong, Rong Liu, Kenneth Xu, Meida Chen, Andrew Feng
Venue: ICLR 2026
First: 2025-08-17T03:13:06+00:00 · Latest: 2026-01-28T18:51:46+00:00
Comments: ICLR 2026 Accepted
Abstract
Feature lifting has emerged as a crucial component in 3D scene understanding, enabling the attachment of rich image feature descriptors (e.g., DINO, CLIP) onto splat-based 3D representations. The core challenge lies in optimally assigning rich general attributes to 3D primitives while addressing the inconsistency issues from multi-view images. We present a unified, kernel- and feature-agnostic formulation of the feature lifting problem as a sparse linear inverse problem, which can be solved efficiently in closed form. Our approach admits a provable upper bound on the global optimal error under convex losses for delivering high quality lifted features. To address inconsistencies and noise in multi-view observations, we introduce two complementary regularization strategies to stabilize the solution and enhance semantic fidelity. Tikhonov Guidance enforces numerical stability through soft diagonal dominance, while Post-Lifting Aggregation filters noisy inputs via feature clustering. Extensive experiments demonstrate that our approach achieves state-of-the-art performance on open-vocabulary 3D segmentation benchmarks, outperforming training-based, grouping-based, and heuristic-forward baselines while producing lifted features in minutes. Our \textbf{code} is available in the \href{https://github.com/saliteta/splat-distiller/tree/main}{\textcolor{blue}{GitHub}}. We provide additional \href{https://splat-distiller.pages.dev/}{\textcolor{blue}{website}} for more visualization, as well as the \href{https://www.youtube.com/watch?v=CH-G5hbvArM}{\textcolor{blue}{video}}.
中文标题/摘要
标题:Splat特征求解器
特征提升已成为3D场景理解中的关键组成部分,能够将丰富的图像特征描述符(例如,DINO,CLIP)附着到基于splat的3D表示上。核心挑战在于如何在解决多视图图像不一致性问题的同时,最优地将丰富的一般属性分配给3D基本体。我们提出了一种统一的、核和特征无关的特征提升问题的稀疏线性逆问题形式化方法,可以高效地以闭式形式求解。我们的方法在凸损失下提供了全局最优误差的可证明上界,以提供高质量的提升特征。为了解决多视图观测中的不一致性和噪声,我们引入了两种互补的正则化策略来稳定解并增强语义保真度。Tikhonov引导通过软对角占优确保数值稳定性,而后提升聚合通过特征聚类过滤掉噪声输入。大量实验表明,我们的方法在开放词汇3D分割基准测试中达到了最先进的性能,优于基于训练、基于分组和启发式前向的基线方法,同时在几分钟内生成提升特征。我们的代码可在GitHub(https://github.com/saliteta/splat-distiller/tree/main)获取。我们还提供了额外的网站(https://splat-distiller.pages.dev/)进行更多可视化展示,以及视频(https://www.youtube.com/watch?v=CH-G5hbvArM)。
Summary / 总结
The paper addresses the challenge of optimally assigning rich image feature descriptors to 3D primitives in splat-based representations, presenting a unified formulation as a sparse linear inverse problem. It introduces two regularization strategies, Tikhonov Guidance and Post-Lifting Aggregation, to stabilize the solution and enhance semantic fidelity. Experiments show that the proposed method outperforms existing baselines on open-vocabulary 3D segmentation benchmarks and produces lifted features in minutes.
研究旨在通过将丰富的图像特征描述符最优地分配给3D原语来提升3D场景理解。方法将特征提升问题表述为稀疏线性逆问题,并提供了一个具有全局最优误差上界的确切解。引入了两种正则化策略,Tikhonov Guidance和Post-Lifting Aggregation,以解决多视图观测中的不一致性和噪声问题。实验表明,该方法在开放词汇3D分割基准测试中优于现有基线,并能快速生成提升特征。
Post-Training Fairness Control: A Single-Train Framework for Dynamic Fairness in Recommendation
Authors: Weixin Chen, Li Chen, Yuhan Zhao
Venue: WWW 2026 Oral Presentation
First: 2026-01-28T18:48:43+00:00 · Latest: 2026-01-28T18:48:43+00:00
Comments: Accepted to WWW 2026 Workshop on HCRS (Oral Presentation)
Abstract
Despite growing efforts to mitigate unfairness in recommender systems, existing fairness-aware methods typically fix the fairness requirement at training time and provide limited post-training flexibility. However, in real-world scenarios, diverse stakeholders may demand differing fairness requirements over time, so retraining for different fairness requirements becomes prohibitive. To address this limitation, we propose Cofair, a single-train framework that enables post-training fairness control in recommendation. Specifically, Cofair introduces a shared representation layer with fairness-conditioned adapter modules to produce user embeddings specialized for varied fairness levels, along with a user-level regularization term that guarantees user-wise monotonic fairness improvements across these levels. We theoretically establish that the adversarial objective of Cofair upper bounds demographic parity and the regularization term enforces progressive fairness at user level. Comprehensive experiments on multiple datasets and backbone models demonstrate that our framework provides dynamic fairness at different levels, delivering comparable or better fairness-accuracy curves than state-of-the-art baselines, without the need to retrain for each new fairness requirement. Our code is publicly available at https://github.com/weixinchen98/Cofair.
中文标题/摘要
标题:训练后公平性控制:推荐中的动态公平性单训练框架
尽管在推荐系统中减轻不公平性的努力不断增加,现有的公平性感知方法通常在训练时固定公平性要求,提供有限的训练后灵活性。然而,在现实场景中,不同的利益相关者可能会在不同时间提出不同的公平性要求,因此为不同的公平性要求重新训练变得不可行。为了解决这一限制,我们提出了一种名为Cofair的单训练框架,以实现推荐中的训练后公平性控制。具体而言,Cofair 引入了一个共享表示层,其中包含公平性条件下的适配模块,以生成针对不同公平性水平专门化的用户嵌入,同时包含一个用户级别的正则化项,以确保这些水平上的用户公平性改进是单调的。我们理论证明了Cofair的对抗目标上界了人口平等性,而正则化项在用户级别强制执行逐步公平性。在多个数据集和基础模型上的全面实验表明,我们的框架在不同水平上提供了动态公平性,其公平性-准确度曲线与最先进的基线相当或更好,而无需为每个新的公平性要求重新训练。我们的代码已公开发布在https://github.com/weixinchen98/Cofair。
Summary / 总结
Despite growing efforts to mitigate unfairness in recommender systems, existing fairness-aware methods typically fix the fairness requirement at training time and provide limited post-training flexibility.
论文提出了一种单训练框架Cofair,以解决现有公平性感知推荐系统中固定公平性要求的局限性。Cofair通过共享表示层和公平性条件适配模块生成不同公平性水平的用户嵌入,并包含用户级正则化项以确保公平性改进。实验表明,Cofair可以在不重新训练的情况下实现动态公平性,提供比现有方法更好的或相当的公平性-准确性折衷。
Fail Fast, Win Big: Rethinking the Drafting Strategy in Speculative Decoding via Diffusion LLMs
Authors: Rui Pan, Zhuofu Chen, Hongyi Liu, Arvind Krishnamurthy, Ravi Netravali
First: 2025-12-23T18:16:58+00:00 · Latest: 2026-01-28T18:48:35+00:00
Abstract
Diffusion Large Language Models (dLLMs) offer fast, parallel token generation, but their standalone use is plagued by an inherent efficiency-quality tradeoff. We show that, if carefully applied, the attributes of dLLMs can actually be a strength for drafters in speculative decoding with autoregressive (AR) verifiers. Our core insight is that dLLM's speed from parallel decoding drastically lowers the risk of costly rejections, providing a practical mechanism to effectively realize the (elusive) lengthy drafts that lead to large speedups with speculative decoding. We present FailFast, a dLLM-based speculative decoding framework that realizes this approach by dynamically adapting its speculation length. It "fails fast" by spending minimal compute in hard-to-speculate regions to shrink speculation latency and "wins big" by aggressively extending draft lengths in easier regions to reduce verification latency (in many cases, speculating and accepting 70 tokens at a time!). Without any fine-tuning, FailFast delivers lossless acceleration of AR LLMs and achieves up to 4.9$\times$ speedup over vanilla decoding, 1.7$\times$ over the best naive dLLM drafter, and 1.7$\times$ over EAGLE-3 across diverse models and workloads. We open-source FailFast at https://github.com/ruipeterpan/failfast.
中文标题/摘要
标题:快速失败,赢得更大:通过扩散大语言模型重新思考推测性解码的起草策略
扩散大语言模型(dLLMs)提供快速并行的标记生成,但它们的独立使用受到效率与质量固有的权衡。我们表明,如果谨慎应用,dLLMs 的属性实际上可以成为起草者在自回归(AR)验证器辅助下的推测性解码中的优势。我们的核心见解是,dLLM 的并行解码速度大大降低了昂贵拒绝的风险,提供了一种实用机制,以有效实现(难以捉摸的)长篇草案,这些草案在推测性解码中能带来大量加速。我们提出了 FailFast,一种基于 dLLM 的推测性解码框架,通过动态调整其推测长度来实现这一方法。它“快速失败”通过在难以推测的区域花费最少的计算资源来缩短推测延迟,并在容易推测的区域“赢得更大”通过积极扩展草案长度来减少验证延迟(在许多情况下,一次推测并接受 70 个标记!)。无需任何微调,FailFast 为 AR LLM 提供无损加速,并在多种模型和工作负载上分别实现了高达 4.9 倍、1.7 倍和 1.7 倍的速度提升。我们已在 https://github.com/ruipeterpan/failfast 开源了 FailFast。
Summary / 总结
The paper addresses the efficiency-quality tradeoff in using diffusion large language models (dLLMs) for speculative decoding. It introduces FailFast, a framework that leverages the speed of dLLMs to reduce the risk of costly rejections, allowing for longer drafts that speed up speculative decoding. FailFast dynamically adjusts speculation length, spending minimal compute in hard-to-speculate regions and extending draft lengths in easier regions. The framework achieves up to 4.9 times speedup over vanilla decoding and 1.7 times over the best naive dLLM drafter and EAGLE-3 across various models and workloads.
论文解决了在使用扩散大语言模型(dLLM)进行推测性解码时的效率与质量权衡问题。它提出了FailFast框架,该框架动态调整推测长度以减少昂贵的拒绝并最大化草稿长度。FailFast在各种模型和工作负载上分别实现了高达4.9倍、1.7倍和1.7倍的加速,超过传统的解码、最佳的简单dLLM推测者和EAGLE-3。
A New Dataset and Framework for Robust Road Surface Classification via Camera-IMU Fusion
Authors: Willams de Lima Costa, Thifany Ketuli Silva de Souza, Jonas Ferreira Silva, Carlos Gabriel Bezerra Pereira, Bruno Reis Vila Nova, Leonardo Silvino Brito, Rafael Raider Leoni, Juliano Silva, Valter Ferreira, Sibele Miguel Soares Neto, Samantha Uehara, Daniel Giacomo, João Marcelo Teixeira, Veronica Teichrieb, Cristiano Coelho de Araújo
First: 2026-01-28T18:46:29+00:00 · Latest: 2026-01-28T18:46:29+00:00
Abstract
Road surface classification (RSC) is a key enabler for environment-aware predictive maintenance systems. However, existing RSC techniques often fail to generalize beyond narrow operational conditions due to limited sensing modalities and datasets that lack environmental diversity. This work addresses these limitations by introducing a multimodal framework that fuses images and inertial measurements using a lightweight bidirectional cross-attention module followed by an adaptive gating layer that adjusts modality contributions under domain shifts. Given the limitations of current benchmarks, especially regarding lack of variability, we introduce ROAD, a new dataset composed of three complementary subsets: (i) real-world multimodal recordings with RGB-IMU streams synchronized using a gold-standard industry datalogger, captured across diverse lighting, weather, and surface conditions; (ii) a large vision-only subset designed to assess robustness under adverse illumination and heterogeneous capture setups; and (iii) a synthetic subset generated to study out-of-distribution generalization in scenarios difficult to obtain in practice. Experiments show that our method achieves a +1.4 pp improvement over the previous state-of-the-art on the PVS benchmark and an +11.6 pp improvement on our multimodal ROAD subset, with consistently higher F1-scores on minority classes. The framework also demonstrates stable performance across challenging visual conditions, including nighttime, heavy rain, and mixed-surface transitions. These findings indicate that combining affordable camera and IMU sensors with multimodal attention mechanisms provides a scalable, robust foundation for road surface understanding, particularly relevant for regions where environmental variability and cost constraints limit the adoption of high-end sensing suites.
中文标题/摘要
标题:一种基于相机-IMU融合的新数据集和鲁棒路面分类框架
路面分类(RSC)是环境感知预测维护系统的关键使能器。然而,现有的RSC技术往往由于传感模态有限和缺乏环境多样性而难以在狭窄的操作条件下泛化。本研究通过引入一种多模态框架来解决这些限制,该框架使用轻量级双向交叉注意力模块融合图像和惯性测量,并通过自适应门控层在领域转移时调整模态贡献。鉴于当前基准的局限性,尤其是缺乏变异性,我们引入了ROAD,这是一个由三个互补子集组成的新型数据集:(i)使用黄金标准行业数据记录仪同步的RGB-IMU实时多模态记录,涵盖了多种照明、天气和路面条件;(ii)一个大型仅视觉子集,旨在评估在不良照明和异构捕获设置下的鲁棒性;(iii)一个合成子集,用于研究在实践中难以获得的场景中的离分布泛化。实验表明,我们的方法在PVS基准上比之前最先进的方法提高了1.4个百分点,在我们的多模态ROAD子集上提高了11.6个百分点,且在少数类别的F1分数上始终更高。该框架还展示了在具有挑战性的视觉条件下(包括夜间、大雨和混合路面过渡)的稳定性能。这些发现表明,将经济实惠的相机和IMU传感器与多模态注意力机制相结合,为路面理解提供了一个可扩展且鲁棒的基础,特别是在环境多样性和成本限制限制了高端传感套件的采用的地区尤为重要。
Summary / 总结
This work introduces a new multimodal framework for road surface classification that fuses images and inertial measurements using a lightweight bidirectional cross-attention module and an adaptive gating layer. The framework is evaluated on a new dataset called ROAD, which includes real-world multimodal recordings, a large vision-only subset, and a synthetic subset. The method achieves a 1.4 percentage point improvement over the previous state-of-the-art on the PVS benchmark and a 11.6 percentage point improvement on the multimodal ROAD subset, with better performance on minority classes. The framework also shows stable performance under challenging visual conditions such as nighttime, heavy rain, and mixed-surface transitions.
这项工作提出了一种新的多模态框架用于道路表面分类(RSC),该框架使用轻量级双向交叉注意力模块和自适应门控层融合图像和惯性测量。该框架在包含真实世界多模态记录、大量仅视觉子集和合成子集的新数据集ROAD上进行了评估。该方法在PVS基准上实现了1.4个百分点的改进,在多模态ROAD子集上实现了11.6个百分点的改进,且在少数类别的F1分数上表现更高。该框架还在夜间、大雨和混合表面过渡等具有挑战性的视觉条件下表现出稳定的性能。
PatchFormer: A Patch-Based Time Series Foundation Model with Hierarchical Masked Reconstruction and Cross-Domain Transfer Learning for Zero-Shot Multi-Horizon Forecasting
Authors: Olaf Yunus Laitinen Imanov, Derya Umut Kulali, Taner Yilmaz
First: 2026-01-28T18:45:45+00:00 · Latest: 2026-01-28T18:45:45+00:00
Comments: 5 pages; 2 figures; 7 tables
Abstract
Time series forecasting is a fundamental problem with applications in climate, energy, healthcare, and finance. Many existing approaches require domain-specific feature engineering and substantial labeled data for each task. We introduce PatchFormer, a patch-based time series foundation model that uses hierarchical masked reconstruction for self-supervised pretraining and lightweight adapters for efficient transfer. PatchFormer segments time series into patches and learns multiscale temporal representations with learnable aggregation across temporal scales. Pretraining uses masked patch reconstruction with dynamic masking and objectives that encourage both local accuracy and global consistency, followed by cross-domain knowledge distillation. Experiments on 24 benchmark datasets spanning weather, energy, traffic, finance, and healthcare demonstrate state-of-the-art zero-shot multi-horizon forecasting, reducing mean squared error by 27.3 percent relative to strong baselines while requiring 94 percent less task-specific training data. The model exhibits near log-linear scaling with more pretraining data up to 100 billion points and processes length-512 sequences 3.8x faster than full-sequence transformers.
中文标题/摘要
标题:PatchFormer:基于块的时间序列基础模型,具有分层掩蔽重建和跨域迁移学习的零样本多步预测
时间序列预测是具有气候、能源、医疗保健和金融应用的基本问题。许多现有方法需要特定领域的特征工程和大量标记数据。我们介绍了PatchFormer,一种基于块的时间序列基础模型,使用分层掩蔽重建进行自我监督预训练,并使用轻量级适配器进行高效迁移。PatchFormer 将时间序列分割成块,并通过可学习的跨时间尺度聚合学习多尺度时间表示。预训练使用动态掩蔽块重建和鼓励局部准确性和全局一致性的目标,随后进行跨域知识蒸馏。在涵盖天气、能源、交通、金融和医疗保健的24个基准数据集上进行的实验表明,PatchFormer 在零样本多步预测方面达到了最先进的性能,相对强基线降低了27.3%的均方误差,同时只需要94%的任务特定训练数据。该模型在最多1000亿个点的预训练数据上表现出接近对数线性缩放,并且处理长度为512的序列比全序列变压器快3.8倍。
Summary / 总结
PatchFormer is a patch-based time series foundation model that uses hierarchical masked reconstruction for self-supervised pretraining and lightweight adapters for efficient transfer. It segments time series into patches and learns multiscale temporal representations. Experiments on 24 benchmark datasets show that PatchFormer achieves state-of-the-art zero-shot multi-horizon forecasting, reducing mean squared error by 27.3 percent relative to strong baselines while requiring 94 percent less task-specific training data.
PatchFormer 是一种基于补丁的时间序列基础模型,使用分层掩码重构进行自我监督预训练,并使用轻量级适配器进行高效迁移。它将时间序列分割成补丁,并学习多尺度时间表示。实验表明,PatchFormer 在 24 个基准数据集上的零样本多步预测中达到最先进的性能,与强大的基线相比,将均方误差降低了 27.3%,同时只需要 94% 的任务特定训练数据。
Deep Researcher with Sequential Plan Reflection and Candidates Crossover (Deep Researcher Reflect Evolve)
Authors: Saurav Prateek
First: 2026-01-28T18:45:39+00:00 · Latest: 2026-01-28T18:45:39+00:00
Comments: 11 pages, 6 figures, 2 tables, source code: https://github.com/SauravP97/deep-researcher-reflect-evolve/
Abstract
This paper introduces a novel Deep Researcher architecture designed to generate detailed research reports on complex PhD level topics by addressing the inherent limitations of the Parallel Scaling paradigm. Our system utilizes two key innovations: Sequential Research Plan Refinement via Reflection and a Candidates Crossover algorithm. The sequential refinement process is demonstrated as an efficient method that allows the agent to maintain a centralized Global Research Context, enabling it to look back at current progress, reason about the research plan, and intelligently make changes at runtime. This dynamic adaptation contrasts with parallel approaches, which often suffer from siloed knowledge. The Candidates Crossover algorithm further enhances search efficiency by deploying multiple LLM candidates with varied parameters to explore a larger search space, with their findings synthesized to curate a comprehensive final research response. The process concludes with One Shot Report Generation, ensuring the final document is informed by a unified narrative and high fact density. Powered by the Gemini 2.5 Pro model, our Deep Researcher was evaluated on the DeepResearch Bench, a globally recognized benchmark of 100 doctoral level research tasks. Our architecture achieved an overall score of 46.21, demonstrating superior performance by surpassing leading deep research agents such as Claude Researcher, Nvidia AIQ Research Assistant, Perplexity Research, Kimi Researcher and Grok Deeper Search present on the DeepResearch Bench actively running leaderboard. This performance marginally exceeds our previous work, Static DRA, and reinforces the finding that sequential scaling consistently outperforms the parallel self consistency paradigm.
中文标题/摘要
标题:具有序列计划反思和候选交叉的深度研究员(深度研究员反思进化)
本文介绍了一种新型的深度研究员架构,旨在通过解决并行扩展范式的固有限制来生成复杂博士级主题的详细研究报告。我们的系统利用了两项关键创新:序列研究计划反思和候选交叉算法。序列精炼过程被证明是一种高效的方法,使代理能够保持一个集中的全局研究上下文,使其能够回顾当前进展,对研究计划进行推理,并在运行时智能地进行修改。这种动态适应与并行方法中常见的知识孤岛形成对比。候选交叉算法进一步提高了搜索效率,通过部署具有不同参数的多个LLM候选者来探索更大的搜索空间,并综合其发现以编撰全面的最终研究回应。该过程以一次生成报告结束,确保最终文档由统一的叙述和高密度的事实支持。由Gemini 2.5 Pro模型驱动,我们的深度研究员在DeepResearch Bench上进行了评估,这是一个全球公认的100项博士级研究任务基准。我们的架构获得了46.21的总体评分,展示了优于Claude Researcher、Nvidia AIQ Research Assistant、Perplexity Research、Kimi Researcher和Grok Deeper Search等领先深度研究代理的性能。这一性能略高于我们之前的工作Static DRA,并进一步证实了序列扩展始终优于并行自我一致性范式。
Summary / 总结
This paper presents a Deep Researcher architecture that addresses the limitations of the Parallel Scaling paradigm by introducing Sequential Plan Reflection and Candidates Crossover. The sequential refinement process allows the agent to maintain a centralized Global Research Context, enabling dynamic adaptation. The Candidates Crossover algorithm enhances search efficiency through multiple LLM candidates with varied parameters. Evaluated on the DeepResearch Bench, the architecture achieved a score of 46.21, surpassing other leading research agents and reinforcing the superiority of sequential scaling over parallel approaches.
该论文提出了一种Deep Researcher架构,旨在生成复杂博士级课题的详细研究报告。它引入了顺序研究计划反思和候选者交叉算法来提高搜索效率。该系统在DeepResearch Bench上的整体得分为46.21,超过了其他领先的助手,证实了顺序扩展优于并行自我一致性范式。
BlindSight: Harnessing Sparsity for Efficient Vision-Language Models
Authors: Tharun Adithya Srikrishnan, Deval Shah, Timothy Hein, Ahmed Hasssan, Stephen Youn, Steven K. Reinhardt
First: 2025-07-11T23:15:30+00:00 · Latest: 2026-01-28T18:45:01+00:00
Abstract
Large vision-language models (VLMs) enable joint processing of text and images. However, incorporating vision data significantly increases the prompt length, resulting in a longer time to first token (TTFT). This bottleneck can be alleviated by leveraging the inherent sparsity in the attention computation. Analyzing these attention patterns in VLMs when processing a series of images, we observe the absence of inter-image attention in a substantial portion of layers. Based on this, we propose BlindSight: an approach to optimize multi-image VLM inference using an input-template-aware attention sparsity mask with no runtime overhead. We utilize a dataset to derive a prompt-agnostic categorization for attention heads: Dense, Sink, Intra-Image, and Intra-Image+Sink. We develop a Triton-based GPU kernel to leverage this sparsity. BlindSight achieves a 1.8-3.2x speedup in the attention computation (prompt length 36K-300K). BlindSight generalizes across VLMs (Qwen2-VL, Qwen2.5-VL, Gemma 3), with only a 0.78% absolute accuracy degradation on average on multi-image comprehension benchmarks. Finally, we advocate for the design of efficient VLMs that combine BlindSight-inspired sparse and dense layers.
中文标题/摘要
标题:BlindSight:利用稀疏性提高视觉语言模型效率
大型视觉语言模型(VLMs)能够同时处理文本和图像。然而,引入视觉数据显著增加了提示长度,导致第一个标记生成时间(TTFT)变长。通过利用注意力计算中的固有稀疏性,可以缓解这一瓶颈。在处理一系列图像时分析VLMs的这些注意力模式,我们观察到在大量层中不存在跨图像注意力。基于此,我们提出BlindSight:一种利用输入模板感知的注意力稀疏性掩码优化多图像VLM推理的方法,且无运行时开销。我们利用数据集为注意力头开发了一种提示无关的分类:密集型、汇流型、图像内型和图像内+汇流型。我们开发了一个基于Triton的GPU内核来利用这种稀疏性。BlindSight在注意力计算中实现了1.8-3.2倍的加速(提示长度36K-300K)。BlindSight在不同VLMs(Qwen2-VL、Qwen2.5-VL、Gemma 3)上具有良好的泛化能力,平均在多图像理解基准测试上的绝对准确率下降仅为0.78%。最后,我们提倡设计结合BlindSight启发式稀疏和密集层的高效VLMs。
Summary / 总结
BlindSight optimizes multi-image vision-language model inference by utilizing the inherent sparsity in attention computation, achieving a 1.8-3.2x speedup in attention computation without runtime overhead. It categorizes attention heads into Dense, Sink, Intra-Image, and Intra-Image+Sink and leverages this sparsity through a Triton-based GPU kernel. The approach maintains average accuracy degradation of only 0.78% on multi-image comprehension benchmarks across different VLMs.
BlindSight 通过利用注意力计算中的固有稀疏性来优化多图像视觉语言模型的推理,实现1.8-3.2倍的注意力计算加速,且不增加运行时开销。它将注意力头分类为密集型、汇流型、图像内型和图像内+汇流型,并通过基于Triton的GPU内核利用这种稀疏性。该方法在不同VLM上的多图像理解基准测试中平均准确率下降仅为0.78%。
Open-Vocabulary Functional 3D Human-Scene Interaction Generation
Authors: Jie Liu, Yu Sun, Alpar Cseke, Yao Feng, Nicolas Heron, Michael J. Black, Yan Zhang
First: 2026-01-28T18:34:25+00:00 · Latest: 2026-01-28T18:34:25+00:00
Comments: 18 pages
Abstract
Generating 3D humans that functionally interact with 3D scenes remains an open problem with applications in embodied AI, robotics, and interactive content creation. The key challenge involves reasoning about both the semantics of functional elements in 3D scenes and the 3D human poses required to achieve functionality-aware interaction. Unfortunately, existing methods typically lack explicit reasoning over object functionality and the corresponding human-scene contact, resulting in implausible or functionally incorrect interactions. In this work, we propose FunHSI, a training-free, functionality-driven framework that enables functionally correct human-scene interactions from open-vocabulary task prompts. Given a task prompt, FunHSI performs functionality-aware contact reasoning to identify functional scene elements, reconstruct their 3D geometry, and model high-level interactions via a contact graph. We then leverage vision-language models to synthesize a human performing the task in the image and estimate proposed 3D body and hand poses. Finally, the proposed 3D body configuration is refined via stage-wise optimization to ensure physical plausibility and functional correctness. In contrast to existing methods, FunHSI not only synthesizes more plausible general 3D interactions, such as "sitting on a sofa'', while supporting fine-grained functional human-scene interactions, e.g., "increasing the room temperature''. Extensive experiments demonstrate that FunHSI consistently generates functionally correct and physically plausible human-scene interactions across diverse indoor and outdoor scenes.
中文标题/摘要
标题:开放词汇功能性的三维人类-场景交互生成
生成能够功能性地与三维场景交互的三维人类仍然是一个开放问题,具有在具身人工智能、机器人技术和交互内容创作中的应用。关键挑战在于既要推理三维场景中功能元素的语义,又要推断出实现功能感知交互所需的三维人类姿态。不幸的是,现有方法通常缺乏对物体功能及其与场景接触的显式推理,导致交互不合理或功能不正确。在本工作中,我们提出了一种无需训练的功能驱动框架FunHSI,能够从开放词汇的任务提示中生成功能性正确的交互。给定一个任务提示,FunHSI执行功能感知的接触推理,以识别功能性的场景元素,重建它们的三维几何结构,并通过接触图建模高层次的交互。然后利用视觉语言模型在图像中合成执行任务的人类,并估计提出的三维身体和手部姿态。最后,通过阶段优化来细化提出的三维身体配置,以确保物理合理性与功能性正确。与现有方法相比,FunHSI不仅合成了更合理的通用三维交互,如“坐在沙发上”,还支持细粒度的功能性人类-场景交互,例如“增加房间温度”。大量实验表明,FunHSI能够一致地生成功能正确且物理合理的三维人类-场景交互,适用于各种室内外场景。
Summary / 总结
The research aims to generate 3D humans that functionally interact with 3D scenes, addressing the challenge of reasoning about object functionality and human-scene contact. FunHSI, a training-free framework, identifies functional scene elements, reconstructs their 3D geometry, and models interactions via a contact graph. The method leverages vision-language models to synthesize human poses and refines them for physical plausibility. Experiments show that FunHSI generates functionally correct and physically plausible interactions, including both general and fine-grained tasks.
该研究解决了生成能够与3D场景功能性互动的3D人类这一关键问题,适用于具身AI、机器人技术和交互内容创作。提出的FunHSI框架通过功能感知的接触推理来识别和重建功能场景元素,并通过接触图建模高层次的互动。然后,它合成执行任务的人类,并通过逐阶段优化来确保物理合理性和功能性正确性。实验表明,FunHSI能够在各种场景中生成功能性正确且物理合理的互动,优于现有方法在一般和细粒度功能性互动方面的表现。
Linear representations in language models can change dramatically over a conversation
Authors: Andrew Kyle Lampinen, Yuxuan Li, Eghbal Hosseini, Sangnie Bhardwaj, Murray Shanahan
First: 2026-01-28T18:33:17+00:00 · Latest: 2026-01-28T18:33:17+00:00
Abstract
Language model representations often contain linear directions that correspond to high-level concepts. Here, we study the dynamics of these representations: how representations evolve along these dimensions within the context of (simulated) conversations. We find that linear representations can change dramatically over a conversation; for example, information that is represented as factual at the beginning of a conversation can be represented as non-factual at the end and vice versa. These changes are content-dependent; while representations of conversation-relevant information may change, generic information is generally preserved. These changes are robust even for dimensions that disentangle factuality from more superficial response patterns, and occur across different model families and layers of the model. These representation changes do not require on-policy conversations; even replaying a conversation script written by an entirely different model can produce similar changes. However, adaptation is much weaker from simply having a sci-fi story in context that is framed more explicitly as such. We also show that steering along a representational direction can have dramatically different effects at different points in a conversation. These results are consistent with the idea that representations may evolve in response to the model playing a particular role that is cued by a conversation. Our findings may pose challenges for interpretability and steering -- in particular, they imply that it may be misleading to use static interpretations of features or directions, or probes that assume a particular range of features consistently corresponds to a particular ground-truth value. However, these types of representational dynamics also point to exciting new research directions for understanding how models adapt to context.
中文标题/摘要
标题:语言模型中的线性表示在对话中可能会发生显著变化
语言模型的表示经常包含与高级概念相对应的线性方向。在这里,我们研究这些表示的动力学:在模拟对话的背景下,这些表示如何沿这些维度演变。我们发现,线性表示在对话中可能会发生显著变化;例如,对话开始时作为事实的信息在对话结束时可能被表示为非事实,反之亦然。这些变化是内容相关的;虽然对话相关的信息的表示可能会变化,但通用信息通常会得到保留。即使对于那些将事实性与更表面的响应模式分离的维度,这些变化也具有鲁棒性,并且在不同的模型家族和模型的不同层中都会发生。这些表示的变化不需要在策略中进行对话;即使重播由完全不同模型编写的对话脚本,也可以产生类似的变化。然而,仅仅在上下文中有一个更明确地被描述为科幻故事的情节,适应性会弱得多。我们还展示了沿着表示方向进行引导可以在对话的不同点产生截然不同的效果。这些结果与表示可能因模型在对话中扮演特定角色而演变的想法是一致的。我们的发现可能对可解释性和引导提出挑战——特别是,它们暗示使用静态特征或方向的解释可能是误导性的,或者假设一组特征始终对应于特定的真实值的探针可能是不准确的。然而,这些类型的表示动力学也指出了理解模型如何适应上下文的新研究方向。
Summary / 总结
The study investigates how linear representations in language models evolve during simulated conversations, finding that representations can change dramatically, such as factual information becoming non-factual or vice versa. These changes are content-dependent and occur across different model types and layers, even when replaying scripts from different models. The results suggest that representations adapt based on the model's role in the conversation, posing challenges for interpretability and steering but also opening new research avenues.
研究探讨了语言模型在模拟对话中线性表示的变化,发现表示可以发生显著变化,例如事实信息变为非事实或反之亦然。这些变化取决于内容,并在不同模型类型和层中发生,即使重播来自不同模型的脚本也是如此。结果表明,表示会根据模型在对话中的角色进行调整,这为可解释性和控制带来了挑战,但也指出了新的研究方向。
FLOL: Fast Baselines for Real-World Low-Light Enhancement
Authors: Juan C. Benito, Daniel Feijoo, Alvaro Garcia, Marcos V. Conde
First: 2025-01-16T18:06:09+00:00 · Latest: 2026-01-28T18:31:35+00:00
Comments: Journal Preprint
Abstract
Low-Light Image Enhancement (LLIE) is a key task in computational photography and imaging. The problem of enhancing images captured during night or in dark environments has been well-studied in the computer vision literature. However, current deep learning-based solutions struggle with efficiency and robustness for real-world scenarios (e.g., scenes with noise, saturated pixels). We propose a lightweight neural network that combines image processing in the frequency and spatial domains. Our baseline method, FLOL, is one of the fastest models for this task, achieving results comparable to the state-of-the-art on popular real-world benchmarks such as LOLv2, LSRW, MIT-5K and UHD-LL. Moreover, we are able to process 1080p images in real-time under 12ms. Code and models at https://github.com/cidautai/FLOL
中文标题/摘要
标题:FLOL:快速实时低光照图像增强基础模型
低光照图像增强(LLIE)是计算摄影和成像中的关键任务。计算机视觉文献中已经很好地研究了在夜间或暗环境下拍摄图像的增强问题。然而,当前基于深度学习的解决方案在实际场景中(例如,有噪声和饱和像素的场景)在效率和鲁棒性方面存在挑战。我们提出了一种结合频域和空域图像处理的轻量级神经网络。我们的基础模型FLOL是该任务中最快的模型之一,在流行的现实世界基准(如LOLv2、LSRW、MIT-5K和UHD-LL)上达到了与最新技术相当的结果。此外,我们能够在12毫秒内实时处理1080p图像。代码和模型见https://github.com/cidautai/FLOL
Summary / 总结
The paper addresses the challenge of enhancing low-light images, a critical task in computational photography. It introduces FLOL, a lightweight neural network that processes images in both frequency and spatial domains, offering fast and robust performance. FLOL achieves results comparable to state-of-the-art models on popular benchmarks and can process 1080p images in real-time under 12ms.
研究旨在解决在真实场景中增强低光图像的挑战,当前的深度学习方法在效率和鲁棒性方面往往表现不佳。提出的FLOL方法结合了频域和空域处理,创建了一个轻量级的神经网络。实验结果表明,FLOL在流行基准上的表现与最先进的模型相当,并且可以实时处理1080p图像,耗时不到12ms。
MemCtrl: Using MLLMs as Active Memory Controllers on Embodied Agents
Authors: Vishnu Sashank Dorbala, Dinesh Manocha
First: 2026-01-28T18:31:17+00:00 · Latest: 2026-01-28T18:31:17+00:00
Abstract
Foundation models rely on in-context learning for personalized decision making. The limited size of this context window necessitates memory compression and retrieval systems like RAG. These systems however often treat memory as large offline storage spaces, which is unfavorable for embodied agents that are expected to operate under strict memory and compute constraints, online. In this work, we propose MemCtrl, a novel framework that uses Multimodal Large Language Models (MLLMs) for pruning memory online. MemCtrl augments MLLMs with a trainable memory head μthat acts as a gate to determine which observations or reflections to retain, update, or discard during exploration. We evaluate with training two types of μ, 1) via an offline expert, and 2) via online RL, and observe significant improvement in overall embodied task completion ability on μ-augmented MLLMs. In particular, on augmenting two low performing MLLMs with MemCtrl on multiple subsets of the EmbodiedBench benchmark, we observe that μ-augmented MLLMs show an improvement of around 16% on average, with over 20% on specific instruction subsets. Finally, we present a qualitative analysis on the memory fragments collected by μ, noting the superior performance of μaugmented MLLMs on long and complex instruction types.
中文标题/摘要
标题:MemCtrl:使用MLLM作为具身代理的主动内存控制器
基础模型依赖于上下文学习进行个性化的决策。由于上下文窗口的限制,需要使用记忆压缩和检索系统,如RAG。然而,这些系统通常将记忆视为大型离线存储空间,这对于需要在严格的记忆和计算约束下在线操作的具身代理来说是不利的。在本文中,我们提出了一种名为MemCtrl的新框架,该框架利用多模态大型语言模型(MLLMs)进行在线记忆修剪。MemCtrl通过一个可训练的记忆头μ来增强MLLMs,μ作为门控机制,决定在探索过程中保留、更新或丢弃哪些观察或反思。我们通过两种方式训练μ:1)通过离线专家,2)通过在线强化学习,并观察到在μ增强的MLLMs上整体具身任务完成能力有了显著提高。特别是在对EmbodiedBench基准的不同子集使用MemCtrl增强两个低性能的MLLMs时,我们观察到μ增强的MLLMs在平均上提高了约16%,在特定指令子集上提高了超过20%。最后,我们对μ收集的记忆片段进行了定性分析,指出μ增强的MLLMs在长且复杂的指令类型上表现出更优的性能。
Summary / 总结
MemCtrl is a framework that uses Multimodal Large Language Models (MLLMs) to prune memory online for embodied agents. It introduces a trainable memory head μ that acts as a gate to decide which observations or reflections to retain, update, or discard. Evaluations show that μ-augmented MLLMs significantly improve embodied task completion ability, with an average improvement of around 16% on the EmbodiedBench benchmark, and over 20% on specific instruction subsets.
MemCtrl 是一个框架,使用多模态大型语言模型(MLLMs)在线修剪记忆。它引入了一个可训练的记忆头 μ,作为门控机制来决定保留、更新或丢弃哪些观察或反思。评估结果显示,μ 增强的 MLLMs 表现显著更好,平均改善了 EmbodiedBench 基准上的约 16%,在特定指令子集上的改善超过 20%。
VSCOUT: A Hybrid Variational Autoencoder Approach to Outlier Detection in High-Dimensional Retrospective Monitoring
Authors: Waldyn G. Martinez
First: 2026-01-28T18:30:48+00:00 · Latest: 2026-01-28T18:30:48+00:00
Abstract
Modern industrial and service processes generate high-dimensional, non-Gaussian, and contamination-prone data that challenge the foundational assumptions of classical Statistical Process Control (SPC). Heavy tails, multimodality, nonlinear dependencies, and sparse special-cause observations can distort baseline estimation, mask true anomalies, and prevent reliable identification of an in-control (IC) reference set. To address these challenges, we introduce VSCOUT, a distribution-free framework designed specifically for retrospective (Phase I) monitoring in high-dimensional settings. VSCOUT combines an Automatic Relevance Determination Variational Autoencoder (ARD-VAE) architecture with ensemble-based latent outlier filtering and changepoint detection. The ARD prior isolates the most informative latent dimensions, while the ensemble and changepoint filters identify pointwise and structural contamination within the determined latent space. A second-stage retraining step removes flagged observations and re-estimates the latent structure using only the retained inliers, mitigating masking and stabilizing the IC latent manifold. This two-stage refinement produces a clean and reliable IC baseline suitable for subsequent Phase II deployment. Extensive experiments across benchmark datasets demonstrate that VSCOUT achieves superior sensitivity to special-cause structure while maintaining controlled false alarms, outperforming classical SPC procedures, robust estimators, and modern machine-learning baselines. Its scalability, distributional flexibility, and resilience to complex contamination patterns position VSCOUT as a practical and effective method for retrospective modeling and anomaly detection in AI-enabled environments.
中文标题/摘要
标题:VSCOUT:高维回顾性监控中的混合变分自编码器异常检测方法
现代工业和服务过程生成高维、非正态和受污染的数据,挑战了经典统计过程控制(SPC)的基本假设。重尾、多模态、非线性依赖关系和稀疏的特殊原因观测可以扭曲基线估计,掩盖真正的异常,并阻止可靠地识别处于控制状态(IC)的参考集。为了解决这些挑战,我们引入了VSCOUT,这是一种分布无关的框架,专门用于高维环境中的回顾性(第一阶段)监控。VSCOUT结合了自动相关性确定变分自编码器(ARD-VAE)架构、基于集成的潜在异常过滤和变化点检测。ARD先验隔离了最具信息量的潜在维度,而集成和变化点过滤器在确定的潜在空间内识别点和结构异常。第二阶段重新训练步骤移除标记的观测值,并仅使用保留的内点重新估计潜在结构,从而减轻掩盖并稳定IC潜在流形。这种两阶段细化产生了一个干净且可靠的IC基线,适用于后续第二阶段部署。在基准数据集上的广泛实验表明,VSCOUT在检测特殊原因结构方面具有更高的灵敏度,同时保持了受控的误报率,优于经典SPC程序、稳健估计和现代机器学习基线。其可扩展性、分布灵活性以及对复杂污染模式的鲁棒性使VSCOUT成为AI驱动环境中回顾性建模和异常检测的一种实用且有效的方法。
Summary / 总结
VSCOUT is a hybrid variational autoencoder approach designed for outlier detection in high-dimensional, non-Gaussian data. It combines an ARD-VAE with ensemble-based latent outlier filtering and changepoint detection to isolate informative latent dimensions and identify contamination. The method re-trains using only inliers to refine the latent structure, producing a clean baseline for anomaly detection. Experiments show VSCOUT outperforms classical SPC, robust estimators, and machine-learning baselines in detecting special-cause structure with controlled false alarms.
VSCOUT 是一种混合变分自编码器方法,用于高维、非高斯数据中的异常检测。它使用 ARD-VAE 来隔离信息性的潜在维度,并结合基于集成的过滤和变化点检测来识别和移除异常值。该方法在内点上重新训练以细化潜在结构,从而生成用于异常检测的干净基线。实验表明,VSCOUT 在检测特殊原因异常的同时控制误报,优于经典 SPC、鲁棒估计和机器学习基线方法。
Online Conformal Model Selection for Nonstationary Time Series
Authors: Shibo Li, Yao Zheng
First: 2025-06-05T19:45:52+00:00 · Latest: 2026-01-28T18:29:54+00:00
Abstract
This paper introduces the MPS (Model Prediction Set), a novel framework for online model selection for nonstationary time series. Classical model selection methods, such as information criteria and cross-validation, rely heavily on the stationarity assumption and often fail in dynamic environments which undergo gradual or abrupt changes over time. Yet real-world data are rarely stationary, and model selection under nonstationarity remains a largely open problem. To tackle this challenge, we combine conformal inference with model confidence sets to develop a procedure that adaptively selects models best suited to the evolving dynamics at any given time. Concretely, the MPS updates in real time a confidence set of candidate models that covers the best model for the next time period with a specified long-run probability, while adapting to nonstationarity of unknown forms. Through simulations and real-world data analysis, we demonstrate that MPS reliably and efficiently identifies optimal models under nonstationarity, an essential capability lacking in offline methods. Moreover, MPS frequently produces high-quality sets with small cardinality, whose evolution offers deeper insights into changing dynamics. As a generic framework, MPS accommodates any data-generating process, data structure, model class, training method, and evaluation metric, making it broadly applicable across diverse problem settings.
中文标题/摘要
标题:在线自适应模型选择框架:非平稳时间序列的模型预测集
本文介绍了MPS(模型预测集),这是一种新颖的在线模型选择框架,用于非平稳时间序列。经典的模型选择方法,如信息准则和交叉验证,严重依赖于平稳性假设,在动态环境中往往失效,而动态环境会经历渐进或突然的变化。然而,现实世界的数据很少是平稳的,非平稳条件下的模型选择仍然是一个主要未解决的问题。为了解决这一挑战,我们结合了自适应置信集和模型置信集,开发了一种能够适应性地选择最适合当前动态变化的模型的程序。具体而言,MPS 实时更新候选模型的置信集,以在长期内覆盖下一个时间周期的最佳模型,并适应未知形式的非平稳性。通过模拟和实际数据的分析,我们证明了MPS 能够可靠且高效地在非平稳条件下识别最优模型,这是离线方法所缺乏的能力。此外,MPS 经常生成高质量且基数较小的集合,其演变提供了对动态变化更深入的洞察。作为一种通用框架,MPS 可容纳任何数据生成过程、数据结构、模型类别、训练方法和评估指标,使其在各种问题设置中具有广泛的应用性。
Summary / 总结
This paper introduces the MPS (Model Prediction Set), a framework for online model selection for nonstationary time series. It addresses the limitations of classical methods like information criteria and cross-validation, which assume stationarity and often fail in dynamic environments. The MPS combines conformal inference with model confidence sets to adaptively select the best models for evolving dynamics. Experiments show that MPS reliably identifies optimal models under nonstationarity and frequently produces high-quality sets with small cardinality, offering deeper insights into changing dynamics.
本文提出了MPS(模型预测集)框架,用于非平稳时间序列的在线模型选择。它解决了信息准则和交叉验证等古典方法假设平稳性,在动态环境中往往失效的问题。MPS 结合了置信集和一致推断,能够适应性地选择当前动态的最佳模型。实验表明,MPS 能够可靠地在非平稳条件下识别最优模型,并且经常生成高质量且集合较小的模型集,从而提供对变化动态的更深入洞察。
Training Reasoning Models on Saturated Problems via Failure-Prefix Conditioning
Authors: Minwu Kim, Safal Shrestha, Keith Ross
First: 2026-01-28T18:29:21+00:00 · Latest: 2026-01-28T18:29:21+00:00
Comments: 16 pages
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has substantially improved the reasoning abilities of large language models (LLMs), yet training often stalls as problems become saturated. We identify the core challenge as the poor accessibility of informative failures: learning signals exist but are rarely encountered during standard rollouts. To address this, we propose failure-prefix conditioning, a simple and effective method for learning from saturated problems. Rather than starting from the original question, our approach reallocates exploration by conditioning training on prefixes derived from rare incorrect reasoning trajectories, thereby exposing the model to failure-prone states. We observe that failure-prefix conditioning yields performance gains matching those of training on medium-difficulty problems, while preserving token efficiency. Furthermore, we analyze the model's robustness, finding that our method reduces performance degradation under misleading failure prefixes, albeit with a mild trade-off in adherence to correct early reasoning. Finally, we demonstrate that an iterative approach, which refreshes failure prefixes during training, unlocks additional gains after performance plateaus. Overall, our results suggest that failure-prefix conditioning offers an effective pathway to extend RLVR training on saturated problems.
中文标题/摘要
标题:通过失败前缀条件化训练饱和问题的推理模型
可验证奖励的强化学习(RLVR)显著提升了大型语言模型(LLMs)的推理能力,但随着问题变得饱和,训练往往停滞不前。我们发现核心挑战在于信息性失败的可访问性差:学习信号存在,但在标准滚动力量中很少被遇到。为了解决这一问题,我们提出了一种简单而有效的方法——失败前缀条件化,用于从饱和问题中学习。我们的方法不是从原始问题开始,而是通过在训练中以罕见的错误推理轨迹的前缀为条件重新分配探索,从而使模型接触到失败倾向的状态。我们观察到,失败前缀条件化在性能提升方面与在中等难度问题上进行训练的效果相当,同时保持了对令牌效率的保留。此外,我们分析了模型的鲁棒性,发现我们的方法减少了在误导性失败前缀下的性能下降,尽管在早期正确推理的遵循上略有妥协。最后,我们展示了迭代方法在训练过程中刷新失败前缀可以解锁额外的性能提升。总体而言,我们的结果表明,失败前缀条件化为扩展RLVR在饱和问题上的训练提供了一条有效途径。
Summary / 总结
The paper addresses the challenge of training large language models (LLMs) on saturated problems using reinforcement learning with verifiable rewards (RLVR), where the models often stall due to poor accessibility of informative failures. The authors propose failure-prefix conditioning, which conditions training on prefixes derived from rare incorrect reasoning trajectories, thereby exposing the model to failure-prone states. This method yields performance gains similar to training on medium-difficulty problems while maintaining token efficiency. Additionally, the iterative approach of refreshing failure prefixes during training further enhances performance after initial plateaus.
论文针对使用强化学习与可验证奖励(RLVR)在饱和问题上训练大型语言模型(LLMs)时遇到的挑战,即标准演练很少遇到有信息量的失败。作者提出了一种失败前缀条件化的方法,该方法通过条件化训练于罕见的错误推理轨迹的前缀,使模型暴露于失败状态。这种方法在保持令牌效率的同时,将性能提升到类似中等难度问题训练的水平。迭代刷新失败前缀进一步在初始平台期后提升性能。研究表明,失败前缀条件化是扩展饱和问题上RLVR训练的有效方法。
Neural Theorem Proving for Verification Conditions: A Real-World Benchmark
Authors: Qiyuan Xu, Xiaokun Luan, Renxi Wang, Joshua Ong Jun Leang, Peixin Wang, Haonan Li, Wenda Li, Conrad Watt
Venue: ICLR
First: 2026-01-26T20:37:11+00:00 · Latest: 2026-01-28T18:25:21+00:00
Comments: Accepted in ICLR'26
Abstract
Theorem proving is fundamental to program verification, where the automated proof of Verification Conditions (VCs) remains a primary bottleneck. Real-world program verification frequently encounters hard VCs that existing Automated Theorem Provers (ATPs) cannot prove, leading to a critical need for extensive manual proofs that burden practical application. While Neural Theorem Proving (NTP) has achieved significant success in mathematical competitions, demonstrating the potential of machine learning approaches to formal reasoning, its application to program verification--particularly VC proving--remains largely unexplored. Despite existing work on annotation synthesis and verification-related theorem proving, no benchmark has specifically targeted this fundamental bottleneck: automated VC proving. This work introduces Neural Theorem Proving for Verification Conditions (NTP4VC), presenting the first real-world multi-language benchmark for this task. From real-world projects such as Linux and Contiki-OS kernel, our benchmark leverages industrial pipelines (Why3 and Frama-C) to generate semantically equivalent test cases across formal languages of Isabelle, Lean, and Rocq. We evaluate large language models (LLMs), both general-purpose and those fine-tuned for theorem proving, on NTP4VC. Results indicate that although LLMs show promise in VC proving, significant challenges remain for program verification, highlighting a large gap and opportunity for future research.
中文标题/摘要
标题:神经定理证明在验证条件中的应用:一个实际基准
定理证明是程序验证的基础,其中验证条件(VCs)的自动证明仍然是主要瓶颈。实际程序验证经常遇到现有的自动定理证明器(ATPs)无法证明的难题VCs,导致需要大量的手动证明,这严重阻碍了实际应用。尽管神经定理证明(NTP)在数学竞赛中取得了显著成功,展示了机器学习方法在形式推理中的潜力,但其在程序验证中的应用,特别是VC证明,仍然鲜有探索。尽管已有工作集中在注释合成和验证相关的定理证明,但没有基准专门针对这一根本瓶颈:自动VC证明。本研究引入了神经定理证明用于验证条件(NTP4VC),提出了第一个用于此任务的实际多语言基准。从如Linux和Contiki-OS内核的实际项目中,我们的基准利用工业管道(Why3和Frama-C)生成形式语言(Isabelle、Lean和Rocq)之间的语义等价测试用例。我们评估了大型语言模型(LLMs),包括通用型和专门针对定理证明微调的模型,对NTP4VC的性能。结果表明,尽管LLMs在VC证明中显示出潜力,但程序验证仍面临重大挑战,突显了未来研究的巨大差距和机会。
Summary / 总结
This work addresses the bottleneck of automated proof of Verification Conditions (VCs) in program verification by introducing NTP4VC, a real-world multi-language benchmark. It evaluates large language models, both general-purpose and fine-tuned for theorem proving, on this benchmark. Results show potential in VC proving but also highlight significant challenges for practical application in program verification.
该研究通过引入NTP4VC,一个面向实际的多语言基准,解决了程序验证中VC自动证明的瓶颈问题。它评估了大型语言模型,包括通用模型和专门针对定理证明优化的模型,在此基准上的表现。结果表明,尽管这些模型显示出潜力,但在VC证明方面仍面临重大挑战,这表明未来需要进一步的研究。
ReactionMamba: Generating Short & Long Human Reaction Sequences
Authors: Hajra Anwar Beg, Baptiste Chopin, Hao Tang, Mohamed Daoudi
First: 2025-11-28T21:19:45+00:00 · Latest: 2026-01-28T18:22:45+00:00
Abstract
We present ReactionMamba, a novel framework for generating long 3D human reaction motions. Reaction-Mamba integrates a motion VAE for efficient motion encoding with Mamba-based state-space models to decode temporally consistent reactions. This design enables ReactionMamba to generate both short sequences of simple motions and long sequences of complex motions, such as dance and martial arts. We evaluate ReactionMamba on three datasets--NTU120-AS, Lindy Hop, and InterX--and demonstrate competitive performance in terms of realism, diversity, and long-sequence generation compared to previous methods, including InterFormer, ReMoS, and Ready-to-React, while achieving substantial improvements in inference speed.
中文标题/摘要
标题:ReactionMamba: 生成简短与长人类反应序列
我们提出了ReactionMamba,一种用于生成长3D人类反应动作的新颖框架。ReactionMamba结合了用于高效动作编码的运动VAE与基于Mamba的状态空间模型来解码时间上一致的反应。这种设计使ReactionMamba能够生成简短的简单动作序列和长的复杂动作序列,如舞蹈和武术。我们在NTU120-AS、Lindy Hop和InterX三个数据集上评估了ReactionMamba,并在现实感、多样性和长序列生成方面与先前的方法(包括InterFormer、ReMoS和Ready-to-React)相比展示了竞争力,同时实现了显著的推理速度改进。
Summary / 总结
ReactionMamba is a framework for generating both short and long 3D human reaction motions. It combines a motion VAE for efficient encoding with Mamba-based state-space models for decoding consistent reactions. The system can generate simple and complex motions, such as dance and martial arts. Evaluation on NTU120-AS, Lindy Hop, and InterX shows that ReactionMamba outperforms previous methods like InterFormer, ReMoS, and Ready-to-React in terms of realism, diversity, and long-sequence generation, while also achieving faster inference speeds.
ReactionMamba 是一个框架,用于生成3D人类反应动作,包括短序列和长序列。它结合了运动VAE进行高效编码和基于Mamba的状态空间模型进行解码,以生成一致的反应。该系统可以产生简单的和复杂的动作序列,如舞蹈和武术。在NTU120-AS、Lindy Hop和InterX上的评估显示,ReactionMamba在现实感、多样性和长序列生成方面表现出色,同时比InterFormer、ReMoS和Ready-to-React等先前方法具有更快的推理速度。
AI Annotation Orchestration: Evaluating LLM verifiers to Improve the Quality of LLM Annotations in Learning Analytics
Authors: Bakhtawar Ahtisham, Kirk Vanacore, Jinsook Lee, Zhuqian Zhou, Doug Pietrzak, Rene F. Kizilcec
First: 2025-11-12T22:35:36+00:00 · Latest: 2026-01-28T18:09:36+00:00
Abstract
Large Language Models (LLMs) are increasingly used to annotate learning interactions, yet concerns about reliability limit their utility. We test whether verification-oriented orchestration-prompting models to check their own labels (self-verification) or audit one another (cross-verification)-improves qualitative coding of tutoring discourse. Using transcripts from 30 one-to-one math sessions, we compare three production LLMs (GPT, Claude, Gemini) under three conditions: unverified annotation, self-verification, and cross-verification across all orchestration configurations. Outputs are benchmarked against a blinded, disagreement-focused human adjudication using Cohen's kappa. Overall, orchestration yields a 58 percent improvement in kappa. Self-verification nearly doubles agreement relative to unverified baselines, with the largest gains for challenging tutor moves. Cross-verification achieves a 37 percent improvement on average, with pair- and construct-dependent effects: some verifier-annotator pairs exceed self-verification, while others reduce alignment, reflecting differences in verifier strictness. We contribute: (1) a flexible orchestration framework instantiating control, self-, and cross-verification; (2) an empirical comparison across frontier LLMs on authentic tutoring data with blinded human "gold" labels; and (3) a concise notation, verifier(annotator) (e.g., Gemini(GPT) or Claude(Claude)), to standardize reporting and make directional effects explicit for replication. Results position verification as a principled design lever for reliable, scalable LLM-assisted annotation in Learning Analytics.
中文标题/摘要
标题:AI注解编排:评估LLM验证器以提高LLM注解在学习分析中的质量
大型语言模型(LLMs)越来越多地用于标注学习互动,但对其可靠性的担忧限制了其应用。我们测试了验证导向的编排提示模型是否能通过自我验证或交叉验证来提高辅导对话的定性编码质量。使用30个一对一数学会话的转录,我们比较了三种生产LLM(GPT、Claude、Gemini)在三种条件下的表现:未验证标注、自我验证和交叉验证。输出与盲法、分歧重点的人类仲裁进行基准比较,使用科恩κ值。总体而言,编排提高了κ值58%。自我验证几乎将一致性提高了一倍,相对于未验证的基线,难度较大的辅导动作获得最大收益。交叉验证平均提高了37%,但存在配对和构建依赖的效果:一些验证者-标注者配对超过了自我验证,而另一些则降低了对齐度,反映了验证者严格程度的差异。我们贡献了:(1)一个灵活的编排框架,实现控制、自我和交叉验证的实例化;(2)在盲法人类“黄金”标签的前沿LLM上对真实辅导数据进行的实证比较;(3)一种简洁的表示法,验证者(标注者)(例如,Gemini(GPT)或Claude(Claude)),以标准化报告并明确指示方向性效果,便于复制。结果将验证定位为可靠、可扩展的LLM辅助注解在学习分析中的一个原则性设计杠杆。
Summary / 总结
This study evaluates the effectiveness of verification-oriented orchestration-prompting models to improve the quality of Large Language Model (LLM) annotations in learning analytics. Using transcripts from 30 one-to-one math sessions, the research compares unverified annotations, self-verification, and cross-verification with three LLMs (GPT, Claude, Gemini). The results show a 58 percent improvement in agreement using Cohen's kappa, with self-verification nearly doubling agreement compared to unverified annotations, and cross-verification achieving an average 37 percent improvement, though with varying effects across different pairs of verifiers and annotators.
本研究评估了验证导向的编排提示模型在学习分析中提高大型语言模型(LLM)注释质量的有效性。使用30个一对一数学会话的转录,对三种LLM(GPT、Claude、Gemini)在未验证注释、自我验证和交叉验证三种条件下进行了测试。结果显示,使用Cohen's kappa基准衡量,整体一致性提高了58%。自我验证将一致性几乎翻倍,与未验证基线相比,特别是在具有挑战性的辅导动作上取得了显著进步。交叉验证平均提高了37%,但不同验证者-注释者配对的效果存在差异,反映了验证者严格程度的不同。
Dense-SfM: Structure from Motion with Dense Consistent Matching
Authors: JongMin Lee, Sungjoo Yoo
First: 2025-01-24T06:45:12+00:00 · Latest: 2026-01-28T17:55:29+00:00
Abstract
We present Dense-SfM, a novel Structure from Motion (SfM) framework designed for dense and accurate 3D reconstruction from multi-view images. Sparse keypoint matching, which traditional SfM methods often rely on, limits both accuracy and point density, especially in texture-less areas. Dense-SfM addresses this limitation by integrating dense matching with a Gaussian Splatting (GS) based track extension which gives more consistent, longer feature tracks. To further improve reconstruction accuracy, Dense-SfM is equipped with a multi-view kernelized matching module leveraging transformer and Gaussian Process architectures, for robust track refinement across multi-views. Evaluations on the ETH3D and Texture-Poor SfM datasets show that Dense-SfM offers significant improvements in accuracy and density over state-of-the-art methods. Project page: https://icetea-cv.github.io/densesfm/.
中文标题/摘要
标题:Dense-SfM:基于密集一致匹配的结构从运动
我们提出了Dense-SfM,这是一种新型的结构从运动(SfM)框架,旨在从多视角图像中进行密集且准确的三维重建。传统的SfM方法通常依赖稀疏关键点匹配,这限制了准确性和点密度,尤其是在无纹理区域。Dense-SfM通过结合基于高斯点积(GS)的轨迹扩展来解决这一限制,从而提供更一致、更长的特征轨迹。为了进一步提高重建精度,Dense-SfM配备了一个利用变换器和高斯过程架构的多视角核匹配模块,以在多视角中实现稳健的轨迹细化。在ETH3D和纹理贫乏的SfM数据集上的评估表明,Dense-SfM在准确性和密度方面显著优于现有方法。项目页面:https://icetea-cv.github.io/densesfm/
Summary / 总结
Dense-SfM is a novel Structure from Motion framework that improves 3D reconstruction accuracy and density by integrating dense matching with a Gaussian Splatting-based track extension. It uses a multi-view kernelized matching module with transformer and Gaussian Process architectures for robust track refinement. Experiments on ETH3D and Texture-Poor SfM datasets demonstrate that Dense-SfM outperforms existing methods in both accuracy and density.
Dense-SfM 是一种新颖的结构从运动框架,通过结合密集匹配和基于高斯点积的轨迹扩展来提高三维重建的准确性和密度。它使用具有变换器和高斯过程架构的多视图内核匹配模块进行稳健的轨迹细化。在 ETH3D 和 Texture-Poor SfM 数据集上的实验表明,Dense-SfM 在准确性和密度方面均优于现有方法。
Reinforcement Learning via Self-Distillation
Authors: Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, Andreas Krause
First: 2026-01-28T17:45:12+00:00 · Latest: 2026-01-28T17:45:12+00:00
Abstract
Large language models are increasingly post-trained with reinforcement learning in verifiable domains such as code and math. Yet, current methods for reinforcement learning with verifiable rewards (RLVR) learn only from a scalar outcome reward per attempt, creating a severe credit-assignment bottleneck. Many verifiable environments actually provide rich textual feedback, such as runtime errors or judge evaluations, that explain why an attempt failed. We formalize this setting as reinforcement learning with rich feedback and introduce Self-Distillation Policy Optimization (SDPO), which converts tokenized feedback into a dense learning signal without any external teacher or explicit reward model. SDPO treats the current model conditioned on feedback as a self-teacher and distills its feedback-informed next-token predictions back into the policy. In this way, SDPO leverages the model's ability to retrospectively identify its own mistakes in-context. Across scientific reasoning, tool use, and competitive programming on LiveCodeBench v6, SDPO improves sample efficiency and final accuracy over strong RLVR baselines. Notably, SDPO also outperforms baselines in standard RLVR environments that only return scalar feedback by using successful rollouts as implicit feedback for failed attempts. Finally, applying SDPO to individual questions at test time accelerates discovery on difficult binary-reward tasks, achieving the same discovery probability as best-of-k sampling or multi-turn conversations with 3x fewer attempts.
中文标题/摘要
标题:通过自我蒸馏进行强化学习
大型语言模型越来越多地通过可验证领域(如代码和数学)的强化学习进行后训练。然而,当前的可验证奖励强化学习(RLVR)方法仅从每次尝试的标量结果奖励中学习,这造成了严重的信用分配瓶颈。许多可验证环境实际上提供了丰富的文本反馈,如运行时错误或评判评价,这些反馈解释了尝试失败的原因。我们将此设置形式化为具有丰富反馈的强化学习,并引入了自我蒸馏策略优化(SDPO),该方法将标记化的反馈转换为密集的学习信号,无需任何外部教师或显式的奖励模型。SDPO 将当前模型在反馈条件下的输出视为自我教师,并将其反馈导向的下一个标记预测回传给策略。这样,SDPO 利用了模型回顾性地识别其自身错误的能力。在科学推理、工具使用和 LiveCodeBench v6 的竞争编程中,SDPO 在强大的 RLVR 基准之上提高了样本效率和最终准确性。值得注意的是,SDPO 还通过使用成功的回放作为失败尝试的隐式反馈,在仅返回标量反馈的标准 RLVR 环境中优于基准。最后,在测试时将 SDPO 应用于单个问题可以加速在二元奖励任务中的发现,以三倍少的尝试次数达到与 k 次采样或三轮对话相同的发现概率。
Summary / 总结
The paper addresses the challenge of reinforcement learning with verifiable rewards, where the current methods only use scalar outcome rewards, leading to a credit-assignment bottleneck. It introduces Self-Distillation Policy Optimization (SDPO), which converts rich textual feedback into a dense learning signal. SDPO improves sample efficiency and final accuracy in scientific reasoning, tool use, and competitive programming tasks compared to strong RLVR baselines. It also outperforms baselines in scalar feedback environments by using successful rollouts as implicit feedback for failed attempts and accelerates discovery on binary-reward tasks with fewer attempts.
论文针对当前使用可验证奖励的强化学习方法中仅使用标量结果奖励导致的信用分配瓶颈问题,提出了Self-Distillation Policy Optimization (SDPO) 方法,该方法将丰富的文本反馈转化为密集的学习信号。SDPO 在科学推理、工具使用和编程竞赛任务中提高了样本效率和最终准确性,优于强 RLVR 基线。此外,它还通过将成功的回放作为失败尝试的隐式反馈来超越仅返回标量反馈的基线,并在二元奖励任务中通过更少的尝试加速了发现过程。
Conditional PED-ANOVA: Hyperparameter Importance in Hierarchical & Dynamic Search Spaces
Authors: Kaito Baba, Yoshihiko Ozaki, Shuhei Watanabe
First: 2026-01-28T17:44:36+00:00 · Latest: 2026-01-28T17:44:36+00:00
Comments: 16 pages, 9 figures
Abstract
We propose conditional PED-ANOVA (condPED-ANOVA), a principled framework for estimating hyperparameter importance (HPI) in conditional search spaces, where the presence or domain of a hyperparameter can depend on other hyperparameters. Although the original PED-ANOVA provides a fast and efficient way to estimate HPI within the top-performing regions of the search space, it assumes a fixed, unconditional search space and therefore cannot properly handle conditional hyperparameters. To address this, we introduce a conditional HPI for top-performing regions and derive a closed-form estimator that accurately reflects conditional activation and domain changes. Experiments show that naive adaptations of existing HPI estimators yield misleading or uninterpretable importance estimates in conditional settings, whereas condPED-ANOVA consistently provides meaningful importances that reflect the underlying conditional structure.
中文标题/摘要
标题:条件PED-ANOVA:在层次化和动态搜索空间中的超参数重要性
我们提出了条件PED-ANOVA(condPED-ANOVA),这是一种在条件搜索空间中估计超参数重要性(HPI)的原理性框架,其中超参数的存在或域可以依赖于其他超参数。尽管原始的PED-ANOVA提供了一种快速而有效的方法来估计搜索空间中表现最佳区域内的HPI,但它假设了一个固定且无条件的搜索空间,因此无法正确处理条件超参数。为了解决这个问题,我们引入了表现最佳区域的条件HPI,并推导出一个闭式估计器,该估计器准确地反映了条件激活和域的变化。实验表明,在条件设置中,现有HPI估计器的简单改编会产生误导性或不可解释的重要性估计,而condPED-ANOVA始终提供有意义的重要性估计,能够反映潜在的条件结构。
Summary / 总结
The research proposes condPED-ANOVA, a framework to estimate hyperparameter importance in conditional search spaces where hyperparameters can depend on others. It addresses the limitations of the original PED-ANOVA by introducing conditional hyperparameter importance and a closed-form estimator. Experiments demonstrate that condPED-ANOVA provides more meaningful and interpretable importance estimates compared to naive adaptations of existing methods in conditional settings.
研究提出了condPED-ANOVA框架,用于估计条件搜索空间中依赖于其他超参数的超参数的重要性。它通过引入条件超参数重要性和闭合形式估计器解决了原始PED-ANOVA的局限性。实验表明,condPED-ANOVA在条件设置中提供了更具有意义和可解释性的重要性估计,而现有的方法则不然。
EVEREST: An Evidential, Tail-Aware Transformer for Rare-Event Time-Series Forecasting
Authors: Antanas Zilinskas, Robert N. Shorten, Jakub Marecek
Venue: 14th International Conference on Learning Representations, 2026
First: 2026-01-26T23:15:20+00:00 · Latest: 2026-01-28T17:40:06+00:00
Comments: Updated author affiliation. No changes to technical content
Abstract
Forecasting rare events in multivariate time-series data is challenging due to severe class imbalance, long-range dependencies, and distributional uncertainty. We introduce EVEREST, a transformer-based architecture for probabilistic rare-event forecasting that delivers calibrated predictions and tail-aware risk estimation, with auxiliary interpretability via attention-based signal attribution. EVEREST integrates four components: (i) a learnable attention bottleneck for soft aggregation of temporal dynamics; (ii) an evidential head for estimating aleatoric and epistemic uncertainty via a Normal--Inverse--Gamma distribution; (iii) an extreme-value head that models tail risk using a Generalized Pareto Distribution; and (iv) a lightweight precursor head for early-event detection. These modules are jointly optimized with a composite loss (focal loss, evidential NLL, and a tail-sensitive EVT penalty) and act only at training time; deployment uses a single classification head with no inference overhead (approximately 0.81M parameters). On a decade of space-weather data, EVEREST achieves state-of-the-art True Skill Statistic (TSS) of 0.973/0.970/0.966 at 24/48/72-hour horizons for C-class flares. The model is compact, efficient to train on commodity hardware, and applicable to high-stakes domains such as industrial monitoring, weather, and satellite diagnostics. Limitations include reliance on fixed-length inputs and exclusion of image-based modalities, motivating future extensions to streaming and multimodal forecasting.
中文标题/摘要
标题:EVEREST:一种证据导向、尾部意识的变换器模型用于稀有事件时间序列预测
在多变量时间序列数据中预测稀有事件具有挑战性,因为存在严重的类别不平衡、长距离依赖性和分布不确定性。我们引入了EVEREST,一种基于变换器的架构,用于概率稀有事件预测,能够提供校准预测和尾部意识风险估计,并通过基于注意力的信号归因提供辅助可解释性。EVEREST 整合了四个组件:(i) 可学习的注意力瓶颈,用于软聚合时间动态;(ii) 证据头部,通过正态-逆伽玛分布估计 aleatoric 和 epistemic 不确定性;(iii) 极值头部,使用广义帕累托分布建模尾部风险;(iv) 轻量级先兆头部,用于早期事件检测。这些模块在训练时通过复合损失(焦点损失、证据 NLL 和尾部敏感的 EVT 罚则)联合优化;部署时使用单个分类头,无推理开销(约 0.81M 参数)。在十年的太阳风暴数据上,EVEREST 在 24/48/72 小时预测 C 类耀斑时达到最先进的真技能统计量 (TSS) 0.973/0.970/0.966。该模型紧凑,可在普通硬件上高效训练,并适用于工业监控、天气和卫星诊断等高风险领域。局限性包括依赖固定长度输入和排除基于图像的模态,这激励未来对流式和多模态预测的扩展。
Summary / 总结
EVEREST is a transformer-based architecture designed for probabilistic forecasting of rare events in multivariate time-series data. It integrates a learnable attention bottleneck, an evidential head for uncertainty estimation, an extreme-value head for tail risk modeling, and a lightweight precursor head for early-event detection. EVEREST achieves state-of-the-art True Skill Statistics of 0.973/0.970/0.966 at 24/48/72-hour horizons for C-class flares on space-weather data, demonstrating calibrated predictions and tail-aware risk estimation. The model is compact and efficient for training on commodity hardware, suitable for high-stakes domains like industrial monitoring and satellite diagnostics.
EVEREST 是一种基于变压器的架构,用于多变量时间序列数据中罕见事件的概率预测。它结合了可学习的注意力瓶颈、用于不确定性估计的证据头、用于尾部风险建模的极值头以及用于早期事件检测的轻量级前导头。EVEREST 在空间天气数据上实现了 C 类耀斑 24/48/72 小时预测的最新 True Skill Statistic 值 0.973/0.970/0.966,展示了校准的预测和尾部风险估计。该模型紧凑且在普通硬件上高效训练,适用于工业监控、天气和卫星诊断等高风险领域。
JAFAR: Jack up Any Feature at Any Resolution
Authors: Paul Couairon, Loick Chambon, Louis Serrano, Jean-Emmanuel Haugeard, Matthieu Cord, Nicolas Thome
First: 2025-06-10T20:53:12+00:00 · Latest: 2026-01-28T17:39:25+00:00
Comments: Code available at https://github.com/PaulCouairon/JAFAR
Abstract
Foundation Vision Encoders have become essential for a wide range of dense vision tasks. However, their low-resolution spatial feature outputs necessitate feature upsampling to produce the high-resolution modalities required for downstream tasks. In this work, we introduce JAFAR, a lightweight and flexible feature upsampler that enhances the spatial resolution of visual features from any Foundation Vision Encoder to an arbitrary target resolution. JAFAR employs an attention-based module designed to promote semantic alignment between high-resolution queries, derived from low-level image features, and semantically enriched low-resolution keys, using Spatial Feature Transform (SFT) modulation. Notably, despite the absence of high-resolution supervision, we demonstrate that learning at low upsampling ratios and resolutions generalizes remarkably well to significantly higher output scales. Extensive experiments show that JAFAR effectively recovers fine-grained spatial details and consistently outperforms existing feature upsampling methods across a diverse set of downstream tasks. Project page at https://jafar-upsampler.github.io
中文标题/摘要
标题:JAFAR:任意特征任意分辨率提升
基础视觉编码器已成为广泛密集视觉任务中的重要组成部分。然而,它们的低分辨率空间特征输出需要进行特征上采样,以生成下游任务所需的高分辨率模态。在本文中,我们引入了JAFAR,这是一种轻量级且灵活的特征上采样器,能够将任何基础视觉编码器的视觉特征的空间分辨率提升到任意目标分辨率。JAFAR采用了一种基于注意力的模块,旨在通过空间特征变换(SFT)调制促进高分辨率查询与从低级图像特征派生的语义丰富低分辨率键之间的语义对齐。值得注意的是,尽管缺乏高分辨率监督,我们证明了在低上采样比和分辨率下学习能够很好地泛化到显著更高的输出尺度。广泛的实验表明,JAFAR能够有效恢复细粒度的空间细节,并且在多种下游任务中始终优于现有的特征上采样方法。项目页面见https://jafar-upsampler.github.io
REST: Diffusion-based Real-time End-to-end Streaming Talking Head Generation via ID-Context Caching and Asynchronous Streaming Distillation
Authors: Haotian Wang, Yuzhe Weng, Jun Du, Haoran Xu, Xiaoyan Wu, Shan He, Bing Yin, Cong Liu, Qingfeng Liu
First: 2025-12-12T02:28:52+00:00 · Latest: 2026-01-28T17:37:46+00:00
Comments: 27 pages, 10 figures
Abstract
Diffusion models have significantly advanced the field of talking head generation (THG). However, slow inference speeds and prevalent non-autoregressive paradigms severely constrain the application of diffusion-based THG models. In this study, we propose REST, a pioneering diffusion-based, real-time, end-to-end streaming audio-driven talking head generation framework. To support real-time end-to-end generation, a compact video latent space is first learned through a spatiotemporal variational autoencoder with a high compression ratio. Additionally, to enable semi-autoregressive streaming within the compact video latent space, we introduce an ID-Context Cache mechanism, which integrates ID-Sink and Context-Cache principles into key-value caching for maintaining identity consistency and temporal coherence during long-term streaming generation. Furthermore, an Asynchronous Streaming Distillation (ASD) strategy is proposed to mitigate error accumulation and enhance temporal consistency in streaming generation, leveraging a non-streaming teacher with an asynchronous noise schedule to supervise the streaming student. REST bridges the gap between autoregressive and diffusion-based approaches, achieving a breakthrough in efficiency for applications requiring real-time THG. Experimental results demonstrate that REST outperforms state-of-the-art methods in both generation speed and overall performance.
中文标题/摘要
标题:REST:基于ID-上下文缓存和异步流式蒸馏的实时端到端扩散驱动的说话人头生成
扩散模型在说话人头生成(THG)领域取得了显著进展。然而,缓慢的推理速度和普遍的非自回归范式严重限制了基于扩散的THG模型的应用。在本研究中,我们提出REST,一种开创性的基于扩散的、实时的端到端流式音频驱动说话人头生成框架。为了支持实时端到端生成,首先通过高压缩比的空间时间变分自编码器学习一个紧凑的视频潜在空间。此外,为了在紧凑的视频潜在空间中实现半自回归流式生成,我们引入了ID-上下文缓存机制,该机制将ID-Sink和上下文缓存原则整合到键值缓存中,以在长期流式生成过程中保持身份一致性和时间连贯性。此外,我们提出了异步流式蒸馏(ASD)策略,以减轻流式生成中的误差累积并增强时间连贯性,利用非流式教师和异步噪声调度来监督流式学生。REST弥合了自回归和基于扩散的方法之间的差距,实现了对需要实时THG的应用程序的效率突破。实验结果表明,REST在生成速度和整体性能上均优于最先进的方法。
Summary / 总结
REST is a real-time, end-to-end streaming talking head generation framework that uses diffusion models. It learns a compact video latent space with a spatiotemporal variational autoencoder and introduces an ID-Context Cache mechanism to maintain identity consistency and temporal coherence. Additionally, an Asynchronous Streaming Distillation strategy is employed to enhance temporal consistency. Experimental results show that REST outperforms existing methods in both generation speed and overall performance.
REST 是一种实时端到端的流式语音驱动头部生成框架,使用了扩散模型。它通过时空变分自编码器学习一个紧凑的视频潜在空间,并引入了ID-上下文缓存机制以保持身份一致性和时间连贯性。此外,还提出了异步流式蒸馏策略以增强时间连贯性。实验结果表明,REST 在生成速度和整体性能上均优于现有方法。
Dissecting Multimodal In-Context Learning: Modality Asymmetries and Circuit Dynamics in modern Transformers
Authors: Yiran Huang, Karsten Roth, Quentin Bouniot, Wenjia Xu, Zeynep Akata
First: 2026-01-28T17:37:28+00:00 · Latest: 2026-01-28T17:37:28+00:00
Abstract
Transformer-based multimodal large language models often exhibit in-context learning (ICL) abilities. Motivated by this phenomenon, we ask: how do transformers learn to associate information across modalities from in-context examples? We investigate this question through controlled experiments on small transformers trained on synthetic classification tasks, enabling precise manipulation of data statistics and model architecture. We begin by revisiting core principles of unimodal ICL in modern transformers. While several prior findings replicate, we find that Rotary Position Embeddings (RoPE) increases the data complexity threshold for ICL. Extending to the multimodal setting reveals a fundamental learning asymmetry: when pretrained on high-diversity data from a primary modality, surprisingly low data complexity in the secondary modality suffices for multimodal ICL to emerge. Mechanistic analysis shows that both settings rely on an induction-style mechanism that copies labels from matching in-context exemplars; multimodal training refines and extends these circuits across modalities. Our findings provide a mechanistic foundation for understanding multimodal ICL in modern transformers and introduce a controlled testbed for future investigation.
中文标题/摘要
标题:剖析多模态上下文学习:模态不对称性和现代Transformer中的电路动态
基于Transformer的多模态大型语言模型通常表现出上下文学习(ICL)能力。受此现象启发,我们提出:Transformer是如何通过上下文示例学习跨模态关联信息的?我们通过在合成分类任务上训练小型Transformer进行受控实验来探讨这一问题,这使得可以精确操控数据统计和模型架构。我们首先回顾现代Transformer中单模态ICL的核心原则。尽管有若干先前发现得到重复,我们发现旋转位置嵌入(RoPE)提高了ICL的数据复杂度阈值。扩展到多模态设置揭示了一个基本的学习不对称性:当预训练于主要模态的高多样性数据上时,令人惊讶的是,次级模态中极低的数据复杂度足以使多模态ICL出现。机制分析表明,两种设置都依赖于一种归纳式的机制,即从匹配的上下文示例中复制标签;多模态训练则在模态间细化并扩展这些电路。我们的发现为理解现代Transformer中的多模态ICL提供了机制基础,并引入了一个受控测试床以供未来研究。
Summary / 总结
The study investigates how transformers learn to associate information across modalities using in-context examples. Through controlled experiments on small transformers trained on synthetic tasks, the research finds that Rotary Position Embeddings increase the data complexity threshold for in-context learning. In the multimodal setting, transformers can learn from low data complexity in the secondary modality when pretrained on high-diversity data from a primary modality. This learning relies on an induction mechanism that copies labels from matching examples, with multimodal training refining these circuits across modalities.
研究探讨了变压器如何通过上下文示例学习跨模态信息关联。通过在合成任务上训练小型变压器进行受控实验,研究发现旋转位置嵌入增加了上下文学习的数据复杂度阈值。在多模态设置中,当主要模态的数据预训练具有高多样性时,次模态即使数据复杂度较低也能实现多模态上下文学习。这归因于一种从匹配的上下文示例中复制标签的归纳机制,在多模态训练中这些机制被细化并扩展到其他模态。
LLMBind: A Unified Modality-Task Integration Framework
Authors: Bin Zhu, Munan Ning, Peng Jin, Bin Lin, Jinfa Huang, Qi Song, Junwu Zhang, Zhenyu Tang, Mingjun Pan, Li Yuan
First: 2024-02-22T12:36:31+00:00 · Latest: 2026-01-28T17:35:06+00:00
Abstract
Despite recent progress in Multi-Modal Large Language Models (MLLMs), it remains challenging to integrate diverse tasks ranging from pixel-level perception to high-fidelity generation. Existing approaches often suffer from either restricted task extensibility or severe performance degradation due to modality interference. n this paper, we present LLMBind, an extensible framework that unifies multimodal tasks through a dual-pathway mechanism: In-Situ semantic embeddings for localization-sensitive tasks like semantic segmentation and Ex-Situ task-prompts for generation across image, video, and audio modalities. Additionally, we employ a Mixture-of-Experts (MoE) architecture to route task-specific tokens, thereby achieving modality disentanglement and mitigating negative transfer. We also curate a 400k multi-turn interactive dataset focused on iterative visual refinement to enable human-like interaction. Extensive experiments demonstrate that LLMBind achieves excellent performance across multiple perception and generation benchmarks while maintaining superior expandability.
中文标题/摘要
标题:LLMBind:统一多模态任务集成框架
尽管在多模态大型语言模型(MLLMs)方面取得了进展,但在从像素级感知到高保真生成的广泛任务集成方面仍然具有挑战性。现有方法往往要么任务扩展性受限,要么由于模态干扰导致性能严重下降。在本文中,我们提出了LLMBind,这是一种通过双重路径机制统一多模态任务的可扩展框架:原位语义嵌入用于定位敏感任务,如语义分割,以及离位任务提示用于跨图像、视频和音频模态的生成。此外,我们采用专家混合(MoE)架构路由任务特定的标记,从而实现模态解耦并减轻负迁移。我们还整理了一个专注于迭代视觉细化的40万轮交互数据集,以实现类似人类的交互。广泛的实验表明,LLMBind在多个感知和生成基准测试中表现出色,同时保持了出色的扩展性。
Summary / 总结
LLMBind is a framework designed to integrate various tasks in multi-modal large language models, addressing the challenges of task extensibility and modality interference. It uses a dual-pathway mechanism for semantic embeddings and task-prompts, and employs a Mixture-of-Experts architecture to achieve modality disentanglement. Extensive experiments show that LLMBind performs well across multiple perception and generation benchmarks and maintains strong expandability.
LLMBind 是一个框架,旨在整合多模态大型语言模型中的各种任务,解决任务扩展性和模态干扰的挑战。它使用语义嵌入和任务提示的双路径机制,并采用专家混合架构来分离模态。该框架在多个基准测试中表现出色,并保持良好的扩展性。广泛的实验验证了其有效性。
FAIRT2V: Training-Free Debiasing for Text-to-Video Diffusion Models
Authors: Haonan Zhong, Wei Song, Tingxu Han, Maurice Pagnucco, Jingling Xue, Yang Song
First: 2026-01-28T17:29:53+00:00 · Latest: 2026-01-28T17:29:53+00:00
Abstract
Text-to-video (T2V) diffusion models have achieved rapid progress, yet their demographic biases, particularly gender bias, remain largely unexplored. We present FairT2V, a training-free debiasing framework for text-to-video generation that mitigates encoder-induced bias without finetuning. We first analyze demographic bias in T2V models and show that it primarily originates from pretrained text encoders, which encode implicit gender associations even for neutral prompts. We quantify this effect with a gender-leaning score that correlates with bias in generated videos. Based on this insight, FairT2V mitigates demographic bias by neutralizing prompt embeddings via anchor-based spherical geodesic transformations while preserving semantics. To maintain temporal coherence, we apply debiasing only during early identity-forming steps through a dynamic denoising schedule. We further propose a video-level fairness evaluation protocol combining VideoLLM-based reasoning with human verification. Experiments on the modern T2V model Open-Sora show that FairT2V substantially reduces demographic bias across occupations with minimal impact on video quality.
中文标题/摘要
标题:FAIRT2V:无需训练的文本到视频扩散模型去偏方法
文本到视频(T2V)扩散模型取得了快速进展,但它们的民众人种偏差,尤其是性别偏差,仍然未被充分探索。我们提出了FairT2V,这是一种无需训练的去偏框架,用于文本到视频生成,可以在不微调的情况下减轻编码器引起的偏差。我们首先分析了T2V模型中的民众人种偏差,并表明它主要源自预训练的文本编码器,即使对于中性的提示,它们也会编码隐含的性别关联。我们通过与生成视频中的偏差相关联的性别倾向评分来量化这种效应。 基于这一洞察,FairT2V 通过基于锚点的球面测地线变换来中和提示嵌入,同时保留语义,从而减轻民众人种偏差。为了保持时间连贯性,我们仅在早期身份形成步骤通过动态去噪调度应用去偏。我们还提出了一种结合VideoLLM推理和人工验证的视频级公平性评估协议。实验表明,FairT2V 在不影响视频质量的情况下,显著减少了职业方面的民众人种偏差。
Summary / 总结
The research aims to address gender bias in text-to-video (T2V) diffusion models by presenting FairT2V, a training-free debiasing framework. It mitigates bias by neutralizing prompt embeddings through anchor-based spherical geodesic transformations while preserving semantics. Experiments on Open-Sora show that FairT2V significantly reduces demographic bias with minimal effect on video quality.
研究旨在通过提出FairT2V,一种无需训练的去偏见框架,来解决文本到视频(T2V)扩散模型中的性别偏见问题。该框架通过锚点基于的球面测地变换来中和提示嵌入,同时保留语义。实验表明,FairT2V在减少人口统计学偏见的同时,对视频质量影响很小。
LogogramNLP: Comparing Visual and Textual Representations of Ancient Logographic Writing Systems for NLP
Authors: Danlu Chen, Freda Shi, Aditi Agarwal, Jacobo Myerston, Taylor Berg-Kirkpatrick
Venue: ACL 2024 long
First: 2024-08-08T17:58:06+00:00 · Latest: 2026-01-28T17:27:29+00:00
Comments: correct wrong refs, typos
Abstract
Standard natural language processing (NLP) pipelines operate on symbolic representations of language, which typically consist of sequences of discrete tokens. However, creating an analogous representation for ancient logographic writing systems is an extremely labor intensive process that requires expert knowledge. At present, a large portion of logographic data persists in a purely visual form due to the absence of transcription -- this issue poses a bottleneck for researchers seeking to apply NLP toolkits to study ancient logographic languages: most of the relevant data are images of writing. This paper investigates whether direct processing of visual representations of language offers a potential solution. We introduce LogogramNLP, the first benchmark enabling NLP analysis of ancient logographic languages, featuring both transcribed and visual datasets for four writing systems along with annotations for tasks like classification, translation, and parsing. Our experiments compare systems that employ recent visual and text encoding strategies as backbones. The results demonstrate that visual representations outperform textual representations for some investigated tasks, suggesting that visual processing pipelines may unlock a large amount of cultural heritage data of logographic languages for NLP-based analyses.
中文标题/摘要
标题:LogogramNLP:比较古代表意书写系统的视觉和文本表示以进行NLP
标准自然语言处理(NLP)管道基于符号语言表示,通常由一系列离散标记组成。然而,为古代表意书写系统创建类似的表示形式是一个极其劳动密集型的过程,需要专家知识。目前,由于缺乏转录,大量表意数据以纯视觉形式存在——这一问题成为研究人员将NLP工具包应用于研究古代表意语言的瓶颈:相关数据大多为书写图像。 本文探讨了直接处理语言的视觉表示是否可能提供一种解决方案。我们介绍了LogogramNLP,这是第一个使NLP分析古代表意语言成为可能的基准,其中包括四种书写系统的转录和视觉数据集,以及分类、翻译和解析等任务的注释。我们的实验比较了使用最近的视觉和文本编码策略作为基础的系统。结果表明,对于某些研究任务,视觉表示优于文本表示,这表明视觉处理管道可能为基于NLP的分析解锁大量表意语言的文化遗产数据。
Summary / 总结
This paper addresses the challenge of applying NLP to ancient logographic writing systems by introducing LogogramNLP, a benchmark that includes both transcribed and visual datasets for four writing systems. The study compares visual and textual representation methods and finds that visual representations outperform textual ones for certain tasks, indicating that visual processing pipelines could be more effective for analyzing logographic languages.
本文通过引入LogogramNLP基准,该基准包含四种书写系统的转录和视觉数据集,来解决将NLP应用于古代表意文字系统的问题。研究比较了视觉和文本表示方法,并发现某些任务中视觉表示优于文本表示,表明视觉处理管道可能更适合分析表意语言。
SERA: Soft-Verified Efficient Repository Agents
Authors: Ethan Shen, Danny Tormoen, Saurabh Shah, Ali Farhadi, Tim Dettmers
First: 2026-01-28T17:27:08+00:00 · Latest: 2026-01-28T17:27:08+00:00
Comments: 21 main pages, 7 pages appendix
Abstract
Open-weight coding agents should hold a fundamental advantage over closed-source systems: they can be specialized to private codebases, encoding repository-specific information directly in their weights. Yet the cost and complexity of training has kept this advantage theoretical. We show it is now practical. We present Soft-Verified Efficient Repository Agents (SERA), an efficient method for training coding agents that enables the rapid and cheap creation of agents specialized to private codebases. Using only supervised finetuning (SFT), SERA achieves state-of-the-art results among fully open-source (open data, method, code) models while matching the performance of frontier open-weight models like Devstral-Small-2. Creating SERA models is 26x cheaper than reinforcement learning and 57x cheaper than previous synthetic data methods to reach equivalent performance. Our method, Soft Verified Generation (SVG), generates thousands of trajectories from a single code repository. Combined with cost-efficiency, this enables specialization to private codebases. Beyond repository specialization, we apply SVG to a larger corpus of codebases, generating over 200,000 synthetic trajectories. We use this dataset to provide detailed analysis of scaling laws, ablations, and confounding factors for training coding agents. Overall, we believe our work will greatly accelerate research on open coding agents and showcase the advantage of open-source models that can specialize to private codebases. We release SERA as the first model in Ai2's Open Coding Agents series, along with all our code, data, and Claude Code integration to support the research community.
中文标题/摘要
标题:SERA:软验证高效代码代理
开源代码代理应该比闭源系统具有根本优势:它们可以针对私有代码库进行专门化,直接在其权重中编码仓库特定的信息。然而,训练的成本和复杂性使其优势仅停留在理论层面。我们展示了现在这种优势是可行的。我们提出了软验证高效代码代理(SERA),一种高效的方法来训练代码代理,使其能够快速且廉价地创建针对私有代码库的代理。仅使用监督微调(SFT),SERA 达到了完全开源(开放数据、方法、代码)模型中的最佳结果,同时与前沿的开源权重模型(如 Devstral-Small-2)的性能相当。创建 SERA 模型的成本比强化学习低 26 倍,比之前的合成数据方法低 57 倍。我们的方法,软验证生成(SVG),从单个代码库生成数千条轨迹。结合成本效益,这使其能够针对私有代码库进行专门化。除了仓库专门化之外,我们还应用 SVG 到更大的代码库集合,生成超过 200,000 条合成轨迹。我们使用此数据集提供了训练代码代理的缩放定律、消融分析和混淆因素的详细分析。总体而言,我们认为我们的工作将极大地加速对开放代码代理的研究,并展示出能够针对私有代码库进行专门化的开源模型的优势。我们发布了 SERA 作为 Ai2 开放代码代理系列的第一个模型,同时提供了所有代码、数据和 Claude Code 集成以支持研究社区。
Summary / 总结
SERA is an efficient method for training coding agents that can be specialized to private codebases, achieving state-of-the-art results while being 26x cheaper than reinforcement learning and 57x cheaper than previous synthetic data methods. The method, Soft Verified Generation (SVG), generates thousands of trajectories from a single code repository, enabling specialization and detailed analysis of scaling laws and training factors for coding agents.
SERA是一种高效的编码代理训练方法,能够专门针对私有代码库,实现快速且低成本地创建此类代理。通过监督微调,SERA达到了最先进的效果,成本仅为强化学习的26倍和之前合成数据方法的57倍。该方法Soft Verified Generation (SVG)可以从单个代码库生成数千条轨迹,促进专业化并进行训练缩放定律和混淆因素的详细分析。
TIPO: Text to Image with Text Presampling for Prompt Optimization
Authors: Shih-Ying Yeh, Sang-Hyun Park, Yi Li, Giyeong Oh, Xuehai Wang, Min Song, Youngjae Yu
First: 2024-11-12T19:09:45+00:00 · Latest: 2026-01-28T17:24:46+00:00
Comments: 50 pages, 28 figures
Abstract
TIPO (Text-to-Image Prompt Optimization) introduces an efficient approach for automatic prompt refinement in text-to-image (T2I) generation. Starting from simple user prompts, TIPO leverages a lightweight pre-trained model to expand these prompts into richer and more detailed versions. Conceptually, TIPO samples refined prompts from a targeted sub-distribution within the broader semantic space, preserving the original intent while significantly improving visual quality, coherence, and detail. Unlike resource-intensive methods based on large language models (LLMs) or reinforcement learning (RL), TIPO offers strong computational efficiency and scalability, opening new possibilities for effective automated prompt engineering in T2I tasks. Extensive experiments across multiple domains demonstrate that TIPO achieves stronger text alignment, reduced visual artifacts, and consistently higher human preference rates, while maintaining competitive aesthetic quality. These results highlight the effectiveness of distribution-aligned prompt engineering and point toward broader opportunities for scalable, automated refinement in text-to-image generation.
中文标题/摘要
标题:TIPO: 文本预采样以优化提示的文本到图像生成
TIPO(文本到图像提示优化)提出了一种自动提示精炼的有效方法,在文本到图像(T2I)生成中。从简单的用户提示开始,TIPO 利用一个轻量级的预训练模型将这些提示扩展为更丰富、更详细的版本。概念上,TIPO 从更广泛的语义空间中的目标子分布中采样优化的提示,保留原始意图的同时显著提高视觉质量、连贯性和细节。与基于大型语言模型(LLMs)或强化学习(RL)的资源密集型方法不同,TIPO 提供了强大的计算效率和可扩展性,为 T2I 任务中的有效自动化提示工程打开了新的可能性。在多个领域的广泛实验表明,TIPO 实现了更强的文本对齐、减少了视觉伪影,并且保持了竞争力的美学质量,同时获得了一致更高的人类偏好率。这些结果突显了分布对齐提示工程的有效性,并指出了在文本到图像生成中进行可扩展、自动化精炼的更广泛机会。
Summary / 总结
TIPO (Text-to-Image Prompt Optimization) presents an efficient method for automatic prompt refinement in text-to-image generation. It starts with simple user prompts and uses a lightweight pre-trained model to expand them into more detailed versions. Experiments show that TIPO improves visual quality, coherence, and detail while maintaining strong computational efficiency and scalability. It achieves better text alignment, reduced visual artifacts, and higher human preference rates compared to resource-intensive methods like large language models or reinforcement learning.
TIPO (文本到图像提示优化)旨在通过自动细化简单的用户提示来提高文本到图像生成的质量。它使用轻量级预训练模型生成更详细和连贯的提示。实验表明,TIPO提高了文本对齐度,减少了视觉伪影,并增加了人类偏好率,同时保持了美学质量,展示了分布对齐提示工程的有效性。
DaMO: A Data-Efficient Multimodal Orchestrator for Temporal Reasoning with Video LLMs
Authors: Bo-Cheng Chiu, Jen-Jee Chen, Yu-Chee Tseng, Feng-Chi Chen, An-Zi Yen
First: 2025-06-13T08:13:05+00:00 · Latest: 2026-01-28T17:24:42+00:00
Abstract
Large Language Models (LLMs) have recently been extended to the video domain, enabling sophisticated video-language understanding. However, existing Video LLMs often exhibit limitations in fine-grained temporal reasoning, restricting their ability to precisely attribute responses to specific video moments, especially under constrained supervision. We introduce DaMO, a data-efficient Video LLM explicitly designed for accurate temporal reasoning and multimodal understanding. At its core, the proposed Temporal-aware Fuseformer employs a hierarchical dual-stream architecture that progressively captures temporal dynamics within each modality and effectively fuses complementary visual and audio information. To further enhance computational efficiency, DaMO integrates a global residual that reduces spatial redundancy while preserving essential semantic details. We train DaMO via a structured four-stage progressive training paradigm, incrementally equipping the model with multimodal alignment, semantic grounding, and temporal reasoning capabilities. This work also contributes multiple datasets augmented from existing ones with LLM-generated temporally grounded QA pairs for tasks requiring temporal supervision. Comprehensive experiments on temporal grounding and video QA benchmarks demonstrate that DaMO consistently surpasses prior methods, particularly in tasks demanding precise temporal alignment and reasoning. Our work establishes a promising direction for data-efficient video-language modeling.
中文标题/摘要
标题:DaMO:一种用于视频LLM的高效多模态 orchestrator,支持时间推理
大型语言模型(LLMs)最近扩展到了视频领域,使视频-语言理解变得复杂。然而,现有的视频LLMs在细粒度的时间推理方面经常表现出局限性,限制了它们精确地将响应归因于特定视频时刻的能力,尤其是在受限监督下。我们引入了DaMO,一种专门设计用于准确时间推理和多模态理解的数据高效视频LLM。其核心是时间感知的Fuseformer,采用分层双流架构,逐步捕捉每个模态内的时间动态,并有效融合互补的视觉和音频信息。为了进一步提高计算效率,DaMO集成了全局残差,减少了空间冗余同时保留了关键的语义细节。我们通过结构化的四阶段渐进训练范式训练DaMO,逐步赋予模型多模态对齐、语义定位和时间推理能力。本文还贡献了多个从现有数据集扩展而来、带有LLM生成的时间定位问答对的增强数据集,用于需要时间监督的任务。全面的实验表明,DaMO在时间定位和视频问答基准测试中始终优于先前的方法,特别是在需要精确时间对齐和推理的任务中。我们的工作为数据高效视频-语言建模指明了一个有希望的方向。
Summary / 总结
DaMO is a data-efficient multimodal orchestrator for video understanding, designed to enhance temporal reasoning capabilities of Video LLMs. It uses a hierarchical dual-stream architecture to capture temporal dynamics and fuse visual and audio information, and integrates a global residual to reduce spatial redundancy. DaMO is trained through a four-stage progressive paradigm, progressively improving multimodal alignment, semantic grounding, and temporal reasoning. Experimental results show that DaMO outperforms previous methods, especially in tasks requiring precise temporal alignment and reasoning, on temporal grounding and video QA benchmarks.
DaMO 是一种高效的数据多模态 orchestrator,旨在增强视频 LLM 的时间推理能力。它采用分层双流架构来捕捉时间动态并融合视觉和音频信息,并集成全局残差以减少空间冗余。DaMO 通过四阶段渐进式训练范式进行训练,逐步提高多模态对齐、语义定位和时间推理能力。实验结果表明,DaMO 在时间定位和视频问答基准测试中优于先前的方法,特别是在需要精确时间对齐和推理的任务中。
Breaking Bad Molecules: Are MLLMs Ready for Structure-Level Molecular Detoxification?
Authors: Fei Lin, Ziyang Gong, Cong Wang, Tengchao Zhang, Yonglin Tian, Yining Jiang, Ji Dai, Chao Guo, Xiaotong Yu, Xue Yang, Gen Luo, Fei-Yue Wang
First: 2025-06-12T17:25:53+00:00 · Latest: 2026-01-28T17:20:25+00:00
Abstract
Toxicity remains a leading cause of early-stage drug development failure. Despite advances in molecular design and property prediction, the task of molecular toxicity repair, generating structurally valid molecular alternatives with reduced toxicity, has not yet been systematically defined or benchmarked. To fill this gap, we introduce ToxiMol, the first benchmark task for general-purpose Multimodal Large Language Models (MLLMs) focused on molecular toxicity repair. We construct a standardized dataset covering 11 primary tasks and 660 representative toxic molecules spanning diverse mechanisms and granularities. We design a prompt annotation pipeline with mechanism-aware and task-adaptive capabilities, informed by expert toxicological knowledge. In parallel, we propose an automated evaluation framework, ToxiEval, which integrates toxicity endpoint prediction, synthetic accessibility, drug-likeness, and structural similarity into a high-throughput evaluation chain for repair success. We systematically assess 43 mainstream general-purpose MLLMs and conduct multiple ablation studies to analyze key issues, including evaluation metrics, candidate diversity, and failure attribution. Experimental results show that although current MLLMs still face significant challenges on this task, they begin to demonstrate promising capabilities in toxicity understanding, semantic constraint adherence, and structure-aware editing.
中文标题/摘要
标题:破坏性分子:MLLMs 是否准备好进行结构级分子解毒?
毒性仍然是早期药物开发失败的主要原因。尽管在分子设计和性质预测方面取得了进展,但分子毒性修复的任务——生成结构上有效的、毒性降低的分子替代品——尚未系统定义或基准测试。为填补这一空白,我们引入了ToxiMol,这是首个专注于分子毒性修复的一般用途多模态大型语言模型基准任务。我们构建了一个标准化数据集,涵盖11个主要任务和660个代表性有毒分子,这些分子涵盖了多种机制和粒度。我们设计了一种具有机制意识和任务适应性的提示注释流水线,该流水线基于专家毒理学知识。同时,我们提出了一个自动评估框架ToxiEval,该框架将毒性终点预测、合成可及性、药物可类比性和结构相似性整合到一个高通量评估链中,以评估修复成功率。我们系统评估了43种主流的一般用途MLLMs,并进行了多个消融研究,分析了关键问题,包括评估指标、候选多样性以及失败归因。实验结果表明,尽管当前的MLLMs在这一任务上仍面临重大挑战,但它们开始在毒性理解、语义约束遵守和结构感知编辑方面展现出有希望的能力。
Summary / 总结
The paper addresses the challenge of molecular toxicity repair, introducing ToxiMol as the first benchmark task for Multimodal Large Language Models (MLLMs). It constructs a standardized dataset of 660 toxic molecules and evaluates 43 MLLMs using an automated evaluation framework, ToxiEval. The results indicate that while current MLLMs face significant challenges, they show promising capabilities in understanding toxicity, adhering to semantic constraints, and performing structure-aware editing.
论文针对分子毒性修复的挑战,引入了ToxiMol作为首个针对多模态大型语言模型(MLLMs)的基准任务。构建了一个包含660种有毒分子的标准数据集,并使用自动化评估框架ToxiEval评估了43种主流MLLMs。实验结果表明,尽管当前MLLMs仍面临重大挑战,但在毒性理解、语义约束遵守和结构感知编辑方面显示出有前景的能力。
Beyond Random Sampling: Efficient Language Model Pretraining via Curriculum Learning
Authors: Yang Zhang, Amr Mohamed, Hadi Abdine, Guokan Shang, Michalis Vazirgiannis
First: 2025-06-12T21:06:57+00:00 · Latest: 2026-01-28T17:19:05+00:00
Abstract
Curriculum learning-organizing training data from easy to hard-has improved efficiency across machine learning domains, yet remains underexplored for language model pretraining. We present the first systematic investigation of curriculum learning in LLM pretraining, with over 200 models trained on up to 100B tokens across three strategies: vanilla curriculum learning, pacing-based sampling, and interleaved curricula, guided by six difficulty metrics spanning linguistic and information-theoretic properties. We evaluate performance on eight benchmarks under three realistic scenarios: limited data, unlimited data, and continual training. Our experiments show that curriculum learning consistently accelerates convergence in early and mid-training phases,reducing training steps by $18-45\%$ to reach baseline performance. When applied as a warmup strategy before standard random sampling, curriculum learning yields sustained improvements up to $3.5\%$. We identify compression ratio, lexical diversity (MTLD), and readability (Flesch Reading Ease) as the most effective difficulty signals. Our findings demonstrate that data ordering-orthogonal to existing data selection methods-provides a practical mechanism for more efficient LLM pretraining.
中文标题/摘要
标题:超越随机采样:通过课程学习高效语言模型预训练
课程学习-按从易到难的顺序组织训练数据-已在机器学习各个领域提高了效率,但在语言模型预训练中仍被广泛探索。我们首次系统地研究了课程学习在大规模语言模型(LLM)预训练中的应用,训练了超过200个模型,使用了多达1000亿个标记,并采用了三种策略:传统的课程学习、基于节奏的采样和交错课程,这些策略由六个难度指标指导,涵盖了语言学和信息论属性。我们在三种现实场景下评估了性能:有限数据、无限数据和持续训练。我们的实验表明,课程学习在早期和中期训练阶段始终能够加速收敛,将达到基线性能所需的训练步骤减少了18%-45%。当作为标准随机采样之前的预热策略时,课程学习可带来持续改进,最多可提高3.5%。我们发现压缩比、词汇多样性(MTLD)和可读性(Flesch Reading Ease)是最有效的难度信号。我们的研究结果表明,数据排序(与现有的数据选择方法无关)为更高效的LLM预训练提供了一种实用机制。
Summary / 总结
This study explores curriculum learning in language model pretraining, which organizes training data from easy to hard, improving efficiency. The researchers trained over 200 models using three strategies and six difficulty metrics, showing that curriculum learning reduces training steps by 18-45% in early and mid-training phases and provides sustained improvements up to 3.5% when used as a warmup strategy. Compression ratio, lexical diversity, and readability were found to be the most effective difficulty signals.
该研究探索了在语言模型预训练中使用课程学习的方法,将训练数据从简单到复杂排序以提高效率。研究人员使用了三种策略和六种难度指标训练了超过200个模型。研究发现,课程学习在早期和中期训练阶段可以将收敛速度提高18-45%,并在作为预热策略使用时可提供高达3.5%的持续改进。压缩比、词汇多样性以及可读性被识别为最有效的难度信号。
History
20260129_0337 20260128_0335 20260127_0332 20260126_0325 20260125_0325 20260124_0333 20260123_0333 20260122_0339 20260121_0422 20260120_0328 20260119_0325 20260118_0324 20260117_0329 20260116_0332 20260115_0330 20260114_0329 20260113_0330 20260112_0330 20260111_0327 20260110_0328 20260109_0331 20260108_0330 20260107_0325 20260106_0331 20260105_0324 20260104_0324 20260103_0322 20260102_0335 20260101_0325 20251231_0331 20251230_0328 20251229_0326 20251228_0329 20251227_0325 20251226_0326 20251225_0325 20251224_0328 20251223_0327 20251222_0324 20251221_0326 20251220_0327 20251219_0327 20251218_0339 20251217_0331 20251216_0329 20251215_0331 20251214_0324 20251213_0324 20251212_0329 20251211_0326 20251210_0323 20251209_0326 20251208_0324 20251207_0323 20251206_0325 20251205_0326 20251204_0326 20251203_0328 20251202_0331 20251201_0324 20251130_0323 20251129_0323 20251128_0324 20251127_0324 20251126_0325 20251125_0322 20251124_0323 20251123_0323 20251122_0325 20251121_0324 20251120_0326 20251119_0325 20251118_0324 20251117_0322 20251116_0322 20251115_0324 20251114_0325 20251113_0326 20251112_0326 20251111_0318 20251110_0322 20251109_0323 20251108_0321 20251107_0320 20251106_0322 20251105_0321 20251104_0324 20251103_0317 20251102_0321 20251101_0317 20251031_0318 20251030_0328 20251029_0325 20251028_0324 20251027_0320 20251026_0328 20251025_0320 20251024_0328 20251023_1235 20251023_0316 20251022_0319 20251021_1916 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553