arXiv 论文速递

2025-12-11 03:26
Snapshot: 20251211_0326
Astra: General Interactive World Model with Autoregressive Denoising
Authors: Yixuan Zhu, Jiaqi Feng, Wenzhao Zheng, Yuan Gao, Xin Tao, Pengfei Wan, Jie Zhou, Jiwen Lu
First: 2025-12-09T18:59:57+00:00 · Latest: 2025-12-09T18:59:57+00:00
Comments: Code is available at: https://github.com/EternalEvan/Astra
Abstract
Recent advances in diffusion transformers have empowered video generation models to generate high-quality video clips from texts or images. However, world models with the ability to predict long-horizon futures from past observations and actions remain underexplored, especially for general-purpose scenarios and various forms of actions. To bridge this gap, we introduce Astra, an interactive general world model that generates real-world futures for diverse scenarios (e.g., autonomous driving, robot grasping) with precise action interactions (e.g., camera motion, robot action). We propose an autoregressive denoising architecture and use temporal causal attention to aggregate past observations and support streaming outputs. We use a noise-augmented history memory to avoid over-reliance on past frames to balance responsiveness with temporal coherence. For precise action control, we introduce an action-aware adapter that directly injects action signals into the denoising process. We further develop a mixture of action experts that dynamically route heterogeneous action modalities, enhancing versatility across diverse real-world tasks such as exploration, manipulation, and camera control. Astra achieves interactive, consistent, and general long-term video prediction and supports various forms of interactions. Experiments across multiple datasets demonstrate the improvements of Astra in fidelity, long-range prediction, and action alignment over existing state-of-the-art world models.
中文标题/摘要
标题:Astra:具有自回归去噪的通用交互式世界模型
最近在扩散变换器方面的进展使视频生成模型能够从文本或图像生成高质量的视频片段。然而,能够从过去的观察和行动预测长时未来的世界模型仍然很少被探索,尤其是在通用场景和各种形式的行动方面。为了弥合这一差距,我们引入了Astra,这是一种交互式的通用世界模型,能够为多种场景(例如自动驾驶、机器人抓取)生成真实世界的未来,并且具有精确的动作交互(例如相机运动、机器人动作)。我们提出了一种自回归去噪架构,并使用时间因果注意力来聚合过去的观察并支持流式输出。我们使用噪声增强的历史记忆来避免过度依赖过去的帧,以平衡响应性和时间一致性。为了实现精确的动作控制,我们引入了一个动作感知适配器,可以直接将动作信号注入去噪过程。我们进一步开发了一种动作专家混合模型,能够动态路由异构动作模态,从而增强在各种真实世界任务(如探索、操作和相机控制)中的灵活性。Astra实现了交互、一致和通用的长期视频预测,并支持各种形式的交互。在多个数据集上的实验表明,Astra在保真度、长距离预测和动作对齐方面优于现有的世界模型。
Summary / 总结
Astra is an interactive general world model that uses an autoregressive denoising architecture with temporal causal attention to predict long-term futures in various scenarios. It incorporates an action-aware adapter and a mixture of action experts to handle diverse actions, improving action alignment and versatility. Experiments show Astra outperforms existing models in terms of fidelity, long-range prediction, and action alignment.
Astra 是一种交互式的通用世界模型,利用自回归去噪和时间因果注意力来预测多种场景下的长期未来。它包含一个动作感知适配器以实现精确的动作控制,并采用动作专家混合体来处理不同的动作模态。实验表明,Astra 在保真度、长期预测和动作对齐方面优于现有模型。
Selfi: Self Improving Reconstruction Engine via 3D Geometric Feature Alignment
Authors: Youming Deng, Songyou Peng, Junyi Zhang, Kathryn Heal, Tiancheng Sun, John Flynn, Steve Marschner, Lucy Chai
First: 2025-12-09T18:59:52+00:00 · Latest: 2025-12-09T18:59:52+00:00
Comments: Project Page: https://denghilbert.github.io/selfi/
Abstract
Novel View Synthesis (NVS) has traditionally relied on models with explicit 3D inductive biases combined with known camera parameters from Structure-from-Motion (SfM) beforehand. Recent vision foundation models like VGGT take an orthogonal approach -- 3D knowledge is gained implicitly through training data and loss objectives, enabling feed-forward prediction of both camera parameters and 3D representations directly from a set of uncalibrated images. While flexible, VGGT features lack explicit multi-view geometric consistency, and we find that improving such 3D feature consistency benefits both NVS and pose estimation tasks. We introduce Selfi, a self-improving 3D reconstruction pipeline via feature alignment, transforming a VGGT backbone into a high-fidelity 3D reconstruction engine by leveraging its own outputs as pseudo-ground-truth. Specifically, we train a lightweight feature adapter using a reprojection-based consistency loss, which distills VGGT outputs into a new geometrically-aligned feature space that captures spatial proximity in 3D. This enables state-of-the-art performance in both NVS and camera pose estimation, demonstrating that feature alignment is a highly beneficial step for downstream 3D reasoning.
中文标题/摘要
标题:Selfi:通过三维几何特征对齐实现自我改进的重建引擎
传统上,新型视角合成(NVS)依赖于具有明确三维先验偏置的模型,并且需要结构从运动(SfM)的已知相机参数。最近的视觉基础模型如VGGT采取了不同的方法——通过训练数据和损失目标隐式获得三维知识,从而能够直接从一组未校准的图像中预测相机参数和三维表示。虽然灵活,但VGGT特征缺乏明确的多视角几何一致性,我们发现提高这种三维特征一致性对NVS和姿态估计任务都有益处。我们引入了Selfi,一种通过特征对齐实现自我改进的三维重建管道,通过利用其自身输出作为伪地面真值,将VGGT主干转换为高保真的三维重建引擎。具体来说,我们使用基于重投影的一致性损失训练了一个轻量级的特征适配器,将VGGT输出提炼到一个新的几何对齐特征空间中,捕捉三维中的空间邻近性。这在NVS和相机姿态估计中实现了最先进的性能,证明了特征对齐是下游三维推理中非常有益的步骤。
Summary / 总结
The research aims to improve the geometric consistency of 3D features in vision foundation models for better novel view synthesis and camera pose estimation. The method involves training a lightweight feature adapter to align VGGT outputs into a geometrically consistent feature space using a reprojection-based consistency loss. Key findings show that this approach enhances performance in both NVS and camera pose estimation tasks, highlighting the benefits of feature alignment for 3D reasoning.
Selfi 是一个自我改进的 3D 重建管道,通过特征对齐增强 VGGT 的几何一致性。它使用基于重投影的一致性损失训练一个轻量级特征适配器,将 VGGT 输出转换为几何对齐的特征空间。这在新型视图合成和相机姿态估计中实现了最先进的性能,展示了特征对齐对 3D 推理的益处。
Efficiently Reconstructing Dynamic Scenes One D4RT at a Time
Authors: Chuhan Zhang, Guillaume Le Moing, Skanda Koppula, Ignacio Rocco, Liliane Momeni, Junyu Xie, Shuyang Sun, Rahul Sukthankar, Joëlle K Barral, Raia Hadsell, Zoubin Ghahramani, Andrew Zisserman, Junlin Zhang, Mehdi SM Sajjadi
First: 2025-12-09T18:57:21+00:00 · Latest: 2025-12-09T18:57:21+00:00
Comments: Project Page: https://d4rt-paper.github.io/
Abstract
Understanding and reconstructing the complex geometry and motion of dynamic scenes from video remains a formidable challenge in computer vision. This paper introduces D4RT, a simple yet powerful feedforward model designed to efficiently solve this task. D4RT utilizes a unified transformer architecture to jointly infer depth, spatio-temporal correspondence, and full camera parameters from a single video. Its core innovation is a novel querying mechanism that sidesteps the heavy computation of dense, per-frame decoding and the complexity of managing multiple, task-specific decoders. Our decoding interface allows the model to independently and flexibly probe the 3D position of any point in space and time. The result is a lightweight and highly scalable method that enables remarkably efficient training and inference. We demonstrate that our approach sets a new state of the art, outperforming previous methods across a wide spectrum of 4D reconstruction tasks. We refer to the project webpage for animated results: https://d4rt-paper.github.io/.
中文标题/摘要
标题:逐帧重建动态场景的高效方法D4RT
从视频中理解和重建动态场景的复杂几何形状和运动仍然是计算机视觉中的一个艰巨挑战。本文介绍了一种名为D4RT的简单而强大的前馈模型,旨在高效解决这一任务。D4RT利用统一的变压器架构,从单个视频中联合推断深度、时空对应关系和全摄像机参数。其核心创新是一种新颖的查询机制,避免了密集的逐帧解码的繁重计算和管理多个特定任务解码器的复杂性。我们的解码接口使模型能够独立且灵活地探究空间和时间中任何点的3D位置。结果是一种轻量级且高度可扩展的方法,能够实现极其高效的训练和推理。我们证明,我们的方法在一系列4D重建任务中达到了新的最先进水平,优于先前的方法。有关动画结果,请参阅项目网页:https://d4rt-paper.github.io/
Same Content, Different Answers: Cross-Modal Inconsistency in MLLMs
Authors: Angela van Sprang, Laurens Samson, Ana Lucic, Erman Acar, Sennay Ghebreab, Yuki M. Asano
First: 2025-12-09T18:57:07+00:00 · Latest: 2025-12-09T18:57:07+00:00
Comments: Angela van Sprang and Laurens Samson contributed equally as first authors. Preprint
Abstract
We introduce two new benchmarks REST and REST+(Render-Equivalence Stress Tests) to enable systematic evaluation of cross-modal inconsistency in multimodal large language models (MLLMs). MLLMs are trained to represent vision and language in the same embedding space, yet they cannot perform the same tasks in both modalities. Our benchmarks contain samples with the same semantic information in three modalities (image, text, mixed) and we show that state-of-the-art MLLMs cannot consistently reason over these different modalities. We evaluate 15 MLLMs and find that the degree of modality inconsistency varies substantially, even when accounting for problems with text recognition (OCR). Neither rendering text as image nor rendering an image as text solves the inconsistency. Even if OCR is correct, we find that visual characteristics (text colour and resolution, but not font) and the number of vision tokens have an impact on model performance. Finally, we find that our consistency score correlates with the modality gap between text and images, highlighting a mechanistic interpretation of cross-modal inconsistent MLLMs.
中文标题/摘要
标题:相同内容,不同答案:多模态大语言模型跨模态不一致性
我们引入了两个新的基准测试REST和REST+(渲染等效压力测试),以系统地评估多模态大语言模型(MLLMs)中的跨模态不一致性。MLLMs被训练在同一嵌入空间中表示视觉和语言,但在两种模态中却不能执行相同任务。我们的基准测试包含在三种模态(图像、文本、混合)中具有相同语义信息的样本,并表明最先进的MLLMs不能一致地在这些不同模态上进行推理。我们评估了15个MLLMs,发现模态不一致的程度差异很大,即使考虑到文本识别(OCR)的问题也是如此。无论是将文本渲染为图像,还是将图像渲染为文本,都无法解决不一致性。即使OCR正确,我们发现视觉特征(文本颜色和分辨率,但不是字体)和视觉标记的数量也会影响模型性能。最后,我们发现我们的一致性分数与文本和图像之间的模态差距相关,突显了跨模态不一致的MLLMs的机制解释。
Summary / 总结
This study introduces REST and REST+ benchmarks to evaluate cross-modal inconsistency in MLLMs. Despite training in a shared embedding space, MLLMs show varying degrees of inconsistency when reasoning across image, text, and mixed modalities. Evaluating 15 MLLMs, the research reveals that visual characteristics and the number of vision tokens significantly affect model performance, and that neither converting text to images nor images to text resolves the inconsistency. The consistency score correlates with the modality gap between text and images, suggesting a mechanistic explanation for cross-modal inconsistency in MLLMs.
研究引入了REST和REST+基准来评估MLLMs在跨模态一致性方面的表现。尽管在共享嵌入空间中训练,MLLMs在处理图像、文本和混合模态时表现出不同程度的不一致性。评估15个MLLMs后,研究发现视觉特征和视觉标记的数量显著影响模型性能,且将文本转换为图像或图像转换为文本并不能解决这种不一致性。一致性评分与文本和图像之间的模态差距相关,这表明跨模态不一致的MLLMs存在机制性的解释。
Unified Diffusion Transformer for High-fidelity Text-Aware Image Restoration
Authors: Jin Hyeon Kim, Paul Hyunbin Cho, Claire Kim, Jaewon Min, Jaeeun Lee, Jihye Park, Yeji Choi, Seungryong Kim
First: 2025-12-09T18:56:54+00:00 · Latest: 2025-12-09T18:56:54+00:00
Abstract
Text-Aware Image Restoration (TAIR) aims to recover high-quality images from low-quality inputs containing degraded textual content. While diffusion models provide strong generative priors for general image restoration, they often produce text hallucinations in text-centric tasks due to the absence of explicit linguistic knowledge. To address this, we propose UniT, a unified text restoration framework that integrates a Diffusion Transformer (DiT), a Vision-Language Model (VLM), and a Text Spotting Module (TSM) in an iterative fashion for high-fidelity text restoration. In UniT, the VLM extracts textual content from degraded images to provide explicit textual guidance. Simultaneously, the TSM, trained on diffusion features, generates intermediate OCR predictions at each denoising step, enabling the VLM to iteratively refine its guidance during the denoising process. Finally, the DiT backbone, leveraging its strong representational power, exploit these cues to recover fine-grained textual content while effectively suppressing text hallucinations. Experiments on the SA-Text and Real-Text benchmarks demonstrate that UniT faithfully reconstructs degraded text, substantially reduces hallucinations, and achieves state-of-the-art end-to-end F1-score performance in TAIR task.
中文标题/摘要
标题:统一扩散变换器用于高保真文本感知图像恢复
文本感知图像恢复(TAIR)旨在从包含退化文本内容的低质量输入中恢复高质量图像。虽然扩散模型为通用图像恢复提供了强大的生成先验知识,但在以文本为中心的任务中,它们通常会由于缺乏显式的语言知识而产生文本幻觉。为了解决这个问题,我们提出了一种统一的文本恢复框架UniT,该框架以迭代方式结合了扩散变换器(DiT)、视觉语言模型(VLM)和文本检测模块(TSM),以实现高保真文本恢复。在UniT中,VLM从退化图像中提取文本内容,提供显式的文本指导。同时,TSM在每个去噪步骤中生成中间OCR预测,使VLM能够在去噪过程中逐步细化其指导。最后,DiT骨干利用其强大的表征能力,利用这些线索恢复细粒度的文本内容,同时有效抑制文本幻觉。在SA-Text和Real-Text基准上的实验表明,UniT能够忠实恢复退化文本,显著减少幻觉,并在TAIR任务中实现最先进的端到端F1分数性能。
Summary / 总结
The paper introduces UniT, a unified framework for text-aware image restoration that integrates a Diffusion Transformer, a Vision-Language Model, and a Text Spotting Module. This approach iteratively refines text content in degraded images by extracting textual information, generating OCR predictions, and leveraging the strong representational power of the Diffusion Transformer to recover fine-grained text while reducing hallucinations. Experiments show that UniT outperforms existing methods in terms of faithful text reconstruction and reducing hallucinations, achieving state-of-the-art F1-score performance.
论文提出了一种名为UniT的统一文本恢复框架,结合了扩散变换器、视觉语言模型和文本检测模块。VLM从退化图像中提取文本内容,TSM生成中间OCR预测以引导VLM迭代。DiT骨干则利用其强大的表示能力,细化文本内容并抑制幻觉。实验表明,UniT有效地重建了退化文本,并在TAIR任务中达到了最先进的性能。
OSMO: Open-Source Tactile Glove for Human-to-Robot Skill Transfer
Authors: Jessica Yin, Haozhi Qi, Youngsun Wi, Sayantan Kundu, Mike Lambeta, William Yang, Changhao Wang, Tingfan Wu, Jitendra Malik, Tess Hellebrekers
First: 2025-12-09T18:56:30+00:00 · Latest: 2025-12-09T18:56:30+00:00
Comments: Project website: https://jessicayin.github.io/osmo_tactile_glove/
Abstract
Human video demonstrations provide abundant training data for learning robot policies, but video alone cannot capture the rich contact signals critical for mastering manipulation. We introduce OSMO, an open-source wearable tactile glove designed for human-to-robot skill transfer. The glove features 12 three-axis tactile sensors across the fingertips and palm and is designed to be compatible with state-of-the-art hand-tracking methods for in-the-wild data collection. We demonstrate that a robot policy trained exclusively on human demonstrations collected with OSMO, without any real robot data, is capable of executing a challenging contact-rich manipulation task. By equipping both the human and the robot with the same glove, OSMO minimizes the visual and tactile embodiment gap, enabling the transfer of continuous shear and normal force feedback while avoiding the need for image inpainting or other vision-based force inference. On a real-world wiping task requiring sustained contact pressure, our tactile-aware policy achieves a 72% success rate, outperforming vision-only baselines by eliminating contact-related failure modes. We release complete hardware designs, firmware, and assembly instructions to support community adoption.
中文标题/摘要
标题:OSMO:开源触觉手套,用于人类到机器人技能转移
人类的视频演示为学习机器人策略提供了丰富的训练数据,但视频无法捕捉到掌握操作所需的关键接触信号。我们介绍了OSMO,一种开源的可穿戴触觉手套,旨在用于人类到机器人技能转移。该手套配备了分布在指尖和手掌上的12个三轴触觉传感器,并设计为与最先进的手部追踪方法兼容,以便在野外收集数据。我们证明,仅使用OSMO收集的人类演示数据训练的机器人策略,无需任何真实机器人数据,也能够执行一项具有挑战性的接触密集型操作任务。通过为人类和机器人配备相同的手套,OSMO最小化了视觉和触觉的实体差距,使连续的切向和法向力反馈的转移成为可能,而无需使用图像补全或其他基于视觉的力推断方法。在一项需要持续接触压力的真实世界擦拭任务中,我们的触觉感知策略的成功率为72%,优于仅基于视觉的基线,消除了与接触相关的失败模式。我们发布了完整的硬件设计、固件和组装说明,以支持社区采用。
Summary / 总结
The research aims to address the limitation of video demonstrations in capturing rich contact signals for robot manipulation. OSMO, an open-source tactile glove, is introduced to facilitate human-to-robot skill transfer. The glove, equipped with 12 three-axis tactile sensors, allows for the collection of continuous shear and normal force feedback. Experiments show that a robot trained solely on human demonstrations using OSMO can successfully perform a complex wiping task with a 72% success rate, outperforming vision-only methods by avoiding contact-related failures.
研究旨在解决视频演示无法捕捉机器人操作所需丰富接触信号的问题。OSMO是一款开源触觉手套,旨在促进人类到机器人的技能转移。该手套配备了12个三轴触觉传感器,可以收集连续的切向和法向力反馈。实验表明,仅使用OSMO收集的人类演示训练的机器人,在执行一项具有持续接触压力的擦拭任务时,成功率达到72%,优于仅依赖视觉的方法,因为它避免了接触相关的失败模式。
Oscillations Make Neural Networks Robust to Quantization
Authors: Jonathan Wenshøj, Bob Pepin, Raghavendra Selvan
Venue: TMLR, 2835-8856, 2025
First: 2025-02-01T16:39:58+00:00 · Latest: 2025-12-09T18:55:38+00:00
Comments: Accepted to Transactions on Machine Learning Research (TMLR, 2025). Published version https://openreview.net/forum?id=bPwcJ0nkDC
Abstract
We challenge the prevailing view that weight oscillations observed during Quantization Aware Training (QAT) are merely undesirable side-effects and argue instead that they are an essential part of QAT. We show in a univariate linear model that QAT results in an additional loss term that causes oscillations by pushing weights away from their nearest quantization level. Based on the mechanism from the analysis, we then derive a regularizer that induces oscillations in the weights of neural networks during training. Our empirical results on ResNet-18 and Tiny Vision Transformer, evaluated on CIFAR-10 and Tiny ImageNet datasets, demonstrate across a range of quantization levels that training with oscillations followed by post-training quantization (PTQ) is sufficient to recover the performance of QAT in most cases. With this work we provide further insight into the dynamics of QAT and contribute a novel insight into explaining the role of oscillations in QAT which until now have been considered to have a primarily negative effect on quantization.
中文标题/摘要
标题:振荡使神经网络在量化过程中更具鲁棒性
我们挑战了量化感知训练(QAT)过程中观察到的权重振荡只是不希望的副作用的观点,而是认为它们是QAT的一个重要组成部分。我们通过一元线性模型表明,QAT会导致一个额外的损失项,通过将权重推向其最近的量化级别,从而产生振荡。基于分析机制,我们随后推导出一个正则化项,该正则化项在训练过程中诱导权重的振荡。我们在ResNet-18和Tiny Vision Transformer上的实验结果,使用CIFAR-10和Tiny ImageNet数据集评估,表明在大多数情况下,通过振荡训练后进行后训练量化(PTQ)即可恢复QAT的性能。通过这项工作,我们进一步揭示了QAT的动力学,并提供了一个关于振荡在QAT中作用的新颖见解,直到现在,人们普遍认为振荡主要对量化过程有负面影响。
Summary / 总结
The study challenges the notion that weight oscillations during Quantization Aware Training (QAT) are merely side-effects and instead posits they are crucial for QAT. Through a univariate linear model, the authors show that QAT introduces an additional loss term causing oscillations. They propose a regularizer to induce these oscillations during training. Experiments on ResNet-18 and Tiny Vision Transformer show that training with oscillations followed by post-training quantization recovers QAT performance in most cases across various quantization levels.
研究挑战了量化感知训练(QAT)中权重振荡仅是副作用的观点,认为它们实际上是QAT的关键部分。通过分析一元线性模型,研究人员表明QAT引入了一个额外的损失项,导致振荡。他们还开发了一个正则化项,在训练过程中诱导这些振荡。实验表明,在不同量化级别上,使用振荡训练后进行后处理量化可以恢复QAT的性能。
LiDAS: Lighting-driven Dynamic Active Sensing for Nighttime Perception
Authors: Simon de Moreau, Andrei Bursuc, Hafid El-Idrissi, Fabien Moutarde
First: 2025-12-09T18:47:56+00:00 · Latest: 2025-12-09T18:47:56+00:00
Comments: Preprint. 12 pages, 9 figures. Project page: https://simondemoreau.github.io/LiDAS/
Abstract
Nighttime environments pose significant challenges for camera-based perception, as existing methods passively rely on the scene lighting. We introduce Lighting-driven Dynamic Active Sensing (LiDAS), a closed-loop active illumination system that combines off-the-shelf visual perception models with high-definition headlights. Rather than uniformly brightening the scene, LiDAS dynamically predicts an optimal illumination field that maximizes downstream perception performance, i.e., decreasing light on empty areas to reallocate it on object regions. LiDAS enables zero-shot nighttime generalization of daytime-trained models through adaptive illumination control. Trained on synthetic data and deployed zero-shot in real-world closed-loop driving scenarios, LiDAS enables +18.7% mAP50 and +5.0% mIoU over standard low-beam at equal power. It maintains performances while reducing energy use by 40%. LiDAS complements domain-generalization methods, further strengthening robustness without retraining. By turning readily available headlights into active vision actuators, LiDAS offers a cost-effective solution to robust nighttime perception.
中文标题/摘要
标题:LiDAS:夜间照明驱动的动态主动传感
基于相机的感知在夜间环境中面临重大挑战,因为现有方法被动地依赖于场景照明。我们提出了照明驱动的动态主动传感(LiDAS),这是一种闭环主动照明系统,结合了现成的视觉感知模型和高分辨率前大灯。LiDAS 不是均匀地照亮整个场景,而是动态地预测一个最优的照明场,以最大化下游感知性能,即减少空旷区域的光照,将光线重新分配到目标区域。LiDAS 通过自适应照明控制使白天训练的模型在夜间实现零样本泛化。LiDAS 在合成数据上训练,并在实际闭环驾驶场景中零样本部署,使其在相同功率下比标准低光束的mAP50提高18.7%,mIoU提高5.0%。同时,它在保持性能的同时减少了40%的能源使用。LiDAS 补充了领域泛化方法,进一步增强了鲁棒性而无需重新训练。通过将现成的前大灯转变为活性视觉执行器,LiDAS 提供了一种经济有效的解决方案,以实现稳健的夜间感知。
Summary / 总结
LiDAS is a closed-loop active illumination system that enhances nighttime perception by dynamically adjusting headlight illumination based on predicted optimal fields, improving downstream perception performance. It achieves a 18.7% increase in mAP50 and a 5.0% increase in mIoU compared to standard low-beam lighting while reducing energy use by 40%. This method complements domain-generalization techniques and offers a cost-effective solution for robust nighttime perception without retraining models.
LiDAS 是一种闭环主动照明系统,通过动态调整照明场来优化下游感知性能,增强夜间感知能力。它利用高清前大灯和视觉感知模型,减少空旷区域的光照并重新分配到物体区域,实现更好的性能同时降低能耗40%。LiDAS 在保持性能的同时,使白天训练的模型能够在真实驾驶场景中零样本泛化,提高了平均精度5.0%和平均交并比18.7%,同时减少了能耗。
Self-Evolving 3D Scene Generation from a Single Image
Authors: Kaizhi Zheng, Yue Fan, Jing Gu, Zishuo Xu, Xuehai He, Xin Eric Wang
First: 2025-12-09T18:44:21+00:00 · Latest: 2025-12-09T18:44:21+00:00
Abstract
Generating high-quality, textured 3D scenes from a single image remains a fundamental challenge in vision and graphics. Recent image-to-3D generators recover reasonable geometry from single views, but their object-centric training limits generalization to complex, large-scale scenes with faithful structure and texture. We present EvoScene, a self-evolving, training-free framework that progressively reconstructs complete 3D scenes from single images. The key idea is combining the complementary strengths of existing models: geometric reasoning from 3D generation models and visual knowledge from video generation models. Through three iterative stages--Spatial Prior Initialization, Visual-guided 3D Scene Mesh Generation, and Spatial-guided Novel View Generation--EvoScene alternates between 2D and 3D domains, gradually improving both structure and appearance. Experiments on diverse scenes demonstrate that EvoScene achieves superior geometric stability, view-consistent textures, and unseen-region completion compared to strong baselines, producing ready-to-use 3D meshes for practical applications.
中文标题/摘要
标题:从单张图像自演化生成3D场景
从单张图像生成高质量、纹理化的3D场景仍然是视觉和图形学中的一个基本挑战。最近的图像到3D生成器可以从单个视角恢复合理的几何结构,但它们以对象为中心的训练限制了其在复杂、大规模场景中的泛化能力,这些场景需要忠实的结构和纹理。我们提出了EvoScene,这是一种无需训练的自演化框架,可以逐步从单张图像中重建完整的3D场景。关键思想是结合现有模型的互补优势:3D生成模型的几何推理和视频生成模型的视觉知识。通过三个迭代阶段——空间先验初始化、视觉引导的3D场景网格生成和空间引导的新视角生成——EvoScene 在2D和3D领域之间交替,逐步提高结构和外观。在多种场景上的实验表明,与强大的基线相比,EvoScene 在几何稳定性、视图一致的纹理以及未见区域的完成方面表现出优越性,生成了可用于实际应用的3D网格。
Summary / 总结
The research aims to generate high-quality 3D scenes from a single image, addressing the limitations of existing object-centric models in handling complex scenes. EvoScene, a self-evolving framework, progressively reconstructs 3D scenes through three stages: spatial prior initialization, visual-guided 3D scene mesh generation, and spatial-guided novel view generation. The framework combines geometric reasoning from 3D generation models and visual knowledge from video generation models. Experiments show that EvoScene outperforms strong baselines in terms of geometric stability, view-consistent textures, and unseen-region completion, producing ready-to-use 3D meshes for practical applications.
研究旨在从单张图像生成高质量的3D场景,解决现有以对象为中心的模型在处理复杂场景时的局限性。EvoScene是一个自我进化的框架,结合了3D生成模型的几何推理和视频生成模型的视觉知识。通过三个阶段,它逐步重建完整的3D场景,提升结构和外观。实验表明,EvoScene在几何稳定性、视图一致的纹理以及未见区域的完成方面优于强基线,生成可用于实际应用的3D网格。
Uncertainty Quantification for Scientific Machine Learning using Sparse Variational Gaussian Process Kolmogorov-Arnold Networks (SVGP KAN)
Authors: Y. Sungtaek Ju
First: 2025-12-04T22:58:32+00:00 · Latest: 2025-12-09T18:44:00+00:00
Comments: 20 pages, 3 figures
Abstract
Kolmogorov-Arnold Networks have emerged as interpretable alternatives to traditional multi-layer perceptrons. However, standard implementations lack principled uncertainty quantification capabilities essential for many scientific applications. We present a framework integrating sparse variational Gaussian process inference with the Kolmogorov-Arnold topology, enabling scalable Bayesian inference with computational complexity quasi-linear in sample size. Through analytic moment matching, we propagate uncertainty through deep additive structures while maintaining interpretability. We use three example studies to demonstrate the framework's ability to distinguish aleatoric from epistemic uncertainty: calibration of heteroscedastic measurement noise in fluid flow reconstruction, quantification of prediction confidence degradation in multi-step forecasting of advection-diffusion dynamics, and out-of-distribution detection in convolutional autoencoders. These results suggest Sparse Variational Gaussian Process Kolmogorov-Arnold Networks (SVGP KANs) is a promising architecture for uncertainty-aware learning in scientific machine learning.
中文标题/摘要
标题:使用稀疏变分高斯过程柯尔莫哥洛夫-阿诺尔德网络(SVGP KAN)的科学机器学习中的不确定性量化
柯尔莫哥洛夫-阿诺尔德网络已作为可解释的替代方案出现,与传统的多层感知机相比。然而,标准实现缺乏许多科学应用中所需的原理性不确定性量化能力。我们提出了一种框架,将稀疏变分高斯过程推理与柯尔莫哥洛夫-阿诺尔德拓扑相结合,使大规模贝叶斯推理的计算复杂度接近线性。通过分析矩匹配,我们在保持可解释性的同时,将不确定性传递通过深层加性结构。我们通过三个示例研究展示了该框架区分 aleatoric 和 epistemic 不确定性的能力:流体流动重构中的异方差测量噪声校准,多步预报对流-扩散动力学预测置信度下降的量化,以及卷积自编码器中的离分布检测。这些结果表明稀疏变分高斯过程柯尔莫哥洛夫-阿诺尔德网络(SVGP KANs)是科学机器学习中不确定性感知学习的一种有前途的架构。
Summary / 总结
The research aims to address the lack of uncertainty quantification in Kolmogorov-Arnold Networks, which are used as interpretable alternatives to multi-layer perceptrons. The study introduces a framework that combines sparse variational Gaussian process inference with the Kolmogorov-Arnold topology, allowing for scalable Bayesian inference with computational efficiency. Key experimental findings show that SVGP KANs can effectively distinguish between aleatoric and epistemic uncertainty in various scientific applications, including fluid flow reconstruction, multi-step forecasting, and out-of-distribution detection in convolutional autoencoders.
研究旨在通过将稀疏变分高斯过程推断与柯尔莫哥洛夫-阿诺尔德拓扑相结合,解决柯尔莫哥洛夫-阿诺尔德网络缺乏不确定性量化的问题。该方法结合了柯尔莫哥洛夫-阿诺尔德网络的可解释性和可扩展的贝叶斯推断,允许计算复杂度接近线性。关键发现包括在流体流动重构、多步预测和卷积自编码器的离分布检测等科学应用中区分 aleatoric 和 epistemic 不确定性的能力。
Open Polymer Challenge: Post-Competition Report
Authors: Gang Liu, Sobin Alosious, Subhamoy Mahajan, Eric Inae, Yihan Zhu, Yuhan Liu, Renzheng Zhang, Jiaxin Xu, Addison Howard, Ying Li, Tengfei Luo, Meng Jiang
Venue: NeurIPS
First: 2025-12-09T18:38:15+00:00 · Latest: 2025-12-09T18:38:15+00:00
Comments: The report for the competition: "NeurIPS - Open Polymer Prediction 2025". Kaggle Page: https://www.kaggle.com/competitions/neurips-open-polymer-prediction-2025. Website: https://open-polymer-challenge.github.io
Abstract
Machine learning (ML) offers a powerful path toward discovering sustainable polymer materials, but progress has been limited by the lack of large, high-quality, and openly accessible polymer datasets. The Open Polymer Challenge (OPC) addresses this gap by releasing the first community-developed benchmark for polymer informatics, featuring a dataset with 10K polymers and 5 properties: thermal conductivity, radius of gyration, density, fractional free volume, and glass transition temperature. The challenge centers on multi-task polymer property prediction, a core step in virtual screening pipelines for materials discovery. Participants developed models under realistic constraints that include small data, label imbalance, and heterogeneous simulation sources, using techniques such as feature-based augmentation, transfer learning, self-supervised pretraining, and targeted ensemble strategies. The competition also revealed important lessons about data preparation, distribution shifts, and cross-group simulation consistency, informing best practices for future large-scale polymer datasets. The resulting models, analysis, and released data create a new foundation for molecular AI in polymer science and are expected to accelerate the development of sustainable and energy-efficient materials. Along with the competition, we release the test dataset at https://www.kaggle.com/datasets/alexliu99/neurips-open-polymer-prediction-2025-test-data. We also release the data generation pipeline at https://github.com/sobinalosious/ADEPT, which simulates more than 25 properties, including thermal conductivity, radius of gyration, and density.
中文标题/摘要
标题:开放聚合物挑战:赛后报告
机器学习(ML)为发现可持续聚合物材料提供了强大的途径,但进展受限于缺乏大规模、高质量且开放获取的聚合物数据集。开放聚合物挑战(OPC)通过发布首个社区开发的聚合物信息学基准,填补了这一空白,该基准包含10000种聚合物和5个属性:热导率、端射半径、密度、自由体积分数和玻璃化转变温度。挑战集中在多任务聚合物属性预测,这是材料发现虚拟筛选管道中的核心步骤。参赛者在包括小数据、标签不平衡和异构模拟源等现实约束下开发模型,使用了基于特征的增强、迁移学习、自我监督预训练和目标化集成策略等技术。比赛还揭示了关于数据准备、分布偏移和跨组模拟一致性的重要教训,为未来大规模聚合物数据集提供了最佳实践指导。生成的模型、分析和发布的数据为聚合物科学中的分子AI奠定了新的基础,并有望加速可持续和节能材料的开发。除了比赛,我们还在https://www.kaggle.com/datasets/alexliu99/neurips-open-polymer-prediction-2025-test-data发布测试数据集。我们还在https://github.com/sobinalosious/ADEPT发布数据生成管道,该管道模拟了超过25个属性,包括热导率、端射半径和密度。
Summary / 总结
The Open Polymer Challenge (OPC) was designed to address the lack of large, high-quality, and openly accessible polymer datasets for machine learning applications. Participants developed models for predicting five polymer properties under realistic constraints, including small data and label imbalance. Key findings include the importance of data preparation and handling distribution shifts, with models and data released to accelerate sustainable materials development.
Open Polymer挑战旨在通过提供包含10,000种聚合物和五种属性的大规模、高质量和开放访问数据集来推动聚合物信息学的发展。参赛者在现实约束条件下开发了模型,使用了特征增强和迁移学习等技术。该挑战强调了数据准备和跨组模拟一致性的重要性,从而改进了未来聚合物数据集的最佳实践,并加速了可持续材料的发展。
Escaping the Verifier: Learning to Reason via Demonstrations
Authors: Locke Cai, Ivan Provilkov
First: 2025-11-26T18:42:52+00:00 · Latest: 2025-12-09T18:37:56+00:00
Abstract
Training Large Language Models (LLMs) to reason often relies on Reinforcement Learning (RL) with task-specific verifiers. However, many real-world reasoning-intensive tasks lack verifiers, despite offering abundant expert demonstrations that remain under-utilized for reasoning-focused training. We introduce RARO (Relativistic Adversarial Reasoning Optimization) that learns strong reasoning capabilities from only expert demonstrations via Inverse Reinforcement Learning. Our method sets up an adversarial game between a policy and a relativistic critic: the policy learns to mimic expert answers, while the critic aims to identify the experts among (expert, policy) answer pairs. Both the policy and the critic are trained jointly and continuously via RL, and we identify the key stabilization techniques required for robust learning. Empirically, RARO significantly outperforms strong verifier-free baselines on all of our evaluation tasks -- Countdown, DeepMath, and Poetry Writing -- and enjoys the same robust scaling trends as RL with verifiers. These results demonstrate that our method effectively elicits strong reasoning performance from expert demonstrations alone, enabling robust reasoning learning even when task-specific verifiers are unavailable.
中文标题/摘要
标题:超越验证者:通过示范学习推理
训练大型语言模型(LLMs)进行推理通常依赖于特定任务的强化学习(RL)和验证器。然而,许多实际的推理密集型任务缺乏验证器,尽管这些任务提供了大量未充分利用的专家示范。我们引入了RARO(相对对抗推理优化),通过逆强化学习仅从专家示范中学习强大的推理能力。我们的方法设置了一个对抗游戏,其中策略和相对批评家之间相互对抗:策略学习模仿专家答案,而批评家则试图识别(专家,策略)答案对中的专家。策略和批评家通过RL联合和连续训练,并且我们确定了实现稳健学习的关键稳定化技术。实验结果表明,RARO在我们的所有评估任务——倒计时、DeepMath和诗歌创作——中显著优于强大的无验证器基线,并且享受与带有验证器的RL相同的稳健扩展趋势。这些结果表明,我们的方法能够仅从专家示范中有效激发强大的推理性能,即使在特定任务的验证器不可用时也能实现稳健的推理学习。
Unsupervised Learning of Density Estimates with Topological Optimization
Authors: Suina Tanweer, Firas A. Khasawneh
First: 2025-12-09T18:35:51+00:00 · Latest: 2025-12-09T18:35:51+00:00
Abstract
Kernel density estimation is a key component of a wide variety of algorithms in machine learning, Bayesian inference, stochastic dynamics and signal processing. However, the unsupervised density estimation technique requires tuning a crucial hyperparameter: the kernel bandwidth. The choice of bandwidth is critical as it controls the bias-variance trade-off by over- or under-smoothing the topological features. Topological data analysis provides methods to mathematically quantify topological characteristics, such as connected components, loops, voids et cetera, even in high dimensions where visualization of density estimates is impossible. In this paper, we propose an unsupervised learning approach using a topology-based loss function for the automated and unsupervised selection of the optimal bandwidth and benchmark it against classical techniques -- demonstrating its potential across different dimensions.
中文标题/摘要
标题:基于拓扑优化的无监督密度估计学习
核密度估计是机器学习、贝叶斯推断、随机动力学和信号处理中广泛算法的关键组成部分。然而,无监督密度估计技术需要调整一个关键的超参数:核带宽。带宽的选择至关重要,因为它通过过度平滑或欠平滑拓扑特征来控制偏差-方差权衡。拓扑数据分析提供了方法来数学量化高维空间中密度估计的拓扑特征,如连通分量、环、空洞等,即使在可视化密度估计不可行的情况下也是如此。在本文中,我们提出了一种基于拓扑的损失函数的无监督学习方法,用于自动和无监督选择最优带宽,并将其与经典技术进行基准测试,展示了其在不同维度中的潜力。
Summary / 总结
This paper addresses the challenge of unsupervised density estimation in machine learning by proposing an automated method to select the optimal kernel bandwidth. The approach leverages a topology-based loss function to optimize the bandwidth, thereby balancing the bias-variance trade-off. Experiments show that this method outperforms classical techniques across various dimensions, particularly in high-dimensional spaces where visualizing density estimates is difficult.
本文提出了一种自动选择最优核带宽的方法,以解决机器学习中的无监督密度估计问题。该方法利用基于拓扑的损失函数来优化带宽,从而平衡偏差和方差之间的权衡。实验表明,该方法在不同维度上优于经典技术,特别是在高维空间中可视化密度估计困难的情况下表现更优。
Do Natural Language Descriptions of Model Activations Convey Privileged Information?
Authors: Millicent Li, Alberto Mario Ceballos Arroyo, Giordano Rogers, Naomi Saphra, Byron C. Wallace
First: 2025-09-16T17:59:04+00:00 · Latest: 2025-12-09T18:35:28+00:00
Comments: 40 pages, 6 figures. Updated and added content
Abstract
Recent interpretability methods have proposed to translate LLM internal representations into natural language descriptions using a second verbalizer LLM. This is intended to illuminate how the target model represents and operates on inputs. But do such activation verbalization approaches actually provide privileged knowledge about the internal workings of the target model, or do they merely convey information about its inputs? We critically evaluate popular verbalization methods across datasets used in prior work and find that they can succeed at benchmarks without any access to target model internals, suggesting that these datasets may not be ideal for evaluating verbalization methods. We then run controlled experiments which reveal that verbalizations often reflect the parametric knowledge of the verbalizer LLM which generated them, rather than the knowledge of the target LLM whose activations are decoded. Taken together, our results indicate a need for targeted benchmarks and experimental controls to rigorously assess whether verbalization methods provide meaningful insights into the operations of LLMs.
中文标题/摘要
标题:自然语言描述的模型激活是否传达了特权信息?
最近的可解释性方法提出使用第二个语言模型将LLM的内部表示转换为自然语言描述。这旨在阐明目标模型如何表示和处理输入。但这些激活语言化方法是否实际上提供了关于目标模型内部工作机制的特权知识,还是仅仅传达了其输入的信息?我们对先前工作中使用的数据集进行了批判性评估,发现这些方法可以在不访问目标模型内部的情况下成功完成基准测试,这表明这些数据集可能不是评估语言化方法的理想选择。然后我们进行了受控实验,发现语言化往往反映了生成它们的第二个语言模型的参数化知识,而不是解码其激活的目标语言模型的知识。综上所述,我们的结果表明需要有针对性的基准测试和实验控制,以严格评估语言化方法是否提供了关于LLM操作的有意义见解。
Summary / 总结
The study investigates whether natural language descriptions of model activations provide privileged information about the target model's internal workings or merely reflect input-related information. By evaluating popular verbalization methods across various datasets and conducting controlled experiments, the researchers found that these methods can perform well on benchmarks without accessing the target model's internals. The experiments suggest that verbalizations often reflect the parametric knowledge of the verbalizer LLM rather than the target LLM's knowledge. This indicates a need for more targeted benchmarks and experimental controls to assess the meaningfulness of verbalization methods in understanding LLM operations.
研究探讨自然语言描述的模型激活是否提供了关于目标模型内部工作机制的特权信息,还是仅仅反映了输入相关的信息。通过评估各种数据集上的流行描述方法,并进行受控实验,研究人员发现这些方法可以在不访问目标模型内部的情况下成功完成基准测试。实验表明,描述往往反映了生成它们的描述器LLM的参数知识,而不是目标LLM的知识。这表明需要更针对性的基准和实验控制来评估描述方法在理解LLM操作方面的意义。
Revisiting the Scaling Properties of Downstream Metrics in Large Language Model Training
Authors: Jakub Krajewski, Amitis Shidani, Dan Busbridge, Sam Wiseman, Jason Ramapuram
First: 2025-12-09T18:33:48+00:00 · Latest: 2025-12-09T18:33:48+00:00
Abstract
While scaling laws for Large Language Models (LLMs) traditionally focus on proxy metrics like pretraining loss, predicting downstream task performance has been considered unreliable. This paper challenges that view by proposing a direct framework to model the scaling of benchmark performance from the training budget. We find that for a fixed token-to-parameter ratio, a simple power law can accurately describe the scaling behavior of log accuracy on multiple popular downstream tasks. Our results show that the direct approach extrapolates better than the previously proposed two-stage procedure, which is prone to compounding errors. Furthermore, we introduce functional forms that predict accuracy across token-to-parameter ratios and account for inference compute under repeated sampling. We validate our findings on models with up to 17B parameters trained on up to 350B tokens across two dataset mixtures. To support reproducibility and encourage future research, we release the complete set of pretraining losses and downstream evaluation results.
中文标题/摘要
标题:大型语言模型训练中下游指标缩放性质的再探讨
虽然大型语言模型(LLMs)的缩放定律传统上侧重于预训练损失等代理指标,但预测下游任务性能一直被认为不可靠。本文通过提出一种直接框架来建模基准性能随训练预算的变化,挑战了这一观点。我们发现,在固定标记-参数比的情况下,简单的幂律可以准确描述多个流行下游任务上对数准确率的缩放行为。我们的结果表明,直接方法比之前提出的两阶段程序更好地进行了外推,后者容易累积误差。此外,我们引入了预测准确率的功能形式,考虑了重复采样下的推理计算。我们在两个数据集混合中训练了至多170亿参数的模型,并使用至多350亿标记进行了下游评估。为了支持可重复性和鼓励未来研究,我们发布了完整的预训练损失和下游评估结果。
Summary / 总结
This paper revisits the scaling properties of downstream metrics in Large Language Model (LLM) training, challenging the traditional focus on proxy metrics like pretraining loss. By proposing a direct framework to model the scaling of benchmark performance from the training budget, the authors find that a simple power law accurately describes the scaling behavior of log accuracy on multiple downstream tasks. The direct approach outperforms the two-stage procedure, which is prone to compounding errors. The study also introduces functional forms to predict accuracy across different token-to-parameter ratios and accounts for inference compute under repeated sampling. Experiments were conducted on models with up to 17B parameters trained on up to 350B tokens across two dataset mixtures, and the results support the proposed framework's effectiveness and reproducibility.
本文重新审视了大规模语言模型(LLM)训练中下游指标的扩展特性,挑战了传统上对预训练损失等代理指标的关注。通过提出一个直接框架,从训练预算直接建模基准性能的扩展行为,作者发现简单的幂律可以准确描述多个下游任务上日志准确率的扩展行为。直接方法优于容易累积误差的两阶段方法。研究还引入了函数形式来预测不同token-to-parameter比下的准确率,并考虑了重复采样下的推理计算。实验在最多17B参数、350B tokens的两个数据集混合上进行,结果支持所提框架的有效性和可重复性。
Arbitrage: Efficient Reasoning via Advantage-Aware Speculation
Authors: Monishwaran Maheswaran, Rishabh Tiwari, Yuezhou Hu, Kerem Dilmen, Coleman Hooper, Haocheng Xi, Nicholas Lee, Mehrdad Farajtabar, Michael W. Mahoney, Kurt Keutzer, Amir Gholami
First: 2025-12-04T17:50:53+00:00 · Latest: 2025-12-09T18:32:43+00:00
Comments: 22 pages
Abstract
Modern Large Language Models achieve impressive reasoning capabilities with long Chain of Thoughts, but they incur substantial computational cost during inference, and this motivates techniques to improve the performance-cost ratio. Among these techniques, Speculative Decoding accelerates inference by employing a fast but inaccurate draft model to autoregressively propose tokens, which are then verified in parallel by a more capable target model. However, due to unnecessary rejections caused by token mismatches in semantically equivalent steps, traditional token-level Speculative Decoding struggles in reasoning tasks. Although recent works have shifted to step-level semantic verification, which improve efficiency by accepting or rejecting entire reasoning steps, existing step-level methods still regenerate many rejected steps with little improvement, wasting valuable target compute. To address this challenge, we propose Arbitrage, a novel step-level speculative generation framework that routes generation dynamically based on the relative advantage between draft and target models. Instead of applying a fixed acceptance threshold, Arbitrage uses a lightweight router trained to predict when the target model is likely to produce a meaningfully better step. This routing approximates an ideal Arbitrage Oracle that always chooses the higher-quality step, achieving near-optimal efficiency-accuracy trade-offs. Across multiple mathematical reasoning benchmarks, Arbitrage consistently surpasses prior step-level Speculative Decoding baselines, reducing inference latency by up to $\sim2\times$ at matched accuracy.
中文标题/摘要
标题:套利:基于优势感知的高效推理
现代大型语言模型凭借长链推理实现了令人印象深刻的推理能力,但在推理过程中产生了巨大的计算成本,这促使人们开发技术以提高性能与成本的比例。这些技术中,推测性解码通过使用快速但不准确的草稿模型自回归地提出令牌,然后由更强大的目标模型并行验证,从而加速推理。然而,由于在语义等价步骤中由于令牌不匹配导致不必要的拒绝,传统的令牌级推测性解码在推理任务中表现不佳。尽管最近的研究转向了步骤级语义验证,这提高了效率,通过接受或拒绝整个推理步骤,但现有的步骤级方法仍然会重新生成许多被拒绝的步骤,浪费了宝贵的目标计算资源。为了解决这一挑战,我们提出了套利,这是一种新颖的步骤级推测生成框架,根据草稿模型和目标模型之间的相对优势动态路由生成。套利不采用固定的接受阈值,而是使用一个轻量级的路由器,训练它预测目标模型何时可能产生更有意义的步骤。这种路由近似于理想的套利Oracle,它总是选择质量更高的步骤,实现了接近最优的效率-准确度权衡。在多个数学推理基准测试中,套利始终超越了先前的步骤级推测性解码基线,将匹配准确度下的推理延迟降低了多达约2倍。
Summary / 总结
The research aims to improve the efficiency of reasoning tasks using Large Language Models by addressing the inefficiencies in traditional Speculative Decoding methods. The proposed Arbitrage framework dynamically routes generation based on the relative advantage between draft and target models, using a lightweight router to predict when the target model is likely to produce a better step. Experimental results show that Arbitrage outperforms previous step-level speculative decoding methods, reducing inference latency by up to 2 times while maintaining matched accuracy.
论文提出了一种新颖的步骤级投机生成框架Arbitrage,以解决传统投机解码方法在大型语言模型中的低效问题。该框架通过轻量级路由器动态路由生成,基于草稿模型和目标模型之间的相对优势来减少不必要的拒绝。实验表明,Arbitrage 在多个数学推理基准测试中优于之前的步骤级投机解码基线,可将推理延迟最多减少2倍,同时保持匹配的准确性。
No Labels, No Problem: Training Visual Reasoners with Multimodal Verifiers
Authors: Damiano Marsili, Georgia Gkioxari
First: 2025-12-09T18:30:23+00:00 · Latest: 2025-12-09T18:30:23+00:00
Comments: Project webpage: https://glab-caltech.github.io/valor/
Abstract
Visual reasoning is challenging, requiring both precise object grounding and understanding complex spatial relationships. Existing methods fall into two camps: language-only chain-of-thought approaches, which demand large-scale (image, query, answer) supervision, and program-synthesis approaches which use pre-trained models and avoid training, but suffer from flawed logic and erroneous grounding. We propose an annotation-free training framework that improves both reasoning and grounding. Our framework uses AI-powered verifiers: an LLM verifier refines LLM reasoning via reinforcement learning, while a VLM verifier strengthens visual grounding through automated hard-negative mining, eliminating the need for ground truth labels. This design combines the strengths of modern AI systems: advanced language-only reasoning models for decomposing spatial queries into simpler subtasks, and strong vision specialist models improved via performant VLM critics. We evaluate our approach across diverse spatial reasoning tasks, and show that our method improves visual reasoning and surpasses open-source and proprietary models, while with our improved visual grounding model we further outperform recent text-only visual reasoning methods. Project webpage: https://glab-caltech.github.io/valor/
中文标题/摘要
标题:无标签,无问题:训练多模态验证器的视觉推理者
视觉推理具有挑战性,需要精确的对象定位和理解复杂的空间关系。现有方法分为两类:仅基于语言的链式思考方法,需要大规模(图像、查询、答案)监督,以及程序合成方法,使用预训练模型并避免训练,但逻辑有误且定位错误。我们提出了一种无需标注的训练框架,以提高推理和定位。该框架使用AI验证器:通过强化学习改进LLM推理,而通过自动化困难负样本挖掘增强视觉定位,从而消除对真实标签的需求。此设计结合了现代AI系统的优点:先进的仅语言推理模型用于将空间查询分解为更简单的子任务,以及通过高效的VLM批评家改进的强视觉专家模型。我们在多种空间推理任务上评估了我们的方法,并展示了我们的方法在视觉推理方面优于开源和专有模型,同时通过改进的视觉定位模型进一步超越了最近的仅文本视觉推理方法。项目网页:https://glab-caltech.github.io/valor/
Accelerated Rotation-Invariant Convolution for UAV Image Segmentation
Authors: Manduhu Manduhu, Alexander Dow, Gerard Dooly, James Riordan
First: 2025-12-09T18:30:00+00:00 · Latest: 2025-12-09T18:30:00+00:00
Abstract
Rotation invariance is essential for precise, object-level segmentation in UAV aerial imagery, where targets can have arbitrary orientations and exhibit fine-scale details. Conventional segmentation architectures like U-Net rely on convolution operators that are not rotation-invariant, leading to degraded segmentation accuracy across varying viewpoints. Rotation invariance can be achieved by expanding the filter bank across multiple orientations; however, this will significantly increase computational cost and memory traffic. In this paper, we introduce a GPU-optimized rotation-invariant convolution framework that eliminates the traditional data-lowering (im2col) step required for matrix-multiplication-based convolution. By exploiting structured data sharing among symmetrically rotated filters, our method achieves multi-orientation convolution with greatly reduced memory traffic and computational redundancy. We further generalize the approach to accelerate convolution with arbitrary (non-symmetric) rotation angles. Across extensive benchmarks, the proposed convolution achieves 20--55% faster training and 15--45% lower energy consumption than CUDNN, while maintaining accuracy comparable to state-of-the-art rotation-invariant methods. In the eight-orientation setting, our approach achieves up to 45% speedup and 41% energy savings on 256\(\times\)256 inputs, and 32% speedup and 23% lower energy usage on 1024\(\times\)1024 inputs. Integrated into a U-Net segmentation model, the framework yields up to 6% improvement in accuracy over the non-rotation-aware baseline. These results demonstrate that the proposed method provides an effective and highly efficient alternative to existing rotation-invariant CNN frameworks.
中文标题/摘要
标题:加速旋转不变卷积用于无人机图像分割
旋转不变性对于无人机航空影像中的精确对象级分割至关重要,因为目标可以具有任意方向并表现出细微的细节。传统的分割架构如U-Net依赖于非旋转不变的卷积操作,这会导致在不同视角下的分割精度下降。通过在多个方向上扩展滤波器库可以实现旋转不变性,但这种方法将显著增加计算成本和内存流量。在本文中,我们介绍了一种GPU优化的旋转不变卷积框架,该框架消除了基于矩阵乘法卷积所需的传统数据降维(im2col)步骤。通过利用对称旋转滤波器之间的结构化数据共享,我们的方法实现了多方向卷积,同时大大减少了内存流量和计算冗余。我们进一步将该方法推广以加速任意(非对称)旋转角度的卷积。 在广泛的基准测试中,所提出的卷积在训练速度上比CUDNN快20-55%,在能耗上低15-45%,同时保持与最先进的旋转不变方法相当的准确性。在八方向设置中,我们的方法在256×256输入上实现了高达45%的加速和41%的能耗节省,在1024×1024输入上实现了32%的加速和23%的能耗降低。将该框架集成到U-Net分割模型中,与非旋转感知基线相比,可实现高达6%的准确性提升。这些结果表明,所提出的方法为现有的旋转不变CNN框架提供了一种有效且高效的替代方案。
Summary / 总结
The research addresses the need for rotation invariance in UAV image segmentation to improve accuracy across varying orientations. It introduces a GPU-optimized rotation-invariant convolution framework that reduces memory traffic and computational redundancy by eliminating the traditional data-lowering step. Experimental results show up to 55% faster training and 45% lower energy consumption compared to CUDNN, while maintaining accuracy similar to state-of-the-art methods. The framework integrated into a U-Net model improves accuracy by up to 6%.
研究旨在通过解决旋转不变性问题,提高无人机航空图像中的对象级分割精度。方法提出了一种GPU优化的旋转不变卷积框架,无需数据降低步骤即可减少内存流量和计算冗余。实验结果表明,与CUDNN相比,该卷积最多可实现55%的训练速度提升和45%的能耗降低,同时保持与最先进的方法相当的准确性。集成到U-Net模型中时,它可将准确性提高多达6%。
A Formalism for Optimal Search with Dynamic Heuristics (Extended Version)
Authors: Remo Christen, Florian Pommerening, Clemens Büchner, Malte Helmert
First: 2025-04-29T19:25:31+00:00 · Latest: 2025-12-09T18:28:34+00:00
Abstract
While most heuristics studied in heuristic search depend only on the state, some accumulate information during search and thus also depend on the search history. Various existing approaches use such dynamic heuristics in $\mathrm{A}^*$-like algorithms and appeal to classic results for $\mathrm{A}^*$ to show optimality. However, doing so ignores the complexities of searching with a mutable heuristic. In this paper we formalize the idea of dynamic heuristics and use them in a generic algorithm framework. We study a particular instantiation that models $\mathrm{A}^*$ with dynamic heuristics and show general optimality results. Finally we show how existing approaches from classical planning can be viewed as special cases of this instantiation, making it possible to directly apply our optimality results.
中文标题/摘要
标题:动态启发式搜索的正式方法(扩展版)
虽然大多数在启发式搜索中研究的启发式函数仅依赖于状态,但有些启发式函数在搜索过程中会积累信息,因此也依赖于搜索历史。各种现有方法在类似A*的算法中使用此类动态启发式函数,并引用经典结果来证明其最优性。然而,这样做忽略了使用可变启发式函数进行搜索的复杂性。在本文中,我们形式化了动态启发式函数的概念,并在通用算法框架中使用它们。我们研究了一种特定的实例化,该实例化模拟了使用动态启发式函数的A*,并展示了通用的最优性结果。最后,我们展示了现有的经典规划方法可以被视为该实例化的特殊情况,从而使我们能够直接应用我们的最优性结果。
Summary / 总结
The paper addresses the limitations of traditional heuristics in heuristic search, which only depend on the state and ignore the search history. It introduces a formalism for dynamic heuristics that accumulate information during search. The authors develop a generic algorithm framework incorporating these dynamic heuristics and prove general optimality results. They also show that existing approaches in classical planning can be seen as special cases of this framework, allowing for the direct application of the optimality results.
论文针对传统启发式搜索中只依赖状态而不考虑搜索历史的启发式方法的局限性,引入了动态启发式方法,这些启发式方法在搜索过程中会积累信息。作者开发了一个包含动态启发式的通用算法框架,并证明了通用的最优性结果。此外,他们还展示了现有经典规划中的方法可以被视为该框架的特殊情况,从而使这些最优性结果可以直接应用。
Explainable Anomaly Detection for Industrial IoT Data Streams
Authors: Ana Rita Paupério, Diogo Risca, Afonso Lourenço, Goreti Marreiros, Ricardo Martins
First: 2025-12-09T18:20:35+00:00 · Latest: 2025-12-09T18:20:35+00:00
Comments: Accepted at 41st ACM/SIGAPP Symposium On Applied Computing (SAC 2026)
Abstract
Industrial maintenance is being transformed by the Internet of Things and edge computing, generating continuous data streams that demand real-time, adaptive decision-making under limited computational resources. While data stream mining (DSM) addresses this challenge, most methods assume fully supervised settings, yet in practice, ground-truth labels are often delayed or unavailable. This paper presents a collaborative DSM framework that integrates unsupervised anomaly detection with interactive, human-in-the-loop learning to support maintenance decisions. We employ an online Isolation Forest and enhance interpretability using incremental Partial Dependence Plots and a feature importance score, derived from deviations of Individual Conditional Expectation curves from a fading average, enabling users to dynamically reassess feature relevance and adjust anomaly thresholds. We describe the real-time implementation and provide initial results for fault detection in a Jacquard loom unit. Ongoing work targets continuous monitoring to predict and explain imminent bearing failures.
中文标题/摘要
标题:工业物联网数据流的可解释异常检测
物联网和边缘计算正在改变工业维护,生成需要在有限计算资源下进行实时、自适应决策的连续数据流。虽然数据流挖掘(DSM)解决了这一挑战,但大多数方法假设完全监督的设置,而在实践中,真实标签往往延迟或不可用。本文提出了一种协作的DSM框架,将无监督异常检测与交互式、人工在环学习相结合,以支持维护决策。我们采用在线孤立森林,并通过增量部分依赖图和基于条件期望曲线偏差与衰减平均值差异的特征重要性得分来增强可解释性,使用户能够动态重新评估特征的相关性并调整异常阈值。我们描述了实时实现并提供了在Jacquard织机单元中进行故障检测的初步结果。后续工作旨在持续监控以预测和解释即将发生的轴承故障。
Summary / 总结
This paper addresses the challenge of real-time anomaly detection in industrial IoT data streams, where ground-truth labels are often unavailable. It proposes a collaborative framework integrating unsupervised anomaly detection with interactive learning. The method uses an online Isolation Forest and enhances interpretability with incremental Partial Dependence Plots and a feature importance score, allowing users to reassess feature relevance and adjust anomaly thresholds. Initial results show effective fault detection in a Jacquard loom unit, with ongoing work focusing on continuous monitoring for bearing failures.
该论文针对工业物联网数据流中真实时间异常检测的挑战,其中真实标签往往不可用。它提出了一种结合无监督异常检测与交互式人机在环学习的协作框架。方法使用在线孤立森林,并通过增量部分依赖图和特征重要性得分增强可解释性。初步结果表明在 Jacquard 织机单元中有效检测故障,持续工作则侧重于连续监控以预测和解释即将发生的轴承故障。
Rethinking Few-Shot Image Fusion: Granular Ball Priors Enable General-Purpose Deep Fusion
Authors: Minjie Deng, Yan Wei, An Wu, Yuncan Ouyang, Hao Zhai, Qianyao Peng
First: 2025-04-11T19:33:06+00:00 · Latest: 2025-12-09T18:19:43+00:00
Abstract
In image fusion tasks, the absence of real fused images as priors forces most deep learning approaches to rely on large-scale paired datasets to extract global weighting features or to generate pseudo-supervised images through algorithmic constructions. Unlike previous methods, this work re-examines prior-guided learning under few-shot conditions by introducing rough set theory. We regard the traditional algorithm as a prior generator, while the network re-inferrs and adaptively optimizes the prior through a dynamic loss function, reducing the inference burden of the network and enabling effective few-shot learning.To provide the prior, we propose the Granular Ball Pixel Computation (GBPC) algorithm. GBPC models pixel pairs in a luminance subspace using meta-granular balls and mines intra-ball information at multiple granular levels. At the fine-grained level, sliding granular balls assign adaptive weights to individual pixels to produce pixel-level prior fusion. At the coarse-grained level, the algorithm performs split computation within a single image to estimate positive and boundary domain distributions, enabling modality awareness and prior confidence estimation, which dynamically guide the loss weighting.The network and the algorithmic prior are coupled through the loss function to form an integrated framework. Thanks to the dynamic weighting mechanism, the network can adaptively adjust to different priors during training, enhancing its perception and fusion capability across modalities. We name this framework GBFF (Granular Ball Fusion Framework). Experiments on four fusion tasks demonstrate that even with only ten training image pairs per task, GBFF achieves superior performance in both visual quality and model compactness. Code is available at: https://github.com/DMinjie/GBFF
中文标题/摘要
标题:重新思考少量示例图像融合:粗糙集先验使通用深度融合成为可能
在图像融合任务中,由于缺乏真实的融合图像作为先验,大多数深度学习方法不得不依赖大规模配对数据集来提取全局权重特征,或者通过算法构造生成伪监督图像。与以往方法不同,本工作在少量示例条件下重新审视先验引导学习,引入粗糙集理论。我们将传统算法视为先验生成器,而网络通过动态损失函数重新推断并自适应优化先验,减轻网络的推断负担,实现有效的少量示例学习。为了提供先验,我们提出了粒度球像素计算(GBPC)算法。GBPC在亮度子空间中使用元粒度球建模像素对,并在多个粒度级别挖掘球内信息。在细粒度级别,滑动粒度球为单个像素分配自适应权重,生成像素级先验融合。在粗粒度级别,算法在单个图像内进行拆分计算,估计正域和边界域分布,实现模态感知和先验置信度估计,动态指导损失加权。网络和算法先验通过损失函数耦合,形成集成框架。得益于动态加权机制,网络在训练过程中可以自适应调整以适应不同的先验,增强其跨模态的感知和融合能力。我们称此框架为GBFF(粒度球融合框架)。在四个融合任务上的实验表明,即使每任务只有十个训练图像对,GBFF在视觉质量和模型紧凑性方面均表现出优越性能。代码可在:https://github.com/DMinjie/GBFF 获取。
Summary / 总结
This work addresses the challenge of few-shot image fusion by introducing the Granular Ball Fusion Framework (GBFF), which leverages rough set theory and a novel Granular Ball Pixel Computation (GBPC) algorithm. GBPC models pixel pairs in a luminance subspace using meta-granular balls, enabling adaptive pixel-level fusion and modality-awareness at different granular levels. The framework integrates the network and algorithmic prior through a dynamic loss function, allowing for effective few-shot learning. Experiments show that GBFF outperforms existing methods in both visual quality and model compactness with only ten training image pairs per task.
该研究通过引入GBFF(粒状球融合框架)和新的GBPC(粒状球像素计算)算法,解决了少量样本图像融合的挑战。GBPC通过元粒状球模型像素对,并通过损失函数动态推断和优化先验,实现有效的少量样本学习。实验表明,即使每任务只有十个训练图像对,GBFF在视觉质量和模型紧凑性方面也优于现有方法。
SATGround: A Spatially-Aware Approach for Visual Grounding in Remote Sensing
Authors: Aysim Toker, Andreea-Maria Oncescu, Roy Miles, Ismail Elezi, Jiankang Deng
First: 2025-12-09T18:15:43+00:00 · Latest: 2025-12-09T18:15:43+00:00
Abstract
Vision-language models (VLMs) are emerging as powerful generalist tools for remote sensing, capable of integrating information across diverse tasks and enabling flexible, instruction-based interactions via a chat interface. In this work, we enhance VLM-based visual grounding in satellite imagery by proposing a novel structured localization mechanism. Our approach involves finetuning a pretrained VLM on a diverse set of instruction-following tasks, while interfacing a dedicated grounding module through specialized control tokens for localization. This method facilitates joint reasoning over both language and spatial information, significantly enhancing the model's ability to precisely localize objects in complex satellite scenes. We evaluate our framework on several remote sensing benchmarks, consistently improving the state-of-the-art, including a 24.8% relative improvement over previous methods on visual grounding. Our results highlight the benefits of integrating structured spatial reasoning into VLMs, paving the way for more reliable real-world satellite data analysis.
中文标题/摘要
标题:SATGround:一种针对遥感领域视觉定位的空间感知方法
视觉语言模型(VLMs)正在成为遥感领域强大的通用工具,能够跨多种任务整合信息,并通过聊天界面实现灵活的指令式交互。在本文中,我们通过提出一种新颖的结构化定位机制,增强了基于VLM的卫星图像视觉定位。我们的方法包括在多样化的指令遵循任务上微调预训练的VLM,并通过专门的控制标记接口连接一个专用的定位模块。该方法促进了语言和空间信息的联合推理,显著增强了模型在复杂卫星场景中精确定位物体的能力。我们在几个遥感基准上评估了我们的框架,始终优于现有方法,包括在视觉定位上的24.8%的相对改进。我们的结果突显了将结构化空间推理集成到VLM中的好处,为更可靠的遥感数据分析铺平了道路。
Summary / 总结
This paper introduces SATGround, a method that enhances visual grounding in satellite imagery using a spatially-aware approach. The approach involves fine-tuning a pretrained vision-language model on various instruction-following tasks and integrating a specialized grounding module with control tokens. This method improves the model's ability to precisely locate objects in complex satellite scenes, achieving a 24.8% relative improvement over previous methods on visual grounding benchmarks.
本文提出了SATGround方法,通过引入空间感知机制增强卫星图像中的视觉定位能力。该方法涉及对预训练的视觉-语言模型进行微调,并结合专门的定位模块和控制标记。这种方法提高了模型在复杂卫星场景中精确定位物体的能力,相比之前的方法在视觉定位基准上的相对改进达到了24.8%。
DAO-GP Drift Aware Online Non-Linear Regression Gaussian-Process
Authors: Mohammad Abu-Shaira, Ajita Rattani, Weishi Shi
Venue: Proc. IEEE International Conference on Big Data (BigData), 2025
First: 2025-12-09T18:12:38+00:00 · Latest: 2025-12-09T18:12:38+00:00
Abstract
Real-world datasets often exhibit temporal dynamics characterized by evolving data distributions. Disregarding this phenomenon, commonly referred to as concept drift, can significantly diminish a model's predictive accuracy. Furthermore, the presence of hyperparameters in online models exacerbates this issue. These parameters are typically fixed and cannot be dynamically adjusted by the user in response to the evolving data distribution. Gaussian Process (GP) models offer powerful non-parametric regression capabilities with uncertainty quantification, making them ideal for modeling complex data relationships in an online setting. However, conventional online GP methods face several critical limitations, including a lack of drift-awareness, reliance on fixed hyperparameters, vulnerability to data snooping, absence of a principled decay mechanism, and memory inefficiencies. In response, we propose DAO-GP (Drift-Aware Online Gaussian Process), a novel, fully adaptive, hyperparameter-free, decayed, and sparse non-linear regression model. DAO-GP features a built-in drift detection and adaptation mechanism that dynamically adjusts model behavior based on the severity of drift. Extensive empirical evaluations confirm DAO-GP's robustness across stationary conditions, diverse drift types (abrupt, incremental, gradual), and varied data characteristics. Analyses demonstrate its dynamic adaptation, efficient in-memory and decay-based management, and evolving inducing points. Compared with state-of-the-art parametric and non-parametric models, DAO-GP consistently achieves superior or competitive performance, establishing it as a drift-resilient solution for online non-linear regression.
中文标题/摘要
标题:DAO-GP 避免漂移的在线非线性回归高斯过程
现实世界的数据集通常表现出随时间变化的数据分布动态。忽略这种现象,通常称为概念漂移,会显著降低模型的预测准确性。此外,在线模型中的超参数问题进一步加剧了这一问题。这些参数通常是固定的,用户无法根据数据分布的变化动态调整。高斯过程(GP)模型提供了强大的非参数回归能力,并能量化不确定性,使其成为在线环境中建模复杂数据关系的理想选择。然而,传统的在线GP方法面临几个关键限制,包括缺乏漂移意识、依赖固定超参数、易受数据窥探的影响、缺乏原理性的衰减机制以及内存效率低下。为应对这些问题,我们提出了DAO-GP(避免漂移的在线高斯过程),这是一种全新的、完全自适应的、无超参数、衰减和稀疏的非线性回归模型。DAO-GP 具有内置的漂移检测和适应机制,能够根据漂移的严重程度动态调整模型行为。广泛的实证评估证实,DAO-GP 在静止条件下表现出色,能够适应多种漂移类型(突然、逐步、渐进)和各种数据特征。分析表明,它具有动态适应性、高效的内存管理和基于衰减的管理机制,以及不断进化的诱导点。与最先进的参数和非参数模型相比,DAO-GP 一致地实现了更优或竞争力的性能,确立了其作为在线非线性回归漂移鲁棒解决方案的地位。
Summary / 总结
The research aims to address the issue of concept drift in real-world datasets by proposing DAO-GP, a drift-aware online non-linear regression Gaussian-Process model. DAO-GP is fully adaptive, hyperparameter-free, and includes a drift detection and adaptation mechanism. Extensive evaluations show that DAO-GP outperforms or matches state-of-the-art models across various drift types and data characteristics, demonstrating its robustness and efficiency.
研究针对现实世界数据集中的概念漂移问题,提出了一种名为DAO-GP的动态适应在线高斯过程模型。DAO-GP能够根据漂移严重程度动态调整其行为,并避免使用固定超参数,展现出在各种漂移类型和数据特征下的稳健性能。实证评估表明,DAO-GP在在线非线性回归任务中优于或匹配最先进的模型。
LLM Collaboration With Multi-Agent Reinforcement Learning
Authors: Shuo Liu, Tianle Chen, Zeyu Liang, Xueguang Lyu, Christopher Amato
First: 2025-08-06T17:18:25+00:00 · Latest: 2025-12-09T18:12:23+00:00
Abstract
A large amount of work has been done in Multi-Agent Systems (MAS) for modeling and solving problems with multiple interacting agents. However, most LLMs are pretrained independently and not specifically optimized for coordination. Existing LLM fine-tuning frameworks rely on individual rewards, which require complex reward designs for each agent to encourage collaboration. To address these challenges, we model LLM collaboration as a cooperative Multi-Agent Reinforcement Learning (MARL) problem. We develop a multi-agent, multi-turn algorithm, Multi-Agent Group Relative Policy Optimization (MAGRPO), to solve it, building on current RL approaches for LLMs as well as MARL techniques. Our experiments on LLM writing and coding collaboration demonstrate that fine-tuning MAS with MAGRPO enables agents to generate high-quality responses efficiently through effective cooperation. Our approach opens the door to using other MARL methods for LLMs and highlights the associated challenges. Our code is available at https://github.com/OpenMLRL/CoMLRL.
中文标题/摘要
标题:大模型与多智能体强化学习协作
在多智能体系统(MAS)中,已经做了大量工作来建模和解决多个相互作用的智能体的问题。然而,大多数大模型(LLM)是独立预训练的,并未特别优化以进行协调。现有的LLM微调框架依赖于个体奖励,这需要为每个智能体设计复杂的奖励机制以促进协作。为了解决这些挑战,我们将LLM协作建模为合作多智能体强化学习(MARL)问题。我们开发了多智能体、多轮的算法——多智能体组相对策略优化(MAGRPO),以解决该问题,该算法结合了当前用于LLM的RL方法以及MARL技术。我们的实验表明,使用MAGRPO微调MAS能够使智能体通过有效的协作高效地生成高质量的响应。我们的方法为使用其他MARL方法进行LLM研究打开了大门,并突显了相关挑战。我们的代码可在https://github.com/OpenMLRL/CoMLRL/获取。
Summary / 总结
This paper addresses the challenge of coordinating large language models (LLMs) by modeling their collaboration as a cooperative Multi-Agent Reinforcement Learning (MARL) problem. The authors develop a multi-agent, multi-turn algorithm called Multi-Agent Group Relative Policy Optimization (MAGRPO) to fine-tune LLMs for better collaboration. Experiments show that using MAGRPO enables agents to generate high-quality responses efficiently through effective cooperation, demonstrating the potential of MARL methods for LLMs and highlighting associated challenges.
研究旨在通过将语言模型(LMs)的交互视为合作的多智能体强化学习(MARL)问题来改善它们的协作。研究开发了一种多智能体、多轮算法MAGRPO,该算法结合了现有的LMs和MARL方法的RL技术。实验表明,使用MAGRPO对LMs进行微调可以增强它们的协作能力,在写作和编程任务中通过有效的合作生成高质量的响应。
Learning Geodesics of Geometric Shape Deformations From Images
Authors: Nian Wu, Miaomiao Zhang
Venue: Machine.Learning.for.Biomedical.Imaging. 3 (2025)
First: 2024-10-24T14:49:59+00:00 · Latest: 2025-12-09T18:10:00+00:00
Comments: Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) https://melba-journal.org/2025:019
Abstract
This paper presents a novel method, named geodesic deformable networks (GDN), that for the first time enables the learning of geodesic flows of deformation fields derived from images. In particular, the capability of our proposed GDN being able to predict geodesics is important for quantifying and comparing deformable shape presented in images. The geodesic deformations, also known as optimal transformations that align pairwise images, are often parameterized by a time sequence of smooth vector fields governed by nonlinear differential equations. A bountiful literature has been focusing on learning the initial conditions (e.g., initial velocity fields) based on registration networks. However, the definition of geodesics central to deformation-based shape analysis is blind to the networks. To address this problem, we carefully develop an efficient neural operator to treat the geodesics as unknown mapping functions learned from the latent deformation spaces. A composition of integral operators and smooth activation functions is then formulated to effectively approximate such mappings. In contrast to previous works, our GDN jointly optimizes a newly defined geodesic loss, which adds additional benefits to promote the network regularizability and generalizability. We demonstrate the effectiveness of GDN on both 2D synthetic data and 3D real brain magnetic resonance imaging (MRI).
中文标题/摘要
标题:从图像中学习几何形状变形的测地线
本文提出了一种名为测地线变形网络(GDN)的新方法,首次使从图像中学习测地线变形场成为可能。特别是,我们提出的GDN能够预测测地线的能力对于量化和比较图像中呈现的可变形形状非常重要。测地线变形,也称为将成对图像对齐的最佳变换,通常由非线性微分方程控制的平滑向量场参数化,并以时间序列的形式给出。大量文献集中在基于配准网络学习初始条件(例如,初始速度场)。然而,变形基于形状分析的核心定义——测地线——对网络是盲目的。为了解决这个问题,我们仔细开发了一个高效的神经运算器,将测地线视为从潜在变形空间中学习的未知映射函数。然后,通过积分运算符和光滑激活函数的组合来有效近似这些映射。与以往工作不同,我们的GDN联合优化了一个新定义的测地线损失,这为网络的正则化和泛化提供了额外的好处。我们在2D合成数据和3D真实脑磁共振成像(MRI)上展示了GDN的有效性。
Summary / 总结
This paper introduces geodesic deformable networks (GDN), which enable the learning of geodesic flows of deformation fields from images. GDN addresses the limitation of previous registration networks by learning geodesics as unknown mapping functions in latent deformation spaces. The method optimizes a geodesic loss to improve regularizability and generalizability. Experiments on 2D synthetic data and 3D real brain MRI show the effectiveness of GDN in predicting geodesics for deformable shape analysis.
本文提出了几何变形网络(GDN),该网络能够从图像中学习几何变形流。GDN 解决了之前注册网络的局限性,通过定义几何损失来优化几何变换,这些变换是用于使成对图像对齐的最佳变换。该方法使用神经运算器来近似这些映射,并在2D合成数据和3D真实脑部磁共振成像(MRI)数据上展示了其有效性。
ScamAgents: How AI Agents Can Simulate Human-Level Scam Calls
Authors: Sanket Badhe
Venue: Proceedings of Machine Learning Research 299, 2025 Conference on Applied Machine Learning for Information Security
First: 2025-08-08T17:01:41+00:00 · Latest: 2025-12-09T18:09:00+00:00
Comments: Accepted at CAMLIS 25: Conference on Applied Machine Learning for Information Security. 19 pages, 3 figures
Abstract
Large Language Models (LLMs) have demonstrated impressive fluency and reasoning capabilities, but their potential for misuse has raised growing concern. In this paper, we present ScamAgent, an autonomous multi-turn agent built on top of LLMs, capable of generating highly realistic scam call scripts that simulate real-world fraud scenarios. Unlike prior work focused on single-shot prompt misuse, ScamAgent maintains dialogue memory, adapts dynamically to simulated user responses, and employs deceptive persuasion strategies across conversational turns. We show that current LLM safety guardrails, including refusal mechanisms and content filters, are ineffective against such agent-based threats. Even models with strong prompt-level safeguards can be bypassed when prompts are decomposed, disguised, or delivered incrementally within an agent framework. We further demonstrate the transformation of scam scripts into lifelike voice calls using modern text-to-speech systems, completing a fully automated scam pipeline. Our findings highlight an urgent need for multi-turn safety auditing, agent-level control frameworks, and new methods to detect and disrupt conversational deception powered by generative AI.
中文标题/摘要
标题:ScamAgents:基于AI代理的模拟人类水平诈骗电话
大型语言模型(LLMs)展示了令人印象深刻的流畅性和推理能力,但其滥用潜力引起了广泛关注。本文介绍了ScamAgent,这是一种基于LLMs的自主多轮代理,能够生成高度逼真的诈骗电话脚本,模拟现实中的欺诈场景。与专注于单轮提示滥用的先前工作不同,ScamAgent保持对话记忆,能够根据模拟用户响应动态调整,并在对话轮次中采用欺骗性说服策略。我们展示了当前的LLM安全护栏,包括拒绝机制和内容过滤器,对这种基于代理的威胁无效。即使具有强大提示级保护机制的模型,在提示被分解、伪装或在代理框架内逐步传递时,也可以被绕过。我们进一步展示了使用现代文本转语音系统将诈骗脚本转换为逼真的语音呼叫,完成了一个完整的自动化诈骗流程。我们的研究结果强调了多轮安全审计、代理级控制框架以及检测和中断由生成式AI驱动的对话欺骗的新方法的迫切需求。
Summary / 总结
This paper introduces ScamAgent, an autonomous multi-turn agent built on LLMs, designed to generate realistic scam call scripts. Unlike previous single-shot approaches, ScamAgent maintains dialogue memory, adapts to user responses, and uses deceptive strategies. The study shows that current LLM safety measures are ineffective against such agents, and even models with strong prompt-level safeguards can be bypassed. The research highlights the need for multi-turn safety auditing and new methods to detect and disrupt conversational deception powered by generative AI.
论文介绍了基于LLM的自主多轮ScamAgent,用于生成逼真的诈骗电话脚本。与之前的单轮方法不同,ScamAgent保持对话记忆,适应用户响应,并使用欺骗性策略。研究显示当前的安全措施不足以应对这种基于代理的威胁,即使是具有强大提示级保护措施的模型也可能被绕过。研究还展示了使用文本转语音系统将诈骗脚本转化为真实的语音呼叫,突显了需要新的安全审计和控制框架的紧迫性。
Fed-SE: Federated Self-Evolution for Privacy-Constrained Multi-Environment LLM Agents
Authors: Xiang Chen, Yuling Shi, Qizhen Lan, Yuchao Qiu, Xiaodong Gu
First: 2025-12-09T18:04:41+00:00 · Latest: 2025-12-09T18:04:41+00:00
Abstract
LLM agents are widely deployed in complex interactive tasks, yet privacy constraints often preclude centralized optimization and co-evolution across dynamic environments. While Federated Learning (FL) has proven effective on static datasets, its extension to the open-ended self-evolution of agents remains underexplored. Directly applying standard FL is challenging: heterogeneous tasks and sparse, trajectory-level rewards introduce severe gradient conflicts, destabilizing the global optimization process. To bridge this gap, we propose Fed-SE, a Federated Self-Evolution framework for LLM agents. Fed-SE establishes a local evolution-global aggregation paradigm. Locally, agents employ parameter-efficient fine-tuning on filtered, high-return trajectories to achieve stable gradient updates. Globally, Fed-SE aggregates updates within a low-rank subspace that disentangles environment-specific dynamics, effectively reducing negative transfer across clients. Experiments across five heterogeneous environments demonstrate that Fed-SE improves average task success rates by approximately 18% over federated baselines, validating its effectiveness in robust cross-environment knowledge transfer in privacy-constrained deployments.
中文标题/摘要
标题:Fed-SE:受隐私约束的多环境LLM代理的联邦自我进化
LLM代理广泛应用于复杂的交互任务,但隐私约束往往阻止了跨动态环境的集中优化和共同进化。尽管联邦学习(FL)在静态数据集上已证明有效,但将其扩展到代理的开放自我进化仍处于探索阶段。直接应用标准FL具有挑战性:异构任务和稀疏的轨迹级奖励引入了严重的梯度冲突,导致全局优化过程不稳定。为解决这一问题,我们提出了一种Fed-SE框架,用于LLM代理的联邦自我进化。Fed-SE建立了一个局部进化-全局聚合的范式。在局部,代理对过滤后的高回报轨迹进行参数高效的微调,以实现稳定的梯度更新。在全局,Fed-SE在低秩子空间中聚合更新,分离环境特定的动力学,有效减少了客户端之间的负面迁移。在五个异构环境中的实验表明,与联邦基线相比,Fed-SE将平均任务成功率提高了约18%,验证了其在受隐私约束部署中跨环境知识转移的有效性。
Summary / 总结
The research aims to address the challenge of optimizing and co-evolving LLM agents across multiple dynamic environments while respecting privacy constraints. Fed-SE, a Federated Self-Evolution framework, is proposed to tackle this issue by enabling local fine-tuning on high-return trajectories and global aggregation within a low-rank subspace to reduce negative transfer. Experiments show that Fed-SE enhances task success rates by about 18% compared to federated baselines, demonstrating its effectiveness in cross-environment knowledge transfer under privacy constraints.
研究旨在解决在多种环境中优化LLM代理的同时遵守隐私约束的问题。提出了Fed-SE框架,该框架允许在高回报轨迹上进行局部微调,并在全球聚合时在低秩子空间内减少负迁移。实验表明,与联邦基线相比,Fed-SE将任务成功率提高了约18%,证明了其在隐私受限部署中的跨环境知识转移的有效性。
MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens
Authors: Kaizhi Zheng, Xuehai He, Xin Eric Wang
First: 2023-10-03T17:49:04+00:00 · Latest: 2025-12-09T18:03:09+00:00
Abstract
The effectiveness of Multimodal Large Language Models (MLLMs) demonstrates a profound capability in multimodal understanding. However, the simultaneous generation of images with coherent texts is still underdeveloped. Addressing this, we introduce a novel interleaved vision-and-language generation method, centered around the concept of ``generative vokens". These vokens serve as pivotal elements contributing to coherent image-text outputs. Our method is marked by a unique two-stage training strategy for description-free multimodal generation, which does not necessitate extensive descriptions of images. We integrate classifier-free guidance to enhance the alignment of generated images and texts, ensuring more seamless and contextually relevant multimodal interactions. Our model, MiniGPT-5, exhibits substantial improvement over the baseline models on multimodal generation datasets, including MMDialog and VIST. The human evaluation shows MiniGPT-5 is better than the baseline model on more than 56\% cases for multimodal generation, highlighting its efficacy across diverse benchmarks.
中文标题/摘要
标题:MiniGPT-5:通过生成Vokens实现交错的视觉-语言生成
多模态大型语言模型(MLLMs)的有效性展示了其在多模态理解方面的强大能力。然而,同时生成具有连贯文本的图像仍然发展不足。为解决这一问题,我们提出了一种新的交错视觉-语言生成方法,围绕“生成Vokens”的概念展开。这些Vokens作为关键元素,有助于生成连贯的图像-文本输出。我们的方法采用独特的两阶段训练策略,用于无描述的多模态生成,无需对图像进行大量描述。我们整合了无分类引导,以增强生成图像和文本的一致性,确保更流畅和上下文相关的多模态交互。我们的模型MiniGPT-5在包括MMDialog和VIST在内的多模态生成数据集上显著优于基线模型。人类评估显示,MiniGPT-5在超过56%的多模态生成案例中优于基线模型,突显了其在各种基准测试中的有效性。
Summary / 总结
The research aims to improve the simultaneous generation of coherent images and texts, which is underdeveloped in Multimodal Large Language Models (MLLMs). The method introduces 'generative vokens' and a two-stage training strategy for description-free multimodal generation, using classifier-free guidance to enhance alignment. Experimental results show MiniGPT-5 outperforms baseline models on MMDialog and VIST datasets, with human evaluations indicating a 56% improvement in multimodal generation quality.
研究旨在提高Multimodal Large Language Models (MLLMs)中图像和文本的同步生成能力,这是尚未充分开发的功能。方法引入了‘生成性voken’和两阶段训练策略,用于无描述的多模态生成,并使用无分类器引导来增强生成图像和文本的对齐。实验结果显示,MiniGPT-5在MMDialog和VIST数据集上的表现优于基线模型,人类评估表明其在多模态生成质量上有56%的提升。
The AI Consumer Index (ACE)
Authors: Julien Benchek, Rohit Shetty, Benjamin Hunsberger, Ajay Arun, Zach Richards, Brendan Foody, Osvald Nitski, Bertie Vidgen
First: 2025-12-04T15:54:28+00:00 · Latest: 2025-12-09T18:01:49+00:00
Abstract
We introduce the first version of the AI Consumer Index (ACE), a benchmark for assessing whether frontier AI models can perform everyday consumer tasks. ACE contains a hidden heldout set of 400 test cases, split across four consumer activities: shopping, food, gaming, and DIY. We are also open sourcing 80 cases as a devset with a CC-BY license. For the ACE leaderboard we evaluated 10 frontier models (with websearch turned on) using a novel grading methodology that dynamically checks whether relevant parts of the response are grounded in the retrieved web sources. GPT 5 (Thinking = High) is the top-performing model, scoring 56.1%, followed by o3 Pro (Thinking = On) at 55.2% and GPT 5.1 (Thinking = High) at 55.1%. Model scores differ across domains, and in Shopping the top model scores under 50\%. We find that models are prone to hallucinating key information, such as prices. ACE shows a substantial gap between the performance of even the best models and consumers' AI needs.
中文标题/摘要
标题:AI消费者指数(ACE)
我们介绍了AI消费者指数(ACE)的第一个版本,这是一个基准,用于评估前沿AI模型是否能够完成日常消费者任务。ACE包含一个隐藏的保留集,共有400个测试案例,分布在四个消费者活动中:购物、食物、游戏和DIY。我们还开源了80个案例作为开发集,使用CC-BY许可。对于ACE排行榜,我们使用一种新的评分方法评估了10个前沿模型(开启网络搜索),该方法动态检查响应中相关部分是否基于检索到的网络来源。GPT 5(思考=高)是表现最好的模型,得分为56.1%,其次是o3 Pro(思考=开)的55.2%和GPT 5.1(思考=高)的55.1%。模型得分在不同领域之间存在差异,在购物领域,顶级模型得分低于50%。我们发现模型容易虚构关键信息,如价格。ACE显示,即使是最优秀的模型在性能上也与消费者对AI的需求之间存在显著差距。
Summary / 总结
The AI Consumer Index (ACE) evaluates AI models' ability to handle everyday consumer tasks. It includes 400 hidden test cases across four activities: shopping, food, gaming, and DIY. Using a novel grading method that checks if responses are grounded in web sources, GPT 5 (56.1%) outperformed other models like o3 Pro (55.2%) and GPT 5.1 (55.1%). Model performance varies by domain, with the best model scoring under 50% in Shopping. ACE highlights the gap between current AI capabilities and consumer needs, noting frequent hallucinations about key information like prices.
AI消费者指数(ACE)是一个基准,用于评估AI模型处理日常消费者任务的能力。它包含400个隐藏测试案例,涵盖购物、食物、游戏和DIY四个领域,其中80个案例作为开发集。使用一种新的评分方法,检查响应是否基于网络来源,评估了10个前沿模型。GPT 5(思考=高)得分最高,为56.1%,其次是o3 Pro(思考=开)的55.2%和GPT 5.1(思考=高)的55.1%。模型在不同领域的表现不同,在购物领域的最佳模型得分低于50%。模型经常虚构关键信息,如价格,表明当前模型性能与消费者需求之间存在显著差距。
EcomBench: Towards Holistic Evaluation of Foundation Agents in E-commerce
Authors: Rui Min, Zile Qiao, Ze Xu, Jiawen Zhai, Wenyu Gao, Xuanzhong Chen, Haozhen Sun, Zhen Zhang, Xinyu Wang, Hong Zhou, Wenbiao Yin, Xuan Zhou, Yong Jiang, Haicheng Liu, Liang Ding, Ling Zou, Yi R., Fung, Yalong Li, Pengjun Xie
First: 2025-12-09T18:00:26+00:00 · Latest: 2025-12-09T18:00:26+00:00
Abstract
Foundation agents have rapidly advanced in their ability to reason and interact with real environments, making the evaluation of their core capabilities increasingly important. While many benchmarks have been developed to assess agent performance, most concentrate on academic settings or artificially designed scenarios while overlooking the challenges that arise in real applications. To address this issue, we focus on a highly practical real-world setting, the e-commerce domain, which involves a large volume of diverse user interactions, dynamic market conditions, and tasks directly tied to real decision-making processes. To this end, we introduce EcomBench, a holistic E-commerce Benchmark designed to evaluate agent performance in realistic e-commerce environments. EcomBench is built from genuine user demands embedded in leading global e-commerce ecosystems and is carefully curated and annotated through human experts to ensure clarity, accuracy, and domain relevance. It covers multiple task categories within e-commerce scenarios and defines three difficulty levels that evaluate agents on key capabilities such as deep information retrieval, multi-step reasoning, and cross-source knowledge integration. By grounding evaluation in real e-commerce contexts, EcomBench provides a rigorous and dynamic testbed for measuring the practical capabilities of agents in modern e-commerce.
中文标题/摘要
标题:EcomBench:朝着全面评估电子商务基础代理的评价体系
基础代理在推理和与真实环境互动方面的能力迅速提升,使其核心能力的评估变得越来越重要。虽然已经开发了许多基准来评估代理性能,但大多数集中在学术环境或人工设计的场景上,而忽视了实际应用中出现的挑战。为了解决这一问题,我们关注一个高度实用的现实世界环境——电子商务领域,该领域涉及大量多样化的用户交互、动态的市场条件以及直接与实际决策过程相关的任务。为此,我们引入了EcomBench,这是一个旨在评估代理在现实电子商务环境中的性能的全面电子商务基准。EcomBench基于全球领先电子商务生态系统中的真实用户需求构建,并通过人类专家精心策划和注释,以确保清晰、准确和领域相关性。它涵盖了电子商务场景中的多个任务类别,并定义了三个难度级别,以评估代理在深度信息检索、多步推理和跨源知识整合等关键能力上的表现。通过在真实的电子商务环境中进行评估,EcomBench为测量代理在现代电子商务中的实际能力提供了一个严格的动态测试平台。
End-to-End Fine-Tuning of 3D Texture Generation using Differentiable Rewards
Authors: AmirHossein Zamani, Tianhao Xie, Amir G. Aghdam, Tiberiu Popa, Eugene Belilovsky
First: 2025-06-23T06:24:12+00:00 · Latest: 2025-12-09T17:54:30+00:00
Abstract
While recent 3D generative models can produce high-quality texture images, they often fail to capture human preferences or meet task-specific requirements. Moreover, a core challenge in the 3D texture generation domain is that most existing approaches rely on repeated calls to 2D text-to-image generative models, which lack an inherent understanding of the 3D structure of the input 3D mesh object. To alleviate these issues, we propose an end-to-end differentiable, reinforcement-learning-free framework that embeds human feedback, expressed as differentiable reward functions, directly into the 3D texture synthesis pipeline. By back-propagating preference signals through both geometric and appearance modules of the proposed framework, our method generates textures that respect the 3D geometry structure and align with desired criteria. To demonstrate its versatility, we introduce three novel geometry-aware reward functions, which offer a more controllable and interpretable pathway for creating high-quality 3D content from natural language. By conducting qualitative, quantitative, and user-preference evaluations against state-of-the-art methods, we demonstrate that our proposed strategy consistently outperforms existing approaches. Our implementation code is publicly available at: https://github.com/AHHHZ975/Differentiable-Texture-Learning
中文标题/摘要
标题:基于可微奖励的3D纹理生成端到端微调
尽管最近的3D生成模型能够生成高质量的纹理图像,但它们往往无法捕捉到人类的偏好或满足特定任务的要求。此外,在3D纹理生成领域,一个核心挑战是大多数现有方法依赖于反复调用2D文本到图像生成模型,这些模型缺乏对输入3D网格对象的3D结构的内在理解。为了解决这些问题,我们提出了一种端到端的、无需强化学习的框架,该框架将人类反馈嵌入到3D纹理合成管道中,以可微奖励函数的形式表达。通过反向传播偏好信号,我们的方法生成的纹理尊重3D几何结构并符合期望的标准。为了展示其灵活性,我们引入了三种新的几何感知奖励函数,这些函数为从自然语言创建高质量3D内容提供了一种更可控和可解释的途径。通过与最新方法进行定性、定量和用户偏好评估,我们证明了我们提出的方法始终优于现有方法。我们的实现代码可在以下网址公开获取:https://github.com/AHHHZ975/Differentiable-Texture-Learning
Summary / 总结
The paper addresses the limitations of existing 3D generative models in capturing human preferences and task-specific requirements. It proposes an end-to-end differentiable framework that integrates human feedback as differentiable reward functions directly into the 3D texture synthesis process. This method generates textures that respect 3D geometry and align with desired criteria, outperforming state-of-the-art approaches in qualitative, quantitative, and user-preference evaluations.
本文提出了一种端到端可微分框架,将人类反馈作为可微奖励函数直接嵌入到3D纹理合成管道中,以解决现有3D生成模型的局限性。该方法通过反向传播偏好信号生成尊重3D几何结构且符合期望标准的纹理。实验结果表明,所提出的方法在质量和用户偏好方面优于最先进的方法,展示了其在从自然语言描述生成高质量3D内容方面的有效性和灵活性。
Tri-Bench: Stress-Testing VLM Reliability on Spatial Reasoning under Camera Tilt and Object Interference
Authors: Amit Bendkhale
Venue: AAAI 2026
First: 2025-12-09T17:52:57+00:00 · Latest: 2025-12-09T17:52:57+00:00
Comments: 6 pages, 3 figures. Code and data: https://github.com/Amiton7/Tri-Bench. Accepted to the AAAI 2026 Workshop on Trust and Control in Agentic AI (TrustAgent)
Abstract
Verifiable geometric reasoning is a critical component for trustworthy and controllable agentic AI. Despite impressive capabilities, Vision-Language Models (VLMs) often fail under realistic scene changes. We present Tri-Bench, a compact benchmark of planar triangle problems that isolates relative geometric reasoning while stressing two deployment-critical factors: camera pose (planar vs. tilted) and scene context via object interference (10 everyday objects). To test verifiability and control, we evaluate four recent VLMs using a single, fixed prompt whose guardrail explicitly describes a surrounding square border, enabling correct answers via homography. We evaluate six simple tasks over binary and continuous targets, and observe that the overall accuracy with respect to 3D ground truth is modest, ~69% on average (best ~75%, worst ~64%). The same responses align even more closely with 2D projections in the image plane, where mean accuracy is ~72%. All four VLMs consistently fail, with accuracy falling to ~0%, on recognizing minority shape classes (equilateral, isosceles, right-angled triangles). Additionally, overall VLM accuracy degrades by ~4.1% under camera tilt. This demonstrates that models fail to correctly utilize the explicit frame-of-reference hint provided in the prompt and default to 2D image plane cues. Finally, we find that object interference has no significant effect on VLM accuracy.
中文标题/摘要
标题:三边测试:在相机倾斜和物体干扰下空间推理能力的VLM可靠性测试
可验证的几何推理是值得信赖和可控的代理AI的关键组成部分。尽管具有令人印象深刻的性能,但在现实场景变化下,视觉-语言模型(VLMs)经常失败。我们提出了三边测试(Tri-Bench),这是一个专门针对平面三角形问题的基准测试,它隔离了相对几何推理,同时强调了两个关键部署因素:相机姿态(平面 vs. 倾斜)和通过物体干扰(10种日常生活中的物体)的场景上下文。为了测试可验证性和可控性,我们使用一个单一的固定提示来评估四个最近的VLMs,该提示的护栏明确描述了一个周围的正方形边界,从而可以通过齐次变换获得正确答案。我们评估了六个简单的任务,涉及二进制和连续目标,观察到相对于3D地面真实值的整体准确性较低,平均约为69%(最佳约为75%,最差约为64%)。同样的响应在图像平面的二维投影中与之更加一致,平均准确性约为72%。所有四个VLMs在识别少数形状类别(等边、等腰、直角三角形)时都一致失败,准确率降至约0%。此外,在相机倾斜的情况下,总体VLM准确性下降了约4.1%。这表明模型未能正确利用提示中提供的明确参考框架提示,而是默认使用二维图像平面线索。最后,我们发现物体干扰对VLM准确性没有显著影响。
Summary / 总结
The research aims to evaluate the reliability of Vision-Language Models (VLMs) in geometric reasoning under realistic scene changes. Tri-Bench, a benchmark focusing on planar triangle problems, tests VLMs under two factors: camera pose and object interference. Four recent VLMs were evaluated using a fixed prompt with a guardrail, and the results showed an average accuracy of 69% with respect to 3D ground truth, dropping to 72% in 2D projections. The models consistently failed to recognize minority shape classes and accuracy degraded by 4.1% under camera tilt, indicating reliance on 2D cues over the 3D frame-of-reference hint provided in the prompt. Object interference did not significantly affect VLM accuracy.
研究旨在评估Vision-Language模型在现实场景变化下的几何推理可靠性。Tri-Bench 是一个基于平面三角形问题的基准测试,测试模型在相机姿态(平面 vs. 倾斜)和物体干扰下的表现。使用固定提示和描述方形边界的护栏,四个模型在3D地面真实情况下的平均准确率为约69%,在倾斜相机下识别少数形状类别时准确率降至约0%。物体干扰对模型准确率没有显著影响。
Refining Diffusion Models for Motion Synthesis with an Acceleration Loss to Generate Realistic IMU Data
Authors: Lars Ole Häusler, Lena Uhlenberg, Göran Köber, Diyora Salimova, Oliver Amft
First: 2025-12-09T17:51:01+00:00 · Latest: 2025-12-09T17:51:01+00:00
Comments: 7 pages, 3 figures, 1 table
Abstract
We propose a text-to-IMU (inertial measurement unit) motion-synthesis framework to obtain realistic IMU data by fine-tuning a pretrained diffusion model with an acceleration-based second-order loss (L_acc). L_acc enforces consistency in the discrete second-order temporal differences of the generated motion, thereby aligning the diffusion prior with IMU-specific acceleration patterns. We integrate L_acc into the training objective of an existing diffusion model, finetune the model to obtain an IMU-specific motion prior, and evaluate the model with an existing text-to-IMU framework that comprises surface modelling and virtual sensor simulation. We analysed acceleration signal fidelity and differences between synthetic motion representation and actual IMU recordings. As a downstream application, we evaluated Human Activity Recognition (HAR) and compared the classification performance using data of our method with the earlier diffusion model and two additional diffusion model baselines. When we augmented the earlier diffusion model objective with L_acc and continued training, L_acc decreased by 12.7% relative to the original model. The improvements were considerably larger in high-dynamic activities (i.e., running, jumping) compared to low-dynamic activities~(i.e., sitting, standing). In a low-dimensional embedding, the synthetic IMU data produced by our refined model shifts closer to the distribution of real IMU recordings. HAR classification trained exclusively on our refined synthetic IMU data improved performance by 8.7% compared to the earlier diffusion model and by 7.6% over the best-performing comparison diffusion model. We conclude that acceleration-aware diffusion refinement provides an effective approach to align motion generation and IMU synthesis and highlights how flexible deep learning pipelines are for specialising generic text-to-motion priors to sensor-specific tasks.
中文标题/摘要
标题:使用加速度损失精炼扩散模型以生成真实IMU数据的运动合成
我们提出了一种文本到IMU(惯性测量单元)运动合成框架,通过使用基于加速度的二阶损失(L_acc)微调预训练的扩散模型来获得真实IMU数据。L_acc 强制生成运动在离散的二阶时间差中具有一致性,从而将扩散先验与IMU特有的加速度模式对齐。我们将L_acc 集成到现有扩散模型的训练目标中,微调模型以获得IMU特定的运动先验,并使用现有文本到IMU框架(包括表面建模和虚拟传感器模拟)评估模型。我们分析了加速度信号保真度以及合成运动表示与实际IMU记录之间的差异。作为下游应用,我们评估了人类活动识别(HAR),并比较了使用我们方法生成的数据与早期扩散模型和两个额外的扩散模型基线的分类性能。当我们将L_acc 增加到早期扩散模型目标并继续训练时,L_acc 相对于原始模型下降了12.7%。在高动态活动(如跑步、跳跃)中,改进幅度明显大于低动态活动(如坐着、站着)。在低维嵌入中,我们精炼模型生成的合成IMU数据更接近真实IMU记录的分布。仅使用我们精炼的合成IMU数据训练的HAR分类性能提高了8.7%,比早期扩散模型提高了7.6%,超过表现最佳的比较扩散模型。我们得出结论,加速度感知的扩散精炼提供了一种有效的方法来对齐运动生成和IMU合成,并突显了深度学习管道如何灵活地将通用文本到运动先验专门化为传感器特定任务。
Summary / 总结
The research aims to generate realistic IMU data through a text-to-IMU motion-synthesis framework by fine-tuning a pretrained diffusion model with an acceleration-based second-order loss (L_acc). The method integrates L_acc into the training objective to enforce temporal consistency in the generated motion, aligning the diffusion prior with IMU-specific acceleration patterns. Experimental results show that the refined model improves acceleration signal fidelity and enhances Human Activity Recognition (HAR) performance by 8.7% compared to the original model and by 7.6% over the best-performing comparison model, especially in high-dynamic activities like running and jumping.
研究旨在通过在预训练的扩散模型中引入基于加速度的二阶损失(L_acc)来生成真实的IMU数据,从而实现文本到IMU的运动合成框架。该方法通过提高生成运动的离散二阶时间差的一致性,使扩散模型与IMU特有的加速度模式相匹配。实验结果表明,改进后的模型在人体活动识别中的分类性能提高了8.7%,特别是在跑步和跳跃等高动态活动中表现尤为明显。
Reinforcement Learning From State and Temporal Differences
Authors: Lex Weaver, Jonathan Baxter
First: 2025-12-09T17:48:28+00:00 · Latest: 2025-12-09T17:48:28+00:00
Abstract
TD($λ$) with function approximation has proved empirically successful for some complex reinforcement learning problems. For linear approximation, TD($λ$) has been shown to minimise the squared error between the approximate value of each state and the true value. However, as far as policy is concerned, it is error in the relative ordering of states that is critical, rather than error in the state values. We illustrate this point, both in simple two-state and three-state systems in which TD($λ$)--starting from an optimal policy--converges to a sub-optimal policy, and also in backgammon. We then present a modified form of TD($λ$), called STD($λ$), in which function approximators are trained with respect to relative state values on binary decision problems. A theoretical analysis, including a proof of monotonic policy improvement for STD($λ$) in the context of the two-state system, is presented, along with a comparison with Bertsekas' differential training method [1]. This is followed by successful demonstrations of STD($λ$) on the two-state system and a variation on the well known acrobot problem.
中文标题/摘要
标题:基于状态和时间差的强化学习
TD($λ$)结合函数近似在某些复杂的强化学习问题中已被实验证明是成功的。对于线性近似,TD($λ$)已被证明可以最小化每个状态的近似值与其真实值之间的平方误差。然而,就策略而言,关键在于状态之间的相对顺序的误差,而不是状态值的误差。我们通过在简单的两状态和三状态系统中从最优策略开始,TD($λ$)收敛到一个次优策略,以及在背gammon中的示例来说明这一点。然后,我们提出了一种改进形式的TD($λ$),称为STD($λ$),其中函数逼近器是针对二元决策问题中的状态相对值进行训练的。我们提供了关于两状态系统中STD($λ$)的单调策略改进的理论分析,包括证明,并与伯特塞拉斯的微分训练方法进行了比较。最后,我们在两状态系统和acrobot问题的一个变体上成功演示了STD($λ$)。
Automated Construction of Artificial Lattice Structures with Designer Electronic States
Authors: Ganesh Narasimha, Mykola Telychko, Wooin Yang, Arthur P. Baddorf, P. Ganesh, An-Ping Li, Rama Vasudevan
First: 2025-08-04T16:38:45+00:00 · Latest: 2025-12-09T17:35:06+00:00
Abstract
Manipulating matter with a scanning tunneling microscope (STM) enables creation of atomically defined artificial structures that host designer quantum states. However, the time-consuming nature of the manipulation process, coupled with the sensitivity of the STM tip, constrains the exploration of diverse configurations and limits the size of designed features. In this study, we present a reinforcement learning (RL)-based framework for creating artificial structures by spatially manipulating carbon monoxide (CO) molecules on a copper substrate using the STM tip. The automated workflow combines molecule detection and manipulation, employing deep learning-based object detection to locate CO molecules and linear assignment algorithms to allocate these molecules to designated target sites. We initially perform molecule maneuvering based on randomized parameter sampling for sample bias, tunneling current setpoint and manipulation speed. This dataset is then structured into an action trajectory used to train an RL agent. The model is subsequently deployed on the STM for real-time fine-tuning of manipulation parameters during structure construction. Our approach incorporates path planning protocols coupled with active drift compensation to enable atomically precise fabrication of structures with significantly reduced human input while realizing larger-scale artificial lattices with desired electronic properties. Using our approach, we demonstrate the automated construction of an extended artificial graphene lattice and confirm the existence of characteristic Dirac point in its electronic structure. Further challenges to RL-based structural assembly scalability are discussed.
中文标题/摘要
标题:自动构建具有设计电子态的人工晶格结构
使用扫描隧道显微镜(STM)操纵物质可以创建原子定义的人工结构,这些结构承载设计的量子态。然而,操纵过程的耗时性质以及STM针尖的敏感性限制了对多种配置的探索,并限制了设计特征的大小。在本研究中,我们提出了一种基于强化学习(RL)的框架,通过空间操纵铜基底上的二氧化碳(CO)分子来创建人工结构。该自动化工作流结合了分子检测和操纵,利用基于深度学习的目标检测来定位CO分子,并使用线性分配算法将这些分子分配到指定的目标位置。我们首先基于随机参数采样进行分子操作,以减少样本偏差、隧道电流设定点和操作速度。然后将此数据集结构化为动作轨迹,用于训练RL代理。随后将该模型部署到STM上,在结构构建过程中实时微调操作参数。我们的方法结合了路径规划协议和主动漂移补偿,以实现显著减少人工输入的同时,构建具有所需电子性质的大规模人工晶格。使用我们的方法,我们展示了扩展人工石墨烯晶格的自动化构建,并确认其电子结构中存在特征狄拉克点。还讨论了基于RL的结构组装可扩展性的进一步挑战。
Summary / 总结
This study introduces a reinforcement learning (RL)-based framework for automating the construction of artificial lattice structures using a scanning tunneling microscope (STM) to manipulate carbon monoxide (CO) molecules on a copper substrate. The method combines deep learning for molecule detection and linear assignment algorithms for allocation, followed by RL training for real-time parameter fine-tuning. The approach enables the creation of larger-scale artificial lattices with reduced human input and confirmed the presence of a Dirac point in the electronic structure of an extended artificial graphene lattice.
本研究提出了一种基于强化学习(RL)的方法,用于使用扫描隧道显微镜(STM)自动化构建人工晶格结构。该方法利用深度学习进行分子检测以定位一氧化碳(CO)分子,并使用线性分配算法将它们分配到目标位置。该方法包括随机参数采样和结构构建过程中的实时调整,结合路径规划和主动漂移补偿以确保精确制造。关键发现包括自动化创建扩展的人工石墨烯晶格,并确认其电子结构中的狄拉克点特征,展示了基于RL的组装过程的可扩展性。
What Does It Take to Be a Good AI Research Agent? Studying the Role of Ideation Diversity
Authors: Alexis Audran-Reiss, Jordi Armengol-Estapé, Karen Hambardzumyan, Amar Budhiraja, Martin Josifoski, Edan Toledo, Rishi Hazra, Despoina Magka, Michael Shvartsman, Parth Pathak, Justine T Kao, Lucia Cipolina-Kun, Bhavul Gauri, Jean-Christophe Gagnon-Audet, Emanuel Tewolde, Jenny Zhang, Taco Cohen, Yossi Adi, Tatiana Shavrina, Yoram Bachrach
First: 2025-11-19T16:32:18+00:00 · Latest: 2025-12-09T17:32:45+00:00
Abstract
AI research agents offer the promise to accelerate scientific progress by automating the design, implementation, and training of machine learning models. However, the field is still in its infancy, and the key factors driving the success or failure of agent trajectories are not fully understood. We examine the role that ideation diversity plays in agent performance. First, we analyse agent trajectories on MLE-bench, a well-known benchmark to evaluate AI research agents, across different models and agent scaffolds. Our analysis reveals that different models and agent scaffolds yield varying degrees of ideation diversity, and that higher-performing agents tend to have increased ideation diversity. Further, we run a controlled experiment where we modify the degree of ideation diversity, demonstrating that higher ideation diversity results in stronger performance. Finally, we strengthen our results by examining additional evaluation metrics beyond the standard medal-based scoring of MLE-bench, showing that our findings still hold across other agent performance metrics.
中文标题/摘要
标题:成为优秀的AI研究代理需要什么?探究创意多样性的作用
AI研究代理有望通过自动化机器学习模型的设计、实现和训练来加速科学进步。然而,该领域仍处于起步阶段,驱动代理轨迹成功或失败的关键因素尚未完全理解。我们研究了创意多样性在代理性能中的作用。首先,我们在MLE-bench这一知名基准上分析不同模型和代理支架的代理轨迹。我们的分析表明,不同的模型和代理支架会产生不同程度的创意多样性,且表现更佳的代理通常具有更高的创意多样性。进一步,我们进行了一项受控实验,修改创意多样性的程度,证明更高的创意多样性会导致更强的性能。最后,我们通过考察超越MLE-bench标准奖牌评分的其他评估指标来加强我们的结果,显示我们的发现仍然适用于其他代理性能指标。
Summary / 总结
The study investigates the role of ideation diversity in the performance of AI research agents. By analyzing agent trajectories on MLE-bench and conducting controlled experiments, the research finds that higher ideation diversity correlates with better performance. Additional evaluation metrics confirm these findings, suggesting that ideation diversity is a crucial factor for successful AI research agent trajectories.
研究探讨了创意多样性对AI研究代理性能的影响。通过对MLE-bench上的代理轨迹进行分析和控制实验,研究发现更高的创意多样性与更好的性能相关。额外的评估指标也支持这些发现,表明创意多样性是代理成功的关键因素。
PinRec: Outcome-Conditioned, Multi-Token Generative Retrieval for Industry-Scale Recommendation Systems
Authors: Prabhat Agarwal, Anirudhan Badrinath, Laksh Bhasin, Jaewon Yang, Edoardo Botta, Jiajing Xu, Charles Rosenberg
First: 2025-04-09T17:46:12+00:00 · Latest: 2025-12-09T17:25:54+00:00
Abstract
Generative retrieval methods utilize generative sequential modeling techniques, such as transformers, to generate candidate items for recommender systems. These methods have demonstrated promising results in academic benchmarks, surpassing traditional retrieval models like two-tower architectures. However, current generative retrieval methods lack the scalability required for industrial recommender systems, and they are insufficiently flexible to satisfy the multiple metric requirements of modern systems. This paper introduces PinRec, a novel generative retrieval model developed for applications at Pinterest. PinRec utilizes outcome-conditioned generation, enabling modelers to specify how to balance various outcome metrics, such as the number of saves and clicks, to effectively align with business goals and user exploration. Additionally, PinRec incorporates multi-token generation to enhance output diversity while optimizing generation. Our experiments demonstrate that PinRec can successfully balance performance, diversity, and efficiency, delivering a significant positive impact to users using generative models. This paper marks a significant milestone in generative retrieval, as it presents, to our knowledge, the first rigorous study on implementing generative retrieval at the scale of Pinterest.
中文标题/摘要
标题:PinRec:基于结果条件的多令牌生成检索方法及其在工业规模推荐系统中的应用
生成检索方法利用生成序列建模技术,如变换器,为推荐系统生成候选项目。这些方法在学术基准测试中表现出色,超越了传统的检索模型,如双塔架构。然而,当前的生成检索方法缺乏适应工业推荐系统所需的可扩展性,并且不足以满足现代系统中的多种指标要求。本文介绍了PinRec,这是一种为Pinterest应用开发的新型生成检索模型。PinRec利用基于结果条件的生成,使建模者能够指定如何平衡各种结果指标,如保存次数和点击次数,以有效与业务目标和用户探索相一致。此外,PinRec还结合了多令牌生成,以增强输出多样性并优化生成。我们的实验表明,PinRec能够成功平衡性能、多样性和效率,为使用生成模型的用户提供显著的积极影响。本文标志着生成检索的一个重要里程碑,因为它据我们所知,首次对在Pinterest规模上实施生成检索进行了严谨的研究。
Summary / 总结
PinRec is a novel generative retrieval model designed for industry-scale recommendation systems at Pinterest. It uses outcome-conditioned generation to balance multiple metrics like saves and clicks, aligning with business goals. PinRec also employs multi-token generation to improve output diversity. Experiments show that PinRec successfully balances performance, diversity, and efficiency, providing a positive impact on user experience.
PinRec 是一种为 Pinterest 等大规模推荐系统设计的新型生成检索模型。它通过结果条件生成来平衡各种指标,如保存和点击,以符合业务目标。PinRec 还使用多令牌生成来提高输出多样性。实验表明,PinRec 成功地平衡了性能、多样性和效率,对使用生成模型的用户体验产生了积极影响。
Forecasting Fails: Unveiling Evasion Attacks in Weather Prediction Models
Authors: Huzaifa Arif, Pin-Yu Chen, Alex Gittens, James Diffenderfer, Bhavya Kailkhura
Venue: Association for the Advancement of Artificial Intelligence 2025
First: 2025-12-09T17:20:56+00:00 · Latest: 2025-12-09T17:20:56+00:00
Abstract
With the increasing reliance on AI models for weather forecasting, it is imperative to evaluate their vulnerability to adversarial perturbations. This work introduces Weather Adaptive Adversarial Perturbation Optimization (WAAPO), a novel framework for generating targeted adversarial perturbations that are both effective in manipulating forecasts and stealthy to avoid detection. WAAPO achieves this by incorporating constraints for channel sparsity, spatial localization, and smoothness, ensuring that perturbations remain physically realistic and imperceptible. Using the ERA5 dataset and FourCastNet (Pathak et al. 2022), we demonstrate WAAPO's ability to generate adversarial trajectories that align closely with predefined targets, even under constrained conditions. Our experiments highlight critical vulnerabilities in AI-driven forecasting models, where small perturbations to initial conditions can result in significant deviations in predicted weather patterns. These findings underscore the need for robust safeguards to protect against adversarial exploitation in operational forecasting systems.
中文标题/摘要
标题:预测失败:揭示天气预测模型中的规避攻击
随着对AI模型在天气预报中的依赖增加,评估其对抗性扰动的脆弱性变得至关重要。本研究引入了Weather Adaptive Adversarial Perturbation Optimization (WAAPO)框架,这是一种新颖的方法,用于生成既有效操纵预测又具有隐蔽性的对抗性扰动。WAAPO通过引入信道稀疏性、空间局部性和平滑性的约束,确保扰动保持物理现实性和不可感知性。使用ERA5数据集和FourCastNet(Pathak等,2022),我们展示了WAAPO生成与预定义目标高度一致的对抗性轨迹的能力,即使在受限条件下也是如此。我们的实验突显了AI驱动的预报模型中的关键漏洞,其中初始条件的小扰动会导致预测天气模式产生显著偏差。这些发现强调了在运营预报系统中需要强大的防护措施以防止对抗性利用的必要性。
Summary / 总结
This work addresses the vulnerability of AI models used in weather forecasting to adversarial attacks. It introduces WAAPO, a framework that generates targeted adversarial perturbations that are both effective and stealthy. Using the ERA5 dataset and FourCastNet, the study shows that WAAPO can manipulate weather forecasts to align with predefined targets, revealing critical vulnerabilities in these models. The experiments indicate that small perturbations to initial conditions can lead to significant deviations in weather predictions, emphasizing the need for robust safeguards against adversarial attacks in forecasting systems.
该研究关注AI模型在天气预报中的对抗攻击脆弱性,并引入了WAAPO框架,该框架能够生成既有效又隐蔽的对抗扰动。使用ERA5数据集和FourCastNet,研究展示了WAAPO能够操纵天气预报以符合预定义的目标,揭示了这些模型中的关键漏洞。实验表明,初始条件的小扰动会导致天气预测结果的重大偏差,强调了对抗攻击防护措施在预报系统中的重要性。
SceneGen: Single-Image 3D Scene Generation in One Feedforward Pass
Authors: Yanxu Meng, Haoning Wu, Ya Zhang, Weidi Xie
First: 2025-08-21T17:59:16+00:00 · Latest: 2025-12-09T17:19:48+00:00
Comments: Accepted by 3DV 2026; Project Page: https://mengmouxu.github.io/SceneGen
Abstract
3D content generation has recently attracted significant research interest, driven by its critical applications in VR/AR and embodied AI. In this work, we tackle the challenging task of synthesizing multiple 3D assets within a single scene image. Concretely, our contributions are fourfold: (i) we present SceneGen, a novel framework that takes a scene image and corresponding object masks as input, simultaneously producing multiple 3D assets with geometry and texture. Notably, SceneGen operates with no need for extra optimization or asset retrieval; (ii) we introduce a novel feature aggregation module that integrates local and global scene information from visual and geometric encoders within the feature extraction module. Coupled with a position head, this enables the generation of 3D assets and their relative spatial positions in a single feedforward pass; (iii) we demonstrate SceneGen's direct extensibility to multi-image input scenarios. Despite being trained solely on single-image inputs, our architecture yields improved generation performance when multiple images are provided; and (iv) extensive quantitative and qualitative evaluations confirm the efficiency and robustness of our approach. We believe this paradigm offers a novel solution for high-quality 3D content generation, potentially advancing its practical applications in downstream tasks. The code and model will be publicly available at: https://mengmouxu.github.io/SceneGen.
中文标题/摘要
标题:SceneGen:单张图像的一次前向传递生成三维场景
三维内容生成近年来引起了广泛的研究兴趣,这主要得益于其在VR/AR和具身AI中的关键应用。在本文中,我们解决了在单张场景图像中合成多个三维资产的具有挑战性的任务。具体来说,我们的贡献包括四个方面:(i) 我们提出了SceneGen,这是一种新颖的框架,它接受场景图像及其对应的物体掩码作为输入,同时生成具有几何和纹理的多个三维资产。值得注意的是,SceneGen 在不需要额外优化或资产检索的情况下运行;(ii) 我们引入了一种新颖的特征聚合模块,该模块在特征提取模块中结合了视觉和几何编码器的局部和全局场景信息。结合位置头,这使得在单次前向传递中生成三维资产及其相对空间位置成为可能;(iii) 我们展示了SceneGen 直接扩展到多图像输入场景的能力。尽管仅在单图像输入上进行训练,但当提供多个图像时,我们的架构可以提高生成性能;(iv) 详尽的定量和定性评估证实了我们方法的高效性和鲁棒性。我们认为,这种范式为高质量三维内容生成提供了一个新的解决方案,有可能推动其在下游任务中的实际应用。代码和模型将在以下网址公开:https://mengmouxu.github.io/SceneGen/
Summary / 总结
SceneGen is a novel framework for generating multiple 3D assets from a single scene image and corresponding object masks, achieving this in a single feedforward pass without extra optimization or asset retrieval. It introduces a feature aggregation module that integrates local and global scene information, enabling the simultaneous generation of 3D assets and their spatial positions. Extensive evaluations show that SceneGen is efficient and robust, and its performance improves when multiple images are provided. The approach offers a promising solution for high-quality 3D content generation.
SceneGen 是一种新颖的框架,可以从单张场景图像和对应的物体掩码中生成多个 3D 资产,并在单次前向传播中完成,无需额外的优化或资产检索。它引入了一种特征聚合模块,可以整合局部和全局场景信息,从而同时生成 3D 资产及其空间位置。广泛的评估表明,SceneGen 是高效且稳健的,其性能在提供多张图像时会有所提升。该方法为高质量 3D 内容生成提供了有前景的解决方案。
InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models
Authors: Hongyuan Tao, Bencheng Liao, Shaoyu Chen, Haoran Yin, Qian Zhang, Wenyu Liu, Xinggang Wang
First: 2025-12-09T17:18:32+00:00 · Latest: 2025-12-09T17:18:32+00:00
Comments: 16 pages, 8 figures, conference or other essential info
Abstract
Window attention and linear attention represent two principal strategies for mitigating the quadratic complexity and ever-growing KV cache in Vision-Language Models (VLMs). However, we observe that window-based VLMs suffer performance degradation when sequence length exceeds the window size, while linear attention underperforms on information-intensive tasks such as OCR and document understanding. To overcome these limitations, we propose InfiniteVL, a linear-complexity VLM architecture that synergizes sliding window attention (SWA) with Gated DeltaNet. For achieving competitive multimodal performance under constrained resources, we design a three-stage training strategy comprising distillation pretraining, instruction tuning, and long-sequence SFT. Remarkably, using less than 2\% of the training data required by leading VLMs, InfiniteVL not only substantially outperforms previous linear-complexity VLMs but also matches the performance of leading Transformer-based VLMs, while demonstrating effective long-term memory retention. Compared to similar-sized Transformer-based VLMs accelerated by FlashAttention-2, InfiniteVL achieves over 3.6\times inference speedup while maintaining constant latency and memory footprint. In streaming video understanding scenarios, it sustains a stable 24 FPS real-time prefill speed while preserving long-term memory cache. Code and models are available at https://github.com/hustvl/InfiniteVL.
中文标题/摘要
标题:InfiniteVL:结合线性与稀疏注意机制以实现高效且无输入限制的跨模态视觉-语言模型
窗口注意力和线性注意力是缓解视觉-语言模型(VLM)中二次复杂性和不断增长的KV缓存的两种主要策略。然而,我们发现基于窗口的VLM在序列长度超过窗口大小时会性能下降,而线性注意力在OCR和文档理解等信息密集型任务中表现不佳。为克服这些限制,我们提出了InfiniteVL,这是一种结合滑动窗口注意力(SWA)与Gated DeltaNet的线性复杂度VLM架构。为了在资源受限的情况下实现竞争性的跨模态性能,我们设计了三阶段训练策略,包括蒸馏预训练、指令调优和长序列SFT。令人惊讶的是,使用比领先VLM少于2%的训练数据,InfiniteVL不仅显著优于之前的线性复杂度VLM,还与基于Transformer的领先VLM性能相当,同时展示了有效的长期记忆保留。与通过FlashAttention-2加速的类似规模的Transformer-based VLM相比,InfiniteVL实现了超过3.6倍的推理加速,同时保持了恒定的延迟和内存占用。在流式视频理解场景中,它能够保持稳定的每秒24帧实时预填充速度,同时保留长期记忆缓存。代码和模型可在https://github.com/hustvl/InfiniteVL获取。
Summary / 总结
InfiniteVL is a VLM architecture that combines sliding window attention and Gated DeltaNet to address the limitations of window-based and linear attention methods. It employs a three-stage training strategy for efficient multimodal performance. With less than 2% of the training data, InfiniteVL outperforms previous linear-complexity VLMs and matches the performance of leading Transformer-based VLMs, while achieving over 3.6 times faster inference speed. It also maintains constant latency and memory footprint and supports real-time prefill in streaming video understanding scenarios.
InfiniteVL 是一种结合滑动窗口注意力和 Gated DeltaNet 的线性复杂度视觉-语言模型,旨在克服窗口基模型和线性注意力方法的局限性。它采用三阶段训练策略,并以不到 2% 的训练数据量实现了与领先 Transformer 基模型相当的性能。InfiniteVL 在推理速度上比类似规模的 Transformer 基模型快 3.6 倍,同时保持了恒定的延迟和内存占用。在实时流式视频理解场景中,它保持了稳定的每秒 24 帧填充速度,并有效保留了长期记忆缓存。
History
20251210_0323 20251209_0326 20251208_0324 20251207_0323 20251206_0325 20251205_0326 20251204_0326 20251203_0328 20251202_0331 20251201_0324 20251130_0323 20251129_0323 20251128_0324 20251127_0324 20251126_0325 20251125_0322 20251124_0323 20251123_0323 20251122_0325 20251121_0324 20251120_0326 20251119_0325 20251118_0324 20251117_0322 20251116_0322 20251115_0324 20251114_0325 20251113_0326 20251112_0326 20251111_0318 20251110_0322 20251109_0323 20251108_0321 20251107_0320 20251106_0322 20251105_0321 20251104_0324 20251103_0317 20251102_0321 20251101_0317 20251031_0318 20251030_0328 20251029_0325 20251028_0324 20251027_0320 20251026_0328 20251025_0320 20251024_0328 20251023_1235 20251023_0316 20251022_0319 20251021_1916 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553