arXiv 论文速递

2026-01-12 03:30
Snapshot: 20260112_0330
Mesh4D: 4D Mesh Reconstruction and Tracking from Monocular Video
Authors: Zeren Jiang, Chuanxia Zheng, Iro Laina, Diane Larlus, Andrea Vedaldi
First: 2026-01-08T18:59:56+00:00 · Latest: 2026-01-08T18:59:56+00:00
Comments: 15 pages, 8 figures, project page: https://mesh-4d.github.io/
Abstract
We propose Mesh4D, a feed-forward model for monocular 4D mesh reconstruction. Given a monocular video of a dynamic object, our model reconstructs the object's complete 3D shape and motion, represented as a deformation field. Our key contribution is a compact latent space that encodes the entire animation sequence in a single pass. This latent space is learned by an autoencoder that, during training, is guided by the skeletal structure of the training objects, providing strong priors on plausible deformations. Crucially, skeletal information is not required at inference time. The encoder employs spatio-temporal attention, yielding a more stable representation of the object's overall deformation. Building on this representation, we train a latent diffusion model that, conditioned on the input video and the mesh reconstructed from the first frame, predicts the full animation in one shot. We evaluate Mesh4D on reconstruction and novel view synthesis benchmarks, outperforming prior methods in recovering accurate 3D shape and deformation.
中文标题/摘要
标题:Mesh4D:基于单目视频的4D网格重建与跟踪
我们提出Mesh4D,一种用于单目4D网格重建的前馈模型。给定一个动态对象的单目视频,我们的模型重建对象的完整3D形状和运动,表示为变形场。我们的主要贡献是一个紧凑的潜在空间,可以在一次通过中编码整个动画序列。该潜在空间通过自编码器学习,训练过程中由训练对象的骨骼结构引导,提供了合理的变形先验。关键的是,在推理时不需要骨骼信息。编码器采用时空注意力机制,提供对象整体变形的更稳定表示。在此表示基础上,我们训练一个潜在扩散模型,在输入视频和第一帧重建的网格条件下,预测完整的动画。我们在重建和新颖视图合成基准上评估Mesh4D,优于先前方法,在恢复准确的3D形状和变形方面表现出色。
Summary / 总结
Mesh4D is a feed-forward model for monocular 4D mesh reconstruction that reconstructs the complete 3D shape and motion of a dynamic object from a single monocular video. It uses a compact latent space learned by an autoencoder, which is guided by the skeletal structure during training. The model predicts the full animation in one shot without requiring skeletal information at inference time. Experiments show that Mesh4D outperforms previous methods in recovering accurate 3D shape and deformation on various benchmarks.
Mesh4D 是一种单目 4D 网格重建模型,可以从单个单目视频中重建动态对象的完整 3D 形状和运动。它使用一个由自编码器学习的紧凑的潜在空间,在训练过程中由骨骼结构提供先验信息。模型在不需要在推理时提供骨骼信息的情况下,可以一次性预测完整的动画。实验表明,Mesh4D 在各种基准测试中优于先前的方法,在恢复准确的 3D 形状和变形方面表现更佳。
RL-AWB: Deep Reinforcement Learning for Auto White Balance Correction in Low-Light Night-time Scenes
Authors: Yuan-Kang Lee, Kuan-Lin Chen, Chia-Che Chang, Yu-Lun Liu
First: 2026-01-08T18:59:55+00:00 · Latest: 2026-01-08T18:59:55+00:00
Comments: Project page: https://ntuneillee.github.io/research/rl-awb/
Abstract
Nighttime color constancy remains a challenging problem in computational photography due to low-light noise and complex illumination conditions. We present RL-AWB, a novel framework combining statistical methods with deep reinforcement learning for nighttime white balance. Our method begins with a statistical algorithm tailored for nighttime scenes, integrating salient gray pixel detection with novel illumination estimation. Building on this foundation, we develop the first deep reinforcement learning approach for color constancy that leverages the statistical algorithm as its core, mimicking professional AWB tuning experts by dynamically optimizing parameters for each image. To facilitate cross-sensor evaluation, we introduce the first multi-sensor nighttime dataset. Experiment results demonstrate that our method achieves superior generalization capability across low-light and well-illuminated images. Project page: https://ntuneillee.github.io/research/rl-awb/
中文标题/摘要
标题:RL-AWB:低光夜间场景自动白平衡校正的深度强化学习
夜间颜色恒定性仍然是计算摄影中的一个挑战性问题,由于低光噪声和复杂的照明条件。我们提出了RL-AWB,一种结合统计方法与深度强化学习的新型框架,用于夜间白平衡。我们的方法以一个针对夜间场景定制的统计算法为基础,结合了显著灰度像素检测与新颖的照明估计。在此基础上,我们开发了第一个基于统计算法的深度强化学习颜色恒定性方法,通过动态优化每个图像的参数来模拟专业AWB调优专家。为了便于跨传感器评估,我们引入了第一个多传感器夜间数据集。实验结果表明,我们的方法在低光和良好照明的图像上具有更强的泛化能力。项目页面:https://ntuneillee.github.io/research/rl-awb/
Summary / 总结
The research aims to address the challenge of nighttime color constancy in computational photography by developing RL-AWB, a framework that combines statistical methods with deep reinforcement learning. The method starts with a statistical algorithm designed for nighttime scenes, which includes salient gray pixel detection and novel illumination estimation. It then uses deep reinforcement learning to dynamically optimize parameters for each image, mimicking professional AWB tuning. The study introduces a multi-sensor nighttime dataset for cross-sensor evaluation and shows that the method outperforms existing approaches in generalization across different lighting conditions.
研究旨在通过结合统计方法和深度强化学习来解决夜间色彩一致性问题,开发了RL-AWB框架。该方法首先使用一种针对夜间场景的统计算法,包括显著灰像素检测和新颖的照明估计。然后使用深度强化学习动态优化每张图像的参数,模拟专业AWB调色专家。研究引入了一个多传感器夜间数据集进行跨传感器评估,并展示了该方法在不同光照条件下的优越泛化能力。
QNeRF: Neural Radiance Fields on a Simulated Gate-Based Quantum Computer
Authors: Daniele Lizzio Bosco, Shuteng Wang, Giuseppe Serra, Vladislav Golyanik
First: 2026-01-08T18:59:55+00:00 · Latest: 2026-01-08T18:59:55+00:00
Comments: 30 pages, 15 figures, 11 tables; project page: https://4dqv.mpi-inf.mpg.de/QNeRF/
Abstract
Recently, Quantum Visual Fields (QVFs) have shown promising improvements in model compactness and convergence speed for learning the provided 2D or 3D signals. Meanwhile, novel-view synthesis has seen major advances with Neural Radiance Fields (NeRFs), where models learn a compact representation from 2D images to render 3D scenes, albeit at the cost of larger models and intensive training. In this work, we extend the approach of QVFs by introducing QNeRF, the first hybrid quantum-classical model designed for novel-view synthesis from 2D images. QNeRF leverages parameterised quantum circuits to encode spatial and view-dependent information via quantum superposition and entanglement, resulting in more compact models compared to the classical counterpart. We present two architectural variants. Full QNeRF maximally exploits all quantum amplitudes to enhance representational capabilities. In contrast, Dual-Branch QNeRF introduces a task-informed inductive bias by branching spatial and view-dependent quantum state preparations, drastically reducing the complexity of this operation and ensuring scalability and potential hardware compatibility. Our experiments demonstrate that -- when trained on images of moderate resolution -- QNeRF matches or outperforms classical NeRF baselines while using less than half the number of parameters. These results suggest that quantum machine learning can serve as a competitive alternative for continuous signal representation in mid-level tasks in computer vision, such as 3D representation learning from 2D observations.
中文标题/摘要
标题:QNeRF:基于模拟格基量子计算机的神经辐射场
最近,量子视觉场(QVFs)在模型紧凑性和收敛速度方面显示出对学习提供的2D或3D信号的有希望的改进。同时,神经辐射场(NeRFs)在新颖视角合成方面取得了重大进展,其中模型从2D图像中学习紧凑表示以渲染3D场景,尽管代价是更大的模型和密集的训练。在本文中,我们通过引入QNeRF扩展了QVFs的方法,QNeRF是第一个为从2D图像合成新颖视角而设计的混合量子-经典模型。QNeRF利用参数化量子电路通过量子叠加和纠缠来编码空间和视角相关的信息,从而与经典对应物相比具有更紧凑的模型。我们提出了两种架构变体。全QNeRF最大限度地利用所有量子振幅以增强表示能力。相比之下,双分支QNeRF通过分支空间和视角相关的量子态准备引入任务导向的归纳偏置,大幅降低此操作的复杂性并确保可扩展性和潜在的硬件兼容性。我们的实验表明,当在中等分辨率的图像上进行训练时,QNeRF在参数数量不到一半的情况下可以匹配或超越经典的NeRF基线。这些结果表明,量子机器学习可以作为计算机视觉中中级任务(如从2D观察学习3D表示)连续信号表示的竞争性替代方案。
Summary / 总结
QNeRF is a hybrid quantum-classical model that leverages parameterized quantum circuits for novel-view synthesis from 2D images, enhancing compactness and representational capabilities. Two variants are presented: Full QNeRF maximizes quantum amplitudes, while Dual-Branch QNeRF introduces a task-informed inductive bias for spatial and view-dependent state preparations, reducing complexity. Experiments show QNeRF matches or outperforms classical NeRF with fewer parameters, suggesting quantum machine learning as a competitive alternative for 3D representation learning from 2D observations.
QNeRF 是一种混合量子-经典模型,利用参数化量子电路从 2D 图像进行新颖视角合成,增强紧凑性和表示能力。提出了两种变体:Full QNeRF 充分利用量子振幅,而 Dual-Branch QNeRF 引入任务导向的归纳偏差进行空间和视角依赖的量子态准备,减少复杂性。实验表明 QNeRF 在参数量少于一半的情况下,匹配或超越经典 NeRF,表明量子机器学习在从 2D 观测学习 3D 表征方面是一个有竞争力的替代方案。
Pixel-Perfect Visual Geometry Estimation
Authors: Gangwei Xu, Haotong Lin, Hongcheng Luo, Haiyang Sun, Bing Wang, Guang Chen, Sida Peng, Hangjun Ye, Xin Yang
First: 2026-01-08T18:59:49+00:00 · Latest: 2026-01-08T18:59:49+00:00
Comments: Code: https://github.com/gangweix/pixel-perfect-depth
Abstract
Recovering clean and accurate geometry from images is essential for robotics and augmented reality. However, existing geometry foundation models still suffer severely from flying pixels and the loss of fine details. In this paper, we present pixel-perfect visual geometry models that can predict high-quality, flying-pixel-free point clouds by leveraging generative modeling in the pixel space. We first introduce Pixel-Perfect Depth (PPD), a monocular depth foundation model built upon pixel-space diffusion transformers (DiT). To address the high computational complexity associated with pixel-space diffusion, we propose two key designs: 1) Semantics-Prompted DiT, which incorporates semantic representations from vision foundation models to prompt the diffusion process, preserving global semantics while enhancing fine-grained visual details; and 2) Cascade DiT architecture that progressively increases the number of image tokens, improving both efficiency and accuracy. To further extend PPD to video (PPVD), we introduce a new Semantics-Consistent DiT, which extracts temporally consistent semantics from a multi-view geometry foundation model. We then perform reference-guided token propagation within the DiT to maintain temporal coherence with minimal computational and memory overhead. Our models achieve the best performance among all generative monocular and video depth estimation models and produce significantly cleaner point clouds than all other models.
中文标题/摘要
标题:像素完美视觉几何估计
从图像中恢复干净且准确的几何结构对于机器人技术和增强现实至关重要。然而,现有的几何基础模型仍然严重受到漂像素和细节损失的影响。在本文中,我们提出了像素完美视觉几何模型,通过在像素空间中利用生成建模来预测高质量、无漂像素的点云。我们首先介绍了像素完美深度(PPD),这是一种基于像素空间扩散变换器(DiT)的单目深度基础模型。为了解决像素空间扩散带来的高计算复杂性,我们提出了两个关键设计:1)语义提示DiT,该设计结合了视觉基础模型的语义表示,以提示扩散过程,保留全局语义同时增强细粒度视觉细节;2)级联DiT架构,逐步增加图像标记的数量,提高效率和准确性。为了将PPD进一步扩展到视频(PPVD),我们引入了一种新的语义一致DiT,该设计从多视图几何基础模型中提取时空一致的语义。然后在DiT中进行参考引导的标记传播,以最小的计算和内存开销保持时间连贯性。我们的模型在所有生成单目和视频深度估计模型中表现最佳,并且产生的点云比其他所有模型都更干净。
Summary / 总结
This paper addresses the issue of recovering clean and accurate geometry from images, crucial for robotics and augmented reality. It introduces pixel-perfect visual geometry models, specifically Pixel-Perfect Depth (PPD) and its extension to video (PPVD), which use pixel-space diffusion transformers to predict high-quality point clouds without flying pixels. Key methods include Semantics-Prompted DiT for preserving global semantics and enhancing fine details, and Cascade DiT for improving efficiency and accuracy. The models outperform existing generative monocular and video depth estimation methods, producing cleaner point clouds.
本文旨在从图像中恢复干净准确的几何结构,这对机器人技术和增强现实至关重要。文中提出了像素完美的视觉几何模型,通过像素空间生成模型预测高质量的点云,而不包含飞像素。该模型包括像素完美深度(PPD)及其视频扩展PPVD,利用像素空间扩散变换器(DiT)并结合语义表示以保留全局语义同时增强细粒度的视觉细节。实验结果表明,这些模型在单目和视频深度估计中优于现有方法,生成的点云更为干净。
GREx: Generalized Referring Expression Segmentation, Comprehension, and Generation
Authors: Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, Yu-Gang Jiang
First: 2026-01-08T18:59:30+00:00 · Latest: 2026-01-08T18:59:30+00:00
Comments: IJCV, Project Page: https://henghuiding.com/GREx/
Abstract
Referring Expression Segmentation (RES) and Comprehension (REC) respectively segment and detect the object described by an expression, while Referring Expression Generation (REG) generates an expression for the selected object. Existing datasets and methods commonly support single-target expressions only, i.e., one expression refers to one object, not considering multi-target and no-target expressions. This greatly limits the real applications of REx (RES/REC/REG). This paper introduces three new benchmarks called Generalized Referring Expression Segmentation (GRES), Comprehension (GREC), and Generation (GREG), collectively denoted as GREx, which extend the classic REx to allow expressions to identify an arbitrary number of objects. We construct the first large-scale GREx dataset gRefCOCO that contains multi-target, no-target, and single-target expressions and their corresponding images with labeled targets. GREx and gRefCOCO are designed to be backward-compatible with REx, facilitating extensive experiments to study the performance gap of the existing REx methods on GREx tasks. One of the challenges of GRES/GREC is complex relationship modeling, for which we propose a baseline ReLA that adaptively divides the image into regions with sub-instance clues and explicitly models the region-region and region-language dependencies. The proposed ReLA achieves the state-of-the-art results on the both GRES and GREC tasks. The proposed gRefCOCO dataset and method are available at https://henghuiding.github.io/GREx.
中文标题/摘要
标题:GREx:通用指代表达分割、理解和生成
指代表达分割(RES)和理解(REC)分别对表达描述的对象进行分割和检测,而指代表达生成(REG)则生成描述选定对象的表达。现有数据集和方法通常仅支持单目标表达,即一个表达仅指代一个对象,而不考虑多目标和无目标表达。这极大地限制了指代表达(RES/REC/REG)的实际应用。本文介绍了三个新的基准测试,即通用指代表达分割(GRES)、理解和生成(GREC),统称为GREx,它们将经典指代表达扩展到允许表达识别任意数量的对象。我们构建了第一个大规模GREx数据集gRefCOCO,包含多目标、无目标和单目标表达及其对应的带有标记目标的图像。GREx和gRefCOCO旨在与指代表达兼容,便于进行广泛的实验,研究现有指代表达方法在GREx任务上的性能差距。GRES/GREC的一个挑战是复杂关系建模,为此我们提出了一种基线ReLA,它适应性地将图像划分为具有子实例线索的区域,并明确建模区域间的依赖关系和区域语言依赖关系。提出的ReLA在GRES和GREC任务上均达到了最先进的结果。提出的gRefCOCO数据集和方法可在https://henghuiding.github.io/GREx/获取。
Summary / 总结
This paper introduces GREx, a new framework for Generalized Referring Expression Segmentation (GRES), Comprehension (GREC), and Generation (GREG), which extends the classic REx to handle multi-target and no-target expressions. It constructs the gRefCOCO dataset with labeled multi-target, no-target, and single-target expressions. The proposed ReLA model achieves state-of-the-art results on GRES and GREC tasks by adaptively dividing images into regions and explicitly modeling dependencies. The gRefCOCO dataset and method are available at https://henghuiding.github.io/GREx.
该论文引入了GREx,将传统的指示表达任务扩展以处理多目标和无目标表达。它构建了一个大型数据集gRefCOCO,并提出了一种基线方法ReLA,该方法建模了区域-区域和区域-语言的依赖关系,在GREx任务上取得了最先进的结果。数据集和方法可在https://henghuiding.github.io/GREx/获取。
Leveraging Clinical Text and Class Conditioning for 3D Prostate MRI Generation
Authors: Emerson P. Grabke, Babak Taati, Masoom A. Haider
First: 2025-06-11T23:12:48+00:00 · Latest: 2026-01-08T18:59:27+00:00
Comments: Accepted for publication in IEEE Transactions on Biomedical Engineering, 2025. This is the accepted author version. The final published version is available at https://doi.org/10.1109/TBME.2025.3648426
Abstract
Objective: Latent diffusion models (LDM) could alleviate data scarcity challenges affecting machine learning development for medical imaging. However, medical LDM strategies typically rely on short-prompt text encoders, nonmedical LDMs, or large data volumes. These strategies can limit performance and scientific accessibility. We propose a novel LDM conditioning approach to address these limitations. Methods: We propose Class-Conditioned Efficient Large Language model Adapter (CCELLA), a novel dual-head conditioning approach that simultaneously conditions the LDM U-Net with free-text clinical reports and radiology classification. We also propose a data-efficient LDM pipeline centered around CCELLA and a proposed joint loss function. We first evaluate our method on 3D prostate MRI against state-of-the-art. We then augment a downstream classifier model training dataset with synthetic images from our method. Results: Our method achieves a 3D FID score of 0.025 on a size-limited 3D prostate MRI dataset, significantly outperforming a recent foundation model with FID 0.070. When training a classifier for prostate cancer prediction, adding synthetic images generated by our method during training improves classifier accuracy from 69% to 74% and outperforms classifiers trained on images generated by prior state-of-the-art. Classifier training solely on our method's synthetic images achieved comparable performance to real image training. Conclusion: We show that our method improved both synthetic image quality and downstream classifier performance using limited data and minimal human annotation. Significance: The proposed CCELLA-centric pipeline enables radiology report and class-conditioned LDM training for high-quality medical image synthesis given limited data volume and human data annotation, improving LDM performance and scientific accessibility.
中文标题/摘要
标题:利用临床文本和类别条件生成3D前列腺MRI
目标:潜在扩散模型(LDM)可以缓解医学成像领域机器学习开发中的数据稀缺挑战。然而,医学LDM策略通常依赖于简短提示文本编码器、非医学LDM或大量数据。这些策略可能会限制性能和科学可访问性。我们提出了一种新的LDM条件化方法来解决这些限制。方法:我们提出了类别条件高效大型语言模型适配器(CCELLA),这是一种新颖的双头条件化方法,同时用自由文本临床报告和放射学分类条件化LDM U-Net。我们还提出了一种数据高效的LDM流水线,围绕CCELLA和提出的联合损失函数。我们首先在3D前列腺MRI上评估了我们的方法,与最先进的方法进行了比较。然后,我们使用我们方法生成的合成图像增强了下游分类器模型训练数据集。结果:我们的方法在大小受限的3D前列腺MRI数据集上实现了0.025的3D FID分数,显著优于最近的基础模型,后者FID为0.070。在训练前列腺癌预测分类器时,添加我们方法生成的合成图像提高了分类器的准确性,从69%提高到74%,并优于使用先前最先进的方法生成的图像训练的分类器。仅使用我们方法生成的合成图像进行分类器训练,其性能与使用真实图像训练相当。结论:我们展示了我们的方法在使用有限数据和最少的人工注释的情况下,提高了合成图像质量和下游分类器性能。意义:提出的CCELLA为中心的流水线能够在有限的数据量和人工数据注释的情况下,实现放射学报告和类别条件的LDM训练,以生成高质量的医学图像,从而提高LDM性能和科学可访问性。
Summary / 总结
The research aims to enhance the performance and scientific accessibility of latent diffusion models (LDM) for medical imaging, particularly in addressing data scarcity. The authors propose CCELLA, a novel dual-head conditioning approach that conditions the LDM U-Net with free-text clinical reports and radiology classification. This method significantly improves the 3D FID score on a size-limited 3D prostate MRI dataset compared to previous models. Additionally, synthetic images generated by CCELLA enhance the accuracy of a downstream classifier for prostate cancer prediction, demonstrating the method's effectiveness in both image quality and downstream task performance.
研究旨在通过利用临床文本和类别条件来提高医学成像中潜在扩散模型(LDM)的性能。作者提出了CCELLA,这是一种双头条件方法,同时用自由文本临床报告和放射学分类来条件化LDM U-Net。该方法显著优于现有最佳模型,实现了3D FID分数为0.025,并将分类器准确性从69%提高到74%。该方法在仅用于分类器训练时也能达到与真实图像训练相当的性能,证明了其在有限数据和少量人工注释下的有效性。
GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization
Authors: Shih-Yang Liu, Xin Dong, Ximing Lu, Shizhe Diao, Peter Belcak, Mingjie Liu, Min-Hung Chen, Hongxu Yin, Yu-Chiang Frank Wang, Kwang-Ting Cheng, Yejin Choi, Jan Kautz, Pavlo Molchanov
First: 2026-01-08T18:59:24+00:00 · Latest: 2026-01-08T18:59:24+00:00
Comments: NVIDIA-Tech Report
Abstract
As language models become increasingly capable, users expect them to provide not only accurate responses but also behaviors aligned with diverse human preferences across a variety of scenarios. To achieve this, Reinforcement learning (RL) pipelines have begun incorporating multiple rewards, each capturing a distinct preference, to guide models toward these desired behaviors. However, recent work has defaulted to apply Group Relative Policy Optimization (GRPO) under multi-reward setting without examining its suitability. In this paper, we demonstrate that directly applying GRPO to normalize distinct rollout reward combinations causes them to collapse into identical advantage values, reducing the resolution of the training signal and resulting in suboptimal convergence and, in some cases, early training failure. We then introduce Group reward-Decoupled Normalization Policy Optimization (GDPO), a new policy optimization method to resolve these issues by decoupling the normalization of individual rewards, more faithfully preserving their relative differences and enabling more accurate multi-reward optimization, along with substantially improved training stability. We compare GDPO with GRPO across three tasks: tool calling, math reasoning, and coding reasoning, evaluating both correctness metrics (accuracy, bug ratio) and constraint adherence metrics (format, length). Across all settings, GDPO consistently outperforms GRPO, demonstrating its effectiveness and generalizability for multi-reward reinforcement learning optimization.
中文标题/摘要
标题:GDPO:组奖励-解耦归一化策略优化方法在多奖励RL优化中的应用
随着语言模型能力的不断增强,用户期望它们不仅能提供准确的响应,还能表现出与各种场景中多样的人类偏好相一致的行为。为了实现这一目标,强化学习(RL)管道开始采用多个奖励,每个奖励捕捉一种独特的偏好,以引导模型向这些期望的行为发展。然而,最近的工作在多奖励设置中默认使用组相对策略优化(GRPO)而没有对其适用性进行检查。本文展示了直接将GRPO应用于归一化不同的回放奖励组合会导致这些组合的优势值坍缩为相同的值,降低训练信号的分辨率,导致次优收敛,在某些情况下甚至导致训练早期失败。我们随后引入了组奖励-解耦归一化策略优化(GDPO),这是一种新的策略优化方法,通过解耦个体奖励的归一化来解决这些问题,更忠实地保留它们的相对差异,从而实现更准确的多奖励优化,并且训练稳定性显著提高。我们通过工具调用、数学推理和编程推理三个任务将GDPO与GRPO进行了比较,评估了正确性指标(准确率、错误率)和约束遵守指标(格式、长度)。在所有设置中,GDPO始终优于GRPO,证明了其在多奖励强化学习优化中的有效性和普适性。
Summary / 总结
The paper addresses the challenge of aligning language models with diverse human preferences by optimizing multiple rewards in reinforcement learning. It identifies issues with the Group Relative Policy Optimization (GRPO) method, which can cause distinct rewards to collapse into identical values, leading to suboptimal training. To resolve this, the authors propose Group reward-Decoupled Normalization Policy Optimization (GDPO), which decouples the normalization of individual rewards, preserving their relative differences and improving training stability. GDPO outperforms GRPO across three tasks, showing better performance in terms of accuracy, bug ratio, format, and length adherence.
论文探讨了使用多奖励强化学习来使语言模型与多样的人类偏好对齐的问题。它指出了组相对策略优化(GRPO)方法的问题,该方法会导致不同的奖励值坍缩为相同的值,从而导致训练效果不佳。为了解决这一问题,作者提出了组奖励解耦归一化策略优化(GDPO)方法,该方法通过解耦个体奖励的归一化,保留它们的相对差异,从而提高训练稳定性。GDPO在工具调用、数学推理和编码推理三个任务中均优于GRPO,从正确性和约束遵守度指标来看均表现出更优的效果。
RoboVIP: Multi-View Video Generation with Visual Identity Prompting Augments Robot Manipulation
Authors: Boyang Wang, Haoran Zhang, Shujie Zhang, Jinkun Hao, Mingda Jia, Qi Lv, Yucheng Mao, Zhaoyang Lyu, Jia Zeng, Xudong Xu, Jiangmiao Pang
First: 2026-01-08T18:59:22+00:00 · Latest: 2026-01-08T18:59:22+00:00
Abstract
The diversity, quantity, and quality of manipulation data are critical for training effective robot policies. However, due to hardware and physical setup constraints, collecting large-scale real-world manipulation data remains difficult to scale across diverse environments. Recent work uses text-prompt conditioned image diffusion models to augment manipulation data by altering the backgrounds and tabletop objects in the visual observations. However, these approaches often overlook the practical need for multi-view and temporally coherent observations required by state-of-the-art policy models. Further, text prompts alone cannot reliably specify the scene setup. To provide the diffusion model with explicit visual guidance, we introduce visual identity prompting, which supplies exemplar images as conditioning inputs to guide the generation of the desired scene setup. To this end, we also build a scalable pipeline to curate a visual identity pool from large robotics datasets. Using our augmented manipulation data to train downstream vision-language-action and visuomotor policy models yields consistent performance gains in both simulation and real-robot settings.
中文标题/摘要
标题:RoboVIP:视觉身份提示增强的多视角视频生成与机器人操作
操作数据的多样性和质量对于训练有效的机器人策略至关重要。然而,由于硬件和物理设置的限制,收集大规模的现实世界操作数据在不同环境中难以扩展。近期的工作使用文本提示条件下的图像扩散模型来通过改变视觉观察中的背景和桌面物体来扩充操作数据。然而,这些方法往往忽略了由最先进的策略模型所需的多视角和时间上一致的观察的实际需求。此外,仅靠文本提示无法可靠地指定场景设置。为了给扩散模型提供明确的视觉指导,我们引入了视觉身份提示,通过提供示例图像作为条件输入来引导生成所需的场景设置。为此,我们还构建了一个可扩展的流水线从大型机器人数据集中整理视觉身份池。使用我们扩充的操作数据来训练下游的视觉-语言-动作和视知觉运动策略模型,在仿真和真实机器人设置中均能获得一致的性能提升。
Summary / 总结
The paper addresses the challenge of collecting diverse and high-quality manipulation data for robot training. It introduces RoboVIP, a method that uses visual identity prompting to generate multi-view video data, which is more suitable for state-of-the-art policy models. The approach involves conditioning image diffusion models with exemplar images to guide scene setup, overcoming limitations of text prompts alone. Experimental results show consistent performance improvements in both simulation and real-robot settings when using the augmented data for training vision-language-action and visuomotor policies.
论文旨在通过增强操作数据的多样性和质量来提高机器人策略的训练效果。提出了RoboVIP方法,通过视觉身份提示生成多视角、时间连贯的视频数据,以提供明确的视觉指导。这种方法通过示例图像作为条件输入,帮助更可靠地指定场景设置。实验结果显示,使用这种增强的数据训练视觉-语言-动作和视知觉运动策略模型,在仿真和真实机器人环境中均表现出一致的性能提升。
Plenoptic Video Generation
Authors: Xiao Fu, Shitao Tang, Min Shi, Xian Liu, Jinwei Gu, Ming-Yu Liu, Dahua Lin, Chen-Hsuan Lin
First: 2026-01-08T18:58:32+00:00 · Latest: 2026-01-08T18:58:32+00:00
Comments: Project Page: https://research.nvidia.com/labs/dir/plenopticdreamer/
Abstract
Camera-controlled generative video re-rendering methods, such as ReCamMaster, have achieved remarkable progress. However, despite their success in single-view setting, these works often struggle to maintain consistency across multi-view scenarios. Ensuring spatio-temporal coherence in hallucinated regions remains challenging due to the inherent stochasticity of generative models. To address it, we introduce PlenopticDreamer, a framework that synchronizes generative hallucinations to maintain spatio-temporal memory. The core idea is to train a multi-in-single-out video-conditioned model in an autoregressive manner, aided by a camera-guided video retrieval strategy that adaptively selects salient videos from previous generations as conditional inputs. In addition, Our training incorporates progressive context-scaling to improve convergence, self-conditioning to enhance robustness against long-range visual degradation caused by error accumulation, and a long-video conditioning mechanism to support extended video generation. Extensive experiments on the Basic and Agibot benchmarks demonstrate that PlenopticDreamer achieves state-of-the-art video re-rendering, delivering superior view synchronization, high-fidelity visuals, accurate camera control, and diverse view transformations (e.g., third-person to third-person, and head-view to gripper-view in robotic manipulation). Project page: https://research.nvidia.com/labs/dir/plenopticdreamer/
中文标题/摘要
标题:全景光视频生成
相机控制生成视频重渲染方法,如ReCamMaster,已经取得了显著进展。然而,尽管这些方法在单视角设置中取得了成功,它们在多视角场景中保持一致性方面仍然面临挑战。由于生成模型固有的随机性,保持时空一致性在幻觉区域仍然具有挑战性。为了解决这个问题,我们引入了PlenopticDreamer框架,该框架同步生成幻觉以保持时空记忆。核心思想是通过相机引导的视频检索策略自回归训练多输入单输出视频条件模型,该策略能够从先前生成的视频中自适应地选择关键视频作为条件输入。此外,我们的训练还包含逐步上下文缩放以提高收敛性,自条件化以增强对由误差累积引起的长距离视觉退化的鲁棒性,以及长视频条件机制以支持长时间视频生成。在Basic和Agibot基准上的广泛实验表明,PlenopticDreamer实现了最先进的视频重渲染,提供了卓越的视角同步、高保真视觉、准确的相机控制和多样的视角变换(例如,从第三人称到第三人称,以及从头部视角到夹爪视角的机器人操作)。项目页面:https://research.nvidia.com/labs/dir/plenopticdreamer/
Summary / 总结
PlenopticDreamer is introduced to address the challenge of maintaining spatio-temporal coherence in multi-view generative video re-rendering. It uses a multi-in-single-out video-conditioned model trained in an autoregressive manner, combined with a camera-guided video retrieval strategy and progressive context-scaling. The framework demonstrates superior view synchronization and high-fidelity visuals on the Basic and Agibot benchmarks, achieving state-of-the-art results in video re-rendering. Project page: https://research.nvidia.com/labs/dir/plenopticdreamer/
PlenopticDreamer 是一个框架,旨在提高多视角下生成视频重渲染的一致性和时空连贯性。它使用基于相机的视频检索策略和自回归多输入单输出模型,并通过逐步扩展上下文规模来增强收敛性。该方法还包括自我条件化和长时间视频条件化,以处理长时间范围内的视觉退化。实验表明,PlenopticDreamer 在视点同步、视觉保真度和相机控制准确性方面优于现有方法,并支持机器人操作中的多种视点变换。
ObjectForesight: Predicting Future 3D Object Trajectories from Human Videos
Authors: Rustin Soraki, Homanga Bharadhwaj, Ali Farhadi, Roozbeh Mottaghi
First: 2026-01-08T18:58:08+00:00 · Latest: 2026-01-08T18:58:08+00:00
Comments: Preprint. Project Website: objectforesight.github.io
Abstract
Humans can effortlessly anticipate how objects might move or change through interaction--imagining a cup being lifted, a knife slicing, or a lid being closed. We aim to endow computational systems with a similar ability to predict plausible future object motions directly from passive visual observation. We introduce ObjectForesight, a 3D object-centric dynamics model that predicts future 6-DoF poses and trajectories of rigid objects from short egocentric video sequences. Unlike conventional world or dynamics models that operate in pixel or latent space, ObjectForesight represents the world explicitly in 3D at the object level, enabling geometrically grounded and temporally coherent predictions that capture object affordances and trajectories. To train such a model at scale, we leverage recent advances in segmentation, mesh reconstruction, and 3D pose estimation to curate a dataset of 2 million plus short clips with pseudo-ground-truth 3D object trajectories. Through extensive experiments, we show that ObjectForesight achieves significant gains in accuracy, geometric consistency, and generalization to unseen objects and scenes, establishing a scalable framework for learning physically grounded, object-centric dynamics models directly from observation. objectforesight.github.io
中文标题/摘要
标题:ObjectForesight:从人类视频预测未来3D物体轨迹
人类可以通过互动轻松地预见到物体可能如何移动或变化——想象一个杯子被举起、一把刀切割或一个盖子被关闭。我们旨在赋予计算系统类似的能力,直接从被动视觉观察中预测物体的可能未来运动。我们引入了ObjectForesight,这是一种3D物体中心的动力学模型,可以从短第一人称视频序列中预测刚体物体的未来6-自由度姿态和轨迹。与传统的世界或动力学模型不同,ObjectForesight在物体级别以3D形式明确表示世界,从而实现几何上合理的、时间上一致的预测,捕捉物体的功能和轨迹。为了大规模训练这样的模型,我们利用最近在分割、网格重建和3D姿态估计方面的进展,构建了一个包含超过200万短片段的数据集,这些片段具有伪地面真实3D物体轨迹。通过广泛的实验,我们展示了ObjectForesight在准确性、几何一致性以及对未见过的物体和场景的泛化方面取得了显著的改进,建立了从观察中学习物理上合理的、物体中心的动力学模型的可扩展框架。objectforesight.github.io
Summary / 总结
The research aims to develop a computational system capable of predicting future 3D object trajectories from passive visual observation, similar to human anticipation. ObjectForesight, a 3D object-centric dynamics model, predicts 6-DoF poses and trajectories of rigid objects from short egocentric video sequences. The model uses recent advances in segmentation, mesh reconstruction, and 3D pose estimation to train on a dataset of over 2 million short clips with pseudo-ground-truth 3D object trajectories. Experiments demonstrate significant improvements in accuracy, geometric consistency, and generalization to unseen objects and scenes, establishing a scalable framework for learning physically grounded dynamics models directly from observation.
研究旨在开发一种可以从被动视觉观察中预测未来3D物体轨迹的计算系统,类似于人类的预见能力。ObjectForesight是一种3D物体中心的动力学模型,可以从短的主观视频序列中预测刚体物体的6-DoF姿态和轨迹。该模型利用分割、网格重建和3D姿态估计的最新进展,在包含超过200万短片段的伪地面真值3D物体轨迹的数据集上进行训练。实验表明,在准确性、几何一致性以及对未见过的物体和场景的泛化能力方面取得了显著改进,建立了从观察中学习物理上合理的、物体中心的动力学模型的可扩展框架。
Measuring and Fostering Peace through Machine Learning and Artificial Intelligence
Authors: P. Gilda, P. Dungarwal, A. Thongkham, E. T. Ajayi, S. Choudhary, T. M. Terol, C. Lam, J. P. Araujo, M. McFadyen-Mungalln, L. S. Liebovitch, P. T. Coleman, H. West, K. Sieck, S. Carter
First: 2026-01-08T18:57:01+00:00 · Latest: 2026-01-08T18:57:01+00:00
Comments: 6 pages, 4 figures
Abstract
We used machine learning and artificial intelligence: 1) to measure levels of peace in countries from news and social media and 2) to develop on-line tools that promote peace by helping users better understand their own media diet. For news media, we used neural networks to measure levels of peace from text embeddings of on-line news sources. The model, trained on one news media dataset also showed high accuracy when used to analyze a different news dataset. For social media, such as YouTube, we developed other models to measure levels of social dimensions important in peace using word level (GoEmotions) and context level (Large Language Model) methods. To promote peace, we note that 71% of people 20-40 years old daily view most of their news through short videos on social media. Content creators of these videos are biased towards creating videos with emotional activation, making you angry to engage you, to increase clicks. We developed and tested a Chrome extension, MirrorMirror, which provides real-time feedback to YouTube viewers about the peacefulness of the media they are watching. Our long term goal is for MirrorMirror to evolve into an open-source tool for content creators, journalists, researchers, platforms, and individual users to better understand the tone of their media creation and consumption and its effects on viewers. Moving beyond simple engagement metrics, we hope to encourage more respectful, nuanced, and informative communication.
中文标题/摘要
标题:通过机器学习和人工智能衡量与促进和平
我们使用机器学习和人工智能:1) 从新闻和社交媒体中衡量各国的和平水平;2) 开发在线工具以促进和平,帮助用户更好地理解自己的媒体消费。对于新闻媒体,我们使用神经网络从在线新闻来源的文本嵌入中衡量和平水平。该模型在训练于一个新闻媒体数据集后,也对分析另一个新闻数据集时表现出高准确性。对于社交媒体,如YouTube,我们开发了其他模型来衡量与和平相关的社会维度,使用了词级(GoEmotions)和上下文级(大型语言模型)方法。为了促进和平,我们注意到20-40岁人群中71%的人每天主要通过社交媒体上的短视频获取新闻。这些视频内容创作者倾向于制作能够激发情绪、让你愤怒的视频以增加点击率。我们开发并测试了一个名为MirrorMirror的Chrome扩展程序,为YouTube观众提供实时反馈,告知他们所观看媒体的和平程度。我们的长期目标是让MirrorMirror成为一个开源工具,供内容创作者、记者、研究人员、平台和个人用户更好地理解其媒体创作和消费的语气及其对观众的影响。我们希望超越简单的参与度指标,鼓励更加尊重、细致和信息丰富的交流。
Summary / 总结
This study utilized machine learning and artificial intelligence to measure peace levels in countries using news and social media data. Neural networks were employed to assess peace from news text embeddings, showing high accuracy across different datasets. For social media, models were developed to measure social dimensions related to peace using word and context levels. A Chrome extension named MirrorMirror was created to provide real-time feedback on the peacefulness of media content, aiming to foster more respectful and informative communication online. The research highlights the potential of AI in promoting peace through better media understanding and engagement.
该研究旨在通过机器学习和AI来衡量和促进和平。它使用神经网络评估新闻文本中的和平水平,并开发了针对社交媒体的模型来衡量对和平至关重要的社会维度。研究还创建了MirrorMirror扩展程序,可以实时反馈媒体内容的和平程度,旨在促进更加尊重和信息丰富的在线交流。
Learning Latent Action World Models In The Wild
Authors: Quentin Garrido, Tushar Nagarajan, Basile Terver, Nicolas Ballas, Yann LeCun, Michael Rabbat
First: 2026-01-08T18:55:39+00:00 · Latest: 2026-01-08T18:55:39+00:00
Comments: 37 pages, 25 figures
Abstract
Agents capable of reasoning and planning in the real world require the ability of predicting the consequences of their actions. While world models possess this capability, they most often require action labels, that can be complex to obtain at scale. This motivates the learning of latent action models, that can learn an action space from videos alone. Our work addresses the problem of learning latent actions world models on in-the-wild videos, expanding the scope of existing works that focus on simple robotics simulations, video games, or manipulation data. While this allows us to capture richer actions, it also introduces challenges stemming from the video diversity, such as environmental noise, or the lack of a common embodiment across videos. To address some of the challenges, we discuss properties that actions should follow as well as relevant architectural choices and evaluations. We find that continuous, but constrained, latent actions are able to capture the complexity of actions from in-the-wild videos, something that the common vector quantization does not. We for example find that changes in the environment coming from agents, such as humans entering the room, can be transferred across videos. This highlights the capability of learning actions that are specific to in-the-wild videos. In the absence of a common embodiment across videos, we are mainly able to learn latent actions that become localized in space, relative to the camera. Nonetheless, we are able to train a controller that maps known actions to latent ones, allowing us to use latent actions as a universal interface and solve planning tasks with our world model with similar performance as action-conditioned baselines. Our analyses and experiments provide a step towards scaling latent action models to the real world.
中文标题/摘要
标题:学习自然环境中的潜在动作世界模型
能够在现实世界中进行推理和规划的智能体需要预测其行为后果的能力。尽管世界模型具备这种能力,但它们通常需要行为标签,而这些标签在大规模获取时可能非常复杂。这促使我们学习潜在动作模型,可以从视频中学习动作空间。我们的工作解决了在自然环境视频中学习潜在动作世界模型的问题,扩展了现有工作集中在简单机器人模拟、视频游戏或操作数据上的范围。虽然这使我们能够捕捉到更丰富的动作,但也带来了视频多样性带来的挑战,如环境噪声或视频间缺乏共同的实体。为应对部分挑战,我们讨论了动作应遵循的属性以及相关架构选择和评估。我们发现,连续但受限的潜在动作能够捕捉自然环境视频中动作的复杂性,而常见的矢量量化则无法做到这一点。例如,我们发现来自智能体(如人类进入房间)的环境变化可以在视频间转移,这突显了学习特定于自然环境视频的动作的能力。在视频间缺乏共同实体的情况下,我们主要能够学习在空间上局部化的潜在动作,相对于摄像机而言。尽管如此,我们能够训练一个控制器,将已知动作映射到潜在动作,使我们能够使用潜在动作作为通用接口,并使用世界模型解决规划任务,其性能与基于动作的基线相当。我们的分析和实验为将潜在动作模型扩展到现实世界提供了一步进展。
Summary / 总结
This work addresses the challenge of learning latent action models from in-the-wild videos, expanding beyond simple robotics simulations and video games. The motivation is to enable agents to reason and plan in real-world scenarios by predicting the consequences of actions without requiring explicit action labels. The method involves capturing richer actions while addressing challenges such as environmental noise and varying embodiments across videos. Key findings include the ability to capture complex actions with continuous, constrained latent actions, and the capability to transfer changes in the environment across videos. The model also allows for the training of a controller that maps known actions to latent ones, enabling the use of latent actions as a universal interface for solving planning tasks with comparable performance to action-conditioned baselines.
该研究旨在从真实世界的视频中学习潜在动作模型,这些视频比机器人模拟、视频游戏或操作数据更复杂和多样化。作者提出了一种方法来捕捉更丰富的动作,同时处理环境噪声和视频间缺乏共同主体的问题。研究发现,连续但受限的潜在动作可以有效地建模来自多样真实世界视频的动作,并且可以通过控制器将已知动作映射到潜在动作,从而使用潜在动作进行规划任务,其性能与基于动作的基线相当。
Stochastic Deep Learning: A Probabilistic Framework for Modeling Uncertainty in Structured Temporal Data
Authors: James Rice
First: 2026-01-08T18:53:59+00:00 · Latest: 2026-01-08T18:53:59+00:00
Comments: 20 pages, 6330 words
Abstract
I propose a novel framework that integrates stochastic differential equations (SDEs) with deep generative models to improve uncertainty quantification in machine learning applications involving structured and temporal data. This approach, termed Stochastic Latent Differential Inference (SLDI), embeds an Itô SDE in the latent space of a variational autoencoder, allowing for flexible, continuous-time modeling of uncertainty while preserving a principled mathematical foundation. The drift and diffusion terms of the SDE are parameterized by neural networks, enabling data-driven inference and generalizing classical time series models to handle irregular sampling and complex dynamic structure. A central theoretical contribution is the co-parameterization of the adjoint state with a dedicated neural network, forming a coupled forward-backward system that captures not only latent evolution but also gradient dynamics. I introduce a pathwise-regularized adjoint loss and analyze variance-reduced gradient flows through the lens of stochastic calculus, offering new tools for improving training stability in deep latent SDEs. My paper unifies and extends variational inference, continuous-time generative modeling, and control-theoretic optimization, providing a rigorous foundation for future developments in stochastic probabilistic machine learning.
中文标题/摘要
标题:随机深度学习:结构化时序数据中不确定性建模的概率框架
我提出了一种新颖的框架,将随机微分方程(SDEs)与深度生成模型相结合,以提高涉及结构化和时序数据的机器学习应用中的不确定性量化。这种方法称为随机潜微分推理(SLDI),在变分自编码器的潜空间中嵌入伊藤SDE,允许灵活的连续时间不确定性建模,同时保持严格的数学基础。SDE的漂移和扩散项由神经网络参数化,使数据驱动的推理成为可能,并将经典的时间序列模型推广到处理不规则采样和复杂动态结构。 一个核心理论贡献是伴随状态与专用神经网络的共参数化,形成一个耦合的前向-后向系统,不仅捕捉潜变量的演变,还捕捉梯度动力学。我引入了一条路径正则化伴随损失,并通过随机微积分的视角分析了方差减少的梯度流,为改进深度潜SDE的训练稳定性提供了新的工具。我的论文统一并扩展了变分推理、连续时间生成建模和控制论优化,为未来的随机概率机器学习发展提供了严格的理论基础。
Summary / 总结
The research introduces Stochastic Latent Differential Inference (SLDI), a framework that combines stochastic differential equations (SDEs) with deep generative models to enhance uncertainty quantification in structured temporal data. SLDI embeds an Itô SDE in the latent space of a variational autoencoder, allowing for flexible, continuous-time modeling of uncertainty. Key findings include the co-parameterization of the adjoint state with a neural network, forming a coupled forward-backward system, and the introduction of a pathwise-regularized adjoint loss to improve training stability in deep latent SDEs.
研究提出了一种称为Stochastic Latent Differential Inference (SLDI) 的框架,将随机微分方程 (SDEs) 与生成模型结合,以增强结构化和时间数据中的不确定性量化。该方法将伊托 SDE 嵌入变分自编码器的潜在空间中,允许灵活的连续时间不确定性建模。关键发现包括将伴随状态与神经网络共参数化,形成耦合的前向-后向系统,同时捕捉潜在演化和梯度动力学。该方法还引入了路径正则化伴随损失,并通过随机微积分分析了方差减少的梯度流,从而提高深潜在 SDE 的训练稳定性。
Extended OpenTT Games Dataset: A table tennis dataset for fine-grained shot type and point outcome
Authors: Moamal Fadhil Abdul-Mahdi, Jonas Bruun Hubrechts, Thomas Martini Jørgensen, Emil Hovad
First: 2025-12-22T12:25:50+00:00 · Latest: 2026-01-08T18:45:51+00:00
Comments: Thomas Martini Jørgensen and Emil Hovad contributed equally and share last authorship
Abstract
Automatically detecting and classifying strokes in table tennis video can streamline training workflows, enrich broadcast overlays, and enable fine-grained performance analytics. For this to be possible, annotated video data of table tennis is needed. We extend the public OpenTTGames dataset with highly detailed, frame-accurate shot type annotations (forehand, backhand with subtypes), player posture labels (body lean and leg stance), and rally outcome tags at point end. OpenTTGames is a set of recordings from the side of the table with official labels for bounces, when the ball is above the net, or hitting the net. The dataset already contains ball coordinates near events, which are either "bounce", "net", or "empty_event" in the original OpenTTGames dataset, and semantic masks (humans, table, scoreboard). Our extension adds the types of stroke to the events and a per-player taxonomy so models can move beyond event spotting toward tactical understanding (e.g., whether a stroke is likely to win the point or set up an advantage). We provide a compact coding scheme and code-assisted labeling procedure to support reproducible annotations and baselines for fine-grained stroke understanding in racket sports. This fills a practical gap in the community, where many prior video resources are either not publicly released or carry restrictive/unclear licenses that hinder reuse and benchmarking. Our annotations are released under the same CC BY-NC-SA 4.0 license as OpenTTGames, allowing free non-commercial use, modification, and redistribution, with appropriate attribution.
中文标题/摘要
标题:扩展的OpenTT Games数据集:用于精细粒度击球类型和得分结果的乒乓球数据集
自动检测和分类乒乓球视频中的击球可以简化训练工作流程,丰富广播叠加内容,并实现精细粒度的性能分析。为此,需要带有标注的乒乓球视频数据。我们扩展了公共OpenTTGames数据集,增加了高度详细的、帧准确的击球类型注释(正手、反手及其子类型)、球员姿势标签(身体倾斜和腿部站位)以及在得分结束时的回合结果标签。OpenTTGames是一组从桌子侧面录制的视频,带有官方标签,表示球的弹跳、球在网上的情况或击中网的情况。该数据集已经包含与事件相关的球坐标,这些事件在原始OpenTTGames数据集中是“弹跳”、“网”或“空事件”,并且带有语义掩码(人类、桌子、记分板)。我们的扩展为事件添加了击球类型,并为每个球员提供了分类体系,使模型能够从事件识别转向战术理解(例如,击球是否可能赢得得分或建立优势)。我们提供了一种紧凑的编码方案和代码辅助的标注程序,以支持可重复的标注和细粒度击球理解的基准。这填补了社区中的实际空白,因为许多先前的视频资源要么未公开发布,要么带有限制性/不明确的许可证,阻碍了重用和基准测试。我们的标注在与OpenTTGames相同的CC BY-NC-SA 4.0许可证下发布,允许免费非商业使用、修改和再分发,同时适当归因。
MineNPC-Task: Task Suite for Memory-Aware Minecraft Agents
Authors: Tamil Sudaravan Mohan Doss, Michael Xu, Sudha Rao, Andrew D. Wilson, Balasaravanan Thoravi Kumaravel
First: 2026-01-08T18:39:52+00:00 · Latest: 2026-01-08T18:39:52+00:00
Abstract
We present \textsc{MineNPC-Task}, a user-authored benchmark and evaluation harness for testing memory-aware, mixed-initiative LLM agents in open-world \emph{Minecraft}. Rather than relying on synthetic prompts, tasks are elicited from formative and summative co-play with expert players, normalized into parametric templates with explicit preconditions and dependency structure, and paired with machine-checkable validators under a bounded-knowledge policy that forbids out-of-world shortcuts. The harness captures plan/act/memory events-including plan previews, targeted clarifications, memory reads and writes, precondition checks, and repair attempts and reports outcomes relative to the total number of attempted subtasks, derived from in-world evidence. As an initial snapshot, we instantiate the framework with GPT-4o and evaluate \textbf{216} subtasks across \textbf{8} experienced players. We observe recurring breakdown patterns in code execution, inventory/tool handling, referencing, and navigation, alongside recoveries supported by mixed-initiative clarifications and lightweight memory. Participants rated interaction quality and interface usability positively, while highlighting the need for stronger memory persistence across tasks. We release the complete task suite, validators, logs, and harness to support transparent, reproducible evaluation of future memory-aware embodied agents.
中文标题/摘要
标题:MineNPC-Task:面向记忆意识Minecraft代理的任务套件
我们提出了\textsc{MineNPC-Task},一种用户编写的基准测试和评估框架,用于测试开放世界\emph{Minecraft}中的记忆意识、混合主动性LLM代理。该框架不依赖于合成提示,而是从专家玩家的形成性与总结性共玩中提取任务,将其规范化为具有显式先决条件和依赖结构的参数化模板,并配以基于有界知识策略的机器可验证验证器,该策略禁止使用世界外的捷径。该框架捕捉计划/行动/记忆事件,包括计划预览、目标澄清、记忆读写、先决条件检查和修复尝试,并根据尝试的子任务总数报告结果,这些结果源自于世内的证据。 作为初步快照,我们使用GPT-4o实例化了该框架,并在8名经验丰富的玩家中评估了\textbf{216}个子任务。我们观察到代码执行、库存/工具处理、引用和导航中的反复出现的故障模式,以及由混合主动性澄清和轻量级记忆支持的恢复。参与者对交互质量和界面易用性给予了积极评价,同时指出了需要更强的记忆持久性以跨越任务。我们发布了完整的任务套件、验证器、日志和框架,以支持未来记忆意识实体代理的透明、可重复评估。
Summary / 总结
The research introduces MineNPC-Task, a benchmark for testing memory-aware LLM agents in Minecraft. Tasks are derived from expert co-play and structured into templates with explicit conditions. The evaluation framework uses GPT-4o to perform 216 subtasks with 8 players, revealing issues in code execution, inventory management, and navigation. Participants positively rated the interaction quality but noted the need for better memory persistence.
研究引入了MineNPC-Task,用于测试记忆感知的LLM代理在Minecraft中的表现。任务从专家玩家中提取,标准化为具有明确先决条件和依赖结构的模板,并使用机器可检查的验证器进行评估。研究评估了8名玩家的216个子任务,发现了代码执行、库存处理、引用和导航等方面的常见问题,通过混合主动澄清和轻量级记忆支持恢复。参与者对交互质量和界面易用性给予了积极评价,但指出需要更强的记忆持久性。
FlowLet: Conditional 3D Brain MRI Synthesis using Wavelet Flow Matching
Authors: Danilo Danese, Angela Lombardi, Matteo Attimonelli, Giuseppe Fasano, Tommaso Di Noia
First: 2026-01-08T18:36:29+00:00 · Latest: 2026-01-08T18:36:29+00:00
Abstract
Brain Magnetic Resonance Imaging (MRI) plays a central role in studying neurological development, aging, and diseases. One key application is Brain Age Prediction (BAP), which estimates an individual's biological brain age from MRI data. Effective BAP models require large, diverse, and age-balanced datasets, whereas existing 3D MRI datasets are demographically skewed, limiting fairness and generalizability. Acquiring new data is costly and ethically constrained, motivating generative data augmentation. Current generative methods are often based on latent diffusion models, which operate in learned low dimensional latent spaces to address the memory demands of volumetric MRI data. However, these methods are typically slow at inference, may introduce artifacts due to latent compression, and are rarely conditioned on age, thereby affecting the BAP performance. In this work, we propose FlowLet, a conditional generative framework that synthesizes age-conditioned 3D MRIs by leveraging flow matching within an invertible 3D wavelet domain, helping to avoid reconstruction artifacts and reducing computational demands. Experiments show that FlowLet generates high-fidelity volumes with few sampling steps. Training BAP models with data generated by FlowLet improves performance for underrepresented age groups, and region-based analysis confirms preservation of anatomical structures.
中文标题/摘要
标题:FlowLet:基于小波流匹配的条件3D脑MRI合成
脑磁共振成像(MRI)在研究神经发育、衰老和疾病中起着核心作用。一个关键应用是脑年龄预测(BAP),它从MRI数据中估计个体的生物脑年龄。有效的BAP模型需要大量、多样且年龄平衡的数据集,而现有的3D MRI数据集在人口统计学上存在偏差,限制了公平性和泛化能力。获取新数据成本高且伦理限制多,推动了生成数据增强。当前的生成方法通常基于潜在扩散模型,在学习的低维潜在空间中操作以应对体积MRI数据的内存需求。然而,这些方法在推理时通常较慢,可能会由于潜在压缩引入伪影,并且很少根据年龄进行条件化,从而影响BAP性能。在本文中,我们提出FlowLet,一种基于3D小波域内可逆流匹配的条件生成框架,通过这种方式避免重建伪影并减少计算需求。实验表明,FlowLet能够通过少量采样步骤生成高保真体积。使用FlowLet生成的数据训练BAP模型可以提高未充分代表的年龄组的性能,而基于区域的分析证实了解剖结构的保留。
Summary / 总结
FlowLet is a conditional generative framework that synthesizes age-conditioned 3D MRIs using flow matching in an invertible 3D wavelet domain, addressing the limitations of latent diffusion models in terms of computational efficiency and artifact introduction. Experiments demonstrate that FlowLet generates high-fidelity volumes with few sampling steps and improves BAP model performance for underrepresented age groups.
FlowLet 是一个条件生成框架,通过在可逆的 3D 小波域内进行流匹配来合成年龄条件下的 3D MRI,解决了潜在扩散模型的局限性。它避免了重建伪影并减少了计算需求。实验表明,FlowLet 生成的高保真体积只需少量采样步骤,使用 FlowLet 生成的数据训练 BAP 模型可以改善未充分代表的年龄组的表现,并保持解剖结构的完整性。
MoE3D: A Mixture-of-Experts Module for 3D Reconstruction
Authors: Zichen Wang, Ang Cao, Liam J. Wang, Jeong Joon Park
First: 2026-01-08T18:33:52+00:00 · Latest: 2026-01-08T18:33:52+00:00
Abstract
MoE3D is a mixture-of-experts module designed to sharpen depth boundaries and mitigate flying-point artifacts (highlighted in red) of existing feed-forward 3D reconstruction models (left side). MoE3D predicts multiple candidate depth maps and fuses them via dynamic weighting (visualized by MoE weights on the right side). When integrated with a pre-trained 3D reconstruction backbone such as VGGT, it substantially enhances reconstruction quality with minimal additional computational overhead. Best viewed digitally.
中文标题/摘要
标题:MoE3D:一种用于3D重建的混合专家模块
MoE3D 是一种混合专家模块,旨在细化深度边界并减轻现有前馈3D重建模型(左侧)中的漂浮点伪影(用红色高亮显示)。MoE3D 预测多个候选深度图,并通过动态加权融合(右侧可视化 MoE 权重)。当与预训练的3D重建主干网络(如 VGGT)结合使用时,它能显著提高重建质量,同时几乎不增加额外的计算开销。最佳观看方式为数字显示。
Summary / 总结
MoE3D is a mixture-of-experts module aimed at improving the sharpness of depth boundaries and reducing flying-point artifacts in 3D reconstruction. It predicts multiple depth maps and fuses them using dynamic weighting. When combined with a pre-trained 3D reconstruction model like VGGT, it significantly improves reconstruction quality with minimal additional computational cost.
MoE3D 是一个混合专家模块,旨在提高 3D 重建中的深度边界清晰度并减少飞行点 artifacts。它预测多个深度图并通过动态加权融合。与预训练的 VGGT 骨干网络结合时,它可以显著提高重建质量,同时几乎没有额外的计算开销。
EARL: Energy-Aware Optimization of Liquid State Machines for Pervasive AI
Authors: Zain Iqbal, Lorenzo Valerio
First: 2026-01-08T18:31:11+00:00 · Latest: 2026-01-08T18:31:11+00:00
Comments: 6 pages, 9 figures, 2 Tables, conference [Submitted in PerConAI-2026]
Abstract
Pervasive AI increasingly depends on on-device learning systems that deliver low-latency and energy-efficient computation under strict resource constraints. Liquid State Machines (LSMs) offer a promising approach for low-power temporal processing in pervasive and neuromorphic systems, but their deployment remains challenging due to high hyperparameter sensitivity and the computational cost of traditional optimization methods that ignore energy constraints. This work presents EARL, an energy-aware reinforcement learning framework that integrates Bayesian optimization with an adaptive reinforcement learning based selection policy to jointly optimize accuracy and energy consumption. EARL employs surrogate modeling for global exploration, reinforcement learning for dynamic candidate prioritization, and an early termination mechanism to eliminate redundant evaluations, substantially reducing computational overhead. Experiments on three benchmark datasets demonstrate that EARL achieves 6 to 15 percent higher accuracy, 60 to 80 percent lower energy consumption, and up to an order of magnitude reduction in optimization time compared to leading hyperparameter tuning frameworks. These results highlight the effectiveness of energy-aware adaptive search in improving the efficiency and scalability of LSMs for resource-constrained on-device AI applications.
中文标题/摘要
标题:EARL:面向普及型人工智能的液态状态机能效优化
普及型人工智能越来越多地依赖于能够在严格资源限制下提供低延迟和能效计算的本地学习系统。液态状态机(LSMs)为普及型和神经形态系统中的低功耗时间处理提供了有希望的方法,但由于高超参数敏感性和传统优化方法忽视能效约束导致的计算成本,其部署仍然具有挑战性。本文提出了一种名为EARL的能效感知强化学习框架,该框架结合了贝叶斯优化和自适应强化学习选择策略,以联合优化准确性和能耗。EARL利用代理建模进行全局探索,强化学习进行动态候选优先级排序,并采用早期终止机制消除冗余评估,大幅减少计算开销。在三个基准数据集上的实验表明,与领先的超参数调优框架相比,EARL在准确性和能耗方面分别提高了6%到15%和60%到80%,优化时间最多减少了十倍。这些结果突显了能效感知自适应搜索在提高资源受限的本地AI应用中液态状态机效率和可扩展性方面的有效性。
Summary / 总结
EARL is an energy-aware optimization framework for Liquid State Machines (LSMs) that integrates Bayesian optimization and adaptive reinforcement learning to optimize both accuracy and energy consumption. It uses surrogate modeling for global exploration, reinforcement learning for dynamic candidate prioritization, and an early termination mechanism to reduce computational overhead. Experiments show that EARL achieves higher accuracy and lower energy consumption compared to other hyperparameter tuning frameworks, with up to an order of magnitude reduction in optimization time.
EARL 是一种结合贝叶斯优化和强化学习的能源感知优化框架,用于液态状态机(LSMs),旨在同时优化准确性和能耗。通过使用代理建模、动态候选优先级和早期终止机制来减少计算开销。实验结果表明,EARL 在准确性和能耗方面表现更优,并且优化时间显著缩短,优于现有的超参数调优框架。
Mechanisms of Prompt-Induced Hallucination in Vision-Language Models
Authors: William Rudman, Michal Golovanevsky, Dana Arad, Yonatan Belinkov, Ritambhara Singh, Carsten Eickhoff, Kyle Mahowald
First: 2026-01-08T18:23:03+00:00 · Latest: 2026-01-08T18:23:03+00:00
Abstract
Large vision-language models (VLMs) are highly capable, yet often hallucinate by favoring textual prompts over visual evidence. We study this failure mode in a controlled object-counting setting, where the prompt overstates the number of objects in the image (e.g., asking a model to describe four waterlilies when only three are present). At low object counts, models often correct the overestimation, but as the number of objects increases, they increasingly conform to the prompt regardless of the discrepancy. Through mechanistic analysis of three VLMs, we identify a small set of attention heads whose ablation substantially reduces prompt-induced hallucinations (PIH) by at least 40% without additional training. Across models, PIH-heads mediate prompt copying in model-specific ways. We characterize these differences and show that PIH ablation increases correction toward visual evidence. Our findings offer insights into the internal mechanisms driving prompt-induced hallucinations, revealing model-specific differences in how these behaviors are implemented.
中文标题/摘要
标题:视觉语言模型中提示诱导幻觉的机制
大型视觉语言模型(VLMs)功能强大,但往往倾向于根据文本提示而非视觉证据进行幻觉。我们在一个受控的对象计数设置中研究了这种失败模式,其中提示夸大了图像中的对象数量(例如,要求模型描述四朵水仙花,而实际上只有三朵)。在对象数量较低时,模型通常会纠正这种夸大,但随着对象数量的增加,它们越来越倾向于遵循提示,无视差异。通过对三种VLMs的机制分析,我们发现一小组注意力头的消除可以显著减少提示诱导幻觉(PIH),至少降低40%且无需额外训练。在不同模型中,PIH头以特定方式介导提示复制。我们描述了这些差异,并表明PIH消除增加了对视觉证据的纠正。我们的研究结果揭示了驱动提示诱导幻觉的内部机制,揭示了这些行为在不同模型中的特定差异。
Summary / 总结
The study investigates how large vision-language models (VLMs) hallucinate based on textual prompts rather than visual evidence. By manipulating object counts in images, the researchers found that models tend to correct prompt overestimations at low object counts but increasingly conform to the prompt as the number of objects increases. Ablating specific attention heads reduced prompt-induced hallucinations by at least 40% across different models, and these heads were found to mediate prompt copying in model-specific ways. The findings suggest that prompt-induced hallucinations are driven by model-specific mechanisms involving these attention heads, which, when removed, improve alignment with visual evidence.
研究通过观察大型视觉-语言模型(VLMs)在物体计数任务中的行为,探讨了提示诱导幻觉的现象。研究发现,随着图像中物体数量的增加,模型更倾向于遵循提示而非视觉证据。通过对三个VLMs的注意力机制进行分析,研究确定了一组特定的注意力头,移除这些头可以显著减少幻觉至少40%,且无需额外训练。研究结果表明,这些头在复制提示方面起着关键作用,移除它们会增强模型依赖视觉证据进行修正的能力。
An interpretable data-driven approach to optimizing clinical fall risk assessment
Authors: Fardin Ganjkhanloo, Emmett Springer, Erik H. Hoyer, Daniel L. Young, Holley Farley, Kimia Ghobadi
First: 2026-01-08T18:17:31+00:00 · Latest: 2026-01-08T18:17:31+00:00
Comments: arXiv admin note: substantial text overlap with arXiv:2510.20714
Abstract
In this study, we aim to better align fall risk prediction from the Johns Hopkins Fall Risk Assessment Tool (JHFRAT) with additional clinically meaningful measures via a data-driven modelling approach. We conducted a retrospective cohort analysis of 54,209 inpatient admissions from three Johns Hopkins Health System hospitals between March 2022 and October 2023. A total of 20,208 admissions were included as high fall risk encounters, and 13,941 were included as low fall risk encounters. To incorporate clinical knowledge and maintain interpretability, we employed constrained score optimization (CSO) models to reweight the JHFRAT scoring weights, while preserving its additive structure and clinical thresholds. Recalibration refers to adjusting item weights so that the resulting score can order encounters more consistently by the study's risk labels, and without changing the tool's form factor or deployment workflow. The model demonstrated significant improvements in predictive performance over the current JHFRAT (CSO AUC-ROC=0.91, JHFRAT AUC-ROC=0.86). This performance improvement translates to protecting an additional 35 high-risk patients per week across the Johns Hopkins Health System. The constrained score optimization models performed similarly with and without the EHR variables. Although the benchmark black-box model (XGBoost), improves upon the performance metrics of the knowledge-based constrained logistic regression (AUC-ROC=0.94), the CSO demonstrates more robustness to variations in risk labeling. This evidence-based approach provides a robust foundation for health systems to systematically enhance inpatient fall prevention protocols and patient safety using data-driven optimization techniques, contributing to improved risk assessment and resource allocation in healthcare settings.
中文标题/摘要
标题:一种可解释的数据驱动方法以优化临床跌倒风险评估
在本研究中,我们旨在通过数据驱动建模方法更好地使约翰霍普金斯跌倒风险评估工具(JHFRAT)的跌倒风险预测与额外的临床有意义的指标相一致。我们对2022年3月至2023年10月期间约翰霍普金斯健康系统三家医院的54,209例住院病例进行了回顾性队列分析。共有20,208例住院病例被纳入高跌倒风险事件,13,941例被纳入低跌倒风险事件。为了整合临床知识并保持可解释性,我们使用了约束评分优化(CSO)模型重新加权JHFRAT评分权重,同时保持其加性结构和临床阈值。校准是指调整项目权重,使所得评分能够更一致地按研究的风险标签对事件进行排序,而不改变工具的形式因素或部署工作流程。该模型在预测性能上显著优于当前的JHFRAT(CSO AUC-ROC=0.91,JHFRAT AUC-ROC=0.86)。这一性能改进相当于每周为约翰霍普金斯健康系统保护额外的35名高风险患者。约束评分优化模型在有和没有EHR变量的情况下表现相似。尽管基准黑盒模型(XGBoost)在知识驱动的约束逻辑回归的基础上提高了性能指标(AUC-ROC=0.94),但CSO在风险标签变化方面表现出更强的鲁棒性。基于证据的方法为医疗机构提供了一个坚实的基础,以系统地增强住院跌倒预防协议和患者安全,利用数据驱动优化技术,从而在医疗保健环境中改善风险评估和资源分配。
Summary / 总结
This study aimed to improve the predictive performance of the Johns Hopkins Fall Risk Assessment Tool (JHFRAT) by incorporating clinically meaningful measures through constrained score optimization (CSO) models. A retrospective cohort analysis of 54,209 inpatient admissions showed that the CSO model significantly enhanced predictive performance (AUC-ROC=0.91) compared to the original JHFRAT (AUC-ROC=0.86), protecting an additional 35 high-risk patients per week. The CSO models maintained interpretability and robustness, even without electronic health record variables, demonstrating a data-driven approach to optimizing clinical fall risk assessment. This evidence-based method provides a robust foundation for enhancing inpatient fall prevention protocols and patient safety in healthcare settings.
本研究旨在通过约束分数优化(CSO)模型将约翰霍普金斯跌倒风险评估工具(JHFRAT)与临床有意义的指标相结合,以提高其预测性能。对54,209名住院患者的回顾性队列分析表明,CSO模型显著提高了预测性能(AUC-ROC=0.91),相较于原始的JHFRAT(AUC-ROC=0.86),每周额外保护了35名高风险患者。CSO模型保持了可解释性和稳健性,即使不使用电子健康记录变量也是如此,展示了优化临床跌倒风险评估的数据驱动方法。该证据为基础的方法为增强住院跌倒预防协议和患者安全提供了坚实的基础,在医疗保健环境中促进了风险评估和资源分配的改进。
SimuAgent: An LLM-Based Simulink Modeling Assistant Enhanced with Reinforcement Learning
Authors: Yanchang Liang, Xiaowei Zhao
First: 2026-01-08T18:10:35+00:00 · Latest: 2026-01-08T18:10:35+00:00
Abstract
Large language models (LLMs) have revolutionized text-based code automation, but their potential in graph-oriented engineering workflows remains under-explored. We introduce SimuAgent, an LLM-powered modeling and simulation agent tailored for Simulink. SimuAgent replaces verbose XML with a concise, dictionary-style Python representation, dramatically cutting token counts, improving interpretability, and enabling fast, in-process simulation. A lightweight plan-execute architecture, trained in two stages, equips the agent with both low-level tool skills and high-level design reasoning. To tackle sparse rewards in long-horizon tasks, we propose Reflection-GRPO (ReGRPO), which augments Group Relative Policy Optimization (GRPO) with self-reflection traces that supply rich intermediate feedback, accelerating convergence and boosting robustness. Experiments on SimuBench, our newly released benchmark comprising 5300 multi-domain modeling tasks, show that a Qwen2.5-7B model fine-tuned with SimuAgent converges faster and achieves higher modeling accuracy than standard RL baselines, and even surpasses GPT-4o when evaluated with few-shot prompting on the same benchmark. Ablations confirm that the two-stage curriculum and abstract-reconstruct data augmentation further enhance generalization. SimuAgent trains and runs entirely on-premise with modest hardware, delivering a privacy-preserving, cost-effective solution for industrial model-driven engineering. SimuAgent bridges the gap between LLMs and graphical modeling environments, offering a practical solution for AI-assisted engineering design in industrial settings.
中文标题/摘要
标题:SimuAgent:一种增强学习的基于LLM的Simulink建模助手
大型语言模型(LLMs)已经革新了基于文本的代码自动化,但在图形导向的工程工作流中的潜力尚未得到充分探索。我们介绍了SimuAgent,这是一种专为Simulink设计的LLM驱动的建模和仿真代理。SimuAgent用简洁的字典风格的Python表示法取代了冗长的XML,大幅减少了标记数量,提高了可解释性,并使仿真变得快速且在进程内进行。一种轻量级的计划-执行架构,经过两阶段训练,使代理具备了低级工具技能和高级设计推理能力。为应对长时任务中的稀疏奖励,我们提出了反射-GRPO(ReGRPO),它通过自我反思轨迹增强了Group相对策略优化(GRPO),提供了丰富的中间反馈,加速了收敛并提高了鲁棒性。在我们新发布的包含5300个多领域建模任务的SimuBench基准测试上进行的实验表明,使用SimuAgent微调的Qwen2.5-7B模型比标准的强化学习基线收敛更快,建模精度更高,甚至在使用少量示例提示在同一基准测试上评估时,超过了GPT-4o。消融实验表明,两阶段课程和抽象重建数据增强进一步增强了泛化能力。SimuAgent完全在本地进行训练和运行,硬件要求较低,提供了一种保护隐私、成本效益高的工业模型驱动工程解决方案。SimuAgent在LLMs和图形建模环境之间架起了一座桥梁,为工业环境中的AI辅助工程设计提供了一种实用的解决方案。
Summary / 总结
SimuAgent is an LLM-powered agent designed for Simulink modeling, using a concise Python representation to reduce token counts and improve interpretability. It is trained in two stages to combine low-level tool skills with high-level design reasoning. The agent uses Reflection-GRPO, which enhances Group Relative Policy Optimization with self-reflection traces to improve convergence and robustness. Experiments on SimuBench show that SimuAgent outperforms standard RL baselines and even GPT-4o in few-shot prompting scenarios, with enhanced generalization through a two-stage curriculum and data augmentation. SimuAgent is cost-effective and privacy-preserving, suitable for industrial model-driven engineering.
SimuAgent 是一个为 Simulink 模型设计的 LLM 助手,使用简洁的 Python 表示法来减少标记数量并提高可解释性。它通过两个阶段的训练结合了低级工具技能和高级设计推理。该助手使用 Reflection-GRPO,这是一种增强 Group Relative Policy Optimization 的方法,通过自我反思痕迹来提高收敛性和鲁棒性。实验表明,SimuAgent 在 SimuBench 上的表现优于标准的 RL 基线,甚至在少量提示的情况下超过了 GPT-4o,通过两阶段课程和数据增强进一步增强了泛化能力。SimuAgent 具有成本效益且保护隐私,适用于工业模型驱动工程。
Observations and Remedies for Large Language Model Bias in Self-Consuming Performative Loop
Authors: Yaxuan Wang, Zhongteng Cai, Yujia Bao, Xueru Zhang, Yang Liu
First: 2026-01-08T18:08:15+00:00 · Latest: 2026-01-08T18:08:15+00:00
Abstract
The rapid advancement of large language models (LLMs) has led to growing interest in using synthetic data to train future models. However, this creates a self-consuming retraining loop, where models are trained on their own outputs and may cause performance drops and induce emerging biases. In real-world applications, previously deployed LLMs may influence the data they generate, leading to a dynamic system driven by user feedback. For example, if a model continues to underserve users from a group, less query data will be collected from this particular demographic of users. In this study, we introduce the concept of \textbf{S}elf-\textbf{C}onsuming \textbf{P}erformative \textbf{L}oop (\textbf{SCPL}) and investigate the role of synthetic data in shaping bias during these dynamic iterative training processes under controlled performative feedback. This controlled setting is motivated by the inaccessibility of real-world user preference data from dynamic production systems, and enables us to isolate and analyze feedback-driven bias evolution in a principled manner. We focus on two types of loops, including the typical retraining setting and the incremental fine-tuning setting, which is largely underexplored. Through experiments on three real-world tasks, we find that the performative loop increases preference bias and decreases disparate bias. We design a reward-based rejection sampling strategy to mitigate the bias, moving towards more trustworthy self-improving systems.
中文标题/摘要
标题:大型语言模型偏见在自我消费执行循环中的观察与补救措施
大型语言模型(LLMs)的迅速发展引发了使用合成数据进行未来模型训练的兴趣。然而,这导致了一个自我消费的重新训练循环,模型在训练过程中使用自己的输出,可能会导致性能下降并引发新的偏见。在实际应用中,之前部署的LLMs可能会影响它们生成的数据,导致由用户反馈驱动的动态系统。例如,如果模型持续未能满足某一用户群体的需求,那么来自该特定用户群体的数据收集量将会减少。在本研究中,我们引入了“自我消费执行循环”(SCPL)的概念,并探讨合成数据在动态迭代训练过程中如何塑造偏见的作用。这种受控的执行反馈环境是由于难以获取动态生产系统中的真实用户偏好数据而驱动的,它使我们能够以一种原则性的方式隔离和分析反馈驱动的偏见演变。我们关注两种类型的循环,包括典型的重新训练设置和增量微调设置,后者尚未得到充分探索。通过三个实际任务的实验,我们发现执行循环增加了偏好偏见并减少了差异偏见。我们设计了一种基于奖励的拒绝采样策略来减轻偏见,朝着更可信赖的自我改进系统迈进。
Summary / 总结
This study addresses the issue of bias in large language models (LLMs) that arise from self-consuming performative loops, where models are trained on their own outputs. The research introduces the concept of Self-Consuming Performative Loop (SCPL) and investigates how synthetic data influences bias during dynamic iterative training. Experiments show that performative loops increase preference bias and decrease disparate bias. The study proposes a reward-based rejection sampling strategy to mitigate these biases, aiming to enhance the trustworthiness of self-improving systems.
研究关注大型语言模型(LLMs)在自我消费执行循环中产生的偏见问题,即模型在其自身输出上进行训练。研究引入了自我消费执行循环(SCPL)的概念,并探讨合成数据在动态迭代训练过程中如何影响偏见。实验表明,执行循环增加了偏好偏见并减少了差异偏见。研究提出了一种基于奖励的拒绝采样策略来缓解这些偏见,以提高自我改进系统的可信度。
VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice
Authors: Shuming Liu, Mingchen Zhuge, Changsheng Zhao, Jun Chen, Lemeng Wu, Zechun Liu, Chenchen Zhu, Zhipeng Cai, Chong Zhou, Haozhe Liu, Ernie Chang, Saksham Suri, Hongyu Xu, Qi Qian, Wei Wen, Balakrishnan Varadarajan, Zhuang Liu, Hu Xu, Florian Bordes, Raghuraman Krishnamoorthi, Bernard Ghanem, Vikas Chandra, Yunyang Xiong
First: 2026-01-08T18:00:59+00:00 · Latest: 2026-01-08T18:00:59+00:00
Comments: Project page: https://ivul-kaust.github.io/projects/videoauto-r1/
Abstract
Chain-of-thought (CoT) reasoning has emerged as a powerful tool for multimodal large language models on video understanding tasks. However, its necessity and advantages over direct answering remain underexplored. In this paper, we first demonstrate that for RL-trained video models, direct answering often matches or even surpasses CoT performance, despite CoT producing step-by-step analyses at a higher computational cost. Motivated by this, we propose VideoAuto-R1, a video understanding framework that adopts a reason-when-necessary strategy. During training, our approach follows a Thinking Once, Answering Twice paradigm: the model first generates an initial answer, then performs reasoning, and finally outputs a reviewed answer. Both answers are supervised via verifiable rewards. During inference, the model uses the confidence score of the initial answer to determine whether to proceed with reasoning. Across video QA and grounding benchmarks, VideoAuto-R1 achieves state-of-the-art accuracy with significantly improved efficiency, reducing the average response length by ~3.3x, e.g., from 149 to just 44 tokens. Moreover, we observe a low rate of thinking-mode activation on perception-oriented tasks, but a higher rate on reasoning-intensive tasks. This suggests that explicit language-based reasoning is generally beneficial but not always necessary.
中文标题/摘要
标题:VideoAuto-R1:通过一次思考,两次回答进行视频自动推理
思维链(CoT)推理已成为多模态大型语言模型在视频理解任务中的一种强大工具。然而,其必要性及其与直接回答相比的优势尚未得到充分探索。在本文中,我们首先证明,对于RL训练的视频模型,直接回答往往能够匹配甚至超越CoT性能,尽管CoT以更高的计算成本生成逐步分析。受此启发,我们提出了一种VideoAuto-R1视频理解框架,采用需要时才推理的策略。在训练过程中,我们的方法遵循一次思考,两次回答的模式:模型首先生成初始答案,然后进行推理,最后输出审查后的答案。两个答案都通过可验证的奖励进行监督。在推理过程中,模型使用初始答案的置信度分数来决定是否继续进行推理。在视频问答和定位基准测试中,VideoAuto-R1在显著提高效率的同时实现了最先进的准确性,平均响应长度减少了约3.3倍,例如,从149个词减少到仅44个词。此外,我们观察到,在感知导向的任务中,推理模式的激活率较低,而在推理密集型任务中,激活率较高。这表明显式的基于语言的推理通常是有益的,但并非总是必要的。
Summary / 总结
The paper explores the necessity of chain-of-thought (CoT) reasoning in video understanding tasks and introduces VideoAuto-R1, a framework that reasons only when necessary. It follows a Thinking Once, Answering Twice paradigm during training, where the model first generates an initial answer, then performs reasoning, and outputs a reviewed answer. During inference, the model decides whether to reason based on the confidence of the initial answer. VideoAuto-R1 achieves state-of-the-art accuracy with significant efficiency improvements, reducing response length by 3.3x. The framework shows that reasoning is generally beneficial but not always necessary, with higher rates of reasoning on more complex tasks.
论文探讨了链式思考(CoT)推理在视频理解任务中的必要性,并提出了VideoAuto-R1框架,该框架在必要时进行推理。在训练过程中,VideoAuto-R1遵循“思考一次,回答两次”的模式,首先生成初始答案,然后进行推理,最后输出审查后的答案。在推理过程中,它根据初始答案的置信度决定是否进行推理。VideoAuto-R1实现了最先进的准确率,并显著提高了效率,响应长度减少了约3.3倍。研究表明,显式的语言推理通常是有益的,但在某些任务中并非总是必要的,特别是在更复杂的任务中推理的频率更高。
FaST: Efficient and Effective Long-Horizon Forecasting for Large-Scale Spatial-Temporal Graphs via Mixture-of-Experts
Authors: Yiji Zhao, Zihao Zhong, Ao Wang, Haomin Wen, Ming Jin, Yuxuan Liang, Huaiyu Wan, Hao Wu
Venue: KDD 2026
First: 2026-01-08T18:00:58+00:00 · Latest: 2026-01-08T18:00:58+00:00
Comments: Accepted to KDD 2026
Abstract
Spatial-Temporal Graph (STG) forecasting on large-scale networks has garnered significant attention. However, existing models predominantly focus on short-horizon predictions and suffer from notorious computational costs and memory consumption when scaling to long-horizon predictions and large graphs. Targeting the above challenges, we present FaST, an effective and efficient framework based on heterogeneity-aware Mixture-of-Experts (MoEs) for long-horizon and large-scale STG forecasting, which unlocks one-week-ahead (672 steps at a 15-minute granularity) prediction with thousands of nodes. FaST is underpinned by two key innovations. First, an adaptive graph agent attention mechanism is proposed to alleviate the computational burden inherent in conventional graph convolution and self-attention modules when applied to large-scale graphs. Second, we propose a new parallel MoE module that replaces traditional feed-forward networks with Gated Linear Units (GLUs), enabling an efficient and scalable parallel structure. Extensive experiments on real-world datasets demonstrate that FaST not only delivers superior long-horizon predictive accuracy but also achieves remarkable computational efficiency compared to state-of-the-art baselines. Our source code is available at: https://github.com/yijizhao/FaST.
中文标题/摘要
标题:FaST:基于专家混合的异质性感知大规模时空图长时预测框架
大规模网络上的时空图(STG)预测引起了广泛关注。然而,现有模型主要关注短时预测,并在扩展到长时预测和大规模图时遭受严重的计算成本和内存消耗问题。为应对上述挑战,我们提出了一种基于异质性感知专家混合(MoEs)的FaST框架,该框架针对长时和大规模STG预测,实现了对数千节点的一周前(以15分钟粒度计算,共672步)预测。FaST的核心创新包括:首先,提出了一种自适应图代理注意力机制,以缓解在大规模图上应用传统图卷积和自注意力模块时固有的计算负担;其次,提出了一种新的并行MoE模块,用门控线性单元(GLUs)替换传统的前馈网络,从而实现高效且可扩展的并行结构。在真实世界数据集上的广泛实验表明,FaST不仅在长时预测准确性上表现出色,而且在计算效率上也显著优于最先进的基线方法。我们的源代码可在:https://github.com/yijizhao/FaST/ 获取。
Summary / 总结
FaST is a framework designed for efficient and effective long-horizon forecasting on large-scale spatial-temporal graphs. It addresses the computational challenges of existing models by introducing an adaptive graph agent attention mechanism and a parallel Mixture-of-Experts module with Gated Linear Units. FaST achieves one-week-ahead predictions with thousands of nodes and outperforms state-of-the-art baselines in both accuracy and efficiency.
FaST 通过采用适应性图代理注意力机制和使用门控线性单元的并行 Mixture-of-Experts 模块,旨在解决大规模时空图长时预测的挑战。该框架能够实现对数千节点的一周预测,并在准确性和计算效率上优于现有方法。
CoV: Chain-of-View Prompting for Spatial Reasoning
Authors: Haoyu Zhao, Akide Liu, Zeyu Zhang, Weijie Wang, Feng Chen, Ruihan Zhu, Gholamreza Haffari, Bohan Zhuang
First: 2026-01-08T17:59:42+00:00 · Latest: 2026-01-08T17:59:42+00:00
Abstract
Embodied question answering (EQA) in 3D environments often requires collecting context that is distributed across multiple viewpoints and partially occluded. However, most recent vision--language models (VLMs) are constrained to a fixed and finite set of input views, which limits their ability to acquire question-relevant context at inference time and hinders complex spatial reasoning. We propose Chain-of-View (CoV) prompting, a training-free, test-time reasoning framework that transforms a VLM into an active viewpoint reasoner through a coarse-to-fine exploration process. CoV first employs a View Selection agent to filter redundant frames and identify question-aligned anchor views. It then performs fine-grained view adjustment by interleaving iterative reasoning with discrete camera actions, obtaining new observations from the underlying 3D scene representation until sufficient context is gathered or a step budget is reached. We evaluate CoV on OpenEQA across four mainstream VLMs and obtain an average +11.56\% improvement in LLM-Match, with a maximum gain of +13.62\% on Qwen3-VL-Flash. CoV further exhibits test-time scaling: increasing the minimum action budget yields an additional +2.51\% average improvement, peaking at +3.73\% on Gemini-2.5-Flash. On ScanQA and SQA3D, CoV delivers strong performance (e.g., 116 CIDEr / 31.9 EM@1 on ScanQA and 51.1 EM@1 on SQA3D). Overall, these results suggest that question-aligned view selection coupled with open-view search is an effective, model-agnostic strategy for improving spatial reasoning in 3D EQA without additional training.
中文标题/摘要
标题:CoV:空间推理的链式视角提示
在3D环境中的嵌入式问题回答(EQA)通常需要收集分布在多个视角且部分被遮挡的上下文。然而,大多数最近的视觉-语言模型(VLMs)被限制在一个固定的有限输入视角集,这限制了它们在推理时获取与问题相关上下文的能力,并阻碍了复杂的空间推理。我们提出了链式视角(CoV)提示,这是一种无需训练、测试时的推理框架,通过从粗到细的探索过程将VLM转换为积极的视角推理者。CoV首先使用视图选择代理筛选冗余帧并识别与问题对齐的锚视图。然后通过交替进行迭代推理和离散相机动作进行细粒度视图调整,从底层3D场景表示中获取新观察,直到收集到足够上下文或达到步骤预算。 我们在OpenEQA上对CoV进行了评估,跨四个主流VLMs获得了平均+11.56%的LLM-Match改进,最大增益为Qwen3-VL-Flash上的+13.62%。CoV还表现出测试时的扩展性:增加最小动作预算可额外获得平均+2.51%的改进,峰值为Gemini-2.5-Flash上的+3.73%。在ScanQA和SQA3D上,CoV表现出强大的性能(例如,ScanQA上的116 CIDEr / 31.9 EM@1和SQA3D上的51.1 EM@1)。总体而言,这些结果表明,与问题对齐的视图选择结合开放视图搜索是提高3D EQA中空间推理的有效、模型无关的策略,无需额外训练。
Summary / 总结
The research aims to enhance embodied question answering (EQA) in 3D environments by addressing the limitations of existing vision-language models (VLMs) that are constrained to a fixed set of input views. The proposed Chain-of-View (CoV) prompting method enables VLMs to actively explore and gather relevant context through a coarse-to-fine process. This involves selecting anchor views and performing fine-grained view adjustments. Experimental results on OpenEQA show an average improvement of +11.56% in LLM-Match, with a maximum gain of +13.62% on Qwen3-VL-Flash. CoV also demonstrates test-time scaling, with additional improvements observed as the minimum action budget increases.
该论文针对3D环境中的体感问答(EQA)问题,其中背景信息通常分布在多个视角中。为了解决视觉-语言模型(VLMs)固定输入视角的限制,作者提出了Chain-of-View(CoV)提示,使模型能够通过粗到细的过程主动探索和收集相关信息。CoV在OpenEQA上的性能平均提高了11.56%,最高达到Qwen3-VL-Flash的13.62%。它还展示了测试时的可扩展性,最高额外提高了3.73%的Gemini-2.5-Flash性能。CoV在ScanQA和SQA3D上表现出色,证明了其在不需额外训练的情况下增强空间推理的有效性。
RelayLLM: Efficient Reasoning via Collaborative Decoding
Authors: Chengsong Huang, Tong Zheng, Langlin Huang, Jinyuan Li, Haolin Liu, Jiaxin Huang
First: 2026-01-08T17:56:16+00:00 · Latest: 2026-01-08T17:56:16+00:00
Abstract
Large Language Models (LLMs) for complex reasoning is often hindered by high computational costs and latency, while resource-efficient Small Language Models (SLMs) typically lack the necessary reasoning capacity. Existing collaborative approaches, such as cascading or routing, operate at a coarse granularity by offloading entire queries to LLMs, resulting in significant computational waste when the SLM is capable of handling the majority of reasoning steps. To address this, we propose RelayLLM, a novel framework for efficient reasoning via token-level collaborative decoding. Unlike routers, RelayLLM empowers the SLM to act as an active controller that dynamically invokes the LLM only for critical tokens via a special command, effectively "relaying" the generation process. We introduce a two-stage training framework, including warm-up and Group Relative Policy Optimization (GRPO) to teach the model to balance independence with strategic help-seeking. Empirical results across six benchmarks demonstrate that RelayLLM achieves an average accuracy of 49.52%, effectively bridging the performance gap between the two models. Notably, this is achieved by invoking the LLM for only 1.07% of the total generated tokens, offering a 98.2% cost reduction compared to performance-matched random routers.
中文标题/摘要
标题:RelayLLM:通过协作解码实现高效推理
大型语言模型(LLMs)进行复杂推理往往受到高计算成本和延迟的限制,而资源高效的小型语言模型(SLMs)通常缺乏必要的推理能力。现有的协作方法,如级联或路由,以粗粒度的方式工作,将整个查询卸载到LLMs上,当SLM能够处理大多数推理步骤时,这会导致显著的计算浪费。为了解决这个问题,我们提出了RelayLLM,这是一种通过标记级协作解码实现高效推理的新框架。与路由器不同,RelayLLM赋予SLM作为主动控制器的能力,动态地仅在关键标记上调用LLM,通过特殊命令有效地“转交”生成过程。我们引入了一种两阶段训练框架,包括预热和组相对策略优化(GRPO),以教导模型平衡独立性和战略性求助。在六个基准测试中的实验结果表明,RelayLLM实现了49.52%的平均准确率,有效地弥合了两种模型之间的性能差距。值得注意的是,这仅通过调用LLM生成标记的1.07%实现,与性能匹配的随机路由器相比,成本降低了98.2%。
Summary / 总结
RelayLLM is a framework that enables efficient reasoning through token-level collaborative decoding between Small Language Models (SLMs) and Large Language Models (LLMs). It allows the SLM to act as an active controller, invoking the LLM only for critical tokens via a special command, thus reducing computational waste. The framework includes a two-stage training process and achieves an average accuracy of 49.52% across six benchmarks, with LLM invocation reduced to only 1.07% of the total tokens, leading to a 98.2% cost reduction compared to random routers.
RelayLLM 是一种框架,通过小语言模型(SLM)和大语言模型(LLM)在标记级别上的协作解码来实现高效的推理。它使 SLM 能够作为主动控制器,仅通过特殊命令为关键标记调用 LLM,从而减少计算浪费。该框架包括两阶段训练过程,以平衡独立性和战略性求助。实验结果显示,RelayLLM 在仅调用 LLM 处理 1.07% 的标记时实现了 49.52% 的准确率,与随机路由器相比,成本降低了 98.2%。
MVT: Mask-Grounded Vision-Language Models for Taxonomy-Aligned Land-Cover Tagging
Authors: Siyi Chen, Kai Wang, Weicong Pang, Ruiming Yang, Ziru Chen, Renjun Gao, Alexis Kai Hon Lau, Dasa Gu, Chenchen Zhang, Cheng Li
First: 2025-09-23T06:23:56+00:00 · Latest: 2026-01-08T17:56:05+00:00
Comments: The project is available at https://charlescsyyy.github.io/MVT
Abstract
Land-cover understanding in remote sensing increasingly demands class-agnostic systems that generalize across datasets while remaining spatially precise and interpretable. We study a geometry-first discovery-and-interpretation setting under domain shift, where candidate regions are delineated class-agnostically and supervision avoids lexical class names via anonymized identifiers. Complementary to open-set recognition and open-world learning, we focus on coupling class-agnostic mask evidence with taxonomy-grounded scene interpretation, rather than unknown rejection or continual class expansion. We propose MVT, a three-stage framework that (i) extracts boundary-faithful region masks using SAM2 with domain adaptation, (ii) performs mask-grounded semantic tagging and scene description generation via dual-step LoRA fine-tuning of multimodal LLMs, and (iii) evaluates outputs with LLM-as-judge scoring calibrated by stratified expert ratings. On cross-dataset segmentation transfer (train on OpenEarthMap, evaluate on LoveDA), domain-adapted SAM2 improves mask quality; meanwhile, dual-step MLLM fine-tuning yields more accurate taxonomy-aligned tags and more informative mask-grounded scene descriptions.
中文标题/摘要
标题:MVT:基于掩码的视觉-语言模型在分类学对齐的土地覆盖标签化中的应用
遥感中的土地覆盖理解越来越多地需要跨数据集泛化但同时保持空间精确性和可解释性的类无差别系统。我们研究了在领域转移下的几何优先发现与解释设置,其中候选区域以类无差别方式划定,监督避免使用类名的明码标识。除了开放集识别和开放世界学习,我们专注于将类无差别掩码证据与分类学导向的场景解释相结合,而不是未知拒绝或持续类扩展。我们提出了MVT,一个三阶段框架,(i) 使用SAM2进行领域适应以提取边界忠实的区域掩码,(ii) 通过双步骤LoRA微调多模态LLM进行掩码导向的语义标签和场景描述生成,(iii) 使用LLM作为裁判评分,通过分层专家评分校准输出评估。在跨数据集分割迁移(在OpenEarthMap上训练,在LoveDA上评估)中,领域适应的SAM2提高了掩码质量;同时,双步骤MLLM微调产生了更准确的分类学对齐标签和更具有信息性的掩码导向场景描述。
Summary / 总结
The research aims to develop class-agnostic systems for land-cover understanding in remote sensing that can generalize across datasets while maintaining spatial precision and interpretability. The method involves a three-stage framework: (i) extracting boundary-faithful region masks using SAM2 with domain adaptation, (ii) performing mask-grounded semantic tagging and scene description generation via dual-step LoRA fine-tuning of multimodal LLMs, and (iii) evaluating outputs with LLM-as-judge scoring calibrated by stratified expert ratings. The main experimental findings show that domain-adapted SAM2 improves mask quality, and dual-step MLLM fine-tuning yields more accurate taxonomy-aligned tags and more informative mask-grounded scene descriptions.
研究旨在开发用于遥感土地覆盖理解的类无感知系统,注重空间精度和可解释性。方法包括三个阶段:(i) 使用SAM2结合领域适应提取边界一致的区域掩码,(ii) 通过双重步骤的LoRA微调多模态LLM进行掩码导向的语义标签和场景描述生成,(iii) 使用LLM作为裁判评分并根据分层专家评级进行校准。关键发现包括领域适应SAM2提高了掩码质量,以及通过双重步骤的LLM微调获得了更准确的分类对齐标签和更具信息量的掩码导向场景描述。
Improving and Evaluating Open Deep Research Agents
Authors: Doaa Allabadi, Kyle Bradbury, Jordan M. Malof
First: 2025-08-13T19:32:01+00:00 · Latest: 2026-01-08T17:54:58+00:00
Comments: 8 pages, 2 figures, 2 tables
Abstract
We focus here on Deep Research Agents (DRAs), which are systems that can take a natural language prompt from a user, and then autonomously search for, and utilize, internet-based content to address the prompt. Recent DRAs have demonstrated impressive capabilities on public benchmarks however, recent research largely involves proprietary closed-source systems. At the time of this work, we only found one open-source DRA, termed Open Deep Research (ODR). In this work we adapt the challenging recent BrowseComp benchmark to compare ODR to existing proprietary systems. We propose BrowseComp-Small (BC-Small), comprising a subset of BrowseComp, as a more computationally-tractable DRA benchmark for academic labs. We benchmark ODR and two other proprietary systems on BC-Small: one system from Anthropic and one system from Google. We find that all three systems achieve 0% accuracy on the test set of 60 questions. We introduce three strategic improvements to ODR, resulting in the ODR+ model, which achieves a state-of-the-art 10% success rate on BC-Small among both closed-source and open-source systems. We report ablation studies indicating that all three of our improvements contributed to the success of ODR+.
中文标题/摘要
标题:改进和评估开放深度研究代理
我们在这里关注深度研究代理(DRAs),这是一种可以从用户那里接收自然语言提示,并自主搜索和利用互联网内容来应对提示的系统。最近的DRAs在公共基准测试中展示了令人印象深刻的性能,然而,最近的研究主要涉及专有的闭源系统。在本研究进行时,我们仅发现一个开源DRAs,称为Open Deep Research(ODR)。在本工作中,我们将具有挑战性的最近的BrowseComp基准测试改编为比较ODR与现有专有系统的基准测试。我们提出了BrowseComp-Small(BC-Small),这是一个更易于计算的DRAs基准测试,适用于学术实验室。我们在BC-Small上对ODR和两个其他专有系统进行了基准测试:一个来自Anthropic的系统和一个来自Google的系统。我们发现,这三个系统在包含60个问题的测试集上均未达到1%的准确率。我们对ODR进行了三项战略改进,产生了ODR+模型,该模型在BC-Small上实现了10%的成功率,这是在专有和开源系统中均处于最先进的水平。我们报告了消融研究,表明我们的三项改进都对ODR+的成功做出了贡献。
Summary / 总结
This work focuses on Deep Research Agents (DRAs) that can autonomously search and utilize internet content based on user prompts. The authors adapt the BrowseComp benchmark to evaluate ODR, an open-source DRA, against proprietary systems. After benchmarking, they found that all systems performed poorly. They then introduced three strategic improvements to ODR, resulting in ODR+, which achieved a 10% success rate on the benchmark, the best among both open-source and closed-source systems.
该研究关注能够处理自然语言提示并自主搜索和利用互联网内容的Deep Research Agents (DRAs)。作者将BrowseComp基准适应以评估ODR,一个开源DRA,与商业系统进行比较。基准测试后,他们发现所有系统表现不佳。然后他们对ODR进行了三项策略性改进,形成了ODR+,在基准测试中达到了10%的成功率,这是开源和商业系统中的最佳表现。
Vision-Language Introspection: Mitigating Overconfident Hallucinations in MLLMs via Interpretable Bi-Causal Steering
Authors: Shuliang Liu, Songbo Yang, Dong Fang, Sihang Jia, Yuqi Tang, Lingfeng Su, Ruoshui Peng, Yibo Yan, Xin Zou, Xuming Hu
First: 2026-01-08T17:49:13+00:00 · Latest: 2026-01-08T17:49:13+00:00
Abstract
Object hallucination critically undermines the reliability of Multimodal Large Language Models, often stemming from a fundamental failure in cognitive introspection, where models blindly trust linguistic priors over specific visual evidence. Existing mitigations remain limited: contrastive decoding approaches operate superficially without rectifying internal semantic misalignments, while current latent steering methods rely on static vectors that lack instance-specific precision. We introduce Vision-Language Introspection (VLI), a training-free inference framework that simulates a metacognitive self-correction process. VLI first performs Attributive Introspection to diagnose hallucination risks via probabilistic conflict detection and localize the causal visual anchors. It then employs Interpretable Bi-Causal Steering to actively modulate the inference process, dynamically isolating visual evidence from background noise while neutralizing blind confidence through adaptive calibration. VLI achieves state-of-the-art performance on advanced models, reducing object hallucination rates by 12.67% on MMHal-Bench and improving accuracy by 5.8% on POPE.
中文标题/摘要
标题:视觉-语言内省:通过可解释的双因归向引导减轻MLLM中的过度自信幻觉
物体幻觉严重削弱了多模态大型语言模型的可靠性,通常源于认知内省的基本失败,模型盲目信任语言先验而非具体的视觉证据。现有缓解措施仍有限:对比解码方法仅表面操作而不纠正内部语义错位,而当前的潜在引导方法依赖于静态向量,缺乏实例特定的精确性。我们引入了视觉-语言内省(VLI),这是一种无需训练的推理框架,模拟了元认知的自我纠正过程。VLI 首先进行属性内省,通过概率冲突检测诊断幻觉风险并定位因果视觉锚点。然后使用可解释的双因归向引导主动调节推理过程,动态隔离视觉证据与背景噪声,通过适应性校准消除盲目的自信。VLI 在先进模型上实现了最先进的性能,在MMHal-Bench 上将物体幻觉率降低了12.67%,在POPE 上提高了5.8%的准确性。
Summary / 总结
The research aims to address the issue of object hallucination in Multimodal Large Language Models (MLLMs) by enhancing their cognitive introspection. The method involves a training-free inference framework called Vision-Language Introspection (VLI), which first diagnoses hallucination risks through probabilistic conflict detection and localizes the causal visual anchors. It then uses Interpretable Bi-Causal Steering to dynamically isolate visual evidence from background noise and reduce blind confidence. Key findings show that VLI significantly reduces object hallucination rates by 12.67% on MMHal-Bench and improves accuracy by 5.8% on POPE.
研究旨在通过增强认知自我反省来解决多模态大型语言模型(MLLMs)中的物体幻觉问题。方法是采用一个无需训练的推理框架——视觉语言自我反省(VLI),首先通过概率冲突检测诊断幻觉风险并定位因果视觉锚点。然后使用可解释的双向因果引导来动态隔离视觉证据并消除背景噪声,减少盲目自信。关键发现表明,VLI在MMHal-Bench上将物体幻觉率降低了12.67%,在POPE上提高了5.8%的准确性。
Forking-Sequences
Authors: Willa Potosnak, Malcolm Wolff, Mengfei Cao, Ruijun Ma, Tatiana Konstantinova, Dmitry Efimov, Michael W. Mahoney, Boris Oreshkin, Kin G. Olivares
Venue: NeurIPS 2025
First: 2025-10-06T04:51:06+00:00 · Latest: 2026-01-08T17:43:12+00:00
Comments: Presented at the GPU-Accelerated and Scalable Optimization (ScaleOpt) Workshop, NeurIPS 2025
Abstract
While accuracy is a critical requirement for time series forecasting, an equally important desideratum is forecast stability across forecast creation dates (FCDs). Even highly accurate models can produce erratic revisions between FCDs, disrupting downstream decision-making. To improve forecast stability of such revisions, several state-of-the-art models including MQCNN, MQT, and SPADE employ a powerful yet underexplored neural network architectural design known as forking-sequences. This architectural design jointly encodes and decodes the entire time series across all FCDs, producing an entire multi-horizon forecast grid in a single forward pass. This approach contrasts with conventional neural forecasting methods that process FCDs independently, generating only a single multi-horizon forecast per forward pass. In this work, we formalize the forking-sequences design and motivate its broader adoption by introducing a metric for quantifying excess volatility in forecast revisions and by providing theoretical and empirical analysis. We theoretically motivate three key benefits of forking-sequences: (i) increased forecast stability through ensembling; (ii) gradient variance reduction, leading to more stable and consistent training steps; and (iii) improved computational efficiency during inference. We validate the benefits of forking-sequences compared to baseline window-sampling on the M-series benchmark, using 16 datasets from the M1, M3, M4, and Tourism competitions. We observe median accuracy improvements across datasets of 29.7%, 46.2%, 49.3%, 28.6%, 24.7%, and 6.4% for MLP, RNN, LSTM, CNN, Transformer, and StateSpace-based architectures, respectively. We then show that forecast ensembling during inference can improve median forecast stability by 10.8%, 13.2%, 13.0%, 10.9%, 10.2%, and 11.2% for these respective models trained with forking-sequences, while maintaining accuracy.
中文标题/摘要
标题:分叉序列
时间序列预测的准确性是关键要求,但同样重要的是预测在不同预测创建日期(FCD)之间的稳定性。即使是非常准确的模型也可能在FCD之间产生不稳定的修订,扰乱下游决策。为了提高这些修订的预测稳定性,包括MQCNN、MQT和SPADE在内的多种最先进的模型采用了名为分叉序列的强大但尚未充分探索的神经网络架构设计。该架构在所有FCD上联合编码和解码整个时间序列,一次性前向传递生成整个多视窗预测网格。这种方法与传统的神经预测方法形成对比,后者独立处理FCD,每次前向传递只生成一个单一的多视窗预测。在本文中,我们形式化了分叉序列设计,并通过引入衡量预测修订超额波动的度量标准和提供理论与实证分析来促进其更广泛的采用。我们理论地证明了分叉序列的三个关键优势:(i)通过集成增加预测稳定性;(ii)减少梯度方差,导致更稳定和一致的训练步骤;(iii)推理期间提高计算效率。我们通过在M系列基准上使用M1、M3、M4和旅游竞赛的16个数据集验证了分叉序列相对于基线窗口采样的优势,观察到MLP、RNN、LSTM、CNN、Transformer和基于状态空间的架构分别在数据集上的中位数准确性改进为29.7%、46.2%、49.3%、28.6%、24.7%和6.4%。然后我们证明,在使用分叉序列训练的这些模型进行推理时,预测集成可以提高中位数预测稳定性的10.8%、13.2%、13.0%、10.9%、10.2%和11.2%,同时保持准确性。
Summary / 总结
This paper addresses the issue of forecast stability in time series forecasting, which is crucial for downstream decision-making. It introduces forking-sequences, a neural network architectural design that improves forecast stability by jointly encoding and decoding the entire time series across all forecast creation dates. The study provides theoretical and empirical evidence, showing that forking-sequences enhance forecast stability, reduce gradient variance, and improve computational efficiency. Experiments on the M-series benchmark datasets demonstrate median accuracy improvements of up to 49.3% and median forecast stability improvements of up to 13.2% for various models trained with forking-sequences.
该论文针对时间序列预测中预测创建日期(FCD)之间的预测不稳定问题,提出了一种名为forking-sequences的神经网络架构设计,该设计在所有FCD上联合编码和解码整个时间序列,在单次前向传递中生成多时间尺度的预测网格。研究通过理论分析和来自多个竞赛的16个数据集的实验证明,forking-sequences可以带来高达49.3%的准确率提升和高达13.2%的预测稳定性提升,适用于不同的模型架构。
Safe Continual Reinforcement Learning Methods for Nonstationary Environments. Towards a Survey of the State of the Art
Authors: Timofey Tomashevskiy
First: 2026-01-08T17:42:56+00:00 · Latest: 2026-01-08T17:42:56+00:00
Comments: 20 pages, 4 figures
Abstract
This work provides a state-of-the-art survey of continual safe online reinforcement learning (COSRL) methods. We discuss theoretical aspects, challenges, and open questions in building continual online safe reinforcement learning algorithms. We provide the taxonomy and the details of continual online safe reinforcement learning methods based on the type of safe learning mechanism that takes adaptation to nonstationarity into account. We categorize safety constraints formulation for online reinforcement learning algorithms, and finally, we discuss prospects for creating reliable, safe online learning algorithms. Keywords: safe RL in nonstationary environments, safe continual reinforcement learning under nonstationarity, HM-MDP, NSMDP, POMDP, safe POMDP, constraints for continual learning, safe continual reinforcement learning review, safe continual reinforcement learning survey, safe continual reinforcement learning, safe online learning under distribution shift, safe continual online adaptation, safe reinforcement learning, safe exploration, safe adaptation, constrained Markov decision processes, safe reinforcement learning, partially observable Markov decision process, safe reinforcement learning and hidden Markov decision processes, Safe Online Reinforcement Learning, safe online reinforcement learning, safe online reinforcement learning, safe meta-learning, safe meta-reinforcement learning, safe context-based reinforcement learning, formulating safety constraints for continual learning
中文标题/摘要
标题:非平稳环境下的安全持续强化学习方法:前沿综述
本文提供了非平稳环境下持续安全在线强化学习(COSRL)方法的前沿综述。我们讨论了构建持续在线安全强化学习算法的理论方面、挑战和开放问题。我们基于安全学习机制的类型提供了持续在线安全强化学习方法的分类和详细信息,该机制考虑了非平稳性的适应。我们对在线强化学习算法中的安全约束进行了分类,并最终讨论了创建可靠、安全的在线学习算法的前景。 关键词:非平稳环境下的安全RL,非平稳条件下持续的强化学习,HM-MDP,NSMDP,POMDP,安全POMDP,持续学习的约束,安全持续的强化学习综述,安全持续的强化学习综述,安全持续的强化学习,分布转移下的安全在线学习,安全持续在线适应,安全强化学习,安全探索,安全适应,约束马尔可夫决策过程,安全强化学习,部分可观测马尔可夫决策过程,安全强化学习和隐马尔可夫决策过程,安全在线强化学习,安全在线强化学习,安全元学习,安全元强化学习,安全上下文强化学习,持续学习中的安全约束制定
Summary / 总结
This paper provides a comprehensive survey of safe continual reinforcement learning methods for nonstationary environments. It discusses theoretical aspects, challenges, and open questions in building safe reinforcement learning algorithms that can adapt to nonstationary conditions. The authors categorize safety constraints for online reinforcement learning and provide a taxonomy of methods based on the type of safe learning mechanism. Key findings include the importance of formulating safety constraints and the need for reliable, safe online learning algorithms.
该研究提供了持续安全在线强化学习(COSRL)方法的全面综述,关注理论方面、挑战和开放问题。它分类了在线强化学习算法中的安全约束,并讨论了非平稳环境下的适应性。主要发现包括安全学习机制的重要性以及在非平稳环境中需要可靠和安全的在线学习算法。
ROOFS: RObust biOmarker Feature Selection
Authors: Anastasiia Bakhmach, Paul Dufossé, Andrea Vaglio, Florence Monville, Laurent Greillier, Fabrice Barlési, Sébastien Benzekry
First: 2026-01-08T17:41:07+00:00 · Latest: 2026-01-08T17:41:07+00:00
Abstract
Feature selection (FS) is essential for biomarker discovery and in the analysis of biomedical datasets. However, challenges such as high-dimensional feature space, low sample size, multicollinearity, and missing values make FS non-trivial. Moreover, FS performances vary across datasets and predictive tasks. We propose roofs, a Python package available at https://gitlab.inria.fr/compo/roofs, designed to help researchers in the choice of FS method adapted to their problem. Roofs benchmarks multiple FS methods on the user's data and generates reports that summarize a comprehensive set of evaluation metrics, including downstream predictive performance estimated using optimism correction, stability, reliability of individual features, and true positive and false positive rates assessed on semi-synthetic data with a simulated outcome. We demonstrate the utility of roofs on data from the PIONeeR clinical trial, aimed at identifying predictors of resistance to anti-PD-(L)1 immunotherapy in lung cancer. The PIONeeR dataset contained 374 multi-source blood and tumor biomarkers from 435 patients. A reduced subset of 214 features was obtained through iterative variance inflation factor pre-filtering. Of the 34 FS methods gathered in roofs, we evaluated 23 in combination with 11 classifiers (253 models in total) and identified a filter based on the union of Benjamini-Hochberg false discovery rate-adjusted p-values from t-test and logistic regression as the optimal approach, outperforming other methods including the widely used LASSO. We conclude that comprehensive benchmarking with roofs has the potential to improve the robustness and reproducibility of FS discoveries and increase the translational value of clinical models.
中文标题/摘要
标题:ROOFS: RObust biOmarker Feature Selection
特征选择(FS)对于生物标志物发现和生物医学数据集分析至关重要。然而,高维特征空间、样本量小、多重共线性和缺失值等挑战使得FS非易事。此外,FS性能在不同数据集和预测任务中有所不同。我们提出ROOFS,这是一个可用在https://gitlab.inria.fr/compo/roofs的Python包,旨在帮助研究人员选择适合其问题的FS方法。ROOFS在用户数据上对多种FS方法进行基准测试,并生成报告,总结包括使用乐观校正估计的下游预测性能、稳定性、单个特征的可靠性以及在模拟结果的半合成数据上评估的真实阳性率和假阳性率在内的全面评估指标集。我们通过PIONeeR临床试验数据展示了ROOFS的应用,该试验旨在识别对PD-(L)1免疫疗法产生抗性的肺癌预测因子。PIONeeR数据集包含435名患者来源的374个多源血液和肿瘤生物标志物。通过迭代方差膨胀因子预筛选,获得了一个包含214个特征的子集。在ROOFS中收集的34种FS方法中,我们评估了23种方法与11种分类器的组合(总共253个模型),并确定了一种基于t检验和逻辑回归的贝叶斯-霍奇金-赫奇伯格假发现率调整p值的并集的过滤器方法为最优方法,优于包括广泛使用的LASSO在内的其他方法。我们得出结论,ROOFS的全面基准测试有可能提高FS发现的稳健性和可重复性,并增加临床模型的转化价值。
Summary / 总结
The paper addresses the challenges of feature selection in biomedical datasets, proposing ROOFS, a Python package that benchmarks multiple feature selection methods. ROOFS evaluates 34 methods on the PIONeeR dataset, which includes 374 biomarkers from 435 lung cancer patients, and identifies a filter based on the union of Benjamini-Hochberg false discovery rate-adjusted p-values from t-test and logistic regression as the optimal approach, outperforming other methods like LASSO in terms of predictive performance and stability.
论文针对生物医学数据集中的特征选择挑战,提出了ROOFS,一个Python包,用于评估多种特征选择方法。ROOFS在包含435名肺癌患者374个生物标志物的PIONeeR数据集上评估了34种方法,并确定了一种基于t检验和逻辑回归的贝叶斯-霍奇伯格假发现率调整p值的并集的过滤方法为最优方法,其在预测性能和稳定性方面优于其他方法,如LASSO。
Multi-Scale Local Speculative Decoding for Image Generation
Authors: Elia Peruzzo, Guillaume Sautière, Amirhossein Habibian
First: 2026-01-08T17:39:35+00:00 · Latest: 2026-01-08T17:39:35+00:00
Comments: Project page is available at https://qualcomm-ai-research.github.io/mulo-sd-webpage
Abstract
Autoregressive (AR) models have achieved remarkable success in image synthesis, yet their sequential nature imposes significant latency constraints. Speculative Decoding offers a promising avenue for acceleration, but existing approaches are limited by token-level ambiguity and lack of spatial awareness. In this work, we introduce Multi-Scale Local Speculative Decoding (MuLo-SD), a novel framework that combines multi-resolution drafting with spatially informed verification to accelerate AR image generation. Our method leverages a low-resolution drafter paired with learned up-samplers to propose candidate image tokens, which are then verified in parallel by a high-resolution target model. Crucially, we incorporate a local rejection and resampling mechanism, enabling efficient correction of draft errors by focusing on spatial neighborhoods rather than raster-scan resampling after the first rejection. We demonstrate that MuLo-SD achieves substantial speedups - up to $\mathbf{1.7\times}$ - outperforming strong speculative decoding baselines such as EAGLE-2 and LANTERN in terms of acceleration, while maintaining comparable semantic alignment and perceptual quality. These results are validated using GenEval, DPG-Bench, and FID/HPSv2 on the MS-COCO 5k validation split. Extensive ablations highlight the impact of up-sampling design, probability pooling, and local rejection and resampling with neighborhood expansion. Our approach sets a new state-of-the-art in speculative decoding for image synthesis, bridging the gap between efficiency and fidelity.
Summary / 总结
MuLo-SD is a novel framework that accelerates autoregressive image generation by combining multi-resolution drafting with spatially informed verification. It uses a low-resolution drafter and learned up-samplers to propose candidate tokens, which are verified in parallel by a high-resolution model. The method incorporates a local rejection and resampling mechanism to efficiently correct errors. MuLo-SD achieves up to 1.7x speedup compared to strong speculative decoding baselines while maintaining semantic alignment and perceptual quality, as validated by GenEval, DPG-Bench, and FID/HPSv2 on the MS-COCO 5k validation split.
MuLo-SD 是一种结合多分辨率草图绘制和空间感知验证的新框架,用于加速自回归图像生成。该方法使用低分辨率草图绘制器和学习上采样器来提出候选令牌,并由高分辨率模型并行验证。该方法还包含局部拒绝和重采样机制,以高效地纠正错误。MuLo-SD 在 GenEval、DPG-Bench 和 FID/HPSv2 上对 MS-COCO 5k 验证集的实验结果表明,它比强大的推测性解码基线快至 1.7 倍,同时保持语义对齐和感知质量。
Atlas 2 -- Foundation models for clinical deployment
Authors: Maximilian Alber, Timo Milbich, Alexandra Carpen-Amarie, Stephan Tietz, Jonas Dippel, Lukas Muttenthaler, Beatriz Perez Cancer, Alessandro Benetti, Panos Korfiatis, Elias Eulig, Jérôme Lüscher, Jiasen Wu, Sayed Abid Hashimi, Gabriel Dernbach, Simon Schallenberg, Neelay Shah, Moritz Krügener, Aniruddh Jammoria, Jake Matras, Patrick Duffy, Matt Redlon, Philipp Jurmeister, David Horst, Lukas Ruff, Klaus-Robert Müller, Frederick Klauschen, Andrew Norgan
First: 2026-01-08T17:37:00+00:00 · Latest: 2026-01-08T17:37:00+00:00
Abstract
Pathology foundation models substantially advanced the possibilities in computational pathology -- yet tradeoffs in terms of performance, robustness, and computational requirements remained, which limited their clinical deployment. In this report, we present Atlas 2, Atlas 2-B, and Atlas 2-S, three pathology vision foundation models which bridge these shortcomings by showing state-of-the-art performance in prediction performance, robustness, and resource efficiency in a comprehensive evaluation across eighty public benchmarks. Our models were trained on the largest pathology foundation model dataset to date comprising 5.5 million histopathology whole slide images, collected from three medical institutions Charité - Universtätsmedizin Berlin, LMU Munich, and Mayo Clinic.
中文标题/摘要
标题:图集2——临床部署的基础模型
病理学基础模型极大地推进了计算病理学的可能性——然而,在性能、稳健性和计算需求方面的权衡限制了它们的临床部署。在本报告中,我们介绍了图集2、图集2-B和图集2-S,这三个病理学视觉基础模型通过在八十项公共基准测试中全面评估,展示了在预测性能、稳健性和资源效率方面的最新技术水平,从而弥补了这些不足。我们的模型是在迄今为止最大的病理学基础模型数据集上训练的,该数据集包含550万张组织病理学全切片图像,来自Charité - Universtätsmedizin Berlin、LMU Munich和Mayo Clinic三家医疗机构。
Summary / 总结
The motivation for this work was to address the limitations of existing pathology foundation models in terms of performance, robustness, and computational requirements, which hindered their clinical deployment. The authors developed Atlas 2, Atlas 2-B, and Atlas 2-S, three pathology vision foundation models, which achieved state-of-the-art performance, robustness, and resource efficiency across eighty public benchmarks. These models were trained on a large dataset of 5.5 million histopathology whole slide images from three medical institutions, significantly improving their potential for clinical use.
这项工作的动机是解决现有病理基础模型在性能、鲁棒性和计算需求方面的局限性,这些局限性阻碍了它们的临床应用。作者开发了Atlas 2、Atlas 2-B和Atlas 2-S三种病理视觉基础模型,这些模型在八十个公开基准测试中实现了最先进的性能、鲁棒性和资源效率。这些模型使用来自三个医疗机构的550万张组织病理学全切片图像的大数据集进行训练,显著提高了它们在临床应用中的潜力。
Cells on Autopilot: Adaptive Cell (Re)Selection via Reinforcement Learning
Authors: Marvin Illian, Ramin Khalili, Antonio A. de A. Rocha, Lin Wang
First: 2026-01-07T16:51:33+00:00 · Latest: 2026-01-08T17:32:37+00:00
Comments: 11 pages, 12 figures, v2: Corrected performance numbers in the conclusion; no change to methodology
Abstract
The widespread deployment of 5G networks, together with the coexistence of 4G/LTE networks, provides mobile devices a diverse set of candidate cells to connect to. However, associating mobile devices to cells to maximize overall network performance, a.k.a. cell (re)selection, remains a key challenge for mobile operators. Today, cell (re)selection parameters are typically configured manually based on operator experience and rarely adapted to dynamic network conditions. In this work, we ask: Can an agent automatically learn and adapt cell (re)selection parameters to consistently improve network performance? We present a reinforcement learning (RL)-based framework called CellPilot that adaptively tunes cell (re)selection parameters by learning spatiotemporal patterns of mobile network dynamics. Our study with real-world data demonstrates that even a lightweight RL agent can outperform conventional heuristic reconfigurations by up to 167%, while generalizing effectively across different network scenarios. These results indicate that data-driven approaches can significantly improve cell (re)selection configurations and enhance mobile network performance.
中文标题/摘要
标题:细胞自动驾驶:通过强化学习实现自适应小区(重)选择
5G网络的广泛部署以及4G/LTE网络的共存,为移动设备提供了多样化的候选小区连接选择。然而,将移动设备与小区关联起来以最大化整体网络性能,即小区(重)选择,仍然是移动运营商面临的关键挑战。目前,小区(重)选择参数通常基于运营商的经验手动配置,并且很少适应动态网络条件。在本工作中,我们提出的问题是:是否可以使用代理自动学习和适应小区(重)选择参数,以持续提高网络性能?我们提出了一种基于强化学习(RL)的框架CellPilot,通过学习移动网络动态的空间和时间模式来自适应调整小区(重)选择参数。我们的研究使用实际数据表明,即使是一个轻量级的RL代理,也可以比传统的启发式重新配置提高高达167%的性能,同时在不同网络场景中具有良好的泛化能力。这些结果表明,数据驱动的方法可以显著改善小区(重)选择配置并增强移动网络性能。
Summary / 总结
This paper addresses the challenge of cell (re)selection in mobile networks by proposing a reinforcement learning (RL) framework called CellPilot. The framework automatically tunes cell (re)selection parameters to improve network performance. Experimental results show that CellPilot outperforms traditional methods by up to 167% and generalizes well across different network scenarios.
该论文通过提出基于强化学习(RL)的框架CellPilot来解决5G网络中的小区(重)选择挑战。该框架自动调整小区(重)选择参数以提升网络性能。实验结果表明,CellPilot相比传统方法可提升高达167%,并且在不同网络场景下具有良好的泛化能力。
VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control
Authors: Sixiao Zheng, Minghao Yin, Wenbo Hu, Xiaoyu Li, Ying Shan, Yanwei Fu
First: 2026-01-08T17:28:52+00:00 · Latest: 2026-01-08T17:28:52+00:00
Comments: Project Page: https://sixiaozheng.github.io/VerseCrafter_page/
Abstract
Video world models aim to simulate dynamic, real-world environments, yet existing methods struggle to provide unified and precise control over camera and multi-object motion, as videos inherently operate dynamics in the projected 2D image plane. To bridge this gap, we introduce VerseCrafter, a 4D-aware video world model that enables explicit and coherent control over both camera and object dynamics within a unified 4D geometric world state. Our approach is centered on a novel 4D Geometric Control representation, which encodes the world state through a static background point cloud and per-object 3D Gaussian trajectories. This representation captures not only an object's path but also its probabilistic 3D occupancy over time, offering a flexible, category-agnostic alternative to rigid bounding boxes or parametric models. These 4D controls are rendered into conditioning signals for a pretrained video diffusion model, enabling the generation of high-fidelity, view-consistent videos that precisely adhere to the specified dynamics. Unfortunately, another major challenge lies in the scarcity of large-scale training data with explicit 4D annotations. We address this by developing an automatic data engine that extracts the required 4D controls from in-the-wild videos, allowing us to train our model on a massive and diverse dataset.
中文标题/摘要
标题:VerseCrafter:具有4D几何控制的动态现实视频世界模型
视频世界模型旨在模拟动态的真实世界环境,但现有方法难以提供统一和精确的摄像机和多对象运动控制,因为视频本质上是在投影的2D图像平面上操作动态的。为了解决这一差距,我们引入了VerseCrafter,这是一种4D感知的视频世界模型,能够在统一的4D几何世界状态中明确且一致地控制摄像机和对象动力学。我们的方法以一种新颖的4D几何控制表示为中心,通过静态背景点云和每个对象的3D高斯轨迹来编码世界状态。这种表示不仅捕捉了对象的路径,还捕捉了其随时间的概率3D占用,提供了一种灵活且跨类别的替代方案,而不是刚性边界框或参数模型。这些4D控制被渲染为预训练视频扩散模型的条件信号,使其能够生成高保真、视图一致的视频,精确符合指定的动力学。不幸的是,另一个主要挑战在于缺乏带有明确4D注释的大规模训练数据。我们通过开发一个自动数据引擎来解决这一问题,该引擎从野外视频中提取所需的4D控制,使我们能够使用大规模和多样化的数据集训练我们的模型。
Summary / 总结
VerseCrafter is designed to simulate dynamic real-world environments with precise control over camera and multi-object motion. It uses a 4D Geometric Control representation that captures both the path and probabilistic 3D occupancy of objects over time, enabling the generation of high-fidelity, view-consistent videos. The model is trained using an automatic data engine that extracts 4D controls from in-the-wild videos, allowing for large-scale training on diverse datasets.
VerseCrafter 是一种4D感知的视频世界模型,能够在一个统一的4D几何世界状态中明确且一致地控制摄像机和物体的动力学。它使用一种新颖的4D几何控制表示来编码世界状态,并将其渲染为预训练视频扩散模型的条件信号,生成高保真度、视点一致的视频。主要挑战在于缺乏带有明确4D注释的大规模训练数据,通过开发一个自动数据引擎从野外视频中提取所需的4D控制,从而能够在大规模和多样化的数据集上进行训练。
Sequential Subspace Noise Injection Prevents Accuracy Collapse in Certified Unlearning
Authors: Polina Dolgova, Sebastian U. Stich
First: 2026-01-08T17:23:13+00:00 · Latest: 2026-01-08T17:23:13+00:00
Abstract
Certified unlearning based on differential privacy offers strong guarantees but remains largely impractical: the noisy fine-tuning approaches proposed so far achieve these guarantees but severely reduce model accuracy. We propose sequential noise scheduling, which distributes the noise budget across orthogonal subspaces of the parameter space, rather than injecting it all at once. This simple modification mitigates the destructive effect of noise while preserving the original certification guarantees. We extend the analysis of noisy fine-tuning to the subspace setting, proving that the same $(\varepsilon,δ)$ privacy budget is retained. Empirical results on image classification benchmarks show that our approach substantially improves accuracy after unlearning while remaining robust to membership inference attacks. These results show that certified unlearning can achieve both rigorous guarantees and practical utility.
Summary / 总结
The paper addresses the challenge of maintaining model accuracy while ensuring privacy in certified unlearning. It introduces sequential noise scheduling, which injects noise in orthogonal subspaces of the parameter space sequentially, thereby reducing the negative impact on accuracy. Experiments on image classification benchmarks demonstrate that this method significantly improves accuracy after unlearning while still providing strong privacy guarantees against membership inference attacks.
论文旨在解决在确保隐私的同时保持模型准确性的难题。它提出了顺序噪声调度的方法,该方法按顺序在参数空间的正交子空间中注入噪声,从而减少对准确性的负面影响。实验结果表明,这种方法在去学习后显著提高了准确性,同时仍然能够抵御成员推理攻击,提供强大的隐私保证。
FALCONEye: Finding Answers and Localizing Content in ONE-hour-long videos with multi-modal LLMs
Authors: Carlos Plou, Cesar Borja, Ruben Martinez-Cantin, Ana C. Murillo
First: 2025-03-25T17:17:19+00:00 · Latest: 2026-01-08T17:17:54+00:00
Abstract
Finding information in hour-long videos is a challenging task even for top-performing Vision Language Models (VLMs), as encoding visual content quickly exceeds available context windows. To tackle this challenge, we present FALCONEye, a novel video agent based on a training-free, model-agnostic meta-architecture composed of a VLM and a Large Language Model (LLM). FALCONEye answers open-ended questions using an exploration-based search algorithm guided by calibrated confidence from the VLM's answers. We also introduce the FALCON-Bench benchmark, extending Question Answering problem to Video Answer Search-requiring models to return both the answer and its supporting temporal window for open-ended questions in hour-long videos. With just a 7B VLM and a lightweight LLM, FALCONEye outscores all open-source 7B VLMs and comparable agents in FALCON-Bench. It further demonstrates its generalization capability in MLVU benchmark with shorter videos and different tasks, surpassing GPT-4o on single-detail tasks while slashing inference cost by roughly an order of magnitude.
中文标题/摘要
标题:FALCONEye:使用多模态LLM在一小时长视频中查找答案并定位内容
在小时长视频中查找信息对顶级视觉语言模型(VLM)来说也是一个具有挑战性的任务,因为编码视觉内容会迅速超出可用的上下文窗口。为了解决这一挑战,我们提出了FALCONEye,这是一种基于训练无损、模型无关元架构的新型视频代理,该架构由VLM和大型语言模型(LLM)组成。FALCONEye使用由VLM答案校准置信度引导的探索式搜索算法来回答开放式问题。我们还引入了FALCON-Bench基准测试,将问答问题扩展到视频答案搜索,要求模型返回一小时长视频中开放式问题的答案及其支持的时间窗口。仅使用7B VLM和轻量级LLM,FALCONEye在FALCON-Bench中得分超过了所有开源7B VLM和可比代理。此外,FALCONEye还在MLVU基准测试中展示了其泛化能力,处理较短视频和不同任务时,超越了GPT-4o,同时将推理成本降低了大约一个数量级。
Summary / 总结
FALCONEye is designed to address the challenge of finding information in hour-long videos using a VLM and LLM in a model-agnostic meta-architecture. It employs an exploration-based search algorithm guided by the VLM's calibrated confidence to answer open-ended questions. FALCONEye outperforms all open-source 7B VLMs and comparable agents in the FALCON-Bench and shows strong generalization in the MLVU benchmark, surpassing GPT-4o on single-detail tasks while reducing inference cost significantly.
FALCONEye 是一个使用 VLM 和 LLM 来回答一小时长视频中的开放性问题的新视频代理。它采用了一种由 VLM 的校准置信度引导的探索式搜索算法。FALCONEye 在 FALCON-Bench 中优于所有开源 7B VLM 及其同类代理,并在 MLVU 基准测试中展示了强大的泛化能力,与 GPT-4o 相比,在单一细节任务上的推理成本降低了约一个数量级。
Re-Align: Structured Reasoning-guided Alignment for In-Context Image Generation and Editing
Authors: Runze He, Yiji Cheng, Tiankai Hang, Zhimin Li, Yu Xu, Zijin Yin, Shiyi Zhang, Wenxun Dai, Penghui Du, Ao Ma, Chunyu Wang, Qinglin Lu, Jizhong Han, Jiao Dai
First: 2026-01-08T17:13:00+00:00 · Latest: 2026-01-08T17:13:00+00:00
Comments: 13 pages, 9 figures, project page: https://github.com/hrz2000/realign
Abstract
In-context image generation and editing (ICGE) enables users to specify visual concepts through interleaved image-text prompts, demanding precise understanding and faithful execution of user intent. Although recent unified multimodal models exhibit promising understanding capabilities, these strengths often fail to transfer effectively to image generation. We introduce Re-Align, a unified framework that bridges the gap between understanding and generation through structured reasoning-guided alignment. At its core lies the In-Context Chain-of-Thought (IC-CoT), a structured reasoning paradigm that decouples semantic guidance and reference association, providing clear textual target and mitigating confusion among reference images. Furthermore, Re-Align introduces an effective RL training scheme that leverages a surrogate reward to measure the alignment between structured reasoning text and the generated image, thereby improving the model's overall performance on ICGE tasks. Extensive experiments verify that Re-Align outperforms competitive methods of comparable model scale and resources on both in-context image generation and editing tasks.
中文标题/摘要
标题:重新对齐:结构化推理引导的上下文图像生成与编辑对齐
上下文图像生成与编辑(ICGE)允许用户通过交错的图像-文本提示指定视觉概念,要求精确理解并忠实执行用户意图。尽管最近的统一多模态模型展示了令人鼓舞的理解能力,但这些优势往往无法有效地转移到图像生成中。我们引入了Re-Align,这是一种统一框架,通过结构化推理引导的对齐来弥合理解和生成之间的差距。其核心是上下文链式思考(IC-CoT),这是一种结构化推理范式,将语义指导和参考关联分离,提供清晰的文本目标并减轻参考图像的混淆。此外,Re-Align引入了一种有效的强化学习训练方案,利用代理奖励来衡量结构化推理文本与生成图像之间的对齐,从而提高模型在ICGE任务上的整体性能。广泛的实验验证了Re-Align在上下文图像生成和编辑任务中均优于具有可比模型规模和资源的竞争方法。
Summary / 总结
Re-Align is a unified framework that enhances in-context image generation and editing by incorporating structured reasoning-guided alignment. It uses the In-Context Chain-of-Thought (IC-CoT) to decouple semantic guidance and reference association, providing clear textual targets and reducing confusion among reference images. The framework also employs an RL training scheme with a surrogate reward to improve alignment between structured reasoning text and generated images. Experiments show that Re-Align outperforms other comparable methods in both in-context image generation and editing tasks.
Re-Align 是一个统一框架,用于在上下文中进行图像生成和编辑,通过结构化推理引导对齐来提高模型对用户意图的理解和执行。它引入了上下文链式思考(IC-CoT)来解耦语义指导和参考关联,并使用RL训练方案来衡量结构化推理文本与生成图像之间的对齐,实验表明Re-Align在生成和编辑任务上均优于其他方法。
From Rays to Projections: Better Inputs for Feed-Forward View Synthesis
Authors: Zirui Wu, Zeren Jiang, Martin R. Oswald, Jie Song
First: 2026-01-08T17:03:44+00:00 · Latest: 2026-01-08T17:03:44+00:00
Comments: Project Page: https://wuzirui.github.io/pvsm-web
Abstract
Feed-forward view synthesis models predict a novel view in a single pass with minimal 3D inductive bias. Existing works encode cameras as Plücker ray maps, which tie predictions to the arbitrary world coordinate gauge and make them sensitive to small camera transformations, thereby undermining geometric consistency. In this paper, we ask what inputs best condition a model for robust and consistent view synthesis. We propose projective conditioning, which replaces raw camera parameters with a target-view projective cue that provides a stable 2D input. This reframes the task from a brittle geometric regression problem in ray space to a well-conditioned target-view image-to-image translation problem. Additionally, we introduce a masked autoencoding pretraining strategy tailored to this cue, enabling the use of large-scale uncalibrated data for pretraining. Our method shows improved fidelity and stronger cross-view consistency compared to ray-conditioned baselines on our view-consistency benchmark. It also achieves state-of-the-art quality on standard novel view synthesis benchmarks.
中文标题/摘要
标题:从光线到投影:更好的输入以实现前馈视图合成
前馈视图合成模型在单次通过中预测新视图,具有最少的三维归纳偏置。现有工作将相机编码为Plücker光线图,这将预测与任意世界坐标系挂钩,并使其对小相机变换敏感,从而破坏几何一致性。在本文中,我们探讨了什么输入最好地条件化模型以实现稳健和一致的视图合成。我们提出了投影条件化,用目标视图的投影线索替换原始相机参数,提供一个稳定的二维输入。这将任务重新定义为光线空间中的脆弱几何回归问题,转变为一个条件良好的目标视图图像到图像的翻译问题。此外,我们引入了一种针对此线索定制的掩码自编码预训练策略,使使用大规模未校准数据进行预训练成为可能。我们的方法在我们的视图一致性基准上显示出更高的保真度和更强的跨视图一致性,优于光线条件化的基线。它还在标准的新视图合成基准上达到了最先进的质量。
Summary / 总结
This paper addresses the issue of geometric inconsistency in feed-forward view synthesis models by proposing projective conditioning as an alternative to Plücker ray maps. The method uses a target-view projective cue to provide a stable 2D input, transforming the task into a well-conditioned image-to-image translation problem. The authors also introduce a masked autoencoding pretraining strategy for this cue, allowing the use of large-scale uncalibrated data. Experimental results demonstrate improved fidelity and stronger cross-view consistency compared to ray-conditioned baselines, and the method achieves state-of-the-art quality on standard benchmarks for novel view synthesis.
本文通过提出项目性条件来解决视图合成模型中的几何不一致性问题,替代了射线图。该方法使用目标视图的项目性线索提供一个稳定的2D输入,将任务重新定义为一个条件良好的图像到图像的翻译问题。实验结果表明,这种方法在视图一致性基准上提高了保真度和跨视图一致性,并在标准基准上达到了最先进的质量。
History
20260111_0327 20260110_0328 20260109_0331 20260108_0330 20260107_0325 20260106_0331 20260105_0324 20260104_0324 20260103_0322 20260102_0335 20260101_0325 20251231_0331 20251230_0328 20251229_0326 20251228_0329 20251227_0325 20251226_0326 20251225_0325 20251224_0328 20251223_0327 20251222_0324 20251221_0326 20251220_0327 20251219_0327 20251218_0339 20251217_0331 20251216_0329 20251215_0331 20251214_0324 20251213_0324 20251212_0329 20251211_0326 20251210_0323 20251209_0326 20251208_0324 20251207_0323 20251206_0325 20251205_0326 20251204_0326 20251203_0328 20251202_0331 20251201_0324 20251130_0323 20251129_0323 20251128_0324 20251127_0324 20251126_0325 20251125_0322 20251124_0323 20251123_0323 20251122_0325 20251121_0324 20251120_0326 20251119_0325 20251118_0324 20251117_0322 20251116_0322 20251115_0324 20251114_0325 20251113_0326 20251112_0326 20251111_0318 20251110_0322 20251109_0323 20251108_0321 20251107_0320 20251106_0322 20251105_0321 20251104_0324 20251103_0317 20251102_0321 20251101_0317 20251031_0318 20251030_0328 20251029_0325 20251028_0324 20251027_0320 20251026_0328 20251025_0320 20251024_0328 20251023_1235 20251023_0316 20251022_0319 20251021_1916 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553