arXiv 论文速递

Mesh4D: 4D Mesh Reconstruction and Tracking from Monocular Video

Authors: Zeren Jiang, Chuanxia Zheng, Iro Laina, Diane Larlus, Andrea Vedaldi

First: 2026-01-08T18:59:56+00:00 · Latest: 2026-01-08T18:59:56+00:00

Comments: 15 pages, 8 figures, project page: https://mesh-4d.github.io/

Abstract

We propose Mesh4D, a feed-forward model for monocular 4D mesh reconstruction. Given a monocular video of a dynamic object, our model reconstructs the object's complete 3D shape and motion, represented as a deformation field. Our key contribution is a compact latent space that encodes the entire animation sequence in a single pass. This latent space is learned by an autoencoder that, during training, is guided by the skeletal structure of the training objects, providing strong priors on plausible deformations. Crucially, skeletal information is not required at inference time. The encoder employs spatio-temporal attention, yielding a more stable representation of the object's overall deformation. Building on this representation, we train a latent diffusion model that, conditioned on the input video and the mesh reconstructed from the first frame, predicts the full animation in one shot. We evaluate Mesh4D on reconstruction and novel view synthesis benchmarks, outperforming prior methods in recovering accurate 3D shape and deformation.

中文标题/摘要

标题：Mesh4D：基于单目视频的4D网格重建与跟踪

我们提出Mesh4D，一种用于单目4D网格重建的前馈模型。给定一个动态对象的单目视频，我们的模型重建对象的完整3D形状和运动，表示为变形场。我们的主要贡献是一个紧凑的潜在空间，可以在一次通过中编码整个动画序列。该潜在空间通过自编码器学习，训练过程中由训练对象的骨骼结构引导，提供合理的变形先验。关键的是，在推理时不需要骨骼信息。编码器采用时空注意力机制，提供对象整体变形的更稳定表示。在此表示基础上，我们训练一个潜在扩散模型，在输入视频和从第一帧重建的网格条件下，一次性预测完整的动画。我们在重建和新颖视图合成基准上评估Mesh4D，优于先前方法，在恢复准确的3D形状和变形方面表现出色。

Summary / 总结

Mesh4D is a feed-forward model for monocular 4D mesh reconstruction that reconstructs the complete 3D shape and motion of a dynamic object from a monocular video. It uses a compact latent space learned by an autoencoder, which encodes the entire animation sequence without requiring skeletal information at inference time. The model employs spatio-temporal attention to provide a stable representation of the object's deformation and a latent diffusion model to predict the full animation. Mesh4D outperforms previous methods in recovering accurate 3D shape and deformation on various benchmarks.

Mesh4D 是一种单目 4D 网格重建模型，可以从单目视频中重建动态对象的完整 3D 形状和运动。它使用一个由训练时骨骼结构引导的紧凑的潜在空间学习的自编码器，然后用于在单次预测中生成完整的动画。Mesh4D 在重建和新颖视图合成基准测试中优于先前的方法，能够准确恢复 3D 形状和变形。

RL-AWB: Deep Reinforcement Learning for Auto White Balance Correction in Low-Light Night-time Scenes

Authors: Yuan-Kang Lee, Kuan-Lin Chen, Chia-Che Chang, Yu-Lun Liu

First: 2026-01-08T18:59:55+00:00 · Latest: 2026-01-08T18:59:55+00:00

Comments: Project page: https://ntuneillee.github.io/research/rl-awb/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Nighttime color constancy remains a challenging problem in computational photography due to low-light noise and complex illumination conditions. We present RL-AWB, a novel framework combining statistical methods with deep reinforcement learning for nighttime white balance. Our method begins with a statistical algorithm tailored for nighttime scenes, integrating salient gray pixel detection with novel illumination estimation. Building on this foundation, we develop the first deep reinforcement learning approach for color constancy that leverages the statistical algorithm as its core, mimicking professional AWB tuning experts by dynamically optimizing parameters for each image. To facilitate cross-sensor evaluation, we introduce the first multi-sensor nighttime dataset. Experiment results demonstrate that our method achieves superior generalization capability across low-light and well-illuminated images. Project page: https://ntuneillee.github.io/research/rl-awb/

中文标题/摘要

标题：RL-AWB：低光夜间场景自动白平衡校正的深度强化学习

夜间颜色恒定性仍然是计算摄影中的一个挑战性问题，由于低光噪声和复杂的照明条件。我们提出了RL-AWB，一种结合统计方法和深度强化学习的新型框架，用于夜间白平衡。我们的方法以一个针对夜间场景定制的统计算法开始，结合了显著灰度像素检测和新颖的照明估计。在此基础上，我们开发了第一个基于统计算法的深度强化学习颜色恒定性方法，通过动态优化每个图像的参数来模拟专业AWB调优专家。为了便于跨传感器评估，我们引入了第一个多传感器夜间数据集。实验结果表明，我们的方法在低光和良好照明的图像上具有更强的泛化能力。项目页面：https://ntuneillee.github.io/research/rl-awb/

Summary / 总结

The research aims to address the challenge of nighttime color constancy in computational photography by developing RL-AWB, a framework that combines statistical methods with deep reinforcement learning. The method starts with a statistical algorithm optimized for nighttime scenes, which detects salient gray pixels and estimates illumination. It then uses deep reinforcement learning to dynamically adjust parameters for each image, mimicking professional white balance tuning. Experimental results show that RL-AWB outperforms existing methods in generalizing across different lighting conditions and sensors.

研究旨在通过结合统计方法和深度强化学习来解决夜间色彩恒定性在计算摄影中的挑战，提出了RL-AWB框架。该方法首先使用针对夜间场景优化的统计算法，检测显著的灰度像素并估计光照，然后使用深度强化学习动态调整每张图像的参数，模仿专业白平衡调校专家。实验结果表明，RL-AWB在不同光照条件和传感器上的泛化能力优于现有方法。

QNeRF: Neural Radiance Fields on a Simulated Gate-Based Quantum Computer

Authors: Daniele Lizzio Bosco, Shuteng Wang, Giuseppe Serra, Vladislav Golyanik

First: 2026-01-08T18:59:55+00:00 · Latest: 2026-01-08T18:59:55+00:00

Comments: 30 pages, 15 figures, 11 tables; project page: https://4dqv.mpi-inf.mpg.de/QNeRF/

Abs · PDF · Code1 · Code2

Abstract

Recently, Quantum Visual Fields (QVFs) have shown promising improvements in model compactness and convergence speed for learning the provided 2D or 3D signals. Meanwhile, novel-view synthesis has seen major advances with Neural Radiance Fields (NeRFs), where models learn a compact representation from 2D images to render 3D scenes, albeit at the cost of larger models and intensive training. In this work, we extend the approach of QVFs by introducing QNeRF, the first hybrid quantum-classical model designed for novel-view synthesis from 2D images. QNeRF leverages parameterised quantum circuits to encode spatial and view-dependent information via quantum superposition and entanglement, resulting in more compact models compared to the classical counterpart. We present two architectural variants. Full QNeRF maximally exploits all quantum amplitudes to enhance representational capabilities. In contrast, Dual-Branch QNeRF introduces a task-informed inductive bias by branching spatial and view-dependent quantum state preparations, drastically reducing the complexity of this operation and ensuring scalability and potential hardware compatibility. Our experiments demonstrate that -- when trained on images of moderate resolution -- QNeRF matches or outperforms classical NeRF baselines while using less than half the number of parameters. These results suggest that quantum machine learning can serve as a competitive alternative for continuous signal representation in mid-level tasks in computer vision, such as 3D representation learning from 2D observations.

中文标题/摘要

标题：QNeRF：基于模拟量子门量子计算机的神经辐射场

最近，量子视觉场（QVFs）在模型紧凑性和收敛速度方面显示出对学习提供的2D或3D信号的有希望的改进。同时，神经辐射场（NeRFs）在新颖视角合成方面取得了重大进展，其中模型从2D图像中学习紧凑表示以渲染3D场景，尽管代价是更大的模型和密集的训练。在本文中，我们通过引入QNeRF扩展了QVFs的方法，QNeRF是第一个为从2D图像合成新颖视角而设计的混合量子-经典模型。QNeRF利用参数化量子电路通过量子叠加和纠缠来编码空间和视角相关的信息，从而比经典对应物具有更紧凑的模型。我们提出了两种架构变体。全QNeRF最大限度地利用所有量子振幅以增强表示能力。相比之下，双分支QNeRF通过分支空间和视角相关的量子态准备引入任务导向的归纳偏置，大幅降低此操作的复杂性并确保可扩展性和潜在硬件兼容性。我们的实验表明，当在中等分辨率的图像上进行训练时，QNeRF在参数数量不到一半的情况下可以匹配或超越经典的NeRF基线。这些结果表明，量子机器学习可以作为计算机视觉中中级任务（如从2D观察学习3D表示）连续信号表示的竞争替代方案。

Summary / 总结

QNeRF is a hybrid quantum-classical model for novel-view synthesis that leverages parameterized quantum circuits to encode spatial and view-dependent information, resulting in more compact models compared to classical counterparts. It includes two architectural variants: Full QNeRF and Dual-Branch QNeRF. Experiments show that QNeRF matches or outperforms classical NeRF baselines with fewer parameters when trained on moderate-resolution images, suggesting quantum machine learning can be a competitive alternative for 3D representation learning from 2D observations in computer vision tasks.

QNeRF 是一种结合了量子和经典计算的模型，用于新视角合成，通过参数化量子电路编码空间和视角相关的信息，相比经典模型更为紧凑。该模型包含两种架构变体：全量子神经辐射场（Full QNeRF）和双分支量子神经辐射场（Dual-Branch QNeRF）。实验表明，当在中等分辨率的图像上进行训练时，QNeRF 在参数数量少于一半的情况下，可以匹配甚至超越经典 NeRF 基线，这表明量子机器学习可以作为计算机视觉中从 2D 观测学习 3D 表征的有竞争力的替代方案。

Pixel-Perfect Visual Geometry Estimation

Authors: Gangwei Xu, Haotong Lin, Hongcheng Luo, Haiyang Sun, Bing Wang, Guang Chen, Sida Peng, Hangjun Ye, Xin Yang

First: 2026-01-08T18:59:49+00:00 · Latest: 2026-01-08T18:59:49+00:00

Comments: Code: https://github.com/gangweix/pixel-perfect-depth

Abs · PDF · Code1 · Code2 · Code3

Abstract

Recovering clean and accurate geometry from images is essential for robotics and augmented reality. However, existing geometry foundation models still suffer severely from flying pixels and the loss of fine details. In this paper, we present pixel-perfect visual geometry models that can predict high-quality, flying-pixel-free point clouds by leveraging generative modeling in the pixel space. We first introduce Pixel-Perfect Depth (PPD), a monocular depth foundation model built upon pixel-space diffusion transformers (DiT). To address the high computational complexity associated with pixel-space diffusion, we propose two key designs: 1) Semantics-Prompted DiT, which incorporates semantic representations from vision foundation models to prompt the diffusion process, preserving global semantics while enhancing fine-grained visual details; and 2) Cascade DiT architecture that progressively increases the number of image tokens, improving both efficiency and accuracy. To further extend PPD to video (PPVD), we introduce a new Semantics-Consistent DiT, which extracts temporally consistent semantics from a multi-view geometry foundation model. We then perform reference-guided token propagation within the DiT to maintain temporal coherence with minimal computational and memory overhead. Our models achieve the best performance among all generative monocular and video depth estimation models and produce significantly cleaner point clouds than all other models.

中文标题/摘要

标题：像素完美视觉几何估计

从图像中恢复干净且准确的几何结构对于机器人技术和增强现实至关重要。然而，现有的几何基础模型仍然严重受到漂像素和细节损失的影响。在本文中，我们提出了像素完美视觉几何模型，通过在像素空间中利用生成建模来预测高质量、无漂像素的点云。我们首先介绍了像素完美深度（PPD），这是一种基于像素空间扩散变换器（DiT）的单目深度基础模型。为了解决像素空间扩散带来的高计算复杂性，我们提出了两个关键设计：1）语义提示DiT，该设计结合了视觉基础模型的语义表示，以提示扩散过程，保持全局语义同时增强细粒度视觉细节；2）级联DiT架构，逐步增加图像标记的数量，提高效率和准确性。为了将PPD进一步扩展到视频（PPVD），我们引入了一种新的语义一致DiT，该设计从多视图几何基础模型中提取时空一致的语义。然后在DiT中进行参考引导的标记传播，以最小的计算和内存开销保持时间连贯性。我们的模型在所有生成单目和视频深度估计模型中表现最佳，并且产生的点云比其他所有模型都更干净。

Summary / 总结

This paper addresses the issue of recovering clean and accurate geometry from images for robotics and augmented reality. It introduces pixel-perfect visual geometry models, specifically Pixel-Perfect Depth (PPD) and its extension to video (PPVD), which use pixel-space diffusion transformers to predict high-quality point clouds without flying pixels. The models incorporate semantic representations and a cascade architecture to enhance fine-grained details and computational efficiency. Experimental results show that PPD and PPVD outperform existing models in monocular and video depth estimation, producing cleaner point clouds.

本文旨在解决从图像中恢复干净准确的几何形状以供机器人和增强现实使用的问题。提出了像素完美的视觉几何模型，特别是Pixel-Perfect Depth (PPD)及其视频扩展PPVD，这些模型使用像素空间扩散变换器来预测无飞点的高质量点云。关键方法包括Semantics-Prompted DiT以保留全局语义并增强细粒度视觉细节，以及Cascade DiT以提高效率和准确性。这些模型在单目和视频深度估计中表现出色，生成的点云更为干净。

GREx: Generalized Referring Expression Segmentation, Comprehension, and Generation

Authors: Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, Yu-Gang Jiang

First: 2026-01-08T18:59:30+00:00 · Latest: 2026-01-08T18:59:30+00:00

Comments: IJCV, Project Page: https://henghuiding.com/GREx/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Referring Expression Segmentation (RES) and Comprehension (REC) respectively segment and detect the object described by an expression, while Referring Expression Generation (REG) generates an expression for the selected object. Existing datasets and methods commonly support single-target expressions only, i.e., one expression refers to one object, not considering multi-target and no-target expressions. This greatly limits the real applications of REx (RES/REC/REG). This paper introduces three new benchmarks called Generalized Referring Expression Segmentation (GRES), Comprehension (GREC), and Generation (GREG), collectively denoted as GREx, which extend the classic REx to allow expressions to identify an arbitrary number of objects. We construct the first large-scale GREx dataset gRefCOCO that contains multi-target, no-target, and single-target expressions and their corresponding images with labeled targets. GREx and gRefCOCO are designed to be backward-compatible with REx, facilitating extensive experiments to study the performance gap of the existing REx methods on GREx tasks. One of the challenges of GRES/GREC is complex relationship modeling, for which we propose a baseline ReLA that adaptively divides the image into regions with sub-instance clues and explicitly models the region-region and region-language dependencies. The proposed ReLA achieves the state-of-the-art results on the both GRES and GREC tasks. The proposed gRefCOCO dataset and method are available at https://henghuiding.github.io/GREx.

中文标题/摘要

标题：GREx：通用指代表达分割、理解和生成

指代表达分割（RES）和理解（REC）分别对表达所描述的对象进行分割和检测，而指代表达生成（REG）则生成描述选定对象的表达。现有数据集和方法通常仅支持单目标表达，即一个表达仅指代一个对象，而不考虑多目标和无目标表达。这极大地限制了指代表达（REx，包括RES/REC/REG）的实际应用。本文引入了三个新的基准测试，分别称为通用指代表达分割（GRES）、理解和生成（GREC），统称为GREx，它们将经典REx扩展为允许表达识别任意数量的对象。我们构建了第一个大规模GREx数据集gRefCOCO，包含多目标、无目标和单目标表达及其对应的带有标记目标的图像。GREx和gRefCOCO旨在与REx兼容，便于进行广泛的实验，研究现有REx方法在GREx任务上的性能差距。GRES/GREC的一个挑战是复杂关系建模，为此我们提出了一种基线ReLA，它适应性地将图像划分为具有子实例线索的区域，并明确建模区域-区域和区域-语言依赖关系。提出的ReLA在GRES和GREC任务上均达到了最先进的结果。提出的gRefCOCO数据集和方法可在https://henghuiding.github.io/GREx/获取。

Summary / 总结

This paper introduces GREx, which extends the classic REx tasks to support multi-target and no-target expressions, addressing the limitations of existing datasets and methods. It proposes a new dataset gRefCOCO and a baseline model ReLA that models complex relationships effectively, achieving state-of-the-art results on GRES and GREC tasks. The dataset and method are available at https://henghuiding.github.io/GREx.

该论文提出了GREx，将经典的REx任务扩展到支持多目标和无目标表达式，解决了现有数据集和方法的局限性。它提出了一种新的数据集gRefCOCO和一个基线模型ReLA，能够有效建模复杂关系，分别在GRES和GREC任务上达到了最先进的结果。数据集和方法可在https://henghuiding.github.io/GREx/获取。

Leveraging Clinical Text and Class Conditioning for 3D Prostate MRI Generation

Authors: Emerson P. Grabke, Babak Taati, Masoom A. Haider

First: 2025-06-11T23:12:48+00:00 · Latest: 2026-01-08T18:59:27+00:00

Comments: Accepted for publication in IEEE Transactions on Biomedical Engineering, 2025. This is the accepted author version. The final published version is available at https://doi.org/10.1109/TBME.2025.3648426

Abs · PDF · Code1 · Code2

Abstract

Objective: Latent diffusion models (LDM) could alleviate data scarcity challenges affecting machine learning development for medical imaging. However, medical LDM strategies typically rely on short-prompt text encoders, nonmedical LDMs, or large data volumes. These strategies can limit performance and scientific accessibility. We propose a novel LDM conditioning approach to address these limitations. Methods: We propose Class-Conditioned Efficient Large Language model Adapter (CCELLA), a novel dual-head conditioning approach that simultaneously conditions the LDM U-Net with free-text clinical reports and radiology classification. We also propose a data-efficient LDM pipeline centered around CCELLA and a proposed joint loss function. We first evaluate our method on 3D prostate MRI against state-of-the-art. We then augment a downstream classifier model training dataset with synthetic images from our method. Results: Our method achieves a 3D FID score of 0.025 on a size-limited 3D prostate MRI dataset, significantly outperforming a recent foundation model with FID 0.070. When training a classifier for prostate cancer prediction, adding synthetic images generated by our method during training improves classifier accuracy from 69% to 74% and outperforms classifiers trained on images generated by prior state-of-the-art. Classifier training solely on our method's synthetic images achieved comparable performance to real image training. Conclusion: We show that our method improved both synthetic image quality and downstream classifier performance using limited data and minimal human annotation. Significance: The proposed CCELLA-centric pipeline enables radiology report and class-conditioned LDM training for high-quality medical image synthesis given limited data volume and human data annotation, improving LDM performance and scientific accessibility.

中文标题/摘要

标题：利用临床文本和类别条件生成3D前列腺MRI

目标：潜在扩散模型（LDM）可以缓解医学成像领域机器学习开发中的数据稀缺挑战。然而，医学LDM策略通常依赖于简短提示文本编码器、非医学LDM或大量数据。这些策略可能会限制性能和科学可访问性。我们提出了一种新的LDM条件化方法来解决这些限制。方法：我们提出了类别条件高效大型语言模型适配器（CCELLA），这是一种新颖的双头条件化方法，同时用自由文本临床报告和放射学分类条件化LDM U-Net。我们还提出了一种以CCELLA为中心的数据高效LDM管道和一个提出的联合损失函数。我们首先在3D前列腺MRI上评估了我们的方法，与最先进的方法进行了比较。然后，我们使用我们方法生成的合成图像增强了下游分类器模型训练数据集。结果：我们的方法在大小受限的3D前列腺MRI数据集上实现了0.025的3D FID分数，显著优于最近的基础模型，该模型的FID为0.070。在训练前列腺癌预测分类器时，通过在训练期间添加我们方法生成的合成图像，分类器的准确性从69%提高到74%，并优于使用先前最先进的方法生成的图像训练的分类器。仅使用我们方法生成的合成图像进行分类器训练，其性能与使用真实图像训练相当。结论：我们展示了我们的方法在使用有限数据和最少的人工注释的情况下，提高了合成图像质量和下游分类器性能。意义：提出的CCELLA为中心的管道能够在有限的数据量和人工数据注释的情况下，实现放射学报告和类别条件的LDM训练，从而提高LDM性能和科学可访问性。

Summary / 总结

The research aims to address the limitations of data scarcity in medical imaging by proposing a novel conditioning approach for latent diffusion models (LDM). The method, CCELLA, conditions the LDM U-Net with free-text clinical reports and radiology classification, and evaluates its performance on 3D prostate MRI. The results show that the proposed method achieves a 3D FID score of 0.025, significantly outperforming a recent foundation model. Additionally, using synthetic images generated by this method improves the accuracy of a downstream classifier for prostate cancer prediction from 69% to 74%. The method also achieves comparable performance to real image training when used solely for classifier training.

研究旨在通过利用潜扩散模型（LDM）和提出一种名为Class-Conditioned Efficient Large Language model Adapter (CCELLA)的新型双头条件方法来解决医学成像中的数据稀缺问题。该方法结合了自由文本临床报告和放射学分类来条件化LDM U-Net。研究在3D前列腺MRI上评估了该方法，并展示了与最近的基础模型相比，在3D FID得分上的显著改进。此外，由该方法生成的合成图像增强了前列腺癌预测下游分类器的准确性，达到了与真实图像相当的性能。

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

Authors: Shih-Yang Liu, Xin Dong, Ximing Lu, Shizhe Diao, Peter Belcak, Mingjie Liu, Min-Hung Chen, Hongxu Yin, Yu-Chiang Frank Wang, Kwang-Ting Cheng, Yejin Choi, Jan Kautz, Pavlo Molchanov

First: 2026-01-08T18:59:24+00:00 · Latest: 2026-01-08T18:59:24+00:00

Comments: NVIDIA-Tech Report

Abs · PDF · Code1 · Code2

Abstract

As language models become increasingly capable, users expect them to provide not only accurate responses but also behaviors aligned with diverse human preferences across a variety of scenarios. To achieve this, Reinforcement learning (RL) pipelines have begun incorporating multiple rewards, each capturing a distinct preference, to guide models toward these desired behaviors. However, recent work has defaulted to apply Group Relative Policy Optimization (GRPO) under multi-reward setting without examining its suitability. In this paper, we demonstrate that directly applying GRPO to normalize distinct rollout reward combinations causes them to collapse into identical advantage values, reducing the resolution of the training signal and resulting in suboptimal convergence and, in some cases, early training failure. We then introduce Group reward-Decoupled Normalization Policy Optimization (GDPO), a new policy optimization method to resolve these issues by decoupling the normalization of individual rewards, more faithfully preserving their relative differences and enabling more accurate multi-reward optimization, along with substantially improved training stability. We compare GDPO with GRPO across three tasks: tool calling, math reasoning, and coding reasoning, evaluating both correctness metrics (accuracy, bug ratio) and constraint adherence metrics (format, length). Across all settings, GDPO consistently outperforms GRPO, demonstrating its effectiveness and generalizability for multi-reward reinforcement learning optimization.

中文标题/摘要

标题：GDPO：组奖励-解耦归一化策略优化方法在多奖励RL优化中的应用

随着语言模型能力的不断增强，用户期望它们不仅能提供准确的响应，还能表现出与各种场景中多样的人类偏好相一致的行为。为了实现这一目标，强化学习（RL）管道已经开始采用多个奖励，每个奖励捕捉一种独特的偏好，以引导模型向这些期望的行为发展。然而，最近的工作在多奖励设置中默认使用组相对策略优化（GRPO）而没有对其适用性进行检查。在本文中，我们证明直接将GRPO应用于归一化不同的回放奖励组合会导致它们坍缩为相同的优势值，从而降低训练信号的分辨率，导致次优收敛，在某些情况下甚至导致早期训练失败。我们随后引入了组奖励-解耦归一化策略优化（GDPO），这是一种新的策略优化方法，通过解耦个体奖励的归一化来解决这些问题，更忠实地保留它们的相对差异，从而实现更准确的多奖励优化，并且训练稳定性显著提高。我们在工具调用、数学推理和编程推理三个任务上将GDPO与GRPO进行了比较，评估了正确性指标（准确率、错误率）和约束遵守指标（格式、长度）。在所有设置中，GDPO始终优于GRPO，证明了其在多奖励强化学习优化中的有效性和普适性。

Summary / 总结

The paper addresses the issue of using Group Relative Policy Optimization (GRPO) in multi-reward reinforcement learning, which can cause distinct rewards to collapse into identical values, leading to suboptimal training. To address this, the authors propose Group reward-Decoupled Normalization Policy Optimization (GDPO), which decouples the normalization of individual rewards, preserving their relative differences and improving training stability. GDPO outperforms GRPO across three tasks: tool calling, math reasoning, and coding reasoning, in terms of both correctness and constraint adherence metrics.

论文探讨了使用组相对策略优化（GRPO）进行多奖励强化学习的问题，指出其可能导致奖励值坍缩，从而影响训练效果。为此，作者提出了组奖励解耦归一化策略优化（GDPO），通过解耦个体奖励的归一化，提高了训练的稳定性和性能。GDPO在工具调用、数学推理和编程推理等任务中，无论是正确性还是约束遵守度指标上，都优于GRPO，展示了其在多奖励强化学习优化中的有效性和普适性。

RoboVIP: Multi-View Video Generation with Visual Identity Prompting Augments Robot Manipulation

Authors: Boyang Wang, Haoran Zhang, Shujie Zhang, Jinkun Hao, Mingda Jia, Qi Lv, Yucheng Mao, Zhaoyang Lyu, Jia Zeng, Xudong Xu, Jiangmiao Pang

First: 2026-01-08T18:59:22+00:00 · Latest: 2026-01-08T18:59:22+00:00

Abs · PDF · Code1 · Code2

Abstract

The diversity, quantity, and quality of manipulation data are critical for training effective robot policies. However, due to hardware and physical setup constraints, collecting large-scale real-world manipulation data remains difficult to scale across diverse environments. Recent work uses text-prompt conditioned image diffusion models to augment manipulation data by altering the backgrounds and tabletop objects in the visual observations. However, these approaches often overlook the practical need for multi-view and temporally coherent observations required by state-of-the-art policy models. Further, text prompts alone cannot reliably specify the scene setup. To provide the diffusion model with explicit visual guidance, we introduce visual identity prompting, which supplies exemplar images as conditioning inputs to guide the generation of the desired scene setup. To this end, we also build a scalable pipeline to curate a visual identity pool from large robotics datasets. Using our augmented manipulation data to train downstream vision-language-action and visuomotor policy models yields consistent performance gains in both simulation and real-robot settings.

中文标题/摘要

标题：RoboVIP：基于视觉身份提示的多视角视频生成增强机器人操作

操作数据的多样性和数量对于训练有效的机器人策略至关重要。然而，由于硬件和物理设置的限制，收集大规模的现实世界操作数据在不同环境中难以扩展。近期的工作使用文本提示条件下的图像扩散模型来通过改变视觉观察中的背景和桌面物体来增强操作数据。然而，这些方法往往忽视了由最先进的策略模型所需的多视角和时间上一致的观察的实际需求。此外，仅靠文本提示无法可靠地指定场景设置。为了给扩散模型提供明确的视觉指导，我们引入了视觉身份提示，通过提供示例图像作为条件输入来引导生成所需的场景设置。为此，我们还构建了一个可扩展的流水线，从大型机器人数据集中整理视觉身份池。使用我们增强的操作数据来训练下游的视觉-语言-动作和视知觉运动策略模型，在仿真和真实机器人环境中均能获得一致的性能提升。

Summary / 总结

The research aims to enhance the diversity, quantity, and quality of manipulation data for training robot policies. It introduces RoboVIP, a method that uses visual identity prompting to generate multi-view, temporally coherent video data. This approach improves the performance of downstream vision-language-action and visuomotor policy models in both simulation and real-robot settings by providing explicit visual guidance to the diffusion model, overcoming the limitations of text-prompt alone in specifying scene setups.

研究旨在通过增强操作数据的多样性、数量和质量来提高机器人策略的训练效果。方法是使用视觉身份提示生成多视角视频来扩充操作数据，为扩散模型提供明确的视觉指导。实验结果表明，使用扩充后的数据训练视觉-语言-动作和视知觉运动策略模型，在仿真和真实机器人环境中均能获得一致的性能提升。

Plenoptic Video Generation

Authors: Xiao Fu, Shitao Tang, Min Shi, Xian Liu, Jinwei Gu, Ming-Yu Liu, Dahua Lin, Chen-Hsuan Lin

First: 2026-01-08T18:58:32+00:00 · Latest: 2026-01-08T18:58:32+00:00

Comments: Project Page: https://research.nvidia.com/labs/dir/plenopticdreamer/

Abs · PDF · Code1 · Code2

Abstract

Camera-controlled generative video re-rendering methods, such as ReCamMaster, have achieved remarkable progress. However, despite their success in single-view setting, these works often struggle to maintain consistency across multi-view scenarios. Ensuring spatio-temporal coherence in hallucinated regions remains challenging due to the inherent stochasticity of generative models. To address it, we introduce PlenopticDreamer, a framework that synchronizes generative hallucinations to maintain spatio-temporal memory. The core idea is to train a multi-in-single-out video-conditioned model in an autoregressive manner, aided by a camera-guided video retrieval strategy that adaptively selects salient videos from previous generations as conditional inputs. In addition, Our training incorporates progressive context-scaling to improve convergence, self-conditioning to enhance robustness against long-range visual degradation caused by error accumulation, and a long-video conditioning mechanism to support extended video generation. Extensive experiments on the Basic and Agibot benchmarks demonstrate that PlenopticDreamer achieves state-of-the-art video re-rendering, delivering superior view synchronization, high-fidelity visuals, accurate camera control, and diverse view transformations (e.g., third-person to third-person, and head-view to gripper-view in robotic manipulation). Project page: https://research.nvidia.com/labs/dir/plenopticdreamer/

中文标题/摘要

标题：全景光视频生成

相机控制生成视频重渲染方法，如ReCamMaster，已经取得了显著进展。然而，尽管这些方法在单视角设置中取得了成功，它们在多视角场景中保持一致性方面仍然面临挑战。由于生成模型固有的随机性，确保幻视区域的时空一致性仍然具有挑战性。为了解决这个问题，我们引入了PlenopticDreamer框架，该框架同步生成幻视以保持时空记忆。核心思想是通过相机引导的视频检索策略以自回归方式训练多输入单输出视频条件模型，该策略能够从先前生成的视频中自适应地选择关键视频作为条件输入。此外，我们的训练还包含逐步上下文缩放以提高收敛性，自条件机制以增强对由误差累积引起的长距离视觉退化的鲁棒性，以及长视频条件机制以支持长时间视频生成。在Basic和Agibot基准上的广泛实验表明，PlenopticDreamer实现了最先进的视频重渲染，提供了卓越的视角同步、高保真视觉、准确的相机控制和多样的视角变换（例如，从第三人称到第三人称，以及从头部视角到夹爪视角的机器人操作）。项目页面：https://research.nvidia.com/labs/dir/plenopticdreamer/

ObjectForesight: Predicting Future 3D Object Trajectories from Human Videos

Authors: Rustin Soraki, Homanga Bharadhwaj, Ali Farhadi, Roozbeh Mottaghi

First: 2026-01-08T18:58:08+00:00 · Latest: 2026-01-08T18:58:08+00:00

Comments: Preprint. Project Website: objectforesight.github.io

Abs · PDF · Code1 · Code2

Abstract

Humans can effortlessly anticipate how objects might move or change through interaction--imagining a cup being lifted, a knife slicing, or a lid being closed. We aim to endow computational systems with a similar ability to predict plausible future object motions directly from passive visual observation. We introduce ObjectForesight, a 3D object-centric dynamics model that predicts future 6-DoF poses and trajectories of rigid objects from short egocentric video sequences. Unlike conventional world or dynamics models that operate in pixel or latent space, ObjectForesight represents the world explicitly in 3D at the object level, enabling geometrically grounded and temporally coherent predictions that capture object affordances and trajectories. To train such a model at scale, we leverage recent advances in segmentation, mesh reconstruction, and 3D pose estimation to curate a dataset of 2 million plus short clips with pseudo-ground-truth 3D object trajectories. Through extensive experiments, we show that ObjectForesight achieves significant gains in accuracy, geometric consistency, and generalization to unseen objects and scenes, establishing a scalable framework for learning physically grounded, object-centric dynamics models directly from observation. objectforesight.github.io

中文标题/摘要

标题：ObjectForesight：从人类视频预测未来3D物体轨迹

人类可以通过互动轻松地预测物体可能如何移动或变化——想象杯子被举起、刀片切割或盖子关闭。我们旨在赋予计算系统类似的能力，直接从被动视觉观察中预测物体的可能未来运动。我们引入了ObjectForesight，这是一种3D物体中心的动力学模型，可以从短第一人称视频序列中预测刚体物体的未来6-自由度姿态和轨迹。与传统的世界或动力学模型不同，ObjectForesight在物体级别以3D形式明确表示世界，从而实现几何和时间上一致的预测，捕捉物体的功能和轨迹。为了大规模训练这种模型，我们利用最近在分割、网格重建和3D姿态估计方面的进展，收集了一个包含200多万个短片段的数据集，带有伪地面真值3D物体轨迹。通过广泛的实验，我们展示了ObjectForesight在准确性、几何一致性以及对未见过的物体和场景的泛化方面取得了显著的改进，建立了一个从观察中学习物理上合理的、物体中心的动力学模型的可扩展框架。objectforesight.github.io

Measuring and Fostering Peace through Machine Learning and Artificial Intelligence

Authors: P. Gilda, P. Dungarwal, A. Thongkham, E. T. Ajayi, S. Choudhary, T. M. Terol, C. Lam, J. P. Araujo, M. McFadyen-Mungalln, L. S. Liebovitch, P. T. Coleman, H. West, K. Sieck, S. Carter

First: 2026-01-08T18:57:01+00:00 · Latest: 2026-01-08T18:57:01+00:00

Comments: 6 pages, 4 figures

Abs · PDF · Code1 · Code2

Abstract

We used machine learning and artificial intelligence: 1) to measure levels of peace in countries from news and social media and 2) to develop on-line tools that promote peace by helping users better understand their own media diet. For news media, we used neural networks to measure levels of peace from text embeddings of on-line news sources. The model, trained on one news media dataset also showed high accuracy when used to analyze a different news dataset. For social media, such as YouTube, we developed other models to measure levels of social dimensions important in peace using word level (GoEmotions) and context level (Large Language Model) methods. To promote peace, we note that 71% of people 20-40 years old daily view most of their news through short videos on social media. Content creators of these videos are biased towards creating videos with emotional activation, making you angry to engage you, to increase clicks. We developed and tested a Chrome extension, MirrorMirror, which provides real-time feedback to YouTube viewers about the peacefulness of the media they are watching. Our long term goal is for MirrorMirror to evolve into an open-source tool for content creators, journalists, researchers, platforms, and individual users to better understand the tone of their media creation and consumption and its effects on viewers. Moving beyond simple engagement metrics, we hope to encourage more respectful, nuanced, and informative communication.

中文标题/摘要

标题：通过机器学习和人工智能衡量与促进和平

我们使用机器学习和人工智能：1) 从新闻和社交媒体中衡量各国的和平水平；2) 开发在线工具以促进和平，帮助用户更好地理解自己的媒体消费。对于新闻媒体，我们使用神经网络从在线新闻来源的文本嵌入中衡量和平水平。该模型在训练于一个新闻媒体数据集后，也对分析另一个新闻数据集时表现出高准确性。对于社交媒体，如YouTube，我们开发了其他模型来衡量与和平相关的社会维度，使用了词级（GoEmotions）和上下文级（大型语言模型）方法。为了促进和平，我们注意到20-40岁人群中71%的人每天主要通过社交媒体上的短视频获取新闻。这些视频内容创作者倾向于制作能够激发情绪、让你生气的视频以增加点击率。我们开发并测试了一个名为MirrorMirror的Chrome扩展程序，为YouTube观众提供他们正在观看的媒体的实时反馈，关于其和平程度。我们的长期目标是让MirrorMirror成为一个开源工具，供内容创作者、记者、研究人员、平台和个人用户更好地理解其媒体创作和消费的语气及其对观众的影响。超越简单的参与度指标，我们希望鼓励更加尊重、细致和信息丰富的交流。

Summary / 总结

This study aims to measure and foster peace using machine learning and artificial intelligence. It developed neural network models to assess peace levels from news media text and social media content, and created a Chrome extension called MirrorMirror to provide real-time feedback on the peacefulness of media consumption. Key findings include high accuracy in measuring peace from different news datasets and a significant portion of young adults relying on short videos for news, which often evoke strong emotions to increase engagement. The research suggests that MirrorMirror could help promote more respectful and informative communication by providing insights into media tone and its effects on viewers.

该研究旨在利用机器学习和人工智能来衡量和促进和平。研究使用神经网络评估新闻媒体文本嵌入中的和平水平，并开发了模型来衡量社交媒体如YouTube上的社会维度。研究团队开发了一个名为MirrorMirror的Chrome扩展程序，可以实时反馈用户观看的媒体内容的和平程度，旨在促进更加尊重和信息丰富的沟通。主要发现包括神经网络模型在不同新闻数据集中的高准确性，以及短视频在塑造年轻人媒体消费习惯中的重要作用。

Learning Latent Action World Models In The Wild

Authors: Quentin Garrido, Tushar Nagarajan, Basile Terver, Nicolas Ballas, Yann LeCun, Michael Rabbat

First: 2026-01-08T18:55:39+00:00 · Latest: 2026-01-08T18:55:39+00:00

Comments: 37 pages, 25 figures

Abs · PDF · Code1 · Code2

Abstract

Agents capable of reasoning and planning in the real world require the ability of predicting the consequences of their actions. While world models possess this capability, they most often require action labels, that can be complex to obtain at scale. This motivates the learning of latent action models, that can learn an action space from videos alone. Our work addresses the problem of learning latent actions world models on in-the-wild videos, expanding the scope of existing works that focus on simple robotics simulations, video games, or manipulation data. While this allows us to capture richer actions, it also introduces challenges stemming from the video diversity, such as environmental noise, or the lack of a common embodiment across videos. To address some of the challenges, we discuss properties that actions should follow as well as relevant architectural choices and evaluations. We find that continuous, but constrained, latent actions are able to capture the complexity of actions from in-the-wild videos, something that the common vector quantization does not. We for example find that changes in the environment coming from agents, such as humans entering the room, can be transferred across videos. This highlights the capability of learning actions that are specific to in-the-wild videos. In the absence of a common embodiment across videos, we are mainly able to learn latent actions that become localized in space, relative to the camera. Nonetheless, we are able to train a controller that maps known actions to latent ones, allowing us to use latent actions as a universal interface and solve planning tasks with our world model with similar performance as action-conditioned baselines. Our analyses and experiments provide a step towards scaling latent action models to the real world.

中文标题/摘要

标题：学习自然环境中的潜在动作世界模型

能够在现实世界中进行推理和规划的智能体需要预测其行为后果的能力。尽管世界模型具备这种能力，但它们通常需要行为标签，而这些标签在大规模应用中往往难以获取。这促使我们学习潜在动作模型，可以从视频中学习动作空间。我们的工作解决了在自然环境视频中学习潜在动作世界模型的问题，扩展了现有工作，这些工作主要集中在简单的机器人模拟、视频游戏或操作数据上。虽然这使我们能够捕捉到更丰富的动作，但也带来了挑战，如视频多样性带来的环境噪声或视频间缺乏共同的实体。为应对部分挑战，我们讨论了动作应遵循的属性以及相关架构选择和评估。我们发现，连续但受限的潜在动作能够捕捉自然环境视频中动作的复杂性，而常见的矢量量化则无法做到这一点。例如，我们发现来自代理（如人类进入房间）的环境变化可以在视频间转移，这突显了学习特定于自然环境视频的动作的能力。在视频间缺乏共同实体的情况下，我们主要能够学习在空间上局部化的潜在动作，相对于摄像机而言。尽管如此，我们能够训练一个控制器，将已知动作映射到潜在动作，使我们能够使用潜在动作作为通用接口，并使用世界模型解决规划任务，其性能与基于动作的基线相当。我们的分析和实验为将潜在动作模型扩展到现实世界提供了一步进展。

Summary / 总结

The research aims to develop world models that can predict the consequences of actions in the real world without needing explicit action labels, which are hard to obtain at scale. The method involves learning latent action models from in-the-wild videos, addressing challenges like environmental noise and diverse embodiments. Key findings include the ability to capture complex actions and transfer changes in the environment across videos, and the development of a controller that maps known actions to latent ones, enabling similar performance in planning tasks as action-conditioned baselines.

这项工作旨在从真实世界的视频中学习潜在动作模型，扩展了对简单机器人模拟、视频游戏和操作数据的研究。动机是使智能体能够在真实环境中推理和规划，通过预测动作的后果而无需明确的动作标签。方法是学习连续但受限的潜在动作，以捕捉来自多样视频的动作复杂性，尽管存在环境噪声和缺乏共同的实体。关键发现包括能够在视频之间转移环境变化，并训练一个控制器将已知动作映射到潜在动作，从而在规划任务中获得与基于动作的基准相似的性能。

Stochastic Deep Learning: A Probabilistic Framework for Modeling Uncertainty in Structured Temporal Data

Authors: James Rice

First: 2026-01-08T18:53:59+00:00 · Latest: 2026-01-08T18:53:59+00:00

Comments: 20 pages, 6330 words

Abs · PDF · Code1 · Code2

Abstract

I propose a novel framework that integrates stochastic differential equations (SDEs) with deep generative models to improve uncertainty quantification in machine learning applications involving structured and temporal data. This approach, termed Stochastic Latent Differential Inference (SLDI), embeds an Itô SDE in the latent space of a variational autoencoder, allowing for flexible, continuous-time modeling of uncertainty while preserving a principled mathematical foundation. The drift and diffusion terms of the SDE are parameterized by neural networks, enabling data-driven inference and generalizing classical time series models to handle irregular sampling and complex dynamic structure. A central theoretical contribution is the co-parameterization of the adjoint state with a dedicated neural network, forming a coupled forward-backward system that captures not only latent evolution but also gradient dynamics. I introduce a pathwise-regularized adjoint loss and analyze variance-reduced gradient flows through the lens of stochastic calculus, offering new tools for improving training stability in deep latent SDEs. My paper unifies and extends variational inference, continuous-time generative modeling, and control-theoretic optimization, providing a rigorous foundation for future developments in stochastic probabilistic machine learning.

中文标题/摘要

标题：随机深度学习：结构化时序数据中不确定性建模的概率框架

我提出了一种新颖的框架，将随机微分方程（SDEs）与深度生成模型相结合，以提高涉及结构化和时序数据的机器学习应用中的不确定性量化。这种方法称为随机潜微分推理（SLDI），将伊藤SDE嵌入变分自编码器的潜空间中，允许灵活的连续时间不确定性建模，同时保留一个严谨的数学基础。SDE的漂移项和扩散项由神经网络参数化，使数据驱动的推理成为可能，并将经典的时间序列模型推广到处理不规则采样和复杂动态结构。一个核心理论贡献是伴随状态与专用神经网络的共参数化，形成一个耦合的前向-后向系统，不仅捕捉潜变量的演变，还捕捉梯度动力学。我引入了一条路径正则化伴随损失，并通过随机微积分的视角分析减小方差的梯度流，为改进深度潜SDE的训练稳定性提供了新的工具。我的论文统一并扩展了变分推理、连续时间生成建模和控制论优化，为随机概率机器学习的未来发展提供了严谨的基础。

Summary / 总结

The research proposes Stochastic Latent Differential Inference (SLDI), a framework that integrates stochastic differential equations (SDEs) with deep generative models to enhance uncertainty quantification in structured and temporal data. SLDI embeds an Itô SDE in the latent space of a variational autoencoder, allowing for flexible, continuous-time modeling of uncertainty. Key findings include the co-parameterization of the adjoint state with a neural network, which forms a coupled forward-backward system, and the introduction of a pathwise-regularized adjoint loss, which improves training stability in deep latent SDEs.

研究提出了一种名为Stochastic Latent Differential Inference (SLDI) 的框架，将随机微分方程 (SDEs) 与深度生成模型结合，以增强结构化和时间数据中的不确定性量化。SLDI 将伊藤 SDE 嵌入变分自编码器的潜在空间，允许灵活的连续时间不确定性建模。关键发现包括将伴随状态与神经网络共参数化，形成耦合的前向-后向系统，以及引入路径正则化伴随损失，以提高深度潜在 SDE 的训练稳定性。

Extended OpenTT Games Dataset: A table tennis dataset for fine-grained shot type and point outcome

Authors: Moamal Fadhil Abdul-Mahdi, Jonas Bruun Hubrechts, Thomas Martini Jørgensen, Emil Hovad

First: 2025-12-22T12:25:50+00:00 · Latest: 2026-01-08T18:45:51+00:00

Comments: Thomas Martini Jørgensen and Emil Hovad contributed equally and share last authorship

Abs · PDF · Code1 · Code2

Abstract

Automatically detecting and classifying strokes in table tennis video can streamline training workflows, enrich broadcast overlays, and enable fine-grained performance analytics. For this to be possible, annotated video data of table tennis is needed. We extend the public OpenTTGames dataset with highly detailed, frame-accurate shot type annotations (forehand, backhand with subtypes), player posture labels (body lean and leg stance), and rally outcome tags at point end. OpenTTGames is a set of recordings from the side of the table with official labels for bounces, when the ball is above the net, or hitting the net. The dataset already contains ball coordinates near events, which are either "bounce", "net", or "empty_event" in the original OpenTTGames dataset, and semantic masks (humans, table, scoreboard). Our extension adds the types of stroke to the events and a per-player taxonomy so models can move beyond event spotting toward tactical understanding (e.g., whether a stroke is likely to win the point or set up an advantage). We provide a compact coding scheme and code-assisted labeling procedure to support reproducible annotations and baselines for fine-grained stroke understanding in racket sports. This fills a practical gap in the community, where many prior video resources are either not publicly released or carry restrictive/unclear licenses that hinder reuse and benchmarking. Our annotations are released under the same CC BY-NC-SA 4.0 license as OpenTTGames, allowing free non-commercial use, modification, and redistribution, with appropriate attribution.

中文标题/摘要

标题：扩展的OpenTT Games数据集：用于精细粒度击球类型和得分结果的乒乓球数据集

自动检测和分类乒乓球视频中的击球可以简化训练工作流程，丰富广播叠加内容，并实现精细粒度的性能分析。为此，需要带有标注的乒乓球视频数据。我们扩展了公共OpenTTGames数据集，增加了高度详细、帧准确的击球类型注释（正手、反手及其子类型）、球员姿势标签（身体倾斜和腿部站位）以及在得分结束时的回合结果标签。OpenTTGames是一组从桌子侧面录制的视频，带有官方标签，表示球的弹跳、球在网上的情况或击中网的情况。该数据集已经包含与事件相关的球坐标，这些事件在原始OpenTTGames数据集中是“弹跳”、“网”或“空事件”，并且带有语义掩码（人类、桌子、记分板）。我们的扩展为事件添加了击球类型，并为每个球员提供了分类体系，使模型能够从事件识别转向战术理解（例如，击球是否可能赢得得分或建立优势）。我们提供了一种紧凑的编码方案和代码辅助的标注程序，以支持可重复的标注和细粒度击球理解的基准。这填补了社区中的实际空白，因为许多先前的视频资源要么未公开发布，要么带有限制性/不明确的许可证，阻碍了重用和基准测试。我们的标注在与OpenTTGames相同的CC BY-NC-SA 4.0许可证下发布，允许免费非商业使用、修改和再分发，同时适当归因。

Summary / 总结

The research aims to enhance the OpenTTGames dataset for table tennis by adding detailed shot type annotations, player posture labels, and point outcome tags. The method involves extending the existing dataset with frame-accurate annotations and a compact coding scheme. Key findings include the provision of a comprehensive dataset that supports fine-grained stroke understanding and tactical analysis, which can be used for training, broadcast overlays, and performance analytics. This dataset fills a practical gap in the community by providing openly accessible and reusable annotations under a CC BY-NC-SA 4.0 license.

研究旨在通过添加详细的击球类型标注、球员姿势标签和回合结果标签来扩展OpenTTGames数据集。方法包括对现有数据集进行扩展，提供帧准确的标注和紧凑的编码方案。主要发现包括提供了一个全面的数据集，支持精细的击球理解和战术分析，可用于训练、转播叠加和表现分析。该数据集填补了社区中的实际空白，提供了在CC BY-NC-SA 4.0许可下开放、可重用和可再分发的标注。

MineNPC-Task: Task Suite for Memory-Aware Minecraft Agents

Authors: Tamil Sudaravan Mohan Doss, Michael Xu, Sudha Rao, Andrew D. Wilson, Balasaravanan Thoravi Kumaravel

First: 2026-01-08T18:39:52+00:00 · Latest: 2026-01-08T18:39:52+00:00

Abs · PDF · Code1 · Code2

Abstract

We present \textsc{MineNPC-Task}, a user-authored benchmark and evaluation harness for testing memory-aware, mixed-initiative LLM agents in open-world \emph{Minecraft}. Rather than relying on synthetic prompts, tasks are elicited from formative and summative co-play with expert players, normalized into parametric templates with explicit preconditions and dependency structure, and paired with machine-checkable validators under a bounded-knowledge policy that forbids out-of-world shortcuts. The harness captures plan/act/memory events-including plan previews, targeted clarifications, memory reads and writes, precondition checks, and repair attempts and reports outcomes relative to the total number of attempted subtasks, derived from in-world evidence. As an initial snapshot, we instantiate the framework with GPT-4o and evaluate \textbf{216} subtasks across \textbf{8} experienced players. We observe recurring breakdown patterns in code execution, inventory/tool handling, referencing, and navigation, alongside recoveries supported by mixed-initiative clarifications and lightweight memory. Participants rated interaction quality and interface usability positively, while highlighting the need for stronger memory persistence across tasks. We release the complete task suite, validators, logs, and harness to support transparent, reproducible evaluation of future memory-aware embodied agents.

中文标题/摘要

标题：MineNPC-Task：面向记忆意识Minecraft代理的任务套件

我们提出了\textsc{MineNPC-Task}，一种用户编写的基准测试和评估框架，用于测试开放世界\emph{Minecraft}中的记忆意识、混合主动性LLM代理。该框架不依赖于合成提示，而是从与专家玩家的形成性及总结性共玩中获取任务，将这些任务规范化为具有显式先决条件和依赖结构的参数化模板，并配以在有限知识政策下的机器可验证验证器，该政策禁止使用世界外的捷径。该框架捕捉计划/行动/记忆事件，包括计划预览、目标澄清、记忆读写、先决条件检查以及修复尝试，并根据尝试的子任务总数报告结果，这些结果源自于世内的证据。作为初步快照，我们使用GPT-4o实例化了该框架，并在8名经验丰富的玩家中评估了\textbf{216}个子任务。我们观察到代码执行、库存/工具处理、引用和导航中的反复出现的故障模式，以及通过混合主动性澄清和轻量级记忆支持的恢复。参与者对交互质量和界面易用性给予了积极评价，同时指出了需要更强的记忆持久性以跨越任务。我们发布了完整的任务套件、验证器、日志和框架，以支持未来记忆意识实体代理的透明、可重复评估。

Summary / 总结

The research introduces MineNPC-Task, a benchmark for evaluating memory-aware LLM agents in Minecraft. Tasks are derived from expert co-play, normalized into templates, and paired with validators. The study evaluates 216 subtasks across 8 players using GPT-4o, revealing issues in code execution, inventory handling, referencing, and navigation, with mixed-initiative clarifications and memory aiding recovery. Participants positively rated interaction quality and usability but noted the need for better memory persistence.

研究引入了MineNPC-Task，一个用于评估记忆感知的LLM代理在Minecraft中的基准测试。任务来源于专家的游戏会话，并被结构化为具有明确条件的参数化模板。使用GPT-4o执行了8名玩家的216个子任务，揭示了代码执行、库存管理以及导航方面的问题。参与者对交互质量和界面易用性给予了积极评价，但也指出需要更强的记忆持久性。任务套件、验证器、日志和框架已公开发布，以支持透明的评估。

FlowLet: Conditional 3D Brain MRI Synthesis using Wavelet Flow Matching

Authors: Danilo Danese, Angela Lombardi, Matteo Attimonelli, Giuseppe Fasano, Tommaso Di Noia

First: 2026-01-08T18:36:29+00:00 · Latest: 2026-01-08T18:36:29+00:00

Abs · PDF · Code1 · Code2

Abstract

Brain Magnetic Resonance Imaging (MRI) plays a central role in studying neurological development, aging, and diseases. One key application is Brain Age Prediction (BAP), which estimates an individual's biological brain age from MRI data. Effective BAP models require large, diverse, and age-balanced datasets, whereas existing 3D MRI datasets are demographically skewed, limiting fairness and generalizability. Acquiring new data is costly and ethically constrained, motivating generative data augmentation. Current generative methods are often based on latent diffusion models, which operate in learned low dimensional latent spaces to address the memory demands of volumetric MRI data. However, these methods are typically slow at inference, may introduce artifacts due to latent compression, and are rarely conditioned on age, thereby affecting the BAP performance. In this work, we propose FlowLet, a conditional generative framework that synthesizes age-conditioned 3D MRIs by leveraging flow matching within an invertible 3D wavelet domain, helping to avoid reconstruction artifacts and reducing computational demands. Experiments show that FlowLet generates high-fidelity volumes with few sampling steps. Training BAP models with data generated by FlowLet improves performance for underrepresented age groups, and region-based analysis confirms preservation of anatomical structures.

中文标题/摘要

标题：FlowLet：基于小波流匹配的条件3D脑MRI合成

脑磁共振成像（MRI）在研究神经发育、衰老和疾病中起着核心作用。一个关键应用是脑年龄预测（BAP），它从MRI数据中估计个体的生物脑年龄。有效的BAP模型需要大量、多样且年龄平衡的数据集，而现有的3D MRI数据集在人口统计学上存在偏差，限制了公平性和泛化能力。获取新数据成本高且伦理限制多，推动了生成数据增强。当前的生成方法通常基于潜在扩散模型，在学习的低维潜在空间中操作以应对体积MRI数据的内存需求。然而，这些方法在推理时通常较慢，可能会由于潜在压缩引入伪影，并且很少根据年龄进行条件化，从而影响BAP性能。在本文中，我们提出FlowLet，一种基于3D小波域内可逆流匹配的条件生成框架，通过这种方式避免重建伪影并减少计算需求。实验表明，FlowLet能够通过少量采样步骤生成高保真体积。使用FlowLet生成的数据训练BAP模型可以提高未充分代表的年龄组的性能，区域分析证实了解剖结构的保留。

Summary / 总结

The research aims to address the limitations of existing 3D MRI datasets for Brain Age Prediction (BAP) by proposing FlowLet, a conditional generative framework that synthesizes age-conditioned 3D MRIs using flow matching in an invertible 3D wavelet domain. This method avoids reconstruction artifacts and reduces computational demands compared to latent diffusion models. Experiments demonstrate that FlowLet generates high-fidelity volumes with few sampling steps, and training BAP models with FlowLet-generated data improves performance for underrepresented age groups.

研究旨在通过提出FlowLet，一种基于3D小波域流匹配的条件生成框架，解决现有3D MRI数据集在脑年龄预测（BAP）中的局限性。该方法避免了重建伪影并减少了计算需求，相比潜在扩散模型。实验表明，FlowLet生成的高保真体积只需少量采样步骤，并且使用FlowLet生成的数据训练BAP模型可以提高对未充分代表的年龄组的性能。

MoE3D: A Mixture-of-Experts Module for 3D Reconstruction

Authors: Zichen Wang, Ang Cao, Liam J. Wang, Jeong Joon Park

First: 2026-01-08T18:33:52+00:00 · Latest: 2026-01-08T18:33:52+00:00

Abs · PDF · Code1 · Code2

Abstract

MoE3D is a mixture-of-experts module designed to sharpen depth boundaries and mitigate flying-point artifacts (highlighted in red) of existing feed-forward 3D reconstruction models (left side). MoE3D predicts multiple candidate depth maps and fuses them via dynamic weighting (visualized by MoE weights on the right side). When integrated with a pre-trained 3D reconstruction backbone such as VGGT, it substantially enhances reconstruction quality with minimal additional computational overhead. Best viewed digitally.

中文标题/摘要

标题：MoE3D：一种用于3D重建的混合专家模块

MoE3D 是一种混合专家模块，旨在细化深度边界并减轻现有前馈3D重建模型（左侧）中的漂浮点伪影（用红色高亮显示）。MoE3D 预测多个候选深度图，并通过动态加权融合（右侧可视化 MoE 权重）。当与预训练的3D重建主干网络（如 VGGT）结合使用时，它能显著提高重建质量，同时几乎不增加额外的计算开销。最佳查看方式为数字显示。

Summary / 总结

MoE3D is a mixture-of-experts module aimed at improving the sharpness of depth boundaries and reducing flying-point artifacts in 3D reconstruction. It predicts multiple depth maps and fuses them using dynamic weighting. When combined with a pre-trained 3D reconstruction model like VGGT, it significantly enhances reconstruction quality with little extra computational cost.

MoE3D 是一种混合专家模块，旨在改善 3D 重建中的深度边界并减少飞行点 artifact。它预测多个候选深度图并通过动态加权融合它们。与预训练的 3D 重建模型如 VGGT 结合使用时，它能显著提高重建质量，同时几乎没有额外的计算开销。

EARL: Energy-Aware Optimization of Liquid State Machines for Pervasive AI

Authors: Zain Iqbal, Lorenzo Valerio

First: 2026-01-08T18:31:11+00:00 · Latest: 2026-01-08T18:31:11+00:00

Comments: 6 pages, 9 figures, 2 Tables, conference [Submitted in PerConAI-2026]

Abs · PDF · Code1 · Code2

Abstract

Pervasive AI increasingly depends on on-device learning systems that deliver low-latency and energy-efficient computation under strict resource constraints. Liquid State Machines (LSMs) offer a promising approach for low-power temporal processing in pervasive and neuromorphic systems, but their deployment remains challenging due to high hyperparameter sensitivity and the computational cost of traditional optimization methods that ignore energy constraints. This work presents EARL, an energy-aware reinforcement learning framework that integrates Bayesian optimization with an adaptive reinforcement learning based selection policy to jointly optimize accuracy and energy consumption. EARL employs surrogate modeling for global exploration, reinforcement learning for dynamic candidate prioritization, and an early termination mechanism to eliminate redundant evaluations, substantially reducing computational overhead. Experiments on three benchmark datasets demonstrate that EARL achieves 6 to 15 percent higher accuracy, 60 to 80 percent lower energy consumption, and up to an order of magnitude reduction in optimization time compared to leading hyperparameter tuning frameworks. These results highlight the effectiveness of energy-aware adaptive search in improving the efficiency and scalability of LSMs for resource-constrained on-device AI applications.

中文标题/摘要

标题：EARL：面向普适AI的液态状态机能量感知优化

普适AI越来越多地依赖于能够在严格资源限制下提供低延迟和高效计算的设备端学习系统。液态状态机（LSMs）为普适和神经形态系统中的低功耗时序处理提供了有前景的方法，但由于高超参数敏感性和传统优化方法忽视能量约束导致的计算成本，其部署仍然具有挑战性。本文提出了一种名为EARL的能量感知强化学习框架，该框架结合了贝叶斯优化和自适应强化学习选择策略，以联合优化准确性和能耗。EARL利用代理建模进行全局探索，强化学习进行动态候选优先级排序，并采用早期终止机制消除冗余评估，大幅减少了计算开销。在三个基准数据集上的实验表明，与领先的超参数调优框架相比，EARL在准确率上提高了6%到15%，能耗降低了60%到80%，优化时间减少了10倍。这些结果突显了能量感知自适应搜索在提高资源受限设备端AI应用中LSMs的效率和可扩展性方面的有效性。

Summary / 总结

EARL is an energy-aware optimization framework for Liquid State Machines (LSMs) that uses Bayesian optimization and reinforcement learning to jointly optimize accuracy and energy consumption. It reduces computational overhead with surrogate modeling, dynamic candidate prioritization, and early termination. Experiments show EARL achieves higher accuracy, lower energy consumption, and faster optimization times compared to existing methods on three benchmark datasets.

EARL 是一种结合贝叶斯优化和强化学习的能源感知优化框架，用于液态状态机（LSMs），旨在同时优化准确性和能耗。通过使用代理建模、动态候选优先级和早期终止机制来减少计算开销。实验结果显示，EARL 在准确性和能耗方面表现更优，并且优化时间显著缩短，优于现有超参数调优框架。

Mechanisms of Prompt-Induced Hallucination in Vision-Language Models

Authors: William Rudman, Michal Golovanevsky, Dana Arad, Yonatan Belinkov, Ritambhara Singh, Carsten Eickhoff, Kyle Mahowald

First: 2026-01-08T18:23:03+00:00 · Latest: 2026-01-08T18:23:03+00:00

Abs · PDF · Code1 · Code2

Abstract

Large vision-language models (VLMs) are highly capable, yet often hallucinate by favoring textual prompts over visual evidence. We study this failure mode in a controlled object-counting setting, where the prompt overstates the number of objects in the image (e.g., asking a model to describe four waterlilies when only three are present). At low object counts, models often correct the overestimation, but as the number of objects increases, they increasingly conform to the prompt regardless of the discrepancy. Through mechanistic analysis of three VLMs, we identify a small set of attention heads whose ablation substantially reduces prompt-induced hallucinations (PIH) by at least 40% without additional training. Across models, PIH-heads mediate prompt copying in model-specific ways. We characterize these differences and show that PIH ablation increases correction toward visual evidence. Our findings offer insights into the internal mechanisms driving prompt-induced hallucinations, revealing model-specific differences in how these behaviors are implemented.

中文标题/摘要

标题：视觉语言模型中提示诱发幻觉的机制

大型视觉语言模型（VLMs）功能强大，但经常因偏好文本提示而忽视视觉证据，从而产生幻觉。我们在一个受控的物体计数设置中研究了这种失败模式，其中提示夸大了图像中的物体数量（例如，要求模型描述四朵水仙花，而实际上只有三朵）。在物体数量较少时，模型通常会纠正这种夸大，但随着物体数量的增加，它们越来越倾向于遵循提示，无视差异。通过对三种VLMs的机制分析，我们发现一小组注意力头的消除可以显著减少提示诱发幻觉（PIH），至少降低40%且无需额外训练。在不同模型中，PIH头以特定方式介导提示复制。我们描述了这些差异，并表明PIH消除增加了对视觉证据的纠正。我们的研究提供了关于提示诱发幻觉内部机制的见解，揭示了这些行为在不同模型中的特定实现差异。

Summary / 总结

This study investigates the mechanism of prompt-induced hallucination in vision-language models (VLMs) by examining their object-counting performance under prompts that overstate the number of objects. As the number of objects increases, models tend to conform to the prompt rather than visual evidence. By analyzing three VLMs, the researchers identified specific attention heads that, when removed, significantly reduced prompt-induced hallucinations by at least 40% without additional training. These findings provide insights into the internal mechanisms of VLMs and highlight model-specific differences in how prompt-induced hallucinations are implemented.

研究探讨了大型视觉语言模型（VLMs）如何基于文本提示而非视觉证据产生幻觉。通过操控图像中的物体数量，研究人员发现模型倾向于纠正最初的过估计，但随着物体数量的增加，它们越来越倾向于遵循提示。移除特定的注意力头可以至少减少40%的提示诱导幻觉，揭示了模型在提示复制方面的特定机制，并强调了这些头在使模型输出与视觉证据一致方面的重要性。

An interpretable data-driven approach to optimizing clinical fall risk assessment

Authors: Fardin Ganjkhanloo, Emmett Springer, Erik H. Hoyer, Daniel L. Young, Holley Farley, Kimia Ghobadi

First: 2026-01-08T18:17:31+00:00 · Latest: 2026-01-08T18:17:31+00:00

Comments: arXiv admin note: substantial text overlap with arXiv:2510.20714

Abs · PDF · Code1 · Code2

Abstract

In this study, we aim to better align fall risk prediction from the Johns Hopkins Fall Risk Assessment Tool (JHFRAT) with additional clinically meaningful measures via a data-driven modelling approach. We conducted a retrospective cohort analysis of 54,209 inpatient admissions from three Johns Hopkins Health System hospitals between March 2022 and October 2023. A total of 20,208 admissions were included as high fall risk encounters, and 13,941 were included as low fall risk encounters. To incorporate clinical knowledge and maintain interpretability, we employed constrained score optimization (CSO) models to reweight the JHFRAT scoring weights, while preserving its additive structure and clinical thresholds. Recalibration refers to adjusting item weights so that the resulting score can order encounters more consistently by the study's risk labels, and without changing the tool's form factor or deployment workflow. The model demonstrated significant improvements in predictive performance over the current JHFRAT (CSO AUC-ROC=0.91, JHFRAT AUC-ROC=0.86). This performance improvement translates to protecting an additional 35 high-risk patients per week across the Johns Hopkins Health System. The constrained score optimization models performed similarly with and without the EHR variables. Although the benchmark black-box model (XGBoost), improves upon the performance metrics of the knowledge-based constrained logistic regression (AUC-ROC=0.94), the CSO demonstrates more robustness to variations in risk labeling. This evidence-based approach provides a robust foundation for health systems to systematically enhance inpatient fall prevention protocols and patient safety using data-driven optimization techniques, contributing to improved risk assessment and resource allocation in healthcare settings.

中文标题/摘要

标题：一种可解释的数据驱动方法以优化临床跌倒风险评估

在本研究中，我们旨在通过数据驱动建模方法更好地使约翰霍普金斯跌倒风险评估工具（JHFRAT）的跌倒风险预测与额外的临床有意义的指标相一致。我们对2022年3月至2023年10月期间约翰霍普金斯健康系统三家医院的54,209例住院病例进行了回顾性队列分析。共有20,208例住院病例被纳入高跌倒风险事件，13,941例被纳入低跌倒风险事件。为了整合临床知识并保持可解释性，我们采用了约束评分优化（CSO）模型重新加权JHFRAT评分权重，同时保持其加性结构和临床阈值。重新校准是指调整项目权重，使所得评分能够更一致地按研究的风险标签对事件进行排序，而不改变工具的形式因素或部署工作流程。该模型在预测性能上显著优于当前的JHFRAT（CSO AUC-ROC=0.91，JHFRAT AUC-ROC=0.86）。这种性能改进相当于每周在约翰霍普金斯健康系统中额外保护35名高风险患者。约束评分优化模型在有和没有EHR变量的情况下表现相似。尽管基准黑盒模型（XGBoost）在知识驱动的约束逻辑回归的基础上提高了性能指标（AUC-ROC=0.94），但CSO在风险标签变化方面表现出更强的稳健性。这种基于证据的方法为医疗机构系统地增强住院跌倒预防协议和患者安全提供了坚实的基础，利用数据驱动优化技术，有助于改善风险评估和资源分配。

Summary / 总结

This study aims to improve the predictive performance of the Johns Hopkins Fall Risk Assessment Tool (JHFRAT) by incorporating clinical knowledge and maintaining interpretability through constrained score optimization (CSO) models. A retrospective cohort analysis of 54,209 inpatient admissions showed that the CSO model significantly improved predictive performance (AUC-ROC=0.91) compared to the original JHFRAT (AUC-ROC=0.86), protecting an additional 35 high-risk patients per week. The CSO models performed similarly with and without electronic health record (EHR) variables, demonstrating robustness to variations in risk labeling. This approach provides a robust foundation for enhancing inpatient fall prevention protocols and patient safety in healthcare settings.

本研究旨在通过使用约束评分优化（CSO）模型重新加权约翰霍普金斯跌倒风险评估工具（JHFRAT）的评分权重，同时保持其临床阈值，以提高其预测性能。对三家约翰霍普金斯医院的54,209名住院患者的回顾性队列分析显示，CSO模型的预测性能显著提高（AUC-ROC=0.91），相较于原始JHFRAT（AUC-ROC=0.86），每周额外保护了35名高风险患者。CSO模型在有和没有电子健康记录（EHR）变量的情况下表现相似，并且在风险标签变化时表现出更强的鲁棒性，优于基准的黑盒模型（XGBoost）。

SimuAgent: An LLM-Based Simulink Modeling Assistant Enhanced with Reinforcement Learning

Authors: Yanchang Liang, Xiaowei Zhao

First: 2026-01-08T18:10:35+00:00 · Latest: 2026-01-08T18:10:35+00:00

Abs · PDF · Code1 · Code2

Abstract

Large language models (LLMs) have revolutionized text-based code automation, but their potential in graph-oriented engineering workflows remains under-explored. We introduce SimuAgent, an LLM-powered modeling and simulation agent tailored for Simulink. SimuAgent replaces verbose XML with a concise, dictionary-style Python representation, dramatically cutting token counts, improving interpretability, and enabling fast, in-process simulation. A lightweight plan-execute architecture, trained in two stages, equips the agent with both low-level tool skills and high-level design reasoning. To tackle sparse rewards in long-horizon tasks, we propose Reflection-GRPO (ReGRPO), which augments Group Relative Policy Optimization (GRPO) with self-reflection traces that supply rich intermediate feedback, accelerating convergence and boosting robustness. Experiments on SimuBench, our newly released benchmark comprising 5300 multi-domain modeling tasks, show that a Qwen2.5-7B model fine-tuned with SimuAgent converges faster and achieves higher modeling accuracy than standard RL baselines, and even surpasses GPT-4o when evaluated with few-shot prompting on the same benchmark. Ablations confirm that the two-stage curriculum and abstract-reconstruct data augmentation further enhance generalization. SimuAgent trains and runs entirely on-premise with modest hardware, delivering a privacy-preserving, cost-effective solution for industrial model-driven engineering. SimuAgent bridges the gap between LLMs and graphical modeling environments, offering a practical solution for AI-assisted engineering design in industrial settings.

中文标题/摘要

标题：SimuAgent：基于LLM的Simulink建模助手，增强以强化学习

大型语言模型（LLMs）已经革新了基于文本的代码自动化，但在图形导向的工程工作流中的潜力尚未得到充分探索。我们介绍了SimuAgent，这是一种专为Simulink设计的LLM驱动的建模和仿真代理。SimuAgent用简洁的字典风格Python表示法取代了冗长的XML，大幅减少了标记数量，提高了可解释性，并使仿真变得快速且在进程内进行。一种轻量级的计划-执行架构，经过两阶段训练，使代理具备了低级工具技能和高级设计推理能力。为应对长期任务中的稀疏奖励，我们提出了反思-GRPO（ReGRPO），它通过自我反思轨迹补充了组相对策略优化（GRPO），提供了丰富的中间反馈，加速了收敛并提高了鲁棒性。在我们新发布的包含5300个多领域建模任务的SimuBench基准测试上进行的实验表明，使用SimuAgent微调的Qwen2.5-7B模型比标准的强化学习基线收敛更快，建模精度更高，甚至在使用少量示例提示在相同基准测试上评估时，超过了GPT-4o。消融实验表明，两阶段课程和抽象重建数据增强进一步增强了泛化能力。SimuAgent完全在本地进行训练和运行，硬件要求较低，提供了一种保护隐私、成本效益高的工业模型驱动工程解决方案。SimuAgent在LLMs和图形建模环境之间架起了一座桥梁，为工业环境中的AI辅助工程设计提供了一个实用的解决方案。

Summary / 总结

SimuAgent is an LLM-powered agent designed for Simulink modeling, using a concise Python representation to enhance interpretability and simulation speed. It employs a two-stage training process and Reflection-GRPO to improve performance on long-horizon tasks. Experiments on SimuBench demonstrate that SimuAgent outperforms standard RL baselines and GPT-4o, achieving faster convergence and higher modeling accuracy. The system is cost-effective and privacy-preserving, suitable for industrial model-driven engineering.

SimuAgent 是一个基于大语言模型的代理，专为 Simulink 设计，旨在增强图形化工程工作流。它采用轻量级计划-执行架构和两阶段训练过程，结合低级工具技能和高级设计推理。为解决长期任务中的稀疏奖励问题，SimuAgent 引入了 Reflection-GRPO，提高了收敛性和鲁棒性。实验表明，使用 Qwen2.5-7B 模型微调的 SimuAgent 比标准 RL 基线更快收敛且建模精度更高，甚至在相同的基准测试中超越了 GPT-4o。SimuAgent 具有成本效益且保护隐私，适用于工业模型驱动工程。

Observations and Remedies for Large Language Model Bias in Self-Consuming Performative Loop

Authors: Yaxuan Wang, Zhongteng Cai, Yujia Bao, Xueru Zhang, Yang Liu

First: 2026-01-08T18:08:15+00:00 · Latest: 2026-01-08T18:08:15+00:00

Abs · PDF · Code1 · Code2

Abstract

The rapid advancement of large language models (LLMs) has led to growing interest in using synthetic data to train future models. However, this creates a self-consuming retraining loop, where models are trained on their own outputs and may cause performance drops and induce emerging biases. In real-world applications, previously deployed LLMs may influence the data they generate, leading to a dynamic system driven by user feedback. For example, if a model continues to underserve users from a group, less query data will be collected from this particular demographic of users. In this study, we introduce the concept of \textbf{S}elf-\textbf{C}onsuming \textbf{P}erformative \textbf{L}oop (\textbf{SCPL}) and investigate the role of synthetic data in shaping bias during these dynamic iterative training processes under controlled performative feedback. This controlled setting is motivated by the inaccessibility of real-world user preference data from dynamic production systems, and enables us to isolate and analyze feedback-driven bias evolution in a principled manner. We focus on two types of loops, including the typical retraining setting and the incremental fine-tuning setting, which is largely underexplored. Through experiments on three real-world tasks, we find that the performative loop increases preference bias and decreases disparate bias. We design a reward-based rejection sampling strategy to mitigate the bias, moving towards more trustworthy self-improving systems.

中文标题/摘要

标题：大型语言模型偏见的观察与补救措施：自我消耗执行循环

大型语言模型（LLMs）的迅速发展引发了对使用合成数据进行未来模型训练的兴趣。然而，这导致了一个自我消耗的重新训练循环，模型在训练过程中使用自己的输出，可能导致性能下降并引发新的偏见。在实际应用中，之前部署的LLMs可能会影响它们生成的数据，形成一个由用户反馈驱动的动态系统。例如，如果模型持续未能满足某一用户群体的需求，那么来自该特定用户群体的数据收集量将会减少。在本研究中，我们引入了“自我消耗执行循环”（SCPL）的概念，并探讨合成数据在这些动态迭代训练过程中如何塑造偏见的作用。这种受控的设置是由于难以获取动态生产系统中的真实用户偏好数据，使我们能够以一种原则性的方式隔离和分析反馈驱动的偏见演变。我们关注两种类型的循环，包括典型的重新训练设置和增量微调设置，后者尚未得到充分探索。通过三项实际任务的实验，我们发现执行循环增加了偏好偏见并减少了差异偏见。我们设计了一种基于奖励的拒绝采样策略来减轻偏见，朝着更值得信赖的自我改进系统迈进。

Summary / 总结

This study addresses the issue of bias in large language models (LLMs) that arise from self-consuming performative loops, where models are trained on their own outputs. The research introduces the concept of Self-Consuming Performative Loop (SCPL) and investigates how synthetic data can shape bias during iterative training processes. Experiments on three real-world tasks show that the performative loop increases preference bias and decreases disparate bias. The study proposes a reward-based rejection sampling strategy to mitigate these biases, aiming to enhance the trustworthiness of self-improving systems.

该研究探讨了由于模型自我消费循环导致的大语言模型（LLMs）中的偏见问题，即模型在其自身输出上进行训练。研究引入了自我消费执行循环（SCPL）的概念，并调查了合成数据如何在迭代训练过程中影响偏见。实验证实在三个实际任务上，执行循环增加了偏好偏见并减少了差异偏见。研究提出了一种基于奖励的拒绝采样策略来缓解这些偏见，旨在提高自我改进系统的可信度。

VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice

Authors: Shuming Liu, Mingchen Zhuge, Changsheng Zhao, Jun Chen, Lemeng Wu, Zechun Liu, Chenchen Zhu, Zhipeng Cai, Chong Zhou, Haozhe Liu, Ernie Chang, Saksham Suri, Hongyu Xu, Qi Qian, Wei Wen, Balakrishnan Varadarajan, Zhuang Liu, Hu Xu, Florian Bordes, Raghuraman Krishnamoorthi, Bernard Ghanem, Vikas Chandra, Yunyang Xiong

First: 2026-01-08T18:00:59+00:00 · Latest: 2026-01-08T18:00:59+00:00

Comments: Project page: https://ivul-kaust.github.io/projects/videoauto-r1/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Chain-of-thought (CoT) reasoning has emerged as a powerful tool for multimodal large language models on video understanding tasks. However, its necessity and advantages over direct answering remain underexplored. In this paper, we first demonstrate that for RL-trained video models, direct answering often matches or even surpasses CoT performance, despite CoT producing step-by-step analyses at a higher computational cost. Motivated by this, we propose VideoAuto-R1, a video understanding framework that adopts a reason-when-necessary strategy. During training, our approach follows a Thinking Once, Answering Twice paradigm: the model first generates an initial answer, then performs reasoning, and finally outputs a reviewed answer. Both answers are supervised via verifiable rewards. During inference, the model uses the confidence score of the initial answer to determine whether to proceed with reasoning. Across video QA and grounding benchmarks, VideoAuto-R1 achieves state-of-the-art accuracy with significantly improved efficiency, reducing the average response length by ~3.3x, e.g., from 149 to just 44 tokens. Moreover, we observe a low rate of thinking-mode activation on perception-oriented tasks, but a higher rate on reasoning-intensive tasks. This suggests that explicit language-based reasoning is generally beneficial but not always necessary.

中文标题/摘要

标题：VideoAuto-R1：通过一次思考，两次回答进行视频自动推理

链式思考（CoT）推理已成为多模态大型语言模型在视频理解任务中的一种强大工具。然而，其必要性及其与直接回答相比的优势尚未得到充分探索。在本文中，我们首先证明，对于通过强化学习训练的视频模型，直接回答往往能够匹配甚至超越CoT的性能，尽管CoT以更高的计算成本生成逐步分析。受此启发，我们提出了一种VideoAuto-R1视频理解框架，采用一种必要时推理的策略。在训练过程中，我们的方法遵循一次思考，两次回答的模式：模型首先生成一个初始答案，然后进行推理，最后输出一个审查后的答案。两个答案都通过可验证的奖励进行监督。在推理过程中，模型使用初始答案的置信度分数来决定是否进行推理。在视频问答和定位基准测试中，VideoAuto-R1实现了最先进的准确率，显著提高了效率，平均响应长度减少了约3.3倍，例如，从149个词减少到仅44个词。此外，我们观察到，在感知导向的任务中，推理模式激活的频率较低，而在推理密集型任务中，这一频率较高。这表明显式的基于语言的推理通常是有益的，但并非总是必要的。

Summary / 总结

The paper explores the necessity of chain-of-thought (CoT) reasoning in video understanding tasks and introduces VideoAuto-R1, a framework that reasons only when necessary. During training, VideoAuto-R1 follows a Thinking Once, Answering Twice paradigm, generating an initial answer, performing reasoning, and then outputting a reviewed answer. This approach achieves state-of-the-art accuracy while significantly improving efficiency, reducing response length by 3.3x. The framework shows that explicit reasoning is beneficial but not always required, with higher rates of reasoning activation on more complex tasks.

本文探讨了链式思考（CoT）推理在视频理解任务中的必要性，并提出了一种仅在必要时推理的框架VideoAuto-R1。在训练过程中，模型生成初始答案，进行推理并输出审查后的答案，由可验证的奖励监督。在推理时，它根据初始答案的置信度决定是否进行推理。VideoAuto-R1实现了最先进的准确率，同时显著减少了响应长度，且在推理密集型任务中比感知导向型任务更常进行推理。

FaST: Efficient and Effective Long-Horizon Forecasting for Large-Scale Spatial-Temporal Graphs via Mixture-of-Experts

Authors: Yiji Zhao, Zihao Zhong, Ao Wang, Haomin Wen, Ming Jin, Yuxuan Liang, Huaiyu Wan, Hao Wu

Venue: KDD 2026

First: 2026-01-08T18:00:58+00:00 · Latest: 2026-01-08T18:00:58+00:00

Comments: Accepted to KDD 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

Spatial-Temporal Graph (STG) forecasting on large-scale networks has garnered significant attention. However, existing models predominantly focus on short-horizon predictions and suffer from notorious computational costs and memory consumption when scaling to long-horizon predictions and large graphs. Targeting the above challenges, we present FaST, an effective and efficient framework based on heterogeneity-aware Mixture-of-Experts (MoEs) for long-horizon and large-scale STG forecasting, which unlocks one-week-ahead (672 steps at a 15-minute granularity) prediction with thousands of nodes. FaST is underpinned by two key innovations. First, an adaptive graph agent attention mechanism is proposed to alleviate the computational burden inherent in conventional graph convolution and self-attention modules when applied to large-scale graphs. Second, we propose a new parallel MoE module that replaces traditional feed-forward networks with Gated Linear Units (GLUs), enabling an efficient and scalable parallel structure. Extensive experiments on real-world datasets demonstrate that FaST not only delivers superior long-horizon predictive accuracy but also achieves remarkable computational efficiency compared to state-of-the-art baselines. Our source code is available at: https://github.com/yijizhao/FaST.

中文标题/摘要

标题：FaST：基于专家混合的大型时空图长时预测高效框架

大型网络上的时空图（STG）预测引起了广泛关注。然而，现有模型主要关注短期预测，并在扩展到长期预测和大型图时遭受严重的计算成本和内存消耗问题。为应对上述挑战，我们提出了FaST，一种基于异质性感知专家混合（MoEs）的框架，用于长时和大规模STG预测，该框架在数千个节点的情况下实现了一周前（以15分钟粒度计算的672步）的预测。FaST基于两项关键创新。首先，提出了一种自适应图代理注意力机制，以缓解在大型图上应用传统图卷积和自我注意力模块时固有的计算负担。其次，我们提出了一种新的并行MoE模块，用门控线性单元（GLUs）取代传统的前馈网络，使结构更加高效和可扩展。在真实世界数据集上的广泛实验表明，FaST不仅在长期预测准确性上表现出色，而且在计算效率上也显著优于最先进的基线。我们的源代码可在：https://github.com/yijizhao/FaST/ 获取。

Summary / 总结

FaST is a framework designed for long-horizon forecasting on large-scale spatial-temporal graphs, addressing the computational and memory challenges of existing models. It introduces an adaptive graph agent attention mechanism and a parallel Mixture-of-Experts module with Gated Linear Units to enhance efficiency. Experiments show FaST outperforms state-of-the-art methods in both accuracy and computational efficiency for one-week-ahead predictions on large graphs.

FaST 是一种用于大规模时空图长时预测的框架，通过引入适应性图代理注意力机制和带有门控线性单元的并行混合专家模块来解决现有模型的计算挑战。FaST 实现了一周（672 步，每 15 分钟一步）的预测，并且在准确性和效率上都优于最先进的基线方法。

CoV: Chain-of-View Prompting for Spatial Reasoning

Authors: Haoyu Zhao, Akide Liu, Zeyu Zhang, Weijie Wang, Feng Chen, Ruihan Zhu, Gholamreza Haffari, Bohan Zhuang

First: 2026-01-08T17:59:42+00:00 · Latest: 2026-01-08T17:59:42+00:00

Abs · PDF · Code1 · Code2

Abstract

Embodied question answering (EQA) in 3D environments often requires collecting context that is distributed across multiple viewpoints and partially occluded. However, most recent vision--language models (VLMs) are constrained to a fixed and finite set of input views, which limits their ability to acquire question-relevant context at inference time and hinders complex spatial reasoning. We propose Chain-of-View (CoV) prompting, a training-free, test-time reasoning framework that transforms a VLM into an active viewpoint reasoner through a coarse-to-fine exploration process. CoV first employs a View Selection agent to filter redundant frames and identify question-aligned anchor views. It then performs fine-grained view adjustment by interleaving iterative reasoning with discrete camera actions, obtaining new observations from the underlying 3D scene representation until sufficient context is gathered or a step budget is reached. We evaluate CoV on OpenEQA across four mainstream VLMs and obtain an average +11.56\% improvement in LLM-Match, with a maximum gain of +13.62\% on Qwen3-VL-Flash. CoV further exhibits test-time scaling: increasing the minimum action budget yields an additional +2.51\% average improvement, peaking at +3.73\% on Gemini-2.5-Flash. On ScanQA and SQA3D, CoV delivers strong performance (e.g., 116 CIDEr / 31.9 EM@1 on ScanQA and 51.1 EM@1 on SQA3D). Overall, these results suggest that question-aligned view selection coupled with open-view search is an effective, model-agnostic strategy for improving spatial reasoning in 3D EQA without additional training.

中文标题/摘要

标题：CoV：空间推理的链式视角提示

在3D环境中的嵌入式问题回答（EQA）通常需要收集分布在多个视角且部分被遮挡的上下文。然而，大多数最近的视觉-语言模型（VLMs）仅限于固定且有限的输入视角集，这限制了它们在推理时获取与问题相关上下文的能力，并阻碍了复杂的空间推理。我们提出了一种名为Chain-of-View（CoV）的提示方法，这是一种无需训练、在测试时进行推理的框架，通过从粗到细的探索过程将VLM转换为积极的视角推理者。CoV首先使用视图选择代理筛选冗余帧并识别与问题对齐的锚视图，然后通过交替进行迭代推理和离散相机动作进行细粒度视图调整，从底层3D场景表示中获取新观察，直到收集到足够上下文或达到步骤预算。我们在OpenEQA上对CoV进行了评估，跨四个主流VLMs获得了平均+11.56%的LLM-Match改进，最大增益为Qwen3-VL-Flash上的+13.62%。CoV还表现出测试时的扩展性：增加最小动作预算可额外获得平均+2.51%的改进，峰值为Gemini-2.5-Flash上的+3.73%。在ScanQA和SQA3D上，CoV表现出强大的性能（例如，ScanQA上的116 CIDEr / 31.9 EM@1和SQA3D上的51.1 EM@1）。总体而言，这些结果表明，与问题对齐的视图选择结合开放视图搜索是提高3D EQA中空间推理能力的有效、模型无关的策略，无需额外训练。

Summary / 总结

The paper proposes Chain-of-View (CoV) prompting to enhance spatial reasoning in embodied question answering (EQA) by allowing models to explore multiple viewpoints. CoV uses a View Selection agent to filter redundant frames and select relevant views, followed by fine-grained view adjustment through iterative reasoning and camera actions. The method improves performance across various VLMs, achieving an average +11.56% improvement in LLM-Match and up to +13.62% on Qwen3-VL-Flash. It also shows test-time scaling, with additional improvements as the action budget increases.

研究旨在通过解决固定视角视觉-语言模型（VLMs）的限制，提升3D环境中的体感问答（EQA）能力。提出的Chain-of-View（CoV）提示方法使VLMs能够主动探索并收集多个视角下的上下文，从而提高空间推理能力。在OpenEQA上的实验显示，平均改进了+11.56%的LLM-Match，最高达到+13.62%的Qwen3-VL-Flash。CoV还展示了可扩展性，随着最小动作预算的增加，额外改进也有所显现。

RelayLLM: Efficient Reasoning via Collaborative Decoding

Authors: Chengsong Huang, Tong Zheng, Langlin Huang, Jinyuan Li, Haolin Liu, Jiaxin Huang

First: 2026-01-08T17:56:16+00:00 · Latest: 2026-01-08T17:56:16+00:00

Abs · PDF · Code1 · Code2

Abstract

Large Language Models (LLMs) for complex reasoning is often hindered by high computational costs and latency, while resource-efficient Small Language Models (SLMs) typically lack the necessary reasoning capacity. Existing collaborative approaches, such as cascading or routing, operate at a coarse granularity by offloading entire queries to LLMs, resulting in significant computational waste when the SLM is capable of handling the majority of reasoning steps. To address this, we propose RelayLLM, a novel framework for efficient reasoning via token-level collaborative decoding. Unlike routers, RelayLLM empowers the SLM to act as an active controller that dynamically invokes the LLM only for critical tokens via a special command, effectively "relaying" the generation process. We introduce a two-stage training framework, including warm-up and Group Relative Policy Optimization (GRPO) to teach the model to balance independence with strategic help-seeking. Empirical results across six benchmarks demonstrate that RelayLLM achieves an average accuracy of 49.52%, effectively bridging the performance gap between the two models. Notably, this is achieved by invoking the LLM for only 1.07% of the total generated tokens, offering a 98.2% cost reduction compared to performance-matched random routers.

中文标题/摘要

标题：RelayLLM：通过协作解码实现高效推理

大型语言模型（LLMs）在复杂推理方面往往受到高计算成本和延迟的限制，而资源高效的小型语言模型（SLMs）通常缺乏必要的推理能力。现有的协作方法，如级联或路由，以粗粒度的方式运行，将整个查询卸载到LLMs上，当SLM能够处理大多数推理步骤时，这会导致显著的计算浪费。为了解决这个问题，我们提出了RelayLLM，这是一种通过标记级协作解码实现高效推理的新框架。与路由器不同，RelayLLM赋予SLM作为主动控制器的能力，动态地仅在关键标记上调用LLM，通过特殊命令有效地“传递”生成过程。我们引入了一种两阶段训练框架，包括预热和组相对策略优化（GRPO），以教导模型平衡独立性和战略性求助。在六个基准测试中的实验结果表明，RelayLLM实现了49.52%的平均准确率，有效地弥合了两种模型之间的性能差距。值得注意的是，这仅通过调用LLM的1.07%的总生成标记实现，与性能匹配的随机路由器相比，成本降低了98.2%。

Summary / 总结

RelayLLM is a framework that enables efficient reasoning through token-level collaborative decoding between Small Language Models (SLMs) and Large Language Models (LLMs). It allows the SLM to dynamically invoke the LLM only for critical tokens, reducing computational waste. The framework uses a two-stage training process to balance independence and strategic help-seeking. Experiments on six benchmarks show that RelayLLM achieves 49.52% accuracy by invoking the LLM for only 1.07% of tokens, reducing costs by 98.2% compared to random routers.

RelayLLM 是一种框架，通过 SLM 和 LLM 之间的 token 级别协作解码实现高效推理。与现有的粗粒度协作方法不同，RelayLLM 允许 SLM 动态调用 LLM 只处理关键 token，减少计算浪费。该框架采用两阶段训练过程来平衡独立性和战略性求助。实验结果表明，RelayLLM 在六项基准测试中实现了 49.52% 的准确率，仅调用 LLM 处理 1.07% 的 token，相比随机路由器实现了 98.2% 的成本降低。

MVT: Mask-Grounded Vision-Language Models for Taxonomy-Aligned Land-Cover Tagging

Authors: Siyi Chen, Kai Wang, Weicong Pang, Ruiming Yang, Ziru Chen, Renjun Gao, Alexis Kai Hon Lau, Dasa Gu, Chenchen Zhang, Cheng Li

First: 2025-09-23T06:23:56+00:00 · Latest: 2026-01-08T17:56:05+00:00

Comments: The project is available at https://charlescsyyy.github.io/MVT

Abs · PDF · Code1 · Code2 · Project1

Abstract

Land-cover understanding in remote sensing increasingly demands class-agnostic systems that generalize across datasets while remaining spatially precise and interpretable. We study a geometry-first discovery-and-interpretation setting under domain shift, where candidate regions are delineated class-agnostically and supervision avoids lexical class names via anonymized identifiers. Complementary to open-set recognition and open-world learning, we focus on coupling class-agnostic mask evidence with taxonomy-grounded scene interpretation, rather than unknown rejection or continual class expansion. We propose MVT, a three-stage framework that (i) extracts boundary-faithful region masks using SAM2 with domain adaptation, (ii) performs mask-grounded semantic tagging and scene description generation via dual-step LoRA fine-tuning of multimodal LLMs, and (iii) evaluates outputs with LLM-as-judge scoring calibrated by stratified expert ratings. On cross-dataset segmentation transfer (train on OpenEarthMap, evaluate on LoveDA), domain-adapted SAM2 improves mask quality; meanwhile, dual-step MLLM fine-tuning yields more accurate taxonomy-aligned tags and more informative mask-grounded scene descriptions.

中文标题/摘要

标题：MVT：基于掩码的视觉-语言模型在分类学对齐的土地覆盖标记中的应用

遥感中的土地覆盖理解越来越需要跨数据集泛化的类无差别系统，同时保持空间精确性和可解释性。我们研究了在领域迁移下的几何优先发现与解释设置，候选区域以类无差别方式划定，监督避免使用类名的明文标识符。除了开放集识别和开放世界学习，我们专注于将类无差别掩码证据与分类学导向的场景解释相结合，而不是未知拒绝或持续类扩展。我们提出了MVT，一个三阶段框架，(i) 使用SAM2进行领域适应以提取边界忠实的区域掩码，(ii) 通过双步骤LoRA微调多模态LLM进行掩码导向的语义标记和场景描述生成，(iii) 使用LLM作为裁判评分，通过分层专家评分校准输出评估。在跨数据集分割迁移（在OpenEarthMap上训练，在LoveDA上评估）中，领域适应的SAM2提高了掩码质量；同时，双步骤多模态LLM微调产生了更准确的分类学对齐标签和更具有信息性的掩码导向场景描述。

Summary / 总结

The research aims to develop class-agnostic systems for land-cover understanding in remote sensing that can generalize across datasets while maintaining spatial precision and interpretability. The method involves a three-stage framework: (i) extracting boundary-faithful region masks using SAM2 with domain adaptation, (ii) performing mask-grounded semantic tagging and scene description generation via dual-step LoRA fine-tuning of multimodal LLMs, and (iii) evaluating outputs with LLM-as-judge scoring calibrated by expert ratings. The key findings show that domain-adapted SAM2 improves mask quality, and dual-step MLLM fine-tuning yields more accurate taxonomy-aligned tags and more informative scene descriptions on cross-dataset segmentation transfer.

研究旨在开发在遥感中进行土地覆盖理解的类-无感知系统，使其能够在不同数据集之间泛化，同时保持空间精度和可解释性。方法包括三个阶段：(i) 使用SAM2进行域适应以提取边界忠实的区域掩码，(ii) 通过双步骤LoRA微调多模态LLM进行掩码导向的语义标签和场景描述生成，(iii) 使用LLM作为裁判评分并根据分层专家评级进行校准。关键发现包括通过域适应提高掩码质量，以及通过双步骤LLM微调获得更准确的分类学对齐标签和更具信息量的掩码导向场景描述。

Improving and Evaluating Open Deep Research Agents

Authors: Doaa Allabadi, Kyle Bradbury, Jordan M. Malof

First: 2025-08-13T19:32:01+00:00 · Latest: 2026-01-08T17:54:58+00:00

Comments: 8 pages, 2 figures, 2 tables

Abs · PDF · Code1 · Code2

Abstract

We focus here on Deep Research Agents (DRAs), which are systems that can take a natural language prompt from a user, and then autonomously search for, and utilize, internet-based content to address the prompt. Recent DRAs have demonstrated impressive capabilities on public benchmarks however, recent research largely involves proprietary closed-source systems. At the time of this work, we only found one open-source DRA, termed Open Deep Research (ODR). In this work we adapt the challenging recent BrowseComp benchmark to compare ODR to existing proprietary systems. We propose BrowseComp-Small (BC-Small), comprising a subset of BrowseComp, as a more computationally-tractable DRA benchmark for academic labs. We benchmark ODR and two other proprietary systems on BC-Small: one system from Anthropic and one system from Google. We find that all three systems achieve 0% accuracy on the test set of 60 questions. We introduce three strategic improvements to ODR, resulting in the ODR+ model, which achieves a state-of-the-art 10% success rate on BC-Small among both closed-source and open-source systems. We report ablation studies indicating that all three of our improvements contributed to the success of ODR+.

中文标题/摘要

标题：改进和评估开放深度研究代理

我们在这里关注深度研究代理（DRAs），这是一种可以从用户那里接收自然语言提示，并自主搜索和利用互联网内容来回应提示的系统。最近的DRAs在公共基准测试上展示了令人印象深刻的性能，然而，最近的研究主要涉及专有的闭源系统。在本研究进行时，我们仅发现一个开源的DRA，称为Open Deep Research（ODR）。在本工作中，我们将具有挑战性的最近的BrowseComp基准测试改编为比较ODR与现有专有系统的基准测试。我们提出了BrowseComp-Small（BC-Small），这是一个更易于计算的DRAs基准测试，适用于学术实验室。我们在BC-Small上对ODR和两个其他专有系统进行了基准测试：来自Anthropic的一个系统和来自Google的一个系统。我们发现，这三个系统在包含60个问题的测试集上均未达到1%的准确率。我们对ODR进行了三项战略改进，产生了ODR+模型，该模型在BC-Small上实现了10%的成功率，这是在专有和开源系统中均处于最先进的水平。我们报告了消融研究，表明我们的三项改进都对ODR+的成功做出了贡献。

Summary / 总结

This study evaluates Deep Research Agents (DRAs) by adapting the BrowseComp benchmark to compare ODR, an open-source DRA, with two proprietary systems. Despite achieving 0% accuracy, ODR was improved with three strategic enhancements, resulting in the ODR+ model, which achieved a 10% success rate on the BC-Small benchmark, setting a new state-of-the-art among both open-source and closed-source systems.

该研究关注于能够根据自然语言提示自主搜索和利用互联网内容的Deep Research Agents (DRAs)。作者将BrowseComp基准适应后用于评估开源DRA ODR及其与现有封闭源系统的性能。基准测试后发现，所有系统表现不佳。随后，他们对ODR进行了三项战略改进，形成了ODR+模型，该模型在基准测试中实现了10%的成功率，是开源和封闭源系统中的最高水平。

Vision-Language Introspection: Mitigating Overconfident Hallucinations in MLLMs via Interpretable Bi-Causal Steering

Authors: Shuliang Liu, Songbo Yang, Dong Fang, Sihang Jia, Yuqi Tang, Lingfeng Su, Ruoshui Peng, Yibo Yan, Xin Zou, Xuming Hu

First: 2026-01-08T17:49:13+00:00 · Latest: 2026-01-08T17:49:13+00:00

Abs · PDF · Code1 · Code2

Abstract

Object hallucination critically undermines the reliability of Multimodal Large Language Models, often stemming from a fundamental failure in cognitive introspection, where models blindly trust linguistic priors over specific visual evidence. Existing mitigations remain limited: contrastive decoding approaches operate superficially without rectifying internal semantic misalignments, while current latent steering methods rely on static vectors that lack instance-specific precision. We introduce Vision-Language Introspection (VLI), a training-free inference framework that simulates a metacognitive self-correction process. VLI first performs Attributive Introspection to diagnose hallucination risks via probabilistic conflict detection and localize the causal visual anchors. It then employs Interpretable Bi-Causal Steering to actively modulate the inference process, dynamically isolating visual evidence from background noise while neutralizing blind confidence through adaptive calibration. VLI achieves state-of-the-art performance on advanced models, reducing object hallucination rates by 12.67% on MMHal-Bench and improving accuracy by 5.8% on POPE.

中文标题/摘要

标题：视觉-语言内省：通过可解释的双因归向引导减轻MLLM中的过度自信幻觉

物体幻觉严重削弱了多模态大型语言模型的可靠性，通常源于认知内省的根本失败，模型盲目信任语言先验而非具体的视觉证据。现有缓解措施仍有限：对比解码方法仅表面操作而不纠正内部语义错位，而当前的潜在引导方法依赖于静态向量，缺乏实例特定的精确性。我们引入了视觉-语言内省（VLI），这是一种无需训练的推理框架，模拟了元认知的自我纠正过程。VLI 首先进行属性内省，通过概率冲突检测诊断幻觉风险并定位因果视觉锚点。然后使用可解释的双因归向引导主动调节推理过程，动态隔离视觉证据与背景噪声，通过适应性校准消除盲目的自信。VLI 在先进模型上实现了最先进的性能，在MMHal-Bench 上将物体幻觉率降低了12.67%，在POPE 上提高了5.8%的准确性。

Forking-Sequences

Authors: Willa Potosnak, Malcolm Wolff, Mengfei Cao, Ruijun Ma, Tatiana Konstantinova, Dmitry Efimov, Michael W. Mahoney, Boris Oreshkin, Kin G. Olivares

Venue: NeurIPS 2025

First: 2025-10-06T04:51:06+00:00 · Latest: 2026-01-08T17:43:12+00:00

Comments: Presented at the GPU-Accelerated and Scalable Optimization (ScaleOpt) Workshop, NeurIPS 2025

Abs · PDF · Code1 · Code2

Abstract

While accuracy is a critical requirement for time series forecasting, an equally important desideratum is forecast stability across forecast creation dates (FCDs). Even highly accurate models can produce erratic revisions between FCDs, disrupting downstream decision-making. To improve forecast stability of such revisions, several state-of-the-art models including MQCNN, MQT, and SPADE employ a powerful yet underexplored neural network architectural design known as forking-sequences. This architectural design jointly encodes and decodes the entire time series across all FCDs, producing an entire multi-horizon forecast grid in a single forward pass. This approach contrasts with conventional neural forecasting methods that process FCDs independently, generating only a single multi-horizon forecast per forward pass. In this work, we formalize the forking-sequences design and motivate its broader adoption by introducing a metric for quantifying excess volatility in forecast revisions and by providing theoretical and empirical analysis. We theoretically motivate three key benefits of forking-sequences: (i) increased forecast stability through ensembling; (ii) gradient variance reduction, leading to more stable and consistent training steps; and (iii) improved computational efficiency during inference. We validate the benefits of forking-sequences compared to baseline window-sampling on the M-series benchmark, using 16 datasets from the M1, M3, M4, and Tourism competitions. We observe median accuracy improvements across datasets of 29.7%, 46.2%, 49.3%, 28.6%, 24.7%, and 6.4% for MLP, RNN, LSTM, CNN, Transformer, and StateSpace-based architectures, respectively. We then show that forecast ensembling during inference can improve median forecast stability by 10.8%, 13.2%, 13.0%, 10.9%, 10.2%, and 11.2% for these respective models trained with forking-sequences, while maintaining accuracy.

中文标题/摘要

标题：分叉序列

时间序列预测的准确性是一个关键要求，但同样重要的是预测在不同预测创建日期（FCD）之间的稳定性。即使是非常准确的模型也可能在FCD之间产生不稳定的修订，扰乱下游决策。为了提高这些修订的预测稳定性，包括MQCNN、MQT和SPADE在内的多种最先进的模型采用了名为分叉序列的强大但尚未充分探索的神经网络架构设计。该架构在所有FCD上联合编码和解码整个时间序列，一次性前向传递生成整个多视窗预测网格。这种方法与传统的神经预测方法形成对比，后者独立处理FCD，每次前向传递只生成一个单一的多视窗预测。在本文中，我们形式化了分叉序列设计，并通过引入衡量预测修订超额波动的度量标准和提供理论与实证分析来促进其更广泛的采用。我们理论地证明了分叉序列的三个关键优势：（i）通过集成增加预测稳定性；（ii）减少梯度方差，导致更稳定和一致的训练步骤；（iii）推理期间提高计算效率。我们通过在M系列基准上使用M1、M3、M4和旅游竞赛的16个数据集验证了分叉序列相对于基线窗口采样的优势，观察到MLP、RNN、LSTM、CNN、Transformer和基于状态空间的架构分别在数据集上的中位数准确性改进为29.7%、46.2%、49.3%、28.6%、24.7%和6.4%。然后我们证明，在使用分叉序列训练的这些模型进行推理期间进行预测集成可以分别提高中位数预测稳定性的10.8%、13.2%、13.0%、10.9%、10.2%和11.2%，同时保持准确性。

Summary / 总结

This paper addresses the issue of forecast stability in time series forecasting, where even highly accurate models can produce erratic revisions. It introduces forking-sequences, a neural network architectural design that jointly encodes and decodes the entire time series across forecast creation dates, improving forecast stability and computational efficiency. Experiments on 16 datasets from various competitions show median accuracy improvements of up to 49.3% and median forecast stability improvements of up to 13.2% for different model architectures using forking-sequences during inference.

研究旨在通过利用神经网络架构设计中的分叉序列来提高时间序列预测的稳定性。研究引入了一个衡量预测波动性的指标，并提供了理论和实证证据。关键发现包括在各种模型中，准确率提高了29.7%到6.4%，并且通过推理期间的预测集成，稳定性提高了10.8%到13.2%，同时保持了准确率。

Safe Continual Reinforcement Learning Methods for Nonstationary Environments. Towards a Survey of the State of the Art

Authors: Timofey Tomashevskiy

First: 2026-01-08T17:42:56+00:00 · Latest: 2026-01-08T17:42:56+00:00

Comments: 20 pages, 4 figures

Abs · PDF · Code1 · Code2

Abstract

This work provides a state-of-the-art survey of continual safe online reinforcement learning (COSRL) methods. We discuss theoretical aspects, challenges, and open questions in building continual online safe reinforcement learning algorithms. We provide the taxonomy and the details of continual online safe reinforcement learning methods based on the type of safe learning mechanism that takes adaptation to nonstationarity into account. We categorize safety constraints formulation for online reinforcement learning algorithms, and finally, we discuss prospects for creating reliable, safe online learning algorithms. Keywords: safe RL in nonstationary environments, safe continual reinforcement learning under nonstationarity, HM-MDP, NSMDP, POMDP, safe POMDP, constraints for continual learning, safe continual reinforcement learning review, safe continual reinforcement learning survey, safe continual reinforcement learning, safe online learning under distribution shift, safe continual online adaptation, safe reinforcement learning, safe exploration, safe adaptation, constrained Markov decision processes, safe reinforcement learning, partially observable Markov decision process, safe reinforcement learning and hidden Markov decision processes, Safe Online Reinforcement Learning, safe online reinforcement learning, safe online reinforcement learning, safe meta-learning, safe meta-reinforcement learning, safe context-based reinforcement learning, formulating safety constraints for continual learning

中文标题/摘要

标题：非平稳环境下的安全持续强化学习方法：前沿综述

本文提供了非平稳环境下持续安全在线强化学习（COSRL）方法的前沿综述。我们讨论了构建持续在线安全强化学习算法的理论方面、挑战和开放问题。我们根据安全学习机制的类型提供了持续在线安全强化学习方法的分类和详细信息，该机制考虑了非平稳性的适应。我们对在线强化学习算法中的安全约束进行了分类，并最终讨论了创建可靠、安全的在线学习算法的前景。关键词：非平稳环境下的安全RL，非平稳条件下持续的强化学习，HM-MDP，NSMDP，POMDP，安全POMDP，持续学习的约束，持续安全的强化学习综述，持续安全的强化学习综述，持续安全的强化学习，分布转移下的安全在线学习，持续安全的在线适应，安全强化学习，安全探索，安全适应，约束马尔可夫决策过程，安全强化学习，部分可观测马尔可夫决策过程，安全强化学习和隐马尔可夫决策过程，安全在线强化学习，安全在线强化学习，安全元学习，安全元强化学习，安全上下文强化学习，持续学习中的安全约束制定

Summary / 总结

This work provides a comprehensive survey of continual safe online reinforcement learning (COSRL) methods, focusing on theoretical aspects, challenges, and open questions in building safe reinforcement learning algorithms for nonstationary environments. The study categorizes safety constraints and methods based on the type of safe learning mechanism that adapts to nonstationarity, and discusses the prospects for creating reliable, safe online learning algorithms. Key findings include the importance of formulating safety constraints and the need for robust adaptation mechanisms in nonstationary settings.

本文提供了一种针对非稳定环境的持续安全在线强化学习方法的全面综述。研究讨论了构建能够适应变化条件的安全强化学习算法的理论方面、挑战和开放问题。主要发现包括对在线强化学习算法中的安全约束进行分类，以及探索各种安全学习机制以应对非稳定性。

ROOFS: RObust biOmarker Feature Selection

Authors: Anastasiia Bakhmach, Paul Dufossé, Andrea Vaglio, Florence Monville, Laurent Greillier, Fabrice Barlési, Sébastien Benzekry

First: 2026-01-08T17:41:07+00:00 · Latest: 2026-01-08T17:41:07+00:00

Abs · PDF · Code1 · Code2

Abstract

Feature selection (FS) is essential for biomarker discovery and in the analysis of biomedical datasets. However, challenges such as high-dimensional feature space, low sample size, multicollinearity, and missing values make FS non-trivial. Moreover, FS performances vary across datasets and predictive tasks. We propose roofs, a Python package available at https://gitlab.inria.fr/compo/roofs, designed to help researchers in the choice of FS method adapted to their problem. Roofs benchmarks multiple FS methods on the user's data and generates reports that summarize a comprehensive set of evaluation metrics, including downstream predictive performance estimated using optimism correction, stability, reliability of individual features, and true positive and false positive rates assessed on semi-synthetic data with a simulated outcome. We demonstrate the utility of roofs on data from the PIONeeR clinical trial, aimed at identifying predictors of resistance to anti-PD-(L)1 immunotherapy in lung cancer. The PIONeeR dataset contained 374 multi-source blood and tumor biomarkers from 435 patients. A reduced subset of 214 features was obtained through iterative variance inflation factor pre-filtering. Of the 34 FS methods gathered in roofs, we evaluated 23 in combination with 11 classifiers (253 models in total) and identified a filter based on the union of Benjamini-Hochberg false discovery rate-adjusted p-values from t-test and logistic regression as the optimal approach, outperforming other methods including the widely used LASSO. We conclude that comprehensive benchmarking with roofs has the potential to improve the robustness and reproducibility of FS discoveries and increase the translational value of clinical models.

中文标题/摘要

标题：ROOFS: RObust biOmarker Feature Selection

特征选择（FS）对于生物标志物发现和生物医学数据集分析至关重要。然而，高维特征空间、样本量小、多重共线性和缺失值等挑战使得FS非易事。此外，FS性能在不同数据集和预测任务中有所不同。我们提出ROOFS，一个可用在https://gitlab.inria.fr/compo/roofs的Python包，旨在帮助研究人员选择适合其问题的FS方法。ROOFS在用户数据上对多种FS方法进行基准测试，并生成报告，总结包括使用乐观校正估计的下游预测性能、稳定性、单个特征的可靠性和在半合成数据上评估的真实阳性率和假阳性率在内的全面评估指标。我们通过PIONeeR临床试验数据展示了ROOFS的应用，该数据集包含来自435名患者的374个多源血液和肿瘤生物标志物。通过迭代方差膨胀因子预筛选，获得了一个包含214个特征的子集。在ROOFS中收集的34种FS方法中，我们评估了23种方法与11种分类器的组合（总共253个模型），并确定了一种基于t检验和逻辑回归的贝叶斯-霍奇伯格假发现率调整p值的并集的过滤器方法为最优方法，优于包括广泛使用的LASSO在内的其他方法。我们得出结论，ROOFS的全面基准测试有可能提高FS发现的稳健性和可重复性，并增加临床模型的转化价值。

Summary / 总结

ROOFS is a Python package designed to assist researchers in selecting appropriate feature selection methods for biomarker discovery. It benchmarks multiple methods on user data and provides comprehensive evaluation metrics, including predictive performance, stability, and feature reliability. On the PIONeeR dataset, ROOFS identified a filter method based on the union of Benjamini-Hochberg false discovery rate-adjusted p-values from t-test and logistic regression as the optimal approach, outperforming other methods like LASSO.

论文针对生物医学数据中的高维特征空间、多重共线性等挑战，提出了ROOFS，一个Python包，用于在用户数据上评估多种特征选择方法，并提供包括预测性能和特征稳定性在内的全面评估指标。在PIONeeR数据集中，ROOFS确定了一种基于t检验和逻辑回归的贝叶斯-霍奇伯格错误发现率调整p值的并集的过滤方法为最优方法，优于包括广泛使用的LASSO在内的其他方法。

Multi-Scale Local Speculative Decoding for Image Generation

Authors: Elia Peruzzo, Guillaume Sautière, Amirhossein Habibian

First: 2026-01-08T17:39:35+00:00 · Latest: 2026-01-08T17:39:35+00:00

Comments: Project page is available at https://qualcomm-ai-research.github.io/mulo-sd-webpage

Abs · PDF · Code1 · Code2 · Project1

Abstract

Autoregressive (AR) models have achieved remarkable success in image synthesis, yet their sequential nature imposes significant latency constraints. Speculative Decoding offers a promising avenue for acceleration, but existing approaches are limited by token-level ambiguity and lack of spatial awareness. In this work, we introduce Multi-Scale Local Speculative Decoding (MuLo-SD), a novel framework that combines multi-resolution drafting with spatially informed verification to accelerate AR image generation. Our method leverages a low-resolution drafter paired with learned up-samplers to propose candidate image tokens, which are then verified in parallel by a high-resolution target model. Crucially, we incorporate a local rejection and resampling mechanism, enabling efficient correction of draft errors by focusing on spatial neighborhoods rather than raster-scan resampling after the first rejection. We demonstrate that MuLo-SD achieves substantial speedups - up to $\mathbf{1.7\times}$ - outperforming strong speculative decoding baselines such as EAGLE-2 and LANTERN in terms of acceleration, while maintaining comparable semantic alignment and perceptual quality. These results are validated using GenEval, DPG-Bench, and FID/HPSv2 on the MS-COCO 5k validation split. Extensive ablations highlight the impact of up-sampling design, probability pooling, and local rejection and resampling with neighborhood expansion. Our approach sets a new state-of-the-art in speculative decoding for image synthesis, bridging the gap between efficiency and fidelity.

Summary / 总结

MuLo-SD is a novel framework that accelerates autoregressive image generation by combining multi-resolution drafting with spatially informed verification. It uses a low-resolution drafter and learned up-samplers to propose candidate tokens, which are verified in parallel by a high-resolution model. The method incorporates a local rejection and resampling mechanism that focuses on spatial neighborhoods, leading to up to 1.7x speedup compared to strong speculative decoding baselines while maintaining semantic alignment and perceptual quality.

MuLo-SD 是一种结合多分辨率草图绘制与空间感知验证的新框架，用于加速自回归图像生成。该方法使用低分辨率草图绘制器和学习上采样器来提出候选图像令牌，这些令牌由高分辨率目标模型并行验证。该方法包含一个局部拒绝和重采样机制，专注于空间邻域，实现了高达1.7倍的加速，同时保持语义对齐和感知质量。

Atlas 2 -- Foundation models for clinical deployment

Authors: Maximilian Alber, Timo Milbich, Alexandra Carpen-Amarie, Stephan Tietz, Jonas Dippel, Lukas Muttenthaler, Beatriz Perez Cancer, Alessandro Benetti, Panos Korfiatis, Elias Eulig, Jérôme Lüscher, Jiasen Wu, Sayed Abid Hashimi, Gabriel Dernbach, Simon Schallenberg, Neelay Shah, Moritz Krügener, Aniruddh Jammoria, Jake Matras, Patrick Duffy, Matt Redlon, Philipp Jurmeister, David Horst, Lukas Ruff, Klaus-Robert Müller, Frederick Klauschen, Andrew Norgan

First: 2026-01-08T17:37:00+00:00 · Latest: 2026-01-08T17:37:00+00:00

Abs · PDF · Code1 · Code2

Abstract

Pathology foundation models substantially advanced the possibilities in computational pathology -- yet tradeoffs in terms of performance, robustness, and computational requirements remained, which limited their clinical deployment. In this report, we present Atlas 2, Atlas 2-B, and Atlas 2-S, three pathology vision foundation models which bridge these shortcomings by showing state-of-the-art performance in prediction performance, robustness, and resource efficiency in a comprehensive evaluation across eighty public benchmarks. Our models were trained on the largest pathology foundation model dataset to date comprising 5.5 million histopathology whole slide images, collected from three medical institutions Charité - Universtätsmedizin Berlin, LMU Munich, and Mayo Clinic.

中文标题/摘要

标题：图集2——临床部署的基础模型

病理学基础模型大幅提升了计算病理学的可能性——然而，在性能、稳健性和计算需求方面的权衡限制了它们的临床部署。在本报告中，我们介绍了图集2、图集2-B和图集2-S，这三种病理学视觉基础模型通过在八十个公共基准上的全面评估展示了预测性能、稳健性和资源效率方面的最新成果，从而弥补了这些不足。我们的模型是在迄今为止最大的病理学基础模型数据集上训练的，该数据集包含550万张组织病理学全切片图像，来自Charité - Universtätsmedizin Berlin、LMU Munich和Mayo Clinic三家医疗机构。

Summary / 总结

The motivation for this work was to address the limitations of existing pathology foundation models in terms of performance, robustness, and computational requirements, which hindered their clinical deployment. The authors developed Atlas 2, Atlas 2-B, and Atlas 2-S, three pathology vision foundation models, which achieved state-of-the-art performance, robustness, and resource efficiency across eighty public benchmarks. These models were trained on a large dataset of 5.5 million histopathology whole slide images from three medical institutions, significantly improving their capabilities for clinical use.

这项工作的动机是解决现有病理基础模型在性能、鲁棒性和计算需求方面的局限性，这些局限性阻碍了它们的临床应用。作者开发了Atlas 2、Atlas 2-B和Atlas 2-S三种病理视觉基础模型，这些模型在八十个公共基准测试中实现了最先进的性能、鲁棒性和资源效率。这些模型使用来自三个医疗机构的550万张组织病理学全切片图像的大规模数据集进行训练，显著提高了它们的临床应用能力。

Cells on Autopilot: Adaptive Cell (Re)Selection via Reinforcement Learning

Authors: Marvin Illian, Ramin Khalili, Antonio A. de A. Rocha, Lin Wang

First: 2026-01-07T16:51:33+00:00 · Latest: 2026-01-08T17:32:37+00:00

Comments: 11 pages, 12 figures, v2: Corrected performance numbers in the conclusion; no change to methodology

Abs · PDF · Code1 · Code2

Abstract

The widespread deployment of 5G networks, together with the coexistence of 4G/LTE networks, provides mobile devices a diverse set of candidate cells to connect to. However, associating mobile devices to cells to maximize overall network performance, a.k.a. cell (re)selection, remains a key challenge for mobile operators. Today, cell (re)selection parameters are typically configured manually based on operator experience and rarely adapted to dynamic network conditions. In this work, we ask: Can an agent automatically learn and adapt cell (re)selection parameters to consistently improve network performance? We present a reinforcement learning (RL)-based framework called CellPilot that adaptively tunes cell (re)selection parameters by learning spatiotemporal patterns of mobile network dynamics. Our study with real-world data demonstrates that even a lightweight RL agent can outperform conventional heuristic reconfigurations by up to 167%, while generalizing effectively across different network scenarios. These results indicate that data-driven approaches can significantly improve cell (re)selection configurations and enhance mobile network performance.

中文标题/摘要

标题：细胞自动驾驶：通过强化学习实现自适应小区（重）选择

5G网络的广泛部署以及4G/LTE网络的共存，为移动设备提供了多种候选小区连接的选择。然而，将移动设备与小区关联起来以最大化整体网络性能，即小区（重）选择，仍然是移动运营商面临的关键挑战。今天，小区（重）选择参数通常基于运营商的经验手动配置，并且很少适应动态网络条件。在本工作中，我们提出的问题是：是否可以使用代理自动学习和适应小区（重）选择参数，以持续提高网络性能？我们提出了一种基于强化学习（RL）的框架CellPilot，通过学习移动网络动态的空间和时间模式来自适应调整小区（重）选择参数。我们的研究使用实际数据表明，即使是一个轻量级的RL代理，也可以比传统的启发式重新配置提高高达167%的性能，同时在不同网络场景中表现出良好的泛化能力。这些结果表明，数据驱动的方法可以显著改善小区（重）选择配置并增强移动网络性能。

Summary / 总结

The paper addresses the challenge of optimizing cell (re)selection in mobile networks by proposing a reinforcement learning (RL) framework called CellPilot. This framework automatically tunes cell (re)selection parameters to improve network performance. Experimental results show that CellPilot outperforms traditional methods by up to 167% and generalizes well across different network scenarios.

论文提出了一种基于强化学习（RL）的框架CellPilot，以优化移动网络中的小区（再）选择。该框架自动调整小区（再）选择参数以提升网络性能。实验结果表明，CellPilot相比传统方法可提升高达167%的性能，并且在不同网络场景下具有良好的泛化能力。

VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control

Authors: Sixiao Zheng, Minghao Yin, Wenbo Hu, Xiaoyu Li, Ying Shan, Yanwei Fu

First: 2026-01-08T17:28:52+00:00 · Latest: 2026-01-08T17:28:52+00:00

Comments: Project Page: https://sixiaozheng.github.io/VerseCrafter_page/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Video world models aim to simulate dynamic, real-world environments, yet existing methods struggle to provide unified and precise control over camera and multi-object motion, as videos inherently operate dynamics in the projected 2D image plane. To bridge this gap, we introduce VerseCrafter, a 4D-aware video world model that enables explicit and coherent control over both camera and object dynamics within a unified 4D geometric world state. Our approach is centered on a novel 4D Geometric Control representation, which encodes the world state through a static background point cloud and per-object 3D Gaussian trajectories. This representation captures not only an object's path but also its probabilistic 3D occupancy over time, offering a flexible, category-agnostic alternative to rigid bounding boxes or parametric models. These 4D controls are rendered into conditioning signals for a pretrained video diffusion model, enabling the generation of high-fidelity, view-consistent videos that precisely adhere to the specified dynamics. Unfortunately, another major challenge lies in the scarcity of large-scale training data with explicit 4D annotations. We address this by developing an automatic data engine that extracts the required 4D controls from in-the-wild videos, allowing us to train our model on a massive and diverse dataset.

中文标题/摘要

标题：VerseCrafter：具有4D几何控制的动态逼真视频世界模型

视频世界模型旨在模拟动态的真实世界环境，但现有方法难以提供统一和精确的摄像机和多对象运动控制，因为视频本质上是在投影的2D图像平面上操作动态的。为了解决这一差距，我们引入了VerseCrafter，这是一种4D感知的视频世界模型，能够在统一的4D几何世界状态中显式和一致地控制摄像机和对象动力学。我们的方法以一种新颖的4D几何控制表示为中心，该表示通过静态背景点云和每个对象的3D高斯轨迹来编码世界状态。这种表示不仅捕捉了对象的路径，还捕捉了其随时间的概率3D占用，提供了一种灵活且跨类别的替代方案，而不是刚性边界框或参数模型。这些4D控制被渲染为预训练视频扩散模型的条件信号，使其能够生成高保真、视图一致的视频，精确符合指定的动力学。不幸的是，另一个主要挑战在于大型训练数据的稀缺性，这些数据具有明确的4D注释。我们通过开发一个自动数据引擎来解决这一问题，该引擎从野外视频中提取所需的4D控制，使我们能够使用大规模和多样化的数据集训练我们的模型。

Summary / 总结

VerseCrafter is designed to address the limitations of existing video world models in controlling camera and object dynamics precisely. It introduces a 4D Geometric Control representation that captures both the path and probabilistic 3D occupancy of objects over time, enabling the generation of high-fidelity, view-consistent videos. The model uses this representation to condition a pretrained video diffusion model, achieving dynamic, realistic video generation. To overcome the lack of 4D annotated data, VerseCrafter employs an automatic data engine that extracts 4D controls from unstructured videos, allowing for extensive training on diverse datasets.

VerseCrafter 是一个 4D 意识的视频世界模型，能够在统一的 4D 几何世界状态中对相机和物体的动力学进行显式和一致的控制。它使用一种新颖的 4D 几何控制表示法来编码世界状态，捕捉物体的路径及其在时间上的概率 3D 占有情况。该模型通过一个自动数据引擎从野生视频中提取所需的 4D 控制进行训练，该引擎生成了一个大规模且多样的数据集，从而能够生成高保真度、视图一致的视频，这些视频严格遵循指定的动力学。

Sequential Subspace Noise Injection Prevents Accuracy Collapse in Certified Unlearning

Authors: Polina Dolgova, Sebastian U. Stich

First: 2026-01-08T17:23:13+00:00 · Latest: 2026-01-08T17:23:13+00:00

Abs · PDF · Code1 · Code2

Abstract

Certified unlearning based on differential privacy offers strong guarantees but remains largely impractical: the noisy fine-tuning approaches proposed so far achieve these guarantees but severely reduce model accuracy. We propose sequential noise scheduling, which distributes the noise budget across orthogonal subspaces of the parameter space, rather than injecting it all at once. This simple modification mitigates the destructive effect of noise while preserving the original certification guarantees. We extend the analysis of noisy fine-tuning to the subspace setting, proving that the same $(\varepsilon,δ)$ privacy budget is retained. Empirical results on image classification benchmarks show that our approach substantially improves accuracy after unlearning while remaining robust to membership inference attacks. These results show that certified unlearning can achieve both rigorous guarantees and practical utility.

Summary / 总结

The research aims to improve the practicality of certified unlearning by addressing the issue of severe accuracy reduction in noisy fine-tuning approaches. The method involves sequential noise scheduling, which distributes the noise budget across orthogonal subspaces of the parameter space. This approach mitigates the negative impact of noise while maintaining the original privacy guarantees. Experiments on image classification benchmarks demonstrate that the proposed method significantly enhances model accuracy after unlearning and remains resilient to membership inference attacks, indicating that certified unlearning can be both secure and practical.

研究旨在通过解决噪声微调导致的显著准确度下降问题，提高认证遗忘的实用性。方法是采用顺序噪声调度，将噪声预算分布在参数空间的正交子空间中。这种方法减轻了噪声的负面影响，同时保持了原始的隐私保证。实验结果表明，该方法在图像分类基准上显著提高了模型在遗忘后的准确度，并且对成员推断攻击具有鲁棒性，这表明认证遗忘既可以严格又具有实用性。

FALCONEye: Finding Answers and Localizing Content in ONE-hour-long videos with multi-modal LLMs

Authors: Carlos Plou, Cesar Borja, Ruben Martinez-Cantin, Ana C. Murillo

First: 2025-03-25T17:17:19+00:00 · Latest: 2026-01-08T17:17:54+00:00

Abs · PDF · Code1 · Code2

Abstract

Finding information in hour-long videos is a challenging task even for top-performing Vision Language Models (VLMs), as encoding visual content quickly exceeds available context windows. To tackle this challenge, we present FALCONEye, a novel video agent based on a training-free, model-agnostic meta-architecture composed of a VLM and a Large Language Model (LLM). FALCONEye answers open-ended questions using an exploration-based search algorithm guided by calibrated confidence from the VLM's answers. We also introduce the FALCON-Bench benchmark, extending Question Answering problem to Video Answer Search-requiring models to return both the answer and its supporting temporal window for open-ended questions in hour-long videos. With just a 7B VLM and a lightweight LLM, FALCONEye outscores all open-source 7B VLMs and comparable agents in FALCON-Bench. It further demonstrates its generalization capability in MLVU benchmark with shorter videos and different tasks, surpassing GPT-4o on single-detail tasks while slashing inference cost by roughly an order of magnitude.

中文标题/摘要

标题：FALCONEye：利用多模态大语言模型在一小时视频中查找答案并定位内容

即使对于表现最佳的视觉语言模型（VLMs），在长达一小时的视频中查找信息也是一个具有挑战性的任务，因为编码视觉内容会迅速超出可用的上下文窗口。为了解决这一挑战，我们提出了FALCONEye，这是一种基于训练无损、模型无关的元架构的新型视频代理，该架构由VLM和大语言模型（LLM）组成。FALCONEye使用由VLM答案校准置信度引导的基于探索的搜索算法来回答开放式问题。我们还引入了FALCON-Bench基准测试，将问答问题扩展到视频答案搜索，要求模型返回一小时视频中开放式问题的答案及其支持的时间窗口。仅使用一个7B VLM和一个轻量级的LLM，FALCONEye在FALCON-Bench中得分超过了所有开源的7B VLM和可比代理。此外，FALCONEye还在MLVU基准测试中展示了其泛化能力，处理较短的视频和不同的任务，其在单一细节任务上的性能超过了GPT-4o，同时将推理成本降低了大约一个数量级。

Summary / 总结

FALCONEye is a novel video agent that uses a VLM and an LLM to answer open-ended questions in one-hour-long videos. It employs an exploration-based search algorithm guided by the VLM's calibrated confidence. FALCONEye outperforms all open-source 7B VLMs and comparable agents in the FALCON-Bench and shows strong generalization in the MLVU benchmark, reducing inference cost significantly compared to GPT-4o on single-detail tasks.

FALCONEye 是一种新型视频代理，使用 VLM 和 LLM 来回答一小时长视频中的开放性问题。它采用了一种由 VLM 的校准置信度引导的探索性搜索算法。FALCONEye 在 FALCON-Bench 基准测试中优于所有开源 7B VLM 及其同类代理，并在 MLVU 基准测试中展示了强大的泛化能力，优于 GPT-4o 在单一细节任务上的表现，同时大幅降低了推理成本。

Re-Align: Structured Reasoning-guided Alignment for In-Context Image Generation and Editing

Authors: Runze He, Yiji Cheng, Tiankai Hang, Zhimin Li, Yu Xu, Zijin Yin, Shiyi Zhang, Wenxun Dai, Penghui Du, Ao Ma, Chunyu Wang, Qinglin Lu, Jizhong Han, Jiao Dai

First: 2026-01-08T17:13:00+00:00 · Latest: 2026-01-08T17:13:00+00:00

Comments: 13 pages, 9 figures, project page: https://github.com/hrz2000/realign

Abs · PDF · Code1 · Code2 · Code3

Abstract

In-context image generation and editing (ICGE) enables users to specify visual concepts through interleaved image-text prompts, demanding precise understanding and faithful execution of user intent. Although recent unified multimodal models exhibit promising understanding capabilities, these strengths often fail to transfer effectively to image generation. We introduce Re-Align, a unified framework that bridges the gap between understanding and generation through structured reasoning-guided alignment. At its core lies the In-Context Chain-of-Thought (IC-CoT), a structured reasoning paradigm that decouples semantic guidance and reference association, providing clear textual target and mitigating confusion among reference images. Furthermore, Re-Align introduces an effective RL training scheme that leverages a surrogate reward to measure the alignment between structured reasoning text and the generated image, thereby improving the model's overall performance on ICGE tasks. Extensive experiments verify that Re-Align outperforms competitive methods of comparable model scale and resources on both in-context image generation and editing tasks.

中文标题/摘要

标题：重新对齐：结构化推理引导的上下文图像生成与编辑对齐

上下文图像生成与编辑（ICGE）允许用户通过交错的图像-文本提示来指定视觉概念，要求对用户意图进行精确理解和忠实执行。尽管最近的统一多模态模型展示了有希望的理解能力，但这些优势往往无法有效地转移到图像生成中。我们引入了Re-Align，这是一种统一框架，通过结构化推理引导的对齐来弥合理解和生成之间的差距。其核心是上下文链式思考（IC-CoT），这是一种结构化推理范式，将语义指导和参考关联解耦，提供清晰的文本目标并减轻参考图像之间的混淆。此外，Re-Align引入了一种有效的强化学习训练方案，利用代理奖励来衡量结构化推理文本与生成图像之间的对齐，从而提高模型在ICGE任务上的整体性能。广泛的实验验证了Re-Align在上下文图像生成和编辑任务中均优于具有可比模型规模和资源的竞争方法。

From Rays to Projections: Better Inputs for Feed-Forward View Synthesis

Authors: Zirui Wu, Zeren Jiang, Martin R. Oswald, Jie Song

First: 2026-01-08T17:03:44+00:00 · Latest: 2026-01-08T17:03:44+00:00

Comments: Project Page: https://wuzirui.github.io/pvsm-web

Abs · PDF · Code1 · Code2 · Project1

Abstract

Feed-forward view synthesis models predict a novel view in a single pass with minimal 3D inductive bias. Existing works encode cameras as Plücker ray maps, which tie predictions to the arbitrary world coordinate gauge and make them sensitive to small camera transformations, thereby undermining geometric consistency. In this paper, we ask what inputs best condition a model for robust and consistent view synthesis. We propose projective conditioning, which replaces raw camera parameters with a target-view projective cue that provides a stable 2D input. This reframes the task from a brittle geometric regression problem in ray space to a well-conditioned target-view image-to-image translation problem. Additionally, we introduce a masked autoencoding pretraining strategy tailored to this cue, enabling the use of large-scale uncalibrated data for pretraining. Our method shows improved fidelity and stronger cross-view consistency compared to ray-conditioned baselines on our view-consistency benchmark. It also achieves state-of-the-art quality on standard novel view synthesis benchmarks.

Summary / 总结

This paper addresses the issue of geometric inconsistency in feed-forward view synthesis models by proposing projective conditioning as an alternative to Plücker ray maps. The method uses a target-view projective cue to provide a stable 2D input, transforming the task into a more robust image-to-image translation problem. Experimental results show that this approach improves both the fidelity and cross-view consistency of the synthesized views compared to ray-conditioned models, and it achieves state-of-the-art quality on standard benchmarks.

本文解决了现有使用Plücker射线图的前馈视图合成模型可能导致几何不一致的问题。作者提出了投影条件，使用目标视图的投影线索代替原始相机参数，使任务更加稳定和条件良好。该方法还包含一种针对这种线索的大规模未标定数据预训练策略。实验结果表明，与射线条件的基线相比，在视图一致性基准上具有更高的保真度和更强的跨视图一致性，并且在标准基准上达到了最先进的质量。