arXiv 论文速递

Mesh4D: 4D Mesh Reconstruction and Tracking from Monocular Video

Authors: Zeren Jiang, Chuanxia Zheng, Iro Laina, Diane Larlus, Andrea Vedaldi

First: 2026-01-08T18:59:56+00:00 · Latest: 2026-01-08T18:59:56+00:00

Comments: 15 pages, 8 figures, project page: https://mesh-4d.github.io/

Abstract

We propose Mesh4D, a feed-forward model for monocular 4D mesh reconstruction. Given a monocular video of a dynamic object, our model reconstructs the object's complete 3D shape and motion, represented as a deformation field. Our key contribution is a compact latent space that encodes the entire animation sequence in a single pass. This latent space is learned by an autoencoder that, during training, is guided by the skeletal structure of the training objects, providing strong priors on plausible deformations. Crucially, skeletal information is not required at inference time. The encoder employs spatio-temporal attention, yielding a more stable representation of the object's overall deformation. Building on this representation, we train a latent diffusion model that, conditioned on the input video and the mesh reconstructed from the first frame, predicts the full animation in one shot. We evaluate Mesh4D on reconstruction and novel view synthesis benchmarks, outperforming prior methods in recovering accurate 3D shape and deformation.

中文标题/摘要

标题：Mesh4D：基于单目视频的4D网格重建与跟踪

我们提出Mesh4D，一种用于单目4D网格重建的前馈模型。给定一个动态对象的单目视频，我们的模型重建对象的完整3D形状和运动，表示为变形场。我们的主要贡献是一个紧凑的潜在空间，可以在一次通过中编码整个动画序列。该潜在空间通过自编码器学习，训练时由训练对象的骨骼结构引导，提供了合理的变形先验。关键的是，在推理时不需要骨骼信息。编码器采用时空注意力机制，提供对象整体变形的更稳定表示。在此表示基础上，我们训练一个潜在扩散模型，在输入视频和从第一帧重建的网格条件下，一次性预测完整的动画。我们在重建和新颖视图合成基准上评估Mesh4D，优于先前方法，在恢复准确的3D形状和变形方面表现出色。

Summary / 总结

Mesh4D is a feed-forward model for monocular 4D mesh reconstruction that reconstructs the complete 3D shape and motion of a dynamic object from a monocular video. It uses a compact latent space learned by an autoencoder, which is guided by the skeletal structure during training. The model predicts the full animation in one shot using a latent diffusion model, outperforming prior methods in 3D shape and deformation recovery. Skeletal information is not needed at inference time, and the encoder uses spatio-temporal attention for a more stable representation of the object's deformation.

Mesh4D 是一种用于从单目视频重建动态对象的完整 3D 形状和运动的前馈模型。它使用一个由训练对象的骨骼结构引导的自编码器学习的紧凑潜空间来表示整个动画序列。该模型采用时空注意力机制来稳定对象的变形表示，并使用潜扩散模型从输入视频和第一帧的网格预测完整的动画。Mesh4D 在评估基准上优于先前的方法，能够准确地恢复 3D 形状和变形。

RL-AWB: Deep Reinforcement Learning for Auto White Balance Correction in Low-Light Night-time Scenes

Authors: Yuan-Kang Lee, Kuan-Lin Chen, Chia-Che Chang, Yu-Lun Liu

First: 2026-01-08T18:59:55+00:00 · Latest: 2026-01-08T18:59:55+00:00

Comments: Project page: https://ntuneillee.github.io/research/rl-awb/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Nighttime color constancy remains a challenging problem in computational photography due to low-light noise and complex illumination conditions. We present RL-AWB, a novel framework combining statistical methods with deep reinforcement learning for nighttime white balance. Our method begins with a statistical algorithm tailored for nighttime scenes, integrating salient gray pixel detection with novel illumination estimation. Building on this foundation, we develop the first deep reinforcement learning approach for color constancy that leverages the statistical algorithm as its core, mimicking professional AWB tuning experts by dynamically optimizing parameters for each image. To facilitate cross-sensor evaluation, we introduce the first multi-sensor nighttime dataset. Experiment results demonstrate that our method achieves superior generalization capability across low-light and well-illuminated images. Project page: https://ntuneillee.github.io/research/rl-awb/

中文标题/摘要

标题：RL-AWB：低光夜间场景自动白平衡校正的深度强化学习

夜间颜色恒定性仍然是计算摄影中的一个挑战性问题，由于低光噪声和复杂的照明条件。我们提出了RL-AWB，这是一种结合统计方法和深度强化学习的新框架，用于夜间白平衡。我们的方法以一个针对夜间场景定制的统计算法开始，结合了显著灰度像素检测和新颖的照明估计。在此基础上，我们开发了第一个基于统计算法的深度强化学习颜色恒定性方法，通过动态优化每个图像的参数来模拟专业AWB调优专家。为了便于跨传感器评估，我们引入了第一个多传感器夜间数据集。实验结果表明，我们的方法在低光和良好照明的图像上具有更强的泛化能力。项目页面：https://ntuneillee.github.io/research/rl-awb/

Summary / 总结

The research aims to address the challenge of nighttime color constancy in low-light conditions by developing RL-AWB, a framework that combines statistical methods with deep reinforcement learning. The method starts with a statistical algorithm for nighttime scenes, incorporating salient gray pixel detection and novel illumination estimation. It then uses deep reinforcement learning to dynamically optimize parameters for each image, similar to professional AWB tuning. The results show that RL-AWB outperforms existing methods in generalizing across different lighting conditions and sensors.

研究旨在通过结合统计方法和深度强化学习来解决夜间色彩一致性在计算摄影中的挑战，提出了RL-AWB框架。该方法从一个针对夜间场景设计的统计算法开始，包括显著灰像素检测和新颖的照明估计。在此基础上，引入了一种深度强化学习方法，能够为每张图像动态优化参数，模拟专业AWB调色专家的操作。实验结果表明，该方法在低光和良好照明的图像上表现出色，具有更强的泛化能力。

QNeRF: Neural Radiance Fields on a Simulated Gate-Based Quantum Computer

Authors: Daniele Lizzio Bosco, Shuteng Wang, Giuseppe Serra, Vladislav Golyanik

First: 2026-01-08T18:59:55+00:00 · Latest: 2026-01-08T18:59:55+00:00

Comments: 30 pages, 15 figures, 11 tables; project page: https://4dqv.mpi-inf.mpg.de/QNeRF/

Abs · PDF · Code1 · Code2

Abstract

Recently, Quantum Visual Fields (QVFs) have shown promising improvements in model compactness and convergence speed for learning the provided 2D or 3D signals. Meanwhile, novel-view synthesis has seen major advances with Neural Radiance Fields (NeRFs), where models learn a compact representation from 2D images to render 3D scenes, albeit at the cost of larger models and intensive training. In this work, we extend the approach of QVFs by introducing QNeRF, the first hybrid quantum-classical model designed for novel-view synthesis from 2D images. QNeRF leverages parameterised quantum circuits to encode spatial and view-dependent information via quantum superposition and entanglement, resulting in more compact models compared to the classical counterpart. We present two architectural variants. Full QNeRF maximally exploits all quantum amplitudes to enhance representational capabilities. In contrast, Dual-Branch QNeRF introduces a task-informed inductive bias by branching spatial and view-dependent quantum state preparations, drastically reducing the complexity of this operation and ensuring scalability and potential hardware compatibility. Our experiments demonstrate that -- when trained on images of moderate resolution -- QNeRF matches or outperforms classical NeRF baselines while using less than half the number of parameters. These results suggest that quantum machine learning can serve as a competitive alternative for continuous signal representation in mid-level tasks in computer vision, such as 3D representation learning from 2D observations.

中文标题/摘要

标题：QNeRF：基于模拟格基量子计算机的神经辐射场

最近，量子视觉场（QVFs）在模型紧凑性和收敛速度方面显示出对学习提供的2D或3D信号的有希望的改进。同时，神经辐射场（NeRFs）在新颖视角合成方面取得了重大进展，其中模型从2D图像中学习紧凑表示以渲染3D场景，尽管这需要更大的模型和更密集的训练。在本文中，我们通过引入QNeRF扩展了QVFs的方法，QNeRF是第一个为从2D图像合成新颖视角而设计的混合量子-经典模型。QNeRF利用参数化量子电路通过量子叠加和纠缠来编码空间和视角相关的信息，从而与经典对应物相比具有更紧凑的模型。我们提出了两种架构变体。全QNeRF最大限度地利用所有量子振幅以增强表示能力。相比之下，双分支QNeRF通过分支空间和视角相关的量子态准备引入任务导向的归纳偏置，大幅降低此操作的复杂性并确保可扩展性和潜在的硬件兼容性。我们的实验表明，当在中等分辨率的图像上进行训练时，QNeRF在参数数量不到一半的情况下可以匹配或超越经典的NeRF基线。这些结果表明，量子机器学习可以作为计算机视觉中中级任务（如从2D观察学习3D表示）中连续信号表示的竞争替代方案。

Summary / 总结

QNeRF is a hybrid quantum-classical model for novel-view synthesis that leverages parameterized quantum circuits to encode spatial and view-dependent information, resulting in more compact models compared to classical counterparts. Two architectural variants are presented: Full QNeRF maximizes quantum amplitudes for enhanced representational capabilities, while Dual-Branch QNeRF introduces a task-informed inductive bias to reduce complexity and ensure scalability. Experiments show that QNeRF matches or outperforms classical NeRF baselines with fewer parameters when trained on moderate-resolution images.

QNeRF 是一种用于新颖视角合成的量子-经典混合模型，通过参数化量子电路编码空间和视角依赖信息，相比经典模型更为紧凑。该模型包含两种架构变体：Full QNeRF 最大化量子振幅以增强表示能力，而 Dual-Branch QNeRF 引入任务导向的归纳偏置以减少复杂性和确保可扩展性。实验表明，当在中等分辨率图像上训练时，QNeRF 在参数数量少于一半的情况下，与经典 NeRF 基线模型相当或更优，这表明量子机器学习在计算机视觉中的中级任务，如从二维观察学习三维表示方面具有竞争力的替代方案。

Pixel-Perfect Visual Geometry Estimation

Authors: Gangwei Xu, Haotong Lin, Hongcheng Luo, Haiyang Sun, Bing Wang, Guang Chen, Sida Peng, Hangjun Ye, Xin Yang

First: 2026-01-08T18:59:49+00:00 · Latest: 2026-01-08T18:59:49+00:00

Comments: Code: https://github.com/gangweix/pixel-perfect-depth

Abs · PDF · Code1 · Code2 · Code3

Abstract

Recovering clean and accurate geometry from images is essential for robotics and augmented reality. However, existing geometry foundation models still suffer severely from flying pixels and the loss of fine details. In this paper, we present pixel-perfect visual geometry models that can predict high-quality, flying-pixel-free point clouds by leveraging generative modeling in the pixel space. We first introduce Pixel-Perfect Depth (PPD), a monocular depth foundation model built upon pixel-space diffusion transformers (DiT). To address the high computational complexity associated with pixel-space diffusion, we propose two key designs: 1) Semantics-Prompted DiT, which incorporates semantic representations from vision foundation models to prompt the diffusion process, preserving global semantics while enhancing fine-grained visual details; and 2) Cascade DiT architecture that progressively increases the number of image tokens, improving both efficiency and accuracy. To further extend PPD to video (PPVD), we introduce a new Semantics-Consistent DiT, which extracts temporally consistent semantics from a multi-view geometry foundation model. We then perform reference-guided token propagation within the DiT to maintain temporal coherence with minimal computational and memory overhead. Our models achieve the best performance among all generative monocular and video depth estimation models and produce significantly cleaner point clouds than all other models.

中文标题/摘要

标题：像素完美视觉几何估计

从图像中恢复干净且准确的几何结构对于机器人技术和增强现实至关重要。然而，现有的几何基础模型仍然严重受到漂像素和细节损失的影响。在本文中，我们提出了像素完美视觉几何模型，通过在像素空间中利用生成建模来预测高质量、无漂像素的点云。我们首先介绍了像素完美深度（PPD），这是一种基于像素空间扩散变换器（DiT）的单目深度基础模型。为了解决像素空间扩散带来的高计算复杂性，我们提出了两个关键设计：1）语义提示DiT，该设计结合了视觉基础模型的语义表示，以提示扩散过程，保留全局语义同时增强细粒度的视觉细节；2）级联DiT架构，逐步增加图像标记的数量，提高效率和准确性。为了将PPD进一步扩展到视频（PPVD），我们引入了一种新的语义一致DiT，该设计从多视图几何基础模型中提取时空一致的语义。然后在DiT中进行参考引导的标记传播，以最小的计算和内存开销保持时间连贯性。我们的模型在所有生成单目和视频深度估计模型中表现最佳，并且产生的点云比其他所有模型都更干净。

Summary / 总结

This paper addresses the issue of recovering clean and accurate geometry from images, crucial for robotics and augmented reality. It introduces pixel-perfect visual geometry models, specifically Pixel-Perfect Depth (PPD) and its extension to video (PPVD), which use pixel-space diffusion transformers (DiT) to predict high-quality point clouds without flying pixels. Key designs include Semantics-Prompted DiT and Cascade DiT architecture, which enhance fine-grained details and computational efficiency. The models outperform existing methods in monocular and video depth estimation, producing cleaner point clouds.

本文解决了从图像中恢复干净准确几何结构的问题，这对于机器人技术和增强现实至关重要。该文提出了使用像素空间生成模型的像素完美视觉几何模型，包括像素完美深度（PPD）及其视频扩展PPVD。这些模型利用像素空间扩散变换器（DiT）并结合语义表示，以保留全局语义同时增强细粒度视觉细节。这些模型在单目和视频深度估计中表现出色，生成的点云更加干净。

GREx: Generalized Referring Expression Segmentation, Comprehension, and Generation

Authors: Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, Yu-Gang Jiang

First: 2026-01-08T18:59:30+00:00 · Latest: 2026-01-08T18:59:30+00:00

Comments: IJCV, Project Page: https://henghuiding.com/GREx/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Referring Expression Segmentation (RES) and Comprehension (REC) respectively segment and detect the object described by an expression, while Referring Expression Generation (REG) generates an expression for the selected object. Existing datasets and methods commonly support single-target expressions only, i.e., one expression refers to one object, not considering multi-target and no-target expressions. This greatly limits the real applications of REx (RES/REC/REG). This paper introduces three new benchmarks called Generalized Referring Expression Segmentation (GRES), Comprehension (GREC), and Generation (GREG), collectively denoted as GREx, which extend the classic REx to allow expressions to identify an arbitrary number of objects. We construct the first large-scale GREx dataset gRefCOCO that contains multi-target, no-target, and single-target expressions and their corresponding images with labeled targets. GREx and gRefCOCO are designed to be backward-compatible with REx, facilitating extensive experiments to study the performance gap of the existing REx methods on GREx tasks. One of the challenges of GRES/GREC is complex relationship modeling, for which we propose a baseline ReLA that adaptively divides the image into regions with sub-instance clues and explicitly models the region-region and region-language dependencies. The proposed ReLA achieves the state-of-the-art results on the both GRES and GREC tasks. The proposed gRefCOCO dataset and method are available at https://henghuiding.github.io/GREx.

中文标题/摘要

标题：GREx：通用指代表达分割、理解和生成

指代表达分割（RES）和理解（REC）分别对表达描述的对象进行分割和检测，而指代表达生成（REG）则生成描述选定对象的表达。现有数据集和方法通常仅支持单目标表达，即一个表达仅指代一个对象，而不考虑多目标和无目标表达。这极大地限制了指代表达（RES/REC/REG）的实际应用。本文引入了三个新的基准测试，分别称为通用指代表达分割（GRES）、理解和生成（GREC），统称为GREx，它们将经典指代表达扩展为允许表达识别任意数量的对象。我们构建了第一个大规模的GREx数据集gRefCOCO，包含多目标、无目标和单目标表达及其对应的带有标记目标的图像。GREx和gRefCOCO旨在与指代表达兼容，便于进行广泛的实验，研究现有指代表达方法在GREx任务上的性能差距。GRES/GREC的一个挑战是复杂关系建模，为此我们提出了一种基线ReLA，它适应性地将图像划分为具有子实例线索的区域，并明确建模区域-区域和区域-语言的依赖关系。提出的ReLA在GRES和GREC任务上均达到了最先进的结果。提出的gRefCOCO数据集和方法可在https://henghuiding.github.io/GREx/获取。

Summary / 总结

This paper introduces GREx, which extends the classic REx tasks to support multi-target and no-target expressions, addressing the limitations of existing datasets and methods. The authors propose a new dataset gRefCOCO and a baseline method ReLA that models complex relationships between regions and language. ReLA achieves state-of-the-art results on GRES and GREC tasks. The gRefCOCO dataset and ReLA method are publicly available.

该论文通过引入支持多目标和无目标表达式的GREx，解决了现有引用表达分割（RES）、理解和生成（REC/REG）方法的局限性。作者构建了一个大规模的数据集gRefCOCO，并提出了一种基线方法ReLA，用于建模复杂关系，该方法在GRES和GREC任务上达到了最先进的性能。数据集和方法可在https://henghuiding.github.io/GREx/获取。

Leveraging Clinical Text and Class Conditioning for 3D Prostate MRI Generation

Authors: Emerson P. Grabke, Babak Taati, Masoom A. Haider

First: 2025-06-11T23:12:48+00:00 · Latest: 2026-01-08T18:59:27+00:00

Comments: Accepted for publication in IEEE Transactions on Biomedical Engineering, 2025. This is the accepted author version. The final published version is available at https://doi.org/10.1109/TBME.2025.3648426

Abs · PDF · Code1 · Code2

Abstract

Objective: Latent diffusion models (LDM) could alleviate data scarcity challenges affecting machine learning development for medical imaging. However, medical LDM strategies typically rely on short-prompt text encoders, nonmedical LDMs, or large data volumes. These strategies can limit performance and scientific accessibility. We propose a novel LDM conditioning approach to address these limitations. Methods: We propose Class-Conditioned Efficient Large Language model Adapter (CCELLA), a novel dual-head conditioning approach that simultaneously conditions the LDM U-Net with free-text clinical reports and radiology classification. We also propose a data-efficient LDM pipeline centered around CCELLA and a proposed joint loss function. We first evaluate our method on 3D prostate MRI against state-of-the-art. We then augment a downstream classifier model training dataset with synthetic images from our method. Results: Our method achieves a 3D FID score of 0.025 on a size-limited 3D prostate MRI dataset, significantly outperforming a recent foundation model with FID 0.070. When training a classifier for prostate cancer prediction, adding synthetic images generated by our method during training improves classifier accuracy from 69% to 74% and outperforms classifiers trained on images generated by prior state-of-the-art. Classifier training solely on our method's synthetic images achieved comparable performance to real image training. Conclusion: We show that our method improved both synthetic image quality and downstream classifier performance using limited data and minimal human annotation. Significance: The proposed CCELLA-centric pipeline enables radiology report and class-conditioned LDM training for high-quality medical image synthesis given limited data volume and human data annotation, improving LDM performance and scientific accessibility.

中文标题/摘要

标题：利用临床文本和类别调节生成3D前列腺MRI

目标：潜在扩散模型（LDM）可以缓解医学成像领域机器学习开发中的数据稀缺挑战。然而，医学LDM策略通常依赖于简短提示文本编码器、非医学LDM或大量数据。这些策略可能会限制性能和科学可访问性。我们提出了一种新的LDM调节方法来解决这些限制。方法：我们提出了类别调节高效大型语言模型适配器（CCELLA），这是一种新颖的双头调节方法，同时用自由文本临床报告和放射学分类调节LDM U-Net。我们还提出了一种以CCELLA为中心的数据高效LDM管道和一个提出的联合损失函数。我们首先在3D前列腺MRI上评估了我们的方法，与最先进的方法进行了比较。然后，我们使用我们方法生成的合成图像增强了下游分类器模型训练数据集。结果：我们的方法在大小受限的3D前列腺MRI数据集上实现了0.025的3D FID分数，显著优于最近的基础模型（FID 0.070）。在训练前列腺癌预测分类器时，使用我们方法生成的合成图像进行训练，分类器的准确率从69%提高到74%，并优于使用先前最先进的方法生成的图像进行训练的分类器。仅使用我们方法生成的合成图像进行分类器训练，其性能与使用真实图像训练相当。结论：我们展示了我们的方法在使用有限数据和最少的人工注释的情况下，提高了合成图像质量和下游分类器性能。意义：提出的CCELLA为中心的管道能够在有限的数据量和人工数据注释的情况下，实现放射学报告和类别调节LDM的训练，以生成高质量的医学图像，从而提高LDM性能和科学可访问性。

Summary / 总结

The research aims to address data scarcity challenges in medical imaging by proposing a novel Latent Diffusion Model (LDM) conditioning approach called CCELLA. CCELLA uses free-text clinical reports and radiology classification to condition the LDM U-Net, and a data-efficient pipeline with a joint loss function is proposed. The method achieves a 3D FID score of 0.025, significantly outperforming a recent foundation model. Additionally, synthetic images generated by the method improve the accuracy of a downstream classifier for prostate cancer prediction from 69% to 74%. The approach demonstrates improved performance with limited data and minimal human annotation.

研究旨在通过提出一种名为CCELLA的新型潜变量扩散模型（LDM）条件化方法来解决医学影像中的数据稀缺问题。该方法利用自由文本临床报告和放射学分类来条件化LDM U-Net，并提出了一种数据高效的管道和联合损失函数。该方法在3D数据受限的前列腺MRI数据集上实现了0.025的3D FID分数，显著优于最近的基础模型的0.070分数。此外，使用该方法生成的合成图像可以将前列腺癌预测下游分类器的准确性从69%提高到74%。结果表明，该方法可以在有限的数据和少量的人工注释下提高合成图像质量和下游分类器性能。

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

Authors: Shih-Yang Liu, Xin Dong, Ximing Lu, Shizhe Diao, Peter Belcak, Mingjie Liu, Min-Hung Chen, Hongxu Yin, Yu-Chiang Frank Wang, Kwang-Ting Cheng, Yejin Choi, Jan Kautz, Pavlo Molchanov

First: 2026-01-08T18:59:24+00:00 · Latest: 2026-01-08T18:59:24+00:00

Comments: NVIDIA-Tech Report

Abs · PDF · Code1 · Code2

Abstract

As language models become increasingly capable, users expect them to provide not only accurate responses but also behaviors aligned with diverse human preferences across a variety of scenarios. To achieve this, Reinforcement learning (RL) pipelines have begun incorporating multiple rewards, each capturing a distinct preference, to guide models toward these desired behaviors. However, recent work has defaulted to apply Group Relative Policy Optimization (GRPO) under multi-reward setting without examining its suitability. In this paper, we demonstrate that directly applying GRPO to normalize distinct rollout reward combinations causes them to collapse into identical advantage values, reducing the resolution of the training signal and resulting in suboptimal convergence and, in some cases, early training failure. We then introduce Group reward-Decoupled Normalization Policy Optimization (GDPO), a new policy optimization method to resolve these issues by decoupling the normalization of individual rewards, more faithfully preserving their relative differences and enabling more accurate multi-reward optimization, along with substantially improved training stability. We compare GDPO with GRPO across three tasks: tool calling, math reasoning, and coding reasoning, evaluating both correctness metrics (accuracy, bug ratio) and constraint adherence metrics (format, length). Across all settings, GDPO consistently outperforms GRPO, demonstrating its effectiveness and generalizability for multi-reward reinforcement learning optimization.

中文标题/摘要

标题：GDPO：组奖励-解耦归一化策略优化方法在多奖励RL优化中的应用

随着语言模型能力的不断增强，用户期望它们不仅能提供准确的响应，还能表现出与各种场景中多样的人类偏好相一致的行为。为了实现这一目标，强化学习（RL）管道已经开始采用多个奖励，每个奖励捕捉一种独特的偏好，以引导模型向这些期望的行为发展。然而，最近的工作在多奖励设置中默认使用组相对策略优化（GRPO）而没有对其适用性进行检查。本文展示了直接将GRPO应用于归一化不同的回放奖励组合会导致这些组合的优势值坍缩为相同的值，降低了训练信号的分辨率，导致次优收敛，在某些情况下甚至导致训练早期失败。我们随后引入了组奖励-解耦归一化策略优化（GDPO），这是一种新的策略优化方法，通过解耦个体奖励的归一化来解决这些问题，更忠实地保留它们的相对差异，从而实现更准确的多奖励优化，并且训练稳定性显著提高。我们通过工具调用、数学推理和编程推理三个任务将GDPO与GRPO进行了比较，评估了正确性指标（准确率、错误率）和约束遵守指标（格式、长度）。在所有设置中，GDPO始终优于GRPO，证明了其在多奖励强化学习优化中的有效性和普适性。

Summary / 总结

This paper addresses the challenge of optimizing language models using multiple rewards in reinforcement learning. It identifies issues with the Group Relative Policy Optimization (GRPO) method, which causes reward values to collapse, leading to suboptimal training. To resolve this, the authors propose Group reward-Decoupled Normalization Policy Optimization (GDPO), which decouples the normalization of individual rewards, improving training stability and performance. GDPO outperforms GRPO across various tasks, including tool calling, math reasoning, and coding reasoning, in terms of both correctness and constraint adherence.

本文探讨了在多奖励强化学习（RL）设置中使用组相对策略优化（GRPO）的问题，这会导致不同的奖励值坍缩为相同的值，从而导致训练效果不佳。作者提出了一种新的方法GDPO，该方法通过解耦各个奖励的归一化处理，保留它们的相对差异，从而提高训练稳定性。GDPO在工具调用、数学推理和编程推理三个任务中均优于GRPO，无论是正确性还是约束遵守度指标都表现更佳。

RoboVIP: Multi-View Video Generation with Visual Identity Prompting Augments Robot Manipulation

Authors: Boyang Wang, Haoran Zhang, Shujie Zhang, Jinkun Hao, Mingda Jia, Qi Lv, Yucheng Mao, Zhaoyang Lyu, Jia Zeng, Xudong Xu, Jiangmiao Pang

First: 2026-01-08T18:59:22+00:00 · Latest: 2026-01-08T18:59:22+00:00

Abs · PDF · Code1 · Code2

Abstract

The diversity, quantity, and quality of manipulation data are critical for training effective robot policies. However, due to hardware and physical setup constraints, collecting large-scale real-world manipulation data remains difficult to scale across diverse environments. Recent work uses text-prompt conditioned image diffusion models to augment manipulation data by altering the backgrounds and tabletop objects in the visual observations. However, these approaches often overlook the practical need for multi-view and temporally coherent observations required by state-of-the-art policy models. Further, text prompts alone cannot reliably specify the scene setup. To provide the diffusion model with explicit visual guidance, we introduce visual identity prompting, which supplies exemplar images as conditioning inputs to guide the generation of the desired scene setup. To this end, we also build a scalable pipeline to curate a visual identity pool from large robotics datasets. Using our augmented manipulation data to train downstream vision-language-action and visuomotor policy models yields consistent performance gains in both simulation and real-robot settings.

中文标题/摘要

标题：RoboVIP：基于视觉身份提示的多视角视频生成增强机器人操作

操作数据的多样性和数量对于训练有效的机器人策略至关重要。然而，由于硬件和物理设置的限制，收集大规模的现实世界操作数据在不同环境中难以扩展。近期的工作使用文本提示条件下的图像扩散模型来通过改变视觉观察中的背景和桌面物体来增强操作数据。然而，这些方法往往忽视了由最先进的策略模型所需的多视角和时间上一致的观察的实际需求。此外，仅靠文本提示无法可靠地指定场景设置。为了给扩散模型提供明确的视觉指导，我们引入了视觉身份提示，通过提供示例图像作为条件输入来引导生成所需的场景设置。为此，我们还构建了一个可扩展的流水线，从大型机器人数据集中整理视觉身份池。使用我们增强的操作数据来训练下游的视觉-语言-动作和视知觉运动策略模型，在仿真和真实机器人环境中均能获得一致的性能提升。

Summary / 总结

The paper addresses the challenge of collecting diverse and high-quality manipulation data for robot training. It introduces RoboVIP, a method that uses visual identity prompting to generate multi-view and temporally coherent observations. By curating a visual identity pool from large robotics datasets, the approach enhances the realism and utility of the augmented data, leading to improved performance in both simulation and real-world settings for vision-language-action and visuomotor policy models.

论文旨在解决收集多样且高质量的机器人操作数据的难题。它提出了RoboVIP方法，通过视觉身份提示生成多视角和时间连贯的观察数据。通过从大型机器人数据集中构建视觉身份池，该方法增强了增强数据的真实性和实用性，从而在模拟和真实机器人设置中提高了视觉语言动作和视知觉运动策略模型的性能。

Plenoptic Video Generation

Authors: Xiao Fu, Shitao Tang, Min Shi, Xian Liu, Jinwei Gu, Ming-Yu Liu, Dahua Lin, Chen-Hsuan Lin

First: 2026-01-08T18:58:32+00:00 · Latest: 2026-01-08T18:58:32+00:00

Comments: Project Page: https://research.nvidia.com/labs/dir/plenopticdreamer/

Abs · PDF · Code1 · Code2

Abstract

Camera-controlled generative video re-rendering methods, such as ReCamMaster, have achieved remarkable progress. However, despite their success in single-view setting, these works often struggle to maintain consistency across multi-view scenarios. Ensuring spatio-temporal coherence in hallucinated regions remains challenging due to the inherent stochasticity of generative models. To address it, we introduce PlenopticDreamer, a framework that synchronizes generative hallucinations to maintain spatio-temporal memory. The core idea is to train a multi-in-single-out video-conditioned model in an autoregressive manner, aided by a camera-guided video retrieval strategy that adaptively selects salient videos from previous generations as conditional inputs. In addition, Our training incorporates progressive context-scaling to improve convergence, self-conditioning to enhance robustness against long-range visual degradation caused by error accumulation, and a long-video conditioning mechanism to support extended video generation. Extensive experiments on the Basic and Agibot benchmarks demonstrate that PlenopticDreamer achieves state-of-the-art video re-rendering, delivering superior view synchronization, high-fidelity visuals, accurate camera control, and diverse view transformations (e.g., third-person to third-person, and head-view to gripper-view in robotic manipulation). Project page: https://research.nvidia.com/labs/dir/plenopticdreamer/

中文标题/摘要

标题：全景光视频生成

相机控制生成视频重渲染方法，如ReCamMaster，已经取得了显著进展。然而，尽管这些方法在单视角设置中取得了成功，它们在多视角场景中保持一致性方面仍然面临挑战。由于生成模型固有的随机性，保持时空一致性在虚构区域中仍然具有挑战性。为了解决这个问题，我们引入了PlenopticDreamer框架，该框架同步生成的虚构以保持时空记忆。核心思想是通过相机引导的视频检索策略自回归训练多输入单输出视频条件模型，该策略能够从先前生成的视频中自适应地选择关键视频作为条件输入。此外，我们的训练还包含逐步上下文缩放以提高收敛性，自我条件化以增强对由误差累积引起的长距离视觉退化的鲁棒性，以及长视频条件机制以支持长时间视频生成。在Basic和Agibot基准上的广泛实验表明，PlenopticDreamer实现了最先进的视频重渲染，提供了卓越的视角同步、高保真视觉、准确的相机控制和多样的视角变换（例如，从第三人称到第三人称，从头部视角到夹爪视角的机器人操作）。项目页面：https://research.nvidia.com/labs/dir/plenopticdreamer/

Summary / 总结

PlenopticDreamer is a framework designed to improve the consistency and coherence of generative video re-rendering across multiple views. It uses an autoregressive multi-in-single-out model trained with a camera-guided video retrieval strategy and incorporates techniques like progressive context-scaling, self-conditioning, and long-video conditioning. Experiments show that PlenopticDreamer outperforms existing methods in maintaining view synchronization, visual fidelity, and camera control, especially in robotic manipulation scenarios.

PlenopticDreamer 是一个框架，旨在提高多视角下生成视频重渲染的一致性和连贯性。它使用多输入单输出的视频条件模型，在自回归方式下训练，并通过自适应视频检索和逐步上下文扩展来增强鲁棒性和收敛性。该框架在 Basic 和 Agibot 基准测试中表现出色，实现了更好的视图同步和高保真视觉效果。关键发现包括准确的相机控制和多样化的视图变换，例如从第三人称到第三人称和从头部视角到夹爪视角的机器人操作。

ObjectForesight: Predicting Future 3D Object Trajectories from Human Videos

Authors: Rustin Soraki, Homanga Bharadhwaj, Ali Farhadi, Roozbeh Mottaghi

First: 2026-01-08T18:58:08+00:00 · Latest: 2026-01-08T18:58:08+00:00

Comments: Preprint. Project Website: objectforesight.github.io

Abs · PDF · Code1 · Code2

Abstract

Humans can effortlessly anticipate how objects might move or change through interaction--imagining a cup being lifted, a knife slicing, or a lid being closed. We aim to endow computational systems with a similar ability to predict plausible future object motions directly from passive visual observation. We introduce ObjectForesight, a 3D object-centric dynamics model that predicts future 6-DoF poses and trajectories of rigid objects from short egocentric video sequences. Unlike conventional world or dynamics models that operate in pixel or latent space, ObjectForesight represents the world explicitly in 3D at the object level, enabling geometrically grounded and temporally coherent predictions that capture object affordances and trajectories. To train such a model at scale, we leverage recent advances in segmentation, mesh reconstruction, and 3D pose estimation to curate a dataset of 2 million plus short clips with pseudo-ground-truth 3D object trajectories. Through extensive experiments, we show that ObjectForesight achieves significant gains in accuracy, geometric consistency, and generalization to unseen objects and scenes, establishing a scalable framework for learning physically grounded, object-centric dynamics models directly from observation. objectforesight.github.io

中文标题/摘要

标题：ObjectForesight：从人类视频预测未来3D物体轨迹

人类可以通过互动轻松地预测物体可能如何移动或变化——想象杯子被举起、刀片切割或盖子关闭。我们旨在赋予计算系统类似的能力，直接从被动视觉观察中预测物体的可能未来运动。我们引入了ObjectForesight，这是一种3D物体中心的动力学模型，可以从短第一人称视频序列中预测刚体物体的未来6-自由度姿态和轨迹。与传统的世界或动力学模型不同，ObjectForesight在物体级别以3D形式明确表示世界，从而实现几何和时间上一致的预测，捕捉物体的功能和轨迹。为了大规模训练这样的模型，我们利用最近在分割、网格重建和3D姿态估计方面的进展，收集了一个包含200多万个短片段的数据集，带有伪地面真值3D物体轨迹。通过广泛的实验，我们展示了ObjectForesight在准确性、几何一致性和对未见过的物体和场景的泛化方面取得了显著的改进，建立了从观察中学习物理上合理的、物体中心的动力学模型的可扩展框架。objectforesight.github.io

Summary / 总结

The research aims to develop a system that can predict future 3D object trajectories from human videos, similar to how humans anticipate object movements. ObjectForesight, a 3D object-centric dynamics model, predicts future 6-DoF poses and trajectories of rigid objects from short egocentric video sequences. The model outperforms conventional methods by providing geometrically grounded and temporally coherent predictions. Extensive experiments demonstrate significant improvements in accuracy and generalization to unseen objects and scenes, establishing a scalable framework for learning physically grounded dynamics models directly from observation.

研究旨在开发一种可以从人类视频中预测未来3D物体轨迹的系统，类似于人类的预见能力。引入了ObjectForesight，这是一种3D物体中心的动力学模型，可以从短的主观视频序列中预测刚体物体的6-DoF姿态和轨迹。该模型在包含200多万个短片段的大数据集上进行训练，这些片段带有伪地面真实3D物体轨迹，并且在准确性、几何一致性以及对未见过的物体和场景的泛化能力方面显著优于传统模型。

Measuring and Fostering Peace through Machine Learning and Artificial Intelligence

Authors: P. Gilda, P. Dungarwal, A. Thongkham, E. T. Ajayi, S. Choudhary, T. M. Terol, C. Lam, J. P. Araujo, M. McFadyen-Mungalln, L. S. Liebovitch, P. T. Coleman, H. West, K. Sieck, S. Carter

First: 2026-01-08T18:57:01+00:00 · Latest: 2026-01-08T18:57:01+00:00

Comments: 6 pages, 4 figures

Abs · PDF · Code1 · Code2

Abstract

We used machine learning and artificial intelligence: 1) to measure levels of peace in countries from news and social media and 2) to develop on-line tools that promote peace by helping users better understand their own media diet. For news media, we used neural networks to measure levels of peace from text embeddings of on-line news sources. The model, trained on one news media dataset also showed high accuracy when used to analyze a different news dataset. For social media, such as YouTube, we developed other models to measure levels of social dimensions important in peace using word level (GoEmotions) and context level (Large Language Model) methods. To promote peace, we note that 71% of people 20-40 years old daily view most of their news through short videos on social media. Content creators of these videos are biased towards creating videos with emotional activation, making you angry to engage you, to increase clicks. We developed and tested a Chrome extension, MirrorMirror, which provides real-time feedback to YouTube viewers about the peacefulness of the media they are watching. Our long term goal is for MirrorMirror to evolve into an open-source tool for content creators, journalists, researchers, platforms, and individual users to better understand the tone of their media creation and consumption and its effects on viewers. Moving beyond simple engagement metrics, we hope to encourage more respectful, nuanced, and informative communication.

中文标题/摘要

标题：通过机器学习和人工智能衡量与促进和平

我们使用机器学习和人工智能：1) 从新闻和社交媒体中衡量各国的和平水平；2) 开发在线工具以促进和平，帮助用户更好地理解自己的媒体消费。对于新闻媒体，我们使用神经网络从在线新闻来源的文本嵌入中衡量和平水平。该模型在训练于一个新闻媒体数据集后，也对分析另一个新闻数据集时表现出高准确性。对于社交媒体，如YouTube，我们开发了其他模型来衡量与和平相关的社会维度，使用了词级（GoEmotions）和上下文级（大型语言模型）方法。为了促进和平，我们注意到20-40岁人群中，71%的人每天主要通过社交媒体上的短视频获取新闻。这些视频内容创作者倾向于制作能够激发情绪、让你生气的视频以增加点击率。我们开发并测试了一个名为MirrorMirror的Chrome扩展程序，为YouTube观众提供实时反馈，告知他们所观看媒体的和平程度。我们的长期目标是让MirrorMirror成为一个开源工具，供内容创作者、记者、研究人员、平台和个人用户更好地理解其媒体创作和消费的语气及其对观众的影响。我们希望超越简单的参与度指标，鼓励更加尊重、细致和信息丰富的交流。

Summary / 总结

This study uses machine learning and artificial intelligence to measure peace levels in countries from news and social media, and develops tools to promote peace by analyzing media content. Neural networks were used to measure peace from news text embeddings, showing high accuracy across datasets. For social media, models were developed to assess social dimensions using word and context levels. A Chrome extension called MirrorMirror provides real-time feedback on the peacefulness of media content, aiming to foster more respectful and informative communication.

该研究利用机器学习和人工智能从新闻和社交媒体中测量各国的和平水平，并开发了一个名为MirrorMirror的在线工具，通过实时反馈媒体内容的和平程度来促进和平。研究显示，神经网络可以从新闻文本嵌入中准确测量和平水平，而71%的年轻人主要通过社交媒体上的短视频获取新闻，这些视频通常带有强烈的情感色彩以增加点击率。MirrorMirror旨在帮助用户了解和改善其媒体消费和创作的语气，促进更加尊重和信息丰富的交流。

Learning Latent Action World Models In The Wild

Authors: Quentin Garrido, Tushar Nagarajan, Basile Terver, Nicolas Ballas, Yann LeCun, Michael Rabbat

First: 2026-01-08T18:55:39+00:00 · Latest: 2026-01-08T18:55:39+00:00

Comments: 37 pages, 25 figures

Abs · PDF · Code1 · Code2

Abstract

Agents capable of reasoning and planning in the real world require the ability of predicting the consequences of their actions. While world models possess this capability, they most often require action labels, that can be complex to obtain at scale. This motivates the learning of latent action models, that can learn an action space from videos alone. Our work addresses the problem of learning latent actions world models on in-the-wild videos, expanding the scope of existing works that focus on simple robotics simulations, video games, or manipulation data. While this allows us to capture richer actions, it also introduces challenges stemming from the video diversity, such as environmental noise, or the lack of a common embodiment across videos. To address some of the challenges, we discuss properties that actions should follow as well as relevant architectural choices and evaluations. We find that continuous, but constrained, latent actions are able to capture the complexity of actions from in-the-wild videos, something that the common vector quantization does not. We for example find that changes in the environment coming from agents, such as humans entering the room, can be transferred across videos. This highlights the capability of learning actions that are specific to in-the-wild videos. In the absence of a common embodiment across videos, we are mainly able to learn latent actions that become localized in space, relative to the camera. Nonetheless, we are able to train a controller that maps known actions to latent ones, allowing us to use latent actions as a universal interface and solve planning tasks with our world model with similar performance as action-conditioned baselines. Our analyses and experiments provide a step towards scaling latent action models to the real world.

中文标题/摘要

标题：学习自然环境中的潜在动作世界模型

能够在现实世界中进行推理和规划的智能体需要预测其行为后果的能力。尽管世界模型具备这种能力，但它们通常需要行为标签，而这些标签在大规模应用中往往难以获取。这促使我们学习潜在动作模型，可以从视频中学习动作空间。我们的工作解决了在自然环境视频中学习潜在动作世界模型的问题，扩展了现有工作在简单机器人模拟、视频游戏或操作数据方面的研究范围。虽然这使我们能够捕捉到更丰富的动作，但也带来了视频多样性带来的挑战，如环境噪声或视频间缺乏共同的实体。为应对部分挑战，我们讨论了动作应遵循的属性以及相关架构选择和评估。我们发现，连续但受限的潜在动作能够捕捉自然环境视频中动作的复杂性，而常见的向量量化则无法做到这一点。例如，我们发现来自智能体（如人类进入房间）的环境变化可以在视频间转移，这突显了学习特定于自然环境视频的动作的能力。在视频间缺乏共同实体的情况下，我们主要能够学习在空间上局部化的潜在动作，相对于摄像机而言。尽管如此，我们能够训练一个控制器，将已知动作映射到潜在动作，使我们能够使用潜在动作作为通用接口，并使用世界模型解决规划任务，其性能与基于动作的基线相当。我们的分析和实验为将潜在动作模型扩展到现实世界迈出了一步。

Summary / 总结

This research aims to develop world models that can predict the consequences of actions without requiring explicit action labels, which are often complex to obtain at scale. The authors address the challenge of learning latent action models from in-the-wild videos, which are more diverse and complex than those used in previous works. Key findings include the ability of continuous, constrained latent actions to capture the complexity of actions from various videos, and the development of a controller that maps known actions to latent ones, enabling the use of latent actions for planning tasks with comparable performance to action-conditioned baselines.

该研究旨在从真实世界视频中学习潜在动作模型，这些视频比受控环境更复杂和多样。作者提出的方法能够捕捉更丰富的动作，同时处理环境噪声和视频间的不同主体。关键发现包括能够跨视频转移环境变化，并开发了一个控制器，将已知动作映射到潜在动作，使在使用世界模型解决规划任务时能达到与动作条件基线相似的性能。这项工作推动了潜在动作模型在真实世界中的扩展应用。

Stochastic Deep Learning: A Probabilistic Framework for Modeling Uncertainty in Structured Temporal Data

Authors: James Rice

First: 2026-01-08T18:53:59+00:00 · Latest: 2026-01-08T18:53:59+00:00

Comments: 20 pages, 6330 words

Abs · PDF · Code1 · Code2

Abstract

I propose a novel framework that integrates stochastic differential equations (SDEs) with deep generative models to improve uncertainty quantification in machine learning applications involving structured and temporal data. This approach, termed Stochastic Latent Differential Inference (SLDI), embeds an Itô SDE in the latent space of a variational autoencoder, allowing for flexible, continuous-time modeling of uncertainty while preserving a principled mathematical foundation. The drift and diffusion terms of the SDE are parameterized by neural networks, enabling data-driven inference and generalizing classical time series models to handle irregular sampling and complex dynamic structure. A central theoretical contribution is the co-parameterization of the adjoint state with a dedicated neural network, forming a coupled forward-backward system that captures not only latent evolution but also gradient dynamics. I introduce a pathwise-regularized adjoint loss and analyze variance-reduced gradient flows through the lens of stochastic calculus, offering new tools for improving training stability in deep latent SDEs. My paper unifies and extends variational inference, continuous-time generative modeling, and control-theoretic optimization, providing a rigorous foundation for future developments in stochastic probabilistic machine learning.

中文标题/摘要

标题：随机深度学习：结构化时序数据中不确定性建模的概率框架

我提出了一种新颖的框架，将随机微分方程（SDEs）与深度生成模型相结合，以提高涉及结构化和时序数据的机器学习应用中的不确定性量化。这种方法称为随机潜微分推理（SLDI），将伊藤SDE嵌入变分自编码器的潜空间中，允许灵活的连续时间不确定性建模，同时保持严格的数学基础。SDE的漂移项和扩散项由神经网络参数化，使数据驱动的推理成为可能，并将经典的时间序列模型推广以处理不规则采样和复杂的动态结构。一个核心理论贡献是伴随状态与专用神经网络的共参数化，形成一个耦合的前向-后向系统，不仅捕捉潜变量的演变，还捕捉梯度动力学。我引入了一条路径正则化伴随损失，并通过随机微积分的视角分析了方差减少的梯度流，为改进深度潜SDE的训练稳定性提供了新的工具。我的论文统一并扩展了变分推理、连续时间生成建模和控制论优化，为未来的随机概率机器学习发展提供了严格的理论基础。

Summary / 总结

The research proposes Stochastic Latent Differential Inference (SLDI), which integrates stochastic differential equations (SDEs) with deep generative models to enhance uncertainty quantification in structured and temporal data. The method embeds an Itô SDE in the latent space of a variational autoencoder, allowing for flexible, continuous-time modeling of uncertainty. Key findings include the co-parameterization of the adjoint state with a neural network, which captures both latent evolution and gradient dynamics, and the introduction of a pathwise-regularized adjoint loss to improve training stability in deep latent SDEs.

研究提出了Stochastic Latent Differential Inference (SLDI)框架，将随机微分方程（SDEs）与生成模型结合，以增强结构化和时间数据中的不确定性量化。SLDI在变分自编码器的潜在空间中嵌入了伊托SDE，允许灵活的连续时间不确定性建模。关键发现包括将伴随状态与神经网络共参数化，形成耦合的前向-后向系统，以及引入路径正则化伴随损失，以提高深度潜在SDEs的训练稳定性。

Extended OpenTT Games Dataset: A table tennis dataset for fine-grained shot type and point outcome

Authors: Moamal Fadhil Abdul-Mahdi, Jonas Bruun Hubrechts, Thomas Martini Jørgensen, Emil Hovad

First: 2025-12-22T12:25:50+00:00 · Latest: 2026-01-08T18:45:51+00:00

Comments: Thomas Martini Jørgensen and Emil Hovad contributed equally and share last authorship

Abs · PDF · Code1 · Code2

Abstract

Automatically detecting and classifying strokes in table tennis video can streamline training workflows, enrich broadcast overlays, and enable fine-grained performance analytics. For this to be possible, annotated video data of table tennis is needed. We extend the public OpenTTGames dataset with highly detailed, frame-accurate shot type annotations (forehand, backhand with subtypes), player posture labels (body lean and leg stance), and rally outcome tags at point end. OpenTTGames is a set of recordings from the side of the table with official labels for bounces, when the ball is above the net, or hitting the net. The dataset already contains ball coordinates near events, which are either "bounce", "net", or "empty_event" in the original OpenTTGames dataset, and semantic masks (humans, table, scoreboard). Our extension adds the types of stroke to the events and a per-player taxonomy so models can move beyond event spotting toward tactical understanding (e.g., whether a stroke is likely to win the point or set up an advantage). We provide a compact coding scheme and code-assisted labeling procedure to support reproducible annotations and baselines for fine-grained stroke understanding in racket sports. This fills a practical gap in the community, where many prior video resources are either not publicly released or carry restrictive/unclear licenses that hinder reuse and benchmarking. Our annotations are released under the same CC BY-NC-SA 4.0 license as OpenTTGames, allowing free non-commercial use, modification, and redistribution, with appropriate attribution.

中文标题/摘要

标题：扩展的OpenTT Games数据集：用于精细粒度击球类型和得分结果的乒乓球数据集

自动检测和分类乒乓球视频中的击球可以简化训练工作流程，丰富广播叠加内容，并实现精细粒度的性能分析。为此，需要标注的乒乓球视频数据。我们扩展了公共OpenTTGames数据集，增加了高度详细的、帧准确的击球类型注释（正手、反手及其子类型）、球员姿势标签（身体倾斜和腿部站位）以及在得分结束时的回合结果标签。OpenTTGames是一组从桌子侧面录制的视频，带有官方标签，表示球的弹跳、球在网以上或击中网的情况。该数据集已经包含接近事件的球坐标，这些事件在原始OpenTTGames数据集中是“弹跳”、“网”或“空事件”，以及语义掩码（人类、桌子、记分板）。我们的扩展为事件添加了击球类型，并为每个球员提供了分类体系，使模型能够从事件识别转向战术理解（例如，击球是否可能赢得得分或建立优势）。我们提供了一种紧凑的编码方案和代码辅助的标注程序，以支持可重复的标注和细粒度击球理解在球拍运动中的基准。这填补了社区中的实际空白，因为许多先前的视频资源要么未公开发布，要么带有限制性/不明确的许可证，阻碍了重用和基准测试。我们的注释在与OpenTTGames相同的CC BY-NC-SA 4.0许可证下发布，允许免费非商业使用、修改和再分发，附带适当的归属。

Summary / 总结

The research aims to enhance the OpenTTGames dataset for table tennis by adding detailed shot type annotations, player posture labels, and rally outcome tags. The method involves extending the existing dataset with frame-accurate annotations and a compact coding scheme for stroke types, enabling models to move from event spotting to tactical understanding. Key findings include improved annotation reproducibility and the provision of a dataset under a permissive license, facilitating fine-grained stroke understanding in racket sports training and analytics.

研究旨在通过添加详细的击球类型注释、球员姿势标签和回合结局标签来扩展OpenTTGames数据集。方法包括对现有数据集进行扩展，使用紧凑的编码方案标注击球类型，并提供一种编码辅助标注程序以提高注释的可重复性。主要发现包括改进的注释可重复性以及在宽松许可下提供数据集，便于在乒乓球训练和分析中实现精细的击球理解。

MineNPC-Task: Task Suite for Memory-Aware Minecraft Agents

Authors: Tamil Sudaravan Mohan Doss, Michael Xu, Sudha Rao, Andrew D. Wilson, Balasaravanan Thoravi Kumaravel

First: 2026-01-08T18:39:52+00:00 · Latest: 2026-01-08T18:39:52+00:00

Abs · PDF · Code1 · Code2

Abstract

We present \textsc{MineNPC-Task}, a user-authored benchmark and evaluation harness for testing memory-aware, mixed-initiative LLM agents in open-world \emph{Minecraft}. Rather than relying on synthetic prompts, tasks are elicited from formative and summative co-play with expert players, normalized into parametric templates with explicit preconditions and dependency structure, and paired with machine-checkable validators under a bounded-knowledge policy that forbids out-of-world shortcuts. The harness captures plan/act/memory events-including plan previews, targeted clarifications, memory reads and writes, precondition checks, and repair attempts and reports outcomes relative to the total number of attempted subtasks, derived from in-world evidence. As an initial snapshot, we instantiate the framework with GPT-4o and evaluate \textbf{216} subtasks across \textbf{8} experienced players. We observe recurring breakdown patterns in code execution, inventory/tool handling, referencing, and navigation, alongside recoveries supported by mixed-initiative clarifications and lightweight memory. Participants rated interaction quality and interface usability positively, while highlighting the need for stronger memory persistence across tasks. We release the complete task suite, validators, logs, and harness to support transparent, reproducible evaluation of future memory-aware embodied agents.

中文标题/摘要

标题：MineNPC-Task：面向记忆意识Minecraft代理的任务套件

我们提出了\textsc{MineNPC-Task}，一种用户编写的基准测试和评估框架，用于测试开放世界\emph{Minecraft}中的记忆意识、混合主动性LLM代理。该框架不依赖于合成提示，而是通过与专家玩家的形成性及总结性共玩来激发任务，将任务规范化为具有显式先决条件和依赖结构的参数化模板，并配以基于有界知识策略的机器可验证验证器，该策略禁止使用世界外的捷径。该框架捕捉计划/行动/记忆事件，包括计划预览、目标澄清、记忆读写、先决条件检查和修复尝试，并根据尝试的子任务总数报告结果，这些结果源自于世内的证据。作为初步快照，我们使用GPT-4o实例化了该框架，并在8名经验丰富的玩家中评估了\textbf{216}个子任务。我们观察到代码执行、库存/工具处理、引用和导航中的反复出现的故障模式，以及通过混合主动性澄清和轻量级记忆支持的恢复。参与者对交互质量和界面易用性给予了积极评价，同时指出了需要更强的记忆持久性以跨越任务。我们发布了完整的任务套件、验证器、日志和框架，以支持未来记忆意识实体代理的透明、可重复评估。

Summary / 总结

The research introduces MineNPC-Task, a benchmark for testing memory-aware LLM agents in Minecraft. It involves tasks elicited from expert players, normalized into templates, and paired with validators. The study evaluates 216 subtasks across 8 experienced players using GPT-4o, revealing recurring issues in code execution, inventory handling, referencing, and navigation, with mixed-initiative clarifications and memory supporting recoveries. Participants positively rated interaction quality and usability but noted the need for better memory persistence.

研究引入了MineNPC-Task，用于测试记忆感知的LLM代理在Minecraft中的表现。任务来自专家玩家，被标准化为具有明确条件的模板，并与验证器配对。研究使用GPT-4o评估了8名玩家的216个子任务，揭示了代码执行、库存处理、引用和导航等方面的问题，通过混合主动澄清和轻量级记忆帮助恢复。参与者对交互质量和界面易用性给予了积极评价，但也指出需要更好的跨任务记忆持久性。

FlowLet: Conditional 3D Brain MRI Synthesis using Wavelet Flow Matching

Authors: Danilo Danese, Angela Lombardi, Matteo Attimonelli, Giuseppe Fasano, Tommaso Di Noia

First: 2026-01-08T18:36:29+00:00 · Latest: 2026-01-08T18:36:29+00:00

Abs · PDF · Code1 · Code2

Abstract

Brain Magnetic Resonance Imaging (MRI) plays a central role in studying neurological development, aging, and diseases. One key application is Brain Age Prediction (BAP), which estimates an individual's biological brain age from MRI data. Effective BAP models require large, diverse, and age-balanced datasets, whereas existing 3D MRI datasets are demographically skewed, limiting fairness and generalizability. Acquiring new data is costly and ethically constrained, motivating generative data augmentation. Current generative methods are often based on latent diffusion models, which operate in learned low dimensional latent spaces to address the memory demands of volumetric MRI data. However, these methods are typically slow at inference, may introduce artifacts due to latent compression, and are rarely conditioned on age, thereby affecting the BAP performance. In this work, we propose FlowLet, a conditional generative framework that synthesizes age-conditioned 3D MRIs by leveraging flow matching within an invertible 3D wavelet domain, helping to avoid reconstruction artifacts and reducing computational demands. Experiments show that FlowLet generates high-fidelity volumes with few sampling steps. Training BAP models with data generated by FlowLet improves performance for underrepresented age groups, and region-based analysis confirms preservation of anatomical structures.

中文标题/摘要

标题：FlowLet：基于小波流匹配的条件3D脑MRI合成

脑磁共振成像（MRI）在研究神经发育、衰老和疾病中起着核心作用。一个关键应用是脑年龄预测（BAP），它从MRI数据中估计个体的生物脑年龄。有效的BAP模型需要大量、多样且年龄平衡的数据集，而现有的3D MRI数据集在人口统计学上存在偏差，限制了公平性和泛化能力。获取新数据成本高且伦理限制多，推动了生成数据增强。当前的生成方法通常基于潜在扩散模型，在学习的低维潜在空间中操作以应对体积MRI数据的内存需求。然而，这些方法在推理时通常速度较慢，可能会由于潜在压缩引入伪影，并且很少根据年龄进行条件化，从而影响BAP性能。在本文中，我们提出FlowLet，这是一种基于小波流匹配的条件生成框架，通过在可逆的3D小波域中利用流匹配来合成年龄条件化的3D MRI，有助于避免重建伪影并减少计算需求。实验表明，FlowLet能够通过少量采样步骤生成高保真度的体积。使用FlowLet生成的数据训练BAP模型可以提高未充分代表的年龄组的性能，并且基于区域的分析证实了解剖结构的保留。

Summary / 总结

FlowLet is a conditional generative framework that synthesizes age-conditioned 3D MRIs using flow matching in an invertible 3D wavelet domain, addressing the limitations of latent diffusion models. This approach avoids reconstruction artifacts and reduces computational demands. Experiments demonstrate that FlowLet generates high-fidelity volumes with few sampling steps, and training BAP models with FlowLet-generated data improves performance for underrepresented age groups.

FlowLet 是一种条件生成框架，通过在可逆 3D 小波域内使用小波流匹配来合成年龄条件下的 3D MRI，解决了潜在扩散模型的局限性。实验表明，FlowLet 生成高保真度的体积并以少量采样步骤完成，同时改善了未充分代表的年龄组的 BAP 模型性能，保持了解剖结构的完整性。

MoE3D: A Mixture-of-Experts Module for 3D Reconstruction

Authors: Zichen Wang, Ang Cao, Liam J. Wang, Jeong Joon Park

First: 2026-01-08T18:33:52+00:00 · Latest: 2026-01-08T18:33:52+00:00

Abs · PDF · Code1 · Code2

Abstract

MoE3D is a mixture-of-experts module designed to sharpen depth boundaries and mitigate flying-point artifacts (highlighted in red) of existing feed-forward 3D reconstruction models (left side). MoE3D predicts multiple candidate depth maps and fuses them via dynamic weighting (visualized by MoE weights on the right side). When integrated with a pre-trained 3D reconstruction backbone such as VGGT, it substantially enhances reconstruction quality with minimal additional computational overhead. Best viewed digitally.

中文标题/摘要

标题：MoE3D：一种用于3D重建的混合专家模块

MoE3D 是一种混合专家模块，旨在细化深度边界并减轻现有前馈3D重建模型（左侧）中的漂浮点伪影（用红色高亮显示）。MoE3D 预测多个候选深度图，并通过动态加权融合（右侧可视化 MoE 权重）。当与预训练的3D重建主干网络（如VGGT）结合使用时，它能显著提高重建质量，同时几乎不增加额外的计算开销。最佳查看方式为数字查看。

Summary / 总结

MoE3D is a mixture-of-experts module aimed at improving the depth boundaries and reducing flying-point artifacts in 3D reconstruction. It predicts multiple depth maps and fuses them using dynamic weighting. When combined with a pre-trained 3D reconstruction model like VGGT, it significantly improves reconstruction quality with little extra computational cost.

MoE3D 是一个混合专家模块，旨在提高深度边界清晰度并减少 3D 重建中的漂浮点 artifact。它预测多个深度图并通过动态加权融合。与预训练的 3D 重建模型如 VGGT 结合使用时，可以显著提高重建质量，同时几乎没有额外的计算开销。

EARL: Energy-Aware Optimization of Liquid State Machines for Pervasive AI

Authors: Zain Iqbal, Lorenzo Valerio

First: 2026-01-08T18:31:11+00:00 · Latest: 2026-01-08T18:31:11+00:00

Comments: 6 pages, 9 figures, 2 Tables, conference [Submitted in PerConAI-2026]

Abs · PDF · Code1 · Code2

Abstract

Pervasive AI increasingly depends on on-device learning systems that deliver low-latency and energy-efficient computation under strict resource constraints. Liquid State Machines (LSMs) offer a promising approach for low-power temporal processing in pervasive and neuromorphic systems, but their deployment remains challenging due to high hyperparameter sensitivity and the computational cost of traditional optimization methods that ignore energy constraints. This work presents EARL, an energy-aware reinforcement learning framework that integrates Bayesian optimization with an adaptive reinforcement learning based selection policy to jointly optimize accuracy and energy consumption. EARL employs surrogate modeling for global exploration, reinforcement learning for dynamic candidate prioritization, and an early termination mechanism to eliminate redundant evaluations, substantially reducing computational overhead. Experiments on three benchmark datasets demonstrate that EARL achieves 6 to 15 percent higher accuracy, 60 to 80 percent lower energy consumption, and up to an order of magnitude reduction in optimization time compared to leading hyperparameter tuning frameworks. These results highlight the effectiveness of energy-aware adaptive search in improving the efficiency and scalability of LSMs for resource-constrained on-device AI applications.

中文标题/摘要

标题：EARL：面向普及型人工智能的液态状态机能效优化

普及型人工智能越来越多地依赖于能够在严格资源限制下提供低延迟和能效计算的设备端学习系统。液态状态机（LSMs）为低功耗时序处理提供了有前景的方法，但在普及和神经形态系统中的部署仍具有挑战性，因为它们对超参数的高度敏感性和传统优化方法忽视能效约束所导致的计算成本。本文提出了一种名为EARL的能效感知强化学习框架，该框架结合了贝叶斯优化和自适应强化学习选择策略，以联合优化准确性和能效。EARL 使用代理建模进行全局探索，使用强化学习进行动态候选优先级排序，并采用早期终止机制以消除冗余评估，大幅减少了计算开销。在三个基准数据集上的实验表明，与领先的超参数调优框架相比，EARL 的准确率提高了 6% 至 15%，能效降低了 60% 至 80%，优化时间减少了 10 倍。这些结果突显了能效感知自适应搜索在提高资源受限设备端人工智能应用中 LSM 的效率和可扩展性方面的有效性。

Summary / 总结

EARL is an energy-aware optimization framework for Liquid State Machines (LSMs) that integrates Bayesian optimization and adaptive reinforcement learning to optimize both accuracy and energy consumption. It reduces computational overhead through surrogate modeling, dynamic candidate prioritization, and early termination. Experiments show that EARL achieves higher accuracy and lower energy consumption compared to existing hyperparameter tuning methods, with up to an order of magnitude reduction in optimization time.

EARL 是一种针对液态机（LSMs）的能量感知优化框架，结合了贝叶斯优化和自适应强化学习来同时优化准确性和能耗。实验结果显示，EARL 在准确性和能耗方面优于现有方法，并且优化时间最多可减少一个数量级。

Mechanisms of Prompt-Induced Hallucination in Vision-Language Models

Authors: William Rudman, Michal Golovanevsky, Dana Arad, Yonatan Belinkov, Ritambhara Singh, Carsten Eickhoff, Kyle Mahowald

First: 2026-01-08T18:23:03+00:00 · Latest: 2026-01-08T18:23:03+00:00

Abs · PDF · Code1 · Code2

Abstract

Large vision-language models (VLMs) are highly capable, yet often hallucinate by favoring textual prompts over visual evidence. We study this failure mode in a controlled object-counting setting, where the prompt overstates the number of objects in the image (e.g., asking a model to describe four waterlilies when only three are present). At low object counts, models often correct the overestimation, but as the number of objects increases, they increasingly conform to the prompt regardless of the discrepancy. Through mechanistic analysis of three VLMs, we identify a small set of attention heads whose ablation substantially reduces prompt-induced hallucinations (PIH) by at least 40% without additional training. Across models, PIH-heads mediate prompt copying in model-specific ways. We characterize these differences and show that PIH ablation increases correction toward visual evidence. Our findings offer insights into the internal mechanisms driving prompt-induced hallucinations, revealing model-specific differences in how these behaviors are implemented.

中文标题/摘要

标题：视觉语言模型中提示诱发幻觉的机制

大型视觉语言模型（VLMs）功能强大，但经常倾向于文本提示而非视觉证据，从而产生幻觉。我们在一个受控的物体计数设置中研究了这种失败模式，其中提示夸大了图像中的物体数量（例如，要求模型描述四朵水仙花，而实际上只有三朵）。在低物体数量时，模型通常会纠正这种高估，但随着物体数量的增加，它们越来越倾向于遵循提示，无视差异。通过对三种VLMs的机制分析，我们发现一小组注意力头的消除可以将提示诱发幻觉（PIH）减少至少40%而无需额外训练。在不同模型中，PIH头以特定方式介导提示复制。我们描述了这些差异，并表明PIH消除增加了对视觉证据的纠正。我们的研究结果提供了关于提示诱发幻觉内部机制的见解，揭示了这些行为在不同模型中的特定差异实现方式。

Summary / 总结

This study investigates the mechanism of prompt-induced hallucination in vision-language models (VLMs) by examining their object-counting performance. The research finds that as the number of objects in an image increases, VLMs increasingly conform to the prompt's overstatement, leading to hallucinations. By analyzing three VLMs, the study identifies specific attention heads that, when removed, significantly reduce hallucinations by at least 40% without additional training. The findings suggest that these heads are crucial for prompt copying and that their ablation enhances the model's reliance on visual evidence for corrections.

研究探讨了视觉-语言模型（VLMs）如何基于文本提示而非视觉证据产生幻觉。通过在物体计数任务中操控提示，研究人员发现，随着物体数量的增加，模型越来越倾向于遵循提示。移除特定的注意力头可以减少至少40%的提示诱导幻觉，且不同模型的具体机制有所不同。研究结果表明，这些头在提示复制中起关键作用，移除它们会使模型更好地与视觉证据对齐。

An interpretable data-driven approach to optimizing clinical fall risk assessment

Authors: Fardin Ganjkhanloo, Emmett Springer, Erik H. Hoyer, Daniel L. Young, Holley Farley, Kimia Ghobadi

First: 2026-01-08T18:17:31+00:00 · Latest: 2026-01-08T18:17:31+00:00

Comments: arXiv admin note: substantial text overlap with arXiv:2510.20714

Abs · PDF · Code1 · Code2

Abstract

In this study, we aim to better align fall risk prediction from the Johns Hopkins Fall Risk Assessment Tool (JHFRAT) with additional clinically meaningful measures via a data-driven modelling approach. We conducted a retrospective cohort analysis of 54,209 inpatient admissions from three Johns Hopkins Health System hospitals between March 2022 and October 2023. A total of 20,208 admissions were included as high fall risk encounters, and 13,941 were included as low fall risk encounters. To incorporate clinical knowledge and maintain interpretability, we employed constrained score optimization (CSO) models to reweight the JHFRAT scoring weights, while preserving its additive structure and clinical thresholds. Recalibration refers to adjusting item weights so that the resulting score can order encounters more consistently by the study's risk labels, and without changing the tool's form factor or deployment workflow. The model demonstrated significant improvements in predictive performance over the current JHFRAT (CSO AUC-ROC=0.91, JHFRAT AUC-ROC=0.86). This performance improvement translates to protecting an additional 35 high-risk patients per week across the Johns Hopkins Health System. The constrained score optimization models performed similarly with and without the EHR variables. Although the benchmark black-box model (XGBoost), improves upon the performance metrics of the knowledge-based constrained logistic regression (AUC-ROC=0.94), the CSO demonstrates more robustness to variations in risk labeling. This evidence-based approach provides a robust foundation for health systems to systematically enhance inpatient fall prevention protocols and patient safety using data-driven optimization techniques, contributing to improved risk assessment and resource allocation in healthcare settings.

中文标题/摘要

标题：一种可解释的数据驱动方法以优化临床跌倒风险评估

在本研究中，我们旨在通过数据驱动建模方法更好地使约翰霍普金斯跌倒风险评估工具（JHFRAT）的跌倒风险预测与额外的临床有意义的指标相一致。我们对2022年3月至2023年10月期间约翰霍普金斯健康系统三家医院的54,209例住院病例进行了回顾性队列分析。共有20,208例住院病例被纳入高跌倒风险事件，13,941例被纳入低跌倒风险事件。为了整合临床知识并保持可解释性，我们使用约束评分优化（CSO）模型重新加权JHFRAT评分权重，同时保持其加性结构和临床阈值。重新校准是指调整项目权重，使所得评分能够更一致地按研究的风险标签对事件进行排序，而不改变工具的形式因素或部署工作流程。该模型在预测性能上显著优于当前的JHFRAT（CSO AUC-ROC=0.91，JHFRAT AUC-ROC=0.86）。这种性能改进相当于每周为约翰霍普金斯健康系统保护额外的35名高风险患者。约束评分优化模型在有和没有EHR变量的情况下表现相似。尽管基准黑盒模型（XGBoost）在知识驱动的约束逻辑回归的基础上提高了性能指标（AUC-ROC=0.94），但CSO在风险标签变化方面表现出更强的稳健性。这种基于证据的方法为医疗机构系统地增强住院跌倒预防协议和患者安全提供了坚实的基础，利用数据驱动优化技术，有助于改善风险评估和资源分配。

Summary / 总结

This study aims to improve the predictive performance of the Johns Hopkins Fall Risk Assessment Tool (JHFRAT) by incorporating additional clinically meaningful measures through a constrained score optimization (CSO) approach. A retrospective cohort analysis of 54,209 inpatient admissions from three Johns Hopkins hospitals revealed that the CSO model significantly improved predictive performance (AUC-ROC=0.91) compared to the original JHFRAT (AUC-ROC=0.86), protecting an additional 35 high-risk patients per week. The CSO model maintained interpretability and robustness, even without electronic health record variables, and provided a robust foundation for enhancing inpatient fall prevention protocols and patient safety in healthcare settings.

本研究旨在通过使用约束分数优化（CSO）模型来改进约翰霍普金斯跌倒风险评估工具（JHFRAT）的预测性能。对三家约翰霍普金斯医院54,209例住院患者的回顾性队列分析显示，CSO模型显著提高了预测性能（CSO AUC-ROC=0.91 vs. JHFRAT AUC-ROC=0.86），相当于每周额外保护35名高风险患者。CSO模型保持了可解释性和对风险标签变化的鲁棒性，为系统提升住院跌倒预防协议和患者安全提供了数据驱动的方法。

SimuAgent: An LLM-Based Simulink Modeling Assistant Enhanced with Reinforcement Learning

Authors: Yanchang Liang, Xiaowei Zhao

First: 2026-01-08T18:10:35+00:00 · Latest: 2026-01-08T18:10:35+00:00

Abs · PDF · Code1 · Code2

Abstract

Large language models (LLMs) have revolutionized text-based code automation, but their potential in graph-oriented engineering workflows remains under-explored. We introduce SimuAgent, an LLM-powered modeling and simulation agent tailored for Simulink. SimuAgent replaces verbose XML with a concise, dictionary-style Python representation, dramatically cutting token counts, improving interpretability, and enabling fast, in-process simulation. A lightweight plan-execute architecture, trained in two stages, equips the agent with both low-level tool skills and high-level design reasoning. To tackle sparse rewards in long-horizon tasks, we propose Reflection-GRPO (ReGRPO), which augments Group Relative Policy Optimization (GRPO) with self-reflection traces that supply rich intermediate feedback, accelerating convergence and boosting robustness. Experiments on SimuBench, our newly released benchmark comprising 5300 multi-domain modeling tasks, show that a Qwen2.5-7B model fine-tuned with SimuAgent converges faster and achieves higher modeling accuracy than standard RL baselines, and even surpasses GPT-4o when evaluated with few-shot prompting on the same benchmark. Ablations confirm that the two-stage curriculum and abstract-reconstruct data augmentation further enhance generalization. SimuAgent trains and runs entirely on-premise with modest hardware, delivering a privacy-preserving, cost-effective solution for industrial model-driven engineering. SimuAgent bridges the gap between LLMs and graphical modeling environments, offering a practical solution for AI-assisted engineering design in industrial settings.

中文标题/摘要

标题：SimuAgent：基于LLM的Simulink建模助手，增强以强化学习

大型语言模型（LLMs）已经革新了基于文本的代码自动化，但在图形导向的工程工作流中的潜力尚未得到充分探索。我们介绍了SimuAgent，这是一种专为Simulink设计的LLM驱动的建模和仿真代理。SimuAgent用简洁的字典风格Python表示法取代了冗长的XML，大幅减少了标记数量，提高了可解释性，并使仿真变得快速且在进程内进行。一种轻量级的计划-执行架构，经过两阶段训练，使代理具备了低级工具技能和高级设计推理能力。为应对长期任务中的稀疏奖励，我们提出了反思-GRPO（ReGRPO），它通过自我反思轨迹补充了组相对策略优化（GRPO），提供了丰富的中间反馈，加速了收敛并提高了鲁棒性。在我们新发布的包含5300个多领域建模任务的SimuBench基准测试上进行的实验表明，使用SimuAgent微调的Qwen2.5-7B模型比标准的强化学习基线收敛更快，建模精度更高，甚至在使用少量示例提示在相同基准测试上评估时，超过了GPT-4o。消融实验表明，两阶段课程和抽象重建数据增强进一步增强了泛化能力。SimuAgent完全在本地进行训练和运行，硬件要求较低，提供了一种保护隐私、成本效益高的工业模型驱动工程解决方案。SimuAgent在LLMs和图形建模环境之间架起了一座桥梁，为工业环境中的AI辅助工程设计提供了一种实用的解决方案。

Summary / 总结

SimuAgent is an LLM-based agent designed for Simulink modeling, using a lightweight plan-execute architecture and Reflection-GRPO to enhance performance. It replaces verbose XML with a concise Python representation, improving interpretability and enabling faster simulation. Experiments on SimuBench show that SimuAgent outperforms standard RL baselines and even GPT-4o in few-shot prompting scenarios, with faster convergence and higher modeling accuracy. Ablations confirm the benefits of a two-stage curriculum and abstract-reconstruct data augmentation for generalization.

SimuAgent 是一个基于大语言模型的 Simulink 模型设计助手，采用轻量级计划-执行架构和 Reflection-GRPO 来提升性能。它将冗长的 XML 替换为简洁的 Python 表示，提高可解释性并实现更快的仿真。实验表明，SimuAgent 在 SimuBench 上的表现优于标准的 RL 基线和 GPT-4o，具有更快的收敛速度和更高的建模准确性。消融实验确认了两阶段课程和抽象重建数据增强对泛化能力的提升。

Observations and Remedies for Large Language Model Bias in Self-Consuming Performative Loop

Authors: Yaxuan Wang, Zhongteng Cai, Yujia Bao, Xueru Zhang, Yang Liu

First: 2026-01-08T18:08:15+00:00 · Latest: 2026-01-08T18:08:15+00:00

Abs · PDF · Code1 · Code2

Abstract

The rapid advancement of large language models (LLMs) has led to growing interest in using synthetic data to train future models. However, this creates a self-consuming retraining loop, where models are trained on their own outputs and may cause performance drops and induce emerging biases. In real-world applications, previously deployed LLMs may influence the data they generate, leading to a dynamic system driven by user feedback. For example, if a model continues to underserve users from a group, less query data will be collected from this particular demographic of users. In this study, we introduce the concept of \textbf{S}elf-\textbf{C}onsuming \textbf{P}erformative \textbf{L}oop (\textbf{SCPL}) and investigate the role of synthetic data in shaping bias during these dynamic iterative training processes under controlled performative feedback. This controlled setting is motivated by the inaccessibility of real-world user preference data from dynamic production systems, and enables us to isolate and analyze feedback-driven bias evolution in a principled manner. We focus on two types of loops, including the typical retraining setting and the incremental fine-tuning setting, which is largely underexplored. Through experiments on three real-world tasks, we find that the performative loop increases preference bias and decreases disparate bias. We design a reward-based rejection sampling strategy to mitigate the bias, moving towards more trustworthy self-improving systems.

中文标题/摘要

标题：大型语言模型偏见的观察与补救措施在自我消费执行循环中的影响

大型语言模型（LLMs）的迅速发展引发了对使用合成数据进行未来模型训练的兴趣。然而，这导致了一个自我消费的重新训练循环，模型在训练过程中使用自己的输出，可能导致性能下降并引发新的偏见。在实际应用中，之前部署的LLMs可能会影响它们生成的数据，导致由用户反馈驱动的动态系统。例如，如果模型继续未能满足某一用户群体的需求，那么来自该特定用户群体的数据收集量将会减少。在本研究中，我们提出了“自我消费执行循环”（SCPL）的概念，并探讨了合成数据在这些动态迭代训练过程中如何塑造偏见的作用。这种受控的反馈机制是由于难以获取动态生产系统中的真实用户偏好数据，使我们能够以一种原则性的方式隔离和分析反馈驱动的偏见演变。我们关注两种类型的循环，包括典型的重新训练设置和增量微调设置，后者尚未得到充分探索。通过三个实际任务的实验，我们发现执行循环增加了偏好偏见并减少了差异偏见。我们设计了一种基于奖励的拒绝采样策略来减轻偏见，朝着更可信赖的自我改进系统迈进。

Summary / 总结

This study addresses the issue of bias in large language models (LLMs) that arise from self-consuming performative loops, where models are trained on their own outputs. The research introduces the concept of Self-Consuming Performative Loop (SCPL) and investigates how synthetic data influences bias during iterative training processes. Experiments on three real-world tasks show that the performative loop increases preference bias and decreases disparate bias. The study proposes a reward-based rejection sampling strategy to mitigate these biases, aiming to enhance the trustworthiness of self-improving systems.

本研究探讨了大型语言模型（LLM）在自我消费的执行循环（SCPL）中训练时出现的偏见问题。研究引入了一个受控环境来考察合成数据如何在迭代训练过程中影响偏见。实验结果显示，执行循环增加了偏好偏见并减少了差异偏见。研究提出了一种基于奖励的拒绝采样策略来缓解这些偏见，旨在提高自我改进系统的可信度。

VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice

Authors: Shuming Liu, Mingchen Zhuge, Changsheng Zhao, Jun Chen, Lemeng Wu, Zechun Liu, Chenchen Zhu, Zhipeng Cai, Chong Zhou, Haozhe Liu, Ernie Chang, Saksham Suri, Hongyu Xu, Qi Qian, Wei Wen, Balakrishnan Varadarajan, Zhuang Liu, Hu Xu, Florian Bordes, Raghuraman Krishnamoorthi, Bernard Ghanem, Vikas Chandra, Yunyang Xiong

First: 2026-01-08T18:00:59+00:00 · Latest: 2026-01-08T18:00:59+00:00

Comments: Project page: https://ivul-kaust.github.io/projects/videoauto-r1/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Chain-of-thought (CoT) reasoning has emerged as a powerful tool for multimodal large language models on video understanding tasks. However, its necessity and advantages over direct answering remain underexplored. In this paper, we first demonstrate that for RL-trained video models, direct answering often matches or even surpasses CoT performance, despite CoT producing step-by-step analyses at a higher computational cost. Motivated by this, we propose VideoAuto-R1, a video understanding framework that adopts a reason-when-necessary strategy. During training, our approach follows a Thinking Once, Answering Twice paradigm: the model first generates an initial answer, then performs reasoning, and finally outputs a reviewed answer. Both answers are supervised via verifiable rewards. During inference, the model uses the confidence score of the initial answer to determine whether to proceed with reasoning. Across video QA and grounding benchmarks, VideoAuto-R1 achieves state-of-the-art accuracy with significantly improved efficiency, reducing the average response length by ~3.3x, e.g., from 149 to just 44 tokens. Moreover, we observe a low rate of thinking-mode activation on perception-oriented tasks, but a higher rate on reasoning-intensive tasks. This suggests that explicit language-based reasoning is generally beneficial but not always necessary.

中文标题/摘要

标题：VideoAuto-R1：通过一次思考，两次回答进行视频自动推理

链式思考（CoT）推理已成为多模态大型语言模型在视频理解任务中的一种强大工具。然而，其必要性及其与直接回答相比的优势尚未得到充分探索。在本文中，我们首先证明，对于RL训练的视频模型，直接回答往往能够匹配甚至超越CoT性能，尽管CoT以更高的计算成本生成逐步分析。受此启发，我们提出了一种VideoAuto-R1视频理解框架，采用一种必要时推理的策略。在训练过程中，我们的方法遵循一次思考，两次回答的模式：模型首先生成初始答案，然后进行推理，最后输出审查后的答案。两个答案都通过可验证的奖励进行监督。在推理过程中，模型使用初始答案的置信度分数来决定是否继续进行推理。在视频问答和定位基准测试中，VideoAuto-R1实现了最先进的准确率，显著提高了效率，平均响应长度减少了约3.3倍，例如，从149个词减少到仅44个词。此外，我们观察到，在感知导向的任务中，推理模式的激活率较低，但在推理密集型任务中，激活率较高。这表明显式的基于语言的推理通常是有益的，但并非总是必要的。

Summary / 总结

The paper explores the necessity of chain-of-thought (CoT) reasoning in video understanding tasks and introduces VideoAuto-R1, a framework that reasons only when necessary. During training, VideoAuto-R1 generates an initial answer, performs reasoning, and outputs a reviewed answer, both supervised by verifiable rewards. During inference, it decides whether to reason based on the confidence of the initial answer. VideoAuto-R1 achieves state-of-the-art accuracy with significantly reduced computational cost, especially on reasoning-intensive tasks, by reducing response length by 3.3 times.

论文探讨了链式思考（CoT）推理在视频理解任务中的必要性，并提出了VideoAuto-R1框架，该框架仅在必要时进行推理。在训练过程中，VideoAuto-R1遵循一次思考、两次回答的模式，首先生成初始答案，然后进行推理，最后输出审查后的答案。该方法在保持最佳准确率的同时显著减少了响应长度，并且显示了显式推理对于不同类型的任务是有益但并非总是必要的。

FaST: Efficient and Effective Long-Horizon Forecasting for Large-Scale Spatial-Temporal Graphs via Mixture-of-Experts

Authors: Yiji Zhao, Zihao Zhong, Ao Wang, Haomin Wen, Ming Jin, Yuxuan Liang, Huaiyu Wan, Hao Wu

Venue: KDD 2026

First: 2026-01-08T18:00:58+00:00 · Latest: 2026-01-08T18:00:58+00:00

Comments: Accepted to KDD 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

Spatial-Temporal Graph (STG) forecasting on large-scale networks has garnered significant attention. However, existing models predominantly focus on short-horizon predictions and suffer from notorious computational costs and memory consumption when scaling to long-horizon predictions and large graphs. Targeting the above challenges, we present FaST, an effective and efficient framework based on heterogeneity-aware Mixture-of-Experts (MoEs) for long-horizon and large-scale STG forecasting, which unlocks one-week-ahead (672 steps at a 15-minute granularity) prediction with thousands of nodes. FaST is underpinned by two key innovations. First, an adaptive graph agent attention mechanism is proposed to alleviate the computational burden inherent in conventional graph convolution and self-attention modules when applied to large-scale graphs. Second, we propose a new parallel MoE module that replaces traditional feed-forward networks with Gated Linear Units (GLUs), enabling an efficient and scalable parallel structure. Extensive experiments on real-world datasets demonstrate that FaST not only delivers superior long-horizon predictive accuracy but also achieves remarkable computational efficiency compared to state-of-the-art baselines. Our source code is available at: https://github.com/yijizhao/FaST.

中文标题/摘要

标题：FaST：基于专家混合的异质性感知大规模时空图长时预测框架

大规模网络上的时空图（STG）预测引起了广泛关注。然而，现有模型主要关注短时预测，并在扩展到长时预测和大规模图时遭受严重的计算成本和内存消耗问题。为应对上述挑战，我们提出了一种基于异质性感知专家混合（MoEs）的FaST框架，该框架适用于长时和大规模STG预测，能够实现一周（672步，每15分钟一个时间粒度）的预测，涉及数千个节点。FaST的核心创新包括：首先，提出了一种自适应图代理注意力机制，以缓解在大规模图上应用传统图卷积和自注意力模块时固有的计算负担；其次，提出了一种新的并行MoE模块，用门控线性单元（GLUs）替换传统的前馈网络，从而实现高效且可扩展的并行结构。在真实世界数据集上的广泛实验表明，FaST不仅在长时预测准确性上表现出色，而且在计算效率上也显著优于最先进的基线方法。我们的源代码可在：https://github.com/yijizhao/FaST/ 获取。

Summary / 总结

FaST is a framework designed for efficient and effective long-horizon forecasting on large-scale spatial-temporal graphs. It addresses the computational and memory challenges of existing models by introducing an adaptive graph agent attention mechanism and a parallel Mixture-of-Experts module with Gated Linear Units. FaST achieves one-week-ahead predictions with thousands of nodes and outperforms state-of-the-art baselines in both accuracy and efficiency.

FaST 是一种针对大规模时空图进行长期预测的框架，旨在解决现有模型的计算挑战。它引入了适应性图代理注意力机制和带有门控线性单元的并行 Mixture-of-Experts 模块，以提高效率。FaST 实现了一周（672 步，每 15 分钟一步）的预测，并且在准确性和计算效率上都优于最先进的方法。

CoV: Chain-of-View Prompting for Spatial Reasoning

Authors: Haoyu Zhao, Akide Liu, Zeyu Zhang, Weijie Wang, Feng Chen, Ruihan Zhu, Gholamreza Haffari, Bohan Zhuang

First: 2026-01-08T17:59:42+00:00 · Latest: 2026-01-08T17:59:42+00:00

Abs · PDF · Code1 · Code2

Abstract

Embodied question answering (EQA) in 3D environments often requires collecting context that is distributed across multiple viewpoints and partially occluded. However, most recent vision--language models (VLMs) are constrained to a fixed and finite set of input views, which limits their ability to acquire question-relevant context at inference time and hinders complex spatial reasoning. We propose Chain-of-View (CoV) prompting, a training-free, test-time reasoning framework that transforms a VLM into an active viewpoint reasoner through a coarse-to-fine exploration process. CoV first employs a View Selection agent to filter redundant frames and identify question-aligned anchor views. It then performs fine-grained view adjustment by interleaving iterative reasoning with discrete camera actions, obtaining new observations from the underlying 3D scene representation until sufficient context is gathered or a step budget is reached. We evaluate CoV on OpenEQA across four mainstream VLMs and obtain an average +11.56\% improvement in LLM-Match, with a maximum gain of +13.62\% on Qwen3-VL-Flash. CoV further exhibits test-time scaling: increasing the minimum action budget yields an additional +2.51\% average improvement, peaking at +3.73\% on Gemini-2.5-Flash. On ScanQA and SQA3D, CoV delivers strong performance (e.g., 116 CIDEr / 31.9 EM@1 on ScanQA and 51.1 EM@1 on SQA3D). Overall, these results suggest that question-aligned view selection coupled with open-view search is an effective, model-agnostic strategy for improving spatial reasoning in 3D EQA without additional training.

中文标题/摘要

标题：CoV：空间推理的链式视角提示

在3D环境中的嵌入式问题回答（EQA）通常需要收集分布在多个视角且部分被遮挡的上下文。然而，大多数最新的视觉-语言模型（VLMs）仅限于固定且有限的输入视角集，这限制了它们在推理时获取与问题相关上下文的能力，并阻碍了复杂的空间推理。我们提出了一种名为链式视角（CoV）的提示方法，这是一种无需训练、在测试时进行推理的框架，通过从粗到细的探索过程将VLM转换为积极的视角推理者。CoV首先使用视图选择代理筛选冗余帧并识别与问题对齐的锚视图，然后通过交替进行迭代推理和离散相机动作进行细粒度视图调整，从底层3D场景表示中获取新观察，直到收集到足够上下文或达到步骤预算。我们在OpenEQA上对CoV进行了评估，跨四个主流VLMs获得了平均+11.56%的LLM-Match改进，最大增益为Qwen3-VL-Flash上的+13.62%。CoV还表现出测试时的扩展性：增加最小动作预算可额外获得平均+2.51%的改进，峰值为Gemini-2.5-Flash上的+3.73%。在ScanQA和SQA3D上，CoV表现出强大的性能（例如，ScanQA上的116 CIDEr / 31.9 EM@1和SQA3D上的51.1 EM@1）。总体而言，这些结果表明，与问题对齐的视图选择结合开放视图搜索是提高3D EQA中空间推理能力的有效、模型无关的策略，无需额外训练。

Summary / 总结

The research aims to enhance embodied question answering (EQA) in 3D environments by addressing the limitations of vision-language models (VLMs) in collecting relevant context across multiple viewpoints. The proposed Chain-of-View (CoV) prompting method enables VLMs to actively explore and reason about the environment through a coarse-to-fine process, improving spatial reasoning. Experiments on OpenEQA show an average improvement of +11.56% in LLM-Match, with significant gains on specific models. CoV also demonstrates scalability, with performance improvements observed as the minimum action budget increases.

研究旨在通过解决视觉-语言模型（VLMs）在跨多个视角收集相关上下文方面的限制，提升3D环境中的体感问答（EQA）。提出的Chain-of-View（CoV）提示方法使VLMs能够通过粗细结合的过程主动探索和推理，提高空间推理能力。在OpenEQA上的实验显示，平均提高了11.56%的LLM-Match，特定模型的提升尤为显著。CoV还展示了可扩展性，最小动作预算增加时，性能有所提升。

RelayLLM: Efficient Reasoning via Collaborative Decoding

Authors: Chengsong Huang, Tong Zheng, Langlin Huang, Jinyuan Li, Haolin Liu, Jiaxin Huang

First: 2026-01-08T17:56:16+00:00 · Latest: 2026-01-08T17:56:16+00:00

Abs · PDF · Code1 · Code2

Abstract

Large Language Models (LLMs) for complex reasoning is often hindered by high computational costs and latency, while resource-efficient Small Language Models (SLMs) typically lack the necessary reasoning capacity. Existing collaborative approaches, such as cascading or routing, operate at a coarse granularity by offloading entire queries to LLMs, resulting in significant computational waste when the SLM is capable of handling the majority of reasoning steps. To address this, we propose RelayLLM, a novel framework for efficient reasoning via token-level collaborative decoding. Unlike routers, RelayLLM empowers the SLM to act as an active controller that dynamically invokes the LLM only for critical tokens via a special command, effectively "relaying" the generation process. We introduce a two-stage training framework, including warm-up and Group Relative Policy Optimization (GRPO) to teach the model to balance independence with strategic help-seeking. Empirical results across six benchmarks demonstrate that RelayLLM achieves an average accuracy of 49.52%, effectively bridging the performance gap between the two models. Notably, this is achieved by invoking the LLM for only 1.07% of the total generated tokens, offering a 98.2% cost reduction compared to performance-matched random routers.

中文标题/摘要

标题：RelayLLM：通过协作解码实现高效推理

大型语言模型（LLMs）在复杂推理方面往往受到高计算成本和延迟的限制，而资源高效的小型语言模型（SLMs）通常缺乏必要的推理能力。现有的协作方法，如级联或路由，以粗粒度的方式运行，将整个查询卸载到LLMs上，当SLM能够处理大多数推理步骤时，这会导致显著的计算浪费。为了解决这个问题，我们提出了一种名为RelayLLM的新框架，通过标记级协作解码实现高效推理。与路由器不同，RelayLLM赋予SLM作为主动控制器的能力，动态地仅在关键标记上调用LLM，通过特殊命令有效地“转接”生成过程。我们引入了一种两阶段训练框架，包括预热和组相对策略优化（GRPO），以教导模型平衡独立性和战略性求助。在六个基准测试中的实验结果表明，RelayLLM实现了49.52%的平均准确率，有效地弥合了两种模型之间的性能差距。值得注意的是，这仅通过调用LLM处理生成标记的1.07%实现，与性能匹配的随机路由器相比，成本降低了98.2%。

Summary / 总结

RelayLLM is a framework for efficient reasoning through token-level collaborative decoding, addressing the computational and latency issues of large language models (LLMs) while leveraging the reasoning capacity of small language models (SLMs). It enables the SLM to dynamically invoke the LLM only for critical tokens via a special command, achieving an average accuracy of 49.52% across six benchmarks with a 98.2% cost reduction compared to performance-matched random routers.

RelayLLM 是一种通过 token 级别协作解码来实现高效推理的框架，解决了大型语言模型（LLM）的计算和延迟问题，并利用小型语言模型（SLM）的推理能力。它引入了两阶段训练过程，以教会模型在保持独立性的同时战略性地调用 LLM。实验结果显示，RelayLLM 在六个基准测试中实现了 49.52% 的准确率，仅调用 LLM 处理 1.07% 的 token，相比随机路由器的成本降低了 98.2%。

MVT: Mask-Grounded Vision-Language Models for Taxonomy-Aligned Land-Cover Tagging

Authors: Siyi Chen, Kai Wang, Weicong Pang, Ruiming Yang, Ziru Chen, Renjun Gao, Alexis Kai Hon Lau, Dasa Gu, Chenchen Zhang, Cheng Li

First: 2025-09-23T06:23:56+00:00 · Latest: 2026-01-08T17:56:05+00:00

Comments: The project is available at https://charlescsyyy.github.io/MVT

Abs · PDF · Code1 · Code2 · Project1

Abstract

Land-cover understanding in remote sensing increasingly demands class-agnostic systems that generalize across datasets while remaining spatially precise and interpretable. We study a geometry-first discovery-and-interpretation setting under domain shift, where candidate regions are delineated class-agnostically and supervision avoids lexical class names via anonymized identifiers. Complementary to open-set recognition and open-world learning, we focus on coupling class-agnostic mask evidence with taxonomy-grounded scene interpretation, rather than unknown rejection or continual class expansion. We propose MVT, a three-stage framework that (i) extracts boundary-faithful region masks using SAM2 with domain adaptation, (ii) performs mask-grounded semantic tagging and scene description generation via dual-step LoRA fine-tuning of multimodal LLMs, and (iii) evaluates outputs with LLM-as-judge scoring calibrated by stratified expert ratings. On cross-dataset segmentation transfer (train on OpenEarthMap, evaluate on LoveDA), domain-adapted SAM2 improves mask quality; meanwhile, dual-step MLLM fine-tuning yields more accurate taxonomy-aligned tags and more informative mask-grounded scene descriptions.

中文标题/摘要

标题：MVT：基于掩码的视觉-语言模型在分类学对齐的土地覆盖标签化中的应用

遥感中的土地覆盖理解越来越多地需要能够在不同数据集之间泛化的同时保持空间精确性和可解释性的类无差别系统。我们研究了在领域迁移下的几何优先发现与解释设置，其中候选区域以类无差别的方式划定，监督避免使用类名的词汇标识符。除了开放集识别和开放世界学习，我们专注于将类无差别掩码证据与分类学导向的场景解释相结合，而不是未知拒绝或持续类扩展。我们提出了MVT，一个三阶段框架，(i) 使用SAM2进行领域适应以提取边界忠实的区域掩码，(ii) 通过双步骤LoRA微调多模态LLM进行掩码导向的语义标签和场景描述生成，(iii) 使用LLM作为裁判评分，通过分层专家评分校准输出评估。在跨数据集分割迁移（在OpenEarthMap上训练，在LoveDA上评估）中，领域适应的SAM2提高了掩码质量；同时，双步骤多模态LLM微调产生了更准确的分类学对齐标签和更具有信息性的掩码导向场景描述。

Summary / 总结

The research aims to develop class-agnostic systems for land-cover understanding in remote sensing, focusing on spatial precision and interpretability. The method involves a three-stage framework: (i) extracting boundary-faithful region masks using SAM2 with domain adaptation, (ii) performing mask-grounded semantic tagging and scene description generation via dual-step LoRA fine-tuning of multimodal LLMs, and (iii) evaluating outputs with LLM-as-judge scoring calibrated by stratified expert ratings. Key findings include improved mask quality through domain-adapted SAM2 and more accurate taxonomy-aligned tags and informative scene descriptions via dual-step MLLM fine-tuning.

论文提出了MVT框架，用于遥感中的分类对齐土地覆盖标签。该框架使用SAM2进行领域适应的区域掩码提取，使用双步骤LoRA微调多模态LLM进行掩码导向的语义标签和场景描述生成，并使用LLM作为评判者进行评估。结果表明，从OpenEarthMap转移到LoveDA数据集时，掩码质量得到提高，且标签更准确，场景描述更丰富。

Improving and Evaluating Open Deep Research Agents

Authors: Doaa Allabadi, Kyle Bradbury, Jordan M. Malof

First: 2025-08-13T19:32:01+00:00 · Latest: 2026-01-08T17:54:58+00:00

Comments: 8 pages, 2 figures, 2 tables

Abs · PDF · Code1 · Code2

Abstract

We focus here on Deep Research Agents (DRAs), which are systems that can take a natural language prompt from a user, and then autonomously search for, and utilize, internet-based content to address the prompt. Recent DRAs have demonstrated impressive capabilities on public benchmarks however, recent research largely involves proprietary closed-source systems. At the time of this work, we only found one open-source DRA, termed Open Deep Research (ODR). In this work we adapt the challenging recent BrowseComp benchmark to compare ODR to existing proprietary systems. We propose BrowseComp-Small (BC-Small), comprising a subset of BrowseComp, as a more computationally-tractable DRA benchmark for academic labs. We benchmark ODR and two other proprietary systems on BC-Small: one system from Anthropic and one system from Google. We find that all three systems achieve 0% accuracy on the test set of 60 questions. We introduce three strategic improvements to ODR, resulting in the ODR+ model, which achieves a state-of-the-art 10% success rate on BC-Small among both closed-source and open-source systems. We report ablation studies indicating that all three of our improvements contributed to the success of ODR+.

中文标题/摘要

标题：改进和评估开放深度研究代理

我们在这里关注深度研究代理（DRAs），这是一种可以从用户那里接收自然语言提示，并自主搜索和利用互联网内容来应对提示的系统。最近的DRAs在公共基准测试上展示了令人印象深刻的性能，然而，最近的研究主要涉及专有的闭源系统。在本研究进行时，我们仅发现一个开源DRAs，称为Open Deep Research（ODR）。在本工作中，我们将具有挑战性的最近的BrowseComp基准测试改编为比较ODR与现有专有系统的基准测试。我们提出了BrowseComp-Small（BC-Small），这是一个更易于计算的DRAs基准测试，适用于学术实验室。我们在BC-Small上对ODR和两个其他专有系统进行了基准测试：一个来自Anthropic的系统和一个来自Google的系统。我们发现，这三个系统在包含60个问题的测试集上均未达到1%的准确率。我们对ODR进行了三项战略改进，产生了ODR+模型，该模型在BC-Small上实现了在专有和开源系统中均处于领先地位的10%的成功率。我们报告了消融研究，表明我们的三项改进都对ODR+的成功做出了贡献。

Summary / 总结

This work focuses on Deep Research Agents (DRAs) that can autonomously search and utilize internet content based on user prompts. The authors adapt the BrowseComp benchmark to evaluate ODR, an open-source DRA, and two proprietary systems. Despite achieving 0% accuracy on the original benchmark, ODR+ improved to a 10% success rate after three strategic enhancements, setting a new state-of-the-art among both open-source and closed-source systems.

研究聚焦于能够处理自然语言提示并自主搜索和利用互联网内容的Deep Research Agents (DRAs)。研究将BrowseComp基准适应为评估ODR（一个开源DRA）和两个商用系统的表现。尽管ODR在原始基准上未能达到准确率，但在进行三项策略性改进后，ODR+实现了10%的成功率，成为开源和商用系统中的新标杆。

Vision-Language Introspection: Mitigating Overconfident Hallucinations in MLLMs via Interpretable Bi-Causal Steering

Authors: Shuliang Liu, Songbo Yang, Dong Fang, Sihang Jia, Yuqi Tang, Lingfeng Su, Ruoshui Peng, Yibo Yan, Xin Zou, Xuming Hu

First: 2026-01-08T17:49:13+00:00 · Latest: 2026-01-08T17:49:13+00:00

Abs · PDF · Code1 · Code2

Abstract

Object hallucination critically undermines the reliability of Multimodal Large Language Models, often stemming from a fundamental failure in cognitive introspection, where models blindly trust linguistic priors over specific visual evidence. Existing mitigations remain limited: contrastive decoding approaches operate superficially without rectifying internal semantic misalignments, while current latent steering methods rely on static vectors that lack instance-specific precision. We introduce Vision-Language Introspection (VLI), a training-free inference framework that simulates a metacognitive self-correction process. VLI first performs Attributive Introspection to diagnose hallucination risks via probabilistic conflict detection and localize the causal visual anchors. It then employs Interpretable Bi-Causal Steering to actively modulate the inference process, dynamically isolating visual evidence from background noise while neutralizing blind confidence through adaptive calibration. VLI achieves state-of-the-art performance on advanced models, reducing object hallucination rates by 12.67% on MMHal-Bench and improving accuracy by 5.8% on POPE.

Forking-Sequences

Authors: Willa Potosnak, Malcolm Wolff, Mengfei Cao, Ruijun Ma, Tatiana Konstantinova, Dmitry Efimov, Michael W. Mahoney, Boris Oreshkin, Kin G. Olivares

Venue: NeurIPS 2025

First: 2025-10-06T04:51:06+00:00 · Latest: 2026-01-08T17:43:12+00:00

Comments: Presented at the GPU-Accelerated and Scalable Optimization (ScaleOpt) Workshop, NeurIPS 2025

Abs · PDF · Code1 · Code2

Abstract

While accuracy is a critical requirement for time series forecasting, an equally important desideratum is forecast stability across forecast creation dates (FCDs). Even highly accurate models can produce erratic revisions between FCDs, disrupting downstream decision-making. To improve forecast stability of such revisions, several state-of-the-art models including MQCNN, MQT, and SPADE employ a powerful yet underexplored neural network architectural design known as forking-sequences. This architectural design jointly encodes and decodes the entire time series across all FCDs, producing an entire multi-horizon forecast grid in a single forward pass. This approach contrasts with conventional neural forecasting methods that process FCDs independently, generating only a single multi-horizon forecast per forward pass. In this work, we formalize the forking-sequences design and motivate its broader adoption by introducing a metric for quantifying excess volatility in forecast revisions and by providing theoretical and empirical analysis. We theoretically motivate three key benefits of forking-sequences: (i) increased forecast stability through ensembling; (ii) gradient variance reduction, leading to more stable and consistent training steps; and (iii) improved computational efficiency during inference. We validate the benefits of forking-sequences compared to baseline window-sampling on the M-series benchmark, using 16 datasets from the M1, M3, M4, and Tourism competitions. We observe median accuracy improvements across datasets of 29.7%, 46.2%, 49.3%, 28.6%, 24.7%, and 6.4% for MLP, RNN, LSTM, CNN, Transformer, and StateSpace-based architectures, respectively. We then show that forecast ensembling during inference can improve median forecast stability by 10.8%, 13.2%, 13.0%, 10.9%, 10.2%, and 11.2% for these respective models trained with forking-sequences, while maintaining accuracy.

中文标题/摘要

标题：分叉序列

时间序列预测的准确性是一个关键要求，但同样重要的是预测在不同预测创建日期（FCD）之间的稳定性。即使是非常准确的模型也可能在FCD之间产生不稳定的修订，扰乱下游决策。为了提高这些修订的预测稳定性，包括MQCNN、MQT和SPADE在内的多种最先进的模型采用了名为分叉序列的强大但尚未充分探索的神经网络架构设计。该架构在所有FCD上联合编码和解码整个时间序列，在单次前向传递中生成整个多视窗预测网格。这种方法与传统的神经预测方法形成对比，后者独立处理FCD，每次前向传递只生成一个单视窗预测网格。在本文中，我们形式化了分叉序列设计，并通过引入衡量预测修订超额波动的度量标准和提供理论与实证分析来促进其更广泛的采用。我们理论地证明了分叉序列的三个关键优势：（i）通过集成增加预测稳定性；（ii）减少梯度方差，导致更稳定和一致的训练步骤；（iii）推理期间提高计算效率。我们通过在M系列基准上使用M1、M3、M4和旅游竞赛的16个数据集验证了分叉序列相对于基线窗口采样的优势。我们观察到，对于MLP、RNN、LSTM、CNN、Transformer和基于状态空间的架构，数据集的中位数准确性分别提高了29.7%、46.2%、49.3%、28.6%、24.7%和6.4%。然后我们证明，在使用分叉序列训练的这些模型进行推理时，预测集成可以将中位数预测稳定性分别提高10.8%、13.2%、13.0%、10.9%、10.2%和11.2%，同时保持准确性。

Summary / 总结

The research aims to improve forecast stability in time series forecasting by leveraging forking-sequences, a neural network architectural design. This design encodes and decodes the entire time series across forecast creation dates in a single forward pass, contrasting with conventional methods that process each date independently. The study introduces a metric to quantify forecast volatility and provides theoretical and empirical evidence, showing that forking-sequences enhance forecast stability and computational efficiency. Experiments on 16 datasets from various competitions demonstrate median accuracy improvements of up to 49.3% and median forecast stability improvements of up to 13.2% for different model architectures.

该研究旨在通过采用联合编码和解码整个时间序列的神经网络架构——forking-sequences，提高时间序列预测的稳定性。研究引入了一个衡量预测波动性的指标，并提供了理论和实证分析。关键发现包括各种模型在准确性上的中位数改进幅度从29.7%到6.4%，以及在推理过程中预测稳定性的中位数提高10.8%到13.2%，同时保持了准确性。这种方法与传统的逐个处理每个预测创建日期的方法形成了对比。

Safe Continual Reinforcement Learning Methods for Nonstationary Environments. Towards a Survey of the State of the Art

Authors: Timofey Tomashevskiy

First: 2026-01-08T17:42:56+00:00 · Latest: 2026-01-08T17:42:56+00:00

Comments: 20 pages, 4 figures

Abs · PDF · Code1 · Code2

Abstract

This work provides a state-of-the-art survey of continual safe online reinforcement learning (COSRL) methods. We discuss theoretical aspects, challenges, and open questions in building continual online safe reinforcement learning algorithms. We provide the taxonomy and the details of continual online safe reinforcement learning methods based on the type of safe learning mechanism that takes adaptation to nonstationarity into account. We categorize safety constraints formulation for online reinforcement learning algorithms, and finally, we discuss prospects for creating reliable, safe online learning algorithms. Keywords: safe RL in nonstationary environments, safe continual reinforcement learning under nonstationarity, HM-MDP, NSMDP, POMDP, safe POMDP, constraints for continual learning, safe continual reinforcement learning review, safe continual reinforcement learning survey, safe continual reinforcement learning, safe online learning under distribution shift, safe continual online adaptation, safe reinforcement learning, safe exploration, safe adaptation, constrained Markov decision processes, safe reinforcement learning, partially observable Markov decision process, safe reinforcement learning and hidden Markov decision processes, Safe Online Reinforcement Learning, safe online reinforcement learning, safe online reinforcement learning, safe meta-learning, safe meta-reinforcement learning, safe context-based reinforcement learning, formulating safety constraints for continual learning

Summary / 总结

This work surveys state-of-the-art methods for continual safe online reinforcement learning (COSRL) in nonstationary environments. It discusses theoretical aspects, challenges, and open questions in building such algorithms, and provides a taxonomy of methods based on safe learning mechanisms. Key findings include the categorization of safety constraints for online reinforcement learning and the prospects for creating reliable, safe online learning algorithms.

本文对非平稳环境下的持续安全在线强化学习（COSRL）方法进行了综述。讨论了构建此类算法的理论方面、挑战和开放问题，并根据安全学习机制提供了方法的分类。研究对在线强化学习中的安全约束进行了分类，并探讨了创建可靠和安全的在线学习算法的前景。

ROOFS: RObust biOmarker Feature Selection

Authors: Anastasiia Bakhmach, Paul Dufossé, Andrea Vaglio, Florence Monville, Laurent Greillier, Fabrice Barlési, Sébastien Benzekry

First: 2026-01-08T17:41:07+00:00 · Latest: 2026-01-08T17:41:07+00:00

Abs · PDF · Code1 · Code2

Abstract

Feature selection (FS) is essential for biomarker discovery and in the analysis of biomedical datasets. However, challenges such as high-dimensional feature space, low sample size, multicollinearity, and missing values make FS non-trivial. Moreover, FS performances vary across datasets and predictive tasks. We propose roofs, a Python package available at https://gitlab.inria.fr/compo/roofs, designed to help researchers in the choice of FS method adapted to their problem. Roofs benchmarks multiple FS methods on the user's data and generates reports that summarize a comprehensive set of evaluation metrics, including downstream predictive performance estimated using optimism correction, stability, reliability of individual features, and true positive and false positive rates assessed on semi-synthetic data with a simulated outcome. We demonstrate the utility of roofs on data from the PIONeeR clinical trial, aimed at identifying predictors of resistance to anti-PD-(L)1 immunotherapy in lung cancer. The PIONeeR dataset contained 374 multi-source blood and tumor biomarkers from 435 patients. A reduced subset of 214 features was obtained through iterative variance inflation factor pre-filtering. Of the 34 FS methods gathered in roofs, we evaluated 23 in combination with 11 classifiers (253 models in total) and identified a filter based on the union of Benjamini-Hochberg false discovery rate-adjusted p-values from t-test and logistic regression as the optimal approach, outperforming other methods including the widely used LASSO. We conclude that comprehensive benchmarking with roofs has the potential to improve the robustness and reproducibility of FS discoveries and increase the translational value of clinical models.

中文标题/摘要

标题：ROOFS: RObust biOmarker Feature Selection

特征选择（FS）对于生物标志物发现和生物医学数据集分析至关重要。然而，高维特征空间、样本量小、多重共线性和缺失值等挑战使得特征选择变得非平凡。此外，特征选择的性能在不同数据集和预测任务之间存在差异。我们提出了一种名为ROOFS的Python包（可在https://gitlab.inria.fr/compo/roofs 获取），旨在帮助研究人员选择适合其问题的特征选择方法。ROOFS在用户数据上对多种特征选择方法进行基准测试，并生成报告，总结包括使用乐观校正估计的下游预测性能、稳定性、单个特征的可靠性以及在模拟结果的半合成数据上评估的真实阳性率和假阳性率在内的全面评估指标。我们通过PIONeeR临床试验数据展示了ROOFS的应用，该试验旨在识别对PD-(L)1免疫疗法产生抗性的肺癌预测因子。PIONeeR数据集包含435名患者来源的374个多源血液和肿瘤生物标志物。通过迭代方差膨胀因子预筛选，获得了一个包含214个特征的子集。在ROOFS中收集的34种特征选择方法中，我们评估了23种方法与11种分类器的组合（总共253个模型），并确定了一种基于t检验和逻辑回归的贝叶斯-霍奇伯格假发现率调整p值的并集的过滤器方法为最优方法，优于包括广泛使用的LASSO在内的其他方法。我们得出结论，使用ROOFS进行全面基准测试有可能提高特征选择发现的稳健性和可重复性，并增加临床模型的转化价值。

Summary / 总结

ROOFS is a Python package designed to assist researchers in selecting appropriate feature selection methods for biomarker discovery. It benchmarks multiple methods on user data and provides comprehensive evaluation metrics, including predictive performance, stability, and feature reliability. On the PIONeeR dataset, which includes 374 biomarkers from 435 lung cancer patients, ROOFS identified a filter method based on the union of Benjamini-Hochberg false discovery rate-adjusted p-values from t-test and logistic regression as the optimal approach, outperforming other methods like LASSO.

ROOFS 是一个 Python 包，旨在帮助研究人员选择适合的特征选择方法进行生物标志物发现。它在用户数据上对多种方法进行基准测试，并提供包括预测性能、稳定性和特征可靠性在内的全面评估指标。在包含 435 名肺癌患者 374 个生物标志物的 PIONeeR 数据集上，ROOFS 识别出基于 t 检验和逻辑回归的本杰明尼-霍奇伯格假发现率调整 p 值的并集的过滤方法为最优方法，优于包括 LASSO 在内的其他方法。

Multi-Scale Local Speculative Decoding for Image Generation

Authors: Elia Peruzzo, Guillaume Sautière, Amirhossein Habibian

First: 2026-01-08T17:39:35+00:00 · Latest: 2026-01-08T17:39:35+00:00

Comments: Project page is available at https://qualcomm-ai-research.github.io/mulo-sd-webpage

Abs · PDF · Code1 · Code2 · Project1

Abstract

Autoregressive (AR) models have achieved remarkable success in image synthesis, yet their sequential nature imposes significant latency constraints. Speculative Decoding offers a promising avenue for acceleration, but existing approaches are limited by token-level ambiguity and lack of spatial awareness. In this work, we introduce Multi-Scale Local Speculative Decoding (MuLo-SD), a novel framework that combines multi-resolution drafting with spatially informed verification to accelerate AR image generation. Our method leverages a low-resolution drafter paired with learned up-samplers to propose candidate image tokens, which are then verified in parallel by a high-resolution target model. Crucially, we incorporate a local rejection and resampling mechanism, enabling efficient correction of draft errors by focusing on spatial neighborhoods rather than raster-scan resampling after the first rejection. We demonstrate that MuLo-SD achieves substantial speedups - up to $\mathbf{1.7\times}$ - outperforming strong speculative decoding baselines such as EAGLE-2 and LANTERN in terms of acceleration, while maintaining comparable semantic alignment and perceptual quality. These results are validated using GenEval, DPG-Bench, and FID/HPSv2 on the MS-COCO 5k validation split. Extensive ablations highlight the impact of up-sampling design, probability pooling, and local rejection and resampling with neighborhood expansion. Our approach sets a new state-of-the-art in speculative decoding for image synthesis, bridging the gap between efficiency and fidelity.

Summary / 总结

MuLo-SD is a novel framework that accelerates autoregressive image generation by combining multi-resolution drafting with spatially informed verification. It uses a low-resolution drafter and learned up-samplers to propose candidate tokens, which are verified in parallel by a high-resolution model. The method incorporates a local rejection and resampling mechanism to efficiently correct errors. MuLo-SD achieves up to 1.7x speedup compared to existing speculative decoding approaches like EAGLE-2 and LANTERN, while maintaining semantic alignment and perceptual quality. The results are validated using GenEval, DPG-Bench, and FID/HPSv2 on the MS-COCO 5k validation split.

MuLo-SD 是一种结合多分辨率草图绘制和空间感知验证的新型框架，用于加速自回归图像生成。该方法使用低分辨率草图绘制器和学习上采样器来提出候选令牌，然后由高分辨率目标模型并行验证。该方法还包含局部拒绝和重采样机制，以高效地纠正错误。MuLo-SD 达到最高 1.7 倍的加速比，同时保持语义对齐和感知质量。

Atlas 2 -- Foundation models for clinical deployment

Authors: Maximilian Alber, Timo Milbich, Alexandra Carpen-Amarie, Stephan Tietz, Jonas Dippel, Lukas Muttenthaler, Beatriz Perez Cancer, Alessandro Benetti, Panos Korfiatis, Elias Eulig, Jérôme Lüscher, Jiasen Wu, Sayed Abid Hashimi, Gabriel Dernbach, Simon Schallenberg, Neelay Shah, Moritz Krügener, Aniruddh Jammoria, Jake Matras, Patrick Duffy, Matt Redlon, Philipp Jurmeister, David Horst, Lukas Ruff, Klaus-Robert Müller, Frederick Klauschen, Andrew Norgan

First: 2026-01-08T17:37:00+00:00 · Latest: 2026-01-08T17:37:00+00:00

Abs · PDF · Code1 · Code2

Abstract

Pathology foundation models substantially advanced the possibilities in computational pathology -- yet tradeoffs in terms of performance, robustness, and computational requirements remained, which limited their clinical deployment. In this report, we present Atlas 2, Atlas 2-B, and Atlas 2-S, three pathology vision foundation models which bridge these shortcomings by showing state-of-the-art performance in prediction performance, robustness, and resource efficiency in a comprehensive evaluation across eighty public benchmarks. Our models were trained on the largest pathology foundation model dataset to date comprising 5.5 million histopathology whole slide images, collected from three medical institutions Charité - Universtätsmedizin Berlin, LMU Munich, and Mayo Clinic.

中文标题/摘要

标题：图谱2——临床部署的基础模型

病理学基础模型显著推进了计算病理学的可能性——然而，在性能、鲁棒性和计算需求方面的权衡限制了它们的临床部署。在本报告中，我们介绍了图谱2、图谱2-B和图谱2-S，这三个病理学视觉基础模型通过在八十项公开基准测试中展示最先进的预测性能、鲁棒性和资源效率，弥补了这些不足。我们的模型是在迄今为止最大的病理学基础模型数据集上训练的，该数据集包含550万张组织病理学全切片图像，来自Charité - Universtätsmedizin Berlin、LMU Munich和Mayo Clinic三家医疗机构。

Summary / 总结

The motivation for this work was to address the limitations of existing pathology foundation models in terms of performance, robustness, and computational requirements, which hindered their clinical deployment. The authors developed Atlas 2, Atlas 2-B, and Atlas 2-S, three pathology vision foundation models, which achieved state-of-the-art performance across eighty public benchmarks, demonstrating robustness and resource efficiency. These models were trained on a large dataset of 5.5 million histopathology whole slide images from three medical institutions, significantly improving their capabilities for clinical use.

这项工作的动机是解决现有病理基础模型在性能、鲁棒性和计算需求方面的局限性，这些局限性限制了它们的临床应用。作者开发了Atlas 2、Atlas 2-B和Atlas 2-S，这些模型在八十个公共基准测试中表现出最先进的性能，展示了卓越的预测准确性、鲁棒性和资源效率。这些模型使用来自三个医疗机构的550万张组织病理学图像进行训练，显著提高了它们在临床环境中的适用性。

Cells on Autopilot: Adaptive Cell (Re)Selection via Reinforcement Learning

Authors: Marvin Illian, Ramin Khalili, Antonio A. de A. Rocha, Lin Wang

First: 2026-01-07T16:51:33+00:00 · Latest: 2026-01-08T17:32:37+00:00

Comments: 11 pages, 12 figures, v2: Corrected performance numbers in the conclusion; no change to methodology

Abs · PDF · Code1 · Code2

Abstract

The widespread deployment of 5G networks, together with the coexistence of 4G/LTE networks, provides mobile devices a diverse set of candidate cells to connect to. However, associating mobile devices to cells to maximize overall network performance, a.k.a. cell (re)selection, remains a key challenge for mobile operators. Today, cell (re)selection parameters are typically configured manually based on operator experience and rarely adapted to dynamic network conditions. In this work, we ask: Can an agent automatically learn and adapt cell (re)selection parameters to consistently improve network performance? We present a reinforcement learning (RL)-based framework called CellPilot that adaptively tunes cell (re)selection parameters by learning spatiotemporal patterns of mobile network dynamics. Our study with real-world data demonstrates that even a lightweight RL agent can outperform conventional heuristic reconfigurations by up to 167%, while generalizing effectively across different network scenarios. These results indicate that data-driven approaches can significantly improve cell (re)selection configurations and enhance mobile network performance.

中文标题/摘要

标题：细胞自动驾驶：通过强化学习实现自适应小区（重）选择

5G网络的广泛部署以及4G/LTE网络的共存，为移动设备提供了多种候选小区连接的选择。然而，将移动设备连接到能够最大化整体网络性能的小区（重）选择，仍然是移动运营商面临的关键挑战。今天，小区（重）选择参数通常基于运营商的经验手动配置，并且很少适应动态网络条件。在本工作中，我们提出的问题是：是否可以使用代理自动学习和适应小区（重）选择参数，以持续提高网络性能？我们提出了一种基于强化学习（RL）的框架CellPilot，通过学习移动网络动态的空间和时间模式来自适应调整小区（重）选择参数。我们的研究使用实际数据表明，即使是一个轻量级的RL代理，也可以比传统的启发式重新配置提高高达167%的性能，同时在不同网络场景中表现出良好的泛化能力。这些结果表明，数据驱动的方法可以显著改善小区（重）选择配置并增强移动网络性能。

Summary / 总结

This paper addresses the challenge of cell (re)selection in 5G networks by proposing a reinforcement learning (RL) framework called CellPilot. The framework automatically tunes cell (re)selection parameters to improve network performance. Experimental results show that CellPilot outperforms conventional methods by up to 167% and generalizes well across different network scenarios.

该研究旨在通过强化学习解决5G网络中的小区（重）选择问题，提出了一种名为CellPilot的框架，能够自动调整小区（重）选择参数以提升网络性能。实验结果表明，CellPilot相较于传统方法可提升高达167%的性能，并且在不同网络场景下具有良好的泛化能力。

VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control

Authors: Sixiao Zheng, Minghao Yin, Wenbo Hu, Xiaoyu Li, Ying Shan, Yanwei Fu

First: 2026-01-08T17:28:52+00:00 · Latest: 2026-01-08T17:28:52+00:00

Comments: Project Page: https://sixiaozheng.github.io/VerseCrafter_page/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Video world models aim to simulate dynamic, real-world environments, yet existing methods struggle to provide unified and precise control over camera and multi-object motion, as videos inherently operate dynamics in the projected 2D image plane. To bridge this gap, we introduce VerseCrafter, a 4D-aware video world model that enables explicit and coherent control over both camera and object dynamics within a unified 4D geometric world state. Our approach is centered on a novel 4D Geometric Control representation, which encodes the world state through a static background point cloud and per-object 3D Gaussian trajectories. This representation captures not only an object's path but also its probabilistic 3D occupancy over time, offering a flexible, category-agnostic alternative to rigid bounding boxes or parametric models. These 4D controls are rendered into conditioning signals for a pretrained video diffusion model, enabling the generation of high-fidelity, view-consistent videos that precisely adhere to the specified dynamics. Unfortunately, another major challenge lies in the scarcity of large-scale training data with explicit 4D annotations. We address this by developing an automatic data engine that extracts the required 4D controls from in-the-wild videos, allowing us to train our model on a massive and diverse dataset.

中文标题/摘要

标题：VerseCrafter：具有4D几何控制的动态逼真视频世界模型

视频世界模型旨在模拟动态的真实世界环境，但现有方法难以提供对摄像机和多对象运动的统一和精确控制，因为视频本质上是在投影的2D图像平面上操作动态的。为了解决这一差距，我们引入了VerseCrafter，这是一种4D感知的视频世界模型，能够在统一的4D几何世界状态中显式且一致地控制摄像机和对象动力学。我们的方法以一种新颖的4D几何控制表示为中心，该表示通过静态背景点云和每个对象的3D高斯轨迹来编码世界状态。这种表示不仅捕捉了对象的路径，还捕捉了其随时间的概率3D占用，提供了一种灵活且跨类别的替代方案，而不是刚性边界框或参数模型。这些4D控制被渲染为预训练视频扩散模型的条件信号，从而能够生成高保真、视图一致的视频，精确符合指定的动力学。不幸的是，另一个主要挑战在于缺乏带有明确4D注释的大规模训练数据。我们通过开发一个自动数据引擎来解决这个问题，该引擎可以从野外视频中提取所需的4D控制，使我们能够使用大规模和多样化的数据集训练我们的模型。

Summary / 总结

VerseCrafter is designed to enhance the control and realism of video world models by introducing a 4D geometric control framework. It uses a static background point cloud and per-object 3D Gaussian trajectories to capture both the path and probabilistic 3D occupancy of objects over time, enabling precise and coherent control over camera and object dynamics. The model is trained using an automatic data engine that extracts 4D controls from unannotated videos, allowing for the generation of high-fidelity, view-consistent videos that adhere to specified dynamics.

VerseCrafter 是一种4D感知的视频世界模型，能够在一个统一的4D几何世界状态中明确且一致地控制摄像机和物体的动力学。它使用一种新颖的4D几何控制表示来编码世界状态，该表示不仅捕捉物体的路径，还捕捉物体在时间上的概率3D占用。该模型通过从野外视频中自动提取所需的4D控制来训练，从而生成高保真度、视图一致的视频，这些视频严格遵循指定的动力学。

Sequential Subspace Noise Injection Prevents Accuracy Collapse in Certified Unlearning

Authors: Polina Dolgova, Sebastian U. Stich

First: 2026-01-08T17:23:13+00:00 · Latest: 2026-01-08T17:23:13+00:00

Abs · PDF · Code1 · Code2

Abstract

Certified unlearning based on differential privacy offers strong guarantees but remains largely impractical: the noisy fine-tuning approaches proposed so far achieve these guarantees but severely reduce model accuracy. We propose sequential noise scheduling, which distributes the noise budget across orthogonal subspaces of the parameter space, rather than injecting it all at once. This simple modification mitigates the destructive effect of noise while preserving the original certification guarantees. We extend the analysis of noisy fine-tuning to the subspace setting, proving that the same $(\varepsilon,δ)$ privacy budget is retained. Empirical results on image classification benchmarks show that our approach substantially improves accuracy after unlearning while remaining robust to membership inference attacks. These results show that certified unlearning can achieve both rigorous guarantees and practical utility.

Summary / 总结

The research aims to improve the practicality of certified unlearning by addressing the issue of severe accuracy reduction in noisy fine-tuning approaches. The method involves sequential noise scheduling, which distributes the noise budget across orthogonal subspaces of the parameter space. This approach mitigates the negative impact of noise while maintaining the original privacy guarantees. Experimental results on image classification benchmarks demonstrate that the proposed method significantly enhances model accuracy after unlearning and remains resilient to membership inference attacks, thereby balancing rigorous privacy guarantees with practical utility.

论文旨在解决在确保隐私的同时保持模型准确性的难题。它提出了一种顺序噪声调度方法，该方法按顺序在参数空间的正交子空间中注入噪声。这种方法保留了原始的隐私保证，并在图像分类基准上展示了在删除数据后显著提高的准确性。此外，该方法还对成员推断攻击具有鲁棒性。

FALCONEye: Finding Answers and Localizing Content in ONE-hour-long videos with multi-modal LLMs

Authors: Carlos Plou, Cesar Borja, Ruben Martinez-Cantin, Ana C. Murillo

First: 2025-03-25T17:17:19+00:00 · Latest: 2026-01-08T17:17:54+00:00

Abs · PDF · Code1 · Code2

Abstract

Finding information in hour-long videos is a challenging task even for top-performing Vision Language Models (VLMs), as encoding visual content quickly exceeds available context windows. To tackle this challenge, we present FALCONEye, a novel video agent based on a training-free, model-agnostic meta-architecture composed of a VLM and a Large Language Model (LLM). FALCONEye answers open-ended questions using an exploration-based search algorithm guided by calibrated confidence from the VLM's answers. We also introduce the FALCON-Bench benchmark, extending Question Answering problem to Video Answer Search-requiring models to return both the answer and its supporting temporal window for open-ended questions in hour-long videos. With just a 7B VLM and a lightweight LLM, FALCONEye outscores all open-source 7B VLMs and comparable agents in FALCON-Bench. It further demonstrates its generalization capability in MLVU benchmark with shorter videos and different tasks, surpassing GPT-4o on single-detail tasks while slashing inference cost by roughly an order of magnitude.

中文标题/摘要

标题：FALCONEye：在一小时长视频中使用多模态LLM查找答案并定位内容

即使对于表现最佳的视觉语言模型（VLMs），在一小时长的视频中查找信息也是一个具有挑战性的任务，因为编码视觉内容会迅速超出可用的上下文窗口。为了解决这一挑战，我们提出了FALCONEye，这是一种基于训练无损、模型无关的元架构的新型视频代理，该架构由VLM和大型语言模型（LLM）组成。FALCONEye使用由VLM答案校准置信度引导的基于探索的搜索算法来回答开放式问题。我们还引入了FALCON-Bench基准测试，将问答问题扩展到视频答案搜索，要求模型返回一小时长视频中开放式问题的答案及其支持的时间窗口。仅使用7B VLM和轻量级LLM，FALCONEye在FALCON-Bench中得分超过了所有开源7B VLM和可比代理。此外，FALCONEye在MLVU基准测试中展示了其泛化能力，处理较短视频和不同任务时，超越了GPT-4o，在单一细节任务上的推理成本降低了约一个数量级。

Summary / 总结

FALCONEye is a novel video agent that uses a VLM and LLM to answer open-ended questions in one-hour-long videos. It employs an exploration-based search algorithm guided by the VLM's calibrated confidence. FALCONEye outperforms all open-source 7B VLMs and comparable agents in the FALCON-Bench and shows strong generalization in the MLVU benchmark, surpassing GPT-4o on single-detail tasks while reducing inference cost significantly.

FALCONEye 是一种新型视频代理，结合了 VLM 和 LLM 来回答一小时长视频中的开放性问题。它使用探索为基础的搜索算法，并由 VLM 的校准置信度引导。FALCONEye 在 FALCON-Bench 中优于所有开源 7B VLM 及同类代理，并在 MLVU 基准测试中表现出强大的泛化能力，与 GPT-4o 相比，在单一细节任务上的推理成本降低了大约一个数量级。

Re-Align: Structured Reasoning-guided Alignment for In-Context Image Generation and Editing

Authors: Runze He, Yiji Cheng, Tiankai Hang, Zhimin Li, Yu Xu, Zijin Yin, Shiyi Zhang, Wenxun Dai, Penghui Du, Ao Ma, Chunyu Wang, Qinglin Lu, Jizhong Han, Jiao Dai

First: 2026-01-08T17:13:00+00:00 · Latest: 2026-01-08T17:13:00+00:00

Comments: 13 pages, 9 figures, project page: https://github.com/hrz2000/realign

Abs · PDF · Code1 · Code2 · Code3

Abstract

In-context image generation and editing (ICGE) enables users to specify visual concepts through interleaved image-text prompts, demanding precise understanding and faithful execution of user intent. Although recent unified multimodal models exhibit promising understanding capabilities, these strengths often fail to transfer effectively to image generation. We introduce Re-Align, a unified framework that bridges the gap between understanding and generation through structured reasoning-guided alignment. At its core lies the In-Context Chain-of-Thought (IC-CoT), a structured reasoning paradigm that decouples semantic guidance and reference association, providing clear textual target and mitigating confusion among reference images. Furthermore, Re-Align introduces an effective RL training scheme that leverages a surrogate reward to measure the alignment between structured reasoning text and the generated image, thereby improving the model's overall performance on ICGE tasks. Extensive experiments verify that Re-Align outperforms competitive methods of comparable model scale and resources on both in-context image generation and editing tasks.

中文标题/摘要

标题：Re-Align: 结构化推理引导对齐以实现上下文内图像生成与编辑

上下文内图像生成与编辑（ICGE）允许用户通过交错的图像-文本提示指定视觉概念，要求精确理解并忠实执行用户意图。尽管最近的统一多模态模型展示了有希望的理解能力，但这些优势往往无法有效地转移到图像生成中。我们引入了Re-Align，这是一种统一框架，通过结构化推理引导对齐来弥合理解和生成之间的差距。其核心是上下文链式思考（IC-CoT），这是一种结构化推理范式，将语义指导和参考关联分离，提供清晰的文本目标并减轻参考图像之间的混淆。此外，Re-Align引入了一种有效的强化学习训练方案，利用代理奖励来衡量结构化推理文本与生成图像之间的对齐程度，从而提高模型在ICGE任务上的整体性能。广泛的实验验证了Re-Align在上下文内图像生成与编辑任务上优于具有可比模型规模和资源的竞争方法。

Summary / 总结

Re-Align is a unified framework for in-context image generation and editing that uses structured reasoning-guided alignment to improve the model's understanding and execution of user intent. It introduces the In-Context Chain-of-Thought (IC-CoT) to decouple semantic guidance and reference association, and an RL training scheme to measure alignment between structured reasoning text and generated images. Experiments show that Re-Align outperforms other methods on both generation and editing tasks.

Re-Align 是一个统一框架，用于通过结构化推理引导对齐来提高图像生成和编辑的精度。它引入了 In-Context 链式思考（IC-CoT）来解耦语义指导和参考关联，提供清晰的文本目标。此外，Re-Align 使用带有代理奖励的强化学习训练方案来增强结构化推理文本与生成图像之间的对齐。实验表明，Re-Align 在图像生成和编辑任务上均优于其他方法。

From Rays to Projections: Better Inputs for Feed-Forward View Synthesis

Authors: Zirui Wu, Zeren Jiang, Martin R. Oswald, Jie Song

First: 2026-01-08T17:03:44+00:00 · Latest: 2026-01-08T17:03:44+00:00

Comments: Project Page: https://wuzirui.github.io/pvsm-web

Abs · PDF · Code1 · Code2 · Project1

Abstract

Feed-forward view synthesis models predict a novel view in a single pass with minimal 3D inductive bias. Existing works encode cameras as Plücker ray maps, which tie predictions to the arbitrary world coordinate gauge and make them sensitive to small camera transformations, thereby undermining geometric consistency. In this paper, we ask what inputs best condition a model for robust and consistent view synthesis. We propose projective conditioning, which replaces raw camera parameters with a target-view projective cue that provides a stable 2D input. This reframes the task from a brittle geometric regression problem in ray space to a well-conditioned target-view image-to-image translation problem. Additionally, we introduce a masked autoencoding pretraining strategy tailored to this cue, enabling the use of large-scale uncalibrated data for pretraining. Our method shows improved fidelity and stronger cross-view consistency compared to ray-conditioned baselines on our view-consistency benchmark. It also achieves state-of-the-art quality on standard novel view synthesis benchmarks.

中文标题/摘要

标题：从光线到投影：更好的输入以实现前馈视图合成

前馈视图合成模型可以在单次通过中预测新的视图，且具有最少的三维归纳偏见。现有工作将相机编码为Plücker光线图，这将预测与任意世界坐标系挂钩，并使其对小的相机变换敏感，从而破坏几何一致性。在本文中，我们探讨了什么输入可以最好地条件化模型以实现稳健且一致的视图合成。我们提出了投影条件化，用目标视图的投影线索替换原始相机参数，提供一个稳定的二维输入。这将任务重新定义为光线空间中的脆弱几何回归问题，转变为一个条件良好的目标视图图像到图像的翻译问题。此外，我们引入了一种针对此线索定制的掩码自编码预训练策略，使使用大规模未标定数据进行预训练成为可能。我们的方法在我们的视图一致性基准上显示出更高的保真度和更强的跨视图一致性，优于光线条件化的基线。它还在标准的新视图合成基准上达到了最先进的质量。

Summary / 总结

This paper addresses the issue of geometric inconsistency in feed-forward view synthesis models by proposing projective conditioning as an alternative to ray maps. The method uses a target-view projective cue to provide a stable 2D input, transforming the task into a well-conditioned image-to-image translation problem. Experimental results show that this approach improves both the fidelity and cross-view consistency of the synthesized views compared to ray-conditioned models on a view-consistency benchmark and achieves state-of-the-art quality on standard benchmarks.

本文通过提出项目性条件来解决视图合成模型中的几何不一致性问题，替代了射线图的方法。该方法使用目标视图的项目性提示作为稳定的2D输入，将任务重新定义为更稳健的图像到图像的翻译问题。实验结果表明，这种方法在保真度和跨视图一致性方面都优于基于射线的方法，并在标准基准测试中达到了最先进的质量。