arXiv 论文速递

Generative Refocusing: Flexible Defocus Control from a Single Image

Authors: Chun-Wei Tuan Mu, Jia-Bin Huang, Yu-Lun Liu

First: 2025-12-18T18:59:59+00:00 · Latest: 2025-12-18T18:59:59+00:00

Comments: Project website: https://generative-refocusing.github.io/

Abstract

Depth-of-field control is essential in photography, but getting the perfect focus often takes several tries or special equipment. Single-image refocusing is still difficult. It involves recovering sharp content and creating realistic bokeh. Current methods have significant drawbacks. They need all-in-focus inputs, depend on synthetic data from simulators, and have limited control over aperture. We introduce Generative Refocusing, a two-step process that uses DeblurNet to recover all-in-focus images from various inputs and BokehNet for creating controllable bokeh. Our main innovation is semi-supervised training. This method combines synthetic paired data with unpaired real bokeh images, using EXIF metadata to capture real optical characteristics beyond what simulators can provide. Our experiments show we achieve top performance in defocus deblurring, bokeh synthesis, and refocusing benchmarks. Additionally, our Generative Refocusing allows text-guided adjustments and custom aperture shapes.

中文标题/摘要

标题：生成性重新聚焦：从单张图像实现灵活的景深控制

景深控制是摄影中的关键要素，但获得完美的对焦往往需要多次尝试或特殊设备。单张图像重新聚焦仍然具有挑战性，涉及恢复清晰内容并创建逼真的散景。当前的方法存在显著的局限性，需要全聚焦输入，依赖于模拟器生成的合成数据，并且对光圈控制有限。我们提出了生成性重新聚焦，这是一种两步过程，使用DeblurNet从各种输入中恢复全聚焦图像，使用BokehNet创建可控的散景。我们的主要创新是半监督训练。该方法结合了合成配对数据和未配对的真实散景图像，使用EXIF元数据捕捉模拟器无法提供的真实光学特性。我们的实验表明，我们在景深去模糊、散景合成和重新聚焦基准测试中均取得了最佳性能。此外，我们的生成性重新聚焦还允许文本指导调整和自定义光圈形状。

Summary / 总结

The paper addresses the challenge of single-image refocusing in photography, where recovering sharp content and creating realistic bokeh are difficult. It introduces Generative Refocusing, a two-step process using DeblurNet and BokehNet. The method employs semi-supervised training that combines synthetic paired data with unpaired real bokeh images, leveraging EXIF metadata. Experiments demonstrate superior performance in defocus deblurring, bokeh synthesis, and refocusing benchmarks, and the system supports text-guided adjustments and custom aperture shapes.

研究旨在通过恢复清晰内容和生成逼真的景深效果来改进单张图片的对焦。方法采用两步流程，DeblurNet用于恢复全聚焦图像，BokehNet用于生成景深效果。半监督训练结合了合成配对数据和真实景深图像，利用EXIF元数据捕捉实际光学特性。关键发现表明，在模糊去模糊、景深合成和对焦基准测试中表现出色，还增加了文本引导调整和自定义光圈形状的功能。

The World is Your Canvas: Painting Promptable Events with Reference Images, Trajectories, and Text

Authors: Hanlin Wang, Hao Ouyang, Qiuyu Wang, Yue Yu, Yihao Meng, Wen Wang, Ka Leong Cheng, Shuailei Ma, Qingyan Bai, Yixuan Li, Cheng Chen, Yanhong Zeng, Xing Zhu, Yujun Shen, Qifeng Chen

First: 2025-12-18T18:59:59+00:00 · Latest: 2025-12-18T18:59:59+00:00

Comments: Project page and code: https://worldcanvas.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

We present WorldCanvas, a framework for promptable world events that enables rich, user-directed simulation by combining text, trajectories, and reference images. Unlike text-only approaches and existing trajectory-controlled image-to-video methods, our multimodal approach combines trajectories -- encoding motion, timing, and visibility -- with natural language for semantic intent and reference images for visual grounding of object identity, enabling the generation of coherent, controllable events that include multi-agent interactions, object entry/exit, reference-guided appearance and counterintuitive events. The resulting videos demonstrate not only temporal coherence but also emergent consistency, preserving object identity and scene despite temporary disappearance. By supporting expressive world events generation, WorldCanvas advances world models from passive predictors to interactive, user-shaped simulators. Our project page is available at: https://worldcanvas.github.io/.

中文标题/摘要

标题：世界即画布：使用参考图像、轨迹和文本绘制可提示事件

我们提出了WorldCanvas框架，该框架通过结合文本、轨迹和参考图像，实现了丰富且用户导向的模拟。与仅使用文本的方法和现有的轨迹控制图像到视频的方法不同，我们的多模态方法将轨迹（编码运动、时间、可见性）与自然语言（用于语义意图）和参考图像（用于物体身份的视觉定位）相结合，从而生成连贯且可控的事件，包括多智能体交互、物体的进入/退出、参考引导的外观以及反直觉事件。生成的视频不仅展示了时间连贯性，还展示了在短暂消失后场景中物体身份和场景的一致性。通过支持富有表现力的世界事件生成，WorldCanvas将世界模型从被动预测者提升为交互式的、用户导向的模拟器。我们的项目页面可在：https://worldcanvas.github.io/ 获取。

Summary / 总结

WorldCanvas is a framework that combines text, trajectories, and reference images to enable rich, user-directed simulation of world events. Unlike text-only approaches or trajectory-controlled image-to-video methods, WorldCanvas integrates trajectories for motion, timing, and visibility with natural language for semantic intent and reference images for visual grounding. This multimodal approach generates coherent, controllable events with multi-agent interactions and reference-guided appearance, preserving object identity and scene consistency. The resulting videos demonstrate temporal and emergent consistency, advancing world models from passive predictors to interactive simulators.

WorldCanvas 是一个框架，通过整合文本、轨迹和参考图像来实现丰富的用户导向模拟。不同于仅使用文本的方法或基于轨迹控制的图像到视频技术，WorldCanvas 将轨迹中的运动、时间和可见性与自然语言的语义意图以及参考图像的视觉定位相结合。这种方法能够生成包含多智能体交互、物体进出、参考引导外观和反常识事件的连贯且可控的事件。生成的视频不仅展示了时间连贯性，还展示了在短暂消失后对象身份和场景的一致性。WorldCanvas 将世界模型从被动预测者提升为交互式的、用户导向的模拟器。

EasyV2V: A High-quality Instruction-based Video Editing Framework

Authors: Jinjie Mai, Chaoyang Wang, Guocheng Gordon Qian, Willi Menapace, Sergey Tulyakov, Bernard Ghanem, Peter Wonka, Ashkan Mirzaei

First: 2025-12-18T18:59:57+00:00 · Latest: 2025-12-18T18:59:57+00:00

Comments: Project page: https://snap-research.github.io/easyv2v/

Abs · PDF · Code1 · Code2 · Project1

Abstract

While image editing has advanced rapidly, video editing remains less explored, facing challenges in consistency, control, and generalization. We study the design space of data, architecture, and control, and introduce \emph{EasyV2V}, a simple and effective framework for instruction-based video editing. On the data side, we compose existing experts with fast inverses to build diverse video pairs, lift image edit pairs into videos via single-frame supervision and pseudo pairs with shared affine motion, mine dense-captioned clips for video pairs, and add transition supervision to teach how edits unfold. On the model side, we observe that pretrained text-to-video models possess editing capability, motivating a simplified design. Simple sequence concatenation for conditioning with light LoRA fine-tuning suffices to train a strong model. For control, we unify spatiotemporal control via a single mask mechanism and support optional reference images. Overall, EasyV2V works with flexible inputs, e.g., video+text, video+mask+text, video+mask+reference+text, and achieves state-of-the-art video editing results, surpassing concurrent and commercial systems. Project page: https://snap-research.github.io/easyv2v/

中文标题/摘要

标题：EasyV2V：一种基于指令的高质量视频编辑框架

尽管图像编辑取得了快速进展，但视频编辑仍较少被探索，面临着一致性、控制和泛化方面的挑战。我们研究了数据、架构和控制的设计空间，并引入了\emph{EasyV2V}，这是一种简单有效的基于指令的视频编辑框架。在数据方面，我们利用现有专家和快速逆向操作构建了多样化的视频对，通过单帧监督和共享仿射运动的伪对将图像编辑对提升为视频，挖掘密集字幕片段以构建视频对，并添加过渡监督以教授编辑的展开方式。在模型方面，我们观察到预训练的文本到视频模型具有编辑能力，这激励了简化的设计。简单的序列拼接作为条件，并结合轻量级LoRA微调足以训练出强大的模型。在控制方面，我们通过单一掩码机制统一了时空控制，并支持可选的参考图像。总体而言，EasyV2V 可以灵活地处理输入，例如视频+文本、视频+掩码+文本、视频+掩码+参考+文本，并实现了最先进的视频编辑结果，超越了同时期和商用系统。项目页面：https://snap-research.github.io/easyv2v/

Summary / 总结

The paper addresses the challenges in video editing, such as consistency and control, by introducing EasyV2V, a framework for instruction-based video editing. The framework uses diverse video pairs created from existing experts and image edit pairs, and supports spatiotemporal control through a single mask mechanism. EasyV2V achieves state-of-the-art results in video editing, surpassing concurrent and commercial systems, by leveraging pretrained text-to-video models and light LoRA fine-tuning. Project page: https://snap-research.github.io/easyv2v/

EasyV2V 是一个基于指令的视频编辑框架，旨在解决视频编辑中的连贯性、控制和泛化问题。它通过多样化的视频对、单帧监督和过渡监督来训练模型，以根据文本指令编辑视频。该模型利用预训练的文本到视频模型，并使用简单的序列拼接和轻量级 LoRA 微调进行训练。EasyV2V 支持多种输入类型，并在视频编辑方面取得了最先进的成果，超越了同时期和商业系统。

DVGT: Driving Visual Geometry Transformer

Authors: Sicheng Zuo, Zixun Xie, Wenzhao Zheng, Shaoqing Xu, Fang Li, Shengyin Jiang, Long Chen, Zhi-Xin Yang, Jiwen Lu

First: 2025-12-18T18:59:57+00:00 · Latest: 2025-12-18T18:59:57+00:00

Comments: Code is available at https://github.com/wzzheng/DVGT

Abs · PDF · Code1 · Code2 · Code3

Abstract

Perceiving and reconstructing 3D scene geometry from visual inputs is crucial for autonomous driving. However, there still lacks a driving-targeted dense geometry perception model that can adapt to different scenarios and camera configurations. To bridge this gap, we propose a Driving Visual Geometry Transformer (DVGT), which reconstructs a global dense 3D point map from a sequence of unposed multi-view visual inputs. We first extract visual features for each image using a DINO backbone, and employ alternating intra-view local attention, cross-view spatial attention, and cross-frame temporal attention to infer geometric relations across images. We then use multiple heads to decode a global point map in the ego coordinate of the first frame and the ego poses for each frame. Unlike conventional methods that rely on precise camera parameters, DVGT is free of explicit 3D geometric priors, enabling flexible processing of arbitrary camera configurations. DVGT directly predicts metric-scaled geometry from image sequences, eliminating the need for post-alignment with external sensors. Trained on a large mixture of driving datasets including nuScenes, OpenScene, Waymo, KITTI, and DDAD, DVGT significantly outperforms existing models on various scenarios. Code is available at https://github.com/wzzheng/DVGT.

中文标题/摘要

标题：DVGT：驾驶目标视觉几何变换器

从视觉输入感知和重建3D场景几何对于自动驾驶至关重要。然而，仍然缺乏一种针对驾驶场景的密集几何感知模型，能够适应不同的场景和相机配置。为了解决这一问题，我们提出了一种驾驶目标视觉几何变换器（DVGT），它可以从前序的多视角未校正视觉输入中重建全局密集的3D点云图。我们首先使用DINO骨干网络提取每张图像的视觉特征，然后采用交替的同视角局部注意力、跨视角空间注意力和跨帧时间注意力来推断图像间的几何关系。接着，我们使用多个解码头在第一帧的 ego 坐标系中解码全局点云图，并为每一帧计算 ego 姿态。与依赖精确相机参数的传统方法不同，DVGT 不需要显式的3D几何先验，能够灵活处理任意的相机配置。DVGT 直接从图像序列中预测出度量标定的几何结构，消除了与外部传感器进行后对齐的需求。DVGT 在 nuScenes、OpenScene、Waymo、KITTI 和 DDAD 等多种驾驶数据集上进行训练，显著优于现有模型在各种场景中的表现。代码可在 https://github.com/wzzheng/DVGT 获取。

Summary / 总结

DVGT is designed to address the challenge of reconstructing dense 3D scene geometry from visual inputs for autonomous driving. It uses a Driving Visual Geometry Transformer to process multi-view images and infer geometric relations through local, spatial, and temporal attention mechanisms. The model directly predicts metric-scaled geometry from image sequences without relying on precise camera parameters, achieving superior performance across various scenarios compared to existing methods.

DVGT旨在从视觉输入中感知和重建自动驾驶中的3D场景几何。它使用Driving Visual Geometry Transformer通过局部、空间和时间注意力机制来推断图像间的几何关系。DVGT无需依赖精确的相机参数即可直接从图像序列中预测出度量标尺的几何结构，其在各种场景中的性能优于现有模型。

Differences That Matter: Auditing Models for Capability Gap Discovery and Rectification

Authors: Qihao Liu, Chengzhi Mao, Yaojie Liu, Alan Yuille, Wen-Sheng Chu

First: 2025-12-18T18:59:57+00:00 · Latest: 2025-12-18T18:59:57+00:00

Comments: project page: https://auditdm.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Conventional evaluation methods for multimodal LLMs (MLLMs) lack interpretability and are often insufficient to fully disclose significant capability gaps across models. To address this, we introduce AuditDM, an automated framework that actively discovers and rectifies MLLM failure modes by auditing their divergence. AuditDM fine-tunes an MLLM as an auditor via reinforcement learning to generate challenging questions and counterfactual images that maximize disagreement among target models. Once trained, the auditor uncovers diverse, interpretable exemplars that reveal model weaknesses and serve as annotation-free data for rectification. When applied to SoTA models like Gemma-3 and PaliGemma-2, AuditDM discovers more than 20 distinct failure types. Fine-tuning on these discoveries consistently improves all models across 16 benchmarks, and enables a 3B model to surpass its 28B counterpart. Our results suggest that as data scaling hits diminishing returns, targeted model auditing offers an effective path to model diagnosis and improvement.

中文标题/摘要

标题：有意义的差异：审计模型以发现和纠正能力差距

传统的多模态LLM评估方法缺乏可解释性，往往无法充分揭示模型之间的显著能力差距。为解决这一问题，我们引入了AuditDM，这是一种自动化的框架，通过审计模型之间的差异来主动发现和纠正其失败模式。AuditDM通过强化学习微调一个LLM作为审计器，生成能够最大化目标模型之间分歧的具有挑战性的问题和反事实图像。训练完成后，审计器能够揭示多样且可解释的示例，揭示模型的弱点，并作为无需标注的数据用于纠正。当应用于如Gemma-3和PaliGemma-2等最先进的模型时，AuditDM发现了超过20种不同的失败类型。基于这些发现的微调在16个基准测试上均能提高所有模型的表现，并使一个3B模型超越了其28B的对照组。我们的结果表明，在数据规模效应减弱时，有针对性的模型审计为模型诊断和改进提供了一条有效途径。

AdaTooler-V: Adaptive Tool-Use for Images and Videos

Authors: Chaoyang Wang, Kaituo Feng, Dongyang Chen, Zhongyu Wang, Zhixun Li, Sicheng Gao, Meng Meng, Xu Zhou, Manyuan Zhang, Yuzhang Shang, Xiangyu Yue

First: 2025-12-18T18:59:55+00:00 · Latest: 2025-12-18T18:59:55+00:00

Comments: Project page: https://github.com/CYWang735/AdaTooler-V

Abs · PDF · Code1 · Code2 · Code3

Abstract

Recent advances have shown that multimodal large language models (MLLMs) benefit from multimodal interleaved chain-of-thought (CoT) with vision tool interactions. However, existing open-source models often exhibit blind tool-use reasoning patterns, invoking vision tools even when they are unnecessary, which significantly increases inference overhead and degrades model performance. To this end, we propose AdaTooler-V, an MLLM that performs adaptive tool-use by determining whether a visual problem truly requires tools. First, we introduce AT-GRPO, a reinforcement learning algorithm that adaptively adjusts reward scales based on the Tool Benefit Score of each sample, encouraging the model to invoke tools only when they provide genuine improvements. Moreover, we construct two datasets to support training: AdaTooler-V-CoT-100k for SFT cold start and AdaTooler-V-300k for RL with verifiable rewards across single-image, multi-image, and video data. Experiments across twelve benchmarks demonstrate the strong reasoning capability of AdaTooler-V, outperforming existing methods in diverse visual reasoning tasks. Notably, AdaTooler-V-7B achieves an accuracy of 89.8\% on the high-resolution benchmark V*, surpassing the commercial proprietary model GPT-4o and Gemini 1.5 Pro. All code, models, and data are released.

中文标题/摘要

标题：AdaTooler-V：自适应图像和视频工具使用

最近的研究表明，多模态大型语言模型（MLLM）从多模态交替的思维链（CoT）与视觉工具交互中受益。然而，现有的开源模型经常表现出盲目的工具使用推理模式，即使在不需要时也调用视觉工具，这显著增加了推理开销并降低了模型性能。为此，我们提出了AdaTooler-V，这是一种MLLM，能够根据视觉问题是否真正需要工具来进行自适应工具使用。首先，我们引入了AT-GRPO，这是一种基于每个样本的工具收益分数调整奖励尺度的强化学习算法，鼓励模型仅在工具提供真正改进时才调用工具。此外，我们构建了两个数据集以支持训练：AdaTooler-V-CoT-100k 用于SFT冷启动，AdaTooler-V-300k 用于具有可验证奖励的强化学习，涵盖单图像、多图像和视频数据。在十二个基准测试中的实验表明，AdaTooler-V 具有强大的推理能力，在各种视觉推理任务中优于现有方法。值得注意的是，AdaTooler-V-7B 在高分辨率基准测试 V* 中的准确率为 89.8%，超过了商业专有模型 GPT-4o 和 Gemini 1.5 Pro。所有代码、模型和数据均已发布。

Summary / 总结

AdaTooler-V is an MLLM that adapts tool-use based on the necessity of visual problems. It uses AT-GRPO, a reinforcement learning algorithm, to adjust reward scales according to the Tool Benefit Score, ensuring tools are only invoked when beneficial. Experiments show AdaTooler-V outperforms existing methods in various visual reasoning tasks, achieving 89.8% accuracy on the V* benchmark, surpassing commercial models like GPT-4o and Gemini 1.5 Pro.

AdaTooler-V 是一种 MLLM，能够根据视觉问题是否需要工具来执行自适应工具使用。它使用 AT-GRPO 强化学习算法根据工具效益评分调整奖励尺度，鼓励模型仅在必要时调用工具。实验表明，AdaTooler-V 在各种视觉推理任务中表现出色，高分辨率基准 V* 的准确率达到 89.8%。

Generative Adversarial Reasoner: Enhancing LLM Reasoning with Adversarial Reinforcement Learning

Authors: Qihao Liu, Luoxin Ye, Wufei Ma, Yu-Cheng Chou, Alan Yuille

First: 2025-12-18T18:59:54+00:00 · Latest: 2025-12-18T18:59:54+00:00

Abs · PDF · Code1 · Code2

Abstract

Large language models (LLMs) with explicit reasoning capabilities excel at mathematical reasoning yet still commit process errors, such as incorrect calculations, brittle logic, and superficially plausible but invalid steps. In this paper, we introduce Generative Adversarial Reasoner, an on-policy joint training framework designed to enhance reasoning by co-evolving an LLM reasoner and an LLM-based discriminator through adversarial reinforcement learning. A compute-efficient review schedule partitions each reasoning chain into logically complete slices of comparable length, and the discriminator evaluates each slice's soundness with concise, structured justifications. Learning couples complementary signals: the LLM reasoner is rewarded for logically consistent steps that yield correct answers, while the discriminator earns rewards for correctly detecting errors or distinguishing traces in the reasoning process. This produces dense, well-calibrated, on-policy step-level rewards that supplement sparse exact-match signals, improving credit assignment, increasing sample efficiency, and enhancing overall reasoning quality of LLMs. Across various mathematical benchmarks, the method delivers consistent gains over strong baselines with standard RL post-training. Specifically, on AIME24, we improve DeepSeek-R1-Distill-Qwen-7B from 54.0 to 61.3 (+7.3) and DeepSeek-R1-Distill-Llama-8B from 43.7 to 53.7 (+10.0). The modular discriminator also enables flexible reward shaping for objectives such as teacher distillation, preference alignment, and mathematical proof-based reasoning.

中文标题/摘要

标题：生成对抗推理器：通过对抗强化学习增强LLM推理能力

具有明确推理能力的大语言模型（LLMs）在数学推理方面表现出色，但仍会犯过程错误，如错误计算、脆弱逻辑和表面上合理但实际上无效的步骤。在本文中，我们介绍了生成对抗推理器，这是一种通过对抗强化学习共同进化LLM推理器和基于LLM的鉴别器的在策略联合训练框架，旨在通过逻辑完整且长度相近的推理链片段进行计算高效审查，鉴别器使用简洁的结构化证明来评估每个片段的合理性。学习结合互补信号：LLM推理器因逻辑一致且得出正确答案的步骤而获得奖励，而鉴别器因正确检测错误或在推理过程中区分痕迹而获得奖励。这产生了密集、校准良好的在策略步骤级奖励，补充稀疏的精确匹配信号，改善了信用分配，提高了样本效率，并增强了LLM的整体推理质量。在各种数学基准测试中，该方法在标准RL后训练中相对于强基线实现了持续改进。具体而言，在AIME24上，我们使DeepSeek-R1-Distill-Qwen-7B从54.0提高到61.3（+7.3），DeepSeek-R1-Distill-Llama-8B从43.7提高到53.7（+10.0）。模块化的鉴别器还使教师蒸馏、偏好对齐和基于数学证明的推理等目标的奖励塑造变得灵活。

Summary / 总结

This paper introduces Generative Adversarial Reasoner, a framework that enhances LLM reasoning through adversarial reinforcement learning. The method co-evolves an LLM reasoner and a discriminator to improve logical consistency and reduce process errors. Across mathematical benchmarks, the approach consistently outperforms strong baselines, with improvements of 7.3% and 10.0% on AIME24 for two different LLMs.

本文提出了生成对抗推理器，该框架通过对抗强化学习增强LLM的推理能力。它通过共同进化LLM推理器和鉴别器来提高逻辑一致性和减少过程错误。该方法使用高效的审查计划来分割推理链，并提供密集且校准良好的奖励。实验结果显示，在数学基准测试中，该方法在AIME24上对DeepSeek-R1-Distill-Qwen-7B和DeepSeek-R1-Distill-Llama-8B分别取得了7.3%和10.0%的改进。

StereoPilot: Learning Unified and Efficient Stereo Conversion via Generative Priors

Authors: Guibao Shen, Yihua Du, Wenhang Ge, Jing He, Chirui Chang, Donghao Zhou, Zhen Yang, Luozhou Wang, Xin Tao, Ying-Cong Chen

First: 2025-12-18T18:59:50+00:00 · Latest: 2025-12-18T18:59:50+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

The rapid growth of stereoscopic displays, including VR headsets and 3D cinemas, has led to increasing demand for high-quality stereo video content. However, producing 3D videos remains costly and complex, while automatic Monocular-to-Stereo conversion is hindered by the limitations of the multi-stage ``Depth-Warp-Inpaint'' (DWI) pipeline. This paradigm suffers from error propagation, depth ambiguity, and format inconsistency between parallel and converged stereo configurations. To address these challenges, we introduce UniStereo, the first large-scale unified dataset for stereo video conversion, covering both stereo formats to enable fair benchmarking and robust model training. Building upon this dataset, we propose StereoPilot, an efficient feed-forward model that directly synthesizes the target view without relying on explicit depth maps or iterative diffusion sampling. Equipped with a learnable domain switcher and a cycle consistency loss, StereoPilot adapts seamlessly to different stereo formats and achieves improved consistency. Extensive experiments demonstrate that StereoPilot significantly outperforms state-of-the-art methods in both visual fidelity and computational efficiency. Project page: https://hit-perfect.github.io/StereoPilot/.

中文标题/摘要

标题：StereoPilot：通过生成先验学习统一高效的立体图转换

随着立体显示的迅速发展，包括VR头显和3D影院，对高质量立体视频内容的需求不断增加。然而，制作3D视频仍然成本高昂且复杂，而单目到立体的自动转换受到多阶段“深度扭曲修补”(DWI)管道的限制。该范式存在误差传播、深度歧义和并行与汇聚立体配置格式不一致的问题。为了解决这些挑战，我们引入了UniStereo，这是首个大规模统一的立体视频转换数据集，涵盖了两种立体格式，以实现公平基准测试和稳健模型训练。基于此数据集，我们提出了StereoPilot，一种高效的前馈模型，可以直接合成目标视图，而无需依赖显式的深度图或迭代扩散采样。配备可学习的领域转换器和循环一致性损失，StereoPilot能够无缝适应不同的立体格式，并实现更好的一致性。大量实验表明，StereoPilot在视觉保真度和计算效率方面显著优于现有最先进的方法。项目页面：https://hit-perfect.github.io/StereoPilot/

Summary / 总结

The paper addresses the challenge of producing high-quality stereo video content for emerging stereoscopic displays. It introduces UniStereo, a large-scale unified dataset for stereo video conversion, and proposes StereoPilot, an efficient feed-forward model that directly synthesizes the target view without using explicit depth maps or iterative diffusion sampling. Experiments show that StereoPilot outperforms existing methods in both visual fidelity and computational efficiency.

论文旨在解决为新兴立体显示生成高质量立体视频内容的挑战。它引入了UniStereo，一个大规模统一的立体视频转换数据集，并提出了StereoPilot，一个高效的前馈模型，可以直接合成目标视图，而无需使用显式的深度图或迭代扩散采样。实验表明，StereoPilot在视觉保真度和计算效率方面都优于现有方法。

Depth Any Panoramas: A Foundation Model for Panoramic Depth Estimation

Authors: Xin Lin, Meixi Song, Dizhe Zhang, Wenxuan Lu, Haodong Li, Bo Du, Ming-Hsuan Yang, Truong Nguyen, Lu Qi

First: 2025-12-18T18:59:29+00:00 · Latest: 2025-12-18T18:59:29+00:00

Comments: Project Page: https://insta360-research-team.github.io/DAP_website/

Abs · PDF · Code1 · Code2 · Project1 · Project2

Abstract

In this work, we present a panoramic metric depth foundation model that generalizes across diverse scene distances. We explore a data-in-the-loop paradigm from the view of both data construction and framework design. We collect a large-scale dataset by combining public datasets, high-quality synthetic data from our UE5 simulator and text-to-image models, and real panoramic images from the web. To reduce domain gaps between indoor/outdoor and synthetic/real data, we introduce a three-stage pseudo-label curation pipeline to generate reliable ground truth for unlabeled images. For the model, we adopt DINOv3-Large as the backbone for its strong pre-trained generalization, and introduce a plug-and-play range mask head, sharpness-centric optimization, and geometry-centric optimization to improve robustness to varying distances and enforce geometric consistency across views. Experiments on multiple benchmarks (e.g., Stanford2D3D, Matterport3D, and Deep360) demonstrate strong performance and zero-shot generalization, with particularly robust and stable metric predictions in diverse real-world scenes. The project page can be found at: \href{https://insta360-research-team.github.io/DAP_website/} {https://insta360-research-team.github.io/DAP\_website/}

中文标题/摘要

标题：深度全景图：一种全景深度估计的基础模型

在本文中，我们提出了一种全景度量深度基础模型，该模型适用于不同场景距离的泛化。我们从数据构建和框架设计的角度探索了数据在环 paradigm。我们通过结合公共数据集、我们UE5模拟器和文本到图像模型生成的高质量合成数据以及网络上的真实全景图像收集了一个大规模数据集。为了减少室内/室外和合成/真实数据之间的领域差距，我们引入了一个三阶段伪标签校准流水线，以生成未标记图像的可靠地面真相。对于模型，我们采用DINOv3-Large作为主干，因其强大的预训练泛化能力，并引入了即插即用的距离掩码头、锐度为中心的优化和几何为中心的优化，以提高对不同距离的鲁棒性并确保视图之间的几何一致性。在多个基准测试（例如Stanford2D3D、Matterport3D和Deep360）上的实验表明，该模型具有强大的性能和零样本泛化能力，在各种真实世界场景中具有特别鲁棒和稳定的度量预测。项目页面可访问：https://insta360-research-team.github.io/DAP_website/

Summary / 总结

This paper introduces a panoramic metric depth foundation model designed to generalize across various scene distances. The authors use a data-in-the-loop approach, combining public datasets, high-quality synthetic data, and real images. They introduce a three-stage pseudo-label curation pipeline to generate reliable ground truth for unlabeled images. The model, based on DINOv3-Large, includes a range mask head and optimization techniques to improve robustness and enforce geometric consistency. Experiments show strong performance and zero-shot generalization across multiple benchmarks, with particularly robust predictions in diverse real-world scenes.

该研究提出了一种全景度量深度基础模型，适用于不同场景距离。作者采用数据闭环方法，结合公开数据集、UE5模拟器生成的合成数据和网络上的真实图像。为了提高鲁棒性，他们实施了三阶段伪标签校准管道和范围掩码头、基于锐度的优化和基于几何的优化。实验结果显示，该模型在多个基准上的性能强大且具有零样本泛化能力，在各种真实世界场景中具有特别鲁棒和稳定的预测能力。

Exploration v.s. Exploitation: Rethinking RLVR through Clipping, Entropy, and Spurious Reward

Authors: Peter Chen, Xiaopeng Li, Ziniu Li, Wotao Yin, Xi Chen, Tianyi Lin

First: 2025-12-18T18:59:27+00:00 · Latest: 2025-12-18T18:59:27+00:00

Comments: 35 pages

Abs · PDF · Code1 · Code2

Abstract

This paper examines the exploration-exploitation trade-off in reinforcement learning with verifiable rewards (RLVR), a framework for improving the reasoning of Large Language Models (LLMs). Recent studies suggest that RLVR can elicit strong mathematical reasoning in LLMs through two seemingly paradoxical mechanisms: spurious rewards, which suppress exploitation by rewarding outcomes unrelated to the ground truth, and entropy minimization, which suppresses exploration by pushing the model toward more confident and deterministic outputs, highlighting a puzzling dynamic: both discouraging exploitation and discouraging exploration improve reasoning performance, yet the underlying principles that reconcile these effects remain poorly understood. We focus on two fundamental questions: (i) how policy entropy relates to performance, and (ii) whether spurious rewards yield gains, potentially through the interplay of clipping bias and model contamination. Our results show that clipping bias under spurious rewards reduces policy entropy, leading to more confident and deterministic outputs, while entropy minimization alone is insufficient for improvement. We further propose a reward-misalignment model explaining why spurious rewards can enhance performance beyond contaminated settings. Our findings clarify the mechanisms behind spurious-reward benefits and provide principles for more effective RLVR training.

中文标题/摘要

标题：探索 vs 开发：通过剪裁、熵和虚假奖励重新思考可验证奖励强化学习（RLVR）

本文探讨了强化学习（RLVR）框架中的探索-开发权衡问题，该框架旨在提高大型语言模型（LLMs）的推理能力。近期研究表明，RLVR可以通过两个看似矛盾的机制激发LLMs的强大数学推理能力：虚假奖励通过奖励与真实结果无关的结果来抑制开发，而熵最小化则通过促使模型产生更自信和确定性的输出来抑制探索，揭示了一个令人困惑的动态：两者都抑制开发和探索反而能提高推理性能，但其背后的原理仍不甚明了。我们关注两个基本问题：（i）策略熵与性能的关系，（ii）虚假奖励是否能带来收益，可能是通过剪裁偏差和模型污染的相互作用。我们的结果显示，在虚假奖励下，剪裁偏差降低了策略熵，导致更自信和确定性的输出，而仅通过熵最小化无法实现改进。我们进一步提出一个奖励错配模型，解释为什么虚假奖励可以在污染环境中增强性能。我们的发现阐明了虚假奖励益处背后的机制，并为更有效的RLVR训练提供了原则。

Summary / 总结

This paper investigates the exploration-exploitation trade-off in reinforcement learning with verifiable rewards (RLVR), focusing on the roles of spurious rewards and entropy minimization. The study reveals that spurious rewards reduce policy entropy by discouraging exploitation, while entropy minimization alone is insufficient for performance improvement. The paper proposes a reward-misalignment model to explain how spurious rewards can enhance performance beyond contaminated settings, providing insights into the mechanisms behind RLVR benefits.

该论文研究了RLVR（带有可验证奖励的强化学习）框架中的探索与利用权衡问题，探讨了虚假奖励和熵最小化对性能的影响，发现虚假奖励减少了策略的熵，导致更自信的输出，而仅靠熵最小化不足以提高性能。研究提出了一种奖励错配模型来解释虚假奖励如何超越污染环境提升性能，为RLVR训练提供了机制上的见解。

Posterior Behavioral Cloning: Pretraining BC Policies for Efficient RL Finetuning

Authors: Andrew Wagenmaker, Perry Dong, Raymond Tsao, Chelsea Finn, Sergey Levine

First: 2025-12-18T18:59:17+00:00 · Latest: 2025-12-18T18:59:17+00:00

Abs · PDF · Code1 · Code2

Abstract

Standard practice across domains from robotics to language is to first pretrain a policy on a large-scale demonstration dataset, and then finetune this policy, typically with reinforcement learning (RL), in order to improve performance on deployment domains. This finetuning step has proved critical in achieving human or super-human performance, yet while much attention has been given to developing more effective finetuning algorithms, little attention has been given to ensuring the pretrained policy is an effective initialization for RL finetuning. In this work we seek to understand how the pretrained policy affects finetuning performance, and how to pretrain policies in order to ensure they are effective initializations for finetuning. We first show theoretically that standard behavioral cloning (BC) -- which trains a policy to directly match the actions played by the demonstrator -- can fail to ensure coverage over the demonstrator's actions, a minimal condition necessary for effective RL finetuning. We then show that if, instead of exactly fitting the observed demonstrations, we train a policy to model the posterior distribution of the demonstrator's behavior given the demonstration dataset, we do obtain a policy that ensures coverage over the demonstrator's actions, enabling more effective finetuning. Furthermore, this policy -- which we refer to as the posterior behavioral cloning (PostBC) policy -- achieves this while ensuring pretrained performance is no worse than that of the BC policy. We then show that PostBC is practically implementable with modern generative models in robotic control domains -- relying only on standard supervised learning -- and leads to significantly improved RL finetuning performance on both realistic robotic control benchmarks and real-world robotic manipulation tasks, as compared to standard behavioral cloning.

中文标题/摘要

标题：后验行为克隆：为高效的RL微调预训练BC策略

从机器人学到语言的各个领域来看，标准做法是首先在大规模演示数据集上预训练一个策略，然后使用强化学习（RL）对该策略进行微调，以提高在部署领域的性能。这一微调步骤对于实现人类或超人类的性能至关重要，然而，尽管已经给予了开发更有效的微调算法更多的关注，但很少关注确保预训练策略是RL微调的有效初始化。在这项工作中，我们旨在理解预训练策略如何影响微调性能，并如何预训练策略以确保它们是有效的微调初始化。我们首先证明，标准的行为克隆（BC）——训练策略直接匹配演示者的行为——可能会失败，无法确保覆盖演示者的行为，这是有效RL微调的必要条件。然后我们证明，如果我们不是精确地拟合观察到的演示，而是训练一个策略来建模给定演示数据集的演示者行为的后验分布，我们确实可以获得一个确保覆盖演示者行为的策略，从而实现更有效的微调。此外，这种策略——我们称之为后验行为克隆（PostBC）策略——在确保预训练性能不低于BC策略的前提下实现了这一点。然后我们证明，PostBC可以通过现代生成模型在机器人控制领域实际实现——仅依赖标准的监督学习——并在现实的机器人控制基准测试和实际的机器人操作任务中，与标准的行为克隆相比，显著提高了RL微调性能。

Summary / 总结

This paper addresses the issue of initializing reinforcement learning (RL) policies with pretraining methods, focusing on behavioral cloning (BC). It demonstrates that standard BC can fail to ensure coverage over the demonstrator's actions, which is crucial for effective RL finetuning. The authors propose a new method called posterior behavioral cloning (PostBC), which models the posterior distribution of the demonstrator's behavior. This approach ensures better coverage and leads to improved RL finetuning performance, both in benchmarks and real-world robotic manipulation tasks, compared to standard BC.

该研究关注预训练策略以提高强化学习（RL）微调的效果，重点在于行为克隆（BC）方法的有效性。研究表明，标准的BC方法可能无法确保覆盖演示者的动作，这对于RL微调至关重要。作者提出了一种新的方法，称为后验行为克隆（PostBC），该方法根据演示数据集模型演示者的后验行为分布。这种方法确保了覆盖范围，并在基准测试和真实世界机器人操作任务中实现了比标准BC更好的RL微调性能。

MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Model for Embodied Task Planning

Authors: Yuanchen Ju, Yongyuan Liang, Yen-Jen Wang, Nandiraju Gireesh, Yuanliang Ju, Seungjae Lee, Qiao Gu, Elvis Hsieh, Furong Huang, Koushil Sreenath

First: 2025-12-18T18:59:03+00:00 · Latest: 2025-12-18T18:59:03+00:00

Comments: 25 pages, 10 figures. Project page:https://hybridrobotics.github.io/MomaGraph/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Mobile manipulators in households must both navigate and manipulate. This requires a compact, semantically rich scene representation that captures where objects are, how they function, and which parts are actionable. Scene graphs are a natural choice, yet prior work often separates spatial and functional relations, treats scenes as static snapshots without object states or temporal updates, and overlooks information most relevant for accomplishing the current task. To address these limitations, we introduce MomaGraph, a unified scene representation for embodied agents that integrates spatial-functional relationships and part-level interactive elements. However, advancing such a representation requires both suitable data and rigorous evaluation, which have been largely missing. We thus contribute MomaGraph-Scenes, the first large-scale dataset of richly annotated, task-driven scene graphs in household environments, along with MomaGraph-Bench, a systematic evaluation suite spanning six reasoning capabilities from high-level planning to fine-grained scene understanding. Built upon this foundation, we further develop MomaGraph-R1, a 7B vision-language model trained with reinforcement learning on MomaGraph-Scenes. MomaGraph-R1 predicts task-oriented scene graphs and serves as a zero-shot task planner under a Graph-then-Plan framework. Extensive experiments demonstrate that our model achieves state-of-the-art results among open-source models, reaching 71.6% accuracy on the benchmark (+11.4% over the best baseline), while generalizing across public benchmarks and transferring effectively to real-robot experiments.

中文标题/摘要

标题：MomaGraph：基于视觉语言模型的统一场景图及其在体感任务规划中的状态感知

家庭中的移动机械臂必须同时导航和操作。这需要一种紧凑且语义丰富的场景表示，能够捕捉物体的位置、功能以及哪些部分可以操作。场景图是一个自然的选择，但先前的工作往往将空间关系和功能关系分开处理，将场景视为静态快照，不包含物体状态或时间更新，也忽略了与当前任务相关的最重要信息。为了解决这些限制，我们引入了MomaGraph，这是一种将空间功能关系和部分级交互元素整合在一起的统一场景表示。然而，要推进这种表示需要合适的数据和严格的评估，这些方面目前仍然缺乏。因此，我们贡献了MomaGraph-Scenes，这是第一个包含丰富注释、任务驱动的场景图的大规模数据集，以及MomaGraph-Bench，这是一个涵盖从高层规划到细粒度场景理解的六个推理能力的系统评估套件。在此基础上，我们进一步开发了MomaGraph-R1，这是一种7B参数的视觉语言模型，通过强化学习在MomaGraph-Scenes上进行训练。MomaGraph-R1预测任务导向的场景图，并在Graph-then-Plan框架下作为零样本任务规划器。广泛的实验表明，我们的模型在开源模型中达到了最先进的结果，准确率达到71.6%（比最佳基线高11.4%），并且在公共基准测试中具有泛化能力，并且能够有效地转移到真实机器人实验中。

Summary / 总结

MomaGraph addresses the limitations of previous scene graph representations by integrating spatial-functional relationships and part-level interactive elements, and introduces MomaGraph-Scenes, a large-scale dataset of richly annotated, task-driven scene graphs in household environments. The model, MomaGraph-R1, a 7B vision-language model trained with reinforcement learning, predicts task-oriented scene graphs and serves as a zero-shot task planner. Experiments show that MomaGraph-R1 achieves 71.6% accuracy on the benchmark, outperforming previous models by 11.4%.

MomaGraph通过整合空间-功能关系和部分级交互元素解决了先前场景图表示的局限性，并引入了MomaGraph-Scenes，这是一个包含丰富注释和任务驱动的场景图的大规模数据集，适用于家庭环境。模型MomaGraph-R1是一个7B的视觉-语言模型，通过强化学习训练，可以预测任务导向的场景图，并作为零样本任务规划器。实验表明，MomaGraph-R1在基准测试中的准确率达到71.6%，比之前的最佳基线模型高出11.4%。

SceneDiff: A Benchmark and Method for Multiview Object Change Detection

Authors: Yuqun Wu, Chih-hao Lin, Henry Che, Aditi Tiwari, Chuhang Zou, Shenlong Wang, Derek Hoiem

First: 2025-12-18T18:59:02+00:00 · Latest: 2025-12-18T18:59:02+00:00

Abs · PDF · Code1 · Code2

Abstract

We investigate the problem of identifying objects that have been added, removed, or moved between a pair of captures (images or videos) of the same scene at different times. Detecting such changes is important for many applications, such as robotic tidying or construction progress and safety monitoring. A major challenge is that varying viewpoints can cause objects to falsely appear changed. We introduce SceneDiff Benchmark, the first multiview change detection benchmark with object instance annotations, comprising 350 diverse video pairs with thousands of changed objects. We also introduce the SceneDiff method, a new training-free approach for multiview object change detection that leverages pretrained 3D, segmentation, and image encoding models to robustly predict across multiple benchmarks. Our method aligns the captures in 3D, extracts object regions, and compares spatial and semantic region features to detect changes. Experiments on multi-view and two-view benchmarks demonstrate that our method outperforms existing approaches by large margins (94% and 37.4% relative AP improvements). The benchmark and code will be publicly released.

中文标题/摘要

标题：SceneDiff：一种多视角物体变化检测基准和方法

我们研究了在不同时间同一场景的两组捕获（图像或视频）之间识别已添加、移除或移动的物体的问题。检测此类变化对于许多应用非常重要，例如机器人整理或建筑进度和安全监控。主要挑战是不同视角的变化可能导致物体错误地被检测为变化。我们引入了SceneDiff基准，这是第一个包含物体实例注释的多视角变化检测基准，包含350个多样化的视频对，数千个变化的物体。我们还引入了SceneDiff方法，这是一种新的无需训练的多视角物体变化检测方法，利用预训练的3D、分割和图像编码模型来稳健地预测多个基准。该方法在3D中对齐捕获，提取物体区域，并比较空间和语义区域特征以检测变化。在多视角和两视角基准上的实验表明，我们的方法在现有方法的基础上取得了显著的性能提升（相对AP改进94%和37.4%）。基准和代码将公开发布。

Summary / 总结

The paper addresses the challenge of detecting changes in objects between two captures of the same scene taken at different times, which is crucial for applications like robotic tidying and construction monitoring. To tackle the issue of viewpoint variations leading to false detections, the authors introduce the SceneDiff Benchmark, a multiview change detection benchmark with object instance annotations, and the SceneDiff method, a training-free approach that uses pretrained 3D, segmentation, and image encoding models to align captures, extract object regions, and compare spatial and semantic features. The method significantly outperforms existing approaches on both multi-view and two-view benchmarks, achieving 94% and 37.4% relative AP improvements respectively.

该研究解决了同一场景在不同时间点的两次捕获中物体变化的检测问题，这对于机器人整理和建筑监控等应用至关重要。为应对视角变化导致的误检测问题，作者引入了SceneDiff基准，这是一个包含物体实例注释的多视角变化检测基准，以及SceneDiff方法，这是一种无需训练的检测方法，利用预训练的3D、分割和图像编码模型对捕获进行对齐，提取物体区域，并比较空间和语义特征。该方法在多视角和两视角基准上的表现显著优于现有方法，分别实现了94%和37.4%的相对AP提升。

Flowing from Reasoning to Motion: Learning 3D Hand Trajectory Prediction from Egocentric Human Interaction Videos

Authors: Mingfei Chen, Yifan Wang, Zhengqin Li, Homanga Bharadhwaj, Yujin Chen, Chuan Qin, Ziyi Kou, Yuan Tian, Eric Whitmire, Rajinder Sodhi, Hrvoje Benko, Eli Shlizerman, Yue Liu

First: 2025-12-18T18:59:01+00:00 · Latest: 2025-12-18T18:59:01+00:00

Comments: Project website: https://egoman-project.github.io

Abs · PDF · Code1 · Code2 · Project1

Abstract

Prior works on 3D hand trajectory prediction are constrained by datasets that decouple motion from semantic supervision and by models that weakly link reasoning and action. To address these, we first present the EgoMAN dataset, a large-scale egocentric dataset for interaction stage-aware 3D hand trajectory prediction with 219K 6DoF trajectories and 3M structured QA pairs for semantic, spatial, and motion reasoning. We then introduce the EgoMAN model, a reasoning-to-motion framework that links vision-language reasoning and motion generation via a trajectory-token interface. Trained progressively to align reasoning with motion dynamics, our approach yields accurate and stage-aware trajectories with generalization across real-world scenes.

中文标题/摘要

标题：从推理到运动：基于第一人称人类互动视频的3D手部轨迹预测学习

先前的3D手部轨迹预测工作受限于将运动与语义监督脱钩的数据集以及弱化推理与动作联系的模型。为解决这些问题，我们首先提出了EgoMAN数据集，这是一个用于交互阶段感知的3D手部轨迹预测的大规模第一人称数据集，包含21.9万条6自由度轨迹和300万条结构化问答对，用于语义、空间和运动推理。我们随后引入了EgoMAN模型，这是一种通过轨迹标记接口将视觉语言推理与运动生成联系起来的推理到运动框架。通过逐步训练使推理与运动动力学对齐，我们的方法能够生成准确且阶段感知的轨迹，并在真实场景中具有泛化能力。

Summary / 总结

The research addresses the limitations of existing datasets and models in 3D hand trajectory prediction by introducing the EgoMAN dataset and model. The EgoMAN dataset includes 219K 6DoF trajectories and 3M structured QA pairs for reasoning, while the EgoMAN model connects vision-language reasoning with motion generation through a trajectory-token interface, improving the alignment of reasoning with motion dynamics and achieving accurate and stage-aware trajectories with generalization across scenes.

研究旨在通过解决现有数据集和模型的局限性，提高3D手部轨迹预测的准确性。研究引入了EgoMAN数据集，包含219K 6DoF轨迹和3M结构化问答对用于推理，并提出了EgoMAN模型，这是一种将视觉语言推理与运动生成链接的推理到运动框架。该模型逐步使推理与运动动力学对齐，从而实现了准确且场景通用的阶段感知轨迹预测。

VIVA: VLM-Guided Instruction-Based Video Editing with Reward Optimization

Authors: Xiaoyan Cong, Haotian Yang, Angtian Wang, Yizhi Wang, Yiding Yang, Canyu Zhang, Chongyang Ma

First: 2025-12-18T18:58:42+00:00 · Latest: 2025-12-18T18:58:42+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Instruction-based video editing aims to modify an input video according to a natural-language instruction while preserving content fidelity and temporal coherence. However, existing diffusion-based approaches are often trained on paired data of simple editing operations, which fundamentally limits their ability to generalize to diverse and complex, real-world instructions. To address this generalization gap, we propose VIVA, a scalable framework for instruction-based video editing that leverages VLM-guided encoding and reward optimization. First, we introduce a VLM-based instructor that encodes the textual instruction, the first frame of the source video, and an optional reference image into visually-grounded instruction representations, providing fine-grained spatial and semantic context for the diffusion transformer backbone. Second, we propose a post-training stage, Edit-GRPO, which adapts Group Relative Policy Optimization to the domain of video editing, directly optimizing the model for instruction-faithful, content-preserving, and aesthetically pleasing edits using relative rewards. Furthermore, we propose a data construction pipeline designed to synthetically generate diverse, high-fidelity paired video-instruction data of basic editing operations. Extensive experiments show that VIVA achieves superior instruction following, generalization, and editing quality over state-of-the-art methods. Website: https://viva-paper.github.io

中文标题/摘要

标题：VIVA：基于VLM引导的指令驱动视频编辑与奖励优化

基于指令的视频编辑旨在根据自然语言指令修改输入视频，同时保持内容保真度和时间连贯性。然而，现有的基于扩散的方法通常是在简单的编辑操作配对数据上进行训练，这从根本上限制了它们对多样且复杂的现实世界指令的泛化能力。为了解决这一泛化差距，我们提出了一种可扩展的基于指令的视频编辑框架VIVA，该框架利用VLM引导编码和奖励优化。首先，我们引入了一种基于VLM的指令器，将文本指令、源视频的第一帧和可选的参考图像编码为视觉接地的指令表示，为扩散变换器骨干网络提供精细的空间和语义上下文。其次，我们提出了一种后训练阶段Edit-GRPO，将组相对策略优化适应到视频编辑领域，直接通过相对奖励优化模型以实现指令忠实、内容保真且具有审美吸引力的编辑。此外，我们还提出了一种数据构建管道，用于合成生成基本编辑操作的多样且高保真配对视频-指令数据。大量实验表明，VIVA在指令跟随、泛化能力和编辑质量方面优于现有最先进的方法。网站：https://viva-paper.github.io

Summary / 总结

The research aims to improve instruction-based video editing by addressing the limitations of existing methods that rely on paired simple editing data. VIVA, a scalable framework, uses a VLM-based instructor to encode textual instructions and visual references, providing detailed context for a diffusion transformer. The post-training Edit-GRPO stage optimizes the model using relative rewards for better fidelity, content preservation, and aesthetic quality. Experiments demonstrate that VIVA outperforms existing methods in following instructions, generalization, and editing quality.

VIVA 是一种基于指令的视频编辑框架，通过 VLM 指导的编码器将文本指令和视频帧编码为视觉相关的表示，并通过 Edit-GRPO 后训练阶段直接优化模型以生成忠实于指令、内容保真且具有美感的编辑。实验结果显示，VIVA 在指令遵循、泛化能力和编辑质量方面优于现有方法。

FlashPortrait: 6x Faster Infinite Portrait Animation with Adaptive Latent Prediction

Authors: Shuyuan Tu, Yueming Pan, Yinming Huang, Xintong Han, Zhen Xing, Qi Dai, Kai Qiu, Chong Luo, Zuxuan Wu

First: 2025-12-18T18:56:05+00:00 · Latest: 2025-12-18T18:56:05+00:00

Abs · PDF · Code1 · Code2

Abstract

Current diffusion-based acceleration methods for long-portrait animation struggle to ensure identity (ID) consistency. This paper presents FlashPortrait, an end-to-end video diffusion transformer capable of synthesizing ID-preserving, infinite-length videos while achieving up to 6x acceleration in inference speed. In particular, FlashPortrait begins by computing the identity-agnostic facial expression features with an off-the-shelf extractor. It then introduces a Normalized Facial Expression Block to align facial features with diffusion latents by normalizing them with their respective means and variances, thereby improving identity stability in facial modeling. During inference, FlashPortrait adopts a dynamic sliding-window scheme with weighted blending in overlapping areas, ensuring smooth transitions and ID consistency in long animations. In each context window, based on the latent variation rate at particular timesteps and the derivative magnitude ratio among diffusion layers, FlashPortrait utilizes higher-order latent derivatives at the current timestep to directly predict latents at future timesteps, thereby skipping several denoising steps and achieving 6x speed acceleration. Experiments on benchmarks show the effectiveness of FlashPortrait both qualitatively and quantitatively.

中文标题/摘要

标题：FlashPortrait：6倍速无限肖像动画的自适应潜在预测

当前基于扩散的长肖像动画加速方法难以保证身份一致性。本文提出FlashPortrait，这是一种端到端的视频扩散变换器，能够合成保持身份、无限长度的视频，同时实现高达6倍的推理速度加速。具体而言，FlashPortrait首先使用现成的提取器计算身份无关的表情特征。然后引入归一化表情特征块，通过将表情特征归一化到各自的均值和方差，以改善面部建模中的身份稳定性。在推理过程中，FlashPortrait采用动态滑动窗口方案并在重叠区域进行加权融合，确保长动画中的平滑过渡和身份一致性。在每个上下文窗口中，根据特定时间步的潜在变化率和扩散层间导数幅度比，FlashPortrait利用当前时间步的高阶潜在导数直接预测未来时间步的潜在值，从而跳过多个去噪步骤，实现6倍速度加速。基准实验表明，FlashPortrait在定性和定量上均有效。

Summary / 总结

FlashPortrait aims to generate identity-preserving, infinite-length portrait animations with up to 6x faster inference speed compared to existing methods. It uses a Normalized Facial Expression Block to align facial features with diffusion latents, and a dynamic sliding-window scheme during inference to ensure smooth transitions and ID consistency. The method predicts future latents using higher-order derivatives, skipping denoising steps and achieving significant speedup. Experiments demonstrate its effectiveness in maintaining identity stability and accelerating inference.

FlashPortrait旨在提高长肖像动画的速度和身份一致性。它使用视频扩散变换器来合成保持身份的视频，并通过6倍的速度加速推理。该方法包括计算面部表情特征、对其进行归一化，并使用动态滑动窗口方案以实现平滑过渡。在推理过程中，它使用更高阶的拉特预测未来的拉特，跳过去噪步骤，从而实现显著的速度提升。

Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image

Authors: Yushi Hu, Reyhane Askari-Hemmat, Melissa Hall, Emily Dinan, Luke Zettlemoyer, Marjan Ghazvininejad

First: 2025-12-18T18:56:04+00:00 · Latest: 2025-12-18T18:56:04+00:00

Comments: Code and data available at https://github.com/facebookresearch/MMRB2

Abs · PDF · Code1 · Code2 · Code3

Abstract

Reward models (RMs) are essential for training large language models (LLMs), but remain underexplored for omni models that handle interleaved image and text sequences. We introduce Multimodal RewardBench 2 (MMRB2), the first comprehensive benchmark for reward models on multimodal understanding and (interleaved) generation. MMRB2 spans four tasks: text-to-image, image editing, interleaved generation, and multimodal reasoning ("thinking-with-images"), providing 1,000 expert-annotated preference pairs per task from 23 models and agents across 21 source tasks. MMRB2 is designed with: (1) practical but challenging prompts; (2) responses from state-of-the-art models and agents; and (3) preference pairs with strong human-expert consensus, curated via an ensemble filtering strategy. Using MMRB2, we study existing judges for each subtask, including multimodal LLM-as-a-judge and models trained with human preferences. The latest Gemini 3 Pro attains 75-80% accuracy. GPT-5 and Gemini 2.5 Pro reach 66-75% accuracy, compared to >90% for humans, yet surpass the widely used GPT-4o (59%). The best performing open-source model Qwen3-VL-32B achieves similar accuracies as Gemini 2.5 Flash (64%). We also show that MMRB2 performance strongly correlates with downstream task success using Best-of-N sampling and conduct an in-depth analysis that shows key areas to improve the reward models going forward.

中文标题/摘要

标题：Multimodal RewardBench 2：评估处理交错文本和图像的全能奖励模型

奖励模型（RMs）对于训练大规模语言模型（LLMs）至关重要，但它们在处理交错图像和文本序列的全能模型方面仍被严重忽视。我们引入了Multimodal RewardBench 2（MMRB2），这是第一个全面评估奖励模型在多模态理解和（交错）生成方面的基准。MMRB2 包含四个任务：文本到图像、图像编辑、交错生成和多模态推理（“图像思考”），每个任务提供了来自 23 个模型和代理的 1,000 对专家注释的偏好对，这些模型和代理来自 21 个源任务。MMRB2 设计有：(1) 实用但具有挑战性的提示；(2) 来自最先进的模型和代理的响应；以及 (3) 通过集成筛选策略精心挑选的具有强烈人类专家共识的偏好对。使用 MMRB2，我们研究了每个子任务的现有评判者，包括多模态 LLM 作为评判者和使用人类偏好训练的模型。最新的 Gemini 3 Pro 达到 75-80% 的准确率。GPT-5 和 Gemini 2.5 Pro 达到 66-75% 的准确率，而人类的准确率超过 90%，但超过了广泛使用的 GPT-4o（59%）。最佳开源模型 Qwen3-VL-32B 达到与 Gemini 2.5 Flash（64%）相似的准确率。我们还展示了 MMRB2 的性能与下游任务的成功之间存在强烈的相关性，并通过 Best-of-N 抽样进行了深入分析，展示了未来改进奖励模型的关键领域。

Summary / 总结

The paper introduces Multimodal RewardBench 2 (MMRB2), a comprehensive benchmark for evaluating reward models on multimodal understanding and generation tasks. It includes four tasks: text-to-image, image editing, interleaved generation, and multimodal reasoning. Using MMRB2, the study evaluates various reward models, finding that Gemini 3 Pro and Gemini 2.5 Pro achieve 75-80% and 66-75% accuracy, respectively, compared to human accuracy of over 90%. The best open-source model, Qwen3-VL-32B, matches Gemini 2.5 Flash's performance. The study also shows strong correlations between MMRB2 performance and downstream task success, highlighting areas for improvement in reward models.

研究引入了Multimodal RewardBench 2 (MMRB2)，用于评估奖励模型在多模态理解和交错生成任务中的表现。该基准包括四个任务，每个任务有1,000个专家标注的偏好对。使用MMRB2，研究发现Gemini 3 Pro和GPT-5的准确率为75-80%，而GPT-5和Gemini 2.5 Pro超过了GPT-4o，但低于人类的准确率。最佳开源模型Qwen3-VL-32B的准确率与Gemini 2.5 Flash相当。研究还显示，MMRB2的表现与下游任务的成功有很强的相关性。

Sceniris: A Fast Procedural Scene Generation Framework

Authors: Jinghuan Shang, Harsh Patel, Ran Gong, Karl Schmeckpeper

First: 2025-12-18T18:55:03+00:00 · Latest: 2025-12-18T18:55:03+00:00

Comments: Code is available at https://github.com/rai-inst/sceniris

Abs · PDF · Code1 · Code2 · Code3

Abstract

Synthetic 3D scenes are essential for developing Physical AI and generative models. Existing procedural generation methods often have low output throughput, creating a significant bottleneck in scaling up dataset creation. In this work, we introduce Sceniris, a highly efficient procedural scene generation framework for rapidly generating large-scale, collision-free scene variations. Sceniris also provides an optional robot reachability check, providing manipulation-feasible scenes for robot tasks. Sceniris is designed for maximum efficiency by addressing the primary performance limitations of the prior method, Scene Synthesizer. Leveraging batch sampling and faster collision checking in cuRobo, Sceniris achieves at least 234x speed-up over Scene Synthesizer. Sceniris also expands the object-wise spatial relationships available in prior work to support diverse scene requirements. Our code is available at https://github.com/rai-inst/sceniris

中文标题/摘要

标题：Sceniris：一种快速的程序化场景生成框架

合成3D场景对于开发物理AI和生成模型至关重要。现有的程序化生成方法通常输出吞吐量较低，成为大规模数据集创建的瓶颈。在此工作中，我们介绍了Sceniris，这是一种高效的程序化场景生成框架，用于快速生成大规模、无碰撞的场景变化。Sceniris还提供可选的机器人可达性检查，为机器人任务提供可操作的场景。Sceniris通过解决先前方法Scene Synthesizer的主要性能限制，设计为最大程度地提高效率。利用批处理采样和cuRobo中的更快碰撞检测，Sceniris在Scene Synthesizer的基础上实现了至少234倍的加速。Sceniris还扩展了先前工作中的对象级空间关系，以支持多样化的场景需求。我们的代码可在https://github.com/rai-inst/sceniris获取

Summary / 总结

Sceniris is a fast procedural scene generation framework designed to address the low throughput of existing methods, which is a bottleneck in scaling up dataset creation for Physical AI and generative models. By leveraging batch sampling and faster collision checking, Sceniris achieves at least 234x speed-up over Scene Synthesizer and can generate large-scale, collision-free scene variations. Additionally, it supports robot reachability checks for manipulation-feasible scenes, expanding the spatial relationships available in prior work to meet diverse scene requirements.

Sceniris 是一个快速的程序化场景生成框架，旨在解决现有方法低吞吐量的问题，这是在物理 AI 和生成模型中扩展数据集创建的瓶颈。通过利用批量采样和更快的碰撞检测，Sceniris 在 Scene Synthesizer 上实现了至少 234 倍的加速，并能生成大规模、无碰撞的场景，同时提供可执行的机器人操作场景。它通过扩展先前工作中的对象级空间关系来支持多样化的场景需求。

Instant Expressive Gaussian Head Avatar via 3D-Aware Expression Distillation

Authors: Kaiwen Jiang, Xueting Li, Seonwook Park, Ravi Ramamoorthi, Shalini De Mello, Koki Nagano

First: 2025-12-18T18:53:28+00:00 · Latest: 2025-12-18T18:53:28+00:00

Comments: Project website is https://research.nvidia.com/labs/amri/projects/instant4d

Abs · PDF · Code1 · Code2 · Project1

Abstract

Portrait animation has witnessed tremendous quality improvements thanks to recent advances in video diffusion models. However, these 2D methods often compromise 3D consistency and speed, limiting their applicability in real-world scenarios, such as digital twins or telepresence. In contrast, 3D-aware facial animation feedforward methods -- built upon explicit 3D representations, such as neural radiance fields or Gaussian splatting -- ensure 3D consistency and achieve faster inference speed, but come with inferior expression details. In this paper, we aim to combine their strengths by distilling knowledge from a 2D diffusion-based method into a feed-forward encoder, which instantly converts an in-the-wild single image into a 3D-consistent, fast yet expressive animatable representation. Our animation representation is decoupled from the face's 3D representation and learns motion implicitly from data, eliminating the dependency on pre-defined parametric models that often constrain animation capabilities. Unlike previous computationally intensive global fusion mechanisms (e.g., multiple attention layers) for fusing 3D structural and animation information, our design employs an efficient lightweight local fusion strategy to achieve high animation expressivity. As a result, our method runs at 107.31 FPS for animation and pose control while achieving comparable animation quality to the state-of-the-art, surpassing alternative designs that trade speed for quality or vice versa. Project website is https://research.nvidia.com/labs/amri/projects/instant4d

Summary / 总结

This paper addresses the challenge of creating 3D-consistent and fast expressive head avatars by combining the strengths of 2D diffusion-based methods and 3D-aware feedforward encoders. The method distills knowledge from a 2D diffusion-based model into a feedforward encoder, which converts an input image into a 3D-consistent, fast, and expressive animatable representation. The approach achieves high animation expressivity through an efficient local fusion strategy, running at 107.31 FPS and surpassing alternative designs that trade speed for quality or vice versa, while maintaining comparable animation quality to state-of-the-art methods.

本文旨在解决创建3D一致且具表现力的头像以进行肖像动画的问题。提出了一种方法，通过将2D扩散模型的知识注入轻量级编码器，结合了2D扩散模型和3D感知前馈方法的优点。该表示可以即时将图像转换为3D一致的可动画化头像，具有高表现力。该方法以107.31 FPS的速度运行，其动画质量与最先进的方法相当，优于其他在速度和质量之间妥协的设计。

LinkedOut: Linking World Knowledge Representation Out of Video LLM for Next-Generation Video Recommendation

Authors: Haichao Zhang, Yao Lu, Lichen Wang, Yunzhe Li, Daiwei Chen, Yunpeng Xu, Yun Fu

First: 2025-12-18T18:52:18+00:00 · Latest: 2025-12-18T18:52:18+00:00

Abs · PDF · Code1 · Code2

Abstract

Video Large Language Models (VLLMs) unlock world-knowledge-aware video understanding through pretraining on internet-scale data and have already shown promise on tasks such as movie analysis and video question answering. However, deploying VLLMs for downstream tasks such as video recommendation remains challenging, since real systems require multi-video inputs, lightweight backbones, low-latency sequential inference, and rapid response. In practice, (1) decode-only generation yields high latency for sequential inference, (2) typical interfaces do not support multi-video inputs, and (3) constraining outputs to language discards fine-grained visual details that matter for downstream vision tasks. We argue that these limitations stem from the absence of a representation that preserves pixel-level detail while leveraging world knowledge. We present LinkedOut, a representation that extracts VLLM world knowledge directly from video to enable fast inference, supports multi-video histories, and removes the language bottleneck. LinkedOut extracts semantically grounded, knowledge-aware tokens from raw frames using VLLMs, guided by promptable queries and optional auxiliary modalities. We introduce a cross-layer knowledge fusion MoE that selects the appropriate level of abstraction from the rich VLLM features, enabling personalized, interpretable, and low-latency recommendation. To our knowledge, LinkedOut is the first VLLM-based video recommendation method that operates on raw frames without handcrafted labels, achieving state-of-the-art results on standard benchmarks. Interpretability studies and ablations confirm the benefits of layer diversity and layer-wise fusion, pointing to a practical path that fully leverages VLLM world-knowledge priors and visual reasoning for downstream vision tasks such as recommendation.

中文标题/摘要

标题：LinkedOut：从视频LLM中提取世界知识表示以实现下一代视频推荐

视频大型语言模型（VLLMs）通过在互联网规模数据上进行预训练，解锁了对视频的理解能力，并已在电影分析和视频问答等任务上展示了潜力。然而，将VLLMs部署到视频推荐等下游任务中仍然具有挑战性，因为实际系统需要多视频输入、轻量级骨干网络、低延迟序列推理和快速响应。实践中，(1) 只解码生成会导致序列推理的高延迟，(2) 传统接口不支持多视频输入，(3) 将输出限制为语言会丢弃对下游视觉任务重要的细粒度视觉细节。我们认为这些限制源于缺乏一种同时保留像素级细节并利用世界知识的表示。我们提出了LinkedOut，一种直接从视频中提取VLLM世界知识的表示，以实现快速推理、支持多视频历史记录，并去除语言瓶颈。LinkedOut 使用VLLMs从原始帧中提取语义上合理的、知识导向的标记，这些标记由可提示查询和可选辅助模态引导。我们引入了一种跨层知识融合MoE，从丰富的VLLM特征中选择适当的抽象层次，从而实现个性化、可解释和低延迟的推荐。据我们所知，LinkedOut 是第一个在没有手工制作标签的情况下直接在原始帧上操作的VLLM基视频推荐方法，实现了标准基准上的最佳结果。解释性研究和消融实验证实了层多样性及层内融合的好处，指出了一个实用的路径，该路径充分利用了VLLM世界知识先验和视觉推理，以实现推荐等下游视觉任务。

Summary / 总结

The paper introduces LinkedOut, a method that addresses the challenges of deploying Video Large Language Models (VLLMs) for video recommendation by extracting world knowledge directly from raw video frames. This approach enables fast inference, supports multi-video histories, and avoids the language bottleneck. Key experimental findings show that LinkedOut achieves state-of-the-art results on standard benchmarks, with interpretability studies and ablations confirming the benefits of layer diversity and layer-wise fusion.

LinkedOut通过引入一种结合像素级细节和世界知识的新表示，解决了将Video Large Language Models (VLLMs)应用于视频推荐的挑战。它使用VLLMs从原始视频帧中提取知识感知的令牌，并支持多视频输入以实现快速推理。LinkedOut在标准基准上取得了最先进的结果，并能够实现个性化、可解释和低延迟的推荐。

M-PhyGs: Multi-Material Object Dynamics from Video

Authors: Norika Wada, Kohei Yamashita, Ryo Kawahara, Ko Nishino

First: 2025-12-18T18:50:08+00:00 · Latest: 2025-12-18T18:50:08+00:00

Abs · PDF · Code1 · Code2

Abstract

Knowledge of the physical material properties governing the dynamics of a real-world object becomes necessary to accurately anticipate its response to unseen interactions. Existing methods for estimating such physical material parameters from visual data assume homogeneous single-material objects, pre-learned dynamics, or simplistic topologies. Real-world objects, however, are often complex in material composition and geometry lying outside the realm of these assumptions. In this paper, we particularly focus on flowers as a representative common object. We introduce Multi-material Physical Gaussians (M-PhyGs) to estimate the material composition and parameters of such multi-material complex natural objects from video. From a short video captured in a natural setting, M-PhyGs jointly segments the object into similar materials and recovers their continuum mechanical parameters while accounting for gravity. M-PhyGs achieves this efficiently with newly introduced cascaded 3D and 2D losses, and by leveraging temporal mini-batching. We introduce a dataset, Phlowers, of people interacting with flowers as a novel platform to evaluate the accuracy of this challenging task of multi-material physical parameter estimation. Experimental results on Phlowers dataset demonstrate the accuracy and effectiveness of M-PhyGs and its components.

中文标题/摘要

标题：M-PhyGs: 视频中多材料物体动力学

了解支配现实世界物体动力学的物理材料属性对于准确预测其对未见交互的响应是必要的。现有方法从视觉数据估计此类物理材料参数假设均匀单一材料物体、预学习的动力学或简单拓扑结构。然而，现实世界中的物体通常在材料组成和几何形状上复杂，超出了这些假设的范围。在本文中，我们特别关注花作为代表性常见物体。我们引入了多材料物理高斯分布（M-PhyGs）来从视频中估计此类多材料复杂自然物体的材料组成和参数。从自然环境中拍摄的短视频，M-PhyGs 联合分割物体为相似材料并恢复其连续力学参数，同时考虑重力。M-PhyGs 通过引入新的级联3D和2D损失以及利用时间小批量处理高效地实现这一点。我们引入了一个数据集 Phlowers，其中包含人们与花的互动，作为评估这一具有挑战性的多材料物理参数估计任务准确性的新平台。Phlowers 数据集上的实验结果证明了 M-PhyGs 及其组件的准确性和有效性。

Summary / 总结

The research aims to accurately estimate the physical material properties of complex multi-material objects, such as flowers, from video data. The method, Multi-material Physical Gaussians (M-PhyGs), jointly segments the object into similar materials and recovers their mechanical parameters while accounting for gravity. M-PhyGs uses cascaded 3D and 2D losses and temporal mini-batching for efficient estimation. Experiments on the Phlowers dataset show the accuracy and effectiveness of M-PhyGs and its components.

本文解决了从视频中准确估计复杂多材料物体，特别是花朵，的物理材料属性的需求。M-PhyGs 方法联合分割物体为相似材料，并恢复它们的力学参数，同时考虑重力。该方法利用嵌套的3D和2D损失以及时间小批量处理进行高效计算。引入了Phlowers数据集，包含人们与花朵的互动，以评估估计准确性，展示了该方法的有效性。

PolaRiS: Scalable Real-to-Sim Evaluations for Generalist Robot Policies

Authors: Arhan Jain, Mingtong Zhang, Kanav Arora, William Chen, Marcel Torne, Muhammad Zubair Irshad, Sergey Zakharov, Yue Wang, Sergey Levine, Chelsea Finn, Wei-Chiu Ma, Dhruv Shah, Abhishek Gupta, Karl Pertsch

First: 2025-12-18T18:49:41+00:00 · Latest: 2025-12-18T18:49:41+00:00

Comments: Website: https://polaris-evals.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

A significant challenge for robot learning research is our ability to accurately measure and compare the performance of robot policies. Benchmarking in robotics is historically challenging due to the stochasticity, reproducibility, and time-consuming nature of real-world rollouts. This challenge is exacerbated for recent generalist policies, which has to be evaluated across a wide variety of scenes and tasks. Evaluation in simulation offers a scalable complement to real world evaluations, but the visual and physical domain gap between existing simulation benchmarks and the real world has made them an unreliable signal for policy improvement. Furthermore, building realistic and diverse simulated environments has traditionally required significant human effort and expertise. To bridge the gap, we introduce Policy Evaluation and Environment Reconstruction in Simulation (PolaRiS), a scalable real-to-sim framework for high-fidelity simulated robot evaluation. PolaRiS utilizes neural reconstruction methods to turn short video scans of real-world scenes into interactive simulation environments. Additionally, we develop a simple simulation data co-training recipe that bridges remaining real-to-sim gaps and enables zero-shot evaluation in unseen simulation environments. Through extensive paired evaluations between simulation and the real world, we demonstrate that PolaRiS evaluations provide a much stronger correlation to real world generalist policy performance than existing simulated benchmarks. Its simplicity also enables rapid creation of diverse simulated environments. As such, this work takes a step towards distributed and democratized evaluation for the next generation of robotic foundation models.

中文标题/摘要

标题：PolaRiS：通用机器人策略的可扩展实况至模拟评估

机器人学习研究的一个重大挑战是准确衡量和比较机器人策略性能的能力。由于机器人领域的基准测试历史上受到现实世界卷出的随机性、可重复性和耗时性的影响，这一挑战在最近的通用策略中更为突出，这些策略需要在各种场景和任务中进行评估。在模拟中进行评估为现实世界评估提供了可扩展的补充，但现有模拟基准与现实世界之间的视觉和物理领域差距使得它们无法成为政策改进的可靠信号。此外，构建现实且多样的模拟环境通常需要大量的人力和专业知识。为了解决这一差距，我们引入了Policy Evaluation and Environment Reconstruction in Simulation (PolaRiS)，一种用于高保真模拟机器人评估的可扩展实况至模拟框架。PolaRiS 利用神经重建方法将现实世界场景的短视频扫描转换为交互式模拟环境。此外，我们开发了一种简单的模拟数据协同训练方法，以弥合剩余的实况至模拟差距，并在未见过的模拟环境中实现零样本评估。通过广泛的模拟与现实世界的配对评估，我们证明了PolaRiS评估与现实世界通用策略性能的相关性远强于现有模拟基准。其简单性还使得能够快速创建多样化的模拟环境。因此，这项工作朝着分布式和普及化的评估方法为下一代机器人基础模型迈进了一步。

Summary / 总结

PolaRiS is a scalable framework for evaluating robot policies in simulation by reconstructing real-world scenes into interactive simulations. It uses neural reconstruction methods to convert video scans of real-world scenes into simulated environments and develops a simple co-training recipe to bridge real-to-sim gaps. Extensive evaluations show that PolaRiS provides a stronger correlation to real-world performance compared to existing benchmarks and enables rapid creation of diverse simulated environments.

PolaRiS 是一个可扩展的框架，通过神经重建方法将真实场景转换为高保真模拟环境，从而实现对通用机器人策略的准确和可靠评估。通过与真实世界的配对评估，PolaRiS 展示了与真实世界性能更强的相关性，相比现有的模拟基准。它还简化了多样模拟环境的创建，促进了机器人策略在各种场景和任务中的评估。

Memory-Enhanced SAM3 for Occlusion-Robust Surgical Instrument Segmentation

Authors: Valay Bundele, Mehran Hosseinzadeh, Hendrik P. A. Lensch

First: 2025-12-18T18:49:33+00:00 · Latest: 2025-12-18T18:49:33+00:00

Comments: Under Review

Abs · PDF · Code1 · Code2 · Project1

Abstract

Accurate surgical instrument segmentation in endoscopic videos is crucial for computer-assisted interventions, yet remains challenging due to frequent occlusions, rapid motion, specular artefacts, and long-term instrument re-entry. While SAM3 provides a powerful spatio-temporal framework for video object segmentation, its performance in surgical scenes is limited by indiscriminate memory updates, fixed memory capacity, and weak identity recovery after occlusions. We propose ReMeDI-SAM3, a training-free memory-enhanced extension of SAM3, that addresses these limitations through three components: (i) relevance-aware memory filtering with a dedicated occlusion-aware memory for storing pre-occlusion frames, (ii) a piecewise interpolation scheme that expands the effective memory capacity, and (iii) a feature-based re-identification module with temporal voting for reliable post-occlusion identity disambiguation. Together, these components mitigate error accumulation and enable reliable recovery after occlusions. Evaluations on EndoVis17 and EndoVis18 under a zero-shot setting show absolute mcIoU improvements of around 7% and 16%, respectively, over vanilla SAM3, outperforming even prior training-based approaches. Project page: https://valaybundele.github.io/remedi-sam3/.

中文标题/摘要

标题：增强记忆的SAM3在遮挡鲁棒的手术器械分割

内窥镜视频中手术器械的准确分割对于计算机辅助干预至关重要，但由于频繁的遮挡、快速运动、镜面伪影以及长期器械再进入，这一任务仍然具有挑战性。尽管SAM3提供了一种强大的时空框架用于视频对象分割，但在手术场景中的性能受限于非区分性的记忆更新、固定的记忆容量以及遮挡后的弱身份恢复。我们提出了一种无需训练的记忆增强扩展ReMeDI-SAM3，通过三个组件解决了这些限制：(i) 关联性记忆过滤，配备专门的遮挡感知记忆以存储遮挡前的帧，(ii) 一段式插值方案，扩展有效记忆容量，(iii) 基于特征的重新识别模块，结合时间投票以实现可靠的遮挡后身份消歧。这些组件共同减轻了错误累积，并在遮挡后实现可靠的恢复。在零样本设置下，ReMeDI-SAM3在EndoVis17和EndoVis18上的绝对mcIoU改进分别约为7%和16%，优于vanilla SAM3，甚至优于先前的训练基方法。项目页面：https://valaybundele.github.io/remedi-sam3/

Summary / 总结

The research aims to improve the accuracy of surgical instrument segmentation in endoscopic videos, which is essential for computer-assisted interventions. The method, ReMeDI-SAM3, enhances SAM3 by incorporating a relevance-aware memory filter, a piecewise interpolation scheme, and a feature-based re-identification module. These components address issues such as indiscriminate memory updates and weak identity recovery after occlusions. Experimental results show that ReMeDI-SAM3 achieves absolute mcIoU improvements of around 7% and 16% on EndoVis17 and EndoVis18, respectively, outperforming previous approaches.

研究旨在提高内窥镜视频中手术器械分割的准确性，这对于计算机辅助干预至关重要。方法ReMeDI-SAM3通过引入相关性感知的记忆过滤器、分段插值方案以及基于特征的重新识别模块来增强SAM3。这些组件解决了诸如非区分性记忆更新和遮挡后弱身份恢复等问题。实验结果表明，该方法在EndoVis17和EndoVis18数据集上的平均类别IoU有了显著提升，分别提高了约7%和16%，超过了之前的训练基方法。

Core-Set Selection for Data-efficient Land Cover Segmentation

Authors: Keiller Nogueira, Akram Zaytar, Wanli Ma, Ribana Roscher, Ronny Hansch, Caleb Robinson, Anthony Ortiz, Simone Nsutezo, Rahul Dodhia, Juan M. Lavista Ferres, Oktay Karakus, Paul L. Rosin

First: 2025-05-02T12:22:08+00:00 · Latest: 2025-12-18T18:47:38+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

The increasing accessibility of remotely sensed data and their potential to support large-scale decision-making have driven the development of deep learning models for many Earth Observation tasks. Traditionally, such models rely on large datasets. However, the common assumption that larger training datasets lead to better performance tends to overlook issues related to data redundancy, noise, and the computational cost of processing massive datasets. Effective solutions must therefore consider not only the quantity but also the quality of data. Towards this, in this paper, we introduce six basic core-set selection approaches -- that rely on imagery only, labels only, or a combination of both -- and investigate whether they can identify high-quality subsets of data capable of maintaining -- or even surpassing -- the performance achieved when using full datasets for remote sensing semantic segmentation. We benchmark such approaches against two traditional baselines on three widely used land-cover classification datasets (DFC2022, Vaihingen, and Potsdam) using two different architectures (SegFormer and U-Net), thus establishing a general baseline for future works. Our experiments show that all proposed methods consistently outperform the baselines across multiple subset sizes, with some approaches even selecting core sets that surpass training on all available data. Notably, on DFC2022, a selected subset comprising only 25% of the training data yields slightly higher SegFormer performance than training with the entire dataset. This result shows the importance and potential of data-centric learning for the remote sensing domain. The code is available at https://github.com/keillernogueira/data-centric-rs-classification/.

中文标题/摘要

标题：基于核心集选择的数据高效土地覆盖分割

遥感数据的日益可获取性和其在支持大规模决策中的潜力，推动了用于地球观测任务的深度学习模型的发展。传统上，这些模型依赖于大规模数据集。然而，更大的训练数据集会导致更好性能的普遍假设往往忽略了数据冗余、噪声以及处理大规模数据集的计算成本问题。因此，有效的解决方案不仅要考虑数据的数量，还要考虑数据的质量。为此，本文介绍了六种基本的核心集选择方法——仅依赖影像、仅依赖标签或两者结合，并探讨它们是否能够识别出高质量的数据子集，这些子集能够保持甚至超越使用完整数据集进行遥感语义分割时所达到的性能。我们使用两种不同的架构（SegFormer和U-Net）在三个广泛使用的土地覆盖分类数据集（DFC2022、Vaihingen和Potsdam）上将这些方法与两种传统基线进行基准测试，从而为未来的工作建立了一个通用基准。我们的实验表明，所有提出的方法在多个数据子集大小上都持续优于基线，有些方法甚至选择的核心集的性能超过了使用所有可用数据进行训练。值得注意的是，在DFC2022数据集上，仅包含训练数据的25%的选定子集在SegFormer上的性能略高于使用整个数据集进行训练。这一结果表明了数据为中心的学习在遥感领域的重要性和潜力。相关代码可在https://github.com/keillernogueira/data-centric-rs-classification/ 获取。

Summary / 总结

This paper addresses the issue of data redundancy and computational cost in remote sensing semantic segmentation by introducing six core-set selection approaches. These methods are evaluated on three land-cover classification datasets using two architectures, SegFormer and U-Net. The experiments demonstrate that these core-set selection methods consistently outperform traditional baselines, with some methods even achieving better performance using only 25% of the training data compared to the full dataset. This highlights the potential of data-centric learning in remote sensing tasks.

本文通过引入六种核心集选择方法，解决了遥感任务中的数据冗余和计算成本问题，旨在识别高质量的数据子集，使其能够保持或超越全数据集的表现。在DFC2022、Vaihingen和Potsdam三个土地覆盖分类数据集上使用SegFormer和U-Net进行的实验表明，所有提出的方案都优于传统基线，有些方法即使使用全数据集的25%也能达到更好的性能。

TACE: A unified Irreducible Cartesian Tensor Framework for Atomistic Machine Learning

Authors: Zemin Xu, Wenbo Xie, Daiqian Xie, P. Hu

First: 2025-09-18T13:51:07+00:00 · Latest: 2025-12-18T18:43:46+00:00

Abs · PDF · Code1 · Code2

Abstract

Here, we introduce the Tensor Atomic Cluster Expansion (TACE), a unified framework formulated entirely in Cartesian space, enabling systematic and consistent prediction of arbitrary structure-dependent tensorial properties. TACE achieves this by decomposing atomic environments into a complete hierarchy of irreducible Cartesian tensors, ensuring symmetry-consistent representations that naturally encode invariance and equivariance constraints. Beyond geometry, TACE incorporates universal embeddings that flexibly integrate diverse attributes including computational levels, charges, magnetic moments and field perturbations. This allows explicit control over external invariants and equivariants in the prediction process. Long-range interactions are also accurately described through the Latent Ewald Summation module within the short-range approximation, providing a rigorous yet computationally efficient treatment of electrostatic and dispersion effects. We demonstrate that TACE attains accuracy, stability, and efficiency on par with or surpassing leading equivariant frameworks across finite molecules and extended materials. This includes in-domain and out-of-domain benchmarks, spectra, Hessian, external-field responses, charged and magnetic systems, multi-fidelity training, heterogeneous catalysis, and even superior performance within the uMLIP benchmark. Crucially, TACE bridges scalar and tensorial modeling and establishes a Cartesian-space paradigm that unifies and extends beyond the design space of spherical-tensor-based methods. This work lays the foundation for a new generation of universal atomistic machine learning models capable of systematically capturing the rich interplay of geometry, fields and material properties within a single coherent framework.

中文标题/摘要

标题：TACE：统一的不可约笛卡尔张量框架用于原子机器学习

在此，我们介绍了张量原子簇展开（TACE），这是一种完全在笛卡尔空间中构建的统一框架，能够系统且一致地预测任意结构依赖的张量性质。TACE 通过将原子环境分解为完整的不可约笛卡尔张量层次结构，确保了对称一致的表示，自然地编码了不变性和协变性约束。除了几何结构，TACE 还整合了通用嵌入，灵活地结合了包括计算水平、电荷、磁矩和场扰动在内的多种属性。这允许在预测过程中对外部不变量和协变量进行显式控制。通过短程近似中的潜在厄瓦尔求和模块，长程相互作用也得到了准确描述，提供了对静电和色散效应的严格且计算高效的处理。我们证明，TACE 在有限分子和扩展材料上的准确度、稳定性和效率与或超越了领先协变框架。这包括域内和域外基准测试、光谱、海森堡矩阵、外部场响应、带电和磁性系统、多保真度训练、异质催化，甚至在 uMLIP 基准测试中表现更优。至关重要的是，TACE 桥接了标量和张量建模，并建立了笛卡尔空间范式，统一并超越了基于球对称张量方法的设计空间。这项工作为新一代能够系统捕捉几何、场和材料性质之间丰富相互作用的通用原子机器学习模型奠定了基础。

Summary / 总结

TACE is a unified framework for predicting tensorial properties in atomistic systems by decomposing atomic environments into irreducible Cartesian tensors, ensuring symmetry-consistent representations. It incorporates universal embeddings for diverse attributes and uses the Latent Ewald Summation module for accurate long-range interactions. TACE demonstrates high accuracy, stability, and efficiency across various benchmarks, surpassing or matching leading equivariant frameworks in finite molecules, extended materials, and other applications.

TACE 是一个统一框架，通过将原子环境分解为不可约笛卡尔张量来预测张量性质，确保对称一致表示。它包含用于多种属性的通用嵌入，并使用 Latent Ewald 总和模块来准确描述长程相互作用。TACE 在各种基准测试中表现出高精度、稳定性和效率，超越或匹配领先对称框架在有限分子、扩展材料和其他应用中的表现。

Pixel Seal: Adversarial-only training for invisible image and video watermarking

Authors: Tomáš Souček, Pierre Fernandez, Hady Elsahar, Sylvestre-Alvise Rebuffi, Valeriu Lacatusu, Tuan Tran, Tom Sander, Alexandre Mourachko

First: 2025-12-18T18:42:19+00:00 · Latest: 2025-12-18T18:42:19+00:00

Comments: Code and model available at https://github.com/facebookresearch/videoseal

Abs · PDF · Code1 · Code2 · Code3

Abstract

Invisible watermarking is essential for tracing the provenance of digital content. However, training state-of-the-art models remains notoriously difficult, with current approaches often struggling to balance robustness against true imperceptibility. This work introduces Pixel Seal, which sets a new state-of-the-art for image and video watermarking. We first identify three fundamental issues of existing methods: (i) the reliance on proxy perceptual losses such as MSE and LPIPS that fail to mimic human perception and result in visible watermark artifacts; (ii) the optimization instability caused by conflicting objectives, which necessitates exhaustive hyperparameter tuning; and (iii) reduced robustness and imperceptibility of watermarks when scaling models to high-resolution images and videos. To overcome these issues, we first propose an adversarial-only training paradigm that eliminates unreliable pixel-wise imperceptibility losses. Second, we introduce a three-stage training schedule that stabilizes convergence by decoupling robustness and imperceptibility. Third, we address the resolution gap via high-resolution adaptation, employing JND-based attenuation and training-time inference simulation to eliminate upscaling artifacts. We thoroughly evaluate the robustness and imperceptibility of Pixel Seal on different image types and across a wide range of transformations, and show clear improvements over the state-of-the-art. We finally demonstrate that the model efficiently adapts to video via temporal watermark pooling, positioning Pixel Seal as a practical and scalable solution for reliable provenance in real-world image and video settings.

中文标题/摘要

标题：像素封印：仅对抗训练的隐形图像和视频水印

隐形水印对于追踪数字内容的来源至关重要。然而，训练最先进的模型仍然非常困难，当前的方法往往难以在鲁棒性和真正的不可感知性之间取得平衡。本研究引入了像素封印，为图像和视频水印设定了新的最先进的标准。我们首先识别了现有方法的三个基本问题：(i) 依赖于MSE和LPIPS等代理感知损失，这些损失无法模拟人类感知，导致可见的水印伪影；(ii) 由于目标冲突导致的优化不稳定，这需要进行详尽的超参数调优；(iii) 当将模型扩展到高分辨率图像和视频时，水印的鲁棒性和不可感知性降低。为了克服这些问题，我们首先提出了一种仅对抗训练范式，消除了不可靠的像素级不可感知损失。其次，我们引入了一种三阶段训练计划，通过解耦鲁棒性和不可感知性来稳定收敛。第三，我们通过高分辨率适应解决了分辨率差距问题，利用JND基衰减和训练时的上采样伪影模拟来消除上采样伪影。我们对不同类型的图像和广泛范围的变换进行了像素封印的鲁棒性和不可感知性的全面评估，并展示了相对于最先进的方法的改进。最后，我们证明了该模型通过时间水印池化高效地适应视频，将像素封印定位为在实际图像和视频环境中可靠来源的实用且可扩展的解决方案。

Summary / 总结

Pixel Seal addresses the challenges in training invisible watermarking models by proposing an adversarial-only training method and a three-stage training schedule. It also introduces high-resolution adaptation techniques to improve robustness and imperceptibility. Experimental results show that Pixel Seal outperforms existing methods in both image and video watermarking, demonstrating clear improvements in robustness and imperceptibility across various transformations and image types.

Pixel Seal通过提出对抗训练范式、三阶段训练计划和高分辨率适应技术来解决不可见水印模型训练的挑战。该方法显著提高了鲁棒性和不可感知性，为图像和视频水印设定了新的最先进的水平。广泛的评估表明，该方法在各种变换和图像类型上明显优于现有方法。

Sequencing to Mitigate Catastrophic Forgetting in Continual Learning

Authors: Hesham G. Moussa, Aroosa Hameed, Arashmid Akhavain

First: 2025-12-18T18:40:58+00:00 · Latest: 2025-12-18T18:40:58+00:00

Comments: The Manuscript is submitted for review under IEEE Transactions on Artificial intelligence

Abs · PDF · Code1 · Code2

Abstract

To cope with real-world dynamics, an intelligent system needs to incrementally acquire, update, and exploit knowledge throughout its lifetime. This ability, known as Continual learning, provides a foundation for AI systems to develop themselves adaptively. Catastrophic forgetting is a major challenge to the progress of Continual Learning approaches, where learning a new task usually results in a dramatic performance drop on previously learned ones. Many approaches have emerged to counteract the impact of CF. Most of the proposed approaches can be categorized into five classes: replay-based, regularization-based, optimization-based, representation-based, and architecture-based. In this work, we approach the problem from a different angle, specifically by considering the optimal sequencing of tasks as they are presented to the model. We investigate the role of task sequencing in mitigating CF and propose a method for determining the optimal task order. The proposed method leverages zero-shot scoring algorithms inspired by neural architecture search (NAS). Results demonstrate that intelligent task sequencing can substantially reduce CF. Moreover, when combined with traditional continual learning strategies, sequencing offers enhanced performance and robustness against forgetting. Additionally, the presented approaches can find applications in other fields, such as curriculum learning.

中文标题/摘要

标题：序列化以减轻连续学习中的灾难性遗忘

为了应对现实世界中的动态变化，智能系统需要在其生命周期中逐步获取、更新和利用知识。这种能力被称为连续学习，为AI系统提供了自我适应发展的基础。灾难性遗忘是连续学习方法发展中的一大挑战，通常学习新任务会导致之前学习任务性能的大幅下降。已经提出了许多方法来对抗灾难性遗忘的影响。大多数提出的方案可以分为五类：重放基、正则化基、优化基、表示基和架构基。在本工作中，我们从一个不同的角度来解决这个问题，具体来说是通过考虑任务呈现给模型时的最佳顺序。我们研究了任务序列在减轻灾难性遗忘中的作用，并提出了一种确定最佳任务顺序的方法。所提出的方法利用了受神经架构搜索（NAS）启发的零样本评分算法。结果表明，智能任务序列可以显著减少灾难性遗忘。此外，当与传统的连续学习策略结合使用时，序列化可以提供增强的性能和更好的遗忘抗性。此外，所提出的方法还可以在其他领域找到应用，例如课程学习。

Summary / 总结

This paper addresses the challenge of catastrophic forgetting in continual learning by focusing on the optimal sequencing of tasks. It proposes a method using zero-shot scoring algorithms inspired by neural architecture search to determine the best task order. The results show that intelligent task sequencing can significantly reduce catastrophic forgetting and improve performance when combined with traditional continual learning strategies.

本文解决了持续学习中灾难性遗忘的问题，即学习新任务往往会导致之前学习任务性能下降。作者提出了一种新颖的方法，重点关注任务的最优排序以减轻这一问题。通过使用受神经架构搜索启发的零样本评分算法，他们确定了任务的最佳顺序。结果表明，这种方法可以显著减少灾难性遗忘，并且与其他持续学习策略结合使用时，可以提高性能和抗遗忘的鲁棒性。该方法在课程学习等领域具有潜在应用价值。

RePlan: Reasoning-guided Region Planning for Complex Instruction-based Image Editing

Authors: Tianyuan Qu, Lei Ke, Xiaohang Zhan, Longxiang Tang, Yuqi Liu, Bohao Peng, Bei Yu, Dong Yu, Jiaya Jia

First: 2025-12-18T18:34:23+00:00 · Latest: 2025-12-18T18:34:23+00:00

Comments: Precise region control and planning for instruction-based image editing. Our project page: https://replan-iv-edit.github.io

Abs · PDF · Code1 · Code2 · Project1

Abstract

Instruction-based image editing enables natural-language control over visual modifications, yet existing models falter under Instruction-Visual Complexity (IV-Complexity), where intricate instructions meet cluttered or ambiguous scenes. We introduce RePlan (Region-aligned Planning), a plan-then-execute framework that couples a vision-language planner with a diffusion editor. The planner decomposes instructions via step-by-step reasoning and explicitly grounds them to target regions; the editor then applies changes using a training-free attention-region injection mechanism, enabling precise, parallel multi-region edits without iterative inpainting. To strengthen planning, we apply GRPO-based reinforcement learning using 1K instruction-only examples, yielding substantial gains in reasoning fidelity and format reliability. We further present IV-Edit, a benchmark focused on fine-grained grounding and knowledge-intensive edits. Across IV-Complex settings, RePlan consistently outperforms strong baselines trained on far larger datasets, improving regional precision and overall fidelity. Our project page: https://replan-iv-edit.github.io

中文标题/摘要

标题：RePlan：基于推理的区域规划方法用于复杂指令驱动的图像编辑

指令驱动的图像编辑允许通过自然语言控制视觉修改，但现有模型在指令视觉复杂性（IV-复杂性）场景下表现不佳，即复杂的指令与杂乱或模糊的场景相遇时。我们提出了RePlan（区域对齐规划），这是一种计划-执行框架，结合了视觉语言规划器和扩散编辑器。规划器通过逐步推理将指令分解，并明确地将它们与目标区域关联；编辑器然后使用无训练注意力区域注入机制应用更改，从而实现精确、并行的多区域编辑，无需迭代修复。为了增强规划，我们使用基于GRPO的强化学习应用1000个仅指令示例，显著提高了推理准确性和格式可靠性。我们还介绍了IV-Edit基准，专注于精细的区域定位和知识密集型编辑。在IV-复杂设置中，RePlan始终优于大型数据集训练的强大基线，提高了区域精度和整体保真度。我们的项目页面：https://replan-iv-edit.github.io

Summary / 总结

RePlan is a plan-then-execute framework for instruction-based image editing that addresses the challenges of complex instructions and visual scenes. It uses a vision-language planner to decompose instructions and ground them to specific regions, followed by a diffusion editor that applies changes without iterative inpainting. The planner leverages GRPO-based reinforcement learning to enhance reasoning fidelity. Experiments show that RePlan outperforms strong baselines in handling intricate instructions and improving regional precision and overall image fidelity in complex settings.

RePlan 是一种用于基于指令的图像编辑的计划-执行框架，旨在解决复杂指令和视觉场景的挑战。它使用视觉语言规划器将指令分解并明确地定位到特定区域，随后使用无需迭代修复的扩散编辑器应用更改。规划器利用基于 GRPO 的强化学习来增强推理准确性。实验表明，RePlan 在处理复杂指令和提高区域精度及整体图像保真度方面优于强大的基线模型。

ReinforceGen: Hybrid Skill Policies with Automated Data Generation and Reinforcement Learning

Authors: Zihan Zhou, Animesh Garg, Ajay Mandlekar, Caelan Garrett

First: 2025-12-18T18:32:39+00:00 · Latest: 2025-12-18T18:32:39+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Long-horizon manipulation has been a long-standing challenge in the robotics community. We propose ReinforceGen, a system that combines task decomposition, data generation, imitation learning, and motion planning to form an initial solution, and improves each component through reinforcement-learning-based fine-tuning. ReinforceGen first segments the task into multiple localized skills, which are connected through motion planning. The skills and motion planning targets are trained with imitation learning on a dataset generated from 10 human demonstrations, and then fine-tuned through online adaptation and reinforcement learning. When benchmarked on the Robosuite dataset, ReinforceGen reaches 80% success rate on all tasks with visuomotor controls in the highest reset range setting. Additional ablation studies show that our fine-tuning approaches contributes to an 89% average performance increase. More results and videos available in https://reinforcegen.github.io/

中文标题/摘要

标题：ReinforceGen：结合自动数据生成和强化学习的混合技能策略

长时程操作一直是机器人领域的长期挑战。我们提出了一种名为ReinforceGen的系统，该系统结合了任务分解、数据生成、模仿学习和运动规划，形成初始解决方案，并通过基于强化学习的微调改进每个组件。ReinforceGen首先将任务分割为多个局部技能，这些技能通过运动规划连接。技能和运动规划目标使用来自10个人类演示生成的数据集进行模仿学习训练，然后通过在线适应和强化学习进行微调。在Robosuite数据集上进行基准测试时，ReinforceGen在最高重置范围设置下使用视知觉控制达到80%的成功率。额外的消融研究显示，我们的微调方法平均提高了89%的性能。更多结果和视频请参见https://reinforcegen.github.io/

Summary / 总结

ReinforceGen is designed to address the challenge of long-horizon manipulation in robotics by integrating task decomposition, data generation, imitation learning, and motion planning. It segments tasks into localized skills connected by motion planning, which are initially trained with imitation learning on a dataset generated from human demonstrations. The system then refines these components through reinforcement learning. Experiments on the Robosuite dataset demonstrate that ReinforceGen achieves an 80% success rate with visuomotor controls in the highest reset range setting, with ablation studies indicating an 89% average performance increase from fine-tuning approaches.

ReinforceGen旨在通过结合任务分解、数据生成、模仿学习和运动规划来解决机器人领域的长期操作挑战。它将任务分解为多个局部技能，并通过运动规划连接这些技能，初始训练使用来自10个人类演示生成的数据集进行模仿学习。然后通过强化学习进一步微调这些组件，使其在Robosuite数据集中的所有任务中，使用视觉和运动控制时的成功率达到80%。消融研究显示，微调方法平均提高了89%的性能。

GenEval 2: Addressing Benchmark Drift in Text-to-Image Evaluation

Authors: Amita Kamath, Kai-Wei Chang, Ranjay Krishna, Luke Zettlemoyer, Yushi Hu, Marjan Ghazvininejad

First: 2025-12-18T18:26:56+00:00 · Latest: 2025-12-18T18:26:56+00:00

Abs · PDF · Code1 · Code2

Abstract

Automating Text-to-Image (T2I) model evaluation is challenging; a judge model must be used to score correctness, and test prompts must be selected to be challenging for current T2I models but not the judge. We argue that satisfying these constraints can lead to benchmark drift over time, where the static benchmark judges fail to keep up with newer model capabilities. We show that benchmark drift is a significant problem for GenEval, one of the most popular T2I benchmarks. Although GenEval was well-aligned with human judgment at the time of its release, it has drifted far from human judgment over time -- resulting in an absolute error of as much as 17.7% for current models. This level of drift strongly suggests that GenEval has been saturated for some time, as we verify via a large-scale human study. To help fill this benchmarking gap, we introduce a new benchmark, GenEval 2, with improved coverage of primitive visual concepts and higher degrees of compositionality, which we show is more challenging for current models. We also introduce Soft-TIFA, an evaluation method for GenEval 2 that combines judgments for visual primitives, which we show is more well-aligned with human judgment and argue is less likely to drift from human-alignment over time (as compared to more holistic judges such as VQAScore). Although we hope GenEval 2 will provide a strong benchmark for many years, avoiding benchmark drift is far from guaranteed and our work, more generally, highlights the importance of continual audits and improvement for T2I and related automated model evaluation benchmarks.

中文标题/摘要

标题：GenEval 2：解决文本到图像评估基准漂移问题

自动化文本到图像（T2I）模型评估具有挑战性；必须使用裁判模型来评分，并选择具有挑战性的测试提示，但不应该是当前T2I模型的难题。我们认为，满足这些约束条件可能会导致基准漂移，随着时间的推移，静态基准裁判无法跟上新模型的能力。我们展示了基准漂移是GenEval（最受欢迎的T2I基准之一）的一个重大问题。尽管GenEval在发布时与人类判断高度一致，但随着时间的推移，它已经远离了人类判断——导致当前模型的绝对误差高达17.7%。这种程度的漂移强烈表明，GenEval已经饱和了一段时间，我们通过大规模的人类研究进行了验证。为了填补这一评估缺口，我们引入了新的基准GenEval 2，它在基本视觉概念的覆盖范围和组合性方面有所改进，我们证明这使得当前模型更具挑战性。我们还引入了Soft-TIFA，这是一种用于GenEval 2的评估方法，结合了对视觉基本概念的判断，我们证明这种方法与人类判断更一致，并且我们认为与更全面的评判标准（如VQAScore）相比，它不太可能随着时间的推移而失去与人类判断的一致性。尽管我们希望GenEval 2能够为多年提供一个强大的基准，但避免基准漂移远非有保证的，我们的工作总体上强调了对T2I及相关自动化模型评估基准进行持续审核和改进的重要性。

Summary / 总结

The research addresses the issue of benchmark drift in Text-to-Image (T2I) model evaluation by introducing GenEval 2, which aims to improve coverage of visual concepts and compositionality. The study shows that the original GenEval has drifted significantly from human judgment, leading to a 17.7% absolute error for current models. To mitigate this, the authors propose GenEval 2 and a new evaluation method called Soft-TIFA, which combines judgments for visual primitives and is better aligned with human judgment. This method is argued to be less prone to drift over time compared to more holistic judges. The work highlights the need for continual audits and improvements in T2I benchmarks.

研究通过引入GenEval 2，改进了视觉概念覆盖和组合性，解决了Text-to-Image (T2I)模型评估中的基准漂移问题。研究显示，原始的GenEval已经与人类判断产生了显著偏差，当前模型的绝对误差高达17.7%。为此，提出了GenEval 2和新的评估方法Soft-TIFA，这两种方法被证明与人类判断更一致，且更不易随时间偏离人类一致性。研究强调了持续审计和改进自动化模型评估基准的重要性。

Developing Distance-Aware, and Evident Uncertainty Quantification in Dynamic Physics-Constrained Neural Networks for Robust Bearing Degradation Estimation

Authors: Waleed Razzaq, Yun-Bo Zhao

First: 2025-12-09T11:30:41+00:00 · Latest: 2025-12-18T18:26:21+00:00

Comments: Under review at Structural health Monitoring - SAGE

Abs · PDF · Code1 · Code2

Abstract

Accurate and uncertainty-aware degradation estimation is essential for predictive maintenance in safety-critical systems like rotating machinery with rolling-element bearings. Many existing uncertainty methods lack confidence calibration, are costly to run, are not distance-aware, and fail to generalize under out-of-distribution data. We introduce two distance-aware uncertainty methods for deterministic physics-guided neural networks: PG-SNGP, based on Spectral Normalization Gaussian Process, and PG-SNER, based on Deep Evidential Regression. We apply spectral normalization to the hidden layers so the network preserves distances from input to latent space. PG-SNGP replaces the final dense layer with a Gaussian Process layer for distance-sensitive uncertainty, while PG-SNER outputs Normal Inverse Gamma parameters to model uncertainty in a coherent probabilistic form. We assess performance using standard accuracy metrics and a new distance-aware metric based on the Pearson Correlation Coefficient, which measures how well predicted uncertainty tracks the distance between test and training samples. We also design a dynamic weighting scheme in the loss to balance data fidelity and physical consistency. We test our methods on rolling-element bearing degradation using the PRONOSTIA, XJTU-SY and HUST datasets and compare them with Monte Carlo and Deep Ensemble PGNNs. Results show that PG-SNGP and PG-SNER improve prediction accuracy, generalize reliably under OOD conditions, and remain robust to adversarial attacks and noise.

中文标题/摘要

标题：在动态物理约束神经网络中开发距离感知和明确的不确定性量化以实现稳健的轴承退化估计

准确且具有不确定性意识的退化估计对于安全关键系统（如带有滚动轴承的旋转机械）的预测维护至关重要。许多现有的不确定性方法缺乏信心校准，运行成本高，不具有距离感知能力，并且在处理离域数据时无法泛化。我们引入了两种距离感知的不确定性方法，适用于确定性物理引导神经网络：基于谱归一化高斯过程的PG-SNGP和基于深度证据回归的PG-SNER。我们对隐藏层应用谱归一化，使网络保持从输入到潜在空间的距离。PG-SNGP用高斯过程层替换最终的全连接层，以实现对距离敏感的不确定性，而PG-SNER输出Normal Inverse Gamma参数以以一致的概率形式建模不确定性。我们使用标准准确度指标和基于皮尔逊相关系数的新距离感知指标评估性能，该指标衡量预测不确定性与测试和训练样本之间距离的相关性。我们还在损失中设计了一种动态加权方案，以平衡数据保真度和物理一致性。我们在滚动轴承退化上测试了我们的方法，使用PRONOSTIA、XJTU-SY和HUST数据集，并将其与蒙特卡洛和深度集成物理引导神经网络进行比较。结果表明，PG-SNGP和PG-SNER提高了预测准确性，在离域条件下可靠泛化，并且对对抗攻击和噪声具有鲁棒性。

Summary / 总结

This research aims to improve the accuracy and reliability of bearing degradation estimation in safety-critical systems by developing distance-aware uncertainty quantification methods for physics-constrained neural networks. The methods, PG-SNGP and PG-SNER, incorporate spectral normalization and Gaussian Process or Deep Evidential Regression to provide calibrated uncertainty estimates. Experimental results demonstrate that these methods enhance prediction accuracy, generalize well under out-of-distribution conditions, and maintain robustness against adversarial attacks and noise.

本文旨在解决安全关键系统中，特别是具有滚动轴承的旋转机械中，准确且具有不确定性意识的退化估计的需求。文中提出了两种方法PG-SNGP和PG-SNER，它们使用谱归一化和高斯过程或深度证据回归来量化不确定性。这些方法使用标准准确度指标和基于皮尔逊相关系数的新距离感知指标进行评估，结果显示这些方法提高了预测准确性，能够在出分布条件下可靠泛化，并且对对抗攻击和噪声具有鲁棒性。

Meta-RL Induces Exploration in Language Agents

Authors: Yulun Jiang, Liangze Jiang, Damien Teney, Michael Moor, Maria Brbic

First: 2025-12-18T18:22:17+00:00 · Latest: 2025-12-18T18:22:17+00:00

Abs · PDF · Code1 · Code2

Abstract

Reinforcement learning (RL) has enabled the training of large language model (LLM) agents to interact with the environment and to solve multi-turn long-horizon tasks. However, the RL-trained agents often struggle in tasks that require active exploration and fail to efficiently adapt from trial-and-error experiences. In this paper, we present LaMer, a general Meta-RL framework that enables LLM agents to actively explore and learn from the environment feedback at test time. LaMer consists of two key components: (i) a cross-episode training framework to encourage exploration and long-term rewards optimization; and (ii) in-context policy adaptation via reflection, allowing the agent to adapt their policy from task feedback signal without gradient update. Experiments across diverse environments show that LaMer significantly improves performance over RL baselines, with 11%, 14%, and 19% performance gains on Sokoban, MineSweeper and Webshop, respectively. Moreover, LaMer also demonstrates better generalization to more challenging or previously unseen tasks compared to the RL-trained agents. Overall, our results demonstrate that Meta-RL provides a principled approach to induce exploration in language agents, enabling more robust adaptation to novel environments through learned exploration strategies.

中文标题/摘要

标题：元强化学习促进语言代理的探索

强化学习（RL）使大型语言模型（LLM）代理能够与环境互动并解决多轮长期任务。然而，RL训练的代理在需要主动探索的任务中往往表现不佳，无法有效地从试错经验中适应。在本文中，我们提出了LaMer，这是一种通用的元强化学习框架，使LLM代理能够在测试时积极探索并从环境反馈中学习。LaMer包括两个关键组件：（i）跨回合训练框架，鼓励探索和长期奖励优化；（ii）通过反思进行上下文内策略适应，使代理能够在不进行梯度更新的情况下从任务反馈信号中调整其策略。在多种环境中的实验表明，与RL基线相比，LaMer显著提高了性能，分别在Sokoban、MineSweeper和Webshop上提高了11%、14%和19%的性能。此外，LaMer在更具有挑战性或以前未见过的任务上的泛化能力也优于RL训练的代理。总体而言，我们的结果表明，元强化学习为诱导语言代理的探索提供了一种原理性的方法，通过学习的探索策略使代理能够更 robust 地适应新的环境。

Summary / 总结

This paper addresses the challenge of active exploration in reinforcement learning (RL)-trained language model (LLM) agents, which often fail to efficiently explore and adapt from trial-and-error experiences. The authors introduce LaMer, a Meta-RL framework that includes a cross-episode training framework and in-context policy adaptation via reflection. Experiments show that LaMer improves performance on Sokoban, MineSweeper, and Webshop by 11%, 14%, and 19%, respectively, and demonstrates better generalization to challenging tasks compared to RL-trained agents.

本文针对强化学习（RL）训练的语言代理在复杂任务中难以有效探索和适应的问题，提出了LaMer，这是一种元RL框架，包括用于探索的跨episode训练机制和通过反思进行的上下文内策略适应。实验表明，LaMer在Sokoban、MineSweeper和Webshop上的表现分别比RL基线高出11%、14%和19%，并且在新任务上的泛化能力更强。

OPENTOUCH: Bringing Full-Hand Touch to Real-World Interaction

Authors: Yuxin Ray Song, Jinzhou Li, Rao Fu, Devin Murphy, Kaichen Zhou, Rishi Shiv, Yaqi Li, Haoyu Xiong, Crystal Elaine Owens, Yilun Du, Yiyue Luo, Xianyi Cheng, Antonio Torralba, Wojciech Matusik, Paul Pu Liang

First: 2025-12-18T18:18:17+00:00 · Latest: 2025-12-18T18:18:17+00:00

Comments: https://opentouch-tactile.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

The human hand is our primary interface to the physical world, yet egocentric perception rarely knows when, where, or how forcefully it makes contact. Robust wearable tactile sensors are scarce, and no existing in-the-wild datasets align first-person video with full-hand touch. To bridge the gap between visual perception and physical interaction, we present OpenTouch, the first in-the-wild egocentric full-hand tactile dataset, containing 5.1 hours of synchronized video-touch-pose data and 2,900 curated clips with detailed text annotations. Using OpenTouch, we introduce retrieval and classification benchmarks that probe how touch grounds perception and action. We show that tactile signals provide a compact yet powerful cue for grasp understanding, strengthen cross-modal alignment, and can be reliably retrieved from in-the-wild video queries. By releasing this annotated vision-touch-pose dataset and benchmark, we aim to advance multimodal egocentric perception, embodied learning, and contact-rich robotic manipulation.

中文标题/摘要

标题：OPENTOUCH：将全手触觉引入现实世界交互

人类的手是我们与物理世界的主要接口，但主观感知很少知道何时、何地或以何种力度接触。可靠的可穿戴触觉传感器稀缺，且现有野外数据集无法将第一人称视频与全手触觉对齐。为了弥合视觉感知与物理交互之间的差距，我们提出了OpenTouch，这是首个野外主观全手触觉数据集，包含5.1小时同步视频-触觉-姿态数据和2900个经过精挑细选的带有详细文本注释的片段。使用OpenTouch，我们引入了检索和分类基准，以探究触觉如何为感知和行动提供基础。我们展示了触觉信号为抓取理解提供了紧凑而强大的线索，加强了跨模态对齐，并可以从野外视频查询中可靠地检索。通过发布此注释的视觉-触觉-姿态数据集和基准，我们旨在推进多模态主观感知、具身学习和接触丰富的机器人操作。

Summary / 总结

The paper presents OpenTouch, the first in-the-wild egocentric full-hand tactile dataset, which includes 5.1 hours of synchronized video-touch-pose data and 2,900 curated clips with detailed text annotations. The dataset aims to bridge the gap between visual perception and physical interaction. Using this dataset, the authors introduce benchmarks for touch-based perception and action, demonstrating that tactile signals are effective for grasp understanding and cross-modal alignment, and can be reliably retrieved from in-the-wild video queries. The goal is to advance multimodal egocentric perception, embodied learning, and contact-rich robotic manipulation.

论文介绍了OpenTouch，这是一个包含5.1小时同步视频、触觉和姿态数据以及2,900个带有详细文本注释的剪辑的在野自视角全手触觉数据集。该数据集旨在弥合视觉感知与物理交互之间的差距。作者引入了基于触觉信号的检索和分类基准，展示了这些信号可以为抓取理解提供紧凑而强大的线索，并增强跨模态对齐。通过发布这个标注过的视觉-触觉-姿态数据集和基准，研究人员旨在推进多模态自视角感知、体态学习和接触丰富的机器人操作。

Radiology Report Generation with Layer-Wise Anatomical Attention

Authors: Emmanuel D. Muñiz-De-León, Jorge A. Rosales-de-Golferichs, Ana S. Muñoz-Rodríguez, Alejandro I. Trejo-Castro, Eduardo de Avila-Armenta, Antonio Martínez-Torteya

First: 2025-12-18T18:17:57+00:00 · Latest: 2025-12-18T18:17:57+00:00

Comments: 11 pages, 6 figures

Abs · PDF · Code1 · Code2 · Code3

Abstract

Automatic radiology report generation is a promising application of multimodal deep learning, aiming to reduce reporting workload and improve consistency. However, current state-of-the-art (SOTA) systems - such as Multimodal AI for Radiology Applications (MAIRA-2) and Medical Pathways Language Model-Multimodal (MedPaLM-M) - depend on large-scale multimodal training, clinical metadata, and multiple imaging views, making them resource-intensive and inaccessible for most settings. We introduce a compact image-to-text architecture that generates the Findings section of chest X-ray reports from a single frontal image. The model combines a frozen Self-Distillation with No Labels v3 (DINOv3) Vision Transformer (ViT) encoder with a Generative Pre-trained Transformer 2 (GPT-2) decoder enhanced by layer-wise anatomical attention. This mechanism integrates lung and heart segmentation masks through hierarchical Gaussian smoothing, biasing attention toward clinically relevant regions without adding trainable parameters. Evaluated on the official Medical Information Mart for Intensive Care-Chest X-ray (MIMIC-CXR) dataset using Chest Radiograph Expert (CheXpert) and Radiology Graph (RadGraph) metrics, our approach achieved substantial gains: CheXpert Macro-F1 for five key pathologies increased by 168% (0.083 -> 0.238) and Micro-F1 by 146% (0.137 -> 0.337), while broader performance across 14 observations improved by 86% (0.170 -> 0.316). Structural coherence also improved, with RadGraph F1 rising by 9.7%. Despite its small size and purely image-conditioned design, the model demonstrates that decoder-level anatomical guidance improves spatial grounding and enhances coherence in clinically relevant regions. The source code is publicly available at: https://github.com/devMuniz02/UDEM-CXR-Reporting-Thesis-2025.

中文标题/摘要

标题：基于层级解剖注意力的放射学报告生成

自动放射学报告生成是多模态深度学习的一个有前途的应用，旨在减少报告工作量并提高一致性。然而，当前最先进的（SOTA）系统——如多模态AI放射学应用（MAIRA-2）和医疗路径语言模型-多模态（MedPaLM-M）——依赖大规模多模态训练、临床元数据和多种影像视图，使其资源密集且大多数情况下不可用。我们介绍了一种紧凑的图像到文本架构，可以从单张前视X光片生成胸部X光报告的发现部分。该模型结合了一个冻结的无标签自我蒸馏v3（DINOv3）视觉变换器（ViT）编码器和一个增强有层级解剖注意力的生成预训练变换器2（GPT-2）解码器。该机制通过分层高斯平滑将肺和心分割掩码集成在一起，使注意力偏向临床相关区域，而无需增加可训练参数。在使用胸部放射图专家（CheXpert）和放射学图（RadGraph）指标对官方医疗信息密集护理-胸部X光（MIMIC-CXR）数据集进行评估时，我们的方法取得了显著的改进：五种关键病理学的CheXpert宏F1分数提高了168%（0.083 -> 0.238），微F1分数提高了146%（0.137 -> 0.337），而14个观察指标的总体性能提高了86%（0.170 -> 0.316）。结构连贯性也有所提高，RadGraph F1分数提高了9.7%。尽管模型很小且完全是基于图像设计，但该模型表明解码器级别的解剖学指导可以改善空间定位并增强临床相关区域的连贯性。源代码可在以下网址公开获取：https://github.com/devMuniz02/UDEM-CXR-Reporting-Thesis-2025。

Summary / 总结

The research aims to develop a compact image-to-text architecture for generating radiology reports from a single frontal chest X-ray image, addressing the resource-intensive nature of current state-of-the-art systems. The model uses a frozen DINOv3 Vision Transformer encoder and a GPT-2 decoder enhanced with layer-wise anatomical attention, which integrates lung and heart segmentation masks through hierarchical Gaussian smoothing. Evaluation on the MIMIC-CXR dataset showed significant improvements in CheXpert Macro-F1 and Micro-F1 for five key pathologies, with gains of 168% and 146%, respectively, and a 86% improvement in broader performance across 14 observations. Structural coherence also improved, with RadGraph F1 increasing by 9.7%.

研究旨在开发一种紧凑的图像到文本架构，从单张前视胸片生成放射学报告，解决当前最先进的系统资源密集的问题。该模型使用冻结的DINOv3视觉变换器编码器和增强有层次解剖注意力的GPT-2解码器，通过层次高斯平滑整合肺和心脏分割掩码。在MIMIC-CXR数据集上的评估显示，在五个关键病理学指标上的CheXpert宏F1和微F1分别提高了168%和146%，并且14个观察指标的整体性能提高了86%。结构连贯性也得到了改善，RadGraph F1提高了9.7%。

DenseBEV: Transforming BEV Grid Cells into 3D Objects

Authors: Marius Dähling, Sebastian Krebs, J. Marius Zöllner

Venue: WACV 2026

First: 2025-12-18T17:59:22+00:00 · Latest: 2025-12-18T17:59:22+00:00

Comments: 15 pages, 8 figures, accepted by WACV 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

In current research, Bird's-Eye-View (BEV)-based transformers are increasingly utilized for multi-camera 3D object detection. Traditional models often employ random queries as anchors, optimizing them successively. Recent advancements complement or replace these random queries with detections from auxiliary networks. We propose a more intuitive and efficient approach by using BEV feature cells directly as anchors. This end-to-end approach leverages the dense grid of BEV queries, considering each cell as a potential object for the final detection task. As a result, we introduce a novel two-stage anchor generation method specifically designed for multi-camera 3D object detection. To address the scaling issues of attention with a large number of queries, we apply BEV-based Non-Maximum Suppression, allowing gradients to flow only through non-suppressed objects. This ensures efficient training without the need for post-processing. By using BEV features from encoders such as BEVFormer directly as object queries, temporal BEV information is inherently embedded. Building on the temporal BEV information already embedded in our object queries, we introduce a hybrid temporal modeling approach by integrating prior detections to further enhance detection performance. Evaluating our method on the nuScenes dataset shows consistent and significant improvements in NDS and mAP over the baseline, even with sparser BEV grids and therefore fewer initial anchors. It is particularly effective for small objects, enhancing pedestrian detection with a 3.8% mAP increase on nuScenes and an 8% increase in LET-mAP on Waymo. Applying our method, named DenseBEV, to the challenging Waymo Open dataset yields state-of-the-art performance, achieving a LET-mAP of 60.7%, surpassing the previous best by 5.4%. Code is available at https://github.com/mdaehl/DenseBEV.

中文标题/摘要

标题：DenseBEV：将BEV网格单元转换为3D物体

当前研究中，基于鸟瞰视角(BEV)的变压器越来越多地用于多摄像头3D物体检测。传统模型通常使用随机查询作为锚点，并逐步优化它们。最近的进展通过辅助网络的检测来补充或替代这些随机查询。我们提出了一种更直观且高效的直接使用BEV特征单元作为锚点的方法。该端到端的方法利用了BEV查询的密集网格，将每个单元视为最终检测任务中的潜在物体。因此，我们引入了一种专为多摄像头3D物体检测设计的新型两阶段锚点生成方法。为了解决大量查询时注意力机制的扩展问题，我们应用了基于BEV的非极大值抑制，仅允许梯度流经未被抑制的对象，从而确保高效的训练而无需后处理。通过直接使用BEVFormer等编码器的BEV特征作为物体查询，时间BEV信息被自然嵌入。基于我们已嵌入时间BEV信息的物体查询，我们通过结合先验检测引入了一种混合时间建模方法，以进一步提高检测性能。在nuScenes数据集上的评估显示，即使使用更稀疏的BEV网格和更少的初始锚点，我们的方法在NDS和mAP上也表现出一致且显著的改进。特别是在小物体检测方面，它提高了行人的检测性能，nuScenes上的mAP提高了3.8%，Waymo上的LET-mAP提高了8%。将我们的方法应用于具有挑战性的Waymo Open数据集，实现了60.7%的LET-mAP，超越了之前的最好成绩5.4%。代码可在https://github.com/mdaehl/DenseBEV获取。

Summary / 总结

The research aims to improve multi-camera 3D object detection by directly using BEV feature cells as anchors, which is more intuitive and efficient. The method introduces a two-stage anchor generation process and applies BEV-based Non-Maximum Suppression to handle the scaling issues of attention. Experimental results on the nuScenes and Waymo datasets show consistent improvements in NDS and mAP, especially for small objects, with significant enhancements in pedestrian detection and overall performance on the Waymo Open dataset.

研究旨在通过直接使用BEV特征单元作为锚点来提高多相机3D目标检测的性能，避免使用随机查询。方法引入了两阶段锚点生成过程，并应用BEV非极大值抑制来高效处理大量查询。在nuScenes和Waymo数据集上的实验显示了在NDS和mAP方面的持续改进，特别是在小目标检测方面取得了显著提升，Waymo Open数据集上达到了60.7%的LET-mAP，超越了之前的最佳性能。

Grammar-Forced Translation of Natural Language to Temporal Logic using LLMs

Authors: William English, Dominic Simon, Sumit Kumar Jha, Rickard Ewetz

Venue: Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:15370-15383, 2025

First: 2025-12-18T17:55:15+00:00 · Latest: 2025-12-18T17:55:15+00:00

Abs · PDF · Code1 · Code2

Abstract

Translating natural language (NL) into a formal language such as temporal logic (TL) is integral for human communication with robots and autonomous systems. State-of-the-art approaches decompose the task into a lifting of atomic propositions (APs) phase and a translation phase. However, existing methods struggle with accurate lifting, the existence of co-references, and learning from limited data. In this paper, we propose a framework for NL to TL translation called Grammar Forced Translation (GraFT). The framework is based on the observation that previous work solves both the lifting and translation steps by letting a language model iteratively predict tokens from its full vocabulary. In contrast, GraFT reduces the complexity of both tasks by restricting the set of valid output tokens from the full vocabulary to only a handful in each step. The solution space reduction is obtained by exploiting the unique properties of each problem. We also provide a theoretical justification for why the solution space reduction leads to more efficient learning. We evaluate the effectiveness of GraFT using the CW, GLTL, and Navi benchmarks. Compared with state-of-the-art translation approaches, it can be observed that GraFT the end-to-end translation accuracy by 5.49% and out-of-domain translation accuracy by 14.06% on average.

中文标题/摘要

标题：基于语法的自然语言到时间逻辑的强制翻译

将自然语言（NL）翻译成形式语言如时间逻辑（TL）对于人类与机器人和自主系统的交流至关重要。最先进的方法将任务分解为原子命题（APs）的提升阶段和翻译阶段。然而，现有方法在准确提升、共指存在以及从有限数据中学习方面存在困难。在本文中，我们提出了一种名为语法强制翻译（GraFT）的自然语言到时间逻辑翻译框架。该框架基于先前工作通过让语言模型从其全词汇中迭代预测标记来同时解决提升和翻译步骤的观察。相比之下，GraFT通过在每一步中将有效输出标记集从全词汇限制为少量标记来降低两个任务的复杂性。通过利用每个问题的独特属性，我们减少了解空间。我们还提供了理论依据，说明解空间的减少为何能更有效地学习。我们使用CW、GLTL和Navi基准评估了GraFT的有效性。与最先进的翻译方法相比，GraFT在端到端翻译准确性上提高了5.49%，在域外翻译准确性上平均提高了14.06%。

Summary / 总结

This paper addresses the challenge of translating natural language into temporal logic, focusing on the limitations of existing methods in accurately lifting atomic propositions and handling co-references. The proposed framework, Grammar Forced Translation (GraFT), simplifies the process by restricting the output tokens at each step, leading to improved accuracy. GraFT outperforms state-of-the-art approaches by 5.49% in end-to-end translation accuracy and 14.06% in out-of-domain translation accuracy across various benchmarks.

本文解决了将自然语言翻译成时序逻辑的问题，这对于人机交互至关重要。作者提出了一种名为Grammar Forced Translation (GraFT) 的框架，通过在每一步中减少有效的输出令牌集来简化过程，从而提高准确性。GraFT 在各种基准测试中将端到端的翻译准确率提高了 5.49%，跨域翻译准确率提高了 14.06%。

Coordinated Anti-Jamming Resilience in Swarm Networks via Multi-Agent Reinforcement Learning

Authors: Bahman Abolhassani, Tugba Erpek, Kemal Davaslioglu, Yalin E. Sagduyu, Sastry Kompella

First: 2025-12-18T17:54:20+00:00 · Latest: 2025-12-18T17:54:20+00:00

Abs · PDF · Code1 · Code2

Abstract

Reactive jammers pose a severe security threat to robotic-swarm networks by selectively disrupting inter-agent communications and undermining formation integrity and mission success. Conventional countermeasures such as fixed power control or static channel hopping are largely ineffective against such adaptive adversaries. This paper presents a multi-agent reinforcement learning (MARL) framework based on the QMIX algorithm to improve the resilience of swarm communications under reactive jamming. We consider a network of multiple transmitter-receiver pairs sharing channels while a reactive jammer with Markovian threshold dynamics senses aggregate power and reacts accordingly. Each agent jointly selects transmit frequency (channel) and power, and QMIX learns a centralized but factorizable action-value function that enables coordinated yet decentralized execution. We benchmark QMIX against a genie-aided optimal policy in a no-channel-reuse setting, and against local Upper Confidence Bound (UCB) and a stateless reactive policy in a more general fading regime with channel reuse enabled. Simulation results show that QMIX rapidly converges to cooperative policies that nearly match the genie-aided bound, while achieving higher throughput and lower jamming incidence than the baselines, thereby demonstrating MARL's effectiveness for securing autonomous swarms in contested environments.

中文标题/摘要

标题：swarm 网络中基于多智能体强化学习的协调抗干扰韧性

反应式干扰器通过选择性地破坏智能体间的通信并破坏编队完整性和任务成功，对机器人 swarm 网络构成严重安全威胁。传统的固定功率控制或静态频道跳转等对策对这种适应性对手基本无效。本文提出了一种基于 QMIX 算法的多智能体强化学习（MARL）框架，以提高在反应式干扰下 swarm 通信的韧性。我们考虑了一个多个发射-接收对共享频道的网络，同时一个具有马尔可夫阈值动态特性的反应式干扰器感知总功率并相应地作出反应。每个智能体共同选择发射频率（频道）和功率，QMIX 学习一个集中但可分解的动作-价值函数，使协调但分散执行成为可能。我们在无频道重用设置中将 QMIX 与 genie-aided 最优策略进行基准测试，并在更一般的具有频道重用的衰落环境中将 QMIX 与局部上置置信界（UCB）和无状态反应策略进行基准测试。仿真结果表明，QMIX 迅速收敛到几乎与 genie-aided 边界匹配的合作策略，同时在基线方法之上实现了更高的吞吐量和更低的干扰发生率，从而证明了 MARL 在争夺环境中保护自主 swarm 的有效性。

Summary / 总结

This paper addresses the security threat posed by reactive jammers to robotic-swarm networks by proposing a multi-agent reinforcement learning (MARL) framework using the QMIX algorithm. The framework aims to improve the resilience of swarm communications under adaptive jamming. Simulation results show that QMIX outperforms baseline policies, achieving higher throughput and lower jamming incidence, and rapidly converging to policies that nearly match the optimal genie-aided bound.

论文针对反应式干扰器对机器人蜂群网络造成的安全威胁，该干扰器通过选择性地破坏通信和任务成功来干扰蜂群网络。研究提出了一种基于QMIX算法的多智能体强化学习（MARL）框架，以增强蜂群通信的抗干扰能力。该框架使智能体能够协调其传输频率和功率的选择，从而形成接近最优 genie-辅助策略的协同策略。仿真结果表明，QMIX在吞吐量和干扰发生率方面优于基线策略，证明了MARL在对抗环境中的自主蜂群安全方面的有效性。

GeoPredict: Leveraging Predictive Kinematics and 3D Gaussian Geometry for Precise VLA Manipulation

Authors: Jingjing Qian, Boyao Han, Chen Shi, Lei Xiao, Long Yang, Shaoshuai Shi, Li Jiang

First: 2025-12-18T17:51:42+00:00 · Latest: 2025-12-18T17:51:42+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-Language-Action (VLA) models achieve strong generalization in robotic manipulation but remain largely reactive and 2D-centric, making them unreliable in tasks that require precise 3D reasoning. We propose GeoPredict, a geometry-aware VLA framework that augments a continuous-action policy with predictive kinematic and geometric priors. GeoPredict introduces a trajectory-level module that encodes motion history and predicts multi-step 3D keypoint trajectories of robot arms, and a predictive 3D Gaussian geometry module that forecasts workspace geometry with track-guided refinement along future keypoint trajectories. These predictive modules serve exclusively as training-time supervision through depth-based rendering, while inference requires only lightweight additional query tokens without invoking any 3D decoding. Experiments on RoboCasa Human-50, LIBERO, and real-world manipulation tasks show that GeoPredict consistently outperforms strong VLA baselines, especially in geometry-intensive and spatially demanding scenarios.

中文标题/摘要

标题：GeoPredict：利用预测动力学和三维高斯几何进行精确VLA操作

视觉-语言-动作（VLA）模型在机器人操作中表现出强大的泛化能力，但仍然主要具有反应性和二维中心性，使其在需要精确三维推理的任务中不可靠。我们提出了一种GeoPredict几何感知VLA框架，该框架通过预测动力学和几何先验增强连续动作策略。GeoPredict引入了一个轨迹级模块，用于编码运动历史并预测多步机器人手臂的三维关键点轨迹，以及一个预测三维高斯几何模块，用于沿未来关键点轨迹进行跟踪引导细化以预测工作空间几何。这些预测模块仅在训练时通过基于深度的渲染提供监督，推理时仅需要轻量级的附加查询标记而无需调用任何三维解码。在RoboCasa Human-50、LIBERO和实际操作任务上的实验表明，GeoPredict在几何密集型和空间要求高的场景中始终优于强大的VLA基线。

Summary / 总结

GeoPredict is a geometry-aware VLA framework that enhances a continuous-action policy with predictive kinematic and geometric priors. It includes a trajectory-level module for predicting 3D keypoint trajectories and a predictive 3D Gaussian geometry module for forecasting workspace geometry. Experiments show that GeoPredict outperforms strong VLA baselines, particularly in geometry-intensive and spatially demanding tasks.

GeoPredict 通过将预测动力学和3D高斯几何引入VLA模型，旨在提高机器人操作的精确性。它使用轨迹级模块预测3D关键点轨迹，并使用预测的3D高斯几何模块预测工作空间几何。该框架在几何密集型任务中表现出色，优于现有的VLA基线，在模拟和真实世界场景中均取得了更好的性能。

Online Continual Graph Learning

Authors: Giovanni Donghi, Luca Pasa, Daniele Zambon, Cesare Alippi, Nicolò Navarin

First: 2025-08-05T10:05:09+00:00 · Latest: 2025-12-18T17:30:25+00:00

Comments: This work has been submitted to the IEEE for possible publication

Abs · PDF · Code1 · Code2

Abstract

Continual Learning (CL) aims to incrementally acquire new knowledge while mitigating catastrophic forgetting. Within this setting, Online Continual Learning (OCL) focuses on updating models promptly and incrementally from single or small batches of observations from a data stream. Extending OCL to graph-structured data is crucial, as many real-world networks evolve over time and require timely, online predictions. However, existing continual or streaming graph learning methods typically assume access to entire graph snapshots or multiple passes over tasks, violating the efficiency constraints of the online setting. To address this gap, we introduce the Online Continual Graph Learning (OCGL) setting, which formalizes node-level continual learning on evolving graphs under strict memory and computational budgets. OCGL defines how a model incrementally processes a stream of node-level information while maintaining anytime inference and respecting resource constraints. We further establish a comprehensive benchmark comprising seven datasets and nine CL strategies, suitably adapted to the OCGL setting, enabling a standardized evaluation setup. Finally, we present a minimalistic yet competitive baseline for OCGL, inspired by our benchmarking results, that achieves strong empirical performance with high efficiency.

中文标题/摘要

标题：在线持续图学习

持续学习（CL）旨在增量地获取新知识同时减轻灾难性遗忘。在此框架下，在线持续学习（OCL）专注于从数据流中的单个或小批量观测中及时、增量地更新模型。将OCL扩展到图结构数据至关重要，因为许多现实世界的网络会随时间演变，并需要及时的在线预测。然而，现有的持续或流式图学习方法通常假设可以访问整个图快照或多次遍历任务，这违反了在线设置的效率约束。为了解决这一差距，我们引入了在线持续图学习（OCGL）框架，该框架在严格的内存和计算预算下，形式化了在演变图上的节点级持续学习。OCGL定义了模型如何增量地处理节点级信息流，同时保持随时推理并遵守资源约束。我们进一步建立了一个全面的基准，包括七个数据集和九种CL策略，适当地适应OCGL设置，以实现标准化的评估框架。最后，我们提出了一个基于基准测试结果的简洁但具有竞争力的OCGL基线，该基线在效率高且性能强方面表现出色。

Summary / 总结

The paper addresses the challenge of continual learning in graph-structured data, introducing Online Continual Graph Learning (OCGL) to handle evolving graphs in an online setting with strict memory and computational constraints. The authors propose a benchmark with seven datasets and nine strategies tailored for OCGL, and introduce a minimalistic yet effective baseline that performs well under these constraints.

研究旨在通过Online Continual Graph Learning (OCGL)处理实时演化的图结构数据，解决现有方法需要整个图快照或多次遍历任务的局限性。该方法在严格的内存和计算约束下形式化节点级别的持续学习，并引入了一个包含七个数据集和九种持续学习策略的基准。关键发现表明，提出的最小化基线在保持高效率的同时实现了强大的实验性能。

KineST: A Kinematics-guided Spatiotemporal State Space Model for Human Motion Tracking from Sparse Signals

Authors: Shuting Zhao, Zeyu Xiao, Xinrong Chen

Venue: AAAI 2026

First: 2025-12-18T17:25:47+00:00 · Latest: 2025-12-18T17:25:47+00:00

Comments: Accepted by AAAI 2026

Abs · PDF · Code1 · Code2 · Project1

Abstract

Full-body motion tracking plays an essential role in AR/VR applications, bridging physical and virtual interactions. However, it is challenging to reconstruct realistic and diverse full-body poses based on sparse signals obtained by head-mounted displays, which are the main devices in AR/VR scenarios. Existing methods for pose reconstruction often incur high computational costs or rely on separately modeling spatial and temporal dependencies, making it difficult to balance accuracy, temporal coherence, and efficiency. To address this problem, we propose KineST, a novel kinematics-guided state space model, which effectively extracts spatiotemporal dependencies while integrating local and global pose perception. The innovation comes from two core ideas. Firstly, in order to better capture intricate joint relationships, the scanning strategy within the State Space Duality framework is reformulated into kinematics-guided bidirectional scanning, which embeds kinematic priors. Secondly, a mixed spatiotemporal representation learning approach is employed to tightly couple spatial and temporal contexts, balancing accuracy and smoothness. Additionally, a geometric angular velocity loss is introduced to impose physically meaningful constraints on rotational variations for further improving motion stability. Extensive experiments demonstrate that KineST has superior performance in both accuracy and temporal consistency within a lightweight framework. Project page: https://kaka-1314.github.io/KineST/

中文标题/摘要

标题：KineST：一种基于运动学引导的空间时间状态空间模型，用于从稀疏信号中进行全身运动跟踪

全身运动跟踪在AR/VR应用中起着重要作用，连接物理和虚拟交互。然而，基于头戴式显示器获得的稀疏信号，重建逼真且多样的全身姿态极具挑战性，头戴式显示器是AR/VR场景中的主要设备。现有姿态重建方法往往计算成本高，或者需要分别建模空间和时间依赖性，难以在准确度、时间连贯性和效率之间取得平衡。为解决这一问题，我们提出了一种新颖的基于运动学引导的状态空间模型KineST，该模型有效地提取了空间时间依赖性，同时整合了局部和全局姿态感知。创新之处在于两个核心思想。首先，为了更好地捕捉复杂的关节关系，在状态空间二元性框架下的扫描策略被重新定义为运动学引导的双向扫描，嵌入了运动学先验。其次，采用混合的空间时间表示学习方法，紧密耦合空间和时间上下文，平衡准确性和平滑度。此外，引入了几何角速度损失，对旋转变化施加物理上合理的约束，进一步提高运动稳定性。大量实验表明，KineST在轻量级框架内具有优越的准确性和时间一致性。项目页面：https://kaka-1314.github.io/KineST/

Summary / 总结

KineST is a novel kinematics-guided state space model designed to reconstruct realistic full-body poses from sparse signals in AR/VR applications. It uses kinematics-guided bidirectional scanning and mixed spatiotemporal representation learning to balance accuracy and temporal coherence. Experimental results show that KineST outperforms existing methods in both accuracy and temporal consistency while maintaining a lightweight framework.

KineST是一种新颖的基于运动学的时空状态空间模型，旨在从AR/VR场景中的稀疏信号中重建逼真的全身姿态。该模型通过嵌入运动学先验重新定义扫描策略，并采用混合的时空表示学习方法来平衡准确性和平滑性。实验结果表明，KineST在准确性和时间一致性方面均优于现有方法，同时保持了轻量级的框架。