Generative Refocusing: Flexible Defocus Control from a Single Image
Authors: Chun-Wei Tuan Mu, Jia-Bin Huang, Yu-Lun Liu
First: 2025-12-18T18:59:59+00:00 · Latest: 2025-12-18T18:59:59+00:00
Comments: Project website: https://generative-refocusing.github.io/
Abstract
Depth-of-field control is essential in photography, but getting the perfect focus often takes several tries or special equipment. Single-image refocusing is still difficult. It involves recovering sharp content and creating realistic bokeh. Current methods have significant drawbacks. They need all-in-focus inputs, depend on synthetic data from simulators, and have limited control over aperture. We introduce Generative Refocusing, a two-step process that uses DeblurNet to recover all-in-focus images from various inputs and BokehNet for creating controllable bokeh. Our main innovation is semi-supervised training. This method combines synthetic paired data with unpaired real bokeh images, using EXIF metadata to capture real optical characteristics beyond what simulators can provide. Our experiments show we achieve top performance in defocus deblurring, bokeh synthesis, and refocusing benchmarks. Additionally, our Generative Refocusing allows text-guided adjustments and custom aperture shapes.
中文标题/摘要
标题:生成性重新聚焦:从单张图像实现灵活的景深控制
景深控制是摄影中的重要技术,但获得完美的对焦往往需要多次尝试或特殊设备。单张图像重新聚焦仍然具有挑战性,涉及恢复清晰内容和创建逼真的散景。当前的方法存在显著的局限性,需要全聚焦输入,依赖于模拟器生成的合成数据,并且对光圈控制有限。我们提出了生成性重新聚焦,这是一种两步过程,使用DeblurNet从各种输入中恢复全聚焦图像,使用BokehNet创建可控的散景。我们的主要创新是半监督训练。该方法结合了合成配对数据和未配对的真实散景图像,使用EXIF元数据捕捉模拟器无法提供的真实光学特性。我们的实验表明,我们在景深去模糊、散景合成和重新聚焦基准测试中实现了最佳性能。此外,我们的生成性重新聚焦还允许文本引导的调整和自定义光圈形状。
Summary / 总结
The paper addresses the challenge of single-image refocusing in photography, where depth-of-field control is crucial but difficult to achieve. It proposes Generative Refocusing, a two-step process involving DeblurNet for recovering all-in-focus images and BokehNet for creating controllable bokeh. The method uses semi-supervised training, combining synthetic paired data with real bokeh images and EXIF metadata to capture real optical characteristics. Experiments demonstrate superior performance in defocus deblurring, bokeh synthesis, and refocusing benchmarks, and the system supports text-guided adjustments and custom aperture shapes.
论文针对摄影中单张图片的对焦调节难题,提出了一种两步法的Generative Refocusing方法,包括使用DeblurNet恢复全聚焦图像和BokehNet生成可控的散景。该方法采用半监督训练,结合合成配对数据和真实散景图像以及EXIF元数据来捕捉实际光学特性。实验结果显示,在失焦去模糊、散景合成和对焦调节基准测试中表现出色,并支持文本引导调整和自定义光圈形状。
The World is Your Canvas: Painting Promptable Events with Reference Images, Trajectories, and Text
Authors: Hanlin Wang, Hao Ouyang, Qiuyu Wang, Yue Yu, Yihao Meng, Wen Wang, Ka Leong Cheng, Shuailei Ma, Qingyan Bai, Yixuan Li, Cheng Chen, Yanhong Zeng, Xing Zhu, Yujun Shen, Qifeng Chen
First: 2025-12-18T18:59:59+00:00 · Latest: 2025-12-18T18:59:59+00:00
Comments: Project page and code: https://worldcanvas.github.io/
Abstract
We present WorldCanvas, a framework for promptable world events that enables rich, user-directed simulation by combining text, trajectories, and reference images. Unlike text-only approaches and existing trajectory-controlled image-to-video methods, our multimodal approach combines trajectories -- encoding motion, timing, and visibility -- with natural language for semantic intent and reference images for visual grounding of object identity, enabling the generation of coherent, controllable events that include multi-agent interactions, object entry/exit, reference-guided appearance and counterintuitive events. The resulting videos demonstrate not only temporal coherence but also emergent consistency, preserving object identity and scene despite temporary disappearance. By supporting expressive world events generation, WorldCanvas advances world models from passive predictors to interactive, user-shaped simulators. Our project page is available at: https://worldcanvas.github.io/.
中文标题/摘要
标题:世界即画布:使用参考图像、轨迹和文本绘制可提示事件
我们提出了WorldCanvas框架,该框架通过结合文本、轨迹和参考图像,实现了丰富且用户导向的模拟。与仅使用文本的方法和现有的轨迹控制图像到视频的方法不同,我们的多模态方法将轨迹——编码运动、时间、可见性——与自然语言结合,用于语义意图,并使用参考图像进行物体身份的视觉定位,从而生成连贯且可控的事件,包括多智能体交互、物体的进入/退出、参考引导的外观以及反直觉事件。生成的视频不仅展示了时间连贯性,还展示了在短暂消失后场景中物体身份和场景的一致性。通过支持富有表现力的世界事件生成,WorldCanvas将世界模型从被动预测者提升为交互式的、用户导向的模拟器。我们的项目页面可在:https://worldcanvas.github.io/访问。
Summary / 总结
WorldCanvas is a framework that allows for rich, user-directed simulation by integrating text, trajectories, and reference images. Unlike previous text-only methods or trajectory-controlled image-to-video techniques, WorldCanvas combines trajectories with natural language and reference images to create coherent, controllable events involving multi-agent interactions and object entry/exit. The generated videos show temporal coherence and emergent consistency, preserving object identity and scene despite temporary disappearance. This framework transforms world models from passive predictors into interactive, user-shaped simulators.
WorldCanvas 是一个框架,结合了文本、轨迹和参考图像来模拟丰富的用户导向的世界事件。不同于仅使用文本的方法或基于轨迹的图像到视频方法,WorldCanvas 将轨迹用于运动、时间和可见性,将自然语言用于语义意图,并将参考图像用于视觉定位物体身份。这种多模态方法能够生成具有多代理交互和参考引导外观的连贯且可控的事件,生成的视频不仅具有时间连贯性,还表现出临时消失后的场景一致性。WorldCanvas 将世界模型转变为由用户输入塑造的交互式模拟器,推动了世界建模领域的发展。
EasyV2V: A High-quality Instruction-based Video Editing Framework
Authors: Jinjie Mai, Chaoyang Wang, Guocheng Gordon Qian, Willi Menapace, Sergey Tulyakov, Bernard Ghanem, Peter Wonka, Ashkan Mirzaei
First: 2025-12-18T18:59:57+00:00 · Latest: 2025-12-18T18:59:57+00:00
Comments: Project page: https://snap-research.github.io/easyv2v/
Abstract
While image editing has advanced rapidly, video editing remains less explored, facing challenges in consistency, control, and generalization. We study the design space of data, architecture, and control, and introduce \emph{EasyV2V}, a simple and effective framework for instruction-based video editing. On the data side, we compose existing experts with fast inverses to build diverse video pairs, lift image edit pairs into videos via single-frame supervision and pseudo pairs with shared affine motion, mine dense-captioned clips for video pairs, and add transition supervision to teach how edits unfold. On the model side, we observe that pretrained text-to-video models possess editing capability, motivating a simplified design. Simple sequence concatenation for conditioning with light LoRA fine-tuning suffices to train a strong model. For control, we unify spatiotemporal control via a single mask mechanism and support optional reference images. Overall, EasyV2V works with flexible inputs, e.g., video+text, video+mask+text, video+mask+reference+text, and achieves state-of-the-art video editing results, surpassing concurrent and commercial systems. Project page: https://snap-research.github.io/easyv2v/
中文标题/摘要
标题:EasyV2V:一种基于指令的高质量视频编辑框架
尽管图像编辑已经取得了快速进展,但视频编辑仍然较少被探索,面临着一致性、控制和泛化的挑战。我们研究了数据、架构和控制的设计空间,并引入了\emph{EasyV2V},这是一种简单有效的基于指令的视频编辑框架。在数据方面,我们通过将现有专家与快速逆向合成多样化的视频对,通过单帧监督和共享仿射运动的伪对将图像编辑对提升为视频,挖掘密集字幕片段以构建视频对,并添加过渡监督以教授编辑的展开方式。在模型方面,我们观察到预训练的文本到视频模型具有编辑能力,这激励了简化的设计。简单的序列拼接作为条件,并结合轻量级LoRA微调足以训练出强大的模型。对于控制,我们通过单一掩码机制统一了时空控制,并支持可选的参考图像。总体而言,EasyV2V 可以灵活地处理输入,例如视频+文本、视频+掩码+文本、视频+掩码+参考+文本,并实现了最先进的视频编辑结果,超越了同时期和商用系统。项目页面:https://snap-research.github.io/easyv2v/
Summary / 总结
The research aims to address the challenges in video editing, such as consistency, control, and generalization, by introducing EasyV2V, a simple framework for instruction-based video editing. The framework leverages existing experts with fast inverses, single-frame supervision, and shared affine motion to create diverse video pairs. It also includes transition supervision and dense-captioned clips. On the model side, a simplified design using pretrained text-to-video models and light LoRA fine-tuning is employed. EasyV2V supports various input types and achieves state-of-the-art results in video editing, outperforming concurrent and commercial systems.
EasyV2V 是一个基于指令的视频编辑框架,旨在解决视频编辑中的连贯性、控制和泛化问题。它利用来自现有专家和图像编辑的多样化视频对,并使用简单的序列拼接方法结合轻量级 LoRA 微调进行模型训练。该框架支持多种输入类型,并在视频编辑方面取得了最先进的成果,超越了同期和商用系统。
DVGT: Driving Visual Geometry Transformer
Authors: Sicheng Zuo, Zixun Xie, Wenzhao Zheng, Shaoqing Xu, Fang Li, Shengyin Jiang, Long Chen, Zhi-Xin Yang, Jiwen Lu
First: 2025-12-18T18:59:57+00:00 · Latest: 2025-12-18T18:59:57+00:00
Comments: Code is available at https://github.com/wzzheng/DVGT
Abstract
Perceiving and reconstructing 3D scene geometry from visual inputs is crucial for autonomous driving. However, there still lacks a driving-targeted dense geometry perception model that can adapt to different scenarios and camera configurations. To bridge this gap, we propose a Driving Visual Geometry Transformer (DVGT), which reconstructs a global dense 3D point map from a sequence of unposed multi-view visual inputs. We first extract visual features for each image using a DINO backbone, and employ alternating intra-view local attention, cross-view spatial attention, and cross-frame temporal attention to infer geometric relations across images. We then use multiple heads to decode a global point map in the ego coordinate of the first frame and the ego poses for each frame. Unlike conventional methods that rely on precise camera parameters, DVGT is free of explicit 3D geometric priors, enabling flexible processing of arbitrary camera configurations. DVGT directly predicts metric-scaled geometry from image sequences, eliminating the need for post-alignment with external sensors. Trained on a large mixture of driving datasets including nuScenes, OpenScene, Waymo, KITTI, and DDAD, DVGT significantly outperforms existing models on various scenarios. Code is available at https://github.com/wzzheng/DVGT.
中文标题/摘要
标题:DVGT:驾驶目标视觉几何变换器
从视觉输入感知和重构3D场景几何对于自动驾驶至关重要。然而,仍然缺乏一种针对驾驶场景的密集几何感知模型,能够适应不同的场景和相机配置。为了解决这一问题,我们提出了一种驾驶目标视觉几何变换器(DVGT),它可以从前序的多视角未校正视觉输入中重建全局密集的3D点云图。我们首先使用DINO骨干网络提取每张图像的视觉特征,然后采用交替的同视角局部注意力、跨视角空间注意力和跨帧时间注意力来推断图像间的几何关系。接着,我们使用多个解码头在第一帧的 ego 坐标系中解码全局点云图,并为每一帧计算 ego 姿态。与依赖精确相机参数的传统方法不同,DVGT 不需要显式的3D几何先验,能够灵活处理任意的相机配置。DVGT 直接从图像序列中预测出度量标定的几何结构,消除了与外部传感器对齐的需要。在包括 nuScenes、OpenScene、Waymo、KITTI 和 DDAD 等多种驾驶数据集的大规模混合训练下,DVGT 在各种场景中显著优于现有模型。代码可在 https://github.com/wzzheng/DVGT 获取。
Summary / 总结
DVGT is designed to perceive and reconstruct 3D scene geometry from visual inputs for autonomous driving, addressing the lack of a driving-targeted dense geometry perception model. It uses a DINO backbone to extract visual features and applies attention mechanisms to infer geometric relations across images. DVGT directly predicts metric-scaled geometry from image sequences without relying on precise camera parameters, achieving superior performance across various scenarios compared to existing models.
DVGT旨在从视觉输入中感知和重建3D场景几何,以支持自主驾驶。它使用Driving Visual Geometry Transformer通过局部、空间和时间注意力机制来推断图像间的几何关系。DVGT无需依赖精确的相机参数即可直接从图像序列中预测出度量标尺的几何结构,其在各种场景下的性能优于现有模型。
Differences That Matter: Auditing Models for Capability Gap Discovery and Rectification
Authors: Qihao Liu, Chengzhi Mao, Yaojie Liu, Alan Yuille, Wen-Sheng Chu
First: 2025-12-18T18:59:57+00:00 · Latest: 2025-12-18T18:59:57+00:00
Comments: project page: https://auditdm.github.io/
Abstract
Conventional evaluation methods for multimodal LLMs (MLLMs) lack interpretability and are often insufficient to fully disclose significant capability gaps across models. To address this, we introduce AuditDM, an automated framework that actively discovers and rectifies MLLM failure modes by auditing their divergence. AuditDM fine-tunes an MLLM as an auditor via reinforcement learning to generate challenging questions and counterfactual images that maximize disagreement among target models. Once trained, the auditor uncovers diverse, interpretable exemplars that reveal model weaknesses and serve as annotation-free data for rectification. When applied to SoTA models like Gemma-3 and PaliGemma-2, AuditDM discovers more than 20 distinct failure types. Fine-tuning on these discoveries consistently improves all models across 16 benchmarks, and enables a 3B model to surpass its 28B counterpart. Our results suggest that as data scaling hits diminishing returns, targeted model auditing offers an effective path to model diagnosis and improvement.
中文标题/摘要
标题:关键差异:审计模型以发现和纠正能力差距
传统的多模态大语言模型(MLLMs)评估方法缺乏可解释性,往往无法充分揭示模型之间的显著能力差距。为解决这一问题,我们引入了AuditDM,这是一种自动化的框架,通过审计模型之间的差异来主动发现和纠正其失败模式。AuditDM 通过强化学习微调一个MLLM作为审计器,生成能够最大化目标模型之间分歧的具有挑战性的问题和反事实图像。训练完成后,审计器能够揭示多样且可解释的示例,揭示模型的弱点,并作为无需标注的数据用于纠正。当应用于如Gemma-3和PaliGemma-2等最先进的模型时,AuditDM 发现了超过20种不同的失败类型。基于这些发现的微调在16个基准测试中持续改进了所有模型,并使一个3B模型超越了其28B的对照组。我们的结果表明,在数据规模效应减弱时,有针对性的模型审计为模型诊断和改进提供了一条有效途径。
Summary / 总结
The paper introduces AuditDM, an automated framework that identifies and rectifies capability gaps in multimodal LLMs by auditing their divergence. It fine-tunes an MLLM as an auditor using reinforcement learning to generate challenging questions and counterfactual images that maximize disagreement among target models. This process uncovers diverse, interpretable examples of model weaknesses and improves model performance across 16 benchmarks, even enabling a smaller model to outperform a larger one.
该研究引入了AuditDM,一种自动化框架,通过审计多模态LLM之间的差异来识别和纠正能力差距。它使用强化学习来微调一个LLM作为审计员,生成具有挑战性的问题和反事实图像,以最大化目标模型之间的分歧。这一过程揭示了多样且可解释的示例,展示了模型的弱点,并作为无注释数据用于改进。将AuditDM应用于最先进的模型如Gemma-3和PaliGemma-2,发现了超过20种不同的失败类型,并通过这些发现的改进在16个基准测试中提升了所有模型的表现,甚至使一个3B模型超越了其28B的对照组。
AdaTooler-V: Adaptive Tool-Use for Images and Videos
Authors: Chaoyang Wang, Kaituo Feng, Dongyang Chen, Zhongyu Wang, Zhixun Li, Sicheng Gao, Meng Meng, Xu Zhou, Manyuan Zhang, Yuzhang Shang, Xiangyu Yue
First: 2025-12-18T18:59:55+00:00 · Latest: 2025-12-18T18:59:55+00:00
Comments: Project page: https://github.com/CYWang735/AdaTooler-V
Abstract
Recent advances have shown that multimodal large language models (MLLMs) benefit from multimodal interleaved chain-of-thought (CoT) with vision tool interactions. However, existing open-source models often exhibit blind tool-use reasoning patterns, invoking vision tools even when they are unnecessary, which significantly increases inference overhead and degrades model performance. To this end, we propose AdaTooler-V, an MLLM that performs adaptive tool-use by determining whether a visual problem truly requires tools. First, we introduce AT-GRPO, a reinforcement learning algorithm that adaptively adjusts reward scales based on the Tool Benefit Score of each sample, encouraging the model to invoke tools only when they provide genuine improvements. Moreover, we construct two datasets to support training: AdaTooler-V-CoT-100k for SFT cold start and AdaTooler-V-300k for RL with verifiable rewards across single-image, multi-image, and video data. Experiments across twelve benchmarks demonstrate the strong reasoning capability of AdaTooler-V, outperforming existing methods in diverse visual reasoning tasks. Notably, AdaTooler-V-7B achieves an accuracy of 89.8\% on the high-resolution benchmark V*, surpassing the commercial proprietary model GPT-4o and Gemini 1.5 Pro. All code, models, and data are released.
中文标题/摘要
标题:AdaTooler-V:自适应图像和视频工具使用
最近的研究表明,多模态大型语言模型(MLLM)从多模态交替的思维链(CoT)与视觉工具交互中受益。然而,现有的开源模型经常表现出盲目的工具使用推理模式,即使在不需要时也调用视觉工具,这显著增加了推理开销并降低了模型性能。为此,我们提出了AdaTooler-V,这是一种MLLM,能够根据视觉问题是否真正需要工具来执行自适应工具使用。首先,我们引入了AT-GRPO,这是一种基于每个样本的工具效益评分动态调整奖励尺度的强化学习算法,鼓励模型仅在工具提供真正改进时才调用工具。此外,我们构建了两个数据集以支持训练:AdaTooler-V-CoT-100k 用于SFT冷启动,AdaTooler-V-300k 用于具有可验证奖励的强化学习,涵盖单图像、多图像和视频数据。在十二个基准测试中的实验表明,AdaTooler-V 具有强大的推理能力,在各种视觉推理任务中优于现有方法。值得注意的是,AdaTooler-V-7B 在高分辨率基准测试 V* 中的准确率为 89.8%,超过了商业专有模型 GPT-4o 和 Gemini 1.5 Pro。所有代码、模型和数据均已发布。
Summary / 总结
AdaTooler-V is an MLLM that performs adaptive tool-use by determining the necessity of vision tools based on a Tool Benefit Score. It uses AT-GRPO, a reinforcement learning algorithm, to adjust reward scales and encourages tool invocation only when beneficial. AdaTooler-V outperforms existing methods in various visual reasoning tasks, achieving 89.8% accuracy on the high-resolution benchmark V*. The model is open-sourced with all code, models, and data available.
AdaTooler-V 是一个 MLLM,通过工具效益评分确定视觉工具的必要性,并使用 AT-GRPO 强化学习算法调整奖励尺度,仅在有益时才调用工具。AdaTooler-V 在各种视觉推理任务中表现出色,高分辨率基准 V* 的准确率达到 89.8%。该模型已开源,提供了所有代码、模型和数据。
Generative Adversarial Reasoner: Enhancing LLM Reasoning with Adversarial Reinforcement Learning
Authors: Qihao Liu, Luoxin Ye, Wufei Ma, Yu-Cheng Chou, Alan Yuille
First: 2025-12-18T18:59:54+00:00 · Latest: 2025-12-18T18:59:54+00:00
Abstract
Large language models (LLMs) with explicit reasoning capabilities excel at mathematical reasoning yet still commit process errors, such as incorrect calculations, brittle logic, and superficially plausible but invalid steps. In this paper, we introduce Generative Adversarial Reasoner, an on-policy joint training framework designed to enhance reasoning by co-evolving an LLM reasoner and an LLM-based discriminator through adversarial reinforcement learning. A compute-efficient review schedule partitions each reasoning chain into logically complete slices of comparable length, and the discriminator evaluates each slice's soundness with concise, structured justifications. Learning couples complementary signals: the LLM reasoner is rewarded for logically consistent steps that yield correct answers, while the discriminator earns rewards for correctly detecting errors or distinguishing traces in the reasoning process. This produces dense, well-calibrated, on-policy step-level rewards that supplement sparse exact-match signals, improving credit assignment, increasing sample efficiency, and enhancing overall reasoning quality of LLMs. Across various mathematical benchmarks, the method delivers consistent gains over strong baselines with standard RL post-training. Specifically, on AIME24, we improve DeepSeek-R1-Distill-Qwen-7B from 54.0 to 61.3 (+7.3) and DeepSeek-R1-Distill-Llama-8B from 43.7 to 53.7 (+10.0). The modular discriminator also enables flexible reward shaping for objectives such as teacher distillation, preference alignment, and mathematical proof-based reasoning.
中文标题/摘要
标题:生成对抗推理器:通过对抗强化学习增强LLM推理能力
具有明确推理能力的大语言模型(LLMs)在数学推理方面表现出色,但仍会犯过程错误,如错误计算、脆弱逻辑和表面上合理但实际上无效的步骤。在本文中,我们介绍了生成对抗推理器,这是一种通过对抗强化学习共同进化LLM推理器和基于LLM的鉴别器的在策略联合训练框架,旨在通过逻辑完整且长度相近的推理链片段进行计算高效审查,鉴别器使用简洁的结构化证明来评估每个片段的合理性。学习结合互补信号:LLM推理器因逻辑一致且得出正确答案的步骤而获得奖励,而鉴别器因正确检测错误或在推理过程中区分痕迹而获得奖励。这产生了密集、校准良好的在策略步骤级奖励,补充稀疏的精确匹配信号,改善了信用分配,提高了样本效率,并增强了LLM的整体推理质量。在各种数学基准测试中,该方法在标准RL后训练中相对于强基线实现了持续改进。具体而言,在AIME24上,我们使DeepSeek-R1-Distill-Qwen-7B从54.0提高到61.3(+7.3),DeepSeek-R1-Distill-Llama-8B从43.7提高到53.7(+10.0)。模块化的鉴别器还使教师蒸馏、偏好对齐和基于数学证明的推理等目标的奖励塑造变得灵活。
Summary / 总结
This paper introduces Generative Adversarial Reasoner, a framework that enhances LLM reasoning through adversarial reinforcement learning. It co-evolves an LLM reasoner and a discriminator to improve logical consistency and reduce process errors. The method partitions reasoning chains into logically complete slices and uses a compute-efficient review schedule. Across mathematical benchmarks, the approach consistently outperforms strong baselines, with improvements of 7.3% and 10.0% on AIME24 for DeepSeek-R1-Distill-Qwen-7B and DeepSeek-R1-Distill-Llama-8B, respectively.
本文提出了生成对抗推理器,该框架通过对抗强化学习来增强LLM的推理能力。它通过协同进化一个LLM推理器和一个判别器来提高逻辑一致性和减少过程错误。该方法将推理链分割成逻辑完整的片段,并使用高效的审查计划。在数学基准测试中,该方法在AIME24上分别对DeepSeek-R1-Distill-Qwen-7B和DeepSeek-R1-Distill-Llama-8B取得了7.3%和10.0%的改进。
StereoPilot: Learning Unified and Efficient Stereo Conversion via Generative Priors
Authors: Guibao Shen, Yihua Du, Wenhang Ge, Jing He, Chirui Chang, Donghao Zhou, Zhen Yang, Luozhou Wang, Xin Tao, Ying-Cong Chen
First: 2025-12-18T18:59:50+00:00 · Latest: 2025-12-18T18:59:50+00:00
Abstract
The rapid growth of stereoscopic displays, including VR headsets and 3D cinemas, has led to increasing demand for high-quality stereo video content. However, producing 3D videos remains costly and complex, while automatic Monocular-to-Stereo conversion is hindered by the limitations of the multi-stage ``Depth-Warp-Inpaint'' (DWI) pipeline. This paradigm suffers from error propagation, depth ambiguity, and format inconsistency between parallel and converged stereo configurations. To address these challenges, we introduce UniStereo, the first large-scale unified dataset for stereo video conversion, covering both stereo formats to enable fair benchmarking and robust model training. Building upon this dataset, we propose StereoPilot, an efficient feed-forward model that directly synthesizes the target view without relying on explicit depth maps or iterative diffusion sampling. Equipped with a learnable domain switcher and a cycle consistency loss, StereoPilot adapts seamlessly to different stereo formats and achieves improved consistency. Extensive experiments demonstrate that StereoPilot significantly outperforms state-of-the-art methods in both visual fidelity and computational efficiency. Project page: https://hit-perfect.github.io/StereoPilot/.
中文标题/摘要
标题:StereoPilot:通过生成先验学习统一高效的立体图转换
立体显示的快速发展,包括VR头显和3D影院,导致对高质量立体视频内容的需求不断增加。然而,制作3D视频仍然成本高昂且复杂,而单目到立体图的自动转换受到多阶段“深度扭曲修补”(DWI)管道的限制。该范式存在误差传播、深度模糊和并行与汇聚立体图配置格式不一致的问题。为了解决这些挑战,我们引入了UniStereo,这是首个大规模统一的立体视频转换数据集,涵盖了两种立体图格式,以实现公平基准测试和稳健模型训练。基于此数据集,我们提出了StereoPilot,一种高效的前馈模型,可以直接合成目标视图,而无需依赖显式的深度图或迭代扩散采样。配备可学习的领域转换器和循环一致性损失,StereoPilot能够无缝适应不同的立体图格式,并实现更好的一致性。大量实验表明,StereoPilot在视觉保真度和计算效率方面显著优于现有最先进的方法。项目页面:https://hit-perfect.github.io/StereoPilot/
Summary / 总结
StereoPilot is a feed-forward model designed to convert monocular videos to stereo videos efficiently. It addresses the limitations of the traditional Depth-Warp-Inpaint pipeline by using a learnable domain switcher and cycle consistency loss, enabling seamless adaptation to different stereo formats. Experiments show that StereoPilot outperforms existing methods in both visual quality and computational efficiency.
论文旨在解决为立体显示设备生成高质量立体视频内容的挑战,这些设备正在变得越来越普及。为克服现有‘深度扭曲修补’管道的限制,作者引入了UniStereo,这是一个大规模统一的立体视频转换数据集。基于此数据集,他们提出了StereoPilot,这是一种高效的前馈模型,可以直接合成目标视图,而无需使用显式的深度图或迭代扩散采样。实验结果表明,StereoPilot在视觉保真度和计算效率方面都优于现有方法。
Depth Any Panoramas: A Foundation Model for Panoramic Depth Estimation
Authors: Xin Lin, Meixi Song, Dizhe Zhang, Wenxuan Lu, Haodong Li, Bo Du, Ming-Hsuan Yang, Truong Nguyen, Lu Qi
First: 2025-12-18T18:59:29+00:00 · Latest: 2025-12-18T18:59:29+00:00
Comments: Project Page: https://insta360-research-team.github.io/DAP_website/
Abstract
In this work, we present a panoramic metric depth foundation model that generalizes across diverse scene distances. We explore a data-in-the-loop paradigm from the view of both data construction and framework design. We collect a large-scale dataset by combining public datasets, high-quality synthetic data from our UE5 simulator and text-to-image models, and real panoramic images from the web. To reduce domain gaps between indoor/outdoor and synthetic/real data, we introduce a three-stage pseudo-label curation pipeline to generate reliable ground truth for unlabeled images. For the model, we adopt DINOv3-Large as the backbone for its strong pre-trained generalization, and introduce a plug-and-play range mask head, sharpness-centric optimization, and geometry-centric optimization to improve robustness to varying distances and enforce geometric consistency across views. Experiments on multiple benchmarks (e.g., Stanford2D3D, Matterport3D, and Deep360) demonstrate strong performance and zero-shot generalization, with particularly robust and stable metric predictions in diverse real-world scenes. The project page can be found at: \href{https://insta360-research-team.github.io/DAP_website/} {https://insta360-research-team.github.io/DAP\_website/}
中文标题/摘要
标题:深度全景图:一种全景深度估计的基础模型
在本文中,我们提出了一种全景度量深度基础模型,该模型适用于不同场景距离的泛化。我们从数据构建和框架设计的角度探索了数据在环 paradigm。我们通过结合公共数据集、我们UE5模拟器和文本到图像模型生成的高质量合成数据以及网络上的真实全景图像收集了一个大规模数据集。为了减少室内/室外和合成/真实数据之间的领域差距,我们引入了一个三阶段伪标签校准流水线,以生成未标记图像的可靠地面真相。对于模型,我们采用DINOv3-Large作为主干,因其强大的预训练泛化能力,并引入了即插即用的距离掩码头、锐度为中心的优化和几何为中心的优化,以提高对不同距离的鲁棒性并确保视图之间的几何一致性。在多个基准测试(例如Stanford2D3D、Matterport3D和Deep360)上的实验表明,该模型具有强大的性能和零样本泛化能力,在各种真实世界场景中具有特别鲁棒和稳定的度量预测。项目页面可访问:https://insta360-research-team.github.io/DAP_website/
Summary / 总结
This work introduces a panoramic metric depth foundation model that generalizes across various scene distances. The authors use a data-in-the-loop approach, combining public datasets, synthetic data, and real images to create a large-scale dataset. They introduce a three-stage pseudo-label curation pipeline to generate reliable ground truth for unlabeled images. The model uses DINOv3-Large as its backbone and includes range mask head, sharpness-centric, and geometry-centric optimizations. Experiments show strong performance and zero-shot generalization across multiple benchmarks, with robust and stable metric predictions in diverse real-world scenes.
该研究提出了一种全景度量深度基础模型,适用于不同场景距离。作者采用数据闭环方法,结合公开数据集、UE5模拟器生成的合成数据和网络上的真实图像。他们引入了三阶段伪标签校准流水线来生成未标记图像的可靠地面真值。模型使用DINOv3-Large作为骨干,并包含范围掩码头、锐度中心优化和几何中心优化以提高性能。实验结果显示,该模型在多个基准上的表现强大且具有零样本泛化能力,在各种真实世界场景中具有稳健和稳定的度量预测。
Exploration v.s. Exploitation: Rethinking RLVR through Clipping, Entropy, and Spurious Reward
Authors: Peter Chen, Xiaopeng Li, Ziniu Li, Wotao Yin, Xi Chen, Tianyi Lin
First: 2025-12-18T18:59:27+00:00 · Latest: 2025-12-18T18:59:27+00:00
Comments: 35 pages
Abstract
This paper examines the exploration-exploitation trade-off in reinforcement learning with verifiable rewards (RLVR), a framework for improving the reasoning of Large Language Models (LLMs). Recent studies suggest that RLVR can elicit strong mathematical reasoning in LLMs through two seemingly paradoxical mechanisms: spurious rewards, which suppress exploitation by rewarding outcomes unrelated to the ground truth, and entropy minimization, which suppresses exploration by pushing the model toward more confident and deterministic outputs, highlighting a puzzling dynamic: both discouraging exploitation and discouraging exploration improve reasoning performance, yet the underlying principles that reconcile these effects remain poorly understood. We focus on two fundamental questions: (i) how policy entropy relates to performance, and (ii) whether spurious rewards yield gains, potentially through the interplay of clipping bias and model contamination. Our results show that clipping bias under spurious rewards reduces policy entropy, leading to more confident and deterministic outputs, while entropy minimization alone is insufficient for improvement. We further propose a reward-misalignment model explaining why spurious rewards can enhance performance beyond contaminated settings. Our findings clarify the mechanisms behind spurious-reward benefits and provide principles for more effective RLVR training.
中文标题/摘要
标题:探索 vs 开发:通过剪裁、熵和虚假奖励重新思考可验证奖励强化学习(RLVR)
本文探讨了强化学习中可验证奖励(RLVR)框架下的探索-开发权衡问题,该框架旨在提高大型语言模型(LLMs)的推理能力。近期研究表明,RLVR可以通过两个看似矛盾的机制激发LLMs的强大数学推理能力:虚假奖励通过奖励与真实结果无关的结果来抑制开发,而熵最小化则通过促使模型更加自信和确定来抑制探索,揭示了一个令人困惑的动态:两者都抑制开发和探索反而能提高推理性能,但其背后的原理仍不甚明了。我们关注两个基本问题:(i)策略熵与性能的关系,(ii)虚假奖励是否能带来收益,可能是通过剪裁偏差和模型污染的相互作用。我们的结果显示,在虚假奖励下,剪裁偏差降低了策略熵,导致更加自信和确定的输出,而仅通过熵最小化无法实现改进。我们进一步提出一个奖励错配模型,解释为什么虚假奖励可以在污染环境中提升性能。我们的发现阐明了虚假奖励益处背后的机制,并为更有效的RLVR训练提供了原则。
Summary / 总结
This paper investigates the exploration-exploitation trade-off in reinforcement learning with verifiable rewards (RLVR), focusing on the roles of spurious rewards and entropy minimization. The study reveals that spurious rewards reduce policy entropy, leading to more confident outputs, while entropy minimization alone is not sufficient for improvement. The authors propose a reward-misalignment model to explain why spurious rewards can enhance performance beyond contaminated settings, providing insights into the mechanisms behind RLVR benefits and guiding more effective training strategies.
该研究探讨了RLVR框架下的探索-利用权衡,旨在提升LLM的推理能力。研究发现,虚假奖励和熵最小化对模型性能的影响,表明虚假奖励减少了策略熵,导致更自信的输出,而仅靠熵最小化不足以提升性能。研究还提出了一种奖励错配模型来解释虚假奖励如何超越污染环境提升性能,为更有效的RLVR训练提供了原理。
Posterior Behavioral Cloning: Pretraining BC Policies for Efficient RL Finetuning
Authors: Andrew Wagenmaker, Perry Dong, Raymond Tsao, Chelsea Finn, Sergey Levine
First: 2025-12-18T18:59:17+00:00 · Latest: 2025-12-18T18:59:17+00:00
Abstract
Standard practice across domains from robotics to language is to first pretrain a policy on a large-scale demonstration dataset, and then finetune this policy, typically with reinforcement learning (RL), in order to improve performance on deployment domains. This finetuning step has proved critical in achieving human or super-human performance, yet while much attention has been given to developing more effective finetuning algorithms, little attention has been given to ensuring the pretrained policy is an effective initialization for RL finetuning. In this work we seek to understand how the pretrained policy affects finetuning performance, and how to pretrain policies in order to ensure they are effective initializations for finetuning. We first show theoretically that standard behavioral cloning (BC) -- which trains a policy to directly match the actions played by the demonstrator -- can fail to ensure coverage over the demonstrator's actions, a minimal condition necessary for effective RL finetuning. We then show that if, instead of exactly fitting the observed demonstrations, we train a policy to model the posterior distribution of the demonstrator's behavior given the demonstration dataset, we do obtain a policy that ensures coverage over the demonstrator's actions, enabling more effective finetuning. Furthermore, this policy -- which we refer to as the posterior behavioral cloning (PostBC) policy -- achieves this while ensuring pretrained performance is no worse than that of the BC policy. We then show that PostBC is practically implementable with modern generative models in robotic control domains -- relying only on standard supervised learning -- and leads to significantly improved RL finetuning performance on both realistic robotic control benchmarks and real-world robotic manipulation tasks, as compared to standard behavioral cloning.
中文标题/摘要
标题:后验行为克隆:为高效的RL微调预训练BC策略
从机器人学到语言的各个领域来看,标准做法是首先在大规模演示数据集上预训练一个策略,然后使用强化学习(RL)对该策略进行微调,以提高在部署领域中的性能。这一微调步骤对于实现人类或超人类的性能至关重要,然而,尽管已经给予了开发更有效的微调算法更多的关注,但很少有人关注确保预训练策略是RL微调的有效初始化。在这项工作中,我们旨在理解预训练策略如何影响微调性能,并如何预训练策略以确保它们是有效的初始化。我们首先证明,标准的行为克隆(BC)——训练策略直接匹配演示者的行为——可能会导致对演示者行为的覆盖不足,这是有效RL微调的必要条件。然后我们证明,如果我们不是精确地拟合观察到的演示,而是训练一个策略来建模给定演示数据集的演示者行为的后验分布,我们确实可以获得一个确保对演示者行为的覆盖的策略,从而实现更有效的微调。此外,这种策略——我们称之为后验行为克隆(PostBC)策略——在确保预训练性能不低于BC策略的前提下实现了这一点。然后我们证明,PostBC可以通过现代生成模型在机器人控制领域中实际实现——仅依赖于标准的监督学习——并在现实机器人控制基准测试和实际机器人操作任务中,与标准的行为克隆相比,显著提高了RL微调性能。
Summary / 总结
This work addresses the issue of pretraining policies for reinforcement learning (RL) fine-tuning, focusing on how the pretrained policy affects the fine-tuning performance. The authors theoretically show that standard behavioral cloning (BC) may fail to ensure coverage over the demonstrator's actions, which is crucial for effective RL fine-tuning. They propose a new method called posterior behavioral cloning (PostBC) that models the posterior distribution of the demonstrator's behavior, ensuring better coverage and more effective fine-tuning. PostBC achieves this without compromising the pretrained performance and is practically implementable with modern generative models, leading to improved RL fine-tuning performance in both benchmarks and real-world robotic manipulation tasks.
本文探讨了使用行为克隆(BC)预训练策略来初始化强化学习(RL)策略的问题。研究发现,标准的BC方法可能会无法覆盖演示者的动作,这对于有效的RL微调至关重要。作者提出了一种新的方法,即后验行为克隆(PostBC),该方法通过给定演示数据集来建模演示者的后验行为分布,从而确保更好的覆盖和更有效的RL微调。PostBC在不牺牲预训练性能的情况下实现了这一点,并在基准测试和真实世界的机器人操作任务中显著优于标准的BC方法。
MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Model for Embodied Task Planning
Authors: Yuanchen Ju, Yongyuan Liang, Yen-Jen Wang, Nandiraju Gireesh, Yuanliang Ju, Seungjae Lee, Qiao Gu, Elvis Hsieh, Furong Huang, Koushil Sreenath
First: 2025-12-18T18:59:03+00:00 · Latest: 2025-12-18T18:59:03+00:00
Comments: 25 pages, 10 figures. Project page:https://hybridrobotics.github.io/MomaGraph/
Abstract
Mobile manipulators in households must both navigate and manipulate. This requires a compact, semantically rich scene representation that captures where objects are, how they function, and which parts are actionable. Scene graphs are a natural choice, yet prior work often separates spatial and functional relations, treats scenes as static snapshots without object states or temporal updates, and overlooks information most relevant for accomplishing the current task. To address these limitations, we introduce MomaGraph, a unified scene representation for embodied agents that integrates spatial-functional relationships and part-level interactive elements. However, advancing such a representation requires both suitable data and rigorous evaluation, which have been largely missing. We thus contribute MomaGraph-Scenes, the first large-scale dataset of richly annotated, task-driven scene graphs in household environments, along with MomaGraph-Bench, a systematic evaluation suite spanning six reasoning capabilities from high-level planning to fine-grained scene understanding. Built upon this foundation, we further develop MomaGraph-R1, a 7B vision-language model trained with reinforcement learning on MomaGraph-Scenes. MomaGraph-R1 predicts task-oriented scene graphs and serves as a zero-shot task planner under a Graph-then-Plan framework. Extensive experiments demonstrate that our model achieves state-of-the-art results among open-source models, reaching 71.6% accuracy on the benchmark (+11.4% over the best baseline), while generalizing across public benchmarks and transferring effectively to real-robot experiments.
中文标题/摘要
标题:MomaGraph:基于视觉语言模型的统一场景图及其在体感任务规划中的状态感知
家庭中的移动机械臂必须同时导航和操作。这需要一种紧凑且语义丰富的场景表示,能够捕捉物体的位置、功能以及哪些部分可以操作。场景图是自然的选择,但先前的工作往往将空间关系和功能关系分开处理,将场景视为静态快照,不包含物体状态或时间更新,也忽略了当前任务最相关的信息。为了解决这些限制,我们引入了MomaGraph,这是一种将空间功能关系和部分级交互元素整合在一起的统一场景表示。然而,推进这种表示需要合适的数据和严格的评估,这些方面目前仍然不足。因此,我们贡献了MomaGraph-Scenes,这是第一个包含丰富注释、任务驱动的场景图的大规模数据集,以及MomaGraph-Bench,这是一个涵盖从高层规划到细粒度场景理解六个推理能力的系统评估套件。在此基础上,我们进一步开发了MomaGraph-R1,这是一种基于MomaGraph-Scenes用强化学习训练的7B视觉语言模型。MomaGraph-R1预测任务导向的场景图,并在Graph-then-Plan框架下作为零样本任务规划器。广泛的实验表明,我们的模型在开源模型中达到了最先进的结果,准确率达到71.6%(比最佳基线高11.4%),并且在公共基准测试中具有泛化能力,并且能够有效地转移到真实机器人实验。
Summary / 总结
MomaGraph addresses the limitations of prior scene graph representations by integrating spatial-functional relationships and part-level interactive elements. It introduces MomaGraph-Scenes, a large-scale dataset of richly annotated, task-driven scene graphs for household environments, and MomaGraph-Bench, an evaluation suite. MomaGraph-R1, a 7B vision-language model, predicts task-oriented scene graphs and serves as a zero-shot task planner, achieving 71.6% accuracy on the benchmark, surpassing previous models by 11.4%.
MomaGraph通过整合空间-功能关系和部分级交互元素来解决先前场景图表示的局限性。它引入了MomaGraph-Scenes,这是一个包含丰富注释和任务驱动的场景图的大规模数据集,以及MomaGraph-Bench,一个系统性的评估套件。MomaGraph-R1是一个7B的视觉-语言模型,使用强化学习在MomaGraph-Scenes上进行训练,并展示了最先进的结果,准确率达到71.6%,具有强大的泛化能力和迁移能力。
SceneDiff: A Benchmark and Method for Multiview Object Change Detection
Authors: Yuqun Wu, Chih-hao Lin, Henry Che, Aditi Tiwari, Chuhang Zou, Shenlong Wang, Derek Hoiem
First: 2025-12-18T18:59:02+00:00 · Latest: 2025-12-18T18:59:02+00:00
Abstract
We investigate the problem of identifying objects that have been added, removed, or moved between a pair of captures (images or videos) of the same scene at different times. Detecting such changes is important for many applications, such as robotic tidying or construction progress and safety monitoring. A major challenge is that varying viewpoints can cause objects to falsely appear changed. We introduce SceneDiff Benchmark, the first multiview change detection benchmark with object instance annotations, comprising 350 diverse video pairs with thousands of changed objects. We also introduce the SceneDiff method, a new training-free approach for multiview object change detection that leverages pretrained 3D, segmentation, and image encoding models to robustly predict across multiple benchmarks. Our method aligns the captures in 3D, extracts object regions, and compares spatial and semantic region features to detect changes. Experiments on multi-view and two-view benchmarks demonstrate that our method outperforms existing approaches by large margins (94% and 37.4% relative AP improvements). The benchmark and code will be publicly released.
中文标题/摘要
标题:SceneDiff:多视角物体变化检测的基准与方法
我们研究了在不同时间同一场景的两组捕获(图像或视频)之间识别已添加、移除或移动的物体的问题。检测此类变化对于许多应用非常重要,例如机器人整理或建筑进度和安全监控。主要挑战在于不同视角的变化可能导致物体错误地被检测为变化。我们引入了SceneDiff基准,这是第一个包含物体实例注释的多视角变化检测基准,包含350个多样化的视频对,数千个变化的物体。我们还引入了SceneDiff方法,这是一种新的无需训练的多视角物体变化检测方法,利用预训练的3D、分割和图像编码模型来稳健地预测多个基准。该方法在3D中对齐捕获,提取物体区域,并比较空间和语义区域特征以检测变化。在多视角和两视角基准上的实验表明,我们的方法在现有方法的基础上取得了显著的性能提升(相对AP改进94%和37.4%)。基准和代码将公开发布。
Summary / 总结
The research aims to detect changes in objects between two captures of the same scene taken at different times, which is crucial for applications like robotic tidying and construction monitoring. The authors introduce the SceneDiff Benchmark, the first multiview change detection benchmark with object instance annotations, and the SceneDiff method, a training-free approach that uses pretrained 3D, segmentation, and image encoding models to align captures, extract object regions, and compare spatial and semantic features to detect changes. The method shows significant improvements over existing approaches, with relative AP improvements of 94% and 37.4% on multi-view and two-view benchmarks respectively.
研究旨在检测同一场景在不同时间点拍摄的两组图像之间的物体变化,这对于机器人整理和建筑监控等应用至关重要。研究引入了SceneDiff基准,这是首个包含物体实例注释的多视角变化检测基准,以及SceneDiff方法,这是一种无需训练的多视角物体变化检测方法,利用预训练的3D、分割和图像编码模型对捕获进行对齐、提取物体区域并比较特征以检测变化。实验结果显示,该方法在多视角和两视角基准上的表现均优于现有方法,相对AP改进幅度分别为94%和37.4%。
Flowing from Reasoning to Motion: Learning 3D Hand Trajectory Prediction from Egocentric Human Interaction Videos
Authors: Mingfei Chen, Yifan Wang, Zhengqin Li, Homanga Bharadhwaj, Yujin Chen, Chuan Qin, Ziyi Kou, Yuan Tian, Eric Whitmire, Rajinder Sodhi, Hrvoje Benko, Eli Shlizerman, Yue Liu
First: 2025-12-18T18:59:01+00:00 · Latest: 2025-12-18T18:59:01+00:00
Comments: Project website: https://egoman-project.github.io
Abstract
Prior works on 3D hand trajectory prediction are constrained by datasets that decouple motion from semantic supervision and by models that weakly link reasoning and action. To address these, we first present the EgoMAN dataset, a large-scale egocentric dataset for interaction stage-aware 3D hand trajectory prediction with 219K 6DoF trajectories and 3M structured QA pairs for semantic, spatial, and motion reasoning. We then introduce the EgoMAN model, a reasoning-to-motion framework that links vision-language reasoning and motion generation via a trajectory-token interface. Trained progressively to align reasoning with motion dynamics, our approach yields accurate and stage-aware trajectories with generalization across real-world scenes.
中文标题/摘要
标题:从推理到运动:基于第一人称人类互动视频的3D手部轨迹预测学习
先前的3D手部轨迹预测工作受限于将运动与语义监督脱钩的数据集以及弱化推理与动作链接的模型。为解决这些问题,我们首先提出了EgoMAN数据集,这是一个用于交互阶段感知的3D手部轨迹预测的大规模第一人称数据集,包含219,000个6自由度轨迹和300万结构化问答对,用于语义、空间和运动推理。我们随后引入了EgoMAN模型,这是一种通过轨迹标记接口将视觉语言推理与运动生成链接的推理到运动框架。通过逐步训练使推理与运动动力学对齐,我们的方法能够生成准确且阶段感知的轨迹,并在真实场景中具有泛化能力。
Summary / 总结
The research aims to improve 3D hand trajectory prediction by addressing the limitations of existing datasets and models. It introduces the EgoMAN dataset, which includes 219K 6DoF trajectories and 3M structured QA pairs for semantic, spatial, and motion reasoning, and the EgoMAN model, a reasoning-to-motion framework that links vision-language reasoning and motion generation. The model is trained to align reasoning with motion dynamics, resulting in accurate and stage-aware trajectories that generalize well across real-world scenes.
研究旨在通过解决现有数据集和模型的限制来提高3D手轨迹预测。它引入了EgoMAN数据集,其中包括219K 6DoF轨迹和3M结构化问答对用于推理,并提出了EgoMAN模型,这是一种通过轨迹令牌接口将视觉语言推理与运动生成链接的推理到运动框架。该模型通过使推理与运动动力学对齐来训练,从而实现准确且阶段感知的轨迹,并在现实世界场景中具有良好的泛化能力。
VIVA: VLM-Guided Instruction-Based Video Editing with Reward Optimization
Authors: Xiaoyan Cong, Haotian Yang, Angtian Wang, Yizhi Wang, Yiding Yang, Canyu Zhang, Chongyang Ma
First: 2025-12-18T18:58:42+00:00 · Latest: 2025-12-18T18:58:42+00:00
Abstract
Instruction-based video editing aims to modify an input video according to a natural-language instruction while preserving content fidelity and temporal coherence. However, existing diffusion-based approaches are often trained on paired data of simple editing operations, which fundamentally limits their ability to generalize to diverse and complex, real-world instructions. To address this generalization gap, we propose VIVA, a scalable framework for instruction-based video editing that leverages VLM-guided encoding and reward optimization. First, we introduce a VLM-based instructor that encodes the textual instruction, the first frame of the source video, and an optional reference image into visually-grounded instruction representations, providing fine-grained spatial and semantic context for the diffusion transformer backbone. Second, we propose a post-training stage, Edit-GRPO, which adapts Group Relative Policy Optimization to the domain of video editing, directly optimizing the model for instruction-faithful, content-preserving, and aesthetically pleasing edits using relative rewards. Furthermore, we propose a data construction pipeline designed to synthetically generate diverse, high-fidelity paired video-instruction data of basic editing operations. Extensive experiments show that VIVA achieves superior instruction following, generalization, and editing quality over state-of-the-art methods. Website: https://viva-paper.github.io
中文标题/摘要
标题:VIVA:基于VLM引导的指令驱动视频编辑与奖励优化
基于指令的视频编辑旨在根据自然语言指令修改输入视频,同时保持内容保真度和时间连贯性。然而,现有的基于扩散的方法通常是在简单的编辑操作配对数据上进行训练,这从根本上限制了它们对多样且复杂的现实世界指令的泛化能力。为了解决这一泛化差距,我们提出了一种可扩展的基于指令的视频编辑框架VIVA,该框架利用VLM引导编码和奖励优化。首先,我们引入了一种基于VLM的指令器,将文本指令、源视频的第一帧和可选的参考图像编码为视觉接地的指令表示,为扩散变换器骨干网络提供精细的空间和语义上下文。其次,我们提出了一种后训练阶段Edit-GRPO,将组相对策略优化适应到视频编辑领域,直接通过相对奖励优化模型以实现指令忠实、内容保真且具有审美吸引力的编辑。此外,我们还提出了一种数据构建管道,用于合成生成基本编辑操作的多样且高保真配对视频-指令数据。大量实验表明,VIVA在指令跟随、泛化能力和编辑质量方面优于现有最先进的方法。网站:https://viva-paper.github.io
Summary / 总结
VIVA is a scalable framework for instruction-based video editing that uses a VLM-based instructor to encode textual instructions and visual inputs, and a post-training stage, Edit-GRPO, to optimize the model for instruction-faithful edits. Experiments show that VIVA outperforms existing methods in instruction following, generalization, and editing quality.
VIVA 是一个使用 VLM 引导编码和奖励优化的可扩展框架,用于基于指令的视频编辑。它引入了一个基于 VLM 的指令员来编码文本指令和视频帧,为扩散变换器提供空间和语义上下文。后训练阶段 Edit-GRPO 将组相对策略优化适应到视频编辑领域,直接优化模型以生成忠实于指令、内容保留且具有审美价值的编辑。实验表明,VIVA 在指令跟随、泛化能力和编辑质量方面优于现有方法。
FlashPortrait: 6x Faster Infinite Portrait Animation with Adaptive Latent Prediction
Authors: Shuyuan Tu, Yueming Pan, Yinming Huang, Xintong Han, Zhen Xing, Qi Dai, Kai Qiu, Chong Luo, Zuxuan Wu
First: 2025-12-18T18:56:05+00:00 · Latest: 2025-12-18T18:56:05+00:00
Abstract
Current diffusion-based acceleration methods for long-portrait animation struggle to ensure identity (ID) consistency. This paper presents FlashPortrait, an end-to-end video diffusion transformer capable of synthesizing ID-preserving, infinite-length videos while achieving up to 6x acceleration in inference speed. In particular, FlashPortrait begins by computing the identity-agnostic facial expression features with an off-the-shelf extractor. It then introduces a Normalized Facial Expression Block to align facial features with diffusion latents by normalizing them with their respective means and variances, thereby improving identity stability in facial modeling. During inference, FlashPortrait adopts a dynamic sliding-window scheme with weighted blending in overlapping areas, ensuring smooth transitions and ID consistency in long animations. In each context window, based on the latent variation rate at particular timesteps and the derivative magnitude ratio among diffusion layers, FlashPortrait utilizes higher-order latent derivatives at the current timestep to directly predict latents at future timesteps, thereby skipping several denoising steps and achieving 6x speed acceleration. Experiments on benchmarks show the effectiveness of FlashPortrait both qualitatively and quantitatively.
中文标题/摘要
标题:FlashPortrait:6倍速的自适应潜在预测无限肖像动画
当前基于扩散的长肖像动画加速方法难以保证身份一致性。本文提出FlashPortrait,这是一种端到端的视频扩散变换器,能够合成保持身份、无限长度的视频,同时实现高达6倍的推理速度加速。具体而言,FlashPortrait首先使用现成的提取器计算身份无关的表情特征。然后引入归一化表情特征块,通过将它们分别的均值和方差进行归一化,以改善面部建模中的身份稳定性。在推理过程中,FlashPortrait采用动态滑动窗口方案并在重叠区域进行加权融合,确保长动画中的平滑过渡和身份一致性。在每个上下文窗口中,根据特定时间步的潜在变化率和扩散层间导数幅度比,FlashPortrait利用当前时间步的高阶潜在导数直接预测未来时间步的潜在值,从而跳过多个去噪步骤,实现6倍速度加速。基准实验表明,FlashPortrait在定性和定量上均有效。
Summary / 总结
FlashPortrait is an end-to-end video diffusion transformer designed to synthesize identity-preserving, infinite-length videos with up to 6x faster inference speed compared to existing methods. It uses an off-the-shelf extractor to compute identity-agnostic facial expression features and introduces a Normalized Facial Expression Block to align these features with diffusion latents. During inference, FlashPortrait employs a dynamic sliding-window scheme and predicts future latents using higher-order derivatives, skipping denoising steps to achieve significant speedup. Experiments demonstrate its effectiveness in maintaining identity consistency and accelerating inference speed.
FlashPortrait旨在以比现有方法快6倍的推理速度合成具有身份一致性的无限长度肖像动画。它使用现成的提取器提取身份无关的面部表情特征,并引入归一化面部表情块将这些特征与扩散潜变量对齐。在推理过程中,FlashPortrait采用动态滑动窗口方案,并利用当前时间步的高阶潜变量预测未来时间步的潜变量,跳过多个去噪步骤,从而实现显著的加速。实验表明,它在保持身份一致性方面有效,并且能够加速推理速度。
Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image
Authors: Yushi Hu, Reyhane Askari-Hemmat, Melissa Hall, Emily Dinan, Luke Zettlemoyer, Marjan Ghazvininejad
First: 2025-12-18T18:56:04+00:00 · Latest: 2025-12-18T18:56:04+00:00
Comments: Code and data available at https://github.com/facebookresearch/MMRB2
Abstract
Reward models (RMs) are essential for training large language models (LLMs), but remain underexplored for omni models that handle interleaved image and text sequences. We introduce Multimodal RewardBench 2 (MMRB2), the first comprehensive benchmark for reward models on multimodal understanding and (interleaved) generation. MMRB2 spans four tasks: text-to-image, image editing, interleaved generation, and multimodal reasoning ("thinking-with-images"), providing 1,000 expert-annotated preference pairs per task from 23 models and agents across 21 source tasks. MMRB2 is designed with: (1) practical but challenging prompts; (2) responses from state-of-the-art models and agents; and (3) preference pairs with strong human-expert consensus, curated via an ensemble filtering strategy. Using MMRB2, we study existing judges for each subtask, including multimodal LLM-as-a-judge and models trained with human preferences. The latest Gemini 3 Pro attains 75-80% accuracy. GPT-5 and Gemini 2.5 Pro reach 66-75% accuracy, compared to >90% for humans, yet surpass the widely used GPT-4o (59%). The best performing open-source model Qwen3-VL-32B achieves similar accuracies as Gemini 2.5 Flash (64%). We also show that MMRB2 performance strongly correlates with downstream task success using Best-of-N sampling and conduct an in-depth analysis that shows key areas to improve the reward models going forward.
中文标题/摘要
标题:Multimodal RewardBench 2:评估处理交错文本和图像的全能奖励模型
奖励模型(RMs)对于训练大型语言模型(LLMs)至关重要,但它们在处理交错图像和文本序列的全能模型方面仍被严重忽视。我们引入了Multimodal RewardBench 2(MMRB2),这是第一个全面评估奖励模型在多模态理解和(交错)生成方面的基准。MMRB2 包含四个任务:文本到图像、图像编辑、交错生成和多模态推理(“图像思考”),每个任务提供了来自 23 个模型和代理的 1,000 对专家注释的偏好对,这些模型和代理来自 21 个源任务。MMRB2 设计有:(1) 实用但具有挑战性的提示;(2) 来自最先进的模型和代理的响应;(3) 通过集成筛选策略精心挑选的具有强烈人类专家共识的偏好对。使用 MMRB2,我们研究了每个子任务的现有评判者,包括多模态 LLM 作为评判者和使用人类偏好训练的模型。最新的 Gemini 3 Pro 达到 75-80% 的准确率。GPT-5 和 Gemini 2.5 Pro 达到 66-75% 的准确率,而人类的准确率超过 90%,但超过了广泛使用的 GPT-4o(59%)。最佳开源模型 Qwen3-VL-32B 达到与 Gemini 2.5 Flash(64%)相似的准确率。我们还展示了 MMRB2 的性能与下游任务成功之间的强烈相关性,使用 Best-of-N 抽样进行分析,并进行了深入分析,展示了未来改进奖励模型的关键领域。
Summary / 总结
The paper introduces Multimodal RewardBench 2 (MMRB2), a comprehensive benchmark for evaluating reward models on multimodal understanding and generation tasks. It includes four tasks: text-to-image, image editing, interleaved generation, and multimodal reasoning. Using MMRB2, the study evaluates various reward models, finding that Gemini 3 Pro and Gemini 2.5 Pro achieve 75-80% and 66-75% accuracy, respectively, compared to >90% for humans. The best open-source model, Qwen3-VL-32B, matches Gemini 2.5 Flash's accuracy. The study also shows strong correlations between MMRB2 performance and downstream task success.
研究旨在评估处理交错文本和图像序列的奖励模型,引入了包含四个任务共1,000个专家标注偏好对的Multimodal RewardBench 2 (MMRB2)。关键发现表明,Gemini 3 Pro和Gemini 2.5 Pro的准确率分别为75-80%和66-75%,而GPT-5和Qwen3-VL-32B的表现与Gemini 2.5 Flash相似。人类的表现仍然显著高于90%。研究还展示了MMRB2性能与下游任务成功之间的强烈相关性。
Sceniris: A Fast Procedural Scene Generation Framework
Authors: Jinghuan Shang, Harsh Patel, Ran Gong, Karl Schmeckpeper
First: 2025-12-18T18:55:03+00:00 · Latest: 2025-12-18T18:55:03+00:00
Comments: Code is available at https://github.com/rai-inst/sceniris
Abstract
Synthetic 3D scenes are essential for developing Physical AI and generative models. Existing procedural generation methods often have low output throughput, creating a significant bottleneck in scaling up dataset creation. In this work, we introduce Sceniris, a highly efficient procedural scene generation framework for rapidly generating large-scale, collision-free scene variations. Sceniris also provides an optional robot reachability check, providing manipulation-feasible scenes for robot tasks. Sceniris is designed for maximum efficiency by addressing the primary performance limitations of the prior method, Scene Synthesizer. Leveraging batch sampling and faster collision checking in cuRobo, Sceniris achieves at least 234x speed-up over Scene Synthesizer. Sceniris also expands the object-wise spatial relationships available in prior work to support diverse scene requirements. Our code is available at https://github.com/rai-inst/sceniris
中文标题/摘要
标题:Sceniris:一种快速的程序化场景生成框架
合成3D场景对于开发物理AI和生成模型至关重要。现有的程序化生成方法通常输出吞吐量较低,成为大规模数据集创建的瓶颈。在本工作中,我们介绍了Sceniris,一种高效的程序化场景生成框架,用于快速生成大规模、无碰撞的场景变化。Sceniris还提供可选的机器人可达性检查,为机器人任务提供可操作的场景。Sceniris通过解决先前方法Scene Synthesizer的主要性能限制,设计为最大程度地提高效率。利用批处理采样和cuRobo中的更快碰撞检测,Sceniris在Scene Synthesizer的基础上实现了至少234倍的速度提升。Sceniris还扩展了先前工作中的对象级空间关系,以支持多样化的场景需求。我们的代码可在https://github.com/rai-inst/sceniris获取
Summary / 总结
Sceniris is a fast procedural scene generation framework designed to address the low throughput issue in existing methods, which is a bottleneck for scaling up dataset creation. It achieves at least 234x speed-up over Scene Synthesizer by leveraging batch sampling and faster collision checking. Sceniris can generate large-scale, collision-free scenes and supports robot reachability checks for manipulation-feasible scenes.
Sceniris 是一种快速的程序化场景生成框架,旨在解决现有方法低吞吐量的问题,这是在 Physical AI 和生成模型中扩展数据集创建的瓶颈。通过利用批量采样和更快的碰撞检测,Sceniris 在速度上比 Scene Synthesizer 提高了至少 234 倍,并能生成大规模、无碰撞的场景变化,同时提供可操作性检查以生成适合机器人任务的场景。该框架扩展了先前工作中的空间关系,以支持多样化的场景需求。
Instant Expressive Gaussian Head Avatar via 3D-Aware Expression Distillation
Authors: Kaiwen Jiang, Xueting Li, Seonwook Park, Ravi Ramamoorthi, Shalini De Mello, Koki Nagano
First: 2025-12-18T18:53:28+00:00 · Latest: 2025-12-18T18:53:28+00:00
Comments: Project website is https://research.nvidia.com/labs/amri/projects/instant4d
Abstract
Portrait animation has witnessed tremendous quality improvements thanks to recent advances in video diffusion models. However, these 2D methods often compromise 3D consistency and speed, limiting their applicability in real-world scenarios, such as digital twins or telepresence. In contrast, 3D-aware facial animation feedforward methods -- built upon explicit 3D representations, such as neural radiance fields or Gaussian splatting -- ensure 3D consistency and achieve faster inference speed, but come with inferior expression details. In this paper, we aim to combine their strengths by distilling knowledge from a 2D diffusion-based method into a feed-forward encoder, which instantly converts an in-the-wild single image into a 3D-consistent, fast yet expressive animatable representation. Our animation representation is decoupled from the face's 3D representation and learns motion implicitly from data, eliminating the dependency on pre-defined parametric models that often constrain animation capabilities. Unlike previous computationally intensive global fusion mechanisms (e.g., multiple attention layers) for fusing 3D structural and animation information, our design employs an efficient lightweight local fusion strategy to achieve high animation expressivity. As a result, our method runs at 107.31 FPS for animation and pose control while achieving comparable animation quality to the state-of-the-art, surpassing alternative designs that trade speed for quality or vice versa. Project website is https://research.nvidia.com/labs/amri/projects/instant4d
中文标题/摘要
标题:基于3D感知表情蒸馏的即时表达高斯头像 avatar
肖像动画在视频扩散模型的最新进展推动下取得了巨大的质量提升。然而,这些2D方法往往牺牲了3D一致性和速度,限制了其在数字孪生或远程呈现等实际场景中的应用。相比之下,基于显式3D表示(如神经辐射场或高斯散射)的3D感知面部动画前馈方法确保了3D一致性和更快的推理速度,但表情细节较差。本文旨在通过将2D扩散方法的知识蒸馏到前馈编码器中,结合它们的优点,即时将野外单张图像转换为3D一致、快速且具有表达性的可动画化表示。我们的动画表示与面部的3D表示解耦,并从数据中隐式学习运动,消除了对预定义参数模型的依赖,这些模型往往限制了动画能力。与先前用于融合3D结构和动画信息的计算密集型全局融合机制(例如,多注意力层)不同,我们的设计采用了一种高效的轻量级局部融合策略,实现了高动画表达性。因此,我们的方法在动画和姿态控制方面以107.31 FPS的速度运行,同时达到了与最新技术相当的动画质量,超越了以速度换取质量或反之的其他设计。项目网站是https://research.nvidia.com/labs/amri/projects/instant4d
Summary / 总结
This paper addresses the challenge of creating 3D-consistent and fast expressive head avatars by distilling knowledge from a 2D diffusion-based method into a feed-forward encoder. The method converts a single in-the-wild image into a 3D-consistent, fast, and expressive animation representation. The key finding is that the proposed method achieves high animation expressivity and runs at 107.31 FPS, surpassing previous designs that trade speed for quality or vice versa.
本文旨在通过结合2D扩散模型和3D感知前馈方法的优势,解决创建3D一致且快速面部动画的挑战。作者提出了一种方法,将2D扩散模型的知识提炼到前馈编码器中,将输入图像转换为3D一致、快速且表达性强的可动画化表示。该方法实现了高动画表达性,并以107.31 FPS的速度运行,超越了以速度换取质量或反之的替代设计。动画表示与面部的3D表示解耦,并从数据中隐式学习运动,无需依赖预定义的参数模型。与以往方法不同,它使用高效的轻量级局部融合策略来实现高质量和高速度的动画。项目网站:https://research.nvidia.com/labs/amri/projects/instant4d
LinkedOut: Linking World Knowledge Representation Out of Video LLM for Next-Generation Video Recommendation
Authors: Haichao Zhang, Yao Lu, Lichen Wang, Yunzhe Li, Daiwei Chen, Yunpeng Xu, Yun Fu
First: 2025-12-18T18:52:18+00:00 · Latest: 2025-12-18T18:52:18+00:00
Abstract
Video Large Language Models (VLLMs) unlock world-knowledge-aware video understanding through pretraining on internet-scale data and have already shown promise on tasks such as movie analysis and video question answering. However, deploying VLLMs for downstream tasks such as video recommendation remains challenging, since real systems require multi-video inputs, lightweight backbones, low-latency sequential inference, and rapid response. In practice, (1) decode-only generation yields high latency for sequential inference, (2) typical interfaces do not support multi-video inputs, and (3) constraining outputs to language discards fine-grained visual details that matter for downstream vision tasks. We argue that these limitations stem from the absence of a representation that preserves pixel-level detail while leveraging world knowledge. We present LinkedOut, a representation that extracts VLLM world knowledge directly from video to enable fast inference, supports multi-video histories, and removes the language bottleneck. LinkedOut extracts semantically grounded, knowledge-aware tokens from raw frames using VLLMs, guided by promptable queries and optional auxiliary modalities. We introduce a cross-layer knowledge fusion MoE that selects the appropriate level of abstraction from the rich VLLM features, enabling personalized, interpretable, and low-latency recommendation. To our knowledge, LinkedOut is the first VLLM-based video recommendation method that operates on raw frames without handcrafted labels, achieving state-of-the-art results on standard benchmarks. Interpretability studies and ablations confirm the benefits of layer diversity and layer-wise fusion, pointing to a practical path that fully leverages VLLM world-knowledge priors and visual reasoning for downstream vision tasks such as recommendation.
中文标题/摘要
标题:LinkedOut:从视频LLM中链接世界知识表示以实现下一代视频推荐
视频大型语言模型(VLLMs)通过在互联网规模数据上进行预训练,解锁了具有世界知识的视频理解能力,并已在电影分析和视频问答等任务上显示出前景。然而,将VLLMs部署到视频推荐等下游任务仍然具有挑战性,因为实际系统需要多视频输入、轻量级骨干、低延迟序列推理和快速响应。实践中,(1) 只解码生成会导致序列推理的高延迟,(2) 传统接口不支持多视频输入,(3) 将输出限制为语言会丢弃对下游视觉任务重要的细粒度视觉细节。我们认为这些限制源于缺乏一种同时保留像素级细节并利用世界知识的表示。我们提出了LinkedOut,一种直接从视频中提取VLLM世界知识的表示,以实现快速推理、支持多视频历史记录,并去除语言瓶颈。LinkedOut 使用VLLMs从原始帧中提取语义上接地、知识感知的标记,由可提示查询和可选辅助模态引导。我们引入了一种跨层知识融合MoE,从丰富的VLLM特征中选择适当的抽象级别,实现个性化、可解释和低延迟的推荐。据我们所知,LinkedOut 是第一个在不使用手工制作标签的情况下直接在原始帧上操作的VLLM基视频推荐方法,实现了标准基准上的最佳结果。解释性研究和消融实验证实了层多样性及层内融合的好处,指出了一个实用的路径,充分利用VLLM世界知识先验和视觉推理,以实现推荐等下游视觉任务。
Summary / 总结
The research aims to address the challenges of deploying Video Large Language Models (VLLMs) for video recommendation by proposing LinkedOut, a method that extracts world knowledge directly from video frames. LinkedOut uses VLLMs to generate semantically grounded tokens from raw frames, supports multi-video inputs, and enables fast inference. Experimental results show that LinkedOut achieves state-of-the-art performance on standard benchmarks and provides interpretable, low-latency recommendations.
LinkedOut旨在通过提出一种结合世界知识与像素级细节的新表示,解决将视频大型语言模型(VLLMs)应用于视频推荐的挑战。该方法使用VLLMs从原始视频帧中提取知识感知的令牌,由可提示查询和可选辅助模态引导。LinkedOut引入了一种跨层知识融合机制,选择适当的抽象层次,实现快速和个性化的推荐。实验结果表明,LinkedOut在标准基准上达到了最先进的性能,并确认了层多样性与层间融合对下游视觉任务(如推荐)的好处。
M-PhyGs: Multi-Material Object Dynamics from Video
Authors: Norika Wada, Kohei Yamashita, Ryo Kawahara, Ko Nishino
First: 2025-12-18T18:50:08+00:00 · Latest: 2025-12-18T18:50:08+00:00
Abstract
Knowledge of the physical material properties governing the dynamics of a real-world object becomes necessary to accurately anticipate its response to unseen interactions. Existing methods for estimating such physical material parameters from visual data assume homogeneous single-material objects, pre-learned dynamics, or simplistic topologies. Real-world objects, however, are often complex in material composition and geometry lying outside the realm of these assumptions. In this paper, we particularly focus on flowers as a representative common object. We introduce Multi-material Physical Gaussians (M-PhyGs) to estimate the material composition and parameters of such multi-material complex natural objects from video. From a short video captured in a natural setting, M-PhyGs jointly segments the object into similar materials and recovers their continuum mechanical parameters while accounting for gravity. M-PhyGs achieves this efficiently with newly introduced cascaded 3D and 2D losses, and by leveraging temporal mini-batching. We introduce a dataset, Phlowers, of people interacting with flowers as a novel platform to evaluate the accuracy of this challenging task of multi-material physical parameter estimation. Experimental results on Phlowers dataset demonstrate the accuracy and effectiveness of M-PhyGs and its components.
中文标题/摘要
标题:M-PhyGs: 视频中多材料物体动力学
了解支配现实世界物体动力学的物理材料属性对于准确预测其对未见交互的响应是必要的。现有方法从视觉数据估计此类物理材料参数假设均匀单一材料物体、预学习的动力学或简单拓扑结构。然而,现实世界中的物体通常在材料组成和几何形状上复杂,超出了这些假设的范围。在本文中,我们特别关注花作为代表性常见物体。我们引入了多材料物理高斯分布(M-PhyGs)来从视频中估计此类多材料复杂自然物体的材料组成和参数。从自然环境中拍摄的短视频,M-PhyGs 联合分割物体为相似材料并恢复其连续力学参数,同时考虑重力。M-PhyGs 通过引入新的级联3D和2D损失以及利用时间小批量处理高效地实现这一点。我们引入了一个数据集 Phlowers,其中包含人们与花的互动,作为评估这一具有挑战性的多材料物理参数估计任务准确性的新平台。Phlowers 数据集上的实验结果证明了 M-PhyGs 及其组件的准确性和有效性。
Summary / 总结
The research aims to accurately estimate the physical material properties of complex multi-material objects, such as flowers, from video data. The method, Multi-material Physical Gaussians (M-PhyGs), jointly segments the object into similar materials and recovers their mechanical parameters while considering gravity. M-PhyGs uses cascaded 3D and 2D losses and temporal mini-batching for efficient computation. Experimental results on the Phlowers dataset show the accuracy and effectiveness of M-PhyGs and its components.
研究旨在通过视频数据准确估计复杂多材料物体(如花朵)的物理材料属性。M-PhyGs 方法联合分割物体为相似材料,并在考虑重力的情况下恢复其力学参数。该方法使用了级联的3D和2D损失以及时间上的小批量处理以提高计算效率。实验结果表明,M-PhyGs 在多材料物理参数估计方面具有较高的准确性和有效性。
PolaRiS: Scalable Real-to-Sim Evaluations for Generalist Robot Policies
Authors: Arhan Jain, Mingtong Zhang, Kanav Arora, William Chen, Marcel Torne, Muhammad Zubair Irshad, Sergey Zakharov, Yue Wang, Sergey Levine, Chelsea Finn, Wei-Chiu Ma, Dhruv Shah, Abhishek Gupta, Karl Pertsch
First: 2025-12-18T18:49:41+00:00 · Latest: 2025-12-18T18:49:41+00:00
Comments: Website: https://polaris-evals.github.io/
Abstract
A significant challenge for robot learning research is our ability to accurately measure and compare the performance of robot policies. Benchmarking in robotics is historically challenging due to the stochasticity, reproducibility, and time-consuming nature of real-world rollouts. This challenge is exacerbated for recent generalist policies, which has to be evaluated across a wide variety of scenes and tasks. Evaluation in simulation offers a scalable complement to real world evaluations, but the visual and physical domain gap between existing simulation benchmarks and the real world has made them an unreliable signal for policy improvement. Furthermore, building realistic and diverse simulated environments has traditionally required significant human effort and expertise. To bridge the gap, we introduce Policy Evaluation and Environment Reconstruction in Simulation (PolaRiS), a scalable real-to-sim framework for high-fidelity simulated robot evaluation. PolaRiS utilizes neural reconstruction methods to turn short video scans of real-world scenes into interactive simulation environments. Additionally, we develop a simple simulation data co-training recipe that bridges remaining real-to-sim gaps and enables zero-shot evaluation in unseen simulation environments. Through extensive paired evaluations between simulation and the real world, we demonstrate that PolaRiS evaluations provide a much stronger correlation to real world generalist policy performance than existing simulated benchmarks. Its simplicity also enables rapid creation of diverse simulated environments. As such, this work takes a step towards distributed and democratized evaluation for the next generation of robotic foundation models.
中文标题/摘要
标题:PolaRiS:可扩展的现实到模拟评估框架以评估通用机器人策略
机器人学习研究中的一个重要挑战是如何准确地衡量和比较机器人策略的表现。由于机器人领域的基准测试历史上受到现实世界卷出的随机性、可重复性和耗时性的影响,这一挑战在最近的通用策略中变得更加突出,这些策略需要在各种场景和任务中进行评估。模拟中的评估为现实世界评估提供了可扩展的补充,但现有的模拟基准与现实世界之间的视觉和物理领域差距使得它们无法成为政策改进的可靠信号。此外,构建现实且多样的模拟环境通常需要大量的手工劳动和专业知识。为了弥合这一差距,我们提出了Policy Evaluation and Environment Reconstruction in Simulation (PolaRiS),一种用于高保真模拟机器人评估的可扩展现实到模拟框架。PolaRiS 利用神经重建方法将现实世界场景的短视频扫描转换为交互式模拟环境。此外,我们还开发了一种简单的模拟数据协同训练方法,以弥合剩余的现实到模拟差距,并在未见过的模拟环境中实现零样本评估。通过广泛的模拟与现实世界的配对评估,我们证明了PolaRiS评估比现有模拟基准提供了更强的与现实世界通用策略性能的相关性。其简单性还使得能够快速创建多样的模拟环境。因此,这项工作朝着分布式和普及化的评估方法为下一代机器人基础模型迈进了一步。
Summary / 总结
PolaRiS is a scalable framework for evaluating robot policies in simulation by reconstructing real-world scenes into interactive simulations using neural methods. It also includes a co-training recipe to bridge real-to-sim gaps, enabling zero-shot evaluation. Extensive evaluations show that PolaRiS provides a stronger correlation to real-world performance compared to existing benchmarks and can create diverse simulated environments quickly.
PolaRiS 是一个可扩展的框架,通过神经重建方法将真实场景转换为互动模拟环境,用于评估机器人策略。通过配对评估,PolaRiS 在真实世界性能方面的相关性比现有基准更强,并且能够快速创建多样化的模拟环境。
Memory-Enhanced SAM3 for Occlusion-Robust Surgical Instrument Segmentation
Authors: Valay Bundele, Mehran Hosseinzadeh, Hendrik P. A. Lensch
First: 2025-12-18T18:49:33+00:00 · Latest: 2025-12-18T18:49:33+00:00
Comments: Under Review
Abstract
Accurate surgical instrument segmentation in endoscopic videos is crucial for computer-assisted interventions, yet remains challenging due to frequent occlusions, rapid motion, specular artefacts, and long-term instrument re-entry. While SAM3 provides a powerful spatio-temporal framework for video object segmentation, its performance in surgical scenes is limited by indiscriminate memory updates, fixed memory capacity, and weak identity recovery after occlusions. We propose ReMeDI-SAM3, a training-free memory-enhanced extension of SAM3, that addresses these limitations through three components: (i) relevance-aware memory filtering with a dedicated occlusion-aware memory for storing pre-occlusion frames, (ii) a piecewise interpolation scheme that expands the effective memory capacity, and (iii) a feature-based re-identification module with temporal voting for reliable post-occlusion identity disambiguation. Together, these components mitigate error accumulation and enable reliable recovery after occlusions. Evaluations on EndoVis17 and EndoVis18 under a zero-shot setting show absolute mcIoU improvements of around 7% and 16%, respectively, over vanilla SAM3, outperforming even prior training-based approaches. Project page: https://valaybundele.github.io/remedi-sam3/.
中文标题/摘要
标题:增强记忆的SAM3在遮挡鲁棒的手术器械分割
内窥镜视频中手术器械的准确分割对于计算机辅助干预至关重要,但由于频繁的遮挡、快速运动、镜面伪影以及长期器械再进入,这一任务仍然具有挑战性。尽管SAM3提供了一种强大的时空框架用于视频对象分割,但在手术场景中的性能受限于非区分性的记忆更新、固定的内存容量以及遮挡后的弱身份恢复。我们提出了一种名为ReMeDI-SAM3的无需训练的记忆增强扩展,通过三个组件解决了这些限制:(i) 基于相关性的记忆过滤,配备专门的遮挡感知记忆用于存储遮挡前的帧,(ii) 一段式插值方案,扩展了有效内存容量,(iii) 基于特征的重新识别模块,结合时间投票,实现可靠的遮挡后身份消歧。这些组件共同减轻了错误累积,并在遮挡后实现可靠的恢复。在零样本设置下,基于EndoVis17和EndoVis18的数据集上的评估显示,与原始SAM3相比,绝对mcIoU提高了约7%和16%,甚至超过了先前的训练基线方法。项目页面:https://valaybundele.github.io/remedi-sam3/
Summary / 总结
The research aims to improve surgical instrument segmentation in endoscopic videos, which is crucial for computer-assisted interventions but challenging due to occlusions and rapid motion. The method, ReMeDI-SAM3, enhances SAM3 by introducing a relevance-aware memory filter, a piecewise interpolation scheme, and a feature-based re-identification module. This results in significant improvements in occlusion robustness, achieving absolute mcIoU improvements of around 7% and 16% on EndoVis17 and EndoVis18, respectively, outperforming previous approaches.
研究旨在提高内窥镜视频中手术器械的分割精度,这对于计算机辅助干预至关重要,但因遮挡和快速运动而具有挑战性。方法ReMeDI-SAM3通过引入相关性感知的记忆过滤器、分段插值方案和基于特征的重新识别模块来增强SAM3。该方法显著提高了遮挡鲁棒性,在EndoVis17和EndoVis18上的绝对mcIoU改进分别约为7%和16%,超过了SAM3和先前的训练基方法。
Core-Set Selection for Data-efficient Land Cover Segmentation
Authors: Keiller Nogueira, Akram Zaytar, Wanli Ma, Ribana Roscher, Ronny Hansch, Caleb Robinson, Anthony Ortiz, Simone Nsutezo, Rahul Dodhia, Juan M. Lavista Ferres, Oktay Karakus, Paul L. Rosin
First: 2025-05-02T12:22:08+00:00 · Latest: 2025-12-18T18:47:38+00:00
Abstract
The increasing accessibility of remotely sensed data and their potential to support large-scale decision-making have driven the development of deep learning models for many Earth Observation tasks. Traditionally, such models rely on large datasets. However, the common assumption that larger training datasets lead to better performance tends to overlook issues related to data redundancy, noise, and the computational cost of processing massive datasets. Effective solutions must therefore consider not only the quantity but also the quality of data. Towards this, in this paper, we introduce six basic core-set selection approaches -- that rely on imagery only, labels only, or a combination of both -- and investigate whether they can identify high-quality subsets of data capable of maintaining -- or even surpassing -- the performance achieved when using full datasets for remote sensing semantic segmentation. We benchmark such approaches against two traditional baselines on three widely used land-cover classification datasets (DFC2022, Vaihingen, and Potsdam) using two different architectures (SegFormer and U-Net), thus establishing a general baseline for future works. Our experiments show that all proposed methods consistently outperform the baselines across multiple subset sizes, with some approaches even selecting core sets that surpass training on all available data. Notably, on DFC2022, a selected subset comprising only 25% of the training data yields slightly higher SegFormer performance than training with the entire dataset. This result shows the importance and potential of data-centric learning for the remote sensing domain. The code is available at https://github.com/keillernogueira/data-centric-rs-classification/.
中文标题/摘要
标题:数据高效土地覆盖分割的核心集选择
遥感数据的日益可获取性和其在大规模决策支持中的潜力,推动了用于地球观测任务的深度学习模型的发展。传统上,这些模型依赖于大规模数据集。然而,更大的训练数据集会导致更好性能的普遍假设往往忽略了数据冗余、噪声以及处理大规模数据集的计算成本问题。因此,有效的解决方案不仅要考虑数据的数量,还要考虑数据的质量。为此,本文介绍了六种基本的核心集选择方法——仅依赖影像、仅依赖标签或两者结合,并探讨它们是否能够识别出能够保持甚至超越使用完整数据集进行遥感语义分割性能的高质量数据子集。我们使用两种不同的架构(SegFormer和U-Net)在三个广泛使用的土地覆盖分类数据集(DFC2022、Vaihingen和Potsdam)上将这些方法与两种传统基线进行基准测试,从而为未来的工作建立了一个通用基准。我们的实验表明,所有提出的方法在多个子集大小上都优于基线,有些方法甚至选择的核心集的性能超过了使用所有可用数据进行训练。值得注意的是,在DFC2022数据集上,仅包含训练数据的25%的选定子集在SegFormer上的性能略高于使用整个数据集进行训练。这一结果表明,数据为中心的学习对于遥感领域的重要性及其潜力。代码可在https://github.com/keillernogueira/data-centric-rs-classification/ 获取。
Summary / 总结
This paper addresses the challenge of using smaller, high-quality datasets for remote sensing semantic segmentation by introducing six core-set selection methods. These methods leverage imagery, labels, or both to identify subsets of data that maintain or even exceed the performance of full datasets. Experiments on three land-cover classification datasets using SegFormer and U-Net show that all proposed methods outperform traditional baselines, with some achieving better results using only 25% of the training data compared to the full dataset. This highlights the potential of data-centric learning in remote sensing.
本文通过引入六种核心集选择方法,这些方法利用图像、标签或两者结合来识别高质量的数据子集,旨在保持或超越全数据集的性能。实验在三个土地覆盖分类数据集上使用SegFormer和U-Net表明,这些方法在多个子集大小上始终优于传统基线,有些方法仅使用训练数据的25%就能达到比全数据集更好的结果,这显示了数据为中心的学习在遥感领域的潜力和重要性。
TACE: A unified Irreducible Cartesian Tensor Framework for Atomistic Machine Learning
Authors: Zemin Xu, Wenbo Xie, Daiqian Xie, P. Hu
First: 2025-09-18T13:51:07+00:00 · Latest: 2025-12-18T18:43:46+00:00
Abstract
Here, we introduce the Tensor Atomic Cluster Expansion (TACE), a unified framework formulated entirely in Cartesian space, enabling systematic and consistent prediction of arbitrary structure-dependent tensorial properties. TACE achieves this by decomposing atomic environments into a complete hierarchy of irreducible Cartesian tensors, ensuring symmetry-consistent representations that naturally encode invariance and equivariance constraints. Beyond geometry, TACE incorporates universal embeddings that flexibly integrate diverse attributes including computational levels, charges, magnetic moments and field perturbations. This allows explicit control over external invariants and equivariants in the prediction process. Long-range interactions are also accurately described through the Latent Ewald Summation module within the short-range approximation, providing a rigorous yet computationally efficient treatment of electrostatic and dispersion effects. We demonstrate that TACE attains accuracy, stability, and efficiency on par with or surpassing leading equivariant frameworks across finite molecules and extended materials. This includes in-domain and out-of-domain benchmarks, spectra, Hessian, external-field responses, charged and magnetic systems, multi-fidelity training, heterogeneous catalysis, and even superior performance within the uMLIP benchmark. Crucially, TACE bridges scalar and tensorial modeling and establishes a Cartesian-space paradigm that unifies and extends beyond the design space of spherical-tensor-based methods. This work lays the foundation for a new generation of universal atomistic machine learning models capable of systematically capturing the rich interplay of geometry, fields and material properties within a single coherent framework.
中文标题/摘要
标题:TACE:统一的不可约笛卡尔张量框架用于原子机器学习
在此,我们介绍了张量原子簇展开(TACE),这是一种完全在笛卡尔空间中构建的统一框架,能够系统且一致地预测任意结构依赖的张量性质。TACE 通过将原子环境分解为完整的不可约笛卡尔张量层次结构,确保了对称一致的表示,自然地编码了不变性和协变性约束。除了几何结构,TACE 还包含了通用嵌入,灵活地整合了包括计算层次、电荷、磁矩和场扰动在内的多种属性。这允许在预测过程中显式地控制外部不变量和协变性。通过短程近似中的潜在厄瓦尔求和模块,长程相互作用也得到了准确描述,提供了对静电和色散效应的严格且计算高效的处理。我们证明,TACE 在有限分子和扩展材料上的准确度、稳定性和效率与或超越了领先协变框架。这包括域内和域外基准测试、光谱、海森堡矩阵、外部场响应、带电和磁性系统、多保真度训练、异质催化,甚至在 uMLIP 基准测试中表现更优。至关重要的是,TACE 桥接了标量和张量建模,并建立了笛卡尔空间范式,统一并超越了基于球对称张量方法的设计空间。这项工作为新一代能够系统捕捉几何、场和材料性质之间丰富相互作用的通用原子机器学习模型奠定了基础。
Summary / 总结
TACE is a unified framework for predicting tensorial properties in atomistic systems by decomposing atomic environments into irreducible Cartesian tensors, ensuring symmetry-consistent representations. It incorporates universal embeddings for diverse attributes and uses the Latent Ewald Summation module to handle long-range interactions. TACE achieves high accuracy, stability, and efficiency, surpassing or matching leading equivariant frameworks across various benchmarks, including molecular and extended materials, spectra, Hessian, and external-field responses.
TACE 是一个统一框架,通过将原子环境分解为不可约笛卡尔张量来预测张量性质,确保对称一致表示。它包含用于多种属性的通用嵌入,并使用 Latent Ewald 总和模块来准确描述长程相互作用。TACE 在各种基准测试中实现了与或优于领先对称框架的准确度、稳定性和效率,包括分子和材料性质、外部场响应和异质催化。
Pixel Seal: Adversarial-only training for invisible image and video watermarking
Authors: Tomáš Souček, Pierre Fernandez, Hady Elsahar, Sylvestre-Alvise Rebuffi, Valeriu Lacatusu, Tuan Tran, Tom Sander, Alexandre Mourachko
First: 2025-12-18T18:42:19+00:00 · Latest: 2025-12-18T18:42:19+00:00
Comments: Code and model available at https://github.com/facebookresearch/videoseal
Abstract
Invisible watermarking is essential for tracing the provenance of digital content. However, training state-of-the-art models remains notoriously difficult, with current approaches often struggling to balance robustness against true imperceptibility. This work introduces Pixel Seal, which sets a new state-of-the-art for image and video watermarking. We first identify three fundamental issues of existing methods: (i) the reliance on proxy perceptual losses such as MSE and LPIPS that fail to mimic human perception and result in visible watermark artifacts; (ii) the optimization instability caused by conflicting objectives, which necessitates exhaustive hyperparameter tuning; and (iii) reduced robustness and imperceptibility of watermarks when scaling models to high-resolution images and videos. To overcome these issues, we first propose an adversarial-only training paradigm that eliminates unreliable pixel-wise imperceptibility losses. Second, we introduce a three-stage training schedule that stabilizes convergence by decoupling robustness and imperceptibility. Third, we address the resolution gap via high-resolution adaptation, employing JND-based attenuation and training-time inference simulation to eliminate upscaling artifacts. We thoroughly evaluate the robustness and imperceptibility of Pixel Seal on different image types and across a wide range of transformations, and show clear improvements over the state-of-the-art. We finally demonstrate that the model efficiently adapts to video via temporal watermark pooling, positioning Pixel Seal as a practical and scalable solution for reliable provenance in real-world image and video settings.
中文标题/摘要
标题:像素封印:仅对抗训练在不可见图像和视频水印中的应用
不可见水印对于追踪数字内容的来源至关重要。然而,训练最先进的模型仍然非常困难,当前的方法往往难以在鲁棒性和真正的不可感知性之间取得平衡。本研究引入了像素封印,为图像和视频水印设定了新的最先进的标准。我们首先识别了现有方法的三个基本问题:(i) 依赖于MSE和LPIPS等代理感知损失,这些损失无法模拟人类感知,导致可见的水印伪影;(ii) 由于目标冲突导致的优化不稳定,这需要进行详尽的超参数调优;(iii) 当将模型扩展到高分辨率图像和视频时,水印的鲁棒性和不可感知性降低。为了克服这些问题,我们首先提出了一种仅对抗训练范式,消除了不可靠的像素级不可感知损失。其次,我们引入了一种三阶段训练计划,通过解耦鲁棒性和不可感知性来稳定收敛。第三,我们通过高分辨率适应解决了分辨率差距问题,利用JND基衰减和训练时的推断模拟来消除放大伪影。我们全面评估了像素封印在不同图像类型和广泛变换范围内的鲁棒性和不可感知性,并展示了相对于最先进的方法的明显改进。最后,我们证明了该模型能够高效地适应视频,通过时间水印池化将像素封印定位为在实际图像和视频环境中可靠来源的实用且可扩展的解决方案。
Summary / 总结
Pixel Seal addresses the challenges in training invisible watermarking models by proposing an adversarial-only training paradigm and a three-stage training schedule. It improves robustness and imperceptibility, especially for high-resolution images and videos. Experimental results show that Pixel Seal outperforms existing methods in both robustness and imperceptibility across various transformations and image types.
Pixel Seal 通过提出对抗训练 paradigm、三阶段训练计划和高分辨率适应技术来解决不可见水印模型训练的挑战。该方法提高了鲁棒性和不可感知性,实现了在图像和视频上的最新性能。全面的评估表明,该方法在各种变换和图像类型上明显优于现有方法。
Sequencing to Mitigate Catastrophic Forgetting in Continual Learning
Authors: Hesham G. Moussa, Aroosa Hameed, Arashmid Akhavain
First: 2025-12-18T18:40:58+00:00 · Latest: 2025-12-18T18:40:58+00:00
Comments: The Manuscript is submitted for review under IEEE Transactions on Artificial intelligence
Abstract
To cope with real-world dynamics, an intelligent system needs to incrementally acquire, update, and exploit knowledge throughout its lifetime. This ability, known as Continual learning, provides a foundation for AI systems to develop themselves adaptively. Catastrophic forgetting is a major challenge to the progress of Continual Learning approaches, where learning a new task usually results in a dramatic performance drop on previously learned ones. Many approaches have emerged to counteract the impact of CF. Most of the proposed approaches can be categorized into five classes: replay-based, regularization-based, optimization-based, representation-based, and architecture-based. In this work, we approach the problem from a different angle, specifically by considering the optimal sequencing of tasks as they are presented to the model. We investigate the role of task sequencing in mitigating CF and propose a method for determining the optimal task order. The proposed method leverages zero-shot scoring algorithms inspired by neural architecture search (NAS). Results demonstrate that intelligent task sequencing can substantially reduce CF. Moreover, when combined with traditional continual learning strategies, sequencing offers enhanced performance and robustness against forgetting. Additionally, the presented approaches can find applications in other fields, such as curriculum learning.
中文标题/摘要
标题:序列化以减轻连续学习中的灾难性遗忘
为了应对现实世界中的动态变化,智能系统需要在其生命周期中逐步获取、更新和利用知识。这种能力,即连续学习,为AI系统提供了自我适应发展的基础。灾难性遗忘是连续学习方法发展中的一大挑战,通常学习新任务会导致之前学习任务性能的大幅下降。已经提出了许多方法来对抗灾难性遗忘的影响。大多数提出的方案可以分为五类:重放基、正则化基、优化基、表示基和架构基。在本工作中,我们从一个不同的角度来解决这个问题,具体来说,是通过考虑任务呈现给模型时的最佳顺序。我们研究了任务序列在减轻灾难性遗忘中的作用,并提出了一种确定最佳任务顺序的方法。所提出的方法利用了受神经架构搜索(NAS)启发的零样本评分算法。结果表明,智能任务序列可以显著减少灾难性遗忘。此外,当与传统的连续学习策略结合使用时,序列化可以提供更好的性能和对遗忘的鲁棒性。此外,所提出的方法还可以在其他领域找到应用,如课程学习。
Summary / 总结
The paper addresses the challenge of catastrophic forgetting in continual learning, where learning new tasks often leads to forgetting previously learned tasks. It proposes a novel approach by optimizing the sequence in which tasks are presented to the model, using zero-shot scoring algorithms inspired by neural architecture search. The results show that this method can significantly reduce catastrophic forgetting and improve overall performance when combined with traditional continual learning strategies.
本文研究了连续学习中灾难性遗忘的问题,即新任务往往会影响之前学习的内容。作者提出了一种新的方法,关注任务的最优排序以减轻这一问题。通过使用受神经架构搜索启发的零样本评分算法,他们确定了任务的最佳顺序。结果表明,这种方法显著减少了灾难性遗忘,并且当与传统的连续学习策略结合使用时,可以提高性能。
RePlan: Reasoning-guided Region Planning for Complex Instruction-based Image Editing
Authors: Tianyuan Qu, Lei Ke, Xiaohang Zhan, Longxiang Tang, Yuqi Liu, Bohao Peng, Bei Yu, Dong Yu, Jiaya Jia
First: 2025-12-18T18:34:23+00:00 · Latest: 2025-12-18T18:34:23+00:00
Comments: Precise region control and planning for instruction-based image editing. Our project page: https://replan-iv-edit.github.io
Abstract
Instruction-based image editing enables natural-language control over visual modifications, yet existing models falter under Instruction-Visual Complexity (IV-Complexity), where intricate instructions meet cluttered or ambiguous scenes. We introduce RePlan (Region-aligned Planning), a plan-then-execute framework that couples a vision-language planner with a diffusion editor. The planner decomposes instructions via step-by-step reasoning and explicitly grounds them to target regions; the editor then applies changes using a training-free attention-region injection mechanism, enabling precise, parallel multi-region edits without iterative inpainting. To strengthen planning, we apply GRPO-based reinforcement learning using 1K instruction-only examples, yielding substantial gains in reasoning fidelity and format reliability. We further present IV-Edit, a benchmark focused on fine-grained grounding and knowledge-intensive edits. Across IV-Complex settings, RePlan consistently outperforms strong baselines trained on far larger datasets, improving regional precision and overall fidelity. Our project page: https://replan-iv-edit.github.io
中文标题/摘要
标题:RePlan:基于推理的区域规划方法用于复杂指令驱动的图像编辑
指令驱动的图像编辑允许通过自然语言控制视觉修改,但现有模型在指令视觉复杂性(IV-复杂性)场景下表现不佳,即复杂的指令与杂乱或模糊的场景相遇时。我们提出了RePlan(区域对齐规划),这是一种计划-执行框架,结合了视觉语言规划器和扩散编辑器。规划器通过逐步推理将指令分解,并明确地将它们与目标区域关联;编辑器然后使用无训练注意力区域注入机制应用更改,从而实现精确的、并行的多区域编辑,无需迭代修复。为了增强规划,我们使用基于GRPO的强化学习应用1000个仅指令示例,显著提高了推理准确性和格式可靠性。我们还提出了IV-Edit基准,专注于精细的区域定位和知识密集型编辑。在IV-复杂场景下,RePlan始终优于大型数据集训练的强大基线,提高了区域精度和整体保真度。我们的项目页面:https://replan-iv-edit.github.io
Summary / 总结
RePlan is a plan-then-execute framework for instruction-based image editing that addresses the challenge of Instruction-Visual Complexity. It uses a vision-language planner to decompose instructions and ground them to target regions, followed by a diffusion editor that applies changes without iterative inpainting. RePlan employs GRPO-based reinforcement learning to enhance planning and outperforms strong baselines in regional precision and overall fidelity across complex settings.
RePlan 是一种用于指令驱动图像编辑的计划-执行框架,旨在解决指令-视觉复杂性问题。它通过视觉语言规划器分解指令并将其明确地定位到特定区域,随后使用无迭代填涂的注意力区域注入机制应用更改。RePlan 在复杂设置下的区域精度和整体保真度方面优于强大的基线模型,这得益于使用基于 GRPO 的强化学习以提高推理准确性和格式可靠性。引入了 IV-Edit 基准来评估细粒度定位和知识密集型编辑在复杂条件下的表现。
ReinforceGen: Hybrid Skill Policies with Automated Data Generation and Reinforcement Learning
Authors: Zihan Zhou, Animesh Garg, Ajay Mandlekar, Caelan Garrett
First: 2025-12-18T18:32:39+00:00 · Latest: 2025-12-18T18:32:39+00:00
Abstract
Long-horizon manipulation has been a long-standing challenge in the robotics community. We propose ReinforceGen, a system that combines task decomposition, data generation, imitation learning, and motion planning to form an initial solution, and improves each component through reinforcement-learning-based fine-tuning. ReinforceGen first segments the task into multiple localized skills, which are connected through motion planning. The skills and motion planning targets are trained with imitation learning on a dataset generated from 10 human demonstrations, and then fine-tuned through online adaptation and reinforcement learning. When benchmarked on the Robosuite dataset, ReinforceGen reaches 80% success rate on all tasks with visuomotor controls in the highest reset range setting. Additional ablation studies show that our fine-tuning approaches contributes to an 89% average performance increase. More results and videos available in https://reinforcegen.github.io/
中文标题/摘要
标题:ReinforceGen:结合自动数据生成和强化学习的混合技能策略
长时程操作一直是机器人领域的一个长期挑战。我们提出了ReinforceGen系统,该系统结合了任务分解、数据生成、模仿学习和运动规划,形成初始解决方案,并通过基于强化学习的微调改进每个组件。ReinforceGen首先将任务分割为多个局部技能,这些技能通过运动规划连接。技能和运动规划目标使用来自10个人类演示生成的数据集进行模仿学习训练,然后通过在线适应和强化学习进行微调。在Robosuite数据集上进行基准测试时,ReinforceGen在最高重置范围设置下使用视知觉控制达到80%的成功率。额外的消融研究显示,我们的微调方法平均提高了89%的性能。更多结果和视频请参见https://reinforcegen.github.io/
Summary / 总结
ReinforceGen is designed to address the challenge of long-horizon manipulation in robotics by combining task decomposition, data generation, imitation learning, and motion planning. It segments tasks into localized skills connected by motion planning, which are initially trained with imitation learning on a dataset generated from human demonstrations. These components are then fine-tuned through reinforcement learning, achieving an 80% success rate on all tasks in the Robosuite dataset with visuomotor controls. Ablation studies indicate that the fine-tuning approaches contribute to an 89% average performance increase.
ReinforceGen旨在通过结合任务分解、数据生成、模仿学习和运动规划来解决机器人领域的长期操作挑战。它将任务分解为局部技能,并通过运动规划连接这些技能,初始训练使用来自10个人类演示生成的数据集进行模仿学习。这些组件通过强化学习进一步微调,在Robosuite数据集中实现了80%的任务成功率,使用的是最高重置范围设置的视觉运动控制。消融研究显示,微调方法平均提高了89%的性能。
GenEval 2: Addressing Benchmark Drift in Text-to-Image Evaluation
Authors: Amita Kamath, Kai-Wei Chang, Ranjay Krishna, Luke Zettlemoyer, Yushi Hu, Marjan Ghazvininejad
First: 2025-12-18T18:26:56+00:00 · Latest: 2025-12-18T18:26:56+00:00
Abstract
Automating Text-to-Image (T2I) model evaluation is challenging; a judge model must be used to score correctness, and test prompts must be selected to be challenging for current T2I models but not the judge. We argue that satisfying these constraints can lead to benchmark drift over time, where the static benchmark judges fail to keep up with newer model capabilities. We show that benchmark drift is a significant problem for GenEval, one of the most popular T2I benchmarks. Although GenEval was well-aligned with human judgment at the time of its release, it has drifted far from human judgment over time -- resulting in an absolute error of as much as 17.7% for current models. This level of drift strongly suggests that GenEval has been saturated for some time, as we verify via a large-scale human study. To help fill this benchmarking gap, we introduce a new benchmark, GenEval 2, with improved coverage of primitive visual concepts and higher degrees of compositionality, which we show is more challenging for current models. We also introduce Soft-TIFA, an evaluation method for GenEval 2 that combines judgments for visual primitives, which we show is more well-aligned with human judgment and argue is less likely to drift from human-alignment over time (as compared to more holistic judges such as VQAScore). Although we hope GenEval 2 will provide a strong benchmark for many years, avoiding benchmark drift is far from guaranteed and our work, more generally, highlights the importance of continual audits and improvement for T2I and related automated model evaluation benchmarks.
中文标题/摘要
标题:GenEval 2:解决文本到图像评估基准漂移问题
自动化文本到图像(T2I)模型评估具有挑战性;必须使用裁判模型来评分,并选择具有挑战性的测试提示,但不应该是当前T2I模型能够解决的。我们认为,满足这些约束条件可能会导致基准漂移,随着时间的推移,静态基准裁判无法跟上新模型的能力。我们展示了基准漂移是GenEval(最受欢迎的T2I基准之一)的一个重大问题。尽管GenEval在发布时与人类判断高度一致,但随着时间的推移,它已经远离了人类判断——导致当前模型的绝对误差高达17.7%。这种程度的漂移强烈表明,GenEval已经饱和了一段时间,我们通过大规模的人类研究进行了验证。为了填补这一评估缺口,我们引入了新的基准GenEval 2,它涵盖了更广泛的原始视觉概念,并具有更高的组合性,我们证明这对当前模型更具挑战性。我们还引入了Soft-TIFA,这是一种结合了视觉基本概念判断的评估方法,我们证明它与人类判断更一致,并认为与更全面的评判标准(如VQAScore)相比,它不太可能随着时间的推移而失去与人类判断的一致性。尽管我们希望GenEval 2能够为多年提供一个强大的基准,但避免基准漂移远非有保证的,我们的工作更广泛地强调了持续审计和改进对于T2I及相关自动模型评估基准的重要性。
Summary / 总结
The research addresses the issue of benchmark drift in Text-to-Image (T2I) model evaluation by introducing GenEval 2, which improves upon the previous benchmark by covering more primitive visual concepts and increasing compositional complexity. The study shows that GenEval, a widely used benchmark, has drifted significantly from human judgment, leading to a 17.7% absolute error for current models. To mitigate this, GenEval 2 and a new evaluation method called Soft-TIFA are proposed, which are better aligned with human judgment and less prone to drift over time.
研究通过引入GenEval 2来解决Text-to-Image (T2I)模型评估中的基准漂移问题,GenEval 2改进了视觉概念的覆盖范围和组合性。研究显示,原始的GenEval已经显著偏离了人类判断,当前模型的绝对误差最高可达17.7%。为缓解这一问题,提出了GenEval 2和一种新的评估方法Soft-TIFA,这两种方法与人类判断更加一致,并且不太容易随着时间推移而偏离人类一致性。研究强调了对T2I及相关自动模型评估基准进行持续审计和改进的重要性。
Developing Distance-Aware, and Evident Uncertainty Quantification in Dynamic Physics-Constrained Neural Networks for Robust Bearing Degradation Estimation
Authors: Waleed Razzaq, Yun-Bo Zhao
First: 2025-12-09T11:30:41+00:00 · Latest: 2025-12-18T18:26:21+00:00
Comments: Under review at Structural health Monitoring - SAGE
Abstract
Accurate and uncertainty-aware degradation estimation is essential for predictive maintenance in safety-critical systems like rotating machinery with rolling-element bearings. Many existing uncertainty methods lack confidence calibration, are costly to run, are not distance-aware, and fail to generalize under out-of-distribution data. We introduce two distance-aware uncertainty methods for deterministic physics-guided neural networks: PG-SNGP, based on Spectral Normalization Gaussian Process, and PG-SNER, based on Deep Evidential Regression. We apply spectral normalization to the hidden layers so the network preserves distances from input to latent space. PG-SNGP replaces the final dense layer with a Gaussian Process layer for distance-sensitive uncertainty, while PG-SNER outputs Normal Inverse Gamma parameters to model uncertainty in a coherent probabilistic form. We assess performance using standard accuracy metrics and a new distance-aware metric based on the Pearson Correlation Coefficient, which measures how well predicted uncertainty tracks the distance between test and training samples. We also design a dynamic weighting scheme in the loss to balance data fidelity and physical consistency. We test our methods on rolling-element bearing degradation using the PRONOSTIA, XJTU-SY and HUST datasets and compare them with Monte Carlo and Deep Ensemble PGNNs. Results show that PG-SNGP and PG-SNER improve prediction accuracy, generalize reliably under OOD conditions, and remain robust to adversarial attacks and noise.
中文标题/摘要
标题:在动态物理约束神经网络中开发距离感知和明确的不确定性量化以实现稳健的轴承退化估计
准确且具有不确定性意识的退化估计对于安全关键系统(如带有滚动轴承的旋转机械)的预测维护至关重要。许多现有的不确定性方法缺乏信心校准,运行成本高,不具有距离感知性,并且在处理离域数据时无法泛化。我们为确定性物理引导神经网络引入了两种距离感知不确定性方法:基于谱归一化高斯过程的PG-SNGP和基于深度证据回归的PG-SNER。我们对隐藏层应用谱归一化,使网络在输入到潜在空间中保持距离。PG-SNGP用高斯过程层替换最终的全连接层,以实现对距离敏感的不确定性,而PG-SNER输出正态逆伽玛参数以以一致的概率形式建模不确定性。我们使用标准准确度指标和基于皮尔逊相关系数的新距离感知指标评估性能,该指标衡量预测不确定性与测试和训练样本之间距离的相关性。我们还在损失中设计了一种动态加权方案,以平衡数据保真度和物理一致性。我们在滚动轴承退化上测试了我们的方法,使用PRONOSTIA、XJTU-SY和HUST数据集,并将它们与蒙特卡洛和深度集成物理引导神经网络进行比较。结果表明,PG-SNGP和PG-SNER提高了预测准确性,在离域条件下可靠泛化,并且对对抗攻击和噪声具有鲁棒性。
Summary / 总结
This study aims to develop distance-aware and uncertainty quantification methods for robust bearing degradation estimation in safety-critical systems. The authors introduce PG-SNGP and PG-SNER, which use spectral normalization and Gaussian Process or Deep Evidential Regression to enhance uncertainty calibration. Experiments on rolling-element bearing degradation datasets show that these methods improve prediction accuracy, generalize well under out-of-distribution conditions, and maintain robustness to adversarial attacks and noise.
研究旨在为安全关键系统中的滚动轴承退化估计开发距离感知和不确定性量化方法。作者引入了基于光谱规范化和高斯过程或深度证据回归的PG-SNGP和PG-SNER,以捕捉距离敏感的不确定性。实验结果表明,这些方法在预测准确性、在异常分布条件下的可靠泛化以及对抗攻击和噪声的鲁棒性方面优于蒙特卡洛和深度集成方法。
Meta-RL Induces Exploration in Language Agents
Authors: Yulun Jiang, Liangze Jiang, Damien Teney, Michael Moor, Maria Brbic
First: 2025-12-18T18:22:17+00:00 · Latest: 2025-12-18T18:22:17+00:00
Abstract
Reinforcement learning (RL) has enabled the training of large language model (LLM) agents to interact with the environment and to solve multi-turn long-horizon tasks. However, the RL-trained agents often struggle in tasks that require active exploration and fail to efficiently adapt from trial-and-error experiences. In this paper, we present LaMer, a general Meta-RL framework that enables LLM agents to actively explore and learn from the environment feedback at test time. LaMer consists of two key components: (i) a cross-episode training framework to encourage exploration and long-term rewards optimization; and (ii) in-context policy adaptation via reflection, allowing the agent to adapt their policy from task feedback signal without gradient update. Experiments across diverse environments show that LaMer significantly improves performance over RL baselines, with 11%, 14%, and 19% performance gains on Sokoban, MineSweeper and Webshop, respectively. Moreover, LaMer also demonstrates better generalization to more challenging or previously unseen tasks compared to the RL-trained agents. Overall, our results demonstrate that Meta-RL provides a principled approach to induce exploration in language agents, enabling more robust adaptation to novel environments through learned exploration strategies.
中文标题/摘要
标题:元RL促进语言代理的探索
强化学习(RL)使大型语言模型(LLM)代理能够与环境互动并解决多轮长期任务。然而,RL训练的代理在需要主动探索的任务中往往表现不佳,无法有效地从试错经验中适应。在本文中,我们提出了LaMer,这是一种通用的元RL框架,使LLM代理能够在测试时积极探索并从环境反馈中学习。LaMer包括两个关键组件:(i)一个跨回合训练框架,以鼓励探索和长期奖励优化;(ii)通过反思进行上下文内策略调整,使代理能够根据任务反馈信号调整其策略,而无需梯度更新。在多种环境中的实验表明,LaMer在Sokoban、MineSweeper和Webshop上的性能分别提高了11%、14%和19%,超过了RL基线。此外,LaMer在更具有挑战性或以前未见过的任务上的泛化能力也优于RL训练的代理。总体而言,我们的结果表明,元RL提供了一种有原则的方法来促进语言代理的探索,通过学习的探索策略使代理能够更稳健地适应新的环境。
Summary / 总结
This paper addresses the challenge of active exploration in reinforcement learning (RL)-trained language model agents, which often fail to efficiently explore and adapt in complex tasks. The authors introduce LaMer, a Meta-RL framework that includes a cross-episode training mechanism for exploration and an in-context policy adaptation method. Experiments show that LaMer outperforms traditional RL methods by 11%, 14%, and 19% on Sokoban, MineSweeper, and Webshop tasks, respectively, and demonstrates better generalization to new tasks.
本文解决了强化学习(RL)训练的语言代理在探索方面的问题,这些代理在需要主动探索的任务中往往无法高效地探索和适应。作者引入了LaMer,这是一种元RL框架,包括一个跨回合训练框架来鼓励探索和长期奖励优化,以及通过反思进行的上下文内策略适应。实验表明,LaMer在Sokoban、MineSweeper和Webshop上的表现分别比RL基线高出11%、14%和19%,并且在新任务上的泛化能力也优于RL训练的代理。
OPENTOUCH: Bringing Full-Hand Touch to Real-World Interaction
Authors: Yuxin Ray Song, Jinzhou Li, Rao Fu, Devin Murphy, Kaichen Zhou, Rishi Shiv, Yaqi Li, Haoyu Xiong, Crystal Elaine Owens, Yilun Du, Yiyue Luo, Xianyi Cheng, Antonio Torralba, Wojciech Matusik, Paul Pu Liang
First: 2025-12-18T18:18:17+00:00 · Latest: 2025-12-18T18:18:17+00:00
Comments: https://opentouch-tactile.github.io/
Abstract
The human hand is our primary interface to the physical world, yet egocentric perception rarely knows when, where, or how forcefully it makes contact. Robust wearable tactile sensors are scarce, and no existing in-the-wild datasets align first-person video with full-hand touch. To bridge the gap between visual perception and physical interaction, we present OpenTouch, the first in-the-wild egocentric full-hand tactile dataset, containing 5.1 hours of synchronized video-touch-pose data and 2,900 curated clips with detailed text annotations. Using OpenTouch, we introduce retrieval and classification benchmarks that probe how touch grounds perception and action. We show that tactile signals provide a compact yet powerful cue for grasp understanding, strengthen cross-modal alignment, and can be reliably retrieved from in-the-wild video queries. By releasing this annotated vision-touch-pose dataset and benchmark, we aim to advance multimodal egocentric perception, embodied learning, and contact-rich robotic manipulation.
中文标题/摘要
标题:OPENTOUCH:将全手触觉引入现实交互
人类的手是我们与物理世界的主要接口,但主观感知很少知道何时、何地或以何种力度接触。可靠的穿戴式触觉传感器稀缺,且目前没有现成的数据集将第一人称视频与全手触觉对齐。为了弥合视觉感知与物理交互之间的差距,我们提出了OpenTouch,这是首个现成的主观全手触觉数据集,包含5.1小时同步的视频-触觉-姿态数据和2900个经过精挑细选的片段,附有详细的文本注释。利用OpenTouch,我们引入了检索和分类基准,以探究触觉如何为感知和行动提供基础。我们展示了触觉信号为理解抓取提供了一种紧凑而强大的线索,加强了跨模态对齐,并且可以从现成的视频查询中可靠地检索。通过发布这个标注的视觉-触觉-姿态数据集和基准,我们旨在推动多模态主观感知、具身学习和接触丰富的机器人操作。
Summary / 总结
The paper introduces OpenTouch, the first in-the-wild egocentric full-hand tactile dataset, which includes 5.1 hours of synchronized video-touch-pose data and 2,900 clips with detailed annotations. This dataset aims to bridge the gap between visual perception and physical interaction. The authors use OpenTouch to develop retrieval and classification benchmarks, demonstrating that tactile signals are crucial for grasp understanding and cross-modal alignment, and can be reliably retrieved from in-the-wild video queries. By releasing this dataset and benchmarks, the researchers aim to advance multimodal egocentric perception and robotic manipulation.
该论文介绍了OpenTouch,这是首个野外的全手触觉数据集,包含5.1小时的同步视频-触觉-姿态数据和2,900个带有详细注释的剪辑。利用该数据集,作者提出了检索和分类基准,以探索触觉如何影响感知和行动。关键发现表明,触觉信号对于理解抓取和跨模态对齐非常有效,并且可以从野外视频查询中可靠地检索出来。该数据集和基准旨在推动多模态自我中心感知、体态学习和接触丰富的机器人操作。
Radiology Report Generation with Layer-Wise Anatomical Attention
Authors: Emmanuel D. Muñiz-De-León, Jorge A. Rosales-de-Golferichs, Ana S. Muñoz-Rodríguez, Alejandro I. Trejo-Castro, Eduardo de Avila-Armenta, Antonio Martínez-Torteya
First: 2025-12-18T18:17:57+00:00 · Latest: 2025-12-18T18:17:57+00:00
Comments: 11 pages, 6 figures
Abstract
Automatic radiology report generation is a promising application of multimodal deep learning, aiming to reduce reporting workload and improve consistency. However, current state-of-the-art (SOTA) systems - such as Multimodal AI for Radiology Applications (MAIRA-2) and Medical Pathways Language Model-Multimodal (MedPaLM-M) - depend on large-scale multimodal training, clinical metadata, and multiple imaging views, making them resource-intensive and inaccessible for most settings. We introduce a compact image-to-text architecture that generates the Findings section of chest X-ray reports from a single frontal image. The model combines a frozen Self-Distillation with No Labels v3 (DINOv3) Vision Transformer (ViT) encoder with a Generative Pre-trained Transformer 2 (GPT-2) decoder enhanced by layer-wise anatomical attention. This mechanism integrates lung and heart segmentation masks through hierarchical Gaussian smoothing, biasing attention toward clinically relevant regions without adding trainable parameters. Evaluated on the official Medical Information Mart for Intensive Care-Chest X-ray (MIMIC-CXR) dataset using Chest Radiograph Expert (CheXpert) and Radiology Graph (RadGraph) metrics, our approach achieved substantial gains: CheXpert Macro-F1 for five key pathologies increased by 168% (0.083 -> 0.238) and Micro-F1 by 146% (0.137 -> 0.337), while broader performance across 14 observations improved by 86% (0.170 -> 0.316). Structural coherence also improved, with RadGraph F1 rising by 9.7%. Despite its small size and purely image-conditioned design, the model demonstrates that decoder-level anatomical guidance improves spatial grounding and enhances coherence in clinically relevant regions. The source code is publicly available at: https://github.com/devMuniz02/UDEM-CXR-Reporting-Thesis-2025.
中文标题/摘要
标题:基于层级解剖注意力的放射学报告生成
自动放射学报告生成是多模态深度学习的一个有前途的应用,旨在减少报告工作量并提高一致性。然而,当前最先进的(SOTA)系统,如多模态人工智能放射学应用(MAIRA-2)和医疗路径语言模型-多模态(MedPaLM-M),依赖大规模多模态训练、临床元数据和多种影像视图,使其资源密集且大多数情况下不可用。我们介绍了一种紧凑的图像到文本架构,可以从单张前视影像生成胸部X光报告的发现部分。该模型结合了一个冻结的无标签自我蒸馏v3(DINOv3)视觉变换器(ViT)编码器和一个增强有层级解剖注意力的生成预训练变换器2(GPT-2)解码器。该机制通过分层高斯平滑将肺和心分割掩码集成在一起,使注意力偏向临床相关区域,而不增加可训练参数。在使用胸部放射图专家(CheXpert)和放射学图(RadGraph)指标对官方医学信息密集护理-胸部X光(MIMIC-CXR)数据集进行评估时,我们的方法取得了显著的改进:五种关键病理学的宏F1分数提高了168%(0.083 -> 0.238),微F1分数提高了146%(0.137 -> 0.337),而14个观察指标的总体性能提高了86%(0.170 -> 0.316)。结构连贯性也得到了提高,RadGraph F1分数提高了9.7%。尽管模型很小且完全是基于图像设计的,但该模型表明解码器级别的解剖学指导可以提高空间定位并增强临床相关区域的连贯性。源代码可在以下网址公开获取:https://github.com/devMuniz02/UDEM-CXR-Reporting-Thesis-2025。
DenseBEV: Transforming BEV Grid Cells into 3D Objects
Authors: Marius Dähling, Sebastian Krebs, J. Marius Zöllner
Venue: WACV 2026
First: 2025-12-18T17:59:22+00:00 · Latest: 2025-12-18T17:59:22+00:00
Comments: 15 pages, 8 figures, accepted by WACV 2026
Abstract
In current research, Bird's-Eye-View (BEV)-based transformers are increasingly utilized for multi-camera 3D object detection. Traditional models often employ random queries as anchors, optimizing them successively. Recent advancements complement or replace these random queries with detections from auxiliary networks. We propose a more intuitive and efficient approach by using BEV feature cells directly as anchors. This end-to-end approach leverages the dense grid of BEV queries, considering each cell as a potential object for the final detection task. As a result, we introduce a novel two-stage anchor generation method specifically designed for multi-camera 3D object detection. To address the scaling issues of attention with a large number of queries, we apply BEV-based Non-Maximum Suppression, allowing gradients to flow only through non-suppressed objects. This ensures efficient training without the need for post-processing. By using BEV features from encoders such as BEVFormer directly as object queries, temporal BEV information is inherently embedded. Building on the temporal BEV information already embedded in our object queries, we introduce a hybrid temporal modeling approach by integrating prior detections to further enhance detection performance. Evaluating our method on the nuScenes dataset shows consistent and significant improvements in NDS and mAP over the baseline, even with sparser BEV grids and therefore fewer initial anchors. It is particularly effective for small objects, enhancing pedestrian detection with a 3.8% mAP increase on nuScenes and an 8% increase in LET-mAP on Waymo. Applying our method, named DenseBEV, to the challenging Waymo Open dataset yields state-of-the-art performance, achieving a LET-mAP of 60.7%, surpassing the previous best by 5.4%. Code is available at https://github.com/mdaehl/DenseBEV.
中文标题/摘要
标题:DenseBEV:将BEV网格单元转换为3D物体
当前研究中,基于鸟瞰视图(BEV)的变压器越来越多地用于多摄像头3D物体检测。传统模型通常使用随机查询作为锚点,并逐步优化它们。最近的进展通过辅助网络的检测来补充或替代这些随机查询。我们提出了一种更直观且高效的直接使用BEV特征单元作为锚点的方法。该端到端的方法利用了BEV查询的密集网格,将每个单元视为最终检测任务的潜在物体。因此,我们引入了一种专为多摄像头3D物体检测设计的新型两阶段锚点生成方法。为了解决大量查询时注意力的扩展问题,我们应用了基于BEV的非极大值抑制,仅允许梯度流经未被抑制的对象。这确保了高效的训练,无需后处理。通过直接使用BEVFormer等编码器的BEV特征作为物体查询,时间BEV信息自然嵌入其中。基于我们已嵌入时间BEV信息的物体查询,我们引入了一种混合时间建模方法,通过整合先验检测来进一步提高检测性能。在nuScenes数据集上的评估显示,即使使用更稀疏的BEV网格和更少的初始锚点,我们的方法在NDS和mAP上也表现出一致且显著的改进。特别是在小物体检测方面,它提高了行人的检测性能,nuScenes上的mAP提高了3.8%,Waymo上的LET-mAP提高了8%。将我们的方法应用于具有挑战性的Waymo Open数据集,取得了最先进的性能,LET-mAP达到60.7%,超越了之前的最好成绩5.4%。代码可在https://github.com/mdaehl/DenseBEV获取。
Summary / 总结
The research aims to improve multi-camera 3D object detection by directly using BEV feature cells as anchors, which is more intuitive and efficient than traditional methods. The proposed DenseBEV method introduces a two-stage anchor generation process and applies BEV-based Non-Maximum Suppression to handle the scaling issues of attention. This approach achieves consistent improvements in NDS and mAP on the nuScenes dataset, especially for small objects, and sets a new state-of-the-art performance on the Waymo Open dataset with a LET-mAP of 60.7%.
研究旨在通过直接使用BEV特征单元作为锚点来提高多相机3D目标检测的效率,这种方法比传统的随机查询更直观和高效。方法引入了两阶段锚点生成过程,并应用了基于BEV的非极大值抑制来解决注意力缩放问题。DenseBEV方法在NDS和mAP上相对于基线方法表现出一致的改进,特别是在行人等小型物体上,还在Waymo Open数据集上实现了最先进的性能,LET-mAP达到60.7%。
Grammar-Forced Translation of Natural Language to Temporal Logic using LLMs
Authors: William English, Dominic Simon, Sumit Kumar Jha, Rickard Ewetz
Venue: Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:15370-15383, 2025
First: 2025-12-18T17:55:15+00:00 · Latest: 2025-12-18T17:55:15+00:00
Abstract
Translating natural language (NL) into a formal language such as temporal logic (TL) is integral for human communication with robots and autonomous systems. State-of-the-art approaches decompose the task into a lifting of atomic propositions (APs) phase and a translation phase. However, existing methods struggle with accurate lifting, the existence of co-references, and learning from limited data. In this paper, we propose a framework for NL to TL translation called Grammar Forced Translation (GraFT). The framework is based on the observation that previous work solves both the lifting and translation steps by letting a language model iteratively predict tokens from its full vocabulary. In contrast, GraFT reduces the complexity of both tasks by restricting the set of valid output tokens from the full vocabulary to only a handful in each step. The solution space reduction is obtained by exploiting the unique properties of each problem. We also provide a theoretical justification for why the solution space reduction leads to more efficient learning. We evaluate the effectiveness of GraFT using the CW, GLTL, and Navi benchmarks. Compared with state-of-the-art translation approaches, it can be observed that GraFT the end-to-end translation accuracy by 5.49% and out-of-domain translation accuracy by 14.06% on average.
中文标题/摘要
标题:基于语法的自然语言到时间逻辑的强制翻译
将自然语言(NL)翻译成形式语言如时间逻辑(TL)对于人类与机器人和自主系统的交流至关重要。最先进的方法将任务分解为原子命题(APs)提升阶段和翻译阶段。然而,现有方法在准确提升、共指存在以及从有限数据中学习方面存在困难。在本文中,我们提出了一种名为Grammar Forced Translation(GraFT)的自然语言到时间逻辑翻译框架。该框架基于先前工作通过让语言模型从其全词汇中迭代预测标记来同时解决提升和翻译步骤的观察。相比之下,GraFT通过在每一步中仅限制有效输出标记集为少量标记来降低两个任务的复杂性。通过利用每个问题的独特属性,我们减少了解空间。我们还提供了理论依据,说明为什么解空间的减少会导致更有效的学习。我们使用CW、GLTL和Navi基准评估了GraFT的有效性。与最先进的翻译方法相比,GraFT在端到端翻译准确性上提高了5.49%,在域外翻译准确性上平均提高了14.06%。
Summary / 总结
The paper addresses the challenge of translating natural language into temporal logic, focusing on the limitations of existing methods in accurately lifting atomic propositions and handling co-references. It introduces a framework called Grammar Forced Translation (GraFT) that reduces the complexity of the task by limiting the set of valid output tokens at each step. GraFT improves end-to-end translation accuracy by 5.49% and out-of-domain translation accuracy by 14.06% on average compared to state-of-the-art methods using CW, GLTL, and Navi benchmarks.
本文针对将自然语言翻译成时序逻辑的挑战,关注现有方法的局限性,如不准确的提升和处理共指问题。提出的框架Grammar Forced Translation (GraFT)通过在每一步限制输出令牌来简化过程,从而提高了准确性。GraFT在各种基准测试中的端到端翻译准确率提高了5.49%,跨域翻译准确率提高了14.06%。
Coordinated Anti-Jamming Resilience in Swarm Networks via Multi-Agent Reinforcement Learning
Authors: Bahman Abolhassani, Tugba Erpek, Kemal Davaslioglu, Yalin E. Sagduyu, Sastry Kompella
First: 2025-12-18T17:54:20+00:00 · Latest: 2025-12-18T17:54:20+00:00
Abstract
Reactive jammers pose a severe security threat to robotic-swarm networks by selectively disrupting inter-agent communications and undermining formation integrity and mission success. Conventional countermeasures such as fixed power control or static channel hopping are largely ineffective against such adaptive adversaries. This paper presents a multi-agent reinforcement learning (MARL) framework based on the QMIX algorithm to improve the resilience of swarm communications under reactive jamming. We consider a network of multiple transmitter-receiver pairs sharing channels while a reactive jammer with Markovian threshold dynamics senses aggregate power and reacts accordingly. Each agent jointly selects transmit frequency (channel) and power, and QMIX learns a centralized but factorizable action-value function that enables coordinated yet decentralized execution. We benchmark QMIX against a genie-aided optimal policy in a no-channel-reuse setting, and against local Upper Confidence Bound (UCB) and a stateless reactive policy in a more general fading regime with channel reuse enabled. Simulation results show that QMIX rapidly converges to cooperative policies that nearly match the genie-aided bound, while achieving higher throughput and lower jamming incidence than the baselines, thereby demonstrating MARL's effectiveness for securing autonomous swarms in contested environments.
中文标题/摘要
标题:swarm 网络中基于多智能体强化学习的协调抗干扰韧性
反应式干扰器通过选择性地破坏智能体间的通信并破坏编队完整性和任务成功,对机器人 swarm 网络构成严重安全威胁。传统的固定功率控制或静态频道跳转等对策对这种适应性对手基本无效。本文提出了一种基于 QMIX 算法的多智能体强化学习(MARL)框架,以提高在反应式干扰下 swarm 通信的韧性。我们考虑了一个多个发射-接收对共享频道的网络,同时一个具有马尔可夫阈值动态特性的反应式干扰器感知总功率并相应地作出反应。每个智能体共同选择发射频率(频道)和功率,QMIX 学习一个集中但可分解的动作-价值函数,从而实现协调但分散的执行。我们在无频道重用设置中将 QMIX 与 genie-aided 最优策略进行基准测试,并在更一般的具有频道重用的衰落环境中将 QMIX 与局部上置置信界(UCB)和无状态反应策略进行基准测试。仿真结果表明,QMIX 迅速收敛到几乎与 genie-aided 边界匹配的合作策略,同时实现比基线更高的吞吐量和更低的干扰发生率,从而证明了 MARL 在争夺环境中保护自主 swarm 的有效性。
Summary / 总结
This paper addresses the security threat posed by reactive jammers to robotic-swarm networks by proposing a multi-agent reinforcement learning (MARL) framework based on the QMIX algorithm. Each agent in the network selects transmit frequency and power, and QMIX learns a centralized yet factorizable action-value function for coordinated yet decentralized execution. The study benchmarks QMIX against optimal and reactive policies in different settings and demonstrates that QMIX achieves higher throughput and lower jamming incidence compared to the baselines, highlighting the effectiveness of MARL for securing autonomous swarms.
论文提出了一种基于QMIX算法的多智能体强化学习(MARL)框架,以应对反应式干扰器对机器人蜂群网络的威胁。每个蜂群智能体选择传输频率和功率,QMIX学习一个集中化但可分解的动作-价值函数,以实现协调但分散的执行。仿真结果表明,QMIX在吞吐量和干扰发生率方面优于基线策略,在无频道重用设置中接近 genie-辅助的最优策略性能,在具有频道重用的衰落环境中则优于局部 UCB 和无状态反应策略。
GeoPredict: Leveraging Predictive Kinematics and 3D Gaussian Geometry for Precise VLA Manipulation
Authors: Jingjing Qian, Boyao Han, Chen Shi, Lei Xiao, Long Yang, Shaoshuai Shi, Li Jiang
First: 2025-12-18T17:51:42+00:00 · Latest: 2025-12-18T17:51:42+00:00
Abstract
Vision-Language-Action (VLA) models achieve strong generalization in robotic manipulation but remain largely reactive and 2D-centric, making them unreliable in tasks that require precise 3D reasoning. We propose GeoPredict, a geometry-aware VLA framework that augments a continuous-action policy with predictive kinematic and geometric priors. GeoPredict introduces a trajectory-level module that encodes motion history and predicts multi-step 3D keypoint trajectories of robot arms, and a predictive 3D Gaussian geometry module that forecasts workspace geometry with track-guided refinement along future keypoint trajectories. These predictive modules serve exclusively as training-time supervision through depth-based rendering, while inference requires only lightweight additional query tokens without invoking any 3D decoding. Experiments on RoboCasa Human-50, LIBERO, and real-world manipulation tasks show that GeoPredict consistently outperforms strong VLA baselines, especially in geometry-intensive and spatially demanding scenarios.
中文标题/摘要
标题:GeoPredict:利用预测动力学和三维高斯几何进行精确VLA操作
视觉-语言-动作(VLA)模型在机器人操作中表现出强大的泛化能力,但仍然主要具有反应性和二维中心性,使其在需要精确三维推理的任务中不可靠。我们提出了一种GeoPredict几何感知VLA框架,该框架通过预测动力学和几何先验增强连续动作策略。GeoPredict引入了一条轨迹级模块,用于编码运动历史并预测多步机器人手臂的三维关键点轨迹,以及一个预测三维高斯几何模块,该模块通过轨迹引导细化来预测工作空间几何。这些预测模块仅在训练时作为基于深度渲染的监督使用,推理时仅需要轻量级的附加查询标记,而不涉及任何三维解码。在RoboCasa Human-50、LIBERO和实际操作任务上的实验表明,GeoPredict在几何密集型和空间需求高的场景中始终优于强大的VLA基线。
Summary / 总结
GeoPredict is designed to enhance the precision of robotic manipulation by integrating predictive kinematics and 3D Gaussian geometry into VLA models. It uses a trajectory-level module to predict 3D keypoint trajectories and a predictive 3D Gaussian geometry module to forecast workspace geometry, both serving as training-time supervision. Experiments demonstrate that GeoPredict outperforms existing VLA models, particularly in tasks requiring precise 3D reasoning and spatial awareness.
GeoPredict 通过将预测动力学和 3D 高斯几何学整合到 VLA 模型中,旨在提高机器人操作的精确性。它使用轨迹级模块来预测 3D 关键点轨迹,并使用预测的 3D 高斯几何模块来预测工作空间几何形状,两者仅作为训练时的监督。实验表明,GeoPredict 在需要精确 3D 推理和空间意识的任务中优于现有 VLA 模型。
Online Continual Graph Learning
Authors: Giovanni Donghi, Luca Pasa, Daniele Zambon, Cesare Alippi, Nicolò Navarin
First: 2025-08-05T10:05:09+00:00 · Latest: 2025-12-18T17:30:25+00:00
Comments: This work has been submitted to the IEEE for possible publication
Abstract
Continual Learning (CL) aims to incrementally acquire new knowledge while mitigating catastrophic forgetting. Within this setting, Online Continual Learning (OCL) focuses on updating models promptly and incrementally from single or small batches of observations from a data stream. Extending OCL to graph-structured data is crucial, as many real-world networks evolve over time and require timely, online predictions. However, existing continual or streaming graph learning methods typically assume access to entire graph snapshots or multiple passes over tasks, violating the efficiency constraints of the online setting. To address this gap, we introduce the Online Continual Graph Learning (OCGL) setting, which formalizes node-level continual learning on evolving graphs under strict memory and computational budgets. OCGL defines how a model incrementally processes a stream of node-level information while maintaining anytime inference and respecting resource constraints. We further establish a comprehensive benchmark comprising seven datasets and nine CL strategies, suitably adapted to the OCGL setting, enabling a standardized evaluation setup. Finally, we present a minimalistic yet competitive baseline for OCGL, inspired by our benchmarking results, that achieves strong empirical performance with high efficiency.
中文标题/摘要
标题:在线持续图学习
持续学习(CL)旨在增量地获取新知识并减轻灾难性遗忘。在此框架下,在线持续学习(OCL)专注于从数据流中的单个或小批次观测中及时、增量地更新模型。将OCL扩展到图结构数据至关重要,因为许多现实世界的网络会随时间演变,并需要及时的在线预测。然而,现有的持续或流式图学习方法通常假设可以访问整个图快照或多次遍历任务,这违反了在线设置的效率约束。为了解决这一差距,我们提出了在线持续图学习(OCGL)框架,该框架在严格的内存和计算预算下,形式化了在演变图上的节点级持续学习,定义了模型如何增量地处理节点级信息流,同时保持随时推理并遵守资源约束。我们进一步建立了一个全面的基准,包括七个数据集和九种CL策略,适当地适应OCGL设置,以实现标准化的评估框架。最后,我们提出了一个基于基准测试结果的简洁但具有竞争力的OCGL基线,该基线在效率高且具有较强实证性能。
Summary / 总结
The research aims to develop Online Continual Graph Learning (OCGL) to handle the incremental acquisition of knowledge in evolving graph-structured data while mitigating catastrophic forgetting. The method introduces a formal setting for node-level continual learning on evolving graphs under strict resource constraints. Key experimental findings show that the proposed minimalistic baseline achieves strong empirical performance with high efficiency, outperforming existing methods in the OCGL setting.
论文解决了图结构数据的持续学习问题,需要模型从数据流中增量更新。它引入了在线持续图学习(OCGL),以在严格的资源约束下处理不断变化的图。作者建立了一个包含七个数据集和九种策略的基准,并提出了一种简洁但有效的基线,其在准确性和效率方面表现良好。
KineST: A Kinematics-guided Spatiotemporal State Space Model for Human Motion Tracking from Sparse Signals
Authors: Shuting Zhao, Zeyu Xiao, Xinrong Chen
Venue: AAAI 2026
First: 2025-12-18T17:25:47+00:00 · Latest: 2025-12-18T17:25:47+00:00
Comments: Accepted by AAAI 2026
Abstract
Full-body motion tracking plays an essential role in AR/VR applications, bridging physical and virtual interactions. However, it is challenging to reconstruct realistic and diverse full-body poses based on sparse signals obtained by head-mounted displays, which are the main devices in AR/VR scenarios. Existing methods for pose reconstruction often incur high computational costs or rely on separately modeling spatial and temporal dependencies, making it difficult to balance accuracy, temporal coherence, and efficiency. To address this problem, we propose KineST, a novel kinematics-guided state space model, which effectively extracts spatiotemporal dependencies while integrating local and global pose perception. The innovation comes from two core ideas. Firstly, in order to better capture intricate joint relationships, the scanning strategy within the State Space Duality framework is reformulated into kinematics-guided bidirectional scanning, which embeds kinematic priors. Secondly, a mixed spatiotemporal representation learning approach is employed to tightly couple spatial and temporal contexts, balancing accuracy and smoothness. Additionally, a geometric angular velocity loss is introduced to impose physically meaningful constraints on rotational variations for further improving motion stability. Extensive experiments demonstrate that KineST has superior performance in both accuracy and temporal consistency within a lightweight framework. Project page: https://kaka-1314.github.io/KineST/
中文标题/摘要
标题:KineST:一种基于运动学引导的空间时间状态空间模型,用于从稀疏信号中进行全身运动跟踪
全身运动跟踪在AR/VR应用中起着关键作用,连接物理和虚拟交互。然而,基于头戴式显示器获得的稀疏信号,很难基于这些信号重建逼真且多样的全身姿态,而头戴式显示器是AR/VR场景中的主要设备。现有的姿态重建方法往往计算成本高,或者需要分别建模空间和时间依赖性,这使得难以在准确度、时间连贯性和效率之间取得平衡。为了解决这个问题,我们提出了一种新颖的基于运动学引导的状态空间模型KineST,该模型有效地提取了空间时间依赖性,同时整合了局部和全局姿态感知。创新之处在于两个核心思想。首先,为了更好地捕捉复杂的关节关系,在状态空间二元性框架下的扫描策略被重新定义为运动学引导的双向扫描,这嵌入了运动学先验。其次,采用了一种混合的空间时间表示学习方法,紧密耦合空间和时间上下文,平衡准确性和平滑度。此外,引入了几何角速度损失,对旋转变化施加物理上合理的约束,进一步提高运动稳定性。广泛的实验表明,KineST在轻量级框架内具有优越的准确性和时间一致性。项目页面:https://kaka-1314.github.io/KineST/
Summary / 总结
KineST is a novel kinematics-guided state space model designed to reconstruct realistic full-body poses from sparse signals in AR/VR applications. It uses kinematics-guided bidirectional scanning and a mixed spatiotemporal representation learning approach to balance accuracy and temporal coherence. The model also introduces a geometric angular velocity loss to enhance motion stability. Experiments show that KineST outperforms existing methods in both accuracy and temporal consistency while maintaining a lightweight framework.
KineST是一种新颖的运动学引导状态空间模型,旨在从AR/VR应用场景中的稀疏信号中重建逼真的全身姿态。该模型引入了运动学引导的双向扫描和混合时空表示学习,以平衡准确度、时间连贯性和效率。实验结果表明,KineST在准确度和时间一致性方面均优于现有方法,同时保持了轻量级框架。