arXiv 论文速递

SpaceTimePilot: Generative Rendering of Dynamic Scenes Across Space and Time

Authors: Zhening Huang, Hyeonho Jeong, Xuelin Chen, Yulia Gryaditskaya, Tuanfeng Y. Wang, Joan Lasenby, Chun-Hao Huang

First: 2025-12-31T18:59:57+00:00 · Latest: 2025-12-31T18:59:57+00:00

Comments: Project page: https://zheninghuang.github.io/Space-Time-Pilot/ Code: https://github.com/ZheningHuang/spacetimepilot

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

We present SpaceTimePilot, a video diffusion model that disentangles space and time for controllable generative rendering. Given a monocular video, SpaceTimePilot can independently alter the camera viewpoint and the motion sequence within the generative process, re-rendering the scene for continuous and arbitrary exploration across space and time. To achieve this, we introduce an effective animation time-embedding mechanism in the diffusion process, allowing explicit control of the output video's motion sequence with respect to that of the source video. As no datasets provide paired videos of the same dynamic scene with continuous temporal variations, we propose a simple yet effective temporal-warping training scheme that repurposes existing multi-view datasets to mimic temporal differences. This strategy effectively supervises the model to learn temporal control and achieve robust space-time disentanglement. To further enhance the precision of dual control, we introduce two additional components: an improved camera-conditioning mechanism that allows altering the camera from the first frame, and CamxTime, the first synthetic space-and-time full-coverage rendering dataset that provides fully free space-time video trajectories within a scene. Joint training on the temporal-warping scheme and the CamxTime dataset yields more precise temporal control. We evaluate SpaceTimePilot on both real-world and synthetic data, demonstrating clear space-time disentanglement and strong results compared to prior work. Project page: https://zheninghuang.github.io/Space-Time-Pilot/ Code: https://github.com/ZheningHuang/spacetimepilot

中文标题/摘要

标题：SpaceTimePilot：时空分离的动态场景生成渲染

我们提出了SpaceTimePilot，一种视频扩散模型，能够分离空间和时间以实现可控的生成渲染。给定单目视频，SpaceTimePilot可以在生成过程中独立改变摄像机视角和运动序列，重新渲染场景，实现空间和时间上的连续和任意探索。为此，我们在扩散过程中引入了有效的动画时间嵌入机制，允许对输出视频的运动序列进行显式控制，相对于源视频。由于现有数据集中没有提供同一动态场景的配对视频以连续的时间变化，我们提出了一种简单而有效的时空扭曲训练方案，利用现有的多视角数据集模拟时间差异。该策略有效地监督模型学习时间控制并实现稳健的时空分离。为了进一步提高双重控制的精度，我们引入了两个额外组件：改进的摄像机条件机制，允许从第一帧开始改变摄像机，以及CamxTime，第一个合成时空全覆盖渲染数据集，提供了场景内的完全自由时空视频轨迹。在时空扭曲方案和CamxTime数据集上的联合训练产生了更精确的时间控制。我们在真实世界和合成数据上评估了SpaceTimePilot，展示了清晰的时空分离和与先前工作相比的强劲结果。

Summary / 总结

SpaceTimePilot is a video diffusion model that disentangles space and time for controllable generative rendering. Given a monocular video, it can independently alter the camera viewpoint and motion sequence, enabling continuous exploration across space and time. The model uses an animation time-embedding mechanism and a temporal-warping training scheme to achieve robust space-time disentanglement. Additional components, such as an improved camera-conditioning mechanism and the CamxTime dataset, further enhance temporal control. Evaluations on real-world and synthetic data show clear space-time disentanglement and strong performance compared to previous methods.

SpaceTimePilot 是一种视频扩散模型，能够解耦空间和时间，实现可控的生成渲染。给定单目视频，它可以独立改变摄像机视角和运动序列，对场景进行连续的空间和时间探索。模型使用了动画时间嵌入机制和时间扭曲训练方案，以实现稳健的空间-时间解耦。此外，还引入了改进的摄像机条件机制和 CamxTime 数据集，进一步增强了控制能力。实验结果显示了清晰的空间-时间解耦和与先前方法相比的强劲性能。

GaMO: Geometry-aware Multi-view Diffusion Outpainting for Sparse-View 3D Reconstruction

Authors: Yi-Chuan Huang, Hao-Jen Chien, Chin-Yang Lin, Ying-Huan Chen, Yu-Lun Liu

First: 2025-12-31T18:59:55+00:00 · Latest: 2025-12-31T18:59:55+00:00

Comments: Project page: https://yichuanh.github.io/GaMO/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Recent advances in 3D reconstruction have achieved remarkable progress in high-quality scene capture from dense multi-view imagery, yet struggle when input views are limited. Various approaches, including regularization techniques, semantic priors, and geometric constraints, have been implemented to address this challenge. Latest diffusion-based methods have demonstrated substantial improvements by generating novel views from new camera poses to augment training data, surpassing earlier regularization and prior-based techniques. Despite this progress, we identify three critical limitations in these state-of-the-art approaches: inadequate coverage beyond known view peripheries, geometric inconsistencies across generated views, and computationally expensive pipelines. We introduce GaMO (Geometry-aware Multi-view Outpainter), a framework that reformulates sparse-view reconstruction through multi-view outpainting. Instead of generating new viewpoints, GaMO expands the field of view from existing camera poses, which inherently preserves geometric consistency while providing broader scene coverage. Our approach employs multi-view conditioning and geometry-aware denoising strategies in a zero-shot manner without training. Extensive experiments on Replica and ScanNet++ demonstrate state-of-the-art reconstruction quality across 3, 6, and 9 input views, outperforming prior methods in PSNR and LPIPS, while achieving a $25\times$ speedup over SOTA diffusion-based methods with processing time under 10 minutes. Project page: https://yichuanh.github.io/GaMO/

中文标题/摘要

标题：GaMO：基于几何的多视图扩散外延法用于稀疏视图三维重建

近年来，三维重建取得了显著进展，能够在密集多视角图像中实现高质量场景捕获，但在输入视角有限时却面临挑战。各种方法，包括正则化技术、语义先验和几何约束，已被实施以应对这一挑战。最新的基于扩散的方法通过生成新的视角来增强训练数据，从而在生成新相机姿态的新视图方面取得了显著改进，超越了早期的正则化和基于先验的方法。尽管取得了这些进展，我们仍发现这些最先进的方法存在三个关键局限性：超出已知视图边缘的覆盖不足、生成视图之间的几何不一致以及计算成本高昂的管道。我们提出了GaMO（几何感知多视图外延器），一种通过多视图外延重新构建稀疏视图的框架。与生成新视角不同，GaMO 从现有相机姿态扩展视场，这本身就能保持几何一致性并提供更广泛的场景覆盖。我们的方法以零样本方式采用多视图条件和几何感知去噪策略，无需训练。在Replica和ScanNet++上的广泛实验表明，GaMO 在3、6和9个输入视图下的重建质量达到最先进的水平，在PSNR和LPIPS方面优于先前方法，同时比最先进的基于扩散的方法快25倍，处理时间不到10分钟。项目页面：https://yichuanh.github.io/GaMO/

Summary / 总结

The research aims to address the limitations of sparse-view 3D reconstruction, particularly the inadequacy of coverage and geometric inconsistencies in existing methods. GaMO, a geometry-aware multi-view outpainting framework, is introduced to expand the field of view from existing camera poses, ensuring geometric consistency and broader scene coverage. The method achieves state-of-the-art reconstruction quality with a significant speedup over previous diffusion-based approaches, outperforming prior methods in PSNR and LPIPS across 3, 6, and 9 input views, while processing times are under 10 minutes.

研究旨在通过提出GaMO框架解决稀疏视角3D重建的局限性，该框架通过多视角出画方式从现有摄像机姿态扩展视野。这种方法避免生成新视角，从而保持几何一致性并提供更广泛的场景覆盖。GaMO在3、6和9个输入视图下均优于先前方法，在PSNR和LPIPS方面表现出色，处理速度比最先进的扩散基方法快25倍，处理时间少于10分钟。

Edit3r: Instant 3D Scene Editing from Sparse Unposed Images

Authors: Jiageng Liu, Weijie Lyu, Xueting Li, Yejie Guo, Ming-Hsuan Yang

First: 2025-12-31T18:59:53+00:00 · Latest: 2025-12-31T18:59:53+00:00

Comments: Project page: https://edit3r.github.io/edit3r/

Abs · PDF · Code1 · Code2 · Project1

Abstract

We present Edit3r, a feed-forward framework that reconstructs and edits 3D scenes in a single pass from unposed, view-inconsistent, instruction-edited images. Unlike prior methods requiring per-scene optimization, Edit3r directly predicts instruction-aligned 3D edits, enabling fast and photorealistic rendering without optimization or pose estimation. A key challenge in training such a model lies in the absence of multi-view consistent edited images for supervision. We address this with (i) a SAM2-based recoloring strategy that generates reliable, cross-view-consistent supervision, and (ii) an asymmetric input strategy that pairs a recolored reference view with raw auxiliary views, encouraging the network to fuse and align disparate observations. At inference, our model effectively handles images edited by 2D methods such as InstructPix2Pix, despite not being exposed to such edits during training. For large-scale quantitative evaluation, we introduce DL3DV-Edit-Bench, a benchmark built on the DL3DV test split, featuring 20 diverse scenes, 4 edit types and 100 edits in total. Comprehensive quantitative and qualitative results show that Edit3r achieves superior semantic alignment and enhanced 3D consistency compared to recent baselines, while operating at significantly higher inference speed, making it promising for real-time 3D editing applications.

中文标题/摘要

标题：Edit3r：从稀疏未对齐图像即时编辑3D场景

我们提出了Edit3r，这是一种单次通过框架，可以从未对齐、视角不一致、指令编辑过的图像中重建和编辑3D场景。与需要逐场景优化的先前方法不同，Edit3r直接预测指令对齐的3D编辑，从而实现快速且逼真的渲染，无需优化或姿态估计。训练此类模型的关键挑战在于缺乏多视角一致的编辑图像作为监督。我们通过(i)基于SAM2的重新着色策略生成可靠的、跨视角一致的监督，以及(ii)不对称输入策略，将重新着色的参考视图与原始辅助视图配对，鼓励网络融合和对齐不同的观察结果来解决这一问题。在推理时，我们的模型能够有效处理由2D方法（如InstructPix2Pix）编辑的图像，尽管在训练过程中并未接触到此类编辑。为了进行大规模的定量评估，我们引入了DL3DV-Edit-Bench基准，该基准基于DL3DV测试分割，包含20个多样化的场景、4种编辑类型和总共100次编辑。全面的定量和定性结果表明，Edit3r在语义对齐和3D一致性方面优于最近的基线方法，同时具有显著更高的推理速度，使其在实时3D编辑应用中具有前景。

Summary / 总结

Edit3r is a feed-forward framework that reconstructs and edits 3D scenes from unposed, view-inconsistent images in a single pass. It directly predicts instruction-aligned 3D edits without requiring per-scene optimization, enabling fast and photorealistic rendering. Key to training this model is addressing the lack of multi-view consistent edited images for supervision, achieved through a SAM2-based recoloring strategy and an asymmetric input strategy. Edit3r effectively handles edits made by 2D methods like InstructPix2Pix and outperforms recent baselines in terms of semantic alignment and 3D consistency, while operating at higher inference speed, making it suitable for real-time 3D editing applications.

Edit3r 是一个无需优化或姿态估计即可从不一致视角的未对齐图像中重建和编辑 3D 场景的前馈框架。它使用 SAM2 基础的重新着色策略生成可靠的监督，并使用不对称输入策略鼓励网络融合和对齐不同的观测。该模型能够处理 2D 编辑且运行速度快，实现了比最近基线更好的语义对齐和 3D 一致性。全面的基准测试表明，它适用于实时 3D 编辑应用。

Scaling Open-Ended Reasoning to Predict the Future

Authors: Nikhil Chandak, Shashwat Goel, Ameya Prabhu, Moritz Hardt, Jonas Geiping

First: 2025-12-31T18:59:51+00:00 · Latest: 2025-12-31T18:59:51+00:00

Comments: 45 pages