arXiv 论文速递

Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark

Authors: Ziyu Guo, Xinyan Chen, Renrui Zhang, Ruichuan An, Yu Qi, Dongzhi Jiang, Xiangtai Li, Manyuan Zhang, Hongsheng Li, Pheng-Ann Heng

First: 2025-10-30T17:59:55+00:00 · Latest: 2025-10-30T17:59:55+00:00

Comments: Project Page: https://video-cof.github.io

Abs · PDF · Code1 · Code2 · Project1

Abstract

Recent video generation models can produce high-fidelity, temporally coherent videos, indicating that they may encode substantial world knowledge. Beyond realistic synthesis, they also exhibit emerging behaviors indicative of visual perception, modeling, and manipulation. Yet, an important question still remains: Are video models ready to serve as zero-shot reasoners in challenging visual reasoning scenarios? In this work, we conduct an empirical study to comprehensively investigate this question, focusing on the leading and popular Veo-3. We evaluate its reasoning behavior across 12 dimensions, including spatial, geometric, physical, temporal, and embodied logic, systematically characterizing both its strengths and failure modes. To standardize this study, we curate the evaluation data into MME-CoF, a compact benchmark that enables in-depth and thorough assessment of Chain-of-Frame (CoF) reasoning. Our findings reveal that while current video models demonstrate promising reasoning patterns on short-horizon spatial coherence, fine-grained grounding, and locally consistent dynamics, they remain limited in long-horizon causal reasoning, strict geometric constraints, and abstract logic. Overall, they are not yet reliable as standalone zero-shot reasoners, but exhibit encouraging signs as complementary visual engines alongside dedicated reasoning models. Project page: https://video-cof.github.io

中文标题/摘要

标题：视频模型准备好作为零样本推理器了吗？MME-CoF基准上的实证研究

近期的视频生成模型能够生成高保真度、时间上连贯的视频，表明它们可能蕴含了大量世界知识。除了现实的合成之外，它们还表现出视觉感知、建模和操控的新兴行为。然而，一个重要的问题仍然存在：视频模型是否准备好在具有挑战性的视觉推理场景中作为零样本推理器使用？在本文中，我们进行了一项实证研究，全面探讨了这一问题，重点关注领先的Veo-3。我们从12个维度评估了其推理行为，包括空间、几何、物理、时间以及具身逻辑，系统地描述了其优势和失败模式。为了标准化这一研究，我们将评估数据整理成MME-CoF，这是一个紧凑的基准，能够深入和全面地评估链框（CoF）推理。我们的发现表明，尽管当前的视频模型在短时间空间一致性、细粒度定位和局部一致动态方面表现出有希望的推理模式，但在长时间因果推理、严格的几何约束和抽象逻辑方面仍然有限。总体而言，它们尚未可靠地作为独立的零样本推理器，但作为专用推理模型的补充视觉引擎表现出令人鼓舞的迹象。

Summary / 总结

This study evaluates the reasoning capabilities of the Veo-3 video generation model across 12 dimensions using the MME-CoF benchmark. The model shows promise in short-horizon spatial coherence, fine-grained grounding, and locally consistent dynamics but is limited in long-horizon causal reasoning, strict geometric constraints, and abstract logic. It is not yet reliable as a standalone zero-shot reasoner but can complement dedicated reasoning models.

本研究通过评估Veo-3模型在空间、几何和时间逻辑等12个维度上的推理行为，探讨视频生成模型是否可以作为零样本推理器。研究引入了MME-CoF基准，用于评估链帧推理。研究发现，尽管模型在短期空间一致性、精细的语义对应和局部一致的动力学方面表现出色，但在长期因果推理和抽象逻辑方面仍存在局限性，使其作为独立的零样本推理器尚不可靠，但作为专用推理模型的补充工具则表现出积极的前景。项目页面: https://video-cof.github.io

OmniX: From Unified Panoramic Generation and Perception to Graphics-Ready 3D Scenes

Authors: Yukun Huang, Jiwen Yu, Yanning Zhou, Jianan Wang, Xintao Wang, Pengfei Wan, Xihui Liu

First: 2025-10-30T17:59:51+00:00 · Latest: 2025-10-30T17:59:51+00:00

Comments: Project page: https://yukun-huang.github.io/OmniX/

Abs · PDF · Code1 · Code2 · Project1

Abstract

There are two prevalent ways to constructing 3D scenes: procedural generation and 2D lifting. Among them, panorama-based 2D lifting has emerged as a promising technique, leveraging powerful 2D generative priors to produce immersive, realistic, and diverse 3D environments. In this work, we advance this technique to generate graphics-ready 3D scenes suitable for physically based rendering (PBR), relighting, and simulation. Our key insight is to repurpose 2D generative models for panoramic perception of geometry, textures, and PBR materials. Unlike existing 2D lifting approaches that emphasize appearance generation and ignore the perception of intrinsic properties, we present OmniX, a versatile and unified framework. Based on a lightweight and efficient cross-modal adapter structure, OmniX reuses 2D generative priors for a broad range of panoramic vision tasks, including panoramic perception, generation, and completion. Furthermore, we construct a large-scale synthetic panorama dataset containing high-quality multimodal panoramas from diverse indoor and outdoor scenes. Extensive experiments demonstrate the effectiveness of our model in panoramic visual perception and graphics-ready 3D scene generation, opening new possibilities for immersive and physically realistic virtual world generation.

中文标题/摘要

标题：OmniX：从统一全景生成与感知到图形就绪的3D场景

构建3D场景有两种常见方法：程序生成和2D提升。其中，基于全景的2D提升已成为一种有前景的技术，利用强大的2D生成先验生成沉浸式、逼真且多样的3D环境。在本工作中，我们推进了这一技术，以生成适合基于物理的渲染（PBR）、重新照明和模拟的图形就绪3D场景。我们的关键见解是重新利用2D生成模型进行全景几何、纹理和PBR材料的感知。与现有的2D提升方法强调外观生成而忽略内在属性感知不同，我们提出了OmniX，这是一种多功能和统一的框架。基于轻量级且高效的跨模态适配器结构，OmniX 重新利用2D生成先验用于广泛的全景视觉任务，包括全景感知、生成和完成。此外，我们构建了一个大规模合成全景数据集，包含来自多种室内和室外场景的高质量多模态全景图。广泛的实验表明，我们的模型在全景视觉感知和图形就绪3D场景生成中的有效性，为沉浸式和物理逼真的虚拟世界生成开辟了新的可能性。

Summary / 总结

This work aims to enhance panorama-based 2D lifting techniques for generating graphics-ready 3D scenes suitable for PBR, relighting, and simulation. The key method is OmniX, a unified framework that repurposes 2D generative models for panoramic perception and generation. Experiments show that OmniX effectively handles panoramic visual perception and 3D scene generation, offering new possibilities for immersive and physically realistic virtual worlds.

研究旨在通过增强基于全景的2D提升技术，生成适用于PBR、重新照明和模拟的高质量3D场景。关键方法是使用跨模态适配器重新利用2D生成模型进行全景感知和生成，构建了一个统一框架OmniX。实验表明，OmniX在全景视觉感知和3D场景生成方面表现出色，为更沉浸和物理真实的虚拟世界生成开辟了新途径。

UniSite: The First Cross-Structure Dataset and Learning Framework for End-to-End Ligand Binding Site Detection

Authors: Jigang Fan, Quanlin Wu, Shengjie Luo, Liwei Wang

Venue: NeurIPS 2025 Spotlight

First: 2025-06-03T17:49:41+00:00 · Latest: 2025-10-30T17:59:46+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

The detection of ligand binding sites for proteins is a fundamental step in Structure-Based Drug Design. Despite notable advances in recent years, existing methods, datasets, and evaluation metrics are confronted with several key challenges: (1) current datasets and methods are centered on individual protein-ligand complexes and neglect that diverse binding sites may exist across multiple complexes of the same protein, introducing significant statistical bias; (2) ligand binding site detection is typically modeled as a discontinuous workflow, employing binary segmentation and subsequent clustering algorithms; (3) traditional evaluation metrics do not adequately reflect the actual performance of different binding site prediction methods. To address these issues, we first introduce UniSite-DS, the first UniProt (Unique Protein)-centric ligand binding site dataset, which contains 4.81 times more multi-site data and 2.08 times more overall data compared to the previously most widely used datasets. We then propose UniSite, the first end-to-end ligand binding site detection framework supervised by set prediction loss with bijective matching. In addition, we introduce Average Precision based on Intersection over Union (IoU) as a more accurate evaluation metric for ligand binding site prediction. Extensive experiments on UniSite-DS and several representative benchmark datasets demonstrate that IoU-based Average Precision provides a more accurate reflection of prediction quality, and that UniSite outperforms current state-of-the-art methods in ligand binding site detection. The dataset and codes will be made publicly available at https://github.com/quanlin-wu/unisite.

中文标题/摘要

标题：UniSite：首个跨结构数据集及端到端配体结合位点检测学习框架

蛋白质配体结合位点的检测是结构基于药物设计中的一个基本步骤。尽管近年来取得了显著进展，但现有方法、数据集和评估指标仍面临几个关键挑战：(1) 当前的数据集和方法主要集中在单一蛋白质-配体复合物上，忽视了同一蛋白质的不同复合物中可能存在多种结合位点，引入了显著的统计偏差；(2) 配体结合位点检测通常被建模为一个不连续的工作流程，采用二元分割和后续聚类算法；(3) 传统评估指标未能充分反映不同结合位点预测方法的实际性能。为解决这些问题，我们首先引入了UniSite-DS，这是首个以UniProt（唯一蛋白质）为中心的配体结合位点数据集，包含4.81倍的多站点数据和2.08倍的整体数据，相比之前最广泛使用的数据集。然后，我们提出了UniSite，这是首个基于集合预测损失和双射匹配的端到端配体结合位点检测框架。此外，我们引入了基于交并比（IoU）的平均精度作为更准确的配体结合位点预测评估指标。在UniSite-DS和几个代表性基准数据集上的广泛实验表明，基于IoU的平均精度提供了更准确的预测质量反映，且UniSite在配体结合位点检测中优于当前最先进的方法。数据集和代码将在https://github.com/quanlin-wu/unisite公开提供。

Summary / 总结

The paper introduces UniSite-DS, the first UniProt-centric ligand binding site dataset, which addresses the limitations of existing datasets by including more multi-site and overall data. It also presents UniSite, an end-to-end detection framework using set prediction loss and bijective matching, and proposes IoU-based Average Precision as a more accurate evaluation metric. Experiments show that UniSite outperforms current state-of-the-art methods in ligand binding site detection on UniSite-DS and benchmark datasets.

论文介绍了UniSite-DS，这是首个基于UniProt的配体结合位点数据集，解决了现有数据集的局限性。同时提出了UniSite，一个使用集合预测损失和双射匹配的端到端检测框架。研究显示，提出的基于IoU的平均精度指标更能反映预测质量，并在UniSite-DS和基准数据集上的检测中优于现有方法。

SEE4D: Pose-Free 4D Generation via Auto-Regressive Video Inpainting

Authors: Dongyue Lu, Ao Liang, Tianxin Huang, Xiao Fu, Yuyang Zhao, Baorui Ma, Liang Pan, Wei Yin, Lingdong Kong, Wei Tsang Ooi, Ziwei Liu

First: 2025-10-30T17:59:39+00:00 · Latest: 2025-10-30T17:59:39+00:00

Comments: 26 pages; 21 figures; 3 tables; project page: https://see-4d.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Immersive applications call for synthesizing spatiotemporal 4D content from casual videos without costly 3D supervision. Existing video-to-4D methods typically rely on manually annotated camera poses, which are labor-intensive and brittle for in-the-wild footage. Recent warp-then-inpaint approaches mitigate the need for pose labels by warping input frames along a novel camera trajectory and using an inpainting model to fill missing regions, thereby depicting the 4D scene from diverse viewpoints. However, this trajectory-to-trajectory formulation often entangles camera motion with scene dynamics and complicates both modeling and inference. We introduce SEE4D, a pose-free, trajectory-to-camera framework that replaces explicit trajectory prediction with rendering to a bank of fixed virtual cameras, thereby separating camera control from scene modeling. A view-conditional video inpainting model is trained to learn a robust geometry prior by denoising realistically synthesized warped images and to inpaint occluded or missing regions across virtual viewpoints, eliminating the need for explicit 3D annotations. Building on this inpainting core, we design a spatiotemporal autoregressive inference pipeline that traverses virtual-camera splines and extends videos with overlapping windows, enabling coherent generation at bounded per-step complexity. We validate See4D on cross-view video generation and sparse reconstruction benchmarks. Across quantitative metrics and qualitative assessments, our method achieves superior generalization and improved performance relative to pose- or trajectory-conditioned baselines, advancing practical 4D world modeling from casual videos.

中文标题/摘要

标题：SEE4D：无需姿态的自回归视频修复生成四维内容

沉浸式应用需要从普通视频中合成时空四维内容，而无需昂贵的三维监督。现有从视频到四维的方法通常依赖于手动标注的相机姿态，这既费时又脆弱，不适合野外视频。最近的基于变形然后修复的方法通过使用新颖的相机轨迹变形输入帧，并利用修复模型填充缺失区域，从而减轻了对姿态标签的需求，进而从多个视角描绘四维场景。然而，这种轨迹到轨迹的建模方式往往将相机运动与场景动态纠缠在一起，增加了建模和推理的复杂性。我们提出了SEE4D，这是一种无需姿态、轨迹到相机的框架，用渲染到一组固定的虚拟相机来替代显式的轨迹预测，从而将相机控制与场景建模分离。一个视角条件的视频修复模型被训练来通过去噪真实合成的变形图像学习稳健的几何先验，并修复虚拟视角之间的遮挡或缺失区域，从而消除显式三维标注的需求。基于这个修复核心，我们设计了一个时空自回归推理管道，遍历虚拟相机样条，并使用重叠窗口扩展视频，从而实现有界每步复杂度的连贯生成。我们在跨视角视频生成和稀疏重建基准上验证了See4D。在定量指标和定性评估中，我们的方法在泛化能力和性能上优于姿态或轨迹条件的基线，推动了从普通视频中构建实用的四维世界建模。

The Quest for Generalizable Motion Generation: Data, Model, and Evaluation

Authors: Jing Lin, Ruisi Wang, Junzhe Lu, Ziqi Huang, Guorui Song, Ailing Zeng, Xian Liu, Chen Wei, Wanqi Yin, Qingping Sun, Zhongang Cai, Lei Yang, Ziwei Liu

First: 2025-10-30T17:59:27+00:00 · Latest: 2025-10-30T17:59:27+00:00

Abs · PDF · Code1 · Code2

Abstract

Despite recent advances in 3D human motion generation (MoGen) on standard benchmarks, existing models still face a fundamental bottleneck in their generalization capability. In contrast, adjacent generative fields, most notably video generation (ViGen), have demonstrated remarkable generalization in modeling human behaviors, highlighting transferable insights that MoGen can leverage. Motivated by this observation, we present a comprehensive framework that systematically transfers knowledge from ViGen to MoGen across three key pillars: data, modeling, and evaluation. First, we introduce ViMoGen-228K, a large-scale dataset comprising 228,000 high-quality motion samples that integrates high-fidelity optical MoCap data with semantically annotated motions from web videos and synthesized samples generated by state-of-the-art ViGen models. The dataset includes both text-motion pairs and text-video-motion triplets, substantially expanding semantic diversity. Second, we propose ViMoGen, a flow-matching-based diffusion transformer that unifies priors from MoCap data and ViGen models through gated multimodal conditioning. To enhance efficiency, we further develop ViMoGen-light, a distilled variant that eliminates video generation dependencies while preserving strong generalization. Finally, we present MBench, a hierarchical benchmark designed for fine-grained evaluation across motion quality, prompt fidelity, and generalization ability. Extensive experiments show that our framework significantly outperforms existing approaches in both automatic and human evaluations. The code, data, and benchmark will be made publicly available.

中文标题/摘要

标题：通用运动生成的追求：数据、模型与评估

尽管在标准基准上取得了近期进展，现有的3D人体运动生成（MoGen）模型仍然面临其泛化能力的基本瓶颈。相比之下，相邻的生成领域，尤其是视频生成（ViGen），已经展示了在建模人类行为方面的出色泛化能力，突显了MoGen可以借鉴的可转移见解。受此观察的启发，我们提出了一种全面的框架，系统地将ViGen的知识转移到MoGen的三个关键支柱：数据、建模和评估。首先，我们引入了ViMoGen-228K，这是一个包含228,000个高质量运动样本的大规模数据集，该数据集整合了高保真光学MoCap数据和来自网络视频的语义标注运动以及由最先进的ViGen模型生成的合成样本。数据集包括文本-运动对和文本-视频-运动三元组，显著扩展了语义多样性。其次，我们提出了ViMoGen，一种基于流匹配的扩散变换器，通过门控多模态条件统一MoCap数据和ViGen模型的先验。为了提高效率，我们进一步开发了ViMoGen-light，这是一种精简变体，消除了对视频生成的依赖，同时保持了强大的泛化能力。最后，我们提出了MBench，这是一种分层基准，用于对运动质量、提示保真度和泛化能力进行细粒度评估。广泛的实验表明，我们的框架在自动和人工评估中均显著优于现有方法。代码、数据和基准将公开提供。

Summary / 总结

This study addresses the limitation of generalization in 3D human motion generation models by drawing insights from video generation. It introduces ViMoGen-228K, a large dataset combining high-quality motion samples from optical MoCap data and web videos, and proposes ViMoGen, a diffusion transformer that integrates priors from MoCap and video generation models. The framework also includes ViMoGen-light, a more efficient variant. MBench, a hierarchical benchmark, evaluates motion quality, prompt fidelity, and generalization. Experiments demonstrate significant improvements over existing methods.

该研究通过借鉴视频生成领域的经验，解决3D人体动作生成模型的泛化能力不足问题。引入了ViMoGen-228K大数据集，结合高质量的运动捕捉数据和网络视频注释，并提出了ViMoGen，一种融合运动捕捉和视频生成先验知识的扩散变换器。该框架还包括ViMoGen-light，一种更高效的变体。通过MBench层次基准进行评估，展示了在自动和人工评价中显著优于现有方法的结果。