arXiv 论文速递

Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark

Authors: Ziyu Guo, Xinyan Chen, Renrui Zhang, Ruichuan An, Yu Qi, Dongzhi Jiang, Xiangtai Li, Manyuan Zhang, Hongsheng Li, Pheng-Ann Heng

First: 2025-10-30T17:59:55+00:00 · Latest: 2025-10-30T17:59:55+00:00

Comments: Project Page: https://video-cof.github.io

Abs · PDF · Code1 · Code2 · Project1

Abstract

Recent video generation models can produce high-fidelity, temporally coherent videos, indicating that they may encode substantial world knowledge. Beyond realistic synthesis, they also exhibit emerging behaviors indicative of visual perception, modeling, and manipulation. Yet, an important question still remains: Are video models ready to serve as zero-shot reasoners in challenging visual reasoning scenarios? In this work, we conduct an empirical study to comprehensively investigate this question, focusing on the leading and popular Veo-3. We evaluate its reasoning behavior across 12 dimensions, including spatial, geometric, physical, temporal, and embodied logic, systematically characterizing both its strengths and failure modes. To standardize this study, we curate the evaluation data into MME-CoF, a compact benchmark that enables in-depth and thorough assessment of Chain-of-Frame (CoF) reasoning. Our findings reveal that while current video models demonstrate promising reasoning patterns on short-horizon spatial coherence, fine-grained grounding, and locally consistent dynamics, they remain limited in long-horizon causal reasoning, strict geometric constraints, and abstract logic. Overall, they are not yet reliable as standalone zero-shot reasoners, but exhibit encouraging signs as complementary visual engines alongside dedicated reasoning models. Project page: https://video-cof.github.io

中文标题/摘要

标题：视频模型准备好作为零样本推理器了吗？MME-CoF基准上的实证研究

近期的视频生成模型能够生成高保真度、时间上连贯的视频，表明它们可能蕴含了大量世界知识。除了现实的合成之外，它们还表现出视觉感知、建模和操控的新兴行为。然而，一个重要的问题仍然存在：视频模型是否准备好在具有挑战性的视觉推理场景中作为零样本推理器使用？在本文中，我们进行了一项实证研究，全面探讨了这一问题，重点关注领先的Veo-3。我们从12个维度评估了其推理行为，包括空间、几何、物理、时间以及具身逻辑，系统地描述了其优势和失败模式。为了标准化这一研究，我们将评估数据整理成MME-CoF，这是一个紧凑的基准，能够深入和全面地评估链框（CoF）推理。我们的发现表明，尽管当前的视频模型在短时间空间一致性、细粒度定位和局部一致动态方面表现出有希望的推理模式，但在长时间因果推理、严格的几何约束和抽象逻辑方面仍然有限。总体而言，它们尚未可靠地作为独立的零样本推理器，但作为专用推理模型的补充视觉引擎表现出令人鼓舞的迹象。

Summary / 总结

This study investigates whether video generation models can serve as zero-shot reasoners by evaluating the Veo-3 model across 12 dimensions of reasoning. The evaluation data is curated into MME-CoF, a benchmark for assessing Chain-of-Frame reasoning. The findings show that while the model performs well on short-term spatial coherence and fine-grained grounding, it struggles with long-term causal reasoning and abstract logic, indicating it is not yet reliable as a standalone zero-shot reasoner but shows potential as a complementary tool.

该研究通过评估Veo-3模型在12个推理维度上的表现，探讨视频生成模型是否可以作为零样本推理器。评估数据被整理成MME-CoF基准，用于评估链式帧推理。研究发现，模型在短期空间一致性及细粒度定位方面表现良好，但在长期因果推理和抽象逻辑方面存在局限，表明它尚未可靠地作为独立的零样本推理器，但显示出作为专用推理模型补充工具的潜力。

OmniX: From Unified Panoramic Generation and Perception to Graphics-Ready 3D Scenes

Authors: Yukun Huang, Jiwen Yu, Yanning Zhou, Jianan Wang, Xintao Wang, Pengfei Wan, Xihui Liu

First: 2025-10-30T17:59:51+00:00 · Latest: 2025-10-30T17:59:51+00:00

Comments: Project page: https://yukun-huang.github.io/OmniX/

Abs · PDF · Code1 · Code2 · Project1

Abstract

There are two prevalent ways to constructing 3D scenes: procedural generation and 2D lifting. Among them, panorama-based 2D lifting has emerged as a promising technique, leveraging powerful 2D generative priors to produce immersive, realistic, and diverse 3D environments. In this work, we advance this technique to generate graphics-ready 3D scenes suitable for physically based rendering (PBR), relighting, and simulation. Our key insight is to repurpose 2D generative models for panoramic perception of geometry, textures, and PBR materials. Unlike existing 2D lifting approaches that emphasize appearance generation and ignore the perception of intrinsic properties, we present OmniX, a versatile and unified framework. Based on a lightweight and efficient cross-modal adapter structure, OmniX reuses 2D generative priors for a broad range of panoramic vision tasks, including panoramic perception, generation, and completion. Furthermore, we construct a large-scale synthetic panorama dataset containing high-quality multimodal panoramas from diverse indoor and outdoor scenes. Extensive experiments demonstrate the effectiveness of our model in panoramic visual perception and graphics-ready 3D scene generation, opening new possibilities for immersive and physically realistic virtual world generation.

中文标题/摘要

标题：OmniX：从统一全景生成与感知到图形就绪的3D场景

构建3D场景有两种常见方法：程序生成和2D提升。其中，基于全景的2D提升已成为一种有前景的技术，利用强大的2D生成先验生成沉浸式、逼真且多样的3D环境。在本工作中，我们推进了这一技术，以生成适合基于物理的渲染（PBR）、重新照明和模拟的图形就绪3D场景。我们的关键见解是重新利用2D生成模型进行全景几何、纹理和PBR材料的感知。与现有的2D提升方法强调外观生成而忽略内在属性感知不同，我们提出了OmniX，这是一种多功能和统一的框架。基于轻量级且高效的跨模态适配器结构，OmniX 重新利用2D生成先验用于广泛的全景视觉任务，包括全景感知、生成和完成。此外，我们构建了一个大规模合成全景数据集，包含来自不同室内和室外场景的高质量多模态全景图。广泛的实验表明，我们的模型在全景视觉感知和图形就绪3D场景生成中的有效性，为沉浸式和物理逼真的虚拟世界生成开辟了新的可能性。

Summary / 总结

This work aims to enhance panorama-based 2D lifting techniques to generate graphics-ready 3D scenes suitable for PBR, relighting, and simulation. The key method involves using a cross-modal adapter to repurpose 2D generative models for panoramic perception and generation. Experiments show that OmniX effectively handles panoramic visual perception and 3D scene generation, paving the way for more immersive and physically realistic virtual worlds.

该研究旨在通过增强基于全景图的2D提升技术，生成适用于PBR、重新照明和模拟的图形级3D场景。关键方法是使用轻量级的跨模态适配器来重新利用2D生成模型进行全景感知、生成和完成。作者提出了OmniX，一个统一框架，并构建了一个大规模的合成全景数据集。实验表明，OmniX 在全景视觉感知和生成图形级3D场景方面表现出色，推动了沉浸式虚拟世界的生成。

UniSite: The First Cross-Structure Dataset and Learning Framework for End-to-End Ligand Binding Site Detection

Authors: Jigang Fan, Quanlin Wu, Shengjie Luo, Liwei Wang

Venue: NeurIPS 2025 Spotlight

First: 2025-06-03T17:49:41+00:00 · Latest: 2025-10-30T17:59:46+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

The detection of ligand binding sites for proteins is a fundamental step in Structure-Based Drug Design. Despite notable advances in recent years, existing methods, datasets, and evaluation metrics are confronted with several key challenges: (1) current datasets and methods are centered on individual protein-ligand complexes and neglect that diverse binding sites may exist across multiple complexes of the same protein, introducing significant statistical bias; (2) ligand binding site detection is typically modeled as a discontinuous workflow, employing binary segmentation and subsequent clustering algorithms; (3) traditional evaluation metrics do not adequately reflect the actual performance of different binding site prediction methods. To address these issues, we first introduce UniSite-DS, the first UniProt (Unique Protein)-centric ligand binding site dataset, which contains 4.81 times more multi-site data and 2.08 times more overall data compared to the previously most widely used datasets. We then propose UniSite, the first end-to-end ligand binding site detection framework supervised by set prediction loss with bijective matching. In addition, we introduce Average Precision based on Intersection over Union (IoU) as a more accurate evaluation metric for ligand binding site prediction. Extensive experiments on UniSite-DS and several representative benchmark datasets demonstrate that IoU-based Average Precision provides a more accurate reflection of prediction quality, and that UniSite outperforms current state-of-the-art methods in ligand binding site detection. The dataset and codes will be made publicly available at https://github.com/quanlin-wu/unisite.

中文标题/摘要

标题：UniSite：首个跨结构数据集及端到端配体结合位点检测学习框架

蛋白质配体结合位点的检测是结构基于药物设计中的一个基本步骤。尽管近年来取得了显著进展，但现有方法、数据集和评估指标仍面临几个关键挑战：(1) 当前的数据集和方法主要集中在单一蛋白质-配体复合物上，忽视了同一蛋白质的不同复合物中可能存在多种结合位点，引入了显著的统计偏差；(2) 配体结合位点检测通常被建模为一个不连续的工作流程，采用二元分割和后续聚类算法；(3) 传统评估指标未能充分反映不同结合位点预测方法的实际性能。为解决这些问题，我们首先引入了UniSite-DS，这是首个以UniProt（唯一蛋白质）为中心的配体结合位点数据集，包含4.81倍的多站点数据和2.08倍的整体数据，相比之前最广泛使用的数据集。然后，我们提出了UniSite，这是首个基于集合预测损失和双射匹配的端到端配体结合位点检测框架。此外，我们引入了基于交并比（IoU）的平均精度作为更准确的配体结合位点预测评估指标。在UniSite-DS和几个代表性基准数据集上的广泛实验表明，基于IoU的平均精度提供了更准确的预测质量反映，且UniSite在配体结合位点检测中优于当前最先进的方法。数据集和代码将在https://github.com/quanlin-wu/unisite公开提供。

Summary / 总结

The paper addresses the challenges in ligand binding site detection by introducing UniSite-DS, a UniProt-centric dataset with more diverse binding site data, and UniSite, an end-to-end detection framework using set prediction loss. The study also proposes a new evaluation metric, IoU-based Average Precision, which better reflects prediction quality. Experiments show that UniSite outperforms existing methods on both UniSite-DS and benchmark datasets.

论文介绍了UniSite-DS，这是首个基于UniProt的配体结合位点数据集，解决了现有数据集和方法的局限性。同时提出了UniSite，一个使用集合预测损失和双射匹配的端到端检测框架，并引入了基于交并比的平均精度作为新的评估指标。实验表明，UniSite 在配体结合位点检测中优于当前最先进的方法，且新指标能更准确地反映预测质量。

SEE4D: Pose-Free 4D Generation via Auto-Regressive Video Inpainting

Authors: Dongyue Lu, Ao Liang, Tianxin Huang, Xiao Fu, Yuyang Zhao, Baorui Ma, Liang Pan, Wei Yin, Lingdong Kong, Wei Tsang Ooi, Ziwei Liu

First: 2025-10-30T17:59:39+00:00 · Latest: 2025-10-30T17:59:39+00:00

Comments: 26 pages; 21 figures; 3 tables; project page: https://see-4d.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Immersive applications call for synthesizing spatiotemporal 4D content from casual videos without costly 3D supervision. Existing video-to-4D methods typically rely on manually annotated camera poses, which are labor-intensive and brittle for in-the-wild footage. Recent warp-then-inpaint approaches mitigate the need for pose labels by warping input frames along a novel camera trajectory and using an inpainting model to fill missing regions, thereby depicting the 4D scene from diverse viewpoints. However, this trajectory-to-trajectory formulation often entangles camera motion with scene dynamics and complicates both modeling and inference. We introduce SEE4D, a pose-free, trajectory-to-camera framework that replaces explicit trajectory prediction with rendering to a bank of fixed virtual cameras, thereby separating camera control from scene modeling. A view-conditional video inpainting model is trained to learn a robust geometry prior by denoising realistically synthesized warped images and to inpaint occluded or missing regions across virtual viewpoints, eliminating the need for explicit 3D annotations. Building on this inpainting core, we design a spatiotemporal autoregressive inference pipeline that traverses virtual-camera splines and extends videos with overlapping windows, enabling coherent generation at bounded per-step complexity. We validate See4D on cross-view video generation and sparse reconstruction benchmarks. Across quantitative metrics and qualitative assessments, our method achieves superior generalization and improved performance relative to pose- or trajectory-conditioned baselines, advancing practical 4D world modeling from casual videos.

中文标题/摘要

标题：SEE4D：无需姿态的自回归视频修复生成4D内容

沉浸式应用需要从普通视频中合成时空4D内容，而无需昂贵的3D监督。现有从视频到4D的方法通常依赖于手动标注的相机姿态，这既费时又脆弱，不适合野外视频。最近的基于变形然后修复的方法通过使用一个新颖的相机轨迹来绕过姿态标签的需求，并利用修复模型填充缺失区域，从而从多个视角描绘4D场景。然而，这种轨迹到轨迹的建模方式往往将相机运动与场景动态纠缠在一起，增加了建模和推理的复杂性。我们提出了SEE4D，这是一种无需姿态、轨迹到相机的框架，用渲染到一组固定的虚拟相机来替代显式的轨迹预测，从而将相机控制与场景建模分离。一个视角条件的视频修复模型被训练来通过去噪真实合成的变形图像学习稳健的几何先验，并修复虚拟视角之间的遮挡或缺失区域，从而消除显式3D标注的需求。基于这个修复核心，我们设计了一个时空自回归推理管道，遍历虚拟相机样条，并使用重叠窗口扩展视频，从而在每步复杂度有限的情况下实现一致生成。我们在跨视角视频生成和稀疏重建基准上验证了See4D。在定量指标和定性评估中，我们的方法在泛化能力和性能上优于姿态或轨迹条件的基线，推动了从普通视频中构建实用的4D世界模型。

Summary / 总结

SEE4D is a pose-free 4D generation method that uses an auto-regressive video inpainting model to synthesize 4D content from casual videos. It replaces the need for manually annotated camera poses with a bank of fixed virtual cameras, separating camera control from scene modeling. The method achieves superior generalization and performance compared to pose- or trajectory-conditioned baselines on cross-view video generation and sparse reconstruction benchmarks, demonstrating practical 4D world modeling from casual videos.

SEE4D 是一种无需姿态的 4D 生成方法，使用自回归视频修复模型从普通视频中合成 4D 内容。它用一组固定的虚拟摄像机取代了手动标注的摄像机姿态，将摄像机控制与场景建模分离。该方法在交叉视角视频生成和稀疏重建基准测试中表现出更优的泛化能力和性能，展示了从普通视频中实现实用的 4D 世界建模的能力。

The Quest for Generalizable Motion Generation: Data, Model, and Evaluation

Authors: Jing Lin, Ruisi Wang, Junzhe Lu, Ziqi Huang, Guorui Song, Ailing Zeng, Xian Liu, Chen Wei, Wanqi Yin, Qingping Sun, Zhongang Cai, Lei Yang, Ziwei Liu

First: 2025-10-30T17:59:27+00:00 · Latest: 2025-10-30T17:59:27+00:00

Abs · PDF · Code1 · Code2

Abstract

Despite recent advances in 3D human motion generation (MoGen) on standard benchmarks, existing models still face a fundamental bottleneck in their generalization capability. In contrast, adjacent generative fields, most notably video generation (ViGen), have demonstrated remarkable generalization in modeling human behaviors, highlighting transferable insights that MoGen can leverage. Motivated by this observation, we present a comprehensive framework that systematically transfers knowledge from ViGen to MoGen across three key pillars: data, modeling, and evaluation. First, we introduce ViMoGen-228K, a large-scale dataset comprising 228,000 high-quality motion samples that integrates high-fidelity optical MoCap data with semantically annotated motions from web videos and synthesized samples generated by state-of-the-art ViGen models. The dataset includes both text-motion pairs and text-video-motion triplets, substantially expanding semantic diversity. Second, we propose ViMoGen, a flow-matching-based diffusion transformer that unifies priors from MoCap data and ViGen models through gated multimodal conditioning. To enhance efficiency, we further develop ViMoGen-light, a distilled variant that eliminates video generation dependencies while preserving strong generalization. Finally, we present MBench, a hierarchical benchmark designed for fine-grained evaluation across motion quality, prompt fidelity, and generalization ability. Extensive experiments show that our framework significantly outperforms existing approaches in both automatic and human evaluations. The code, data, and benchmark will be made publicly available.

中文标题/摘要

标题：通用运动生成的追求：数据、模型与评估

尽管在标准基准上取得了近期进展，现有的3D人体运动生成（MoGen）模型仍然面临其泛化能力的基本瓶颈。相比之下，相邻的生成领域，尤其是视频生成（ViGen），已经展示了在建模人类行为方面的出色泛化能力，突显了MoGen可以借鉴的可转移见解。受此观察的启发，我们提出了一种全面的框架，系统地将ViGen的知识转移到MoGen的三个关键支柱：数据、建模和评估。首先，我们引入了ViMoGen-228K，这是一个包含228,000个高质量运动样本的大规模数据集，该数据集整合了高保真光学MoCap数据和来自网络视频的语义标注运动以及由最先进的ViGen模型生成的合成样本。数据集包括文本-运动对和文本-视频-运动三元组，显著扩展了语义多样性。其次，我们提出了ViMoGen，这是一种基于流匹配的扩散变换器，通过门控多模态条件统一了MoCap数据和ViGen模型的先验知识。为了提高效率，我们进一步开发了ViMoGen-light，这是一种精简变体，消除了对视频生成的依赖，同时保持了强大的泛化能力。最后，我们提出了MBench，这是一种分层基准，用于对运动质量、提示保真度和泛化能力进行细粒度评估。广泛的实验表明，我们的框架在自动和人工评估中均显著优于现有方法。代码、数据和基准将公开提供。

Summary / 总结

This study addresses the limitation of generalization in 3D human motion generation (MoGen) by drawing insights from video generation (ViGen). It introduces ViMoGen-228K, a large dataset combining high-quality motion samples from optical MoCap data and web videos, and proposes ViMoGen, a diffusion transformer that integrates priors from both MoGen and ViGen. The framework also includes ViMoGen-light, a more efficient variant. Additionally, a new benchmark called MBench is developed for comprehensive evaluation. Experimental results demonstrate significant improvements over existing methods in both automatic and human evaluations.

该研究通过借鉴视频生成（ViGen）领域的经验，解决3D人体运动生成（MoGen）中的泛化能力不足问题。引入了ViMoGen-228K，一个结合高质量光学MoCap数据和网络视频的大型数据集，并提出ViMoGen，一种融合MoCap和ViGen先验知识的扩散变换器。该框架还包括ViMoGen-light，一种更高效的变体。MBench是一个分层基准，用于评估运动质量、指令保真度和泛化能力。实验表明，在自动和人工评估中，该框架显著优于现有方法。

Defeating the Training-Inference Mismatch via FP16

Authors: Penghui Qi, Zichen Liu, Xiangxin Zhou, Tianyu Pang, Chao Du, Wee Sun Lee, Min Lin

First: 2025-10-30T17:58:11+00:00 · Latest: 2025-10-30T17:58:11+00:00

Abs · PDF · Code1 · Code2

Abstract

Reinforcement learning (RL) fine-tuning of large language models (LLMs) often suffers from instability due to the numerical mismatch between the training and inference policies. While prior work has attempted to mitigate this issue through algorithmic corrections or engineering alignments, we show that its root cause lies in the floating point precision itself. The widely adopted BF16, despite its large dynamic range, introduces large rounding errors that breaks the consistency between training and inference. In this work, we demonstrate that simply reverting to \textbf{FP16} effectively eliminates this mismatch. The change is simple, fully supported by modern frameworks with only a few lines of code change, and requires no modification to the model architecture or learning algorithm. Our results suggest that using FP16 uniformly yields more stable optimization, faster convergence, and stronger performance across diverse tasks, algorithms and frameworks. We hope these findings motivate a broader reconsideration of precision trade-offs in RL fine-tuning.

中文标题/摘要

标题：通过FP16克服训练推断不匹配

大规模语言模型（LLMs）的强化学习（RL）微调常常由于训练和推断策略之间的数值不匹配而变得不稳定。尽管先前的工作试图通过算法修正或工程对齐来缓解这一问题，但我们表明其根本原因在于浮点精度本身。尽管广泛采用的BF16具有较大的动态范围，但其引入了大量舍入误差，破坏了训练和推断的一致性。在本文中，我们证明简单地恢复到**FP16**可以有效消除这种不匹配。这一改变简单，由现代框架完全支持，只需几行代码更改，无需修改模型架构或学习算法。我们的结果表明，使用FP16在各种任务、算法和框架中均能获得更稳定的优化、更快的收敛和更强的性能。我们希望这些发现能促使更广泛地重新考虑RL微调中的精度权衡。

Summary / 总结

This paper addresses the instability in reinforcement learning fine-tuning of large language models due to the numerical mismatch between training and inference. It identifies that the use of BF16 floating point precision is the root cause, as it introduces significant rounding errors. The authors demonstrate that switching to FP16 effectively resolves this issue, leading to more stable optimization, faster convergence, and better performance across various tasks and frameworks. No changes to the model architecture or learning algorithm are required, making the solution straightforward and widely applicable.

该论文解决了由于训练和推理之间数值不匹配导致的大语言模型强化学习微调中的不稳定性问题。研究发现，BF16浮点精度是根本原因，因为它引入了大量舍入误差。作者证明，切换到FP16可以有效解决这一问题，带来更稳定的优化、更快的收敛和更好的性能，适用于各种任务和框架。无需对模型架构或学习算法进行修改，使解决方案简单且广泛适用。

Remote Labor Index: Measuring AI Automation of Remote Work

Authors: Mantas Mazeika, Alice Gatti, Cristina Menghini, Udari Madhushani Sehwag, Shivam Singhal, Yury Orlovskiy, Steven Basart, Manasi Sharma, Denis Peskoff, Elaine Lau, Jaehyuk Lim, Lachlan Carroll, Alice Blair, Vinaya Sivakumar, Sumana Basu, Brad Kenstler, Yuntao Ma, Julian Michael, Xiaoke Li, Oliver Ingebretsen, Aditya Mehta, Jean Mottola, John Teichmann, Kevin Yu, Zaina Shaik, Adam Khoja, Richard Ren, Jason Hausenloy, Long Phan, Ye Htet, Ankit Aich, Tahseen Rabbani, Vivswan Shah, Andriy Novykov, Felix Binder, Kirill Chugunov, Luis Ramirez, Matias Geralnik, Hernán Mesura, Dean Lee, Ed-Yeremai Hernandez Cardona, Annette Diamond, Summer Yue, Alexandr Wang, Bing Liu, Ernesto Hernandez, Dan Hendrycks

Venue: www

First: 2025-10-30T17:58:04+00:00 · Latest: 2025-10-30T17:58:04+00:00

Comments: Website: https://www.remotelabor.ai

Abs · PDF · Code1 · Code2 · Project1

Abstract

AIs have made rapid progress on research-oriented benchmarks of knowledge and reasoning, but it remains unclear how these gains translate into economic value and automation. To measure this, we introduce the Remote Labor Index (RLI), a broadly multi-sector benchmark comprising real-world, economically valuable projects designed to evaluate end-to-end agent performance in practical settings. AI agents perform near the floor on RLI, with the highest-performing agent achieving an automation rate of 2.5%. These results help ground discussions of AI automation in empirical evidence, setting a common basis for tracking AI impacts and enabling stakeholders to proactively navigate AI-driven labor automation.

中文标题/摘要

标题：远程劳动力指数：衡量AI对远程工作的自动化

人工智能在知识和推理的研究基准上取得了快速进展，但这些进展如何转化为经济价值和自动化仍不清楚。为了衡量这一点，我们引入了远程劳动力指数（RLI），这是一个涵盖多个行业的广泛基准，包含真实世界、具有经济价值的项目，旨在评估代理在实际环境中的端到端性能。AI代理在RLI上的表现接近最低水平，最高性能的代理实现了2.5%的自动化率。这些结果有助于将关于AI自动化的讨论建立在实证证据的基础上，为跟踪AI影响和使利益相关者能够积极应对AI驱动的劳动力自动化设定共同基础。

Summary / 总结

The Remote Labor Index (RLI) is introduced to measure the economic value and automation potential of AI in remote work. AI agents perform poorly on RLI, with the best achieving only a 2.5% automation rate, indicating that current AI capabilities have limited impact on remote labor tasks. This provides empirical evidence for discussions on AI automation and sets a standard for tracking AI's effects on labor markets.

引入了远程劳动力指数（RLI）来衡量AI在远程工作中的经济价值和自动化潜力。AI代理在RLI上的表现不佳，最高仅实现2.5%的自动化率，表明当前AI能力对远程劳动任务的影响有限。这为关于AI自动化讨论提供了实证证据，并为跟踪AI对劳动力市场的影响设定了标准。

HEIR: Learning Graph-Based Motion Hierarchies

Authors: Cheng Zheng, William Koch, Baiang Li, Felix Heide

Venue: NeurIPS 2025

First: 2025-10-30T17:57:40+00:00 · Latest: 2025-10-30T17:57:40+00:00

Comments: Code link: https://github.com/princeton-computational-imaging/HEIR

Abs · PDF · Code1 · Code2 · Code3

Abstract

Hierarchical structures of motion exist across research fields, including computer vision, graphics, and robotics, where complex dynamics typically arise from coordinated interactions among simpler motion components. Existing methods to model such dynamics typically rely on manually-defined or heuristic hierarchies with fixed motion primitives, limiting their generalizability across different tasks. In this work, we propose a general hierarchical motion modeling method that learns structured, interpretable motion relationships directly from data. Our method represents observed motions using graph-based hierarchies, explicitly decomposing global absolute motions into parent-inherited patterns and local motion residuals. We formulate hierarchy inference as a differentiable graph learning problem, where vertices represent elemental motions and directed edges capture learned parent-child dependencies through graph neural networks. We evaluate our hierarchical reconstruction approach on three examples: 1D translational motion, 2D rotational motion, and dynamic 3D scene deformation via Gaussian splatting. Experimental results show that our method reconstructs the intrinsic motion hierarchy in 1D and 2D cases, and produces more realistic and interpretable deformations compared to the baseline on dynamic 3D Gaussian splatting scenes. By providing an adaptable, data-driven hierarchical modeling paradigm, our method offers a formulation applicable to a broad range of motion-centric tasks. Project Page: https://light.princeton.edu/HEIR/

中文标题/摘要

标题：HEIR：基于图的运动层次结构学习

运动的层次结构存在于计算机视觉、图形学和机器人学等多个研究领域，其中复杂的动态通常源自更简单运动组件之间的协调交互。现有的建模方法通常依赖于手动定义或启发式的固定运动基元层次结构，这限制了它们在不同任务中的泛化能力。在本工作中，我们提出了一种通用的层次运动建模方法，可以从数据中直接学习结构化和可解释的运动关系。我们的方法使用基于图的层次结构表示观察到的运动，明确地将全局绝对运动分解为父继承模式和局部运动残差。我们将层次结构推理形式化为一个可微图学习问题，其中顶点表示基本运动，有向边通过图神经网络捕捉学习到的父节点与子节点之间的依赖关系。我们在三个示例上评估了我们的层次重建方法：1D平移运动、2D旋转运动以及通过高斯点绘制动态3D场景变形。实验结果表明，我们的方法在1D和2D情况下重建了内在的运动层次结构，并且在动态3D高斯点绘制场景中产生了更真实且可解释的变形，优于基线方法。通过提供一种可适应的数据驱动的层次建模范式，我们的方法为广泛的以运动为中心的任务提供了适用的表述。

Summary / 总结

This work proposes HEIR, a method that learns hierarchical motion structures from data, addressing the limitations of manually-defined hierarchies. It uses graph-based representations to decompose motions into parent-inherited patterns and local residuals, and learns these relationships through graph neural networks. The method successfully reconstructs intrinsic hierarchies in 1D and 2D cases and produces more realistic deformations in 3D scenes compared to baseline methods.

该研究引入了HEIR方法，直接从数据中学习运动的层次结构，解决了手动定义层次结构的局限性。它使用图神经网络推断父节点和子节点之间的依赖关系，并将运动分解为继承模式和局部残差。实验表明，HEIR在1D、2D和3D运动中重建了内在的层次结构，并在动态场景中产生了更真实的变形，优于基线方法。

Clone Deterministic 3D Worlds with Geometrically-Regularized World Models

Authors: Zaishuo Xia, Yukuan Lu, Xinyi Li, Yifan Xu, Yubei Chen

First: 2025-10-30T17:56:43+00:00 · Latest: 2025-10-30T17:56:43+00:00

Abs · PDF · Code1 · Code2

Abstract

A world model is an internal model that simulates how the world evolves. Given past observations and actions, it predicts the future of both the embodied agent and its environment. Accurate world models are essential for enabling agents to think, plan, and reason effectively in complex, dynamic settings. Despite rapid progress, current world models remain brittle and degrade over long horizons. We argue that a central cause is representation quality: exteroceptive inputs (e.g., images) are high-dimensional, and lossy or entangled latents make dynamics learning unnecessarily hard. We therefore ask whether improving representation learning alone can substantially improve world-model performance. In this work, we take a step toward building a truly accurate world model by addressing a fundamental yet open problem: constructing a model that can fully clone and overfit to a deterministic 3D world. We propose Geometrically-Regularized World Models (GRWM), which enforces that consecutive points along a natural sensory trajectory remain close in latent representation space. This approach yields significantly improved latent representations that align closely with the true topology of the environment. GRWM is plug-and-play, requires only minimal architectural modification, scales with trajectory length, and is compatible with diverse latent generative backbones. Across deterministic 3D settings and long-horizon prediction tasks, GRWM significantly increases rollout fidelity and stability. Analyses show that its benefits stem from learning a latent manifold with superior geometric structure. These findings support a clear takeaway: improving representation learning is a direct and useful path to robust world models, delivering reliable long-horizon predictions without enlarging the dynamics module.

中文标题/摘要

标题：使用几何正则化的世界模型克隆确定性3D世界

世界模型是一种内部模型，用于模拟世界如何演变。给定过去的观察和行动，它预测代理及其环境的未来。准确的世界模型对于使代理能够在复杂、动态的环境中思考、计划和有效推理至关重要。尽管取得了快速进展，但当前的世界模型仍然脆弱，并且在长时间范围内会退化。我们认为主要原因在于表示质量：外部感受输入（例如，图像）是高维的，而损失性的或纠缠的潜在变量使动力学习变得不必要的困难。因此，我们询问是否仅通过改进表示学习就能显著提高世界模型的性能。在这项工作中，我们朝着构建真正准确的世界模型迈出了一步，通过解决一个基本但尚未解决的问题：构建一个能够完全克隆和拟合到确定性3D世界的模型。我们提出了几何正则化世界模型（GRWM），该方法要求自然感官轨迹中连续点在潜在表示空间中保持接近。这种方法产生了显著改进的潜在表示，与环境的真实拓扑结构紧密对齐。GRWM 是即插即用的，只需要进行最小的架构修改，可以扩展到轨迹长度，并且与多种潜在生成骨干兼容。在确定性3D设置和长时间预测任务中，GRWM 显著提高了展开的准确性和稳定性。分析表明，其好处来自于学习具有优越几何结构的潜在流形。这些发现支持一个明确的结论：改进表示学习是直接且有用的道路，能够提供可靠的长时间预测，而无需扩大动力模块。

Summary / 总结

This paper aims to enhance the accuracy of world models by improving representation learning, particularly focusing on the geometric structure of latent spaces. The authors propose Geometrically-Regularized World Models (GRWM), which ensures that consecutive points in sensory trajectories remain close in latent space. This approach significantly improves latent representations and leads to better long-horizon prediction and stability in deterministic 3D environments. The findings suggest that optimizing representation learning can directly improve world model performance without increasing the complexity of the dynamics module.

该论文旨在构建适用于复杂环境的准确世界模型。作者提出了几何正则化世界模型（GRWM），通过确保感官轨迹中连续点在潜在空间中的接近性来改进表示学习。该方法显著提升了潜在表示，并在确定性的3D环境中实现了更稳定和准确的长期预测。

ChartAB: A Benchmark for Chart Grounding & Dense Alignment

Authors: Aniruddh Bansal, Davit Soselia, Dang Nguyen, Tianyi Zhou

First: 2025-10-30T17:56:31+00:00 · Latest: 2025-10-30T17:56:31+00:00

Abs · PDF · Code1 · Code2

Abstract

Charts play an important role in visualization, reasoning, data analysis, and the exchange of ideas among humans. However, existing vision-language models (VLMs) still lack accurate perception of details and struggle to extract fine-grained structures from charts. Such limitations in chart grounding also hinder their ability to compare multiple charts and reason over them. In this paper, we introduce a novel "ChartAlign Benchmark (ChartAB)" to provide a comprehensive evaluation of VLMs in chart grounding tasks, i.e., extracting tabular data, localizing visualization elements, and recognizing various attributes from charts of diverse types and complexities. We design a JSON template to facilitate the calculation of evaluation metrics specifically tailored for each grounding task. By incorporating a novel two-stage inference workflow, the benchmark can further evaluate VLMs' capability to align and compare elements/attributes across two charts. Our analysis of evaluations on several recent VLMs reveals new insights into their perception biases, weaknesses, robustness, and hallucinations in chart understanding. These findings highlight the fine-grained discrepancies among VLMs in chart understanding tasks and point to specific skills that need to be strengthened in current models.

中文标题/摘要

标题：ChartAB：图表定位与密集对齐基准

图表在可视化、推理、数据分析以及人类之间的思想交流中发挥着重要作用。然而，现有的视觉-语言模型（VLMs）在细节感知方面仍存在不足，难以从图表中提取精细的结构。这些在图表定位方面的限制也阻碍了它们比较多个图表和推理的能力。在本文中，我们引入了一个新的“图表对齐基准（ChartAB）”，以全面评估VLMs在图表定位任务中的表现，即从不同类型和复杂度的图表中提取表格数据、定位可视化元素以及识别各种属性。我们设计了一个JSON模板，以方便计算针对每个定位任务量身定制的评估指标。通过引入一种新颖的两阶段推理工作流，基准还可以进一步评估VLMs在两个图表之间对齐和比较元素/属性的能力。我们对几个最近的VLMs的评估分析揭示了它们在图表理解方面的感知偏差、弱点、鲁棒性和幻觉。这些发现突显了VLMs在图表理解任务中的细微差异，并指出了当前模型需要加强的具体技能。

Summary / 总结

The paper introduces ChartAB, a benchmark for evaluating vision-language models in chart grounding tasks, including extracting tabular data, localizing visualization elements, and recognizing attributes. By using a two-stage inference workflow, the benchmark assesses models' ability to align and compare elements across charts. Evaluations on recent VLMs reveal biases, weaknesses, and hallucinations, highlighting the need to improve specific skills in current models.

研究引入了ChartAB，一个用于评估视觉-语言模型(VLMs)在图表定位任务中的基准，包括提取表格数据、定位可视化元素和识别图表属性。通过使用两阶段推理工作流和用于评估指标的JSON模板，基准评估了VLMs在跨图表对齐和比较元素的能力。研究揭示了VLMs在图表理解中的感知偏差、弱点和幻觉，强调了需要在当前模型中增强特定技能。