arXiv 论文速递

WorldLens: Full-Spectrum Evaluations of Driving World Models in Real World

Authors: Ao Liang, Lingdong Kong, Tianyi Yan, Hongsi Liu, Wesley Yang, Ziqi Huang, Wei Yin, Jialong Zuo, Yixuan Hu, Dekai Zhu, Dongyue Lu, Youquan Liu, Guangfeng Jiang, Linfeng Li, Xiangtai Li, Long Zhuo, Lai Xing Ng, Benoit R. Cottereau, Changxin Gao, Liang Pan, Wei Tsang Ooi, Ziwei Liu

First: 2025-12-11T18:59:58+00:00 · Latest: 2025-12-11T18:59:58+00:00

Comments: Preprint; 80 pages, 37 figures, 29 tables; Project Page at https://worldbench.github.io/worldlens

Abs · PDF · Code1 · Code2 · Project1

Abstract

Generative world models are reshaping embodied AI, enabling agents to synthesize realistic 4D driving environments that look convincing but often fail physically or behaviorally. Despite rapid progress, the field still lacks a unified way to assess whether generated worlds preserve geometry, obey physics, or support reliable control. We introduce WorldLens, a full-spectrum benchmark evaluating how well a model builds, understands, and behaves within its generated world. It spans five aspects -- Generation, Reconstruction, Action-Following, Downstream Task, and Human Preference -- jointly covering visual realism, geometric consistency, physical plausibility, and functional reliability. Across these dimensions, no existing world model excels universally: those with strong textures often violate physics, while geometry-stable ones lack behavioral fidelity. To align objective metrics with human judgment, we further construct WorldLens-26K, a large-scale dataset of human-annotated videos with numerical scores and textual rationales, and develop WorldLens-Agent, an evaluation model distilled from these annotations to enable scalable, explainable scoring. Together, the benchmark, dataset, and agent form a unified ecosystem for measuring world fidelity -- standardizing how future models are judged not only by how real they look, but by how real they behave.

中文标题/摘要

标题：WorldLens：全面评估生成世界模型在真实世界中的表现

生成的世界模型正在重塑具身AI，使代理能够合成逼真的4D驾驶环境，这些环境看起来很真实，但往往在物理上或行为上失败。尽管取得了快速进展，该领域仍然缺乏一种统一的方法来评估生成的世界是否保留了几何结构、遵循了物理规律或支持可靠的控制。我们引入了WorldLens，这是一种全面的基准测试，评估模型如何在其生成的世界中构建、理解和表现。它涵盖了五个方面——生成、重建、动作跟随、下游任务和人类偏好——共同涵盖了视觉真实感、几何一致性、物理合理性以及功能可靠性。在这些维度上，目前没有一个世界模型能够全面优秀：那些具有强大纹理的模型往往违反物理规律，而几何稳定的模型缺乏行为准确性。为了使客观指标与人类判断相一致，我们进一步构建了包含大量人类标注视频的WorldLens-26K数据集，这些视频附有数值评分和文本解释，并开发了WorldLens-Agent，这是一种从这些注释中提炼出的评估模型，以实现可扩展且可解释的评分。基准测试、数据集和代理共同构成了一个统一的生态系统，用于衡量世界的真实性——不仅通过它们看起来多么真实，还通过它们表现得多么真实来评判未来的模型。

Summary / 总结

WorldLens is a comprehensive benchmark for evaluating generative world models in driving environments, covering aspects such as generation, reconstruction, action-following, downstream tasks, and human preference. It addresses the limitations of existing models by ensuring visual realism, geometric consistency, physical plausibility, and functional reliability. The benchmark includes a large-scale dataset, WorldLens-26K, and an evaluation model, WorldLens-Agent, which helps in aligning objective metrics with human judgment, providing a unified ecosystem for assessing world fidelity in driving models.

WorldLens 是一个全面的基准，用于评估生成的世界模型在驾驶场景中的表现，评估视觉真实感、几何一致性、物理合理性以及功能可靠性等方面。研究发现现有模型在某一领域表现优异但在其他领域则表现不佳，例如强大的纹理会违反物理规则或几何稳定性缺乏行为准确性。为了与人类判断一致，开发了包含大规模数据集 WorldLens-26K 和评估模型 WorldLens-Agent，以提供可扩展且可解释的评分，形成一个超越视觉真实感的统一生态系统来衡量世界的真实性。

SceneMaker: Open-set 3D Scene Generation with Decoupled De-occlusion and Pose Estimation Model

Authors: Yukai Shi, Weiyu Li, Zihao Wang, Hongyang Li, Xingyu Chen, Ping Tan, Lei Zhang

First: 2025-12-11T18:59:56+00:00 · Latest: 2025-12-11T18:59:56+00:00

Comments: Project page: https://idea-research.github.io/SceneMaker/

Abs · PDF · Code1 · Code2 · Project1

Abstract

We propose a decoupled 3D scene generation framework called SceneMaker in this work. Due to the lack of sufficient open-set de-occlusion and pose estimation priors, existing methods struggle to simultaneously produce high-quality geometry and accurate poses under severe occlusion and open-set settings. To address these issues, we first decouple the de-occlusion model from 3D object generation, and enhance it by leveraging image datasets and collected de-occlusion datasets for much more diverse open-set occlusion patterns. Then, we propose a unified pose estimation model that integrates global and local mechanisms for both self-attention and cross-attention to improve accuracy. Besides, we construct an open-set 3D scene dataset to further extend the generalization of the pose estimation model. Comprehensive experiments demonstrate the superiority of our decoupled framework on both indoor and open-set scenes. Our codes and datasets is released at https://idea-research.github.io/SceneMaker/.

中文标题/摘要

标题：SceneMaker：分阶段的3D场景生成模型，包含解遮挡和姿态估计

本文提出了一种名为SceneMaker的分阶段3D场景生成框架。由于缺乏足够的开放集解遮挡和姿态估计先验，现有方法在严重遮挡和开放集设置下难以同时生成高质量的几何结构和准确的姿态。为了解决这些问题，我们首先将解遮挡模型与3D物体生成分离，并通过利用图像数据集和收集的解遮挡数据集增强它，以处理更多样化的开放集遮挡模式。然后，我们提出了一种统一的姿态估计模型，该模型结合了全局和局部机制，以提高准确性。此外，我们构建了一个开放集3D场景数据集，以进一步扩展姿态估计模型的泛化能力。全面的实验表明，我们的分阶段框架在室内和开放集场景上具有优越性。我们的代码和数据集发布在https://idea-research.github.io/SceneMaker/。

Summary / 总结

SceneMaker is a decoupled 3D scene generation framework that addresses the challenges of open-set de-occlusion and pose estimation under severe occlusion. It decouples the de-occlusion model from 3D object generation and enhances it with diverse open-set occlusion patterns. Additionally, it introduces a unified pose estimation model with global and local mechanisms for improved accuracy. Experiments show that SceneMaker outperforms existing methods on both indoor and open-set scenes.

SceneMaker 是一个分阶段的 3D 场景生成框架，旨在解决在严重遮挡和开放集设置下的开放集去遮挡和姿态估计难题。该框架将去遮挡模型与 3D 物体生成分离，并通过图像和去遮挡数据集中的多样开放集遮挡模式进行增强。此外，它还引入了一个结合全局和局部机制的统一姿态估计模型，以提高准确性。实验表明，SceneMaker 在室内和开放集场景上均优于现有方法。

Empowering Dynamic Urban Navigation with Stereo and Mid-Level Vision

Authors: Wentao Zhou, Xuweiyi Chen, Vignesh Rajagopal, Jeffrey Chen, Rohan Chandra, Zezhou Cheng

Venue: www

First: 2025-12-11T18:59:56+00:00 · Latest: 2025-12-11T18:59:56+00:00

Comments: Project Page: https://www.cs.virginia.edu/~tsx4zn/stereowalk/

Abs · PDF · Code1 · Code2

Abstract

The success of foundation models in language and vision motivated research in fully end-to-end robot navigation foundation models (NFMs). NFMs directly map monocular visual input to control actions and ignore mid-level vision modules (tracking, depth estimation, etc) entirely. While the assumption that vision capabilities will emerge implicitly is compelling, it requires large amounts of pixel-to-action supervision that are difficult to obtain. The challenge is especially pronounced in dynamic and unstructured settings, where robust navigation requires precise geometric and dynamic understanding, while the depth-scale ambiguity in monocular views further limits accurate spatial reasoning. In this paper, we show that relying on monocular vision and ignoring mid-level vision priors is inefficient. We present StereoWalker, which augments NFMs with stereo inputs and explicit mid-level vision such as depth estimation and dense pixel tracking. Our intuition is straightforward: stereo inputs resolve the depth-scale ambiguity, and modern mid-level vision models provide reliable geometric and motion structure in dynamic scenes. We also curate a large stereo navigation dataset with automatic action annotation from Internet stereo videos to support training of StereoWalker and to facilitate future research. Through our experiments, we find that mid-level vision enables StereoWalker to achieve a comparable performance as the state-of-the-art using only 1.5% of the training data, and surpasses the state-of-the-art using the full data. We also observe that stereo vision yields higher navigation performance than monocular input.

中文标题/摘要

标题：利用立体视觉和中级视觉赋能动态城市导航

基础模型在语言和视觉领域的成功激发了对端到端机器人导航基础模型（NFMs）的研究。NFMs 直接将单目视觉输入映射到控制动作，完全忽略了中级视觉模块（跟踪、深度估计等）。虽然假设视觉能力会隐式出现是令人信服的，但需要大量的像素到动作监督，这很难获得。在动态和非结构化的环境中，这一挑战尤为突出，因为稳健的导航需要精确的几何和动态理解，而单目视图中的深度尺度歧义进一步限制了准确的空间推理。在本文中，我们表明依赖单目视觉并忽略中级视觉先验是低效的。我们提出了 StereoWalker，它通过添加立体视觉输入和显式的中级视觉（如深度估计和密集像素跟踪）来增强 NFMs。我们的直觉很简单：立体视觉解决了深度尺度歧义，而现代中级视觉模型在动态场景中提供了可靠的几何和运动结构。我们还整理了一个大型的立体视觉导航数据集，该数据集通过从互联网立体视频中自动标注动作注释来支持 StereoWalker 的训练，并促进未来的研究。通过我们的实验，我们发现中级视觉使 StereoWalker 能够仅使用 1.5% 的训练数据达到与最新技术相当的性能，并且使用完整数据时超过了最新技术。我们还观察到，立体视觉的导航性能高于单目输入。

Summary / 总结

This paper addresses the challenge of robot navigation in dynamic urban environments by proposing StereoWalker, which integrates stereo vision and mid-level vision modules into neural foundation models (NFMs). The authors find that using stereo inputs and explicit depth estimation and dense pixel tracking significantly improves navigation performance, requiring only 1.5% of the training data to match state-of-the-art models and outperforming them with full data. The study also introduces a large stereo navigation dataset with automatic action annotations to support future research.

本文通过引入结合立体视觉输入和深度估计、密集像素跟踪等中级视觉模块的StereoWalker，解决了端到端机器人导航的挑战。动机来自于单目视觉在动态城市环境中的局限性，其中深度模糊和精确几何理解的需求至关重要。实验表明，StereoWalker仅使用1.5%的训练数据即可达到与最新模型相当的性能，并在使用完整数据时超越它们，突显了集成立体视觉和中级视觉处理的好处。

Omni-Attribute: Open-vocabulary Attribute Encoder for Visual Concept Personalization

Authors: Tsai-Shien Chen, Aliaksandr Siarohin, Guocheng Gordon Qian, Kuan-Chieh Jackson Wang, Egor Nemchinov, Moayed Haji-Ali, Riza Alp Guler, Willi Menapace, Ivan Skorokhodov, Anil Kag, Jun-Yan Zhu, Sergey Tulyakov

First: 2025-12-11T18:59:56+00:00 · Latest: 2025-12-11T18:59:56+00:00

Comments: Project page: https://snap-research.github.io/omni-attribute

Abs · PDF · Code1 · Code2 · Project1

Abstract

Visual concept personalization aims to transfer only specific image attributes, such as identity, expression, lighting, and style, into unseen contexts. However, existing methods rely on holistic embeddings from general-purpose image encoders, which entangle multiple visual factors and make it difficult to isolate a single attribute. This often leads to information leakage and incoherent synthesis. To address this limitation, we introduce Omni-Attribute, the first open-vocabulary image attribute encoder designed to learn high-fidelity, attribute-specific representations. Our approach jointly designs the data and model: (i) we curate semantically linked image pairs annotated with positive and negative attributes to explicitly teach the encoder what to preserve or suppress; and (ii) we adopt a dual-objective training paradigm that balances generative fidelity with contrastive disentanglement. The resulting embeddings prove effective for open-vocabulary attribute retrieval, personalization, and compositional generation, achieving state-of-the-art performance across multiple benchmarks.

中文标题/摘要

标题：Omni-Attribute：面向视觉概念个性化的大词汇量属性编码器

视觉概念个性化旨在将特定图像属性，如身份、表情、光照和风格，转移到未见的上下文中。然而，现有方法依赖于通用图像编码器的整体嵌入，这会将多个视觉因素纠缠在一起，使得难以隔离单一属性。这通常会导致信息泄露和不一致的合成。为了解决这一局限性，我们引入了Omni-Attribute，这是第一个用于学习高保真度、属性特定表示的大词汇量图像属性编码器。我们的方法联合设计了数据和模型：(i) 我们收集了带有正负属性标注的语义关联图像对，以明确地教导编码器保留或抑制什么；(ii) 我们采用了一种双目标训练范式，平衡生成保真度与对比性解耦。生成的嵌入在开放词汇量属性检索、个性化和组合生成方面证明有效，并在多个基准测试中达到了最先进的性能。

Summary / 总结

The research aims to transfer specific image attributes like identity and style into unseen contexts without information leakage. Omni-Attribute, an open-vocabulary image attribute encoder, is introduced to learn high-fidelity attribute-specific representations. It uses semantically linked image pairs and a dual-objective training method to isolate attributes effectively, leading to superior performance in attribute retrieval, personalization, and compositional generation across multiple benchmarks.

Omni-Attribute 是一种开放词汇量的图像属性编码器，旨在分离出身份、表情和风格等特定属性在视觉概念个性化中的应用。它使用语义关联的图像对和双重目标训练方法来学习高保真的属性特定表示，从而提高属性检索、个性化和组合生成的效果，并在多个基准测试中达到最佳性能。

Hierarchical Dataset Selection for High-Quality Data Sharing

Authors: Xiaona Zhou, Yingyan Zeng, Ran Jin, Ismini Lourentzou

First: 2025-12-11T18:59:55+00:00 · Latest: 2025-12-11T18:59:55+00:00

Abs · PDF · Code1 · Code2

Abstract

The success of modern machine learning hinges on access to high-quality training data. In many real-world scenarios, such as acquiring data from public repositories or sharing across institutions, data is naturally organized into discrete datasets that vary in relevance, quality, and utility. Selecting which repositories or institutions to search for useful datasets, and which datasets to incorporate into model training are therefore critical decisions, yet most existing methods select individual samples and treat all data as equally relevant, ignoring differences between datasets and their sources. In this work, we formalize the task of dataset selection: selecting entire datasets from a large, heterogeneous pool to improve downstream performance under resource constraints. We propose Dataset Selection via Hierarchies (DaSH), a dataset selection method that models utility at both dataset and group (e.g., collections, institutions) levels, enabling efficient generalization from limited observations. Across two public benchmarks (Digit-Five and DomainNet), DaSH outperforms state-of-the-art data selection baselines by up to 26.2% in accuracy, while requiring significantly fewer exploration steps. Ablations show DaSH is robust to low-resource settings and lack of relevant datasets, making it suitable for scalable and adaptive dataset selection in practical multi-source learning workflows.

中文标题/摘要

标题：层次化数据集选择以实现高质量数据共享

现代机器学习的成功依赖于高质量训练数据的访问。在许多实际场景中，如从公共仓库获取数据或机构间共享数据时，数据自然地组织成不同的数据集，这些数据集在相关性、质量和实用性方面各不相同。因此，选择哪些仓库或机构来搜索有用的数据集，以及将哪些数据集纳入模型训练是至关重要的决策，但现有的大多数方法选择的是单个样本，并将所有数据视为同等重要，忽略了数据集及其来源之间的差异。在本项工作中，我们形式化了数据集选择任务：在资源受限的情况下，从大量异构数据集中选择整个数据集以提高下游性能。我们提出了层次化数据集选择（DaSH）方法，该方法在数据集和组（例如，集合、机构）级别建模效用，从而能够从有限的观察中高效泛化。在两个公开基准（Digit-Five和DomainNet）上，DaSH在准确率上比最先进的数据选择基线高出26.2%，同时需要显著减少探索步骤。消融实验表明，DaSH在资源稀缺和缺乏相关数据集的情况下具有鲁棒性，使其适用于实际多源学习工作流中的可扩展和自适应数据集选择。

Summary / 总结

This work addresses the challenge of selecting high-quality datasets for machine learning, formalizing the task of selecting entire datasets from a large pool to improve model performance under resource constraints. The proposed method, Dataset Selection via Hierarchies (DaSH), models utility at both dataset and group levels, leading to better performance compared to existing methods. Across two benchmarks, DaSH achieves up to 26.2% higher accuracy with fewer exploration steps.

该研究通过提出DaSH方法解决了选择高质量数据集的问题，该方法在数据集和组别层面建模效用。DaSH在两个基准测试中比现有方法高出26.2%的准确率，同时需要更少的探索步骤。该方法在低资源环境下具有鲁棒性，并适用于实际多源学习工作流中的可扩展和自适应数据集选择。

E-RayZer: Self-supervised 3D Reconstruction as Spatial Visual Pre-training

Authors: Qitao Zhao, Hao Tan, Qianqian Wang, Sai Bi, Kai Zhang, Kalyan Sunkavalli, Shubham Tulsiani, Hanwen Jiang

First: 2025-12-11T18:59:53+00:00 · Latest: 2025-12-11T18:59:53+00:00

Comments: Project website: https://qitaozhao.github.io/E-RayZer

Abs · PDF · Code1 · Code2 · Project1

Abstract

Self-supervised pre-training has revolutionized foundation models for languages, individual 2D images and videos, but remains largely unexplored for learning 3D-aware representations from multi-view images. In this paper, we present E-RayZer, a self-supervised large 3D Vision model that learns truly 3D-aware representations directly from unlabeled images. Unlike prior self-supervised methods such as RayZer that infer 3D indirectly through latent-space view synthesis, E-RayZer operates directly in 3D space, performing self-supervised 3D reconstruction with Explicit geometry. This formulation eliminates shortcut solutions and yields representations that are geometrically grounded. To ensure convergence and scalability, we introduce a novel fine-grained learning curriculum that organizes training from easy to hard samples and harmonizes heterogeneous data sources in an entirely unsupervised manner. Experiments demonstrate that E-RayZer significantly outperforms RayZer on pose estimation, matches or sometimes surpasses fully supervised reconstruction models such as VGGT. Furthermore, its learned representations outperform leading visual pre-training models (e.g., DINOv3, CroCo v2, VideoMAE V2, and RayZer) when transferring to 3D downstream tasks, establishing E-RayZer as a new paradigm for 3D-aware visual pre-training.

中文标题/摘要

标题：E-RayZer：自我监督的3D重建作为空间视觉预训练

自我监督预训练已经彻底改变了语言、单个2D图像和视频的基础模型，但对于从多视角图像中学习3D感知表示的研究仍然相对较少。在本文中，我们提出了E-RayZer，这是一种直接从未标记图像中学习真正3D感知表示的自我监督大型3D视觉模型。与通过潜在空间视图合成间接推断3D的先前自我监督方法（如RayZer）不同，E-RayZer直接在3D空间中操作，执行具有显式几何学的自我监督3D重建。这种表述消除了捷径解决方案，并产生了几何上可靠的表示。为了确保收敛性和可扩展性，我们引入了一种新颖的细粒度学习课程，从易到难组织训练样本，并以完全无监督的方式协调异构数据源。实验表明，E-RayZer在姿态估计方面显著优于RayZer，在匹配或有时超越完全监督重建模型（如VGGT）方面表现出色。此外，其学习表示在转移到3D下游任务时优于领先的空间视觉预训练模型（例如DINOv3、CroCo v2、VideoMAE V2和RayZer），确立了E-RayZer作为3D感知视觉预训练的新范式。

Summary / 总结

E-RayZer is a self-supervised 3D reconstruction model that learns 3D-aware representations directly from unlabeled images. Unlike previous methods that infer 3D indirectly, E-RayZer operates in 3D space, performing self-supervised 3D reconstruction with explicit geometry. The model uses a fine-grained learning curriculum to ensure convergence and scalability. Experiments show that E-RayZer outperforms RayZer on pose estimation and matches or surpasses fully supervised models on reconstruction tasks. Its representations also outperform other visual pre-training models when transferred to 3D tasks, establishing E-RayZer as a new paradigm for 3D-aware visual pre-training.

E-RayZer 是一种自监督 3D 重建模型，直接从未标记图像中学习 3D 意识表示。不同于之前通过隐空间视图合成间接推断 3D 的方法，E-RayZer 在 3D 空间中操作，执行显式的 3D 重建。它引入了一种细粒度的学习课程来确保收敛性和可扩展性。实验表明，E-RayZer 在姿态估计上优于 RayZer，并且在重建任务上与完全监督模型相当或超越。其表示在转移到 3D 下游任务时也优于其他视觉预训练模型，确立了 E-RayZer 作为 3D 意识视觉预训练的新范式。

Are We Ready for RL in Text-to-3D Generation? A Progressive Investigation

Authors: Yiwen Tang, Zoey Guo, Kaixin Zhu, Ray Zhang, Qizhi Chen, Dongzhi Jiang, Junli Liu, Bohan Zeng, Haoming Song, Delin Qu, Tianyi Bai, Dan Xu, Wentao Zhang, Bin Zhao

First: 2025-12-11T18:59:52+00:00 · Latest: 2025-12-11T18:59:52+00:00

Comments: Code is released at https://github.com/Ivan-Tang-3D/3DGen-R1

Abs · PDF · Code1 · Code2 · Code3

Abstract

Reinforcement learning (RL), earlier proven to be effective in large language and multi-modal models, has been successfully extended to enhance 2D image generation recently. However, applying RL to 3D generation remains largely unexplored due to the higher spatial complexity of 3D objects, which require globally consistent geometry and fine-grained local textures. This makes 3D generation significantly sensitive to reward designs and RL algorithms. To address these challenges, we conduct the first systematic study of RL for text-to-3D autoregressive generation across several dimensions. (1) Reward designs: We evaluate reward dimensions and model choices, showing that alignment with human preference is crucial, and that general multi-modal models provide robust signal for 3D attributes. (2) RL algorithms: We study GRPO variants, highlighting the effectiveness of token-level optimization, and further investigate the scaling of training data and iterations. (3) Text-to-3D Benchmarks: Since existing benchmarks fail to measure implicit reasoning abilities in 3D generation models, we introduce MME-3DR. (4) Advanced RL paradigms: Motivated by the natural hierarchy of 3D generation, we propose Hi-GRPO, which optimizes the global-to-local hierarchical 3D generation through dedicated reward ensembles. Based on these insights, we develop AR3D-R1, the first RL-enhanced text-to-3D model, expert from coarse shape to texture refinement. We hope this study provides insights into RL-driven reasoning for 3D generation. Code is released at https://github.com/Ivan-Tang-3D/3DGen-R1.

中文标题/摘要

标题：文本到3D生成中我们准备好使用RL了吗？一项渐进式研究

强化学习（RL）在大型语言和多模态模型中已被证明有效，并成功扩展到增强2D图像生成。然而，将RL应用于3D生成仍然因3D对象的更高空间复杂性而鲜有探索，这些对象需要全局一致的几何结构和精细的局部纹理。这使得3D生成对奖励设计和RL算法极为敏感。为应对这些挑战，我们首次系统地研究了RL在多个维度上对文本到3D自回归生成的影响。(1) 奖励设计：我们评估了奖励维度和模型选择，表明与人类偏好的一致性至关重要，并且通用多模态模型为3D属性提供了稳健的信号。(2) RL算法：我们研究了GRPO变体，强调了基于标记的优化的有效性，并进一步探讨了训练数据和迭代的扩展。(3) 文本到3D基准：由于现有基准无法衡量3D生成模型的隐式推理能力，我们引入了MME-3DR。(4) 高级RL范式：受3D生成自然层次结构的启发，我们提出了Hi-GRPO，通过专门的奖励集合优化全局到局部的3D生成。基于这些见解，我们开发了AR3D-R1，这是第一个增强的文本到3D模型，从粗略形状到纹理细化。我们希望这项研究为基于RL的3D生成推理提供见解。代码发布在https://github.com/Ivan-Tang-3D/3DGen-R1

ClusIR: Towards Cluster-Guided All-in-One Image Restoration

Authors: Shengkai Hu, Jiaqi Ma, Jun Wan, Wenwen Min, Yongcheng Jing, Lefei Zhang, Dacheng Tao

First: 2025-12-11T18:59:47+00:00 · Latest: 2025-12-11T18:59:47+00:00

Abs · PDF · Code1 · Code2

Abstract

All-in-One Image Restoration (AiOIR) aims to recover high-quality images from diverse degradations within a unified framework. However, existing methods often fail to explicitly model degradation types and struggle to adapt their restoration behavior to complex or mixed degradations. To address these issues, we propose ClusIR, a Cluster-Guided Image Restoration framework that explicitly models degradation semantics through learnable clustering and propagates cluster-aware cues across spatial and frequency domains for adaptive restoration. Specifically, ClusIR comprises two key components: a Probabilistic Cluster-Guided Routing Mechanism (PCGRM) and a Degradation-Aware Frequency Modulation Module (DAFMM). The proposed PCGRM disentangles degradation recognition from expert activation, enabling discriminative degradation perception and stable expert routing. Meanwhile, DAFMM leverages the cluster-guided priors to perform adaptive frequency decomposition and targeted modulation, collaboratively refining structural and textural representations for higher restoration fidelity. The cluster-guided synergy seamlessly bridges semantic cues with frequency-domain modulation, empowering ClusIR to attain remarkable restoration results across a wide range of degradations. Extensive experiments on diverse benchmarks validate that ClusIR reaches competitive performance under several scenarios.

中文标题/摘要

标题：ClusIR：面向聚类引导的一站式图像恢复

一站式图像恢复（AiOIR）旨在在一个统一框架内从多种退化中恢复高质量图像。然而，现有方法往往无法明确建模退化类型，并且难以适应复杂的或混合的退化。为了解决这些问题，我们提出了一种聚类引导图像恢复框架ClusIR，该框架通过可学习的聚类明确建模退化语义，并在空间域和频域中传播聚类感知线索以实现自适应恢复。具体而言，ClusIR 包含两个关键组件：概率聚类引导路由机制（PCGRM）和退化感知频域调制模块（DAFMM）。所提出的 PCGRM 将退化识别与专家激活分离，使退化感知更具辨别力且专家路由更稳定。同时，DAFMM 利用聚类引导的先验知识进行自适应频域分解和目标化调制，协同优化结构和纹理表示以提高恢复精度。聚类引导的协同作用无缝地将语义线索与频域调制相结合，使 ClusIR 能够在各种退化中取得显著的恢复效果。在多种基准上的广泛实验表明，在多种场景下，ClusIR 达到了具有竞争力的性能。

Summary / 总结

ClusIR is a Cluster-Guided Image Restoration framework designed to handle various image degradations in a unified framework. It introduces a Probabilistic Cluster-Guided Routing Mechanism and a Degradation-Aware Frequency Modulation Module to model degradation types and propagate cluster-aware cues for adaptive restoration. Experimental results show that ClusIR outperforms existing methods across different degradations, achieving high-quality image restoration.

ClusIR 是一种集群引导的图像恢复框架，旨在从多种退化中恢复高质量图像。它通过概率集群引导路由机制和退化感知频域调制模块来建模退化类型并适应恢复行为。实验表明，ClusIR 在不同退化场景中表现出色，能够以高保真度恢复图像。

Towards Efficient and Effective Multi-Camera Encoding for End-to-End Driving

Authors: Jiawei Yang, Ziyu Chen, Yurong You, Yan Wang, Yiming Li, Yuxiao Chen, Boyi Li, Boris Ivanovic, Marco Pavone, Yue Wang

First: 2025-12-11T18:59:46+00:00 · Latest: 2025-12-11T18:59:46+00:00

Comments: Project Page: https://jiawei-yang.github.io/Flex/

Abs · PDF · Code1 · Code2 · Project1

Abstract

We present Flex, an efficient and effective scene encoder that addresses the computational bottleneck of processing high-volume multi-camera data in end-to-end autonomous driving. Flex employs a small set of learnable scene tokens to jointly encode information from all image tokens across different cameras and timesteps. By design, our approach is geometry-agnostic, learning a compact scene representation directly from data without relying on the explicit 3D inductive biases, such as Bird-Eye-View (BEV), occupancy or tri-plane representations, which are common in prior work. This holistic encoding strategy aggressively compresses the visual input for the downstream Large Language Model (LLM) based policy model. Evaluated on a large-scale proprietary dataset of 20,000 driving hours, our Flex achieves 2.2x greater inference throughput while improving driving performance by a large margin compared to state-of-the-art methods. Furthermore, we show that these compact scene tokens develop an emergent capability for scene decomposition without any explicit supervision. Our findings challenge the prevailing assumption that 3D priors are necessary, demonstrating that a data-driven, joint encoding strategy offers a more scalable, efficient and effective path for future autonomous driving systems.

中文标题/摘要

标题：面向端到端驾驶的高效和有效多摄像头编码

我们提出了Flex，一种高效的场景编码器，解决了处理大量多摄像头数据的计算瓶颈问题，适用于端到端自动驾驶。Flex 使用一组可学习的场景标记，联合编码来自不同摄像头和时间步长的所有图像标记的信息。通过设计，我们的方法是几何无关的，直接从数据中学习紧凑的场景表示，而不依赖于显式的三维诱导偏置，如鸟瞰图（BEV）、占用或三平面表示，这些都是先前工作中的常见方法。这种整体编码策略大幅压缩了供下游大型语言模型（LLM）基于的策略模型使用的视觉输入。在包含20,000小时驾驶数据的大型自有数据集上评估，我们的Flex实现了2.2倍的推理吞吐量提升，并在驾驶性能上大幅优于最先进的方法。此外，我们展示了这些紧凑的场景标记在没有任何显式监督的情况下发展出了一种场景分解的新兴能力。我们的发现挑战了三维先验是必要的这一传统假设，证明了数据驱动的联合编码策略为未来的自动驾驶系统提供了一条更具有扩展性、高效和有效的方法。

Summary / 总结

Flex is an efficient scene encoder designed to address the computational challenges of processing multi-camera data in end-to-end autonomous driving. It uses a small set of learnable scene tokens to jointly encode information from all cameras and timesteps, without relying on 3D inductive biases. Evaluated on a large dataset, Flex improves driving performance and achieves 2.2x greater inference throughput compared to state-of-the-art methods, suggesting that data-driven joint encoding can be more scalable and effective than 3D priors.

Flex 是一种用于高效处理端到端自动驾驶中高体积多摄像头数据的场景编码器。它使用一组可学习的场景标记来联合编码来自多个摄像头和时间步的信息，而不依赖于 3D 先验。在大规模数据集上评估表明，Flex 的推理吞吐量提高了 2.2 倍，并且在驾驶性能上优于最先进的方法。此外，它还发展出了一种无需显式监督的场景分解能力，挑战了 3D 先验是必要的这一假设。

ImplicitRDP: An End-to-End Visual-Force Diffusion Policy with Structural Slow-Fast Learning

Authors: Wendi Chen, Han Xue, Yi Wang, Fangyuan Zhou, Jun Lv, Yang Jin, Shirun Tang, Chuan Wen, Cewu Lu

First: 2025-12-11T18:59:46+00:00 · Latest: 2025-12-11T18:59:46+00:00

Comments: Project page: https://implicit-rdp.github.io

Abs · PDF · Code1 · Code2 · Project1

Abstract

Human-level contact-rich manipulation relies on the distinct roles of two key modalities: vision provides spatially rich but temporally slow global context, while force sensing captures rapid, high-frequency local contact dynamics. Integrating these signals is challenging due to their fundamental frequency and informational disparities. In this work, we propose ImplicitRDP, a unified end-to-end visual-force diffusion policy that integrates visual planning and reactive force control within a single network. We introduce Structural Slow-Fast Learning, a mechanism utilizing causal attention to simultaneously process asynchronous visual and force tokens, allowing the policy to perform closed-loop adjustments at the force frequency while maintaining the temporal coherence of action chunks. Furthermore, to mitigate modality collapse where end-to-end models fail to adjust the weights across different modalities, we propose Virtual-target-based Representation Regularization. This auxiliary objective maps force feedback into the same space as the action, providing a stronger, physics-grounded learning signal than raw force prediction. Extensive experiments on contact-rich tasks demonstrate that ImplicitRDP significantly outperforms both vision-only and hierarchical baselines, achieving superior reactivity and success rates with a streamlined training pipeline. Code and videos will be publicly available at https://implicit-rdp.github.io.

中文标题/摘要

标题：ImplicitRDP：一种端到端的视觉-力扩散策略，结合结构化的慢速-快速学习

人类水平的接触丰富操作依赖于两种关键模态的独特作用：视觉提供丰富但缓慢的空间上下文，而力感知捕捉快速的高频率局部接触动力学。由于这些信号的基本频率和信息差异，将这些信号集成在一起具有挑战性。在本文中，我们提出了ImplicitRDP，这是一种统一的端到端视觉-力扩散策略，将视觉规划和反应性力控制整合到单个网络中。我们引入了结构化的慢速-快速学习机制，利用因果注意力同时处理异步的视觉和力标记，使策略能够在力频率下执行闭环调整，同时保持动作片段的时间连贯性。此外，为了缓解模态崩溃问题，即端到端模型无法调整不同模态的权重，我们提出了基于虚拟目标的表示正则化。这一辅助目标将力反馈映射到与动作相同的空间，提供比原始力预测更强的物理基础学习信号。广泛的实验表明，ImplicitRDP 显著优于仅视觉和分层基线，具有简化训练管道的情况下实现了更好的反应性和成功率。代码和视频将在 https://implicit-rdp.github.io 公开。

Summary / 总结

The research aims to improve human-level contact-rich manipulation by integrating visual and force signals. ImplicitRDP, an end-to-end policy, uses Structural Slow-Fast Learning to process visual and force signals asynchronously, enabling fast force adjustments while maintaining temporal coherence. The method also includes Virtual-target-based Representation Regularization to prevent modality collapse. Experiments show that ImplicitRDP outperforms vision-only and hierarchical baselines, achieving higher reactivity and success rates.

研究旨在通过整合视觉和力信号来提高人类水平的接触丰富操作。ImplicitRDP 是一个端到端的策略，使用结构化的慢速-快速学习来处理异步的视觉和力信号令牌，使策略能够在力频率下进行闭环调整，同时保持动作块的时间连贯性。该方法还包括基于虚拟目标的表示正则化，以防止模态崩溃。实验表明，ImplicitRDP 在反应性和成功率方面优于视觉仅和分层基线。代码和视频可在 https://implicit-rdp.github.io 获取。

MeViS: A Multi-Modal Dataset for Referring Motion Expression Video Segmentation

Authors: Henghui Ding, Chang Liu, Shuting He, Kaining Ying, Xudong Jiang, Chen Change Loy, Yu-Gang Jiang

Venue: in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, no. 12, pp. 11400-11416, 2025

First: 2025-12-11T18:59:44+00:00 · Latest: 2025-12-11T18:59:44+00:00

Comments: IEEE TPAMI, Project Page: https://henghuiding.com/MeViS/

Abs · PDF · Code1 · Code2

Abstract

This paper proposes a large-scale multi-modal dataset for referring motion expression video segmentation, focusing on segmenting and tracking target objects in videos based on language description of objects' motions. Existing referring video segmentation datasets often focus on salient objects and use language expressions rich in static attributes, potentially allowing the target object to be identified in a single frame. Such datasets underemphasize the role of motion in both videos and languages. To explore the feasibility of using motion expressions and motion reasoning clues for pixel-level video understanding, we introduce MeViS, a dataset containing 33,072 human-annotated motion expressions in both text and audio, covering 8,171 objects in 2,006 videos of complex scenarios. We benchmark 15 existing methods across 4 tasks supported by MeViS, including 6 referring video object segmentation (RVOS) methods, 3 audio-guided video object segmentation (AVOS) methods, 2 referring multi-object tracking (RMOT) methods, and 4 video captioning methods for the newly introduced referring motion expression generation (RMEG) task. The results demonstrate weaknesses and limitations of existing methods in addressing motion expression-guided video understanding. We further analyze the challenges and propose an approach LMPM++ for RVOS/AVOS/RMOT that achieves new state-of-the-art results. Our dataset provides a platform that facilitates the development of motion expression-guided video understanding algorithms in complex video scenes. The proposed MeViS dataset and the method's source code are publicly available at https://henghuiding.com/MeViS/

中文标题/摘要

标题：MeViS：一种用于引用运动表达视频分割的多模态数据集

本文提出了一种大规模多模态数据集，用于引用运动表达视频分割，专注于根据物体运动的语言描述分割和跟踪视频中的目标物体。现有的引用视频分割数据集通常关注显眼的物体，并使用富含静态属性的语言表达，这可能允许目标物体在单帧中被识别。这些数据集在视频和语言中强调运动的作用方面存在不足。为了探索使用运动表达和运动推理线索进行像素级视频理解的可行性，我们引入了MeViS数据集，该数据集包含33,072个人标注的运动表达，包括文本和音频，覆盖了2,006个复杂场景中的8,171个物体。我们对MeViS支持的4项任务中的15种现有方法进行了基准测试，包括6种引用视频对象分割（RVOS）方法、3种音频引导视频对象分割（AVOS）方法、2种引用多对象跟踪（RMOT）方法以及4种用于新引入的引用运动表达生成（RMEG）任务的视频字幕方法。结果表明现有方法在解决运动表达引导的视频理解方面存在弱点和局限性。我们进一步分析了挑战并提出了一种LMPM++方法，该方法在RVOS/AVOS/RMOT中达到了新的最佳结果。我们的数据集为复杂视频场景中运动表达引导的视频理解算法的发展提供了平台。提出的MeViS数据集及其方法的源代码可在https://henghuiding.com/MeViS/公开获取。

Summary / 总结

This paper introduces MeViS, a large multi-modal dataset for segmenting and tracking target objects based on their motion descriptions in videos. It contains 33,072 human-annotated motion expressions and covers 8,171 objects in 2,006 complex scenario videos. The authors benchmark 15 existing methods across four tasks and find limitations in motion expression-guided video understanding. They propose LMPM++ for RVOS/AVOS/RMOT, achieving new state-of-the-art results. The dataset supports the development of motion expression-guided video understanding algorithms in complex scenes.

该论文提出了MeViS数据集，用于基于视频中目标物体运动描述进行目标分割和跟踪。数据集包含33,072个人标注的运动表达，并覆盖了2,006个复杂场景视频中的8,171个物体。作者对15种现有方法进行了四个任务的基准测试，发现现有方法在运动表达引导的视频理解方面存在局限性。他们提出了LMPM++方法用于RVOS/AVOS/RMOT，实现了新的最佳结果。该数据集支持在复杂视频场景中开发运动表达引导的视频理解算法。

AlcheMinT: Fine-grained Temporal Control for Multi-Reference Consistent Video Generation

Authors: Sharath Girish, Viacheslav Ivanov, Tsai-Shien Chen, Hao Chen, Aliaksandr Siarohin, Sergey Tulyakov

First: 2025-12-11T18:59:34+00:00 · Latest: 2025-12-11T18:59:34+00:00

Comments: Project page: https://snap-research.github.io/Video-AlcheMinT/snap-research.github.io/Video-AlcheMinT

Abs · PDF · Code1 · Code2 · Project1 · Project2

Abstract

Recent advances in subject-driven video generation with large diffusion models have enabled personalized content synthesis conditioned on user-provided subjects. However, existing methods lack fine-grained temporal control over subject appearance and disappearance, which are essential for applications such as compositional video synthesis, storyboarding, and controllable animation. We propose AlcheMinT, a unified framework that introduces explicit timestamps conditioning for subject-driven video generation. Our approach introduces a novel positional encoding mechanism that unlocks the encoding of temporal intervals, associated in our case with subject identities, while seamlessly integrating with the pretrained video generation model positional embeddings. Additionally, we incorporate subject-descriptive text tokens to strengthen binding between visual identity and video captions, mitigating ambiguity during generation. Through token-wise concatenation, AlcheMinT avoids any additional cross-attention modules and incurs negligible parameter overhead. We establish a benchmark evaluating multiple subject identity preservation, video fidelity, and temporal adherence. Experimental results demonstrate that AlcheMinT achieves visual quality matching state-of-the-art video personalization methods, while, for the first time, enabling precise temporal control over multi-subject generation within videos. Project page is at https://snap-research.github.io/Video-AlcheMinT

中文标题/摘要

标题：AlcheMinT：多参考一致视频生成的细粒度时间控制

大型扩散模型驱动的主题视频生成的最新进展使基于用户提供的主题的个性化内容合成成为可能。然而，现有方法缺乏对主题出现和消失的细粒度时间控制，这对于合成视频、故事板和可控动画等应用至关重要。我们提出了AlcheMinT，这是一种统一框架，引入了明确的时间戳条件，用于主题驱动的视频生成。我们的方法引入了一种新颖的位置编码机制，解锁了时间间隔的编码，这些时间间隔在我们的情况下与主题身份相关，同时无缝地与预训练的视频生成模型的位置嵌入集成。此外，我们还引入了主题描述性文本标记，以加强视觉身份与视频字幕之间的联系，减轻生成过程中的歧义。通过令牌级连接，AlcheMinT 避免了任何额外的交叉注意力模块，并且没有增加参数开销。我们建立了一个基准，评估多个主题身份保留、视频保真度和时间一致性。实验结果表明，AlcheMinT 达到了与最先进的视频个性化方法相当的视觉质量，同时首次实现了对视频中多主题生成的精确时间控制。项目页面为 https://snap-research.github.io/Video-AlcheMinT

Summary / 总结

AlcheMinT is a unified framework that introduces explicit timestamps for fine-grained temporal control in subject-driven video generation. It uses a novel positional encoding mechanism to encode temporal intervals associated with subject identities and integrates subject-descriptive text tokens to enhance visual identity binding. Experiments show that AlcheMinT matches the visual quality of state-of-the-art methods while enabling precise temporal control over multi-subject generation in videos.

AlcheMinT 是一个统一框架，引入显式时间戳以实现主体驱动视频生成中的细粒度时间控制，增强如合成视频合成和可控动画等应用。它使用新颖的位置编码机制和主体描述性文本标记来保持主体身份并提高视频保真度和时间一致性，实现与最新方法相当的视觉质量，同时首次在视频中实现对多主体生成的精确时间控制。

VL-JEPA: Joint Embedding Predictive Architecture for Vision-language

Authors: Delong Chen, Mustafa Shukor, Theo Moutakanni, Willy Chung, Jade Yu, Tejaswi Kasarla, Allen Bolourchi, Yann LeCun, Pascale Fung

First: 2025-12-11T18:59:22+00:00 · Latest: 2025-12-11T18:59:22+00:00

Abs · PDF · Code1 · Code2

Abstract

We introduce VL-JEPA, a vision-language model built on a Joint Embedding Predictive Architecture (JEPA). Instead of autoregressively generating tokens as in classical VLMs, VL-JEPA predicts continuous embeddings of the target texts. By learning in an abstract representation space, the model focuses on task-relevant semantics while abstracting away surface-level linguistic variability. In a strictly controlled comparison against standard token-space VLM training with the same vision encoder and training data, VL-JEPA achieves stronger performance while having 50% fewer trainable parameters. At inference time, a lightweight text decoder is invoked only when needed to translate VL-JEPA predicted embeddings into text. We show that VL-JEPA natively supports selective decoding that reduces the number of decoding operations by 2.85x while maintaining similar performance compared to non-adaptive uniform decoding. Beyond generation, the VL-JEPA's embedding space naturally supports open-vocabulary classification, text-to-video retrieval, and discriminative VQA without any architecture modification. On eight video classification and eight video retrieval datasets, the average performance VL-JEPA surpasses that of CLIP, SigLIP2, and Perception Encoder. At the same time, the model achieves comparable performance as classical VLMs (InstructBLIP, QwenVL) on four VQA datasets: GQA, TallyQA, POPE and POPEv2, despite only having 1.6B parameters.

中文标题/摘要

标题：VL-JEPA：视觉语言联合嵌入预测架构

我们介绍了基于联合嵌入预测架构（JEPA）的VL-JEPA视觉语言模型。与传统的逐词自回归生成不同，VL-JEPA预测目标文本的连续嵌入。通过在抽象表示空间中学习，该模型专注于与任务相关的语义，同时抽象掉表面语言的变异性。在严格控制的比较中，与使用相同视觉编码器和训练数据的标准令牌空间VLM训练相比，VL-JEPA在参数量减少50%的情况下实现了更强的性能。在推理时，仅在需要时调用轻量级文本解码器将VL-JEPA预测的嵌入转换为文本。我们展示了VL-JEPA原生支持选择性解码，将解码操作减少2.85倍，同时保持与非自适应均匀解码相似的性能。除了生成之外，VL-JEPA的嵌入空间自然支持开放词汇分类、文本到视频检索和区分型VQA，无需任何架构修改。在八个视频分类数据集和八个视频检索数据集上，VL-JEPA的平均性能超过了CLIP、SigLIP2和感知编码器。同时，该模型在四个VQA数据集（GQA、TallyQA、POPE和POPEv2）上实现了与经典VLMs（InstructBLIP、QwenVL）相当的性能，尽管只有1.6B参数。

Summary / 总结

VL-JEPA is a vision-language model that uses a Joint Embedding Predictive Architecture to predict continuous embeddings of target texts, rather than generating tokens autoregressively. This approach leads to better performance with fewer parameters and supports selective decoding, reducing the number of decoding operations by 2.85x. VL-JEPA outperforms several models on video classification and retrieval tasks and achieves comparable performance on VQA tasks with significantly fewer parameters.

VL-JEPA 是一种使用联合嵌入预测架构来预测目标文本连续嵌入的视觉语言模型，专注于任务相关的语义。它在参数量减少50%的情况下比标准的基于token的空间VLMs表现更好，并支持将解码操作减少2.85倍的自适应解码。VL-JEPA 在视频分类和检索任务上优于CLIP、SigLIP2和Perception Encoder，并且在VQA任务上与更大规模的模型具有相似的性能，尽管参数量仅为1.6B。

EvoIR: Towards All-in-One Image Restoration via Evolutionary Frequency Modulation

Authors: Jiaqi Ma, Shengkai Hu, Xu Zhang, Jun Wan, Jiaxing Huang, Lefei Zhang, Salman Khan

First: 2025-12-04T18:59:10+00:00 · Latest: 2025-12-11T18:59:22+00:00

Abs · PDF · Code1 · Code2

Abstract

All-in-One Image Restoration (AiOIR) tasks often involve diverse degradation that require robust and versatile strategies. However, most existing approaches typically lack explicit frequency modeling and rely on fixed or heuristic optimization schedules, which limit the generalization across heterogeneous degradation. To address these limitations, we propose EvoIR, an AiOIR-specific framework that introduces evolutionary frequency modulation for dynamic and adaptive image restoration. Specifically, EvoIR employs the Frequency-Modulated Module (FMM) that decomposes features into high- and low-frequency branches in an explicit manner and adaptively modulates them to enhance both structural fidelity and fine-grained details. Central to EvoIR, an Evolutionary Optimization Strategy (EOS) iteratively adjusts frequency-aware objectives through a population-based evolutionary process, dynamically balancing structural accuracy and perceptual fidelity. Its evolutionary guidance further mitigates gradient conflicts across degradation and accelerates convergence. By synergizing FMM and EOS, EvoIR yields greater improvements than using either component alone, underscoring their complementary roles. Extensive experiments on multiple benchmarks demonstrate that EvoIR outperforms state-of-the-art AiOIR methods.

中文标题/摘要

标题：EvoIR：通过进化频率调制实现一站式图像恢复

一站式图像恢复（AiOIR）任务通常涉及多种退化，需要稳健且通用的策略。然而，大多数现有方法通常缺乏显式的频率建模，并依赖于固定的或启发式的优化计划，这限制了其在异构退化中的泛化能力。为了解决这些限制，我们提出了一种AiOIR特定框架EvoIR，引入了进化频率调制以实现动态和自适应的图像恢复。具体而言，EvoIR 使用频率调制模块（FMM），以显式方式将特征分解为高频频带和低频频带，并自适应地调制它们以增强结构保真度和细粒度细节。EvoIR 的核心是一种进化优化策略（EOS），通过基于群体的进化过程迭代调整频率感知目标，动态平衡结构准确性和感知保真度。其进化的指导进一步缓解了退化之间的梯度冲突并加速了收敛。通过结合FMM和EOS，EvoIR 的改进效果优于单独使用任一组件，突显了它们的互补作用。在多个基准上的广泛实验表明，EvoIR 在一站式图像恢复方法中表现优于现有最佳方法。

Summary / 总结

EvoIR is designed to handle diverse image restoration tasks by introducing evolutionary frequency modulation. It uses a Frequency-Modulated Module (FMM) to decompose and adaptively modulate high- and low-frequency branches, and an Evolutionary Optimization Strategy (EOS) to dynamically balance structural accuracy and perceptual fidelity. Experiments show that EvoIR outperforms existing methods on multiple benchmarks, highlighting the effectiveness of its combined approach.

EvoIR 是一种 AiOIR 框架，引入了进化的频率调制以动态适应各种图像退化。它使用频率调制模块 (FMM) 来分解特征并进行调制，并使用进化优化策略 (EOS) 以种群为基础迭代调整目标，平衡结构准确性和感知保真度。实验表明，EvoIR 在多个基准上优于现有方法，展示了其在处理多种退化方面的有效性。

Mull-Tokens: Modality-Agnostic Latent Thinking

Authors: Arijit Ray, Ahmed Abdelkader, Chengzhi Mao, Bryan A. Plummer, Kate Saenko, Ranjay Krishna, Leonidas Guibas, Wen-Sheng Chu

First: 2025-12-11T18:59:08+00:00 · Latest: 2025-12-11T18:59:08+00:00

Comments: Project webpage: https://arijitray.com/multimodal_thinking/

Abs · PDF · Code1 · Code2

Abstract

Reasoning goes beyond language; the real world requires reasoning about space, time, affordances, and much more that words alone cannot convey. Existing multimodal models exploring the potential of reasoning with images are brittle and do not scale. They rely on calling specialist tools, costly generation of images, or handcrafted reasoning data to switch between text and image thoughts. Instead, we offer a simpler alternative -- Mull-Tokens -- modality-agnostic latent tokens pre-trained to hold intermediate information in either image or text modalities to let the model think free-form towards the correct answer. We investigate best practices to train Mull-Tokens inspired by latent reasoning frameworks. We first train Mull-Tokens using supervision from interleaved text-image traces, and then fine-tune without any supervision by only using the final answers. Across four challenging spatial reasoning benchmarks involving tasks such as solving puzzles and taking different perspectives, we demonstrate that Mull-Tokens improve upon several baselines utilizing text-only reasoning or interleaved image-text reasoning, achieving a +3% average improvement and up to +16% on a puzzle solving reasoning-heavy split compared to our strongest baseline. Adding to conversations around challenges in grounding textual and visual reasoning, Mull-Tokens offers a simple solution to abstractly think in multiple modalities.

中文标题/摘要

标题：Mull-Tokens：模态无关的潜在思维

推理超越语言；现实世界需要关于空间、时间、功能等方面的推理，而这些是单靠语言无法传达的。现有的多模态模型探索图像推理潜力时表现脆弱且不具扩展性。它们依赖于调用专业工具、昂贵的图像生成或手工制作的推理数据来在文本和图像思维之间切换。相反，我们提供了一个更简单的替代方案——Mull-Tokens——一种模态无关的潜在令牌，预先训练以在图像或文本模态中保留中间信息，让模型自由地思考以得出正确答案。我们研究了受潜在推理框架启发的最佳实践来训练Mull-Tokens。我们首先使用交错的文本-图像痕迹进行监督训练，然后仅使用最终答案进行微调。在四个具有挑战性的空间推理基准测试中，涉及解谜和换位思考等任务，我们证明Mull-Tokens在利用仅文本推理或交错图像-文本推理的多个基线之上取得了平均3%的改进，最高达16%的解谜推理密集部分改进，与我们的最强基线相比。Mull-Tokens为关于文本和视觉推理的接地挑战提供了简单解决方案，以抽象方式在多种模态中思考。

Summary / 总结

Mull-Tokens are modality-agnostic latent tokens pre-trained to hold intermediate information in either image or text modalities, allowing models to reason flexibly. They are trained using interleaved text-image traces and fine-tuned with only final answers. Mull-Tokens outperform text-only and interleaved image-text reasoning baselines on four spatial reasoning benchmarks, achieving up to a 16% improvement in puzzle solving tasks.

Mull-Tokens 是模态无关的潜在标记，预训练以在图像或文本模态中持有中间信息，使模型能够自由地推理以得出正确答案。它们通过交错的文本-图像痕迹进行训练，并仅使用最终答案进行微调。Mull-Tokens 在四个空间推理基准测试中优于仅文本和交错图像-文本推理基线，特别是在解谜推理密集型任务中取得了高达 16% 的改进。

OmniView: An All-Seeing Diffusion Model for 3D and 4D View Synthesis

Authors: Xiang Fan, Sharath Girish, Vivek Ramanujan, Chaoyang Wang, Ashkan Mirzaei, Petr Sushko, Aliaksandr Siarohin, Sergey Tulyakov, Ranjay Krishna

First: 2025-12-11T18:59:05+00:00 · Latest: 2025-12-11T18:59:05+00:00

Comments: Project page: https://snap-research.github.io/OmniView/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Prior approaches injecting camera control into diffusion models have focused on specific subsets of 4D consistency tasks: novel view synthesis, text-to-video with camera control, image-to-video, amongst others. Therefore, these fragmented approaches are trained on disjoint slices of available 3D/4D data. We introduce OmniView, a unified framework that generalizes across a wide range of 4D consistency tasks. Our method separately represents space, time, and view conditions, enabling flexible combinations of these inputs. For example, OmniView can synthesize novel views from static, dynamic, and multiview inputs, extrapolate trajectories forward and backward in time, and create videos from text or image prompts with full camera control. OmniView is competitive with task-specific models across diverse benchmarks and metrics, improving image quality scores among camera-conditioned diffusion models by up to 33\% in multiview NVS LLFF dataset, 60\% in dynamic NVS Neural 3D Video benchmark, 20\% in static camera control on RE-10K, and reducing camera trajectory errors by 4x in text-conditioned video generation. With strong generalizability in one model, OmniView demonstrates the feasibility of a generalist 4D video model. Project page is available at https://snap-research.github.io/OmniView/

中文标题/摘要

标题：OmniView：一种全面视角的扩散模型，用于3D和4D视图合成

先前将相机控制注入扩散模型的方法主要集中在4D一致性任务的特定子集上：新颖视图合成、带有相机控制的文本到视频、图像到视频等。因此，这些分散的方法仅在可用的3D/4D数据的不同片段上进行训练。我们提出了OmniView，这是一种统一框架，能够泛化到广泛的4D一致性任务。我们的方法分别表示空间、时间和视图条件，允许这些输入的灵活组合。例如，OmniView可以从静态、动态和多视角输入中合成新颖视图，向前向后推断时间轨迹，并根据完整的相机控制从文本或图像提示生成视频。在多种基准和指标上，OmniView与特定任务模型竞争，提高相机条件下的扩散模型的图像质量评分，最高可达多视角NVS LLFF数据集中的33%，动态NVS神经3D视频基准中的60%，静态相机控制RE-10K中的20%，并在文本条件下的视频生成中将相机轨迹误差减少4倍。凭借一个模型的强大泛化能力，OmniView展示了通用4D视频模型的可行性。项目页面可在https://snap-research.github.io/OmniView/获取。

GaussianHeadTalk: Wobble-Free 3D Talking Heads with Audio Driven Gaussian Splatting

Authors: Madhav Agarwal, Mingtian Zhang, Laura Sevilla-Lara, Steven McDonagh

First: 2025-12-11T18:59:02+00:00 · Latest: 2025-12-11T18:59:02+00:00

Comments: IEEE/CVF Winter Conference on Applications of Computer Vision 2026

Abs · PDF · Code1 · Code2

Abstract

Speech-driven talking heads have recently emerged and enable interactive avatars. However, real-world applications are limited, as current methods achieve high visual fidelity but slow or fast yet temporally unstable. Diffusion methods provide realistic image generation, yet struggle with oneshot settings. Gaussian Splatting approaches are real-time, yet inaccuracies in facial tracking, or inconsistent Gaussian mappings, lead to unstable outputs and video artifacts that are detrimental to realistic use cases. We address this problem by mapping Gaussian Splatting using 3D Morphable Models to generate person-specific avatars. We introduce transformer-based prediction of model parameters, directly from audio, to drive temporal consistency. From monocular video and independent audio speech inputs, our method enables generation of real-time talking head videos where we report competitive quantitative and qualitative performance.

中文标题/摘要

标题：GaussianHeadTalk：基于音频驱动高斯点积的无晃动3D对话头

基于语音的对话头最近出现，能够实现交互式虚拟角色。然而，实际应用受到限制，因为当前方法在视觉保真度高但速度慢或快且时间不稳定之间存在权衡。扩散方法能够生成逼真图像，但在单次设置中表现不佳。高斯点积方法实时性好，但面部跟踪不准确或高斯映射不一致会导致输出不稳定和视频伪影，这些伪影对现实使用场景不利。我们通过将高斯点积映射到3D可变模型来解决这一问题，以生成个性化的虚拟角色。我们引入基于变换器的模型参数预测，直接从音频驱动时间一致性。从单目视频和独立的音频语音输入，我们的方法能够生成实时对话头视频，我们报告了具有竞争力的定量和定性性能。

Summary / 总结

The research aims to improve the stability and realism of 3D talking heads driven by audio. The method combines Gaussian Splatting with 3D Morphable Models and transformer-based prediction of model parameters from audio to ensure temporal consistency. Key experimental findings show that the method achieves real-time generation of talking head videos with competitive performance compared to existing techniques, addressing the issues of instability and video artifacts in previous approaches.

研究旨在提高由音频驱动的3D表情的稳定性和逼真度。方法结合了3D可变形模型和从音频预测模型参数的变压器，以确保时间一致性。实验结果表明，所提出的方法能够实时生成具有竞争力性能的说话头像视频，解决了之前方法在稳定性和视觉保真度方面的局限性。

Stronger Normalization-Free Transformers

Authors: Mingzhi Chen, Taiming Lu, Jiachen Zhu, Mingjie Sun, Zhuang Liu

First: 2025-12-11T18:58:49+00:00 · Latest: 2025-12-11T18:58:49+00:00

Abs · PDF · Code1 · Code2

Abstract

Although normalization layers have long been viewed as indispensable components of deep learning architectures, the recent introduction of Dynamic Tanh (DyT) has demonstrated that alternatives are possible. The point-wise function DyT constrains extreme values for stable convergence and reaches normalization-level performance; this work seeks further for function designs that can surpass it. We first study how the intrinsic properties of point-wise functions influence training and performance. Building on these findings, we conduct a large-scale search for a more effective function design. Through this exploration, we introduce $\mathrm{Derf}(x) = \mathrm{erf}(αx + s)$, where $\mathrm{erf}(x)$ is the rescaled Gaussian cumulative distribution function, and identify it as the most performant design. Derf outperforms LayerNorm, RMSNorm, and DyT across a wide range of domains, including vision (image recognition and generation), speech representation, and DNA sequence modeling. Our findings suggest that the performance gains of Derf largely stem from its improved generalization rather than stronger fitting capacity. Its simplicity and stronger performance make Derf a practical choice for normalization-free Transformer architectures.

中文标题/摘要

标题：无需归一化的更强变换器

尽管归一化层长期以来被视为深度学习架构中不可或缺的组成部分，但最近引入的动态双曲正切（DyT）表明，替代方案是可能的。点函数DyT通过限制极端值实现稳定的收敛，并达到与归一化相当的性能；本研究旨在寻找超越它的函数设计。我们首先研究点函数内在属性如何影响训练和性能。在此基础上，我们进行大规模搜索以找到更有效的函数设计。通过这一探索，我们引入了$\mathrm{Derf}(x) = \mathrm{erf}(αx + s)$，其中$\mathrm{erf}(x)$是归一化高斯累积分布函数，并将其识别为最有效的设计。Derf在包括视觉（图像识别和生成）、语音表示和DNA序列建模在内的广泛领域中优于LayerNorm、RMSNorm和DyT。我们的研究结果表明，Derf的性能提升主要来自于其更好的泛化能力，而不是更强的拟合能力。其简洁性和更强的性能使Derf成为无需归一化的变换器架构的实用选择。

Summary / 总结

This work explores alternatives to normalization layers in deep learning architectures, motivated by the success of Dynamic Tanh (DyT) in achieving normalization-level performance without normalization. Through a large-scale search, the authors introduce Derf, defined as $\mathrm{erf}(αx + s)$, which outperforms existing methods like LayerNorm, RMSNorm, and DyT in various domains including vision, speech, and DNA sequence modeling. The performance gains of Derf are attributed to improved generalization rather than increased fitting capacity.

这项研究旨在寻找超越动态双曲函数(DyT)的功能设计，DyT在不使用归一化的情况下达到了与LayerNorm相当的性能。研究探索了点式函数属性对训练和性能的影响，引入了Derf，定义为$\mathrm{Derf}(x) = \mathrm{erf}(αx + s)$。Derf在图像识别、语音表示和DNA序列建模等多个领域中均优于LayerNorm、RMSNorm和DyT。性能提升主要归因于更好的泛化能力而非更强的拟合能力。

Any4D: Unified Feed-Forward Metric 4D Reconstruction

Authors: Jay Karhade, Nikhil Keetha, Yuchen Zhang, Tanisha Gupta, Akash Sharma, Sebastian Scherer, Deva Ramanan

First: 2025-12-11T18:57:39+00:00 · Latest: 2025-12-11T18:57:39+00:00

Comments: Project Website: https://any-4d.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

We present Any4D, a scalable multi-view transformer for metric-scale, dense feed-forward 4D reconstruction. Any4D directly generates per-pixel motion and geometry predictions for N frames, in contrast to prior work that typically focuses on either 2-view dense scene flow or sparse 3D point tracking. Moreover, unlike other recent methods for 4D reconstruction from monocular RGB videos, Any4D can process additional modalities and sensors such as RGB-D frames, IMU-based egomotion, and Radar Doppler measurements, when available. One of the key innovations that allows for such a flexible framework is a modular representation of a 4D scene; specifically, per-view 4D predictions are encoded using a variety of egocentric factors (depthmaps and camera intrinsics) represented in local camera coordinates, and allocentric factors (camera extrinsics and scene flow) represented in global world coordinates. We achieve superior performance across diverse setups - both in terms of accuracy (2-3X lower error) and compute efficiency (15X faster), opening avenues for multiple downstream applications.

中文标题/摘要

标题：Any4D：统一的多视图变换器，用于米尺度的前馈4D重建

我们提出了Any4D，一种可扩展的多视图变换器，用于米尺度的密集前馈4D重建。Any4D直接生成N帧的每个像素的运动和几何预测，而以往的工作通常专注于双视图密集场景流或稀疏3D点跟踪。此外，与其他基于单目RGB视频的4D重建方法不同，Any4D可以处理可用的其他模态和传感器，如RGB-D帧、基于IMU的自我运动和雷达多普勒测量。关键创新之一是4D场景的模块化表示；具体来说，每个视图的4D预测使用多种以自我为中心的因素（深度图和相机内参）表示在局部相机坐标系中，以及以全局世界坐标系表示的以他为中心的因素（相机外参和场景流）。我们在多种设置下实现了优越的性能——在准确性和计算效率方面（误差降低2-3倍，速度快15倍），为多个下游应用打开了新的途径。

Summary / 总结

Any4D is a scalable multi-view transformer designed for metric-scale, dense feed-forward 4D reconstruction. It directly generates per-pixel motion and geometry predictions for N frames, unlike previous methods that focus on 2-view dense scene flow or sparse 3D point tracking. Any4D can process additional modalities such as RGB-D frames, IMU-based egomotion, and Radar Doppler measurements. It uses a modular 4D scene representation with egocentric and allocentric factors, achieving superior performance in terms of accuracy and compute efficiency across various setups.

Any4D 是一种可扩展的多视图变换器，用于实现度量尺度的密集前馈4D重建，可以直接生成 N 帧的逐像素运动和几何预测。它可以处理各种模态，包括 RGB-D 帧、IMU 基本的自我运动和雷达多普勒测量。其灵活性的关键在于一种模块化的4D场景表示，使其在多种设置下表现出色，具有较低的误差和更高的计算效率，优于先前的方法。

Curriculum-Based Reinforcement Learning for Autonomous UAV Navigation in Unknown Curved Tubular Conduit

Authors: Zamirddine Mari, Jérôme Pasquet, Julien Seinturier

First: 2025-12-11T18:57:29+00:00 · Latest: 2025-12-11T18:57:29+00:00

Abs · PDF · Code1 · Code2

Abstract

Autonomous drone navigation in confined tubular environments remains a major challenge due to the constraining geometry of the conduits, the proximity of the walls, and the perceptual limitations inherent to such scenarios. We propose a reinforcement learning approach enabling a drone to navigate unknown three-dimensional tubes without any prior knowledge of their geometry, relying solely on local observations from LiDAR and a conditional visual detection of the tube center. In contrast, the Pure Pursuit algorithm, used as a deterministic baseline, benefits from explicit access to the centerline, creating an information asymmetry designed to assess the ability of RL to compensate for the absence of a geometric model. The agent is trained through a progressive Curriculum Learning strategy that gradually exposes it to increasingly curved geometries, where the tube center frequently disappears from the visual field. A turning-negotiation mechanism, based on the combination of direct visibility, directional memory, and LiDAR symmetry cues, proves essential for ensuring stable navigation under such partial observability conditions. Experiments show that the PPO policy acquires robust and generalizable behavior, consistently outperforming the deterministic controller despite its limited access to geometric information. Validation in a high-fidelity 3D environment further confirms the transferability of the learned behavior to a continuous physical dynamics. The proposed approach thus provides a complete framework for autonomous navigation in unknown tubular environments and opens perspectives for industrial, underground, or medical applications where progressing through narrow and weakly perceptive conduits represents a central challenge.

中文标题/摘要

标题：基于课程学习的强化学习在未知曲管导管中的自主无人机导航

在受限的管状环境中自主无人机导航仍然是一个重大挑战，由于导管的几何约束、墙壁的接近性和此类场景固有的感知限制。我们提出了一种强化学习方法，使无人机能够在没有任何关于其几何形状的先验知识的情况下，仅依靠LiDAR的局部观察和管中心的条件视觉检测来导航未知的三维管状结构。相比之下，作为确定性基线的纯追迹算法可以从显式访问中心线中获益，从而创建一种信息不对称，旨在评估强化学习补偿几何模型缺失的能力。代理通过逐步暴露于越来越弯曲的几何形状中进行训练，其中管中心经常从视觉中消失。基于直接可见性、方向记忆和LiDAR对称性线索的转向协商机制对于确保在部分可观测条件下稳定导航至关重要。实验表明，PPO策略获得了稳健且可泛化的行为，即使在有限的几何信息访问下也始终优于确定性控制器。在高保真3D环境中的验证进一步证实了所学行为在连续物理动力学中的可转移性。因此，所提出的方法为在未知管状环境中的自主导航提供了一个完整的框架，并为工业、地下或医疗应用中通过狭窄和感知能力弱的导管提供了新的前景。

Summary / 总结

This paper addresses the challenge of autonomous drone navigation in unknown curved tubular environments. It proposes a reinforcement learning approach that uses local LiDAR and visual detection to navigate without prior knowledge of the conduit's geometry. The agent is trained through a Curriculum Learning strategy, gradually exposed to more curved geometries. The PPO policy demonstrates robust and generalizable behavior, outperforming a deterministic controller that has access to the centerline. Experiments in a high-fidelity 3D environment confirm the transferability of the learned behavior to real-world dynamics.

本文解决了自主无人机在未知曲管环境中的导航挑战。提出了一种基于强化学习的方法，利用局部LiDAR和视觉检测来导航，无需先验了解管道的几何结构。代理通过Curriculum Learning策略逐步暴露于更弯曲的几何结构中。PPO策略展示了稳健且可泛化的行为，优于具有中心线访问权的确定性控制器。高保真3D环境中的实验进一步证实了所学行为在实际动力学中的可转移性。

BabyVLM-V2: Toward Developmentally Grounded Pretraining and Benchmarking of Vision Foundation Models

Authors: Shengao Wang, Wenqi Wang, Zecheng Wang, Max Whitton, Michael Wakeham, Arjun Chandra, Joey Huang, Pengyue Zhu, Helen Chen, David Li, Jeffrey Li, Shawn Li, Andrew Zagula, Amy Zhao, Andrew Zhu, Sayaka Nakamura, Yuki Yamamoto, Jerry Jun Yokono, Aaron Mueller, Bryan A. Plummer, Kate Saenko, Venkatesh Saligrama, Boqing Gong

First: 2025-12-11T18:57:05+00:00 · Latest: 2025-12-11T18:57:05+00:00

Abs · PDF · Code1 · Code2

Abstract

Early children's developmental trajectories set up a natural goal for sample-efficient pretraining of vision foundation models. We introduce BabyVLM-V2, a developmentally grounded framework for infant-inspired vision-language modeling that extensively improves upon BabyVLM-V1 through a longitudinal, multifaceted pretraining set, a versatile model, and, most importantly, DevCV Toolbox for cognitive evaluation. The pretraining set maximizes coverage while minimizing curation of a longitudinal, infant-centric audiovisual corpus, yielding video-utterance, image-utterance, and multi-turn conversational data that mirror infant experiences. DevCV Toolbox adapts all vision-related measures of the recently released NIH Baby Toolbox into a benchmark suite of ten multimodal tasks, covering spatial reasoning, memory, and vocabulary understanding aligned with early children's capabilities. Experimental results show that a compact model pretrained from scratch can achieve competitive performance on DevCV Toolbox, outperforming GPT-4o on some tasks. We hope the principled, unified BabyVLM-V2 framework will accelerate research in developmentally plausible pretraining of vision foundation models.

中文标题/摘要

标题：BabyVLM-V2：面向发展性基础视觉模型预训练和基准测试的框架

早期儿童的发展轨迹为高效样本预训练视觉基础模型提供了自然目标。我们介绍了BabyVLM-V2，这是一种基于发展的婴儿启发式视觉-语言建模框架，通过纵向多维度预训练集、多功能模型以及最重要的是DevCV工具箱进行认知评估，大幅改进了BabyVLM-V1。预训练集最大化覆盖范围同时最小化纵向婴儿为中心的视听内容的整理，产生视频-语句、图像-语句和多轮对话数据，反映婴儿的经验。DevCV工具箱将最近发布的NIH婴儿工具箱中的所有视觉相关度量标准改编为涵盖空间推理、记忆和词汇理解的十项多模态任务基准套件，这些任务与早期儿童的能力相一致。实验结果表明，从零开始预训练的紧凑模型在DevCV工具箱上可以达到竞争力的表现，某些任务上优于GPT-4o。我们希望原理统一的BabyVLM-V2框架能够加速发展性基础视觉模型预训练的研究。

Summary / 总结

The research aims to develop vision foundation models that can learn from early childhood development trajectories for more sample-efficient pretraining. BabyVLM-V2 introduces a longitudinal, multifaceted pretraining set and a DevCV Toolbox for cognitive evaluation, leading to a compact model that performs competitively on ten multimodal tasks, surpassing GPT-4o on some tasks. Key findings include the model's ability to understand spatial reasoning, memory, and vocabulary aligned with early children's capabilities, demonstrating the effectiveness of the developmentally grounded approach.

研究旨在通过模仿早期儿童发展轨迹来训练视觉基础模型，以实现更高效的预训练。BabyVLM-V2 提出了一个纵向多方面的预训练数据集和 DevCV 工具箱进行认知评估，最终得到一个紧凑的模型，在十个跨模态任务上表现良好，某些任务上甚至超过了 GPT-4o。主要发现包括模型能够理解与早期儿童能力相匹配的空间推理、记忆和词汇理解，展示了发展性基础方法的有效性。

FoundationMotion: Auto-Labeling and Reasoning about Spatial Movement in Videos

Authors: Yulu Gan, Ligeng Zhu, Dandan Shan, Baifeng Shi, Hongxu Yin, Boris Ivanovic, Song Han, Trevor Darrell, Jitendra Malik, Marco Pavone, Boyi Li

First: 2025-12-11T18:53:15+00:00 · Latest: 2025-12-11T18:53:15+00:00

Comments: Code is available at https://github.com/Wolfv0/FoundationMotion/tree/main

Abs · PDF · Code1 · Code2 · Code3

Abstract

Motion understanding is fundamental to physical reasoning, enabling models to infer dynamics and predict future states. However, state-of-the-art models still struggle on recent motion benchmarks, primarily due to the scarcity of large-scale, fine-grained motion datasets. Existing motion datasets are often constructed from costly manual annotation, severely limiting scalability. To address this challenge, we introduce FoundationMotion, a fully automated data curation pipeline that constructs large-scale motion datasets. Our approach first detects and tracks objects in videos to extract their trajectories, then leverages these trajectories and video frames with Large Language Models (LLMs) to generate fine-grained captions and diverse question-answer pairs about motion and spatial reasoning. Using datasets produced by this pipeline, we fine-tune open-source models including NVILA-Video-15B and Qwen2.5-7B, achieving substantial improvements in motion understanding without compromising performance on other tasks. Notably, our models outperform strong closed-source baselines like Gemini-2.5 Flash and large open-source models such as Qwen2.5-VL-72B across diverse motion understanding datasets and benchmarks. FoundationMotion thus provides a scalable solution for curating fine-grained motion datasets that enable effective fine-tuning of diverse models to enhance motion understanding and spatial reasoning capabilities.

中文标题/摘要

标题：FoundationMotion：自动标注和空间运动推理

运动理解是物理推理的基础，使模型能够推断动力学并预测未来状态。然而，最先进的模型在最近的运动基准测试中仍然存在困难，主要是由于缺乏大规模、细粒度的运动数据集。现有的运动数据集通常是从昂贵的手动注释中构建的，严重限制了其可扩展性。为了解决这一挑战，我们引入了FoundationMotion，这是一种完全自动化的数据整理流水线，用于构建大规模的运动数据集。我们的方法首先在视频中检测和跟踪对象以提取其轨迹，然后利用这些轨迹和视频帧以及大型语言模型（LLMs）生成关于运动和空间推理的细粒度描述和多样化的问答对。使用此流水线生成的数据集，我们对包括NVILA-Video-15B和Qwen2.5-7B在内的开源模型进行微调，实现了在运动理解方面的显著改进，同时在其他任务上没有牺牲性能。值得注意的是，我们的模型在各种运动理解数据集和基准测试中均优于强大的闭源基线Gemini-2.5 Flash和大型开源模型Qwen2.5-VL-72B。因此，FoundationMotion为构建细粒度运动数据集提供了一种可扩展的解决方案，这些数据集能够有效微调各种模型以增强运动理解和空间推理能力。

Summary / 总结

The research aims to address the scarcity of large-scale, fine-grained motion datasets by introducing FoundationMotion, an automated data curation pipeline. This pipeline detects and tracks objects in videos to extract their trajectories, then uses Large Language Models to generate captions and question-answer pairs. Fine-tuning open-source models with datasets produced by this pipeline leads to significant improvements in motion understanding without affecting performance on other tasks. The models outperform both closed-source and large open-source baselines on various motion understanding benchmarks.

FoundationMotion 是一个自动数据整理流水线，通过在视频中检测和追踪物体并使用大型语言模型生成细粒度的描述和问答对来构建大规模的运动数据集。使用这些数据集微调 NVILA-Video-15B 和 Qwen2.5-7B 等模型可以显著提高运动理解能力，同时在各种基准测试中优于封闭源和大型开源模型，且不会损害其他任务的性能。

LLM-Driven Composite Neural Architecture Search for Multi-Source RL State Encoding

Authors: Yu Yu, Qian Xie, Nairen Cao, Li Jin

Venue: NeurIPS 2025

First: 2025-12-07T20:25:07+00:00 · Latest: 2025-12-11T18:52:44+00:00

Comments: NeurIPS 2025 Workshop on Bridging Language, Agent, and World Models for Reasoning and Planning

Abs · PDF · Code1 · Code2

Abstract

Designing state encoders for reinforcement learning (RL) with multiple information sources -- such as sensor measurements, time-series signals, image observations, and textual instructions -- remains underexplored and often requires manual design. We formalize this challenge as a problem of composite neural architecture search (NAS), where multiple source-specific modules and a fusion module are jointly optimized. Existing NAS methods overlook useful side information from the intermediate outputs of these modules -- such as their representation quality -- limiting sample efficiency in multi-source RL settings. To address this, we propose an LLM-driven NAS pipeline in which the LLM serves as a neural architecture design agent, leveraging language-model priors and intermediate-output signals to guide sample-efficient search for high-performing composite state encoders. On a mixed-autonomy traffic control task, our approach discovers higher-performing architectures with fewer candidate evaluations than traditional NAS baselines and the LLM-based GENIUS framework.

中文标题/摘要

标题：基于LLM的多源RL状态编码复合神经架构搜索

多信息源（如传感器测量、时间序列信号、图像观察和文本指令）的强化学习（RL）状态编码设计仍处于未充分探索的状态，通常需要手动设计。我们将这一挑战形式化为复合神经架构搜索（NAS）问题，其中多个源特定模块和融合模块联合优化。现有的NAS方法忽略了这些模块中间输出的有用侧信息（如其表示质量），限制了在多源RL设置中的样本效率。为解决这一问题，我们提出了一种基于LLM的NAS管道，其中LLM作为神经架构设计代理，利用语言模型先验和中间输出信号来引导高效搜索高性能的复合状态编码。在混合自主交通控制任务上，我们的方法在较少的候选评估下发现性能更高的架构，优于传统的NAS基线和基于LLM的GENIUS框架。

Summary / 总结

This paper addresses the challenge of designing state encoders for reinforcement learning with multiple information sources by formulating it as a composite neural architecture search problem. The authors propose an LLM-driven NAS pipeline that leverages language-model priors and intermediate-output signals to guide the search for high-performing composite state encoders, improving sample efficiency. On a mixed-autonomy traffic control task, their approach discovers better architectures with fewer evaluations compared to traditional NAS methods and the GENIUS framework.

论文通过将多信息源强化学习状态编码设计问题形式化为复合神经架构搜索问题，提出了一种基于LLM的NAS管道，利用语言模型先验和中间输出信号来引导高效搜索高性能复合状态编码器，在混合自主交通控制任务中，该方法比传统NAS基线和GENIUS框架使用更少的候选评估发现更高性能的架构。

Digital Twin Supervised Reinforcement Learning Framework for Autonomous Underwater Navigation

Authors: Zamirddine Mari, Mohamad Motasem Nawaf, Pierre Drap

First: 2025-12-11T18:52:42+00:00 · Latest: 2025-12-11T18:52:42+00:00

Abs · PDF · Code1 · Code2

Abstract

Autonomous navigation in underwater environments remains a major challenge due to the absence of GPS, degraded visibility, and the presence of submerged obstacles. This article investigates these issues through the case of the BlueROV2, an open platform widely used for scientific experimentation. We propose a deep reinforcement learning approach based on the Proximal Policy Optimization (PPO) algorithm, using an observation space that combines target-oriented navigation information, a virtual occupancy grid, and ray-casting along the boundaries of the operational area. The learned policy is compared against a reference deterministic kinematic planner, the Dynamic Window Approach (DWA), commonly employed as a robust baseline for obstacle avoidance. The evaluation is conducted in a realistic simulation environment and complemented by validation on a physical BlueROV2 supervised by a 3D digital twin of the test site, helping to reduce risks associated with real-world experimentation. The results show that the PPO policy consistently outperforms DWA in highly cluttered environments, notably thanks to better local adaptation and reduced collisions. Finally, the experiments demonstrate the transferability of the learned behavior from simulation to the real world, confirming the relevance of deep RL for autonomous navigation in underwater robotics.

中文标题/摘要

标题：数字孪生监督强化学习框架用于自主水下导航

由于缺乏GPS、能见度差以及存在水下障碍物，水下环境中的自主导航仍然是一个主要挑战。本文通过BlueROV2这一广泛用于科学实验的开放平台案例，探讨了这些问题。我们提出了一种基于Proximal Policy Optimization (PPO) 算法的深度强化学习方法，使用结合目标导向导航信息、虚拟占用网格以及沿操作区域边界进行的射线投射的观测空间。学习到的策略与常用的鲁棒障碍物避免基准——动态窗口方法（DWA）进行了比较。评估在现实的模拟环境中进行，并通过3D数字孪生监督的物理BlueROV2进行验证，有助于降低实际实验的风险。结果表明，在高度拥挤的环境中，PPO策略始终优于DWA，特别是在局部适应性和减少碰撞方面表现更佳。最后，实验验证了从模拟到现实世界的策略可转移性，证实了深度强化学习在水下机器人自主导航中的相关性。

Summary / 总结

This study addresses the challenges of autonomous underwater navigation by proposing a digital twin supervised reinforcement learning framework using the Proximal Policy Optimization (PPO) algorithm. The framework combines target-oriented navigation information, a virtual occupancy grid, and ray-casting for obstacle detection. The learned policy is evaluated against the Dynamic Window Approach (DWA) in a realistic simulation and on a physical BlueROV2. Results indicate that the PPO policy outperforms DWA in cluttered environments, with better local adaptation and fewer collisions. The experiments also confirm the transferability of the learned behavior from simulation to the real world, validating the approach for autonomous underwater navigation.

本文通过提出基于Proximal Policy Optimization (PPO)算法的数字孪生监督强化学习框架，解决了水下环境中自主导航的挑战。该框架结合了目标导向的导航信息、虚拟占用网格和边界上的射线投射来检测障碍物。所学策略在现实的模拟环境中与动态窗口方法（DWA）进行了对比评估，结果显示在复杂环境中具有更好的局部适应性和更低的碰撞率。实验结果证实了从模拟到实际应用的行为转移性，验证了该方法在水下自主导航中的适用性。

If generative AI is the answer, what is the question?

Authors: Ambuj Tewari

First: 2025-09-07T16:07:45+00:00 · Latest: 2025-12-11T18:45:18+00:00

Comments: To appear as a book chapter in a Springer book titled "Statistical Foundations and Applications of Artificial Intelligence, Machine Learning and Deep Learning" and edited by S. Ejaz Ahmed, Pierre Alquier, Yi Li, Shuangge Ma

Abs · PDF · Code1 · Code2

Abstract

Beginning with text and images, generative AI has expanded to audio, video, computer code, and molecules. Yet, if generative AI is the answer, what is the question? We explore the foundations of generation as a distinct machine learning task with connections to prediction, compression, and decision-making. We survey five major generative model families: autoregressive models, variational autoencoders, normalizing flows, generative adversarial networks, and diffusion models. We then introduce a probabilistic framework that emphasizes the distinction between density estimation and generation. We review a game-theoretic framework with a two-player adversary-learner setup to study generation. We discuss post-training modifications that prepare generative models for deployment. We end by highlighting some important topics in socially responsible generation such as privacy, detection of AI-generated content, and copyright and IP. We adopt a task-first framing of generation, focusing on what generation is as a machine learning problem, rather than only on how models implement it.

中文标题/摘要

标题：如果生成AI是答案，那么问题是什么？

从文本和图像开始，生成AI已经扩展到音频、视频、计算机代码和分子。然而，如果生成AI是答案，那么问题是什么？我们探讨了生成作为独立机器学习任务的基础，与预测、压缩和决策之间的联系。我们概述了五种主要的生成模型家族：自回归模型、变分自编码器、规范化流、生成对抗网络和扩散模型。然后我们介绍了一个概率框架，强调密度估计与生成之间的区别。我们回顾了一个博弈论框架，采用两玩家的对抗学习设置来研究生成。我们讨论了训练后修改，以准备生成模型的部署。最后，我们强调了一些在社会责任生成方面的重要话题，如隐私、检测AI生成内容、版权和知识产权。我们采用任务优先的生成框架，专注于生成作为机器学习问题是什么，而不仅仅是模型如何实现它。

Distributionally Robust Regret Optimal Control Under Moment-Based Ambiguity Sets

Authors: Feras Al Taha, Eilyan Bitar

First: 2025-12-11T18:36:15+00:00 · Latest: 2025-12-11T18:36:15+00:00

Comments: 21 pages, 2 figures

Abs · PDF · Code1 · Code2

Abstract

In this paper, we consider a class of finite-horizon, linear-quadratic stochastic control problems, where the probability distribution governing the noise process is unknown but assumed to belong to an ambiguity set consisting of all distributions whose mean and covariance lie within norm balls centered at given nominal values. To address the distributional ambiguity, we explore the design of causal affine control policies to minimize the worst-case expected regret over all distributions in the given ambiguity set. The resulting minimax optimal control problem is shown to admit an equivalent reformulation as a tractable convex program that corresponds to a regularized version of the nominal linear-quadratic stochastic control problem. While this convex program can be recast as a semidefinite program, semidefinite programs are typically solved using primal-dual interior point methods that scale poorly with the problem size in practice. To address this limitation, we propose a scalable dual projected subgradient method to compute optimal controllers to an arbitrary accuracy. Numerical experiments are presented to benchmark the proposed method against state-of-the-art data-driven and distributionally robust control design approaches.

Summary / 总结

This paper investigates a class of finite-horizon linear-quadratic stochastic control problems under distributional ambiguity, where the noise distribution is unknown but constrained within an ambiguity set defined by mean and covariance norms. The authors formulate a minimax optimal control problem to minimize the worst-case expected regret and show that it can be reformulated as a tractable convex program. They propose a scalable dual projected subgradient method to solve this problem efficiently, and numerical experiments demonstrate the effectiveness of their approach compared to existing methods.

本文研究了一类具有分布不确定性的时间有限线性二次随机控制问题，其中噪声分布未知但受限于由均值和协方差范数定义的不确定性集。作者提出了一个最小化最坏情况期望后悔的最小最大最优控制问题，并将其重新表述为一个可解的凸规划问题。他们提出了一种可扩展的投影次梯度方法来高效求解该问题，并通过数值实验展示了其方法的有效性，优于现有方法。

DuetSVG: Unified Multimodal SVG Generation with Internal Visual Guidance

Authors: Peiying Zhang, Nanxuan Zhao, Matthew Fisher, Yiran Xu, Jing Liao, Difan Liu

First: 2025-12-11T18:23:03+00:00 · Latest: 2025-12-11T18:23:03+00:00

Comments: Project page: https://intchous.github.io/DuetSVG-site

Abs · PDF · Code1 · Code2 · Project1

Abstract

Recent vision-language model (VLM)-based approaches have achieved impressive results on SVG generation. However, because they generate only text and lack visual signals during decoding, they often struggle with complex semantics and fail to produce visually appealing or geometrically coherent SVGs. We introduce DuetSVG, a unified multimodal model that jointly generates image tokens and corresponding SVG tokens in an end-to-end manner. DuetSVG is trained on both image and SVG datasets. At inference, we apply a novel test-time scaling strategy that leverages the model's native visual predictions as guidance to improve SVG decoding quality. Extensive experiments show that our method outperforms existing methods, producing visually faithful, semantically aligned, and syntactically clean SVGs across a wide range of applications.

中文标题/摘要

标题：DuetSVG：统一的多模态SVG生成，带有内部视觉指导

基于视觉-语言模型（VLM）的方法在SVG生成方面取得了令人印象深刻的成果。然而，由于它们在解码过程中仅生成文本而缺乏视觉信号，因此往往难以处理复杂的语义，无法生成视觉上吸引人或几何上一致的SVG。我们提出了DuetSVG，这是一种统一的多模态模型，可以以端到端的方式联合生成图像标记和相应的SVG标记。DuetSVG在图像和SVG数据集上进行训练。在推理时，我们应用了一种新颖的测试时缩放策略，利用模型的原生视觉预测作为指导，以提高SVG解码质量。广泛的实验表明，我们的方法优于现有方法，能够生成视觉上忠实、语义上对齐且语法上干净的SVG。

Summary / 总结

The research motivation is to address the limitations of existing vision-language model-based approaches in SVG generation, which often fail to produce visually appealing or geometrically coherent SVGs due to the lack of visual signals during decoding. The main method is DuetSVG, a unified multimodal model that generates both image and SVG tokens end-to-end, trained on both image and SVG datasets. The key experimental findings show that DuetSVG outperforms existing methods, generating visually faithful, semantically aligned, and syntactically clean SVGs across various applications.

研究动机是解决现有基于视觉语言模型的方法在SVG生成中的局限性，这些方法在解码过程中缺乏视觉信号，导致生成的SVG往往不够视觉吸引或几何上不连贯。主要方法是DuetSVG，这是一种联合生成图像和SVG标记的统一多模态模型，通过同时训练图像和SVG数据集来实现端到端的生成。关键实验发现表明，DuetSVG在各种应用中生成了视觉上忠实、语义上对齐且语法上干净的SVG。

Iterative Compositional Data Generation for Robot Control

Authors: Anh-Quan Pham, Marcel Hussing, Shubhankar P. Patankar, Dani S. Bassett, Jorge Mendez-Mendez, Eric Eaton

First: 2025-12-11T18:20:49+00:00 · Latest: 2025-12-11T18:20:49+00:00

Abs · PDF · Code1 · Code2

Abstract

Collecting robotic manipulation data is expensive, making it impractical to acquire demonstrations for the combinatorially large space of tasks that arise in multi-object, multi-robot, and multi-environment settings. While recent generative models can synthesize useful data for individual tasks, they do not exploit the compositional structure of robotic domains and struggle to generalize to unseen task combinations. We propose a semantic compositional diffusion transformer that factorizes transitions into robot-, object-, obstacle-, and objective-specific components and learns their interactions through attention. Once trained on a limited subset of tasks, we show that our model can zero-shot generate high-quality transitions from which we can learn control policies for unseen task combinations. Then, we introduce an iterative self-improvement procedure in which synthetic data is validated via offline reinforcement learning and incorporated into subsequent training rounds. Our approach substantially improves zero-shot performance over monolithic and hard-coded compositional baselines, ultimately solving nearly all held-out tasks and demonstrating the emergence of meaningful compositional structure in the learned representations.

中文标题/摘要

标题：机器人控制的迭代组合数据生成

收集机器人操作数据成本高昂，使得在多对象、多机器人和多环境设置中获取演示变得不切实际。尽管最近的生成模型可以为单个任务生成有用的数据，但它们未能利用机器人领域的组合结构，并且难以泛化到未见过的任务组合。我们提出了一种语义组合扩散变换器，将转换分解为机器人特定、对象特定、障碍物特定和目标特定的组件，并通过注意力学习它们的交互。在仅对任务的有限子集进行训练后，我们展示了我们的模型可以从这些转换中零样本生成高质量的过渡，从中可以学习未见过的任务组合的控制策略。然后，我们引入了一种迭代自我改进过程，在此过程中，合成数据通过离线强化学习验证，并纳入后续训练轮次。我们的方法在零样本性能上显著优于单一和硬编码的组合基线，最终解决了几乎所有保留的任务，并展示了在学习表示中出现有意义的组合结构。

Summary / 总结

The research aims to address the high cost of collecting robotic manipulation data, especially in complex settings involving multiple objects, robots, and environments. The method involves a semantic compositional diffusion transformer that factors transitions into specific components and learns their interactions through attention. After training on a limited set of tasks, the model can generate high-quality transitions for unseen task combinations, which are then used to learn control policies. An iterative self-improvement procedure further enhances the model's performance by validating synthetic data through offline reinforcement learning and incorporating it into subsequent training rounds. This approach significantly improves zero-shot performance compared to monolithic and hard-coded compositional baselines, solving nearly all held-out tasks and revealing meaningful compositional structure in the learned representations.

研究旨在解决收集机器人操作数据成本高昂的问题，特别是在涉及多个物体、机器人和环境的复杂任务中。方法是使用一个语义组合扩散变换器，将过渡分解为特定组件，并通过注意力学习它们的交互。经过部分任务的训练后，模型可以生成高质量的过渡用于未见过的任务，进而学习控制策略。通过迭代自我改进程序，利用离线强化学习验证合成数据并将其纳入后续训练，进一步提高模型性能。这种方法在解决未见过的任务方面优于单一和硬编码的组合基线，展示了在学习表示中出现的有意义的组合结构。

PubTables-v2: A new large-scale dataset for full-page and multi-page table extraction

Authors: Brandon Smock, Valerie Faucon-Morin, Max Sokolov, Libin Liang, Tayyibah Khanam, Maury Courtland

First: 2025-12-11T18:19:00+00:00 · Latest: 2025-12-11T18:19:00+00:00

Comments: 15 pages, 7 figures

Abs · PDF · Code1 · Code2

Abstract

Table extraction (TE) is a key challenge in visual document understanding. Traditional approaches detect tables first, then recognize their structure. Recently, interest has surged in developing methods, such as vision-language models (VLMs), that can extract tables directly in their full page or document context. However, progress has been difficult to demonstrate due to a lack of annotated data. To address this, we create a new large-scale dataset, PubTables-v2. PubTables-v2 supports a number of current challenging table extraction tasks. Notably, it is the first large-scale benchmark for multi-page table structure recognition. We demonstrate its usefulness by evaluating domain-specialized VLMs on these tasks and highlighting current progress. Finally, we use PubTables-v2 to create the Page-Object Table Transformer (POTATR), an image-to-graph extension of the Table Transformer to comprehensive page-level TE. Data, code, and trained models will be released.

中文标题/摘要

标题：PubTables-v2：新的大规模数据集用于全页和多页表格提取

表格提取（TE）是视觉文档理解中的一个关键挑战。传统方法首先检测表格，然后识别其结构。最近，人们开始开发可以直接在全页或文档上下文中提取表格的方法，例如视觉-语言模型（VLMs）。然而，由于缺乏标注数据，进步难以展示。为了解决这个问题，我们创建了一个新的大规模数据集，PubTables-v2。PubTables-v2 支持当前许多具有挑战性的表格提取任务。值得注意的是，它是第一个大规模的多页表格结构识别基准。我们通过在这些任务上评估领域专用的 VLMs 来展示其用途，并突出当前的进展。最后，我们使用 PubTables-v2 创建了 Page-Object Table Transformer（POTATR），这是一种表格变换器的图像到图扩展，用于全面的页面级 TE。数据、代码和训练模型将被发布。

Summary / 总结

The research aims to improve table extraction in visual document understanding by addressing the lack of annotated data. The authors introduce PubTables-v2, a large-scale dataset for full-page and multi-page table extraction, especially for multi-page table structure recognition. They evaluate domain-specialized vision-language models and propose POTATR, an image-to-graph extension of the Table Transformer, demonstrating the dataset's utility and potential for advancing the field.

研究旨在通过解决标注数据不足的问题，推动视觉文档理解中的表格提取。作者引入了PubTables-v2，这是一个大规模的全页和多页表格提取数据集，特别关注多页表格结构识别。主要发现包括对领域特定的视觉语言模型的评估以及开发了POTATR，即表格变换器的图像到图扩展，用于全面的页面级表格提取。

AFRAgent : An Adaptive Feature Renormalization Based High Resolution Aware GUI agent

Authors: Neeraj Anand, Rishabh Jain, Sohan Patnaik, Balaji Krishnamurthy, Mausoom Sarkar

Venue: WACV 2026

First: 2025-11-30T11:32:54+00:00 · Latest: 2025-12-11T18:16:38+00:00

Comments: Accepted at WACV 2026 Conference

Abs · PDF · Code1 · Code2

Abstract

There is a growing demand for mobile user interface (UI) automation, driven by its broad applications across industries. With the advent of visual language models (VLMs), GUI automation has progressed from generating text-based instructions for humans to autonomously executing tasks, thus optimizing automation workflows. Recent approaches leverage VLMs for this problem due to their ability to 1) process on-screen content directly, 2) remain independent of device-specific APIs by utilizing human actions (e.g., clicks, typing), and 3) apply real-world contextual knowledge for task understanding. However, these models often have trouble accurately identifying widgets and determining actions due to limited spatial information in vision encoder features. Additionally, top-performing models are often large, requiring extensive training and resulting in inference delays. In this work, we introduce AFRAgent, an instruct-BLIP-based multimodal architecture that achieves superior performance in GUI automation while being less than one-fourth the size of its nearest competitor. To enhance image embeddings in the large language model (LLM) pipeline, we propose an adaptive feature renormalization-based (a token-level affine transformation) technique that effectively enriches low-resolution image embeddings and fuses high-resolution details. We evaluate AFRAgent on Meta-GUI and AITW benchmarks, establishing a new state-of-the-art baseline for smartphone automation.

中文标题/摘要

标题：AFRAgent：一种自适应特征重规范化基于高分辨率感知的GUI代理

随着移动用户界面（UI）自动化在各个行业中的广泛应用，对其的需求日益增长。随着视觉语言模型（VLMs）的出现，GUI自动化从生成供人类使用的文本指令发展到自主执行任务，从而优化自动化工作流程。最近的方法利用VLMs，因为它们能够1）直接处理屏幕内容，2）通过利用人类操作（例如点击、输入）而不依赖于特定设备的API，3）应用现实世界的上下文知识来理解任务。然而，这些模型往往难以准确识别控件和确定操作，因为视觉编码器特征中的空间信息有限。此外，表现最佳的模型通常很大，需要大量训练，导致推理延迟。在本工作中，我们引入了AFRAgent，这是一种基于instruct-BLIP的多模态架构，其性能优于其最接近的竞争者四分之一。为了增强大型语言模型（LLM）管道中的图像嵌入，我们提出了一种自适应特征重规范化（基于令牌级仿射变换）技术，有效丰富了低分辨率图像嵌入并融合了高分辨率细节。我们在Meta-GUI和AITW基准上评估了AFRAgent，建立了智能手机自动化的新基线。

Summary / 总结

AFRAgent is designed to improve GUI automation by addressing the limitations of existing visual language models, particularly in accurately identifying widgets and determining actions. It uses an adaptive feature renormalization technique to enhance image embeddings, making the model smaller and more efficient. AFRAgent achieves superior performance on Meta-GUI and AITW benchmarks, setting a new state-of-the-art baseline for smartphone automation.

AFRAgent旨在通过解决现有视觉语言模型的局限性，如不准确的控件识别和大模型尺寸问题，来提升移动UI自动化。它采用基于instruct-BLIP的多模态架构，并使用自适应特征重规范化技术来增强图像嵌入，从而更好地处理低分辨率图像并融合高分辨率细节。在Meta-GUI和AITW基准上的实验结果表明，AFRAgent在性能上优于先前模型，同时其大小仅为竞争对手的四分之一。

Physics-Informed Learning of Flow Distribution and Receiver Heat Losses in Parabolic Trough Solar Fields

Authors: Stefan Matthes, Markus Schramm

First: 2025-12-11T18:16:26+00:00 · Latest: 2025-12-11T18:16:26+00:00

Abs · PDF · Code1 · Code2

Abstract

Parabolic trough Concentrating Solar Power (CSP) plants operate large hydraulic networks of collector loops that must deliver a uniform outlet temperature despite spatially heterogeneous optical performance, heat losses, and pressure drops. While loop temperatures are measured, loop-level mass flows and receiver heat-loss parameters are unobserved, making it impossible to diagnose hydraulic imbalances or receiver degradation using standard monitoring tools. We present a physics-informed learning framework that infers (i) loop-level mass-flow ratios and (ii) time-varying receiver heat-transfer coefficients directly from routine operational data. The method exploits nocturnal homogenization periods -- when hot oil is circulated through a non-irradiated field -- to isolate hydraulic and thermal-loss effects. A differentiable conjugate heat-transfer model is discretized and embedded into an end-to-end learning pipeline optimized using historical plant data from the 50 MW Andasol 3 solar field. The model accurately reconstructs loop temperatures (RMSE $<2^\circ$C) and produces physically meaningful estimates of loop imbalances and receiver heat losses. Comparison against drone-based infrared thermography (QScan) shows strong correspondence, correctly identifying all areas with high-loss receivers. This demonstrates that noisy real-world CSP operational data contain enough information to recover latent physical parameters when combined with appropriate modeling and differentiable optimization.

中文标题/摘要

标题：基于物理信息的学习在抛物线槽太阳能场中流体分布和接收器热损失

抛物线槽集中太阳能热发电(CSP)电站运行着大型液压网络的集热器回路，必须在空间上不均匀的光学性能、热损失和压力降的情况下提供均匀的出口温度。虽然回路温度是可以测量的，但回路级的质量流率和接收器热损失参数是不可观测的，使得使用标准监测工具诊断液压不平衡或接收器退化是不可能的。我们提出了一种基于物理信息的学习框架，可以直接从常规运行数据中推断出(i)回路级的质量流率比值和(ii)时间变化的接收器传热系数。该方法利用夜间同质化时期——当热油通过未受照射的场区循环时——来隔离液压和热损失效应。一个可微分的耦合传热模型被离散化并嵌入到端到端的学习管道中，该管道使用来自50兆瓦安道尔3号太阳能场的历史电站数据进行优化。该模型准确地重构了回路温度（均方根误差<2°C），并生成了具有物理意义的回路不平衡和接收器热损失的估计值。与无人机红外热成像(QScan)的比较显示了强烈的对应关系，正确识别了所有高损失接收器的区域。这表明，当与适当的建模和可微优化相结合时，嘈杂的实际CSP运行数据包含足够的信息来恢复潜在的物理参数。

Summary / 总结

The research aims to diagnose hydraulic imbalances and receiver degradation in parabolic trough CSP plants by inferring loop-level mass flows and receiver heat-loss parameters from operational data. The method uses a physics-informed learning framework that exploits nocturnal homogenization periods to isolate hydraulic and thermal-loss effects. The model accurately reconstructs loop temperatures and produces physically meaningful estimates of loop imbalances and receiver heat losses, showing strong correspondence with drone-based infrared thermography data.

研究旨在通过从运营数据中推断出集热管级质量流率和接收器热损失参数，来诊断抛物线槽型太阳能热发电系统的液压不平衡和接收器退化。方法利用夜间同温期来隔离液压和热损失效应，采用物理信息学习框架。该模型准确地重构了集热管温度，并成功识别了高损失接收器区域，证明了当结合适当的建模和可微优化技术时，实际运营数据中包含足够的信息来恢复物理参数。

MoCapAnything: Unified 3D Motion Capture for Arbitrary Skeletons from Monocular Videos

Authors: Kehong Gong, Zhengyu Wen, Weixia He, Mingxi Xu, Qi Wang, Ning Zhang, Zhengyu Li, Dongze Lian, Wei Zhao, Xiaoyu He, Mingyuan Zhang

First: 2025-12-11T18:09:48+00:00 · Latest: 2025-12-11T18:09:48+00:00

Comments: Project page: https://animotionlab.github.io/MoCapAnything/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Motion capture now underpins content creation far beyond digital humans, yet most existing pipelines remain species- or template-specific. We formalize this gap as Category-Agnostic Motion Capture (CAMoCap): given a monocular video and an arbitrary rigged 3D asset as a prompt, the goal is to reconstruct a rotation-based animation such as BVH that directly drives the specific asset. We present MoCapAnything, a reference-guided, factorized framework that first predicts 3D joint trajectories and then recovers asset-specific rotations via constraint-aware inverse kinematics. The system contains three learnable modules and a lightweight IK stage: (1) a Reference Prompt Encoder that extracts per-joint queries from the asset's skeleton, mesh, and rendered images; (2) a Video Feature Extractor that computes dense visual descriptors and reconstructs a coarse 4D deforming mesh to bridge the gap between video and joint space; and (3) a Unified Motion Decoder that fuses these cues to produce temporally coherent trajectories. We also curate Truebones Zoo with 1038 motion clips, each providing a standardized skeleton-mesh-render triad. Experiments on both in-domain benchmarks and in-the-wild videos show that MoCapAnything delivers high-quality skeletal animations and exhibits meaningful cross-species retargeting across heterogeneous rigs, enabling scalable, prompt-driven 3D motion capture for arbitrary assets. Project page: https://animotionlab.github.io/MoCapAnything/

中文标题/摘要

标题：MoCapAnything：任意骨架的单目视频统一3D动作捕捉

动作捕捉现在已远远超越了数字人类的内容创作，但大多数现有的流水线仍然局限于特定物种或模板。我们将其差距形式化为无类别动作捕捉（CAMoCap）：给定一个单目视频和一个任意的3D资产作为提示，目标是重建基于旋转的动画，如BVH，直接驱动特定的资产。我们提出MoCapAnything，这是一种参考引导、因子化的框架，首先预测3D关节轨迹，然后通过约束感知逆运动学恢复资产特定的旋转。该系统包含三个可学习模块和一个轻量级的IK阶段：（1）参考提示编码器，从资产的骨架、网格和渲染图像中提取每个关节的查询；（2）视频特征提取器，计算密集的视觉描述符并重建一个粗略的4D变形网格，以弥合视频和关节空间之间的差距；（3）统一动作解码器，将这些线索融合以生成时间连贯的轨迹。我们还整理了Truebones动物园，包含1038个动作片段，每个片段提供了一个标准化的骨架-网格-渲染三元组。在领域内基准测试和野外视频上的实验表明，MoCapAnything提供了高质量的骨骼动画，并在异构骨架上展示了有意义的跨物种重新目标化，从而实现了任意资产的可扩展、提示驱动的3D动作捕捉。项目页面：https://animotionlab.github.io/MoCapAnything/

Summary / 总结

MoCapAnything addresses the gap in category-agnostic motion capture by predicting 3D joint trajectories and recovering asset-specific rotations through constraint-aware inverse kinematics. It consists of three learnable modules and a lightweight IK stage: a Reference Prompt Encoder, a Video Feature Extractor, and a Unified Motion Decoder. Experiments demonstrate high-quality skeletal animations and meaningful cross-species retargeting across different rigs, enabling scalable, prompt-driven 3D motion capture for arbitrary assets.

MoCapAnything 提出了一种统一框架，可以从单目视频中为任意 3D 资产生成动画。它采用参考引导、模块化的方法，包含三个可学习模块和逆运动学，以预测 3D 关节轨迹并恢复特定资产的旋转。实验结果显示高质量的骨骼动画和跨不同骨架的有意义的跨物种重新目标化，从而实现可扩展的、基于提示的 3D 动画捕捉。

From Macro to Micro: Benchmarking Microscopic Spatial Intelligence on Molecules via Vision-Language Models

Authors: Zongzhao Li, Xiangzhe Kong, Jiahui Su, Zongyang Ma, Mingze Li, Songyou Li, Yuelin Zhang, Yu Rong, Tingyang Xu, Deli Zhao, Wenbing Huang

First: 2025-12-11T18:00:21+00:00 · Latest: 2025-12-11T18:00:21+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

This paper introduces the concept of Microscopic Spatial Intelligence (MiSI), the capability to perceive and reason about the spatial relationships of invisible microscopic entities, which is fundamental to scientific discovery. To assess the potential of Vision-Language Models (VLMs) in this domain, we propose a systematic benchmark framework MiSI-Bench. This framework features over 163,000 question-answer pairs and 587,000 images derived from approximately 4,000 molecular structures, covering nine complementary tasks that evaluate abilities ranging from elementary spatial transformations to complex relational identifications. Experimental results reveal that current state-of-the-art VLMs perform significantly below human level on this benchmark. However, a fine-tuned 7B model demonstrates substantial potential, even surpassing humans in spatial transformation tasks, while its poor performance in scientifically-grounded tasks like hydrogen bond recognition underscores the necessity of integrating explicit domain knowledge for progress toward scientific AGI. The datasets are available at https://huggingface.co/datasets/zongzhao/MiSI-bench.

中文标题/摘要

标题：从宏观到微观：通过视觉语言模型评估分子微观空间智能

本文介绍了微观空间智能（MiSI）的概念，即感知和推理看不见的微观实体的空间关系的能力，这是科学研究的基础。为了评估视觉语言模型（VLMs）在这一领域的潜力，我们提出了一种系统性的基准框架MiSI-Bench。该框架包含超过163,000个问答对和587,000张图像，源自约4,000个分子结构，涵盖了九项互补任务，评估能力从基本的空间变换到复杂的关联识别。实验结果表明，当前最先进的VLMs在这一基准上的表现远低于人类水平。然而，微调后的7B模型显示出巨大的潜力，甚至在空间变换任务上超过了人类，而其在氢键识别等基于科学的任务上的表现不佳，突显了整合显式领域知识对于向科学AGI迈进的必要性。数据集可在https://huggingface.co/datasets/zongzhao/MiSI-bench获取。

MMSI-Video-Bench: A Holistic Benchmark for Video-Based Spatial Intelligence

Authors: Jingli Lin, Runsen Xu, Shaohao Zhu, Sihan Yang, Peizhou Cao, Yunlong Ran, Miao Hu, Chenming Zhu, Yiman Xie, Yilin Long, Wenbo Hu, Dahua Lin, Tai Wang, Jiangmiao Pang

First: 2025-12-11T17:57:24+00:00 · Latest: 2025-12-11T17:57:24+00:00

Abs · PDF · Code1 · Code2

Abstract

Spatial understanding over continuous visual input is crucial for MLLMs to evolve into general-purpose assistants in physical environments. Yet there is still no comprehensive benchmark that holistically assesses the progress toward this goal. In this work, we introduce MMSI-Video-Bench, a fully human-annotated benchmark for video-based spatial intelligence in MLLMs. It operationalizes a four-level framework, Perception, Planning, Prediction, and Cross-Video Reasoning, through 1,106 questions grounded in 1,278 clips from 25 datasets and in-house videos. Each item is carefully designed and reviewed by 3DV experts with explanatory rationales to ensure precise, unambiguous grounding. Leveraging its diverse data sources and holistic task coverage, MMSI-Video-Bench also supports three domain-oriented sub-benchmarks (Indoor Scene Perception Bench, Robot Bench and Grounding Bench) for targeted capability assessment. We evaluate 25 strong open-source and proprietary MLLMs, revealing a striking human--AI gap: many models perform near chance, and the best reasoning model lags humans by nearly 60%. We further find that spatially fine-tuned models still fail to generalize effectively on our benchmark. Fine-grained error analysis exposes systematic failures in geometric reasoning, motion grounding, long-horizon prediction, and cross-video correspondence. We also show that typical frame-sampling strategies transfer poorly to our reasoning-intensive benchmark, and that neither 3D spatial cues nor chain-of-thought prompting yields meaningful gains. We expect our benchmark to establish a solid testbed for advancing video-based spatial intelligence.

中文标题/摘要

标题：MMSI-Video-Bench：面向视频空间智能的综合基准

在连续视觉输入中进行空间理解对于MLLMs成为物理环境中的通用助手至关重要。然而，目前还没有一个全面的基准能够综合评估这一目标的进展。在本文中，我们介绍了MMSI-Video-Bench，这是一个全面的人工标注基准，用于评估MLLMs的视频空间智能。它通过1,106个问题和1,278个来自25个数据集和内部视频的片段，实现了感知、规划、预测和跨视频推理四个层次框架的系统化。每个项目都由3D视觉专家精心设计和审查，并附有解释性理由，以确保精确和明确的定位。利用其多样化的数据来源和全面的任务覆盖范围，MMSI-Video-Bench还支持三个领域导向的子基准（室内场景感知基准、机器人基准和语义基准），以进行有针对性的能力评估。我们评估了25个强大的开源和专有MLLMs，揭示了显著的人工智能差距：许多模型的表现接近随机猜测，而最佳推理模型比人类落后近60%。我们还发现，空间微调模型在我们的基准上仍然无法有效泛化。精细的错误分析揭示了几何推理、运动语义、长时预测和跨视频对应中的系统性失败。我们还表明，典型的帧采样策略在我们的推理密集型基准上表现不佳，而3D空间线索或链式思考提示也没有带来有意义的改进。我们期望我们的基准能够为推进视频空间智能建立一个坚实的基础。

Summary / 总结

MMSI-Video-Bench is a benchmark for evaluating video-based spatial intelligence in MLLMs, covering Perception, Planning, Prediction, and Cross-Video Reasoning. It includes 1,106 questions from 1,278 clips across 25 datasets and in-house videos, reviewed by 3DV experts. Evaluating 25 MLLMs, the benchmark reveals a significant human-AI gap, with many models performing near chance and the best model lagging humans by nearly 60%. The benchmark also highlights systematic failures in geometric reasoning, motion grounding, and cross-video correspondence, and shows that typical frame-sampling strategies and 3D spatial cues are ineffective. This benchmark aims to advance video-based spatial intelligence research.

MMSI-Video-Bench 是一个全面的视频空间智能基准，涵盖感知、规划、预测和跨视频推理，通过1,106个问题。它评估了25个MLLMs，并揭示了显著的人工智能差距，许多模型表现接近随机，最佳模型落后人类近60%。基准还识别了几何推理、运动定位和长时预测中的系统性失败，并表明典型的帧采样策略和三维空间线索无效。它旨在推动视频空间智能研究。

SWiT-4D: Sliding-Window Transformer for Lossless and Parameter-Free Temporal 4D Generation

Authors: Kehong Gong, Zhengyu Wen, Mingxi Xu, Weixia He, Qi Wang, Ning Zhang, Zhengyu Li, Chenbin Li, Dongze Lian, Wei Zhao, Xiaoyu He, Mingyuan Zhang

First: 2025-12-11T17:54:31+00:00 · Latest: 2025-12-11T17:54:31+00:00

Comments: Project page: https://animotionlab.github.io/SWIT4D/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Despite significant progress in 4D content generation, the conversion of monocular videos into high-quality animated 3D assets with explicit 4D meshes remains considerably challenging. The scarcity of large-scale, naturally captured 4D mesh datasets further limits the ability to train generalizable video-to-4D models from scratch in a purely data-driven manner. Meanwhile, advances in image-to-3D generation, supported by extensive datasets, offer powerful prior models that can be leveraged. To better utilize these priors while minimizing reliance on 4D supervision, we introduce SWiT-4D, a Sliding-Window Transformer for lossless, parameter-free temporal 4D mesh generation. SWiT-4D integrates seamlessly with any Diffusion Transformer (DiT)-based image-to-3D generator, adding spatial-temporal modeling across video frames while preserving the original single-image forward process, enabling 4D mesh reconstruction from videos of arbitrary length. To recover global translation, we further introduce an optimization-based trajectory module tailored for static-camera monocular videos. SWiT-4D demonstrates strong data efficiency: with only a single short (<10s) video for fine-tuning, it achieves high-fidelity geometry and stable temporal consistency, indicating practical deployability under extremely limited 4D supervision. Comprehensive experiments on both in-domain zoo-test sets and challenging out-of-domain benchmarks (C4D, Objaverse, and in-the-wild videos) show that SWiT-4D consistently outperforms existing baselines in temporal smoothness. Project page: https://animotionlab.github.io/SWIT4D/

中文标题/摘要

标题：SWiT-4D：滑动窗口变换器用于无损和参数自由的时空4D生成

尽管在4D内容生成方面取得了显著进展，但将单目视频转换为高质量的动画3D资产，特别是带有明确4D网格的资产，仍然极具挑战性。由于缺乏大规模的自然捕获4D网格数据集，这进一步限制了从头开始以纯数据驱动方式训练通用的视频到4D模型的能力。与此同时，由大量数据支持的图像到3D生成的进步提供了强大的先验模型，可以加以利用。为了更好地利用这些先验模型并尽量减少对4D监督的依赖，我们提出了SWiT-4D，一种用于无损、参数自由的时空4D网格生成的滑动窗口变换器。SWiT-4D可以无缝集成到任何基于扩散变换器（DiT）的图像到3D生成器中，同时在视频帧之间进行空间-时间建模，保留原始单图像前向处理过程，从而可以从任意长度的视频中重建4D网格。为了恢复全局平移，我们进一步引入了一个针对静态相机单目视频的基于优化的轨迹模块。SWiT-4D展示了强大的数据效率：仅需一个短于10秒的视频进行微调，即可实现高保真几何结构和稳定的时空一致性，表明在极其有限的4D监督下具有实际部署能力。在领域内动物园测试集和具有挑战性的领域外基准测试集（C4D、Objaverse和野外视频）上的全面实验表明，SWiT-4D在时间平滑度方面始终优于现有基线。项目页面：https://animotionlab.github.io/SWIT4D/

Summary / 总结

SWiT-4D is a Sliding-Window Transformer designed for lossless and parameter-free temporal 4D mesh generation from monocular videos. It integrates with existing Diffusion Transformer-based image-to-3D generators to model spatial-temporal dynamics without requiring extensive 4D supervision. SWiT-4D introduces an optimization-based trajectory module for static-camera videos and demonstrates strong data efficiency, achieving high-fidelity geometry and temporal consistency with minimal fine-tuning. Experiments show SWiT-4D outperforms existing methods in temporal smoothness across various benchmarks.

SWiT-4D 是一种滑动窗口变换器，用于从单目视频生成无损且无需参数的时空4D网格。它与现有的基于扩散变换器的图像到3D生成器集成，以建模时空动态，无需大量4D监督。SWiT-4D 引入了一种基于优化的轨迹模块来处理静态摄像机视频，并展示了强大的数据效率，即使在少量微调的情况下也能实现高保真几何和时间一致性。实验表明，SWiT-4D 在各种基准测试中在时间平滑性方面优于现有方法。

Bayesian Symbolic Regression via Posterior Sampling

Authors: Geoffrey F. Bomarito, Patrick E. Leser

First: 2025-12-11T17:38:20+00:00 · Latest: 2025-12-11T17:38:20+00:00

Abs · PDF · Code1 · Code2

Abstract

Symbolic regression is a powerful tool for discovering governing equations directly from data, but its sensitivity to noise hinders its broader application. This paper introduces a Sequential Monte Carlo (SMC) framework for Bayesian symbolic regression that approximates the posterior distribution over symbolic expressions, enhancing robustness and enabling uncertainty quantification for symbolic regression in the presence of noise. Differing from traditional genetic programming approaches, the SMC-based algorithm combines probabilistic selection, adaptive tempering, and the use of normalized marginal likelihood to efficiently explore the search space of symbolic expressions, yielding parsimonious expressions with improved generalization. When compared to standard genetic programming baselines, the proposed method better deals with challenging, noisy benchmark datasets. The reduced tendency to overfit and enhanced ability to discover accurate and interpretable equations paves the way for more robust symbolic regression in scientific discovery and engineering design applications.

中文标题/摘要

标题：基于后验采样的贝叶斯符号回归

符号回归是一种强大的工具，可以直接从数据中发现支配方程，但其对噪声的敏感性限制了其更广泛的应用。本文介绍了一种基于顺序蒙特卡洛（SMC）的贝叶斯符号回归框架，该框架通过近似符号表达式的后验分布来增强鲁棒性，并在存在噪声的情况下实现符号回归中的不确定性量化。与传统的遗传编程方法不同，基于SMC的算法结合了概率选择、自适应退火和归一化边际似然的使用，以高效地探索符号表达式的搜索空间，产生简洁的表达式并提高泛化能力。与标准的遗传编程基线相比，所提出的方法在处理具有挑战性和噪声的数据集方面表现更好。减少过度拟合的趋势和发现准确且可解释方程的能力为科学发现和工程设计应用中的更稳健的符号回归铺平了道路。

Summary / 总结

This paper addresses the challenge of noisy data in symbolic regression by proposing a Bayesian approach using Sequential Monte Carlo (SMC) for posterior sampling. The method enhances robustness and allows for uncertainty quantification. Unlike traditional genetic programming, it employs probabilistic selection, adaptive tempering, and normalized marginal likelihood to explore the symbolic expression space efficiently, leading to more parsimonious and generalizable models. Experiments show that the proposed method outperforms standard genetic programming on noisy benchmark datasets, reducing overfitting and improving the discovery of accurate and interpretable equations.

该论文通过提出使用Sequential Monte Carlo (SMC)进行贝叶斯符号回归，以应对噪声数据的挑战，该方法通过概率选择、自适应退火和归一化边际似然来近似符号表达式的后验分布，从而更高效地探索符号表达式的搜索空间，产生更稳健和可解释的方程。实验结果表明，所提出的基于SMC的方法在噪声基准数据集上优于标准遗传编程方法，展示了更好的泛化能力和较少的过拟合。

From Generated Human Videos to Physically Plausible Robot Trajectories

Authors: James Ni, Zekai Wang, Wei Lin, Amir Bar, Yann LeCun, Trevor Darrell, Jitendra Malik, Roei Herzig

First: 2025-12-04T18:56:03+00:00 · Latest: 2025-12-11T17:37:53+00:00

Comments: For project website, see https://genmimic.github.io

Abs · PDF · Code1 · Code2 · Project1

Abstract

Video generation models are rapidly improving in their ability to synthesize human actions in novel contexts, holding the potential to serve as high-level planners for contextual robot control. To realize this potential, a key research question remains open: how can a humanoid execute the human actions from generated videos in a zero-shot manner? This challenge arises because generated videos are often noisy and exhibit morphological distortions that make direct imitation difficult compared to real video. To address this, we introduce a two-stage pipeline. First, we lift video pixels into a 4D human representation and then retarget to the humanoid morphology. Second, we propose GenMimic-a physics-aware reinforcement learning policy conditioned on 3D keypoints, and trained with symmetry regularization and keypoint-weighted tracking rewards. As a result, GenMimic can mimic human actions from noisy, generated videos. We curate GenMimicBench, a synthetic human-motion dataset generated using two video generation models across a spectrum of actions and contexts, establishing a benchmark for assessing zero-shot generalization and policy robustness. Extensive experiments demonstrate improvements over strong baselines in simulation and confirm coherent, physically stable motion tracking on a Unitree G1 humanoid robot without fine-tuning. This work offers a promising path to realizing the potential of video generation models as high-level policies for robot control.

中文标题/摘要

标题：从生成的人类视频到物理上可信的机器人轨迹

视频生成模型在合成新颖情境下的人类动作方面的能力正在迅速提高，有可能作为上下文机器人控制的高层规划者。为了实现这一潜力，一个关键的研究问题仍然悬而未决：如何以零样本的方式让类人机器人执行生成视频中的人类动作？这一挑战源于生成的视频通常噪声较大且存在形态失真，使得直接模仿比真实视频更困难。为了解决这一问题，我们引入了一个两阶段的流水线。首先，我们将视频像素提升到4D的人类表示，然后重新定位到类人形态。其次，我们提出了GenMimic——一种基于3D关键点的物理感知强化学习策略，并通过对称正则化和关键点加权跟踪奖励进行训练。因此，GenMimic可以从噪声较大的生成视频中模仿人类动作。我们构建了GenMimicBench，这是一个使用两种视频生成模型生成的合成人类动作数据集，涵盖了各种动作和情境，为评估零样本泛化能力和策略鲁棒性建立了基准。广泛的实验表明，与强大的基线相比，在模拟中有所改进，并且在无需微调的情况下，Unitree G1类人机器人上实现了连贯且物理稳定的运动跟踪。这项工作为实现视频生成模型作为机器人控制高层策略的潜力提供了一条有希望的道路。

Summary / 总结

The research aims to enable robots to execute human actions from generated videos in a zero-shot manner, addressing the challenges of noise and morphological distortions in generated videos. A two-stage pipeline is introduced, first lifting video pixels into a 4D human representation and then retargeting to the humanoid morphology. The GenMimic policy, conditioned on 3D keypoints and trained with symmetry regularization and keypoint-weighted tracking rewards, successfully mimics human actions. Experiments show that GenMimic outperforms strong baselines and achieves coherent, physically stable motion tracking on a Unitree G1 humanoid robot without fine-tuning.

研究旨在使机器人能够从生成的视频中零样本执行人类动作，解决生成视频中的噪声和形态失真问题。提出了一种两阶段管道，首先将视频像素提升到4D人体表示，然后重新定位到人形形态。GenMimic策略基于3D关键点训练，并使用对称正则化和关键点加权跟踪奖励，成功模仿人类动作。实验表明，GenMimic在模拟中优于强基线，并在无需微调的情况下实现了Unitree G1人形机器人上的连贯且物理稳定的运动跟踪。

PoseGAM: Robust Unseen Object Pose Estimation via Geometry-Aware Multi-View Reasoning

Authors: Jianqi Chen, Biao Zhang, Xiangjun Tang, Peter Wonka

First: 2025-12-11T17:29:25+00:00 · Latest: 2025-12-11T17:29:25+00:00

Comments: Project page: https://windvchen.github.io/PoseGAM/

Abs · PDF · Code1 · Code2 · Project1

Abstract

6D object pose estimation, which predicts the transformation of an object relative to the camera, remains challenging for unseen objects. Existing approaches typically rely on explicitly constructing feature correspondences between the query image and either the object model or template images. In this work, we propose PoseGAM, a geometry-aware multi-view framework that directly predicts object pose from a query image and multiple template images, eliminating the need for explicit matching. Built upon recent multi-view-based foundation model architectures, the method integrates object geometry information through two complementary mechanisms: explicit point-based geometry and learned features from geometry representation networks. In addition, we construct a large-scale synthetic dataset containing more than 190k objects under diverse environmental conditions to enhance robustness and generalization. Extensive evaluations across multiple benchmarks demonstrate our state-of-the-art performance, yielding an average AR improvement of 5.1% over prior methods and achieving up to 17.6% gains on individual datasets, indicating strong generalization to unseen objects. Project page: https://windvchen.github.io/PoseGAM/ .

中文标题/摘要

标题：PoseGAM：通过几何意识多视图推理实现鲁棒的未见物体姿态估计

6D物体姿态估计，即预测物体相对于相机的变换，对于未见物体来说仍然具有挑战性。现有方法通常依赖于在查询图像与物体模型或模板图像之间显式构建特征对应关系。在本工作中，我们提出PoseGAM，这是一种几何意识的多视图框架，可以直接从查询图像和多个模板图像中预测物体姿态，从而消除了显式匹配的需要。该方法基于最近的多视图基础模型架构，通过两种互补机制整合物体几何信息：显式的点基几何和几何表示网络学习到的特征。此外，我们构建了一个包含超过19万个物体的大规模合成数据集，这些物体在不同的环境条件下，以增强鲁棒性和泛化能力。在多个基准上的广泛评估表明，我们的方法具有最先进的性能，相对于先前方法的平均AR改进为5.1%，在个别数据集上达到17.6%的增益，表明其在未见物体上的强大泛化能力。

Summary / 总结

PoseGAM is a geometry-aware multi-view framework for robust 6D object pose estimation, which directly predicts object pose from a query image and multiple template images without explicit feature matching. It uses a large-scale synthetic dataset with over 190k objects under various conditions to enhance robustness. The method improves average AR by 5.1% over previous methods and achieves up to 17.6% gains on individual datasets, showing strong generalization to unseen objects.

PoseGAM 是一个几何感知的多视图框架，用于鲁棒的 6D 物体姿态估计，特别是对于未见过的物体。它直接从查询图像和多个模板图像中预测物体姿态，避免了显式的特征匹配。通过结合基于点的几何信息和几何表示网络学习的特征，PoseGAM 达到了最先进的性能，平均 AR 提高了 5.1%，在某些数据集上甚至提高了 17.6%。

T-SHRED: Symbolic Regression for Regularization and Model Discovery with Transformer Shallow Recurrent Decoders

Authors: Alexey Yermakov, David Zoro, Mars Liyao Gao, J. Nathan Kutz

First: 2025-06-18T21:14:38+00:00 · Latest: 2025-12-11T17:28:30+00:00

Comments: 17 pages, 5 figures, submitted to Transactions of the Royal Society (Symbolic Regression in the Physical Sciences)

Abs · PDF · Code1 · Code2

Abstract

SHallow REcurrent Decoders (SHRED) are effective for system identification and forecasting from sparse sensor measurements. Such models are light-weight and computationally efficient, allowing them to be trained on consumer laptops. SHRED-based models rely on Recurrent Neural Networks (RNNs) and a simple Multi-Layer Perceptron (MLP) for the temporal encoding and spatial decoding respectively. Despite the relatively simple structure of SHRED, they are able to predict chaotic dynamical systems on different physical, spatial, and temporal scales directly from a sparse set of sensor measurements. In this work, we modify SHRED by leveraging transformers (T-SHRED) embedded with symbolic regression for the temporal encoding, circumventing auto-regressive long-term forecasting for physical data. This is achieved through a new sparse identification of nonlinear dynamics (SINDy) attention mechanism into T-SHRED to impose sparsity regularization on the latent space, which also allows for immediate symbolic interpretation. Symbolic regression improves model interpretability by learning and regularizing the dynamics of the latent space during training. We analyze the performance of T-SHRED on three different dynamical systems ranging from low-data to high-data regimes.

中文标题/摘要

标题：T-SHRED：基于变换器浅层递归解码器的符号回归正则化与模型发现

SH浅层递归解码器（SHRED）对于从稀疏传感器测量中进行系统识别和预测非常有效。这些模型轻量级且计算效率高，允许它们在消费级笔记本电脑上进行训练。SHRED 基于递归神经网络（RNN）和简单的多层感知机（MLP）进行时间编码和空间解码。尽管 SHRED 的结构相对简单，但它们能够直接从稀疏的传感器测量中预测不同物理、空间和时间尺度上的混沌动力系统。在本文中，我们通过利用变换器（T-SHRED）嵌入符号回归来修改 SHRED，以绕过自回归长期预测。这是通过将新的稀疏非线性动力系统识别（SINDy）注意力机制引入 T-SHRED 来实现的，以在潜在空间上施加稀疏正则化，这也允许即时符号解释。符号回归通过在训练过程中学习和正则化潜在空间的动力学来提高模型的可解释性。我们分析了 T-SHRED 在三个不同动力系统上的性能，从低数据到高数据范围。

Summary / 总结

T-SHRED is an enhanced version of SHRED, which uses transformers with symbolic regression for temporal encoding to improve model interpretability and performance on sparse sensor data. It achieves this by incorporating a SINDy attention mechanism that imposes sparsity regularization on the latent space, allowing for immediate symbolic interpretation. The model demonstrates improved performance on various dynamical systems, from low-data to high-data regimes.

T-SHRED 是 SHRED 的增强版本，使用变压器和符号回归进行时间编码，以提高模型在稀疏传感器数据上的可解释性和性能。关键方法是集成 SINDy 注意机制对潜在空间施加稀疏正则化，从而实现即时的符号解释。主要实验发现表明，T-SHRED 在各种条件下（从低数据到高数据）预测混沌动力系统方面优于 SHRED。

Learning Controllable and Diverse Player Behaviors in Multi-Agent Environments

Authors: Atahan Cilan, Atay Özgövde

First: 2025-12-11T17:26:24+00:00 · Latest: 2025-12-11T17:26:24+00:00

Comments: Submitted to IEEE Transactions on Games

Abs · PDF · Code1 · Code2

Abstract

This paper introduces a reinforcement learning framework that enables controllable and diverse player behaviors without relying on human gameplay data. Existing approaches often require large-scale player trajectories, train separate models for different player types, or provide no direct mapping between interpretable behavioral parameters and the learned policy, limiting their scalability and controllability. We define player behavior in an N-dimensional continuous space and uniformly sample target behavior vectors from a region that encompasses the subset representing real human styles. During training, each agent receives both its current and target behavior vectors as input, and the reward is based on the normalized reduction in distance between them. This allows the policy to learn how actions influence behavioral statistics, enabling smooth control over attributes such as aggressiveness, mobility, and cooperativeness. A single PPO-based multi-agent policy can reproduce new or unseen play styles without retraining. Experiments conducted in a custom multi-player Unity game show that the proposed framework produces significantly greater behavioral diversity than a win-only baseline and reliably matches specified behavior vectors across diverse targets. The method offers a scalable solution for automated playtesting, game balancing, human-like behavior simulation, and replacing disconnected players in online games.

中文标题/摘要

标题：在多智能体环境中学习可控且多样的玩家行为

本文介绍了一种强化学习框架，能够在无需依赖人类游戏数据的情况下实现可控且多样的玩家行为。现有方法通常需要大规模的玩家轨迹数据、为不同玩家类型训练独立模型，或者无法直接将可解释的行为参数与学习策略映射起来，限制了其可扩展性和可控性。我们定义玩家行为在N维连续空间中，并均匀采样目标行为向量，该向量代表了真实人类风格的子集。在训练过程中，每个智能体接收当前和目标行为向量作为输入，奖励基于它们之间距离的归一化减少量。这使得策略能够学习行动如何影响行为统计，从而实现对诸如攻击性、移动性和合作性等属性的平滑控制。基于PPO的单个多智能体策略可以在无需重新训练的情况下重现新的或未见过的游戏风格。在自定义多人Unity游戏中进行的实验表明，所提出的框架产生的行为多样性显著高于仅以胜利为目标的基线，并且能够在多种目标上可靠地匹配指定的行为向量。该方法为自动化游戏测试、游戏平衡、人类行为模拟以及在线游戏中替代断线玩家提供了可扩展的解决方案。