arXiv 论文速递

WorldLens: Full-Spectrum Evaluations of Driving World Models in Real World

Authors: Ao Liang, Lingdong Kong, Tianyi Yan, Hongsi Liu, Wesley Yang, Ziqi Huang, Wei Yin, Jialong Zuo, Yixuan Hu, Dekai Zhu, Dongyue Lu, Youquan Liu, Guangfeng Jiang, Linfeng Li, Xiangtai Li, Long Zhuo, Lai Xing Ng, Benoit R. Cottereau, Changxin Gao, Liang Pan, Wei Tsang Ooi, Ziwei Liu

First: 2025-12-11T18:59:58+00:00 · Latest: 2025-12-11T18:59:58+00:00

Comments: Preprint; 80 pages, 37 figures, 29 tables; Project Page at https://worldbench.github.io/worldlens

Abs · PDF · Code1 · Code2 · Project1

Abstract

Generative world models are reshaping embodied AI, enabling agents to synthesize realistic 4D driving environments that look convincing but often fail physically or behaviorally. Despite rapid progress, the field still lacks a unified way to assess whether generated worlds preserve geometry, obey physics, or support reliable control. We introduce WorldLens, a full-spectrum benchmark evaluating how well a model builds, understands, and behaves within its generated world. It spans five aspects -- Generation, Reconstruction, Action-Following, Downstream Task, and Human Preference -- jointly covering visual realism, geometric consistency, physical plausibility, and functional reliability. Across these dimensions, no existing world model excels universally: those with strong textures often violate physics, while geometry-stable ones lack behavioral fidelity. To align objective metrics with human judgment, we further construct WorldLens-26K, a large-scale dataset of human-annotated videos with numerical scores and textual rationales, and develop WorldLens-Agent, an evaluation model distilled from these annotations to enable scalable, explainable scoring. Together, the benchmark, dataset, and agent form a unified ecosystem for measuring world fidelity -- standardizing how future models are judged not only by how real they look, but by how real they behave.

中文标题/摘要

标题：WorldLens：全面评估驱动世界模型在真实世界中的表现

生成式世界模型正在重塑具身AI，使代理能够合成逼真的4D驾驶环境，尽管这些环境看起来很真实，但往往在物理上或行为上失败。尽管取得了快速进展，该领域仍然缺乏一种统一的方法来评估生成的世界是否保留了几何结构、遵守了物理法则或支持可靠的控制。我们引入了WorldLens，这是一种全面的基准测试，评估模型如何构建、理解和在其生成的世界中表现。它涵盖了五个方面——生成、重建、动作跟随、下游任务和人类偏好——共同涵盖了视觉真实感、几何一致性、物理可合理性以及功能可靠性。在这些维度上，没有现有的世界模型能够全面优秀：那些具有强大纹理的模型往往违反物理法则，而几何稳定的模型缺乏行为准确性。为了使客观指标与人类判断相一致，我们进一步构建了包含大量人类标注视频的WorldLens-26K数据集，这些视频附有数值评分和文本解释，并开发了WorldLens-Agent，这是一种从这些注释中提炼出的评估模型，以实现可扩展且可解释的评分。基准测试、数据集和代理共同构成了衡量世界真实性的统一生态系统——不仅通过它们看起来多么真实，还通过它们表现得多么真实来评判未来的模型。

Summary / 总结

WorldLens evaluates generative world models in driving environments by assessing five aspects: generation, reconstruction, action-following, downstream tasks, and human preference. It reveals that no existing model excels in all areas, with some violating physics while others lack behavioral fidelity. To align with human judgment, the authors created WorldLens-26K, a dataset with human-annotated videos, and developed WorldLens-Agent for scalable, explainable scoring, forming a unified ecosystem for measuring world fidelity in embodied AI.

WorldLens 是一个全面的基准，用于评估生成世界模型在驾驶场景中的真实性。它从五个维度评估模型：生成、重建、动作跟随、下游任务和人类偏好。基准显示，目前没有模型在所有方面都表现出色，有些模型违反了物理规律，而另一些则缺乏行为真实性。为了与人类判断一致，创建了包含大量人类标注视频的 WorldLens-26K 数据集，并开发了 WorldLens-Agent 评估模型以实现可扩展且可解释的评分。这个生态系统通过同时考虑视觉真实性和功能可靠性来标准化世界模型的评估。

SceneMaker: Open-set 3D Scene Generation with Decoupled De-occlusion and Pose Estimation Model

Authors: Yukai Shi, Weiyu Li, Zihao Wang, Hongyang Li, Xingyu Chen, Ping Tan, Lei Zhang

First: 2025-12-11T18:59:56+00:00 · Latest: 2025-12-11T18:59:56+00:00

Comments: Project page: https://idea-research.github.io/SceneMaker/

Abs · PDF · Code1 · Code2 · Project1

Abstract

We propose a decoupled 3D scene generation framework called SceneMaker in this work. Due to the lack of sufficient open-set de-occlusion and pose estimation priors, existing methods struggle to simultaneously produce high-quality geometry and accurate poses under severe occlusion and open-set settings. To address these issues, we first decouple the de-occlusion model from 3D object generation, and enhance it by leveraging image datasets and collected de-occlusion datasets for much more diverse open-set occlusion patterns. Then, we propose a unified pose estimation model that integrates global and local mechanisms for both self-attention and cross-attention to improve accuracy. Besides, we construct an open-set 3D scene dataset to further extend the generalization of the pose estimation model. Comprehensive experiments demonstrate the superiority of our decoupled framework on both indoor and open-set scenes. Our codes and datasets is released at https://idea-research.github.io/SceneMaker/.

中文标题/摘要

标题：SceneMaker：分阶段的3D场景生成模型，包含解遮挡和姿态估计

本文提出了一种名为SceneMaker的分阶段3D场景生成框架。由于缺乏足够的开放集解遮挡和姿态估计先验，现有方法在严重遮挡和开放集设置下难以同时生成高质量的几何结构和准确的姿态。为了解决这些问题，我们首先将解遮挡模型与3D物体生成分离，并通过利用图像数据集和收集的解遮挡数据集增强它，以处理更多样化的开放集遮挡模式。然后，我们提出了一种统一的姿态估计模型，结合全局和局部机制，以提高准确度。此外，我们构建了一个开放集3D场景数据集，以进一步扩展姿态估计模型的泛化能力。全面的实验表明，我们的分阶段框架在室内和开放集场景上具有优越性。我们的代码和数据集发布在https://idea-research.github.io/SceneMaker/。

Summary / 总结

SceneMaker is a decoupled 3D scene generation framework that addresses the limitations of existing methods in handling severe occlusions and open-set scenarios. It decouples the de-occlusion model from 3D object generation and enhances it with diverse open-set occlusion patterns from image and de-occlusion datasets. Additionally, it introduces a unified pose estimation model that combines global and local mechanisms for improved accuracy. Experiments show that SceneMaker outperforms existing methods on both indoor and open-set scenes.

SceneMaker 是一个解耦的 3D 场景生成框架，旨在解决现有方法在处理严重遮挡和开放集场景时的局限性。它将去遮挡模型与 3D 物体生成解耦，并通过多样化的开放集遮挡模式增强它。此外，它还引入了一个结合全局和局部机制的统一姿态估计模型，以提高准确性。实验表明，SceneMaker 在室内和开放集场景中均优于现有方法。

Empowering Dynamic Urban Navigation with Stereo and Mid-Level Vision

Authors: Wentao Zhou, Xuweiyi Chen, Vignesh Rajagopal, Jeffrey Chen, Rohan Chandra, Zezhou Cheng

Venue: www

First: 2025-12-11T18:59:56+00:00 · Latest: 2025-12-11T18:59:56+00:00

Comments: Project Page: https://www.cs.virginia.edu/~tsx4zn/stereowalk/

Abs · PDF · Code1 · Code2

Abstract

The success of foundation models in language and vision motivated research in fully end-to-end robot navigation foundation models (NFMs). NFMs directly map monocular visual input to control actions and ignore mid-level vision modules (tracking, depth estimation, etc) entirely. While the assumption that vision capabilities will emerge implicitly is compelling, it requires large amounts of pixel-to-action supervision that are difficult to obtain. The challenge is especially pronounced in dynamic and unstructured settings, where robust navigation requires precise geometric and dynamic understanding, while the depth-scale ambiguity in monocular views further limits accurate spatial reasoning. In this paper, we show that relying on monocular vision and ignoring mid-level vision priors is inefficient. We present StereoWalker, which augments NFMs with stereo inputs and explicit mid-level vision such as depth estimation and dense pixel tracking. Our intuition is straightforward: stereo inputs resolve the depth-scale ambiguity, and modern mid-level vision models provide reliable geometric and motion structure in dynamic scenes. We also curate a large stereo navigation dataset with automatic action annotation from Internet stereo videos to support training of StereoWalker and to facilitate future research. Through our experiments, we find that mid-level vision enables StereoWalker to achieve a comparable performance as the state-of-the-art using only 1.5% of the training data, and surpasses the state-of-the-art using the full data. We also observe that stereo vision yields higher navigation performance than monocular input.

中文标题/摘要

标题：利用立体视觉和中级视觉赋能动态城市导航

基础模型在语言和视觉领域的成功激发了对端到端机器人导航基础模型（NFMs）的研究。NFMs 直接将单目视觉输入映射到控制动作，完全忽略了中级视觉模块（跟踪、深度估计等）。虽然假设视觉能力会隐式出现是令人信服的，但需要大量的像素到动作监督，这很难获得。特别是在动态和非结构化的环境中，这种挑战尤为突出，因为稳健的导航需要精确的几何和动态理解，而单目视图中的深度尺度歧义进一步限制了准确的空间推理。在本文中，我们表明依赖单目视觉并忽略中级视觉先验是低效的。我们提出了 StereoWalker，它通过添加立体视觉输入和显式的中级视觉（如深度估计和密集像素跟踪）来增强 NFMs。我们的直觉很简单：立体视觉解决了深度尺度歧义问题，现代中级视觉模型在动态场景中提供了可靠的几何和运动结构。我们还整理了一个大型的立体视觉导航数据集，该数据集通过互联网立体视频的自动动作注释来支持 StereoWalker 的训练，并促进未来的研究。通过我们的实验，我们发现中级视觉使 StereoWalker 能够仅使用 1.5% 的训练数据达到与最新技术相当的性能，并且使用完整数据时超过了最新技术。我们还观察到，立体视觉的导航性能高于单目输入。

Summary / 总结

This paper addresses the challenge of robot navigation in dynamic urban environments by integrating stereo vision and mid-level vision modules into navigation foundation models (NFMs). The authors introduce StereoWalker, which uses stereo inputs and explicit depth estimation and dense pixel tracking to resolve depth ambiguity and provide reliable geometric and motion structure. Experiments show that StereoWalker achieves comparable performance to state-of-the-art models using only 1.5% of the training data and surpasses them with full data, demonstrating the efficiency and effectiveness of incorporating stereo vision and mid-level vision priors.

本文通过提出结合立体视觉和中级视觉模块的StereoWalker，解决了动态城市环境中的机器人导航挑战，以提高几何和运动理解。作者发现使用立体视觉和中级视觉，如深度估计和密集像素跟踪，可以增强性能，仅需1.5%的训练数据即可匹配最先进的模型，并在使用完整数据时超越它们。为此目的而构建的数据集支持训练和未来在立体导航领域的研究。

Omni-Attribute: Open-vocabulary Attribute Encoder for Visual Concept Personalization

Authors: Tsai-Shien Chen, Aliaksandr Siarohin, Guocheng Gordon Qian, Kuan-Chieh Jackson Wang, Egor Nemchinov, Moayed Haji-Ali, Riza Alp Guler, Willi Menapace, Ivan Skorokhodov, Anil Kag, Jun-Yan Zhu, Sergey Tulyakov

First: 2025-12-11T18:59:56+00:00 · Latest: 2025-12-11T18:59:56+00:00

Comments: Project page: https://snap-research.github.io/omni-attribute

Abs · PDF · Code1 · Code2 · Project1

Abstract

Visual concept personalization aims to transfer only specific image attributes, such as identity, expression, lighting, and style, into unseen contexts. However, existing methods rely on holistic embeddings from general-purpose image encoders, which entangle multiple visual factors and make it difficult to isolate a single attribute. This often leads to information leakage and incoherent synthesis. To address this limitation, we introduce Omni-Attribute, the first open-vocabulary image attribute encoder designed to learn high-fidelity, attribute-specific representations. Our approach jointly designs the data and model: (i) we curate semantically linked image pairs annotated with positive and negative attributes to explicitly teach the encoder what to preserve or suppress; and (ii) we adopt a dual-objective training paradigm that balances generative fidelity with contrastive disentanglement. The resulting embeddings prove effective for open-vocabulary attribute retrieval, personalization, and compositional generation, achieving state-of-the-art performance across multiple benchmarks.

中文标题/摘要

标题：Omni-Attribute：面向视觉概念个性化的大词汇量属性编码器

视觉概念个性化旨在将特定图像属性，如身份、表情、光照和风格，转移到未见的上下文中。然而，现有方法依赖于通用图像编码器的整体嵌入，这会将多个视觉因素纠缠在一起，使得难以隔离单一属性。这通常会导致信息泄露和不一致的合成。为了解决这一局限性，我们引入了Omni-Attribute，这是第一个用于学习高保真度、属性特定表示的大词汇量图像属性编码器。我们的方法联合设计了数据和模型：(i) 我们收集了带有正负属性标注的语义关联图像对，以明确地教导编码器保留或抑制什么；(ii) 我们采用了一种双目标训练范式，平衡生成保真度与对比性解耦。生成的嵌入在开放词汇量属性检索、个性化和组合生成方面证明有效，实现了多个基准上的最佳性能。

Summary / 总结

Omni-Attribute is an open-vocabulary image attribute encoder designed to learn high-fidelity, attribute-specific representations for visual concept personalization. It addresses the limitation of existing methods by curating semantically linked image pairs and adopting a dual-objective training paradigm. The method achieves state-of-the-art performance in open-vocabulary attribute retrieval, personalization, and compositional generation across multiple benchmarks.

Omni-Attribute 是一种开放词汇量的图像属性编码器，旨在分离特定的视觉属性，如身份、表情、光照和风格。它通过使用语义关联的图像对和双重目标训练方法来学习高保真度、属性特定的表示。该编码器在开放词汇量属性检索、个性化和组合生成方面优于先前的方法，并在多个基准测试中达到了最先进的性能。

Hierarchical Dataset Selection for High-Quality Data Sharing

Authors: Xiaona Zhou, Yingyan Zeng, Ran Jin, Ismini Lourentzou

First: 2025-12-11T18:59:55+00:00 · Latest: 2025-12-11T18:59:55+00:00

Abs · PDF · Code1 · Code2

Abstract

The success of modern machine learning hinges on access to high-quality training data. In many real-world scenarios, such as acquiring data from public repositories or sharing across institutions, data is naturally organized into discrete datasets that vary in relevance, quality, and utility. Selecting which repositories or institutions to search for useful datasets, and which datasets to incorporate into model training are therefore critical decisions, yet most existing methods select individual samples and treat all data as equally relevant, ignoring differences between datasets and their sources. In this work, we formalize the task of dataset selection: selecting entire datasets from a large, heterogeneous pool to improve downstream performance under resource constraints. We propose Dataset Selection via Hierarchies (DaSH), a dataset selection method that models utility at both dataset and group (e.g., collections, institutions) levels, enabling efficient generalization from limited observations. Across two public benchmarks (Digit-Five and DomainNet), DaSH outperforms state-of-the-art data selection baselines by up to 26.2% in accuracy, while requiring significantly fewer exploration steps. Ablations show DaSH is robust to low-resource settings and lack of relevant datasets, making it suitable for scalable and adaptive dataset selection in practical multi-source learning workflows.

中文标题/摘要

标题：层次化数据集选择以实现高质量数据共享

现代机器学习的成功依赖于高质量训练数据的访问。在许多实际场景中，如从公共存储库获取数据或机构间共享数据时，数据自然地组织成不同的数据集，这些数据集在相关性、质量和实用性方面各不相同。选择哪些存储库或机构搜索有用的数据库集，以及将哪些数据集纳入模型训练是至关重要的决策，但现有的大多数方法选择单个样本，并将所有数据视为同等重要，忽略了数据集及其来源之间的差异。在本工作中，我们形式化了数据集选择任务：在资源受限的情况下，从大量异构数据集中选择整个数据集以提高下游性能。我们提出了层次化数据集选择（DaSH）方法，该方法在数据集和组（例如，集合、机构）级别建模效用，从而能够从有限的观察中高效泛化。在两个公开基准（Digit-Five和DomainNet）上，DaSH在准确率上比最先进的数据选择基线高出26.2%，同时需要显著减少探索步骤。消融实验表明，DaSH在低资源设置和缺乏相关数据集的情况下具有鲁棒性，使其适用于可扩展和适应性的数据集选择，适用于实际多源学习工作流。

Summary / 总结

This research addresses the challenge of selecting high-quality datasets for machine learning by proposing DaSH, a method that considers both dataset and group levels. DaSH outperforms existing methods by up to 26.2% in accuracy across two benchmarks, requiring fewer exploration steps. It is robust in low-resource settings and suitable for practical multi-source learning workflows.

该研究通过提出DaSH方法解决了选择高质量数据集的问题，该方法同时考虑了数据集和组别层面。它在两个基准测试中将准确率提高了最多26.2%，同时需要更少的探索步骤。DaSH在资源有限的环境中表现 robust，并适用于实际的多源学习工作流程。

E-RayZer: Self-supervised 3D Reconstruction as Spatial Visual Pre-training

Authors: Qitao Zhao, Hao Tan, Qianqian Wang, Sai Bi, Kai Zhang, Kalyan Sunkavalli, Shubham Tulsiani, Hanwen Jiang

First: 2025-12-11T18:59:53+00:00 · Latest: 2025-12-11T18:59:53+00:00

Comments: Project website: https://qitaozhao.github.io/E-RayZer

Abs · PDF · Code1 · Code2 · Project1

Abstract

Self-supervised pre-training has revolutionized foundation models for languages, individual 2D images and videos, but remains largely unexplored for learning 3D-aware representations from multi-view images. In this paper, we present E-RayZer, a self-supervised large 3D Vision model that learns truly 3D-aware representations directly from unlabeled images. Unlike prior self-supervised methods such as RayZer that infer 3D indirectly through latent-space view synthesis, E-RayZer operates directly in 3D space, performing self-supervised 3D reconstruction with Explicit geometry. This formulation eliminates shortcut solutions and yields representations that are geometrically grounded. To ensure convergence and scalability, we introduce a novel fine-grained learning curriculum that organizes training from easy to hard samples and harmonizes heterogeneous data sources in an entirely unsupervised manner. Experiments demonstrate that E-RayZer significantly outperforms RayZer on pose estimation, matches or sometimes surpasses fully supervised reconstruction models such as VGGT. Furthermore, its learned representations outperform leading visual pre-training models (e.g., DINOv3, CroCo v2, VideoMAE V2, and RayZer) when transferring to 3D downstream tasks, establishing E-RayZer as a new paradigm for 3D-aware visual pre-training.

中文标题/摘要

标题：E-RayZer：自我监督的3D重建作为空间视觉预训练

自我监督的预训练已经彻底改变了语言、单个2D图像和视频的基础模型，但在从多视角图像学习3D感知表示方面仍然鲜有探索。本文中，我们提出了E-RayZer，一种直接从未标记图像中学习真正3D感知表示的自我监督大型3D视觉模型。与通过潜在空间视图合成间接推断3D的先前自我监督方法（如RayZer）不同，E-RayZer直接在3D空间中操作，执行具有显式几何结构的自我监督3D重建。这种表述消除了捷径解决方案，产生了几何上可靠的表示。为了确保收敛性和可扩展性，我们引入了一种新颖的细粒度学习课程，从简单到困难的样本组织训练，并以完全无监督的方式协调异构数据源。实验表明，E-RayZer在姿态估计方面显著优于RayZer，在匹配或有时超越完全监督重建模型（如VGGT）方面表现出色。此外，其学习的表示在转移到3D下游任务时优于领先的空间视觉预训练模型（例如DINOv3、CroCo v2、VideoMAE V2和RayZer），确立了E-RayZer作为3D感知视觉预训练的新范式。

Summary / 总结

E-RayZer is a self-supervised 3D reconstruction model that learns 3D-aware representations directly from unlabeled images. Unlike previous methods that infer 3D indirectly, E-RayZer operates in 3D space, performing self-supervised 3D reconstruction with explicit geometry. The model uses a fine-grained learning curriculum to ensure convergence and scalability. Experiments show that E-RayZer outperforms RayZer on pose estimation and matches or surpasses fully supervised models on reconstruction tasks. Its representations also outperform other visual pre-training models when transferred to 3D tasks, establishing E-RayZer as a new paradigm for 3D-aware visual pre-training.

E-RayZer 是一种自监督 3D 视觉模型，直接从未标记的图像中学习 3D 意识表示。它使用显式几何进行自监督 3D 重建，避免了捷径解决方案，产生了几何上扎实的表示。该模型采用细粒度的学习课程来确保收敛性和可扩展性。实验表明，E-RayZer 在姿态估计上优于 RayZer，并在重建任务上与完全监督模型相当或超越。其表示在转移到 3D 下游任务时也优于其他视觉预训练模型，确立了 E-RayZer 作为 3D 意识视觉预训练的新范式。

Are We Ready for RL in Text-to-3D Generation? A Progressive Investigation

Authors: Yiwen Tang, Zoey Guo, Kaixin Zhu, Ray Zhang, Qizhi Chen, Dongzhi Jiang, Junli Liu, Bohan Zeng, Haoming Song, Delin Qu, Tianyi Bai, Dan Xu, Wentao Zhang, Bin Zhao

First: 2025-12-11T18:59:52+00:00 · Latest: 2025-12-11T18:59:52+00:00

Comments: Code is released at https://github.com/Ivan-Tang-3D/3DGen-R1

Abs · PDF · Code1 · Code2 · Code3

Abstract

Reinforcement learning (RL), earlier proven to be effective in large language and multi-modal models, has been successfully extended to enhance 2D image generation recently. However, applying RL to 3D generation remains largely unexplored due to the higher spatial complexity of 3D objects, which require globally consistent geometry and fine-grained local textures. This makes 3D generation significantly sensitive to reward designs and RL algorithms. To address these challenges, we conduct the first systematic study of RL for text-to-3D autoregressive generation across several dimensions. (1) Reward designs: We evaluate reward dimensions and model choices, showing that alignment with human preference is crucial, and that general multi-modal models provide robust signal for 3D attributes. (2) RL algorithms: We study GRPO variants, highlighting the effectiveness of token-level optimization, and further investigate the scaling of training data and iterations. (3) Text-to-3D Benchmarks: Since existing benchmarks fail to measure implicit reasoning abilities in 3D generation models, we introduce MME-3DR. (4) Advanced RL paradigms: Motivated by the natural hierarchy of 3D generation, we propose Hi-GRPO, which optimizes the global-to-local hierarchical 3D generation through dedicated reward ensembles. Based on these insights, we develop AR3D-R1, the first RL-enhanced text-to-3D model, expert from coarse shape to texture refinement. We hope this study provides insights into RL-driven reasoning for 3D generation. Code is released at https://github.com/Ivan-Tang-3D/3DGen-R1.

中文标题/摘要

标题：文本到3D生成中我们准备好使用强化学习了吗？一项渐进式研究

强化学习（RL）在大型语言和多模态模型中已被证明有效，并成功扩展到增强2D图像生成。然而，将RL应用于3D生成仍然很少被探索，因为3D对象的空间复杂性更高，需要全局一致的几何结构和精细的局部纹理。这使得3D生成对奖励设计和RL算法非常敏感。为了解决这些挑战，我们首次系统地研究了RL在多个维度上对文本到3D自回归生成的影响。(1) 奖励设计：我们评估了奖励维度和模型选择，表明与人类偏好的一致性至关重要，并且通用多模态模型为3D属性提供了稳健的信号。(2) RL算法：我们研究了GRPO变体，强调了基于token的优化的有效性，并进一步探讨了训练数据和迭代的扩展。(3) 文本到3D基准：由于现有基准无法衡量3D生成模型的隐式推理能力，我们引入了MME-3DR。(4) 高级RL范式：受3D生成自然层次结构的启发，我们提出了Hi-GRPO，通过专门的奖励集合优化全局到局部的3D生成。基于这些见解，我们开发了AR3D-R1，这是第一个增强的文本到3D模型，从粗略形状到纹理细化。我们希望这项研究为基于RL的3D生成推理提供见解。代码发布在https://github.com/Ivan-Tang-3D/3DGen-R1。

ClusIR: Towards Cluster-Guided All-in-One Image Restoration

Authors: Shengkai Hu, Jiaqi Ma, Jun Wan, Wenwen Min, Yongcheng Jing, Lefei Zhang, Dacheng Tao

First: 2025-12-11T18:59:47+00:00 · Latest: 2025-12-11T18:59:47+00:00

Abs · PDF · Code1 · Code2

Abstract

All-in-One Image Restoration (AiOIR) aims to recover high-quality images from diverse degradations within a unified framework. However, existing methods often fail to explicitly model degradation types and struggle to adapt their restoration behavior to complex or mixed degradations. To address these issues, we propose ClusIR, a Cluster-Guided Image Restoration framework that explicitly models degradation semantics through learnable clustering and propagates cluster-aware cues across spatial and frequency domains for adaptive restoration. Specifically, ClusIR comprises two key components: a Probabilistic Cluster-Guided Routing Mechanism (PCGRM) and a Degradation-Aware Frequency Modulation Module (DAFMM). The proposed PCGRM disentangles degradation recognition from expert activation, enabling discriminative degradation perception and stable expert routing. Meanwhile, DAFMM leverages the cluster-guided priors to perform adaptive frequency decomposition and targeted modulation, collaboratively refining structural and textural representations for higher restoration fidelity. The cluster-guided synergy seamlessly bridges semantic cues with frequency-domain modulation, empowering ClusIR to attain remarkable restoration results across a wide range of degradations. Extensive experiments on diverse benchmarks validate that ClusIR reaches competitive performance under several scenarios.

中文标题/摘要

标题：ClusIR：面向聚类引导的一站式图像恢复

一站式图像恢复（AiOIR）旨在在一个统一框架内从多种退化中恢复高质量图像。然而，现有方法往往未能明确建模退化类型，并且难以适应复杂或混合退化。为了解决这些问题，我们提出了一种聚类引导图像恢复框架ClusIR，该框架通过可学习的聚类明确建模退化语义，并在空间域和频域中传播聚类感知线索以实现自适应恢复。具体而言，ClusIR 包含两个关键组件：概率聚类引导路由机制（PCGRM）和退化感知频域调制模块（DAFMM）。所提出的 PCGRM 将退化识别与专家激活分离，实现区分性退化感知和稳定的专家路由。同时，DAFMM 利用聚类引导的先验知识进行自适应频域分解和目标化调制，协同优化结构和纹理表示以提高恢复保真度。聚类引导的协同作用无缝地将语义线索与频域调制相结合，使 ClusIR 能够在各种退化中取得卓越的恢复效果。在多种基准上的广泛实验表明，在多种场景下，ClusIR 达到了具有竞争力的性能。

Summary / 总结

ClusIR is a Cluster-Guided Image Restoration framework designed to address the limitations of existing All-in-One Image Restoration methods by explicitly modeling degradation types and adapting to complex degradations. It uses a Probabilistic Cluster-Guided Routing Mechanism and a Degradation-Aware Frequency Modulation Module to disentangle degradation recognition from expert activation and perform adaptive frequency decomposition, respectively. ClusIR demonstrates remarkable restoration results across various degradations, achieving competitive performance in extensive experiments.

ClusIR 是一种集群引导的图像恢复框架，旨在通过明确建模退化类型和适应复杂退化来解决现有 All-in-One 图像恢复方法的局限性。它使用概率集群引导路由机制和退化感知频域调制模块来分离退化识别，实现区分性感知，并进行适应性频域调制。实验表明，ClusIR 在各种退化场景下达到了竞争力的恢复效果。

Towards Efficient and Effective Multi-Camera Encoding for End-to-End Driving

Authors: Jiawei Yang, Ziyu Chen, Yurong You, Yan Wang, Yiming Li, Yuxiao Chen, Boyi Li, Boris Ivanovic, Marco Pavone, Yue Wang

First: 2025-12-11T18:59:46+00:00 · Latest: 2025-12-11T18:59:46+00:00

Comments: Project Page: https://jiawei-yang.github.io/Flex/

Abs · PDF · Code1 · Code2 · Project1

Abstract

We present Flex, an efficient and effective scene encoder that addresses the computational bottleneck of processing high-volume multi-camera data in end-to-end autonomous driving. Flex employs a small set of learnable scene tokens to jointly encode information from all image tokens across different cameras and timesteps. By design, our approach is geometry-agnostic, learning a compact scene representation directly from data without relying on the explicit 3D inductive biases, such as Bird-Eye-View (BEV), occupancy or tri-plane representations, which are common in prior work. This holistic encoding strategy aggressively compresses the visual input for the downstream Large Language Model (LLM) based policy model. Evaluated on a large-scale proprietary dataset of 20,000 driving hours, our Flex achieves 2.2x greater inference throughput while improving driving performance by a large margin compared to state-of-the-art methods. Furthermore, we show that these compact scene tokens develop an emergent capability for scene decomposition without any explicit supervision. Our findings challenge the prevailing assumption that 3D priors are necessary, demonstrating that a data-driven, joint encoding strategy offers a more scalable, efficient and effective path for future autonomous driving systems.

中文标题/摘要

标题：面向端到端驾驶的高效有效多摄像头编码

我们提出了Flex，一种高效的场景编码器，旨在解决处理大量多摄像头数据的计算瓶颈问题，适用于端到端的自动驾驶。Flex 使用一组可学习的场景标记，联合编码来自不同摄像头和时间步长的所有图像标记的信息。通过设计，我们的方法是几何无关的，直接从数据中学习紧凑的场景表示，而不依赖于显式的三维诱导偏置，如鸟瞰图（BEV）、占用或三平面表示，这些都是先前工作中的常见方法。这种整体编码策略大幅压缩了供下游大型语言模型（LLM）基于的策略模型使用的视觉输入。在包含20,000小时驾驶数据的大型自有数据集上评估，我们的Flex实现了2.2倍的推理吞吐量提升，并在驾驶性能上大幅优于最先进的方法。此外，我们展示了这些紧凑的场景标记在没有任何显式监督的情况下发展出了一种场景分解的新兴能力。我们的发现挑战了三维先验是必要的这一传统假设，证明了数据驱动的联合编码策略为未来的自动驾驶系统提供了一条更可扩展、更高效和更有效的方法。

Summary / 总结

Flex is a scene encoder designed to efficiently process high-volume multi-camera data in end-to-end autonomous driving. It uses a small set of learnable scene tokens to encode information from all cameras and timesteps without relying on 3D inductive biases. Evaluated on a large dataset, Flex improves driving performance and achieves 2.2x greater inference throughput compared to state-of-the-art methods, suggesting that data-driven encoding can be more scalable and effective for autonomous driving systems.

Flex 是一种用于高效处理端到端自动驾驶中高体积多摄像头数据的场景编码器，旨在解决计算瓶颈问题。它通过使用少量可学习的场景令牌来联合编码来自所有摄像头和时间步的信息，而不依赖于显式的3D先验。在大规模数据集上评估表明，Flex 提高了驾驶性能，并且相比最先进的方法实现了2.2倍的推理吞吐量提升，这表明数据驱动的联合编码策略可能更为高效和有效，适用于未来的自动驾驶系统。

ImplicitRDP: An End-to-End Visual-Force Diffusion Policy with Structural Slow-Fast Learning

Authors: Wendi Chen, Han Xue, Yi Wang, Fangyuan Zhou, Jun Lv, Yang Jin, Shirun Tang, Chuan Wen, Cewu Lu

First: 2025-12-11T18:59:46+00:00 · Latest: 2025-12-11T18:59:46+00:00

Comments: Project page: https://implicit-rdp.github.io

Abs · PDF · Code1 · Code2 · Project1

Abstract

Human-level contact-rich manipulation relies on the distinct roles of two key modalities: vision provides spatially rich but temporally slow global context, while force sensing captures rapid, high-frequency local contact dynamics. Integrating these signals is challenging due to their fundamental frequency and informational disparities. In this work, we propose ImplicitRDP, a unified end-to-end visual-force diffusion policy that integrates visual planning and reactive force control within a single network. We introduce Structural Slow-Fast Learning, a mechanism utilizing causal attention to simultaneously process asynchronous visual and force tokens, allowing the policy to perform closed-loop adjustments at the force frequency while maintaining the temporal coherence of action chunks. Furthermore, to mitigate modality collapse where end-to-end models fail to adjust the weights across different modalities, we propose Virtual-target-based Representation Regularization. This auxiliary objective maps force feedback into the same space as the action, providing a stronger, physics-grounded learning signal than raw force prediction. Extensive experiments on contact-rich tasks demonstrate that ImplicitRDP significantly outperforms both vision-only and hierarchical baselines, achieving superior reactivity and success rates with a streamlined training pipeline. Code and videos will be publicly available at https://implicit-rdp.github.io.

中文标题/摘要

标题：ImplicitRDP：一种结合结构化慢速快速学习的端到端视觉-力扩散策略

人类水平的接触丰富操作依赖于两种关键模态的独特作用：视觉提供丰富但缓慢的空间上下文，而力感知捕捉快速的高频率局部接触动力学。由于这些信号的基本频率和信息差异，将这些信号集成在一起具有挑战性。在本文中，我们提出了一种统一的端到端视觉-力扩散策略ImplicitRDP，该策略在单一网络中结合了视觉规划和反应性力控制。我们引入了一种结构化慢速快速学习机制，该机制利用因果注意力同时处理异步视觉和力标记，使策略能够在力频率下执行闭环调整，同时保持动作片段的时间连贯性。此外，为了缓解端到端模型在不同模态间调整权重时出现的模态崩溃问题，我们提出了基于虚拟目标的表示正则化。该辅助目标将力反馈映射到与动作相同的空间，提供比原始力预测更强的、基于物理的训练信号。广泛的实验表明，ImplicitRDP 在接触丰富任务中显著优于仅视觉和分层基线，具有简化训练管道的情况下实现了更好的反应性和成功率。代码和视频将在 https://implicit-rdp.github.io 公开。

Summary / 总结

The research aims to improve human-level contact-rich manipulation by integrating visual and force signals in a unified end-to-end policy. ImplicitRDP uses Structural Slow-Fast Learning to process asynchronous visual and force tokens, enabling closed-loop force adjustments while maintaining temporal coherence. The study shows that ImplicitRDP outperforms vision-only and hierarchical baselines, achieving better reactivity and success rates. Virtual-target-based Representation Regularization is introduced to prevent modality collapse and enhance learning. The policy demonstrates superior performance in contact-rich tasks with a streamlined training process.

研究旨在通过统一的端到端策略整合视觉和力信号，提高接触丰富的精细操作水平。ImplicitRDP 使用结构化的慢速-快速学习机制处理异步的视觉和力信号，能够在进行力调整的同时保持动作的时序一致性。研究结果显示，ImplicitRDP 在接触丰富的任务中优于仅视觉和分层基线，实现了更好的反应性和成功率。还提出了基于虚拟目标的表示正则化来防止模态崩溃并增强学习效果。该策略在简化训练流程的同时展示了卓越的性能。

MeViS: A Multi-Modal Dataset for Referring Motion Expression Video Segmentation

Authors: Henghui Ding, Chang Liu, Shuting He, Kaining Ying, Xudong Jiang, Chen Change Loy, Yu-Gang Jiang

Venue: in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, no. 12, pp. 11400-11416, 2025

First: 2025-12-11T18:59:44+00:00 · Latest: 2025-12-11T18:59:44+00:00

Comments: IEEE TPAMI, Project Page: https://henghuiding.com/MeViS/

Abs · PDF · Code1 · Code2

Abstract

This paper proposes a large-scale multi-modal dataset for referring motion expression video segmentation, focusing on segmenting and tracking target objects in videos based on language description of objects' motions. Existing referring video segmentation datasets often focus on salient objects and use language expressions rich in static attributes, potentially allowing the target object to be identified in a single frame. Such datasets underemphasize the role of motion in both videos and languages. To explore the feasibility of using motion expressions and motion reasoning clues for pixel-level video understanding, we introduce MeViS, a dataset containing 33,072 human-annotated motion expressions in both text and audio, covering 8,171 objects in 2,006 videos of complex scenarios. We benchmark 15 existing methods across 4 tasks supported by MeViS, including 6 referring video object segmentation (RVOS) methods, 3 audio-guided video object segmentation (AVOS) methods, 2 referring multi-object tracking (RMOT) methods, and 4 video captioning methods for the newly introduced referring motion expression generation (RMEG) task. The results demonstrate weaknesses and limitations of existing methods in addressing motion expression-guided video understanding. We further analyze the challenges and propose an approach LMPM++ for RVOS/AVOS/RMOT that achieves new state-of-the-art results. Our dataset provides a platform that facilitates the development of motion expression-guided video understanding algorithms in complex video scenes. The proposed MeViS dataset and the method's source code are publicly available at https://henghuiding.com/MeViS/

中文标题/摘要

标题：MeViS：一种用于参考运动表达视频分割的多模态数据集

本文提出了一种大规模多模态数据集，用于参考运动表达视频分割，专注于根据物体运动的语言描述对视频中的目标对象进行分割和跟踪。现有的参考视频分割数据集通常关注显眼的物体，并使用富含静态属性的语言表达，这可能允许目标物体在单帧中被识别。这些数据集在视频和语言中都低估了运动的作用。为了探索使用运动表达和运动推理线索进行像素级视频理解的可行性，我们引入了MeViS数据集，该数据集包含33,072个人标注的运动表达，包括文本和音频，覆盖了2,006个复杂场景中的8,171个物体。我们对MeViS支持的4项任务中的15种现有方法进行了基准测试，包括6种参考视频对象分割（RVOS）方法、3种音频引导视频对象分割（AVOS）方法、2种参考多对象跟踪（RMOT）方法以及4种用于新引入的参考运动表达生成（RMEG）任务的视频字幕方法。结果表明现有方法在解决运动表达引导的视频理解方面存在弱点和局限性。我们进一步分析了挑战并提出了一种LMPM++方法，该方法在RVOS/AVOS/RMOT中达到了新的最佳结果。我们的数据集为复杂视频场景中运动表达引导的视频理解算法的发展提供了平台。提出的MeViS数据集及其方法的源代码可在https://henghuiding.com/MeViS/公开获取。

Summary / 总结

The paper introduces MeViS, a large multi-modal dataset for segmenting and tracking target objects in videos based on their motion descriptions. It contains 33,072 human-annotated motion expressions in text and audio, covering 8,171 objects in 2,006 complex-scenario videos. Benchmarking 15 existing methods across 4 tasks, the results highlight the limitations of current approaches in handling motion expression-guided video understanding. The study proposes LMPM++ for RVOS/AVOS/RMOT, achieving new state-of-the-art results. The dataset supports the development of motion expression-guided video understanding algorithms in complex scenes.

该论文介绍了MeViS，一个用于基于目标运动描述进行视频中目标分割和跟踪的大规模多模态数据集，包含33,072个人标注的文本和音频运动表达，覆盖2,006个复杂场景视频中的8,171个对象。对15种现有方法进行4项任务的基准测试，结果表明当前方法在处理运动表达引导的视频理解方面存在局限性。研究提出了LMPM++方法用于RVOS/AVOS/RMOT，取得了新的最佳性能。该数据集支持复杂视频场景中运动表达引导的视频理解算法的发展。

AlcheMinT: Fine-grained Temporal Control for Multi-Reference Consistent Video Generation

Authors: Sharath Girish, Viacheslav Ivanov, Tsai-Shien Chen, Hao Chen, Aliaksandr Siarohin, Sergey Tulyakov

First: 2025-12-11T18:59:34+00:00 · Latest: 2025-12-11T18:59:34+00:00

Comments: Project page: https://snap-research.github.io/Video-AlcheMinT/snap-research.github.io/Video-AlcheMinT

Abs · PDF · Code1 · Code2 · Project1 · Project2

Abstract

Recent advances in subject-driven video generation with large diffusion models have enabled personalized content synthesis conditioned on user-provided subjects. However, existing methods lack fine-grained temporal control over subject appearance and disappearance, which are essential for applications such as compositional video synthesis, storyboarding, and controllable animation. We propose AlcheMinT, a unified framework that introduces explicit timestamps conditioning for subject-driven video generation. Our approach introduces a novel positional encoding mechanism that unlocks the encoding of temporal intervals, associated in our case with subject identities, while seamlessly integrating with the pretrained video generation model positional embeddings. Additionally, we incorporate subject-descriptive text tokens to strengthen binding between visual identity and video captions, mitigating ambiguity during generation. Through token-wise concatenation, AlcheMinT avoids any additional cross-attention modules and incurs negligible parameter overhead. We establish a benchmark evaluating multiple subject identity preservation, video fidelity, and temporal adherence. Experimental results demonstrate that AlcheMinT achieves visual quality matching state-of-the-art video personalization methods, while, for the first time, enabling precise temporal control over multi-subject generation within videos. Project page is at https://snap-research.github.io/Video-AlcheMinT

中文标题/摘要

标题：AlcheMinT：多参考一致视频生成的细粒度时间控制

大型扩散模型驱动的主题视频生成的最新进展使基于用户提供的主题的个性化内容合成成为可能。然而，现有方法缺乏对主题出现和消失的细粒度时间控制，这对于合成视频、故事板和可控动画等应用至关重要。我们提出了AlcheMinT，这是一种统一框架，引入了明确的时间戳条件，用于主题驱动的视频生成。我们的方法引入了一种新颖的位置编码机制，解锁了时间间隔的编码，这些时间间隔在我们的情况下与主题身份相关联，同时无缝地与预训练的视频生成模型位置嵌入集成。此外，我们还引入了主题描述性文本标记，以增强视觉身份与视频字幕之间的联系，减轻生成过程中的歧义。通过令牌级连接，AlcheMinT 避免了任何额外的交叉注意力模块，并且没有增加参数开销。我们建立了一个基准，评估多个主题身份保留、视频保真度和时间一致性。实验结果表明，AlcheMinT 达到了与最先进的视频个性化方法相当的视觉质量，同时首次实现了对视频中多主题生成的精确时间控制。项目页面为 https://snap-research.github.io/Video-AlcheMinT

Summary / 总结

AlcheMinT is a unified framework that introduces explicit timestamps for fine-grained temporal control in subject-driven video generation, enhancing applications like compositional video synthesis and controllable animation. By incorporating subject-descriptive text tokens and a novel positional encoding mechanism, AlcheMinT achieves high visual quality and precise temporal control, surpassing previous methods in multi-subject generation within videos. The approach avoids additional cross-attention modules, maintaining negligible parameter overhead. Experimental results show AlcheMinT's superior performance in subject identity preservation, video fidelity, and temporal adherence compared to state-of-the-art methods.

AlcheMinT 是一个统一框架，通过引入显式的时间戳来实现主体驱动视频生成中的细粒度时间控制，增强如合成视频合成和可控动画等应用。它使用新颖的位置编码机制和主体描述性文本标记来保持主体身份并提高视频保真度，实现与最先进的方法相当的视觉质量，同时首次能够在视频中对多主体生成进行精确的时间控制。

VL-JEPA: Joint Embedding Predictive Architecture for Vision-language

Authors: Delong Chen, Mustafa Shukor, Theo Moutakanni, Willy Chung, Jade Yu, Tejaswi Kasarla, Allen Bolourchi, Yann LeCun, Pascale Fung

First: 2025-12-11T18:59:22+00:00 · Latest: 2025-12-11T18:59:22+00:00

Abs · PDF · Code1 · Code2

Abstract

We introduce VL-JEPA, a vision-language model built on a Joint Embedding Predictive Architecture (JEPA). Instead of autoregressively generating tokens as in classical VLMs, VL-JEPA predicts continuous embeddings of the target texts. By learning in an abstract representation space, the model focuses on task-relevant semantics while abstracting away surface-level linguistic variability. In a strictly controlled comparison against standard token-space VLM training with the same vision encoder and training data, VL-JEPA achieves stronger performance while having 50% fewer trainable parameters. At inference time, a lightweight text decoder is invoked only when needed to translate VL-JEPA predicted embeddings into text. We show that VL-JEPA natively supports selective decoding that reduces the number of decoding operations by 2.85x while maintaining similar performance compared to non-adaptive uniform decoding. Beyond generation, the VL-JEPA's embedding space naturally supports open-vocabulary classification, text-to-video retrieval, and discriminative VQA without any architecture modification. On eight video classification and eight video retrieval datasets, the average performance VL-JEPA surpasses that of CLIP, SigLIP2, and Perception Encoder. At the same time, the model achieves comparable performance as classical VLMs (InstructBLIP, QwenVL) on four VQA datasets: GQA, TallyQA, POPE and POPEv2, despite only having 1.6B parameters.

中文标题/摘要

标题：VL-JEPA：联合嵌入预测架构的跨模态模型

我们提出了VL-JEPA，一种基于联合嵌入预测架构(JEPA)的跨模态模型。与经典视觉语言模型(VLM)逐个生成标记不同，VL-JEPA预测目标文本的连续嵌入。通过在抽象表示空间中学习，该模型专注于任务相关的语义，同时抽象掉表面的语义变异性。在严格控制的比较中，与使用相同视觉编码器和训练数据的标准标记空间VLM训练相比，VL-JEPA在参数量减少50%的情况下实现了更强的性能。在推理时，仅在需要时调用轻量级文本解码器将VL-JEPA预测的嵌入转换为文本。我们展示了VL-JEPA原生支持选择性解码，将解码操作减少2.85倍，同时保持与非自适应均匀解码相似的性能。除了生成之外，VL-JEPA的嵌入空间自然支持开放词汇分类、文本到视频检索和区分型VQA，无需任何架构修改。在八个视频分类数据集和八个视频检索数据集上，VL-JEPA的平均性能超过了CLIP、SigLIP2和感知编码器。同时，该模型在四个VQA数据集(GQA、TallyQA、POPE和POPEv2)上实现了与经典VLMs(InstructBLIP、QwenVL)相当的性能，尽管只有1.6B参数。

Summary / 总结

VL-JEPA is a vision-language model that uses a Joint Embedding Predictive Architecture to predict continuous embeddings of target texts, focusing on task-relevant semantics. It achieves better performance with 50% fewer parameters compared to standard token-space VLMs. During inference, a lightweight text decoder is used only when needed, reducing decoding operations by 2.85x while maintaining similar performance. VL-JEPA outperforms CLIP, SigLIP2, and Perception Encoder on video classification and retrieval tasks and matches the performance of larger models on VQA tasks with fewer parameters.

VL-JEPA 是一种使用联合嵌入预测架构的视觉-语言模型，它预测目标文本的连续嵌入而不是自回归生成标记。这种方法在较少参数的情况下实现了更强的性能，并支持选择性解码，将解码操作的数量减少了2.85倍。VL-JEPA 在多个视频分类和检索任务中表现优于多个模型，并且仅使用1.6B参数就能在四个VQA数据集上达到与经典视觉语言模型相当的性能。

EvoIR: Towards All-in-One Image Restoration via Evolutionary Frequency Modulation

Authors: Jiaqi Ma, Shengkai Hu, Xu Zhang, Jun Wan, Jiaxing Huang, Lefei Zhang, Salman Khan

First: 2025-12-04T18:59:10+00:00 · Latest: 2025-12-11T18:59:22+00:00

Abs · PDF · Code1 · Code2

Abstract

All-in-One Image Restoration (AiOIR) tasks often involve diverse degradation that require robust and versatile strategies. However, most existing approaches typically lack explicit frequency modeling and rely on fixed or heuristic optimization schedules, which limit the generalization across heterogeneous degradation. To address these limitations, we propose EvoIR, an AiOIR-specific framework that introduces evolutionary frequency modulation for dynamic and adaptive image restoration. Specifically, EvoIR employs the Frequency-Modulated Module (FMM) that decomposes features into high- and low-frequency branches in an explicit manner and adaptively modulates them to enhance both structural fidelity and fine-grained details. Central to EvoIR, an Evolutionary Optimization Strategy (EOS) iteratively adjusts frequency-aware objectives through a population-based evolutionary process, dynamically balancing structural accuracy and perceptual fidelity. Its evolutionary guidance further mitigates gradient conflicts across degradation and accelerates convergence. By synergizing FMM and EOS, EvoIR yields greater improvements than using either component alone, underscoring their complementary roles. Extensive experiments on multiple benchmarks demonstrate that EvoIR outperforms state-of-the-art AiOIR methods.

中文标题/摘要

标题：EvoIR：通过进化频率调制实现一站式图像恢复

一站式图像恢复（AiOIR）任务通常涉及多种退化，需要稳健且通用的策略。然而，大多数现有方法通常缺乏显式的频率建模，并依赖于固定的或启发式的优化计划，这限制了其在异构退化中的泛化能力。为了解决这些限制，我们提出了一种AiOIR特定框架EvoIR，引入了进化频率调制以实现动态和自适应的图像恢复。具体而言，EvoIR 使用了频率调制模块（FMM），该模块以显式方式将特征分解为高频频带和低频频带，并自适应地调制它们以增强结构保真度和细粒度细节。EvoIR 的核心是一种进化优化策略（EOS），通过基于群体的进化过程迭代调整频率感知目标，动态平衡结构准确性和感知保真度。其进化的指导进一步缓解了退化之间的梯度冲突并加速了收敛。通过结合FMM和EOS，EvoIR 比单独使用任一组件都取得了更大的改进，突显了它们的互补作用。在多个基准上的广泛实验表明，EvoIR 在一站式图像恢复方法中优于最先进的方法。

Summary / 总结

EvoIR is an AiOIR framework that introduces evolutionary frequency modulation for dynamic and adaptive image restoration. It uses the Frequency-Modulated Module (FMM) to decompose features into high- and low-frequency branches and adaptively modulate them. An Evolutionary Optimization Strategy (EOS) iteratively adjusts frequency-aware objectives, balancing structural accuracy and perceptual fidelity. Experiments show that EvoIR outperforms state-of-the-art AiOIR methods on multiple benchmarks.

EvoIR 通过引入进化频率调制来处理多样化的图像退化问题。它使用频率调制模块 (FMM) 将特征分解为高频和低频分支，并适应性地调节它们以提高结构保真度和细粒度细节。进化优化策略 (EOS) 动态调整频率感知目标，平衡结构准确性和感知保真度。实验表明，EvoIR 在所有图像恢复任务中优于现有最先进的方法。

Mull-Tokens: Modality-Agnostic Latent Thinking

Authors: Arijit Ray, Ahmed Abdelkader, Chengzhi Mao, Bryan A. Plummer, Kate Saenko, Ranjay Krishna, Leonidas Guibas, Wen-Sheng Chu

First: 2025-12-11T18:59:08+00:00 · Latest: 2025-12-11T18:59:08+00:00

Comments: Project webpage: https://arijitray.com/multimodal_thinking/

Abs · PDF · Code1 · Code2

Abstract

Reasoning goes beyond language; the real world requires reasoning about space, time, affordances, and much more that words alone cannot convey. Existing multimodal models exploring the potential of reasoning with images are brittle and do not scale. They rely on calling specialist tools, costly generation of images, or handcrafted reasoning data to switch between text and image thoughts. Instead, we offer a simpler alternative -- Mull-Tokens -- modality-agnostic latent tokens pre-trained to hold intermediate information in either image or text modalities to let the model think free-form towards the correct answer. We investigate best practices to train Mull-Tokens inspired by latent reasoning frameworks. We first train Mull-Tokens using supervision from interleaved text-image traces, and then fine-tune without any supervision by only using the final answers. Across four challenging spatial reasoning benchmarks involving tasks such as solving puzzles and taking different perspectives, we demonstrate that Mull-Tokens improve upon several baselines utilizing text-only reasoning or interleaved image-text reasoning, achieving a +3% average improvement and up to +16% on a puzzle solving reasoning-heavy split compared to our strongest baseline. Adding to conversations around challenges in grounding textual and visual reasoning, Mull-Tokens offers a simple solution to abstractly think in multiple modalities.

中文标题/摘要

标题：Mull-Tokens：模态无关的潜在思维

推理超越语言；现实世界需要关于空间、时间、功能等方面的推理，而这些是单靠语言无法传达的。现有的多模态模型探索图像推理潜力时表现脆弱且不具扩展性。它们依赖于调用专业工具、昂贵的图像生成或手工制作的推理数据来在文本和图像思维之间切换。相反，我们提供了一个更简单的替代方案——Mull-Tokens——一种模态无关的潜在标记，预先训练以在图像或文本模态中保留中间信息，让模型自由地思考以得出正确答案。我们研究了受潜在推理框架启发的最佳实践来训练Mull-Tokens。我们首先使用交错的文本-图像痕迹进行监督训练，然后仅使用最终答案进行微调。在四个具有挑战性的空间推理基准测试中，涉及解谜和换位思考等任务，我们证明Mull-Tokens在利用仅文本推理或交错图像-文本推理的多个基线之上取得了平均3%的改进，最高达16%。在解谜推理密集的划分上，Mull-Tokens为关于文本和视觉推理的接地挑战提供了简单解决方案，以抽象方式在多种模态中思考。

Summary / 总结

Mull-Tokens are modality-agnostic latent tokens pre-trained to hold intermediate information in either image or text modalities, enabling models to think freely towards the correct answer. By training Mull-Tokens with supervision from interleaved text-image traces and fine-tuning without supervision using only final answers, the model shows significant improvements on four spatial reasoning benchmarks, achieving up to a 16% improvement in puzzle solving tasks compared to previous text-only or image-text reasoning methods.

研究旨在开发一种跨模态的模态无关潜在令牌系统Mull-Tokens，使模型能够在不同模态之间进行推理，而无需依赖专门工具或昂贵的数据生成。通过使用文本-图像痕迹进行预训练，并仅使用最终答案进行微调，该模型可以灵活思考并在空间推理任务中表现出色。在四个基准测试中，Mull-Tokens在拼图解决等推理密集型任务上优于仅文本和文本-图像推理基线，最高提高了16%的性能。

OmniView: An All-Seeing Diffusion Model for 3D and 4D View Synthesis

Authors: Xiang Fan, Sharath Girish, Vivek Ramanujan, Chaoyang Wang, Ashkan Mirzaei, Petr Sushko, Aliaksandr Siarohin, Sergey Tulyakov, Ranjay Krishna

First: 2025-12-11T18:59:05+00:00 · Latest: 2025-12-11T18:59:05+00:00

Comments: Project page: https://snap-research.github.io/OmniView/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Prior approaches injecting camera control into diffusion models have focused on specific subsets of 4D consistency tasks: novel view synthesis, text-to-video with camera control, image-to-video, amongst others. Therefore, these fragmented approaches are trained on disjoint slices of available 3D/4D data. We introduce OmniView, a unified framework that generalizes across a wide range of 4D consistency tasks. Our method separately represents space, time, and view conditions, enabling flexible combinations of these inputs. For example, OmniView can synthesize novel views from static, dynamic, and multiview inputs, extrapolate trajectories forward and backward in time, and create videos from text or image prompts with full camera control. OmniView is competitive with task-specific models across diverse benchmarks and metrics, improving image quality scores among camera-conditioned diffusion models by up to 33\% in multiview NVS LLFF dataset, 60\% in dynamic NVS Neural 3D Video benchmark, 20\% in static camera control on RE-10K, and reducing camera trajectory errors by 4x in text-conditioned video generation. With strong generalizability in one model, OmniView demonstrates the feasibility of a generalist 4D video model. Project page is available at https://snap-research.github.io/OmniView/

中文标题/摘要

标题：OmniView：一种全面视角的扩散模型，用于3D和4D视图合成

先前将相机控制注入扩散模型的方法主要集中在4D一致性任务的特定子集上：新颖视图合成、带有相机控制的文本到视频、图像到视频等。因此，这些分散的方法仅在可用的3D/4D数据的不同片段上进行训练。我们提出了OmniView，这是一种统一框架，能够泛化到广泛的4D一致性任务。我们的方法分别表示空间、时间和视图条件，允许这些输入的灵活组合。例如，OmniView可以从静态、动态和多视角输入中合成新颖视图，向前向后推断时间轨迹，并根据完整的相机控制从文本或图像提示生成视频。在各种基准和指标上，OmniView与特定任务模型竞争，提高相机条件下的扩散模型的图像质量分数，最高可达多视角NVS LLFF数据集中的33%，动态NVS神经3D视频基准中的60%，静态相机控制RE-10K中的20%，并在文本条件下的视频生成中将相机轨迹误差减少4倍。凭借一个模型的强大泛化能力，OmniView展示了通用4D视频模型的可行性。项目页面可在https://snap-research.github.io/OmniView/获取。

Summary / 总结

OmniView is a unified framework for 3D and 4D view synthesis that generalizes across various 4D consistency tasks. It separately represents space, time, and view conditions, allowing flexible input combinations. Experimental results show that OmniView outperforms task-specific models on diverse benchmarks, improving image quality scores and reducing camera trajectory errors by up to 60% in certain datasets. This demonstrates the feasibility of a generalist 4D video model.

OmniView 是一个统一框架，用于 3D 和 4D 视图合成，通过分别表示空间、时间和视图条件来泛化各种 4D 一致性任务。它可以生成来自静态、动态和多视图输入的新视图，进行轨迹外推，并根据文本或图像提示生成具有全摄像机控制的视频。OmniView 在各种基准测试中优于特定任务模型，图像质量得分最多可提高 60%（动态 NVS 神经 3D 视频基准），并在文本条件视频生成中将摄像机轨迹误差减少 4 倍。

GaussianHeadTalk: Wobble-Free 3D Talking Heads with Audio Driven Gaussian Splatting

Authors: Madhav Agarwal, Mingtian Zhang, Laura Sevilla-Lara, Steven McDonagh

First: 2025-12-11T18:59:02+00:00 · Latest: 2025-12-11T18:59:02+00:00

Comments: IEEE/CVF Winter Conference on Applications of Computer Vision 2026

Abs · PDF · Code1 · Code2

Abstract

Speech-driven talking heads have recently emerged and enable interactive avatars. However, real-world applications are limited, as current methods achieve high visual fidelity but slow or fast yet temporally unstable. Diffusion methods provide realistic image generation, yet struggle with oneshot settings. Gaussian Splatting approaches are real-time, yet inaccuracies in facial tracking, or inconsistent Gaussian mappings, lead to unstable outputs and video artifacts that are detrimental to realistic use cases. We address this problem by mapping Gaussian Splatting using 3D Morphable Models to generate person-specific avatars. We introduce transformer-based prediction of model parameters, directly from audio, to drive temporal consistency. From monocular video and independent audio speech inputs, our method enables generation of real-time talking head videos where we report competitive quantitative and qualitative performance.

中文标题/摘要

标题：GaussianHeadTalk：基于音频驱动高斯点积的无晃动3D对话头

基于语音的对话头最近出现，能够实现交互式虚拟角色。然而，实际应用受到限制，因为当前方法在视觉保真度高但速度慢或快且时间不稳定之间存在权衡。扩散方法能够生成逼真图像，但在单次设置中表现不佳。高斯点积方法实时性好，但面部跟踪不准确或高斯映射不一致会导致输出不稳定和视频伪影，这些伪影对现实使用场景不利。我们通过将高斯点积映射到3D可变模型来解决这一问题，以生成个性化的虚拟角色。我们引入基于变换器的模型参数预测，直接从音频驱动时间一致性。从单目视频和独立的音频语音输入，我们的方法能够生成实时对话头视频，我们报告了具有竞争力的定量和定性性能。

Summary / 总结

The research aims to improve the stability and realism of 3D talking heads driven by audio. It employs a method combining 3D Morphable Models and transformer-based audio prediction to generate temporally consistent avatars. The key experimental findings show that the method achieves competitive performance in both quantitative and qualitative assessments, enabling real-time generation of stable and realistic talking head videos from monocular video and independent audio inputs.

研究旨在提高由音频驱动的3D谈话头的稳定性和逼真度。方法结合了3D形态可变模型和基于变换器的音频预测，以生成时间上一致的虚拟形象。实验结果表明，该方法在定量和定性评估中均表现出竞争力，能够从单目视频和独立音频输入中实时生成稳定且逼真的谈话头视频。

Stronger Normalization-Free Transformers

Authors: Mingzhi Chen, Taiming Lu, Jiachen Zhu, Mingjie Sun, Zhuang Liu

First: 2025-12-11T18:58:49+00:00 · Latest: 2025-12-11T18:58:49+00:00

Abs · PDF · Code1 · Code2

Abstract

Although normalization layers have long been viewed as indispensable components of deep learning architectures, the recent introduction of Dynamic Tanh (DyT) has demonstrated that alternatives are possible. The point-wise function DyT constrains extreme values for stable convergence and reaches normalization-level performance; this work seeks further for function designs that can surpass it. We first study how the intrinsic properties of point-wise functions influence training and performance. Building on these findings, we conduct a large-scale search for a more effective function design. Through this exploration, we introduce $\mathrm{Derf}(x) = \mathrm{erf}(αx + s)$, where $\mathrm{erf}(x)$ is the rescaled Gaussian cumulative distribution function, and identify it as the most performant design. Derf outperforms LayerNorm, RMSNorm, and DyT across a wide range of domains, including vision (image recognition and generation), speech representation, and DNA sequence modeling. Our findings suggest that the performance gains of Derf largely stem from its improved generalization rather than stronger fitting capacity. Its simplicity and stronger performance make Derf a practical choice for normalization-free Transformer architectures.

中文标题/摘要

标题：无需归一化更强的变压器

尽管归一化层长期以来被视为深度学习架构中不可或缺的组件，但最近引入的动态双曲正切（DyT）表明替代方案是可能的。点函数DyT通过限制极端值实现稳定的收敛，并达到与归一化相当的性能；这项工作寻求进一步寻找超越它的函数设计。我们首先研究点函数内在属性如何影响训练和性能。基于这些发现，我们进行大规模搜索以寻找更有效的函数设计。通过这一探索，我们引入了$\mathrm{Derf}(x) = \mathrm{erf}(αx + s)$，其中$\mathrm{erf}(x)$是缩放后的高斯累积分布函数，并将其识别为最有效的设计。Derf在包括视觉（图像识别和生成）、语音表示和DNA序列建模在内的广泛领域中优于LayerNorm、RMSNorm和DyT。我们的研究结果表明，Derf的性能提升主要来自于其更好的泛化能力，而不是更强的拟合能力。其简洁性和更强的性能使Derf成为无需归一化的Transformer架构的实用选择。

Summary / 总结

This work explores alternative function designs to normalization layers in deep learning, motivated by the success of Dynamic Tanh (DyT) which achieves normalization-level performance without normalization. The authors conduct a large-scale search and introduce Derf, defined as $\mathrm{erf}(αx + s)$, which outperforms LayerNorm, RMSNorm, and DyT across various domains. The key finding is that Derf's performance improvement is due to better generalization rather than increased fitting capacity.

该研究探索了在深度学习架构中替代归一化层的函数设计，动机是Dynamic Tanh (DyT)在无需归一化的情况下实现了与归一化层相当的性能。通过大规模搜索，作者引入了Derf，定义为$\mathrm{erf}(αx + s)$，该设计在图像识别、语音表示和DNA序列建模等多个领域优于LayerNorm、RMSNorm和DyT等现有方法。Derf的性能提升主要归因于其更好的泛化能力，而非更强的拟合能力。

Any4D: Unified Feed-Forward Metric 4D Reconstruction

Authors: Jay Karhade, Nikhil Keetha, Yuchen Zhang, Tanisha Gupta, Akash Sharma, Sebastian Scherer, Deva Ramanan

First: 2025-12-11T18:57:39+00:00 · Latest: 2025-12-11T18:57:39+00:00

Comments: Project Website: https://any-4d.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

We present Any4D, a scalable multi-view transformer for metric-scale, dense feed-forward 4D reconstruction. Any4D directly generates per-pixel motion and geometry predictions for N frames, in contrast to prior work that typically focuses on either 2-view dense scene flow or sparse 3D point tracking. Moreover, unlike other recent methods for 4D reconstruction from monocular RGB videos, Any4D can process additional modalities and sensors such as RGB-D frames, IMU-based egomotion, and Radar Doppler measurements, when available. One of the key innovations that allows for such a flexible framework is a modular representation of a 4D scene; specifically, per-view 4D predictions are encoded using a variety of egocentric factors (depthmaps and camera intrinsics) represented in local camera coordinates, and allocentric factors (camera extrinsics and scene flow) represented in global world coordinates. We achieve superior performance across diverse setups - both in terms of accuracy (2-3X lower error) and compute efficiency (15X faster), opening avenues for multiple downstream applications.

中文标题/摘要

标题：Any4D：统一的多视图前馈4D重建

我们提出了Any4D，一种可扩展的多视图变换器，用于米尺度的密集前馈4D重建。Any4D直接生成N帧的每个像素的运动和几何预测，而以往的工作通常专注于双视图密集场景流或稀疏3D点跟踪。此外，与其他基于单目RGB视频的4D重建方法不同，Any4D可以处理可用的其他模态和传感器，如RGB-D帧、基于IMU的自我运动和雷达多普勒测量。关键创新之一是4D场景的模块化表示；具体来说，每个视图的4D预测使用多种以自我为中心的因素（深度图和相机内参）表示在局部相机坐标系中，以及以全局世界坐标系表示的以他为中心的因素（相机外参和场景流）。我们在多种设置下实现了优越的性能——在准确性和计算效率方面（误差降低2-3倍，速度快15倍），为多个下游应用打开了新的途径。

Summary / 总结

Any4D is a scalable multi-view transformer designed for metric-scale, dense feed-forward 4D reconstruction. It directly generates per-pixel motion and geometry predictions for N frames, unlike previous methods that focus on 2-view dense scene flow or sparse 3D point tracking. Any4D can handle additional modalities such as RGB-D frames, IMU-based egomotion, and Radar Doppler measurements. Its modular representation of 4D scenes, using egocentric and allocentric factors, enables it to process these diverse inputs efficiently. The method achieves higher accuracy and lower computational cost compared to existing approaches, making it suitable for various downstream applications.

Any4D 是一个可扩展的多视图变换器，用于实现度量尺度的密集前馈4D重建，直接生成 N 帧的像素级运动和几何预测。它可以处理包括 RGB-D 图像、IMU 基本 ego 运动和雷达多普勒测量在内的多种模态。Any4D 达到了更高的准确性和更好的计算效率，使其能够应用于多个下游任务。

Curriculum-Based Reinforcement Learning for Autonomous UAV Navigation in Unknown Curved Tubular Conduit

Authors: Zamirddine Mari, Jérôme Pasquet, Julien Seinturier

First: 2025-12-11T18:57:29+00:00 · Latest: 2025-12-11T18:57:29+00:00

Abs · PDF · Code1 · Code2

Abstract

Autonomous drone navigation in confined tubular environments remains a major challenge due to the constraining geometry of the conduits, the proximity of the walls, and the perceptual limitations inherent to such scenarios. We propose a reinforcement learning approach enabling a drone to navigate unknown three-dimensional tubes without any prior knowledge of their geometry, relying solely on local observations from LiDAR and a conditional visual detection of the tube center. In contrast, the Pure Pursuit algorithm, used as a deterministic baseline, benefits from explicit access to the centerline, creating an information asymmetry designed to assess the ability of RL to compensate for the absence of a geometric model. The agent is trained through a progressive Curriculum Learning strategy that gradually exposes it to increasingly curved geometries, where the tube center frequently disappears from the visual field. A turning-negotiation mechanism, based on the combination of direct visibility, directional memory, and LiDAR symmetry cues, proves essential for ensuring stable navigation under such partial observability conditions. Experiments show that the PPO policy acquires robust and generalizable behavior, consistently outperforming the deterministic controller despite its limited access to geometric information. Validation in a high-fidelity 3D environment further confirms the transferability of the learned behavior to a continuous physical dynamics. The proposed approach thus provides a complete framework for autonomous navigation in unknown tubular environments and opens perspectives for industrial, underground, or medical applications where progressing through narrow and weakly perceptive conduits represents a central challenge.

中文标题/摘要

标题：基于 Curriculum 奖励学习的自主无人机在未知曲管中的导航

在受限的管状环境中自主无人机导航仍然是一个重大挑战，由于导管的约束几何形状、墙壁的近距离以及此类场景中固有的感知限制。我们提出了一种基于强化学习的方法，使无人机能够在没有任何关于其几何形状的先验知识的情况下，仅依靠LiDAR的局部观察和管中心的条件视觉检测来导航未知的三维管。相比之下，作为确定性基线的纯追求算法可以从显式访问中心线中获益，从而创建一种信息不对称，旨在评估强化学习补偿几何模型缺失的能力。代理通过逐步暴露于越来越弯曲的几何形状中进行训练，其中管中心经常从视觉中消失。基于直接可见性、方向记忆和LiDAR对称性线索的转向协商机制对于确保在部分可观测条件下稳定导航至关重要。实验表明，PPO策略获得了稳健且可泛化的行为，即使在有限的几何信息访问下也始终优于确定性控制器。在高保真3D环境中的验证进一步证实了所学行为在连续物理动力学中的可转移性。因此，所提出的方法为在未知管状环境中的自主导航提供了一个完整的框架，并为工业、地下或医疗应用中通过狭窄和感知能力弱的导管提供了新的前景。

Summary / 总结

The paper addresses the challenge of autonomous drone navigation in unknown curved tubular conduits by proposing a reinforcement learning approach. The method involves training a drone using a progressive Curriculum Learning strategy, where it learns to navigate based on local observations from LiDAR and visual detection of the tube center. Key findings show that the Proximal Policy Optimization (PPO) policy outperforms a deterministic Pure Pursuit algorithm, demonstrating robust and generalizable behavior even without explicit geometric information, and confirming transferability to real-world dynamics.

论文提出了一种强化学习方法，以解决无人机在未知弯曲管状导管中自主导航的挑战。方法包括使用渐进式 Curriculum Learning 策略进行训练，无人机基于 LiDAR 本地观察和管中心的视觉检测来导航。关键发现表明，Proximal Policy Optimization (PPO) 策略在没有显式几何信息的情况下，表现出更稳健和可泛化的行为，并且能够在真实的物理动态中验证其行为的可转移性。

BabyVLM-V2: Toward Developmentally Grounded Pretraining and Benchmarking of Vision Foundation Models

Authors: Shengao Wang, Wenqi Wang, Zecheng Wang, Max Whitton, Michael Wakeham, Arjun Chandra, Joey Huang, Pengyue Zhu, Helen Chen, David Li, Jeffrey Li, Shawn Li, Andrew Zagula, Amy Zhao, Andrew Zhu, Sayaka Nakamura, Yuki Yamamoto, Jerry Jun Yokono, Aaron Mueller, Bryan A. Plummer, Kate Saenko, Venkatesh Saligrama, Boqing Gong

First: 2025-12-11T18:57:05+00:00 · Latest: 2025-12-11T18:57:05+00:00

Abs · PDF · Code1 · Code2

Abstract

Early children's developmental trajectories set up a natural goal for sample-efficient pretraining of vision foundation models. We introduce BabyVLM-V2, a developmentally grounded framework for infant-inspired vision-language modeling that extensively improves upon BabyVLM-V1 through a longitudinal, multifaceted pretraining set, a versatile model, and, most importantly, DevCV Toolbox for cognitive evaluation. The pretraining set maximizes coverage while minimizing curation of a longitudinal, infant-centric audiovisual corpus, yielding video-utterance, image-utterance, and multi-turn conversational data that mirror infant experiences. DevCV Toolbox adapts all vision-related measures of the recently released NIH Baby Toolbox into a benchmark suite of ten multimodal tasks, covering spatial reasoning, memory, and vocabulary understanding aligned with early children's capabilities. Experimental results show that a compact model pretrained from scratch can achieve competitive performance on DevCV Toolbox, outperforming GPT-4o on some tasks. We hope the principled, unified BabyVLM-V2 framework will accelerate research in developmentally plausible pretraining of vision foundation models.

中文标题/摘要

标题：BabyVLM-V2：面向发展性基础视觉模型预训练和基准测试的框架

早期儿童的发展轨迹为高效样本预训练视觉基础模型提供了自然目标。我们引入了BabyVLM-V2，这是一种基于发展的婴儿启发式视觉语言建模框架，通过纵向多维度预训练集、多功能模型以及最重要的是DevCV工具箱进行认知评估，大幅改进了BabyVLM-V1。预训练集最大限度地覆盖了内容，同时减少了纵向婴儿为中心的视听素材的整理，生成了视频-语句、图像-语句和多轮对话数据，这些数据反映了婴儿的经验。DevCV工具箱将最近发布的NIH婴儿工具箱中所有与视觉相关的度量标准改编为涵盖空间推理、记忆和词汇理解的十项多模态任务基准套件，这些任务与早期儿童的能力相一致。实验结果表明，从零开始预训练的紧凑模型在DevCV工具箱上可以达到竞争力的表现，某些任务上优于GPT-4o。我们希望BabyVLM-V2框架能够促进发展性基础视觉模型预训练的研究。

Summary / 总结

The research aims to develop vision foundation models that can learn efficiently from infant-like data to better simulate early child development. The study introduces BabyVLM-V2, which uses a longitudinal, multifaceted pretraining set and a DevCV Toolbox for cognitive evaluation. The pretraining set includes video-utterance, image-utterance, and multi-turn conversational data that reflect infant experiences. The DevCV Toolbox adapts measures from the NIH Baby Toolbox into a benchmark suite of ten multimodal tasks. The results show that a compact model pretrained from scratch can achieve competitive performance on these tasks, outperforming GPT-4o on some tasks.

研究旨在开发能够从婴儿数据中高效学习的视觉基础模型，与早期儿童的认知发展相一致。BabyVLM-V2 使用纵向、多方面的预训练集和 DevCV 工具箱进行认知评估。预训练集包括视频-语句、图像-语句和多轮对话数据，而 DevCV 工具箱涵盖了空间推理、记忆和词汇理解等任务。实验结果显示，从零开始预训练的紧凑型模型在 DevCV 工具箱上可以达到竞争力的表现，某些任务上甚至优于 GPT-4o。

FoundationMotion: Auto-Labeling and Reasoning about Spatial Movement in Videos

Authors: Yulu Gan, Ligeng Zhu, Dandan Shan, Baifeng Shi, Hongxu Yin, Boris Ivanovic, Song Han, Trevor Darrell, Jitendra Malik, Marco Pavone, Boyi Li

First: 2025-12-11T18:53:15+00:00 · Latest: 2025-12-11T18:53:15+00:00

Comments: Code is available at https://github.com/Wolfv0/FoundationMotion/tree/main

Abs · PDF · Code1 · Code2 · Code3

Abstract

Motion understanding is fundamental to physical reasoning, enabling models to infer dynamics and predict future states. However, state-of-the-art models still struggle on recent motion benchmarks, primarily due to the scarcity of large-scale, fine-grained motion datasets. Existing motion datasets are often constructed from costly manual annotation, severely limiting scalability. To address this challenge, we introduce FoundationMotion, a fully automated data curation pipeline that constructs large-scale motion datasets. Our approach first detects and tracks objects in videos to extract their trajectories, then leverages these trajectories and video frames with Large Language Models (LLMs) to generate fine-grained captions and diverse question-answer pairs about motion and spatial reasoning. Using datasets produced by this pipeline, we fine-tune open-source models including NVILA-Video-15B and Qwen2.5-7B, achieving substantial improvements in motion understanding without compromising performance on other tasks. Notably, our models outperform strong closed-source baselines like Gemini-2.5 Flash and large open-source models such as Qwen2.5-VL-72B across diverse motion understanding datasets and benchmarks. FoundationMotion thus provides a scalable solution for curating fine-grained motion datasets that enable effective fine-tuning of diverse models to enhance motion understanding and spatial reasoning capabilities.

中文标题/摘要

标题：FoundationMotion：自动标注和空间运动推理

运动理解是物理推理的基础，使模型能够推断动力学并预测未来状态。然而，最先进的模型在最近的运动基准测试中仍然存在困难，主要是由于缺乏大规模、细粒度的运动数据集。现有的运动数据集通常是从昂贵的手动注释中构建的，严重限制了其可扩展性。为了解决这一挑战，我们引入了FoundationMotion，这是一种完全自动化的数据整理流水线，用于构建大规模的运动数据集。我们的方法首先在视频中检测和跟踪对象以提取其轨迹，然后利用这些轨迹和视频帧以及大型语言模型（LLMs）生成关于运动和空间推理的细粒度描述和多样化的问答对。使用此流水线生成的数据集，我们对包括NVILA-Video-15B和Qwen2.5-7B在内的开源模型进行微调，实现了在运动理解方面的显著改进，同时在其他任务上不降低性能。值得注意的是，我们的模型在各种运动理解数据集和基准测试中均优于强大的闭源基线Gemini-2.5 Flash和大型开源模型Qwen2.5-VL-72B。因此，FoundationMotion为构建细粒度运动数据集提供了一种可扩展的解决方案，这些数据集能够有效微调各种模型以增强运动理解和空间推理能力。

Summary / 总结

The research aims to address the scarcity of large-scale, fine-grained motion datasets for motion understanding, which is crucial for physical reasoning. The method involves an automated data curation pipeline that detects and tracks objects in videos to extract trajectories, then uses Large Language Models to generate captions and question-answer pairs. Experiments show that models fine-tuned with this pipeline outperform strong closed-source and large open-source models across various motion understanding benchmarks, enhancing motion understanding and spatial reasoning capabilities.

FoundationMotion 是一个自动数据整理管道，通过在视频中检测和追踪物体并使用大型语言模型生成描述和问答对来构建大规模运动数据集。使用这些数据集微调模型如 NVILA-Video-15B 和 Qwen2.5-7B 可以提高运动理解能力，同时不会损害其他任务的性能，并在各种运动理解基准测试中优于强大的闭源和大型开源模型。

LLM-Driven Composite Neural Architecture Search for Multi-Source RL State Encoding

Authors: Yu Yu, Qian Xie, Nairen Cao, Li Jin

Venue: NeurIPS 2025

First: 2025-12-07T20:25:07+00:00 · Latest: 2025-12-11T18:52:44+00:00

Comments: NeurIPS 2025 Workshop on Bridging Language, Agent, and World Models for Reasoning and Planning

Abs · PDF · Code1 · Code2

Abstract

Designing state encoders for reinforcement learning (RL) with multiple information sources -- such as sensor measurements, time-series signals, image observations, and textual instructions -- remains underexplored and often requires manual design. We formalize this challenge as a problem of composite neural architecture search (NAS), where multiple source-specific modules and a fusion module are jointly optimized. Existing NAS methods overlook useful side information from the intermediate outputs of these modules -- such as their representation quality -- limiting sample efficiency in multi-source RL settings. To address this, we propose an LLM-driven NAS pipeline in which the LLM serves as a neural architecture design agent, leveraging language-model priors and intermediate-output signals to guide sample-efficient search for high-performing composite state encoders. On a mixed-autonomy traffic control task, our approach discovers higher-performing architectures with fewer candidate evaluations than traditional NAS baselines and the LLM-based GENIUS framework.

Summary / 总结

The paper addresses the challenge of designing state encoders for reinforcement learning with multiple information sources by formulating it as a composite neural architecture search problem. It proposes an LLM-driven NAS pipeline that leverages language-model priors and intermediate-output signals to guide the search for high-performing composite state encoders, resulting in higher-performing architectures with fewer candidate evaluations compared to traditional NAS methods and the GENIUS framework on a mixed-autonomy traffic control task.

论文通过将多信息源强化学习状态编码设计问题形式化为复合神经架构搜索问题，提出了一种基于LLM的NAS管道，利用语言模型先验和中间输出信号来引导高效搜索高性能复合状态编码。在混合自主交通控制任务上，该方法以更少的候选评估发现更好的架构，优于传统NAS基线和GENIUS框架。

Digital Twin Supervised Reinforcement Learning Framework for Autonomous Underwater Navigation

Authors: Zamirddine Mari, Mohamad Motasem Nawaf, Pierre Drap

First: 2025-12-11T18:52:42+00:00 · Latest: 2025-12-11T18:52:42+00:00

Abs · PDF · Code1 · Code2

Abstract

Autonomous navigation in underwater environments remains a major challenge due to the absence of GPS, degraded visibility, and the presence of submerged obstacles. This article investigates these issues through the case of the BlueROV2, an open platform widely used for scientific experimentation. We propose a deep reinforcement learning approach based on the Proximal Policy Optimization (PPO) algorithm, using an observation space that combines target-oriented navigation information, a virtual occupancy grid, and ray-casting along the boundaries of the operational area. The learned policy is compared against a reference deterministic kinematic planner, the Dynamic Window Approach (DWA), commonly employed as a robust baseline for obstacle avoidance. The evaluation is conducted in a realistic simulation environment and complemented by validation on a physical BlueROV2 supervised by a 3D digital twin of the test site, helping to reduce risks associated with real-world experimentation. The results show that the PPO policy consistently outperforms DWA in highly cluttered environments, notably thanks to better local adaptation and reduced collisions. Finally, the experiments demonstrate the transferability of the learned behavior from simulation to the real world, confirming the relevance of deep RL for autonomous navigation in underwater robotics.

Summary / 总结

The paper addresses the challenge of autonomous underwater navigation by proposing a digital twin supervised reinforcement learning framework using the Proximal Policy Optimization (PPO) algorithm. The framework combines target-oriented navigation information, a virtual occupancy grid, and ray-casting data. The learned policy was evaluated against the Dynamic Window Approach (DWA) in a realistic simulation and on a physical BlueROV2, showing superior performance in cluttered environments with reduced collisions. The results confirm the transferability of the learned behavior from simulation to the real world, highlighting the potential of deep RL for underwater navigation.

本文通过提出基于Proximal Policy Optimization (PPO)的数字孪生监督强化学习框架，解决了自主水下导航的挑战。该框架结合了目标导向导航、虚拟占用网格和边界射线探测障碍物。所学策略在现实模拟和物理BlueROV2上与Dynamic Window Approach (DWA)进行对比。结果表明，PPO策略在复杂环境中优于DWA，具有更好的局部适应性和更少的碰撞。实验还证实了从模拟到现实世界的策略转移性，证明了深度强化学习在水下机器人自主导航中的相关性。

If generative AI is the answer, what is the question?

Authors: Ambuj Tewari

First: 2025-09-07T16:07:45+00:00 · Latest: 2025-12-11T18:45:18+00:00

Comments: To appear as a book chapter in a Springer book titled "Statistical Foundations and Applications of Artificial Intelligence, Machine Learning and Deep Learning" and edited by S. Ejaz Ahmed, Pierre Alquier, Yi Li, Shuangge Ma

Abs · PDF · Code1 · Code2

Abstract

Beginning with text and images, generative AI has expanded to audio, video, computer code, and molecules. Yet, if generative AI is the answer, what is the question? We explore the foundations of generation as a distinct machine learning task with connections to prediction, compression, and decision-making. We survey five major generative model families: autoregressive models, variational autoencoders, normalizing flows, generative adversarial networks, and diffusion models. We then introduce a probabilistic framework that emphasizes the distinction between density estimation and generation. We review a game-theoretic framework with a two-player adversary-learner setup to study generation. We discuss post-training modifications that prepare generative models for deployment. We end by highlighting some important topics in socially responsible generation such as privacy, detection of AI-generated content, and copyright and IP. We adopt a task-first framing of generation, focusing on what generation is as a machine learning problem, rather than only on how models implement it.

中文标题/摘要

标题：如果生成式AI是答案，那么问题是什么？

从文本和图像开始，生成式AI已经扩展到音频、视频、计算机代码和分子。然而，如果生成式AI是答案，那么问题是什么？我们探讨了生成作为独立机器学习任务的基础，与预测、压缩和决策之间的联系。我们概述了五种主要的生成模型家族：自回归模型、变分自编码器、规范化流、生成对抗网络和扩散模型。然后我们介绍了一个概率框架，强调密度估计与生成之间的区别。我们回顾了一个博弈论框架，采用两玩家的对抗学习设置来研究生成。我们讨论了训练后修改，以准备生成模型的部署。最后，我们强调了一些在社会责任生成方面的重要话题，如隐私、检测AI生成内容以及版权和知识产权。我们采用任务优先的生成框架，专注于生成作为机器学习问题是什么，而不仅仅是模型如何实现它。

Summary / 总结

This paper explores the foundations of generation as a machine learning task, connecting it to prediction, compression, and decision-making. It surveys five major generative model families and introduces a probabilistic framework that distinguishes between density estimation and generation. The study also discusses game-theoretic frameworks and post-training modifications for deployment, and highlights important topics like privacy and AI-generated content detection in socially responsible generation.

本文探讨了生成任务作为机器学习问题的基础，介绍了五种主要的生成模型家族：自回归模型、变分自编码器、规范化流、生成对抗网络和扩散模型。它引入了一个概率框架，区分了密度估计和生成，并讨论了博弈论和后训练修改。关键发现包括理解生成作为一项独特任务的重要性，与预测、压缩和决策制定相关，并且需要负责任的做法，如隐私和内容检测。

Distributionally Robust Regret Optimal Control Under Moment-Based Ambiguity Sets

Authors: Feras Al Taha, Eilyan Bitar

First: 2025-12-11T18:36:15+00:00 · Latest: 2025-12-11T18:36:15+00:00

Comments: 21 pages, 2 figures

Abs · PDF · Code1 · Code2

Abstract

In this paper, we consider a class of finite-horizon, linear-quadratic stochastic control problems, where the probability distribution governing the noise process is unknown but assumed to belong to an ambiguity set consisting of all distributions whose mean and covariance lie within norm balls centered at given nominal values. To address the distributional ambiguity, we explore the design of causal affine control policies to minimize the worst-case expected regret over all distributions in the given ambiguity set. The resulting minimax optimal control problem is shown to admit an equivalent reformulation as a tractable convex program that corresponds to a regularized version of the nominal linear-quadratic stochastic control problem. While this convex program can be recast as a semidefinite program, semidefinite programs are typically solved using primal-dual interior point methods that scale poorly with the problem size in practice. To address this limitation, we propose a scalable dual projected subgradient method to compute optimal controllers to an arbitrary accuracy. Numerical experiments are presented to benchmark the proposed method against state-of-the-art data-driven and distributionally robust control design approaches.

中文标题/摘要

标题：基于矩约束不确定性集的分布鲁棒后悔最优控制

在本文中，我们考虑一类有限时间区间、线性二次随机控制问题，其中噪声过程的概率分布未知，但假设属于一个不确定性集，该集包含所有均值和协方差位于给定名义值为中心的范数球内的分布。为应对分布不确定性，我们探索设计因果线性控制策略，以最小化给定不确定性集内所有分布的最坏情况期望后悔。所得到的最小最大最优控制问题被证明可以等价地重新表述为一个可解的凸规划问题，该问题对应于一个名义线性二次随机控制问题的正则化版本。虽然这个凸规划问题可以重新表述为半定规划问题，但半定规划问题通常使用对偶内点法求解，这种方法在实际中会随着问题规模的增大而表现不佳。为解决这一局限，我们提出了一种可扩展的对偶投影次梯度方法来计算任意精度的最优控制器。还进行了数值实验，将所提出的方法与最先进的数据驱动和分布鲁棒控制设计方法进行了基准测试。

Summary / 总结

This paper addresses the problem of designing control policies for linear-quadratic stochastic systems under distributional ambiguity, where the noise distribution is unknown but constrained within a set defined by mean and covariance norms. The authors formulate the problem as a minimax optimization and provide a convex program reformulation. They also propose a scalable dual projected subgradient method to solve the problem efficiently. The experiments show that their method outperforms existing data-driven and distributionally robust control approaches in terms of regret minimization.

本文研究了一个线性二次随机控制问题，其中噪声分布未知但受限于基于均值和协方差的不确定性集。目标是最小化最坏情况下的预期后悔，使用因果线性控制策略。该问题被重新表述为一个凸规划，可以通过一种可扩展的投影次梯度方法来求解。数值实验表明，所提出的方法在计算效率和鲁棒性方面优于现有方法。

DuetSVG: Unified Multimodal SVG Generation with Internal Visual Guidance

Authors: Peiying Zhang, Nanxuan Zhao, Matthew Fisher, Yiran Xu, Jing Liao, Difan Liu

First: 2025-12-11T18:23:03+00:00 · Latest: 2025-12-11T18:23:03+00:00

Comments: Project page: https://intchous.github.io/DuetSVG-site

Abs · PDF · Code1 · Code2 · Project1

Abstract

Recent vision-language model (VLM)-based approaches have achieved impressive results on SVG generation. However, because they generate only text and lack visual signals during decoding, they often struggle with complex semantics and fail to produce visually appealing or geometrically coherent SVGs. We introduce DuetSVG, a unified multimodal model that jointly generates image tokens and corresponding SVG tokens in an end-to-end manner. DuetSVG is trained on both image and SVG datasets. At inference, we apply a novel test-time scaling strategy that leverages the model's native visual predictions as guidance to improve SVG decoding quality. Extensive experiments show that our method outperforms existing methods, producing visually faithful, semantically aligned, and syntactically clean SVGs across a wide range of applications.

中文标题/摘要

标题：DuetSVG：统一的多模态SVG生成与内部视觉指导

基于视觉-语言模型（VLM）的方法在SVG生成方面取得了令人印象深刻的成果。然而，由于它们在解码过程中仅生成文本而缺乏视觉信号，因此往往难以处理复杂的语义，无法生成视觉上吸引人或几何上一致的SVG。我们提出了DuetSVG，这是一种统一的多模态模型，可以以端到端的方式同时生成图像标记和相应的SVG标记。DuetSVG在图像和SVG数据集上进行训练。在推理时，我们应用了一种新颖的测试时缩放策略，利用模型的原生视觉预测作为指导，以提高SVG解码质量。广泛的实验表明，我们的方法优于现有方法，能够生成视觉上忠实、语义上对齐且语法上干净的SVG。

Summary / 总结

The motivation for DuetSVG is to improve SVG generation by incorporating visual signals during the decoding process, addressing the limitations of existing vision-language model-based approaches. The method involves a unified multimodal model that generates both image and SVG tokens end-to-end, and uses a test-time scaling strategy to enhance SVG decoding quality with visual guidance. Key experimental findings show that DuetSVG outperforms existing methods, generating visually faithful, semantically aligned, and syntactically clean SVGs for various applications.

研究动机是解决现有基于视觉-语言模型（VLM）的方法在SVG生成中存在的问题，这些方法在解码过程中缺乏视觉信号，往往无法生成视觉上吸引人或几何上协调的SVG。提出了一个统一的多模态模型DuetSVG，该模型可以端到端地联合生成图像令牌和相应的SVG令牌。该模型在图像和SVG数据集上进行训练，在推理时应用了一种新颖的测试时缩放策略，利用模型的视觉预测作为指导。实验结果表明，DuetSVG在各种应用中生成了视觉上忠实、语义上对齐和语法上干净的SVG。

Iterative Compositional Data Generation for Robot Control

Authors: Anh-Quan Pham, Marcel Hussing, Shubhankar P. Patankar, Dani S. Bassett, Jorge Mendez-Mendez, Eric Eaton

First: 2025-12-11T18:20:49+00:00 · Latest: 2025-12-11T18:20:49+00:00

Abs · PDF · Code1 · Code2

Abstract

Collecting robotic manipulation data is expensive, making it impractical to acquire demonstrations for the combinatorially large space of tasks that arise in multi-object, multi-robot, and multi-environment settings. While recent generative models can synthesize useful data for individual tasks, they do not exploit the compositional structure of robotic domains and struggle to generalize to unseen task combinations. We propose a semantic compositional diffusion transformer that factorizes transitions into robot-, object-, obstacle-, and objective-specific components and learns their interactions through attention. Once trained on a limited subset of tasks, we show that our model can zero-shot generate high-quality transitions from which we can learn control policies for unseen task combinations. Then, we introduce an iterative self-improvement procedure in which synthetic data is validated via offline reinforcement learning and incorporated into subsequent training rounds. Our approach substantially improves zero-shot performance over monolithic and hard-coded compositional baselines, ultimately solving nearly all held-out tasks and demonstrating the emergence of meaningful compositional structure in the learned representations.

中文标题/摘要

标题：机器人控制的迭代组合数据生成

收集机器人操作数据成本高昂，使得在多对象、多机器人和多环境设置中获取演示变得不切实际。尽管最近的生成模型可以为单个任务合成有用的数据，但它们未能利用机器人领域的组合结构，并且难以泛化到未见过的任务组合。我们提出了一种语义组合扩散变换器，将转换分解为机器人特定、对象特定、障碍物特定和目标特定的组件，并通过注意力学习它们的交互。在有限的任务子集上训练后，我们展示了该模型可以零样本生成高质量的转换，从中可以学习未见过的任务组合的控制策略。然后，我们引入了一种迭代自我改进过程，在此过程中，合成数据通过离线强化学习验证，并纳入后续训练轮次。我们的方法在零样本性能上显著优于单一和硬编码的组合基线，最终解决了几乎所有保留任务，并展示了在学习表示中出现有意义的组合结构。

Summary / 总结

The research aims to address the high cost of collecting robotic manipulation data for various tasks. It proposes a semantic compositional diffusion transformer that learns the interactions between robot, object, obstacle, and objective components through attention. After training on a subset of tasks, the model can generate high-quality transitions for unseen task combinations, which are then used to learn control policies. An iterative self-improvement procedure further enhances the model's performance, solving nearly all held-out tasks and revealing meaningful compositional structure in the learned representations.

本文解决了收集机器人操作数据成本高且不切实际的问题，特别是在多对象、多机器人和多环境设置中。提出了一种语义组合扩散变换器，将过渡分解为机器人、物体、障碍物和目标特定的组件，并通过注意力学习它们的交互。在对部分任务进行训练后，该模型可以生成高质量的过渡以处理未见过的任务组合，并通过结合离线强化学习验证的合成数据进行迭代自我改进。这种方法在零样本性能上优于整体和硬编码的组合基线，并解决了几乎所有保留的任务，展示了学习表示中的有意义的组合结构。

PubTables-v2: A new large-scale dataset for full-page and multi-page table extraction

Authors: Brandon Smock, Valerie Faucon-Morin, Max Sokolov, Libin Liang, Tayyibah Khanam, Maury Courtland

First: 2025-12-11T18:19:00+00:00 · Latest: 2025-12-11T18:19:00+00:00

Comments: 15 pages, 7 figures

Abs · PDF · Code1 · Code2

Abstract

Table extraction (TE) is a key challenge in visual document understanding. Traditional approaches detect tables first, then recognize their structure. Recently, interest has surged in developing methods, such as vision-language models (VLMs), that can extract tables directly in their full page or document context. However, progress has been difficult to demonstrate due to a lack of annotated data. To address this, we create a new large-scale dataset, PubTables-v2. PubTables-v2 supports a number of current challenging table extraction tasks. Notably, it is the first large-scale benchmark for multi-page table structure recognition. We demonstrate its usefulness by evaluating domain-specialized VLMs on these tasks and highlighting current progress. Finally, we use PubTables-v2 to create the Page-Object Table Transformer (POTATR), an image-to-graph extension of the Table Transformer to comprehensive page-level TE. Data, code, and trained models will be released.

中文标题/摘要

标题：PubTables-v2：新的大规模数据集用于全页和多页表格提取

表格提取（TE）是视觉文档理解中的一个关键挑战。传统方法首先检测表格，然后识别其结构。最近，人们开始开发可以直接在全页或文档上下文中提取表格的方法，例如视觉-语言模型（VLMs）。然而，由于缺乏标注数据，进步难以展示。为了解决这个问题，我们创建了一个新的大规模数据集，PubTables-v2。PubTables-v2 支持当前许多具有挑战性的表格提取任务。值得注意的是，它是第一个大规模的多页表格结构识别基准。我们通过在这些任务上评估领域专用的 VLMs 来展示其用途，并突出当前的进展。最后，我们使用 PubTables-v2 创建了页对象表格变换器（POTATR），这是一种图像到图的 Table Transformer 扩展，用于全面的页面级表格提取。数据、代码和训练模型将被发布。

Summary / 总结

The research aims to improve table extraction in visual document understanding by addressing the lack of annotated data. PubTables-v2, a new large-scale dataset, is introduced to support full-page and multi-page table extraction, especially for multi-page table structure recognition. The dataset enables the evaluation of vision-language models and the development of the Page-Object Table Transformer (POTATR), which extends the Table Transformer for comprehensive page-level table extraction.

研究旨在通过解决标注数据不足的问题，提高视觉文档理解中的表格提取。提出了一个新的大规模数据集PubTables-v2，支持全页和多页表格提取，特别是多页表格结构识别。该数据集用于评估视觉语言模型，并开发了Page-Object Table Transformer (POTATR)，该模型扩展了Table Transformer以实现全面的页面级表格提取。

AFRAgent : An Adaptive Feature Renormalization Based High Resolution Aware GUI agent

Authors: Neeraj Anand, Rishabh Jain, Sohan Patnaik, Balaji Krishnamurthy, Mausoom Sarkar

Venue: WACV 2026

First: 2025-11-30T11:32:54+00:00 · Latest: 2025-12-11T18:16:38+00:00

Comments: Accepted at WACV 2026 Conference

Abs · PDF · Code1 · Code2

Abstract

There is a growing demand for mobile user interface (UI) automation, driven by its broad applications across industries. With the advent of visual language models (VLMs), GUI automation has progressed from generating text-based instructions for humans to autonomously executing tasks, thus optimizing automation workflows. Recent approaches leverage VLMs for this problem due to their ability to 1) process on-screen content directly, 2) remain independent of device-specific APIs by utilizing human actions (e.g., clicks, typing), and 3) apply real-world contextual knowledge for task understanding. However, these models often have trouble accurately identifying widgets and determining actions due to limited spatial information in vision encoder features. Additionally, top-performing models are often large, requiring extensive training and resulting in inference delays. In this work, we introduce AFRAgent, an instruct-BLIP-based multimodal architecture that achieves superior performance in GUI automation while being less than one-fourth the size of its nearest competitor. To enhance image embeddings in the large language model (LLM) pipeline, we propose an adaptive feature renormalization-based (a token-level affine transformation) technique that effectively enriches low-resolution image embeddings and fuses high-resolution details. We evaluate AFRAgent on Meta-GUI and AITW benchmarks, establishing a new state-of-the-art baseline for smartphone automation.

中文标题/摘要

标题：AFRAgent：一种自适应特征重规范化基于高分辨率感知的GUI代理

随着移动用户界面（UI）自动化需求的增长，其在各行业的广泛应用推动了这一领域的发展。随着视觉语言模型（VLMs）的出现，GUI自动化从生成供人类使用的文本指令发展到自主执行任务，从而优化了自动化工作流。最近的方法利用VLMs，因为它们能够1）直接处理屏幕内容，2）通过利用人类操作（例如点击、输入）而不依赖于特定设备的API，3）应用现实世界的上下文知识来理解任务。然而，这些模型往往难以准确识别控件和确定操作，因为视觉编码器特征中的空间信息有限。此外，表现最佳的模型通常很大，需要大量训练，导致推理延迟。在本文中，我们提出了AFRAgent，这是一种基于指令-BLIP的多模态架构，其性能优于其最接近的竞争者四分之一。为了增强大型语言模型（LLM）管道中的图像嵌入，我们提出了一种自适应特征重规范化技术（基于令牌级仿射变换），该技术有效地丰富了低分辨率图像嵌入并融合了高分辨率细节。我们在Meta-GUI和AITW基准上评估了AFRAgent，建立了智能手机自动化的新基准。

Summary / 总结

AFRAgent is designed to improve mobile UI automation by addressing the limitations of existing visual language models, particularly in accurately identifying widgets and determining actions. It uses an adaptive feature renormalization technique to enhance image embeddings, making the model smaller and more efficient. AFRAgent outperforms its competitors on the Meta-GUI and AITW benchmarks, setting a new state-of-the-art for smartphone automation.

AFRAgent旨在通过解决现有视觉语言模型在准确识别控件和确定操作方面的局限性，提高移动UI自动化。它采用了一种自适应特征重规范化技术来增强图像嵌入，使其在GUI自动化方面更有效。AFRAgent通过在Meta-GUI和AITW基准测试中实现更优性能，且体积更小，确立了智能手机自动化的新基准。

Physics-Informed Learning of Flow Distribution and Receiver Heat Losses in Parabolic Trough Solar Fields

Authors: Stefan Matthes, Markus Schramm

First: 2025-12-11T18:16:26+00:00 · Latest: 2025-12-11T18:16:26+00:00

Abs · PDF · Code1 · Code2

Abstract

Parabolic trough Concentrating Solar Power (CSP) plants operate large hydraulic networks of collector loops that must deliver a uniform outlet temperature despite spatially heterogeneous optical performance, heat losses, and pressure drops. While loop temperatures are measured, loop-level mass flows and receiver heat-loss parameters are unobserved, making it impossible to diagnose hydraulic imbalances or receiver degradation using standard monitoring tools. We present a physics-informed learning framework that infers (i) loop-level mass-flow ratios and (ii) time-varying receiver heat-transfer coefficients directly from routine operational data. The method exploits nocturnal homogenization periods -- when hot oil is circulated through a non-irradiated field -- to isolate hydraulic and thermal-loss effects. A differentiable conjugate heat-transfer model is discretized and embedded into an end-to-end learning pipeline optimized using historical plant data from the 50 MW Andasol 3 solar field. The model accurately reconstructs loop temperatures (RMSE $<2^\circ$C) and produces physically meaningful estimates of loop imbalances and receiver heat losses. Comparison against drone-based infrared thermography (QScan) shows strong correspondence, correctly identifying all areas with high-loss receivers. This demonstrates that noisy real-world CSP operational data contain enough information to recover latent physical parameters when combined with appropriate modeling and differentiable optimization.

Summary / 总结

The research aims to diagnose hydraulic imbalances and receiver degradation in parabolic trough CSP plants by inferring loop-level mass flows and receiver heat-transfer coefficients from operational data. The method uses a physics-informed learning framework that exploits nocturnal homogenization periods. The model accurately reconstructs loop temperatures and identifies areas with high-loss receivers, showing strong correspondence with drone-based infrared thermography data.

研究旨在通过从运营数据中推断出循环级质量流率和接收器传热系数，来诊断抛物槽型 CSP 系统中的液压不平衡和接收器退化。方法利用夜间同质化时期来隔离液压和热损失效应。该模型准确地重建了循环温度，并识别出高热损失接收器的区域，与无人机红外热成像结果有很强的对应关系。

MoCapAnything: Unified 3D Motion Capture for Arbitrary Skeletons from Monocular Videos

Authors: Kehong Gong, Zhengyu Wen, Weixia He, Mingxi Xu, Qi Wang, Ning Zhang, Zhengyu Li, Dongze Lian, Wei Zhao, Xiaoyu He, Mingyuan Zhang

First: 2025-12-11T18:09:48+00:00 · Latest: 2025-12-11T18:09:48+00:00

Comments: Project page: https://animotionlab.github.io/MoCapAnything/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Motion capture now underpins content creation far beyond digital humans, yet most existing pipelines remain species- or template-specific. We formalize this gap as Category-Agnostic Motion Capture (CAMoCap): given a monocular video and an arbitrary rigged 3D asset as a prompt, the goal is to reconstruct a rotation-based animation such as BVH that directly drives the specific asset. We present MoCapAnything, a reference-guided, factorized framework that first predicts 3D joint trajectories and then recovers asset-specific rotations via constraint-aware inverse kinematics. The system contains three learnable modules and a lightweight IK stage: (1) a Reference Prompt Encoder that extracts per-joint queries from the asset's skeleton, mesh, and rendered images; (2) a Video Feature Extractor that computes dense visual descriptors and reconstructs a coarse 4D deforming mesh to bridge the gap between video and joint space; and (3) a Unified Motion Decoder that fuses these cues to produce temporally coherent trajectories. We also curate Truebones Zoo with 1038 motion clips, each providing a standardized skeleton-mesh-render triad. Experiments on both in-domain benchmarks and in-the-wild videos show that MoCapAnything delivers high-quality skeletal animations and exhibits meaningful cross-species retargeting across heterogeneous rigs, enabling scalable, prompt-driven 3D motion capture for arbitrary assets. Project page: https://animotionlab.github.io/MoCapAnything/

中文标题/摘要

标题：MoCapAnything：任意骨架的统一单目3D动作捕捉

动作捕捉现在已远远超越了数字人类的内容创作，但大多数现有的流水线仍然局限于特定物种或模板。我们将其差距形式化为无类别动作捕捉（CAMoCap）：给定一个单目视频和一个任意的3D资产作为提示，目标是重建基于旋转的动画，如BVH，直接驱动特定的资产。我们提出MoCapAnything，这是一种参考引导、因子化的框架，首先预测3D关节轨迹，然后通过约束感知逆运动学恢复资产特定的旋转。该系统包含三个可学习模块和一个轻量级的IK阶段：（1）参考提示编码器，从资产的骨架、网格和渲染图像中提取每个关节的查询；（2）视频特征提取器，计算密集的视觉描述符并重建一个粗略的4D变形网格，以弥合视频和关节空间之间的差距；（3）统一动作解码器，将这些线索融合以生成时间连贯的轨迹。我们还整理了Truebones动物园，包含1038个动作片段，每个片段提供了一个标准化的骨架-网格-渲染三元组。在领域内基准测试和野外视频上的实验表明，MoCapAnything能够生成高质量的骨骼动画，并在异构骨架上表现出跨物种的有意义的重新目标化，从而实现任意资产的可扩展、提示驱动的3D动作捕捉。项目页面：https://animotionlab.github.io/MoCapAnything/

Summary / 总结

MoCapAnything addresses the gap in motion capture by proposing a unified framework for arbitrary skeleton animation from monocular videos. It uses a reference-guided, factorized approach that first predicts 3D joint trajectories and then recovers asset-specific rotations via constraint-aware inverse kinematics. The system includes a Reference Prompt Encoder, a Video Feature Extractor, and a Unified Motion Decoder. Experiments show high-quality skeletal animations and meaningful cross-species retargeting across different rigs, enabling scalable, prompt-driven 3D motion capture for arbitrary assets.

MoCapAnything通过预测3D关节轨迹并使用参考引导、因子化的框架恢复特定资产的旋转，解决了类别无关的动作捕捉问题。该系统包括参考提示编码器、视频特征提取器和统一动作解码器。实验表明，MoCapAnything能够生成高质量的骨骼动画，并支持在不同骨架上的跨物种重新目标化，从而实现任意资产的可扩展、提示驱动的3D动作捕捉。

From Macro to Micro: Benchmarking Microscopic Spatial Intelligence on Molecules via Vision-Language Models

Authors: Zongzhao Li, Xiangzhe Kong, Jiahui Su, Zongyang Ma, Mingze Li, Songyou Li, Yuelin Zhang, Yu Rong, Tingyang Xu, Deli Zhao, Wenbing Huang

First: 2025-12-11T18:00:21+00:00 · Latest: 2025-12-11T18:00:21+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

This paper introduces the concept of Microscopic Spatial Intelligence (MiSI), the capability to perceive and reason about the spatial relationships of invisible microscopic entities, which is fundamental to scientific discovery. To assess the potential of Vision-Language Models (VLMs) in this domain, we propose a systematic benchmark framework MiSI-Bench. This framework features over 163,000 question-answer pairs and 587,000 images derived from approximately 4,000 molecular structures, covering nine complementary tasks that evaluate abilities ranging from elementary spatial transformations to complex relational identifications. Experimental results reveal that current state-of-the-art VLMs perform significantly below human level on this benchmark. However, a fine-tuned 7B model demonstrates substantial potential, even surpassing humans in spatial transformation tasks, while its poor performance in scientifically-grounded tasks like hydrogen bond recognition underscores the necessity of integrating explicit domain knowledge for progress toward scientific AGI. The datasets are available at https://huggingface.co/datasets/zongzhao/MiSI-bench.

中文标题/摘要

标题：从宏观到微观：通过视觉语言模型评估分子微观空间智能

本文介绍了微观空间智能（MiSI）的概念，即感知和推理看不见的微观实体的空间关系的能力，这是科学研究的基础。为了评估视觉语言模型（VLMs）在这一领域的潜力，我们提出了一种系统性的基准框架MiSI-Bench。该框架包含超过163,000个问答对和587,000张图像，源自约4,000个分子结构，涵盖了九项互补任务，评估能力从基本的空间变换到复杂的关联识别。实验结果表明，当前最先进的VLMs在这一基准上的表现远低于人类水平。然而，微调后的7B模型显示出巨大的潜力，甚至在空间变换任务上超过了人类，而其在氢键识别等基于科学的任务上的表现不佳，突显了整合显式领域知识以实现科学AGI的必要性。数据集可在https://huggingface.co/datasets/zongzhao/MiSI-bench 获取。

MMSI-Video-Bench: A Holistic Benchmark for Video-Based Spatial Intelligence

Authors: Jingli Lin, Runsen Xu, Shaohao Zhu, Sihan Yang, Peizhou Cao, Yunlong Ran, Miao Hu, Chenming Zhu, Yiman Xie, Yilin Long, Wenbo Hu, Dahua Lin, Tai Wang, Jiangmiao Pang

First: 2025-12-11T17:57:24+00:00 · Latest: 2025-12-11T17:57:24+00:00

Abs · PDF · Code1 · Code2

Abstract

Spatial understanding over continuous visual input is crucial for MLLMs to evolve into general-purpose assistants in physical environments. Yet there is still no comprehensive benchmark that holistically assesses the progress toward this goal. In this work, we introduce MMSI-Video-Bench, a fully human-annotated benchmark for video-based spatial intelligence in MLLMs. It operationalizes a four-level framework, Perception, Planning, Prediction, and Cross-Video Reasoning, through 1,106 questions grounded in 1,278 clips from 25 datasets and in-house videos. Each item is carefully designed and reviewed by 3DV experts with explanatory rationales to ensure precise, unambiguous grounding. Leveraging its diverse data sources and holistic task coverage, MMSI-Video-Bench also supports three domain-oriented sub-benchmarks (Indoor Scene Perception Bench, Robot Bench and Grounding Bench) for targeted capability assessment. We evaluate 25 strong open-source and proprietary MLLMs, revealing a striking human--AI gap: many models perform near chance, and the best reasoning model lags humans by nearly 60%. We further find that spatially fine-tuned models still fail to generalize effectively on our benchmark. Fine-grained error analysis exposes systematic failures in geometric reasoning, motion grounding, long-horizon prediction, and cross-video correspondence. We also show that typical frame-sampling strategies transfer poorly to our reasoning-intensive benchmark, and that neither 3D spatial cues nor chain-of-thought prompting yields meaningful gains. We expect our benchmark to establish a solid testbed for advancing video-based spatial intelligence.

中文标题/摘要

标题：MMSI-Video-Bench：面向视频空间智能的综合基准

在连续视觉输入中进行空间理解对于MLLMs演变为物理环境中的通用助手至关重要。然而，目前还没有一个全面的基准能够综合评估这一目标的进展。在本文中，我们介绍了MMSI-Video-Bench，这是一个全面的人工标注基准，用于评估MLLMs的视频空间智能。它通过1,106个问题和1,278个来自25个数据集和内部视频的片段，实现了感知、规划、预测和跨视频推理四个层次框架的系统化。每个项目都由3D视觉专家精心设计和审查，并附有解释性理由，以确保精确和明确的定位。利用其多样化的数据来源和全面的任务覆盖范围，MMSI-Video-Bench还支持三个领域导向的子基准（室内场景感知基准、机器人基准和语义基准），以进行有针对性的能力评估。我们评估了25个强大的开源和专有MLLMs，揭示了显著的人工智能差距：许多模型的表现接近随机猜测，而最佳推理模型比人类落后近60%。我们还发现，空间微调模型在我们的基准上仍然无法有效泛化。精细的错误分析揭示了几何推理、运动定位、长时预测和跨视频对应中的系统性失败。我们还表明，典型的帧采样策略在我们的推理密集型基准上表现不佳，而3D空间线索或链式思考提示也没有带来有意义的改进。我们期望我们的基准能够为推进视频空间智能建立一个坚实的基础。

Summary / 总结

The research introduces MMSI-Video-Bench, a benchmark for evaluating video-based spatial intelligence in MLLMs, covering Perception, Planning, Prediction, and Cross-Video Reasoning. It assesses 25 MLLMs and finds a significant human-AI gap, with many models performing near chance and the best model lagging humans by nearly 60%. The benchmark reveals systematic failures in geometric reasoning, motion grounding, and long-horizon prediction, and typical frame-sampling strategies do not improve performance on reasoning-intensive tasks.

MMSI-Video-Bench 是一个全面的基准，用于评估 MLLMs 的视频空间智能。它通过 1,278 个片段中的 1,106 个问题，评估模型在感知、规划、预测和跨视频推理方面的表现。对 25 个模型的评估显示了显著的人工智能差距，许多模型的表现接近随机猜测，最佳模型落后人类近 60%。基准测试揭示了几何推理、运动定位和长时预测方面的系统性失败，并表明典型的帧采样策略和 3D 空间线索无法提高性能。该基准旨在推动视频空间智能研究的发展。

SWiT-4D: Sliding-Window Transformer for Lossless and Parameter-Free Temporal 4D Generation

Authors: Kehong Gong, Zhengyu Wen, Mingxi Xu, Weixia He, Qi Wang, Ning Zhang, Zhengyu Li, Chenbin Li, Dongze Lian, Wei Zhao, Xiaoyu He, Mingyuan Zhang

First: 2025-12-11T17:54:31+00:00 · Latest: 2025-12-11T17:54:31+00:00

Comments: Project page: https://animotionlab.github.io/SWIT4D/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Despite significant progress in 4D content generation, the conversion of monocular videos into high-quality animated 3D assets with explicit 4D meshes remains considerably challenging. The scarcity of large-scale, naturally captured 4D mesh datasets further limits the ability to train generalizable video-to-4D models from scratch in a purely data-driven manner. Meanwhile, advances in image-to-3D generation, supported by extensive datasets, offer powerful prior models that can be leveraged. To better utilize these priors while minimizing reliance on 4D supervision, we introduce SWiT-4D, a Sliding-Window Transformer for lossless, parameter-free temporal 4D mesh generation. SWiT-4D integrates seamlessly with any Diffusion Transformer (DiT)-based image-to-3D generator, adding spatial-temporal modeling across video frames while preserving the original single-image forward process, enabling 4D mesh reconstruction from videos of arbitrary length. To recover global translation, we further introduce an optimization-based trajectory module tailored for static-camera monocular videos. SWiT-4D demonstrates strong data efficiency: with only a single short (<10s) video for fine-tuning, it achieves high-fidelity geometry and stable temporal consistency, indicating practical deployability under extremely limited 4D supervision. Comprehensive experiments on both in-domain zoo-test sets and challenging out-of-domain benchmarks (C4D, Objaverse, and in-the-wild videos) show that SWiT-4D consistently outperforms existing baselines in temporal smoothness. Project page: https://animotionlab.github.io/SWIT4D/

中文标题/摘要

标题：SWiT-4D：滑动窗口变换器用于无损和参数自由的时空4D生成

尽管在4D内容生成方面取得了显著进展，但将单目视频转换为具有明确4D网格的高质量动画3D资产仍然极具挑战性。由于缺乏大规模的自然捕获4D网格数据集，这进一步限制了从头开始以纯数据驱动方式训练通用的视频到4D模型的能力。与此同时，由大量数据支持的图像到3D生成的进步提供了强大的先验模型，可以加以利用。为了更好地利用这些先验模型并尽量减少对4D监督的依赖，我们提出了SWiT-4D，一种用于无损、参数自由的时空4D网格生成的滑动窗口变换器。SWiT-4D可以无缝集成到任何基于扩散变换器（DiT）的图像到3D生成器中，在视频帧之间进行空间-时间建模，同时保留原始的单图像前向过程，从而可以从任意长度的视频中重建4D网格。为了恢复全局平移，我们进一步引入了一个针对静态相机单目视频的基于优化的轨迹模块。SWiT-4D展示了强大的数据效率：仅需一个短于10秒的视频进行微调，即可实现高保真几何结构和稳定的时空一致性，表明在极有限的4D监督下具有实际部署能力。在本领域动物园测试集和具有挑战性的跨领域基准测试（C4D、Objaverse和野外视频）上的全面实验表明，SWiT-4D在时间平滑度方面始终优于现有基线。项目页面：https://animotionlab.github.io/SWIT4D/

Summary / 总结

SWiT-4D is designed to generate high-quality 4D meshes from monocular videos without the need for 4D supervision or parameters. It integrates with existing image-to-3D models and introduces a trajectory module for global translation recovery. SWiT-4D shows strong data efficiency, achieving high-fidelity geometry and temporal consistency with minimal fine-tuning, and outperforms existing methods in temporal smoothness across various benchmarks.

SWiT-4D旨在无需4D监督或参数的情况下，从单目视频生成高质量的4D网格。它与现有的图像到3D模型集成，并引入了一个轨迹模块以恢复全局平移。SWiT-4D展示了强大的数据效率，即使在少量微调的情况下也能实现高保真几何结构和时间一致性，并在各种基准测试中优于现有方法，在时间平滑度方面表现出色。

Bayesian Symbolic Regression via Posterior Sampling

Authors: Geoffrey F. Bomarito, Patrick E. Leser

First: 2025-12-11T17:38:20+00:00 · Latest: 2025-12-11T17:38:20+00:00

Abs · PDF · Code1 · Code2

Abstract

Symbolic regression is a powerful tool for discovering governing equations directly from data, but its sensitivity to noise hinders its broader application. This paper introduces a Sequential Monte Carlo (SMC) framework for Bayesian symbolic regression that approximates the posterior distribution over symbolic expressions, enhancing robustness and enabling uncertainty quantification for symbolic regression in the presence of noise. Differing from traditional genetic programming approaches, the SMC-based algorithm combines probabilistic selection, adaptive tempering, and the use of normalized marginal likelihood to efficiently explore the search space of symbolic expressions, yielding parsimonious expressions with improved generalization. When compared to standard genetic programming baselines, the proposed method better deals with challenging, noisy benchmark datasets. The reduced tendency to overfit and enhanced ability to discover accurate and interpretable equations paves the way for more robust symbolic regression in scientific discovery and engineering design applications.

中文标题/摘要

标题：基于后验采样的贝叶斯符号回归

符号回归是一种强大的工具，可以直接从数据中发现支配方程，但其对噪声的敏感性限制了其更广泛的应用。本文提出了一种基于顺序蒙特卡洛（SMC）框架的贝叶斯符号回归方法，该方法近似符号表达式的后验分布，增强了鲁棒性，并能够在噪声存在的情况下进行符号回归的不确定性量化。与传统的遗传编程方法不同，基于SMC的算法结合了概率选择、自适应退火和归一化边际似然的使用，以高效地探索符号表达式的搜索空间，产生简洁且具有更好泛化能力的表达式。与标准的遗传编程基线方法相比，所提出的方法在处理具有挑战性和噪声的数据集时表现更好。减少过拟合的趋势和发现准确且可解释方程的能力为科学发现和工程设计应用中的更稳健符号回归铺平了道路。

Summary / 总结

This paper addresses the challenge of symbolic regression being sensitive to noise by proposing a Bayesian approach using Sequential Monte Carlo (SMC) for posterior sampling. The method combines probabilistic selection, adaptive tempering, and normalized marginal likelihood to explore the space of symbolic expressions efficiently, leading to more robust and interpretable models. Experiments show that the proposed method outperforms standard genetic programming techniques on noisy datasets, demonstrating better generalization and reduced overfitting.

该论文通过提出基于Sequential Monte Carlo (SMC)的贝叶斯方法进行后验采样，解决了噪声数据在符号回归中的挑战。该方法增强了鲁棒性并允许不确定性量化。与传统的遗传编程相比，基于SMC的算法在嘈杂的数据集上表现出更好的性能，提供了更准确和可解释的方程，并减少了过拟合。这种方法特别适用于需要鲁棒性的科学发现和工程设计应用。

From Generated Human Videos to Physically Plausible Robot Trajectories

Authors: James Ni, Zekai Wang, Wei Lin, Amir Bar, Yann LeCun, Trevor Darrell, Jitendra Malik, Roei Herzig

First: 2025-12-04T18:56:03+00:00 · Latest: 2025-12-11T17:37:53+00:00

Comments: For project website, see https://genmimic.github.io

Abs · PDF · Code1 · Code2 · Project1

Abstract

Video generation models are rapidly improving in their ability to synthesize human actions in novel contexts, holding the potential to serve as high-level planners for contextual robot control. To realize this potential, a key research question remains open: how can a humanoid execute the human actions from generated videos in a zero-shot manner? This challenge arises because generated videos are often noisy and exhibit morphological distortions that make direct imitation difficult compared to real video. To address this, we introduce a two-stage pipeline. First, we lift video pixels into a 4D human representation and then retarget to the humanoid morphology. Second, we propose GenMimic-a physics-aware reinforcement learning policy conditioned on 3D keypoints, and trained with symmetry regularization and keypoint-weighted tracking rewards. As a result, GenMimic can mimic human actions from noisy, generated videos. We curate GenMimicBench, a synthetic human-motion dataset generated using two video generation models across a spectrum of actions and contexts, establishing a benchmark for assessing zero-shot generalization and policy robustness. Extensive experiments demonstrate improvements over strong baselines in simulation and confirm coherent, physically stable motion tracking on a Unitree G1 humanoid robot without fine-tuning. This work offers a promising path to realizing the potential of video generation models as high-level policies for robot control.

中文标题/摘要

标题：从生成的人类视频到物理上可信的机器人轨迹

视频生成模型在合成人类在新情境下的动作方面的能力正在迅速提高，这为作为上下文机器人控制的高级规划者提供了潜在的应用。为了实现这一潜力，一个关键的研究问题仍然悬而未决：如何以零样本的方式让类人机器人执行生成视频中的人类动作？这一挑战源于生成的视频通常噪声较大，且存在形态失真，使得直接模仿比真实视频更困难。为了解决这一问题，我们引入了一个两阶段的流水线。首先，我们将视频像素提升到4D的人类表示，然后重新定向到类人形态。其次，我们提出了一个基于3D关键点的物理感知强化学习策略GenMimic，并通过对称正则化和关键点加权跟踪奖励进行训练。因此，GenMimic可以从噪声较大的生成视频中模仿人类动作。我们构建了GenMimicBench，这是一个使用两种视频生成模型生成的合成人类动作数据集，涵盖了各种动作和情境，为评估零样本泛化能力和策略鲁棒性建立了基准。广泛的实验表明，GenMimic在模拟中优于强大的基线，并且在无需微调的情况下，能够实现类人机器人Unitree G1的连贯且物理稳定的动作跟踪。这项工作为利用视频生成模型作为机器人控制的高级策略提供了有希望的途径。

Summary / 总结

The research addresses the challenge of translating human actions from generated videos into physically plausible robot trajectories. It introduces a two-stage pipeline: first, lifting video pixels into a 4D human representation and then retargeting to a humanoid morphology. The second stage involves GenMimic, a physics-aware reinforcement learning policy that uses 3D keypoints and is trained with symmetry regularization and keypoint-weighted tracking rewards. Experiments show that GenMimic can effectively mimic human actions from noisy generated videos and achieve coherent, physically stable motion on a Unitree G1 humanoid robot without fine-tuning.

该研究解决了将生成视频中的人类动作转化为物理上可行的机器人运动的挑战。它提出了一种两阶段管道：首先将视频像素提升到4D人体表示，并重新定位到人形形态，然后训练一个基于物理的强化学习策略GenMimic，使用对称正则化和关键点加权跟踪奖励。实验表明，GenMimic可以从嘈杂的生成视频中有效模仿人类动作，在模拟和Unitree G1人形机器人上均表现出优于强基线的性能，且无需微调。

PoseGAM: Robust Unseen Object Pose Estimation via Geometry-Aware Multi-View Reasoning

Authors: Jianqi Chen, Biao Zhang, Xiangjun Tang, Peter Wonka

First: 2025-12-11T17:29:25+00:00 · Latest: 2025-12-11T17:29:25+00:00

Comments: Project page: https://windvchen.github.io/PoseGAM/

Abs · PDF · Code1 · Code2 · Project1

Abstract

6D object pose estimation, which predicts the transformation of an object relative to the camera, remains challenging for unseen objects. Existing approaches typically rely on explicitly constructing feature correspondences between the query image and either the object model or template images. In this work, we propose PoseGAM, a geometry-aware multi-view framework that directly predicts object pose from a query image and multiple template images, eliminating the need for explicit matching. Built upon recent multi-view-based foundation model architectures, the method integrates object geometry information through two complementary mechanisms: explicit point-based geometry and learned features from geometry representation networks. In addition, we construct a large-scale synthetic dataset containing more than 190k objects under diverse environmental conditions to enhance robustness and generalization. Extensive evaluations across multiple benchmarks demonstrate our state-of-the-art performance, yielding an average AR improvement of 5.1% over prior methods and achieving up to 17.6% gains on individual datasets, indicating strong generalization to unseen objects. Project page: https://windvchen.github.io/PoseGAM/ .

中文标题/摘要

标题：PoseGAM：通过几何意识多视图推理实现鲁棒的未见物体姿态估计

6D物体姿态估计，即预测物体相对于相机的变换，对于未见物体而言仍然具有挑战性。现有方法通常依赖于在查询图像与物体模型或模板图像之间显式构建特征对应关系。在本工作中，我们提出PoseGAM，一种几何意识多视图框架，可以直接从查询图像和多个模板图像中预测物体姿态，从而消除显式匹配的需要。该方法基于最近的多视图基础模型架构，通过两种互补机制整合物体几何信息：显式的点基几何和几何表示网络学习的特征。此外，我们构建了一个包含超过19万个物体的大型合成数据集，这些物体在多种环境条件下存在，以增强鲁棒性和泛化能力。在多个基准上的广泛评估表明，我们的方法具有最先进的性能，相对于先前方法平均AR提高了5.1%，在个别数据集上最高可获得17.6%的提升，表明其在未见物体上的强大泛化能力。

Summary / 总结

PoseGAM is a geometry-aware multi-view framework that predicts 6D object pose directly from a query image and multiple template images without explicit matching. It uses a large-scale synthetic dataset with over 190k objects under various conditions to enhance robustness. The method improves average AR by 5.1% over previous methods and achieves up to 17.6% gains on individual datasets, showing strong generalization to unseen objects.

PoseGAM 是一种几何感知的多视图框架，旨在通过直接从查询图像和多个模板图像预测物体姿态来估计未见过物体的 6D 姿态，无需进行显式的特征匹配。该方法利用了基于多视图的最新基础模型，并通过显式的点几何和几何表示网络学习的特征来整合物体的几何信息。该方法在多个基准测试中进行了评估，相对于先前的方法平均提高了 5.1% 的 AR 性能，最高在单个数据集上提高了 17.6%，展示了对未见过物体的强大泛化能力。项目页面: https://windvchen.github.io/PoseGAM/ .

T-SHRED: Symbolic Regression for Regularization and Model Discovery with Transformer Shallow Recurrent Decoders

Authors: Alexey Yermakov, David Zoro, Mars Liyao Gao, J. Nathan Kutz

First: 2025-06-18T21:14:38+00:00 · Latest: 2025-12-11T17:28:30+00:00

Comments: 17 pages, 5 figures, submitted to Transactions of the Royal Society (Symbolic Regression in the Physical Sciences)

Abs · PDF · Code1 · Code2

Abstract

SHallow REcurrent Decoders (SHRED) are effective for system identification and forecasting from sparse sensor measurements. Such models are light-weight and computationally efficient, allowing them to be trained on consumer laptops. SHRED-based models rely on Recurrent Neural Networks (RNNs) and a simple Multi-Layer Perceptron (MLP) for the temporal encoding and spatial decoding respectively. Despite the relatively simple structure of SHRED, they are able to predict chaotic dynamical systems on different physical, spatial, and temporal scales directly from a sparse set of sensor measurements. In this work, we modify SHRED by leveraging transformers (T-SHRED) embedded with symbolic regression for the temporal encoding, circumventing auto-regressive long-term forecasting for physical data. This is achieved through a new sparse identification of nonlinear dynamics (SINDy) attention mechanism into T-SHRED to impose sparsity regularization on the latent space, which also allows for immediate symbolic interpretation. Symbolic regression improves model interpretability by learning and regularizing the dynamics of the latent space during training. We analyze the performance of T-SHRED on three different dynamical systems ranging from low-data to high-data regimes.

中文标题/摘要

标题：T-SHRED：基于变换器浅层递归解码器的符号回归系统识别与模型发现

SH浅层递归解码器（SHRED）对于从稀疏传感器测量中进行系统识别和预测非常有效。这类模型轻量级且计算效率高，允许它们在消费级笔记本电脑上进行训练。SHRED模型依赖于递归神经网络（RNN）和简单的多层感知机（MLP）进行时间编码和空间解码。尽管SHRED结构相对简单，但它们能够直接从稀疏的传感器测量中预测不同物理、空间和时间尺度的混沌动力系统。在本文中，我们通过利用变换器（T-SHRED）嵌入符号回归来修改SHRED，以绕过自回归长期预测，从而实现对物理数据的直接预测。这是通过将新的稀疏非线性动力系统识别（SINDy）注意力机制引入T-SHRED，对潜在空间施加稀疏正则化，同时允许即时符号解释。符号回归通过在训练过程中学习和正则化潜在空间的动力学来提高模型的可解释性。我们分析了T-SHRED在三种不同动力系统上的性能，从低数据到高数据范围。

Summary / 总结

This study introduces T-SHRED, which enhances SHRED models by integrating transformers and symbolic regression for improved prediction of chaotic dynamical systems from sparse sensor data. The key method involves a new sparse identification of nonlinear dynamics (SINDy) attention mechanism that imposes sparsity regularization on the latent space, enabling immediate symbolic interpretation. The main experimental findings show that T-SHRED outperforms traditional SHRED models across various dynamical systems, from low-data to high-data regimes, enhancing both prediction accuracy and model interpretability.

T-SHRED通过结合变压器和符号回归改进了SHRED的时间编码，使其在稀疏数据下的混沌动力系统表现更好。该模型使用SINDy注意力机制施加稀疏正则化，从而实现即时的符号解释并提高可解释性。实验表明，T-SHRED在不同数据量的三种动力系统中表现出色。

Learning Controllable and Diverse Player Behaviors in Multi-Agent Environments

Authors: Atahan Cilan, Atay Özgövde

First: 2025-12-11T17:26:24+00:00 · Latest: 2025-12-11T17:26:24+00:00

Comments: Submitted to IEEE Transactions on Games

Abs · PDF · Code1 · Code2

Abstract

This paper introduces a reinforcement learning framework that enables controllable and diverse player behaviors without relying on human gameplay data. Existing approaches often require large-scale player trajectories, train separate models for different player types, or provide no direct mapping between interpretable behavioral parameters and the learned policy, limiting their scalability and controllability. We define player behavior in an N-dimensional continuous space and uniformly sample target behavior vectors from a region that encompasses the subset representing real human styles. During training, each agent receives both its current and target behavior vectors as input, and the reward is based on the normalized reduction in distance between them. This allows the policy to learn how actions influence behavioral statistics, enabling smooth control over attributes such as aggressiveness, mobility, and cooperativeness. A single PPO-based multi-agent policy can reproduce new or unseen play styles without retraining. Experiments conducted in a custom multi-player Unity game show that the proposed framework produces significantly greater behavioral diversity than a win-only baseline and reliably matches specified behavior vectors across diverse targets. The method offers a scalable solution for automated playtesting, game balancing, human-like behavior simulation, and replacing disconnected players in online games.

中文标题/摘要

标题：在多智能体环境中学习可控和多样的玩家行为

本文介绍了一种强化学习框架，能够在无需依赖人类游戏数据的情况下实现可控和多样的玩家行为。现有方法通常需要大规模的玩家轨迹数据、为不同玩家类型训练单独的模型，或者无法直接将可解释的行为参数与学习策略映射起来，限制了其可扩展性和可控性。我们定义玩家行为为N维连续空间，并均匀采样目标行为向量，该向量代表了真实人类风格的子集。在训练过程中，每个智能体接收当前和目标行为向量作为输入，奖励基于它们之间距离的归一化减少量。这使得策略能够学习行动如何影响行为统计，从而实现对侵略性、移动性和合作性等属性的平滑控制。基于PPO的单个多智能体策略可以在无需重新训练的情况下复制新的或未见过的游戏风格。在自定义多人Unity游戏中进行的实验表明，所提出的框架产生的行为多样性显著高于仅以胜利为目标的基线，并且能够在各种目标上可靠地匹配指定的行为向量。该方法为自动化游戏测试、游戏平衡、人类行为模拟以及在线游戏中替代断开连接的玩家提供了可扩展的解决方案。

Summary / 总结

This paper presents a reinforcement learning framework that enables the generation of controllable and diverse player behaviors in multi-agent environments without the need for human gameplay data. The method defines player behavior in an N-dimensional space and uses a single PPO-based policy to learn how actions influence behavioral statistics. The framework allows for smooth control over attributes like aggressiveness, mobility, and cooperativeness. Experiments in a custom Unity game demonstrate that the proposed approach produces significantly more behavioral diversity compared to a win-only baseline and reliably matches specified behavior vectors across various targets, offering a scalable solution for playtesting, game balancing, and simulating human-like behavior.

该论文提出了一种强化学习框架，能够在无需人类游戏数据的情况下生成可控且多样的多智能体环境中的玩家行为。该方法将玩家行为定义在N维空间中，并使用单个基于PPO的策略来学习行为统计如何受到行动的影响。该框架允许对侵略性、移动性和合作性等属性进行平滑控制。在自定义Unity游戏中进行的实验表明，所提出的方法相比仅以获胜为目标的基线方法能够产生更多的行为多样性，并且能够在各种目标上可靠地匹配指定的行为向量，为游戏测试、平衡调整和模拟人类行为提供了可扩展的解决方案。