WorldLens: Full-Spectrum Evaluations of Driving World Models in Real World
Authors: Ao Liang, Lingdong Kong, Tianyi Yan, Hongsi Liu, Wesley Yang, Ziqi Huang, Wei Yin, Jialong Zuo, Yixuan Hu, Dekai Zhu, Dongyue Lu, Youquan Liu, Guangfeng Jiang, Linfeng Li, Xiangtai Li, Long Zhuo, Lai Xing Ng, Benoit R. Cottereau, Changxin Gao, Liang Pan, Wei Tsang Ooi, Ziwei Liu
First: 2025-12-11T18:59:58+00:00 · Latest: 2025-12-11T18:59:58+00:00
Comments: Preprint; 80 pages, 37 figures, 29 tables; Project Page at https://worldbench.github.io/worldlens
Abstract
Generative world models are reshaping embodied AI, enabling agents to synthesize realistic 4D driving environments that look convincing but often fail physically or behaviorally. Despite rapid progress, the field still lacks a unified way to assess whether generated worlds preserve geometry, obey physics, or support reliable control. We introduce WorldLens, a full-spectrum benchmark evaluating how well a model builds, understands, and behaves within its generated world. It spans five aspects -- Generation, Reconstruction, Action-Following, Downstream Task, and Human Preference -- jointly covering visual realism, geometric consistency, physical plausibility, and functional reliability. Across these dimensions, no existing world model excels universally: those with strong textures often violate physics, while geometry-stable ones lack behavioral fidelity. To align objective metrics with human judgment, we further construct WorldLens-26K, a large-scale dataset of human-annotated videos with numerical scores and textual rationales, and develop WorldLens-Agent, an evaluation model distilled from these annotations to enable scalable, explainable scoring. Together, the benchmark, dataset, and agent form a unified ecosystem for measuring world fidelity -- standardizing how future models are judged not only by how real they look, but by how real they behave.
中文标题/摘要
标题:WorldLens:全面评估驾驶世界模型在真实世界中的表现
生成的世界模型正在重塑具身AI,使代理能够合成逼真的4D驾驶环境,尽管这些环境看起来很真实,但往往在物理上或行为上失败。尽管取得了快速进展,该领域仍然缺乏一种统一的方法来评估生成的世界是否保留了几何结构、遵循物理规律或支持可靠的控制。我们引入了WorldLens,这是一种全面的基准测试,评估模型如何在其生成的世界中构建、理解和表现。它涵盖了五个方面——生成、重建、动作跟随、下游任务和人类偏好——共同涵盖了视觉真实感、几何一致性、物理合理性以及功能可靠性。在这些维度上,没有任何现有的世界模型能够全面优秀:那些具有强大纹理的模型往往违反物理规律,而几何稳定的模型缺乏行为准确性。为了使客观指标与人类判断相一致,我们进一步构建了包含大量人类标注视频的大规模数据集WorldLens-26K,这些视频附有数值评分和文本解释,并开发了WorldLens-Agent,这是一种从这些注释中提炼出的评估模型,以实现可扩展且可解释的评分。基准测试、数据集和代理共同构成了衡量世界真实性的统一生态系统——不仅通过它们看起来多么真实,还通过它们表现得多么真实来评判未来的模型。
Summary / 总结
WorldLens evaluates generative world models in driving scenarios by assessing their performance across five dimensions: generation, reconstruction, action-following, downstream tasks, and human preference. It reveals that existing models excel in one area but fall short in others, such as strong textures violating physics or geometric stability lacking behavioral fidelity. To align with human judgment, WorldLens-26K, a large dataset, and WorldLens-Agent, an evaluation model, are developed to provide scalable and explainable scoring, forming a unified ecosystem for measuring world fidelity in embodied AI.
WorldLens 是一个全面的基准,用于评估生成的世界模型在驾驶场景中的表现,涵盖了生成、重构、动作跟随、下游任务和人类偏好等方面。它评估视觉真实性、几何一致性、物理可合理性以及功能可靠性。目前没有模型在所有方面都表现出色,一些模型违反物理规律,而另一些则缺乏行为准确性。为了与人类判断一致,开发了包含大量人类标注视频的 WorldLens-26K 数据集和 WorldLens-Agent 评估模型,以实现可扩展且可解释的评分。该基准不仅标准化了对模型视觉真实性的评估,还扩展到了行为真实性的评估。
SceneMaker: Open-set 3D Scene Generation with Decoupled De-occlusion and Pose Estimation Model
Authors: Yukai Shi, Weiyu Li, Zihao Wang, Hongyang Li, Xingyu Chen, Ping Tan, Lei Zhang
First: 2025-12-11T18:59:56+00:00 · Latest: 2025-12-11T18:59:56+00:00
Comments: Project page: https://idea-research.github.io/SceneMaker/
Abstract
We propose a decoupled 3D scene generation framework called SceneMaker in this work. Due to the lack of sufficient open-set de-occlusion and pose estimation priors, existing methods struggle to simultaneously produce high-quality geometry and accurate poses under severe occlusion and open-set settings. To address these issues, we first decouple the de-occlusion model from 3D object generation, and enhance it by leveraging image datasets and collected de-occlusion datasets for much more diverse open-set occlusion patterns. Then, we propose a unified pose estimation model that integrates global and local mechanisms for both self-attention and cross-attention to improve accuracy. Besides, we construct an open-set 3D scene dataset to further extend the generalization of the pose estimation model. Comprehensive experiments demonstrate the superiority of our decoupled framework on both indoor and open-set scenes. Our codes and datasets is released at https://idea-research.github.io/SceneMaker/.
中文标题/摘要
标题:SceneMaker:分阶段的3D场景生成模型,包含解遮挡和姿态估计
本文提出了一种名为SceneMaker的分阶段3D场景生成框架。由于缺乏足够的开放集解遮挡和姿态估计先验,现有方法在严重遮挡和开放集设置下难以同时生成高质量的几何结构和准确的姿态。为了解决这些问题,我们首先将解遮挡模型与3D物体生成分离,并通过利用图像数据集和收集的解遮挡数据集增强它,以处理更多样化的开放集遮挡模式。然后,我们提出了一种统一的姿态估计模型,结合全局和局部机制,以提高准确度。此外,我们构建了一个开放集3D场景数据集,以进一步扩展姿态估计模型的泛化能力。全面的实验表明,我们的分阶段框架在室内和开放集场景上具有优越性。我们的代码和数据集发布在https://idea-research.github.io/SceneMaker/。
Summary / 总结
SceneMaker is a decoupled 3D scene generation framework that addresses the challenges of open-set de-occlusion and pose estimation under severe occlusion. It decouples the de-occlusion model from 3D object generation and enhances it with diverse open-set occlusion patterns from image and de-occlusion datasets. Additionally, it introduces a unified pose estimation model with global and local mechanisms for improved accuracy. Experiments show that SceneMaker outperforms existing methods on both indoor and open-set scenes.
SceneMaker 是一个分隔开的 3D 场景生成框架,旨在解决在严重遮挡情况下开放集去遮挡和姿态估计的挑战。它将去遮挡模型与 3D 对象生成分离,并通过图像数据集中的多样开放集遮挡模式对其进行增强。此外,它引入了一个结合全局和局部机制的统一姿态估计模型,以提高准确性。实验表明,SceneMaker 在室内和开放集场景中均优于现有方法。
Empowering Dynamic Urban Navigation with Stereo and Mid-Level Vision
Authors: Wentao Zhou, Xuweiyi Chen, Vignesh Rajagopal, Jeffrey Chen, Rohan Chandra, Zezhou Cheng
Venue: www
First: 2025-12-11T18:59:56+00:00 · Latest: 2025-12-11T18:59:56+00:00
Comments: Project Page: https://www.cs.virginia.edu/~tsx4zn/stereowalk/
Abstract
The success of foundation models in language and vision motivated research in fully end-to-end robot navigation foundation models (NFMs). NFMs directly map monocular visual input to control actions and ignore mid-level vision modules (tracking, depth estimation, etc) entirely. While the assumption that vision capabilities will emerge implicitly is compelling, it requires large amounts of pixel-to-action supervision that are difficult to obtain. The challenge is especially pronounced in dynamic and unstructured settings, where robust navigation requires precise geometric and dynamic understanding, while the depth-scale ambiguity in monocular views further limits accurate spatial reasoning. In this paper, we show that relying on monocular vision and ignoring mid-level vision priors is inefficient.
We present StereoWalker, which augments NFMs with stereo inputs and explicit mid-level vision such as depth estimation and dense pixel tracking. Our intuition is straightforward: stereo inputs resolve the depth-scale ambiguity, and modern mid-level vision models provide reliable geometric and motion structure in dynamic scenes. We also curate a large stereo navigation dataset with automatic action annotation from Internet stereo videos to support training of StereoWalker and to facilitate future research. Through our experiments, we find that mid-level vision enables StereoWalker to achieve a comparable performance as the state-of-the-art using only 1.5% of the training data, and surpasses the state-of-the-art using the full data. We also observe that stereo vision yields higher navigation performance than monocular input.
中文标题/摘要
标题:利用立体视觉和中级视觉赋能动态城市导航
基础模型在语言和视觉领域的成功激发了对端到端机器人导航基础模型(NFMs)的研究。NFMs 直接将单目视觉输入映射到控制动作,完全忽略了中级视觉模块(跟踪、深度估计等)。虽然假设视觉能力会隐式出现是令人信服的,但需要大量的像素到动作监督,这很难获得。特别是在动态和非结构化的环境中,这种挑战尤为突出,因为稳健的导航需要精确的几何和动态理解,而单目视图中的深度尺度歧义进一步限制了准确的空间推理。在本文中,我们表明依赖单目视觉并忽略中级视觉先验是低效的。我们提出了 StereoWalker,它通过添加立体视觉输入和显式的中级视觉(如深度估计和密集像素跟踪)来增强 NFMs。我们的直觉很简单:立体视觉解决了深度尺度歧义问题,现代中级视觉模型在动态场景中提供了可靠的几何和运动结构。我们还整理了一个大型的立体视觉导航数据集,该数据集通过互联网立体视频的自动动作注释来支持 StereoWalker 的训练,并促进未来的研究。通过我们的实验,我们发现中级视觉使 StereoWalker 能够仅使用 1.5% 的训练数据达到与最新技术相当的性能,并且使用完整数据时超过了最新技术。我们还观察到,立体视觉的导航性能高于单目输入。
Summary / 总结
This paper addresses the challenge of end-to-end robot navigation by integrating stereo vision and mid-level vision modules into neural foundation models (NFMs). The authors introduce StereoWalker, which uses stereo inputs and explicit depth estimation and dense pixel tracking to resolve depth ambiguity and provide reliable geometric and motion structure. Experiments show that StereoWalker achieves comparable performance to state-of-the-art models using only 1.5% of the training data and surpasses them when using full data. Stereo vision also yields higher navigation performance compared to monocular inputs.
本文通过将立体视觉和中间视觉模块整合到神经基础模型(NFMs)中,解决了在动态城市环境中高效机器人导航的挑战。作者提出了StereoWalker,该方法使用立体输入和显式的深度估计和密集像素跟踪来解决深度模糊问题,并增强几何和运动理解。实验表明,StereoWalker仅使用1.5%的训练数据就能达到与最先进的方法相当的性能,并在使用完整数据时超越它们,证明了立体视觉比单目输入的优势。
Omni-Attribute: Open-vocabulary Attribute Encoder for Visual Concept Personalization
Authors: Tsai-Shien Chen, Aliaksandr Siarohin, Guocheng Gordon Qian, Kuan-Chieh Jackson Wang, Egor Nemchinov, Moayed Haji-Ali, Riza Alp Guler, Willi Menapace, Ivan Skorokhodov, Anil Kag, Jun-Yan Zhu, Sergey Tulyakov
First: 2025-12-11T18:59:56+00:00 · Latest: 2025-12-11T18:59:56+00:00
Comments: Project page: https://snap-research.github.io/omni-attribute
Abstract
Visual concept personalization aims to transfer only specific image attributes, such as identity, expression, lighting, and style, into unseen contexts. However, existing methods rely on holistic embeddings from general-purpose image encoders, which entangle multiple visual factors and make it difficult to isolate a single attribute. This often leads to information leakage and incoherent synthesis. To address this limitation, we introduce Omni-Attribute, the first open-vocabulary image attribute encoder designed to learn high-fidelity, attribute-specific representations. Our approach jointly designs the data and model: (i) we curate semantically linked image pairs annotated with positive and negative attributes to explicitly teach the encoder what to preserve or suppress; and (ii) we adopt a dual-objective training paradigm that balances generative fidelity with contrastive disentanglement. The resulting embeddings prove effective for open-vocabulary attribute retrieval, personalization, and compositional generation, achieving state-of-the-art performance across multiple benchmarks.
中文标题/摘要
标题:Omni-Attribute:面向视觉概念个性化的大词汇量属性编码器
视觉概念个性化旨在将特定图像属性,如身份、表情、光照和风格,转移到未见的上下文中。然而,现有方法依赖于通用图像编码器的整体嵌入,这会将多个视觉因素纠缠在一起,使得难以隔离单一属性。这通常会导致信息泄露和不一致的合成。为了解决这一局限性,我们引入了Omni-Attribute,这是第一个用于学习高保真度、属性特定表示的大词汇量图像属性编码器。我们的方法联合设计了数据和模型:(i) 我们收集了带有正负属性标注的语义关联图像对,以明确地教导编码器保留或抑制什么;(ii) 我们采用了一种双目标训练范式,平衡生成保真度与对比性解耦。生成的嵌入在开放词汇量属性检索、个性化和组合生成方面证明是有效的,并在多个基准测试中达到了最先进的性能。
Summary / 总结
Omni-Attribute is an open-vocabulary image attribute encoder designed to isolate specific visual attributes like identity, expression, and style. It uses semantically linked image pairs and a dual-objective training approach to learn high-fidelity attribute-specific representations, improving attribute retrieval, personalization, and compositional generation. Results show state-of-the-art performance across multiple benchmarks.
Omni-Attribute 是一种开放词汇量的图像属性编码器,旨在隔离并转移特定的视觉属性,如身份、表情、光照和风格,应用于未见情境。它通过使用语义关联的图像对和双重目标训练方法来解决现有方法的局限性。该编码器在开放词汇量属性检索、个性化和组合生成等多个基准上取得了最先进的性能。
Hierarchical Dataset Selection for High-Quality Data Sharing
Authors: Xiaona Zhou, Yingyan Zeng, Ran Jin, Ismini Lourentzou
First: 2025-12-11T18:59:55+00:00 · Latest: 2025-12-11T18:59:55+00:00
Abstract
The success of modern machine learning hinges on access to high-quality training data. In many real-world scenarios, such as acquiring data from public repositories or sharing across institutions, data is naturally organized into discrete datasets that vary in relevance, quality, and utility. Selecting which repositories or institutions to search for useful datasets, and which datasets to incorporate into model training are therefore critical decisions, yet most existing methods select individual samples and treat all data as equally relevant, ignoring differences between datasets and their sources. In this work, we formalize the task of dataset selection: selecting entire datasets from a large, heterogeneous pool to improve downstream performance under resource constraints. We propose Dataset Selection via Hierarchies (DaSH), a dataset selection method that models utility at both dataset and group (e.g., collections, institutions) levels, enabling efficient generalization from limited observations. Across two public benchmarks (Digit-Five and DomainNet), DaSH outperforms state-of-the-art data selection baselines by up to 26.2% in accuracy, while requiring significantly fewer exploration steps. Ablations show DaSH is robust to low-resource settings and lack of relevant datasets, making it suitable for scalable and adaptive dataset selection in practical multi-source learning workflows.
中文标题/摘要
标题:层次化数据集选择以实现高质量数据共享
现代机器学习的成功依赖于高质量训练数据的访问。在许多实际场景中,如从公共存储库获取数据或机构间共享数据时,数据自然地组织成不同的数据集,这些数据集在相关性、质量和实用性方面各不相同。选择哪些存储库或机构来搜索有用的数据库集,以及将哪些数据集纳入模型训练是至关重要的决策,但现有的大多数方法选择单个样本,并将所有数据视为同等重要,忽略了数据集及其来源之间的差异。在本工作中,我们形式化了数据集选择任务:在资源受限的情况下,从大量异构数据集中选择整个数据集以提高下游性能。我们提出了层次化数据集选择(DaSH)方法,该方法在数据集和组(例如,集合、机构)级别建模效用,从而能够从有限的观察中高效泛化。在两个公开基准(Digit-Five和DomainNet)上,DaSH在准确率上比最先进的数据选择基线高出26.2%,同时需要显著减少探索步骤。消融实验表明,DaSH在低资源设置和缺乏相关数据集的情况下具有鲁棒性,使其适用于实际多源学习工作流中的可扩展和自适应数据集选择。
Summary / 总结
This work addresses the challenge of selecting high-quality datasets for machine learning by proposing DaSH, a method that considers both dataset and group levels. It outperforms existing methods by up to 26.2% in accuracy across two benchmarks, requiring fewer exploration steps. DaSH is robust in low-resource settings and suitable for practical multi-source learning workflows.
该研究通过提出DaSH方法解决了选择高质量数据集的问题,该方法在数据集和组别层面建模效用。在两个基准测试中,DaSH的准确率最高比现有方法高出26.2%,同时需要更少的探索步骤。该方法在资源有限的环境下表现稳健,并适用于实际的多源学习工作流程。
E-RayZer: Self-supervised 3D Reconstruction as Spatial Visual Pre-training
Authors: Qitao Zhao, Hao Tan, Qianqian Wang, Sai Bi, Kai Zhang, Kalyan Sunkavalli, Shubham Tulsiani, Hanwen Jiang
First: 2025-12-11T18:59:53+00:00 · Latest: 2025-12-11T18:59:53+00:00
Comments: Project website: https://qitaozhao.github.io/E-RayZer
Abstract
Self-supervised pre-training has revolutionized foundation models for languages, individual 2D images and videos, but remains largely unexplored for learning 3D-aware representations from multi-view images. In this paper, we present E-RayZer, a self-supervised large 3D Vision model that learns truly 3D-aware representations directly from unlabeled images. Unlike prior self-supervised methods such as RayZer that infer 3D indirectly through latent-space view synthesis, E-RayZer operates directly in 3D space, performing self-supervised 3D reconstruction with Explicit geometry. This formulation eliminates shortcut solutions and yields representations that are geometrically grounded. To ensure convergence and scalability, we introduce a novel fine-grained learning curriculum that organizes training from easy to hard samples and harmonizes heterogeneous data sources in an entirely unsupervised manner. Experiments demonstrate that E-RayZer significantly outperforms RayZer on pose estimation, matches or sometimes surpasses fully supervised reconstruction models such as VGGT. Furthermore, its learned representations outperform leading visual pre-training models (e.g., DINOv3, CroCo v2, VideoMAE V2, and RayZer) when transferring to 3D downstream tasks, establishing E-RayZer as a new paradigm for 3D-aware visual pre-training.
中文标题/摘要
标题:E-RayZer:自我监督的3D重建作为空间视觉预训练
自我监督的预训练已经彻底改变了语言、单个2D图像和视频的基础模型,但在从多视角图像学习3D感知表示方面仍处于探索阶段。本文介绍了E-RayZer,这是一种直接从未标记图像中学习真正3D感知表示的自我监督大型3D视觉模型。与通过潜在空间视图合成间接推断3D的RayZer等先前的自我监督方法不同,E-RayZer直接在3D空间中操作,执行具有显式几何学的自我监督3D重建。这种表述消除了捷径解决方案,产生了几何学上可靠的表示。为了确保收敛性和可扩展性,我们引入了一种新颖的细粒度学习课程,从简单的样本到困难的样本组织训练,并以完全无监督的方式协调异构数据源。实验表明,E-RayZer在姿态估计方面显著优于RayZer,在匹配或有时超越完全监督重建模型(如VGGT)方面表现出色。此外,其学习的表示在转移到3D下游任务时优于领先的空间视觉预训练模型(例如DINOv3、CroCo v2、VideoMAE V2和RayZer),确立了E-RayZer作为3D感知视觉预训练的新范式。
Summary / 总结
E-RayZer is a self-supervised 3D vision model that learns 3D-aware representations directly from unlabeled images through explicit geometry-based self-supervised 3D reconstruction. It introduces a fine-grained learning curriculum to ensure convergence and scalability. Experiments show that E-RayZer outperforms RayZer on pose estimation and matches or surpasses fully supervised models on reconstruction tasks, while its representations excel in 3D downstream tasks compared to other visual pre-training models.
E-RayZer 是一种直接从未标注图像中学习 3D 意识表示的自监督 3D 视觉模型,使用显式几何进行自监督 3D 重建,避免了捷径解决方案并提供了几何上扎实的表示。该模型采用细粒度的学习课程确保收敛性和可扩展性,并在姿态估计上优于 RayZer 等先前方法,在完全监督的重建模型上取得竞争力的结果。其表示在 3D 下游任务中也表现出色,超越了领先视觉预训练模型。
Are We Ready for RL in Text-to-3D Generation? A Progressive Investigation
Authors: Yiwen Tang, Zoey Guo, Kaixin Zhu, Ray Zhang, Qizhi Chen, Dongzhi Jiang, Junli Liu, Bohan Zeng, Haoming Song, Delin Qu, Tianyi Bai, Dan Xu, Wentao Zhang, Bin Zhao
First: 2025-12-11T18:59:52+00:00 · Latest: 2025-12-11T18:59:52+00:00
Comments: Code is released at https://github.com/Ivan-Tang-3D/3DGen-R1
Abstract
Reinforcement learning (RL), earlier proven to be effective in large language and multi-modal models, has been successfully extended to enhance 2D image generation recently. However, applying RL to 3D generation remains largely unexplored due to the higher spatial complexity of 3D objects, which require globally consistent geometry and fine-grained local textures. This makes 3D generation significantly sensitive to reward designs and RL algorithms. To address these challenges, we conduct the first systematic study of RL for text-to-3D autoregressive generation across several dimensions. (1) Reward designs: We evaluate reward dimensions and model choices, showing that alignment with human preference is crucial, and that general multi-modal models provide robust signal for 3D attributes. (2) RL algorithms: We study GRPO variants, highlighting the effectiveness of token-level optimization, and further investigate the scaling of training data and iterations. (3) Text-to-3D Benchmarks: Since existing benchmarks fail to measure implicit reasoning abilities in 3D generation models, we introduce MME-3DR. (4) Advanced RL paradigms: Motivated by the natural hierarchy of 3D generation, we propose Hi-GRPO, which optimizes the global-to-local hierarchical 3D generation through dedicated reward ensembles. Based on these insights, we develop AR3D-R1, the first RL-enhanced text-to-3D model, expert from coarse shape to texture refinement. We hope this study provides insights into RL-driven reasoning for 3D generation. Code is released at https://github.com/Ivan-Tang-3D/3DGen-R1.
中文标题/摘要
标题:文本到3D生成中我们准备好使用强化学习了吗?一项渐进式研究
强化学习(RL)在大型语言和多模态模型中已被证明有效,并成功扩展到增强2D图像生成。然而,将RL应用于3D生成仍然鲜有探索,因为3D对象的空间复杂性更高,需要全局一致的几何结构和精细的局部纹理。这使得3D生成对奖励设计和RL算法极为敏感。为应对这些挑战,我们首次系统地研究了RL在多个维度上对文本到3D自回归生成的影响。(1) 奖励设计:我们评估了奖励维度和模型选择,表明与人类偏好的一致性至关重要,并且通用多模态模型为3D属性提供了稳健的信号。(2) RL算法:我们研究了GRPO变体,强调了基于token的优化的有效性,并进一步探讨了训练数据和迭代的扩展。(3) 文本到3D基准:由于现有基准无法衡量3D生成模型的隐式推理能力,我们引入了MME-3DR。(4) 高级RL范式:受3D生成自然层次结构的启发,我们提出了Hi-GRPO,通过专门的奖励集合优化全局到局部的3D生成。基于这些见解,我们开发了AR3D-R1,这是第一个增强的文本到3D模型,从粗略形状到纹理细化。我们希望这项研究为基于RL的3D生成推理提供见解。代码发布在https://github.com/Ivan-Tang-3D/3DGen-R1
ClusIR: Towards Cluster-Guided All-in-One Image Restoration
Authors: Shengkai Hu, Jiaqi Ma, Jun Wan, Wenwen Min, Yongcheng Jing, Lefei Zhang, Dacheng Tao
First: 2025-12-11T18:59:47+00:00 · Latest: 2025-12-11T18:59:47+00:00
Abstract
All-in-One Image Restoration (AiOIR) aims to recover high-quality images from diverse degradations within a unified framework. However, existing methods often fail to explicitly model degradation types and struggle to adapt their restoration behavior to complex or mixed degradations. To address these issues, we propose ClusIR, a Cluster-Guided Image Restoration framework that explicitly models degradation semantics through learnable clustering and propagates cluster-aware cues across spatial and frequency domains for adaptive restoration. Specifically, ClusIR comprises two key components: a Probabilistic Cluster-Guided Routing Mechanism (PCGRM) and a Degradation-Aware Frequency Modulation Module (DAFMM). The proposed PCGRM disentangles degradation recognition from expert activation, enabling discriminative degradation perception and stable expert routing. Meanwhile, DAFMM leverages the cluster-guided priors to perform adaptive frequency decomposition and targeted modulation, collaboratively refining structural and textural representations for higher restoration fidelity. The cluster-guided synergy seamlessly bridges semantic cues with frequency-domain modulation, empowering ClusIR to attain remarkable restoration results across a wide range of degradations. Extensive experiments on diverse benchmarks validate that ClusIR reaches competitive performance under several scenarios.
中文标题/摘要
标题:ClusIR:面向聚类引导的一站式图像恢复
一站式图像恢复(AiOIR)旨在在一个统一框架内从多种退化中恢复高质量图像。然而,现有方法往往未能明确建模退化类型,并且难以适应复杂的或混合的退化。为了解决这些问题,我们提出了一种聚类引导图像恢复框架ClusIR,该框架通过可学习的聚类明确建模退化语义,并在空间域和频域中传播聚类感知线索以实现自适应恢复。具体而言,ClusIR 包含两个关键组件:概率聚类引导路由机制(PCGRM)和退化感知频域调制模块(DAFMM)。所提出的 PCGRM 将退化识别与专家激活分离,实现区分性退化感知和稳定的专家路由。同时,DAFMM 利用聚类引导的先验知识进行自适应频域分解和目标化调制,协同细化结构和纹理表示以提高恢复保真度。聚类引导的协同作用无缝地将语义线索与频域调制相结合,使 ClusIR 能够在各种退化中取得显著的恢复效果。在多种基准上的广泛实验表明,在多种场景下,ClusIR 达到了具有竞争力的性能。
Summary / 总结
ClusIR is a Cluster-Guided Image Restoration framework designed to handle diverse image degradations within a unified framework. It introduces a Probabilistic Cluster-Guided Routing Mechanism and a Degradation-Aware Frequency Modulation Module to model degradation types and propagate cluster-aware cues for adaptive restoration. Experimental results show that ClusIR outperforms existing methods across various benchmarks, achieving high restoration fidelity for different types of degradations.
ClusIR 是一种集群引导的图像恢复框架,旨在在一个统一框架中处理多种图像退化。它使用概率集群引导路由机制和退化感知频域调制模块来建模退化类型,并传播集群感知线索以实现自适应恢复。实验结果表明,ClusIR 在各种场景下优于现有方法,实现了不同退化类型的高质量图像恢复。
Towards Efficient and Effective Multi-Camera Encoding for End-to-End Driving
Authors: Jiawei Yang, Ziyu Chen, Yurong You, Yan Wang, Yiming Li, Yuxiao Chen, Boyi Li, Boris Ivanovic, Marco Pavone, Yue Wang
First: 2025-12-11T18:59:46+00:00 · Latest: 2025-12-11T18:59:46+00:00
Comments: Project Page: https://jiawei-yang.github.io/Flex/
Abstract
We present Flex, an efficient and effective scene encoder that addresses the computational bottleneck of processing high-volume multi-camera data in end-to-end autonomous driving. Flex employs a small set of learnable scene tokens to jointly encode information from all image tokens across different cameras and timesteps. By design, our approach is geometry-agnostic, learning a compact scene representation directly from data without relying on the explicit 3D inductive biases, such as Bird-Eye-View (BEV), occupancy or tri-plane representations, which are common in prior work. This holistic encoding strategy aggressively compresses the visual input for the downstream Large Language Model (LLM) based policy model. Evaluated on a large-scale proprietary dataset of 20,000 driving hours, our Flex achieves 2.2x greater inference throughput while improving driving performance by a large margin compared to state-of-the-art methods. Furthermore, we show that these compact scene tokens develop an emergent capability for scene decomposition without any explicit supervision. Our findings challenge the prevailing assumption that 3D priors are necessary, demonstrating that a data-driven, joint encoding strategy offers a more scalable, efficient and effective path for future autonomous driving systems.
中文标题/摘要
标题:面向端到端驾驶的高效有效多摄像头编码
我们提出了Flex,一种高效的场景编码器,解决了处理大量多摄像头数据的计算瓶颈问题。Flex 使用一组可学习的场景标记,联合编码来自不同摄像头和时间步长的所有图像标记的信息。通过设计,我们的方法是几何无关的,直接从数据中学习紧凑的场景表示,而不依赖于显式的三维归纳偏置,如鸟瞰图(BEV)、占用或三平面表示,这些在先前的工作中很常见。这种整体编码策略大幅压缩了供下游大型语言模型(LLM)基于的策略模型使用的视觉输入。在包含20,000小时驾驶数据的大型自有数据集上评估,我们的Flex实现了2.2倍的推理吞吐量提升,并在驾驶性能上大幅优于最先进的方法。此外,我们展示了这些紧凑的场景标记在没有任何显式监督的情况下发展出了一种场景分解的新兴能力。我们的发现挑战了三维先验是必要的这一传统假设,证明了数据驱动的联合编码策略为未来的自动驾驶系统提供了一条更可扩展、更高效和更有效的方法。
Summary / 总结
Flex is a scene encoder designed to address the computational challenges of processing multi-camera data in end-to-end autonomous driving. It uses a small set of learnable scene tokens to encode information from all cameras and timesteps, avoiding the use of 3D inductive biases. Evaluated on a large dataset, Flex improves driving performance and achieves 2.2x greater inference throughput compared to state-of-the-art methods, suggesting a more scalable and efficient approach to autonomous driving.
Flex 是一种用于高效处理端到端自动驾驶中高体积多摄像头数据的场景编码器。它使用少量可学习的场景令牌联合编码来自所有摄像头和时间步的信息,而不依赖于显式的3D先验。在大规模数据集上评估表明,Flex 的推理吞吐量提高了2.2倍,并且在驾驶性能上优于最先进的方法,还展示了在无监督的情况下发展出的场景分解能力,挑战了3D先验是必要的假设。
ImplicitRDP: An End-to-End Visual-Force Diffusion Policy with Structural Slow-Fast Learning
Authors: Wendi Chen, Han Xue, Yi Wang, Fangyuan Zhou, Jun Lv, Yang Jin, Shirun Tang, Chuan Wen, Cewu Lu
First: 2025-12-11T18:59:46+00:00 · Latest: 2025-12-11T18:59:46+00:00
Comments: Project page: https://implicit-rdp.github.io
Abstract
Human-level contact-rich manipulation relies on the distinct roles of two key modalities: vision provides spatially rich but temporally slow global context, while force sensing captures rapid, high-frequency local contact dynamics. Integrating these signals is challenging due to their fundamental frequency and informational disparities. In this work, we propose ImplicitRDP, a unified end-to-end visual-force diffusion policy that integrates visual planning and reactive force control within a single network. We introduce Structural Slow-Fast Learning, a mechanism utilizing causal attention to simultaneously process asynchronous visual and force tokens, allowing the policy to perform closed-loop adjustments at the force frequency while maintaining the temporal coherence of action chunks. Furthermore, to mitigate modality collapse where end-to-end models fail to adjust the weights across different modalities, we propose Virtual-target-based Representation Regularization. This auxiliary objective maps force feedback into the same space as the action, providing a stronger, physics-grounded learning signal than raw force prediction. Extensive experiments on contact-rich tasks demonstrate that ImplicitRDP significantly outperforms both vision-only and hierarchical baselines, achieving superior reactivity and success rates with a streamlined training pipeline. Code and videos will be publicly available at https://implicit-rdp.github.io.
中文标题/摘要
标题:ImplicitRDP:一种结合结构化慢速快速学习的端到端视觉-力扩散策略
人类级别的接触丰富操作依赖于两种关键模态的独特作用:视觉提供丰富但缓慢的空间上下文,而力感知捕捉快速的高频局部接触动力学。由于这些信号的基本频率和信息差异,将这些信号集成在一起具有挑战性。在本文中,我们提出了一种统一的端到端视觉-力扩散策略ImplicitRDP,该策略在单一网络中结合了视觉规划和反应性力控制。我们引入了一种结构化慢速快速学习机制,利用因果注意力同时处理异步视觉和力标记,使策略能够在力频率下执行闭环调整,同时保持动作片段的时间连贯性。此外,为了缓解端到端模型在不同模态间调整权重时出现的模态崩溃问题,我们提出了基于虚拟目标的表示正则化。该辅助目标将力反馈映射到与动作相同的空间,提供比原始力预测更强的物理基础学习信号。广泛的实验表明,ImplicitRDP 在接触丰富任务中显著优于仅视觉和分层基线,具有简化训练管道的情况下实现了更好的反应性和成功率。代码和视频将在 https://implicit-rdp.github.io 公开。
Summary / 总结
The research aims to improve human-level contact-rich manipulation by integrating visual and force signals. ImplicitRDP, an end-to-end visual-force diffusion policy, is proposed to handle the frequency and informational disparities between these signals. The method uses Structural Slow-Fast Learning to process asynchronous visual and force tokens and Virtual-target-based Representation Regularization to ensure effective modality integration. Experiments show that ImplicitRDP outperforms vision-only and hierarchical baselines, achieving better reactivity and success rates.
研究旨在通过统一的端到端策略ImplicitRDP整合视觉和力信号,以提高接触丰富的操作水平。它使用结构化的慢速-快速学习来处理视觉和力信号之间的频率差异,并使用基于虚拟目标的表示正则化来防止模态崩溃。实验表明,ImplicitRDP在反应性和成功率方面优于视觉仅和分层基线,且具有简化训练流程。
MeViS: A Multi-Modal Dataset for Referring Motion Expression Video Segmentation
Authors: Henghui Ding, Chang Liu, Shuting He, Kaining Ying, Xudong Jiang, Chen Change Loy, Yu-Gang Jiang
Venue: H. Ding et al., "MeViS: A Multi-Modal Dataset for Referring Motion Expression Video Segmentation," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, no. 12, pp. 11400-11416, 2025
First: 2025-12-11T18:59:44+00:00 · Latest: 2025-12-11T18:59:44+00:00
Comments: IEEE TPAMI, Project Page: https://henghuiding.com/MeViS/
Abstract
This paper proposes a large-scale multi-modal dataset for referring motion expression video segmentation, focusing on segmenting and tracking target objects in videos based on language description of objects' motions. Existing referring video segmentation datasets often focus on salient objects and use language expressions rich in static attributes, potentially allowing the target object to be identified in a single frame. Such datasets underemphasize the role of motion in both videos and languages. To explore the feasibility of using motion expressions and motion reasoning clues for pixel-level video understanding, we introduce MeViS, a dataset containing 33,072 human-annotated motion expressions in both text and audio, covering 8,171 objects in 2,006 videos of complex scenarios. We benchmark 15 existing methods across 4 tasks supported by MeViS, including 6 referring video object segmentation (RVOS) methods, 3 audio-guided video object segmentation (AVOS) methods, 2 referring multi-object tracking (RMOT) methods, and 4 video captioning methods for the newly introduced referring motion expression generation (RMEG) task. The results demonstrate weaknesses and limitations of existing methods in addressing motion expression-guided video understanding. We further analyze the challenges and propose an approach LMPM++ for RVOS/AVOS/RMOT that achieves new state-of-the-art results. Our dataset provides a platform that facilitates the development of motion expression-guided video understanding algorithms in complex video scenes. The proposed MeViS dataset and the method's source code are publicly available at https://henghuiding.com/MeViS/
中文标题/摘要
标题:MeViS:一种用于参考运动表达视频分割的多模态数据集
本文提出了一种大规模多模态数据集,用于参考运动表达视频分割,专注于根据物体运动的语言描述分割和跟踪视频中的目标物体。现有的参考视频分割数据集通常关注显眼的物体,并使用富含静态属性的语言表达,这可能允许目标物体在单帧中被识别。这些数据集在视频和语言中都低估了运动的作用。为了探索使用运动表达和运动推理线索进行像素级视频理解的可行性,我们引入了MeViS数据集,该数据集包含33,072个人标注的运动表达,包括文本和音频,覆盖了2,006个复杂场景中的8,171个物体。我们对MeViS支持的4项任务中的15种现有方法进行了基准测试,包括6种参考视频对象分割(RVOS)方法、3种音频引导视频对象分割(AVOS)方法、2种参考多对象跟踪(RMOT)方法以及4种用于新引入的参考运动表达生成(RMEG)任务的视频字幕方法。结果表明现有方法在解决运动表达引导的视频理解方面存在弱点和局限性。我们进一步分析了挑战并提出了一种LMPM++方法,该方法在RVOS/AVOS/RMOT中达到了新的最佳结果。我们的数据集为复杂视频场景中运动表达引导的视频理解算法的发展提供了平台。提出的MeViS数据集和方法的源代码可在https://henghuiding.com/MeViS/公开获取。
Summary / 总结
This paper introduces MeViS, a large-scale multi-modal dataset for segmenting and tracking target objects in videos based on their motion descriptions. It contains 33,072 human-annotated motion expressions in both text and audio, covering 8,171 objects in 2,006 videos. The authors benchmark 15 existing methods across 4 tasks and propose LMPM++ for RVOS/AVOS/RMOT, achieving new state-of-the-art results. The dataset highlights the importance of motion in video understanding and provides a platform for developing motion expression-guided algorithms in complex scenes.
该论文提出了一个名为MeViS的多模态数据集,包含33,072个人标注的运动表达文本和音频,用于2,006个视频中的8,171个对象。该数据集旨在探索运动在视频理解中的作用。作者对15种现有方法进行了四个任务的基准测试,并提出了LMPM++方法用于RVOS/AVOS/RMOT,取得了新的最佳结果。该数据集为在复杂视频场景中开发运动表达引导的视频理解算法提供了平台。
AlcheMinT: Fine-grained Temporal Control for Multi-Reference Consistent Video Generation
Authors: Sharath Girish, Viacheslav Ivanov, Tsai-Shien Chen, Hao Chen, Aliaksandr Siarohin, Sergey Tulyakov
First: 2025-12-11T18:59:34+00:00 · Latest: 2025-12-11T18:59:34+00:00
Comments: Project page: https://snap-research.github.io/Video-AlcheMinT/snap-research.github.io/Video-AlcheMinT
Abstract
Recent advances in subject-driven video generation with large diffusion models have enabled personalized content synthesis conditioned on user-provided subjects. However, existing methods lack fine-grained temporal control over subject appearance and disappearance, which are essential for applications such as compositional video synthesis, storyboarding, and controllable animation. We propose AlcheMinT, a unified framework that introduces explicit timestamps conditioning for subject-driven video generation. Our approach introduces a novel positional encoding mechanism that unlocks the encoding of temporal intervals, associated in our case with subject identities, while seamlessly integrating with the pretrained video generation model positional embeddings. Additionally, we incorporate subject-descriptive text tokens to strengthen binding between visual identity and video captions, mitigating ambiguity during generation. Through token-wise concatenation, AlcheMinT avoids any additional cross-attention modules and incurs negligible parameter overhead. We establish a benchmark evaluating multiple subject identity preservation, video fidelity, and temporal adherence. Experimental results demonstrate that AlcheMinT achieves visual quality matching state-of-the-art video personalization methods, while, for the first time, enabling precise temporal control over multi-subject generation within videos. Project page is at https://snap-research.github.io/Video-AlcheMinT
中文标题/摘要
标题:AlcheMinT:多参考一致视频生成的细粒度时间控制
大型扩散模型驱动的主题视频生成的最新进展使基于用户提供的主题的个性化内容合成成为可能。然而,现有方法缺乏对主题出现和消失的细粒度时间控制,这对于合成视频、故事板和可控动画等应用至关重要。我们提出了AlcheMinT,这是一种统一框架,引入了明确的时间戳条件,用于主题驱动的视频生成。我们的方法引入了一种新颖的位置编码机制,解锁了时间间隔的编码,这些时间间隔在我们的情况下与主题身份相关,同时无缝地与预训练的视频生成模型位置嵌入集成。此外,我们还引入了主题描述性文本标记,以加强视觉身份与视频字幕之间的联系,减少生成过程中的歧义。通过令牌级连接,AlcheMinT 避免了任何额外的交叉注意力模块,并且没有增加参数开销。我们建立了一个基准,评估多个主题身份保留、视频保真度和时间一致性。实验结果表明,AlcheMinT 达到了与最先进的视频个性化方法相当的视觉质量,同时首次实现了对视频中多主题生成的精确时间控制。项目页面为 https://snap-research.github.io/Video-AlcheMinT
Summary / 总结
AlcheMinT is a unified framework that introduces explicit timestamps for fine-grained temporal control in subject-driven video generation. It uses a novel positional encoding mechanism to encode temporal intervals and integrates subject-descriptive text tokens to enhance visual identity binding. Experiments show that AlcheMinT matches state-of-the-art methods in visual quality and enables precise temporal control over multi-subject generation for the first time.
研究旨在增强主体驱动视频生成中的时间控制,这对于合成视频和可控动画等应用至关重要。AlcheMinT 引入了明确的时间戳和一种新颖的位置编码机制,以实现细粒度的时间控制,同时与预训练模型无缝集成。该方法避免了额外的交叉注意力模块,保持了微乎其微的参数开销。实验结果表明,AlcheMinT 在视觉质量上与最先进的方法相当,并首次实现了对视频中多主体生成的精确时间控制。
VL-JEPA: Joint Embedding Predictive Architecture for Vision-language
Authors: Delong Chen, Mustafa Shukor, Theo Moutakanni, Willy Chung, Jade Yu, Tejaswi Kasarla, Allen Bolourchi, Yann LeCun, Pascale Fung
First: 2025-12-11T18:59:22+00:00 · Latest: 2025-12-11T18:59:22+00:00
Abstract
We introduce VL-JEPA, a vision-language model built on a Joint Embedding Predictive Architecture (JEPA). Instead of autoregressively generating tokens as in classical VLMs, VL-JEPA predicts continuous embeddings of the target texts. By learning in an abstract representation space, the model focuses on task-relevant semantics while abstracting away surface-level linguistic variability. In a strictly controlled comparison against standard token-space VLM training with the same vision encoder and training data, VL-JEPA achieves stronger performance while having 50% fewer trainable parameters. At inference time, a lightweight text decoder is invoked only when needed to translate VL-JEPA predicted embeddings into text. We show that VL-JEPA natively supports selective decoding that reduces the number of decoding operations by 2.85x while maintaining similar performance compared to non-adaptive uniform decoding. Beyond generation, the VL-JEPA's embedding space naturally supports open-vocabulary classification, text-to-video retrieval, and discriminative VQA without any architecture modification. On eight video classification and eight video retrieval datasets, the average performance VL-JEPA surpasses that of CLIP, SigLIP2, and Perception Encoder. At the same time, the model achieves comparable performance as classical VLMs (InstructBLIP, QwenVL) on four VQA datasets: GQA, TallyQA, POPE and POPEv2, despite only having 1.6B parameters.
中文标题/摘要
标题:VL-JEPA:联合嵌入预测架构的跨模态模型
我们介绍了基于联合嵌入预测架构(JEPA)的跨模态模型VL-JEPA。与经典视觉语言模型(VLM)逐个生成标记不同,VL-JEPA预测目标文本的连续嵌入。通过在抽象表示空间中学习,该模型专注于与任务相关的语义,同时抽象掉表面语言的变异性。在严格控制的比较中,与使用相同视觉编码器和训练数据的标准标记空间VLM训练相比,VL-JEPA在参数量减少50%的情况下实现了更强的性能。在推理时,仅在需要时调用轻量级文本解码器将VL-JEPA预测的嵌入转换为文本。我们展示了VL-JEPA原生支持选择性解码,将解码操作减少2.85倍,同时保持与非自适应均匀解码相似的性能。除了生成之外,VL-JEPA的嵌入空间自然支持开放词汇分类、文本到视频检索和区分型VQA,无需任何架构修改。在八个视频分类数据集和八个视频检索数据集上,VL-JEPA的平均性能超过了CLIP、SigLIP2和感知编码器。同时,尽管只有1.6B参数,该模型在四个VQA数据集(GQA、TallyQA、POPE和POPEv2)上的性能与经典VLM(InstructBLIP、QwenVL)相当。
Summary / 总结
VL-JEPA is a vision-language model that uses a Joint Embedding Predictive Architecture to predict continuous embeddings of target texts, focusing on task-relevant semantics. It achieves better performance with 50% fewer parameters compared to standard token-space VLMs and supports selective decoding that reduces decoding operations by 2.85x. VL-JEPA outperforms CLIP, SigLIP2, and Perception Encoder on video classification and retrieval tasks and matches the performance of larger models like InstructBLIP and QwenVL on VQA tasks with fewer parameters.
VL-JEPA 是一种使用联合嵌入预测架构的视觉-语言模型,它预测目标文本的连续嵌入而不是自回归生成标记。这种方法使得模型在更少的参数下表现出更强的性能,并支持选择性解码,将解码操作的数量减少2.85倍。VL-JEPA 在多个视频分类和检索任务中表现出色,并且在 VQA 任务中实现了与传统 VLM 相当的性能,尽管参数量仅为 1.6B。
EvoIR: Towards All-in-One Image Restoration via Evolutionary Frequency Modulation
Authors: Jiaqi Ma, Shengkai Hu, Xu Zhang, Jun Wan, Jiaxing Huang, Lefei Zhang, Salman Khan
First: 2025-12-04T18:59:10+00:00 · Latest: 2025-12-11T18:59:22+00:00
Abstract
All-in-One Image Restoration (AiOIR) tasks often involve diverse degradation that require robust and versatile strategies. However, most existing approaches typically lack explicit frequency modeling and rely on fixed or heuristic optimization schedules, which limit the generalization across heterogeneous degradation. To address these limitations, we propose EvoIR, an AiOIR-specific framework that introduces evolutionary frequency modulation for dynamic and adaptive image restoration. Specifically, EvoIR employs the Frequency-Modulated Module (FMM) that decomposes features into high- and low-frequency branches in an explicit manner and adaptively modulates them to enhance both structural fidelity and fine-grained details. Central to EvoIR, an Evolutionary Optimization Strategy (EOS) iteratively adjusts frequency-aware objectives through a population-based evolutionary process, dynamically balancing structural accuracy and perceptual fidelity. Its evolutionary guidance further mitigates gradient conflicts across degradation and accelerates convergence. By synergizing FMM and EOS, EvoIR yields greater improvements than using either component alone, underscoring their complementary roles. Extensive experiments on multiple benchmarks demonstrate that EvoIR outperforms state-of-the-art AiOIR methods.
中文标题/摘要
标题:EvoIR:通过进化频率调制实现一站式图像恢复
一站式图像恢复(AiOIR)任务通常涉及多种退化,需要稳健且通用的策略。然而,大多数现有方法通常缺乏显式的频率建模,并依赖于固定的或启发式的优化计划,这限制了其在异构退化中的泛化能力。为了解决这些限制,我们提出了EvoIR,这是一种针对AiOIR的特定框架,引入了进化频率调制以实现动态和自适应的图像恢复。具体而言,EvoIR 使用频率调制模块(FMM),以显式方式将特征分解为高频频带和低频频带,并自适应地调制它们以增强结构保真度和细粒度细节。EvoIR 的核心是一种进化优化策略(EOS),通过基于群体的进化过程迭代调整频率感知目标,动态平衡结构准确性和感知保真度。其进化的指导进一步缓解了退化之间的梯度冲突并加速了收敛。通过结合FMM和EOS,EvoIR 比单独使用任一组件都取得了更大的改进,突显了它们的互补作用。在多个基准上的广泛实验表明,EvoIR 在一站式图像恢复方法中优于最先进的方法。
Summary / 总结
EvoIR is designed to handle diverse image degradation by introducing evolutionary frequency modulation. It uses a Frequency-Modulated Module (FMM) to decompose features into high- and low-frequency branches and adaptively modulate them for better structural fidelity and fine details. An Evolutionary Optimization Strategy (EOS) dynamically adjusts frequency-aware objectives, balancing structural accuracy and perceptual fidelity. Experiments show that EvoIR outperforms existing state-of-the-art methods in all-in-one image restoration tasks.
EvoIR 通过引入进化频率调制来处理多样化的图像退化问题。它使用频率调制模块(FMM)将特征分解为高频和低频分支,并适配性地调节它们以提高结构保真度和细粒度细节。进化优化策略(EOS)通过基于群体的过程动态调整目标,平衡结构准确性和感知保真度。实验表明,EvoIR 在所有-in-one 图像恢复任务中优于现有最先进的方法。
Mull-Tokens: Modality-Agnostic Latent Thinking
Authors: Arijit Ray, Ahmed Abdelkader, Chengzhi Mao, Bryan A. Plummer, Kate Saenko, Ranjay Krishna, Leonidas Guibas, Wen-Sheng Chu
First: 2025-12-11T18:59:08+00:00 · Latest: 2025-12-11T18:59:08+00:00
Comments: Project webpage: https://arijitray.com/multimodal_thinking/
Abstract
Reasoning goes beyond language; the real world requires reasoning about space, time, affordances, and much more that words alone cannot convey. Existing multimodal models exploring the potential of reasoning with images are brittle and do not scale. They rely on calling specialist tools, costly generation of images, or handcrafted reasoning data to switch between text and image thoughts. Instead, we offer a simpler alternative -- Mull-Tokens -- modality-agnostic latent tokens pre-trained to hold intermediate information in either image or text modalities to let the model think free-form towards the correct answer. We investigate best practices to train Mull-Tokens inspired by latent reasoning frameworks. We first train Mull-Tokens using supervision from interleaved text-image traces, and then fine-tune without any supervision by only using the final answers. Across four challenging spatial reasoning benchmarks involving tasks such as solving puzzles and taking different perspectives, we demonstrate that Mull-Tokens improve upon several baselines utilizing text-only reasoning or interleaved image-text reasoning, achieving a +3% average improvement and up to +16% on a puzzle solving reasoning-heavy split compared to our strongest baseline. Adding to conversations around challenges in grounding textual and visual reasoning, Mull-Tokens offers a simple solution to abstractly think in multiple modalities.
中文标题/摘要
标题:Mull-Tokens:模态无关的潜在思维
推理超越语言;现实世界需要关于空间、时间、功能等方面的推理,而这些是单靠语言无法传达的。现有的多模态模型探索图像推理潜力时是脆弱的且不具扩展性。它们依赖于调用专业工具、昂贵的图像生成或手工制作的推理数据来在文本和图像思维之间切换。相反,我们提供了一个更简单的替代方案——Mull-Tokens——一种模态无关的潜在标记,预先训练以在图像或文本模态中保留中间信息,让模型自由地思考以得出正确答案。我们研究了受潜在推理框架启发的最佳实践来训练Mull-Tokens。我们首先使用交错的文本-图像痕迹进行监督训练,然后仅使用最终答案进行微调。在四个具有挑战性的空间推理基准测试中,涉及解谜和换位思考等任务,我们证明Mull-Tokens在利用仅文本推理或交错图像-文本推理的多个基线之上有所改进,平均提高了3%,在解谜推理密集型分割上最高提高了16%。Mull-Tokens为关于文本和视觉推理的接地挑战提供了简单解决方案,以抽象方式在多种模态中思考。
Summary / 总结
Mull-Tokens are modality-agnostic latent tokens pre-trained to facilitate reasoning in both image and text modalities, enabling models to think freely towards the correct answer. They are trained using interleaved text-image traces and fine-tuned without supervision using final answers. Mull-Tokens outperform text-only and interleaved image-text reasoning baselines on four spatial reasoning benchmarks, achieving up to a 16% improvement in puzzle solving tasks.
Mull-Tokens 是一种模态无关的潜在标记,可以在图像或文本模态中存储中间信息,使模型能够自由地推理以得出正确答案。它们通过交错的文本-图像痕迹进行预训练,并仅使用最终答案进行微调。Mull-Tokens 在四个空间推理基准测试中优于仅文本和交错图像-文本推理基线,解决谜题任务时最高可提高 16% 的性能。
OmniView: An All-Seeing Diffusion Model for 3D and 4D View Synthesis
Authors: Xiang Fan, Sharath Girish, Vivek Ramanujan, Chaoyang Wang, Ashkan Mirzaei, Petr Sushko, Aliaksandr Siarohin, Sergey Tulyakov, Ranjay Krishna
First: 2025-12-11T18:59:05+00:00 · Latest: 2025-12-11T18:59:05+00:00
Comments: Project page: https://snap-research.github.io/OmniView/
Abstract
Prior approaches injecting camera control into diffusion models have focused on specific subsets of 4D consistency tasks: novel view synthesis, text-to-video with camera control, image-to-video, amongst others. Therefore, these fragmented approaches are trained on disjoint slices of available 3D/4D data. We introduce OmniView, a unified framework that generalizes across a wide range of 4D consistency tasks. Our method separately represents space, time, and view conditions, enabling flexible combinations of these inputs. For example, OmniView can synthesize novel views from static, dynamic, and multiview inputs, extrapolate trajectories forward and backward in time, and create videos from text or image prompts with full camera control. OmniView is competitive with task-specific models across diverse benchmarks and metrics, improving image quality scores among camera-conditioned diffusion models by up to 33\% in multiview NVS LLFF dataset, 60\% in dynamic NVS Neural 3D Video benchmark, 20\% in static camera control on RE-10K, and reducing camera trajectory errors by 4x in text-conditioned video generation. With strong generalizability in one model, OmniView demonstrates the feasibility of a generalist 4D video model. Project page is available at https://snap-research.github.io/OmniView/
中文标题/摘要
标题:OmniView:一种全面的扩散模型,用于3D和4D视图合成
先前将相机控制注入扩散模型的方法主要集中在4D一致性任务的特定子集上:新颖视图合成、带有相机控制的文本到视频、图像到视频等。因此,这些分散的方法仅在可用的3D/4D数据的不同片段上进行训练。我们提出了OmniView,这是一种统一框架,能够泛化到广泛的4D一致性任务。我们的方法分别表示空间、时间和视图条件,使这些输入能够灵活组合。例如,OmniView可以从静态、动态和多视图输入中合成新颖视图,向前向后推断时间轨迹,并根据完整的相机控制从文本或图像提示生成视频。在各种基准和指标上,OmniView与特定任务模型竞争,提高相机条件下的扩散模型图像质量评分,最高可达多视图NVS LLFF数据集中的33%,动态NVS神经3D视频基准中的60%,静态相机控制RE-10K中的20%,并在文本条件下的视频生成中将相机轨迹误差减少4倍。凭借一个模型的强大泛化能力,OmniView展示了通用4D视频模型的可行性。项目页面可在https://snap-research.github.io/OmniView/获取。
Summary / 总结
OmniView is a unified framework for 3D and 4D view synthesis that generalizes across various 4D consistency tasks by separately representing space, time, and view conditions. It can synthesize novel views from static, dynamic, and multiview inputs, extrapolate trajectories, and generate videos from text or image prompts with full camera control. OmniView outperforms task-specific models, improving image quality scores by up to 60% in dynamic video generation and reducing camera trajectory errors by 4x in text-conditioned video generation.
OmniView 是一个统一框架,用于 3D 和 4D 视图合成,能够跨多种 4D 一致性任务进行泛化。它分别表示空间、时间和视图条件,允许灵活的输入组合。关键发现包括在多种基准测试中与特定任务模型竞争的表现,以及在文本条件视频生成中提高了图像质量分数并减少了 4 倍的相机轨迹误差。
GaussianHeadTalk: Wobble-Free 3D Talking Heads with Audio Driven Gaussian Splatting
Authors: Madhav Agarwal, Mingtian Zhang, Laura Sevilla-Lara, Steven McDonagh
First: 2025-12-11T18:59:02+00:00 · Latest: 2025-12-11T18:59:02+00:00
Comments: IEEE/CVF Winter Conference on Applications of Computer Vision 2026
Abstract
Speech-driven talking heads have recently emerged and enable interactive avatars. However, real-world applications are limited, as current methods achieve high visual fidelity but slow or fast yet temporally unstable. Diffusion methods provide realistic image generation, yet struggle with oneshot settings. Gaussian Splatting approaches are real-time, yet inaccuracies in facial tracking, or inconsistent Gaussian mappings, lead to unstable outputs and video artifacts that are detrimental to realistic use cases. We address this problem by mapping Gaussian Splatting using 3D Morphable Models to generate person-specific avatars. We introduce transformer-based prediction of model parameters, directly from audio, to drive temporal consistency. From monocular video and independent audio speech inputs, our method enables generation of real-time talking head videos where we report competitive quantitative and qualitative performance.
中文标题/摘要
标题:GaussianHeadTalk:基于音频驱动高斯点积的无晃动3D对话头像
基于语音的对话头像最近出现,能够实现交互式虚拟角色。然而,实际应用受到限制,因为当前方法在视觉保真度高但速度慢或快且时间不稳定之间存在权衡。扩散方法能够生成逼真图像,但在单次设置中表现不佳。高斯点积方法实时性好,但面部跟踪不准确或高斯映射不一致会导致输出不稳定和视频伪影,这些伪影对现实应用不利。我们通过将高斯点积映射到3D可变模型来解决这一问题,以生成个性化的虚拟角色。我们引入基于变换器的模型参数预测,直接从音频驱动时间一致性。从单目视频和独立的音频语音输入,我们的方法能够生成实时对话头像视频,我们报告了具有竞争力的定量和定性性能。
Summary / 总结
The research aims to improve the stability and realism of 3D talking heads driven by audio. It introduces a method combining Gaussian Splatting with 3D Morphable Models and transformer-based prediction of model parameters from audio to ensure temporal consistency. The method generates real-time talking head videos from monocular video and independent audio inputs, showing competitive performance in both quantitative and qualitative evaluations.
研究旨在提高由音频驱动的3D表情的稳定性和逼真度。方法结合了Gaussian Splatting和3D Morphable Models,并使用基于变换器的模型参数预测直接从音频中驱动时间一致性。该方法可以从单目视频和独立音频生成实时的表情视频,并在定量和定性评估中表现出竞争力。
Stronger Normalization-Free Transformers
Authors: Mingzhi Chen, Taiming Lu, Jiachen Zhu, Mingjie Sun, Zhuang Liu
First: 2025-12-11T18:58:49+00:00 · Latest: 2025-12-11T18:58:49+00:00
Abstract
Although normalization layers have long been viewed as indispensable components of deep learning architectures, the recent introduction of Dynamic Tanh (DyT) has demonstrated that alternatives are possible. The point-wise function DyT constrains extreme values for stable convergence and reaches normalization-level performance; this work seeks further for function designs that can surpass it. We first study how the intrinsic properties of point-wise functions influence training and performance. Building on these findings, we conduct a large-scale search for a more effective function design. Through this exploration, we introduce $\mathrm{Derf}(x) = \mathrm{erf}(αx + s)$, where $\mathrm{erf}(x)$ is the rescaled Gaussian cumulative distribution function, and identify it as the most performant design. Derf outperforms LayerNorm, RMSNorm, and DyT across a wide range of domains, including vision (image recognition and generation), speech representation, and DNA sequence modeling. Our findings suggest that the performance gains of Derf largely stem from its improved generalization rather than stronger fitting capacity. Its simplicity and stronger performance make Derf a practical choice for normalization-free Transformer architectures.
中文标题/摘要
标题:无需归一化的更强变换器
尽管归一化层长期以来被视为深度学习架构中不可或缺的组成部分,但最近引入的动态双曲正切(DyT)表明,替代方案是可能的。点函数DyT通过限制极端值实现稳定的收敛,并达到与归一化相当的性能;本研究旨在寻找超越它的函数设计。我们首先研究点函数内在属性如何影响训练和性能。在此基础上,我们进行大规模搜索以寻找更有效的函数设计。通过这一探索,我们引入了$\mathrm{Derf}(x) = \mathrm{erf}(αx + s)$,其中$\mathrm{erf}(x)$是缩放后的高斯累积分布函数,并将其识别为最有效的设计。Derf在包括视觉(图像识别和生成)、语音表示和DNA序列建模在内的广泛领域中优于LayerNorm、RMSNorm和DyT。我们的研究结果表明,Derf的性能提升主要来自于其更好的泛化能力,而不是更强的拟合能力。其简洁性和更强的性能使Derf成为无需归一化的变换器架构的实用选择。
Summary / 总结
This paper explores alternatives to normalization layers in deep learning architectures, motivated by the success of Dynamic Tanh (DyT) in achieving normalization-level performance without normalization. The authors conduct a large-scale search for a more effective function design and introduce Derf, defined as $\mathrm{erf}(αx + s)$, which outperforms existing methods like LayerNorm, RMSNorm, and DyT in various domains including vision, speech, and DNA sequence modeling. The performance gains of Derf are attributed to improved generalization rather than increased fitting capacity.
该论文探索了在深度学习架构中替代归一化层的方案,动机是Dynamic Tanh (DyT)在无需归一化的情况下达到了与归一化相当的性能。作者进行了一项大规模搜索以找到更有效的函数设计,并引入了Derf,定义为$\mathrm{erf}(αx + s)$,该设计在视觉、语音和DNA序列建模等多个领域优于LayerNorm、RMSNorm和DyT。Derf的性能提升主要归因于其更好的泛化能力,而非更强的拟合能力。
Any4D: Unified Feed-Forward Metric 4D Reconstruction
Authors: Jay Karhade, Nikhil Keetha, Yuchen Zhang, Tanisha Gupta, Akash Sharma, Sebastian Scherer, Deva Ramanan
First: 2025-12-11T18:57:39+00:00 · Latest: 2025-12-11T18:57:39+00:00
Comments: Project Website: https://any-4d.github.io/
Abstract
We present Any4D, a scalable multi-view transformer for metric-scale, dense feed-forward 4D reconstruction. Any4D directly generates per-pixel motion and geometry predictions for N frames, in contrast to prior work that typically focuses on either 2-view dense scene flow or sparse 3D point tracking. Moreover, unlike other recent methods for 4D reconstruction from monocular RGB videos, Any4D can process additional modalities and sensors such as RGB-D frames, IMU-based egomotion, and Radar Doppler measurements, when available. One of the key innovations that allows for such a flexible framework is a modular representation of a 4D scene; specifically, per-view 4D predictions are encoded using a variety of egocentric factors (depthmaps and camera intrinsics) represented in local camera coordinates, and allocentric factors (camera extrinsics and scene flow) represented in global world coordinates. We achieve superior performance across diverse setups - both in terms of accuracy (2-3X lower error) and compute efficiency (15X faster), opening avenues for multiple downstream applications.
中文标题/摘要
标题:Any4D:统一的多视图变换器,用于米尺度的前馈4D重建
我们提出了Any4D,一种可扩展的多视图变换器,用于米尺度的密集前馈4D重建。Any4D直接生成N帧的每个像素的运动和几何预测,而以往的工作通常专注于双视图密集场景流或稀疏3D点跟踪。此外,与其他基于单目RGB视频的4D重建方法不同,Any4D可以处理可用的其他模态和传感器,如RGB-D帧、基于IMU的自我运动和雷达多普勒测量。关键创新之一是4D场景的模块化表示;具体来说,每个视图的4D预测使用多种以自我为中心的因素(深度图和相机内参)表示在局部相机坐标系中,以及以全局世界坐标系表示的以他为中心的因素(相机外参和场景流)。我们在多种设置下实现了优越的性能——在准确性和计算效率方面(分别为2-3倍和15倍的误差和速度),为多个下游应用打开了新的途径。
Summary / 总结
Any4D is a scalable multi-view transformer designed for metric-scale, dense feed-forward 4D reconstruction. It directly generates per-pixel motion and geometry predictions for N frames, unlike previous methods that focus on 2-view dense scene flow or sparse 3D point tracking. Any4D can handle additional modalities such as RGB-D frames, IMU-based egomotion, and Radar Doppler measurements. The key innovation is a modular 4D scene representation, which allows for efficient processing and superior performance in diverse setups, with 2-3X lower error and 15X faster computation compared to existing methods.
Any4D 是一个可扩展的多视图变换器,用于实现米级尺度的密集前馈4D重建,直接生成 N 帧中的每个像素的运动和几何预测。它可以处理各种模态,包括 RGB-D 帧、IMU 基本的自我运动和雷达多普勒测量,相比现有方法,它具有更低的误差和更高的计算效率。
Curriculum-Based Reinforcement Learning for Autonomous UAV Navigation in Unknown Curved Tubular Conduit
Authors: Zamirddine Mari, Jérôme Pasquet, Julien Seinturier
First: 2025-12-11T18:57:29+00:00 · Latest: 2025-12-11T18:57:29+00:00
Abstract
Autonomous drone navigation in confined tubular environments remains a major challenge due to the constraining geometry of the conduits, the proximity of the walls, and the perceptual limitations inherent to such scenarios. We propose a reinforcement learning approach enabling a drone to navigate unknown three-dimensional tubes without any prior knowledge of their geometry, relying solely on local observations from LiDAR and a conditional visual detection of the tube center. In contrast, the Pure Pursuit algorithm, used as a deterministic baseline, benefits from explicit access to the centerline, creating an information asymmetry designed to assess the ability of RL to compensate for the absence of a geometric model. The agent is trained through a progressive Curriculum Learning strategy that gradually exposes it to increasingly curved geometries, where the tube center frequently disappears from the visual field. A turning-negotiation mechanism, based on the combination of direct visibility, directional memory, and LiDAR symmetry cues, proves essential for ensuring stable navigation under such partial observability conditions. Experiments show that the PPO policy acquires robust and generalizable behavior, consistently outperforming the deterministic controller despite its limited access to geometric information. Validation in a high-fidelity 3D environment further confirms the transferability of the learned behavior to a continuous physical dynamics.
The proposed approach thus provides a complete framework for autonomous navigation in unknown tubular environments and opens perspectives for industrial, underground, or medical applications where progressing through narrow and weakly perceptive conduits represents a central challenge.
中文标题/摘要
标题:基于课程学习的强化学习在未知曲管导管中的自主无人机导航
在受限的管状环境中自主无人机导航仍然是一个重大挑战,由于导管的几何约束、墙壁的接近性和此类场景固有的感知限制。我们提出了一种强化学习方法,使无人机能够在没有任何关于其几何形状的先验知识的情况下,仅依赖LiDAR的局部观察和管中心的条件视觉检测来导航未知的三维管状结构。相比之下,作为确定性基线的纯追迹算法可以从显式访问中心线中获益,从而创建一种信息不对称,旨在评估强化学习补偿几何模型缺失的能力。通过逐步暴露于逐渐弯曲的几何形状中,其中管中心经常从视觉中消失,代理通过渐进的课程学习策略进行训练。基于直接可见性、方向记忆和LiDAR对称性线索的转向协商机制对于确保在部分可观测条件下稳定导航至关重要。实验表明,PPO策略获得了稳健且可泛化的行为,即使在有限的几何信息访问下也始终优于确定性控制器。在高保真3D环境中的验证进一步证实了所学行为在连续物理动力学中的可转移性。因此,所提出的方法为在未知管状环境中的自主导航提供了一个完整的框架,并为工业、地下或医疗应用中通过狭窄和感知能力弱的导管提供了新的前景。
Summary / 总结
The paper addresses the challenge of autonomous drone navigation in unknown curved tubular environments by proposing a reinforcement learning approach. The method involves training a drone using a Curriculum Learning strategy that exposes it to increasingly complex geometries, relying on local observations from LiDAR and visual detection. The drone uses a turning-negotiation mechanism combining direct visibility, directional memory, and LiDAR cues to navigate. Experiments demonstrate that the PPO policy outperforms a deterministic controller, showing robust and generalizable behavior even with limited geometric information. The approach is validated in a high-fidelity 3D environment, confirming its transferability to physical dynamics and potential for industrial and medical applications.
论文提出了一种使用强化学习方法解决未知弯曲管状通道中自主无人机导航的挑战。方法是通过渐进式的 Curriculum Learning 策略训练无人机,使其逐渐暴露于越来越弯曲的几何结构中。关键发现是,Proximal Policy Optimization (PPO) 策略在有限的几何信息下表现出色,优于确定性的纯追求算法,展示了鲁棒性和通用性。高保真3D环境中的实验进一步证实了所学行为在实际动力学中的可转移性。
BabyVLM-V2: Toward Developmentally Grounded Pretraining and Benchmarking of Vision Foundation Models
Authors: Shengao Wang, Wenqi Wang, Zecheng Wang, Max Whitton, Michael Wakeham, Arjun Chandra, Joey Huang, Pengyue Zhu, Helen Chen, David Li, Jeffrey Li, Shawn Li, Andrew Zagula, Amy Zhao, Andrew Zhu, Sayaka Nakamura, Yuki Yamamoto, Jerry Jun Yokono, Aaron Mueller, Bryan A. Plummer, Kate Saenko, Venkatesh Saligrama, Boqing Gong
First: 2025-12-11T18:57:05+00:00 · Latest: 2025-12-11T18:57:05+00:00
Abstract
Early children's developmental trajectories set up a natural goal for sample-efficient pretraining of vision foundation models. We introduce BabyVLM-V2, a developmentally grounded framework for infant-inspired vision-language modeling that extensively improves upon BabyVLM-V1 through a longitudinal, multifaceted pretraining set, a versatile model, and, most importantly, DevCV Toolbox for cognitive evaluation. The pretraining set maximizes coverage while minimizing curation of a longitudinal, infant-centric audiovisual corpus, yielding video-utterance, image-utterance, and multi-turn conversational data that mirror infant experiences. DevCV Toolbox adapts all vision-related measures of the recently released NIH Baby Toolbox into a benchmark suite of ten multimodal tasks, covering spatial reasoning, memory, and vocabulary understanding aligned with early children's capabilities. Experimental results show that a compact model pretrained from scratch can achieve competitive performance on DevCV Toolbox, outperforming GPT-4o on some tasks. We hope the principled, unified BabyVLM-V2 framework will accelerate research in developmentally plausible pretraining of vision foundation models.
中文标题/摘要
标题:BabyVLM-V2:面向发展性基础视觉模型预训练和基准测试的框架
早期儿童的发展轨迹为高效样本预训练视觉基础模型提供了自然目标。我们引入了BabyVLM-V2,这是一种基于发展的婴儿启发式视觉语言建模框架,通过纵向多维度预训练集、多功能模型以及最重要的是DevCV工具箱进行认知评估,大幅改进了BabyVLM-V1。预训练集最大限度地覆盖了内容,同时减少了纵向婴儿为中心的视听素材的整理,产生了视频-语句、图像-语句和多轮对话数据,这些数据反映了婴儿的经验。DevCV工具箱将最近发布的NIH婴儿工具箱中所有与视觉相关的度量标准改编为涵盖空间推理、记忆和词汇理解的十项多模态任务基准套件,这些任务与早期儿童的能力相一致。实验结果表明,从零开始预训练的紧凑模型在DevCV工具箱上可以达到竞争力的表现,某些任务上优于GPT-4o。我们希望BabyVLM-V2框架能够促进发展性基础视觉模型预训练的研究。
Summary / 总结
The research aims to develop vision foundation models that can learn from early childhood development trajectories to improve sample efficiency. BabyVLM-V2 introduces a longitudinal, multifaceted pretraining set and a DevCV Toolbox for cognitive evaluation. The model achieves competitive performance on a benchmark suite of ten multimodal tasks, outperforming GPT-4o on some tasks, demonstrating the effectiveness of the developmentally grounded pretraining approach.
研究旨在通过婴儿样数据的学习来提高样本效率,开发出更有效的视觉基础模型。BabyVLM-V2 提出了一个纵向的、多方面的预训练数据集和一个用于认知评估的 DevCV 工具箱。从零开始预训练的模型在十个涵盖空间推理、记忆和词汇理解的多模态任务上表现出色,部分任务上优于 GPT-4o。
FoundationMotion: Auto-Labeling and Reasoning about Spatial Movement in Videos
Authors: Yulu Gan, Ligeng Zhu, Dandan Shan, Baifeng Shi, Hongxu Yin, Boris Ivanovic, Song Han, Trevor Darrell, Jitendra Malik, Marco Pavone, Boyi Li
First: 2025-12-11T18:53:15+00:00 · Latest: 2025-12-11T18:53:15+00:00
Comments: Code is available at https://github.com/Wolfv0/FoundationMotion/tree/main
Abstract
Motion understanding is fundamental to physical reasoning, enabling models to infer dynamics and predict future states. However, state-of-the-art models still struggle on recent motion benchmarks, primarily due to the scarcity of large-scale, fine-grained motion datasets. Existing motion datasets are often constructed from costly manual annotation, severely limiting scalability. To address this challenge, we introduce FoundationMotion, a fully automated data curation pipeline that constructs large-scale motion datasets. Our approach first detects and tracks objects in videos to extract their trajectories, then leverages these trajectories and video frames with Large Language Models (LLMs) to generate fine-grained captions and diverse question-answer pairs about motion and spatial reasoning. Using datasets produced by this pipeline, we fine-tune open-source models including NVILA-Video-15B and Qwen2.5-7B, achieving substantial improvements in motion understanding without compromising performance on other tasks. Notably, our models outperform strong closed-source baselines like Gemini-2.5 Flash and large open-source models such as Qwen2.5-VL-72B across diverse motion understanding datasets and benchmarks. FoundationMotion thus provides a scalable solution for curating fine-grained motion datasets that enable effective fine-tuning of diverse models to enhance motion understanding and spatial reasoning capabilities.
中文标题/摘要
标题:FoundationMotion:自动标注和空间运动推理
运动理解是物理推理的基础,使模型能够推断动力学并预测未来状态。然而,最先进的模型在最近的运动基准测试中仍然存在困难,主要是由于缺乏大规模、细粒度的运动数据集。现有的运动数据集通常是从昂贵的手动注释中构建的,严重限制了其可扩展性。为了解决这一挑战,我们引入了FoundationMotion,这是一种完全自动化的数据整理流水线,用于构建大规模的运动数据集。我们的方法首先在视频中检测和跟踪对象以提取其轨迹,然后利用这些轨迹和视频帧以及大型语言模型(LLMs)生成关于运动和空间推理的细粒度描述和多样化的问答对。使用此流水线生成的数据集,我们对包括NVILA-Video-15B和Qwen2.5-7B在内的开源模型进行微调,实现了在运动理解方面的显著改进,同时在其他任务上不牺牲性能。值得注意的是,我们的模型在各种运动理解数据集和基准测试中均优于强大的闭源基线Gemini-2.5 Flash和大型开源模型Qwen2.5-VL-72B。因此,FoundationMotion为构建细粒度运动数据集提供了一种可扩展的解决方案,这些数据集能够有效微调各种模型以增强运动理解和空间推理能力。
Summary / 总结
FoundationMotion is a fully automated data curation pipeline that constructs large-scale motion datasets by detecting and tracking objects in videos and using Large Language Models to generate fine-grained captions and question-answer pairs. Fine-tuning open-source models with these datasets leads to significant improvements in motion understanding without harming performance on other tasks. The models outperform strong closed-source and large open-source baselines across various motion understanding benchmarks, providing a scalable solution for enhancing motion understanding and spatial reasoning capabilities.
FoundationMotion 是一个自动数据整理管道,用于生成大规模的精细动作数据集,以提升动作理解能力。它通过检测和跟踪视频中的物体来提取轨迹,然后利用大型语言模型生成描述动作和空间推理的图例和问答对。在这些数据集上微调开源模型可以提升动作理解能力,同时保持其他任务的性能。该模型在各种动作理解基准测试中优于闭源和大型开源基线模型。
LLM-Driven Composite Neural Architecture Search for Multi-Source RL State Encoding
Authors: Yu Yu, Qian Xie, Nairen Cao, Li Jin
Venue: NeurIPS 2025
First: 2025-12-07T20:25:07+00:00 · Latest: 2025-12-11T18:52:44+00:00
Comments: NeurIPS 2025 Workshop on Bridging Language, Agent, and World Models for Reasoning and Planning
Abstract
Designing state encoders for reinforcement learning (RL) with multiple information sources -- such as sensor measurements, time-series signals, image observations, and textual instructions -- remains underexplored and often requires manual design. We formalize this challenge as a problem of composite neural architecture search (NAS), where multiple source-specific modules and a fusion module are jointly optimized. Existing NAS methods overlook useful side information from the intermediate outputs of these modules -- such as their representation quality -- limiting sample efficiency in multi-source RL settings. To address this, we propose an LLM-driven NAS pipeline in which the LLM serves as a neural architecture design agent, leveraging language-model priors and intermediate-output signals to guide sample-efficient search for high-performing composite state encoders. On a mixed-autonomy traffic control task, our approach discovers higher-performing architectures with fewer candidate evaluations than traditional NAS baselines and the LLM-based GENIUS framework.
中文标题/摘要
标题:基于LLM的多源RL状态编码复合神经架构搜索
多信息源(如传感器测量、时间序列信号、图像观察和文本指令)的强化学习(RL)状态编码设计仍然未被充分探索,通常需要手动设计。我们将这一挑战形式化为复合神经架构搜索(NAS)问题,其中多个源特定模块和融合模块联合优化。现有的NAS方法忽略了这些模块中间输出的有用侧信息(如其表示质量),限制了在多源RL设置中的样本效率。为解决这一问题,我们提出了一种基于LLM的NAS管道,其中LLM作为神经架构设计代理,利用语言模型先验和中间输出信号来引导高效搜索高性能的复合状态编码。在混合自主交通控制任务上,我们的方法在较少的候选评估下发现性能更高的架构,优于传统的NAS基线和基于LLM的GENIUS框架。
Summary / 总结
The paper addresses the challenge of designing state encoders for reinforcement learning with multiple information sources by formulating it as a composite neural architecture search problem. It introduces an LLM-driven NAS pipeline that leverages language-model priors and intermediate-output signals to guide the search for high-performing composite state encoders, demonstrating superior performance with fewer candidate evaluations compared to traditional NAS methods and the GENIUS framework on a mixed-autonomy traffic control task.
论文通过将多信息源强化学习状态编码设计问题形式化为复合神经架构搜索问题,提出了一种基于LLM的NAS管道,利用语言模型先验和中间输出信号进行高效搜索。该方法在混合自主交通控制任务上优于传统NAS基线和基于LLM的GENIUS框架,并且使用较少的候选评估发现性能更高的架构。
Digital Twin Supervised Reinforcement Learning Framework for Autonomous Underwater Navigation
Authors: Zamirddine Mari, Mohamad Motasem Nawaf, Pierre Drap
First: 2025-12-11T18:52:42+00:00 · Latest: 2025-12-11T18:52:42+00:00
Abstract
Autonomous navigation in underwater environments remains a major challenge due to the absence of GPS, degraded visibility, and the presence of submerged obstacles. This article investigates these issues through the case of the BlueROV2, an open platform widely used for scientific experimentation. We propose a deep reinforcement learning approach based on the Proximal Policy Optimization (PPO) algorithm, using an observation space that combines target-oriented navigation information, a virtual occupancy grid, and ray-casting along the boundaries of the operational area. The learned policy is compared against a reference deterministic kinematic planner, the Dynamic Window Approach (DWA), commonly employed as a robust baseline for obstacle avoidance. The evaluation is conducted in a realistic simulation environment and complemented by validation on a physical BlueROV2 supervised by a 3D digital twin of the test site, helping to reduce risks associated with real-world experimentation. The results show that the PPO policy consistently outperforms DWA in highly cluttered environments, notably thanks to better local adaptation and reduced collisions. Finally, the experiments demonstrate the transferability of the learned behavior from simulation to the real world, confirming the relevance of deep RL for autonomous navigation in underwater robotics.
中文标题/摘要
标题:数字孪生监督强化学习框架用于自主水下导航
由于缺乏GPS、能见度差以及存在水下障碍物,水下环境中的自主导航仍然是一个重大挑战。本文通过BlueROV2这一广泛用于科学实验的开放平台案例,探讨了这些问题。我们提出了一种基于Proximal Policy Optimization (PPO) 算法的深度强化学习方法,使用结合目标导向导航信息、虚拟占用网格以及沿操作区域边界进行的射线投射的观测空间。学习到的策略与参考的确定性动力学规划器——动态窗口方法(DWA)进行了比较,DWA通常被用作避障的稳健基线。评估在现实的模拟环境中进行,并通过3D数字孪生监督的物理BlueROV2进行验证,有助于减少现实世界实验的风险。结果表明,在高度拥挤的环境中,PPO策略始终优于DWA,特别是在局部适应性和减少碰撞方面表现更佳。最后,实验表明从模拟到现实世界的可迁移性,证实了深度强化学习在水下机器人自主导航中的相关性。
Summary / 总结
This study addresses the challenges of autonomous underwater navigation by proposing a digital twin supervised reinforcement learning framework using the Proximal Policy Optimization (PPO) algorithm. The framework combines target-oriented navigation information, a virtual occupancy grid, and ray-casting data. The learned policy is evaluated against the Dynamic Window Approach (DWA) in a realistic simulation and on a physical BlueROV2, showing superior performance in cluttered environments due to better local adaptation and reduced collisions. The results confirm the transferability of the learned behavior from simulation to the real world, highlighting the potential of deep RL for underwater navigation.
本文通过提出基于Proximal Policy Optimization (PPO)算法的数字孪生监督强化学习框架,解决了水下环境中自主导航的挑战。该框架结合了目标导向的导航信息、虚拟占用网格和边界射线数据。所学策略在现实模拟和物理BlueROV2上与动态窗口方法(DWA)进行比较,结果显示在复杂环境中具有更好的局部适应性和较低的碰撞率。实验还证实了从模拟到现实世界的转移学习的有效性。
If generative AI is the answer, what is the question?
Authors: Ambuj Tewari
First: 2025-09-07T16:07:45+00:00 · Latest: 2025-12-11T18:45:18+00:00
Comments: To appear as a book chapter in a Springer book titled "Statistical Foundations and Applications of Artificial Intelligence, Machine Learning and Deep Learning" and edited by S. Ejaz Ahmed, Pierre Alquier, Yi Li, Shuangge Ma
Abstract
Beginning with text and images, generative AI has expanded to audio, video, computer code, and molecules. Yet, if generative AI is the answer, what is the question? We explore the foundations of generation as a distinct machine learning task with connections to prediction, compression, and decision-making. We survey five major generative model families: autoregressive models, variational autoencoders, normalizing flows, generative adversarial networks, and diffusion models. We then introduce a probabilistic framework that emphasizes the distinction between density estimation and generation. We review a game-theoretic framework with a two-player adversary-learner setup to study generation. We discuss post-training modifications that prepare generative models for deployment. We end by highlighting some important topics in socially responsible generation such as privacy, detection of AI-generated content, and copyright and IP. We adopt a task-first framing of generation, focusing on what generation is as a machine learning problem, rather than only on how models implement it.
中文标题/摘要
标题:如果生成AI是答案,那么问题是什么?
从文本和图像开始,生成AI已经扩展到音频、视频、计算机代码和分子。然而,如果生成AI是答案,那么问题是什么?我们探讨了生成作为独立机器学习任务的基础,与预测、压缩和决策之间的联系。我们概述了五种主要的生成模型家族:自回归模型、变分自编码器、规范化流、生成对抗网络和扩散模型。然后我们介绍了一个概率框架,强调密度估计与生成之间的区别。我们回顾了一个博弈论框架,采用两玩家的对抗学习设置来研究生成。我们讨论了训练后修改,以准备生成模型的部署。最后,我们强调了一些在社会责任生成方面的重要话题,如隐私、检测AI生成内容、版权和知识产权。我们采用任务优先的生成框架,专注于生成作为机器学习问题是什么,而不仅仅是模型如何实现它。
Summary / 总结
This paper explores the foundations of generation as a distinct machine learning task, connecting it to prediction, compression, and decision-making. It surveys five major generative model families and introduces a probabilistic framework that distinguishes between density estimation and generation. The study also discusses game-theoretic frameworks and post-training modifications for deployment, and highlights important topics such as privacy and AI-generated content detection in socially responsible generation.
本文探讨了生成作为机器学习任务的基础,将其与预测和压缩区分开来。研究概述了主要的生成模型家族,并引入了一个概率框架,强调密度估计与生成之间的区别。研究还讨论了博弈论框架和部署后的修改,并强调了隐私、AI生成内容检测等社会责任感生成的重要话题。
Distributionally Robust Regret Optimal Control Under Moment-Based Ambiguity Sets
Authors: Feras Al Taha, Eilyan Bitar
First: 2025-12-11T18:36:15+00:00 · Latest: 2025-12-11T18:36:15+00:00
Comments: 21 pages, 2 figures
Abstract
In this paper, we consider a class of finite-horizon, linear-quadratic stochastic control problems, where the probability distribution governing the noise process is unknown but assumed to belong to an ambiguity set consisting of all distributions whose mean and covariance lie within norm balls centered at given nominal values. To address the distributional ambiguity, we explore the design of causal affine control policies to minimize the worst-case expected regret over all distributions in the given ambiguity set. The resulting minimax optimal control problem is shown to admit an equivalent reformulation as a tractable convex program that corresponds to a regularized version of the nominal linear-quadratic stochastic control problem. While this convex program can be recast as a semidefinite program, semidefinite programs are typically solved using primal-dual interior point methods that scale poorly with the problem size in practice. To address this limitation, we propose a scalable dual projected subgradient method to compute optimal controllers to an arbitrary accuracy. Numerical experiments are presented to benchmark the proposed method against state-of-the-art data-driven and distributionally robust control design approaches.
中文标题/摘要
标题:基于矩约束不确定性集的分布鲁棒后悔最优控制
在本文中,我们考虑一类有限时间区间、线性二次随机控制问题,其中噪声过程的概率分布未知但假定属于一个不确定性集,该不确定性集包含所有均值和协方差位于给定名义值为中心的范数球内的分布。为应对分布不确定性,我们探索设计因果线性控制策略,以最小化给定不确定性集内所有分布的最坏情况期望后悔。所得到的最小最大最优控制问题被证明可以等价地重新表述为一个可解的凸规划问题,该凸规划问题对应于名义线性二次随机控制问题的正则化版本。虽然这个凸规划问题可以重新表述为半定规划问题,但半定规划问题通常使用对偶内点法求解,这种方法在实践中会随着问题规模的增大而表现不佳。为解决这一局限,我们提出了一种可扩展的对偶投影次梯度方法来计算任意精度的最优控制器。还进行了数值实验,将所提出的方法与最先进的数据驱动和分布鲁棒控制设计方法进行了基准测试。
Summary / 总结
This paper addresses a linear-quadratic stochastic control problem with an unknown noise distribution but bounded by an ambiguity set. The goal is to minimize the worst-case expected regret using causal affine control policies. The minimax problem is reformulated as a tractable convex program, which is then solved using a scalable dual projected subgradient method. The method is validated through numerical experiments, demonstrating its effectiveness compared to existing approaches.
本文研究了一类具有未知噪声分布但假设在基于矩的不确定性集内的有限时段线性二次随机控制问题。作者设计了因果线性控制策略,以最小化不确定性集内所有分布的最坏情况下期望遗憾。该最小最大问题被重新表述为一个可求解的凸规划问题,可以通过可扩展的对偶投影次梯度方法求解。数值实验表明,所提出的方法在与现有方法的比较中表现出有效性。
DuetSVG: Unified Multimodal SVG Generation with Internal Visual Guidance
Authors: Peiying Zhang, Nanxuan Zhao, Matthew Fisher, Yiran Xu, Jing Liao, Difan Liu
First: 2025-12-11T18:23:03+00:00 · Latest: 2025-12-11T18:23:03+00:00
Comments: Project page: https://intchous.github.io/DuetSVG-site
Abstract
Recent vision-language model (VLM)-based approaches have achieved impressive results on SVG generation. However, because they generate only text and lack visual signals during decoding, they often struggle with complex semantics and fail to produce visually appealing or geometrically coherent SVGs. We introduce DuetSVG, a unified multimodal model that jointly generates image tokens and corresponding SVG tokens in an end-to-end manner. DuetSVG is trained on both image and SVG datasets. At inference, we apply a novel test-time scaling strategy that leverages the model's native visual predictions as guidance to improve SVG decoding quality. Extensive experiments show that our method outperforms existing methods, producing visually faithful, semantically aligned, and syntactically clean SVGs across a wide range of applications.
中文标题/摘要
标题:DuetSVG:统一的多模态SVG生成,内置视觉指导
基于视觉-语言模型(VLM)的方法在SVG生成方面取得了令人印象深刻的成果。然而,由于它们在解码过程中仅生成文本而缺乏视觉信号,因此在处理复杂语义时常常遇到困难,无法生成视觉上吸引人或几何上一致的SVG。我们提出了DuetSVG,这是一种统一的多模态模型,可以以端到端的方式联合生成图像标记和相应的SVG标记。DuetSVG在图像和SVG数据集上进行训练。在推理时,我们应用了一种新颖的测试时缩放策略,利用模型的原生视觉预测作为指导,以提高SVG解码质量。广泛的实验表明,我们的方法优于现有方法,能够生成视觉上忠实、语义上对齐且语法上干净的SVG。
Summary / 总结
The research aims to improve SVG generation by incorporating visual signals into the decoding process. DuetSVG is a unified multimodal model that generates both image and SVG tokens simultaneously. It is trained on both image and SVG datasets and uses a test-time scaling strategy to guide SVG decoding, resulting in visually faithful, semantically aligned, and syntactically clean SVGs across various applications. The method outperforms existing approaches in generating complex and visually appealing SVGs.
研究动机是解决现有基于视觉-语言模型(VLM)的方法在SVG生成中存在的问题,这些方法在解码过程中缺乏视觉信号,往往无法生成视觉上吸引人或几何上一致的SVG。主要方法是DuetSVG,这是一种联合生成图像和SVG标记的统一多模态模型,并在测试时使用新颖的缩放策略来指导SVG解码。关键实验发现表明,DuetSVG在各种应用中生成了视觉上忠实、语义上对齐且语法上干净的SVG。
Iterative Compositional Data Generation for Robot Control
Authors: Anh-Quan Pham, Marcel Hussing, Shubhankar P. Patankar, Dani S. Bassett, Jorge Mendez-Mendez, Eric Eaton
First: 2025-12-11T18:20:49+00:00 · Latest: 2025-12-11T18:20:49+00:00
Abstract
Collecting robotic manipulation data is expensive, making it impractical to acquire demonstrations for the combinatorially large space of tasks that arise in multi-object, multi-robot, and multi-environment settings. While recent generative models can synthesize useful data for individual tasks, they do not exploit the compositional structure of robotic domains and struggle to generalize to unseen task combinations. We propose a semantic compositional diffusion transformer that factorizes transitions into robot-, object-, obstacle-, and objective-specific components and learns their interactions through attention. Once trained on a limited subset of tasks, we show that our model can zero-shot generate high-quality transitions from which we can learn control policies for unseen task combinations. Then, we introduce an iterative self-improvement procedure in which synthetic data is validated via offline reinforcement learning and incorporated into subsequent training rounds. Our approach substantially improves zero-shot performance over monolithic and hard-coded compositional baselines, ultimately solving nearly all held-out tasks and demonstrating the emergence of meaningful compositional structure in the learned representations.
中文标题/摘要
标题:机器人控制的迭代组合数据生成
收集机器人操作数据成本高昂,使得在多对象、多机器人和多环境设置中获取演示变得不切实际。尽管最近的生成模型可以为单个任务生成有用的数据,但它们未能利用机器人领域的组合结构,并且难以泛化到未见过的任务组合。我们提出了一种语义组合扩散变换器,将转换分解为机器人特定、物体特定、障碍物特定和目标特定的组件,并通过注意力学习它们的交互。在仅对任务的有限子集进行训练后,我们展示了我们的模型可以零样本生成高质量的转换,从中可以学习未见过的任务组合的控制策略。然后,我们引入了一种迭代自我改进过程,在此过程中,合成数据通过离线强化学习验证并纳入后续训练轮次。我们的方法在零样本性能上显著优于单一的和硬编码的组合基线,最终解决了几乎所有保留的任务,并展示了在学习表示中出现有意义的组合结构。
Summary / 总结
This research addresses the challenge of collecting expensive robotic manipulation data by proposing a semantic compositional diffusion transformer that factorizes transitions into specific components and learns their interactions through attention. After training on a limited set of tasks, the model can generate high-quality transitions for unseen task combinations, which are then used to learn control policies. An iterative self-improvement procedure further enhances the model's performance, leading to better zero-shot results compared to monolithic and hard-coded compositional baselines, and solving nearly all held-out tasks.
该研究通过提出一种语义组合扩散变换器来解决收集昂贵的机器人操作数据的挑战,该模型将过渡分解为特定组件,并通过注意力学习它们的交互。在有限的任务集上训练后,该模型可以生成适用于未见过的任务组合的高质量过渡,然后用于学习控制策略。通过迭代自我改进程序进一步提高模型性能,使其零样本结果优于整体和硬编码的组合基线,并解决几乎所有保留出的任务,展示了学习表示中的有意义的组合结构。
PubTables-v2: A new large-scale dataset for full-page and multi-page table extraction
Authors: Brandon Smock, Valerie Faucon-Morin, Max Sokolov, Libin Liang, Tayyibah Khanam, Maury Courtland
First: 2025-12-11T18:19:00+00:00 · Latest: 2025-12-11T18:19:00+00:00
Comments: 15 pages, 7 figures
Abstract
Table extraction (TE) is a key challenge in visual document understanding. Traditional approaches detect tables first, then recognize their structure. Recently, interest has surged in developing methods, such as vision-language models (VLMs), that can extract tables directly in their full page or document context. However, progress has been difficult to demonstrate due to a lack of annotated data. To address this, we create a new large-scale dataset, PubTables-v2. PubTables-v2 supports a number of current challenging table extraction tasks. Notably, it is the first large-scale benchmark for multi-page table structure recognition. We demonstrate its usefulness by evaluating domain-specialized VLMs on these tasks and highlighting current progress. Finally, we use PubTables-v2 to create the Page-Object Table Transformer (POTATR), an image-to-graph extension of the Table Transformer to comprehensive page-level TE. Data, code, and trained models will be released.
中文标题/摘要
标题:PubTables-v2:新的大规模数据集用于全页和多页表格提取
表格提取(TE)是视觉文档理解中的一个关键挑战。传统方法首先检测表格,然后识别其结构。最近,人们开始开发可以直接在全页或文档上下文中提取表格的方法,例如视觉-语言模型(VLMs)。然而,由于缺乏标注数据,进步难以展示。为了解决这个问题,我们创建了一个新的大规模数据集,PubTables-v2。PubTables-v2 支持当前许多具有挑战性的表格提取任务。值得注意的是,它是第一个大规模的多页表格结构识别基准。我们通过在这些任务上评估领域专用的 VLMs 来展示其用途,并突出当前的进展。最后,我们使用 PubTables-v2 创建了 Page-Object Table Transformer(POTATR),这是一种表格变换器的图像到图扩展,用于全面的页面级 TE。数据、代码和训练模型将被发布。
Summary / 总结
The research aims to improve table extraction in visual document understanding by addressing the lack of annotated data. The authors introduce PubTables-v2, a large-scale dataset for full-page and multi-page table extraction, especially the first large-scale benchmark for multi-page table structure recognition. They evaluate domain-specialized vision-language models and propose POTATR, an image-to-graph extension of the Table Transformer for comprehensive page-level table extraction.
研究旨在通过解决标注数据不足的问题,提高视觉文档理解中的表格提取。作者开发了PubTables-v2,这是一个大规模的全页和多页表格提取数据集,特别是用于多页表格结构识别。主要发现包括对领域特定的视觉语言模型的评估以及提出了POTATR,即表格变换器的图像到图扩展,用于全面的页面级表格提取。
AFRAgent : An Adaptive Feature Renormalization Based High Resolution Aware GUI agent
Authors: Neeraj Anand, Rishabh Jain, Sohan Patnaik, Balaji Krishnamurthy, Mausoom Sarkar
Venue: WACV 2026
First: 2025-11-30T11:32:54+00:00 · Latest: 2025-12-11T18:16:38+00:00
Comments: Accepted at WACV 2026 Conference
Abstract
There is a growing demand for mobile user interface (UI) automation, driven by its broad applications across industries. With the advent of visual language models (VLMs), GUI automation has progressed from generating text-based instructions for humans to autonomously executing tasks, thus optimizing automation workflows. Recent approaches leverage VLMs for this problem due to their ability to 1) process on-screen content directly, 2) remain independent of device-specific APIs by utilizing human actions (e.g., clicks, typing), and 3) apply real-world contextual knowledge for task understanding. However, these models often have trouble accurately identifying widgets and determining actions due to limited spatial information in vision encoder features. Additionally, top-performing models are often large, requiring extensive training and resulting in inference delays. In this work, we introduce AFRAgent, an instruct-BLIP-based multimodal architecture that achieves superior performance in GUI automation while being less than one-fourth the size of its nearest competitor. To enhance image embeddings in the large language model (LLM) pipeline, we propose an adaptive feature renormalization-based (a token-level affine transformation) technique that effectively enriches low-resolution image embeddings and fuses high-resolution details. We evaluate AFRAgent on Meta-GUI and AITW benchmarks, establishing a new state-of-the-art baseline for smartphone automation.
中文标题/摘要
标题:AFRAgent:一种自适应特征重规范化基于高分辨率感知的GUI代理
随着移动用户界面(UI)自动化需求的增长,其在各行业的广泛应用推动了这一领域的发展。随着视觉语言模型(VLMs)的出现,GUI自动化从为人类生成文本指令发展到自主执行任务,从而优化了自动化工作流。最近的方法利用VLMs,因为它们能够1)直接处理屏幕内容,2)通过利用人类操作(例如点击、输入)而不依赖于特定设备的API,3)应用现实世界的上下文知识来理解任务。然而,这些模型往往难以准确识别控件和确定操作,因为视觉编码器特征中的空间信息有限。此外,表现最佳的模型通常很大,需要大量训练,导致推理延迟。在本文中,我们引入了AFRAgent,这是一种基于指令-BLIP的多模态架构,其性能优于其最接近的竞争者四分之一。为了增强大型语言模型(LLM)管道中的图像嵌入,我们提出了一种自适应特征重规范化技术(基于令牌级别的仿射变换),该技术有效地丰富了低分辨率图像嵌入并融合了高分辨率细节。我们在Meta-GUI和AITW基准上评估了AFRAgent,建立了智能手机自动化的新基准。
Summary / 总结
AFRAgent is designed to improve mobile UI automation by addressing the limitations of existing visual language models, particularly in accurately identifying widgets and determining actions. It uses an instruct-BLIP-based multimodal architecture with adaptive feature renormalization to enhance image embeddings, making it more efficient and effective. AFRAgent outperforms its competitors on the Meta-GUI and AITW benchmarks, setting a new state-of-the-art for smartphone automation.
AFRAgent旨在通过解决现有视觉语言模型在准确识别控件和确定操作方面的局限性,提升移动UI自动化。它采用基于instruct-BLIP的多模态架构,并使用自适应特征重规范化技术增强图像嵌入,使其更高效且更有效。AFRAgent在Meta-GUI和AITW基准测试中表现出色,确立了智能手机自动化的新标准。
Physics-Informed Learning of Flow Distribution and Receiver Heat Losses in Parabolic Trough Solar Fields
Authors: Stefan Matthes, Markus Schramm
First: 2025-12-11T18:16:26+00:00 · Latest: 2025-12-11T18:16:26+00:00
Abstract
Parabolic trough Concentrating Solar Power (CSP) plants operate large hydraulic networks of collector loops that must deliver a uniform outlet temperature despite spatially heterogeneous optical performance, heat losses, and pressure drops. While loop temperatures are measured, loop-level mass flows and receiver heat-loss parameters are unobserved, making it impossible to diagnose hydraulic imbalances or receiver degradation using standard monitoring tools.
We present a physics-informed learning framework that infers (i) loop-level mass-flow ratios and (ii) time-varying receiver heat-transfer coefficients directly from routine operational data. The method exploits nocturnal homogenization periods -- when hot oil is circulated through a non-irradiated field -- to isolate hydraulic and thermal-loss effects. A differentiable conjugate heat-transfer model is discretized and embedded into an end-to-end learning pipeline optimized using historical plant data from the 50 MW Andasol 3 solar field.
The model accurately reconstructs loop temperatures (RMSE $<2^\circ$C) and produces physically meaningful estimates of loop imbalances and receiver heat losses. Comparison against drone-based infrared thermography (QScan) shows strong correspondence, correctly identifying all areas with high-loss receivers. This demonstrates that noisy real-world CSP operational data contain enough information to recover latent physical parameters when combined with appropriate modeling and differentiable optimization.
中文标题/摘要
标题:基于物理信息的学习在抛物线槽太阳能场中流体分布和接收器热损失
抛物线槽集中太阳能热发电(CSP)电站运行着大型液压网络的集热器回路,必须在空间上不均匀的光学性能、热损失和压力降的情况下提供均匀的出口温度。虽然回路温度是可以测量的,但回路级的质量流率和接收器热损失参数是不可观测的,使得使用标准监测工具诊断液压不平衡或接收器退化是不可能的。
我们提出了一种基于物理信息的学习框架,可以直接从常规运行数据中推断出(i)回路级的质量流率比值和(ii)时间变化的接收器传热系数。该方法利用夜间同质化时期——当热油通过未受照射的场区循环时——来隔离液压和热损失效应。一个可微分的耦合传热模型被离散化并嵌入到端到端的学习管道中,该管道使用来自50兆瓦安道尔3号太阳能场的历史电站数据进行优化。
该模型准确地重构了回路温度(均方根误差<2°C),并生成了具有物理意义的回路不平衡和接收器热损失的估计值。与无人机红外热成像(QScan)的比较显示了强烈的对应关系,正确识别了所有高损失接收器的区域。这表明,当与适当的建模和可微优化相结合时,嘈杂的实际CSP运行数据包含足够的信息来恢复潜在的物理参数。
Summary / 总结
The research aims to diagnose hydraulic imbalances and receiver degradation in parabolic trough CSP plants by inferring loop-level mass flows and receiver heat-loss parameters from operational data. The method uses a physics-informed learning framework that exploits nocturnal homogenization periods to isolate hydraulic and thermal-loss effects. The model accurately reconstructs loop temperatures and produces physically meaningful estimates of loop imbalances and receiver heat losses, showing strong correspondence with drone-based infrared thermography data.
研究旨在通过从运营数据中推断出循环级别的质量流率和接收器热损失参数,诊断抛物线槽型 CSP 站中的液压不平衡和接收器退化。方法利用夜间同质化时期来隔离液压和热损失效应,使用一个物理信息学习框架。该模型准确地重构了循环温度,并生成了循环不平衡和接收器热损失的物理上合理的估计值,与无人机红外热成像数据有很强的对应关系。
MoCapAnything: Unified 3D Motion Capture for Arbitrary Skeletons from Monocular Videos
Authors: Kehong Gong, Zhengyu Wen, Weixia He, Mingxi Xu, Qi Wang, Ning Zhang, Zhengyu Li, Dongze Lian, Wei Zhao, Xiaoyu He, Mingyuan Zhang
First: 2025-12-11T18:09:48+00:00 · Latest: 2025-12-11T18:09:48+00:00
Comments: Project page: https://animotionlab.github.io/MoCapAnything/
Abstract
Motion capture now underpins content creation far beyond digital humans, yet most existing pipelines remain species- or template-specific. We formalize this gap as Category-Agnostic Motion Capture (CAMoCap): given a monocular video and an arbitrary rigged 3D asset as a prompt, the goal is to reconstruct a rotation-based animation such as BVH that directly drives the specific asset. We present MoCapAnything, a reference-guided, factorized framework that first predicts 3D joint trajectories and then recovers asset-specific rotations via constraint-aware inverse kinematics. The system contains three learnable modules and a lightweight IK stage: (1) a Reference Prompt Encoder that extracts per-joint queries from the asset's skeleton, mesh, and rendered images; (2) a Video Feature Extractor that computes dense visual descriptors and reconstructs a coarse 4D deforming mesh to bridge the gap between video and joint space; and (3) a Unified Motion Decoder that fuses these cues to produce temporally coherent trajectories. We also curate Truebones Zoo with 1038 motion clips, each providing a standardized skeleton-mesh-render triad. Experiments on both in-domain benchmarks and in-the-wild videos show that MoCapAnything delivers high-quality skeletal animations and exhibits meaningful cross-species retargeting across heterogeneous rigs, enabling scalable, prompt-driven 3D motion capture for arbitrary assets. Project page: https://animotionlab.github.io/MoCapAnything/
中文标题/摘要
标题:MoCapAnything:任意骨架的单目视频统一3D动作捕捉
动作捕捉现在已远远超越了数字人类的内容创作,但大多数现有的流水线仍然局限于特定物种或模板。我们将其差距形式化为无类别动作捕捉(CAMoCap):给定一个单目视频和一个任意的3D资产作为提示,目标是重建基于旋转的动画,如BVH,直接驱动特定的资产。我们提出MoCapAnything,这是一种参考引导、因子化的框架,首先预测3D关节轨迹,然后通过约束感知逆运动学恢复资产特定的旋转。该系统包含三个可学习模块和一个轻量级的IK阶段:(1)参考提示编码器,从资产的骨架、网格和渲染图像中提取每个关节的查询;(2)视频特征提取器,计算密集的视觉描述符并重建一个粗略的4D变形网格,以弥合视频和关节空间之间的差距;(3)统一动作解码器,将这些线索融合以生成时间连贯的轨迹。我们还整理了Truebones动物园,包含1038个动作片段,每个片段提供了一个标准化的骨架-网格-渲染三元组。在领域内基准测试和野外视频上的实验表明,MoCapAnything提供了高质量的骨骼动画,并在异构骨架上展示了有意义的跨物种重新目标化,从而实现了任意资产的可扩展、提示驱动的3D动作捕捉。项目页面:https://animotionlab.github.io/MoCapAnything/
Summary / 总结
MoCapAnything addresses the gap in category-agnostic motion capture by predicting 3D joint trajectories and recovering asset-specific rotations using a reference-guided, factorized framework. It consists of a Reference Prompt Encoder, a Video Feature Extractor, and a Unified Motion Decoder. Experiments demonstrate high-quality skeletal animations and meaningful cross-species retargeting across different rigs, enabling scalable, prompt-driven 3D motion capture for arbitrary assets.
MoCapAnything 提出了一种统一框架,可以从单目视频中为任意骨架进行运动捕捉。该框架采用参考引导和因子分解的方法,通过逆运动学恢复特定资产的旋转。系统包括参考提示编码器、视频特征提取器和统一运动解码器,并在多种骨架上展示了高质量的骨骼动画和跨物种的重新目标化。
From Macro to Micro: Benchmarking Microscopic Spatial Intelligence on Molecules via Vision-Language Models
Authors: Zongzhao Li, Xiangzhe Kong, Jiahui Su, Zongyang Ma, Mingze Li, Songyou Li, Yuelin Zhang, Yu Rong, Tingyang Xu, Deli Zhao, Wenbing Huang
First: 2025-12-11T18:00:21+00:00 · Latest: 2025-12-11T18:00:21+00:00
Abstract
This paper introduces the concept of Microscopic Spatial Intelligence (MiSI), the capability to perceive and reason about the spatial relationships of invisible microscopic entities, which is fundamental to scientific discovery. To assess the potential of Vision-Language Models (VLMs) in this domain, we propose a systematic benchmark framework MiSI-Bench. This framework features over 163,000 question-answer pairs and 587,000 images derived from approximately 4,000 molecular structures, covering nine complementary tasks that evaluate abilities ranging from elementary spatial transformations to complex relational identifications. Experimental results reveal that current state-of-the-art VLMs perform significantly below human level on this benchmark. However, a fine-tuned 7B model demonstrates substantial potential, even surpassing humans in spatial transformation tasks, while its poor performance in scientifically-grounded tasks like hydrogen bond recognition underscores the necessity of integrating explicit domain knowledge for progress toward scientific AGI. The datasets are available at https://huggingface.co/datasets/zongzhao/MiSI-bench.
中文标题/摘要
标题:从宏观到微观:通过视觉语言模型评估分子微观空间智能
本文介绍了微观空间智能(MiSI)的概念,即感知和推理看不见的微观实体的空间关系的能力,这是科学研究的基础。为了评估视觉语言模型(VLMs)在这一领域的潜力,我们提出了一种系统性的基准框架MiSI-Bench。该框架包含超过163,000个问答对和587,000张图像,源自约4,000个分子结构,涵盖了九项互补任务,评估能力从基本的空间变换到复杂的关联识别。实验结果表明,当前最先进的VLMs在这一基准上的表现远低于人类水平。然而,微调后的7B模型显示出巨大的潜力,甚至在空间变换任务上超过了人类,而其在氢键识别等基于科学的任务上的表现不佳,突显了整合显式领域知识以实现科学AGI的必要性。数据集可在https://huggingface.co/datasets/zongzhao/MiSI-bench获取。
MMSI-Video-Bench: A Holistic Benchmark for Video-Based Spatial Intelligence
Authors: Jingli Lin, Runsen Xu, Shaohao Zhu, Sihan Yang, Peizhou Cao, Yunlong Ran, Miao Hu, Chenming Zhu, Yiman Xie, Yilin Long, Wenbo Hu, Dahua Lin, Tai Wang, Jiangmiao Pang
First: 2025-12-11T17:57:24+00:00 · Latest: 2025-12-11T17:57:24+00:00
Abstract
Spatial understanding over continuous visual input is crucial for MLLMs to evolve into general-purpose assistants in physical environments. Yet there is still no comprehensive benchmark that holistically assesses the progress toward this goal. In this work, we introduce MMSI-Video-Bench, a fully human-annotated benchmark for video-based spatial intelligence in MLLMs. It operationalizes a four-level framework, Perception, Planning, Prediction, and Cross-Video Reasoning, through 1,106 questions grounded in 1,278 clips from 25 datasets and in-house videos. Each item is carefully designed and reviewed by 3DV experts with explanatory rationales to ensure precise, unambiguous grounding. Leveraging its diverse data sources and holistic task coverage, MMSI-Video-Bench also supports three domain-oriented sub-benchmarks (Indoor Scene Perception Bench, Robot Bench and Grounding Bench) for targeted capability assessment. We evaluate 25 strong open-source and proprietary MLLMs, revealing a striking human--AI gap: many models perform near chance, and the best reasoning model lags humans by nearly 60%. We further find that spatially fine-tuned models still fail to generalize effectively on our benchmark. Fine-grained error analysis exposes systematic failures in geometric reasoning, motion grounding, long-horizon prediction, and cross-video correspondence. We also show that typical frame-sampling strategies transfer poorly to our reasoning-intensive benchmark, and that neither 3D spatial cues nor chain-of-thought prompting yields meaningful gains. We expect our benchmark to establish a solid testbed for advancing video-based spatial intelligence.
中文标题/摘要
标题:MMSI-Video-Bench:基于视频的空间智能综合基准
在连续视觉输入中进行空间理解对于MLLMs成为物理环境中的通用助手至关重要。然而,目前还没有一个全面的基准能够综合评估这一目标的进展。在本文中,我们介绍了MMSI-Video-Bench,这是一个全面的人工标注基准,用于评估MLLMs的空间智能。它通过1,106个问题和1,278个来自25个数据集和内部视频的片段,实现了感知、规划、预测和跨视频推理四个层次框架的系统化。每个项目都由3D视觉专家精心设计和审查,附有解释性理由,以确保精确和明确的定位。利用其多样化的数据来源和全面的任务覆盖范围,MMSI-Video-Bench还支持三个领域导向的子基准(室内场景感知基准、机器人基准和语义基准),以进行有针对性的能力评估。我们评估了25个强大的开源和专有MLLMs,揭示了显著的人工智能差距:许多模型的表现接近随机猜测,而最佳推理模型比人类落后近60%。我们还发现,空间微调模型在我们的基准上仍然无法有效泛化。精细的错误分析揭示了几何推理、运动定位、长时预测和跨视频对应中的系统性失败。我们还展示了典型的帧采样策略在我们的推理密集型基准上表现不佳,而3D空间线索或思维链提示也没有带来有意义的改进。我们期望我们的基准能够为推进基于视频的空间智能建立一个坚实的基础。
Summary / 总结
MMSI-Video-Bench is a comprehensive benchmark for evaluating video-based spatial intelligence in MLLMs, covering Perception, Planning, Prediction, and Cross-Video Reasoning through 1,106 questions. It reveals a significant human-AI gap, with many models performing near chance and the best model lagging humans by nearly 60%. The benchmark also highlights systematic failures in geometric reasoning and long-horizon prediction, and shows that typical frame-sampling strategies and 3D spatial cues are ineffective in reasoning-intensive tasks.
MMSI-Video-Bench 是一个全面的基准,用于评估 MLLMs 的视频空间智能,通过 1,106 个问题覆盖 1,278 个来自 25 个数据集的片段,评估感知、规划、预测和跨视频推理。评估 25 个 MLLMs 后发现,许多模型的表现接近随机,最佳模型比人类落后近 60%。基准还揭示了几何推理、运动定位、长时预测和跨视频对应方面的系统性失败,并表明典型的帧采样策略和 3D 空间线索在推理密集型基准上效果不佳。该基准旨在推动视频空间智能的发展。
SWiT-4D: Sliding-Window Transformer for Lossless and Parameter-Free Temporal 4D Generation
Authors: Kehong Gong, Zhengyu Wen, Mingxi Xu, Weixia He, Qi Wang, Ning Zhang, Zhengyu Li, Chenbin Li, Dongze Lian, Wei Zhao, Xiaoyu He, Mingyuan Zhang
First: 2025-12-11T17:54:31+00:00 · Latest: 2025-12-11T17:54:31+00:00
Comments: Project page: https://animotionlab.github.io/SWIT4D/
Abstract
Despite significant progress in 4D content generation, the conversion of monocular videos into high-quality animated 3D assets with explicit 4D meshes remains considerably challenging. The scarcity of large-scale, naturally captured 4D mesh datasets further limits the ability to train generalizable video-to-4D models from scratch in a purely data-driven manner. Meanwhile, advances in image-to-3D generation, supported by extensive datasets, offer powerful prior models that can be leveraged. To better utilize these priors while minimizing reliance on 4D supervision, we introduce SWiT-4D, a Sliding-Window Transformer for lossless, parameter-free temporal 4D mesh generation. SWiT-4D integrates seamlessly with any Diffusion Transformer (DiT)-based image-to-3D generator, adding spatial-temporal modeling across video frames while preserving the original single-image forward process, enabling 4D mesh reconstruction from videos of arbitrary length. To recover global translation, we further introduce an optimization-based trajectory module tailored for static-camera monocular videos. SWiT-4D demonstrates strong data efficiency: with only a single short (<10s) video for fine-tuning, it achieves high-fidelity geometry and stable temporal consistency, indicating practical deployability under extremely limited 4D supervision. Comprehensive experiments on both in-domain zoo-test sets and challenging out-of-domain benchmarks (C4D, Objaverse, and in-the-wild videos) show that SWiT-4D consistently outperforms existing baselines in temporal smoothness. Project page: https://animotionlab.github.io/SWIT4D/
中文标题/摘要
标题:SWiT-4D:滑动窗口变换器用于无损和参数自由的时空4D生成
尽管在4D内容生成方面取得了显著进展,但将单目视频转换为具有明确4D网格的高质量动画3D资产仍然极具挑战性。由于缺乏大规模的自然捕获4D网格数据集,这进一步限制了从头开始以纯数据驱动方式训练通用的视频到4D模型的能力。与此同时,由大量数据集支持的图像到3D生成的进步提供了强大的先验模型,可以加以利用。为了更好地利用这些先验模型并尽量减少对4D监督的依赖,我们提出了SWiT-4D,一种用于无损、参数自由的时空4D网格生成的滑动窗口变换器。SWiT-4D可以无缝集成到任何基于扩散变换器(DiT)的图像到3D生成器中,在视频帧之间进行空间-时间建模,同时保留原始的单图像前向过程,从而可以从任意长度的视频中重建4D网格。为了恢复全局平移,我们进一步引入了一个针对静态相机单目视频的基于优化的轨迹模块。SWiT-4D展示了强大的数据效率:仅需一个短于10秒的视频进行微调,即可实现高保真几何结构和稳定的时空一致性,表明在极有限的4D监督下具有实际部署能力。在域内动物园测试集和具有挑战性的域外基准测试集(C4D、Objaverse和野外视频)上的全面实验表明,SWiT-4D在时间平滑度方面始终优于现有基线。项目页面:https://animotionlab.github.io/SWIT4D/
Summary / 总结
SWiT-4D is a Sliding-Window Transformer designed for lossless and parameter-free temporal 4D mesh generation from monocular videos. It integrates with existing Diffusion Transformer-based image-to-3D generators to model spatial-temporal information across video frames. SWiT-4D includes an optimization-based trajectory module for global translation recovery in static-camera videos. Experiments show that SWiT-4D achieves high-fidelity geometry and stable temporal consistency with minimal 4D supervision, outperforming existing methods in temporal smoothness on various benchmarks.
SWiT-4D 是一种滑动窗口变换器,用于从单目视频生成无损且无需参数的时空4D网格。它与现有的图像到3D模型集成,无需大量4D监督即可实现4D网格重建。SWiT-4D 引入了一种针对静态摄像头单目视频的优化轨迹模块,并展示了强大的数据效率,即使在少量微调的情况下也能实现高保真几何和时间一致性。实验表明,SWiT-4D 在各种基准测试中在时间平滑性方面优于现有方法。
Bayesian Symbolic Regression via Posterior Sampling
Authors: Geoffrey F. Bomarito, Patrick E. Leser
First: 2025-12-11T17:38:20+00:00 · Latest: 2025-12-11T17:38:20+00:00
Abstract
Symbolic regression is a powerful tool for discovering governing equations directly from data, but its sensitivity to noise hinders its broader application. This paper introduces a Sequential Monte Carlo (SMC) framework for Bayesian symbolic regression that approximates the posterior distribution over symbolic expressions, enhancing robustness and enabling uncertainty quantification for symbolic regression in the presence of noise. Differing from traditional genetic programming approaches, the SMC-based algorithm combines probabilistic selection, adaptive tempering, and the use of normalized marginal likelihood to efficiently explore the search space of symbolic expressions, yielding parsimonious expressions with improved generalization. When compared to standard genetic programming baselines, the proposed method better deals with challenging, noisy benchmark datasets. The reduced tendency to overfit and enhanced ability to discover accurate and interpretable equations paves the way for more robust symbolic regression in scientific discovery and engineering design applications.
中文标题/摘要
标题:基于后验采样的贝叶斯符号回归
符号回归是一种强大的工具,可以直接从数据中发现支配方程,但其对噪声的敏感性限制了其更广泛的应用。本文介绍了一种基于顺序蒙特卡洛(SMC)的贝叶斯符号回归框架,该框架通过近似符号表达式的后验分布来增强鲁棒性,并在噪声存在的情况下实现符号回归中的不确定性量化。与传统的遗传编程方法不同,基于SMC的算法结合了概率选择、自适应退火和归一化边际似然的使用,以高效地探索符号表达式的搜索空间,产生简洁且具有更好泛化能力的表达式。与标准的遗传编程基线方法相比,所提出的方法在处理具有挑战性和噪声的数据集时表现更好。减少过度拟合的趋势和增强发现准确且可解释方程的能力为科学发现和工程设计应用中的更稳健符号回归铺平了道路。
Summary / 总结
This paper addresses the challenge of noisy data in symbolic regression by proposing a Bayesian approach using Sequential Monte Carlo (SMC) for robust model discovery. The method enhances traditional genetic programming by incorporating probabilistic selection, adaptive tempering, and normalized marginal likelihood, leading to better generalization and reduced overfitting. Experimental results show that the proposed SMC-based algorithm outperforms standard genetic programming on noisy benchmark datasets, providing more accurate and interpretable equations.
该论文通过提出基于Sequential Monte Carlo (SMC)的贝叶斯方法来应对符号回归中的噪声问题,该方法通过近似符号表达式的后验分布来提高模型的泛化能力和不确定性量化能力。实验表明,基于SMC的算法在噪声数据集上优于传统的遗传编程方法,提供了更准确和可解释的方程,且不易过拟合。
From Generated Human Videos to Physically Plausible Robot Trajectories
Authors: James Ni, Zekai Wang, Wei Lin, Amir Bar, Yann LeCun, Trevor Darrell, Jitendra Malik, Roei Herzig
First: 2025-12-04T18:56:03+00:00 · Latest: 2025-12-11T17:37:53+00:00
Comments: For project website, see https://genmimic.github.io
Abstract
Video generation models are rapidly improving in their ability to synthesize human actions in novel contexts, holding the potential to serve as high-level planners for contextual robot control. To realize this potential, a key research question remains open: how can a humanoid execute the human actions from generated videos in a zero-shot manner? This challenge arises because generated videos are often noisy and exhibit morphological distortions that make direct imitation difficult compared to real video. To address this, we introduce a two-stage pipeline. First, we lift video pixels into a 4D human representation and then retarget to the humanoid morphology. Second, we propose GenMimic-a physics-aware reinforcement learning policy conditioned on 3D keypoints, and trained with symmetry regularization and keypoint-weighted tracking rewards. As a result, GenMimic can mimic human actions from noisy, generated videos. We curate GenMimicBench, a synthetic human-motion dataset generated using two video generation models across a spectrum of actions and contexts, establishing a benchmark for assessing zero-shot generalization and policy robustness. Extensive experiments demonstrate improvements over strong baselines in simulation and confirm coherent, physically stable motion tracking on a Unitree G1 humanoid robot without fine-tuning. This work offers a promising path to realizing the potential of video generation models as high-level policies for robot control.
中文标题/摘要
标题:从生成的人类视频到物理上可信的机器人轨迹
视频生成模型在合成新颖情境下的人类动作方面的能力正在迅速提高,有可能作为高级规划者服务于情境机器人控制。为了实现这一潜力,一个关键的研究问题仍然悬而未决:如何以零样本的方式让类人机器人执行生成视频中的人类动作?这一挑战源于生成的视频通常噪声较大且存在形态失真,使得直接模仿比真实视频更困难。为了解决这一问题,我们引入了一个两阶段的流水线。首先,我们将视频像素提升到4D的人类表示,然后重新目标到类人形态。其次,我们提出了GenMimic——一种基于3D关键点的物理感知强化学习策略,并通过对称正则化和关键点加权跟踪奖励进行训练。因此,GenMimic可以从噪声较大的生成视频中模仿人类动作。我们构建了GenMimicBench,这是一个使用两种视频生成模型生成的合成人类动作数据集,涵盖了各种动作和情境,为评估零样本泛化能力和策略鲁棒性建立了基准。广泛的实验表明,与强大的基线相比,在模拟中有所改进,并且在无需微调的情况下,能够实现Unitree G1类人机器人上连贯且物理稳定的运动跟踪。这项工作为实现视频生成模型作为机器人控制高级策略的潜力提供了一条有希望的道路。
Summary / 总结
The research aims to enable robots to execute human actions from generated videos in a zero-shot manner. A two-stage pipeline is introduced, first lifting video pixels into a 4D human representation and then retargeting to the humanoid morphology. The GenMimic policy, conditioned on 3D keypoints and trained with symmetry regularization and keypoint-weighted tracking rewards, successfully mimics human actions from noisy generated videos. Experiments show that GenMimic outperforms strong baselines in simulation and achieves coherent, physically stable motion tracking on a Unitree G1 humanoid robot without fine-tuning.
研究旨在使机器人能够从生成的视频中零样本执行人类动作,解决合成视频中的噪声和形态失真问题。方法包括两阶段管道:将视频像素提升到4D人体表示并重新目标到人形形态,然后训练一个基于物理的强化学习策略GenMimic,使用对称正则化和关键点加权跟踪奖励。关键发现包括GenMimic能够模仿来自噪声生成视频的人类动作,并在模拟和Unitree G1人形机器人上优于强基线,无需微调,建立了零样本泛化和策略鲁棒性的新基准。
PoseGAM: Robust Unseen Object Pose Estimation via Geometry-Aware Multi-View Reasoning
Authors: Jianqi Chen, Biao Zhang, Xiangjun Tang, Peter Wonka
First: 2025-12-11T17:29:25+00:00 · Latest: 2025-12-11T17:29:25+00:00
Comments: Project page: https://windvchen.github.io/PoseGAM/
Abstract
6D object pose estimation, which predicts the transformation of an object relative to the camera, remains challenging for unseen objects. Existing approaches typically rely on explicitly constructing feature correspondences between the query image and either the object model or template images. In this work, we propose PoseGAM, a geometry-aware multi-view framework that directly predicts object pose from a query image and multiple template images, eliminating the need for explicit matching. Built upon recent multi-view-based foundation model architectures, the method integrates object geometry information through two complementary mechanisms: explicit point-based geometry and learned features from geometry representation networks. In addition, we construct a large-scale synthetic dataset containing more than 190k objects under diverse environmental conditions to enhance robustness and generalization. Extensive evaluations across multiple benchmarks demonstrate our state-of-the-art performance, yielding an average AR improvement of 5.1% over prior methods and achieving up to 17.6% gains on individual datasets, indicating strong generalization to unseen objects. Project page: https://windvchen.github.io/PoseGAM/ .
中文标题/摘要
标题:PoseGAM:通过几何意识多视图推理实现鲁棒的未见物体姿态估计
6D物体姿态估计,即预测物体相对于相机的变换,对于未见物体而言仍然具有挑战性。现有方法通常依赖于在查询图像与物体模型或模板图像之间显式构建特征对应关系。在本文中,我们提出了一种几何意识多视图框架PoseGAM,该框架可以直接从查询图像和多个模板图像中预测物体姿态,从而消除了显式匹配的需要。该方法基于最近的多视图基础模型架构,通过两种互补机制整合物体几何信息:显式的点基几何和几何表示网络学习到的特征。此外,我们构建了一个包含超过19万个物体的大型合成数据集,这些物体在多种环境条件下具有多样性,以增强鲁棒性和泛化能力。在多个基准上的广泛评估表明,我们的方法具有最先进的性能,相对于先前方法的平均AR改进为5.1%,在个别数据集上达到17.6%的提升,表明其在未见物体上的强大泛化能力。
Summary / 总结
The research aims to improve 6D object pose estimation for unseen objects by proposing PoseGAM, a geometry-aware multi-view framework that directly predicts object pose from a query image and multiple template images without explicit matching. The method uses recent multi-view-based foundation models and integrates object geometry information through explicit point-based geometry and learned features from geometry representation networks. Extensive evaluations show that PoseGAM outperforms previous methods, achieving an average AR improvement of 5.1% and up to 17.6% gains on individual datasets.
研究旨在通过提出PoseGAM,一种几何感知的多视图框架,提高对未见过物体的6D姿态估计的鲁棒性。该方法直接从查询图像和多个模板图像中预测物体姿态,无需显式特征对应。它利用了基于多视图的最新基础模型,并通过显式的点几何和几何表示网络学习的特征来整合物体几何信息。该方法在多个基准上进行了评估,显示出优于先前方法的最新性能,平均AR改进了5.1%,在个别数据集上甚至达到了17.6%的提升,表明其在未见过物体上的强泛化能力。
T-SHRED: Symbolic Regression for Regularization and Model Discovery with Transformer Shallow Recurrent Decoders
Authors: Alexey Yermakov, David Zoro, Mars Liyao Gao, J. Nathan Kutz
First: 2025-06-18T21:14:38+00:00 · Latest: 2025-12-11T17:28:30+00:00
Comments: 17 pages, 5 figures, submitted to Transactions of the Royal Society (Symbolic Regression in the Physical Sciences)
Abstract
SHallow REcurrent Decoders (SHRED) are effective for system identification and forecasting from sparse sensor measurements. Such models are light-weight and computationally efficient, allowing them to be trained on consumer laptops. SHRED-based models rely on Recurrent Neural Networks (RNNs) and a simple Multi-Layer Perceptron (MLP) for the temporal encoding and spatial decoding respectively. Despite the relatively simple structure of SHRED, they are able to predict chaotic dynamical systems on different physical, spatial, and temporal scales directly from a sparse set of sensor measurements. In this work, we modify SHRED by leveraging transformers (T-SHRED) embedded with symbolic regression for the temporal encoding, circumventing auto-regressive long-term forecasting for physical data. This is achieved through a new sparse identification of nonlinear dynamics (SINDy) attention mechanism into T-SHRED to impose sparsity regularization on the latent space, which also allows for immediate symbolic interpretation. Symbolic regression improves model interpretability by learning and regularizing the dynamics of the latent space during training. We analyze the performance of T-SHRED on three different dynamical systems ranging from low-data to high-data regimes.
中文标题/摘要
标题:T-SHRED:基于变换器浅层递归解码器的符号回归正则化与模型发现
SH 浅层递归解码器(SHRED)对于从稀疏传感器测量中进行系统识别和预测非常有效。这类模型轻量级且计算效率高,允许它们在消费级笔记本电脑上进行训练。SHRED 基础模型依赖递归神经网络(RNN)和简单的多层感知机(MLP)进行时间编码和空间解码。尽管 SHRED 的结构相对简单,但它们能够直接从稀疏的传感器测量集中预测不同物理、空间和时间尺度上的混沌动力系统。在本文中,我们通过利用变换器(T-SHRED)并结合符号回归进行时间编码,改进了 SHRED,从而绕过了自回归长期预测。这是通过将新的稀疏非线性动力系统识别(SINDy)注意力机制引入 T-SHRED 来实现的,该机制在潜空间中施加稀疏正则化,同时也允许即时符号解释。符号回归通过在训练过程中学习和正则化潜空间的动力学来提高模型的可解释性。我们分析了 T-SHRED 在三种不同动力系统上的性能,从低数据到高数据范围。
Summary / 总结
T-SHRED modifies SHRED by integrating transformers with symbolic regression for temporal encoding, enabling the model to predict chaotic dynamical systems from sparse sensor data. Key findings include improved model interpretability and performance across different dynamical systems, from low-data to high-data regimes.
T-SHRED 是 SHRED 的增强版本,使用了带有符号回归的变压器进行时间编码,以提高模型在稀疏传感器数据上的可解释性和性能。通过引入稀疏非线性动力学(SINDy)注意力机制,它在潜在空间上施加稀疏正则化,从而实现即时的潜在空间动力学符号解释。实验结果表明,T-SHRED 在各种数据量条件下预测混沌动力系统方面优于 SHRED 和其他模型。
Learning Controllable and Diverse Player Behaviors in Multi-Agent Environments
Authors: Atahan Cilan, Atay Özgövde
First: 2025-12-11T17:26:24+00:00 · Latest: 2025-12-11T17:26:24+00:00
Comments: Submitted to IEEE Transactions on Games
Abstract
This paper introduces a reinforcement learning framework that enables controllable and diverse player behaviors without relying on human gameplay data. Existing approaches often require large-scale player trajectories, train separate models for different player types, or provide no direct mapping between interpretable behavioral parameters and the learned policy, limiting their scalability and controllability. We define player behavior in an N-dimensional continuous space and uniformly sample target behavior vectors from a region that encompasses the subset representing real human styles. During training, each agent receives both its current and target behavior vectors as input, and the reward is based on the normalized reduction in distance between them. This allows the policy to learn how actions influence behavioral statistics, enabling smooth control over attributes such as aggressiveness, mobility, and cooperativeness. A single PPO-based multi-agent policy can reproduce new or unseen play styles without retraining. Experiments conducted in a custom multi-player Unity game show that the proposed framework produces significantly greater behavioral diversity than a win-only baseline and reliably matches specified behavior vectors across diverse targets. The method offers a scalable solution for automated playtesting, game balancing, human-like behavior simulation, and replacing disconnected players in online games.
中文标题/摘要
标题:在多智能体环境中学习可控和多样的玩家行为
本文介绍了一种强化学习框架,能够在无需依赖人类游戏数据的情况下实现可控和多样的玩家行为。现有方法通常需要大规模的玩家轨迹数据、为不同玩家类型训练单独的模型,或者无法直接将可解释的行为参数与学习策略映射起来,限制了其可扩展性和可控性。我们定义玩家行为在N维连续空间中,并均匀采样目标行为向量,该向量代表了真实人类风格的子集。在训练过程中,每个智能体接收当前和目标行为向量作为输入,奖励基于它们之间距离的归一化减少量。这使得策略能够学习行动如何影响行为统计,从而实现对侵略性、移动性和合作性等属性的平滑控制。基于PPO的单个多智能体策略可以在无需重新训练的情况下重现新的或未见过的游戏风格。在自定义多人Unity游戏中进行的实验表明,所提出的框架产生的行为多样性显著高于仅以胜利为目标的基线,并且能够在多种目标上可靠地匹配指定的行为向量。该方法为自动化游戏测试、游戏平衡、人类行为模拟以及在线游戏中替代断线玩家提供了可扩展的解决方案。
Summary / 总结
This paper presents a reinforcement learning framework that enables the generation of controllable and diverse player behaviors in multi-agent environments without the need for human gameplay data. The method defines player behavior in an N-dimensional space, uses target behavior vectors for training, and employs a reward based on the normalized distance between current and target vectors. Experiments in a custom Unity game demonstrate that the framework can produce significantly greater behavioral diversity compared to a win-only baseline and reliably match specified behavior vectors across various targets, offering a scalable solution for playtesting, game balancing, and simulating human-like behavior.
本文提出了一种强化学习框架,能够在无需人类游戏数据的情况下生成可控且多样的玩家行为。该方法将玩家行为定义在N维空间中,使用目标行为向量进行训练,并采用基于归一化距离的奖励来引导策略。实验在自定义的Unity游戏中表明,该框架能够显著增加行为多样性,并且能够在各种目标上可靠地匹配指定的行为向量,为游戏测试、平衡和模拟人类行为提供了可扩展的解决方案。