arXiv 论文速递

EfficientFlow: Efficient Equivariant Flow Policy Learning for Embodied AI

Authors: Jianlei Chang, Ruofeng Mei, Wei Ke, Xiangyu Xu

Venue: AAAI 2026

First: 2025-12-01T18:59:59+00:00 · Latest: 2025-12-01T18:59:59+00:00

Comments: Accepted by AAAI 2026. Project Page: https://efficientflow.github.io/

Abstract

Generative modeling has recently shown remarkable promise for visuomotor policy learning, enabling flexible and expressive control across diverse embodied AI tasks. However, existing generative policies often struggle with data inefficiency, requiring large-scale demonstrations, and sampling inefficiency, incurring slow action generation during inference. We introduce EfficientFlow, a unified framework for efficient embodied AI with flow-based policy learning. To enhance data efficiency, we bring equivariance into flow matching. We theoretically prove that when using an isotropic Gaussian prior and an equivariant velocity prediction network, the resulting action distribution remains equivariant, leading to improved generalization and substantially reduced data demands. To accelerate sampling, we propose a novel acceleration regularization strategy. As direct computation of acceleration is intractable for marginal flow trajectories, we derive a novel surrogate loss that enables stable and scalable training using only conditional trajectories. Across a wide range of robotic manipulation benchmarks, the proposed algorithm achieves competitive or superior performance under limited data while offering dramatically faster inference. These results highlight EfficientFlow as a powerful and efficient paradigm for high-performance embodied AI.

中文标题/摘要

标题：EfficientFlow：高效等变流政策学习在具身AI中的应用

生成建模最近在视觉-运动政策学习中展现了显著的潜力，能够灵活且富有表现力地控制各种具身AI任务。然而，现有的生成政策往往在数据效率和采样效率方面存在挑战，需要大规模演示，并在推理过程中导致缓慢的动作生成。我们提出了EfficientFlow，一种基于流的政策学习统一框架，以提高数据效率。通过将等变性引入流匹配，我们理论上证明，在使用各向同性高斯先验和等变速度预测网络时，所得到的动作分布保持等变性，从而提高泛化能力和显著减少数据需求。为了加速采样，我们提出了一种新的加速正则化策略。由于直接计算加速度对于边缘流轨迹是不可行的，我们推导出一种新的替代损失，仅使用条件轨迹即可实现稳定且可扩展的训练。在一系列广泛的机器人操作基准测试中，所提出的方法在有限数据下实现了竞争力或优越的性能，同时提供了显著更快的推理速度。这些结果突显了EfficientFlow作为高性能具身AI的强大且高效的范式。

Summary / 总结

EfficientFlow is a unified framework for efficient embodied AI that uses flow-based policy learning. It enhances data efficiency by incorporating equivariance into flow matching, which improves generalization and reduces data demands. To accelerate sampling, it introduces a novel acceleration regularization strategy, enabling stable and scalable training. Across various robotic manipulation benchmarks, EfficientFlow achieves competitive or superior performance with limited data and faster inference compared to existing methods.

EfficientFlow 是一种统一框架，利用基于流的策略学习来提高体态AI的效率。通过将对称性引入流匹配中，它提高了泛化能力和减少了数据需求。为了加速采样，它引入了一种新的加速度正则化策略，使训练更加稳定和可扩展。在各种机器人操作基准测试中，EfficientFlow 在有限数据下实现了竞争力或更优的性能，并且具有更快的推理速度。

A Diffusion Model Framework for Maximum Entropy Reinforcement Learning

Authors: Sebastian Sanokowski, Kaustubh Patil, Alois Knoll

First: 2025-12-01T18:59:58+00:00 · Latest: 2025-12-01T18:59:58+00:00

Abs · PDF · Code1 · Code2

Abstract

Diffusion models have achieved remarkable success in data-driven learning and in sampling from complex, unnormalized target distributions. Building on this progress, we reinterpret Maximum Entropy Reinforcement Learning (MaxEntRL) as a diffusion model-based sampling problem. We tackle this problem by minimizing the reverse Kullback-Leibler (KL) divergence between the diffusion policy and the optimal policy distribution using a tractable upper bound. By applying the policy gradient theorem to this objective, we derive a modified surrogate objective for MaxEntRL that incorporates diffusion dynamics in a principled way. This leads to simple diffusion-based variants of Soft Actor-Critic (SAC), Proximal Policy Optimization (PPO) and Wasserstein Policy Optimization (WPO), termed DiffSAC, DiffPPO and DiffWPO. All of these methods require only minor implementation changes to their base algorithm. We find that on standard continuous control benchmarks, DiffSAC, DiffPPO and DiffWPO achieve better returns and higher sample efficiency than SAC and PPO.

中文标题/摘要

标题：最大熵强化学习的扩散模型框架

扩散模型在数据驱动学习和从复杂、非规范化目标分布中采样方面取得了显著成功。在此基础上，我们将最大熵强化学习（MaxEntRL）重新解释为基于扩散模型的采样问题。我们通过最小化扩散策略与最优策略分布之间的反向Kullback-Leibler（KL）散度的可计算上界来解决这个问题。通过将策略梯度定理应用于该目标，我们推导出一种改进的替代目标，该目标以原理上正确的方式将扩散动力学纳入最大熵强化学习中。这导致了Soft Actor-Critic（SAC）、Proximal Policy Optimization（PPO）和Wasserstein Policy Optimization（WPO）的简单扩散版本，分别称为DiffSAC、DiffPPO和DiffWPO。所有这些方法只需要对其基础算法进行少量的实现更改。我们发现，在标准连续控制基准测试中，DiffSAC、DiffPPO和DiffWPO在回报和样本效率方面优于SAC和PPO。

Summary / 总结

The research aims to enhance Maximum Entropy Reinforcement Learning (MaxEntRL) by interpreting it as a diffusion model-based sampling problem. The method minimizes the reverse Kullback-Leibler divergence between the diffusion policy and the optimal policy distribution using a tractable upper bound. The key experimental findings show that DiffSAC, DiffPPO, and DiffWPO achieve better returns and higher sample efficiency compared to traditional SAC and PPO on standard continuous control benchmarks.

研究旨在通过将最大熵强化学习（MaxEntRL）重新解释为基于扩散模型的采样问题来提升其性能。方法通过使用可处理的上界最小化扩散策略与最优策略分布之间的反向Kullback-Leibler散度。关键实验发现表明，DiffSAC、DiffPPO和DiffWPO在标准连续控制基准测试中取得了更好的回报和更高的样本效率。

Visual Sync: Multi-Camera Synchronization via Cross-View Object Motion

Authors: Shaowei Liu, David Yifan Yao, Saurabh Gupta, Shenlong Wang

Venue: NeurIPS 2025

First: 2025-12-01T18:59:57+00:00 · Latest: 2025-12-01T18:59:57+00:00

Comments: Accepted to NeurIPS 2025. Project page: https://stevenlsw.github.io/visualsync/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Today, people can easily record memorable moments, ranging from concerts, sports events, lectures, family gatherings, and birthday parties with multiple consumer cameras. However, synchronizing these cross-camera streams remains challenging. Existing methods assume controlled settings, specific targets, manual correction, or costly hardware. We present VisualSync, an optimization framework based on multi-view dynamics that aligns unposed, unsynchronized videos at millisecond accuracy. Our key insight is that any moving 3D point, when co-visible in two cameras, obeys epipolar constraints once properly synchronized. To exploit this, VisualSync leverages off-the-shelf 3D reconstruction, feature matching, and dense tracking to extract tracklets, relative poses, and cross-view correspondences. It then jointly minimizes the epipolar error to estimate each camera's time offset. Experiments on four diverse, challenging datasets show that VisualSync outperforms baseline methods, achieving an median synchronization error below 50 ms.

Summary / 总结

The research aims to address the challenge of synchronizing multiple unposed and unsynchronized videos captured from different consumer cameras. VisualSync, an optimization framework, uses multi-view dynamics to align videos with millisecond accuracy. By leveraging 3D reconstruction, feature matching, and dense tracking, VisualSync extracts tracklets, relative poses, and cross-view correspondences to minimize epipolar errors, estimating each camera's time offset. Experiments demonstrate that VisualSync outperforms baseline methods with a median synchronization error below 50 ms.

VisualSync 是一个优化框架，旨在以毫秒级精度同步多台摄像机拍摄的未摆拍且未同步的视频。它利用多视图动态、3D重建、特征匹配和密集跟踪来提取轨迹片段和跨视图对应关系，然后联合最小化视差误差以估计时间偏移。实验结果显示，VisualSync 在四个数据集上的同步误差中位数低于 50 毫秒，优于基线方法。

Objects in Generated Videos Are Slower Than They Appear: Models Suffer Sub-Earth Gravity and Don't Know Galileo's Principle...for now

Authors: Varun Varma Thozhiyoor, Shivam Tripathi, Venkatesh Babu Radhakrishnan, Anand Bhattad

First: 2025-12-01T18:59:56+00:00 · Latest: 2025-12-01T18:59:56+00:00

Comments: https://gravity-eval.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Video generators are increasingly evaluated as potential world models, which requires them to encode and understand physical laws. We investigate their representation of a fundamental law: gravity. Out-of-the-box video generators consistently generate objects falling at an effectively slower acceleration. However, these physical tests are often confounded by ambiguous metric scale. We first investigate if observed physical errors are artifacts of these ambiguities (e.g., incorrect frame rate assumptions). We find that even temporal rescaling cannot correct the high-variance gravity artifacts. To rigorously isolate the underlying physical representation from these confounds, we introduce a unit-free, two-object protocol that tests the timing ratio $t_1^2/t_2^2 = h_1/h_2$, a relationship independent of $g$, focal length, and scale. This relative test reveals violations of Galileo's equivalence principle. We then demonstrate that this physical gap can be partially mitigated with targeted specialization. A lightweight low-rank adaptor fine-tuned on only 100 single-ball clips raises $g_{\mathrm{eff}}$ from $1.81\,\mathrm{m/s^2}$ to $6.43\,\mathrm{m/s^2}$ (reaching $65\%$ of terrestrial gravity). This specialist adaptor also generalizes zero-shot to two-ball drops and inclined planes, offering initial evidence that specific physical laws can be corrected with minimal data.

Summary / 总结

The study evaluates video generators' understanding of gravity by introducing a unit-free protocol that tests the timing ratio of falling objects. Despite temporal rescaling, video generators still show a slower acceleration of falling objects. A low-rank adaptor fine-tuned on 100 single-ball clips improved the effective gravity to 65% of terrestrial gravity, demonstrating partial mitigation of the physical gap and generalization to two-ball drops and inclined planes.

研究通过引入一个无单位的协议来测试下落物体的时间比，评估视频生成器对重力的理解。尽管进行了时间缩放，视频生成器仍然表现出物体下落的加速度较慢。通过对100个单球片段进行微调的低秩适配器将有效重力提高到地球重力的65%，部分缓解了物理差距，并且能够零样本泛化到双球下落和斜面实验，提供了特定物理定律可以通过少量数据进行修正的初步证据。

Generative Video Motion Editing with 3D Point Tracks

Authors: Yao-Chih Lee, Zhoutong Zhang, Jiahui Huang, Jui-Hsien Wang, Joon-Young Lee, Jia-Bin Huang, Eli Shechtman, Zhengqi Li

First: 2025-12-01T18:59:55+00:00 · Latest: 2025-12-01T18:59:55+00:00

Comments: Project page: https://edit-by-track.github.io

Abs · PDF · Code1 · Code2 · Project1

Abstract

Camera and object motions are central to a video's narrative. However, precisely editing these captured motions remains a significant challenge, especially under complex object movements. Current motion-controlled image-to-video (I2V) approaches often lack full-scene context for consistent video editing, while video-to-video (V2V) methods provide viewpoint changes or basic object translation, but offer limited control over fine-grained object motion. We present a track-conditioned V2V framework that enables joint editing of camera and object motion. We achieve this by conditioning a video generation model on a source video and paired 3D point tracks representing source and target motions. These 3D tracks establish sparse correspondences that transfer rich context from the source video to new motions while preserving spatiotemporal coherence. Crucially, compared to 2D tracks, 3D tracks provide explicit depth cues, allowing the model to resolve depth order and handle occlusions for precise motion editing. Trained in two stages on synthetic and real data, our model supports diverse motion edits, including joint camera/object manipulation, motion transfer, and non-rigid deformation, unlocking new creative potential in video editing.

中文标题/摘要

标题：基于3D点轨迹的生成视频运动编辑

摄像机和物体的运动是视频叙事的核心。然而，精确编辑这些捕捉到的运动仍然是一个重大挑战，尤其是在复杂的物体运动下。当前的运动控制图像到视频（I2V）方法往往缺乏全面的场景上下文，无法实现一致的视频编辑，而视频到视频（V2V）方法则提供了视角变化或基本的物体平移，但对精细的物体运动控制有限。我们提出了一种基于轨迹的V2V框架，能够同时编辑摄像机和物体的运动。我们通过将视频生成模型条件化于源视频及其配对的3D点轨迹（表示源和目标运动）来实现这一点。这些3D轨迹建立了稀疏对应关系，将源视频中的丰富上下文转移到新的运动中，同时保持时空一致性。关键的是，与2D轨迹相比，3D轨迹提供了明确的深度线索，使模型能够解决深度顺序问题并处理遮挡，从而实现精确的运动编辑。我们的模型在合成和真实数据上分阶段训练，支持多种运动编辑，包括摄像机/物体的联合操作、运动转移和非刚性变形，从而解锁了视频编辑中的新创意潜力。

Summary / 总结

The paper addresses the challenge of precisely editing camera and object motions in videos, which are crucial for narrative. It introduces a track-conditioned video-to-video framework that uses 3D point tracks to transfer rich context and preserve spatiotemporal coherence. The model, trained on both synthetic and real data, supports various motion edits such as joint camera/object manipulation and non-rigid deformation, enhancing video editing capabilities.

论文针对精确编辑视频中的摄像机和物体运动这一关键问题，提出了一个基于3D点轨迹的视频到视频框架，通过转移丰富的上下文并保持时空一致性来实现精确编辑。该模型在合成和真实数据上进行了双重训练，支持多种运动编辑，如摄像机和物体的联合操作及非刚性变形，提升了视频编辑的能力。

TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models

Authors: Zhiheng Liu, Weiming Ren, Haozhe Liu, Zijian Zhou, Shoufa Chen, Haonan Qiu, Xiaoke Huang, Zhaochong An, Fanny Yang, Aditya Patel, Viktar Atliha, Tony Ng, Xiao Han, Chuyan Zhu, Chenyang Zhang, Ding Liu, Juan-Manuel Perez-Rua, Sen He, Jürgen Schmidhuber, Wenhu Chen, Ping Luo, Wei Liu, Tao Xiang, Jonas Schult, Yuren Cong

First: 2025-12-01T18:59:51+00:00 · Latest: 2025-12-01T18:59:51+00:00

Comments: Project page: https://tuna-ai.org/

Abs · PDF · Code1 · Code2

Abstract

Unified multimodal models (UMMs) aim to jointly perform multimodal understanding and generation within a single framework. We present TUNA, a native UMM that builds a unified continuous visual representation by cascading a VAE encoder with a representation encoder. This unified representation space allows end-to-end processing of images and videos for both understanding and generation tasks. Compared to prior UMMs with decoupled representations, TUNA's unified visual space avoids representation format mismatches introduced by separate encoders, outperforming decoupled alternatives in both understanding and generation. Moreover, we observe that stronger pretrained representation encoders consistently yield better performance across all multimodal tasks, highlighting the importance of the representation encoder. Finally, in this unified setting, jointly training on both understanding and generation data allows the two tasks to benefit from each other rather than interfere. Our extensive experiments on multimodal understanding and generation benchmarks show that TUNA achieves state-of-the-art results in image and video understanding, image and video generation, and image editing, demonstrating the effectiveness and scalability of its unified representation design.

中文标题/摘要

标题：TUNA: 镇服统一视觉表示以构建原生统一多模态模型

统一多模态模型（UMMs）旨在在一个框架内同时进行多模态理解和生成。我们提出了TUNA，这是一种原生UMM，通过级联VAE编码器和表示编码器构建统一的连续视觉表示。这种统一的表示空间允许对图像和视频进行端到端的处理，用于理解和生成任务。与具有分离表示的先前UMMs相比，TUNA的统一视觉空间避免了由单独编码器引入的表示格式不匹配，其在理解和生成任务上均优于分离的替代方案。此外，我们观察到，更强的预训练表示编码器在所有多模态任务中均能获得更好的性能，突显了表示编码器的重要性。最后，在这种统一的设置中，同时对理解和生成数据进行联合训练，使得两个任务能够相互受益而不是相互干扰。我们在多模态理解和生成基准上的广泛实验表明，TUNA在图像和视频理解和生成、图像编辑等方面均达到了最先进的性能，证明了其统一表示设计的有效性和可扩展性。

Summary / 总结

TUNA is a native unified multimodal model that uses a VAE encoder followed by a representation encoder to create a unified visual representation space for both understanding and generation tasks. This approach outperforms decoupled models in both understanding and generation, and stronger pretrained representation encoders improve performance across all tasks. Joint training on understanding and generation data enhances performance for both tasks. Experiments show TUNA achieves state-of-the-art results in various multimodal tasks.

TUNA 是一种将 VAE 编码器与表示编码器结合以创建统一视觉表示空间的原生统一多模态模型，适用于理解和生成任务。这种方法在理解和生成任务上都优于分离模型，并且更强的预训练表示编码器可以提高所有任务的性能。联合训练理解和生成数据可以提升性能，TUNA 在各种多模态基准测试中达到了最先进的结果。

AirSim360: A Panoramic Simulation Platform within Drone View

Authors: Xian Ge, Yuling Pan, Yuhang Zhang, Xiang Li, Weijun Zhang, Dizhe Zhang, Zhaoliang Wan, Xin Lin, Xiangkai Zhang, Juntao Liang, Jason Li, Wenjie Jiang, Bo Du, Ming-Hsuan Yang, Lu Qi

First: 2025-12-01T18:59:30+00:00 · Latest: 2025-12-01T18:59:30+00:00

Comments: Project Website: https://insta360-research-team.github.io/AirSim360-website/

Abs · PDF · Code1 · Code2 · Project1 · Project2

Abstract

The field of 360-degree omnidirectional understanding has been receiving increasing attention for advancing spatial intelligence. However, the lack of large-scale and diverse data remains a major limitation. In this work, we propose AirSim360, a simulation platform for omnidirectional data from aerial viewpoints, enabling wide-ranging scene sampling with drones. Specifically, AirSim360 focuses on three key aspects: a render-aligned data and labeling paradigm for pixel-level geometric, semantic, and entity-level understanding; an interactive pedestrian-aware system for modeling human behavior; and an automated trajectory generation paradigm to support navigation tasks. Furthermore, we collect more than 60K panoramic samples and conduct extensive experiments across various tasks to demonstrate the effectiveness of our simulator. Unlike existing simulators, our work is the first to systematically model the 4D real world under an omnidirectional setting. The entire platform, including the toolkit, plugins, and collected datasets, will be made publicly available at https://insta360-research-team.github.io/AirSim360-website.

中文标题/摘要

标题：AirSim360：无人机视角下的全景模拟平台

360度全方位理解领域正逐渐受到关注，以推动空间智能的发展。然而，缺乏大规模和多样化的数据仍然是一个主要限制。在本文中，我们提出了AirSim360，一个从空中视角获取全方位数据的模拟平台，能够利用无人机进行广泛的场景采样。具体而言，AirSim360关注三个方面：一种渲染对齐的数据和标注范式，用于像素级几何、语义和实体理解；一种交互式行人感知系统，用于建模人类行为；以及一种自动轨迹生成范式，以支持导航任务。此外，我们收集了超过60,000个全景样本，并在各种任务上进行了广泛的实验，以证明我们模拟器的有效性。与现有模拟器不同，我们的工作是首次在全方位设置下系统地建模4D真实世界。整个平台，包括工具包、插件和收集的数据集，将在https://insta360-research-team.github.io/AirSim360-website/公开提供。

Summary / 总结

AirSim360 is a simulation platform designed for 360-degree aerial data, addressing the lack of diverse and large-scale data in the field of spatial intelligence. It focuses on pixel-level understanding, human behavior modeling, and automated trajectory generation. The platform collects over 60,000 panoramic samples and demonstrates its effectiveness through extensive experiments across various tasks. Unlike previous simulators, AirSim360 models the 4D real world under an omnidirectional setting and will be made publicly available.

AirSim360 是一个用于 360 度航拍数据的模拟平台，旨在解决空间智能领域缺乏多样性和大规模数据的问题。该平台关注像素级理解、人类行为建模和自动轨迹生成。它收集了超过 60,000 个全景样本，并通过各种任务的广泛实验展示了其有效性。与之前的模拟器不同，AirSim360 在全景设置下建模 4D 实际世界，并将公开发布。

MV-TAP: Tracking Any Point in Multi-View Videos

Authors: Jahyeok Koo, Inès Hyeonsu Kim, Mungyeom Kim, Junghyun Park, Seohyun Park, Jaeyeong Kim, Jung Yi, Seokju Cho, Seungryong Kim

First: 2025-12-01T18:59:01+00:00 · Latest: 2025-12-01T18:59:01+00:00

Comments: Project Page: https://cvlab-kaist.github.io/MV-TAP/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Multi-view camera systems enable rich observations of complex real-world scenes, and understanding dynamic objects in multi-view settings has become central to various applications. In this work, we present MV-TAP, a novel point tracker that tracks points across multi-view videos of dynamic scenes by leveraging cross-view information. MV-TAP utilizes camera geometry and a cross-view attention mechanism to aggregate spatio-temporal information across views, enabling more complete and reliable trajectory estimation in multi-view videos. To support this task, we construct a large-scale synthetic training dataset and real-world evaluation sets tailored for multi-view tracking. Extensive experiments demonstrate that MV-TAP outperforms existing point-tracking methods on challenging benchmarks, establishing an effective baseline for advancing research in multi-view point tracking.

中文标题/摘要

标题：MV-TAP：多视角视频中任意点的跟踪

多视角摄像系统能够对复杂的现实场景进行丰富的观察，理解多视角设置中的动态对象已成为各种应用的核心。在本文中，我们提出了一种新颖的点跟踪器MV-TAP，它通过利用跨视角信息在动态场景的多视角视频中跟踪点。MV-TAP 利用摄像机几何和跨视角注意力机制来跨视角聚合时空信息，从而在多视角视频中实现更完整和可靠的轨迹估计。为了支持这一任务，我们构建了一个大规模的合成训练数据集和针对多视角跟踪的现实世界评估集。广泛的实验表明，MV-TAP 在具有挑战性的基准测试中优于现有的点跟踪方法，为多视角点跟踪研究的进步奠定了有效的基础。

Summary / 总结

MV-TAP is a novel point tracker that tracks points across multi-view videos by using cross-view information and camera geometry. It aggregates spatio-temporal information to estimate more complete and reliable trajectories. Experiments show that MV-TAP outperforms existing methods on challenging benchmarks, setting a new baseline for multi-view point tracking research.

MV-TAP 是一种新颖的点跟踪器，通过利用跨视图信息和相机几何结构在多视图视频中跟踪点。它使用跨视图注意力机制来聚合时空信息，从而改善轨迹估计。实验表明，MV-TAP 在具有挑战性的基准测试中优于现有方法，为多视图点跟踪研究设立了新的基线。

Learning Visual Affordance from Audio

Authors: Lidong Lu, Guo Chen, Zhu Wei, Yicheng Liu, Tong Lu

First: 2025-12-01T18:58:56+00:00 · Latest: 2025-12-01T18:58:56+00:00

Comments: 15 pages, 10 figures

Abs · PDF · Code1 · Code2 · Project1

Abstract

We introduce Audio-Visual Affordance Grounding (AV-AG), a new task that segments object interaction regions from action sounds. Unlike existing approaches that rely on textual instructions or demonstration videos, which often limited by ambiguity or occlusion, audio provides real-time, semantically rich, and visually independent cues for affordance grounding, enabling more intuitive understanding of interaction regions. To support this task, we construct the first AV-AG dataset, comprising a large collection of action sounds, object images, and pixel-level affordance annotations. The dataset also includes an unseen subset to evaluate zero-shot generalization. Furthermore, we propose AVAGFormer, a model equipped with a semantic-conditioned cross-modal mixer and a dual-head decoder that effectively fuses audio and visual signals for mask prediction. Experiments show that AVAGFormer achieves state-of-the-art performance on AV-AG, surpassing baselines from related tasks. Comprehensive analyses highlight the distinctions between AV-AG and AVS, the benefits of end-to-end modeling, and the contribution of each component. Code and dataset have been released on https://jscslld.github.io/AVAGFormer/.

中文标题/摘要

标题：从音频学习视觉功能

我们介绍了Audio-Visual Affordance Grounding (AV-AG) 任务，该任务从动作声音中分割出对象交互区域。与依赖于文本指令或示范视频的现有方法不同，这些方法往往受到模糊或遮挡的限制，音频提供了实时、语义丰富且视觉独立的线索，用于功能定位，使交互区域的理解更加直观。为了支持这一任务，我们构建了第一个AV-AG数据集，包含大量动作声音、对象图像和像素级的功能注释。数据集还包括一个未见过的子集，用于评估零样本泛化能力。此外，我们提出了AVAGFormer模型，该模型配备有语义条件跨模态混合器和双头解码器，能够有效融合音频和视觉信号进行掩码预测。实验表明，AVAGFormer在AV-AG上达到了最先进的性能，超越了相关任务的基线。全面的分析突出了AV-AG与AVS之间的区别、端到端建模的优势以及每个组件的贡献。代码和数据集已发布在https://jscslld.github.io/AVAGFormer/。

Summary / 总结

The research introduces Audio-Visual Affordance Grounding (AV-AG), a task that segments object interaction regions from action sounds. Unlike previous methods relying on textual instructions or demonstration videos, AV-AG uses audio to provide real-time, semantically rich, and visually independent cues for affordance grounding. The study constructs the first AV-AG dataset and proposes AVAGFormer, a model that fuses audio and visual signals for mask prediction. AVAGFormer outperforms existing baselines and demonstrates the benefits of end-to-end modeling and the contributions of each component in the model. Comprehensive analyses highlight the differences between AV-AG and Audio-Visual Semantic Segmentation (AVS).

研究引入了Audio-Visual Affordance Grounding (AV-AG) 任务，该任务通过动作声音来分割物体交互区域。不同于以往依赖文本指令或示范视频的方法，AV-AG 利用音频线索进行实时、语义丰富的视觉独立的用途定位。研究构建了首个 AV-AG 数据集，并提出了 AVAGFormer 模型，该模型能够融合音频和视觉信号进行掩码预测。AVAGFormer 在零样本泛化方面超越了现有基线，并展示了端到端建模和语义条件化的优势。全面的分析突出了该模型的强项和对领域的贡献。

Learning Sim-to-Real Humanoid Locomotion in 15 Minutes

Authors: Younggyo Seo, Carmelo Sferrazza, Juyue Chen, Guanya Shi, Rocky Duan, Pieter Abbeel

First: 2025-12-01T18:55:17+00:00 · Latest: 2025-12-01T18:55:17+00:00

Comments: Project website: https://younggyo.me/fastsac-humanoid

Abs · PDF · Code1 · Code2

Abstract

Massively parallel simulation has reduced reinforcement learning (RL) training time for robots from days to minutes. However, achieving fast and reliable sim-to-real RL for humanoid control remains difficult due to the challenges introduced by factors such as high dimensionality and domain randomization. In this work, we introduce a simple and practical recipe based on off-policy RL algorithms, i.e., FastSAC and FastTD3, that enables rapid training of humanoid locomotion policies in just 15 minutes with a single RTX 4090 GPU. Our simple recipe stabilizes off-policy RL algorithms at massive scale with thousands of parallel environments through carefully tuned design choices and minimalist reward functions. We demonstrate rapid end-to-end learning of humanoid locomotion controllers on Unitree G1 and Booster T1 robots under strong domain randomization, e.g., randomized dynamics, rough terrain, and push perturbations, as well as fast training of whole-body human-motion tracking policies. We provide videos and open-source implementation at: https://younggyo.me/fastsac-humanoid.

中文标题/摘要

标题：在15分钟内学习类人机器人行走的模拟到现实

大规模并行模拟将机器人强化学习（RL）的训练时间从几天缩短到了几分钟。然而，由于高维度和领域随机化等因素带来的挑战，实现快速可靠的模拟到现实的RL类人控制仍然困难。在本工作中，我们介绍了一种基于离策RL算法（例如FastSAC和FastTD3）的简单实用方法，能够在单个RTX 4090 GPU上仅用15分钟快速训练类人行走策略。我们的简单方法通过精心调优的设计选择和简约的奖励函数，在数千个并行环境中稳定了离策RL算法。我们在强领域随机化（例如随机动力学、粗糙地形和推力扰动）下，展示了在Unitree G1和Booster T1机器人上端到端学习类人行走控制器，以及全身人体运动跟踪策略的快速训练。我们提供了视频和开源实现：https://younggyo.me/fastsac-humanoid

Summary / 总结

This work addresses the challenge of fast and reliable sim-to-real reinforcement learning for humanoid robots by introducing a simple and practical method using FastSAC and FastTD3 algorithms. The method enables rapid training of humanoid locomotion policies in just 15 minutes with a single RTX 4090 GPU. Key experimental findings include successful end-to-end learning of locomotion controllers and whole-body human-motion tracking policies under strong domain randomization conditions such as randomized dynamics and rough terrain.

该研究旨在解决人形机器人快速可靠的仿真到现实的强化学习挑战。它引入了一种简单的方法，使用FastSAC和FastTD3算法，能够在单个RTX 4090 GPU上仅用15分钟训练人形运动控制策略。该方法通过精心调优的设计选择和简约的奖励函数稳定了离策RL算法，展示了在强领域随机化条件下快速端到端学习人形运动控制器，并快速训练全身人体运动跟踪策略，应用于Unitree G1和Booster T1机器人。

RoaD: Rollouts as Demonstrations for Closed-Loop Supervised Fine-Tuning of Autonomous Driving Policies

Authors: Guillermo Garcia-Cobo, Maximilian Igl, Peter Karkus, Zhejun Zhang, Michael Watson, Yuxiao Chen, Boris Ivanovic, Marco Pavone

First: 2025-12-01T18:52:03+00:00 · Latest: 2025-12-01T18:52:03+00:00

Comments: Preprint

Abs · PDF · Code1 · Code2

Abstract

Autonomous driving policies are typically trained via open-loop behavior cloning of human demonstrations. However, such policies suffer from covariate shift when deployed in closed loop, leading to compounding errors. We introduce Rollouts as Demonstrations (RoaD), a simple and efficient method to mitigate covariate shift by leveraging the policy's own closed-loop rollouts as additional training data. During rollout generation, RoaD incorporates expert guidance to bias trajectories toward high-quality behavior, producing informative yet realistic demonstrations for fine-tuning. This approach enables robust closed-loop adaptation with orders of magnitude less data than reinforcement learning, and avoids restrictive assumptions of prior closed-loop supervised fine-tuning (CL-SFT) methods, allowing broader applications domains including end-to-end driving. We demonstrate the effectiveness of RoaD on WOSAC, a large-scale traffic simulation benchmark, where it performs similar or better than the prior CL-SFT method; and in AlpaSim, a high-fidelity neural reconstruction-based simulator for end-to-end driving, where it improves driving score by 41\% and reduces collisions by 54\%.

中文标题/摘要

标题：RoaD：通过自主驾驶策略的闭环回放作为示范进行闭环监督微调

自主驾驶策略通常通过行为克隆人类示范进行开环训练。然而，当部署到闭环中时，这些策略会遭受协变量偏移，导致累积错误。我们提出了Rollouts as Demonstrations (RoaD)，这是一种简单且高效的方法，通过利用策略自身的闭环回放作为额外的训练数据来缓解协变量偏移。在回放生成过程中，RoaD 结合了专家指导，使轨迹偏向高质量行为，产生既具有信息性又现实的示范，用于微调。该方法使闭环适应具有比强化学习少几个数量级的数据，并避免了先前闭环监督微调 (CL-SFT) 方法的限制性假设，允许更广泛的应用领域，包括端到端驾驶。我们在WOSAC，一个大规模交通仿真基准测试中展示了RoaD的有效性，其性能与先前的CL-SFT方法相当或更好；在AlpaSim，一个高保真神经重建基底模拟器中，其驾驶分数提高了41%，碰撞减少了54%。

Summary / 总结

RoaD is a method that addresses the issue of covariate shift in autonomous driving policies by using the policy's own closed-loop rollouts as additional training data. This approach incorporates expert guidance during rollout generation to produce high-quality demonstrations, enabling robust closed-loop adaptation with less data than reinforcement learning. RoaD demonstrates similar or better performance on WOSAC and a 41% improvement in driving score and a 54% reduction in collisions in AlpaSim compared to previous methods.

论文提出了RoaD方法，通过使用策略自身的闭环回放作为额外训练数据来缓解自动驾驶策略中的协变量偏移问题。该方法通过专家指导生成高质量的演示，使闭环适应更加稳健，所需数据量远少于强化学习。RoaD在WOSAC上的表现与先前方法相当或更好，并在AlpaSim中将驾驶得分提高了41%，减少了54%的碰撞。

VIVAT: Virtuous Improving VAE Training through Artifact Mitigation

Authors: Lev Novitskiy, Viacheslav Vasilev, Maria Kovaleva, Vladimir Arkhipkin, Denis Dimitrov

First: 2025-06-09T15:27:03+00:00 · Latest: 2025-12-01T18:51:34+00:00

Abs · PDF · Code1 · Code2

Abstract

Variational Autoencoders (VAEs) remain a cornerstone of generative computer vision, yet their training is often plagued by artifacts that degrade reconstruction and generation quality. This paper introduces VIVAT, a systematic approach to mitigating common artifacts in KL-VAE training without requiring radical architectural changes. We present a detailed taxonomy of five prevalent artifacts - color shift, grid patterns, blur, corner and droplet artifacts - and analyze their root causes. Through straightforward modifications, including adjustments to loss weights, padding strategies, and the integration of Spatially Conditional Normalization, we demonstrate significant improvements in VAE performance. Our method achieves state-of-the-art results in image reconstruction metrics (PSNR and SSIM) across multiple benchmarks and enhances text-to-image generation quality, as evidenced by superior CLIP scores. By preserving the simplicity of the KL-VAE framework while addressing its practical challenges, VIVAT offers actionable insights for researchers and practitioners aiming to optimize VAE training.

中文标题/摘要

标题：VIVAT：通过减少伪影改进VAE训练

变分自编码器（VAEs）仍然是生成计算机视觉的核心，但其训练往往受到降低重建和生成质量的伪影困扰。本文介绍了一种名为VIVAT的系统方法，用于在不进行根本性架构改变的情况下减轻KL-VAE训练中的常见伪影。我们详细分类了五种常见的伪影——颜色偏移、网格图案、模糊、角落和滴落伪影，并分析了它们的根本原因。通过简单的修改，包括调整损失权重、填充策略以及引入空间条件归一化，我们展示了VAE性能的显著提升。我们的方法在多个基准测试中实现了图像重建指标（PSNR和SSIM）的最新成果，并通过优越的CLIP分数提高了文本到图像生成的质量。通过保留KL-VAE框架的简洁性同时解决其实用挑战，VIVAT为研究人员和实践者优化VAE训练提供了可操作的见解。

Summary / 总结

The paper addresses the issue of artifacts in Variational Autoencoder (VAE) training, which can degrade reconstruction and generation quality. It introduces VIVAT, a method that mitigates common artifacts through simple modifications such as adjusting loss weights, padding strategies, and integrating Spatially Conditional Normalization, without requiring significant architectural changes. The method significantly improves VAE performance, achieving state-of-the-art results in image reconstruction metrics and enhancing text-to-image generation quality.

论文针对变分自编码器（VAE）训练中常见的图像退化问题，引入了VIVAT方法，通过调整损失权重、使用填充策略和集成空间条件归一化等简单修改来缓解常见问题。VIVAT提高了VAE的性能，在图像重建指标（PSNR和SSIM）上达到了最先进的水平，并通过更好的CLIP分数提升了文本到图像生成的质量。

LLM CHESS: Benchmarking Reasoning and Instruction-Following in LLMs through Chess

Authors: Sai Kolasani, Maxim Saplin, Nicholas Crispino, Kyle Montgomery, Jared Quincy Davis, Matei Zaharia, Chi Wang, Chenguang Wang

First: 2025-12-01T18:51:08+00:00 · Latest: 2025-12-01T18:51:08+00:00

Abs · PDF · Code1 · Code2

Abstract

We introduce LLM CHESS, an evaluation framework designed to probe the generalization of reasoning and instruction-following abilities in large language models (LLMs) through extended agentic interaction in the domain of chess. We rank over 50 open and closed source models by playing against a random opponent using a range of behavioral metrics, including win and loss rates, move quality, move legality, hallucinated actions, and game duration. For a subset of top reasoning models, we derive an Elo estimate by playing against a chess engine with variably configured skill, which allows for comparisons between models in an easily understandable way. Despite the simplicity of the instruction-following task and the weakness of the opponent, many state-of-the-art models struggle to complete games or achieve consistent wins. Similar to other benchmarks on complex reasoning tasks, our experiments reveal a clear separation between reasoning and non-reasoning models. However, unlike existing static benchmarks, the stochastic and dynamic nature of LLM CHESS uniquely reduces overfitting and memorization while preventing benchmark saturation, proving difficult even for top reasoning models. To support future work on evaluating reasoning and instruction-following in LLMs, we release our experimental framework, a public leaderboard, and a dataset of associated games.

中文标题/摘要

标题：LLM CHESS：通过国际象棋评估大型语言模型的推理和指令遵循能力

我们引入了LLM CHESS，这是一种评估框架，旨在通过在国际象棋领域的扩展代理交互来测试大型语言模型（LLM）的推理和指令遵循能力的泛化。我们通过使用一系列行为指标（包括胜率、输率、走棋质量、走棋合法性、幻觉动作和游戏时长）与随机对手对弈来对超过50个开源和闭源模型进行排名。对于部分顶级推理模型，我们通过与具有可变技能配置的国际象棋引擎对弈来推导出Elo估计值，这使得模型之间的比较变得容易理解。尽管指令遵循任务的简单性和对手的弱点，许多最先进的模型仍然难以完成比赛或实现一致的胜利。与其他复杂推理任务上的基准测试类似，我们的实验揭示了推理模型和非推理模型之间明显的分离。然而，与现有的静态基准不同，LLM CHESS的随机性和动态性质独特地减少了过拟合和记忆现象，防止了基准饱和，即使对于顶级推理模型也构成了挑战。为了支持未来对评估LLM的推理和指令遵循能力的研究，我们发布了我们的实验框架、一个公开的排行榜和相关游戏的数据集。

Summary / 总结

LLM CHESS is an evaluation framework designed to test the reasoning and instruction-following abilities of large language models (LLMs) in the domain of chess. It ranks over 50 models based on behavioral metrics such as win and loss rates, move quality, legality, hallucinated actions, and game duration. The framework also provides an Elo estimate for a subset of top reasoning models by playing against a chess engine with varying skill levels. Despite the simplicity of the task and the weak opponent, many state-of-the-art models struggle to complete games or achieve consistent wins, highlighting the challenge of reasoning and instruction-following in LLMs. The framework reveals a clear separation between reasoning and non-reasoning models and is designed to prevent overfitting and benchmark saturation. The experimental framework, leaderboard, and dataset are publicly released to support future research.

LLM CHESS 是一个评估框架，通过国际象棋游戏测试大型语言模型（LLMs）的推理和指令遵循能力。它基于各种行为指标对超过50个模型进行排名，并为部分顶级推理模型推导出Elo评分。尽管任务简单且对手较弱，许多最先进的模型仍难以完成比赛或稳定获胜，突显了推理任务对LLMs的难度。该框架通过减少过拟合和记忆效应，提供了一个具有挑战性的基准，支持未来的研究工作。

STORM: Segment, Track, and Object Re-Localization from a Single Image

Authors: Yu Deng, Teng Cao, Hikaru Shindo, Jiahong Xue, Quentin Delfosse, Kristian Kersting

First: 2025-11-12T22:06:51+00:00 · Latest: 2025-12-01T18:48:10+00:00

Abs · PDF · Code1 · Code2

Abstract

Accurate 6D pose estimation and tracking are fundamental capabilities for physical AI systems such as robots. However, existing approaches typically require a pre-defined 3D model of the target and rely on a manually annotated segmentation mask in the first frame, which is labor-intensive and leads to reduced performance when faced with occlusions or rapid movement. To address these limitations, we propose STORM (Segment, Track, and Object Re-localization from a single iMage), an open-source robust real-time 6D pose estimation system that requires no manual annotation. STORM employs a novel three-stage pipeline combining vision-language understanding with feature matching: contextual object descriptions guide localization, self-cross-attention mechanisms identify candidate regions, and produce precise masks and 3D models for accurate pose estimation. Another key innovation is our automatic re-registration mechanism that detects tracking failures through feature similarity monitoring and recovers from severe occlusions or rapid motion. STORM achieves state-of-the-art accuracy on challenging industrial datasets featuring multi-object occlusions, high-speed motion, and varying illumination, while operating at real-time speeds without additional training. This annotation-free approach significantly reduces deployment overhead, providing a practical solution for modern applications, such as flexible manufacturing and intelligent quality control.

中文标题/摘要

标题：STORM：从单张图像进行分割、跟踪和对象再定位

准确的6D姿态估计和跟踪是物理AI系统（如机器人）的基本能力。然而，现有方法通常需要目标的预定义3D模型，并依赖于第一帧的手动标注分割掩码，这既耗时又导致在面对遮挡或快速移动时性能降低。为解决这些限制，我们提出了STORM（从单张图像进行分割、跟踪和对象再定位），这是一种开源的鲁棒实时6D姿态估计系统，无需手动标注。STORM采用了一种新颖的三阶段流水线，结合了视觉-语言理解与特征匹配：上下文对象描述指导定位，自我交叉注意力机制识别候选区域并生成精确的掩码和3D模型以实现准确的姿态估计。另一个关键创新是我们自动再注册机制，通过特征相似性监控检测跟踪失败，并从严重遮挡或快速运动中恢复。STORM在具有多对象遮挡、高速运动和变化光照的具有挑战性的工业数据集上实现了最先进的精度，同时以实时速度运行且无需额外训练。这种无需标注的方法显著降低了部署成本，为现代应用（如灵活制造和智能质量控制）提供了实用的解决方案。

Summary / 总结

STORM is a real-time 6D pose estimation system that does not require manual annotation, addressing the limitations of existing approaches that rely on pre-defined 3D models and manually annotated segmentation masks. It uses a three-stage pipeline combining vision-language understanding and self-cross-attention mechanisms to guide localization, identify candidate regions, and produce precise masks and 3D models for accurate pose estimation. STORM also includes an automatic re-registration mechanism to handle tracking failures and severe occlusions. The system achieves state-of-the-art accuracy on challenging industrial datasets while operating at real-time speeds.

STORM 是一种无需手动标注的稳健实时 6D 姿态估计系统，解决了现有方法依赖预定义 3D 模型和初始分割掩码的局限性。它使用结合视觉-语言理解和自交叉注意力机制的三阶段管道来引导定位、识别候选区域并生成精确的掩码和 3D 模型以实现准确的姿态估计。STORM 还包括一个自动重新注册机制，用于处理跟踪失败和严重遮挡。该系统在具有多对象遮挡、高速运动和变化光照的挑战性工业数据集上实现了最先进的准确性，同时以实时速度运行且无需额外训练。

PAI-Bench: A Comprehensive Benchmark For Physical AI

Authors: Fengzhe Zhou, Jiannan Huang, Jialuo Li, Deva Ramanan, Humphrey Shi

First: 2025-12-01T18:47:39+00:00 · Latest: 2025-12-01T18:47:39+00:00

Abs · PDF · Code1 · Code2

Abstract

Physical AI aims to develop models that can perceive and predict real-world dynamics; yet, the extent to which current multi-modal large language models and video generative models support these abilities is insufficiently understood. We introduce Physical AI Bench (PAI-Bench), a unified and comprehensive benchmark that evaluates perception and prediction capabilities across video generation, conditional video generation, and video understanding, comprising 2,808 real-world cases with task-aligned metrics designed to capture physical plausibility and domain-specific reasoning. Our study provides a systematic assessment of recent models and shows that video generative models, despite strong visual fidelity, often struggle to maintain physically coherent dynamics, while multi-modal large language models exhibit limited performance in forecasting and causal interpretation. These observations suggest that current systems are still at an early stage in handling the perceptual and predictive demands of Physical AI. In summary, PAI-Bench establishes a realistic foundation for evaluating Physical AI and highlights key gaps that future systems must address.

中文标题/摘要

标题：PAI-Bench：物理AI的综合基准

物理AI旨在开发能够感知和预测现实世界动力学的模型；然而，当前多模态大型语言模型和视频生成模型在这些能力上的支持程度尚不明确。我们引入了物理AI基准（PAI-Bench），这是一个统一且全面的基准，评估了视频生成、条件视频生成和视频理解的感知和预测能力，包含2,808个现实世界案例，采用与任务对齐的指标来捕捉物理合理性和领域特定推理。我们的研究对近期模型进行了系统评估，表明尽管视频生成模型在视觉保真度方面表现出色，但在保持物理连贯的动力学方面往往难以维持，而多模态大型语言模型在预测和因果解释方面表现有限。这些观察表明，当前系统在处理物理AI的感知和预测需求方面仍处于早期阶段。总之，PAI-Bench为评估物理AI奠定了现实基础，并指出了未来系统必须解决的关键差距。

Summary / 总结

PAI-Bench is a comprehensive benchmark for evaluating the perception and prediction capabilities of Physical AI models, including video generation, conditional video generation, and video understanding. It consists of 2,808 real-world cases with task-aligned metrics to assess physical plausibility and domain-specific reasoning. The study reveals that video generative models have strong visual fidelity but often fail to maintain physically coherent dynamics, while multi-modal large language models show limited performance in forecasting and causal interpretation, indicating that current systems are still in their early stages for handling the demands of Physical AI.

PAI-Bench 是一个全面的基准，用于评估物理人工智能模型在视频生成、条件视频生成和视频理解方面的感知和预测能力，包含2,808个真实世界案例，并使用任务对齐的指标来评估物理合理性与领域特定推理。研究发现，尽管视频生成模型具有高视觉保真度，但它们往往无法保持物理连贯的动力学，而多模态大型语言模型在预测和因果解释方面表现有限。这些发现表明，当前系统尚无法应对物理人工智能的感知和预测需求。

Continuous Perception Matters: Diagnosing Temporal Integration Failures in Multimodal Models

Authors: Zeyu Wang, Zhenzhen Weng, Serena Yeung-Levy

First: 2024-08-15T00:45:21+00:00 · Latest: 2025-12-01T18:46:42+00:00

Abs · PDF · Code1 · Code2

Abstract

Continuous perception, the ability to integrate visual observations over time in a continuous stream fashion, is essential for robust real-world understanding, yet remains largely untested in current multimodal models. We introduce CP-Bench, a minimal and fully controlled benchmark designed to isolate this capability using an extremely simple task: counting identical cubes in a synthetic scene while the camera moves and only reveals subsets of objects at any moment. Despite the simplicity of the setting, we find that state-of-the-art open-source and commercial models, including Qwen-3-VL, InternVL3, GPT-5, and Gemini-3-Pro, fail dramatically. A static-camera control variant confirms that the failure arises not from object recognition but from an inability to accumulate evidence across time. Further experiments show that neither higher sampling FPS, perception- or spatial-enhanced models, nor finetuning with additional videos leads to meaningful cross-temporal generalization. Our results reveal a fundamental limitation in modern multimodal architectures and training paradigms. CP-Bench provides a simple yet powerful diagnostic tool and establishes a clean testbed for developing models capable of genuine time-consistent visual reasoning.

中文标题/摘要

标题：连续感知很重要：诊断多模态模型中的时间整合失败

连续感知，即以连续流的方式整合视觉观察的能力，对于实现稳健的现实世界理解至关重要，但在当前的多模态模型中却鲜有测试。我们引入了CP-Bench，这是一个最小化且完全可控的基准测试，通过一个极其简单的任务来隔离这种能力：在合成场景中随着摄像机移动并仅在任何时刻揭示部分物体的情况下，计数相同的立方体。尽管设置简单，我们发现最先进的开源和商用模型，包括Qwen-3-VL、InternVL3、GPT-5和Gemini-3-Pro，表现极为糟糕。静止摄像机的控制变体证实，失败并非源于物体识别，而是无法在时间上累积证据。进一步的实验表明，更高的采样FPS、感知或空间增强模型以及额外视频的微调都无法实现有意义的时间跨域泛化。我们的结果揭示了现代多模态架构和训练范式的根本局限性。CP-Bench 提供了一个简单而强大的诊断工具，并为开发能够进行真正时间一致视觉推理的模型建立了干净的测试平台。

Summary / 总结

The paper aims to test the continuous perception capability of multimodal models, which involves integrating visual observations over time. It introduces CP-Bench, a benchmark using a simple task of counting cubes in a moving camera scene. Despite the simplicity, state-of-the-art models fail to perform well. Experiments show that increasing FPS or using enhanced models does not improve cross-temporal generalization. The study reveals a fundamental limitation in current multimodal architectures and training methods, highlighting the need for better time-consistent visual reasoning models.

论文旨在测试多模态模型的连续感知能力，即在时间连续流中整合视觉观察。它引入了CP-Bench，使用一个简单的任务，在移动摄像头场景中数立方体。尽管任务简单，但最先进的模型表现不佳。实验显示，提高帧率或使用增强模型并不能改善跨时间的一致性。研究揭示了当前多模态架构和训练方法的基本局限性，强调了需要更好的时间一致视觉推理模型。

The AI Productivity Index (APEX)

Authors: Bertie Vidgen, Abby Fennelly, Evan Pinnix, Julien Bencheck, Daniyal Khan, Zach Richards, Austin Bridges, Calix Huang, Ben Hunsberger, Isaac Robinson, Akul Datta, Chirag Mahapatra, Dominic Barton, Cass R. Sunstein, Eric Topol, Brendan Foody, Osvald Nitski

First: 2025-09-30T03:26:17+00:00 · Latest: 2025-12-01T18:46:32+00:00

Abs · PDF · Code1 · Code2

Abstract

We present an extended version of the AI Productivity Index (APEX-v1-extended), a benchmark for assessing whether frontier models are capable of performing economically valuable tasks in four jobs: investment banking associate, management consultant, big law associate, and primary care physician (MD). This technical report details the extensions to APEX-v1, including an increase in the held-out evaluation set from n = 50 to n = 100 cases per job (n = 400 total) and updates to the grading methodology. We present a new leaderboard, where GPT5 (Thinking = High) remains the top performing model with a score of 67.0%. APEX-v1-extended shows that frontier models still have substantial limitations when performing typical professional tasks. To support further research, we are open sourcing n = 25 non-benchmark example cases per role (n = 100 total) along with our evaluation harness.

中文标题/摘要

标题：AI生产力指数（APEX）

我们提出了AI生产力指数（APEX-v1-extended）的扩展版本，这是一个基准测试，用于评估前沿模型是否能够在四个职位上执行具有经济价值的任务：投资银行助理、管理咨询师、大律师事务所助理和全科医生（MD）。该技术报告详细介绍了APEX-v1的扩展，包括将保留评估集从n=50增加到n=100个案例（总共n=400个）以及评分方法的更新。我们呈现了一个新的排行榜，其中GPT5（思考=高）仍然是表现最好的模型，得分为67.0%。APEX-v1-extended表明，前沿模型在执行典型专业任务时仍然存在重大限制。为了支持进一步的研究，我们开源了每个角色n=25个非基准示例案例（总共n=100个）以及我们的评估框架。

Summary / 总结

The AI Productivity Index (APEX-v1-extended) evaluates the capability of advanced models to perform economically valuable tasks in four professions. It includes an expanded evaluation set and updated grading methodology. GPT5 (Thinking = High) is the top model with a score of 67.0%, indicating that while frontier models can perform some tasks, they still have significant limitations in typical professional roles.

研究介绍了AI生产力指数（APEX-v1-extended）的扩展版本，用于评估前沿模型在投资银行助理、管理咨询师、大律师事务所助理和全科医生四个专业角色中的表现。它包括扩展的评估集和更新的评分方法，结果显示顶级模型如GPT5（思考=高）的得分为67.0%，表明在执行典型专业任务时存在显著限制。

Artemis: Structured Visual Reasoning for Perception Policy Learning

Authors: Wei Tang, Yanpeng Sun, Shan Zhang, Xiaofan Li, Piotr Koniusz, Wei Li, Na Zhao, Zechao Li

First: 2025-12-01T18:45:30+00:00 · Latest: 2025-12-01T18:45:30+00:00

Abs · PDF · Code1 · Code2

Abstract

Recent reinforcement-learning frameworks for visual perception policy have begun to incorporate intermediate reasoning chains expressed in natural language. Empirical observations indicate that such purely linguistic intermediate reasoning often reduces performance on perception tasks. We argue that the core issue lies not in reasoning per se but in the form of reasoning: while these chains perform semantic reasoning in an unstructured linguistic space, visual perception requires reasoning in a spatial and object-centric space. In response, we introduce Artemis, a perception-policy learning framework that performs structured proposal-based reasoning, where each intermediate step is represented as a (label, bounding-box) pair capturing a verifiable visual state. This design enables explicit tracking of intermediate states, direct supervision for proposal quality, and avoids ambiguity introduced by language-based reasoning. Artemis is built on Qwen2.5-VL-3B, achieves strong performance on grounding and detection task and exhibits substantial generalization to counting and geometric-perception tasks. The consistent improvements across these diverse settings confirm that aligning reasoning with spatial representations enhances perception-policy learning. Owing to its strengthened visual reasoning, Artemis also achieves competitive performance on general MLLM benchmarks, illustrating that spatially grounded reasoning provides a principled route toward scalable and general perception policies.

中文标题/摘要

标题：阿耳忒弥斯：结构化视觉推理在感知策略学习中的应用

近期的视觉感知策略强化学习框架开始引入用自然语言表达的中间推理链。实证观察表明，这种纯粹的语言中间推理往往降低了感知任务的性能。我们认为，核心问题不在于推理本身，而在于推理的形式：虽然这些链在非结构化的语言空间中进行语义推理，但视觉感知需要在空间和对象中心的空间中进行推理。为此，我们引入了阿耳忒弥斯，这是一种感知策略学习框架，执行结构化的提案推理，其中每个中间步骤表示为一个（标签，边界框）对，捕捉可验证的视觉状态。这种设计使得中间状态的显式跟踪、提案质量的直接监督成为可能，并避免了基于语言推理引入的歧义。阿耳忒弥斯基于Qwen2.5-VL-3B，实现了在语义定位和检测任务上的强大性能，并在计数和几何感知任务上表现出显著的泛化能力。这些不同场景中的一致改进证实了将推理与空间表示对齐可以增强感知策略学习。由于其增强的视觉推理能力，阿耳忒弥斯在通用MLLM基准测试中也取得了竞争力的表现，表明空间化推理提供了一条实现可扩展和通用感知策略的原理性途径。

Summary / 总结

Artemis is a perception-policy learning framework that addresses the limitations of purely linguistic reasoning by introducing structured visual reasoning. Each intermediate step in Artemis is represented as a (label, bounding-box) pair, enabling explicit tracking of visual states and direct supervision for proposal quality. This design leads to strong performance on grounding and detection tasks, and substantial generalization to counting and geometric-perception tasks. The consistent improvements across various settings confirm that aligning reasoning with spatial representations enhances perception-policy learning.

Artemis 是一种使用结构化视觉推理的感知策略学习框架，旨在提高感知任务的性能。与依赖于非结构化语言推理的方法不同，Artemis 将中间步骤表示为（标签，边界框）对，这使得可以明确跟踪视觉状态并直接监督提案质量。这种方法在语义接地和检测任务上表现出强大的性能，并且在计数和几何感知任务上表现出显著的一般化能力，证实了将推理与空间表示对齐的好处。Artemis 还在通用 MLLM 基准测试中表现出竞争力，证明了空间性地基于推理为可扩展和通用的感知策略提供了一条原则性的途径。

Forecasting in Offline Reinforcement Learning for Non-stationary Environments

Authors: Suzan Ece Ada, Georg Martius, Emre Ugur, Erhan Oztop

Venue: NeurIPS 2025

First: 2025-12-01T18:45:05+00:00 · Latest: 2025-12-01T18:45:05+00:00

Comments: The Thirty-Ninth Annual Conference on Neural Information Processing Systems, NeurIPS 2025

Abs · PDF · Code1 · Code2

Abstract

Offline Reinforcement Learning (RL) provides a promising avenue for training policies from pre-collected datasets when gathering additional interaction data is infeasible. However, existing offline RL methods often assume stationarity or only consider synthetic perturbations at test time, assumptions that often fail in real-world scenarios characterized by abrupt, time-varying offsets. These offsets can lead to partial observability, causing agents to misperceive their true state and degrade performance. To overcome this challenge, we introduce Forecasting in Non-stationary Offline RL (FORL), a framework that unifies (i) conditional diffusion-based candidate state generation, trained without presupposing any specific pattern of future non-stationarity, and (ii) zero-shot time-series foundation models. FORL targets environments prone to unexpected, potentially non-Markovian offsets, requiring robust agent performance from the onset of each episode. Empirical evaluations on offline RL benchmarks, augmented with real-world time-series data to simulate realistic non-stationarity, demonstrate that FORL consistently improves performance compared to competitive baselines. By integrating zero-shot forecasting with the agent's experience, we aim to bridge the gap between offline RL and the complexities of real-world, non-stationary environments.

中文标题/摘要

标题：非平稳环境下的离线强化学习预测

离线强化学习（RL）为从预先收集的数据集中训练策略提供了有希望的途径，当收集额外的交互数据不可行时。然而，现有的离线RL方法通常假设平稳性或仅在测试时考虑合成扰动，这些假设在由突然的时间变化偏移特征的现实世界场景中往往无法成立。这些偏移可能导致部分可观测性，使智能体错误地感知其真实状态并降低性能。为克服这一挑战，我们引入了非平稳离线RL中的预测（FORL）框架，该框架统一了（i）基于条件扩散的候选状态生成，无需预设未来非平稳性的任何特定模式，以及（ii）零样本时间序列基础模型。FORL针对那些容易出现意外、可能非马尔可夫偏移的环境，要求智能体从每个回合开始时就表现出鲁棒性。通过在离线RL基准上进行实证评估，并通过现实世界的时间序列数据增强以模拟现实的非平稳性，我们证明FORL在与竞争基线相比时始终能提高性能。通过将零样本预测与智能体的经验相结合，我们旨在弥合离线RL与现实世界非平稳环境复杂性之间的差距。

Summary / 总结

The paper addresses the challenge of non-stationary environments in offline reinforcement learning by introducing FORL, which combines conditional diffusion-based candidate state generation and zero-shot time-series foundation models. Empirical results show that FORL outperforms existing methods on offline RL benchmarks with real-world non-stationary data, improving agent performance in unpredictable, non-Markovian settings.

论文针对非平稳环境下使用离线强化学习（RL）训练策略的问题，现有方法通常假设环境是平稳的。它提出了FORL框架，结合了条件扩散候选状态生成和零样本时间序列基础模型来处理突发的时间变化偏移。实证评估表明，FORL在具有真实世界非平稳数据的离线RL基准测试中优于竞争基线。

The SWE-Bench Illusion: When State-of-the-Art LLMs Remember Instead of Reason

Authors: Shanchao Liang, Spandan Garg, Roshanak Zilouchian Moghaddam

First: 2025-06-14T00:25:26+00:00 · Latest: 2025-12-01T18:42:11+00:00

Abs · PDF · Code1 · Code2

Abstract

As large language models (LLMs) become increasingly capable and widely adopted, benchmarks play a central role in assessing their practical utility. For example, SWE-Bench Verified has emerged as a critical benchmark for evaluating LLMs' software engineering abilities, particularly their aptitude for resolving real-world GitHub issues. Recent LLMs show impressive performance on SWE-Bench, leading to optimism about their capacity for complex coding tasks. However, current evaluation protocols may overstate these models' true capabilities. It is crucial to distinguish LLMs' generalizable problem-solving ability and other learned artifacts. In this work, we introduce two diagnostic tasks: file path identification from issue descriptions alone and ground truth function reproduction with only the current file context and issue description to probe models' underlying knowledge. We present empirical evidence that performance gains on SWE-Bench-Verified may be partially driven by memorization rather than genuine problem-solving. We show that state-of-the-art models achieve up to 76% accuracy in identifying buggy file paths using only issue descriptions, without access to repository structure. This performance is merely up to 53% on tasks from repositories not included in SWE-Bench, pointing to possible data contamination or memorization. Similar patterns are also observed for the function reproduction task, where the verbatim similarity is much higher on SWE-Bench Verified than on other similar coding benchmarks (up to 35% consecutive 5-gram accuracy on SWE-Bench Verified and Full, but only up to 18% for tasks in other benchmarks). These findings raise concerns about the validity of existing results and underscore the need for more robust, contamination-resistant benchmarks to reliably evaluate LLMs' coding abilities.

中文标题/摘要

标题：SWE-Bench 的幻觉：当最先进的LLM记住而不是推理

随着大型语言模型（LLM）的能力不断增强并被广泛采用，基准测试在评估其实际用途中扮演着核心角色。例如，SWE-Bench Verified已成为评估LLM软件工程能力的关键基准，特别是解决真实GitHub问题的能力。最近的LLM在SWE-Bench上的表现令人印象深刻，这引发了对其复杂编码任务能力的乐观情绪。然而，当前的评估协议可能夸大了这些模型的真实能力。区分LLM的一般问题解决能力和其他学习到的特征至关重要。在这项工作中，我们引入了两个诊断任务：仅从问题描述中识别文件路径和仅凭当前文件上下文和问题描述生成真实函数，以探究模型的内在知识。我们提供了实验证据，表明SWE-Bench-Verified上的性能提升部分可能是由于记忆而不是真正的解决问题。我们展示了最先进的模型仅使用问题描述就能达到高达76%的错误文件路径识别准确率，而无需访问仓库结构。这一性能在未包含于SWE-Bench的仓库任务中仅达到53%，这表明可能存在数据污染或记忆现象。类似的趋势也出现在函数生成任务中，SWE-Bench Verified上的字面相似度要高得多，而在其他类似编码基准中则不然（SWE-Bench Verified和全集上的连续5克连续准确率高达35%，而在其他基准任务中则高达18%）。这些发现引发了对现有结果有效性的担忧，并强调了需要更稳健、抗污染的基准来可靠地评估LLM的编码能力。

Summary / 总结

This study investigates the performance of state-of-the-art large language models (LLMs) on the SWE-Bench Verified benchmark, which evaluates their software engineering capabilities. The authors introduce two diagnostic tasks to differentiate between memorization and genuine problem-solving. They find that LLMs achieve high accuracy in identifying file paths and reproducing functions using only issue descriptions, suggesting that some performance gains may be due to memorization rather than reasoning. The study highlights the need for more robust benchmarks to accurately assess LLMs' coding abilities.

该研究探讨了当前基准在评估大型语言模型（LLMs）在软件工程任务中的局限性，特别是SWE-Bench Verified基准。作者引入了两个诊断任务来探究LLMs的底层知识，并发现SWE-Bench-Verified上的性能提升可能是由于记忆而非解决问题的能力。研究显示，最先进的模型在仅从问题描述中识别文件路径和从当前文件上下文和问题描述中复现函数时表现出很高的准确性，但在测试外部仓库时准确性显著下降，这表明可能存在数据污染或记忆现象。这些发现强调了需要更 robust 的基准来准确评估LLMs的编码能力。

ECO: Energy-Constrained Operator Learning for Chaotic Dynamics with Boundedness Guarantees

Authors: Andrea Goertzen, Sunbochen Tang, Navid Azizan

First: 2025-12-01T18:42:02+00:00 · Latest: 2025-12-01T18:42:02+00:00

Abs · PDF · Code1 · Code2

Abstract

Chaos is a fundamental feature of many complex dynamical systems, including weather systems and fluid turbulence. These systems are inherently difficult to predict due to their extreme sensitivity to initial conditions. Many chaotic systems are dissipative and ergodic, motivating data-driven models that aim to learn invariant statistical properties over long time horizons. While recent models have shown empirical success in preserving invariant statistics, they are prone to generating unbounded predictions, which prevent meaningful statistics evaluation. To overcome this, we introduce the Energy-Constrained Operator (ECO) that simultaneously learns the system dynamics while enforcing boundedness in predictions. We leverage concepts from control theory to develop algebraic conditions based on a learnable energy function, ensuring the learned dynamics is dissipative. ECO enforces these algebraic conditions through an efficient closed-form quadratic projection layer, which provides provable trajectory boundedness. To our knowledge, this is the first work establishing such formal guarantees for data-driven chaotic dynamics models. Additionally, the learned invariant level set provides an outer estimate for the strange attractor, a complex structure that is computationally intractable to characterize. We demonstrate empirical success in ECO's ability to generate stable long-horizon forecasts, capturing invariant statistics on systems governed by chaotic PDEs, including the Kuramoto--Sivashinsky and the Navier--Stokes equations.

中文标题/摘要

标题：ECO：能量约束的操作学习在有界性保证下的混沌动力学

混沌是许多复杂动力系统的基本特征，包括天气系统和流体湍流。由于初始条件的极端敏感性，这些系统本质上难以预测。许多混沌系统是耗散的且遍历的，这促使人们开发数据驱动模型，旨在学习长时间尺度上的不变统计性质。虽然最近的模型在保持不变统计性质方面表现出色，但它们容易生成无界的预测，这妨碍了统计评价的有效性。为了解决这个问题，我们引入了能量约束操作器（ECO），它同时学习系统动力学并强制预测有界性。我们利用控制理论的概念，基于可学习的能量函数开发代数条件，确保学习的动力学是耗散的。ECO 通过高效的闭式二次投影层强制这些代数条件，从而提供可证明的轨迹有界性。据我们所知，这是首个为数据驱动混沌动力学模型建立此类形式保证的工作。此外，学习到的不变等值集为奇异吸引子提供了外部估计，这是一个在计算上难以表征的复杂结构。我们展示了ECO在生成稳定长期预测方面的实证成功，捕捉到了由混沌偏微分方程（如Kuramoto--Sivashinsky和Navier--Stokes方程）支配的系统中的不变统计性质。

Summary / 总结

The paper addresses the challenge of predicting chaotic systems, which are sensitive to initial conditions and difficult to forecast. It introduces the Energy-Constrained Operator (ECO) to learn the system dynamics while ensuring bounded predictions. ECO uses a learnable energy function and a quadratic projection layer to enforce dissipativity, providing formal guarantees of trajectory boundedness. Experiments show that ECO can generate stable long-horizon forecasts and capture invariant statistics for chaotic partial differential equations like the Kuramoto--Sivashinsky and Navier--Stokes equations.

论文提出了能量约束算子（ECO），以确保在学习混沌系统动力学时预测结果有界。ECO 利用控制理论中的概念，通过可学习的能量函数和二次投影层来强制消散性，从而提供轨迹有界的正式保证。实验表明，ECO 能够生成稳定的长期预测，并捕捉由Kuramoto--Sivashinsky 和 Navier--Stokes 方程控制的混沌偏微分方程的不变统计特性。

DualCamCtrl: Dual-Branch Diffusion Model for Geometry-Aware Camera-Controlled Video Generation

Authors: Hongfei Zhang, Kanghao Chen, Zixin Zhang, Harold Haodong Chen, Yuanhuiyi Lyu, Yuqi Zhang, Shuai Yang, Kun Zhou, Yingcong Chen

First: 2025-11-28T12:19:57+00:00 · Latest: 2025-12-01T18:39:32+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

This paper presents DualCamCtrl, a novel end-to-end diffusion model for camera-controlled video generation. Recent works have advanced this field by representing camera poses as ray-based conditions, yet they often lack sufficient scene understanding and geometric awareness. DualCamCtrl specifically targets this limitation by introducing a dual-branch framework that mutually generates camera-consistent RGB and depth sequences. To harmonize these two modalities, we further propose the Semantic Guided Mutual Alignment (SIGMA) mechanism, which performs RGB-depth fusion in a semantics-guided and mutually reinforced manner. These designs collectively enable DualCamCtrl to better disentangle appearance and geometry modeling, generating videos that more faithfully adhere to the specified camera trajectories. Additionally, we analyze and reveal the distinct influence of depth and camera poses across denoising stages and further demonstrate that early and late stages play complementary roles in forming global structure and refining local details. Extensive experiments demonstrate that DualCamCtrl achieves more consistent camera-controlled video generation, with over 40\% reduction in camera motion errors compared with prior methods. Our project page: https://soyouthinkyoucantell.github.io/dualcamctrl-page/

中文标题/摘要

标题：DualCamCtrl: 一种用于几何感知相机控制视频生成的双分支扩散模型

本文提出了DualCamCtrl，这是一种用于相机控制视频生成的端到端扩散模型。近期的工作通过将相机姿态表示为射线条件来推进了这一领域，但它们往往缺乏足够的场景理解和几何意识。DualCamCtrl 特别针对这一局限性，引入了一种双分支框架，该框架能够相互生成相机一致的 RGB 和深度序列。为了使这两种模态协调一致，我们进一步提出了语义引导的相互对齐（SIGMA）机制，该机制以语义为导向并相互强化地进行 RGB-深度融合。这些设计共同使 DualCamCtrl 能够更好地分离外观和几何建模，生成更忠实于指定相机轨迹的视频。此外，我们分析并揭示了深度和相机姿态在去噪阶段的不同影响，并进一步证明早期和晚期阶段在形成全局结构和细化局部细节方面发挥着互补作用。大量实验表明，与先前方法相比，DualCamCtrl 实现了更一致的相机控制视频生成，相机运动误差降低了超过 40%。我们的项目页面：https://soyouthinkyoucantell.github.io/dualcamctrl-page/

Summary / 总结

DualCamCtrl is an end-to-end diffusion model designed for camera-controlled video generation, addressing the limitations of previous methods by incorporating a dual-branch framework that generates consistent RGB and depth sequences. It introduces the Semantic Guided Mutual Alignment (SIGMA) mechanism to fuse these modalities in a semantics-guided manner. Experiments show that DualCamCtrl reduces camera motion errors by over 40% compared to previous methods, leading to more faithful video generation following specified camera trajectories.

DualCamCtrl 是一种端到端的扩散模型，用于相机控制的视频生成。它通过引入双分支框架生成一致的 RGB 和深度序列，并提出语义引导的相互对齐机制进行 RGB-深度融合。实验表明，DualCamCtrl 将相机运动误差降低了超过 40%，并生成了更符合指定相机轨迹的视频。

Chain-of-Ground: Improving GUI Grounding via Iterative Reasoning and Reference Feedback

Authors: Aiden Yiliu Li, Bizhi Yu, Daoan Lei, Tianhe Ren, Shilong Liu

First: 2025-12-01T18:37:19+00:00 · Latest: 2025-12-01T18:37:19+00:00

Abs · PDF · Code1 · Code2

Abstract

GUI grounding aims to align natural language instructions with precise regions in complex user interfaces. Advanced multimodal large language models show strong ability in visual GUI grounding but still struggle with small or visually similar targets and ambiguity in real world layouts. These limitations arise from limited grounding capacity and from underuse of existing reasoning potential. We present Chain of Ground CoG a training free multi step grounding framework that uses multimodal large language models for iterative visual reasoning and refinement. Instead of direct prediction the model progressively reflects and adjusts its hypotheses leading to more accurate and interpretable localization. Our approach achieves 68.4 accuracy on the ScreenSpot Pro benchmark an improvement of 4.8 points. To measure real world generalization we introduce TPanel UI a dataset of 420 labeled industrial control panels with visual distortions such as blur and masking. On TPanel UI Chain of Ground improves over the strong baseline Qwen3 VL 235B by 6.9 points showing the effectiveness of multi step training free grounding across real world and digital interfaces. These results highlight a direction for unlocking grounding potential through structured iterative refinement instead of additional training.

中文标题/摘要

标题：地面链路：通过迭代推理和参考反馈提高GUI定位

GUI定位旨在将自然语言指令与复杂用户界面中的精确区域对齐。高级多模态大型语言模型在视觉GUI定位方面表现出强大的能力，但仍难以处理小型或视觉上相似的目标以及现实世界布局中的歧义。这些限制源于定位能力有限以及对现有推理潜力的不足利用。我们提出了地面链路CoG，这是一种无需训练的多步定位框架，利用多模态大型语言模型进行迭代视觉推理和细化。模型不是直接预测，而是逐步反思和调整其假设，从而实现更准确和可解释的定位。我们的方法在ScreenSpot Pro基准测试中达到了68.4的准确率，提高了4.8个百分点。为了衡量实际应用中的泛化能力，我们引入了TPanel UI数据集，包含420个带有视觉失真（如模糊和遮挡）的工业控制面板标签。在TPanel UI上，地面链路提高了强基线Qwen3 VL 235B的6.9个百分点，展示了无需训练的多步定位在现实世界和数字界面中的有效性。这些结果突显了通过结构化迭代细化来解锁定位潜力的方向，而不是额外的训练。

Summary / 总结

The paper addresses the challenge of GUI grounding by presenting Chain-of-Ground (CoG), a training-free framework that enhances the accuracy of visual grounding through iterative reasoning and reference feedback. Unlike direct prediction, CoG progressively refines hypotheses, leading to better localization. The approach achieves 68.4% accuracy on the ScreenSpot Pro benchmark, a 4.8% improvement over previous methods. Additionally, CoG shows significant improvement on the TPanel UI dataset, outperforming the strong baseline Qwen3 VL 235B by 6.9%, demonstrating its effectiveness in real-world and digital interfaces.

论文提出了Chain-of-Ground (CoG) 框架，通过迭代推理和参考反馈提高GUI定位的准确性，而无需额外训练。与直接预测不同，CoG 逐步细化假设，从而获得更好的定位效果。该方法在ScreenSpot Pro基准测试中达到68.4%的准确率，比之前的方法提高了4.8%。此外，CoG 在TPanel UI数据集上的表现也优于强基线Qwen3 VL 235B，提高了6.9%，展示了其在真实世界和数字界面中的有效性。

AI-Driven Optimization under Uncertainty for Mineral Processing Operations

Authors: William Xu, Amir Eskanlou, Mansur Arief, David Zhen Yin, Jef K. Caers

First: 2025-12-01T18:35:54+00:00 · Latest: 2025-12-01T18:35:54+00:00

Comments: 27 pages, 13 figures, submitted to Sustainable Earth Resources Communications (SERC)

Abs · PDF · Code1 · Code2

Abstract

The global capacity for mineral processing must expand rapidly to meet the demand for critical minerals, which are essential for building the clean energy technologies necessary to mitigate climate change. However, the efficiency of mineral processing is severely limited by uncertainty, which arises from both the variability of feedstock and the complexity of process dynamics. To optimize mineral processing circuits under uncertainty, we introduce an AI-driven approach that formulates mineral processing as a Partially Observable Markov Decision Process (POMDP). We demonstrate the capabilities of this approach in handling both feedstock uncertainty and process model uncertainty to optimize the operation of a simulated, simplified flotation cell as an example. We show that by integrating the process of information gathering (i.e., uncertainty reduction) and process optimization, this approach has the potential to consistently perform better than traditional approaches at maximizing an overall objective, such as net present value (NPV). Our methodological demonstration of this optimization-under-uncertainty approach for a synthetic case provides a mathematical and computational framework for later real-world application, with the potential to improve both the laboratory-scale design of experiments and industrial-scale operation of mineral processing circuits without any additional hardware.

中文标题/摘要

标题：基于不确定性的人工智能驱动矿产加工优化

为了满足关键矿产的需求，全球矿产加工能力必须迅速扩大，这些矿产对于构建必要的清洁能源技术以缓解气候变化至关重要。然而，矿产加工的效率受到不确定性的影响，这种不确定性既来自原料的变异性，也来自工艺动态的复杂性。为了在不确定性条件下优化矿产加工电路，我们提出了一种基于人工智能的方法，将矿产加工建模为部分可观测马尔可夫决策过程（POMDP）。我们通过一个模拟的简化浮选槽示例展示了该方法在处理原料不确定性及工艺模型不确定性方面的能力，以优化其操作。我们表明，通过将信息收集过程（即不确定性减少）与工艺优化过程相结合，该方法有可能在最大化总体目标（如净现值NPV）方面始终优于传统方法。我们对这种不确定性下的优化方法的数学和计算框架的演示为合成案例提供了方法论基础，有可能改善矿产加工电路的实验室规模设计实验和工业规模操作，无需额外硬件。

Summary / 总结

The paper aims to address the challenge of expanding mineral processing capacity to meet the demand for critical minerals needed for clean energy technologies. It introduces an AI-driven approach using a Partially Observable Markov Decision Process (POMDP) to optimize mineral processing operations under uncertainty. The method integrates the process of information gathering and process optimization, demonstrating superior performance in maximizing net present value compared to traditional methods. This approach provides a framework for both laboratory-scale experiments and industrial-scale operations without requiring additional hardware.

研究旨在通过解决原料和工艺动态中的不确定性来提高矿物加工的效率。作者采用AI驱动的方法，将矿物加工建模为部分可观测马尔可夫决策过程（POMDP），以优化模拟浮选细胞的操作。他们展示了该方法如何通过结合信息收集和工艺优化，在不确定性条件下最大化矿物加工操作的净现值（NPV），从而优于传统方法。

From Atomic to Composite: Reinforcement Learning Enables Generalization in Complementary Reasoning

Authors: Sitao Cheng, Xunjian Yin, Ruiwen Zhou, Yuxuan Li, Xinyi Wang, Liangming Pan, William Yang Wang, Victor Zhong

First: 2025-12-01T18:27:25+00:00 · Latest: 2025-12-01T18:27:25+00:00

Comments: Work in Progress. Code and data will be available at https://github.com/sitaocheng/from_atomic_to_composite

Abs · PDF · Code1 · Code2 · Code3

Abstract

The mechanism by which RL contributes to reasoning capabilities-whether it incentivizes the synthesis of new skills or merely amplifies existing behaviors-remains a subject of intense debate. In this work, we investigate this question through the lens of Complementary Reasoning, a complex task that requires integrating internal parametric knowledge with external contextual information. Using a controlled synthetic dataset of human biographies, we strictly decouple this ability into two atomic skills: Parametric Reasoning (relying on internal knowledge) and Contextual Reasoning (depending on external information). To rigorously assess capability boundaries, we evaluate generalization across three distinct levels of difficulty: I.I.D., Composition, and Zero-shot settings. We find that while SFT is sufficient for in-distribution performance, it struggles with O.O.D. generalization, particularly in Zero-shot settings where relational combinations are novel. Crucially, we identify the SFT Generalization Paradox: Models supervised solely on the composite task achieve near-perfect in-distribution accuracy but collapse on out-of-distribution generalization, indicating their reliance on rote memorization of path shortcuts. In contrast, we find that RL acts as a reasoning synthesizer rather than a probability amplifier. However, we uncover a strict atomic prerequisite: RL can only synthesize these complex strategies if the base model has first mastered the independent atomic skills (Parametric and Contextual) via SFT. These findings challenge the view of RL as a mere amplifier, suggesting that given sufficient atomic foundations, RL can actively synthesize complex reasoning strategies from learned primitives without explicit supervision on such complex strategies. This indicates that decoupled atomic training followed by RL offers a scalable path to generalization for complex reasoning tasks.

中文标题/摘要

标题：从原子到复合：强化学习促进互补推理的一般化

关于RL如何贡献推理能力的问题——它是否激励新技能的合成还是仅仅放大现有行为——仍然是一个激烈的争论话题。在本研究中，我们通过互补推理这一复杂任务的视角来探讨这一问题，该任务需要将内部参数化知识与外部上下文信息相结合。我们使用人类传记的受控合成数据集，严格将这种能力分解为两个原子技能：参数化推理（依赖内部知识）和上下文推理（依赖外部信息）。为了严格评估能力边界，我们在三个不同的难度级别上评估了一般化：I.I.D.、复合和零样本设置。我们发现，虽然微调在分布内性能上是足够的，但在分布外一般化方面却遇到困难，尤其是在零样本设置中，因为关系组合是新颖的。至关重要的是，我们发现了微调泛化悖论：仅在复合任务上进行监督的模型在分布内准确性接近完美，但在分布外泛化上崩溃，表明它们依赖于死记硬背的路径捷径。相反，我们发现RL充当推理合成器而不是概率放大器。然而，我们发现了一个严格的原子先决条件：RL只能在基础模型首先通过微调掌握独立原子技能（参数化和上下文）之后，才能合成这些复杂策略。这些发现挑战了将RL仅仅视为放大器的观点，表明在给定足够的原子基础后，RL可以主动从学习的基本元素中合成复杂的推理策略，而无需显式监督这些复杂策略。这表明分离的原子训练后跟随RL提供了一条扩展路径，以实现复杂推理任务的一般化。

Summary / 总结

This study investigates how reinforcement learning (RL) contributes to reasoning capabilities by evaluating its role in a complex task called Complementary Reasoning. The task is decomposed into two atomic skills: Parametric Reasoning and Contextual Reasoning. The research finds that while supervised fine-tuning (SFT) performs well within its training distribution, it fails to generalize to out-of-distribution scenarios, especially in zero-shot settings. In contrast, RL is found to synthesize complex reasoning strategies, but only if the base model has first mastered the independent atomic skills through SFT. This suggests that RL can actively synthesize complex reasoning strategies from learned primitives without explicit supervision on such strategies, indicating a scalable path to generalization for complex reasoning tasks.

本研究通过评估强化学习（RL）在复杂任务——互补推理中的作用，探讨了其对推理能力的贡献。使用人类传记的合成数据集，研究将任务拆分为两个原子技能：参数推理和上下文推理。研究发现，虽然监督微调（SFT）在分布内表现良好，但在分布外场景尤其是零样本设置中无法泛化。相比之下，RL可以在没有对复合任务进行显式监督的情况下合成复杂的推理策略。然而，RL需要基础模型首先通过SFT掌握独立的原子技能。这表明，在给定足够的原子基础后，RL可以主动从学习的基本元素中合成复杂的推理策略。

Meta-Reinforcement Learning for Building Energy Management System

Authors: Benoit Boulet Huiliang Zhang, Di Wu, Arnaud Zinflou

Venue: 2025 IEEE Electrical Power and Energy Conference (EPEC)

First: 2022-10-23T01:56:30+00:00 · Latest: 2025-12-01T18:22:25+00:00

Comments: arXiv admin note: text overlap with arXiv:1909.10165 by other authors

Abs · PDF · Code1 · Code2

Abstract

The building sector is one of the largest contributors to global energy consumption. Improving its energy efficiency is essential for reducing operational costs and greenhouse gas emissions. Energy management systems (EMS) play a key role in monitoring and controlling building appliances efficiently and reliably. With the increasing integration of renewable energy, intelligent EMS solutions have received growing attention. Reinforcement learning (RL) has recently been explored for this purpose and shows strong potential. However, most RL-based EMS methods require a large number of training steps to learn effective control policies, especially when adapting to unseen buildings, which limits their practical deployment. This paper introduces MetaEMS, a meta-reinforcement learning framework for EMS. MetaEMS improves learning efficiency by transferring knowledge from previously solved tasks to new ones through group-level and building-level adaptation, enabling fast adaptation and effective control across diverse building environments. Experimental results demonstrate that MetaEMS adapts more rapidly to unseen buildings and consistently outperforms baseline methods across various scenarios.

中文标题/摘要

标题：元强化学习在建筑能源管理系统中的应用

建筑行业是全球能源消耗的最大贡献者之一。提高其能源效率对于降低运营成本和减少温室气体排放至关重要。能源管理系统（EMS）在监测和控制建筑设备方面发挥着关键作用。随着可再生能源的不断增加集成，智能EMS解决方案受到了越来越多的关注。最近，强化学习（RL）被探索用于此目的，并显示出强大的潜力。然而，大多数基于RL的EMS方法需要大量的训练步骤来学习有效的控制策略，尤其是在适应未见过的建筑时，这限制了它们的实际部署。本文介绍了MetaEMS，这是一种用于EMS的元强化学习框架。MetaEMS通过组级和建筑级的适应，将先前解决的任务的知识转移到新任务中，从而提高学习效率，实现跨不同建筑环境的快速适应和有效控制。实验结果表明，MetaEMS能够更快地适应未见过的建筑，并在各种场景中始终优于基线方法。

Summary / 总结

This paper addresses the challenge of improving energy efficiency in buildings through the development of MetaEMS, a meta-reinforcement learning framework for energy management systems. MetaEMS enhances learning efficiency by transferring knowledge from previously solved tasks to new ones, facilitating faster adaptation and effective control across diverse building environments. The experimental results show that MetaEMS adapts more quickly to unseen buildings and performs better than baseline methods in various scenarios.

本文通过开发一种用于能源管理系统（EMS）的元强化学习框架MetaEMS，来提高建筑的能源效率。MetaEMS通过将先前学习的任务知识转移到新任务中，提高了学习效率，使新建筑的快速适应成为可能。实验结果表明，MetaEMS在各种场景中均优于基线方法，并且能够更快地适应新建筑。

Securing the Skies: A Comprehensive Survey on Anti-UAV Methods, Benchmarking, and Future Directions

Authors: Yifei Dong, Fengyi Wu, Sanjian Zhang, Guangyu Chen, Yuzhi Hu, Masumi Yano, Jingdong Sun, Siyu Huang, Feng Liu, Qi Dai, Zhi-Qi Cheng

Venue: CVPR Best Paper

First: 2025-04-16T10:58:33+00:00 · Latest: 2025-12-01T18:22:15+00:00

Comments: Best Paper, Accepted at CVPR Workshop Anti-UAV 2025. 16 pages

Abs · PDF · Code1 · Code2

Abstract

Unmanned Aerial Vehicles (UAVs) are indispensable for infrastructure inspection, surveillance, and related tasks, yet they also introduce critical security challenges. This survey provides a wide-ranging examination of the anti-UAV domain, centering on three core objectives-classification, detection, and tracking-while detailing emerging methodologies such as diffusion-based data synthesis, multi-modal fusion, vision-language modeling, self-supervised learning, and reinforcement learning. We systematically evaluate state-of-the-art solutions across both single-modality and multi-sensor pipelines (spanning RGB, infrared, audio, radar, and RF) and discuss large-scale as well as adversarially oriented benchmarks. Our analysis reveals persistent gaps in real-time performance, stealth detection, and swarm-based scenarios, underscoring pressing needs for robust, adaptive anti-UAV systems. By highlighting open research directions, we aim to foster innovation and guide the development of next-generation defense strategies in an era marked by the extensive use of UAVs.

中文标题/摘要

标题：确保天空安全：无人机反制方法综述、基准测试及未来方向

无人驾驶航空器（UAV）对于基础设施检查、监控及相关任务不可或缺，但同时也带来了关键的安全挑战。本文综述了反无人机领域，重点围绕分类、检测和跟踪三大核心目标，详细介绍了诸如基于扩散的数据合成、多模态融合、视觉-语言建模、自监督学习和强化学习等新兴方法。我们系统地评估了单模态和多传感器管道（包括RGB、红外、音频、雷达和RF）中的最新解决方案，并讨论了大规模及对抗性导向的基准测试。我们的分析揭示了实时性能、隐形检测和群无人机场景中的持续差距，强调了需要开发稳健、适应性强的反无人机系统。通过突出开放的研究方向，我们旨在促进创新并指导无人机广泛使用时代下新一代防御策略的发展。

Summary / 总结

This survey examines anti-UAV methods focusing on classification, detection, and tracking, evaluating state-of-the-art solutions across single-modality and multi-sensor pipelines. Key findings highlight gaps in real-time performance, stealth detection, and swarm scenarios, emphasizing the need for robust and adaptive anti-UAV systems. The study also discusses emerging methodologies like diffusion-based data synthesis and multi-modal fusion, and calls for further research to address these challenges.

该调研聚焦于无人机分类、检测和跟踪方法，评估了包括RGB、红外、音频、雷达和RF在内的多种模态的先进解决方案。研究指出，在实时性能、隐形检测和集群场景方面存在不足，强调了需要开发稳健且适应性强的反无人机系统。此外，研究还讨论了诸如扩散数据合成、多模态融合和强化学习等新兴方法，旨在指导未来的研究和反无人机防御策略的发展。

Get RICH or Die Scaling: Profitably Trading Inference Compute for Robustness

Authors: Tavish McDonald, Bo Lei, Stanislav Fort, Bhavya Kailkhura, Brian Bartoldson

First: 2025-10-08T09:18:53+00:00 · Latest: 2025-12-01T18:15:29+00:00

Comments: 21 pages

Abs · PDF · Code1 · Code2

Abstract

Models are susceptible to adversarially out-of-distribution (OOD) data despite large training-compute investments into their robustification. Zaremba et al. (2025) make progress on this problem at test time, showing LLM reasoning improves satisfaction of model specifications designed to thwart attacks, resulting in a correlation between reasoning effort and robustness to jailbreaks. However, this benefit of test compute fades when attackers are given access to gradients or multimodal inputs. We address this gap, clarifying that inference-compute offers benefits even in such cases. Our approach argues that compositional generalization, through which OOD data is understandable via its in-distribution (ID) components, enables adherence to defensive specifications on adversarially OOD inputs. Namely, we posit the Robustness from Inference Compute Hypothesis (RICH): inference-compute defenses profit as the model's training data better reflects the attacked data's components. We empirically support this hypothesis across vision language model and attack types, finding robustness gains from test-time compute if specification following on OOD data is unlocked by compositional generalization. For example, InternVL 3.5 gpt-oss 20B gains little robustness when its test compute is scaled, but such scaling adds significant robustness if we first robustify its vision encoder. This correlation of inference-compute's robustness benefit with base model robustness is the rich-get-richer dynamic of the RICH: attacked data components are more ID for robustified models, aiding compositional generalization to OOD data. Thus, we advise layering train-time and test-time defenses to obtain their synergistic benefit.

中文标题/摘要

标题：获取财富或死亡：盈利地交易推理计算以提高鲁棒性

尽管在模型的鲁棒性提升上投入了大量训练计算资源，但它们仍然容易受到对抗性离分布（OOD）数据的影响。Zaremba等人（2025）在测试时对此问题取得进展，表明语言模型的推理能力提高了模型规范的满足度，这些规范旨在抵御攻击，从而在推理努力与对抗性脱缰攻击的鲁棒性之间建立了相关性。然而，当攻击者获得梯度访问权或多种模态输入时，这种测试计算的好处会消失。我们解决了这一缺口，阐明了即使在这些情况下，推理计算也提供了益处。我们的方法认为，通过将离分布（OOD）数据通过其在分布（ID）组件进行理解，组成泛化使模型能够遵守针对对抗性OOD输入的防御规范。具体而言，我们提出了推理计算鲁棒性假设（RICH）：当模型的训练数据更好地反映攻击数据的组件时，推理计算防御会受益。我们通过视觉语言模型和攻击类型的经验支持这一假设，发现如果通过组成泛化解锁OOD数据上的规范遵循，测试计算可以带来鲁棒性提升。例如，InternVL 3.5 gpt-oss 20B在测试计算扩展时几乎没有获得鲁棒性提升，但如果首先使其视觉编码器鲁棒化，这种扩展会显著增加鲁棒性。推理计算鲁棒性收益与基础模型鲁棒性的这种相关性是RICH的富者愈富动态：对于鲁棒性增强的模型，攻击数据的组件更接近ID，有助于组成泛化到OOD数据。因此，我们建议叠加训练时间和测试时间的防御以获得它们的协同效益。

Summary / 总结

The paper addresses the vulnerability of models to adversarially out-of-distribution data despite extensive training compute. It proposes the Robustness from Inference Compute Hypothesis (RICH), suggesting that increasing inference compute can enhance model robustness, especially when the model's training data better reflects the attacked data's components. The study finds that robustness gains from test-time compute are significant when compositional generalization is enabled, as seen in vision-language models like InternVL 3.5. This rich-get-richer dynamic is highlighted through the example of gpt-oss 20B, where robustifying the vision encoder first significantly improves robustness when test compute is scaled.

论文探讨了尽管进行了大量鲁棒性优化，模型仍对对抗性离分布数据易受攻击的问题。提出了推理计算增强假设（RICH），认为增加推理计算可以提升模型的鲁棒性，尤其是在模型训练数据更好地反映攻击数据的组成部分时。研究通过多种视觉语言模型和攻击类型的支持，实证表明，在对模型的视觉编码器进行鲁棒性优化后，增加测试时的计算可以实现鲁棒性提升。

NeuroRVQ: Multi-Scale EEG Tokenization for Generative Large Brainwave Models

Authors: Konstantinos Barmpas, Na Lee, Alexandros Koliousis, Yannis Panagakis, Dimitrios A. Adamos, Nikolaos Laskaris, Stefanos Zafeiriou

First: 2025-10-15T01:26:52+00:00 · Latest: 2025-12-01T18:14:58+00:00

Abs · PDF · Code1 · Code2

Abstract

Electroencephalography (EEG) captures neural activity across multiple temporal and spectral scales, yielding signals that are rich but complex for representation learning. Recently, EEG foundation models trained to predict masked signal-tokens have shown promise for learning generalizable representations. However, their performance is hindered by their signal tokenization modules. Existing neural tokenizers fail to preserve high-frequency dynamics, limiting their ability to reconstruct EEG signals with high fidelity. We introduce NeuroRVQ, a scalable Large Brainwave Model (LBM) centered on a codebook-based tokenizer. Our tokenizer integrates: (i) multi-scale feature extraction modules that capture the full frequency neural spectrum; (ii) hierarchical residual vector quantization (RVQ) codebooks for high-resolution encoding; and, (iii) an EEG signal phase- and amplitude-aware loss function for efficient training. This design enables efficient EEG compression while supporting accurate reconstruction across all frequency bands, leading to robust generative masked modeling. Our empirical results demonstrate that NeuroRVQ achieves lower reconstruction error and outperforms existing LBMs on a variety of downstream tasks. More broadly, NeuroRVQ tokenizer establishes a strong prior for codebook-based general-purpose brainwave models, enabling advances in neural decoding, generative modeling and multimodal biosignal integration.

中文标题/摘要

标题：NeuroRVQ：多尺度EEG分词用于生成大型脑波模型

脑电图（EEG）捕捉了多个时间和频谱尺度的神经活动，产生丰富但复杂的信号，适合表示学习。最近，训练用于预测掩码信号分词的EEG基础模型显示出学习通用表示的潜力。然而，它们的表现受到其信号分词模块的限制。现有的神经分词器无法保留高频动态，限制了它们以高保真度重建EEG信号的能力。我们引入了NeuroRVQ，这是一种以基于码本的分词器为中心的大规模脑波模型（LBM）。我们的分词器整合了：(i) 多尺度特征提取模块，捕捉完整的神经频谱；(ii) 分层残差向量量化（RVQ）码本，用于高分辨率编码；以及(iii) 一种EEG信号相位和振幅感知的损失函数，用于高效训练。这种设计使EEG压缩变得高效，同时支持所有频带的准确重建，从而实现稳健的生成掩码建模。我们的实验证明，NeuroRVQ的重建误差较低，并在多种下游任务中优于现有的LBM。更广泛地说，NeuroRVQ分词器为基于码本的通用脑波模型建立了强大的先验，促进了神经解码、生成建模和多模态生物信号集成的进步。

Summary / 总结

NeuroRVQ is designed to address the limitations of existing EEG tokenization methods by integrating multi-scale feature extraction, hierarchical RVQ codebooks, and EEG-aware loss functions. This approach enables efficient EEG compression and accurate reconstruction across all frequency bands, leading to improved performance in generative masked modeling. Empirical results show that NeuroRVQ outperforms existing Large Brainwave Models (LBMs) on various downstream tasks and achieves lower reconstruction error.

NeuroRVQ旨在通过解决现有分词方法的限制来改进EEG表示学习。它使用多尺度特征提取模块、分层RVQ码本和EEG感知损失函数来有效捕捉和编码高频动态。实验结果表明，NeuroRVQ减少了重构误差，并在各种下游任务中优于现有的大脑波模型。

SpriteHand: Real-Time Versatile Hand-Object Interaction with Autoregressive Video Generation

Authors: Zisu Li, Hengye Lyu, Jiaxin Shi, Yufeng Zeng, Mingming Fan, Hanwang Zhang, Chen Liang

First: 2025-12-01T18:13:40+00:00 · Latest: 2025-12-01T18:13:40+00:00

Abs · PDF · Code1 · Code2

Abstract

Modeling and synthesizing complex hand-object interactions remains a significant challenge, even for state-of-the-art physics engines. Conventional simulation-based approaches rely on explicitly defined rigid object models and pre-scripted hand gestures, making them inadequate for capturing dynamic interactions with non-rigid or articulated entities such as deformable fabrics, elastic materials, hinge-based structures, furry surfaces, or even living creatures. In this paper, we present SpriteHand, an autoregressive video generation framework for real-time synthesis of versatile hand-object interaction videos across a wide range of object types and motion patterns. SpriteHand takes as input a static object image and a video stream in which the hands are imagined to interact with the virtual object embedded in a real-world scene, and generates corresponding hand-object interaction effects in real time. Our model employs a causal inference architecture for autoregressive generation and leverages a hybrid post-training approach to enhance visual realism and temporal coherence. Our 1.3B model supports real-time streaming generation at around 18 FPS and 640x368 resolution, with an approximate 150 ms latency on a single NVIDIA RTX 5090 GPU, and more than a minute of continuous output. Experiments demonstrate superior visual quality, physical plausibility, and interaction fidelity compared to both generative and engine-based baselines.

Summary / 总结

SpriteHand is an autoregressive video generation framework designed for real-time synthesis of hand-object interaction videos. It addresses the challenge of modeling complex interactions, especially with non-rigid objects, by taking a static object image and a video stream as input and generating corresponding interaction effects in real time. The model uses a causal inference architecture and a hybrid post-training approach to ensure visual realism and temporal coherence, achieving real-time streaming at around 18 FPS with a latency of approximately 150 ms on a single NVIDIA RTX 5090 GPU. Experiments show that SpriteHand outperforms both generative and engine-based baselines in terms of visual quality, physical plausibility, and interaction fidelity.

SpriteHand 是一种自回归视频生成框架，用于实时合成手物交互。它克服了传统基于模拟的方法的局限性，能够捕捉各种物体类型的动态交互。该模型采用因果推理架构和混合后训练方法，实现高视觉真实感和时间连贯性。实验表明，SpriteHand 在视觉质量、物理合理性以及交互保真度方面均优于生成性和引擎基线方法。

Adaptive Plane Reformatting for 4D Flow MRI using Deep Reinforcement Learning

Authors: Javier Bisbal, Julio Sotelo, Maria I Valdés, Pablo Irarrazaval, Marcelo E Andia, Julio García, José Rodriguez-Palomarez, Francesca Raimondi, Cristián Tejos, Sergio Uribe

First: 2025-05-31T22:02:05+00:00 · Latest: 2025-12-01T18:13:13+00:00

Abs · PDF · Code1 · Code2

Abstract

Background and Objective: Plane reformatting for four-dimensional phase contrast MRI (4D flow MRI) is time-consuming and prone to inter-observer variability, which limits fast cardiovascular flow assessment. Deep reinforcement learning (DRL) trains agents to iteratively adjust plane position and orientation, enabling accurate plane reformatting without the need for detailed landmarks, making it suitable for images with limited contrast and resolution such as 4D flow MRI. However, current DRL methods assume that test volumes share the same spatial alignment as the training data, limiting generalization across scanners and institutions. To address this limitation, we introduce AdaPR (Adaptive Plane Reformatting), a DRL framework that uses a local coordinate system to navigate volumes with arbitrary positions and orientations. Methods: We implemented AdaPR using the Asynchronous Advantage Actor-Critic (A3C) algorithm and validated it on 88 4D flow MRI datasets acquired from multiple vendors, including patients with congenital heart disease. Results: AdaPR achieved a mean angular error of 6.32 +/- 4.15 degrees and a distance error of 3.40 +/- 2.75 mm, outperforming global-coordinate DRL methods and alternative non-DRL methods. AdaPR maintained consistent accuracy under different volume orientations and positions. Flow measurements from AdaPR planes showed no significant differences compared to two manual observers, with excellent correlation (R^2 = 0.972 and R^2 = 0.968), comparable to inter-observer agreement (R^2 = 0.969). Conclusion: AdaPR provides robust, orientation-independent plane reformatting for 4D flow MRI, achieving flow quantification comparable to expert observers. Its adaptability across datasets and scanners makes it a promising candidate for medical imaging applications beyond 4D flow MRI.

中文标题/摘要

标题：使用深度强化学习的自适应平面重格式化4D流MRI

背景与目的：4D相位对比MRI（4D流MRI）的平面重格式化耗时且易受观察者间变异的影响，限制了快速心血管流评估。深度强化学习（DRL）训练代理迭代调整平面位置和方向，无需详细地标记点即可实现准确的平面重格式化，使其适用于对比度和分辨率有限的图像，如4D流MRI。然而，当前的DRL方法假设测试体素与训练数据具有相同的空间对齐，限制了其在不同扫描器和机构间的泛化能力。为解决这一限制，我们引入了AdaPR（自适应平面重格式化），这是一种使用局部坐标系统导航任意位置和方向体素的DRL框架。方法：我们使用异步优势演员评论者（A3C）算法实现了AdaPR，并在来自多个供应商的88个4D流MRI数据集中进行了验证，包括先天性心脏病患者的样本。结果：AdaPR实现了平均角度误差6.32±4.15度和距离误差3.40±2.75毫米，优于全局坐标DRL方法和替代的非DRL方法。AdaPR在不同体素方向和位置下保持了稳定的准确性。来自AdaPR平面的流测量与两名手动观察者之间无显著差异，相关性良好（R²=0.972和R²=0.968），与观察者间一致性（R²=0.969）相当。结论：AdaPR为4D流MRI提供了稳健的、方向无关的平面重格式化，实现了与专家观察者相当的流量化。其在不同数据集和扫描器上的适应性使其成为超越4D流MRI的医学成像应用的有前途的候选者。

Summary / 总结

The research aims to address the time-consuming and inter-observer variability issues in plane reformatting for 4D flow MRI using a deep reinforcement learning (DRL) approach. The method, AdaPR, uses the A3C algorithm and a local coordinate system to navigate volumes with arbitrary positions and orientations, achieving a mean angular error of 6.32 +/- 4.15 degrees and a distance error of 3.40 +/- 2.75 mm, which outperforms other DRL and non-DRL methods. Flow measurements from AdaPR planes showed no significant differences compared to manual observers, with excellent correlation (R^2 = 0.972 and R^2 = 0.968), comparable to inter-observer agreement (R^2 = 0.969).

研究旨在通过深度强化学习（DRL）方法解决4D流MRI中的平面重塑耗时和观察者间差异问题。方法AdaPR使用A3C算法和局部坐标系统来导航任意位置和方向的体积，实现了6.32 +/- 4.15度的平均角度误差和3.40 +/- 2.75毫米的距离误差，优于其他DRL和非DRL方法。来自AdaPR平面的流测量与手动观察者无显著差异，相关性极好（R^2 = 0.972和R^2 = 0.968），与观察者间一致性（R^2 = 0.969）相当。

Outcome-based Reinforcement Learning to Predict the Future

Authors: Benjamin Turtel, Danny Franklin, Kris Skotheim, Luke Hewitt, Philipp Schoenegger

First: 2025-05-23T14:56:07+00:00 · Latest: 2025-12-01T18:12:07+00:00

Abs · PDF · Code1 · Code2

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has been an effective approach for improving Large Language Models' reasoning in domains such as coding and mathematics. Here, we apply RLVR methods towards forecasting future real-world events - a challenging task for RL due to the very noisy (and delayed) outcomes involved. Using a novel dataset of recent questions from a prediction market, and accompanying relevant news headlines, we show that a compact (14B) reasoning model can be trained to match or surpass the predictive accuracy of frontier models like o1, while greatly improving probabilistic calibration. The model's performance is also practically meaningful: in a Polymarket trading simulation, we estimate that its bets would have yielded a return on investment of over 10% across all questions in the test set. We detail and compare approaches used in training our model, including augmenting our training-data with synthetic prediction questions, guardrails for learning stability, and median prediction sampling at inference-time.

中文标题/摘要

标题：基于结果的强化学习预测未来

验证奖励的强化学习（RLVR）已被证明是提高大型语言模型在编程和数学等领域的推理能力的有效方法。在这里，我们应用RLVR方法来预测未来的真实世界事件——由于涉及的结果非常嘈杂（且延迟），这是一项对RL极具挑战的任务。利用来自预测市场最近问题的新型数据集以及相关新闻头条，我们展示了训练一个紧凑的（14B）推理模型可以匹配或超越前沿模型o1的预测准确性，同时大幅提高概率校准。该模型的表现也具有实际意义：在Polymarket交易模拟中，我们估计其押注在整个测试集中的投资回报率超过10%。我们详细介绍了训练模型所使用的方法，并比较了包括使用合成预测问题增强训练数据、学习稳定性护栏以及推理时中位数预测采样在内的方法。

Learned-Rule-Augmented Large Language Model Evaluators

Authors: Jie Meng, Jin Mao

First: 2025-12-01T18:08:45+00:00 · Latest: 2025-12-01T18:08:45+00:00

Abs · PDF · Code1 · Code2

Abstract

Large language models (LLMs) are predominantly used as evaluators for natural language generation (NLG) tasks, but their application to broader evaluation scenarios remains limited. In this work, we explore the potential of LLMs as general evaluators across diverse tasks. Although LLM-based evaluators have made progress in different areas, existing methods struggle to generalize due to their reliance on costly, human-designed evaluation principles, which are often misaligned with both annotated data and LLMs' understanding.To address these challenges, we propose a rule-augmented evaluation paradigm. First, we introduce a rule distillation method that automatically extracts scoring rules from data using an LLM-assisted Monte Carlo Tree Search (MCTS), alleviating scalability issues and improving alignment with data. Second, to enable LLMs to effectively apply the learned rules, we propose two strategies: (1) Chain-of-Rule (CoR), which guides LLM to follow distilled rules, and (2) training a rule-augmented LLM evaluator (RuAE) via reinforcement learning, further bridging the gap between rules and LLMs' reasoning. Extensive experiments on diverse tasks demonstrate the effectiveness and generalizability of our approach across various evaluation scenarios.

中文标题/摘要

标题：学习规则增强的大语言模型评估器

大语言模型（LLMs）主要被用作自然语言生成（NLG）任务的评估器，但它们在更广泛评估场景中的应用仍然有限。在本文中，我们探讨了LLMs作为跨多种任务的一般评估器的潜力。尽管基于LLM的评估器在不同领域取得了进展，但现有方法由于依赖于昂贵的人工设计评估原则而难以泛化，这些原则往往与标注数据和LLM的理解不一致。为了解决这些挑战，我们提出了一种规则增强的评估范式。首先，我们介绍了一种规则蒸馏方法，该方法使用LLM辅助的蒙特卡洛树搜索（MCTS）从数据中自动提取评分规则，从而缓解了可扩展性问题并提高了与数据的对齐。其次，为了使LLMs能够有效地应用所学规则，我们提出了两种策略：（1）规则链（CoR），引导LLM遵循蒸馏规则；（2）通过强化学习训练规则增强的大语言模型评估器（RuAE），进一步弥合规则与LLM推理之间的差距。在多种任务上的广泛实验表明，我们的方法在各种评估场景中具有有效性和泛化性。

Summary / 总结

This work aims to enhance the general applicability of large language models (LLMs) as evaluators for natural language generation tasks. To address the limitations of existing methods, which rely on costly and often misaligned human-designed evaluation principles, the authors propose a rule-augmented evaluation paradigm. This includes a rule distillation method using LLM-assisted Monte Carlo Tree Search to extract scoring rules from data, and two strategies: Chain-of-Rule (CoR) to guide LLMs to follow these rules, and training a rule-augmented LLM evaluator (RuAE) via reinforcement learning. Experiments across various tasks show the effectiveness and generalizability of this approach.

本文旨在提升大型语言模型（LLM）在自然语言生成任务评估中的通用性。为了解决现有方法依赖于昂贵且与数据和LLM理解不匹配的人工设计评估原则的问题，作者提出了一种规则增强的评估范式。这包括使用LLM辅助的蒙特卡洛树搜索（MCTS）从数据中自动提取评分规则的规则提取方法，以及两种策略：链式规则（CoR）来引导LLM遵循这些规则，以及通过强化学习训练规则增强的LLM评估器（RuAE），进一步弥合规则与LLM推理之间的差距。实验表明，该方法在各种评估场景中具有有效性和通用性。

KV Pareto: Systems-Level Optimization of KV Cache and Model Compression for Long Context Inference

Authors: Sai Gokhale, Devleena Das, Rajeev Patwari, Ashish Sirasao, Elliott Delaye

First: 2025-12-01T18:03:47+00:00 · Latest: 2025-12-01T18:03:47+00:00

Abs · PDF · Code1 · Code2

Abstract

Long-context Large Language Models (LLMs) face significant memory bottlenecks during inference due to the linear growth of key-value (KV) cache with sequence length. While individual optimization techniques like KV cache quantization, chunked prefill, and model weight quantization have shown promise, their joint effects and optimal configurations for edge deployment remain underexplored. We introduce KV Pareto, a systems-level framework that systematically maps the trade-off frontier between total memory consumption and task accuracy across these three complementary optimization techniques. Our framework evaluates multiple LLM architectures (Qwen, Llama, Mistral) with varying KV quantization schemes (int2/4/8, mixed-precision), granularities (per-token, per-tensor, per-block), and 4-bit weight quantization via AWQ. Our framework identifies model-specific Pareto-optimal configurations that achieve 68-78% total memory reduction with minimal (1-3%) accuracy degradation on long-context tasks. We additionally verify the selected frontiers on additional benchmarks of Needle-in-a-Haystack, GSM8k and MMLU as well as extended context lengths of up to 128k to demonstrate the practical need of joint optimization for efficient LLM inference.

中文标题/摘要

标题：KV 帕累托：KV 缓存和模型压缩在长上下文推理中的系统级优化

长上下文大型语言模型（LLMs）在推理过程中由于键值（KV）缓存随序列长度线性增长而面临显著的内存瓶颈。虽然单独优化技术如键值缓存量化、分块预填充和模型权重量化已经显示出潜力，但它们的联合效果及其在边缘部署中的最优配置仍然未被充分探索。我们引入了KV 帕累托，这是一种系统级框架，系统地映射了这三种互补优化技术之间总内存消耗与任务准确率之间的权衡前沿。我们的框架评估了不同LLM架构（Qwen、Llama、Mistral）和不同的KV量化方案（int2/4/8、混合精度）、粒度（按token、按张量、按块）以及通过AWQ的4位权重量化。我们的框架识别了特定模型的帕累托最优配置，这些配置在长上下文任务中实现了68-78%的总内存减少，同时准确率下降幅度最小（1-3%）。我们还验证了所选前沿在Needle-in-a-Haystack、GSM8k和MMLU等额外基准以及长达128k的扩展上下文长度上，以证明联合优化对于高效LLM推理的实际需求。

Summary / 总结

The paper introduces KV Pareto, a systems-level framework that optimizes key-value (KV) cache and model compression techniques for long-context inference in Large Language Models (LLMs). It evaluates multiple LLM architectures and quantization schemes, identifying model-specific Pareto-optimal configurations that reduce total memory consumption by 68-78% with minimal accuracy degradation. The framework also verifies these configurations on additional benchmarks and extended context lengths, highlighting the practical need for joint optimization.

KV Pareto 是一个系统级框架，用于优化长上下文推理中大型语言模型（LLM）的关键值（KV）缓存和模型压缩技术。该框架评估了不同 LLM 架构下的 KV 量化、分块预填充和 4 位权重量化等多种配置。该框架确定了能够将总内存消耗减少 68-78% 同时保持最小准确度下降的帕累托最优配置，展示了联合优化对于高效 LLM 推理的实际需求。

GrndCtrl: Grounding World Models via Self-Supervised Reward Alignment

Authors: Haoyang He, Jay Patrikar, Dong-Ki Kim, Max Smith, Daniel McGann, Ali-akbar Agha-mohammadi, Shayegan Omidshafiei, Sebastian Scherer

First: 2025-12-01T18:03:29+00:00 · Latest: 2025-12-01T18:03:29+00:00

Abs · PDF · Code1 · Code2

Abstract

Recent advances in video world modeling have enabled large-scale generative models to simulate embodied environments with high visual fidelity, providing strong priors for prediction, planning, and control. Yet, despite their realism, these models often lack geometric grounding, limiting their use in navigation tasks that require spatial coherence and long-horizon stability. We introduce Reinforcement Learning with World Grounding (RLWG), a self-supervised post-training framework that aligns pretrained world models with a physically verifiable structure through geometric and perceptual rewards. Analogous to reinforcement learning from verifiable feedback (RLVR) in language models, RLWG can use multiple rewards that measure pose cycle-consistency, depth reprojection, and temporal coherence. We instantiate this framework with GrndCtrl, a reward-aligned adaptation method based on Group Relative Policy Optimization (GRPO), yielding world models that maintain stable trajectories, consistent geometry, and reliable rollouts for embodied navigation. Like post-training alignment in large language models, GrndCtrl leverages verifiable rewards to bridge generative pretraining and grounded behavior, achieving superior spatial coherence and navigation stability over supervised fine-tuning in outdoor environments.

中文标题/摘要

标题：GrndCtrl: 通过自我监督的奖励对齐进行地面控制世界模型的约束

近期在视频世界建模方面的进展使大规模生成模型能够以高视觉保真度模拟具身环境，为预测、规划和控制提供了强大的先验知识。然而，尽管这些模型具有高度的真实性，但它们往往缺乏几何约束，限制了其在需要空间连贯性和长时稳定性导航任务中的应用。我们提出了具身环境约束的强化学习（RLWG），这是一种自我监督的后训练框架，通过几何和感知奖励将预训练的世界模型与可验证的物理结构对齐。类似于语言模型中的可验证反馈强化学习（RLVR），RLWG 可以使用测量姿态循环一致性、深度重投影和时间连贯性的多种奖励。我们通过基于组相对策略优化（GRPO）的奖励对齐适应方法 GrndCtrl 实现了这一框架，生成的世界模型能够保持稳定的轨迹、一致的几何结构和可靠的具身导航展开。类似于大型语言模型的后训练对齐，GrndCtrl 利用可验证的奖励将生成预训练与具身行为对接，实现了在户外环境中优于监督微调的空间连贯性和导航稳定性。

Script: Graph-Structured and Query-Conditioned Semantic Token Pruning for Multimodal Large Language Models

Authors: Zhongyu Yang, Dannong Xu, Wei Pang, Yingfang Yuan

First: 2025-12-01T17:59:11+00:00 · Latest: 2025-12-01T17:59:11+00:00

Comments: Published in Transactions on Machine Learning Research, Project in https://01yzzyu.github.io/script.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

The rapid growth of visual tokens in multimodal large language models (MLLMs) leads to excessive memory consumption and inference latency, especially when handling high-resolution images and videos. Token pruning is a technique used to mitigate this issue by removing redundancy, but existing methods often ignore relevance to the user query or suffer from the limitations of attention mechanisms, reducing their adaptability and effectiveness. To address these challenges, we propose Script, a plug-and-play pruning method that requires no retraining and generalizes across diverse MLLMs. Script comprises two modules: a graph-structured pruning module that removes visually redundant tokens, and a query-conditioned semantic pruning module that preserves query-relevant visual information. Together, they enhance performance on multimodal tasks. Experiments on fourteen benchmarks across image and video understanding tasks show that Script consistently achieves higher model efficiency and predictive accuracy compared to existing pruning methods. On LLaVA-NeXT-7B, it achieves up to 6.8x prefill speedup and 10x FLOP reduction, while retaining 96.88% of the original performance.

中文标题/摘要

标题：脚本：基于图结构和查询条件的语义标记剪枝方法用于多模态大型语言模型

多模态大型语言模型（MLLMs）中视觉标记的快速增长导致了内存消耗过多和推理延迟增加，尤其是在处理高分辨率图像和视频时。标记剪枝是一种通过去除冗余来缓解这一问题的技术，但现有方法往往忽视了与用户查询的相关性，或者受限于注意力机制的局限性，降低了其适应性和有效性。为了解决这些挑战，我们提出了Script，这是一种即插即用的剪枝方法，无需重新训练，并适用于多种MLLMs。Script 包含两个模块：一个基于图结构的剪枝模块，用于去除视觉冗余标记，以及一个基于查询条件的语义剪枝模块，用于保留与查询相关的视觉信息。这两个模块共同提高了多模态任务的性能。在图像和视频理解任务的14个基准测试中，Script 一致地实现了比现有剪枝方法更高的模型效率和预测准确性。在LLaVA-NeXT-7B上，它实现了高达6.8倍的预填充加速和10倍的FLOP减少，同时保留了96.88%的原始性能。

Summary / 总结

Script is a token pruning method for MLLMs that addresses the issue of excessive memory consumption and inference latency by removing visually redundant tokens and preserving query-relevant visual information. It consists of two modules and does not require retraining. Experiments show that Script improves model efficiency and predictive accuracy, achieving up to 6.8x prefill speedup and 10x FLOP reduction on LLaVA-NeXT-7B while maintaining 96.88% of the original performance.

论文提出了Script，这是一种用于多模态大型语言模型（MLLMs）的剪枝方法，通过移除冗余的视觉令牌并保留与查询相关的信息来解决内存消耗和推理延迟的问题。Script 包含两个模块：一个图结构化的剪枝模块和一个查询条件下的语义剪枝模块。实验表明，Script 提高了模型的效率和预测准确性，实现了在LLaVA-NeXT-7B上高达6.8倍的预填充加速和10倍的FLOP减少，同时保持了原始性能的96.88%。

Guardian: Detecting Robotic Planning and Execution Errors with Vision-Language Models

Authors: Paul Pacaud, Ricardo Garcia, Shizhe Chen, Cordelia Schmid

First: 2025-12-01T17:57:27+00:00 · Latest: 2025-12-01T17:57:27+00:00

Comments: 9 pages, 9 figures, 6 tables

Abs · PDF · Code1 · Code2

Abstract

Robust robotic manipulation requires reliable failure detection and recovery. Although current Vision-Language Models (VLMs) show promise, their accuracy and generalization are limited by the scarcity of failure data. To address this data gap, we propose an automatic robot failure synthesis approach that procedurally perturbs successful trajectories to generate diverse planning and execution failures. This method produces not only binary classification labels but also fine-grained failure categories and step-by-step reasoning traces in both simulation and the real world. With it, we construct three new failure detection benchmarks: RLBench-Fail, BridgeDataV2-Fail, and UR5-Fail, substantially expanding the diversity and scale of existing failure datasets. We then train Guardian, a VLM with multi-view images for detailed failure reasoning and detection. Guardian achieves state-of-the-art performance on both existing and newly introduced benchmarks. It also effectively improves task success rates when integrated into a state-of-the-art manipulation system in simulation and real robots, demonstrating the impact of our generated failure data.

中文标题/摘要

标题：卫报：使用视觉语言模型检测机器人规划与执行错误

稳健的机器人操作需要可靠的故障检测和恢复。尽管当前的视觉语言模型（VLMs）显示出潜力，但它们的准确性和泛化能力受限于故障数据的稀缺性。为解决这一数据缺口，我们提出了一种自动机器人故障合成方法，通过程序化地扰动成功的轨迹来生成多样化的规划和执行故障。该方法不仅生成二元分类标签，还生成详细的故障类别和推理轨迹，适用于仿真和真实世界。通过这种方法，我们构建了三个新的故障检测基准：RLBench-Fail、BridgeDataV2-Fail 和 UR5-Fail，显著扩展了现有故障数据集的多样性和规模。然后，我们训练了Guardian，这是一种具有多视角图像的VLM，用于详细的故障推理和检测。Guardian在现有和新引入的基准测试中均达到了最先进的性能。当将其集成到最先进的操作系统中时，它也有效提高了仿真和真实机器人中的任务成功率，证明了我们生成的故障数据的影响。

Summary / 总结

The paper aims to enhance the reliability of robotic manipulation by improving failure detection and recovery. To address the scarcity of failure data, the authors propose an automatic robot failure synthesis approach that generates diverse planning and execution failures. This method produces binary classification labels, fine-grained failure categories, and reasoning traces. Using this, they created three new failure detection benchmarks and trained Guardian, a VLM that excels in failure reasoning and detection, achieving state-of-the-art performance and improving task success rates in both simulation and real robots.

论文旨在通过解决失败数据稀缺问题来提高机器人操作中的故障检测能力。它提出了一种自动机器人故障合成方法，生成多样化的规划和执行故障，并产生二元标签和详细的推理轨迹。该方法构建了新的故障检测基准，并训练了Guardian，这是一种VLM，在故障推理和检测方面表现出色，实现在仿真和真实机器人中的任务成功率提升。

Agentic Policy Optimization via Instruction-Policy Co-Evolution

Authors: Han Zhou, Xingchen Wan, Ivan Vulić, Anna Korhonen

First: 2025-12-01T17:56:29+00:00 · Latest: 2025-12-01T17:56:29+00:00

Comments: 10 pages, 3 figures, 2 tables (18 pages including references and appendices)

Abs · PDF · Code1 · Code2

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has advanced the reasoning capability of large language models (LLMs), enabling autonomous agents that can conduct effective multi-turn and tool-integrated reasoning. While instructions serve as the primary protocol for defining agents, RLVR typically relies on static and manually designed instructions. However, those instructions may be suboptimal for the base model, and the optimal instruction may change as the agent's policy improves and explores the interaction with the environment. To bridge the gap, we introduce INSPO, a novel Instruction-Policy co-evolution framework that integrates instruction optimization as a dynamic component of the reinforcement learning (RL) loop. INSPO maintains a dynamic population of instruction candidates that are sampled with questions, where reward signals in RL loops are automatically attributed to each instruction, and low performers are periodically pruned. New instructions are generated and verified through an on-policy reflection mechanism, where an LLM-based optimizer analyzes past experience from a replay buffer and evolves more effective strategies given the current policy. We conduct extensive experiments on multi-turn retrieval and reasoning tasks, demonstrating that INSPO substantially outperforms strong baselines relying on static instructions. INSPO discovers innovative instructions that guide the agent toward more strategic reasoning paths, achieving substantial performance gains with only a marginal increase in computational overhead.

中文标题/摘要

标题：通过指令-策略共进化实现代理政策优化

可验证奖励的强化学习（RLVR）提升了大型语言模型（LLMs）的推理能力，使自主代理能够进行有效的多轮次和工具集成推理。虽然指令是定义代理的主要协议，但RLVR通常依赖于静态和手动设计的指令。然而，这些指令可能不适合基础模型，而最优指令可能会随着代理策略的改进和与环境的互动而变化。为解决这一问题，我们引入了INSPO，这是一种新颖的指令-策略共进化框架，将指令优化作为强化学习（RL）循环中的动态组成部分。INSPO维护一个动态的指令候选人群，这些候选指令是根据问题进行采样的，RL循环中的奖励信号会自动归因于每个指令，表现不佳的指令会定期被修剪。通过策略内反思机制生成和验证新指令，LLM优化器分析回放缓冲区中的过往经验，并根据当前策略进化更有效的策略。我们在多轮次检索和推理任务上进行了广泛的实验，表明INSPO显著优于依赖静态指令的强基线。INSPO发现创新指令，引导代理走向更具策略性的推理路径，仅在计算开销上略有增加。

Summary / 总结

The research aims to enhance the reasoning capability of large language models in reinforcement learning by dynamically optimizing instructions alongside the agent's policy. The method involves INSPO, an Instruction-Policy co-evolution framework that maintains a dynamic population of instructions, prunes underperforming ones, and generates new instructions through on-policy reflection. Experiments show that INSPO significantly outperforms static instruction-based methods on multi-turn retrieval and reasoning tasks, with minimal additional computational cost.

论文提出了INSPO框架，该框架在强化学习中动态优化指令和策略，以增强大型语言模型的推理能力。通过动态优化指令并自动分配奖励信号，INSPO在多轮检索和推理任务中表现出色，优于基于静态指令的方法，并且仅需少量额外的计算资源。新的指令通过策略回放机制生成和验证，引导代理采取更有效的推理路径。

MAMMA: Markerless & Automatic Multi-Person Motion Action Capture

Authors: Hanz Cuevas-Velasquez, Anastasios Yiannakidis, Soyong Shin, Giorgio Becherini, Markus Höschle, Joachim Tesch, Taylor Obersat, Tsvetelina Alexiadis, Eni Halilaj, Michael J. Black

First: 2025-06-16T02:04:51+00:00 · Latest: 2025-12-01T17:55:23+00:00

Abs · PDF · Code1 · Code2

Abstract

We present MAMMA, a markerless motion-capture pipeline that accurately recovers SMPL-X parameters from multi-view video of two-person interaction sequences. Traditional motion-capture systems rely on physical markers. Although they offer high accuracy, their requirements of specialized hardware, manual marker placement, and extensive post-processing make them costly and time-consuming. Recent learning-based methods attempt to overcome these limitations, but most are designed for single-person capture, rely on sparse keypoints, or struggle with occlusions and physical interactions. In this work, we introduce a method that predicts dense 2D contact-aware surface landmarks conditioned on segmentation masks, enabling person-specific correspondence estimation even under heavy occlusion. We employ a novel architecture that exploits learnable queries for each landmark. We demonstrate that our approach can handle complex person--person interaction and offers greater accuracy than existing methods. To train our network, we construct a large, synthetic multi-view dataset combining human motions from diverse sources, including extreme poses, hand motions, and close interactions. Our dataset yields high-variability synthetic sequences with rich body contact and occlusion, and includes SMPL-X ground-truth annotations with dense 2D landmarks. The result is a system capable of capturing human motion without the need for markers. Our approach offers competitive reconstruction quality compared to commercial marker-based motion-capture solutions, without the extensive manual cleanup. Finally, we address the absence of common benchmarks for dense-landmark prediction and markerless motion capture by introducing two evaluation settings built from real multi-view sequences. We will release our dataset, benchmark, method, training code, and pre-trained model weights for research purposes.

中文标题/摘要

标题：MAMMA：无标记的自动多人动作捕捉

我们提出了MAMMA，一种无标记的动作捕捉流水线，可以从两人互动序列的多视角视频中准确恢复SMPL-X参数。传统的动作捕捉系统依赖于物理标记。尽管它们提供高精度，但其对专用硬件、手动标记放置和大量后处理的要求使其成本高昂且耗时。最近的学习方法试图克服这些限制，但大多数方法仅适用于单人捕捉，依赖稀疏关键点，或难以处理遮挡和物理互动。在本文中，我们介绍了一种方法，该方法在分割掩码条件下预测密集的2D接触感知表面特征点，即使在严重遮挡下也能实现个体特定的对应估计。我们采用了一种新颖的架构，利用可学习的查询来为每个特征点提供信息。我们证明了我们的方法可以处理复杂的人员-人员互动，并提供了比现有方法更高的准确性。为了训练我们的网络，我们构建了一个大型的合成多视角数据集，结合了来自多种来源的人体动作，包括极端姿态、手部动作和近距离互动。我们的数据集产生了具有丰富身体接触和遮挡的高变异性合成序列，并包括SMPL-X地面真值注释和密集的2D特征点。结果是一个无需标记即可捕捉人体动作的系统。我们的方法在与商业标记基动作捕捉解决方案的竞争重建质量方面表现出竞争力，而无需大量的手动清理。最后，我们通过引入基于真实多视角序列的两种评估设置，解决了密集特征点预测和无标记动作捕捉缺乏通用基准的问题。我们将发布我们的数据集、基准、方法、训练代码和预训练模型权重供研究使用。

Summary / 总结

MAMMA is a markerless motion-capture pipeline that accurately recovers SMPL-X parameters from multi-view video of two-person interaction sequences. It uses a novel architecture to predict dense 2D contact-aware surface landmarks conditioned on segmentation masks, enabling accurate person-specific correspondence even under heavy occlusions. The method demonstrates superior accuracy in handling complex person-person interactions compared to existing methods. The system is trained on a large synthetic multi-view dataset with rich body contact and occlusions, and offers competitive reconstruction quality to commercial marker-based solutions without extensive manual cleanup.

MAMMA 是一个无标记的动作捕捉管道，可以从多人互动的多视角视频中准确恢复 SMPL-X 参数。该方法使用新颖的架构来预测密集的2D接触感知表面关键点，即使在严重遮挡的情况下也能实现准确的人体对应。该方法在处理复杂互动方面表现出色，优于现有方法。系统通过包含极端姿态和近距离互动的多样化人体动作的大规模合成数据集进行训练，并且其重建质量与商业标记系统相当，无需进行大量手动清理。

Benchmarking machine learning models for multi-class state recognition in double quantum dot data

Authors: Valeria Díaz Moreno, Ryan P Khalili, Daniel Schug, Patrick J. Walsh, Justyna P. Zwolak

First: 2025-11-27T13:38:57+00:00 · Latest: 2025-12-01T17:47:33+00:00

Comments: 12 pages, 4 figures, 2 tables

Abs · PDF · Code1 · Code2

Abstract

Semiconductor quantum dots (QDs) are a leading platform for scalable quantum processors. However, scaling to large arrays requires reliable, automated tuning strategies for devices' bootstrapping, calibration, and operation, with many tuning aspects depending on accurately identifying QD device states from charge-stability diagrams (CSDs). In this work, we present a comprehensive benchmarking study of four modern machine learning (ML) architectures for multi-class state recognition in double-QD CSDs. We evaluate their performance across different data budgets and normalization schemes using both synthetic and experimental data. We find that the more resource-intensive models -- U-Nets and visual transformers (ViTs) -- achieve the highest MSE score (defined as $1-\mathrm{MSE}$) on synthetic data (over $0.98$) but fail to generalize to experimental data. MDNs are the most computationally efficient and exhibit highly stable training, but with substantially lower peak performance. CNNs offer the most favorable trade-off on experimental CSDs, achieving strong accuracy with two orders of magnitude fewer parameters than the U-Nets and ViTs. Normalization plays a nontrivial role: min-max scaling generally yields higher MSE scores but less stable convergence, whereas z-score normalization produces more predictable training dynamics but at reduced accuracy for most models. Overall, our study shows that CNNs with min-max normalization are a practical approach for QD CSDs.

中文标题/摘要

标题：双量子点数据中多类状态识别的机器学习模型基准测试

半导体量子点（QDs）是可扩展量子处理器的领先平台。然而，扩展到大型阵列需要可靠的自动化调谐策略，用于设备的启动、校准和操作，其中许多调谐方面取决于从电荷稳定性图（CSDs）准确识别QD设备状态的能力。在本工作中，我们对四种现代机器学习（ML）架构在双QD CSD中的多类状态识别进行了全面的基准测试研究。我们使用合成和实验数据评估了它们在不同数据预算和归一化方案下的性能。我们发现，资源密集型模型——U-Nets和视觉变换器（ViTs）——在合成数据上获得了最高的MSE分数（定义为$1-\mathrm{MSE}$，超过$0.98$），但在实验数据上无法泛化。MDNs是最具计算效率的，并且表现出高度稳定的训练，但其峰值性能明显较低。CNNs在实验CSD上提供了最有利的权衡，使用比U-Nets和ViTs少两个数量级的参数实现了较强的准确性。归一化起着非平凡的作用：最小-最大缩放通常会产生更高的MSE分数，但收敛性较差，而z分数归一化会产生更可预测的训练动态，但大多数模型的准确性降低。总体而言，我们的研究显示，对于QD CSD，使用最小-最大归一化的CNNs是一种实用的方法。

Summary / 总结

This study benchmarks four machine learning models for recognizing multi-class states in double quantum dot charge-stability diagrams. The research evaluates the models' performance with different data sizes and normalization techniques using both synthetic and experimental data. U-Nets and visual transformers achieve high accuracy on synthetic data but fail to generalize to experimental data. CNNs provide a good balance between accuracy and computational efficiency, outperforming U-Nets and ViTs on experimental data with far fewer parameters. Min-max normalization generally improves accuracy but can lead to unstable training, while z-score normalization offers more stable training dynamics but at the cost of accuracy for most models. Overall, CNNs with min-max normalization are recommended for practical applications in quantum dot systems.

本研究对比了四种机器学习模型在双量子点电荷稳定性图中识别多类状态的表现，包括U-网络、视觉变换器、混合密度网络和卷积神经网络。评估结果显示，U-网络和视觉变换器在合成数据上表现最佳，但在实验数据上无法泛化。混合密度网络计算效率高，但性能峰值较低。卷积神经网络在实验数据上提供了良好的权衡，以较少的参数实现了高准确性。最小-最大归一化通常提高MSE分数，但导致收敛不稳定，而Z-分数归一化提供更可预测的训练动态，但大多数模型的准确性较低。总体而言，卷积神经网络结合最小-最大归一化是双量子点电荷稳定性图中的实用方法。