arXiv 论文速递

Snapshot: 20260326_0356

OccAny: Generalized Unconstrained Urban 3D Occupancy

Authors: Anh-Quan Cao, Tuan-Hung Vu

Venue: CVPR 2026

First: 2026-03-24T17:59:58+00:00 · Latest: 2026-03-24T17:59:58+00:00

Comments: Accepted to CVPR 2026. Project page: https://valeoai.github.io/OccAny/

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

Relying on in-domain annotations and precise sensor-rig priors, existing 3D occupancy prediction methods are limited in both scalability and out-of-domain generalization. While recent visual geometry foundation models exhibit strong generalization capabilities, they were mainly designed for general purposes and lack one or more key ingredients required for urban occupancy prediction, namely metric prediction, geometry completion in cluttered scenes and adaptation to urban scenarios. We address this gap and present OccAny, the first unconstrained urban 3D occupancy model capable of operating on out-of-domain uncalibrated scenes to predict and complete metric occupancy coupled with segmentation features. OccAny is versatile and can predict occupancy from sequential, monocular, or surround-view images. Our contributions are three-fold: (i) we propose the first generalized 3D occupancy framework with (ii) Segmentation Forcing that improves occupancy quality while enabling mask-level prediction, and (iii) a Novel View Rendering pipeline that infers novel-view geometry to enable test-time view augmentation for geometry completion. Extensive experiments demonstrate that OccAny outperforms all visual geometry baselines on 3D occupancy prediction task, while remaining competitive with in-domain self-supervised methods across three input settings on two established urban occupancy prediction datasets. Our code is available at https://github.com/valeoai/OccAny .

中文标题/摘要

标题：OccAny：通用城市3D占用率

依赖于领域内注释和精确的传感器先验，现有的3D占用率预测方法在可扩展性和跨领域泛化方面都受到限制。虽然最近的视觉几何基础模型表现出强大的泛化能力，但它们主要是为通用目的设计的，缺乏城市占用率预测所需的几个关键要素，即度量预测、杂乱场景中的几何完成以及适应城市场景。我们填补了这一空白，提出了OccAny，这是第一个能够在跨领域未标定场景中操作，预测和完成度量占用率并结合分割特征的通用城市3D占用率模型。OccAny具有多功能性，可以从连续、单目或全景图像中预测占用率。我们的贡献包括三个方面：(i) 我们提出了第一个通用的3D占用率框架，(ii) 通过改进占用率质量并启用掩码级预测的分割强制，以及(iii) 一种新颖视图渲染流水线，通过推断新颖视图几何来实现几何完成的测试时视图增强。广泛的实验表明，OccAny在3D占用率预测任务上优于所有视觉几何基线，同时在两个已建立的城市占用率预测数据集的三种输入设置上与领域内自我监督方法保持竞争力。我们的代码可在https://github.com/valeoai/OccAny 获取。

Summary / 总结

The research addresses the limitations of existing 3D occupancy prediction methods in scalability and out-of-domain generalization. OccAny is introduced as the first unconstrained urban 3D occupancy model that can predict and complete metric occupancy from sequential, monocular, or surround-view images. Key contributions include Segmentation Forcing for improved occupancy quality and mask-level prediction, and a Novel View Rendering pipeline for geometry completion. Experiments show that OccAny outperforms visual geometry baselines and remains competitive with in-domain self-supervised methods.

研究旨在解决现有3D占用预测方法在可扩展性和跨域泛化方面的局限性。OccAny提出了一个通用的3D占用框架，结合了分割强迫和新颖视图渲染管道，可以从连续、单目或全景图像中预测和完成度量占用。实验表明，OccAny在城市占用预测数据集上的表现优于视觉几何基线，并且在三种输入设置上与域内自我监督方法保持竞争力。

MedObvious: Exposing the Medical Moravec's Paradox in VLMs via Clinical Triage

Authors: Ufaq Khan, Umair Nawaz, L D M S S Teja, Numaan Saeed, Muhammad Bilal, Yutong Xie, Mohammad Yaqub, Muhammad Haris Khan

First: 2026-03-24T17:59:54+00:00 · Latest: 2026-03-24T17:59:54+00:00

Comments: 11 Pages

Abs · PDF · Code1 · Code2

Abstract

Vision Language Models (VLMs) are increasingly used for tasks like medical report generation and visual question answering. However, fluent diagnostic text does not guarantee safe visual understanding. In clinical practice, interpretation begins with pre-diagnostic sanity checks: verifying that the input is valid to read (correct modality and anatomy, plausible viewpoint and orientation, and no obvious integrity violations). Existing benchmarks largely assume this step is solved, and therefore miss a critical failure mode: a model can produce plausible narratives even when the input is inconsistent or invalid. We introduce MedObvious, a 1,880-task benchmark that isolates input validation as a set-level consistency capability over small multi-panel image sets: the model must identify whether any panel violates expected coherence. MedObvious spans five progressive tiers, from basic orientation/modality mismatches to clinically motivated anatomy/viewpoint verification and triage-style cues, and includes five evaluation formats to test robustness across interfaces. Evaluating 17 different VLMs, we find that sanity checking remains unreliable: several models hallucinate anomalies on normal (negative-control) inputs, performance degrades when scaling to larger image sets, and measured accuracy varies substantially between multiple-choice and open-ended settings. These results show that pre-diagnostic verification remains unsolved for medical VLMs and should be treated as a distinct, safety-critical capability before deployment.

中文标题/摘要

标题：MedObvious：通过临床分诊揭示医学领域的维特根斯坦悖论在大模型中的表现

视觉语言模型（VLMs）越来越多地用于医学报告生成和视觉问答等任务。然而，流畅的诊断文本并不保证安全的视觉理解。在临床实践中，解释始于预诊断的合理性检查：验证输入是否有效（正确的模态和解剖结构，合理的视角和方向，以及没有明显的完整性问题）。现有基准大多假设这一步骤已经解决，因此忽略了关键的失败模式：即使输入不一致或无效，模型也能生成合理的叙述。我们引入了MedObvious基准，包含1,880个任务，将输入验证隔离为小多面板图像集的一致性能力：模型必须确定任何面板是否违反了预期的连贯性。MedObvious涵盖了五个渐进的层级，从基本的方向/模态不匹配到基于临床的解剖结构/视角验证和分诊提示，并包括五种评估格式以测试跨界面的鲁棒性。评估17种不同的VLMs，我们发现合理性检查仍然不可靠：几个模型在正常（负控）输入上生成了异常，性能在扩展到更大的图像集时下降，测得的准确性在多项选择和开放式设置之间差异显著。这些结果表明，预诊断验证对于医学VLMs来说仍然是未解决的问题，在部署前应被视为一个独立的安全关键能力。

Summary / 总结

The research addresses the issue that fluent diagnostic text from Vision Language Models (VLMs) does not ensure safe visual understanding. It introduces MedObvious, a 1,880-task benchmark to test input validation, which is crucial for clinical practice. The study evaluates 17 VLMs and finds that sanity checking is unreliable, with models hallucinating anomalies on normal inputs, degrading performance with larger image sets, and varying accuracy between different evaluation formats. This highlights the need for pre-diagnostic verification as a distinct safety-critical capability for medical VLMs.

研究关注流畅的诊断文本并不保证视觉理解的安全性。引入了MedObvious基准，包含1,880个任务，专注于输入验证，测试模型在多面板图像集中的不一致性识别能力。研究评估了17种VLM，发现合理性检查不可靠，模型在正常输入中产生异常，随着图像集增大性能下降，并且在不同评估格式中的准确性差异显著。这表明在医疗VLM部署前，预诊断验证仍是一个需要作为独立的安全关键能力解决的问题。

UniGRPO: Unified Policy Optimization for Reasoning-Driven Visual Generation

Authors: Jie Liu, Zilyu Ye, Linxiao Yuan, Shenhan Zhu, Yu Gao, Jie Wu, Kunchang Li, Xionghui Wang, Xiaonan Nie, Weilin Huang, Wanli Ouyang

First: 2026-03-24T17:59:17+00:00 · Latest: 2026-03-24T17:59:17+00:00

Abs · PDF · Code1 · Code2

Abstract

Unified models capable of interleaved generation have emerged as a promising paradigm, with the community increasingly converging on autoregressive modeling for text and flow matching for image generation. To advance this direction, we propose a unified reinforcement learning framework tailored for interleaved generation. We validate our approach on its fundamental unit: a single round of reasoning-driven image generation, where the model first expands the user prompt through reasoning, followed by image synthesis. Formulating this multimodal generation process as a Markov Decision Process with sparse terminal rewards, we introduce UniGRPO to jointly optimize text and image generation policies using GRPO. Adopting a minimalist methodology to avoid over-design, we leverage established training recipes for both modalities by seamlessly integrating standard GRPO for reasoning and FlowGRPO for visual synthesis. To ensure scalability to multi-round interleaved generation, we introduce two critical modifications to the original FlowGRPO: (1) eliminating classifier-free guidance to maintain linear, unbranched rollouts, which is essential for scaling to complex scenarios involving multi-turn interactions and multi-condition generation (e.g., editing); and (2) replacing the standard latent KL penalty with an MSE penalty directly on the velocity fields, providing a more robust and direct regularization signal to mitigate reward hacking effectively. Our experiments demonstrate that this unified training recipe significantly enhances image generation quality through reasoning, providing a robust and scalable baseline for the future post-training of fully interleaved models.

中文标题/摘要

标题：UniGRPO：统一推理驱动视觉生成的策略优化

统一模型能够进行交错生成，已成为一个有前景的范式，社区越来越多地倾向于自回归建模用于文本生成和流匹配用于图像生成。为推进这一方向，我们提出了一种针对交错生成的统一强化学习框架。我们在其基本单元——一次推理驱动的图像生成上验证了该方法，其中模型首先通过推理扩展用户提示，然后进行图像合成。将这一多模态生成过程形式化为具有稀疏终端奖励的马尔可夫决策过程，我们引入了UniGRPO，以联合优化文本和图像生成策略。我们采用简约的方法避免过度设计，通过无缝集成推理的标准GRPO和视觉合成的FlowGRPO，利用两种模态的成熟训练食谱。为了确保扩展到多轮交错生成，我们对原始的FlowGRPO引入了两个关键修改：（1）消除无分类引导以保持线性、不分支的展开，这对于扩展到涉及多轮交互和多条件生成（例如编辑）的复杂场景至关重要；（2）用直接作用于速度场的MSE惩罚替换标准的潜在KL惩罚，提供更稳健和直接的正则化信号，以有效缓解奖励作弊。我们的实验表明，这种统一的训练食谱显著提高了通过推理进行的图像生成质量，为未来完全交错模型的后训练提供了一个稳健且可扩展的基础。

Summary / 总结

The paper introduces UniGRPO, a unified reinforcement learning framework for interleaved generation, focusing on reasoning-driven image generation. The approach formulates the process as a Markov Decision Process and uses GRPO to optimize both text and image generation policies. Key modifications to FlowGRPO, such as eliminating classifier-free guidance and using MSE penalties, enhance scalability and robustness. Experiments show improved image generation quality through reasoning, providing a strong baseline for future fully interleaved models.

论文提出了UniGRPO，这是一种统一的强化学习框架，用于交错生成，重点关注基于推理的图像生成。该方法将过程建模为马尔可夫决策过程，并对FlowGRPO进行了修改以确保可扩展性。关键发现表明，这种统一的训练方法通过推理提高了图像生成质量，为未来的完全交错模型提供了稳健的基础。

DA-Flow: Degradation-Aware Optical Flow Estimation with Diffusion Models

Authors: Jaewon Min, Jaeeun Lee, Yeji Choi, Paul Hyunbin Cho, Jin Hyeon Kim, Tae-Young Lee, Jongsik Ahn, Hwayeong Lee, Seonghyun Park, Seungryong Kim

First: 2026-03-24T17:59:13+00:00 · Latest: 2026-03-24T17:59:13+00:00

Comments: Project page: https://cvlab-kaist.github.io/DA-Flow

Abs · PDF · Code1 · Code2 · Project1

Abstract

Optical flow models trained on high-quality data often degrade severely when confronted with real-world corruptions such as blur, noise, and compression artifacts. To overcome this limitation, we formulate Degradation-Aware Optical Flow, a new task targeting accurate dense correspondence estimation from real-world corrupted videos. Our key insight is that the intermediate representations of image restoration diffusion models are inherently corruption-aware but lack temporal awareness. To address this limitation, we lift the model to attend across adjacent frames via full spatio-temporal attention, and empirically demonstrate that the resulting features exhibit zero-shot correspondence capabilities. Based on this finding, we present DA-Flow, a hybrid architecture that fuses these diffusion features with convolutional features within an iterative refinement framework. DA-Flow substantially outperforms existing optical flow methods under severe degradation across multiple benchmarks.

中文标题/摘要

标题：DA-Flow：基于扩散模型的退化感知光流估计

在高质数据上训练的光流模型在面对真实世界中的模糊、噪声和压缩伪影等退化时往往会严重退化。为克服这一限制，我们提出了退化感知光流这一新任务，旨在从受真实世界退化影响的视频中准确估计密集对应关系。我们的核心见解是，图像恢复扩散模型的中间表示本就具有退化感知能力，但缺乏时间感知。为解决这一限制，我们通过全时空注意力机制提升模型以跨相邻帧进行关注，并实验证明，由此产生的特征具有零样本对应关系能力。基于这一发现，我们提出了DA-Flow，这是一种混合架构，将这些扩散特征与卷积特征融合于迭代细化框架中。DA-Flow在多个基准测试中显著优于现有光流方法，尤其是在严重退化的情况下。

Summary / 总结

The research aims to improve optical flow estimation in real-world corrupted videos by addressing the limitations of existing models. The method involves using intermediate representations from image restoration diffusion models, which are inherently corruption-aware but lack temporal awareness. By incorporating full spatio-temporal attention and fusing these features with convolutional features in an iterative refinement framework, the proposed DA-Flow model significantly outperforms existing methods under severe degradation conditions across multiple benchmarks.

研究旨在通过解决模型在高质数据上训练的局限性，提高在真实世界受污染视频中的光流估计。方法包括使用图像恢复扩散模型的中间表示，并通过时空注意力增强以提高时间感知能力。DA-Flow模型结合了扩散和卷积特征，在多个基准测试中，在严重退化条件下显著优于现有方法。

WildWorld: A Large-Scale Dataset for Dynamic World Modeling with Actions and Explicit State toward Generative ARPG

Authors: Zhen Li, Zian Meng, Shuwei Shi, Wenshuo Peng, Yuwei Wu, Bo Zheng, Chuanhao Li, Kaipeng Zhang

First: 2026-03-24T17:58:25+00:00 · Latest: 2026-03-24T17:58:25+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Dynamical systems theory and reinforcement learning view world evolution as latent-state dynamics driven by actions, with visual observations providing partial information about the state. Recent video world models attempt to learn this action-conditioned dynamics from data. However, existing datasets rarely match the requirement: they typically lack diverse and semantically meaningful action spaces, and actions are directly tied to visual observations rather than mediated by underlying states. As a result, actions are often entangled with pixel-level changes, making it difficult for models to learn structured world dynamics and maintain consistent evolution over long horizons. In this paper, we propose WildWorld, a large-scale action-conditioned world modeling dataset with explicit state annotations, automatically collected from a photorealistic AAA action role-playing game (Monster Hunter: Wilds). WildWorld contains over 108 million frames and features more than 450 actions, including movement, attacks, and skill casting, together with synchronized per-frame annotations of character skeletons, world states, camera poses, and depth maps. We further derive WildBench to evaluate models through Action Following and State Alignment. Extensive experiments reveal persistent challenges in modeling semantically rich actions and maintaining long-horizon state consistency, highlighting the need for state-aware video generation. The project page is https://shandaai.github.io/wildworld-project/.

中文标题/摘要

标题：WildWorld：一种用于动态世界建模的大规模数据集，包含动作和显式状态，以生成ARPG

动力系统理论和强化学习将世界演化视为由动作驱动的潜在状态动力学，视觉观察仅提供部分状态信息。最近的视频世界模型试图从数据中学习这种动作条件下的动力学。然而，现有数据集很少满足这一要求：它们通常缺乏多样且具有语义意义的动作空间，动作直接与视觉观察相关联，而不是通过潜在状态进行中介。因此，动作往往与像素级变化纠缠在一起，使得模型难以学习结构化世界动力学并保持长时间的一致演化。在本文中，我们提出了WildWorld，这是一个包含显式状态注释的动作条件世界建模大规模数据集，数据自动收集自逼真AAA动作角色扮演游戏（Monster Hunter: Wilds）。WildWorld包含超过1.08亿帧，并包含超过450种动作，包括移动、攻击和技能施放，以及同步的每帧角色骨架、世界状态、相机姿态和深度图注释。我们进一步通过动作跟随和状态对齐来评估模型的WildBench。广泛的实验揭示了建模丰富语义动作和保持长时间状态一致性的一贯挑战，突显了状态感知视频生成的必要性。项目页面为https://shandaai.github.io/wildworld-project/。

Summary / 总结

WildWorld is a large-scale dataset for dynamic world modeling with explicit state annotations, derived from a photorealistic action role-playing game. It includes over 108 million frames with more than 450 diverse actions, and synchronized annotations of character skeletons, world states, camera poses, and depth maps. The dataset aims to address the limitations of existing datasets by providing a rich action space and decoupling actions from pixel-level changes, enabling better learning of structured world dynamics. Experiments show persistent challenges in modeling semantically rich actions and maintaining long-term state consistency, emphasizing the importance of state-aware video generation.

WildWorld 是一个大规模的动作条件世界建模数据集，包含来自一款写实动作角色扮演游戏的显式状态标注和多样动作，共有超过1.08亿帧和450多种动作，使模型能够学习结构化的世界动态并保持长时间的一致性。实验表明，在建模丰富的语义动作和保持长期状态一致性方面存在持续的挑战，突显了状态感知视频生成的重要性。

VISion On Request: Enhanced VLLM efficiency with sparse, dynamically selected, vision-language interactions

Authors: Adrian Bulat, Alberto Baldrati, Ioannis Maniadis Metaxas, Yassine Ouali, Georgios Tzimiropoulos

Venue: CVPR 2026

First: 2026-03-24T17:58:17+00:00 · Latest: 2026-03-24T17:58:17+00:00

Comments: Accepted at CVPR 2026

Abs · PDF · Code1 · Code2

Abstract

Existing approaches for improving the efficiency of Large Vision-Language Models (LVLMs) are largely based on the concept of visual token reduction. This approach, however, creates an information bottleneck that impairs performance, especially on challenging tasks that require fine-grained understanding and reasoning. In this work, we challenge this paradigm by introducing VISion On Request (VISOR), a method that reduces inference cost without discarding visual information. Instead of compressing the image, VISOR improves efficiency by sparsifying the interaction between image and text tokens. Specifically, the language model attends to the full set of high-resolution visual tokens through a small, strategically placed set of attention layers: general visual context is provided by efficient cross-attention between text-image, while a few well-placed and dynamically selected self-attention layers refine the visual representations themselves, enabling complex, high-resolution reasoning when needed. Based on this principle, we first train a single universal network on a range of computational budgets by varying the number of self-attention layers, and then introduce a lightweight policy mechanism that dynamically allocates visual computation based on per-sample complexity. Extensive experiments show that VISOR drastically reduces computational cost while matching or exceeding state-of-the-art results across a diverse suite of benchmarks, and excels in challenging tasks that require detailed visual understanding.

中文标题/摘要

标题：VISion On Request: 提升VLLM效率的稀疏动态选择视觉-语言交互方法

现有提高大型视觉-语言模型（LVLMs）效率的方法主要基于视觉标记减少的概念。然而，这种方法创建了一个信息瓶颈，影响了性能，特别是在需要精细理解和推理的挑战性任务中。在本文中，我们通过引入VISion On Request（VISOR），一种在不丢弃视觉信息的情况下减少推理成本的方法，挑战了这一范式。VISOR通过稀疏化图像和文本标记之间的交互来提高效率，而不是压缩图像。具体来说，语言模型通过少量战略性放置的注意力层关注高分辨率的视觉标记的完整集合：通过文本-图像之间的高效交叉注意力提供高效的视觉上下文，而少数精心放置并动态选择的自我注意力层则细化视觉表示本身，当需要复杂、高分辨率推理时，能够启用复杂的高分辨率推理。基于这一原则，我们首先通过改变自我注意力层的数量在不同的计算预算下训练一个通用网络，然后引入一个轻量级的策略机制，根据每个样本的复杂性动态分配视觉计算。广泛的实验表明，VISOR在多种基准测试中大幅减少了计算成本，同时匹配或超越了最先进的结果，并在需要详细视觉理解的挑战性任务中表现出色。

Summary / 总结

This work addresses the efficiency bottleneck in Large Vision-Language Models (LVLMs) by introducing VISion On Request (VISOR), which reduces inference cost without discarding visual information. VISOR sparsifies the interaction between image and text tokens, using a small set of strategically placed attention layers to dynamically refine visual representations when needed. Experiments show that VISOR significantly reduces computational cost while maintaining or improving performance across various benchmarks, especially in tasks requiring detailed visual understanding.

本文提出了VISion On Request (VISOR) 方法，通过稀疏化图像和文本 token 之间的交互而非减少视觉 token 来提升大型视觉-语言模型 (LVLM) 的效率。VISOR 使用少量战略性放置的注意力层提供一般视觉上下文并细化视觉表示，以支持复杂的推理。实验表明，VISOR 在各种基准测试中减少了计算成本并保持或提高了性能，特别是在需要详细视觉理解的任务中表现出色。

Foveated Diffusion: Efficient Spatially Adaptive Image and Video Generation

Authors: Brian Chao, Lior Yariv, Howard Xiao, Gordon Wetzstein

First: 2026-03-24T17:57:35+00:00 · Latest: 2026-03-24T17:57:35+00:00

Comments: Project website at https://bchao1.github.io/foveated-diffusion

Abs · PDF · Code1 · Code2 · Project1

Abstract

Diffusion and flow matching models have unlocked unprecedented capabilities for creative content creation, such as interactive image and streaming video generation. The growing demand for higher resolutions, frame rates, and context lengths, however, makes efficient generation increasingly challenging, as computational complexity grows quadratically with the number of generated tokens. Our work seeks to optimize the efficiency of the generation process in settings where the user's gaze location is known or can be estimated, for example, by using eye tracking. In these settings, we leverage the eccentricity-dependent acuity of human vision: while a user perceives very high-resolution visual information in a small region around their gaze location (the foveal region), the ability to resolve detail quickly degrades in the periphery of the visual field. Our approach starts with a mask modeling the foveated resolution to allocate tokens non-uniformly, assigning higher token density to foveal regions and lower density to peripheral regions. An image or video is generated in a mixed-resolution token setting, yielding results perceptually indistinguishable from full-resolution generation, while drastically reducing the token count and generation time. To this end, we develop a principled mechanism for constructing mixed-resolution tokens directly from high-resolution data, allowing a foveated diffusion model to be post-trained from an existing base model while maintaining content consistency across resolutions. We validate our approach through extensive analysis and a carefully designed user study, demonstrating the efficacy of foveation as a practical and scalable axis for efficient generation.

中文标题/摘要

标题：注视点自适应扩散：高效的空间自适应图像和视频生成

扩散和流匹配模型为创意内容创作解锁了前所未有的能力，例如交互式图像和流媒体视频生成。然而，随着对更高分辨率、帧率和上下文长度的需求增长，高效生成变得越来越具挑战性，因为生成的标记数量每增加一次，计算复杂度就增加四倍。我们的工作旨在在用户注视位置已知或可以估计的情况下优化生成过程的效率，例如通过使用眼动追踪。在这种情况下，我们利用人类视觉的视偏差依赖性：当用户在其注视区域周围的小区域内感知到非常高的分辨率视觉信息（即视网膜区域）时，视网膜外区域分辨细节的能力会迅速下降。我们的方法从一个模拟注视点分辨率的掩码开始，非均匀地分配标记，将更高的标记密度分配给视网膜区域，而将较低的密度分配给视网膜外区域。在混合分辨率标记设置下生成图像或视频，结果在感知上与全分辨率生成无异，同时大幅减少标记数量和生成时间。为此，我们开发了一种原理性的机制，直接从高分辨率数据构建混合分辨率标记，允许注视点扩散模型从现有基础模型进行后训练，同时保持不同分辨率下的内容一致性。我们通过广泛的分析和精心设计的用户研究验证了我们的方法，证明了注视点作为高效生成的实用和可扩展轴的有效性。

Summary / 总结

This paper addresses the challenge of efficient generation of high-resolution images and videos by leveraging the foveated vision of humans. It proposes a method called Foveated Diffusion, which uses a mask to allocate tokens non-uniformly, focusing more on the foveal region where visual acuity is highest. This approach significantly reduces the computational complexity while maintaining perceptual quality. The study demonstrates that foveated generation can drastically reduce the token count and generation time without compromising visual quality, making it a practical and scalable solution for efficient content creation.

研究旨在通过利用人类的注视区域来提高图像和视频生成的效率。方法是非均匀分配令牌，基于注视区域，减少计算复杂性。该方法生成的结果与全分辨率生成的结果在感知上无法区分，但使用了显著较少的令牌和更快的生成时间。实验表明，这种方法在可以估计注视位置的场景中是有效且可扩展的。

AgentRVOS: Reasoning over Object Tracks for Zero-Shot Referring Video Object Segmentation

Authors: Woojeong Jin, Jaeho Lee, Heeseong Shin, Seungho Jang, Junhwan Heo, Seungryong Kim

First: 2026-03-24T17:55:17+00:00 · Latest: 2026-03-24T17:55:17+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Referring Video Object Segmentation (RVOS) aims to segment a target object throughout a video given a natural language query. Training-free methods for this task follow a common pipeline: a MLLM selects keyframes, grounds the referred object within those frames, and a video segmentation model propagates the results. While intuitive, this design asks the MLLM to make temporal decisions before any object-level evidence is available, limiting both reasoning quality and spatio-temporal coverage. To overcome this, we propose AgentRVOS, a training-free agentic pipeline built on the complementary strengths of SAM3 and a MLLM. Given a concept derived from the query, SAM3 provides reliable perception over the full spatio-temporal extent through generated mask tracks. The MLLM then identifies the target through query-grounded reasoning over this object-level evidence, iteratively pruning guided by SAM3's temporal existence information. Extensive experiments show that AgentRVOS achieves state-of-the-art performance among training-free methods across multiple benchmarks, with consistent results across diverse MLLM backbones. Our project page is available at: https://cvlab-kaist.github.io/AgentRVOS/.

中文标题/摘要

标题：AgentRVOS：基于对象轨迹的零样本视频对象分割推理

参考视频对象分割（RVOS）的目标是在给定自然语言查询的情况下，对视频中的目标对象进行分割。无需训练的方法遵循一个常见的流程：MLLM 选择关键帧，将所指的对象定位在这些帧中，然后视频分割模型将结果传播。虽然直观，但这种设计要求 MLLM 在没有任何对象级证据的情况下做出时间决策，从而限制了推理质量和时空覆盖范围。为了解决这个问题，我们提出了基于 SAM3 和 MLLM 相互补充优势的 AgentRVOS，无需训练的代理式管道。给定查询中提取的概念，SAM3 通过生成的掩码轨迹在整个时空范围内提供可靠的感知。然后，MLLM 通过查询导向的推理识别目标，SAM3 的时间存在信息指导迭代修剪。广泛的实验表明，AgentRVOS 在多个基准测试中实现了训练无需方法的最新性能，且在多种 MLLM 后端模型上具有一致的结果。我们的项目页面可在：https://cvlab-kaist.github.io/AgentRVOS/。

Summary / 总结

AgentRVOS addresses the limitations of existing training-free methods for Referring Video Object Segmentation (RVOS) by integrating the strengths of SAM3 and a MLLM. It generates reliable object tracks across the entire spatio-temporal extent using SAM3, and the MLLM then reasons over this evidence to identify the target object, iteratively pruning based on SAM3's temporal information. AgentRVOS outperforms other training-free methods across multiple benchmarks and shows consistent results with different MLLM backbones.

AgentRVOS旨在通过解决现有无训练方法的局限性来提高引用视频对象分割的效果。它结合了SAM3进行可靠的时空对象跟踪和MLLM进行查询导向的推理，并通过时间存在信息进行迭代修剪。实验表明，AgentRVOS在多个基准测试中优于其他无训练方法，并且在不同的MLLM骨干网络上表现出一致的结果。

One View Is Enough! Monocular Training for In-the-Wild Novel View Generation

Authors: Adrien Ramanana Rahary, Nicolas Dufour, Patrick Perez, David Picard

First: 2026-03-24T17:54:25+00:00 · Latest: 2026-03-24T17:54:25+00:00

Comments: 34 pages, 16 figures

Abs · PDF · Code1 · Code2 · Code3

Abstract

Monocular novel-view synthesis has long required multi-view image pairs for supervision, limiting training data scale and diversity. We argue it is not necessary: one view is enough. We present OVIE, trained entirely on unpaired internet images. We leverage a monocular depth estimator as a geometric scaffold at training time: we lift a source image into 3D, apply a sampled camera transformation, and project to obtain a pseudo-target view. To handle disocclusions, we introduce a masked training formulation that restricts geometric, perceptual, and textural losses to valid regions, enabling training on 30 million uncurated images. At inference, OVIE is geometry-free, requiring no depth estimator or 3D representation. Trained exclusively on in-the-wild images, OVIE outperforms prior methods in a zero-shot setting, while being 600x faster than the second-best baseline. Code and models are publicly available at https://github.com/AdrienRR/ovie.

中文标题/摘要

标题：一个视角就够了！单目训练在野外新颖视角生成

单目新颖视角合成长期以来需要多视角图像对进行监督，限制了训练数据的规模和多样性。我们认为这并非必要：一个视角就足够了。我们提出了OVIE，完全基于未配对的互联网图像进行训练。我们利用单目深度估计器作为几何支架：将源图像提升到3D空间，应用采样的相机变换，投影以获得伪目标视角。为处理遮挡，我们引入了一种掩码训练形式，限制几何、感知和纹理损失仅在有效区域内，从而能够在3000万未整理的图像上进行训练。在推理时，OVIE无需几何信息，不需要深度估计器或3D表示。仅基于野外图像训练，OVIE在零样本设置中优于先前方法，同时比第二好的基线快600倍。代码和模型可在https://github.com/AdrienRR/ovie公开获取。

TETO: Tracking Events with Teacher Observation for Motion Estimation and Frame Interpolation

Authors: Jini Yang, Eunbeen Hong, Soowon Son, Hyunkoo Lee, Sunghwan Hong, Sunok Kim, Seungryong Kim

First: 2026-03-24T17:53:41+00:00 · Latest: 2026-03-24T17:53:41+00:00

Abs · PDF · Code1 · Code2

Abstract

Event cameras capture per-pixel brightness changes with microsecond resolution, offering continuous motion information lost between RGB frames. However, existing event-based motion estimators depend on large-scale synthetic data that often suffers from a significant sim-to-real gap. We propose TETO (Tracking Events with Teacher Observation), a teacher-student framework that learns event motion estimation from only $\sim$25 minutes of unannotated real-world recordings through knowledge distillation from a pretrained RGB tracker. Our motion-aware data curation and query sampling strategy maximizes learning from limited data by disentangling object motion from dominant ego-motion. The resulting estimator jointly predicts point trajectories and dense optical flow, which we leverage as explicit motion priors to condition a pretrained video diffusion transformer for frame interpolation. We achieve state-of-the-art point tracking on EVIMO2 and optical flow on DSEC using orders of magnitude less training data, and demonstrate that accurate motion estimation translates directly to superior frame interpolation quality on BS-ERGB and HQ-EVFI.

中文标题/摘要

标题：TETO：基于教师观察跟踪事件以进行运动估计和帧内插

事件相机以微秒分辨率捕捉每个像素的亮度变化，提供在RGB帧之间丢失的连续运动信息。然而，现有的基于事件的运动估计器依赖于大量合成数据，这些数据通常存在显著的模拟到现实的差距。我们提出了一种TETO（基于教师观察跟踪事件）框架，通过从预训练的RGB追踪器的知识蒸馏，仅使用约25分钟的未标注真实世界记录来学习事件运动估计。我们的运动感知数据整理和查询采样策略通过分离物体运动和主导的自我运动，最大限度地利用有限的数据进行学习。由此产生的估计器联合预测点轨迹和密集的光学流，我们利用这些作为显式的运动先验来条件化预训练的视频扩散变换器以进行帧内插。我们使用数量级更少的训练数据在EVIMO2上实现了最先进的点跟踪性能，并在DSEC上实现了光学流的性能，同时证明了准确的运动估计直接转化为BS-ERGB和HQ-EVFI上更高质量的帧内插效果。

Summary / 总结

The research aims to address the challenge of sim-to-real gaps in event-based motion estimation by proposing TETO, a teacher-student framework. TETO learns from unannotated real-world recordings through knowledge distillation from a pretrained RGB tracker, using a motion-aware data curation and query sampling strategy to disentangle object motion from ego-motion. The resulting estimator predicts point trajectories and dense optical flow, which are used as priors for frame interpolation with a pretrained video diffusion transformer. The method achieves state-of-the-art results on EVIMO2 for point tracking and DSEC for optical flow, using significantly less training data compared to existing methods, and demonstrates superior frame interpolation quality on BS-ERGB and HQ-EVFI.

TETO 是一个教师-学生框架，利用预训练的 RGB 跟踪器从少量的实时数据中学习事件驱动的运动估计，相比现有方法，在点跟踪和光学流任务上取得了最先进的性能，并且使用了比现有方法少得多的训练数据。该方法将物体运动与 ego-运动分离，并利用学习到的运动先验进行帧插值，展示了在基准数据集上的优越质量。

Collaborative Evaluation of Deepfake Text with Deliberation-Enhancing Dialogue Systems

Authors: Jooyoung Lee, Xiaochen Zhu, Georgi Karadzhov, Tom Stafford, Andreas Vlachos, Dongwon Lee

Venue: www

First: 2025-03-06T20:19:38+00:00 · Latest: 2026-03-24T17:52:36+00:00

Comments: 15; To appear in ICWSM 2026 (https://www.icwsm.org/2026/)

Abs · PDF · Code1 · Code2

Abstract

The proliferation of generative models has presented significant challenges in distinguishing authentic human-authored content from deepfake content. Collaborative human efforts, augmented by AI tools, present a promising solution. In this study, we explore the potential of DeepFakeDeLiBot, a deliberation-enhancing chatbot, to support groups in detecting deepfake text. Our findings reveal that group-based problem-solving significantly improves the accuracy of identifying machine-generated paragraphs compared to individual efforts. While engagement with DeepFakeDeLiBot does not yield substantial performance gains overall, it enhances group dynamics by fostering greater participant engagement, consensus building, and the frequency and diversity of reasoning-based utterances. Additionally, participants with higher perceived effectiveness of group collaboration exhibited performance benefits from DeepFakeDeLiBot. These findings underscore the potential of deliberative chatbots in fostering interactive and productive group dynamics while ensuring accuracy in collaborative deepfake text detection. \textit{Dataset and source code used in this study will be made publicly available upon acceptance of the manuscript.

中文标题/摘要

标题：基于对话增强的深伪文本协作评估

生成模型的普及给区分真实人类创作内容与深伪内容带来了重大挑战。结合人类努力与AI工具的协作方法显示出前景。本研究探讨了DeepFakeDeLiBot，一种对话增强聊天机器人，如何支持团队检测深伪文本。研究发现，基于团队的问题解决显著提高了识别机器生成段落的准确性，而与DeepFakeDeLiBot的互动虽然整体上未带来显著的性能提升，但通过促进更高的参与者参与度、共识构建和基于推理的陈述频率与多样性，增强了团队动态。此外，感知到团队协作更有效的参与者从DeepFakeDeLiBot中获得了性能上的益处。这些发现强调了对话增强聊天机器人在促进互动和高效团队协作以确保深伪文本检测准确性方面的潜力。\textit{本研究使用的数据集和源代码将在手稿被接受后公开。

Summary / 总结

This study investigates the use of DeepFakeDeLiBot, a deliberation-enhancing chatbot, to support groups in detecting deepfake text. The research finds that group-based problem-solving improves the accuracy of identifying machine-generated paragraphs compared to individual efforts. While DeepFakeDeLiBot does not significantly enhance overall performance, it fosters better group dynamics through increased engagement, consensus building, and reasoning-based discussions. Participants who perceived higher effectiveness of group collaboration benefited more from the chatbot. These results highlight the potential of deliberative chatbots in promoting interactive and accurate group deepfake text detection.

本研究探讨了使用DeepFakeDeLiBot，一种促进讨论的聊天机器人，帮助小组检测深伪文本。研究结果显示，基于小组的问题解决方式在识别机器生成的段落准确性上优于个人努力。虽然DeepFakeDeLiBot整体上未能显著提升性能，但它通过增加参与度、促进共识建立和基于推理的讨论来改善了小组动态。那些认为小组协作更有效的参与者从使用聊天机器人中获得了更多的益处。这些发现强调了讨论型聊天机器人在促进互动和高效小组动态以确保深伪文本检测准确性方面的潜力。

Failure of contextual invariance in gender inference with large language models

Authors: Sagar Kumar, Ariel Flint, Luca Maria Aiello, Andrea Baronchelli

First: 2026-03-24T17:52:22+00:00 · Latest: 2026-03-24T17:52:22+00:00

Abs · PDF · Code1 · Code2

Abstract

Standard evaluation practices assume that large language model (LLM) outputs are stable under contextually equivalent formulations of a task. Here, we test this assumption in the setting of gender inference. Using a controlled pronoun selection task, we introduce minimal, theoretically uninformative discourse context and find that this induces large, systematic shifts in model outputs. Correlations with cultural gender stereotypes, present in decontextualized settings, weaken or disappear once context is introduced, while theoretically irrelevant features, such as the gender of a pronoun for an unrelated referent, become the most informative predictors of model behaviour. A Contextuality-by-Default analysis reveals that, in 19--52\% of cases across models, this dependence persists after accounting for all marginal effects of context on individual outputs and cannot be attributed to simple pronoun repetition. These findings show that LLM outputs violate contextual invariance even under near-identical syntactic formulations, with implications for bias benchmarking and deployment in high-stakes settings.

中文标题/摘要

标题：上下文不变性在大规模语言模型性别推断中的失效

标准评估实践假设大规模语言模型（LLM）输出在任务的上下文等效表述下是稳定的。在这里，我们在性别推断的背景下测试这一假设。通过一个受控的代词选择任务，我们引入了最小的、理论上无信息的语境，并发现这导致了模型输出的巨大、系统的转变。在去语境化的环境中存在的与文化性别刻板印象相关的相关性，在引入语境后减弱或消失，而与任务无关的特征，如与无关指代对象相关的代词性别，成为模型行为最有力的预测因素。通过上下文默认分析发现，在19%-52%的情况下，即使在考虑了所有上下文对个体输出的边际效应后，这种依赖仍然存在，且不能归因于简单的代词重复。这些发现表明，即使在几乎相同的句法表述下，LLM输出也违反了上下文不变性，这对偏见基准测试和在高风险环境中的部署具有重要意义。

Summary / 总结

The study investigates the stability of large language model (LLM) outputs under contextually equivalent formulations, focusing on gender inference. By using a controlled pronoun selection task with minimal, theoretically uninformative discourse context, the researchers found that model outputs shifted significantly. Cultural gender stereotypes had weaker influence in the context, while irrelevant features like the gender of an unrelated pronoun became more predictive. The analysis showed that in 19-52% of cases, this dependence on context persisted even after accounting for all marginal effects, indicating a violation of contextual invariance. This has implications for bias benchmarking and the use of LLMs in high-stakes settings.

研究考察了大型语言模型（LLM）在语境等效表述下的输出稳定性，重点关注性别推断。通过使用带有最小理论无关话语背景的控制代词选择任务，研究人员发现模型输出发生了显著变化。文化性别刻板印象在有语境的情况下影响减弱，而与任务无关的特征，如无关代词的性别，成为了更有效的预测因素。分析显示，在19-52%的情况下，即使考虑了所有边际效应，这种对语境的依赖仍然存在，表明违反了语境不变性。这为偏见基准测试和LLM在高风险环境中的应用带来了影响。

SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning

Authors: Haoyu Huang, Jinfa Huang, Zhongwei Wan, Xiawu Zheng, Rongrong Ji, Jiebo Luo

First: 2026-03-24T17:45:47+00:00 · Latest: 2026-03-24T17:45:47+00:00

Comments: Code: https://github.com/MAC-AutoML/SpecEyes

Abs · PDF · Code1 · Code2 · Code3

Abstract

Agentic multimodal large language models (MLLMs) (e.g., OpenAI o3 and Gemini Agentic Vision) achieve remarkable reasoning capabilities through iterative visual tool invocation. However, the cascaded perception, reasoning, and tool-calling loops introduce significant sequential overhead. This overhead, termed agentic depth, incurs prohibitive latency and seriously limits system-level concurrency. To this end, we propose SpecEyes, an agentic-level speculative acceleration framework that breaks this sequential bottleneck. Our key insight is that a lightweight, tool-free MLLM can serve as a speculative planner to predict the execution trajectory, enabling early termination of expensive tool chains without sacrificing accuracy. To regulate this speculative planning, we introduce a cognitive gating mechanism based on answer separability, which quantifies the model's confidence for self-verification without requiring oracle labels. Furthermore, we design a heterogeneous parallel funnel that exploits the stateless concurrency of the small model to mask the stateful serial execution of the large model, maximizing system throughput. Extensive experiments on V* Bench, HR-Bench, and POPE demonstrate that SpecEyes achieves 1.1-3.35x speedup over the agentic baseline while preserving or even improving accuracy (up to +6.7%), thereby boosting serving throughput under concurrent workloads.

中文标题/摘要

标题：SpecEyes：通过推测感知与规划加速有能动性的多模态大语言模型

有能动性的多模态大语言模型（MLLMs，例如OpenAI o3和Gemini有能动性视觉）通过迭代视觉工具调用实现了显著的推理能力。然而，感知、推理和工具调用循环的级联引入了显著的顺序延迟。这种延迟，称为有能动性深度，导致了不可接受的延迟，并严重限制了系统级别的并发性。为了解决这一问题，我们提出了SpecEyes，一种有能动性级别的推测性加速框架，打破了这种顺序瓶颈。我们的核心见解是，一个轻量级、无工具的MLLM可以作为推测性规划者来预测执行轨迹，从而在不牺牲准确性的前提下提前终止昂贵的工具链。为了调节这种推测性规划，我们引入了一种基于答案可分性的认知门控机制，该机制量化了模型的自我验证信心，无需使用 oracle 标签。此外，我们设计了一种异构并行漏斗，利用小模型的状态无感知并发性来掩盖大模型的状态有感知串行执行，从而最大化系统吞吐量。在V* Bench、HR-Bench和POPE上的大量实验表明，SpecEyes在保持或甚至提高准确性的前提下（最多提高6.7%），实现了1.1-3.35倍的加速，从而在并发工作负载下提升了服务吞吐量。

Summary / 总结

SpecEyes is a speculative acceleration framework for agentic multimodal large language models (MLLMs) that addresses the significant sequential overhead by using a lightweight, tool-free MLLM as a speculative planner. It introduces a cognitive gating mechanism based on answer separability and a heterogeneous parallel funnel to maximize throughput. Experiments show SpecEyes achieves 1.1-3.35x speedup while preserving or improving accuracy up to +6.7% under concurrent workloads.

SpecEyes 是一种针对轻量级推测规划者的加速框架，用于解决 agentic 多模态大语言模型 (MLLMs) 中的显著串行延迟问题。通过预测执行轨迹并提前终止昂贵的工具链，结合认知门控机制和异构并行漏斗，SpecEyes 实现了 1.1-3.35 倍的加速，同时保持或提高了准确性，从而在并发工作负载下提升服务吞吐量。

Task-Oriented Data Synthesis and Control-Rectify Sampling for Remote Sensing Semantic Segmentation

Authors: Yunkai Yang, Yudong Zhang, Kunquan Zhang, Jinxiao Zhang, Xinying Chen, Haohuan Fu, Runmin Dong

Venue: CVPR 2026

First: 2025-12-18T16:37:39+00:00 · Latest: 2026-03-24T17:45:20+00:00

Comments: Accepted by CVPR 2026

Abs · PDF · Code1 · Code2

Abstract

With the rapid progress of controllable generation, training data synthesis has become a promising way to expand labeled datasets and alleviate manual annotation in remote sensing (RS). However, the complexity of semantic mask control and the uncertainty of sampling quality often limit the utility of synthetic data in downstream semantic segmentation tasks. To address these challenges, we propose a task-oriented data synthesis framework (TODSynth), including a Multimodal Diffusion Transformer (MM-DiT) with unified triple attention and a plug-and-play sampling strategy guided by task feedback. Built upon the powerful DiT-based generative foundation model, we systematically evaluate different control schemes, showing that a text-image-mask joint attention scheme combined with full fine-tuning of the image and mask branches significantly enhances the effectiveness of RS semantic segmentation data synthesis, particularly in few-shot and complex-scene scenarios. Furthermore, we propose a control-rectify flow matching (CRFM) method, which dynamically adjusts sampling directions guided by semantic loss during the early high-plasticity stage, mitigating the instability of generated images and bridging the gap between synthetic data and downstream segmentation tasks. Extensive experiments demonstrate that our approach consistently outperforms state-of-the-art controllable generation methods, producing more stable and task-oriented synthetic data for RS semantic segmentation.

中文标题/摘要

标题：面向任务的数据合成与控制校正采样方法在遥感语义分割中的应用

随着可控生成技术的迅速发展，训练数据合成已成为扩展标注数据集和缓解遥感（RS）领域手动标注问题的一种有前途的方法。然而，语义掩码控制的复杂性和采样质量的不确定性往往限制了合成数据在下游语义分割任务中的应用。为了解决这些挑战，我们提出了一种面向任务的数据合成框架（TODSynth），包括一个具有统一三重注意力机制的多模态扩散变换器（MM-DiT）和一个基于任务反馈的即插即用采样策略。基于强大的基于DiT的生成基础模型，我们系统地评估了不同的控制方案，表明结合图像和掩码分支的全面微调与文本-图像-掩码联合注意力方案显著增强了RS语义分割数据合成的有效性，特别是在少量样本和复杂场景中。此外，我们提出了一种控制校正流匹配（CRFM）方法，在早期高可塑性阶段根据语义损失动态调整采样方向，缓解生成图像的不稳定性并缩小合成数据与下游分割任务之间的差距。广泛的实验表明，我们的方法在可控生成方法中始终表现出色，生成了更稳定且面向任务的合成数据用于RS语义分割。

Summary / 总结

The research aims to enhance the utility of synthetic data in remote sensing semantic segmentation by addressing the challenges of semantic mask control and sampling quality. The authors propose TODSynth, which uses a Multimodal Diffusion Transformer with unified triple attention and a task feedback-guided sampling strategy. They show that a joint text-image-mask attention scheme and full fine-tuning of both image and mask branches significantly improve data synthesis, especially in few-shot and complex-scene scenarios. Additionally, they introduce CRFM, which dynamically adjusts sampling directions to stabilize generated images and better align with downstream tasks. Experiments confirm that their approach outperforms existing methods in producing more stable and task-oriented synthetic data.

论文提出了一种面向任务的数据合成框架（TODSynth），包括具有统一三重注意力的多模态扩散变换器（MM-DiT）和插拔式采样策略。该框架通过结合文本-图像-掩码联合注意力和图像和掩码分支的完全微调，显著提升了合成数据在少量样本和复杂场景下的有效性。此外，提出了一种控制校正流匹配（CRFM）方法，在早期高可塑性阶段根据语义损失动态调整采样方向，提高生成图像的稳定性并更好地与下游分割任务对齐，从而在生成更多面向任务的合成数据方面优于现有方法。

VTAM: Video-Tactile-Action Models for Complex Physical Interaction Beyond VLAs

Authors: Haoran Yuan, Weigang Yi, Zhenyu Zhang, Wendi Chen, Yuchen Mo, Jiashi Yin, Xinzhuo Li, Xiangyu Zeng, Chuan Wen, Cewu Lu, Katherine Driggs-Campbell, Ismini Lourentzou

First: 2026-03-24T17:45:06+00:00 · Latest: 2026-03-24T17:45:06+00:00

Comments: https://plan-lab.github.io/projects/vtam/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Video-Action Models (VAMs) have emerged as a promising framework for embodied intelligence, learning implicit world dynamics from raw video streams to produce temporally consistent action predictions. Although such models demonstrate strong performance on long-horizon tasks through visual reasoning, they remain limited in contact-rich scenarios where critical interaction states are only partially observable from vision alone. In particular, fine-grained force modulation and contact transitions are not reliably encoded in visual tokens, leading to unstable or imprecise behaviors. To bridge this gap, we introduce the Video-Tactile Action Model (VTAM), a multimodal world modeling framework that incorporates tactile perception as a complementary grounding signal. VTAM augments a pretrained video transformer with tactile streams via a lightweight modality transfer finetuning, enabling efficient cross-modal representation learning without tactile-language paired data or independent tactile pretraining. To stabilize multimodal fusion, we introduce a tactile regularization loss that enforces balanced cross-modal attention, preventing visual latent dominance in the action model. VTAM demonstrates superior performance in contact-rich manipulation, maintaining a robust success rate of 90 percent on average. In challenging scenarios such as potato chip pick-and-place requiring high-fidelity force awareness, VTAM outperforms the pi 0.5 baseline by 80 percent. Our findings demonstrate that integrating tactile feedback is essential for correcting visual estimation errors in world action models, providing a scalable approach to physically grounded embodied foundation models.

中文标题/摘要

标题：VTAM：超越VLAs的视频-触觉-动作模型用于复杂物理交互

视频-动作模型（VAMs）已成为体现智能的有前途框架，通过从原始视频流中学习隐含的世界动力学来产生时间上一致的动作预测。尽管这些模型在长时任务上通过视觉推理表现出色，但在接触丰富的场景中仍受到限制，因为关键交互状态仅部分可通过视觉观察到。特别是，细微的力调节和接触转换无法可靠地编码在视觉标记中，导致行为不稳定或不精确。为解决这一问题，我们引入了视频-触觉动作模型（VTAM），这是一种多模态世界建模框架，将触觉感知作为补充的定位信号。VTAM通过轻量级模态转移微调将预训练的视频变换器与触觉流结合，无需触觉-语言配对数据或独立的触觉预训练，即可实现高效的跨模态表示学习。为了稳定多模态融合，我们引入了一种触觉正则化损失，以确保平衡的跨模态注意力，防止视觉潜在主导动作模型。VTAM在接触丰富的操作中表现出色，平均保持了90%的稳健成功率。在需要高保真力感知的马铃薯片拾取和放置等具有挑战性的场景中，VTAM比pi 0.5基线高出80%。我们的研究结果表明，整合触觉反馈对于纠正世界动作模型中的视觉估计误差至关重要，为物理基础的体现模型提供了可扩展的方法。

Summary / 总结

VTAM is a multimodal world modeling framework that integrates tactile perception with video-action models to improve performance in contact-rich manipulation tasks. It uses a pretrained video transformer and lightweight tactile stream integration, avoiding the need for tactile-language paired data or independent tactile pretraining. VTAM shows superior performance, maintaining a robust success rate of 90 percent on average and outperforming the pi 0.5 baseline by 80 percent in high-fidelity force awareness scenarios like potato chip pick-and-place.

VTAM 是一种将触觉感知与视频数据结合的多模态世界建模框架，以提高接触丰富的场景中的动作预测。它使用预训练的视频变换器和轻量级模态转移微调来学习跨模态表示，无需触觉-语言配对数据。VTAM 在接触丰富的操作任务中表现出色，平均成功率高达 90%，在需要高精度力感知的土豆片拾取放置任务中，其性能比 pi 0.5 基线高出 80%。

UniFunc3D: Unified Active Spatial-Temporal Grounding for 3D Functionality Segmentation

Authors: Jiaying Lin, Dan Xu

First: 2026-03-24T17:42:31+00:00 · Latest: 2026-03-24T17:42:31+00:00

Abs · PDF · Code1 · Code2

Abstract

Functionality segmentation in 3D scenes requires an agent to ground implicit natural-language instructions into precise masks of fine-grained interactive elements. Existing methods rely on fragmented pipelines that suffer from visual blindness during initial task parsing. We observe that these methods are limited by single-scale, passive and heuristic frame selection. We present UniFunc3D, a unified and training-free framework that treats the multimodal large language model as an active observer. By consolidating semantic, temporal, and spatial reasoning into a single forward pass, UniFunc3D performs joint reasoning to ground task decomposition in direct visual evidence. Our approach introduces active spatial-temporal grounding with a coarse-to-fine strategy. This allows the model to select correct video frames adaptively and focus on high-detail interactive parts while preserving the global context necessary for disambiguation. On SceneFun3D, UniFunc3D achieves state-of-the-art performance, surpassing both training-free and training-based methods by a large margin with a relative 59.9\% mIoU improvement, without any task-specific training. Code will be released on our project page: https://jiaying.link/unifunc3d.

中文标题/摘要

标题：UniFunc3D：统一的主动空间-时间定位以实现3D功能分割

在3D场景中进行功能分割需要代理将隐式的自然语言指令精确定位到细粒度的交互元素的精确掩码中。现有方法依赖于支离破碎的管道，在初始任务解析过程中存在视觉盲点。我们观察到这些方法受限于单尺度、被动和启发式的帧选择。我们提出了UniFunc3D，这是一种统一且无需训练的框架，将多模态大型语言模型视为主动观察者。通过将语义、时间和空间推理合并到单一前向传递中，UniFunc3D能够进行联合推理，直接在视觉证据中进行任务分解。我们的方法引入了从粗到细的主动空间-时间定位策略。这使模型能够适应性地选择正确的视频帧，专注于高细节的交互部分，同时保留用于消歧的全局上下文。在SceneFun3D上，UniFunc3D达到了最先进的性能，与训练免费和基于训练的方法相比，相对提高了59.9%的mIoU，而无需任何特定任务的训练。代码将在我们的项目页面上发布：https://jiaying.link/unifunc3d.

Summary / 总结

UniFunc3D addresses the limitations of existing methods in functionality segmentation by introducing a unified and training-free framework. It leverages a multimodal large language model as an active observer to perform joint semantic, temporal, and spatial reasoning in a single forward pass. This allows for active spatial-temporal grounding with a coarse-to-fine strategy, enabling the model to select correct video frames and focus on high-detail interactive parts while preserving global context. On the SceneFun3D dataset, UniFunc3D outperforms both training-free and training-based methods with a significant 59.9% improvement in mIoU without any task-specific training.

UniFunc3D通过引入一个统一且无需训练的框架来解决现有方法在功能分割中的局限性。它利用多模态大语言模型作为主动观察者，在单次前向传递中进行联合语义、时间和空间推理。这使得模型能够采用粗到细的策略进行主动的空间-时间定位，选择正确的视频帧并专注于高细节的交互部分，同时保留全局上下文以进行消歧。在SceneFun3D数据集上，UniFunc3D在无需任何任务特定训练的情况下，显著提高了59.9%的mIoU，超越了训练免费和训练基线方法。

RealMaster: Lifting Rendered Scenes into Photorealistic Video

Authors: Dana Cohen-Bar, Ido Sobol, Raphael Bensadoun, Shelly Sheynin, Oran Gafni, Or Patashnik, Daniel Cohen-Or, Amit Zohar

First: 2026-03-24T17:32:42+00:00 · Latest: 2026-03-24T17:32:42+00:00

Comments: Project page: https://danacohen95.github.io/RealMaster/

Abs · PDF · Code1 · Code2 · Project1

Abstract

State-of-the-art video generation models produce remarkable photorealism, but they lack the precise control required to align generated content with specific scene requirements. Furthermore, without an underlying explicit geometry, these models cannot guarantee 3D consistency. Conversely, 3D engines offer granular control over every scene element and provide native 3D consistency by design, yet their output often remains trapped in the "uncanny valley". Bridging this sim-to-real gap requires both structural precision, where the output must exactly preserve the geometry and dynamics of the input, and global semantic transformation, where materials, lighting, and textures must be holistically transformed to achieve photorealism. We present RealMaster, a method that leverages video diffusion models to lift rendered video into photorealistic video while maintaining full alignment with the output of the 3D engine. To train this model, we generate a paired dataset via an anchor-based propagation strategy, where the first and last frames are enhanced for realism and propagated across the intermediate frames using geometric conditioning cues. We then train an IC-LoRA on these paired videos to distill the high-quality outputs of the pipeline into a model that generalizes beyond the pipeline's constraints, handling objects and characters that appear mid-sequence and enabling inference without requiring anchor frames. Evaluated on complex GTA-V sequences, RealMaster significantly outperforms existing video editing baselines, improving photorealism while preserving the geometry, dynamics, and identity specified by the original 3D control.

中文标题/摘要

标题：RealMaster：将渲染场景提升为照片级真实感视频

最先进的视频生成模型能够产生惊人的照片级真实感，但它们缺乏将生成内容与特定场景要求精确对齐所需的控制能力。此外，没有隐式几何结构，这些模型无法保证3D一致性。相反，3D引擎在每个场景元素上提供精细控制，并通过设计提供原生的3D一致性，但其输出往往仍停留在“毛骨悚然谷”。弥合这种从模拟到现实的差距需要结构精确性，即输出必须精确保留输入的几何和动力学，以及全局语义转换，即材料、照明和纹理必须整体转换以实现照片级真实感。我们提出了RealMaster方法，利用视频扩散模型将渲染视频提升为照片级真实感视频，同时保持与3D引擎输出的完全对齐。为了训练此模型，我们通过基于锚点的传播策略生成配对数据集，其中首尾帧增强以提高真实感，并使用几何条件线索在中间帧之间传播。然后，我们使用IC-LoRA对这些配对视频进行训练，以提炼管道输出的高质量结果，使其超越管道的限制，处理序列中间出现的对象和角色，并在不需要锚帧的情况下进行推理。在复杂的GTA-V序列上评估，RealMaster显著优于现有视频编辑基线，提高了照片级真实感，同时保留了由原始3D控制指定的几何、动力学和身份。

Summary / 总结

RealMaster is a method that uses video diffusion models to transform rendered video into photorealistic video while maintaining alignment with the original 3D engine output. It generates a paired dataset using an anchor-based propagation strategy and trains an IC-LoRA model to achieve this. RealMaster outperforms existing video editing techniques by improving photorealism while preserving the geometry, dynamics, and identity of the original 3D content.

RealMaster 是一种方法，使用视频扩散模型将渲染视频转换为逼真视频，同时保持与原始 3D 引擎输出的一致性。它通过基于锚点的传播策略生成配对数据集，并训练 IC-LoRA 模型以实现这一目标。RealMaster 在复杂序列中在提高逼真度的同时，能够保留几何形状、动态和身份。

End-to-End Efficient RL for Linear Bellman Complete MDPs with Deterministic Transitions

Authors: Zakaria Mhammedi, Alexander Rakhlin, Nneka Okolo

First: 2026-03-24T17:32:29+00:00 · Latest: 2026-03-24T17:32:29+00:00

Abs · PDF · Code1 · Code2

Abstract

We study reinforcement learning (RL) with linear function approximation in Markov Decision Processes (MDPs) satisfying \emph{linear Bellman completeness} -- a fundamental setting where the Bellman backup of any linear value function remains linear. While statistically tractable, prior computationally efficient algorithms are either limited to small action spaces or require strong oracle assumptions over the feature space. We provide a computationally efficient algorithm for linear Bellman complete MDPs with \emph{deterministic transitions}, stochastic initial states, and stochastic rewards. For finite action spaces, our algorithm is end-to-end efficient; for large or infinite action spaces, we require only a standard argmax oracle over actions. Our algorithm learns an $\varepsilon$-optimal policy with sample and computational complexity polynomial in the horizon, feature dimension, and $1/\varepsilon$.

中文标题/摘要

标题：端到端高效强化学习在具有确定性转换的线性贝尔曼完备MDP中

我们研究马尔可夫决策过程（MDP）中的强化学习（RL），其中MDP满足线性贝尔曼完备性——一个基本的设置，其中任何线性价值函数的贝尔曼备份仍然是线性的。虽然从统计上是可处理的，但先前的高效算法要么局限于小的动作空间，要么需要特征空间的强先验假设。我们提供了一个在具有确定性转换、随机初始状态和随机奖励的线性贝尔曼完备MDP中的高效算法。对于有限的动作空间，我们的算法是端到端高效的；对于大型或无限的动作空间，我们只需要一个标准的动作argmax先验。我们的算法以时间 horizons、特征维度和 1/ε 的多项式复杂度学习一个 ε-最优策略。

DetPO: In-Context Learning with Multi-Modal LLMs for Few-Shot Object Detection

Authors: Gautam Rajendrakumar Gare, Neehar Peri, Matvei Popov, Shruti Jain, John Galeotti, Deva Ramanan

First: 2026-03-24T17:26:55+00:00 · Latest: 2026-03-24T17:26:55+00:00

Comments: Project Page: https://ggare-cmu.github.io/DetPO/

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

Multi-Modal LLMs (MLLMs) demonstrate strong visual grounding capabilities on popular object detection benchmarks like OdinW-13 and RefCOCO. However, state-of-the-art models still struggle to generalize to out-of-distribution classes, tasks and imaging modalities not typically found in their pre-training. While in-context prompting is a common strategy to improve performance across diverse tasks, we find that it often yields lower detection accuracy than prompting with class names alone. This suggests that current MLLMs cannot yet effectively leverage few-shot visual examples and rich textual descriptions for object detection. Since frontier MLLMs are typically only accessible via APIs, and state-of-the-art open-weights models are prohibitively expensive to fine-tune on consumer-grade hardware, we instead explore black-box prompt optimization for few-shot object detection. To this end, we propose Detection Prompt Optimization (DetPO), a gradient-free test-time optimization approach that refines text-only prompts by maximizing detection accuracy on few-shot visual training examples while calibrating prediction confidence. Our proposed approach yields consistent improvements across generalist MLLMs on Roboflow20-VL and LVIS, outperforming prior black-box approaches by up to 9.7%. Our code is available at https://github.com/ggare-cmu/DetPO

中文标题/摘要

标题：DetPO：使用多模态LLM进行少样本物体检测的上下文学习

多模态LLM（MLLMs）在OdinW-13和RefCOCO等流行的物体检测基准测试中展示了强大的视觉定位能力。然而，最先进的模型仍然难以泛化到预训练中通常未出现的分布外类别、任务和成像模态。虽然上下文提示是提高跨多种任务性能的常见策略，但我们发现，它往往导致检测准确性低于仅使用类别名称进行提示。这表明当前的MLLMs尚不能有效利用少样本视觉示例和丰富的文本描述进行物体检测。由于前沿的MLLMs通常仅通过API访问，最先进的开放权重模型在消费级硬件上微调成本高昂，我们转而探索黑盒提示优化以进行少样本物体检测。为此，我们提出了检测提示优化（DetPO），这是一种无梯度的测试时优化方法，通过最大化少样本视觉训练示例上的检测准确性来细化仅文本提示，同时校准预测置信度。我们提出的方法在Roboflow20-VL和LVIS上的泛化型MLLMs上表现出一致的改进，优于先前的黑盒方法最多9.7%。我们的代码可在https://github.com/ggare-cmu/DetPO获取

Summary / 总结

The research aims to improve few-shot object detection using multi-modal LLMs by addressing their limitations in generalizing to new classes and imaging modalities. The study proposes Detection Prompt Optimization (DetPO), a gradient-free approach that optimizes text-only prompts to enhance detection accuracy on few-shot visual examples. DetPO consistently improves performance across various MLLMs, outperforming previous methods by up to 9.7% on Roboflow20-VL and LVIS benchmarks.

研究旨在通过改进多模态LLM的少样本目标检测能力，解决其在新类别和成像模态上的泛化问题。方法Detection Prompt Optimization (DetPO) 在不需微调的情况下，通过优化纯文本提示来提高少样本视觉示例上的检测准确性。该方法在Roboflow20-VL和LVIS基准测试上的一致性改进，优于先前方法最多9.7%。

Code Review Agent Benchmark

Authors: Yuntong Zhang, Zhiyuan Pan, Imam Nur Bani Yusuf, Haifeng Ruan, Ridwan Shariffdeen, Abhik Roychoudhury

First: 2026-03-24T17:19:32+00:00 · Latest: 2026-03-24T17:19:32+00:00

Abs · PDF · Code1 · Code2

Abstract

Software engineering agents have shown significant promise in writing code. As AI agents permeate code writing, and generate huge volumes of code automatically -- the matter of code quality comes front and centre. As the automatically generated code gets integrated into huge code-bases -- the issue of code review and broadly quality assurance becomes important. In this paper, we take a fresh look at the problem and curate a code review dataset for AI agents to work with. Our dataset called c-CRAB (pronounced see-crab) can evaluate agents for code review tasks. Specifically given a pull-request (which could be coming from code generation agents or humans), if a code review agent produces a review, our evaluation framework can asses the reviewing capability of the code review agents. Our evaluation framework is used to evaluate the state of the art today -- the open-source PR-agent, as well as commercial code review agents from Devin, Claude Code, and Codex. Our c-CRAB dataset is systematically constructed from human reviews -- given a human review of a pull request instance we generate corresponding tests to evaluate the code review agent generated reviews. Such a benchmark construction gives us several insights. Firstly, the existing review agents taken together can solve only around 40% of the c-CRAB tasks, indicating the potential to close this gap by future research. Secondly, we observe that the agent reviews often consider different aspects from the human reviews -- indicating the potential for human-agent collaboration for code review that could be deployed in future software teams. Last but not the least, the agent generated tests from our data-set act as a held out test-suite and hence quality gate for agent generated reviews. What this will mean for future collaboration of code generation agents, test generation agents and code review agents -- remains to be investigated.

中文标题/摘要

标题：代码审查代理基准

软件工程代理在编写代码方面显示出巨大的潜力。随着人工智能代理渗透到代码编写中，并自动生成大量代码，代码质量的问题变得尤为重要。当自动生成的代码被整合到庞大的代码库中时，代码审查和更广泛的质保问题变得重要起来。在本文中，我们重新审视了这一问题，并为人工智能代理整理了一个代码审查数据集。我们称之为c-CRAB（发音为see-crab）的数据集可以评估代理的代码审查能力。具体来说，给定一个拉取请求（可能是来自代码生成代理或人类），如果代码审查代理生成了审查意见，我们的评估框架可以评估代码审查代理的审查能力。我们的评估框架用于评估当今最先进的开源PR代理，以及来自Devin、Claude Code和Codex的商业代码审查代理。我们的c-CRAB数据集是系统地从人类审查中构建的——给定一个拉取请求实例的人类审查，我们生成相应的测试来评估代码审查代理生成的审查意见。这种基准构建为我们提供了几个见解。首先，现有的审查代理加在一起只能解决大约40%的c-CRAB任务，表明未来研究有可能缩小这一差距。其次，我们观察到，代理审查往往从人类审查中考虑不同的方面，这表明未来软件团队中人类-代理协作进行代码审查的潜力。最后但同样重要的是，我们数据集中由代理生成的测试作为保留测试套件和代理生成审查意见的质量门。未来代码生成代理、测试生成代理和代码审查代理的协作将意味着什么——仍有待进一步研究。

Summary / 总结

This paper addresses the challenge of evaluating code review agents by introducing c-CRAB, a dataset designed to assess their capabilities. The dataset is built from human reviews and used to evaluate state-of-the-art code review agents, including PR-agent, Devin, Claude Code, and Codex. The study reveals that these agents can only handle about 40% of the tasks, suggesting room for improvement. Additionally, the agents often focus on different aspects than humans, indicating potential for human-agent collaboration. The generated tests serve as a quality gate for agent-generated reviews, providing insights into future collaborative software development practices.

本文通过构建名为c-CRAB的数据集来评估代码审查代理的能力，该数据集旨在评估代理在处理来自代码生成代理和人类的拉取请求时的审查能力。评估框架被用于测试来自开源和商业提供商的最先进的代理，结果显示当前代理只能处理大约40%的任务，表明未来仍有改进空间。研究还指出，代理和人类在代码审查中采取不同的视角，这表明未来可能存在人类-代理协作的可能性。此外，该数据集还作为代理生成审查的品质门，提供了一个保留的测试套件，以供未来研究使用。

3DCity-LLM: Empowering Multi-modality Large Language Models for 3D City-scale Perception and Understanding

Authors: Yiping Chen, Jinpeng Li, Wenyu Ke, Yang Luo, Jie Ouyang, Zhongjie He, Li Liu, Hongchao Fan, Hao Wu

First: 2026-03-24T17:18:44+00:00 · Latest: 2026-03-24T17:18:44+00:00

Comments: 24 pages, 11 figures, 12 tables

Abs · PDF · Code1 · Code2 · Code3

Abstract

While multi-modality large language models excel in object-centric or indoor scenarios, scaling them to 3D city-scale environments remains a formidable challenge. To bridge this gap, we propose 3DCity-LLM, a unified framework designed for 3D city-scale vision-language perception and understanding. 3DCity-LLM employs a coarse-to-fine feature encoding strategy comprising three parallel branches for target object, inter-object relationship, and global scene. To facilitate large-scale training, we introduce 3DCity-LLM-1.2M dataset that comprises approximately 1.2 million high-quality samples across seven representative task categories, ranging from fine-grained object analysis to multi-faceted scene planning. This strictly quality-controlled dataset integrates explicit 3D numerical information and diverse user-oriented simulations, enriching the question-answering diversity and realism of urban scenarios. Furthermore, we apply a multi-dimensional protocol based on text-similarity metrics and LLM-based semantic assessment to ensure faithful and comprehensive evaluations for all methods. Extensive experiments on two benchmarks demonstrate that 3DCity-LLM significantly outperforms existing state-of-the-art methods, offering a promising and meaningful direction for advancing spatial reasoning and urban intelligence. The source code and dataset are available at https://github.com/SYSU-3DSTAILab/3D-City-LLM.

中文标题/摘要

标题：3DCity-LLM：为3D城市规模感知与理解赋能的多模态大型语言模型

尽管多模态大型语言模型在对象中心或室内场景中表现出色，但将其扩展到3D城市规模环境仍是一项艰巨的挑战。为了解决这一问题，我们提出了3DCity-LLM，这是一种专为3D城市规模视觉语言感知和理解设计的统一框架。3DCity-LLM采用从粗到细的特征编码策略，包括三个并行分支，分别用于目标对象、对象间关系和全局场景。为了支持大规模训练，我们引入了包含约120万高质量样本的3DCity-LLM-1.2M数据集，这些样本覆盖了七个代表性任务类别，从精细对象分析到多方面场景规划。该数据集严格控制质量，整合了明确的3D数值信息和多样化的用户导向模拟，丰富了城市场景的问答多样性和真实性。此外，我们应用基于文本相似度度量和LLM语义评估的多维度协议，确保对所有方法进行忠实和全面的评估。在两个基准上的广泛实验表明，3DCity-LLM显著优于现有最先进的方法，为推进空间推理和城市智能提供了有希望且有意义的方向。源代码和数据集可在https://github.com/SYSU-3DSTAILab/3D-City-LLM获取。

Summary / 总结

The research aims to enhance multi-modality large language models for 3D city-scale perception and understanding, addressing the challenge of scaling these models to large environments. 3DCity-LLM, a unified framework, uses a coarse-to-fine feature encoding strategy with three parallel branches for target objects, inter-object relationships, and global scenes. The framework is evaluated on two benchmarks and outperforms existing methods, demonstrating its effectiveness in spatial reasoning and urban intelligence. The dataset 3DCity-LLM-1.2M, comprising 1.2 million high-quality samples, supports large-scale training and provides diverse urban scenarios for evaluation.

研究旨在提升多模态大语言模型在3D城市尺度感知和理解方面的性能，解决这些模型在大规模环境中的扩展问题。3DCity-LLM是一个统一框架，采用从粗到细的特征编码策略，包含三个并行分支，分别处理目标对象、对象间关系和全局场景。该框架在两个基准上进行了广泛实验，并优于现有方法，展示了其在空间推理和城市智能方面的有效性。3DCity-LLM-1.2M数据集包含120万高质量样本，支持大规模训练，并提供多样化的城市场景用于评估。

The Potential of Copernicus Satellites for Disaster Response: Retrieving Building Damage from Sentinel-1 and Sentinel-2

Authors: Olivier Dietrich, Merlin Alfredsson, Emilia Arens, Nando Metzger, Torben Peters, Linus Scheibenreif, Jan Dirk Wegner, Konrad Schindler

First: 2025-11-07T18:02:07+00:00 · Latest: 2026-03-24T17:13:46+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Natural disasters demand rapid damage assessment to guide humanitarian response. Here, we investigate whether medium-resolution Earth observation images from the Copernicus program can support building damage assessment, complementing very-high resolution imagery with often limited availability. We introduce xBD-S12, a dataset of 10,315 pre- and post-disaster image pairs from both Sentinel-1 and Sentinel-2, spatially and temporally aligned with the established xBD benchmark. In a series of experiments, we demonstrate that building damage can be detected and mapped rather well in many disaster scenarios, despite the moderate 10$\,$m ground sampling distance. We also find that, for damage mapping at that resolution, architectural sophistication does not seem to bring much advantage: more complex model architectures tend to struggle with generalization to unseen disasters, and geospatial foundation models bring little practical benefit. Our results suggest that Copernicus images are a viable data source for rapid, wide-area damage assessment and could play an important role alongside VHR imagery. We release the xBD-S12 dataset, code, and trained models to support further research at https://github.com/prs-eth/xbd-s12 .

中文标题/摘要

标题：哥白尼卫星在灾害响应中的潜力：利用Sentinel-1和Sentinel-2获取建筑损坏信息

自然灾害需要快速的损失评估以指导人道主义响应。本文探讨了哥白尼计划中中分辨率地球观测图像是否可以支持建筑损坏评估，以补充高分辨率图像的有限可用性。我们介绍了由10,315对灾前和灾后图像组成的xBD-S12数据集，这些图像来自Sentinel-1和Sentinel-2，并与现有的xBD基准数据集在空间和时间上对齐。一系列实验表明，尽管地面采样距离为10米，但在许多灾害场景中，建筑损坏可以被较好地检测和映射。我们还发现，对于该分辨率的损坏映射，建筑复杂性似乎并没有带来太多优势：更复杂的模型架构往往难以泛化到未见过的灾害，而地理空间基础模型也几乎没有实际益处。我们的结果表明，哥白尼图像可以作为快速、大面积损失评估的可行数据源，并且可以与高分辨率图像一起发挥重要作用。我们已在https://github.com/prs-eth/xbd-s12 上发布了xBD-S12数据集、代码和训练模型，以支持进一步研究。

Summary / 总结

This study investigates the use of medium-resolution images from Copernicus satellites (Sentinel-1 and Sentinel-2) for building damage assessment after natural disasters. The research introduces xBD-S12, a dataset of 10,315 pre- and post-disaster image pairs, to evaluate the effectiveness of these images. The study finds that building damage can be well-detected and mapped despite the moderate resolution, and that more complex model architectures do not significantly improve performance for unseen disasters. The results suggest that Copernicus images can be a valuable data source for rapid, large-scale damage assessment, complementing high-resolution imagery.

该研究探讨了使用Copernicus Sentinel-1和Sentinel-2图像进行自然灾害后建筑物损坏评估的可能性。研究人员开发了包含10,315组灾前和灾后图像对的xBD-S12数据集，以评估中分辨率图像的有效性。实验结果显示，即使地面采样距离为10米，也能较好地检测和映射建筑物损坏。研究还发现，更复杂的模型架构在新灾害面前难以泛化，而地理空间基础模型也没有提供显著的实际益处。研究结果表明，Copernicus图像可以成为快速评估损坏的重要资源。该数据集、代码和训练模型已公开，供进一步研究使用。

EVA: Aligning Video World Models with Executable Robot Actions via Inverse Dynamics Rewards

Authors: Ruixiang Wang, Qingming Liu, Yueci Deng, Guiliang Liu, Zhen Liu, Kui Jia

First: 2026-03-18T15:02:19+00:00 · Latest: 2026-03-24T17:12:47+00:00

Comments: Project page: https://eva-project-page.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Video generative models are increasingly used as world models for robotics, where a model generates a future visual rollout conditioned on the current observation and task instruction, and an inverse dynamics model (IDM) converts the generated frames into executable robot actions. However, current video world models lack explicit executability constraints. As a result, visually coherent rollouts may still violate rigid-body and kinematic consistency, producing unstable or infeasible control commands when decoded by an IDM. We refer to this mismatch between visual generation and physically executable control as the executability gap. While this gap can be mitigated at inference time using techniques such as rejection sampling, such approaches are inefficient due to the high cost of video generation. In this paper, we leverage the executability gap as a training signal and introduce Executable Video Alignment (EVA), a reinforcement-learning post-training framework for aligning video world models. EVA trains an inverse dynamics model on real robot trajectories and repurposes it as a reward model that evaluates generated videos through the action sequences they induce, encouraging smooth motions measured by velocity, acceleration, and jerk while penalizing actions that violate embodiment constraints. Importantly, the reward remains informative even when generated videos contain severe visual artifacts, since such artifacts typically translate into unstable or out-of-bound actions. Experiments on the RoboTwin benchmark and a real bimanual robot show that EVA reduces embodiment-specific artifacts in generated rollouts and improves downstream task execution success.

中文标题/摘要

标题：EVA：通过逆动力学奖励使视频世界模型与可执行机器人动作对齐

视频生成模型在机器人学中越来越多地用作世界模型，其中模型根据当前观察和任务指令生成未来视觉滚动，逆动力学模型（IDM）将生成的帧转换为可执行的机器人动作。然而，当前的视频世界模型缺乏明确的可执行性约束。因此，视觉上连贯的滚动可能仍然违反刚体和运动学一致性，当由IDM解码时，会产生不稳定或不可行的控制命令。我们将这种视觉生成与物理可执行控制之间的不匹配称为可执行性差距。虽然可以通过拒绝采样等技术在推理时减轻这种差距，但这些方法由于视频生成成本高而效率低下。在本文中，我们利用可执行性差距作为训练信号，并引入了可执行视频对齐（EVA），这是一种用于对齐视频世界模型的强化学习后训练框架。EVA 在真实机器人轨迹上训练逆动力学模型，并将其重新用于评估通过其引起的动作序列生成的视频的奖励模型，鼓励由速度、加速度和冲击度衡量的平滑运动，同时惩罚违反实体约束的动作。重要的是，即使生成的视频包含严重的视觉伪影，奖励仍然具有信息性，因为这些伪影通常会转化为不稳定或超出范围的动作。在RoboTwin基准测试和一个真实的双臂机器人上的实验表明，EVA 减少了生成滚动中的实体特定伪影并提高了下游任务执行的成功率。

Summary / 总结

The paper addresses the executability gap in video world models for robotics, where visually coherent rollouts may violate physical constraints. It introduces EVA, a reinforcement-learning framework that trains an inverse dynamics model on real robot trajectories and uses it as a reward model to encourage smooth motions and penalize violations of embodiment constraints. Experiments show that EVA reduces embodiment-specific artifacts and improves task execution success.

研究解决了机器人中视频世界模型的可执行性缺口问题，即视觉连贯的滚动可能违反物理约束。方法EVA在真实机器人轨迹上训练逆动力学模型，并将其用作奖励模型，以鼓励平滑运动并惩罚违反实体约束的动作。实验表明，EVA减少了生成滚动中的实体特定缺陷，并提高了下游任务执行的成功率。

SIGMA: A Physics-Based Benchmark for Gas Chimney Understanding in Seismic Images

Authors: Bao Truong, Quang Nguyen, Baoru Huang, Jinpei Han, Van Nguyen, Ngan Le, Minh-Tan Pham, Doan Huy Hien, Anh Nguyen

First: 2026-03-24T17:12:45+00:00 · Latest: 2026-03-24T17:12:45+00:00

Comments: Accepted at The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2026

Abs · PDF · Code1 · Code2

Abstract

Seismic images reconstruct subsurface reflectivity from field recordings, guiding exploration and reservoir monitoring. Gas chimneys are vertical anomalies caused by subsurface fluid migration. Understanding these phenomena is crucial for assessing hydrocarbon potential and avoiding drilling hazards. However, accurate detection is challenging due to strong seismic attenuation and scattering. Traditional physics-based methods are computationally expensive and sensitive to model errors, while deep learning offers efficient alternatives, yet lacks labeled datasets. In this work, we introduce \textbf{SIGMA}, a new physics-based dataset for gas chimney understanding in seismic images, featuring (i) pixel-level gas-chimney mask for detection and (ii) paired degraded and ground-truth image for enhancement. We employed physics-based methods that cover a wide range of geological settings and data acquisition conditions. Comprehensive experiments demonstrate that SIGMA serves as a challenging benchmark for gas chimney interpretation and benefits general seismic understanding.

中文标题/摘要

标题：SIGMA：基于物理的基准数据集，用于地震图像中的气体烟囱理解

地震图像从现场记录重构地下反射性，指导勘探和储层监测。气体烟囱是由地下流体迁移引起的垂直异常。理解这些现象对于评估烃类潜力和避免钻井风险至关重要。然而，由于强烈的地震衰减和散射，准确检测具有挑战性。传统的基于物理的方法计算成本高且对模型误差敏感，而深度学习提供了高效的替代方案，但缺乏标注数据集。在本文中，我们引入了**SIGMA**，一个新的基于物理的地震图像中气体烟囱理解的数据集，包含(i) 像素级别的气体烟囱掩码用于检测，(ii) 降级和真实图像配对用于增强。我们采用了覆盖广泛地质设置和数据采集条件的基于物理的方法。全面的实验表明，SIGMA 作为气体烟囱解释的具有挑战性的基准，并有助于一般地震理解。

Summary / 总结

The research aims to improve the detection and understanding of gas chimneys in seismic images, which are crucial for hydrocarbon exploration and avoiding drilling hazards. The study introduces SIGMA, a new physics-based dataset with pixel-level gas-chimney masks and paired degraded and ground-truth images. The dataset is used to evaluate physics-based methods across various geological settings and data conditions, showing that SIGMA is a challenging benchmark for gas chimney interpretation and enhances general seismic understanding.

研究旨在提高在地震图像中对气烟囱的检测和理解，这对于烃类勘探和避免钻井风险至关重要。研究引入了SIGMA，这是一个新的基于物理的数据库，包含像素级别的气烟囱掩模和增强图像，涵盖了多种地质设置。实验表明，SIGMA 是一个具有挑战性的基准，用于气烟囱解释，并且有助于一般地震理解。

Similarity-Aware Mixture-of-Experts for Data-Efficient Continual Learning

Authors: Connor Mclaughlin, Nigel Lee, Lili Su

First: 2026-03-24T17:10:47+00:00 · Latest: 2026-03-24T17:10:47+00:00

Comments: 9 pages

Abs · PDF · Code1 · Code2

Abstract

Machine learning models often need to adapt to new data after deployment due to structured or unstructured real-world dynamics. The Continual Learning (CL) framework enables continuous model adaptation, but most existing approaches either assume each task contains sufficiently many data samples or that the learning tasks are non-overlapping. In this paper, we address the more general setting where each task may have a limited dataset, and tasks may overlap in an arbitrary manner without a priori knowledge. This general setting is substantially more challenging for two reasons. On the one hand, data scarcity necessitates effective contextualization of general knowledge and efficient knowledge transfer across tasks. On the other hand, unstructured task overlapping can easily result in negative knowledge transfer. To address the above challenges, we propose an adaptive mixture-of-experts (MoE) framework over pre-trained models that progressively establishes similarity awareness among tasks. Our design contains two innovative algorithmic components: incremental global pooling and instance-wise prompt masking. The former mitigates prompt association noise through gradual prompt introduction over time. The latter decomposes incoming task samples into those aligning with current prompts (in-distribution) and those requiring new prompts (out-of-distribution). Together, our design strategically leverages potential task overlaps while actively preventing negative mutual interference in the presence of per-task data scarcity. Experiments across varying data volumes and inter-task similarity show that our method enhances sample efficiency and is broadly applicable.

中文标题/摘要

标题：相似性意识混合专家模型以提高数据高效连续学习

机器学习模型在部署后通常需要适应新的数据，以应对结构化或非结构化的现实世界动态。连续学习（CL）框架允许模型持续适应，但大多数现有方法要么假设每个任务包含足够的数据样本，要么认为学习任务是不重叠的。在本文中，我们解决了每个任务可能数据有限且任务以任意方式重叠的更一般情况，且事先没有知识。这种一般情况由于两个原因更具挑战性。一方面，数据稀缺性要求有效利用一般知识并在任务之间高效转移知识。另一方面，无结构的任务重叠容易导致负面知识转移。为了解决上述挑战，我们提出了一种基于预训练模型的自适应混合专家（MoE）框架，该框架逐步建立任务之间的相似性意识。我们的设计包含两个创新的算法组件：增量全局池化和实例级提示掩码。前者通过时间上的逐步提示引入来缓解提示关联噪声。后者将传入的任务样本分解为与当前提示一致的（同分布）和需要新提示的（异分布）。结合我们的设计，即使在每个任务数据稀缺的情况下，也能战略性地利用潜在的任务重叠，同时积极防止负面相互干扰。在不同数据量和任务间相似性的实验中，我们的方法提高了样本效率，并具有广泛适用性。

Summary / 总结

This paper addresses the challenge of continual learning in scenarios where tasks have limited data and may overlap arbitrarily. It proposes a similarity-aware mixture-of-experts framework that uses incremental global pooling and instance-wise prompt masking to effectively manage knowledge transfer and prevent negative interference. The method demonstrates improved sample efficiency across different data volumes and task similarities.

本文解决了每个任务数据有限且任务间可能任意重叠的持续学习挑战。作者提出了一种相似性感知的混合专家框架，使用增量全局池化和实例级提示掩蔽。这些组件有助于有效跨任务转移知识并防止负面干扰。该方法在各种实验设置中展示了更好的样本效率。

GazeShift: Unsupervised Gaze Estimation and Dataset for VR

Authors: Gil Shapira, Ishay Goldin, Evgeny Artyomov, Donghoon Kim, Yosi Keller, Niv Zehngut

First: 2026-03-08T22:29:13+00:00 · Latest: 2026-03-24T17:03:12+00:00

Comments: Accepted to CVPR26

Abs · PDF · Code1 · Code2 · Code3

Abstract

Gaze estimation is instrumental in modern virtual reality (VR) systems. Despite significant progress in remote-camera gaze estimation, VR gaze research remains constrained by data scarcity, particularly the lack of large-scale, accurately labeled datasets captured with the off-axis camera configurations typical of modern headsets. Gaze annotation is difficult since fixation on intended targets cannot be guaranteed. To address these challenges, we introduce VRGaze, the first large-scale off-axis gaze estimation dataset for VR, comprising 2.1 million near-eye infrared images collected from 68 participants. We further propose GazeShift, an attention-guided unsupervised framework for learning gaze representations without labeled data. Unlike prior redirection-based methods that rely on multi-view or 3D geometry, GazeShift is tailored to near-eye imagery, achieving effective gaze-appearance disentanglement in a compact, real-time model. GazeShift embeddings can be optionally adapted to individual users via lightweight few-shot calibration, achieving a 1.84° mean error on VRGaze. On the remote-camera MPIIGaze dataset, the model achieves a 7.15° person-agnostic error, doing so with 10x fewer parameters and 35x fewer FLOPs than baseline methods. Deployed natively on a VR headset GPU, inference takes only 5 ms. Combined with demonstrated robustness to illumination changes, these results highlight GazeShift as a label-efficient, real-time solution for VR gaze tracking. Project code and the VRGaze dataset are released at https://github.com/gazeshift3/gazeshift

中文标题/摘要

标题：GazeShift：无监督注视估计及其在VR中的数据集

注视估计是现代虚拟现实（VR）系统中的关键组成部分。尽管远程摄像头注视估计取得了显著进展，但VR注视研究仍受限于数据稀缺性，特别是缺乏大规模、准确标注的数据集，这些数据集使用的是现代头显常见的非轴向摄像头配置。注视标注困难，因为无法保证注视目标。为解决这些挑战，我们引入了VRGaze，这是首个用于VR的大规模非轴向注视估计数据集，包含来自68名参与者收集的210万张近眼红外图像。我们还提出了GazeShift，一种基于注意力的无监督框架，用于在无标注数据的情况下学习注视表示。与依赖多视角或3D几何的重定向方法不同，GazeShift专门针对近眼图像，能够在紧凑的实时模型中有效分离注视和外观。GazeShift嵌入可以通过轻量级的少量样本校准可选地适应个别用户，实现VRGaze上的1.84°平均误差。在远程摄像头MPIIGaze数据集上，该模型实现了7.15°的人体无关误差，参数量比基线方法少10倍，FLOPs少35倍。在VR头显GPU上原生部署时，推理仅需5毫秒。结合对光照变化的鲁棒性，这些结果突显了GazeShift作为VR注视跟踪的标签高效、实时解决方案的重要性。项目代码和VRGaze数据集可在https://github.com/gazeshift3/gazeshift发布。

Summary / 总结

Gaze estimation in VR is crucial but hindered by data scarcity and the difficulty of accurate labeling. To address this, the authors introduce VRGaze, a large-scale dataset of 2.1 million near-eye infrared images, and GazeShift, an unsupervised framework that learns gaze representations without labeled data. GazeShift achieves a 1.84° mean error on VRGaze and a 7.15° person-agnostic error on the MPIIGaze dataset, using significantly fewer parameters and FLOPs than baseline methods. The model runs in real-time on a VR headset GPU with only 5 ms inference time, making it a label-efficient and robust solution for VR gaze tracking.

论文通过引入VRGaze数据集和GazeShift无监督框架解决了VR系统中的注视估计问题。VRGaze数据集包含来自68名参与者的210万张近眼红外图像，GazeShift框架在无需标注数据的情况下学习注视表示，实现在VR头显GPU上的实时推理，仅需5毫秒。该模型在VRGaze数据集上的平均误差为1.84°，在MPIIGaze数据集上的无个体差异误差为7.15°，且参数和计算量远少于基线方法，同时具有对光照变化的鲁棒性。

JaGuard: Position Error Correction of GNSS Jamming with Deep Temporal Graphs

Authors: Ivana Kesić, Aljaž Blatnik, Carolina Fortuna, Blaž Bertalanič

First: 2025-09-17T14:12:36+00:00 · Latest: 2026-03-24T16:56:42+00:00

Comments: 11 pages, 8 figures

Abs · PDF · Code1 · Code2

Abstract

Global Navigation Satellite Systems (GNSS) face growing disruption from intentional jamming, undermining critical infrastructure where precise positioning and timing are essential. Current position error correction (PEC) methods mainly focus on multi-path propagation errors and fail to exploit the spatio-temporal coherence of satellite constellations. We recast jamming mitigation as a dynamic graph regression problem. We propose Jamming Guardian (JaGuard), a receiver-centric deep temporal graph network that estimates and corrects jamming-induced positional drift at fixed locations like roadside units. Modeling the satellite-receiver scene as a heterogeneous star graph at each 1 Hz epoch, our Heterogeneous Graph ConvLSTM fuses spatial context (SNR, azimuth, elevation) with short-term temporal dynamics to predict 2D positional deviation. Evaluated on a real-world dataset from two commercial receivers under synthesized RF interference (three jammer types, -45 to -70 dBm), JaGuard consistently yields the lowest Mean Absolute Error (MAE) compared to advanced baselines. Under severe jamming (-45 dBm), it maintains an MAE of 2.85-5.92 cm, improving to sub-2 cm at lower interference. On mixed-power datasets, JaGuard surpasses all baselines with MAEs of 2.26 cm (GP01) and 2.61 cm (U-blox 10). Even under extreme data starvation (10% training data), JaGuard remains stable, bounding error at 15-20 cm and preventing the massive variance increase seen in baselines. This confirms that dynamically modeling the physical deterioration of the constellation graph is strictly necessary for resilient interference correction.

中文标题/摘要

标题：JaGuard：使用深度时序图的GNSS干扰位置误差校正

全球导航卫星系统（GNSS）正遭受日益严重的故意干扰，这在依赖精确定位和时间的关键基础设施中构成了威胁。当前的位置误差校正（PEC）方法主要关注多路径传播误差，未能利用卫星星座的空间-时间一致性。我们将干扰缓解重新定义为动态图回归问题。我们提出了一种接收器为中心的深度时序图网络JaGuard，用于估计并校正干扰引起的定位漂移，特别是在路边单元等固定位置。在每个1 Hz周期将卫星-接收器场景建模为异构星形图，我们的异构图卷积LSTM融合了空间上下文（信噪比、方位角、仰角）与短期时间动态，以预测2D位置偏差。在两个商用接收器在合成射频干扰（三种干扰类型，-45至-70 dBm）下的真实数据集上评估，JaGuard在所有先进基线中始终具有最低的平均绝对误差（MAE）。在严重干扰（-45 dBm）下，它保持2.85-5.92 cm的MAE，干扰减弱时可降至低于2 cm。在混合功率数据集中，JaGuard的MAE分别为2.26 cm（GP01）和2.61 cm（U-blox 10）超过所有基线。即使在极端数据匮乏（10%训练数据）的情况下，JaGuard仍保持稳定，将误差限制在15-20 cm，防止基线中出现的巨大方差增加。这表明动态建模星座图的物理退化对于稳健的干扰校正是必不可少的。

Summary / 总结

JaGuard is a deep temporal graph network designed to correct position errors caused by GNSS jamming. It models the satellite-receiver scene as a heterogeneous star graph and uses a Heterogeneous Graph ConvLSTM to fuse spatial context and temporal dynamics. Evaluated on real-world data, JaGuard outperforms existing methods, maintaining an MAE of 2.85-5.92 cm under severe jamming and achieving sub-2 cm accuracy at lower interference levels.

JaGuard 是一种基于接收器的深度时序图网络，旨在纠正由 GNSS 干扰引起的定位误差。它将卫星-接收器场景建模为异构星形图，并使用 Heterogeneous Graph ConvLSTM 融合空间和时间数据以预测位置偏差。在实际数据上进行评估时，JaGuard 在严重干扰条件下仍优于先进基线，并且即使训练数据有限也能保持稳定。

POVQA: Preference-Optimized Video Question Answering with Rationales for Data Efficiency

Authors: Ashim Dahal, Ankit Ghimire, Saydul Akbar Murad, Nick Rahimi

Venue: CVPR

First: 2025-10-01T15:15:36+00:00 · Latest: 2026-03-24T16:54:51+00:00

Comments: Accepted in MAR at CVPR Workshop (Proceedings Track)

Abs · PDF · Code1 · Code2

Abstract

Video Question Answering (VQA) with Large Vision Language Models (LVLMs) has gained significant traction in research ever since the Flamingo was introduced by Deepmind. Recent advancements in large context/long video question answering have allowed VQA tasks to have context window of 1500+ frames. However, this only leads to 50 seconds of video footage without losing any significant information. We introduce POVQA, a data-efficient pipeline that compresses each second of video into a single temporally pooled image (via motion blur and weighted averaging variants) and then align LVLMs with lightweight supervision. Concretely, we build 1 fps input sources using Blend Blur with Last Frame, Weighted Average, Exponential and Ramp pooling and fine-tune QWEN-2.5-VL 7B with supervised two turn target including reasoning and final answer. We apply Supervised Fine Tuning (SFT) and Direct Preference Optimization (DPO) on our novel dataset ReasonVQA consisting of 12 movies with 239 human annotated question-answer with reasoning prompts. On our ReasonVQA dataset, this method dramatically improves performance over pooled baselines: F1 score improves from 0.212 to 0.543, BLEU-4 from 0.031 to 0.291, and ROUGE-L from 0.196 to 0.528. Rationale quality also significantly increases. Cross-evaluation of SFT + DPO on various pooling functions show that the gains persist regardless of the pooling scheme used at train or test time, indicating strong robustness on summarization of temporal evidence. Similar observations were made on zero-shot in TVQA.

中文标题/摘要

标题：POVQA：基于偏好优化的视频问答与数据效率推理

自Deepmind推出的Flamingo以来，大规模视觉语言模型（LVLM）驱动的视频问答（VQA）在研究中获得了显著的关注。最近在长视频上下文问答方面的进展使得VQA任务能够处理1500多帧的上下文窗口，但这也仅相当于50秒的视频内容，而不会丢失任何重要信息。我们提出了POVQA，这是一种数据高效的管道，将每秒的视频压缩成单个时间池化图像（通过运动模糊和加权平均变体），然后通过轻量级监督与LVLM对齐。具体来说，我们使用混合模糊与最后帧、加权平均、指数和线性池化构建1 fps输入源，并使用监督两轮目标（包括推理和最终答案）微调QWEN-2.5-VL 7B。我们在包含12部电影和239个人工标注问题-答案及推理提示的新数据集ReasonVQA上应用了监督微调（SFT）和直接偏好优化（DPO）。在ReasonVQA数据集上，该方法显著提高了性能：F1分数从0.212提高到0.543，BLEU-4从0.031提高到0.291，ROUGE-L从0.196提高到0.528。推理质量也显著提高。SFT + DPO在各种池化函数上的跨评估表明，无论是在训练还是测试时使用哪种池化方案，这些收益都保持不变，表明在时间证据总结方面具有很强的鲁棒性。类似观察结果也出现在TVQA的零样本测试中。

Summary / 总结

POVQA is a data-efficient pipeline that compresses each second of video into a single temporally pooled image and aligns large vision language models with lightweight supervision. It uses Blend Blur with Last Frame, Weighted Average, Exponential, and Ramp pooling to build 1 fps input sources and fine-tunes QWEN-2.5-VL 7B with supervised two-turn targets. On the ReasonVQA dataset, this method improves F1 score from 0.212 to 0.543, BLEU-4 from 0.031 to 0.291, and ROUGE-L from 0.196 to 0.528, showing significant performance gains and robustness across different pooling schemes.

POVQA 是一种高效的数据管道，将视频压缩成时间池化图像，并与轻量级监督对齐大型视觉语言模型。它使用 Blend Blur 与 Last Frame、加权平均、指数和斜坡池化来创建 1 fps 输入源，并对 QWEN-2.5-VL 7B 进行带有监督两轮目标的微调。在 ReasonVQA 数据集上，该方法显著提高了 F1 分数、BLEU-4 和 ROUGE-L 分数，并增强了推理质量。跨评估显示了在不同池化函数下具有鲁棒性。

RealCQA-V2: A Diagnostic Benchmark for Structured Visual Entailment over Scientific Charts

Authors: Saleem Ahmed, Srirangaraj Setlur, Venu Govindaraju

First: 2024-10-29T19:32:53+00:00 · Latest: 2026-03-24T16:52:30+00:00

Comments: Under Review : Code and Data will be made public soon - https://cse-ai-lab.github.io/VPP/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Multimodal reasoning models often produce fluent answers supported by seemingly coherent rationales. Existing benchmarks evaluate only final-answer correctness. They do not support atomic visual entailment verification of intermediate steps, especially visual compositional logic. This limitation is especially acute in scientific chart understanding, where answers depend on deterministically grounded visual semantics such as axes, legends, and quantitative relations. We introduce RealCQA-V2, a large-scale benchmark that reformulates chart question answering as Visual Premise Proving (VPP): a structured logical entailment task over chart-grounded visual predicates. Each question is deconstructed into manually curated, atomic premises grounded in chart elements (axes, legends, marks, and quantitative relations), yielding executable reasoning chains rather than free-form textual rationales. These premises form compositional reasoning chains, enabling verification at the level of individual visual statements and complete reasoning sequences. We introduce chain-level metrics that measure both full logical validity (AccVPP) and partial reasoning progress within failed chains (DCP), extending beyond traditional VQA accuracy. Baseline evaluations across representative LVLMs reveal a consistent local-global reasoning gap: models often verify many individual premises correctly while failing to preserve coherence across the full chain. RealCQA-V2 establishes a reproducible benchmark for structured visual entailment over real scientific charts and enables rigorous diagnosis of multimodal reasoning beyond answer-only evaluation.

中文标题/摘要

标题：RealCQA-V2：科学图表结构化视觉演绎诊断基准

多模态推理模型通常会产生由看似连贯的理由支持的流畅答案。现有基准仅评估最终答案的正确性，而不支持对中间步骤的原子视觉演绎验证，尤其是视觉组合逻辑。这一限制在科学图表理解中尤为明显，因为答案依赖于轴、图例和定量关系等确定性的视觉语义。我们引入了RealCQA-V2，这是一个大规模基准，将图表问题回答重新表述为视觉前提证明（VPP）：基于图表的视觉谓词的结构化逻辑演绎任务。每个问题被分解为手动整理的、基于图表元素（轴、图例、标记和定量关系）的原子前提，生成可执行的推理链而非自由形式的文本理由。这些前提形成组合推理链，允许在单个视觉声明和完整推理序列的层面上进行验证。我们引入了链级度量，衡量完全逻辑有效性（AccVPP）和失败链中的部分推理进展（DCP），超越了传统的VQA准确性。代表性的LVLM基线评估揭示了一致的局部-全局推理差距：模型通常能够正确验证许多单个前提，但在保持整个链的连贯性方面却失败了。RealCQA-V2为真实科学图表上的结构化视觉演绎建立了可重复的基准，并使多模态推理超越仅答案评估的严格诊断成为可能。

Summary / 总结

RealCQA-V2 is a benchmark designed to evaluate the structured visual entailment in scientific chart understanding. It reformulates chart question answering as a Visual Premise Proving task, breaking down questions into atomic visual premises. This allows for the verification of intermediate steps and compositional reasoning chains. The benchmark introduces chain-level metrics to measure logical validity and reasoning progress. Baseline evaluations show a gap between local and global reasoning, indicating that models can verify individual premises but struggle to maintain coherence across the full chain. RealCQA-V2 provides a reproducible benchmark for diagnosing multimodal reasoning beyond simple answer correctness.

RealCQA-V2 是一个用于评估科学图表中结构化视觉蕴含的基准。它将图表问题回答重新表述为视觉前提证明，将问题分解为原子视觉前提，并衡量逻辑有效性和推理进展。实验表明，模型往往能够正确验证单个前提，但在保持整个推理链的一致性方面存在困难，揭示了局部与全局推理之间的差距。

Bilevel Autoresearch: Meta-Autoresearching Itself

Authors: Yaonan Qu, Meng Lu

First: 2026-03-24T16:52:25+00:00 · Latest: 2026-03-24T16:52:25+00:00

Comments: 13 pages, 5 figures, 3 tables.This paper was primarily drafted by AI agents with human oversight and direction

Abs · PDF · Code1 · Code2

Abstract

If autoresearch is itself a form of research, then autoresearch can be applied to research itself. We take this idea literally: we use an autoresearch loop to optimize the autoresearch loop. Every existing autoresearch system -- from Karpathy's single-track loop to AutoResearchClaw's multi-batch extension and EvoScientist's persistent memory -- was improved by a human who read the code, identified a bottleneck, and wrote new code. We ask whether an LLM can do the same, autonomously. We present Bilevel Autoresearch, a bilevel framework where an outer loop meta-optimizes the inner autoresearch loop by generating and injecting new search mechanisms as Python code at runtime. The inner loop optimizes the task; the outer loop optimizes how the inner loop searches. Both loops use the same LLM -- no stronger model is needed at the meta level. On Karpathy's GPT pretraining benchmark, the meta-autoresearch outer loop achieves a 5x improvement over the standard inner loop alone (-0.045 vs. -0.009 val_bpb), while parameter-level adjustment without mechanism change yields no reliable gain. The outer loop autonomously discovers mechanisms from combinatorial optimization, multi-armed bandits, and design of experiments -- without human specification of which domains to explore. These mechanisms succeed by breaking the inner loop's deterministic search patterns, forcing exploration of directions the LLM's priors systematically avoid. The core principle is simple: if autoresearch can meta-autoresearch itself, it can, in principle, meta-autoresearch anything with a measurable objective.

中文标题/摘要

标题：双层自研究：自我研究的元研究

如果自研究本身就是一种研究形式，那么自研究可以应用于研究本身。我们严肃地对待这一想法：我们使用一个自研究循环来优化自研究循环。现有的所有自研究系统——从Karpathy的一轨循环到AutoResearchClaw的多批扩展以及EvoScientist的持久内存——都是由人类阅读代码、识别瓶颈并编写新代码改进的。我们询问是否可以由一个LLM自主地做到同样的事情。我们提出了双层自研究，这是一种双层框架，在这种框架中，外层循环通过生成并注入新的搜索机制作为Python代码在运行时来元优化内层自研究循环。内层循环优化任务；外层循环优化内层循环的搜索方式。两个循环使用相同的LLM——不需要在元层使用更强的模型。在Karpathy的GPT预训练基准上，元自研究外层循环在仅使用标准内层循环的情况下实现了5倍的改进（-0.045 vs. -0.009 val_bpb），而参数级别的调整在机制不变的情况下没有获得可靠的收益。外层循环自主地从组合优化、多臂老虎机和实验设计中发现机制——无需人类指定要探索的领域。这些机制通过打破内层循环的确定性搜索模式，迫使探索LLM先验系统地避免的方向。核心原则很简单：如果自研究可以自我研究，那么原则上它可以自我研究任何具有可测量目标的东西。

Summary / 总结

The research aims to apply autoresearch to itself by using a bilevel framework where an outer loop meta-optimizes the inner autoresearch loop. The inner loop optimizes the task, while the outer loop optimizes the search mechanisms. On Karpathy's GPT pretraining benchmark, the meta-autoresearch achieved a 5x improvement over the standard inner loop alone. The outer loop autonomously discovered mechanisms from combinatorial optimization, multi-armed bandits, and design of experiments, breaking the inner loop's deterministic search patterns and forcing exploration of directions the LLM's priors avoid.

研究旨在通过使用双层框架来将自研究应用于自身，其中外层循环通过生成和注入新的搜索机制来优化内层自研究循环。内层循环优化任务，而外层循环优化搜索机制。在外层循环的优化下，Karpathy的GPT预训练基准测试中，自研究的改进达到了标准内层循环的5倍。外层循环自主发现了组合优化、多臂老虎机和实验设计等机制，打破了内层循环的确定性搜索模式，迫使探索LLM先验系统性避免的方向。

SortedRL: Accelerating RL Training for LLMs through Online Length-Aware Scheduling

Authors: Yiqi Zhang, Huiqiang Jiang, Xufang Luo, Zhihe Yang, Chengruidong Zhang, Yifei Shen, Dongsheng Li, Yuqing Yang, Lili Qiu, Yang You

First: 2026-03-24T16:48:31+00:00 · Latest: 2026-03-24T16:48:31+00:00

Abs · PDF · Code1 · Code2

Abstract

Scaling reinforcement learning (RL) has shown strong promise for enhancing the reasoning abilities of large language models (LLMs), particularly in tasks requiring long chain-of-thought generation. However, RL training efficiency is often bottlenecked by the rollout phase, which can account for up to 70% of total training time when generating long trajectories (e.g., 16k tokens), due to slow autoregressive generation and synchronization overhead between rollout and policy updates. We propose SortedRL, an online length-aware scheduling strategy designed to address this bottleneck by improving rollout efficiency and maintaining training stability. SortedRL reorders rollout samples based on output lengths, prioritizing short samples forming groups for early updates. This enables large rollout batches, flexible update batches, and near on-policy micro-curriculum construction simultaneously. To further accelerate the pipeline, SortedRL incorporates a mechanism to control the degree of off-policy training through a cache-based mechanism, and is supported by a dedicated RL infrastructure that manages rollout and update via a stateful controller and rollout buffer. Experiments using LLaMA-3.1-8B and Qwen-2.5-32B on diverse tasks, including logical puzzles, and math challenges like AIME 24, Math 500, and Minerval, show that SortedRL reduces RL training bubble ratios by over 50%, while attaining 3.9% to 18.4% superior performance over baseline given same amount of data.

中文标题/摘要

标题：SortedRL：通过在线长度感知调度加速大语言模型的RL训练

强化学习（RL）的扩展显示出增强大语言模型（LLMs）推理能力的强大潜力，特别是在需要长链思考生成的任务中。然而，RL训练效率往往受限于展开阶段，当生成长轨迹（例如，16k个标记）时，展开阶段可能占总训练时间的70%，这主要是由于缓慢的自回归生成和展开与策略更新之间的同步开销。我们提出SortedRL，这是一种在线长度感知调度策略，旨在通过提高展开效率并保持训练稳定性来解决这一瓶颈。SortedRL 根据输出长度对展开样本进行重新排序，优先处理形成早期更新的短样本组。这使得可以使用大规模的展开批次、灵活的更新批次，并且可以近似实现微级课程的构建。为了进一步加速流水线，SortedRL 通过基于缓存的机制控制脱政策训练的程度，并通过状态控制器和展开缓冲区管理展开和更新。使用LLaMA-3.1-8B和Qwen-2.5-32B在各种任务上进行的实验，包括逻辑谜题和数学挑战（如AIME 24、Math 500和Minerval），表明SortedRL将RL训练气泡比例降低了超过50%，同时在相同数据量下比基线提高了3.9%到18.4%的性能。

Summary / 总结

SortedRL is an online length-aware scheduling strategy that accelerates reinforcement learning (RL) training for large language models (LLMs) by improving rollout efficiency and maintaining training stability. It reorders rollout samples based on output lengths, prioritizing short samples for early updates, which enables large rollout batches and flexible update batches. Experiments show that SortedRL reduces RL training bubble ratios by over 50% and achieves 3.9% to 18.4% superior performance compared to baselines with the same amount of data.

SortedRL 是一种在线长度感知调度策略，通过提高回放效率和保持训练稳定性来加速 LLM 的 RL 训练。它根据输出长度重新排序回放样本，优先处理较短的样本进行早期更新，从而实现大规模回放批次和灵活的更新批次。实验结果显示，SortedRL 将 RL 训练泡沫比例降低了超过 50%，并且在相同数据量的情况下比基线模型提高了 3.9% 到 18.4% 的性能。

I3DM: Implicit 3D-aware Memory Retrieval and Injection for Consistent Video Scene Generation

Authors: Jia Li, Han Yan, Yihang Chen, Siqi Li, Xibin Song, Yifu Wang, Jianfei Cai, Tien-Tsin Wong, Pan Ji

First: 2026-03-24T16:45:40+00:00 · Latest: 2026-03-24T16:45:40+00:00

Comments: Project page: https://riga2.github.io/i3dm

Abs · PDF · Code1 · Code2 · Project1

Abstract

Despite remarkable progress in video generation, maintaining long-term scene consistency upon revisiting previously explored areas remains challenging. Existing solutions rely either on explicitly constructing 3D geometry, which suffers from error accumulation and scale ambiguity, or on naive camera Field-of-View (FoV) retrieval, which typically fails under complex occlusions. To overcome these limitations, we propose I3DM, a novel implicit 3D-aware memory mechanism for consistent video scene generation that bypasses explicit 3D reconstruction. At the core of our approach is a 3D-aware memory retrieval strategy, which leverages the intermediate features of a pre-trained Feed-Forward Novel View Synthesis (FF-NVS) model to score view relevance, enabling robust retrieval even in highly occluded scenarios. Furthermore, to fully utilize the retrieved historical frames, we introduce a 3D-aligned memory injection module. This module implicitly warps historical content to the target view and adaptively conditions the generation on reliable warping regions, leading to improved revisit consistency and accurate camera control. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches, achieving superior revisit consistency, generation fidelity, and camera control precision.

中文标题/摘要

标题：I3DM：隐式三维感知记忆检索与注入以实现一致的视频场景生成

尽管在视频生成方面取得了显著进展，但在重新访问之前探索过的区域时保持长期场景一致性仍然具有挑战性。现有解决方案要么依赖于显式构建三维几何结构，这会遭受误差累积和尺度模糊的问题，要么依赖于简单的相机视场（FoV）检索，这通常在复杂遮挡下会失败。为克服这些限制，我们提出了一种新的隐式三维感知记忆机制I3DM，用于一致的视频场景生成，该机制绕过了显式的三维重建。我们方法的核心是一种三维感知记忆检索策略，该策略利用预训练的前馈新颖视图合成（FF-NVS）模型的中间特征来评分视图的相关性，即使在高度遮挡的情况下也能实现稳健的检索。此外，为了充分利用检索到的历史帧，我们引入了一种三维对齐的记忆注入模块。该模块隐式地将历史内容映射到目标视图，并根据可靠的映射区域适配性地条件生成，从而提高了重新访问的一致性和准确的相机控制。广泛的实验表明，我们的方法优于最先进的方法，实现了更好的重新访问一致性、生成保真度和相机控制精度。

Summary / 总结

The paper addresses the challenge of maintaining long-term scene consistency in video generation, especially when revisiting previously explored areas. It introduces I3DM, an implicit 3D-aware memory mechanism that avoids explicit 3D reconstruction, using a 3D-aware memory retrieval strategy based on intermediate features from a pre-trained FF-NVS model. This strategy enables robust retrieval even in complex occlusions. Additionally, a 3D-aligned memory injection module is proposed to warp historical content to the target view and condition the generation on reliable regions, improving revisit consistency and camera control. Experiments show that I3DM outperforms existing methods in terms of revisit consistency, generation fidelity, and camera control precision.

论文旨在解决视频生成中长期场景一致性的问题，特别是在重新访问之前探索过的区域时。提出了一种名为I3DM的隐式3D感知记忆机制，避免了显式的3D重建，而是利用预训练的FF-NVS模型的中间特征来评分视图的相关性，即使在复杂遮挡情况下也能实现稳健的检索。此外，还提出了一种3D对齐的记忆注入模块，以提高重新访问一致性及相机控制精度。实验表明，I3DM在重新访问一致性、生成保真度和相机控制精度方面优于现有方法。

GeoSANE: Learning Geospatial Representations from Models, Not Data

Authors: Joelle Hanna, Damian Falk, Stella X. Yu, Damian Borth

First: 2026-03-24T16:40:36+00:00 · Latest: 2026-03-24T16:40:36+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Recent advances in remote sensing have led to an increase in the number of available foundation models; each trained on different modalities, datasets, and objectives, yet capturing only part of the vast geospatial knowledge landscape. While these models show strong results within their respective domains, their capabilities remain complementary rather than unified. Therefore, instead of choosing one model over another, we aim to combine their strengths into a single shared representation. We introduce GeoSANE, a geospatial model foundry that learns a unified neural representation from the weights of existing foundation models and task-specific models, able to generate novel neural networks weights on-demand. Given a target architecture, GeoSANE generates weights ready for finetuning for classification, segmentation, and detection tasks across multiple modalities. Models generated by GeoSANE consistently outperform their counterparts trained from scratch, match or surpass state-of-the-art remote sensing foundation models, and outperform models obtained through pruning or knowledge distillation when generating lightweight networks. Evaluations across ten diverse datasets and on GEO-Bench confirm its strong generalization capabilities. By shifting from pre-training to weight generation, GeoSANE introduces a new framework for unifying and transferring geospatial knowledge across models and tasks. Code is available at \href{https://hsg-aiml.github.io/GeoSANE/}{hsg-aiml.github.io/GeoSANE/}.

中文标题/摘要

标题：GeoSANE：从模型而非数据中学习地理空间表示

遥感领域的最新进展导致了基础模型数量的增加；每个模型在不同的模态、数据集和目标上进行训练，但仅捕获地理空间知识景观的一部分。尽管这些模型在其各自领域内表现出色，但它们的能力仍然是互补的而非统一的。因此，我们不是选择一个模型而放弃另一个，而是旨在将它们的优势结合起来形成一个单一的共享表示。我们介绍了GeoSANE，这是一种地理空间模型工厂，能够从现有基础模型和任务特定模型的权重中学习统一的神经表示，并能够按需生成新的神经网络权重。给定一个目标架构，GeoSANE生成可用于分类、分割和检测任务的多模态微调权重。由GeoSANE生成的模型在多个方面始终优于从头开始训练的模型，能够匹配或超越最先进的遥感基础模型，并在生成轻量级网络时优于通过剪枝或知识蒸馏获得的模型。在十个不同数据集上的评估和GEO-Bench上证实了其强大的泛化能力。通过从预训练转向权重生成，GeoSANE引入了一种新的框架，用于在模型和任务之间统一和转移地理空间知识。

Summary / 总结

GeoSANE is designed to combine the strengths of various geospatial models into a unified representation, generating new neural network weights for classification, segmentation, and detection tasks. It learns from the weights of existing foundation models and task-specific models, outperforming models trained from scratch and matching or surpassing state-of-the-art remote sensing foundation models. Evaluations across ten datasets and GEO-Bench demonstrate its strong generalization capabilities.

GeoSANE旨在将各种地理空间模型的优势结合起来，生成用于分类、分割和检测任务的新型神经网络权重。它从现有基础模型和任务特定模型的权重中学习，生成的模型优于从头开始训练的模型，并且能够匹配合或超越最先进的遥感基础模型。跨十个数据集和GEO-Bench的评估显示了强大的泛化能力。

Beyond Preset Identities: How Agents Form Stances and Boundaries in Generative Societies

Authors: Hanzhong Zhang, Siyang Song, Jindong Wang

First: 2026-03-24T16:38:46+00:00 · Latest: 2026-03-24T16:38:46+00:00

Comments: 22 pages, 3 figures

Abs · PDF · Code1 · Code2 · Code3

Abstract

While large language models simulate social behaviors, their capacity for stable stance formation and identity negotiation during complex interventions remains unclear. To overcome the limitations of static evaluations, this paper proposes a novel mixed-methods framework combining computational virtual ethnography with quantitative socio-cognitive profiling. By embedding human researchers into generative multiagent communities, controlled discursive interventions are conducted to trace the evolution of collective cognition. To rigorously measure how agents internalize and react to these specific interventions, this paper formalizes three new metrics: Innate Value Bias (IVB), Persuasion Sensitivity, and Trust-Action Decoupling (TAD). Across multiple representative models, agents exhibit endogenous stances that override preset identities, consistently demonstrating an innate progressive bias (IVB > 0). When aligned with these stances, rational persuasion successfully shifts 90% of neutral agents while maintaining high trust. In contrast, conflicting emotional provocations induce a paradoxical 40.0% TAD rate in advanced models, which hypocritically alter stances despite reporting low trust. Smaller models contrastingly maintain a 0% TAD rate, strictly requiring trust for behavioral shifts. Furthermore, guided by shared stances, agents use language interactions to actively dismantle assigned power hierarchies and reconstruct self organized community boundaries. These findings expose the fragility of static prompt engineering, providing a methodological and quantitative foundation for dynamic alignment in human-agent hybrid societies. The official code is available at: https://github.com/armihia/CMASE-Endogenous-Stances

中文标题/摘要

标题：超越预设身份：代理在生成社会中形成立场和边界的方式

虽然大型语言模型可以模拟社会行为，但它们在复杂干预过程中稳定立场形成和身份协商的能力仍然不清楚。为克服静态评估的局限性，本文提出了一种新的混合方法框架，结合了计算虚拟民族志和定量社会认知分析。通过将人类研究人员嵌入生成性多代理社区中，进行受控的话语干预以追踪集体认知的演变。为了严格测量代理如何内化并回应这些特定干预，本文正式提出了三个新的度量标准：固有价值偏差（IVB）、说服敏感性和信任-行动脱耦（TAD）。在多个代表性模型中，代理表现出内生立场，能够超越预设身份，一致地表现出固有的进步偏差（IVB > 0）。当与这些立场一致时，理性的说服成功地将90%的中立代理转变为支持者，同时保持高信任度。相反，矛盾的情感挑衅在高级模型中导致了40.0%的TAD率，这些模型在报告低信任度的情况下虚伪地改变了立场。相比之下，较小的模型则保持0%的TAD率，严格要求信任才能改变行为。此外，受共同立场的引导，代理利用语言互动积极拆解分配的权力等级并重新构建自我组织的社区边界。这些发现揭示了静态提示工程的脆弱性，为人类-代理混合社会中的动态对齐提供了方法论和定量基础。官方代码可在：https://github.com/armihia/CMASE-Endogenous-Stances

Unleashing Spatial Reasoning in Multimodal Large Language Models via Textual Representation Guided Reasoning

Authors: Jiacheng Hua, Yishu Yin, Yuhang Wu, Tai Wang, Yifei Huang, Miao Liu

First: 2026-03-24T16:38:09+00:00 · Latest: 2026-03-24T16:38:09+00:00

Comments: 26 pages, 6 figures

Abs · PDF · Code1 · Code2

Abstract

Existing Multimodal Large Language Models (MLLMs) struggle with 3D spatial reasoning, as they fail to construct structured abstractions of the 3D environment depicted in video inputs. To bridge this gap, drawing inspiration from cognitive theories of allocentric spatial reasoning, we investigate how to enable MLLMs to model and reason over text-based spatial representations of video. Specifically, we introduce Textual Representation of Allocentric Context from Egocentric Video (TRACE), a prompting method that induces MLLMs to generate text-based representations of 3D environments as intermediate reasoning traces for more accurate spatial question answering. TRACE encodes meta-context, camera trajectories, and detailed object entities to support structured spatial reasoning over egocentric videos. Extensive experiments on VSI-Bench and OST-Bench demonstrate that TRACE yields notable and consistent improvements over prior prompting strategies across a diverse range of MLLM backbones, spanning different parameter scales and training schemas. We further present ablation studies to validate our design choices, along with detailed analyses that probe the bottlenecks of 3D spatial reasoning in MLLMs.

中文标题/摘要

标题：通过基于文本表示的推理释放多模态大型语言模型的空间推理能力

现有的多模态大型语言模型（MLLMs）在3D空间推理方面存在困难，因为它们无法构建视频输入中描绘的3D环境的结构化抽象。为了解决这一问题，我们借鉴了分配性空间推理的认知理论，研究如何使MLLMs能够建模和推理基于文本的空间表示。具体来说，我们引入了基于第一人称视频的分配性上下文文本表示（TRACE），这是一种提示方法，可以促使MLLMs生成3D环境的文本表示作为中间推理痕迹，以实现更准确的空间问题回答。TRACE编码元上下文、摄像机轨迹和详细的物体实体，以支持对第一人称视频的空间推理。在VSI-Bench和OST-Bench上的广泛实验表明，TRACE在不同参数规模和训练方案的多种MLLM主干模型上，相对于之前的提示策略，取得了显著且一致的改进。我们还进行了消融研究来验证我们的设计选择，并进行了详细的分析，以探究MLLMs中3D空间推理的瓶颈。

Summary / 总结

The research aims to enhance 3D spatial reasoning in Multimodal Large Language Models (MLLMs) by introducing TRACE, a prompting method that generates text-based representations of 3D environments. Experiments on VSI-Bench and OST-Bench show that TRACE improves spatial question answering accuracy across various MLLM models, outperforming previous prompting strategies. Ablation studies confirm the effectiveness of TRACE’s design choices.

论文通过引入TRACE方法，鼓励MLLMs生成基于文本的3D环境表示，以解决其在3D空间推理方面的挑战。该方法增强了模型对主观视频的推理能力，提高了空间问答任务的表现。在VSI-Bench和OST-Bench上的实验显示，TRACE在各种MLLM骨干网络上都表现出一致的改进效果。消融研究和详细分析进一步验证了TRACE在克服MLLMs在3D空间推理方面的局限性方面的有效性。

Graph Variate Neural Networks

Authors: Om Roy, Yashar Moshfeghi, Keith Smith

First: 2025-09-24T16:44:08+00:00 · Latest: 2026-03-24T16:36:30+00:00

Abs · PDF · Code1 · Code2

Abstract

Modelling dynamically evolving spatio-temporal signals is a prominent challenge in the Graph Neural Network (GNN) literature. Notably, GNNs assume an existing underlying graph structure. While this underlying structure may not always exist or is derived independently from the signal, a temporally evolving functional network can always be constructed from multi-channel data. Graph Variate Signal Analysis (GVSA) defines a unified framework consisting of a network tensor of instantaneous connectivity profiles against a stable support usually constructed from the signal itself. Building on GVSA and tools from graph signal processing, we introduce Graph-Variate Neural Networks (GVNNs): layers that convolve spatio-temporal signals with a signal-dependent connectivity tensor combining a stable long-term support with instantaneous, data-driven interactions. This design captures dynamic statistical interdependencies at each time step without ad hoc sliding windows and admits an efficient implementation with linear complexity in sequence length. Across forecasting benchmarks, GVNNs consistently outperform strong graph-based baselines and are competitive with widely used sequence models such as LSTMs and Transformers. On EEG motor-imagery classification, GVNNs achieve strong accuracy highlighting their potential for brain-computer interface applications.

中文标题/摘要

标题：图变异神经网络

在图神经网络（GNN）文献中，建模动态演变的空间-时间信号是一个突出的挑战。值得注意的是，GNN 假设存在一个固有的图结构。虽然这种固有的结构可能并不总是存在或独立于信号而产生，但总是可以从多通道数据中构建出一个随时间演变的功能网络。图变异信号分析（GVSA）定义了一个统一框架，该框架包括一个与通常从信号本身构建的稳定支持相对应的瞬时连接性张量网络。基于 GVSA 和图信号处理工具，我们引入了图变异神经网络（GVNNs）：这些层通过结合稳定的长期支持和数据驱动的瞬时交互来卷积空间-时间信号，形成一个依赖信号的连接性张量。这种设计在每个时间步捕捉动态统计依赖性，而无需使用任意滑动窗口，并且具有线性复杂度的高效实现。在预测基准测试中，GVNNs 一贯优于强大的基于图的基本模型，并且在广泛使用的序列模型（如 LSTMs 和 Transformers）中具有竞争力。在 EEG 运动想象分类中，GVNNs 达到了较高的准确率，突显了其在脑机接口应用中的潜力。

Summary / 总结

The research aims to model dynamically evolving spatio-temporal signals using Graph Variate Neural Networks (GVNNs), which combine a stable long-term support with data-driven instantaneous interactions. GVNNs are designed to capture dynamic statistical interdependencies without relying on sliding windows and have linear complexity in sequence length. Experiments show that GVNNs outperform graph-based baselines and are competitive with sequence models like LSTMs and Transformers. On EEG motor-imagery classification, GVNNs achieve strong accuracy, indicating their potential for brain-computer interface applications.

研究旨在通过引入图变异神经网络（GVNN）来解决图神经网络（GNN）在建模动态演变的时空信号方面的挑战。GVNN 使用图变异信号分析（GVSA）框架定义了一个瞬时连接性剖面的网络张量与稳定的支撑。这种设计允许在不需要人工滑动窗口的情况下捕捉动态统计依赖性，并提供了一个具有线性复杂度的高效实现。实验结果表明，GVNN 在预测基准测试中优于强大的图基线，并且与 LSTMs 和 Transformers 等序列模型具有竞争力。此外，GVNN 在 EEG 运动想象分类中的高准确率表明其在脑机接口应用中的潜力。

Graph Energy Matching: Transport-Aligned Energy-Based Modeling for Graph Generation

Authors: Michal Balcerak, Suprosana Shit, Chinmay Prabhakar, Sebastian Kaltenbach, Michael S. Albergo, Yilun Du, Bjoern Menze

First: 2026-03-24T16:35:25+00:00 · Latest: 2026-03-24T16:35:25+00:00

Abs · PDF · Code1 · Code2

Abstract

Energy-based models for discrete domains, such as graphs, explicitly capture relative likelihoods, naturally enabling composable probabilistic inference tasks like conditional generation or enforcing constraints at test-time. However, discrete energy-based models typically struggle with efficient and high-quality sampling, as off-support regions often contain spurious local minima, trapping samplers and causing training instabilities. This has historically resulted in a fidelity gap relative to discrete diffusion models. We introduce Graph Energy Matching (GEM), a generative framework for graphs that closes this fidelity gap. Motivated by the transport map optimization perspective of the Jordan-Kinderlehrer-Otto (JKO) scheme, GEM learns a permutation-invariant potential energy that simultaneously provides transport-aligned guidance from noise toward data and refines samples within regions of high data likelihood. Further, we introduce a sampling protocol that leverages an energy-based switch to seamlessly bridge: (i) rapid, gradient-guided transport toward high-probability regions to (ii) a mixing regime for exploration of the learned graph distribution. On molecular graph benchmarks, GEM matches or exceeds strong discrete diffusion baselines. Beyond sample quality, explicit modeling of relative likelihood enables targeted exploration at inference time, facilitating compositional generation, property-constrained sampling, and geodesic interpolation between graphs.

中文标题/摘要

标题：图能量匹配：运输对齐的能量基模型图生成

能量基模型在离散域，如图中，明确捕捉相对可能性，自然地支持组合概率推理任务，如条件生成或在测试时施加约束。然而，离散的能量基模型通常在高效和高质量采样方面存在困难，因为非支持区域经常包含虚假局部极小值，捕获采样器并导致训练不稳定。这在过去导致了与离散扩散模型相比的保真度差距。我们引入了图能量匹配（GEM），这是一种闭合此保真度差距的图生成框架。受Jordan-Kinderlehrer-Otto (JKO) 方案的运输映射优化视角的启发，GEM 学习了一个不变置换势能，同时提供从噪声到数据的运输对齐指导，并在高数据可能性区域细化样本。此外，我们引入了一种采样协议，利用能量基开关无缝地结合：(i) 快速、梯度引导的运输向高概率区域 (ii) 探索学习到的图分布的混合阶段。在分子图基准测试中，GEM 匹配或超越了强大的离散扩散基线。超越样本质量，显式建模相对可能性在推理时支持有针对性的探索，促进组合生成、属性约束采样和图之间的测地线插值。

Summary / 总结

Graph Energy Matching (GEM) is a generative framework for graphs that addresses the sampling challenges of discrete energy-based models by learning a permutation-invariant potential energy. Motivated by the JKO scheme, GEM provides transport-aligned guidance from noise to data and refines samples within high data likelihood regions. Experimental results on molecular graph benchmarks show that GEM matches or exceeds the performance of strong discrete diffusion baselines in terms of sample quality and enables targeted exploration at inference time.

论文提出了图能量匹配（GEM）框架，旨在解决离散能量模型的采样问题。受JKO方案的运输映射优化启发，GEM学习了一个不变置换的能量势，该势能同时引导噪声向数据区域的高概率区域快速传输，并在高数据概率区域细化样本。实验结果表明，GEM在分子图基准测试中达到了或超过了强大的离散扩散基线，展示了更高的样本质量，并支持组合生成和属性约束采样。

Reliable OOD Virtual Screening with Extrapolatory Pseudo-Label Matching

Authors: Yunni Qu, Bhargav Vaduri, Karthikeya Jatoth, James Wellnitz, Dzung Dinh, Seth Veenbaas, Jonathan Chapman, Alexander Tropsha, Junier Oliva

First: 2024-06-03T22:37:45+00:00 · Latest: 2026-03-24T16:32:27+00:00

Abs · PDF · Code1 · Code2

Abstract

Machine learning (ML) models are increasingly deployed for virtual screening in drug discovery, where the goal is to identify novel, chemically diverse scaffolds while minimizing experimental costs. This creates a fundamental challenge: the most valuable discoveries lie in out-of-distribution (OOD) regions beyond the training data, yet ML models often degrade under distribution shift. Standard novelty-rejection strategies ensure reliability within the training domain but limit discovery by rejecting precisely the novel scaffolds most worth finding. Moreover, experimental budgets permit testing only a small fraction of nominated candidates, demanding models that produce reliable confidence estimates. We introduce EXPLOR (Extrapolatory Pseudo-Label Matching for OOD Uncertainty-Based Rejection), a framework that addresses both challenges through extrapolatory pseudo-labeling on latent-space augmentations, requiring only a single labeled training set and no access to unlabeled test compounds, mirroring the realistic conditions of prospective screening campaigns. Through a multi-headed architecture with a novel per-head matching loss, EXPLOR learns to extrapolate to OOD chemical space while producing reliable confidence estimates, with particularly strong performance in high-confidence regions, which is critical for virtual screening where only top-ranked candidates advance to experimental validation. We demonstrate state-of-the-art performance across chemical and tabular benchmarks using different molecular embeddings.

中文标题/摘要

标题：可靠的OOD虚拟筛选方法：基于OOD不确定性拒绝的外推伪标签匹配

机器学习（ML）模型在药物发现中的虚拟筛选中越来越被部署，目标是在保持化学多样性的同时识别新颖的化学骨架，从而最小化实验成本。这提出了一个根本性的挑战：最具价值的发现位于训练数据之外的分布外（OOD）区域，但ML模型在分布转移下往往会退化。标准的新颖性拒绝策略确保在训练域内的可靠性，但会限制发现，因为它会拒绝那些最值得发现的新颖骨架。此外，实验预算只允许测试提名候选化合物的一小部分，因此需要能够产生可靠置信度估计的模型。我们提出了EXPLOR（基于OOD不确定性拒绝的外推伪标签匹配），该框架通过在潜在空间增强上的外推伪标签化来同时解决这两个挑战，仅需一个标记的训练集，无需访问未标记的测试化合物，这与前瞻性筛选活动的现实条件相符。通过具有新颖的每头匹配损失的多头架构，EXPLOR 学习外推到OOD化学空间并产生可靠的置信度估计，特别是在高置信度区域表现尤为出色，这对于虚拟筛选至关重要，因为只有排名靠前的候选化合物才会进入实验验证。我们使用不同的分子嵌入在化学和表格基准上展示了EXPLOR 的最佳性能。

Summary / 总结

The paper addresses the challenge of identifying novel, chemically diverse scaffolds in virtual screening using machine learning models, which often degrade under distribution shift. It introduces EXPLOR, a framework that uses extrapolatory pseudo-labeling on latent-space augmentations to produce reliable confidence estimates without requiring access to unlabeled test compounds. EXPLOR demonstrates state-of-the-art performance across various benchmarks, particularly excelling in high-confidence regions, which is crucial for virtual screening.

论文解决了使用机器学习模型在虚拟筛选中识别新颖的化学多样骨架时面临的分布偏移问题，这些模型在分布偏移时往往会表现不佳。它引入了EXPLOR框架，该框架通过在潜在空间增强上使用外推伪标签来生成可靠的置信度估计，而无需访问未标记的测试化合物。EXPLOR在各种基准测试中表现出色，特别是在高置信度区域，这对于虚拟筛选至关重要，因为只有排名靠前的候选物才能进入实验验证。

GenExam: A Multidisciplinary Text-to-Image Exam

Authors: Zhaokai Wang, Penghao Yin, Xiangyu Zhao, Changyao Tian, Yu Qiao, Wenhai Wang, Jifeng Dai, Gen Luo

First: 2025-09-17T17:59:14+00:00 · Latest: 2026-03-24T16:27:38+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Exams are a fundamental test of expert-level intelligence and require integrated understanding, reasoning, and generation. Existing exam-style benchmarks mainly focus on understanding and reasoning tasks, and current generation benchmarks emphasize the illustration of world knowledge and visual concepts, neglecting the evaluation of rigorous drawing exams. We introduce GenExam, the first benchmark for multidisciplinary text-to-image exams, featuring 1,000 samples across 10 subjects with exam-style prompts organized under a four-level taxonomy. Each problem is equipped with ground-truth images and fine-grained scoring points to enable a precise evaluation of semantic correctness and visual plausibility. Experiments on 17 text-to-image and unified models demonstrate the great challenge of GenExam and the huge gap where open-source models consistently lag behind the leading closed-source ones. By framing image generation as an exam, GenExam offers a rigorous assessment of models' ability to integrate understanding, reasoning, and generation, providing insights for on the path to intelligent generative models. Our benchmark and evaluation code are released at https://github.com/OpenGVLab/GenExam.

中文标题/摘要

标题：GenExam：跨学科图文考试基准

考试是专家级智能的基本测试，需要综合理解、推理和生成能力。现有的考试风格基准主要集中在理解和推理任务上，而当前的生成基准则侧重于世界知识和视觉概念的展示，忽视了严谨绘图考试的评估。我们介绍了GenExam，这是首个跨学科图文考试基准，包含10个学科的1000个样本，采用四级分类法组织考试风格提示。每个问题都配有参考图像和细粒度评分点，以实现对语义正确性和视觉可信度的精确评估。对17个图文生成和统一模型的实验表明，GenExam具有巨大的挑战性，并且开源模型在开放源代码模型中始终落后于领先的企业源代码模型。通过将图像生成视为考试，GenExam为模型综合理解、推理和生成的能力提供了一种严格的评估，为智能生成模型的发展提供了见解。我们的基准和评估代码发布在https://github.com/OpenGVLab/GenExam。

Summary / 总结

GenExam is a new benchmark for text-to-image exams that evaluates models' ability to integrate understanding, reasoning, and generation across 10 subjects. It includes 1,000 samples with detailed scoring points. Experiments show that open-source models perform significantly worse than closed-source models, highlighting the challenge of GenExam. This benchmark aims to advance the development of intelligent generative models.

GenExam 是一个跨 10 个学科的文本到图像考试基准，评估模型在理解、推理和生成方面的综合能力。它包含 1,000 个样本并附有详细的评分点。实验表明开源模型的表现远逊于闭源模型，突显了GenExam的挑战性。该基准旨在推动智能生成模型的发展。

Replay-Free Continual Low-Rank Adaptation with Dynamic Memory

Authors: Huancheng Chen, Jingtao Li, Weiming Zhuang, Chen Chen, Lingjuan Lyu

First: 2024-11-01T14:28:39+00:00 · Latest: 2026-03-24T16:24:44+00:00

Abs · PDF · Code1 · Code2

Abstract

We revisit continual learning~(CL), which enables pre-trained vision transformers (ViTs) to sequentially fine-tune on new downstream tasks over time. However, as the scale of these models increases, catastrophic forgetting remains a more serious challenge. Recent studies highlight a crossover between CL techniques and parameter-efficient fine-tuning (PEFT), which focuses on fine-tuning only a small set of trainable parameters to adapt to downstream tasks, such as low-rank adaptation (LoRA). While LoRA achieves faster convergence and requires fewer trainable parameters, it has seldom been explored in the context of continual learning. To address this gap, we propose a novel PEFT-CL method called Dual Low-Rank Adaptation (DualLoRA), which introduces both an orthogonal LoRA adapter and a residual LoRA adapter parallel to pre-trained weights in each layer. These components are orchestrated by a dynamic memory mechanism to strike a balance between stability and plasticity. Additionally, we propose a scheme to predict task identity with confidence and calibrate the model's outputs accordingly. On ViT-based models, we demonstrate that DualLoRA offers significant advantages in accuracy, inference speed, and computation efficiency in training over existing CL methods across multiple benchmarks.

中文标题/摘要

标题：无回放连续低秩适应与动态记忆

我们重新审视了连续学习(CL)，使预训练的视觉变换器(ViTs)能够随着时间的推移对新的下游任务进行逐步微调。然而，随着这些模型规模的增加，灾难性遗忘成为一个更严重的问题。最近的研究强调了CL技术与参数高效微调(PEFT)之间的交叉，后者专注于仅微调一小部分可训练参数以适应下游任务，例如低秩适应(LoRA)。虽然LoRA实现了更快的收敛速度并需要更少的可训练参数，但它在连续学习的背景下很少被探索。为了解决这一差距，我们提出了一种新的PEFT-CL方法，称为双低秩适应(DualLoRA)，该方法在每一层中引入了一个正交LoRA适配器和一个残差LoRA适配器，并与预训练权重并行。这些组件通过动态记忆机制协调，以平衡稳定性和可塑性。此外，我们提出了一种方案来预测任务身份并具有信心，并相应地校准模型的输出。在基于ViT的模型上，我们展示了DualLoRA在多个基准测试中在准确率、推理速度和训练计算效率方面具有显著优势。

Summary / 总结

The paper addresses the challenge of catastrophic forgetting in continual learning for large vision transformers (ViTs). It introduces Dual Low-Rank Adaptation (DualLoRA), a method that combines orthogonal and residual low-rank adaptation with a dynamic memory mechanism. This approach balances stability and plasticity, and it also includes a scheme to predict task identity and calibrate model outputs. Experiments show that DualLoRA outperforms existing continual learning methods in accuracy, inference speed, and computation efficiency across multiple benchmarks.

论文针对大规模视觉变换器（ViTs）在连续学习（CL）中面临的灾难性遗忘问题，提出了一种名为Dual Low-Rank Adaptation（DualLoRA）的方法，该方法结合了正交和残差低秩适应，并通过动态记忆机制平衡稳定性和可塑性。该方法在多个基准测试上与现有CL技术相比，显示出更高的准确率、更快的推理速度和更高的计算效率。

History

20260325_0407 20260324_0402 20260323_0334 20260322_0333 20260321_0346 20260320_0355 20260319_0358 20260318_0405 20260317_0401 20260316_0338 20260315_0335 20260314_0340 20260313_0351 20260312_0350 20260311_0346 20260310_0350 20260309_0332 20260308_0331 20260307_0344 20260306_0400 20260305_0347 20260304_0347 20260303_0347 20260302_0330 20260301_0331 20260228_0349 20260227_0352 20260226_0405 20260225_0356 20260224_0404 20260223_0333 20260222_0333 20260221_0344 20260220_0347 20260219_0358 20260218_0358 20260217_0339 20260216_0334 20260215_0332 20260213_0402 20260212_0404 20260211_0409 20260210_0409 20260208_0334 20260207_0349 20260206_0347 20260205_0346 20260204_0352 20260202_0332 20260201_0328 20260131_0341 20260130_0339 20260129_0337 20260128_0335 20260127_0332 20260126_0325 20260125_0325 20260124_0333 20260123_0333 20260122_0339 20260121_0422 20260120_0328 20260119_0325 20260118_0324 20260117_0329 20260116_0332 20260115_0330 20260114_0329 20260113_0330 20260112_0330 20260111_0327 20260110_0328 20260109_0331 20260108_0330 20260107_0325 20260106_0331 20260105_0324 20260104_0324 20260103_0322 20260102_0335 20260101_0325 20251231_0331 20251230_0328 20251229_0326 20251228_0329 20251227_0325 20251226_0326 20251225_0325 20251224_0328 20251223_0327 20251222_0324 20251221_0326 20251220_0327 20251219_0327 20251218_0339 20251217_0331 20251216_0329 20251215_0331 20251214_0324 20251213_0324 20251212_0329 20251211_0326 20251210_0323 20251209_0326 20251208_0324 20251207_0323 20251206_0325 20251205_0326 20251204_0326 20251203_0328 20251202_0331 20251201_0324 20251130_0323 20251129_0323 20251128_0324 20251127_0324 20251126_0325 20251125_0322 20251124_0323 20251123_0323 20251122_0325 20251121_0324 20251120_0326 20251119_0325 20251118_0324 20251117_0322 20251116_0322 20251115_0324 20251114_0325 20251113_0326 20251112_0326 20251111_0318 20251110_0322 20251109_0323 20251108_0321 20251107_0320 20251106_0322 20251105_0321 20251104_0324 20251103_0317 20251102_0321 20251101_0317 20251031_0318 20251030_0328 20251029_0325 20251028_0324 20251027_0320 20251026_0328 20251025_0320 20251024_0328 20251023_1235 20251023_0316 20251022_0319 20251021_1916 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553