arXiv 论文速递

2026-03-16 03:38
Snapshot: 20260316_0338
EVATok: Adaptive Length Video Tokenization for Efficient Visual Autoregressive Generation
Authors: Tianwei Xiong, Jun Hao Liew, Zilong Huang, Zhijie Lin, Jiashi Feng, Xihui Liu
Venue: CVPR 2026
First: 2026-03-12T17:59:59+00:00 · Latest: 2026-03-12T17:59:59+00:00
Comments: Accepted by CVPR 2026. Project page: https://silentview.github.io/EVATok/
Abstract
Autoregressive (AR) video generative models rely on video tokenizers that compress pixels into discrete token sequences. The length of these token sequences is crucial for balancing reconstruction quality against downstream generation computational cost. Traditional video tokenizers apply a uniform token assignment across temporal blocks of different videos, often wasting tokens on simple, static, or repetitive segments while underserving dynamic or complex ones. To address this inefficiency, we introduce $\textbf{EVATok}$, a framework to produce $\textbf{E}$fficient $\textbf{V}$ideo $\textbf{A}$daptive $\textbf{Tok}$enizers. Our framework estimates optimal token assignments for each video to achieve the best quality-cost trade-off, develops lightweight routers for fast prediction of these optimal assignments, and trains adaptive tokenizers that encode videos based on the assignments predicted by routers. We demonstrate that EVATok delivers substantial improvements in efficiency and overall quality for video reconstruction and downstream AR generation. Enhanced by our advanced training recipe that integrates video semantic encoders, EVATok achieves superior reconstruction and state-of-the-art class-to-video generation on UCF-101, with at least 24.4% savings in average token usage compared to the prior state-of-the-art LARP and our fixed-length baseline.
中文标题/摘要
标题:EVATok:高效自适应视频分词以实现高效的视觉自回归生成
自回归(AR)视频生成模型依赖于将像素压缩为离散的分词序列的视频分词器。这些分词序列的长度对于平衡重建质量与下游生成计算成本至关重要。传统的视频分词器在不同视频的时间块中应用统一的分词分配,通常在简单、静态或重复的段落上浪费分词,而在动态或复杂的段落上则服务不足。为了解决这种低效率,我们引入了**EVATok**,一种生成**E**fficient **V**ideo **A**daptive **Tok**enizers的框架。我们的框架为每个视频估计最佳的分词分配以实现最佳的质量-成本权衡,开发了轻量级路由器以快速预测这些最佳分配,并训练基于路由器预测的分配编码的自适应分词器。我们证明EVATok在视频重建和下游AR生成方面实现了显著的效率和整体质量提升。通过结合先进的训练食谱和视频语义编码器,EVATok在UCF-101上实现了优于先前最先进的LARP和我们固定长度基线的重建和类别到视频生成,平均分词使用量节省了至少24.4%。
Summary / 总结
EVATok is a framework designed to improve the efficiency of video tokenization in autoregressive video generative models. It estimates optimal token assignments for each video to balance reconstruction quality and computational cost, using lightweight routers for fast prediction and adaptive tokenizers for encoding. EVATok demonstrates significant improvements in efficiency and quality for video reconstruction and downstream generation, with at least 24.4% savings in average token usage compared to previous methods on UCF-101.
EVATok 是一种框架,旨在提高视频分词在自回归视频生成中的效率。它为每个视频估计最优的分词分配,以平衡质量和计算成本,使用轻量级路由器进行快速预测,并训练自适应分词器。实验表明,与之前的方法相比,EVATok 至少节省了 24.4% 的平均分词使用量,同时在 UCF-101 上实现了更优的重建和最先进的类别到视频生成。
MM-CondChain: A Programmatically Verified Benchmark for Visually Grounded Deep Compositional Reasoning
Authors: Haozhan Shen, Shilin Yan, Hongwei Xue, Shuaiqi Lu, Xiaojun Tang, Guannan Zhang, Tiancheng Zhao, Jianwei Yin
Venue: MM
First: 2026-03-12T17:59:56+00:00 · Latest: 2026-03-12T17:59:56+00:00
Comments: Project Page: https://accio-lab.github.io/MM-CondChain
Abstract
Multimodal Large Language Models (MLLMs) are increasingly used to carry out visual workflows such as navigating GUIs, where the next step depends on verified visual compositional conditions (e.g., "if a permission dialog appears and the color of the interface is green, click Allow") and the process may branch or terminate early. Yet this capability remains under-evaluated: existing benchmarks focus on shallow-compositions or independent-constraints rather than deeply chained compositional conditionals. In this paper, we introduce MM-CondChain, a benchmark for visually grounded deep compositional reasoning. Each benchmark instance is organized as a multi-layer reasoning chain, where every layer contains a non-trivial compositional condition grounded in visual evidence and built from multiple objects, attributes, or relations. To answer correctly, an MLLM must perceive the image in detail, reason over multiple visual elements at each step, and follow the resulting execution path to the final outcome. To scalably construct such workflow-style data, we propose an agentic synthesis pipeline: a Planner orchestrates layer-by-layer generation of compositional conditions, while a Verifiable Programmatic Intermediate Representation (VPIR) ensures each layer's condition is mechanically verifiable. A Composer then assembles these verified layers into complete instructions. Using this pipeline, we construct benchmarks across three visual domains: natural images, data charts, and GUI trajectories. Experiments on a range of MLLMs show that even the strongest model attains only 53.33 Path F1, with sharp drops on hard negatives and as depth or predicate complexity grows, confirming that deep compositional reasoning remains a fundamental challenge.
中文标题/摘要
标题:MM-CondChain:一个程序验证基准,用于视觉接地的深度组合推理
多模态大型语言模型(MLLMs)越来越多地用于执行视觉工作流,如导航GUI,其中下一步依赖于经过验证的视觉组合条件(例如,“如果出现权限对话框且界面颜色为绿色,则点击允许”),并且过程可能会分支或提前终止。然而,这种能力仍处于评估不足的状态:现有的基准测试主要关注浅组合或独立约束,而不是深度链式组合条件。在本文中,我们引入了MM-CondChain,一个用于视觉接地的深度组合推理基准。每个基准实例组织为多层推理链,其中每一层包含基于视觉证据的非平凡组合条件,并由多个对象、属性或关系构建。为了正确回答,MLLM必须详细地感知图像,在每一步上推理多个视觉元素,并遵循由此产生的执行路径到最终结果。为了大规模构建此类工作流数据,我们提出了一种代理合成流水线:规划者协调逐层生成组合条件,而可验证的程序化中间表示(VPIR)确保每一层的条件是机械可验证的。然后,合成器将这些验证过的层组装成完整的指令。使用此流水线,我们在三个视觉领域构建了基准测试:自然图像、数据图表和GUI轨迹。在一系列MLLM上的实验表明,即使是最强大的模型也只能达到53.33路径F1,随着难度或谓词复杂性的增加,路径F1急剧下降,证实了深度组合推理仍然是一个基本挑战。
Summary / 总结
The research introduces MM-CondChain, a benchmark for evaluating visually grounded deep compositional reasoning in multimodal large language models (MLLMs). It consists of multi-layer reasoning chains with complex compositional conditions that require detailed image perception and reasoning over multiple visual elements. The pipeline includes a Planner, VPIR, and Composer to generate and verify these conditions. Experiments show that even strong MLLMs struggle with deep compositional reasoning, achieving only 53.33 Path F1, especially on complex conditions and deeper chains.
MM-CondChain 是一个用于评估多模态大型语言模型(MLLMs)在视觉接地深度组合推理能力的基准。它由多层推理链组成,每层涉及复杂的组合条件并基于视觉证据。该基准使用一个包含规划者、机械可验证的程序中间表示(VPIR)和组装者(Composer)的生成流水线进行构建。实验表明,即使最强的 MLLMs 也只能达到 53.33 的 Path F1,表明深度组合推理仍然是一个基本挑战。
OmniStream: Mastering Perception, Reconstruction and Action in Continuous Streams
Authors: Yibin Yan, Jilan Xu, Shangzhe Di, Haoning Wu, Weidi Xie
First: 2026-03-12T17:59:55+00:00 · Latest: 2026-03-12T17:59:55+00:00
Comments: Technical Report. Project Page: https://go2heart.github.io/omnistream/
Abstract
Modern visual agents require representations that are general, causal, and physically structured to operate in real-time streaming environments. However, current vision foundation models remain fragmented, specializing narrowly in image semantic perception, offline temporal modeling, or spatial geometry. This paper introduces OmniStream, a unified streaming visual backbone that effectively perceives, reconstructs, and acts from diverse visual inputs. By incorporating causal spatiotemporal attention and 3D rotary positional embeddings (3D-RoPE), our model supports efficient, frame-by-frame online processing of video streams via a persistent KV-cache. We pre-train OmniStream using a synergistic multi-task framework coupling static and temporal representation learning, streaming geometric reconstruction, and vision-language alignment on 29 datasets. Extensive evaluations show that, even with a strictly frozen backbone, OmniStream achieves consistently competitive performance with specialized experts across image and video probing, streaming geometric reconstruction, complex video and spatial reasoning, as well as robotic manipulation (unseen at training). Rather than pursuing benchmark-specific dominance, our work demonstrates the viability of training a single, versatile vision backbone that generalizes across semantic, spatial, and temporal reasoning, i.e., a more meaningful step toward general-purpose visual understanding for interactive and embodied agents.
中文标题/摘要
标题:OmniStream:掌握连续流中的感知、重建和行动
现代视觉代理需要能够在实时流媒体环境中运行的一般性、因果性和物理结构化的表示。然而,当前的视觉基础模型仍然支离破碎,专门化地专注于图像语义感知、离线时间建模或空间几何。本文介绍了OmniStream,这是一种统一的流媒体视觉骨干,能够有效地从多种视觉输入中进行感知、重建和行动。通过结合因果时空注意力和三维旋转位置嵌入(3D-RoPE),我们的模型支持通过持久的KV缓存以帧为单位进行高效的在线视频流处理。我们使用一种协同多任务框架对OmniStream进行预训练,该框架结合了静态和时间表示学习、流媒体几何重建和视觉-语言对齐,使用了29个数据集。广泛的评估表明,即使在严格冻结骨干的情况下,OmniStream在图像和视频探测、流媒体几何重建、复杂视频和空间推理以及机器人操作(未在训练中出现)方面也能够实现与专门专家一致的竞争性表现。我们的工作不是追求特定基准的主导地位,而是展示了训练一个单一的、多功能的视觉骨干以在语义、空间和时间推理方面进行泛化的可行性,即朝着通用视觉理解的交互和实体代理迈出了一步。
Summary / 总结
OmniStream is designed to handle real-time streaming environments by integrating perception, reconstruction, and action capabilities. It uses causal spatiotemporal attention and 3D rotary positional embeddings to support efficient frame-by-frame processing. Pre-trained on 29 datasets with a multi-task framework, OmniStream performs competitively across various tasks including image and video probing, geometric reconstruction, and robotic manipulation, even with a frozen backbone.
OmniStream旨在处理实时流媒体环境,整合了感知、重建和行动的能力。它使用因果时空注意力和三维旋转位置嵌入来支持高效的逐帧处理。通过多任务框架预训练,OmniStream在各种视觉任务中表现出色,包括图像和视频探查、几何重建和机器人操作,即使在冻结主干网络的情况下也是如此。这项工作表明,一个统一的视觉骨干网络可能能够跨语义、空间和时间推理进行泛化,适用于交互和具身代理。
GRADE: Benchmarking Discipline-Informed Reasoning in Image Editing
Authors: Mingxin Liu, Ziqian Fan, Zhaokai Wang, Leyao Gu, Zirun Zhu, Yiguo He, Yuchen Yang, Changyao Tian, Xiangyu Zhao, Ning Liao, Shaofeng Zhang, Qibing Ren, Zhihang Zhong, Xuanhe Zhou, Junchi Yan, Xue Yang
First: 2026-03-12T17:59:52+00:00 · Latest: 2026-03-12T17:59:52+00:00
Comments: 49 pages, 23 figures, 10 tables; Project Page: https://grade-bench.github.io/, Code: https://github.com/VisionXLab/GRADE, Dataset: https://huggingface.co/datasets/VisionXLab/GRADE
Abstract
Unified multimodal models target joint understanding, reasoning, and generation, but current image editing benchmarks are largely confined to natural images and shallow commonsense reasoning, offering limited assessment of this capability under structured, domain-specific constraints. In this work, we introduce GRADE, the first benchmark to assess discipline-informed knowledge and reasoning in image editing. GRADE comprises 520 carefully curated samples across 10 academic domains, spanning from natural science to social science. To support rigorous evaluation, we propose a multi-dimensional evaluation protocol that jointly assesses Discipline Reasoning, Visual Consistency, and Logical Readability. Extensive experiments on 20 state-of-the-art open-source and closed-source models reveal substantial limitations in current models under implicit, knowledge-intensive editing settings, leading to large performance gaps. Beyond quantitative scores, we conduct rigorous analyses and ablations to expose model shortcomings and identify the constraints within disciplinary editing. Together, GRADE pinpoints key directions for the future development of unified multimodal models, advancing the research on discipline-informed image editing and reasoning. Our benchmark and evaluation code are publicly released.
中文标题/摘要
标题:GRADE:基于学科知识的图像编辑推理基准测试
统一的多模态模型旨在实现联合理解、推理和生成,但当前的图像编辑基准主要局限于自然图像和浅层常识推理,对在结构化、领域特定约束下的这种能力评估有限。在本文中,我们引入了GRADE,这是首个评估学科知识和推理能力的图像编辑基准。GRADE 包含了来自10个学术领域的520个精心挑选的样本,涵盖了自然科学和社会科学。为了支持严格的评估,我们提出了一种多维度评估协议,联合评估学科推理、视觉一致性以及逻辑可读性。在20个最先进的开源和闭源模型上的广泛实验揭示了当前模型在隐含、知识密集型编辑设置下的显著局限性,导致了巨大的性能差距。除了定量评分外,我们还进行了严格的分析和消融实验,以揭示模型的不足之处并识别学科编辑中的约束。总之,GRADE 指出了统一多模态模型未来发展的关键方向,推动了学科导向的图像编辑和推理研究。我们的基准和评估代码已公开发布。
Video Streaming Thinking: VideoLLMs Can Watch and Think Simultaneously
Authors: Yiran Guan, Liang Yin, Dingkang Liang, Jianzhong Ju, Zhenbo Luo, Jian Luan, Yuliang Liu, Xiang Bai
First: 2026-03-12T17:59:51+00:00 · Latest: 2026-03-12T17:59:51+00:00
Abstract
Online Video Large Language Models (VideoLLMs) play a critical role in supporting responsive, real-time interaction. Existing methods focus on streaming perception, lacking a synchronized logical reasoning stream. However, directly applying test-time scaling methods incurs unacceptable response latency. To address this trade-off, we propose Video Streaming Thinking (VST), a novel paradigm for streaming video understanding. It supports a thinking while watching mechanism, which activates reasoning over incoming video clips during streaming. This design improves timely comprehension and coherent cognition while preserving real-time responsiveness by amortizing LLM reasoning latency over video playback. Furthermore, we introduce a comprehensive post-training pipeline that integrates VST-SFT, which structurally adapts the offline VideoLLM to causal streaming reasoning, and VST-RL, which provides end-to-end improvement through self-exploration in a multi-turn video interaction environment. Additionally, we devise an automated training-data synthesis pipeline that uses video knowledge graphs to generate high-quality streaming QA pairs, with an entity-relation grounded streaming Chain-of-Thought to enforce multi-evidence reasoning and sustained attention to the video stream. Extensive evaluations show that VST-7B performs strongly on online benchmarks, e.g. 79.5% on StreamingBench and 59.3% on OVO-Bench. Meanwhile, VST remains competitive on offline long-form or reasoning benchmarks. Compared with Video-R1, VST responds 15.7 times faster and achieves +5.4% improvement on VideoHolmes, demonstrating higher efficiency and strong generalization across diverse video understanding tasks. Code, data, and models will be released at https://github.com/1ranGuan/VST.
中文标题/摘要
标题:视频流思维:视频LLMs可以边看边思考
在线视频大型语言模型(VideoLLMs)在支持响应式、实时交互中发挥关键作用。现有方法侧重于流式感知,缺乏同步逻辑推理流。然而,直接应用测试时缩放方法会导致不可接受的响应延迟。为解决这一权衡,我们提出了一种新的流式视频理解范式——视频流思维(VST)。它支持边看边思考机制,在流式传输过程中激活对新视频片段的推理。此设计通过在视频播放过程中分摊LLM推理延迟,提高了及时理解和连贯认知能力,同时保持实时响应性。此外,我们引入了一个全面的后训练流水线,结合了VST-SFT,结构性地将离线VideoLLM适应因果流式推理,以及VST-RL,通过多轮视频交互环境中的自我探索提供端到端改进。另外,我们设计了一个自动化的训练数据合成流水线,使用视频知识图谱生成高质量的流式问答对,并通过实体-关系支撑的流式推理链确保多证据推理和对视频流的持续关注。广泛评估显示,VST-7B在在线基准测试中表现强劲,例如在StreamingBench上得分为79.5%,在OVO-Bench上得分为59.3%。同时,VST在离线长格式或推理基准测试中保持竞争力。与Video-R1相比,VST响应速度快15.7倍,并在VideoHolmes上实现了+5.4%的改进,显示出更高的效率和在各种视频理解任务中的强大泛化能力。代码、数据和模型将在https://github.com/1ranGuan/VST/发布。
Summary / 总结
The research aims to enhance the real-time interaction capabilities of Video Large Language Models (VideoLLMs) by introducing Video Streaming Thinking (VST), which enables simultaneous video perception and logical reasoning. The method involves a post-training pipeline combining VST-SFT for structural adaptation and VST-RL for end-to-end improvement through self-exploration. Key findings show that VST-7B performs well on online benchmarks, such as 79.5% on StreamingBench and 59.3% on OVO-Bench, and it is 15.7 times faster than Video-R1 on VideoHolmes, indicating higher efficiency and strong generalization across various tasks.
研究旨在通过解决视频大型语言模型(VideoLLMs)在流式处理过程中缺乏同步逻辑推理的问题,提高其实时交互能力。提出的Video Streaming Thinking (VST) 帕累托引入了同时进行视频感知和推理的机制,增强了及时理解和连贯性,同时保持实时响应性。全面的后训练管道,包括VST-SFT和VST-RL,用于将离线的VideoLLMs适应因果流式推理,并通过自我探索来提高性能。自动化的训练数据合成管道生成高质量的流式问答对,支持多证据推理和对视频流的持续关注。VST-7B在在线基准测试中表现出色,与Video-R1相比,在VideoHolmes上的响应速度提高了15.7倍,显示出更高的效率和在各种视频理解任务中的强大泛化能力。
DreamVideo-Omni: Omni-Motion Controlled Multi-Subject Video Customization with Latent Identity Reinforcement Learning
Authors: Yujie Wei, Xinyu Liu, Shiwei Zhang, Hangjie Yuan, Jinbo Xing, Zhekai Chen, Xiang Wang, Haonan Qiu, Rui Zhao, Yutong Feng, Ruihang Chu, Yingya Zhang, Yike Guo, Xihui Liu, Hongming Shan
First: 2026-03-12T17:59:12+00:00 · Latest: 2026-03-12T17:59:12+00:00
Comments: Project Page: https://dreamvideo-omni.github.io
Abstract
While large-scale diffusion models have revolutionized video synthesis, achieving precise control over both multi-subject identity and multi-granularity motion remains a significant challenge. Recent attempts to bridge this gap often suffer from limited motion granularity, control ambiguity, and identity degradation, leading to suboptimal performance on identity preservation and motion control. In this work, we present DreamVideo-Omni, a unified framework enabling harmonious multi-subject customization with omni-motion control via a progressive two-stage training paradigm. In the first stage, we integrate comprehensive control signals for joint training, encompassing subject appearances, global motion, local dynamics, and camera movements. To ensure robust and precise controllability, we introduce a condition-aware 3D rotary positional embedding to coordinate heterogeneous inputs and a hierarchical motion injection strategy to enhance global motion guidance. Furthermore, to resolve multi-subject ambiguity, we introduce group and role embeddings to explicitly anchor motion signals to specific identities, effectively disentangling complex scenes into independent controllable instances. In the second stage, to mitigate identity degradation, we design a latent identity reward feedback learning paradigm by training a latent identity reward model upon a pretrained video diffusion backbone. This provides motion-aware identity rewards in the latent space, prioritizing identity preservation aligned with human preferences. Supported by our curated large-scale dataset and the comprehensive DreamOmni Bench for multi-subject and omni-motion control evaluation, DreamVideo-Omni demonstrates superior performance in generating high-quality videos with precise controllability.
中文标题/摘要
标题:DreamVideo-Omni:通过潜在身份强化学习实现全方位运动控制的多主体视频定制
虽然大规模扩散模型已经革新了视频合成,但在实现对多主体身份和多粒度运动的精确控制方面仍面临重大挑战。近期尝试弥合这一差距的方法往往受到运动粒度有限、控制模糊和身份退化等问题的困扰,导致在身份保持和运动控制方面表现不佳。在本文中,我们提出了DreamVideo-Omni,这是一种统一框架,通过渐进的两阶段训练范式实现和谐的多主体定制和全方位运动控制。在第一阶段,我们整合了全面的控制信号进行联合训练,包括主体外观、全局运动、局部动态和摄像机运动。为了确保鲁棒性和精确的可控性,我们引入了一种条件感知的3D旋转位置嵌入来协调异构输入,并采用分层运动注入策略增强全局运动指导。此外,为了解决多主体的模糊性,我们引入了组和角色嵌入,以明确将运动信号锚定到特定身份,从而有效将复杂场景分解为独立可控实例。在第二阶段,为了减轻身份退化,我们设计了一种潜在身份奖励反馈学习范式,通过在预训练的视频扩散主干上训练潜在身份奖励模型,提供运动感知的身份奖励,在潜在空间中优先考虑与人类偏好一致的身份保持。
Summary / 总结
DreamVideo-Omni is a unified framework that enables precise control over multi-subject identity and motion through a two-stage training process. In the first stage, it integrates various control signals and introduces condition-aware 3D rotary positional embedding and hierarchical motion injection to enhance controllability. In the second stage, it uses a latent identity reward feedback learning paradigm to mitigate identity degradation. The framework demonstrates superior performance in generating high-quality videos with precise identity and motion control.
DreamVideo-Omni 是一个统一框架,通过两阶段训练过程实现对多主体身份和运动的精确控制。第一阶段整合了各种控制信号,并引入了条件感知的3D旋转位置嵌入和分层运动注入以增强可控性。第二阶段使用潜在身份奖励反馈学习范式来减轻身份退化。该框架在生成高质量视频方面表现出色,具有精确的身份和运动控制能力。
Spatial-TTT: Streaming Visual-based Spatial Intelligence with Test-Time Training
Authors: Fangfu Liu, Diankun Wu, Jiawei Chi, Yimo Cai, Yi-Hsin Hung, Xumin Yu, Hao Li, Han Hu, Yongming Rao, Yueqi Duan
First: 2026-03-12T17:58:58+00:00 · Latest: 2026-03-12T17:58:58+00:00
Comments: Project Page: https://liuff19.github.io/Spatial-TTT
Abstract
Humans perceive and understand real-world spaces through a stream of visual observations. Therefore, the ability to streamingly maintain and update spatial evidence from potentially unbounded video streams is essential for spatial intelligence. The core challenge is not simply longer context windows but how spatial information is selected, organized, and retained over time. In this paper, we propose Spatial-TTT towards streaming visual-based spatial intelligence with test-time training (TTT), which adapts a subset of parameters (fast weights) to capture and organize spatial evidence over long-horizon scene videos. Specifically, we design a hybrid architecture and adopt large-chunk updates parallel with sliding-window attention for efficient spatial video processing. To further promote spatial awareness, we introduce a spatial-predictive mechanism applied to TTT layers with 3D spatiotemporal convolution, which encourages the model to capture geometric correspondence and temporal continuity across frames. Beyond architecture design, we construct a dataset with dense 3D spatial descriptions, which guides the model to update its fast weights to memorize and organize global 3D spatial signals in a structured manner. Extensive experiments demonstrate that Spatial-TTT improves long-horizon spatial understanding and achieves state-of-the-art performance on video spatial benchmarks. Project page: https://liuff19.github.io/Spatial-TTT.
中文标题/摘要
标题:Spatial-TTT:基于视觉的流式空间智能在测试时训练
人类通过一系列视觉观察来感知和理解现实空间。因此,能够从潜在无界视频流中流式地维护和更新空间证据对于空间智能至关重要。核心挑战不仅在于更长的上下文窗口,而在于如何在时间上选择、组织和保留空间信息。在本文中,我们提出了一种Spatial-TTT方法,以测试时训练(TTT)的方式实现基于视觉的流式空间智能,该方法通过调整参数子集(快速权重)来捕捉和组织长时场景视频中的空间证据。具体而言,我们设计了一种混合架构,并采用大块更新与滑动窗口注意力相结合的方式,以实现高效的空间视频处理。为了进一步增强空间意识,我们引入了一种应用于TTT层的基于空间预测机制,使用3D时空卷积鼓励模型在帧间捕捉几何对应关系和时间连续性。除了架构设计,我们还构建了一个包含密集3D空间描述的数据集,该数据集指导模型以结构化的方式更新其快速权重,以记忆和组织全局3D空间信号。广泛的实验表明,Spatial-TTT提高了长时空间理解能力,并在视频空间基准测试中达到了最先进的性能。项目页面:https://liuff19.github.io/Spatial-TTT.
Summary / 总结
The paper proposes Spatial-TTT, a method for streaming visual-based spatial intelligence with test-time training, to maintain and update spatial evidence from video streams. It uses a hybrid architecture with large-chunk updates and sliding-window attention to efficiently process spatial videos and introduces a spatial-predictive mechanism for better geometric and temporal understanding. Experiments show that Spatial-TTT enhances long-horizon spatial understanding and achieves state-of-the-art performance on video spatial benchmarks.
研究旨在通过测试时训练(TTT)开发一种基于视觉的流式空间智能系统,以维护和更新来自视频流的空间证据。方法包括一种带有大块更新和滑动窗口注意力的混合架构,并引入了使用3D时空卷积的空间预测机制以增强空间意识。实验表明,Spatial-TTT 提高了长时空间理解能力,并在视频空间基准测试中达到了最先进的性能。
Attend Before Attention: Efficient and Scalable Video Understanding via Autoregressive Gazing
Authors: Baifeng Shi, Stephanie Fu, Long Lian, Hanrong Ye, David Eigen, Aaron Reite, Boyi Li, Jan Kautz, Song Han, David M. Chan, Pavlo Molchanov, Trevor Darrell, Hongxu Yin
Venue: CVPR 2026
First: 2026-03-12T17:58:52+00:00 · Latest: 2026-03-12T17:58:52+00:00
Comments: CVPR 2026. Project page: https://autogaze.github.io/
Abstract
Multi-modal large language models (MLLMs) have advanced general-purpose video understanding but struggle with long, high-resolution videos -- they process every pixel equally in their vision transformers (ViTs) or LLMs despite significant spatiotemporal redundancy. We introduce AutoGaze, a lightweight module that removes redundant patches before processed by a ViT or an MLLM. Trained with next-token prediction and reinforcement learning, AutoGaze autoregressively selects a minimal set of multi-scale patches that can reconstruct the video within a user-specified error threshold, eliminating redundancy while preserving information. Empirically, AutoGaze reduces visual tokens by 4x-100x and accelerates ViTs and MLLMs by up to 19x, enabling scaling MLLMs to 1K-frame 4K-resolution videos and achieving superior results on video benchmarks (e.g., 67.0% on VideoMME). Furthermore, we introduce HLVid: the first high-resolution, long-form video QA benchmark with 5-minute 4K-resolution videos, where an MLLM scaled with AutoGaze improves over the baseline by 10.1% and outperforms the previous best MLLM by 4.5%. Project page: https://autogaze.github.io/.
中文标题/摘要
标题:先关注后出席:通过自回归凝视实现高效可扩展的视频理解
多模态大型语言模型(MLLMs)在通用视频理解方面取得了进展,但在处理长且高分辨率的视频时遇到困难——它们在视觉变换器(ViTs)或大型语言模型(LLMs)中等量处理每个像素,尽管存在显著的空间-时间冗余。我们引入了AutoGaze,这是一个轻量级模块,在ViT或MLLM处理之前去除冗余块。AutoGaze通过下一个标记预测和强化学习训练,自回归地选择一组多尺度块,这些块可以在用户指定的误差阈值内重建视频,从而消除冗余并保留信息。实验表明,AutoGaze将视觉标记减少4-100倍,并将ViTs和MLLMs加速19倍,使MLLM能够扩展到1000帧4K分辨率的视频,并在视频基准测试中取得优异结果(例如,67.0%的VideoMME得分)。此外,我们引入了HLVid:第一个高分辨率、长形式的视频问答基准,包含5分钟4K分辨率的视频,其中使用AutoGaze扩展的MLLM比基线提高了10.1%,并优于之前的最佳MLLM 4.5%。项目页面:https://autogaze.github.io/
EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models
Authors: Xuanlang Dai, Yujie Zhou, Long Xing, Jiazi Bu, Xilin Wei, Yuhong Liu, Beichen Zhang, Kai Chen, Yuhang Zang
First: 2026-03-12T17:58:48+00:00 · Latest: 2026-03-12T17:58:48+00:00
Comments: 23 pages, 18 figures
Abstract
Recently, Multimodal Large Language Models (MLLMs) have been widely integrated into diffusion frameworks primarily as text encoders to tackle complex tasks such as spatial reasoning. However, this paradigm suffers from two critical limitations: (i) MLLMs text encoder exhibits insufficient reasoning depth. Single-step encoding fails to activate the Chain-of-Thought process, which is essential for MLLMs to provide accurate guidance for complex tasks. (ii) The guidance remains invariant during the decoding process. Invariant guidance during decoding prevents DiT from progressively decomposing complex instructions into actionable denoising steps, even with correct MLLM encodings. To this end, we propose Endogenous Chain-of-Thought (EndoCoT), a novel framework that first activates MLLMs' reasoning potential by iteratively refining latent thought states through an iterative thought guidance module, and then bridges these states to the DiT's denoising process. Second, a terminal thought grounding module is applied to ensure the reasoning trajectory remains grounded in textual supervision by aligning the final state with ground-truth answers. With these two components, the MLLM text encoder delivers meticulously reasoned guidance, enabling the DiT to execute it progressively and ultimately solve complex tasks in a step-by-step manner. Extensive evaluations across diverse benchmarks (e.g., Maze, TSP, VSP, and Sudoku) achieve an average accuracy of 92.1%, outperforming the strongest baseline by 8.3 percentage points.
中文标题/摘要
标题:EndoCoT:在扩散模型中扩展内生链式思考推理
最近,多模态大型语言模型(MLLMs)被广泛集成到扩散框架中,主要作为文本编码器来解决空间推理等复杂任务。然而,这种范式存在两个关键限制:(i) MLLMs的文本编码器表现出推理深度不足。单步编码无法激活链式思考过程,这对于MLLMs提供准确的复杂任务指导至关重要。(ii) 编码指导在解码过程中保持不变。解码过程中的不变指导阻止了DiT逐步分解复杂指令为可执行的去噪步骤,即使MLLM编码正确。为此,我们提出了内生链式思考(EndoCoT),这是一种新颖的框架,首先通过迭代思考指导模块逐步细化潜在思维状态,激活MLLMs的推理潜力,然后将这些状态与DiT的去噪过程联系起来。其次,应用终端思维接地模块,通过将最终状态与正确答案对齐,确保推理轨迹保持在文本监督中。通过这两个组件,MLLMs的文本编码器提供精心推理的指导,使DiT能够逐步执行并最终以逐步方式解决复杂任务。在不同基准(如迷宫、TSP、VSP和数独)上的广泛评估实现了92.1%的平均准确率,比最强基线高出8.3个百分点。
Summary / 总结
The paper introduces EndoCoT, a framework that enhances the reasoning capabilities of Multimodal Large Language Models (MLLMs) in diffusion models by iteratively refining latent thought states and aligning them with ground-truth answers. This approach addresses the limitations of insufficient reasoning depth and invariant guidance during decoding, leading to improved performance on complex tasks such as Maze, TSP, VSP, and Sudoku, with an average accuracy of 92.1% and outperforming the strongest baseline by 8.3 percentage points.
本文针对多模态大型语言模型(MLLMs)在扩散框架中的不足,特别是其推理深度不足和解码过程中的不变指导。为此,作者提出了EndoCoT框架,该框架通过迭代细化潜在思维状态并与正确答案对齐,使扩散模型能够逐步执行推理。实验结果显示,在各种基准测试(如迷宫、TSP、VSP和数独)中的平均准确率为92.1%,比现有方法高出8.3个百分点。
DVD: Deterministic Video Depth Estimation with Generative Priors
Authors: Hongfei Zhang, Harold Haodong Chen, Chenfei Liao, Jing He, Zixin Zhang, Haodong Li, Yihao Liang, Kanghao Chen, Bin Ren, Xu Zheng, Shuai Yang, Kun Zhou, Yinchuan Li, Nicu Sebe, Ying-Cong Chen
First: 2026-03-12T17:58:06+00:00 · Latest: 2026-03-12T17:58:06+00:00
Comments: Project: https://dvd-project.github.io/
Abstract
Existing video depth estimation faces a fundamental trade-off: generative models suffer from stochastic geometric hallucinations and scale drift, while discriminative models demand massive labeled datasets to resolve semantic ambiguities. To break this impasse, we present DVD, the first framework to deterministically adapt pre-trained video diffusion models into single-pass depth regressors. Specifically, DVD features three core designs: (i) repurposing the diffusion timestep as a structural anchor to balance global stability with high-frequency details; (ii) latent manifold rectification (LMR) to mitigate regression-induced over-smoothing, enforcing differential constraints to restore sharp boundaries and coherent motion; and (iii) global affine coherence, an inherent property bounding inter-window divergence, which enables seamless long-video inference without requiring complex temporal alignment. Extensive experiments demonstrate that DVD achieves state-of-the-art zero-shot performance across benchmarks. Furthermore, DVD successfully unlocks the profound geometric priors implicit in video foundation models using 163x less task-specific data than leading baselines. Notably, we fully release our pipeline, providing the whole training suite for SOTA video depth estimation to benefit the open-source community.
中文标题/摘要
标题:DVD:基于生成先验的确定性视频深度估计
现有的视频深度估计面临一个根本性的权衡:生成模型会遭受随机几何幻觉和尺度漂移的问题,而判别模型则需要大量的标注数据集来解决语义歧义。为打破这一僵局,我们提出了DVD,这是第一个将预训练的视频扩散模型确定性地改编为单次深度回归器的框架。具体来说,DVD 包含三个核心设计:(i) 将扩散时间步作为结构锚点,平衡全局稳定性和高频细节;(ii) 潜在流形矫正 (LMR) 以减轻回归引起的过度平滑,施加微分约束以恢复清晰边界和连贯运动;(iii) 全局仿射一致性,这是一种固有的属性,限制了窗口间差异,使得在无需复杂时间对齐的情况下即可无缝进行长视频推理。大量实验表明,DVD 在基准测试中实现了最先进的零样本性能。此外,DVD 成功地利用了视频基础模型中隐含的深刻几何先验,比领先基线少使用 163 倍的任务特定数据。值得注意的是,我们完全开源了我们的管道,提供了最先进的视频深度估计的完整训练套件,以造福开源社区。
Summary / 总结
DVD addresses the trade-off in video depth estimation by integrating pre-trained video diffusion models into deterministic single-pass depth regressors. It introduces three key designs: using the diffusion timestep as a structural anchor, latent manifold rectification to prevent over-smoothing, and global affine coherence to ensure seamless long-video inference. DVD achieves state-of-the-art zero-shot performance and requires significantly less task-specific data compared to existing methods.
DVD通过将预训练的生成模型整合到确定性的单次深度回归中来解决视频深度估计中的权衡问题。它引入了三个关键设计:使用扩散时间步作为结构锚点、使用潜在流形矫正来防止过度平滑以及全局仿射一致性以确保长视频推断的无缝连接。实验表明,DVD在各种基准上超过了现有方法,并且所需的任务特定数据比先前的方法少得多。
SciMDR: Benchmarking and Advancing Scientific Multimodal Document Reasoning
Authors: Ziyu Chen, Yilun Zhao, Chengye Wang, Rilyn Han, Manasi Patwardhan, Arman Cohan
First: 2026-03-12T17:57:52+00:00 · Latest: 2026-03-12T17:57:52+00:00
Abstract
Constructing scientific multimodal document reasoning datasets for foundation model training involves an inherent trade-off among scale, faithfulness, and realism. To address this challenge, we introduce the synthesize-and-reground framework, a two-stage pipeline comprising: (1) Claim-Centric QA Synthesis, which generates faithful, isolated QA pairs and reasoning on focused segments, and (2) Document-Scale Regrounding, which programmatically re-embeds these pairs into full-document tasks to ensure realistic complexity. Using this framework, we construct SciMDR, a large-scale training dataset for cross-modal comprehension, comprising 300K QA pairs with explicit reasoning chains across 20K scientific papers. We further construct SciMDR-Eval, an expert-annotated benchmark to evaluate multimodal comprehension within full-length scientific workflows. Experiments demonstrate that models fine-tuned on SciMDR achieve significant improvements across multiple scientific QA benchmarks, particularly in those tasks requiring complex document-level reasoning.
中文标题/摘要
标题:SciMDR:科学多模态文档推理基准测试与推进
构建用于基础模型训练的科学多模态文档推理数据集涉及规模、忠实度和现实性之间的固有权衡。为解决这一挑战,我们引入了合成和再嵌入框架,这是一个两阶段流水线,包括:(1) 主题导向的问答合成,生成忠实的、孤立的问答对和聚焦段落上的推理,以及(2) 文档规模再嵌入,通过程序化重新嵌入这些对到完整的文档任务,以确保现实的复杂性。使用此框架,我们构建了SciMDR,一个大规模训练数据集,包含30万对具有明确推理链的问答对,覆盖2万篇科学论文。我们还构建了SciMDR-Eval,一个专家注释基准,用于评估全篇科学工作流程中的多模态理解。实验表明,基于SciMDR微调的模型在多个科学问答基准测试中取得了显著改进,特别是在需要复杂文档级推理的任务中。
Summary / 总结
The research aims to address the challenge of constructing scientific multimodal document reasoning datasets by introducing the synthesize-and-reground framework, which consists of Claim-Centric QA Synthesis and Document-Scale Regrounding. This framework generates 300K QA pairs with explicit reasoning chains across 20K scientific papers and constructs SciMDR, a large-scale training dataset. The experiments show that models fine-tuned on SciMDR perform significantly better on multiple scientific QA benchmarks, especially in tasks requiring complex document-level reasoning.
论文介绍了SciMDR,这是一个用于科学多模态文档推理的大规模数据集,旨在解决规模、忠实性和现实性之间的权衡问题。它使用两阶段框架:Claim-Centric QA Synthesis 生成忠实的问答对,Document-Scale Regrounding 将这些对嵌入到完整文档中。数据集包含300K个带有推理链的问答对,覆盖20K篇科学论文,并构建了SciMDR-Eval 专家注释基准来评估全篇科学工作流程中的多模态理解。在SciMDR上微调的模型在多个科学问答基准测试中表现出显著改进,特别是在需要复杂文档级推理的任务中。
Trust Your Critic: Robust Reward Modeling and Reinforcement Learning for Faithful Image Editing and Generation
Authors: Xiangyu Zhao, Peiyuan Zhang, Junming Lin, Tianhao Liang, Yuchen Duan, Shengyuan Ding, Changyao Tian, Yuhang Zang, Junchi Yan, Xue Yang
First: 2026-03-12T17:57:21+00:00 · Latest: 2026-03-12T17:57:21+00:00
Abstract
Reinforcement learning (RL) has emerged as a promising paradigm for enhancing image editing and text-to-image (T2I) generation. However, current reward models, which act as critics during RL, often suffer from hallucinations and assign noisy scores, inherently misguiding the optimization process. In this paper, we present FIRM (Faithful Image Reward Modeling), a comprehensive framework that develops robust reward models to provide accurate and reliable guidance for faithful image generation and editing. First, we design tailored data curation pipelines to construct high-quality scoring datasets. Specifically, we evaluate editing using both execution and consistency, while generation is primarily assessed via instruction following. Using these pipelines, we collect the FIRM-Edit-370K and FIRM-Gen-293K datasets, and train specialized reward models (FIRM-Edit-8B and FIRM-Gen-8B) that accurately reflect these criteria. Second, we introduce FIRM-Bench, a comprehensive benchmark specifically designed for editing and generation critics. Evaluations demonstrate that our models achieve superior alignment with human judgment compared to existing metrics. Furthermore, to seamlessly integrate these critics into the RL pipeline, we formulate a novel "Base-and-Bonus" reward strategy that balances competing objectives: Consistency-Modulated Execution (CME) for editing and Quality-Modulated Alignment (QMA) for generation. Empowered by this framework, our resulting models FIRM-Qwen-Edit and FIRM-SD3.5 achieve substantial performance breakthroughs. Comprehensive experiments demonstrate that FIRM mitigates hallucinations, establishing a new standard for fidelity and instruction adherence over existing general models. All of our datasets, models, and code have been publicly available at https://firm-reward.github.io.
中文标题/摘要
标题:信任你的批评者:稳健的奖励建模与强化学习在忠实图像编辑与生成中的应用
强化学习(RL)已成为提升图像编辑和文本到图像(T2I)生成的有前途的范式。然而,当前的奖励模型在作为RL中的批评者时,经常遭受幻觉困扰并给出嘈杂的评分,从而误导优化过程。在本文中,我们提出了FIRM(忠实图像奖励建模),这是一种全面的框架,旨在开发稳健的奖励模型以提供准确可靠的指导,用于忠实的图像生成和编辑。首先,我们设计了定制的数据整理管道来构建高质量的评分数据集。具体而言,我们使用执行和一致性来评估编辑,而生成则主要通过指令遵循来进行评估。使用这些管道,我们收集了FIRM-Edit-370K和FIRM-Gen-293K数据集,并训练了专门的奖励模型(FIRM-Edit-8B和FIRM-Gen-8B),这些模型能够准确反映这些标准。其次,我们引入了FIRM-Bench,这是一种专门针对编辑和生成批评者的综合基准。评估表明,我们的模型在与人类判断的对齐方面优于现有指标。此外,为了无缝地将这些批评者集成到RL管道中,我们提出了一个新颖的“基础加奖励”奖励策略,该策略平衡了编辑中的一致性调节执行(CME)和生成中的质量调节对齐(QMA)等竞争目标。借助此框架,我们的模型FIRM-Qwen-Edit和FIRM-SD3.5实现了显著的性能突破。全面的实验表明,FIRM减轻了幻觉,建立了忠实度和指令遵循的新标准,超越了现有的通用模型。所有我们的数据集、模型和代码均已在https://firm-reward.github.io/公开。
Summary / 总结
This paper addresses the issue of hallucinations in reward models used for reinforcement learning in image editing and text-to-image generation. It introduces FIRM (Faithful Image Reward Modeling), a framework that includes specialized data curation pipelines and reward models (FIRM-Edit-8B and FIRM-Gen-8B) to provide accurate guidance. The framework also proposes a novel 'Base-and-Bonus' reward strategy (CME for editing and QMA for generation) to balance consistency and quality. Experiments show that FIRM models outperform existing methods in terms of fidelity and instruction adherence, reducing hallucinations and setting a new standard for image generation and editing. All datasets, models, and code are publicly available.
本文旨在解决图像编辑和文本到图像生成中奖励模型出现幻觉的问题,提出了FIRM(Faithful Image Reward Modeling)框架,该框架包括专门的数据收集管道和奖励模型(FIRM-Edit-8B和FIRM-Gen-8B),以提供准确的指导。该框架还提出了一种新的“基础+奖励”奖励策略(编辑中的CME和生成中的QMA),以平衡一致性和质量。实验表明,FIRM模型在保真度和指令遵循方面优于现有方法,减少了幻觉并建立了新的标准。所有数据集、模型和代码均已公开。
Examining Reasoning LLMs-as-Judges in Non-Verifiable LLM Post-Training
Authors: Yixin Liu, Yue Yu, DiJia Su, Sid Wang, Xuewei Wang, Song Jiang, Bo Liu, Arman Cohan, Yuandong Tian, Zhengxing Chen
First: 2026-03-12T17:57:06+00:00 · Latest: 2026-03-12T17:57:06+00:00
Abstract
Reasoning LLMs-as-Judges, which can benefit from inference-time scaling, provide a promising path for extending the success of reasoning models to non-verifiable domains where the output correctness/quality cannot be directly checked. However, while reasoning judges have shown better performance on static evaluation benchmarks, their effectiveness in actual policy training has not been systematically examined. Therefore, we conduct a rigorous study to investigate the actual impact of non-reasoning and reasoning judges in reinforcement-learning-based LLM alignment. Our controlled synthetic setting, where a "gold-standard" judge (gpt-oss-120b) provides preference annotations to train smaller judges, reveals key differences between non-reasoning and reasoning judges: non-reasoning judges lead to reward hacking easily, while reasoning judges can lead to policies that achieve strong performance when evaluated by the gold-standard judge. Interestingly, we find that the reasoning-judge-trained policies achieve such strong performance by learning to generate highly effective adversarial outputs that can also score well on popular benchmarks such as Arena-Hard by deceiving other LLM-judges. Combined with our further analysis, our study highlights both important findings and room for improvements for applying (reasoning) LLM-judges in non-verifiable LLM post-training.
中文标题/摘要
标题:探究推理LLM作为法官在非可验证LLM后训练中的应用
推理LLM作为法官,得益于推理时的扩展,为将推理模型的成功扩展到输出正确性/质量无法直接验证的领域提供了前景。然而,尽管推理法官在静态评估基准上表现出更好的性能,但它们在实际政策训练中的有效性尚未系统地研究。因此,我们进行了一项严格的研究所探讨基于强化学习的LLM对齐中非推理和推理法官的实际影响。在“黄金标准”法官(gpt-oss-120b)提供偏好注解以训练较小法官的受控合成环境中,揭示了非推理和推理法官之间的关键差异:非推理法官容易导致奖励作弊,而推理法官可以导致在“黄金标准”法官评估下表现出色的策略。有趣的是,我们发现,通过学习生成高度有效的对抗输出,推理法官训练的策略能够获得如此出色的表现,这些对抗输出也能在流行的基准测试如Arena-Hard中得分。结合我们进一步的分析,我们的研究突显了在非可验证LLM后训练中应用(推理)LLM法官的重要发现和改进空间。
Summary / 总结
The study examines the effectiveness of reasoning LLMs-as-judges in non-verifiable domains by comparing them with non-reasoning judges in reinforcement-learning-based LLM alignment. Using a controlled synthetic setting, the research finds that reasoning judges prevent reward hacking and produce policies that perform well when evaluated by a gold-standard judge, while non-reasoning judges are prone to reward hacking. The reasoning-judge-trained policies also learn to generate effective adversarial outputs that can score well on popular benchmarks, indicating their potential but also highlighting the need for further improvements in their application.
研究在受控的合成环境中考察了推理LLM作为裁判在非验证性领域中的有效性,使用黄金标准裁判训练较小的裁判。研究发现,推理裁判可以生成在黄金标准裁判评估中表现良好的策略,而非推理裁判容易产生奖励作弊。推理裁判训练的策略还学会了生成有效的对抗输出,这些输出可以在其他基准如Arena-Hard中得分良好,这既突显了推理LLM裁判的优势,也指出了其在非验证性LLM后处理中的改进空间。
One Model, Many Budgets: Elastic Latent Interfaces for Diffusion Transformers
Authors: Moayed Haji-Ali, Willi Menapace, Ivan Skorokhodov, Dogyun Park, Anil Kag, Michael Vasilkovsky, Sergey Tulyakov, Vicente Ordonez, Aliaksandr Siarohin
First: 2026-03-12T17:57:04+00:00 · Latest: 2026-03-12T17:57:04+00:00
Comments: Project page: https://snap-research.github.io/elit/
Abstract
Diffusion transformers (DiTs) achieve high generative quality but lock FLOPs to image resolution, limiting principled latency-quality trade-offs, and allocate computation uniformly across input spatial tokens, wasting resource allocation to unimportant regions. We introduce Elastic Latent Interface Transformer (ELIT), a drop-in, DiT-compatible mechanism that decouples input image size from compute. Our approach inserts a latent interface, a learnable variable-length token sequence on which standard transformer blocks can operate. Lightweight Read and Write cross-attention layers move information between spatial tokens and latents and prioritize important input regions. By training with random dropping of tail latents, ELIT learns to produce importance-ordered representations with earlier latents capturing global structure while later ones contain information to refine details. At inference, the number of latents can be dynamically adjusted to match compute constraints. ELIT is deliberately minimal, adding two cross-attention layers while leaving the rectified flow objective and the DiT stack unchanged. Across datasets and architectures (DiT, U-ViT, HDiT, MM-DiT), ELIT delivers consistent gains. On ImageNet-1K 512px, ELIT delivers an average gain of $35.3\%$ and $39.6\%$ in FID and FDD scores. Project page: https://snap-research.github.io/elit/
中文标题/摘要
标题:一种模型多种预算:弹性潜在接口的扩散变换器
扩散变换器(DiTs)实现高质量生成,但固定FLOPs到图像分辨率,限制了延迟-质量权衡的原理性调整,并且均匀分配计算资源到输入空间标记,浪费了对不重要区域的资源分配。我们引入了弹性潜在接口变换器(ELIT),这是一种可插入、DiT兼容的机制,解耦输入图像大小与计算。我们的方法插入了一个潜在接口,一个可学习的可变长度标记序列,标准变换器块可以在其上操作。轻量级的读取和写入交叉注意层在空间标记和潜在标记之间移动信息,并优先处理重要输入区域。通过随机丢弃尾部潜在标记进行训练,ELIT学习生成按重要性排序的表示,早期潜在标记捕获全局结构,而后期则包含细化细节的信息。在推理时,潜在标记的数量可以根据计算约束动态调整。ELIT故意保持简单,仅添加了两个交叉注意层,而未改变修正流目标和DiT堆栈。在不同数据集和架构(DiT、U-ViT、HDiT、MM-DiT)上,ELIT提供了持续的收益。在ImageNet-1K 512px上,ELIT在FID和FDD得分上分别提供了平均35.3%和39.6%的收益。
Summary / 总结
The research aims to address the limitations of diffusion transformers (DiTs) in terms of latency-quality trade-offs and inefficient resource allocation. The authors introduce Elastic Latent Interface Transformer (ELIT), which decouples input image size from compute by inserting a learnable latent interface. ELIT uses lightweight read and write cross-attention layers to prioritize important input regions and learns to produce importance-ordered representations. At inference, the number of latents can be adjusted dynamically. Experiments show consistent gains across different datasets and architectures, with an average improvement of 35.3% and 39.6% in FID and FDD scores on ImageNet-1K 512px.
研究旨在解决扩散变压器(DiTs)在FLOPs和资源分配方面的限制。作者引入了弹性潜空间接口变压器(ELIT),通过插入一个可学习的潜空间接口来解耦输入图像大小与计算。该方法使用轻量级的读写交叉注意力层来优先处理重要输入区域,并允许在推理时动态调整潜空间的数量。ELIT在ImageNet-1K 512px数据集上的一致性改进了FID和FDD得分,分别提高了35.3%和39.6%,适用于不同的数据集和架构。
Separable neural architectures as a primitive for unified predictive and generative intelligence
Authors: Reza T. Batley, Apurba Sarker, Rajib Mostakim, Andrew Klichine, Sourav Saha
First: 2026-03-12T17:56:54+00:00 · Latest: 2026-03-12T17:56:54+00:00
Abstract
Intelligent systems across physics, language and perception often exhibit factorisable structure, yet are typically modelled by monolithic neural architectures that do not explicitly exploit this structure. The separable neural architecture (SNA) addresses this by formalising a representational class that unifies additive, quadratic and tensor-decomposed neural models. By constraining interaction order and tensor rank, SNAs impose a structural inductive bias that factorises high-dimensional mappings into low-arity components. Separability need not be a property of the system itself: it often emerges in the coordinates or representations through which the system is expressed. Crucially, this coordinate-aware formulation reveals a structural analogy between chaotic spatiotemporal dynamics and linguistic autoregression. By treating continuous physical states as smooth, separable embeddings, SNAs enable distributional modelling of chaotic systems. This approach mitigates the nonphysical drift characteristics of deterministic operators whilst remaining applicable to discrete sequences. The compositional versatility of this approach is demonstrated across four domains: autonomous waypoint navigation via reinforcement learning, inverse generation of multifunctional microstructures, distributional modelling of turbulent flow and neural language modelling. These results establish the separable neural architecture as a domain-agnostic primitive for predictive and generative intelligence, capable of unifying both deterministic and distributional representations.
中文标题/摘要
标题:可分神经架构作为统一预测与生成智能的基本构建块
物理、语言和感知领域的智能系统通常表现出可分解的结构,但通常由不明确利用这种结构的单一神经架构进行建模。可分神经架构(SNA)通过形式化一个统一加性、二次和张量分解神经模型的表示类来解决这一问题。通过限制交互顺序和张量秩,SNA 强加了一种结构先验偏置,将高维映射分解为低元组件。可分性不必是系统本身的属性:它通常在系统表达的坐标或表示中出现。关键的是,这种坐标感知的表述揭示了混沌时空动力学与语言自回归之间的结构类比。通过将连续物理状态视为平滑的可分嵌入,SNA 使混沌系统的分布建模成为可能。这种方法减轻了确定性算子的非物理漂移特性,同时适用于离散序列。这种方法的组合灵活性在四个领域得到展示:通过强化学习进行自主航点导航、多功能微结构的逆生成、湍流流动的分布建模以及神经语言建模。这些结果确立了可分神经架构作为预测与生成智能的领域无关的基本构建块,能够统一确定性和分布性表示。
Summary / 总结
The research aims to address the lack of factorisable structure in monolithic neural architectures used in intelligent systems. It introduces the separable neural architecture (SNA) to unify additive, quadratic, and tensor-decomposed models by constraining interaction order and tensor rank, thus factorising high-dimensional mappings into low-arity components. Key experimental findings include the successful application of SNAs in autonomous waypoint navigation, inverse generation of microstructures, distributional modelling of turbulent flow, and neural language modelling, demonstrating its compositional versatility across various domains.
研究旨在通过引入可分神经架构(SNAs)来解决传统神经架构中缺乏显式因子化的不足。SNAs 统一了加性、二次和张量分解模型,通过约束交互顺序和张量秩,将高维映射分解为低阶分量。该方法在自主航点导航的强化学习、多功能微结构的逆生成、湍流流动的分布建模以及神经语言建模四个领域中展示了其通用性,确立了SNAs作为预测和生成智能的通用基础架构的地位。
SceneAssistant: A Visual Feedback Agent for Open-Vocabulary 3D Scene Generation
Authors: Jun Luo, Jiaxiang Tang, Ruijie Lu, Gang Zeng
First: 2026-03-12T17:55:07+00:00 · Latest: 2026-03-12T17:55:07+00:00
Comments: Code: https://github.com/ROUJINN/SceneAssistant
Abstract
Text-to-3D scene generation from natural language is highly desirable for digital content creation. However, existing methods are largely domain-restricted or reliant on predefined spatial relationships, limiting their capacity for unconstrained, open-vocabulary 3D scene synthesis. In this paper, we introduce SceneAssistant, a visual-feedback-driven agent designed for open-vocabulary 3D scene generation. Our framework leverages modern 3D object generation model along with the spatial reasoning and planning capabilities of Vision-Language Models (VLMs). To enable open-vocabulary scene composition, we provide the VLMs with a comprehensive set of atomic operations (e.g., Scale, Rotate, FocusOn). At each interaction step, the VLM receives rendered visual feedback and takes actions accordingly, iteratively refining the scene to achieve more coherent spatial arrangements and better alignment with the input text. Experimental results demonstrate that our method can generate diverse, open-vocabulary, and high-quality 3D scenes. Both qualitative analysis and quantitative human evaluations demonstrate the superiority of our approach over existing methods. Furthermore, our method allows users to instruct the agent to edit existing scenes based on natural language commands. Our code is available at https://github.com/ROUJINN/SceneAssistant
中文标题/摘要
标题:SceneAssistant:一种用于开放词汇3D场景生成的视觉反馈代理
从自然语言生成文本到3D场景对于数字内容创作来说非常 desirable。然而,现有方法大多局限于特定领域或依赖预定义的空间关系,限制了它们在不受约束、开放词汇3D场景合成方面的能力。在本文中,我们介绍了SceneAssistant,一种用于开放词汇3D场景生成的视觉反馈驱动代理。我们的框架利用了现代3D对象生成模型以及视觉语言模型(VLM)的空间推理和规划能力。为了实现开放词汇场景组合,我们为VLM提供了全面的原子操作集(例如,缩放、旋转、聚焦)。在每次交互步骤中,VLM接收渲染的视觉反馈并相应地采取行动,逐步细化场景以实现更连贯的空间布局并更好地与输入文本对齐。实验结果表明,我们的方法可以生成多样、开放词汇且高质量的3D场景。定性和定量的人类评估都证明了我们方法优于现有方法。此外,我们的方法允许用户根据自然语言命令编辑现有场景。我们的代码可在https://github.com/ROUJINN/SceneAssistant 获取
Summary / 总结
SceneAssistant is a visual-feedback-driven agent for open-vocabulary 3D scene generation, using modern 3D object generation models and Vision-Language Models (VLMs) with atomic operations like Scale, Rotate, and FocusOn. It iteratively refines scenes based on visual feedback, achieving diverse and high-quality 3D scenes that align well with input text. Qualitative and quantitative evaluations show its superiority over existing methods, and it supports editing existing scenes with natural language commands.
SceneAssistant 是一种基于视觉反馈的开放词汇3D场景生成代理,结合了3D对象生成模型和视觉语言模型(VLMs),通过自然语言输入逐步细化场景。该方法为VLMs提供原子操作并在每一步提供视觉反馈,以实现空间布局的协调性。实验结果表明,SceneAssistant 可以生成多样且高质量的3D场景,在定性和定量的人类评估中均优于现有方法,并支持使用自然语言命令编辑现有场景。
Security Considerations for Artificial Intelligence Agents
Authors: Ninghui Li, Kaiyuan Zhang, Kyle Polley, Jerry Ma
First: 2026-03-12T17:49:39+00:00 · Latest: 2026-03-12T17:49:39+00:00
Comments: Perplexity Response to NIST/CAISI Request for Information 2025-0035. 91 Fed. Reg. 698 (Jan. 8, 2026)
Abstract
This article, a lightly adapted version of Perplexity's response to NIST/CAISI Request for Information 2025-0035, details our observations and recommendations concerning the security of frontier AI agents. These insights are informed by Perplexity's experience operating general-purpose agentic systems used by millions of users and thousands of enterprises in both controlled and open-world environments. Agent architectures change core assumptions around code-data separation, authority boundaries, and execution predictability, creating new confidentiality, integrity, and availability failure modes. We map principal attack surfaces across tools, connectors, hosting boundaries, and multi-agent coordination, with particular emphasis on indirect prompt injection, confused-deputy behavior, and cascading failures in long-running workflows. We then assess current defenses as a layered stack: input-level and model-level mitigations, sandboxed execution, and deterministic policy enforcement for high-consequence actions. Finally, we identify standards and research gaps, including adaptive security benchmarks, policy models for delegation and privilege control, and guidance for secure multi-agent system design aligned with NIST risk management principles.
中文标题/摘要
标题:人工智能代理的安全考虑
本文,基于Perplexity对NIST/CAISI请求信息2025-0035的轻度改编版本,详细阐述了我们对前沿AI代理安全性的观察和建议。这些见解源自Perplexity在受控和开放世界环境中运营广泛用途代理系统方面的经验,这些系统被数百万人和数千家企业使用。代理架构改变了代码-数据分离、权限边界和执行可预测性的核心假设,创造了新的机密性、完整性和可用性故障模式。我们映射了工具、连接器、托管边界和多代理协调的主要攻击面,特别强调了间接提示注入、混淆副手行为以及长时间运行工作流中的级联故障。然后我们评估了当前的防御措施,作为分层堆栈:输入级和模型级缓解措施、沙盒执行以及对高后果行动的确定性策略执行。最后,我们确定了标准和研究缺口,包括适应性安全基准、委托和权限控制的政策模型,以及与NIST风险管理原则一致的多代理系统设计指南。
Summary / 总结
This article discusses security considerations for AI agents, drawing from Perplexity's operational experience with general-purpose agentic systems. It highlights new security challenges arising from changes in agent architectures, such as confidentiality, integrity, and availability risks. The authors map attack surfaces and assess current defenses, including input-level and model-level mitigations, sandboxed execution, and deterministic policy enforcement. They also identify gaps in standards and research, such as adaptive security benchmarks and secure multi-agent system design guidelines.
本文讨论了AI代理的安全考虑,基于Perplexity在广泛用途代理系统运营中的经验。文章指出了由于代理架构变化带来的新安全挑战,如机密性、完整性和可用性风险。作者绘制了攻击面并评估了当前的防御措施,包括输入级和模型级缓解、沙盒执行和确定性策略执行。他们还指出了标准和研究中的空白,如适应性安全基准和安全多代理系统设计指南。
Portfolio of Solving Strategies in CEGAR-based Object Packing and Scheduling for Sequential 3D Printing
Authors: Pavel Surynek
First: 2026-03-12T17:48:14+00:00 · Latest: 2026-03-12T17:48:14+00:00
Comments: arXiv admin note: substantial text overlap with arXiv:2503.05071
Abstract
Computing power that used to be available only in supercomputers decades ago especially their parallelism is currently available in standard personal computer CPUs even in CPUs for mobile telephones. We show how to effectively utilize the computing power of modern multi-core personal computer CPU to solve the complex combinatorial problem of object arrangement and scheduling for sequential 3D printing. We achieved this by parallelizing the existing CEGAR-SEQ algorithm that solves the sequential object arrangement and scheduling by expressing it as a linear arithmetic formula which is then solved by a technique inspired by counterexample guided abstraction refinement (CEGAR). The original CEGAR-SEQ algorithm uses an object arrangement strategy that places objects towards the center of the printing plate. We propose alternative object arrangement strategies such as placing objects towards a corner of the printing plate and scheduling objects according to their height. Our parallelization is done at the high-level where we execute the CEGAR-SEQ algorithm in parallel with a portfolio of object arrangement strategies, an algorithm is called Porfolio-CEGAR-SEQ. Our experimental evaluation indicates that Porfolio-CEGAR-SEQ outperforms the original CEGAR-SEQ. When a batch of objects for multiple printing plates is scheduled, Portfolio-CEGAR-SEQ often uses fewer printing plates than CEGAR-SEQ.
中文标题/摘要
标题:基于CEGAR的物体排列与调度策略组合在顺序3D打印中的组合策略集
几十年前只能在超级计算机中使用的计算能力,尤其是其并行性,现在甚至在移动电话CPU中也变得可用。我们展示了如何有效利用现代多核个人计算机CPU的计算能力来解决顺序3D打印中物体排列和调度的复杂组合问题。我们通过将现有的CEGAR-SEQ算法表达为线性算术公式,然后使用受反例引导抽象细化(CEGAR)启发的技术来解决该问题,实现了这一点。原始的CEGAR-SEQ算法使用了一种将物体放置在打印板中心的物体排列策略。我们提出了替代的物体排列策略,如将物体放置在打印板的角落,并根据物体的高度进行调度。我们的并行化是在高层次上实现的,我们并行执行CEGAR-SEQ算法和物体排列策略的组合,称为Porfolio-CEGAR-SEQ。我们的实验评估表明,Porfolio-CEGAR-SEQ优于原始的CEGAR-SEQ。当为多个打印板调度一批物体时,Porfolio-CEGAR-SEQ通常使用比CEGAR-SEQ更少的打印板。
Summary / 总结
The research aims to leverage modern multi-core CPUs to solve the complex combinatorial problem of object arrangement and scheduling for 3D printing. The method involves parallelizing the CEGAR-SEQ algorithm using a portfolio of object arrangement strategies. Key findings show that the Portfolio-CEGAR-SEQ algorithm outperforms the original CEGAR-SEQ by often using fewer printing plates when scheduling objects for multiple printing plates.
研究旨在利用现代多核CPU解决3D打印中的复杂组合问题。方法是通过使用对象排列策略的组合来并行化CEGAR-SEQ算法。关键发现表明,Portfolio-CEGAR-SEQ算法在为多个打印板调度对象时通常比CEGAR-SEQ使用更少的打印板。
LoC-Path: Learning to Compress for Pathology Multimodal Large Language Models
Authors: Qingqiao Hu, Weimin Lyu, Meilong Xu, Kehan Qi, Xiaoling Hu, Saumya Gupta, Jiawei Zhou, Chao Chen
First: 2025-12-05T03:16:46+00:00 · Latest: 2026-03-12T17:45:22+00:00
Comments: Code will be released soon
Abstract
Whole Slide Image (WSI) MLLMs are difficult to build and deploy because gigapixel slides induce thousands of visual tokens, while only a small fraction of regions is diagnostically relevant. Existing slide-level pathology MLLMs typically combine heavy slide-level encoders with long visual prefixes, making end-to-end slide-level development and deployment expensive under limited computational resources. We revisit this regime and show that WSI tile features are highly redundant at both global and local scales, while task-relevant evidence is sparse and query-dependent. We therefore introduce LoC-Path, a resource-efficient slide-level MLLM that compresses before fusion. LoC-Path uses a Sparse Token Merger (STM) and an MAE-pretrained resampler to replace expensive slide-level encoding with a compact latent interface, then uses a Token Importance Scorer (TIS) to select the most relevant latents and a Cross-Attention Routing Adapter (CARA) to fuse them into a few LLM decoder layers. This design lowers both multimodal tuning cost and inference-time latency/memory by avoiding heavy slide-level encoding and long visual prefixes. Extensive experiments show that LoC-Path remains competitive with prior slide-level MLLMs while making end-to-end development and deployment more practical under limited computational resources.
中文标题/摘要
标题:LoC-Path:学习压缩以压缩病理多模态大型语言模型
全视野图像(WSI)多模态大型语言模型难以构建和部署,因为吉格像素级的切片会产生数千个视觉标记,而只有少数区域具有诊断意义。现有的切片级病理多模态大型语言模型通常结合了沉重的切片级编码器和长的视觉前缀,这在有限的计算资源下使得端到端的切片级开发和部署非常昂贵。我们重新审视了这一领域,并展示了WSI切片特征在全局和局部尺度上高度冗余,而任务相关证据则稀疏且查询依赖。因此,我们引入了LoC-Path,这是一种资源高效的切片级多模态大型语言模型,它在融合之前进行压缩。LoC-Path使用稀疏标记合并器(STM)和MAE预训练的重采样器来用紧凑的潜在接口替换昂贵的切片级编码,然后使用标记重要性评分器(TIS)选择最相关的潜在特征,并使用跨注意力路由适配器(CARA)将它们融合到少量的LLM解码器层中。这种设计通过避免昂贵的切片级编码和长的视觉前缀,降低了多模态调优成本和推理时的延迟/内存。广泛的实验表明,LoC-Path在保持与先前的切片级多模态大型语言模型竞争力的同时,使在有限计算资源下进行端到端的开发和部署更加实际。
Summary / 总结
The motivation for this work is to address the computational challenges of building and deploying Whole Slide Image (WSI) Multimodal Large Language Models (MLLMs) due to the high number of visual tokens and the limited relevance of most regions. The method involves LoC-Path, which compresses WSI tile features before fusion using a Sparse Token Merger, an MAE-pretrained resampler, and a Token Importance Scorer to select and fuse relevant latents with a Cross-Attention Routing Adapter. Key findings show that LoC-Path is competitive with existing slide-level MLLMs while significantly reducing multimodal tuning costs and inference-time latency/memory under limited computational resources.
LoC-Path通过引入一种资源高效的方法来解决构建和部署全切片图像(WSI)多模态大型语言模型(MLLM)的挑战。它在融合之前压缩WSI切片特征,使用稀疏Token合并器和MAE预训练重采样器来创建一个紧凑的潜在接口。Token重要性评分器选择相关特征,然后使用跨注意力路由适配器将它们融合到少数LLM解码层中。这种设计减少了多模态调优成本和推理时间的延迟/内存。实验表明,LoC-Path在保持与现有滑块级MLLMs竞争力的同时,使在有限计算资源下端到端的开发和部署更加实用。
HOG-Diff: Higher-Order Guided Diffusion for Graph Generation
Authors: Yiming Huang, Tolga Birdal
Venue: ICLR 2026
First: 2025-02-06T18:51:14+00:00 · Latest: 2026-03-12T17:45:20+00:00
Comments: Accepted at ICLR 2026
Abstract
Graph generation is a critical yet challenging task, as empirical analyses require a deep understanding of complex, non-Euclidean structures. Diffusion models have recently made significant advances in graph generation, but these models are typically adapted from image generation frameworks and overlook inherent higher-order topology, limiting their ability to capture graph topology. In this work, we propose Higher-order Guided Diffusion (HOG-Diff), a principled framework that progressively generates plausible graphs with inherent topological structures. HOG-Diff follows a coarse-to-fine generation curriculum, guided by higher-order topology and implemented via diffusion bridges. We further prove that our model admits stronger theoretical guarantees than classical diffusion frameworks. Extensive experiments across eight graph generation benchmarks, spanning diverse domains and including large-scale settings, demonstrate the scalability of our method and its superior performance on both pairwise and higher-order topological metrics. Our project page is available \href{https://circle-group.github.io/research/hog-diff/}{here}.
中文标题/摘要
标题:HOG-Diff:基于高阶引导扩散的图生成
图生成是一项关键但具有挑战性的任务,因为经验分析需要对复杂的非欧几里得结构有深刻的理解。扩散模型在图生成方面最近取得了显著进展,但这些模型通常是从图像生成框架中改编而来的,忽视了固有的高阶拓扑结构,限制了它们捕捉图拓扑结构的能力。在本文中,我们提出了一种基于高阶引导扩散(HOG-Diff)的原则性框架,该框架逐步生成具有固有拓扑结构的合理图。HOG-Diff 遵循从粗到细的生成课程,由高阶拓扑结构引导,并通过扩散桥梁实现。我们进一步证明,我们的模型在理论保证上比经典的扩散框架更强。在八个涵盖不同领域并包括大规模设置的图生成基准测试中进行的广泛实验表明,我们的方法在可扩展性和高阶拓扑度量上的性能均优于其他方法。我们的项目页面可在 https://circle-group.github.io/research/hog-diff/ 查看。
Summary / 总结
The research aims to address the challenge of generating graphs with complex, non-Euclidean structures, which is crucial for empirical analyses. HOG-Diff, a Higher-order Guided Diffusion framework, is proposed to generate graphs with inherent topological structures by following a coarse-to-fine generation curriculum guided by higher-order topology. Experiments across various benchmarks show that HOG-Diff outperforms existing methods on both pairwise and higher-order topological metrics and is scalable to large-scale settings.
HOG-Diff 是一种新颖的图生成框架,通过引入更高阶的拓扑结构来解决现有扩散模型的局限性。它采用从粗到细的生成方法,并由更高阶的拓扑结构和扩散桥梁引导,从而在各种图生成基准测试中,在成对和更高阶拓扑度量方面表现出更优性能。该方法提供了比经典扩散框架更强的理论保证,并在大规模设置中展示了可扩展性。
A Two-Stage Dual-Modality Model for Facial Emotional Expression Recognition
Authors: Jiajun Sun, Zhe Gao
First: 2026-03-12T17:45:12+00:00 · Latest: 2026-03-12T17:45:12+00:00
Comments: 10 pages, 4 figures
Abstract
This paper addresses the expression (EXPR) recognition challenge in the 10th Affective Behavior Analysis in-the-Wild (ABAW) workshop and competition, which requires frame-level classification of eight facial emotional expressions from unconstrained videos. This task is challenging due to inaccurate face localization, large pose and scale variations, motion blur, temporal instability, and other confounding factors across adjacent frames. We propose a two-stage dual-modal (audio-visual) model to address these difficulties. Stage I focuses on robust visual feature extraction with a pretrained DINOv2-based encoder. Specifically, DINOv2 ViT-L/14 is used as the backbone, a padding-aware augmentation (PadAug) strategy is employed for image padding and data preprocessing from raw videos, and a mixture-of-experts (MoE) training head is introduced to enhance classifier diversity. Stage II addresses modality fusion and temporal consistency. For the visual modality, faces are re-cropped from raw videos at multiple scales, and the extracted visual features are averaged to form a robust frame-level representation. Concurrently, frame-aligned Wav2Vec 2.0 audio features are derived from short audio windows to provide complementary acoustic cues. These dual-modal features are integrated via a lightweight gated fusion module, followed by inference-time temporal smoothing. Experiments on the ABAW dataset demonstrate the effectiveness of the proposed method. The two-stage model achieves a Macro-F1 score of 0.5368 on the official validation set and 0.5122 +/- 0.0277 under 5-fold cross-validation, outperforming the official baselines.
中文标题/摘要
标题:一种双模态两阶段模型用于面部情感表达识别
本文在第10届野生情感行为分析工作坊(ABAW)和竞赛中解决了情感(EXPR)识别挑战,该任务要求对不受限制的视频中的八种面部情感表达进行帧级分类。由于相邻帧之间存在不准确的面部定位、大姿态和尺度变化、运动模糊、时间不稳定性以及其他干扰因素,该任务具有挑战性。我们提出了一种双模态(音视频)两阶段模型来解决这些困难。第一阶段专注于使用预训练的DINOv2基编码器进行鲁棒的视觉特征提取。具体来说,使用DINOv2 ViT-L/14作为骨干,采用填充感知增强(PadAug)策略对从原始视频中获取的图像进行填充和数据预处理,并引入混合专家(MoE)训练头以增强分类器多样性。第二阶段解决模态融合和时间一致性问题。对于视觉模态,从多尺度的原始视频中重新裁剪人脸,并提取的视觉特征进行平均以形成鲁棒的帧级表示。同时,从短音频窗口中提取与帧对齐的Wav2Vec 2.0音频特征,以提供补充的声学线索。这些双模态特征通过轻量级门控融合模块进行集成,并在推理时进行时间平滑。在ABAW数据集上的实验表明了所提出方法的有效性。两阶段模型在官方验证集上的宏F1得分为0.5368,在5折交叉验证下的得分为0.5122 ± 0.0277,优于官方基线。
Summary / 总结
This paper presents a two-stage dual-modal model for facial emotional expression recognition in unconstrained videos, addressing challenges such as inaccurate face localization and pose variations. The model uses a pretrained DINOv2-based encoder for robust visual feature extraction and a mixture-of-experts training head to enhance classifier diversity. In the second stage, it integrates visual and audio features through a lightweight gated fusion module, improving temporal consistency. The model achieves a Macro-F1 score of 0.5368 on the official validation set and outperforms official baselines.
该论文提出了一种双模态两阶段模型,用于处理不受限视频中的面部情感表达识别,解决了如面部定位不准确和姿态变化等挑战。模型使用预训练的DINOv2编码器进行稳健的视觉特征提取,并通过轻量级门控融合模块整合视听特征。实验结果显示,该模型在官方验证集上的宏F1分数为0.5368,并优于官方基线模型。
Real-World Point Tracking with Verifier-Guided Pseudo-Labeling
Authors: Görkay Aydemir, Fatma Güney, Weidi Xie
Venue: CVPR 2026
First: 2026-03-12T17:40:52+00:00 · Latest: 2026-03-12T17:40:52+00:00
Comments: CVPR 2026
Abstract
Models for long-term point tracking are typically trained on large synthetic datasets. The performance of these models degrades in real-world videos due to different characteristics and the absence of dense ground-truth annotations. Self-training on unlabeled videos has been explored as a practical solution, but the quality of pseudo-labels strongly depends on the reliability of teacher models, which vary across frames and scenes. In this paper, we address the problem of real-world fine-tuning and introduce verifier, a meta-model that learns to assess the reliability of tracker predictions and guide pseudo-label generation. Given candidate trajectories from multiple pretrained trackers, the verifier evaluates them per frame and selects the most trustworthy predictions, resulting in high-quality pseudo-label trajectories. When applied for fine-tuning, verifier-guided pseudo-labeling substantially improves the quality of supervision and enables data-efficient adaptation to unlabeled videos. Extensive experiments on four real-world benchmarks demonstrate that our approach achieves state-of-the-art results while requiring less data than prior self-training methods. Project page: https://kuis-ai.github.io/track_on_r
中文标题/摘要
标题:基于验证器引导伪标签的现实世界点跟踪
长期点跟踪模型通常在大型合成数据集上训练。这些模型在现实世界视频中的性能下降,因为现实世界视频具有不同的特征且缺乏密集的地面真值注释。在未标记视频上进行自我训练已被探索作为实际解决方案,但伪标签的质量强烈依赖于教师模型的可靠性,这在不同帧和场景之间变化。在本文中,我们解决了现实世界微调的问题,并引入了验证器,这是一种元模型,用于学习评估跟踪器预测的可靠性并指导伪标签生成。给定多个预训练跟踪器的候选轨迹,验证器逐帧评估它们并选择最值得信赖的预测,从而生成高质量的伪标签轨迹。在进行微调时,验证器引导的伪标签生成显著提高了监督质量,并使模型能够高效地适应未标记视频。在四个现实世界基准上的广泛实验表明,我们的方法在所需数据量少于先前自我训练方法的情况下达到了最先进的效果。项目页面:https://kuis-ai.github.io/track_on_r
Summary / 总结
The paper addresses the challenge of real-world point tracking by proposing a verifier-guided pseudo-labeling method. This method uses a meta-model called verifier to assess the reliability of tracker predictions and select the most trustworthy ones, generating high-quality pseudo-labels. Experiments on four real-world benchmarks show that this approach significantly improves tracking performance and achieves state-of-the-art results with less data compared to previous self-training methods.
本文解决了点跟踪模型在真实世界视频中的微调问题,合成数据集由于特性不同往往效果不佳。作者引入了一个验证器,该模型评估追踪预测的可靠性并指导高质量伪标签的生成。通过使用验证器,该方法显著提高了监督质量,并更有效地适应未标记的真实世界视频,相比之前的自我训练方法,使用更少的数据取得了最先进的结果。
RDNet: Region Proportion-Aware Dynamic Adaptive Salient Object Detection Network in Optical Remote Sensing Images
Authors: Bin Wan, Runmin Cong, Xiaofei Zhou, Hao Fang, Yaoqi Sun, Sam Kwong
First: 2026-03-12T17:34:29+00:00 · Latest: 2026-03-12T17:34:29+00:00
Abstract
Salient object detection (SOD) in remote sensing images faces significant challenges due to large variations in object sizes, the computational cost of self-attention mechanisms, and the limitations of CNN-based extractors in capturing global context and long-range dependencies. Existing methods that rely on fixed convolution kernels often struggle to adapt to diverse object scales, leading to detail loss or irrelevant feature aggregation. To address these issues, this work aims to enhance robustness to scale variations and achieve precise object localization. We propose the Region Proportion-Aware Dynamic Adaptive Salient Object Detection Network (RDNet), which replaces the CNN backbone with the SwinTransformer for global context modeling and introduces three key modules: (1) the Dynamic Adaptive Detail-aware (DAD) module, which applies varied convolution kernels guided by object region proportions; (2) the Frequency-matching Context Enhancement (FCE) module, which enriches contextual information through wavelet interactions and attention; and (3) the Region Proportion-aware Localization (RPL) module, which employs cross-attention to highlight semantic details and integrates a Proportion Guidance (PG) block to assist the DAD module. By combining these modules, RDNet achieves robustness against scale variations and accurate localization, delivering superior detection performance compared with state-of-the-art methods.
中文标题/摘要
标题:RDNet:光学遥感图像中区域比例感知动态自适应显著目标检测网络
光学遥感图像中的显著目标检测(SOD)面临着显著挑战,由于目标大小的巨大变化、自注意力机制的计算成本以及基于CNN的提取器在捕捉全局上下文和长程依赖方面的局限性。现有依赖固定卷积核的方法往往难以适应多样的目标尺度,导致细节丢失或无关特征聚合。为了解决这些问题,这项工作旨在增强对尺度变化的鲁棒性并实现精确的目标定位。我们提出了区域比例感知动态自适应显著目标检测网络(RDNet),用SwinTransformer替换CNN骨干进行全局上下文建模,并引入了三个关键模块:(1)动态自适应细节感知(DAD)模块,通过对象区域比例指导应用变化的卷积核;(2)频率匹配上下文增强(FCE)模块,通过小波交互和注意力丰富上下文信息;(3)区域比例感知定位(RPL)模块,采用交叉注意力突出语义细节,并结合比例引导(PG)块辅助DAD模块。通过结合这些模块,RDNet实现了对尺度变化的鲁棒性和准确的定位,相比现有最先进的方法提供了更好的检测性能。
Summary / 总结
This paper addresses the challenges of salient object detection in remote sensing images by proposing RDNet, which uses SwinTransformer for global context modeling and introduces three key modules: Dynamic Adaptive Detail-aware (DAD), Frequency-matching Context Enhancement (FCE), and Region Proportion-aware Localization (RPL). The DAD module uses varied convolution kernels based on object region proportions, while the FCE module enhances contextual information through wavelet interactions and attention. The RPL module employs cross-attention to highlight semantic details and integrates a Proportion Guidance (PG) block. Experimental results show that RDNet outperforms existing methods in terms of robustness to scale variations and accurate object localization.
本文针对遥感图像中显著目标检测面临的挑战,如目标尺寸变化大和计算成本高。提出了一种RDNet方法,使用SwinTransformer进行全局上下文建模,并引入了三个模块:DAD基于目标区域比例应用可变卷积核,FCE通过小波交互和注意力增强上下文信息,RPL使用交叉注意力突出语义细节并结合比例引导块辅助DAD模块。实验结果表明,RDNet在处理尺度变化和实现精确目标定位方面优于现有方法。
ForensicZip: More Tokens are Better but Not Necessary in Forensic Vision-Language Models
Authors: Yingxin Lai, Zitong Yu, Jun Wang, Linlin Shen, Yong Xu, Xiaochun Cao
First: 2026-03-12T17:30:49+00:00 · Latest: 2026-03-12T17:30:49+00:00
Abstract
Multimodal Large Language Models (MLLMs) enable interpretable multimedia forensics by generating textual rationales for forgery detection. However, processing dense visual sequences incurs high computational costs, particularly for high-resolution images and videos. Visual token pruning is a practical acceleration strategy, yet existing methods are largely semantic-driven, retaining salient objects while discarding background regions where manipulation traces such as high-frequency anomalies and temporal jitters often reside. To address this issue, we introduce ForensicZip, a training-free framework that reformulates token compression from a forgery-driven perspective. ForensicZip models temporal token evolution as a Birth-Death Optimal Transport problem with a slack dummy node, quantifying physical discontinuities indicating transient generative artifacts. The forensic scoring further integrates transport-based novelty with high-frequency priors to separate forensic evidence from semantic content under large-ratio compression. Experiments on deepfake and AIGC benchmarks show that at 10\% token retention, ForensicZip achieves $2.97\times$ speedup and over 90\% FLOPs reduction while maintaining state-of-the-art detection performance.
中文标题/摘要
标题:ForensicZip:更多的标记更好但不一定必要于法医视觉语言模型
多模态大型语言模型(MLLMs)通过生成伪造检测的文本解释来实现可解释的多媒体法医分析。然而,处理密集的视觉序列会带来高昂的计算成本,特别是对于高分辨率的图像和视频。视觉标记剪枝是一种实用的加速策略,但现有方法主要基于语义驱动,保留显著对象,而丢弃包含伪造痕迹(如高频异常和时间抖动)的背景区域。为了解决这一问题,我们引入了ForensicZip,这是一种无需训练的框架,从伪造驱动的角度重新定义了标记压缩。ForensicZip将时间标记演变建模为带松弛虚拟节点的出生-死亡最优传输问题,量化表明瞬态生成伪迹的物理不连续性。法医评分进一步将基于传输的新颖性与高频先验相结合,在大比例压缩下分离法医证据和语义内容。在深度伪造和AIGC基准测试中,即使在10%的标记保留率下,ForensicZip也实现了2.97倍的加速和超过90%的FLOPs减少,同时保持了最先进的检测性能。
Summary / 总结
The research addresses the high computational costs of processing dense visual sequences in forensic vision-language models. It introduces ForensicZip, a training-free framework that reformulates token compression from a forgery-driven perspective, focusing on physical discontinuities to detect transient generative artifacts. Experiments show that at 10% token retention, ForensicZip achieves a 2.97x speedup and over 90% FLOPs reduction while maintaining state-of-the-art detection performance.
研究旨在解决处理密集视觉序列在法医视觉-语言模型中的高计算成本问题。ForensicZip 是一个无需训练的框架,从伪造驱动的角度重新定义了 token 压缩,使用 Birth-Death Optimal Transport 问题来量化物理不连续性。在 10% token 保留的情况下,ForensicZip 实现了 2.97 倍的加速和超过 90% 的 FLOPs 减少,同时在深伪和 AIGC 基准上保持了最先进的检测性能。
LLMTrack: Semantic Multi-Object Tracking with Multi-modal Large Language Models
Authors: Pan Liao, Feng Yang, Di Wu, Jinwen Yu, Yuhua Zhu, Wenhui Zhao, Dingwen Zhang
First: 2026-01-10T12:18:12+00:00 · Latest: 2026-03-12T17:26:37+00:00
Abstract
Multi-Object Tracking (MOT) is evolving from geometric localization to Semantic MOT (SMOT) to answer complex relational queries, yet progress is hindered by semantic data scarcity and a structural disconnect between tracking architectures and Multi-modal Large Language Models (MLLMs). To address this, we introduce Grand-SMOT, a large-scale, open-world benchmark providing high-density, dual-stream narratives that comprehensively decouple individual behaviors from environmental contexts. Furthermore, we propose LLMTrack, the first framework to seamlessly integrate MLLMs into the SMOT task. LLMTrack establishes a Macro-Understanding-First paradigm, utilizing a novel Spatio-Temporal Fusion Module to align discrete geometric trajectories with continuous semantic features, effectively suppressing temporal hallucinations during online processing. Extensive experiments demonstrate that LLMTrack achieves state-of-the-art geometric tracking performance while delivering a qualitative leap in dynamic semantic reasoning. Notably, our analysis reveals that high-quality semantic narratives empower the language model to deduce complex social interactions naturally, demonstrating that direct cognitive reasoning is more effective than cumbersome explicit visual modeling. Ultimately, our contributions bridge the gap between perceptual tracking and cognitive reasoning, establishing a robust new foundation for comprehensive video understanding and intelligent narrative generation.
中文标题/摘要
标题:LLMTrack:使用多模态大型语言模型的语义多对象跟踪
多对象跟踪(MOT)正在从几何定位发展到语义MOT(SMOT),以回答复杂的关联查询,但进展受到语义数据稀缺性和跟踪架构与多模态大型语言模型(MLLMs)之间结构性脱节的阻碍。为了解决这个问题,我们引入了Grand-SMOT,这是一个大规模、开放世界的基准,提供了高密度、双流叙事,全面地将个体行为与环境背景解耦。此外,我们提出了LLMTrack,这是第一个将MLLM无缝集成到SMOT任务中的框架。LLMTrack确立了先宏观理解的范式,利用新颖的空间-时间融合模块将离散的几何轨迹与连续的语义特征对齐,在在线处理过程中有效抑制了时间幻觉。广泛的实验表明,LLMTrack在几何跟踪性能上达到了最先进的水平,同时在动态语义推理方面实现了质的飞跃。值得注意的是,我们的分析表明,高质量的语义叙事使语言模型能够自然地推断复杂的社交互动,表明直接的认知推理比繁琐的显式视觉建模更有效。最终,我们的贡献弥合了感知跟踪与认知推理之间的差距,为全面的视频理解和智能叙事生成奠定了坚实的新基础。
Summary / 总结
LLMTrack is a framework that integrates Multi-modal Large Language Models into Semantic Multi-Object Tracking (SMOT) to address the challenges of semantic data scarcity and structural disconnect. It introduces Grand-SMOT, a benchmark with high-density narratives, and a Spatio-Temporal Fusion Module to align geometric trajectories with semantic features, improving both geometric tracking and dynamic semantic reasoning. Experiments show LLMTrack outperforms existing methods in geometric tracking and enhances the model's ability to understand complex social interactions.
论文提出了LLMTrack框架,将多模态大型语言模型(MLLMs)集成到语义多对象跟踪(SMOT)中,以解决语义数据稀缺性和结构断层的问题。LLMTrack使用时空融合模块对齐几何轨迹和语义特征,提高几何跟踪和动态语义推理的性能。实验表明,LLMTrack在几何跟踪上超越了现有方法,并通过自然语言推理显著提升了对复杂社会互动的理解。
SaPaVe: Towards Active Perception and Manipulation in Vision-Language-Action Models for Robotics
Authors: Mengzhen Liu, Enshen Zhou, Cheng Chi, Yi Han, Shanyu Rong, Liming Chen, Pengwei Wang, Zhongyuan Wang, Shanghang Zhang
Venue: CVPR 2026
First: 2026-03-12T17:23:46+00:00 · Latest: 2026-03-12T17:23:46+00:00
Comments: Accepted to CVPR 2026. See project page at https://lmzpai.github.io/SaPaVe
Abstract
Active perception and manipulation are crucial for robots to interact with complex scenes. Existing methods struggle to unify semantic-driven active perception with robust, viewpoint-invariant execution. We propose SaPaVe, an end-to-end framework that jointly learns these capabilities in a data-efficient manner. Our approach decouples camera and manipulation actions rather than placing them in a shared action space, and follows a bottom-up training strategy: we first train semantic camera control on a large-scale dataset, then jointly optimize both action types using hybrid data. To support this framework, we introduce ActiveViewPose-200K, a dataset of 200k image-language-camera movement pairs for semantic camera movement learning, and a 3D geometry-aware module that improves execution robustness under dynamic viewpoints. We also present ActiveManip-Bench, the first benchmark for evaluating active manipulation beyond fixed-view settings. Extensive experiments in both simulation and real-world environments show that SaPaVe outperforms recent vision-language-action models such as GR00T N1 and \(π_0\), achieving up to 31.25\% higher success rates in real-world tasks. These results show that tightly coupled perception and execution, when trained with decoupled yet coordinated strategies, enable efficient and generalizable active manipulation. Project page: https://lmzpai.github.io/SaPaVe
中文标题/摘要
标题:SaPaVe:迈向机器人视觉-语言-动作模型中的主动感知与操作
主动感知与操作对于机器人与复杂场景交互至关重要。现有方法难以将语义驱动的主动感知与稳健、视角不变的操作执行统一起来。我们提出SaPaVe,这是一种端到端框架,能够以数据高效的方式联合学习这些能力。我们的方法将相机操作和操作动作分离,而不是将它们放在共享的动作空间中,并采用自底向上的训练策略:我们首先在大规模数据集上训练语义相机控制,然后使用混合数据联合优化两种动作类型。为了支持这一框架,我们引入了ActiveViewPose-200K数据集,包含20万个图像-语言-相机运动对,用于语义相机运动学习,并提出了一种3D几何感知模块,以提高在动态视角下的执行鲁棒性。我们还提出了ActiveManip-Bench,这是第一个用于评估超出固定视角设置的主动操作的基准。在模拟和真实环境中的广泛实验表明,SaPaVe在现实任务中的成功率比最近的视觉-语言-动作模型GR00T N1和\(π_0\)高出31.25%。这些结果表明,当以分离但协调的方式训练紧密耦合的感知和执行时,能够实现高效且通用的主动操作。项目页面:https://lmzpai.github.io/SaPaVe
Summary / 总结
SaPaVe is an end-to-end framework that addresses the challenge of integrating semantic-driven active perception with robust manipulation. It decouples camera and manipulation actions and uses a bottom-up training strategy, first training semantic camera control and then jointly optimizing both actions. The framework demonstrates superior performance in real-world tasks, achieving up to 31.25% higher success rates compared to recent models like GR00T N1 and \(π_0\).
SaPaVe 是一个端到端框架,旨在解决将语义驱动的主动感知与稳健的执行集成的挑战。该框架将相机和操作动作分离,并采用自底向上的训练策略,首先训练语义相机控制,然后联合优化两种类型的动作。实验结果表明,该框架在真实世界任务中的表现优于 GR00T N1 和 \(π_0\) 等最近的模型,成功率提高了高达 31.25%。
ReSplat: Learning Recurrent Gaussian Splatting
Authors: Haofei Xu, Daniel Barath, Andreas Geiger, Marc Pollefeys
First: 2025-10-09T17:59:59+00:00 · Latest: 2026-03-12T17:18:50+00:00
Comments: Project page: https://haofeixu.github.io/resplat/ Code: https://github.com/cvg/resplat
Abstract
While existing feed-forward Gaussian splatting models offer computational efficiency and can generalize to sparse view settings, their performance is fundamentally constrained by relying on a single forward pass for inference. We propose ReSplat, a feed-forward recurrent Gaussian splatting model that iteratively refines 3D Gaussians without explicitly computing gradients. Our key insight is that the Gaussian splatting rendering error serves as a rich feedback signal, guiding the recurrent network to learn effective Gaussian updates. This feedback signal naturally adapts to unseen data distributions at test time, enabling robust generalization across datasets, view counts, and image resolutions. To initialize the recurrent process, we introduce a compact reconstruction model that operates in a $16 \times$ subsampled space, producing $16 \times$ fewer Gaussians than previous per-pixel Gaussian models. This substantially reduces computational overhead and allows for efficient Gaussian updates. Extensive experiments across varying number of input views (2, 8, 16, 32), resolutions ($256 \times 256$ to $540 \times 960$), and datasets (DL3DV, RealEstate10K, and ACID) demonstrate that our method achieves state-of-the-art performance while significantly reducing the number of Gaussians and improving the rendering speed. Our project page is at https://haofeixu.github.io/resplat/.
中文标题/摘要
标题:ReSplat:学习递归高斯点积
虽然现有的前馈高斯点积模型提供了计算效率,并且可以泛化到稀疏视角设置,但它们的性能从根本上受到仅依赖单次前向推理的限制。我们提出了ReSplat,一种前馈递归高斯点积模型,通过迭代细化3D高斯分布而无需显式计算梯度。我们的核心见解是,高斯点积渲染误差作为丰富的反馈信号,指导递归网络学习有效的高斯更新。这种反馈信号在测试时自然适应未见过的数据分布,从而在不同数据集、视角数量和图像分辨率下实现稳健泛化。为了初始化递归过程,我们引入了一个紧凑的重建模型,在$16 imes$下采样空间中操作,产生的高斯数量仅为之前每个像素高斯模型的$16 imes$分之一。这大大减少了计算开销,并允许高效地更新高斯。在不同数量的输入视角(2,8,16,32)、分辨率($256 imes 256$到$540 imes 960$)和数据集(DL3DV,RealEstate10K,ACID)的广泛实验中,我们的方法在显著减少高斯数量和提高渲染速度的同时,实现了最先进的性能。我们的项目页面为https://haofeixu.github.io/resplat/
Summary / 总结
ReSplat is a feed-forward recurrent Gaussian splatting model that iteratively refines 3D Gaussians without computing gradients, using the Gaussian splatting rendering error as a feedback signal to guide the network. It introduces a compact reconstruction model to initialize the process, reducing computational overhead. Experiments show that ReSplat achieves state-of-the-art performance with fewer Gaussians and faster rendering speed across various datasets and resolutions.
ReSplat 是一种无需计算梯度的前馈递归高斯点云模型,通过高斯点云渲染误差作为反馈信号来指导递归网络进行迭代细化。通过使用紧凑的重建模型初始化过程,它在各种数据集、视图数量和分辨率下实现了最先进的性能,同时使用更少的高斯模型并加快了渲染速度。
Proof-Carrying Materials: Falsifiable Safety Certificates for Machine-Learned Interatomic Potentials
Authors: Abhinaba Basu, Pavan Chakraborty
First: 2026-03-12T17:13:25+00:00 · Latest: 2026-03-12T17:13:25+00:00
Abstract
Machine-learned interatomic potentials (MLIPs) are deployed for high-throughput materials screening without formal reliability guarantees. We show that a single MLIP used as a stability filter misses 93% of density functional theory (DFT)-stable materials (recall 0.07) on a 25,000-material benchmark. Proof-Carrying Materials (PCM) closes this gap through three stages: adversarial falsification across compositional space, bootstrap envelope refinement with 95% confidence intervals, and Lean 4 formal certification. Auditing CHGNet, TensorNet and MACE reveals architecture-specific blind spots with near-zero pairwise error correlations (r <= 0.13; n = 5,000), confirmed by independent Quantum ESPRESSO validation (20/20 converged; median DFT/CHGNet force ratio 12x). A risk model trained on PCM-discovered features predicts failures on unseen materials (AUC-ROC = 0.938 +/- 0.004) and transfers across architectures (cross-MLIP AUC-ROC ~ 0.70; feature importance r = 0.877). In a thermoelectric screening case study, PCM-audited protocols discover 62 additional stable materials missed by single-MLIP screening - a 25% improvement in discovery yield.
中文标题/摘要
标题:证明携带材料:机器学习原子势能的安全证书
机器学习原子势能(MLIPs)被用于高通量材料筛选,但缺乏正式的可靠性保证。我们展示了在25,000种材料基准测试中,单一MLIP用作稳定性筛选器时,会错过93%的密度泛函理论(DFT)稳定的材料(召回率0.07)。通过三个阶段的证明携带材料(PCM)来弥补这一差距:在组成空间中的对抗性反驳、95%置信区间的引导包络细化以及Lean 4形式化认证。审计CHGNet、TensorNet和MACE揭示了特定架构的盲点,配对误差相关性接近零(r <= 0.13;n = 5,000),并通过独立的Quantum ESPRESSO验证得到确认(20/20收敛;DFT/CHGNet力的中位数比值12倍)。基于PCM发现特征训练的风险模型在未见材料上预测失败(AUC-ROC = 0.938 +/- 0.004)并跨架构转移(跨MLIP AUC-ROC ~ 0.70;特征重要性r = 0.877)。在热电材料筛选案例研究中,PCM审计的协议发现了62种被单一MLIP筛选遗漏的稳定材料,提高了25%的发现率。
Summary / 总结
The paper addresses the lack of reliability guarantees for machine-learned interatomic potentials (MLIPs) used in materials screening. It introduces Proof-Carrying Materials (PCM), which involves adversarial falsification, bootstrap envelope refinement, and formal certification to ensure the safety of MLIPs. Key findings include the identification of architecture-specific blind spots in CHGNet, TensorNet, and MACE, and the development of a risk model that predicts failures with high accuracy (AUC-ROC = 0.938).
研究针对机器学习原子势(MLIPs)在材料筛选中缺乏可靠性保证的问题,提出了证明携带材料(PCM)的方法,包括对抗性反驳、Bootstrap边界细化和形式化认证以确保安全性。关键发现包括在CHGNet、TensorNet和MACE中识别出特定架构的盲点,并开发了一种风险模型,能够以高精度预测失败,通过PCM审核的筛选协议在热电材料筛选案例研究中发现了62种额外的稳定材料,提高了25%的发现率。
Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections
Authors: Łukasz Borchmann, Jordy Van Landeghem, Michał Turski, Shreyansh Padarha, Ryan Othniel Kearns, Adam Mahdi, Niels Rogge, Clémentine Fourrier, Siwei Han, Huaxiu Yao, Artemis Llabrés, Yiming Xu, Dimosthenis Karatzas, Hao Zhang, Anupam Datta
First: 2026-03-12T17:11:22+00:00 · Latest: 2026-03-12T17:11:22+00:00
Abstract
Multimodal agents offer a promising path to automating complex document-intensive workflows. Yet, a critical question remains: do these agents demonstrate genuine strategic reasoning, or merely stochastic trial-and-error search? To address this, we introduce MADQA, a benchmark of 2,250 human-authored questions grounded in 800 heterogeneous PDF documents. Guided by Classical Test Theory, we design it to maximize discriminative power across varying levels of agentic abilities. To evaluate agentic behaviour, we introduce a novel evaluation protocol measuring the accuracy-effort trade-off. Using this framework, we show that while the best agents can match human searchers in raw accuracy, they succeed on largely different questions and rely on brute-force search to compensate for weak strategic planning. They fail to close the nearly 20% gap to oracle performance, persisting in unproductive loops. We release the dataset and evaluation harness to help facilitate the transition from brute-force retrieval to calibrated, efficient reasoning.
中文标题/摘要
标题:战略导航还是随机搜索?代理和人类在处理文档集合时如何推理
多模态代理为自动化复杂文档密集型工作流提供了有希望的途径。然而,一个关键问题仍然存在:这些代理是否展示了真正的战略推理,还是仅仅进行了随机的尝试和错误搜索?为了解决这个问题,我们引入了MADQA基准,包含2,250个人撰写的基于800份异构PDF文档的问题。根据经典测验理论,我们设计它以最大化在不同代理能力水平上的区分力。为了评估代理行为,我们引入了一种新的评估协议,衡量准确性和努力之间的权衡。使用这种框架,我们表明,虽然最好的代理在纯准确度上可以与人类搜索者匹敌,但它们回答的问题类型不同,并依赖于暴力搜索来弥补薄弱的战略规划。它们未能缩小与Oracle性能近20%的差距,持续陷入无成效的循环。我们发布了数据集和评估框架,以帮助促进从暴力检索向校准、高效的推理过渡。
Summary / 总结
The study aims to determine whether multimodal agents use strategic reasoning or merely stochastic search when navigating document collections. MADQA, a benchmark of 2,250 human-authored questions based on 800 diverse PDF documents, was developed to evaluate agentic behavior. The evaluation protocol measures the accuracy-effort trade-off. Results show that while the best agents can match human searchers in accuracy, they rely on exhaustive search rather than strategic planning and fail to close the performance gap to oracle systems, indicating a need for improved strategic reasoning capabilities.
研究旨在确定多模态代理在导航文档集合时是使用战略推理还是仅使用随机搜索。基于800份不同类型的PDF文档,开发了包含2,250个人撰写的测试问题的MADQA基准,以评估代理行为。评估协议衡量准确率与努力之间的权衡。结果显示,尽管最佳代理在准确率上可以与人类搜索者匹敌,但它们依赖于穷举搜索而非战略规划,并未能缩小与理想性能之间的近20%差距,表明需要改进战略推理能力。
BehaviorVLM: Unified Finetuning-Free Behavioral Understanding with Vision-Language Reasoning
Authors: Jingyang Ke, Weihan Li, Amartya Pradhan, Jeffrey Markowitz, Anqi Wu
First: 2026-03-12T17:09:20+00:00 · Latest: 2026-03-12T17:09:20+00:00
Abstract
Understanding freely moving animal behavior is central to neuroscience, where pose estimation and behavioral understanding form the foundation for linking neural activity to natural actions. Yet both tasks still depend heavily on human annotation or unstable unsupervised pipelines, limiting scalability and reproducibility. We present BehaviorVLM, a unified vision-language framework for pose estimation and behavioral understanding that requires no task-specific finetuning and minimal human labeling by guiding pretrained Vision-Language Models (VLMs) through detailed, explicit, and verifiable reasoning steps. For pose estimation, we leverage quantum-dot-grounded behavioral data and propose a multi-stage pipeline that integrates temporal, spatial, and cross-view reasoning. This design greatly reduces human annotation effort, exposes low-confidence labels through geometric checks such as reprojection error, and produces labels that can later be filtered, corrected, or used to fine-tune downstream pose models. For behavioral understanding, we propose a pipeline that integrates deep embedded clustering for over-segmented behavior discovery, VLM-based per-clip video captioning, and LLM-based reasoning to merge and semantically label behavioral segments. The behavioral pipeline can operate directly from visual information and does not require keypoints to segment behavior. Together, these components enable scalable, interpretable, and label-light analysis of multi-animal behavior.
中文标题/摘要
标题:BehaviorVLM:统一的无需微调的行为理解与视觉-语言推理
理解自由移动的动物行为是神经科学的核心,其中姿态估计和行为理解构成了将神经活动与自然动作联系起来的基础。然而,这两个任务仍然严重依赖于人工注释或不稳定的无监督管道,限制了其可扩展性和可重复性。我们提出了BehaviorVLM,这是一种统一的视觉-语言框架,用于姿态估计和行为理解,无需特定任务的微调和最少的人工标注,通过引导预训练的视觉-语言模型(VLMs)进行详细的、明确的和可验证的推理步骤。对于姿态估计,我们利用量子点标注的行为数据,并提出了一种多阶段管道,结合了时间、空间和跨视图推理。这种设计大大减少了人工标注的工作量,通过几何检查如重投影误差暴露了低置信度的标签,并生成了可以稍后过滤、修正或用于微调下游姿态模型的标签。对于行为理解,我们提出了一种管道,结合了深度嵌入聚类以发现过度分割的行为,基于VLM的每段视频字幕,以及基于LLM的推理以合并和语义标注行为片段。行为管道可以直接从视觉信息运行,不需要关键点来分割行为。这些组件共同实现了可扩展、可解释和标签轻的行为分析。
Summary / 总结
The research aims to improve the scalability and reproducibility of understanding freely moving animal behavior in neuroscience. It introduces BehaviorVLM, a unified vision-language framework that uses pretrained models and detailed reasoning steps to perform pose estimation and behavioral understanding without task-specific fine-tuning. For pose estimation, it employs a multi-stage pipeline with temporal, spatial, and cross-view reasoning, reducing human annotation effort and enabling geometric checks. For behavioral understanding, it integrates deep clustering, VLM-based video captioning, and LLM-based reasoning to discover and label behaviors directly from visual information, without needing keypoints. This approach enables scalable, interpretable, and label-light analysis of multi-animal behavior.
研究旨在通过使用名为BehaviorVLM的统一视觉语言框架来提高理解动物行为在神经科学中的可扩展性和可重复性。该框架利用预训练模型和详细的推理步骤来进行姿态估计和行为理解,无需特定任务的微调或大量的人工标注。方法包括一个结合了时间、空间和跨视图推理的多阶段姿态估计管道,以及一个使用嵌入式聚类、基于VLM的视频字幕和基于LLM的推理来直接从视觉信息中发现和标注行为的行为理解管道。关键发现包括减少了人工标注的工作量,暴露了低置信度的标签,并能够进行可扩展且可解释的多动物行为分析。
Personalized Feature Translation for Expression Recognition: An Efficient Source-Free Domain Adaptation Method
Authors: Masoumeh Sharafi, Soufiane Belharbi, Muhammad Osama Zeeshan, Houssem Ben Salem, Ali Etemad, Alessandro Lameiras Koerich, Marco Pedersoli, Simon Bacon, Eric Granger
First: 2025-08-08T20:13:50+00:00 · Latest: 2026-03-12T17:05:14+00:00
Abstract
Facial expression recognition (FER) models are widely used in video-based affective computing applications, such as human-computer interaction and healthcare monitoring. However, deep FER models often struggle with subtle expressions and high inter-subject variability, limiting performance in real-world settings. Source-free domain adaptation (SFDA) has been proposed to personalize a pretrained source model using only unlabeled target data, avoiding privacy, storage, and transmission constraints. We address a particularly challenging setting where source data is unavailable and the target data contains only neutral expressions. Existing SFDA methods are not designed for adaptation from a single target class, while generating non-neutral facial images is often unstable and expensive. To address this, we propose Source-Free Domain Adaptation with Personalized Feature Translation (SFDA-PFT), a lightweight latent-space approach. A translator is first pretrained on source data to map subject-specific style features between subjects while preserving expression information through expression-consistency and style-aware objectives. It is then adapted to neutral target data without source data or image synthesis. By operating in the latent space, SFDA-PFT avoids noisy facial image generation, reduces computation, and learns discriminative embeddings for classification. Experiments on BioVid, StressID, BAH, and Aff-Wild2 show that SFDA-PFT consistently outperforms state-of-the-art SFDA methods in privacy-sensitive FER scenarios. Our code is publicly available at: \href{https://github.com/MasoumehSharafi/SFDA-PFT}{GitHub}.
中文标题/摘要
标题:个性化特征翻译在表情识别中的应用:一种高效的源无域适应方法
面部表情识别(FER)模型在基于视频的情感计算应用中广泛应用,如人机交互和健康监测。然而,深度FER模型在识别细微表情和处理高个体间变异性方面常常表现不佳,限制了其在实际场景中的性能。源无域适应(SFDA)已被提出,通过仅使用未标记的目标数据来个性化预训练的源模型,从而避免隐私、存储和传输方面的限制。我们解决了一个特别具有挑战性的场景,即源数据不可用且目标数据仅包含中性表情。现有的SFDA方法未设计用于从单一目标类别进行适应,而生成非中性面部图像往往不稳定且昂贵。为了解决这一问题,我们提出了一种轻量级的潜在空间方法——源无域适应与个性化特征翻译(SFDA-PFT)。首先在源数据上预训练一个翻译器,以在不同个体之间映射个体特定的风格特征,同时通过表情一致性目标和风格意识目标保留表情信息。然后,该翻译器无需源数据或图像合成即可适应中性目标数据。通过在潜在空间中操作,SFDA-PFT避免了嘈杂的面部图像生成,减少了计算量,并学习了用于分类的判别嵌入。在BioVid、StressID、BAH和Aff-Wild2上的实验表明,SFDA-PFT在隐私敏感的FER场景中始终优于最先进的SFDA方法。我们的代码已公开发布于:\href{https://github.com/MasoumehSharafi/SFDA-PFT}{GitHub}。
Summary / 总结
The paper addresses the challenge of facial expression recognition (FER) in real-world settings where deep models struggle with subtle expressions and high inter-subject variability. It introduces SFDA-PFT, a lightweight method for source-free domain adaptation that uses only unlabeled target data. By pretraining a translator on source data and adapting it to neutral target data, SFDA-PFT avoids generating noisy facial images, reduces computation, and improves discriminative embeddings, leading to better performance in privacy-sensitive scenarios compared to existing methods.
论文针对现实环境中深度模型在处理细微表情和高个体差异时遇到的挑战,提出了一种名为SFDA-PFT的轻量级源无域适应方法,仅使用目标数据中的未标记数据。通过在源数据上预训练一个翻译器,并将其适应到目标数据中的中性表情,SFDA-PFT避免生成噪声面部图像,减少计算量,并提高分类的判别性嵌入,从而在隐私敏感的场景中优于现有方法。
LatentGeo: Learnable Auxiliary Constructions in Latent Space for Multimodal Geometric Reasoning
Authors: Haiying Xu, Zihan Wang, Song Dai, Zhengxuan Zhang, Kairan Dou, Xuming Hu
First: 2026-03-12T17:01:23+00:00 · Latest: 2026-03-12T17:01:23+00:00
Abstract
Despite recent advances in multimodal reasoning, representing auxiliary geometric constructions remains a fundamental challenge for multimodal large language models (MLLMs). Such constructions are absent from the original diagram and must be introduced before theorems apply. Existing approaches predominantly rely on explicit construction paradigms, including text-based geometric specification, visual-token interleaving during reasoning, and tool-augmented geometric execution. However, these methods either fail to faithfully represent complex spatial relationships, incur representation mismatch between discrete symbols and continuous geometric structures, or rely on external capabilities that hinder end-to-end optimization. To address these limitations, we propose LatentGeo, a framework that learns continuous latent visual representations to internalize auxiliary geometric constructions without pixel-level rendering or external executors. We design a three-stage curriculum that progressively aligns and internalizes these latent representations through auxiliary visual supervision, followed by LaGDPO, a latent-aware reinforcement learning procedure that stabilizes latent representations during policy optimization while improving end-task correctness. To systematically evaluate construction-centric representation quality, we introduce GeoAux, a new benchmark targeting visually dependent geometry problems, and conduct experiments on GeoAux and MathVerse. Results show that LatentGeo achieves substantial gains on geometric reasoning tasks, particularly those requiring auxiliary constructions. Extensive analyses and ablation studies further validate the effectiveness of each component in our framework.
中文标题/摘要
标题:LatentGeo:学习型辅助构造在潜在空间中的多模态几何推理
尽管在多模态推理方面取得了近期进展,但在多模态大型语言模型(MLLMs)中表示辅助几何构造仍然是一个基本挑战。这些构造不在原始图中,必须在应用定理之前引入。现有方法主要依赖显式的构造范式,包括基于文本的几何规范、推理期间的视觉标记交错以及工具增强的几何执行。然而,这些方法要么无法忠实表示复杂的空间关系,要么在离散符号和连续几何结构之间产生表示不匹配,要么依赖外部能力,阻碍端到端优化。为了解决这些限制,我们提出了LatentGeo框架,该框架学习连续的潜在视觉表示,以在不进行像素级渲染或外部执行的情况下内化辅助几何构造。我们设计了一个三阶段课程,通过辅助视觉监督逐步对齐和内化这些潜在表示,随后是LaGDPO,一种潜在感知的强化学习过程,在策略优化过程中稳定潜在表示,同时提高最终任务的正确性。为了系统地评估构造中心的表示质量,我们引入了GeoAux,这是一个针对视觉依赖几何问题的新基准,并在GeoAux和MathVerse上进行了实验。结果显示,LatentGeo在几何推理任务中取得了显著的提升,特别是在需要辅助构造的任务中。广泛的分析和消融研究进一步验证了我们框架中每个组件的有效性。
Summary / 总结
LatentGeo addresses the challenge of representing auxiliary geometric constructions in multimodal large language models by learning continuous latent visual representations. It uses a three-stage curriculum and a latent-aware reinforcement learning procedure to internalize these representations without pixel-level rendering or external executors. Experiments on GeoAux and MathVerse show that LatentGeo significantly improves geometric reasoning tasks, especially those requiring auxiliary constructions.
研究旨在解决多模态大型语言模型中表示辅助几何构造的挑战。LatentGeo 提出了一种框架,通过学习连续的潜在视觉表示来内部化这些构造,而无需像素级渲染或外部执行器。该方法包括一个三阶段课程和一种潜在感知的强化学习过程。实验结果表明,LatentGeo 在几何推理任务中显著提高了性能,特别是在需要辅助构造的任务上。
A Variational Latent Equilibrium for Learning in Neuronal Circuits
Authors: Simon Brandt, Paul Haider, Walter Senn, Federico Benitez, Mihai A. Petrovici
First: 2026-03-10T12:44:48+00:00 · Latest: 2026-03-12T16:55:52+00:00
Abstract
Brains remain unrivaled in their ability to recognize and generate complex spatiotemporal patterns. While AI is able to reproduce some of these capabilities, deep learning algorithms remain largely at odds with our current understanding of brain circuitry and dynamics. This is prominently the case for backpropagation through time (BPTT), the go-to algorithm for learning complex temporal dependencies. In this work we propose a general formalism to approximate BPTT in a controlled, biologically plausible manner. Our approach builds on, unifies and extends several previous approaches to local, time-continuous, phase-free spatiotemporal credit assignment based on principles of energy conservation and extremal action. Our starting point is a prospective energy function of neuronal states, from which we calculate real-time error dynamics for time-continuous neuronal networks. In the general case, this provides a simple and straightforward derivation of the adjoint method result for neuronal networks, the time-continuous equivalent to BPTT. With a few modifications, we can turn this into a fully local (in space and time) set of equations for neuron and synapse dynamics. Our theory provides a rigorous framework for spatiotemporal deep learning in the brain, while simultaneously suggesting a blueprint for physical circuits capable of carrying out these computations. These results reframe and extend the recently proposed Generalized Latent Equilibrium (GLE) model.
中文标题/摘要
标题:神经环路学习中的变分潜均衡
大脑在识别和生成复杂时空模式方面仍然无与伦比。尽管人工智能能够复制一些这些能力,但深度学习算法在很大程度上与我们对大脑环路和动力学的理解相矛盾。这在通过时间反向传播(BPTT)中表现得尤为明显,它是学习复杂时间依赖性的首选算法。在这项工作中,我们提出了一种通用的形式化方法,以在受控且生物上合理的条件下近似BPTT。我们的方法建立在、统一并扩展了几个先前基于能量守恒和极值作用原则的局部、时间连续、无相位时空信用分配方法之上。我们的出发点是一个前瞻性的神经元状态能量函数,从中我们计算出时间连续神经网络的实时误差动力学。在一般情况下,这为神经网络提供了简单且直接的伴随方法结果的推导,这是BPTT的时间连续等价物。通过一些修改,我们可以将其转化为空间和时间上完全局部的神经元和突触动力学方程组。我们的理论为大脑中的时空深度学习提供了一个严谨的框架,同时建议了一种能够执行这些计算的物理电路蓝图。这些结果重新定义并扩展了最近提出的广义潜均衡(GLE)模型。
Summary / 总结
This research aims to bridge the gap between artificial intelligence and brain circuitry by proposing a new method to approximate backpropagation through time (BPTT) in a biologically plausible manner. The method builds on previous approaches to derive real-time error dynamics for neuronal networks, leading to a fully local set of equations for neuron and synapse dynamics. Key findings include a rigorous framework for spatiotemporal deep learning in the brain and a blueprint for physical circuits capable of these computations, extending the Generalized Latent Equilibrium (GLE) model.
研究旨在通过提出一种新的方法来逼近时间上的反向传播(BPTT),以弥合人工智能与脑回路之间的差距。该方法基于先前的方法推导出神经网络的实时误差动态,从而得到神经元和突触动力学的完全局部方程。主要发现包括为大脑中的时空深度学习提供了一个严谨的框架,并为能够执行这些计算的物理电路提供了蓝图,扩展了广义潜平衡(GLE)模型。
GlyphBanana: Advancing Precise Text Rendering Through Agentic Workflows
Authors: Zexuan Yan, Jiarui Jin, Yue Ma, Shijian Wang, Jiahui Hu, Wenxiang Jiao, Yuan Lu, Linfeng Zhang
First: 2026-03-12T16:53:06+00:00 · Latest: 2026-03-12T16:53:06+00:00
Abstract
Despite recent advances in generative models driving significant progress in text rendering, accurately generating complex text and mathematical formulas remains a formidable challenge. This difficulty primarily stems from the limited instruction-following capabilities of current models when encountering out-of-distribution prompts. To address this, we introduce GlyphBanana, alongside a corresponding benchmark specifically designed for rendering complex characters and formulas. GlyphBanana employs an agentic workflow that integrates auxiliary tools to inject glyph templates into both the latent space and attention maps, facilitating the iterative refinement of generated images. Notably, our training-free approach can be seamlessly applied to various Text-to-Image (T2I) models, achieving superior precision compared to existing baselines. Extensive experiments demonstrate the effectiveness of our proposed workflow. Associated code is publicly available at https://github.com/yuriYanZeXuan/GlyphBanana.
中文标题/摘要
标题:GlyphBanana:通过自主工作流提升精确文本渲染
尽管生成模型的最新进展在文本渲染方面取得了显著进步,但准确生成复杂文本和数学公式仍然是一项艰巨的挑战。这一困难主要源于当前模型在遇到分布外提示时有限的指令遵循能力。为了解决这一问题,我们引入了GlyphBanana,并设计了一个专门用于渲染复杂字符和公式的基准测试。GlyphBanana采用了一种自主工作流,通过集成辅助工具将字形模板注入到潜在空间和注意力图中,促进生成图像的迭代优化。值得注意的是,我们的无训练方法可以无缝应用于各种文本到图像(T2I)模型,与现有基线相比,实现了更高的精度。大量实验表明了我们提出的工作流的有效性。相关代码已公开发布在https://github.com/yuriYanZeXuan/GlyphBanana。
Summary / 总结
The research aims to improve the precision of text rendering, especially for complex characters and formulas, by addressing the limitations of current generative models in handling out-of-distribution prompts. GlyphBanana introduces an agentic workflow that integrates auxiliary tools to inject glyph templates into the latent space and attention maps, enabling iterative refinement. Experiments show that this approach achieves superior precision compared to existing methods without requiring training.
研究旨在通过解决当前生成模型在处理非分布提示时的局限性,提高复杂字符和公式的文本渲染精度。GlyphBanana引入了一种代理工作流,通过辅助工具将字形模板注入潜在空间和注意力图中,实现迭代优化。实验表明,该无训练方法在生成精确的文本和公式方面优于现有方法。
IsoCompute Playbook: Optimally Scaling Sampling Compute for LLM RL
Authors: Zhoujun Cheng, Yutao Xie, Yuxiao Qu, Amrith Setlur, Shibo Hao, Varad Pimpalkhute, Tongtong Liang, Feng Yao, Zhengzhong Liu, Eric Xing, Virginia Smith, Ruslan Salakhutdinov, Zhiting Hu, Taylor Killian, Aviral Kumar
First: 2026-03-12T16:49:21+00:00 · Latest: 2026-03-12T16:49:21+00:00
Comments: 29 pages, 27 figures. Under review
Abstract
While scaling laws guide compute allocation for LLM pre-training, analogous prescriptions for reinforcement learning (RL) post-training of large language models (LLMs) remain poorly understood. We study the compute-optimal allocation of sampling compute for on-policy RL methods in LLMs, framing scaling as a compute-constrained optimization over three resources: parallel rollouts per problem, number of problems per batch, and number of update steps. We find that the compute-optimal number of parallel rollouts per problem increases predictably with compute budget and then saturates. This trend holds across both easy and hard problems, though driven by different mechanisms: solution sharpening on easy problems and coverage expansion on hard problems. We further show that increasing the number of parallel rollouts mitigates interference across problems, while the number of problems per batch primarily affects training stability and can be chosen within a broad range. Validated across base models and data distributions, our results recast RL scaling laws as prescriptive allocation rules and provide practical guidance for compute-efficient LLM RL post-training.
中文标题/摘要
标题:IsoCompute 指南:优化大型语言模型(LLM)强化学习(RL)采样计算的扩展
虽然缩放定律指导了大型语言模型(LLM)预训练中的计算分配,但对于大型语言模型(LLM)训练后强化学习(RL)中的计算最优分配,现有的类似指导仍然知之甚少。我们研究了在 LLM 中使用采样计算的最优分配,将缩放问题表述为在三个资源(每个问题的并行卷积次数、批次中的问题数量和更新步骤数量)约束下的计算受限优化问题。我们发现,每个问题的最优并行卷积次数随着计算预算的增加而可预测地增加,然后饱和。这一趋势在简单和复杂的问题上都适用,但驱动机制不同:简单问题上的解细化和复杂问题上的覆盖扩展。我们还表明,增加并行卷积次数可以减少问题间的干扰,而批次中的问题数量主要影响训练稳定性,并可以在较宽的范围内选择。我们的结果在基础模型和数据分布上得到验证,重新定义了 RL 的缩放定律为指导性的分配规则,并为计算高效的 LLM RL 训练后提供了实用指导。
Summary / 总结
The study addresses the lack of scaling laws for reinforcement learning (RL) post-training of large language models (LLMs) by optimizing the allocation of sampling compute for on-policy RL methods. It frames the problem as a constrained optimization over parallel rollouts, problems per batch, and update steps. Key findings include the predictable increase and saturation of optimal parallel rollouts with compute budget, and the different mechanisms driving this trend on easy and hard problems. The study also highlights that increasing parallel rollouts reduces interference between problems, while the number of problems per batch affects training stability. These results provide practical guidance for efficient RL scaling in LLMs.
论文探讨了在LLMs中使用on-policy RL方法时采样计算的最优分配,将扩展视为在并行卷出、每批问题数量和更新步骤三个资源上的约束优化。关键发现包括随计算预算增加,最优并行卷出数量呈可预测增长并最终饱和,增加并行卷出可以减少跨问题的干扰,而每批问题数量主要影响训练稳定性。这些结果为LLMs的RL后训练提供了高效的计算分配指导。
Linking Perception, Confidence and Accuracy in MLLMs
Authors: Yuetian Du, Yucheng Wang, Rongyu Zhang, Zhijie Xu, Boyu Yang, Ming Kong, Jie Liu, Qiang Zhu
First: 2026-03-12T16:47:42+00:00 · Latest: 2026-03-12T16:47:42+00:00
Comments: Accepted by CVPR2026
Abstract
Recent advances in Multi-modal Large Language Models (MLLMs) have predominantly focused on enhancing visual perception to improve accuracy. However, a critical question remains unexplored: Do models know when they do not know? Through a probing experiment, we reveal a severe confidence miscalibration problem in MLLMs. To address this, we propose Confidence-Driven Reinforcement Learning (CDRL), which uses original-noise image pairs and a novel confidence-based reward to enhance perceptual sensitivity and robustly calibrate the model's confidence. Beyond training benefits, calibrated confidence enables more effective test-time scaling as a free lunch. We further propose Confidence-Aware Test-Time Scaling (CA-TTS), which dynamically coordinates Self-Consistency, Self-Reflection, and Visual Self-Check modules guided by confidence signals. An Expert Model acts in multiple roles (e.g., Planner, Critic, Voter) to schedule these modules and provide external verification. Our integrated framework establishes new state-of-the-art results with consistent 8.8% gains across four benchmarks. More ablation studies demonstrate the effectiveness of each module and scaling superiority.
中文标题/摘要
标题:将感知、信心与准确性在MLLM中的联系
近年来,多模态大型语言模型(MLLMs)的进展主要集中在增强视觉感知以提高准确性。然而,一个关键问题尚未得到探索:模型是否知道自己不知道?通过一项探针实验,我们揭示了MLLMs中严重的信心失校准问题。为了解决这一问题,我们提出了信心驱动强化学习(CDRL),它使用原始噪声图像对和一种新颖的信心基奖励来增强感知灵敏度并稳健校准模型的信心。除了训练收益外,校准的信心还能够作为免费午餐实现更有效的测试时扩展。我们进一步提出了信心感知测试时扩展(CA-TTS),它通过信心信号动态协调自我一致性、自我反思和视觉自我检查模块。专家模型在多个角色(如规划者、评论者、投票者)中发挥作用,调度这些模块并提供外部验证。我们集成的框架在四个基准测试中建立了新的最佳结果,一致地提高了8.8%。更多的消融研究证明了每个模块的有效性和扩展优势。
Summary / 总结
The study addresses the confidence miscalibration issue in Multi-modal Large Language Models (MLLMs) by proposing Confidence-Driven Reinforcement Learning (CDRL) and Confidence-Aware Test-Time Scaling (CA-TTS). CDRL uses original-noise image pairs and a confidence-based reward to improve perceptual sensitivity and calibrate model confidence. CA-TTS dynamically coordinates modules guided by confidence signals, achieving consistent 8.8% gains across four benchmarks and demonstrating the effectiveness of each module and scaling superiority.
研究通过提出信心驱动的强化学习(CDRL)和信心感知的测试时缩放(CA-TTS)来解决多模态大型语言模型(MLLMs)的信心失校准问题。CDRL 使用原始噪声图像对和基于信心的奖励来提高感知灵敏度并校准模型的信心。CA-TTS 动态协调由信心信号引导的模块,实现了四个基准上一致的 8.8% 收益,并展示了每个模块的有效性和缩放优势。
EgoIntent: An Egocentric Step-level Benchmark for Understanding What, Why, and Next
Authors: Ye Pan, Chi Kit Wong, Yuanhuiyi Lyu, Hanqian Li, Jiahao Huo, Jiacheng Chen, Lutao Jiang, Xu Zheng, Xuming Hu
First: 2026-03-12T16:46:01+00:00 · Latest: 2026-03-12T16:46:01+00:00
Abstract
Multimodal Large Language Models (MLLMs) have demonstrated remarkable video reasoning capabilities across diverse tasks. However, their ability to understand human intent at a fine-grained level in egocentric videos remains largely unexplored. Existing benchmarks focus primarily on episode-level intent reasoning, overlooking the finer granularity of step-level intent understanding. Yet applications such as intelligent assistants, robotic imitation learning, and augmented reality guidance require understanding not only what a person is doing at each step, but also why and what comes next, in order to provide timely and context-aware support. To this end, we introduce EgoIntent, a step-level intent understanding benchmark for egocentric videos. It comprises 3,014 steps spanning 15 diverse indoor and outdoor daily-life scenarios, and evaluates models on three complementary dimensions: local intent (What), global intent (Why), and next-step plan (Next). Crucially, each clip is truncated immediately before the key outcome of the queried step (e.g., contact or grasp) occurs and contains no frames from subsequent steps, preventing future-frame leakage and enabling a clean evaluation of anticipatory step understanding and next-step planning. We evaluate 15 MLLMs, including both state-of-the-art closed-source and open-source models. Even the best-performing model achieves an average score of only 33.31 across the three intent dimensions, underscoring that step-level intent understanding in egocentric videos remains a highly challenging problem that calls for further investigation.
中文标题/摘要
标题:EgoIntent:一种理解个人视角视频中意图、原因和下一步的自中心步级基准
多模态大型语言模型(MLLMs)在多种任务中展示了卓越的视频推理能力。然而,它们在个人视角视频中对人类意图进行细粒度理解的能力仍然鲜有探索。现有基准主要关注情节级意图推理,忽视了步骤级意图理解的更细粒度。然而,智能助手、机器人模仿学习和增强现实指导等应用不仅需要理解每个人在每一步做什么,还需要理解为什么以及接下来会发生什么,以便提供及时和上下文相关的支持。为此,我们引入了EgoIntent,一种个人视角视频中的步骤级意图理解基准。它包含15种不同的室内和室外日常生活场景中的3,014个步骤,并从三个互补维度评估模型:局部意图(What)、全局意图(Why)和下一步计划(Next)。关键的是,每个片段在查询步骤的关键结果(如接触或抓取)发生前立即截断,不包含后续步骤的画面,从而避免未来帧泄漏,使前瞻性步骤理解和下一步计划的评估更加清晰。我们评估了15个MLLMs,包括最先进的闭源和开源模型。即使表现最好的模型在三个意图维度上的平均得分为33.31,也表明个人视角视频中的步骤级意图理解仍然是一个极具挑战性的问题,需要进一步研究。
Summary / 总结
EgoIntent is a step-level benchmark for understanding human intent in egocentric videos, addressing the need for finer granularity in video reasoning. It includes 3,014 steps from 15 diverse scenarios and evaluates models on local, global intent, and next-step planning. Despite the advanced MLLMs tested, the best model scores only 33.31 on average across the three dimensions, highlighting the challenge of step-level intent understanding in egocentric videos.
EgoIntent 是一个用于理解 egocentric 视频中人类意图的步骤级基准,旨在解决智能助手等应用中细粒度意图推理的需求。它包含来自 15 个场景的 3,014 个步骤,并从局部意图、全局意图和下一步计划三个方面评估模型。尽管最佳模型在这三个意图维度上的平均得分为 33.31,但该任务仍然具有挑战性,表明需要进一步研究。
FlashMotion: Few-Step Controllable Video Generation with Trajectory Guidance
Authors: Quanhao Li, Zhen Xing, Rui Wang, Haidong Cao, Qi Dai, Daoguo Dong, Zuxuan Wu
First: 2026-03-12T16:45:53+00:00 · Latest: 2026-03-12T16:45:53+00:00
Comments: Accepted by CVPR2026
Abstract
Recent advances in trajectory-controllable video generation have achieved remarkable progress. Previous methods mainly use adapter-based architectures for precise motion control along predefined trajectories. However, all these methods rely on a multi-step denoising process, leading to substantial time redundancy and computational overhead. While existing video distillation methods successfully distill multi-step generators into few-step, directly applying these approaches to trajectory-controllable video generation results in noticeable degradation in both video quality and trajectory accuracy. To bridge this gap, we introduce FlashMotion, a novel training framework designed for few-step trajectory-controllable video generation. We first train a trajectory adapter on a multi-step video generator for precise trajectory control. Then, we distill the generator into a few-step version to accelerate video generation. Finally, we finetune the adapter using a hybrid strategy that combines diffusion and adversarial objectives, aligning it with the few-step generator to produce high-quality, trajectory-accurate videos. For evaluation, we introduce FlashBench, a benchmark for long-sequence trajectory-controllable video generation that measures both video quality and trajectory accuracy across varying numbers of foreground objects. Experiments on two adapter architectures show that FlashMotion surpasses existing video distillation methods and previous multi-step models in both visual quality and trajectory consistency.
中文标题/摘要
标题:FlashMotion:基于轨迹引导的少步可控视频生成
近期在轨迹可控视频生成方面的进展取得了显著成果。以往的方法主要依赖基于适配器的架构,以精确控制沿预定义轨迹的运动。然而,所有这些方法都依赖于多步去噪过程,导致时间冗余和计算开销。虽然现有的视频蒸馏方法成功地将多步生成器蒸馏为少步版本,但直接将这些方法应用于轨迹可控视频生成会导致视频质量和轨迹精度显著下降。为解决这一问题,我们提出了FlashMotion,一种专为少步轨迹可控视频生成设计的新训练框架。我们首先在多步视频生成器上训练一个轨迹适配器,以实现精确的轨迹控制。然后,我们将生成器蒸馏为少步版本以加速视频生成。最后,我们使用结合扩散和对抗目标的混合策略微调适配器,使其与少步生成器对齐,以生成高质量、轨迹准确的视频。为了评估,我们引入了FlashBench,这是一个针对长序列轨迹可控视频生成的基准,它衡量不同前景对象数量下的视频质量和轨迹精度。实验表明,FlashMotion在视觉质量和轨迹一致性方面均优于现有的视频蒸馏方法和之前的多步模型。
Summary / 总结
FlashMotion is a novel training framework for few-step trajectory-controllable video generation. It first trains a trajectory adapter on a multi-step video generator and then distills the generator into a few-step version. Finally, it fine-tunes the adapter using a hybrid strategy to produce high-quality, trajectory-accurate videos. Experiments show that FlashMotion outperforms existing methods in both visual quality and trajectory consistency.
FlashMotion 是一种新型的用于少步骤轨迹可控视频生成的训练框架。它首先在多步骤视频生成器上训练轨迹适配器,然后将生成器精简为少步骤版本。适配器通过结合扩散和对抗目标的混合策略进一步微调。实验表明,FlashMotion 在视觉质量和轨迹一致性方面均优于现有方法。
Automatic Generation of High-Performance RL Environments
Authors: Seth Karten, Rahul Dev Appapogu, Chi Jin
First: 2026-03-12T16:45:47+00:00 · Latest: 2026-03-12T16:45:47+00:00
Comments: 26 pages, 9 figures, 8 tables
Abstract
Translating complex reinforcement learning (RL) environments into high-performance implementations has traditionally required months of specialized engineering. We present a reusable recipe - a generic prompt template, hierarchical verification, and iterative agent-assisted repair - that produces semantically equivalent high-performance environments for <$10 in compute cost. We demonstrate three distinct workflows across five environments. Direct translation (no prior performance implementation exists): EmuRust (1.5x PPO speedup via Rust parallelism for a Game Boy emulator) and PokeJAX, the first GPU-parallel Pokemon battle simulator (500M SPS random action, 15.2M SPS PPO; 22,320x over the TypeScript reference). Translation verified against existing performance implementations: throughput parity with MJX (1.04x) and 5x over Brax at matched GPU batch sizes (HalfCheetah JAX); 42x PPO (Puffer Pong). New environment creation: TCGJax, the first deployable JAX Pokemon TCG engine (717K SPS random action, 153K SPS PPO; 6.6x over the Python reference), synthesized from a web-extracted specification. At 200M parameters, the environment overhead drops below 4% of training time. Hierarchical verification (property, interaction, and rollout tests) confirms semantic equivalence for all five environments; cross-backend policy transfer confirms zero sim-to-sim gap for all five environments. TCGJax, synthesized from a private reference absent from public repositories, serves as a contamination control for agent pretraining data concerns. The paper contains sufficient detail - including representative prompts, verification methodology, and complete results - that a coding agent could reproduce the translations directly from the manuscript.
中文标题/摘要
标题:自动生成高性能RL环境
将复杂的强化学习(RL)环境转换为高性能实现通常需要数月的专门工程工作。我们提出了一种可重用的配方——通用提示模板、分层验证和迭代代理辅助修复,该方法以不到10美元的计算成本生成语义等效的高性能环境。我们展示了五个环境中的三种不同工作流程。直接翻译(没有现有的高性能实现):EmuRust(通过Rust并行性实现1.5倍的PPO加速,用于Game Boy模拟器)和PokeJAX,这是第一个GPU并行的宝可梦战斗模拟器(500M SPS随机动作,15.2M SPS PPO;22,320倍于TypeScript参考)。与现有性能实现验证:吞吐量与MJX(1.04倍)和Brax(在匹配的GPU批处理大小下)相比,分别快5倍;42倍PPO(Puffer Pong)。新环境创建:TCGJax,第一个可部署的JAX宝可梦TCG引擎(717K SPS随机动作,153K SPS PPO;6.6倍于Python参考),从网页提取的规范中合成。在200M参数下,环境开销低于训练时间的4%。分层验证(属性、交互和回放测试)确认了所有五个环境的语义等效性;跨后端策略转移确认了所有五个环境的零模拟到模拟差距。TCGJax,从一个私人参考合成,该参考未包含在公共存储库中,作为代理预训练数据问题的污染控制。论文包含足够的细节——包括代表性提示、验证方法和完整结果——使得编码代理可以直接从手稿中复制这些翻译。
Summary / 总结
The research aims to automate the generation of high-performance reinforcement learning environments, which traditionally required specialized engineering and months of work. The method involves a generic prompt template, hierarchical verification, and iterative agent-assisted repair, producing semantically equivalent high-performance environments for less than $10 in compute cost. Key findings include a 1.5x speedup for EmuRust using Rust parallelism, a 500M SPS random action and 15.2M SPS PPO for PokeJAX, and a 6.6x speedup for TCGJax over the Python reference. Hierarchical verification confirmed semantic equivalence across all environments, and cross-backend policy transfer showed no sim-to-sim gap. TCGJax serves as a contamination control for agent pretraining data concerns.
研究旨在自动化生成高性能强化学习环境,传统上这需要专门的工程工作和数月的时间。方法包括通用提示模板、分层验证和迭代的代理辅助修复,以低于10美元的计算成本生成语义等效的高性能环境。关键发现包括EmuRust使用Rust并行性实现1.5倍加速,PokeJAX的500M SPS随机动作和15.2M SPS PPO,以及TCGJax相对于Python参考实现的6.6倍加速。分层验证确认了所有环境的语义等效性,跨后端策略转移显示没有模拟到模拟的差距。TCGJax作为代理预训练数据污染控制使用。
O3N: Omnidirectional Open-Vocabulary Occupancy Prediction
Authors: Mengfei Duan, Hao Shi, Fei Teng, Guoqiang Zhao, Yuheng Zhang, Zhiyong Li, Kailun Yang
First: 2026-03-12T16:45:42+00:00 · Latest: 2026-03-12T16:45:42+00:00
Comments: The source code will be made publicly available at https://github.com/MengfeiD/O3N
Abstract
Understanding and reconstructing the 3D world through omnidirectional perception is an inevitable trend in the development of autonomous agents and embodied intelligence. However, existing 3D occupancy prediction methods are constrained by limited perspective inputs and predefined training distribution, making them difficult to apply to embodied agents that require comprehensive and safe perception of scenes in open world exploration. To address this, we present O3N, the first purely visual, end-to-end Omnidirectional Open-vocabulary Occupancy predictioN framework. O3N embeds omnidirectional voxels in a polar-spiral topology via the Polar-spiral Mamba (PsM) module, enabling continuous spatial representation and long-range context modeling across 360°. The Occupancy Cost Aggregation (OCA) module introduces a principled mechanism for unifying geometric and semantic supervision within the voxel space, ensuring consistency between the reconstructed geometry and the underlying semantic structure. Moreover, Natural Modality Alignment (NMA) establishes a gradient-free alignment pathway that harmonizes visual features, voxel embeddings, and text semantics, forming a consistent "pixel-voxel-text" representation triad. Extensive experiments on multiple models demonstrate that our method not only achieves state-of-the-art performance on QuadOcc and Human360Occ benchmarks but also exhibits remarkable cross-scene generalization and semantic scalability, paving the way toward universal 3D world modeling. The source code will be made publicly available at https://github.com/MengfeiD/O3N.
中文标题/摘要
标题:O3N:全方位开放式词汇占用预测
通过全方位感知理解并重构3D世界是自主代理和具身智能发展中不可避免的趋势。然而,现有的3D占用预测方法受限于有限视角输入和预定义的训练分布,难以应用于需要全面和安全场景感知的具身代理在开放世界探索中。为解决这一问题,我们提出了O3N,这是首个纯视觉、端到端的全方位开放式词汇占用预测框架。O3N通过Polar-spiral Mamba (PsM) 模块嵌入全方位体素,以极螺旋拓扑结构实现连续的空间表示和360°范围内的长程上下文建模。Occupancy Cost Aggregation (OCA) 模块引入了一种原理性的机制,用于在体素空间内统一几何和语义监督,确保重建几何与底层语义结构之间的一致性。此外,Natural Modality Alignment (NMA) 建立了一种无梯度对齐路径,协调视觉特征、体素嵌入和文本语义,形成一致的“像素-体素-文本”表示三元组。在多个模型上的广泛实验表明,我们的方法不仅在QuadOcc和Human360Occ基准测试中达到了最先进的性能,还展示了出色的跨场景泛化能力和语义可扩展性,为通用3D世界建模铺平了道路。源代码将在https://github.com/MengfeiD/O3N公开。
Summary / 总结
O3N is an end-to-end framework for omnidirectional open-vocabulary occupancy prediction, addressing limitations of existing methods by incorporating continuous spatial representation and long-range context modeling. It uses a Polar-spiral Mamba module for omnidirectional voxel embedding and an Occupancy Cost Aggregation module for consistent geometric and semantic supervision. The Natural Modality Alignment module aligns visual features, voxel embeddings, and text semantics. Experiments show O3N outperforms existing methods on QuadOcc and Human360Occ benchmarks and demonstrates strong cross-scene generalization and semantic scalability.
O3N 是一种端到端的全景开放词汇占用预测框架,通过引入连续的空间表示和长程上下文建模来解决现有方法的局限性。它使用了Polar-spiral Mamba模块进行全景体素嵌入,并使用Occupancy Cost Aggregation模块确保几何和语义的一致性监督。Natural Modality Alignment模块则实现了视觉特征、体素嵌入和文本语义的一致对齐。实验表明,O3N 在 QuadOcc 和 Human360Occ 基准上优于现有方法,并且具有强大的跨场景泛化能力和语义扩展性。
History
20260315_0335 20260314_0340 20260313_0351 20260312_0350 20260311_0346 20260310_0350 20260309_0332 20260308_0331 20260307_0344 20260306_0400 20260305_0347 20260304_0347 20260303_0347 20260302_0330 20260301_0331 20260228_0349 20260227_0352 20260226_0405 20260225_0356 20260224_0404 20260223_0333 20260222_0333 20260221_0344 20260220_0347 20260219_0358 20260218_0358 20260217_0339 20260216_0334 20260215_0332 20260213_0402 20260212_0404 20260211_0409 20260210_0409 20260208_0334 20260207_0349 20260206_0347 20260205_0346 20260204_0352 20260202_0332 20260201_0328 20260131_0341 20260130_0339 20260129_0337 20260128_0335 20260127_0332 20260126_0325 20260125_0325 20260124_0333 20260123_0333 20260122_0339 20260121_0422 20260120_0328 20260119_0325 20260118_0324 20260117_0329 20260116_0332 20260115_0330 20260114_0329 20260113_0330 20260112_0330 20260111_0327 20260110_0328 20260109_0331 20260108_0330 20260107_0325 20260106_0331 20260105_0324 20260104_0324 20260103_0322 20260102_0335 20260101_0325 20251231_0331 20251230_0328 20251229_0326 20251228_0329 20251227_0325 20251226_0326 20251225_0325 20251224_0328 20251223_0327 20251222_0324 20251221_0326 20251220_0327 20251219_0327 20251218_0339 20251217_0331 20251216_0329 20251215_0331 20251214_0324 20251213_0324 20251212_0329 20251211_0326 20251210_0323 20251209_0326 20251208_0324 20251207_0323 20251206_0325 20251205_0326 20251204_0326 20251203_0328 20251202_0331 20251201_0324 20251130_0323 20251129_0323 20251128_0324 20251127_0324 20251126_0325 20251125_0322 20251124_0323 20251123_0323 20251122_0325 20251121_0324 20251120_0326 20251119_0325 20251118_0324 20251117_0322 20251116_0322 20251115_0324 20251114_0325 20251113_0326 20251112_0326 20251111_0318 20251110_0322 20251109_0323 20251108_0321 20251107_0320 20251106_0322 20251105_0321 20251104_0324 20251103_0317 20251102_0321 20251101_0317 20251031_0318 20251030_0328 20251029_0325 20251028_0324 20251027_0320 20251026_0328 20251025_0320 20251024_0328 20251023_1235 20251023_0316 20251022_0319 20251021_1916 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553