EVATok: Adaptive Length Video Tokenization for Efficient Visual Autoregressive Generation
Authors: Tianwei Xiong, Jun Hao Liew, Zilong Huang, Zhijie Lin, Jiashi Feng, Xihui Liu
Venue: CVPR 2026
First: 2026-03-12T17:59:59+00:00 · Latest: 2026-03-12T17:59:59+00:00
Comments: Accepted by CVPR 2026. Project page: https://silentview.github.io/EVATok/
Abstract
Autoregressive (AR) video generative models rely on video tokenizers that compress pixels into discrete token sequences. The length of these token sequences is crucial for balancing reconstruction quality against downstream generation computational cost. Traditional video tokenizers apply a uniform token assignment across temporal blocks of different videos, often wasting tokens on simple, static, or repetitive segments while underserving dynamic or complex ones. To address this inefficiency, we introduce $\textbf{EVATok}$, a framework to produce $\textbf{E}$fficient $\textbf{V}$ideo $\textbf{A}$daptive $\textbf{Tok}$enizers. Our framework estimates optimal token assignments for each video to achieve the best quality-cost trade-off, develops lightweight routers for fast prediction of these optimal assignments, and trains adaptive tokenizers that encode videos based on the assignments predicted by routers. We demonstrate that EVATok delivers substantial improvements in efficiency and overall quality for video reconstruction and downstream AR generation. Enhanced by our advanced training recipe that integrates video semantic encoders, EVATok achieves superior reconstruction and state-of-the-art class-to-video generation on UCF-101, with at least 24.4% savings in average token usage compared to the prior state-of-the-art LARP and our fixed-length baseline.
中文标题/摘要
标题:EVATok:高效自适应视频分词以实现高效的视觉自回归生成
自回归(AR)视频生成模型依赖于将像素压缩为离散的分词序列的视频分词器。这些分词序列的长度对于平衡重建质量与下游生成计算成本至关重要。传统的视频分词器在不同视频的时间块上应用统一的分词分配,往往在简单、静态或重复的段落上浪费分词,而在动态或复杂的段落上则服务不足。为了解决这种低效率,我们引入了**EVATok**,一种生成**E**fficient **V**ideo **A**daptive **Tok**enizers的框架。我们的框架为每个视频估计最佳的分词分配以实现最佳的质量-成本权衡,开发了轻量级路由器以快速预测这些最佳分配,并训练基于路由器预测的分配编码的自适应分词器。我们证明EVATok在视频重建和下游AR生成方面实现了显著的效率和整体质量提升。通过结合先进的训练食谱和视频语义编码器,EVATok在UCF-101上实现了优于先前最先进的LARP和我们固定长度基线的重建和类别到视频生成,平均分词使用量节省了至少24.4%。
Summary / 总结
EVATok is designed to improve the efficiency of video tokenization in autoregressive video generative models by estimating optimal token assignments for each video. It uses lightweight routers to predict these assignments and trains adaptive tokenizers accordingly. Experimental results show that EVATok significantly reduces token usage by at least 24.4% compared to previous methods while maintaining or improving reconstruction and generation quality.
EVATok 是一种框架,旨在通过适应性地为视频的不同部分分配令牌来提高自回归视频生成的效率和质量。它为每个视频估计最优的令牌分配,使用轻量级路由器进行快速预测,并训练适应性令牌编码器。实验结果表明,与之前的方法相比,EVATok 至少节省了 24.4% 的平均令牌使用量,同时在 UCF-101 上实现了更优的重建和最先进的类别到视频生成。
MM-CondChain: A Programmatically Verified Benchmark for Visually Grounded Deep Compositional Reasoning
Authors: Haozhan Shen, Shilin Yan, Hongwei Xue, Shuaiqi Lu, Xiaojun Tang, Guannan Zhang, Tiancheng Zhao, Jianwei Yin
Venue: MM
First: 2026-03-12T17:59:56+00:00 · Latest: 2026-03-12T17:59:56+00:00
Comments: Project Page: https://accio-lab.github.io/MM-CondChain
Abstract
Multimodal Large Language Models (MLLMs) are increasingly used to carry out visual workflows such as navigating GUIs, where the next step depends on verified visual compositional conditions (e.g., "if a permission dialog appears and the color of the interface is green, click Allow") and the process may branch or terminate early. Yet this capability remains under-evaluated: existing benchmarks focus on shallow-compositions or independent-constraints rather than deeply chained compositional conditionals. In this paper, we introduce MM-CondChain, a benchmark for visually grounded deep compositional reasoning. Each benchmark instance is organized as a multi-layer reasoning chain, where every layer contains a non-trivial compositional condition grounded in visual evidence and built from multiple objects, attributes, or relations. To answer correctly, an MLLM must perceive the image in detail, reason over multiple visual elements at each step, and follow the resulting execution path to the final outcome. To scalably construct such workflow-style data, we propose an agentic synthesis pipeline: a Planner orchestrates layer-by-layer generation of compositional conditions, while a Verifiable Programmatic Intermediate Representation (VPIR) ensures each layer's condition is mechanically verifiable. A Composer then assembles these verified layers into complete instructions. Using this pipeline, we construct benchmarks across three visual domains: natural images, data charts, and GUI trajectories. Experiments on a range of MLLMs show that even the strongest model attains only 53.33 Path F1, with sharp drops on hard negatives and as depth or predicate complexity grows, confirming that deep compositional reasoning remains a fundamental challenge.
中文标题/摘要
标题:MM-CondChain:一个程序验证基准,用于视觉接地的深度组合推理
多模态大型语言模型(MLLMs)越来越多地用于执行视觉工作流,如导航GUI,其中下一步依赖于经过验证的视觉组合条件(例如,“如果出现权限对话框且界面颜色为绿色,则点击允许”),并且过程可能会分支或提前终止。然而,这种能力仍处于评估不足的状态:现有的基准测试主要关注浅组合或独立约束,而不是深层链式组合条件。在本文中,我们引入了MM-CondChain,一个用于视觉接地的深度组合推理基准。每个基准实例组织为多层推理链,其中每一层包含基于视觉证据的非平凡组合条件,并由多个对象、属性或关系构建。为了正确回答,MLLM必须详细地感知图像,在每一步上推理多个视觉元素,并遵循由此产生的执行路径到最终结果。为了大规模构建此类工作流数据,我们提出了一种代理合成流水线:规划者协调逐层生成组合条件,而可验证的程序化中间表示(VPIR)确保每一层的条件是机械可验证的。然后,合成器将这些验证过的层组装成完整的指令。使用此流水线,我们在三个视觉领域构建了基准测试:自然图像、数据图表和GUI轨迹。在一系列MLLM上的实验表明,即使是最强大的模型也只能达到53.33路径F1,随着难度或谓词复杂性的增加,路径F1急剧下降,证实了深度组合推理仍然是一个基本挑战。
Summary / 总结
MM-CondChain is a benchmark for evaluating visually grounded deep compositional reasoning in multimodal large language models (MLLMs). It consists of multi-layer reasoning chains with complex visual conditions that require detailed image perception and reasoning over multiple elements. The benchmark is constructed using an agentic synthesis pipeline involving a Planner, Verifiable Programmatic Intermediate Representation (VPIR), and a Composer. Experiments show that even the strongest MLLMs struggle with deep compositional reasoning, achieving only 53.33 Path F1, especially on hard negatives and as complexity increases.
MM-CondChain 是一个用于评估 MLLMs 在视觉基础下的深度组合推理能力的基准,每个实例是一个多层推理链,包含复杂的组合条件。该基准使用一个包含规划者、机械可验证的程序中间表示(VPIR)和编译者(Composer)的管道构建。实验显示,即使是最强大的 MLLM 也只能达到 53.33 的 Path F1,特别是在处理复杂条件和硬负例时表现较差。
OmniStream: Mastering Perception, Reconstruction and Action in Continuous Streams
Authors: Yibin Yan, Jilan Xu, Shangzhe Di, Haoning Wu, Weidi Xie
First: 2026-03-12T17:59:55+00:00 · Latest: 2026-03-12T17:59:55+00:00
Comments: Technical Report. Project Page: https://go2heart.github.io/omnistream/
Abstract
Modern visual agents require representations that are general, causal, and physically structured to operate in real-time streaming environments. However, current vision foundation models remain fragmented, specializing narrowly in image semantic perception, offline temporal modeling, or spatial geometry. This paper introduces OmniStream, a unified streaming visual backbone that effectively perceives, reconstructs, and acts from diverse visual inputs. By incorporating causal spatiotemporal attention and 3D rotary positional embeddings (3D-RoPE), our model supports efficient, frame-by-frame online processing of video streams via a persistent KV-cache. We pre-train OmniStream using a synergistic multi-task framework coupling static and temporal representation learning, streaming geometric reconstruction, and vision-language alignment on 29 datasets. Extensive evaluations show that, even with a strictly frozen backbone, OmniStream achieves consistently competitive performance with specialized experts across image and video probing, streaming geometric reconstruction, complex video and spatial reasoning, as well as robotic manipulation (unseen at training). Rather than pursuing benchmark-specific dominance, our work demonstrates the viability of training a single, versatile vision backbone that generalizes across semantic, spatial, and temporal reasoning, i.e., a more meaningful step toward general-purpose visual understanding for interactive and embodied agents.
中文标题/摘要
标题:OmniStream:掌握连续流中的感知、重建和行动
现代视觉代理需要能够在实时流媒体环境中运行的一般性、因果性和物理结构化的表示。然而,当前的视觉基础模型仍然支离破碎,专门化地专注于图像语义感知、离线时间建模或空间几何。本文介绍了OmniStream,这是一种统一的流媒体视觉骨干,能够有效地从多种视觉输入中进行感知、重建和行动。通过结合因果时空注意力和三维旋转位置嵌入(3D-RoPE),我们的模型支持通过持久的KV缓存以帧为单位的在线视频流处理。我们使用结合静态和时间表示学习、流媒体几何重建和视觉-语言对齐的协同多任务框架对OmniStream进行预训练,使用29个数据集。广泛的评估表明,即使在严格冻结骨干的情况下,OmniStream在图像和视频探测、流媒体几何重建、复杂视频和空间推理以及机器人操作(未在训练中出现)方面也能够实现与专门专家一致的竞争性性能。我们的工作不是追求特定基准的主导地位,而是展示了训练一个能够跨语义、空间和时间推理进行泛化的单一、多功能视觉骨干的可行性,即朝着通用视觉理解迈出更具意义的一步,适用于交互式和具身代理。
Summary / 总结
OmniStream is designed to handle real-time streaming environments by integrating perception, reconstruction, and action from diverse visual inputs. It uses causal spatiotemporal attention and 3D rotary positional embeddings to support efficient frame-by-frame processing. Pre-trained with a multi-task framework, OmniStream performs competitively across various tasks including image and video probing, streaming geometric reconstruction, and robotic manipulation, even with a frozen backbone. This demonstrates the potential for a unified vision backbone that generalizes across semantic, spatial, and temporal reasoning for interactive and embodied agents.
OmniStream旨在处理实时流媒体环境,通过整合感知、重建和行动来处理多种视觉输入。它使用因果时空注意力和3D旋转位置嵌入来实现高效的逐帧处理。通过多任务框架预训练,OmniStream在各种任务中表现出竞争力,包括图像和视频探查、几何重建和机器人操作,即使在冻结主干网络的情况下也是如此。
GRADE: Benchmarking Discipline-Informed Reasoning in Image Editing
Authors: Mingxin Liu, Ziqian Fan, Zhaokai Wang, Leyao Gu, Zirun Zhu, Yiguo He, Yuchen Yang, Changyao Tian, Xiangyu Zhao, Ning Liao, Shaofeng Zhang, Qibing Ren, Zhihang Zhong, Xuanhe Zhou, Junchi Yan, Xue Yang
First: 2026-03-12T17:59:52+00:00 · Latest: 2026-03-12T17:59:52+00:00
Comments: 49 pages, 23 figures, 10 tables; Project Page: https://grade-bench.github.io/, Code: https://github.com/VisionXLab/GRADE, Dataset: https://huggingface.co/datasets/VisionXLab/GRADE
Abstract
Unified multimodal models target joint understanding, reasoning, and generation, but current image editing benchmarks are largely confined to natural images and shallow commonsense reasoning, offering limited assessment of this capability under structured, domain-specific constraints. In this work, we introduce GRADE, the first benchmark to assess discipline-informed knowledge and reasoning in image editing. GRADE comprises 520 carefully curated samples across 10 academic domains, spanning from natural science to social science. To support rigorous evaluation, we propose a multi-dimensional evaluation protocol that jointly assesses Discipline Reasoning, Visual Consistency, and Logical Readability. Extensive experiments on 20 state-of-the-art open-source and closed-source models reveal substantial limitations in current models under implicit, knowledge-intensive editing settings, leading to large performance gaps. Beyond quantitative scores, we conduct rigorous analyses and ablations to expose model shortcomings and identify the constraints within disciplinary editing. Together, GRADE pinpoints key directions for the future development of unified multimodal models, advancing the research on discipline-informed image editing and reasoning. Our benchmark and evaluation code are publicly released.
中文标题/摘要
标题:GRADE:评估图像编辑中的学科导向推理
统一的多模态模型旨在实现联合理解、推理和生成,但当前的图像编辑基准主要局限于自然图像和浅显的常识推理,这在结构化、领域特定的约束条件下对这种能力的评估有限。在本文中,我们引入了GRADE,这是首个评估图像编辑中学科导向知识和推理的基准。GRADE 包含了10个学术领域的520个精心挑选的样本,涵盖了自然科学到社会科学。为了支持严格的评估,我们提出了一种多维度评估协议,联合评估学科推理、视觉一致性以及逻辑可读性。在20个最先进的开源和闭源模型上的广泛实验揭示了当前模型在隐含、知识密集型编辑设置下的显著局限性,导致了巨大的性能差距。除了定量评分外,我们还进行了严格的分析和消融实验,以揭示模型的不足之处并识别学科编辑中的约束。总之,GRADE 指出了统一多模态模型未来发展的关键方向,推动了学科导向图像编辑和推理的研究。我们的基准和评估代码已公开发布。
Summary / 总结
This work introduces GRADE, a benchmark for evaluating discipline-informed reasoning in image editing, addressing limitations of current benchmarks. GRADE includes 520 samples from 10 academic domains and evaluates models on Discipline Reasoning, Visual Consistency, and Logical Readability. Experiments on 20 state-of-the-art models reveal significant performance gaps under knowledge-intensive editing settings, highlighting the need for improved multimodal models. Beyond scores, detailed analyses identify model limitations and constraints in disciplinary editing, guiding future research directions.
该研究引入了GRADE,一个用于评估图像编辑中学科知识和推理能力的基准,解决了当前多模态模型的局限性。GRADE 包含来自10个学术领域的520个样本,并从学科推理、视觉一致性和逻辑可读性三个方面评估模型。实验结果显示,在处理隐含、知识密集型编辑任务时,模型存在显著性能差距,强调了该领域模型改进的需求。研究还提供了详细的分析,以识别学科编辑中的模型局限性和约束条件。
Video Streaming Thinking: VideoLLMs Can Watch and Think Simultaneously
Authors: Yiran Guan, Liang Yin, Dingkang Liang, Jianzhong Ju, Zhenbo Luo, Jian Luan, Yuliang Liu, Xiang Bai
First: 2026-03-12T17:59:51+00:00 · Latest: 2026-03-12T17:59:51+00:00
Abstract
Online Video Large Language Models (VideoLLMs) play a critical role in supporting responsive, real-time interaction. Existing methods focus on streaming perception, lacking a synchronized logical reasoning stream. However, directly applying test-time scaling methods incurs unacceptable response latency. To address this trade-off, we propose Video Streaming Thinking (VST), a novel paradigm for streaming video understanding. It supports a thinking while watching mechanism, which activates reasoning over incoming video clips during streaming. This design improves timely comprehension and coherent cognition while preserving real-time responsiveness by amortizing LLM reasoning latency over video playback. Furthermore, we introduce a comprehensive post-training pipeline that integrates VST-SFT, which structurally adapts the offline VideoLLM to causal streaming reasoning, and VST-RL, which provides end-to-end improvement through self-exploration in a multi-turn video interaction environment. Additionally, we devise an automated training-data synthesis pipeline that uses video knowledge graphs to generate high-quality streaming QA pairs, with an entity-relation grounded streaming Chain-of-Thought to enforce multi-evidence reasoning and sustained attention to the video stream. Extensive evaluations show that VST-7B performs strongly on online benchmarks, e.g. 79.5% on StreamingBench and 59.3% on OVO-Bench. Meanwhile, VST remains competitive on offline long-form or reasoning benchmarks. Compared with Video-R1, VST responds 15.7 times faster and achieves +5.4% improvement on VideoHolmes, demonstrating higher efficiency and strong generalization across diverse video understanding tasks. Code, data, and models will be released at https://github.com/1ranGuan/VST.
中文标题/摘要
标题:视频流思维:视频LLMs可以边看边思考
在线视频大型语言模型(VideoLLMs)在支持响应式、实时交互中发挥关键作用。现有方法侧重于流式感知,缺乏同步逻辑推理流。然而,直接应用测试时缩放方法会导致不可接受的响应延迟。为解决这一权衡,我们提出了视频流思维(VST),一种新的流式视频理解范式。它支持边看边思考机制,在流式传输过程中激活对传入视频片段的推理。此设计通过在视频播放过程中分摊LLM推理延迟来提高及时理解和连贯认知,同时保持实时响应性。此外,我们引入了一个全面的后训练流水线,整合了VST-SFT,结构性地将离线VideoLLM适应因果流式推理,以及VST-RL,通过多轮视频交互环境中的自我探索提供端到端改进。此外,我们设计了一个自动化的训练数据合成流水线,使用视频知识图谱生成高质量的流式问答对,并通过实体关系支撑的流式思维链推理来强化多证据推理和对视频流的持续关注。广泛评估显示,VST-7B在在线基准测试中表现强劲,例如在StreamingBench上得分为79.5%,在OVO-Bench上得分为59.3%。同时,VST在离线长格式或推理基准测试中保持竞争力。与Video-R1相比,VST响应速度快15.7倍,在VideoHolmes上提高了5.4%,显示出更高的效率和在各种视频理解任务中的强大泛化能力。代码、数据和模型将在https://github.com/1ranGuan/VST/发布。
Summary / 总结
The research aims to enhance the real-time interaction capabilities of Video Large Language Models (VideoLLMs) by introducing Video Streaming Thinking (VST), which enables simultaneous video perception and logical reasoning. VST supports a thinking while watching mechanism, improving timely comprehension and coherent cognition. The study introduces a comprehensive post-training pipeline integrating VST-SFT and VST-RL, and an automated training-data synthesis pipeline. Extensive evaluations show VST-7B performs well on online benchmarks, such as StreamingBench and OVO-Bench, and is more efficient than Video-R1, achieving a 15.7 times faster response time on VideoHolmes while maintaining strong generalization across tasks.
研究旨在通过引入Video Streaming Thinking (VST) 来增强视频大型语言模型(VideoLLMs)的实时交互能力,使模型能够同时进行视频感知和逻辑推理。VST 支持边看边思考的机制,提高了及时理解和连贯认知的能力。研究引入了一个综合的后训练管道,结合了 VST-SFT 和 VST-RL,并开发了一个自动化的训练数据合成管道。广泛的评估表明,VST-7B 在 StreamingBench 和 OVO-Bench 等在线基准测试中表现出色,响应速度比 Video-R1 快 15.7 倍,同时在多种视频理解任务中保持了强大的泛化能力。
DreamVideo-Omni: Omni-Motion Controlled Multi-Subject Video Customization with Latent Identity Reinforcement Learning
Authors: Yujie Wei, Xinyu Liu, Shiwei Zhang, Hangjie Yuan, Jinbo Xing, Zhekai Chen, Xiang Wang, Haonan Qiu, Rui Zhao, Yutong Feng, Ruihang Chu, Yingya Zhang, Yike Guo, Xihui Liu, Hongming Shan
First: 2026-03-12T17:59:12+00:00 · Latest: 2026-03-12T17:59:12+00:00
Comments: Project Page: https://dreamvideo-omni.github.io
Abstract
While large-scale diffusion models have revolutionized video synthesis, achieving precise control over both multi-subject identity and multi-granularity motion remains a significant challenge. Recent attempts to bridge this gap often suffer from limited motion granularity, control ambiguity, and identity degradation, leading to suboptimal performance on identity preservation and motion control. In this work, we present DreamVideo-Omni, a unified framework enabling harmonious multi-subject customization with omni-motion control via a progressive two-stage training paradigm. In the first stage, we integrate comprehensive control signals for joint training, encompassing subject appearances, global motion, local dynamics, and camera movements. To ensure robust and precise controllability, we introduce a condition-aware 3D rotary positional embedding to coordinate heterogeneous inputs and a hierarchical motion injection strategy to enhance global motion guidance. Furthermore, to resolve multi-subject ambiguity, we introduce group and role embeddings to explicitly anchor motion signals to specific identities, effectively disentangling complex scenes into independent controllable instances. In the second stage, to mitigate identity degradation, we design a latent identity reward feedback learning paradigm by training a latent identity reward model upon a pretrained video diffusion backbone. This provides motion-aware identity rewards in the latent space, prioritizing identity preservation aligned with human preferences. Supported by our curated large-scale dataset and the comprehensive DreamOmni Bench for multi-subject and omni-motion control evaluation, DreamVideo-Omni demonstrates superior performance in generating high-quality videos with precise controllability.
中文标题/摘要
标题:DreamVideo-Omni:通过潜在身份强化学习实现全方位运动控制的多主体视频定制
虽然大规模扩散模型已经革新了视频合成,但在实现对多主体身份和多粒度运动的精确控制方面仍面临重大挑战。近期尝试弥合这一差距的方法往往受到运动粒度有限、控制模糊和身份退化等问题的困扰,导致在身份保持和运动控制方面表现不佳。在本文中,我们提出了DreamVideo-Omni,这是一种统一框架,通过渐进的两阶段训练范式实现和谐的多主体定制和全方位运动控制。在第一阶段,我们整合了全面的控制信号进行联合训练,包括主体外观、全局运动、局部动态和摄像机运动。为了确保鲁棒性和精确的可控性,我们引入了条件感知的3D旋转位置嵌入来协调异构输入,并引入了分层运动注入策略以增强全局运动指导。此外,为了解决多主体的模糊性,我们引入了组和角色嵌入,以明确将运动信号锚定到特定身份,从而有效将复杂场景分解为独立可控实例。在第二阶段,为了减轻身份退化,我们设计了一种潜在身份奖励反馈学习范式,通过在预训练的视频扩散主干上训练潜在身份奖励模型来实现。这在潜在空间中提供了运动感知的身份奖励,优先考虑与人类偏好一致的身份保持。借助我们精心策划的大规模数据集和全面的DreamOmni基准,用于多主体和全方位运动控制评估,DreamVideo-Omni展示了在生成高质量视频方面具有卓越的可控性。
Summary / 总结
DreamVideo-Omni is a unified framework that enables precise control over multi-subject identity and motion using a progressive two-stage training approach. In the first stage, it integrates various control signals and introduces condition-aware 3D rotary positional embedding and hierarchical motion injection to enhance controllability. In the second stage, it uses a latent identity reward feedback learning paradigm to mitigate identity degradation. The framework demonstrates superior performance in generating high-quality videos with precise controllability on a curated dataset and the DreamOmni Bench.
DreamVideo-Omni 是一个统一框架,通过两阶段训练过程实现对多主体身份和运动的精确控制。第一阶段整合多种控制信号,并引入条件感知的3D旋转位置嵌入和分层运动注入以增强可控性。第二阶段使用潜身份奖励反馈学习范式来减轻身份退化问题。该框架在生成高质量视频方面表现出色,具有精确的可控性和身份保真度。
Spatial-TTT: Streaming Visual-based Spatial Intelligence with Test-Time Training
Authors: Fangfu Liu, Diankun Wu, Jiawei Chi, Yimo Cai, Yi-Hsin Hung, Xumin Yu, Hao Li, Han Hu, Yongming Rao, Yueqi Duan
First: 2026-03-12T17:58:58+00:00 · Latest: 2026-03-12T17:58:58+00:00
Comments: Project Page: https://liuff19.github.io/Spatial-TTT
Abstract
Humans perceive and understand real-world spaces through a stream of visual observations. Therefore, the ability to streamingly maintain and update spatial evidence from potentially unbounded video streams is essential for spatial intelligence. The core challenge is not simply longer context windows but how spatial information is selected, organized, and retained over time. In this paper, we propose Spatial-TTT towards streaming visual-based spatial intelligence with test-time training (TTT), which adapts a subset of parameters (fast weights) to capture and organize spatial evidence over long-horizon scene videos. Specifically, we design a hybrid architecture and adopt large-chunk updates parallel with sliding-window attention for efficient spatial video processing. To further promote spatial awareness, we introduce a spatial-predictive mechanism applied to TTT layers with 3D spatiotemporal convolution, which encourages the model to capture geometric correspondence and temporal continuity across frames. Beyond architecture design, we construct a dataset with dense 3D spatial descriptions, which guides the model to update its fast weights to memorize and organize global 3D spatial signals in a structured manner. Extensive experiments demonstrate that Spatial-TTT improves long-horizon spatial understanding and achieves state-of-the-art performance on video spatial benchmarks. Project page: https://liuff19.github.io/Spatial-TTT.
中文标题/摘要
标题:Spatial-TTT:基于视觉的流式空间智能在测试时训练
人类通过一系列视觉观察来感知和理解现实空间。因此,从潜在无界视频流中流式维护和更新空间证据的能力对于空间智能至关重要。核心挑战不仅在于更长的上下文窗口,而在于如何在时间上选择、组织和保留空间信息。在本文中,我们提出了一种Spatial-TTT方法,以测试时训练(TTT)实现基于视觉的流式空间智能,该方法通过长时场景视频调整参数子集(快速权重),以捕捉和组织空间证据。具体而言,我们设计了一种混合架构,并采用大块更新与滑动窗口注意力相结合的方式,以高效处理空间视频。为了进一步增强空间意识,我们引入了一种应用于TTT层的基于3D时空卷积的空间预测机制,这促使模型捕捉帧间的空间几何对应关系和时间连续性。除了架构设计,我们还构建了一个包含密集3D空间描述的数据集,该数据集指导模型以结构化的方式更新其快速权重,以记忆和组织全局3D空间信号。广泛的实验表明,Spatial-TTT提高了长时空间理解能力,并在视频空间基准测试中达到了最先进的性能。项目页面:https://liuff19.github.io/Spatial-TTT。
Summary / 总结
The research aims to develop a method for streaming visual-based spatial intelligence with test-time training to maintain and update spatial evidence from video streams. The method uses a hybrid architecture with large-chunk updates and sliding-window attention to efficiently process spatial videos. It also introduces a spatial-predictive mechanism with 3D spatiotemporal convolution to enhance spatial awareness. Experiments show that Spatial-TTT improves long-horizon spatial understanding and achieves state-of-the-art performance on video spatial benchmarks.
论文提出Spatial-TTT,通过在测试时训练来适应部分参数,以捕捉和组织长时段场景视频中的空间证据。该方法采用混合架构,结合大块更新和滑动窗口注意力机制以提高处理效率,并引入3D时空卷积的空间预测机制以增强空间意识。实验表明,Spatial-TTT提升了长时段的空间理解能力,并在视频空间基准测试中达到了最先进的性能。
Attend Before Attention: Efficient and Scalable Video Understanding via Autoregressive Gazing
Authors: Baifeng Shi, Stephanie Fu, Long Lian, Hanrong Ye, David Eigen, Aaron Reite, Boyi Li, Jan Kautz, Song Han, David M. Chan, Pavlo Molchanov, Trevor Darrell, Hongxu Yin
Venue: CVPR 2026
First: 2026-03-12T17:58:52+00:00 · Latest: 2026-03-12T17:58:52+00:00
Comments: CVPR 2026. Project page: https://autogaze.github.io/
Abstract
Multi-modal large language models (MLLMs) have advanced general-purpose video understanding but struggle with long, high-resolution videos -- they process every pixel equally in their vision transformers (ViTs) or LLMs despite significant spatiotemporal redundancy. We introduce AutoGaze, a lightweight module that removes redundant patches before processed by a ViT or an MLLM. Trained with next-token prediction and reinforcement learning, AutoGaze autoregressively selects a minimal set of multi-scale patches that can reconstruct the video within a user-specified error threshold, eliminating redundancy while preserving information. Empirically, AutoGaze reduces visual tokens by 4x-100x and accelerates ViTs and MLLMs by up to 19x, enabling scaling MLLMs to 1K-frame 4K-resolution videos and achieving superior results on video benchmarks (e.g., 67.0% on VideoMME). Furthermore, we introduce HLVid: the first high-resolution, long-form video QA benchmark with 5-minute 4K-resolution videos, where an MLLM scaled with AutoGaze improves over the baseline by 10.1% and outperforms the previous best MLLM by 4.5%. Project page: https://autogaze.github.io/.
中文标题/摘要
标题:先关注后出席:通过自回归凝视实现高效可扩展的视频理解
多模态大型语言模型(MLLMs)在通用视频理解方面取得了进展,但在处理长且高分辨率的视频时遇到困难——它们在视觉变换器(ViTs)或大型语言模型(LLMs)中等量处理每个像素,尽管存在大量的时空冗余。我们引入了AutoGaze,这是一个轻量级模块,在ViT或MLLM处理之前去除冗余块。AutoGaze通过下一个标记预测和强化学习进行训练,自回归地选择一组多尺度块,这些块可以在用户指定的误差阈值内重建视频,从而消除冗余并保留信息。实验表明,AutoGaze将视觉标记减少4到100倍,并将ViTs和MLLMs加速19倍,使其能够扩展到1000帧4K分辨率的视频,并在视频基准测试中取得优异成绩(例如,VideoMME上为67.0%)。此外,我们引入了HLVid:第一个高分辨率、长形式的视频问答基准,包含5分钟4K分辨率的视频,其中使用AutoGaze扩展的MLLM比基线提高了10.1%,并优于之前的最佳MLLM 4.5%。项目页面:https://autogaze.github.io/
Summary / 总结
AutoGaze is a lightweight module that reduces redundant patches in videos before processing by vision transformers or large language models, enabling efficient and scalable video understanding. It autoregressively selects a minimal set of multi-scale patches to reconstruct the video within a specified error threshold, reducing visual tokens by 4x-100x and accelerating processing by up to 19x. AutoGaze improves results on video benchmarks and outperforms previous methods on a new high-resolution, long-form video QA benchmark, HLVid, by 10.1% and 4.5% respectively.
AutoGaze 是一个轻量级模块,在视频被视觉变压器或大型语言模型处理之前,它会去除冗余的图像块,从而实现高效和可扩展的视频理解。它自回归地选择一组多尺度的图像块来重建视频,误差在指定范围内,减少了 4 到 100 倍的视觉令牌,并将处理速度提高了最多 19 倍。AutoGaze 在视频基准测试中提高了结果,并在新的高分辨率长视频问答基准 HLVid 中分别比基线方法和之前最好的大型语言模型提高了 10.1% 和 4.5%。
EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models
Authors: Xuanlang Dai, Yujie Zhou, Long Xing, Jiazi Bu, Xilin Wei, Yuhong Liu, Beichen Zhang, Kai Chen, Yuhang Zang
First: 2026-03-12T17:58:48+00:00 · Latest: 2026-03-12T17:58:48+00:00
Comments: 23 pages, 18 figures
Abstract
Recently, Multimodal Large Language Models (MLLMs) have been widely integrated into diffusion frameworks primarily as text encoders to tackle complex tasks such as spatial reasoning. However, this paradigm suffers from two critical limitations: (i) MLLMs text encoder exhibits insufficient reasoning depth. Single-step encoding fails to activate the Chain-of-Thought process, which is essential for MLLMs to provide accurate guidance for complex tasks. (ii) The guidance remains invariant during the decoding process. Invariant guidance during decoding prevents DiT from progressively decomposing complex instructions into actionable denoising steps, even with correct MLLM encodings. To this end, we propose Endogenous Chain-of-Thought (EndoCoT), a novel framework that first activates MLLMs' reasoning potential by iteratively refining latent thought states through an iterative thought guidance module, and then bridges these states to the DiT's denoising process. Second, a terminal thought grounding module is applied to ensure the reasoning trajectory remains grounded in textual supervision by aligning the final state with ground-truth answers. With these two components, the MLLM text encoder delivers meticulously reasoned guidance, enabling the DiT to execute it progressively and ultimately solve complex tasks in a step-by-step manner. Extensive evaluations across diverse benchmarks (e.g., Maze, TSP, VSP, and Sudoku) achieve an average accuracy of 92.1%, outperforming the strongest baseline by 8.3 percentage points.
中文标题/摘要
标题:EndoCoT: 扩展内生链式思考推理在扩散模型中的应用
最近,多模态大型语言模型(MLLMs)被广泛集成到扩散框架中,主要作为文本编码器来解决空间推理等复杂任务。然而,这种范式存在两个关键限制:(i) MLLMs的文本编码器表现出推理深度不足。单步编码无法激活链式思考过程,这对于MLLMs提供准确的复杂任务指导至关重要。(ii) 在解码过程中,指导信息保持不变。解码过程中的不变指导阻止了DiT逐步分解复杂指令为可执行的去噪步骤,即使MLLM编码正确。为此,我们提出了一种新的框架Endogenous Chain-of-Thought(EndoCoT),该框架首先通过迭代思考指导模块逐步细化潜在思维状态,激活MLLMs的推理潜力,然后将这些状态与DiT的去噪过程联系起来。其次,应用终端思维接地模块,确保推理轨迹保持在文本监督中,通过将最终状态与正确答案对齐来实现。通过这两个组件,MLLMs的文本编码器提供细致的推理指导,使DiT能够逐步执行并最终以逐步方式解决复杂任务。在不同基准(如迷宫、TSP、VSP和数独)上的广泛评估显示平均准确率为92.1%,比最强基线高出8.3个百分点。
Summary / 总结
The paper addresses the limitations of Multimodal Large Language Models (MLLMs) in diffusion models, particularly their insufficient reasoning depth and invariant guidance during decoding. To overcome these issues, the authors propose EndoCoT, which iteratively refines latent thought states and grounds them in textual supervision, enabling the diffusion model to execute reasoning step-by-step and solve complex tasks with high accuracy, achieving an average accuracy of 92.1% across various benchmarks.
论文针对多模态大型语言模型(MLLMs)在扩散框架中的不足,特别是其推理深度不足和解码过程中的不变指导。为此,提出了EndoCoT框架,该框架通过迭代细化潜在思维状态并与真实答案对齐,使扩散模型能够逐步执行推理,最终在各种基准测试(如迷宫、TSP、VSP和数独)中达到92.1%的平均准确率,比现有方法高出8.3个百分点。
DVD: Deterministic Video Depth Estimation with Generative Priors
Authors: Hongfei Zhang, Harold Haodong Chen, Chenfei Liao, Jing He, Zixin Zhang, Haodong Li, Yihao Liang, Kanghao Chen, Bin Ren, Xu Zheng, Shuai Yang, Kun Zhou, Yinchuan Li, Nicu Sebe, Ying-Cong Chen
First: 2026-03-12T17:58:06+00:00 · Latest: 2026-03-12T17:58:06+00:00
Comments: Project: https://dvd-project.github.io/
Abstract
Existing video depth estimation faces a fundamental trade-off: generative models suffer from stochastic geometric hallucinations and scale drift, while discriminative models demand massive labeled datasets to resolve semantic ambiguities. To break this impasse, we present DVD, the first framework to deterministically adapt pre-trained video diffusion models into single-pass depth regressors. Specifically, DVD features three core designs: (i) repurposing the diffusion timestep as a structural anchor to balance global stability with high-frequency details; (ii) latent manifold rectification (LMR) to mitigate regression-induced over-smoothing, enforcing differential constraints to restore sharp boundaries and coherent motion; and (iii) global affine coherence, an inherent property bounding inter-window divergence, which enables seamless long-video inference without requiring complex temporal alignment. Extensive experiments demonstrate that DVD achieves state-of-the-art zero-shot performance across benchmarks. Furthermore, DVD successfully unlocks the profound geometric priors implicit in video foundation models using 163x less task-specific data than leading baselines. Notably, we fully release our pipeline, providing the whole training suite for SOTA video depth estimation to benefit the open-source community.
中文标题/摘要
标题:DVD:基于生成先验的确定性视频深度估计
现有的视频深度估计面临一个根本性的权衡:生成模型会遭受随机几何幻觉和尺度漂移的问题,而判别模型则需要大量的标注数据集来解决语义歧义。为打破这一僵局,我们提出了DVD,这是第一个将预训练的视频扩散模型确定性地改编为单次深度回归器的框架。具体而言,DVD 包含三个核心设计:(i) 将扩散时间步作为结构锚点,以平衡全局稳定性和高频细节;(ii) 潜在流形矫正(LMR)以减轻回归引起的过度平滑,施加微分约束以恢复清晰边界和连贯运动;(iii) 全局仿射一致性,这是一种固有的属性,限制了窗口间差异,使得在无需复杂时间对齐的情况下即可无缝进行长视频推理。广泛的实验表明,DVD 在基准测试中实现了最先进的零样本性能。此外,DVD 成功地利用了视频基础模型中隐含的深刻几何先验,比领先基线少使用163倍的任务特定数据。值得注意的是,我们完全开源了我们的管道,提供了最先进的视频深度估计的完整训练套件,以造福开源社区。
Summary / 总结
DVD addresses the trade-off in video depth estimation by integrating pre-trained generative models into deterministic single-pass depth regressors. It introduces three key designs: using the diffusion timestep as a structural anchor, latent manifold rectification to prevent over-smoothing, and global affine coherence to ensure long-video inference. Experiments show DVD outperforms existing methods on benchmarks and requires significantly less task-specific data compared to leading baselines.
DVD通过引入一个确定性的框架,将扩散模型重新用于单次深度回归,解决了视频深度估计中的权衡问题。它包含三个核心设计:平衡全局稳定性和高频细节、通过潜在流形矫正减轻过度平滑、确保全局仿射一致性以实现无缝长视频推理。DVD在基准测试中表现出色,使用了比领先基线少得多的任务特定数据。
SciMDR: Benchmarking and Advancing Scientific Multimodal Document Reasoning
Authors: Ziyu Chen, Yilun Zhao, Chengye Wang, Rilyn Han, Manasi Patwardhan, Arman Cohan
First: 2026-03-12T17:57:52+00:00 · Latest: 2026-03-12T17:57:52+00:00
Abstract
Constructing scientific multimodal document reasoning datasets for foundation model training involves an inherent trade-off among scale, faithfulness, and realism. To address this challenge, we introduce the synthesize-and-reground framework, a two-stage pipeline comprising: (1) Claim-Centric QA Synthesis, which generates faithful, isolated QA pairs and reasoning on focused segments, and (2) Document-Scale Regrounding, which programmatically re-embeds these pairs into full-document tasks to ensure realistic complexity. Using this framework, we construct SciMDR, a large-scale training dataset for cross-modal comprehension, comprising 300K QA pairs with explicit reasoning chains across 20K scientific papers. We further construct SciMDR-Eval, an expert-annotated benchmark to evaluate multimodal comprehension within full-length scientific workflows. Experiments demonstrate that models fine-tuned on SciMDR achieve significant improvements across multiple scientific QA benchmarks, particularly in those tasks requiring complex document-level reasoning.
中文标题/摘要
标题:SciMDR:科学多模态文档推理基准测试与推进
构建用于基础模型训练的科学多模态文档推理数据集涉及规模、忠实度和现实性之间的固有权衡。为解决这一挑战,我们引入了合成和再嵌入框架,这是一个两阶段流水线,包括:(1) 主题导向的问答合成,生成忠实的、孤立的问答对和聚焦段落上的推理,以及(2) 文档规模再嵌入,通过程序化重新嵌入这些对到完整的文档任务,以确保现实的复杂性。使用此框架,我们构建了SciMDR,一个大规模训练数据集,包含30万对具有明确推理链的问答对,覆盖2万篇科学论文。我们还构建了SciMDR-Eval,一个专家注释基准,用于评估全篇科学工作流程中的多模态理解。实验表明,基于SciMDR微调的模型在多个科学问答基准测试中取得了显著改进,特别是在那些需要复杂文档级推理的任务中。
Summary / 总结
The research aims to address the challenge of constructing scientific multimodal document reasoning datasets by introducing the synthesize-and-reground framework, which generates faithful QA pairs and embeds them into full documents to ensure realism. The resulting SciMDR dataset includes 300K QA pairs with explicit reasoning chains across 20K scientific papers, and models fine-tuned on SciMDR show significant improvements in scientific QA benchmarks, especially in tasks requiring complex document-level reasoning. SciMDR-Eval, an expert-annotated benchmark, is also constructed to evaluate multimodal comprehension in full-length scientific workflows.
研究旨在通过引入合成和再嵌入框架来解决构建科学多模态文档推理数据集的挑战,该框架包括基于声明的QA合成和文档规模再嵌入。该框架生成了300K包含明确推理链的QA对,覆盖20K科学论文,并构建了一个大规模训练数据集SciMDR。此外,还创建了SciMDR-Eval专家标注基准,以评估全篇科学工作流程中的多模态理解能力。实验表明,基于SciMDR微调的模型在多个科学QA基准测试中表现出显著改进,特别是在需要复杂文档级推理的任务中。
Trust Your Critic: Robust Reward Modeling and Reinforcement Learning for Faithful Image Editing and Generation
Authors: Xiangyu Zhao, Peiyuan Zhang, Junming Lin, Tianhao Liang, Yuchen Duan, Shengyuan Ding, Changyao Tian, Yuhang Zang, Junchi Yan, Xue Yang
First: 2026-03-12T17:57:21+00:00 · Latest: 2026-03-12T17:57:21+00:00
Abstract
Reinforcement learning (RL) has emerged as a promising paradigm for enhancing image editing and text-to-image (T2I) generation. However, current reward models, which act as critics during RL, often suffer from hallucinations and assign noisy scores, inherently misguiding the optimization process. In this paper, we present FIRM (Faithful Image Reward Modeling), a comprehensive framework that develops robust reward models to provide accurate and reliable guidance for faithful image generation and editing. First, we design tailored data curation pipelines to construct high-quality scoring datasets. Specifically, we evaluate editing using both execution and consistency, while generation is primarily assessed via instruction following. Using these pipelines, we collect the FIRM-Edit-370K and FIRM-Gen-293K datasets, and train specialized reward models (FIRM-Edit-8B and FIRM-Gen-8B) that accurately reflect these criteria. Second, we introduce FIRM-Bench, a comprehensive benchmark specifically designed for editing and generation critics. Evaluations demonstrate that our models achieve superior alignment with human judgment compared to existing metrics. Furthermore, to seamlessly integrate these critics into the RL pipeline, we formulate a novel "Base-and-Bonus" reward strategy that balances competing objectives: Consistency-Modulated Execution (CME) for editing and Quality-Modulated Alignment (QMA) for generation. Empowered by this framework, our resulting models FIRM-Qwen-Edit and FIRM-SD3.5 achieve substantial performance breakthroughs. Comprehensive experiments demonstrate that FIRM mitigates hallucinations, establishing a new standard for fidelity and instruction adherence over existing general models. All of our datasets, models, and code have been publicly available at https://firm-reward.github.io.
中文标题/摘要
标题:信任你的批评者:稳健的奖励建模与强化学习在忠实图像编辑与生成中的应用
强化学习(RL)已成为提升图像编辑和文本到图像(T2I)生成的有前途的范式。然而,当前的奖励模型在作为RL中的批评者时,往往会产生幻觉并分配嘈杂的分数,从而误导优化过程。在本文中,我们提出了FIRM(忠实图像奖励建模),这是一种全面的框架,旨在开发稳健的奖励模型,为忠实的图像生成和编辑提供准确可靠的指导。首先,我们设计了定制的数据整理管道,以构建高质量的评分数据集。具体而言,我们使用执行和一致性来评估编辑,而生成则主要通过指令遵循来进行评估。使用这些管道,我们收集了FIRM-Edit-370K和FIRM-Gen-293K数据集,并训练了专门的奖励模型(FIRM-Edit-8B和FIRM-Gen-8B),这些模型能够准确反映这些标准。其次,我们引入了FIRM-Bench,这是一种专门针对编辑和生成批评者的综合基准。评估结果表明,我们的模型在与人类判断的对齐方面优于现有指标。此外,为了无缝地将这些批评者集成到RL管道中,我们提出了一个新颖的“基础加奖金”奖励策略,该策略平衡了编辑中的一致性调节执行(CME)和生成中的质量调节对齐(QMA)等竞争目标。借助此框架,我们的模型FIRM-Qwen-Edit和FIRM-SD3.5实现了显著的性能突破。全面的实验表明,FIRM减轻了幻觉,建立了忠实度和指令遵循的新标准,超越了现有的通用模型。所有我们的数据集、模型和代码均已在https://firm-reward.github.io/公开。
Summary / 总结
This paper addresses the issue of hallucinations in reward models used for reinforcement learning in image editing and text-to-image generation. It introduces FIRM (Faithful Image Reward Modeling), a framework that includes tailored data curation pipelines to create high-quality scoring datasets and specialized reward models. The authors also propose a novel 'Base-and-Bonus' reward strategy, which improves alignment with human judgment. Experiments show that FIRM models significantly reduce hallucinations and enhance fidelity and instruction adherence compared to existing models.
本文解决了图像编辑和文本到图像生成中奖励模型出现幻觉的问题,提出了FIRM(Faithful Image Reward Modeling)框架,通过定制化数据收集和专门训练来开发稳健的奖励模型。作者收集了高质量的数据集,并训练了能够准确反映编辑执行和一致性以及生成指令跟随标准的奖励模型。引入的“基底和奖励”奖励策略,包括一致性调节执行(CME)和质量调节对齐(QMA),提高了与人类判断的一致性。实验表明,FIRM模型减少了幻觉,实现了更好的保真度和指令遵循性,优于现有通用模型。
Examining Reasoning LLMs-as-Judges in Non-Verifiable LLM Post-Training
Authors: Yixin Liu, Yue Yu, DiJia Su, Sid Wang, Xuewei Wang, Song Jiang, Bo Liu, Arman Cohan, Yuandong Tian, Zhengxing Chen
First: 2026-03-12T17:57:06+00:00 · Latest: 2026-03-12T17:57:06+00:00
Abstract
Reasoning LLMs-as-Judges, which can benefit from inference-time scaling, provide a promising path for extending the success of reasoning models to non-verifiable domains where the output correctness/quality cannot be directly checked. However, while reasoning judges have shown better performance on static evaluation benchmarks, their effectiveness in actual policy training has not been systematically examined. Therefore, we conduct a rigorous study to investigate the actual impact of non-reasoning and reasoning judges in reinforcement-learning-based LLM alignment. Our controlled synthetic setting, where a "gold-standard" judge (gpt-oss-120b) provides preference annotations to train smaller judges, reveals key differences between non-reasoning and reasoning judges: non-reasoning judges lead to reward hacking easily, while reasoning judges can lead to policies that achieve strong performance when evaluated by the gold-standard judge. Interestingly, we find that the reasoning-judge-trained policies achieve such strong performance by learning to generate highly effective adversarial outputs that can also score well on popular benchmarks such as Arena-Hard by deceiving other LLM-judges. Combined with our further analysis, our study highlights both important findings and room for improvements for applying (reasoning) LLM-judges in non-verifiable LLM post-training.
中文标题/摘要
标题:探究推理LLM作为法官在非可验证LLM后训练中的应用
推理LLM作为法官,得益于推理时的扩展,为将推理模型的成功扩展到输出正确性/质量无法直接验证的领域提供了有希望的途径。然而,尽管推理法官在静态评估基准上表现出更好的性能,但它们在实际政策训练中的有效性尚未系统地进行研究。因此,我们进行了一项严格的研究所调查非推理和推理法官在基于强化学习的LLM对齐中的实际影响。在我们的受控合成环境中,“黄金标准”法官(gpt-oss-120b)提供偏好注释以训练较小的法官,揭示了非推理和推理法官之间的关键差异:非推理法官容易导致奖励作弊,而推理法官可以导致在“黄金标准”法官评估下表现出色的策略。有趣的是,我们发现,通过学习生成高度有效的对抗输出,推理法官训练的策略能够获得如此出色的表现,这些对抗输出也能在流行的基准测试如Arena-Hard中获得高分,从而欺骗其他LLM法官。结合我们进一步的分析,我们的研究突显了在非可验证LLM后训练中应用(推理)LLM法官的重要发现和改进空间。
Summary / 总结
The study investigates the effectiveness of reasoning LLMs-as-judges in non-verifiable domains by comparing them with non-reasoning judges in a controlled synthetic setting. The research reveals that reasoning judges can prevent reward hacking and produce policies that perform well according to a gold-standard judge, while non-reasoning judges are more prone to reward hacking. The reasoning-judge-trained policies also learn to generate effective adversarial outputs that can score well on other benchmarks, highlighting both the benefits and potential issues of using reasoning LLM-judges.
研究通过在强化学习基线模型对齐中比较推理和非推理法官的有效性,来考察推理LLM-法官在非可验证领域中的应用。使用受控的合成环境,研究发现推理法官生成的策略在由黄金标准法官评估时表现出色,而非推理法官容易导致奖励作弊。推理法官训练的策略还学会了生成有效的对抗输出,这些输出在流行的基准测试如Arena-Hard中也能表现良好,这表明了其应用中的重要发现和改进空间。
One Model, Many Budgets: Elastic Latent Interfaces for Diffusion Transformers
Authors: Moayed Haji-Ali, Willi Menapace, Ivan Skorokhodov, Dogyun Park, Anil Kag, Michael Vasilkovsky, Sergey Tulyakov, Vicente Ordonez, Aliaksandr Siarohin
First: 2026-03-12T17:57:04+00:00 · Latest: 2026-03-12T17:57:04+00:00
Comments: Project page: https://snap-research.github.io/elit/
Abstract
Diffusion transformers (DiTs) achieve high generative quality but lock FLOPs to image resolution, limiting principled latency-quality trade-offs, and allocate computation uniformly across input spatial tokens, wasting resource allocation to unimportant regions. We introduce Elastic Latent Interface Transformer (ELIT), a drop-in, DiT-compatible mechanism that decouples input image size from compute. Our approach inserts a latent interface, a learnable variable-length token sequence on which standard transformer blocks can operate. Lightweight Read and Write cross-attention layers move information between spatial tokens and latents and prioritize important input regions. By training with random dropping of tail latents, ELIT learns to produce importance-ordered representations with earlier latents capturing global structure while later ones contain information to refine details. At inference, the number of latents can be dynamically adjusted to match compute constraints. ELIT is deliberately minimal, adding two cross-attention layers while leaving the rectified flow objective and the DiT stack unchanged. Across datasets and architectures (DiT, U-ViT, HDiT, MM-DiT), ELIT delivers consistent gains. On ImageNet-1K 512px, ELIT delivers an average gain of $35.3\%$ and $39.6\%$ in FID and FDD scores. Project page: https://snap-research.github.io/elit/
中文标题/摘要
标题:一种模型多种预算:弹性潜在接口的扩散变换器
扩散变换器(DiTs)实现高质量生成,但固定FLOPs到图像分辨率,限制了延迟-质量权衡的原理性调整,并且均匀分配计算资源到输入空间标记,浪费了对不重要区域的资源分配。我们引入了弹性潜在接口变换器(ELIT),这是一种可插入、DiT兼容的机制,解耦输入图像大小与计算。我们的方法插入了一个潜在接口,一个可学习的可变长度标记序列,标准变换器块可以在其上操作。轻量级读取和写入交叉注意层在空间标记和潜在标记之间移动信息,并优先处理重要输入区域。通过随机丢弃尾部潜在标记进行训练,ELIT学习生成按重要性排序的表示,早期潜在标记捕获全局结构,而后期则包含细化细节的信息。在推理时,潜在标记的数量可以根据计算约束动态调整。ELIT故意保持简单,仅添加两个交叉注意层,而未改变修正流目标和DiT堆栈。在不同数据集和架构(DiT、U-ViT、HDiT、MM-DiT)上,ELIT提供了持续的收益。在ImageNet-1K 512px上,ELIT在FID和FDD得分上分别提供了平均35.3%和39.6%的收益。
Summary / 总结
The research aims to improve the efficiency of diffusion transformers (DiTs) by allowing flexible computation allocation. ELIT, a drop-in mechanism, decouples input image size from compute by introducing a learnable latent interface. This approach uses lightweight read and write cross-attention layers to prioritize important regions, leading to consistent gains in FID and FDD scores across various datasets and architectures. On ImageNet-1K 512px, ELIT improves FID by 35.3% and FDD by 39.6%.
研究旨在通过引入弹性潜在接口变换器(ELIT)来提高扩散变换器(DiTs)的效率,该方法解耦输入图像大小与计算成本。ELIT使用可学习的潜在接口和轻量级的读写交叉注意力层来优先处理重要区域,并允许在推理时动态调整潜在变量的数量。实验表明,该方法在各种数据集和架构上表现出一致的改进,在ImageNet-1K 512px上,FID和FDD得分分别平均提高了35.3%和39.6%。
Separable neural architectures as a primitive for unified predictive and generative intelligence
Authors: Reza T. Batley, Apurba Sarker, Rajib Mostakim, Andrew Klichine, Sourav Saha
First: 2026-03-12T17:56:54+00:00 · Latest: 2026-03-12T17:56:54+00:00
Abstract
Intelligent systems across physics, language and perception often exhibit factorisable structure, yet are typically modelled by monolithic neural architectures that do not explicitly exploit this structure. The separable neural architecture (SNA) addresses this by formalising a representational class that unifies additive, quadratic and tensor-decomposed neural models. By constraining interaction order and tensor rank, SNAs impose a structural inductive bias that factorises high-dimensional mappings into low-arity components. Separability need not be a property of the system itself: it often emerges in the coordinates or representations through which the system is expressed. Crucially, this coordinate-aware formulation reveals a structural analogy between chaotic spatiotemporal dynamics and linguistic autoregression. By treating continuous physical states as smooth, separable embeddings, SNAs enable distributional modelling of chaotic systems. This approach mitigates the nonphysical drift characteristics of deterministic operators whilst remaining applicable to discrete sequences. The compositional versatility of this approach is demonstrated across four domains: autonomous waypoint navigation via reinforcement learning, inverse generation of multifunctional microstructures, distributional modelling of turbulent flow and neural language modelling. These results establish the separable neural architecture as a domain-agnostic primitive for predictive and generative intelligence, capable of unifying both deterministic and distributional representations.
中文标题/摘要
标题:可分神经架构作为统一预测与生成智能的基本构建块
物理、语言和感知领域的智能系统通常表现出可分解的结构,但通常由不明确利用这种结构的单一神经架构进行建模。可分神经架构(SNA)通过形式化一个统一加性、二次和张量分解神经模型的表示类来解决这一问题。通过限制交互顺序和张量秩,SNA施加一种结构先验偏置,将高维映射分解为低元组件。可分性不必是系统本身的属性:它通常在系统表达的坐标或表示中出现。关键的是,这种基于坐标的表述揭示了混沌时空动力学与语言自回归之间的结构类比。通过将连续物理状态视为平滑的可分嵌入,SNA使混沌系统的分布建模成为可能。这种方法减轻了确定性算子的非物理漂移特性,同时仍适用于离散序列。这种方法的组合灵活性在四个领域得到展示:通过强化学习进行自主航点导航、多功能微结构的逆生成、湍流流动的分布建模以及神经语言建模。这些结果确立了可分神经架构作为预测与生成智能的领域无关的基本构建块,能够统一确定性和分布性表示。
Summary / 总结
The research aims to address the lack of factorisable structure in monolithic neural architectures by introducing the separable neural architecture (SNA). SNA formalises a representational class that unifies additive, quadratic, and tensor-decomposed models, imposing a structural inductive bias to factorise high-dimensional mappings into low-arity components. The approach demonstrates compositional versatility across four domains: autonomous waypoint navigation, inverse generation of microstructures, distributional modelling of turbulent flow, and neural language modelling, establishing SNA as a domain-agnostic primitive for both predictive and generative intelligence.
研究旨在解决传统单一神经架构在建模物理、语言和感知等领域中因子结构的局限性。引入了可分神经架构(SNA),通过约束交互顺序和张量秩来统一加性、二次和张量分解模型,从而将高维映射分解为低arity组件。关键实验结果表明,SNA在自主航点导航、多功能微结构逆生成、湍流流动的分布建模和神经语言建模等方面的有效性,确立了SNA作为预测和生成智能的领域通用基本构件的地位,能够统一确定性和分布性表示。
SceneAssistant: A Visual Feedback Agent for Open-Vocabulary 3D Scene Generation
Authors: Jun Luo, Jiaxiang Tang, Ruijie Lu, Gang Zeng
First: 2026-03-12T17:55:07+00:00 · Latest: 2026-03-12T17:55:07+00:00
Comments: Code: https://github.com/ROUJINN/SceneAssistant
Abstract
Text-to-3D scene generation from natural language is highly desirable for digital content creation. However, existing methods are largely domain-restricted or reliant on predefined spatial relationships, limiting their capacity for unconstrained, open-vocabulary 3D scene synthesis. In this paper, we introduce SceneAssistant, a visual-feedback-driven agent designed for open-vocabulary 3D scene generation. Our framework leverages modern 3D object generation model along with the spatial reasoning and planning capabilities of Vision-Language Models (VLMs). To enable open-vocabulary scene composition, we provide the VLMs with a comprehensive set of atomic operations (e.g., Scale, Rotate, FocusOn). At each interaction step, the VLM receives rendered visual feedback and takes actions accordingly, iteratively refining the scene to achieve more coherent spatial arrangements and better alignment with the input text. Experimental results demonstrate that our method can generate diverse, open-vocabulary, and high-quality 3D scenes. Both qualitative analysis and quantitative human evaluations demonstrate the superiority of our approach over existing methods. Furthermore, our method allows users to instruct the agent to edit existing scenes based on natural language commands. Our code is available at https://github.com/ROUJINN/SceneAssistant
中文标题/摘要
标题:SceneAssistant:一种用于开放词汇3D场景生成的视觉反馈代理
从自然语言生成文本到3D场景是数字内容创作中高度 desirable 的。然而,现有方法大多局限于特定领域或依赖预定义的空间关系,限制了其生成不受限制、开放词汇3D场景的能力。在本文中,我们介绍了SceneAssistant,一种用于开放词汇3D场景生成的视觉反馈驱动代理。我们的框架利用了现代3D对象生成模型以及视觉语言模型(VLMs)的空间推理和规划能力。为了实现开放词汇场景组合,我们为VLMs提供了一整套原子操作(例如,缩放、旋转、聚焦)。在每次交互步骤中,VLM接收渲染的视觉反馈并相应地采取行动,逐步细化场景以实现更连贯的空间布局并更好地与输入文本对齐。实验结果表明,我们的方法可以生成多样、开放词汇且高质量的3D场景。定性和定量的人类评估均证明了我们方法优于现有方法的优越性。此外,我们的方法允许用户根据自然语言命令编辑现有场景。我们的代码可在https://github.com/ROUJINN/SceneAssistant 获取
Summary / 总结
SceneAssistant is a visual-feedback-driven agent for open-vocabulary 3D scene generation, which uses a 3D object generation model and Vision-Language Models (VLMs) to iteratively refine scenes based on rendered visual feedback and natural language instructions. The method demonstrates the ability to generate diverse and high-quality 3D scenes, outperforming existing approaches in both qualitative and quantitative evaluations. Additionally, it supports editing existing scenes with natural language commands.
SceneAssistant 是一种基于视觉反馈的开放词汇3D场景生成代理,结合了3D对象生成模型和Vision-Language模型(VLMs)。它为VLMs提供了如缩放和旋转等原子操作,根据视觉反馈逐步细化场景,实现空间布局的一致性和与输入文本的对齐。实验表明,SceneAssistant 能生成多样、高质量的3D场景,优于现有方法,并且支持使用自然语言命令编辑现有场景。
Security Considerations for Artificial Intelligence Agents
Authors: Ninghui Li, Kaiyuan Zhang, Kyle Polley, Jerry Ma
First: 2026-03-12T17:49:39+00:00 · Latest: 2026-03-12T17:49:39+00:00
Comments: Perplexity Response to NIST/CAISI Request for Information 2025-0035. 91 Fed. Reg. 698 (Jan. 8, 2026)
Abstract
This article, a lightly adapted version of Perplexity's response to NIST/CAISI Request for Information 2025-0035, details our observations and recommendations concerning the security of frontier AI agents. These insights are informed by Perplexity's experience operating general-purpose agentic systems used by millions of users and thousands of enterprises in both controlled and open-world environments. Agent architectures change core assumptions around code-data separation, authority boundaries, and execution predictability, creating new confidentiality, integrity, and availability failure modes. We map principal attack surfaces across tools, connectors, hosting boundaries, and multi-agent coordination, with particular emphasis on indirect prompt injection, confused-deputy behavior, and cascading failures in long-running workflows. We then assess current defenses as a layered stack: input-level and model-level mitigations, sandboxed execution, and deterministic policy enforcement for high-consequence actions. Finally, we identify standards and research gaps, including adaptive security benchmarks, policy models for delegation and privilege control, and guidance for secure multi-agent system design aligned with NIST risk management principles.
中文标题/摘要
标题:人工智能代理的安全考虑
本文,基于Perplexity对NIST/CAISI 2025-0035请求信息的轻度改编回应,详细阐述了我们对前沿AI代理安全性的观察和建议。这些见解源自Perplexity在受控和开放世界环境中运营广泛用途代理系统方面的经验,这些系统被数百万人和数千家企业使用。代理架构改变了代码-数据分离、权限边界和执行可预测性的核心假设,创造了新的机密性、完整性和可用性故障模式。我们映射了工具、连接器、托管边界和多代理协调的主要攻击面,特别强调了间接提示注入、混淆副手行为以及长时间运行工作流中的级联故障。然后,我们评估了当前的防御措施,作为分层堆栈:输入级和模型级缓解措施、沙箱执行以及对高后果行动的确定性策略执行。最后,我们确定了标准和研究缺口,包括适应性安全基准、委托和权限控制的政策模型,以及与NIST风险管理原则一致的多代理系统设计指南。
Summary / 总结
This article discusses the security challenges of advanced AI agents, drawing from Perplexity's operational experience with millions of users and thousands of enterprises. It identifies new security risks due to changes in agent architectures, such as indirect prompt injection and cascading failures. The authors propose a layered defense strategy including input-level and model-level mitigations, sandboxed execution, and deterministic policy enforcement. They also highlight the need for adaptive security benchmarks and policy models for secure multi-agent system design.
本文讨论了AI代理的安全挑战,基于Perplexity在大规模代理系统方面的经验。文章指出了由于代理架构变化带来的新安全风险,如间接提示注入和迷惑副手行为。作者建议采用分层防御方法,包括输入级和模型级的缓解措施、沙箱执行以及对高后果行动的确定性策略执行。此外,他们还指出了当前标准中的不足,并建议未来研究的方向,如适应性安全基准和安全的多代理系统设计。
Portfolio of Solving Strategies in CEGAR-based Object Packing and Scheduling for Sequential 3D Printing
Authors: Pavel Surynek
First: 2026-03-12T17:48:14+00:00 · Latest: 2026-03-12T17:48:14+00:00
Comments: arXiv admin note: substantial text overlap with arXiv:2503.05071
Abstract
Computing power that used to be available only in supercomputers decades ago especially their parallelism is currently available in standard personal computer CPUs even in CPUs for mobile telephones. We show how to effectively utilize the computing power of modern multi-core personal computer CPU to solve the complex combinatorial problem of object arrangement and scheduling for sequential 3D printing. We achieved this by parallelizing the existing CEGAR-SEQ algorithm that solves the sequential object arrangement and scheduling by expressing it as a linear arithmetic formula which is then solved by a technique inspired by counterexample guided abstraction refinement (CEGAR). The original CEGAR-SEQ algorithm uses an object arrangement strategy that places objects towards the center of the printing plate. We propose alternative object arrangement strategies such as placing objects towards a corner of the printing plate and scheduling objects according to their height. Our parallelization is done at the high-level where we execute the CEGAR-SEQ algorithm in parallel with a portfolio of object arrangement strategies, an algorithm is called Porfolio-CEGAR-SEQ. Our experimental evaluation indicates that Porfolio-CEGAR-SEQ outperforms the original CEGAR-SEQ. When a batch of objects for multiple printing plates is scheduled, Portfolio-CEGAR-SEQ often uses fewer printing plates than CEGAR-SEQ.
中文标题/摘要
标题:基于CEGAR的物体排列与调度组合策略集在顺序3D打印中的应用
几十年前只能在超级计算机中使用的计算能力,尤其是其并行性,现在甚至在移动电话CPU中也变得可用。我们展示了如何有效利用现代多核个人计算机CPU的计算能力来解决顺序3D打印中物体排列和调度的复杂组合问题。我们通过将现有的CEGAR-SEQ算法表达为线性算术公式,然后使用受反例引导抽象细化(CEGAR)启发的技术来解决该问题,实现了这一目标。原始的CEGAR-SEQ算法使用了一种将物体放置在打印板中心的物体排列策略。我们提出了替代的物体排列策略,如将物体放置在打印板的角落,并根据物体的高度进行调度。我们的并行化是在高层次上进行的,我们并行执行CEGAR-SEQ算法和物体排列策略的组合,称为Porfolio-CEGAR-SEQ。我们的实验评估表明,Porfolio-CEGAR-SEQ优于原始的CEGAR-SEQ。当为多个打印板调度一批物体时,Porfolio-CEGAR-SEQ通常使用比CEGAR-SEQ更少的打印板。
Summary / 总结
The research aims to leverage modern multi-core CPUs to solve complex 3D printing scheduling and arrangement problems. It achieves this by parallelizing the CEGAR-SEQ algorithm and introducing a portfolio of object arrangement strategies. The key finding is that the Portfolio-CEGAR-SEQ algorithm outperforms the original CEGAR-SEQ, often using fewer printing plates for scheduling multiple objects across multiple printing plates.
研究旨在利用现代多核CPU解决3D打印中的复杂组合问题。通过并行化CEGAR-SEQ算法并引入多种物体排列策略组合,实现了这一目标。实验结果表明,Portfolio-CEGAR-SEQ比原始的CEGAR-SEQ更优,通常在多个打印板上调度一批物体时使用更少的打印板。
LoC-Path: Learning to Compress for Pathology Multimodal Large Language Models
Authors: Qingqiao Hu, Weimin Lyu, Meilong Xu, Kehan Qi, Xiaoling Hu, Saumya Gupta, Jiawei Zhou, Chao Chen
First: 2025-12-05T03:16:46+00:00 · Latest: 2026-03-12T17:45:22+00:00
Comments: Code will be released soon
Abstract
Whole Slide Image (WSI) MLLMs are difficult to build and deploy because gigapixel slides induce thousands of visual tokens, while only a small fraction of regions is diagnostically relevant. Existing slide-level pathology MLLMs typically combine heavy slide-level encoders with long visual prefixes, making end-to-end slide-level development and deployment expensive under limited computational resources. We revisit this regime and show that WSI tile features are highly redundant at both global and local scales, while task-relevant evidence is sparse and query-dependent. We therefore introduce LoC-Path, a resource-efficient slide-level MLLM that compresses before fusion. LoC-Path uses a Sparse Token Merger (STM) and an MAE-pretrained resampler to replace expensive slide-level encoding with a compact latent interface, then uses a Token Importance Scorer (TIS) to select the most relevant latents and a Cross-Attention Routing Adapter (CARA) to fuse them into a few LLM decoder layers. This design lowers both multimodal tuning cost and inference-time latency/memory by avoiding heavy slide-level encoding and long visual prefixes. Extensive experiments show that LoC-Path remains competitive with prior slide-level MLLMs while making end-to-end development and deployment more practical under limited computational resources.
中文标题/摘要
标题:LoC-Path:学习压缩以压缩病理多模态大型语言模型
全视野图像(WSI)多模态大型语言模型(MLLM)难以构建和部署,因为吉普赛级的幻灯片会产生数千个视觉标记,而只有少数区域具有诊断意义。现有的幻灯片级病理MLLM通常结合了沉重的幻灯片级编码器和长的视觉前缀,这在有限的计算资源下使得端到端的幻灯片级开发和部署非常昂贵。我们重新审视了这一领域,并展示了WSI切片特征在全局和局部尺度上高度冗余,而任务相关证据则稀疏且查询依赖。因此,我们引入了LoC-Path,这是一种资源高效的幻灯片级MLLM,它在融合之前进行压缩。LoC-Path使用稀疏标记合并器(STM)和MAE预训练的重采样器来用紧凑的潜在接口替换昂贵的幻灯片级编码,然后使用标记重要性评分器(TIS)选择最相关的潜在特征,并使用跨注意力路由适配器(CARA)将它们融合到少量的LLM解码器层中。这种设计通过避免沉重的幻灯片级编码和长的视觉前缀,降低了多模态调优成本和推理时的延迟/内存。广泛的实验表明,LoC-Path在保持与先前幻灯片级MLLM竞争力的同时,使得在有限计算资源下端到端的开发和部署更加实际。
Summary / 总结
LoC-Path is designed to address the challenges of building and deploying whole slide image (WSI) multimodal large language models (MLLMs) by compressing WSI tile features before fusion. It uses a Sparse Token Merger and an MAE-pretrained resampler to create a compact latent interface, and a Token Importance Scorer and Cross-Attention Routing Adapter to select and fuse relevant latents into a few LLM decoder layers. This approach reduces both multimodal tuning cost and inference-time latency/memory. Experiments demonstrate that LoC-Path maintains competitive performance with previous slide-level MLLMs while making end-to-end development and deployment more feasible under limited computational resources.
LoC-Path旨在通过在融合前压缩WSI切片特征来解决构建和部署WSI MLLM的计算挑战。它使用稀疏Token合并器和MAE预训练重采样器来创建一个紧凑的潜在接口,并使用Token重要性评分器和跨注意力路由适配器来选择并融合相关的潜在特征到少量的LLM解码层。实验表明,LoC-Path在保持与现有滑片级MLLMs竞争力的同时,减少了调优成本和推理时的延迟/内存占用。
HOG-Diff: Higher-Order Guided Diffusion for Graph Generation
Authors: Yiming Huang, Tolga Birdal
Venue: ICLR 2026
First: 2025-02-06T18:51:14+00:00 · Latest: 2026-03-12T17:45:20+00:00
Comments: Accepted at ICLR 2026
Abstract
Graph generation is a critical yet challenging task, as empirical analyses require a deep understanding of complex, non-Euclidean structures. Diffusion models have recently made significant advances in graph generation, but these models are typically adapted from image generation frameworks and overlook inherent higher-order topology, limiting their ability to capture graph topology. In this work, we propose Higher-order Guided Diffusion (HOG-Diff), a principled framework that progressively generates plausible graphs with inherent topological structures. HOG-Diff follows a coarse-to-fine generation curriculum, guided by higher-order topology and implemented via diffusion bridges. We further prove that our model admits stronger theoretical guarantees than classical diffusion frameworks. Extensive experiments across eight graph generation benchmarks, spanning diverse domains and including large-scale settings, demonstrate the scalability of our method and its superior performance on both pairwise and higher-order topological metrics. Our project page is available \href{https://circle-group.github.io/research/hog-diff/}{here}.
中文标题/摘要
标题:HOG-Diff:高阶引导扩散在图生成中的应用
图生成是一项关键但具有挑战性的任务,因为经验分析需要深入理解复杂的非欧几里得结构。扩散模型在图生成方面取得了显著进展,但这些模型通常是从图像生成框架中改编而来的,忽视了固有的高阶拓扑结构,限制了其捕捉图拓扑结构的能力。在本文中,我们提出了一种名为高阶引导扩散(HOG-Diff)的原理性框架,该框架逐步生成具有固有拓扑结构的合理图。HOG-Diff 遵循从粗到细的生成课程,由高阶拓扑结构引导,并通过扩散桥梁实现。我们进一步证明,我们的模型在理论保证上比经典扩散框架更强。在八个涵盖不同领域并包括大规模设置的图生成基准测试中进行的广泛实验表明,我们的方法在对偶和高阶拓扑度量方面均表现出优越的性能。我们的项目页面可在 https://circle-group.github.io/research/hog-diff/ 查看。
Summary / 总结
HOG-Diff is a principled framework for graph generation that addresses the limitations of existing diffusion models by incorporating higher-order topology. It follows a coarse-to-fine generation process guided by higher-order topology and implemented through diffusion bridges. Experiments on eight diverse graph generation benchmarks show that HOG-Diff outperforms existing methods on both pairwise and higher-order topological metrics and is scalable to large graphs.
HOG-Diff 是一个框架,用于通过引入更高阶拓扑来解决现有扩散模型在图生成中的局限性。它采用从粗到细的生成方法,并由更高阶拓扑引导,通过扩散桥梁实现。实验表明,HOG-Diff 在各种基准上的表现优于现有方法,并且适用于大规模设置。
A Two-Stage Dual-Modality Model for Facial Emotional Expression Recognition
Authors: Jiajun Sun, Zhe Gao
First: 2026-03-12T17:45:12+00:00 · Latest: 2026-03-12T17:45:12+00:00
Comments: 10 pages, 4 figures
Abstract
This paper addresses the expression (EXPR) recognition challenge in the 10th Affective Behavior Analysis in-the-Wild (ABAW) workshop and competition, which requires frame-level classification of eight facial emotional expressions from unconstrained videos. This task is challenging due to inaccurate face localization, large pose and scale variations, motion blur, temporal instability, and other confounding factors across adjacent frames. We propose a two-stage dual-modal (audio-visual) model to address these difficulties. Stage I focuses on robust visual feature extraction with a pretrained DINOv2-based encoder. Specifically, DINOv2 ViT-L/14 is used as the backbone, a padding-aware augmentation (PadAug) strategy is employed for image padding and data preprocessing from raw videos, and a mixture-of-experts (MoE) training head is introduced to enhance classifier diversity. Stage II addresses modality fusion and temporal consistency. For the visual modality, faces are re-cropped from raw videos at multiple scales, and the extracted visual features are averaged to form a robust frame-level representation. Concurrently, frame-aligned Wav2Vec 2.0 audio features are derived from short audio windows to provide complementary acoustic cues. These dual-modal features are integrated via a lightweight gated fusion module, followed by inference-time temporal smoothing. Experiments on the ABAW dataset demonstrate the effectiveness of the proposed method. The two-stage model achieves a Macro-F1 score of 0.5368 on the official validation set and 0.5122 +/- 0.0277 under 5-fold cross-validation, outperforming the official baselines.
中文标题/摘要
标题:一种双阶段双模态模型用于面部情感表达识别
本文针对在第10届情感行为野外分析研讨会(ABAW)的工作坊和竞赛中提出的表达(EXPR)识别挑战,该挑战要求对不受限制的视频中的八种面部情感表达进行帧级分类。由于相邻帧之间存在不准确的面部定位、大姿态和尺度变化、运动模糊、时间不稳定性以及其他干扰因素,这项任务具有挑战性。我们提出了一种双阶段双模态(音视频)模型来应对这些困难。第一阶段专注于使用预训练的DINOv2编码器进行鲁棒的视觉特征提取。具体来说,使用DINOv2 ViT-L/14作为骨干,采用填充感知增强(PadAug)策略对图像进行填充和数据预处理,从原始视频中引入混合专家(MoE)训练头以增强分类器多样性。第二阶段解决模态融合和时间一致性问题。对于视觉模态,从多尺度的原始视频中重新裁剪人脸,并提取的视觉特征进行平均以形成鲁棒的帧级表示。同时,从短音频窗口中提取与帧对齐的Wav2Vec 2.0音频特征,提供补充的声学线索。这些双模态特征通过轻量级门控融合模块进行集成,随后在推理时进行时间平滑。在ABAW数据集上的实验表明了所提出方法的有效性。双阶段模型在官方验证集上达到了0.5368的宏F1分数,并在5折交叉验证下达到了0.5122 +/- 0.0277,优于官方基线。
Summary / 总结
This paper presents a two-stage dual-modal model for facial emotional expression recognition in unconstrained videos, addressing challenges like inaccurate face localization and pose variations. The model uses a pretrained DINOv2-based encoder for robust visual feature extraction and a mixture-of-experts training head. In the second stage, it integrates visual and audio features through a gated fusion module and temporal smoothing, achieving a Macro-F1 score of 0.5368 on the ABAW validation set and outperforming official baselines.
该论文提出了一种双模态两阶段模型,用于处理不受限视频中的面部情感表达识别,解决了如面部定位不准确和姿态变化等挑战。模型使用预训练的DINOv2编码器进行稳健的视觉特征提取,并通过轻量级门控融合模块整合音频和视觉特征。实验表明该模型的有效性,其在官方验证集上的宏F1分数为0.5368,并且优于官方基线模型。
Real-World Point Tracking with Verifier-Guided Pseudo-Labeling
Authors: Görkay Aydemir, Fatma Güney, Weidi Xie
Venue: CVPR 2026
First: 2026-03-12T17:40:52+00:00 · Latest: 2026-03-12T17:40:52+00:00
Comments: CVPR 2026
Abstract
Models for long-term point tracking are typically trained on large synthetic datasets. The performance of these models degrades in real-world videos due to different characteristics and the absence of dense ground-truth annotations. Self-training on unlabeled videos has been explored as a practical solution, but the quality of pseudo-labels strongly depends on the reliability of teacher models, which vary across frames and scenes. In this paper, we address the problem of real-world fine-tuning and introduce verifier, a meta-model that learns to assess the reliability of tracker predictions and guide pseudo-label generation. Given candidate trajectories from multiple pretrained trackers, the verifier evaluates them per frame and selects the most trustworthy predictions, resulting in high-quality pseudo-label trajectories. When applied for fine-tuning, verifier-guided pseudo-labeling substantially improves the quality of supervision and enables data-efficient adaptation to unlabeled videos. Extensive experiments on four real-world benchmarks demonstrate that our approach achieves state-of-the-art results while requiring less data than prior self-training methods. Project page: https://kuis-ai.github.io/track_on_r
中文标题/摘要
标题:基于验证器引导伪标签的现实世界点跟踪
长期点跟踪模型通常在大型合成数据集上训练。这些模型在现实世界视频中的性能下降,因为现实世界视频具有不同的特征且缺乏密集的地面真值注释。在未标记视频上进行自我训练已被探索作为实际解决方案,但伪标签的质量强烈依赖于教师模型的可靠性,这在不同帧和场景之间变化。在本文中,我们解决了现实世界微调的问题,并引入了验证器,这是一种元模型,用于学习评估跟踪器预测的可靠性并指导伪标签生成。给定多个预训练跟踪器的候选轨迹,验证器逐帧评估它们并选择最可信的预测,从而生成高质量的伪标签轨迹。在进行微调时,验证器引导的伪标签生成显著提高了监督质量,并使模型能够高效地适应未标记视频。在四个现实世界基准上的广泛实验表明,我们的方法在所需数据量少于先前自我训练方法的情况下达到了最先进的效果。项目页面:https://kuis-ai.github.io/track_on_r
Summary / 总结
This paper addresses the challenge of real-world point tracking by introducing a verifier-guided pseudo-labeling method. The method uses a meta-model to assess the reliability of tracker predictions and select trustworthy pseudo-labels, improving the quality of supervision for fine-tuning. Experiments on four real-world benchmarks show that this approach outperforms previous self-training methods with less data required for adaptation.
论文通过引入基于验证器的伪标签生成方法解决了现实世界点跟踪的问题。该方法使用一个元模型验证器来评估追踪器预测的可靠性并选择最可信的预测,生成高质量的伪标签。实验表明,这种方法在现实世界基准上的表现显著优于之前的自我训练方法,且所需数据较少。
RDNet: Region Proportion-Aware Dynamic Adaptive Salient Object Detection Network in Optical Remote Sensing Images
Authors: Bin Wan, Runmin Cong, Xiaofei Zhou, Hao Fang, Yaoqi Sun, Sam Kwong
First: 2026-03-12T17:34:29+00:00 · Latest: 2026-03-12T17:34:29+00:00
Abstract
Salient object detection (SOD) in remote sensing images faces significant challenges due to large variations in object sizes, the computational cost of self-attention mechanisms, and the limitations of CNN-based extractors in capturing global context and long-range dependencies. Existing methods that rely on fixed convolution kernels often struggle to adapt to diverse object scales, leading to detail loss or irrelevant feature aggregation. To address these issues, this work aims to enhance robustness to scale variations and achieve precise object localization. We propose the Region Proportion-Aware Dynamic Adaptive Salient Object Detection Network (RDNet), which replaces the CNN backbone with the SwinTransformer for global context modeling and introduces three key modules: (1) the Dynamic Adaptive Detail-aware (DAD) module, which applies varied convolution kernels guided by object region proportions; (2) the Frequency-matching Context Enhancement (FCE) module, which enriches contextual information through wavelet interactions and attention; and (3) the Region Proportion-aware Localization (RPL) module, which employs cross-attention to highlight semantic details and integrates a Proportion Guidance (PG) block to assist the DAD module. By combining these modules, RDNet achieves robustness against scale variations and accurate localization, delivering superior detection performance compared with state-of-the-art methods.
中文标题/摘要
标题:RDNet:光学遥感图像中区域比例感知动态自适应显著目标检测网络
光学遥感图像中的显著目标检测(SOD)面临着显著挑战,由于目标大小的巨大变化、自注意力机制的计算成本以及基于CNN的提取器在捕捉全局上下文和长程依赖方面的局限性。现有依赖固定卷积核的方法往往难以适应多样的目标尺度,导致细节丢失或无关特征聚合。为了解决这些问题,这项工作旨在增强对尺度变化的鲁棒性并实现精确的目标定位。我们提出了区域比例感知动态自适应显著目标检测网络(RDNet),用SwinTransformer替换CNN骨干以建模全局上下文,并引入了三个关键模块:(1)动态自适应细节感知(DAD)模块,该模块根据目标区域比例应用不同的卷积核;(2)频率匹配上下文增强(FCE)模块,该模块通过小波交互和注意力丰富上下文信息;(3)区域比例感知定位(RPL)模块,该模块采用交叉注意力突出语义细节,并结合比例引导(PG)块辅助DAD模块。通过结合这些模块,RDNet实现了对尺度变化的鲁棒性和准确的定位,相比现有最先进的方法提供了更好的检测性能。
Summary / 总结
This paper addresses the challenges of salient object detection in remote sensing images by proposing RDNet, which uses SwinTransformer for global context modeling and introduces three key modules: Dynamic Adaptive Detail-aware (DAD), Frequency-matching Context Enhancement (FCE), and Region Proportion-aware Localization (RPL). The DAD module applies varied convolution kernels based on object region proportions, the FCE module enhances contextual information through wavelet interactions and attention, and the RPL module uses cross-attention to highlight semantic details. Experimental results show that RDNet outperforms existing state-of-the-art methods in terms of robustness to scale variations and accurate localization.
本文针对遥感图像中显著目标检测面临的挑战,如目标尺寸变化大和计算成本高。提出了一种RDNet方法,使用SwinTransformer进行全局上下文建模,并引入了三个模块:DAD基于目标区域比例应用可变卷积核,FCE通过小波交互和注意力增强上下文信息,RPL使用交叉注意力精确定位并结合比例引导块辅助DAD模块。实验结果表明,RDNet在处理尺度变化和实现精确目标定位方面优于现有方法。
ForensicZip: More Tokens are Better but Not Necessary in Forensic Vision-Language Models
Authors: Yingxin Lai, Zitong Yu, Jun Wang, Linlin Shen, Yong Xu, Xiaochun Cao
First: 2026-03-12T17:30:49+00:00 · Latest: 2026-03-12T17:30:49+00:00
Abstract
Multimodal Large Language Models (MLLMs) enable interpretable multimedia forensics by generating textual rationales for forgery detection. However, processing dense visual sequences incurs high computational costs, particularly for high-resolution images and videos. Visual token pruning is a practical acceleration strategy, yet existing methods are largely semantic-driven, retaining salient objects while discarding background regions where manipulation traces such as high-frequency anomalies and temporal jitters often reside. To address this issue, we introduce ForensicZip, a training-free framework that reformulates token compression from a forgery-driven perspective. ForensicZip models temporal token evolution as a Birth-Death Optimal Transport problem with a slack dummy node, quantifying physical discontinuities indicating transient generative artifacts. The forensic scoring further integrates transport-based novelty with high-frequency priors to separate forensic evidence from semantic content under large-ratio compression. Experiments on deepfake and AIGC benchmarks show that at 10\% token retention, ForensicZip achieves $2.97\times$ speedup and over 90\% FLOPs reduction while maintaining state-of-the-art detection performance.
中文标题/摘要
标题:ForensicZip:更多的标记更好但并非必要——在法医视觉语言模型中的应用
多模态大型语言模型(MLLMs)通过生成伪造检测的文本解释来实现多媒体可解释的法医分析。然而,处理密集的视觉序列会带来高昂的计算成本,特别是对于高分辨率图像和视频。视觉标记剪枝是一种实用的加速策略,但现有方法主要基于语义驱动,保留显著对象,而丢弃包含伪造痕迹(如高频异常和时间抖动)的背景区域。为了解决这一问题,我们引入了ForensicZip,这是一种无需训练的框架,从伪造驱动的角度重新定义了标记压缩。ForensicZip将时间标记演变建模为具有松弛虚拟节点的出生-死亡最优传输问题,量化物理不连续性以指示瞬态生成伪影。法医评分进一步将传输基础的新颖性与高频先验相结合,在大比例压缩下分离法医证据和语义内容。在深度伪造和AIGC基准测试中,即使在10%的标记保留率下,ForensicZip也实现了2.97倍的加速和超过90%的FLOPs减少,同时保持了最先进的检测性能。
Summary / 总结
The research aims to improve the efficiency of forensic vision-language models by addressing the high computational costs associated with processing dense visual sequences. ForensicZip, a training-free framework, reformulates token compression from a forgery-driven perspective, focusing on quantifying physical discontinuities to detect transient generative artifacts. Experiments show that at 10% token retention, ForensicZip achieves a 2.97x speedup and over 90% FLOPs reduction while maintaining state-of-the-art detection performance.
研究旨在通过解决处理密集视觉序列时的高计算成本问题,提高法医视觉-语言模型的效率。ForensicZip 是一个无需训练的框架,从伪造驱动的角度重新定义了标记压缩,使用生灭最优传输问题来量化物理不连续性。在10%标记保留的情况下,ForensicZip 实现了2.97倍的加速和超过90%的FLOPs减少,同时在深度伪造和AIGC基准测试中保持了最先进的检测性能。
LLMTrack: Semantic Multi-Object Tracking with Multi-modal Large Language Models
Authors: Pan Liao, Feng Yang, Di Wu, Jinwen Yu, Yuhua Zhu, Wenhui Zhao, Dingwen Zhang
First: 2026-01-10T12:18:12+00:00 · Latest: 2026-03-12T17:26:37+00:00
Abstract
Multi-Object Tracking (MOT) is evolving from geometric localization to Semantic MOT (SMOT) to answer complex relational queries, yet progress is hindered by semantic data scarcity and a structural disconnect between tracking architectures and Multi-modal Large Language Models (MLLMs). To address this, we introduce Grand-SMOT, a large-scale, open-world benchmark providing high-density, dual-stream narratives that comprehensively decouple individual behaviors from environmental contexts. Furthermore, we propose LLMTrack, the first framework to seamlessly integrate MLLMs into the SMOT task. LLMTrack establishes a Macro-Understanding-First paradigm, utilizing a novel Spatio-Temporal Fusion Module to align discrete geometric trajectories with continuous semantic features, effectively suppressing temporal hallucinations during online processing. Extensive experiments demonstrate that LLMTrack achieves state-of-the-art geometric tracking performance while delivering a qualitative leap in dynamic semantic reasoning. Notably, our analysis reveals that high-quality semantic narratives empower the language model to deduce complex social interactions naturally, demonstrating that direct cognitive reasoning is more effective than cumbersome explicit visual modeling. Ultimately, our contributions bridge the gap between perceptual tracking and cognitive reasoning, establishing a robust new foundation for comprehensive video understanding and intelligent narrative generation.
中文标题/摘要
标题:LLMTrack:使用多模态大型语言模型的语义多对象跟踪
多对象跟踪(MOT)正在从几何定位发展到语义MOT(SMOT)以回答复杂的关联查询,但进展受到语义数据稀缺性和跟踪架构与多模态大型语言模型(MLLMs)之间结构性脱节的阻碍。为了解决这一问题,我们引入了Grand-SMOT,这是一个大规模、开放世界的基准,提供了高密度、双流叙事,全面地将个体行为与环境背景解耦。此外,我们提出了LLMTrack,这是第一个将MLLM无缝集成到SMOT任务中的框架。LLMTrack确立了先宏观理解的范式,利用新颖的空间-时间融合模块将离散的几何轨迹与连续的语义特征对齐,在在线处理过程中有效抑制了时间幻觉。广泛的实验表明,LLMTrack在几何跟踪性能上达到了最先进的水平,同时在动态语义推理方面实现了质的飞跃。值得注意的是,我们的分析表明,高质量的语义叙事使语言模型能够自然地推断复杂的社交互动,表明直接的认知推理比繁琐的显式视觉建模更有效。最终,我们的贡献弥合了感知跟踪与认知推理之间的差距,为全面的视频理解和智能叙事生成奠定了坚实的新基础。
Summary / 总结
The research aims to advance Semantic Multi-Object Tracking (SMOT) by addressing semantic data scarcity and structural disconnects. The authors introduce LLMTrack, a framework that integrates Multi-modal Large Language Models (MLLMs) into SMOT, using a Spatio-Temporal Fusion Module to align geometric trajectories with semantic features. Experiments show that LLMTrack achieves superior geometric tracking and enhanced dynamic semantic reasoning, highlighting the effectiveness of high-quality semantic narratives in deducing complex social interactions.
研究通过引入LLMTrack,将多模态大型语言模型(MLLMs)集成到SMOT中以应对挑战。该方法使用时空融合模块将几何轨迹与语义特征对齐,提升跟踪性能和动态语义推理。实验表明,LLMTrack在几何跟踪上优于现有方法,并通过高质量的语义叙述增强对复杂社会互动的理解。
SaPaVe: Towards Active Perception and Manipulation in Vision-Language-Action Models for Robotics
Authors: Mengzhen Liu, Enshen Zhou, Cheng Chi, Yi Han, Shanyu Rong, Liming Chen, Pengwei Wang, Zhongyuan Wang, Shanghang Zhang
Venue: CVPR 2026
First: 2026-03-12T17:23:46+00:00 · Latest: 2026-03-12T17:23:46+00:00
Comments: Accepted to CVPR 2026. See project page at https://lmzpai.github.io/SaPaVe
Abstract
Active perception and manipulation are crucial for robots to interact with complex scenes. Existing methods struggle to unify semantic-driven active perception with robust, viewpoint-invariant execution. We propose SaPaVe, an end-to-end framework that jointly learns these capabilities in a data-efficient manner. Our approach decouples camera and manipulation actions rather than placing them in a shared action space, and follows a bottom-up training strategy: we first train semantic camera control on a large-scale dataset, then jointly optimize both action types using hybrid data. To support this framework, we introduce ActiveViewPose-200K, a dataset of 200k image-language-camera movement pairs for semantic camera movement learning, and a 3D geometry-aware module that improves execution robustness under dynamic viewpoints. We also present ActiveManip-Bench, the first benchmark for evaluating active manipulation beyond fixed-view settings. Extensive experiments in both simulation and real-world environments show that SaPaVe outperforms recent vision-language-action models such as GR00T N1 and \(π_0\), achieving up to 31.25\% higher success rates in real-world tasks. These results show that tightly coupled perception and execution, when trained with decoupled yet coordinated strategies, enable efficient and generalizable active manipulation. Project page: https://lmzpai.github.io/SaPaVe
中文标题/摘要
标题:SaPaVe:迈向机器人视觉-语言-动作模型中的主动感知与操作
主动感知与操作对于机器人与复杂场景交互至关重要。现有方法难以将语义驱动的主动感知与稳健、视角不变的操作执行统一起来。我们提出SaPaVe,这是一种端到端框架,能够以数据高效的方式联合学习这些能力。我们的方法将相机操作和操作动作分离,而不是将它们放在共享的动作空间中,并采用自底向上的训练策略:我们首先在大规模数据集上训练语义相机控制,然后使用混合数据联合优化两种动作类型。为了支持这一框架,我们引入了ActiveViewPose-200K数据集,包含20万个图像-语言-相机运动对,用于语义相机运动学习,并引入了一个3D几何感知模块,以提高在动态视角下的执行鲁棒性。我们还提出了ActiveManip-Bench,这是第一个用于评估超越固定视角设置的主动操作的基准。在模拟和真实环境中的广泛实验表明,SaPaVe在现实任务中的成功率比最近的视觉-语言-动作模型GR00T N1和\(π_0\)高31.25%。这些结果表明,当以分离但协调的方式训练紧密耦合的感知和执行时,能够实现高效且通用的主动操作。项目页面:https://lmzpai.github.io/SaPaVe
ReSplat: Learning Recurrent Gaussian Splatting
Authors: Haofei Xu, Daniel Barath, Andreas Geiger, Marc Pollefeys
First: 2025-10-09T17:59:59+00:00 · Latest: 2026-03-12T17:18:50+00:00
Comments: Project page: https://haofeixu.github.io/resplat/ Code: https://github.com/cvg/resplat
Abstract
While existing feed-forward Gaussian splatting models offer computational efficiency and can generalize to sparse view settings, their performance is fundamentally constrained by relying on a single forward pass for inference. We propose ReSplat, a feed-forward recurrent Gaussian splatting model that iteratively refines 3D Gaussians without explicitly computing gradients. Our key insight is that the Gaussian splatting rendering error serves as a rich feedback signal, guiding the recurrent network to learn effective Gaussian updates. This feedback signal naturally adapts to unseen data distributions at test time, enabling robust generalization across datasets, view counts, and image resolutions. To initialize the recurrent process, we introduce a compact reconstruction model that operates in a $16 \times$ subsampled space, producing $16 \times$ fewer Gaussians than previous per-pixel Gaussian models. This substantially reduces computational overhead and allows for efficient Gaussian updates. Extensive experiments across varying number of input views (2, 8, 16, 32), resolutions ($256 \times 256$ to $540 \times 960$), and datasets (DL3DV, RealEstate10K, and ACID) demonstrate that our method achieves state-of-the-art performance while significantly reducing the number of Gaussians and improving the rendering speed. Our project page is at https://haofeixu.github.io/resplat/.
中文标题/摘要
标题:ReSplat:学习递归高斯点云
虽然现有的前馈高斯点云模型提供了计算效率并能泛化到稀疏视角设置,但它们的性能从根本上受限于仅依赖单次前向推理。我们提出了ReSplat,一种前馈递归高斯点云模型,通过迭代细化3D高斯分布而不显式计算梯度。我们的核心洞察是,高斯点云渲染误差作为丰富的反馈信号,指导递归网络学习有效的高斯更新。这种反馈信号在测试时自然适应未见过的数据分布,使方法在不同数据集、视角数量和图像分辨率下实现稳健泛化。为了初始化递归过程,我们引入了一个紧凑的重建模型,在$16 imes$下采样空间中操作,生成的高斯数量仅为之前每个像素高斯模型的$16 imes$。这大大减少了计算开销并允许高效更新高斯。在不同输入视角数量(2, 8, 16, 32)、分辨率($256 imes 256$到$540 imes 960$)和数据集(DL3DV、RealEstate10K和ACID)的广泛实验中,我们的方法在显著减少高斯数量和提高渲染速度的同时,实现了最先进的性能。项目页面:https://haofeixu.github.io/resplat/
Summary / 总结
ReSplat is a feed-forward recurrent Gaussian splatting model that iteratively refines 3D Gaussians without computing gradients. It uses the Gaussian splatting rendering error as a feedback signal to guide the recurrent network, enabling robust generalization across datasets and image resolutions. ReSplat reduces the number of Gaussians and computational overhead by initializing the process with a compact reconstruction model operating in a subsampled space. Experiments show that ReSplat achieves state-of-the-art performance with fewer Gaussians and faster rendering speed compared to existing methods.
ReSplat是一种前向递归高斯点云模型,无需显式计算梯度即可迭代细化3D高斯。它使用高斯点云渲染误差作为反馈信号来引导递归网络,从而在不同数据集和图像分辨率下实现鲁棒泛化。实验表明,ReSplat在比之前方法更少的高斯点数和更快的渲染速度下达到最先进的性能。
Proof-Carrying Materials: Falsifiable Safety Certificates for Machine-Learned Interatomic Potentials
Authors: Abhinaba Basu, Pavan Chakraborty
First: 2026-03-12T17:13:25+00:00 · Latest: 2026-03-12T17:13:25+00:00
Abstract
Machine-learned interatomic potentials (MLIPs) are deployed for high-throughput materials screening without formal reliability guarantees. We show that a single MLIP used as a stability filter misses 93% of density functional theory (DFT)-stable materials (recall 0.07) on a 25,000-material benchmark. Proof-Carrying Materials (PCM) closes this gap through three stages: adversarial falsification across compositional space, bootstrap envelope refinement with 95% confidence intervals, and Lean 4 formal certification. Auditing CHGNet, TensorNet and MACE reveals architecture-specific blind spots with near-zero pairwise error correlations (r <= 0.13; n = 5,000), confirmed by independent Quantum ESPRESSO validation (20/20 converged; median DFT/CHGNet force ratio 12x). A risk model trained on PCM-discovered features predicts failures on unseen materials (AUC-ROC = 0.938 +/- 0.004) and transfers across architectures (cross-MLIP AUC-ROC ~ 0.70; feature importance r = 0.877). In a thermoelectric screening case study, PCM-audited protocols discover 62 additional stable materials missed by single-MLIP screening - a 25% improvement in discovery yield.
中文标题/摘要
标题:证明携带材料:机器学习原子势能的可验证安全证书
机器学习原子势能(MLIPs)被用于高通量材料筛选,但缺乏正式的可靠性保证。我们展示了单一MLIP用作稳定性筛选器时,未能识别出93%的密度泛函理论(DFT)稳定的材料(召回率0.07)在25,000种材料的基准测试中。通过三个阶段的证明携带材料(PCM)来弥补这一差距:针对组成空间的对抗性反证、使用95%置信区间进行的Bootstrap包络细化,以及Lean 4形式化认证。审计CHGNet、TensorNet和MACE揭示了特定架构的盲点,配对误差相关性接近零(r <= 0.13;n = 5,000),并由独立的Quantum ESPRESSO验证所证实(20/20收敛;DFT/CHGNet力的中位数比值12倍)。基于PCM发现特征训练的风险模型在未见材料上预测失败(AUC-ROC = 0.938 ± 0.004),并在不同架构间具有可转移性(跨MLIP AUC-ROC ~ 0.70;特征重要性r = 0.877)。在热电材料筛选案例研究中,PCM审计的协议发现了62种额外的稳定材料,比单一MLIP筛选提高了25%的发现率。
Summary / 总结
The research addresses the lack of reliability guarantees for machine-learned interatomic potentials (MLIPs) used in materials screening. It introduces Proof-Carrying Materials (PCM) which involve adversarial falsification, bootstrap envelope refinement, and formal certification. The study finds that a single MLIP misses 93% of DFT-stable materials, while PCM-audited protocols discover 62 additional stable materials, improving the discovery yield by 25%. It also identifies architecture-specific blind spots and develops a risk model with high predictive accuracy.
研究旨在解决机器学习原子势能(MLIPs)在材料筛选中缺乏可靠性保证的问题。引入了证明携带材料(PCM)的方法,包括对抗性反驳、Bootstrap边界修正和形式化认证。研究发现,单一MLIP会错过93%的DFT稳定材料,而PCM审核的协议能够发现62种额外的稳定材料,提高了发现率25%。研究还揭示了架构特定的盲点,并建立了一个风险模型,能够以高精度预测失败。
Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections
Authors: Łukasz Borchmann, Jordy Van Landeghem, Michał Turski, Shreyansh Padarha, Ryan Othniel Kearns, Adam Mahdi, Niels Rogge, Clémentine Fourrier, Siwei Han, Huaxiu Yao, Artemis Llabrés, Yiming Xu, Dimosthenis Karatzas, Hao Zhang, Anupam Datta
First: 2026-03-12T17:11:22+00:00 · Latest: 2026-03-12T17:11:22+00:00
Abstract
Multimodal agents offer a promising path to automating complex document-intensive workflows. Yet, a critical question remains: do these agents demonstrate genuine strategic reasoning, or merely stochastic trial-and-error search? To address this, we introduce MADQA, a benchmark of 2,250 human-authored questions grounded in 800 heterogeneous PDF documents. Guided by Classical Test Theory, we design it to maximize discriminative power across varying levels of agentic abilities. To evaluate agentic behaviour, we introduce a novel evaluation protocol measuring the accuracy-effort trade-off. Using this framework, we show that while the best agents can match human searchers in raw accuracy, they succeed on largely different questions and rely on brute-force search to compensate for weak strategic planning. They fail to close the nearly 20% gap to oracle performance, persisting in unproductive loops. We release the dataset and evaluation harness to help facilitate the transition from brute-force retrieval to calibrated, efficient reasoning.
中文标题/摘要
标题:战略导航还是随机搜索?代理和人类在文档集合中的推理方式
多模态代理为自动化复杂文档密集型工作流提供了有希望的途径。然而,一个关键问题仍然存在:这些代理是否展示了真正的战略推理,还是仅仅进行了随机的尝试和错误搜索?为了解决这个问题,我们引入了MADQA基准,包含2,250个人撰写的基于800份异构PDF文档的问题。根据经典测验理论,我们设计它以最大化在不同代理能力水平上的区分力。为了评估代理行为,我们引入了一种新的评估协议,衡量准确性和努力之间的权衡。使用这一框架,我们表明,虽然最好的代理在纯准确度上可以与人类搜索者匹敌,但它们回答的问题类型不同,并依赖于暴力搜索来弥补薄弱的战略规划。它们未能缩小与Oracle性能近20%的差距,持续陷入无效循环。我们发布了数据集和评估框架,以帮助促进从暴力检索向校准、高效的推理过渡。
Summary / 总结
This study investigates whether multimodal agents exhibit strategic reasoning or merely rely on stochastic search when navigating document collections. The researchers developed MADQA, a benchmark consisting of 2,250 human-authored questions based on 800 diverse PDF documents. Using a novel evaluation protocol, they found that top-performing agents can achieve similar accuracy to human searchers but often use brute-force methods, lacking strategic planning. These agents fail to match oracle performance, getting stuck in unproductive loops and leaving a significant gap of nearly 20% unbridged. The dataset and evaluation framework are made available to advance the field towards more efficient reasoning systems.
研究探讨了多模态代理在导航文档集合时是展示战略推理能力还是仅依赖于随机搜索。研究人员开发了MADQA基准,包含基于800份不同PDF文档的2,250个人撰写的查询问题。通过一种新的评估协议,他们发现表现最好的代理在准确率上可以与人类搜索者相媲美,但往往依赖于暴力搜索方法,缺乏战略规划能力。这些代理未能达到Oracle性能,经常陷入无效循环,留下近20%的性能差距未弥补。该数据集和评估框架已公开,以促进向更高效推理系统的过渡。
BehaviorVLM: Unified Finetuning-Free Behavioral Understanding with Vision-Language Reasoning
Authors: Jingyang Ke, Weihan Li, Amartya Pradhan, Jeffrey Markowitz, Anqi Wu
First: 2026-03-12T17:09:20+00:00 · Latest: 2026-03-12T17:09:20+00:00
Abstract
Understanding freely moving animal behavior is central to neuroscience, where pose estimation and behavioral understanding form the foundation for linking neural activity to natural actions. Yet both tasks still depend heavily on human annotation or unstable unsupervised pipelines, limiting scalability and reproducibility. We present BehaviorVLM, a unified vision-language framework for pose estimation and behavioral understanding that requires no task-specific finetuning and minimal human labeling by guiding pretrained Vision-Language Models (VLMs) through detailed, explicit, and verifiable reasoning steps. For pose estimation, we leverage quantum-dot-grounded behavioral data and propose a multi-stage pipeline that integrates temporal, spatial, and cross-view reasoning. This design greatly reduces human annotation effort, exposes low-confidence labels through geometric checks such as reprojection error, and produces labels that can later be filtered, corrected, or used to fine-tune downstream pose models. For behavioral understanding, we propose a pipeline that integrates deep embedded clustering for over-segmented behavior discovery, VLM-based per-clip video captioning, and LLM-based reasoning to merge and semantically label behavioral segments. The behavioral pipeline can operate directly from visual information and does not require keypoints to segment behavior. Together, these components enable scalable, interpretable, and label-light analysis of multi-animal behavior.
中文标题/摘要
标题:BehaviorVLM:统一的无需微调的行为理解与视觉-语言推理
理解自由移动的动物行为是神经科学的核心,其中姿态估计和行为理解构成了将神经活动与自然动作联系起来的基础。然而,这两个任务仍然严重依赖于人工注释或不稳定的无监督管道,限制了其可扩展性和可重复性。我们提出了BehaviorVLM,这是一种无需特定任务微调且只需少量人工标注的统一视觉-语言框架,通过引导预训练的视觉-语言模型(VLMs)进行详细的、明确的和可验证的推理步骤。对于姿态估计,我们利用量子点标注的行为数据,并提出了一种多阶段管道,结合了时间、空间和跨视图推理。这种设计大大减少了人工标注的工作量,通过几何检查如重投影误差暴露了低置信度的标签,并生成了可以后期过滤、修正或用于微调下游姿态模型的标签。对于行为理解,我们提出了一种管道,结合了深度嵌入聚类进行过度分割行为发现、基于VLM的每段视频字幕生成,以及基于LLM的推理以合并和语义标注行为片段。行为管道可以直接从视觉信息中运行,不需要关键点来分割行为。这些组件共同实现了多动物行为的大规模、可解释和轻标注分析。
Summary / 总结
BehaviorVLM is a unified vision-language framework for pose estimation and behavioral understanding in neuroscience, which does not require task-specific finetuning or extensive human labeling. It leverages detailed reasoning steps to guide pretrained models, reducing human annotation effort and enabling geometric checks for label verification. For pose estimation, it uses a multi-stage pipeline with temporal, spatial, and cross-view reasoning, while for behavioral understanding, it integrates deep embedded clustering, VLM-based video captioning, and LLM-based reasoning to discover and label behaviors directly from visual information.
BehaviorVLM 是一个无需特定任务微调或大量人工标注的统一视觉-语言框架,用于神经科学中的姿态估计和行为理解。它通过详细的推理步骤引导预训练模型,减少人工标注工作量,并通过几何检查进行标签验证。在姿态估计方面,它使用包含时间、空间和跨视图推理的多阶段管道;而在行为理解方面,它结合了深度嵌入聚类、基于VLM的视频字幕生成和基于LLM的推理,直接从视觉信息中发现和标注行为。
Personalized Feature Translation for Expression Recognition: An Efficient Source-Free Domain Adaptation Method
Authors: Masoumeh Sharafi, Soufiane Belharbi, Muhammad Osama Zeeshan, Houssem Ben Salem, Ali Etemad, Alessandro Lameiras Koerich, Marco Pedersoli, Simon Bacon, Eric Granger
First: 2025-08-08T20:13:50+00:00 · Latest: 2026-03-12T17:05:14+00:00
Abstract
Facial expression recognition (FER) models are widely used in video-based affective computing applications, such as human-computer interaction and healthcare monitoring. However, deep FER models often struggle with subtle expressions and high inter-subject variability, limiting performance in real-world settings. Source-free domain adaptation (SFDA) has been proposed to personalize a pretrained source model using only unlabeled target data, avoiding privacy, storage, and transmission constraints. We address a particularly challenging setting where source data is unavailable and the target data contains only neutral expressions. Existing SFDA methods are not designed for adaptation from a single target class, while generating non-neutral facial images is often unstable and expensive. To address this, we propose Source-Free Domain Adaptation with Personalized Feature Translation (SFDA-PFT), a lightweight latent-space approach. A translator is first pretrained on source data to map subject-specific style features between subjects while preserving expression information through expression-consistency and style-aware objectives. It is then adapted to neutral target data without source data or image synthesis. By operating in the latent space, SFDA-PFT avoids noisy facial image generation, reduces computation, and learns discriminative embeddings for classification. Experiments on BioVid, StressID, BAH, and Aff-Wild2 show that SFDA-PFT consistently outperforms state-of-the-art SFDA methods in privacy-sensitive FER scenarios. Our code is publicly available at: \href{https://github.com/MasoumehSharafi/SFDA-PFT}{GitHub}.
中文标题/摘要
标题:个性化特征翻译在表情识别中的应用:一种高效的源无域适应方法
面部表情识别(FER)模型广泛应用于基于视频的情感计算应用,如人机交互和健康监测。然而,深度FER模型往往难以识别细微的表情和高个体间差异,限制了其在实际环境中的性能。源无域适应(SFDA)已被提出,仅使用目标的未标记数据来个性化预训练的源模型,从而避免隐私、存储和传输的限制。我们解决了一个特别具有挑战性的场景,即源数据不可用且目标数据仅包含中性表情。现有的SFDA方法未设计用于从单一目标类别进行适应,而生成非中性面部图像往往不稳定且昂贵。为了解决这个问题,我们提出了源无域适应与个性化特征翻译(SFDA-PFT),这是一种轻量级的潜在空间方法。首先在源数据上预训练一个翻译器,以在不同个体之间映射个体特定的风格特征,同时通过表情一致性目标和风格意识目标保留表情信息。然后,它无需源数据或图像合成即可适应中性目标数据。通过在潜在空间中操作,SFDA-PFT避免了嘈杂的面部图像生成,减少了计算量,并学习了用于分类的判别嵌入。在BioVid、StressID、BAH和Aff-Wild2上的实验表明,SFDA-PFT在隐私敏感的FER场景中始终优于最先进的SFDA方法。我们的代码可在:\href{https://github.com/MasoumehSharafi/SFDA-PFT}{GitHub}公开获取。
Summary / 总结
The research aims to improve facial expression recognition (FER) models in real-world settings by addressing the limitations of deep FER models with subtle expressions and high inter-subject variability. The proposed method, Source-Free Domain Adaptation with Personalized Feature Translation (SFDA-PFT), uses a lightweight latent-space approach to personalize a pretrained model using only unlabeled target data, which contains only neutral expressions. The method first pretrains a translator on source data to map subject-specific style features while preserving expression information, and then adapts it to neutral target data without requiring source data or image synthesis. Experiments on BioVid, StressID, BAH, and Aff-Wild2 demonstrate that SFDA-PFT consistently outperforms existing state-of-the-art SFDA methods in privacy-sensitive FER scenarios.
研究旨在通过解决深度面部表情识别(FER)模型在真实环境中的局限性,即对细微表情和高个体差异的识别能力不足,来提升FER模型的表现。提出的轻量级潜空间方法Source-Free Domain Adaptation with Personalized Feature Translation (SFDA-PFT) 使用仅有的未标记目标数据来个性化预训练模型,无需使用源数据或图像合成。实验结果表明,SFDA-PFT 在BioVid、StressID、BAH 和 Aff-Wild2 上的一系列隐私敏感的FER场景中,始终优于现有的源无域适应方法。
LatentGeo: Learnable Auxiliary Constructions in Latent Space for Multimodal Geometric Reasoning
Authors: Haiying Xu, Zihan Wang, Song Dai, Zhengxuan Zhang, Kairan Dou, Xuming Hu
First: 2026-03-12T17:01:23+00:00 · Latest: 2026-03-12T17:01:23+00:00
Abstract
Despite recent advances in multimodal reasoning, representing auxiliary geometric constructions remains a fundamental challenge for multimodal large language models (MLLMs). Such constructions are absent from the original diagram and must be introduced before theorems apply. Existing approaches predominantly rely on explicit construction paradigms, including text-based geometric specification, visual-token interleaving during reasoning, and tool-augmented geometric execution. However, these methods either fail to faithfully represent complex spatial relationships, incur representation mismatch between discrete symbols and continuous geometric structures, or rely on external capabilities that hinder end-to-end optimization. To address these limitations, we propose LatentGeo, a framework that learns continuous latent visual representations to internalize auxiliary geometric constructions without pixel-level rendering or external executors. We design a three-stage curriculum that progressively aligns and internalizes these latent representations through auxiliary visual supervision, followed by LaGDPO, a latent-aware reinforcement learning procedure that stabilizes latent representations during policy optimization while improving end-task correctness. To systematically evaluate construction-centric representation quality, we introduce GeoAux, a new benchmark targeting visually dependent geometry problems, and conduct experiments on GeoAux and MathVerse. Results show that LatentGeo achieves substantial gains on geometric reasoning tasks, particularly those requiring auxiliary constructions. Extensive analyses and ablation studies further validate the effectiveness of each component in our framework.
中文标题/摘要
标题:LatentGeo:学习潜在空间中的辅助构造以进行多模态几何推理
尽管在多模态推理方面取得了近期进展,但在多模态大型语言模型(MLLMs)中表示辅助几何构造仍然是一个基本挑战。这些构造不在原始图中,必须在应用定理之前引入。现有方法主要依赖显式的构造范式,包括基于文本的几何规范、推理期间的视觉标记交错以及工具增强的几何执行。然而,这些方法要么无法忠实表示复杂的空间关系,要么在离散符号和连续几何结构之间产生表示不匹配,或者依赖外部能力,这妨碍了端到端优化。为了解决这些限制,我们提出了LatentGeo框架,该框架学习连续的潜在视觉表示,以在不进行像素级渲染或外部执行的情况下内化辅助几何构造。我们设计了一个三阶段课程,通过辅助视觉监督逐步对齐和内化这些潜在表示,随后是LaGDPO,一种潜在感知的强化学习过程,该过程在策略优化过程中稳定潜在表示,同时提高最终任务的正确性。为了系统地评估构造为中心的表示质量,我们引入了GeoAux,这是一个针对视觉依赖几何问题的新基准,并在GeoAux和MathVerse上进行了实验。结果表明,LatentGeo在几何推理任务中取得了显著的改进,特别是那些需要辅助构造的任务。广泛的分析和消融研究进一步验证了我们框架中每个组件的有效性。
Summary / 总结
LatentGeo addresses the challenge of representing auxiliary geometric constructions in multimodal reasoning by learning continuous latent visual representations. It uses a three-stage curriculum and a latent-aware reinforcement learning procedure to internalize these constructions without pixel-level rendering or external executors. Experiments on GeoAux and MathVerse show that LatentGeo significantly improves geometric reasoning tasks, especially those requiring auxiliary constructions.
LatentGeo通过学习连续的潜在视觉表示来解决多模态推理中辅助几何构造的表示问题,使用三阶段课程和潜在感知的强化学习过程来内部化这些构造,而不依赖于像素级渲染或外部执行器。在GeoAux和MathVerse上的实验表明,LatentGeo在几何推理方面取得了显著改进,特别是在需要辅助构造的任务上。
A Variational Latent Equilibrium for Learning in Neuronal Circuits
Authors: Simon Brandt, Paul Haider, Walter Senn, Federico Benitez, Mihai A. Petrovici
First: 2026-03-10T12:44:48+00:00 · Latest: 2026-03-12T16:55:52+00:00
Abstract
Brains remain unrivaled in their ability to recognize and generate complex spatiotemporal patterns. While AI is able to reproduce some of these capabilities, deep learning algorithms remain largely at odds with our current understanding of brain circuitry and dynamics. This is prominently the case for backpropagation through time (BPTT), the go-to algorithm for learning complex temporal dependencies. In this work we propose a general formalism to approximate BPTT in a controlled, biologically plausible manner. Our approach builds on, unifies and extends several previous approaches to local, time-continuous, phase-free spatiotemporal credit assignment based on principles of energy conservation and extremal action. Our starting point is a prospective energy function of neuronal states, from which we calculate real-time error dynamics for time-continuous neuronal networks. In the general case, this provides a simple and straightforward derivation of the adjoint method result for neuronal networks, the time-continuous equivalent to BPTT. With a few modifications, we can turn this into a fully local (in space and time) set of equations for neuron and synapse dynamics. Our theory provides a rigorous framework for spatiotemporal deep learning in the brain, while simultaneously suggesting a blueprint for physical circuits capable of carrying out these computations. These results reframe and extend the recently proposed Generalized Latent Equilibrium (GLE) model.
中文标题/摘要
标题:神经环路学习中的变分潜均衡
大脑在识别和生成复杂的空间-时间模式方面仍然无与伦比。尽管人工智能能够复制一些这些能力,但深度学习算法在很大程度上与我们对大脑环路和动力学的理解相矛盾。这在时间反向传播(BPTT)中尤为明显,它是学习复杂时间依赖性的首选算法。在这项工作中,我们提出了一种通用的形式化方法,以在受控且生物上合理的条件下近似BPTT。我们的方法建立在、统一并扩展了几个先前基于能量守恒和极值作用原理的空间-时间信用分配的局部、时间连续、无相位方法之上。我们的出发点是一个前瞻性的神经元状态能量函数,从中我们计算出时间连续神经网络的实时误差动力学。在一般情况下,这为神经网络提供了变分伴随方法结果的简单且直接的推导,这是BPTT的时间连续等价物。通过一些修改,我们可以将其转化为空间和时间上的完全局部方程组,用于神经元和突触动力学。我们的理论为大脑中的空间-时间深度学习提供了一个严谨的框架,同时建议了一种能够执行这些计算的物理电路蓝图。这些结果重新定义并扩展了最近提出的广义潜均衡(GLE)模型。
Summary / 总结
This research aims to bridge the gap between artificial intelligence and brain circuitry by proposing a new method to approximate backpropagation through time (BPTT) in a biologically plausible manner. The method builds on previous approaches to local, time-continuous, phase-free spatiotemporal credit assignment and derives a prospective energy function to calculate real-time error dynamics for neuronal networks. Key findings include a derivation of the adjoint method for neuronal networks and a set of fully local equations for neuron and synapse dynamics, suggesting a framework for spatiotemporal deep learning in the brain and physical circuits capable of these computations.
研究旨在通过提出一种新方法来逼近时间连续的基于能量守恒和极值作用原理的空间-时间信用分配,弥合人工智能与脑回路之间的差距。该方法基于先前的空间-时间局部、时间连续、无相位的信用分配方法,并通过计算神经元网络的实时误差动力学来推导出一个前景能量函数。关键发现包括神经元网络的伴随方法的推导以及一套完全局部的神经元和突触动力学方程,为大脑和物理电路中的空间-时间深度学习提供了一个框架。
GlyphBanana: Advancing Precise Text Rendering Through Agentic Workflows
Authors: Zexuan Yan, Jiarui Jin, Yue Ma, Shijian Wang, Jiahui Hu, Wenxiang Jiao, Yuan Lu, Linfeng Zhang
First: 2026-03-12T16:53:06+00:00 · Latest: 2026-03-12T16:53:06+00:00
Abstract
Despite recent advances in generative models driving significant progress in text rendering, accurately generating complex text and mathematical formulas remains a formidable challenge. This difficulty primarily stems from the limited instruction-following capabilities of current models when encountering out-of-distribution prompts. To address this, we introduce GlyphBanana, alongside a corresponding benchmark specifically designed for rendering complex characters and formulas. GlyphBanana employs an agentic workflow that integrates auxiliary tools to inject glyph templates into both the latent space and attention maps, facilitating the iterative refinement of generated images. Notably, our training-free approach can be seamlessly applied to various Text-to-Image (T2I) models, achieving superior precision compared to existing baselines. Extensive experiments demonstrate the effectiveness of our proposed workflow. Associated code is publicly available at https://github.com/yuriYanZeXuan/GlyphBanana.
中文标题/摘要
标题:GlyphBanana:通过自主工作流提升精确文本渲染
尽管生成模型在文本渲染方面取得了显著进展,但准确生成复杂文本和数学公式仍然是一项艰巨的挑战。这一困难主要源于当前模型在遇到分布外提示时有限的指令遵循能力。为解决这一问题,我们引入了GlyphBanana,并设计了一个专门用于渲染复杂字符和公式的基准测试。GlyphBanana采用了一种自主工作流,将辅助工具集成到潜在空间和注意力图中,以注入字形模板,促进生成图像的迭代优化。值得注意的是,我们的无训练方法可以无缝应用于各种文本到图像(T2I)模型,相比现有基线方法具有更高的精度。大量实验表明了我们提出的工作流的有效性。相关代码已公开发布在https://github.com/yuriYanZeXuan/GlyphBanana。
Summary / 总结
The research aims to improve the precision of text and mathematical formula rendering by addressing the limitations of current generative models. GlyphBanana uses an agentic workflow that integrates auxiliary tools to inject glyph templates into the latent space and attention maps, enabling iterative refinement. Experiments show that this approach achieves superior precision compared to existing methods without requiring training. The method can be applied to various Text-to-Image models and is publicly available.
研究旨在通过解决当前生成模型的局限性,提高复杂文本和数学公式的渲染精度。GlyphBanana提出了一种代理工作流,通过辅助工具将字形模板注入潜空间和注意力图中,实现迭代优化。实验表明,该方法在不需训练的情况下,相比现有方法具有更高的精度。
IsoCompute Playbook: Optimally Scaling Sampling Compute for LLM RL
Authors: Zhoujun Cheng, Yutao Xie, Yuxiao Qu, Amrith Setlur, Shibo Hao, Varad Pimpalkhute, Tongtong Liang, Feng Yao, Zhengzhong Liu, Eric Xing, Virginia Smith, Ruslan Salakhutdinov, Zhiting Hu, Taylor Killian, Aviral Kumar
First: 2026-03-12T16:49:21+00:00 · Latest: 2026-03-12T16:49:21+00:00
Comments: 29 pages, 27 figures. Under review
Abstract
While scaling laws guide compute allocation for LLM pre-training, analogous prescriptions for reinforcement learning (RL) post-training of large language models (LLMs) remain poorly understood. We study the compute-optimal allocation of sampling compute for on-policy RL methods in LLMs, framing scaling as a compute-constrained optimization over three resources: parallel rollouts per problem, number of problems per batch, and number of update steps. We find that the compute-optimal number of parallel rollouts per problem increases predictably with compute budget and then saturates. This trend holds across both easy and hard problems, though driven by different mechanisms: solution sharpening on easy problems and coverage expansion on hard problems. We further show that increasing the number of parallel rollouts mitigates interference across problems, while the number of problems per batch primarily affects training stability and can be chosen within a broad range. Validated across base models and data distributions, our results recast RL scaling laws as prescriptive allocation rules and provide practical guidance for compute-efficient LLM RL post-training.
中文标题/摘要
标题:IsoCompute 指南:优化大型语言模型 RL 后训练采样计算量
虽然缩放定律指导了大型语言模型(LLM)预训练时的计算分配,但对于 LLM 大规模强化学习(RL)后训练的计算分配,类似的指导原则仍然知之甚少。我们研究了在 LLM 中使用采样计算的计算最优分配,将缩放问题视为在三个资源约束下的计算优化问题:每个问题的并行卷积次数、批次中的问题数量以及更新步骤数量。我们发现,每个问题的并行卷积次数的计算最优值随着计算预算的增加而可预测地增加,然后饱和。这一趋势在简单和复杂的问题上都适用,但驱动机制不同:简单问题上的解细化和复杂问题上的覆盖扩展。我们还表明,增加并行卷积次数可以减少问题间的干扰,而批次中的问题数量主要影响训练稳定性,可以在较宽的范围内选择。我们的结果在基础模型和数据分布上得到验证,重新定义了 RL 的缩放定律为具有指导性的分配规则,并为计算高效的 LLM RL 后训练提供了实用指导。
Summary / 总结
This study addresses the lack of compute scaling laws for reinforcement learning (RL) post-training of large language models (LLMs), focusing on the optimal allocation of sampling compute for on-policy RL methods. The research frames the problem as a constrained optimization over parallel rollouts, batch size, and update steps. Key findings include the predictable increase and saturation of optimal parallel rollouts with compute budget, and the different mechanisms driving this trend on easy versus hard problems. The study also highlights the benefits of increasing parallel rollouts in mitigating interference across problems, with the batch size primarily affecting training stability. These results offer practical guidance for efficiently scaling RL post-training of LLMs.
研究解决了大规模语言模型(LLM)的强化学习(RL)后训练中缺乏缩放定律的问题,通过优化on-policy RL方法的采样计算分配来解决。将问题建模为在并行卷积、批次大小和更新步骤约束下的优化问题。关键发现包括,最优的并行卷积数量随计算预算增加而增加,然后饱和,不同机制在简单和复杂问题上驱动这一现象。增加并行卷积可以减少问题间的干扰,而批次大小主要影响训练稳定性,可以在较宽范围内选择。这些结果为LLM的高效RL缩放提供了实用指导。
Linking Perception, Confidence and Accuracy in MLLMs
Authors: Yuetian Du, Yucheng Wang, Rongyu Zhang, Zhijie Xu, Boyu Yang, Ming Kong, Jie Liu, Qiang Zhu
First: 2026-03-12T16:47:42+00:00 · Latest: 2026-03-12T16:47:42+00:00
Comments: Accepted by CVPR2026
Abstract
Recent advances in Multi-modal Large Language Models (MLLMs) have predominantly focused on enhancing visual perception to improve accuracy. However, a critical question remains unexplored: Do models know when they do not know? Through a probing experiment, we reveal a severe confidence miscalibration problem in MLLMs. To address this, we propose Confidence-Driven Reinforcement Learning (CDRL), which uses original-noise image pairs and a novel confidence-based reward to enhance perceptual sensitivity and robustly calibrate the model's confidence. Beyond training benefits, calibrated confidence enables more effective test-time scaling as a free lunch. We further propose Confidence-Aware Test-Time Scaling (CA-TTS), which dynamically coordinates Self-Consistency, Self-Reflection, and Visual Self-Check modules guided by confidence signals. An Expert Model acts in multiple roles (e.g., Planner, Critic, Voter) to schedule these modules and provide external verification. Our integrated framework establishes new state-of-the-art results with consistent 8.8% gains across four benchmarks. More ablation studies demonstrate the effectiveness of each module and scaling superiority.
中文标题/摘要
标题:连接感知、信心与准确性在多模态大型语言模型中的关系
近年来,多模态大型语言模型(MLLMs)的进展主要集中在增强视觉感知以提高准确性。然而,一个关键问题尚未得到探索:模型是否知道自己不知道?通过一项探针实验,我们揭示了MLLMs中严重的信心失校准问题。为了解决这一问题,我们提出了信心驱动强化学习(CDRL),它使用原始噪声图像对和一种新颖的信心基奖励来增强感知灵敏度并稳健校准模型的信心。除了训练收益外,校准的信心还能够作为免费午餐实现更有效的测试时扩展。我们进一步提出了信心感知测试时扩展(CA-TTS),它动态协调自我一致性、自我反思和视觉自我检查模块,并由信心信号引导。专家模型扮演多种角色(例如,规划者、评论员、投票者)来调度这些模块并提供外部验证。我们集成的框架在四个基准测试中建立了新的最佳结果,一致地提高了8.8%。更多的消融研究证明了每个模块的有效性和扩展优势。
Summary / 总结
This study addresses the confidence miscalibration issue in Multi-modal Large Language Models (MLLMs) by proposing Confidence-Driven Reinforcement Learning (CDRL) and Confidence-Aware Test-Time Scaling (CA-TTS). CDRL uses original-noise image pairs and a confidence-based reward to improve perceptual sensitivity and calibrate model confidence. CA-TTS dynamically coordinates modules guided by confidence signals, enhancing test-time scaling. The integrated framework achieves consistent 8.8% gains across four benchmarks.
研究通过提出信心驱动的强化学习(CDRL)和信心感知的测试时缩放(CA-TTS)来解决多模态大型语言模型(MLLM)的信心失校准问题。CDRL 使用原始噪声图像对和基于信心的奖励来提高感知灵敏度并校准模型的信心。CA-TTS 动态协调由信心信号引导的模块,实现四个基准上一致的 8.8% 收益。消融研究证实了每个模块的有效性以及所提框架的缩放优势。
EgoIntent: An Egocentric Step-level Benchmark for Understanding What, Why, and Next
Authors: Ye Pan, Chi Kit Wong, Yuanhuiyi Lyu, Hanqian Li, Jiahao Huo, Jiacheng Chen, Lutao Jiang, Xu Zheng, Xuming Hu
First: 2026-03-12T16:46:01+00:00 · Latest: 2026-03-12T16:46:01+00:00
Abstract
Multimodal Large Language Models (MLLMs) have demonstrated remarkable video reasoning capabilities across diverse tasks. However, their ability to understand human intent at a fine-grained level in egocentric videos remains largely unexplored. Existing benchmarks focus primarily on episode-level intent reasoning, overlooking the finer granularity of step-level intent understanding. Yet applications such as intelligent assistants, robotic imitation learning, and augmented reality guidance require understanding not only what a person is doing at each step, but also why and what comes next, in order to provide timely and context-aware support. To this end, we introduce EgoIntent, a step-level intent understanding benchmark for egocentric videos. It comprises 3,014 steps spanning 15 diverse indoor and outdoor daily-life scenarios, and evaluates models on three complementary dimensions: local intent (What), global intent (Why), and next-step plan (Next). Crucially, each clip is truncated immediately before the key outcome of the queried step (e.g., contact or grasp) occurs and contains no frames from subsequent steps, preventing future-frame leakage and enabling a clean evaluation of anticipatory step understanding and next-step planning. We evaluate 15 MLLMs, including both state-of-the-art closed-source and open-source models. Even the best-performing model achieves an average score of only 33.31 across the three intent dimensions, underscoring that step-level intent understanding in egocentric videos remains a highly challenging problem that calls for further investigation.
中文标题/摘要
标题:EgoIntent:一种理解个人视角视频中意图、原因和下一步的自中心步级基准
多模态大型语言模型(MLLMs)在多种任务中展示了卓越的视频推理能力。然而,它们在个人视角视频中对人类意图进行细粒度理解的能力尚未得到充分探索。现有基准主要集中在情节级意图推理上,忽视了步骤级意图理解的更精细粒度。然而,诸如智能助手、机器人模仿学习和增强现实指导等应用不仅需要理解每个人在每一步做什么,还需要理解为什么以及接下来会发生什么,以便提供及时和上下文相关的支持。为此,我们引入了EgoIntent,一种个人视角视频中的步骤级意图理解基准。它包含15种不同的室内和室外日常生活场景中的3,014个步骤,并从三个互补维度评估模型:局部意图(What)、全局意图(Why)和下一步计划(Next)。关键的是,每个片段在查询步骤的关键结果(如接触或抓取)发生前立即截断,不包含后续步骤的画面,从而避免未来帧泄漏,使前瞻性步骤理解和下一步计划的评估更加清晰。我们评估了15个MLLMs,包括最先进的闭源和开源模型。即使表现最好的模型在三个意图维度上的平均得分为33.31,也表明个人视角视频中的步骤级意图理解仍然是一个极具挑战性的问题,需要进一步研究。
Summary / 总结
EgoIntent is a step-level benchmark for understanding human intent in egocentric videos, addressing the need for fine-grained intent reasoning beyond episode-level benchmarks. It includes 3,014 steps from 15 scenarios and evaluates models on local, global intent, and next-step planning. Despite evaluating 15 MLLMs, the best model scores only 33.31 across three dimensions, highlighting the challenge of step-level intent understanding in egocentric videos.
EgoIntent 是一个用于理解 egocentric 视频中人类意图的步骤级基准,超越了对事件级理解的需求。它包含来自 15 个场景的 3,014 个步骤,并从局部意图、全局意图和下一步计划三个方面评估模型。尽管测试了包括最先进的闭源和开源模型在内的 15 种 MLLM,但最佳模型在三个维度上的平均得分仅为 33.31,突显了 egocentric 视频中步骤级意图理解的挑战性。
FlashMotion: Few-Step Controllable Video Generation with Trajectory Guidance
Authors: Quanhao Li, Zhen Xing, Rui Wang, Haidong Cao, Qi Dai, Daoguo Dong, Zuxuan Wu
First: 2026-03-12T16:45:53+00:00 · Latest: 2026-03-12T16:45:53+00:00
Comments: Accepted by CVPR2026
Abstract
Recent advances in trajectory-controllable video generation have achieved remarkable progress. Previous methods mainly use adapter-based architectures for precise motion control along predefined trajectories. However, all these methods rely on a multi-step denoising process, leading to substantial time redundancy and computational overhead. While existing video distillation methods successfully distill multi-step generators into few-step, directly applying these approaches to trajectory-controllable video generation results in noticeable degradation in both video quality and trajectory accuracy. To bridge this gap, we introduce FlashMotion, a novel training framework designed for few-step trajectory-controllable video generation. We first train a trajectory adapter on a multi-step video generator for precise trajectory control. Then, we distill the generator into a few-step version to accelerate video generation. Finally, we finetune the adapter using a hybrid strategy that combines diffusion and adversarial objectives, aligning it with the few-step generator to produce high-quality, trajectory-accurate videos. For evaluation, we introduce FlashBench, a benchmark for long-sequence trajectory-controllable video generation that measures both video quality and trajectory accuracy across varying numbers of foreground objects. Experiments on two adapter architectures show that FlashMotion surpasses existing video distillation methods and previous multi-step models in both visual quality and trajectory consistency.
中文标题/摘要
标题:FlashMotion:基于轨迹引导的少步可控视频生成
近期在轨迹可控视频生成方面的进展取得了显著成果。先前的方法主要使用适配器架构以精确控制沿预定义轨迹的运动。然而,所有这些方法都依赖于多步去噪过程,导致大量时间冗余和计算开销。虽然现有的视频蒸馏方法成功地将多步生成器蒸馏为少步版本,但直接将这些方法应用于轨迹可控视频生成会导致视频质量和轨迹精度明显下降。为解决这一问题,我们提出了FlashMotion,一种专为少步轨迹可控视频生成设计的训练框架。我们首先在多步视频生成器上训练一个轨迹适配器以实现精确的轨迹控制。然后,我们将生成器蒸馏为少步版本以加速视频生成。最后,我们使用结合扩散和对抗目标的混合策略微调适配器,使其与少步生成器对齐,以生成高质量、轨迹准确的视频。为了评估,我们引入了FlashBench,这是一个针对长序列轨迹可控视频生成的基准,它衡量不同前景对象数量下的视频质量和轨迹精度。实验结果显示,FlashMotion在视觉质量和轨迹一致性方面均优于现有的视频蒸馏方法和之前的多步模型。
Summary / 总结
FlashMotion is a novel training framework for few-step trajectory-controllable video generation. It first trains a trajectory adapter on a multi-step video generator, then distills the generator into a few-step version, and finally refines the adapter using a hybrid strategy to ensure high-quality and accurate trajectory videos. Experiments show that FlashMotion outperforms existing methods in both visual quality and trajectory consistency.
FlashMotion 是一种用于少步轨迹可控视频生成的新训练框架。它首先在多步视频生成器上训练轨迹适配器,然后将生成器精简为少步版本,并最终使用混合扩散和对抗目标对适配器进行微调。实验表明,FlashMotion 在视频质量和轨迹一致性方面均优于现有视频精简方法和多步模型。
Automatic Generation of High-Performance RL Environments
Authors: Seth Karten, Rahul Dev Appapogu, Chi Jin
First: 2026-03-12T16:45:47+00:00 · Latest: 2026-03-12T16:45:47+00:00
Comments: 26 pages, 9 figures, 8 tables
Abstract
Translating complex reinforcement learning (RL) environments into high-performance implementations has traditionally required months of specialized engineering. We present a reusable recipe - a generic prompt template, hierarchical verification, and iterative agent-assisted repair - that produces semantically equivalent high-performance environments for <$10 in compute cost. We demonstrate three distinct workflows across five environments. Direct translation (no prior performance implementation exists): EmuRust (1.5x PPO speedup via Rust parallelism for a Game Boy emulator) and PokeJAX, the first GPU-parallel Pokemon battle simulator (500M SPS random action, 15.2M SPS PPO; 22,320x over the TypeScript reference). Translation verified against existing performance implementations: throughput parity with MJX (1.04x) and 5x over Brax at matched GPU batch sizes (HalfCheetah JAX); 42x PPO (Puffer Pong). New environment creation: TCGJax, the first deployable JAX Pokemon TCG engine (717K SPS random action, 153K SPS PPO; 6.6x over the Python reference), synthesized from a web-extracted specification. At 200M parameters, the environment overhead drops below 4% of training time. Hierarchical verification (property, interaction, and rollout tests) confirms semantic equivalence for all five environments; cross-backend policy transfer confirms zero sim-to-sim gap for all five environments. TCGJax, synthesized from a private reference absent from public repositories, serves as a contamination control for agent pretraining data concerns. The paper contains sufficient detail - including representative prompts, verification methodology, and complete results - that a coding agent could reproduce the translations directly from the manuscript.
中文标题/摘要
标题:自动生成高性能RL环境
将复杂的强化学习(RL)环境转换为高性能实现通常需要数月的专门工程工作。我们提出了一种可重用的配方——通用提示模板、分层验证和迭代代理辅助修复,该方法以不到10美元的计算成本生成语义等价的高性能环境。我们展示了五个环境中的三种不同工作流程。直接翻译(没有现有的高性能实现):EmuRust(通过Rust并行性实现1.5倍的PPO加速,对于Game Boy模拟器)和PokeJAX,第一个GPU并行的宝可梦战斗模拟器(500M SPS随机动作,15.2M SPS PPO;22,320倍于TypeScript参考)。与现有性能实现验证:吞吐量与MJX相当(1.04倍),在匹配GPU批处理大小时比Brax快5倍(HalfCheetah JAX);42倍PPO(Puffer Pong)。新环境创建:TCGJax,第一个可部署的JAX宝可梦TCG引擎(717K SPS随机动作,153K SPS PPO;6.6倍于Python参考),从网页提取的规范中合成。在200M参数下,环境开销低于训练时间的4%。分层验证(属性、交互和回放测试)确认了所有五个环境的语义等价性;跨后端策略转移确认了所有五个环境的零模拟到模拟差距。TCGJax,从私人参考合成,私人参考未公开在公共存储库中,作为代理预训练数据污染的控制。论文包含足够的细节——包括代表性提示、验证方法和完整结果——使得编码代理可以直接从手稿中复制这些翻译。
Summary / 总结
The research aims to automate the generation of high-performance reinforcement learning environments, which traditionally required months of specialized engineering. The method involves a generic prompt template, hierarchical verification, and iterative agent-assisted repair. Key findings include a 1.5x speedup for EmuRust using Rust parallelism, a 500M SPS random action rate for PokeJAX, and a 6.6x speedup for TCGJax over the Python reference. Hierarchical verification confirmed semantic equivalence across all environments, and cross-backend policy transfer showed no sim-to-sim gap. The process is cost-effective, with compute costs under $10, and the paper provides sufficient detail for direct reproduction.
研究旨在自动化生成高性能强化学习环境,这通常需要数月的专门工程工作。方法包括通用提示模板、分层验证和迭代代理辅助修复,以低于10美元的计算成本生成语义等效的高性能环境。关键发现包括EmuRust使用Rust并行性实现1.5倍速度提升,PokeJAX实现500M SPS随机动作和15.2M SPS PPO,以及TCGJax,这是首个可部署的JAX宝可梦TCG引擎,从网页提取的规范中生成,实现了6.6倍的速度提升。
O3N: Omnidirectional Open-Vocabulary Occupancy Prediction
Authors: Mengfei Duan, Hao Shi, Fei Teng, Guoqiang Zhao, Yuheng Zhang, Zhiyong Li, Kailun Yang
First: 2026-03-12T16:45:42+00:00 · Latest: 2026-03-12T16:45:42+00:00
Comments: The source code will be made publicly available at https://github.com/MengfeiD/O3N
Abstract
Understanding and reconstructing the 3D world through omnidirectional perception is an inevitable trend in the development of autonomous agents and embodied intelligence. However, existing 3D occupancy prediction methods are constrained by limited perspective inputs and predefined training distribution, making them difficult to apply to embodied agents that require comprehensive and safe perception of scenes in open world exploration. To address this, we present O3N, the first purely visual, end-to-end Omnidirectional Open-vocabulary Occupancy predictioN framework. O3N embeds omnidirectional voxels in a polar-spiral topology via the Polar-spiral Mamba (PsM) module, enabling continuous spatial representation and long-range context modeling across 360°. The Occupancy Cost Aggregation (OCA) module introduces a principled mechanism for unifying geometric and semantic supervision within the voxel space, ensuring consistency between the reconstructed geometry and the underlying semantic structure. Moreover, Natural Modality Alignment (NMA) establishes a gradient-free alignment pathway that harmonizes visual features, voxel embeddings, and text semantics, forming a consistent "pixel-voxel-text" representation triad. Extensive experiments on multiple models demonstrate that our method not only achieves state-of-the-art performance on QuadOcc and Human360Occ benchmarks but also exhibits remarkable cross-scene generalization and semantic scalability, paving the way toward universal 3D world modeling. The source code will be made publicly available at https://github.com/MengfeiD/O3N.
中文标题/摘要
标题:O3N:全方位开放式词汇占用预测
通过全方位感知理解并重构3D世界是自主代理和具身智能发展中不可避免的趋势。然而,现有的3D占用预测方法受限于有限视角输入和预定义的训练分布,难以应用于需要全面和安全场景感知的具身代理在开放世界探索中的应用。为了解决这一问题,我们提出了O3N,这是首个纯视觉、端到端的全方位开放式词汇占用预测框架。O3N通过Polar-spiral Mamba (PsM) 模块嵌入全方位体素,以极螺旋拓扑结构实现连续的空间表示和360°范围内的长程上下文建模。Occupancy Cost Aggregation (OCA) 模块引入了一种原理性的机制,用于在体素空间内统一几何和语义监督,确保重建几何与底层语义结构之间的一致性。此外,Natural Modality Alignment (NMA) 建立了一种无梯度对齐路径,协调视觉特征、体素嵌入和文本语义,形成一致的“像素-体素-文本”表示三元组。在多个模型上的广泛实验表明,我们的方法不仅在QuadOcc和Human360Occ基准测试中达到了最先进的性能,还展示了出色的跨场景泛化能力和语义可扩展性,为通用3D世界建模铺平了道路。源代码将在https://github.com/MengfeiD/O3N公开。
Summary / 总结
O3N is an end-to-end framework for omnidirectional open-vocabulary occupancy prediction, addressing the limitations of existing methods by using a polar-spiral topology and introducing modules like OCA and NMA. It achieves state-of-the-art performance on QuadOcc and Human360Occ benchmarks and demonstrates strong cross-scene generalization and semantic scalability.
O3N 是一种端到端的全景开放词汇占用预测框架,通过使用极螺旋拓扑和引入 OCA 和 NMA 模块来解决现有方法的局限性。该方法在 QuadOcc 和 Human360Occ 基准测试中达到了最先进的性能,并展示了强大的跨场景泛化能力和语义扩展性。