Unified Spatio-Temporal Token Scoring for Efficient Video VLMs
Authors: Jianrui Zhang, Yue Yang, Rohun Tripathi, Winson Han, Ranjay Krishna, Christopher Clark, Yong Jae Lee, Sangho Lee
First: 2026-03-18T17:59:56+00:00 · Latest: 2026-03-18T17:59:56+00:00
Abstract
Token pruning is essential for enhancing the computational efficiency of vision-language models (VLMs), particularly for video-based tasks where temporal redundancy is prevalent. Prior approaches typically prune tokens either (1) within the vision transformer (ViT) exclusively for unimodal perception tasks such as action recognition and object segmentation, without adapting to downstream vision-language tasks; or (2) only within the LLM while leaving the ViT output intact, often requiring complex text-conditioned token selection mechanisms. In this paper, we introduce Spatio-Temporal Token Scoring (STTS), a simple and lightweight module that prunes vision tokens across both the ViT and the LLM without text conditioning or token merging, and is fully compatible with end-to-end training. By learning how to score temporally via an auxiliary loss and spatially via LLM downstream gradients, aided by our efficient packing algorithm, STTS prunes 50% of vision tokens throughout the entire architecture, resulting in a 62% improvement in efficiency during both training and inference with only a 0.7% drop in average performance across 13 short and long video QA tasks. Efficiency gains increase with more sampled frames per video. Applying test-time scaling for long-video QA further yields performance gains of 0.5-1% compared to the baseline. Overall, STTS represents a novel, simple yet effective technique for unified, architecture-wide vision token pruning.
中文标题/摘要
标题:统一时空令牌评分以提高视频VLMs的效率
令牌剪枝对于提高视觉语言模型(VLMs)的计算效率至关重要,特别是在视频任务中,时间冗余普遍存在。先前的方法通常仅在视觉变换器(ViT)内剪枝令牌,适用于单模态感知任务如动作识别和对象分割,而不适应下游视觉语言任务;或者仅在LLM内剪枝令牌,而保留ViT输出不变,通常需要复杂的文本条件令牌选择机制。在本文中,我们引入了时空令牌评分(STTS),这是一种简单且轻量级的模块,可以在ViT和LLM之间剪枝视觉令牌,无需文本条件或令牌合并,并且完全兼容端到端训练。通过学习如何通过辅助损失学习时间评分以及通过LLM下游梯度学习空间评分,并借助我们高效的打包算法,STTS在整个架构中剪枝了50%的视觉令牌,从而在训练和推理过程中效率提高了62%,并且平均性能下降了0.7%。随着每视频采样帧数的增加,效率收益会增加。在长视频问答测试时应用缩放进一步提高了0.5-1%的性能,与基线相比。总体而言,STTS代表了一种新颖、简单而有效的统一架构视觉令牌剪枝技术。
Summary / 总结
This paper addresses the need for computational efficiency in vision-language models (VLMs) for video tasks by introducing Spatio-Temporal Token Scoring (STTS), a method that prunes both vision and language tokens without text conditioning or token merging. STTS learns to score tokens spatially and temporally, achieving a 62% efficiency improvement with only a 0.7% drop in performance across 13 video QA tasks. Efficiency gains are more pronounced with more sampled frames per video, and test-time scaling further improves performance by 0.5-1%.
本文通过引入时空令牌评分(STTS)方法,解决了视频任务中视觉语言模型的计算效率问题,该方法在视觉变换器和语言模型之间剪枝视觉令牌,无需文本条件或令牌合并。STTS 在 13 个视频 QA 任务中将效率提高了 62%,性能下降仅为 0.7%,并且随着每视频采样帧数的增加,效率提升更加明显。测试时的缩放进一步提高了 0.5-1% 的性能。该方法简单、轻量且完全兼容端到端训练。
Universal Skeleton Understanding via Differentiable Rendering and MLLMs
Authors: Ziyi Wang, Peiming Li, Xinshun Wang, Yang Tang, Kai-Kuang Ma, Mengyuan Liu
First: 2026-03-18T17:59:12+00:00 · Latest: 2026-03-18T17:59:12+00:00
Comments: 32 pages, 15 figures
Abstract
Multimodal large language models (MLLMs) exhibit strong visual-language reasoning, yet remain confined to their native modalities and cannot directly process structured, non-visual data such as human skeletons. Existing methods either compress skeleton dynamics into lossy feature vectors for text alignment, or quantize motion into discrete tokens that generalize poorly across heterogeneous skeleton formats. We present SkeletonLLM, which achieves universal skeleton understanding by translating arbitrary skeleton sequences into the MLLM's native visual modality. At its core is DrAction, a differentiable, format-agnostic renderer that converts skeletal kinematics into compact image sequences. Because the pipeline is end-to-end differentiable, MLLM gradients can directly guide the rendering to produce task-informative visual tokens. To further enhance reasoning capabilities, we introduce a cooperative training strategy: Causal Reasoning Distillation transfers structured, step-by-step reasoning from a teacher model, while Discriminative Finetuning sharpens decision boundaries between confusable actions. SkeletonLLM demonstrates strong generalization on diverse tasks including recognition, captioning, reasoning, and cross-format transfer -- suggesting a viable path for applying MLLMs to non-native modalities. Code will be released upon acceptance.
中文标题/摘要
标题:通过可微渲染和MLLM实现通用骨架理解
多模态大型语言模型(MLLMs)在视觉-语言推理方面表现出色,但仍然局限于其原生模态,无法直接处理如人类骨架等结构化、非视觉数据。现有方法要么将骨架动态压缩为有损特征向量以进行文本对齐,要么将运动量化为难以在不同骨架格式之间泛化的离散标记。我们提出了SkeletonLLM,它通过将任意骨架序列转换为MLLM的原生视觉模态来实现通用骨架理解。其核心是DrAction,一种格式无关的可微渲染器,将骨骼运动转换为紧凑的图像序列。由于整个管道是端到端可微的,MLLM的梯度可以直接指导渲染以生成任务相关信息的视觉标记。为了进一步增强推理能力,我们引入了一种协作训练策略:因果推理蒸馏从教师模型中转移结构化的逐步推理,而判别性微调则细化混淆动作之间的决策边界。SkeletonLLM在包括识别、描述、推理和跨格式转移等多种任务上表现出强大的泛化能力——这表明MLLM可以应用于非原生模态的一种可行路径。代码将在接受后发布。
Summary / 总结
The research aims to enable multimodal large language models (MLLMs) to process structured non-visual data like human skeletons. The method involves using DrAction, a differentiable renderer that converts skeleton sequences into visual images, allowing MLLMs to understand and reason about skeletons. Key findings show that SkeletonLLM can perform various tasks such as recognition, captioning, and reasoning, and can transfer across different skeleton formats, indicating its potential for applying MLLMs to non-native modalities.
研究旨在通过一个名为DrAction的可微渲染器将人体骨架序列转换为视觉模态,使多模态大语言模型(MLLMs)能够理解人体骨架。方法包括因果推理蒸馏和判别微调等协同训练策略,以增强推理能力。关键发现表明,SkeletonLLM在识别、描述、推理和跨格式转换等多种任务上表现出强大的泛化能力,表明其在非原生模态中应用MLLMs的可能性。
Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models
Authors: Kevin Qu, Haozhe Qi, Mihai Dusmanu, Mahdi Rad, Rui Wang, Marc Pollefeys
First: 2026-03-18T17:59:10+00:00 · Latest: 2026-03-18T17:59:10+00:00
Comments: Project Page: https://kevinqu7.github.io/loc3r-vlm
Abstract
Multimodal Large Language Models (MLLMs) have made impressive progress in connecting vision and language, but they still struggle with spatial understanding and viewpoint-aware reasoning. Recent efforts aim to augment the input representations with geometric cues rather than explicitly teaching models to reason in 3D space. We introduce Loc3R-VLM, a framework that equips 2D Vision-Language Models with advanced 3D understanding capabilities from monocular video input. Inspired by human spatial cognition, Loc3R-VLM relies on two joint objectives: global layout reconstruction to build a holistic representation of the scene structure, and explicit situation modeling to anchor egocentric perspective. These objectives provide direct spatial supervision that grounds both perception and language in a 3D context. To ensure geometric consistency and metric-scale alignment, we leverage lightweight camera pose priors extracted from a pre-trained 3D foundation model. Loc3R-VLM achieves state-of-the-art performance in language-based localization and outperforms existing 2D- and video-based approaches on situated and general 3D question-answering benchmarks, demonstrating that our spatial supervision framework enables strong 3D understanding. Project page: https://kevinqu7.github.io/loc3r-vlm
中文标题/摘要
标题:Loc3R-VLM:基于语言的空间定位和三维推理
多模态大型语言模型(MLLMs)在连接视觉和语言方面取得了显著进展,但仍然难以理解空间关系和视角相关的推理。最近的努力旨在通过几何提示增强输入表示,而不是明确地教会模型在三维空间中进行推理。我们提出了Loc3R-VLM框架,该框架使二维视觉语言模型具备从单目视频输入中获得的高级三维理解能力。受人类空间认知的启发,Loc3R-VLM依赖于两个联合目标:全局布局重建以构建场景结构的整体表示,以及明确的情景建模以锚定主观视角。这些目标提供了直接的空间监督,使感知和语言在三维上下文中得到约束。为了确保几何一致性并实现度量级对齐,我们利用从预训练的三维基础模型中提取的轻量级相机姿态先验。Loc3R-VLM在基于语言的空间定位方面达到了最先进的性能,并在基于图像和视频的现有方法上在定位和一般三维问答基准测试中表现出色,证明了我们的空间监督框架能够实现强大的三维理解。项目页面:https://kevinqu7.github.io/loc3r-vlm
Summary / 总结
Loc3R-VLM is a framework that enhances 2D Vision-Language Models with 3D understanding capabilities using monocular video input. It focuses on global layout reconstruction and explicit situation modeling to provide spatial supervision. This framework outperforms existing 2D and video-based approaches in language-based localization and 3D question-answering benchmarks, showing strong 3D understanding through geometric consistency and metric-scale alignment.
Loc3R-VLM 是一种框架,通过单目视频输入增强 2D 视觉语言模型的 3D 理解能力。它侧重于全局布局重建和显式情况建模,提供空间监督。这种方法在语言基于的定位和 3D 问答基准测试中达到了最先进的性能,并优于现有方法,展示了空间监督在 3D 理解中的有效性。
EchoGen: Cycle-Consistent Learning for Unified Layout-Image Generation and Understanding
Authors: Kai Zou, Hongbo Liu, Dian Zheng, Jianxiong Gao, Zhiwei Zhao, Bin Liu
Venue: AAAI
First: 2026-03-18T17:59:03+00:00 · Latest: 2026-03-18T17:59:03+00:00
Comments: 9 pages, Accepted at the 40th AAAI Conference on Artificial Intelligence (AAAI 2026)
Abstract
In this work, we present EchoGen, a unified framework for layout-to-image generation and image grounding, capable of generating images with accurate layouts and high fidelity to text descriptions (e.g., spatial relationships), while grounding the image robustly at the same time. We believe that image grounding possesses strong text and layout understanding abilities, which can compensate for the corresponding limitations in layout-to-image generation. At the same time, images generated from layouts exhibit high diversity in content, thereby enhancing the robustness of image grounding. Jointly training both tasks within a unified model can promote performance improvements for each. However, we identify that this joint training paradigm encounters several optimization challenges and results in restricted performance. To address these issues, we propose progressive training strategies. First, the Parallel Multi-Task Pre-training (PMTP) stage equips the model with basic abilities for both tasks, leveraging shared tokens to accelerate training. Next, the Dual Joint Optimization (DJO) stage exploits task duality to sequentially integrate the two tasks, enabling unified optimization. Finally, the Cycle RL stage eliminates reliance on visual supervision by using consistency constraints as rewards, significantly enhancing the model's unified capabilities via the GRPO strategy. Extensive experiments demonstrate state-of-the-art results on both layout-to-image generation and image grounding benchmarks, and reveal clear synergistic gains from optimizing the two tasks together.
中文标题/摘要
标题:EchoGen:循环一致学习的统一布局-图像生成与理解框架
在本文中,我们提出了EchoGen,这是一种统一的布局到图像生成和图像定位框架,能够生成具有准确布局和高保真度的图像(例如空间关系),同时稳健地将图像定位。我们认为图像定位具有强大的文本和布局理解能力,可以弥补布局到图像生成中的相应局限性。同时,从布局生成的图像在内容上表现出高度的多样性,从而增强了图像定位的稳健性。在统一模型中联合训练这两个任务可以促进每个任务的性能提升。然而,我们发现这种联合训练范式遇到了一些优化挑战,并导致了性能限制。为了解决这些问题,我们提出了渐进式训练策略。首先,平行多任务预训练(PMTP)阶段为模型提供了两个任务的基本能力,利用共享标记加速训练。其次,双重联合优化(DJO)阶段利用任务的对偶性,逐步整合两个任务,实现统一优化。最后,循环RL阶段通过使用一致性约束作为奖励,消除对视觉监督的依赖,显著增强了模型的统一能力,通过GRPO策略。广泛的实验在布局到图像生成和图像定位基准上展示了最先进的结果,并揭示了优化两个任务的协同增益。
Summary / 总结
EchoGen is a unified framework for layout-to-image generation and image grounding, which generates images with accurate layouts and high fidelity to text descriptions while robustly grounding the image. The framework addresses optimization challenges through progressive training strategies: Parallel Multi-Task Pre-training, Dual Joint Optimization, and Cycle RL, leading to state-of-the-art results on both layout-to-image generation and image grounding benchmarks.
EchoGen 是一个统一框架,用于布局到图像生成和图像定位,能够生成具有准确布局和高描述准确性的图像,并且能够稳健地定位图像。通过提出分阶段训练策略:并行多任务预训练、双重联合优化和循环强化学习,解决联合训练中的优化难题。广泛的实验表明,在布局到图像生成和图像定位基准上取得了最先进的结果,并且通过同时优化两个任务获得了协同增益。
AgentFactory: A Self-Evolving Framework Through Executable Subagent Accumulation and Reuse
Authors: Zhang Zhang, Shuqi Lu, Hongjin Qian, Di He, Zheng Liu
First: 2026-03-18T17:58:25+00:00 · Latest: 2026-03-18T17:58:25+00:00
Abstract
Building LLM-based agents has become increasingly important. Recent works on LLM-based agent self-evolution primarily record successful experiences as textual prompts or reflections, which cannot reliably guarantee efficient task re-execution in complex scenarios. We propose AgentFactory, a new self-evolution paradigm that preserves successful task solutions as executable subagent code rather than textual experience. Crucially, these subagents are continuously refined based on execution feedback, becoming increasingly robust and efficient as more tasks are encountered. Saved subagents are pure Python code with standardized documentation, enabling portability across any Python-capable system. We demonstrate that AgentFactory enables continuous capability accumulation: its library of executable subagents grows and improves over time, progressively reducing the effort required for similar tasks without manual intervention. Our implementation is open-sourced at https://github.com/zzatpku/AgentFactory, and our demonstration video is available at https://youtu.be/iKSsuAXJHW0.
中文标题/摘要
标题:AgentFactory:通过可执行子代理积累与重用实现自我进化的框架
基于LLM的代理构建变得越来越重要。最近关于基于LLM的代理自我进化的研究主要记录成功经验为文本提示或反思,这不能可靠地保证在复杂场景中高效地重新执行任务。我们提出了AgentFactory,这是一种新的自我进化范式,将成功任务解决方案保存为可执行的子代理代码,而不是文本经验。关键的是,这些子代理会根据执行反馈不断优化,遇到更多任务时变得越来越稳健和高效。保存的子代理是纯Python代码,具有标准化文档,可以在任何支持Python的系统中实现跨平台。我们证明了AgentFactory能够实现持续的能力积累:其可执行子代理库随着时间的推移不断增长和改进,逐步减少完成类似任务所需的努力,而无需手动干预。我们的实现已开源在https://github.com/zzatpku/AgentFactory,演示视频可在https://youtu.be/iKSsuAXJHW0找到。
Summary / 总结
The research aims to improve the efficiency and reliability of LLM-based agent self-evolution by preserving successful task solutions as executable subagent code rather than textual experiences. The main method involves continuously refining these subagents based on execution feedback, making them more robust and efficient. Key findings show that AgentFactory enables continuous capability accumulation, with its library of executable subagents growing and improving over time, reducing the effort required for similar tasks without manual intervention.
研究旨在通过一种名为AgentFactory的自演化框架提升基于LLM的代理的效率和鲁棒性。该框架通过执行子代理的积累和重用来替代文本经验,并且这些子代理会根据执行反馈不断优化。关键发现表明,AgentFactory能够实现持续的能力积累,随着时间的推移,对于类似任务所需的努力会逐渐减少,无需人工干预。
Search2Motion: Training-Free Object-Level Motion Control via Attention-Consensus Search
Authors: Sainan Liu, Tz-Ying Wu, Hector A Valdez, Subarna Tripathi
First: 2026-03-17T16:02:38+00:00 · Latest: 2026-03-18T17:58:04+00:00
Comments: 14 pages, 9 figures
Abstract
We present Search2Motion, a training-free framework for object-level motion editing in image-to-video generation. Unlike prior methods requiring trajectories, bounding boxes, masks, or motion fields, Search2Motion adopts target-frame-based control, leveraging first-last-frame motion priors to realize object relocation while preserving scene stability without fine-tuning. Reliable target-frame construction is achieved through semantic-guided object insertion and robust background inpainting. We further show that early-step self-attention maps predict object and camera dynamics, offering interpretable user feedback and motivating ACE-Seed (Attention Consensus for Early-step Seed selection), a lightweight search strategy that improves motion fidelity without look-ahead sampling or external evaluators. Noting that existing benchmarks conflate object and camera motion, we introduce S2M-DAVIS and S2M-OMB for stable-camera, object-only evaluation, alongside FLF2V-obj metrics that isolate object artifacts without requiring ground-truth trajectories. Search2Motion consistently outperforms baselines on FLF2V-obj and VBench.
中文标题/摘要
标题:Search2Motion:无需训练的对象级运动控制
我们提出了Search2Motion,一种无需训练的框架,用于图像到视频生成中的对象级运动编辑。与需要轨迹、边界框、掩码或运动场的先前方法不同,Search2Motion 采用目标帧基于的控制,利用首尾帧运动先验来实现对象重定位,同时保持场景稳定性,无需微调。通过语义引导的对象插入和鲁棒的背景修复,实现了可靠的目标帧构建。我们进一步展示了早期步骤的自我注意力图预测对象和相机动力学,提供可解释的用户反馈,并激发了ACE-Seed(注意力共识早期步骤种子选择)这一轻量级搜索策略,该策略在无需前瞻采样或外部评估者的情况下提高了运动保真度。鉴于现有基准混淆了对象和相机运动,我们引入了S2M-DAVIS和S2M-OMB进行稳定相机、仅对象评估,以及FLF2V-obj指标,该指标隔离了对象伪影,无需真实轨迹。Search2Motion 在FLF2V-obj 和 VBench 上均优于基线。
The Unreasonable Effectiveness of Text Embedding Interpolation for Continuous Image Steering
Authors: Yigit Ekin, Yossi Gandelsman
First: 2026-03-18T17:57:53+00:00 · Latest: 2026-03-18T17:57:53+00:00
Comments: Project Page: https://yigitekin.github.io/diffusion-sliders
Abstract
We present a training-free framework for continuous and controllable image editing at test time for text-conditioned generative models. In contrast to prior approaches that rely on additional training or manual user intervention, we find that a simple steering in the text-embedding space is sufficient to produce smooth edit control. Given a target concept (e.g., enhancing photorealism or changing facial expression), we use a large language model to automatically construct a small set of debiased contrastive prompt pairs, from which we compute a steering vector in the generator's text-encoder space. We then add this vector directly to the input prompt representation to control generation along the desired semantic axis. To obtain a continuous control, we propose an elastic range search procedure that automatically identifies an effective interval of steering magnitudes, avoiding both under-steering (no-edit) and over-steering (changing other attributes). Adding the scaled versions of the same vector within this interval yields smooth and continuous edits. Since our method modifies only textual representations, it naturally generalizes across text-conditioned modalities, including image and video generation. To quantify the steering continuity, we introduce a new evaluation metric that measures the uniformity of semantic change across edit strengths. We compare the continuous editing behavior across methods and find that, despite its simplicity and lightweight design, our approach is comparable to training-based alternatives, outperforming other training-free methods.
中文标题/摘要
标题:基于文本嵌入插值的无训练连续图像操控方法
我们提出了一种无需训练的框架,在测试时对文本条件生成模型进行连续可控的图像编辑。与依赖额外训练或手动用户干预的先前方法不同,我们发现简单的文本嵌入空间中的操控足以产生平滑的编辑控制。给定一个目标概念(例如,增强照片逼真度或改变面部表情),我们使用大型语言模型自动生成一组去偏差对比提示对,从中计算生成器文本编码器空间中的操控向量。然后,我们将此向量直接添加到输入提示表示中,以沿所需的语义轴控制生成。为了获得连续控制,我们提出了一种弹性范围搜索程序,自动识别有效的操控幅度区间,避免过度操控(改变其他属性)和不足操控(无编辑)。在该区间内添加该向量的缩放版本可获得平滑且连续的编辑。由于我们的方法仅修改文本表示,因此自然适用于文本条件的各种模态,包括图像和视频生成。为了量化操控连续性,我们引入了一个新的评估指标,该指标衡量编辑强度下语义变化的均匀性。我们比较了不同方法的连续编辑行为,并发现尽管我们的方法简单且设计轻量,但在与基于训练的替代方法相当的情况下,优于其他无训练方法。
Summary / 总结
This paper introduces a training-free framework for continuous and controllable image editing using text embeddings. By steering in the text-embedding space, the authors achieve smooth control over image generation without additional training or manual intervention. They use a large language model to generate prompt pairs and compute a steering vector, which is then added to the input prompt to control the generation along desired semantic axes. An elastic range search procedure ensures continuous control by identifying an effective interval of steering magnitudes. The method is applicable across text-conditioned modalities and outperforms other training-free approaches in terms of continuous editing behavior.
该论文提出了一种无需训练的框架,利用文本嵌入进行连续可控的图像编辑。它利用大型语言模型生成对比提示对,计算文本编码空间中的偏移向量,并将其添加到输入提示中以沿所需语义轴控制生成。弹性范围搜索过程确保了平滑和连续的编辑。实验表明,尽管该方法简单轻量,但在连续编辑行为上与基于训练的方法相当,并且优于其他无需训练的方法。
LoST: Level of Semantics Tokenization for 3D Shapes
Authors: Niladri Shekhar Dutt, Zifan Shi, Paul Guerrero, Chun-Hao Paul Huang, Duygu Ceylan, Niloy J. Mitra, Xuelin Chen
Venue: CVPR 2026
First: 2026-03-18T17:56:06+00:00 · Latest: 2026-03-18T17:56:06+00:00
Comments: CVPR 2026; Project website-- https://lost3d.github.io
Abstract
Tokenization is a fundamental technique in the generative modeling of various modalities. In particular, it plays a critical role in autoregressive (AR) models, which have recently emerged as a compelling option for 3D generation. However, optimal tokenization of 3D shapes remains an open question. State-of-the-art (SOTA) methods primarily rely on geometric level-of-detail (LoD) hierarchies, originally designed for rendering and compression. These spatial hierarchies are often token-inefficient and lack semantic coherence for AR modeling. We propose Level-of-Semantics Tokenization (LoST), which orders tokens by semantic salience, such that early prefixes decode into complete, plausible shapes that possess principal semantics, while subsequent tokens refine instance-specific geometric and semantic details. To train LoST, we introduce Relational Inter-Distance Alignment (RIDA), a novel 3D semantic alignment loss that aligns the relational structure of the 3D shape latent space with that of the semantic DINO feature space. Experiments show that LoST achieves SOTA reconstruction, surpassing previous LoD-based 3D shape tokenizers by large margins on both geometric and semantic reconstruction metrics. Moreover, LoST achieves efficient, high-quality AR 3D generation and enables downstream tasks like semantic retrieval, while using only 0.1%-10% of the tokens needed by prior AR models.
中文标题/摘要
标题:LoST:3D形状的语义分词级别
分词是生成建模中各种模态的基本技术。特别是在自回归(AR)模型中,它起着关键作用,这些模型最近成为3D生成的有吸引力的选择。然而,3D形状的最佳分词仍然是一个开放的问题。最先进的(SOTA)方法主要依赖于几何层次细节(LoD)层次结构,这些层次结构最初是为渲染和压缩设计的。这些空间层次结构通常分词效率低下,缺乏AR建模所需的语义连贯性。我们提出了语义分词级别(LoST),它按语义显著性对分词进行排序,使得早期前缀解码成完整的、合理的形状,具有主要的语义,而后续的分词则细化实例特定的几何和语义细节。为了训练LoST,我们引入了关系互距对齐(RIDA),这是一种新颖的3D语义对齐损失,它将3D形状潜在空间的关系结构与语义DINO特征空间的关系结构对齐。实验表明,LoST在重建方面达到了SOTA水平,与基于LoD的3D形状分词器相比,在几何和语义重建指标上取得了显著的领先优势。此外,LoST实现了高效的高质量AR 3D生成,并使下游任务如语义检索成为可能,同时仅使用了先前AR模型所需分词的0.1%-10%。
Summary / 总结
The paper introduces Level-of-Semantics Tokenization (LoST) for 3D shape generation, addressing the inefficiency of geometric level-of-detail (LoD) hierarchies in autoregressive models. LoST orders tokens based on semantic salience, allowing early tokens to decode complete, plausible shapes with principal semantics, and subsequent tokens to refine details. The method uses Relational Inter-Distance Alignment (RIDA) to align the 3D shape latent space with semantic features. Experiments show LoST outperforms previous LoD-based tokenizers in both geometric and semantic reconstruction, and enables efficient, high-quality 3D generation with fewer tokens.
研究旨在改进3D形状的标记化,特别是在自回归模型中的应用。提出的Level-of-Semantics Tokenization (LoST)方法根据语义显著性对标记进行排序,使得早期重建更加连贯和合理。实验表明,LoST在几何和语义重建方面均优于现有基于几何层次的方法,并能使用更少的标记实现高效、高质量的3D生成。
GMT: Goal-Conditioned Multimodal Transformer for 6-DOF Object Trajectory Synthesis in 3D Scenes
Authors: Huajian Zeng, Abhishek Saroha, Daniel Cremers, Xi Wang
First: 2026-03-18T17:54:35+00:00 · Latest: 2026-03-18T17:54:35+00:00
Comments: Accpeted by 3DV 2026. Project Page: https://huajian-zeng.github.io/projects/gmt/
Abstract
Synthesizing controllable 6-DOF object manipulation trajectories in 3D environments is essential for enabling robots to interact with complex scenes, yet remains challenging due to the need for accurate spatial reasoning, physical feasibility, and multimodal scene understanding. Existing approaches often rely on 2D or partial 3D representations, limiting their ability to capture full scene geometry and constraining trajectory precision. We present GMT, a multimodal transformer framework that generates realistic and goal-directed object trajectories by jointly leveraging 3D bounding box geometry, point cloud context, semantic object categories, and target end poses. The model represents trajectories as continuous 6-DOF pose sequences and employs a tailored conditioning strategy that fuses geometric, semantic, contextual, and goaloriented information. Extensive experiments on synthetic and real-world benchmarks demonstrate that GMT outperforms state-of-the-art human motion and human-object interaction baselines, such as CHOIS and GIMO, achieving substantial gains in spatial accuracy and orientation control. Our method establishes a new benchmark for learningbased manipulation planning and shows strong generalization to diverse objects and cluttered 3D environments. Project page: https://huajian- zeng.github. io/projects/gmt/.
中文标题/摘要
标题:GMT:面向6自由度物体轨迹合成的多模态变换器
在3D环境中合成可控的6自由度物体操作轨迹对于使机器人能够与复杂场景交互至关重要,但由于需要准确的空间推理、物理可行性以及多模态场景理解,这仍然是一个挑战。现有方法通常依赖于2D或部分3D表示,限制了它们捕捉完整场景几何结构和轨迹精度的能力。我们提出了GMT,这是一种多模态变换器框架,通过联合利用3D边界框几何、点云上下文、语义物体类别和目标末端姿态来生成现实且目标导向的物体轨迹。该模型将轨迹表示为连续的6自由度姿态序列,并采用了一种定制的条件策略,融合了几何、语义、上下文和目标导向的信息。在合成和真实世界基准上的广泛实验表明,GMT在空间精度和姿态控制方面优于最先进的基于人类运动和人机交互的基线方法,如CHOIS和GIMO,取得了显著的改进。我们的方法为基于学习的操纵规划设定了新基准,并在多种物体和复杂3D环境中表现出强大的泛化能力。项目页面:https://huajian-zeng.github.io/projects/gmt/
Summary / 总结
GMT is a multimodal transformer framework designed to synthesize realistic 6-DOF object manipulation trajectories in 3D scenes by integrating 3D bounding box geometry, point cloud context, semantic object categories, and target end poses. Experiments show that GMT outperforms existing methods like CHOIS and GIMO in terms of spatial accuracy and orientation control, demonstrating strong generalization to various objects and complex 3D environments.
GMT 是一种多模态变压器框架,通过整合 3D 边界框几何、点云上下文、语义对象类别和目标末端姿态来生成 6-DOF 物体操作轨迹。该模型生成连续的 6-DOF 姿态序列,并使用结合几何、语义、上下文和目标导向信息的条件策略。实验表明,GMT 在空间精度和姿态控制方面优于现有方法 CHOIS 和 GIMO,为基于学习的操纵规划设定了新基准,并在各种物体和复杂 3D 环境中表现出强大的泛化能力。
Equivariant symmetry-aware head pose estimation for fetal MRI
Authors: Ramya Muthukrishnan, Borjan Gagoski, Aryn Lee, P. Ellen Grant, Elfar Adalsteinsson, Benjamin Billot, Polina Golland
First: 2025-12-04T15:15:55+00:00 · Latest: 2026-03-18T17:54:10+00:00
Abstract
We present E(3)-Pose, a novel fast pose estimation method that jointly and explicitly models rotation equivariance and object symmetry. Our work is motivated by the challenging problem of accounting for fetal head motion during a diagnostic MRI scan. We aim to enable automatic adaptive prescription of diagnostic 2D MRI slices with 6-DoF head pose estimation, supported by rapid low-resolution 3D MRI volumes acquired before each 2D slice. Existing pose estimation methods struggle to generalize to clinical volumes due to pose ambiguities induced by inherent anatomical symmetries, as well as low resolution, noise, and artifacts. In contrast, E(3)-Pose captures anatomical symmetries and rigid pose equivariance by construction, and yields robust estimates of the fetal head pose. Our experiments on publicly available and representative clinical fetal MRI datasets demonstrate the superior robustness and generalization of our method across domains. Crucially, E(3)-Pose achieves state-of-the-art accuracy on clinical MRI volumes, supporting future clinical translation. Our implementation is publicly available at github.com/MedicalVisionGroup/E3-Pose.
中文标题/摘要
标题:Equivariant symmetry-aware头位估计方法用于胎儿MRI
我们提出了E(3)-Pose,这是一种新颖的快速姿态估计方法,能够同时和明确地建模旋转等变性和物体对称性。我们的工作旨在解决诊断MRI扫描期间胎儿头部运动的挑战性问题。我们旨在通过6-DoF头部姿态估计,利用每次2D切片前快速获取的低分辨率3D MRI体积,实现自动适应性诊断2D MRI切片的处方。现有的姿态估计方法由于固有的解剖对称性引起的姿态歧义,以及低分辨率、噪声和伪影,难以在临床体积上泛化。相比之下,E(3)-Pose通过设计捕捉了解剖对称性和刚性姿态等变性,并提供了胎儿头部姿态的稳健估计。我们在公开可用和代表性的临床胎儿MRI数据集上的实验表明,我们的方法在不同领域具有优越的稳健性和泛化能力。至关重要的是,E(3)-Pose在临床MRI体积上的准确性达到了最先进的水平,支持未来的临床转化。我们的实现可在github.com/MedicalVisionGroup/E3-Pose获取。
Summary / 总结
E(3)-Pose is a novel pose estimation method that models rotation equivariance and object symmetry to address the challenge of fetal head motion during MRI scans. It achieves robust and accurate 6-DoF head pose estimation, even with low-resolution and noisy clinical MRI volumes, by capturing anatomical symmetries and rigid pose equivariance. Experiments show that E(3)-Pose outperforms existing methods on clinical fetal MRI datasets, supporting its potential for clinical translation.
E(3)-Pose 是一种新颖的姿态估计方法,通过同时建模旋转等变性和物体对称性来应对胎儿 MRI 扫描中的头部运动挑战。该方法旨在通过快速低分辨率 3D MRI 体积实现自动 6-DoF 头部姿态估计,以适应性地处方诊断 2D MRI 切片。该方法在不同临床数据集上的鲁棒性和泛化能力优于现有方法,并在临床 MRI 体积上达到了最先进的准确性。
Versatile Editing of Video Content, Actions, and Dynamics without Training
Authors: Vladimir Kulikov, Roni Paiss, Andrey Voynov, Inbar Mosseri, Tali Dekel, Tomer Michaeli
First: 2026-03-18T17:50:56+00:00 · Latest: 2026-03-18T17:50:56+00:00
Comments: Project page at https://dynaedit.github.io/
Abstract
Controlled video generation has seen drastic improvements in recent years. However, editing actions and dynamic events, or inserting contents that should affect the behaviors of other objects in real-world videos, remains a major challenge. Existing trained models struggle with complex edits, likely due to the difficulty of collecting relevant training data. Similarly, existing training-free methods are inherently restricted to structure- and motion-preserving edits and do not support modification of motion or interactions. Here, we introduce DynaEdit, a training-free editing method that unlocks versatile video editing capabilities with pretrained text-to-video flow models. Our method relies on the recently introduced inversion-free approach, which does not intervene in the model internals, and is thus model-agnostic. We show that naively attempting to adapt this approach to general unconstrained editing results in severe low-frequency misalignment and high-frequency jitter. We explain the sources for these phenomena and introduce novel mechanisms for overcoming them. Through extensive experiments, we show that DynaEdit achieves state-of-the-art results on complex text-based video editing tasks, including modifying actions, inserting objects that interact with the scene, and introducing global effects.
中文标题/摘要
标题:无需训练即可灵活编辑视频内容、动作和动态
受控视频生成在近年来取得了显著进步。然而,编辑动作和动态事件,或插入应影响其他对象行为的内容,仍然是一个重大挑战。现有的训练模型难以处理复杂的编辑,这可能是因为收集相关训练数据的难度。同样,现有的无训练方法本质上只能进行结构和运动保持的编辑,不支持修改运动或交互。在这里,我们介绍了一种无训练编辑方法DynaEdit,该方法利用预训练的文本到视频流模型解锁了灵活的视频编辑能力。我们的方法依赖于最近引入的无需反演的方法,该方法不会干预模型内部,因此是模型无关的。我们展示了直接尝试将此方法适应于一般不受约束的编辑会导致严重的低频错位和高频抖动。我们解释了这些现象的来源,并引入了克服它们的新机制。通过广泛的实验,我们展示了DynaEdit在复杂的基于文本的视频编辑任务上达到了最先进的效果,包括修改动作、插入与场景交互的对象以及引入全局效果。
Summary / 总结
The research aims to address the challenge of editing actions and dynamic events in videos without training, which is difficult for existing models due to the scarcity of relevant training data. DynaEdit, a training-free method, uses pretrained text-to-video flow models and an inversion-free approach to achieve versatile video editing. The method introduces novel mechanisms to overcome low-frequency misalignment and high-frequency jitter, and it outperforms existing methods on complex text-based video editing tasks such as modifying actions and inserting interactive objects.
研究旨在解决在没有训练的情况下编辑视频中的动作和动态事件的难题,由于缺乏相关训练数据,现有模型难以应对。DynaEdit 是一种无需训练的方法,利用预训练的文本到视频流模型和无反演的方法来实现多功能视频编辑。该方法引入了新的机制来克服低频失真和高频抖动,并在修改动作和插入互动对象等复杂文本基于视频编辑任务上超越了现有方法。
Integrating Arithmetic Learning Improves Mathematical Reasoning in Smaller Models
Authors: Neeraj Gangwar, Suma P Bhat, Nickvash Kani
First: 2025-02-18T13:43:06+00:00 · Latest: 2026-03-18T17:43:04+00:00
Comments: Accepted to LREC 2026
Abstract
While large models pre-trained on high-quality data exhibit excellent performance on mathematical reasoning (e.g., GSM8k, MultiArith), it remains challenging to specialize smaller models for these tasks. Common approaches to address this challenge include knowledge distillation from large teacher models and data augmentation (e.g., rephrasing questions and generating synthetic solutions). Despite these efforts, smaller models struggle with arithmetic computations, leading to errors in mathematical reasoning. In this work, we leverage a synthetic arithmetic dataset generated programmatically to enhance the reasoning capabilities of smaller models. We investigate two key approaches to incorporate this dataset: (1) intermediate fine-tuning, in which a model is fine-tuned on the arithmetic dataset before training it on a reasoning dataset, and (2) integrating the arithmetic dataset into an instruction-tuning mixture, allowing the model to learn arithmetic skills alongside general instruction-following abilities. Our experiments on multiple reasoning benchmarks demonstrate that incorporating an arithmetic dataset, whether through targeted fine-tuning or within an instruction-tuning mixture, enhances models' arithmetic capabilities, thereby improving their mathematical reasoning performance.
中文标题/摘要
标题:整合算术学习提高小型模型的数学推理能力
虽然大型预训练模型在高质量数据上表现出色(例如GSM8k、MultiArith),但将这些模型专门化以执行数学推理任务(如算术计算)仍然具有挑战性。为解决这一挑战,常用的方法包括从大型教师模型进行知识蒸馏和数据增强(例如重新表述问题和生成合成解决方案)。尽管做出了这些努力,但小型模型在算术计算方面仍存在问题,导致数学推理中的错误。在本研究中,我们利用通过编程生成的合成算术数据集来增强小型模型的推理能力。我们研究了两种关键方法来整合此数据集:(1)中间微调,即在模型在推理数据集上训练之前,先在算术数据集上进行微调;(2)将算术数据集整合到指令微调混合中,使模型在学习算术技能的同时也能学习一般指令遵循能力。我们在多个推理基准上的实验表明,无论是通过目标微调还是在指令微调混合中整合算术数据集,都能增强模型的算术能力,从而提高其数学推理性能。
Summary / 总结
This study addresses the challenge of improving mathematical reasoning in smaller models by integrating arithmetic learning. Two methods are explored: intermediate fine-tuning on an arithmetic dataset followed by reasoning training, and integrating the arithmetic dataset into instruction-tuning. The results show that both methods enhance the models' arithmetic capabilities, leading to better performance in mathematical reasoning tasks across multiple benchmarks.
该研究旨在通过整合算术学习来提升较小模型的数学推理能力。研究人员使用了一个合成的算术数据集来增强推理能力,采用了两种方法:中间微调和将数据集整合到指令微调混合中。实验结果显示,这两种方法都提高了模型的算术技能,从而改善了它们在数学推理任务上的表现。
Feeling the Space: Egomotion-Aware Video Representation for Efficient and Accurate 3D Scene Understanding
Authors: Shuyao Shi, Kang G. Shin
First: 2026-03-18T17:42:49+00:00 · Latest: 2026-03-18T17:42:49+00:00
Abstract
Recent Multimodal Large Language Models (MLLMs) have shown high potential for spatial reasoning within 3D scenes. However, they typically rely on computationally expensive 3D representations like point clouds or reconstructed Bird's-Eye View (BEV) maps, or lack physical grounding to resolve ambiguities in scale and size. This paper significantly enhances MLLMs with egomotion modality data, captured by Inertial Measurement Units (IMUs) concurrently with the video. In particular, we propose a novel framework, called Motion-MLLM, introducing two key components: (1) a cascaded motion-visual keyframe filtering module that leverages both IMU data and visual features to efficiently select a sparse yet representative set of keyframes, and (2) an asymmetric cross-modal fusion module where motion tokens serve as intermediaries that channel egomotion cues and cross-frame visual context into the visual representation. By grounding visual content in physical egomotion trajectories, Motion-MLLM can reason about absolute scale and spatial relationships across the scene. Our extensive evaluation shows that Motion-MLLM makes significant improvements in various tasks related to 3D scene understanding and spatial reasoning. Compared to state-of-the-art (SOTA) methods based on video frames and explicit 3D data, Motion-MLLM exhibits similar or even higher accuracy with significantly less overhead (i.e., 1.40$\times$ and 1.63$\times$ higher cost-effectiveness, respectively).
中文标题/摘要
标题:感知空间:基于自我运动的视频表示以实现高效准确的三维场景理解
近期的多模态大型语言模型(MLLMs)在三维场景的空间推理方面显示出高潜力。然而,它们通常依赖于计算成本高昂的三维表示,如点云或重建的鸟瞰图(BEV)地图,或者缺乏物理基础来解决尺度和大小的歧义。本文通过引入自我运动模态数据显著增强了MLLMs,这些数据由惯性测量单元(IMUs)与视频同时捕获。特别是,我们提出了一种新的框架,称为Motion-MLLM,引入了两个关键组件:(1)级联运动-视觉关键帧过滤模块,该模块利用IMU数据和视觉特征高效地选择一组稀疏但具有代表性的关键帧;(2)不对称跨模态融合模块,其中运动标记作为中介,将自我运动线索和跨帧视觉上下文引导到视觉表示中。通过将视觉内容与物理的自我运动轨迹联系起来,Motion-MLLM 可以在场景中推理绝对尺度和空间关系。我们的广泛评估表明,Motion-MLLM 在各种与三维场景理解及空间推理相关的任务中取得了显著改进。与基于视频帧和显式三维数据的最新方法(SOTA)相比,Motion-MLLM 在成本效益方面表现出相似甚至更高的准确性(分别提高了1.40倍和1.63倍)。
Summary / 总结
This paper introduces Motion-MLLM, a framework that enhances Multimodal Large Language Models with egomotion data from IMUs to improve 3D scene understanding. It includes a cascaded motion-visual keyframe filtering module and an asymmetric cross-modal fusion module. Experimental results show that Motion-MLLM outperforms state-of-the-art methods in various 3D scene understanding tasks with lower computational cost.
本文提出了Motion-MLLM框架,该框架结合了来自IMU的运动数据以增强多模态大型语言模型,从而提高3D场景理解能力。该框架包括一个级联的运动-视觉关键帧过滤模块和一个不对称的跨模态融合模块。实验结果表明,Motion-MLLM在各种3D场景理解任务中表现出色,并且具有较低的计算成本。
AdaRadar: Rate Adaptive Spectral Compression for Radar-based Perception
Authors: Jinho Park, Se Young Chun, Mingoo Seok
Venue: CVPR 2026
First: 2026-03-18T17:42:34+00:00 · Latest: 2026-03-18T17:42:34+00:00
Comments: Accepted to CVPR 2026
Abstract
Radar is a critical perception modality in autonomous driving systems due to its all-weather characteristics and ability to measure range and Doppler velocity. However, the sheer volume of high-dimensional raw radar data saturates the communication link to the computing engine (e.g., an NPU), which is often a low-bandwidth interface with data rate provisioned only for a few low-resolution range-Doppler frames. A generalized codec for utilizing high-dimensional radar data is notably absent, while existing image-domain approaches are unsuitable, as they typically operate at fixed compression ratios and fail to adapt to varying or adversarial conditions. In light of this, we propose radar data compression with adaptive feedback. It dynamically adjusts the compression ratio by performing gradient descent from the proxy gradient of detection confidence with respect to the compression rate. We employ a zeroth-order gradient approximation as it enables gradient computation even with non-differentiable core operations--pruning and quantization. This also avoids transmitting the gradient tensors over the band-limited link, which, if estimated, would be as large as the original radar data. In addition, we have found that radar feature maps are heavily concentrated on a few frequency components. Thus, we apply the discrete cosine transform to the radar data cubes and selectively prune out the coefficients effectively. We preserve the dynamic range of each radar patch through scaled quantization. Combining those techniques, our proposed online adaptive compression scheme achieves over 100x feature size reduction at minimal performance drop (~1%p). We validate our results on the RADIal, CARRADA, and Radatron datasets.
中文标题/摘要
标题:AdaRadar:基于雷达感知的自适应谱压缩
雷达因其全天候特性和能够测量距离和多普勒速度而在自动驾驶系统中是一种关键的感知模态。然而,高维原始雷达数据的庞大体积会饱和到计算引擎(例如NPU)的通信链路,后者通常是一个带宽有限的接口,数据传输速率仅适用于几帧低分辨率的距离-多普勒图。目前缺乏一种通用的雷达数据压缩编解码器,而现有的图像域方法通常固定压缩比,无法适应变化或对抗性条件。为了解决这个问题,我们提出了一种基于自适应反馈的雷达数据压缩方法。该方法通过从检测置信度对压缩率的代理梯度进行梯度下降来动态调整压缩比。我们采用零阶梯度近似,因为它即使在核心操作(剪枝和量化)不可微的情况下也能进行梯度计算。这还避免了在带宽有限的链路上传输梯度张量,如果估计这些张量,其大小将与原始雷达数据相当。此外,我们发现雷达特征图主要集中在少数频率分量上。因此,我们对雷达数据立方体应用离散余弦变换,并选择性地剪枝出系数。我们通过缩放量化保留每个雷达片段的动态范围。结合这些技术,我们提出的一种在线自适应压缩方案在性能下降极小(约1%)的情况下实现了超过100倍的特征尺寸减少。我们在RADIal、CARRADA和Radatron数据集上验证了我们的结果。
Summary / 总结
AdaRadar proposes an adaptive compression method for radar data in autonomous driving systems. It dynamically adjusts the compression ratio using gradient descent on detection confidence, avoiding the need for gradient tensors to be transmitted over the communication link. By applying the discrete cosine transform and selective pruning, the method reduces feature size by over 100x with minimal performance impact. The approach is validated on multiple radar datasets.
AdaRadar 提出了一种自适应雷达数据压缩方法,通过检测置信度梯度动态调整压缩比。它使用零阶梯度近似和离散余弦变换来剪枝频率成分,实现了超过 100 倍的特征尺寸减少,同时保持了最小的性能下降。在 RADIal、CARRADA 和 Radatron 数据集上的实验验证了其有效性。
AHOY! Animatable Humans under Occlusion from YouTube Videos with Gaussian Splatting and Video Diffusion Priors
Authors: Aymen Mir, Riza Alp Guler, Xiangjun Tang, Peter Wonka, Gerard Pons-Moll
First: 2026-03-18T17:39:05+00:00 · Latest: 2026-03-18T17:39:05+00:00
Comments: Our project page is available at https://miraymen.github.io/ahoy/
Abstract
We present AHOY, a method for reconstructing complete, animatable 3D Gaussian avatars from in-the-wild monocular video despite heavy occlusion. Existing methods assume unoccluded input-a fully visible subject, often in a canonical pose-excluding the vast majority of real-world footage where people are routinely occluded by furniture, objects, or other people. Reconstructing from such footage poses fundamental challenges: large body regions may never be observed, and multi-view supervision per pose is unavailable. We address these challenges with four contributions: (i) a hallucination-as-supervision pipeline that uses identity-finetuned diffusion models to generate dense supervision for previously unobserved body regions; (ii) a two-stage canonical-to-pose-dependent architecture that bootstraps from sparse observations to full pose-dependent Gaussian maps; (iii) a map-pose/LBS-pose decoupling that absorbs multi-view inconsistencies from the generated data; (iv) a head/body split supervision strategy that preserves facial identity. We evaluate on YouTube videos and on multi-view capture data with significant occlusion and demonstrate state-of-the-art reconstruction quality. We also demonstrate that the resulting avatars are robust enough to be animated with novel poses and composited into 3DGS scenes captured using cell-phone video. Our project page is available at https://miraymen.github.io/ahoy/
中文标题/摘要
标题:AHOY!从YouTube视频中在遮挡下的可动画人类体重建,使用高斯散点图和视频扩散先验
我们提出了AHOY,一种方法,用于从野生单目视频中重建完整的、可动画的3D高斯化身,即使存在严重的遮挡。现有方法假设输入未被遮挡——一个完全可见的主体,通常处于标准姿势——排除了绝大多数现实世界片段,其中人们经常被家具、物体或其他人遮挡。从这样的片段中重建带来了根本性的挑战:身体的很大一部分区域可能从未被观察到,而且每种姿势的多视角监督不可用。我们通过四个贡献解决了这些挑战:(i) 一种使用身份微调的扩散模型生成未观察到的身体区域密集监督的幻觉作为监督管道;(ii) 一种两阶段的标准到姿势依赖的架构,从稀疏观察起步,到完整的姿势依赖高斯图;(iii) 一种图-姿势/LBS-姿势解耦,吸收生成数据中的多视角不一致性;(iv) 一种头部/身体分割监督策略,保留面部身份。我们在YouTube视频和多视角捕捉数据上进行了评估,这些数据具有显著的遮挡,并展示了最先进的重建质量。我们还展示了生成的化身足够稳健,可以用于新型姿势的动画和合成到使用手机视频捕捉的3DGS场景中。我们的项目页面可在https://miraymen.github.io/ahoy/ 获取。
Summary / 总结
AHOY is a method for reconstructing 3D Gaussian avatars from monocular videos with heavy occlusion, addressing the limitations of existing methods that require fully visible subjects. It uses a hallucination-as-supervision pipeline, a two-stage architecture, map-pose/LBS-pose decoupling, and head/body split supervision to handle sparse observations and multi-view inconsistencies. AHOY demonstrates state-of-the-art reconstruction quality and robustness for animation and compositing into 3D scenes.
AHOY 是一种从含有大量遮挡的 YouTube 视频中重建 3D 高斯化身的方法。它使用了幻觉作为监督的管道、两阶段架构、图-姿态解耦以及头部/身体分割监督来解决从遮挡视频中重建的挑战。该方法展示了最先进的重建质量,并允许进行稳健的动画和合成到 3D 场景中。
Toward Scalable Automated Repository-Level Datasets for Software Vulnerability Detection
Authors: Amine Lbath
First: 2026-03-18T17:38:35+00:00 · Latest: 2026-03-18T17:38:35+00:00
Comments: Supervisor: Prof. Massih-Reza Amini
Abstract
Software vulnerabilities continue to grow in volume and remain difficult to detect in practice. Although learning-based vulnerability detection has progressed, existing benchmarks are largely function-centric and fail to capture realistic, executable, interprocedural settings. Recent repo-level security benchmarks demonstrate the importance of realistic environments, but their manual curation limits scale. This doctoral research proposes an automated benchmark generator that injects realistic vulnerabilities into real-world repositories and synthesizes reproducible proof-of-vulnerability (PoV) exploits, enabling precisely labeled datasets for training and evaluating repo-level vulnerability detection agents. We further investigate an adversarial co-evolution loop between injection and detection agents to improve robustness under realistic constraints.
中文标题/摘要
标题:面向软件漏洞检测的大规模自动化仓库级数据集研究
软件漏洞的数量不断增加,并且在实践中难以检测。尽管基于学习的漏洞检测已经取得进展,但现有基准大多是函数为中心的,无法捕捉到真实的、可执行的、跨过程的环境。最近的仓库级安全基准展示了真实环境的重要性,但其手动整理限制了规模。本博士研究提出了一种自动化基准生成器,该生成器将真实的漏洞注入到实际的仓库中,并合成可重复的漏洞证明(PoV)利用,从而为训练和评估仓库级漏洞检测代理提供精确标记的数据集。我们进一步研究了注入代理和检测代理之间的对抗共进化循环,以在现实约束下提高鲁棒性。
Summary / 总结
This research aims to address the challenge of detecting software vulnerabilities by proposing an automated benchmark generator that injects realistic vulnerabilities into real-world repositories and synthesizes reproducible proof-of-vulnerability exploits. The method involves creating precisely labeled datasets for training and evaluating vulnerability detection agents. Key findings include the generation of scalable datasets that capture realistic, executable, interprocedural settings, which improve the robustness of vulnerability detection under practical constraints.
该研究旨在通过提出一种自动基准生成器,将现实中的漏洞注入实际代码库,并生成可重现的漏洞证明,来解决软件漏洞检测难题。方法包括创建精确标注的数据集,用于训练和评估漏洞检测代理。主要发现包括生成了可扩展的数据集,能够捕捉到执行的、程序间的真实环境,从而在实际约束下提高了漏洞检测的鲁棒性。
TDAD: Test-Driven Agentic Development - Reducing Code Regressions in AI Coding Agents via Graph-Based Impact Analysis
Authors: Pepe Alonso
First: 2026-03-18T17:38:22+00:00 · Latest: 2026-03-18T17:38:22+00:00
Comments: Toolpaper, 7 pages, 3 tables, 1 figure, 1 algorithm. Submitted to ACM AIWare 2026 (Data and Benchmark Track)
Abstract
AI coding agents can resolve real-world software issues, yet they frequently introduce regressions, breaking tests that previously passed. Current benchmarks focus almost exclusively on resolution rate, leaving regression behavior under-studied. This paper presents TDAD (Test-Driven Agentic Development), an open-source tool and benchmark methodology that combines abstract-syntax-tree (AST) based code-test graph construction with weighted impact analysis to surface the tests most likely affected by a proposed change. Evaluated on SWE-bench Verified with two local models (Qwen3-Coder 30B on 100 instances and Qwen3.5-35B-A3B on 25 instances), TDAD's GraphRAG workflow reduced test-level regressions by 70% (6.08% to 1.82%) and improved resolution from 24% to 32% when deployed as an agent skill. A surprising finding is that TDD prompting alone increased regressions (9.94%), revealing that smaller models benefit more from contextual information (which tests to verify) than from procedural instructions (how to do TDD). An autonomous auto-improvement loop raised resolution from 12% to 60% on a 10-instance subset with 0% regression. These findings suggest that for AI agent tool design, surfacing contextual information outperforms prescribing procedural workflows. All code, data, and logs are publicly available at https://github.com/pepealonso95/TDAD.
中文标题/摘要
标题:TDAD:基于测试驱动的代理开发 - 通过基于图的影响分析减少AI编码代理的代码回退
AI编码代理可以解决实际软件问题,但它们经常引入回退,导致之前通过的测试失败。当前基准测试几乎完全关注解决率,而忽视了回退行为的研究。本文介绍了TDAD(基于测试驱动的代理开发),这是一种开源工具和基准方法,结合了基于抽象语法树(AST)的代码-测试图构建和加权影响分析,以揭示最有可能受提议更改影响的测试。TDAD的GraphRAG工作流程在SWE-bench上验证了两个本地模型(Qwen3-Coder 30B在100个实例上和Qwen3.5-35B-A3B在25个实例上),减少了测试级别回退70%(从6.08%到1.82%),并提高了解决率从24%到32%。一个令人惊讶的发现是,仅TDD提示增加了回退(9.94%),表明较小的模型从上下文信息(需要验证哪些测试)中受益更多,而不是从程序指令(如何进行TDD)中受益。自主自动改进循环在10个实例子集上将解决率从12%提高到60%,且无回退。这些发现表明,对于AI代理工具设计,呈现上下文信息优于规定程序化工作流。所有代码、数据和日志均可在https://github.com/pepealonso95/TDAD/公开获取。
Summary / 总结
TDAD (Test-Driven Agentic Development) is designed to reduce code regressions in AI coding agents by using a graph-based impact analysis method. Evaluated on SWE-bench with two local models, TDAD's GraphRAG workflow significantly reduced test-level regressions by 70% and improved resolution from 24% to 32%. The study also found that TDD prompting alone increased regressions, indicating that contextual information is more beneficial than procedural instructions for smaller models. An autonomous auto-improvement loop further enhanced resolution to 60% with no regressions. These results suggest that surfacing contextual information is more effective than prescribing procedural workflows for AI agent tool design.
TDAD(Test-Driven Agentic Development)通过使用基于图的影响分析来减少AI编码代理中的代码回退。在SWE-bench上使用两个本地模型进行评估时,TDAD的GraphRAG工作流显著减少了70%的测试级别回退,并将解决率从24%提高到32%。研究还发现,TDD提示本身会增加回退,表明对于较小的模型,提供上下文信息比提供程序性指令更为有益。一个自主的自动改进循环进一步将解决率提高到60%,且没有回退。这些结果表明,在AI代理工具设计中,提供上下文信息比规定程序性工作流程更为有效。
Specification-Aware Distribution Shaping for Robotics Foundation Models
Authors: Sadık Bera Yüksel, Derya Aksaray
First: 2026-03-18T17:36:46+00:00 · Latest: 2026-03-18T17:36:46+00:00
Comments: 8 pages, 3 figures
Abstract
Robotics foundation models have demonstrated strong capabilities in executing natural language instructions across diverse tasks and environments. However, they remain largely data-driven and lack formal guarantees on safety and satisfaction of time-dependent specifications during deployment. In practice, robots often need to comply with operational constraints involving rich spatio-temporal requirements such as time-bounded goal visits, sequential objectives, and persistent safety conditions. In this work, we propose a specification-aware action distribution optimization framework that enforces a broad class of Signal Temporal Logic (STL) constraints during execution of a pretrained robotics foundation model without modifying its parameters. At each decision step, the method computes a minimally modified action distribution that satisfies a hard STL feasibility constraint by reasoning over the remaining horizon using forward dynamics propagation. We validate the proposed framework in simulation using a state-of-the-art robotics foundation model across multiple environments and complex specifications.
中文标题/摘要
标题:基于规范的机器人基础模型动作分布优化
机器人基础模型在执行跨多种任务和环境的自然语言指令方面表现出强大的能力。然而,在部署过程中,它们仍然主要依赖数据驱动,缺乏对安全性和时间依赖规范满足性的正式保证。实际上,机器人经常需要遵守涉及丰富时空要求的操作约束,如时间限制的目标访问、顺序目标和持续的安全条件。在本工作中,我们提出了一种基于规范的动作分布优化框架,在执行预训练的机器人基础模型时强制执行广泛的信号时序逻辑(STL)约束,而不修改其参数。在每个决策步骤中,该方法通过利用前向动力学传播来推理剩余的时段,计算一个最小修改的动作分布,以满足硬的STL可行性约束。我们使用最先进的机器人基础模型在多个环境中并针对复杂的规范进行了仿真验证。
Summary / 总结
This work addresses the need for formal guarantees in the execution of robotics foundation models by proposing a specification-aware action distribution optimization framework. The method ensures compliance with Signal Temporal Logic (STL) constraints without altering the pretrained model's parameters. It computes a modified action distribution at each decision step to satisfy a hard STL feasibility constraint using forward dynamics propagation. The framework is validated in simulation across various environments and complex specifications.
研究旨在提高机器人基础模型在操作约束下的安全性和合规性。提出了一种规格感知的动作分布优化框架,该框架在不修改模型参数的情况下强制执行信号时序逻辑(STL)约束。该方法在每个决策步骤中通过使用前向动力学传播计算一个最小修改的动作分布,以满足硬STL可行性约束。实验在仿真中展示了该框架在多种环境和复杂规格下的有效性。
LaDe: Unified Multi-Layered Graphic Media Generation and Decomposition
Authors: Vlad-Constantin Lungu-Stan, Ionut Mironica, Mariana-Iuliana Georgescu
First: 2026-03-18T17:34:07+00:00 · Latest: 2026-03-18T17:34:07+00:00
Comments: 18 pages (main + supp)
Abstract
Media design layer generation enables the creation of fully editable, layered design documents such as posters, flyers, and logos using only natural language prompts. Existing methods either restrict outputs to a fixed number of layers or require each layer to contain only spatially continuous regions, causing the layer count to scale linearly with design complexity. We propose LaDe (Layered Media Design), a latent diffusion framework that generates a flexible number of semantically meaningful layers. LaDe combines three components: an LLM-based prompt expander that transforms a short user intent into structured per-layer descriptions that guide the generation, a Latent Diffusion Transformer with a 4D RoPE positional encoding mechanism that jointly generates the full media design and its constituent RGBA layers, and an RGBA VAE that decodes each layer with full alpha-channel support. By conditioning on layer samples during training, our unified framework supports three tasks: text-to-image generation, text-to-layers media design generation, and media design decomposition. We compare LaDe to Qwen-Image-Layered on text-to-layers and image-to-layers tasks on the Crello test set. LaDe outperforms Qwen-Image-Layered in text-to-layers generation by improving text-to-layer alignment, as validated by two VLM-as-a-judge evaluators (GPT-4o mini and Qwen3-VL).
中文标题/摘要
标题:LaDe:统一多层图形媒体生成与分解
媒体设计层生成能够仅通过自然语言提示创建完全可编辑的分层设计文档,如海报、传单和标志。现有方法要么限制输出层数为固定数量,要么要求每层仅包含连续区域,导致层数随设计复杂度线性增加。我们提出LaDe(分层媒体设计),这是一种潜扩散框架,能够生成灵活数量的语义上有意义的分层。LaDe 结合了三个组件:基于LLM的提示扩展器,将简短用户意图转换为分层结构化的描述,以指导生成;具有4D RoPE位置编码机制的潜扩散变换器,联合生成完整的媒体设计及其构成的RGBA分层;以及RGBA VAE,用于每个分层的全透明通道解码。通过在训练期间条件化分层样本,我们的统一框架支持三个任务:文本到图像生成、文本到分层媒体设计生成以及媒体设计分解。我们在Crello测试集上将LaDe与Qwen-Image-Layered在文本到分层和图像到分层任务上进行比较。LaDe 在文本到分层生成方面优于Qwen-Image-Layered,通过两个VLM作为评判者(GPT-4o mini和Qwen3-VL)验证了文本到分层对齐的改进。
Summary / 总结
The research aims to generate and decompose graphic media with flexible layers using natural language prompts. LaDe, a latent diffusion framework, combines an LLM-based prompt expander, a Latent Diffusion Transformer, and an RGBA VAE to generate and decompose media designs. Experiments show that LaDe outperforms Qwen-Image-Layered in text-to-layers generation by improving text-to-layer alignment, as evaluated by VLMs GPT-4o mini and Qwen3-VL.
研究旨在使用自然语言提示生成和分解具有灵活层次的图形媒体。LaDe 是一个潜扩散框架,结合了基于LLM的提示扩展器、潜扩散变换器和RGBA VAE,以生成和分解媒体设计。实验表明,LaDe 在文本到层生成方面优于 Qwen-Image-Layered,通过 VLMs GPT-4o mini 和 Qwen3-VL 的评估,提高了文本到层的对齐度。
MOBODY: Model Based Off-Dynamics Offline Reinforcement Learning
Authors: Yihong Guo, Yu Yang, Pan Xu, Anqi Liu
Venue: ICLR 2026
First: 2025-06-10T05:36:54+00:00 · Latest: 2026-03-18T17:33:14+00:00
Comments: Published at ICLR 2026
Abstract
We study off-dynamics offline reinforcement learning, where the goal is to learn a policy from offline source and limited target datasets with mismatched dynamics. Existing methods either penalize the reward or discard source transitions occurring in parts of the transition space with high dynamics shift. As a result, they optimize the policy using data from low-shift regions, limiting exploration of high-reward states in the target domain that do not fall within these regions. Consequently, such methods often fail when the dynamics shift is significant or the optimal trajectories lie outside the low-shift regions. To overcome this limitation, we propose MOBODY, a Model-Based Off-Dynamics Offline RL algorithm that optimizes a policy using learned target dynamics transitions to explore the target domain, rather than only being trained with the low dynamics-shift transitions. For the dynamics learning, built on the observation that achieving the same next state requires taking different actions in different domains, MOBODY employs separate action encoders for each domain to encode different actions to the shared latent space while sharing a unified representation of states and a common transition function. We further introduce a target Q-weighted behavior cloning loss in policy optimization to avoid out-of-distribution actions, which push the policy toward actions with high target-domain Q-values, rather than high source domain Q-values or uniformly imitating all actions in the offline dataset. We evaluate MOBODY on a wide range of MuJoCo and Adroit benchmarks, demonstrating that it outperforms state-of-the-art off-dynamics RL baselines as well as policy learning methods based on different dynamics learning baselines, with especially pronounced improvements in challenging scenarios where existing methods struggle.
中文标题/摘要
标题:MOBODY:基于模型的离动力学离线强化学习
我们研究离动力学离线强化学习,目标是从离线源数据和有限的目标数据集中的动力学不匹配部分学习策略。现有方法要么惩罚奖励,要么丢弃在动力学变化大的区域发生的源过渡。因此,它们仅使用低变化区域的数据来优化策略,限制了在目标域中高奖励状态的探索,这些状态不在这些区域中。因此,当动力学变化显著或最优轨迹位于低变化区域之外时,这些方法往往无法奏效。为克服这一限制,我们提出了MOBODY,一种基于模型的离动力学离线强化学习算法,该算法使用学习到的目标域动力学过渡来探索目标域,而不仅仅是使用低动力学变化的过渡进行训练。对于动力学学习,基于在不同域中实现相同下一个状态需要采取不同动作的观察,MOBODY为每个域使用单独的动作编码器,将不同的动作编码到共享的潜在空间中,同时共享状态的统一表示和共同的转换函数。我们还在策略优化中引入了目标Q加权行为克隆损失,以避免出现分布外的动作,使策略趋向于目标域Q值高的动作,而不是源域Q值高的动作或均匀模仿离线数据集中的所有动作。我们在广泛的MuJoCo和Adroit基准上评估了MOBODY,结果显示它优于最先进的离动力学RL基线以及基于不同动力学学习基线的策略学习方法,特别是在现有方法难以应对的挑战性场景中表现尤为突出。
Summary / 总结
MOBODY is a Model-Based Off-Dynamics Offline RL algorithm designed to learn policies from offline datasets with mismatched dynamics. It optimizes the policy using learned target dynamics transitions, allowing for exploration of high-reward states outside low-dynamics-shift regions. Experimental results show that MOBODY outperforms existing methods across various MuJoCo and Adroit benchmarks, particularly in challenging scenarios where other approaches fail.
MOBODY 是一种针对离域离线强化学习的算法,通过使用学习到的目标域动力学转换来优化策略,而不是仅仅使用低动力学偏移的转换。它采用每个领域独立的动作编码器,并引入目标 Q 加权行为克隆损失以避免离域动作。实验结果表明,MOBODY 在各种 MuJoCo 和 Adroit 基准测试中优于最先进的基线方法,特别是在其他方法失败的挑战性场景中表现出显著改进。
Provably Safe Model Updates
Authors: Leo Elmecker-Plakolm, Pierre Fasterling, Philip Sosnin, Calvin Tsay, Matthew Wicker
First: 2025-12-01T17:19:53+00:00 · Latest: 2026-03-18T17:29:55+00:00
Comments: 12 pages, 9 figures. This work has been accepted for publication at SaTML 2026. The final version will be available on IEEE Xplore
Abstract
Safety-critical environments are inherently dynamic. Distribution shifts, emerging vulnerabilities, and evolving requirements demand continuous updates to machine learning models. Yet even benign parameter updates can have unintended consequences, such as catastrophic forgetting in classical models or alignment drift in foundation models. Existing heuristic approaches (e.g., regularization, parameter isolation) can mitigate these effects but cannot certify that updated models continue to satisfy required performance specifications. We address this problem by introducing a framework for provably safe model updates. Our approach first formalizes the problem as computing the largest locally invariant domain (LID): a connected region in parameter space where all points are certified to satisfy a given specification. While exact maximal LID computation is intractable, we show that relaxing the problem to parameterized abstract domains (orthotopes, zonotopes) yields a tractable primal-dual formulation. This enables efficient certification of updates - independent of the data or algorithm used - by projecting them onto the safe domain. Our formulation further allows computation of multiple approximately optimal LIDs, incorporation of regularization-inspired biases, and use of lookahead data buffers. Across continual learning and foundation model fine-tuning benchmarks, our method matches or exceeds heuristic baselines for avoiding forgetting while providing formal safety guarantees.
中文标题/摘要
标题:可验证安全的模型更新
关键安全环境本质上是动态的。分布变化、新兴漏洞和不断变化的要求需要持续更新机器学习模型。然而,即使是看似无害的参数更新也可能产生意想不到的后果,例如经典模型中的灾难性遗忘或基础模型中的对齐漂移。现有的启发式方法(例如正则化、参数隔离)可以减轻这些影响,但无法保证更新后的模型继续满足所需性能规范。我们通过引入一个可验证安全的模型更新框架来解决这个问题。我们的方法首先将问题形式化为计算最大的局部不变域(LID):参数空间中的一个连通区域,其中所有点都得到验证,满足给定的规范。虽然精确的最大LID计算是不可行的,但我们证明将问题松弛到参数化抽象域(正交体、zonotope)可以得到一个可解的对偶公式。这使得独立于数据或算法的更新认证成为可能,通过将它们投影到安全域上实现。我们的公式还允许计算多个近似最优的LID,整合启发式正则化偏置,并使用前瞻数据缓冲区。在持续学习和基础模型微调基准测试中,我们的方法在避免遗忘方面与启发式基线相当或更优,同时提供形式安全保证。
Summary / 总结
This paper addresses the challenge of safely updating machine learning models in dynamic environments, where distribution shifts and evolving requirements necessitate continuous model updates. The authors introduce a framework for computing the largest locally invariant domain (LID) to ensure that updated models continue to meet performance specifications. While exact LID computation is intractable, the method relaxes the problem to parameterized abstract domains, enabling efficient certification of updates. The approach also allows for the computation of multiple LIDs, incorporation of regularization biases, and use of lookahead data buffers. Experimental results show that the method matches or exceeds heuristic baselines in avoiding catastrophic forgetting while providing formal safety guarantees.
该论文旨在解决动态环境中机器学习模型安全更新的挑战。它提出了一种计算最大局部不变域(LID)的方法,以确保参数更新能够维持所需的性能规范。该方法将问题放松到可处理的参数化抽象域,从而能够高效地认证更新并提供形式上的安全性保证。实验结果显示,该方法在各种基准测试中避免遗忘的效果与启发式基线相当或更优。
HyperMotionX: The Dataset and Benchmark with DiT-Based Pose-Guided Human Image Animation of Complex Motions
Authors: Shuolin Xu, Siming Zheng, Ziyi Wang, HC Yu, Jinwei Chen, Huaqi Zhang, Daquan Zhou, Tong-Yee Lee, Bo Li, Peng-Tao Jiang
First: 2025-05-29T01:30:46+00:00 · Latest: 2026-03-18T17:20:31+00:00
Comments: 17 pages, 7 figures
Abstract
Recent advances in diffusion models have significantly improved conditional video generation, particularly in the pose-guided human image animation task. Although existing methods are capable of generating high-fidelity and time-consistent animation sequences in regular motions and static scenes. However there are still obvious limitations when facing complex human body motions that contain highly dynamic, non-standard motions, and the lack of a high-quality benchmark for evaluation of complex human motion animations. To address this challenge, we propose a concise yet powerful DiT-based human animation generation baseline and design spatial low-frequency enhanced RoPE, a novel module that selectively enhances low-frequency spatial feature modeling by introducing learnable frequency scaling. Furthermore, we introduce the Open-HyperMotionX Dataset and HyperMotionX Bench, which provide high-quality human pose annotations and curated video clips for evaluating and improving pose-guided human image animation models under complex human motion conditions. Our method significantly improves structural stability and appearance consistency in highly dynamic human motion sequences. Extensive experiments demonstrate the effectiveness of our dataset and proposed approach in advancing the generation quality of complex human motion image animations. The codes, model weights, and dataset have been made publicly available at https://vivocameraresearch.github.io/hypermotion/
中文标题/摘要
标题:HyperMotionX:基于DiT的姿势引导人体图像动画数据集和基准
近期在扩散模型方面的进展显著提高了条件视频生成,特别是在姿势引导的人体图像动画任务中的表现。尽管现有方法能够生成高质量且时间一致的动画序列,特别是在常规动作和静态场景中。然而,在面对包含高度动态和非标准动作的复杂人体动作时,仍然存在明显的局限性,且缺乏高质量的基准用于复杂人体动作动画的评估。为解决这一挑战,我们提出了一种简洁而强大的基于DiT的人体动画生成基线,并设计了空间低频增强RoPE模块,这是一种新颖的模块,通过引入可学习的频率缩放来选择性地增强低频空间特征建模。此外,我们引入了Open-HyperMotionX数据集和HyperMotionX基准,提供了高质量的人体姿态注释和精心挑选的视频片段,以评估和改进在复杂人体动作条件下的人体图像动画模型。我们的方法显著提高了高度动态人体运动序列的结构稳定性和外观一致性。广泛的实验表明,我们的数据集和提出的方法在提高复杂人体运动图像动画的生成质量方面具有有效性。相关代码、模型权重和数据集已公开发布于https://vivocameraresearch.github.io/hypermotion/
Summary / 总结
The research addresses the limitations of existing methods in generating high-fidelity animations for complex human motions, such as highly dynamic and non-standard movements. It introduces a DiT-based baseline with a spatial low-frequency enhanced RoPE module to enhance low-frequency spatial feature modeling. The study also presents the Open-HyperMotionX Dataset and HyperMotionX Bench, which include high-quality human pose annotations and video clips for evaluating and improving pose-guided human image animation models under complex motion conditions. The results show significant improvements in structural stability and appearance consistency for complex human motion sequences.
研究旨在利用扩散模型提高复杂人体动作的高保真度和时间一致性动画生成。作者提出了一种基于DiT的基本模型,并引入了新型的空间低频增强RoPE模块,以增强低频空间特征建模。研究还引入了Open-HyperMotionX数据集和HyperMotionX基准,提供了高质量的姿态标注和视频片段,用于评估和改进在复杂人体动作条件下的人体图像动画模型。该方法显著提高了高度动态人体运动序列的结构稳定性和外观一致性,证明了所提出方法在生成复杂人体运动图像动画方面的有效性。
VideoAtlas: Navigating Long-Form Video in Logarithmic Compute
Authors: Mohamed Eltahir, Ali Habibullah, Yazan Alshoibi, Lama Ayash, Tanveer Hussain, Naeemullah Khan
First: 2026-03-18T17:20:19+00:00 · Latest: 2026-03-18T17:20:19+00:00
Abstract
Extending language models to video introduces two challenges: representation, where existing methods rely on lossy approximations, and long-context, where caption- or agent-based pipelines collapse video into text and lose visual fidelity. To overcome this, we introduce \textbf{VideoAtlas}, a task-agnostic environment to represent video as a hierarchical grid that is simultaneously lossless, navigable, scalable, caption- and preprocessing-free. An overview of the video is available at a glance, and any region can be recursively zoomed into, with the same visual representation used uniformly for the video, intermediate investigations, and the agent's memory, eliminating lossy text conversion end-to-end. This hierarchical structure ensures access depth grows only logarithmically with video length. For long-context, Recursive Language Models (RLMs) recently offered a powerful solution for long text, but extending them to visual domain requires a structured environment to recurse into, which \textbf{VideoAtlas} provides. \textbf{VideoAtlas} as a Markov Decision Process unlocks Video-RLM: a parallel Master-Worker architecture where a Master coordinates global exploration while Workers concurrently drill into assigned regions to accumulate lossless visual evidence. We demonstrate three key findings: (1)~logarithmic compute growth with video duration, further amplified by a 30-60\% multimodal cache hit rate arising from the grid's structural reuse. (2)~environment budgeting, where bounding the maximum exploration depth provides a principled compute-accuracy hyperparameter. (3)~emergent adaptive compute allocation that scales with question granularity. When scaling from 1-hour to 10-hour benchmarks, Video-RLM remains the most duration-robust method with minimal accuracy degradation, demonstrating that structured environment navigation is a viable and scalable paradigm for video understanding.
中文标题/摘要
标题:VideoAtlas:在对数计算中导航长格式视频
将语言模型扩展到视频中引入了两个挑战:表示,其中现有方法依赖于有损近似;以及长上下文,其中字幕或代理管道将视频压缩为文本并失去视觉保真度。为了解决这个问题,我们引入了**VideoAtlas**,这是一种任务无关的环境,将视频表示为一个同时无损、可导航、可扩展且无需字幕和预处理的分层网格。视频的概览一目了然,任何区域都可以递归放大,使用相同的视觉表示用于视频、中间调查和代理的记忆,从而在整个过程中消除有损文本转换。这种分层结构确保访问深度仅以视频长度的对数增长。对于长上下文,递归语言模型(RLMs)最近为长文本提供了一种强大的解决方案,但将其扩展到视觉领域需要一个结构化的环境来进行递归,这正是**VideoAtlas**提供的。**VideoAtlas**作为马尔可夫决策过程解锁了视频-RLM:一种并行的主从架构,其中主节点协调全局探索,而工人同时钻入分配的区域以积累无损视觉证据。我们展示了三个关键发现:(1)视频长度与对数计算增长,进一步受到30-60%多模态缓存命中率的影响,这是由于网格结构的重用。(2)环境预算,其中限制最大探索深度提供了一个有原则的计算-准确度超参数。(3)适应性计算分配的出现,其与问题的粒度成比例。当从1小时扩展到10小时基准时,视频-RLM仍然是最具有长度鲁棒性的方法,准确度下降最小,这表明结构化环境导航是视频理解的一种可行且可扩展的范式。
Summary / 总结
VideoAtlas is designed to address the challenges of representing long-form video and maintaining visual fidelity by introducing a hierarchical grid representation that is lossless and navigable. The system uses a Markov Decision Process to enable a parallel Master-Worker architecture for efficient exploration and evidence accumulation. Key experimental findings include logarithmic compute growth with video duration, a 30-60% multimodal cache hit rate due to structural reuse, and adaptive compute allocation that scales with question granularity, demonstrating the method's robustness for long-duration video understanding tasks.
VideoAtlas旨在解决长视频的表示和保持视觉保真度的问题。它引入了一种层次网格表示法,该表示法无损、可导航且可扩展,无需字幕或预处理。系统采用主-工人架构,其中主节点进行全局探索,工人则专注于特定区域,收集无损视觉证据。关键发现包括视频时长与计算量呈对数增长,通过环境预算设置合理的计算-准确性超参数,以及根据问题粒度自适应分配计算资源。当从1小时扩展到10小时的基准测试时,Video-RLM方法在保持最小准确率下降的情况下仍表现出对视频时长的鲁棒性,证明了结构化环境导航是视频理解的一种可行且可扩展的范式。
Unified Policy Value Decomposition for Rapid Adaptation
Authors: Cristiano Capone, Luca Falorsi, Andrea Ciardiello, Luca Manneschi
First: 2026-03-18T17:19:56+00:00 · Latest: 2026-03-18T17:19:56+00:00
Abstract
Rapid adaptation in complex control systems remains a central challenge in reinforcement learning. We introduce a framework in which policy and value functions share a low-dimensional coefficient vector - a goal embedding - that captures task identity and enables immediate adaptation to novel tasks without retraining representations. During pretraining, we jointly learn structured value bases and compatible policy bases through a bilinear actor-critic decomposition. The critic factorizes as Q = sum_k G_k(g) y_k(s,a), where G_k(g) is a goal-conditioned coefficient vector and y_k(s,a) are learned value basis functions. This multiplicative gating - where a context signal scales a set of state-dependent bases - is reminiscent of gain modulation observed in Layer 5 pyramidal neurons, where top-down inputs modulate the gain of sensory-driven responses without altering their tuning. Building on Successor Features, we extend the decomposition to the actor, which composes a set of primitive policies weighted by the same coefficients G_k(g). At test time the bases are frozen and G_k(g) is estimated zero-shot via a single forward pass, enabling immediate adaptation to novel tasks without any gradient update. We train a Soft Actor-Critic agent on the MuJoCo Ant environment under a multi-directional locomotion objective, requiring the agent to walk in eight directions specified as continuous goal vectors. The bilinear structure allows each policy head to specialize to a subset of directions, while the shared coefficient layer generalizes across them, accommodating novel directions by interpolating in goal embedding space. Our results suggest that shared low-dimensional goal embeddings offer a general mechanism for rapid, structured adaptation in high-dimensional control, and highlight a potentially biologically plausible principle for efficient transfer in complex reinforcement learning systems.
中文标题/摘要
标题:统一策略价值分解以实现快速适应
在复杂控制系统中实现快速适应仍然是强化学习中的核心挑战。我们提出了一种框架,在该框架中,策略和价值函数共享一个低维系数向量——目标嵌入,该向量捕捉任务身份并使模型能够无需重新训练表示即刻适应新任务。在预训练阶段,我们通过双线性演员-评论分解联合学习结构化价值基和兼容的策略基。评论因子化为Q = ∑_k G_k(g) y_k(s,a),其中G_k(g)是目标条件化的系数向量,y_k(s,a)是学习到的价值基函数。这种乘法门控——上下文信号缩放一组状态依赖基——类似于在层5锥形神经元中观察到的增益调制现象,其中上行输入调节感觉驱动响应的增益而不改变其调谐。基于后继特征,我们将分解扩展到演员,该演员由一组加权相同系数G_k(g)的原始策略组成。在测试时,基底冻结,G_k(g)通过单次前向传播零样本估计,从而无需任何梯度更新即可立即适应新任务。我们在MuJoCo蚂蚁环境中训练一个软演员-评论家代理,目标是多方向运动,要求代理在八个指定为连续目标向量的方向上行走。双线性结构允许每个策略头专门化于一组方向,而共享的系数层则在它们之间泛化,通过在目标嵌入空间内插值来适应新方向。我们的结果表明,共享的低维目标嵌入提供了一种通用机制,用于在高维控制中实现快速、结构化的适应,并突显了在复杂强化学习系统中高效迁移的一种潜在生物合理原则。
Summary / 总结
This paper addresses the challenge of rapid adaptation in complex control systems using reinforcement learning. It introduces a framework where policy and value functions share a low-dimensional coefficient vector, or goal embedding, which captures task identity and enables immediate adaptation to new tasks without retraining. During pretraining, the system learns structured value bases and compatible policy bases through a bilinear actor-critic decomposition. At test time, the bases are frozen, and the coefficient layer is estimated zero-shot, allowing for immediate adaptation to novel tasks. The method was tested on a MuJoCo Ant environment, demonstrating that shared low-dimensional goal embeddings can facilitate rapid and structured adaptation in high-dimensional control tasks.
论文旨在通过强化学习解决复杂控制系统中的快速适应难题。提出了一种框架,其中策略和价值函数共享一个低维系数向量(目标嵌入),以实现无需重新训练即可立即适应新任务。在预训练过程中,使用双线性演员-评论家分解来学习结构化价值基和兼容的策略基。测试时,冻结基函数,通过单次前向传播估计系数层,从而实现无需梯度更新即可立即适应新任务。该方法在MuJoCo蚂蚁环境中进行了评估,表明共享的低维目标嵌入可以促进高维控制任务中的快速和结构化适应。
100x Cost & Latency Reduction: Performance Analysis of AI Query Approximation using Lightweight Proxy Models
Authors: Yeounoh Chung, Rushabh Desai, Jian He, Yu Xiao, Thibaud Hottelier, Yves-Laurent Kom Samo, Pushkar Kadilkar, Xianshun Chen, Sam Idicula, Fatma Özcan, Alon Halevy, Yannis Papakonstantinou
First: 2026-03-16T22:42:45+00:00 · Latest: 2026-03-18T17:17:29+00:00
Abstract
Several data warehouse and database providers have recently introduced extensions to SQL called AI Queries, enabling users to specify functions and conditions in SQL that are evaluated by LLMs, thereby broadening significantly the kinds of queries one can express over the combination of structured and unstructured data. LLMs offer remarkable semantic reasoning capabilities, making them an essential tool for complex and nuanced queries that blend structured and unstructured data. While extremely powerful, these AI queries can become prohibitively costly when invoked thousands of times. This paper provides an extensive evaluation of a recent AI query approximation approach that enables low cost analytics and database applications to benefit from AI queries. The approach delivers >100x cost and latency reduction for the semantic filter ($AI.IF$) operator and also important gains for semantic ranking ($AI.RANK$). The cost and performance gains come from utilizing cheap and accurate proxy models over embedding vectors. We show that despite the massive gains in latency and cost, these proxy models preserve accuracy and occasionally improve accuracy across various benchmark datasets, including the extended Amazon reviews benchmark that has 10M rows. We present an OLAP-friendly architecture within Google BigQuery for this approach for purely online (ad hoc) queries, and a low-latency HTAP database-friendly architecture in AlloyDB that could further improve the latency by moving the proxy model training offline. We present techniques that accelerate the proxy model training.
中文标题/摘要
标题:100倍成本与延迟降低:轻量级代理模型在AI查询近似中的性能分析
许多数据仓库和数据库提供商最近引入了名为AI查询的SQL扩展,使用户能够指定由LLM评估的SQL函数和条件,从而显著扩展了用户可以表达的查询类型,包括结构化和非结构化数据的组合。LLM提供了出色的语义推理能力,使其成为复杂和细腻查询的重要工具,这些查询混合了结构化和非结构化数据。虽然非常强大,但这些AI查询在被调用数千次时可能会变得极其昂贵。本文对一种最近的AI查询近似方法进行了全面评估,该方法使低成本分析和数据库应用程序能够受益于AI查询。该方法在语义过滤器($AI.IF$)操作符上实现了超过100倍的成本和延迟降低,并且在语义排名($AI.RANK$)方面也取得了重要进展。成本和性能的提升来自于使用嵌入向量的廉价且准确的代理模型。我们展示了尽管在延迟和成本方面取得了巨大的提升,这些代理模型仍然保持了准确性,并且在各种基准数据集中偶尔提高了准确性,包括扩展的亚马逊评论基准数据集,该数据集包含1000万行数据。我们为这种方法在Google BigQuery中提供了一个适合OLAP的架构,用于纯在线(即兴)查询,并在AlloyDB中提供了一个低延迟HTAP数据库友好的架构,通过将代理模型训练移出线下来进一步提高延迟。我们介绍了加速代理模型训练的技术。
A Comprehensive Benchmark of Histopathology Foundation Models for Kidney Digital Pathology Images
Authors: Harishwar Reddy Kasireddy, Patricio S. La Rosa, Akshita Gupta, Anindya S. Paul, Jamie L. Fermin, William L. Clapp, Meryl A. Waldman, Tarek M. El-Ashkar, Sanjay Jain, Luis Rodrigues, Kuang Yu Jen, Avi Z. Rosenberg, Michael T. Eadon, Jeffrey B. Hodgin, Pinaki Sarder
First: 2026-03-16T22:37:43+00:00 · Latest: 2026-03-18T17:17:27+00:00
Comments: 31 Pages, 14 Tables, 12 figures, Co-correspondence to jhodgin@med.umich.edu and pinaki.sarder@ufl.edu
Abstract
Histopathology foundation models (HFMs), pretrained on large-scale cancer datasets, have advanced computational pathology. However, their applicability to non-cancerous chronic kidney disease remains underexplored, despite coexistence of renal pathology with malignancies such as renal cell and urothelial carcinoma. We systematically evaluate 11 publicly available HFMs across 11 kidney-specific downstream tasks spanning multiple stains (PAS, H&E, PASM, and IHC), spatial scales (tile and slide-level), task types (classification, regression, and copy detection), and clinical objectives, including detection, diagnosis, and prognosis. Tile-level performance is assessed using repeated stratified group cross-validation, while slide-level tasks are evaluated using repeated nested stratified cross-validation. Statistical significance is examined using Friedman test followed by pairwise Wilcoxon signed-rank testing with Holm-Bonferroni correction and compact letter display visualization. To promote reproducibility, we release an open-source Python package, kidney-hfm-eval, available at https://pypi.org/project/kidney-hfm-eval/ , that reproduces the evaluation pipelines. Results show moderate to strong performance on tasks driven by coarse meso-scale renal morphology, including diagnostic classification and detection of prominent structural alterations. In contrast, performance consistently declines for tasks requiring fine-grained microstructural discrimination, complex biological phenotypes, or slide-level prognostic inference, largely independent of stain type. Overall, current HFMs appear to encode predominantly static meso-scale representations and may have limited capacity to capture subtle renal pathology or prognosis-related signals. Our results highlight the need for kidney-specific, multi-stain, and multimodal foundation models to support clinically reliable decision-making in nephrology.
中文标题/摘要
标题:肾数字病理图像综合基准测试:基于组织病理学的基础模型
基于组织病理学的基础模型(HFMs),在大规模癌症数据集上预训练,已推动计算病理学的发展。然而,它们在非癌性慢性肾病中的应用尚未得到充分探索,尽管肾病理与肾细胞癌和尿路上皮癌共存。我们系统地评估了11个公开可用的HFMs在11个肾特异性下游任务中的表现,这些任务涵盖了多种染色(PAS、H&E、PASM和IHC)、空间尺度(切片级和玻片级)、任务类型(分类、回归和拷贝检测)以及临床目标,包括检测、诊断和预后。使用重复分层分组交叉验证评估切片级性能,而玻片级任务则使用重复嵌套分层交叉验证评估。使用Friedman检验后跟成对Wilcoxon符号秩检验(Holm-Bonferroni校正)进行统计显著性检验,并使用紧凑字母显示可视化。为了促进可重复性,我们发布了一个开源Python包,kidney-hfm-eval,可在https://pypi.org/project/kidney-hfm-eval/ 复现评估管道。结果显示,HFMs在由粗尺度肾形态驱动的任务中表现出中等到较强的表现,包括诊断分类和显著结构改变的检测。相比之下,对于需要精细微结构区分、复杂生物表型或玻片级预后推断的任务,性能持续下降,这在很大程度上与染色类型无关。总体而言,当前的HFMs似乎主要编码静态的粗尺度表示,可能难以捕捉到细微的肾病理或与预后相关的信号。我们的结果强调了需要针对肾脏的、多染色和多模态的基础模型,以支持肾病学中的临床可靠决策。
TransText: Transparency Aware Image-to-Video Typography Animation
Authors: Fei Zhang, Zijian Zhou, Bohao Tang, Sen He, Hang Li, Zhe Wang, Soubhik Sanyal, Pengfei Liu, Viktar Atliha, Tao Xiang, Frost Xu, Semih Gunel
First: 2026-03-18T17:16:40+00:00 · Latest: 2026-03-18T17:16:40+00:00
Comments: 19 pages, publication review
Abstract
We introduce the first method, to the best of our knowledge, for adapting image-to-video models to layer-aware text (glyph) animation, a capability critical for practical dynamic visual design. Existing approaches predominantly handle the transparency-encoding (alpha channel) as an extra latent dimension appended to the RGB space, necessitating the reconstruction of the underlying RGB-centric variational autoencoder (VAE). However, given the scarcity of high-quality transparent glyph data, retraining the VAE is computationally expensive and may erode the robust semantic priors learned from massive RGB corpora, potentially leading to latent pattern mixing. To mitigate these limitations, we propose TransText, a framework based on a novel Alpha-as-RGB paradigm to jointly model appearance and transparency without modifying the pre-trained generative manifold. TransText embeds the alpha channel as an RGB-compatible visual signal through latent spatial concatenation, explicitly ensuring strict cross-modal (RGB-and-Alpha) consistency while preventing feature entanglement. Our experiments demonstrate that TransText significantly outperforms baselines, generating coherent, high-fidelity transparent animations with diverse, fine-grained effects.
中文标题/摘要
标题:TransText:透明度感知的图像到视频字体动画
我们提出了迄今为止第一个能够将图像到视频模型适应层感知文本(字形)动画的方法,这是实际动态视觉设计中至关重要的能力。现有方法主要将透明度编码(alpha通道)视为附加到RGB空间的额外潜在维度,需要重建基于RGB的变分自编码器(VAE)。然而,由于高质量透明字形数据稀缺,重新训练VAE计算成本高昂,并可能侵蚀从大规模RGB语料库中学到的鲁棒语义先验,可能导致潜在模式混杂。为缓解这些限制,我们提出了一种基于新颖的Alpha-as-RGB范式的TransText框架,以联合建模外观和透明度而不修改预训练生成流形。TransText通过潜在空间连接将alpha通道嵌入为与RGB兼容的视觉信号,明确确保跨模态(RGB和Alpha)一致性,同时防止特征纠缠。我们的实验表明,TransText显著优于基线方法,生成了连贯、高保真度的透明动画,具有多样化的精细效果。
Summary / 总结
The research introduces TransText, a method for adapting image-to-video models to animate text layers with transparency, addressing the limitations of existing approaches that require retraining variational autoencoders. TransText uses an Alpha-as-RGB paradigm to model appearance and transparency without altering the pre-trained generative manifold, ensuring cross-modal consistency and preventing feature entanglement. Experiments show that TransText generates more coherent and high-fidelity transparent animations with diverse effects compared to baseline methods.
研究引入了TransText方法,用于将图像到视频的模型适应层感知的文字动画,解决了现有方法将透明度作为额外的潜空间维度处理的局限性。TransText采用Alpha-as-RGB范式,不修改预训练生成流形,同时联合建模外观和透明度,确保跨模态一致性并防止特征纠缠。实验表明,TransText生成了更加连贯且高保真的透明动画,具有多样且精细的效果,优于基线方法。
LaS-Comp: Zero-shot 3D Completion with Latent-Spatial Consistency
Authors: Weilong Yan, Haipeng Li, Hao Xu, Nianjin Ye, Yihao Ai, Shuaicheng Liu, Jingyu Hu
First: 2026-02-21T06:55:28+00:00 · Latest: 2026-03-18T17:12:08+00:00
Comments: Accepted by CVPR2026
Abstract
This paper introduces LaS-Comp, a zero-shot and category-agnostic approach that leverages the rich geometric priors of 3D foundation models to enable 3D shape completion across diverse types of partial observations. Our contributions are threefold: First, \ourname{} harnesses these powerful generative priors for completion through a complementary two-stage design: (i) an explicit replacement stage that preserves the partial observation geometry to ensure faithful completion; and (ii) an implicit refinement stage ensures seamless boundaries between the observed and synthesized regions. Second, our framework is training-free and compatible with different 3D foundation models. Third, we introduce Omni-Comp, a comprehensive benchmark combining real-world and synthetic data with diverse and challenging partial patterns, enabling a more thorough and realistic evaluation. Both quantitative and qualitative experiments demonstrate that our approach outperforms previous state-of-the-art approaches. Our code and data will be available at \href{https://github.com/DavidYan2001/LaS-Comp}{LaS-Comp}.
中文标题/摘要
标题:LaS-Comp:利用潜在空间一致性实现零样本3D补全
本文介绍了LaS-Comp,这是一种零样本且类别无关的方法,利用3D基础模型丰富的几何先验,实现不同类型的不完整观察下的3D形状补全。我们的贡献有三个方面:首先,我们通过互补的两阶段设计利用这些强大的生成先验进行补全:(i) 显式的替换阶段,保留不完整观察的几何结构,确保补全是忠实的;(ii) 隐式的细化阶段,确保观察区域和合成区域之间的边界无缝衔接。其次,我们的框架无需训练且兼容不同的3D基础模型。第三,我们引入了Omni-Comp,这是一个综合基准,结合了真实世界和合成数据,具有多样且具有挑战性的不完整模式,使评估更加全面和真实。定量和定性的实验均表明,我们的方法优于之前的最先进的方法。我们的代码和数据将在https://github.com/DavidYan2001/LaS-Comp/LaS-Comp上提供。
Summary / 总结
LaS-Comp is a zero-shot and category-agnostic approach that uses the geometric priors of 3D foundation models to complete 3D shapes from partial observations. It consists of two stages: an explicit replacement stage that preserves the partial geometry and an implicit refinement stage that ensures seamless boundaries. The approach is training-free and compatible with various 3D foundation models. LaS-Comp outperforms previous methods both quantitatively and qualitatively, and it is evaluated on a new benchmark called Omni-Comp. The code and data are available at LaS-Comp.
LaS-Comp 是一种零样本且跨类别的方法,利用 3D 基础模型的几何先验从部分观察中完成 3D 形状。该方法包含两个阶段:一个显式的替换阶段保留部分几何结构,以及一个隐式的精修阶段确保边界无缝连接。该方法无需训练且兼容多种 3D 基础模型。LaS-Comp 在定量和定性实验中均优于先前的方法,并在名为 Omni-Comp 的新基准上进行了评估。代码和数据可在 LaS-Comp 获取。
Interpretable Traffic Responsibility from Dashcam Video via Legal Multi Agent Reasoning
Authors: Jingchun Yang, Jinchang Zhang
First: 2026-03-18T17:04:48+00:00 · Latest: 2026-03-18T17:04:48+00:00
Abstract
The widespread adoption of dashcams has made video evidence in traffic accidents increasingly abundant, yet transforming "what happened in the video" into "who is responsible under which legal provisions" still relies heavily on human experts. Existing ego-view traffic accident studies mainly focus on perception and semantic understanding, while LLM-based legal methods are mostly built on textual case descriptions and rarely incorporate video evidence, leaving a clear gap between the two. We first propose C-TRAIL, a multimodal legal dataset that, under the Chinese traffic regulation system, explicitly aligns dashcam videos and textual descriptions with a closed set of responsibility modes and their corresponding Chinese traffic statutes. On this basis, we introduce a two-stage framework: (1) a traffic accident understanding module that generates textual video descriptions; and (2) a legal multi-agent framework that outputs responsibility modes, statute sets, and complete judgment reports. Experimental results on C-TRAIL and MM-AU show that our method outperforms general and legal LLMs, as well as existing agent-based approaches, while providing a transparent and interpretable legal reasoning process.
中文标题/摘要
标题:基于法律多智能体推理的可解释交通责任从行车记录仪视频解读
行车记录仪的广泛应用使得交通事故视频证据日益丰富,但将“视频中发生了什么”转化为“根据哪些法律规定谁应承担责任”仍主要依赖于人类专家。现有的以自我视角为主的交通事故研究主要集中在感知和语义理解上,而基于LLM的法律方法大多基于文本案例描述,很少包含视频证据,这在两者之间留下了一个明显的差距。我们首先提出了C-TRAIL,这是一个多模态法律数据集,在中国的交通法规体系下,明确地将行车记录仪视频和文本描述与一组封闭的责任模式及其对应的中国交通法规进行了对齐。在此基础上,我们引入了一个两阶段框架:(1) 交通事故理解模块,生成视频文本描述;(2) 法律多智能体框架,输出责任模式、法规集合和完整的判决报告。C-TRAIL和MM-AU上的实验结果显示,我们的方法优于通用和法律LLM以及现有的基于代理的方法,同时提供了一个透明和可解释的法律推理过程。
Summary / 总结
This study addresses the gap between video evidence and legal responsibility in traffic accidents by proposing C-TRAIL, a multimodal legal dataset, and a two-stage framework. The first stage generates textual descriptions of dashcam videos, and the second stage uses a legal multi-agent framework to determine responsibility and legal statutes. The method outperforms general and legal language models and existing agent-based approaches, offering a transparent and interpretable legal reasoning process.
本文提出了一种通过C-TRAIL多模态数据集和两阶段框架将交通事故视频证据转化为法律责任的方法。第一阶段从行车记录仪视频中生成事故文本描述,第二阶段使用法律多智能体框架确定责任和相关法规。该方法在C-TRAIL和MM-AU数据集上优于现有方法,提供了透明和可解释的法律推理过程。
Mamba2D: A Natively Multi-Dimensional State-Space Model for Vision Tasks
Authors: Enis Baty, Alejandro Hernández Díaz, Rebecca Davidson, Chris Bridges, Simon Hadfield
First: 2024-12-20T18:50:36+00:00 · Latest: 2026-03-18T17:02:51+00:00
Abstract
State-Space Models (SSMs) have emerged as an efficient alternative to transformers, yet existing visual SSMs retain deeply ingrained biases from their origins in natural language processing. In this paper, we address these limitations by introducing M2D-SSM, a ground-up re-derivation of selective state-space techniques for multidimensional data. Unlike prior works that apply 1D SSMs directly to images through arbitrary rasterised scanning, our M2D-SSM employs a single 2D scan that factors in both spatial dimensions natively. On ImageNet-1K classification, M2D-T achieves 84.0% top-1 accuracy with only 27M parameters, surpassing all prior SSM-based vision models at that size. M2D-S further achieves 85.3%, establishing state-of-the-art results among SSM-based architectures. Across downstream tasks, Mamba2D achieves 52.2 box AP on MS-COCO object detection (3$\times$ schedule) and 51.7 mIoU on ADE20K segmentation, demonstrating strong generalisation and efficiency at scale. Source code is available at https://github.com/cocoalex00/Mamba2D.
中文标题/摘要
标题:Mamba2D:一种原生多维状态空间模型用于视觉任务
状态空间模型(SSMs)已成为变压器的高效替代方案,但现有的视觉SSMs仍然保留了其自然语言处理起源中的深层偏见。在本文中,我们通过引入M2D-SSM来解决这些限制,这是一种从头开始重新推导的多维数据选择性状态空间技术。与先前直接将1D SSM应用于图像并通过任意栅格扫描的方法不同,我们的M2D-SSM采用单一的2D扫描,能够原生地考虑两个空间维度。在ImageNet-1K分类任务上,M2D-T仅使用27M参数实现了84.0%的top-1准确率,超过了所有先前基于SSM的视觉模型。M2D-S进一步实现了85.3%的准确率,确立了基于SSM架构的最新成果。在下游任务中,Mamba2D在MS-COCO对象检测上实现了52.2的box AP(3倍训练周期),在ADE20K分割上实现了51.7的mIoU,展示了其在大规模下的强大泛化能力和效率。源代码可在https://github.com/cocoalex00/Mamba2D获取。
Summary / 总结
This paper introduces Mamba2D, a novel multi-dimensional state-space model designed for vision tasks, addressing limitations of existing visual SSMs. Unlike previous methods that apply 1D SSMs to images through rasterization, Mamba2D uses a single 2D scan that incorporates both spatial dimensions natively. The model achieves 84.0% top-1 accuracy on ImageNet-1K classification with 27M parameters and 85.3% accuracy with M2D-S, setting new state-of-the-art results. It also demonstrates strong performance on downstream tasks, achieving 52.2 box AP on MS-COCO object detection and 51.7 mIoU on ADE20K segmentation.
本文提出了Mamba2D,这是一种新型的多维状态空间模型,旨在解决现有视觉状态空间模型的局限性。不同于以往方法通过任意栅格化扫描将1D SSM应用于图像,Mamba2D 使用单一的2D扫描,能够原生地同时处理两个空间维度。模型M2D-T在ImageNet-1K分类上以27M参数实现了84.0%的top-1准确率,超越了所有先前基于状态空间模型的视觉模型。M2D-S进一步提升至85.3%,成为基于状态空间模型的架构中的最新记录。在下游任务中,Mamba2D展示了强大的泛化能力和效率,分别在MS-COCO物体检测和ADE20K分割上实现了52.2的box AP和51.7的mIoU。
Semi-supervised Shelter Mapping for WASH Accessibility Assessment in Rohingya Refugee Camps
Authors: Kyeongjin Ahn, YongHun Suh, Sungwon Han, Jeasurk Yang, Hannes Taubenböck, Meeyoung Cha
First: 2025-11-10T15:48:04+00:00 · Latest: 2026-03-18T16:58:48+00:00
Comments: 22 pages, 13 figures, 2 tables
Abstract
Lack of access to Water, Sanitation, and Hygiene (WASH) services is a major public health concern in refugee camps, where extreme crowding accelerates the spread of communicable diseases. The Rohingya settlements in Cox's Bazar, Bangladesh, exemplify these conditions, with large populations living under severe spatial constraints. We develop a semi-supervised segmentation framework using the Segment Anything Model (SAM) to map shelters from multi-temporal sub-meter remote sensing imagery (2017-2025), improving detection in complex camp environments by 4.9% in F1-score over strong baselines. The detected shelter maps show that shelter expansion stabilized after 2020, whereas continued population growth reduced per capita living space by approximately 14% between 2020 and 2025. WASH accessibility, measured with an enhanced network-based two-step floating catchment area (2SFCA) method, declined from 2022 to 2025, increasing facility loads and exceeding global benchmarks. Gender-disaggregated scenarios that incorporate safety penalty further reveal pronounced inequities, with female accessibility approximately 27% lower than male. Together, these results demonstrate that remote sensing-driven AI diagnostics can generate equity-focused evidence to prioritize WASH investments and mitigate health risks in protracted displacement settings.
中文标题/摘要
标题:半监督庇护所测绘以评估罗兴亚难民营的WASH可及性
在难民营中,缺乏对水、卫生和清洁用水(WASH)服务的访问是公共卫生的重大关切,极端拥挤加速了传染病的传播。孟加拉国科克斯巴扎尔的罗兴亚定居点就是这种情况,大量人口在严重的空间限制下生活。我们使用分割一切模型(SAM)开发了一种半监督分割框架,从多时相亚米级遥感图像(2017-2025)中绘制庇护所,通过F1分数提高了4.9%的检测精度,超过了强大的基线。检测到的庇护所地图显示,庇护所扩张在2020年后趋于稳定,而人口持续增长导致2020年至2025年间人均居住空间减少了约14%。使用增强的基于网络的两步浮动服务区域(2SFCA)方法测量的WASH可及性从2022年到2025年下降,增加了设施负荷并超过了全球标准。性别分化的场景结合了安全惩罚进一步揭示了明显的不平等,女性可及性比男性低约27%。这些结果表明,基于遥感的AI诊断可以生成关注公平性的证据,以优先考虑WASH投资并减轻长期流离失所环境中的健康风险。
Summary / 总结
This study aims to address the lack of WASH services in Rohingya refugee camps by developing a semi-supervised segmentation framework using the Segment Anything Model (SAM) to map shelters from multi-temporal remote sensing imagery. The framework improved detection accuracy by 4.9% in F1-score compared to strong baselines. Key findings include stable shelter expansion post-2020, a 14% reduction in per capita living space between 2020 and 2025, and a decline in WASH accessibility from 2022 to 2025, with female accessibility being 27% lower than male. The study highlights the potential of remote sensing and AI for generating equity-focused evidence to prioritize WASH investments.
该研究利用Segment Anything Model (SAM) 开发了一种半监督分割框架,从2017年至2025年对罗兴亚难民营的庇护所进行映射,检测准确率在F1分数上提高了4.9%。研究发现,虽然2020年后庇护所扩张趋于稳定,但人口增长导致人均居住空间在2020年至2025年间减少了14%。WASH可及性从2022年到2025年下降,女性可及性比男性低约27%,突显了性别不平等。该研究展示了遥感驱动的AI可以为优先考虑WASH投资和减轻健康风险提供证据,在长期流离失所环境中具有重要意义。
SegFly: A 2D-3D-2D Paradigm for Aerial RGB-Thermal Semantic Segmentation at Scale
Authors: Markus Gross, Sai Bharadhwaj Matha, Rui Song, Viswanathan Muthuveerappan, Conrad Christoph, Julius Huber, Daniel Cremers
First: 2026-03-18T16:57:22+00:00 · Latest: 2026-03-18T16:57:22+00:00
Abstract
Semantic segmentation for uncrewed aerial vehicles (UAVs) is fundamental for aerial scene understanding, yet existing RGB and RGB-T datasets remain limited in scale, diversity, and annotation efficiency due to the high cost of manual labeling and the difficulties of accurate RGB-T alignment on off-the-shelf UAVs. To address these challenges, we propose a scalable geometry-driven 2D-3D-2D paradigm that leverages multi-view redundancy in high-overlap aerial imagery to automatically propagate labels from a small subset of manually annotated RGB images to both RGB and thermal modalities within a unified framework. By lifting less than 3% of RGB images into a semantic 3D point cloud and reprojecting it into all views, our approach enables dense pseudo ground-truth generation across large image collections, automatically producing 97% of RGB labels and 100% of thermal labels while achieving 91% and 88% annotation accuracy without any 2D manual refinement. We further extend this 2D-3D-2D paradigm to cross-modal image registration, using 3D geometry as an intermediate alignment space to obtain fully automatic, strong pixel-level RGB-T alignment with 87% registration accuracy and no hardware-level synchronization. Applying our framework to existing geo-referenced aerial imagery, we construct SegFly, a large-scale benchmark with over 20,000 high-resolution RGB images and more than 15,000 geometrically aligned RGB-T pairs spanning diverse urban, industrial, and rural environments across multiple altitudes and seasons. On SegFly, we establish the Firefly baseline for RGB and thermal semantic segmentation and show that both conventional architectures and vision foundation models benefit substantially from SegFly supervision, highlighting the potential of geometry-driven 2D-3D-2D pipelines for scalable multi-modal scene understanding. Data and Code available at https://github.com/markus-42/SegFly.
中文标题/摘要
标题:SegFly:大规模空域RGB-热成像语义分割的2D-3D-2D范式
无人驾驶航空器(UAV)的语义分割对于空域场景理解至关重要,但现有的RGB和RGB-T数据集在规模、多样性和注释效率方面仍然受限,这主要是由于手动标注成本高以及在商用UAV上实现准确的RGB-T对齐的难度。为了解决这些挑战,我们提出了一种可扩展的几何驱动2D-3D-2D范式,该范式利用高重叠空域图像中的多视图冗余性,自动将一小部分手动标注的RGB图像的标签传播到RGB和热成像模态中,从而在统一框架内实现。通过将不到3%的RGB图像提升到语义3D点云中并重新投影到所有视图中,我们的方法能够在大规模图像集合中生成密集的伪地面真值,自动产生97%的RGB标签和100%的热成像标签,同时在无需任何2D手动细化的情况下实现91%和88%的注释准确率。我们进一步将2D-3D-2D范式扩展到跨模态图像对齐,使用3D几何作为中间对齐空间,以获得完全自动的强像素级RGB-T对齐,准确率为87%,无需硬件级同步。将我们的框架应用于现有地理参考空域图像,我们构建了SegFly,这是一个包含超过20,000张高分辨率RGB图像和超过15,000个几何对齐的RGB-T配对的大规模基准,这些配对覆盖了多种城市、工业和农村环境,跨越了多个高度和季节。在SegFly上,我们建立了Firefly基线用于RGB和热成像语义分割,并展示了传统架构和视觉基础模型从SegFly监督中受益显著,突显了几何驱动2D-3D-2D流水线在多模态场景理解中的潜力。数据和代码可在https://github.com/markus-42/SegFly/获取。
IndicSafe: A Benchmark for Evaluating Multilingual LLM Safety in South Asia
Authors: Priyaranjan Pattnayak, Sanchari Chowdhuri
First: 2026-03-18T16:54:07+00:00 · Latest: 2026-03-18T16:54:07+00:00
Abstract
As large language models (LLMs) are deployed in multilingual settings, their safety behavior in culturally diverse, low-resource languages remains poorly understood. We present the first systematic evaluation of LLM safety across 12 Indic languages, spoken by over 1.2 billion people but underrepresented in LLM training data. Using a dataset of 6,000 culturally grounded prompts spanning caste, religion, gender, health, and politics, we assess 10 leading LLMs on translated variants of the prompt.
Our analysis reveals significant safety drift: cross-language agreement is just 12.8\%, and \texttt{SAFE} rate variance exceeds 17\% across languages. Some models over-refuse benign prompts in low-resource scripts, overflag politically sensitive topics, while others fail to flag unsafe generations. We quantify these failures using prompt-level entropy, category bias scores, and multilingual consistency indices.
Our findings highlight critical safety generalization gaps in multilingual LLMs and show that safety alignment does not transfer evenly across languages. We release \textsc{IndicSafe}, the first benchmark to enable culturally informed safety evaluation for Indic deployments, and advocate for language-aware alignment strategies grounded in regional harms.
中文标题/摘要
标题:IndicSafe:评估南亚多语言LLM安全性的基准
随着大型语言模型(LLMs)在多语言环境中部署,它们在文化多样、资源稀缺的语言中的安全行为仍然知之甚少。我们首次系统地评估了10种领先LLM在12种印地语系语言中的安全性,这些语言有超过12亿使用者,但在LLM训练数据中代表性不足。使用包含种姓、宗教、性别、健康和政治等文化背景的6000个提示数据集,我们评估了这些提示的翻译版本。
我们的分析揭示了显著的安全性漂移:跨语言一致性仅为12.8%,且安全标记率变异超过17%。一些模型在低资源脚本中过度拒绝良性提示,在政治敏感话题上过度标记,而其他模型则未能标记不安全的生成内容。我们使用提示级熵、类别偏差评分和多语言一致性指数量化这些失败。
我们的发现突显了多语言LLM在安全性泛化方面的关键差距,并表明安全性对齐在不同语言之间并不均匀。我们发布了IndicSafe,这是首个用于印地语系部署的文化导向安全性评估基准,并倡导基于区域危害的语言意识对齐策略。
Summary / 总结
The study evaluates the safety of 10 leading large language models across 12 Indic languages, which are spoken by over 1.2 billion people but underrepresented in model training data. Using a dataset of 6,000 culturally grounded prompts, the research reveals significant safety drift among the models, with cross-language agreement at only 12.8% and variance in the exttt{SAFE} rate exceeding 17% across languages. The findings highlight critical safety generalization gaps and the need for language-aware alignment strategies.
研究评估了10个主流大型语言模型在12种印地语系语言中的安全性,这些语言有超过12亿人使用,但在LLM训练数据中代表性不足。使用包含6000个文化背景问题的数据集,研究发现模型之间的安全性存在显著差异,跨语言一致性仅为12.8%,不同语言的 exttt{SAFE}率差异超过17%。研究通过问题级别的熵、类别偏差评分和多语言一致性指数量化了这些安全问题,突显了多语言LLM在安全性泛化方面的关键差距。
Event-Driven Video Generation
Authors: Chika Maduabuchi
First: 2026-03-12T00:16:56+00:00 · Latest: 2026-03-18T16:49:27+00:00
Abstract
State-of-the-art text-to-video models often look realistic frame-by-frame yet fail on simple interactions: motion starts before contact, actions are not realized, objects drift after placement, and support relations break. We argue this stems from frame-first denoising, which updates latent state everywhere at every step without an explicit notion of when and where an interaction is active. We introduce Event-Driven Video Generation (EVD), a minimal DiT-compatible framework that makes sampling event-grounded: a lightweight event head predicts token-aligned event activity, event-grounded losses couple activity to state change during training, and event-gated sampling (with hysteresis and early-step scheduling) suppresses spurious updates while concentrating updates during interactions. On EVD-Bench, EVD consistently improves human preference and VBench dynamics, substantially reducing failure modes in state persistence, spatial accuracy, support relations, and contact stability without sacrificing appearance. These results indicate that explicit event grounding is a practical abstraction for reducing interaction hallucinations in video generation.
中文标题/摘要
标题:事件驱动的视频生成
当前最先进的文本到视频模型在每一帧上看起来都很逼真,但在简单的互动方面却失败了:动作开始于接触之前,行为没有实现,物体放置后漂移,支撑关系破裂。我们认为这源于帧优先去噪,它在每一步都会无处不在地更新潜在状态,而没有明确的互动何时何地活跃的概念。我们引入了事件驱动的视频生成(EVD),这是一种与DiT兼容的最小化框架,使采样基于事件:轻量级的事件头预测标记对齐的事件活动,事件地基损失将活动与训练中的状态变化耦合,事件门控采样(具有滞回和早期步骤调度)抑制了虚假更新,同时在互动期间集中更新。在EVD-Bench上,EVD在人类偏好和VBench动力学方面始终如一地提高了性能,显著减少了状态持久性、空间准确性、支撑关系和接触稳定性方面的失败模式,而不会牺牲外观。这些结果表明,显式的事件地基是一个实用的抽象,可以减少视频生成中的互动幻觉。
Summary / 总结
The research aims to address the limitations of state-of-the-art text-to-video models, which often fail to handle simple interactions such as motion and support relations. The authors introduce Event-Driven Video Generation (EVD), a framework that uses event prediction and gating to focus updates during interactions, thereby improving dynamics and reducing failure modes. Experiments on EVD-Bench show consistent improvements in human preference and VBench dynamics, with better handling of state persistence, spatial accuracy, support relations, and contact stability compared to existing methods.
研究旨在解决当前最先进的文本到视频模型在处理简单交互(如运动和支持关系)方面的局限性。作者提出了事件驱动的视频生成(EVD)框架,该框架通过事件预测和门控来集中处理交互期间的更新,从而提高动态性能并减少错误。实验表明,EVD在EVD-Bench上的一致改进包括人类偏好和VBench动态的提升,以及在状态持久性、空间准确性、支持关系和接触稳定性方面的表现优于现有方法。
SpiderCam: Low-Power Snapshot Depth from Differential Defocus
Authors: Marcos A. Ferreira, Tianao Li, John Mamish, Josiah Hester, Yaman Sangar, Qi Guo, Emma Alexander
Venue: CVPR
First: 2026-03-18T16:48:41+00:00 · Latest: 2026-03-18T16:48:41+00:00
Comments: Accepted to IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026
Abstract
We introduce SpiderCam, an FPGA-based snapshot depth-from-defocus camera which produces 480x400 sparse depth maps in real-time at 32.5 FPS over a working range of 52 cm while consuming 624 mW of power in total. SpiderCam comprises a custom camera that simultaneously captures two differently focused images of the same scene, processed with a SystemVerilog implementation of depth from differential defocus (DfDD) on a low-power FPGA. To achieve state-of-the-art power consumption, we present algorithmic improvements to DfDD that overcome challenges caused by low-power sensors, and design a memory-local implementation for streaming depth computation on a device that is too small to store even a single image pair. We report the first sub-Watt total power measurement for passive FPGA-based 3D cameras in the literature.
中文标题/摘要
标题:SpiderCam:低功耗差焦距瞬时深度
我们介绍了SpiderCam,这是一种基于FPGA的瞬时深度从差焦距相机,能够在52厘米的工作范围内以32.5 FPS的速率实时生成480x400的稀疏深度图,总功耗为624 mW。SpiderCam由一个定制的相机组成,该相机同时捕捉同一场景的两个不同聚焦图像,并使用低功耗FPGA上的SystemVerilog实现差焦距深度计算(DfDD)。为了实现最先进的功耗,我们提出了对DfDD的算法改进,以克服低功耗传感器带来的挑战,并设计了一个内存本地实现,用于在无法存储单个图像对的设备上进行流式深度计算。我们报告了文献中首个低于瓦特级的被动FPGA基3D相机总功耗测量。
Summary / 总结
SpiderCam is an FPGA-based snapshot depth-from-defocus camera that generates 480x400 sparse depth maps at 32.5 FPS with a working range of 52 cm, consuming 624 mW of power. It uses a custom camera to capture two differently focused images of the same scene, processed with a SystemVerilog implementation of depth from differential defocus on a low-power FPGA. The camera achieves state-of-the-art power consumption by overcoming challenges with low-power sensors and implementing a memory-local design for streaming depth computation on a small device. This is the first sub-Watt total power measurement for passive FPGA-based 3D cameras reported in the literature.
SpiderCam 是一种基于 FPGA 的 snapshot 深度从差焦距相机,能够以 32.5 FPS 的速度生成 480x400 的稀疏深度图,工作范围为 52 cm,总功耗为 624 mW。它使用一个定制的相机同时捕捉同一场景的两个不同聚焦的图像,并通过低功耗 FPGA 上的 SystemVerilog 实现深度从差焦距算法进行处理。该相机通过克服低功耗传感器的挑战并实现内存本地设计以在小型设备上进行流式深度计算,实现了亚瓦特级的总功耗。这是文献中首次报告的被动 FPGA 基础 3D 相机的亚瓦特级总功耗测量。
Generative Refocusing: Flexible Defocus Control from a Single Image
Authors: Chun-Wei Tuan Mu, Cheng-De Fan, Jia-Bin Huang, Yu-Lun Liu
First: 2025-12-18T18:59:59+00:00 · Latest: 2026-03-18T16:44:47+00:00
Comments: Project website: https://generative-refocusing.github.io/
Abstract
Depth-of-field control is essential in photography, but achieving perfect focus often requires multiple attempts or specialized equipment. Single-image refocusing is still difficult. It involves recovering sharp content and creating realistic bokeh. Current methods have significant drawbacks. They require all-in-focus inputs, rely on synthetic data from simulators, and have limited control over the aperture. We introduce Generative Refocusing, a two-step process that uses DeblurNet to recover all-in-focus images from diverse inputs and BokehNet to create controllable bokeh. This method combines synthetic and real bokeh images to achieve precise control while preserving authentic optical characteristics. Our experiments show we achieve top performance in defocus deblurring, bokeh synthesis, and refocusing benchmarks. Additionally, our Generative Refocusing allows custom aperture shapes. Project page: https://generative-refocusing.github.io/
中文标题/摘要
标题:生成性重新聚焦:从单张图像实现灵活的景深控制
景深控制是摄影中的重要技术,但要实现完美的对焦通常需要多次尝试或专门的设备。单张图像重新聚焦仍然很困难,涉及恢复清晰内容和创建逼真的散景。当前的方法存在显著的局限性,需要全聚焦输入,依赖于模拟器生成的合成数据,并且对光圈控制有限。我们提出了生成性重新聚焦,这是一种两步过程,使用DeblurNet从多种输入中恢复全聚焦图像,使用BokehNet创建可控的散景。该方法结合了合成和真实散景图像,实现了精确控制同时保留真实的光学特性。我们的实验表明,我们在景深去模糊、散景合成和重新聚焦基准测试中取得了最佳性能。此外,我们的生成性重新聚焦允许自定义光圈形状。项目页面:https://generative-refocusing.github.io/
Summary / 总结
The research aims to improve single-image refocusing in photography by recovering sharp content and creating realistic bokeh. The method uses DeblurNet for all-in-focus image recovery and BokehNet for controllable bokeh synthesis. Experiments demonstrate superior performance in defocus deblurring, bokeh synthesis, and refocusing benchmarks, with the added capability of custom aperture shapes.
研究旨在通过恢复清晰内容和生成逼真的景深效果来改进单张图片的对焦。方法采用两步流程,使用DeblurNet恢复全焦点图像,BokehNet生成可控的景深效果。实验结果显示在景深去模糊、景深合成和对焦基准测试中表现优异,并支持自定义光圈形状。
Decadal sink-source shifts of forest aboveground carbon since 1988
Authors: Zhen Qian, Sebastian Bathiany, Teng Liu, Lana L. Blaschke, Hoong Chen Teo, Niklas Boers
First: 2025-06-13T15:29:10+00:00 · Latest: 2026-03-18T16:30:17+00:00
Abstract
Forest ecosystems are vital to the global carbon cycle, yet their long-term aboveground carbon (AGC) dynamics remain uncertain. Here, we integrate multi-source satellite observations with probabilistic deep learning models to reconstruct a harmonized, uncertainty-aware global forest AGC record from 1988 to 2021 at 0.25-deg. We find that, although global forests sequestered 6.2 PgC, moist tropical and boreal forests have progressively transitioned toward carbon sources since the early 2000s. This shift coincides with a strengthening negative correlation between tropical AGC variability and atmospheric CO2 growth rates (r = -0.63 in 2011-2021), suggesting tropical forests increasingly modulate the global carbon cycle. Notably, in the Brazilian Amazon, the contribution of intact forests to the year-to-year variations in AGC losses increased from 33% in the 1990s to 76% in the 2010s, surpassing that of deforested areas (from 60% to 13%). Our findings highlight the vulnerability of carbon stocks in key biomes and provide a benchmark to track emerging sink-source shifts under anthropogenic climate change.
中文标题/摘要
标题:自1988年以来森林地上碳汇源动态
森林生态系统对全球碳循环至关重要,但其长期地上碳(AGC)动态仍不明确。我们通过整合多源卫星观测数据与概率深度学习模型,重建了从1988年到2021年0.25度分辨率的全球森林AGC记录。研究发现,尽管全球森林吸收了6.2 PgC,但自2000年代初以来,湿润热带和寒带森林逐渐转变为碳源。这一转变与热带AGC变异性与大气CO2增长率之间的负相关性增强(2011-2021年r = -0.63)相吻合,表明热带森林越来越多地调节全球碳循环。值得注意的是,在巴西亚马逊地区,完整森林对AGC损失年际变化的贡献从1990年代的33%增加到2010年代的76%,超过了被砍伐地区(从60%降至13%)。我们的研究结果突显了关键生物群落中碳储量的脆弱性,并为在人为气候变化下跟踪出现的汇源动态变化提供了基准。
Summary / 总结
This study aims to clarify the long-term dynamics of forest aboveground carbon (AGC) by integrating satellite observations and probabilistic deep learning models. The research finds that while global forests have sequestered 6.2 PgC, moist tropical and boreal forests have become carbon sources since the early 2000s. This shift is linked to a stronger negative correlation between tropical AGC variability and atmospheric CO2 growth rates, indicating that tropical forests increasingly influence the global carbon cycle. In the Brazilian Amazon, intact forests now contribute more to AGC losses year-to-year than deforested areas, highlighting the vulnerability of carbon stocks in key biomes.
本研究旨在阐明森林地上碳(AGC)的长期动态,以更好地理解全球碳循环。通过整合卫星观测和概率深度学习模型,研究人员重建了从1988年到2021年的全球AGC记录。关键发现包括自2000年代初以来,湿润热带和 boreal 森林从碳汇转变为碳源,热带森林越来越多地影响全球碳循环。值得注意的是,巴西亚马逊的原始森林在2010年代成为AGC损失的主要驱动因素,其贡献超过了被砍伐地区的贡献。
A Noise Sensitivity Exponent Controls Large Statistical-to-Computational Gaps in Single- and Multi-Index Models
Authors: Leonardo Defilippis, Florent Krzakala, Bruno Loureiro, Antoine Maillard
First: 2026-03-18T16:26:41+00:00 · Latest: 2026-03-18T16:26:41+00:00
Abstract
Understanding when learning is statistically possible yet computationally hard is a central challenge in high-dimensional statistics. In this work, we investigate this question in the context of single- and multi-index models, classes of functions widely studied as benchmarks to probe the ability of machine learning methods to discover features in high-dimensional data. Our main contribution is to show that a Noise Sensitivity Exponent (NSE) - a simple quantity determined by the activation function - governs the existence and magnitude of statistical-to-computational gaps within a broad regime of these models. We first establish that, in single-index models with large additive noise, the onset of a computational bottleneck is fully characterized by the NSE. We then demonstrate that the same exponent controls a statistical-computational gap in the specialization transition of large separable multi-index models, where individual components become learnable. Finally, in hierarchical multi-index models, we show that the NSE governs the optimal computational rate in which different directions are sequentially learned. Taken together, our results identify the NSE as a unifying property linking noise robustness, computational hardness, and feature specialization in high-dimensional learning.
中文标题/摘要
标题:噪声敏感性指数控制单指数和多指数模型中的统计计算差距
理解何时学习在统计上可行但在计算上困难是高维统计中的一个核心挑战。在本文中,我们探讨了这一问题在单指数和多指数模型中的情况,这些模型作为基准被广泛研究,以探究机器学习方法在高维数据中发现特征的能力。我们的主要贡献是展示了噪声敏感性指数(NSE)——由激活函数决定的一个简单量——在这些模型广泛范围内控制着统计计算差距的存在和大小。我们首先证明,在具有大量附加噪声的单指数模型中,计算瓶颈的出现完全由NSE决定。然后我们展示了在大型可分离多指数模型的专业化过渡中,相同的指数控制着统计计算差距,其中各个组件变得可学习。最后,在分层多指数模型中,我们展示了NSE控制着不同方向依次学习的最优计算速率。综上所述,我们的结果将NSE识别为连接噪声鲁棒性、计算难度和特征专业化的一个统一属性。
Summary / 总结
This study investigates when learning is statistically possible yet computationally hard in single- and multi-index models. The authors introduce the Noise Sensitivity Exponent (NSE), which is determined by the activation function and governs the existence and magnitude of statistical-to-computational gaps. They show that the NSE characterizes the onset of computational bottlenecks in single-index models with large noise and controls the specialization transition in multi-index models. Additionally, the NSE determines the optimal computational rate for learning different directions in hierarchical multi-index models.
该研究探讨了单指数模型和多指数模型中统计上可能但计算上具有挑战性的条件。作者引入了噪声敏感性指数(NSE)作为关键因素,决定了统计到计算差距的存在和大小。他们表明,在具有大量噪声的单指数模型中,NSE 表征了计算瓶颈的出现。在多指数模型中,NSE 控制了特征专业化过渡和不同方向的学习最优计算速率,从而将噪声鲁棒性、计算难度和特征专业化联系起来。
Look Before You Fuse: 2D-Guided Cross-Modal Alignment for Robust 3D Detection
Authors: Xiang Li, Zhangchi Hu, Xiao Xu, Bin Kong
Venue: cvpr 2026
First: 2025-07-21T18:12:22+00:00 · Latest: 2026-03-18T16:22:49+00:00
Comments: accepted to cvpr 2026
Abstract
Integrating LiDAR and camera inputs into a unified Bird's-Eye-View (BEV) representation is crucial for enhancing 3D perception capabilities of autonomous vehicles. However, existing methods suffer from spatial misalignment between LiDAR and camera features, which causes inaccurate depth supervision in camera branch and erroneous fusion during cross-modal feature aggregation. The root cause of this misalignment lies in projection errors, stemming from calibration inaccuracies and rolling shutter effect.The key insight of this work is that locations of these projection errors are not random but highly predictable, as they are concentrated at object-background boundaries which 2D detectors can reliably identify. Based on this, our main motivation is to utilize 2D object priors to pre-align cross-modal features before fusion. To address local misalignment, we propose Prior Guided Depth Calibration (PGDC), which leverages 2D priors to alleviate misalignment and preserve correct cross-modal feature pairs. To resolve global misalignment, we introduce Discontinuity Aware Geometric Fusion (DAGF) to suppress residual noise from PGDC and explicitly enhance sharp depth transitions at object-background boundaries, yielding a structurally aware representation. To effectively utilize these aligned representations, we incorporate Structural Guidance Depth Modulator (SGDM), using a gated attention mechanism to efficiently fuse aligned depth and image features. Our method achieves SOTA performance on nuScenes validation dataset, with its mAP and NDS reaching 71.5% and 73.6% respectively. Additionally, on the Argoverse 2 validation set, we achieve a competitive mAP of 41.7%.
中文标题/摘要
标题:未融合之前先观察:基于2D引导的跨模态对齐以实现稳健的3D检测
将LiDAR和相机输入整合到统一的鸟瞰图(BEV)表示中,对于增强自动驾驶车辆的3D感知能力至关重要。然而,现有方法在LiDAR和相机特征之间存在空间错位,导致相机分支中的不准确深度监督和跨模态特征聚合中的错误融合。这种错位的根本原因在于投影误差,这些误差源于校准不准确和滚动快门效应。本文的关键洞察是,这些投影误差的位置并非随机,而是高度可预测的,因为它们集中在2D检测器可以可靠识别的对象背景边界上。基于此,我们的主要动机是利用2D对象先验在融合前对齐跨模态特征。为解决局部错位,我们提出了先验引导深度校准(PGDC),利用2D先验来缓解错位并保留正确的跨模态特征对。为解决全局错位,我们引入了断续性感知几何融合(DAGF),以抑制PGDC中的残余噪声,并明确增强对象背景边界处的锐利深度过渡,从而生成结构感知表示。为了有效利用这些对齐的表示,我们引入了结构引导深度调制器(SGDM),使用门控注意力机制高效融合对齐的深度和图像特征。我们的方法在nuScenes验证数据集上达到了SOTA性能,其mAP和NDS分别达到71.5%和73.6%。此外,在Argoverse 2验证集上,我们实现了竞争力的mAP为41.7%。
Summary / 总结
This work addresses the issue of spatial misalignment between LiDAR and camera features in cross-modal fusion for 3D detection, which affects depth supervision and fusion accuracy. The authors propose a method that uses 2D object priors to pre-align cross-modal features, introducing Prior Guided Depth Calibration (PGDC) for local misalignment and Discontinuity Aware Geometric Fusion (DAGF) for global misalignment. They also incorporate Structural Guidance Depth Modulator (SGDM) to fuse aligned depth and image features. The method achieves state-of-the-art performance on the nuScenes validation dataset with mAP and NDS reaching 71.5% and 73.6%, respectively, and a competitive mAP of 41.7% on the Argoverse 2 validation set.
该研究解决了自动驾驶车辆3D感知系统中LiDAR和相机特征之间空间错位的问题。作者提出了一种名为Look Before You Fuse的方法,利用2D物体先验信息在融合前对跨模态特征进行预对齐。他们引入了Prior Guided Depth Calibration (PGDC)来解决局部错位问题,以及Discontinuity Aware Geometric Fusion (DAGF)来处理全局错位问题。此外,他们还引入了Structural Guidance Depth Modulator (SGDM)来高效融合对齐的深度和图像特征。该方法在nuScenes验证集上达到了最先进的性能,mAP和NDS分别达到71.5%和73.6%,在Argoverse 2验证集上也取得了竞争力的mAP为41.7%。
Identity as Presence: Towards Appearance and Voice Personalized Joint Audio-Video Generation
Authors: Yingjie Chen, Shilun Lin, Cai Xing, Qixin Yan, Wenjing Wang, Dingming Liu, Hao Liu, Chen Li, Jing Lyu
First: 2026-03-18T16:13:48+00:00 · Latest: 2026-03-18T16:13:48+00:00
Abstract
Recent advances have demonstrated compelling capabilities in synthesizing real individuals into generated videos, reflecting the growing demand for identity-aware content creation. Nevertheless, an openly accessible framework enabling fine-grained control over facial appearance and voice timbre across multiple identities remains unavailable. In this work, we present a unified and scalable framework for identity-aware joint audio-video generation, enabling high-fidelity and consistent personalization. Specifically, we introduce a data curation pipeline that automatically extracts identity-bearing information with paired annotations across audio and visual modalities, covering diverse scenarios from single-subject to multi-subject interactions. We further propose a flexible and scalable identity injection mechanism for single- and multi-subject scenarios, in which both facial appearance and vocal timbre act as identity-bearing control signals. Moreover, in light of modality disparity, we design a multi-stage training strategy to accelerate convergence and enforce cross-modal coherence. Experiments demonstrate the superiority of the proposed framework. For more details and qualitative results, please refer to our webpage: \href{https://chen-yingjie.github.io/projects/Identity-as-Presence}{Identity-as-Presence}.
中文标题/摘要
标题:身份即存在:面向个性化联合音视频生成的外观与声音
近期进展展示了在生成视频中合成真实个体的强大能力,反映了对身份感知内容创作日益增长的需求。然而,一个能够对多种身份的面部外观和声音音调进行精细控制的开放框架仍然不可用。在本文中,我们提出了一种统一且可扩展的身份感知联合音视频生成框架,能够实现高保真度和一致的个性化。具体而言,我们引入了一种数据整理管道,自动从跨音频和视觉模态的配对注释中提取身份相关信息,涵盖了从单人到多人互动的各种场景。我们进一步提出了一种灵活且可扩展的身份注入机制,适用于单人和多人场景,其中面部外观和声音音调均作为身份相关信息的控制信号。此外,为了解决模态差异问题,我们设计了一种多阶段训练策略,以加速收敛并确保跨模态一致性。实验结果表明了所提出框架的优势。更多细节和定性结果,请参阅我们的网页:https://chen-yingjie.github.io/projects/Identity-as-Presence