WorldCache: Content-Aware Caching for Accelerated Video World Models
Authors: Umair Nawaz, Ahmed Heakl, Ufaq Khan, Abdelrahman Shaker, Salman Khan, Fahad Shahbaz Khan
First: 2026-03-23T17:59:54+00:00 · Latest: 2026-03-23T17:59:54+00:00
Comments: 33 Pages
Abstract
Diffusion Transformers (DiTs) power high-fidelity video world models but remain computationally expensive due to sequential denoising and costly spatio-temporal attention. Training-free feature caching accelerates inference by reusing intermediate activations across denoising steps; however, existing methods largely rely on a Zero-Order Hold assumption i.e., reusing cached features as static snapshots when global drift is small. This often leads to ghosting artifacts, blur, and motion inconsistencies in dynamic scenes. We propose \textbf{WorldCache}, a Perception-Constrained Dynamical Caching framework that improves both when and how to reuse features. WorldCache introduces motion-adaptive thresholds, saliency-weighted drift estimation, optimal approximation via blending and warping, and phase-aware threshold scheduling across diffusion steps. Our cohesive approach enables adaptive, motion-consistent feature reuse without retraining. On Cosmos-Predict2.5-2B evaluated on PAI-Bench, WorldCache achieves \textbf{2.3$\times$} inference speedup while preserving \textbf{99.4\%} of baseline quality, substantially outperforming prior training-free caching approaches. Our code can be accessed on \href{https://umair1221.github.io/World-Cache/}{World-Cache}.
中文标题/摘要
标题:WorldCache:内容感知缓存加速视频世界模型
扩散变换器(DiTs)驱动高保真视频世界模型,但由于顺序去噪和昂贵的空间-时间注意力,计算成本仍然很高。无训练特征缓存通过在去噪步骤中重用中间激活来加速推理,但现有方法大多依赖于零阶保持假设,即在全局漂移较小时将缓存特征作为静态快照重用。这通常会导致动态场景中的鬼影伪影、模糊和运动不一致。我们提出了**WorldCache**,一种感知约束动力缓存框架,以改进何时以及如何重用特征。WorldCache 引入了运动自适应阈值、显著性加权漂移估计、通过混合和扭曲进行的最佳近似以及扩散步骤中的相位感知阈值调度。我们的一体化方法使无训练的、运动一致的特征重用成为可能。在使用PAI-Bench评估的Cosmos-Predict2.5-2B上,WorldCache 达到了**2.3倍**的推理加速,同时保持了**99.4%**的基本质量,显著优于先前的无训练缓存方法。我们的代码可以在**World-Cache**(https://umair1221.github.io/World-Cache/)访问。
Summary / 总结
WorldCache is a Perception-Constrained Dynamical Caching framework that accelerates inference for high-fidelity video world models by reusing intermediate activations adaptively. It introduces motion-adaptive thresholds, saliency-weighted drift estimation, and phase-aware threshold scheduling to avoid ghosting artifacts and motion inconsistencies. On PAI-Bench, WorldCache achieves a 2.3x inference speedup while maintaining 99.4% of baseline quality, outperforming previous training-free caching methods.
WorldCache 提出了一种感知约束的动力学缓存框架,以改进视频世界模型中扩散变换器中的特征重用,解决了鬼影伪影和运动不一致的问题。它引入了运动自适应阈值、显著性加权漂移估计和相位感知阈值调度。在 PAI-Bench 上,WorldCache 实现了 2.3 倍的推理加速,同时保持了 99.4% 的基线质量,显著优于之前的无训练缓存方法。
VideoDetective: Clue Hunting via both Extrinsic Query and Intrinsic Relevance for Long Video Understanding
Authors: Ruoliu Yang, Chu Wu, Caifeng Shan, Ran He, Chaoyou Fu
First: 2026-03-23T17:59:51+00:00 · Latest: 2026-03-23T17:59:51+00:00
Abstract
Long video understanding remains challenging for multimodal large language models (MLLMs) due to limited context windows, which necessitate identifying sparse query-relevant video segments. However, existing methods predominantly localize clues based solely on the query, overlooking the video's intrinsic structure and varying relevance across segments. To address this, we propose VideoDetective, a framework that integrates query-to-segment relevance and inter-segment affinity for effective clue hunting in long-video question answering. Specifically, we divide a video into various segments and represent them as a visual-temporal affinity graph built from visual similarity and temporal proximity. We then perform a Hypothesis-Verification-Refinement loop to estimate relevance scores of observed segments to the query and propagate them to unseen segments, yielding a global relevance distribution that guides the localization of the most critical segments for final answering with sparse observation. Experiments show our method consistently achieves substantial gains across a wide range of mainstream MLLMs on representative benchmarks, with accuracy improvements of up to 7.5% on VideoMME-long. Our code is available at https://videodetective.github.io/
中文标题/摘要
标题:VideoDetective:通过外部查询和内在相关性进行长视频线索搜索
由于上下文窗口有限,多模态大型语言模型(MLLMs)在理解长视频方面仍然具有挑战性,因此需要识别稀疏的查询相关视频片段。然而,现有方法主要基于查询来定位线索,忽视了视频的内在结构以及片段之间的变化相关性。为了解决这个问题,我们提出了一种VideoDetective框架,该框架结合了查询到片段的相关性和片段之间的亲和力,以有效地在长视频问答中进行线索搜索。具体来说,我们将视频划分为多个片段,并通过视觉相似性和时间临近性构建视觉-时间亲和力图来表示它们。然后,我们执行假设-验证-精炼循环来估计观察到的片段与查询的相关性得分,并将其传播到未观察到的片段,从而生成一个全局相关性分布,该分布指导最终回答所需的最关键片段的定位,基于稀疏观察。实验表明,我们的方法在主流MLLMs上的一系列代表性基准测试中实现了显著的性能提升,在VideoMME-long上的准确率提高了高达7.5%。我们的代码可在https://videodetective.github.io/获取
Summary / 总结
The research aims to improve long video understanding by addressing the limitations of current multimodal large language models. VideoDetective, a proposed framework, integrates query-to-segment relevance and inter-segment affinity to effectively identify critical video segments. The method divides videos into segments, constructs a visual-temporal affinity graph, and uses a Hypothesis-Verification-Refinement loop to estimate relevance scores, which are then propagated to guide the localization of key segments. Experiments demonstrate that VideoDetective significantly enhances the performance of various MLLMs, achieving up to 7.5% accuracy improvements on VideoMME-long benchmarks.
研究旨在通过解决现有方法仅关注查询相关性的问题,提高长视频理解。VideoDetective框架结合了查询到片段的相关性和片段间的关联性,有效识别相关视频片段。它将视频划分为片段并构建视觉-时间关联图,然后使用假设-验证-精炼循环来估计相关性分数并传播,从而得到全局相关性分布。实验表明,VideoDetective在VideoMME-long基准测试中表现出色,最高可提高7.5%的准确性。
ThinkJEPA: Empowering Latent World Models with Large Vision-Language Reasoning Model
Authors: Haichao Zhang, Yijiang Li, Shwai He, Tushar Nagarajan, Mingfei Chen, Jianglin Lu, Ang Li, Yun Fu
First: 2026-03-23T17:59:42+00:00 · Latest: 2026-03-23T17:59:42+00:00
Comments: 10 pages, 5 figures
Abstract
Recent progress in latent world models (e.g., V-JEPA2) has shown promising capability in forecasting future world states from video observations. Nevertheless, dense prediction from a short observation window limits temporal context and can bias predictors toward local, low-level extrapolation, making it difficult to capture long-horizon semantics and reducing downstream utility. Vision--language models (VLMs), in contrast, provide strong semantic grounding and general knowledge by reasoning over uniformly sampled frames, but they are not ideal as standalone dense predictors due to compute-driven sparse sampling, a language-output bottleneck that compresses fine-grained interaction states into text-oriented representations, and a data-regime mismatch when adapting to small action-conditioned datasets. We propose a VLM-guided JEPA-style latent world modeling framework that combines dense-frame dynamics modeling with long-horizon semantic guidance via a dual-temporal pathway: a dense JEPA branch for fine-grained motion and interaction cues, and a uniformly sampled VLM \emph{thinker} branch with a larger temporal stride for knowledge-rich guidance. To transfer the VLM's progressive reasoning signals effectively, we introduce a hierarchical pyramid representation extraction module that aggregates multi-layer VLM representations into guidance features compatible with latent prediction. Experiments on hand-manipulation trajectory prediction show that our method outperforms both a strong VLM-only baseline and a JEPA-predictor baseline, and yields more robust long-horizon rollout behavior.
Summary / 总结
The research aims to enhance latent world models by integrating a vision-language model (VLM) for better long-term prediction. The method employs a dual-temporal pathway, combining a dense JEPA branch for fine-grained motion and interaction cues with a uniformly sampled VLM 'thinker' branch for semantic guidance. Experiments demonstrate that this approach outperforms both a VLM-only baseline and a JEPA-predictor baseline, particularly in long-horizon predictions.
研究旨在通过整合视觉-语言模型(VLM)来提升潜世界的模型预测能力。方法结合了密集的JEPA分支以捕捉详细的运动和交互线索,以及VLM基的思考分支以提供丰富的语义指导。实验表明,该方法在长期预测上优于仅使用VLM的模型和JEPA预测模型,提供了更稳健的长期滚动行为。
DualCoT-VLA: Visual-Linguistic Chain of Thought via Parallel Reasoning for Vision-Language-Action Models
Authors: Zhide Zhong, Junfeng Li, Junjie He, Haodong Yan, Xin Gong, Guanyi Zhao, Yingjie Cai, Jiantao Gao, Xu Yan, Bingbing Liu, Yingcong Chen, Liuqing Yang, Haoang Li
First: 2026-03-23T17:59:25+00:00 · Latest: 2026-03-23T17:59:25+00:00
Abstract
Vision-Language-Action (VLA) models map visual observations and language instructions directly to robotic actions. While effective for simple tasks, standard VLA models often struggle with complex, multi-step tasks requiring logical planning, as well as precise manipulations demanding fine-grained spatial perception. Recent efforts have incorporated Chain-of-Thought (CoT) reasoning to endow VLA models with a ``thinking before acting'' capability. However, current CoT-based VLA models face two critical limitations: 1) an inability to simultaneously capture low-level visual details and high-level logical planning due to their reliance on isolated, single-modal CoT; 2) high inference latency with compounding errors caused by step-by-step autoregressive decoding. To address these limitations, we propose DualCoT-VLA, a visual-linguistic CoT method for VLA models with a parallel reasoning mechanism. To achieve comprehensive multi-modal reasoning, our method integrates a visual CoT for low-level spatial understanding and a linguistic CoT for high-level task planning. Furthermore, to overcome the latency bottleneck, we introduce a parallel CoT mechanism that incorporates two sets of learnable query tokens, shifting autoregressive reasoning to single-step forward reasoning. Extensive experiments demonstrate that our DualCoT-VLA achieves state-of-the-art performance on the LIBERO and RoboCasa GR1 benchmarks, as well as in real-world platforms.
中文标题/摘要
标题:DualCoT-VLA:通过并行推理实现视觉语言逻辑思维的视觉语言行动模型
视觉语言行动(VLA)模型直接将视觉观察和语言指令映射到机器人行动。虽然对于简单任务有效,但标准VLA模型往往难以应对需要逻辑规划的复杂多步骤任务,以及需要精细空间感知的精确操作。最近的努力将逻辑思维(CoT)推理引入VLA模型,赋予其“先思考后行动”的能力。然而,当前基于CoT的VLA模型面临两个关键限制:1)无法同时捕捉低级视觉细节和高级逻辑规划,因为它们依赖于孤立的单模态CoT;2)由于逐步自回归解码导致推理延迟增加并累积错误。为了解决这些限制,我们提出了DualCoT-VLA,一种具有并行推理机制的视觉语言CoT方法。为了实现全面的多模态推理,我们的方法结合了视觉CoT进行低级空间理解以及语言CoT进行高级任务规划。此外,为了克服延迟瓶颈,我们引入了一种并行CoT机制,其中包含两组可学习查询标记,将自回归推理转变为单步前向推理。广泛的实验表明,我们的DualCoT-VLA在LIBERO和RoboCasa GR1基准测试以及实际平台中均实现了最先进的性能。
Summary / 总结
The research aims to enhance Vision-Language-Action (VLA) models by addressing their limitations in handling complex, multi-step tasks and precise manipulations. The proposed DualCoT-VLA method introduces a parallel reasoning mechanism that combines visual and linguistic Chain-of-Thought (CoT) for comprehensive multi-modal reasoning and reduces inference latency through single-step forward reasoning. Experimental results show that DualCoT-VLA outperforms existing models on the LIBERO and RoboCasa GR1 benchmarks and in real-world platforms.
研究旨在通过解决VLA模型在处理复杂任务和精细操作方面的局限性,提升其性能。提出的DualCoT-VLA方法引入了一种并行推理机制,结合视觉和语言CoT实现全面的多模态推理。该方法减少了推理延迟,并在LIBERO和RoboCasa GR1基准测试中取得了最佳性能。
3D-Layout-R1: Structured Reasoning for Language-Instructed Spatial Editing
Authors: Haoyu Zhen, Xiaolong Li, Yilin Zhao, Han Zhang, Sifei Liu, Kaichun Mo, Chuang Gan, Subhashree Radhakrishnan
First: 2026-03-23T17:59:14+00:00 · Latest: 2026-03-23T17:59:14+00:00
Abstract
Large Language Models (LLMs) and Vision Language Models (VLMs) have shown impressive reasoning abilities, yet they struggle with spatial understanding and layout consistency when performing fine-grained visual editing. We introduce a Structured Reasoning framework that performs text-conditioned spatial layout editing via scene-graph reasoning. Given an input scene graph and a natural-language instruction, the model reasons over the graph to generate an updated scene graph that satisfies the text condition while maintaining spatial coherence. By explicitly guiding the reasoning process through structured relational representations, our approach improves both interpretability and control over spatial relationships. We evaluate our method on a new text-guided layout editing benchmark encompassing sorting, spatial alignment, and room-editing tasks. Our training paradigm yields an average 15% improvement in IoU and 25% reduction in center-distance error compared to Chain of Thought Fine-tuning (CoT-SFT) and vanilla GRPO baselines. Compared to SOTA zero-shot LLMs, our best models achieve up to 20% higher mIoU, demonstrating markedly improved spatial precision.
中文标题/摘要
标题:3D-布局-R1:基于语言指导的空间编辑结构化推理
大型语言模型(LLMs)和视觉语言模型(VLMs)展示了令人印象深刻的推理能力,但在执行精细视觉编辑时,它们在空间理解和布局一致性方面存在困难。我们提出了一种结构化推理框架,通过场景图推理进行基于文本的空间布局编辑。给定输入的场景图和自然语言指令,模型在图上进行推理以生成满足文本条件并保持空间连贯性的更新场景图。通过明确地通过结构化关系表示引导推理过程,我们的方法提高了空间关系的可解释性和控制力。我们在一个包含排序、空间对齐和房间编辑任务的新文本指导布局编辑基准上评估了我们的方法。与链式思维微调(CoT-SFT)和vanilla GRPO基线相比,我们的训练范式在IoU上平均提高了15%,中心距离误差减少了25%。与最先进的零样本LLMs相比,我们的最佳模型在mIoU上提高了高达20%,显示出显著提高的空间精度。
The Dual Mechanisms of Spatial Reasoning in Vision-Language Models
Authors: Kelly Cui, Nikhil Prakash, Ayush Raina, David Bau, Antonio Torralba, Tamar Rott Shaham
First: 2026-03-23T17:58:02+00:00 · Latest: 2026-03-23T17:58:02+00:00
Comments: 26 pages, 35 figures
Abstract
Many multimodal tasks, such as image captioning and visual question answering, require vision-language models (VLMs) to associate objects with their properties and spatial relations. Yet it remains unclear where and how such associations are computed within VLMs. In this work, we show that VLMs rely on two concurrent mechanisms to represent such associations. In the language model backbone, intermediate layers represent content-independent spatial relations on top of visual tokens corresponding to objects. However, this mechanism plays only a secondary role in shaping model predictions. Instead, the dominant source of spatial information originates in the vision encoder, whose representations encode the layout of objects and are directly exploited by the language model backbone. Notably, this spatial signal is distributed globally across visual tokens, extending beyond object regions into surrounding background areas. We show that enhancing these vision-derived spatial representations globally across all image tokens improves spatial reasoning performance on naturalistic images. Together, our results clarify how spatial association is computed within VLMs and highlight the central role of vision encoders in enabling spatial reasoning.
中文标题/摘要
标题:视觉语言模型中空间推理的双重机制
许多多模态任务,如图像字幕和视觉问答,要求视觉语言模型(VLMs)将物体与其属性和空间关系联系起来。然而,尚不清楚这些联系在VLMs中的何处和如何进行计算。在本研究中,我们展示了VLMs依赖于两种并发机制来表示这些联系。在语言模型骨干中,中间层在视觉标记(对应于物体)之上表示内容无关的空间关系。然而,这种机制在塑造模型预测方面仅起次要作用。相反,空间信息的主要来源在于视觉编码器,其表示编码了物体的布局,并直接被语言模型骨干利用。值得注意的是,这种空间信号在视觉标记中是全局分布的,延伸到物体区域之外的背景区域。我们展示了在所有图像标记中增强这些视觉衍生的空间表示可以提高自然图像的空间推理性能。综上所述,我们的结果阐明了空间联系在VLMs中的计算方式,并突显了视觉编码器在实现空间推理中的核心作用。
Summary / 总结
This study investigates how vision-language models (VLMs) process spatial reasoning in tasks like image captioning and visual question answering. The research reveals that VLMs use two concurrent mechanisms: one in the language model backbone that represents content-independent spatial relations, and another in the vision encoder that encodes the layout of objects. The vision encoder's spatial information, which is distributed globally across visual tokens, is found to be the primary source of spatial reasoning. Enhancing these spatial representations globally improves performance on spatial reasoning tasks. This work clarifies the mechanisms of spatial association within VLMs and emphasizes the importance of the vision encoder in spatial reasoning capabilities.
该研究探讨了视觉语言模型(VLMs)在图像字幕和视觉问答等任务中处理空间推理的机制。研究发现,VLMs 使用两种并发机制:一种是语言模型主干中表示的内容无关的空间关系,另一种是视觉编码器中编码的对象布局。视觉编码器的空间信息,其在全球范围内分布在所有图像标记中,对模型预测的影响更大。通过增强这些空间表示可以提高空间推理性能。研究澄清了视觉编码器在VLMs中进行空间推理中的核心作用。
Scaling DoRA: High-Rank Adaptation via Factored Norms and Fused Kernels
Authors: Alexandra Zelenin, Alexandra Zhuravlyova
First: 2026-03-23T17:57:24+00:00 · Latest: 2026-03-23T17:57:24+00:00
Comments: 30 pages, 15 figures, 15 tables, including appendices. Code and data at https://github.com/sockeye44/dorafactors
Abstract
Weight-Decomposed Low-Rank Adaptation (DoRA) extends LoRA by decoupling weight magnitude from direction, but its forward pass requires the row-wise norm of W + sBA, a computation that every major framework we surveyed implements by materializing the dense [d_out, d_in] product BA. At d_in = 8192 and rank r = 384, a single module's norm requires about 512 MB of transient working memory in bf16, making high-rank DoRA costly and often infeasible on common single-GPU setups once hundreds of adapted modules and checkpointing are involved.
We present two systems contributions. A factored norm decomposes the squared norm into base, cross, and Gram terms computable through O(d_out r + r^2) intermediates, eliminating the dense product. Fused Triton kernels collapse the four-kernel DoRA composition into a single pass, reducing memory traffic by about 4x and using a numerically stable form that avoids catastrophic cancellation in the near-unity rescaling regime where magnitude scales concentrate in practice.
Across six 8-32B vision-language models (VLMs) on three NVIDIA GPUs (RTX 6000 PRO, H200, B200) at r = 384 in bf16, the fused implementation is 1.5-2.0x faster than Hugging Face PEFT's DoRA implementation for inference and 1.5-1.9x faster for gradient computation (optimizer step excluded), with up to 7 GB lower peak VRAM. Microbenchmarks on six GPUs spanning four architecture generations (L40S, A100, RTX 6000 PRO, H200, B200, B300) confirm 1.5-2.7x compose-kernel speedup. Final-logit cosine similarity exceeds 0.9999 across all model/GPU pairs, and multi-seed training curves match within 7.1 x 10^-4 mean per-step loss delta over 2000 steps.
中文标题/摘要
标题:DoRA的扩展:通过分解范数和融合核函数实现高秩适应
权重分解低秩适应(DoRA)通过将权重幅度与方向解耦来扩展LoRA,但其前向传播需要计算W + sBA的行范数,每个我们调查的主要框架都通过计算密集的[d_out, d_in]乘积BA来实现这一计算。在d_in = 8192和秩r = 384的情况下,单个模块的范数需要大约512 MB的临时工作内存(bf16),这使得高秩DoRA在涉及数百个已适应模块和检查点时变得昂贵且往往不可行。
我们提出了两个系统贡献。分解范数将平方范数分解为基、交叉和格朗项,这些项可以通过O(d_out r + r^2)中间量计算,从而消除密集乘积。融合Triton内核将四核DoRA组合简化为单次通过,减少了约4倍的内存流量,并使用了在幅度缩放集中在实际操作中的近一缩放区间内避免灾难性消减的数值稳定形式。
在六种8-32B视觉-语言模型(VLMs)上,使用三个NVIDIA GPU(RTX 6000 PRO、H200、B200)在bf16下r = 384的情况下,融合实现比Hugging Face PEFT的DoRA实现快1.5-2.0倍,用于推理,梯度计算(排除优化器步骤)快1.5-1.9倍,峰值VRAM低7 GB。六种跨越四个架构代(L40S、A100、RTX 6000 PRO、H200、B200、B300)的微基准测试确认组合内核速度提升1.5-2.7倍。所有模型/GPU对的最终逻辑余弦相似度超过0.9999,多种子训练曲线在2000步内每步损失差值的均值内匹配7.1 x 10^-4。
Summary / 总结
This paper addresses the computational challenges of high-rank Weight-Decomposed Low-Rank Adaptation (DoRA) by introducing a factored norm and fused Triton kernels. The factored norm decomposes the squared norm into base, cross, and Gram terms, reducing the memory requirement from O(d_out * d_in) to O(d_out * r + r^2), making high-rank DoRA feasible on common single-GPU setups. The fused Triton kernels further optimize the DoRA composition, reducing memory traffic and improving numerical stability. Experimental results show that the fused implementation is 1.5-2.0x faster than existing DoRA implementations for inference and 1.5-1.9x faster for gradient computation, with up to 7 GB lower peak VRAM usage across various models and GPUs.
本文针对高秩Weight-Decomposed Low-Rank Adaptation (DoRA)的计算挑战,引入了分解范数和融合Triton内核。分解范数将平方范数分解为基、交叉和格三种项,将内存需求从O(d_out * d_in)降低到O(d_out * r + r^2),使得高秩DoRA在常见单GPU设置上变得可行。融合Triton内核进一步优化了DoRA的组合,减少了内存流量并提高了数值稳定性。在三个NVIDIA GPU上的六个视觉-语言模型实验显示,融合实现对于推理快1.5-2.0倍,对于梯度计算快1.5-1.9倍,且最高可降低7 GB的峰值VRAM使用量,相比Hugging Face PEFT的DoRA实现更为高效。
Repurposing Geometric Foundation Models for Multi-view Diffusion
Authors: Wooseok Jang, Seonghu Jeon, Jisang Han, Jinhyeok Choi, Minkyung Kwon, Seungryong Kim, Saining Xie, Sainan Liu
First: 2026-03-23T17:57:05+00:00 · Latest: 2026-03-23T17:57:05+00:00
Comments: project website: https://cvlab-kaist.github.io/GLD/
Abstract
While recent advances in generative latent spaces have driven substantial progress in single-image generation, the optimal latent space for novel view synthesis (NVS) remains largely unexplored. In particular, NVS requires geometrically consistent generation across viewpoints, but existing approaches typically operate in a view-independent VAE latent space. In this paper, we propose Geometric Latent Diffusion (GLD), a framework that repurposes the geometrically consistent feature space of geometric foundation models as the latent space for multi-view diffusion. We show that these features not only support high-fidelity RGB reconstruction but also encode strong cross-view geometric correspondences, providing a well-suited latent space for NVS. Our experiments demonstrate that GLD outperforms both VAE and RAE on 2D image quality and 3D consistency metrics, while accelerating training by more than 4.4x compared to the VAE latent space. Notably, GLD remains competitive with state-of-the-art methods that leverage large-scale text-to-image pretraining, despite training its diffusion model from scratch without such generative pretraining.
中文标题/摘要
标题:重新利用几何基础模型进行多视角扩散
虽然生成潜在空间的最新进展在单图像生成方面取得了显著进展,但用于新颖视角合成(NVS)的最佳潜在空间仍鲜有探索。特别是,NVS需要在不同视角下的一致几何生成,但现有方法通常在与视角无关的VAE潜在空间中操作。在本文中,我们提出了一种几何潜在扩散(GLD)框架,该框架重新利用几何基础模型中的几何一致特征空间作为多视角扩散的潜在空间。我们证明这些特征不仅支持高保真度的RGB重建,还编码了强大的跨视角几何对应关系,为NVS提供了合适的潜在空间。我们的实验表明,GLD在2D图像质量和3D一致性指标上均优于VAE和RAE,并且与VAE潜在空间相比,训练速度提高了4.4倍以上。值得注意的是,尽管GLD从零开始训练其扩散模型,没有利用大规模文本到图像的预训练,但其性能仍与利用大规模文本到图像预训练的最新方法相当。
Summary / 总结
This paper addresses the challenge of novel view synthesis (NVS) by proposing Geometric Latent Diffusion (GLD), which repurposes the geometrically consistent feature space of geometric foundation models as the latent space for multi-view diffusion. GLD demonstrates superior performance in 2D image quality and 3D consistency metrics compared to VAE and RAE, and it accelerates training by more than 4.4 times. Despite not using large-scale text-to-image pretraining, GLD remains competitive with state-of-the-art methods.
本文提出了一种名为Geometric Latent Diffusion (GLD)的方法,通过将几何基础模型的几何一致特征空间重新用于多视角扩散的潜在空间来解决新颖视图合成(NVS)的挑战。GLD在2D图像质量和3D一致性指标上优于VAE和RAE,并且训练速度提高了超过4.4倍。尽管没有使用大规模文本到图像的预训练,GLD仍然与最先进的方法保持竞争力。
Decoupling Exploration and Policy Optimization: Uncertainty Guided Tree Search for Hard Exploration
Authors: Zakaria Mhammedi, James Cohan
First: 2026-03-23T17:56:52+00:00 · Latest: 2026-03-23T17:56:52+00:00
Abstract
The process of discovery requires active exploration -- the act of collecting new and informative data. However, efficient autonomous exploration remains a major unsolved problem. The dominant paradigm addresses this challenge by using Reinforcement Learning (RL) to train agents with intrinsic motivation, maximizing a composite objective of extrinsic and intrinsic rewards. We suggest that this approach incurs unnecessary overhead: while policy optimization is necessary for precise task execution, employing such machinery solely to expand state coverage may be inefficient. In this paper, we propose a new paradigm that explicitly separates exploration from exploitation and bypasses RL during the exploration phase. Our method uses a tree-search strategy inspired by the Go-With-The-Winner algorithm, paired with a measure of epistemic uncertainty to systematically drive exploration. By removing the overhead of policy optimization, our approach explores an order of magnitude more efficiently than standard intrinsic motivation baselines on hard Atari benchmarks. Further, we demonstrate that the discovered trajectories can be distilled into deployable policies using existing supervised backward learning algorithms, achieving state-of-the-art scores by a wide margin on Montezuma's Revenge, Pitfall!, and Venture without relying on domain-specific knowledge. Finally, we demonstrate the generality of our framework in high-dimensional continuous action spaces by solving the MuJoCo Adroit dexterous manipulation and AntMaze tasks in a sparse-reward setting, directly from image observations and without expert demonstrations or offline datasets. To the best of our knowledge, this has not been achieved before.
中文标题/摘要
标题:解耦探索与策略优化:基于不确定性引导的树搜索方法在困难探索中的应用
发现过程需要积极的探索——即收集新的和有信息量的数据。然而,高效的自主探索仍然是一个主要未解决的问题。主流方法通过使用强化学习(RL)训练具有内在动机的代理,最大化外在奖励和内在奖励的复合目标来应对这一挑战。我们认为这种方法带来了不必要的开销:虽然策略优化对于精确执行任务是必要的,但仅为了扩展状态覆盖范围而使用这种机制可能是低效的。在本文中,我们提出了一种新的范式,明确地将探索与利用分离,并在探索阶段绕过RL。我们的方法使用了受Go-With-The-Winner算法启发的树搜索策略,并配以知识不确定性度量,系统地驱动探索。通过去除策略优化的开销,我们的方法在困难的Atari基准测试中比标准的内在动机基线更高效地探索了数量级。此外,我们证明了发现的轨迹可以使用现有的监督反向学习算法提炼成可部署的策略,在Montezuma’s Revenge、Pitfall!和Venture上取得了显著优于现有技术水平的成绩,而无需依赖领域特定知识。最后,我们展示了在高维连续动作空间中该框架的通用性,通过直接从图像观察中解决MuJoCo Adroit灵巧操作和AntMaze任务,在稀疏奖励设置下无需专家演示或离线数据集。据我们所知,这是首次实现这一目标。
Summary / 总结
This paper addresses the challenge of efficient autonomous exploration in reinforcement learning by proposing a new paradigm that decouples exploration from policy optimization. The method uses a tree-search strategy with epistemic uncertainty to guide exploration, avoiding the overhead of policy optimization during the exploration phase. This approach significantly improves exploration efficiency compared to intrinsic motivation baselines on hard Atari benchmarks and achieves state-of-the-art scores on Montezuma's Revenge, Pitfall!, and Venture without domain-specific knowledge. Additionally, the framework demonstrates generality in high-dimensional continuous action spaces by solving MuJoCo Adroit and AntMaze tasks directly from image observations without expert demonstrations or offline datasets.
本文提出了一种新的范式,通过将探索与策略优化分离来解决强化学习中的自主高效探索挑战。该方法使用带有表征不确定性指导的树搜索策略,在探索阶段绕过RL。这种方法在硬币Atari基准测试中显著提高了探索效率,优于标准的内在动机基线。此外,发现的轨迹可以用于训练可部署的策略,在Montezuma的复仇、Pitfall!和Venture上实现了最先进的得分。该框架还在高维连续动作空间中得到了验证,直接从图像观察中解决了MuJoCo Adroit和AntMaze任务,无需专家演示或离线数据集。
DUO-VSR: Dual-Stream Distillation for One-Step Video Super-Resolution
Authors: Zhengyao Lv, Menghan Xia, Xintao Wang, Kwan-Yee K. Wong
Venue: CVPR 2026
First: 2026-03-23T17:56:17+00:00 · Latest: 2026-03-23T17:56:17+00:00
Comments: Accepted to CVPR 2026
Abstract
Diffusion-based video super-resolution (VSR) has recently achieved remarkable fidelity but still suffers from prohibitive sampling costs. While distribution matching distillation (DMD) can accelerate diffusion models toward one-step generation, directly applying it to VSR often results in training instability alongside degraded and insufficient supervision. To address these issues, we propose DUO-VSR, a three-stage framework built upon a Dual-Stream Distillation strategy that unifies distribution matching and adversarial supervision for one-step VSR. Firstly, a Progressive Guided Distillation Initialization is employed to stabilize subsequent training through trajectory-preserving distillation. Next, the Dual-Stream Distillation jointly optimizes the DMD and Real-Fake Score Feature GAN (RFS-GAN) streams, with the latter providing complementary adversarial supervision leveraging discriminative features from both real and fake score models. Finally, a Preference-Guided Refinement stage further aligns the student with perceptual quality preferences. Extensive experiments demonstrate that DUO-VSR achieves superior visual quality and efficiency over previous one-step VSR approaches.
中文标题/摘要
标题:DUO-VSR:双流蒸馏的一步视频超分辨率
基于扩散的视频超分辨率(VSR)最近取得了显著的保真度,但仍面临高昂的采样成本。虽然分布匹配蒸馏(DMD)可以加速扩散模型向一步生成的转变,但直接应用于VSR往往会导致训练不稳定并降低监督效果。为了解决这些问题,我们提出DUO-VSR,这是一种基于双流蒸馏策略的三阶段框架,将分布匹配和对抗监督统一起来以实现一步VSR。首先,通过轨迹保持蒸馏初始化渐进引导蒸馏,以稳定后续训练。其次,双流蒸馏联合优化DMD和真实-假象评分特征生成对抗网络(RFS-GAN)流,后者利用来自真实和假象评分模型的判别特征提供互补的对抗监督。最后,偏好引导细化阶段进一步使学生模型与感知质量偏好对齐。大量实验表明,DUO-VSR在视觉质量和效率方面优于之前的一步VSR方法。
Summary / 总结
DUO-VSR is a three-stage framework designed to improve the efficiency and quality of one-step video super-resolution (VSR) by addressing training instability and insufficient supervision. It uses a Dual-Stream Distillation strategy that combines distribution matching and adversarial supervision. The framework includes a Progressive Guided Distillation Initialization, a Dual-Stream Distillation stage, and a Preference-Guided Refinement stage. Experimental results show that DUO-VSR outperforms previous one-step VSR approaches in both visual quality and efficiency.
DUO-VSR 是一个三阶段框架,旨在通过解决训练不稳定性和监督不足的问题来提高单步视频超分辨率(VSR)的效率和质量。它采用了一种结合分布匹配和对抗监督的双流蒸馏策略。该框架包括渐进引导蒸馏初始化、双流蒸馏阶段和偏好引导细化阶段。实验结果表明,DUO-VSR 在视觉质量和效率方面均优于之前的单步 VSR 方法。
TiCo: Time-Controllable Training for Spoken Dialogue Models
Authors: Kai-Wei Chang, Wei-Chih Chen, En-Pei Hu, Hung-yi Lee, James Glass
First: 2026-03-23T17:51:40+00:00 · Latest: 2026-03-23T17:51:40+00:00
Abstract
We propose TiCo, a simple post-training method for enabling spoken dialogue models (SDMs) to follow time-constrained instructions and generate responses with controllable duration. This capability is valuable for real-world spoken language systems such as voice assistants and interactive agents, where controlling response duration can improve interaction quality. However, despite their strong ability to generate natural spoken responses, existing models lack time awareness and struggle to follow duration-related instructions (e.g., "Please generate a response lasting about 15 seconds"). Through an empirical evaluation of both open-source and commercial SDMs, we show that they frequently fail to satisfy such time-control requirements. TiCo addresses this limitation by enabling models to estimate elapsed speaking time during generation through Spoken Time Markers (STM) (e.g., <10.6 seconds>). These markers help the model maintain awareness of time and adjust the remaining content to meet the target duration. TiCo is simple and efficient: it requires only a small amount of data and no additional question-answer pairs, relying instead on self-generation and reinforcement learning. Experimental results show that TiCo significantly improves adherence to duration constraints while preserving response quality.
中文标题/摘要
标题:TiCo:时间可控训练方法在口语对话模型中的应用
我们提出了一种名为TiCo的简单后训练方法,使口语对话模型(SDMs)能够遵循时间限制指令并生成可控时长的响应。这一能力对于语音助手和交互代理等现实世界口语语言系统来说非常有价值,因为控制响应时长可以提高交互质量。然而,尽管现有模型能够生成自然的口语响应,但它们缺乏时间意识,难以遵循与时长相关的指令(例如,“请生成一个大约持续15秒的响应”)。通过对开源和商用SDMs的实证评估,我们发现它们经常无法满足此类时间控制要求。TiCo通过使模型在生成过程中通过口语时间标记(STM,例如<10.6秒>)来估算已用时间,从而解决了这一限制。这些标记有助于模型保持时间意识并调整剩余内容以满足目标时长。TiCo简单且高效:它只需要少量数据,无需额外的问题-答案对,而是依赖自我生成和强化学习。实验结果表明,TiCo在满足时长约束的同时显著提高了响应质量。
Summary / 总结
TiCo is a post-training method that enables spoken dialogue models to generate responses with controllable duration, improving interaction quality in voice assistants and interactive agents. It uses Spoken Time Markers to help models estimate elapsed speaking time and adjust content accordingly. Experiments show that TiCo improves adherence to duration constraints without compromising response quality.
TiCo 是一种后训练方法,使对话模型能够生成具有可控时长的响应,从而在语音助手和交互代理中提高交互质量。通过使用语音时间标记,TiCo 帮助模型估算已用时间并调整内容以满足目标时长。实验结果显示,TiCo 显著改善了对时长约束的遵守情况,同时保持了响应质量。
ELVIS: Enhance Low-Light for Video Instance Segmentation in the Dark
Authors: Joanne Lin, Ruirui Lin, Yini Li, David Bull, Nantheera Anantrasirichai
Venue: CVPR 2026
First: 2025-12-01T10:17:07+00:00 · Latest: 2026-03-23T17:49:56+00:00
Comments: Accepted to CVPR 2026
Abstract
Video instance segmentation (VIS) for low-light content remains highly challenging for both humans and machines alike, due to noise, blur and other adverse conditions. The lack of large-scale annotated datasets and the limitations of current synthetic pipelines, particularly in modeling temporal degradations, further hinder progress. Moreover, existing VIS methods are not robust to the degradations found in low-light videos and, consequently, perform poorly even after finetuning. In this paper, we introduce \textbf{ELVIS} (\textbf{E}nhance \textbf{L}ow-Light for \textbf{V}ideo \textbf{I}nstance \textbf{S}egmentation), a framework that enables domain adaptation of state-of-the-art VIS models to low-light scenarios. ELVIS is comprised of an unsupervised synthetic low-light video pipeline that models both spatial and temporal degradations, a calibration-free degradation profile estimation network (VDP-Net) and an enhancement decoder head that disentangles degradations from content features. ELVIS improves performances by up to \textbf{+3.7AP} on the synthetic low-light YouTube-VIS 2019 dataset and beats two-stage baselines by at least \textbf{+2.8AP} on real low-light videos. Code and dataset available at: \href{https://joannelin168.github.io/research/ELVIS}{https://joannelin168.github.io/research/ELVIS}
中文标题/摘要
标题:ELVIS:增强低光环境下的视频实例分割
低光条件下的视频实例分割(VIS)对人类和机器来说仍然是一个高度具有挑战性的任务,由于噪声、模糊和其他不良条件。缺乏大规模标注数据集以及当前合成管道的局限性,特别是在建模时间退化方面,进一步阻碍了进展。此外,现有的VIS方法对低光视频中的退化不具有鲁棒性,因此即使经过微调,表现也较差。在本文中,我们提出了**ELVIS**(**E**nhance **L**ow-Light for **V**ideo **I**nstance **S**egmentation),一种使最先进的VIS模型适应低光场景的框架。ELVIS 包含一个无监督的合成低光视频管道,该管道建模了空间和时间退化,一个无需校准的退化特征估计网络(VDP-Net)和一个解耦退化与内容特征的增强解码器头部。ELVIS 在合成低光 YouTube-VIS 2019 数据集上的性能提高了最多 **+3.7AP**,在真实低光视频上至少优于两阶段基线 **+2.8AP**。代码和数据集可在:https://joannelin168.github.io/research/ELVIS 获取
Summary / 总结
ELVIS is a framework for enhancing low-light video instance segmentation by addressing noise, blur, and other degradations through an unsupervised synthetic low-light video pipeline, a degradation profile estimation network, and an enhancement decoder head. It improves performance by up to 3.7AP on synthetic low-light data and outperforms two-stage baselines by at least 2.8AP on real low-light videos.
ELVIS 是一个框架,旨在通过解决噪声、模糊和其他退化问题来增强低光视频实例分割。它包括一个无监督的低光视频合成管道、一个退化特征估计网络和一个增强解码头。ELVIS 在合成低光 YouTube-VIS 2019 数据集上的性能提高了最多 3.7AP,并且在真实低光视频上至少比两阶段基线高出 2.8AP。
The Price of Progress: Price Performance and the Future of AI
Authors: Hans Gundlach, Jayson Lynch, Matthias Mertens, Neil Thompson
First: 2025-11-28T18:47:33+00:00 · Latest: 2026-03-23T17:48:22+00:00
Abstract
Language models have seen enormous progress on advanced benchmarks in recent years, but much of this progress has only been possible by using more costly models. Benchmarks may therefore present a warped picture of progress in practical capabilities *per dollar*. To remedy this, we use data from Artificial Analysis and Epoch AI to form the largest dataset of current and historical prices to run benchmarks to date. We find that the price for a given level of benchmark performance has decreased remarkably fast, around $5\times$ to $10\times$ per year, for frontier models on knowledge, reasoning, math, and software engineering benchmarks. These reductions in the cost of AI inference are due to economic forces, hardware efficiency improvements, and algorithmic efficiency improvements. Isolating out open models to control for competition effects and dividing by hardware price declines, we estimate that algorithmic efficiency progress is around $3\times$ per year. However, at the same time, the price of running frontier models is rising between $3\times$ to $18\times$ per year due to bigger models and larger reasoning demands. Finally, we recommend that evaluators both publicize and take into account the price of benchmarking as an essential part of measuring the real-world impact of AI.
中文标题/摘要
标题:进步的代价:价格性能与AI的未来
近年来,语言模型在高级基准测试中取得了巨大的进步,但这些进步中的许多都只能通过使用更昂贵的模型来实现。因此,基准测试可能无法准确反映每美元实际能力的进步。为了纠正这一点,我们使用来自人工分析和Epoch AI的数据,形成了迄今为止最大的价格和基准数据集。我们发现,对于知识、推理、数学和软件工程基准上的前沿模型,达到给定基准性能的价格已经以惊人的速度下降,大约每年下降5到10倍。这些AI推理成本的降低是由于经济力量、硬件效率改进和算法效率改进所致。通过剔除开放模型以控制竞争效应,并除以硬件价格下降,我们估计算法效率的进步约为每年3倍。然而,同时,运行前沿模型的成本正在以每年3到18倍的速度上升,这主要是由于更大的模型和更大的推理需求。最后,我们建议评估者不仅要公布还要考虑基准测试的成本,将其作为衡量AI实际影响的重要组成部分。
Scalable Prompt Routing via Fine-Grained Latent Task Discovery
Authors: Yunyi Zhang, Soji Adeshina, Sheng Guan, Ashwin Ganesh, Zhen Han, Vassilis N. Ioannidis, Huzefa Rangwala, George Karypis
First: 2026-03-19T19:15:51+00:00 · Latest: 2026-03-23T17:46:56+00:00
Abstract
Prompt routing dynamically selects the most appropriate large language model from a pool of candidates for each query, optimizing performance while managing costs. As model pools scale to include dozens of frontier models with narrow performance gaps, existing approaches face significant challenges: manually defined task taxonomies cannot capture fine-grained capability distinctions, while monolithic routers struggle to differentiate subtle differences across diverse tasks. We propose a two-stage routing architecture that addresses these limitations through automated fine-grained task discovery and task-aware quality estimation. Our first stage employs graph-based clustering to discover latent task types and trains a classifier to assign prompts to discovered tasks. The second stage uses a mixture-of-experts architecture with task-specific prediction heads for specialized quality estimates. At inference, we aggregate predictions from both stages to balance task-level stability with prompt-specific adaptability. Evaluated on 10 benchmarks with 11 frontier models, our method consistently outperforms existing baselines and surpasses the strongest individual model while incurring less than half its cost.
中文标题/摘要
标题:通过细粒度潜在任务发现实现可扩展的提示路由
提示路由动态地从候选模型池中选择最适合每个查询的大语言模型,优化性能同时管理成本。随着模型池扩展到包括数十个前沿模型且性能差距变窄,现有方法面临重大挑战:手动定义的任务分类无法捕捉细微的能力差异,而单一的路由器难以区分多样任务中的细微差异。我们提出了一种两阶段路由架构,通过自动化的细粒度任务发现和任务感知质量估计来解决这些限制。第一阶段使用图聚类发现潜在的任务类型并训练分类器将提示分配给发现的任务。第二阶段使用专家混合架构,带有特定任务的预测头,进行专门的质量估计。在推理时,我们结合两个阶段的预测以平衡任务级别的稳定性与提示特定的适应性。在10个基准测试中使用11个前沿模型评估,我们的方法在所有基准测试中都优于现有基线,并且成本不到最强单个模型的一半。
Summary / 总结
The research aims to improve prompt routing in large language models by addressing the challenges posed by scaling model pools. The method involves a two-stage architecture: the first stage uses graph-based clustering to discover latent task types and trains a classifier to assign prompts to these tasks, while the second stage employs a mixture-of-experts architecture with task-specific prediction heads for quality estimation. The approach is evaluated on 10 benchmarks with 11 frontier models and shows consistent outperformance of existing baselines while incurring lower costs.
研究旨在通过解决现有方法在管理不断增加的模型池时的局限性,改进大型语言模型中的提示路由。方法采用两阶段路由架构:第一阶段使用基于图的聚类发现潜在的任务类型,并训练分类器将提示分配给发现的任务;第二阶段采用具有任务特定预测头的混合专家架构进行质量估计。研究在10个基准测试中使用11个前沿模型评估了该方法,发现它在性能上优于现有基线和单一模型,并且成本更低。
Measuring Iterative Temporal Reasoning with Time Puzzles
Authors: Zhengxiang Wang, Zeyu Dong
First: 2026-01-12T02:39:26+00:00 · Latest: 2026-03-23T17:44:47+00:00
Comments: 11 pages, 4 tables, 3 figures
Abstract
Tool use, such as web search, has become a standard capability even in freely available large language models (LLMs). However, existing benchmarks evaluate temporal reasoning mainly in static, non-tool-using settings, which poorly reflect how LLMs perform temporal reasoning in practice. We introduce Time Puzzles, a constraint-based date inference task for evaluating iterative temporal reasoning with tools. Each puzzle combines factual temporal anchors with (cross-cultural) calendar relations and may admit one or multiple valid dates. The puzzles are algorithmically generated, enabling controlled and continual evaluation. Across 13 LLMs, even the best model (GPT-5) achieves only 55.3% accuracy without tools, despite using easily searchable facts. While web search improves performance, models perform substantially better when constraints are rewritten with explicit dates, removing the need for factual lookup. These results reveal a gap in reliable tool use for iterative temporal reasoning.
中文标题/摘要
标题:使用时间谜题衡量迭代时间推理
工具使用,如网络搜索,已成为大型语言模型(LLM)中的一项标准能力。然而,现有的基准主要在静态、无工具使用的情境下评估时间推理,这与LLM在实际中如何进行时间推理相差甚远。我们引入了时间谜题,这是一种基于约束的日期推断任务,用于评估使用工具的迭代时间推理。每个谜题结合了事实性的时间锚点与(跨文化)日历关系,并可能允许一个或多个有效日期。谜题是通过算法生成的,这使得评估可以受到控制并持续进行。在13个LLM中,即使最好的模型(GPT-5)在没有工具的情况下也仅能达到55.3%的准确率,尽管使用了易于搜索的事实。虽然网络搜索可以提高性能,但当约束被重写为明确的日期时,模型的表现显著提高,从而消除了事实查找的需要。这些结果揭示了可靠工具使用在迭代时间推理中的差距。
Summary / 总结
Tool use, such as web search, has become a standard capability even in freely available large language models (LLMs).
论文引入了Time Puzzles,一个新的基准,用于评估工具辅助下的迭代时间推理能力,解决了现有静态基准的局限性。该方法通过结合事实时间锚点和历法关系,以算法生成的方式生成谜题,实现可控和持续的评估。关键发现表明,即使是最优模型(GPT-5)在没有工具的情况下也仅能达到55.3%的准确率,而使用网络搜索可以提高性能,但当约束条件被重写为明确的日期时,模型表现更好,这揭示了可靠工具使用在迭代时间推理中的不足。
EgoGroups: A Benchmark For Detecting Social Groups of People in the Wild
Authors: Jeffri Murrugarra-Llerena, Pranav Chitale, Zicheng Liu, Kai Ao, Yujin Ham, Guha Balakrishnan, Paola Cascante-Bonilla
First: 2026-03-23T17:43:49+00:00 · Latest: 2026-03-23T17:43:49+00:00
Comments: Project Page: https://lab-spell.github.io/EgoGroups/
Abstract
Social group detection, or the identification of humans involved in reciprocal interpersonal interactions (e.g., family members, friends, and customers and merchants), is a crucial component of social intelligence needed for agents transacting in the world. The few existing benchmarks for social group detection are limited by low scene diversity and reliance on third-person camera sources (e.g., surveillance footage). Consequently, these benchmarks generally lack real-world evaluation on how groups form and evolve in diverse cultural contexts and unconstrained settings. To address this gap, we introduce EgoGroups, a first-person view dataset that captures social dynamics in cities around the world. EgoGroups spans 65 countries covering low, medium, and high-crowd settings under four weather/time-of-day conditions. We include dense human annotations for person and social groups, along with rich geographic and scene metadata. Using this dataset, we performed an extensive evaluation of state-of-the-art VLM/LLMs and supervised models on their group detection capabilities. We found several interesting findings, including VLMs and LLMs can outperform supervised baselines in a zero-shot setting, while crowd density and cultural regions clearly influence model performance.
中文标题/摘要
标题:EgoGroups:检测野生环境中人群社交群体的标准
人群社交群体检测,即识别参与相互间互动的人类(例如家庭成员、朋友、顾客和商家),是社交智能的关键组成部分,对于在世界中进行交易的代理至关重要。现有的少数社交群体检测基准受限于场景多样性低和依赖第三人视角的摄像源(例如监控录像)。因此,这些基准通常缺乏在不同文化背景和非受限环境下的实际评估。为解决这一问题,我们引入了EgoGroups,这是一个第一人称视角的数据集,捕捉了世界各地城市的社交动态。EgoGroups覆盖了65个国家,包括低、中、高人群密度设置,在四种天气/时间条件下的场景。我们包括了密集的人类注释,包括个人和社交群体,以及丰富的地理和场景元数据。使用此数据集,我们对最先进的VLM/LLMs和监督模型进行了广泛的评估,以测试它们的群体检测能力。我们发现了一些有趣的结果,包括在零样本设置下,VLMs和LLMs可以超越监督基线,而人群密度和文化区域明显影响模型性能。
Summary / 总结
EgoGroups is a first-person view dataset designed to evaluate social group detection in diverse real-world settings. It captures social dynamics across 65 countries under various conditions. The dataset includes dense human annotations and rich metadata. Experiments show that VLMs and LLMs outperform supervised models in a zero-shot setting, and performance is influenced by crowd density and cultural regions.
EgoGroups 是一个第一人称视角的数据集,用于捕捉来自65个国家的多样社会动态。它评估了最先进的视觉-语言模型和监督模型在群体检测能力上的表现,发现VLMs和LLMs在零样本设置下可以超越监督基线模型,而人群密度和文化区域明显影响模型的表现。
One Model, Two Markets: Bid-Aware Generative Recommendation
Authors: Yanchen Jiang, Zhe Feng, Christopher P. Mah, Aranyak Mehta, Di Wang
First: 2026-03-23T17:27:59+00:00 · Latest: 2026-03-23T17:27:59+00:00
Abstract
Generative Recommender Systems using semantic ids, such as TIGER (Rajput et al., 2023), have emerged as a widely adopted competitive paradigm in sequential recommendation. However, existing architectures are designed solely for semantic retrieval and do not address concerns such as monetization via ad revenue and incorporation of bids for commercial retrieval. We propose GEM-Rec, a unified framework that integrates commercial relevance and monetization objectives directly into the generative sequence. We introduce control tokens to decouple the decision of whether to show an ad from which item to show. This allows the model to learn valid placement patterns directly from interaction logs, which inherently reflect past successful ad placements. Complementing this, we devise a Bid-Aware Decoding mechanism that handles real-time pricing, injecting bids directly into the inference process to steer the generation toward high-value items. We prove that this approach guarantees allocation monotonicity, ensuring that higher bids weakly increase an ad's likelihood of being shown without requiring model retraining. Experiments demonstrate that GEM-Rec allows platforms to dynamically optimize for semantic relevance and platform revenue.
中文标题/摘要
标题:一个模型,两个市场:竞价感知生成推荐
使用语义ID生成推荐系统,如TIGER(Rajput等人,2023),已成为序列推荐中广泛采用的竞争范式。然而,现有的架构仅针对语义检索设计,并未解决通过广告收入实现商业化检索和货币化的问题。我们提出了GEM-Rec,这是一种统一框架,将商业相关性和货币化目标直接整合到生成序列中。我们引入了控制标记,将是否展示广告的决策与展示哪个项目分离。这使模型能够直接从交互日志中学习有效的广告放置模式,这些日志反映了过去的成功广告放置。此外,我们设计了一种竞价感知解码机制,处理实时定价,将竞价直接注入推理过程,引导生成向高价值项目倾斜。我们证明了这种方法保证了分配单调性,确保较高的竞价会弱化广告展示的可能性,而无需重新训练模型。实验表明,GEM-Rec使平台能够动态优化语义相关性和平台收入。
Summary / 总结
Generative Recommender Systems using semantic ids, such as TIGER (Rajput et al., 2023), have emerged as a widely adopted competitive paradigm in sequential recommendation.
Benchmarking Deep Learning Models for Aerial LiDAR Point Cloud Semantic Segmentation under Real Acquisition Conditions: A Case Study in Navarre
Authors: Alex Salvatierra, José Antonio Sanz, Christian Gutiérrez, Mikel Galar
First: 2026-03-23T17:26:41+00:00 · Latest: 2026-03-23T17:26:41+00:00
Comments: 6 pages, 2 figures
Abstract
Recent advances in deep learning have significantly improved 3D semantic segmentation, but most models focus on indoor or terrestrial datasets. Their behavior under real aerial acquisition conditions remains insufficiently explored, and although a few studies have addressed similar scenarios, they differ in dataset design, acquisition conditions, and model selection. To address this gap, we conduct an experimental benchmark evaluating several state-of-the-art architectures on a large-scale aerial LiDAR dataset acquired under operational flight conditions in Navarre, Spain, covering heterogeneous urban, rural, and industrial landscapes. This study compares four representative deep learning models, including KPConv, RandLA-Net, Superpoint Transformer, and Point Transformer V3, across five semantic classes commonly found in airborne surveys, such as ground, vegetation, buildings, and vehicles, highlighting the inherent challenges of class imbalance and geometric variability in aerial data. Results show that all tested models achieve high overall accuracy exceeding 93%, with KPConv attaining the highest mean IoU (78.51%) through consistent performance across classes, particularly on challenging and underrepresented categories. Point Transformer V3 demonstrates superior performance on the underrepresented vehicle class (75.11% IoU), while Superpoint Transformer and RandLA-Net trade off segmentation robustness for computational efficiency.
中文标题/摘要
标题:在实际空中获取条件下对航空LiDAR点云语义分割深度学习模型进行基准测试:以西班牙纳瓦拉地区为例
近年来,深度学习的进展显著提高了3D语义分割的性能,但大多数模型主要集中在室内或地面数据集上。在实际空中获取条件下的行为仍然没有得到充分探索,尽管有一些研究解决了类似场景,但它们在数据集设计、获取条件和模型选择上有所不同。为了解决这一差距,我们在西班牙纳瓦拉地区在实际飞行条件下获取的大型航空LiDAR数据集上,评估了几种最先进的架构,涵盖了异质的城市、农村和工业景观。本研究比较了四个代表性的深度学习模型,包括KPConv、RandLA-Net、Superpoint Transformer和Point Transformer V3,跨越了空中调查中常见的五个语义类别,如地面、植被、建筑物和车辆,突出了空中数据中类别不平衡和几何变化的固有挑战。结果显示,所有测试的模型均实现了超过93%的整体准确性,KPConv通过在各类别中保持一致的性能,特别是在具有挑战性和未充分代表的类别中,获得了最高的平均IoU(78.51%)。Point Transformer V3在未充分代表的车辆类别上表现出色(IoU为75.11%),而Superpoint Transformer和RandLA-Net则在分割稳健性与计算效率之间进行了权衡。
Summary / 总结
This study benchmarks deep learning models for aerial LiDAR point cloud semantic segmentation under real acquisition conditions in Navarre, Spain. Four state-of-the-art architectures—KPConv, RandLA-Net, Superpoint Transformer, and Point Transformer V3—are evaluated across five semantic classes. All models achieve high overall accuracy, with KPConv showing the best mean IoU of 78.51%, particularly excelling in challenging and underrepresented categories. Point Transformer V3 performs best on the underrepresented vehicle class, while Superpoint Transformer and RandLA-Net offer computational efficiency at the cost of segmentation robustness.
该研究在西班牙纳瓦拉地区真实飞行条件下,评估了四种最先进的深度学习模型(包括KPConv、RandLA-Net、Superpoint Transformer和Point Transformer V3)对机载LiDAR点云语义分割的表现。结果表明,所有模型在总体准确率上表现优异,KPConv在平均IoU上表现最佳(78.51%),而Point Transformer V3在车辆类别的表现尤为突出(75.11%)。
SpatialReward: Verifiable Spatial Reward Modeling for Fine-Grained Spatial Consistency in Text-to-Image Generation
Authors: Sashuai Zhou, Qiang Zhou, Junpeng Ma, Yue Cao, Ruofan Hu, Ziang Zhang, Xiaoda Yang, Zhibin Wang, Jun Song, Cheng Yu, Bo Zheng, Zhou Zhao
First: 2026-03-23T17:26:35+00:00 · Latest: 2026-03-23T17:26:35+00:00
Abstract
Recent advances in text-to-image (T2I) generation via reinforcement learning (RL) have benefited from reward models that assess semantic alignment and visual quality. However, most existing reward models pay limited attention to fine-grained spatial relationships, often producing images that appear plausible overall yet contain inaccuracies in object positioning. In this work, we present \textbf{SpatialReward}, a verifiable reward model explicitly designed to evaluate spatial layouts in generated images. SpatialReward adopts a multi-stage pipeline: a \emph{Prompt Decomposer} extracts entities, attributes, and spatial metadata from free-form prompts; expert detectors provide accurate visual grounding of object positions and attributes; and a vision-language model applies chain-of-thought reasoning over grounded observations to assess complex spatial relations that are challenging for rule-based methods. To more comprehensively evaluate spatial relationships in generated images, we introduce \textbf{SpatRelBench}, a benchmark covering object attributes, orientation, inter-object relations, and rendered text placement. Experiments on Stable Diffusion and FLUX show that incorporating SpatialReward into RL training consistently improves spatial consistency and overall generation quality, with results aligned more closely to human judgments. These findings indicate that verifiable reward models hold considerable potential for enabling more accurate and controllable optimization in text-to-image generation models.
中文标题/摘要
标题:SpatialReward:可验证的空间奖励建模以实现文本到图像生成中的细粒度空间一致性
通过强化学习(RL)实现的文本到图像(T2I)生成的最近进展得益于评估语义对齐和视觉质量的奖励模型。然而,大多数现有的奖励模型对细粒度的空间关系关注有限,经常生成整体上看似合理的图像,但包含物体定位的不准确之处。在本文中,我们提出了**SpatialReward**,一种明确设计用于评估生成图像的空间布局的可验证奖励模型。SpatialReward 采用多阶段管道:一个**提示分解器**从自由形式的提示中提取实体、属性和空间元数据;专家检测器提供准确的视觉定位和属性;视觉语言模型在地基观察上进行链式推理,评估规则方法难以处理的复杂空间关系。为了更全面地评估生成图像中的空间关系,我们引入了**SpatRelBench**,一个涵盖物体属性、方向、物体间关系和渲染文本位置的基准。在Stable Diffusion和FLUX上的实验表明,将SpatialReward纳入RL训练中可以一致地提高空间一致性和整体生成质量,结果与人类判断更为一致。这些发现表明,可验证的奖励模型在实现文本到图像生成模型中更准确和可控的优化方面具有巨大潜力。
Summary / 总结
Recent advances in text-to-image (T2I) generation via reinforcement learning (RL) have benefited from reward models that assess semantic alignment and visual quality.
研究旨在通过解决现有奖励模型忽视精细空间关系的问题,提高文本生成图像的空间一致性。方法包括一个名为SpatialReward的多阶段管道,包含提示分解器、专家检测器和视觉语言模型,用于评估复杂的空间关系。实验结果表明,将SpatialReward集成到RL训练中可以提高空间一致性并提升整体生成质量,与Stable Diffusion和FLUX模型的人类判断更为一致。
VL-Nav: A Neuro-Symbolic Approach for Reasoning-based Vision-Language Navigation
Authors: Yi Du, Taimeng Fu, Zhipeng Zhao, Shaoshu Su, Zitong Zhan, Qiwei Du, Zhuoqun Chen, Bowen Li, Chen Wang
First: 2025-02-02T21:44:15+00:00 · Latest: 2026-03-23T17:26:04+00:00
Abstract
Navigating unseen, large-scale environments based on complex and abstract human instructions remains a formidable challenge for autonomous mobile robots. Addressing this requires robots to infer implicit semantics and efficiently explore large-scale task spaces. However, existing methods, ranging from end-to-end learning to foundation model-based modular architectures, often lack the capability to decompose complex tasks or employ efficient exploration strategies, leading to robot aimless wandering or target recognition failures. To address these limitations, we propose VL-Nav, a neuro-symbolic (NeSy) vision-language navigation system. The proposed system intertwines neural reasoning with symbolic guidance through two core components: (1) a NeSy task planner that leverages a symbolic 3D scene graph and image memory system to enhance the vision language models' (VLMs) neural reasoning capabilities for task decomposition and replanning; and (2) a NeSy exploration system that couples neural semantic cues with the symbolic heuristic function to efficiently gather the task-related information while minimizing unnecessary repeat travel during exploration. Validated on the DARPA TIAMAT Challenge navigation tasks, our system achieved an 83.4% success rate (SR) in indoor environments and 75% in outdoor scenarios. VL-Nav achieved an 86.3% SR in real-world experiments, including a challenging 483-meter run. Finally, we validate the system with complex instructions in a 3D multi-floor scenario.
中文标题/摘要
标题:VL-Nav:一种基于神经符号的方法进行基于推理的视觉语言导航
基于复杂和抽象的人类指令自主导航未见过的大型环境仍然是自主移动机器人的一大挑战。解决这一问题需要机器人推断隐含语义并高效探索大型任务空间。然而,现有的方法,从端到端学习到基于基础模型的模块化架构,往往缺乏分解复杂任务或采用高效探索策略的能力,导致机器人盲目游荡或目标识别失败。为了解决这些限制,我们提出了VL-Nav,一种神经符号(NeSy)视觉语言导航系统。该系统通过两个核心组件将神经推理与符号指导相结合:(1)一个NeSy任务规划器,利用符号3D场景图和图像记忆系统增强视觉语言模型(VLMs)的神经推理能力,以进行任务分解和重新规划;(2)一个NeSy探索系统,将神经语义线索与符号启发式函数耦合,以高效地收集任务相关信息,同时在探索过程中尽量减少不必要的重复旅行。在DARPA TIAMAT挑战导航任务上验证,该系统在室内环境中的成功率(SR)为83.4%,在室外场景中为75%。在真实世界实验中,VL-Nav实现了86.3%的SR,包括一次具有挑战性的483米跑步。最后,我们在3D多层场景中使用复杂指令验证了该系统。
Summary / 总结
VL-Nav is a neuro-symbolic approach designed to help robots navigate based on complex human instructions in large-scale environments. It integrates neural reasoning with symbolic guidance through a task planner and an exploration system. The task planner uses a symbolic 3D scene graph and image memory to improve the neural reasoning of vision-language models for task decomposition and replanning, while the exploration system efficiently gathers task-related information with minimal redundant travel. Experiments show that VL-Nav achieved an 83.4% success rate in indoor environments, 75% in outdoor scenarios, and 86.3% in real-world tests, including a 483-meter run with complex instructions in a 3D multi-floor scenario.
VL-Nav 是一种神经符号型的视觉语言导航系统,旨在帮助机器人理解和遵循大型环境中的复杂人类指令。该系统利用符号3D场景图和图像记忆来增强视觉语言模型的推理能力,用于任务分解和重新规划,并结合神经语义线索与符号启发式函数进行高效探索。该系统在DARPA TIAMAT挑战中的室内环境中取得了83.4%的成功率,在室外环境中取得了75%的成功率,并在真实世界实验中取得了86.3%的成功率,包括一次483米的运行。
Agnostics: Learning to Code in Any Programming Language via Reinforcement with a Universal Learning Environment
Authors: Aleksander Boruch-Gruszecki, Yangtian Zi, Zixuan Wu, Tejas Oberoi, Carolyn Jane Anderson, Joydeep Biswas, Arjun Guha
Venue: ICLR 2026
First: 2025-08-06T20:30:55+00:00 · Latest: 2026-03-23T17:23:32+00:00
Comments: 30 pages, 19 figures. Accepted at ICLR 2026. For data, code, artifacts, see https://agnostics.abgru.me
Abstract
Large language models (LLMs) already excel at writing code in high-resource languages such as Python and JavaScript, yet stumble on low-resource languages that remain essential to science and engineering. Besides the obvious shortage of pre-training data, post-training itself is a bottleneck: every new language seems to require new datasets, test harnesses, and reinforcement-learning (RL) infrastructure.
We introduce Agnostics, a language-agnostic post-training pipeline that eliminates this per-language engineering. The key idea is to judge code solely by its externally observable behavior, so a single verifier can test solutions written in any language. Concretely, we (i) use an LLM to rewrite existing unit-test datasets into an I/O format, (ii) supply a short configuration that tells the verifier how to compile and run a target language, and (iii) apply reinforcement learning with verifiable rewards (RLVR) in a robust code execution environment.
Applied to five low-resource languages--Lua, Julia, R, OCaml, and Fortran--Agnostics (1) improves Qwen-3 4B to performance that rivals other 16B-70B open-weight models; (2) scales cleanly to larger and diverse model families (Qwen-3 8B, DeepSeek Coder 6.7B Instruct, Phi 4 Mini); and (3) for ${\le} 16$B parameter models, sets new state-of-the-art pass@1 results on MultiPL-E and a new multi-language version of LiveCodeBench that we introduce.
We release the language-agnostic training datasets (Ag-MBPP-X, Ag-Codeforces-X, Ag-LiveCodeBench-X), training code, and ready-to-use configurations, making RL post-training in any programming language as simple as editing a short YAML file.
中文标题/摘要
标题:Agnostics:通过通用学习环境强化学习任何编程语言的代码
大型语言模型(LLMs)已经在高资源语言如Python和JavaScript方面表现出色,但在处理低资源语言时却遇到困难,这些语言对于科学和工程仍然至关重要。除了预训练数据的明显短缺,后训练本身也是一个瓶颈:每种新语言似乎都需要新的数据集、测试框架和强化学习(RL)基础设施。
我们引入了Agnostics,这是一种语言无关的后训练管道,消除了每种语言的工程需求。关键思想是仅通过外部可观察的行为来评判代码,因此单一验证器可以测试用任何语言编写的解决方案。具体来说,我们(i)使用LLM将现有的单元测试数据集重写为I/O格式,(ii)提供一个简短的配置来告诉验证器如何编译和运行目标语言,(iii)在稳健的代码执行环境中应用可验证奖励的强化学习(RLVR)。
应用于五种低资源语言——Lua、Julia、R、OCaml和Fortran——Agnostics(1)将Qwen-3 4B提升到与16B-70B开放权重模型相当的性能;(2)可以干净地扩展到更大的多样化模型家族(Qwen-3 8B、DeepSeek Coder 6.7B Instruct、Phi 4 Mini);(3)对于≤16B参数模型,在MultiPL-E和我们引入的新多语言版本LiveCodeBench上设置了新的最佳结果。
我们发布了语言无关的训练数据集(Ag-MBPP-X、Ag-Codeforces-X、Ag-LiveCodeBench-X)、训练代码和即用型配置,使得任何编程语言的RL后训练简单到只需编辑一个简短的YAML文件。
Summary / 总结
Agnostics is a language-agnostic post-training pipeline that uses reinforcement learning with verifiable rewards (RLVR) to improve the performance of large language models (LLMs) on low-resource languages like Lua, Julia, R, OCaml, and Fortran. It eliminates the need for language-specific datasets and infrastructure, and can scale to larger models. Agnostics achieves performance comparable to other open-weight models and sets new state-of-the-art results on MultiPL-E and a multi-language version of LiveCodeBench.
Agnostics 是一种语言无关的后训练管道,通过可验证奖励的强化学习(RLVR)来提升大型语言模型(LLMs)在低资源编程语言上的性能。它通过单一验证器测试任何语言中的解决方案来消除针对每种语言的工程需求。Agnostics 应用于五种低资源语言(Lua、Julia、R、OCaml 和 Fortran)后,提高了 Qwen-3 4B 的性能,使其与 16B-70B 的开放权重模型相媲美,并且能够扩展到更大的模型,同时在 MultiPL-E 和一个新推出的多语言版本的 LiveCodeBench 上设定了新的最佳性能。
FRIREN: Beyond Trajectories -- A Spectral Lens on Time
Authors: Qilin Wang
Venue: NeurIPS 2025
First: 2025-05-23T00:52:13+00:00 · Latest: 2026-03-23T17:16:03+00:00
Comments: 37 pages, 4 figures. Submitted to NeurIPS 2025. Public code at https://anonymous.4open.science/r/LTSF_model-03BB/
Abstract
Long-term time-series forecasting (LTSF) models are often presented as general-purpose solutions that can be applied across domains, implicitly assuming that all data is pointwise predictable. Using chaotic systems such as Lorenz-63 as a case study, we argue that geometric structure - not pointwise prediction - is the right abstraction for a dynamic-agnostic foundational model. Minimizing the Wasserstein-2 distance (W2), which captures geometric changes, and providing a spectral view of dynamics are essential for long-horizon forecasting. Our model, FRIREN (Flow-inspired Representations via Interpretable Eigen-networks), implements an augmented normalizing-flow block that embeds data into a normally distributed latent representation. It then generates a W2-efficient optimal path that can be decomposed into rotation, scaling, inverse rotation, and translation. This architecture yields locally generated, geometry-preserving predictions that are independent of the underlying dynamics, and a global spectral representation that functions as a finite Koopman operator with a small modification. This enables practitioners to identify which modes grow, decay, or oscillate, both locally and system-wide. FRIREN achieves an MSE of 11.4, MAE of 1.6, and SWD of 0.96 on Lorenz-63 in a 336-in, 336-out, dt=0.01 setting, surpassing TimeMixer (MSE 27.3, MAE 2.8, SWD 2.1). The model maintains effective prediction for 274 out of 336 steps, approximately 2.5 Lyapunov times. On Rossler (96-in, 336-out), FRIREN achieves an MSE of 0.0349, MAE of 0.0953, and SWD of 0.0170, outperforming TimeMixer's MSE of 4.3988, MAE of 0.886, and SWD of 3.2065. FRIREN is also competitive on standard LTSF datasets such as ETT and Weather. By connecting modern generative flows with classical spectral analysis, FRIREN makes long-term forecasting both accurate and interpretable, setting a new benchmark for LTSF model design.
中文标题/摘要
标题:FRIREN:超越轨迹——时空光谱视角
长期时间序列预测(LTSF)模型通常被呈现为通用解决方案,可以跨领域应用,隐含地假设所有数据都是点预测的。使用洛伦兹-63混沌系统作为案例研究,我们认为几何结构而非点预测是动态无感知基础模型的正确抽象。最小化Wasserstein-2距离(W2),捕捉几何变化,并提供动力学的光谱视图是长期展望预测的关键。我们的模型FRIREN(基于可解释特征网络的流启发表示)实现了一个增强的归一化流块,将数据嵌入到正态分布的潜在表示中。然后生成一个W2高效的最优路径,可以分解为旋转、缩放、逆旋转和平移。这种架构产生的是局部生成、几何保持的预测,与底层动力学无关,以及一个全局光谱表示,作为有限Koopman算子的小修改。这使实践者能够识别哪些模式增长、衰减或振荡,无论是局部还是系统范围。FRIREN在洛伦兹-63(336输入,336输出,dt=0.01)设置中实现了MSE 11.4,MAE 1.6,SWD 0.96,超越了TimeMixer(MSE 27.3,MAE 2.8,SWD 2.1)。该模型在大约2.5个Lyapunov时间内的274个步骤中保持了有效的预测。在罗萨勒(96输入,336输出)上,FRIREN实现了MSE 0.0349,MAE 0.0953,SWD 0.0170,优于TimeMixer的MSE 4.3988,MAE 0.886,SWD 3.2065。FRIREN在标准LTSF数据集如ETT和天气数据集上也具有竞争力。通过将现代生成流与经典光谱分析相结合,FRIREN使长期预测既准确又可解释,为LTSF模型设计设立了新的基准。
Summary / 总结
The research aims to address the limitations of long-term time-series forecasting models by focusing on geometric structure rather than pointwise prediction. The method involves using an augmented normalizing-flow block to embed data into a latent representation and generate W2-efficient optimal paths. Key findings show that FRIREN outperforms TimeMixer on both Lorenz-63 and Rossler datasets, achieving lower MSE, MAE, and SWD values. On standard LTSF datasets, FRIREN also demonstrates competitive performance, making long-term forecasting both accurate and interpretable.
研究旨在通过关注几何结构而非点预测来解决长期时间序列预测模型的局限性。方法是使用FRIREN模型,该模型通过最小化Wasserstein-2距离并提供动态的频谱视图来实现。关键实验发现表明,FRIREN在Lorenz-63和Rossler数据集上优于TimeMixer,具有更低的MSE、MAE和SWD值,并且在大量步骤中保持有效的预测。在标准的长期时间序列预测数据集上,FRIREN也表现出竞争力,为长期预测模型设计设定了新的基准。
Noise Titration: Exact Distributional Benchmarking for Probabilistic Time Series Forecasting
Authors: Qilin Wang
First: 2026-03-23T17:14:11+00:00 · Latest: 2026-03-23T17:14:11+00:00
Abstract
Modern time series forecasting is evaluated almost entirely through passive observation of single historical trajectories, rendering claims about a model's robustness to non-stationarity fundamentally unfalsifiable. We propose a paradigm shift toward interventionist, exact-statistical benchmarking. By systematically titrating calibrated Gaussian observation noise into known chaotic and stochastic dynamical systems, we transform forecasting from a black-box sequence matching game into an exact distributional inference task. Because the underlying data-generating process and noise variance are mathematically explicit, evaluation can rely on exact negative log-likelihoods and calibrated distributional tests rather than heuristic approximations. To fully leverage this framework, we extend the Fern architecture into a probabilistic generative model that natively parameterizes the Symmetric Positive Definite (SPD) cone, outputting calibrated joint covariance structures without the computational bottleneck of generic Jacobian modeling. Under this rigorous evaluation, we find that state-of-the-art zero-shot foundation models behave consistently with the context-parroting mechanism, failing systematically under non-stationary regime shifts and elevated noise. In contrast, Fern explicitly captures the invariant measure and multivariate geometry of the underlying dynamics, maintaining structural fidelity and statistically sharp calibration precisely where massive sequence-matching models collapse.
中文标题/摘要
标题:噪声滴定:概率时间序列预测的确切分布基准测试
现代时间序列预测几乎完全通过观察单个历史轨迹来进行评估,使得关于模型对非平稳性的鲁棒性声明从根本上无法验证。我们提出了一种范式转变,转向干预主义的确切统计基准测试。通过系统地向已知的混沌和随机动力系统中滴定校准的高斯观测噪声,我们将预测从一个黑盒序列匹配游戏转变为一个确切的分布推断任务。由于底层的数据生成过程和噪声方差在数学上是明确的,评估可以依赖于确切的负对数似然和校准的分布检验,而不是启发式的近似。为了充分利用这一框架,我们将Fern架构扩展为一个概率生成模型,该模型本征地参数化对称正定锥(SPD),输出校准的联合协方差结构,而不受通用雅可比建模的计算瓶颈的限制。在这一严格的评估下,我们发现最先进的零样本基础模型表现出与上下文鹦鹉机制一致的行为,在非平稳性转变和噪声增加的情况下系统性地失败。相比之下,Fern明确捕捉到了底层动力学的不变测度和多元几何结构,保持了结构的忠实性和统计上的精确校准,而大规模序列匹配模型则在这些情况下崩溃。
Summary / 总结
Modern time series forecasting is evaluated almost entirely through passive observation of single historical trajectories, rendering claims about a model's robustness to non-stationarity fundamentally unfalsifiable.
SPA: A Simple but Tough-to-Beat Baseline for Knowledge Injection
Authors: Kexian Tang, Jiani Wang, Shaowen Wang, Kaifeng Lyu
First: 2026-03-23T17:11:43+00:00 · Latest: 2026-03-23T17:11:43+00:00
Abstract
While large language models (LLMs) are pretrained on massive amounts of data, their knowledge coverage remains incomplete in specialized, data-scarce domains, motivating extensive efforts to study synthetic data generation for knowledge injection. We propose SPA (Scaling Prompt-engineered Augmentation), a simple but tough-to-beat baseline that uses a small set of carefully designed prompts to generate large-scale synthetic data for knowledge injection. Through systematic comparisons, we find that SPA outperforms several strong baselines. Furthermore, we identify two key limitations of prior approaches: (1) while RL-based methods may improve the token efficiency of LLM-based data augmentation at small scale, they suffer from diversity collapse as data scales, leading to diminishing returns; and (2) while multi-stage prompting may outperform simple augmentation methods, their advantages can disappear after careful prompt tuning. Our results suggest that, for knowledge injection, careful prompt design combined with straightforward large-scale augmentation can be surprisingly effective, and we hope SPA can serve as a strong baseline for future studies in this area. Our code is available at https://github.com/Tangkexian/SPA.
中文标题/摘要
标题:SPA:一种简单但难以超越的知识注入基线
虽然大型语言模型(LLMs)在大量数据上进行预训练,但在专门的数据稀缺领域,其知识覆盖仍然不完整,这推动了大量研究致力于通过合成数据生成进行知识注入。我们提出了SPA(Scaling Prompt-engineered Augmentation),这是一种简单但难以超越的基线,使用少量精心设计的提示生成大规模合成数据以进行知识注入。通过系统的比较,我们发现SPA优于几种强大的基线。此外,我们确定了先前方法的两个关键局限性:(1)虽然基于RL的方法可能在小规模下提高LLM数据增强的令牌效率,但随着数据规模扩大,它们会遭受多样性崩溃,导致收益递减;(2)虽然多阶段提示可能优于简单的增强方法,但在仔细调整提示后,它们的优势可能会消失。我们的结果表明,对于知识注入,结合精心设计的提示与直接的大规模增强可以非常有效,我们希望SPA可以作为未来研究中的强大基线。我们的代码可在https://github.com/Tangkexian/SPA获取。
Summary / 总结
The paper proposes SPA, a simple prompt-engineered augmentation method for generating synthetic data to inject knowledge into large language models. SPA uses a small set of carefully designed prompts to scale up data generation, outperforming several strong baselines. The study identifies limitations of prior approaches, such as the diversity collapse of RL-based methods and the diminishing returns of multi-stage prompting. The findings suggest that careful prompt design with straightforward large-scale augmentation can be surprisingly effective for knowledge injection.
论文提出了SPA,一种通过精心设计的提示生成合成数据的方法,以增强语言模型的知识覆盖。SPA在多个强基线中表现出色,并指出了基于强化学习的方法和多阶段提示的局限性。研究结果表明,结合精心设计的提示进行简单的大量数据增强是出人意料的有效。代码可在GitHub上获得。
Omni-WorldBench: Towards a Comprehensive Interaction-Centric Evaluation for World Models
Authors: Meiqi Wu, Zhixin Cai, Fufangchen Zhao, Xiaokun Feng, Rujing Dang, Bingze Song, Ruitian Tian, Jiashu Zhu, Jiachen Lei, Hao Dou, Jing Tang, Lei Sun, Jiahong Wu, Xiangxiang Chu, Zeming Liu, Kaiqi Huang
First: 2026-03-23T17:10:29+00:00 · Latest: 2026-03-23T17:10:29+00:00
Abstract
Video--based world models have emerged along two dominant paradigms: video generation and 3D reconstruction. However, existing evaluation benchmarks either focus narrowly on visual fidelity and text--video alignment for generative models, or rely on static 3D reconstruction metrics that fundamentally neglect temporal dynamics. We argue that the future of world modeling lies in 4D generation, which jointly models spatial structure and temporal evolution. In this paradigm, the core capability is interactive response: the ability to faithfully reflect how interaction actions drive state transitions across space and time. Yet no existing benchmark systematically evaluates this critical dimension. To address this gap, we propose Omni--WorldBench, a comprehensive benchmark specifically designed to evaluate the interactive response capabilities of world models in 4D settings. Omni--WorldBench comprises two key components: Omni--WorldSuite, a systematic prompt suite spanning diverse interaction levels and scene types; and Omni--Metrics, an agent-based evaluation framework that quantifies world modeling capabilities by measuring the causal impact of interaction actions on both final outcomes and intermediate state evolution trajectories. We conduct extensive evaluations of 18 representative world models across multiple paradigms. Our analysis reveals critical limitations of current world models in interactive response, providing actionable insights for future research. Omni-WorldBench will be publicly released to foster progress in interactive 4D world modeling.
中文标题/摘要
标题:Omni-WorldBench:向着全面的交互为中心的评估方法论
基于视频的世界模型已经沿着两种主要范式发展:视频生成和3D重建。然而,现有的评估基准要么仅专注于生成模型的视觉保真度和文本-视频对齐,要么依赖于静态的3D重建指标,这些指标从根本上忽视了时间动态。我们认为,世界建模的未来在于4D生成,即同时建模空间结构和时间演变。在这个范式中,核心能力是交互响应:能够准确反映交互动作如何驱动空间和时间中的状态转换。然而,目前没有基准能够系统地评估这一关键维度。为了解决这一差距,我们提出了Omni-WorldBench,这是一个全面的基准,专门用于评估世界模型在4D设置中的交互响应能力。Omni-WorldBench 包含两个关键组件:Omni-WorldSuite,一个涵盖多种交互级别和场景类型的系统提示套件;以及Omni-Metrics,一个基于代理的评估框架,通过测量交互动作对最终结果和中间状态演变轨迹的因果影响来量化世界建模能力。我们对多个范式下的18个代表性世界模型进行了广泛的评估。我们的分析揭示了当前世界模型在交互响应方面的关键局限性,为未来研究提供了可操作的见解。Omni-WorldBench 将公开发布,以促进交互4D世界建模的进步。
Summary / 总结
Omni-WorldBench is proposed to evaluate the interactive response capabilities of world models in 4D settings, addressing the limitations of existing benchmarks that focus on visual fidelity or static 3D reconstruction metrics. It includes an extensive prompt suite and an agent-based evaluation framework to measure the causal impact of interaction actions. Evaluations of 18 world models across different paradigms highlight their limitations in interactive response, offering insights for future research.
论文提出了Omni-WorldBench,这是一个新的基准,旨在评估4D世界模型的互动响应能力。它通过关注模型如何反映交互动作驱动空间和时间状态转换的能力来弥补现有基准的不足。该基准包括Omni-WorldSuite,一个全面的提示套件,以及Omni-Metrics,一个基于代理的评估框架。对18个不同范式的世界模型进行的广泛评估揭示了当前模型在互动响应方面的重要局限性,为未来研究提供了宝贵的见解。
DisPatch: Disarming Adversarial Patches in Object Detection with Diffusion Models
Authors: Jin Ma, Mohammed Aldeen, Christopher Salas, Feng Luo, Mashrur Chowdhury, Mert Pesé, Long Cheng
First: 2025-09-04T18:20:36+00:00 · Latest: 2026-03-23T17:00:09+00:00
Abstract
Object detection is fundamental to various real-world applications, such as security monitoring and surveillance video analysis. Despite their advancements, state-of-the-art object detectors are still vulnerable to adversarial patch attacks, which can be easily applied to real-world objects to either conceal actual items or create non-existent ones, leading to severe consequences. In this work, we introduce DisPatch, the first diffusion-based defense framework for object detection. Unlike previous works that aim to "detect and remove" adversarial patches, DisPatch adopts a "regenerate and rectify" strategy, leveraging generative models to disarm attack effects while preserving the integrity of the input image. Specifically, we utilize the in-distribution generative power of diffusion models to regenerate the entire image, aligning it with benign data. A rectification process is then employed to identify and replace adversarial regions with their regenerated benign counterparts. DisPatch is attack-agnostic and requires no prior knowledge of the existing patches. Extensive experiments across multiple detectors demonstrate that DisPatch consistently outperforms state-of-the-art defenses on both hiding attacks and creating attacks, achieving the best overall mAP@0.5 score of 89.3% on hiding attacks, and lowering the attack success rate to 24.8% on untargeted creating attacks. Moreover, it strikes the balance between effectiveness and efficiency, and maintains strong robustness against adaptive attacks, making it a practical and reliable defense method.
中文标题/摘要
标题:DisPatch:使用扩散模型解除对象检测中对抗补丁
对象检测是各种实际应用的基础,如安全监控和监控视频分析。尽管取得了进展,最先进的对象检测器仍然容易受到对抗补丁攻击的影响,这些攻击可以轻松地应用于现实中的物体,以隐藏实际物品或创造不存在的物品,导致严重后果。在本工作中,我们提出了DisPatch,这是首个基于扩散模型的对象检测防御框架。与之前旨在“检测和移除”对抗补丁的工作不同,DisPatch 采用“再生和校正”的策略,利用生成模型解除攻击效果同时保持输入图像的完整性。具体而言,我们利用扩散模型的同分布生成能力再生整个图像,使其与良性数据对齐。然后采用校正过程来识别并替换对抗区域为它们的再生良性对应物。DisPatch 对抗无特定知识,无需了解现有补丁。在多个检测器上的广泛实验表明,DisPatch 在隐藏攻击和创造攻击方面均优于最先进的防御方法,在隐藏攻击中实现最佳的整体 mAP@0.5 分数为 89.3%,在非目标创造攻击中将攻击成功率降低到 24.8%。此外,它在有效性与效率之间取得了平衡,并且对适应性攻击具有很强的鲁棒性,使其成为一种实用可靠的防御方法。
Summary / 总结
DisPatch is a novel diffusion-based defense framework for object detection that addresses adversarial patch attacks. Unlike previous methods that focus on detecting and removing patches, DisPatch regenerates the entire image to align with benign data and then rectifies adversarial regions with their benign counterparts. The framework is attack-agnostic and does not require prior knowledge of the patches. Experiments show that DisPatch outperforms existing defenses, achieving an mAP@0.5 score of 89.3% on hiding attacks and a 24.8% attack success rate on untargeted creating attacks, while maintaining robustness against adaptive attacks.
DisPatch 是一种新颖的基于扩散模型的防御框架,用于解决对象检测中的对抗性贴图攻击。与之前专注于检测和移除贴图的方法不同,DisPatch 通过再生整个图像使其与良性数据对齐,然后用良性区域替换对抗性区域。该框架是攻击无关的,不需要事先了解贴图信息。实验表明,DisPatch 在隐藏攻击中的 mAP@0.5 得分达到 89.3%,在未目标创建攻击中的攻击成功率降低到 24.8%,同时保持了对适应性攻击的鲁棒性。
CayleyPy-4: AI-Holography. Towards analogs of holographic string dualities for AI tasks
Authors: A. Chervov, F. Levkovich-Maslyuk, A. Smolensky, F. Khafizov, I. Kiselev, D. Melnikov, I. Koltsov, S. Kudashev, D. Shiltsov, M. Obozov, S. Krymskii, V. Kirova, E. V. Konstantinova, A. Soibelman, S. Galkin, L. Grunwald, A. Kotov, A. Alexandrov, S. Lytkin, D. Fedoriaka, A. Chevychelov, Z. Kogan, A. Natyrova, L. Cheldieva, O. Nikitina, S. Fironov, A. Vakhrushev, A. Lukyanenko, V. Ilin, D. Gorodkov, N. Bogachev, I. Gaiur, M. Zaitsev, F. Petrov, L. Petrov, T. Gaintseva, A. Gavrilova, M. N. Smirnov, N. Kalinin, A. Khan, K. Jung, H. Mousset, H. Isambert, O. Debeaupuis
First: 2026-03-23T16:54:44+00:00 · Latest: 2026-03-23T16:54:44+00:00
Comments: 20+120 pages
Abstract
This is the fourth paper in the CayleyPy project, which applies AI methods to the exploration of large graphs. In this work, we suggest the existence of a new discrete version of holographic string dualities for this setup, and discuss their relevance to AI systems and mathematics. Many modern AI tasks -- such as those addressed by GPT-style language models or RL systems -- can be viewed as direct analogues of predicting particle trajectories on graphs. We investigate this problem for a large family of Cayley graphs, for which we show that surprisingly it admits a dual description in terms of discrete strings. We hypothesize that such dualities may extend to a range of AI systems where they can lead to more efficient computational approaches. In particular, string holographic images of states are proposed as natural candidates for data embeddings, motivated by the "complexity = volume" principle in AdS/CFT.
For Cayley graphs of the symmetric group S_n, our results indicate that the corresponding dual objects are flat, planar polygons. The diameter of the graph is equal to the number of integer points inside the polygon scaled by n. Vertices of the graph can be mapped holographically to paths inside the polygon, and the usual graph distances correspond to the area under the paths, thus directly realising the "complexity = volume" paradigm. We also find evidence for continuous CFTs and dual strings in the large n limit. We confirm this picture and other aspects of the duality in a large initial set of examples. We also present new datasets (obtained by a combination of ML and conventional tools) which should be instrumental in establishing the duality for more general cases.
中文标题/摘要
标题:CayleyPy-4:AI全息图。朝向AI任务中的全息弦对偶的类比
这是CayleyPy项目的第四篇论文,该项目将AI方法应用于大型图的探索。本文中,我们提出了这种设置中新的全息弦对偶的离散版本,并讨论了它们对AI系统和数学的相关性。许多现代AI任务——例如GPT风格的语言模型或RL系统所处理的任务——可以被视为预测图上粒子轨迹的直接类比。我们研究了Cayley图的一个大家族,我们证明了对于这些图,它以离散弦的形式具有对偶描述。我们假设这些对偶可能扩展到一系列AI系统中,从而可能导致更高效的计算方法。特别是,弦全息图中的状态图像是数据嵌入的自然候选者,这受到AdS/CFT中的“复杂性=体积”原则的启发。
对于对称群S_n的Cayley图,我们的结果表明,相应的对偶对象是平坦的、平面的多边形。图的直径等于多边形内整数点的数量乘以n。图的顶点可以映射到多边形内的路径,通常的图距离对应于路径下的面积,从而直接实现了“复杂性=体积”的范式。我们还发现,在大n极限下存在连续的CFT和对偶弦。我们通过大量初始示例确认了这一图景和其他方面的对偶性。我们还介绍了新的数据集(通过结合机器学习和传统工具获得),这些数据集对于更一般情况下的对偶性建立至关重要。
Summary / 总结
This paper explores the application of AI methods to large graphs, proposing a new discrete version of holographic string dualities. The authors investigate Cayley graphs and find that they can be described using discrete strings, suggesting potential for more efficient computational approaches in AI systems. For symmetric group Cayley graphs, the dual objects are flat, planar polygons, with graph distances corresponding to the area under paths, aligning with the 'complexity = volume' principle. The study confirms this duality in many examples and provides new datasets to further validate the findings.
本文探讨了将AI方法应用于大型图的研究,提出了一种新的离散版本的全息弦对偶性。作者研究了Cayley图,并发现它们可以用离散弦来描述,这表明在AI系统中可能存在更高效的计算方法。对于对称群Cayley图,对偶对象是平坦的平面多边形,图的距离对应于路径下的面积,符合“复杂性=体积”的原则。研究在许多例子中确认了这种对偶性,并提供了新的数据集以进一步验证这些发现。
Foundation Models for Trajectory Planning in Autonomous Driving: A Review of Progress and Open Challenges
Authors: Kemal Oksuz, Alexandru Buburuzan, Anthony Knittel, Yuhan Yao, Puneet K. Dokania
First: 2025-10-31T18:05:02+00:00 · Latest: 2026-03-23T16:53:27+00:00
Comments: Accepted to TMLR (Survey Certification)
Abstract
The emergence of multi-modal foundation models has markedly transformed the technology for autonomous driving, shifting away from conventional and mostly hand-crafted design choices towards unified, foundation-model-based approaches, capable of directly inferring motion trajectories from raw sensory inputs. This new class of methods can also incorporate natural language as an additional modality, with Vision-Language-Action (VLA) models serving as a representative example. In this review, we provide a comprehensive examination of such methods through a unifying taxonomy to critically evaluate their architectural design choices, methodological strengths, and their inherent capabilities and limitations. Our survey covers 37 recently proposed approaches that span the landscape of trajectory planning with foundation models. Furthermore, we assess these approaches with respect to the openness of their source code and datasets, offering valuable information to practitioners and researchers. We provide an accompanying webpage that catalogues the methods based on our taxonomy, available at: https://github.com/fiveai/FMs-for-driving-trajectories
中文标题/摘要
标题:自主驾驶轨迹规划的基础模型:进展与开放挑战综述
多模态基础模型的出现显著改变了自主驾驶技术,从传统的、主要的手工设计选择转向统一的基础模型方法,可以直接从原始感官输入中推断出运动轨迹。这类新方法还可以将自然语言作为附加模态,视觉-语言-动作(VLA)模型是其代表之一。在本文综述中,我们通过统一的分类体系全面审视了这些方法,对其架构设计选择、方法论优势以及固有的能力和局限性进行了批判性评估。我们的调查涵盖了37种最近提出的基于基础模型的轨迹规划方法。此外,我们还根据其源代码和数据集的开放性评估了这些方法,为从业者和研究人员提供了有价值的信息。我们提供了一个配套网页,根据我们的分类体系列出了这些方法,网址为:https://github.com/fiveai/FMs-for-driving-trajectories
Summary / 总结
This review examines how multi-modal foundation models have transformed autonomous driving trajectory planning, moving from hand-crafted designs to unified approaches that can directly infer motion trajectories from raw sensory inputs and incorporate natural language. The study evaluates 37 recent approaches, assessing their architectural design, strengths, and limitations, and provides a taxonomy for classification and a webpage cataloging these methods.
这篇综述探讨了多模态基础模型如何改变了自主驾驶轨迹规划,从手工设计转向可以直接从原始传感器输入中推断运动轨迹的统一方法,并可以结合自然语言。研究评估了37种近期方法,评估了它们的架构设计、优势和局限性,并提供了一个基于分类的网页目录这些方法。
Scalable Multi-Task Learning through Spiking Neural Networks with Adaptive Task-Switching Policy for Intelligent Autonomous Agents
Authors: Rachmad Vidya Wicaksana Putra, Avaneesh Devkota, Muhammad Shafique
First: 2025-04-18T08:12:59+00:00 · Latest: 2026-03-23T16:51:57+00:00
Comments: Accepted at the 63rd ACM/IEEE Design Automation Conference (DAC), July 26-29, 2026 in Long Beach, CA, USA. [Codes: https://github.com/rachmadvwp/SwitchMT]
Abstract
Training resource-constrained autonomous agents on multiple tasks simultaneously is crucial for adapting to diverse real-world environments. Recent works employ reinforcement learning (RL) approach, but they still suffer from sub-optimal multi-task performance due to task interference. State-of-the-art works employ Spiking Neural Networks (SNNs) to improve RL-based multi-task learning and enable low-power/energy operations through network enhancements and spike-driven data stream processing. However, they rely on fixed task-switching intervals during its training, thus limiting its performance and scalability. To address this, we propose SwitchMT, a novel methodology that employs adaptive task-switching for effective, scalable, and simultaneous multi-task learning. SwitchMT employs the following key ideas: (1) leveraging a Deep Spiking Q-Network with active dendrites and dueling structure, that utilizes task-specific context signals to create specialized sub-networks; and (2) devising an adaptive task-switching policy that leverages both rewards and internal dynamics of the network parameters. Experimental results demonstrate that SwitchMT achieves competitive scores in multiple Atari games (i.e., Pong: -8.8, Breakout: 5.6, and Enduro: 355.2) and longer game episodes as compared to the state-of-the-art. These results also highlight the effectiveness of SwitchMT methodology in addressing task interference without increasing the network complexity, enabling intelligent autonomous agents with scalable multi-task learning capabilities.
中文标题/摘要
标题:通过自适应任务切换策略的脉冲神经网络实现可扩展的多任务学习以适应资源受限的智能自主代理
同时在多个任务上对资源受限的自主代理进行训练对于适应多变的现实环境至关重要。最近的研究采用强化学习(RL)方法,但由于任务干扰,其多任务性能仍然不尽如人意。最先进的研究利用脉冲神经网络(SNNs)来改进基于RL的多任务学习,并通过网络增强和基于脉冲的数据流处理实现低功耗/低能耗操作。然而,它们在训练过程中依赖于固定的任务切换间隔,从而限制了其性能和可扩展性。为了解决这个问题,我们提出了一种名为SwitchMT的新方法,该方法采用自适应任务切换策略实现有效的、可扩展的和同时的多任务学习。SwitchMT采用了以下关键思想:(1)利用具有活跃树突和对分结构的深度脉冲Q网络,利用任务特定的上下文信号创建专门的子网络;(2)设计一种基于奖励和网络参数内部动力学的自适应任务切换策略。实验结果表明,SwitchMT在多个Atari游戏中(例如,Pong:-8.8,Breakout:5.6,Enduro:355.2)和更长的游戏回合中实现了与最先进的方法相当的分数。这些结果还突显了SwitchMT方法在不增加网络复杂性的情况下解决任务干扰的有效性,使智能自主代理具备可扩展的多任务学习能力。
Summary / 总结
The research aims to improve multi-task learning in resource-constrained autonomous agents by addressing task interference. It proposes SwitchMT, which uses an adaptive task-switching policy and a Deep Spiking Q-Network with active dendrites and dueling structure. The method outperforms existing approaches in multiple Atari games and longer game episodes, demonstrating effective and scalable multi-task learning without increasing network complexity.
研究旨在通过解决任务干扰问题,提高资源受限自主代理的多任务学习能力。提出了一种名为SwitchMT的新方法,该方法采用自适应任务切换策略和具有每个任务专用子网络的深度脉冲Q网络。实验结果表明,SwitchMT在多个Atari游戏中和更长的游戏回合中优于现有方法,展示了有效的可扩展多任务学习能力,且未增加网络复杂度。
PAM: A Pose-Appearance-Motion Engine for Sim-to-Real HOI Video Generation
Authors: Mingju Gao, Kaisen Yang, Huan-ang Gao, Bohan Li, Ao Ding, Wenyi Li, Yangcheng Yu, Jinkun Liu, Shaocong Xu, Yike Niu, Haohan Chi, Hao Chen, Hao Tang, Li Yi, Hao Zhao
Venue: CVPR 2026
First: 2026-03-23T16:51:52+00:00 · Latest: 2026-03-23T16:51:52+00:00
Comments: Accepted to CVPR 2026 Code: https://github.com/GasaiYU/PAM
Abstract
Hand-object interaction (HOI) reconstruction and synthesis are becoming central to embodied AI and AR/VR. Yet, despite rapid progress, existing HOI generation research remains fragmented across three disjoint tracks: (1) pose-only synthesis that predicts MANO trajectories without producing pixels; (2) single-image HOI generation that hallucinates appearance from masks or 2D cues but lacks dynamics; and (3) video generation methods that require both the entire pose sequence and the ground-truth first frame as inputs, preventing true sim-to-real deployment. Inspired by the philosophy of Joo et al. (2018), we think that HOI generation requires a unified engine that brings together pose, appearance, and motion within one coherent framework. Thus we introduce PAM: a Pose-Appearance-Motion Engine for controllable HOI video generation. The performance of our engine is validated by: (1) On DexYCB, we obtain an FVD of 29.13 (vs. 38.83 for InterDyn), and MPJPE of 19.37 mm (vs. 30.05 mm for CosHand), while generating higher-resolution 480x720 videos compared to 256x256 and 256x384 baselines. (2) On OAKINK2, our full multi-condition model improves FVD from 68.76 to 46.31. (3) An ablation over input conditions on DexYCB shows that combining depth, segmentation, and keypoints consistently yields the best results. (4) For a downstream hand pose estimation task using SimpleHand, augmenting training with 3,400 synthetic videos (207k frames) allows a model trained on only 50% of the real data plus our synthetic data to match the 100% real baseline.
中文标题/摘要
标题:PAM:一种用于模拟到现实手物交互视频生成的姿态-外观-运动引擎
手物交互(HOI)重建和合成已成为具身AI和AR/VR的核心。尽管取得了快速进展,现有的HOI生成研究仍分散在三个独立的轨道上:(1)仅姿态合成,预测MANO轨迹而不生成像素;(2)单张图像HOI生成,从掩码或2D线索中虚构成外观但缺乏动态;以及(3)视频生成方法,需要整个姿态序列和真实的第一帧作为输入,阻碍了真正的模拟到现实部署。受Joo等人(2018)哲学的启发,我们认为HOI生成需要一个统一的引擎,将姿态、外观和运动整合在一个连贯的框架中。因此,我们引入了PAM:一种用于可控HOI视频生成的姿态-外观-运动引擎。我们的引擎性能通过以下方式得到验证:(1)在DexYCB上,我们获得FVD为29.13(与InterDyn的38.83相比),MPJPE为19.37毫米(与CosHand的30.05毫米相比),同时生成了比256x256和256x384基线更高的分辨率480x720视频。(2)在OAKINK2上,我们的全多条件模型将FVD从68.76提高到46.31。(3)在DexYCB上的输入条件消融实验表明,结合深度、分割和关键点始终能获得最佳结果。(4)对于使用SimpleHand的下游手姿态估计任务,通过增加3,400个合成视频(207,000帧)的训练数据,一个仅使用50%真实数据加上我们合成数据训练的模型可以匹配100%真实数据的基线。
Summary / 总结
The research aims to address the fragmented approaches in HOI generation by proposing a unified Pose-Appearance-Motion Engine (PAM). PAM integrates pose, appearance, and motion within a single framework to generate controllable HOI videos. Key experimental findings include an FVD of 29.13 and MPJPE of 19.37 mm on DexYCB, improvements in FVD from 68.76 to 46.31 on OAKINK2, and enhanced performance in downstream hand pose estimation tasks with synthetic data augmentation.
研究旨在通过提出统一的Pose-Appearance-Motion Engine (PAM) 来解决手-物体交互(HOI)生成中的碎片化方法问题。该方法将姿态、外观和运动整合到一个框架中以生成HOI视频。关键实验发现包括在DexYCB上的FVD为29.13和MPJPE为19.37 mm,在OAKINK2上的FVD从68.76提高到46.31,并且通过合成数据增强,在下游手姿态估计任务中取得了更好的性能。
A Backbone Benchmarking Study on Self-supervised Learning as a Auxiliary Task with Texture-based Local Descriptors for Face Analysis
Authors: Shukesh Reddy, Abhijit Das
First: 2026-03-23T16:49:50+00:00 · Latest: 2026-03-23T16:49:50+00:00
Comments: Accepted for publication in SN Computer Science
Abstract
In this work, we benchmark with different backbones and study their impact for self-supervised learning (SSL) as an auxiliary task to blend texture-based local descriptors into feature modelling for efficient face analysis. It is established in previous work that combining a primary task and a self-supervised auxiliary task enables more robust and discriminative representation learning.
We employed different shallow to deep backbones for the SSL task of Masked Auto-Encoder (MAE) as an auxiliary objective to reconstruct texture features such as local patterns alongside the primary task in local pattern SSAT (L-SSAT), ensuring robust and unbiased face analysis.
To expand the benchmark, we conducted a comprehensive comparative analysis across multiple model configurations within the proposed framework. To this end, we address the three research questions: "What is the role of the backbone in performance L-SSAT?", "What type of backbone is effective for different face analysis tasks?", and "Is there any generalized backbone for effective face analysis with L-SSAT?".
Towards answering these questions, we provide a detailed study and experiments. The performance evaluation demonstrates that the backbone for the proposed method is highly dependent on the downstream task, achieving average accuracies of 0.94 on FaceForensics++, 0.87 on CelebA, and 0.88 on AffectNet.
For consistency of feature representation quality and generalisation capability across various face analysis paradigms, including face attribute prediction, emotion classification, and deepfake detection, there is no unified backbone.
中文标题/摘要
标题:基于纹理局部描述子融合的自监督学习作为辅助任务的脊柱基准研究
在本研究中,我们使用不同的脊柱进行基准测试,并研究它们对自监督学习(SSL)作为辅助任务的影响,该任务旨在将基于纹理的局部描述子融合到特征建模中,以实现高效的面部分析。先前的研究表明,结合主要任务和自监督辅助任务可以实现更稳健和区分性更强的表示学习。
我们为掩码自动编码器(MAE)的SSL任务使用了不同的浅层到深层脊柱,作为辅助目标以重建局部模式等纹理特征,同时在局部模式SSAT(L-SSAT)中执行主要任务,确保稳健且无偏的面部分析。
为了扩展基准测试,我们在提出的框架内对多个模型配置进行了全面的比较分析。为此,我们探讨了三个研究问题:“脊柱在L-SSAT中的作用是什么?”、“哪种类型的脊柱适用于不同的面部分析任务?”以及“是否存在适用于L-SSAT的有效面部分析的通用脊柱?”
为了回答这些问题,我们提供了详细的分析和实验。性能评估表明,所提方法的脊柱高度依赖于下游任务,平均准确率分别为FaceForensics++上的0.94、CelebA上的0.87和AffectNet上的0.88。
为了在包括面部属性预测、情绪分类和深度伪造检测在内的各种面部分析范式中保持特征表示质量和泛化能力的一致性,目前尚无统一的脊柱。
Summary / 总结
This study benchmarks different backbones for self-supervised learning (SSL) as an auxiliary task to enhance texture-based local descriptors in face analysis. The research evaluates the impact of various shallow to deep backbones on the performance of local pattern SSAT (L-SSAT) across multiple datasets, including FaceForensics++, CelebA, and AffectNet, achieving accuracies of 0.94, 0.87, and 0.88 respectively. The findings suggest that the effectiveness of the backbone depends on the specific face analysis task, with no single backbone being universally optimal for all tasks.
本研究对不同骨干网络在利用纹理局部描述符增强局部模式SSAT框架下进行自监督学习(SSL)作为辅助任务进行了基准测试。研究评估了从浅到深不同骨干网络在Masked Auto-Encoder (MAE)任务中的影响。实验结果显示,骨干网络的选择对性能有显著影响,FaceForensics++上的平均准确率为0.94,CelebA上的为0.87,AffectNet上的为0.88。没有一个通用的骨干网络适用于所有面部分析任务,表明不同任务有特定的需求。
Seeing is Improving: Visual Feedback for Iterative Text Layout Refinement
Authors: Junrong Guo, Shancheng Fang, Yadong Qu, Hongtao Xie
Venue: CVPR 2026
First: 2026-03-23T16:48:39+00:00 · Latest: 2026-03-23T16:48:39+00:00
Comments: Accepted by CVPR 2026
Abstract
Recent advances in Multimodal Large Language Models (MLLMs) have enabled automated generation of structured layouts from natural language descriptions. Existing methods typically follow a code-only paradigm that generates code to represent layouts, which are then rendered by graphic engines to produce final images. However, they are blind to the rendered visual outcome, making it difficult to guarantee readability and aesthetics. In this paper, we identify visual feedback as a critical factor in layout generation and propose Visual Feedback Layout Model (VFLM), a self-improving framework that leverages visual feedback iterative refinement. VFLM is capable of performing adaptive reflective generation, which leverages visual information to reflect on previous issues and iteratively generates outputs until satisfactory quality is achieved. It is achieved through reinforcement learning with a visually grounded reward model that incorporates OCR accuracy. By rewarding only the final generated outcome, we can effectively stimulate the model's iterative and reflective generative capabilities. Experiments across multiple benchmarks show that VFLM consistently outperforms advanced MLLMs, existing layout models, and code-only baselines, establishing visual feedback as critical for design-oriented MLLMs. Our code and data are available at https://github.com/FolSpark/VFLM.
中文标题/摘要
标题:所见即提升:视觉反馈在迭代文本布局优化中的应用
近年来,多模态大型语言模型(MLLMs)的进步使得从自然语言描述自动生成结构化布局成为可能。现有方法通常遵循代码唯一切线,生成代码表示布局,然后由图形引擎渲染生成最终图像。然而,它们对渲染后的视觉效果视而不见,难以保证可读性和美观性。在本文中,我们确定视觉反馈是布局生成中的关键因素,并提出了一种视觉反馈布局模型(VFLM),这是一种利用视觉反馈进行迭代优化的自我改进框架。VFLM能够进行自适应反思生成,利用视觉信息反思先前的问题,并迭代生成输出,直到达到满意的质量。这通过结合OCR准确性的视觉导向奖励模型的强化学习实现。通过仅奖励最终生成的结果,可以有效地刺激模型的迭代和反思生成能力。在多个基准测试中的实验表明,VFLM在性能上始终优于先进的MLLMs、现有布局模型和代码唯一切线基线,确立了视觉反馈对于设计导向的MLLMs至关重要。我们的代码和数据可在https://github.com/FolSpark/VFLM获取。
Summary / 总结
This paper addresses the challenge of generating structured layouts from natural language descriptions by introducing Visual Feedback Layout Model (VFLM), which uses visual feedback to iteratively refine layouts. VFLM employs reinforcement learning with a visually grounded reward model to ensure the final output meets high standards of readability and aesthetics. Experiments demonstrate that VFLM outperforms existing methods across multiple benchmarks, highlighting the importance of visual feedback in design-oriented multimodal large language models.
本文通过引入视觉反馈布局模型(VFLM)解决了自动化文本布局生成中确保可读性和美观性的挑战。VFLM利用视觉反馈进行迭代优化,通过强化学习提高布局质量。实验表明,VFLM在多个基准测试中优于现有方法,突显了视觉反馈在设计导向的MLLMs中的重要性。
Revisiting Quantum Code Generation: Where Should Domain Knowledge Live?
Authors: Oscar Novo, Oscar Bastidas-Jossa, Alberto Calvo, Antonio Peris, Carlos Kuchkovsky
First: 2026-03-23T16:46:39+00:00 · Latest: 2026-03-23T16:46:39+00:00
Comments: Submitted to Quantum Machine Intelligence
Abstract
Recent advances in large language models (LLMs) have enabled the automation of an increasing number of programming tasks, including code generation for scientific and engineering domains. In rapidly evolving software ecosystems such as quantum software development, where frameworks expose complex abstractions, a central question is how best to incorporate domain knowledge into LLM-based assistants while preserving maintainability as libraries evolve.
In this work, we study specialization strategies for Qiskit code generation using the Qiskit-HumanEval benchmark. We compare a parameter-specialized fine-tuned baseline introduced in prior work against a range of recent general-purpose LLMs enhanced with retrieval-augmented generation (RAG) and agent-based inference with execution feedback.
Our results show that modern general-purpose LLMs consistently outperform the parameter-specialized baseline. While the fine-tuned model achieves approximately 47% pass@1 on Qiskit-HumanEval, recent general-purpose models reach 60-65% under zero-shot and retrieval-augmented settings, and up to 85% for the strongest evaluated model when combined with iterative execution-feedback agents -representing an improvement of more than 20% over zero-shot general-purpose performance and more than 35% over the parameter-specialized baseline.
Agentic execution feedback yields the most consistent improvements, albeit at increased runtime cost, while RAG provides modest and model-dependent gains. These findings indicate that performance gains can be achieved without domain-specific fine-tuning, instead relying on inference-time augmentation, thereby enabling a more flexible and maintainable approach to LLM-assisted quantum software development.
中文标题/摘要
标题:重访量子代码生成:领域知识应如何存在?
近年来,大型语言模型(LLMs)的进步使得越来越多的编程任务自动化,包括科学和工程领域的代码生成。在快速发展的软件生态系统中,如量子软件开发,框架暴露了复杂的抽象,一个核心问题是,在保持库演进时的可维护性的同时,如何最好地将领域知识融入基于LLM的助手中。
在这项工作中,我们研究了使用Qiskit-HumanEval基准进行Qiskit代码生成的专业化策略。我们将先前工作中引入的参数专业化微调基线与一系列最近的通用语言模型进行了比较,这些模型通过检索增强生成(RAG)和基于代理的推理与执行反馈进行了增强。
我们的结果表明,现代通用语言模型在所有情况下都优于参数专业化基线。虽然微调模型在Qiskit-HumanEval上的pass@1达到约47%,但在零样本和检索增强设置下,最近的通用模型达到60-65%,最强的模型结合迭代执行反馈代理时达到85%,分别比零样本通用性能提高了超过20%,比参数专业化基线提高了超过35%。
代理执行反馈提供了最一致的改进,尽管伴随着运行时间成本的增加,而RAG提供了适度且依赖于模型的增益。这些发现表明,可以通过推理时的增强来实现性能提升,而不是依赖于领域特定的微调,从而实现更灵活和可维护的LLM辅助量子软件开发。
Summary / 总结
This study investigates the integration of domain knowledge into quantum code generation using large language models (LLMs) and compares a parameter-specialized fine-tuned model against general-purpose LLMs enhanced with retrieval-augmented generation (RAG) and agent-based inference with execution feedback. The results show that modern general-purpose LLMs outperform the parameter-specialized baseline, achieving up to 85% pass@1 on the Qiskit-HumanEval benchmark, which is a significant improvement over both zero-shot general-purpose models and the parameter-specialized model.
研究探讨了如何在量子软件开发中将领域知识有效地融入基于LLM的助手,使用Qiskit-HumanEval作为基准。研究比较了参数特化的微调模型与增强检索增强生成(RAG)和基于执行反馈的代理推理的通用模型。结果显示,现代通用模型在使用迭代执行反馈代理时能达到高达85%的通过率,这比零样本通用模型和参数特化基线都有显著提升。
First Frame Is the Place to Go for Video Content Customization
Authors: Jingxi Chen, Zongxia Li, Zhichao Liu, Guangyao Shi, Xiyang Wu, Fuxiao Liu, Cornelia Fermuller, Brandon Y. Feng, Yiannis Aloimonos
Venue: CVPR 2026
First: 2025-11-19T18:56:50+00:00 · Latest: 2026-03-23T16:43:36+00:00
Comments: Accepted to CVPR 2026
Abstract
What role does the first frame play in video generation models? Traditionally, it's viewed as the spatial-temporal starting point of a video, merely a seed for subsequent animation. In this work, we reveal a fundamentally different perspective: video models implicitly treat the first frame as a conceptual memory buffer that stores visual entities for later reuse during generation. Leveraging this insight, we show that it's possible to achieve robust and generalized video content customization in diverse scenarios, using only 20-50 training examples without architectural changes or large-scale finetuning. This unveils a powerful, overlooked capability of video generation models for reference-based video customization.
中文标题/摘要
标题:第一帧是进行视频内容定制的地方
在视频生成模型中,第一帧扮演什么角色?传统上,它被视为视频的时间-空间起点,仅仅是一个后续动画的种子。在本研究中,我们揭示了一个截然不同的视角:视频模型隐式地将第一帧视为一个概念性记忆缓冲区,用于存储可在生成过程中重新使用的视觉实体。利用这一洞察,我们展示了仅使用20-50个训练示例,无需架构更改或大规模微调,即可在多种场景中实现稳健且通用的视频内容定制。这揭示了视频生成模型在参考基础上进行视频定制的强大但未被充分利用的能力。
Summary / 总结
This study explores the role of the first frame in video generation models, revealing that it acts as a conceptual memory buffer for visual entities. By leveraging this insight, the researchers demonstrate that robust and generalized video content customization can be achieved with only 20-50 training examples, without altering the architecture or performing large-scale fine-tuning. This highlights a previously underutilized capability of video generation models for reference-based customization in various scenarios.
这项工作探讨了视频生成模型中第一帧的作用,揭示它作为视觉实体的概念记忆缓冲区。通过利用这一洞察,作者展示了仅使用20-50个训练示例即可实现鲁棒且通用的视频内容定制,无需架构更改或大规模微调。这突显了视频生成模型在参考基础上进行视频定制的强大但未被充分利用的能力,适用于各种场景。
MARCUS: An agentic, multimodal vision-language model for cardiac diagnosis and management
Authors: Jack W O'Sullivan, Mohammad Asadi, Lennart Elbe, Akshay Chaudhari, Tahoura Nedaee, Francois Haddad, Michael Salerno, Li Fe-Fei, Ehsan Adeli, Rima Arnaout, Euan A Ashley
First: 2026-03-23T16:42:11+00:00 · Latest: 2026-03-23T16:42:11+00:00
Abstract
Cardiovascular disease remains the leading cause of global mortality, with progress hindered by human interpretation of complex cardiac tests. Current AI vision-language models are limited to single-modality inputs and are non-interactive. We present MARCUS (Multimodal Autonomous Reasoning and Chat for Ultrasound and Signals), an agentic vision-language system for end-to-end interpretation of electrocardiograms (ECGs), echocardiograms, and cardiac magnetic resonance imaging (CMR) independently and as multimodal input. MARCUS employs a hierarchical agentic architecture comprising modality-specific vision-language expert models, each integrating domain-trained visual encoders with multi-stage language model optimization, coordinated by a multimodal orchestrator. Trained on 13.5 million images (0.25M ECGs, 1.3M echocardiogram images, 12M CMR images) and our novel expert-curated dataset spanning 1.6 million questions, MARCUS achieves state-of-the-art performance surpassing frontier models (GPT-5 Thinking, Gemini 2.5 Pro Deep Think). Across internal (Stanford) and external (UCSF) test cohorts, MARCUS achieves accuracies of 87-91% for ECG, 67-86% for echocardiography, and 85-88% for CMR, outperforming frontier models by 34-45% (P<0.001). On multimodal cases, MARCUS achieved 70% accuracy, nearly triple that of frontier models (22-28%), with 1.7-3.0x higher free-text quality scores. Our agentic architecture also confers resistance to mirage reasoning, whereby vision-language models derive reasoning from unintended textual signals or hallucinated visual content. MARCUS demonstrates that domain-specific visual encoders with an agentic orchestrator enable multimodal cardiac interpretation. We release our models, code, and benchmark open-source.
中文标题/摘要
标题:MARCUS:一种自主的多模态视图语言模型,用于心脏诊断和管理
心血管疾病仍然是全球死亡的主要原因,进展受阻于人类对复杂心脏测试的解释。当前的AI视图语言模型仅限于单模态输入且不具备交互性。我们介绍了MARCUS(多模态自主推理和超声与信号聊天),这是一种自主的视图语言系统,用于独立解释心电图(ECGs)、超声心动图和心脏磁共振成像(CMR),以及作为多模态输入的综合解释。MARCUS采用分层的自主架构,包括模态特定的视图语言专家模型,每个模型整合了领域训练的视觉编码器与多阶段语言模型优化,由多模态协调器协调。MARCUS基于1350万张图像(包括25万张心电图、130万张超声心动图图像、1200万张心脏磁共振成像图像)以及我们新开发的专家标注数据集(涵盖160万问题),实现了最先进的性能,超越了前沿模型(GPT-5 Thinking、Gemini 2.5 Pro Deep Think)。在内部(斯坦福)和外部(UCSF)测试组中,MARCUS的心电图准确率为87-91%,超声心动图准确率为67-86%,心脏磁共振成像准确率为85-88%,分别比前沿模型高出34-45%(P<0.001)。在多模态病例中,MARCUS的准确率为70%,几乎是前沿模型(22-28%)的三倍,且自由文本质量评分高出1.7-3.0倍。我们的自主架构还赋予了模型对幻象推理的抵抗力,即视图语言模型从无意的文本信号或虚构的视觉内容中推断推理。MARCUS展示了领域特定的视觉编码器与自主协调器结合,能够实现多模态心脏解释。我们已开源发布我们的模型、代码和基准测试。
Regularization Implies balancedness in the deep linear network
Authors: Kathryn Lindsey, Govind Menon
First: 2025-11-03T01:19:26+00:00 · Latest: 2026-03-23T16:29:45+00:00
Comments: 18 pages, 3 figures. Fixed minor errors in revision, added more context and created Discussion section
Abstract
We use geometric invariant theory (GIT) to study the deep linear network (DLN). The Kempf-Ness theorem is used to establish that the $L^2$ regularizer is minimized on the balanced manifold. We introduce related balancing flows using the Riemannian geometry of fibers. The balancing flow defined by the $L^2$ regularizer is shown to converge to the balanced manifold at a uniform exponential rate. The balancing flow defined by the squared moment map is computed explicitly and shown to converge globally.
This framework allows us to decompose the training dynamics into two distinct gradient flows: a regularizing flow on fibers and a learning flow on the balanced manifold. It also provides a common mathematical framework for balancedness in deep learning and linear systems theory. We use this framework to interpret balancedness in terms of fast-slow systems, model reduction and Bayesian principles.
中文标题/摘要
标题:正则化意味着深度线性网络中的平衡
我们使用几何不变理论(GIT)研究深度线性网络(DLN)。使用Kempf-Ness定理建立$L^2$正则化在平衡流形上最小化。我们利用纤维的黎曼几何引入相关平衡流。由$L^2$正则化定义的平衡流被证明以均匀指数速率收敛到平衡流形。由平方动量映射定义的平衡流被显式计算并证明全局收敛。此框架允许我们将训练动力学分解为两种不同的梯度流:纤维上的正则化流和平衡流形上的学习流。它还为深度学习和线性系统理论中的平衡性提供了一个共同的数学框架。我们使用此框架将平衡性解释为快慢系统、模型降阶和贝叶斯原则。
Calibeating Made Simple
Authors: Yurong Chen, Zhiyi Huang, Michael I. Jordan, Haipeng Luo
First: 2026-03-23T16:28:07+00:00 · Latest: 2026-03-23T16:28:07+00:00
Abstract
We study calibeating, the problem of post-processing external forecasts online to minimize cumulative losses and match an informativeness-based benchmark. Unlike prior work, which analyzed calibeating for specific losses with specific arguments, we reduce calibeating to existing online learning techniques and obtain results for general proper losses. More concretely, we first show that calibeating is minimax-equivalent to regret minimization. This recovers the $O(\log T)$ calibeating rate of Foster and Hart [FH23] for the Brier and log losses and its optimality, and yields new optimal calibeating rates for mixable losses and general bounded losses. Second, we prove that multi-calibeating is minimax-equivalent to the combination of calibeating and the classical expert problem. This yields new optimal multi-calibeating rates for mixable losses, including Brier and log losses, and general bounded losses. Finally, we obtain new bounds for achieving calibeating and calibration simultaneously for the Brier loss. For binary predictions, our result gives the first calibrated algorithm that at the same time also achieves the optimal $O(\log T)$ calibeating rate.
ACPO: Counteracting Likelihood Displacement in Vision-Language Alignment with Asymmetric Constraints
Authors: Kaili Huang, Hongming Zhang, Rui Shen, Linjun Dai, Jiahao Wang, Hanming Deng, Lewei Lu
First: 2026-03-23T16:26:11+00:00 · Latest: 2026-03-23T16:26:11+00:00
Abstract
While Direct Preference Optimization (DPO) has become the de facto approach for aligning Large Vision-Language Models (LVLMs), it suffers from Likelihood Displacement, where the probability of both chosen and rejected responses collapses. This optimization flaw is especially detrimental in multimodal settings: the erosion of chosen likelihoods -- a failure we term Visual Anchor Collapse -- causes models to abandon visual evidence for strong language priors, precipitating significant hallucinations. To address this, we propose Asymmetric Constrained Preference Optimization (ACPO), a modality-agnostic alignment mechanism that applies dynamic, target-oriented scaling to preference optimization. ACPO derives a complexity-aware scaling coefficient applied exclusively to the rejected reward, asymmetrically suppressing the gradient flow on the rejected term while preserving the chosen distribution as a gradient-stable reference. While fundamentally a general-purpose objective, breaking this gradient symmetry is crucial for multimodal tasks, as it mitigates the suppression of visual tokens by language priors. Experiments on InternVL models demonstrate that ACPO effectively reverses the chosen-reward degradation of standard DPO. By halting Visual Anchor Collapse, ACPO generally outperforms baselines on hallucination benchmarks (HallusionBench, MM-IFEval) and general leaderboards (MMBench, MMStar, OCRBenchV2) while driving concurrent improvements in general capabilities.
中文标题/摘要
标题:ACPO:通过不对称约束对抗视觉语言对齐中的可能性位移
尽管直接偏好优化(DPO)已成为对齐大型视觉语言模型(LVLMs)的默认方法,但它会遭受可能性位移的问题,即选择和拒绝的回答的概率都会下降。这种优化缺陷在多模态设置中尤为有害:选择的可能性下降——我们称之为视觉锚点崩溃——导致模型放弃视觉证据,转而依赖强大的语言先验,从而引发显著的幻觉。为了解决这一问题,我们提出了不对称约束偏好优化(ACPO),这是一种跨模态的对齐机制,通过动态、目标导向的缩放应用于偏好优化。ACPO 通过仅应用于拒绝奖励的复杂性感知缩放系数,不对称地抑制拒绝项的梯度流动,同时保持选择分布作为梯度稳定的参考。虽然本质上是一种通用目标,但打破这种梯度对称性对于多模态任务至关重要,因为它可以减轻语言先验对视觉标记的抑制。在 InternVL 模型上的实验表明,ACPO 有效地逆转了标准 DPO 的选择奖励退化。通过阻止视觉锚点崩溃,ACPO 在幻觉基准(HallusionBench, MM-IFEval)和通用排行榜(MMBench, MMStar, OCRBenchV2)上通常优于基线,同时推动了通用能力的同步提升。
Summary / 总结
The paper addresses the issue of Likelihood Displacement in Direct Preference Optimization (DPO) for aligning Large Vision-Language Models (LVLMs), which leads to Visual Anchor Collapse and significant hallucinations. To tackle this, the authors propose Asymmetric Constrained Preference Optimization (ACPO), which dynamically scales the rejected reward to asymmetrically suppress the gradient flow on the rejected term while preserving the chosen distribution. Experiments show that ACPO effectively reverses the chosen-reward degradation of standard DPO and improves performance on hallucination benchmarks and general leaderboards while enhancing overall model capabilities.
论文针对直接偏好优化(DPO)在大型视觉语言模型(LVLM)对齐中出现的似然位移问题,导致视觉锚点崩溃和显著的幻觉。为此,作者提出了不对称约束偏好优化(ACPO),动态调整拒绝奖励以不对称地抑制拒绝项的梯度流动,同时保持选择分布作为梯度稳定的参考。实验表明,ACPO有效逆转了标准DPO的选定奖励退化,并在幻觉基准和通用排行榜上优于基线,同时提高了一般能力。
Data Curation for Machine Learning Interatomic Potentials by Determinantal Point Processes
Authors: Joanna Zou, Youssef Marzouk
Venue: ICLR
First: 2026-03-23T16:22:19+00:00 · Latest: 2026-03-23T16:22:19+00:00
Comments: Original publication at https://openreview.net/forum?id=PKGP7tg65A
Abstract
The development of machine learning interatomic potentials faces a critical computational bottleneck with the generation and labeling of useful training datasets. We present a novel application of determinantal point processes (DPPs) to the task of selecting informative subsets of atomic configurations to label with reference energies and forces from costly quantum mechanical methods. Through experiments with hafnium oxide data, we show that DPPs are competitive with existing approaches to constructing compact but diverse training sets by utilizing kernels of molecular descriptors, leading to improved accuracy and robustness in machine learning representations of molecular systems. Our work identifies promising directions to employ DPPs for unsupervised training data curation with heterogeneous or multimodal data, or in online active learning schemes for iterative data augmentation during molecular dynamics simulation.
中文标题/摘要
标题:机器学习原子势能的数据整理方法:确定性点过程的应用
机器学习原子势能的发展面临一个关键的计算瓶颈,即生成和标注有用的训练数据集。我们提出了一种新颖的应用确定性点过程(DPPs)的方法,用于选择具有信息性的原子构型子集,并用参考能量和力从昂贵的量子力学方法中进行标注。通过铪氧化物数据的实验,我们展示了DPPs在利用分子描述符核构建紧凑但多样化的训练集方面与现有方法具有竞争力,从而提高了分子系统机器学习表示的准确性和鲁棒性。我们的工作指出了在异构或多元数据中应用DPPs进行无监督训练数据整理的前景,或在分子动力学模拟过程中迭代数据增强的在线主动学习方案。
Summary / 总结
The paper addresses the challenge of generating informative training datasets for machine learning interatomic potentials, proposing the use of determinantal point processes (DPPs) to select diverse and representative atomic configurations. Experiments on hafnium oxide data demonstrate that DPPs can construct compact and diverse training sets, improving the accuracy and robustness of machine learning models for molecular systems compared to existing methods.
论文针对生成用于机器学习原子势能训练数据集的挑战,这是一个关键的瓶颈。它引入了使用确定性点过程(DPPs)来选择具有信息性的原子配置子集进行标记,参考能量和力来自昂贵的量子力学方法。实验结果表明,DPPs可以构建紧凑且多样化的训练集,从而提高分子系统中机器学习模型的准确性和鲁棒性,相比现有方法具有优势。
dynActivation: A Trainable Activation Family for Adaptive Nonlinearity
Authors: Alois Bachmann
First: 2026-03-23T16:18:28+00:00 · Latest: 2026-03-23T16:18:28+00:00
Comments: 22 pages, 15 figures
Abstract
This paper proposes $\mathrm{dynActivation}$, a per-layer trainable activation defined as $f_i(x) = \mathrm{BaseAct}(x)(α_i - β_i) + β_i x$, where $α_i$ and $β_i$ are lightweight learned scalars that interpolate between the base nonlinearity and a linear path and $\mathrm{BaseAct}(x)$ resembles any ReLU-like function. The static and dynamic ReLU-like variants are then compared across multiple vision tasks, language modeling tasks, and ablation studies. The results suggest that dynActivation variants tend to linearize deep layers while maintaining high performance, which can improve training efficiency by up to $+54\%$ over ReLU.
On CIFAR-10, dynActivation(Mish) improves over static Mish by up to $+14.02\%$ on AttentionCNN with an average improvment by $+6.00\%$, with a $24\%$ convergence-AUC reduction relative to Mish (2120 vs. 2785). In a 1-to-75-layer MNIST depth-scaling study, dynActivation never drops below $95\%$ test accuracy ($95.3$--$99.3\%$), while ReLU collapses below $80\%$ at 25 layers. Under FGSM at $\varepsilon{=}0.08$, dynActivation(Mish) incurs a $55.39\%$ accuracy drop versus $62.79\%$ for ReLU ($7.40\%$ advantage). Transferred to language modeling, a new proposed dynActGLU-variant achieves a $10.3\%$ relative perplexity reduction over SwiGLU at 5620 steps (4.047 vs. 4.514), though the gap vanishes at 34300 steps.
Summary / 总结
This paper introduces dynActivation, a trainable activation function that adapts between a base nonlinearity and a linear path. It compares dynActivation with static ReLU-like variants across various tasks and shows that dynActivation variants can linearize deep layers while maintaining high performance, improving training efficiency by up to 54%. On CIFAR-10, dynActivation(Mish) outperforms static Mish by up to 14.02% on AttentionCNN, with a 24% reduction in convergence-AUC. In a depth-scaling study, dynActivation maintains 95% test accuracy even at 75 layers, whereas ReLU drops below 80% at 25 layers. Under FGSM attacks, dynActivation(Mish) also shows better robustness, with a 7.40% advantage over ReLU. In language modeling, a new dynActGLU variant reduces perplexity by 10.3% compared to SwiGLU.
该论文提出了一种可训练的激活函数dynActivation,它在基非线性与线性路径之间进行插值,增强神经网络的适应性。它将dynActivation与静态ReLU-like变体在各种任务中进行了比较,显示dynActivation变体可以在保持高性能的同时线性化深层网络,提高训练效率最多54%。在CIFAR-10上,dynActivation(Mish)在AttentionCNN上的表现优于静态Mish,最高提高14.02%,收敛-AUC减少24%。在MNIST的深度扩展研究中,dynActivation即使在网络更深时仍保持高准确率,而ReLU则显著下降。在语言建模中,新提出的dynActGLU变体相比SwiGLU将困惑度降低了10.3%。