arXiv 论文速递

2026-03-21 03:46
Snapshot: 20260321_0346
Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding
Authors: Xianjin Wu, Dingkang Liang, Tianrui Feng, Kui Xia, Yumeng Zhang, Xiaofan Li, Xiao Tan, Xiang Bai
First: 2026-03-19T17:59:58+00:00 · Latest: 2026-03-19T17:59:58+00:00
Comments: 31 pages, 12 figures
Abstract
While Multimodal Large Language Models demonstrate impressive semantic capabilities, they often suffer from spatial blindness, struggling with fine-grained geometric reasoning and physical dynamics. Existing solutions typically rely on explicit 3D modalities or complex geometric scaffolding, which are limited by data scarcity and generalization challenges. In this work, we propose a paradigm shift by leveraging the implicit spatial prior within large-scale video generation models. We posit that to synthesize temporally coherent videos, these models inherently learn robust 3D structural priors and physical laws. We introduce VEGA-3D (Video Extracted Generative Awareness), a plug-and-play framework that repurposes a pre-trained video diffusion model as a Latent World Simulator. By extracting spatiotemporal features from intermediate noise levels and integrating them with semantic representations via a token-level adaptive gated fusion mechanism, we enrich MLLMs with dense geometric cues without explicit 3D supervision. Extensive experiments across 3D scene understanding, spatial reasoning, and embodied manipulation benchmarks demonstrate that our method outperforms state-of-the-art baselines, validating that generative priors provide a scalable foundation for physical-world understanding. Code is publicly available at https://github.com/H-EmbodVis/VEGA-3D.
中文标题/摘要
标题:生成模型了解空间:释放隐含的三维先验以促进场景理解
虽然多模态大型语言模型展示了令人印象深刻的语义能力,但它们往往在空间感知方面存在缺陷,难以进行精细的几何推理和物理动力学处理。现有解决方案通常依赖于显式的三维模态或复杂的几何结构,这些方法受限于数据稀缺性和泛化挑战。在本工作中,我们提出了一种范式转变,通过利用大规模视频生成模型中的隐含空间先验。我们认为,为了合成时空连贯的视频,这些模型会内在地学习稳健的三维结构先验和物理法则。我们引入了VEGA-3D(视频提取生成意识)框架,该框架将预训练的视频扩散模型重新用于潜空间模拟器。通过从中间噪声级别提取时空特征,并通过基于标记的自适应门控融合机制将其与语义表示集成,我们为MLLMs提供了密集的几何线索,而无需显式的三维监督。在三维场景理解、空间推理和具身操作基准测试中的广泛实验表明,我们的方法优于最先进的基线,验证了生成先验为物理世界理解提供了可扩展的基础。代码可在https://github.com/H-EmbodVis/VEGA-3D公开获取。
Summary / 总结
This work addresses the spatial limitations of multimodal large language models by proposing VEGA-3D, which leverages implicit 3D priors from video generation models. The method integrates spatiotemporal features with semantic representations to enhance MLLMs with geometric cues, achieving superior performance in 3D scene understanding and spatial reasoning benchmarks compared to existing approaches.
该研究通过利用视频生成模型中的隐式3D先验来解决多模态大语言模型的空间局限性。VEGA-3D 是一个插件即用框架,重新利用预训练的视频扩散模型来模拟一个潜在的世界,并通过时空特征和语义表示的token级自适应门控融合机制,为多模态语言模型提供密集的几何线索。实验表明,该方法在3D场景理解、空间推理和具身操作任务中优于现有最先进的方法,验证了生成先验在物理世界理解中的可扩展性基础。
Matryoshka Gaussian Splatting
Authors: Zhilin Guo, Boqiao Zhang, Hakan Aktas, Kyle Fogarty, Jeffrey Hu, Nursena Koprucu Aslan, Wenzhao Li, Canberk Baykal, Albert Miao, Josef Bengtson, Chenliang Zhou, Weihao Xia, Cristina Nader Vasconcelos. Cengiz Oztireli
First: 2026-03-19T17:59:56+00:00 · Latest: 2026-03-19T17:59:56+00:00
Comments: project page: https://zhilinguo.github.io/MGS
Abstract
The ability to render scenes at adjustable fidelity from a single model, known as level of detail (LoD), is crucial for practical deployment of 3D Gaussian Splatting (3DGS). Existing discrete LoD methods expose only a limited set of operating points, while concurrent continuous LoD approaches enable smoother scaling but often suffer noticeable quality degradation at full capacity, making LoD a costly design decision. We introduce Matryoshka Gaussian Splatting (MGS), a training framework that enables continuous LoD for standard 3DGS pipelines without sacrificing full-capacity rendering quality. MGS learns a single ordered set of Gaussians such that rendering any prefix, the first k splats, produces a coherent reconstruction whose fidelity improves smoothly with increasing budget. Our key idea is stochastic budget training: each iteration samples a random splat budget and optimises both the corresponding prefix and the full set. This strategy requires only two forward passes and introduces no architectural modifications. Experiments across four benchmarks and six baselines show that MGS matches the full-capacity performance of its backbone while enabling a continuous speed-quality trade-off from a single model. Extensive ablations on ordering strategies, training objectives, and model capacity further validate the designs.
中文标题/摘要
标题:Matryoshka 高斯斑点化
从单一模型以可调节保真度渲染场景的能力,即层次细节(LoD),对于3D高斯斑点化(3DGS)的实际部署至关重要。现有的离散LoD方法仅暴露有限的操作点集,而同时进行的连续LoD方法虽然能够更平滑地扩展,但往往在全容量时会遭受明显的质量下降,使LoD成为一个昂贵的设计选择。我们引入了Matryoshka高斯斑点化(MGS),这是一种训练框架,能够在标准3DGS流水线中实现连续LoD而不牺牲全容量渲染质量。MGS学习一个有序的高斯集合,使得渲染任何前缀,前k个斑点,产生一个连贯的重建,其保真度随着预算增加而平滑提高。我们的核心思想是随机预算训练:每次迭代都采样一个随机的斑点预算,并优化相应的前缀和整个集合。该策略只需要两次前向传递,并且不需要架构修改。在四个基准和六个基线上的实验表明,MGS在匹配其主干的全容量性能的同时,能够从单一模型中实现连续的速度-质量权衡。广泛的消融实验进一步验证了排序策略、训练目标和模型容量的设计。
Summary / 总结
Matryoshka Gaussian Splatting (MGS) addresses the challenge of rendering scenes at adjustable fidelity from a single model by introducing a training framework that enables continuous level of detail (LoD) without sacrificing full-capacity rendering quality. MGS learns a single ordered set of Gaussians, allowing for a coherent reconstruction whose fidelity improves smoothly with increasing budget. The key method is stochastic budget training, which optimizes both the corresponding prefix and the full set at each iteration. Experiments show that MGS matches full-capacity performance while providing a continuous speed-quality trade-off from a single model.
论文提出了Matryoshka Gaussian Splatting (MGS),这是一种训练框架,可以在不牺牲全容量渲染质量的情况下,为3D Gaussian Splatting (3DGS)提供连续的LoD。MGS学习一个有序的高斯集合,使得随着渲染更多斑点,可以得到更加连贯且保真的重建。该方法使用随机预算训练,只需要两次前向传播且无需修改架构。实验表明,MGS能够匹配全容量性能,同时从单一模型提供平滑的速度-质量权衡。
Cubic Discrete Diffusion: Discrete Visual Generation on High-Dimensional Representation Tokens
Authors: Yuqing Wang, Chuofan Ma, Zhijie Lin, Yao Teng, Lijun Yu, Shuai Wang, Jiaming Han, Jiashi Feng, Yi Jiang, Xihui Liu
Venue: CVPR 2026
First: 2026-03-19T17:59:55+00:00 · Latest: 2026-03-19T17:59:55+00:00
Comments: Accepted by CVPR 2026 main track; Code: https://github.com/YuqingWang1029/CubiD
Abstract
Visual generation with discrete tokens has gained significant attention as it enables a unified token prediction paradigm shared with language models, promising seamless multimodal architectures. However, current discrete generation methods remain limited to low-dimensional latent tokens (typically 8-32 dims), sacrificing the semantic richness essential for understanding. While high-dimensional pretrained representations (768-1024 dims) could bridge this gap, their discrete generation poses fundamental challenges. In this paper, we present Cubic Discrete Diffusion (CubiD), the first discrete generation model for high-dimensional representations. CubiD performs fine-grained masking throughout the high-dimensional discrete representation -- any dimension at any position can be masked and predicted from partial observations. This enables the model to learn rich correlations both within and across spatial positions, with the number of generation steps fixed at $T$ regardless of feature dimensionality, where $T \ll hwd$. On ImageNet-256, CubiD achieves state-of-the-art discrete generation with strong scaling behavior from 900M to 3.7B parameters. Crucially, we validate that these discretized tokens preserve original representation capabilities, demonstrating that the same discrete tokens can effectively serve both understanding and generation tasks. We hope this work will inspire future research toward unified multimodal architectures. Code is available at: https://github.com/YuqingWang1029/CubiD.
中文标题/摘要
标题:立方离散扩散:高维表示令牌上的离散视觉生成
使用离散令牌进行视觉生成获得了显著关注,因为它允许与语言模型共享统一的令牌预测范式,有望实现无缝的多模态架构。然而,当前的离散生成方法仍然局限于低维潜变量(通常为8-32维),牺牲了理解所需的语义丰富性。虽然高维预训练表示(768-1024维)可以弥补这一差距,但它们的离散生成提出了根本性的挑战。在本文中,我们提出了立方离散扩散(CubiD),这是第一个用于高维表示的离散生成模型。CubiD 在高维离散表示中执行精细粒度的掩码——任何维度在任何位置都可以被掩码并从部分观察中预测。这使模型能够学习丰富的空间位置内和跨位置的相关性,生成步骤数固定为 $T$,与特征维度无关,其中 $T \ll hwd$。在ImageNet-256上,CubiD 达到了最先进的离散生成效果,从9亿到37亿参数具有强大的扩展行为。至关重要的是,我们验证这些离散化令牌保留了原始表示能力,证明了相同的离散令牌可以有效地服务于理解和生成任务。我们希望这项工作能够激发未来研究向统一的多模态架构方向发展。代码可在:https://github.com/YuqingWang1029/CubiD 获取。
Summary / 总结
Cubic Discrete Diffusion (CubiD) addresses the challenge of discrete generation for high-dimensional representations by masking and predicting any dimension in the high-dimensional discrete space. This method allows for learning rich correlations within and across spatial positions, achieving state-of-the-art results on ImageNet-256 with strong scaling from 900M to 3.7B parameters. The discretized tokens maintain the original representation capabilities, enabling both understanding and generation tasks. This work paves the way for unified multimodal architectures.
Cubic Discrete Diffusion (CubiD) 是首个用于高维表示的离散生成模型,通过在高维离散空间中进行精细粒度的掩码和预测,CubiD 学习了丰富的相关性,并在 ImageNet-256 上实现了离散生成的最新成果,具有良好的可扩展性。该模型验证了离散化令牌既能用于理解又能用于生成任务,展示了统一多模态架构的潜力。
MonoArt: Progressive Structural Reasoning for Monocular Articulated 3D Reconstruction
Authors: Haitian Li, Haozhe Xie, Junxiang Xu, Beichen Wen, Fangzhou Hong, Ziwei Liu
First: 2026-03-19T17:59:52+00:00 · Latest: 2026-03-19T17:59:52+00:00
Comments: Project page: https://lihaitian.com/MonoArt
Abstract
Reconstructing articulated 3D objects from a single image requires jointly inferring object geometry, part structure, and motion parameters from limited visual evidence. A key difficulty lies in the entanglement between motion cues and object structure, which makes direct articulation regression unstable. Existing methods address this challenge through multi-view supervision, retrieval-based assembly, or auxiliary video generation, often sacrificing scalability or efficiency. We present MonoArt, a unified framework grounded in progressive structural reasoning. Rather than predicting articulation directly from image features, MonoArt progressively transforms visual observations into canonical geometry, structured part representations, and motion-aware embeddings within a single architecture. This structured reasoning process enables stable and interpretable articulation inference without external motion templates or multi-stage pipelines. Extensive experiments on PartNet-Mobility demonstrate that OM achieves state-of-the-art performance in both reconstruction accuracy and inference speed. The framework further generalizes to robotic manipulation and articulated scene reconstruction.
中文标题/摘要
标题:MonoArt:单目 articulated 3D 重建的渐进结构推理
从单张图像重建 articulated 3D 对象需要联合推断对象几何、部件结构和运动参数,但视觉证据有限。关键难点在于运动线索与对象结构的纠缠,这使得直接articulation回归不稳定。现有方法通过多视角监督、基于检索的组装或辅助视频生成来应对这一挑战,但往往牺牲了可扩展性或效率。我们提出了 MonoArt,这是一种基于渐进结构推理的统一框架。MonoArt 不是从图像特征直接预测articulation,而是逐步将视觉观察转化为标准几何、结构部件表示和运动感知嵌入,全部在单一架构中完成。这种结构推理过程使得在没有外部运动模板或多阶段流水线的情况下,能够实现稳定且可解释的articulation推断。在 PartNet-Mobility 上的广泛实验表明,OM 在重建精度和推理速度方面均达到最先进的性能。该框架进一步推广到机器人操作和articulated 场景重建。
Summary / 总结
MonoArt addresses the challenge of reconstructing articulated 3D objects from a single image by progressively transforming visual observations into canonical geometry and motion-aware embeddings within a single architecture. This method avoids the instability of direct articulation regression and achieves state-of-the-art performance in both reconstruction accuracy and inference speed on PartNet-Mobility. The framework also generalizes to robotic manipulation and articulated scene reconstruction.
MonoArt通过在单一架构中逐步将视觉观察转换为标准几何、结构化部件表示和运动感知嵌入来解决从单张图像重建 articulated 3D 对象的挑战。这种方法避免了直接articulation回归的不稳定性,并在PartNet-Mobility上实现了在重建精度和推理速度方面的最先进性能。该框架还适用于机器人操作和articulated场景重建。
NavTrust: Benchmarking Trustworthiness for Embodied Navigation
Authors: Huaide Jiang, Yash Chaudhary, Yuping Wang, Zehao Wang, Raghav Sharma, Manan Mehta, Yang Zhou, Lichao Sun, Zhiwen Fan, Zhengzhong Tu, Jiachen Li
First: 2026-03-19T17:59:51+00:00 · Latest: 2026-03-19T17:59:51+00:00
Comments: Project Website: https://navtrust.github.io
Abstract
There are two major categories of embodied navigation: Vision-Language Navigation (VLN), where agents navigate by following natural language instructions; and Object-Goal Navigation (OGN), where agents navigate to a specified target object. However, existing work primarily evaluates model performance under nominal conditions, overlooking the potential corruptions that arise in real-world settings. To address this gap, we present NavTrust, a unified benchmark that systematically corrupts input modalities, including RGB, depth, and instructions, in realistic scenarios and evaluates their impact on navigation performance. To our best knowledge, NavTrust is the first benchmark that exposes embodied navigation agents to diverse RGB-Depth corruptions and instruction variations in a unified framework. Our extensive evaluation of seven state-of-the-art approaches reveals substantial performance degradation under realistic corruptions, which highlights critical robustness gaps and provides a roadmap toward more trustworthy embodied navigation systems. Furthermore, we systematically evaluate four distinct mitigation strategies to enhance robustness against RGB-Depth and instructions corruptions. Our base models include Uni-NaVid and ETPNav. We deployed them on a real mobile robot and observed improved robustness to corruptions. The project website is: https://navtrust.github.io.
中文标题/摘要
标题:NavTrust:评估实体导航可信度基准
实体导航主要分为两类:视觉-语言导航(VLN),其中代理遵循自然语言指令导航;以及目标-对象导航(OGN),其中代理导航至指定目标对象。然而,现有工作主要在理想条件下评估模型性能,忽视了真实世界环境中可能出现的潜在干扰。为解决这一问题,我们提出了NavTrust,这是一个统一基准,系统地在现实场景中干扰输入模态,包括RGB、深度和指令,并评估这些干扰对导航性能的影响。据我们所知,NavTrust是第一个在统一框架中使实体导航代理暴露于多种RGB-Depth干扰和指令变化的基准。我们对七种最先进的方法进行了广泛评估,发现它们在现实干扰下的性能显著下降,这突显了关键的鲁棒性差距,并为更可信的实体导航系统指明了道路。此外,我们系统地评估了四种不同的缓解策略,以增强对RGB-Depth和指令干扰的鲁棒性。我们的基础模型包括Uni-NaVid和ETPNav。我们在一个真实的移动机器人上部署了它们,并观察到对干扰的鲁棒性有所提高。项目网站:https://navtrust.github.io
Summary / 总结
NavTrust benchmarks the trustworthiness of embodied navigation by systematically corrupting RGB, depth, and instructions in realistic scenarios. It evaluates seven state-of-the-art approaches and finds significant performance degradation under realistic corruptions, highlighting robustness gaps. NavTrust also assesses four mitigation strategies, showing improved robustness in real-world settings with base models Uni-NaVid and ETPNav.
NavTrust 通过在视觉-语言导航和对象-目标导航任务中引入真实的 RGB、深度和指令扰动来评估 embodied 导航的可信度。它评估了七种最先进的方法,并发现实际条件下性能显著下降,表明存在关键的鲁棒性差距。NavTrust 还评估了四种不同的缓解策略以增强鲁棒性。项目网站是: https://navtrust.github.io.
SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing
Authors: Xinyao Zhang, Wenkai Dong, Yuxin Song, Bo Fang, Qi Zhang, Jing Wang, Fan Chen, Hui Zhang, Haocheng Feng, Yu Lu, Hang Zhou, Chun Yuan, Jingdong Wang
First: 2026-03-19T17:59:51+00:00 · Latest: 2026-03-19T17:59:51+00:00
Comments: 24 pages, 12 figures
Abstract
Current instruction-guided video editing models struggle to simultaneously balance precise semantic modifications with faithful motion preservation. While existing approaches rely on injecting explicit external priors (e.g., VLM features or structural conditions) to mitigate these issues, this reliance severely bottlenecks model robustness and generalization. To overcome this limitation, we present SAMA (factorized Semantic Anchoring and Motion Alignment), a framework that factorizes video editing into semantic anchoring and motion modeling. First, we introduce Semantic Anchoring, which establishes a reliable visual anchor by jointly predicting semantic tokens and video latents at sparse anchor frames, enabling purely instruction-aware structural planning. Second, Motion Alignment pre-trains the same backbone on motion-centric video restoration pretext tasks (cube inpainting, speed perturbation, and tube shuffle), enabling the model to internalize temporal dynamics directly from raw videos. SAMA is optimized with a two-stage pipeline: a factorized pre-training stage that learns inherent semantic-motion representations without paired video-instruction editing data, followed by supervised fine-tuning on paired editing data. Remarkably, the factorized pre-training alone already yields strong zero-shot video editing ability, validating the proposed factorization. SAMA achieves state-of-the-art performance among open-source models and is competitive with leading commercial systems (e.g., Kling-Omni). Code, models, and datasets will be released.
中文标题/摘要
标题:SAMA:因子化语义锚定与运动对齐的指令引导视频编辑
当前的指令引导视频编辑模型难以同时平衡精确的语义修改与忠实的运动保留。现有方法依赖于注入显式的外部先验(例如,VLM特征或结构条件)来缓解这些问题,但这种依赖严重限制了模型的鲁棒性和泛化能力。为克服这一限制,我们提出了SAMA(因子化语义锚定与运动对齐),一种将视频编辑分解为语义锚定和运动建模的框架。首先,我们引入了语义锚定,通过联合预测稀疏锚帧的语义标记和视频潜在变量来建立可靠的视觉锚点,从而实现纯粹基于指令的结构规划。其次,运动对齐在以运动为中心的视频恢复预训练任务(立方体填充、速度扰动和管子打乱)上预训练相同的骨干网络,使模型能够直接从原始视频中内化时间动态。SAMA 通过两阶段优化:无配对视频-指令编辑数据的因子化预训练阶段,学习固有的语义-运动表示,随后在配对编辑数据上进行监督微调。值得注意的是,仅因子化预训练就已表现出强大的零样本视频编辑能力,验证了所提出的分解的有效性。SAMA 在开源模型中达到了最先进的性能,并且与领先的商业系统(例如 Kling-Omni)竞争。代码、模型和数据集将被发布。
Summary / 总结
SAMA addresses the challenge of balancing precise semantic modifications with motion preservation in instruction-guided video editing. It factorizes video editing into semantic anchoring and motion modeling. SAMA introduces Semantic Anchoring to establish a visual anchor by jointly predicting semantic tokens and video latents at sparse frames, and Motion Alignment pre-trains the model on motion-centric tasks to internalize temporal dynamics. The framework shows strong zero-shot video editing ability and achieves state-of-the-art performance, comparable to leading commercial systems.
SAMA 解决了在指令引导的视频编辑中精确的语义修改与忠实的运动保留之间的平衡问题。它将视频编辑分解为语义锚定和运动建模。SAMA 引入了语义锚定,通过在稀疏帧上联合预测语义标记和视频潜在变量来建立可靠的视觉锚点,并通过在运动中心任务上的预训练来训练模型。SAMA 的两阶段优化实现了强大的零样本视频编辑能力,并在开源模型中达到了最先进的性能,与领先的商业系统相当。
Under One Sun: Multi-Object Generative Perception of Materials and Illumination
Authors: Nobuo Yoshii, Xinran Nicole Han, Ryo Kawahara, Todd Zickler, Ko Nishino
First: 2026-03-19T17:59:45+00:00 · Latest: 2026-03-19T17:59:45+00:00
Abstract
We introduce Multi-Object Generative Perception (MultiGP), a generative inverse rendering method for stochastic sampling of all radiometric constituents -- reflectance, texture, and illumination -- underlying object appearance from a single image. Our key idea to solve this inherently ambiguous radiometric disentanglement is to leverage the fact that while their texture and reflectance may differ, objects in the same scene are all lit by the same illumination. MultiGP exploits this consensus to produce samples of reflectance, texture, and illumination from a single image of known shapes based on four key technical contributions: a cascaded end-to-end architecture that combines image-space and angular-space disentanglement; Coordinated Guidance for diffusion convergence to a single consistent illumination estimate; Axial Attention applied to facilitate ``cross-talk'' between objects of different reflectance; and a Texture Extraction ControlNet to preserve high-frequency texture details while ensuring decoupling from estimated lighting. Experimental results demonstrate that MultiGP effectively leverages the complementary spatial and frequency characteristics of multiple object appearances to recover individual texture and reflectance as well as the common illumination.
中文标题/摘要
标题:同一天空之下:多对象生成感知材料与照明
我们介绍了多对象生成感知(MultiGP),这是一种生成逆渲染方法,可以从单张图像中随机采样所有辐射成分——反射率、纹理和照明,以解析对象外观。我们通过利用同一场景中虽然纹理和反射率可能不同,但所有物体都由相同的照明照亮这一事实,来解决这种固有的辐射成分解耦问题。MultiGP 通过四个关键技术贡献利用这种共识:级联端到端架构结合图像空间和角度空间解耦;协调引导以实现扩散收敛到一致的照明估计;轴向注意力以促进不同反射率对象之间的“交流”;以及纹理提取控制网以保留高频纹理细节并确保与估计照明解耦。实验结果表明,MultiGP 能够有效利用多个对象外观的空间和频率互补特性,恢复个体纹理和反射率以及共同的照明。
FinTradeBench: A Financial Reasoning Benchmark for LLMs
Authors: Yogesh Agrawal, Aniruddha Dutta, Md Mahadi Hasan, Santu Karmaker, Aritra Dutta
First: 2026-03-19T17:59:41+00:00 · Latest: 2026-03-19T17:59:41+00:00
Comments: 8 pages main text, 22 pages total (including references and appendix). 5 figures, 14 tables. Preprint under review. Code and data will be made available upon publication
Abstract
Real-world financial decision-making is a challenging problem that requires reasoning over heterogeneous signals, including company fundamentals derived from regulatory filings and trading signals computed from price dynamics. Recently, with the advancement of Large Language Models (LLMs), financial analysts have begun to use them for financial decision-making tasks. However, existing financial question answering benchmarks for testing these models primarily focus on company balance sheet data and rarely evaluate reasoning over how company stocks trade in the market or their interactions with fundamentals. To take advantage of the strengths of both approaches, we introduce FinTradeBench, a benchmark for evaluating financial reasoning that integrates company fundamentals and trading signals. FinTradeBench contains 1,400 questions grounded in NASDAQ-100 companies over a ten-year historical window. The benchmark is organized into three reasoning categories: fundamentals-focused, trading-signal-focused, and hybrid questions requiring cross-signal reasoning. To ensure reliability at scale, we adopt a calibration-then-scaling framework that combines expert seed questions, multi-model response generation, intra-model self-filtering, numerical auditing, and human-LLM judge alignment. We evaluate 14 LLMs under zero-shot prompting and retrieval-augmented settings and witness a clear performance gap. Retrieval substantially improves reasoning over textual fundamentals, but provides limited benefit for trading-signal reasoning. These findings highlight fundamental challenges in the numerical and time-series reasoning for current LLMs and motivate future research in financial intelligence.
中文标题/摘要
标题:FinTradeBench:LLMs的金融推理基准
现实世界的金融决策是一个具有挑战性的问题,需要在公司基本面(来自监管文件的公司基础数据)和价格动态计算的交易信号等多种信号之间进行推理。近年来,随着大型语言模型(LLMs)的发展,金融分析师开始使用它们进行金融决策任务。然而,现有的金融问答基准主要侧重于公司资产负债表数据,很少评估公司股票在市场上的交易情况及其与基本面的互动。为了充分利用两种方法的优势,我们引入了FinTradeBench,这是一个结合公司基本面和交易信号的金融推理基准。FinTradeBench 包含了1400个基于纳斯达克100公司的问题,时间跨度为十年。基准测试分为三大类推理问题:以基本面为主、以交易信号为主,以及需要跨信号推理的混合问题。为了确保大规模的可靠性,我们采用了校准-扩展框架,结合了专家种子问题、多模型响应生成、模型内自我筛选、数值审计和人-LLM法官对齐。我们在零样本提示和检索增强设置下评估了14个LLM,并观察到了明显的性能差距。检索显著提高了对文本基本面的推理能力,但在交易信号推理方面提供的帮助有限。这些发现突显了当前LLM在数值和时间序列推理方面的基本挑战,并激发了未来金融智能研究的动力。
Summary / 总结
FinTradeBench is a benchmark for evaluating financial reasoning in LLMs by integrating company fundamentals and trading signals. It contains 1,400 questions on NASDAQ-100 companies over a ten-year period, categorized into three types: fundamentals-focused, trading-signal-focused, and hybrid. The evaluation involves 14 LLMs under zero-shot and retrieval-augmented settings, revealing a significant performance gap, with retrieval improving reasoning over textual fundamentals but not trading signals. This highlights challenges in numerical and time-series reasoning for current LLMs and motivates future research in financial intelligence.
FinTradeBench 是一个用于评估 LLM 在金融推理方面能力的基准,通过结合公司基本面和交易信号。它包含 1,400 个问题,覆盖 NASDAQ-100 公司长达十年的历史数据,分为三类:基本面导向、交易信号导向和混合型。评估涉及 14 种 LLM,在零样本和检索增强设置下,显示出明显的性能差距,检索在提高对文本基本面的推理方面有显著效果,但在交易信号推理方面效果有限。这突显了当前 LLM 在数值和时间序列推理方面的挑战,并激励未来在金融智能方面的研究。
EffectErase: Joint Video Object Removal and Insertion for High-Quality Effect Erasing
Authors: Yang Fu, Yike Zheng, Ziyun Dai, Henghui Ding
Venue: CVPR 2026
First: 2026-03-19T17:59:22+00:00 · Latest: 2026-03-19T17:59:22+00:00
Comments: CVPR 2026, Project Page: https://henghuiding.com/EffectErase/
Abstract
Video object removal aims to eliminate dynamic target objects and their visual effects, such as deformation, shadows, and reflections, while restoring seamless backgrounds. Recent diffusion-based video inpainting and object removal methods can remove the objects but often struggle to erase these effects and to synthesize coherent backgrounds. Beyond method limitations, progress is further hampered by the lack of a comprehensive dataset that systematically captures common object effects across varied environments for training and evaluation. To address this, we introduce VOR (Video Object Removal), a large-scale dataset that provides diverse paired videos, each consisting of one video where the target object is present with its effects and a counterpart where the object and effects are absent, with corresponding object masks. VOR contains 60K high-quality video pairs from captured and synthetic sources, covers five effects types, and spans a wide range of object categories as well as complex, dynamic multi-object scenes. Building on VOR, we propose EffectErase, an effect-aware video object removal method that treats video object insertion as the inverse auxiliary task within a reciprocal learning scheme. The model includes task-aware region guidance that focuses learning on affected areas and enables flexible task switching. Then, an insertion-removal consistency objective that encourages complementary behaviors and shared localization of effect regions and structural cues. Trained on VOR, EffectErase achieves superior performance in extensive experiments, delivering high-quality video object effect erasing across diverse scenarios.
中文标题/摘要
标题:EffectErase:联合视频对象去除与插入以实现高质量效果去除
视频对象去除旨在消除动态目标对象及其视觉效果,如变形、阴影和反射,同时恢复无缝背景。基于扩散的视频修补和对象去除方法可以去除对象,但往往难以消除这些效果并合成连贯的背景。除了方法限制外,进展还受到缺乏一个全面的数据集的阻碍,该数据集系统地捕捉了不同环境中的常见对象效果,用于训练和评估。为了解决这个问题,我们引入了VOR(视频对象去除)数据集,该数据集提供了多样化的配对视频,每个视频都包含一个目标对象及其效果存在的视频和一个目标对象及其效果不存在的对应视频,以及相应的对象掩码。VOR包含来自捕获和合成源的60,000个高质量视频配对,涵盖了五种效果类型,并涵盖了广泛的对象类别以及复杂的动态多对象场景。基于VOR,我们提出了EffectErase,这是一种效果感知的视频对象去除方法,将视频对象插入视为互逆学习方案中的逆辅助任务。该模型包括任务感知区域指导,专注于受影响区域的学习,并允许灵活的任务切换。然后,插入去除一致性目标,鼓励互补行为并共享效果区域和结构线索的定位。在VOR上训练后,EffectErase在广泛的实验中表现出色,实现了在各种场景中高质量的视频对象效果去除。
Summary / 总结
The research aims to improve video object removal by addressing the limitations of existing methods in erasing object effects and synthesizing coherent backgrounds. To achieve this, a new dataset VOR is introduced, which provides diverse paired videos with and without target objects and their effects. Building on VOR, EffectErase is proposed, an effect-aware method that uses a reciprocal learning scheme and an insertion-removal consistency objective to focus on affected areas and encourage complementary behaviors. Extensive experiments show that EffectErase outperforms existing methods in erasing object effects and synthesizing high-quality backgrounds across various scenarios.
研究旨在通过解决现有方法在消除变形、阴影等视觉效果方面的局限性,提高视频对象去除的效果。研究引入了VOR数据集,该数据集涵盖了各种环境下的对象效果。基于VOR,提出了EffectErase方法,该方法采用互逆学习方案和插入-去除一致性目标,专注于受影响区域并增强效果区域和结构线索的定位。实验表明,EffectErase在不同场景下优于现有方法,能够高质量地去除视频对象效果。
F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World
Authors: Ziyin Zhang, Zihan Liao, Hang Yu, Peng Di, Rui Wang
First: 2026-03-19T17:59:21+00:00 · Latest: 2026-03-19T17:59:21+00:00
Abstract
We present F2LLM-v2, a new family of general-purpose, multilingual embedding models in 8 distinct sizes ranging from 80M to 14B. Trained on a newly curated composite of 60 million publicly available high-quality data samples, F2LLM-v2 supports more than 200 languages, with a particular emphasis on previously underserved mid- and low-resource languages. By integrating a two-stage LLM-based embedding training pipeline with matryoshka learning, model pruning, and knowledge distillation techniques, we present models that are far more efficient than previous LLM-based embedding models while retaining competitive performances. Extensive evaluations confirm that F2LLM-v2-14B ranks first on 11 MTEB benchmarks, while the smaller models in the family also set a new state of the art for resource-constrained applications. To facilitate open-source embedding model research, we release all models, data, code, and intermediate checkpoints.
中文标题/摘要
标题:F2LLM-v2:包容、高性能且高效的多语言嵌入模型
我们提出了F2LLM-v2,这是一种新的通用多语言嵌入模型系列,包含8种不同规模,从80M到14B。该模型基于6000万条高质量公开数据样本的新综合训练而成,支持超过200种语言,特别强调了之前未充分服务的中低资源语言。通过结合两阶段基于LLM的嵌入训练流水线、matryoshka学习、模型剪枝和知识蒸馏技术,我们展示了比之前基于LLM的嵌入模型更高效的模型,同时保持了竞争力。广泛的评估证实,F2LLM-v2-14B在11个MTEB基准测试中排名第一,而该系列中的较小模型也为资源受限的应用设定了新的标准。为了促进开源嵌入模型研究,我们发布了所有模型、数据、代码和中间检查点。
Spectrally-Guided Diffusion Noise Schedules
Authors: Carlos Esteves, Ameesh Makadia
First: 2026-03-19T17:59:12+00:00 · Latest: 2026-03-19T17:59:12+00:00
Abstract
Denoising diffusion models are widely used for high-quality image and video generation. Their performance depends on noise schedules, which define the distribution of noise levels applied during training and the sequence of noise levels traversed during sampling. Noise schedules are typically handcrafted and require manual tuning across different resolutions. In this work, we propose a principled way to design per-instance noise schedules for pixel diffusion, based on the image's spectral properties. By deriving theoretical bounds on the efficacy of minimum and maximum noise levels, we design ``tight'' noise schedules that eliminate redundant steps. During inference, we propose to conditionally sample such noise schedules. Experiments show that our noise schedules improve generative quality of single-stage pixel diffusion models, particularly in the low-step regime.
中文标题/摘要
标题:光谱引导的扩散噪声调度
去噪扩散模型广泛用于高质量图像和视频生成。它们的性能取决于噪声调度,这定义了在训练期间应用的噪声水平的分布以及在采样期间遍历的噪声水平序列。噪声调度通常由手工设计,并且需要在不同分辨率下进行手动调整。在本文中,我们提出了一种基于图像的光谱属性设计实例特定噪声调度的原理性方法。通过推导最小和最大噪声水平有效性的理论界,我们设计了“紧凑”的噪声调度,消除了冗余步骤。在推理过程中,我们建议有条件地采样此类噪声调度。实验表明,我们的噪声调度在单阶段像素扩散模型的生成质量方面有所改进,特别是在低步骤区间。
Summary / 总结
This paper addresses the challenge of optimizing noise schedules for denoising diffusion models, which are crucial for high-quality image and video generation. The authors propose a method that leverages the spectral properties of images to design per-instance noise schedules, reducing redundant steps and improving generative quality, especially in the low-step regime. Experiments demonstrate the effectiveness of their approach in enhancing the quality of single-stage pixel diffusion models.
本文探讨了为去噪扩散模型设计有效噪声调度的问题,这对于高质量图像和视频生成至关重要。作者提出了一种基于图像频谱特性的方法来创建实例特定的噪声调度,这些调度在理论上优化以消除冗余步骤。实验结果表明,这些噪声调度可以提高单阶段像素扩散模型的生成质量,尤其是在较少采样步骤的情况下。
Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation
Authors: Zhuolin Yang, Zihan Liu, Yang Chen, Wenliang Dai, Boxin Wang, Sheng-Chieh Lin, Chankyu Lee, Yangyi Chen, Dongfu Jiang, Jiafan He, Renjie Pi, Grace Lam, Nayeon Lee, Alexander Bukharin, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping
First: 2026-03-19T17:58:52+00:00 · Latest: 2026-03-19T17:58:52+00:00
Comments: We release the model and data at https://huggingface.co/collections/nvidia/nemotron-cascade-2
Abstract
We introduce Nemotron-Cascade 2, an open 30B MoE model with 3B activated parameters that delivers best-in-class reasoning and strong agentic capabilities. Despite its compact size, its mathematical and coding reasoning performance approaches that of frontier open models. It is the second open-weight LLM, after DeepSeekV3.2-Speciale-671B-A37B, to achieve Gold Medal-level performance in the 2025 International Mathematical Olympiad (IMO), the International Olympiad in Informatics (IOI), and the ICPC World Finals, demonstrating remarkably high intelligence density with 20x fewer parameters. In contrast to Nemotron-Cascade 1, the key technical advancements are as follows. After SFT on a meticulously curated dataset, we substantially expand Cascade RL to cover a much broader spectrum of reasoning and agentic domains. Furthermore, we introduce multi-domain on-policy distillation from the strongest intermediate teacher models for each domain throughout the Cascade RL process, allowing us to efficiently recover benchmark regressions and sustain strong performance gains along the way. We release the collection of model checkpoint and training data.
中文标题/摘要
标题:Nemotron-Cascade 2:级联RL和多域在线策略蒸馏下的后训练大语言模型
我们介绍了Nemotron-Cascade 2,这是一个开放的30B模型,具有3B激活参数,提供最佳推理和强大的代理能力。尽管其体积紧凑,但在数学和编程推理性能方面接近前沿开放模型。它是继DeepSeekV3.2-Speciale-671B-A37B之后第二个在2025年国际数学奥林匹克(IMO)、国际信息学奥林匹克(IOI)和ICPC世界总决赛中获得金牌水平表现的开放权重大语言模型,显示出极高的智能密度,参数量减少20倍。与Nemotron-Cascade 1相比,关键的技术进步如下。在精心策划的数据集上进行SFT后,我们大幅扩展了级联RL,涵盖了更广泛的推理和代理领域。此外,我们引入了多域在线策略蒸馏,从每个领域的最强中间教师模型中进行蒸馏,整个级联RL过程中,这使我们能够高效地恢复基准回归并保持强大的性能提升。我们发布了模型检查点和训练数据的集合。
Summary / 总结
Nemotron-Cascade 2 is a 30B model with 3B active parameters that excels in reasoning and agentic capabilities, achieving top performance in major international competitions despite its size. It uses post-training techniques including expanded Cascade RL and multi-domain on-policy distillation to maintain strong performance with fewer parameters. The model demonstrates high intelligence density and is open-sourced with its checkpoints and training data available on Hugging Face.
Nemotron-Cascade 2 是一个30B参数的模型,其中有3B参数是激活的,它在推理和代理能力方面表现出色,即使在较小的规模下也能在国际重要比赛中取得优异成绩。该模型采用扩展的Cascade RL和多领域在线策略蒸馏等后训练技术,保持了强大的性能并减少了参数数量。该模型展示了高智能密度,并已开源,提供了数据集和检查点。
DriveTok: 3D Driving Scene Tokenization for Unified Multi-View Reconstruction and Understanding
Authors: Dong Zhuo, Wenzhao Zheng, Sicheng Zuo, Siming Yan, Lu Hou, Jie Zhou, Jiwen Lu
First: 2026-03-19T17:58:22+00:00 · Latest: 2026-03-19T17:58:22+00:00
Comments: Project Page: https://paryi555.github.io/DriveTok/ Code: https://github.com/paryi555/DriveTok
Abstract
With the growing adoption of vision-language-action models and world models in autonomous driving systems, scalable image tokenization becomes crucial as the interface for the visual modality. However, most existing tokenizers are designed for monocular and 2D scenes, leading to inefficiency and inter-view inconsistency when applied to high-resolution multi-view driving scenes. To address this, we propose DriveTok, an efficient 3D driving scene tokenizer for unified multi-view reconstruction and understanding. DriveTok first obtains semantically rich visual features from vision foundation models and then transforms them into the scene tokens with 3D deformable cross-attention. For decoding, we employ a multi-view transformer to reconstruct multi-view features from the scene tokens and use multiple heads to obtain RGB, depth, and semantic reconstructions. We also add a 3D head directly on the scene tokens for 3D semantic occupancy prediction for better spatial awareness. With the multiple training objectives, DriveTok learns unified scene tokens that integrate semantic, geometric, and textural information for efficient multi-view tokenization. Extensive experiments on the widely used nuScenes dataset demonstrate that the scene tokens from DriveTok perform well on image reconstruction, semantic segmentation, depth prediction, and 3D occupancy prediction tasks.
中文标题/摘要
标题:DriveTok:统一多视图重建与理解的3D驾驶场景分词
随着视觉-语言-动作模型和世界模型在自动驾驶系统中的广泛应用,可扩展的图像分词成为视觉模态与系统交互的关键接口。然而,大多数现有的分词器都是为单目和2D场景设计的,当应用于高分辨率多视图驾驶场景时,会导致效率低下和视图间不一致。为了解决这个问题,我们提出了DriveTok,一种用于统一多视图重建与理解的高效3D驾驶场景分词器。DriveTok首先从视觉基础模型中获取丰富的语义视觉特征,然后通过3D可变形交叉注意力将它们转换为场景分词。在解码过程中,我们采用多视图变换器从场景分词重构多视图特征,并使用多个头获得RGB、深度和语义重建。我们还在场景分词上直接添加了一个3D头,用于3D语义占用预测,以提高空间意识。通过多个训练目标,DriveTok学习了综合语义、几何和纹理信息的统一场景分词,以实现高效的多视图分词。在广泛使用的nuScenes数据集上的大量实验表明,DriveTok的场景分词在图像重建、语义分割、深度预测和3D占用预测任务中表现良好。
Summary / 总结
DriveTok is proposed to address the inefficiency and inconsistency of existing tokenizers in high-resolution multi-view driving scenes. It uses 3D deformable cross-attention to transform semantically rich visual features from vision foundation models into scene tokens, which are then decoded by a multi-view transformer to reconstruct multi-view features and predict 3D semantic occupancy. Experiments show that DriveTok performs well in image reconstruction, semantic segmentation, depth prediction, and 3D occupancy prediction tasks on the nuScenes dataset.
DriveTok旨在解决现有高分辨率多视图驾驶场景中传统分词器的低效性和视图间不一致性问题。它提出了一种3D驾驶场景分词器,通过视觉基础模型提取的语义丰富的视觉特征和3D可变形交叉注意力将这些特征转换为场景分词。多视图变压器用于多视图重建,并添加了一个3D分词器头进行3D语义占用预测。在nuScenes数据集上的实验表明,DriveTok在图像重建、语义分割、深度预测和3D占用预测任务中表现良好。
LVOmniBench: Pioneering Long Audio-Video Understanding Evaluation for Omnimodal LLMs
Authors: Keda Tao, Yuhua Zheng, Jia Xu, Wenjie Du, Kele Shao, Hesong Wang, Xueyi Chen, Xin Jin, Junhan Zhu, Bohan Yu, Weiqiang Wang, Jian Liu, Can Qin, Yulun Zhang, Ming-Hsuan Yang, Huan Wang
First: 2026-03-19T17:58:13+00:00 · Latest: 2026-03-19T17:58:13+00:00
Comments: Project page: https://kd-tao.github.io/LVOmniBench/
Abstract
Recent advancements in omnimodal large language models (OmniLLMs) have significantly improved the comprehension of audio and video inputs. However, current evaluations primarily focus on short audio and video clips ranging from 10 seconds to 5 minutes, failing to reflect the demands of real-world applications, where videos typically run for tens of minutes. To address this critical gap, we introduce LVOmniBench, a new benchmark designed specifically for the cross-modal comprehension of long-form audio and video. This dataset comprises high-quality videos sourced from open platforms that feature rich audio-visual dynamics. Through rigorous manual selection and annotation, LVOmniBench comprises 275 videos, ranging in duration from 10 to 90 minutes, and 1,014 question-answer (QA) pairs. LVOmniBench aims to rigorously evaluate the capabilities of OmniLLMs across domains, including long-term memory, temporal localization, fine-grained understanding, and multimodal perception. Our extensive evaluation reveals that current OmniLLMs encounter significant challenges when processing extended audio-visual inputs. Open-source models generally achieve accuracies below 35%, whereas the Gemini 3 Pro reaches a peak accuracy of approximately 65%. We anticipate that this dataset, along with our empirical findings, will stimulate further research and the development of advanced models capable of resolving complex cross-modal understanding problems within long-form audio-visual contexts.
中文标题/摘要
标题:LVOmniBench:面向多模态LLM的长音频视频理解评估先驱
近年来,多模态大型语言模型(OmniLLM)在理解音频和视频输入方面取得了显著进步。然而,当前的评估主要集中在10秒到5分钟的短音频和视频片段上,未能反映实际应用中的需求,而实际应用中的视频通常持续数十分钟。为解决这一关键缺口,我们引入了LVOmniBench,这是一个专门用于长格式音频和视频跨模态理解的新基准。该数据集包含来自开放平台的高质量视频,这些视频具有丰富的音频-视觉动态。通过严格的手动选择和标注,LVOmniBench 包含275个视频,时长从10分钟到90分钟不等,以及1,014个问答(QA)对。LVOmniBench旨在严格评估OmniLLM在各个领域的能力,包括长期记忆、时间定位、细粒度理解以及多模态感知。我们的广泛评估表明,当前的OmniLLM在处理扩展的音频-视觉输入时面临重大挑战。开源模型通常准确率低于35%,而Gemini 3 Pro达到的峰值准确率约为65%。我们预计,该数据集以及我们的实证研究结果将激发进一步的研究,并促进开发能够解决长格式音频-视频上下文中的复杂跨模态理解问题的先进模型。
Summary / 总结
LVOmniBench is a new benchmark for evaluating the cross-modal comprehension of long-form audio and video, addressing the limitations of existing short-form evaluations. It includes 275 videos ranging from 10 to 90 minutes and 1,014 question-answer pairs. The evaluation shows that current omnimodal large language models struggle with long-form inputs, with open-source models achieving accuracies below 35%, while Gemini 3 Pro reaches around 65% accuracy.
LVOmniBench 是一个新的基准,用于评估长形式音频和视频的跨模态理解能力,解决了现有短形式评估的局限性。它包含275段从10到90分钟不等的视频和1,014个问答对。评估结果显示,当前的跨模态大型语言模型在处理长形式输入时存在困难,开源模型的准确率低于35%,而Gemini 3 Pro的准确率约为65%。
DreamPartGen: Semantically Grounded Part-Level 3D Generation via Collaborative Latent Denoising
Authors: Tianjiao Yu, Xinzhuo Li, Muntasir Wahed, Jerry Xiong, Yifan Shen, Ying Shen, Ismini Lourentzou
First: 2026-03-19T17:58:11+00:00 · Latest: 2026-03-19T17:58:11+00:00
Abstract
Understanding and generating 3D objects as compositions of meaningful parts is fundamental to human perception and reasoning. However, most text-to-3D methods overlook the semantic and functional structure of parts. While recent part-aware approaches introduce decomposition, they remain largely geometry-focused, lacking semantic grounding and failing to model how parts align with textual descriptions or their inter-part relations. We propose DreamPartGen, a framework for semantically grounded, part-aware text-to-3D generation. DreamPartGen introduces Duplex Part Latents (DPLs) that jointly model each part's geometry and appearance, and Relational Semantic Latents (RSLs) that capture inter-part dependencies derived from language. A synchronized co-denoising process enforces mutual geometric and semantic consistency, enabling coherent, interpretable, and text-aligned 3D synthesis. Across multiple benchmarks, DreamPartGen delivers state-of-the-art performance in geometric fidelity and text-shape alignment.
中文标题/摘要
标题:DreamPartGen: 基于语义的部件级3D生成通过协作去噪
理解和生成由有意义部件组成的3D对象是人类感知和推理的基础。然而,大多数文本到3D的方法忽略了部件的语义和功能结构。虽然最近的部件感知方法引入了分解,但它们仍然主要集中在几何结构上,缺乏语义基础,无法建模部件如何与文本描述或部件间关系对齐。我们提出DreamPartGen,一种基于语义的部件感知文本到3D生成框架。DreamPartGen引入了双部件潜在变量(DPLs),联合建模每个部件的几何形状和外观,并引入了关系语义潜在变量(RSLs),捕捉从语言中推导出的部件间依赖关系。同步的联合去噪过程确保了几何和语义的一致性,使3D合成具有连贯性、可解释性和文本对齐性。在多个基准测试中,DreamPartGen在几何保真度和文本形状对齐方面达到了最先进的性能。
Summary / 总结
DreamPartGen is designed to address the limitations of existing text-to-3D methods by introducing a framework that models both the geometry and appearance of 3D object parts, as well as their semantic relationships. It uses Duplex Part Latents and Relational Semantic Latents to ensure geometric and semantic consistency, resulting in more coherent and interpretable 3D generation that aligns well with textual descriptions. Experiments show that DreamPartGen outperforms previous methods in terms of geometric fidelity and text-shape alignment across various benchmarks.
DreamPartGen旨在通过考虑部件的语义和功能结构来生成3D物体,解决现有文本到3D方法的局限性。它使用双重部件潜在变量来同时建模每个部件的几何形状和外观,并使用关系语义潜在变量来从文本中捕捉部件之间的依赖关系。该框架采用同步共去噪过程来确保几何和语义的一致性,从而实现连贯且与文本对齐的3D合成。实验表明,DreamPartGen在几何保真度和文本形状对齐方面优于现有方法,在多个基准测试中表现出色。
Do VLMs Need Vision Transformers? Evaluating State Space Models as Vision Encoders
Authors: Shang-Jui Ray Kuo, Paola Cascante-Bonilla
First: 2026-03-19T17:56:32+00:00 · Latest: 2026-03-19T17:56:32+00:00
Comments: Project page: https://lab-spell.github.io/vlm-ssm-vision-encoders/ ; Code: https://github.com/raykuo18/vlm-ssm-vision-encoders
Abstract
Large vision--language models (VLMs) often use a frozen vision backbone, whose image features are mapped into a large language model through a lightweight connector. While transformer-based encoders are the standard visual backbone, we ask whether state space model (SSM) vision backbones can be a strong alternative. We systematically evaluate SSM vision backbones for VLMs in a controlled setting. Under matched ImageNet-1K initialization, the SSM backbone achieves the strongest overall performance across both VQA and grounding/localization. We further adapt both SSM and ViT-family backbones with detection or segmentation training and find that dense-task tuning generally improves performance across families; after this adaptation, the SSM backbone remains competitive while operating at a substantially smaller model scale. We further observe that (i) higher ImageNet accuracy or larger backbones do not reliably translate into better VLM performance, and (ii) some visual backbones are unstable in localization. Based on these findings, we propose stabilization strategies that improve robustness for both backbone families and highlight SSM backbones as a strong alternative to transformer-based vision encoders in VLMs.
中文标题/摘要
标题:VLMs是否需要视觉变换器?评估状态空间模型作为视觉编码器
大型视觉-语言模型(VLMs)通常使用冻结的视觉主干,其图像特征通过轻量级连接器映射到大型语言模型中。虽然基于变换器的编码器是标准的视觉主干,但我们询问状态空间模型(SSM)视觉主干是否可以成为强有力的替代品。我们在受控环境中系统地评估了SSM视觉主干在VLMs中的表现。在匹配的ImageNet-1K初始化下,SSM主干在VQA和定位/标注任务中均表现出最强的整体性能。我们进一步适应了SSM和ViT家族的主干,并进行了检测或分割训练,发现密集任务调整通常在家族中提高了性能;在这一适应后,SSM主干仍具有竞争力,但模型规模要小得多。我们还观察到,(i) 更高的ImageNet准确度或更大的主干并不一定能可靠地转化为更好的VLM性能,(ii) 一些视觉主干在定位方面不稳定。基于这些发现,我们提出了稳定策略,以提高两种主干家族的鲁棒性,并强调SSM主干作为VLMs中基于变换器视觉编码器的强有力替代品。
Summary / 总结
This study investigates whether state space model (SSM) vision backbones can be a strong alternative to transformer-based encoders in large vision-language models (VLMs). The research evaluates SSM backbones under matched ImageNet-1K initialization and finds them to perform the strongest overall in VQA and grounding/localization tasks. After adapting both SSM and ViT-family backbones with detection or segmentation training, the SSM backbone remains competitive while operating at a smaller model scale. The study also highlights that higher ImageNet accuracy or larger backbones do not reliably translate into better VLM performance, and some visual backbones are unstable in localization tasks, suggesting SSM backbones as a strong alternative to transformer-based vision encoders.
研究评估了状态空间模型(SSM)在大型视觉语言模型(VLM)中的表现,发现SSM在VQA和定位/检测任务中优于基于变换器的编码器,尤其是在匹配的ImageNet-1K初始化条件下。经过检测或分割训练的适应后,SSM编码器仍具有竞争力且规模更小。研究还指出,更高的ImageNet准确度或更大的模型并不总是能提高VLM性能,并提出了稳定策略以提高两种编码器家族的鲁棒性。
Robustness, Cost, and Attack-Surface Concentration in Phishing Detection
Authors: Julian Allagan, Mohamed Elbakary, Zohreh Safari, Weizheng Gao, Gabrielle Morgan, Essence Morgan, Vladimir Deriglazov
First: 2026-03-19T17:53:32+00:00 · Latest: 2026-03-19T17:53:32+00:00
Comments: 14 pages, 4 figures, 9 tables
Abstract
Phishing detectors built on engineered website features attain near-perfect accuracy under i.i.d.\ evaluation, yet deployment security depends on robustness to post-deployment feature manipulation. We study this gap through a cost-aware evasion framework that models discrete, monotone feature edits under explicit attacker budgets. Three diagnostics are introduced: minimal evasion cost (MEC), the evasion survival rate $S(B)$, and the robustness concentration index (RCI). On the UCI Phishing Websites benchmark (11\,055 instances, 30 ternary features), Logistic Regression, Random Forests, Gradient Boosted Trees, and XGBoost all achieve $\mathrm{AUC}\ge 0.979$ under static evaluation. Under budgeted sanitization-style evasion, robustness converges across architectures: the median MEC equals 2 with full features, and over 80\% of successful minimal-cost evasions concentrate on three low-cost surface features. Feature restriction improves robustness only when it removes all dominant low-cost transitions. Under strict cost schedules, infrastructure-leaning feature sets exhibit 17-19\% infeasible mass for ensemble models, while the median MEC among evadable instances remains unchanged. We formalize this convergence: if a positive fraction of correctly detected phishing instances admit evasion through a single feature transition of minimal cost $c_{\min}$, no classifier can raise the corresponding MEC quantile above $c_{\min}$ without modifying the feature representation or cost model. Adversarial robustness in phishing detection is governed by feature economics rather than model complexity.
Summary / 总结
The study investigates the robustness of phishing detectors against post-deployment feature manipulation, using a cost-aware evasion framework. Four machine learning models (Logistic Regression, Random Forests, Gradient Boosted Trees, and XGBoost) achieve high accuracy under static evaluation. However, under budgeted feature edits, robustness converges across models, with minimal evasion cost often concentrated on a few low-cost features. Feature restriction only improves robustness when it removes all dominant low-cost transitions. The research formalizes that adversarial robustness in phishing detection is influenced by feature economics rather than model complexity.
研究探讨了钓鱼检测器在部署后对抗特征操纵的鲁棒性。引入了最小欺骗成本(MEC)、欺骗生存率 $S(B)$ 和鲁棒性集中指数(RCI)等诊断指标。在UCI钓鱼网站基准上,各种模型在静态评估下实现了高AUC值。然而,在预算化清洁式欺骗下,鲁棒性在不同架构间趋于一致,最小成本等于2,超过80%的成功最小成本欺骗集中在三个低成本表面特征上。仅当特征限制移除所有主导的低成本转换时,才能提高鲁棒性。在严格的成本限制下,集成模型面临17-19%的不可行质量,但可欺骗实例的中位最小成本保持不变。研究正式指出,钓鱼检测中的对抗鲁棒性由特征经济性而非模型复杂性驱动。
Tinted Frames: Question Framing Blinds Vision-Language Models
Authors: Wan-Cyuan Fan, Jiayun Luo, Declan Kutscher, Leonid Sigal, Ritwik Gupta
First: 2026-03-19T17:53:09+00:00 · Latest: 2026-03-19T17:53:09+00:00
Comments: Preprint. Project page: https://davidhalladay.github.io/tinted_frames_demo/
Abstract
Vision-Language Models (VLMs) have been shown to be blind, often underutilizing their visual inputs even on tasks that require visual reasoning. In this work, we demonstrate that VLMs are selectively blind. They modulate the amount of attention applied to visual inputs based on linguistic framing even when alternative framings demand identical visual reasoning. Using visual attention as a probe, we quantify how framing alters both the amount and distribution of attention over the image. Constrained framings, such as multiple choice and yes/no, induce substantially lower attention to image context compared to open-ended, reduce focus on task-relevant regions, and shift attention towards uninformative tokens. We further demonstrate that this attention misallocation is the principal cause of degraded accuracy and cross-framing inconsistency. Building on this mechanistic insight, we introduce a lightweight prompt-tuning method using learnable tokens that encourages the robust, visually grounded attention patterns observed in open-ended settings, improving visual grounding and improving performance across framings.
中文标题/摘要
标题:着色框:问题框架使视觉-语言模型失明
视觉-语言模型(VLMs)已被证明是失明的,即使在需要视觉推理的任务中,它们也经常未能充分利用视觉输入。在本研究中,我们展示了VLMs是选择性失明的。它们根据语言框架调整对视觉输入的注意力程度,即使存在其他框架要求相同的视觉推理。通过使用视觉注意力作为探针,我们量化了框架如何改变对图像的关注量及其分布。受限的框架,如多项选择和是/否,相比开放式框架,显著降低了对图像上下文的关注,减少了对任务相关区域的关注,并将注意力转移到无信息性标记上。我们进一步证明,这种注意力分配不当是导致准确度下降和跨框架不一致的主要原因。基于这一机制洞察,我们引入了一种轻量级的提示调优方法,使用可学习标记来鼓励在开放式设置中观察到的稳健、视觉接地的注意力模式,从而提高视觉接地并改善不同框架下的性能。
Summary / 总结
The study investigates why Vision-Language Models (VLMs) are selectively blind to visual inputs, depending on the linguistic framing of questions. Using visual attention as a probe, the research shows that constrained framings like multiple choice and yes/no lead to less attention on image context and focus on uninformative tokens, while open-ended questions result in more robust and visually grounded attention. The findings suggest that this misallocation of attention is the main cause of performance degradation and inconsistency across different framings. The study introduces a prompt-tuning method to encourage more robust and visually grounded attention patterns, improving VLM performance across various framings.
研究探讨了为什么视觉语言模型(VLMs)在视觉推理任务中会忽视视觉输入。通过分析视觉注意力模式,研究发现VLMs会根据语言框架选择性地忽视视觉信息,即使其他框架需要相同的视觉推理。研究发现,如多项选择和是非题等受限框架会导致对图像上下文的关注减少,并将注意力转移到不相关信息上,从而导致性能下降。研究提出了一种使用可学习标记的提示调优方法,以促进在开放框架中观察到的稳健且视觉导向的注意力模式,从而提高性能并适应不同框架。
FASTER: Rethinking Real-Time Flow VLAs
Authors: Yuxiang Lu, Zhe Liu, Xianzhe Fan, Zhenya Yang, Jinghua Hou, Junyi Li, Kaixin Ding, Hengshuang Zhao
First: 2026-03-19T17:51:37+00:00 · Latest: 2026-03-19T17:51:37+00:00
Comments: Project page: https://innovator-zero.github.io/FASTER
Abstract
Real-time execution is crucial for deploying Vision-Language-Action (VLA) models in the physical world. Existing asynchronous inference methods primarily optimize trajectory smoothness, but neglect the critical latency in reacting to environmental changes. By rethinking the notion of reaction in action chunking policies, this paper presents a systematic analysis of the factors governing reaction time. We show that reaction time follows a uniform distribution determined jointly by the Time to First Action (TTFA) and the execution horizon. Moreover, we reveal that the standard practice of applying a constant schedule in flow-based VLAs can be inefficient and forces the system to complete all sampling steps before any movement can start, forming the bottleneck in reaction latency. To overcome this issue, we propose Fast Action Sampling for ImmediaTE Reaction (FASTER). By introducing a Horizon-Aware Schedule, FASTER adaptively prioritizes near-term actions during flow sampling, compressing the denoising of the immediate reaction by tenfold (e.g., in $π_{0.5}$ and X-VLA) into a single step, while preserving the quality of long-horizon trajectory. Coupled with a streaming client-server pipeline, FASTER substantially reduces the effective reaction latency on real robots, especially when deployed on consumer-grade GPUs. Real-world experiments, including a highly dynamic table tennis task, prove that FASTER unlocks unprecedented real-time responsiveness for generalist policies, enabling rapid generation of accurate and smooth trajectories.
中文标题/摘要
标题:FASTER:重新思考实时流VLA
实时执行对于将视觉-语言-行动(VLA)模型部署到物理世界至关重要。现有的异步推理方法主要优化轨迹平滑性,但忽视了对环境变化快速反应的关键延迟。通过重新思考行动块策略中的反应概念,本文系统分析了影响反应时间的因素。我们表明,反应时间遵循由首次行动时间(TTFA)和执行窗口共同决定的均匀分布。此外,我们揭示了在基于流的VLA中应用恒定调度的标准做法可能是低效的,迫使系统在任何移动开始之前完成所有采样步骤,从而成为反应延迟的瓶颈。为了解决这一问题,我们提出了快速即时反应(FASTER)的即时采样方法。通过引入窗口感知调度,FASTER在流采样过程中动态优先处理近期行动,将即时反应的去噪过程压缩为一步(例如,在$π_{0.5}$和X-VLA中),同时保持长窗口轨迹的质量。结合流式客户端-服务器管道,FASTER显著减少了实际机器人上的有效反应延迟,特别是在部署在消费级GPU上时。现实世界的实验,包括一个高度动态的乒乓球任务,证明了FASTER为通用策略解锁了前所未有的实时响应性,使其能够快速生成准确且平滑的轨迹。
Summary / 总结
This paper addresses the need for real-time execution in Vision-Language-Action models, focusing on reducing reaction latency. It introduces FASTER, which rethinks action chunking policies to prioritize near-term actions, thereby reducing the denoising time by tenfold. Experiments show that FASTER significantly decreases reaction latency, especially on consumer-grade GPUs, enabling real-time responsiveness in tasks like table tennis.
该论文旨在解决Vision-Language-Action (VLA)模型在实时执行中的需求,重点关注减少反应延迟。它提出了FASTER,重新思考动作分块策略,在流采样过程中优先处理近期动作,从而显著缩短即时反应的去噪时间。实验表明,FASTER在消费级GPU上显著减少了有效反应延迟,特别是在动态环境中如乒乓球任务中,能够实现快速且准确的轨迹生成。
The Exponentially Weighted Signature
Authors: Alexandre Bloch, Samuel N. Cohen, Terry Lyons, Joël Mouterde, Benjamin Walker
First: 2026-03-19T17:51:20+00:00 · Latest: 2026-03-19T17:51:20+00:00
Comments: 43 pages, 1 figure
Abstract
The signature is a canonical representation of a multidimensional path over an interval. However, it treats all historical information uniformly, offering no intrinsic mechanism for contextualising the relevance of the past. To address this, we introduce the Exponentially Weighted Signature (EWS), generalising the Exponentially Fading Memory (EFM) signature from diagonal to general bounded linear operators. These operators enable cross-channel coupling at the level of temporal weighting together with richer memory dynamics including oscillatory, growth, and regime-dependent behaviour, while preserving the algebraic strengths of the classical signature. We show that the EWS is the unique solution to a linear controlled differential equation on the tensor algebra, and that it generalises both state-space models and the Laplace and Fourier transforms of the path. The group-like structure of the EWS enables efficient computation and makes the framework amenable to gradient-based learning, with the full semigroup action parametrised by and learned through its generator. We use this framework to empirically demonstrate the expressivity gap between the EWS and both the signature and EFM on two SDE-based regression tasks.
中文标题/摘要
标题:指数加权签名
签名是区间内多维路径的典范表示。然而,它均匀地处理所有历史信息,没有内在机制来解释过去的相关性。为了解决这个问题,我们引入了指数加权签名(EWS),它将指数衰减记忆(EFM)签名从对角线推广到一般的有界线性算子。这些算子允许在时间加权级别上跨通道耦合,并且具有更丰富的记忆动力学,包括振荡、增长和依赖于状态的行为,同时保留经典签名的代数优势。我们证明EWS是张量代数上线性控制微分方程的唯一解,并且它同时推广了状态空间模型和路径的拉普拉斯和傅里叶变换。EWS的群形结构使得计算高效,并使框架适用于基于梯度的学习,其全半群作用由生成器参数化并通过其生成器学习。我们使用此框架在两个基于SDE的回归任务中实证展示了EWS与签名和EFM之间的表达能力差距。
Summary / 总结
The research addresses the uniform treatment of historical information in the signature, introducing the Exponentially Weighted Signature (EWS) that uses general bounded linear operators to provide context-dependent weighting of past information. The EWS generalizes the Exponentially Fading Memory (EFM) signature and can capture oscillatory, growth, and regime-dependent behaviors. Key findings show that the EWS outperforms both the signature and EFM on SDE-based regression tasks, highlighting its expressivity advantage.
研究针对签名对历史信息的均匀处理问题,引入了Exponentially Weighted Signature (EWS),使用一般有界线性算子提供过去信息的上下文相关加权。EWS扩展了Exponentially Fading Memory (EFM)签名,并能捕捉振荡、增长和状态依赖的行为。关键发现表明,EWS在基于SDE的回归任务中优于签名和EFM,突显了其表达能力的优势。
Reconstruction Matters: Learning Geometry-Aligned BEV Representation through 3D Gaussian Splatting
Authors: Yiren Lu, Xin Ye, Burhaneddin Yaman, Jingru Luo, Zhexiao Xiong, Liu Ren, Yu Yin
First: 2026-03-19T17:49:43+00:00 · Latest: 2026-03-19T17:49:43+00:00
Comments: Project page at https://vulab-ai.github.io/Splat2BEV/
Abstract
Bird's-Eye-View (BEV) perception serves as a cornerstone for autonomous driving, offering a unified spatial representation that fuses surrounding-view images to enable reasoning for various downstream tasks, such as semantic segmentation, 3D object detection, and motion prediction. However, most existing BEV perception frameworks adopt an end-to-end training paradigm, where image features are directly transformed into the BEV space and optimized solely through downstream task supervision. This formulation treats the entire perception process as a black box, often lacking explicit 3D geometric understanding and interpretability, leading to suboptimal performance. In this paper, we claim that an explicit 3D representation matters for accurate BEV perception, and we propose Splat2BEV, a Gaussian Splatting-assisted framework for BEV tasks. Splat2BEV aims to learn BEV feature representations that are both semantically rich and geometrically precise. We first pre-train a Gaussian generator that explicitly reconstructs 3D scenes from multi-view inputs, enabling the generation of geometry-aligned feature representations. These representations are then projected into the BEV space to serve as inputs for downstream tasks. Extensive experiments on nuScenes and argoverse dataset demonstrate that Splat2BEV achieves state-of-the-art performance and validate the effectiveness of incorporating explicit 3D reconstruction into BEV perception.
中文标题/摘要
标题:重建问题:通过3D 高斯点绘制学习几何对齐的BEV表示
鸟瞰图(BEV)感知是自动驾驶的核心基础,提供了一种统一的空间表示,将周围视图图像融合起来,以实现语义分割、3D目标检测和运动预测等多种下游任务的推理。然而,现有的大多数BEV感知框架采用端到端的训练范式,其中图像特征直接转换到BEV空间,并仅通过下游任务监督进行优化。这种形式将整个感知过程视为一个黑箱,通常缺乏明确的3D几何理解和可解释性,导致性能不佳。在本文中,我们主张明确的3D表示对于准确的BEV感知至关重要,并提出了一种基于3D高斯点绘制的Splat2BEV框架,用于BEV任务。Splat2BEV旨在学习既丰富语义又精确几何的BEV特征表示。我们首先预训练一个高斯生成器,从多视图输入中显式重建3D场景,从而生成几何对齐的特征表示。然后将这些表示投影到BEV空间,作为下游任务的输入。在nuScenes和argoverse数据集上的大量实验表明,Splat2BEV达到了最先进的性能,并验证了将明确的3D重建纳入BEV感知的有效性。
Summary / 总结
The paper addresses the limitation of existing BEV perception frameworks that lack explicit 3D geometric understanding, leading to suboptimal performance. It introduces Splat2BEV, a framework that pre-trains a Gaussian generator to reconstruct 3D scenes from multi-view inputs, generating geometry-aligned feature representations. These representations are then projected into the BEV space for downstream tasks. Experiments show that Splat2BEV outperforms existing methods on nuScenes and argoverse datasets, validating the importance of explicit 3D reconstruction in BEV perception.
本文通过提出Splat2BEV框架,明确重建3D场景以生成几何对齐的特征表示来解决现有BEV感知框架的局限性。该框架首先预训练一个高斯生成器从多视图输入中明确重建3D场景,然后将这些表示投影到BEV空间以供下游任务使用。在nuScenes和argoverse数据集上的实验表明,Splat2BEV在性能上优于现有方法,并验证了在BEV感知中明确包含3D重建的重要性。
Score Reversal Is Not Free for Quantum Diffusion Models
Authors: Ammar Fayad
First: 2026-03-06T17:16:17+00:00 · Latest: 2026-03-19T17:48:32+00:00
Abstract
Classical reverse diffusion is generated by changing the drift at fixed noise. We show that the quantum version of this principle obeys an exact law with a sharp phase boundary. For Gaussian pure-loss dynamics, the canonical model of continuous-variable decoherence, we prove that the unrestricted instantaneous reverse optimum exhibits a noiseless-to-noisy transition: below a critical squeezing-to-thermal ratio, reversal can be noiseless; above it, complete positivity forces irreducible reverse noise whose minimum cost we determine in closed form. The optimal reverse diffusion is uniquely covariance-aligned and simultaneously minimizes the geometric, metrological, and thermodynamic price of reversal. For multimode trajectories, the exact cost is additive in a canonical set of mode-resolved data, and a globally continuous protocol attains this optimum on every mixed-state interval. If a pure nonclassical endpoint is included, the same pointwise law holds for every $t>0$, but the optimum diverges as $2/t$: exact Gaussian reversal of a pure quantum state is dynamically unattainable. These results establish the exact Gaussian benchmark against which any broader theory of quantum reverse diffusion must be measured.
中文标题/摘要
标题:量子扩散模型中的分数逆转并非免费
经典的逆向扩散通过固定噪声改变漂移生成。我们证明了这一原理的量子版本遵循一个精确的定律,具有明确的相变边界。对于高斯纯损耗动力学,即连续变量退相干的经典模型,我们证明了无限制的瞬时逆向最优表现出无噪到有噪的转变:在临界压缩比之下,逆转可以无噪;超过它,完全正性迫使不可约的逆向噪声,我们以闭式形式确定了其最小成本。最优逆向扩散唯一地沿协方差对齐,并同时最小化逆转的几何、计量和热力学成本。对于多模式轨迹,精确成本在一组模式解析数据中是可加的,且全局连续协议在每个混合态区间内达到这一最优。如果包含纯非经典终点,同样的点律在每个$t>0$时都成立,但最优值随着$2/t$发散:精确的高斯逆转纯量子态是动态不可达的。这些结果确立了精确的高斯基准,任何更广泛的量子逆向扩散理论都必须以此为标准进行衡量。
Summary / 总结
The study explores the reversibility of quantum diffusion models by comparing them to classical reverse diffusion. It demonstrates that quantum reverse diffusion follows a precise law with a clear phase boundary. For Gaussian pure-loss dynamics, the research proves that the unrestricted reverse diffusion can be noiseless below a critical squeezing-to-thermal ratio but becomes noisy above it due to complete positivity. The optimal reverse diffusion is uniquely aligned and minimizes various costs. For multimode trajectories, the cost is additive, and a continuous protocol achieves the optimal reverse diffusion. The study also shows that exact Gaussian reversal of a pure quantum state is dynamically unattainable. These findings provide a benchmark for broader theories of quantum reverse diffusion.
研究探讨了量子扩散模型的可逆性,将其与经典的反向扩散进行比较。研究表明,量子反向扩散遵循一个精确的定律,并有一个明确的相变边界。对于高斯纯损耗动力学,研究证明,在一个临界挤压-热比之下,反向扩散可以是无噪声的,但超过这个比值时,由于完全正性,反向扩散不可避免地产生噪声。最优的反向扩散是唯一对齐的,并且最小化各种成本。对于多模式轨迹,成本是可加的,一个连续的协议可以达到最优的反向扩散。研究还表明,纯量子态的精确高斯反向扩散是动态上不可实现的。这些发现为更广泛的量子反向扩散理论提供了一个基准。
OS-Themis: A Scalable Critic Framework for Generalist GUI Rewards
Authors: Zehao Li, Zhenyu Wu, Yibo Zhao, Bowen Yang, Jingjing Xie, Zhaoyang Liu, Zhoumianze Liu, Kaiming Jin, Jianze Liang, Zonglin Li, Feng Wu, Bowen Zhou, Zun Wang, Zichen Ding
First: 2026-03-19T17:47:47+00:00 · Latest: 2026-03-19T17:47:47+00:00
Abstract
Reinforcement Learning (RL) has the potential to improve the robustness of GUI agents in stochastic environments, yet training is highly sensitive to the quality of the reward function. Existing reward approaches struggle to achieve both scalability and performance. To address this, we propose OS-Themis, a scalable and accurate multi-agent critic framework. Unlike a single judge, OS-Themis decomposes trajectories into verifiable milestones to isolate critical evidence for decision making and employs a review mechanism to strictly audit the evidence chain before making the final verdict. To facilitate evaluation, we further introduce OmniGUIRewardBench (OGRBench), a holistic cross-platform benchmark for GUI outcome rewards, where all evaluated models achieve their best performance under OS-Themis. Extensive experiments on AndroidWorld show that OS-Themis yields a 10.3% improvement when used to support online RL training, and a 6.9% gain when used for trajectory validation and filtering in the self-training loop, highlighting its potential to drive agent evolution.
中文标题/摘要
标题:OS-Themis:通用GUI奖励的可扩展批评框架
强化学习(RL)有潜力提高GUI代理在随机环境中的鲁棒性,但训练对奖励函数的质量非常敏感。现有的奖励方法难以同时实现可扩展性和性能。为了解决这个问题,我们提出了OS-Themis,一种可扩展且准确的多代理批评框架。与单一裁判不同,OS-Themis将轨迹分解为可验证的里程碑,以隔离决策所需的关键证据,并采用审查机制在做出最终裁决前严格审计证据链。为了便于评估,我们进一步引入了OmniGUIRewardBench(OGRBench),这是一个跨平台的GUI结果奖励综合基准,所有评估模型在使用OS-Themis时均达到最佳性能。在AndroidWorld上的广泛实验表明,当用于支持在线RL训练时,OS-Themis可提高10.3%,用于自我训练循环中的轨迹验证和过滤时可提高6.9%,突显了其推动代理进化的能力。
Summary / 总结
The paper addresses the challenge of training robust GUI agents in stochastic environments by proposing OS-Themis, a scalable multi-agent critic framework. Unlike traditional single-judge approaches, OS-Themis decomposes trajectories into verifiable milestones and employs a review mechanism to ensure the accuracy of the evidence chain. Experiments on AndroidWorld demonstrate that OS-Themis improves online RL training by 10.3% and trajectory validation by 6.9%, indicating its effectiveness in enhancing agent performance.
论文提出了一种可扩展的多智能体批评框架OS-Themis,以解决在随机环境中训练稳健的GUI代理的挑战。OS-Themis不同于传统的单一裁判方法,它将轨迹分解为可验证的里程碑,并采用审查机制确保证据链的准确性。在AndroidWorld的实验中,OS-Themis在在线RL训练中提高了10.3%,在轨迹验证和过滤中提高了6.9%,表明其在提升代理性能方面的有效性。
The Convergence Frontier: Integrating Machine Learning and High Performance Quantum Computing for Next-Generation Drug Discovery
Authors: Narjes Ansari, César Feniou, Nicolaï Gouraud, Daniele Loco, Siwar Badreddine, Baptiste Claudon, Félix Aviat, Marharyta Blazhynska, Kevin Gasperich, Guillaume Michel, Diata Traore, Corentin Villot, Thomas Plé, Olivier Adjoua, Louis Lagardère, Jean-Philip Piquemal
First: 2026-03-18T14:51:15+00:00 · Latest: 2026-03-19T17:44:57+00:00
Abstract
Integrating quantum mechanics into drug discovery marks a decisive shift from empirical trial-and-error toward quantitative precision. However, the prohibitive cost of ab initio molecular dynamics has historically forced a compromise between chemical accuracy and computational scalability. This paper identifies the convergence of High-Performance Computing (HPC), Machine Learning (ML), and Quantum Computing (QC) as the definitive solution to this bottleneck. While ML foundation models, such as FeNNix-Bio1, enable quantum-accurate simulations, they remain tethered to the inherent limits of classical data generation. We detail how High-Performance Quantum Computing (HPQC), utilizing hybrid QPU-GPU architectures, will serve as the ultimate accelerator for quantum chemistry data. By leveraging Hilbert space mapping, these systems can achieve true chemical accuracy while bypassing the heuristics of classical approximations. We show how this tripartite convergence optimizes the drug discovery pipeline, spanning from initial system preparation to ML-driven, high-fidelity simulations. Finally, we position quantum-enhanced sampling as the beyond GPU frontier for modeling reactive cellular systems and pioneering next-generation materials.
中文标题/摘要
标题:量子计算与高性能计算的交汇前沿:机器学习与高能效量子计算在下一代药物发现中的集成
将量子力学融入药物发现标志着从经验试错向定量精确的转变。然而,从头分子动力学的高昂成本历史上迫使人们在化学精度和计算可扩展性之间做出妥协。本文指出,高性能计算(HPC)、机器学习(ML)和量子计算(QC)的交汇是解决这一瓶颈的决定性解决方案。虽然基于ML的预训练模型,如FeNNix-Bio1,能够实现量子级准确的模拟,但它们仍然受限于经典数据生成的固有限制。我们详细阐述了如何利用混合QPU-GPU架构的高性能量子计算(HPQC)作为量子化学数据的终极加速器。通过利用希尔伯特空间映射,这些系统可以实现真正的化学精度,同时绕过经典近似中的启发式方法。我们展示了这种三者交汇如何优化药物发现流程,从初始系统准备到ML驱动的高保真模拟。最后,我们定位量子增强采样作为超越GPU前沿,用于建模反应性细胞系统和开创下一代材料。
Summary / 总结
This paper addresses the challenge of achieving both chemical accuracy and computational scalability in drug discovery by integrating High-Performance Computing (HPC), Machine Learning (ML), and Quantum Computing (QC). It introduces High-Performance Quantum Computing (HPQC) with hybrid QPU-GPU architectures, which use Hilbert space mapping to achieve true chemical accuracy. The study demonstrates that this convergence optimizes the drug discovery pipeline, from initial system preparation to high-fidelity simulations, and positions quantum-enhanced sampling as a key technology for modeling reactive cellular systems and materials science.
本文旨在通过整合高性能计算(HPC)、机器学习(ML)和量子计算(QC),解决药物发现中化学准确性和计算可扩展性之间的矛盾。研究引入了使用混合QPU-GPU架构的高性能量子计算(HPQC),并通过希耳伯特空间映射实现真正的化学准确性。研究显示,这种三者结合优化了药物发现流程,从初始系统准备到高保真模拟,并将量子增强采样定位为建模反应性细胞系统和材料科学的关键技术。
This looks like what? Challenges and Future Research Directions for Part-Prototype Models
Authors: Khawla Elhadri, Tomasz Michalski, Adam Wróbel, Jörg Schlötterer, Bartosz Zieliński, Christin Seifert
First: 2025-02-13T14:00:55+00:00 · Latest: 2026-03-19T17:41:24+00:00
Comments: Accepted at the 4th World Conference on eXplainable Artificial Intelligence (XAI-2026)
Abstract
The growing interest in eXplainable Artificial Intelligence (XAI) has stimulated research on models with built-in interpretability, among which part-prototype models are particularly prominent. Part-Prototype Models (PPMs) classify inputs by comparing them to learned prototypes and provide human-understandable explanations of the form "this looks like that". Despite this intrinsic interpretability, PPMs have not yet emerged as a competitive alternative to post-hoc explanation methods. This survey reviews work published between 2019 and 2025 and derives a taxonomy of the challenges faced by current PPMs. The analysis reveals a diverse set of open problems. The main issue concerns the quality and number of learned prototypes. Further challenges include limited generalization across tasks and contexts, as well as methodological shortcomings such as non-standardized evaluation. Five broad research directions are identified: improving predictive performance, developing theoretically grounded architectures, establishing frameworks for human-AI collaboration, aligning models with human concepts, and defining robust metrics and benchmarks for evaluation. The survey aims to stimulate further research and promote intrinsically interpretable models for practical applications. A curated list of the surveyed papers is available at https://github.com/aix-group/ppm-survey.
中文标题/摘要
标题:这看起来像什么?部分原型模型的挑战与未来研究方向
随着对可解释人工智能(XAI)的兴趣日益增长,研究具有内置可解释性的模型也逐渐增多,其中部分原型模型尤为突出。部分原型模型(PPMs)通过将输入与学习到的原型进行比较来进行分类,并提供易于理解的解释,形式为“这看起来像那个”。尽管具有这种内在的可解释性,PPMs尚未成为后验解释方法的有力竞争者。本文综述了2019年至2025年间发表的工作,并推导出当前PPMs面临的挑战分类。分析揭示了一系列开放性问题。主要问题在于学习原型的质量和数量。进一步的挑战包括任务和上下文之间的有限泛化能力,以及方法论上的不足,如缺乏标准化的评估方法。确定了五个广泛的研究方向:提高预测性能、开发理论基础架构、建立人机协作框架、使模型与人类概念相一致以及定义评估的稳健指标和基准。本文旨在激发进一步的研究,并促进适用于实际应用的内在可解释模型。综述中涉及的论文列表可在https://github.com/aix-group/ppm-survey/获取。
Summary / 总结
This paper reviews the challenges and future research directions for Part-Prototype Models (PPMs), which classify inputs by comparing them to learned prototypes and provide human-understandable explanations. The main method involves a taxonomy of the challenges faced by current PPMs, focusing on the quality and number of learned prototypes, generalization across tasks, and methodological shortcomings. Key findings highlight the need to improve predictive performance, develop theoretically grounded architectures, and establish robust evaluation metrics. The survey aims to stimulate further research and promote intrinsically interpretable models for practical applications.
本文回顾了部分原型模型(PPMs)面临的挑战及其未来的研究方向,PPMs通过将输入与学习到的原型进行比较来进行分类,并提供易于理解的解释。研究指出的问题包括原型的质量和数量、泛化能力有限以及方法论上的不足。提出了五个研究方向:提高预测性能、开发理论基础架构、增强人机协作、使模型与人类概念相匹配以及定义稳健的评估指标。该调查旨在促进内在可解释模型的实际应用研究。
Box Maze: A Process-Control Architecture for Reliable LLM Reasoning
Authors: Zou Qiang
First: 2026-03-19T17:41:18+00:00 · Latest: 2026-03-19T17:41:18+00:00
Comments: 10 pages, 5 tables, 0 figures. Conceptual architecture with preliminary simulation-based validation
Abstract
Large language models (LLMs) demonstrate strong generative capabilities but remain vulnerable to hallucination and unreliable reasoning under adversarial prompting. Existing safety approaches -- such as reinforcement learning from human feedback (RLHF) and output filtering -- primarily operate at the behavioral level and may lack explicit architectural mechanisms for enforcing reasoning process integrity. This paper proposes the Box Maze framework, a conceptual process-control architecture that decomposes LLM reasoning into three explicit layers: memory grounding, structured inference, and boundary enforcement. We introduce preliminary simulation-based evaluation involving progressive boundary erosion scenarios across multiple heterogeneous LLM systems (DeepSeek-V3, Doubao, Qwen). Results from n=50 adversarial scenarios suggest that explicit cognitive control layers may improve consistency in boundary maintenance, with architectural constraints reducing boundary failure rates from approximately 40% (baseline RLHF) to below 1% under adversarial conditions. While current validation is simulation-based, these preliminary results indicate that process-level control may offer a promising direction for improving reliability in large language model reasoning.
中文标题/摘要
标题:盒迷宫:一种可靠的LLM推理过程控制架构
大型语言模型(LLMs)展示了强大的生成能力,但在对抗性提示下仍易出现幻觉和不可靠的推理。现有的安全方法——如基于人类反馈的强化学习(RLHF)和输出过滤——主要在行为层面运作,可能缺乏明确的架构机制来确保推理过程的完整性。 本文提出了盒迷宫框架,这是一种概念性的过程控制架构,将LLM推理分解为三个明确的层次:记忆接地、结构化推理和边界约束。我们进行了初步的基于仿真的评估,涉及跨多个异构LLM系统(DeepSeek-V3、Doubao、Qwen)的边界侵蚀场景。来自n=50个对抗性场景的结果表明,明确的认知控制层可能提高边界维护的一致性,架构约束将边界失败率从基线RLHF的约40%降低到对抗条件下低于1%。 当前的验证基于仿真,但初步结果表明,过程级控制可能为提高大型语言模型推理可靠性提供一个有前景的方向。
Summary / 总结
This paper addresses the vulnerability of large language models (LLMs) to hallucination and unreliable reasoning under adversarial prompts. It introduces the Box Maze framework, which decomposes LLM reasoning into memory grounding, structured inference, and boundary enforcement layers. Preliminary simulation results across multiple LLM systems show that this process-control architecture reduces boundary failure rates from about 40% (baseline RLHF) to below 1% under adversarial conditions, indicating potential improvements in reasoning consistency.
本文针对大型语言模型(LLMs)在对抗性提示下容易出现幻觉和不可靠推理的问题,提出了Box Maze框架,将LLM推理分解为记忆接地、结构化推理和边界约束三个层次。初步的模拟结果显示,该架构在多种LLM系统中将边界失败率从约40%(基线RLHF)降低到低于1%,表明在推理一致性方面可能存在改进的空间。
Image2Gcode: Image-to-G-code Generation for Additive Manufacturing Using Diffusion-Transformer Model
Authors: Ziyue Wang, Yayati Jadhav, Peter Pak, Amir Barati Farimani
First: 2025-11-25T18:55:12+00:00 · Latest: 2026-03-19T17:40:45+00:00
Abstract
Mechanical design and manufacturing workflows conventionally begin with conceptual design, followed by the creation of a computer-aided design (CAD) model and fabrication through material-extrusion (MEX) printing. This process requires converting CAD geometry into machine-readable G-code through slicing and path planning. While each step is well established, dependence on CAD modeling remains a major bottleneck: constructing object-specific 3D geometry is slow and poorly suited to rapid prototyping. Even minor design variations typically necessitate manual updates in CAD software, making iteration time-consuming and difficult to scale. To address this limitation, we introduce Image2Gcode, an end-to-end data-driven framework that bypasses the CAD stage and generates printer-ready G-code directly from images and part drawings. Instead of relying on an explicit 3D model, a hand-drawn or captured 2D image serves as the sole input. The framework first extracts slice-wise structural cues from the image and then employs a denoising diffusion probabilistic model (DDPM) over G-code sequences. Through iterative denoising, the model transforms Gaussian noise into executable print-move trajectories with corresponding extrusion parameters, establishing a direct mapping from visual input to native toolpaths. By producing structured G-code directly from 2D imagery, Image2Gcode eliminates the need for CAD or STL intermediates, lowering the entry barrier for additive manufacturing and accelerating the design-to-fabrication cycle. This approach supports on-demand prototyping from simple sketches or visual references and integrates with upstream 2D-to-3D reconstruction modules to enable an automated pipeline from concept to physical artifact. The result is a flexible, computationally efficient framework that advances accessibility in design iteration, repair workflows, and distributed manufacturing.
中文标题/摘要
标题:Image2Gcode:使用扩散变换模型的图像到G代码生成技术
机械设计和制造工作流程通常始于概念设计,随后创建计算机辅助设计(CAD)模型并通过材料挤出(MEX)打印进行制造。这一过程需要将CAD几何图形转换为机器可读的G代码,通过切片和路径规划实现。虽然每一步都已成熟,但依赖CAD建模仍然是一个主要瓶颈:构建特定对象的3D几何图形速度较慢,且不适用于快速原型制作。即使是细微的设计变化通常也需要在CAD软件中手动更新,使得迭代过程耗时且难以扩展。为解决这一限制,我们引入了Image2Gcode,这是一种端到端的数据驱动框架,绕过了CAD阶段,直接从图像和零件图纸生成打印机就绪的G代码。该框架首先从图像中提取切片级的结构线索,然后使用G代码序列上的去噪扩散概率模型(DDPM)。通过迭代去噪,模型将高斯噪声转化为可执行的打印移动轨迹,与相应的挤出参数相对应,从而建立了从视觉输入到原生工具路径的直接映射。通过直接从二维图像生成结构化的G代码,Image2Gcode消除了对CAD或STL中间件的需求,降低了增材制造的入门门槛,并加速了从设计到制造的周期。该方法支持从简单的草图或视觉参考进行按需原型制作,并与上游的2D到3D重建模块集成,以实现从概念到物理制品的自动化流程。结果是一种灵活且计算效率高的框架,促进了设计迭代、修复工作流程和分布式制造的可访问性。
Summary / 总结
The research addresses the bottleneck of converting CAD geometry into machine-readable G-code, which is slow and requires manual updates. Image2Gcode is an end-to-end framework that generates G-code directly from images and part drawings, bypassing the need for CAD models. It uses a denoising diffusion probabilistic model to transform Gaussian noise into executable print-move trajectories, enabling rapid prototyping and reducing iteration time.
研究通过引入Image2Gcode框架,直接从图像和图纸生成G-code,解决了机械设计和制造中手动CAD建模的瓶颈。该框架使用去噪扩散概率模型将2D图像转换为可执行的打印轨迹,绕过了CAD模型的需要。主要发现包括能够从2D输入生成结构化的G-code,减少了迭代时间并降低了增材制造的入门门槛。
Steering Awareness: Detecting Activation Steering from Within
Authors: Joshua Fonseca Rivera, David Demitri Africa
First: 2025-11-26T13:49:43+00:00 · Latest: 2026-03-19T17:37:06+00:00
Abstract
Activation steering -- adding a vector to a model's residual stream to modify its behavior -- is widely used in safety evaluations as if the model cannot detect the intervention. We test this assumption, introducing steering awareness: a model's ability to infer, during its own forward pass, that a steering vector was injected and what concept it encodes. After fine-tuning, seven instruction-tuned models develop strong steering awareness on held-out concepts; the best reaches 95.5% detection, 71.2% concept identification, and zero false positives on clean inputs. This generalizes to unseen steering vector construction methods when their directions have high cosine similarity to the training distribution but not otherwise, indicating a geometric detector rather than a generic anomaly detector. Surprisingly, detection does not confer resistance; on both factual and safety benchmarks, detection-trained models are consistently more susceptible to steering than their base counterparts. Mechanistically, steering awareness arises not from a localized circuit, but from a distributed transformation that progressively rotates diverse injected vectors into a shared detection direction. Activation steering should therefore not be considered an invisible intervention in safety evaluations.
中文标题/摘要
标题:操控意识:从内部检测激活操控
激活操控——向模型的残差流中添加一个向量以修改其行为——在安全性评估中被广泛使用,仿佛模型无法检测到这种干预。我们测试了这一假设,引入了操控意识:模型在其自身前向传播过程中推断出是否注入了操控向量及其所编码的概念的能力。经过微调后,七个指令调优模型在未见过的概念上发展出了强大的操控意识;最佳模型在干净输入上的检测率为95.5%,概念识别率为71.2%,且无误报。这在训练分布方向与未见过的操控向量构造方法方向具有高余弦相似度时得以泛化,表明这是一种几何检测器而非通用异常检测器。令人惊讶的是,检测能力并未赋予模型抵抗力;在事实性和安全性基准测试中,检测训练模型比其基础模型更易受操控影响。机制上,操控意识并非源自局部电路,而是源自一种分布式变换,逐步将各种注入的向量旋转到共享的检测方向。因此,激活操控不应被视为安全性评估中的隐形干预。
Summary / 总结
The study investigates the concept of steering awareness, which is a model's ability to detect and identify the concept encoded by a steering vector added to its residual stream. After fine-tuning, several instruction-tuned models developed strong steering awareness, achieving 95.5% detection accuracy and 71.2% concept identification accuracy on clean inputs. However, detection does not confer resistance to steering, as detection-trained models are more susceptible to steering interventions compared to their base counterparts on both factual and safety benchmarks.
研究探讨了模型检测并识别添加到其残差流中的引导向量的能力,即所谓的引导意识。经过微调后,七个指令调优模型在干净输入上实现了95.5%的检测准确率和71.2%的概念识别准确率。然而,检测并不能提供抵抗力,检测训练后的模型在事实和安全基准测试中比基线模型更容易受到引导干预的影响。引导意识的机制涉及一种分布式变换,逐步将注入的向量旋转到共享的检测方向。
Farther the Shift, Sparser the Representation: Analyzing OOD Mechanisms in LLMs
Authors: Mingyu Jin, Yutong Yin, Jingcheng Niu, Qingcheng Zeng, Wujiang Xu, Mengnan Du, Wei Cheng, Zhaoran Wang, Tianlong Chen, Dimitris N. Metaxas
First: 2026-03-03T18:48:15+00:00 · Latest: 2026-03-19T17:36:27+00:00
Abstract
In this work, we investigate how Large Language Models (LLMs) adapt their internal representations when encountering inputs of increasing difficulty, quantified as the degree of out-of-distribution (OOD) shift. We reveal a consistent and quantifiable phenomenon: as task difficulty increases, whether through harder reasoning questions, longer contexts, or adding answer choices, the last hidden states of LLMs become substantially sparser. In short, \textbf{\textit{the farther the shift, the sparser the representations}}. This sparsity--difficulty relation is observable across diverse models and domains, suggesting that language models respond to unfamiliar or complex inputs by concentrating computation into specialized subspaces in the last hidden state. Through a series of controlled analyses with a learning dynamic explanation, we demonstrate that this sparsity is not incidental but an adaptive mechanism for stabilizing reasoning under OOD. Leveraging this insight, we design \textit{Sparsity-Guided Curriculum In-Context Learning (SG-ICL)}, a strategy that explicitly uses representation sparsity to schedule few-shot demonstrations, leading to considerable performance enhancements. Our study provides new mechanistic insights into how LLMs internalize OOD challenges. The source code is available at the URL: https://github.com/MingyuJ666/sparsityLLM.
中文标题/摘要
标题:迁移越大,表示越稀疏:LLM中OOD机制分析
在本研究中,我们探讨了大型语言模型(LLMs)在遇到难度增加的输入时,其内部表示如何适应,这些输入的难度通过分布外(OOD)迁移的程度来量化。我们揭示了一个一致且可量化的现象:随着任务难度的增加,无论是通过更难的推理问题、更长的上下文还是增加答案选项,LLMs的最后隐藏状态变得显著稀疏。简而言之,\textbf{\textit{迁移越大,表示越稀疏}}。这种稀疏性与难度的关系在不同的模型和领域中都是可观察到的,表明语言模型在面对不熟悉或复杂的输入时,会将计算集中在最后隐藏状态的专门子空间中。通过一系列受学习动力学解释控制的分析,我们证明这种稀疏性不是偶然的,而是适应机制,用于在OOD下稳定推理。利用这一见解,我们设计了\textit{稀疏引导的上下文内少样本学习(SG-ICL)}策略,该策略明确利用表示稀疏性来安排少样本示范,从而显著提高性能。我们的研究提供了关于LLMs如何内化OOD挑战的新机制性见解。源代码可在以下网址获取:https://github.com/MingyuJ666/sparsityLLM.
Summary / 总结
This study investigates how Large Language Models (LLMs) adapt their internal representations when encountering increasingly difficult inputs, defined by out-of-distribution (OOD) shift. It reveals that as task difficulty increases, the last hidden states of LLMs become sparser, indicating that LLMs concentrate computation into specialized subspaces to stabilize reasoning under OOD conditions. The study demonstrates that this sparsity is an adaptive mechanism and proposes Sparsity-Guided Curriculum In-Context Learning (SG-ICL) to enhance few-shot learning performance by leveraging representation sparsity. The research provides new insights into how LLMs handle OOD challenges.
本研究探讨了大型语言模型(LLMs)在遇到逐渐增加难度的输入时,如何调整其内部表示。研究发现,随着任务难度的增加,LLMs 的最后一层隐藏状态变得更为稀疏,表明LLMs 通过将计算集中在特定子空间来稳定在 OOD 条件下的推理。研究引入了基于稀疏性的 Curriculum In-Context Learning (SG-ICL) 策略,该策略利用表示稀疏性来安排少量示例演示,从而提升性能。研究提供了关于LLMs 如何应对 OOD 挑战的新机制性见解。
Few-shot Acoustic Synthesis with Multimodal Flow Matching
Authors: Amandine Brunetto
Venue: CVPR 2026
First: 2026-03-19T17:32:06+00:00 · Latest: 2026-03-19T17:32:06+00:00
Comments: To appear at CVPR 2026. 23 pages, 16 figures. Project Page: https://amandinebtto.github.io/FLAC/
Abstract
Generating audio that is acoustically consistent with a scene is essential for immersive virtual environments. Recent neural acoustic field methods enable spatially continuous sound rendering but remain scene-specific, requiring dense audio measurements and costly training for each environment. Few-shot approaches improve scalability across rooms but still rely on multiple recordings and, being deterministic, fail to capture the inherent uncertainty of scene acoustics under sparse context. We introduce flow-matching acoustic generation (FLAC), a probabilistic method for few-shot acoustic synthesis that models the distribution of plausible room impulse responses (RIRs) given minimal scene context. FLAC leverages a diffusion transformer trained with a flow-matching objective to generate RIRs at arbitrary positions in novel scenes, conditioned on spatial, geometric, and acoustic cues. FLAC outperforms state-of-the-art eight-shot baselines with one-shot on both the AcousticRooms and Hearing Anything Anywhere datasets. To complement standard perceptual metrics, we further introduce AGREE, a joint acoustic-geometry embedding, enabling geometry-consistent evaluation of generated RIRs through retrieval and distributional metrics. This work is the first to apply generative flow matching to explicit RIR synthesis, establishing a new direction for robust and data-efficient acoustic synthesis.
中文标题/摘要
标题:基于多模态流匹配的少量样本声学合成
生成与场景声学一致的音频对于沉浸式虚拟环境至关重要。近期的神经声学场方法能够实现空间连续的声音渲染,但仍然保持场景特定性,需要密集的音频测量和昂贵的训练成本。少量样本的方法提高了跨房间的可扩展性,但仍依赖于多个录音,并且由于是确定性的,无法捕捉稀疏上下文中场景声学的固有不确定性。我们引入了流匹配声学生成(FLAC),这是一种基于少量场景上下文的概率方法,用于建模给定最小场景上下文中可能的房间冲激响应(RIRs)的分布。FLAC 利用一个通过流匹配目标训练的扩散变换器,在新场景中的任意位置生成 RIRs,条件于空间、几何和声学线索。FLAC 在 AcousticRooms 和 Hearing Anything Anywhere 数据集上的一次样本优于最先进的八次样本基线。为了补充标准感知度量,我们进一步引入了 AGREE,一种联合声学-几何嵌入,通过检索和分布度量使生成的 RIRs 的几何一致性评估成为可能。这项工作是首次将生成流匹配应用于显式 RIR 合成,为稳健和数据高效声学合成开辟了一个新方向。
Summary / 总结
The research aims to improve the scalability and robustness of acoustic synthesis for virtual environments by addressing the limitations of existing methods. FLAC, a probabilistic method, models the distribution of plausible room impulse responses (RIRs) given minimal scene context using a diffusion transformer with a flow-matching objective. This method outperforms state-of-the-art eight-shot baselines with one-shot on AcousticRooms and Hearing Anything Anywhere datasets. Additionally, a new metric AGREE is introduced to evaluate the geometry-consistent generation of RIRs, enhancing the evaluation of generated audio in virtual environments.
研究旨在开发一种在虚拟环境中以最少场景上下文生成声学一致音频的方法。FLAC是一种概率方法,使用带有流匹配目标的扩散变换器生成新场景中任意位置的房间冲激响应(RIR),并根据空间、几何和声学线索进行条件化。该方法在AcousticRooms和Hearing Anything Anywhere数据集上的一次生成中优于现有八次生成基线。此外,引入了AGREE新指标来评估RIR的几何一致性生成。这项工作通过将生成流匹配应用于RIR合成,提升了鲁棒性和数据效率。
SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits
Authors: Edward Lin, Sahil Modi, Siva Kumar Sastry Hari, Qijing Huang, Zhifan Ye, Nestor Qin, Fengzhe Zhou, Yuan Zhang, Jingquan Wang, Sana Damani, Dheeraj Peri, Ouye Xie, Aditya Kane, Moshe Maor, Michael Behar, Triston Cao, Rishabh Mehta, Vartika Singh, Vikram Sharma Mailthody, Terry Chen, Zihao Ye, Hanfeng Chen, Tianqi Chen, Vinod Grover, Wei Chen, Wei Liu, Eric Chung, Luis Ceze, Roger Bringmann, Cyril Zeller, Michael Lightstone, Christos Kozyrakis, Humphrey Shi
First: 2026-03-19T17:30:02+00:00 · Latest: 2026-03-19T17:30:02+00:00
Abstract
As agentic AI systems become increasingly capable of generating and optimizing GPU kernels, progress is constrained by benchmarks that reward speedup over software baselines rather than proximity to hardware-efficient execution. We present SOL-ExecBench, a benchmark of 235 CUDA kernel optimization problems extracted from 124 production and emerging AI models spanning language, diffusion, vision, audio, video, and hybrid architectures, targeting NVIDIA Blackwell GPUs. The benchmark covers forward and backward workloads across BF16, FP8, and NVFP4, including kernels whose best performance is expected to rely on Blackwell-specific capabilities. Unlike prior benchmarks that evaluate kernels primarily relative to software implementations, SOL-ExecBench measures performance against analytically derived Speed-of-Light (SOL) bounds computed by SOLAR, our pipeline for deriving hardware-grounded SOL bounds, yielding a fixed target for hardware-efficient optimization. We report a SOL Score that quantifies how much of the gap between a release-defined scoring baseline and the hardware SOL bound a candidate kernel closes. To support robust evaluation of agentic optimizers, we additionally provide a sandboxed harness with GPU clock locking, L2 cache clearing, isolated subprocess execution, and static analysis based checks against common reward-hacking strategies. SOL-ExecBench reframes GPU kernel benchmarking from beating a mutable software baseline to closing the remaining gap to hardware Speed-of-Light.
中文标题/摘要
标题:SOL-ExecBench:针对硬件限制的实时GPU内核基准测试
随着具有生成和优化GPU内核能力的代理型AI系统的日益强大,进展受到奖励软件基线速度提升而非接近硬件高效执行的基准的限制。我们提出了SOL-ExecBench,这是一个包含235个CUDA内核优化问题的基准测试,这些问题来自涵盖语言、扩散、视觉、音频、视频和混合架构的124个生产及新兴AI模型,针对NVIDIA Blackwell GPU。基准测试涵盖了BF16、FP8和NVFP4的前向和后向工作负载,包括预期最佳性能依赖于Blackwell特定能力的内核。与以往主要基于软件实现评估内核的基准不同,SOL-ExecBench通过我们的SOLAR流水线计算出的基于硬件的SOL上限进行性能测量,从而提供了一个固定的目标,用于硬件高效的优化。我们报告了一个SOL评分,量化候选内核在释放定义的评分基线和硬件SOL上限之间的差距。为了支持代理优化器的稳健评估,我们还提供了带有GPU时钟锁定、L2缓存清除、隔离子进程执行和针对常见奖励作弊策略的静态分析检查的沙箱环境。SOL-ExecBench将GPU内核基准测试重新定义为从击败可变软件基线转向关闭与硬件SOL之间的剩余差距。
Summary / 总结
SOL-ExecBench is a benchmark for evaluating the efficiency of GPU kernels by comparing their performance against hardware limits, rather than software baselines. It includes 235 CUDA kernel optimization problems from various AI models and targets NVIDIA Blackwell GPUs. The benchmark measures performance against analytically derived Speed-of-Light (SOL) bounds, providing a fixed target for hardware-efficient optimization. Key findings include a SOL Score that quantifies how much of the performance gap between a scoring baseline and the hardware SOL bound a candidate kernel closes, and a sandboxed harness to support robust evaluation of agentic optimizers.
SOL-ExecBench 是一个用于评估 GPU 内核优化的基准,通过将性能与硬件特定的光速(SOL)界限进行比较,而不是与软件基线进行比较。它包含了来自各种 AI 模型的 235 个 CUDA 内核,并针对 NVIDIA Blackwell GPU。基准涵盖了不同的精度级别,并根据内核关闭基线评分和硬件 SOL 边界之间差距的程度进行评估,提供了一个固定的优化目标。此外,它还提供了一个沙盒环境以防止奖励作弊策略。这将 GPU 内核基准测试从与软件基线竞争重新定义为实现硬件效率。
DyMoE: Dynamic Expert Orchestration with Mixed-Precision Quantization for Efficient MoE Inference on Edge
Authors: Yuegui Huang, Zhiyuan Fang, Weiqi Luo, Ruoyu Wu, Wuhui Chen, Zibin Zheng
First: 2026-03-19T17:30:01+00:00 · Latest: 2026-03-19T17:30:01+00:00
Abstract
Despite the computational efficiency of MoE models, the excessive memory footprint and I/O overhead inherent in multi-expert architectures pose formidable challenges for real-time inference on resource-constrained edge platforms. While existing static methods struggle with a rigid latency-accuracy trade-off, we observe that expert importance is highly skewed and depth-dependent. Motivated by these insights, we propose DyMoE, a dynamic mixed-precision quantization framework designed for high-performance edge inference. Leveraging insights into expert importance skewness and depth-dependent sensitivity, DyMoE introduces: (1) importance-aware prioritization to dynamically quantize experts at runtime; (2) depth-adaptive scheduling to preserve semantic integrity in critical layers; and (3) look-ahead prefetching to overlap I/O stalls. Experimental results on commercial edge hardware show that DyMoE reduces Time-to-First-Token (TTFT) by 3.44x-22.7x and up to a 14.58x speedup in Time-Per-Output-Token (TPOT) compared to state-of-the-art offloading baselines, enabling real-time, accuracy-preserving MoE inference on resource-constrained edge devices.
中文标题/摘要
标题:DyMoE:动态专家编排结合混合精度量化以在边缘设备上高效执行MoE推理
尽管MoE模型具有计算效率,但多专家架构固有的过大致内存占用和I/O开销给资源受限的边缘平台上的实时推理带来了严峻挑战。尽管现有静态方法难以平衡延迟和准确性的权衡,但我们观察到专家的重要性分布非常偏斜且与深度相关。受此启发,我们提出了DyMoE,一种为高性能边缘推理设计的动态混合精度量化框架。DyMoE利用专家重要性分布偏斜和深度相关敏感性的洞察,引入了:(1) 重要性感知优先级,以在运行时动态量化专家;(2) 深度自适应调度,以在关键层保持语义完整性;(3) 预测预取,以重叠I/O阻塞。在商用边缘硬件上的实验结果显示,与最先进的卸载基线相比,DyMoE将首次令牌时间(TTFT)减少了3.44倍至22.7倍,并在每个输出令牌的时间(TPOT)上最多提高了14.58倍的加速,从而在资源受限的边缘设备上实现实时、准确性的MoE推理。
Summary / 总结
DyMoE is a dynamic mixed-precision quantization framework designed to enhance the efficiency of MoE inference on edge devices. It dynamically quantizes experts based on their importance and depth-adaptively schedules layers to maintain semantic integrity, achieving significant reductions in Time-to-First-Token (3.44x-22.7x) and Time-Per-Output-Token (up to 14.58x speedup) compared to existing methods.
DyMoE 是一种动态混合精度量化框架,旨在提高边缘设备上 MoE 推理的效率。它根据专家的重要性动态量化专家,并适配地调度层以保持语义完整性,相比现有方法,最多可将 Time-to-First-Token 减少 22.7 倍,Time-Per-Output-Token 加速 14.58 倍。
ARIADNE: A Perception-Reasoning Synergy Framework for Trustworthy Coronary Angiography Analysis
Authors: Zhan Jin, Yu Luo, Yizhou Zhang, Ziyang Cui, Yuqing Wei, Xianchao Liu, Xueying Zeng, Qing Zhang
First: 2026-03-19T17:25:19+00:00 · Latest: 2026-03-19T17:25:19+00:00
Comments: 28 pages, 5 figures . arXiv:submit/7385738 [cs.AI]
Abstract
Conventional pixel-wise loss functions fail to enforce topological constraints in coronary vessel segmentation, producing fragmented vascular trees despite high pixel-level accuracy. We present ARIADNE, a two-stage framework coupling preference-aligned perception with RL-based diagnostic reasoning for topologically coherent stenosis detection. The perception module employs DPO to fine-tune the Sa2VA vision-language foundation model using Betti number constraints as preference signals, aligning the policy toward geometrically complete vessel structures rather than pixel-wise overlap metrics. The reasoning module formulates stenosis localization as a Markov Decision Process with an explicit rejection mechanism that autonomously defers ambiguous anatomical candidates such as bifurcations and vessel crossings, shifting from coverage maximization to reliability optimization. On 1,400 clinical angiograms, ARIADNE achieves state-of-the-art centerline Dice of 0.838, reduces false positives by 41% compared to geometric baselines. External validation on multi-center benchmarks ARCADE and XCAD confirms generalization across acquisition protocols. This represents the first application of DPO for topological alignment in medical imaging, demonstrating that preference-based learning over structural constraints mitigates topological violations while maintaining diagnostic sensitivity in interventional cardiology workflows.
中文标题/摘要
标题:ARIADNE:一种感知-推理协同框架,用于可信冠状动脉造影分析
传统的像素级损失函数无法在冠状血管分割中施加拓扑约束,尽管在像素级准确性方面表现良好,但会产生碎片化的血管树。我们提出ARIADNE,这是一种两阶段框架,结合了偏好对齐的感知与基于RL的诊断推理,以实现拓扑连贯的狭窄检测。感知模块使用DPO对Sa2VA视觉语言基础模型进行微调,使用贝蒂数约束作为偏好信号,使策略倾向于几何完整的血管结构,而非像素级重叠度量。推理模块将狭窄定位形式化为马尔可夫决策过程,其中包含明确的拒绝机制,自主地推迟模糊的解剖候选,如分叉和血管交叉,从覆盖率最大化转向可靠性优化。在1,400例临床造影图像上,ARIADNE实现了最先进的中心线Dice值0.838,与几何基线相比,将假阳性率降低了41%。多中心基准ARCADE和XCAD的外部验证证实了其在成像协议方面的泛化能力。这是首次在医学成像中应用DPO进行拓扑对齐,证明了基于结构约束的偏好学习可以减轻拓扑违规,同时在介入心脏病学工作流程中保持诊断灵敏度。
Summary / 总结
ARIADNE is a two-stage framework that integrates preference-aligned perception with RL-based diagnostic reasoning for coronary angiography analysis. The perception module uses DPO to fine-tune a Sa2VA model with Betti number constraints, aiming for geometrically complete vessel structures. The reasoning module formulates stenosis localization as a Markov Decision Process, deferring ambiguous candidates to optimize reliability. On 1,400 clinical angiograms, ARIADNE achieves a state-of-the-art centerline Dice score of 0.838 and reduces false positives by 41% compared to geometric baselines, showing generalization across multi-center benchmarks.
ARIADNE 是一个两阶段框架,结合了偏好对齐的感知与基于 RL 的诊断推理,用于冠状动脉成像分析。感知模块使用 DPO 对 Sa2VA 模型进行微调,使用 Betti 数字约束以实现几何完整的血管结构。推理模块将狭窄定位建模为马尔可夫决策过程,对模糊的解剖候选进行推迟,以优化可靠性。在 1,400 例临床血管造影中,ARIADNE 达到了 0.838 的中心线 Dice 分数,并将假阳性减少了 41%,展示了在多中心基准上的泛化能力。
Flow Matching Policy with Entropy Regularization
Authors: Ting Gao, Stavros Orfanoudakis, Nan Lin, Elvin Isufi, Winnie Daamen, Serge Hoogendoorn
First: 2026-03-18T13:00:20+00:00 · Latest: 2026-03-19T17:21:12+00:00
Abstract
Diffusion-based policies have gained significant popularity in Reinforcement Learning (RL) due to their ability to represent complex, non-Gaussian distributions. Stochastic Differential Equation (SDE)-based diffusion policies often rely on indirect entropy control due to the intractability of the exact entropy, while also suffering from computationally prohibitive policy gradients through the iterative denoising chain. To overcome these issues, we propose Flow Matching Policy with Entropy Regularization (FMER), an Ordinary Differential Equation (ODE)-based online RL framework. FMER parameterizes the policy via flow matching and samples actions along a straight probability path, motivated by optimal transport. FMER leverages the model's generative nature to construct an advantage-weighted target velocity field from a candidate set, steering policy updates toward high-value regions. By deriving a tractable entropy objective, FMER enables principled maximum-entropy optimization for enhanced exploration. Experiments on sparse multi-goal FrankaKitchen benchmarks demonstrate that FMER outperforms state-of-the-art methods, while remaining competitive on standard MuJoco benchmarks. Moreover, FMER reduces training time by 7x compared to heavy diffusion baselines (QVPO) and 10-15% relative to efficient variants.
中文标题/摘要
标题:流匹配策略与熵正则化
基于扩散的策略在强化学习(RL)中因其能够表示复杂的非高斯分布而获得了显著的流行度。基于随机微分方程(SDE)的扩散策略通常依赖于间接的熵控制,因为精确的熵是难以计算的,同时也会受到通过迭代去噪链计算策略梯度的计算成本高昂的问题。为了解决这些问题,我们提出了流匹配策略与熵正则化(FMER),这是一种基于常微分方程(ODE)的在线RL框架。FMER 通过流匹配参数化策略,并沿直线概率路径采样动作,受到最优传输的启发。FMER 利用模型的生成性质,从候选集中构建加权优势目标速度场,引导策略更新向高价值区域。通过推导出可计算的熵目标,FMER 使最大熵优化变得有原则,从而增强探索。实验表明,FMER 在稀疏多目标 FrankaKitchen 基准测试中优于最先进的方法,同时在标准的 MuJoco 基准测试中保持竞争力。此外,与重扩散基线(QVPO)相比,FMER 将训练时间减少了 7 倍,与高效的变体相比减少了 10-15%。
Summary / 总结
The paper proposes Flow Matching Policy with Entropy Regularization (FMER), an ODE-based RL framework that addresses the limitations of SDE-based diffusion policies by using flow matching and entropy regularization. FMER improves exploration through a tractable entropy objective and constructs an advantage-weighted target velocity field, leading to enhanced performance on sparse multi-goal tasks compared to state-of-the-art methods. Additionally, FMER reduces training time significantly compared to heavy diffusion baselines and efficient variants.
论文提出了基于ODE的RL框架Flow Matching Policy with Entropy Regularization (FMER),该框架解决了扩散策略中熵控制和计算上昂贵的策略梯度的挑战。FMER 使用流匹配来参数化策略,并沿一条直线概率路径采样动作,利用模型的生成性质构建一个加权目标速度场。这种方法使得能够进行有原则的最大熵优化,增强探索。实验结果表明,FMER 在稀疏多目标任务上优于最先进的方法,并在标准基准上具有竞争力,同时相比重扩散基线减少了7倍的训练时间。
Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vision-Language Navigation
Authors: Swagat Padhan, Lakshya Jain, Bhavya Minesh Shah, Omkar Patil, Thao Nguyen, Nakul Gopalan
First: 2026-03-19T17:20:56+00:00 · Latest: 2026-03-19T17:20:56+00:00
Comments: Equal contribution: Swagat Padhan and Lakshya Jain, 9 pages, 6 figures, paper website: https://lakshya-asu.github.io/Meanings-Measurements-Multi-Agent-Probabilistic-Grounding/
Abstract
Robots collaborating with humans must convert natural language goals into actionable, physically grounded decisions. For example, executing a command such as "go two meters to the right of the fridge" requires grounding semantic references, spatial relations, and metric constraints within a 3D scene. While recent vision language models (VLMs) demonstrate strong semantic grounding capabilities, they are not explicitly designed to reason about metric constraints in physically defined spaces. In this work, we empirically demonstrate that state-of-the-art VLM-based grounding approaches struggle with complex metric-semantic language queries. To address this limitation, we propose MAPG (Multi-Agent Probabilistic Grounding), an agentic framework that decomposes language queries into structured subcomponents and queries a VLM to ground each component. MAPG then probabilistically composes these grounded outputs to produce metrically consistent, actionable decisions in 3D space. We evaluate MAPG on the HM-EQA benchmark and show consistent performance improvements over strong baselines. Furthermore, we introduce a new benchmark, MAPG-Bench, specifically designed to evaluate metric-semantic goal grounding, addressing a gap in existing language grounding evaluations. We also present a real-world robot demonstration showing that MAPG transfers beyond simulation when a structured scene representation is available.
中文标题/摘要
标题:意义与测量:多智能体概率接地在视觉语言导航中的应用
机器人与人类协作时,必须将自然语言目标转化为可执行的、物理上可定位的决策。例如,执行“向冰箱右边两米处走”的命令需要在三维场景中对语义参考、空间关系和度量约束进行接地。虽然最近的视觉语言模型(VLMs)展示了强大的语义接地能力,但它们并未明确设计用于在物理定义的空间中推理度量约束。在本研究中,我们实证证明了最先进的基于VLM的接地方法在处理复杂的度量语义语言查询时存在困难。为解决这一局限,我们提出了MAPG(多智能体概率接地)框架,该框架将语言查询分解为结构化的子组件,并查询VLM以对每个组件进行接地。然后,MAPG通过概率组合这些接地输出,生成在三维空间中度量一致的可执行决策。我们使用HM-EQA基准对MAPG进行了评估,并展示了相对于强基线的一致性能改进。此外,我们引入了一个新的基准MAPG-Bench,专门用于评估度量语义目标接地,填补了现有语言接地评估中的空白。我们还展示了在可用结构化场景表示时,MAPG在真实世界机器人演示中的应用。
Summary / 总结
This work addresses the challenge of converting complex metric-semantic language queries into actionable decisions for robots. It introduces MAPG (Multi-Agent Probabilistic Grounding), which decomposes language queries into subcomponents and uses a VLM to ground each part, then probabilistically composes these to produce metrically consistent actions. Experiments on HM-EQA and a new benchmark, MAPG-Bench, show MAPG outperforms strong baselines. Additionally, a real-world robot demonstration confirms MAPG's effectiveness in real scenarios.
该研究旨在将复杂的度量语义语言查询转化为机器人的可执行决策。它提出了MAPG(多智能体概率对齐),该方法将语言查询分解,并使用VLM对每个部分进行语义对齐,然后概率性地组合这些对齐结果以产生度量一致的行动。MAPG在HM-EQA基准测试中表现出对强基线的一致性能改进,并引入了MAPG-Bench来评估度量语义目标对齐。此外,还提供了实际的机器人演示,展示了MAPG在结构化场景表示可用时超越模拟的潜力。
TDAD: Test-Driven Agentic Development - Reducing Code Regressions in AI Coding Agents via Graph-Based Impact Analysis
Authors: Pepe Alonso, Sergio Yovine, Victor A. Braberman
First: 2026-03-18T17:38:22+00:00 · Latest: 2026-03-19T17:12:15+00:00
Comments: Toolpaper, 7 pages, 7 tables, 3 figures, 1 algorithm. Submitted to ACM AIWare 2026 (Data and Benchmark Track)
Abstract
AI coding agents can resolve real-world software issues, yet they frequently introduce regressions -- breaking tests that previously passed. Current benchmarks focus almost exclusively on resolution rate, leaving regression behavior under-studied. This paper presents TDAD (Test-Driven Agentic Development), an open-source tool that performs pre-change impact analysis for AI coding agents. TDAD builds a dependency map between source code and tests so that before committing a patch, the agent knows which tests to verify and can self-correct. The map is delivered as a lightweight agent skill -- a static text file the agent queries at runtime. Evaluated on SWE-bench Verified with two open-weight models running on consumer hardware (Qwen3-Coder 30B, 100 instances; Qwen3.5-35B-A3B, 25 instances), TDAD reduced regressions by 70% (6.08% to 1.82%) compared to a vanilla baseline. In contrast, adding TDD procedural instructions without targeted test context increased regressions to 9.94% -- worse than no intervention at all. When deployed as an agent skill with a different model and framework, TDAD improved issue-resolution rate from 24% to 32%, confirming that surfacing contextual information outperforms prescribing procedural workflows. All code, data, and logs are publicly available at https://github.com/pepealonso95/TDAD.
中文标题/摘要
标题:TDAD:基于测试驱动的代理开发 - 通过基于图的影响分析减少AI编码代理的代码回退
AI编码代理可以解决现实世界中的软件问题,但它们经常引入回退——打破之前通过的测试。当前的基准测试几乎完全关注解决率,而对回退行为的研究则相对不足。本文介绍了TDAD(基于测试驱动的代理开发),这是一种开源工具,用于对AI编码代理进行预变更影响分析。TDAD构建了源代码与测试之间的依赖关系图,以便在提交补丁之前,代理知道需要验证哪些测试并可以自我纠正。该图以轻量级代理技能的形式提供——一个静态文本文件,代理在运行时查询。TDAD在SWE-bench上进行了验证,使用两台消费级硬件上的两个开源权重模型(Qwen3-Coder 30B,30个实例;Qwen3.5-35B-A3B,25个实例),与原始基线相比,TDAD将回退减少了70%(从6.08%降至1.82%)。相比之下,不带目标测试上下文的TDD过程指令反而将回退增加到9.94%,比没有任何干预还要差。当作为代理技能部署到不同的模型和框架中时,TDAD将问题解决率从24%提高到32%,证实了提供上下文信息优于规定程序化工作流。所有代码、数据和日志均可在https://github.com/pepealonso95/TDAD/上公开获取。
ADAPT: Attention Driven Adaptive Prompt Scheduling and InTerpolating Orthogonal Complements for Rare Concepts Generation
Authors: Kwanyoung Lee, Hyunwoo Oh, SeungJu Cha, Sungho Koh, Dong-Jin Kim
Venue: CVPR 2026
First: 2026-03-19T17:11:49+00:00 · Latest: 2026-03-19T17:11:49+00:00
Comments: Accepted in CVPR 2026 (findings). 10 pages, 4 figures; supplementary material included (8 pages, 10 figures)
Abstract
Generating rare compositional concepts in text-to-image synthesis remains a challenge for diffusion models, particularly for attributes that are uncommon in the training data. While recent approaches, such as R2F, address this challenge by utilizing LLM for prompt scheduling, they suffer from inherent variance due to the randomness of language models and suboptimal guidance from iterative text embedding switching. To address these problems, we propose the ADAPT framework, a training-free framework that deterministically plans and semantically aligns prompt schedules, providing consistent guidance to enhance the composition of rare concepts. By leveraging attention scores and orthogonal components, ADAPT significantly enhances compositional generation of rare concepts in the RareBench benchmark without additional training or fine-tuning. Through comprehensive experiments, we demonstrate that ADAPT achieves superior performance in RareBench and accurately reflects the semantic information of rare attributes, providing deterministic and precise control over the generation of rare compositions without compromising visual integrity.
中文标题/摘要
标题:ADAPT:注意力驱动的自适应提示调度和插值正交补对于稀有概念生成
对于文本到图像合成而言,在生成稀有组合概念方面,扩散模型仍然面临挑战,尤其是对于训练数据中不常见的属性。虽然最近的方法,如R2F,通过利用LLM进行提示调度来解决这一挑战,但由于语言模型的随机性和迭代文本嵌入切换的次优指导,它们仍然存在固有的方差问题。为了解决这些问题,我们提出了ADAPT框架,这是一个无需训练的框架,可以确定性地规划和语义对齐提示调度,提供一致的指导以增强稀有概念的组合。通过利用注意力分数和正交组件,ADAPT在无需额外训练或微调的情况下,显著增强了在RareBench基准上稀有概念的组合生成。通过全面的实验,我们证明ADAPT在RareBench上实现了优越的性能,并准确反映了稀有属性的语义信息,提供了对稀有组合生成的确定性和精确控制,而不损害视觉完整性。
Summary / 总结
The research aims to improve the generation of rare compositional concepts in text-to-image synthesis using diffusion models. ADAPT, an attention-driven adaptive prompt scheduling framework, is proposed to address the challenges of variance and suboptimal guidance from previous methods. By leveraging attention scores and orthogonal components, ADAPT provides consistent guidance and enhances the compositional generation of rare concepts in the RareBench benchmark without additional training or fine-tuning. Comprehensive experiments show that ADAPT outperforms existing methods and accurately reflects the semantic information of rare attributes, offering deterministic and precise control over rare composition generation while maintaining visual integrity.
研究旨在利用扩散模型提高文本到图像合成中稀有组合概念的生成。提出了一个基于注意力的自适应提示调度框架ADAPT,以解决变异性和次优指导的问题。通过利用注意力分数和正交组件,ADAPT 提供了一致的指导,并在 RareBench 基准测试中显著增强了稀有概念的组合生成,无需额外训练。全面的实验表明,ADAPT 在 RareBench 上的表现优于现有方法,并准确反映了稀有属性的语义信息,提供了对稀有组合生成的确定性和精确控制,同时保持了视觉完整性。
VEPO: Variable Entropy Policy Optimization for Low-Resource Language Foundation Models
Authors: Chonghan Liu, Yimin Du, Qi An, Xin He, Cunqi Zhai, Fei Tan, Weijia Lin, Xiaochun Gong, Yongchao Deng, Shousheng Jia, Xiangzheng Zhang
First: 2026-03-19T17:10:29+00:00 · Latest: 2026-03-19T17:10:29+00:00
Comments: 23 pages. Includes figures and tables. Conference submission
Abstract
Large language models frequently exhibit suboptimal performance on low resource languages, primarily due to inefficient subword segmentation and systemic training data imbalances. In this paper, we propose Variable Entropy Policy Optimization (VEPO), which leverages Reinforcement Learning with Verifiable Rewards to incorporate deterministic structural constraints into the policy alignment process. This framework ensures prescribed sequence length, robust format consistency, and rigorous linguistic well formedness, all enforced during training. Central to our approach is a variable entropy mechanism that enables the model to dynamically calibrate the equilibrium between literal fidelity and semantic naturalness by modulating the exploration exploitation manifold. By integrating entropy tempered advantage estimation with asymmetric clipping, VEPO sustains robust exploration while mitigating policy collapse. Empirical evaluations across 90 FLORES-200, COMET-22, chrF directions demonstrate that VEPO yields substantial improvements in both tokenization efficiency and translation quality, bridging the performance gap for underrepresented languages.
中文标题/摘要
标题:VEPO:低资源语言基础模型的可变熵策略优化
大型语言模型在低资源语言上的表现经常不尽如人意,主要原因是子词分段效率低下和系统性训练数据不平衡。本文提出了一种可变熵策略优化(VEPO),该方法利用可验证奖励的强化学习来将确定性的结构约束纳入策略对齐过程中。该框架确保了规定的序列长度、稳健的格式一致性以及严格的语言正确性,所有这些都在训练过程中得到保证。我们方法的核心是一种可变熵机制,它使模型能够动态校准字面忠实度和语义自然性之间的平衡,通过调节探索与利用的平衡。通过结合熵校正的优势估计和不对称剪裁,VEPO 保持了稳健的探索,同时减轻了策略崩溃。在90个FLORES-200、COMET-22、chrF方向上的实证评估表明,VEPO 在分词效率和翻译质量上取得了显著改进,缩小了代表性不足语言的性能差距。
Summary / 总结
VEPO is proposed to address the suboptimal performance of large language models on low-resource languages by incorporating deterministic structural constraints through Reinforcement Learning with Verifiable Rewards. The method uses a variable entropy mechanism to balance literal fidelity and semantic naturalness, and integrates entropy tempered advantage estimation with asymmetric clipping to maintain robust exploration. Experiments show that VEPO improves both tokenization efficiency and translation quality, narrowing the performance gap for underrepresented languages across 90 FLORES-200, COMET-22, and chrF directions.
论文针对大型语言模型在低资源语言上的表现不佳问题,归因于不高效的子词分段和训练数据不平衡。提出了一种名为Variable Entropy Policy Optimization (VEPO)的方法,该方法利用可验证奖励的强化学习来将结构约束纳入训练过程。VEPO 包含一个可变熵机制,允许模型在字面准确性和语义自然性之间进行平衡,并结合熵调整的优势估计与不对称剪切来保持稳健的探索。实验表明,VEPO 在提高分词效率和翻译质量方面表现出色,缩小了对代表性不足的语言的性能差距。
Optimal Splitting of Language Models from Mixtures to Specialized Domains
Authors: Skyler Seto, Pierre Ablin, Anastasiia Filippova, Jiayuan Ye, Louis Bethune, Angelos Katharopoulos, David Grangier
First: 2026-03-19T17:07:05+00:00 · Latest: 2026-03-19T17:07:05+00:00
Comments: 26 pages, 11 tables, 17 figures
Abstract
Language models achieve impressive performance on a variety of knowledge, language, and reasoning tasks due to the scale and diversity of pretraining data available. The standard training recipe is a two-stage paradigm: pretraining first on the full corpus of data followed by specialization on a subset of high quality, specialized data from the full corpus. In the multi-domain setting, this involves continued pretraining of multiple models on each specialized domain, referred to as split model training. We propose a method for pretraining multiple models independently over a general pretraining corpus, and determining the optimal compute allocation between pretraining and continued pretraining using scaling laws. Our approach accurately predicts the loss of a model of size N with D pretraining and D' specialization tokens, and extrapolates to larger model sizes and number of tokens. Applied to language model training, our approach improves performance consistently across common sense knowledge and reasoning benchmarks across different model sizes and compute budgets.
中文标题/摘要
标题:语言模型从混合到专业领域的最优划分
语言模型由于预训练数据的规模和多样性,在各种知识、语言和推理任务中取得了令人印象深刻的性能。标准训练范式是两阶段模式:首先在全部数据集上进行预训练,然后在全部数据集的高质量专业化子集上进行专业化训练。在多领域设置中,这涉及在每个专业化领域继续对多个模型进行预训练,称为划分模型训练。我们提出了一种方法,可以在通用预训练数据集上独立预训练多个模型,并使用标度定律确定预训练和继续预训练之间的最优计算分配。我们的方法可以准确预测具有D个预训练令牌和D'个专业化令牌的大小为N的模型的损失,并外推到更大的模型大小和令牌数量。应用于语言模型训练,我们的方法在不同模型大小和计算预算下的一般知识和推理基准测试中始终提高了性能。
Summary / 总结
The research aims to optimize the splitting of language models from a general pretraining corpus to specialized domains. The method involves pretraining multiple models independently on the general corpus and using scaling laws to determine the optimal compute allocation between pretraining and specialization. Key findings show consistent improvements in performance across common sense knowledge and reasoning benchmarks for different model sizes and compute budgets.
研究旨在优化将语言模型从通用预训练语料库拆分到专门领域的方法。方法包括独立地在通用语料库上预训练多个模型,并使用缩放定律来确定预训练和专门化之间的最优计算分配。关键发现表明,这种方法在不同模型大小和计算预算下,能够提高各种基准测试中的性能。
Enhancing Pretrained Model-based Continual Representation Learning via Guided Random Projection
Authors: Ruilin Li, Heming Zou, Xiufeng Yan, Zheming Liang, Jie Yang, Chenliang Li, Xue Yang
First: 2026-03-19T17:01:14+00:00 · Latest: 2026-03-19T17:01:14+00:00
Abstract
Recent paradigms in Random Projection Layer (RPL)-based continual representation learning have demonstrated superior performance when building upon a pre-trained model (PTM). These methods insert a randomly initialized RPL after a PTM to enhance feature representation in the initial stage. Subsequently, a linear classification head is used for analytic updates in the continual learning stage. However, under severe domain gaps between pre-trained representations and target domains, a randomly initialized RPL exhibits limited expressivity under large domain shifts. While largely scaling up the RPL dimension can improve expressivity, it also induces an ill-conditioned feature matrix, thereby destabilizing the recursive analytic updates of the linear head. To this end, we propose the Stochastic Continual Learner with MemoryGuard Supervisory Mechanism (SCL-MGSM). Unlike random initialization, MGSM constructs the projection layer via a principled, data-guided mechanism that progressively selects target-aligned random bases to adapt the PTM representation to downstream tasks. This facilitates the construction of a compact yet expressive RPL while improving the numerical stability of analytic updates. Extensive experiments on multiple exemplar-free Class Incremental Learning (CIL) benchmarks demonstrate that SCL-MGSM achieves superior performance compared to state-of-the-art methods.
中文标题/摘要
标题:基于引导随机投影增强预训练模型导向连续表示学习
基于随机投影层(RPL)的连续表示学习近期范式在构建于预训练模型(PTM)之上时表现出优越性能。这些方法在PTM之后插入一个随机初始化的RPL以增强初始阶段的特征表示。随后,使用线性分类头进行连续学习阶段的分析更新。然而,在预训练表示与目标域之间存在严重领域差距的情况下,随机初始化的RPL在大规模领域转移下表现出有限的表达能力。虽然大幅增加RPL维度可以提高表达能力,但也导致特征矩阵病态,从而不稳定化线性头的递归分析更新。为此,我们提出了一种带有MemoryGuard监督机制的随机连续学习器(SCL-MGSM)。与随机初始化不同,MGSM通过一个原理上数据导向的机制逐步选择目标对齐的随机基,以适应PTM表示以适应下游任务。这促进了紧凑而表达能力强的RPL的构建,并提高了分析更新的数值稳定性。在多个无示例增量学习(CIL)基准上的广泛实验表明,SCL-MGSM在性能上优于现有方法。
Summary / 总结
The research aims to enhance the performance of pre-trained model-based continual representation learning by addressing the limitations of randomly initialized Random Projection Layers (RPLs) under domain shifts. The proposed method, SCL-MGSM, uses a data-guided mechanism to construct the RPL, which selects target-aligned random bases to adapt pre-trained models to downstream tasks. This approach improves both the expressivity and numerical stability of the model, leading to superior performance in Class Incremental Learning benchmarks compared to existing methods.
论文针对预训练模型在大域移位下的随机投影层(RPL)表达能力有限的问题,提出了一种名为SCL-MGSM的方法,通过数据导向机制逐步选择对齐目标的任务随机基,以适应预训练模型到下游任务的表示。这种方法提高了RPL的紧凑性和表达能力,同时确保线性分类头的数值稳定性,从而在多个无示例增量学习基准上取得了优于现有方法的性能。
History
20260320_0355 20260319_0358 20260318_0405 20260317_0401 20260316_0338 20260315_0335 20260314_0340 20260313_0351 20260312_0350 20260311_0346 20260310_0350 20260309_0332 20260308_0331 20260307_0344 20260306_0400 20260305_0347 20260304_0347 20260303_0347 20260302_0330 20260301_0331 20260228_0349 20260227_0352 20260226_0405 20260225_0356 20260224_0404 20260223_0333 20260222_0333 20260221_0344 20260220_0347 20260219_0358 20260218_0358 20260217_0339 20260216_0334 20260215_0332 20260213_0402 20260212_0404 20260211_0409 20260210_0409 20260208_0334 20260207_0349 20260206_0347 20260205_0346 20260204_0352 20260202_0332 20260201_0328 20260131_0341 20260130_0339 20260129_0337 20260128_0335 20260127_0332 20260126_0325 20260125_0325 20260124_0333 20260123_0333 20260122_0339 20260121_0422 20260120_0328 20260119_0325 20260118_0324 20260117_0329 20260116_0332 20260115_0330 20260114_0329 20260113_0330 20260112_0330 20260111_0327 20260110_0328 20260109_0331 20260108_0330 20260107_0325 20260106_0331 20260105_0324 20260104_0324 20260103_0322 20260102_0335 20260101_0325 20251231_0331 20251230_0328 20251229_0326 20251228_0329 20251227_0325 20251226_0326 20251225_0325 20251224_0328 20251223_0327 20251222_0324 20251221_0326 20251220_0327 20251219_0327 20251218_0339 20251217_0331 20251216_0329 20251215_0331 20251214_0324 20251213_0324 20251212_0329 20251211_0326 20251210_0323 20251209_0326 20251208_0324 20251207_0323 20251206_0325 20251205_0326 20251204_0326 20251203_0328 20251202_0331 20251201_0324 20251130_0323 20251129_0323 20251128_0324 20251127_0324 20251126_0325 20251125_0322 20251124_0323 20251123_0323 20251122_0325 20251121_0324 20251120_0326 20251119_0325 20251118_0324 20251117_0322 20251116_0322 20251115_0324 20251114_0325 20251113_0326 20251112_0326 20251111_0318 20251110_0322 20251109_0323 20251108_0321 20251107_0320 20251106_0322 20251105_0321 20251104_0324 20251103_0317 20251102_0321 20251101_0317 20251031_0318 20251030_0328 20251029_0325 20251028_0324 20251027_0320 20251026_0328 20251025_0320 20251024_0328 20251023_1235 20251023_0316 20251022_0319 20251021_1916 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553