Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding
Authors: Xianjin Wu, Dingkang Liang, Tianrui Feng, Kui Xia, Yumeng Zhang, Xiaofan Li, Xiao Tan, Xiang Bai
First: 2026-03-19T17:59:58+00:00 · Latest: 2026-03-19T17:59:58+00:00
Comments: 31 pages, 12 figures
Abstract
While Multimodal Large Language Models demonstrate impressive semantic capabilities, they often suffer from spatial blindness, struggling with fine-grained geometric reasoning and physical dynamics. Existing solutions typically rely on explicit 3D modalities or complex geometric scaffolding, which are limited by data scarcity and generalization challenges. In this work, we propose a paradigm shift by leveraging the implicit spatial prior within large-scale video generation models. We posit that to synthesize temporally coherent videos, these models inherently learn robust 3D structural priors and physical laws. We introduce VEGA-3D (Video Extracted Generative Awareness), a plug-and-play framework that repurposes a pre-trained video diffusion model as a Latent World Simulator. By extracting spatiotemporal features from intermediate noise levels and integrating them with semantic representations via a token-level adaptive gated fusion mechanism, we enrich MLLMs with dense geometric cues without explicit 3D supervision. Extensive experiments across 3D scene understanding, spatial reasoning, and embodied manipulation benchmarks demonstrate that our method outperforms state-of-the-art baselines, validating that generative priors provide a scalable foundation for physical-world understanding. Code is publicly available at https://github.com/H-EmbodVis/VEGA-3D.
中文标题/摘要
标题:生成模型了解空间:释放隐含的三维先验以促进场景理解
虽然多模态大型语言模型展示了令人印象深刻的语义能力,但它们往往在空间感知方面存在缺陷,难以进行精细的几何推理和物理动力学处理。现有解决方案通常依赖于显式的三维模态或复杂的几何结构,这些方法受限于数据稀缺性和泛化挑战。在本工作中,我们提出了一种范式转变,通过利用大规模视频生成模型中的隐含空间先验。我们认为,为了合成时空连贯的视频,这些模型会内在地学习稳健的三维结构先验和物理法则。我们引入了VEGA-3D(视频提取生成意识)框架,该框架将预训练的视频扩散模型重新用作潜在世界模拟器。通过从中间噪声级别提取时空特征,并通过基于令牌的自适应门控融合机制将其与语义表示集成,我们为MLLMs提供了密集的几何线索,而无需显式的三维监督。在三维场景理解、空间推理和具身操作基准测试中的广泛实验表明,我们的方法优于最先进的基线,验证了生成先验为物理世界理解提供了可扩展的基础。代码可在https://github.com/H-EmbodVis/VEGA-3D公开获取。
Summary / 总结
This work addresses the spatial limitations of multimodal large language models by leveraging implicit 3D priors from video generation models. VEGA-3D, a plug-and-play framework, enhances these models with spatiotemporal features and token-level adaptive gated fusion, enriching them with geometric cues. Experiments show that VEGA-3D outperforms state-of-the-art methods in 3D scene understanding, spatial reasoning, and embodied manipulation benchmarks, validating the effectiveness of generative priors for physical-world understanding.
该研究通过利用视频生成模型中的隐式3D先验来解决多模态大语言模型的空间局限性。提出的VEGA-3D框架通过token级自适应门控融合机制为MLLMs增加几何线索,无需显式的3D监督。实验表明,VEGA-3D在3D场景理解、空间推理和实体操作基准测试中优于最先进的方法,验证了生成先验在物理世界理解中的有效性。
Matryoshka Gaussian Splatting
Authors: Zhilin Guo, Boqiao Zhang, Hakan Aktas, Kyle Fogarty, Jeffrey Hu, Nursena Koprucu Aslan, Wenzhao Li, Canberk Baykal, Albert Miao, Josef Bengtson, Chenliang Zhou, Weihao Xia, Cristina Nader Vasconcelos. Cengiz Oztireli
First: 2026-03-19T17:59:56+00:00 · Latest: 2026-03-19T17:59:56+00:00
Comments: project page: https://zhilinguo.github.io/MGS
Abstract
The ability to render scenes at adjustable fidelity from a single model, known as level of detail (LoD), is crucial for practical deployment of 3D Gaussian Splatting (3DGS). Existing discrete LoD methods expose only a limited set of operating points, while concurrent continuous LoD approaches enable smoother scaling but often suffer noticeable quality degradation at full capacity, making LoD a costly design decision. We introduce Matryoshka Gaussian Splatting (MGS), a training framework that enables continuous LoD for standard 3DGS pipelines without sacrificing full-capacity rendering quality. MGS learns a single ordered set of Gaussians such that rendering any prefix, the first k splats, produces a coherent reconstruction whose fidelity improves smoothly with increasing budget. Our key idea is stochastic budget training: each iteration samples a random splat budget and optimises both the corresponding prefix and the full set. This strategy requires only two forward passes and introduces no architectural modifications. Experiments across four benchmarks and six baselines show that MGS matches the full-capacity performance of its backbone while enabling a continuous speed-quality trade-off from a single model. Extensive ablations on ordering strategies, training objectives, and model capacity further validate the designs.
中文标题/摘要
标题:Matryoshka 高斯点绘制
从单一模型以可调节保真度渲染场景的能力,即层次细节(LoD),对于3D高斯点绘制(3DGS)的实际部署至关重要。现有的离散LoD方法仅暴露有限的操作点集,而同时进行的连续LoD方法虽然能够更平滑地扩展,但往往在全容量下会遭受明显的质量下降,使LoD成为一个昂贵的设计选择。我们提出了Matryoshka高斯点绘制(MGS),这是一种训练框架,能够在标准3DGS流水线中实现连续LoD,而不牺牲全容量渲染质量。MGS学习一个有序的高斯集合,使得渲染任何前缀,前k个点,都能产生一个连贯的重建,其保真度随着预算增加而平滑提高。我们的核心思想是随机预算训练:每次迭代都采样一个随机的点绘制预算,并优化相应的前缀和整个集合。该策略只需要两次前向传递,并且不需要架构修改。在四个基准和六个基线上的实验表明,MGS在匹配其主干的全容量性能的同时,能够从单一模型中实现连续的速度-质量权衡。广泛的消融实验进一步验证了排序策略、训练目标和模型容量的设计。
Summary / 总结
The paper introduces Matryoshka Gaussian Splatting (MGS), a training framework that allows for continuous level of detail (LoD) in 3D Gaussian Splatting (3DGS) without compromising full-capacity rendering quality. MGS learns a single ordered set of Gaussians, enabling a smooth speed-quality trade-off. The method uses stochastic budget training, which requires only two forward passes and no architectural changes. Experiments show that MGS matches full-capacity performance while providing a continuous LoD, validated by extensive ablations on ordering strategies, training objectives, and model capacity.
研究旨在改进3D高斯点绘制(3DGS)中的层次细节(LoD),以实现场景保真度的可调节渲染。引入了Matryoshka高斯点绘制(MGS),该方法学习一个单一有序的高斯集合,以实现连续的LoD而不牺牲全容量渲染质量。该方法使用随机预算训练,只需两次前向传递且无需架构修改。实验表明,MGS在保持全容量性能的同时,能够从单一模型中提供连续的速度-质量权衡。
Cubic Discrete Diffusion: Discrete Visual Generation on High-Dimensional Representation Tokens
Authors: Yuqing Wang, Chuofan Ma, Zhijie Lin, Yao Teng, Lijun Yu, Shuai Wang, Jiaming Han, Jiashi Feng, Yi Jiang, Xihui Liu
Venue: CVPR 2026
First: 2026-03-19T17:59:55+00:00 · Latest: 2026-03-19T17:59:55+00:00
Comments: Accepted by CVPR 2026 main track; Code: https://github.com/YuqingWang1029/CubiD
Abstract
Visual generation with discrete tokens has gained significant attention as it enables a unified token prediction paradigm shared with language models, promising seamless multimodal architectures. However, current discrete generation methods remain limited to low-dimensional latent tokens (typically 8-32 dims), sacrificing the semantic richness essential for understanding. While high-dimensional pretrained representations (768-1024 dims) could bridge this gap, their discrete generation poses fundamental challenges. In this paper, we present Cubic Discrete Diffusion (CubiD), the first discrete generation model for high-dimensional representations. CubiD performs fine-grained masking throughout the high-dimensional discrete representation -- any dimension at any position can be masked and predicted from partial observations. This enables the model to learn rich correlations both within and across spatial positions, with the number of generation steps fixed at $T$ regardless of feature dimensionality, where $T \ll hwd$. On ImageNet-256, CubiD achieves state-of-the-art discrete generation with strong scaling behavior from 900M to 3.7B parameters. Crucially, we validate that these discretized tokens preserve original representation capabilities, demonstrating that the same discrete tokens can effectively serve both understanding and generation tasks. We hope this work will inspire future research toward unified multimodal architectures. Code is available at: https://github.com/YuqingWang1029/CubiD.
中文标题/摘要
标题:立方离散扩散:高维表示令牌上的离散视觉生成
使用离散令牌进行视觉生成引起了广泛关注,因为它能够与语言模型共享统一的令牌预测范式,有望实现无缝的多模态架构。然而,当前的离散生成方法仍然局限于低维潜变量(通常为8-32维),牺牲了理解所需的语义丰富性。虽然高维预训练表示(768-1024维)可以弥补这一差距,但它们的离散生成提出了根本性的挑战。在本文中,我们提出了立方离散扩散(CubiD),这是第一个用于高维表示的离散生成模型。CubiD 在高维离散表示中执行精细粒度的掩码——任何维度在任何位置都可以被掩码并从部分观察中预测。这使模型能够学习丰富的内部和跨空间位置的相关性,生成步骤数固定为 $T$,与特征维度无关,其中 $T \ll hwd$。在ImageNet-256上,CubiD 达到了从900M到3.7B参数的最强离散生成性能。关键的是,我们验证了这些离散化令牌保留了原始表示能力,证明了相同的离散令牌可以有效地服务于理解和生成任务。我们希望这项工作能够激发未来研究向统一的多模态架构方向发展。代码可在:https://github.com/YuqingWang1029/CubiD 获取。
Summary / 总结
Cubic Discrete Diffusion (CubiD) is the first discrete generation model for high-dimensional representations, addressing the limitations of low-dimensional tokens in visual generation. CubiD uses fine-grained masking in high-dimensional discrete representations, allowing it to learn rich correlations and scale effectively from 900M to 3.7B parameters. On ImageNet-256, CubiD achieves state-of-the-art discrete generation and preserves the original representation capabilities, enabling both understanding and generation tasks. This work paves the way for unified multimodal architectures.
Cubic Discrete Diffusion (CubiD) 是首个用于高维表示的离散生成模型,解决了低维令牌在视觉生成中的局限性。通过在高维离散表示中进行精细粒度的掩码,CubiD 学习了丰富的相关性,并在 ImageNet-256 上实现了离散生成的最新成果,具有良好的可扩展性。该模型展示了离散化令牌既能用于理解又能用于生成任务,为统一的多模态架构铺平了道路。
MonoArt: Progressive Structural Reasoning for Monocular Articulated 3D Reconstruction
Authors: Haitian Li, Haozhe Xie, Junxiang Xu, Beichen Wen, Fangzhou Hong, Ziwei Liu
First: 2026-03-19T17:59:52+00:00 · Latest: 2026-03-19T17:59:52+00:00
Comments: Project page: https://lihaitian.com/MonoArt
Abstract
Reconstructing articulated 3D objects from a single image requires jointly inferring object geometry, part structure, and motion parameters from limited visual evidence. A key difficulty lies in the entanglement between motion cues and object structure, which makes direct articulation regression unstable. Existing methods address this challenge through multi-view supervision, retrieval-based assembly, or auxiliary video generation, often sacrificing scalability or efficiency. We present MonoArt, a unified framework grounded in progressive structural reasoning. Rather than predicting articulation directly from image features, MonoArt progressively transforms visual observations into canonical geometry, structured part representations, and motion-aware embeddings within a single architecture. This structured reasoning process enables stable and interpretable articulation inference without external motion templates or multi-stage pipelines. Extensive experiments on PartNet-Mobility demonstrate that OM achieves state-of-the-art performance in both reconstruction accuracy and inference speed. The framework further generalizes to robotic manipulation and articulated scene reconstruction.
中文标题/摘要
标题:MonoArt:单目 articulated 3D 重建的渐进结构推理
从单张图像重建 articulated 3D 对象需要联合推断对象几何、部件结构和运动参数,但视觉证据有限。关键难点在于运动线索与对象结构的纠缠,这使得直接articulation回归不稳定。现有方法通过多视角监督、基于检索的组装或辅助视频生成来应对这一挑战,但往往牺牲了可扩展性或效率。我们提出了 MonoArt,这是一种基于渐进结构推理的统一框架。MonoArt 不是从图像特征直接预测articulation,而是逐步将视觉观察转化为标准几何、结构部件表示和运动感知嵌入,全部在单一架构中完成。这种结构推理过程使得在没有外部运动模板或多阶段流水线的情况下,能够实现稳定且可解释的articulation推断。在 PartNet-Mobility 上的广泛实验表明,OM 在重建精度和推理速度方面均达到最先进的性能。该框架进一步推广到机器人操作和articulated 场景重建。
Summary / 总结
MonoArt addresses the challenge of reconstructing articulated 3D objects from a single image by progressively transforming visual observations into canonical geometry and structured part representations within a unified architecture. This method avoids the instability of direct articulation regression and achieves state-of-the-art performance in both reconstruction accuracy and inference speed on PartNet-Mobility. The framework also generalizes to robotic manipulation and articulated scene reconstruction.
MonoArt通过在一个统一框架中逐步将视觉观察转化为标准几何结构和运动感知嵌入来解决从单张图像重建 articulated 3D 对象的挑战。该方法避免了直接articulation回归的不稳定性,并在PartNet-Mobility上实现了在重建精度和推理速度方面的最先进性能。该框架还适用于机器人操作和articulated场景重建。
NavTrust: Benchmarking Trustworthiness for Embodied Navigation
Authors: Huaide Jiang, Yash Chaudhary, Yuping Wang, Zehao Wang, Raghav Sharma, Manan Mehta, Yang Zhou, Lichao Sun, Zhiwen Fan, Zhengzhong Tu, Jiachen Li
First: 2026-03-19T17:59:51+00:00 · Latest: 2026-03-19T17:59:51+00:00
Comments: Project Website: https://navtrust.github.io
Abstract
There are two major categories of embodied navigation: Vision-Language Navigation (VLN), where agents navigate by following natural language instructions; and Object-Goal Navigation (OGN), where agents navigate to a specified target object. However, existing work primarily evaluates model performance under nominal conditions, overlooking the potential corruptions that arise in real-world settings. To address this gap, we present NavTrust, a unified benchmark that systematically corrupts input modalities, including RGB, depth, and instructions, in realistic scenarios and evaluates their impact on navigation performance. To our best knowledge, NavTrust is the first benchmark that exposes embodied navigation agents to diverse RGB-Depth corruptions and instruction variations in a unified framework. Our extensive evaluation of seven state-of-the-art approaches reveals substantial performance degradation under realistic corruptions, which highlights critical robustness gaps and provides a roadmap toward more trustworthy embodied navigation systems. Furthermore, we systematically evaluate four distinct mitigation strategies to enhance robustness against RGB-Depth and instructions corruptions. Our base models include Uni-NaVid and ETPNav. We deployed them on a real mobile robot and observed improved robustness to corruptions. The project website is: https://navtrust.github.io.
中文标题/摘要
标题:NavTrust:评估实体导航可信度基准
实体导航主要分为两类:视觉-语言导航(VLN),其中代理通过遵循自然语言指令进行导航;以及目标-目标导航(OGN),其中代理导航至指定目标物体。然而,现有工作主要在理想条件下评估模型性能,忽视了真实世界环境中可能出现的潜在干扰。为解决这一问题,我们提出了NavTrust,这是一个统一基准,系统地在现实场景中对输入模态(包括RGB、深度和指令)进行干扰,并评估其对导航性能的影响。据我们所知,NavTrust是第一个在统一框架中使实体导航代理暴露于多种RGB-Depth干扰和指令变化的基准。我们对七种最先进的方法进行了广泛评估,发现它们在现实干扰下的性能显著下降,这突显了关键的鲁棒性差距,并为更可信的实体导航系统指明了方向。此外,我们系统地评估了四种不同的缓解策略,以增强对RGB-Depth和指令干扰的鲁棒性。我们的基础模型包括Uni-NaVid和ETPNav。我们将其部署在真实移动机器人上,观察到对干扰的鲁棒性有所提高。项目网站:https://navtrust.github.io
Summary / 总结
NavTrust benchmarks the trustworthiness of embodied navigation systems by introducing realistic corruptions to RGB, depth, and instructions in VLN and OGN tasks. It evaluates seven state-of-the-art approaches and finds significant performance degradation under realistic corruptions, highlighting robustness gaps. NavTrust also assesses four mitigation strategies, showing improved robustness with base models Uni-NaVid and ETPNav.
NavTrust通过在RGB、深度和指令中引入现实世界的干扰来评估体态导航的可信度,涵盖了VLN和OGN任务。它评估了七种最先进的方法,并发现这些方法在现实条件下表现显著下降,表明存在关键的鲁棒性差距。NavTrust还评估了四种不同的缓解策略以提高对RGB-Depth和指令干扰的鲁棒性,基模型Uni-NaVid和ETPNav在真实移动机器人上表现出更好的鲁棒性。
SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing
Authors: Xinyao Zhang, Wenkai Dong, Yuxin Song, Bo Fang, Qi Zhang, Jing Wang, Fan Chen, Hui Zhang, Haocheng Feng, Yu Lu, Hang Zhou, Chun Yuan, Jingdong Wang
First: 2026-03-19T17:59:51+00:00 · Latest: 2026-03-19T17:59:51+00:00
Comments: 24 pages, 12 figures
Abstract
Current instruction-guided video editing models struggle to simultaneously balance precise semantic modifications with faithful motion preservation. While existing approaches rely on injecting explicit external priors (e.g., VLM features or structural conditions) to mitigate these issues, this reliance severely bottlenecks model robustness and generalization. To overcome this limitation, we present SAMA (factorized Semantic Anchoring and Motion Alignment), a framework that factorizes video editing into semantic anchoring and motion modeling. First, we introduce Semantic Anchoring, which establishes a reliable visual anchor by jointly predicting semantic tokens and video latents at sparse anchor frames, enabling purely instruction-aware structural planning. Second, Motion Alignment pre-trains the same backbone on motion-centric video restoration pretext tasks (cube inpainting, speed perturbation, and tube shuffle), enabling the model to internalize temporal dynamics directly from raw videos. SAMA is optimized with a two-stage pipeline: a factorized pre-training stage that learns inherent semantic-motion representations without paired video-instruction editing data, followed by supervised fine-tuning on paired editing data. Remarkably, the factorized pre-training alone already yields strong zero-shot video editing ability, validating the proposed factorization. SAMA achieves state-of-the-art performance among open-source models and is competitive with leading commercial systems (e.g., Kling-Omni). Code, models, and datasets will be released.
中文标题/摘要
标题:SAMA:因子化语义锚定与运动对齐的指令引导视频编辑
当前的指令引导视频编辑模型难以同时平衡精确的语义修改与忠实的运动保留。现有方法依赖于注入显式的外部先验(例如,VLM特征或结构条件)来缓解这些问题,但这种依赖严重限制了模型的鲁棒性和泛化能力。为克服这一限制,我们提出了SAMA(因子化语义锚定与运动对齐),一种将视频编辑分解为语义锚定和运动建模的框架。首先,我们引入了语义锚定,通过联合预测稀疏锚帧的语义标记和视频潜在变量来建立可靠的视觉锚点,从而实现纯粹基于指令的结构规划。其次,运动对齐在以运动为中心的视频恢复预训练任务(立方体填充、速度扰动和管子打乱)上预训练相同的骨干网络,使模型能够直接从原始视频中内化时间动态。SAMA 通过两阶段优化:无配对视频-指令编辑数据的因子化预训练阶段,学习固有的语义-运动表示,随后在配对编辑数据上进行监督微调。值得注意的是,仅因子化预训练本身就已经表现出强大的零样本视频编辑能力,验证了所提出的分解的有效性。SAMA 在开源模型中达到了最先进的性能,并且与领先的商用系统(例如 Kling-Omni)竞争。代码、模型和数据集将被发布。
Summary / 总结
SAMA addresses the challenge of balancing precise semantic modifications with faithful motion preservation in instruction-guided video editing. It factorizes video editing into semantic anchoring and motion modeling. SAMA introduces Semantic Anchoring to establish a reliable visual anchor by jointly predicting semantic tokens and video latents at sparse anchor frames, and Motion Alignment to pre-train the model on motion-centric tasks. SAMA shows strong zero-shot video editing ability and achieves state-of-the-art performance among open-source models, comparable to leading commercial systems.
SAMA 解决了在指令引导的视频编辑中精确语义修改与忠实运动保持之间的平衡问题。它将过程分解为语义锚定和运动建模。语义锚定通过在稀疏帧上联合预测语义标记和视频潜在变量来建立可靠的视觉锚点,实现基于指令的结构规划。运动对齐通过在运动中心任务上进行预训练,使模型能够直接从原始视频中学习时间动态。SAMA 通过两阶段优化:因子化预训练和监督微调。值得注意的是,仅因子化预训练就展示了强大的零样本视频编辑能力,验证了该因子化方法的有效性。SAMA 在开源模型中表现出最佳性能,并且与领先商用系统(如 Kling-Omni)竞争。
Under One Sun: Multi-Object Generative Perception of Materials and Illumination
Authors: Nobuo Yoshii, Xinran Nicole Han, Ryo Kawahara, Todd Zickler, Ko Nishino
First: 2026-03-19T17:59:45+00:00 · Latest: 2026-03-19T17:59:45+00:00
Abstract
We introduce Multi-Object Generative Perception (MultiGP), a generative inverse rendering method for stochastic sampling of all radiometric constituents -- reflectance, texture, and illumination -- underlying object appearance from a single image. Our key idea to solve this inherently ambiguous radiometric disentanglement is to leverage the fact that while their texture and reflectance may differ, objects in the same scene are all lit by the same illumination. MultiGP exploits this consensus to produce samples of reflectance, texture, and illumination from a single image of known shapes based on four key technical contributions: a cascaded end-to-end architecture that combines image-space and angular-space disentanglement; Coordinated Guidance for diffusion convergence to a single consistent illumination estimate; Axial Attention applied to facilitate ``cross-talk'' between objects of different reflectance; and a Texture Extraction ControlNet to preserve high-frequency texture details while ensuring decoupling from estimated lighting. Experimental results demonstrate that MultiGP effectively leverages the complementary spatial and frequency characteristics of multiple object appearances to recover individual texture and reflectance as well as the common illumination.
中文标题/摘要
标题:同一天空之下:多对象生成感知材料与照明
我们介绍了多对象生成感知(MultiGP),这是一种生成逆渲染方法,可以从单张图像中随机采样所有辐射成分——反射率、纹理和照明,以解析对象外观。我们解决这种固有含糊的辐射成分解耦的关键思想是利用这样一个事实:虽然它们的纹理和反射率可能不同,但同一场景中的所有物体都由相同的照明照亮。MultiGP 利用这种共识,基于四个关键技术贡献从已知形状的单张图像中生成反射率、纹理和照明的样本:级联端到端架构结合图像空间和角度空间解耦;协调引导以实现扩散收敛到一致的照明估计;轴向注意力以促进不同反射率对象之间的“交流”;以及纹理提取控制网以保留高频纹理细节并确保与估计照明的解耦。实验结果表明,MultiGP 能够有效利用多个对象外观的空间和频率互补特性,恢复个体纹理和反射率以及共同的照明。
Summary / 总结
The research introduces Multi-Object Generative Perception (MultiGP), a method for disentangling reflectance, texture, and illumination from a single image using a cascaded end-to-end architecture, coordinated guidance, axial attention, and a Texture Extraction ControlNet. The method effectively recovers individual texture and reflectance as well as the common illumination from multiple objects in the same scene, demonstrating its capability to handle the inherent ambiguity in radiometric disentanglement.
研究引入了Multi-Object Generative Perception (MultiGP) 方法,通过级联端到端架构、协调引导、轴向注意力和纹理提取ControlNet,从单张图像中分离出反射、纹理和照明。该方法能够有效恢复多个物体的个体纹理和反射以及共同的照明,展示了其在处理辐射量分离固有模糊性方面的能力。
FinTradeBench: A Financial Reasoning Benchmark for LLMs
Authors: Yogesh Agrawal, Aniruddha Dutta, Md Mahadi Hasan, Santu Karmaker, Aritra Dutta
First: 2026-03-19T17:59:41+00:00 · Latest: 2026-03-19T17:59:41+00:00
Comments: 8 pages main text, 22 pages total (including references and appendix). 5 figures, 14 tables. Preprint under review. Code and data will be made available upon publication
Abstract
Real-world financial decision-making is a challenging problem that requires reasoning over heterogeneous signals, including company fundamentals derived from regulatory filings and trading signals computed from price dynamics. Recently, with the advancement of Large Language Models (LLMs), financial analysts have begun to use them for financial decision-making tasks. However, existing financial question answering benchmarks for testing these models primarily focus on company balance sheet data and rarely evaluate reasoning over how company stocks trade in the market or their interactions with fundamentals. To take advantage of the strengths of both approaches, we introduce FinTradeBench, a benchmark for evaluating financial reasoning that integrates company fundamentals and trading signals. FinTradeBench contains 1,400 questions grounded in NASDAQ-100 companies over a ten-year historical window. The benchmark is organized into three reasoning categories: fundamentals-focused, trading-signal-focused, and hybrid questions requiring cross-signal reasoning. To ensure reliability at scale, we adopt a calibration-then-scaling framework that combines expert seed questions, multi-model response generation, intra-model self-filtering, numerical auditing, and human-LLM judge alignment. We evaluate 14 LLMs under zero-shot prompting and retrieval-augmented settings and witness a clear performance gap. Retrieval substantially improves reasoning over textual fundamentals, but provides limited benefit for trading-signal reasoning. These findings highlight fundamental challenges in the numerical and time-series reasoning for current LLMs and motivate future research in financial intelligence.
中文标题/摘要
标题:FinTradeBench:LLMs的金融推理基准
现实世界的金融决策是一个具有挑战性的问题,需要在公司基本面(来自监管文件的数据)和价格动态计算的交易信号等多种信号之间进行推理。近年来,随着大型语言模型(LLMs)的发展,金融分析师开始使用它们进行金融决策任务。然而,现有的金融问答基准主要侧重于公司资产负债表数据,很少评估公司股票在市场上的交易情况及其与基本面的互动。为了充分利用两种方法的优势,我们引入了FinTradeBench,这是一个结合公司基本面和交易信号的金融推理基准。FinTradeBench包含1400个基于纳斯达克100公司的问题,时间跨度为十年。基准分为三大类推理问题:以基本面为主、以交易信号为主和需要跨信号推理的混合问题。为了确保大规模的可靠性,我们采用了校准-扩展框架,结合了专家种子问题、多模型响应生成、模型内自我筛选、数值审计和人-LLM法官对齐。我们在零样本提示和检索增强设置下评估了14个LLM,并观察到了明显的性能差距。检索显著提高了对文本基本面的推理,但在交易信号推理方面提供的帮助有限。这些发现突显了当前LLM在数值和时间序列推理方面的基本挑战,并激发了未来金融智能研究的动力。
Summary / 总结
FinTradeBench is a benchmark designed to evaluate financial reasoning capabilities of Large Language Models (LLMs) by integrating company fundamentals and trading signals. It includes 1,400 questions on NASDAQ-100 companies over a ten-year period, categorized into three types: fundamentals-focused, trading-signal-focused, and hybrid. The evaluation reveals that retrieval-augmented settings enhance reasoning over textual fundamentals but offer limited benefits for trading-signal reasoning, highlighting current LLMs' challenges in numerical and time-series reasoning and suggesting areas for future research in financial intelligence.
FinTradeBench 是一个基准,旨在通过结合公司基本面和交易信号来评估大型语言模型(LLMs)的金融推理能力。它包含1,400个问题,覆盖纳斯达克100公司长达十年的数据,分为三大类:基本面导向、交易信号导向和混合型。评估结果显示,检索增强设置可以增强对文本基本面的推理,但在交易信号推理方面提供的帮助有限,突显了当前LLMs在数值和时间序列推理方面的挑战,并指出了未来金融智能研究的方向。
EffectErase: Joint Video Object Removal and Insertion for High-Quality Effect Erasing
Authors: Yang Fu, Yike Zheng, Ziyun Dai, Henghui Ding
Venue: CVPR 2026
First: 2026-03-19T17:59:22+00:00 · Latest: 2026-03-19T17:59:22+00:00
Comments: CVPR 2026, Project Page: https://henghuiding.com/EffectErase/
Abstract
Video object removal aims to eliminate dynamic target objects and their visual effects, such as deformation, shadows, and reflections, while restoring seamless backgrounds. Recent diffusion-based video inpainting and object removal methods can remove the objects but often struggle to erase these effects and to synthesize coherent backgrounds. Beyond method limitations, progress is further hampered by the lack of a comprehensive dataset that systematically captures common object effects across varied environments for training and evaluation. To address this, we introduce VOR (Video Object Removal), a large-scale dataset that provides diverse paired videos, each consisting of one video where the target object is present with its effects and a counterpart where the object and effects are absent, with corresponding object masks. VOR contains 60K high-quality video pairs from captured and synthetic sources, covers five effects types, and spans a wide range of object categories as well as complex, dynamic multi-object scenes. Building on VOR, we propose EffectErase, an effect-aware video object removal method that treats video object insertion as the inverse auxiliary task within a reciprocal learning scheme. The model includes task-aware region guidance that focuses learning on affected areas and enables flexible task switching. Then, an insertion-removal consistency objective that encourages complementary behaviors and shared localization of effect regions and structural cues. Trained on VOR, EffectErase achieves superior performance in extensive experiments, delivering high-quality video object effect erasing across diverse scenarios.
中文标题/摘要
标题:EffectErase:联合视频对象去除与插入以实现高质量效果擦除
视频对象去除旨在消除动态目标对象及其视觉效果,如变形、阴影和反射,同时恢复无缝背景。基于扩散的视频修补和对象去除方法可以去除对象,但往往难以消除这些效果并合成连贯的背景。除了方法限制外,进展还受到缺乏一个全面的数据集的阻碍,该数据集系统地捕捉了不同环境中的常见对象效果,用于训练和评估。为了解决这个问题,我们引入了VOR(视频对象去除)数据集,该数据集提供了多样化的配对视频,每个配对视频包括一个目标对象及其效果存在的视频和一个目标对象及其效果不存在的视频,以及相应的对象掩码。VOR包含来自捕获和合成源的60,000个高质量视频配对,涵盖了五种效果类型,以及广泛的对象类别和复杂的动态多对象场景。基于VOR,我们提出了一种EffectErase,这是一种具有效果意识的视频对象去除方法,将视频对象插入视为在互逆学习方案中的逆辅助任务。该模型包括任务感知区域指导,专注于受影响区域的学习,并允许灵活的任务切换。然后,一种插入去除一致性目标,鼓励互补行为并共享效果区域和结构线索的定位。在VOR上训练后,EffectErase在广泛的实验中表现出色,实现了在各种场景中高质量的视频对象效果擦除。
Summary / 总结
The research aims to improve video object removal by addressing the limitations of existing methods in erasing visual effects like deformation and shadows. It introduces VOR, a large dataset that captures diverse object effects across various environments. Based on VOR, EffectErase is proposed, an effect-aware method that uses a reciprocal learning scheme and an insertion-removal consistency objective to focus on affected areas and enhance background synthesis. Experiments show that EffectErase outperforms existing methods in erasing video object effects and restoring seamless backgrounds.
研究旨在通过解决现有方法在消除变形、阴影等视觉效果方面的局限性,提高视频对象去除的效果。研究引入了VOR数据集,用于训练和评估对象去除方法,并提出了EffectErase方法,该方法采用互逆学习方案和插入-去除一致性目标来增强效果去除。EffectErase在各种场景中表现出色,提供了高质量的视频对象效果去除。
F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World
Authors: Ziyin Zhang, Zihan Liao, Hang Yu, Peng Di, Rui Wang
First: 2026-03-19T17:59:21+00:00 · Latest: 2026-03-19T17:59:21+00:00
Abstract
We present F2LLM-v2, a new family of general-purpose, multilingual embedding models in 8 distinct sizes ranging from 80M to 14B. Trained on a newly curated composite of 60 million publicly available high-quality data samples, F2LLM-v2 supports more than 200 languages, with a particular emphasis on previously underserved mid- and low-resource languages. By integrating a two-stage LLM-based embedding training pipeline with matryoshka learning, model pruning, and knowledge distillation techniques, we present models that are far more efficient than previous LLM-based embedding models while retaining competitive performances. Extensive evaluations confirm that F2LLM-v2-14B ranks first on 11 MTEB benchmarks, while the smaller models in the family also set a new state of the art for resource-constrained applications. To facilitate open-source embedding model research, we release all models, data, code, and intermediate checkpoints.
中文标题/摘要
标题:F2LLM-v2:包容、高性能且高效的多语言嵌入模型
我们提出了F2LLM-v2,这是一种新的通用多语言嵌入模型系列,包含8种不同规模,从80M到14B。该模型基于6000万条高质量公开数据样本的新综合训练而成,支持超过200种语言,特别强调了之前未充分服务的中低资源语言。通过结合两阶段基于LLM的嵌入训练流水线、matryoshka学习、模型剪枝和知识蒸馏技术,我们展示了比之前基于LLM的嵌入模型更高效的模型,同时保持了竞争力。广泛的评估证实,F2LLM-v2-14B在11个MTEB基准测试中排名第一,而该系列中的较小模型也设定了资源受限应用的新标准。为了促进开源嵌入模型研究,我们发布了所有模型、数据、代码和中间检查点。
Spectrally-Guided Diffusion Noise Schedules
Authors: Carlos Esteves, Ameesh Makadia
First: 2026-03-19T17:59:12+00:00 · Latest: 2026-03-19T17:59:12+00:00
Abstract
Denoising diffusion models are widely used for high-quality image and video generation. Their performance depends on noise schedules, which define the distribution of noise levels applied during training and the sequence of noise levels traversed during sampling. Noise schedules are typically handcrafted and require manual tuning across different resolutions. In this work, we propose a principled way to design per-instance noise schedules for pixel diffusion, based on the image's spectral properties. By deriving theoretical bounds on the efficacy of minimum and maximum noise levels, we design ``tight'' noise schedules that eliminate redundant steps. During inference, we propose to conditionally sample such noise schedules. Experiments show that our noise schedules improve generative quality of single-stage pixel diffusion models, particularly in the low-step regime.
中文标题/摘要
标题:光谱引导的扩散噪声调度
去噪扩散模型广泛用于高质量图像和视频生成。它们的性能取决于噪声调度,这定义了训练过程中应用的噪声水平的分布以及采样过程中遍历的噪声水平序列。噪声调度通常由手工设计,并且需要在不同分辨率下进行手动调整。在本工作中,我们提出了一种基于图像光谱特性的像素扩散噪声调度的原理化设计方法。通过推导最小和最大噪声水平有效性的理论界,我们设计了“紧凑”的噪声调度,以消除冗余步骤。在推理过程中,我们提出有条件地采样这样的噪声调度。实验表明,我们的噪声调度在单阶段像素扩散模型的生成质量方面有所改进,特别是在低步骤区间。
Summary / 总结
This paper addresses the challenge of designing effective noise schedules for denoising diffusion models, which are crucial for high-quality image and video generation. The authors propose a method based on the spectral properties of images to create per-instance noise schedules, which are theoretically derived to be tight and eliminate redundant steps. The experiments demonstrate that these noise schedules enhance the generative quality of single-stage pixel diffusion models, especially in the low-step regime.
研究旨在通过优化噪声调度来提升图像和视频生成中去噪扩散模型的性能。作者提出了一种方法,利用图像的频谱特性设计实例特定的噪声调度,这些调度是理论推导出来的,能够高效运行并消除冗余步骤。实验结果表明,这些优化的噪声调度能够提高单阶段像素扩散模型的生成质量,特别是在步骤较少的情况下表现尤为明显。
Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation
Authors: Zhuolin Yang, Zihan Liu, Yang Chen, Wenliang Dai, Boxin Wang, Sheng-Chieh Lin, Chankyu Lee, Yangyi Chen, Dongfu Jiang, Jiafan He, Renjie Pi, Grace Lam, Nayeon Lee, Alexander Bukharin, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping
First: 2026-03-19T17:58:52+00:00 · Latest: 2026-03-19T17:58:52+00:00
Comments: We release the model and data at https://huggingface.co/collections/nvidia/nemotron-cascade-2
Abstract
We introduce Nemotron-Cascade 2, an open 30B MoE model with 3B activated parameters that delivers best-in-class reasoning and strong agentic capabilities. Despite its compact size, its mathematical and coding reasoning performance approaches that of frontier open models. It is the second open-weight LLM, after DeepSeekV3.2-Speciale-671B-A37B, to achieve Gold Medal-level performance in the 2025 International Mathematical Olympiad (IMO), the International Olympiad in Informatics (IOI), and the ICPC World Finals, demonstrating remarkably high intelligence density with 20x fewer parameters. In contrast to Nemotron-Cascade 1, the key technical advancements are as follows. After SFT on a meticulously curated dataset, we substantially expand Cascade RL to cover a much broader spectrum of reasoning and agentic domains. Furthermore, we introduce multi-domain on-policy distillation from the strongest intermediate teacher models for each domain throughout the Cascade RL process, allowing us to efficiently recover benchmark regressions and sustain strong performance gains along the way. We release the collection of model checkpoint and training data.
中文标题/摘要
标题:Nemotron-Cascade 2:级联RL和多域在线策略蒸馏后的训练后LLM
我们介绍了Nemotron-Cascade 2,这是一个开放的30B模型,具有3B激活参数,提供最佳推理能力和强大的代理能力。尽管其体积较小,但在数学和编程推理性能上接近前沿开放模型。它是继DeepSeekV3.2-Speciale-671B-A37B之后第二个在2025年国际数学奥林匹克(IMO)、国际信息学奥林匹克(IOI)和ICPC世界总决赛中获得金牌级别的开放权重LLM,显示出极高的智能密度,参数量减少20倍。与Nemotron-Cascade 1相比,关键的技术进步如下。在精心策划的数据集上进行SFT后,我们大幅扩展了级联RL,涵盖了更广泛的推理和代理领域。此外,我们引入了多域在线策略蒸馏,从每个领域的最强中间教师模型进行蒸馏,整个级联RL过程中,使我们能够高效地恢复基准回归并保持强大的性能提升。我们发布了模型检查点和训练数据的集合。
Summary / 总结
Nemotron-Cascade 2 is a 30B model with 3B active parameters that excels in reasoning and agentic capabilities, achieving top performance in major international competitions despite its smaller size. It uses post-training methods with Cascade RL and multi-domain on-policy distillation to maintain strong performance. Key improvements include expanding Cascade RL to cover more reasoning and agentic domains and using distillation from strong intermediate models to recover benchmark levels and sustain performance gains.
Nemotron-Cascade 2 是一个30B参数的模型,其中有3B个活跃参数,其在推理和代理能力方面表现出色,尽管规模较小,但在国际重大比赛中取得了顶级成绩。该模型使用后训练方法结合级联强化学习和多领域在线策略蒸馏来保持强劲表现。关键改进包括将级联强化学习扩展到涵盖更多推理和代理领域,并使用来自强中间教师模型的蒸馏来恢复基准水平并保持性能提升。
DriveTok: 3D Driving Scene Tokenization for Unified Multi-View Reconstruction and Understanding
Authors: Dong Zhuo, Wenzhao Zheng, Sicheng Zuo, Siming Yan, Lu Hou, Jie Zhou, Jiwen Lu
First: 2026-03-19T17:58:22+00:00 · Latest: 2026-03-19T17:58:22+00:00
Comments: Project Page: https://paryi555.github.io/DriveTok/ Code: https://github.com/paryi555/DriveTok
Abstract
With the growing adoption of vision-language-action models and world models in autonomous driving systems, scalable image tokenization becomes crucial as the interface for the visual modality. However, most existing tokenizers are designed for monocular and 2D scenes, leading to inefficiency and inter-view inconsistency when applied to high-resolution multi-view driving scenes. To address this, we propose DriveTok, an efficient 3D driving scene tokenizer for unified multi-view reconstruction and understanding. DriveTok first obtains semantically rich visual features from vision foundation models and then transforms them into the scene tokens with 3D deformable cross-attention. For decoding, we employ a multi-view transformer to reconstruct multi-view features from the scene tokens and use multiple heads to obtain RGB, depth, and semantic reconstructions. We also add a 3D head directly on the scene tokens for 3D semantic occupancy prediction for better spatial awareness. With the multiple training objectives, DriveTok learns unified scene tokens that integrate semantic, geometric, and textural information for efficient multi-view tokenization. Extensive experiments on the widely used nuScenes dataset demonstrate that the scene tokens from DriveTok perform well on image reconstruction, semantic segmentation, depth prediction, and 3D occupancy prediction tasks.
中文标题/摘要
标题:DriveTok:统一多视图重建与理解的3D驾驶场景分词
随着视觉-语言-动作模型和世界模型在自动驾驶系统中的广泛应用,可扩展的图像分词成为视觉模态与系统交互的关键接口。然而,大多数现有的分词器都是为单目和2D场景设计的,这在应用于高分辨率多视图驾驶场景时会导致效率低下和视图间不一致。为了解决这个问题,我们提出了DriveTok,一种用于统一多视图重建与理解的高效3D驾驶场景分词器。DriveTok首先从视觉基础模型中获取丰富的语义视觉特征,然后通过3D可变形交叉注意力将这些特征转换为场景分词。在解码过程中,我们采用多视图变换器从场景分词重建多视图特征,并使用多个头获得RGB、深度和语义重建。我们还在场景分词上直接添加了一个3D头,用于3D语义占用预测,以提高空间意识。通过多种训练目标,DriveTok学习了能够整合语义、几何和纹理信息的统一场景分词,以实现高效的多视图分词。在广泛使用的nuScenes数据集上的大量实验表明,DriveTok生成的场景分词在图像重建、语义分割、深度预测和3D占用预测任务中表现良好。
Summary / 总结
DriveTok is designed to address the inefficiency and inconsistency of existing 2D tokenizers in handling high-resolution multi-view driving scenes. It uses 3D deformable cross-attention to transform semantically rich visual features from vision foundation models into scene tokens, which are then decoded by a multi-view transformer to reconstruct RGB, depth, and semantic features. Additionally, a 3D head is added for 3D semantic occupancy prediction. Experiments on the nuScenes dataset show that DriveTok performs well in image reconstruction, semantic segmentation, depth prediction, and 3D occupancy prediction tasks.
DriveTok 是一个高效的 3D 驾驶场景分词器,用于统一的多视图重建和理解。它使用来自视觉基础模型的语义丰富的视觉特征,并通过 3D 变形交叉注意力将其转换为场景令牌。DriveTok 在 nuScenes 数据集上的广泛实验中,在图像重建、语义分割、深度预测和 3D 占有率预测任务中表现良好。
LVOmniBench: Pioneering Long Audio-Video Understanding Evaluation for Omnimodal LLMs
Authors: Keda Tao, Yuhua Zheng, Jia Xu, Wenjie Du, Kele Shao, Hesong Wang, Xueyi Chen, Xin Jin, Junhan Zhu, Bohan Yu, Weiqiang Wang, Jian Liu, Can Qin, Yulun Zhang, Ming-Hsuan Yang, Huan Wang
First: 2026-03-19T17:58:13+00:00 · Latest: 2026-03-19T17:58:13+00:00
Comments: Project page: https://kd-tao.github.io/LVOmniBench/
Abstract
Recent advancements in omnimodal large language models (OmniLLMs) have significantly improved the comprehension of audio and video inputs. However, current evaluations primarily focus on short audio and video clips ranging from 10 seconds to 5 minutes, failing to reflect the demands of real-world applications, where videos typically run for tens of minutes. To address this critical gap, we introduce LVOmniBench, a new benchmark designed specifically for the cross-modal comprehension of long-form audio and video. This dataset comprises high-quality videos sourced from open platforms that feature rich audio-visual dynamics. Through rigorous manual selection and annotation, LVOmniBench comprises 275 videos, ranging in duration from 10 to 90 minutes, and 1,014 question-answer (QA) pairs. LVOmniBench aims to rigorously evaluate the capabilities of OmniLLMs across domains, including long-term memory, temporal localization, fine-grained understanding, and multimodal perception. Our extensive evaluation reveals that current OmniLLMs encounter significant challenges when processing extended audio-visual inputs. Open-source models generally achieve accuracies below 35%, whereas the Gemini 3 Pro reaches a peak accuracy of approximately 65%. We anticipate that this dataset, along with our empirical findings, will stimulate further research and the development of advanced models capable of resolving complex cross-modal understanding problems within long-form audio-visual contexts.
中文标题/摘要
标题:LVOmniBench:面向多模态LLM的长音频视频理解评估
近年来,多模态大型语言模型(OmniLLMs)在理解音频和视频输入方面取得了显著进步。然而,当前的评估主要集中在10秒到5分钟的短音频和视频片段上,未能反映实际应用中的需求,而实际应用中的视频通常持续数十分钟。为解决这一关键问题,我们引入了LVOmniBench,这是一个专门用于长音频视频跨模态理解的新基准。该数据集包含来自开放平台的高质量视频,这些视频具有丰富的音频-视觉动态。通过严格的手动选择和标注,LVOmniBench 包含275个视频,时长从10分钟到90分钟不等,以及1,014个问答(QA)对。LVOmniBench旨在严格评估OmniLLMs在各个领域的能力,包括长期记忆、时间定位、细粒度理解以及多模态感知。我们的广泛评估表明,当前的OmniLLMs在处理扩展的音频-视觉输入时面临重大挑战。开源模型通常的准确率低于35%,而Gemini 3 Pro达到的峰值准确率约为65%。我们预计,该数据集以及我们的实证研究结果将激发进一步的研究,并促进开发能够解决长音频视频上下文中的复杂跨模态理解问题的高级模型。
Summary / 总结
LVOmniBench is a new benchmark for evaluating the cross-modal comprehension of long-form audio and video, addressing the limitations of existing short-form evaluations. It includes 275 videos ranging from 10 to 90 minutes and 1,014 question-answer pairs. The evaluation shows that current omnimodal large language models struggle with long-form inputs, with accuracies below 35% for open-source models and 65% for Gemini 3 Pro. This dataset aims to drive further research and model development for complex cross-modal understanding in long-form media.
LVOmniBench 是一个新的基准,用于评估长形式音频和视频的跨模态理解能力,解决了当前短形式评估的局限性。它包含275个从10到90分钟的视频和1,014个问答对。评估结果显示,当前的OmniLLMs在处理长形式输入时存在困难,开源模型的准确率低于35%,而Gemini 3 Pro的峰值准确率约为65%。
DreamPartGen: Semantically Grounded Part-Level 3D Generation via Collaborative Latent Denoising
Authors: Tianjiao Yu, Xinzhuo Li, Muntasir Wahed, Jerry Xiong, Yifan Shen, Ying Shen, Ismini Lourentzou
First: 2026-03-19T17:58:11+00:00 · Latest: 2026-03-19T17:58:11+00:00
Abstract
Understanding and generating 3D objects as compositions of meaningful parts is fundamental to human perception and reasoning. However, most text-to-3D methods overlook the semantic and functional structure of parts. While recent part-aware approaches introduce decomposition, they remain largely geometry-focused, lacking semantic grounding and failing to model how parts align with textual descriptions or their inter-part relations. We propose DreamPartGen, a framework for semantically grounded, part-aware text-to-3D generation. DreamPartGen introduces Duplex Part Latents (DPLs) that jointly model each part's geometry and appearance, and Relational Semantic Latents (RSLs) that capture inter-part dependencies derived from language. A synchronized co-denoising process enforces mutual geometric and semantic consistency, enabling coherent, interpretable, and text-aligned 3D synthesis. Across multiple benchmarks, DreamPartGen delivers state-of-the-art performance in geometric fidelity and text-shape alignment.
中文标题/摘要
标题:DreamPartGen: 基于语义的部件级3D生成通过协作去噪
理解和生成由有意义部件组成的3D对象是人类感知和推理的基础。然而,大多数文本到3D的方法忽略了部件的语义和功能结构。虽然最近的部件感知方法引入了分解,但它们仍然主要集中在几何结构上,缺乏语义基础,无法建模部件如何与文本描述或部件间关系对齐。我们提出了DreamPartGen,一种基于语义的部件感知文本到3D生成框架。DreamPartGen引入了双部件潜在变量(DPLs),联合建模每个部件的几何形状和外观,并引入了关系语义潜在变量(RSLs),捕捉从语言中推导出的部件间依赖关系。同步的联合去噪过程确保了几何和语义的一致性,使3D合成具有连贯性、可解释性和文本对齐。在多个基准测试中,DreamPartGen在几何保真度和文本形状对齐方面达到了最先进的性能。
Summary / 总结
DreamPartGen is designed to generate 3D objects by decomposing them into meaningful parts and aligning them with textual descriptions. It uses Duplex Part Latents to model both geometry and appearance of each part, and Relational Semantic Latents to capture inter-part dependencies. The framework ensures mutual geometric and semantic consistency through a synchronized co-denoising process, resulting in high geometric fidelity and better text-shape alignment compared to existing methods.
DreamPartGen旨在生成具有语义接地和部分级意识的3D对象,解决了现有文本到3D方法的局限性。它使用双重部分潜变量来建模每个部分的几何和外观,使用关系语义潜变量来捕捉语言中的部分间依赖关系。同步的联合去噪过程确保了几何和语义的一致性,从而实现了连贯且与文本对齐的3D合成。实验表明,DreamPartGen在几何保真度和文本形状对齐方面超越了现有方法,在多个基准测试中表现出色。
Do VLMs Need Vision Transformers? Evaluating State Space Models as Vision Encoders
Authors: Shang-Jui Ray Kuo, Paola Cascante-Bonilla
First: 2026-03-19T17:56:32+00:00 · Latest: 2026-03-19T17:56:32+00:00
Comments: Project page: https://lab-spell.github.io/vlm-ssm-vision-encoders/ ; Code: https://github.com/raykuo18/vlm-ssm-vision-encoders
Abstract
Large vision--language models (VLMs) often use a frozen vision backbone, whose image features are mapped into a large language model through a lightweight connector. While transformer-based encoders are the standard visual backbone, we ask whether state space model (SSM) vision backbones can be a strong alternative. We systematically evaluate SSM vision backbones for VLMs in a controlled setting. Under matched ImageNet-1K initialization, the SSM backbone achieves the strongest overall performance across both VQA and grounding/localization. We further adapt both SSM and ViT-family backbones with detection or segmentation training and find that dense-task tuning generally improves performance across families; after this adaptation, the SSM backbone remains competitive while operating at a substantially smaller model scale. We further observe that (i) higher ImageNet accuracy or larger backbones do not reliably translate into better VLM performance, and (ii) some visual backbones are unstable in localization. Based on these findings, we propose stabilization strategies that improve robustness for both backbone families and highlight SSM backbones as a strong alternative to transformer-based vision encoders in VLMs.
中文标题/摘要
标题:VLMs是否需要视觉变换器?评估状态空间模型作为视觉编码器
大型视觉-语言模型(VLMs)通常使用冻结的视觉骨干,其图像特征通过轻量级连接器映射到大型语言模型中。虽然基于变换器的编码器是标准的视觉骨干,但我们询问状态空间模型(SSM)视觉骨干是否可以成为强有力的替代品。我们在受控环境中系统地评估了SSM视觉骨干在VLMs中的表现。在匹配的ImageNet-1K初始化下,SSM骨干在VQA和定位/标注方面实现了最强的整体性能。我们进一步适应了SSM和ViT家族的骨干,并进行了检测或分割训练,发现密集任务调整通常在家族中提高了性能;在这一适应后,SSM骨干在较小的模型规模下仍具有竞争力。我们还观察到,(i) 更高的ImageNet准确度或更大的骨干并不一定能可靠地转化为更好的VLM性能,(ii) 一些视觉骨干在定位方面不稳定。基于这些发现,我们提出了稳定策略,以提高两个骨干家族的鲁棒性,并强调SSM骨干作为VLMs中基于变换器视觉编码器的强有力替代品。
Summary / 总结
This study evaluates state space model (SSM) vision backbones in large vision-language models (VLMs), finding that SSMs outperform transformer-based encoders in both VQA and grounding/localization tasks under matched ImageNet-1K initialization. After dense-task adaptation, SSMs remain competitive while being smaller in scale. The research also highlights that higher ImageNet accuracy or larger backbones do not necessarily improve VLM performance, and some visual backbones are unstable in localization tasks. The study proposes stabilization strategies to enhance robustness for both backbone families and suggests SSMs as a strong alternative to transformer-based vision encoders in VLMs.
研究评估了状态空间模型(SSM)在大型视觉语言模型(VLM)中的应用,发现SSM在视觉问答和定位/标注任务中均优于基于变换器的编码器,尤其是在匹配的ImageNet-1K初始化条件下。经过检测或分割训练的适应后,SSM保持了竞争力,同时模型规模更小。研究还指出,更高的ImageNet准确度或更大的模型并不一定意味着更好的VLM性能,某些视觉编码器在定位任务中不稳定,这表明SSM可以作为变换器基视觉编码器的有力替代方案。
Robustness, Cost, and Attack-Surface Concentration in Phishing Detection
Authors: Julian Allagan, Mohamed Elbakary, Zohreh Safari, Weizheng Gao, Gabrielle Morgan, Essence Morgan, Vladimir Deriglazov
First: 2026-03-19T17:53:32+00:00 · Latest: 2026-03-19T17:53:32+00:00
Comments: 14 pages, 4 figures, 9 tables
Abstract
Phishing detectors built on engineered website features attain near-perfect accuracy under i.i.d.\ evaluation, yet deployment security depends on robustness to post-deployment feature manipulation. We study this gap through a cost-aware evasion framework that models discrete, monotone feature edits under explicit attacker budgets. Three diagnostics are introduced: minimal evasion cost (MEC), the evasion survival rate $S(B)$, and the robustness concentration index (RCI).
On the UCI Phishing Websites benchmark (11\,055 instances, 30 ternary features), Logistic Regression, Random Forests, Gradient Boosted Trees, and XGBoost all achieve $\mathrm{AUC}\ge 0.979$ under static evaluation. Under budgeted sanitization-style evasion, robustness converges across architectures: the median MEC equals 2 with full features, and over 80\% of successful minimal-cost evasions concentrate on three low-cost surface features. Feature restriction improves robustness only when it removes all dominant low-cost transitions. Under strict cost schedules, infrastructure-leaning feature sets exhibit 17-19\% infeasible mass for ensemble models, while the median MEC among evadable instances remains unchanged. We formalize this convergence: if a positive fraction of correctly detected phishing instances admit evasion through a single feature transition of minimal cost $c_{\min}$, no classifier can raise the corresponding MEC quantile above $c_{\min}$ without modifying the feature representation or cost model. Adversarial robustness in phishing detection is governed by feature economics rather than model complexity.
Summary / 总结
This study investigates the robustness of phishing detectors against post-deployment feature manipulation. It introduces diagnostics such as minimal evasion cost (MEC), evasion survival rate $S(B)$, and robustness concentration index (RCI). On the UCI Phishing Websites benchmark, various machine learning models achieve high AUC values under static evaluation. However, under budgeted evasion, robustness converges across models, with minimal cost equaling 2 and over 80% of successful minimal-cost evasions concentrating on three low-cost features. Restricting features only improves robustness if it removes all dominant low-cost transitions. Under strict cost schedules, ensemble models face 17-19% infeasible mass, while the median MEC remains unchanged. The study formalizes that adversarial robustness in phishing detection is driven by feature economics rather than model complexity.
该研究探讨了钓鱼检测器在部署后对抗特征操纵的鲁棒性。引入了最小欺骗成本、欺骗生存率和鲁棒性集中指数等诊断指标。在UCI钓鱼网站基准数据集上,各种机器学习模型在静态评估下均能达到较高的AUC值。但在预算内欺骗下,鲁棒性在不同模型间趋于一致,最小成本欺骗主要集中在少数低成本特征上。仅限制特征并不能提高鲁棒性,除非移除所有主导的低成本转换。在严格的成本限制下,基础设施导向的特征集对集成模型的可实施性更高,但可欺骗实例的最小成本中位数保持不变。
Tinted Frames: Question Framing Blinds Vision-Language Models
Authors: Wan-Cyuan Fan, Jiayun Luo, Declan Kutscher, Leonid Sigal, Ritwik Gupta
First: 2026-03-19T17:53:09+00:00 · Latest: 2026-03-19T17:53:09+00:00
Comments: Preprint. Project page: https://davidhalladay.github.io/tinted_frames_demo/
Abstract
Vision-Language Models (VLMs) have been shown to be blind, often underutilizing their visual inputs even on tasks that require visual reasoning. In this work, we demonstrate that VLMs are selectively blind. They modulate the amount of attention applied to visual inputs based on linguistic framing even when alternative framings demand identical visual reasoning. Using visual attention as a probe, we quantify how framing alters both the amount and distribution of attention over the image. Constrained framings, such as multiple choice and yes/no, induce substantially lower attention to image context compared to open-ended, reduce focus on task-relevant regions, and shift attention towards uninformative tokens. We further demonstrate that this attention misallocation is the principal cause of degraded accuracy and cross-framing inconsistency. Building on this mechanistic insight, we introduce a lightweight prompt-tuning method using learnable tokens that encourages the robust, visually grounded attention patterns observed in open-ended settings, improving visual grounding and improving performance across framings.
中文标题/摘要
标题:着色框:问题框架使视觉-语言模型失明
视觉-语言模型(VLMs)已被证明是失明的,即使在需要视觉推理的任务中,它们也经常未能充分利用视觉输入。在本研究中,我们展示了VLMs是选择性失明的。它们根据语言框架调整对视觉输入的注意力程度,即使存在其他框架要求相同的视觉推理。通过使用视觉注意力作为探针,我们量化了框架如何改变对图像的关注量及其分布。受限的框架,如多项选择和是/否,相比开放式框架,显著降低了对图像上下文的关注,减少了对任务相关区域的关注,并将注意力转移到了无信息的标记上。我们进一步证明,这种注意力分配不当是导致准确度下降和跨框架不一致的主要原因。基于这一机制洞察,我们引入了一种轻量级的提示调优方法,使用可学习标记来鼓励在开放式设置中观察到的稳健、视觉接地的注意力模式,从而提高视觉接地并改善不同框架下的性能。
Summary / 总结
This study investigates why Vision-Language Models (VLMs) underutilize visual inputs, especially in tasks requiring visual reasoning. By analyzing visual attention patterns, the researchers found that VLMs adjust their attention based on the linguistic framing of questions, even when different framings demand the same visual reasoning. The study shows that constrained framings like multiple choice and yes/no lead to less attention on image context and more focus on uninformative tokens, causing performance degradation. The researchers propose a prompt-tuning method using learnable tokens to encourage robust, visually grounded attention, improving both visual grounding and performance across different framings.
研究探讨了为什么视觉语言模型(VLMs)在需要视觉推理的任务中未能充分利用视觉输入。通过分析视觉注意力模式,研究发现模型会根据问题的语义框架调整其注意力,即使不同的框架要求相同的视觉推理。研究显示,受限的框架会导致对图像上下文的关注减少,并将注意力转向不相关信息,这会负面影响准确性和不同框架之间的一致性。研究者提出了一种使用可学习标记的提示调优方法,以促进开放框架中观察到的稳健且视觉导向的注意力模式,从而提高不同问题框架下的性能。
FASTER: Rethinking Real-Time Flow VLAs
Authors: Yuxiang Lu, Zhe Liu, Xianzhe Fan, Zhenya Yang, Jinghua Hou, Junyi Li, Kaixin Ding, Hengshuang Zhao
First: 2026-03-19T17:51:37+00:00 · Latest: 2026-03-19T17:51:37+00:00
Comments: Project page: https://innovator-zero.github.io/FASTER
Abstract
Real-time execution is crucial for deploying Vision-Language-Action (VLA) models in the physical world. Existing asynchronous inference methods primarily optimize trajectory smoothness, but neglect the critical latency in reacting to environmental changes. By rethinking the notion of reaction in action chunking policies, this paper presents a systematic analysis of the factors governing reaction time. We show that reaction time follows a uniform distribution determined jointly by the Time to First Action (TTFA) and the execution horizon. Moreover, we reveal that the standard practice of applying a constant schedule in flow-based VLAs can be inefficient and forces the system to complete all sampling steps before any movement can start, forming the bottleneck in reaction latency. To overcome this issue, we propose Fast Action Sampling for ImmediaTE Reaction (FASTER). By introducing a Horizon-Aware Schedule, FASTER adaptively prioritizes near-term actions during flow sampling, compressing the denoising of the immediate reaction by tenfold (e.g., in $π_{0.5}$ and X-VLA) into a single step, while preserving the quality of long-horizon trajectory. Coupled with a streaming client-server pipeline, FASTER substantially reduces the effective reaction latency on real robots, especially when deployed on consumer-grade GPUs. Real-world experiments, including a highly dynamic table tennis task, prove that FASTER unlocks unprecedented real-time responsiveness for generalist policies, enabling rapid generation of accurate and smooth trajectories.
中文标题/摘要
标题:FASTER:重新思考实时流VLA
实时执行对于在物理世界中部署视觉-语言-行动(VLA)模型至关重要。现有的异步推理方法主要优化轨迹平滑性,但忽视了对环境变化快速反应的关键延迟。通过重新思考行动分块策略中的反应概念,本文系统分析了影响反应时间的因素。我们表明,反应时间遵循由首次行动时间(TTFA)和执行窗口共同决定的均匀分布。此外,我们揭示了在基于流的VLA中应用恒定调度的标准做法可能是低效的,迫使系统在任何移动开始之前完成所有采样步骤,从而成为反应延迟的瓶颈。为了解决这一问题,我们提出了快速即时反应的快速行动采样(FASTER)。通过引入窗口感知调度,FASTER在流采样过程中动态优先处理近期行动,将即时反应的去噪过程压缩为一步(例如,在$π_{0.5}$和X-VLA中),同时保持长窗口轨迹的质量。结合流式客户端-服务器管道,FASTER显著减少了实际机器人上的有效反应延迟,特别是在部署在消费级GPU上时。现实世界的实验,包括一个高度动态的乒乓球任务,证明了FASTER为通用策略解锁了前所未有的实时响应能力,使其能够快速生成准确且平滑的轨迹。
Summary / 总结
This paper addresses the critical need for real-time execution in Vision-Language-Action models, focusing on reducing reaction latency. It introduces FASTER, which rethinks action chunking policies to prioritize near-term actions, reducing the denoising process by tenfold. Experiments show that FASTER significantly decreases reaction latency, enabling rapid and accurate trajectory generation, especially on consumer-grade GPUs.
该论文针对Vision-Language-Action模型在现实世界中的实时执行需求,重点关注减少反应延迟。它引入了FASTER,通过使用Horizon-Aware Schedule优先处理近期动作,显著减少了即时反应的去噪时间。实验表明,FASTER相比现有方法可以将反应延迟减少十倍,从而在如乒乓球等任务中实现实时响应。
The Exponentially Weighted Signature
Authors: Alexandre Bloch, Samuel N. Cohen, Terry Lyons, Joël Mouterde, Benjamin Walker
First: 2026-03-19T17:51:20+00:00 · Latest: 2026-03-19T17:51:20+00:00
Comments: 43 pages, 1 figure
Abstract
The signature is a canonical representation of a multidimensional path over an interval. However, it treats all historical information uniformly, offering no intrinsic mechanism for contextualising the relevance of the past. To address this, we introduce the Exponentially Weighted Signature (EWS), generalising the Exponentially Fading Memory (EFM) signature from diagonal to general bounded linear operators. These operators enable cross-channel coupling at the level of temporal weighting together with richer memory dynamics including oscillatory, growth, and regime-dependent behaviour, while preserving the algebraic strengths of the classical signature. We show that the EWS is the unique solution to a linear controlled differential equation on the tensor algebra, and that it generalises both state-space models and the Laplace and Fourier transforms of the path. The group-like structure of the EWS enables efficient computation and makes the framework amenable to gradient-based learning, with the full semigroup action parametrised by and learned through its generator. We use this framework to empirically demonstrate the expressivity gap between the EWS and both the signature and EFM on two SDE-based regression tasks.
中文标题/摘要
标题:加权指数签名
签名是区间内多维路径的典范表示。然而,它对所有历史信息处理一致,没有内在机制来解释过去的相关性。为解决这一问题,我们引入了加权指数签名(EWS),它将对角线上的指数衰减记忆(EFM)签名推广到一般的有界线性算子。这些算子允许在时间加权级别上进行跨通道耦合,同时包含更丰富的记忆动力学,包括振荡、增长和依赖于状态的行为,同时保持经典签名的代数优势。我们证明EWS是张量代数上线性控制微分方程的唯一解,并且它同时推广了状态空间模型和路径的拉普拉斯和傅里叶变换。EWS的群形结构使其计算高效,并使框架适用于基于梯度的学习,其全半群作用由生成器参数化并学习。我们使用此框架在两个基于SDE的回归任务中实证展示了EWS与签名和EFM之间的表达能力差距。
Summary / 总结
The research introduces the Exponentially Weighted Signature (EWS) to address the uniform treatment of historical information in the classical signature. By generalizing the Exponentially Fading Memory (EFM) signature, EWS incorporates cross-channel coupling and richer memory dynamics, while maintaining the algebraic properties of the classical signature. The EWS is shown to be the unique solution to a linear controlled differential equation and generalizes state-space models and path transforms. Empirical results demonstrate the EWS's superior expressivity in two SDE-based regression tasks compared to both the signature and EFM.
研究旨在通过改进历史信息的均匀处理来提升签名表示。引入了Exponentially Weighted Signature (EWS),将其从Exponentially Fading Memory (EFM)签名推广到包含跨通道耦合和更丰富的记忆动态。关键发现表明,EWS在基于SDE的回归任务中优于签名和EFM,突显了其表达能力的优势。
Reconstruction Matters: Learning Geometry-Aligned BEV Representation through 3D Gaussian Splatting
Authors: Yiren Lu, Xin Ye, Burhaneddin Yaman, Jingru Luo, Zhexiao Xiong, Liu Ren, Yu Yin
First: 2026-03-19T17:49:43+00:00 · Latest: 2026-03-19T17:49:43+00:00
Comments: Project page at https://vulab-ai.github.io/Splat2BEV/
Abstract
Bird's-Eye-View (BEV) perception serves as a cornerstone for autonomous driving, offering a unified spatial representation that fuses surrounding-view images to enable reasoning for various downstream tasks, such as semantic segmentation, 3D object detection, and motion prediction. However, most existing BEV perception frameworks adopt an end-to-end training paradigm, where image features are directly transformed into the BEV space and optimized solely through downstream task supervision. This formulation treats the entire perception process as a black box, often lacking explicit 3D geometric understanding and interpretability, leading to suboptimal performance. In this paper, we claim that an explicit 3D representation matters for accurate BEV perception, and we propose Splat2BEV, a Gaussian Splatting-assisted framework for BEV tasks. Splat2BEV aims to learn BEV feature representations that are both semantically rich and geometrically precise. We first pre-train a Gaussian generator that explicitly reconstructs 3D scenes from multi-view inputs, enabling the generation of geometry-aligned feature representations. These representations are then projected into the BEV space to serve as inputs for downstream tasks. Extensive experiments on nuScenes and argoverse dataset demonstrate that Splat2BEV achieves state-of-the-art performance and validate the effectiveness of incorporating explicit 3D reconstruction into BEV perception.
中文标题/摘要
标题:重建问题:通过3D 高斯点绘制学习几何对齐的BEV表示
鸟瞰视图(BEV)感知是自动驾驶的核心基石,提供了一种统一的空间表示,将周围视图图像融合以实现语义分割、3D目标检测和运动预测等多种下游任务的推理。然而,大多数现有的BEV感知框架采用端到端的训练范式,其中图像特征直接转换到BEV空间并通过下游任务监督进行优化。这种形式将整个感知过程视为黑盒,通常缺乏明确的3D几何理解和可解释性,导致性能不佳。在本文中,我们主张明确的3D表示对于准确的BEV感知很重要,并提出了一种基于3D高斯点绘制的Splat2BEV框架,用于BEV任务。Splat2BEV旨在学习既丰富语义又精确几何的BEV特征表示。我们首先预训练一个高斯生成器,显式地从多视图输入重建3D场景,从而生成几何对齐的特征表示。然后将这些表示投影到BEV空间,作为下游任务的输入。在nuScenes和argoverse数据集上的大量实验表明,Splat2BEV达到了最先进的性能,并验证了将明确的3D重建纳入BEV感知的有效性。
Summary / 总结
This paper addresses the limitations of existing BEV perception frameworks by proposing Splat2BEV, which incorporates explicit 3D reconstruction to enhance geometric understanding. The method involves pre-training a Gaussian generator to reconstruct 3D scenes from multi-view inputs, generating geometry-aligned feature representations that are then projected into the BEV space. Experiments on nuScenes and argoverse datasets show that Splat2BEV outperforms existing methods in BEV tasks such as semantic segmentation and 3D object detection.
本文提出了一种名为Splat2BEV的方法,通过显式重建3D场景来生成几何对齐的特征表示,以解决现有BEV感知框架的局限性。该方法首先通过多视图输入预训练一个高斯生成器来重建3D场景,然后将这些表示投影到BEV空间以供下游任务使用。在nuScenes和argoverse数据集上的实验表明,Splat2BEV在性能上超过了现有方法,并验证了在BEV感知中显式3D重建的重要性。
Score Reversal Is Not Free for Quantum Diffusion Models
Authors: Ammar Fayad
First: 2026-03-06T17:16:17+00:00 · Latest: 2026-03-19T17:48:32+00:00
Abstract
Classical reverse diffusion is generated by changing the drift at fixed noise. We show that the quantum version of this principle obeys an exact law with a sharp phase boundary. For Gaussian pure-loss dynamics, the canonical model of continuous-variable decoherence, we prove that the unrestricted instantaneous reverse optimum exhibits a noiseless-to-noisy transition: below a critical squeezing-to-thermal ratio, reversal can be noiseless; above it, complete positivity forces irreducible reverse noise whose minimum cost we determine in closed form. The optimal reverse diffusion is uniquely covariance-aligned and simultaneously minimizes the geometric, metrological, and thermodynamic price of reversal. For multimode trajectories, the exact cost is additive in a canonical set of mode-resolved data, and a globally continuous protocol attains this optimum on every mixed-state interval. If a pure nonclassical endpoint is included, the same pointwise law holds for every $t>0$, but the optimum diverges as $2/t$: exact Gaussian reversal of a pure quantum state is dynamically unattainable. These results establish the exact Gaussian benchmark against which any broader theory of quantum reverse diffusion must be measured.
中文标题/摘要
标题:量子扩散模型中的分数逆转并非免费
经典的逆向扩散通过固定噪声改变漂移生成。我们证明了这一原理的量子版本遵循一个精确的相变定律。对于高斯纯损耗动力学,即连续变量退相干的典范模型,我们证明了无限制的瞬时逆向最优表现出无噪到有噪的转变:在临界压缩比之下,逆转可以无噪;超过它,完全正性迫使不可约的逆向噪声,我们以闭式形式确定了其最小成本。最优逆向扩散唯一地与协方差对齐,并同时最小化逆转的几何、计量和热力学价格。对于多模式轨迹,精确成本在一组模式解析数据中是可加的,且一个全局连续协议在每个混合态区间内达到这一最优。如果包含一个纯非经典的终点,同样的点律在每个$t>0$时都成立,但最优值随着$2/t$发散:精确的高斯逆转一个纯量子态是动态上不可达的。这些结果确立了高斯基准,任何更广泛的量子逆向扩散理论都必须以此为标准进行衡量。
Summary / 总结
The study investigates the reversibility of quantum diffusion models by exploring the quantum version of classical reverse diffusion. It demonstrates that for Gaussian pure-loss dynamics, the reverse process can be noiseless below a critical squeezing-to-thermal ratio but becomes noisy above it due to complete positivity constraints. The research identifies the minimum cost of noise and shows that the optimal reverse diffusion is uniquely covariance-aligned and minimizes various costs. For multimode trajectories, the cost is additive, and a globally continuous protocol achieves the optimum. The study also reveals that exact Gaussian reversal of a pure quantum state is dynamically unattainable, establishing a benchmark for broader theories of quantum reverse diffusion.
研究通过扩展经典反向扩散的概念,探讨了量子扩散模型的可逆性。研究表明,量子反向扩散具有明确的相变边界,并在高斯纯损失动力学中表现出无噪到有噪的转变。在特定的挤压到热比之下,反向扩散可以无噪,但超过这个比值时,完全正性会导致不可约的反向噪声,其最小成本可以用闭式形式确定。最优反向扩散是唯一协方差对齐的,并且同时最小化各种反向成本。对于多模式轨迹,精确成本是可加的,全局连续协议可以在每个混合态区间内达到最优。对于纯非经典状态,最优反向扩散会发散,表明精确的高斯反向扩散是动态不可实现的。
OS-Themis: A Scalable Critic Framework for Generalist GUI Rewards
Authors: Zehao Li, Zhenyu Wu, Yibo Zhao, Bowen Yang, Jingjing Xie, Zhaoyang Liu, Zhoumianze Liu, Kaiming Jin, Jianze Liang, Zonglin Li, Feng Wu, Bowen Zhou, Zun Wang, Zichen Ding
First: 2026-03-19T17:47:47+00:00 · Latest: 2026-03-19T17:47:47+00:00
Abstract
Reinforcement Learning (RL) has the potential to improve the robustness of GUI agents in stochastic environments, yet training is highly sensitive to the quality of the reward function. Existing reward approaches struggle to achieve both scalability and performance. To address this, we propose OS-Themis, a scalable and accurate multi-agent critic framework. Unlike a single judge, OS-Themis decomposes trajectories into verifiable milestones to isolate critical evidence for decision making and employs a review mechanism to strictly audit the evidence chain before making the final verdict. To facilitate evaluation, we further introduce OmniGUIRewardBench (OGRBench), a holistic cross-platform benchmark for GUI outcome rewards, where all evaluated models achieve their best performance under OS-Themis. Extensive experiments on AndroidWorld show that OS-Themis yields a 10.3% improvement when used to support online RL training, and a 6.9% gain when used for trajectory validation and filtering in the self-training loop, highlighting its potential to drive agent evolution.
中文标题/摘要
标题:OS-Themis:通用GUI奖励的可扩展批评框架
强化学习(RL)有潜力提高GUI代理在随机环境中的鲁棒性,但训练对奖励函数的质量极为敏感。现有的奖励方法难以同时实现可扩展性和性能。为了解决这一问题,我们提出了OS-Themis,一种可扩展且准确的多代理批评框架。与单一评判者不同,OS-Themis将轨迹分解为可验证的里程碑,以隔离决策所需的关键证据,并采用审查机制在做出最终裁决前严格审计证据链。为了便于评估,我们进一步引入了OmniGUIRewardBench(OGRBench),这是一个跨平台的GUI结果奖励综合基准,所有评估模型在使用OS-Themis时均达到最佳性能。在AndroidWorld的广泛实验表明,当用于支持在线RL训练时,OS-Themis可提高10.3%;在自我训练循环中用于轨迹验证和过滤时,可提高6.9%,突显了其推动代理进化的能力。
Summary / 总结
OS-Themis is a scalable multi-agent critic framework designed to improve the robustness of GUI agents in stochastic environments by decomposing trajectories into verifiable milestones and employing a review mechanism to audit evidence. Experiments on AndroidWorld show that OS-Themis improves online RL training by 10.3% and trajectory validation by 6.9%. This framework addresses the scalability and performance issues of existing reward approaches and demonstrates its potential to enhance agent evolution.
论文提出了OS-Themis,这是一种可扩展的多代理批评框架,旨在通过提高奖励函数的质量来增强GUI代理在随机环境中的鲁棒性。与传统的单一评判方法不同,OS-Themis将轨迹分解为可验证的里程碑,并采用审查机制确保决策的准确性。在AndroidWorld的实验中,OS-Themis在在线RL训练中提高了10.3%,在轨迹验证和过滤中提高了6.9%,表明其在驱动代理进化方面的有效性。
The Convergence Frontier: Integrating Machine Learning and High Performance Quantum Computing for Next-Generation Drug Discovery
Authors: Narjes Ansari, César Feniou, Nicolaï Gouraud, Daniele Loco, Siwar Badreddine, Baptiste Claudon, Félix Aviat, Marharyta Blazhynska, Kevin Gasperich, Guillaume Michel, Diata Traore, Corentin Villot, Thomas Plé, Olivier Adjoua, Louis Lagardère, Jean-Philip Piquemal
First: 2026-03-18T14:51:15+00:00 · Latest: 2026-03-19T17:44:57+00:00
Abstract
Integrating quantum mechanics into drug discovery marks a decisive shift from empirical trial-and-error toward quantitative precision. However, the prohibitive cost of ab initio molecular dynamics has historically forced a compromise between chemical accuracy and computational scalability. This paper identifies the convergence of High-Performance Computing (HPC), Machine Learning (ML), and Quantum Computing (QC) as the definitive solution to this bottleneck. While ML foundation models, such as FeNNix-Bio1, enable quantum-accurate simulations, they remain tethered to the inherent limits of classical data generation. We detail how High-Performance Quantum Computing (HPQC), utilizing hybrid QPU-GPU architectures, will serve as the ultimate accelerator for quantum chemistry data. By leveraging Hilbert space mapping, these systems can achieve true chemical accuracy while bypassing the heuristics of classical approximations. We show how this tripartite convergence optimizes the drug discovery pipeline, spanning from initial system preparation to ML-driven, high-fidelity simulations. Finally, we position quantum-enhanced sampling as the beyond GPU frontier for modeling reactive cellular systems and pioneering next-generation materials.
中文标题/摘要
标题:量子计算与高性能计算融合前沿:机器学习与高能效量子计算在下一代药物发现中的集成
将量子力学融入药物发现标志着从经验试错向定量精确的转变。然而,从头分子动力学的高昂成本历史上迫使在化学精度和计算可扩展性之间做出妥协。本文指出,高性能计算(HPC)、机器学习(ML)和量子计算(QC)的融合是解决这一瓶颈的最终解决方案。虽然基于ML的预训练模型,如FeNNix-Bio1,能够实现量子准确的模拟,但它们仍然受到经典数据生成固有限制的束缚。我们详细说明了如何利用混合QPU-GPU架构的高能效量子计算(HPQC)作为量子化学数据的终极加速器。通过利用希尔伯特空间映射,这些系统可以实现真正的化学精度,同时绕过经典近似中的启发式方法。我们展示了这种三者融合如何优化药物发现流程,从初始系统准备到ML驱动的高保真模拟。最后,我们定位量子增强采样作为超越GPU前沿,用于建模反应性细胞系统和开创下一代材料。
Summary / 总结
This paper addresses the challenge of achieving both chemical accuracy and computational scalability in drug discovery by integrating High-Performance Computing (HPC), Machine Learning (ML), and Quantum Computing (QC). The authors propose using High-Performance Quantum Computing (HPQC) with hybrid QPU-GPU architectures to overcome the limitations of classical data generation. Key experimental findings show that this approach can achieve true chemical accuracy and optimize the drug discovery pipeline through ML-driven, high-fidelity simulations, positioning quantum-enhanced sampling as a frontier for modeling reactive cellular systems and materials.
本文旨在通过整合高性能计算(HPC)、机器学习(ML)和量子计算(QC),解决药物发现中化学精度和计算可扩展性之间的挑战。作者提出使用具有混合QPU-GPU架构的高性能量子计算(HPQC)来克服经典数据生成的限制。关键实验结果表明,这种方法可以实现真正的化学精度,并通过基于ML的高保真模拟优化药物发现流程,将量子增强采样定位为建模反应性细胞系统和下一代材料的前沿技术。
This looks like what? Challenges and Future Research Directions for Part-Prototype Models
Authors: Khawla Elhadri, Tomasz Michalski, Adam Wróbel, Jörg Schlötterer, Bartosz Zieliński, Christin Seifert
First: 2025-02-13T14:00:55+00:00 · Latest: 2026-03-19T17:41:24+00:00
Comments: Accepted at the 4th World Conference on eXplainable Artificial Intelligence (XAI-2026)
Abstract
The growing interest in eXplainable Artificial Intelligence (XAI) has stimulated research on models with built-in interpretability, among which part-prototype models are particularly prominent. Part-Prototype Models (PPMs) classify inputs by comparing them to learned prototypes and provide human-understandable explanations of the form "this looks like that". Despite this intrinsic interpretability, PPMs have not yet emerged as a competitive alternative to post-hoc explanation methods. This survey reviews work published between 2019 and 2025 and derives a taxonomy of the challenges faced by current PPMs. The analysis reveals a diverse set of open problems. The main issue concerns the quality and number of learned prototypes. Further challenges include limited generalization across tasks and contexts, as well as methodological shortcomings such as non-standardized evaluation. Five broad research directions are identified: improving predictive performance, developing theoretically grounded architectures, establishing frameworks for human-AI collaboration, aligning models with human concepts, and defining robust metrics and benchmarks for evaluation. The survey aims to stimulate further research and promote intrinsically interpretable models for practical applications. A curated list of the surveyed papers is available at https://github.com/aix-group/ppm-survey.
中文标题/摘要
标题:这看起来像什么?部分原型模型的挑战与未来研究方向
随着对可解释人工智能(XAI)的兴趣日益增长,研究具有内置可解释性的模型也逐渐增多,其中部分原型模型尤为突出。部分原型模型(PPMs)通过将输入与学习到的原型进行比较来进行分类,并提供易于理解的解释,形式为“这看起来像那个”。尽管具有这种内在的可解释性,PPMs尚未成为后验解释方法的有力竞争者。本文综述了2019年至2025年间发表的相关工作,并推导出当前PPMs面临的挑战分类。分析揭示了一系列开放性问题。主要问题在于学习到的原型的质量和数量。进一步的挑战包括在任务和上下文之间有限的一般化能力,以及方法论上的不足,如非标准化的评估。确定了五个广泛的研究方向:提高预测性能、发展理论依据的架构、建立人机协作框架、使模型与人类概念相一致以及定义稳健的评估指标和基准。综述旨在激发进一步的研究,促进适用于实际应用的内在可解释模型。所综述论文的精选列表可在https://github.com/aix-group/ppm-survey获取。
Summary / 总结
This paper reviews the challenges and future research directions for Part-Prototype Models (PPMs), which classify inputs by comparing them to learned prototypes and provide human-understandable explanations. The study identifies five key research directions: improving predictive performance, developing theoretically grounded architectures, establishing frameworks for human-AI collaboration, aligning models with human concepts, and defining robust metrics for evaluation. The main issues include the quality and number of learned prototypes, limited generalization, and methodological shortcomings in evaluation. The survey aims to promote further research in intrinsically interpretable models for practical applications.
本文回顾了部分原型模型(PPMs)面临的挑战及其未来研究方向,PPMs通过将输入与学习到的原型进行比较来进行分类,并提供易于理解的解释。研究确定了五个关键的研究方向:提高预测性能、开发理论基础架构、建立人机协作框架、使模型与人类概念相一致以及定义评估的稳健指标。主要问题包括原型的质量和数量、泛化能力有限以及评估中的方法论不足。该综述旨在促进内在可解释模型在实际应用中的进一步研究。
Box Maze: A Process-Control Architecture for Reliable LLM Reasoning
Authors: Zou Qiang
First: 2026-03-19T17:41:18+00:00 · Latest: 2026-03-19T17:41:18+00:00
Comments: 10 pages, 5 tables, 0 figures. Conceptual architecture with preliminary simulation-based validation
Abstract
Large language models (LLMs) demonstrate strong generative capabilities but remain vulnerable to hallucination and unreliable reasoning under adversarial prompting. Existing safety approaches -- such as reinforcement learning from human feedback (RLHF) and output filtering -- primarily operate at the behavioral level and may lack explicit architectural mechanisms for enforcing reasoning process integrity.
This paper proposes the Box Maze framework, a conceptual process-control architecture that decomposes LLM reasoning into three explicit layers: memory grounding, structured inference, and boundary enforcement. We introduce preliminary simulation-based evaluation involving progressive boundary erosion scenarios across multiple heterogeneous LLM systems (DeepSeek-V3, Doubao, Qwen). Results from n=50 adversarial scenarios suggest that explicit cognitive control layers may improve consistency in boundary maintenance, with architectural constraints reducing boundary failure rates from approximately 40% (baseline RLHF) to below 1% under adversarial conditions.
While current validation is simulation-based, these preliminary results indicate that process-level control may offer a promising direction for improving reliability in large language model reasoning.
中文标题/摘要
标题:盒迷宫:一种可靠的LLM推理过程控制架构
大型语言模型(LLMs)展示了强大的生成能力,但在对抗性提示下仍易出现幻觉和不可靠的推理。现有的安全方法——如基于人类反馈的强化学习(RLHF)和输出过滤——主要在行为层面运作,可能缺乏明确的架构机制来确保推理过程的完整性。
本文提出了盒迷宫框架,这是一种概念性的过程控制架构,将LLM推理分解为三个明确的层次:记忆接地、结构化推理和边界约束。我们进行了初步的基于仿真的评估,涉及跨多个异构LLM系统(DeepSeek-V3、Doubao、Qwen)的边界侵蚀场景。来自n=50个对抗性场景的结果表明,明确的认知控制层可能有助于提高边界维护的一致性,架构约束将边界失败率从基线RLHF的约40%降低到对抗条件下低于1%。
当前的验证是基于仿真的,但这些初步结果表明,过程级控制可能为提高大型语言模型推理的可靠性提供一个有希望的方向。
Summary / 总结
The paper addresses the issue of hallucination and unreliable reasoning in large language models (LLMs) under adversarial conditions. It introduces the Box Maze framework, which decomposes LLM reasoning into memory grounding, structured inference, and boundary enforcement layers. Preliminary simulation results show that this process-control architecture reduces boundary failure rates from about 40% (baseline RLHF) to below 1% under adversarial scenarios, indicating potential improvements in reasoning consistency.
论文针对大型语言模型(LLMs)在对抗性提示下容易出现幻觉和不可靠推理的问题,提出了Box Maze框架,将LLM推理分解为记忆接地、结构化推理和边界约束三个层次。初步的模拟结果显示,该过程控制架构在多种LLM系统中将边界失败率从约40%(基线RLHF)降低到低于1%,表明可能在提高推理一致性方面有所改进。
Image2Gcode: Image-to-G-code Generation for Additive Manufacturing Using Diffusion-Transformer Model
Authors: Ziyue Wang, Yayati Jadhav, Peter Pak, Amir Barati Farimani
First: 2025-11-25T18:55:12+00:00 · Latest: 2026-03-19T17:40:45+00:00
Abstract
Mechanical design and manufacturing workflows conventionally begin with conceptual design, followed by the creation of a computer-aided design (CAD) model and fabrication through material-extrusion (MEX) printing. This process requires converting CAD geometry into machine-readable G-code through slicing and path planning. While each step is well established, dependence on CAD modeling remains a major bottleneck: constructing object-specific 3D geometry is slow and poorly suited to rapid prototyping. Even minor design variations typically necessitate manual updates in CAD software, making iteration time-consuming and difficult to scale. To address this limitation, we introduce Image2Gcode, an end-to-end data-driven framework that bypasses the CAD stage and generates printer-ready G-code directly from images and part drawings. Instead of relying on an explicit 3D model, a hand-drawn or captured 2D image serves as the sole input. The framework first extracts slice-wise structural cues from the image and then employs a denoising diffusion probabilistic model (DDPM) over G-code sequences. Through iterative denoising, the model transforms Gaussian noise into executable print-move trajectories with corresponding extrusion parameters, establishing a direct mapping from visual input to native toolpaths. By producing structured G-code directly from 2D imagery, Image2Gcode eliminates the need for CAD or STL intermediates, lowering the entry barrier for additive manufacturing and accelerating the design-to-fabrication cycle. This approach supports on-demand prototyping from simple sketches or visual references and integrates with upstream 2D-to-3D reconstruction modules to enable an automated pipeline from concept to physical artifact. The result is a flexible, computationally efficient framework that advances accessibility in design iteration, repair workflows, and distributed manufacturing.
中文标题/摘要
标题:Image2Gcode:使用扩散变换模型的图像到G代码生成技术
机械设计和制造工作流程通常始于概念设计,随后创建计算机辅助设计(CAD)模型并通过材料挤出(MEX)打印进行制造。这一过程需要将CAD几何图形转换为机器可读的G代码,通过切片和路径规划实现。虽然每一步都已成熟,但依赖CAD建模仍然是一个主要瓶颈:构建特定对象的3D几何图形速度较慢,且不适用于快速原型制作。即使是细微的设计变化通常也需要在CAD软件中手动更新,使得迭代过程耗时且难以扩展。为解决这一限制,我们引入了Image2Gcode,这是一种端到端的数据驱动框架,绕过了CAD阶段,直接从图像和零件图纸生成打印机就绪的G代码。该框架首先从图像中提取切片级的结构线索,然后使用G代码序列上的去噪扩散概率模型(DDPM)。通过迭代去噪,模型将高斯噪声转化为可执行的打印移动轨迹,与相应的挤出参数相对应,从而建立了从视觉输入到原生工具路径的直接映射。通过直接从二维图像生成结构化的G代码,Image2Gcode消除了对CAD或STL中间件的需求,降低了增材制造的入门门槛,并加速了从设计到制造的周期。该方法支持从简单的草图或视觉参考进行按需原型制作,并与上游的二维到三维重建模块集成,以实现从概念到物理制品的自动化流程。结果是一个灵活、计算效率高的框架,促进了设计迭代、修复工作流程和分布式制造的可访问性。
Summary / 总结
The paper introduces Image2Gcode, an end-to-end framework that generates printer-ready G-code directly from images and part drawings, bypassing the need for CAD models. It uses a denoising diffusion probabilistic model (DDPM) to transform Gaussian noise into executable print-move trajectories. Key findings include the ability to produce structured G-code from 2D imagery, eliminating the need for CAD or STL intermediates, and supporting rapid prototyping from simple sketches or visual references.
Image2Gcode 是一个端到端的框架,可以直接从图像和零件图纸生成打印机可读的 G-code,无需使用 CAD 模型。它使用去噪扩散概率模型将高斯噪声转化为可执行的打印轨迹。主要发现包括能够从 2D 图像生成结构化的 G-code,消除 CAD 或 STL 中间件的需求,并支持从简单草图或视觉参考进行快速原型制作。
Steering Awareness: Detecting Activation Steering from Within
Authors: Joshua Fonseca Rivera, David Demitri Africa
First: 2025-11-26T13:49:43+00:00 · Latest: 2026-03-19T17:37:06+00:00
Abstract
Activation steering -- adding a vector to a model's residual stream to modify its behavior -- is widely used in safety evaluations as if the model cannot detect the intervention. We test this assumption, introducing steering awareness: a model's ability to infer, during its own forward pass, that a steering vector was injected and what concept it encodes. After fine-tuning, seven instruction-tuned models develop strong steering awareness on held-out concepts; the best reaches 95.5% detection, 71.2% concept identification, and zero false positives on clean inputs. This generalizes to unseen steering vector construction methods when their directions have high cosine similarity to the training distribution but not otherwise, indicating a geometric detector rather than a generic anomaly detector. Surprisingly, detection does not confer resistance; on both factual and safety benchmarks, detection-trained models are consistently more susceptible to steering than their base counterparts. Mechanistically, steering awareness arises not from a localized circuit, but from a distributed transformation that progressively rotates diverse injected vectors into a shared detection direction. Activation steering should therefore not be considered an invisible intervention in safety evaluations.
中文标题/摘要
标题:操控意识:从内部检测激活操控
激活操控——向模型的残差流中添加一个向量以修改其行为——在安全性评估中被广泛使用,仿佛模型无法检测到这种干预。我们测试了这一假设,引入了操控意识:模型在其自身前向传播过程中推断出是否注入了操控向量及其所编码的概念的能力。经过微调后,七个指令调优模型在未见过的概念上发展出了强大的操控意识;最佳模型在干净输入上的检测率为95.5%,概念识别率为71.2%,且无误报。这在训练分布方向与未见过的操控向量构造方法方向具有高余弦相似度时泛化,表明其为几何检测器而非通用异常检测器。令人惊讶的是,检测能力并未赋予模型抵抗力;在事实性和安全性基准测试中,检测训练模型始终比其基础版本更易受操控影响。从机制上看,操控意识并非源自局部电路,而是源自一种逐步将各种注入向量旋转到共享检测方向的分布式转换。因此,激活操控不应被视为安全性评估中的隐形干预。
Summary / 总结
The research investigates the concept of activation steering, where a vector is added to a model's residual stream to alter its behavior, and introduces steering awareness as the model's ability to detect and identify such interventions during its own processing. Seven instruction-tuned models developed strong steering awareness, achieving 95.5% detection accuracy and 71.2% concept identification on clean inputs. However, detection does not provide resistance to steering, and detection-trained models are more susceptible to steering than their base counterparts on both factual and safety benchmarks. The mechanism behind steering awareness is a distributed transformation that progressively rotates injected vectors into a shared detection direction, indicating a geometric detector rather than a generic anomaly detector.
研究探讨了激活引导的概念,即向模型的残差流中添加向量以改变其行为,并引入了激活引导意识,即模型在自身处理过程中检测和识别此类干预的能力。七个指令调优模型在清洁输入上实现了95.5%的检测准确率和71.2%的概念识别率。然而,检测并不能提供抵抗引导的能力,检测训练后的模型在事实和安全基准测试中比基线模型更容易受到引导的影响。激活引导意识的机制是一种分布式转换,它逐步将注入的向量旋转到共享的检测方向,表明这是一种几何检测器而非通用异常检测器。
Farther the Shift, Sparser the Representation: Analyzing OOD Mechanisms in LLMs
Authors: Mingyu Jin, Yutong Yin, Jingcheng Niu, Qingcheng Zeng, Wujiang Xu, Mengnan Du, Wei Cheng, Zhaoran Wang, Tianlong Chen, Dimitris N. Metaxas
First: 2026-03-03T18:48:15+00:00 · Latest: 2026-03-19T17:36:27+00:00
Abstract
In this work, we investigate how Large Language Models (LLMs) adapt their internal representations when encountering inputs of increasing difficulty, quantified as the degree of out-of-distribution (OOD) shift. We reveal a consistent and quantifiable phenomenon: as task difficulty increases, whether through harder reasoning questions, longer contexts, or adding answer choices, the last hidden states of LLMs become substantially sparser. In short, \textbf{\textit{the farther the shift, the sparser the representations}}. This sparsity--difficulty relation is observable across diverse models and domains, suggesting that language models respond to unfamiliar or complex inputs by concentrating computation into specialized subspaces in the last hidden state. Through a series of controlled analyses with a learning dynamic explanation, we demonstrate that this sparsity is not incidental but an adaptive mechanism for stabilizing reasoning under OOD. Leveraging this insight, we design \textit{Sparsity-Guided Curriculum In-Context Learning (SG-ICL)}, a strategy that explicitly uses representation sparsity to schedule few-shot demonstrations, leading to considerable performance enhancements. Our study provides new mechanistic insights into how LLMs internalize OOD challenges. The source code is available at the URL: https://github.com/MingyuJ666/sparsityLLM.
中文标题/摘要
标题:迁移越大,表示越稀疏:LLM的OOD机制分析
在本研究中,我们探讨了大型语言模型(LLMs)在遇到难度增加的输入时,其内部表示如何适应,这些输入的难度通过出分布(OOD)迁移的程度来量化。我们揭示了一个一致且可量化的现象:随着任务难度的增加,无论是通过更难的推理问题、更长的上下文还是增加答案选项,LLMs的最后隐藏状态变得显著稀疏。简而言之,\textbf{\textit{迁移越大,表示越稀疏}}。这种稀疏性与难度的关系在不同的模型和领域中都是可观察到的,表明语言模型在面对不熟悉或复杂的输入时,会将计算集中在最后隐藏状态的专门子空间中。通过一系列受学习动力学解释控制的分析,我们证明这种稀疏性不是偶然的,而是适应机制,用于在OOD下稳定推理。利用这一见解,我们设计了\textit{稀疏引导的上下文内少样本学习(SG-ICL)}策略,该策略明确利用表示稀疏性来安排少样本示范,从而显著提高性能。我们的研究提供了关于LLMs如何内化OOD挑战的新机制性见解。源代码可在以下网址获取:https://github.com/MingyuJ666/sparsityLLM.
Summary / 总结
This work investigates how Large Language Models (LLMs) adapt their internal representations when encountering increasingly difficult inputs, quantified as out-of-distribution (OOD) shift. It reveals that as task difficulty increases, the last hidden states of LLMs become sparser, indicating that the models concentrate computation into specialized subspaces. This sparsity is not incidental but an adaptive mechanism for stabilizing reasoning under OOD conditions. The study demonstrates that leveraging this insight through Sparsity-Guided Curriculum In-Context Learning (SG-ICL) can enhance performance. The research provides new mechanistic insights into how LLMs handle OOD challenges.
本研究探讨了大型语言模型(LLMs)在遇到越来越难的输入时,如何调整其内部表示,这些输入的难度通过出-of-distribution (OOD) 偏移的程度来量化。研究发现,随着任务难度的增加,LLMs 的最后一层隐藏状态变得更为稀疏,表明LLMs 会将计算集中在对不熟悉或复杂的输入做出反应的专门子空间中。这种稀疏性-难度关系在各种模型和领域中是一致的,表明这是一种稳定推理的适应机制。研究还引入了稀疏性引导的上下文学习(SG-ICL),该方法利用表示稀疏性来安排少量示例演示,从而提高性能。研究提供了关于LLMs 如何处理OOD挑战的新见解。
Few-shot Acoustic Synthesis with Multimodal Flow Matching
Authors: Amandine Brunetto
Venue: CVPR 2026
First: 2026-03-19T17:32:06+00:00 · Latest: 2026-03-19T17:32:06+00:00
Comments: To appear at CVPR 2026. 23 pages, 16 figures. Project Page: https://amandinebtto.github.io/FLAC/
Abstract
Generating audio that is acoustically consistent with a scene is essential for immersive virtual environments. Recent neural acoustic field methods enable spatially continuous sound rendering but remain scene-specific, requiring dense audio measurements and costly training for each environment. Few-shot approaches improve scalability across rooms but still rely on multiple recordings and, being deterministic, fail to capture the inherent uncertainty of scene acoustics under sparse context. We introduce flow-matching acoustic generation (FLAC), a probabilistic method for few-shot acoustic synthesis that models the distribution of plausible room impulse responses (RIRs) given minimal scene context. FLAC leverages a diffusion transformer trained with a flow-matching objective to generate RIRs at arbitrary positions in novel scenes, conditioned on spatial, geometric, and acoustic cues. FLAC outperforms state-of-the-art eight-shot baselines with one-shot on both the AcousticRooms and Hearing Anything Anywhere datasets. To complement standard perceptual metrics, we further introduce AGREE, a joint acoustic-geometry embedding, enabling geometry-consistent evaluation of generated RIRs through retrieval and distributional metrics. This work is the first to apply generative flow matching to explicit RIR synthesis, establishing a new direction for robust and data-efficient acoustic synthesis.
中文标题/摘要
标题:基于多模态流匹配的少量样本声学合成
生成与场景声学一致的音频对于沉浸式虚拟环境至关重要。近期的神经声学场方法能够实现空间连续的声音渲染,但仍然保持场景特定性,需要密集的音频测量和昂贵的训练成本。少量样本的方法提高了跨房间的可扩展性,但仍依赖于多个录音,并且由于是确定性的,无法捕捉稀疏上下文中场景声学的固有不确定性。我们引入了流匹配声学生成(FLAC),这是一种基于少量场景上下文的概率方法,用于建模给定最小场景上下文中可能的房间冲激响应(RIRs)的分布。FLAC 利用一个通过流匹配目标训练的扩散变换器,在新场景中的任意位置生成 RIRs,条件于空间、几何和声学线索。FLAC 在 AcousticRooms 和 Hearing Anything Anywhere 数据集上的一次样本优于最先进的八次样本基线。为了补充标准感知度量,我们进一步引入了 AGREE,一种联合声学-几何嵌入,通过检索和分布度量使生成的 RIRs 的几何一致性评估成为可能。这项工作是首次将生成流匹配应用于显式 RIR 合成,为稳健和数据高效声学合成开辟了一个新方向。
Summary / 总结
The research aims to improve the scalability of acoustic synthesis for virtual environments by developing a few-shot approach. FLAC, a probabilistic method, models the distribution of plausible room impulse responses (RIRs) given minimal scene context using a diffusion transformer with a flow-matching objective. This method outperforms existing eight-shot baselines with one-shot on AcousticRooms and Hearing Anything Anywhere datasets. Additionally, AGREE, a joint acoustic-geometry embedding, is introduced for evaluating the geometry-consistent generation of RIRs. This work marks the first application of generative flow matching to explicit RIR synthesis, advancing robust and data-efficient acoustic synthesis.
研究旨在通过开发少样本方法提高虚拟环境中的声学合成可扩展性。FLAC是一种概率方法,使用具有流匹配目标的扩散变换器,在给定少量场景上下文的情况下,建模可能的房间冲激响应(RIRs)的分布。该方法在AcousticRooms和Hearing Anything Anywhere数据集上的一次样本表现优于现有八次样本基线。此外,还引入了AGREE,一种联合声学-几何嵌入,用于通过检索和分布性度量评估生成的RIR的一致性。这项工作是首次将生成流匹配应用于显式RIR合成,推动了稳健和数据高效声学合成的发展。
SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits
Authors: Edward Lin, Sahil Modi, Siva Kumar Sastry Hari, Qijing Huang, Zhifan Ye, Nestor Qin, Fengzhe Zhou, Yuan Zhang, Jingquan Wang, Sana Damani, Dheeraj Peri, Ouye Xie, Aditya Kane, Moshe Maor, Michael Behar, Triston Cao, Rishabh Mehta, Vartika Singh, Vikram Sharma Mailthody, Terry Chen, Zihao Ye, Hanfeng Chen, Tianqi Chen, Vinod Grover, Wei Chen, Wei Liu, Eric Chung, Luis Ceze, Roger Bringmann, Cyril Zeller, Michael Lightstone, Christos Kozyrakis, Humphrey Shi
First: 2026-03-19T17:30:02+00:00 · Latest: 2026-03-19T17:30:02+00:00
Abstract
As agentic AI systems become increasingly capable of generating and optimizing GPU kernels, progress is constrained by benchmarks that reward speedup over software baselines rather than proximity to hardware-efficient execution. We present SOL-ExecBench, a benchmark of 235 CUDA kernel optimization problems extracted from 124 production and emerging AI models spanning language, diffusion, vision, audio, video, and hybrid architectures, targeting NVIDIA Blackwell GPUs. The benchmark covers forward and backward workloads across BF16, FP8, and NVFP4, including kernels whose best performance is expected to rely on Blackwell-specific capabilities. Unlike prior benchmarks that evaluate kernels primarily relative to software implementations, SOL-ExecBench measures performance against analytically derived Speed-of-Light (SOL) bounds computed by SOLAR, our pipeline for deriving hardware-grounded SOL bounds, yielding a fixed target for hardware-efficient optimization. We report a SOL Score that quantifies how much of the gap between a release-defined scoring baseline and the hardware SOL bound a candidate kernel closes. To support robust evaluation of agentic optimizers, we additionally provide a sandboxed harness with GPU clock locking, L2 cache clearing, isolated subprocess execution, and static analysis based checks against common reward-hacking strategies. SOL-ExecBench reframes GPU kernel benchmarking from beating a mutable software baseline to closing the remaining gap to hardware Speed-of-Light.
中文标题/摘要
标题:SOL-ExecBench:针对硬件限制的实时GPU内核基准测试
随着具有生成和优化GPU内核能力的代理型AI系统的不断发展,进展受到奖励软件基线速度提升而非接近硬件高效执行的基准的限制。我们提出了SOL-ExecBench,这是一个包含235个CUDA内核优化问题的基准测试,这些问题来自涵盖语言、扩散、视觉、音频、视频和混合架构的124个生产及新兴AI模型,针对NVIDIA Blackwell GPU。基准测试涵盖了BF16、FP8和NVFP4的前向和后向工作负载,包括预期最佳性能依赖于Blackwell特定能力的内核。与以往主要基于软件实现评估内核的基准不同,SOL-ExecBench通过我们的SOLAR流水线计算出的基于硬件的SOL上限进行性能测量,从而提供了一个固定的目标,用于硬件高效优化。我们报告了一个SOL评分,量化候选内核在释放定义的评分基线和硬件SOL上限之间的差距。为了支持代理优化器的稳健评估,我们还提供了带有GPU时钟锁定、L2缓存清除、隔离子进程执行和针对常见奖励作弊策略的静态分析检查的沙箱环境。SOL-ExecBench将GPU内核基准测试重新定义为从击败可变软件基线转向关闭与硬件SOL之间的剩余差距。
Summary / 总结
SOL-ExecBench is a benchmark for evaluating the efficiency of GPU kernels by comparing them to hardware-specific Speed-of-Light (SOL) bounds, rather than software baselines. It includes 235 CUDA kernel optimization problems from various AI models, targeting NVIDIA Blackwell GPUs. The benchmark measures performance against analytically derived SOL bounds, providing a fixed target for hardware-efficient optimization. Key findings include a SOL Score that quantifies the gap between a scoring baseline and the hardware SOL bound, and a sandboxed harness to support robust evaluation of agentic optimizers.
SOL-ExecBench 是一个用于通过与硬件特定的光速(SOL)界限进行比较来评估 GPU 内核效率的基准测试,而不是与软件基线进行比较。它包含了来自各种 AI 模型的 235 个 CUDA 内核优化问题,针对的是 NVIDIA Blackwell GPU。基准测试衡量性能与分析得出的 SOL 边界之间的差距,提供了一个硬件高效优化的固定目标。关键发现包括一个 SOL 分数,量化了评分基线与硬件 SOL 边界之间的差距,并提供了一个沙盒环境以支持对智能优化器的稳健评估。
DyMoE: Dynamic Expert Orchestration with Mixed-Precision Quantization for Efficient MoE Inference on Edge
Authors: Yuegui Huang, Zhiyuan Fang, Weiqi Luo, Ruoyu Wu, Wuhui Chen, Zibin Zheng
First: 2026-03-19T17:30:01+00:00 · Latest: 2026-03-19T17:30:01+00:00
Abstract
Despite the computational efficiency of MoE models, the excessive memory footprint and I/O overhead inherent in multi-expert architectures pose formidable challenges for real-time inference on resource-constrained edge platforms. While existing static methods struggle with a rigid latency-accuracy trade-off, we observe that expert importance is highly skewed and depth-dependent. Motivated by these insights, we propose DyMoE, a dynamic mixed-precision quantization framework designed for high-performance edge inference. Leveraging insights into expert importance skewness and depth-dependent sensitivity, DyMoE introduces: (1) importance-aware prioritization to dynamically quantize experts at runtime; (2) depth-adaptive scheduling to preserve semantic integrity in critical layers; and (3) look-ahead prefetching to overlap I/O stalls. Experimental results on commercial edge hardware show that DyMoE reduces Time-to-First-Token (TTFT) by 3.44x-22.7x and up to a 14.58x speedup in Time-Per-Output-Token (TPOT) compared to state-of-the-art offloading baselines, enabling real-time, accuracy-preserving MoE inference on resource-constrained edge devices.
中文标题/摘要
标题:DyMoE:动态专家编排与混合精度量化在边缘设备上高效MoE推理
尽管MoE模型具有计算效率,但多专家架构中固有的过大致内存占用和I/O开销给资源受限的边缘平台上的实时推理带来了巨大挑战。尽管现有静态方法难以平衡延迟与准确性的权衡,我们观察到专家的重要性分布非常偏斜且与深度相关。受此启发,我们提出了DyMoE,一种为高性能边缘推理设计的动态混合精度量化框架。DyMoE利用专家重要性偏斜和深度相关敏感性的洞察,引入了:(1) 重要性感知优先级,以在运行时动态量化专家;(2) 深度自适应调度,以在关键层保持语义完整性;(3) 预测预取,以重叠I/O阻塞。在商用边缘硬件上的实验结果显示,与最先进的卸载基线相比,DyMoE将首次令牌时间(TTFT)减少了3.44倍至22.7倍,并在每个输出令牌的时间(TPOT)上最多提高了14.58倍的加速,从而在资源受限的边缘设备上实现实时、准确性的MoE推理。
Summary / 总结
DyMoE is a dynamic mixed-precision quantization framework designed to improve the efficiency of MoE inference on edge devices. It dynamically quantizes experts based on their importance and depth-adaptively schedules layers to preserve semantic integrity, achieving up to 22.7x reduction in Time-to-First-Token and 14.58x speedup in Time-Per-Output-Token compared to existing methods.
DyMoE 是一种动态混合精度量化框架,旨在提高边缘设备上 MoE 推理的效率。它根据专家的重要性动态量化它们,并适配性地调度它们以保持语义完整性,实现了高达 22.7 倍的 Time-to-First-Token 减少和 14.58 倍的 Time-Per-Output-Token 加速,优于现有方法。
ARIADNE: A Perception-Reasoning Synergy Framework for Trustworthy Coronary Angiography Analysis
Authors: Zhan Jin, Yu Luo, Yizhou Zhang, Ziyang Cui, Yuqing Wei, Xianchao Liu, Xueying Zeng, Qing Zhang
First: 2026-03-19T17:25:19+00:00 · Latest: 2026-03-19T17:25:19+00:00
Comments: 28 pages, 5 figures . arXiv:submit/7385738 [cs.AI]
Abstract
Conventional pixel-wise loss functions fail to enforce topological constraints in coronary vessel segmentation, producing fragmented vascular trees despite high pixel-level accuracy. We present ARIADNE, a two-stage framework coupling preference-aligned perception with RL-based diagnostic reasoning for topologically coherent stenosis detection. The perception module employs DPO to fine-tune the Sa2VA vision-language foundation model using Betti number constraints as preference signals, aligning the policy toward geometrically complete vessel structures rather than pixel-wise overlap metrics. The reasoning module formulates stenosis localization as a Markov Decision Process with an explicit rejection mechanism that autonomously defers ambiguous anatomical candidates such as bifurcations and vessel crossings, shifting from coverage maximization to reliability optimization. On 1,400 clinical angiograms, ARIADNE achieves state-of-the-art centerline Dice of 0.838, reduces false positives by 41% compared to geometric baselines. External validation on multi-center benchmarks ARCADE and XCAD confirms generalization across acquisition protocols. This represents the first application of DPO for topological alignment in medical imaging, demonstrating that preference-based learning over structural constraints mitigates topological violations while maintaining diagnostic sensitivity in interventional cardiology workflows.
中文标题/摘要
标题:ARIADNE:一种感知-推理协同框架,用于可信冠状动脉造影分析
传统的像素级损失函数无法在冠状血管分割中施加拓扑约束,尽管在像素级准确性方面表现良好,但会产生碎片化的血管树。我们提出ARIADNE,这是一种两阶段框架,结合了偏好对齐的感知与基于RL的诊断推理,以实现拓扑连贯的狭窄检测。感知模块使用DPO对Sa2VA视觉-语言基础模型进行微调,使用贝蒂数约束作为偏好信号,使策略更倾向于几何完整的血管结构,而不是像素级重叠度量。推理模块将狭窄定位建模为马尔可夫决策过程,具有明确的拒绝机制,自主地推迟模糊的解剖候选,如分叉和血管交叉,从覆盖率最大化转向可靠性优化。在1,400例临床造影图像上,ARIADNE实现了最先进的中心线Dice值0.838,与几何基线相比,将假阳性率降低了41%。多中心基准ARCADE和XCAD的外部验证证实了其在成像协议方面的泛化能力。这代表了在医学成像中首次将DPO应用于拓扑对齐,证明了基于结构约束的偏好学习可以减轻拓扑违反,同时在介入心脏病学工作流程中保持诊断敏感性。
Summary / 总结
ARIADNE is a two-stage framework that integrates preference-aligned perception with reinforcement learning-based diagnostic reasoning for coronary angiography analysis. The perception module uses Deep Preference Optimization (DPO) to fine-tune a vision-language model with Betti number constraints, enhancing geometrically complete vessel structures. The reasoning module formulates stenosis localization as a Markov Decision Process, autonomously deferring ambiguous anatomical candidates. On 1,400 clinical angiograms, ARIADNE achieves a state-of-the-art centerline Dice score of 0.838 and reduces false positives by 41% compared to geometric baselines. External validation on multi-center benchmarks confirms its generalization across acquisition protocols.
ARIADNE 是一个两阶段框架,结合了偏好对齐的感知与基于 RL 的诊断推理,用于冠状动脉成像分析。感知模块使用 DPO 对 Sa2VA 模型进行微调,使用 Betti 数字约束,提高几何完整性。推理模块将狭窄定位建模为马尔可夫决策过程,并包含一个明确的拒绝机制,与几何基线相比,将假阳性率降低了 41%。在 1,400 例临床血管造影中,ARIADNE 达到了 0.838 的中心线 Dice 分数,并且在不同成像协议下具有良好的泛化能力。
Flow Matching Policy with Entropy Regularization
Authors: Ting Gao, Stavros Orfanoudakis, Nan Lin, Elvin Isufi, Winnie Daamen, Serge Hoogendoorn
First: 2026-03-18T13:00:20+00:00 · Latest: 2026-03-19T17:21:12+00:00
Abstract
Diffusion-based policies have gained significant popularity in Reinforcement Learning (RL) due to their ability to represent complex, non-Gaussian distributions. Stochastic Differential Equation (SDE)-based diffusion policies often rely on indirect entropy control due to the intractability of the exact entropy, while also suffering from computationally prohibitive policy gradients through the iterative denoising chain. To overcome these issues, we propose Flow Matching Policy with Entropy Regularization (FMER), an Ordinary Differential Equation (ODE)-based online RL framework. FMER parameterizes the policy via flow matching and samples actions along a straight probability path, motivated by optimal transport. FMER leverages the model's generative nature to construct an advantage-weighted target velocity field from a candidate set, steering policy updates toward high-value regions. By deriving a tractable entropy objective, FMER enables principled maximum-entropy optimization for enhanced exploration. Experiments on sparse multi-goal FrankaKitchen benchmarks demonstrate that FMER outperforms state-of-the-art methods, while remaining competitive on standard MuJoco benchmarks. Moreover, FMER reduces training time by 7x compared to heavy diffusion baselines (QVPO) and 10-15% relative to efficient variants.
中文标题/摘要
标题:流匹配策略与熵正则化
基于扩散的策略在强化学习(RL)中因其能够表示复杂的非高斯分布而获得了显著的流行度。基于随机微分方程(SDE)的扩散策略通常依赖于间接的熵控制,因为精确的熵是难以计算的,同时也会受到通过迭代去噪链计算策略梯度的计算成本高昂的问题。为了解决这些问题,我们提出了流匹配策略与熵正则化(FMER),这是一种基于常微分方程(ODE)的在线RL框架。FMER 通过流匹配参数化策略,并沿直线概率路径采样动作,受到最优传输的启发。FMER 利用模型的生成性质,从候选集中构建加权优势目标速度场,引导策略更新向高价值区域。通过推导出可计算的熵目标,FMER 使最大熵优化变得有原则,从而增强探索。在稀疏多目标 FrankaKitchen 基准测试中,FMER 的表现优于最先进的方法,同时在标准的 MuJoco 基准测试中保持竞争力。此外,与重扩散基线(QVPO)相比,FMER 将训练时间减少了 7 倍,与高效的变体相比减少了 10-15%。
Summary / 总结
The paper proposes Flow Matching Policy with Entropy Regularization (FMER), an ODE-based RL framework that uses flow matching to parameterize policies and samples actions along a straight probability path. FMER constructs an advantage-weighted target velocity field to steer policy updates towards high-value regions and derives a tractable entropy objective for principled maximum-entropy optimization. Experiments show FMER outperforms state-of-the-art methods on sparse multi-goal tasks and reduces training time significantly compared to heavy diffusion baselines.
论文提出了基于ODE的RL框架Flow Matching Policy with Entropy Regularization (FMER),该框架解决了扩散策略中熵控制和计算上昂贵的策略梯度的挑战。FMER 使用流匹配来参数化策略,并沿一条直线概率路径采样动作,利用模型的生成特性构建一个加权目标速度场。这种方法使得可以进行原则性的最大熵优化并增强探索。实验结果表明,FMER 在稀疏多目标任务上优于最先进的方法,在标准基准上具有竞争力,并且与重扩散基线相比显著减少了训练时间。
Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vision-Language Navigation
Authors: Swagat Padhan, Lakshya Jain, Bhavya Minesh Shah, Omkar Patil, Thao Nguyen, Nakul Gopalan
First: 2026-03-19T17:20:56+00:00 · Latest: 2026-03-19T17:20:56+00:00
Comments: Equal contribution: Swagat Padhan and Lakshya Jain, 9 pages, 6 figures, paper website: https://lakshya-asu.github.io/Meanings-Measurements-Multi-Agent-Probabilistic-Grounding/
Abstract
Robots collaborating with humans must convert natural language goals into actionable, physically grounded decisions. For example, executing a command such as "go two meters to the right of the fridge" requires grounding semantic references, spatial relations, and metric constraints within a 3D scene. While recent vision language models (VLMs) demonstrate strong semantic grounding capabilities, they are not explicitly designed to reason about metric constraints in physically defined spaces. In this work, we empirically demonstrate that state-of-the-art VLM-based grounding approaches struggle with complex metric-semantic language queries. To address this limitation, we propose MAPG (Multi-Agent Probabilistic Grounding), an agentic framework that decomposes language queries into structured subcomponents and queries a VLM to ground each component. MAPG then probabilistically composes these grounded outputs to produce metrically consistent, actionable decisions in 3D space. We evaluate MAPG on the HM-EQA benchmark and show consistent performance improvements over strong baselines. Furthermore, we introduce a new benchmark, MAPG-Bench, specifically designed to evaluate metric-semantic goal grounding, addressing a gap in existing language grounding evaluations. We also present a real-world robot demonstration showing that MAPG transfers beyond simulation when a structured scene representation is available.
中文标题/摘要
标题:意义与测量:多智能体概率对接在视觉语言导航中的应用
机器人与人类协作时,必须将自然语言目标转化为可执行的、物理上可对接的决策。例如,执行“向冰箱右边两米处走”的命令需要在三维场景中对接语义参考、空间关系和度量约束。虽然最近的视觉语言模型(VLMs)展示了强大的语义对接能力,但它们并未明确设计用于在物理定义的空间中推理度量约束。在本研究中,我们实证展示了最先进的基于VLM的对接方法在处理复杂的度量语义语言查询时存在困难。为解决这一局限,我们提出了MAPG(多智能体概率对接)框架,将语言查询分解为结构化的子组件,并查询VLM对接每个组件。然后,MAPG通过概率组合这些对接输出,生成在三维空间中度量一致的可执行决策。我们使用HM-EQA基准对MAPG进行了评估,并展示了相对于强大基线的一致性能改进。此外,我们引入了一个新的基准MAPG-Bench,专门用于评估度量语义目标对接,填补了现有语言对接评估中的空白。我们还展示了在可用结构化场景表示的现实世界机器人演示,表明MAPG可以超越仿真。
Summary / 总结
This work addresses the challenge of converting complex metric-semantic language queries into actionable decisions for robots. It introduces MAPG (Multi-Agent Probabilistic Grounding), which decomposes language queries and uses a VLM to ground each component, then probabilistically composes these to produce metrically consistent actions. MAPG shows consistent performance improvements over strong baselines on the HM-EQA benchmark and introduces a new benchmark, MAPG-Bench, to evaluate metric-semantic goal grounding. Additionally, a real-world robot demonstration illustrates MAPG's potential for transferring beyond simulation when a structured scene representation is available.
该研究旨在将复杂的度量语义语言查询转换为机器人的可执行决策。作者提出了MAPG(多代理概率定位),该方法将查询分解为子组件,并使用VLM进行定位,然后通过概率组合确保度量一致性。HM-EQA上的实验表明,MAPG优于强基线,并引入了新的基准MAPG-Bench来评估度量语义目标定位。此外,一个现实世界的机器人演示证实了该框架在提供结构化场景表示时超越了模拟环境的有效性。
TDAD: Test-Driven Agentic Development - Reducing Code Regressions in AI Coding Agents via Graph-Based Impact Analysis
Authors: Pepe Alonso, Sergio Yovine, Victor A. Braberman
First: 2026-03-18T17:38:22+00:00 · Latest: 2026-03-19T17:12:15+00:00
Comments: Toolpaper, 7 pages, 7 tables, 3 figures, 1 algorithm. Submitted to ACM AIWare 2026 (Data and Benchmark Track)
Abstract
AI coding agents can resolve real-world software issues, yet they frequently introduce regressions -- breaking tests that previously passed. Current benchmarks focus almost exclusively on resolution rate, leaving regression behavior under-studied. This paper presents TDAD (Test-Driven Agentic Development), an open-source tool that performs pre-change impact analysis for AI coding agents. TDAD builds a dependency map between source code and tests so that before committing a patch, the agent knows which tests to verify and can self-correct. The map is delivered as a lightweight agent skill -- a static text file the agent queries at runtime. Evaluated on SWE-bench Verified with two open-weight models running on consumer hardware (Qwen3-Coder 30B, 100 instances; Qwen3.5-35B-A3B, 25 instances), TDAD reduced regressions by 70% (6.08% to 1.82%) compared to a vanilla baseline. In contrast, adding TDD procedural instructions without targeted test context increased regressions to 9.94% -- worse than no intervention at all. When deployed as an agent skill with a different model and framework, TDAD improved issue-resolution rate from 24% to 32%, confirming that surfacing contextual information outperforms prescribing procedural workflows. All code, data, and logs are publicly available at https://github.com/pepealonso95/TDAD.
中文标题/摘要
标题:TDAD:基于测试驱动的代理开发 - 通过基于图的影响分析减少AI编码代理的代码回退
AI编码代理可以解决实际软件问题,但它们经常引入回退——打破之前通过的测试。当前基准测试几乎完全关注解决率,而忽视了回退行为的研究。本文介绍了TDAD(基于测试驱动的代理开发),这是一种开源工具,用于对AI编码代理进行预变更影响分析。TDAD构建了源代码与测试之间的依赖关系图,以便在提交补丁之前,代理知道需要验证哪些测试并可以自我纠正。该图以轻量级代理技能的形式提供——一个静态文本文件,代理在运行时查询。TDAD在SWE-bench上进行了验证,使用两台消费级硬件(Qwen3-Coder 30B,30个实例;Qwen3.5-35B-A3B,25个实例),与空白基线相比,TDAD将回退减少了70%(从6.08%降至1.82%)。相比之下,不带目标测试上下文的TDD过程指令反而将回退增加到9.94%,比没有任何干预还要差。当作为代理技能部署在不同模型和框架中时,TDAD将问题解决率从24%提高到32%,证实了提供上下文信息优于规定程序化工作流程。所有代码、数据和日志均可在https://github.com/pepealonso95/TDAD上公开获取。
Summary / 总结
TDAD (Test-Driven Agentic Development) is a tool that reduces code regressions in AI coding agents by performing pre-change impact analysis. It builds a dependency map between source code and tests, allowing the agent to verify relevant tests before committing a patch. Evaluated on SWE-bench with two open-weight models, TDAD reduced regressions by 70% compared to a vanilla baseline and improved issue-resolution rate from 24% to 32% when deployed as an agent skill with a different model and framework.
TDAD(Test-Driven Agentic Development)是一种工具,通过在更改前进行依赖图分析来帮助AI编码代理减少代码回退。通过构建源代码和测试之间的依赖图,TDAD帮助AI代理在提交更改前验证相关测试。在SWE-bench上使用两个开源模型进行评估时,TDAD将回退减少了70%,并且在作为代理技能部署时,将问题解决率从24%提高到了32%。仅添加TDD程序指令而没有目标测试上下文会将回退增加到9.94%。所有代码、数据和日志均已公开。
ADAPT: Attention Driven Adaptive Prompt Scheduling and InTerpolating Orthogonal Complements for Rare Concepts Generation
Authors: Kwanyoung Lee, Hyunwoo Oh, SeungJu Cha, Sungho Koh, Dong-Jin Kim
Venue: CVPR 2026
First: 2026-03-19T17:11:49+00:00 · Latest: 2026-03-19T17:11:49+00:00
Comments: Accepted in CVPR 2026 (findings). 10 pages, 4 figures; supplementary material included (8 pages, 10 figures)
Abstract
Generating rare compositional concepts in text-to-image synthesis remains a challenge for diffusion models, particularly for attributes that are uncommon in the training data. While recent approaches, such as R2F, address this challenge by utilizing LLM for prompt scheduling, they suffer from inherent variance due to the randomness of language models and suboptimal guidance from iterative text embedding switching. To address these problems, we propose the ADAPT framework, a training-free framework that deterministically plans and semantically aligns prompt schedules, providing consistent guidance to enhance the composition of rare concepts. By leveraging attention scores and orthogonal components, ADAPT significantly enhances compositional generation of rare concepts in the RareBench benchmark without additional training or fine-tuning. Through comprehensive experiments, we demonstrate that ADAPT achieves superior performance in RareBench and accurately reflects the semantic information of rare attributes, providing deterministic and precise control over the generation of rare compositions without compromising visual integrity.
中文标题/摘要
标题:ADAPT:注意力驱动的自适应提示调度和插值正交补对于稀有概念生成
对于文本到图像合成而言,在生成稀有组合概念方面,扩散模型仍然面临挑战,尤其是对于训练数据中不常见的属性。虽然最近的方法,如R2F,通过利用LLM进行提示调度来解决这一挑战,但由于语言模型的随机性和迭代文本嵌入切换的次优指导,它们仍然存在固有的方差问题。为了解决这些问题,我们提出了ADAPT框架,这是一种无需训练的框架,可以确定性地规划和语义对齐提示调度,提供一致的指导以增强稀有概念的组合。通过利用注意力分数和正交组件,ADAPT在无需额外训练或微调的情况下,显著增强了在RareBench基准上稀有概念的组合生成。通过全面的实验,我们证明ADAPT在RareBench上实现了优越的性能,并准确反映了稀有属性的语义信息,提供了对稀有组合生成的确定性和精确控制,而不牺牲视觉完整性。
Summary / 总结
The research aims to improve the generation of rare compositional concepts in text-to-image synthesis using diffusion models. ADAPT, an attention-driven adaptive prompt scheduling framework, is proposed to address the challenges of variance and suboptimal guidance from previous methods. By leveraging attention scores and orthogonal components, ADAPT provides consistent guidance and enhances the compositional generation of rare concepts in the RareBench benchmark without additional training or fine-tuning. Experiments show that ADAPT outperforms existing methods and accurately reflects the semantic information of rare attributes, offering deterministic and precise control over the generation of rare compositions while maintaining visual integrity.
研究旨在利用扩散模型提高文本到图像合成中稀有组合概念的生成。提出了一个基于注意力驱动的自适应提示调度框架ADAPT,以解决先前方法中存在的方差和迭代文本嵌入切换的次优指导问题。通过利用注意力分数和正交组件,ADAPT 提供了持续的指导,并在 RareBench 基准测试中增强了稀有概念的组合生成,无需额外的训练或微调。实验表明,ADAPT 在 RareBench 上优于现有方法,并准确反映了稀有属性的语义信息,提供了对稀有组合生成的确定性和精确控制,同时保持了视觉完整性。
VEPO: Variable Entropy Policy Optimization for Low-Resource Language Foundation Models
Authors: Chonghan Liu, Yimin Du, Qi An, Xin He, Cunqi Zhai, Fei Tan, Weijia Lin, Xiaochun Gong, Yongchao Deng, Shousheng Jia, Xiangzheng Zhang
First: 2026-03-19T17:10:29+00:00 · Latest: 2026-03-19T17:10:29+00:00
Comments: 23 pages. Includes figures and tables. Conference submission
Abstract
Large language models frequently exhibit suboptimal performance on low resource languages, primarily due to inefficient subword segmentation and systemic training data imbalances. In this paper, we propose Variable Entropy Policy Optimization (VEPO), which leverages Reinforcement Learning with Verifiable Rewards to incorporate deterministic structural constraints into the policy alignment process. This framework ensures prescribed sequence length, robust format consistency, and rigorous linguistic well formedness, all enforced during training. Central to our approach is a variable entropy mechanism that enables the model to dynamically calibrate the equilibrium between literal fidelity and semantic naturalness by modulating the exploration exploitation manifold. By integrating entropy tempered advantage estimation with asymmetric clipping, VEPO sustains robust exploration while mitigating policy collapse. Empirical evaluations across 90 FLORES-200, COMET-22, chrF directions demonstrate that VEPO yields substantial improvements in both tokenization efficiency and translation quality, bridging the performance gap for underrepresented languages.
中文标题/摘要
标题:VEPO:低资源语言基础模型的可变熵策略优化
大型语言模型在低资源语言上的表现经常不尽如人意,主要原因是子词分段效率低下和系统性训练数据不平衡。本文提出了一种可变熵策略优化(VEPO),它利用可验证奖励的强化学习来将确定性的结构约束纳入策略对齐过程中。该框架确保了规定的序列长度、稳健的格式一致性以及严格的语言正确性,所有这些都在训练过程中得到保证。我们方法的核心是一种可变熵机制,它使模型能够动态校准字面忠实度和语义自然性之间的平衡,通过调节探索与利用的平衡。通过结合熵校正的优势估计与不对称剪裁,VEPO 保持了稳健的探索,同时减轻了策略崩溃。在90个FLORES-200、COMET-22、chrF方向上的实证评估表明,VEPO 在分词效率和翻译质量上取得了显著改进,缩小了代表性不足语言的性能差距。
Summary / 总结
This paper addresses the suboptimal performance of large language models on low-resource languages by proposing Variable Entropy Policy Optimization (VEPO), which uses Reinforcement Learning with verifiable rewards to incorporate structural constraints. VEPO ensures sequence length, format consistency, and linguistic well-formedness during training through a variable entropy mechanism that dynamically balances literal fidelity and semantic naturalness. Experiments on 90 FLORES-200, COMET-22, and chrF directions show that VEPO improves both tokenization efficiency and translation quality, narrowing the performance gap for underrepresented languages.
论文针对大型语言模型在低资源语言上的表现不佳问题,提出了Variable Entropy Policy Optimization (VEPO) 方法,该方法利用可验证奖励的强化学习来引入结构约束。VEPO 在训练过程中确保了序列长度、格式一致性和语言正确性。实验结果表明,VEPO 在提高分词效率和翻译质量方面取得了显著成效,缩小了未充分代表语言的性能差距。
Optimal Splitting of Language Models from Mixtures to Specialized Domains
Authors: Skyler Seto, Pierre Ablin, Anastasiia Filippova, Jiayuan Ye, Louis Bethune, Angelos Katharopoulos, David Grangier
First: 2026-03-19T17:07:05+00:00 · Latest: 2026-03-19T17:07:05+00:00
Comments: 26 pages, 11 tables, 17 figures
Abstract
Language models achieve impressive performance on a variety of knowledge, language, and reasoning tasks due to the scale and diversity of pretraining data available. The standard training recipe is a two-stage paradigm: pretraining first on the full corpus of data followed by specialization on a subset of high quality, specialized data from the full corpus. In the multi-domain setting, this involves continued pretraining of multiple models on each specialized domain, referred to as split model training. We propose a method for pretraining multiple models independently over a general pretraining corpus, and determining the optimal compute allocation between pretraining and continued pretraining using scaling laws. Our approach accurately predicts the loss of a model of size N with D pretraining and D' specialization tokens, and extrapolates to larger model sizes and number of tokens. Applied to language model training, our approach improves performance consistently across common sense knowledge and reasoning benchmarks across different model sizes and compute budgets.
中文标题/摘要
标题:语言模型从混合到专业领域的最优划分
语言模型由于预训练数据的规模和多样性,在各种知识、语言和推理任务中取得了令人印象深刻的性能。标准训练范式是两阶段模式:首先在全部数据语料上进行预训练,然后在全部语料中高质量的专业化数据子集上进行专业化训练。在多领域设置中,这涉及在每个专业化领域继续对多个模型进行预训练,称为划分模型训练。我们提出了一种方法,即在通用预训练语料上独立预训练多个模型,并使用缩放定律确定预训练和继续预训练之间的最优计算分配。我们的方法准确预测了具有D个预训练令牌和D'个专业化令牌的大小为N的模型的损失,并外推到更大的模型大小和令牌数量。应用于语言模型训练,我们的方法在不同模型大小和计算预算下的一般常识知识和推理基准测试中均提高了性能。
Summary / 总结
The research aims to optimize the splitting of language models from a general pretraining corpus to specialized domains. The method involves pretraining multiple models independently on the general corpus and using scaling laws to determine the optimal compute allocation between pretraining and specialization. Key findings show consistent improvements in performance across common sense knowledge and reasoning benchmarks for different model sizes and compute budgets.
研究旨在优化将语言模型从通用预训练语料库拆分到专门领域的方法。方法包括独立地在通用语料库上预训练多个模型,并使用标度定律来确定预训练和专门化之间的最优计算分配。关键发现表明,不同模型大小和计算预算下,在常识知识和推理基准测试中的性能得到了一致的提升。
Enhancing Pretrained Model-based Continual Representation Learning via Guided Random Projection
Authors: Ruilin Li, Heming Zou, Xiufeng Yan, Zheming Liang, Jie Yang, Chenliang Li, Xue Yang
First: 2026-03-19T17:01:14+00:00 · Latest: 2026-03-19T17:01:14+00:00
Abstract
Recent paradigms in Random Projection Layer (RPL)-based continual representation learning have demonstrated superior performance when building upon a pre-trained model (PTM). These methods insert a randomly initialized RPL after a PTM to enhance feature representation in the initial stage. Subsequently, a linear classification head is used for analytic updates in the continual learning stage. However, under severe domain gaps between pre-trained representations and target domains, a randomly initialized RPL exhibits limited expressivity under large domain shifts. While largely scaling up the RPL dimension can improve expressivity, it also induces an ill-conditioned feature matrix, thereby destabilizing the recursive analytic updates of the linear head. To this end, we propose the Stochastic Continual Learner with MemoryGuard Supervisory Mechanism (SCL-MGSM). Unlike random initialization, MGSM constructs the projection layer via a principled, data-guided mechanism that progressively selects target-aligned random bases to adapt the PTM representation to downstream tasks. This facilitates the construction of a compact yet expressive RPL while improving the numerical stability of analytic updates. Extensive experiments on multiple exemplar-free Class Incremental Learning (CIL) benchmarks demonstrate that SCL-MGSM achieves superior performance compared to state-of-the-art methods.
中文标题/摘要
标题:基于引导随机投影的预训练模型导向连续表示学习增强
基于随机投影层(RPL)的连续表示学习近期范式在构建于预训练模型(PTM)之上时表现出优越性能。这些方法在PTM之后插入一个随机初始化的RPL以增强初始阶段的特征表示。随后,使用线性分类头进行连续学习阶段的分析更新。然而,在预训练表示与目标域之间存在严重领域差距的情况下,随机初始化的RPL在大规模领域转移下表现出有限的表达能力。虽然大幅增加RPL维度可以提高表达能力,但也导致特征矩阵病态,从而不稳定化线性头的递归分析更新。为此,我们提出了一种具有记忆守护监督机制的随机连续学习器(SCL-MGSM)。与随机初始化不同,MGSM通过一个原理上、数据导向的机制逐步选择目标对齐的随机基,以适应PTM表示以适应下游任务。这有助于构建紧凑而具有表达能力的RPL,并提高分析更新的数值稳定性。在多个无示例增量学习(CIL)基准上的广泛实验表明,SCL-MGSM在性能上优于现有方法。
Summary / 总结
The paper addresses the challenge of limited expressivity in Random Projection Layers (RPL) under large domain shifts in pre-trained model-based continual representation learning. It proposes SCL-MGSM, which constructs the RPL via a data-guided mechanism to adapt pre-trained representations to downstream tasks, improving both performance and numerical stability. Experiments show SCL-MGSM outperforms existing methods on multiple Class Incremental Learning benchmarks.
论文针对预训练模型在大域移位下的随机投影层(RPL)表达能力有限的问题,提出了一种数据导向机制构建RPL的SCL-MGSM方法,提高了表达能力和数值稳定性。实验结果显示,SCL-MGSM在多个类增量学习基准上优于现有方法。