arXiv 论文速递

2025-11-04 03:24
Snapshot: 20251104_0324
Phased DMD: Few-step Distribution Matching Distillation via Score Matching within Subintervals
Authors: Xiangyu Fan, Zesong Qiu, Zhuguanyu Wu, Fanzhou Wang, Zhiqian Lin, Tianxiang Ren, Dahua Lin, Ruihao Gong, Lei Yang
First: 2025-10-31T17:55:10+00:00 · Latest: 2025-10-31T17:55:10+00:00
Abstract
Distribution Matching Distillation (DMD) distills score-based generative models into efficient one-step generators, without requiring a one-to-one correspondence with the sampling trajectories of their teachers. However, limited model capacity causes one-step distilled models underperform on complex generative tasks, e.g., synthesizing intricate object motions in text-to-video generation. Directly extending DMD to multi-step distillation increases memory usage and computational depth, leading to instability and reduced efficiency. While prior works propose stochastic gradient truncation as a potential solution, we observe that it substantially reduces the generation diversity of multi-step distilled models, bringing it down to the level of their one-step counterparts. To address these limitations, we propose Phased DMD, a multi-step distillation framework that bridges the idea of phase-wise distillation with Mixture-of-Experts (MoE), reducing learning difficulty while enhancing model capacity. Phased DMD is built upon two key ideas: progressive distribution matching and score matching within subintervals. First, our model divides the SNR range into subintervals, progressively refining the model to higher SNR levels, to better capture complex distributions. Next, to ensure the training objective within each subinterval is accurate, we have conducted rigorous mathematical derivations. We validate Phased DMD by distilling state-of-the-art image and video generation models, including Qwen-Image (20B parameters) and Wan2.2 (28B parameters). Experimental results demonstrate that Phased DMD preserves output diversity better than DMD while retaining key generative capabilities. We will release our code and models.
中文标题/摘要
标题:分阶段DMD:基于子区间内评分匹配的多步分布匹配蒸馏
分布匹配蒸馏(DMD)将基于评分的生成模型提炼为高效的一步生成器,无需与教师模型的采样轨迹一一对应。然而,有限的模型容量导致一步提炼模型在复杂的生成任务上表现不佳,例如文本到视频生成中的复杂对象运动合成。直接将DMD扩展到多步提炼会增加内存使用和计算深度,导致不稳定性和效率降低。尽管先前的工作提出随机梯度截断作为潜在解决方案,我们观察到它显著降低了多步提炼模型的生成多样性,使其与一步模型相当。为了解决这些限制,我们提出了分阶段DMD,这是一种多步提炼框架,将阶段式提炼与专家混合(MoE)相结合,降低学习难度并增强模型容量。分阶段DMD基于两个关键思想:逐步分布匹配和子区间内评分匹配。首先,我们的模型将信噪比(SNR)范围划分为子区间,逐步将模型细化到更高的SNR水平,以更好地捕捉复杂分布。其次,为了确保每个子区间内的训练目标准确,我们进行了严格的数学推导。我们通过提炼最先进的图像和视频生成模型,包括Qwen-Image(200亿参数)和Wan2.2(280亿参数)来验证分阶段DMD。实验结果表明,分阶段DMD在保持输出多样性的同时保留了关键的生成能力。我们将发布我们的代码和模型。
Summary / 总结
Phased DMD is a multi-step distillation framework that improves the performance of score-based generative models on complex tasks by dividing the SNR range into subintervals and progressively refining the model. This approach enhances model capacity and preserves generation diversity better than traditional DMD. Experiments show that Phased DMD outperforms DMD in generating intricate object motions in text-to-video synthesis while maintaining key generative capabilities.
Phased DMD 是一种多步蒸馏框架,通过将 SNR 范围划分为子区间并逐步细化模型来提高生成模型在复杂任务上的性能。该方法增强了模型容量和学习效率,从而在保持关键生成能力的同时,比传统 DMD 更好地保留了输出多样性。
PETAR: Localized Findings Generation with Mask-Aware Vision-Language Modeling for PET Automated Reporting
Authors: Danyal Maqbool, Changhee Lee, Zachary Huemann, Samuel D. Church, Matthew E. Larson, Scott B. Perlman, Tomas A. Romero, Joshua D. Warner, Meghan Lubner, Xin Tie, Jameson Merkow, Junjie Hu, Steve Y. Cho, Tyler J. Bradshaw
First: 2025-10-31T17:49:01+00:00 · Latest: 2025-10-31T17:49:01+00:00
Abstract
Recent advances in vision-language models (VLMs) have enabled impressive multimodal reasoning, yet most medical applications remain limited to 2D imaging. In this work, we extend VLMs to 3D positron emission tomography and computed tomography (PET/CT), a domain characterized by large volumetric data, small and dispersed lesions, and lengthy radiology reports. We introduce a large-scale dataset comprising over 11,000 lesion-level descriptions paired with 3D segmentations from more than 5,000 PET/CT exams, extracted via a hybrid rule-based and large language model (LLM) pipeline. Building upon this dataset, we propose PETAR-4B, a 3D mask-aware vision-language model that integrates PET, CT, and lesion contours for spatially grounded report generation. PETAR bridges global contextual reasoning with fine-grained lesion awareness, producing clinically coherent and localized findings. Comprehensive automated and human evaluations demonstrate that PETAR substantially improves PET/CT report generation quality, advancing 3D medical vision-language understanding.
中文标题/摘要
标题:PETAR:基于掩码意识的视图-语言建模在PET自动化报告中的局部发现生成
近期在视图-语言模型(VLMs)方面的进展使多模态推理变得令人印象深刻,但大多数医学应用仍局限于二维成像。在本研究中,我们将VLMs扩展到3D正电子发射断层扫描和计算机断层扫描(PET/CT),这是一个以大量体数据、小而分散的病灶和冗长的放射学报告为特征的领域。我们引入了一个包含超过11,000个病灶级描述的大规模数据集,这些描述与来自超过5,000次PET/CT检查的3D分割配对,通过混合基于规则和大型语言模型(LLM)的管道提取。基于此数据集,我们提出了PETAR-4B,这是一种3D掩码意识的视图-语言模型,将PET、CT和病灶轮廓结合在一起,用于空间定位报告生成。PETAR将全局上下文推理与细粒度病灶意识相结合,生成临床连贯且局部化的发现。全面的自动化和人工评估表明,PETAR在提高PET/CT报告生成质量方面取得了显著进步,推动了3D医学视图-语言理解的发展。
Vision Transformer for Robust Occluded Person Reidentification in Complex Surveillance Scenes
Authors: Bo Li, Duyuan Zheng, Xinyang Liu, Qingwen Li, Hong Li, Hongyan Cui, Ge Gao, Chen Liu
First: 2025-10-31T17:43:50+00:00 · Latest: 2025-10-31T17:43:50+00:00
Comments: 12 pages,conference
Abstract
Person re-identification (ReID) in surveillance is challenged by occlusion, viewpoint distortion, and poor image quality. Most existing methods rely on complex modules or perform well only on clear frontal images. We propose Sh-ViT (Shuffling Vision Transformer), a lightweight and robust model for occluded person ReID. Built on ViT-Base, Sh-ViT introduces three components: First, a Shuffle module in the final Transformer layer to break spatial correlations and enhance robustness to occlusion and blur; Second, scenario-adapted augmentation (geometric transforms, erasing, blur, and color adjustment) to simulate surveillance conditions; Third, DeiT-based knowledge distillation to improve learning with limited labels.To support real-world evaluation, we construct the MyTT dataset, containing over 10,000 pedestrians and 30,000+ images from base station inspections, with frequent equipment occlusion and camera variations. Experiments show that Sh-ViT achieves 83.2% Rank-1 and 80.1% mAP on MyTT, outperforming CNN and ViT baselines, and 94.6% Rank-1 and 87.5% mAP on Market1501, surpassing state-of-the-art methods.In summary, Sh-ViT improves robustness to occlusion and blur without external modules, offering a practical solution for surveillance-based personnel monitoring.
中文标题/摘要
标题:用于复杂监控场景中鲁棒遮挡人体重识别的视觉变换器
监控中的重识别(ReID)受到遮挡、视角失真和图像质量差的挑战。大多数现有方法依赖于复杂的模块,或者只能在清晰的正面图像上表现良好。我们提出了Sh-ViT(Shuffling Vision Transformer),一种轻量级且对遮挡鲁棒的模型。基于ViT-Base,Sh-ViT 引入了三个组件:首先,在最终的Transformer层中引入Shuffle模块以打破空间相关性并增强对遮挡和模糊的鲁棒性;其次,采用适应场景的增强(几何变换、擦除、模糊和色彩调整)以模拟监控条件;第三,基于DeiT的知识蒸馏以提高在有限标签下的学习效果。为了支持实际评估,我们构建了MyTT数据集,包含超过10,000名行人的30,000多张图像,来自基站检查,其中包含频繁的设备遮挡和摄像头变化。实验表明,Sh-ViT在MyTT上的Rank-1达到83.2%,mAP达到80.1%,优于CNN和ViT基线;在Market1501上的Rank-1达到94.6%,mAP达到87.5%,超越了最先进的方法。总之,Sh-ViT在不依赖外部模块的情况下提高了对遮挡和模糊的鲁棒性,为基于监控的人员监控提供了一个实用的解决方案。
Intelligent Software System for Low-Cost, Brightfield Segmentation: Algorithmic Implementation for Cytometric Auto-Analysis
Authors: Surajit Das, Pavel Zun
First: 2025-09-14T17:12:17+00:00 · Latest: 2025-10-31T17:37:42+00:00
Abstract
Bright-field microscopy, a cost-effective solution for live-cell culture, is often the only resource available, along with standard CPUs, for many low-budget labs. The inherent challenges of bright-field images -- their noisiness, low contrast, and dynamic morphology -- coupled with a lack of GPU resources and complex software interfaces, hinder the desired research output. This article presents a novel microscopy image analysis framework designed for low-budget labs equipped with a standard CPU desktop. The Python-based program enables cytometric analysis of live, unstained cells in culture through an advanced computer vision and machine learning pipeline. Crucially, the framework operates on label-free data, requiring no manually annotated training data or training phase. It is accessible via a user-friendly, cross-platform GUI that requires no programming skills, while also providing a scripting interface for programmatic control and integration by developers. The end-to-end workflow performs semantic and instance segmentation, feature extraction, analysis, evaluation, and automated report generation. Its modular architecture supports easy maintenance and flexible integration while supporting both single-image and batch processing. Validated on several unstained cell types from the public dataset of livecells, the framework demonstrates superior accuracy and reproducibility compared to contemporary tools like Cellpose and StarDist. Its competitive segmentation speed on a CPU-based platform highlights its significant potential for basic research and clinical applications -- particularly in cell transplantation for personalised medicine and muscle regeneration therapies. The access to the application is available for reproducibility
中文标题/摘要
标题:低成本明场分割智能软件系统:基于Python的细胞自动分析算法实现
明场显微镜是一种成本效益高的活细胞培养解决方案,对于许多预算有限的实验室来说,它通常是唯一可用的资源,搭配标准CPU。明场图像固有的挑战——噪声、低对比度和动态形态——与缺乏GPU资源和复杂软件界面相结合,阻碍了理想的科研产出。本文介绍了一种专为配备标准CPU台式机的预算有限实验室设计的新型显微镜图像分析框架。基于Python的程序通过先进的计算机视觉和机器学习流水线,实现了活的未染色细胞的细胞学分析。该框架在无需手动标注训练数据或训练阶段的情况下,即可在无标记数据上运行。它通过一个用户友好的跨平台GUI提供,无需编程技能,同时为开发者提供脚本接口以实现程序化控制和集成。端到端的工作流执行语义和实例分割、特征提取、分析、评估和自动化报告生成。其模块化架构支持易于维护和灵活集成,同时支持单张图像和批量处理。该框架在公共livecells数据集的多种未染色细胞类型上进行了验证,其准确性和可重复性优于当前工具如Cellpose和StarDist。基于CPU平台的竞争性分割速度突显了其在基础研究和临床应用中的巨大潜力,特别是在个性化医疗和肌肉再生疗法中的细胞移植应用。该应用程序的访问权限可用于可重复性
Summary / 总结
This paper introduces an intelligent software system for low-cost bright-field segmentation, addressing the challenges faced by low-budget labs. The system uses a Python-based advanced computer vision and machine learning pipeline to perform cytometric analysis of live, unstained cells without requiring manually annotated training data. It achieves superior accuracy and reproducibility compared to tools like Cellpose and StarDist, and its competitive segmentation speed on a CPU-based platform makes it suitable for basic research and clinical applications such as personalized medicine and muscle regeneration therapies.
本文介绍了一种低成本明场分割的智能软件系统,旨在解决低预算实验室面临的挑战。该系统使用基于Python的高级计算机视觉和机器学习流水线,对未染色的活细胞进行细胞学分析,无需手动标注训练数据。与Cellpose和StarDist等工具相比,该系统在准确性和可重复性方面表现出色,并且在基于CPU的平台上具有竞争力的分割速度,使其适用于基础研究和临床应用,如个性化医疗和肌肉再生疗法。
The End of Manual Decoding: Towards Truly End-to-End Language Models
Authors: Zhichao Wang, Dongyang Ma, Xinting Huang, Deng Cai, Tian Lan, Jiahao Xu, Haitao Mi, Xiaoying Tang, Yan Wang
First: 2025-10-30T17:01:43+00:00 · Latest: 2025-10-31T17:36:35+00:00
Abstract
The "end-to-end" label for LLMs is a misnomer. In practice, they depend on a non-differentiable decoding process that requires laborious, hand-tuning of hyperparameters like temperature and top-p. This paper introduces AutoDeco, a novel architecture that enables truly "end-to-end" generation by learning to control its own decoding strategy. We augment the standard transformer with lightweight heads that, at each step, dynamically predict context-specific temperature and top-p values alongside the next-token logits. This approach transforms decoding into a parametric, token-level process, allowing the model to self-regulate its sampling strategy within a single forward pass. Through extensive experiments on eight benchmarks, we demonstrate that AutoDeco not only significantly outperforms default decoding strategies but also achieves performance comparable to an oracle-tuned baseline derived from "hacking the test set"-a practical upper bound for any static method. Crucially, we uncover an emergent capability for instruction-based decoding control: the model learns to interpret natural language commands (e.g., "generate with low randomness") and adjusts its predicted temperature and top-p on a token-by-token basis, opening a new paradigm for steerable and interactive LLM decoding.
中文标题/摘要
标题:手动解码终结:迈向真正端到端的语言模型
对于大语言模型(LLM)来说,“端到端”标签是一个误导。实际上,它们依赖于一个非可微分的解码过程,这需要手动调整温度和top-p等超参数。本文介绍了AutoDeco,这是一种新型架构,通过学习控制自己的解码策略,使其能够真正实现“端到端”生成。我们通过在标准变压器中添加轻量级头部,使模型在每一步动态预测上下文特定的温度和top-p值,以及下一个标记的概率。这种方法将解码转换为参数化、标记级的过程,允许模型在单次前向传播中自我调节其采样策略。通过在八个基准上的广泛实验,我们证明AutoDeco不仅显著优于默认解码策略,而且其性能与通过“破解测试集”获得的Oracle调优基线相当,这是任何静态方法的实用上限。最关键的是,我们发现了一种新兴的能力:基于指令的解码控制:模型学会解释自然语言命令(例如,“生成低随机性”),并在每个标记的基础上调整其预测的温度和top-p,从而开启了一种可引导和交互的大语言模型解码的新范式。
Summary / 总结
This paper addresses the limitations of current large language models (LLMs) by introducing AutoDeco, a novel architecture that enables truly end-to-end generation. AutoDeco learns to control its own decoding strategy by predicting context-specific temperature and top-p values at each step, transforming decoding into a parametric, token-level process. Extensive experiments on eight benchmarks show that AutoDeco outperforms default decoding strategies and achieves performance comparable to an oracle-tuned baseline, demonstrating its capability for instruction-based decoding control and opening a new paradigm for steerable and interactive LLM decoding.
本文通过引入AutoDeco这一新颖架构解决了大型语言模型(LLMs)中的非端到端解码问题,该架构能够学习控制自身的解码策略。通过在标准变压器中添加轻量级头部,AutoDeco在每个步骤中动态预测上下文特定的温度和top-p值,将解码过程转化为参数化过程。在八个基准上的广泛实验表明,AutoDeco不仅超越了默认解码策略,还达到了与通过“破解测试集”得到的最优基线相当的性能,展示了其基于指令的解码控制能力,并为可调节和交互式的LLM解码开辟了新范式。
Deep learning denoising unlocks quantitative insights in operando materials microscopy
Authors: Samuel Degnan-Morgenstern, Alexander E. Cohen, Rajeev Gopal, Megan Gober, George J. Nelson, Peng Bai, Martin Z. Bazant
First: 2025-10-31T17:34:05+00:00 · Latest: 2025-10-31T17:34:05+00:00
Abstract
Operando microscopy provides direct insight into the dynamic chemical and physical processes that govern functional materials, yet measurement noise limits the effective resolution and undermines quantitative analysis. Here, we present a general framework for integrating unsupervised deep learning-based denoising into quantitative microscopy workflows across modalities and length scales. Using simulated data, we demonstrate that deep denoising preserves physical fidelity, introduces minimal bias, and reduces uncertainty in model learning with partial differential equation (PDE)-constrained optimization. Applied to experiments, denoising reveals nanoscale chemical and structural heterogeneity in scanning transmission X-ray microscopy (STXM) of lithium iron phosphate (LFP), enables automated particle segmentation and phase classification in optical microscopy of graphite electrodes, and reduces noise-induced variability by nearly 80% in neutron radiography to resolve heterogeneous lithium transport. Collectively, these results establish deep denoising as a powerful, modality-agnostic enhancement that advances quantitative operando imaging and extends the reach of previously noise-limited techniques.
中文标题/摘要
标题:深度学习去噪解锁原位材料显微镜的定量见解
原位显微镜提供了对功能材料动态化学和物理过程的直接洞察,但测量噪声限制了有效分辨率并削弱了定量分析。在此,我们提出了一种将无监督深度学习去噪整合到跨模态和长度尺度的定量显微镜工作流程中的通用框架。通过模拟数据,我们证明深度去噪保留了物理保真度,引入了最小偏见,并通过偏微分方程(PDE)约束优化减少了模型学习中的不确定性。应用于实验,去噪揭示了锂铁磷酸盐(LFP)扫描透射X射线显微镜(STXM)中的纳米尺度化学和结构异质性,使光学显微镜中石墨电极颗粒分割和相分类自动化,并通过近80%减少中子成像中的噪声引起的变异性以解析异质锂传输。这些结果共同确立了深度去噪作为一种强大的、模态无关的增强技术,推动了定量原位成像并扩展了以前受噪声限制的技术的应用范围。
Challenges in Credit Assignment for Multi-Agent Reinforcement Learning in Open Agent Systems
Authors: Alireza Saleh Abadi, Leen-Kiat Soh
First: 2025-10-31T17:30:32+00:00 · Latest: 2025-10-31T17:30:32+00:00
Abstract
In the rapidly evolving field of multi-agent reinforcement learning (MARL), understanding the dynamics of open systems is crucial. Openness in MARL refers to the dynam-ic nature of agent populations, tasks, and agent types with-in a system. Specifically, there are three types of openness as reported in (Eck et al. 2023) [2]: agent openness, where agents can enter or leave the system at any time; task openness, where new tasks emerge, and existing ones evolve or disappear; and type openness, where the capabil-ities and behaviors of agents change over time. This report provides a conceptual and empirical review, focusing on the interplay between openness and the credit assignment problem (CAP). CAP involves determining the contribution of individual agents to the overall system performance, a task that becomes increasingly complex in open environ-ments. Traditional credit assignment (CA) methods often assume static agent populations, fixed and pre-defined tasks, and stationary types, making them inadequate for open systems. We first conduct a conceptual analysis, in-troducing new sub-categories of openness to detail how events like agent turnover or task cancellation break the assumptions of environmental stationarity and fixed team composition that underpin existing CAP methods. We then present an empirical study using representative temporal and structural algorithms in an open environment. The results demonstrate that openness directly causes credit misattribution, evidenced by unstable loss functions and significant performance degradation.
中文标题/摘要
标题:多智能体强化学习中开放系统中的责任分配挑战
在快速发展的多智能体强化学习(MARL)领域,理解开放系统的动态至关重要。MARL中的开放性指的是系统中智能体群体、任务和智能体类型具有动态性。具体来说,根据(Eck et al. 2023) [2] 报告,有三种类型的开放性:智能体开放性,即智能体可以随时进入或离开系统;任务开放性,即新任务出现,现有任务演化或消失;类型开放性,即智能体的能力和行为随时间变化。本报告提供了一个概念性和实证性的回顾,重点关注开放性与信用分配问题(CAP)之间的相互作用。CAP涉及确定个体智能体对整体系统性能的贡献,这一任务在开放环境中变得越来越复杂。传统的信用分配(CA)方法通常假设智能体群体是静态的,任务是固定且预先定义的,类型是稳定的,这使得它们不适合开放系统。我们首先进行概念分析,引入新的开放性子类别,详细说明智能体更替或任务取消如何破坏现有CAP方法所依赖的环境稳定性和固定团队组成假设。然后,我们使用代表性的时序和结构算法在开放环境中进行实证研究。结果表明,开放性直接导致信用误分配,表现为不稳定的损失函数和显著的性能下降。
Summary / 总结
The paper addresses the challenges in credit assignment for multi-agent reinforcement learning in open systems, where agents, tasks, and agent types can dynamically change. It introduces new sub-categories of openness and shows that traditional credit assignment methods are inadequate for such environments. An empirical study using temporal and structural algorithms in an open environment reveals that openness leads to credit misattribution, evidenced by unstable loss functions and performance degradation.
论文探讨了在开放系统中多智能体强化学习中的信用分配挑战,其中智能体、任务和智能体类型可以动态变化。它引入了开放性的新子类别,并表明传统的信用分配方法不适用于此类环境。通过在开放环境中使用时间性和结构性算法进行的实证研究显示,开放性会导致信用误分配,表现为损失函数不稳定和性能下降。
Community Detection on Model Explanation Graphs for Explainable AI
Authors: Ehsan Moradi
First: 2025-10-31T17:27:56+00:00 · Latest: 2025-10-31T17:27:56+00:00
Abstract
Feature-attribution methods (e.g., SHAP, LIME) explain individual predictions but often miss higher-order structure: sets of features that act in concert. We propose Modules of Influence (MoI), a framework that (i) constructs a model explanation graph from per-instance attributions, (ii) applies community detection to find feature modules that jointly affect predictions, and (iii) quantifies how these modules relate to bias, redundancy, and causality patterns. Across synthetic and real datasets, MoI uncovers correlated feature groups, improves model debugging via module-level ablations, and localizes bias exposure to specific modules. We release stability and synergy metrics, a reference implementation, and evaluation protocols to benchmark module discovery in XAI.
中文标题/摘要
标题:模型解释图上的社区检测以实现可解释AI
特征归因方法(例如SHAP、LIME)解释单个预测,但往往忽略了更高阶的结构:共同作用的特征集合。我们提出了一种影响模块(MoI)框架,该框架(i)从单个实例的归因中构建模型解释图,(ii)应用社区检测以发现共同影响预测的特征模块,并(iii)量化这些模块与偏差、冗余和因果关系模式之间的关系。在合成和真实数据集上,MoI 揭示了相关特征组,通过模块级消融改进了模型调试,并将偏差暴露定位到特定模块。我们发布了稳定性和协同性指标、参考实现和评估协议,以基准模块发现的XAI。
Summary / 总结
The research aims to uncover higher-order structure in model explanations by detecting communities in model explanation graphs. The method involves constructing a graph from feature attributions, applying community detection to identify feature modules that jointly influence predictions, and quantifying their relationships to bias, redundancy, and causality. Key findings include the discovery of correlated feature groups, improved model debugging through module-level ablations, and localization of bias exposure to specific modules. Stability and synergy metrics, a reference implementation, and evaluation protocols are provided for benchmarking module discovery in explainable AI.
研究旨在通过检测模型解释图中的社区来揭示模型解释中的更高阶结构。方法包括从特征归因构建图,应用社区检测来识别联合影响预测的特征模块,并量化这些模块与偏差、冗余和因果关系之间的关系。关键发现包括发现相关特征组,通过模块级消融改进模型调试,并将偏差暴露定位到特定模块。提供了模块发现的稳定性和协同性指标、参考实现和评估协议以供解释性AI中的模块发现基准测试。
Information-Theoretic Greedy Layer-wise Training for Traffic Sign Recognition
Authors: Shuyan Lyu, Zhanzimo Wu, Junliang Du
First: 2025-10-31T17:24:58+00:00 · Latest: 2025-10-31T17:24:58+00:00
Abstract
Modern deep neural networks (DNNs) are typically trained with a global cross-entropy loss in a supervised end-to-end manner: neurons need to store their outgoing weights; training alternates between a forward pass (computation) and a top-down backward pass (learning) which is biologically implausible. Alternatively, greedy layer-wise training eliminates the need for cross-entropy loss and backpropagation. By avoiding the computation of intermediate gradients and the storage of intermediate outputs, it reduces memory usage and helps mitigate issues such as vanishing or exploding gradients. However, most existing layer-wise training approaches have been evaluated only on relatively small datasets with simple deep architectures. In this paper, we first systematically analyze the training dynamics of popular convolutional neural networks (CNNs) trained by stochastic gradient descent (SGD) through an information-theoretic lens. Our findings reveal that networks converge layer-by-layer from bottom to top and that the flow of information adheres to a Markov information bottleneck principle. Building on these observations, we propose a novel layer-wise training approach based on the recently developed deterministic information bottleneck (DIB) and the matrix-based R\'enyi's $\alpha$-order entropy functional. Specifically, each layer is trained jointly with an auxiliary classifier that connects directly to the output layer, enabling the learning of minimal sufficient task-relevant representations. We empirically validate the effectiveness of our training procedure on CIFAR-10 and CIFAR-100 using modern deep CNNs and further demonstrate its applicability to a practical task involving traffic sign recognition. Our approach not only outperforms existing layer-wise training baselines but also achieves performance comparable to SGD.
中文标题/摘要
标题:信息论驱动的分层贪婪训练在交通标志识别中的应用
现代深度神经网络(DNNs)通常以监督的端到端方式全局使用交叉熵损失进行训练:神经元需要存储其传出权重;训练交替进行前向传播(计算)和自上而下的反向传播(学习),这在生物学上是不可行的。相反,贪婪分层训练消除了交叉熵损失和反向传播的需要。通过避免中间梯度的计算和中间输出的存储,它减少了内存使用并有助于缓解梯度消失或爆炸的问题。然而,大多数现有的分层训练方法仅在相对较小的数据集和简单的深度架构上进行了评估。在本文中,我们首先通过信息论的视角系统地分析了通过随机梯度下降(SGD)训练的流行卷积神经网络(CNNs)的训练动态。我们的发现表明,网络从下到上逐层收敛,信息流遵循马尔可夫信息瓶颈原则。基于这些观察,我们提出了一种基于最近开发的确定性信息瓶颈(DIB)和矩阵形式的Rényi的α阶熵泛函的新型分层训练方法。具体而言,每个层与直接连接到输出层的辅助分类器联合训练,使学习最小充分的任务相关表示成为可能。我们通过现代深度CNN在CIFAR-10和CIFAR-100上的实验验证了我们训练程序的有效性,并进一步证明了其在涉及交通标志识别的实际任务中的适用性。我们的方法不仅优于现有的分层训练基线,而且在性能上与SGD相当。
Kernel conditional tests from learning-theoretic bounds
Authors: Pierre-François Massiani, Christian Fiedler, Lukas Haverbeck, Friedrich Solowjow, Sebastian Trimpe
Venue: NeurIPS 2025
First: 2025-06-04T12:53:13+00:00 · Latest: 2025-10-31T17:19:02+00:00
Comments: 46 pages, 8 figures, 9 tables. Accepted at NeurIPS 2025; to appear in the proceedings of the Thirty-ninth Annual Conference on Neural Information Processing Systems. Reviews and discussion: https://openreview.net/forum?id=hJKDwf32Xu
Abstract
We propose a framework for hypothesis testing on conditional probability distributions, which we then use to construct statistical tests of functionals of conditional distributions. These tests identify the inputs where the functionals differ with high probability, and include tests of conditional moments or two-sample tests. Our key idea is to transform confidence bounds of a learning method into a test of conditional expectations. We instantiate this principle for kernel ridge regression (KRR) with subgaussian noise. An intermediate data embedding then enables more general tests -- including conditional two-sample tests -- via kernel mean embeddings of distributions. To have guarantees in this setting, we generalize existing pointwise-in-time or time-uniform confidence bounds for KRR to previously-inaccessible yet essential cases such as infinite-dimensional outputs with non-trace-class kernels. These bounds also circumvent the need for independent data, allowing for instance online sampling. To make our tests readily applicable in practice, we introduce bootstrapping schemes leveraging the parametric form of testing thresholds identified in theory to avoid tuning inaccessible parameters. We illustrate the tests on examples, including one in process monitoring and comparison of dynamical systems. Overall, our results establish a comprehensive foundation for conditional testing on functionals, from theoretical guarantees to an algorithmic implementation, and advance the state of the art on confidence bounds for vector-valued least squares estimation.
中文标题/摘要
标题:内核条件检验从学习理论界值出发
我们提出了一种条件概率分布假设检验的框架,然后利用该框架构建了条件分布函数的统计检验。这些检验能够识别出函数值差异显著的输入,并包括条件矩检验或两样本检验。我们的核心思想是将学习方法的置信界转化为条件期望的检验。我们针对具有次高斯噪声的核岭回归(KRR)实例化了这一原则。中间的数据嵌入使我们能够通过分布的核均值嵌入实现更一般的检验——包括条件两样本检验。为了在该框架下提供保证,我们将现有的针对KRR的时间点或时间一致置信界推广到以前无法访问但至关重要的情况,例如无限维输出和非迹类核。这些界值还避免了独立数据的需求,允许例如在线采样。为了使我们的检验在实践中易于应用,我们引入了利用理论中识别的测试阈值的参数形式的重抽样方案,以避免调整不可调参数。我们在示例中展示了这些检验,包括过程监控和动态系统的比较。总体而言,我们的结果为条件函数检验奠定了全面的基础,从理论保证到算法实现,并推进了向量值最小二乘估计置信界的前沿。
Summary / 总结
This paper introduces a framework for hypothesis testing on conditional probability distributions, using kernel ridge regression with subgaussian noise. The method transforms learning-theoretic confidence bounds into tests of conditional expectations, enabling tests of conditional moments and two-sample tests. Key findings include the establishment of theoretical guarantees for infinite-dimensional outputs and the introduction of bootstrapping schemes for practical application, enhancing the state of the art in confidence bounds for vector-valued least squares estimation.
该论文提出了一种通过将学习理论置信边界转化为条件期望检验的方法框架,特别使用了带亚高斯噪声的核岭回归,并将现有的置信边界推广到处理无限维输出的情况。该方法包括利用理论中测试阈值的参数形式来实现的自助方案,以使测试在实际场景中易于应用,如过程监控和动态系统比较。主要发现包括能够识别功能在何处以高概率不同,并通过核均值嵌入扩展了条件两样本检验。
Bayesian Optimization on Networks
Authors: Wenwen Li, Daniel Sanz-Alonso, Ruiyi Yang
First: 2025-10-31T17:12:49+00:00 · Latest: 2025-10-31T17:12:49+00:00
Comments: 36 pages, 6 figures; includes appendices
Abstract
This paper studies optimization on networks modeled as metric graphs. Motivated by applications where the objective function is expensive to evaluate or only available as a black box, we develop Bayesian optimization algorithms that sequentially update a Gaussian process surrogate model of the objective to guide the acquisition of query points. To ensure that the surrogates are tailored to the network's geometry, we adopt Whittle-Mat\'ern Gaussian process prior models defined via stochastic partial differential equations on metric graphs. In addition to establishing regret bounds for optimizing sufficiently smooth objective functions, we analyze the practical case in which the smoothness of the objective is unknown and the Whittle-Mat\'ern prior is represented using finite elements. Numerical results demonstrate the effectiveness of our algorithms for optimizing benchmark objective functions on a synthetic metric graph and for Bayesian inversion via maximum a posteriori estimation on a telecommunication network.
中文标题/摘要
标题:网络上的贝叶斯优化
本文研究了以度量图模型的网络上的优化问题。受目标函数昂贵或仅作为黑箱可用的应用驱动,我们开发了贝叶斯优化算法,这些算法通过逐步更新目标的高斯过程代理模型来指导查询点的获取。为了确保代理模型适应网络的几何结构,我们采用了通过度量图上的随机偏微分方程定义的Whittle-Matérn高斯过程先验模型。除了为足够光滑的目标函数优化建立了遗憾界,我们还分析了目标函数光滑性未知的情况,并使用有限元表示Whittle-Matérn先验。数值结果表明,我们的算法在合成度量图上优化基准目标函数以及在电信网络上通过最大后验估计进行贝叶斯反演的有效性。
Summary / 总结
This paper addresses optimization on networks by developing Bayesian optimization algorithms for metric graphs. The motivation arises from the need to optimize expensive or black-box functions. The method involves sequentially updating a Gaussian process surrogate model to guide query points. Key findings show that the algorithms are effective for optimizing benchmark functions on synthetic networks and for Bayesian inversion on a telecommunication network.
该论文通过开发使用高斯过程代理模型来指导查询点获取的贝叶斯优化算法,以解决网络上的优化问题。这些算法利用通过随机偏微分方程定义的Whittle-Matérn高斯过程先验模型来适应网络的几何结构。关键发现包括优化平滑目标函数的后悔界,以及该算法在合成度量图上优化基准函数和在电信网络上进行贝叶斯反演的有效性。
SpecAttn: Speculating Sparse Attention
Authors: Harsh Shah
Venue: NeurIPS 2025
First: 2025-10-31T17:12:34+00:00 · Latest: 2025-10-31T17:12:34+00:00
Comments: Accepted to NeurIPS 2025 Workshop on Structured Probabilistic Inference & Generative Modeling
Abstract
Large Language Models (LLMs) face significant computational bottlenecks during inference due to the quadratic complexity of self-attention mechanisms, particularly as context lengths increase. We introduce SpecAttn, a novel training-free approach that seamlessly integrates with existing speculative decoding techniques to enable efficient sparse attention in pre-trained transformers. Our key insight is to exploit the attention weights already computed by the draft model during speculative decoding to identify important tokens for the target model, eliminating redundant computation while maintaining output quality. SpecAttn employs three core techniques: KL divergence-based layer alignment between draft and target models, a GPU-optimized sorting-free algorithm for top-p token selection from draft attention patterns, and dynamic key-value cache pruning guided by these predictions. By leveraging the computational work already performed in standard speculative decoding pipelines, SpecAttn achieves over 75% reduction in key-value cache accesses with a mere 15.29% increase in perplexity on the PG-19 dataset, significantly outperforming existing sparse attention methods. Our approach demonstrates that speculative execution can be enhanced to provide approximate verification without significant performance degradation.
中文标题/摘要
标题:SpecAttn: 预测稀疏注意
大型语言模型(LLMs)在推理过程中由于自我注意机制的二次复杂性而面临显著的计算瓶颈,尤其是在上下文长度增加时。我们引入了SpecAttn,这是一种无需训练的新颖方法,可以无缝集成到现有的推测性解码技术中,以在预训练的变压器中实现高效的稀疏注意。我们的核心见解是利用在推测性解码过程中由草稿模型计算出的注意权重来识别目标模型中的重要令牌,从而消除冗余计算并保持输出质量。SpecAttn 使用了三种核心技术:基于 KL 散度的草稿模型和目标模型之间的层对齐、一种基于草稿注意模式的 GPU 优化的无排序 top-p 令牌选择算法,以及由这些预测指导的动态键值缓存剪枝。通过利用标准推测性解码管道中已经完成的计算工作,SpecAttn 在 PG-19 数据集上实现了超过 75% 的键值缓存访问量减少,同时仅增加了 15.29% 的困惑度,显著优于现有的稀疏注意方法。我们的方法表明,推测执行可以增强以提供近似验证,而不会显著降低性能。
Summary / 总结
SpecAttn is a training-free method that integrates with speculative decoding to enable efficient sparse attention in pre-trained transformers. It uses KL divergence for layer alignment, a GPU-optimized algorithm for top-p token selection, and dynamic key-value cache pruning. SpecAttn reduces key-value cache accesses by over 75% with only a 15.29% increase in perplexity on the PG-19 dataset, outperforming existing sparse attention methods.
SpecAttn 是一种无需训练的方法,结合推测性解码以在预训练变压器中实现高效的稀疏注意。它使用 KL 散度进行层对齐,无排序算法进行 token 选择,并通过这些预测动态修剪 key-value 缓存。SpecAttn 在 PG-19 数据集上的 key-value 缓存访问量减少了超过 75%,同时 perplexity 增加了 15.29%,显著优于现有稀疏注意方法。
TempoPFN: Synthetic Pre-training of Linear RNNs for Zero-shot Time Series Forecasting
Authors: Vladyslav Moroshan, Julien Siems, Arber Zela, Timur Carstensen, Frank Hutter
First: 2025-10-29T13:27:18+00:00 · Latest: 2025-10-31T17:01:54+00:00
Comments: 30 pages, 18 figures, 13 tables
Abstract
Foundation models for zero-shot time series forecasting face challenges in efficient long-horizon prediction and reproducibility, with existing synthetic-only approaches underperforming on challenging benchmarks. This paper presents TempoPFN, a univariate time series foundation model based on linear Recurrent Neural Networks (RNNs) pre-trained exclusively on synthetic data. The model uses a GatedDeltaProduct architecture with state-weaving for fully parallelizable training across sequence lengths, eliminating the need for windowing or summarization techniques while maintaining robust temporal state-tracking. Our comprehensive synthetic data pipeline unifies diverse generators, including stochastic differential equations, Gaussian processes, and audio synthesis, with novel augmentations. In zero-shot evaluations on the Gift-Eval benchmark, TempoPFN achieves top-tier competitive performance, outperforming all existing synthetic-only approaches and surpassing the vast majority of models trained on real-world data, while being more efficient than existing baselines by leveraging fully parallelizable training and inference. We open-source our complete data generation pipeline and training code, providing a reproducible foundation for future research.
中文标题/摘要
标题:TempoPFN:基于线性RNN的合成预训练在零样本时间序列预测中的应用
用于零样本时间序列预测的基础模型面临着高效长时预测和可重复性方面的挑战,现有的仅合成数据方法在具有挑战性的基准测试中表现不佳。本文提出了一种基于线性递归神经网络(RNN)的单变量时间序列基础模型TempoPFN,该模型仅在合成数据上进行预训练。该模型采用门控差积架构并使用状态编织技术,实现了序列长度上的完全并行训练,消除了窗口化或总结技术的需求,同时保持了稳健的时间状态跟踪。我们全面的合成数据管道统一了包括随机微分方程、高斯过程和音频合成在内的多种生成器,并引入了新的增强技术。在Gift-Eval基准测试的零样本评估中,TempoPFN取得了顶级的竞争力表现,超越了所有现有的仅合成数据方法,并且在效率上超过了现有的基线方法,这得益于完全并行化的训练和推理。我们开源了完整的数据生成管道和训练代码,为未来的研究提供了可重复的基础。
Summary / 总结
TempoPFN is a univariate time series foundation model based on linear RNNs pre-trained on synthetic data, using a GatedDeltaProduct architecture for efficient training. It outperforms existing synthetic-only approaches and most real-world data models on the Gift-Eval benchmark, while being more efficient due to fully parallelizable training and inference. The model maintains robust temporal state-tracking and is open-sourced with a comprehensive data generation pipeline.
TempoPFN 是基于线性 RNN 的单变量时间序列基础模型,仅在合成数据上进行预训练。它使用 GatedDeltaProduct 架构和状态编织,实现高效的训练并保持稳健的时间状态跟踪。在零样本评估中,TempoPFN 在 Gift-Eval 基准上优于现有合成数据方法和大多数基于真实数据的模型,同时由于完全并行化的训练和推理而更加高效。
Visual Backdoor Attacks on MLLM Embodied Decision Making via Contrastive Trigger Learning
Authors: Qiusi Zhan, Hyeonjeong Ha, Rui Yang, Sirui Xu, Hanyang Chen, Liang-Yan Gui, Yu-Xiong Wang, Huan Zhang, Heng Ji, Daniel Kang
First: 2025-10-31T16:50:49+00:00 · Latest: 2025-10-31T16:50:49+00:00
Abstract
Multimodal large language models (MLLMs) have advanced embodied agents by enabling direct perception, reasoning, and planning task-oriented actions from visual inputs. However, such vision driven embodied agents open a new attack surface: visual backdoor attacks, where the agent behaves normally until a visual trigger appears in the scene, then persistently executes an attacker-specified multi-step policy. We introduce BEAT, the first framework to inject such visual backdoors into MLLM-based embodied agents using objects in the environments as triggers. Unlike textual triggers, object triggers exhibit wide variation across viewpoints and lighting, making them difficult to implant reliably. BEAT addresses this challenge by (1) constructing a training set that spans diverse scenes, tasks, and trigger placements to expose agents to trigger variability, and (2) introducing a two-stage training scheme that first applies supervised fine-tuning (SFT) and then our novel Contrastive Trigger Learning (CTL). CTL formulates trigger discrimination as preference learning between trigger-present and trigger-free inputs, explicitly sharpening the decision boundaries to ensure precise backdoor activation. Across various embodied agent benchmarks and MLLMs, BEAT achieves attack success rates up to 80%, while maintaining strong benign task performance, and generalizes reliably to out-of-distribution trigger placements. Notably, compared to naive SFT, CTL boosts backdoor activation accuracy up to 39% under limited backdoor data. These findings expose a critical yet unexplored security risk in MLLM-based embodied agents, underscoring the need for robust defenses before real-world deployment.
中文标题/摘要
标题:视觉后门攻击对基于MLLM的具身决策制定的视觉触发学习注入
多模态大型语言模型(MLLMs)通过直接感知、推理和规划任务导向动作,从视觉输入中提升了具身代理的能力。然而,这种以视觉驱动的具身代理打开了一个新的攻击面:视觉后门攻击,其中代理在场景中出现视觉触发时表现出正常行为,然后持续执行攻击者指定的多步策略。我们提出了BEAT,这是第一个使用环境中的物体作为触发器将此类视觉后门注入到基于MLLM的具身代理中的框架。与文本触发器不同,物体触发器在视角和光照方面表现出广泛的差异,使其难以可靠地植入。BEAT通过(1)构建跨越多样场景、任务和触发器位置的训练集,使代理暴露于触发器的差异性,以及(2)引入两阶段训练方案,首先应用监督微调(SFT),然后是我们的新颖对比触发学习(CTL)来应对这一挑战。CTL将触发器的区分问题表述为触发器存在和不存在输入之间的偏好学习,明确地锐化决策边界以确保精确的后门激活。在各种具身代理基准测试和MLLMs中,BEAT实现了高达80%的攻击成功率,同时保持了强大的良性任务性能,并可靠地泛化到分布外的触发器位置。值得注意的是,与朴素的SFT相比,在有限的后门数据下,CTL将后门激活的准确性提高了高达39%。这些发现揭示了基于MLLM的具身代理中一个关键但未被探索的安全风险,强调了在实际部署前需要强大的防御措施。
Summary / 总结
The paper introduces BEAT, a framework for injecting visual backdoors into multimodal large language models (MLLMs) embodied agents using objects as triggers. BEAT addresses the challenge of trigger variability by constructing a diverse training set and employing a two-stage training scheme, including supervised fine-tuning and a novel Contrastive Trigger Learning method. The study demonstrates attack success rates up to 80% while maintaining strong benign task performance and reliable generalization to out-of-distribution triggers. Contrastive Trigger Learning significantly improves backdoor activation accuracy by up to 39% compared to supervised fine-tuning alone.
该研究提出了BEAT框架,用于在基于MLLM的体感代理中注入视觉后门攻击,使用物体作为触发器。BEAT通过构建多样化的训练集和采用两阶段训练方案(包括监督微调和对比触发学习)来应对触发器的变异性挑战。该方法在各种体感代理基准和MLLM上实现了高达80%的攻击成功率,同时保持了强大的正常任务性能,并可靠地泛化到未见过的触发器位置。对比触发学习相比简单的监督微调,显著提高了后门激活的准确性,最多可提升39%。
SRAGAN: Saliency Regularized and Attended Generative Adversarial Network for Chinese Ink-wash Painting Style Transfer
Authors: Xiang Gao, Yuqi Zhang
Venue: Pattern Recognition, 2025, 162: 111344
First: 2024-04-24T09:02:24+00:00 · Latest: 2025-10-31T16:44:36+00:00
Comments: Pattern Recognition, Volume 162, June 2025, 111344
Abstract
Recent style transfer problems are still largely dominated by Generative Adversarial Network (GAN) from the perspective of cross-domain image-to-image (I2I) translation, where the pivotal issue is to learn and transfer target-domain style patterns onto source-domain content images. This paper handles the problem of translating real pictures into traditional Chinese ink-wash paintings, i.e., Chinese ink-wash painting style transfer. Though a wide range of I2I models tackle this problem, a notable challenge is that the content details of the source image could be easily erased or corrupted due to the transfer of ink-wash style elements. To remedy this issue, we propose to incorporate saliency detection into the unpaired I2I framework to regularize image content, where the detected saliency map is utilized from two aspects: (\romannumeral1) we propose saliency IOU (SIOU) loss to explicitly regularize object content structure by enforcing saliency consistency before and after image stylization; (\romannumeral2) we propose saliency adaptive normalization (SANorm) which implicitly enhances object structure integrity of the generated paintings by dynamically injecting image saliency information into the generator to guide stylization process. Besides, we also propose saliency attended discriminator which harnesses image saliency information to focus generative adversarial attention onto the drawn objects, contributing to generating more vivid and delicate brush strokes and ink-wash textures. Extensive qualitative and quantitative experiments demonstrate superiority of our approach over related advanced image stylization methods in both GAN and diffusion model paradigms.
中文标题/摘要
标题:SRAGAN: 增强注意力和显著性正则化的生成对抗网络用于中国水墨画风格迁移
近年来,风格迁移问题仍然主要由从跨域图像到图像(I2I)翻译视角的生成对抗网络(GAN)主导,其中的关键问题是学习并转移目标域的风格模式到源域的内容图像上。本文处理了将真实图片转化为传统中国水墨画的问题,即中国水墨画风格迁移。尽管有许多I2I模型解决了这个问题,但一个显著的挑战是源图像的内容细节可能会因水墨风格元素的转移而被轻易擦除或破坏。为了解决这一问题,我们提出将显著性检测融入未配对的I2I框架中以正则化图像内容,其中检测到的显著性图从两个方面利用:(i) 我们提出了显著性IOU(SIOU)损失,通过在图像风格化前后强制显著性一致性来显式正则化对象内容结构;(ii) 我们提出了显著性自适应归一化(SANorm),通过动态注入图像显著性信息到生成器中以引导风格化过程,隐式增强生成绘画中的对象结构完整性。此外,我们还提出了显著性注意判别器,利用图像显著性信息将生成对抗注意力集中在绘制的对象上,有助于生成更生动和细腻的笔触和水墨纹理。广泛的定性和定量实验表明,我们的方法在GAN和扩散模型范式中均优于相关先进的图像风格化方法。
Summary / 总结
This paper addresses the challenge of translating real pictures into traditional Chinese ink-wash paintings by proposing SRAGAN, which incorporates saliency detection into an unpaired image-to-image framework. The method uses saliency IOU loss to enforce saliency consistency and saliency adaptive normalization to enhance object structure integrity. Additionally, a saliency attended discriminator is introduced to focus on the drawn objects, improving the vividness and delicacy of the generated brush strokes and ink-wash textures. Experimental results show that SRAGAN outperforms existing methods in both GAN and diffusion model paradigms.
本文提出SRAGAN,通过将显著性检测融入无配对的图像到图像框架中,解决将真实图片转换为传统中国水墨画的问题。该方法使用显著性IOU损失来确保显著性一致性,并使用显著性自适应归一化来增强生成绘画中的对象结构完整性。此外,引入了显著性注意鉴别器,使其能够专注于绘制的对象,从而提高生成笔触和水墨纹理的生动性和细腻度。实验结果表明,SRAGAN在GAN和扩散模型范式中均优于现有方法。
VeriMoA: A Mixture-of-Agents Framework for Spec-to-HDL Generation
Authors: Heng Ping, Arijit Bhattacharjee, Peiyu Zhang, Shixuan Li, Wei Yang, Anzhe Cheng, Xiaole Zhang, Jesse Thomason, Ali Jannesari, Nesreen Ahmed, Paul Bogdan
First: 2025-10-31T16:40:58+00:00 · Latest: 2025-10-31T16:40:58+00:00
Abstract
Automation of Register Transfer Level (RTL) design can help developers meet increasing computational demands. Large Language Models (LLMs) show promise for Hardware Description Language (HDL) generation, but face challenges due to limited parametric knowledge and domain-specific constraints. While prompt engineering and fine-tuning have limitations in knowledge coverage and training costs, multi-agent architectures offer a training-free paradigm to enhance reasoning through collaborative generation. However, current multi-agent approaches suffer from two critical deficiencies: susceptibility to noise propagation and constrained reasoning space exploration. We propose VeriMoA, a training-free mixture-of-agents (MoA) framework with two synergistic innovations. First, a quality-guided caching mechanism to maintain all intermediate HDL outputs and enables quality-based ranking and selection across the entire generation process, encouraging knowledge accumulation over layers of reasoning. Second, a multi-path generation strategy that leverages C++ and Python as intermediate representations, decomposing specification-to-HDL translation into two-stage processes that exploit LLM fluency in high-resource languages while promoting solution diversity. Comprehensive experiments on VerilogEval 2.0 and RTLLM 2.0 benchmarks demonstrate that VeriMoA achieves 15--30% improvements in Pass@1 across diverse LLM backbones, especially enabling smaller models to match larger models and fine-tuned alternatives without requiring costly training.
中文标题/摘要
标题:VeriMoA:一种基于代理混合的从规格到HDL生成框架
自动化寄存器传输级(RTL)设计可以帮助开发人员满足日益增长的计算需求。大型语言模型(LLMs)在硬件描述语言(HDL)生成方面显示出潜力,但由于参数知识有限和领域特定约束,面临挑战。尽管提示工程和微调在知识覆盖和训练成本方面存在局限性,多代理架构提供了一种无需训练的范式,通过协作生成增强推理。然而,当前的多代理方法存在两个关键缺陷:易受噪声传播的影响和受限的推理空间探索。我们提出VeriMoA,一种无需训练的基于代理混合(MoA)框架,包含两个协同创新。首先,一种质量导向的缓存机制,用于维护所有中间HDL输出,并在整个生成过程中实现基于质量的排名和选择,促进多层推理中的知识积累。其次,一种多路径生成策略,利用C++和Python作为中间表示,将规格到HDL的转换分解为两个阶段的过程,利用LLMs在高资源语言中的流畅性,同时促进解决方案多样性。在VerilogEval 2.0和RTLLM 2.0基准测试上的全面实验表明,VeriMoA在不同LLM基础模型上实现了15-30%的Pass@1改进,特别是使较小的模型能够匹配较大的模型和微调替代方案,而无需昂贵的训练。
Summary / 总结
VeriMoA is a training-free mixture-of-agents framework designed to enhance the automation of Register Transfer Level (RTL) design through collaborative generation. It introduces a quality-guided caching mechanism and a multi-path generation strategy using C++ and Python as intermediate representations. Experiments on VerilogEval 2.0 and RTLLM 2.0 benchmarks show that VeriMoA improves Pass@1 by 15-30% across various LLM backbones, particularly enabling smaller models to match larger models and fine-tuned alternatives without additional training costs.
VeriMoA 是一个无需训练的混合代理框架,旨在通过协作生成增强 RTL 设计的自动化。它引入了质量导向的缓存机制和多路径生成策略,以提高推理能力和解决方案的多样性。实验表明,VeriMoA 在各种 LLM 后端上实现了 15-30% 的 Pass@1 改进,使较小的模型能够匹配较大的模型和微调的替代方案,而无需额外的训练成本。
ORGEval: Graph-Theoretic Evaluation of LLMs in Optimization Modeling
Authors: Zhuohan Wang, Ziwei Zhu, Ziniu Li, Congliang Chen, Yizhou Han, Yufeng Lin, Zhihang Lin, Angyang Gu, Xinglin Hu, Ruoyu Sun, Tian Ding
First: 2025-10-31T16:35:52+00:00 · Latest: 2025-10-31T16:35:52+00:00
Abstract
Formulating optimization problems for industrial applications demands significant manual effort and domain expertise. While Large Language Models (LLMs) show promise in automating this process, evaluating their performance remains difficult due to the absence of robust metrics. Existing solver-based approaches often face inconsistency, infeasibility issues, and high computational costs. To address these issues, we propose ORGEval, a graph-theoretic evaluation framework for assessing LLMs' capabilities in formulating linear and mixed-integer linear programs. ORGEval represents optimization models as graphs, reducing equivalence detection to graph isomorphism testing. We identify and prove a sufficient condition, when the tested graphs are symmetric decomposable (SD), under which the Weisfeiler-Lehman (WL) test is guaranteed to correctly detect isomorphism. Building on this, ORGEval integrates a tailored variant of the WL-test with an SD detection algorithm to evaluate model equivalence. By focusing on structural equivalence rather than instance-level configurations, ORGEval is robust to numerical variations. Experimental results show that our method can successfully detect model equivalence and produce 100\% consistent results across random parameter configurations, while significantly outperforming solver-based methods in runtime, especially on difficult problems. Leveraging ORGEval, we construct the Bench4Opt dataset and benchmark state-of-the-art LLMs on optimization modeling. Our results reveal that although optimization modeling remains challenging for all LLMs, DeepSeek-V3 and Claude-Opus-4 achieve the highest accuracies under direct prompting, outperforming even leading reasoning models.
中文标题/摘要
标题:ORGEval:优化建模中大型语言模型的图论评估
为工业应用制定优化问题需要大量的手动努力和领域专业知识。虽然大型语言模型(LLMs)在自动化这一过程方面显示出潜力,但由于缺乏稳健的评估指标,评估其性能仍然困难。现有的基于求解器的方法往往面临不一致性、不可行性问题和高计算成本。为了解决这些问题,我们提出了一种图论评估框架ORGEval,用于评估LLMs在制定线性和混合整数线性规划方面的能力。ORGEval将优化模型表示为图,将等价性检测减少为图同构测试。我们确定并证明了一个充分条件,当测试图是对称可分解(SD)时,Weisfeiler-Lehman(WL)测试可以保证正确检测同构。在此基础上,ORGEval结合了一个定制的WL测试变体与SD检测算法来评估模型等价性。通过关注结构等价性而不是实例级别的配置,ORGEval对数值变化具有鲁棒性。实验结果表明,我们的方法可以成功检测模型等价性,并在随机参数配置下产生100%一致的结果,同时在运行时间上显著优于基于求解器的方法,尤其是在困难问题上。利用ORGEval,我们构建了Bench4Opt数据集,并在优化建模方面对最先进的LLMs进行了基准测试。我们的结果表明,尽管所有LLMs在优化建模方面仍然具有挑战性,但DeepSeek-V3和Claude-Opus-4在直接提示下实现了最高的准确性,甚至超过了领先的推理模型。
Summary / 总结
ORGEval is a graph-theoretic evaluation framework designed to assess LLMs' capabilities in formulating linear and mixed-integer linear programs. It represents optimization models as graphs and uses a tailored Weisfeiler-Lehman test to detect model equivalence, which is robust to numerical variations. Experimental results show that ORGEval outperforms solver-based methods in runtime and can consistently detect model equivalence across different parameter configurations. Bench4Opt, a dataset constructed using ORGEval, benchmarks state-of-the-art LLMs, revealing that DeepSeek-V3 and Claude-Opus-4 perform best under direct prompting for optimization modeling.
ORGEval 是一个图理论评估框架,用于评估 LLMs 在构建线性和混合整数线性规划模型方面的能力。它将优化模型表示为图,并使用 Weisfeiler-Lehman (WL) 测试来检测模型等价性,这使其对数值变化具有鲁棒性。实验结果表明,ORGEval 在运行时间上优于基于求解器的方法,并且能够实现 100% 的一致结果,其中 DeepSeek-V3 和 Claude-Opus-4 在直接提示下表现出最高的准确性。
Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model
Authors: John Won, Kyungmin Lee, Huiwon Jang, Dongyoung Kim, Jinwoo Shin
First: 2025-10-31T16:32:12+00:00 · Latest: 2025-10-31T16:32:12+00:00
Comments: 20 pages, 10 figures
Abstract
Recently, augmenting Vision-Language-Action models (VLAs) with world modeling has shown promise in improving robotic policy learning. However, it remains challenging to jointly predict next-state observations and action sequences because of the inherent difference between the two modalities. To address this, we propose DUal-STream diffusion (DUST), a world-model augmented VLA framework that handles the modality conflict and enhances the performance of VLAs across diverse tasks. Specifically, we propose a multimodal diffusion transformer architecture that explicitly maintains separate modality streams while still enabling cross-modal knowledge sharing. In addition, we introduce independent noise perturbations for each modality and a decoupled flow-matching loss. This design enables the model to learn the joint distribution in a bidirectional manner while avoiding the need for a unified latent space. Based on the decoupling of modalities during training, we also introduce a joint sampling method that supports test-time scaling, where action and vision tokens evolve asynchronously at different rates. Through experiments on simulated benchmarks such as RoboCasa and GR-1, DUST achieves up to 6% gains over baseline methods, while our test-time scaling approach provides an additional 2-5% boost. On real-world tasks with the Franka Research 3, DUST improves success rates by 13%, confirming its effectiveness beyond simulation. Furthermore, pre-training on action-free videos from BridgeV2 yields significant transfer gains on RoboCasa, underscoring DUST's potential for large-scale VLA pretraining.
中文标题/摘要
标题:双流扩散模型用于增强视觉-语言-动作模型的世界建模
最近,通过世界建模增强视觉-语言-动作模型(VLAs)在提高机器人策略学习方面显示出前景。然而,由于两种模态之间的固有差异,同时预测下一状态观察和动作序列仍然具有挑战性。为了解决这个问题,我们提出了双流扩散(DUST),一种世界建模增强的VLA框架,处理模态冲突并提高VLAs在各种任务中的性能。具体来说,我们提出了一种多模态扩散变换器架构,该架构在保持各自模态流的同时,仍然能够实现跨模态知识共享。此外,我们为每个模态引入了独立的噪声扰动和解耦的流匹配损失。这种设计使模型能够以双向方式学习联合分布,而无需统一的潜在空间。基于训练期间模态的解耦,我们还引入了一种联合采样方法,支持测试时缩放,在此方法中,动作和视觉标记以不同的速率异步演化。通过在RoboCasa和GR-1等模拟基准上的实验,DUST在基线方法上实现了高达6%的性能提升,而我们的测试时缩放方法提供了额外2-5%的提升。在使用Franka Research 3的现实任务中,DUST将成功率提高了13%,证实了其在模拟之外的有效性。此外,通过在BridgeV2的无动作视频上进行预训练,DUST在RoboCasa上获得了显著的迁移增益,突显了DUST在大规模VLAs预训练方面的潜力。
Summary / 总结
The research aims to improve Vision-Language-Action models (VLAs) by integrating world modeling to better predict next-state observations and action sequences. The proposed DUal-STream diffusion (DUST) framework uses a multimodal diffusion transformer that maintains separate modality streams while enabling cross-modal knowledge sharing. DUST introduces independent noise perturbations and a decoupled flow-matching loss, which enhances performance across various tasks. Experiments on simulated benchmarks show up to 6% gains over baseline methods, and on real-world tasks, DUST improves success rates by 13%. Pre-training on action-free videos also yields significant transfer gains on RoboCasa, highlighting DUST's potential for large-scale VLA pretraining.
研究旨在通过集成世界建模来改进Vision-Language-Action模型(VLAs),以更好地预测下一状态观察和动作序列。提出的DUal-STream扩散(DUST)框架使用了多模态扩散变换器,保持了模态流的分离,同时仍能实现跨模态知识共享。DUST引入了独立的噪声扰动和解耦的流匹配损失,这提升了各种任务上的性能。在模拟基准上的实验显示,DUST相对于基线方法可获得高达6%的性能提升,而在真实世界任务中,DUST将成功率提高了13%。此外,基于BridgeV2的无动作视频预训练也显著提升了RoboCasa上的迁移性能,突显了DUST在大规模VLAs预训练中的潜力。
Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning
Authors: Yuhong Liu, Beichen Zhang, Yuhang Zang, Yuhang Cao, Long Xing, Xiaoyi Dong, Haodong Duan, Dahua Lin, Jiaqi Wang
First: 2025-10-31T16:30:08+00:00 · Latest: 2025-10-31T16:30:08+00:00
Comments: preprint
Abstract
Spatial understanding remains a weakness of Large Vision-Language Models (LVLMs). Existing supervised fine-tuning (SFT) and recent reinforcement learning with verifiable rewards (RLVR) pipelines depend on costly supervision, specialized tools, or constrained environments that limit scale. We introduce Spatial-SSRL, a self-supervised RL paradigm that derives verifiable signals directly from ordinary RGB or RGB-D images. Spatial-SSRL automatically formulates five pretext tasks that capture 2D and 3D spatial structure: shuffled patch reordering, flipped patch recognition, cropped patch inpainting, regional depth ordering, and relative 3D position prediction. These tasks provide ground-truth answers that are easy to verify and require no human or LVLM annotation. Training on our tasks substantially improves spatial reasoning while preserving general visual capabilities. On seven spatial understanding benchmarks in both image and video settings, Spatial-SSRL delivers average accuracy gains of 4.63% (3B) and 3.89% (7B) over the Qwen2.5-VL baselines. Our results show that simple, intrinsic supervision enables RLVR at scale and provides a practical route to stronger spatial intelligence in LVLMs.
中文标题/摘要
标题:Spatial-SSRL:通过自我监督强化学习提升空间理解
空间理解仍然是大型视觉-语言模型(LVLMs)的弱点。现有的监督微调(SFT)和最近的可验证奖励强化学习(RLVR)管道依赖于昂贵的监督、专门的工具或受限的环境,这限制了规模。我们引入了Spatial-SSRL,这是一种自我监督的RL范式,可以从普通的RGB或RGB-D图像中直接推导出可验证的信号。Spatial-SSRL 自动制定了五个预训练任务,捕捉2D和3D空间结构:打乱的块重排序、翻转的块识别、裁剪的块填充、区域深度排序和相对3D位置预测。这些任务提供了易于验证的正确答案,不需要人类或LVLM的标注。在我们的任务上进行训练显著提高了空间推理能力,同时保留了通用的视觉能力。在七个空间理解基准测试中,无论是图像还是视频设置,Spatial-SSRL 在Qwen2.5-VL 基线上的平均准确率分别提高了4.63%(3B)和3.89%(7B)。我们的结果表明,简单的内在监督使RLVR能够大规模实现,并为在LVLMs中实现更强的空间智能提供了实际途径。
Summary / 总结
Spatial-SSRL is a self-supervised reinforcement learning approach that enhances spatial understanding in large vision-language models (LVLMs) without costly supervision. It introduces five pretext tasks derived from ordinary RGB or RGB-D images to automatically formulate verifiable signals. These tasks improve spatial reasoning while preserving general visual capabilities, achieving average accuracy gains of 4.63% and 3.89% on seven spatial understanding benchmarks compared to the Qwen2.5-VL baselines.
Spatial-SSRL 是一种自我监督的强化学习方法,旨在增强大型视觉-语言模型(LVLM)的空间理解能力,无需昂贵的监督。它引入了五个从普通 RGB 或 RGB-D 图像中提取的预训练任务,这些任务提供了易于验证的正确答案。在七个空间理解基准测试中,Spatial-SSRL 分别将 3B 和 7B 参数模型的准确性提高了 4.63% 和 3.89%,超过了 Qwen2.5-VL 基线模型。
I-RAVEN-X: Benchmarking Generalization and Robustness of Analogical and Mathematical Reasoning in Large Language and Reasoning Models
Authors: Giacomo Camposampiero, Michael Hersche, Roger Wattenhofer, Abu Sebastian, Abbas Rahimi
Venue: NeurIPS 2025
First: 2025-10-20T12:51:13+00:00 · Latest: 2025-10-31T16:27:50+00:00
Comments: Accepted at the 5th Workshop on Mathematical Reasoning and AI (MATH-AI), NeurIPS 2025
Abstract
We introduce I-RAVEN-X, a symbolic benchmark designed to evaluate generalization and robustness in analogical and mathematical reasoning for Large Language Models (LLMs) and Large Reasoning Models (LRMs). I-RAVEN-X extends I-RAVEN by increasing operand complexity, attribute range, and introducing perceptual uncertainty. Compared to LLMs, empirical results show that LRMs achieve improved productivity and systematicity on longer reasoning relations and wider attribute ranges, respectively. However, LRMs are still significantly challenged by reasoning under uncertainty and cannot effectively explore multiple probabilistic outcomes.
中文标题/摘要
标题:I-RAVEN-X:大型语言和推理模型在类比和数学推理方面泛化能力和鲁棒性的基准测试
我们引入了I-RAVEN-X,这是一种符号基准测试,旨在评估大型语言模型(LLMs)和大型推理模型(LRMs)在类比和数学推理方面的泛化能力和鲁棒性。I-RAVEN-X 在 I-RAVEN 的基础上增加了操作符复杂性、属性范围,并引入了感知不确定性。与 LLMs 相比,实验证明 LRMs 在较长的推理关系和更宽的属性范围内表现出更好的生产力和系统性。然而,LRMs 在不确定性推理方面仍然面临重大挑战,无法有效探索多种概率结果。
Summary / 总结
I-RAVEN-X is a benchmark to assess the generalization and robustness of analogical and mathematical reasoning in LLMs and LRMs. It extends I-RAVEN by increasing operand complexity, attribute range, and introducing perceptual uncertainty. LRMs show better performance in productivity and systematicity for longer reasoning relations and wider attribute ranges compared to LLMs, but they struggle with reasoning under uncertainty and cannot effectively explore multiple probabilistic outcomes.
研究引入了I-RAVEN-X,这是一个用于评估大型语言模型(LLMs)和大型推理模型(LRMs)在类比和数学推理方面泛化能力和鲁棒性的基准。通过增加操作数的复杂性、属性范围,并引入感知不确定性,I-RAVEN-X 扩展了 I-RAVEN。实验结果表明,LRMs 在较长的推理关系和更宽的属性范围内表现出更高的生产力和系统性,但在不确定性推理方面仍面临重大挑战,无法有效探索多种概率结果。
InnovatorBench: Evaluating Agents' Ability to Conduct Innovative LLM Research
Authors: Yunze Wu, Dayuan Fu, Weiye Si, Zhen Huang, Mohan Jiang, Keyu Li, Shijie Xia, Jie Sun, Tianze Xu, Xiangkun Hu, Pengrui Lu, Xiaojie Cai, Lyumanshan Ye, Wenhong Zhu, Yang Xiao, Pengfei Liu
First: 2025-10-31T16:22:23+00:00 · Latest: 2025-10-31T16:22:23+00:00
Abstract
AI agents could accelerate scientific discovery by automating hypothesis formation, experiment design, coding, execution, and analysis, yet existing benchmarks probe narrow skills in simplified settings. To address this gap, we introduce InnovatorBench, a benchmark-platform pair for realistic, end-to-end assessment of agents performing Large Language Model (LLM) research. It comprises 20 tasks spanning Data Construction, Filtering, Augmentation, Loss Design, Reward Design, and Scaffold Construction, which require runnable artifacts and assessment of correctness, performance, output quality, and uncertainty. To support agent operation, we develop ResearchGym, a research environment offering rich action spaces, distributed and long-horizon execution, asynchronous monitoring, and snapshot saving. We also implement a lightweight ReAct agent that couples explicit reasoning with executable planning using frontier models such as Claude-4, GPT-5, GLM-4.5, and Kimi-K2. Our experiments demonstrate that while frontier models show promise in code-driven research tasks, they struggle with fragile algorithm-related tasks and long-horizon decision making, such as impatience, poor resource management, and overreliance on template-based reasoning. Furthermore, agents require over 11 hours to achieve their best performance on InnovatorBench, underscoring the benchmark's difficulty and showing the potential of InnovatorBench to be the next generation of code-based research benchmark.
中文标题/摘要
标题:InnovatorBench:评估代理进行创新性LLM研究的能力
AI代理可以通过自动化假设形成、实验设计、编码、执行和分析来加速科学发现,但现有基准测试仅在简化设置中测试狭窄的技能。为解决这一差距,我们引入了InnovatorBench,这是一个基准测试-平台对,用于评估代理进行大型语言模型(LLM)研究的端到端现实评估。它包括20项任务,涵盖数据构建、过滤、增强、损失设计、奖励设计和支架构建,这些任务需要可运行的成果并评估正确性、性能、输出质量和不确定性。为了支持代理操作,我们开发了ResearchGym,这是一个研究环境,提供了丰富的动作空间、分布式和长周期执行、异步监控和快照保存。我们还实现了一个轻量级的ReAct代理,该代理结合了显式推理和可执行规划,使用前沿模型如Claude-4、GPT-5、GLM-4.5和Kimi-K2。我们的实验表明,尽管前沿模型在代码驱动的研究任务中表现出潜力,但在脆弱的算法相关任务和长期决策制定(如不耐烦、资源管理不佳和过度依赖模板推理)方面存在困难。此外,代理在InnovatorBench上达到最佳性能需要超过11小时,这突显了基准测试的难度,并表明InnovatorBench有可能成为下一代代码基础研究基准。
Summary / 总结
InnovatorBench evaluates AI agents' ability to conduct innovative LLM research by introducing a benchmark-platform pair for realistic, end-to-end assessment. It includes 20 tasks covering various aspects of LLM research, and supports agent operation through ResearchGym. Experiments show that frontier models perform well in code-driven tasks but struggle with long-horizon decision-making and resource management. Agents take over 11 hours to achieve their best performance, highlighting the benchmark's difficulty and potential as a next-generation research benchmark for LLMs.
InnovatorBench旨在通过自动化各种科学任务来评估AI代理进行创新大型语言模型(LLM)研究的能力。方法是创建一个基准平台对,ResearchGym,为代理提供一个丰富的环境来执行数据构建、过滤和支架构建等任务。关键实验发现表明,前沿模型在代码驱动的任务中表现出色,但在脆弱的算法相关任务和长期决策制定方面面临挑战。代理需要超过11小时才能达到最佳性能,突显了基准的难度及其作为下一代基于代码的研究基准的潜力。
Augmented Reality-based Guidance with Deformable Registration in Head and Neck Tumor Resection
Authors: Qingyun Yang, Fangjie Li, Jiayi Xu, Zixuan Liu, Sindhura Sridhar, Whitney Jin, Jennifer Du, Jon Heiselman, Michael Miga, Michael Topf, Jie Ying Wu
Venue: MICCAI 2025
First: 2025-03-11T18:32:14+00:00 · Latest: 2025-10-31T16:19:07+00:00
Comments: Accepted at MICCAI 2025
Abstract
Head and neck squamous cell carcinoma (HNSCC) has one of the highest rates of recurrence cases among solid malignancies. Recurrence rates can be reduced by improving positive margins localization. Frozen section analysis (FSA) of resected specimens is the gold standard for intraoperative margin assessment. However, because of the complex 3D anatomy and the significant shrinkage of resected specimens, accurate margin relocation from specimen back onto the resection site based on FSA results remains challenging. We propose a novel deformable registration framework that uses both the pre-resection upper surface and the post-resection site of the specimen to incorporate thickness information into the registration process. The proposed method significantly improves target registration error (TRE), demonstrating enhanced adaptability to thicker specimens. In tongue specimens, the proposed framework improved TRE by up to 33% as compared to prior deformable registration. Notably, tongue specimens exhibit complex 3D anatomies and hold the highest clinical significance compared to other head and neck specimens from the buccal and skin. We analyzed distinct deformation behaviors in different specimens, highlighting the need for tailored deformation strategies. To further aid intraoperative visualization, we also integrated this framework with an augmented reality-based auto-alignment system. The combined system can accurately and automatically overlay the deformed 3D specimen mesh with positive margin annotation onto the resection site. With a pilot study of the AR guided framework involving two surgeons, the integrated system improved the surgeons' average target relocation error from 9.8 cm to 4.8 cm.
中文标题/摘要
标题:基于增强现实的可变形配准头颈部肿瘤切除引导
头颈部鳞状细胞癌(HNSCC)是实体恶性肿瘤中复发率最高的肿瘤之一。通过提高阳性边缘定位可以降低复发率。冷冻切片分析(FSA)是切除标本的术中边缘评估的金标准。然而,由于复杂的三维解剖结构和切除标本的显著收缩,基于FSA结果将切除边缘准确地重新定位到切除部位仍然具有挑战性。我们提出了一种新的可变形配准框架,该框架结合了切除前的上表面和切除后的标本位置,将厚度信息纳入配准过程。所提出的方法显著提高了目标配准误差(TRE),显示出对较厚标本的增强适应性。在舌标本中,与先前的可变形配准相比,所提出的方法将TRE提高了高达33%。值得注意的是,舌标本具有复杂的三维解剖结构,并且与其他口腔和皮肤的头颈部标本相比具有最高的临床意义。我们分析了不同标本的变形行为,突显了需要定制的变形策略。为了进一步辅助术中可视化,我们还将该框架与基于增强现实的自动对齐系统集成。结合系统可以准确且自动地将变形的3D标本网格与阳性边缘注释叠加到切除部位。在涉及两名外科医生的试点研究中,集成系统将外科医生的平均目标重新定位误差从9.8 cm降低到4.8 cm。
Summary / 总结
The study aims to improve the accuracy of margin localization in head and neck tumor resection by addressing the challenges of specimen shrinkage and complex 3D anatomy. A novel deformable registration framework is proposed, which uses both pre- and post-resection specimen surfaces to incorporate thickness information, significantly reducing target registration error (TRE) by up to 33% in tongue specimens compared to previous methods. The integrated augmented reality-based system further enhances intraoperative visualization, reducing the surgeons' average target relocation error from 9.8 cm to 4.8 cm.
研究旨在通过解决复杂3D解剖结构和切除标本显著收缩带来的挑战,提高头颈部肿瘤切除手术中边缘定位的准确性。提出了一种新的变形配准框架,结合了切除前后的标本表面信息,以厚度信息为基础,舌标本的靶向注册误差相比之前的方法提高了33%。该系统还与基于增强现实的自动对齐工具集成,使外科医生在试点研究中的平均目标重新定位误差从9.8厘米降低到4.8厘米。
Image Hashing via Cross-View Code Alignment in the Age of Foundation Models
Authors: Ilyass Moummad, Kawtar Zaher, Hervé Goëau, Alexis Joly
First: 2025-10-31T16:08:46+00:00 · Latest: 2025-10-31T16:08:46+00:00
Abstract
Efficient large-scale retrieval requires representations that are both compact and discriminative. Foundation models provide powerful visual and multimodal embeddings, but nearest neighbor search in these high-dimensional spaces is computationally expensive. Hashing offers an efficient alternative by enabling fast Hamming distance search with binary codes, yet existing approaches often rely on complex pipelines, multi-term objectives, designs specialized for a single learning paradigm, and long training times. We introduce CroVCA (Cross-View Code Alignment), a simple and unified principle for learning binary codes that remain consistent across semantically aligned views. A single binary cross-entropy loss enforces alignment, while coding-rate maximization serves as an anti-collapse regularizer to promote balanced and diverse codes. To implement this, we design HashCoder, a lightweight MLP hashing network with a final batch normalization layer to enforce balanced codes. HashCoder can be used as a probing head on frozen embeddings or to adapt encoders efficiently via LoRA fine-tuning. Across benchmarks, CroVCA achieves state-of-the-art results in just 5 training epochs. At 16 bits, it particularly well-for instance, unsupervised hashing on COCO completes in under 2 minutes and supervised hashing on ImageNet100 in about 3 minutes on a single GPU. These results highlight CroVCA's efficiency, adaptability, and broad applicability.
中文标题/摘要
标题:基于视图对齐的哈希编码在基础模型时代下的图像哈希
大规模高效检索需要既紧凑又具有区分性的表示。基础模型提供了强大的视觉和多模态嵌入,但在这些高维空间中进行最近邻搜索计算成本高昂。哈希提供了一种有效的替代方案,通过二进制码实现快速汉明距离搜索,但现有方法通常依赖于复杂的流水线、多目标优化、专为单一学习范式设计,并且训练时间较长。我们提出了CroVCA(视图对齐编码),这是一种简单且统一的原则,用于学习在语义对齐视图中保持一致性的二进制码。单一的二元交叉熵损失强制执行对齐,而编码率最大化作为抗坍塌正则化器促进平衡和多样化的码。为了实现这一点,我们设计了HashCoder,这是一种轻量级的MLP哈希网络,带有最终的批量归一化层以强制执行平衡的码。HashCoder 可以作为冻结嵌入的探针头使用,或者通过LoRA微调高效地适应编码器。在基准测试中,CroVCA 在仅5个训练周期内达到了最先进的结果。在16位精度下,它特别高效,例如,COCO上的无监督哈希在单个GPU上完成时间不到2分钟,ImageNet100上的监督哈希大约需要3分钟。这些结果突显了CroVCA的高效性、适应性和广泛适用性。
Summary / 总结
The paper addresses the challenge of efficient large-scale image retrieval by proposing CroVCA (Cross-View Code Alignment), which uses a simple binary cross-entropy loss to align codes across semantically aligned views, combined with coding-rate maximization to ensure balanced and diverse codes. The method employs a lightweight HashCoder network and can be used as a probing head or fine-tuned via LoRA. Experiments show that CroVCA achieves state-of-the-art results in just 5 training epochs, with unsupervised hashing on COCO taking under 2 minutes and supervised hashing on ImageNet100 taking about 3 minutes on a single GPU.
论文提出了CroVCA(跨视图代码对齐)方法,用于在语义对齐视图间学习一致的二进制代码。该方法使用单一的二元交叉熵损失和编码率最大化来确保对齐和多样性。实验结果显示,CroVCA仅需5个训练周期即可达到最先进的性能,无监督哈希在COCO上的运行时间不到2分钟,有监督哈希在ImageNet100上的运行时间约为3分钟,单张GPU上展示了其高效性和适应性。
C3Editor: Achieving Controllable Consistency in 2D Model for 3D Editing
Authors: Zeng Tao, Zheng Ding, Zeyuan Chen, Xiang Zhang, Leizhi Li, Zhuowen Tu
Venue: ICCV 2025
First: 2025-10-06T07:07:14+00:00 · Latest: 2025-10-31T16:06:19+00:00
Comments: ICCV 2025 Workshop Wild3D
Abstract
Existing 2D-lifting-based 3D editing methods often encounter challenges related to inconsistency, stemming from the lack of view-consistent 2D editing models and the difficulty of ensuring consistent editing across multiple views. To address these issues, we propose C3Editor, a controllable and consistent 2D-lifting-based 3D editing framework. Given an original 3D representation and a text-based editing prompt, our method selectively establishes a view-consistent 2D editing model to achieve superior 3D editing results. The process begins with the controlled selection of a ground truth (GT) view and its corresponding edited image as the optimization target, allowing for user-defined manual edits. Next, we fine-tune the 2D editing model within the GT view and across multiple views to align with the GT-edited image while ensuring multi-view consistency. To meet the distinct requirements of GT view fitting and multi-view consistency, we introduce separate LoRA modules for targeted fine-tuning. Our approach delivers more consistent and controllable 2D and 3D editing results than existing 2D-lifting-based methods, outperforming them in both qualitative and quantitative evaluations.
中文标题/摘要
标题:C3Editor:实现2D模型在3D编辑中的可控一致性
现有的基于2D提升的3D编辑方法常常遇到不一致的问题,这源于缺乏视图一致的2D编辑模型以及确保多视图之间一致编辑的难度。为了解决这些问题,我们提出了C3Editor,一种可控且一致的基于2D提升的3D编辑框架。给定原始的3D表示和基于文本的编辑提示,我们的方法选择性地建立一个视图一致的2D编辑模型,以实现更优的3D编辑结果。过程始于受控选择一个地面真实(GT)视图及其对应的编辑图像作为优化目标,允许用户定义的手动编辑。接下来,我们对GT视图内的2D编辑模型以及多个视图进行微调,使其与GT编辑图像对齐,同时确保多视图一致性。为了满足GT视图拟合和多视图一致性各自的特殊需求,我们引入了专门的LoRA模块进行目标微调。我们的方法在2D和3D编辑结果的一致性和可控性方面优于现有的基于2D提升的方法,在定性和定量评估中均表现出色。
Summary / 总结
C3Editor is a framework designed to address inconsistency issues in 2D-lifting-based 3D editing. It achieves this by selectively establishing a view-consistent 2D editing model and fine-tuning the model to align with a ground truth-edited image while ensuring multi-view consistency. The method introduces separate LoRA modules for targeted fine-tuning to meet the distinct requirements of GT view fitting and multi-view consistency, resulting in more consistent and controllable 2D and 3D editing results compared to existing methods.
C3Editor 是一种框架,旨在通过利用视图一致的 2D 编辑模型实现可控且一致的 3D 编辑。给定一个 3D 表现形式和一个文本编辑提示,C3Editor 首先选择一个地面真实视图及其编辑后的图像作为优化目标。然后,它在该视图及其多个视图中对 2D 编辑模型进行微调,以确保与地面真实编辑后的图像一致,使用单独的 LoRA 模块进行目标微调。结果显示,C3Editor 在定性和定量评估中均优于现有方法,提供了更一致和可控的 3D 编辑结果。
HELIOS: Adaptive Model And Early-Exit Selection for Efficient LLM Inference Serving
Authors: Avinash Kumar, Shashank Nag, Jason Clemons, Lizy John, Poulami Das
First: 2025-04-14T21:30:43+00:00 · Latest: 2025-10-31T16:06:03+00:00
Abstract
Early-Exit Large Language Models (EE-LLMs) enable high throughput inference by allowing tokens to exit early at intermediate layers. However, their throughput is limited by the computational and memory savings. Existing EE-LLM frameworks rely on a single model and therefore, their token generation latencies are bottlenecked by tokens that do not exit early and traverse additional layers. Moreover, early exits are only known at runtime and depend on the request. Therefore, these frameworks load the weights of all model layers even though large portions remain unused when tokens exit early. The lack of memory savings limit us from scaling the batch sizes. We propose $\textit{HELIOS}$, a framework that improves both token generation latency and batch sizes to enable high-throughput in EE-LLMs. HELIOS exploits two insights. $\textit{First}$, early exits are often complimentary across models, tokens that do not exit early on one model often take an early-exit on another. HELIOS employs multiple models and dynamically switches between them to collectively maximize the number of tokens that exit early, and minimize token generation latencies. $\textit{Second}$, even when a predicted token does not exit early due to poor confidence, it often remains unchanged even after additional layer traversal. HELIOS greedily allows such tokens to exit early and only loads the weights of the most likely to be used layers, yielding memory savings which is then re-purposed to increase batch sizes. HELIOS employs real-time profiling to accurately identify the early-exit distributions, and adaptively switches between models by tracking tokens in real-time to minimize the performance degradation caused by greedy model loading and exiting. Our evaluations show that HELIOS achieves $1.48\times$ higher throughput and $15.14\times$ larger batch size compared to existing EE-LLM frameworks.
中文标题/摘要
标题:HELIOS:自适应模型和早期退出选择以实现高效的LLM推理服务
早期退出大型语言模型(EE-LLMs)通过允许令牌在中间层提前退出来实现高吞吐量推理。然而,它们的吞吐量受限于计算和内存节省。现有的EE-LLM框架依赖单一模型,因此其令牌生成延迟受到不提前退出的令牌的限制,这些令牌需要遍历额外的层。此外,早期退出仅在运行时才已知,并且取决于请求。因此,这些框架即使在令牌提前退出时也会加载所有模型层的权重,这限制了我们扩展批量大小的能力。我们提出了HELIOS框架,以同时提高令牌生成延迟和批量大小,从而在EE-LLMs中实现高吞吐量。HELIOS利用了两个洞察。首先,早期退出在不同模型之间往往是互补的,一个模型中不提前退出的令牌往往在另一个模型中提前退出。HELIOS使用多个模型并动态切换它们,以集体最大化提前退出的令牌数量,最小化令牌生成延迟。其次,即使预测的令牌由于置信度低而不提前退出,它在额外层遍历后通常也不会改变。HELIOS贪婪地允许这些令牌提前退出,并仅加载最有可能使用的层的权重,从而节省内存,然后重新利用这些节省的内存来增加批量大小。HELIOS利用实时分析准确地识别早期退出分布,并通过实时跟踪令牌来适应性地切换模型,以最小化贪婪模型加载和退出引起的性能下降。我们的评估表明,与现有的EE-LLM框架相比,HELIOS实现了1.48倍更高的吞吐量和15.14倍更大的批量大小。
Summary / 总结
HELIOS is a framework designed to improve the throughput and batch size of Early-Exit Large Language Models (EE-LLMs) by dynamically switching between multiple models and loading only the most likely to be used layers. This approach maximizes the number of tokens that exit early and minimizes token generation latencies. HELIOS also allows tokens with low confidence to exit early, further reducing memory usage and increasing batch sizes. Evaluations show that HELIOS achieves 1.48 times higher throughput and 15.14 times larger batch size compared to existing EE-LLM frameworks.
HELIOS 是一个框架,通过动态切换多个模型并在每次仅加载最有可能使用的层来提高早期退出大语言模型(EE-LLM)的效率。这种方法可以同时提高标记生成延迟和批量大小,相比现有 EE-LLM 框架,实现了 1.48 倍更高的吞吐量和 15.14 倍更大的批量大小。
Towards Universal Video Retrieval: Generalizing Video Embedding via Synthesized Multimodal Pyramid Curriculum
Authors: Zhuoning Guo, Mingxin Li, Yanzhao Zhang, Dingkun Long, Pengjun Xie, Xiaowen Chu
First: 2025-10-31T15:54:48+00:00 · Latest: 2025-10-31T15:54:48+00:00
Abstract
The prevailing video retrieval paradigm is structurally misaligned, as narrow benchmarks incentivize correspondingly limited data and single-task training. Therefore, universal capability is suppressed due to the absence of a diagnostic evaluation that defines and demands multi-dimensional generalization. To break this cycle, we introduce a framework built on the co-design of evaluation, data, and modeling. First, we establish the Universal Video Retrieval Benchmark (UVRB), a suite of 16 datasets designed not only to measure performance but also to diagnose critical capability gaps across tasks and domains. Second, guided by UVRB's diagnostics, we introduce a scalable synthesis workflow that generates 1.55 million high-quality pairs to populate the semantic space required for universality. Finally, we devise the Modality Pyramid, a curriculum that trains our General Video Embedder (GVE) by explicitly leveraging the latent interconnections within our diverse data. Extensive experiments show GVE achieves state-of-the-art zero-shot generalization on UVRB. In particular, our analysis reveals that popular benchmarks are poor predictors of general ability and that partially relevant retrieval is a dominant but overlooked scenario. Overall, our co-designed framework provides a practical path to escape the limited scope and advance toward truly universal video retrieval.
中文标题/摘要
标题:迈向通用视频检索:通过合成多模态金字塔课程泛化视频嵌入
当前的视频检索范式结构上存在偏差,狭窄的基准测试激励了相应有限的数据和单任务训练。因此,由于缺乏定义和要求多维度泛化的诊断性评估,通用能力被抑制。为打破这一循环,我们提出了一种基于评估、数据和建模协同设计的框架。首先,我们建立了通用视频检索基准(UVRB),一套包含16个数据集的套件,不仅用于衡量性能,还用于诊断任务和领域间的关键能力差距。其次,根据UVRB的诊断结果,我们引入了一种可扩展的合成工作流,生成155万对高质量的配对,以填充实现通用性所需的语义空间。最后,我们设计了模态金字塔,通过明确利用我们多样化数据中的潜在联系来训练我们的通用视频嵌入器(GVE)。广泛的实验表明,GVE在UVRB上实现了最先进的零样本泛化。特别地,我们的分析揭示了流行基准测试是通用能力的不良预测者,部分相关检索是主导但被忽视的场景。总体而言,我们协同设计的框架提供了一条实用的道路,以摆脱有限的范围并朝着真正通用的视频检索迈进。
SIGMA: Search-Augmented On-Demand Knowledge Integration for Agentic Mathematical Reasoning
Authors: Ali Asgarov, Umid Suleymanov, Aadyant Khatri
First: 2025-10-31T15:51:00+00:00 · Latest: 2025-10-31T15:51:00+00:00
Comments: Short Paper - Under Review
Abstract
Solving mathematical reasoning problems requires not only accurate access to relevant knowledge but also careful, multi-step thinking. However, current retrieval-augmented models often rely on a single perspective, follow inflexible search strategies, and struggle to effectively combine information from multiple sources. We introduce SIGMA (Search-Augmented On-Demand Knowledge Integration for AGentic Mathematical reAsoning), a unified framework that orchestrates specialized agents to independently reason, perform targeted searches, and synthesize findings through a moderator mechanism. Each agent generates hypothetical passages to optimize retrieval for its analytic perspective, ensuring knowledge integration is both context-sensitive and computation-efficient. When evaluated on challenging benchmarks such as MATH500, AIME, and PhD-level science QA GPQA, SIGMA consistently outperforms both open- and closed-source systems, achieving an absolute performance improvement of 7.4%. Our results demonstrate that multi-agent, on-demand knowledge integration significantly enhances both reasoning accuracy and efficiency, offering a scalable approach for complex, knowledge-intensive problem-solving. We will release the code upon publication.
中文标题/摘要
标题:SIGMA:增强型自主数学推理的知识集成
解决数学推理问题不仅需要准确地访问相关知识,还需要进行细致的多步思考。然而,当前的检索增强模型往往依赖单一视角,遵循僵化的搜索策略,并且难以有效地从多个来源整合信息。我们提出了SIGMA(搜索增强的按需知识集成以促进自主数学推理),这是一种统一框架,通过调解机制协调专门的代理独立推理、执行有针对性的搜索并综合发现。每个代理生成假设性段落以优化其分析视角的检索,确保知识集成既具有上下文敏感性又具有计算效率。在MATH500、AIME和博士级科学问答GPQA等具有挑战性的基准测试中,SIGMA始终优于开源和闭源系统,绝对性能提升7.4%。我们的结果表明,多代理、按需知识集成显著提高了推理的准确性和效率,为复杂、知识密集型问题解决提供了一种可扩展的方法。论文发表后我们将发布代码。
Summary / 总结
SIGMA is designed to improve mathematical reasoning by integrating knowledge from multiple sources through a multi-agent system. Each agent independently reasons and performs targeted searches, with findings synthesized by a moderator. SIGMA outperforms existing systems on benchmarks like MATH500, AIME, and GPQA, achieving a 7.4% absolute performance improvement and enhancing both reasoning accuracy and efficiency.
SIGMA 通过一个多代理系统整合多源知识来提升数学推理能力。每个代理独立推理并执行有针对性的搜索,由协调者合成结果。SIGMA 在 MATH500、AIME 和 GPQA 等基准测试中表现优于现有系统,绝对性能提升 7.4%,同时提高了推理的准确性和效率。
CodeAlignBench: Assessing Code Generation Models on Developer-Preferred Code Adjustments
Authors: Forough Mehralian, Ryan Shar, James R. Rae, Alireza Hashemi
First: 2025-10-31T15:47:07+00:00 · Latest: 2025-10-31T15:47:07+00:00
Abstract
As large language models become increasingly capable of generating code, evaluating their performance remains a complex and evolving challenge. Existing benchmarks primarily focus on functional correctness, overlooking the diversity of real-world coding tasks and developer expectations. To this end, we introduce a multi-language benchmark that evaluates LLM instruction-following capabilities and is extensible to operate on any set of standalone coding problems. Our benchmark evaluates instruction following in two key settings: adherence to pre-defined constraints specified with the initial problem, and the ability to perform refinements based on follow-up instructions. For this paper's analysis, we empirically evaluated our benchmarking pipeline with programming tasks from LiveBench, that are also automatically translated from Python into Java and JavaScript. Our automated benchmark reveals that models exhibit differing levels of performance across multiple dimensions of instruction-following. Our benchmarking pipeline provides a more comprehensive evaluation of code generation models, highlighting their strengths and limitations across languages and generation goals.
中文标题/摘要
标题:CodeAlignBench:评估代码生成模型在开发人员首选代码调整方面的性能
随着大型语言模型在生成代码方面的能力不断增强,对其性能的评估仍然是一个复杂且不断发展的挑战。现有的基准测试主要关注功能正确性,忽视了现实世界编码任务的多样性和开发人员的期望。为此,我们引入了一个多语言基准测试,该基准测试评估LLM的指令遵循能力,并可扩展应用于任何一组独立的编码问题。我们的基准测试在两个关键设置中评估指令遵循:一是遵守初始问题中定义的预设约束,二是根据后续指令执行改进。对于本文的分析,我们使用LiveBench中的编程任务来实证评估我们的基准测试管道,这些任务还自动从Python翻译成Java和JavaScript。我们的自动化基准测试揭示了模型在指令遵循多个维度上的不同性能水平。我们的基准测试管道为代码生成模型提供了更全面的评估,突显了它们在不同语言和生成目标方面的优势和局限性
Summary / 总结
This study introduces CodeAlignBench, a multi-language benchmark to evaluate the performance of code generation models in following developer-preferred instructions and making code adjustments. The benchmark assesses adherence to initial problem constraints and the ability to refine code based on follow-up instructions. Evaluations using LiveBench tasks show that models vary in their performance across different dimensions of instruction-following, providing a more comprehensive evaluation of code generation models across languages and generation goals.
该研究引入了CodeAlignBench,一个多语言基准,用于评估代码生成模型在遵循开发人员偏好指令和进行代码调整方面的性能。该基准评估模型对初始问题约束的遵守情况以及根据后续指令进行代码优化的能力。使用LiveBench任务的评估表明,模型在不同指令执行维度上的表现存在差异,提供了跨语言和生成目标的代码生成模型的更全面评估。
Toward Accurate Long-Horizon Robotic Manipulation: Language-to-Action with Foundation Models via Scene Graphs
Authors: Sushil Samuel Dinesh, Shinkyu Park
First: 2025-10-31T15:42:32+00:00 · Latest: 2025-10-31T15:42:32+00:00
Abstract
This paper presents a framework that leverages pre-trained foundation models for robotic manipulation without domain-specific training. The framework integrates off-the-shelf models, combining multimodal perception from foundation models with a general-purpose reasoning model capable of robust task sequencing. Scene graphs, dynamically maintained within the framework, provide spatial awareness and enable consistent reasoning about the environment. The framework is evaluated through a series of tabletop robotic manipulation experiments, and the results highlight its potential for building robotic manipulation systems directly on top of off-the-shelf foundation models.
中文标题/摘要
标题:向准确的长时机器人操作迈进:通过场景图利用基础模型实现语言到动作
本文提出了一种框架,利用预训练的基础模型进行无需领域特定训练的机器人操作。该框架整合了现成的模型,结合了基础模型的多模态感知与一个通用推理模型,该模型能够实现稳健的任务序列。框架中动态维护的场景图提供了空间意识,并使对环境的一致性推理成为可能。该框架通过一系列桌面机器人操作实验进行了评估,结果突显了其直接基于现成基础模型构建机器人操作系统的潜力。
Summary / 总结
This paper introduces a framework that uses pre-trained foundation models for robotic manipulation without requiring domain-specific training. It integrates off-the-shelf models with a general-purpose reasoning model and uses scene graphs to maintain spatial awareness and enable consistent reasoning about the environment. The framework was tested in tabletop robotic manipulation experiments, demonstrating its potential for building robotic manipulation systems directly from off-the-shelf foundation models.
本文提出了一种框架,利用预训练的基础模型进行机器人操作,无需特定领域的训练。该框架结合了现成模型的多模态感知和通用推理模型,以实现稳健的任务序列。框架内的场景图提供了空间意识,并使对环境的一致性推理成为可能。桌面操作任务的实验结果表明,该框架可以直接基于现成的基础模型构建机器人操作系统。
NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation
Authors: Xiangyan Liu, Jinjie Ni, Zijian Wu, Chao Du, Longxu Dou, Haonan Wang, Tianyu Pang, Michael Qizhe Shieh
Venue: NeurIPS 2025
First: 2025-04-17T16:10:13+00:00 · Latest: 2025-10-31T15:41:28+00:00
Comments: NeurIPS 2025
Abstract
Recent advances in reinforcement learning (RL) have strengthened the reasoning capabilities of vision-language models (VLMs). However, enhancing policy exploration to better scale test-time compute remains largely underexplored. In addition, VLMs continue to struggle with imperfect visual perception, which in turn affects the subsequent reasoning process. We introduce NoisyRollout, a simple yet effective data augmentation method that addresses these issues by mixing training trajectories from both clean and moderately distorted images. This approach injects perceptual diversity, encouraging better policy exploration and leading to more robust reasoning. A noise annealing schedule gradually reduces distortion strength, aiding exploration early in training while ensuring later stability. Crucially, our method is easy-to-adopt--requiring no additional training cost and no modifications to the RL objective. Extensive experiments on 2 distinct training datasets demonstrate that NoisyRollout achieves state-of-the-art performance among open-source RL-tuned models across 5 out-of-domain reasoning and perception benchmarks. Furthermore, we validate the effectiveness of NoisyRollout across model sizes (7B and 32B), data scales (from 1K to 6K) and image augmentation types (Gaussion noise and rotation), highlighting its generalizability and scalability.
中文标题/摘要
标题:NoisyRollout:通过数据增强强化视觉推理
近期强化学习(RL)的进步增强了视觉语言模型(VLMs)的推理能力。然而,如何通过增强策略探索来更好地扩展测试时的计算能力仍然鲜有探索。此外,VLMs 继续在不完美的视觉感知方面挣扎,这反过来影响了后续的推理过程。我们提出了 NoisyRollout,这是一种简单而有效的方法,通过混合来自干净图像和适度失真图像的训练轨迹来解决这些问题。这种方法注入了感知多样性,鼓励更好的策略探索,从而实现更稳健的推理。通过逐渐减少失真强度的噪声退火计划,NoisyRollout 在训练早期帮助探索,同时确保后期的稳定性。重要的是,我们的方法易于采用——无需额外的训练成本,也不需要修改RL目标。在两个不同的训练数据集上的广泛实验表明,NoisyRollout 在5个跨域推理和感知基准测试中实现了开源RL调优模型的最新性能。此外,我们验证了NoisyRollout在不同模型规模(7B和32B)、不同数据规模(从1K到6K)和不同图像增强类型(高斯噪声和旋转)中的有效性,突显了其通用性和可扩展性。
Summary / 总结
NoisyRollout is a data augmentation method that enhances policy exploration in reinforcement learning for vision-language models by mixing clean and moderately distorted images. It introduces perceptual diversity, leading to better reasoning and robustness. Experiments show that NoisyRollout achieves state-of-the-art performance across various benchmarks and model sizes, demonstrating its generalizability and scalability.
NoisyRollout 是一种数据增强方法,通过混合来自干净图像和适度失真图像的训练轨迹来增强强化学习中视觉语言模型的推理能力。这种方法提高了策略探索并导致更稳健的推理。实验表明,NoisyRollout 在各种离域基准测试中达到了最先进的性能,并适用于不同的模型大小和数据规模。
Dynamic Gaussian Splatting from Defocused and Motion-blurred Monocular Videos
Authors: Xuankai Zhang, Junjin Xiao, Qing Zhang
Venue: NeurIPS 2025
First: 2025-10-12T16:38:54+00:00 · Latest: 2025-10-31T15:40:49+00:00
Comments: Accepted to NeurIPS 2025
Abstract
This paper presents a unified framework that allows high-quality dynamic Gaussian Splatting from both defocused and motion-blurred monocular videos. Due to the significant difference between the formation processes of defocus blur and motion blur, existing methods are tailored for either one of them, lacking the ability to simultaneously deal with both of them. Although the two can be jointly modeled as blur kernel-based convolution, the inherent difficulty in estimating accurate blur kernels greatly limits the progress in this direction. In this work, we go a step further towards this direction. Particularly, we propose to estimate per-pixel reliable blur kernels using a blur prediction network that exploits blur-related scene and camera information and is subject to a blur-aware sparsity constraint. Besides, we introduce a dynamic Gaussian densification strategy to mitigate the lack of Gaussians for incomplete regions, and boost the performance of novel view synthesis by incorporating unseen view information to constrain scene optimization. Extensive experiments show that our method outperforms the state-of-the-art methods in generating photorealistic novel view synthesis from defocused and motion-blurred monocular videos. Our code is available at https://github.com/hhhddddddd/dydeblur.
中文标题/摘要
标题:从模糊和运动模糊单目视频中实现动态高斯点云化
本文提出了一种统一框架,可以从模糊和运动模糊单目视频中实现高质量的动态高斯点云化。由于焦距模糊和运动模糊形成过程的巨大差异,现有方法仅针对其中一种,缺乏同时处理两者的能力。尽管两者可以联合建模为基于模糊核的卷积,但估计准确模糊核的固有难度极大地限制了这一方向的进步。在本文中,我们进一步向这一方向迈进。特别地,我们提出使用一个利用与模糊相关的场景和相机信息的模糊预测网络来估计每个像素可靠的模糊核,并且该网络受到模糊感知稀疏约束。此外,我们引入了一种动态高斯密度策略来缓解不完整区域中高斯点的不足,并通过引入未见视图信息来约束场景优化,从而提升新颖视图合成的性能。大量实验表明,我们的方法在生成来自模糊和运动模糊单目视频的逼真新颖视图合成方面优于现有最先进的方法。我们的代码可在https://github.com/hhhddddddd/dydeblur 获取。
Summary / 总结
This paper introduces a unified framework for high-quality dynamic Gaussian Splatting from defocused and motion-blurred monocular videos. The method estimates per-pixel blur kernels using a blur prediction network and incorporates a dynamic Gaussian densification strategy to improve novel view synthesis. Experiments demonstrate superior performance compared to existing methods in generating photorealistic novel views from such videos.
该论文提出了一种统一框架,用于从模糊和运动模糊的单目视频中实现高质量的动态高斯散点图。它提出了一种模糊预测网络来估计准确的逐像素模糊核,并引入了一种动态高斯密集化策略以增强新颖视角合成。实验表明,所提出的方法在从这些视频生成逼真的新颖视角方面优于现有最先进的技术。
Resource-Adaptive Successive Doubling for Hyperparameter Optimization with Large Datasets on High-Performance Computing Systems
Authors: Marcel Aach, Rakesh Sarma, Helmut Neukirchen, Morris Riedel, Andreas Lintermann
Venue: Future Generation Computer Systems, Volume 175, February 2026
First: 2024-12-03T11:25:48+00:00 · Latest: 2025-10-31T15:38:53+00:00
Abstract
On High-Performance Computing (HPC) systems, several hyperparameter configurations can be evaluated in parallel to speed up the Hyperparameter Optimization (HPO) process. State-of-the-art HPO methods follow a bandit-based approach and build on top of successive halving, where the final performance of a combination is estimated based on a lower than fully trained fidelity performance metric and more promising combinations are assigned more resources over time. Frequently, the number of epochs is treated as a resource, letting more promising combinations train longer. Another option is to use the number of workers as a resource and directly allocate more workers to more promising configurations via data-parallel training. This article proposes a novel Resource-Adaptive Successive Doubling Algorithm (RASDA), which combines a resource-adaptive successive doubling scheme with the plain Asynchronous Successive Halving Algorithm (ASHA). Scalability of this approach is shown on up to 1,024 Graphics Processing Units (GPUs) on modern HPC systems. It is applied to different types of Neural Networks (NNs) and trained on large datasets from the Computer Vision (CV), Computational Fluid Dynamics (CFD), and Additive Manufacturing (AM) domains, where performing more than one full training run is usually infeasible. Empirical results show that RASDA outperforms ASHA by a factor of up to 1.9 with respect to the runtime. At the same time, the solution quality of final ASHA models is maintained or even surpassed by the implicit batch size scheduling of RASDA. With RASDA, systematic HPO is applied to a terabyte-scale scientific dataset for the first time in the literature, enabling efficient optimization of complex models on massive scientific data. The implementation of RASDA is available on https://github.com/olympiquemarcel/rasda
中文标题/摘要
标题:资源自适应连续加倍算法在高性能计算系统中用于超参数优化的大数据集
在高性能计算(HPC)系统中,可以并行评估多个超参数配置以加快超参数优化(HPO)过程。最先进的HPO方法采用基于多臂老虎机的方法,并基于连续减半方案构建,其中组合的最终性能基于低于完全训练的保真度性能指标进行估计,并且更有前途的组合会随着时间分配更多资源。通常,训练周期数被视为资源,让更有前途的组合训练更长时间。另一种选择是使用工作数量作为资源,并直接通过数据并行训练将更多工作分配给更有前途的配置。本文提出了一种新颖的资源自适应连续加倍算法(RASDA),该算法将资源自适应连续加倍方案与简单的异步连续减半算法(ASHA)相结合。在现代HPC系统上的1,024个图形处理单元(GPUs)上展示了该方法的可扩展性。它应用于不同类型的人工神经网络(NNs),并在计算机视觉(CV)、计算流体动力学(CFD)和增材制造(AM)领域的大数据集上进行训练,通常无法进行一次完整的训练运行。实验证明,与ASHA相比,RASDA在运行时间上提高了1.9倍。同时,RASDA的隐式批量大小调度保持甚至超越了最终ASHA模型的解决方案质量。使用RASDA,文献中首次将系统化的HPO应用于太字节规模的科学数据集,从而在大规模科学数据上高效优化复杂模型。RASDA的实现可在https://github.com/olympiquemarcel/rasda获取
Summary / 总结
The research aims to optimize hyperparameters for large datasets on HPC systems by proposing a novel Resource-Adaptive Successive Doubling Algorithm (RASDA). RASDA combines a resource-adaptive successive doubling scheme with the Asynchronous Successive Halving Algorithm (ASHA). The study demonstrates that RASDA outperforms ASHA by up to 1.9 times in runtime while maintaining or improving the quality of the final models. This is achieved by efficiently scheduling resources based on the performance of configurations, allowing for the optimization of complex models on massive scientific datasets for the first time in the literature.
研究旨在通过提出一种新型资源自适应倍增算法(RASDA)来优化大规模数据集上的超参数,该算法结合了资源自适应倍增方案和异步倍增算法(ASHA)。研究显示,RASDA 在运行时间上比 ASHA 提高了最多 1.9 倍,同时保持或提高了最终模型的质量。这通过根据配置性能高效调度资源实现,首次在文献中将系统化超参数优化应用于大规模科学数据集,从而能够高效优化复杂的模型。
Data-Driven Stochastic Optimal Control in Reproducing Kernel Hilbert Spaces
Authors: Nicolas Hoischen, Petar Bevanda, Stefan Sosnowski, Sandra Hirche, Boris Houska
First: 2024-07-23T11:53:03+00:00 · Latest: 2025-10-31T15:27:52+00:00
Comments: author-submitted electronic preprint version: 19 pages, 5 figures, 3 tables
Abstract
This paper proposes a fully data-driven approach for optimal control of nonlinear control-affine systems represented by a stochastic diffusion. The focus is on the scenario where both the nonlinear dynamics and stage cost functions are unknown, while only a control penalty function and constraints are provided. To this end, we embed state probability densities into a reproducing kernel Hilbert space (RKHS) to leverage recent advances in operator regression, thereby identifying Markov transition operators associated with controlled diffusion processes. This operator learning approach integrates naturally with convex operator-theoretic Hamilton-Jacobi-Bellman recursions that scale linearly with state dimensionality, effectively solving a wide range of nonlinear optimal control problems. Numerical results demonstrate its ability to address diverse nonlinear control tasks, including the depth regulation of an autonomous underwater vehicle.
中文标题/摘要
标题:基于再生核希尔伯特空间的数据驱动随机最优控制
本文提出了一种完全基于数据的非线性控制-仿射系统(由随机扩散表示)的最优控制方法。重点在于非线性动力学和阶段成本函数未知的情况下,仅提供控制惩罚函数和约束条件。为此,我们将状态概率密度嵌入到再生核希尔伯特空间(RKHS)中,利用最新的算子回归进展,从而识别与控制扩散过程相关的马尔可夫转移算子。该算子学习方法自然地与线性缩放于状态维度的凸算子理论哈密尔顿-雅可比-贝尔曼递归相结合,有效地解决了广泛的非线性最优控制问题。数值结果表明其能够处理各种非线性控制任务,包括自主水下车辆的深度调节。
Summary / 总结
This paper introduces a data-driven method for optimal control of nonlinear systems with unknown dynamics and costs, using reproducing kernel Hilbert spaces to learn Markov transition operators. The approach leverages operator regression and convex optimization to handle high-dimensional state spaces, and demonstrates effectiveness in various nonlinear control tasks, such as depth regulation of an autonomous underwater vehicle.
该论文提出了一种数据驱动的方法来解决具有未知动力学和成本的非线性系统的最优控制问题,通过再生核希尔伯特空间学习马尔可夫转移算子。该方法利用算子回归,并且随状态维度线性扩展,成功解决了诸如自主水下车辆深度调节等多种非线性控制问题。
VRoPE: Rotary Position Embedding for Video Large Language Models
Authors: Zikang Liu, Longteng Guo, Yepeng Tang, Tongtian Yue, Junxian Cai, Kai Ma, Qingbin Liu, Xi Chen, Jing Liu
Venue: EMNLP 2025
First: 2025-02-17T10:53:57+00:00 · Latest: 2025-10-31T15:26:11+00:00
Comments: EMNLP 2025 Main Camera Ready
Abstract
Rotary Position Embedding (RoPE) has shown strong performance in text-based Large Language Models (LLMs), but extending it to video remains a challenge due to the intricate spatiotemporal structure of video frames. Existing adaptations, such as RoPE-3D, attempt to encode spatial and temporal dimensions separately but suffer from two major limitations: positional bias in attention distribution and disruptions in video-text transitions. To overcome these issues, we propose Video Rotary Position Embedding (VRoPE), a novel positional encoding method tailored for Video-LLMs. Specifically, we introduce a more balanced encoding strategy that mitigates attention biases, ensuring a more uniform distribution of spatial focus. Additionally, our approach restructures positional indices to ensure a smooth transition between video and text tokens. Extensive experiments on different models demonstrate that VRoPE consistently outperforms previous RoPE variants, achieving significant improvements in video understanding, temporal reasoning, and retrieval tasks. Code is available at https://github.com/johncaged/VRoPE.
中文标题/摘要
标题:VRoPE:视频大型语言模型的旋转位置嵌入
旋转位置嵌入(RoPE)在基于文本的大型语言模型(LLMs)中表现出强大的性能,但将其扩展到视频仍然面临挑战,因为视频帧具有复杂的时空结构。现有的适应方法,如RoPE-3D,试图分别编码空间和时间维度,但存在两个主要局限性:注意力分布的位置偏差和视频-文本过渡中的中断。为克服这些问题,我们提出了视频旋转位置嵌入(VRoPE),这是一种针对视频LLMs的新型位置编码方法。具体来说,我们引入了一种更平衡的编码策略,以减轻注意力偏差,确保空间焦点的更均匀分布。此外,我们的方法重新构建了位置索引,以确保视频和文本标记之间的平滑过渡。在不同模型上的广泛实验表明,VRoPE 一致地优于之前的RoPE变体,在视频理解、时间推理和检索任务中取得了显著的改进。代码可在 https://github.com/johncaged/VRoPE 获取。
Summary / 总结
VRoPE is a novel positional encoding method designed for Video Large Language Models (Video-LLMs) to address the limitations of existing Rotary Position Embedding (RoPE) techniques in handling the spatiotemporal structure of video frames. It introduces a balanced encoding strategy to mitigate attention biases and ensures a smooth transition between video and text tokens. Experimental results show that VRoPE outperforms previous RoPE variants in video understanding, temporal reasoning, and retrieval tasks, achieving significant improvements across different models.
VRoPE 是一种针对视频大型语言模型(Video-LLMs)的新位置编码方法,旨在解决现有旋转位置嵌入(RoPE)技术在处理视频帧的时空结构时的局限性。它引入了一种平衡的编码策略,以减轻注意力偏差,并确保视频和文本标记之间的平滑过渡。实验结果表明,VRoPE 在视频理解、时间推理和检索任务中优于之前的 RoPE 变体,实现了显著的性能提升。
MapSAM2: Adapting SAM2 for Automatic Segmentation of Historical Map Images and Time Series
Authors: Xue Xia, Randall Balestriero, Tao Zhang, Yixin Zhou, Andrew Ding, Dev Saini, Lorenz Hurni
First: 2025-10-31T15:25:40+00:00 · Latest: 2025-10-31T15:25:40+00:00
Abstract
Historical maps are unique and valuable archives that document geographic features across different time periods. However, automated analysis of historical map images remains a significant challenge due to their wide stylistic variability and the scarcity of annotated training data. Constructing linked spatio-temporal datasets from historical map time series is even more time-consuming and labor-intensive, as it requires synthesizing information from multiple maps. Such datasets are essential for applications such as dating buildings, analyzing the development of road networks and settlements, studying environmental changes etc. We present MapSAM2, a unified framework for automatically segmenting both historical map images and time series. Built on a visual foundation model, MapSAM2 adapts to diverse segmentation tasks with few-shot fine-tuning. Our key innovation is to treat both historical map images and time series as videos. For images, we process a set of tiles as a video, enabling the memory attention mechanism to incorporate contextual cues from similar tiles, leading to improved geometric accuracy, particularly for areal features. For time series, we introduce the annotated Siegfried Building Time Series Dataset and, to reduce annotation costs, propose generating pseudo time series from single-year maps by simulating common temporal transformations. Experimental results show that MapSAM2 learns temporal associations effectively and can accurately segment and link buildings in time series under limited supervision or using pseudo videos. We will release both our dataset and code to support future research.
中文标题/摘要
标题:MapSAM2:将SAM2适应于历史地图图像和时间序列的自动分割
历史地图是独特的宝贵档案,记录了不同时间时期的地理特征。然而,由于其广泛的风格差异和标注训练数据的稀缺性,历史地图图像的自动化分析仍是一个重大挑战。从历史地图时间序列构建链接的时空数据集更加耗时和劳动密集,因为它需要从多张地图中综合信息。此类数据集对于诸如建筑年代测定、道路网络和聚落的发展分析、环境变化研究等应用至关重要。我们提出了MapSAM2,这是一种统一框架,用于自动分割历史地图图像和时间序列。基于视觉基础模型,MapSAM2通过少量微调适应多种分割任务。我们的关键创新是将历史地图图像和时间序列都视为视频。对于图像,我们将一组瓦片处理为视频,使记忆注意力机制能够从相似瓦片中获取上下文线索,从而提高几何准确性,特别是对于区域特征。对于时间序列,我们引入了标注的西格弗里德建筑时间序列数据集,并通过模拟常见的时间变换从单年地图生成伪时间序列以降低标注成本。实验结果表明,MapSAM2能够有效地学习时间关联,并在有限监督或使用伪视频的情况下准确分割和链接时间序列中的建筑物。我们将发布我们的数据集和代码以支持未来的研究。
Summary / 总结
MapSAM2 is designed to automatically segment historical map images and time series, addressing the challenges of stylistic variability and lack of annotated data. By treating historical maps as videos, MapSAM2 improves geometric accuracy for areal features and introduces a pseudo time series generation method to reduce annotation costs. The framework effectively learns temporal associations and accurately segments and links buildings in time series with limited supervision or pseudo videos.
MapSAM2 是一个统一框架,用于自动分割历史地图图像和时间序列。它利用视觉基础模型,并通过少量微调适应多种分割任务。通过将历史地图图像和时间序列视为视频,MapSAM2 提高了区域特征的空间准确性,并通过伪时间序列生成减少时间序列的注释成本。实验表明,在有限监督或使用伪视频的情况下,MapSAM2 能够有效地学习时间关联并准确分割和链接建筑物。
EBT-Policy: Energy Unlocks Emergent Physical Reasoning Capabilities
Authors: Travis Davies, Yiqi Huang, Alexi Gladstone, Yunxin Liu, Xiang Chen, Heng Ji, Huxian Liu, Luhui Hu
First: 2025-10-31T15:21:05+00:00 · Latest: 2025-10-31T15:21:05+00:00
Comments: 9 pages, 6 figures, 4 tables
Abstract
Implicit policies parameterized by generative models, such as Diffusion Policy, have become the standard for policy learning and Vision-Language-Action (VLA) models in robotics. However, these approaches often suffer from high computational cost, exposure bias, and unstable inference dynamics, which lead to divergence under distribution shifts. Energy-Based Models (EBMs) address these issues by learning energy landscapes end-to-end and modeling equilibrium dynamics, offering improved robustness and reduced exposure bias. Yet, policies parameterized by EBMs have historically struggled to scale effectively. Recent work on Energy-Based Transformers (EBTs) demonstrates the scalability of EBMs to high-dimensional spaces, but their potential for solving core challenges in physically embodied models remains underexplored. We introduce a new energy-based architecture, EBT-Policy, that solves core issues in robotic and real-world settings. Across simulated and real-world tasks, EBT-Policy consistently outperforms diffusion-based policies, while requiring less training and inference computation. Remarkably, on some tasks it converges within just two inference steps, a 50x reduction compared to Diffusion Policy's 100. Moreover, EBT-Policy exhibits emergent capabilities not seen in prior models, such as zero-shot recovery from failed action sequences using only behavior cloning and without explicit retry training. By leveraging its scalar energy for uncertainty-aware inference and dynamic compute allocation, EBT-Policy offers a promising path toward robust, generalizable robot behavior under distribution shifts.
中文标题/摘要
标题:EBT-政策:能源解锁了新兴的物理推理能力
隐式政策由生成模型参数化,如扩散政策,已成为机器人领域政策学习和视觉-语言-动作(VLA)模型的标准。然而,这些方法通常面临高计算成本、暴露偏差和不稳定的推理动力学问题,导致在分布转移下出现发散。基于能量的模型(EBMs)通过端到端学习能量景观并建模平衡动力学来解决这些问题,提供更好的鲁棒性和减少的暴露偏差。然而,EBMs参数化的政策在过去难以有效扩展。最近关于基于能量的变换器(EBTs)的工作展示了EBMs在高维空间中的可扩展性,但它们在解决物理嵌入模型核心挑战方面的潜力尚未得到充分探索。我们提出了一种新的基于能量的架构EBT-政策,解决了机器人和现实世界设置中的核心问题。在模拟和现实任务中,EBT-政策在所有情况下都优于基于扩散的政策,同时需要更少的训练和推理计算。令人惊讶的是,在某些任务中,它仅在两次推理步骤内收敛,比扩散政策的100次减少了50倍。此外,EBT-政策展示了先前模型中未见的新兴能力,例如仅使用行为克隆和无需显式重试训练即可从失败的动作序列中实现零样本恢复。通过利用其标量能量进行不确定性感知推理和动态计算分配,EBT-政策为在分布转移下实现稳健且可泛化的机器人行为提供了有希望的途径。
Summary / 总结
EBT-Policy addresses the limitations of implicit policies in robotics by leveraging Energy-Based Models (EBMs) to improve robustness and reduce computational costs. It consistently outperforms diffusion-based policies across simulated and real-world tasks, requiring fewer training and inference steps. Notably, EBT-Policy demonstrates emergent capabilities like zero-shot recovery from failed actions without explicit retry training, showcasing its potential for robust and generalizable robot behavior.
EBT-Policy通过使用能量基模型(EBMs)来解决机器人领域中隐式策略的局限性,提高了鲁棒性并减少了计算成本。它在模拟和真实世界任务中均优于基于扩散的策略,需要更少的训练和推理步骤。值得注意的是,EBT-Policy可以在仅两步推理步骤内从失败的动作序列中恢复,展示了显著的效率提升和以前模型中未见的新兴能力。
Mechanics of Learned Reasoning 1: TempoBench, A Benchmark for Interpretable Deconstruction of Reasoning System Performance
Authors: Nikolaus Holzer, William Fishell, Baishakhi Ray, Mark Santolucito
First: 2025-10-31T15:17:55+00:00 · Latest: 2025-10-31T15:17:55+00:00
Abstract
Large Language Models (LLMs) are increasingly excelling and outpacing human performance on many tasks. However, to improve LLM reasoning, researchers either rely on ad-hoc generated datasets or formal mathematical proof systems such as the Lean proof assistant. Whilst ad-hoc generated methods can capture the decision chains of real-world reasoning processes, they may encode some inadvertent bias in the space of reasoning they cover; they also cannot be formally verified. On the other hand, systems like Lean can guarantee verifiability, but are not well-suited to capture the nature of agentic decision chain-based tasks. This creates a gap both in performance for functions such as business agents or code assistants, and in the usefulness of LLM reasoning benchmarks, whereby these fall short in reasoning structure or real-world alignment. We introduce TempoBench, the first formally grounded and verifiable diagnostic benchmark that parametrizes difficulty to systematically analyze how LLMs perform reasoning. TempoBench uses two evaluation benchmarks to break down reasoning ability. First, temporal trace evaluation (TTE) tests the ability of an LLM to understand and simulate the execution of a given multi-step reasoning system. Subsequently, temporal causal evaluation (TCE) tests an LLM's ability to perform multi-step causal reasoning and to distill cause-and-effect relations from complex systems. We find that models score 65.6% on TCE-normal, and 7.5% on TCE-hard. This shows that state-of-the-art LLMs clearly understand the TCE task but perform poorly as system complexity increases. Our code is available at our \href{https://github.com/nik-hz/tempobench}{GitHub repository}.
中文标题/摘要
标题:学习推理的机理1:TempoBench,一个可解释的推理系统性能分解基准
大型语言模型(LLMs)在许多任务上越来越出色,超越了人类的表现。然而,为了改进LLM的推理能力,研究人员要么依赖于手工生成的数据集,要么使用形式化的数学证明系统,如Lean证明助手。手工生成的方法可以捕捉到实际推理过程中的决策链,但它们可能会在所覆盖的推理空间中编码一些无意的偏见;它们也无法进行形式验证。另一方面,系统如Lean可以保证可验证性,但并不适合捕捉基于代理决策链的任务的本质。这在性能上为商务代理或代码助手等功能造成了差距,同时也使得LLM推理基准在推理结构或现实世界对齐方面显得不足。我们引入了TempoBench,这是第一个形式上基于和可验证的诊断基准,通过参数化难度系统地分析LLM的推理性能。TempoBench使用两个评估基准来分解推理能力。首先,时间轨迹评估(TTE)测试LLM理解并模拟给定多步推理系统执行的能力。随后,时间因果评估(TCE)测试LLM进行多步因果推理以及从复杂系统中提炼因果关系的能力。我们发现,模型在TCE-normal上的得分为65.6%,在TCE-hard上的得分为7.5%。这表明最先进的LLM显然理解TCE任务,但在系统复杂性增加时表现不佳。我们的代码可在我们的 \href{https://github.com/nik-hz/tempobench}{GitHub仓库}获得。
DialectalArabicMMLU: Benchmarking Dialectal Capabilities in Arabic and Multilingual Language Models
Authors: Malik H. Altakrori, Nizar Habash, Abdelhakim Freihat, Younes Samih, Kirill Chirkunov, Muhammed AbuOdeh, Radu Florian, Teresa Lynn, Preslav Nakov, Alham Fikri Aji
First: 2025-10-31T15:17:06+00:00 · Latest: 2025-10-31T15:17:06+00:00
Comments: 9 pages, 9 tables
Abstract
We present DialectalArabicMMLU, a new benchmark for evaluating the performance of large language models (LLMs) across Arabic dialects. While recently developed Arabic and multilingual benchmarks have advanced LLM evaluation for Modern Standard Arabic (MSA), dialectal varieties remain underrepresented despite their prevalence in everyday communication. DialectalArabicMMLU extends the MMLU-Redux framework through manual translation and adaptation of 3K multiple-choice question-answer pairs into five major dialects (Syrian, Egyptian, Emirati, Saudi, and Moroccan), yielding a total of 15K QA pairs across 32 academic and professional domains (22K QA pairs when also including English and MSA). The benchmark enables systematic assessment of LLM reasoning and comprehension beyond MSA, supporting both task-based and linguistic analysis. We evaluate 19 open-weight Arabic and multilingual LLMs (1B-13B parameters) and report substantial performance variation across dialects, revealing persistent gaps in dialectal generalization. DialectalArabicMMLU provides the first unified, human-curated resource for measuring dialectal understanding in Arabic, thus promoting more inclusive evaluation and future model development.
中文标题/摘要
标题:DialectalArabicMMLU:评估阿拉伯方言和多语言语言模型能力的新基准
我们提出了DialectalArabicMMLU,这是一个新的基准,用于评估大型语言模型(LLMs)在阿拉伯方言上的性能。尽管最近开发的阿拉伯语和多语言基准已经推进了现代标准阿拉伯语(MSA)的LLM评估,但方言变体仍然在日常交流中占主导地位,却相对被忽视。DialectalArabicMMLU通过手动翻译和适应3000个多选题-答案对到五个主要方言(叙利亚、埃及、阿联酋、沙特和摩洛哥),扩展了MMLU-Redux框架,总共产生了15000个题-答对,在32个学术和专业领域(包括英语和MSA时为22000个题-答对)。该基准使LLM在MSA之外的推理和理解能力系统评估成为可能,支持基于任务和语言分析。我们评估了19个开源阿拉伯语和多语言LLM(参数量从1B到13B),并报告了在不同方言上的显著性能差异,揭示了持续存在的方言泛化差距。DialectalArabicMMLU提供了第一个统一的人工编纂资源,用于测量阿拉伯语中的方言理解,从而促进更具包容性的评估和未来模型开发。
Summary / 总结
DialectalArabicMMLU is a new benchmark for evaluating large language models across Arabic dialects, addressing the underrepresentation of dialectal varieties in existing benchmarks. It consists of 15K multiple-choice question-answer pairs translated into five major dialects, covering 32 academic and professional domains. Evaluating 19 open-weight LLMs, the study reveals significant performance variations across dialects, highlighting gaps in dialectal generalization capabilities of these models.
DialectalArabicMMLU 是一个新的基准,用于评估大型语言模型在阿拉伯方言上的表现,解决了现有基准中方言不足的问题。它将3,000个多项选择题-答案对翻译成五种主要方言,共产生了15,000个题-答案对,覆盖32个领域。评估19个LLM后,研究揭示了方言间显著的表现差异,突显了方言泛化的不足。该基准支持任务和语言分析,促进更包容的评估和模型开发。
AstuteRAG-FQA: Task-Aware Retrieval-Augmented Generation Framework for Proprietary Data Challenges in Financial Question Answering
Authors: Mohammad Zahangir Alam, Khandoker Ashik Uz Zaman, Mahdi H. Miraz
Venue: Print ISSN: 2516-0281, Online ISSN: 2516-029X, pp. 13-31, Vol. 9, No. 5, 25 October 2025
First: 2025-10-31T15:13:03+00:00 · Latest: 2025-10-31T15:13:03+00:00
Abstract
Retrieval-Augmented Generation (RAG) shows significant promise in knowledge-intensive tasks by improving domain specificity, enhancing temporal relevance, and reducing hallucinations. However, applying RAG to finance encounters critical challenges: restricted access to proprietary datasets, limited retrieval accuracy, regulatory constraints, and sensitive data interpretation. We introduce AstuteRAG-FQA, an adaptive RAG framework tailored for Financial Question Answering (FQA), leveraging task-aware prompt engineering to address these challenges. The framework uses a hybrid retrieval strategy integrating both open-source and proprietary financial data while maintaining strict security protocols and regulatory compliance. A dynamic prompt framework adapts in real time to query complexity, improving precision and contextual relevance. To systematically address diverse financial queries, we propose a four-tier task classification: explicit factual, implicit factual, interpretable rationale, and hidden rationale involving implicit causal reasoning. For each category, we identify key challenges, datasets, and optimization techniques within the retrieval and generation process. The framework incorporates multi-layered security mechanisms including differential privacy, data anonymization, and role-based access controls to protect sensitive financial information. Additionally, AstuteRAG-FQA implements real-time compliance monitoring through automated regulatory validation systems that verify responses against industry standards and legal obligations. We evaluate three data integration techniques - contextual embedding, small model augmentation, and targeted fine-tuning - analyzing their efficiency and feasibility across varied financial environments.
中文标题/摘要
标题:AstuteRAG-FQA:面向金融问答的适应性检索增强生成框架
检索增强生成(RAG)在知识密集型任务中表现出显著的潜力,通过提高领域特异性、增强时间相关性和减少幻觉。然而,将RAG应用于金融领域遇到了关键挑战:受限的专有数据集访问、检索准确性有限、监管限制以及敏感数据解释。我们介绍了AstuteRAG-FQA,这是一种针对金融问答(FQA)的适应性RAG框架,利用任务感知的提示工程来应对这些挑战。该框架采用混合检索策略,结合开源和专有金融数据,同时保持严格的安全协议和合规性。动态提示框架能够实时适应查询复杂性,提高精确度和上下文相关性。为了系统地应对各种金融查询,我们提出了四级任务分类:明确的事实性、隐含的事实性、可解释的推理和隐含的推理涉及隐含因果推理。对于每类,我们确定了检索和生成过程中的关键挑战、数据集和优化技术。该框架包括多层次的安全机制,包括差分隐私、数据匿名化和基于角色的访问控制,以保护敏感的金融信息。此外,AstuteRAG-FQA 通过自动合规验证系统实施实时合规监控,该系统验证响应是否符合行业标准和法律义务。我们评估了三种数据集成技术——上下文嵌入、小型模型增强和目标微调——分析它们在不同金融环境中的效率和可行性。
RObotic MAnipulation Network (ROMAN) -- Hybrid Hierarchical Learning for Solving Complex Sequential Tasks
Authors: Eleftherios Triantafyllidis, Fernando Acero, Zhaocheng Liu, Zhibin Li
First: 2023-06-30T20:35:22+00:00 · Latest: 2025-10-31T15:09:35+00:00
Comments: To appear in Nature Machine Intelligence. Includes the main and supplementary manuscript. Total of 70 pages, with a total of 9 Figures and 17 Tables
Abstract
Solving long sequential tasks poses a significant challenge in embodied artificial intelligence. Enabling a robotic system to perform diverse sequential tasks with a broad range of manipulation skills is an active area of research. In this work, we present a Hybrid Hierarchical Learning framework, the Robotic Manipulation Network (ROMAN), to address the challenge of solving multiple complex tasks over long time horizons in robotic manipulation. ROMAN achieves task versatility and robust failure recovery by integrating behavioural cloning, imitation learning, and reinforcement learning. It consists of a central manipulation network that coordinates an ensemble of various neural networks, each specialising in distinct re-combinable sub-tasks to generate their correct in-sequence actions for solving complex long-horizon manipulation tasks. Experimental results show that by orchestrating and activating these specialised manipulation experts, ROMAN generates correct sequential activations for accomplishing long sequences of sophisticated manipulation tasks and achieving adaptive behaviours beyond demonstrations, while exhibiting robustness to various sensory noises. These results demonstrate the significance and versatility of ROMAN's dynamic adaptability featuring autonomous failure recovery capabilities, and highlight its potential for various autonomous manipulation tasks that demand adaptive motor skills.
中文标题/摘要
标题:RObotic MAnipulation Network (ROMAN) -- 混合层次学习方法解决复杂序列任务
解决长期序列任务是具身人工智能中的一个重大挑战。使机器人系统能够执行多种序列任务并具备广泛的操作技能是研究的热点。本文提出了一种混合层次学习框架,即Robotic Manipulation Network (ROMAN),以应对机器人操作中长时间解决多个复杂任务的挑战。ROMAN 通过结合行为克隆、模仿学习和强化学习实现任务的多样性和鲁棒的失败恢复。它由一个中央操作网络协调各种专门处理不同可重组子任务的神经网络,生成正确的序列动作以解决复杂的长期操作任务。实验结果表明,通过协调和激活这些专门的操作专家,ROMAN 生成了正确的序列激活,以完成复杂的操作任务序列,并表现出超越演示的适应性行为,同时对各种感官噪声具有鲁棒性。这些结果展示了 ROMAN 动态适应性及其自主失败恢复能力的重要性,并突显了其在需要适应性运动技能的各种自主操作任务中的潜力。
Summary / 总结
The research addresses the challenge of performing long sequential tasks in robotic manipulation by introducing the Robotic Manipulation Network (ROMAN), which combines behavioral cloning, imitation learning, and reinforcement learning. ROMAN consists of a central manipulation network coordinating various specialized neural networks to handle distinct sub-tasks, enabling it to perform complex long-horizon manipulation tasks. Experiments show that ROMAN can generate correct sequential actions, adapt beyond demonstrations, and recover robustly from failures, highlighting its potential for autonomous manipulation tasks requiring adaptive motor skills.
研究通过引入Robotic Manipulation Network (ROMAN),结合行为克隆、模仿学习和强化学习,解决了机器人操作中执行长时间序列任务的挑战。ROMAN 包含一个中央操作网络协调各种专门处理不同子任务的神经网络,使其能够执行复杂的长时间操作任务。实验表明,ROMAN 可以生成正确的序列动作,超越演示进行适应,并从失败中 robust 地恢复,突显了其在需要适应性运动技能的自主操作任务中的潜力。
History
20251103_0317 20251102_0321 20251101_0317 20251031_0318 20251030_0328 20251029_0325 20251028_0324 20251027_0320 20251026_0328 20251025_0320 20251024_0328 20251023_1235 20251023_0316 20251022_0319 20251021_1916 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553