arXiv 论文速递

Enhancing Retrieval-Augmented Generation with Entity Linking for Educational Platforms

Authors: Francesco Granata, Francesco Poggi, Misael Mongiovì

First: 2025-12-05T18:59:18+00:00 · Latest: 2025-12-05T18:59:18+00:00

Abstract

In the era of Large Language Models (LLMs), Retrieval-Augmented Generation (RAG) architectures are gaining significant attention for their ability to ground language generation in reliable knowledge sources. Despite their impressive effectiveness in many areas, RAG systems based solely on semantic similarity often fail to ensure factual accuracy in specialized domains, where terminological ambiguity can affect retrieval relevance. This study proposes an enhanced RAG architecture that integrates a factual signal derived from Entity Linking to improve the accuracy of educational question-answering systems in Italian. The system includes a Wikidata-based Entity Linking module and implements three re-ranking strategies to combine semantic and entity-based information: a hybrid score weighting model, reciprocal rank fusion, and a cross-encoder re-ranker. Experiments were conducted on two benchmarks: a custom academic dataset and the standard SQuAD-it dataset. Results show that, in domain-specific contexts, the hybrid schema based on reciprocal rank fusion significantly outperforms both the baseline and the cross-encoder approach, while the cross-encoder achieves the best results on the general-domain dataset. These findings confirm the presence of an effect of domain mismatch and highlight the importance of domain adaptation and hybrid ranking strategies to enhance factual precision and reliability in retrieval-augmented generation. They also demonstrate the potential of entity-aware RAG systems in educational environments, fostering adaptive and reliable AI-based tutoring tools.

中文标题/摘要

标题：利用实体链接增强教育平台的检索增强生成

在大型语言模型（LLMs）的时代，检索增强生成（RAG）架构因其能够将语言生成与可靠的知识来源相结合而受到广泛关注。尽管RAG系统在许多领域表现出色，但仅基于语义相似性的RAG系统在专业领域中往往无法确保事实准确性，术语歧义会影响检索的相关性。本研究提出了一种增强的RAG架构，通过结合实体链接提取的事实信号来提高意大利语教育问答系统的准确性。该系统包括一个基于Wikidata的实体链接模块，并实施了三种重新排序策略来结合语义和实体信息：混合评分加权模型、互惠排名融合以及跨编码器重新排序器。实验在两个基准数据集上进行：一个自定义的学术数据集和标准的SQuAD-it数据集。结果显示，在特定领域的情境下，基于互惠排名融合的混合方案显著优于基线和跨编码器方法，而跨编码器在通用领域数据集上表现最佳。这些发现证实了领域不匹配效应的存在，并强调了领域适应和混合排序策略对于提高检索增强生成的事实精确性和可靠性的重要性。此外，它们还展示了实体感知RAG系统在教育环境中的潜力，促进了适应性和可靠的基于AI的辅导工具的发展。

Summary / 总结

This study aims to improve the accuracy of educational question-answering systems by integrating Entity Linking into Retrieval-Augmented Generation (RAG) architectures. The system uses a Wikidata-based Entity Linking module and implements three re-ranking strategies: a hybrid score weighting model, reciprocal rank fusion, and a cross-encoder re-ranker. Experiments on a custom academic dataset and the SQuAD-it dataset show that the hybrid schema based on reciprocal rank fusion outperforms both the baseline and the cross-encoder approach in domain-specific contexts, while the cross-encoder achieves the best results on the general-domain dataset. These findings highlight the importance of domain adaptation and hybrid ranking strategies for enhancing factual precision in educational platforms.

该研究旨在通过将实体链接集成到检索增强生成（RAG）架构中，提高教育问答系统的准确性。系统使用基于Wikidata的实体链接模块，并采用三种重新排序策略：混合评分加权模型、互反排名融合和交叉编码器重新排序器。实验结果表明，在特定领域中，基于互反排名融合的混合方案在性能上优于基线和交叉编码器方法，而交叉编码器在通用领域数据集上表现最佳。

EditThinker: Unlocking Iterative Reasoning for Any Image Editor

Authors: Hongyu Li, Manyuan Zhang, Dian Zheng, Ziyu Guo, Yimeng Jia, Kaituo Feng, Hao Yu, Yexin Liu, Yan Feng, Peng Pei, Xunliang Cai, Linjiang Huang, Hongsheng Li, Si Liu

First: 2025-12-05T18:58:09+00:00 · Latest: 2025-12-05T18:58:09+00:00

Comments: Project page: https://appletea233.github.io/think-while-edit

Abs · PDF · Code1 · Code2 · Project1

Abstract

Instruction-based image editing has emerged as a prominent research area, which, benefiting from image generation foundation models, have achieved high aesthetic quality, making instruction-following capability the primary challenge. Existing approaches improve instruction adherence via supervised or reinforcement learning, yet single-turn success rates remain limited due to inherent stochasticity and a lack of deliberation. In this work, we propose a deliberative editing framework to 'think' while they edit, which simulates the human cognitive loop by iteratively executing a Think-while-Edit cycle: Critiquing results and Refining instructions , followed by Repeating the generation until satisfactory. Specifically, we train a single MLLM, EditThinker, to act as the reasoning engine of this framework, which jointly produce the critique score, reasoning process, and refined instructions. We employ reinforcement learning to align the EditThinker's thinking with its editing, thereby generating more targeted instruction improvements. Extensive experiments on four benchmarks demonstrate that our approach significantly improves the instruction-following capability of any image editing model by a large margin. We will release our data construction framework, datasets, and models to benefit the community.

中文标题/摘要

标题：EditThinker：解锁任意图像编辑器的迭代推理

基于指令的图像编辑已成为一个重要的研究领域，得益于图像生成基础模型的支持，它已经实现了高质量的美学效果，使指令遵循能力成为主要挑战。现有方法通过监督或强化学习提高指令的遵循度，但由于固有的随机性和缺乏深思熟虑，单次编辑的成功率仍然有限。在本文中，我们提出了一种反思编辑框架，使其在编辑时“思考”，通过迭代执行“思考-编辑”循环来模拟人类的认知循环：批判结果并改进指令，然后重复生成直到满意。具体来说，我们训练了一个单一的MLLM，EditThinker，作为该框架的推理引擎，它共同生成批判评分、推理过程和改进后的指令。我们使用强化学习来使EditThinker的思考与其编辑相一致，从而生成更针对性的指令改进。在四个基准上的广泛实验表明，我们的方法显著提高了任何图像编辑模型的指令遵循能力，幅度很大。我们将发布我们的数据构建框架、数据集和模型，以造福社区。

Summary / 总结

The research aims to enhance the instruction-following capability of image editing models by addressing the limitations of single-turn success rates due to stochasticity and lack of deliberation. The proposed EditThinker framework uses a Think-while-Edit cycle to iteratively critique results, refine instructions, and repeat the generation process until satisfactory outcomes are achieved. The framework trains a single MLLM to act as the reasoning engine, which produces critique scores, reasoning processes, and refined instructions. Reinforcement learning is employed to align the reasoning with the editing process, leading to more targeted instruction improvements. Experiments on four benchmarks show that EditThinker significantly enhances the instruction-following capability of image editing models.

研究旨在通过解决单次编辑成功率受限于随机性和缺乏反思的问题，提高图像编辑模型的指令遵循能力。提出的EditThinker框架使用思考-编辑循环来迭代地评估结果、改进指令并重复生成过程，直到达到满意的结果。该框架训练一个单一的MLLM作为推理引擎，生成评估分数、推理过程和改进后的指令。通过强化学习使推理与编辑过程对齐，从而实现更精准的指令改进。在四个基准上的实验表明，EditThinker显著提升了图像编辑模型的指令遵循能力。

Training-Time Action Conditioning for Efficient Real-Time Chunking

Authors: Kevin Black, Allen Z. Ren, Michael Equi, Sergey Levine

First: 2025-12-05T18:57:28+00:00 · Latest: 2025-12-05T18:57:28+00:00

Abs · PDF · Code1 · Code2

Abstract

Real-time chunking (RTC) enables vision-language-action models (VLAs) to generate smooth, reactive robot trajectories by asynchronously predicting action chunks and conditioning on previously committed actions via inference-time inpainting. However, this inpainting method introduces computational overhead that increases inference latency. In this work, we propose a simple alternative: simulating inference delay at training time and conditioning on action prefixes directly, eliminating any inference-time overhead. Our method requires no modifications to the model architecture or robot runtime, and can be implemented with only a few additional lines of code. In simulated experiments, we find that training-time RTC outperforms inference-time RTC at higher inference delays. In real-world experiments on box building and espresso making tasks with the $π_{0.6}$ VLA, we demonstrate that training-time RTC maintains both task performance and speed parity with inference-time RTC while being computationally cheaper. Our results suggest that training-time action conditioning is a practical drop-in replacement for inference-time inpainting in real-time robot control.

中文标题/摘要

标题：训练时动作条件化以实现高效的实时分块

实时分块（RTC）使视觉-语言-动作模型（VLAs）能够通过异步预测动作分块并在推理时进行修复来生成平滑的、反应性的机器人轨迹。然而，这种方法引入了计算开销，增加了推理延迟。在本文中，我们提出了一种简单的替代方案：在训练时模拟推理延迟，并直接条件化于动作前缀，从而消除任何推理时的开销。我们的方法不需要对模型架构或机器人运行时进行任何修改，并且只需几行额外的代码即可实现。在模拟实验中，我们发现训练时RTC在更高的推理延迟下优于推理时RTC。在使用$π_{0.6}$ VLA进行盒子构建和制作浓缩咖啡任务的实地实验中，我们证明了训练时RTC在保持任务性能的同时，与推理时RTC具有相同的速度，且计算成本更低。我们的结果表明，训练时动作条件化是实时机器人控制中推理时修复的一种实用的即插即用替代方案。

Summary / 总结

This paper addresses the computational overhead of inference-time inpainting in real-time chunking (RTC) for vision-language-action models (VLAs), which increases inference latency. The authors propose training-time RTC, where action prefixes are conditioned directly during training, eliminating inference-time overhead. Experiments show that training-time RTC outperforms inference-time RTC at higher inference delays and maintains task performance and speed parity with inference-time RTC while being more computationally efficient in real-world tasks like box building and espresso making.

该研究解决了实时分块（RTC）中视觉-语言-动作模型（VLAs）的推理时间 inpainting 计算开销问题，增加了推理延迟。作者提出在训练时间模拟推理延迟并在训练时间直接条件化动作前缀的方法，从而消除了推理时间的开销。实验表明，训练时间 RTC 在较高推理延迟下优于推理时间 RTC，并且在诸如打包和制作意式咖啡等真实世界任务中保持了任务性能和速度与推理时间 RTC 相当的同时，计算成本更低。

Whatever Remains Must Be True: Filtering Drives Reasoning in LLMs, Shaping Diversity

Authors: Germán Kruszewski, Pierre Erbacher, Jos Rozen, Marc Dymetman

First: 2025-12-05T18:56:40+00:00 · Latest: 2025-12-05T18:56:40+00:00

Abs · PDF · Code1 · Code2

Abstract

Reinforcement Learning (RL) has become the de facto standard for tuning LLMs to solve tasks involving reasoning. However, growing evidence shows that models trained in such way often suffer from a significant loss in diversity. We argue that this arises because RL implicitly optimizes the "mode-seeking" or "zero-forcing" Reverse KL to a target distribution causing the model to concentrate mass on certain high-probability regions of the target while neglecting others. In this work, we instead begin from an explicit target distribution, obtained by filtering out incorrect answers while preserving the relative probabilities of correct ones. Starting from a pre-trained LLM, we approximate this target distribution using the $α$-divergence family, which unifies prior approaches and enables direct control of the precision-diversity trade-off by interpolating between mode-seeking and mass-covering divergences. On a Lean theorem-proving benchmark, our method achieves state-of-the-art performance along the coverage-precision Pareto frontier, outperforming all prior methods on the coverage axis.

中文标题/摘要

标题：凡余下者必为真：过滤驱动LLMs推理，塑造多样性

强化学习(RL)已成为调校LLMs以解决涉及推理的任务的标准方法。然而，越来越多的证据表明，以这种方式训练的模型往往在多样性方面遭受重大损失。我们认为，这是因为RL隐式优化了“模式寻求”或“零强迫”逆KL到目标分布，导致模型将概率集中在目标的某些高概率区域，而忽视了其他区域。在本文中，我们从一个显式的目标分布开始，该分布通过过滤掉错误答案并保留正确答案的相对概率而获得。从一个预训练的LLM出发，我们使用α-散度族来近似这个目标分布，该族统一了先前的方法，并通过在模式寻求和质量覆盖散度之间进行插值，直接控制精确度-多样性权衡。在Lean定理证明基准测试中，我们的方法在覆盖率-精确度帕累托前沿上达到了最先进的性能，在覆盖率轴上优于所有先前的方法。

Summary / 总结

This study addresses the loss of diversity in Large Language Models (LLMs) trained via Reinforcement Learning (RL), which tends to concentrate on high-probability regions while neglecting others. The authors propose a method that starts from an explicit target distribution obtained by filtering out incorrect answers while preserving the relative probabilities of correct ones. Using the α-divergence family, they approximate this target distribution and control the precision-diversity trade-off. On a Lean theorem-proving benchmark, their method achieves state-of-the-art performance, outperforming previous methods on the coverage axis.

该研究探讨了通过强化学习（RL）训练的大语言模型（LLM）在训练过程中出现的多样性损失问题，这种现象会导致模型集中在高概率区域而忽视其他区域。作者提出了一种方法，从过滤掉错误答案并保留正确答案相对概率的显式目标分布开始。使用α-散度族，他们近似这种目标分布，并通过插值控制精确度-多样性权衡。在Lean定理证明基准测试中，他们的方法达到了最先进的性能，在覆盖轴上优于所有先前的方法。

AQUA-Net: Adaptive Frequency Fusion and Illumination Aware Network for Underwater Image Enhancement

Authors: Munsif Ali, Najmul Hassan, Lucia Ventura, Davide Di Bari, Simonepietro Canese

First: 2025-12-05T18:56:10+00:00 · Latest: 2025-12-05T18:56:10+00:00

Abs · PDF · Code1 · Code2

Abstract

Underwater images often suffer from severe color distortion, low contrast, and a hazy appearance due to wavelength-dependent light absorption and scattering. Simultaneously, existing deep learning models exhibit high computational complexity, which limits their practical deployment for real-time underwater applications. To address these challenges, this paper presents a novel underwater image enhancement model, called Adaptive Frequency Fusion and Illumination Aware Network (AQUA-Net). It integrates a residual encoder decoder with dual auxiliary branches, which operate in the frequency and illumination domains. The frequency fusion encoder enriches spatial representations with frequency cues from the Fourier domain and preserves fine textures and structural details. Inspired by Retinex, the illumination-aware decoder performs adaptive exposure correction through a learned illumination map that separates reflectance from lighting effects. This joint spatial, frequency, and illumination design enables the model to restore color balance, visual contrast, and perceptual realism under diverse underwater conditions. Additionally, we present a high-resolution, real-world underwater video-derived dataset from the Mediterranean Sea, which captures challenging deep-sea conditions with realistic visual degradations to enable robust evaluation and development of deep learning models. Extensive experiments on multiple benchmark datasets show that AQUA-Net performs on par with SOTA in both qualitative and quantitative evaluations while using less number of parameters. Ablation studies further confirm that the frequency and illumination branches provide complementary contributions that improve visibility and color representation. Overall, the proposed model shows strong generalization capability and robustness, and it provides an effective solution for real-world underwater imaging applications.

中文标题/摘要

标题：AQUA-Net：自适应频率融合和照明感知网络的水下图像增强

水下图像常常由于波长依赖的光吸收和散射而遭受严重的色彩失真、低对比度和朦胧外观。同时，现有的深度学习模型具有较高的计算复杂性，这限制了它们在实时水下应用中的实际部署。为了解决这些挑战，本文提出了一种新颖的水下图像增强模型，称为自适应频率融合和照明感知网络（AQUA-Net）。该模型结合了残差编码解码器和双辅助分支，分别在频率和照明域中运行。频率融合编码器通过从傅里叶域获取的频率线索丰富空间表示，并保留精细纹理和结构细节。受Retinex启发，照明感知解码器通过学习得到的照明图执行自适应曝光校正，该图将反射率与照明效果分离。这种联合空间、频率和照明设计使模型能够在多种水下条件下恢复色彩平衡、视觉对比度和感知现实性。此外，我们还提供了一个来自地中海的高分辨率真实世界水下视频数据集，该数据集捕捉了具有真实视觉退化特征的深海条件，以实现对深度学习模型的稳健评估和开发。在多个基准数据集上的大量实验表明，AQUA-Net在定性和定量评估中均与当前最佳技术（SOTA）相当，同时使用更少的参数。消融研究进一步证实，频率和照明分支提供了互补的贡献，提高了可见性和色彩表示。总体而言，所提出模型展示了强大的泛化能力和鲁棒性，并为实际水下成像应用提供了一个有效的解决方案。

Summary / 总结

The paper addresses the challenges of color distortion, low contrast, and hazy appearance in underwater images due to light absorption and scattering. It introduces AQUA-Net, a novel model that combines a residual encoder-decoder with dual auxiliary branches for frequency and illumination domains. AQUA-Net enhances spatial representations with frequency cues and performs adaptive exposure correction using a learned illumination map. Experiments show that AQUA-Net matches state-of-the-art models in both qualitative and quantitative evaluations while using fewer parameters. Ablation studies confirm the complementary contributions of the frequency and illumination branches in improving visibility and color representation.

论文针对由于光吸收和散射导致的水下图像色彩失真、对比度低和雾化等问题，提出了一种名为AQUA-Net的新模型，该模型结合了残差编码解码器和频率与光照域的双辅助分支。AQUA-Net通过频率融合保留纹理和结构细节，并通过学习光照图进行曝光校正以分离反射和照明效果。实验表明，AQUA-Net在定性和定量评价中与最先进的模型相当，同时使用更少的参数。消融研究进一步证实了频率和光照分支的互补贡献，提高了可见性和色彩表示。该模型展示了强大的泛化能力和鲁棒性，适用于实际水下成像应用。

M4-RAG: A Massive-Scale Multilingual Multi-Cultural Multimodal RAG

Authors: David Anugraha, Patrick Amadeus Irawan, Anshul Singh, En-Shiun Annie Lee, Genta Indra Winata

First: 2025-12-05T18:55:58+00:00 · Latest: 2025-12-05T18:55:58+00:00

Comments: Preprint

Abs · PDF · Code1 · Code2

Abstract

Vision-language models (VLMs) have achieved strong performance in visual question answering (VQA), yet they remain constrained by static training data. Retrieval-Augmented Generation (RAG) mitigates this limitation by enabling access to up-to-date, culturally grounded, and multilingual information; however, multilingual multimodal RAG remains largely underexplored. We introduce M4-RAG, a massive-scale benchmark covering 42 languages and 56 regional dialects and registers, comprising over 80,000 culturally diverse image-question pairs for evaluating retrieval-augmented VQA across languages and modalities. To balance realism with reproducibility, we build a controlled retrieval environment containing millions of carefully curated multilingual documents relevant to the query domains, approximating real-world retrieval conditions while ensuring consistent experimentation. Our systematic evaluation reveals that although RAG consistently benefits smaller VLMs, it fails to scale to larger models and often even degrades their performance, exposing a critical mismatch between model size and current retrieval effectiveness. M4-RAG provides a foundation for advancing next-generation RAG systems capable of reasoning seamlessly across languages, modalities, and cultural contexts.

中文标题/摘要

标题：M4-RAG：大规模多语言多文化多模态RAG

视觉语言模型（VLMs）在视觉问答（VQA）方面取得了强大的性能，但仍然受限于静态训练数据。检索增强生成（RAG）通过使模型能够访问最新的、文化基础的和多语言的信息来缓解这一限制；然而，多语言多模态RAG仍然很大程度上未被探索。我们介绍了M4-RAG，这是一个涵盖42种语言和56种地区方言及体裁的大规模基准，包含超过80,000个文化多样性的图像-问题对，用于评估跨语言和模态的检索增强VQA。为了平衡现实性和可重复性，我们构建了一个受控的检索环境，包含数百万个与查询领域相关的精心策划的多语言文档，近似于现实世界的检索条件，同时确保一致的实验。我们的系统性评估表明，尽管RAG持续改善较小的VLMs，但它无法扩展到更大的模型，并且经常甚至会降低其性能，揭示了模型规模与当前检索效果之间的重要不匹配。M4-RAG为推进能够无缝跨越语言、模态和文化背景的下一代RAG系统奠定了基础。

Summary / 总结

M4-RAG is a large-scale benchmark for multilingual multimodal retrieval-augmented generation, covering 42 languages and 56 regional dialects with over 80,000 image-question pairs. It evaluates VQA across languages and modalities using a controlled retrieval environment with millions of curated documents. The study finds that while RAG improves smaller VLMs, it degrades the performance of larger models, highlighting a mismatch between model size and retrieval effectiveness.

研究旨在通过利用检索增强生成（RAG）技术，使视觉语言模型在视觉问答中的表现更佳，以获取最新的、文化背景丰富和多语言的信息。M4-RAG 是一个大规模基准，评估了跨 42 种语言和 56 种区域方言的检索增强视觉问答，结果显示虽然 RAG 改善了较小模型的表现，但往往降低了较大模型的性能，原因是模型大小与当前检索效果之间存在不匹配。

SIMPACT: Simulation-Enabled Action Planning using Vision-Language Models

Authors: Haowen Liu, Shaoxiong Yao, Haonan Chen, Jiawei Gao, Jiayuan Mao, Jia-Bin Huang, Yilun Du

First: 2025-12-05T18:51:03+00:00 · Latest: 2025-12-05T18:51:03+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Vision-Language Models (VLMs) exhibit remarkable common-sense and semantic reasoning capabilities. However, they lack a grounded understanding of physical dynamics. This limitation arises from training VLMs on static internet-scale visual-language data that contain no causal interactions or action-conditioned changes. Consequently, it remains challenging to leverage VLMs for fine-grained robotic manipulation tasks that require physical understanding, reasoning, and corresponding action planning. To overcome this, we present SIMPACT, a test-time, SIMulation-enabled ACTion Planning framework that equips VLMs with physical reasoning through simulation-in-the-loop world modeling, without requiring any additional training. From a single RGB-D observation, SIMPACT efficiently constructs physics simulations, enabling the VLM to propose informed actions, observe simulated rollouts, and iteratively refine its reasoning. By integrating language reasoning with physics prediction, our simulation-enabled VLM can understand contact dynamics and action outcomes in a physically grounded way. Our method demonstrates state-of-the-art performance on five challenging, real-world rigid-body and deformable manipulation tasks that require fine-grained physical reasoning, outperforming existing general-purpose robotic manipulation models. Our results demonstrate that embedding physics understanding via efficient simulation into VLM reasoning at test time offers a promising path towards generalizable embodied intelligence. Project webpage can be found at https://simpact-bot.github.io

中文标题/摘要

标题：SIMPACT：使用视觉-语言模型的仿真驱动动作规划

视觉-语言模型（VLMs）表现出显著的常识和语义推理能力。然而，它们缺乏对物理动力学的现实理解。这一限制源于VLMs在静态互联网规模的视觉-语言数据上进行训练，这些数据中没有因果交互或动作条件下的变化。因此，利用VLMs进行需要物理理解、推理和相应动作规划的精细机器人操作任务仍然具有挑战性。为了解决这个问题，我们提出了SIMPACT，一种测试时仿真驱动的动作规划框架，通过仿真闭环世界建模为VLM提供物理推理能力，而无需额外的训练。从单个RGB-D观察开始，SIMPACT高效地构建物理仿真，使VLM能够提出有信息量的动作，观察仿真滚动，并逐步改进其推理。通过将语言推理与物理预测集成，我们的仿真驱动的VLM能够以物理为基础的方式理解接触动力学和动作结果。我们的方法在五个需要精细物理推理的真实世界刚体和可变形操作任务上展示了最先进的性能，优于现有的通用机器人操作模型。我们的结果表明，在测试时通过高效仿真嵌入物理理解为VLM推理提供了实现通用体化智能的有希望的途径。项目网页可访问 https://simpact-bot.github.io

Summary / 总结

SIMPACT is a test-time simulation-enabled action planning framework that enhances Vision-Language Models (VLMs) with physical reasoning capabilities. By integrating physics simulations into the VLMs, SIMPACT allows them to understand contact dynamics and action outcomes, enabling fine-grained manipulation tasks. The method outperforms existing models on five challenging real-world tasks and shows promise for embodied intelligence.

SIMPACT 是一个框架，通过仿真增强视觉-语言模型（VLM）的物理推理能力，使其能够执行精细的机器人操作任务。通过将语言推理与物理预测集成，SIMPACT 从单个 RGB-D 观测中构建物理仿真，允许 VLM 逐步细化其推理并提出有见地的动作。该方法在五个具有挑战性的操作任务上优于现有模型，展示了将物理理解嵌入 VLM 推理中的潜力，以实现通用的具身智能。

SymPyBench: A Dynamic Benchmark for Scientific Reasoning with Executable Python Code

Authors: Shima Imani, Seungwhan Moon, Adel Ahmadyan, Lu Zhang, Kirmani Ahmed, Babak Damavandi

First: 2025-12-05T18:50:48+00:00 · Latest: 2025-12-05T18:50:48+00:00

Abs · PDF · Code1 · Code2

Abstract

We introduce, a large-scale synthetic benchmark of 15,045 university-level physics problems (90/10% train/test split). Each problem is fully parameterized, supporting an effectively infinite range of input configurations, and is accompanied by structured, step-by-step reasoning and executable Python code that produces the ground-truth solution for any parameter set. The benchmark contains three question types: MC-Symbolic (multiple-choice with symbolic options), MC-Numerical (multiple-choice with numerical options), and free-form (open-ended responses). These diverse formats test complementary reasoning skills. By leveraging the dynamic, code-driven nature of the benchmark, we introduce three novel evaluation metrics in addition to standard accuracy: Consistency Score, Failure Rate, and Confusion Rate, that quantify variability and uncertainty across problem variants. Experiments with state-of-the-art instruction-tuned language models reveal both strengths and limitations in scientific reasoning, positioning SymPyBench as a foundation for developing more robust and interpretable reasoning systems

中文标题/摘要

标题：SymPyBench：一种基于可执行Python代码的科学推理动态基准

我们引入了一个大规模合成基准，包含15,045个大学水平的物理问题（90/10% 训练/测试分割）。每个问题都是完全参数化的，支持无限范围的输入配置，并附有结构化的、分步骤的推理和可执行的Python代码，可以为任何参数集生成正确的解决方案。基准包含三种问题类型：MC-符号（多项选择题，带有符号选项）、MC-数值（多项选择题，带有数值选项）和自由形式（开放式回答）。这些多样的格式测试了互补的推理技能。通过利用基准的动态、代码驱动的性质，我们引入了三个新的评估指标，除了标准准确率之外，还有一致性分数、失败率和混淆率，这些指标量化了问题变体之间的变化性和不确定性。使用最先进的指令调优语言模型的实验揭示了科学推理中的强项和局限性，将SymPyBench定位为开发更稳健和可解释的推理系统的基础

Summary / 总结

SymPyBench is a large-scale synthetic benchmark consisting of 15,045 university-level physics problems, each fully parameterized and accompanied by structured reasoning and executable Python code for ground-truth solutions. The benchmark includes three question types: MC-Symbolic, MC-Numerical, and free-form. Novel evaluation metrics such as Consistency Score, Failure Rate, and Confusion Rate are introduced to assess variability and uncertainty. Experiments with state-of-the-art language models highlight both strengths and limitations in scientific reasoning.

SymPyBench 是一个包含 15,045 个大学水平物理问题的大规模合成基准，每个问题都完全参数化，并附有结构化的推理和生成真实解决方案的可执行 Python 代码。基准包括三种问题类型：MC-符号、MC-数值和开放式。引入了新的评估指标，如一致性分数、失败率和混淆率，以评估变异性和不确定性。实验表明，最先进的语言模型在科学推理方面既有优势也有局限性。

Impugan: Learning Conditional Generative Models for Robust Data Imputation

Authors: Zalish Mahmud, Anantaa Kotal, Aritran Piplai

First: 2025-12-05T18:46:33+00:00 · Latest: 2025-12-05T18:46:33+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Incomplete data are common in real-world applications. Sensors fail, records are inconsistent, and datasets collected from different sources often differ in scale, sampling rate, and quality. These differences create missing values that make it difficult to combine data and build reliable models. Standard imputation methods such as regression models, expectation-maximization, and multiple imputation rely on strong assumptions about linearity and independence. These assumptions rarely hold for complex or heterogeneous data, which can lead to biased or over-smoothed estimates. We propose Impugan, a conditional Generative Adversarial Network (cGAN) for imputing missing values and integrating heterogeneous datasets. The model is trained on complete samples to learn how missing variables depend on observed ones. During inference, the generator reconstructs missing entries from available features, and the discriminator enforces realism by distinguishing true from imputed data. This adversarial process allows Impugan to capture nonlinear and multimodal relationships that conventional methods cannot represent. In experiments on benchmark datasets and a multi-source integration task, Impugan achieves up to 82\% lower Earth Mover's Distance (EMD) and 70\% lower mutual-information deviation (MI) compared to leading baselines. These results show that adversarially trained generative models provide a scalable and principled approach for imputing and merging incomplete, heterogeneous data. Our model is available at: github.com/zalishmahmud/impuganBigData2025

中文标题/摘要

标题：质疑：学习条件生成模型以实现稳健的数据插补

在实际应用中，不完整数据很常见。传感器故障，记录不一致，不同来源收集的数据在规模、采样率和质量上往往存在差异。这些差异导致了缺失值，使得数据难以结合并构建可靠的模型。标准插补方法如回归模型、期望最大化和多重插补依赖于线性和独立性的强假设。这些假设很少适用于复杂或异质数据，可能导致有偏或过度平滑的估计。我们提出了一种条件生成对抗网络（cGAN）Impugan，用于插补缺失值并整合异质数据集。该模型在完整样本上进行训练，以学习缺失变量如何依赖于观测变量。在推理过程中，生成器从可用特征中重建缺失项，判别器通过区分真实数据和插补数据来确保真实性。这种对抗过程使Impugan能够捕捉到常规方法无法表示的非线性和多模态关系。在基准数据集和多源整合任务上的实验表明，与领先基线相比，Impugan在地球移动力距离（EMD）和互信息偏差（MI）上分别降低了82%和70%。这些结果表明，对抗训练生成模型为插补和合并不完整、异质数据提供了一种可扩展且原理上的方法。我们的模型可在github.com/zalishmahmud/impuganBigData2025获取

Summary / 总结

The paper addresses the challenge of handling incomplete data in real-world applications by proposing Impugan, a conditional Generative Adversarial Network (cGAN) for robust data imputation. The model learns from complete samples to understand the dependencies between missing and observed variables, and during inference, it reconstructs missing values while a discriminator ensures the realism of the imputed data. Experiments show that Impugan outperforms existing methods, achieving up to 82% lower Earth Mover's Distance and 70% lower mutual-information deviation on benchmark datasets and multi-source integration tasks.

论文提出了一种基于生成对抗网络（cGAN）的条件生成模型Impugan，以应对实际应用中数据不完整的问题。该模型通过学习完整数据来理解缺失变量与观测变量之间的依赖关系，并在推断过程中重建缺失值，同时通过判别器确保生成数据的真实性。实验结果表明，Impugan 在基准数据集和多源数据集成任务上的表现优于现有方法，分别在地球搬运距离和互信息偏差上降低了高达 82% 和 70%。

Variational Quantum Rainbow Deep Q-Network for Optimizing Resource Allocation Problem

Authors: Truong Thanh Hung Nguyen, Truong Thinh Nguyen, Hung Cao

First: 2025-12-05T18:43:18+00:00 · Latest: 2025-12-05T18:43:18+00:00

Comments: Quantum Software Engineering Practices at The 41st ACM/SIGAPP Symposium On Applied Computing (SAC 2026)

Abs · PDF · Code1 · Code2 · Code3

Abstract

Resource allocation remains NP-hard due to combinatorial complexity. While deep reinforcement learning (DRL) methods, such as the Rainbow Deep Q-Network (DQN), improve scalability through prioritized replay and distributional heads, classical function approximators limit their representational power. We introduce Variational Quantum Rainbow DQN (VQR-DQN), which integrates ring-topology variational quantum circuits with Rainbow DQN to leverage quantum superposition and entanglement. We frame the human resource allocation problem (HRAP) as a Markov decision process (MDP) with combinatorial action spaces based on officer capabilities, event schedules, and transition times. On four HRAP benchmarks, VQR-DQN achieves 26.8% normalized makespan reduction versus random baselines and outperforms Double DQN and classical Rainbow DQN by 4.9-13.4%. These gains align with theoretical connections between circuit expressibility, entanglement, and policy quality, demonstrating the potential of quantum-enhanced DRL for large-scale resource allocation. Our implementation is available at: https://github.com/Analytics-Everywhere-Lab/qtrl/.

中文标题/摘要

标题：变分量子彩虹深度Q网络用于优化资源分配问题

由于组合复杂性，资源分配仍然是NP难问题。尽管深度强化学习（DRL）方法，如彩虹深度Q网络（DQN），通过优先重放和分布性头部提高可扩展性，但经典函数逼近器限制了其表示能力。我们引入了变分量子彩虹DQN（VQR-DQN），将环状拓扑变分量子电路与彩虹DQN结合，利用量子叠加和纠缠。我们将人力资源分配问题（HRAP）建模为基于军官能力、事件时间表和转换时间的组合动作空间的马尔可夫决策过程（MDP）。在四个HRAP基准测试中，VQR-DQN相对于随机基线实现了26.8%的归一化周转时间减少，并且在双DQN和经典彩虹DQN的基础上分别提高了4.9%-13.4%。这些收益与电路可表达性、纠缠和策略质量之间的理论联系一致，展示了量子增强DRL在大规模资源分配中的潜力。我们的实现可在以下链接获取：https://github.com/Analytics-Everywhere-Lab/qtrl/。

Summary / 总结

The paper addresses the NP-hard resource allocation problem by proposing Variational Quantum Rainbow DQN (VQR-DQN), which combines ring-topology variational quantum circuits with Rainbow DQN. The authors frame the human resource allocation problem as a Markov decision process and evaluate VQR-DQN on four benchmarks, achieving a 26.8% reduction in normalized makespan compared to random baselines and outperforming Double DQN and classical Rainbow DQN by 4.9-13.4%. These results suggest that quantum-enhanced DRL can improve resource allocation in large-scale scenarios.

论文通过引入结合环状拓扑变量子电路和Rainbow DQN的Variational Quantum Rainbow DQN (VQR-DQN)，解决NP难的资源分配问题，提升表示能力。作者将人力资源分配问题建模为马尔可夫决策过程，并在四个基准上评估VQR-DQN，实现了26.8%的标准化工期减少，相比随机基线，并且在与经典Rainbow DQN和Double DQN的对比中分别提高了4.9%-13.4%。这些结果表明，量子增强的DRL可以在大规模资源分配任务中提高可扩展性和性能。

Zoom in, Click out: Unlocking and Evaluating the Potential of Zooming for GUI Grounding

Authors: Zhiyuan Jiang, Shenghao Xie, Wenyi Li, Wenqiang Zu, Peihang Li, Jiahao Qiu, Siqi Pei, Lei Ma, Tiejun Huang, Mengdi Wang, Shilong Liu

First: 2025-12-05T18:39:12+00:00 · Latest: 2025-12-05T18:39:12+00:00

Comments: Code is available at https://github.com/Princeton-AI2-Lab/ZoomClick

Abs · PDF · Code1 · Code2 · Code3

Abstract

Grounding is a fundamental capability for building graphical user interface (GUI) agents. Although existing approaches rely on large-scale bounding box supervision, they still face various challenges, such as cross-platform generalization, complex layout analysis, and fine-grained element localization. In this paper, we investigate zoom as a strong yet underexplored prior for GUI grounding, and propose a training-free method, ZoomClick. By characterizing four key properties of zoom (i.e., pre-zoom, depth, shrink size, minimal crop size), we unlock its full capabilities for dynamic spatial focusing and adaptive context switching. Experiments demonstrate that our method significantly boosts the performance of both general vision-language and specialized GUI grounding models, achieving state-of-the-art results on several mainstream benchmarks; for example, UI-Venus-72B attains a 73.1% success rate on ScreenSpot-Pro. Furthermore, we present GUIZoom-Bench, a benchmark for evaluating model adaptability to zoom, aiming to inspire future research on improving zoom for further training and test-time scaling in GUI grounding tasks.

中文标题/摘要

标题：放大缩小，点击退出：解锁和评估缩放技术在GUI语义理解中的潜力

语义理解是构建图形用户界面（GUI）代理的基本能力。尽管现有方法依赖于大规模边界框监督，但仍面临各种挑战，如跨平台通用性、复杂布局分析和细粒度元素定位。在本文中，我们研究了缩放作为GUI语义理解的强大但尚未充分探索的先验，并提出了一种无需训练的方法ZoomClick。通过表征缩放的四个关键属性（即预缩放、深度、缩小尺寸、最小裁剪尺寸），我们解锁了其动态空间聚焦和自适应上下文切换的全部能力。实验表明，我们的方法显著提升了通用视觉-语言模型和专门的GUI语义理解模型的性能，在多个主流基准上取得了最先进的结果；例如，UI-Venus-72B在ScreenSpot-Pro上的成功率为73.1%。此外，我们提出了GUIZoom-Bench，这是一个用于评估模型对缩放适应性的基准，旨在激发未来研究以提高缩放在GUI语义理解任务中的训练和测试扩展。

Summary / 总结

This paper addresses the challenges of GUI grounding by leveraging the underutilized potential of zooming. It introduces ZoomClick, a training-free method that characterizes four key properties of zoom to enhance dynamic spatial focusing and adaptive context switching. Experimental results show that ZoomClick significantly improves the performance of both general vision-language and specialized GUI grounding models, achieving state-of-the-art results on several benchmarks, such as a 73.1% success rate on ScreenSpot-Pro. Additionally, the paper introduces GUIZoom-Bench to evaluate model adaptability to zoom, promoting future research in this area.

本文通过利用缩放的潜在价值来解决GUI定位的挑战。它提出了一个无需训练的方法ZoomClick，通过四个关键属性来增强动态空间聚焦和自适应上下文切换。实验结果表明，ZoomClick 显著提高了通用视觉-语言模型和专门的GUI定位模型的性能，例如在ScreenSpot-Pro上的成功率为73.1%。此外，该论文还引入了GUIZoom-Bench来评估模型对缩放的适应性，推动了该领域的未来研究。

Measuring the Effect of Background on Classification and Feature Importance in Deep Learning for AV Perception

Authors: Anne Sielemann, Valentin Barner, Stefan Wolf, Masoud Roschani, Jens Ziehn, Juergen Beyerer

First: 2025-12-05T18:25:52+00:00 · Latest: 2025-12-05T18:25:52+00:00

Comments: 8 pages, 2 figures, 7 tables

Abs · PDF · Code1 · Code2

Abstract

Common approaches to explainable AI (XAI) for deep learning focus on analyzing the importance of input features on the classification task in a given model: saliency methods like SHAP and GradCAM are used to measure the impact of spatial regions of the input image on the classification result. Combined with ground truth information about the location of the object in the input image (e.g., a binary mask), it is determined whether object pixels had a high impact on the classification result, or whether the classification focused on background pixels. The former is considered to be a sign of a healthy classifier, whereas the latter is assumed to suggest overfitting on spurious correlations. A major challenge, however, is that these intuitive interpretations are difficult to test quantitatively, and hence the output of such explanations lacks an explanation itself. One particular reason is that correlations in real-world data are difficult to avoid, and whether they are spurious or legitimate is debatable. Synthetic data in turn can facilitate to actively enable or disable correlations where desired but often lack a sufficient quantification of realism and stochastic properties. [...] Therefore, we systematically generate six synthetic datasets for the task of traffic sign recognition, which differ only in their degree of camera variation and background correlation [...] to quantify the isolated influence of background correlation, different levels of camera variation, and considered traffic sign shapes on the classification performance, as well as background feature importance. [...] Results include a quantification of when and how much background features gain importance to support the classification task based on changes in the training domain [...]. Download: synset.de/datasets/synset-signset-ger/background-effect

中文标题/摘要

标题：基于深度学习的自动驾驶感知中背景对分类和特征重要性影响的测量

解释性人工智能（XAI）的常见方法侧重于分析输入特征在给定模型中的分类任务的重要性：使用SHAP和GradCAM等显著性方法来衡量输入图像的空间区域对分类结果的影响。结合输入图像中对象位置的真实信息（例如，二值掩码），可以确定对象像素是否对分类结果有重大影响，或者分类是否集中在背景像素上。前者被认为是健康分类器的标志，而后者则被认为表明了对虚假相关性的过度拟合。然而，一个主要挑战是，这些直观的解释难以定量测试，因此这种解释的输出本身缺乏解释。一个特别的原因是，现实世界数据中的相关性难以避免，它们是虚假的还是合法的尚有争议。相反，合成数据可以主动启用或禁用所需的相关性，但通常缺乏足够的现实性和随机性量化 [... ] 因此，我们系统地生成了六个用于交通标志识别任务的合成数据集，这些数据集仅在相机变化程度和背景相关性方面有所不同 [... ] 以量化背景相关性、不同水平的相机变化以及考虑的交通标志形状对分类性能以及背景特征重要性的影响 [... ] 结果包括对背景特征在分类任务中获得重要性的时间和程度的量化 [... ]

Summary / 总结

This study aims to assess the impact of background on classification and feature importance in deep learning for autonomous vehicle perception. The authors generate six synthetic datasets differing in camera variation and background correlation to systematically evaluate these factors. Key findings show that background features can significantly influence classification performance, particularly when the training domain changes, highlighting the need to consider background correlation in model interpretability.

该研究旨在评估背景对自动驾驶车辆感知中深度学习分类和特征重要性的影响。作者生成了六个不同相机变化和背景相关性的合成数据集，以系统地评估这些因素。主要发现表明，背景特征在分类任务中可以显著影响分类性能，尤其是在训练领域发生变化时，强调了在模型可解释性中考虑背景相关性的必要性。

Synset Signset Germany: a Synthetic Dataset for German Traffic Sign Recognition

Authors: Anne Sielemann, Lena Loercher, Max-Lion Schumacher, Stefan Wolf, Masoud Roschani, Jens Ziehn

First: 2025-12-05T18:24:07+00:00 · Latest: 2025-12-05T18:24:07+00:00

Comments: 8 pages, 8 figures, 3 tables

Abs · PDF · Code1 · Code2

Abstract

In this paper, we present a synthesis pipeline and dataset for training / testing data in the task of traffic sign recognition that combines the advantages of data-driven and analytical modeling: GAN-based texture generation enables data-driven dirt and wear artifacts, rendering unique and realistic traffic sign surfaces, while the analytical scene modulation achieves physically correct lighting and allows detailed parameterization. In particular, the latter opens up applications in the context of explainable AI (XAI) and robustness tests due to the possibility of evaluating the sensitivity to parameter changes, which we demonstrate with experiments. Our resulting synthetic traffic sign recognition dataset Synset Signset Germany contains a total of 105500 images of 211 different German traffic sign classes, including newly published (2020) and thus comparatively rare traffic signs. In addition to a mask and a segmentation image, we also provide extensive metadata including the stochastically selected environment and imaging effect parameters for each image. We evaluate the degree of realism of Synset Signset Germany on the real-world German Traffic Sign Recognition Benchmark (GTSRB) and in comparison to CATERED, a state-of-the-art synthetic traffic sign recognition dataset.

中文标题/摘要

标题：Synset Signset 德国：一种用于德国交通标志识别的合成数据集

在本文中，我们提出了一种合成管道和数据集，用于交通标志识别任务中的训练/测试数据，结合了数据驱动和分析建模的优点：基于GAN的纹理生成使数据驱动的污损和磨损效果成为可能，渲染出独特且逼真的交通标志表面，而分析场景调制实现了物理上正确的照明，并允许详细的参数化。特别是，后者由于可以评估参数变化的敏感性而为可解释人工智能（XAI）和鲁棒性测试打开了应用，我们在实验中演示了这一点。我们得到的合成交通标志识别数据集Synset Signset 德国包含总共105500张211种不同德国交通标志类别的图像，包括2020年新发布的（因此相对罕见）交通标志。除了掩码和分割图像外，我们还提供了每张图像的广泛元数据，包括随机选择的环境和成像效果参数。我们在真实的德国交通标志识别基准（GTSRB）上评估了Synset Signset 德国的逼真度，并将其与CATERED（一种最先进的合成交通标志识别数据集）进行了比较。

Summary / 总结

This paper introduces Synset Signset Germany, a synthetic dataset for German traffic sign recognition that combines GAN-based texture generation for realistic dirt and wear artifacts and analytical scene modulation for physically correct lighting. The dataset includes 105,500 images of 211 German traffic sign classes, with detailed metadata for each image. Experiments demonstrate the dataset's realism and robustness, surpassing the state-of-the-art CATERED dataset. The dataset is particularly useful for explainable AI and robustness tests due to its parameterizability.

本文介绍了Synset Signset Germany，这是一个结合了基于GAN的纹理生成和分析场景调制的合成数据集，用于德国交通标志识别。该数据集包含105,500张211种不同德国交通标志的图像，每张图像都有详细的元数据。实验表明该数据集具有高度的真实性和鲁棒性，超越了最先进的CATERED数据集。由于其参数化能力，该数据集特别适用于可解释AI和鲁棒性测试。

On the Bayes Inconsistency of Disagreement Discrepancy Surrogates

Authors: Neil G. Marchant, Andrew C. Cullen, Feng Liu, Sarah M. Erfani

First: 2025-12-05T18:16:03+00:00 · Latest: 2025-12-05T18:16:03+00:00

Comments: 37 pages, 7 figures

Abs · PDF · Code1 · Code2

Abstract

Deep neural networks often fail when deployed in real-world contexts due to distribution shift, a critical barrier to building safe and reliable systems. An emerging approach to address this problem relies on \emph{disagreement discrepancy} -- a measure of how the disagreement between two models changes under a shifting distribution. The process of maximizing this measure has seen applications in bounding error under shifts, testing for harmful shifts, and training more robust models. However, this optimization involves the non-differentiable zero-one loss, necessitating the use of practical surrogate losses. We prove that existing surrogates for disagreement discrepancy are not Bayes consistent, revealing a fundamental flaw: maximizing these surrogates can fail to maximize the true disagreement discrepancy. To address this, we introduce new theoretical results providing both upper and lower bounds on the optimality gap for such surrogates. Guided by this theory, we propose a novel disagreement loss that, when paired with cross-entropy, yields a provably consistent surrogate for disagreement discrepancy. Empirical evaluations across diverse benchmarks demonstrate that our method provides more accurate and robust estimates of disagreement discrepancy than existing approaches, particularly under challenging adversarial conditions.

中文标题/摘要

标题：关于分歧不一致性代理的贝叶斯不一致性

由于分布偏移，深度神经网络在实际应用中常常失效，这是构建安全可靠系统的一个关键障碍。一种解决这一问题的方法是利用\emph{分歧不一致性}——衡量两种模型在分布偏移下分歧变化的度量。最大化这一度量的过程已被应用于误差边界估计、有害偏移检测以及训练更稳健的模型。然而，这一优化过程涉及非可微的零一损失，因此需要使用实用的代理损失。我们证明了现有的分歧不一致性代理不是贝叶斯一致的，揭示了一个根本缺陷：最大化这些代理并不能最大化真正的分歧不一致性。为解决这一问题，我们引入了新的理论结果，提供了这些代理优化差距的上下界。受该理论的指导，我们提出了一种新的分歧损失，当与交叉熵结合使用时，可以证明一致地代理分歧不一致性。在多种基准上的实证评估表明，我们的方法在分歧不一致性估计的准确性和鲁棒性方面优于现有方法，尤其是在具有挑战性的对抗条件下。

Summary / 总结

The paper addresses the issue of distribution shift in deep neural networks and the use of disagreement discrepancy as a measure to address this problem. It proves that existing surrogates for disagreement discrepancy are not Bayes consistent, leading to potential inaccuracies. To solve this, the authors introduce a novel disagreement loss that, when combined with cross-entropy, provides a provably consistent surrogate. Empirical evaluations show that this method offers more accurate and robust estimates of disagreement discrepancy, especially under adversarial conditions.

论文针对深度神经网络在实际应用中由于分布偏移导致的问题，关注于使用分歧不一致性作为解决方法。研究证明现有分歧不一致性替代方法并非贝叶斯一致的，可能导致不准确。为此，作者提出了一种新的分歧损失，与交叉熵结合后，提供了一种一致的分歧不一致性替代方法。实验证明，该方法在对抗条件下提供了更准确和稳健的分歧不一致性估计。

PRiSM: An Agentic Multimodal Benchmark for Scientific Reasoning via Python-Grounded Evaluation

Authors: Shima Imani, Seungwhan Moon, Adel Ahmadyan, Lu Zhang, Kirmani Ahmed, Babak Damavandi

First: 2025-12-05T18:14:55+00:00 · Latest: 2025-12-05T18:14:55+00:00

Abs · PDF · Code1 · Code2

Abstract

Evaluating vision-language models (VLMs) in scientific domains like mathematics and physics poses unique challenges that go far beyond predicting final answers. These domains demand conceptual understanding, symbolic reasoning, and adherence to formal laws, requirements that most existing benchmarks fail to address. In particular, current datasets tend to be static, lacking intermediate reasoning steps, robustness to variations, or mechanisms for verifying scientific correctness. To address these limitations, we introduce PRiSM, a synthetic, fully dynamic, and multimodal benchmark for evaluating scientific reasoning via grounded Python code. PRiSM includes over 24,750 university-level physics and math problems, and it leverages our scalable agent-based pipeline, PrismAgent, to generate well-structured problem instances. Each problem contains dynamic textual and visual input, a generated figure, alongside rich structured outputs: executable Python code for ground truth generation and verification, and detailed step-by-step reasoning. The dynamic nature and Python-powered automated ground truth generation of our benchmark allow for fine-grained experimental auditing of multimodal VLMs, revealing failure modes, uncertainty behaviors, and limitations in scientific reasoning. To this end, we propose five targeted evaluation tasks covering generalization, symbolic program synthesis, perturbation robustness, reasoning correction, and ambiguity resolution. Through comprehensive evaluation of existing VLMs, we highlight their limitations and showcase how PRiSM enables deeper insights into their scientific reasoning capabilities.

中文标题/摘要

标题：PRiSM：基于Python验证的科学推理多模态基准

在数学和物理学等科学领域评估视觉-语言模型（VLMs）带来了独特的挑战，远超出了预测最终答案的范围。这些领域需要概念理解、符号推理和遵守正式法则，而现有的大多数基准未能解决这些问题。特别是，当前的数据集往往是静态的，缺乏中间推理步骤、对变化的鲁棒性或验证科学正确性的机制。为了解决这些局限性，我们引入了PRiSM，这是一个合成的、完全动态的、多模态的基准，用于通过嵌入的Python代码评估科学推理。PRiSM 包含超过24,750个大学级别的物理和数学问题，并利用我们的可扩展基于代理的管道PrismAgent生成结构良好的问题实例。每个问题包含动态的文本和视觉输入、生成的图表，以及丰富的结构化输出：用于生成和验证真实结果的可执行Python代码，以及详细的逐步推理。基准的动态性质和基于Python的自动化真实结果生成允许对多模态VLMs进行精细的实验审计，揭示推理失败模式、不确定性行为和科学推理中的局限性。为此，我们提出了五个有针对性的评估任务，涵盖泛化、符号程序合成、扰动鲁棒性、推理纠正和歧义解决。通过全面评估现有的VLMs，我们指出了它们的局限性，并展示了PRiSM如何使我们更深入地了解它们的科学推理能力。

Summary / 总结

PRiSM is designed to evaluate vision-language models in scientific domains by introducing a dynamic, multimodal benchmark grounded in Python code. It addresses the limitations of existing benchmarks by providing rich, structured outputs and dynamic problem instances. Key experimental findings show that current VLMs struggle with symbolic reasoning, perturbation robustness, and ambiguity resolution, highlighting the need for more sophisticated scientific reasoning capabilities.

PRiSM 通过引入基于 Python 代码的动态多模态基准来评估科学领域的视觉-语言模型，解决了现有基准的局限性，提供了丰富的结构化输出和动态问题实例。关键发现包括通过细粒度的实验审计揭示 VLM 在科学推理中的失败模式、不确定性行为和局限性。

A Comparative Study on Synthetic Facial Data Generation Techniques for Face Recognition

Authors: Pedro Vidal, Bernardo Biesseck, Luiz E. L. Coelho, Roger Granada, David Menotti

First: 2025-12-05T18:11:29+00:00 · Latest: 2025-12-05T18:11:29+00:00

Comments: 18 pages, 17 figures

Abs · PDF · Code1 · Code2

Abstract

Facial recognition has become a widely used method for authentication and identification, with applications for secure access and locating missing persons. Its success is largely attributed to deep learning, which leverages large datasets and effective loss functions to learn discriminative features. Despite these advances, facial recognition still faces challenges in explainability, demographic bias, privacy, and robustness to aging, pose variations, lighting changes, occlusions, and facial expressions. Privacy regulations have also led to the degradation of several datasets, raising legal, ethical, and privacy concerns. Synthetic facial data generation has been proposed as a promising solution. It mitigates privacy issues, enables experimentation with controlled facial attributes, alleviates demographic bias, and provides supplementary data to improve models trained on real data. This study compares the effectiveness of synthetic facial datasets generated using different techniques in facial recognition tasks. We evaluate accuracy, rank-1, rank-5, and the true positive rate at a false positive rate of 0.01% on eight leading datasets, offering a comparative analysis not extensively explored in the literature. Results demonstrate the ability of synthetic data to capture realistic variations while emphasizing the need for further research to close the performance gap with real data. Techniques such as diffusion models, GANs, and 3D models show substantial progress; however, challenges remain.

中文标题/摘要

标题：合成面部数据生成技术在面部识别中的比较研究

面部识别已成为广泛应用于认证和识别的方法，应用于安全访问和寻找失踪人员。其成功主要归功于深度学习，它利用大量数据集和有效的损失函数来学习区分性特征。尽管取得了这些进展，面部识别仍然面临可解释性、人口统计偏差、隐私和对老化、姿态变化、光照变化、遮挡和面部表情的鲁棒性等方面的挑战。隐私法规也导致了多个数据集的退化，引发了法律、伦理和隐私方面的担忧。合成面部数据生成已被提议作为一种有前景的解决方案。它缓解了隐私问题，使实验能够控制面部属性，减轻人口统计偏差，并提供补充数据以提高基于真实数据训练的模型。本研究比较了使用不同技术生成的合成面部数据集在面部识别任务中的有效性。我们在八个多领先的数据集上评估了准确率、排名1和排名5以及假阳性率为0.01%时的真正阳性率，提供了文献中未广泛探讨的比较分析。结果表明合成数据能够捕捉到现实的变异性，但强调了需要进一步研究以缩小与真实数据的性能差距。诸如扩散模型、GANs和3D模型等技术取得了显著进展；然而，仍存在挑战。

Summary / 总结

This study investigates the effectiveness of synthetic facial data generation techniques in improving facial recognition accuracy. Motivated by the challenges of real data, such as privacy concerns and demographic bias, the research compares various synthetic data generation methods including diffusion models, GANs, and 3D models. The evaluation on eight leading datasets shows that synthetic data can capture realistic variations, but there is still a performance gap compared to real data, highlighting the need for further research to enhance synthetic data quality and applicability.

该研究评估了不同合成面部数据生成技术在面部识别任务中的有效性。受隐私、人口统计偏差和鲁棒性等方面的限制，研究旨在改善真实数据集的不足，评估包括扩散模型、GAN和3D模型在内的多种合成数据生成方法。研究发现，合成数据能够捕捉到真实的变异并提升模型性能，但仍与真实数据存在差距，表明需要进一步研究以缩小这一差距。

World Models That Know When They Don't Know: Controllable Video Generation with Calibrated Uncertainty

Authors: Zhiting Mei, Tenny Yin, Micah Baker, Ola Shorinwa, Anirudha Majumdar

First: 2025-12-05T18:06:18+00:00 · Latest: 2025-12-05T18:06:18+00:00

Abs · PDF · Code1 · Code2

Abstract

Recent advances in generative video models have led to significant breakthroughs in high-fidelity video synthesis, specifically in controllable video generation where the generated video is conditioned on text and action inputs, e.g., in instruction-guided video editing and world modeling in robotics. Despite these exceptional capabilities, controllable video models often hallucinate - generating future video frames that are misaligned with physical reality - which raises serious concerns in many tasks such as robot policy evaluation and planning. However, state-of-the-art video models lack the ability to assess and express their confidence, impeding hallucination mitigation. To rigorously address this challenge, we propose C3, an uncertainty quantification (UQ) method for training continuous-scale calibrated controllable video models for dense confidence estimation at the subpatch level, precisely localizing the uncertainty in each generated video frame. Our UQ method introduces three core innovations to empower video models to estimate their uncertainty. First, our method develops a novel framework that trains video models for correctness and calibration via strictly proper scoring rules. Second, we estimate the video model's uncertainty in latent space, avoiding training instability and prohibitive training costs associated with pixel-space approaches. Third, we map the dense latent-space uncertainty to interpretable pixel-level uncertainty in the RGB space for intuitive visualization, providing high-resolution uncertainty heatmaps that identify untrustworthy regions. Through extensive experiments on large-scale robot learning datasets (Bridge and DROID) and real-world evaluations, we demonstrate that our method not only provides calibrated uncertainty estimates within the training distribution, but also enables effective out-of-distribution detection.

中文标题/摘要

标题：世界模型知其不知：基于校准不确定性可控视频生成

生成视频模型的最新进展在高保真视频合成方面取得了重大突破，特别是在基于文本和动作输入的可控视频生成方面，例如在指令引导的视频编辑和机器人世界建模中。尽管这些模型具有出色的能力，但它们经常产生与物理现实不符的未来视频帧，这在许多任务中（如机器人策略评估和规划）引发了严重关切。然而，最先进的视频模型缺乏评估和表达其置信度的能力，阻碍了幻觉的缓解。为严格应对这一挑战，我们提出了C3，一种用于训练连续尺度校准可控视频模型的不确定性量化（UQ）方法，以在子块级别进行密集置信估计，精确地定位每个生成视频帧中的不确定性。我们的UQ方法引入了三项核心创新，以使视频模型能够估计其不确定性。首先，我们的方法开发了一种新的框架，通过严格恰当评分规则训练视频模型以确保正确性和校准。其次，我们估计了视频模型在潜在空间中的不确定性，避免了像素空间方法相关的训练不稳定性和高昂的训练成本。第三，我们将密集的潜在空间不确定性映射到可解释的RGB空间像素级不确定性，以直观可视化，提供高分辨率的不确定性热图，以识别不可信区域。通过在大规模机器人学习数据集（Bridge和DROID）上的广泛实验和现实世界评估，我们证明了我们的方法不仅提供了在训练分布内的校准不确定性估计，还实现了有效的离分布检测。

Summary / 总结

The research aims to address the issue of hallucination in controllable video generation models by developing a method to quantify and express their uncertainty. The method, C3, introduces a novel framework for training video models to estimate their uncertainty using strictly proper scoring rules, avoiding training instability by estimating uncertainty in latent space, and mapping the uncertainty to pixel-level heatmaps for visualization. Experiments show that C3 provides calibrated uncertainty estimates and enables effective out-of-distribution detection.

该论文通过提出C3不确定性量化方法，解决了可控视频生成模型中的幻觉问题。C3训练视频模型估计生成帧的置信度，避免了像素空间训练的不稳定性，并提供了可解释的不确定性热图。实验表明，C3能够提供校准的不确定性估计和有效的离分布检测。

Open-PMC-18M: A High-Fidelity Large Scale Medical Dataset for Multimodal Representation Learning

Authors: Negin Baghbanzadeh, Mohammed Saidul Islam, Sajad Ashkezari, Elham Dolatabadi, Arash Afkanpour

First: 2025-06-03T10:53:19+00:00 · Latest: 2025-12-05T17:47:02+00:00

Comments: 21 pages

Abs · PDF · Code1 · Code2

Abstract

In biomedical vision-language modeling, datasets are typically mined from scientific literature, pairing compound figures with captions that are short, context-dependent, and oftern partially informative. Prior work on subfigure extraction has been limited in both dataset size and generalizability. In addition, no existing effort has incorporated rich medical context in image-text pairs. We revisit data curation as a foundational component of effective biomedical representation learning. Our data curation process integrates transformer-based subfigure detection, subcaption extraction, and contextual text enrichment derived from inline references. Our subfigure extraction model, trained on a corpus of 500,000 compound figures, achieves state-of-the-art performance on real and synthetic benchmarks. Using this process, we curate and release Open-PMC-18M, a large-scale high-fidelity biomedical dataset comprising 18 million image-text pairs, spanning radiology, microscopy, and visible light photography. We train vision-language models on our dataset and perform extensive evaluation on 6 retrieval and 19 zero-shot classification tasks across three major modalities. The models trained on our dataset set a new state-of-the-art results in medical representation learning. We release our dataset, models, and code to support reproducible benchmarks and further study into biomedical vision-language modeling and representation learning.

中文标题/摘要

标题：Open-PMC-18M：多模态表示学习的高保真大规模医学数据集

在生物医学视觉-语言建模中，数据通常是从科学文献中挖掘出来的，将复合图与短、上下文依赖且经常部分信息的描述配对。先前的子图提取工作在数据集规模和泛化能力上都受到限制。此外，现有的努力没有在图像-文本对中纳入丰富的医学上下文。我们重新审视数据整理作为有效生物医学表示学习基础组件的重要性。我们的数据整理过程结合了基于变换器的子图检测、子描述提取以及来自内联参考的上下文文本丰富。我们的子图提取模型在50万复合图的语料库上训练，实现了在真实和合成基准上的最佳性能。通过这一过程，我们整理并发布了Open-PMC-18M，这是一个包含1800万图像-文本对的大规模高保真生物医学数据集，涵盖了放射学、显微镜和可见光摄影。我们在该数据集上训练视觉-语言模型，并在三个主要模态的六个检索和十九个零样本分类任务上进行了广泛的评估。在我们数据集上训练的模型在医学表示学习中取得了新的最佳结果。我们发布了数据集、模型和代码，以支持可重复的基准测试并进一步研究生物医学视觉-语言建模和表示学习。

Summary / 总结

The research aims to address the limitations of existing biomedical vision-language datasets by curating a large-scale, high-fidelity dataset, Open-PMC-18M, which includes 18 million image-text pairs from radiology, microscopy, and visible light photography. The dataset is curated using a transformer-based subfigure detection and subcaption extraction model, and enriched with contextual text from inline references. Models trained on this dataset achieve state-of-the-art results in medical representation learning across six retrieval and 19 zero-shot classification tasks. The dataset, models, and code are released to support further research.

研究旨在通过创建一个大规模、高保真度的数据集来解决现有生物医学视觉-语言数据集的局限性。方法包括使用基于变压器的模型进行子图检测和子图标题提取，并结合来自内联参考的上下文文本。Open-PMC-18M数据集包含来自放射学、显微镜和可见光摄影的1800万张图像-文本对。在该数据集上训练的模型在各种任务中达到了生物医学表示学习的最新成果。

Reinforce-Ada: An Adaptive Sampling Framework under Non-linear RL Objectives

Authors: Wei Xiong, Chenlu Ye, Baohao Liao, Hanze Dong, Xinxing Xu, Christof Monz, Jiang Bian, Nan Jiang, Tong Zhang

First: 2025-10-06T16:34:09+00:00 · Latest: 2025-12-05T17:41:34+00:00

Comments: 27 pages, 10 figures

Abs · PDF · Code1 · Code2 · Code3

Abstract

Reinforcement learning (RL) for large language model reasoning is frequently hindered by signal loss, a phenomenon where standard uniform sampling with small group sizes fails to uncover informative learning signals for difficult prompts. We demonstrate that this collapse is a statistical artifact of undersampling rather than an inherent model limitation. To address this systematically, we introduce a theoretical framework based on optimizing a non-linear RL objective (e.g., log-likelihood). We show that this objective naturally induces a weighted gradient estimator that prioritizes difficult prompts, which can be robustly realized through adaptive sampling. Guided by this framework, we propose Reinforce-Ada, a family of algorithms that dynamically allocates inference budgets based on prompt difficulty, effectively scaling up RL compute to where it is needed most. Unlike passive filtering methods that discard low-signal prompts, Reinforce-Ada actively invests compute to recover them. We introduce two efficient realizations: an estimation-based approach and a model-free sequential sampling approach. Extensive experiments across multiple benchmarks show that Reinforce-Ada significantly outperforms uniform baselines like GRPO, recovering lost signals and accelerating convergence by up to $2\times$ while maintaining the same total inference budget. Code is available at https://github.com/RLHFlow/Reinforce-Ada.

中文标题/摘要

标题：Reinforce-Ada：非线性RL目标下的自适应采样框架

大规模语言模型推理中的强化学习（RL）经常受到信号丢失的阻碍，这是一种现象，其中标准的均匀采样和小组大小无法揭示困难提示的信息性学习信号。我们证明这种崩溃是由于采样不足的统计现象，而不是模型本身的局限性。为系统地解决这一问题，我们基于优化非线性RL目标（例如对数似然）引入了一个理论框架。我们展示了该目标自然地诱导了一个加权梯度估计器，优先处理困难的提示，这可以通过自适应采样稳健地实现。根据这一框架，我们提出了Reinforce-Ada，这是一种算法家族，根据提示的难度动态分配推理预算，有效地将RL计算扩展到最需要的地方。与被动过滤方法不同，后者会丢弃低信号提示，Reinforce-Ada积极地投资计算以恢复这些提示。我们引入了两种高效的实现方式：基于估计的方法和无模型的顺序采样方法。在多个基准测试中的广泛实验表明，Reinforce-Ada显著优于均匀基线（如GRPO），恢复了丢失的信号并加速了收敛，最多可提高2倍，同时保持相同的总推理预算。代码可在https://github.com/RLHFlow/Reinforce-Ada/ 获取。

Summary / 总结

The paper addresses the issue of signal loss in reinforcement learning for large language models, where uniform sampling fails to uncover informative signals for difficult prompts. It introduces Reinforce-Ada, an adaptive sampling framework that dynamically allocates inference budgets based on prompt difficulty, optimizing a non-linear RL objective. Experiments show that Reinforce-Ada outperforms uniform baselines, recovering lost signals and accelerating convergence by up to 2 times while maintaining the same total inference budget.

论文解决了大规模语言模型中强化学习中信号丢失的问题，统一采样无法为困难提示揭露有用信号。它引入了Reinforce-Ada，一种动态分配推理预算的自适应采样框架，基于提示难度优化非线性RL目标。实验表明，Reinforce-Ada优于均匀基线，能够恢复丢失的信号并加速收敛，最多可提高2倍的收敛速度，同时保持相同的总推理预算。

SCAIL: Towards Studio-Grade Character Animation via In-Context Learning of 3D-Consistent Pose Representations

Authors: Wenhao Yan, Sheng Ye, Zhuoyi Yang, Jiayan Teng, ZhenHui Dong, Kairui Wen, Xiaotao Gu, Yong-Jin Liu, Jie Tang

First: 2025-12-05T17:38:55+00:00 · Latest: 2025-12-05T17:38:55+00:00

Abs · PDF · Code1 · Code2

Abstract

Achieving character animation that meets studio-grade production standards remains challenging despite recent progress. Existing approaches can transfer motion from a driving video to a reference image, but often fail to preserve structural fidelity and temporal consistency in wild scenarios involving complex motion and cross-identity animations. In this work, we present \textbf{SCAIL} (\textbf{S}tudio-grade \textbf{C}haracter \textbf{A}nimation via \textbf{I}n-context \textbf{L}earning), a framework designed to address these challenges from two key innovations. First, we propose a novel 3D pose representation, providing a more robust and flexible motion signal. Second, we introduce a full-context pose injection mechanism within a diffusion-transformer architecture, enabling effective spatio-temporal reasoning over full motion sequences. To align with studio-level requirements, we develop a curated data pipeline ensuring both diversity and quality, and establish a comprehensive benchmark for systematic evaluation. Experiments show that \textbf{SCAIL} achieves state-of-the-art performance and advances character animation toward studio-grade reliability and realism.

中文标题/摘要

标题：SCAIL：通过上下文学习三维一致的姿态表示以实现工作室级角色动画

尽管近期取得了一些进展，但实现符合工作室级生产标准的角色动画仍然具有挑战性。现有方法可以从驱动视频中转移动作到参考图像，但在涉及复杂动作和跨身份动画的野生场景中，往往无法保持结构保真度和时间一致性。在本工作中，我们提出了**SCAIL**（**S**tudio-grade **C**haracter **A**nimation via **I**n-context **L**earning），一种旨在通过两大创新来解决这些挑战的框架。首先，我们提出了一种新颖的三维姿态表示，提供了一个更稳健和灵活的动作信号。其次，我们引入了一种全上下文姿态注入机制，结合在扩散-变换器架构中，能够有效地进行时空推理。为了满足工作室级的要求，我们开发了一个精心策划的数据管道，确保多样性和质量，并建立了全面的基准以系统评估。实验表明，**SCAIL** 达到了最先进的性能，并推动了角色动画向工作室级可靠性和真实性的进步。

Summary / 总结

SCAIL is designed to improve character animation by addressing the challenges of structural fidelity and temporal consistency in complex motion scenarios. It introduces a 3D pose representation and a full-context pose injection mechanism within a diffusion-transformer architecture. Experiments demonstrate that SCAIL outperforms existing methods and brings character animation closer to studio-grade reliability and realism.

SCAIL 是一个框架，旨在通过解决结构保真度和时间一致性的问题来提升角色动画。它引入了一种新的3D姿态表示和一种在扩散变换器架构内的全上下文姿态注入机制。实验表明，SCAIL 在性能上超越了现有方法，并使角色动画更接近于工作室级别的可靠性和真实性。

Joint Self-Supervised Video Alignment and Action Segmentation

Authors: Ali Shah Ali, Syed Ahmed Mahmood, Mubin Saeed, Andrey Konin, M. Zeeshan Zia, Quoc-Huy Tran

Venue: ICCV 2025

First: 2025-03-21T04:02:00+00:00 · Latest: 2025-12-05T17:27:34+00:00

Comments: Accepted to ICCV 2025

Abs · PDF · Code1 · Code2 · Project1

Abstract

We introduce a novel approach for simultaneous self-supervised video alignment and action segmentation based on a unified optimal transport framework. In particular, we first tackle self-supervised video alignment by developing a fused Gromov-Wasserstein optimal transport formulation with a structural prior, which trains efficiently on GPUs and needs only a few iterations for solving the optimal transport problem. Our single-task method achieves the state-of-the-art performance on multiple video alignment benchmarks and outperforms VAVA, which relies on a traditional Kantorovich optimal transport formulation with an optimality prior. Furthermore, we extend our approach by proposing a unified optimal transport framework for joint self-supervised video alignment and action segmentation, which requires training and storing a single model and saves both time and memory consumption as compared to two different single-task models. Extensive evaluations on several video alignment and action segmentation datasets demonstrate that our multi-task method achieves comparable video alignment yet superior action segmentation results over previous methods in video alignment and action segmentation respectively. Finally, to the best of our knowledge, this is the first work to unify video alignment and action segmentation into a single model. Our code is available on our research website: https://retrocausal.ai/research/.

中文标题/摘要

标题：联合自监督视频对齐和动作分割

我们提出了一种基于统一最优传输框架的新型同时自监督视频对齐和动作分割方法。特别地，我们首先通过开发结合结构先验的融合格罗莫夫-瓦尔什最优传输公式来解决自监督视频对齐问题，该方法在GPU上高效训练，并且只需要少量迭代即可解决最优传输问题。我们的单任务方法在多个视频对齐基准测试中达到了最先进的性能，并且优于依赖传统柯尔莫哥洛夫最优传输公式和最优性先验的VAVA方法。此外，我们通过提出一种统一的最优传输框架来联合解决自监督视频对齐和动作分割问题，这种方法只需要训练和存储一个模型，与两个不同的单任务模型相比，节省了时间和内存消耗。在多个视频对齐和动作分割数据集上的广泛评估表明，我们的多任务方法在视频对齐方面达到了可比的结果，而在动作分割方面则优于之前的方法。最后，据我们所知，这是首次将视频对齐和动作分割统一到一个模型中的工作。我们的代码可在我们的研究网站上获得：https://retrocausal.ai/research/。

Summary / 总结

The paper introduces a novel approach for joint self-supervised video alignment and action segmentation using a unified optimal transport framework. It first addresses video alignment by developing a fused Gromov-Wasserstein optimal transport formulation with a structural prior, achieving state-of-the-art performance on multiple benchmarks. The approach is then extended to a unified framework for joint alignment and segmentation, which trains a single model and outperforms previous methods in action segmentation while maintaining comparable alignment results. This is the first work to unify these two tasks into a single model, demonstrating significant improvements in both areas.

该论文提出了一种使用统一最优传输框架的新型方法，用于同时进行自监督视频对齐和动作分割。该方法引入了一种结合Gromov-Wasserstein最优传输形式和结构先验的高效视频对齐训练方法。多任务模型在视频对齐基准测试中优于先前的单任务模型，并在动作分割方面取得了优于先前方法的性能。这是首次将视频对齐和动作分割统一到一个模型中，展示了在两个任务上的显著改进。

Uncovering Grounding IDs: How External Cues Shape Multimodal Binding

Authors: Hosein Hasani, Amirmohammad Izadi, Fatemeh Askari, Mobin Bagherian, Sadegh Mohammadian, Mohammad Izadi, Mahdieh Soleymani Baghshah

Venue: ICLR 2026

First: 2025-09-28T21:15:07+00:00 · Latest: 2025-12-05T17:19:01+00:00

Comments: Under review as a conference paper at ICLR 2026

Abs · PDF · Code1 · Code2

Abstract

Large vision-language models (LVLMs) show strong performance across multimodal benchmarks but remain limited in structured reasoning and precise grounding. Recent work has demonstrated that adding simple visual structures, such as partitions and annotations, improves accuracy, yet the internal mechanisms underlying these gains remain unclear. We investigate this phenomenon and propose the concept of Grounding IDs, latent identifiers induced by external cues that bind objects to their designated partitions across modalities. Through representation analysis, we find that these identifiers emerge as consistent within-partition alignment in embedding space and reduce the modality gap between image and text. Causal interventions further confirm that these identifiers mediate binding between objects and symbolic cues. We show that Grounding IDs strengthen attention between related components, which in turn improves cross-modal grounding and reduces hallucinations. Taken together, our results identify Grounding IDs as a key symbolic mechanism that explains how external cues enhance multimodal binding and offer both interpretability and practical improvements.

中文标题/摘要

标题：揭示接地ID：外部线索如何塑造多模态绑定

大型视觉-语言模型（LVLMs）在多模态基准测试中表现出色，但在结构化推理和精确接地方面仍有限制。近期研究表明，添加简单的视觉结构，如分区和注释，可以提高准确性，但这些改进背后的内部机制尚不清楚。我们研究了这一现象并提出了接地ID的概念，即由外部线索诱导的潜在标识符，这些标识符在不同模态中将对象与其指定的分区绑定在一起。通过表示分析，我们发现这些标识符在嵌入空间中表现为一致的分区内部对齐，并减少了图像与文本之间的模态差距。因果干预进一步证实这些标识符在对象与符号线索之间起中介作用。我们展示了接地ID增强了相关组件之间的注意力，从而提高了跨模态接地并减少了幻觉。综上所述，我们的结果将接地ID识别为一个关键的符号机制，解释了外部线索如何增强多模态绑定，并提供了可解释性和实际改进。

Summary / 总结

This study investigates how external visual cues improve the performance of large vision-language models in multimodal tasks. It introduces the concept of Grounding IDs, which are latent identifiers induced by external cues that help bind objects to their partitions across modalities. The research finds that these identifiers reduce the modality gap and enhance cross-modal grounding, leading to better performance and reduced hallucinations. Causal interventions confirm that Grounding IDs mediate the binding between objects and symbolic cues, thereby strengthening attention between related components.

研究探讨了外部视觉线索如何提升大型视觉-语言模型在多模态任务中的表现，特别是在结构化推理和精确定位方面。通过提出Grounding IDs的概念，即由外部线索诱导的潜在标识符，研究展示了这些标识符如何增强跨模态对齐并减少模态差距。实验结果表明，Grounding IDs能够增强相关组件之间的注意力，从而提高跨模态定位并减少幻觉现象。

MOSS: Efficient and Accurate FP8 LLM Training with Microscaling and Automatic Scaling

Authors: Yu Zhang, Hui-Ling Zhen, Mingxuan Yuan, Bei Yu

First: 2025-11-08T02:51:26+00:00 · Latest: 2025-12-05T17:14:58+00:00

Abs · PDF · Code1 · Code2

Abstract

Training large language models with FP8 formats offers significant efficiency gains. However, the reduced numerical precision of FP8 poses challenges for stable and accurate training. Current frameworks preserve training performance using mixed-granularity quantization, i.e., applying per-group quantization for activations and per-tensor/block quantization for weights. While effective, per-group quantization requires scaling along the inner dimension of matrix multiplication, introducing additional dequantization overhead. Moreover, these frameworks often rely on just-in-time scaling to dynamically adjust scaling factors based on the current data distribution. However, this online quantization is inefficient for FP8 training, as it involves multiple memory reads and writes that negate the performance benefits of FP8. To overcome these limitations, we propose MOSS, a novel FP8 training framework that ensures both efficiency and numerical stability. MOSS introduces two key innovations: (1) a two-level microscaling strategy for quantizing sensitive activations, which balances precision and dequantization cost by combining a high-precision global scale with compact, power-of-two local scales; and (2) automatic scaling for weights in linear layers, which eliminates the need for costly max-reduction operations by predicting and adjusting scaling factors during training. Leveraging these techniques, MOSS enables efficient FP8 training of a 7B parameter model, achieving performance comparable to the BF16 baseline while achieving up to 34% higher training throughput.

中文标题/摘要

标题：MOSS：高效且准确的FP8大语言模型训练方法，结合微缩放和自动缩放

使用FP8格式训练大型语言模型可以带来显著的效率提升。然而，FP8的较低数值精度给稳定和准确的训练带来了挑战。当前框架通过分组量化和张量/块量化混合精度量化来保持训练性能。虽然有效，但分组量化需要沿矩阵乘法的内部维度进行缩放，引入了额外的去量化开销。此外，这些框架通常依赖即时缩放来动态调整缩放因子，以适应当前的数据分布。然而，这种在线量化对于FP8训练是低效的，因为它涉及多次内存读写，抵消了FP8的性能优势。为克服这些限制，我们提出了MOSS，这是一种新型的FP8训练框架，确保了效率和数值稳定性。MOSS引入了两项关键技术：(1) 两级微缩放策略，通过结合高精度全局缩放和紧凑的2的幂次局部缩放来平衡精度和去量化成本；(2) 线性层权重的自动缩放，通过预测和调整缩放因子来消除昂贵的最大值归约操作。利用这些技术，MOSS能够高效地训练一个7B参数模型，性能与BF16基线相当，同时实现高达34%的更高训练吞吐量。

Summary / 总结

The paper introduces MOSS, a novel FP8 training framework for large language models that enhances both efficiency and numerical stability. It proposes a two-level microscaling strategy for activations and automatic scaling for weights, reducing dequantization overhead and eliminating the need for max-reduction operations. As a result, MOSS enables efficient FP8 training of a 7B parameter model, achieving up to 34% higher training throughput compared to the BF16 baseline.

论文提出了MOSS，一种结合微缩放和自动缩放的FP8训练框架，以提高效率和准确性。MOSS使用两级微缩放策略处理激活值，并对线性层的权重使用自动缩放，减少了去量化开销并消除了昂贵的最大值归约操作。这种方法使得7B参数模型的FP8训练更加高效，相比BF16可实现高达34%的训练吞吐量提升，同时保持相当的性能。

DAE-HardNet: A Physics Constrained Neural Network Enforcing Differential-Algebraic Hard Constraints

Authors: Rahul Golder, Bimol Nath Roy, M. M. Faruque Hasan

First: 2025-12-05T16:55:54+00:00 · Latest: 2025-12-05T16:55:54+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Traditional physics-informed neural networks (PINNs) do not always satisfy physics based constraints, especially when the constraints include differential operators. Rather, they minimize the constraint violations in a soft way. Strict satisfaction of differential-algebraic equations (DAEs) to embed domain knowledge and first-principles in data-driven models is generally challenging. This is because data-driven models consider the original functions to be black-box whose derivatives can only be obtained after evaluating the functions. We introduce DAE-HardNet, a physics-constrained (rather than simply physics-informed) neural network that learns both the functions and their derivatives simultaneously, while enforcing algebraic as well as differential constraints. This is done by projecting model predictions onto the constraint manifold using a differentiable projection layer. We apply DAE-HardNet to several systems and test problems governed by DAEs, including the dynamic Lotka-Volterra predator-prey system and transient heat conduction. We also show the ability of DAE-HardNet to estimate unknown parameters through a parameter estimation problem. Compared to multilayer perceptrons (MLPs) and PINNs, DAE-HardNet achieves orders of magnitude reduction in the physics loss while maintaining the prediction accuracy. It has the added benefits of learning the derivatives which improves the constrained learning of the backbone neural network prior to the projection layer. For specific problems, this suggests that the projection layer can be bypassed for faster inference. The current implementation and codes are available at https://github.com/SOULS-TAMU/DAE-HardNet.

中文标题/摘要

标题：DAE-HardNet：一种强制满足微分代数严格约束的物理约束神经网络

传统的物理启发式神经网络（PINNs）并不总是满足基于物理的约束，尤其是当约束包含微分运算符时。相反，它们以软的方式最小化约束违反。严格满足微分代数方程（DAEs）以在数据驱动模型中嵌入领域知识和基本原理通常是具有挑战性的。这是因为数据驱动模型将原始函数视为黑盒，其导数只能在评估函数之后获得。我们引入了DAE-HardNet，这是一种物理约束（而不是简单的物理启发式）神经网络，可以同时学习函数及其导数，并强制执行代数和微分约束。这通过使用可微投影层将模型预测投影到约束流形上实现。我们使用DAE-HardNet对几个由DAEs支配的系统和测试问题进行了应用，包括动态Lotka-Volterra捕食者-猎物系统和瞬态热传导。我们还展示了DAE-HardNet通过参数估计问题估计未知参数的能力。与多层感知机（MLPs）和PINNs相比，DAE-HardNet在保持预测准确性的同时，物理损失降低了几个数量级。它还具有学习导数的额外优势，这可以提高投影层之前的主干神经网络的约束学习。对于特定问题，这表明投影层可以被绕过以实现更快的推理。当前的实现和代码可在https://github.com/SOULS-TAMU/DAE-HardNet/获得。

Summary / 总结

DAE-HardNet is a physics-constrained neural network that enforces differential-algebraic hard constraints by simultaneously learning functions and their derivatives, and projecting model predictions onto the constraint manifold. It achieves significant reduction in physics loss compared to MLPs and PINNs while maintaining prediction accuracy. DAE-HardNet also improves constrained learning and can potentially bypass the projection layer for faster inference in specific problems.

DAE-HardNet 是一种同时学习函数及其导数并将其预测投影到约束流形上的物理约束神经网络，以强制执行微分代数硬约束。与多层感知机（MLPs）和物理感知神经网络（PINNs）相比，它在保持预测准确性的同时显著减少了物理损失。DAE-HardNet 还改进了约束学习，并且在某些问题中可以跳过投影层以实现更快的推理。

Computational Design of Low-Volatility Lubricants for Space Using Interpretable Machine Learning

Authors: Daniel Miliate, Ashlie Martini

First: 2025-12-05T16:47:04+00:00 · Latest: 2025-12-05T16:47:04+00:00

Abs · PDF · Code1 · Code2

Abstract

The function and lifetime of moving mechanical assemblies (MMAs) in space depend on the properties of lubricants. MMAs that experience high speeds or high cycles require liquid based lubricants due to their ability to reflow to the point of contact. However, only a few liquid-based lubricants have vapor pressures low enough for the vacuum conditions of space, each of which has limitations that add constraints to MMA designs. This work introduces a data-driven machine learning (ML) approach to predicting vapor pressure, enabling virtual screening and discovery of new space-suitable liquid lubricants. The ML models are trained with data from both high-throughput molecular dynamics simulations and experimental databases. The models are designed to prioritize interpretability, enabling the relationships between chemical structure and vapor pressure to be identified. Based on these insights, several candidate molecules are proposed that may have promise for future space lubricant applications in MMAs.

中文标题/摘要

标题：空间用低挥发性润滑剂的计算设计及可解释机器学习

空间中移动机械组件（MMAs）的功能和寿命取决于润滑剂的性质。高速或高循环次数的MMAs需要基于液体的润滑剂，因为它们能够重新流动到接触点。然而，只有少数液体润滑剂的蒸汽压足够低以适应空间的真空条件，每种润滑剂都有其局限性，这为MMA设计增加了约束。本研究介绍了一种数据驱动的机器学习（ML）方法，用于预测蒸汽压，从而实现虚拟筛选和发现新的空间适用液体润滑剂。ML模型使用了高通量分子动力学模拟数据和实验数据库的数据进行训练。模型设计强调可解释性，使化学结构与蒸汽压之间的关系能够被识别。基于这些见解，提出了几种候选分子，这些分子可能在未来适用于空间润滑剂应用的MMAs中。

Summary / 总结

This research aims to improve the function and lifetime of mechanical assemblies in space by developing low-volatility lubricants. The study uses interpretable machine learning to predict vapor pressure, allowing for the virtual screening of new lubricants. The models are trained on both simulation and experimental data, and the results identify several promising candidate molecules for future space lubricant applications in high-speed or high-cycle mechanical assemblies.

该研究旨在通过开发低挥发性润滑剂来提高空间机械组件的功能和寿命。研究人员使用可解释的机器学习来预测蒸汽压，从而进行潜在润滑剂的虚拟筛选。模型基于分子动力学模拟和实验数据库的数据进行训练，并识别出几种适用于空间应用的候选分子，从而增强空间机械系统的设计灵活性。

Optimizing Medical Question-Answering Systems: A Comparative Study of Fine-Tuned and Zero-Shot Large Language Models with RAG Framework

Authors: Tasnimul Hassan, Md Faisal Karim, Haziq Jeelani, Elham Behnam, Robert Green, Fayeq Jeelani Syed

First: 2025-12-05T16:38:47+00:00 · Latest: 2025-12-05T16:38:47+00:00

Abs · PDF · Code1 · Code2

Abstract

Medical question-answering (QA) systems can benefit from advances in large language models (LLMs), but directly applying LLMs to the clinical domain poses challenges such as maintaining factual accuracy and avoiding hallucinations. In this paper, we present a retrieval-augmented generation (RAG) based medical QA system that combines domain-specific knowledge retrieval with open-source LLMs to answer medical questions. We fine-tune two state-of-the-art open LLMs (LLaMA~2 and Falcon) using Low-Rank Adaptation (LoRA) for efficient domain specialization. The system retrieves relevant medical literature to ground the LLM's answers, thereby improving factual correctness and reducing hallucinations. We evaluate the approach on benchmark datasets (PubMedQA and MedMCQA) and show that retrieval augmentation yields measurable improvements in answer accuracy compared to using LLMs alone. Our fine-tuned LLaMA~2 model achieves 71.8% accuracy on PubMedQA, substantially improving over the 55.4% zero-shot baseline, while maintaining transparency by providing source references. We also detail the system design and fine-tuning methodology, demonstrating that grounding answers in retrieved evidence reduces unsupported content by approximately 60%. These results highlight the potential of RAG-augmented open-source LLMs for reliable biomedical QA, pointing toward practical clinical informatics applications.

中文标题/摘要

标题：优化医疗问答系统：基于RAG框架的微调和零样本大型语言模型比较研究

医疗问答(QA)系统可以从大型语言模型(LLMs)的进步中受益，但直接将LLMs应用于临床领域会面临保持事实准确性和避免幻觉的挑战。本文介绍了一种基于检索增强生成(RAG)的医疗QA系统，该系统结合了领域特定知识检索与开源LLMs，以回答医疗问题。我们使用低秩适应(LoRA)对两个最先进的开源LLM(LLaMA~2和Falcon)进行微调，以实现高效的领域专业化。该系统检索相关医学文献以支持LLM的答案，从而提高事实正确性并减少幻觉。我们在基准数据集(PubMedQA和MedMCQA)上评估了该方法，并表明检索增强可以显著提高答案准确性，优于仅使用LLMs的方法。我们微调的LLaMA~2模型在PubMedQA上的准确率为71.8%，大幅提高了55.4%的零样本基线，同时通过提供来源参考保持了透明度。我们还详细介绍了系统设计和微调方法，证明将答案与检索到的证据相结合可以减少约60%的无根据内容。这些结果突显了RAG增强开源LLMs在可靠生物医学QA中的潜力，指出了实际临床信息学应用的方向。

Summary / 总结

This paper aims to enhance the accuracy and reliability of medical question-answering systems by integrating retrieval-augmented generation (RAG) with fine-tuned large language models (LLMs). The authors use Low-Rank Adaptation (LoRA) to specialize two state-of-the-art LLMs (LLaMA-2 and Falcon) for medical domains. By retrieving relevant medical literature, the system improves factual accuracy and reduces hallucinations. Evaluations on benchmark datasets show that the fine-tuned LLaMA-2 model achieves 71.8% accuracy, a significant improvement over the 55.4% zero-shot baseline, while maintaining transparency through source references.

本文旨在通过结合检索增强生成（RAG）和微调的大语言模型（LLMs），提高医学问答系统的准确性和可靠性。作者使用低秩适应（LoRA）微调了两个最先进的LLM（LLaMA~2和Falcon），并将它们与检索到的医学文献结合，以增强事实正确性。该系统在基准数据集（PubMedQA和MedMCQA）上进行了评估，显示出显著的改进。微调后的LLaMA~2模型在PubMedQA上的准确率为71.8%，比零样本基线提高了16.4%，同时减少了约60%的未支持内容。

LM-CartSeg: Automated Segmentation of Lateral and Medial Cartilage and Subchondral Bone for Radiomics Analysis

Authors: Tongxu Zhang

First: 2025-12-03T05:07:56+00:00 · Latest: 2025-12-05T16:30:37+00:00

Comments: The manuscript represents only a preliminary and substantially incompleted exploration. The author has decided not to stand by these results, and a thoroughly revised and significantly different version will be developed separately. Therefore this version is withdrawn and should not be cited

Abs · PDF · Code1 · Code2

Abstract

Background and Objective: Radiomics of knee MRI requires robust, anatomically meaningful regions of interest (ROIs) that jointly capture cartilage and subchondral bone. Most existing work relies on manual ROIs and rarely reports quality control (QC). We present LM-CartSeg, a fully automatic pipeline for cartilage/bone segmentation, geometric lateral/medial (L/M) compartmentalisation and radiomics analysis. Methods: Two 3D nnU-Net models were trained on SKM-TEA (138 knees) and OAIZIB-CM (404 knees). At test time, zero-shot predictions were fused and refined by simple geometric rules: connected-component cleaning, construction of 10 mm subchondral bone bands in physical space, and a data-driven tibial L/M split based on PCA and k-means. Segmentation was evaluated on an OAIZIB-CM test set (103 knees) and on SKI-10 (100 knees). QC used volume and thickness signatures. From 10 ROIs we extracted 4 650 non-shape radiomic features to study inter-compartment similarity, dependence on ROI size, and OA vs. non-OA classification on OAIZIB-CM Results: Post-processing improved macro ASSD on OAIZIB-CM from 2.63 to 0.36 mm and HD95 from 25.2 to 3.35 mm, with DSC 0.91; zero-shot DSC on SKI-10 was 0.80. The geometric L/M rule produced stable compartments across datasets, whereas a direct L/M nnU-Net showed domain-dependent side swaps. Only 6 to 12 percent of features per ROI were strongly correlated with volume or thickness. Radiomics-based models models restricted to size-linked features. Conclusions: LM-CartSeg yields automatic, QCd ROIs and radiomic features that carry discriminative information beyond simple morphometry, providing a practical foundation for multi-centre knee OA radiomics studies.

中文标题/摘要

标题：LM-CartSeg：自动分割膝关节外侧和内侧软骨及次级骨质以进行影像组学分析

背景与目的：膝关节MRI影像组学需要稳健且具有解剖学意义的感兴趣区域（ROIs），能够同时捕捉软骨和次级骨质。现有大多数工作依赖于手动ROI，并很少报告质量控制（QC）。我们提出LM-CartSeg，这是一种全自动的软骨/骨质分割、几何外侧/内侧（L/M）区域化和影像组学分析的管道。方法：在SKM-TEA（138个膝关节）和OAIZIB-CM（404个膝关节）数据集上训练了两个3D nnU-Net模型。测试时，通过简单的几何规则进行零样本预测的融合和细化：连通域清理、物理空间中构建10毫米次级骨质带，以及基于PCA和k-means的数据驱动胫骨L/M分割。分割在OAIZIB-CM测试集（103个膝关节）和SKI-10（100个膝关节）上进行了评估。QC使用体积和厚度特征。从10个ROI中提取了4650个非形状影像组学特征，以研究跨区域差异性、ROI大小依赖性以及OA与非OA分类。结果：后处理将OAIZIB-CM上的宏观ASSD从2.63毫米提高到0.36毫米，HD95从25.2毫米降低到3.35毫米，DSC为0.91；SKI-10上的零样本DSC为0.80。几何L/M规则在不同数据集上产生了稳定区域，而直接L/M nnU-Net则表现出域依赖性侧向交换。每个ROI中只有6%到12%的特征与体积或厚度高度相关。基于影像组学的模型仅限于大小相关的特征。结论：LM-CartSeg提供了自动、经过质量控制的ROI和影像组学特征，这些特征携带了超越简单形态学的信息，为多中心膝关节OA影像组学研究提供了实用的基础。

Summary / 总结

The study presents LM-CartSeg, an automated pipeline for segmenting cartilage and subchondral bone in knee MRI, which includes two 3D nnU-Net models and post-processing steps. The method achieved improved segmentation quality with post-processing, reducing macro ASSD and HD95 values. The geometric lateral/medial compartmentalization was stable across datasets, and only a small fraction of radiomic features were strongly correlated with volume or thickness. The study provides a practical foundation for multi-centre knee osteoarthritis radiomics studies, though the results are considered preliminary and will be revised separately.

研究介绍了LM-CartSeg，这是一种自动化的膝关节MRI中软骨和次级骨分割的管道，包括两个在不同数据集上训练的3D nnU-Net模型。后处理提高了分割质量，宏ASSD降至0.36 mm，HD95降至3.35 mm。几何左右腔室化在不同数据集上产生了稳定的结果，只有少量的放射学特征与体积或厚度有强烈的相关性。该方法提供了经过质量控制的ROI和放射学特征，这些特征提供了超越简单形态学的区分信息，适用于多中心膝关节OA放射学研究。

VRSA: Jailbreaking Multimodal Large Language Models through Visual Reasoning Sequential Attack

Authors: Shiji Zhao, Shukun Xiong, Yao Huang, Yan Jin, Zhenyu Wu, Jiyang Guan, Ranjie Duan, Jialing Tao, Hui Xue, Xingxing Wei

First: 2025-12-05T16:29:52+00:00 · Latest: 2025-12-05T16:29:52+00:00

Abs · PDF · Code1 · Code2

Abstract

Multimodal Large Language Models (MLLMs) are widely used in various fields due to their powerful cross-modal comprehension and generation capabilities. However, more modalities bring more vulnerabilities to being utilized for jailbreak attacks, which induces MLLMs to output harmful content. Due to the strong reasoning ability of MLLMs, previous jailbreak attacks try to explore reasoning safety risk in text modal, while similar threats have been largely overlooked in the visual modal. To fully evaluate potential safety risks in the visual reasoning task, we propose Visual Reasoning Sequential Attack (VRSA), which induces MLLMs to gradually externalize and aggregate complete harmful intent by decomposing the original harmful text into several sequentially related sub-images. In particular, to enhance the rationality of the scene in the image sequence, we propose Adaptive Scene Refinement to optimize the scene most relevant to the original harmful query. To ensure the semantic continuity of the generated image, we propose Semantic Coherent Completion to iteratively rewrite each sub-text combined with contextual information in this scene. In addition, we propose Text-Image Consistency Alignment to keep the semantical consistency. A series of experiments demonstrates that the VRSA can achieve a higher attack success rate compared with the state-of-the-art jailbreak attack methods on both the open-source and closed-source MLLMs such as GPT-4o and Claude-4.5-Sonnet.

中文标题/摘要

标题：VRSA：通过视觉推理顺序攻击破解多模态大型语言模型

多模态大型语言模型（MLLMs）因其强大的跨模态理解和生成能力而在各个领域广泛应用。然而，更多的模态也带来了更多的利用风险，使得MLLMs输出有害内容。由于MLLMs具有强大的推理能力，之前的破解攻击尝试探索文本模态中的推理安全性风险，而视觉模态中的类似威胁则被很大程度上忽视。为了全面评估视觉推理任务中的潜在安全风险，我们提出了视觉推理顺序攻击（VRSA），该攻击通过将原始有害文本分解为多个顺序相关子图像，逐步使MLLMs外部化并聚合完整的有害意图。特别是为了增强图像序列中的场景合理性，我们提出了自适应场景优化以优化与原始有害查询最相关的场景。为了确保生成图像的语义连续性，我们提出了语义一致完成，通过结合场景中的上下文信息迭代重写每个子文本。此外，我们提出了文本-图像一致性对齐以保持语义一致性。一系列实验表明，与最先进的破解攻击方法相比，VRSA在开源和闭源MLLMs（如GPT-4o和Claude-4.5-Sonnet）上实现了更高的攻击成功率。

Summary / 总结

The research aims to evaluate the safety risks in the visual reasoning tasks of Multimodal Large Language Models (MLLMs) by proposing VRSA, a Visual Reasoning Sequential Attack. This method decomposes harmful text into sub-images and optimizes the scene using Adaptive Scene Refinement, ensures semantic continuity with Semantic Coherent Completion, and maintains text-image consistency. Experiments show that VRSA outperforms existing jailbreak attacks on both open-source and closed-source MLLMs like GPT-4o and Claude-4.5-Sonnet in terms of attack success rate.

研究旨在通过提出VRSA（视觉推理序列攻击）来评估多模态大型语言模型（MLLMs）在视觉推理任务中的安全风险。该方法将有害文本分解为子图像，并通过自适应场景优化来优化与原始有害查询最相关的场景，确保通过语义一致完成生成图像的语义连续性，并保持文本-图像一致性。实验表明，VRSA在GPT-4o和Claude-4.5-Sonnet等开源和闭源MLLMs中的攻击成功率高于现有最先进的攻击方法。

NEAT: Neighborhood-Guided, Efficient, Autoregressive Set Transformer for 3D Molecular Generation

Authors: Daniel Rose, Roxane Axel Jacob, Johannes Kirchmair, Thierry Langer

First: 2025-12-05T16:18:07+00:00 · Latest: 2025-12-05T16:18:07+00:00

Abs · PDF · Code1 · Code2

Abstract

Autoregressive models are a promising alternative to diffusion-based models for 3D molecular structure generation. However, a key limitation is the assumption of a token order: while text has a natural sequential order, the next token prediction given a molecular graph prefix should be invariant to atom permutations. Previous works sidestepped this mismatch by using canonical orders or focus atoms. We argue that this is unnecessary. We introduce NEAT, a Neighborhood-guided, Efficient, Autoregressive, Set Transformer that treats molecular graphs as sets of atoms and learns the order-agnostic distribution over admissible tokens at the graph boundary with an autoregressive flow model. NEAT approaches state-of-the-art performance in 3D molecular generation with high computational efficiency and atom-level permutation invariance, establishing a practical foundation for scalable molecular design.

中文标题/摘要

标题：NEAT：基于邻域指导、高效且自回归的集合变换器，用于3D分子生成

自回归模型是3D分子结构生成的一种有前途的替代方案，与基于扩散的模型相比。然而，一个关键限制是假设了标记顺序：虽然文本具有自然的顺序，但在给定分子图前缀时，下一个标记预测应该是对原子排列不变的。先前的工作通过使用标准顺序或关注原子来绕过这种不匹配。我们认为这是不必要的。我们引入了NEAT，一种基于邻域指导、高效且自回归的集合变换器，将分子图视为原子集合，并使用自回归流模型学习在图边界上可忽略顺序的可接受标记分布。NEAT在3D分子生成方面接近最先进的性能，具有高计算效率和原子级排列不变性，为可扩展的分子设计奠定了实用基础。

Summary / 总结

The research aims to improve autoregressive models for 3D molecular structure generation by addressing the issue of token order invariance. NEAT, a Neighborhood-Guided, Efficient, Autoregressive, Set Transformer, is introduced to treat molecular graphs as sets of atoms and learn an order-agnostic distribution over admissible tokens. Key experimental findings show that NEAT achieves state-of-the-art performance in 3D molecular generation with high computational efficiency and atom-level permutation invariance, providing a practical foundation for scalable molecular design.

研究旨在通过解决标记顺序不变性问题，改进用于3D分子结构生成的自回归模型。引入了NEAT，一种基于邻域指导、高效、自回归的集合变换器，将分子图视为原子集合，并学习边界处的可接受标记分布，而不依赖于固定顺序。关键实验结果表明，NEAT在3D分子生成中达到了最先进的性能，同时保持了高计算效率和原子级置换不变性，为可扩展的分子设计提供了实用框架。

SPARTAN: A Sparse Transformer World Model Attending to What Matters

Authors: Anson Lei, Bernhard Schölkopf, Ingmar Posner

First: 2024-11-11T11:42:48+00:00 · Latest: 2025-12-05T16:14:50+00:00

Abs · PDF · Code1 · Code2

Abstract

Capturing the interactions between entities in a structured way plays a central role in world models that flexibly adapt to changes in the environment. Recent works motivate the benefits of models that explicitly represent the structure of interactions and formulate the problem as discovering local causal structures. In this work, we demonstrate that reliably capturing these relationships in complex settings remains challenging. To remedy this shortcoming, we postulate that sparsity is a critical ingredient for the discovery of such local structures. To this end, we present the SPARse TrANsformer World model (SPARTAN), a Transformer-based world model that learns context-dependent interaction structures between entities in a scene. By applying sparsity regularisation on the attention patterns between object-factored tokens, SPARTAN learns sparse, context-dependent interaction graphs that accurately predict future object states. We further extend our model to adapt to sparse interventions with unknown targets in the dynamics of the environment. This results in a highly interpretable world model that can efficiently adapt to changes. Empirically, we evaluate SPARTAN against the current state-of-the-art in object-centric world models in observation-based environments and demonstrate that our model can learn local causal graphs that accurately reflect the underlying interactions between objects, achieving significantly improved few-shot adaptation to dynamics changes, as well as robustness against distractors.

中文标题/摘要

标题：SPARTAN：一种稀疏变换器世界模型，关注关键交互

以结构化方式捕捉实体之间的交互在世界模型中起着核心作用，这些模型能够灵活适应环境变化。近期研究强调了明确表示交互结构模型的优势，并将问题表述为发现局部因果结构。在本文中，我们展示了在复杂环境中可靠捕捉这些关系仍然具有挑战性。为解决这一不足，我们提出稀疏性是发现此类局部结构的关键成分。为此，我们提出了SPARse TrANsformer World模型（SPARTAN），一种基于变换器的世界模型，能够学习场景中实体之间依赖上下文的交互结构。通过在对象因子化标记之间的注意力模式上应用稀疏正则化，SPARTAN学习到稀疏的、依赖上下文的交互图，能够准确预测未来对象状态。我们进一步扩展了该模型，使其能够适应环境动力学中的稀疏干预，且目标未知。这导致了一个高度可解释的世界模型，能够高效地适应变化。实验中，我们评估了SPARTAN在基于观察的世界模型中的当前状态，并展示了我们的模型能够学习准确反映对象之间交互的局部因果图，实现了显著改进的少量样本适应动力学变化，以及对干扰的鲁棒性。

Summary / 总结

The research aims to improve world models by capturing structured interactions between entities in complex environments. SPARTAN, a sparse Transformer-based world model, learns context-dependent interaction graphs between objects by applying sparsity regularization on attention patterns. This approach enables the model to accurately predict future object states and adapt to changes, outperforming existing methods in few-shot adaptation and robustness against distractors.

研究旨在通过捕捉复杂环境中实体之间的结构化交互来改进世界模型。提出了基于稀疏Transformer的SPARTAN模型，以学习对象之间的上下文相关交互图。通过在注意力模式上应用稀疏正则化，SPARTAN能够准确预测未来对象状态并高效适应变化。实验表明，SPARTAN在少量样本适应和对抗干扰方面优于现有方法。

Morphling: Fast, Fused, and Flexible GNN Training at Scale

Authors: Anubhab, Rupesh Nasre

First: 2025-12-01T13:45:03+00:00 · Latest: 2025-12-05T16:07:38+00:00

Abs · PDF · Code1 · Code2

Abstract

Graph Neural Networks (GNNs) present a fundamental hardware challenge by fusing irregular, memory-bound graph traversals with regular, compute-intensive dense matrix operations. While frameworks such as PyTorch Geometric (PyG) and Deep Graph Library (DGL) prioritize high-level usability, they fail to address these divergent execution characteristics. As a result, they rely on generic kernels that suffer from poor cache locality, excessive memory movement, and substantial intermediate allocations. To address these limitations, we present Morphling, a domain-specific code synthesizer designed to bridge this gap. Morphling compiles high-level GNN specifications into portable, backend-specialized implementations targeting OpenMP, CUDA, and MPI. It achieves this by instantiating a library of optimized, architecture-aware primitives tailored to each execution environment. Morphling also incorporates a runtime sparsity-aware execution engine that dynamically selects dense or sparse execution paths using input feature statistics, reducing unnecessary computation on zero-valued entries. We evaluate Morphling on eleven real-world datasets spanning diverse graph structures, feature dimensionalities, and sparsity regimes. Morphling improves per-epoch training throughput by an average of 20X on CPUs, 19X on GPUs, and 6X in distributed settings over PyG and DGL, with peak speedups reaching 66X. Morphling's memory-efficient layouts further reduce peak memory consumption by up to 15X, enabling large-scale GNN training on commodity hardware. These findings demonstrate that specialized, architecture-aware code synthesis provides an effective and scalable path toward high-performance GNN execution across diverse parallel and distributed platforms.

中文标题/摘要

标题：Morphling：大规模快速、融合和灵活的图神经网络训练

图神经网络（GNNs）通过将不规则的、内存绑定的图遍历与规则的、计算密集型的密集矩阵操作融合在一起，提出了一个基本的硬件挑战。尽管像PyTorch Geometric（PyG）和Deep Graph Library（DGL）这样的框架优先考虑高级易用性，但它们未能解决这些不同的执行特性。因此，它们依赖于通用内核，这些内核遭受着较差的缓存局部性、过多的内存移动和大量的中间分配。为了解决这些限制，我们提出了Morphling，这是一种领域特定的代码合成器，旨在弥合这一差距。Morphling将高级GNN规范编译为针对OpenMP、CUDA和MPI的后端特定实现。它通过为每个执行环境实例化一个优化的、架构感知的原语库来实现这一点。Morphling还包含一个运行时稀疏感知执行引擎，该引擎根据输入特征统计信息动态选择密集或稀疏执行路径，从而减少不必要的零值项计算。我们在涵盖不同图结构、特征维度和稀疏性的11个真实数据集上评估了Morphling。Morphling在CPU上的每轮训练吞吐量平均提高了20倍，在GPU上的提高了19倍，在分布式设置中提高了6倍，与PyG和DGL相比，峰值加速比达到66倍。Morphling的内存高效布局进一步将峰值内存消耗减少了最多15倍，使大规模GNN训练能够在普通硬件上实现。这些发现表明，专门的、架构感知的代码合成为跨各种并行和分布式平台实现高性能GNN执行提供了一条有效且可扩展的途径。

Summary / 总结

Morphling is a domain-specific code synthesizer designed to optimize Graph Neural Network (GNN) training by addressing the hardware challenges of combining irregular graph traversals with dense matrix operations. It compiles high-level GNN specifications into backend-specialized implementations for OpenMP, CUDA, and MPI, using optimized primitives and a runtime sparsity-aware execution engine. Experimental results show that Morphling improves per-epoch training throughput by an average of 20X on CPUs, 19X on GPUs, and 6X in distributed settings, with peak speedups reaching 66X, and reduces peak memory consumption by up to 15X.

Morphling 是一种领域特定的代码合成器，旨在通过解决将不规则图遍历与密集矩阵操作相结合的硬件挑战来优化图神经网络（GNN）的训练。它将高级 GNN 规范编译为针对 OpenMP、CUDA 和 MPI 的后端特定实现，使用优化的原语和运行时稀疏性感知执行引擎。实验结果表明，Morphling 在 CPU 上将每轮训练吞吐量提高 20 倍，在 GPU 上提高 19 倍，在分布式设置中提高 6 倍，超过现有的框架如 PyG 和 DGL，峰值加速比达到 66 倍。此外，Morphling 进一步减少了峰值内存消耗，最多减少 15 倍，使大规模 GNN 训练能够在普通硬件上进行。

KNARsack: Teaching Neural Algorithmic Reasoners to Solve Pseudo-Polynomial Problems

Authors: Stjepan Požgaj, Dobrik Georgiev, Marin Šilić, Petar Veličković

First: 2025-09-17T15:44:25+00:00 · Latest: 2025-12-05T15:57:26+00:00

Comments: 16 pages, 10 figures, 5 tables, 3 listings

Abs · PDF · Code1 · Code2

Abstract

Neural algorithmic reasoning (NAR) is a growing field that aims to embed algorithmic logic into neural networks by imitating classical algorithms. In this extended abstract, we detail our attempt to build a neural algorithmic reasoner that can solve Knapsack, a pseudo-polynomial problem bridging classical algorithms and combinatorial optimisation, but omitted in standard NAR benchmarks. Our neural algorithmic reasoner is designed to closely follow the two-phase pipeline for the Knapsack problem, which involves first constructing the dynamic programming table and then reconstructing the solution from it. The approach, which models intermediate states through dynamic programming supervision, achieves better generalization to larger problem instances than a direct-prediction baseline that attempts to select the optimal subset only from the problem inputs.

中文标题/摘要

标题：KNARsack：训练神经算法推理器解决伪多项式问题

神经算法推理（NAR）是一个旨在通过模仿经典算法将算法逻辑嵌入神经网络的新兴领域。在本文扩展摘要中，我们详细描述了我们尝试构建一个能够解决背包问题的神经算法推理器的努力，背包问题是连接经典算法和组合优化的伪多项式问题，但在标准NAR基准中被忽略。我们的神经算法推理器设计为紧密遵循背包问题的两阶段管道，首先构建动态规划表，然后从中重构解决方案。该方法通过动态规划监督建模中间状态，其在处理更大规模问题实例时的泛化能力优于直接预测基线，后者仅尝试从问题输入中选择最优子集。

Summary / 总结

The research aims to develop a neural algorithmic reasoner (NAR) to solve the Knapsack problem, a pseudo-polynomial problem, by closely following the two-phase pipeline of dynamic programming table construction and solution reconstruction. The proposed method outperforms a direct-prediction baseline by better generalizing to larger problem instances through dynamic programming supervision of intermediate states. Key findings show improved performance on larger Knapsack problems compared to traditional NAR approaches.

研究旨在通过紧密遵循动态规划表构建和解决方案重构的两阶段流程，开发一个神经算法推理器（NAR）来解决Knapsack问题，这是一个伪多项式问题。提出的方法通过动态规划监督中间状态，优于直接预测基线，能够在更大规模的问题实例上实现更好的泛化性能。关键发现表明，与传统NAR方法相比，在更大规模的Knapsack问题上表现出更好的性能。

The AI Productivity Index (APEX)

Authors: Bertie Vidgen, Abby Fennelly, Evan Pinnix, Julien Benchek, Daniyal Khan, Zach Richards, Austin Bridges, Calix Huang, Ben Hunsberger, Isaac Robinson, Akul Datta, Chirag Mahapatra, Dominic Barton, Cass R. Sunstein, Eric Topol, Brendan Foody, Osvald Nitski

First: 2025-09-30T03:26:17+00:00 · Latest: 2025-12-05T15:48:58+00:00

Abs · PDF · Code1 · Code2

Abstract

We present an extended version of the AI Productivity Index (APEX-v1-extended), a benchmark for assessing whether frontier models are capable of performing economically valuable tasks in four jobs: investment banking associate, management consultant, big law associate, and primary care physician (MD). This technical report details the extensions to APEX-v1, including an increase in the held-out evaluation set from n = 50 to n = 100 cases per job (n = 400 total) and updates to the grading methodology. We present a new leaderboard, where GPT5 (Thinking = High) remains the top performing model with a score of 67.0%. APEX-v1-extended shows that frontier models still have substantial limitations when performing typical professional tasks. To support further research, we are open sourcing n = 25 non-benchmark example cases per role (n = 100 total) along with our evaluation harness.

中文标题/摘要

标题：AI生产力指数（APEX）

我们提出了AI生产力指数（APEX-v1-extended）的扩展版本，这是一个基准测试，用于评估前沿模型是否能够在四个职位上执行具有经济价值的任务：投资银行助理、管理咨询师、大律师事务所助理和全科医生（MD）。该技术报告详细介绍了APEX-v1的扩展，包括将保留评估集从n=50增加到n=100个案例（总共n=400个）以及评分方法的更新。我们展示了新的排行榜，其中GPT5（思考=高）仍然是表现最好的模型，得分为67.0%。APEX-v1-extended表明，前沿模型在执行典型专业任务时仍然存在重大限制。为了支持进一步的研究，我们开源了每个角色n=25个非基准示例案例（总共n=100个）以及我们的评估框架。

Summary / 总结

The research aims to evaluate the economic value of frontier AI models in four professional roles: investment banking associate, management consultant, big law associate, and primary care physician. The study extends the AI Productivity Index (APEX) by increasing the evaluation set to 100 cases per role and updating the grading methodology. Key findings show that while GPT5 (Thinking = High) remains the top model with a score of 67.0%, frontier models still face significant limitations in performing typical professional tasks.

研究旨在评估前沿AI模型在投资银行助理、管理咨询师、大律师事务所助理和全科医生四个专业角色中的经济价值。通过将保留评估集增加到每个职位100个案例，并更新评分方法，扩展了AI生产力指数（APEX）。新的排行榜显示，GPT5（思考=高）仍然是表现最好的模型，得分为67.0%，表明尽管这些模型可以执行一些任务，但在典型的专业环境中仍然存在显著的局限性。

xLSTM-PINN: Memory-Gated Spectral Remodeling for Physics-Informed Learning

Authors: Ze Tao, Darui Zhao, Fujun Liu, Ke Xu, Xiangsheng Hu

First: 2025-11-16T08:55:27+00:00 · Latest: 2025-12-05T15:45:09+00:00

Abs · PDF · Code1 · Code2

Abstract

Physics-informed neural networks (PINN) face significant challenges from spectral bias, which impedes their ability to model high-frequency phenomena and limits extrapolation performance. To address this, we introduce xLSTM-PINN, a novel architecture that performs representation-level spectral remodeling through memory gating and residual micro-steps. Our method consistently achieves markedly lower spectral error and root mean square error (RMSE) across four diverse partial differential equation (PDE) benchmarks, along withhhh a broader stable learning-rate window. Frequency-domain analysis confirms that xLSTM-PINN elevates high-frequency kernel weights, shifts the resolvable bandwidth rightward, and shortens the convergence time for high-wavenumber components. Without modifying automatic differentiation or physics loss constraints, this work provides a robust pathway to suppress spectral bias, thereby improving accuracy, reproducibility, and transferability in physics-informed learning.

中文标题/摘要

标题：xLSTM-PINN：记忆门控频谱重塑以实现物理知情学习

物理知情神经网络（PINN）面临频谱偏差的重大挑战，这阻碍了它们对高频现象的建模能力并限制了外推性能。为了解决这一问题，我们引入了xLSTM-PINN，这是一种通过记忆门控和残差微步进行表示级频谱重塑的新型架构。我们的方法在四个不同的偏微分方程（PDE）基准测试中，始终实现了显著更低的频谱误差和均方根误差（RMSE），并且具有更宽的稳定学习率窗口。频域分析证实，xLSTM-PINN 提升了高频核权重，向右移动了解决的带宽，并缩短了高波数分量的收敛时间。在不修改自动微分或物理损失约束的情况下，这项工作提供了一种抑制频谱偏差的稳健途径，从而提高了物理知情学习的准确度、可重复性和可迁移性。

Summary / 总结

xLSTM-PINN addresses the spectral bias in PINNs by introducing memory gating and residual micro-steps for representation-level spectral remodeling. It achieves lower spectral error and RMSE across four PDE benchmarks, with a wider stable learning-rate window. Frequency-domain analysis shows that xLSTM-PINN enhances high-frequency kernel weights and improves the resolvable bandwidth, leading to faster convergence for high-wavenumber components.

xLSTM-PINN 通过记忆门控和残差微步进行表示级别的频谱重塑，旨在解决物理感知神经网络（PINNs）中的频谱偏差问题。该方法在四个不同的偏微分方程（PDE）基准测试中显著降低了频谱误差和RMSE，并且还拓宽了稳定的学习率窗口。频域分析表明，xLSTM-PINN 提高了高频核权重，向右移动了可解析带宽，并加速了高频波数分量的收敛。

Machine-learning-enabled interpretation of tribological deformation patterns in large-scale MD data

Authors: Hendrik J. Ehrich, Marvin C. May, Stefan J. Eder

First: 2025-12-05T15:39:13+00:00 · Latest: 2025-12-05T15:39:13+00:00

Comments: 19 pages, 11 figures

Abs · PDF · Code1 · Code2

Abstract

Molecular dynamics (MD) simulations have become indispensable for exploring tribological deformation patterns at the atomic scale. However, transforming the resulting high-dimensional data into interpretable deformation pattern maps remains a resource-intensive and largely manual process. In this work, we introduce a data-driven workflow that automates this interpretation step using unsupervised and supervised learning. Grain-orientation-colored computational tomograph pictures obtained from CuNi alloy simulations were first compressed through an autoencoder to a 32-dimensional global feature vector. Despite this strong compression, the reconstructed images retained the essential microstructural motifs: grain boundaries, stacking faults, twins, and partial lattice rotations, while omitting only the finest defects. The learned representations were then combined with simulation metadata (composition, load, time, temperature, and spatial position) to train a CNN-MLP model to predict the dominant deformation pattern. The resulting model achieves a prediction accuracy of approximately 96% on validation data. A refined evaluation strategy, in which an entire spatial region containing distinct grains was excluded from training, provides a more robust measure of generalization. The approach demonstrates that essential tribological deformation signatures can be automatically identified and classified from structural images using Machine Learning. This proof of concept constitutes a first step towards fully automated, data-driven construction of tribological mechanism maps and, ultimately, toward predictive modeling frameworks that may reduce the need for large-scale MD simulation campaigns.

中文标题/摘要

标题：基于机器学习的大型分子动力学数据摩擦变形模式解释

分子动力学（MD）模拟已成为探索原子尺度摩擦变形模式不可或缺的工具。然而，将生成的高维数据转换为可解释的变形模式图仍然是一项资源密集型且主要依赖人工的过程。在本工作中，我们引入了一种数据驱动的工作流，使用无监督和监督学习自动化这一解释步骤。首先，从CuNi合金模拟中获得的晶粒取向着色计算断层扫描图像通过自编码器压缩到32维全局特征向量。尽管进行了这种强烈的压缩，重建的图像仍保留了关键的微观结构特征：晶界、层错、孪晶和部分晶格旋转，仅省略了最细小的缺陷。然后，将学习到的表示与模拟元数据（组成、载荷、时间、温度和空间位置）结合，训练一个CNN-MLP模型以预测主导变形模式。所得到的模型在验证数据上的预测准确率约为96%。一种改进的评估策略，即排除整个包含不同晶粒的空间区域进行训练，提供了更稳健的泛化度量。该方法表明，可以使用机器学习自动识别和分类结构图像中的关键摩擦变形特征。该概念验证是完全自动化、数据驱动构建摩擦机制图的第一步，最终可能朝着减少大规模MD模拟需求的预测建模框架迈进。

Summary / 总结

This study addresses the challenge of interpreting high-dimensional molecular dynamics (MD) data by introducing a data-driven workflow that uses unsupervised and supervised learning. The workflow compresses grain-orientation-colored computational tomograph pictures from CuNi alloy simulations into a 32-dimensional feature vector, which retains essential microstructural motifs while omitting fine defects. A CNN-MLP model trained on this data achieves a validation accuracy of about 96% in predicting dominant deformation patterns, demonstrating the potential for automated identification and classification of tribological deformation signatures.

本研究通过引入使用无监督和监督学习的数据驱动工作流来解决高维分子动力学（MD）数据的解释难题。该工作流将CuNi合金模拟的粒度方向着色计算断层图压缩成一个32维特征向量，保留了关键的微观结构特征，同时忽略了细微缺陷。该数据训练的CNN-MLP模型在预测主要变形模式方面的验证准确率约为96%，展示了自动识别和分类摩擦学变形特征的潜力。

Utility Boundary of Dataset Distillation: Scaling and Configuration-Coverage Laws

Authors: Zhengquan Luo, Zhiqiang Xu

First: 2025-12-05T15:37:38+00:00 · Latest: 2025-12-05T15:37:38+00:00

Abs · PDF · Code1 · Code2

Abstract

Dataset distillation (DD) aims to construct compact synthetic datasets that allow models to achieve comparable performance to full-data training while substantially reducing storage and computation. Despite rapid empirical progress, its theoretical foundations remain limited: existing methods (gradient, distribution, trajectory matching) are built on heterogeneous surrogate objectives and optimization assumptions, which makes it difficult to analyze their common principles or provide general guarantees. Moreover, it is still unclear under what conditions distilled data can retain the effectiveness of full datasets when the training configuration, such as optimizer, architecture, or augmentation, changes. To answer these questions, we propose a unified theoretical framework, termed configuration--dynamics--error analysis, which reformulates major DD approaches under a common generalization-error perspective and provides two main results: (i) a scaling law that provides a single-configuration upper bound, characterizing how the error decreases as the distilled sample size increases and explaining the commonly observed performance saturation effect; and (ii) a coverage law showing that the required distilled sample size scales linearly with configuration diversity, with provably matching upper and lower bounds. In addition, our unified analysis reveals that various matching methods are interchangeable surrogates, reducing the same generalization error, clarifying why they can all achieve dataset distillation and providing guidance on how surrogate choices affect sample efficiency and robustness. Experiments across diverse methods and configurations empirically confirm the derived laws, advancing a theoretical foundation for DD and enabling theory-driven design of compact, configuration-robust dataset distillation.

中文标题/摘要

标题：数据集蒸馏的实用边界：缩放和配置覆盖率定律

数据集蒸馏（DD）旨在构建紧凑的合成数据集，使模型在存储和计算大幅减少的情况下，仍能实现与全量数据训练相当的性能。尽管在经验上取得了快速进展，但其理论基础仍然有限：现有方法（梯度、分布、轨迹匹配）基于异构的代理目标和优化假设，这使得难以分析它们的共同原理或提供一般保证。此外，当训练配置（如优化器、架构或增强）发生变化时，蒸馏数据能否保留全量数据的有效性仍不清楚。为回答这些问题，我们提出了一种统一的理论框架，称为配置-动力学-误差分析，该框架从共同的泛化误差视角重新表述了主要的DD方法，并提供了两个主要结果：（i）一个缩放定律，提供了一个单一配置的上界，描述了随着蒸馏样本量的增加，误差如何减少，并解释了通常观察到的性能饱和效应；（ii）一个覆盖率定律，表明所需的蒸馏样本量与配置多样性成线性关系，并且具有可证明的匹配上界和下界。此外，我们的统一分析揭示了各种匹配方法是可互换的代理，减少了相同的泛化误差，澄清了它们为何都能实现数据集蒸馏，并提供了关于代理选择如何影响样本效率和鲁棒性的指导。跨多种方法和配置的实验经验上证实了所推导的定律，为DD提供了理论基础，并使基于理论的设计紧凑且配置鲁棒的数据集蒸馏成为可能。

Summary / 总结

This paper aims to provide a theoretical foundation for dataset distillation (DD), a method to create compact synthetic datasets that match full-data training performance. The authors propose a unified theoretical framework called configuration-dynamics-error analysis, which reformulates major DD approaches under a common generalization-error perspective. They derive two main results: a scaling law that characterizes the error reduction with increasing distilled sample size and a coverage law that shows the required sample size scales linearly with configuration diversity. Experiments across various methods and configurations confirm these laws, advancing the theoretical understanding of DD and enabling more robust and efficient design of distilled datasets.

本文旨在通过提出一种统一的理论框架——配置-动力学-误差分析，为数据集蒸馏（DD）提供理论基础。该框架将主要的DD方法重新表述为通用的泛化误差视角，并推导出两个主要结果：误差减少的缩放定律和所需的样本大小与配置多样性成线性关系的覆盖定律。实验验证了这些定律，推进了对DD的理论理解，并使基于理论的设计更加高效和鲁棒。

Toward Efficient and Robust Behavior Models for Multi-Agent Driving Simulation

Authors: Fabian Konstantinidis, Moritz Sackmann, Ulrich Hofmann, Christoph Stiller

First: 2025-12-05T15:32:36+00:00 · Latest: 2025-12-05T15:32:36+00:00

Comments: This work has been submitted to the IEEE for possible publication

Abs · PDF · Code1 · Code2

Abstract

Scalable multi-agent driving simulation requires behavior models that are both realistic and computationally efficient. We address this by optimizing the behavior model that controls individual traffic participants. To improve efficiency, we adopt an instance-centric scene representation, where each traffic participant and map element is modeled in its own local coordinate frame. This design enables efficient, viewpoint-invariant scene encoding and allows static map tokens to be reused across simulation steps. To model interactions, we employ a query-centric symmetric context encoder with relative positional encodings between local frames. We use Adversarial Inverse Reinforcement Learning to learn the behavior model and propose an adaptive reward transformation that automatically balances robustness and realism during training. Experiments demonstrate that our approach scales efficiently with the number of tokens, significantly reducing training and inference times, while outperforming several agent-centric baselines in terms of positional accuracy and robustness.

中文标题/摘要

标题：高效稳健的多智能体驾驶模拟行为模型研究

可扩展的多智能体驾驶模拟需要既现实又计算高效的行为空间模型。我们通过优化控制个体交通参与者的行为空间模型来解决这一问题。为了提高效率，我们采用以实例为中心的场景表示，其中每个交通参与者和地图元素都在自己的局部坐标系中建模。这种设计使得场景编码高效且视角不变，并允许静态地图标记在模拟步骤之间重用。为了建模交互，我们使用以查询为中心的对称上下文编码器，并在局部坐标系之间使用相对位置编码。我们使用对抗逆向强化学习来学习行为空间模型，并提出了一种自适应奖励转换，以在训练过程中自动平衡稳健性和现实性。实验表明，我们的方法在数量上可高效扩展，显著减少了训练和推理时间，并在位置准确性及稳健性方面优于几种以智能体为中心的基线方法。

Summary / 总结

The research aims to develop efficient and robust behavior models for multi-agent driving simulation. The method involves using an instance-centric scene representation and a query-centric symmetric context encoder with relative positional encodings to model interactions. The approach also employs Adversarial Inverse Reinforcement Learning and an adaptive reward transformation to balance robustness and realism. Key experimental findings show that the proposed method scales efficiently with the number of tokens, reducing training and inference times, and outperforms agent-centric baselines in positional accuracy and robustness.

研究旨在开发适用于多智能体驾驶模拟的高效且稳健的行为模型。方法包括基于实例的场景表示和基于查询的对称上下文编码器，带有局部框架间的相对位置编码。该方法使用对抗逆强化学习和自适应奖励转换来平衡稳健性和现实性。关键发现包括随着标记数量的增加高效扩展、减少训练和推理时间、以及在位置精度和稳健性方面优于基于智能体的基线方法。

Probing the effectiveness of World Models for Spatial Reasoning through Test-time Scaling

Authors: Saurav Jha, M. Jehanzeb Mirza, Wei Lin, Shiqi Yang, Sarath Chandar

First: 2025-12-05T15:30:08+00:00 · Latest: 2025-12-05T15:30:08+00:00

Comments: Extended abstract at World Modeling Workshop 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

Vision-Language Models (VLMs) remain limited in spatial reasoning tasks that require multi-view understanding and embodied perspective shifts. Recent approaches such as MindJourney attempt to mitigate this gap through test-time scaling where a world model imagines action-conditioned trajectories and a heuristic verifier selects helpful views from such trajectories. In this work, we systematically examine how such test-time verifiers behave across benchmarks, uncovering both their promise and their pitfalls. Our uncertainty-based analyses show that MindJourney's verifier provides little meaningful calibration, and that random scoring often reduces answer entropy equally well, thus exposing systematic action biases and unreliable reward signals. To mitigate these, we introduce a Verification through Spatial Assertions (ViSA) framework that grounds the test-time reward in verifiable, frame-anchored micro-claims. This principled verifier consistently improves spatial reasoning on the SAT-Real benchmark and corrects trajectory-selection biases through more balanced exploratory behavior. However, on the challenging MMSI-Bench, none of the verifiers, including ours, achieve consistent scaling, suggesting that the current world models form an information bottleneck where imagined views fail to enrich fine-grained reasoning. Together, these findings chart the bad, good, and ugly aspects of test-time verification for world-model-based reasoning. Our code is available at https://github.com/chandar-lab/visa-for-mindjourney.

中文标题/摘要

标题：探究世界模型在空间推理中的有效性通过测试时缩放

视觉-语言模型（VLMs）在需要多视角理解和身体视角转换的空间推理任务中仍然受到限制。最近的方法如MindJourney试图通过测试时缩放来缓解这一差距，其中世界模型想象基于动作的轨迹，启发式验证器从中选择有用的观点。在本工作中，我们系统地研究了此类测试时验证器在基准测试中的表现，揭示了它们的潜力和局限性。基于不确定性的分析表明，MindJourney的验证器几乎没有提供有意义的校准，随机评分往往与减少答案熵的效果相当，从而暴露出系统性的动作偏差和不可靠的奖励信号。为了缓解这些问题，我们引入了一种基于空间断言的验证框架（ViSA），将测试时的奖励与可验证的、帧锚定的微断言联系起来。这种原理性的验证器在SAT-Real基准测试中一致地提高了空间推理能力，并通过更平衡的探索行为纠正了轨迹选择偏差。然而，在具有挑战性的MMSI-Bench上，包括我们的验证器在内的所有验证器都无法实现一致的缩放，这表明当前的世界模型形成了一个信息瓶颈，想象中的视图未能丰富精细的推理。总之，这些发现描绘了基于世界模型推理的测试时验证的优劣和缺陷。我们的代码可在https://github.com/chandar-lab/visa-for-mindjourney获取。

Summary / 总结

This study investigates the effectiveness of test-time scaling in Vision-Language Models for spatial reasoning, focusing on the MindJourney approach. The research reveals that the heuristic verifier in MindJourney does not provide significant calibration and that random scoring can be equally effective, highlighting action biases and unreliable reward signals. To address these issues, the authors propose a Verification through Spatial Assertions (ViSA) framework, which improves spatial reasoning on the SAT-Real benchmark by grounding rewards in verifiable micro-claims. However, on the MMSI-Bench, none of the verifiers, including ViSA, achieve consistent scaling, indicating that current world models may form an information bottleneck. The findings suggest both the promise and limitations of test-time verification for world-model-based reasoning.

这项研究探讨了测试时扩展在视觉-语言模型中进行空间推理任务的有效性，重点关注MindJourney方法。研究发现，MindJourney中的启发式验证器未能提供有意义的校准，随机评分同样有效，这揭示了动作偏差和不可靠的奖励信号。为解决这些问题，作者提出了空间断言验证（ViSA）框架，通过将奖励基于可验证的微断言来提高SAT-Real基准上的空间推理能力。然而，在更具挑战性的MMSI-Bench上，包括ViSA在内的所有验证器均未能实现一致的扩展，表明当前的世界模型可能形成了信息瓶颈。研究结果揭示了测试时验证在基于世界模型的推理中的潜力和局限性。

3D Path Planning for Robot-assisted Vertebroplasty from Arbitrary Bi-plane X-ray via Differentiable Rendering

Authors: Blanca Inigo, Benjamin D. Killeen, Rebecca Choi, Michelle Song, Ali Uneri, Majid Khan, Christopher Bailey, Axel Krieger, Mathias Unberath

First: 2025-12-05T15:26:13+00:00 · Latest: 2025-12-05T15:26:13+00:00

Abs · PDF · Code1 · Code2

Abstract

Robotic systems are transforming image-guided interventions by enhancing accuracy and minimizing radiation exposure. A significant challenge in robotic assistance lies in surgical path planning, which often relies on the registration of intraoperative 2D images with preoperative 3D CT scans. This requirement can be burdensome and costly, particularly in procedures like vertebroplasty, where preoperative CT scans are not routinely performed. To address this issue, we introduce a differentiable rendering-based framework for 3D transpedicular path planning utilizing bi-planar 2D X-rays. Our method integrates differentiable rendering with a vertebral atlas generated through a Statistical Shape Model (SSM) and employs a learned similarity loss to refine the SSM shape and pose dynamically, independent of fixed imaging geometries. We evaluated our framework in two stages: first, through vertebral reconstruction from orthogonal X-rays for benchmarking, and second, via clinician-in-the-loop path planning using arbitrary-view X-rays. Our results indicate that our method outperformed a normalized cross-correlation baseline in reconstruction metrics (DICE: 0.75 vs. 0.65) and achieved comparable performance to the state-of-the-art model ReVerteR (DICE: 0.77), while maintaining generalization to arbitrary views. Success rates for bipedicular planning reached 82% with synthetic data and 75% with cadaver data, exceeding the 66% and 31% rates of a 2D-to-3D baseline, respectively. In conclusion, our framework facilitates versatile, CT-free 3D path planning for robot-assisted vertebroplasty, effectively accommodating real-world imaging diversity without the need for preoperative CT scans.

中文标题/摘要

标题：基于不同iable 渲染的任意双平面X射线引导的机器人辅助椎体成形术3D路径规划

机器人系统正在通过提高准确性和减少辐射暴露来改变图像引导的干预措施。机器人辅助手术中的一个重大挑战是手术路径规划，这通常依赖于术中2D图像与术前3D CT扫描的配准。这一要求在椎体成形术等程序中尤其负担沉重且成本高昂，因为术前CT扫描通常不会常规进行。为了解决这一问题，我们提出了一种基于不同iable渲染的框架，用于利用双平面2D X射线进行3D经椎弓根路径规划。该方法结合了通过统计形状模型（SSM）生成的椎骨图集，并使用学习到的相似性损失动态优化SSM形状和姿态，独立于固定的成像几何结构。我们通过两个阶段评估了该框架：首先，通过正交X射线进行椎骨重建以进行基准测试，然后通过使用任意视图X射线进行临床医生在环路径规划。我们的结果表明，与归一化互相关基线相比，我们的方法在重建指标（DICE：0.75 vs. 0.65）中表现更优，并且在性能上与最先进的模型ReVerteR（DICE：0.77）相当，同时保持了对任意视图的泛化能力。双平面规划的成功率在合成数据中达到82%，在尸体数据中达到75%，分别超过了2D到3D基线的66%和31%的成功率。总之，我们的框架促进了无CT的机器人辅助椎体成形术3D路径规划，有效地适应了现实世界的成像多样性，无需术前CT扫描。

Summary / 总结

The research aims to improve the accuracy and reduce radiation exposure in robotic-assisted vertebroplasty by developing a differentiable rendering-based framework for 3D path planning using bi-planar 2D X-rays. The method integrates a Statistical Shape Model with a vertebral atlas and a learned similarity loss to dynamically refine shape and pose. The framework outperformed a normalized cross-correlation baseline in vertebral reconstruction metrics and achieved success rates of 82% and 75% for bipedicular planning with synthetic and cadaver data, respectively, surpassing a 2D-to-3D baseline performance.

研究旨在通过使用双平面X光开发3D路径规划框架，提高机器人辅助椎体成形术的准确性和减少辐射暴露。该方法利用可微渲染和统计形状模型动态优化椎体形状和姿态，重建指标优于现有技术。该框架成功为82%的合成数据和75%的尸体数据规划路径，显著超越了2D到3D基线方法。

Bring Your Dreams to Life: Continual Text-to-Video Customization

Authors: Jiahua Dong, Xudong Wang, Wenqi Liang, Zongyan Han, Meng Cao, Duzhen Zhang, Hanbin Zhao, Zhi Han, Salman Khan, Fahad Shahbaz Khan

First: 2025-12-05T15:25:56+00:00 · Latest: 2025-12-05T15:25:56+00:00

Comments: Accepted to AAAI2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

Customized text-to-video generation (CTVG) has recently witnessed great progress in generating tailored videos from user-specific text. However, most CTVG methods assume that personalized concepts remain static and do not expand incrementally over time. Additionally, they struggle with forgetting and concept neglect when continuously learning new concepts, including subjects and motions. To resolve the above challenges, we develop a novel Continual Customized Video Diffusion (CCVD) model, which can continuously learn new concepts to generate videos across various text-to-video generation tasks by tackling forgetting and concept neglect. To address catastrophic forgetting, we introduce a concept-specific attribute retention module and a task-aware concept aggregation strategy. They can capture the unique characteristics and identities of old concepts during training, while combining all subject and motion adapters of old concepts based on their relevance during testing. Besides, to tackle concept neglect, we develop a controllable conditional synthesis to enhance regional features and align video contexts with user conditions, by incorporating layer-specific region attention-guided noise estimation. Extensive experimental comparisons demonstrate that our CCVD outperforms existing CTVG models. The code is available at https://github.com/JiahuaDong/CCVD.

中文标题/摘要

标题：实现你的梦想：持续文本到视频定制化

定制化文本到视频生成（CTVG）最近在从用户特定文本生成定制视频方面取得了巨大进展。然而，大多数CTVG方法假设个性化概念保持静态，不会随着时间逐步扩展。此外，它们在持续学习新概念时难以克服灾难性遗忘和概念忽视。为了解决上述挑战，我们开发了一种新颖的持续定制视频扩散（CCVD）模型，该模型可以通过解决遗忘和概念忽视来跨多种文本到视频生成任务持续学习新概念，从而生成视频。为了应对灾难性遗忘，我们引入了概念特定属性保留模块和任务感知概念聚合策略。它们可以在训练过程中捕捉旧概念的独特特性和身份，而在测试过程中根据相关性结合所有主题和动作适配器。此外，为了应对概念忽视，我们开发了一种可控条件合成，通过引入层特定区域注意力引导噪声估计来增强区域特征并使视频上下文与用户条件对齐。广泛的实验比较表明，我们的CCVD优于现有CTVG模型。代码可在https://github.com/JiahuaDong/CCVD获取。

Summary / 总结

The research aims to address the limitations of current customized text-to-video generation (CTVG) methods, which assume static personalized concepts and struggle with forgetting and concept neglect during continuous learning. To overcome these challenges, the authors propose the Continual Customized Video Diffusion (CCVD) model, which includes a concept-specific attribute retention module and a task-aware concept aggregation strategy to mitigate forgetting, and a controllable conditional synthesis to address concept neglect. Experimental results show that CCVD outperforms existing CTVG models in various text-to-video generation tasks.

研究旨在解决现有定制化文本到视频生成（CTVG）方法在学习新概念时面临的灾难性遗忘和概念忽视问题。作者提出了一种持续定制视频扩散（CCVD）模型，引入了概念特定属性保留模块和任务感知的概念聚合策略以应对灾难性遗忘。此外，开发了一种可控条件合成来增强区域特征并使视频上下文与用户条件对齐，以解决概念忽视问题。实验结果表明，CCVD在处理概念增量学习和时间上的扩展方面优于现有CTVG模型。