arXiv 论文速递

2025-11-16 03:22
Snapshot: 20251116_0322
Enhancing the Outcome Reward-based RL Training of MLLMs with Self-Consistency Sampling
Authors: Jiahao Wang, Weiye Xu, Aijun Yang, Wengang Zhou, Lewei Lu, Houqiang Li, Xiaohua Wang, Jinguo Zhu
Venue: NeurIPS 2025
First: 2025-11-13T18:59:57+00:00 · Latest: 2025-11-13T18:59:57+00:00
Comments: Accepted to NeurIPS 2025 (The Thirty-Ninth Annual Conference on Neural Information Processing Systems)
Abstract
Outcome-reward reinforcement learning (RL) is a common and increasingly significant way to refine the step-by-step reasoning of multimodal large language models (MLLMs). In the multiple-choice setting - a dominant format for multimodal reasoning benchmarks - the paradigm faces a significant yet often overlooked obstacle: unfaithful trajectories that guess the correct option after a faulty chain of thought receive the same reward as genuine reasoning, which is a flaw that cannot be ignored. We propose Self-Consistency Sampling (SCS) to correct this issue. For each question, SCS (i) introduces small visual perturbations and (ii) performs repeated truncation and resampling of an initial trajectory; agreement among the resulting trajectories yields a differentiable consistency score that down-weights unreliable traces during policy updates. Based on Qwen2.5-VL-7B-Instruct, plugging SCS into RLOO, GRPO, and REINFORCE++ series improves accuracy by up to 7.7 percentage points on six multimodal benchmarks with negligible extra computation. SCS also yields notable gains on both Qwen2.5-VL-3B-Instruct and InternVL3-8B, offering a simple, general remedy for outcome-reward RL in MLLMs.
中文标题/摘要
标题:利用自我一致性采样增强MLLM的基于结果奖励的RL训练
结果奖励强化学习(RL)是改进多模态大型语言模型(MLLM)逐步推理的一种常见且日益重要的方法。在多项选择设置中,这一范式面临一个重要的但往往被忽视的问题:在错误的推理链之后猜测正确选项的轨迹与真实的推理获得相同的奖励,这是一个不能忽视的缺陷。我们提出了自我一致性采样(SCS)来纠正这一问题。对于每个问题,SCS (i) 引入小的视觉扰动,并 (ii) 对初始轨迹进行重复截断和重新采样;结果轨迹的一致性产生一个可微的一致性分数,该分数在策略更新时降低不可靠轨迹的权重。基于Qwen2.5-VL-7B-Instruct,将SCS插入RLOO、GRPO和REINFORCE++系列,在六个多模态基准上提高了高达7.7个百分点的准确性,且额外计算量可以忽略不计。SCS还在Qwen2.5-VL-3B-Instruct和InternVL3-8B上取得了显著的收益,为MLLM中的结果奖励RL提供了一个简单且通用的解决方案。
Summary / 总结
The paper addresses the issue of unfaithful trajectories in outcome-reward reinforcement learning for multimodal large language models (MLLMs) by proposing Self-Consistency Sampling (SCS). SCS introduces small visual perturbations and performs repeated truncation and resampling of an initial trajectory to generate a consistency score, which down-weights unreliable traces during policy updates. This method improves accuracy by up to 7.7 percentage points on six multimodal benchmarks with minimal additional computation, and also yields gains on smaller models like Qwen2.5-VL-3B-Instruct and InternVL3-8B.
论文提出了一种名为Self-Consistency Sampling (SCS)的方法,以解决多模态大型语言模型(MLLMs)在结果奖励强化学习中出现的不忠实轨迹问题。SCS通过引入小的视觉扰动并重复截断和重新采样初始轨迹来生成一致的轨迹,然后计算一个可微的一致性分数以降低不可靠轨迹的权重。该方法在六个多模态基准测试中将准确性提高了最多7.7个百分点,并且几乎不增加额外的计算成本。
Depth Anything 3: Recovering the Visual Space from Any Views
Authors: Haotong Lin, Sili Chen, Junhao Liew, Donny Y. Chen, Zhenyu Li, Guang Shi, Jiashi Feng, Bingyi Kang
First: 2025-11-13T18:59:53+00:00 · Latest: 2025-11-13T18:59:53+00:00
Comments: https://depth-anything-3.github.io/
Abstract
We present Depth Anything 3 (DA3), a model that predicts spatially consistent geometry from an arbitrary number of visual inputs, with or without known camera poses. In pursuit of minimal modeling, DA3 yields two key insights: a single plain transformer (e.g., vanilla DINO encoder) is sufficient as a backbone without architectural specialization, and a singular depth-ray prediction target obviates the need for complex multi-task learning. Through our teacher-student training paradigm, the model achieves a level of detail and generalization on par with Depth Anything 2 (DA2). We establish a new visual geometry benchmark covering camera pose estimation, any-view geometry and visual rendering. On this benchmark, DA3 sets a new state-of-the-art across all tasks, surpassing prior SOTA VGGT by an average of 44.3% in camera pose accuracy and 25.1% in geometric accuracy. Moreover, it outperforms DA2 in monocular depth estimation. All models are trained exclusively on public academic datasets.
中文标题/摘要
标题:深度万物3:从任意视角恢复视觉空间
我们提出了深度万物3(DA3),这是一种可以从任意数量的视觉输入中预测空间一致几何结构的模型,这些输入可以有已知的相机姿态也可以没有。为了实现最小化建模,DA3 提出了两个关键见解:单一普通的变压器(例如,纯 DINO 编码器)作为骨干网络是足够的,不需要专门的架构,单一的深度射线预测目标消除了复杂多任务学习的需要。通过我们的教师-学生训练范式,该模型在细节和泛化能力上达到了与深度万物2(DA2)相当的水平。我们在涵盖相机姿态估计、任意视角几何和视觉渲染的新视觉几何基准上建立了新的基准。在该基准上,DA3 在所有任务上都达到了新的最佳水平,平均在相机姿态准确性上超越了先前的 SOTA VGGT 44.3%,在几何准确性上超越了 25.1%。此外,它在单目深度估计上也超过了 DA2。所有模型仅在公共学术数据集上进行训练。
Summary / 总结
Depth Anything 3 (DA3) is a model designed to predict spatially consistent geometry from any number of visual inputs, with or without known camera poses. It uses a plain transformer as its backbone and predicts depth rays directly, avoiding complex multi-task learning. DA3 achieves comparable detail and generalization to Depth Anything 2 (DA2) and sets new state-of-the-art benchmarks in camera pose estimation, any-view geometry, and visual rendering, outperforming previous methods by significant margins.
Depth Anything 3 (DA3) 是一种可以从任意数量的视觉输入中预测一致几何形状的模型,即使没有已知的相机姿态。它使用一个普通的变压器作为其骨干,并直接预测深度射线,简化了模型架构。DA3 在一个新的视觉几何基准上达到了与 Depth Anything 2 (DA2) 相当的细节和泛化能力,并在该基准上设定了新的最先进结果,相比先前方法在相机姿态准确性上提高了 44.3%,在几何准确性上提高了 25.1%。此外,它在单目深度估计上也优于 DA2,所有模型均在公共学术数据集上训练。
Ax-Prover: A Deep Reasoning Agentic Framework for Theorem Proving in Mathematics and Quantum Physics
Authors: Benjamin Breen, Marco Del Tredici, Jacob McCarran, Javier Aspuru Mijares, Weichen Winston Yin, Kfir Sulimany, Jacob M. Taylor, Frank H. L. Koppens, Dirk Englund
First: 2025-10-14T17:57:04+00:00 · Latest: 2025-11-13T18:56:10+00:00
Abstract
We present Ax-Prover, a multi-agent system for automated theorem proving in Lean that can solve problems across diverse scientific domains and operate either autonomously or collaboratively with human experts. To achieve this, Ax-Prover approaches scientific problem solving through formal proof generation, a process that demands both creative reasoning and strict syntactic rigor. Ax-Prover meets this challenge by equipping Large Language Models (LLMs), which provide knowledge and reasoning, with Lean tools via the Model Context Protocol (MCP), which ensure formal correctness. To evaluate its performance as an autonomous prover, we benchmark our approach against frontier LLMs and specialized prover models on two public math benchmarks and on two Lean benchmarks we introduce in the fields of abstract algebra and quantum theory. On public datasets, Ax-Prover is competitive with state-of-the-art provers, while it largely outperforms them on the new benchmarks. This shows that, unlike specialized systems that struggle to generalize, our tool-based agentic theorem prover approach offers a generalizable methodology for formal verification across diverse scientific domains. Furthermore, we demonstrate Ax-Prover's assistant capabilities in a practical use case, showing how it enabled an expert mathematician to formalize the proof of a complex cryptography theorem.
中文标题/摘要
标题:Ax-Prover:一个用于数学和量子物理定理证明的深度推理代理框架
我们介绍了Ax-Prover,一个基于Lean的多代理系统,能够解决跨学科科学领域的难题,并且可以自主运行或与人类专家协作。为了实现这一目标,Ax-Prover 通过形式证明的方法来解决科学问题,这一过程既需要创造性的推理,也需要严格的句法严谨性。Ax-Prover 通过将大型语言模型(LLMs)与Lean工具结合,使用模型上下文协议(MCP)确保形式正确性来应对这一挑战。为了评估其作为自主证明器的性能,我们在两个公开的数学基准和两个我们引入的抽象代数和量子理论领域的Lean基准上,将我们的方法与前沿的LLMs和专门的证明模型进行了对比。在公开数据集上,Ax-Prover 与最先进的证明器竞争,而在新的基准上,它则显著优于它们。这表明,与专门系统难以泛化不同,我们的基于工具的代理定理证明方法提供了一种在不同科学领域进行形式验证的一般化方法。此外,我们展示了Ax-Prover 在实际用例中的助手能力,展示了它如何帮助一位专家数学家正式化了一个复杂的密码学定理的证明。
Summary / 总结
Ax-Prover is a multi-agent system for automated theorem proving in Lean that uses Large Language Models (LLMs) equipped with Lean tools via the Model Context Protocol (MCP) to ensure formal correctness. It was evaluated on public math benchmarks and new benchmarks in abstract algebra and quantum theory, showing competitive performance on public datasets and superior performance on new benchmarks. This indicates that Ax-Prover offers a generalizable methodology for formal verification across diverse scientific domains.
Ax-Prover 是一个使用 Large Language Models (LLMs) 并通过 Model Context Protocol (MCP) 配备 Lean 工具的多代理系统,以确保形式正确性。它在公共数学基准测试和抽象代数及量子理论的新基准测试中进行了评估,显示在公共数据集上具有竞争力的表现,在新基准测试中则表现出更优的性能。这表明 Ax-Prover 提供了一种适用于不同科学领域的形式验证的一般化方法论。
Robot Crash Course: Learning Soft and Stylized Falling
Authors: Pascal Strauch, David Müller, Sammy Christen, Agon Serifi, Ruben Grandia, Espen Knoop, Moritz Bächer
First: 2025-11-13T18:55:34+00:00 · Latest: 2025-11-13T18:55:34+00:00
Abstract
Despite recent advances in robust locomotion, bipedal robots operating in the real world remain at risk of falling. While most research focuses on preventing such events, we instead concentrate on the phenomenon of falling itself. Specifically, we aim to reduce physical damage to the robot while providing users with control over a robot's end pose. To this end, we propose a robot agnostic reward function that balances the achievement of a desired end pose with impact minimization and the protection of critical robot parts during reinforcement learning. To make the policy robust to a broad range of initial falling conditions and to enable the specification of an arbitrary and unseen end pose at inference time, we introduce a simulation-based sampling strategy of initial and end poses. Through simulated and real-world experiments, our work demonstrates that even bipedal robots can perform controlled, soft falls.
中文标题/摘要
标题:机器人快速入门:学习柔和且风格化的跌落
尽管在稳健运动方面取得了近期进展,但在现实世界中操作的双足机器人仍然面临跌倒的风险。大多数研究集中在防止此类事件,而我们则将重点放在跌倒现象本身。具体来说,我们旨在减少对机器人的物理损害,同时让用户能够控制机器人最终姿态。为此,我们提出了一种机器人无关的奖励函数,该函数在实现期望的最终姿态、减少冲击和保护关键机器人部件之间取得平衡。为了使策略能够应对广泛的初始跌倒条件,并在推理时能够指定任意且未见过的最终姿态,我们引入了一种基于仿真的初始和最终姿态采样策略。通过模拟和现实世界的实验,我们的工作证明即使是双足机器人也能执行受控的、柔和的跌落。
Summary / 总结
The research aims to enable bipedal robots to perform controlled, soft falls to reduce physical damage while allowing users to control the end pose. The method involves a robot-agnostic reward function that balances achieving the desired end pose with minimizing impact and protecting critical parts. Simulation-based sampling is used to handle various initial falling conditions and specify unseen end poses at inference. Experiments show that bipedal robots can perform soft falls in both simulated and real-world settings.
研究旨在使双足机器人能够执行可控的软着陆,以减少物理损伤并允许用户控制最终姿态。方法包括一种机器人无关的奖励函数,该函数平衡了实现所需最终姿态、最小化冲击和保护关键部件。通过模拟采样处理各种初始跌落条件,并在推理时指定未见过的最终姿态。实验表明,双足机器人可以在模拟和真实世界环境中执行软着陆。
Instella: Fully Open Language Models with Stellar Performance
Authors: Jiang Liu, Jialian Wu, Xiaodong Yu, Yusheng Su, Prakamya Mishra, Gowtham Ramesh, Sudhanshu Ranjan, Chaitanya Manem, Ximeng Sun, Ze Wang, Pratik Prabhanjan Brahma, Zicheng Liu, Emad Barsoum
First: 2025-11-13T18:52:46+00:00 · Latest: 2025-11-13T18:52:46+00:00
Abstract
Large language models (LLMs) have demonstrated remarkable performance across a wide range of tasks, yet the majority of high-performing models remain closed-source or partially open, limiting transparency and reproducibility. In this work, we introduce Instella, a family of fully open three billion parameter language models trained entirely on openly available data and codebase. Powered by AMD Instinct MI300X GPUs, Instella is developed through large-scale pre-training, general-purpose instruction tuning, and alignment with human preferences. Despite using substantially fewer pre-training tokens than many contemporaries, Instella achieves state-of-the-art results among fully open models and is competitive with leading open-weight models of comparable size. We further release two specialized variants: Instella-Long, capable of handling context lengths up to 128K tokens, and Instella-Math, a reasoning-focused model enhanced through supervised fine-tuning and reinforcement learning on mathematical tasks. Together, these contributions establish Instella as a transparent, performant, and versatile alternative for the community, advancing the goal of open and reproducible language modeling research.
中文标题/摘要
标题:Instella:全开源语言模型,卓越性能
大型语言模型(LLMs)在广泛的任务中展现了卓越的性能,但大多数高性能模型仍为闭源或部分开源,限制了透明度和可复现性。在此项工作中,我们介绍了Instella,这是一个完全开源的三亿参数语言模型家族,完全基于公开可用的数据和代码库进行训练。借助AMD Instinct MI300X GPU,Instella通过大规模预训练、通用指令调优和与人类偏好的对齐开发而成。尽管预训练使用的令牌数量远少于许多同代模型,Instella仍实现了全开源模型中的最佳结果,并且在与可比规模的开源权重模型竞争中表现出色。我们还发布了两个专门变体:Instella-Long,能够处理长达128K令牌的上下文长度,以及通过监督微调和数学任务上的强化学习增强的推理重点模型Instella-Math。这些贡献共同确立了Instella作为透明、高性能和多功能的替代方案的地位,推动了开放和可复现的语言模型研究的目标。
Summary / 总结
The motivation for this work is to increase transparency and reproducibility in language model research by developing fully open models. Instella, a family of three billion parameter language models, is trained on publicly available data and optimized using large-scale pre-training, instruction tuning, and alignment with human preferences. Despite using fewer pre-training tokens than many contemporaries, Instella achieves state-of-the-art results among fully open models and is competitive with leading open-weight models. Two specialized variants, Instella-Long and Instella-Math, are also introduced, enhancing the model's capabilities in handling long contexts and mathematical reasoning respectively.
这项工作的动机是通过开发完全开源的模型来增加语言模型研究的透明度和可重复性。Instella 是一个由三个十亿参数组成的语言模型家族,基于公开数据进行训练,并通过大规模预训练、指令调优和与人类偏好对齐进行优化。尽管使用了比许多同代模型更少的预训练令牌,但 Instella 在完全开源模型中达到了最先进的结果,并且与同类大小的领先开源模型竞争。还引入了两个专门变体 Instella-Long 和 Instella-Math,分别增强了模型在处理长上下文和数学推理方面的能力。
Global Solutions to Non-Convex Functional Constrained Problems with Hidden Convexity
Authors: Ilyas Fatkhullin, Niao He, Guanghui Lan, Florian Wolf
First: 2025-11-13T18:51:00+00:00 · Latest: 2025-11-13T18:51:00+00:00
Abstract
Constrained non-convex optimization is fundamentally challenging, as global solutions are generally intractable and constraint qualifications may not hold. However, in many applications, including safe policy optimization in control and reinforcement learning, such problems possess hidden convexity, meaning they can be reformulated as convex programs via a nonlinear invertible transformation. Typically such transformations are implicit or unknown, making the direct link with the convex program impossible. On the other hand, (sub-)gradients with respect to the original variables are often accessible or can be easily estimated, which motivates algorithms that operate directly in the original (non-convex) problem space using standard (sub-)gradient oracles. In this work, we develop the first algorithms to provably solve such non-convex problems to global minima. First, using a modified inexact proximal point method, we establish global last-iterate convergence guarantees with $\widetilde{\mathcal{O}}(\varepsilon^{-3})$ oracle complexity in non-smooth setting. For smooth problems, we propose a new bundle-level type method based on linearly constrained quadratic subproblems, improving the oracle complexity to $\widetilde{\mathcal{O}}(\varepsilon^{-1})$. Surprisingly, despite non-convexity, our methodology does not require any constraint qualifications, can handle hidden convex equality constraints, and achieves complexities matching those for solving unconstrained hidden convex optimization.
SSR: Socratic Self-Refine for Large Language Model Reasoning
Authors: Haizhou Shi, Ye Liu, Bo Pang, Zeyu Leo Liu, Hao Wang, Silvio Savarese, Caiming Xiong, Yingbo Zhou, Semih Yavuz
First: 2025-11-13T18:47:07+00:00 · Latest: 2025-11-13T18:47:07+00:00
Comments: Preprint; work in progress
Abstract
Large Language Models (LLMs) have demonstrated remarkable reasoning abilities, yet existing test-time frameworks often rely on coarse self-verification and self-correction, limiting their effectiveness on complex tasks. In this paper, we propose Socratic Self-Refine (SSR), a novel framework for fine-grained evaluation and precise refinement of LLM reasoning. Our proposed SSR decomposes model responses into verifiable (sub-question, sub-answer) pairs, enabling step-level confidence estimation through controlled re-solving and self-consistency checks. By pinpointing unreliable steps and iteratively refining them, SSR produces more accurate and interpretable reasoning chains. Empirical results across five reasoning benchmarks and three LLMs show that SSR consistently outperforms state-of-the-art iterative self-refinement baselines. Beyond performance gains, SSR provides a principled black-box approach for evaluating and understanding the internal reasoning processes of LLMs. Code is available at https://github.com/SalesforceAIResearch/socratic-self-refine-reasoning.
中文标题/摘要
标题:SSR:苏格拉底式自我精炼在大型语言模型推理中的应用
大型语言模型(LLMs)展示了卓越的推理能力,但现有的测试时框架往往依赖粗略的自我验证和自我修正,限制了其在复杂任务中的效果。本文提出了一种新颖的框架——苏格拉底式自我精炼(SSR),用于精细评估和精确精炼LLM推理。我们提出的SSR将模型响应分解为可验证的(子问题,子答案)对,通过受控重解和自我一致性检查实现步骤级的信心估计。通过定位不可靠的步骤并迭代精炼,SSR生成更准确和可解释的推理链。在五个推理基准和三种LLM上的实证结果表明,SSR在性能上始终优于最先进的迭代自我精炼基线。除了性能提升,SSR还提供了一种原理性的黑盒方法,用于评估和理解LLM的内部推理过程。代码可在https://github.com/SalesforceAIResearch/socratic-self-refine-reasoning/获取。
Summary / 总结
The paper introduces Socratic Self-Refine (SSR), a framework designed to enhance the reasoning accuracy of Large Language Models (LLMs) by decomposing model responses into verifiable sub-question and sub-answer pairs. SSR enables step-level confidence estimation and iterative refinement, leading to more accurate and interpretable reasoning chains. Across five reasoning benchmarks and three LLMs, SSR outperforms existing iterative self-refinement methods, providing a principled approach for evaluating and understanding LLMs' internal reasoning processes.
论文提出了Socratic Self-Refine (SSR)框架,通过将模型响应分解为可验证的子问题和子答案对,增强大型语言模型(LLM)的推理能力。这使得可以进行步骤级的信心估计和不可靠步骤的迭代修正,从而产生更准确和可解释的推理链。SSR在五个推理基准和三种LLM上优于现有迭代自我修正方法,提供了一种评估和理解LLM推理过程的原理性方法。
Towards Blind and Low-Vision Accessibility of Lightweight VLMs and Custom LLM-Evals
Authors: Shruti Singh Baghel, Yash Pratap Singh Rathore, Sushovan Jena, Anurag Pradhan, Amit Shukla, Arnav Bhavsar, Pawan Goyal
First: 2025-11-13T18:45:39+00:00 · Latest: 2025-11-13T18:45:39+00:00
Comments: 8 pages
Abstract
Large Vision-Language Models (VLMs) excel at understanding and generating video descriptions but their high memory, computation, and deployment demands hinder practical use particularly for blind and low-vision (BLV) users who depend on detailed, context-aware descriptions. To study the effect of model size on accessibility-focused description quality, we evaluate SmolVLM2 variants with 500M and 2.2B parameters across two diverse datasets: AVCaps (outdoor), and Charades (indoor). In this work, we introduce two novel evaluation frameworks specifically designed for BLV accessibility assessment: the Multi-Context BLV Framework evaluating spatial orientation, social interaction, action events, and ambience contexts; and the Navigational Assistance Framework focusing on mobility-critical information. Additionally, we conduct a systematic evaluation of four different prompt design strategies and deploy both models on a smartphone, evaluating FP32 and INT8 precision variants to assess real-world performance constraints on resource-limited mobile devices.
中文标题/摘要
标题:轻量级VLM和自定义LLM评估向盲人和低视力用户无障碍的迈进
大型视觉-语言模型(VLMs)在理解和生成视频描述方面表现出色,但其高内存、计算和部署需求阻碍了其在盲人和低视力(BLV)用户中的实际应用,这些用户依赖于详细且上下文相关的描述。为了研究模型规模对无障碍描述质量的影响,我们评估了具有500M和2.2B参数的SmolVLM2变体在两个不同数据集上的表现:AVCaps(户外)和Charades(室内)。在本文中,我们引入了两个专门为BLV无障碍评估设计的新颖评估框架:多上下文BLV框架,评估空间定向、社会互动、动作事件和氛围上下文;以及导航辅助框架,专注于移动性关键信息。此外,我们系统地评估了四种不同的提示设计策略,并在智能手机上部署了这两种模型,评估了FP32和INT8精度变体,以评估资源受限的移动设备上的实际性能限制。
Multitask GLocal OBIA-Mamba for Sentinel-2 Landcover Mapping
Authors: Zack Dewis, Yimin Zhu, Zhengsen Xu, Mabel Heffring, Saeid Taleghanidoozdoozan, Kaylee Xiao, Motasem Alkayid, Lincoln Linlin Xu
First: 2025-11-13T18:40:22+00:00 · Latest: 2025-11-13T18:40:22+00:00
Abstract
Although Sentinel-2 based land use and land cover (LULC) classification is critical for various environmental monitoring applications, it is a very difficult task due to some key data challenges (e.g., spatial heterogeneity, context information, signature ambiguity). This paper presents a novel Multitask Glocal OBIA-Mamba (MSOM) for enhanced Sentinel-2 classification with the following contributions. First, an object-based image analysis (OBIA) Mamba model (OBIA-Mamba) is designed to reduce redundant computation without compromising fine-grained details by using superpixels as Mamba tokens. Second, a global-local (GLocal) dual-branch convolutional neural network (CNN)-mamba architecture is designed to jointly model local spatial detail and global contextual information. Third, a multitask optimization framework is designed to employ dual loss functions to balance local precision with global consistency. The proposed approach is tested on Sentinel-2 imagery in Alberta, Canada, in comparison with several advanced classification approaches, and the results demonstrate that the proposed approach achieves higher classification accuracy and finer details that the other state-of-the-art methods.
Summary / 总结
This paper addresses the challenges of Sentinel-2 based land use and land cover classification by proposing a Multitask Glocal OBIA-Mamba (MSOM) approach. The method uses an object-based image analysis Mamba model with superpixels to reduce redundant computation while preserving fine-grained details. It also incorporates a global-local dual-branch CNN-Mamba architecture to model both local spatial detail and global contextual information, and employs a multitask optimization framework with dual loss functions to balance local precision and global consistency. Experimental results on Sentinel-2 imagery in Alberta, Canada, show that the proposed approach outperforms other state-of-the-art methods in terms of classification accuracy and detail resolution.
本文提出了一种Multitask Glocal OBIA-Mamba (MSOM) 方法,以应对Sentinel-2 基于的土地利用和土地覆盖分类挑战。该方法引入了使用超像素的基于对象的图像分析Mamba模型,以减少冗余计算同时保留细粒度细节。GLocal 双分支CNN-Mamba 架构被设计用于同时建模局部空间细节和全局上下文信息。此外,还采用了一种多任务优化框架,使用双重损失函数来平衡局部精度和全局一致性。实验结果表明,该方法在阿尔伯塔省加拿大Sentinel-2 影像上的分类准确性和细节分辨率上优于其他最先进的方法。
BATIS: Bayesian Approaches for Targeted Improvement of Species Distribution Models
Authors: Catherine Villeneuve, Benjamin Akera, Mélisande Teng, David Rolnick
First: 2025-10-22T16:42:46+00:00 · Latest: 2025-11-13T18:38:08+00:00
Abstract
Species distribution models (SDMs), which aim to predict species occurrence based on environmental variables, are widely used to monitor and respond to biodiversity change. Recent deep learning advances for SDMs have been shown to perform well on complex and heterogeneous datasets, but their effectiveness remains limited by spatial biases in the data. In this paper, we revisit deep SDMs from a Bayesian perspective and introduce BATIS, a novel and practical framework wherein prior predictions are updated iteratively using limited observational data. Models must appropriately capture both aleatoric and epistemic uncertainty to effectively combine fine-grained local insights with broader ecological patterns. We benchmark an extensive set of uncertainty quantification approaches on a novel dataset including citizen science observations from the eBird platform. Our empirical study shows how Bayesian deep learning approaches can greatly improve the reliability of SDMs in data-scarce locations, which can contribute to ecological understanding and conservation efforts.
中文标题/摘要
标题:BATIS:基于贝叶斯方法的目标物种分布模型改进方法
物种分布模型(SDMs),旨在根据环境变量预测物种分布,广泛用于监测和应对生物多样性的变化。最近的SDMs深度学习进展在复杂和异质数据集上表现出色,但其效果仍受限于数据的空间偏差。本文从贝叶斯视角重新审视SDMs,并引入BATIS,这是一种新颖且实用的框架,其中先验预测通过有限的观测数据迭代更新。模型必须适当地捕捉 aleatoric 和 epistemic 不确定性,以有效地结合细粒度的局部见解与更广泛的生态模式。我们在包含eBird平台公民科学观测数据的新颖数据集上对一系列不确定性量化方法进行了基准测试。我们的实证研究表明,贝叶斯深度学习方法可以显著提高在数据稀缺地区SDMs的可靠性,这可以促进生态学理解和保护努力。
Summary / 总结
The research aims to improve the reliability of species distribution models (SDMs) in data-scarce areas by addressing spatial biases in the data. BATIS, a Bayesian framework, iteratively updates prior predictions with limited observational data to capture both aleatoric and epistemic uncertainty. The study demonstrates that Bayesian deep learning approaches significantly enhance the reliability of SDMs, particularly in data-scarce regions, contributing to ecological understanding and conservation efforts.
研究旨在通过解决数据稀缺区域的数据偏差问题,提高物种分布模型(SDMs)的可靠性。BATIS是一种贝叶斯框架,通过使用有限的观测数据迭代更新先验预测。研究显示,贝叶斯深度学习方法可以显著提高SDMs的可靠性,特别是在数据稀少的地区,有助于生态理解和保护工作。
From 2D to 3D Without Extra Baggage: Data-Efficient Cancer Detection in Digital Breast Tomosynthesis
Authors: Yen Nhi Truong Vu, Dan Guo, Sripad Joshi, Harshit Kumar, Jason Su, Thomas Paul Matthews
Venue: In Machine Learning for Health (ML4H). PMLR 297, 2025
First: 2025-11-13T18:35:45+00:00 · Latest: 2025-11-13T18:35:45+00:00
Abstract
Digital Breast Tomosynthesis (DBT) enhances finding visibility for breast cancer detection by providing volumetric information that reduces the impact of overlapping tissues; however, limited annotated data has constrained the development of deep learning models for DBT. To address data scarcity, existing methods attempt to reuse 2D full-field digital mammography (FFDM) models by either flattening DBT volumes or processing slices individually, thus discarding volumetric information. Alternatively, 3D reasoning approaches introduce complex architectures that require more DBT training data. Tackling these drawbacks, we propose M&M-3D, an architecture that enables learnable 3D reasoning while remaining parameter-free relative to its FFDM counterpart, M&M. M&M-3D constructs malignancy-guided 3D features, and 3D reasoning is learned through repeatedly mixing these 3D features with slice-level information. This is achieved by modifying operations in M&M without adding parameters, thus enabling direct weight transfer from FFDM. Extensive experiments show that M&M-3D surpasses 2D projection and 3D slice-based methods by 11-54% for localization and 3-10% for classification. Additionally, M&M-3D outperforms complex 3D reasoning variants by 20-47% for localization and 2-10% for classification in the low-data regime, while matching their performance in high-data regime. On the popular BCS-DBT benchmark, M&M-3D outperforms previous top baseline by 4% for classification and 10% for localization.
中文标题/摘要
标题:从2D到3D无需额外负担:在数字乳腺断层成像中的数据高效癌症检测
数字乳腺断层成像(DBT)通过提供体积信息来增强乳腺癌检测的可见性,从而减少重叠组织的影响;然而,有限的标注数据限制了DBT中深度学习模型的发展。为了解决数据稀缺问题,现有方法试图重用全野数字乳腺摄影(FFDM)模型,通过将DBT体积展平或逐层处理,从而丢弃体积信息。或者,3D推理方法引入了复杂的架构,需要更多的DBT训练数据。为解决这些缺点,我们提出了M&M-3D架构,该架构能够在保持参数数量与FFDM模型相同的情况下实现可学习的3D推理。M&M-3D构建了恶性肿瘤导向的3D特征,并通过反复混合这些3D特征与切片级信息来进行3D推理。这通过修改M&M的操作实现,而不增加参数,从而允许直接从FFDM转移权重。大量实验表明,M&M-3D在定位上的表现比2D投影和基于切片的3D方法高出11-54%,在分类上的表现高出3-10%。此外,在数据稀缺的情况下,M&M-3D在定位上的表现比复杂3D推理变体高出20-47%,在分类上的表现高出2-10%,而在数据丰富的情况下,其性能与这些方法相当。在流行的BCS-DBT基准测试中,M&M-3D在分类上的表现比之前的最佳基线高出4%,在定位上的表现高出10%。
Summary / 总结
The paper addresses the challenge of limited annotated data in Digital Breast Tomosynthesis (DBT) for breast cancer detection. It proposes M&M-3D, an architecture that enables learnable 3D reasoning without increasing parameters compared to its 2D counterpart, M&M. M&M-3D constructs malignancy-guided 3D features and learns 3D reasoning through mixing these features with slice-level information. Experimental results show that M&M-3D outperforms 2D projection and 3D slice-based methods by 11-54% for localization and 3-10% for classification, and surpasses complex 3D reasoning variants by 20-47% for localization and 2-10% for classification in the low-data regime, while matching their performance in high-data regime. On the BCS-DBT benchmark, M&M-3D outperforms previous top baselines by 4% for classification and 10% for localization.
研究旨在解决数字乳腺断层摄影(DBT)中由于标注数据有限而导致的乳腺癌检测难题。提出了一种名为M&M-3D的架构,该架构在参数上与2D的M&M模型相当,但能够实现3D推理。M&M-3D通过构建恶性肿瘤导向的3D特征,并通过混合这些特征与切片级信息来学习3D推理。实验表明,M&M-3D在定位和分类方面分别比2D和3D切片方法高出11-54%和3-10%,并且在低数据情况下比复杂3D推理变体高出20-47%的定位和2-10%的分类性能,而在高数据情况下则与其相当。在BCS-DBT基准上,M&M-3D的分类性能提高了4%,定位性能提高了10%,超越了之前的顶级基线。
Regular Games -- an Automata-Based General Game Playing Language
Authors: Radosław Miernik, Marek Szykuła, Jakub Kowalski, Jakub Cieśluk, Łukasz Galas, Wojciech Pawlik
Venue: AAAI 2026
First: 2025-11-13T18:29:27+00:00 · Latest: 2025-11-13T18:29:27+00:00
Comments: Full version of AAAI 2026 paper
Abstract
We propose a new General Game Playing (GGP) system called Regular Games (RG). The main goal of RG is to be both computationally efficient and convenient for game design. The system consists of several languages. The core component is a low-level language that defines the rules by a finite automaton. It is minimal with only a few mechanisms, which makes it easy for automatic processing (by agents, analysis, optimization, etc.). The language is universal for the class of all finite turn-based games with imperfect information. Higher-level languages are introduced for game design (by humans or Procedural Content Generation), which are eventually translated to a low-level language. RG generates faster forward models than the current state of the art, beating other GGP systems (Regular Boardgames, Ludii) in terms of efficiency. Additionally, RG's ecosystem includes an editor with LSP, automaton visualization, benchmarking tools, and a debugger of game description transformations.
中文标题/摘要
标题:常规游戏——一种基于自动机的一般游戏玩法语言
我们提出了一种新的通用游戏玩法(GGP)系统,称为常规游戏(RG)。RG的主要目标是既高效又便于游戏设计。该系统由几种语言组成。核心组件是一种低级语言,通过有限自动机定义规则。该语言非常简洁,仅有少数机制,这使其易于自动处理(由代理、分析、优化等)。该语言适用于所有有限回合制不完美信息游戏的类别。还引入了高级语言用于游戏设计(由人类或程序化内容生成),最终将这些高级语言翻译成低级语言。RG生成的前向模型比当前最先进的技术更快,效率上击败了其他GGP系统(常规棋盘游戏、Ludii)。此外,RG的生态系统包括具有LSP的编辑器、自动机可视化工具、基准测试工具和游戏描述转换调试器。
Summary / 总结
The paper introduces Regular Games (RG), a new GGP system designed for computational efficiency and ease of game design. RG uses a low-level language based on finite automata to define game rules, which is minimal and easy for automatic processing. Higher-level languages are used for game design, which are then translated into the low-level language. RG outperforms existing GGP systems in efficiency and generates faster forward models. The system also includes an editor, visualization tools, benchmarking, and a debugger for game description transformations.
论文提出了一种新的GGP系统Regular Games (RG),旨在提高计算效率和便于游戏设计。RG使用基于有限自动机的低级语言来定义游戏规则,使其易于自动处理。高级语言用于游戏设计,最终被翻译成低级语言。RG在效率上超越了现有系统,生成更快的前向模型,并击败了其他GGP系统如Regular Boardgames和Ludii。该系统还包括编辑器、可视化工具、基准测试和游戏描述转换调试器。
Mined Prompting and Metadata-Guided Generation for Wound Care Visual Question Answering
Authors: Bavana Durgapraveen, Sornaraj Sivasankaran, Abhinand Balachandran, Sriram Rajkumar
First: 2025-11-13T18:28:58+00:00 · Latest: 2025-11-13T18:28:58+00:00
Comments: 2 figures, 11 pages
Abstract
The rapid expansion of asynchronous remote care has intensified provider workload, creating demand for AI systems that can assist clinicians in managing patient queries more efficiently. The MEDIQA-WV 2025 shared task addresses this challenge by focusing on generating free-text responses to wound care queries paired with images. In this work, we present two complementary approaches developed for the English track. The first leverages a mined prompting strategy, where training data is embedded and the top-k most similar examples are retrieved to serve as few-shot demonstrations during generation. The second approach builds on a metadata ablation study, which identified four metadata attributes that consistently enhance response quality. We train classifiers to predict these attributes for test cases and incorporate them into the generation pipeline, dynamically adjusting outputs based on prediction confidence. Experimental results demonstrate that mined prompting improves response relevance, while metadata-guided generation further refines clinical precision. Together, these methods highlight promising directions for developing AI-driven tools that can provide reliable and efficient wound care support.
中文标题/摘要
标题:开采提示和元数据引导生成在伤口护理视觉问答中的应用
异步远程护理的迅速扩展加剧了提供者的负担,从而产生了需要能够帮助临床医生更高效地管理患者查询的AI系统的需求。MEDIQA-WV 2025 共享任务通过专注于生成配图的伤口护理查询的自由文本回答来应对这一挑战。在本文中,我们为英语赛道提出了两种互补的方法。第一种方法利用了开采提示策略,其中训练数据被嵌入,并检索最相似的前k个示例作为生成过程中的零样本示范。第二种方法基于元数据消融研究,该研究确定了四个始终能提升回答质量的元数据属性。我们训练分类器来预测这些属性在测试案例中的情况,并将它们整合到生成管道中,根据预测置信度动态调整输出。实验结果表明,开采提示提高了回答的相关性,而元数据引导的生成进一步提高了临床精度。这些方法共同展示了开发能够提供可靠和高效伤口护理支持的AI驱动工具的有希望的方向。
Summary / 总结
This study aims to develop AI systems that assist clinicians in managing wound care queries more efficiently. Two approaches were developed: one uses mined prompting to retrieve similar examples for generating relevant responses, and the other incorporates metadata to enhance clinical precision. Experimental results show that mined prompting improves response relevance, while metadata-guided generation further refines clinical precision, demonstrating promising directions for AI-driven wound care support.
该研究旨在应对异步远程护理增加的工作量,开发AI系统以帮助临床医生更有效地管理伤口护理查询。提出了两种方法:一种是使用挖掘提示检索相似示例进行少样本演示,另一种是利用元数据预测并提升响应质量。实验结果表明,挖掘提示提高了响应的相关性,而元数据指导的生成则进一步提高了临床精度,显示出AI驱动的伤口护理支持的有前途的方向。
Pretrained Joint Predictions for Scalable Batch Bayesian Optimization of Molecular Designs
Authors: Miles Wang-Henderson, Ben Kaufman, Edward Williams, Ryan Pederson, Matteo Rossi, Owen Howell, Carl Underkoffler, Narbe Mardirossian, John Parkhill
First: 2025-11-13T18:26:58+00:00 · Latest: 2025-11-13T18:26:58+00:00
Abstract
Batched synthesis and testing of molecular designs is the key bottleneck of drug development. There has been great interest in leveraging biomolecular foundation models as surrogates to accelerate this process. In this work, we show how to obtain scalable probabilistic surrogates of binding affinity for use in Batch Bayesian Optimization (Batch BO). This demands parallel acquisition functions that hedge between designs and the ability to rapidly sample from a joint predictive density to approximate them. Through the framework of Epistemic Neural Networks (ENNs), we obtain scalable joint predictive distributions of binding affinity on top of representations taken from large structure-informed models. Key to this work is an investigation into the importance of prior networks in ENNs and how to pretrain them on synthetic data to improve downstream performance in Batch BO. Their utility is demonstrated by rediscovering known potent EGFR inhibitors on a semi-synthetic benchmark in up to 5x fewer iterations, as well as potent inhibitors from a real-world small-molecule library in up to 10x fewer iterations, offering a promising solution for large-scale drug discovery applications.
中文标题/摘要
标题:预训练联合预测以实现分子设计的大规模批处理贝叶斯优化
批量合成和测试分子设计是药物开发中的关键瓶颈。人们非常关注利用生物分子基础模型作为代理以加速这一过程。在本文中,我们展示了如何获得适用于批量贝叶斯优化(Batch BO)的可扩展概率代理,用于预测结合亲和力。这需要并行获取函数来平衡设计之间的权衡,并能够快速从联合预测密度中抽样以近似它们。通过Epistemic Neural Networks (ENNs)的框架,我们获得了基于大型结构指导模型提取的表示之上的可扩展的结合亲和力联合预测分布。本文的关键在于对ENNs中先验网络的重要性进行研究,并探讨如何在合成数据上预训练它们以提高Batch BO下游性能。其效用通过在半合成基准中以最多5倍更少的迭代重新发现已知的EGFR抑制剂以及在真实世界的小分子库中以最多10倍更少的迭代发现强效抑制剂得以证明,为大规模药物发现应用提供了有前景的解决方案。
Summary / 总结
This research aims to accelerate drug development by using machine learning to predict molecular binding affinity for batch Bayesian optimization. The method involves using Epistemic Neural Networks to obtain scalable joint predictive distributions of binding affinity, with a focus on pretraining prior networks on synthetic data to enhance performance. Key findings show that this approach can rediscover known potent EGFR inhibitors in up to 5 fewer iterations and find potent inhibitors from a real-world library in up to 10 fewer iterations, significantly improving the efficiency of drug discovery processes.
该研究旨在通过机器学习预测分子结合亲和力来加速药物开发,使用Epistemic Neural Networks获得可扩展的联合预测分布,重点在于在合成数据上预训练先验网络以提高性能。关键发现包括在最多5次迭代中重新发现已知的EGFR强抑制剂,在最多10次迭代中从实际的小分子库中发现强抑制剂,展示了大规模药物发现应用的潜力。
Textual understanding boost in the WikiRace
Authors: Raman Ebrahimi, Sean Fuhrman, Kendrick Nguyen, Harini Gurusankar, Massimo Franceschetti
First: 2025-11-13T18:25:43+00:00 · Latest: 2025-11-13T18:25:43+00:00
Abstract
The WikiRace game, where players navigate between Wikipedia articles using only hyperlinks, serves as a compelling benchmark for goal-directed search in complex information networks. This paper presents a systematic evaluation of navigation strategies for this task, comparing agents guided by graph-theoretic structure (betweenness centrality), semantic meaning (language model embeddings), and hybrid approaches. Through rigorous benchmarking on a large Wikipedia subgraph, we demonstrate that a purely greedy agent guided by the semantic similarity of article titles is overwhelmingly effective. This strategy, when combined with a simple loop-avoidance mechanism, achieved a perfect success rate and navigated the network with an efficiency an order of magnitude better than structural or hybrid methods. Our findings highlight the critical limitations of purely structural heuristics for goal-directed search and underscore the transformative potential of large language models to act as powerful, zero-shot semantic navigators in complex information spaces.
中文标题/摘要
标题:维基赛跑中的文本理解提升
维基赛跑游戏是一种利用超链接在维基百科文章之间导航的竞赛,它为复杂信息网络中的目标导向搜索提供了一个引人注目的基准。本文系统地评估了该任务中的导航策略,比较了由图论结构(介数中心性)、语义意义(语言模型嵌入)以及混合方法引导的智能体。通过在大规模维基百科子图上的严格基准测试,我们证明了一个仅由文章标题语义相似性引导的贪婪智能体表现最为有效。当该策略与简单的循环避免机制结合时,实现了完美的成功率,并且在导航网络的效率上比结构或混合方法高一个数量级。我们的研究结果突显了纯粹基于结构启发式的局限性,强调了大型语言模型作为强大零样本语义导航器在复杂信息空间中的变革潜力。
Summary / 总结
The paper evaluates navigation strategies in the WikiRace game, a benchmark for goal-directed search in complex information networks. It compares agents using graph-theoretic structure, semantic meaning, and hybrid approaches. The study shows that a purely greedy agent using semantic similarity of article titles, combined with a loop-avoidance mechanism, achieves a perfect success rate and significantly better efficiency compared to structural or hybrid methods. This highlights the limitations of structural heuristics and the effectiveness of semantic approaches in such tasks.
论文评估了WikiRace游戏中导航策略,这是一个复杂信息网络中目标导向搜索的基准。研究比较了使用图理论结构、语义意义和混合方法的代理。研究显示,使用文章标题语义相似性进行贪婪导航,并结合避免循环机制,实现了完美的成功率和显著更高的效率,比结构或混合方法都要好。这突显了纯粹结构启发式的局限性,并强调了大型语言模型在复杂信息空间中作为强大零样本语义导航器的潜力。
Tight Robustness Certification through the Convex Hull of $\ell_0$ Attacks
Authors: Yuval Shapira, Dana Drachsler-Cohen
First: 2025-11-13T18:15:37+00:00 · Latest: 2025-11-13T18:15:37+00:00
Abstract
Few-pixel attacks mislead a classifier by modifying a few pixels of an image. Their perturbation space is an $\ell_0$-ball, which is not convex, unlike $\ell_p$-balls for $p\geq1$. However, existing local robustness verifiers typically scale by relying on linear bound propagation, which captures convex perturbation spaces. We show that the convex hull of an $\ell_0$-ball is the intersection of its bounding box and an asymmetrically scaled $\ell_1$-like polytope. The volumes of the convex hull and this polytope are nearly equal as the input dimension increases. We then show a linear bound propagation that precisely computes bounds over the convex hull and is significantly tighter than bound propagations over the bounding box or our $\ell_1$-like polytope. This bound propagation scales the state-of-the-art $\ell_0$ verifier on its most challenging robustness benchmarks by 1.24x-7.07x, with a geometric mean of 3.16.
中文标题/摘要
标题:通过$\ell_0$攻击凸包实现紧密鲁棒性认证
少量像素攻击通过修改图像的几个像素来误导分类器。它们的扰动空间是$\ell_0$球,而非凸的,不同于$\ell_p$球($p\geq1$)。然而,现有的局部鲁棒性验证器通常通过线性边界传播来缩放,这捕捉的是凸扰动空间。我们表明,$\ell_0$球的凸包是其边界框和不对称缩放的类似$\ell_1$多面体的交集。随着输入维度的增加,凸包和这个多面体的体积几乎相等。然后我们展示了精确计算凸包边界的一线性边界传播,其紧致性显著优于边界框或我们类似$\ell_1$多面体上的边界传播。这种边界传播将最先进的$\ell_0$验证器在其最具挑战性的鲁棒性基准上提高了1.24倍至7.07倍,几何平均值为3.16倍。
Are Foundational Atomistic Models Reliable for Finite-Temperature Molecular Dynamics?
Authors: Denan Li, Jiyuan Yang, Xiangkai Chen, Lintao Yu, Shi Liu
First: 2025-03-11T09:23:01+00:00 · Latest: 2025-11-13T18:15:15+00:00
Comments: 18 pages, 5 figures
Abstract
Machine learning force fields have emerged as promising tools for molecular dynamics (MD) simulations, potentially offering quantum-mechanical accuracy with the efficiency of classical MD. Inspired by foundational large language models, recent years have seen considerable progress in developing foundational atomistic models, sometimes referred to as universal force fields, designed to cover most elements in the periodic table. This Perspective adopts a practitioner's viewpoint to ask a critical question: Are these foundational atomistic models reliable for one of their most compelling applications, in particular simulating finite-temperature dynamics? Instead of a broad benchmark, we use the canonical ferroelectric-paraelectric phase transition in PbTiO$_3$ as a focused case study to evaluate prominent foundational atomistic models. Our findings suggest a potential disconnect between static accuracy and dynamic reliability. While 0 K properties are often well-reproduced, we observed that the models can struggle to consistently capture the correct phase transition, sometimes exhibiting simulation instabilities. We believe these challenges may stem from inherent biases in training data and a limited description of anharmonicity. These observed shortcomings, though demonstrated on a single system, appear to point to broader, systemic challenges that can be addressed with targeted fine-tuning. This Perspective serves not to rank models, but to initiate a crucial discussion on the practical readiness of foundational atomistic models and to explore future directions for their improvement.
中文标题/摘要
标题:基础原子模型在有限温度分子动力学模拟中的可靠性吗?
机器学习力场已成为分子动力学(MD)模拟的有前途的工具,可能提供量子力学精度的同时保持经典MD的效率。受大型语言模型启发,近年来在开发覆盖周期表中大多数元素的基础原子模型方面取得了显著进展,有时称为通用力场。本文从实践者的视角提出一个关键问题:这些基础原子模型在最令人信服的应用之一,特别是模拟有限温度动力学方面是否可靠?我们没有采用广泛的基准测试,而是以PbTiO$_3$的典型铁电-顺电相变作为集中案例研究来评估主要的基础原子模型。我们的发现表明,静态准确性与动态可靠性之间可能存在脱节。虽然0 K性质通常被很好地再现,但我们观察到模型在一致地捕捉正确的相变方面存在困难,有时表现出模拟不稳定。我们认为这些挑战可能源于训练数据中的固有偏差以及对非谐性的有限描述。尽管这些观察到的不足是在单一系统上展示的,但它们似乎指出了更广泛、系统性的挑战,可以通过有针对性的微调来解决。本文不是对模型进行排名,而是旨在启动关于基础原子模型实际准备情况的重要讨论,并探索其改进的未来方向。
Summary / 总结
This paper investigates the reliability of foundational atomistic models for simulating finite-temperature molecular dynamics, using PbTiO$_3$ as a case study. The authors find that while these models can accurately reproduce static properties at 0 K, they often struggle to consistently capture the correct phase transition and can exhibit simulation instabilities. The challenges are attributed to biases in training data and a limited description of anharmonicity, suggesting the need for targeted fine-tuning to improve dynamic reliability.
研究评估了基础原子模型在模拟有限温度分子动力学方面的可靠性,以PbTiO$_3$的铁电-顺电相变作为案例。尽管静态性质准确,但模型往往无法一致地捕捉正确的相变,并且会出现模拟不稳定现象。作者认为这些问题可能源于训练数据的偏差和对非谐性的有限描述,表明存在更广泛、系统性的挑战,可以通过针对性的微调来解决。
Towards Emotionally Intelligent and Responsible Reinforcement Learning
Authors: Garapati Keerthana, Manik Gupta
First: 2025-11-13T18:09:37+00:00 · Latest: 2025-11-13T18:09:37+00:00
Abstract
Personalized decision systems in healthcare and behavioral support often rely on static rule-based or engagement-maximizing heuristics that overlook users' emotional context and ethical constraints. Such approaches risk recommending insensitive or unsafe interventions, especially in domains involving serious mental illness, substance use disorders, or depression. To address this limitation, we propose a Responsible Reinforcement Learning (RRL) framework that integrates emotional and contextual understanding with ethical considerations into the sequential decision-making process. RRL formulates personalization as a Constrained Markov Decision Process (CMDP), where the agent optimizes engagement and adherence while ensuring emotional alignment and ethical safety. We introduce a multi-objective reward function that explicitly balances short-term behavioral engagement with long-term user well-being, and define an emotion-informed state representation that captures fluctuations in emotional readiness, affect, and risk. The proposed architecture can be instantiated with any RL algorithm (e.g., DQN, PPO) augmented with safety constraints or Lagrangian regularization. Conceptually, this framework operationalizes empathy and responsibility within machine learning policy optimization, bridging safe RL, affective computing and responsible AI. We discuss the implications of this approach for human-centric domains such as behavioral health, education, and digital therapeutics, and outline simulation-based validation paths for future empirical work. This paper aims to initiate a methodological conversation about ethically aligned reinforcement learning for emotionally aware and trustworthy personalization systems.
中文标题/摘要
标题:迈向具有情感智能和责任感的强化学习
在医疗保健和行为支持领域中,个性化的决策系统通常依赖于静态的规则基础或最大化参与度的启发式方法,这些方法忽视了用户的情感背景和伦理约束。这种做法可能会推荐不敏感或不安全的干预措施,尤其是在涉及严重精神疾病、物质使用障碍或抑郁症的领域。为了解决这一局限性,我们提出了一种负责任的强化学习(RRL)框架,该框架将情感和情境理解与伦理考虑整合到顺序决策过程中。RRL将个性化问题形式化为约束马尔可夫决策过程(CMDP),其中代理优化参与度和依从性,同时确保情感一致性和伦理安全性。我们引入了一个多目标奖励函数,明确平衡短期行为参与与长期用户福祉,定义了一个情感导向的状态表示,捕捉情感准备、情感和风险的变化。所提出的架构可以使用任何强化学习算法(例如,DQN、PPO)实现,并结合了安全约束或拉格朗日正则化。从概念上讲,该框架在机器学习策略优化中实现了同理心和责任感,将安全的RL、情感计算和负责任的人工智能联系起来。我们讨论了该方法在行为健康、教育和数字疗法等人本领域中的意义,并概述了未来实证工作的基于模拟的验证路径。本文旨在启动关于情感意识和值得信赖的个性化系统中伦理对齐的强化学习方法论对话。
Summary / 总结
The paper addresses the limitations of static rule-based decision systems in healthcare and behavioral support by proposing a Responsible Reinforcement Learning (RRL) framework. This framework integrates emotional and contextual understanding with ethical considerations into the decision-making process through a Constrained Markov Decision Process (CMDP). Key findings include the introduction of a multi-objective reward function that balances short-term engagement with long-term well-being, and an emotion-informed state representation that captures emotional readiness and risk. The RRL framework can be applied to various RL algorithms and aims to operationalize empathy and responsibility in machine learning, enhancing personalization systems in human-centric domains like behavioral health and education.
论文提出了一种负责任的强化学习(RRL)框架,以解决医疗保健和行为支持中静态规则基础决策系统的局限性。该框架将情感和情境理解与伦理考虑相结合,通过约束马尔可夫决策过程(CMDP)确保情感一致性和伦理安全性。关键发现包括引入了一个多目标奖励函数,平衡短期参与与长期福祉,以及一个情感导向的状态表示,捕捉情感准备、情感状态和风险。RRL框架可以适应各种强化学习算法,并旨在在机器学习策略优化中实现同理心和责任感,潜在应用包括行为健康、教育和数字疗法。
Bi-Level Contextual Bandits for Individualized Resource Allocation under Delayed Feedback
Authors: Mohammadsina Almasi, Hadis Anahideh
Venue: AAAI
First: 2025-11-13T18:09:08+00:00 · Latest: 2025-11-13T18:09:08+00:00
Comments: Accepted at AAAI-26 (AISI Track). Final version to appear in the Proceedings of the AAAI Conference on Artificial Intelligence (AAAI-26), 2026
Abstract
Equitably allocating limited resources in high-stakes domains-such as education, employment, and healthcare-requires balancing short-term utility with long-term impact, while accounting for delayed outcomes, hidden heterogeneity, and ethical constraints. However, most learning-based allocation frameworks either assume immediate feedback or ignore the complex interplay between individual characteristics and intervention dynamics. We propose a novel bi-level contextual bandit framework for individualized resource allocation under delayed feedback, designed to operate in real-world settings with dynamic populations, capacity constraints, and time-sensitive impact. At the meta level, the model optimizes subgroup-level budget allocations to satisfy fairness and operational constraints. At the base level, it identifies the most responsive individuals within each group using a neural network trained on observational data, while respecting cooldown windows and delayed treatment effects modeled via resource-specific delay kernels. By explicitly modeling temporal dynamics and feedback delays, the algorithm continually refines its policy as new data arrive, enabling more responsive and adaptive decision-making. We validate our approach on two real-world datasets from education and workforce development, showing that it achieves higher cumulative outcomes, better adapts to delay structures, and ensures equitable distribution across subgroups. Our results highlight the potential of delay-aware, data-driven decision-making systems to improve institutional policy and social welfare.
中文标题/摘要
标题:延迟反馈下的双层情境多臂老虎机方法及其在个体化资源分配中的应用
在教育、就业和医疗等高风险领域公平分配有限资源需要平衡短期效益与长期影响,同时考虑延迟结果、隐藏异质性和伦理约束。然而,大多数基于学习的分配框架要么假设即时反馈,要么忽视个体特征与干预动态之间的复杂相互作用。我们提出了一种新颖的双层情境多臂老虎机框架,用于延迟反馈下的个体化资源分配,旨在在具有动态人群、容量限制和时间敏感影响的实际环境中运行。在元层面上,模型优化子组级别的预算分配以满足公平性和操作约束。在基础层面上,它使用基于观察数据训练的神经网络识别每个组内的最响应个体,同时尊重冷却窗口和通过资源特定延迟核建模的延迟治疗效果。通过明确建模时间动态和反馈延迟,该算法随着新数据的到达不断优化其策略,从而实现更响应和适应性的决策。我们在教育和劳动力发展领域的两个真实数据集上验证了该方法,结果显示它实现了更高的累积成果,更好地适应了延迟结构,并确保了子组间的公平分配。我们的结果突显了延迟感知、数据驱动决策系统的潜在价值,以改善机构政策和社会福利。
Summary / 总结
This paper addresses the challenge of equitably allocating limited resources in high-stakes domains by proposing a bi-level contextual bandit framework that accounts for delayed feedback and individual characteristics. The model optimizes subgroup-level budget allocations and identifies responsive individuals using a neural network, while respecting cooldown windows and delay kernels. Experimental results on education and workforce development datasets show that the proposed method achieves higher cumulative outcomes, adapts better to delay structures, and ensures equitable distribution across subgroups.
论文提出了一种双层上下文多臂老虎机框架,用于高风险领域的个性化资源分配,解决了延迟反馈和伦理约束问题。该模型在子群体层面优化预算分配,并使用神经网络识别响应个体,同时考虑冷却窗口和延迟核。实验结果表明,在教育和劳动力发展数据集上,该方法能够提高累积效果并实现公平分配。
Interpretable and Granular Video-Based Quantification of Motor Characteristics from the Finger Tapping Test in Parkinson Disease
Authors: Tahereh Zarrat Ehsan, Michael Tangermann, Yağmur Güçlütürk, Bastiaan R. Bloem, Luc J. W. Evers
First: 2025-06-19T12:49:06+00:00 · Latest: 2025-11-13T18:08:05+00:00
Abstract
Accurately quantifying motor characteristics in Parkinson disease (PD) is crucial for monitoring disease progression and optimizing treatment strategies. The finger-tapping test is a standard motor assessment. Clinicians visually evaluate a patient's tapping performance and assign an overall severity score based on tapping amplitude, speed, and irregularity. However, this subjective evaluation is prone to inter- and intra-rater variability, and does not offer insights into individual motor characteristics captured during this test. This paper introduces a granular computer vision-based method for quantifying PD motor characteristics from video recordings. Four sets of clinically relevant features are proposed to characterize hypokinesia, bradykinesia, sequence effect, and hesitation-halts. We evaluate our approach on video recordings and clinical evaluations of 74 PD patients from the Personalized Parkinson Project. Principal component analysis with varimax rotation shows that the video-based features corresponded to the four deficits. Additionally, video-based analysis has allowed us to identify further granular distinctions within sequence effect and hesitation-halts deficits. In the following, we have used these features to train machine learning classifiers to estimate the Movement Disorder Society Unified Parkinson Disease Rating Scale (MDS-UPDRS) finger-tapping score. Compared to state-of-the-art approaches, our method achieves a higher accuracy in MDS-UPDRS score prediction, while still providing an interpretable quantification of individual finger-tapping motor characteristics. In summary, the proposed framework provides a practical solution for the objective assessment of PD motor characteristics, that can potentially be applied in both clinical and remote settings. Future work is needed to assess its responsiveness to symptomatic treatment and disease progression.
中文标题/摘要
标题:Parkinson病中基于手指敲击测试的可解释和精细的运动特征量化
准确量化帕金森病(PD)的运动特征对于监测疾病进展和优化治疗策略至关重要。手指敲击测试是标准的运动评估方法。临床医生通过视觉评估患者的敲击表现,并根据敲击幅度、速度和不规则性给出总体严重程度评分。然而,这种主观评估容易产生评分者间和评分者内的变异,并不能提供敲击测试过程中捕捉到的个体运动特征的见解。本文介绍了一种基于计算机视觉的精细方法,用于从视频记录中量化PD的运动特征。提出了四组临床相关特征来表征运动减少、运动迟缓、序列效应和犹豫-停顿。我们在来自个性化帕金森项目74名PD患者的视频记录和临床评估上评估了我们的方法。主成分分析结合方差最大旋转表明,基于视频的特征与四种缺陷相对应。此外,基于视频的分析还允许我们进一步识别序列效应和犹豫-停顿缺陷中的细微差别。随后,我们使用这些特征训练机器学习分类器以估计运动障碍学会统一帕金森病评定量表(MDS-UPDRS)的手指敲击评分。与最先进的方法相比,我们的方法在MDS-UPDRS评分预测的准确性上更高,同时仍能提供对个体手指敲击运动特征的可解释量化。总之,所提出的框架提供了一种实用的解决方案,用于客观评估PD的运动特征,可能在临床和远程环境中应用。未来的工作需要评估其对症状治疗和疾病进展的敏感性。
Summary / 总结
This paper presents a computer vision-based method to objectively quantify motor characteristics in Parkinson's disease (PD) from video recordings of the finger-tapping test. Four clinically relevant features are proposed to assess hypokinesia, bradykinesia, sequence effect, and hesitation-halts. The method achieves higher accuracy in predicting the Movement Disorder Society Unified Parkinson Disease Rating Scale (MDS-UPDRS) finger-tapping score compared to existing approaches, while providing interpretable results. The study involves 74 PD patients and demonstrates the potential of the framework for both clinical and remote assessment of PD motor characteristics.
该研究提出了一种基于计算机视觉的方法,用于从帕金森病(PD)患者的指敲测试视频中客观量化其运动特征。提出了四个临床相关特征来评估低运动量、运动迟缓、序列效应和犹豫-停顿。该方法在预测运动障碍学会统一帕金森病评定量表(MDS-UPDRS)指敲评分方面比现有方法更准确,同时提供了可解释的结果。研究涉及74名PD患者,并展示了该框架在临床和远程评估PD运动特征方面的潜力。
OmniVGGT: Omni-Modality Driven Visual Geometry Grounded
Authors: Haosong Peng, Hao Li, Yalun Dai, Yushi Lan, Yihang Luo, Tianyu Qi, Zhengshen Zhang, Yufeng Zhan, Junfei Zhang, Wenchao Xu, Ziwei Liu
First: 2025-11-13T17:59:01+00:00 · Latest: 2025-11-13T17:59:01+00:00
Comments: Project Page: https://livioni.github.io/OmniVGGT-offcial/
Abstract
General 3D foundation models have started to lead the trend of unifying diverse vision tasks, yet most assume RGB-only inputs and ignore readily available geometric cues (e.g., camera intrinsics, poses, and depth maps). To address this issue, we introduce OmniVGGT, a novel framework that can effectively benefit from an arbitrary number of auxiliary geometric modalities during both training and inference. In our framework, a GeoAdapter is proposed to encode depth and camera intrinsics/extrinsics into a spatial foundation model. It employs zero-initialized convolutions to progressively inject geometric information without disrupting the foundation model's representation space. This design ensures stable optimization with negligible overhead, maintaining inference speed comparable to VGGT even with multiple additional inputs. Additionally, a stochastic multimodal fusion regimen is proposed, which randomly samples modality subsets per instance during training. This enables an arbitrary number of modality inputs during testing and promotes learning robust spatial representations instead of overfitting to auxiliary cues. Comprehensive experiments on monocular/multi-view depth estimation, multi-view stereo, and camera pose estimation demonstrate that OmniVGGT outperforms prior methods with auxiliary inputs and achieves state-of-the-art results even with RGB-only input. To further highlight its practical utility, we integrated OmniVGGT into vision-language-action (VLA) models. The enhanced VLA model by OmniVGGT not only outperforms the vanilla point-cloud-based baseline on mainstream benchmarks, but also effectively leverages accessible auxiliary inputs to achieve consistent gains on robotic tasks.
中文标题/摘要
标题:OmniVGGT:全方位驱动的视觉几何基础模型
通用3D基础模型已经开始引领统一多种视觉任务的趋势,但大多数模型仅假设RGB输入并忽略了现成的几何线索(例如,相机内参、外参和深度图)。为解决这一问题,我们提出了OmniVGGT,这是一种新型框架,可以在训练和推理过程中有效利用任意数量的辅助几何模态。在我们的框架中,提出了一种GeoAdapter来将深度和相机内参/外参编码到空间基础模型中。它使用零初始化卷积逐步注入几何信息,而不破坏基础模型的表示空间。此设计确保了优化的稳定性和极小的开销,即使有多个附加输入,推理速度也与VGGT相当。此外,还提出了一种随机多模态融合机制,在训练过程中按实例随机采样模态子集。这使得在测试过程中可以使用任意数量的模态输入,并促进学习稳健的空间表示,而不是过度拟合辅助线索。在单目/多视图深度估计、多视图立体和相机姿态估计的全面实验中,OmniVGGT在有辅助输入的情况下优于先前方法,并且即使仅使用RGB输入也能达到最先进的结果。为了进一步突出其实用性,我们将OmniVGGT集成到视觉-语言-动作(VLA)模型中。通过OmniVGGT增强的VLA模型不仅在主流基准上优于基于点云的基线模型,而且能够有效利用可获取的辅助输入,在机器人任务中实现一致的性能提升。
Summary / 总结
OmniVGGT is a framework that integrates geometric cues with visual geometry grounding for various 3D vision tasks. It uses a GeoAdapter to encode depth and camera information into a spatial foundation model without disrupting its representation space. OmniVGGT demonstrates superior performance in monocular/multi-view depth estimation, multi-view stereo, and camera pose estimation, even with RGB-only input. Moreover, it enhances vision-language-action models, providing consistent gains in robotic tasks by leveraging accessible auxiliary inputs.
OmniVGGT 是一个框架,通过整合几何线索来提升视觉几何任务的表现。它使用 GeoAdapter 将深度和相机信息编码到空间基础模型中,并使用随机多模态融合机制来处理训练中的多个输入。实验结果表明,OmniVGGT 在有辅助输入的情况下优于先前的方法,并且即使使用 RGB-only 输入也能达到最先进的性能,同时增强视觉-语言-动作模型以实现机器人任务中的持续收益。
Benchmarking Diversity in Image Generation via Attribute-Conditional Human Evaluation
Authors: Isabela Albuquerque, Ira Ktena, Olivia Wiles, Ivana Kajić, Amal Rannen-Triki, Cristina Vasconcelos, Aida Nematzadeh
First: 2025-11-13T17:48:38+00:00 · Latest: 2025-11-13T17:48:38+00:00
Abstract
Despite advances in generation quality, current text-to-image (T2I) models often lack diversity, generating homogeneous outputs. This work introduces a framework to address the need for robust diversity evaluation in T2I models. Our framework systematically assesses diversity by evaluating individual concepts and their relevant factors of variation. Key contributions include: (1) a novel human evaluation template for nuanced diversity assessment; (2) a curated prompt set covering diverse concepts with their identified factors of variation (e.g. prompt: An image of an apple, factor of variation: color); and (3) a methodology for comparing models in terms of human annotations via binomial tests. Furthermore, we rigorously compare various image embeddings for diversity measurement. Notably, our principled approach enables ranking of T2I models by diversity, identifying categories where they particularly struggle. This research offers a robust methodology and insights, paving the way for improvements in T2I model diversity and metric development.
中文标题/摘要
标题:基于属性条件的人类评估基准研究:图像生成中的多样性评估
尽管生成质量有所提高,但当前的文本到图像(T2I)模型往往缺乏多样性,生成的是同质化的输出。本研究引入了一种框架,以解决T2I模型中稳健多样性评估的需求。该框架通过评估单个概念及其相关的变化因素来系统地评估多样性。主要贡献包括:(1) 一种新颖的人类评估模板,用于细致的多样性评估;(2) 一个经过精心策划的提示集,涵盖了各种概念及其识别的变化因素(例如,提示:苹果的图像,变化因素:颜色);以及(3) 一种通过二项式检验比较模型在人类注释方面的方法。此外,我们还严格比较了各种图像嵌入在多样性测量中的效果。值得注意的是,我们的原则性方法能够对T2I模型进行多样性排名,识别出它们特别困难的类别。本研究提供了一种稳健的方法和见解,为T2I模型多样性和度量标准的发展铺平了道路。
Summary / 总结
This work addresses the lack of diversity in text-to-image generation models by introducing a framework for robust diversity evaluation. The method involves a novel human evaluation template, a curated prompt set, and a methodology using binomial tests to compare models. Key findings include the ranking of T2I models by diversity and identification of categories where they struggle, providing insights for model improvement.
该研究通过引入一种系统性的多样性评估框架,解决了文本到图像生成模型缺乏多样性的问题。方法包括一种新颖的人类评估模板、一个包含因素变异的定制提示集,以及一种使用二项式检验比较模型的方法。主要发现表明,不同的图像嵌入在多样性测量上的效果不同,并且该方法按多样性对T2I模型进行了排名,指出了它们表现不佳的领域。
Text to Robotic Assembly of Multi Component Objects using 3D Generative AI and Vision Language Models
Authors: Alexander Htet Kyaw, Richa Gupta, Dhruv Shah, Anoop Sinha, Kory Mathewson, Stefanie Pender, Sachin Chitta, Yotto Koga, Faez Ahmed, Lawrence Sass, Randall Davis
Venue: NeurIPS 2025
First: 2025-11-04T01:02:21+00:00 · Latest: 2025-11-13T17:46:04+00:00
Comments: Accepted to NeurIPS 2025, Conference on Neural Information Processing Systems, Creative AI Track
Abstract
Advances in 3D generative AI have enabled the creation of physical objects from text prompts, but challenges remain in creating objects involving multiple component types. We present a pipeline that integrates 3D generative AI with vision-language models (VLMs) to enable the robotic assembly of multi-component objects from natural language. Our method leverages VLMs for zero-shot, multi-modal reasoning about geometry and functionality to decompose AI-generated meshes into multi-component 3D models using predefined structural and panel components. We demonstrate that a VLM is capable of determining which mesh regions need panel components in addition to structural components, based on the object's geometry and functionality. Evaluation across test objects shows that users preferred the VLM-generated assignments 90.6% of the time, compared to 59.4% for rule-based and 2.5% for random assignment. Lastly, the system allows users to refine component assignments through conversational feedback, enabling greater human control and agency in making physical objects with generative AI and robotics.
中文标题/摘要
标题:使用3D生成AI和视觉语言模型从文本构建多组件物体的机器人装配
3D生成AI的进步使得从文本提示创建物理对象成为可能,但在创建涉及多种组件类型的对象时仍面临挑战。我们提出了一种将3D生成AI与视觉语言模型(VLMs)结合的管道,以使自然语言生成多组件物体的机器人装配成为可能。我们的方法利用VLMs进行零样本、多模态的几何和功能推理,将生成的网格分解为使用预定义结构和面板组件的多组件3D模型。我们证明VLM能够根据物体的几何形状和功能确定哪些网格区域需要面板组件。在测试对象上的评估显示,用户中有90.6%的时间更喜欢VLM生成的分配,而基于规则的分配为59.4%,随机分配为2.5%。最后,该系统允许用户通过对话反馈来细化组件分配,从而在使用生成AI和机器人技术制作物理对象时赋予更大的人类控制权和自主权。
Dynamic Avatar-Scene Rendering from Human-centric Context
Authors: Wenqing Wang, Haosen Yang, Josef Kittler, Xiatian Zhu
First: 2025-11-13T17:39:06+00:00 · Latest: 2025-11-13T17:39:06+00:00
Comments: 13 pages, 8 figures
Abstract
Reconstructing dynamic humans interacting with real-world environments from monocular videos is an important and challenging task. Despite considerable progress in 4D neural rendering, existing approaches either model dynamic scenes holistically or model scenes and backgrounds separately aim to introduce parametric human priors. However, these approaches either neglect distinct motion characteristics of various components in scene especially human, leading to incomplete reconstructions, or ignore the information exchange between the separately modeled components, resulting in spatial inconsistencies and visual artifacts at human-scene boundaries. To address this, we propose {\bf Separate-then-Map} (StM) strategy that introduces a dedicated information mapping mechanism to bridge separately defined and optimized models. Our method employs a shared transformation function for each Gaussian attribute to unify separately modeled components, enhancing computational efficiency by avoiding exhaustive pairwise interactions while ensuring spatial and visual coherence between humans and their surroundings. Extensive experiments on monocular video datasets demonstrate that StM significantly outperforms existing state-of-the-art methods in both visual quality and rendering accuracy, particularly at challenging human-scene interaction boundaries.
中文标题/摘要
标题:基于人体中心情境的动态Avatar-场景渲染
从单目视频中重构动态人类与真实环境交互是一项重要且具有挑战性的任务。尽管在4D神经渲染方面取得了显著进展,现有方法要么整体建模动态场景,要么分别建模场景和背景以引入参数化的人体先验。然而,这些方法要么忽略了场景中各个组件尤其是人体的运动特征,导致重构不完整,要么忽略了分别建模组件之间的信息交换,导致人体-场景边界处的空间不一致和视觉伪影。为了解决这个问题,我们提出了“分别建模-然后映射”(StM) 策略,引入了专门的信息映射机制来连接分别定义和优化的模型。我们的方法为每个高斯属性使用共享的变换函数,以统一分别建模的组件,通过避免耗时的两两交互来提高计算效率,同时确保人体与其周围环境的空间和视觉一致性。在单目视频数据集上的大量实验表明,StM 在视觉质量和渲染精度方面显著优于现有最先进的方法,特别是在人体-场景交互边界处。
Summary / 总结
The research aims to reconstruct dynamic humans interacting with real-world environments from monocular videos, addressing limitations of existing holistic or separate modeling approaches. The proposed Separate-then-Map (StM) strategy introduces a dedicated information mapping mechanism to unify separately defined and optimized models, enhancing computational efficiency and ensuring spatial and visual coherence. Experiments show that StM outperforms existing methods in visual quality and rendering accuracy, especially at human-scene interaction boundaries.
研究旨在通过单目视频重建动态人类与现实环境的互动,解决现有整体和单独建模方法的局限性。提出的Separate-then-Map (StM) 策略引入了专门的信息映射机制,通过共享变换函数统一单独定义和优化的组件。实验表明,StM 在视觉质量和渲染准确性方面优于现有方法,尤其是在人类与场景互动边界处。
Say It Differently: Linguistic Styles as Jailbreak Vectors
Authors: Srikant Panda, Avinash Rai
First: 2025-11-13T17:24:38+00:00 · Latest: 2025-11-13T17:24:38+00:00
Abstract
Large Language Models (LLMs) are commonly evaluated for robustness against paraphrased or semantically equivalent jailbreak prompts, yet little attention has been paid to linguistic variation as an attack surface. In this work, we systematically study how linguistic styles such as fear or curiosity can reframe harmful intent and elicit unsafe responses from aligned models. We construct style-augmented jailbreak benchmark by transforming prompts from 3 standard datasets into 11 distinct linguistic styles using handcrafted templates and LLM-based rewrites, while preserving semantic intent. Evaluating 16 open- and close-source instruction-tuned models, we find that stylistic reframing increases jailbreak success rates by up to +57 percentage points. Styles such as fearful, curious and compassionate are most effective and contextualized rewrites outperform templated variants. To mitigate this, we introduce a style neutralization preprocessing step using a secondary LLM to strip manipulative stylistic cues from user inputs, significantly reducing jailbreak success rates. Our findings reveal a systemic and scaling-resistant vulnerability overlooked in current safety pipelines.
中文标题/摘要
标题:用不同的方式说:语言风格作为逃逸向量
大型语言模型(LLMs)通常会针对重新表述或语义等价的逃逸提示进行稳健性评估,但很少关注语言变异作为攻击面。在本文中,我们系统地研究了诸如恐惧或好奇心等语言风格如何重新定义有害意图并诱使对齐模型产生不安全的响应。我们通过使用手工制作的模板和基于LLM的重写将3个标准数据集的提示转换为11种不同的语言风格,构建了一个风格增强的逃逸基准,同时保持语义意图不变。评估16个开源和闭源指令调优模型后,我们发现风格重塑可以将逃逸成功率提高多达57个百分点。诸如恐惧、好奇和同情等风格最有效,上下文相关的重写优于模板变体。为了缓解这一问题,我们引入了一种风格中立的预处理步骤,使用第二个LLM从用户输入中去除具有操控性的风格线索,显著降低了逃逸成功率。我们的研究结果揭示了一个系统性和扩展性抗性的漏洞,这一漏洞在当前的安全管道中被忽视。
SemanticVLA: Semantic-Aligned Sparsification and Enhancement for Efficient Robotic Manipulation
Authors: Wei Li, Renshan Zhang, Rui Shao, Zhijian Fang, Kaiwen Zhou, Zhuotao Tian, Liqiang Nie
Venue: AAAI 2026 Oral
First: 2025-11-13T17:24:37+00:00 · Latest: 2025-11-13T17:24:37+00:00
Comments: Accepted to AAAI 2026 (Oral), Project Page: https://github.com/JiuTian-VL/SemanticVLA
Abstract
Vision-Language-Action (VLA) models have advanced in robotic manipulation, yet practical deployment remains hindered by two key limitations: 1) perceptual redundancy, where irrelevant visual inputs are processed inefficiently, and 2) superficial instruction-vision alignment, which hampers semantic grounding of actions. In this paper, we propose SemanticVLA, a novel VLA framework that performs Semantic-Aligned Sparsification and Enhancement for Efficient Robotic Manipulation. Specifically: 1) To sparsify redundant perception while preserving semantic alignment, Semantic-guided Dual Visual Pruner (SD-Pruner) performs: Instruction-driven Pruner (ID-Pruner) extracts global action cues and local semantic anchors in SigLIP; Spatial-aggregation Pruner (SA-Pruner) compacts geometry-rich features into task-adaptive tokens in DINOv2. 2) To exploit sparsified features and integrate semantics with spatial geometry, Semantic-complementary Hierarchical Fuser (SH-Fuser) fuses dense patches and sparse tokens across SigLIP and DINOv2 for coherent representation. 3) To enhance the transformation from perception to action, Semantic-conditioned Action Coupler (SA-Coupler) replaces the conventional observation-to-DoF approach, yielding more efficient and interpretable behavior modeling for manipulation tasks. Extensive experiments on simulation and real-world tasks show that SemanticVLA sets a new SOTA in both performance and efficiency. SemanticVLA surpasses OpenVLA on LIBERO benchmark by 21.1% in success rate, while reducing training cost and inference latency by 3.0-fold and 2.7-fold.SemanticVLA is open-sourced and publicly available at https://github.com/JiuTian-VL/SemanticVLA
中文标题/摘要
标题:SemanticVLA:语义对齐的稀疏化与增强以实现高效的机器人操作
视觉-语言-行动(VLA)模型在机器人操作方面取得了进展,但实际部署仍受到两个关键限制的阻碍:1)感知冗余,其中无关的视觉输入被无效处理;2)表面的指令-视觉对齐,这阻碍了行动的语义定位。本文提出了一种新的VLA框架——SemanticVLA,以实现语义对齐的稀疏化与增强以实现高效的机器人操作。具体而言:1)为了稀疏化冗余感知同时保持语义对齐,语义引导的双视觉剪枝器(SD-Pruner)执行:指令驱动剪枝器(ID-Pruner)从SigLIP中提取全局动作线索和局部语义锚点;空间聚合剪枝器(SA-Pruner)将几何丰富的特征压缩成任务自适应的令牌。2)为了利用稀疏化特征并整合语义与空间几何,语义互补的分层融合器(SH-Fuser)在SigLIP和DINOv2之间融合密集补丁和稀疏令牌,以实现连贯的表示。3)为了增强从感知到行动的转换,语义条件下的动作耦合器(SA-Coupler)替代了传统的观察到自由度的方法,从而为操作任务提供更高效和可解释的行为建模。在模拟和实际任务上的广泛实验表明,SemanticVLA在性能和效率方面均达到了新的SOTA。SemanticVLA在LIBERO基准测试上的成功率比OpenVLA高出21.1%,同时将训练成本和推理延迟分别降低了3.0倍和2.7倍。SemanticVLA已开源并公开发布在https://github.com/JiuTian-VL/SemanticVLA
Summary / 总结
The research aims to address the inefficiencies in Vision-Language-Action models for robotic manipulation by proposing SemanticVLA, which includes Semantic-guided Dual Visual Pruner for sparsifying redundant perception while preserving semantic alignment, Semantic-complementary Hierarchical Fuser for integrating semantics with spatial geometry, and Semantic-conditioned Action Coupler for enhancing the transformation from perception to action. The experiments demonstrate that SemanticVLA outperforms OpenVLA on the LIBERO benchmark with a 21.1% higher success rate, while reducing training cost and inference latency by 3.0-fold and 2.7-fold respectively.
研究旨在通过解决感知冗余和表层指令-视觉对齐问题,提高Vision-Language-Action (VLA)模型在机器人操作中的效率和效果。提出的SemanticVLA框架包括一个语义导向的双视觉剪枝器,用于稀疏化冗余感知同时保持语义对齐;一个语义互补的层次融合器,用于整合语义与空间几何;以及一个语义条件的动作耦合器,用于增强从感知到动作的转换。实验表明,SemanticVLA在LIBERO基准测试上比OpenVLA高出21.1%的成功率,同时将训练成本和推理延迟分别减少了3.0倍和2.7倍。
Holonorm
Authors: Daryl Noupa Yongueng, Hamidou Tembine
First: 2025-11-13T17:11:33+00:00 · Latest: 2025-11-13T17:11:33+00:00
Comments: 17 pages, 11 figures, 10 tables, 2 datasets. A stable geometric alternative to LayerNorm and Tanh normalization in deep neural networks
Abstract
Normalization is a key point in transformer training . In Dynamic Tanh (DyT), the author demonstrated that Tanh can be used as an alternative layer normalization (LN) and confirmed the effectiveness of the idea. But Tanh itself faces orthogonality, linearity and distortion problems. Due to that, his proposition cannot be reliable. So we propose a Holonorm (hn) which has residual connections and nonlinearity. Holonorm is suitable for replacing Tanh in the context of normalization. Although the HoloNorm expression could be similar to the softsign function in dimension one, softsign is a componentwise function which is not good for tensors and vectors of great dimension. Holonorm preserves the orthogonality, the direction, the invertibility of the signal. Holonorm is also a suitable metric, maps all vectors into the open unit ball. This prevents exploding activations and improves stability in deep Transformer models. In this work, we have meticulously examined the normalization in transformers and say that Holonorm, a generalized form of softsign function suited as a normalization function first.Second, defined between 0 and 1 hn serves as a percentage, and $1 - \text{Holonorm}$ is its complement, making it better understandable in evaluating a model.
中文标题/摘要
标题:Holonorm
规范化是变压器训练中的关键点。在动态双曲函数(DyT)中,作者证明了双曲函数可以作为替代层规范化(LN)使用,并证实了这一想法的有效性。但由于双曲函数自身存在正交性、线性和失真问题,因此他的提议不可靠。因此,我们提出了一种Holonorm(hn),它具有残差连接和非线性。Holonorm适用于在规范化上下文中替代双曲函数。尽管HoloNorm在维度为一的情况下表达式可能类似于软信号函数,但软信号是一个分量函数,不适用于高维张量和向量。Holonorm保持了正交性、方向性和信号的可逆性。Holonorm也是一个合适的度量标准,将所有向量映射到开单位球中。这防止了激活函数的爆炸,并提高了深度变压器模型的稳定性。在本文中,我们仔细研究了变压器中的规范化,并指出Holonorm是软信号函数的一种推广形式,适合作为规范化函数。其次,定义在0到1之间的Holonorm作为百分比,$1 - ext{Holonorm}$是它的补数,使其在评估模型时更易于理解。
Summary / 总结
The research aims to improve normalization techniques in transformer training by addressing the limitations of existing methods like Tanh and LayerNorm. The authors propose Holonorm, which incorporates residual connections and nonlinearity, enhancing orthogonality and invertibility. Key findings show that Holonorm effectively prevents exploding activations and improves model stability, making it a robust alternative to Tanh and LayerNorm in deep neural networks.
论文针对Tanh和LayerNorm在Transformer训练中的局限性,提出了Holonorm作为解决方案。Holonorm结合了残差连接和非线性,解决了正交性、线性和失真问题。关键发现包括Holonorm能够保持信号的正交性、方向和可逆性,并有效防止激活爆炸和提高模型稳定性。Holonorm在0到1之间定义,便于在模型评估中进行解释。
Bridging LMS and generative AI: dynamic course content integration (DCCI) for enhancing student satisfaction and engagement via the ask ME assistant
Authors: Kovan Mzwri, Márta Turcsányi-Szabo
Venue: J. Comput. Educ. (2025)
First: 2025-04-04T22:17:30+00:00 · Latest: 2025-11-13T17:07:01+00:00
Abstract
Integration of Large Language Models (LLMs) with Learning Management Systems (LMSs) can enhance task automation and accessibility in education. However, hallucination where LLMs generate inaccurate or misleading information remains a challenge. This study introduces the Dynamic Course Content Integration (DCCI) mechanism, which dynamically retrieves course content from Canvas LMS and structures it within an LLM's context window via prompt engineering, enabling the LLM-powered assistant, Ask ME, to deliver context-aware, curriculum-aligned responses while mitigating hallucinations. A mixed-methods pilot study grounded in Self-Determination Theory (autonomy, competence) and the Technology Acceptance Model (perceived usefulness, ease of use) evaluated DCCI's effectiveness with 120 first-year programming students at Eötvös Loránd University. The course focused on foundational programming patterns in C#, including writing program specifications. We analyzed 14,746 logged interactions and a post-course survey completed by 101 students. User satisfaction was measured via a 5-point Likert scale (turn-level ratings), while the survey assessed usability, engagement, and ethical concerns. Results indicated high satisfaction (mean 4.65/5) and strong recognition of Ask ME's ability to provide timely, contextually relevant answers to administrative and course-related queries. 78.06% agreed that Ask ME's Canvas integration reduced platform switching, improving usability, engagement, comprehension, and topic exploration. Many students reported reduced hesitation to ask questions and increased motivation for self-directed learning, though concerns about over-reliance on AI and reduced student-teacher interaction emerged. This study demonstrates that DCCI enhances LLM reliability, student satisfaction, and engagement in AI-driven educational automation, while highlighting the importance of balancing
中文标题/摘要
标题:连接LMS与生成式AI:通过Ask ME助手实现动态课程内容集成(DCCI)以提升学生满意度和参与度
将大型语言模型(LLMs)与学习管理系统(LMSs)集成可以增强教育中的任务自动化和访问性。然而,LLMs生成不准确或误导性信息的幻觉仍然是一个挑战。本研究介绍了动态课程内容集成(DCCI)机制,该机制通过提示工程动态从Canvas LMS检索课程内容,并将其结构化在LLM的上下文窗口中,使LLM驱动的Ask ME助手能够提供上下文相关、课程对齐的响应,同时减轻幻觉。一项基于自我决定理论(自主性、能力)和技术接受模型(感知有用性、易用性)的混合方法试点研究评估了DCCI的有效性,研究对象为爱瓦尔·洛兰大学120名一年级编程学生。该课程侧重于C#基础编程模式,包括编写程序规范。我们分析了14,746次记录的交互和101名学生完成的课程后调查。用户满意度通过5点李克特量表(回合级评分)进行测量,而调查评估了易用性、参与度和伦理问题。结果显示,满意度高(平均4.65/5),且学生强烈认可Ask ME的Canvas集成减少了平台切换,提高了易用性、参与度、理解和主题探索。许多学生报告称,他们减少了提问的犹豫,增加了自主学习的动力,尽管也有对过度依赖AI和学生-教师互动减少的担忧。本研究证明,DCCI增强了LLM的可靠性、学生满意度和参与度,在AI驱动的教育自动化中发挥了作用,同时强调了平衡的重要性
On the Detectability of Active Gradient Inversion Attacks in Federated Learning
Authors: Vincenzo Carletti, Pasquale Foggia, Carlo Mazzocca, Giuseppe Parrella, Mario Vento
First: 2025-11-13T17:06:57+00:00 · Latest: 2025-11-13T17:06:57+00:00
Abstract
One of the key advantages of Federated Learning (FL) is its ability to collaboratively train a Machine Learning (ML) model while keeping clients' data on-site. However, this can create a false sense of security. Despite not sharing private data increases the overall privacy, prior studies have shown that gradients exchanged during the FL training remain vulnerable to Gradient Inversion Attacks (GIAs). These attacks allow reconstructing the clients' local data, breaking the privacy promise of FL. GIAs can be launched by either a passive or an active server. In the latter case, a malicious server manipulates the global model to facilitate data reconstruction. While effective, earlier attacks falling under this category have been demonstrated to be detectable by clients, limiting their real-world applicability. Recently, novel active GIAs have emerged, claiming to be far stealthier than previous approaches. This work provides the first comprehensive analysis of these claims, investigating four state-of-the-art GIAs. We propose novel lightweight client-side detection techniques, based on statistically improbable weight structures and anomalous loss and gradient dynamics. Extensive evaluation across several configurations demonstrates that our methods enable clients to effectively detect active GIAs without any modifications to the FL training protocol.
中文标题/摘要
标题:关于联邦学习中主动梯度反转攻击可检测性的研究
联邦学习(FL)的一个关键优势在于其能够在保护客户端数据隐私的同时协作训练机器学习(ML)模型。然而,这可能会造成一种虚假的安全感。尽管不共享私有数据可以提高整体隐私性,但先前的研究表明,在FL训练过程中交换的梯度仍然容易受到梯度反转攻击(GIAs)的威胁。这些攻击允许重建客户端的本地数据,破坏FL的隐私承诺。GIAs可以由被动或主动服务器发起。在后者的情况下,恶意服务器通过操纵全局模型来促进数据重建。虽然有效,但早期此类攻击已被证明可以通过客户端检测到,限制了它们的实际应用。最近,出现了新型的主动GIAs,声称比之前的攻击方法更为隐蔽。本研究提供了对这些声称的首次全面分析,调查了四种最先进的GIAs。我们提出了基于统计上不可能的权重结构和异常损失及梯度动态的新颖轻量级客户端检测技术。在多种配置下的广泛评估表明,我们的方法使客户端能够在不修改FL训练协议的情况下有效检测主动GIAs。
Summary / 总结
This study addresses the vulnerability of Federated Learning (FL) to active Gradient Inversion Attacks (GIAs), where a malicious server manipulates the global model to facilitate data reconstruction. The research introduces novel client-side detection techniques based on statistical anomalies and loss dynamics. Evaluations show that these methods can effectively detect active GIAs without altering the FL training protocol, challenging the stealth claims of recent GIAs.
该研究探讨了联邦学习在面对恶意服务器操纵全局模型以实现数据重建的主动梯度反转攻击(GIAs)时的脆弱性。研究引入了基于权重结构统计异常和损失及梯度动态的新颖轻量级客户端检测技术。在多种配置下的评估表明,这些方法能够在不修改联邦学习训练协议的情况下有效检测主动GIAs。
Strategic Opponent Modeling with Graph Neural Networks, Deep Reinforcement Learning and Probabilistic Topic Modeling
Authors: Georgios Chalkiadakis, Charilaos Akasiadis, Gerasimos Koresis, Stergios Plataniots, Leonidas Bakopoulos
First: 2025-11-13T17:06:56+00:00 · Latest: 2025-11-13T17:06:56+00:00
Comments: 26 pages
Abstract
This paper provides a comprehensive review of mainly Graph Neural Networks, Deep Reinforcement Learning, and Probabilistic Topic Modeling methods with a focus on their potential incorporation in strategic multiagent settings. We draw interest in (i) Machine Learning methods currently utilized for uncovering unknown model structures adaptable to the task of strategic opponent modeling, and (ii) the integration of these methods with Game Theoretic concepts that avoid relying on assumptions often invalid in real-world scenarios, such as the Common Prior Assumption (CPA) and the Self-Interest Hypothesis (SIH). We analyze the ability to handle uncertainty and heterogeneity, two characteristics that are very common in real-world application cases, as well as scalability. As a potential answer to effectively modeling relationships and interactions in multiagent settings, we champion the use of Graph Neural Networks (GNN). Such approaches are designed to operate upon graph-structured data, and have been shown to be a very powerful tool for performing tasks such as node classification and link prediction. Next, we review the domain of Reinforcement Learning (RL), and in particular that of Multiagent Deep Reinforcement Learning (MADRL). Following, we describe existing relevant game theoretic solution concepts and consider properties such as fairness and stability. Our review comes complete with a note on the literature that utilizes PTM in domains other than that of document analysis and classification. The capability of PTM to estimate unknown underlying distributions can help with tackling heterogeneity and unknown agent beliefs. Finally, we identify certain open challenges specifically, the need to (i) fit non-stationary environments, (ii) balance the degrees of stability and adaptation, (iii) tackle uncertainty and heterogeneity, (iv) guarantee scalability and solution tractability.
中文标题/摘要
标题:基于图神经网络、深度强化学习和概率主题建模的战略对手建模
本文对主要的图神经网络、深度强化学习和概率主题建模方法进行了全面回顾,重点在于它们在战略多智能体环境中的潜在应用。我们关注(i)目前用于揭示可适应战略对手建模任务的未知模型结构的机器学习方法,以及(ii)将这些方法与博弈论概念结合,避免依赖于在现实世界场景中经常无效的假设,如共同先验假设(CPA)和自我利益假设(SIH)。我们分析了处理不确定性和异质性这两种在现实世界应用案例中非常常见的特性,以及可扩展性。作为有效建模多智能体环境中关系和交互的潜在答案,我们提倡使用图神经网络(GNN)。此类方法旨在处理图结构数据,并已被证明是执行节点分类和链接预测等任务的强大工具。接下来,我们回顾了强化学习(RL)领域,特别是多智能体深度强化学习(MADRL)。随后,我们描述了现有的相关博弈论解决方案概念,并考虑了诸如公平性和稳定性等属性。我们的回顾还包括了利用PTM在文档分析和分类之外领域的文献注记。PTM的能力在于估计未知的潜在分布,有助于解决异质性和未知代理信念的问题。最后,我们指出了某些开放挑战,包括(i)适应非平稳环境,(ii)平衡稳定性和适应性,(iii)应对不确定性和异质性,(iv)确保可扩展性和解决方案的可处理性。
Summary / 总结
This paper reviews Graph Neural Networks, Deep Reinforcement Learning, and Probabilistic Topic Modeling methods for strategic opponent modeling in multiagent settings. It focuses on leveraging these methods to handle uncertainty and heterogeneity, and to avoid assumptions like the Common Prior Assumption and Self-Interest Hypothesis. Key findings include the effectiveness of Graph Neural Networks in modeling relationships and interactions, and the potential of Probabilistic Topic Modeling to estimate unknown distributions and agent beliefs. Challenges include fitting non-stationary environments and ensuring scalability and solution tractability.
本文回顾了用于多智能体环境中的战略对手建模的图神经网络、深度强化学习和概率主题建模方法。它强调了GNN在处理不确定性和异质性方面的应用,并将这些方法与博弈论概念结合以避免不切实际的假设。主要发现包括GNN在建模关系和互动方面的有效性,以及PTM在处理异质性和未知智能体信念方面的潜力。
Transfer in Reinforcement Learning via Regret Bounds for Learning Agents
Authors: Adrienne Tuynman, Ronald Ortner
First: 2022-02-02T18:10:21+00:00 · Latest: 2025-11-13T17:06:34+00:00
Abstract
We present an approach for the quantification of the usefulness of transfer in reinforcement learning via regret bounds for a multi-agent setting. Considering a number of $\aleph$ agents operating in the same Markov decision process, however possibly with different reward functions, we consider the regret each agent suffers with respect to an optimal policy maximizing her average reward. We show that when the agents share their observations the total regret of all agents is smaller by a factor of $\sqrt{\aleph}$ compared to the case when each agent has to rely on the information collected by herself. This result demonstrates how considering the regret in multi-agent settings can provide theoretical bounds on the benefit of sharing observations in transfer learning.
中文标题/摘要
标题:强化学习中的转移学习通过后悔界量化其有用性
我们提出了一种通过后悔界量化强化学习中转移学习有用性的方法,适用于多智能体环境。考虑有不同奖励函数但在同一马尔可夫决策过程中的$\aleph$个智能体,我们考虑每个智能体相对于最大化其平均奖励的最优策略所遭受的后悔。我们证明,当智能体共享其观察时,所有智能体的总后悔比每个智能体仅依赖于自己收集的信息时小一个$\sqrt{\aleph}$的数量级。这一结果表明,在多智能体环境中考虑后悔可以提供共享观察在转移学习中收益的理论界。
Learnable Total Variation with Lambda Mapping for Low-Dose CT Denoising
Authors: Yusuf Talha Basak, Mehmet Ozan Unal, Metin Ertas, Isa Yildirim
First: 2025-11-13T17:05:36+00:00 · Latest: 2025-11-13T17:05:36+00:00
Abstract
Although Total Variation (TV) performs well in noise reduction and edge preservation on images, its dependence on the lambda parameter limits its efficiency and makes it difficult to use effectively. In this study, we present a Learnable Total Variation (LTV) framework that couples an unrolled TV solver with a data-driven Lambda Mapping Network (LambdaNet) predicting a per-pixel regularization map. The pipeline is trained end-to-end so that reconstruction and regularization are optimized jointly, yielding spatially adaptive smoothing: strong in homogeneous regions, relaxed near anatomical boundaries. Experiments on the DeepLesion dataset, using a realistic noise model adapted from the LoDoPaB-CT methodology, show consistent gains over classical TV and FBP+U-Net: +2.9 dB PSNR and +6% SSIM on average. LTV provides an interpretable alternative to black-box CNNs and a basis for 3D and data-consistency-driven reconstruction. Our codes are available at: https://github.com/itu-biai/deep_tv_for_ldct
中文标题/摘要
标题:可学习的全变差与Lambda映射在低剂量CT降噪中的应用
尽管全变差(TV)在噪声减少和边缘保留方面表现良好,但其对lambda参数的依赖限制了其效率,并使其难以有效使用。在本研究中,我们提出了一种可学习的全变差(LTV)框架,该框架将展开的TV求解器与数据驱动的Lambda映射网络(LambdaNet)预测的像素级正则化图相结合。该流水线端到端训练,使得重建和正则化联合优化,从而实现空间自适应平滑:在均匀区域较强,在解剖边界附近放松。使用DeepLesion数据集和从LoDoPaB-CT方法改编的真实噪声模型进行的实验显示,与经典TV和FBP+U-Net相比,LTV在平均上分别获得了+2.9 dB PSNR和+6% SSIM的提升。LTV为黑盒CNN提供了一种可解释的替代方案,并为3D和数据一致性驱动的重建奠定了基础。我们的代码可在以下链接获取:https://github.com/itu-biai/deep_tv_for_ldct
Summary / 总结
This study addresses the limitations of traditional Total Variation (TV) in noise reduction and parameter dependency by introducing a Learnable Total Variation (LTV) framework. The framework combines an unrolled TV solver with a Lambda Mapping Network (LambdaNet) that predicts a per-pixel regularization map. The model is trained end-to-end, leading to spatially adaptive smoothing. Experiments on the DeepLesion dataset demonstrated that LTV outperformed classical TV and FBP+U-Net, achieving an average improvement of 2.9 dB in PSNR and 6% in SSIM. This method provides a more interpretable alternative to black-box CNNs and lays the groundwork for 3D and data-consistency-driven reconstruction.
本研究通过引入可学习的总变差(LTV)框架解决了传统总变差(TV)在噪声抑制和参数依赖性方面的局限性。该框架结合了一个未展开的TV求解器和一个预测像素级正则化图的Lambda映射网络(LambdaNet)。模型通过端到端训练,实现了空间自适应平滑。实验结果表明,LTV在DeepLesion数据集上的表现优于经典TV和FBP+U-Net,平均PSNR提高了2.9 dB,SSIM提高了6%。该方法为3D和数据一致性驱动的重建提供了更可解释的替代方案。
Drifting Away from Truth: GenAI-Driven News Diversity Challenges LVLM-Based Misinformation Detection
Authors: Fanxiao Li, Jiaying Wu, Tingchao Fu, Yunyun Dong, Bingbing Song, Wei Zhou
First: 2025-08-18T08:19:43+00:00 · Latest: 2025-11-13T16:57:54+00:00
Abstract
The proliferation of multimodal misinformation poses growing threats to public discourse and societal trust. While Large Vision-Language Models (LVLMs) have enabled recent progress in multimodal misinformation detection (MMD), the rise of generative AI (GenAI) tools introduces a new challenge: GenAI-driven news diversity, characterized by highly varied and complex content. We show that this diversity induces multi-level drift, comprising (1) model-level misperception drift, where stylistic variations disrupt a model's internal reasoning, and (2) evidence-level drift, where expression diversity degrades the quality or relevance of retrieved external evidence. These drifts significantly degrade the robustness of current LVLM-based MMD systems. To systematically study this problem, we introduce DriftBench, a large-scale benchmark comprising 16,000 news instances across six categories of diversification. We design three evaluation tasks: (1) robustness of truth verification under multi-level drift; (2) susceptibility to adversarial evidence contamination generated by GenAI; and (3) analysis of reasoning consistency across diverse inputs. Experiments with six state-of-the-art LVLM-based detectors show substantial performance drops (average F1 -14.8%) and increasingly unstable reasoning traces, with even more severe failures under adversarial evidence injection. Our findings uncover fundamental vulnerabilities in existing MMD systems and suggest an urgent need for more resilient approaches in the GenAI era.
中文标题/摘要
标题:远离真相:由GenAI驱动的新闻多样性挑战LVLM基的误信息检测
多模态误信息的泛滥对公共话语和社会信任构成了日益增长的威胁。虽然大型视觉语言模型(LVLM)在多模态误信息检测(MMD)方面取得了近期进展,但生成型人工智能(GenAI)工具的兴起引入了一个新的挑战:由GenAI驱动的新闻多样性,其特征是内容高度多样化和复杂化。我们表明,这种多样性导致了多级漂移,包括(1)模型级感知漂移,其中风格变化干扰了模型的内部推理,以及(2)证据级漂移,其中表达多样性降低了检索外部证据的质量或相关性。这些漂移显著削弱了当前基于LVLM的MMD系统的稳健性。为了系统地研究这一问题,我们引入了DriftBench,这是一个包含16,000个新闻实例的大规模基准,涵盖了六个多样化的类别。我们设计了三个评估任务:(1)在多级漂移下的事实验证稳健性;(2)对抗由GenAI生成的虚假证据污染的易感性;以及(3)对多样输入推理一致性的分析。六种最先进的基于LVLM的检测器的实验显示,性能下降显著(平均F1 -14.8%),推理轨迹越来越不稳定,在对抗虚假证据注入下表现更加严重。我们的研究揭示了现有MMD系统的基本脆弱性,并建议在GenAI时代迫切需要更稳健的方法。
Summary / 总结
The paper addresses the challenge of GenAI-driven news diversity in multimodal misinformation detection, which introduces multi-level drift affecting both model-level misperception and evidence-level quality. To study this, the authors created DriftBench, a large-scale benchmark with 16,000 news instances, and evaluated six state-of-the-art LVLM-based detectors, finding significant performance drops and unstable reasoning traces under these drift conditions, especially with adversarial evidence. This highlights the need for more robust MMD systems in the GenAI era.
论文探讨了由GenAI驱动的新闻多样性对多模态虚假信息检测的挑战,这种多样性导致了模型级和证据级的漂移,削弱了基于LVLM的系统的鲁棒性。它引入了包含16,000个新闻实例的DriftBench基准,并评估了六种最先进的检测器,在这些漂移下表现出显著的性能下降和不稳定的推理轨迹,尤其是在对抗性证据注入的情况下。这表明在GenAI时代需要更稳健的MMD方法。
vMFCoOp: Towards Equilibrium on a Unified Hyperspherical Manifold for Prompting Biomedical VLMs
Authors: Minye Shao, Sihan Guo, Xinrun Li, Xingyu Miao, Haoran Duan, Yang Long
Venue: AAAI 2026 Oral Presentation
First: 2025-11-12T18:38:33+00:00 · Latest: 2025-11-13T16:56:35+00:00
Comments: Accepted as an Oral Presentation at AAAI 2026 Main Technical Track (this version is not peer-reviewed; it is the extended version)
Abstract
Recent advances in context optimization (CoOp) guided by large language model (LLM)-distilled medical semantic priors offer a scalable alternative to manual prompt engineering and full fine-tuning for adapting biomedical CLIP-based vision-language models (VLMs). However, prompt learning in this context is challenged by semantic misalignment between LLMs and CLIP variants due to divergent training corpora and model architectures; it further lacks scalability across continuously evolving families of foundation models. More critically, pairwise multimodal alignment via conventional Euclidean-space optimization lacks the capacity to model unified representations or apply localized geometric constraints, which tends to amplify modality gaps in complex biomedical imaging and destabilize few-shot adaptation. In this work, we propose vMFCoOp, a framework that inversely estimates von Mises-Fisher (vMF) distributions on a shared Hyperspherical Manifold, aligning semantic biases between arbitrary LLMs and CLIP backbones via Unified Semantic Anchors to achieve robust biomedical prompting and superior few-shot classification. Grounded in three complementary constraints, vMFCoOp demonstrates consistent improvements across 14 medical datasets, 12 medical imaging modalities, and 13 anatomical regions, outperforming state-of-the-art methods in accuracy, generalization, and clinical applicability. This work aims to continuously expand to encompass more downstream applications, and the corresponding resources are intended to be shared through https://github.com/VinyehShaw/UniEqui.
中文标题/摘要
标题:vMFCoOp:在统一超球面流形上朝均衡方向努力,以指导生物医学VLMs的提示
基于大型语言模型(LLM)提炼的医学语义先验的上下文优化(CoOp)的最新进展为使用生物医学CLIP基视觉语言模型(VLMs)进行手动提示工程和全面微调提供了可扩展的替代方案。然而,这种上下文中的提示学习受到LLM和CLIP变体之间语义不匹配的挑战,这归因于不同的训练语料库和模型架构;此外,它在不断演进的基础模型家族中缺乏可扩展性。更严重的是,通过传统的欧几里得空间优化进行的两模态对齐缺乏建模统一表示或应用局部几何约束的能力,这在复杂的生物医学成像中往往会放大模态差距并导致少量样本适应不稳定。在本文中,我们提出了一种vMFCoOp框架,该框架在共享的超球面流形上逆向估计von Mises-Fisher(vMF)分布,通过统一语义锚点对任意LLM和CLIP主干之间的语义偏差进行对齐,以实现稳健的生物医学提示和优越的少量样本分类。基于三个互补约束,vMFCoOp在14个医学数据集、12种医学成像模态和13个解剖区域上表现出一致的改进,优于最先进的方法在准确度、泛化能力和临床适用性方面。本文旨在不断扩展以涵盖更多的下游应用,相应的资源将通过https://github.com/VinyehShaw/UniEqui共享。
SPOT: Sparsification with Attention Dynamics via Token Relevance in Vision Transformers
Authors: Oded Schlesinger, Amirhossein Farzam, J. Matias Di Martino, Guillermo Sapiro
First: 2025-11-13T16:56:24+00:00 · Latest: 2025-11-13T16:56:24+00:00
Comments: Project repository: https://github.com/odedsc/SPOT
Abstract
While Vision Transformers (ViT) have demonstrated remarkable performance across diverse tasks, their computational demands are substantial, scaling quadratically with the number of processed tokens. Compact attention representations, reflecting token interaction distributions, can guide early detection and reduction of less salient tokens prior to attention computation. Motivated by this, we present SParsification with attentiOn dynamics via Token relevance (SPOT), a framework for early detection of redundant tokens within ViTs that leverages token embeddings, interactions, and attention dynamics across layers to infer token importance, resulting in a more context-aware and interpretable relevance detection process. SPOT informs token sparsification and facilitates the elimination of such tokens, improving computational efficiency without sacrificing performance. SPOT employs computationally lightweight predictors that can be plugged into various ViT architectures and learn to derive effective input-specific token prioritization across layers. Its versatile design supports a range of performance levels adaptable to varying resource constraints. Empirical evaluations demonstrate significant efficiency gains of up to 40% compared to standard ViTs, while maintaining or even improving accuracy. Code and models are available at https://github.com/odedsc/SPOT .
中文标题/摘要
标题:SPOT:通过标记相关性实现视觉变换器的稀疏化与注意力动态
尽管视觉变换器(ViT)在多种任务中表现出色,但其计算需求巨大,随着处理标记数量的增加而呈二次增长。紧凑的注意力表示,反映标记间交互分布,可以在注意力计算之前引导对不显着标记的早期检测和减少。受此启发,我们提出了SParsification with attentiOn dynamics via Token relevance(SPOT),一种利用标记嵌入、交互和跨层注意力动态来推断标记重要性的框架,从而实现更具有上下文意识和可解释性的相关性检测过程。SPOT 指导标记稀疏化并促进这些标记的消除,提高计算效率而不牺牲性能。SPOT 使用计算量轻的预测器,可以插入各种 ViT 架构中,并学习在各层中推导出有效的输入特定标记优先级。其灵活的设计支持多种性能水平,适应不同的资源约束。实证评估表明,与标准 ViT 相比,SPOT 可实现高达 40% 的效率提升,同时保持或甚至提高准确性。代码和模型可在 https://github.com/odedsc/SPOT 获取。
Summary / 总结
SPOT is a framework that enhances the efficiency of Vision Transformers (ViTs) by detecting and reducing less important tokens early in the process. It uses token embeddings, interactions, and attention dynamics to identify and remove redundant tokens, leading to improved computational efficiency without compromising performance. Experiments show that SPOT can achieve up to 40% efficiency gains compared to standard ViTs while maintaining or even enhancing accuracy.
SPOT 是一种框架,通过利用 token 嵌入、交互和注意力动态来推断 token 的重要性,实现对 Vision Transformers (ViTs) 中冗余 token 的早期检测。这种方法能够提供更具有上下文感知和可解释性的相关性检测,从而提高计算效率而不牺牲性能。实证评估显示,与标准 ViTs 相比,SPOT 可以实现高达 40% 的效率提升,同时保持甚至提高准确率。
Two-Scale Latent Dynamics for Recurrent-Depth Transformers
Authors: Francesco Pappone, Donato Crisostomi, Emanuele Rodolà
First: 2025-09-27T14:01:40+00:00 · Latest: 2025-11-13T16:51:26+00:00
Abstract
Recurrent-depth transformers scale test-time compute by iterating latent computations before emitting tokens. We study the geometry of these iterates and argue for a simple, two-scale operational picture: (i) within a looped block, updates act as small-scale refinements; (ii) across consecutive blocks, states undergo a larger-scale drift. Across training, our measurements show that loop steps become smaller and increasingly orthogonal to one another, indicating better local modeling of fine structure rather than merely pushing in a single direction. These dynamics motivate an early-exit mechanism based on the model's second-order difference in step-size, which we show is superior in terms of performance, stability and time-efficiency, when compared to the KL-divergence exit strategy of Geiping et al. and its naive first-order counterpart.
中文标题/摘要
标题:双尺度潜在动力学用于循环深度变换器
循环深度变换器通过在发射标记之前迭代潜在计算来扩展测试时的计算量。我们研究了这些迭代的几何结构,并提出了一种简单的双尺度操作图景:(i) 在循环块内部,更新表现为小尺度的细化;(ii) 在连续块之间,状态经历较大的尺度漂移。在训练过程中,我们的测量结果显示循环步骤变得越来越小且彼此更加正交,表明模型更好地局部建模了精细结构,而不是仅仅朝一个方向推进。这些动力学促使我们提出一种基于模型步长二阶差值的早期退出机制,我们证明该机制在性能、稳定性和时间效率方面优于Geiping等人及其原始的一阶对应策略的KL散度退出策略。
Summary / 总结
The paper investigates the geometry of iterates in recurrent-depth transformers, proposing a two-scale operational picture where updates within a block are small-scale refinements, and states across blocks undergo larger-scale drift. The study shows that loop steps become smaller and more orthogonal during training, indicating better local modeling. This motivates an early-exit mechanism based on the model's second-order difference in step-size, which outperforms the KL-divergence exit strategy and its first-order counterpart in terms of performance, stability, and time-efficiency.
论文研究了循环深度变换器中迭代的几何特性,提出了一种两尺度操作图景,其中块内的更新是小尺度细化,而块间的状态则经历较大的尺度漂移。研究表明,循环步骤变得越来越小且更加正交,表明其在局部建模方面表现更好。这促使了一种基于模型步长二阶差值的早期退出机制,该机制在性能、稳定性和时间效率方面优于Geiping等人提出的KL散度退出策略及其一阶版本。
Panda: Test-Time Adaptation with Negative Data Augmentation
Authors: Ruxi Deng, Wenxuan Bao, Tianxin Wei, Jingrui He
Venue: AAAI 2026
First: 2025-11-13T16:46:00+00:00 · Latest: 2025-11-13T16:46:00+00:00
Comments: Accepted by AAAI 2026
Abstract
Pretrained VLMs exhibit strong zero-shot classification capabilities, but their predictions degrade significantly under common image corruptions. To improve robustness, many test-time adaptation (TTA) methods adopt positive data augmentation (PDA), which generates multiple views of each test sample to reduce prediction variance. However, these methods suffer from two key limitations. First, it introduces considerable computational overhead due to the large number of augmentations required per image. Second, it fails to mitigate prediction bias, where the model tends to predict certain classes disproportionately under corruption, as PDA operates on corrupted inputs and typically does not remove the corruption itself. To address these challenges, we propose Panda, a novel TTA method based on negative data augmentation (NDA). Unlike positive augmentations that preserve object semantics, Panda generates negative augmentations by disrupting semantic content. It divides images into patches and randomly assembles them from a shared patch pool. These negatively augmented images retain corruption-specific features while discarding object-relevant signals. We then subtract the mean feature of these negative samples from the original image feature, effectively suppressing corruption-related components while preserving class-relevant information. This mitigates prediction bias under distribution shifts. Panda allows augmentation to be shared across samples within a batch, resulting in minimal computational overhead. Panda can be seamlessly integrated into existing test-time adaptation frameworks and substantially improve their robustness. Our experiments indicate that Panda delivers superior performance compared to PDA methods, and a wide range of TTA methods exhibit significantly enhanced performance when integrated with Panda. Our code is available at https://github.com/ruxideng/Panda .
中文标题/摘要
标题:Panda:使用负数据增强的测试时自适应
预训练的VLMs在零样本分类中表现出强大的能力,但在常见的图像损坏下预测性能显著下降。为了提高鲁棒性,许多测试时自适应(TTA)方法采用正数据增强(PDA),通过为每个测试样本生成多个视图来减少预测的方差。然而,这些方法存在两个关键问题。首先,由于每张图像需要大量的增强,这引入了显著的计算开销。其次,PDA无法缓解预测偏差,即模型在损坏输入下倾向于不当地预测某些类别,因为PDA通常不会去除损坏本身。为了解决这些挑战,我们提出了一种基于负数据增强(NDA)的新型TTA方法Panda。与保留对象语义的正增强不同,Panda通过破坏语义内容生成负增强。它将图像划分为块,并从共享块池中随机组装它们。这些负增强图像保留了损坏特定的特征,同时消除了与对象相关的信号。然后,我们从原始图像特征中减去这些负样本的均值特征,有效地抑制了与损坏相关的成分,同时保留了类别相关的信息。这在分布转移下缓解了预测偏差。Panda允许增强在批次内的样本之间共享,从而导致最小的计算开销。Panda可以无缝集成到现有的TTA框架中,并显著提高其鲁棒性。我们的实验表明,Panda在性能上优于PDA方法,而广泛使用的TTA方法与Panda结合后表现出显著的性能提升。我们的代码可在https://github.com/ruxideng/Panda 获取。
Summary / 总结
Panda is a novel test-time adaptation method that uses negative data augmentation to improve the robustness of pretrained vision-language models under image corruptions. Unlike positive data augmentation, which preserves object semantics, Panda disrupts semantic content by randomly assembling image patches from a shared pool, thus retaining corruption-specific features while removing object-relevant signals. This approach mitigates prediction bias and reduces computational overhead. Experiments show that Panda outperforms positive data augmentation methods and enhances the robustness of various test-time adaptation techniques.
Panda 是一种新颖的测试时自适应方法,使用负数据增强来提高预训练视觉-语言模型在图像噪声下的鲁棒性。与生成每个测试样本多个视图的正数据增强不同,Panda 通过随机组合图像块来破坏语义内容,从而减少计算开销并减轻预测偏差。实验表明,Panda 在性能上优于正数据增强方法,并且能够显著增强各种测试时自适应技术的鲁棒性。
Beyond Elicitation: Provision-based Prompt Optimization for Knowledge-Intensive Tasks
Authors: Yunzhe Xu, Zhuosheng Zhang, Zhe Liu
First: 2025-11-13T16:33:18+00:00 · Latest: 2025-11-13T16:33:18+00:00
Comments: 16 pages, 19 figures
Abstract
While prompt optimization has emerged as a critical technique for enhancing language model performance, existing approaches primarily focus on elicitation-based strategies that search for optimal prompts to activate models' capabilities. These methods exhibit fundamental limitations when addressing knowledge-intensive tasks, as they operate within fixed parametric boundaries rather than providing the factual knowledge, terminology precision, and reasoning patterns required in specialized domains. To address these limitations, we propose Knowledge-Provision-based Prompt Optimization (KPPO), a framework that reformulates prompt optimization as systematic knowledge integration rather than potential elicitation. KPPO introduces three key innovations: 1) a knowledge gap filling mechanism for knowledge gap identification and targeted remediation; 2) a batch-wise candidate evaluation approach that considers both performance improvement and distributional stability; 3) an adaptive knowledge pruning strategy that balances performance and token efficiency, reducing up to 29% token usage. Extensive evaluation on 15 knowledge-intensive benchmarks from various domains demonstrates KPPO's superiority over elicitation-based methods, with an average performance improvement of ~6% over the strongest baseline while achieving comparable or lower token consumption. Code at: https://github.com/xyz9911/KPPO.
中文标题/摘要
标题:超越提示提取:基于提供知识的提示优化框架用于知识密集型任务
提示优化作为一种提升语言模型性能的关键技术已经崭露头角,现有方法主要集中在基于提取的策略上,通过搜索最优提示来激活模型的能力。然而,这些方法在处理知识密集型任务时存在根本局限性,因为它们在固定参数边界内运作,而不是提供特定领域所需的事实知识、术语精确性和推理模式。为了解决这些局限性,我们提出了基于提供知识的提示优化(KPPO)框架,将提示优化重新定义为系统性知识整合,而非潜在的提取。KPPO 引入了三个关键创新:1)知识缺口填充机制,用于识别和针对性地弥补知识缺口;2)批次候选评估方法,同时考虑性能提升和分布稳定性;3)自适应知识剪枝策略,平衡性能和标记效率,最多可减少29%的标记使用量。在来自不同领域的15个知识密集型基准测试上的广泛评估表明,KPPO 在性能上优于基于提取的方法,平均性能提升约为6%,同时实现相当或更低的标记消耗。代码参见:https://github.com/xyz9911/KPPO。
Summary / 总结
The research addresses the limitations of elicitation-based prompt optimization methods in knowledge-intensive tasks by proposing Knowledge-Provision-based Prompt Optimization (KPPO). KPPO introduces a knowledge gap filling mechanism, a batch-wise candidate evaluation approach, and an adaptive knowledge pruning strategy. The method shows an average performance improvement of 6% over the strongest baseline and reduces token usage by up to 29%. Extensive evaluations on 15 benchmarks from various domains confirm KPPO's superiority over existing methods.
研究旨在通过解决启发式提示优化方法的局限性,提高语言模型在知识密集型任务上的性能。提出了KPPO,该方法系统地将知识整合到提示中。关键创新包括知识缺口填充机制、批次候选评估和自适应知识剪枝。KPPO在最强基线基础上平均提高了6%的性能,同时减少了高达29%的令牌使用量。
OpenSR-SRGAN: A Flexible Super-Resolution Framework for Multispectral Earth Observation Data
Authors: Simon Donike, Cesar Aybar, Julio Contreras, Luis Gómez-Chova
First: 2025-11-13T16:28:35+00:00 · Latest: 2025-11-13T16:28:35+00:00
Abstract
We present OpenSR-SRGAN, an open and modular framework for single-image super-resolution in Earth Observation. The software provides a unified implementation of SRGAN-style models that is easy to configure, extend, and apply to multispectral satellite data such as Sentinel-2. Instead of requiring users to modify model code, OpenSR-SRGAN exposes generators, discriminators, loss functions, and training schedules through concise configuration files, making it straightforward to switch between architectures, scale factors, and band setups. The framework is designed as a practical tool and benchmark implementation rather than a state-of-the-art model. It ships with ready-to-use configurations for common remote sensing scenarios, sensible default settings for adversarial training, and built-in hooks for logging, validation, and large-scene inference. By turning GAN-based super-resolution into a configuration-driven workflow, OpenSR-SRGAN lowers the entry barrier for researchers and practitioners who wish to experiment with SRGANs, compare models in a reproducible way, and deploy super-resolution pipelines across diverse Earth-observation datasets.
中文标题/摘要
标题:OpenSR-SRGAN:一种灵活的多光谱地球观测数据超分辨率框架
我们介绍了OpenSR-SRGAN,这是一种用于地球观测的单图像超分辨率的开放和模块化框架。该软件提供了一种SRGAN风格模型的统一实现,易于配置、扩展,并应用于如Sentinel-2等多光谱卫星数据。OpenSR-SRGAN通过简洁的配置文件暴露生成器、判别器、损失函数和训练计划,而不是要求用户修改模型代码,使得在不同架构、缩放因子和波段设置之间切换变得简单。该框架旨在作为实用工具和基准实现,而非最先进的模型。它附带了针对常见遥感场景的现成配置,针对对抗训练设置了合理的默认设置,并内置了日志记录、验证和大场景推理的钩子。通过将基于GAN的超分辨率转换为配置驱动的工作流,OpenSR-SRGAN降低了研究人员和从业者实验SRGAN、以可重复的方式比较模型以及在多种地球观测数据集中部署超分辨率管道的门槛。
Summary / 总结
OpenSR-SRGAN is an open and modular framework for single-image super-resolution in Earth Observation, providing a unified implementation of SRGAN-style models through configuration files. It simplifies the process of switching between architectures, scale factors, and band setups without modifying model code, making it easy to use for researchers and practitioners. Key experimental findings include the framework's ability to handle multispectral satellite data like Sentinel-2 and its practical utility as a benchmark implementation for remote sensing scenarios.
OpenSR-SRGAN 是一个开源且模块化的框架,用于地球观测中的单图像超分辨率,旨在易于配置和扩展。它通过配置文件暴露生成器、判别器、损失函数和训练计划,允许用户在无需修改模型代码的情况下切换不同的架构和设置。关键实验发现包括该框架能够处理如Sentinel-2等多光谱卫星数据,并提供针对常见遥感场景的现成配置,使其成为研究人员和从业者进行超分辨率实验和部署超分辨率管道的实用工具。
LocalBench: Benchmarking LLMs on County-Level Local Knowledge and Reasoning
Authors: Zihan Gao, Yifei Xu, Jacob Thebault-Spieker
First: 2025-11-13T16:26:13+00:00 · Latest: 2025-11-13T16:26:13+00:00
Abstract
Large language models (LLMs) have been widely evaluated on macro-scale geographic tasks, such as global factual recall, event summarization, and regional reasoning. Yet, their ability to handle hyper-local knowledge remains poorly understood. This gap is increasingly consequential as real-world applications, from civic platforms to community journalism, demand AI systems that can reason about neighborhood-specific dynamics, cultural narratives, and local governance. Existing benchmarks fall short in capturing this complexity, often relying on coarse-grained data or isolated references. We present LocalBench, the first benchmark designed to systematically evaluate LLMs on county-level local knowledge across the United States. Grounded in the Localness Conceptual Framework, LocalBench includes 14,782 validated question-answer pairs across 526 U.S. counties in 49 states, integrating diverse sources such as Census statistics, local subreddit discourse, and regional news. It spans physical, cognitive, and relational dimensions of locality. Using LocalBench, we evaluate 13 state-of-the-art LLMs under both closed-book and web-augmented settings. Our findings reveal critical limitations: even the best-performing models reach only 56.8% accuracy on narrative-style questions and perform below 15.5% on numerical reasoning. Moreover, larger model size and web augmentation do not guarantee better performance, for example, search improves Gemini's accuracy by +13.6%, but reduces GPT-series performance by -11.4%. These results underscore the urgent need for language models that can support equitable, place-aware AI systems: capable of engaging with the diverse, fine-grained realities of local communities across geographic and cultural contexts.
中文标题/摘要
标题:LocalBench:在县级本地知识和推理方面的LLM基准测试
大型语言模型(LLMs)已经在全球地理任务上得到了广泛评估,例如全球事实回忆、事件总结和区域推理。然而,它们处理超本地知识的能力仍然知之甚少。随着从公民平台到社区新闻等实际应用的需求增加,这种差距变得越来越重要。现有的基准测试在捕捉这种复杂性方面往往不足,通常依赖于粗粒度的数据或孤立的参考。我们提出了LocalBench,这是第一个旨在系统评估LLM在全美50个州的526个县的县级本地知识上的基准测试。LocalBench基于本地性概念框架,包括来自526个美国县的14,782个验证过的问答对,整合了各种来源的数据,如人口普查统计数据、本地subreddit讨论和区域新闻。它涵盖了地方性的物理、认知和关系维度。使用LocalBench,我们评估了13种最先进的LLM模型,在闭卷和网络增强两种设置下。我们的研究结果揭示了关键的局限性:即使表现最好的模型在叙述性问题上的准确率也只有56.8%,在数值推理上的准确率低于15.5%。此外,更大的模型规模和网络增强并不保证更好的性能,例如,搜索可以将Gemini的准确率提高13.6%,但会降低GPT系列的性能11.4%。这些结果强调了迫切需要支持公平、地方意识的AI系统的语言模型:能够与地理和文化背景下多样而精细的地方社区进行互动。
Summary / 总结
LocalBench evaluates LLMs on county-level local knowledge and reasoning, addressing the gap in existing benchmarks. It includes 14,782 question-answer pairs from 526 U.S. counties, covering physical, cognitive, and relational dimensions. Evaluating 13 state-of-the-art LLMs, the study finds that even the best models achieve only 56.8% accuracy on narrative questions and under 15.5% on numerical reasoning. Web augmentation does not uniformly improve performance across models, highlighting the need for models that can handle local, place-aware knowledge equitably.
LocalBench 评估了 LLM 在县级本地知识和推理方面的表现,填补了现有基准的空白。它包含了来自 526 个美国县的 14,782 个问题-答案对,涵盖了物理、认知和关系维度。评估了 13 个最先进的 LLM 后,研究发现即使表现最好的模型在叙述性问题上的准确率也只有 56.8%,在数值推理上的准确率低于 15.5%。网络增强有时甚至会降低某些模型的性能。
History
20251115_0324 20251114_0325 20251113_0326 20251112_0326 20251111_0318 20251110_0322 20251109_0323 20251108_0321 20251107_0320 20251106_0322 20251105_0321 20251104_0324 20251103_0317 20251102_0321 20251101_0317 20251031_0318 20251030_0328 20251029_0325 20251028_0324 20251027_0320 20251026_0328 20251025_0320 20251024_0328 20251023_1235 20251023_0316 20251022_0319 20251021_1916 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553