arXiv 论文速递

RoMa v2: Harder Better Faster Denser Feature Matching

Authors: Johan Edstedt, David Nordström, Yushan Zhang, Georg Bökman, Jonathan Astermark, Viktor Larsson, Anders Heyden, Fredrik Kahl, Mårten Wadenbäck, Michael Felsberg

First: 2025-11-19T18:59:38+00:00 · Latest: 2025-11-19T18:59:38+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Dense feature matching aims to estimate all correspondences between two images of a 3D scene and has recently been established as the gold-standard due to its high accuracy and robustness. However, existing dense matchers still fail or perform poorly for many hard real-world scenarios, and high-precision models are often slow, limiting their applicability. In this paper, we attack these weaknesses on a wide front through a series of systematic improvements that together yield a significantly better model. In particular, we construct a novel matching architecture and loss, which, combined with a curated diverse training distribution, enables our model to solve many complex matching tasks. We further make training faster through a decoupled two-stage matching-then-refinement pipeline, and at the same time, significantly reduce refinement memory usage through a custom CUDA kernel. Finally, we leverage the recent DINOv3 foundation model along with multiple other insights to make the model more robust and unbiased. In our extensive set of experiments we show that the resulting novel matcher sets a new state-of-the-art, being significantly more accurate than its predecessors. Code is available at https://github.com/Parskatt/romav2

中文标题/摘要

标题：RoMa v2: 更难、更好、更快、更密集的特征匹配

密集特征匹配旨在估计两幅3D场景图像之间的所有对应关系，并且由于其高精度和鲁棒性，最近已被确立为黄金标准。然而，现有的密集匹配器仍然无法处理或在许多困难的现实场景中表现不佳，高精度模型往往速度较慢，限制了它们的应用。在本文中，我们通过一系列系统的改进从多个方面解决了这些弱点，从而产生了一个显著更好的模型。特别是，我们构建了一个新颖的匹配架构和损失函数，结合一个精心挑选的多样化训练分布，使我们的模型能够解决许多复杂的匹配任务。我们还通过解耦的两阶段匹配-然后细化流水线使训练更快，并通过自定义CUDA内核显著减少了细化内存使用。最后，我们利用最近的DINOv3基础模型以及其他多个见解使模型更加鲁棒和无偏。在我们广泛的一系列实验中，我们展示了结果的新匹配器达到了新的最先进水平，比其前身更为准确。代码可在https://github.com/Parskatt/romav2 获取

Summary / 总结

The paper aims to improve dense feature matching by addressing its limitations in real-world scenarios and computational efficiency. The authors introduce a novel matching architecture and loss function, along with a curated training distribution, to enhance model performance. They also propose a two-stage matching-refinement pipeline and a custom CUDA kernel to speed up training and reduce memory usage. Experimental results demonstrate that the new model outperforms previous methods in accuracy and sets a new state-of-the-art.

论文旨在通过解决实际场景中的局限性和计算效率问题来改进密集特征匹配。作者引入了一种新的匹配架构和损失函数，并使用一个精心策划的训练集来提升模型性能。他们还提出了一种两阶段匹配-精炼流水线和自定义CUDA内核来加速训练并减少内存使用。实验表明，新模型在准确性和性能上都超过了之前的模型，并达到了新的前沿水平。

GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization

Authors: Yikun Wang, Zuyan Liu, Ziyi Wang, Pengfei Liu, Han Hu, Yongming Rao

First: 2025-11-19T18:59:22+00:00 · Latest: 2025-11-19T18:59:22+00:00

Abs · PDF · Code1 · Code2

Abstract

Current research on agentic visual reasoning enables deep multimodal understanding but primarily focuses on image manipulation tools, leaving a gap toward more general-purpose agentic models. In this work, we revisit the geolocalization task, which requires not only nuanced visual grounding but also web search to confirm or refine hypotheses during reasoning. Since existing geolocalization benchmarks fail to meet the need for high-resolution imagery and the localization challenge for deep agentic reasoning, we curate GeoBench, a benchmark that includes photos and panoramas from around the world, along with a subset of satellite images of different cities to rigorously evaluate the geolocalization ability of agentic models. We also propose GeoVista, an agentic model that seamlessly integrates tool invocation within the reasoning loop, including an image-zoom-in tool to magnify regions of interest and a web-search tool to retrieve related web information. We develop a complete training pipeline for it, including a cold-start supervised fine-tuning (SFT) stage to learn reasoning patterns and tool-use priors, followed by a reinforcement learning (RL) stage to further enhance reasoning ability. We adopt a hierarchical reward to leverage multi-level geographical information and improve overall geolocalization performance. Experimental results show that GeoVista surpasses other open-source agentic models on the geolocalization task greatly and achieves performance comparable to closed-source models such as Gemini-2.5-flash and GPT-5 on most metrics.

中文标题/摘要

标题：GeoVista：增强型地理定位的自主视觉推理

当前关于自主视觉推理的研究能够实现深度多模态理解，但主要集中在图像处理工具上，缺乏更通用的自主模型。在此项工作中，我们重新审视了地理定位任务，该任务不仅需要精细的视觉定位，还需要在推理过程中通过网络搜索来验证或细化假设。由于现有的地理定位基准未能满足高分辨率图像和深度自主推理定位挑战的需求，我们创建了GeoBench，该基准包括来自世界各地的照片和全景图，以及不同城市的卫星图像子集，以严格评估自主模型的地理定位能力。我们还提出了GeoVista，这是一种能够无缝集成工具调用的自主模型，包括图像放大工具以放大感兴趣区域和网络搜索工具以检索相关网络信息。我们为其开发了一个完整的训练管道，包括一个冷启动监督微调（SFT）阶段以学习推理模式和工具使用先验，随后是一个强化学习（RL）阶段以进一步增强推理能力。我们采用分层奖励来利用多级地理信息并提高整体地理定位性能。实验结果表明，GeoVista在地理定位任务上显著超越了其他开源自主模型，并在大多数指标上达到了与Gemini-2.5-flash和GPT-5等封闭源模型相当的性能。

Summary / 总结

The research aims to enhance agentic visual reasoning for geolocalization by integrating web search capabilities. GeoVista, an agentic model, is proposed, which includes an image-zoom-in tool and a web-search tool. The model is trained using a cold-start supervised fine-tuning stage and a reinforcement learning stage. GeoVista outperforms other open-source agentic models and achieves performance similar to closed-source models on most geolocalization metrics.

研究旨在通过集成网络搜索能力来增强地理定位的主动视觉推理。提出了GeoVista模型，该模型包含图像放大工具和网络搜索工具。模型通过冷启动监督微调阶段和强化学习阶段进行训练。GeoVista在大多数地理定位指标上超过了其他开源主动模型，并达到了与封闭源模型如Gemini-2.5-flash和GPT-5相当的性能。

In-N-On: Scaling Egocentric Manipulation with in-the-wild and on-task Data

Authors: Xiongyi Cai, Ri-Zhao Qiu, Geng Chen, Lai Wei, Isabella Liu, Tianshu Huang, Xuxin Cheng, Xiaolong Wang

First: 2025-11-19T18:59:04+00:00 · Latest: 2025-11-19T18:59:04+00:00

Comments: Project webpage: https://xiongyicai.github.io/In-N-On/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Egocentric videos are a valuable and scalable data source to learn manipulation policies. However, due to significant data heterogeneity, most existing approaches utilize human data for simple pre-training, which does not unlock its full potential. This paper first provides a scalable recipe for collecting and using egocentric data by categorizing human data into two categories: in-the-wild and on-task alongside with systematic analysis on how to use the data. We first curate a dataset, PHSD, which contains over 1,000 hours of diverse in-the-wild egocentric data and over 20 hours of on-task data directly aligned to the target manipulation tasks. This enables learning a large egocentric language-conditioned flow matching policy, Human0. With domain adaptation techniques, Human0 minimizes the gap between humans and humanoids. Empirically, we show Human0 achieves several novel properties from scaling human data, including language following of instructions from only human data, few-shot learning, and improved robustness using on-task data. Project website: https://xiongyicai.github.io/In-N-On/

中文标题/摘要

标题：In-N-On：利用野生和任务中数据扩展自我中心操作

自我中心视频是学习操作策略的一个有价值且可扩展的数据来源。然而，由于数据异质性显著，大多数现有方法使用人类数据进行简单的预训练，这并未充分利用其潜力。本文首先提供了一种可扩展的方法来收集和使用自我中心数据，通过将人类数据分为两类：野生和任务中数据，并进行了系统的数据分析。我们首先整理了一个数据集PHSD，包含超过1000小时的多样化的野生自我中心数据和超过20小时的任务中数据，直接与目标操作任务对齐。这使得能够学习一个大型的自我中心语言条件流匹配策略Human0。通过领域适应技术，Human0最小化了人类与类人者之间的差距。实验证明，Human0从扩展人类数据中获得了几个新颖的特性，包括仅从人类数据遵循指令的语言跟随能力、少样本学习以及使用任务中数据提高鲁棒性。项目网站：https://xiongyicai.github.io/In-N-On/

Summary / 总结

This paper addresses the challenge of utilizing egocentric videos for learning manipulation policies, which are valuable but often hindered by data heterogeneity. It introduces a method to categorize human data into in-the-wild and on-task categories and proposes a dataset, PHSD, with over 1,000 hours of in-the-wild data and 20 hours of on-task data. Using this data, the authors develop a large egocentric language-conditioned flow matching policy, Human0, which, through domain adaptation, reduces the gap between human and humanoid performance. Key findings include language following from human data, few-shot learning capabilities, and improved robustness with on-task data.

该论文针对利用以自我为中心的视频学习操作策略面临的挑战，这些问题通常受到数据异质性的限制。它提出了一种将人类数据分为野外和任务相关两类的方法，并提出了一个包含超过1000小时野外数据和20小时任务相关数据的数据集PHSD。通过使用这些数据，作者开发了一个大规模的以自我为中心的语言条件流匹配策略Human0，通过领域适应技术，该策略减少了人类和类人机器人之间的差距。主要发现包括仅从人类数据中实现语言跟随、少量样本学习能力和通过任务相关数据提高鲁棒性。

First Frame Is the Place to Go for Video Content Customization

Authors: Jingxi Chen, Zongxia Li, Zhichao Liu, Guangyao Shi, Xiyang Wu, Fuxiao Liu, Cornelia Fermuller, Brandon Y. Feng, Yiannis Aloimonos

First: 2025-11-19T18:56:50+00:00 · Latest: 2025-11-19T18:56:50+00:00

Comments: Project Website: https://firstframego.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

What role does the first frame play in video generation models? Traditionally, it's viewed as the spatial-temporal starting point of a video, merely a seed for subsequent animation. In this work, we reveal a fundamentally different perspective: video models implicitly treat the first frame as a conceptual memory buffer that stores visual entities for later reuse during generation. Leveraging this insight, we show that it's possible to achieve robust and generalized video content customization in diverse scenarios, using only 20-50 training examples without architectural changes or large-scale finetuning. This unveils a powerful, overlooked capability of video generation models for reference-based video customization.

中文标题/摘要

标题：First Frame 是视频内容定制的理想选择

在视频生成模型中，第一帧扮演什么角色？传统上，它被视为视频的时间空间起点，仅仅是后续动画的种子。在本研究中，我们揭示了一个截然不同的视角：视频模型隐式地将第一帧视为一个概念性记忆缓冲区，用于存储可在生成过程中重新使用的视觉实体。利用这一洞察，我们展示了仅使用20-50个训练示例，无需架构更改或大规模微调，即可在多种场景中实现稳健且通用的视频内容定制。这揭示了视频生成模型在参考基础上进行视频定制的强大但未被充分利用的能力。

Summary / 总结

This study explores the role of the first frame in video generation models, revealing it as a conceptual memory buffer for visual entities. By leveraging this insight, the researchers demonstrate that robust and generalized video content customization can be achieved with only 20-50 training examples, without architectural changes or large-scale fine-tuning. This highlights a powerful, previously underutilized capability of video generation models for reference-based customization in various scenarios.

研究探讨了视频生成模型中第一帧的作用，发现它实际上是一个概念性的视觉实体记忆缓冲区。通过这一发现，研究人员展示了仅使用20-50个训练示例即可实现稳健且通用的视频内容定制，无需修改架构或进行大规模微调。这揭示了视频生成模型在参考基础上进行视频定制的强大但未被充分利用的能力，适用于多种场景。

Joint Semantic-Channel Coding and Modulation for Token Communications

Authors: Jingkai Ying, Zhijin Qin, Yulong Feng, Liejun Wang, Xiaoming Tao

First: 2025-11-19T18:56:32+00:00 · Latest: 2025-11-19T18:56:32+00:00

Comments: 14 pages, 14 figures, 2 tables

Abs · PDF · Code1 · Code2

Abstract

In recent years, the Transformer architecture has achieved outstanding performance across a wide range of tasks and modalities. Token is the unified input and output representation in Transformer-based models, which has become a fundamental information unit. In this work, we consider the problem of token communication, studying how to transmit tokens efficiently and reliably. Point cloud, a prevailing three-dimensional format which exhibits a more complex spatial structure compared to image or video, is chosen to be the information source. We utilize the set abstraction method to obtain point tokens. Subsequently, to get a more informative and transmission-friendly representation based on tokens, we propose a joint semantic-channel and modulation (JSCCM) scheme for the token encoder, mapping point tokens to standard digital constellation points (modulated tokens). Specifically, the JSCCM consists of two parallel Point Transformer-based encoders and a differential modulator which combines the Gumel-softmax and soft quantization methods. Besides, the rate allocator and channel adapter are developed, facilitating adaptive generation of high-quality modulated tokens conditioned on both semantic information and channel conditions. Extensive simulations demonstrate that the proposed method outperforms both joint semantic-channel coding and traditional separate coding, achieving over 1dB gain in reconstruction and more than 6x compression ratio in modulated symbols.

中文标题/摘要

标题：联合语义-通道编码与调制在标记通信中的应用

近年来，Transformer架构在各种任务和模态中取得了卓越的性能。标记是基于Transformer模型的统一输入和输出表示，已成为基本的信息单元。在本文中，我们考虑标记通信的问题，研究如何高效可靠地传输标记。点云，一种比图像或视频具有更复杂空间结构的主流三维格式，被选作信息源。我们利用集合抽象方法获取点标记。随后，为了基于标记获得更具信息量且易于传输的表示，我们提出了一种联合语义-通道和调制（JSCCM）方案，将点标记映射为标准数字星座点（调制标记）。具体而言，JSCCM 包括两个并行的点Transformer编码器和一个结合Gumel-softmax和软量化方法的差分调制器。此外，我们还开发了速率分配器和信道适配器，以在语义信息和信道条件的基础上生成高质量的调制标记。广泛的仿真实验表明，所提出的方法在重建方面优于联合语义-通道编码和传统单独编码方法，且在调制符号中压缩比提高了超过6倍。

Summary / 总结

This work addresses the problem of efficient and reliable token communication, focusing on point cloud data. It proposes a joint semantic-channel coding and modulation (JSCCM) scheme, which includes two parallel Point Transformer-based encoders and a differential modulator. The method achieves over 1dB gain in reconstruction and more than 6x compression ratio in modulated symbols compared to existing approaches.

本文提出了一种联合语义信道编码和调制（JSCCM）方案，以解决高效可靠的token通信问题。方法通过点云的集合抽象获得点token，并使用两个并行的Point Transformer编码器和差分调制器将这些token映射到标准数字星座点。JSCCM方案提高了重构性能和压缩比，与现有方法相比，实现了超过1dB的增益和超过6倍的压缩比。

The Impact of Quantization on Large Reasoning Model Reinforcement Learning

Authors: Medha Kumar, Zifei Xu, Xin Wang, Tristan Webb

Venue: NeurIPS 2025

First: 2025-11-19T18:50:58+00:00 · Latest: 2025-11-19T18:50:58+00:00

Comments: Accepted to the NeurIPS 2025 Efficient Reasoning Workshop

Abs · PDF · Code1 · Code2

Abstract

Strong reasoning capabilities can now be achieved by large-scale reinforcement learning (RL) without any supervised fine-tuning. Although post-training quantization (PTQ) and quantization-aware training (QAT) are well studied in the context of fine-tuning, how quantization impacts RL in large reasoning models (LRMs) remains an open question. To answer this question, we conducted systematic experiments and discovered a significant gap in reasoning performance on mathematical benchmarks between post-RL quantized models and their quantization-aware RL optimized counterparts. Our findings suggest that quantization-aware RL training negatively impacted the learning process, whereas PTQ and QLoRA led to greater performance.

中文标题/摘要

标题：量化对大规模推理模型强化学习的影响

大规模强化学习（RL）现在可以在没有监督微调的情况下实现强大的推理能力。尽管后训练量化（PTQ）和量化感知训练（QAT）在微调上下文中得到了广泛研究，但量化对大规模推理模型（LRMs）中的RL性能的影响仍然是一个开放的问题。为了回答这个问题，我们进行了系统的实验，并发现数学基准测试中后RL量化模型与量化感知RL优化模型之间的推理性能存在显著差距。我们的研究结果表明，量化感知RL训练对学习过程产生了负面影响，而PTQ和QLoRA则提高了性能。

Summary / 总结

This study investigates the impact of quantization on large reasoning models in reinforcement learning. The research finds that quantization-aware RL training negatively affects learning, while post-training quantization (PTQ) and QLoRA improve performance. Systematic experiments on mathematical benchmarks reveal a significant performance gap between post-RL quantized models and quantization-aware RL optimized models.

研究探讨了量化对大规模推理模型在强化学习中的影响。比较了后训练量化(PTQ)和量化感知训练(QAT)方法，发现PTQ和QLoRA方法能带来更好的性能，而QAT则对学习过程产生负面影响。研究指出，在数学基准测试上，后RL量化模型与量化感知RL优化的模型之间存在显著的性能差距。

Hyperspectral Image Classification using Spectral-Spatial Mixer Network

Authors: Mohammed Q. Alkhatib

First: 2025-11-19T18:48:52+00:00 · Latest: 2025-11-19T18:48:52+00:00

Comments: Accepted for WHISPERS2025

Abs · PDF · Code1 · Code2 · Code3

Abstract

This paper introduces SS-MixNet, a lightweight and effective deep learning model for hyperspectral image (HSI) classification. The architecture integrates 3D convolutional layers for local spectral-spatial feature extraction with two parallel MLP-style mixer blocks that capture long-range dependencies in spectral and spatial dimensions. A depthwise convolution-based attention mechanism is employed to enhance discriminative capability with minimal computational overhead. The model is evaluated on the QUH-Tangdaowan and QUH-Qingyun datasets using only 1% of labeled data for training and validation. SS-MixNet achieves the highest performance among compared methods, including 2D-CNN, 3D-CNN, IP-SWIN, SimPoolFormer, and HybridKAN, reaching 95.68% and 93.86% overall accuracy on the Tangdaowan and Qingyun datasets, respectively. The results, supported by quantitative metrics and classification maps, confirm the model's effectiveness in delivering accurate and robust predictions with limited supervision. The code will be made publicly available at: https://github.com/mqalkhatib/SS-MixNet

中文标题/摘要

标题：使用光谱-空间混合网络的高光谱图像分类

本文介绍了SS-MixNet，这是一种轻量级且有效的深度学习模型，用于高光谱图像（HSI）分类。该架构结合了用于局部光谱-空间特征提取的3D卷积层以及两个并行的MLP风格混合块，以捕捉光谱和空间维度中的长程依赖性。采用基于深度卷积的注意力机制以最小的计算开销增强判别能力。该模型仅使用1%的标注数据在QUH-唐道湾和QUH-青云数据集上进行训练和验证。SS-MixNet在比较的方法中（包括2D-CNN、3D-CNN、IP-SWIN、SimPoolFormer和HybridKAN）表现最佳，分别在唐道湾和青云数据集上达到95.68%和93.86%的整体准确率。定量指标和分类图的结果证实了该模型在有限监督下提供准确且稳健预测的有效性。代码将在以下地址公开：https://github.com/mqalkhatib/SS-MixNet

Summary / 总结

This paper presents SS-MixNet, a lightweight deep learning model for hyperspectral image classification. It combines 3D convolutional layers with spectral-spatial mixer blocks and a depthwise convolution-based attention mechanism to capture long-range dependencies. Evaluated on the QUH-Tangdaowan and QUH-Qingyun datasets with only 1% labeled data, SS-MixNet outperforms other methods like 2D-CNN, 3D-CNN, IP-SWIN, SimPoolFormer, and HybridKAN, achieving 95.68% and 93.86% overall accuracy respectively. The model demonstrates high accuracy and robustness with minimal supervision.

本文提出了一种轻量级的SS-MixNet深度学习模型，用于高光谱图像分类。该模型结合了3D卷积层进行局部光谱-空间特征提取，并使用注意力机制和MLP风格的混合块来捕捉长距离依赖。在QUH-Tangdaowan和QUH-Qingyun数据集上，仅使用1%的标注数据进行训练和验证，SS-MixNet在Tangdaowan和Qingyun数据集上分别达到了95.68%和93.86%的整体准确率，优于2D-CNN、3D-CNN、IP-SWIN、SimPoolFormer和HybridKAN等方法。该模型在少量监督下展示了高准确性和鲁棒性。

MoDES: Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping

Authors: Yushi Huang, Zining Wang, Zhihang Yuan, Yifu Ding, Ruihao Gong, Jinyang Guo, Xianglong Liu, Jun Zhang

First: 2025-11-19T18:48:27+00:00 · Latest: 2025-11-19T18:48:27+00:00

Comments: Code will be released upon acceptance

Abs · PDF · Code1 · Code2

Abstract

Mixture-of-Experts (MoE) Multimodal large language models (MLLMs) excel at vision-language tasks, but they suffer from high computational inefficiency. To reduce inference overhead, expert skipping methods have been proposed to deactivate redundant experts based on the current input tokens. However, we find that applying these methods-originally designed for unimodal large language models (LLMs)-to MLLMs results in considerable performance degradation. This is primarily because such methods fail to account for the heterogeneous contributions of experts across MoE layers and modality-specific behaviors of tokens within these layers. Motivated by these findings, we propose MoDES, the first training-free framework that adaptively skips experts to enable efficient and accurate MoE MLLM inference. It incorporates a globally-modulated local gating (GMLG) mechanism that integrates global layer-wise importance into local routing probabilities to accurately estimate per-token expert importance. A dual-modality thresholding (DMT) method is then applied, which processes tokens from each modality separately, to derive the skipping schedule. To set the optimal thresholds, we introduce a frontier search algorithm that exploits monotonicity properties, cutting convergence time from several days to a few hours. Extensive experiments for 3 model series across 13 benchmarks demonstrate that MoDES far outperforms previous approaches. For instance, when skipping 88% experts for Qwen3-VL-MoE-30B-A3B-Instruct, the performance boost is up to 10.67% (97.33% vs. 86.66%). Furthermore, MoDES significantly enhances inference speed, improving the prefilling time by 2.16$\times$ and the decoding time by 1.26$\times$.

中文标题/摘要

标题：MoDES：通过动态专家跳过加速Mixture-of-Experts多模态大型语言模型

Mixture-of-Experts (MoE) 多模态大型语言模型（MLLMs）在视觉-语言任务中表现出色，但存在高计算效率问题。为了减少推理开销，已经提出了专家跳过方法，根据当前输入令牌来停用冗余专家。然而，我们发现将这些方法（最初设计用于单模态大型语言模型（LLMs））应用于MLLMs会导致显著的性能下降。这主要是因为这些方法未能考虑到MoE层中专家的异质贡献以及这些层中令牌的模态特定行为。受这些发现的启发，我们提出了MoDES，这是第一个无需训练的框架，能够自适应地跳过专家以实现高效且准确的MoE MLLM推理。它结合了全局调制局部门控（GMLG）机制，将全局层间重要性整合到局部路由概率中，以准确估计每个令牌的专家重要性。然后应用了一种双模态阈值化（DMT）方法，分别处理每个模态的令牌，以推导跳过计划。为了设置最优阈值，我们引入了一种前沿搜索算法，利用单调性特性，将收敛时间从几天缩短到几小时。针对13个基准的3个模型系列的广泛实验表明，MoDES远优于先前的方法。例如，当跳过Qwen3-VL-MoE-30B-A3B-Instruct的88%专家时，性能提升高达10.67%（97.33% vs. 86.66%）。此外，MoDES显著提高了推理速度，将预填充时间提高了2.16倍，解码时间提高了1.26倍。

Summary / 总结

MoDES is a training-free framework that accelerates Mixture-of-Experts Multimodal large language models (MLLMs) by adaptively skipping experts based on their importance. It uses a globally-modulated local gating mechanism to estimate per-token expert importance and a dual-modality thresholding method to derive the skipping schedule. MoDES outperforms previous approaches, achieving up to a 10.67% performance boost and significantly reducing inference time by 2.16$\times$ for prefilling and 1.26$\times$ for decoding.

MoDES 是一个无需训练的框架，通过根据专家的重要性适配性跳过专家来加速 Mixture-of-Experts 多模态大型语言模型 (MLLMs)。它使用全局调制局部门控机制来估计每个令牌专家的重要性，并使用双模态阈值化方法来推导跳过计划。MoDES 在性能上超越了之前的方法，例如，在跳过 Qwen3-VL-MoE-30B-A3B-Instruct 的 88% 专家时，性能提升了 10.67%，并且显著减少了推理时间，预填充时间减少了 2.16$\times$，解码时间减少了 1.26$\times$。

Automating Android Build Repair: Bridging the Reasoning-Execution Gap in LLM Agents with Domain-Specific Tools

Authors: Ha Min Son, Huan Ren, Xin Liu, Zhe Zhao

First: 2025-10-09T01:33:25+00:00 · Latest: 2025-11-19T18:46:34+00:00

Abs · PDF · Code1 · Code2

Abstract

Android is the largest mobile platform, yet automatically building applications remains a practical challenge. While Large Language Models (LLMs) show promise for code repair, their use for fixing Android build errors remains underexplored. To address this gap, we first introduce AndroidBuildBench, a benchmark of 1,019 build failures curated from the commit histories of 43 open-source Android projects. Each problem is paired with a verified solution from a subsequent commit, ensuring that fixes are feasible. Second, we propose GradleFixer, an LLM agent with domain-specific tools for inspecting and manipulating the Gradle build environment. GradleFixer achieves a resolve rate of 81.4% (pass@1), significantly outperforming a state-of-the-art coding agent that relies on a general-purpose shell. GradleFixer's success suggests that while LLMs possess the high-level knowledge to solve these failures, they struggle to translate this knowledge into effective low-level actions using a general-purpose shell. We demonstrate the effectiveness of a strategy we term Tool Bridging, which replaces general-purpose shell commands with domain-aware abstractions. We hypothesize this approach works through two mechanisms: 1) it provides tools in an API-like format that LLMs use more reliably, and 2) it constrains the action space to relevant operations. This approach bridges the gap between the model's high-level reasoning and effective low-level execution.

中文标题/摘要

标题：自动化Android构建修复：通过领域特定工具在LLM代理中弥合推理-执行差距

Android是最大的移动平台，但自动构建应用程序仍然是一个实际挑战。虽然大型语言模型（LLMs）在代码修复方面显示出潜力，但它们用于修复Android构建错误的应用仍然未被充分探索。为了解决这一差距，我们首先介绍了AndroidBuildBench，这是一个包含1,019个构建失败的基准，这些失败是从43个开源Android项目的提交历史中精心挑选出来的。每个问题都配有一个后续提交中的验证解决方案，确保修复是可行的。其次，我们提出了GradleFixer，这是一种带有领域特定工具的LLM代理，用于检查和操作Gradle构建环境。GradleFixer实现了81.4%的解决率（pass@1），显著优于依赖通用shell的最先进的编码代理。GradleFixer的成功表明，虽然LLMs具备了解决这些失败的高层知识，但在使用通用shell将这些知识转化为有效的低层操作方面却存在困难。我们展示了我们称之为工具桥接的有效策略，该策略用领域意识的抽象替换通用shell命令。我们假设这种方法通过两种机制起作用：1）它以API格式提供工具，使LLMs更可靠地使用，2）它将操作空间限制为相关操作。这种方法弥合了模型的高层推理与有效低层执行之间的差距。

Summary / 总结

The paper addresses the challenge of automatically building Android applications, which remains a practical issue despite the potential of Large Language Models (LLMs) for code repair. It introduces AndroidBuildBench, a benchmark of 1,019 build failures from open-source Android projects, and proposes GradleFixer, an LLM agent with domain-specific tools for fixing build errors. GradleFixer achieves an 81.4% resolve rate, outperforming a state-of-the-art coding agent that uses a general-purpose shell. The study suggests that LLMs benefit from domain-aware abstractions, which improve their ability to translate high-level reasoning into effective low-level actions.

论文旨在解决自动构建Android应用程序这一实际问题，尽管大型语言模型（LLMs）在代码修复方面具有潜力。研究引入了AndroidBuildBench，这是一个包含来自43个开源Android项目的1,019个构建失败的基准，并提出了GradleFixer，这是一种带有领域特定工具的LLM代理，用于修复构建错误。GradleFixer的解决率为81.4%，远超依赖通用shell的最先进的编码代理。研究指出，虽然LLMs具有高层次的知识，但需要领域特定的工具才能有效地执行低层次操作，并引入了工具桥接策略来弥合这一差距。

Walrus: A Cross-Domain Foundation Model for Continuum Dynamics

Authors: Michael McCabe, Payel Mukhopadhyay, Tanya Marwah, Bruno Regaldo-Saint Blancard, Francois Rozet, Cristiana Diaconu, Lucas Meyer, Kaze W. K. Wong, Hadi Sotoudeh, Alberto Bietti, Irina Espejo, Rio Fear, Siavash Golkar, Tom Hehir, Keiya Hirashima, Geraud Krawezik, Francois Lanusse, Rudy Morel, Ruben Ohana, Liam Parker, Mariel Pettee, Jeff Shen, Kyunghyun Cho, Miles Cranmer, Shirley Ho

First: 2025-11-19T18:36:03+00:00 · Latest: 2025-11-19T18:36:03+00:00

Abs · PDF · Code1 · Code2

Abstract

Foundation models have transformed machine learning for language and vision, but achieving comparable impact in physical simulation remains a challenge. Data heterogeneity and unstable long-term dynamics inhibit learning from sufficiently diverse dynamics, while varying resolutions and dimensionalities challenge efficient training on modern hardware. Through empirical and theoretical analysis, we incorporate new approaches to mitigate these obstacles, including a harmonic-analysis-based stabilization method, load-balanced distributed 2D and 3D training strategies, and compute-adaptive tokenization. Using these tools, we develop Walrus, a transformer-based foundation model developed primarily for fluid-like continuum dynamics. Walrus is pretrained on nineteen diverse scenarios spanning astrophysics, geoscience, rheology, plasma physics, acoustics, and classical fluids. Experiments show that Walrus outperforms prior foundation models on both short and long term prediction horizons on downstream tasks and across the breadth of pretraining data, while ablation studies confirm the value of our contributions to forecast stability, training throughput, and transfer performance over conventional approaches. Code and weights are released for community use.

中文标题/摘要

标题：Walrus：一种用于连续动力学的跨域基础模型

基础模型已彻底改变了语言和视觉领域的机器学习，但在物理模拟中实现类似的影响仍面临挑战。数据异质性和长期动力学的不稳定阻碍了从足够多样的动力学中学习，而不同的分辨率和维度则对现代硬件上的高效训练构成了挑战。通过实证和理论分析，我们引入了新的方法来克服这些障碍，包括基于谐波分析的稳定方法、负载平衡的分布式2D和3D训练策略以及计算自适应的标记化。利用这些工具，我们开发了Walrus，一种主要用于流体样连续动力学的基础模型。Walrus在天体物理学、地球科学、流变学、等离子体物理学、声学和经典流体等十九种不同场景下进行预训练。实验表明，Walrus在下游任务和预训练数据的整个范围内，在短期和长期预测方面均优于先前的基础模型，而消融研究也证实了我们对预测稳定性、训练吞吐量和转移性能的贡献优于传统方法。代码和权重已向社区开放使用。

Summary / 总结

The research aims to develop a foundation model for physical simulation, addressing challenges such as data heterogeneity and unstable dynamics. Walrus, a transformer-based model, is pretrained on diverse scenarios from various fields including astrophysics and fluid dynamics. Experiments demonstrate that Walrus outperforms previous models in both short and long-term predictions and shows improved stability and transfer performance. Ablation studies confirm the effectiveness of the model's components in enhancing forecast accuracy and training efficiency.

研究旨在开发一种用于物理模拟的基础模型，解决数据异质性和动态不稳定性等问题。Walrus是一种基于变压器的模型，它在包括天体物理学和流体动力学在内的多种场景下进行预训练。实验表明，Walrus在短期和长期预测中均优于先前的模型，并且显示出更好的稳定性和迁移性能。消融研究证实了模型组件在提高预测准确性和训练效率方面的有效性。

A Unified Framework for Provably Efficient Algorithms to Estimate Shapley Values

Authors: Tyler Chen, Akshay Seshadri, Mattia J. Villani, Pradeep Niroula, Shouvanik Chakrabarti, Archan Ray, Pranav Deshpande, Romina Yalovetzky, Marco Pistoia, Niraj Kumar

Venue: NeurIPS 2025

First: 2025-06-05T16:30:53+00:00 · Latest: 2025-11-19T18:20:38+00:00

Comments: Accepted at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025); 45 pages, 7 figures, 7 tables

Abs · PDF · Code1 · Code2

Abstract

Shapley values have emerged as a critical tool for explaining which features impact the decisions made by machine learning models. However, computing exact Shapley values is difficult, generally requiring an exponential (in the feature dimension) number of model evaluations. To address this, many model-agnostic randomized estimators have been developed, the most influential and widely used being the KernelSHAP method (Lundberg & Lee, 2017). While related estimators such as unbiased KernelSHAP (Covert & Lee, 2021) and LeverageSHAP (Musco & Witter, 2025) are known to satisfy theoretical guarantees, bounds for KernelSHAP have remained elusive. We describe a broad and unified framework that encompasses KernelSHAP and related estimators constructed using both with and without replacement sampling strategies. We then prove strong non-asymptotic theoretical guarantees that apply to all estimators from our framework. This provides, to the best of our knowledge, the first theoretical guarantees for KernelSHAP and sheds further light on tradeoffs between existing estimators. Through comprehensive benchmarking on small and medium dimensional datasets for Decision-Tree models, we validate our approach against exact Shapley values, consistently achieving low mean squared error with modest sample sizes. Furthermore, we make specific implementation improvements to enable scalability of our methods to high-dimensional datasets. Our methods, tested on datasets such MNIST and CIFAR10, provide consistently better results compared to the KernelSHAP library.

中文标题/摘要

标题：一种可证明高效的计算Shapley值算法的统一框架

Shapley值已成为解释机器学习模型决策中哪些特征产生影响的关键工具。然而，计算精确的Shapley值非常困难，通常需要进行指数级（特征维度）的模型评估。为解决这一问题，已经开发了许多模型无关的随机估计器，其中最具有影响力和广泛应用的是KernelSHAP方法（Lundberg & Lee, 2017）。虽然有无偏KernelSHAP（Covert & Lee, 2021）和LeverageSHAP（Musco & Witter, 2025）等相关的估计器已知满足理论保证，但对KernelSHAP的界仍然难以获得。我们描述了一个广泛且统一的框架，该框架涵盖了使用有放回和无放回采样策略构建的KernelSHAP及其相关估计器。然后，我们证明了适用于我们框架中所有估计器的强非渐近理论保证。据我们所知，这提供了第一个对KernelSHAP的理论保证，并进一步阐明了现有估计器之间的权衡。通过在决策树模型的小到中等维度数据集上进行全面基准测试，我们验证了我们的方法与精确的Shapley值相比，即使在较小的样本量下也能实现较低的均方误差。此外，我们对具体实现进行了改进，以使我们的方法能够扩展到高维度数据集。我们的方法在MNIST和CIFAR10等数据集上测试时，提供了比KernelSHAP库更好的结果。

Summary / 总结

The paper presents a unified framework for estimating Shapley values, which are used to explain the impact of features on machine learning model decisions. It covers both with and without replacement sampling strategies and provides strong non-asymptotic theoretical guarantees for all estimators within this framework. Experimental results on Decision-Tree models show that the proposed methods achieve low mean squared error with modest sample sizes, outperforming KernelSHAP in some cases. Additionally, the methods are scalable to high-dimensional datasets with specific implementation improvements.

论文提出了一种统一框架，用于估计Shapley值，这些值用于解释机器学习模型决策中特征的影响。该框架统一了包括KernelSHAP和LeverageSHAP在内的多种估计器，并为框架内的所有估计器提供了强大的理论保证。实验结果显示，所提出的方法在决策树模型中实现了低均方误差，并且在高维数据集如MNIST和CIFAR10上优于KernelSHAP库。

MF-GCN: A Multi-Frequency Graph Convolutional Network for Tri-Modal Depression Detection Using Eye-Tracking, Facial, and Acoustic Features

Authors: Sejuti Rahman, Swakshar Deb, MD. Sameer Iqbal Chowdhury, MD. Jubair Ahmed Sourov, Mohammad Shamsuddin

First: 2025-11-19T18:18:53+00:00 · Latest: 2025-11-19T18:18:53+00:00

Abs · PDF · Code1 · Code2

Abstract

Eye tracking data quantifies the attentional bias towards negative stimuli that is frequently observed in depressed groups. Audio and video data capture the affective flattening and psychomotor retardation characteristic of depression. Statistical validation confirmed their significant discriminative power in distinguishing depressed from non depressed groups. We address a critical limitation of existing graph-based models that focus on low-frequency information and propose a Multi-Frequency Graph Convolutional Network (MF-GCN). This framework consists of a novel Multi-Frequency Filter Bank Module (MFFBM), which can leverage both low and high frequency signals. Extensive evaluation against traditional machine learning algorithms and deep learning frameworks demonstrates that MF-GCN consistently outperforms baselines. In binary (depressed and non depressed) classification, the model achieved a sensitivity of 0.96 and F2 score of 0.94. For the 3 class (no depression, mild to moderate depression and severe depression) classification task, the proposed method achieved a sensitivity of 0.79 and specificity of 0.87 and siginificantly suprassed other models. To validate generalizability, the model was also evaluated on the Chinese Multimodal Depression Corpus (CMDC) dataset and achieved a sensitivity of 0.95 and F2 score of 0.96. These results confirm that our trimodal, multi frequency framework effectively captures cross modal interaction for accurate depression detection.

中文标题/摘要

标题：MF-GCN：一种用于利用眼动追踪、面部和声学特征进行三模态抑郁检测的多频图卷积网络

眼动追踪数据量化了抑郁组中对负面刺激的注意力偏向。音频和视频数据捕捉了抑郁特征中的情感平淡和心理运动迟缓。统计验证证实了它们在区分抑郁与非抑郁组方面具有显著的鉴别能力。我们解决了现有基于图的模型集中在低频信息上的关键局限性，提出了一种多频图卷积网络（MF-GCN）。该框架包括一个新颖的多频滤波器模块（MFFBM），可以利用低频和高频信号。与传统机器学习算法和深度学习框架的广泛评估表明，MF-GCN 一贯优于基线。在二分类（抑郁与非抑郁）中，该模型的灵敏度为0.96，F2分数为0.94。对于三分类（无抑郁、轻度到中度抑郁和重度抑郁）分类任务，所提出的方法的灵敏度为0.79，特异性为0.87，并显著超越其他模型。为了验证泛化能力，该模型还在中文多模态抑郁语料库（CMDC）数据集上进行了评估，灵敏度为0.95，F2分数为0.96。这些结果证实了我们三模态、多频框架有效地捕捉了跨模态交互，以实现准确的抑郁检测。

Summary / 总结

The research aims to improve depression detection using a multi-frequency graph convolutional network (MF-GCN) that integrates eye-tracking, facial, and acoustic features. The method introduces a Multi-Frequency Filter Bank Module (MFFBM) to leverage both low and high frequency signals. Experimental results show that MF-GCN outperforms traditional machine learning and deep learning models, achieving high sensitivity and F2 scores in binary and multi-class depression classification tasks. The model also generalizes well on the Chinese Multimodal Depression Corpus dataset.

研究旨在通过使用三模态数据（眼动追踪、面部和声学特征）来改进抑郁检测，并解决现有图基模型专注于低频信息的局限性。研究提出了一种多频图卷积网络（MF-GCN）和新型多频滤波器模块（MFFBM），以利用低频和高频信号。MF-GCN 在二分类和多分类抑郁检测任务中均表现出色，实现了高灵敏度和F2分数，并在中文多模态抑郁语料库（CMDC）数据集上展示了良好的泛化能力。

Foundational Automatic Evaluators: Scaling Multi-Task Generative Evaluator Training for Reasoning-Centric Domains

Authors: Austin Xu, Xuan-Phi Nguyen, Yilun Zhou, Chien-Sheng Wu, Caiming Xiong, Shafiq Joty

First: 2025-10-20T17:52:06+00:00 · Latest: 2025-11-19T17:57:07+00:00

Comments: 29 pages, 9 tables, 6 figures

Abs · PDF · Code1 · Code2

Abstract

Finetuning specialized generative evaluators has emerged as a popular paradigm to meet the increasing demand for scalable evaluation during both training and test-time. However, recent work has largely focused on applying new methodology, such as reinforcement learning (RL), to training evaluators, shying away from large-scale, data-driven development. In this work, we focus on data scaling, curating a set of 2.5M samples spanning five unique evaluation tasks (pairwise, step-level, reference-free and reference-based verification, and single rating) and multiple domains focused on reasoning evaluation. With our data, we train Foundational Automatic Reasoning Evaluators (FARE), a family of 8B and 20B (with 3.6B active) parameter evaluators, with a simple iterative rejection-sampling supervised finetuning (SFT) approach. FARE-8B challenges larger specialized RL-trained evaluators and FARE-20B sets the new standard for open-source evaluators, surpassing specialized 70B+ evaluators. Beyond static benchmarks, we evaluate FARE in real-world tasks: As inference-time rerankers, FARE-20B achieves near-oracle performance on MATH. As verifiers in RL training, FARE improves the downstream RL-trained model performance by up to 14.1% vs. string-matching verifiers. When initialized from FARE, a continually-finetuned FARE-Code outperforms gpt-oss-20B by 65% on evaluating test-case quality.

中文标题/摘要

标题：基础自动评估器：扩展多任务生成评估器训练以适应以推理为中心的领域

针对训练和测试期间不断增加的可扩展评估需求，微调专门的生成评估器已成为一个流行的范式。然而，近期的工作主要集中在使用新的方法，如强化学习（RL），来训练评估器，而避免大规模的数据驱动开发。在本工作中，我们专注于数据扩展，收集了涵盖五个独特评估任务（成对、步骤级、无参考和有参考验证、单评级）以及多个以推理评估为中心的领域的250万样本。利用我们的数据，我们训练了基础自动推理评估器（FARE），这是一个包含80亿和200亿（其中36亿活跃参数）参数的评估器家族，使用简单的迭代拒绝采样监督微调（SFT）方法。FARE-8B 挑战了更大的专门强化学习训练的评估器，而FARE-20B 设定了开源评估器的新标准，超越了专门的700亿+评估器。除了静态基准，我们还在实际任务中评估了FARE：作为推理时间的重排序器，FARE-20B 在MATH 上达到了接近完美的性能。作为强化学习训练中的验证器，FARE 提高了下游强化学习训练模型的性能，最多可提高14.1%。从FARE 初始化的持续微调FARE-Code 在评估测试案例质量方面比gpt-oss-20B 高出65%。

Summary / 总结

This study addresses the need for scalable evaluation methods in reasoning-centric domains by curating a large dataset of 2.5M samples across five tasks and multiple domains. Using a simple iterative rejection-sampling supervised fine-tuning approach, the authors developed FARE, a family of 8B and 20B parameter evaluators. FARE-20B outperforms larger specialized RL-trained evaluators and sets a new standard for open-source evaluators, surpassing specialized 70B+ models. In real-world applications, FARE-20B demonstrated near-oracle performance in MATH inference and improved RL-trained model performance by up to 14.1% compared to string-matching verifiers. Additionally, when initialized from FARE, a continually fine-tuned FARE-Code outperformed gpt-oss-20B by 65% in evaluating test-case quality.

该研究旨在解决推理导向领域中可扩展评估方法的需求，通过收集涵盖五个任务和多个领域的250万样本数据集。使用简单的迭代拒绝采样监督微调方法，作者开发了FARE系列评估器，包括80亿和200亿参数版本。FARE-20B在性能上超越了更大规模的专业化强化学习训练评估器，并成为开源评估器的新标准，超过了专门的700亿+模型。在实际应用中，FARE-20B在MATH推理中的表现接近完美，并将强化学习训练模型的性能提高了14.1%以上，相较于字符串匹配验证器。此外，从FARE初始化的持续微调FARE-Code在评估测试案例质量方面比gpt-oss-20B高出65%。

Measuring the (Un)Faithfulness of Concept-Based Explanations

Authors: Shubham Kumar, Narendra Ahuja

First: 2025-04-15T03:24:13+00:00 · Latest: 2025-11-19T17:56:19+00:00

Comments: Pre-print

Abs · PDF · Code1 · Code2

Abstract

Deep vision models perform input-output computations that are hard to interpret. Concept-based explanation methods (CBEMs) increase interpretability by re-expressing parts of the model with human-understandable semantic units, or concepts. Checking if the derived explanations are faithful -- that is, they represent the model's internal computation -- requires a surrogate that combines concepts to compute the output. Simplifications made for interpretability inevitably reduce faithfulness, resulting in a tradeoff between the two. State-of-the-art unsupervised CBEMs (U-CBEMs) have reported increasingly interpretable concepts, while also being more faithful to the model. However, we observe that the reported improvement in faithfulness artificially results from either (1) using overly complex surrogates, which introduces an unmeasured cost to the explanation's interpretability, or (2) relying on deletion-based approaches that, as we demonstrate, do not properly measure faithfulness. We propose Surrogate Faithfulness (SURF), which (1) replaces prior complex surrogates with a simple, linear surrogate that measures faithfulness without changing the explanation's interpretability and (2) introduces well-motivated metrics that assess loss across all output classes, not just the predicted class. We validate SURF with a measure-over-measure study by proposing a simple sanity check -- explanations with random concepts should be less faithful -- which prior surrogates fail. SURF enables the first reliable faithfulness benchmark of U-CBEMs, revealing that many visually compelling U-CBEMs are not faithful. Code to be released.

中文标题/摘要

标题：测量基于概念的解释的（不）忠实性

深度视觉模型执行难以解释的输入-输出计算。基于概念的解释方法（CBEM）通过用人类可理解的语义单元重新表达模型的部分内容来提高可解释性。检查所得到的解释是否忠实——即它们是否代表了模型的内部计算——需要一个结合概念来计算输出的替代模型。为了提高可解释性而做出的简化不可避免地降低了忠实性，从而导致两者之间的权衡。最新的无监督CBEM（U-CBEM）报告了越来越可解释的概念，同时对模型的忠实性也更高。然而，我们观察到，所报告的忠实性改进要么是由于（1）使用过于复杂的替代模型，这引入了对解释可解释性未测量的成本，要么是依赖于基于删除的方法，正如我们所证明的那样，这些方法并不能正确衡量忠实性。我们提出了替代忠实性（SURF），（1）用一个简单、线性的替代模型替换先前复杂的替代模型，该模型在不改变解释可解释性的情况下衡量忠实性，（2）引入了合理的度量标准，评估所有输出类别的损失，而不仅仅是预测类别。我们通过提出一个简单的合理性检查——具有随机概念的解释应不那么忠实——来验证SURF，这表明先前的替代模型未能通过。SURF使无监督CBEM的第一个可靠的忠实性基准成为可能，揭示了许多视觉上引人注目的无监督CBEM实际上并不忠实。代码将发布。

Summary / 总结

The paper aims to measure the faithfulness of concept-based explanation methods (CBEMs) in deep vision models, which balance interpretability and faithfulness. The authors propose Surrogate Faithfulness (SURF) to address limitations in previous methods, using a simple linear surrogate to assess faithfulness without compromising interpretability. Experimental results show that many visually compelling unsupervised CBEMs (U-CBEMs) are not faithful to the model's internal computations, highlighting the need for reliable faithfulness benchmarks. Code for SURF will be released.

该论文解决了概念基解释方法（CBEM）在深度视觉模型中的忠实性测量问题。作者提出了Surrogate Faithfulness（SURF），用一个简单线性的替代模型来替代复杂的模型，同时保持解释的可解释性，准确测量忠实性。研究揭示了许多视觉上引人注目的无监督CBEM（U-CBEM）并不忠实于模型的内部计算，强调了可靠忠实性基准的必要性。

VisPlay: Self-Evolving Vision-Language Models from Images

Authors: Yicheng He, Chengsong Huang, Zongxia Li, Jiaxin Huang, Yonghui Yang

First: 2025-11-19T17:55:15+00:00 · Latest: 2025-11-19T17:55:15+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Reinforcement learning (RL) provides a principled framework for improving Vision-Language Models (VLMs) on complex reasoning tasks. However, existing RL approaches often rely on human-annotated labels or task-specific heuristics to define verifiable rewards, both of which are costly and difficult to scale. We introduce VisPlay, a self-evolving RL framework that enables VLMs to autonomously improve their reasoning abilities using large amounts of unlabeled image data. Starting from a single base VLM, VisPlay assigns the model into two interacting roles: an Image-Conditioned Questioner that formulates challenging yet answerable visual questions, and a Multimodal Reasoner that generates silver responses. These roles are jointly trained with Group Relative Policy Optimization (GRPO), which incorporates diversity and difficulty rewards to balance the complexity of generated questions with the quality of the silver answers. VisPlay scales efficiently across two model families. When trained on Qwen2.5-VL and MiMo-VL, VisPlay achieves consistent improvements in visual reasoning, compositional generalization, and hallucination reduction across eight benchmarks, including MM-Vet and MMMU, demonstrating a scalable path toward self-evolving multimodal intelligence. The project page is available at https://bruno686.github.io/VisPlay/

中文标题/摘要

标题：VisPlay：从图像中自我进化的视觉-语言模型

强化学习（RL）为在复杂推理任务上改进视觉-语言模型（VLMs）提供了一个原则性的框架。然而，现有的RL方法通常依赖于人工标注的标签或特定任务的启发式方法来定义可验证的奖励，这两种方法都成本高昂且难以扩展。我们引入了VisPlay，这是一种自我进化的RL框架，使VLMs能够利用大量未标注的图像数据自主提高其推理能力。从一个基础VLM开始，VisPlay将模型分配为两个相互作用的角色：一个图像条件下的提问者，它能够提出具有挑战性但可回答的视觉问题；以及一个跨模态推理器，它生成银级回答。这些角色通过组相对策略优化（GRPO）联合训练，该方法结合了多样性和难度奖励，以平衡生成问题的复杂性与银级回答的质量。VisPlay在两个模型家族中高效扩展。当在Qwen2.5-VL和MiMo-VL上训练时，VisPlay在八个基准测试中，包括MM-Vet和MMMU，实现了视觉推理、组合泛化和幻觉减少的一致改进，展示了自我进化的跨模态智能的可扩展路径。项目页面可在https://bruno686.github.io/VisPlay/获取

Summary / 总结

VisPlay is a self-evolving reinforcement learning framework for Vision-Language Models (VLMs) that uses large amounts of unlabeled image data to autonomously improve reasoning abilities. It assigns the model to two roles: an Image-Conditioned Questioner and a Multimodal Reasoner, which are trained together using Group Relative Policy Optimization (GRPO) to balance question difficulty and answer quality. VisPlay consistently improves visual reasoning, compositional generalization, and reduces hallucination across eight benchmarks, showing a scalable path to self-evolving multimodal intelligence.

VisPlay 是一种自我进化的强化学习框架，使用大量未标记的图像数据来增强视觉语言模型（VLMs）。它将模型分配为两个角色：图像条件下的问题提出者，提出具有挑战性的问题，以及多模态推理者，生成银级答案。这些角色通过组相对策略优化（GRPO）进行训练，以平衡问题的复杂性和答案的质量。VisPlay 在八个基准测试中提高了视觉推理能力、组合泛化能力和减少了幻觉，展示了自我进化的多模态智能的可扩展路径。

TrackStudio: An Integrated Toolkit for Markerless Tracking

Authors: Hristo Dimitrov, Giulia Dominijanni, Viktorija Pavalkyte, Tamar R. Makin

First: 2025-11-10T20:49:58+00:00 · Latest: 2025-11-19T17:53:19+00:00

Comments: 26 pages, 5 main text figures, 5 supplementary figures

Abs · PDF · Code1 · Code2

Abstract

Markerless motion tracking has advanced rapidly in the past 10 years and currently offers powerful opportunities for behavioural, clinical, and biomechanical research. While several specialised toolkits provide high performance for specific tasks, using existing tools still requires substantial technical expertise. There remains a gap in accessible, integrated solutions that deliver sufficient tracking for non-experts across diverse settings. TrackStudio was developed to address this gap by combining established open-source tools into a single, modular, GUI-based pipeline that works out of the box. It provides automatic 2D and 3D tracking, calibration, preprocessing, feature extraction, and visualisation without requiring any programming skills. We supply a user guide with practical advice for video acquisition, synchronisation, and setup, alongside documentation of common pitfalls and how to avoid them. To validate the toolkit, we tested its performance across three environments using either low-cost webcams or high-resolution cameras, including challenging conditions for body position, lightning, and space and obstructions. Across 76 participants, average inter-frame correlations exceeded 0.98 and average triangulation errors remained low (<13.6mm for hand tracking), demonstrating stable and consistent tracking. We further show that the same pipeline can be extended beyond hand tracking to other body and face regions. TrackStudio provides a practical, accessible route into markerless tracking for researchers or laypeople who need reliable performance without specialist expertise.

中文标题/摘要

标题：TrackStudio：无标记点跟踪的集成工具包

无标记点运动跟踪在过去10年中取得了快速进展，目前为行为学、临床学和生物力学研究提供了强大的机会。虽然有几种专门的工具包能够为特定任务提供高性能，但使用现有工具仍然需要大量的技术专长。仍存在一个缺口，即易于访问的集成解决方案，能够为非专家提供足够的跟踪能力，适用于各种环境。 TrackStudio 是为了解决这一缺口而开发的，通过将现有的开源工具整合到一个单一的、模块化的、基于GUI的流水线中，使其开箱即用。它提供了自动的2D和3D跟踪、校准、预处理、特征提取和可视化，无需任何编程技能。我们提供了一份用户指南，包含有关视频采集、同步和设置的实用建议，以及避免常见陷阱的文档。为了验证该工具包，我们在三种环境中进行了测试，使用的是低成本网络摄像头或高分辨率摄像头，包括对身体位置、照明、空间和障碍物的挑战性条件。在76名参与者中，平均帧间相关性超过0.98，平均三角误差保持较低（手部跟踪<13.6毫米），证明了稳定和一致的跟踪性能。我们还展示了相同的流水线可以扩展到手部跟踪之外，应用于其他身体和面部区域。TrackStudio 为需要可靠性能但没有专门技能的研究人员或普通用户提供了一条实用、易用的进入无标记点跟踪的途径。

Summary / 总结

TrackStudio was developed to address the gap in accessible markerless motion tracking solutions for non-experts. It integrates various open-source tools into a user-friendly pipeline that supports automatic 2D and 3D tracking, calibration, and visualisation without requiring programming skills. The toolkit was validated across three environments with low-cost and high-resolution cameras, showing high inter-frame correlations and low triangulation errors, indicating stable and consistent tracking performance. Beyond hand tracking, the pipeline can be extended to other body and face regions.

TrackStudio 是为非专家提供的一种易于使用的无标记运动跟踪解决方案，它将多种开源工具整合到一个基于GUI的管道中，无需编程即可实现自动2D和3D跟踪、校准和可视化。该工具包在三种环境中使用低成本和高分辨率摄像头进行了验证，实现了高帧间相关性和低三角误差，展示了其在多种跟踪需求中的稳定性和可靠性。

DINOv3 as a Frozen Encoder for CRPS-Oriented Probabilistic Rainfall Nowcasting

Authors: Luciano Araujo Dourado Filho, Almir Moreira da Silva Neto, Anthony Miyaguchi, Rodrigo Pereira David, Rodrigo Tripodi Calumby, Lukáš Picek

First: 2025-11-14T02:14:08+00:00 · Latest: 2025-11-19T17:48:19+00:00

Abs · PDF · Code1 · Code2

Abstract

This paper proposes a competitive and computationally efficient approach to probabilistic rainfall nowcasting. A video projector (V-JEPA Vision Transformer) associated to a lightweight probabilistic head is attached to a pre-trained satellite vision encoder (DINOv3-SAT493M) to map encoder tokens into a discrete empirical CDF (eCDF) over 4-hour accumulated rainfall. The projector-head is optimized end-to-end over the Ranked Probability Score (RPS). As an alternative, 3D-UNET baselines trained with an aggregate Rank Probability Score and a per-pixel Gamma-Hurdle objective are used. On the Weather4Cast 2025 benchmark, the proposed method achieved a promising performance, with a CRPS of 3.5102, which represents $\approx$ 26% in effectiveness gain against the best 3D-UNET.

中文标题/摘要

标题：DINOv3作为冻结编码器的CRPS导向降水现在概率预报

本文提出了一种竞争性和计算效率高的概率降水现在概率预报方法。将一个视频投影器（V-JEPA 视觉变换器）与一个轻量级的概率头连接到预训练的卫星视觉编码器（DINOv3-SAT493M）上，将编码器标记映射到4小时累积降水的离散经验累积分布函数（eCDF）。投影头通过排名概率分数（RPS）端到端优化。作为替代，使用3D-UNET基线，这些基线使用聚合排名概率分数和每个像素的Gamma-Hurdle目标进行训练。在Weather4Cast 2025基准测试中，所提出的方法取得了令人鼓舞的性能，CRPS为3.5102，这相当于相对于最佳3D-UNET的有效性提升约26%。

Summary / 总结

This paper introduces a method for probabilistic rainfall nowcasting using a pre-trained DINOv3-SAT493M encoder and a lightweight probabilistic head optimized end-to-end with the Ranked Probability Score. The approach uses a video projector (V-JEPA Vision Transformer) to map encoder tokens into a discrete empirical CDF for 4-hour accumulated rainfall. On the Weather4Cast 2025 benchmark, the proposed method achieved a CRPS of 3.5102, demonstrating a 26% improvement over the best 3D-UNET baseline.

该论文提出了一种使用预训练的DINOv3-SAT493M编码器和轻量级概率头的方法，该头通过排名概率评分优化端到端，用于概率降雨现在预报。方法使用视频投影器（V-JEPA 视觉变换器）将编码器标记映射到4小时累积降雨的离散经验累积分布函数。在Weather4Cast 2025基准测试中，所提出的方法实现了3.5102的CRPS，相比最佳3D-UNET基线提高了约26%的性能。

GEO-Bench-2: From Performance to Capability, Rethinking Evaluation in Geospatial AI

Authors: Naomi Simumba, Nils Lehmann, Paolo Fraccaro, Hamed Alemohammad, Geeth De Mel, Salman Khan, Manil Maskey, Nicolas Longepe, Xiao Xiang Zhu, Hannah Kerner, Juan Bernabe-Moreno, Alexander Lacoste

First: 2025-11-19T17:45:02+00:00 · Latest: 2025-11-19T17:45:02+00:00

Abs · PDF · Code1 · Code2

Abstract

Geospatial Foundation Models (GeoFMs) are transforming Earth Observation (EO), but evaluation lacks standardized protocols. GEO-Bench-2 addresses this with a comprehensive framework spanning classification, segmentation, regression, object detection, and instance segmentation across 19 permissively-licensed datasets. We introduce ''capability'' groups to rank models on datasets that share common characteristics (e.g., resolution, bands, temporality). This enables users to identify which models excel in each capability and determine which areas need improvement in future work. To support both fair comparison and methodological innovation, we define a prescriptive yet flexible evaluation protocol. This not only ensures consistency in benchmarking but also facilitates research into model adaptation strategies, a key and open challenge in advancing GeoFMs for downstream tasks. Our experiments show that no single model dominates across all tasks, confirming the specificity of the choices made during architecture design and pretraining. While models pretrained on natural images (ConvNext ImageNet, DINO V3) excel on high-resolution tasks, EO-specific models (TerraMind, Prithvi, and Clay) outperform them on multispectral applications such as agriculture and disaster response. These findings demonstrate that optimal model choice depends on task requirements, data modalities, and constraints. This shows that the goal of a single GeoFM model that performs well across all tasks remains open for future research. GEO-Bench-2 enables informed, reproducible GeoFM evaluation tailored to specific use cases. Code, data, and leaderboard for GEO-Bench-2 are publicly released under a permissive license.

中文标题/摘要

标题：GEO-Bench-2：从性能到能力，重新思考地理空间AI的评估

地理空间基础模型（GeoFMs）正在改变地球观测（EO），但评估缺乏标准化协议。GEO-Bench-2 通过涵盖分类、分割、回归、对象检测和实例分割的全面框架，跨越了19个许可使用的数据集，解决了这一问题。我们引入了“能力”组，根据数据集共享的共同特征（例如，分辨率、波段、时间性）对模型进行排名。这使用户能够识别哪些模型在每个能力方面表现出色，并确定未来工作中需要改进的领域。为了支持公平比较和方法创新，我们定义了一种既具有指导性又灵活的评估协议。这不仅确保了基准测试的一致性，还促进了对模型适应策略的研究，这是推进GeoFMs用于下游任务的关键和开放挑战。我们的实验表明，没有单一模型在所有任务中都占主导地位，这证实了在架构设计和预训练期间所做的选择的特定性。虽然在自然图像上预训练的模型（ConvNext ImageNet，DINO V3）在高分辨率任务上表现出色，但针对EO的特定模型（TerraMind，Prithvi，和Clay）在多光谱应用（如农业和灾害响应）中表现更优。这些发现表明，最佳模型选择取决于任务要求、数据模态和约束。这表明，一个在所有任务中表现良好的单一GeoFM模型的目标仍然有待未来研究。GEO-Bench-2 使针对特定用例的GeoFM评估变得知情且可重复。GEO-Bench-2 的代码、数据和排行榜在许可协议下公开发布。

Summary / 总结

GEO-Bench-2 aims to standardize the evaluation of Geospatial Foundation Models (GeoFMs) by introducing a comprehensive framework covering various geospatial tasks. The study defines 'capability' groups to rank models based on shared dataset characteristics, enabling users to identify model strengths and areas for improvement. Experiments show that no single model excels across all tasks, with EO-specific models outperforming natural image-pretrained models in multispectral applications. This highlights the importance of task-specific model selection and underscores the need for further research into model adaptation strategies.

GEO-Bench-2 提出了一种全面的评估框架，用于评估地理空间基础模型（GeoFMs）在分类、分割、回归、对象检测和实例分割等任务上的表现。它基于共享相似特征（如分辨率和波段）的能力组对模型进行排名，使用户能够识别模型的优势和需要改进的领域。实验结果显示，没有单一模型在所有任务上都表现出色，图像预训练模型在高分辨率任务上优于特定于地球观测的模型，而特定于地球观测的模型在多光谱应用如农业和灾害响应中表现更佳。这表明模型选择具有任务特定性，并强调了进一步研究模型适应策略的必要性。

Knowledge-Grounded Agentic Large Language Models for Multi-Hazard Understanding from Reconnaissance Reports

Authors: Chenchen Kuai, Zihao Li, Braden Rosen, Stephanie Paal, Navid Jafari, Jean-Louis Briaud, Yunlong Zhang, Youssef M. A. Hashash, Yang Zhou

First: 2025-11-18T00:36:31+00:00 · Latest: 2025-11-19T17:42:08+00:00

Comments: 17 pages, 5 figures

Abs · PDF · Code1 · Code2

Abstract

Post-disaster reconnaissance reports contain critical evidence for understanding multi-hazard interactions, yet their unstructured narratives make systematic knowledge transfer difficult. Large language models (LLMs) offer new potential for analyzing these reports, but often generate unreliable or hallucinated outputs when domain grounding is absent. This study introduces the Mixture-of-Retrieval Agentic RAG (MoRA-RAG), a knowledge-grounded LLM framework that transforms reconnaissance reports into a structured foundation for multi-hazard reasoning. The framework integrates a Mixture-of-Retrieval mechanism that dynamically routes queries across hazard-specific databases while using agentic chunking to preserve contextual coherence during retrieval. It also includes a verification loop that assesses evidence sufficiency, refines queries, and initiates targeted searches when information remains incomplete. We construct HazardRecQA by deriving question-answer pairs from GEER reconnaissance reports, which document 90 global events across seven major hazard types. MoRA-RAG achieves up to 94.5 percent accuracy, outperforming zero-shot LLMs by 30 percent and state-of-the-art RAG systems by 10 percent, while reducing hallucinations across diverse LLM architectures. MoRA-RAG also enables open-weight LLMs to achieve performance comparable to proprietary models. It establishes a new paradigm for transforming post-disaster documentation into actionable, trustworthy intelligence for hazard resilience.

中文标题/摘要

标题：基于知识的代理大型语言模型在灾后侦察报告中多灾种理解的应用

灾后侦察报告包含理解多灾种相互作用的关键证据，但由于其非结构化的叙述，系统性的知识转移变得困难。大型语言模型（LLMs）为分析这些报告提供了新的可能性，但在缺乏领域约束的情况下，它们通常会产生不可靠或虚构的输出。本研究引入了混合检索代理RAG（MoRA-RAG），这是一种基于知识的LLM框架，将侦察报告转化为多灾种推理的结构化基础。该框架结合了混合检索机制，该机制动态地将查询路由到特定于灾种的数据库中，同时使用代理分块来保持检索过程中的上下文连贯性。它还包括一个验证循环，评估证据的充分性，细化查询，并在信息不完整时启动有针对性的搜索。我们通过从GEER侦察报告中推导出问题-答案对来构建HazardRecQA，这些报告记录了7种主要灾种类型中的90个全球事件。MoRA-RAG的准确率达到94.5%，比零样本LLMs高出30%，比最先进的RAG系统高出10%，同时减少了各种LLM架构中的虚构现象。MoRA-RAG还使开放权重LLMs能够达到与专有模型相当的性能。它为将灾后文档转化为可操作的、可信赖的情报以增强灾种韧性建立了新的范式。

Summary / 总结

This study addresses the challenge of analyzing unstructured post-disaster reconnaissance reports by introducing MoRA-RAG, a knowledge-grounded LLM framework. It uses a Mixture-of-Retrieval mechanism and agentic chunking to preserve contextual coherence while integrating a verification loop to ensure evidence sufficiency. MoRA-RAG achieves up to 94.5 percent accuracy, outperforming zero-shot LLMs and state-of-the-art RAG systems, and enables open-weight LLMs to match proprietary models in performance. This framework transforms post-disaster documentation into actionable intelligence for hazard resilience.

本研究通过引入MoRA-RAG知识接地的大语言模型框架，解决了分析灾害后侦察报告中未结构化信息的挑战。该框架结合了混合检索机制和代理分块以确保上下文连贯性，并包含证据充分性验证循环。MoRA-RAG在准确率上达到94.5%，显著优于零样本大语言模型和最先进的检索增强系统，并使开放权重大语言模型能够达到专有模型的性能。该框架将灾害后文档转化为可用于灾害韧性的可操作情报。

Continual Reinforcement Learning for Cyber-Physical Systems: Lessons Learned and Open Challenges

Authors: Kim N. Nolle, Ivana Dusparic, Rhodri Cusack, Vinny Cahill

First: 2025-11-19T17:40:13+00:00 · Latest: 2025-11-19T17:40:13+00:00

Comments: 5 pages, 5 figures, Accepted to RLDM 2025

Abs · PDF · Code1 · Code2

Abstract

Continual learning (CL) is a branch of machine learning that aims to enable agents to adapt and generalise previously learned abilities so that these can be reapplied to new tasks or environments. This is particularly useful in multi-task settings or in non-stationary environments, where the dynamics can change over time. This is particularly relevant in cyber-physical systems such as autonomous driving. However, despite recent advances in CL, successfully applying it to reinforcement learning (RL) is still an open problem. This paper highlights open challenges in continual RL (CRL) based on experiments in an autonomous driving environment. In this environment, the agent must learn to successfully park in four different scenarios corresponding to parking spaces oriented at varying angles. The agent is successively trained in these four scenarios one after another, representing a CL environment, using Proximal Policy Optimisation (PPO). These experiments exposed a number of open challenges in CRL: finding suitable abstractions of the environment, oversensitivity to hyperparameters, catastrophic forgetting, and efficient use of neural network capacity. Based on these identified challenges, we present open research questions that are important to be addressed for creating robust CRL systems. In addition, the identified challenges call into question the suitability of neural networks for CL. We also identify the need for interdisciplinary research, in particular between computer science and neuroscience.

中文标题/摘要

标题：持续强化学习在网络物理系统中的应用：经验教训与开放挑战

持续学习（CL）是机器学习的一个分支，旨在使智能体能够适应并泛化之前学到的能力，以便在新的任务或环境中重新应用。这在多任务设置或非平稳环境中特别有用，其中动态性会随时间变化。这对于诸如自动驾驶的网络物理系统尤为重要。然而，尽管在CL方面取得了近期进展，将其成功应用于强化学习（RL）仍然是一个开放问题。本文基于在自动驾驶环境中的实验，突出了持续RL（CRL）中的开放挑战。在该环境中，智能体必须学会在四个不同场景中成功停车，这些场景对应于不同角度的停车位。智能体依次在一个接一个的四个场景中进行训练，代表了一个CL环境，使用了近端策略优化（PPO）。这些实验揭示了CRL中的多个开放挑战：环境的合适抽象、对超参数的过度敏感、灾难性遗忘以及神经网络容量的有效利用。基于这些识别出的挑战，我们提出了创建稳健的CRL系统时需要解决的重要研究问题。此外，识别出的挑战也质疑了神经网络在CL中的适用性。我们还指出了跨学科研究的必要性，特别是计算机科学与神经科学之间的研究。

Summary / 总结

This paper explores open challenges in continual reinforcement learning (CRL) using an autonomous driving environment where the agent must learn to park in four different scenarios. The experiments, conducted using Proximal Policy Optimisation (PPO), highlight issues such as finding suitable environmental abstractions, hyperparameter sensitivity, catastrophic forgetting, and efficient neural network use. These challenges suggest the need for robust CRL systems and interdisciplinary research, particularly between computer science and neuroscience.

该论文探讨了持续强化学习（CRL）在自动驾驶等网络物理系统中的挑战，特别是在一个需要学习在四种不同场景下停车的环境中。使用Proximal Policy Optimization (PPO)，实验揭示了适合环境抽象、超参数敏感性、灾难性遗忘以及神经网络高效利用等方面的问题。这些发现表明当前的神经网络可能不适合持续学习，并强调了计算机科学与神经科学之间跨学科研究的必要性以解决这些问题。

Distribution Matching Distillation Meets Reinforcement Learning

Authors: Dengyang Jiang, Dongyang Liu, Zanyi Wang, Qilong Wu, Liuzhuozheng Li, Hengzhuang Li, Xin Jin, David Liu, Zhen Li, Bo Zhang, Mengmeng Wang, Steven Hoi, Peng Gao, Harry Yang

First: 2025-11-17T17:59:54+00:00 · Latest: 2025-11-19T17:27:53+00:00

Comments: The synergy of reinforcement learning and distribution matching distillation. See more: https://github.com/vvvvvjdy/dmdr

Abs · PDF · Code1 · Code2 · Code3

Abstract

Distribution Matching Distillation (DMD) distills a pre-trained multi-step diffusion model to a few-step one to improve inference efficiency. However, the performance of the latter is often capped by the former. To circumvent this dilemma, we propose DMDR, a novel framework that combines Reinforcement Learning (RL) techniques into the distillation process. We show that for the RL of the few-step generator, the DMD loss itself is a more effective regularization compared to the traditional ones. In turn, RL can help to guide the mode coverage process in DMD more effectively. These allow us to unlock the capacity of the few-step generator by conducting distillation and RL simultaneously. Meanwhile, we design the dynamic distribution guidance and dynamic renoise sampling training strategies to improve the initial distillation process. The experiments demonstrate that DMDR can achieve leading visual quality, prompt coherence among few-step methods, and even exhibit performance that exceeds the multi-step teacher.

中文标题/摘要

标题：分布匹配蒸馏与强化学习的结合

分布匹配蒸馏（DMD）将预训练的多步扩散模型精简为几步模型以提高推理效率。然而，后者的性能往往受限于前者。为解决这一问题，我们提出了一种新的框架DMDR，将强化学习（RL）技术融入蒸馏过程。我们表明，对于几步生成器的RL，DMD损失本身比传统的正则化更有效。反过来，RL可以帮助更有效地指导DMD中的模式覆盖过程。这些允许我们在同时进行蒸馏和RL的情况下解锁几步生成器的能力。同时，我们设计了动态分布指导和动态重新噪声采样训练策略以改进初始蒸馏过程。实验表明，DMDR可以实现领先的视觉质量、几步方法之间的提示一致性，甚至表现出超越多步教师的性能。

Summary / 总结

The paper addresses the issue of performance limitations in few-step diffusion models distilled from multi-step ones. It introduces DMDR, a framework that integrates RL techniques into the distillation process. The DMD loss is found to be more effective for RL of the few-step generator, enhancing mode coverage. Dynamic distribution guidance and renoise sampling strategies further improve the initial distillation. Experiments show that DMDR achieves superior visual quality and prompt coherence, surpassing multi-step methods in some cases.

研究旨在通过使用分布匹配蒸馏（DMD）将多步扩散模型精简为少步模型，以提高推理效率。为克服少步模型的性能限制，作者引入了DMDR框架，将强化学习（RL）整合到蒸馏过程中。研究表明，DMD损失对于引导少步生成器的RL更为有效，有助于更好地覆盖模式。此外，还设计了动态分布指导和动态重新噪声采样训练策略以改进初始蒸馏过程。实验结果表明，DMDR在视觉质量和提示一致性方面优于多步模型，甚至在某些方面超过了它们。

Optimal control of the future via prospective learning with control

Authors: Yuxin Bai, Aranyak Acharyya, Ashwin De Silva, Zeyu Shen, James Hassett, Joshua T. Vogelstein

First: 2025-11-11T19:27:14+00:00 · Latest: 2025-11-19T17:25:38+00:00

Abs · PDF · Code1 · Code2

Abstract

Optimal control of the future is the next frontier for AI. Current approaches to this problem are typically rooted in either reinforcement learning (RL). While powerful, this learning framework is mathematically distinct from supervised learning, which has been the main workhorse for the recent achievements in AI. Moreover, RL typically operates in a stationary environment with episodic resets, limiting its utility to more realistic settings. Here, we extend supervised learning to address learning to control in non-stationary, reset-free environments. Using this framework, called ''Prospective Learning with Control (PL+C)'', we prove that under certain fairly general assumptions, empirical risk minimization (ERM) asymptotically achieves the Bayes optimal policy. We then consider a specific instance of prospective learning with control, foraging -- which is a canonical task for any mobile agent -- be it natural or artificial. We illustrate that modern RL algorithms fail to learn in these non-stationary reset-free environments, and even with modifications, they are orders of magnitude less efficient than our prospective foraging agents.

中文标题/摘要

标题：通过前瞻学习与控制实现未来优化控制

未来优化控制是AI的下一个前沿领域。当前解决此问题的方法通常基于强化学习（RL）。虽然强大，但这种学习框架在数学上与监督学习不同，后者一直是最近AI成就的主要工具。此外，RL通常在静止环境中运行，并且每集重置，限制了其在更现实环境中的应用。在这里，我们将监督学习扩展到非静止、无重置环境中的控制学习。使用这种方法，称为“前瞻学习与控制（PL+C）”，我们证明，在某些相当一般的假设下，经验风险最小化（ERM）渐近地实现了贝叶斯最优策略。然后，我们考虑了前瞻学习与控制的一个具体实例——觅食，这是任何移动代理（无论是自然的还是人工的）的经典任务。我们说明了现代RL算法在这些非静止、无重置环境中无法学习，即使经过修改，它们的效率也比我们的前瞻性觅食代理低几个数量级。

Summary / 总结

The research aims to extend optimal control methods to non-stationary environments without episodic resets, addressing limitations of traditional reinforcement learning (RL). The method, called Prospective Learning with Control (PL+C), leverages supervised learning principles to prove that empirical risk minimization (ERM) asymptotically achieves the Bayes optimal policy under certain assumptions. Key findings show that modern RL algorithms struggle in non-stationary reset-free environments, while PL+C agents are significantly more efficient in foraging tasks, a canonical task for mobile agents.

研究旨在将最优控制方法扩展到没有周期性重置的非平稳环境，解决传统强化学习（RL）的局限性。该方法称为“前瞻性学习与控制”（PL+C），利用监督学习原理证明，在某些假设下，经验风险最小化（ERM）渐近地实现贝叶斯最优策略。关键发现表明，现代RL算法在非平稳无重置环境中难以学习，而PL+C代理在觅食任务中表现显著更高效，觅食是移动代理的经典任务。

CODE-II: A large-scale dataset for artificial intelligence in ECG analysis

Authors: Petrus E. O. G. B. Abreu, Gabriela M. M. Paixão, Jiawei Li, Paulo R. Gomes, Peter W. Macfarlane, Ana C. S. Oliveira, Vinicius T. Carvalho, Thomas B. Schön, Antonio Luiz P. Ribeiro, Antônio H. Ribeiro

First: 2025-11-19T17:14:05+00:00 · Latest: 2025-11-19T17:14:05+00:00

Abs · PDF · Code1 · Code2

Abstract

Data-driven methods for electrocardiogram (ECG) interpretation are rapidly progressing. Large datasets have enabled advances in artificial intelligence (AI) based ECG analysis, yet limitations in annotation quality, size, and scope remain major challenges. Here we present CODE-II, a large-scale real-world dataset of 2,735,269 12-lead ECGs from 2,093,807 adult patients collected by the Telehealth Network of Minas Gerais (TNMG), Brazil. Each exam was annotated using standardized diagnostic criteria and reviewed by cardiologists. A defining feature of CODE-II is a set of 66 clinically meaningful diagnostic classes, developed with cardiologist input and routinely used in telehealth practice. We additionally provide an open available subset: CODE-II-open, a public subset of 15,000 patients, and the CODE-II-test, a non-overlapping set of 8,475 exams reviewed by multiple cardiologists for blinded evaluation. A neural network pre-trained on CODE-II achieved superior transfer performance on external benchmarks (PTB-XL and CPSC 2018) and outperformed alternatives trained on larger datasets.

中文标题/摘要

标题：CODE-II：用于心电图分析的人工智能大规模数据集

基于数据的方法在心电图（ECG）解释方面迅速发展。大规模数据集促进了基于人工智能（AI）的心电图分析的进步，但注释质量、规模和范围的限制仍然是主要挑战。在这里，我们介绍了CODE-II，这是一个由巴西米纳斯吉拉斯州远程医疗网络（TNMG）收集的2,735,269份12导联心电图的大规模现实世界数据集，来自2,093,807名成人患者。每项检查都使用标准化诊断标准进行注释，并由心脏病专家审核。CODE-II 的一个显著特点是66个临床有意义的诊断类别，这些类别是在心脏病专家的输入下开发的，并且在远程医疗实践中常规使用。我们还提供了一个开放可用的子集：CODE-II-open，这是一个包含15,000名患者的公共子集，以及一个非重叠的CODE-II-test子集，包含8,475份由多名心脏病专家审核的检查，用于盲法评估。在CODE-II上预训练的神经网络在外部基准测试（PTB-XL和CPSC 2018）上实现了更好的迁移性能，并优于在更大数据集上训练的替代方案。

Summary / 总结

The motivation for CODE-II is to address the limitations in annotation quality, size, and scope of existing ECG datasets for AI-based analysis. The main method involves collecting 2,735,269 12-lead ECGs from 2,093,807 adult patients, with each exam annotated using standardized criteria and reviewed by cardiologists. Key experimental findings show that a neural network pre-trained on CODE-II outperformed alternatives trained on larger datasets on external benchmarks (PTB-XL and CPSC 2018).

CODE-II 的动机是解决现有 ECG 数据集在注释质量、规模和范围方面的限制。主要方法是收集来自 2,093,807 名成年患者的 2,735,269 份 12 导联 ECG，每份 ECG 使用标准化标准进行注释并由心脏病专家审核。关键实验发现表明，基于 CODE-II 预训练的神经网络在外部基准（PTB-XL 和 CPSC 2018）上表现优于在更大数据集上训练的替代方案。

The SA-FARI Dataset: Segment Anything in Footage of Animals for Recognition and Identification

Authors: Dante Francisco Wasmuht, Otto Brookes, Maximillian Schall, Pablo Palencia, Chris Beirne, Tilo Burghardt, Majid Mirmehdi, Hjalmar Kühl, Mimi Arandjelovic, Sam Pottie, Peter Bermant, Brandon Asheim, Yi Jin Toh, Adam Elzinga, Jason Holmberg, Andrew Whitworth, Eleanor Flatt, Laura Gustafson, Chaitanya Ryali, Yuan-Ting Hu, Baishan Guo, Andrew Westbury, Kate Saenko, Didac Suris

First: 2025-11-19T17:07:08+00:00 · Latest: 2025-11-19T17:07:08+00:00

Abs · PDF · Code1 · Code2

Abstract

Automated video analysis is critical for wildlife conservation. A foundational task in this domain is multi-animal tracking (MAT), which underpins applications such as individual re-identification and behavior recognition. However, existing datasets are limited in scale, constrained to a few species, or lack sufficient temporal and geographical diversity - leaving no suitable benchmark for training general-purpose MAT models applicable across wild animal populations. To address this, we introduce SA-FARI, the largest open-source MAT dataset for wild animals. It comprises 11,609 camera trap videos collected over approximately 10 years (2014-2024) from 741 locations across 4 continents, spanning 99 species categories. Each video is exhaustively annotated culminating in ~46 hours of densely annotated footage containing 16,224 masklet identities and 942,702 individual bounding boxes, segmentation masks, and species labels. Alongside the task-specific annotations, we publish anonymized camera trap locations for each video. Finally, we present comprehensive benchmarks on SA-FARI using state-of-the-art vision-language models for detection and tracking, including SAM 3, evaluated with both species-specific and generic animal prompts. We also compare against vision-only methods developed specifically for wildlife analysis. SA-FARI is the first large-scale dataset to combine high species diversity, multi-region coverage, and high-quality spatio-temporal annotations, offering a new foundation for advancing generalizable multianimal tracking in the wild. The dataset is available at $\href{https://www.conservationxlabs.com/sa-fari}{\text{conservationxlabs.com/SA-FARI}}$.

中文标题/摘要

标题：SA-FARI数据集：在动物视频中进行任何动物的分割识别

自动视频分析对于野生动物保护至关重要。该领域的一个基础任务是多动物跟踪（MAT），它支撑着个体再识别和行为识别等应用。然而，现有数据集在规模、物种限制或时空多样性方面存在局限，没有适合训练适用于野生动物种群的通用MAT模型的基准。为解决这一问题，我们引入了SA-FARI，这是最大的开源野生动物多动物跟踪数据集。它包含从四大洲741个地点收集的约10年（2014-2024）的11,609个相机陷阱视频，覆盖99个物种类别。每个视频都进行了详尽标注，最终包含约46小时密集标注的视频片段，包含16,224个掩码身份和942,702个个体边界框、分割掩码和物种标签。除了特定任务的标注，我们还发布了每个视频的匿名相机陷阱位置。最后，我们使用最先进的视觉-语言模型在SA-FARI上进行了检测和跟踪基准测试，包括SAM 3，使用了物种特定和通用动物提示进行评估。我们还与专门为野生动物分析开发的仅视觉方法进行了比较。SA-FARI是第一个结合高物种多样性、多区域覆盖和高质量时空标注的大规模数据集，为推进野外多动物跟踪的通用性提供了新的基础。数据集可在$\href{https://www.conservationxlabs.com/sa-fari}{\text{conservationxlabs.com/SA-FARI}}$获取。

Summary / 总结

The research aims to improve automated video analysis for wildlife conservation by addressing the limitations of existing datasets. The study introduces SA-FARI, a large-scale dataset for multi-animal tracking (MAT) in wild animals, containing 11,609 camera trap videos from 741 locations across 4 continents, spanning 99 species. Each video is annotated with dense labels, including 16,224 masklet identities and 942,702 bounding boxes and segmentation masks. The dataset is benchmarked using state-of-the-art vision-language models, showing improved performance over vision-only methods. SA-FARI provides a new foundation for developing generalizable MAT models for wild animal populations.

研究引入了SA-FARI数据集，这是一个用于野生动物多动物跟踪的大规模数据集，解决了现有数据集在规模、物种多样性和地理覆盖范围方面的局限性。该数据集包含来自4个大陆741个地点的11,609个相机陷阱视频，涵盖99个物种类别，并进行了详尽的标注，包括16,224个掩码身份和942,702个边界框、分割掩码和物种标签。研究使用最先进的视觉-语言模型对该数据集进行了评估，并与专门用于野生动物分析的视觉方法进行了比较，展示了SA-FARI在推动野生环境中多动物跟踪的泛化方面的潜力。

CODE: A global approach to ODE dynamics learning

Authors: Nils Wildt, Daniel M. Tartakovsky, Sergey Oladyshkin, Wolfgang Nowak

First: 2025-11-19T17:04:24+00:00 · Latest: 2025-11-19T17:04:24+00:00

Abs · PDF · Code1 · Code2

Abstract

Ordinary differential equations (ODEs) are a conventional way to describe the observed dynamics of physical systems. Scientists typically hypothesize about dynamical behavior, propose a mathematical model, and compare its predictions to data. However, modern computing and algorithmic advances now enable purely data-driven learning of governing dynamics directly from observations. In data-driven settings, one learns the ODE's right-hand side (RHS). Dense measurements are often assumed, yet high temporal resolution is typically both cumbersome and expensive. Consequently, one usually has only sparsely sampled data. In this work we introduce ChaosODE (CODE), a Polynomial Chaos ODE Expansion in which we use an arbitrary Polynomial Chaos Expansion (aPCE) for the ODE's right-hand side, resulting in a global orthonormal polynomial representation of dynamics. We evaluate the performance of CODE in several experiments on the Lotka-Volterra system, across varying noise levels, initial conditions, and predictions far into the future, even on previously unseen initial conditions. CODE exhibits remarkable extrapolation capabilities even when evaluated under novel initial conditions and shows advantages compared to well-examined methods using neural networks (NeuralODE) or kernel approximators (KernelODE) as the RHS representer. We observe that the high flexibility of NeuralODE and KernelODE degrades extrapolation capabilities under scarce data and measurement noise. Finally, we provide practical guidelines for robust optimization of dynamics-learning problems and illustrate them in the accompanying code.

中文标题/摘要

标题：CODE: 全局视角下的ODE动力学习

常微分方程（ODEs）是描述物理系统观测动力学的一种传统方式。科学家通常会假设动力学行为，提出一个数学模型，并将其预测与数据进行比较。然而，现代计算和算法的进步现在使得可以直接从观测数据中进行纯粹的数据驱动的动力学学习。在数据驱动的环境中，人们学习的是ODE的右侧（RHS）。通常假设密集的测量，但高时间分辨率通常既繁琐又昂贵。因此，通常只有稀疏采样的数据。在本文中，我们引入了ChaosODE（CODE），这是一种多项式混沌ODE展开，在这种展开中，我们使用任意多项式混沌展开（aPCE）作为ODE的右侧，从而得到动力学的全局正交多项式表示。我们在Lotka-Volterra系统上进行了多项实验，评估了CODE在不同噪声水平、初始条件以及远期预测中的性能，甚至在以前未见过的初始条件下。CODE在评估新型初始条件时表现出惊人的外推能力，并且在与使用神经网络（NeuralODE）或核逼近器（KernelODE）作为RHS表示者的已检验方法相比时显示出优势。我们观察到，NeuralODE和KernelODE的高灵活性在数据稀缺和测量噪声下会降低外推能力。最后，我们提供了动力学习问题稳健优化的实用指南，并在附带的代码中进行了说明。

Summary / 总结

The research aims to develop a method for learning the dynamics of physical systems directly from sparse data using a Polynomial Chaos ODE Expansion (CODE). The method uses an arbitrary Polynomial Chaos Expansion (aPCE) for the ODE's right-hand side, providing a global orthonormal polynomial representation of the dynamics. Experiments on the Lotka-Volterra system show that CODE outperforms NeuralODE and KernelODE in extrapolation capabilities, especially under scarce data and measurement noise, and provides practical guidelines for optimizing dynamics-learning problems.

本文介绍了CODE方法，该方法利用普通微分方程（ODE）直接从稀疏数据中学习物理系统的动力学。通过使用多项式混沌ODE展开，CODE能够处理不同噪声水平和初始条件，展现出强大的外推能力。与NeuralODE和KernelODE相比，CODE在数据稀少和噪声条件下表现出更好的性能，使其成为处理稀疏数据场景中动力学学习的重要工具。

FlashMesh: Faster and Better Autoregressive Mesh Synthesis via Structured Speculation

Authors: Tingrui Shen, Yiheng Zhang, Chen Tang, Chuan Ping, Zixing Zhao, Le Wan, Yuwang Wang, Ronggang Wang, Shengfeng He

First: 2025-11-19T17:03:49+00:00 · Latest: 2025-11-19T17:03:49+00:00

Abs · PDF · Code1 · Code2

Abstract

Autoregressive models can generate high-quality 3D meshes by sequentially producing vertices and faces, but their token-by-token decoding results in slow inference, limiting practical use in interactive and large-scale applications. We present FlashMesh, a fast and high-fidelity mesh generation framework that rethinks autoregressive decoding through a predict-correct-verify paradigm. The key insight is that mesh tokens exhibit strong structural and geometric correlations that enable confident multi-token speculation. FlashMesh leverages this by introducing a speculative decoding scheme tailored to the commonly used hourglass transformer architecture, enabling parallel prediction across face, point, and coordinate levels. Extensive experiments show that FlashMesh achieves up to a 2 x speedup over standard autoregressive models while also improving generation fidelity. Our results demonstrate that structural priors in mesh data can be systematically harnessed to accelerate and enhance autoregressive generation.

中文标题/摘要

标题：FlashMesh：通过结构化推测实现更快更好的自回归网格合成

自回归模型可以通过顺序生成顶点和面来生成高质量的3D网格，但逐令牌解码导致推理速度慢，限制了其在交互式和大规模应用中的实际使用。我们提出了FlashMesh，这是一种快速且高保真的网格生成框架，通过预测-校正-验证范式重新思考自回归解码。关键见解是网格令牌表现出强烈的结构和几何相关性，这使得多令牌推测变得有信心。FlashMesh 通过引入一种针对常用的小时玻璃变压器架构进行定制的推测解码方案，利用这一点，在面、点和坐标级别上实现并行预测。大量实验表明，FlashMesh 在保持或提高生成保真度的同时，比标准自回归模型快至2倍。我们的结果表明，网格数据中的结构先验可以系统地利用以加速和增强自回归生成。

Summary / 总结

FlashMesh addresses the slow inference issue of autoregressive models in mesh synthesis by introducing a predict-correct-verify paradigm. It leverages the strong structural and geometric correlations in mesh tokens to enable multi-token speculation, allowing parallel prediction. Experiments show that FlashMesh achieves up to a 2x speedup while improving generation fidelity compared to standard autoregressive models.

FlashMesh 通过引入预测-校正-验证范式来解决自回归模型在3D网格合成中的缓慢推理问题。它利用网格令牌中的结构和几何相关性，采用推测性解码方案，在不同层级上实现并行预测。实验表明，FlashMesh 可以比标准自回归模型快2倍，同时保持或提高生成质量。

When to Think and When to Look: Uncertainty-Guided Lookback

Authors: Jing Bi, Filippos Bellos, Junjia Guo, Yayuan Li, Chao Huang, Yunlong, Tang, Luchuan Song, Susan Liang, Zhongfei, Zhang, Jason J. Corso, Chenliang Xu

First: 2025-11-19T17:01:02+00:00 · Latest: 2025-11-19T17:01:02+00:00

Abs · PDF · Code1 · Code2

Abstract

Test-time thinking (that is, generating explicit intermediate reasoning chains) is known to boost performance in large language models and has recently shown strong gains for large vision language models (LVLMs). However, despite these promising results, there is still no systematic analysis of how thinking actually affects visual reasoning. We provide the first such analysis with a large scale, controlled comparison of thinking for LVLMs, evaluating ten variants from the InternVL3.5 and Qwen3-VL families on MMMU-val under generous token budgets and multi pass decoding. We show that more thinking is not always better; long chains often yield long wrong trajectories that ignore the image and underperform the same models run in standard instruct mode. A deeper analysis reveals that certain short lookback phrases, which explicitly refer back to the image, are strongly enriched in successful trajectories and correlate with better visual grounding. Building on this insight, we propose uncertainty guided lookback, a training free decoding strategy that combines an uncertainty signal with adaptive lookback prompts and breadth search. Our method improves overall MMMU performance, delivers the largest gains in categories where standard thinking is weak, and outperforms several strong decoding baselines, setting a new state of the art under fixed model families and token budgets. We further show that this decoding strategy generalizes, yielding consistent improvements on five additional benchmarks, including two broad multimodal suites and math focused visual reasoning datasets.

中文标题/摘要

标题：何时思考何时查看：基于不确定性回溯

测试时的思考（即生成明确的中间推理链）已被证明能提升大型语言模型的性能，并且最近在大型视觉语言模型（LVLMs）中也显示出强大的增益。然而，尽管取得了这些有希望的结果，仍然没有系统地分析思考如何影响视觉推理。我们提供了首次此类分析，通过大规模、受控的比较思考对LVLMs的影响，评估了InternVL3.5和Qwen3-VL家族中的十个变体在MMMU-val下的表现，使用宽松的令牌预算和多轮解码。我们展示了更多的思考并不总是更好的；长链往往导致忽略图像的长期错误轨迹，并且表现不如标准指令模式运行的相同模型。更深入的分析表明，某些短回溯短语，明确地回溯到图像，强烈富集于成功的轨迹中，并与更好的视觉定位相关。基于这一洞察，我们提出了基于不确定性回溯的解码策略，该策略结合了不确定性信号和自适应回溯提示及广度搜索。我们的方法在整体MMMU性能上有所提升，在标准思考较弱的类别中取得最大的增益，并优于几个强大的解码基线，固定模型家族和令牌预算下达到新的最佳水平。我们进一步展示了该解码策略的泛化能力，在五个额外的基准上取得一致的改进，包括两个广泛的多模态套件和数学聚焦的视觉推理数据集。

Summary / 总结

The study investigates the impact of test-time thinking on visual reasoning in large vision language models (LVLMs) by comparing ten variants from InternVL3.5 and Qwen3-VL families. It finds that more thinking is not always beneficial, as long chains often lead to incorrect reasoning. The research introduces uncertainty-guided lookback, a decoding strategy that combines uncertainty signals with adaptive lookback prompts, leading to improved performance and setting a new state-of-the-art on multiple benchmarks.

研究通过比较来自InternVL3.5和Qwen3-VL家族的十种变体，分析了测试时思考对大型视觉语言模型（LVLMs）视觉推理的影响。研究发现，更多的思考并不总是有益的，因为长的推理链往往会导致错误的推理。研究引入了一种基于不确定性指导的回溯解码策略，结合了不确定性信号和自适应回溯提示，提高了性能，并在多个基准上设定了新的最先进水平。

SRPO: Self-Referential Policy Optimization for Vision-Language-Action Models

Authors: Senyu Fei, Siyin Wang, Li Ji, Ao Li, Shiduo Zhang, Liming Liu, Jinlong Hou, Jingjing Gong, Xianzhong Zhao, Xipeng Qiu

First: 2025-11-19T16:52:23+00:00 · Latest: 2025-11-19T16:52:23+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-Language-Action (VLA) models excel in robotic manipulation but are constrained by their heavy reliance on expert demonstrations, leading to demonstration bias and limiting performance. Reinforcement learning (RL) is a vital post-training strategy to overcome these limits, yet current VLA-RL methods, including group-based optimization approaches, are crippled by severe reward sparsity. Relying on binary success indicators wastes valuable information in failed trajectories, resulting in low training efficiency. To solve this, we propose Self-Referential Policy Optimization (SRPO), a novel VLA-RL framework. SRPO eliminates the need for external demonstrations or manual reward engineering by leveraging the model's own successful trajectories, generated within the current training batch, as a self-reference. This allows us to assign a progress-wise reward to failed attempts. A core innovation is the use of latent world representations to measure behavioral progress robustly. Instead of relying on raw pixels or requiring domain-specific fine-tuning, we utilize the compressed, transferable encodings from a world model's latent space. These representations naturally capture progress patterns across environments, enabling accurate, generalized trajectory comparison. Empirical evaluations on the LIBERO benchmark demonstrate SRPO's efficiency and effectiveness. Starting from a supervised baseline with 48.9% success, SRPO achieves a new state-of-the-art success rate of 99.2% in just 200 RL steps, representing a 103% relative improvement without any extra supervision. Furthermore, SRPO shows substantial robustness, achieving a 167% performance improvement on the LIBERO-Plus benchmark.

中文标题/摘要

标题：SRPO：自参照策略优化在视觉-语言-动作模型中的应用

视觉-语言-动作（VLA）模型在机器人操作方面表现出色，但它们严重依赖于专家演示，这导致了演示偏差并限制了性能。强化学习（RL）是克服这些限制的重要后训练策略，然而当前的VLA-RL方法，包括基于群体的优化方法，受到严重奖励稀疏性的困扰。依赖于二元成功指标浪费了失败轨迹中的宝贵信息，导致训练效率低下。为了解决这个问题，我们提出了自参照策略优化（SRPO），这是一种新颖的VLA-RL框架。SRPO通过利用模型自身在当前训练批次中生成的成功轨迹作为自我参照，消除了对外部演示或手动奖励工程的需求。这使我们能够为失败尝试分配按进度的奖励。核心创新在于使用潜在世界表示来稳健地衡量行为进步。我们利用世界模型潜在空间中的压缩、可转移编码，而不是依赖原始像素或需要特定领域的微调，这些表示自然地捕捉了跨环境的进步模式，使准确、泛化的轨迹比较成为可能。在LIBERO基准上的实证评估表明，SRPO的效率和有效性。从监督基线的48.9%成功率开始，SRPO仅在200个RL步骤后达到了新的最佳成功率99.2%，相对改进了103%且无需额外监督。此外，SRPO还表现出显著的鲁棒性，在LIBERO-Plus基准上实现了167%的性能改进。

Summary / 总结

The research aims to improve Vision-Language-Action (VLA) models in robotic manipulation by addressing the limitations of expert demonstrations and reward sparsity. SRPO, a novel Self-Referential Policy Optimization framework, leverages the model's own successful trajectories to assign progress-wise rewards, enhancing training efficiency. SRPO uses latent world representations from a world model's latent space to measure behavioral progress, achieving a new state-of-the-art success rate of 99.2% in 200 RL steps, a 103% relative improvement over a supervised baseline, and demonstrating robust performance on the LIBERO-Plus benchmark.

SRPO 是一种新颖的 VLA-RL 框架，通过使用模型自身的成功轨迹作为自我参考来分配进度奖励，从而提高训练效率。它利用世界模型潜在空间中的潜在世界表示来衡量行为进度，避免了对外部演示或手动奖励工程的需求。SRPO 在 200 个 RL 步骤中实现了 99.2% 的新最佳成功率，相对于监督基线 48.9% 的成功率提高了 103%，并且在 LIBERO-Plus 基准上表现出显著的鲁棒性。

MaskMed: Decoupled Mask and Class Prediction for Medical Image Segmentation

Authors: Bin Xie, Gady Agam

First: 2025-11-19T16:49:02+00:00 · Latest: 2025-11-19T16:49:02+00:00

Abs · PDF · Code1 · Code2

Abstract

Medical image segmentation typically adopts a point-wise convolutional segmentation head to predict dense labels, where each output channel is heuristically tied to a specific class. This rigid design limits both feature sharing and semantic generalization. In this work, we propose a unified decoupled segmentation head that separates multi-class prediction into class-agnostic mask prediction and class label prediction using shared object queries. Furthermore, we introduce a Full-Scale Aware Deformable Transformer module that enables low-resolution encoder features to attend across full-resolution encoder features via deformable attention, achieving memory-efficient and spatially aligned full-scale fusion. Our proposed method, named MaskMed, achieves state-of-the-art performance, surpassing nnUNet by +2.0% Dice on AMOS 2022 and +6.9% Dice on BTCV.

中文标题/摘要

标题：MaskMed：解耦的掩码和类别预测在医学图像分割中的应用

医学图像分割通常采用点式卷积分割头来预测密集标签，其中每个输出通道被启发式地与特定类别相关联。这种刚性设计限制了特征共享和语义泛化。在本文中，我们提出了一种统一的解耦分割头，将多类预测分解为无类别的掩码预测和类别标签预测，使用共享的对象查询。此外，我们引入了一种全尺度感知的可变形变换器模块，该模块通过可变形注意力使低分辨率编码特征能够跨全分辨率编码特征进行注意，从而实现高效且空间对齐的全尺度融合。我们提出的方法MaskMed在AMOS 2022和BTCV上分别超越了nnUNet 2.0%和6.9%的Dice值，达到了最先进的性能。

Agent-SAMA: State-Aware Mobile Assistant

Authors: Linqiang Guo, Wei Liu, Yi Wen Heng, Tse-Hsun, Chen, Yang Wang

Venue: AAAI

First: 2025-05-29T16:08:51+00:00 · Latest: 2025-11-19T16:48:16+00:00

Comments: Accepted to AAAI-26 (Main Technical Track)

Abs · PDF · Code1 · Code2

Abstract

Mobile Graphical User Interface (GUI) agents aim to autonomously complete tasks within or across apps based on user instructions. While recent Multimodal Large Language Models (MLLMs) enable these agents to interpret UI screens and perform actions, existing agents remain fundamentally reactive. They reason over the current UI screen but lack a structured representation of the app navigation flow, limiting GUI agents' ability to understand execution context, detect unexpected execution results, and recover from errors. We introduce Agent-SAMA, a state-aware multi-agent framework that models app execution as a Finite State Machine (FSM), treating UI screens as states and user actions as transitions. Agent-SAMA implements four specialized agents that collaboratively construct and use FSMs in real time to guide task planning, execution verification, and recovery. We evaluate Agent-SAMA on two types of benchmarks: cross-app (Mobile-Eval-E, SPA-Bench) and mostly single-app (AndroidWorld). On Mobile-Eval-E, Agent-SAMA achieves an 84.0% success rate and a 71.9% recovery rate. On SPA-Bench, it reaches an 80.0% success rate with a 66.7% recovery rate. Compared to prior methods, Agent-SAMA improves task success by up to 12% and recovery success by 13.8%. On AndroidWorld, Agent-SAMA achieves a 63.7% success rate, outperforming the baselines. Our results demonstrate that structured state modeling enhances robustness and can serve as a lightweight, model-agnostic memory layer for future GUI agents.

中文标题/摘要

标题：Agent-SAMA：状态感知移动助手

移动图形用户界面（GUI）代理旨在根据用户指令自主完成应用程序内或跨应用程序的任务。虽然最近的多模态大型语言模型（MLLMs）使这些代理能够解释UI屏幕并执行操作，但现有的代理仍然本质上是反应性的。它们仅在当前UI屏幕上进行推理，缺乏对应用程序导航流程的结构化表示，限制了GUI代理理解执行上下文、检测意外执行结果和从错误中恢复的能力。我们引入了Agent-SAMA，这是一种状态感知的多代理框架，将应用程序执行建模为有限状态机（FSM），将UI屏幕视为状态，用户操作视为转换。Agent-SAMA 实现了四个专门的代理，它们协作实时构建和使用FSM来指导任务规划、执行验证和恢复。我们在两种类型的基准测试上评估了Agent-SAMA：跨应用程序（Mobile-Eval-E，SPA-Bench）和主要单应用程序（AndroidWorld）。在Mobile-Eval-E上，Agent-SAMA 达到了84.0%的成功率和71.9%的恢复率。在SPA-Bench上，它达到了80.0%的成功率和66.7%的恢复率。与先前的方法相比，Agent-SAMA 将任务成功率提高了最多12%，恢复成功率提高了13.8%。在AndroidWorld上，Agent-SAMA 达到了63.7%的成功率，优于基线。我们的结果表明，结构化状态建模可以增强鲁棒性，并可作为未来GUI代理的轻量级、模型无关的记忆层。

Summary / 总结

Agent-SAMA is a state-aware multi-agent framework that models app execution as a Finite State Machine (FSM) to improve task planning, execution verification, and recovery. It achieves high success rates and recovery rates on cross-app and mostly single-app benchmarks, outperforming previous methods and demonstrating the benefits of structured state modeling for GUI agents.

Agent-SAMA 是一个基于状态机（FSM）的多代理框架，旨在通过建模应用执行过程来增强移动GUI代理的鲁棒性。它使用四个专门的代理在实时构建和利用 FSM 来进行任务规划、执行验证和恢复。在跨应用基准测试（Mobile-Eval-E, SPA-Bench）中，Agent-SAMA 的任务成功率达到了 84.0%，恢复成功率达到了 71.9%，比先前的方法在任务成功和恢复成功方面分别提高了 12% 和 13.8%。在 AndroidWorld 中，它实现了 63.7% 的成功率，超过了基线方法。

US-X Complete: A Multi-Modal Approach to Anatomical 3D Shape Recovery

Authors: Miruna-Alexandra Gafencu, Yordanka Velikova, Nassir Navab, Mohammad Farid Azampour

Venue: MICCAI 2025

First: 2025-11-19T16:45:04+00:00 · Latest: 2025-11-19T16:45:04+00:00

Comments: Accepted at the Workshop on Shape in Medical Imaging at MICCAI 2025

Abs · PDF · Code1 · Code2 · Code3

Abstract

Ultrasound offers a radiation-free, cost-effective solution for real-time visualization of spinal landmarks, paraspinal soft tissues and neurovascular structures, making it valuable for intraoperative guidance during spinal procedures. However, ultrasound suffers from inherent limitations in visualizing complete vertebral anatomy, in particular vertebral bodies, due to acoustic shadowing effects caused by bone. In this work, we present a novel multi-modal deep learning method for completing occluded anatomical structures in 3D ultrasound by leveraging complementary information from a single X-ray image. To enable training, we generate paired training data consisting of: (1) 2D lateral vertebral views that simulate X-ray scans, and (2) 3D partial vertebrae representations that mimic the limited visibility and occlusions encountered during ultrasound spine imaging. Our method integrates morphological information from both imaging modalities and demonstrates significant improvements in vertebral reconstruction (p < 0.001) compared to state of art in 3D ultrasound vertebral completion. We perform phantom studies as an initial step to future clinical translation, and achieve a more accurate, complete volumetric lumbar spine visualization overlayed on the ultrasound scan without the need for registration with preoperative modalities such as computed tomography. This demonstrates that integrating a single X-ray projection mitigates ultrasound's key limitation while preserving its strengths as the primary imaging modality. Code and data can be found at https://github.com/miruna20/US-X-Complete

中文标题/摘要

标题：US-X 完整：一种多模态的三维解剖形状恢复方法

超声波提供了一种无辐射、成本效益高的实时可视化脊柱标志点、脊旁软组织和神经血管结构的方法，使其在脊柱手术中的术中指导中具有重要价值。然而，超声波由于骨引起的声影效应，在可视化完整的椎体解剖结构方面存在固有的局限性，特别是椎体本身。在本研究中，我们提出了一种新颖的多模态深度学习方法，通过利用单张X射线图像的互补信息来完成3D超声中的遮挡解剖结构。为了进行训练，我们生成了配对的训练数据，包括：(1) 模拟X射线扫描的2D侧位椎体视图，(2) 模拟超声脊柱成像中遇到的有限可见性和遮挡的3D部分椎体表示。我们的方法结合了两种成像模态的形态信息，并在椎体重建方面显著优于现有的3D超声椎体完成技术（p < 0.001）。我们进行了体模研究作为未来临床转化的初步步骤，并在超声扫描上实现了更准确、完整的椎体体积可视化，无需与术前成像模态（如计算机断层扫描）进行配准。这表明，结合单张X射线投影可以缓解超声波的关键局限性，同时保留其作为主要成像模态的优势。代码和数据可在 https://github.com/miruna20/US-X-Complete 获取。

Summary / 总结

This work addresses the limitation of ultrasound in visualizing complete vertebral anatomy by proposing a multi-modal deep learning method that combines ultrasound and X-ray images. The method generates paired training data from 2D lateral vertebral views and 3D partial vertebrae representations. Experimental results show significant improvements in vertebral reconstruction compared to state-of-the-art methods in 3D ultrasound vertebral completion. The approach enables more accurate and complete volumetric lumbar spine visualization on ultrasound scans without the need for registration with preoperative modalities like computed tomography.

该研究提出了一种结合超声和X光图像的多模态深度学习方法，以解决超声在显示完整椎体解剖结构方面的局限性。该方法通过生成来自2D侧位椎体视图和3D部分椎体表示的配对训练数据来实现。实验结果表明，与现有的3D超声椎体完成方法相比，该方法在椎体重建方面取得了显著的改进。该方法能够实现更准确和完整的腰椎体积可视化，无需与术前成像模态如计算机断层扫描进行配准。

Learning from Mistakes: Loss-Aware Memory Enhanced Continual Learning for LiDAR Place Recognition

Authors: Xufei Wang, Junqiao Zhao, Siyue Tao, Qiwen Gu, Wonbong Kim, Tiantian Feng

First: 2025-11-19T16:41:30+00:00 · Latest: 2025-11-19T16:41:30+00:00

Comments: 8 pages, 4 figures

Abs · PDF · Code1 · Code2 · Code3

Abstract

LiDAR place recognition plays a crucial role in SLAM, robot navigation, and autonomous driving. However, existing LiDAR place recognition methods often struggle to adapt to new environments without forgetting previously learned knowledge, a challenge widely known as catastrophic forgetting. To address this issue, we propose KDF+, a novel continual learning framework for LiDAR place recognition that extends the KDF paradigm with a loss-aware sampling strategy and a rehearsal enhancement mechanism. The proposed sampling strategy estimates the learning difficulty of each sample via its loss value and selects samples for replay according to their estimated difficulty. Harder samples, which tend to encode more discriminative information, are sampled with higher probability while maintaining distributional coverage across the dataset. In addition, the rehearsal enhancement mechanism encourages memory samples to be further refined during new-task training by slightly reducing their loss relative to previous tasks, thereby reinforcing long-term knowledge retention. Extensive experiments across multiple benchmarks demonstrate that KDF+ consistently outperforms existing continual learning methods and can be seamlessly integrated into state-of-the-art continual learning for LiDAR place recognition frameworks to yield significant and stable performance gains. The code will be available at https://github.com/repo/KDF-plus.

中文标题/摘要

标题：从错误中学习：损失感知记忆增强连续学习在LiDAR地点识别中的应用

LiDAR地点识别在SLAM、机器人导航和自动驾驶中起着关键作用。然而，现有的LiDAR地点识别方法往往难以在不忘记之前学到的知识的情况下适应新环境，这一挑战被称为灾难性遗忘。为了解决这个问题，我们提出了一种新的LiDAR地点识别连续学习框架KDF+，该框架通过损失感知采样策略和回忆增强机制扩展了KDF范式。提出的采样策略通过损失值估计每个样本的学习难度，并根据估计的难度选择样本进行重放。更难的样本，通常包含更多区分性信息，以更高的概率被采样，同时保持数据集的分布覆盖。此外，回忆增强机制在新任务训练期间鼓励记忆样本进一步细化，通过相对于先前任务稍微降低其损失来强化长期知识保留。在多个基准上的广泛实验表明，KDF+始终优于现有的连续学习方法，并且可以无缝集成到最先进的连续学习LiDAR地点识别框架中，以实现显著且稳定的性能提升。代码将在https://github.com/repo/KDF-plus/上提供。

Summary / 总结

The paper addresses the challenge of catastrophic forgetting in LiDAR place recognition by proposing KDF+, a novel continual learning framework. KDF+ incorporates a loss-aware sampling strategy and a rehearsal enhancement mechanism to selectively replay harder samples and refine memory samples, respectively. Experiments show that KDF+ outperforms existing methods and integrates well with state-of-the-art continual learning frameworks for LiDAR place recognition, providing consistent and stable performance gains.

论文提出了一种新颖的持续学习框架KDF+，以解决LiDAR地点识别中的灾难性遗忘问题。KDF+结合了基于损失的采样策略和回放增强机制，选择更具区分性的样本进行回放，并进一步精炼记忆样本以保留长期知识。实验表明，KDF+在多个基准测试中优于现有方法，并且可以无缝集成到最先进的持续学习框架中，提供一致且稳定的性能提升。

Deep Spectral Prior

Authors: Yanqi Cheng, Xuxiang Zhao, Tieyong Zeng, Pietro Lio, Carola-Bibiane Schönlieb, Angelica I Aviles-Rivero

First: 2025-05-26T12:00:37+00:00 · Latest: 2025-11-19T16:33:22+00:00

Abs · PDF · Code1 · Code2

Abstract

We introduce the Deep Spectral Prior (DSP), a new framework for unsupervised image reconstruction that operates entirely in the complex frequency domain. Unlike the Deep Image Prior (DIP), which optimises pixel-level errors and is highly sensitive to overfitting, DSP performs joint learning of amplitude and phase to capture the full spectral structure of images. We derive a rigorous theoretical characterisation of DSP's optimisation dynamics, proving that it follows frequency-dependent descent trajectories that separate informative low-frequency modes from stochastic high-frequency noise. This spectral mode separation explains DSP's self-regularising behaviour and, for the first time, formally establishes the elimination of DIP's major limitation-its reliance on manual early stopping. Moreover, DSP induces an implicit projection onto a frequency-consistent manifold, ensuring convergence to stable, physically plausible reconstructions without explicit priors or supervision. Extensive experiments on denoising, inpainting, and deblurring demonstrate that DSP consistently surpasses DIP and other unsupervised baselines, achieving superior fidelity, robustness, and theoretical interpretability within a unified, unsupervised data-free framework.

中文标题/摘要

标题：深度频谱先验

我们提出了深度频谱先验（DSP），这是一种全新的无监督图像重建框架，完全在复频域中操作。与深度图像先验（DIP）不同，DIP优化像素级误差且极易过拟合，DSP联合学习振幅和相位，以捕捉图像的完整频谱结构。我们推导了DSP优化动力学的严格理论表征，证明它遵循频率依赖的下降轨迹，将信息性的低频模式与随机的高频噪声分离。这种频谱模式分离解释了DSP的自我正则化行为，并首次正式确立了消除DIP主要限制——依赖手动早期停止。此外，DSP隐式投影到一个频谱一致的流形上，确保在没有显式先验或监督的情况下收敛到稳定且物理上合理的重建。在去噪、修补和去模糊方面的广泛实验表明，DSP在所有方面都优于DIP和其他无监督基准，实现了更高的保真度、鲁棒性和理论可解释性，全部在统一的、无监督的数据免费框架内。

Summary / 总结

The Deep Spectral Prior (DSP) is a novel framework for unsupervised image reconstruction that operates in the complex frequency domain. Unlike the Deep Image Prior (DIP), which is prone to overfitting, DSP optimizes both amplitude and phase to capture the full spectral structure of images. DSP's optimization dynamics are theoretically characterized, showing that it separates informative low-frequency modes from stochastic high-frequency noise, leading to self-regularizing behavior and stable reconstructions. DSP outperforms DIP and other unsupervised methods in denoising, inpainting, and deblurring, achieving superior fidelity and robustness.

提出了Deep Spectral Prior (DSP) 作为一种新的无监督图像重建框架，该框架在复频域中操作。与Deep Image Prior (DIP) 不同，DSP 同时优化振幅和相位，捕捉图像的完整频谱结构并避免过拟合。DSP 的优化动态被理论化地描述为频率依赖的下降轨迹，能够分离出信息性的低频模式和随机的高频噪声，从而实现自我正则化行为和稳定、物理上合理的重建。在去噪、修复和去模糊等实验中，DSP 显示出优于 DIP 和其他无监督方法的优越保真度和鲁棒性。

What Does It Take to Be a Good AI Research Agent? Studying the Role of Ideation Diversity

Authors: Alexis Audran-Reiss, Jordi Armengol Estapé, Karen Hambardzumyan, Amar Budhiraja, Martin Josifoski, Edan Toledo, Rishi Hazra, Despoina Magka, Michael Shvartsman, Parth Pathak, Justine T Kao, Lucia Cipolina-Kun, Bhavul Gauri, Jean-Christophe Gagnon-Audet, Emanuel Tewolde, Jenny Zhang, Taco Cohen, Yossi Adi, Tatiana Shavrina, Yoram Bachrach

First: 2025-11-19T16:32:18+00:00 · Latest: 2025-11-19T16:32:18+00:00

Abs · PDF · Code1 · Code2

Abstract

AI research agents offer the promise to accelerate scientific progress by automating the design, implementation, and training of machine learning models. However, the field is still in its infancy, and the key factors driving the success or failure of agent trajectories are not fully understood. We examine the role that ideation diversity plays in agent performance. First, we analyse agent trajectories on MLE-bench, a well-known benchmark to evaluate AI research agents, across different models and agent scaffolds. Our analysis reveals that different models and agent scaffolds yield varying degrees of ideation diversity, and that higher-performing agents tend to have increased ideation diversity. Further, we run a controlled experiment where we modify the degree of ideation diversity, demonstrating that higher ideation diversity results in stronger performance. Finally, we strengthen our results by examining additional evaluation metrics beyond the standard medal-based scoring of MLE-bench, showing that our findings still hold across other agent performance metrics.

中文标题/摘要

标题：成为优秀的AI研究代理需要什么？研究创意多样性的作用

AI研究代理有望通过自动化机器学习模型的设计、实现和训练来加速科学进步。然而，该领域仍处于起步阶段，驱动代理轨迹成功或失败的关键因素尚未完全理解。我们研究了创意多样性在代理性能中的作用。首先，我们在MLE-bench这一知名基准上分析不同模型和代理支架的代理轨迹。我们的分析表明，不同的模型和代理支架会产生不同程度的创意多样性，且表现更佳的代理通常具有更高的创意多样性。进一步，我们进行了一项受控实验，修改创意多样性的程度，证明更高的创意多样性会导致更强的性能。最后，我们通过研究超越MLE-bench标准奖牌评分的其他评估指标来加强我们的结果，显示我们的发现仍然适用于其他代理性能指标。

Summary / 总结

This study investigates the role of ideation diversity in the performance of AI research agents. By analyzing agent trajectories on MLE-bench and conducting controlled experiments, the research finds that higher ideation diversity correlates with better performance. Additional evaluation metrics confirm these findings, suggesting that ideation diversity is a key factor in the success of AI research agents.

研究探讨了创意多样性在AI研究代理性能中的作用。通过分析MLE-bench上的代理轨迹和进行受控实验，研究发现更高的创意多样性与更好的性能相关。额外的评估指标也证实了这些发现，表明创意多样性是成功AI研究代理轨迹的关键因素。

Incremental Maintenance of DatalogMTL Materialisations

Authors: Kaiyue Zhao, Dingqi Chen, Shaoyu Wang, Pan Hu

Venue: AAAI 2026 oral

First: 2025-11-15T11:45:19+00:00 · Latest: 2025-11-19T16:30:26+00:00

Comments: Accepted as oral paper at the main track of AAAI 2026

Abs · PDF · Code1 · Code2

Abstract

DatalogMTL extends the classical Datalog language with metric temporal logic (MTL), enabling expressive reasoning over temporal data. While existing reasoning approaches, such as materialisation based and automata based methods, offer soundness and completeness, they lack support for handling efficient dynamic updates, a crucial requirement for real-world applications that involve frequent data updates. In this work, we propose DRedMTL, an incremental reasoning algorithm for DatalogMTL with bounded intervals. Our algorithm builds upon the classical DRed algorithm, which incrementally updates the materialisation of a Datalog program. Unlike a Datalog materialisation which is in essence a finite set of facts, a DatalogMTL materialisation has to be represented as a finite set of facts plus periodic intervals indicating how the full materialisation can be constructed through unfolding. To cope with this, our algorithm is equipped with specifically designed operators to efficiently handle such periodic representations of DatalogMTL materialisations. We have implemented this approach and tested it on several publicly available datasets. Experimental results show that DRedMTL often significantly outperforms rematerialisation, sometimes by orders of magnitude.

中文标题/摘要

标题：DatalogMTL材料化增量维护

DatalogMTL将经典的Datalog语言扩展到度量时态逻辑（MTL），使对时间数据的表达性推理成为可能。虽然现有的推理方法，如材料化基于的方法和自动机基于的方法，提供了正确性和完备性，但它们缺乏处理高效动态更新的支持，这是涉及频繁数据更新的实际应用中的一个关键要求。在本文中，我们提出了DRedMTL，一种基于有界区间段的DatalogMTL增量推理算法。我们的算法基于经典的DRed算法，该算法增量更新Datalog程序的材料化。与本质上是一组有限事实的Datalog材料化不同，DatalogMTL材料化必须表示为一组有限事实加上周期性区间，指示如何通过展开构建完整的材料化。为了应对这一挑战，我们的算法配备了专门设计的操作符，以高效处理DatalogMTL材料化的这种周期性表示。我们已经实现了这种方法，并在几个公开可用的数据集上进行了测试。实验结果表明，DRedMTL通常显著优于重新材料化，有时甚至相差几个数量级。

Summary / 总结

This paper addresses the challenge of efficiently handling dynamic updates in DatalogMTL, which extends Datalog with metric temporal logic. The authors propose DRedMTL, an incremental reasoning algorithm that builds upon the classical DRed algorithm. DRedMTL is designed to efficiently manage periodic representations of DatalogMTL materialisations, enabling faster updates compared to full rematerialisation. Experiments on public datasets demonstrate that DRedMTL often outperforms rematerialisation by significant margins, sometimes by orders of magnitude.

论文旨在解决DatalogMTL中动态更新的高效处理问题，DatalogMTL扩展了Datalog以支持度量时序逻辑。文中提出了DRedMTL增量推理算法，该算法基于经典的DRed算法。DRedMTL设计用于高效更新带有周期间隔的DatalogMTL材料化，并在实验中表现出色，通常比重新材料化快几个数量级。

Alpha Divergence Losses for Biometric Verification

Authors: Dimitrios Koutsianos, Ladislav Mosner, Yannis Panagakis, Themos Stafylakis

First: 2025-11-17T17:27:28+00:00 · Latest: 2025-11-19T16:29:32+00:00

Comments: Found something suboptimal in results

Abs · PDF · Code1 · Code2

Abstract

Performance in face and speaker verification is largely driven by margin based softmax losses like CosFace and ArcFace. Recently introduced $α$-divergence loss functions offer a compelling alternative, particularly for their ability to induce sparse solutions (when $α>1$). However, integrating an angular margin-crucial for verification tasks-is not straightforward. We find this integration can be achieved in at least two distinct ways: via the reference measure (prior probabilities) or via the logits (unnormalized log-likelihoods). In this paper, we explore both pathways, deriving two novel margin-based $α$-divergence losses: Q-Margin (margin in the reference measure) and A3M (margin in the logits). We identify and address a critical training instability in A3M-caused by the interplay of penalized logits and sparsity-with a simple yet effective prototype re-initialization strategy. Our methods achieve significant performance gains on the challenging IJB-B and IJB-C face verification benchmarks. We demonstrate similarly strong performance in speaker verification on VoxCeleb. Crucially, our models significantly outperform strong baselines at low false acceptance rates (FAR). This capability is crucial for practical high-security applications, such as banking authentication, when minimizing false authentications is paramount.

中文标题/摘要

标题：Alpha 发散损失函数在生物特征识别中的应用

面部和语音识别的性能主要由像 CosFace 和 ArcFace 这样的基于边距的 softmax 损失驱动。最近引入的 $α$-发散损失函数提供了一个有吸引力的替代方案，特别是在它们能够诱导稀疏解（当 $α>1$ 时）。然而，将角度边距（对于验证任务至关重要）整合进去并不直接。我们发现这种整合可以通过两种不同的方式实现：通过参考度量（先验概率）或通过逻辑值（未归一化的对数似然）。在本文中，我们探索了这两种途径，推导出了两种新的基于边距的 $α$-发散损失：Q-边距（边距在参考度量中）和 A3M（边距在逻辑值中）。我们识别并解决了 A3M 中的一个关键训练不稳定性——由惩罚逻辑值和稀疏性之间的相互作用引起——通过一个简单而有效的原型重新初始化策略来解决。我们的方法在具有挑战性的 IJB-B 和 IJB-C 面部识别基准测试中实现了显著的性能提升。我们在 VoxCeleb 上的说话人识别中也展示了类似强大的性能。最关键的是，我们的模型在低误接受率（FAR）下显著优于强大的基线模型。这种能力对于实际的高安全应用至关重要，例如银行认证，当减少误认证是至关重要的。

Summary / 总结

This paper explores the integration of angular margins into $α$-divergence losses for face and speaker verification, introducing Q-Margin and A3M as novel margin-based $α$-divergence losses. The authors address a training instability in A3M and propose a prototype re-initialization strategy. The methods achieve significant performance gains on IJB-B, IJB-C, and VoxCeleb benchmarks, particularly at low false acceptance rates, which is crucial for high-security applications.

本文探索了将角度边际融入α散度损失以进行面部和语音验证的方法，引入了Q-Margin和A3M作为新型的基于边际的α散度损失。作者解决了A3M中的训练不稳定性问题，并提出了一种原型重新初始化策略来缓解这一问题。该方法在IJB-B、IJB-C和VoxCeleb基准测试中取得了显著的性能提升，特别是在低误接受率下，这对于高安全应用如银行认证至关重要。

Interpretable Retinal Disease Prediction Using Biology-Informed Heterogeneous Graph Representations

Authors: Laurin Lux, Alexander H. Berger, Maria Romeo Tricas, Richard Rosen, Alaa E. Fayed, Sobha Sivaprasada, Linus Kreitner, Jonas Weidner, Martin J. Menten, Daniel Rueckert, Johannes C. Paetzold

First: 2025-02-23T19:27:47+00:00 · Latest: 2025-11-19T16:24:36+00:00

Abs · PDF · Code1 · Code2

Abstract

Interpretability is crucial to enhance trust in machine learning models for medical diagnostics. However, most state-of-the-art image classifiers based on neural networks are not interpretable. As a result, clinicians often resort to known biomarkers for diagnosis, although biomarker-based classification typically performs worse than large neural networks. This work proposes a method that surpasses the performance of established machine learning models while simultaneously improving prediction interpretability for diabetic retinopathy staging from optical coherence tomography angiography (OCTA) images. Our method is based on a novel biology-informed heterogeneous graph representation that models retinal vessel segments, intercapillary areas, and the foveal avascular zone (FAZ) in a human-interpretable way. This graph representation allows us to frame diabetic retinopathy staging as a graph-level classification task, which we solve using an efficient graph neural network. We benchmark our method against well-established baselines, including classical biomarker-based classifiers, convolutional neural networks (CNNs), and vision transformers. Our model outperforms all baselines on two datasets. Crucially, we use our biology-informed graph to provide explanations of unprecedented detail. Our approach surpasses existing methods in precisely localizing and identifying critical vessels or intercapillary areas. In addition, we give informative and human-interpretable attributions to critical characteristics. Our work contributes to the development of clinical decision-support tools in ophthalmology.

中文标题/摘要

标题：使用生物启发异质图表示的可解释视网膜疾病预测

可解释性对于增强医学诊断中机器学习模型的信任至关重要。然而，大多数基于神经网络的先进图像分类器不具备可解释性。因此，临床医生经常依赖已知生物标志物进行诊断，尽管基于生物标志物的分类通常不如大型神经网络表现好。本研究提出了一种方法，在保持糖尿病视网膜病变分期从光学相干断层扫描血管成像(OCTA)图像中性能的同时，还提高了预测的可解释性。该方法基于一种新颖的生物启发异质图表示，以一种人类可解释的方式建模视网膜血管段、毛细血管间区域和黄斑无血管区(FAZ)。该图表示使我们能够将糖尿病视网膜病变分期建模为图级分类任务，我们使用高效的图神经网络解决该问题。我们将该方法与经典的基于生物标志物的分类器、卷积神经网络(CNNs)和视觉变压器等已建立的基线进行基准测试。我们的模型在两个数据集上均优于所有基线。关键的是，我们使用生物启发的图提供前所未有的详细解释。我们的方法在精确定位和识别关键血管或毛细血管间区域方面超越了现有方法。此外，我们还提供了关于关键特征的具有信息性和人类可解释的归因。我们的工作为眼科临床决策支持工具的发展做出了贡献。

Summary / 总结

This work addresses the need for interpretable machine learning models in medical diagnostics, particularly for diabetic retinopathy staging using OCTA images. It proposes a biology-informed heterogeneous graph representation that models retinal structures in a human-interpretable way, framed as a graph-level classification task solved by an efficient graph neural network. The method outperforms established baselines and provides detailed, human-interpretable explanations of critical retinal features, enhancing trust in the model for clinical use.

该研究通过提出一种基于生物学信息的异质图表示方法，解决了医学诊断中对可解释机器学习模型的需求，用于从OCTA图像中进行糖尿病视网膜病变分期。该方法使用高效的图神经网络对图表示进行分类，不仅超越了现有基线，还提供了对关键血管或毛细血管区域的详细且易于理解的解释。该模型在两个数据集上实现了优越的性能，同时通过可解释性增强了信任。

CompTrack: Information Bottleneck-Guided Low-Rank Dynamic Token Compression for Point Cloud Tracking

Authors: Sifan Zhou, Yichao Cao, Jiahao Nie, Yuqian Fu, Ziyu Zhao, Xiaobo Lu, Shuo Wang

Venue: AAAI 2026 Oral

First: 2025-11-19T16:12:24+00:00 · Latest: 2025-11-19T16:12:24+00:00

Comments: Accepted by AAAI 2026 (Oral)

Abs · PDF · Code1 · Code2

Abstract

3D single object tracking (SOT) in LiDAR point clouds is a critical task in computer vision and autonomous driving. Despite great success having been achieved, the inherent sparsity of point clouds introduces a dual-redundancy challenge that limits existing trackers: (1) vast spatial redundancy from background noise impairs accuracy, and (2) informational redundancy within the foreground hinders efficiency. To tackle these issues, we propose CompTrack, a novel end-to-end framework that systematically eliminates both forms of redundancy in point clouds. First, CompTrack incorporates a Spatial Foreground Predictor (SFP) module to filter out irrelevant background noise based on information entropy, addressing spatial redundancy. Subsequently, its core is an Information Bottleneck-guided Dynamic Token Compression (IB-DTC) module that eliminates the informational redundancy within the foreground. Theoretically grounded in low-rank approximation, this module leverages an online SVD analysis to adaptively compress the redundant foreground into a compact and highly informative set of proxy tokens. Extensive experiments on KITTI, nuScenes and Waymo datasets demonstrate that CompTrack achieves top-performing tracking performance with superior efficiency, running at a real-time 90 FPS on a single RTX 3090 GPU.

中文标题/摘要

标题：CompTrack：基于信息瓶颈的低秩动态令牌压缩点云跟踪

基于LiDAR点云的3D单目标跟踪（SOT）是计算机视觉和自动驾驶中的关键任务。尽管已经取得了巨大成功，但点云的固有稀疏性引入了双重冗余挑战，限制了现有跟踪器：（1）背景噪声引起的巨大空间冗余影响准确性，（2）前景内的信息冗余妨碍了效率。为了解决这些问题，我们提出了CompTrack，这是一种新颖的端到端框架，系统地消除了点云中的两种冗余。首先，CompTrack引入了空间前景预测器（SFP）模块，基于信息熵过滤掉无关的背景噪声，解决空间冗余问题。随后，其核心是基于信息瓶颈的动态令牌压缩（IB-DTC）模块，消除前景内的信息冗余。该模块基于低秩近似理论，利用在线SVD分析自适应地将冗余前景压缩成一个紧凑且高度信息性的代理令牌集。在Kitti、nuScenes和Waymo数据集上的大量实验表明，CompTrack在保持高效性的同时实现了顶级的跟踪性能，单个RTX 3090 GPU上实时运行速度达到90 FPS。

Summary / 总结

CompTrack is an end-to-end framework designed to address the dual-redundancy challenge in 3D single object tracking from LiDAR point clouds. It uses a Spatial Foreground Predictor (SFP) to filter out background noise and an Information Bottleneck-guided Dynamic Token Compression (IB-DTC) module to compress foreground information, reducing informational redundancy. Experiments show that CompTrack outperforms existing methods in terms of accuracy and efficiency, achieving real-time performance at 90 FPS on a single RTX 3090 GPU.

CompTrack 是一个端到端框架，旨在解决 LiDAR 点云中 3D 单目标跟踪 (SOT) 的双重冗余问题。它使用 Spatial Foreground Predictor (SFP) 模块过滤背景噪声，并使用 Information Bottleneck-guided Dynamic Token Compression (IB-DTC) 模块压缩前景信息，减少信息冗余。实验表明，CompTrack 在效率上优于现有方法，实现实时 90 FPS 的性能，运行在单个 RTX 3090 GPU 上。

AVATAAR: Agentic Video Answering via Temporal Adaptive Alignment and Reasoning

Authors: Urjitkumar Patel, Fang-Chun Yeh, Chinmay Gondhalekar

First: 2025-11-19T16:09:38+00:00 · Latest: 2025-11-19T16:09:38+00:00

Comments: Accepted in the 5th IEEE Big Data Workshop on Multimodal AI (MMAI 2025), Dec 8-11, Macau, China, 2025 (Preprint Copy)

Abs · PDF · Code1 · Code2

Abstract

With the increasing prevalence of video content, effectively understanding and answering questions about long form videos has become essential for numerous applications. Although large vision language models (LVLMs) have enhanced performance, they often face challenges with nuanced queries that demand both a comprehensive understanding and detailed analysis. To overcome these obstacles, we introduce AVATAAR, a modular and interpretable framework that combines global and local video context, along with a Pre Retrieval Thinking Agent and a Rethink Module. AVATAAR creates a persistent global summary and establishes a feedback loop between the Rethink Module and the Pre Retrieval Thinking Agent, allowing the system to refine its retrieval strategies based on partial answers and replicate human-like iterative reasoning. On the CinePile benchmark, AVATAAR demonstrates significant improvements over a baseline, achieving relative gains of +5.6% in temporal reasoning, +5% in technical queries, +8% in theme-based questions, and +8.2% in narrative comprehension. Our experiments confirm that each module contributes positively to the overall performance, with the feedback loop being crucial for adaptability. These findings highlight AVATAAR's effectiveness in enhancing video understanding capabilities. Ultimately, AVATAAR presents a scalable solution for long-form Video Question Answering (QA), merging accuracy, interpretability, and extensibility.

中文标题/摘要

标题：AVATAAR：通过时间自适应对齐与推理的代理视频回答

随着视频内容的日益普及，有效理解和回答长视频的问题已成为众多应用中的关键需求。尽管大型视觉语言模型（LVLM）提高了性能，但在处理需要全面理解和详细分析的复杂查询时仍面临挑战。为克服这些障碍，我们提出了AVATAAR，这是一种模块化且可解释的框架，结合了全局和局部视频上下文，以及一个预检索思考代理和重思模块。AVATAAR创建了一个持久的全局摘要，并在重思模块和预检索思考代理之间建立了反馈循环，使系统能够根据部分答案进行策略调整，并模仿人类的迭代推理。在CinePile基准测试中，AVATAAR在时间推理、技术查询、主题问题和叙事理解方面分别取得了+5.6%、+5%、+8%和+8.2%的相对改进。我们的实验表明，每个模块都对整体性能做出了积极贡献，反馈循环对于适应性至关重要。这些发现突显了AVATAAR在增强视频理解能力方面的有效性。最终，AVATAAR提供了一种可扩展的长视频问答解决方案，结合了准确度、可解释性和可扩展性。

Summary / 总结

AVATAAR is a modular framework designed to improve the understanding and answering of questions about long-form videos. It combines global and local video context with a Pre Retrieval Thinking Agent and a Rethink Module to create a persistent summary and refine retrieval strategies iteratively. On the CinePile benchmark, AVATAAR shows significant improvements, with relative gains of +5.6% in temporal reasoning, +5% in technical queries, +8% in theme-based questions, and +8.2% in narrative comprehension. The feedback loop between modules is crucial for adaptability and performance enhancement.

AVATAAR 是一个模块化框架，旨在提高对长视频的理解和回答问题的能力。它结合了全局和局部视频上下文以及预检索思考代理和反思模块，创建持久的摘要并迭代地细化检索策略。在 CinePile 基准测试中，AVATAAR 显示出显著的改进，相对增益分别为 +5.6% 的时间推理、+5% 的技术查询、+8% 的主题问题和 +8.2% 的叙事理解。模块之间的反馈循环对于适应性和性能提升至关重要。

HSKBenchmark: Modeling and Benchmarking Chinese Second Language Acquisition in Large Language Models through Curriculum Tuning

Authors: Qihao Yang, Xuelin Wang, Jiale Chen, Xuelian Dong, Yuxin Hao, Tianyong Hao

Venue: AAAI

First: 2025-11-19T16:06:06+00:00 · Latest: 2025-11-19T16:06:06+00:00

Comments: Accepted by AAAI-2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

Language acquisition is vital to revealing the nature of human language intelligence and has recently emerged as a promising perspective for improving the interpretability of large language models (LLMs). However, it is ethically and practically infeasible to conduct experiments that require controlling human learners' language inputs. This poses challenges for the verifiability and scalability of language acquisition modeling, particularly in Chinese second language acquisition (SLA). While LLMs provide a controllable and reproducible alternative, a systematic benchmark to support phase-wise modeling and assessment is still lacking. In this paper, we present HSKBenchmark, the first benchmark for staged modeling and writing assessment of LLMs in Chinese SLA. It covers HSK levels 3 to 6 and includes authentic textbooks with 6.76 million tokens, 16K synthetic instruction samples, 30 test topics, and a linguistically grounded evaluation system. To simulate human learning trajectories, we introduce a curriculum-tuning framework that trains models from beginner to advanced levels. An evaluation system is created to examine level-based grammar coverage, writing errors, lexical and syntactic complexity, and holistic scoring. We also build HSKAgent, fine-tuned on 10K learner compositions. Extensive experimental results demonstrate that HSKBenchmark not only models Chinese SLA effectively, but also serves as a reliable benchmark for dynamic writing assessment in LLMs. Our fine-tuned LLMs have writing performance on par with advanced human learners and exhibit human-like acquisition characteristics. The HSKBenchmark, HSKAgent, and checkpoints serve as foundational tools and resources, with the potential to pave the way for future research on language acquisition modeling and LLMs interpretability. Code and data are publicly available at: https://github.com/CharlesYang030/HSKB.

中文标题/摘要

标题：HSKBenchmark：通过课程调优在大型语言模型中建模和基准测试汉语二语习得

语言习得对于揭示人类语言智能的本质至关重要，并且最近成为提高大型语言模型（LLMs）可解释性的有前途的视角。然而，由于需要控制人类学习者的语言输入，进行此类实验在伦理和实践上都是不可行的。这给语言习得建模的可验证性和可扩展性带来了挑战，特别是在汉语二语习得（SLA）方面。尽管LLMs提供了一种可控和可重复的替代方案，但缺乏支持阶段建模和评估的系统基准。在本文中，我们提出了HSKBenchmark，这是第一个用于汉语SLA阶段建模和写作评估的基准。它涵盖了HSK等级3至6，并包括676万个词的真实教科书、16000个合成指令样本、30个测试主题以及一个语言学基础的评估系统。为了模拟人类学习轨迹，我们引入了一种课程调优框架，从初级到高级训练模型。创建了一个评估系统来检查基于级别的语法覆盖率、写作错误、词汇和句法复杂性以及整体评分。我们还构建了HSKBenchmark，基于10000名学习者的作文进行了微调。广泛的实验结果表明，HSKBenchmark不仅有效地建模了汉语SLA，还作为LLMs动态写作评估的可靠基准。我们的微调LLMs在写作性能上与高级人类学习者相当，并表现出类似人类的学习特征。HSKBenchmark、HSKBenchmark Agent和检查点作为基础工具和资源，有望为未来语言习得建模和LLMs可解释性研究铺平道路。代码和数据可在：https://github.com/CharlesYang030/HSKB/ 公开获取。

Summary / 总结

HSKBenchmark is a benchmark for modeling and assessing Chinese second language acquisition in large language models (LLMs) through curriculum tuning. It includes authentic textbooks, synthetic instruction samples, and a linguistically grounded evaluation system covering HSK levels 3 to 6. The evaluation system assesses grammar coverage, writing errors, and lexical and syntactic complexity. Experimental results show that fine-tuned LLMs perform at the level of advanced human learners and exhibit human-like acquisition characteristics, making HSKBenchmark a reliable benchmark for dynamic writing assessment in LLMs.

HSKBenchmark 是一个用于通过课程调优建模和评估中文第二语言习得的大型语言模型（LLM）基准。它包括真实的教科书、合成指令样本和一个语言学上合理的评估系统，涵盖了HSK等级3到6。该基准展示了对中文SLA的有效建模，并作为动态写作评估的可靠工具，其中微调后的LLM表现出与高级人类学习者相当的写作性能。HSKBenchmark、HSKAgent和检查点可供未来研究语言习得和LLM可解释性使用。