arXiv 论文速递

Graph Representation Learning with Diffusion Generative Models

Authors: Daniel Wesego

First: 2025-01-22T07:12:10+00:00 · Latest: 2025-10-22T17:58:33+00:00

Abstract

Diffusion models have established themselves as state-of-the-art generative models across various data modalities, including images and videos, due to their ability to accurately approximate complex data distributions. Unlike traditional generative approaches such as VAEs and GANs, diffusion models employ a progressive denoising process that transforms noise into meaningful data over multiple iterative steps. This gradual approach enhances their expressiveness and generation quality. Not only that, diffusion models have also been shown to extract meaningful representations from data while learning to generate samples. Despite their success, the application of diffusion models to graph-structured data remains relatively unexplored, primarily due to the discrete nature of graphs, which necessitates discrete diffusion processes distinct from the continuous methods used in other domains. In this work, we leverage the representational capabilities of diffusion models to learn meaningful embeddings for graph data. By training a discrete diffusion model within an autoencoder framework, we enable both effective autoencoding and representation learning tailored to the unique characteristics of graph-structured data. We extract the representation from the combination of the encoder's output and the decoder's first time step hidden embedding. Our approach demonstrates the potential of discrete diffusion models to be used for graph representation learning. The code can be found at https://github.com/DanielMitiku/Graph-Representation-Learning-with-Diffusion-Generative-Models

中文标题/摘要

标题：图结构数据的扩散生成模型表示学习

扩散模型因其能够准确逼近复杂的数据分布而在各种数据模态中确立了作为最先进的生成模型的地位，包括图像和视频。与传统的生成方法如VAEs和GANs不同，扩散模型采用逐步去噪的过程，在多个迭代步骤中将噪声转化为有意义的数据。这一渐进的方法增强了它们的表达能力和生成质量。此外，扩散模型还被证明能够从数据中提取有意义的表示，同时学习生成样本。尽管取得了成功，但将扩散模型应用于图结构数据的应用仍然相对未被探索，主要是由于图的离散性质，这需要与其它领域使用的连续方法不同的离散扩散过程。在本文中，我们利用扩散模型的表示能力来学习图数据的有意义嵌入。通过在自编码器框架中训练一个离散的扩散模型，我们能够实现有效的自编码和针对图结构数据的独特特征进行的表示学习。我们从编码器输出和解码器的第一个时间步隐藏嵌入的组合中提取表示。我们的方法展示了离散扩散模型在图表示学习中的潜力。代码可以在https://github.com/DanielMitiku/Graph-Representation-Learning-with-Diffusion-Generative-Models找到

Summary / 总结

This paper explores the application of diffusion models to graph-structured data, leveraging their ability to generate high-quality data and extract meaningful representations. The authors propose an autoencoder framework that combines a discrete diffusion model with an encoder-decoder architecture to learn graph embeddings. The key finding is that this approach effectively captures the unique characteristics of graph data, demonstrating the potential of diffusion models for graph representation learning.

本文探讨了将扩散模型应用于图结构数据的方法，解决了离散扩散过程的挑战。作者提出了一种结合离散扩散模型和编码器-解码器架构的自编码框架，以学习有意义的图嵌入。实验结果表明，该方法能够有效捕捉图的结构信息，并在图表示学习任务中优于传统的VAE和GAN方法。

Learning Reward Machines from Partially Observed Policies

Authors: Mohamad Louai Shehab, Antoine Aspeel, Necmiye Ozay

First: 2025-02-06T03:48:25+00:00 · Latest: 2025-10-22T17:55:33+00:00

Abs · PDF · Code1 · Code2

Abstract

Inverse reinforcement learning is the problem of inferring a reward function from an optimal policy or demonstrations by an expert. In this work, it is assumed that the reward is expressed as a reward machine whose transitions depend on atomic propositions associated with the state of a Markov Decision Process (MDP). Our goal is to identify the true reward machine using finite information. To this end, we first introduce the notion of a prefix tree policy which associates a distribution of actions to each state of the MDP and each attainable finite sequence of atomic propositions. Then, we characterize an equivalence class of reward machines that can be identified given the prefix tree policy. Finally, we propose a SAT-based algorithm that uses information extracted from the prefix tree policy to solve for a reward machine. It is proved that if the prefix tree policy is known up to a sufficient (but finite) depth, our algorithm recovers the exact reward machine up to the equivalence class. This sufficient depth is derived as a function of the number of MDP states and (an upper bound on) the number of states of the reward machine. These results are further extended to the case where we only have access to demonstrations from an optimal policy. Several examples, including discrete grid and block worlds, a continuous state-space robotic arm, and real data from experiments with mice, are used to demonstrate the effectiveness and generality of the approach.

中文标题/摘要

标题：从部分观察到的策略学习奖励机器

逆强化学习是通过专家的最优策略或演示推断奖励函数的问题。在本文中，假设奖励以奖励机器的形式表达，其转换依赖于与马尔可夫决策过程（MDP）状态相关的原子命题。我们的目标是使用有限信息识别真正的奖励机器。为此，我们首先引入前缀树策略的概念，该概念将动作分布与MDP的每个状态和每个可达到的原子命题有限序列关联起来。然后，我们刻画了给定前缀树策略可以识别的奖励机器等价类。最后，我们提出了一种基于SAT的算法，该算法利用从前缀树策略中提取的信息求解奖励机器。证明了如果前缀树策略已知到足够（但有限）的深度，我们的算法可以恢复等价类中的精确奖励机器。此足够深度是MDP状态数和（奖励机器状态数的）上界函数。这些结果进一步扩展到我们只能访问最优策略演示的情况。使用离散网格和块世界、连续状态空间的机械臂以及老鼠实验中的实际数据，展示了该方法的有效性和普适性。

Summary / 总结

This paper addresses the inverse reinforcement learning problem by inferring a reward function from an optimal policy or expert demonstrations. It introduces the concept of a prefix tree policy and characterizes the equivalence class of reward machines that can be identified given this policy. A SAT-based algorithm is proposed to solve for the reward machine, and it is proven that the algorithm can recover the exact reward machine up to a certain equivalence class if the prefix tree policy is known to a sufficient depth. The approach is demonstrated to be effective and general through various examples, including discrete grid worlds, robotic arms, and real data from mouse experiments.

本文解决了从最优策略或专家演示中推断奖励函数的逆强化学习问题。引入了前缀树策略的概念，并刻画了给定该策略可以识别的奖励机器等价类。提出了一种基于SAT的算法来求解奖励机器，并证明在一定深度下可以精确恢复奖励机器。该方法通过离散和连续状态空间的例子以及小鼠实验的实际数据得到了验证，展示了其有效性和普适性。

Context-Aware Pseudo-Label Scoring for Zero-Shot Video Summarization

Authors: Yuanli Wu, Long Zhang, Yue Du, Bin Li

First: 2025-10-20T12:54:32+00:00 · Latest: 2025-10-22T17:54:43+00:00

Abs · PDF · Code1 · Code2

Abstract

We propose a rubric-guided, pseudo-labeled, and prompt-driven zero-shot video summarization framework that bridges large language models with structured semantic reasoning. A small subset of human annotations is converted into high-confidence pseudo labels and organized into dataset-adaptive rubrics defining clear evaluation dimensions such as thematic relevance, action detail, and narrative progression. During inference, boundary scenes, including the opening and closing segments, are scored independently based on their own descriptions, while intermediate scenes incorporate concise summaries of adjacent segments to assess narrative continuity and redundancy. This design enables the language model to balance local salience with global coherence without any parameter tuning. Across three benchmarks, the proposed method achieves stable and competitive results, with F1 scores of 57.58 on SumMe, 63.05 on TVSum, and 53.79 on QFVS, surpassing zero-shot baselines by +0.85, +0.84, and +0.37, respectively. These outcomes demonstrate that rubric-guided pseudo labeling combined with contextual prompting effectively stabilizes LLM-based scoring and establishes a general, interpretable, and training-free paradigm for both generic and query-focused video summarization.

中文标题/摘要

标题：基于上下文感知伪标签评分的零样本视频摘要

我们提出了一种基于评分准则、伪标签和提示驱动的零样本视频摘要框架，将大型语言模型与结构化语义推理相结合。一小部分人类注释被转换为高置信度的伪标签，并组织成数据集自适应的评分准则，定义清晰的评估维度，如主题相关性、动作细节和叙事进展。在推理过程中，边界场景，包括开头和结尾部分，根据自身的描述独立评分，而中间场景则结合相邻段落的简洁摘要来评估叙事连贯性和冗余性。这种设计使语言模型能够在不进行任何参数调整的情况下平衡局部显著性和全局一致性。在三个基准测试中，所提出的方法在SumMe上的F1得分为57.58，在TVSum上的得分为63.05，在QFVS上的得分为53.79，分别超越零样本基线0.85、0.84和0.37。这些结果表明，基于评分准则的伪标签结合上下文提示有效地稳定了基于LLM的评分，并为通用和查询导向的视频摘要建立了一种通用、可解释且无需训练的范式。

Summary / 总结

The paper introduces a rubric-guided pseudo-labeled and prompt-driven framework for zero-shot video summarization, leveraging large language models and structured semantic reasoning. It converts a small set of human annotations into high-confidence pseudo labels and organizes them into rubrics for clear evaluation. During inference, the model scores boundary scenes independently and intermediate scenes based on summaries of adjacent segments, balancing local salience and global coherence. The method achieves competitive results across three benchmarks, with F1 scores of 57.58 on SumMe, 63.05 on TVSum, and 53.79 on QFVS, surpassing zero-shot baselines by +0.85, +0.84, and +0.37 respectively.

论文提出了一种基于规则引导的伪标签和提示驱动的零样本视频摘要框架，结合了大型语言模型和结构化语义推理。它将一小部分人类注释转换为高置信度的伪标签，并组织成评价维度，如主题相关性、动作细节和叙事进展。在推理过程中，模型独立评分边界场景，并基于相邻段落的摘要评分中间场景，平衡局部显著性和全局连贯性。该方法在三个基准上取得了竞争力的结果，分别在SumMe、TVSum和QFVS上的F1分数上超越了零样本基线 +0.85、+0.84 和 +0.37。

Is This Tracker On? A Benchmark Protocol for Dynamic Tracking

Authors: Ilona Demler, Saumya Chauhan, Georgia Gkioxari

First: 2025-10-22T17:53:56+00:00 · Latest: 2025-10-22T17:53:56+00:00

Comments: Project page: https://glab-caltech.github.io/ITTO/

Abs · PDF · Code1 · Code2 · Project1

Abstract

We introduce ITTO, a challenging new benchmark suite for evaluating and diagnosing the capabilities and limitations of point tracking methods. Our videos are sourced from existing datasets and egocentric real-world recordings, with high-quality human annotations collected through a multi-stage pipeline. ITTO captures the motion complexity, occlusion patterns, and object diversity characteristic of real-world scenes -- factors that are largely absent in current benchmarks. We conduct a rigorous analysis of state-of-the-art tracking methods on ITTO, breaking down performance along key axes of motion complexity. Our findings reveal that existing trackers struggle with these challenges, particularly in re-identifying points after occlusion, highlighting critical failure modes. These results point to the need for new modeling approaches tailored to real-world dynamics. We envision ITTO as a foundation testbed for advancing point tracking and guiding the development of more robust tracking algorithms.

中文标题/摘要

标题：这个追踪器开启了吗？一种动态追踪基准协议

我们引入了ITTO，这是一个新的具有挑战性的基准套件，用于评估和诊断点追踪方法的能力和局限性。我们的视频来源于现有的数据集和第一人称的现实世界记录，并通过多阶段管道收集了高质量的人工注释。ITTO捕捉了现实场景中的运动复杂性、遮挡模式和对象多样性——这些因素在当前的基准中几乎不存在。我们在ITTO上对最先进的追踪方法进行了严格的分析，按关键的运动复杂性轴分解了性能。我们的研究结果表明，现有的追踪器在这些挑战面前表现不佳，尤其是在遮挡后重新识别点方面，这突显了关键的失败模式。这些结果指出了需要针对现实世界动态的新建模方法。我们设想ITTO作为推进点追踪和指导开发更稳健追踪算法的基础测试平台。

Summary / 总结

ITTO is a new benchmark suite for evaluating point tracking methods, designed to capture the complexity and challenges of real-world tracking scenarios. It includes high-quality human annotations from diverse sources and rigorous analysis of state-of-the-art trackers. The study reveals that current trackers struggle with re-identification after occlusions, indicating a need for new modeling approaches. This benchmark aims to serve as a foundation for advancing point tracking algorithms.

ITTO 是一个新的基准套件，用于评估和诊断点跟踪方法的能力和局限性。它包含来自多种来源的具有高质量注释的挑战性视频，捕捉到现实世界中的复杂性，如运动、遮挡和物体多样性。研究发现，当前的跟踪器在遮挡后重新识别方面存在困难，表明需要新的建模方法。该基准旨在作为推进点跟踪算法的测试平台。

olmOCR 2: Unit Test Rewards for Document OCR

Authors: Jake Poznanski, Luca Soldaini, Kyle Lo

First: 2025-10-22T17:53:02+00:00 · Latest: 2025-10-22T17:53:02+00:00

Comments: https://olmocr.allen.ai/

Abs · PDF · Code1 · Code2 · Project1

Abstract

We present olmOCR 2, the latest in our family of powerful OCR systems for converting digitized print documents, like PDFs, into clean, naturally ordered plain text. olmOCR 2 is powered by olmOCR-2-7B-1025, a specialized, 7B vision language model (VLM) trained using reinforcement learning with verifiable rewards (RLVR), where our rewards are a diverse set of binary unit tests. To scale unit test creation, we develop a pipeline for generating synthetic documents with diverse and challenging layouts, known ground-truth HTML source code, and extracted test cases. We show that RL training on these test cases results in state-of-the-art performance on olmOCR-Bench, our English-language OCR benchmark, with the largest improvements in math formula conversion, table parsing, and multi-column layouts compared to previous versions. We release our model, data and code under permissive open licenses.

中文标题/摘要

标题：olmOCR 2：文档OCR单元测试奖励

我们介绍了olmOCR 2，这是我们家族中最新的一款强大的OCR系统，用于将数字化印刷文档，如PDF，转换为干净、自然排序的纯文本。olmOCR 2由olmOCR-2-7B-1025驱动，这是一种专门的7B视觉语言模型（VLM），使用可验证奖励（RLVR）的强化学习进行训练，其中我们的奖励是一组多样化的二元单元测试。为了扩展单元测试的创建，我们开发了一条生成具有多样性和挑战性布局的合成文档的管道，已知的地面真实HTML源代码和提取的测试用例。我们展示了在这些测试用例上进行RL训练的结果，在olmOCR-Bench，我们的英文OCR基准测试中达到了最先进的性能，与之前的版本相比，在数学公式转换、表格解析和多列布局方面取得了最大的改进。我们以宽松的开源许可发布了我们的模型、数据和代码。

Summary / 总结

olmOCR 2 is an advanced OCR system designed to convert digitized print documents into clean text. It uses a 7B vision-language model trained with reinforcement learning and verifiable rewards derived from diverse unit tests. The system generates synthetic documents to facilitate unit test creation and scales performance on the olmOCR-Bench benchmark, particularly excelling in math formula conversion, table parsing, and multi-column layouts compared to previous versions. The model, data, and code are released under open licenses.

olmOCR 2 是一个先进的 OCR 系统，用于将数字化印刷文档转换为干净的文本。它使用了一个 7B 视觉语言模型，并通过强化学习和可验证奖励进行训练，这些奖励来自多样化的单元测试。该系统生成合成文档以促进单元测试的创建，并在 olmOCR-Bench 基准测试中表现出色，特别是在数学公式转换、表格解析和多列布局方面，优于之前的版本。模型、数据和代码均在开源许可下发布。

Unlearned but Not Forgotten: Data Extraction after Exact Unlearning in LLM

Authors: Xiaoyu Wu, Yifei Pang, Terrance Liu, Zhiwei Steven Wu

Venue: Neurips 2025

First: 2025-05-30T09:09:33+00:00 · Latest: 2025-10-22T17:51:21+00:00

Comments: Accepted by Neurips 2025

Abs · PDF · Code1 · Code2 · Code3

Abstract

Large Language Models are typically trained on datasets collected from the web, which may inadvertently contain harmful or sensitive personal information. To address growing privacy concerns, unlearning methods have been proposed to remove the influence of specific data from trained models. Of these, exact unlearning -- which retrains the model from scratch without the target data -- is widely regarded the gold standard for mitigating privacy risks in deployment. In this paper, we revisit this assumption in a practical deployment setting where both the pre- and post-unlearning logits API are exposed, such as in open-weight scenarios. Targeting this setting, we introduce a novel data extraction attack that leverages signals from the pre-unlearning model to guide the post-unlearning model, uncovering patterns that reflect the removed data distribution. Combining model guidance with a token filtering strategy, our attack significantly improves extraction success rates -- doubling performance in some cases -- across common benchmarks such as MUSE, TOFU, and WMDP. Furthermore, we demonstrate our attack's effectiveness on a simulated medical diagnosis dataset to highlight real-world privacy risks associated with exact unlearning. In light of our findings, which suggest that unlearning may, in a contradictory way, increase the risk of privacy leakage during real-world deployments, we advocate for evaluation of unlearning methods to consider broader threat models that account not only for post-unlearning models but also for adversarial access to prior checkpoints. Code is publicly available at: https://github.com/Nicholas0228/unlearned_data_extraction_llm.

中文标题/摘要

标题：未重新学习但未被遗忘：在LLM精确重新学习后的数据提取

大型语言模型通常在从网络收集的数据集上进行训练，这些数据集可能无意中包含有害或敏感的个人信息。为应对日益增长的隐私担忧，提出了重新学习方法以从训练模型中移除特定数据的影响。其中，从头开始重新训练模型而不包含目标数据的精确重新学习，被认为是减轻部署中隐私风险的黄金标准。在本文中，我们在一个实际部署环境中重新审视这一假设，该环境同时暴露了重新学习前后的logits API，例如在开放权重场景中。针对这一环境，我们引入了一种新颖的数据提取攻击，利用重新学习前模型的信号来引导重新学习后的模型，揭示出反映删除数据分布的模式。结合模型引导与令牌过滤策略，我们的攻击显著提高了提取成功率——在某些情况下，性能翻倍。此外，我们还在模拟的医疗诊断数据集上展示了攻击的有效性，以突出精确重新学习带来的实际隐私风险。鉴于我们的发现，表明重新学习可能以一种矛盾的方式，在实际部署中增加隐私泄露的风险，我们建议评估重新学习方法时考虑更广泛的威胁模型，不仅包括重新学习后的模型，还包括对手对先前检查点的访问。代码可在以下地址公开获取：https://github.com/Nicholas0228/unlearned_data_extraction_llm。

Summary / 总结

This paper revisits the effectiveness of exact unlearning in mitigating privacy risks in large language models. It introduces a novel data extraction attack that leverages signals from the pre-unlearning model to uncover patterns of removed data, significantly improving extraction success rates. The attack demonstrates real-world privacy risks in exact unlearning, suggesting that unlearning may paradoxically increase privacy leakage during deployment. The authors advocate for evaluating unlearning methods under broader threat models.

该论文重新审视了精确遗忘在减轻大型语言模型隐私风险方面的有效性。它引入了一种新的数据提取攻击，利用预遗忘模型的信号来揭示被移除数据的模式，提高了提取成功率。该攻击在常见基准和模拟的医疗诊断数据集上进行了测试，突显了实际部署中潜在的隐私泄露风险。

Hubble: a Model Suite to Advance the Study of LLM Memorization

Authors: Johnny Tian-Zheng Wei, Ameya Godbole, Mohammad Aflah Khan, Ryan Wang, Xiaoyuan Zhu, James Flemings, Nitya Kashyap, Krishna P. Gummadi, Willie Neiswanger, Robin Jia

First: 2025-10-22T17:48:23+00:00 · Latest: 2025-10-22T17:48:23+00:00

Abs · PDF · Code1 · Code2

Abstract

We present Hubble, a suite of fully open-source large language models (LLMs) for the scientific study of LLM memorization. Hubble models come in standard and perturbed variants: standard models are pretrained on a large English corpus, and perturbed models are trained in the same way but with controlled insertion of text (e.g., book passages, biographies, and test sets) designed to emulate key memorization risks. Our core release includes 8 models -- standard and perturbed models with 1B or 8B parameters, pretrained on 100B or 500B tokens -- establishing that memorization risks are determined by the frequency of sensitive data relative to size of the training corpus (i.e., a password appearing once in a smaller corpus is memorized better than the same password in a larger corpus). Our release also includes 6 perturbed models with text inserted at different pretraining phases, showing that sensitive data without continued exposure can be forgotten. These findings suggest two best practices for addressing memorization risks: to dilute sensitive data by increasing the size of the training corpus, and to order sensitive data to appear earlier in training. Beyond these general empirical findings, Hubble enables a broad range of memorization research; for example, analyzing the biographies reveals how readily different types of private information are memorized. We also demonstrate that the randomized insertions in Hubble make it an ideal testbed for membership inference and machine unlearning, and invite the community to further explore, benchmark, and build upon our work.

中文标题/摘要

标题：哈勃：用于研究大语言模型记忆的模型套件

我们介绍了哈勃，一个完全开源的大语言模型（LLM）套件，用于科学研究LLM记忆。哈勃模型包括标准和扰动版本：标准模型在大规模英文语料上进行预训练，而扰动模型以相同方式训练，但通过受控插入文本（例如，书籍段落、传记和测试集）来模拟关键记忆风险。我们的核心发布包括8个模型——标准和扰动模型，参数量为1B或8B，分别在100B或500B令牌上进行预训练，表明记忆风险由敏感数据的频率相对于训练语料库的大小决定（即，在较小语料库中出现一次的密码比在较大语料库中出现的同一密码记忆得更好）。我们的发布还包括6个在不同预训练阶段插入文本的扰动模型，表明没有持续暴露的敏感数据可以被遗忘。这些发现建议了两种应对记忆风险的最佳实践：通过增加训练语料库的大小来稀释敏感数据，以及使敏感数据在训练中更早出现。除了这些一般性的实证发现外，哈勃还为广泛的记忆研究提供了广泛的可能性；例如，分析传记可以揭示不同类型私人信息被记忆的难易程度。我们还证明了哈勃中的随机插入使其成为成员推理和机器遗忘的理想测试平台，并邀请社区进一步探索、基准测试和建立我们的工作。

Summary / 总结

Hubble is a suite of open-source large language models designed to study LLM memorization. It includes standard and perturbed models, with the latter containing controlled insertion of text to emulate memorization risks. Key findings show that memorization risks depend on the frequency of sensitive data relative to the size of the training corpus, and that sensitive data can be forgotten if not continuously exposed during training. This suggests increasing corpus size and ordering sensitive data to appear early in training as best practices. Hubble also enables research on biographies and is useful for testing membership inference and machine unlearning.

Hubble 是一套开源的大语言模型，旨在研究 LLM 的记忆风险。它包含不同参数和训练数据的标准和扰动模型，表明记忆风险取决于敏感数据在训练语料库中的频率和大小。主要发现建议通过增加语料库大小来稀释敏感数据，并将敏感数据安排在训练的早期出现以减轻风险。除了这些发现之外，Hubble 还支持各种记忆研究，并作为成员推理和机器遗忘的理想测试平台。

Pico-Banana-400K: A Large-Scale Dataset for Text-Guided Image Editing

Authors: Yusu Qian, Eli Bocek-Rivele, Liangchen Song, Jialing Tong, Yinfei Yang, Jiasen Lu, Wenze Hu, Zhe Gan

First: 2025-10-22T17:43:15+00:00 · Latest: 2025-10-22T17:43:15+00:00

Abs · PDF · Code1 · Code2

Abstract

Recent advances in multimodal models have demonstrated remarkable text-guided image editing capabilities, with systems like GPT-4o and Nano-Banana setting new benchmarks. However, the research community's progress remains constrained by the absence of large-scale, high-quality, and openly accessible datasets built from real images. We introduce Pico-Banana-400K, a comprehensive 400K-image dataset for instruction-based image editing. Our dataset is constructed by leveraging Nano-Banana to generate diverse edit pairs from real photographs in the OpenImages collection. What distinguishes Pico-Banana-400K from previous synthetic datasets is our systematic approach to quality and diversity. We employ a fine-grained image editing taxonomy to ensure comprehensive coverage of edit types while maintaining precise content preservation and instruction faithfulness through MLLM-based quality scoring and careful curation. Beyond single turn editing, Pico-Banana-400K enables research into complex editing scenarios. The dataset includes three specialized subsets: (1) a 72K-example multi-turn collection for studying sequential editing, reasoning, and planning across consecutive modifications; (2) a 56K-example preference subset for alignment research and reward model training; and (3) paired long-short editing instructions for developing instruction rewriting and summarization capabilities. By providing this large-scale, high-quality, and task-rich resource, Pico-Banana-400K establishes a robust foundation for training and benchmarking the next generation of text-guided image editing models.

中文标题/摘要

标题：Pico-Banana-400K：一种用于文本指导图像编辑的大规模数据集

近期多模态模型在文本指导图像编辑方面取得了显著进展，系统如GPT-4o和Nano-Banana设定了新的基准。然而，研究社区的进步仍受限于缺乏大规模、高质量且开放获取的真实图像数据集。我们介绍了Pico-Banana-400K，这是一个基于指令的图像编辑综合400K图像数据集。该数据集通过利用Nano-Banana从OpenImages集合中的真实照片生成多样化的编辑对。与之前的合成数据集不同，Pico-Banana-400K采用系统的方法来保证质量与多样性。我们采用细粒度的图像编辑分类法，确保全面覆盖编辑类型，同时通过基于MLLM的质量评分和仔细的筛选保持内容的精确保留和指令的忠实性。除了单次编辑，Pico-Banana-400K还支持复杂编辑场景的研究。数据集包括三个专门的子集：（1）一个包含72K个示例的多轮编辑集合，用于研究连续修改中的序列编辑、推理和规划；（2）一个包含56K个示例的偏好子集，用于对齐研究和奖励模型训练；（3）配对的长短编辑指令，用于开发指令重写和总结能力。通过提供这个大规模、高质量且任务丰富的资源，Pico-Banana-400K为训练和基准测试下一代文本指导图像编辑模型奠定了坚实的基础。

Summary / 总结

Pico-Banana-400K is a large-scale dataset for text-guided image editing, containing 400K real image pairs generated by Nano-Banana. It improves upon previous datasets by ensuring high quality and diversity through a fine-grained taxonomy and MLLM-based quality scoring. The dataset supports complex editing scenarios and includes specialized subsets for multi-turn editing, preference alignment, and instruction summarization, enabling comprehensive research into text-guided image editing models.

Pico-Banana-400K 是一个包含 40 万对真实图像的大型数据集，通过 Nano-Banana 生成，旨在通过精细的分类学和 MLLM 基础的质量评分确保高质量和多样性。该数据集包含三个专门的子集，用于多轮编辑、偏好对齐和指令总结，以支持复杂编辑场景的研究。关键发现包括该数据集能够支持先进文本指导图像编辑模型的训练和基准测试。

Scaf-GRPO: Scaffolded Group Relative Policy Optimization for Enhancing LLM Reasoning

Authors: Xichen Zhang, Sitong Wu, Yinghao Zhu, Haoru Tan, Shaozuo Yu, Ziyi He, Jiaya Jia

First: 2025-10-22T17:41:30+00:00 · Latest: 2025-10-22T17:41:30+00:00

Comments: Code: https://github.com/dvlab-research/Scaf-GRPO

Abs · PDF · Code1 · Code2 · Code3

Abstract

Reinforcement learning from verifiable rewards has emerged as a powerful technique for enhancing the complex reasoning abilities of Large Language Models (LLMs). However, these methods are fundamentally constrained by the ''learning cliff'' phenomenon: when faced with problems far beyond their current capabilities, models consistently fail, yielding a persistent zero-reward signal. In policy optimization algorithms like GRPO, this collapses the advantage calculation to zero, rendering these difficult problems invisible to the learning gradient and stalling progress. To overcome this, we introduce Scaf-GRPO (Scaffolded Group Relative Policy Optimization), a progressive training framework that strategically provides minimal guidance only when a model's independent learning has plateaued. The framework first diagnoses learning stagnation and then intervenes by injecting tiered in-prompt hints, ranging from abstract concepts to concrete steps, enabling the model to construct a valid solution by itself. Extensive experiments on challenging mathematics benchmarks demonstrate Scaf-GRPO's effectiveness, boosting the pass@1 score of the Qwen2.5-Math-7B model on the AIME24 benchmark by a relative 44.3% over a vanilla GRPO baseline. This result demonstrates our framework provides a robust and effective methodology for unlocking a model's ability to solve problems previously beyond its reach, a critical step towards extending the frontier of autonomous reasoning in LLM.

中文标题/摘要

标题：Scaf-GRPO：支撑结构分组相对策略优化以增强LLM推理能力

可验证奖励的强化学习已成为增强大型语言模型（LLMs）复杂推理能力的强大技术。然而，这些方法从根本上受到“学习悬崖”现象的限制：当面对超出当前能力的问题时，模型会一致失败，产生持续的零奖励信号。在如GRPO等策略优化算法中，这将优势计算归零，使这些困难的问题对学习梯度变得不可见，从而阻碍进步。为克服这一问题，我们引入了Scaf-GRPO（分组相对策略优化的支撑结构），这是一种渐进的训练框架，仅在模型独立学习停滞时提供最小的指导。该框架首先诊断学习停滞，然后通过注入分层的提示干预，从抽象概念到具体步骤，使模型能够自行构建有效的解决方案。在具有挑战性的数学基准测试上的广泛实验表明，Scaf-GRPO的有效性，使Qwen2.5-Math-7B模型在AIME24基准测试上的pass@1得分相对于vanilla GRPO基线提高了44.3%。这一结果表明，我们的框架提供了一种稳健且有效的方法，以解锁模型解决之前超出其能力的问题的能力，这是向扩展LLM自主推理前沿的关键一步。

Summary / 总结

Scaf-GRPO is a progressive training framework designed to enhance the reasoning abilities of Large Language Models (LLMs) by providing minimal guidance when the model's independent learning has plateaued. It uses tiered in-prompt hints to help the model construct valid solutions, overcoming the 'learning cliff' phenomenon. Experiments on the AIME24 benchmark showed that Scaf-GRPO improved the Qwen2.5-Math-7B model's pass@1 score by 44.3% compared to a vanilla GRPO baseline, demonstrating its effectiveness in enabling models to solve previously unsolvable problems.

Scaf-GRPO 是一种渐进式训练框架，旨在通过在模型独立学习停滞时提供少量指导来增强大型语言模型（LLM）的推理能力。它使用分层的提示来帮助模型构建有效的解决方案，克服了‘学习悬崖’现象。实验表明，Scaf-GRPO 将 Qwen2.5-Math-7B 模型在 AIME24 基准上的 pass@1 分数提高了 44.3%，与 vanilla GRPO 基线相比，证明了其在使模型能够解决之前无法解决的问题方面的有效性。

Class-Aware Prototype Learning with Negative Contrast for Test-Time Adaptation of Vision-Language Models

Authors: Xiaozhen Qiao, Jingkai Zhao, Yuqiu Jiang, Xianda Guo, Zhe Sun, Hongyuan Zhang, Xuelong Li

First: 2025-10-22T17:38:35+00:00 · Latest: 2025-10-22T17:38:35+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-Language Models (VLMs) demonstrate impressive zero-shot generalization through large-scale image-text pretraining, yet their performance can drop once the deployment distribution diverges from the training distribution. To address this, Test-Time Adaptation (TTA) methods update models using unlabeled target data. However, existing approaches often ignore two key challenges: prototype degradation in long-tailed distributions and confusion between semantically similar classes. To tackle these issues, we propose \textbf{C}lass-Aware \textbf{P}rototype \textbf{L}earning with \textbf{N}egative \textbf{C}ontrast(\textbf{CPL-NC}), a lightweight TTA framework designed specifically for VLMs to enhance generalization under distribution shifts. CPL-NC introduces a \textit{Class-Aware Prototype Cache} Module that dynamically adjusts per-class capacity based on test-time frequency and activation history, with a rejuvenation mechanism for inactive classes to retain rare-category knowledge. Additionally, a \textit{Negative Contrastive Learning} Mechanism identifies and constrains hard visual-textual negatives to improve class separability. The framework employs asymmetric optimization, refining only textual prototypes while anchoring on stable visual features. Experiments on 15 benchmarks show that CPL-NC consistently outperforms prior TTA methods across both ResNet-50 and ViT-B/16 backbones.

中文标题/摘要

标题：面向类别的原型学习与负对比度测试时适应性调整视觉-语言模型

视觉-语言模型（VLMs）通过大规模图像-文本预训练展示了令人印象深刻的零样本泛化能力，但在部署分布与训练分布不一致时，其性能会下降。为解决这一问题，测试时适应性（TTA）方法使用未标记的目标数据更新模型。然而，现有方法往往忽视了两个关键挑战：长尾分布中的原型退化和语义相似类别之间的混淆。为应对这些问题，我们提出了**C**类**A**意识**P**原型**L**学习与**N**负**C**对比（**CPL-NC**），这是一种专为VLMs设计的轻量级TTA框架，旨在在分布转移下增强泛化能力。CPL-NC引入了一个**类意识原型缓存**模块，根据测试时的频率和激活历史动态调整每个类别的容量，并通过一种不活跃类的再生机制保留稀有类别的知识。此外，一种**负对比学习**机制识别并限制难以区分的视觉-文本负样本，以提高类别可分性。该框架采用非对称优化，仅细化文本原型，同时锚定稳定的视觉特征。在15个基准测试上的实验表明，CPL-NC在ResNet-50和ViT-B/16两个骨干网络上均能一致地优于先前的TTA方法。

Summary / 总结

The research aims to improve the zero-shot generalization of Vision-Language Models (VLMs) by addressing prototype degradation and class confusion during test-time adaptation. The proposed CPL-NC framework introduces a Class-Aware Prototype Cache Module and a Negative Contrastive Learning Mechanism to dynamically adjust prototype capacity and improve class separability. Experiments on 15 benchmarks demonstrate that CPL-NC outperforms existing TTA methods across different backbone architectures.

研究旨在通过解决原型退化和类别混淆问题来提高Vision-Language Models (VLMs)的零样本泛化能力。提出的CPL-NC框架引入了Class-Aware Prototype Cache模块和Negative Contrastive Learning机制，以动态调整原型容量并提高类别可分性。实验结果显示，CPL-NC在不同骨干网络架构上均优于现有TTA方法。

OmniMotion-X: Versatile Multimodal Whole-Body Motion Generation

Authors: Guowei Xu, Yuxuan Bian, Ailing Zeng, Mingyi Shi, Shaoli Huang, Wen Li, Lixin Duan, Qiang Xu

First: 2025-10-22T17:25:33+00:00 · Latest: 2025-10-22T17:25:33+00:00

Abs · PDF · Code1 · Code2

Abstract

This paper introduces OmniMotion-X, a versatile multimodal framework for whole-body human motion generation, leveraging an autoregressive diffusion transformer in a unified sequence-to-sequence manner. OmniMotion-X efficiently supports diverse multimodal tasks, including text-to-motion, music-to-dance, speech-to-gesture, and global spatial-temporal control scenarios (e.g., motion prediction, in-betweening, completion, and joint/trajectory-guided synthesis), as well as flexible combinations of these tasks. Specifically, we propose the use of reference motion as a novel conditioning signal, substantially enhancing the consistency of generated content, style, and temporal dynamics crucial for realistic animations. To handle multimodal conflicts, we introduce a progressive weak-to-strong mixed-condition training strategy. To enable high-quality multimodal training, we construct OmniMoCap-X, the largest unified multimodal motion dataset to date, integrating 28 publicly available MoCap sources across 10 distinct tasks, standardized to the SMPL-X format at 30 fps. To ensure detailed and consistent annotations, we render sequences into videos and use GPT-4o to automatically generate structured and hierarchical captions, capturing both low-level actions and high-level semantics. Extensive experimental evaluations confirm that OmniMotion-X significantly surpasses existing methods, demonstrating state-of-the-art performance across multiple multimodal tasks and enabling the interactive generation of realistic, coherent, and controllable long-duration motions.

中文标题/摘要

标题：OmniMotion-X：多功能全身运动生成框架

本文介绍了OmniMotion-X，这是一种多功能的多模态框架，用于全身人体运动生成，利用统一的序列到序列方式的自回归扩散变换器。OmniMotion-X 有效地支持多种多模态任务，包括文本到运动、音乐到舞蹈、语音到手势，以及全局时空控制场景（例如运动预测、中间帧生成、完成和关节/轨迹引导合成），以及这些任务的灵活组合。具体而言，我们提出使用参考运动作为新的条件信号，显著增强了生成内容的一致性、风格和时间动态，这对于现实动画至关重要。为了处理多模态冲突，我们引入了一种逐步弱到强混合条件训练策略。为了实现高质量的多模态训练，我们构建了迄今为止最大的统一多模态运动数据集OmniMoCap-X，整合了10种不同任务的28个公开可用的MoCap来源，并标准化为每秒30帧的SMPL-X格式。为了确保详细和一致的注释，我们将序列渲染成视频，并使用GPT-4o自动生成结构化和分层的描述，捕捉低级动作和高级语义。广泛的实验评估证实，OmniMotion-X 显著超越了现有方法，在多个多模态任务上表现出最先进的性能，并能够生成逼真、连贯且可控的长时间运动。

Summary / 总结

OmniMotion-X is a versatile multimodal framework for whole-body human motion generation using an autoregressive diffusion transformer. It supports various tasks like text-to-motion, music-to-dance, and speech-to-gesture, and introduces reference motion as a conditioning signal to enhance consistency. The framework also uses a progressive training strategy to handle multimodal conflicts and a large unified dataset, OmniMoCap-X, for high-quality training. Experimental results show that OmniMotion-X outperforms existing methods in multiple multimodal tasks and enables the generation of realistic, coherent, and controllable long-duration motions.

OmniMotion-X 是一种使用自回归扩散变换器的多功能框架，用于全身人体动作生成，支持文本到动作、音乐到舞蹈和语音到手势等多种任务，并引入参考动作作为条件信号以增强一致性。该框架还使用逐步训练策略来处理多模态冲突，并使用大型统一数据集 OmniMoCap-X 进行高质量训练。实验结果表明，OmniMotion-X 在多个多模态任务中优于现有方法，并能够生成逼真、连贯且可控的长时间动作。

Rethinking Backbone Design for Lightweight 3D Object Detection in LiDAR

Authors: Adwait Chandorkar, Hasan Tercan, Tobias Meisen

Venue: ICCV 2025 Best Paper

First: 2025-08-01T16:19:51+00:00 · Latest: 2025-10-22T17:25:08+00:00

Comments: Best Paper Award at the Embedded Vision Workshop ICCV 2025

Abs · PDF · Code1 · Code2

Abstract

Recent advancements in LiDAR-based 3D object detection have significantly accelerated progress toward the realization of fully autonomous driving in real-world environments. Despite achieving high detection performance, most of the approaches still rely on a VGG-based or ResNet-based backbone for feature exploration, which increases the model complexity. Lightweight backbone design is well-explored for 2D object detection, but research on 3D object detection still remains limited. In this work, we introduce Dense Backbone, a lightweight backbone that combines the benefits of high processing speed, lightweight architecture, and robust detection accuracy. We adapt multiple SoTA 3d object detectors, such as PillarNet, with our backbone and show that with our backbone, these models retain most of their detection capability at a significantly reduced computational cost. To our knowledge, this is the first dense-layer-based backbone tailored specifically for 3D object detection from point cloud data. DensePillarNet, our adaptation of PillarNet, achieves a 29% reduction in model parameters and a 28% reduction in latency with just a 2% drop in detection accuracy on the nuScenes test set. Furthermore, Dense Backbone's plug-and-play design allows straightforward integration into existing architectures, requiring no modifications to other network components.

中文标题/摘要

标题：重新思考轻量级3D物体检测的LiDAR背部设计

基于LiDAR的3D物体检测最近取得了显著进展，极大地推动了全自动驾驶在现实环境中的实现。尽管检测性能很高，但大多数方法仍然依赖于VGG或ResNet为基础的背部设计来探索特征，这增加了模型的复杂性。轻量级背部设计在2D物体检测中得到了广泛研究，但在3D物体检测方面的研究仍然有限。在本文中，我们介绍了密集背部设计，这是一种结合了高速处理、轻量级架构和稳健检测精度的轻量级背部设计。我们将多种当前最佳3D物体检测器，如PillarNet，与我们的背部设计进行适配，并展示了在显著降低计算成本的同时，这些模型保留了大部分的检测能力。据我们所知，这是第一个专门针对点云数据3D物体检测的基于密集层的背部设计。我们的PillarNet适配版本DensePillarNet在nuScenes测试集上的模型参数减少了29%，延迟减少了28%，检测精度仅下降了2%。此外，密集背部设计的即插即用设计使其可以轻松集成到现有架构中，无需对其他网络组件进行修改。

Summary / 总结

This work addresses the need for lightweight backbones in 3D object detection for LiDAR data, motivated by the high computational cost of existing VGG- and ResNet-based approaches. The authors introduce Dense Backbone, a dense-layer-based lightweight design that retains high detection accuracy while significantly reducing computational cost. DensePillarNet, an adaptation of PillarNet using this backbone, achieves a 29% reduction in model parameters and a 28% reduction in latency with only a 2% drop in detection accuracy on the nuScenes test set.

本文旨在解决3D LiDAR数据目标检测中轻量级骨干网络的需求，动机在于现有基于VGG和ResNet的方法计算成本高。作者提出了一种基于密集层的轻量级设计Dense Backbone，该设计在保持检测准确性的同时减少了计算成本。DensePillarNet是PillarNet的一种基于此骨干网络的适应版本，其模型参数减少了29%，延迟减少了28%，同时检测准确性仅下降了2%，在nuScenes测试集上的表现证明了这一点。

Benchmarking World-Model Learning

Authors: Archana Warrier, Dat Nyugen, Michelangelo Naim, Moksh Jain, Yichao Liang, Karen Schroeder, Cambridge Yang, Joshua B. Tenenbaum, Sebastian Vollmer, Kevin Ellis, Zenna Tavares

First: 2025-10-22T17:23:18+00:00 · Latest: 2025-10-22T17:23:18+00:00

Comments: 30 pages, 10 figures

Abs · PDF · Code1 · Code2

Abstract

Model-learning agents should gather information to learn world models that support many downstream tasks and inferences, such as predicting unobserved states, estimating near- and far-term consequences of actions, planning action sequences, and detecting changes in dynamics. Current methods for learning and evaluating world models diverge from this goal: training and evaluation are anchored to next-frame prediction, and success is scored by reward maximization in the same environment. We propose WorldTest, a protocol to evaluate model-learning agents that separates reward-free interaction from a scored test phase in a different but related environment. WorldTest is open-ended$\unicode{x2014}$models should support many different tasks unknown ahead of time$\unicode{x2014}$and agnostic to model representation, allowing comparison across approaches. We instantiated WorldTest with AutumnBench, a suite of 43 interactive grid-world environments and 129 tasks across three families: masked-frame prediction, planning, and predicting changes to the causal dynamics. We compared 517 human participants and three frontier models on AutumnBench. We found that humans outperform the models, and scaling compute improves performance only in some environments but not others. WorldTest provides a novel template$\unicode{x2014}$reward-free exploration, derived tests, and behavior-based scoring$\unicode{x2014}$to evaluate what agents learn about environment dynamics, and AutumnBench exposes significant headroom in world-model learning.

中文标题/摘要

标题：世界模型学习基准测试

模型学习代理应收集信息以学习支持多种下游任务和推理的世界模型，例如预测未观察到的状态、估计动作的近期和远期后果、规划动作序列以及检测动力学变化。当前学习和评估世界模型的方法与这一目标相偏离：训练和评估锚定于下一帧预测，成功通过在相同环境中最大化奖励来衡量。我们提出了WorldTest，一种评估模型学习代理的协议，将无奖励交互与在不同但相关环境中的评分测试阶段分离。WorldTest 是开放式的——模型应支持许多未知的任务——并且对模型表示是无偏的，允许不同方法之间的比较。我们使用AutumnBench 实现了WorldTest，这是一个包含43个交互式网格世界环境和129个任务的套件，分为三种家族：遮罩帧预测、规划和预测因果动力学的变化。我们在AutumnBench 上比较了517名人类参与者和三种前沿模型。我们发现人类的表现优于模型，而扩展计算能力仅在某些环境中提高了性能，在其他环境中则没有。WorldTest 提供了一个新颖的模板——无奖励探索、衍生测试和基于行为的评分——来评估代理对环境动力学的理解，而AutumnBench 暴露了世界模型学习中的显著潜力。

Summary / 总结

This paper introduces WorldTest, a protocol to evaluate model-learning agents by separating reward-free interaction from a scored test phase in a different environment. The goal is to assess agents' ability to learn world models that support various downstream tasks. The protocol is instantiated with AutumnBench, a suite of 43 interactive grid-world environments and 129 tasks. The study compares 517 human participants and three frontier models on AutumnBench, finding that humans outperform the models, and scaling compute improves performance in some environments but not others.

该论文提出了WorldTest，一种评估模型学习代理的协议，通过将无奖励探索与得分测试阶段分离在不同环境中进行。目标是评估代理学习环境模型以支持各种下游任务的能力。该协议通过AutumnBench实例化，AutumnBench包含43个交互式网格世界环境和129个任务。研究在AutumnBench上比较了517名人类参与者和三种前沿模型，发现人类的表现优于模型，而扩展计算能力仅在某些环境中提高了性能。

Provably Efficient Reward Transfer in Reinforcement Learning with Discrete Markov Decision Processes

Authors: Kevin Vora, Yu Zhang

First: 2025-03-17T17:42:54+00:00 · Latest: 2025-10-22T17:22:42+00:00

Abs · PDF · Code1 · Code2

Abstract

In this paper, we propose a new solution to reward adaptation (RA) in reinforcement learning, where the agent adapts to a target reward function based on one or more existing source behaviors learned a priori under the same domain dynamics but different reward functions. While learning the target behavior from scratch is possible, it is often inefficient given the available source behaviors. Our work introduces a new approach to RA through the manipulation of Q-functions. Assuming the target reward function is a known function of the source reward functions, we compute bounds on the Q-function and present an iterative process (akin to value iteration) to tighten these bounds. Such bounds enable action pruning in the target domain before learning even starts. We refer to this method as "Q-Manipulation" (Q-M). The iteration process assumes access to a lite-model, which is easy to provide or learn. We formally prove that Q-M, under discrete domains, does not affect the optimality of the returned policy and show that it is provably efficient in terms of sample complexity in a probabilistic sense. Q-M is evaluated in a variety of synthetic and simulation domains to demonstrate its effectiveness, generalizability, and practicality.

中文标题/摘要

标题：在离散马尔可夫决策过程中的可验证高效奖励转移在强化学习中的应用

在本文中，我们提出了一种新的奖励适应（RA）解决方案，在强化学习中，代理基于先前在相同动力学下但不同奖励函数下学习的一种或多种源行为，来适应目标奖励函数。虽然可以从头开始学习目标行为是可能的，但在可用的源行为下，这通常是低效的。我们的工作通过Q函数的操作引入了一种新的RA方法。假设目标奖励函数是源奖励函数的已知函数，我们计算Q函数的边界，并提出一个迭代过程（类似于值迭代）来收紧这些边界。这些边界使代理能够在学习开始前在目标领域中进行动作剪枝。我们将这种方法称为“Q操作”（Q-M）。迭代过程假设可以访问一个轻量级模型，该模型易于提供或学习。我们正式证明，在离散领域中，Q-M 不影响返回策略的最优性，并且从概率意义上证明了它在样本复杂性方面是可验证高效的。Q-M 在多种合成和模拟领域中进行了评估，以展示其有效性、泛化能力和实用性。

Summary / 总结

This paper addresses the challenge of reward adaptation in reinforcement learning by proposing a method called Q-Manipulation (Q-M), which manipulates Q-functions to adapt an agent to a target reward function using existing source behaviors. The method computes bounds on Q-functions and uses an iterative process to tighten these bounds, enabling action pruning before learning starts. Theoretical proofs show that Q-M does not affect the optimality of the returned policy and is efficient in terms of sample complexity. Experimental results in various domains demonstrate its effectiveness, generalizability, and practicality.

本文提出了一种名为Q-Manipulation (Q-M)的方法，通过在目标域中修剪动作来解决强化学习中的奖励适应问题。该方法假设目标奖励是已知源奖励的函数，并通过迭代紧化Q函数的边界来实现高效学习。理论证明表明，Q-M不会影响策略的最优性，并且在概率意义上具有高效的样本复杂性。在多种合成和模拟域中的实验结果验证了其有效性、泛化能力和实用性。

Environment Inference for Learning Generalizable Dynamical System

Authors: Shixuan Liu, Yue He, Haotian Wang, Wenjing Yang, Yunfei Wang, Peng Cui, Zhong Liu

Venue: NeurIPS 2025 Spotlight

First: 2025-10-22T17:20:12+00:00 · Latest: 2025-10-22T17:20:12+00:00

Comments: NeurIPS 2025 Spotlight

Abs · PDF · Code1 · Code2

Abstract

Data-driven methods offer efficient and robust solutions for analyzing complex dynamical systems but rely on the assumption of I.I.D. data, driving the development of generalization techniques for handling environmental differences. These techniques, however, are limited by their dependence on environment labels, which are often unavailable during training due to data acquisition challenges, privacy concerns, and environmental variability, particularly in large public datasets and privacy-sensitive domains. In response, we propose DynaInfer, a novel method that infers environment specifications by analyzing prediction errors from fixed neural networks within each training round, enabling environment assignments directly from data. We prove our algorithm effectively solves the alternating optimization problem in unlabeled scenarios and validate it through extensive experiments across diverse dynamical systems. Results show that DynaInfer outperforms existing environment assignment techniques, converges rapidly to true labels, and even achieves superior performance when environment labels are available.

中文标题/摘要

标题：环境推断学习通用化的动力学系统

数据驱动的方法为分析复杂动力学系统提供了高效且稳健的解决方案，但依赖于独立同分布数据的假设，推动了处理环境差异的泛化技术的发展。然而，这些技术受限于对环境标签的依赖，这些标签在训练过程中由于数据获取挑战、隐私问题和环境变化等因素往往不可用，特别是在大型公共数据集和隐私敏感领域。为此，我们提出了一种名为DynaInfer的新方法，通过分析每轮训练中固定神经网络预测误差来推断环境规范，从而直接从数据中进行环境分配。我们证明了该算法在无标签场景中有效解决了交替优化问题，并通过跨多种动力学系统的广泛实验进行了验证。结果显示，DynaInfer在环境标签不可用时优于现有环境分配技术，能够快速收敛到真实标签，并且即使在环境标签可用时也能实现更优性能。

Summary / 总结

The paper addresses the challenge of generalizing data-driven methods for dynamical systems in the absence of environment labels. It introduces DynaInfer, which infers environment specifications by analyzing prediction errors from fixed neural networks. Experiments across various dynamical systems demonstrate that DynaInfer outperforms existing techniques, converges quickly to true labels, and even performs better when labels are available.

论文提出了一种名为DynaInfer的新方法，通过从预测误差中推断环境规格来解决数据驱动方法在动力系统分析中的泛化问题，而不依赖于环境标签。该方法在无标签场景下有效解决了交替优化问题，并且在各种动力系统实验中表现出色，优于现有技术，即使在有标签的情况下也能实现更快的收敛和更好的性能。

QoQ-Med: Building Multimodal Clinical Foundation Models with Domain-Aware GRPO Training

Authors: Wei Dai, Peilin Chen, Chanakya Ekbote, Paul Pu Liang

Venue: NeurIPS 2025 Oral

First: 2025-05-31T21:02:52+00:00 · Latest: 2025-10-22T17:18:21+00:00

Comments: Accepted as Oral at NeurIPS 2025. Revision after camera ready

Abs · PDF · Code1 · Code2 · Code3

Abstract

Clinical decision-making routinely demands reasoning over heterogeneous data, yet existing multimodal language models (MLLMs) remain largely vision-centric and fail to generalize across clinical specialties. To bridge this gap, we introduce QoQ-Med-7B/32B, the first open generalist clinical foundation model that jointly reasons across medical images, time-series signals, and text reports. QoQ-Med is trained with Domain-aware Relative Policy Optimization (DRPO), a novel reinforcement-learning objective that hierarchically scales normalized rewards according to domain rarity and modality difficulty, mitigating performance imbalance caused by skewed clinical data distributions. Trained on 2.61 million instruction tuning pairs spanning 9 clinical domains, we show that DRPO training boosts diagnostic performance by 43% in macro-F1 on average across all visual domains as compared to other critic-free training methods like GRPO. Furthermore, with QoQ-Med trained on intensive segmentation data, it is able to highlight salient regions related to the diagnosis, with an IoU 10x higher than open models while reaching the performance of OpenAI o4-mini. To foster reproducibility and downstream research, we release (i) the full model weights, (ii) the modular training pipeline, and (iii) all intermediate reasoning traces at https://github.com/DDVD233/QoQ_Med.

中文标题/摘要

标题：QoQ-Med：基于领域意识GRPO训练的多模态临床基础模型构建

临床决策通常需要处理异构数据，但现有的多模态语言模型（MLLMs）主要集中在视觉方面，难以在不同临床专科之间泛化。为了解决这一问题，我们提出了QoQ-Med-7B/32B，这是首个开放的通用临床基础模型，能够联合处理医学影像、时间序列信号和文本报告。QoQ-Med使用领域意识相对策略优化（DRPO）进行训练，这是一种新颖的强化学习目标，根据领域稀有性和模态难度逐级放大标准化奖励，从而缓解由临床数据分布偏斜引起的性能不平衡问题。该模型在涵盖9个临床领域的261万条指令调优对上进行训练，结果显示，与无批评训练方法（如GRPO）相比，DRPO训练在宏观F1分数上平均提高了43%。此外，通过在密集分割数据上训练QoQ-Med，该模型能够突出显示与诊断相关的显著区域，其IoU比开源模型高10倍，同时达到OpenAI o4-mini的性能。为了促进可重复性和下游研究，我们发布了（i）完整的模型权重，（ii）模块化训练管道，以及（iii）所有中间推理痕迹，网址为https://github.com/DDVD233/QoQ_Med。

Summary / 总结

QoQ-Med is a multimodal clinical foundation model that addresses the limitations of existing vision-centric models by jointly reasoning over medical images, time-series signals, and text reports. It uses Domain-aware Relative Policy Optimization (DRPO) to mitigate performance imbalance, leading to a 43% improvement in macro-F1 scores across visual domains. The model also excels in highlighting diagnostic regions with a 10x higher IoU compared to open models, matching the performance of OpenAI o4-mini. The full model and training pipeline are publicly available.

QoQ-Med 是一个多模态临床基础模型，通过联合处理医学图像、时间序列信号和文本报告来解决现有以视觉为中心模型的局限性。它使用 Domain-aware Relative Policy Optimization (DRPO) 来缓解性能不平衡问题，使得在视觉域的宏观 F1 分数平均提高了 43%。该模型在突出诊断区域方面表现优异，IoU 比开源模型高 10 倍，性能与 OpenAI o4-mini 相当。完整的模型和训练管道已公开发布。

Diffusion-Based Hierarchical Graph Neural Networks for Simulating Nonlinear Solid Mechanics

Authors: Tobias Würth, Niklas Freymuth, Gerhard Neumann, Luise Kärger

First: 2025-06-06T12:46:36+00:00 · Latest: 2025-10-22T17:09:23+00:00

Abs · PDF · Code1 · Code2

Abstract

Graph-based learned simulators have emerged as a promising approach for simulating physical systems on unstructured meshes, offering speed and generalization across diverse geometries. However, they often struggle with capturing global phenomena, such as bending or long-range correlations usually occurring in solid mechanics, and suffer from error accumulation over long rollouts due to their reliance on local message passing and direct next-step prediction. We address these limitations by introducing the Rolling Diffusion-Batched Inference Network (ROBIN), a novel learned simulator that integrates two key innovations: (i) Rolling Diffusion-Batched Inference (ROBI), a parallelized inference scheme that amortizes the cost of diffusion-based refinement across physical time steps by overlapping denoising steps across a temporal window. (ii) A Hierarchical Graph Neural Network built on algebraic multigrid coarsening, enabling multiscale message passing across different mesh resolutions. This architecture, implemented via Algebraic-hierarchical Message Passing Networks, captures both fine-scale local dynamics and global structural effects critical for phenomena like beam bending or multi-body contact. We validate ROBIN on challenging 2D and 3D solid mechanics benchmarks involving geometric, material, and contact nonlinearities. ROBIN achieves state-of-the-art accuracy on all tasks, substantially outperforming existing next-step learned simulators while reducing inference time by up to an order of magnitude compared to standard diffusion simulators.

中文标题/摘要

标题：基于扩散的分层图神经网络用于模拟非线性固体力学

基于图的学习模拟器已成为在非结构化网格上模拟物理系统的一种有前途的方法，提供了速度和在不同几何形状上的泛化能力。然而，它们通常难以捕捉全局现象，如弯曲或在固体力学中常见的长程相关性，并且由于依赖于局部消息传递和直接下一步预测，长时间序列中会出现误差累积。我们通过引入滚动扩散批处理推理网络（ROBIN），一种新颖的学习模拟器来解决这些限制，该网络结合了两项创新：(i) 滚动扩散批处理推理（ROBI），一种并行推理方案，通过在时间窗口内重叠去噪步骤来并行化基于扩散的细化成本，从而在物理时间步长之间摊销成本。(ii) 基于代数多重网格细化的分层图神经网络，能够在不同网格分辨率之间实现多尺度消息传递。该架构通过代数分层消息传递网络实现，能够捕捉细尺度局部动力学和对梁弯曲或多体接触等现象至关重要的全局结构效应。我们通过涉及几何、材料和接触非线性的挑战性2D和3D固体力学基准测试验证了ROBIN。ROBIN在所有任务上都达到了最先进的准确性，显著优于现有的一步学习模拟器，并且与标准扩散模拟器相比，推理时间减少了多达一个数量级。

Summary / 总结

The research aims to improve the simulation of nonlinear solid mechanics by addressing the limitations of existing graph-based learned simulators, which struggle with global phenomena and error accumulation. The Rolling Diffusion-Batched Inference Network (ROBIN) is introduced, combining Rolling Diffusion-Batched Inference (ROBI) for parallelized inference and a Hierarchical Graph Neural Network for multiscale message passing. This architecture, implemented via Algebraic-hierarchical Message Passing Networks, enhances the simulation of fine-scale local dynamics and global structural effects. Experiments on 2D and 3D benchmarks show that ROBIN achieves state-of-the-art accuracy and reduces inference time significantly compared to previous methods.

研究旨在通过解决图基学习模拟器在捕捉全局现象和减少长时间序列中的误差累积方面的局限性，来改进固体力学的模拟。提出了Rolling Diffusion-Batched Inference Network (ROBIN)，该网络结合了Rolling Diffusion-Batched Inference (ROBI)的并行推理和层次图神经网络的多尺度消息传递。ROBIN在2D和3D固体力学基准测试中达到了最先进的精度，并且与标准扩散模拟器相比，推理时间减少了近一个数量级。

Beyond Reactivity: Measuring Proactive Problem Solving in LLM Agents

Authors: Gil Pasternak, Dheeraj Rajagopal, Julia White, Dhruv Atreja, Matthew Thomas, George Hurn-Maloney, Ash Lewis

First: 2025-10-22T17:00:45+00:00 · Latest: 2025-10-22T17:00:45+00:00

Abs · PDF · Code1 · Code2

Abstract

LLM-based agents are increasingly moving towards proactivity: rather than awaiting instruction, they exercise agency to anticipate user needs and solve them autonomously. However, evaluating proactivity is challenging; current benchmarks are constrained to localized context, limiting their ability to test reasoning across sources and longer time horizons. To address this gap, we present PROBE (Proactive Resolution Of BottlEnecks). PROBE decomposes proactivity as a pipeline of three core capabilities: (1) searching for unspecified issues, (2) identifying specific bottlenecks, and (3) executing appropriate resolutions. We apply PROBE to evaluate leading LLMs and popular agentic frameworks, showing that even state-of-the-art models struggle to solve this benchmark. Computing our consistent measurements across frontier LLMs and agents, we find that the best end-to-end performance of 40% is achieved by both GPT-5 and Claude Opus-4.1. Additionally, we demonstrate the relative capabilities of each model and analyze mutual failure modes. Our results highlight the current limitations of autonomous action in agentic systems, and expose promising future research directions.

中文标题/摘要

标题：超越反应性：测量LLM代理的主动问题解决能力

基于LLM的代理正越来越多地转向主动性：它们不再等待指令，而是主动行使代理权以预见用户需求并自主解决问题。然而，评估主动性颇具挑战性；当前的基准测试局限于局部上下文，限制了它们测试跨源和更长时间推理的能力。为解决这一差距，我们提出了PROBE（主动解决瓶颈）。PROBE将主动性分解为三个核心能力的管道：（1）搜索未指定的问题，（2）识别特定瓶颈，（3）执行适当的解决方案。我们将PROBE应用于评估领先LLM和流行的代理框架，结果显示即使是最先进的模型也难以解决这一基准测试。在前沿LLM和代理的一致测量中，我们发现GPT-5和Claude Opus-4.1的最佳端到端性能为40%。此外，我们展示了每个模型的相对能力并分析了它们的共同失败模式。我们的结果突显了代理系统中自主行动的当前局限性，并揭示了未来研究的潜在方向。

Summary / 总结

The research aims to evaluate the proactivity of large language model (LLM) agents by introducing PROBE, a benchmark that decomposes proactivity into three core capabilities: issue searching, bottleneck identification, and resolution execution. The study finds that even state-of-the-art models perform poorly on this benchmark, with the best end-to-end performance at 40% achieved by both GPT-5 and Claude Opus-4.1. The analysis also highlights the limitations of autonomous action in agentic systems and suggests future research directions.

研究旨在通过引入分解为三个核心能力的PROBE基准来评估大型语言模型（LLM）代理的主动性：问题搜索、瓶颈识别和解决方案执行。研究发现，即使是最先进的模型在这一基准上的表现也很差，最佳端到端性能为40%，由GPT-5和Claude Opus-4.1共同实现。分析还指出了自主行动在代理系统中的局限性，并提出了未来的研究方向。

SmartSwitch: Advancing LLM Reasoning by Overcoming Underthinking via Promoting Deeper Thought Exploration

Authors: Xichen Zhang, Sitong Wu, Haoru Tan, Shaozuo Yu, Yinghao Zhu, Ziyi He, Jiaya Jia

First: 2025-10-22T16:56:01+00:00 · Latest: 2025-10-22T16:56:01+00:00

Comments: Code: https://github.com/dvlab-research/SmartSwitch

Abs · PDF · Code1 · Code2 · Code3

Abstract

The long chain-of-thought (LongCoT) capability is central to the recent breakthroughs achieved by large language models in complex reasoning tasks. However, the accompanying issue of ''underthinking'', where models exhibit shallow reasoning by frequently switching thoughts without sufficient exploration, limits both performance and token efficiency. To address this problem, we propose a simple yet effective reasoning strategy: the SmartSwitch inference framework. This framework can be easily integrated into any large language model as a plug-and-play solution, continuously monitoring the model's reasoning process to detect underthinking and guide it toward deeper exploration of promising but overlooked thoughts. Specifically, the perception module identifies points where thoughts switch and evaluates the potential of the preceding thought using an off-the-shelf process reward model (PRM). If a high-potential thought is found to be prematurely abandoned, the intervention module interrupts the ongoing inference, backtracks to the point before the switch, and inserts a "deepening prompt" to encourage further exploration along that promising path. Extensive experiments on challenging mathematical reasoning benchmarks demonstrate that our method significantly enhances the performance of various large language models of different sizes.

中文标题/摘要

标题：SmartSwitch：通过促进更深入的思考探索来克服大型语言模型复杂推理任务中的浅层思考

长链推理（LongCoT）能力是大型语言模型在复杂推理任务中取得近期突破的核心。然而，伴随的“浅层思考”问题，即模型在推理过程中频繁切换思考而缺乏充分探索，限制了性能和标记效率。为解决这一问题，我们提出了一种简单而有效的推理策略：SmartSwitch 推理框架。该框架可以轻松集成到任何大型语言模型中作为即插即用的解决方案，持续监控模型的推理过程，检测浅层思考并引导其向更有潜力但被忽视的思考进行更深入的探索。具体来说，感知模块识别思考切换的点，并使用现成的过程奖励模型（PRM）评估前一思考的潜力。如果发现有高潜力的思考被过早放弃，干预模块会中断正在进行的推理，回溯到切换前的点，并插入一个“深化提示”以鼓励沿着这条有潜力的路径进行更深入的探索。在具有挑战性的数学推理基准测试中的广泛实验表明，我们的方法显著提升了不同规模大型语言模型的性能。

Summary / 总结

The paper addresses the issue of 'underthinking' in large language models (LLMs), where models perform shallow reasoning by frequently switching thoughts without sufficient exploration. To tackle this, the authors propose the SmartSwitch inference framework, which monitors the reasoning process and intervenes when underthinking is detected by backtracking and encouraging deeper exploration. Experiments on mathematical reasoning benchmarks show that this method improves the performance of various LLMs of different sizes.

论文针对大型语言模型中常见的‘浅思考’问题，即模型在推理过程中频繁切换思考而缺乏充分探索。为此，作者提出了SmartSwitch推理框架，该框架监控模型的推理过程，并在发现有潜力的思考被过早放弃时，通过回溯和插入‘深入提示’来干预。实验结果显示，该方法能够提升不同规模大型语言模型的性能。

gLSTM: Mitigating Over-Squashing by Increasing Storage Capacity

Authors: Hugh Blayney, Álvaro Arroyo, Xiaowen Dong, Michael M. Bronstein

First: 2025-10-09T16:58:49+00:00 · Latest: 2025-10-22T16:55:32+00:00

Comments: 23 pages, 22 figures, 7 tables. v2: clarified over-squashing separation in light of related work

Abs · PDF · Code1 · Code2

Abstract

Graph Neural Networks (GNNs) leverage the graph structure to transmit information between nodes, typically through the message-passing mechanism. While these models have found a wide variety of applications, they are known to suffer from over-squashing, where information from a large receptive field of node representations is collapsed into a single fixed sized vector, resulting in an information bottleneck. In this paper, we re-examine the over-squashing phenomenon through the lens of model storage and retrieval capacity, which we define as the amount of information that can be stored in a node's representation for later use. We study some of the limitations of existing tasks used to measure over-squashing and introduce a new synthetic task to demonstrate that an information bottleneck can saturate this capacity. Furthermore, we adapt ideas from the sequence modeling literature on associative memories, fast weight programmers, and the xLSTM model to develop a novel GNN architecture with improved capacity. We demonstrate strong performance of this architecture both on our capacity synthetic task, as well as a range of real-world graph benchmarks.

中文标题/摘要

标题：gLSTM：通过增加存储容量缓解过度压缩

图神经网络（GNNs）利用图结构在节点之间传递信息，通常通过消息传递机制实现。尽管这些模型在各种应用中取得了成功，但它们会遭受过度压缩的问题，即节点表示的大范围感受野中的信息被压缩成一个固定大小的向量，导致信息瓶颈。在本文中，我们从模型存储和检索容量的角度重新审视了过度压缩现象，我们定义为节点表示中可以存储供以后使用的相关信息量。我们研究了现有用于衡量过度压缩的一些限制，并引入了一个新的合成任务来证明信息瓶颈可以饱和这种容量。此外，我们借鉴序列建模文献中的关联记忆、快速权重程序员和xLSTM模型的想法，开发了一种具有改进容量的新GNN架构。我们在容量合成任务以及一系列实际图基准上展示了该架构的出色性能。

Summary / 总结

This paper addresses the issue of over-squashing in Graph Neural Networks (GNNs) by redefining the concept of model storage and retrieval capacity. The authors introduce a new synthetic task to demonstrate the limitations of existing over-squashing tasks and propose a novel GNN architecture, gLSTM, which incorporates ideas from sequence modeling to enhance storage capacity. The gLSTM architecture shows strong performance on both the synthetic task and real-world graph benchmarks.

本文通过提出一个新的GNN架构gLSTM来解决图神经网络中的过压缩问题。作者重新审视了过压缩现象，通过模型存储和检索能力的角度，并引入了一个新的合成任务来展示信息瓶颈。然后，他们借鉴序列建模中的关联记忆、快速权重编程和xLSTM模型的想法，开发了gLSTM，以提高存储能力，并在合成和实际图基准测试中都表现出色。

Adaptive Distribution-aware Quantization for Mixed-Precision Neural Networks

Authors: Shaohang Jia, Zhiyong Huang, Zhi Yu, Mingyang Hou, Shuai Miao, Han Yang

First: 2025-10-22T16:48:29+00:00 · Latest: 2025-10-22T16:48:29+00:00

Comments: 16 pages, 10 figures

Abs · PDF · Code1 · Code2

Abstract

Quantization-Aware Training (QAT) is a critical technique for deploying deep neural networks on resource-constrained devices. However, existing methods often face two major challenges: the highly non-uniform distribution of activations and the static, mismatched codebooks used in weight quantization. To address these challenges, we propose Adaptive Distribution-aware Quantization (ADQ), a mixed-precision quantization framework that employs a differentiated strategy. The core of ADQ is a novel adaptive weight quantization scheme comprising three key innovations: (1) a quantile-based initialization method that constructs a codebook closely aligned with the initial weight distribution; (2) an online codebook adaptation mechanism based on Exponential Moving Average (EMA) to dynamically track distributional shifts; and (3) a sensitivity-informed strategy for mixed-precision allocation. For activations, we integrate a hardware-friendly non-uniform-to-uniform mapping scheme. Comprehensive experiments validate the effectiveness of our method. On ImageNet, ADQ enables a ResNet-18 to achieve 71.512% Top-1 accuracy with an average bit-width of only 2.81 bits, outperforming state-of-the-art methods under comparable conditions. Furthermore, detailed ablation studies on CIFAR-10 systematically demonstrate the individual contributions of each innovative component, validating the rationale and effectiveness of our design.

中文标题/摘要

标题：适应性分布感知量化技术在混合精度神经网络中的应用

量化感知训练（QAT）是将深度神经网络部署到资源受限设备的关键技术。然而，现有方法通常面临两个主要挑战：激活的高非均匀分布和权重量化中使用的静态、不匹配的码本。为了解决这些挑战，我们提出了适应性分布感知量化（ADQ），这是一种混合精度量化框架，采用不同的策略。ADQ的核心是一种新颖的自适应权重量化方案，包括三个关键创新：（1）基于分位数的初始化方法，构建与初始权重分布紧密对齐的码本；（2）基于指数移动平均（EMA）的在线码本自适应机制，动态跟踪分布变化；（3）混合精度分配的敏感性导向策略。对于激活，我们整合了一种硬件友好的非均匀到均匀映射方案。全面的实验验证了我们方法的有效性。在ImageNet上，ADQ使ResNet-18的Top-1精度达到71.512%，平均位宽仅为2.81位，优于在类似条件下最先进的方法。此外，对CIFAR-10的详细消融研究系统地展示了每个创新组件的独立贡献，验证了我们设计的合理性和有效性。

Summary / 总结

The paper proposes ADQ, an adaptive distribution-aware quantization method for mixed-precision neural networks to address the challenges of non-uniform activation distributions and static codebooks. ADQ includes a quantile-based initialization, an online codebook adaptation mechanism, and a sensitivity-informed mixed-precision allocation strategy. Experiments show that ADQ achieves 71.512% Top-1 accuracy on ImageNet with an average bit-width of 2.81 bits, outperforming existing methods. Ablation studies on CIFAR-10 confirm the effectiveness of each component.

论文提出了ADQ，一种适应性分布感知的混合精度量化方法，以解决激活分布非均匀和静态码本的问题。ADQ包括基于分位数的初始化、基于指数移动平均的在线码本适应机制以及敏感性指导的混合精度分配策略。实验表明，ADQ在ImageNet上实现了71.512%的Top-1精度，平均比特宽度仅为2.81位，优于现有方法。在CIFAR-10上的消融研究进一步验证了每个创新组件的有效性。

CONFEX: Uncertainty-Aware Counterfactual Explanations with Conformal Guarantees

Authors: Aman Bilkhoo, Milad Kazemi, Nicola Paoletti, Mehran Hosseini

First: 2025-10-22T16:43:36+00:00 · Latest: 2025-10-22T16:43:36+00:00

Comments: 35 pages, 10 figures, 21 tables, 2 algorithms. [Main paper part consists of 11 pages, 2 figures, 1 table, 1 algorithm]

Abs · PDF · Code1 · Code2

Abstract

Counterfactual explanations (CFXs) provide human-understandable justifications for model predictions, enabling actionable recourse and enhancing interpretability. To be reliable, CFXs must avoid regions of high predictive uncertainty, where explanations may be misleading or inapplicable. However, existing methods often neglect uncertainty or lack principled mechanisms for incorporating it with formal guarantees. We propose CONFEX, a novel method for generating uncertainty-aware counterfactual explanations using Conformal Prediction (CP) and Mixed-Integer Linear Programming (MILP). CONFEX explanations are designed to provide local coverage guarantees, addressing the issue that CFX generation violates exchangeability. To do so, we develop a novel localised CP procedure that enjoys an efficient MILP encoding by leveraging an offline tree-based partitioning of the input space. This way, CONFEX generates CFXs with rigorous guarantees on both predictive uncertainty and optimality. We evaluate CONFEX against state-of-the-art methods across diverse benchmarks and metrics, demonstrating that our uncertainty-aware approach yields robust and plausible explanations.

中文标题/摘要

标题：CONFEX：具有形式保证的不确定性感知反事实解释

反事实解释（CFXs）为模型预测提供人类可理解的说明，使采取行动并提高可解释性成为可能。为了可靠，CFXs 必须避免高预测不确定性的区域，在这些区域解释可能误导或不适用。然而，现有方法往往忽视不确定性或缺乏将不确定性纳入其中的原理性机制和形式保证。我们提出了一种名为 CONFEX 的新方法，该方法使用一致性预测（CP）和混合整数线性规划（MILP）生成不确定性感知的反事实解释。CONFEX 解释旨在提供局部覆盖保证，解决 CFX 生成违反可交换性的问题。为此，我们开发了一种新颖的局部 CP 程序，通过利用基于树的输入空间分区的离线方法，该程序具有高效的 MILP 编码。这样，CONFEX 生成具有预测不确定性和最优性双重形式保证的 CFXs。我们使用多种基准和指标评估了 CONFEX，证明了我们的不确定性感知方法提供了稳健且合理的解释。

Summary / 总结

The research aims to provide reliable counterfactual explanations by incorporating predictive uncertainty. CONFEX uses Conformal Prediction and Mixed-Integer Linear Programming to generate explanations with local coverage guarantees, ensuring they are not misleading in regions of high uncertainty. Experiments show that CONFEX provides robust and plausible explanations compared to existing methods.

研究旨在通过引入预测不确定性来提供可靠的反事实解释。CONFEX 使用 Conformal Prediction 和 Mixed-Integer Linear Programming 生成具有局部覆盖保证的解释，确保它们在高不确定性区域不会误导人。实验表明，CONFEX 提供了比现有方法更稳健和合理的解释。

When Do Transformers Learn Heuristics for Graph Connectivity?

Authors: Qilin Ye, Deqing Fu, Robin Jia, Vatsal Sharan

First: 2025-10-22T16:43:32+00:00 · Latest: 2025-10-22T16:43:32+00:00

Abs · PDF · Code1 · Code2

Abstract

Transformers often fail to learn generalizable algorithms, instead relying on brittle heuristics. Using graph connectivity as a testbed, we explain this phenomenon both theoretically and empirically. We consider a simplified Transformer architecture, the disentangled Transformer, and prove that an $L$-layer model has capacity to solve for graphs with diameters up to exactly $3^L$, implementing an algorithm equivalent to computing powers of the adjacency matrix. We analyze the training-dynamics, and show that the learned strategy hinges on whether most training instances are within this model capacity. Within-capacity graphs (diameter $\leq 3^L$) drive the learning of a correct algorithmic solution while beyond-capacity graphs drive the learning of a simple heuristic based on node degrees. Finally, we empirically demonstrate that restricting training data within a model's capacity leads to both standard and disentangled transformers learning the exact algorithm rather than the degree-based heuristic.

中文标题/摘要

标题：何时变压器学习图连通性的启发式方法？

变压器通常无法学习可泛化的算法，而是依赖于脆弱的启发式方法。通过图连通性作为试验平台，我们从理论和实验两个方面解释了这一现象。我们考虑了一个简化的变压器架构——解耦变压器，并证明了一个$L$层模型具有解决直径最多为$3^L$的图的能力，实现了一个等同于计算邻接矩阵幂的算法。我们分析了训练动态，并表明学习到的策略取决于大多数训练实例是否在该模型能力范围内。直径在能力范围内的图（直径$\leq 3^L$）驱动了正确算法的习得，而超出能力范围的图则驱动了基于节点度数的简单启发式的习得。最后，我们通过实验展示了限制训练数据在模型能力范围内可以使得标准和解耦变压器学习到精确的算法而非基于节点度数的启发式方法。

Summary / 总结

The study investigates why transformers often rely on brittle heuristics instead of learning generalizable algorithms. Using graph connectivity as a test case, it proves that an $L$-layer disentangled Transformer can solve graphs with diameters up to $3^L$ by computing powers of the adjacency matrix. The research shows that the learned strategy depends on the training instances: within-capacity graphs lead to learning the correct algorithm, while beyond-capacity graphs result in a simple degree-based heuristic. Restricting training data within the model's capacity ensures the transformers learn the exact algorithm rather than the heuristic.

研究探讨了为什么变压器通常依赖于脆弱的启发式方法而不是学习通用算法。使用图连通性作为测试案例，研究证明一个$L$层的解耦变压器可以通过计算邻接矩阵的幂来解决直径不超过$3^L$的图。研究表明，学习的策略取决于训练样本：直径在模型容量内的图会导致学习正确的算法，而直径超出容量的图则会导致基于节点度数的简单启发式方法。限制训练数据在模型的容量范围内可以确保变压器学习精确的算法而不是启发式方法。

Learning Affordances at Inference-Time for Vision-Language-Action Models

Authors: Ameesh Shah, William Chen, Adwait Godbole, Federico Mora, Sanjit A. Seshia, Sergey Levine

First: 2025-10-22T16:43:29+00:00 · Latest: 2025-10-22T16:43:29+00:00

Comments: 7 pages and appendix

Abs · PDF · Code1 · Code2

Abstract

Solving complex real-world control tasks often takes multiple tries: if we fail at first, we reflect on what went wrong, and change our strategy accordingly to avoid making the same mistake. In robotics, Vision-Language-Action models (VLAs) offer a promising path towards solving complex control tasks, but lack the ability to contextually and dynamically readjust behavior when they fail to accomplish a task. In this work, we introduce Learning from Inference-Time Execution (LITEN), which connects a VLA low-level policy to a high-level VLM that conditions on past experiences by including them in-context, allowing it to learn the affordances and capabilities of the low-level VLA. Our approach iterates between a reasoning phase that generates and executes plans for the low-level VLA, and an assessment phase that reflects on the resulting execution and draws useful conclusions to be included in future reasoning contexts. Unlike similar approaches to self-refinement in non-robotics domains, LITEN must reflect on unstructured real-world robot trajectories (e.g., raw videos), which requires structured guiderails during assessment. Our experimental results demonstrate LITEN is able to effectively learn from past experience to generate plans that use high-affordance instructions to accomplish long-horizon tasks.

中文标题/摘要

标题：在推理时学习视觉-语言-动作模型的功能

解决复杂的现实世界控制任务通常需要多次尝试：如果我们第一次失败，我们会反思哪里出了问题，并相应地改变策略以避免重复同样的错误。在机器人学中，视觉-语言-动作模型（VLAs）为解决复杂控制任务提供了有希望的途径，但缺乏在失败时根据上下文和动态调整行为的能力。在本工作中，我们引入了在推理时执行学习（LITEN），将VLAs低级策略连接到一个高级VLM，该VLM通过将过去的经验包含在上下文中进行条件化，使其能够学习低级VLAs的功能和能力。我们的方法在推理阶段生成并执行低级VLAs的计划，然后在评估阶段反思执行结果并得出有用的结论，这些结论将被包含在未来推理的上下文中。与非机器人领域类似自我改进的方法不同，LITEN必须反思未结构化的实际机器人轨迹（例如，原始视频），这需要在评估过程中提供结构化的引导。我们的实验结果表明，LITEN能够有效地从过去的经验中学习，生成使用高功能指令来完成长期任务的计划。

Summary / 总结

This work addresses the need for Vision-Language-Action models to dynamically adjust their behavior based on past failures. It introduces LITEN, which connects a low-level policy to a high-level model that learns from past experiences. The method iterates between reasoning and assessment phases, allowing the model to generate and execute plans, and then reflect on the outcomes to improve future strategies. Experiments show that LITEN can effectively learn from past experiences to generate plans using high-affordance instructions for long-horizon tasks.

该研究旨在解决视觉-语言-动作模型在失败后动态调整行为的需求。方法Learning from Inference-Time Execution (LITEN) 将低级策略与高级模型连接起来，该模型可以从过去的经历中学习。关键发现表明，LITEN可以通过反思过去的执行并将其有用结论纳入未来的推理上下文中，有效地生成使用高可利用指令来完成长期任务的计划。

BATIS: Bayesian Approaches for Targeted Improvement of Species Distribution Models

Authors: Catherine Villeneuve, Benjamin Akera, Mélisande Teng, David Rolnick

First: 2025-10-22T16:42:46+00:00 · Latest: 2025-10-22T16:42:46+00:00

Abs · PDF · Code1 · Code2

Abstract

Species distribution models (SDMs), which aim to predict species occurrence based on environmental variables, are widely used to monitor and respond to biodiversity change. Recent deep learning advances for SDMs have been shown to perform well on complex and heterogeneous datasets, but their effectiveness remains limited by spatial biases in the data. In this paper, we revisit deep SDMs from a Bayesian perspective and introduce BATIS, a novel and practical framework wherein prior predictions are updated iteratively using limited observational data. Models must appropriately capture both aleatoric and epistemic uncertainty to effectively combine fine-grained local insights with broader ecological patterns. We benchmark an extensive set of uncertainty quantification approaches on a novel dataset including citizen science observations from the eBird platform. Our empirical study shows how Bayesian deep learning approaches can greatly improve the reliability of SDMs in data-scarce locations, which can contribute to ecological understanding and conservation efforts.

中文标题/摘要

标题：BATIS：基于贝叶斯方法的物种分布模型目标改进方法

物种分布模型（SDMs），旨在基于环境变量预测物种分布，广泛用于监测和应对生物多样性变化。最近的SDMs深度学习进展在复杂和异质数据集上表现良好，但其效果仍受限于数据的空间偏差。本文从贝叶斯视角重新审视SDMs，并引入BATIS，这是一种新颖且实用的框架，其中先验预测通过有限的观测数据迭代更新。模型必须适当捕捉 aleatoric 和 epistemic 不确定性，以有效结合细粒度的局部见解与更广泛的生态模式。我们在包含eBird平台公民科学观测数据的新颖数据集上对一系列不确定性量化方法进行了基准测试。我们的实证研究表明，贝叶斯深度学习方法可以显著提高在数据稀缺地区SDMs的可靠性，这可以促进生态理解和保护工作。

Summary / 总结

The research aims to improve the reliability of species distribution models (SDMs) by addressing spatial biases in data. BATIS, a Bayesian framework, iteratively updates prior predictions with limited observational data to capture both aleatoric and epistemic uncertainty. The study demonstrates that Bayesian deep learning approaches significantly enhance the reliability of SDMs in data-scarce areas, contributing to ecological understanding and conservation efforts.

研究旨在通过解决数据中的空间偏差来提高物种分布模型（SDMs）的可靠性。BATIS是一种贝叶斯框架，通过有限的观测数据迭代更新先验预测。研究显示，通过捕捉 aleatoric 和 epistemic 不确定性，BATIS 在数据稀缺区域增强了 SDMs 的性能，有助于生态理解和保护工作。

Video-R1: Reinforcing Video Reasoning in MLLMs

Authors: Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, Xiangyu Yue

Venue: NeurIPS 2025

First: 2025-03-27T17:59:51+00:00 · Latest: 2025-10-22T16:42:24+00:00

Comments: NeurIPS 2025, Project page: https://github.com/tulerfeng/Video-R1

Abs · PDF · Code1 · Code2 · Code3

Abstract

Inspired by DeepSeek-R1's success in eliciting reasoning abilities through rule-based reinforcement learning (RL), we introduce Video-R1 as the first attempt to systematically explore the R1 paradigm for incentivizing video reasoning within multimodal large language models (MLLMs). However, directly applying RL training with the GRPO algorithm to video reasoning presents two primary challenges: (i) a lack of temporal modeling for video reasoning, and (ii) the scarcity of high-quality video-reasoning data. To address these issues, we first propose the T-GRPO algorithm, which encourages models to utilize temporal information in videos for reasoning. Additionally, instead of relying solely on video data, we incorporate high-quality image-reasoning data into the training process. We have constructed two datasets: Video-R1-CoT-165k for SFT cold start and Video-R1-260k for RL training, both comprising image and video data. Experimental results demonstrate that Video-R1 achieves significant improvements on video reasoning benchmarks such as VideoMMMU and VSI-Bench, as well as on general video benchmarks including MVBench and TempCompass, etc. Notably, Video-R1-7B attains a 37.1% accuracy on video spatial reasoning benchmark VSI-bench, surpassing the commercial proprietary model GPT-4o. All code, models, and data are released in: https://github.com/tulerfeng/Video-R1.

中文标题/摘要

标题：Video-R1：在MLLM中强化视频推理

受DeepSeek-R1通过基于规则的强化学习（RL）激发推理能力成功的启发，我们引入了Video-R1，作为首个系统探索R1范式以激励MLLM中的视频推理的尝试。然而，直接将RL训练与GRPO算法应用于视频推理存在两大主要挑战：（i）缺乏对视频推理的时序建模，（ii）高质量视频推理数据稀缺。为解决这些问题，我们首先提出了T-GRPO算法，鼓励模型利用视频中的时序信息进行推理。此外，我们不仅依赖视频数据，还引入了高质量的图像推理数据进行训练。我们构建了两个数据集：Video-R1-CoT-165k 用于SFT冷启动，Video-R1-260k 用于RL训练，均包含图像和视频数据。实验结果表明，Video-R1 在视频推理基准（如VideoMMMU和VSI-Bench）以及一般视频基准（如MVBench和TempCompass）上取得了显著改进。值得注意的是，Video-R1-7B 在视频空间推理基准VSI-bench 上的准确率达到37.1%，超越了商用专有模型GPT-4o。所有代码、模型和数据已发布在：https://github.com/tulerfeng/Video-R1。

Summary / 总结

Video-R1 aims to enhance video reasoning capabilities in multimodal large language models by addressing the challenges of temporal modeling and data scarcity. It introduces the T-GRPO algorithm to incorporate temporal information and uses a combination of high-quality image-reasoning data and newly constructed video datasets for training. The model shows significant improvements on video reasoning benchmarks and general video benchmarks, with Video-R1-7B achieving 37.1% accuracy on VSI-bench, surpassing GPT-4o.

Video-R1旨在通过解决时间建模和数据稀缺性问题来增强多模态大语言模型（MLLMs）的视频推理能力。它引入了T-GRPO算法以整合时间信息，并使用图像和视频数据进行训练。该模型在视频推理基准如VideoMMMU和VSI-Bench以及一般视频基准如MVBench和TempCompass上表现出显著改进。值得注意的是，Video-R1-7B在VSI-bench上的准确率达到37.1%，超越了GPT-4o。

Breaking the Exploration Bottleneck: Rubric-Scaffolded Reinforcement Learning for General LLM Reasoning

Authors: Yang Zhou, Sunzhu Li, Shunyu Liu, Wenkai Fang, Kongcheng Zhang, Jiale Zhao, Jingwen Yang, Yihe Zhou, Jianwei Lv, Tongya Zheng, Hengtong Lu, Wei Chen, Yan Xie, Mingli Song

First: 2025-08-23T08:47:31+00:00 · Latest: 2025-10-22T16:32:56+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Recent advances in Large Language Models (LLMs) have underscored the potential of Reinforcement Learning (RL) to facilitate the emergence of reasoning capabilities. Despite the encouraging results, a fundamental dilemma persists as RL improvement relies on learning from high-quality samples, yet the exploration for such samples remains bounded by the inherent limitations of LLMs. This, in effect, creates an undesirable cycle in which what cannot be explored cannot be learned. In this work, we propose Rubric-Scaffolded Reinforcement Learning (RuscaRL), a novel instructional scaffolding framework designed to break the exploration bottleneck for general LLM reasoning. Specifically, RuscaRL introduces checklist-style rubrics as (1) explicit scaffolding for exploration during rollout generation, where different rubrics are provided as external guidance within task instructions to steer diverse high-quality responses. This guidance is gradually decayed over time, encouraging the model to internalize the underlying reasoning patterns; (2) verifiable rewards for exploitation during model training, where we can obtain robust LLM-as-a-Judge scores using rubrics as references, enabling effective RL on general reasoning tasks. Extensive experiments demonstrate the superiority of the proposed RuscaRL across various benchmarks, effectively expanding reasoning boundaries under the Best-of-N evaluation. Notably, RuscaRL significantly boosts Qwen2.5-7B-Instruct from 23.6 to 50.3 on HealthBench-500, surpassing GPT-4.1. Furthermore, our fine-tuned variant on Qwen3-30B-A3B-Instruct achieves 61.1 on HealthBench-500, outperforming leading LLMs including OpenAI-o3. Our code is available at https://github.com/IANNXANG/RuscaRL.

中文标题/摘要

标题：打破探索瓶颈：基于评分标准的强化学习促进通用大语言模型推理

大型语言模型（LLMs）的最新进展强调了强化学习（RL）在促进推理能力方面的作用潜力。尽管取得了令人鼓舞的结果，但RL的进步仍然依赖于高质量样本的学习，而探索这些样本则受限于LLMs的固有限制。这实际上创造了一个不良循环，即无法探索的内容无法学习。在本文中，我们提出了一种名为评分标准引导的强化学习（RuscaRL）的新颖教学支架框架，旨在打破通用LLMs推理的探索瓶颈。具体而言，RuscaRL引入了清单式评分标准作为（1）生成展开期间探索的显式支架，其中在任务说明中提供不同的评分标准作为外部指导，以引导多样化的高质量响应。随着时间的推移，这种指导逐渐减弱，促使模型内化潜在的推理模式；（2）模型训练期间的可验证奖励，我们可以通过评分标准作为参考获得稳健的LLM作为裁判的评分，从而在通用推理任务上实现有效的RL。广泛的实验表明，提出的RuscaRL在各种基准测试中表现出优越性，有效地在Best-of-N评估中扩展了推理边界。值得注意的是，RuscaRL将Qwen2.5-7B-Instruct在HealthBench-500上的得分从23.6提升到50.3，超过了GPT-4.1。此外，我们对Qwen3-30B-A3B-Instruct的微调版本在HealthBench-500上取得了61.1的得分，超过了包括OpenAI-o3在内的领先LLMs。我们的代码可在https://github.com/IANNXANG/RuscaRL获取。

Summary / 总结

This paper addresses the challenge of exploration in reinforcement learning for large language models (LLMs) by proposing Rubric-Scaffolded Reinforcement Learning (RuscaRL). The method uses checklist-style rubrics to guide exploration during rollout generation and provide verifiable rewards for training, helping the model internalize reasoning patterns. Experiments show that RuscaRL significantly improves reasoning capabilities, with Qwen2.5-7B-Instruct achieving 50.3 on HealthBench-500, surpassing GPT-4.1, and a fine-tuned variant achieving 61.1, outperforming leading LLMs like OpenAI-o3.

本文提出了一种名为Rubric-Scaffolded Reinforcement Learning (RuscaRL)的方法，以解决大型语言模型（LLMs）在强化学习中的探索难题。该方法通过使用清单式评分表来指导生成策略的探索，并在训练中提供可验证的奖励，帮助模型内化推理模式。实验表明，RuscaRL 显著提升了推理能力，Qwen2.5-7B-Instruct 在 HealthBench-500 上达到了 50.3 的成绩，超过了 GPT-4.1，而其微调版本达到了 61.1，超过了包括 OpenAI-o3 在内的领先 LLMs。

Optimized 3D Gaussian Splatting using Coarse-to-Fine Image Frequency Modulation

Authors: Umar Farooq, Jean-Yves Guillemaut, Adrian Hilton, Marco Volino

First: 2025-03-18T17:49:01+00:00 · Latest: 2025-10-22T16:31:43+00:00

Abs · PDF · Code1 · Code2

Abstract

The field of Novel View Synthesis has been revolutionized by 3D Gaussian Splatting (3DGS), which enables high-quality scene reconstruction that can be rendered in real-time. 3DGS-based techniques typically suffer from high GPU memory and disk storage requirements which limits their practical application on consumer-grade devices. We propose Opti3DGS, a novel frequency-modulated coarse-to-fine optimization framework that aims to minimize the number of Gaussian primitives used to represent a scene, thus reducing memory and storage demands. Opti3DGS leverages image frequency modulation, initially enforcing a coarse scene representation and progressively refining it by modulating frequency details in the training images. On the baseline 3DGS, we demonstrate an average reduction of 62% in Gaussians, a 40% reduction in the training GPU memory requirements and a 20% reduction in optimization time without sacrificing the visual quality. Furthermore, we show that our method integrates seamlessly with many 3DGS-based techniques, consistently reducing the number of Gaussian primitives while maintaining, and often improving, visual quality. Additionally, Opti3DGS inherently produces a level-of-detail scene representation at no extra cost, a natural byproduct of the optimization pipeline. Results and code will be made publicly available.

中文标题/摘要

标题：使用粗到细图像频率调制优化的3D高斯点绘制

新颖视图合成领域通过3D高斯点绘制（3DGS）得到了革命性的变革，它能够实现高质量场景重建并在实时渲染中应用。基于3DGS的技术通常会面临高GPU内存和磁盘存储需求的问题，这限制了它们在消费级设备上的实际应用。我们提出了一种名为Opti3DGS的新型频率调制粗到细优化框架，旨在最小化用于表示场景的高斯原素的数量，从而减少内存和存储需求。Opti3DGS利用图像频率调制，在初始阶段强制执行粗略的场景表示，并通过在训练图像中调制频率细节逐步细化。在基准3DGS上，我们展示了平均62%的高斯原素减少，40%的训练GPU内存需求减少，以及20%的优化时间减少，而不会牺牲视觉质量。此外，我们展示了我们的方法可以无缝集成到许多基于3DGS的技术中，在保持甚至提高视觉质量的同时，一致地减少高斯原素的数量。此外，Opti3DGS固有地生成了具有不同细节层次的场景表示，这是优化管道的自然副产品。结果和代码将公开提供。

Summary / 总结

The paper proposes Opti3DGS, a method to optimize 3D Gaussian splatting by reducing the number of Gaussian primitives used for scene representation, thereby decreasing GPU memory and storage demands. It achieves this through a coarse-to-fine frequency modulation approach, which starts with a coarse scene representation and progressively refines it. Compared to the baseline 3DGS, Opti3DGS reduces the number of Gaussians by 62%, cuts training GPU memory usage by 40%, and shortens optimization time by 20% without compromising visual quality. The method also provides a level-of-detail scene representation as a byproduct.

论文提出了Opti3DGS，这是一种优化的3D高斯点绘制方法，通过减少用于表示场景的高斯原数来降低GPU内存和存储需求。通过采用从粗到细的频率调制方法，它实现了高斯原数平均62%的减少、训练GPU内存40%的减少以及优化时间20%的减少，同时不牺牲视觉质量。该方法还自然地提供了层次细节的场景表示。

SEC-bench: Automated Benchmarking of LLM Agents on Real-World Software Security Tasks

Authors: Hwiwon Lee, Ziqi Zhang, Hanxiao Lu, Lingming Zhang

First: 2025-06-13T13:54:30+00:00 · Latest: 2025-10-22T16:27:29+00:00

Abs · PDF · Code1 · Code2

Abstract

Rigorous security-focused evaluation of large language model (LLM) agents is imperative for establishing trust in their safe deployment throughout the software development lifecycle. However, existing benchmarks largely rely on synthetic challenges or simplified vulnerability datasets that fail to capture the complexity and ambiguity encountered by security engineers in practice. We introduce SEC-bench, the first fully automated benchmarking framework for evaluating LLM agents on authentic security engineering tasks. SEC-bench employs a novel multi-agent scaffold that automatically constructs code repositories with harnesses, reproduces vulnerabilities in isolated environments, and generates gold patches for reliable evaluation. Our framework automatically creates high-quality software vulnerability datasets with reproducible artifacts at a cost of only $0.87 per instance. Using SEC-bench, we implement two critical software security tasks to rigorously evaluate LLM agents' capabilities: proof-of-concept (PoC) generation and vulnerability patching. A comprehensive evaluation of state-of-the-art LLM code agents reveals significant performance gaps, achieving at most 18.0% success in PoC generation and 34.0% in vulnerability patching on our complete dataset. These results highlight the crucial steps needed toward developing LLM agents that are more practical, intelligent, and autonomous for security engineering.

中文标题/摘要

标题：SEC-bench：自动化评估大型语言模型代理在实际软件安全任务中的表现

对大型语言模型（LLM）代理进行严格的专注于安全性的评估对于在整个软件开发生命周期中确保其安全部署至关重要。然而，现有的基准测试主要依赖于合成挑战或简化的漏洞数据集，无法捕捉到安全工程师在实践中遇到的复杂性和模糊性。我们引入了SEC-bench，这是第一个完全自动化的基准测试框架，用于评估LLM代理在真实的软件安全工程任务中的表现。SEC-bench 使用一种新颖的多代理支架，自动构建包含测试框架的代码仓库，隔离环境中重现漏洞，并生成金标准补丁以实现可靠的评估。我们的框架以每实例仅0.87美元的成本自动创建高质量的软件漏洞数据集和可重复的成果。使用SEC-bench，我们实现了两个关键的软件安全任务，以严格评估LLM代理的能力：概念证明（PoC）生成和漏洞修补。对最先进的LLM代码代理的全面评估揭示了显著的性能差距，在我们的完整数据集上，PoC生成的成功率最高为18.0%，漏洞修补的成功率为34.0%。这些结果突显了开发更实用、更智能和更自主的LLM代理以用于软件安全工程所需的关键步骤。

Summary / 总结

SEC-bench is an automated benchmarking framework for evaluating LLM agents on real-world software security tasks. It constructs code repositories with harnesses, reproduces vulnerabilities in isolated environments, and generates gold patches for reliable evaluation. Using SEC-bench, the study rigorously evaluated state-of-the-art LLM code agents on proof-of-concept generation and vulnerability patching, achieving at most 18.0% and 34.0% success rates, respectively, highlighting significant performance gaps in LLM agents for security engineering tasks.

SEC-bench 是一个自动化基准测试框架，用于评估 LLM 代理在实际软件安全任务中的表现。它构建包含框架的代码库，隔离环境中重现漏洞，并生成可靠的基准代码。使用 SEC-bench，研究严格评估了最先进的 LLM 代码代理在生成概念证明和漏洞修复方面的表现，成功率分别不超过 18.0% 和 34.0%，突显了 LLM 代理在安全工程任务中的显著性能差距。

Zhyper: Factorized Hypernetworks for Conditioned LLM Fine-Tuning

Authors: M. H. I. Abdalla, Zhipin Wang, Christian Frey, Steffen Eger, Josif Grabocka

First: 2025-10-22T16:25:43+00:00 · Latest: 2025-10-22T16:25:43+00:00

Abs · PDF · Code1 · Code2

Abstract

Large Language Model (LLM) conditioning refers to instructing an LLM to generate content in accordance with the norms and values of a specific culture, beliefs of a particular political orientation, or any desired text-specified semantic conditioning. Unfortunately, prompt engineering does not ensure that LLMs behave in accordance with a desired conditioning due to the inductive bias of the pre-training and alignment datasets. Prior works have focused on fine-tuning LLMs by directly conditioning the LoRA weights; however, such methods introduce a large number of parameters. As a remedy, we propose Zhyper, a parameter-efficient factorized hypernetwork framework that generates context-aware LoRA adapters from textual descriptions. Experiments on multiple benchmarks show that Zhyper achieves competitive performance with up to 26x fewer parameters than the state-of-the-art baselines. Furthermore, we extend Zhyper to cultural alignment, demonstrating improved generalization to out-of-domain settings and a better capturing of fine-grained contextual values.

中文标题/摘要

标题：Zhyper：因子化超网络用于条件化LLM微调

大型语言模型（LLM）条件化是指指导LLM根据特定文化的规范和价值观、特定政治倾向的信念或任何指定文本的语义条件生成内容。不幸的是，提示工程无法保证LLM的行为符合所需的条件，因为预训练和对齐数据集的归纳偏见。先前的工作集中在通过直接条件化LoRA权重来微调LLM；然而，这些方法引入了大量的参数。为此，我们提出了一种参数高效的因子化超网络框架Zhyper，可以从文本描述生成上下文感知的LoRA适配器。在多个基准测试上的实验表明，Zhyper在参数量少至26倍的情况下实现了与最先进的基线相当的性能。此外，我们将Zhyper扩展到文化对齐，展示了其在跨域设置中更好的泛化能力和对细粒度上下文值的更好捕捉。

Summary / 总结

Zhyper is a parameter-efficient factorized hypernetwork framework designed for fine-tuning Large Language Models (LLMs) with textual conditioning. It generates context-aware LoRA adapters from textual descriptions, requiring up to 26 times fewer parameters than state-of-the-art methods. Experiments show that Zhyper achieves competitive performance while improving generalization to out-of-domain settings and capturing fine-grained contextual values better than previous approaches.

Zhyper 是一种参数高效的因子超网络框架，用于对大型语言模型（LLM）进行基于文本的微调。它从文本描述中生成上下文相关的 LoRA 适配器，参数量比最先进的方法少 26 倍。实验表明，Zhyper 在保持竞争力的同时，能够更好地泛化到新的领域，并更准确地捕捉细微的上下文价值。

Memo: Training Memory-Efficient Embodied Agents with Reinforcement Learning

Authors: Gunshi Gupta, Karmesh Yadav, Zsolt Kira, Yarin Gal, Rahaf Aljundi

Venue: NeurIPS 2025 Spotlight

First: 2025-10-22T16:24:47+00:00 · Latest: 2025-10-22T16:24:47+00:00

Comments: Accepted for Spotlight Presentation at NeurIPS 2025

Abs · PDF · Code1 · Code2

Abstract

To enable embodied agents to operate effectively over extended timeframes, it is crucial to develop models that form and access memories to stay contextualized in their environment. In the current paradigm of training transformer-based policies for embodied sequential decision-making tasks, visual inputs often overwhelm the context limits of transformers, while humans can maintain and utilize a lifetime of experience compressed as memories. Significant compression is possible in principle, as much of the input is irrelevant and can be abstracted. However, existing approaches predominantly focus on either recurrent models with fixed-size memory or transformers with full-context reliance. In this work, we propose Memo, a transformer-based architecture and training recipe for reinforcement learning (RL) on memory-intensive, long-horizon tasks. Memo incorporates the creation and retrieval of memory by interleaving periodic summarization tokens with the inputs of a model during training. We demonstrate Memo's effectiveness on a gridworld meta-RL benchmark and a multi-object navigation task in photo-realistic indoor settings. Memo outperforms naive long-context transformer baselines while being more compute and storage efficient. Additionally, Memo generalizes better to longer contexts at inference time and remains robust in streaming settings, where historical context must be truncated to fit inference constraints.

中文标题/摘要

标题：备忘录：使用强化学习训练高效记忆体代理

为了使体态智能体能够在长时间内有效运行，开发能够形成和访问记忆的模型至关重要，以保持其环境中的上下文相关性。在当前基于变压器的策略训练范式中，视觉输入往往超出了变压器的上下文限制，而人类可以维持并利用压缩为记忆的一生经验。原则上，输入中的大量信息是无关的，可以进行抽象化处理，从而实现显著的压缩。然而，现有的方法主要集中在固定大小记忆的循环模型或依赖完整上下文的变压器上。在本工作中，我们提出了Memo，这是一种基于变压器的架构和训练方案，用于在内存密集型、长时程任务上进行强化学习（RL）。Memo在训练过程中通过交替插入周期性总结标记与模型输入来实现记忆的创建和检索。我们在网格世界元强化学习基准和照片写实室内环境中的多对象导航任务上展示了Memo的有效性。Memo在计算和存储效率方面优于简单的长上下文变压器基线，并且在推理时能够更好地泛化到更长的上下文，并且在必须截断历史上下文以适应推理约束的流式设置中保持鲁棒性。

Summary / 总结

This paper addresses the challenge of training embodied agents to operate effectively over extended periods by developing a memory-efficient transformer-based model called Memo. Memo interleaves summarization tokens with model inputs during training to create and retrieve memories, allowing the model to handle long-horizon tasks more efficiently. The model outperforms naive long-context transformer baselines and demonstrates better generalization and robustness in streaming settings.

本文旨在通过开发Memo，一种结合记忆创建和检索的基于变换器的架构，解决训练体态智能体在长时间内有效操作的挑战。Memo在训练过程中通过插入摘要标记来压缩和管理视觉输入。该方法在计算和存储效率上优于传统的长上下文变换器基线，并且在更长的上下文和流式设置下具有更好的泛化能力和鲁棒性。

Bridging Earth and Space: A Survey on HAPS for Non-Terrestrial Networks

Authors: G. Svistunov, A. Akhtarshenas, D. López-Pérez, M. Giordani, G. Geraci, H. Yanikomeroglu

First: 2025-10-22T16:22:31+00:00 · Latest: 2025-10-22T16:22:31+00:00

Comments: 30 pages. This work has been submitted to IEEE Communications Surveys & Tutorials (under review)

Abs · PDF · Code1 · Code2

Abstract

HAPS are emerging as key enablers in the evolution of 6G wireless networks, bridging terrestrial and non-terrestrial infrastructures. Operating in the stratosphere, HAPS can provide wide-area coverage, low-latency, energy-efficient broadband communications with flexible deployment options for diverse applications. This survey delivers a comprehensive overview of HAPS use cases, technologies, and integration strategies within the 6G ecosystem. The roles of HAPS in extending connectivity to underserved regions, supporting dynamic backhauling, enabling massive IoT, and delivering reliable low-latency communications for autonomous and immersive services are discussed. The paper reviews state-of-the-art architectures for terrestrial and non-terrestrial network integration, highlights recent field trials. Furthermore, key enabling technologies such as channel modeling, AI-driven resource allocation, interference control, mobility management, and energy-efficient communications are examined. The paper also outlines open research challenges. By addressing existing gaps in the literature, this survey positions HAPS as a foundational component of globally integrated, resilient, and sustainable 6G networks.

中文标题/摘要

标题：连接地球与太空：高空平台飞机在非地面网络中的研究

高空平台飞机（HAPS）正在成为6G无线网络演进的关键使能器，连接地面和非地面基础设施。在平流层运行，HAPS可以提供广泛的覆盖范围、低延迟、高效的宽带通信，并具有灵活的部署选项，适用于多种应用。本文综述了HAPS在6G生态系统中的应用场景、技术及其整合策略。讨论了HAPS在扩展未服务地区连接性、支持动态回传、实现大规模物联网以及为自主和沉浸式服务提供可靠低延迟通信方面的作用。本文回顾了地面和非地面网络整合的先进架构，概述了最近的现场试验。此外，还探讨了关键使能技术，如信道建模、基于AI的资源分配、干扰控制、移动管理和高效通信。本文还概述了开放的研究挑战。通过填补文献中的现有空白，本文将HAPS定位为全球集成、韧性和可持续的6G网络的基础组件。

Summary / 总结

This survey explores the role of High-Altitude Platform Stations (HAPS) in 6G networks, focusing on their ability to bridge terrestrial and non-terrestrial infrastructures. It discusses HAPS use cases, technologies, and integration strategies, highlighting their potential for extending connectivity, supporting dynamic backhauling, and enabling massive IoT. The survey reviews state-of-the-art architectures, recent field trials, and key enabling technologies such as channel modeling and energy-efficient communications, while also identifying open research challenges.

这篇综述探讨了高空平台站（HAPS）在6G网络中的作用，重点在于它们如何连接地面和非地面基础设施。它讨论了HAPS的应用案例、技术及其集成策略，强调了它们在扩展连接性、支持动态回程和实现大规模物联网方面的潜力。综述还回顾了最新的架构、现场试验以及关键使能技术，如信道建模和高效通信，并指出了开放的研究挑战。

ModServe: Modality- and Stage-Aware Resource Disaggregation for Scalable Multimodal Model Serving

Authors: Haoran Qiu, Anish Biswas, Zihan Zhao, Jayashree Mohan, Alind Khare, Esha Choukse, Íñigo Goiri, Zeyu Zhang, Haiying Shen, Chetan Bansal, Ramachandran Ramjee, Rodrigo Fonseca

First: 2025-02-02T22:10:40+00:00 · Latest: 2025-10-22T16:21:57+00:00

Comments: Published at ACM SoCC 2025; 14 pages, 20 figures

Abs · PDF · Code1 · Code2

Abstract

Large multimodal models (LMMs) demonstrate impressive capabilities in understanding images, videos, and audio beyond text. However, efficiently serving LMMs in production environments poses significant challenges due to their complex architectures and heterogeneous characteristics across their multi-stage inference pipelines. We present the first comprehensive systems analysis of two prominent LMM architectures, decoder-only and cross-attention, across six representative open-source models, revealing key systems design implications. We also present an in-depth analysis of production LMM inference traces, uncovering unique workload characteristics, including variable, heavy-tailed request distributions and bursty traffic patterns. Based on these insights, we propose ModServe, a modular LMM serving system that decouples stages for independent optimization and adaptive scaling. ModServe dynamically reconfigures stages and handles bursty traffic with modality-aware scheduling and autoscaling to meet tail latency SLOs while minimizing costs. ModServe achieves 3.3-5.5x higher throughput (leading to 25-41.3% cost saving) while meeting SLOs on a 128-GPU cluster with production traces.

中文标题/摘要

标题：ModServe：针对可扩展多模态模型服务的模态和阶段感知资源分解

大型多模态模型（LMMs）在理解图像、视频和音频方面展示了令人印象深刻的性能，超越了文本。然而，在生产环境中高效地服务LMMs由于其复杂的架构和多阶段推理管道中的异构特性，带来了重大挑战。我们首次对两种主要的LMM架构——解码器仅和交叉注意机制——进行了全面的系统分析，涵盖了六个代表性开源模型，揭示了关键的系统设计启示。我们还对生产中的LMM推理跟踪进行了深入分析，发现了独特的负载特征，包括可变的、重尾请求分布和突发流量模式。基于这些见解，我们提出了ModServe，这是一种模块化的LMM服务系统，将阶段解耦以独立优化和自适应扩展。ModServe动态重新配置阶段，使用模态感知调度和自动扩展来处理突发流量，以满足尾部延迟SLOs的同时最小化成本。ModServe在128块GPU集群上使用生产跟踪实现了3.3-5.5倍的吞吐量（导致25-41.3%的成本节省），同时满足SLOs。

Summary / 总结

The paper addresses the challenges of serving large multimodal models (LMMs) in production environments by analyzing their architectures and inference traces. It proposes ModServe, a modular serving system that decouples stages for independent optimization and adaptive scaling, which dynamically reconfigures stages and uses modality-aware scheduling to handle bursty traffic, achieving higher throughput and cost savings while meeting service level objectives. ModServe demonstrates a 3.3-5.5x higher throughput and 25-41.3% cost reduction on a 128-GPU cluster.

论文通过分析大型多模态模型（LMMs）的架构和推理模式，解决其在生产环境中的高效服务问题。它提出了ModServe，一个模块化的服务系统，将阶段解耦以独立优化和动态扩展，实现了最高5.5倍的吞吐量提升和41.3%的成本节约，同时在128块GPU集群上满足服务级别目标。

Enabling Granular Subgroup Level Model Evaluations by Generating Synthetic Medical Time Series

Authors: Mahmoud Ibrahim, Bart Elen, Chang Sun, Gökhan Ertaylan, Michel Dumontier

First: 2025-10-22T16:17:29+00:00 · Latest: 2025-10-22T16:17:29+00:00

Abs · PDF · Code1 · Code2

Abstract

We present a novel framework for leveraging synthetic ICU time-series data not only to train but also to rigorously and trustworthily evaluate predictive models, both at the population level and within fine-grained demographic subgroups. Building on prior diffusion and VAE-based generators (TimeDiff, HealthGen, TimeAutoDiff), we introduce \textit{Enhanced TimeAutoDiff}, which augments the latent diffusion objective with distribution-alignment penalties. We extensively benchmark all models on MIMIC-III and eICU, on 24-hour mortality and binary length-of-stay tasks. Our results show that Enhanced TimeAutoDiff reduces the gap between real-on-synthetic and real-on-real evaluation (``TRTS gap'') by over 70\%, achieving $\Delta_{TRTS} \leq 0.014$ AUROC, while preserving training utility ($\Delta_{TSTR} \approx 0.01$). Crucially, for 32 intersectional subgroups, large synthetic cohorts cut subgroup-level AUROC estimation error by up to 50\% relative to small real test sets, and outperform them in 72--84\% of subgroups. This work provides a practical, privacy-preserving roadmap for trustworthy, granular model evaluation in critical care, enabling robust and reliable performance analysis across diverse patient populations without exposing sensitive EHR data, contributing to the overall trustworthiness of Medical AI.

中文标题/摘要

标题：通过生成合成医学时间序列数据实现精细亚组水平模型评估

我们提出了一种新颖的框架，不仅利用合成ICU时间序列数据来训练模型，还用于在群体层面和细粒度的人口统计亚组内严格和可靠地评估预测模型。基于先前的扩散和VAE基生成器（TimeDiff、HealthGen、TimeAutoDiff），我们引入了增强的TimeAutoDiff，它在潜在扩散目标中加入了分布对齐惩罚。我们在MIMIC-III和eICU上对所有模型进行了广泛的基准测试，针对24小时死亡率和二元住院时间任务。我们的结果显示，增强的TimeAutoDiff将实际数据与合成数据和实际数据与实际数据评估之间的差距（“TRTS差距”）降低了70%以上，实现了Δ_TRTS ≤ 0.014 AUCROC，同时保持了训练实用性（Δ_TSTR ≈ 0.01）。对于32个交叉亚组，大型合成队列将亚组水平AUCROC估计误差降低了最多50%，并在72-84%的亚组中优于它们。这项工作为在重症监护中实现可信赖的、精细的模型评估提供了一条实用的、隐私保护的道路，使不同患者群体的稳健和可靠性能分析成为可能，而不暴露敏感的EHR数据，从而提高医疗AI的整体可信度。

Summary / 总结

The research aims to develop a framework for evaluating predictive models at both population and subgroup levels using synthetic ICU time-series data. The study introduces Enhanced TimeAutoDiff, which improves upon existing generators by incorporating distribution-alignment penalties. Experiments on MIMIC-III and eICU datasets show that Enhanced TimeAutoDiff reduces the TRTS gap by over 70%, maintaining training utility while significantly improving subgroup-level AUROC estimation, especially for small real test sets, in up to 84% of subgroups. This work enhances the trustworthiness of medical AI by enabling granular model evaluations without exposing sensitive EHR data.

研究旨在通过合成ICU时间序列数据来评估预测模型在总体和亚组层面的表现。研究引入了Enhanced TimeAutoDiff，该方法在现有生成器的基础上加入了分布对齐惩罚。实验结果表明，Enhanced TimeAutoDiff将TRTS差距减少了超过70%，同时保持了训练效用，特别是在小规模真实测试集上，对多达84%的亚组的AUROC估计误差降低了50%以上。这项工作通过不暴露敏感的EHR数据来增强医疗AI的可信度，实现了对不同患者群体的稳健和可靠的性能分析。

WikiVideo: Article Generation from Multiple Videos

Authors: Alexander Martin, Reno Kriz, William Gantt Walden, Kate Sanders, Hannah Recknor, Eugene Yang, Francis Ferraro, Benjamin Van Durme

First: 2025-04-01T16:22:15+00:00 · Latest: 2025-10-22T16:17:16+00:00

Comments: Repo can be found here: https://github.com/alexmartin1722/wikivideo

Abs · PDF · Code1 · Code2 · Code3

Abstract

We introduce the task of grounded article generation with the goal of creating a Wikipedia-style article from multiple diverse videos about real-world events -- from natural disasters to political elections -- where all the information in the article is supported by video evidence. Videos are intuitive sources for retrieval-augmented generation (RAG), but most contemporary RAG workflows focus heavily on text while existing methods for video-based summarization focus on low-level scene understanding rather than high-level event semantics. To close this gap, we introduce WikiVideo, a benchmark consisting of expert-written articles and densely annotated videos that provide evidence for articles' claims, facilitating the integration of video into RAG pipelines and enabling the creation of in-depth content that is grounded in multimodal sources. We further propose Collaborative Article Generation (CAG), a novel interactive method for article creation from multiple videos. CAG leverages an iterative interaction between an r1-style reasoning model and a VideoLLM to draw higher-level inferences about the target event than is possible with VideoLLMs alone, which fixate on low-level visual features. We benchmark state-of-the-art VideoLLMs and CAG in both oracle retrieval and RAG settings and find that CAG consistently outperforms alternative methods, while suggesting intriguing avenues for future work.

中文标题/摘要

标题：WikiVideo: 多视频生成文章

我们介绍了基于视频的 grounded 文章生成任务，旨在从关于现实事件（如自然灾害和政治选举）的多种多样的视频中生成维基百科风格的文章，其中文章中的所有信息都由视频证据支持。视频是检索增强生成（RAG）的直观来源，但大多数当代 RAG 工作流主要集中在文本上，而现有的基于视频的总结方法则侧重于低级场景理解而非高级事件语义。为了解决这一差距，我们引入了 WikiVideo，这是一个包含专家撰写的文章和详细标注的视频的基准，这些视频为文章的声明提供了证据，促进了视频与 RAG 管道的整合，并使创建基于多模态来源的深入内容成为可能。我们还提出了协作文章生成（CAG），这是一种新颖的交互式方法，用于从多个视频中创建文章。CAG 利用 r1 风格推理模型与 VideoLLM 之间的迭代交互，比单独使用 VideoLLM 能够对目标事件进行更高级别的推断，VideoLLM 专注于低级视觉特征。我们在 oracle 检索和 RAG 设置中对最先进的 VideoLLM 和 CAG 进行了基准测试，并发现 CAG 一致地优于其他方法，同时提出了未来工作的有趣方向。

3D Visual Illusion Depth Estimation

Authors: Chengtang Yao, Zhidan Liu, Jiaxi Zeng, Lidong Yu, Yuwei Wu, Yunde Jia

Venue: NeurIPS 2025

First: 2025-05-19T12:51:03+00:00 · Latest: 2025-10-22T16:13:49+00:00

Comments: NeurIPS 2025, Project: https://github.com/YaoChengTang/3D-Visual-Illusion-Depth-Estimation

Abs · PDF · Code1 · Code2 · Code3

Abstract

3D visual illusion is a perceptual phenomenon where a two-dimensional plane is manipulated to simulate three-dimensional spatial relationships, making a flat artwork or object look three-dimensional in the human visual system. In this paper, we reveal that the machine visual system is also seriously fooled by 3D visual illusions, including monocular and binocular depth estimation. In order to explore and analyze the impact of 3D visual illusion on depth estimation, we collect a large dataset containing almost 3k scenes and 200k images to train and evaluate SOTA monocular and binocular depth estimation methods. We also propose a 3D visual illusion depth estimation framework that uses common sense from the vision language model to adaptively fuse depth from binocular disparity and monocular depth. Experiments show that SOTA monocular, binocular, and multi-view depth estimation approaches are all fooled by various 3D visual illusions, while our method achieves SOTA performance.

中文标题/摘要

标题：3D视觉错觉深度估计

3D视觉错觉是一种知觉现象，通过操纵二维平面来模拟三维空间关系，使平面的艺术作品或物体在人类视觉系统中看起来具有三维效果。在本文中，我们揭示了机器视觉系统也会被3D视觉错觉严重欺骗，包括单眼和双眼深度估计。为了探索和分析3D视觉错觉对深度估计的影响，我们收集了一个包含近3000个场景和20万张图像的大规模数据集，用于训练和评估最先进的单眼和双眼深度估计方法。我们还提出了一种3D视觉错觉深度估计框架，该框架利用视觉语言模型的常识，自适应地融合来自双眼视差和单眼深度的深度信息。实验表明，最先进的单眼、双眼和多视图深度估计方法都会被各种3D视觉错觉欺骗，而我们的方法则达到了最先进的性能。

Summary / 总结

This paper investigates how 3D visual illusions affect machine depth estimation systems. To explore this, the authors collected a large dataset of 3k scenes and 200k images containing 3D visual illusions and evaluated state-of-the-art (SOTA) monocular and binocular depth estimation methods. The results show that these methods are fooled by various 3D visual illusions, but the proposed framework that uses common sense from a vision language model to fuse depth from binocular disparity and monocular depth achieves SOTA performance.

该研究通过收集包含3千个场景和20万张图像的大规模数据集，探讨了3D视觉幻象对机器深度估计的影响。研究发现，单目和双目深度估计方法都会被3D视觉幻象欺骗。作者提出了一种框架，利用视觉语言模型中的常识来融合来自双目视差和单目深度的信息，实现了深度估计的最新性能。

GeoBenchX: Benchmarking LLMs in Agent Solving Multistep Geospatial Tasks

Authors: Varvara Krechetova, Denis Kochedykov

First: 2025-03-23T16:20:14+00:00 · Latest: 2025-10-22T16:12:30+00:00

Comments: Github with code and benchmark set: https://github.com/Solirinai/GeoBenchX

Abs · PDF · Code1 · Code2 · Code3

Abstract

This paper establishes a benchmark for evaluating tool-calling capabilities of large language models (LLMs) on multi-step geospatial tasks relevant to commercial GIS practitioners. We assess eight commercial LLMs (Claude Sonnet 3.5 and 4, Claude Haiku 3.5, Gemini 2.0 Flash, Gemini 2.5 Pro Preview, GPT-4o, GPT-4.1 and o4-mini) using a simple tool-calling agent equipped with 23 geospatial functions. Our benchmark comprises tasks in four categories of increasing complexity, with both solvable and intentionally unsolvable tasks to test rejection accuracy. We develop a LLM-as-Judge evaluation framework to compare agent solutions against reference solutions. Results show o4-mini and Claude 3.5 Sonnet achieve the best overall performance, OpenAI's GPT-4.1, GPT-4o and Google's Gemini 2.5 Pro Preview do not fall far behind, but the last two are more efficient in identifying unsolvable tasks. Claude Sonnet 4, due its preference to provide any solution rather than reject a task, proved to be less accurate. We observe significant differences in token usage, with Anthropic models consuming more tokens than competitors. Common errors include misunderstanding geometrical relationships, relying on outdated knowledge, and inefficient data manipulation. The resulting benchmark set, evaluation framework, and data generation pipeline are released as open-source resources (available at https://github.com/Solirinai/GeoBenchX), providing one more standardized method for the ongoing evaluation of LLMs for GeoAI.

中文标题/摘要

标题：GeoBenchX：评估大型语言模型在多步骤地理空间任务中调用工具的能力

本文建立了一个基准，用于评估大型语言模型（LLMs）在与商业GIS从业者相关的多步骤地理空间任务中调用工具的能力。我们使用一个简单的工具调用代理，配备了23个地理空间功能，评估了八种商业LLMs（Claude Sonnet 3.5和4、Claude Haiku 3.5、Gemini 2.0 Flash、Gemini 2.5 Pro Preview、GPT-4o、GPT-4.1和o4-mini）。我们的基准包括四个复杂度递增的类别中的任务，既有可解任务也有故意不可解的任务以测试拒绝准确率。我们开发了一种LLM作为裁判的评估框架，将代理解决方案与参考解决方案进行比较。结果显示o4-mini和Claude 3.5 Sonnet的整体性能最佳，OpenAI的GPT-4.1、GPT-4o和Google的Gemini 2.5 Pro Preview也不相上下，但后两者在识别不可解任务方面更有效率。Claude Sonnet 4由于倾向于提供任何解决方案而不是拒绝任务，证明其准确性较低。我们观察到在令牌使用方面存在显著差异，Anthropic模型消耗的令牌比竞争对手多。常见的错误包括误解几何关系、依赖过时的知识以及无效的数据操作。基准集、评估框架和数据生成管道作为开源资源发布（可在https://github.com/Solirinai/GeoBenchX 获取），为持续评估LLMs在GeoAI中的应用提供了一种标准化方法。

Summary / 总结

This paper introduces GeoBenchX, a benchmark for evaluating LLMs in solving complex geospatial tasks. Eight commercial LLMs were tested using a tool-calling agent with 23 geospatial functions across four categories of tasks. The results show that o4-mini and Claude 3.5 Sonnet performed best overall, with OpenAI's GPT-4.1 and Google's Gemini 2.5 Pro Preview also performing well but being more efficient in identifying unsolvable tasks. Common errors included misinterpreting geometrical relationships and inefficient data manipulation.

本文介绍了GeoBenchX，这是一个用于评估LLM解决复杂地理空间任务能力的基准。使用包含23个地理空间功能的工具调用代理对八种商业LLM进行了测试，涵盖四个难度等级的任务类别。结果显示，o4-mini和Claude 3.5 Sonnet的整体表现最佳，而OpenAI的GPT-4.1和Google的Gemini 2.5 Pro Preview也表现良好，但在识别不可解任务方面更为高效。常见的错误包括误解几何关系和数据操作不高效。

Uni-Instruct: One-step Diffusion Model through Unified Diffusion Divergence Instruction

Authors: Yifei Wang, Weimin Bai, Colin Zhang, Debing Zhang, Weijian Luo, He Sun

First: 2025-05-27T05:55:45+00:00 · Latest: 2025-10-22T16:07:38+00:00

Abs · PDF · Code1 · Code2

Abstract

In this paper, we unify more than 10 existing one-step diffusion distillation approaches, such as Diff-Instruct, DMD, SIM, SiD, $f$-distill, etc, inside a theory-driven framework which we name the \textbf{\emph{Uni-Instruct}}. Uni-Instruct is motivated by our proposed diffusion expansion theory of the $f$-divergence family. Then we introduce key theories that overcome the intractability issue of the original expanded $f$-divergence, resulting in an equivalent yet tractable loss that effectively trains one-step diffusion models by minimizing the expanded $f$-divergence family. The novel unification introduced by Uni-Instruct not only offers new theoretical contributions that help understand existing approaches from a high-level perspective but also leads to state-of-the-art one-step diffusion generation performances. On the CIFAR10 generation benchmark, Uni-Instruct achieves record-breaking Frechet Inception Distance (FID) values of \textbf{\emph{1.46}} for unconditional generation and \textbf{\emph{1.38}} for conditional generation. On the ImageNet-$64\times 64$ generation benchmark, Uni-Instruct achieves a new SoTA one-step generation FID of \textbf{\emph{1.02}}, which outperforms its 79-step teacher diffusion with a significant improvement margin of 1.33 (1.02 vs 2.35). We also apply Uni-Instruct on broader tasks like text-to-3D generation. For text-to-3D generation, Uni-Instruct gives decent results, which slightly outperforms previous methods, such as SDS and VSD, in terms of both generation quality and diversity. Both the solid theoretical and empirical contributions of Uni-Instruct will potentially help future studies on one-step diffusion distillation and knowledge transferring of diffusion models.

中文标题/摘要

标题：Uni-Instruct：统一扩散差异指令的一步扩散模型

在本文中，我们以内驱理论框架统一了超过10种现有的一步扩散蒸馏方法，如Diff-Instruct、DMD、SIM、SiD、$f$-distill等，我们将其命名为**Uni-Instruct**。Uni-Instruct 的灵感来源于我们提出的 $f$-散度族的扩散扩展理论。然后我们引入了关键理论，克服了原始扩展 $f$-散度的不可处理问题，从而得到一个等效但可处理的损失，通过最小化扩展的 $f$-散度族有效训练一步扩散模型。Uni-Instruct 引入的新统一不仅提供了新的理论贡献，有助于从高层次理解现有方法，还导致了一步扩散生成性能达到最先进的水平。在CIFAR10生成基准上，Uni-Instruct 实现了无条件生成的记录-breaking Frechet Inception Distance (FID) 值为**1.46**，有条件生成的FID值为**1.38**。在ImageNet-$64 imes 64$生成基准上，Uni-Instruct 达到了新的最先进的一步生成FID值**1.02**，显著优于其79步教师扩散，改进幅度为1.33（1.02 vs 2.35）。我们还在更广泛的文本到3D生成任务上应用了Uni-Instruct。对于文本到3D生成，Uni-Instruct 的结果不错，从生成质量和多样性来看，略优于先前的方法，如SDS和VSD。Uni-Instruct 的坚实理论和实证贡献可能有助于未来对一步扩散蒸馏和扩散模型知识转移的研究。

Summary / 总结

Uni-Instruct unifies over ten one-step diffusion distillation approaches into a theory-driven framework, motivated by the diffusion expansion theory of the $f$-divergence family. It introduces key theories to overcome the intractability issue of the expanded $f$-divergence, leading to an equivalent yet tractable loss for training one-step diffusion models. On CIFAR10, Uni-Instruct achieves record-breaking FID values of 1.46 for unconditional and 1.38 for conditional generation. On ImageNet-$64 imes 64$, it outperforms a 79-step teacher model with a significant improvement margin of 1.33, achieving a new state-of-the-art FID of 1.02. For text-to-3D generation, Uni-Instruct slightly outperforms previous methods in terms of generation quality and diversity.

Uni-Instruct 将超过十种的一步扩散蒸馏方法统一到一个理论驱动的框架中，动机来自于 $f$-散度族的扩散扩展理论。它引入了可计算的损失函数，有效训练了一步扩散模型，取得了最先进的性能。在 CIFAR10 上，Uni-Instruct 实现了无条件生成的 FID 分数为 1.46 和有条件生成的 1.38，而在 ImageNet-64x64 上，它实现了新的一步生成 FID 为 1.02，显著优于 79 步教师模型。

LyTimeT: Towards Robust and Interpretable State-Variable Discovery

Authors: Kuai Yu, Crystal Su, Xiang Liu, Judah Goldfeder, Mingyuan Shao, Hod Lipson

First: 2025-10-22T16:03:10+00:00 · Latest: 2025-10-22T16:03:10+00:00

Abs · PDF · Code1 · Code2

Abstract

Extracting the true dynamical variables of a system from high-dimensional video is challenging due to distracting visual factors such as background motion, occlusions, and texture changes. We propose LyTimeT, a two-phase framework for interpretable variable extraction that learns robust and stable latent representations of dynamical systems. In Phase 1, LyTimeT employs a spatio-temporal TimeSformer-based autoencoder that uses global attention to focus on dynamically relevant regions while suppressing nuisance variation, enabling distraction-robust latent state learning and accurate long-horizon video prediction. In Phase 2, we probe the learned latent space, select the most physically meaningful dimensions using linear correlation analysis, and refine the transition dynamics with a Lyapunov-based stability regularizer to enforce contraction and reduce error accumulation during roll-outs. Experiments on five synthetic benchmarks and four real-world dynamical systems, including chaotic phenomena, show that LyTimeT achieves mutual information and intrinsic dimension estimates closest to ground truth, remains invariant under background perturbations, and delivers the lowest analytical mean squared error among CNN-based (TIDE) and transformer-only baselines. Our results demonstrate that combining spatio-temporal attention with stability constraints yields predictive models that are not only accurate but also physically interpretable.

中文标题/摘要

标题：LyTimeT：迈向稳健且可解释的状态变量发现

从高维视频中提取系统的真正动力学变量具有挑战性，因为存在诸如背景运动、遮挡和纹理变化等分散视觉因素。我们提出了一种名为LyTimeT的两阶段框架，用于可解释的变量提取，该框架学习动力学系统的稳健且稳定的潜在表示。在第一阶段，LyTimeT采用时空TimeSformer基自编码器，使用全局注意力聚焦于动态相关区域，同时抑制无关变异，实现抗干扰的潜在状态学习和准确的长时视频预测。在第二阶段，我们探测学习到的潜在空间，使用线性相关分析选择最具物理意义的维度，并使用基于李雅普诺夫的稳定性正则化来优化过渡动力学，以确保收敛并减少回放过程中的误差累积。在五个合成基准和四个真实世界动力学系统上的实验，包括混沌现象，表明LyTimeT的互信息和固有维度估计值最接近真实值，对背景扰动保持不变，并且在基于CNN（TIDE）和仅基于变换器的基线中具有最低的分析均方误差。我们的结果表明，结合时空注意力与稳定性约束可以生成不仅准确而且具有物理可解释性的预测模型。

Summary / 总结

LyTimeT is designed to extract robust and interpretable state variables from high-dimensional videos by addressing challenges like background motion and occlusions. It uses a two-phase framework: Phase 1 employs a spatio-temporal TimeSformer-based autoencoder to learn distraction-robust latent states, and Phase 2 refines these states using linear correlation and Lyapunov-based stability. Experiments show that LyTimeT outperforms CNN-based and transformer-only methods in terms of mutual information, intrinsic dimension, and mean squared error, while maintaining physical interpretability.

LyTimeT 是一个两阶段框架，旨在通过解决视觉干扰来从高维视频中提取稳健且可解释的潜在状态。第一阶段使用时空 TimeSformer 自编码器来学习抗干扰的潜在表示。第二阶段选择物理上最相关的维度，并使用基于李雅普诺夫的稳定性正则化来细化动态。实验表明，LyTimeT 在互信息、固有维度估计和分析均方误差方面优于基于 CNN 和仅基于变换器的基线，并且对背景变化保持不变。

SEMPO: Lightweight Foundation Models for Time Series Forecasting

Authors: Hui He, Kun Yi, Yuanchi Ma, Qi Zhang, Zhendong Niu, Guansong Pang

Venue: NeurIPS 2025

First: 2025-10-22T15:58:44+00:00 · Latest: 2025-10-22T15:58:44+00:00

Comments: Accepted by NeurIPS 2025

Abs · PDF · Code1 · Code2 · Code3

Abstract

The recent boom of large pre-trained models witnesses remarkable success in developing foundation models (FMs) for time series forecasting. Despite impressive performance across diverse downstream forecasting tasks, existing time series FMs possess massive network architectures and require substantial pre-training on large-scale datasets, which significantly hinders their deployment in resource-constrained environments. In response to this growing tension between versatility and affordability, we propose SEMPO, a novel lightweight foundation model that requires pretraining on relatively small-scale data, yet exhibits strong general time series forecasting. Concretely, SEMPO comprises two key modules: 1) energy-aware SpEctral decomposition module, that substantially improves the utilization of pre-training data by modeling not only the high-energy frequency signals but also the low-energy yet informative frequency signals that are ignored in current methods; and 2) Mixture-of-PrOmpts enabled Transformer, that learns heterogeneous temporal patterns through small dataset-specific prompts and adaptively routes time series tokens to prompt-based experts for parameter-efficient model adaptation across different datasets and domains. Equipped with these modules, SEMPO significantly reduces both pre-training data scale and model size, while achieving strong generalization. Extensive experiments on two large-scale benchmarks covering 16 datasets demonstrate the superior performance of SEMPO in both zero-shot and few-shot forecasting scenarios compared with state-of-the-art methods. Code and data are available at https://github.com/mala-lab/SEMPO.

中文标题/摘要

标题：SEMPO：时间序列预测的轻量级基础模型

近年来，大规模预训练模型的兴起在时间序列预测的基础模型（FMs）开发方面取得了显著的成功。尽管在各种下游预测任务中表现出色，但现有时间序列FMs具有庞大的网络架构，并需要在大规模数据集上进行大量预训练，这在资源受限的环境中严重阻碍了它们的部署。为应对这种日益增长的灵活性与经济性之间的矛盾，我们提出了一种名为SEMPO的新型轻量级基础模型，该模型在相对较小规模的数据上进行预训练，但仍能表现出强大的通用时间序列预测能力。具体而言，SEMPO包含两个关键模块：1）能量感知频谱分解模块，通过建模不仅高能量频率信号，还包括被当前方法忽略的低能量但具有信息性的频率信号，显著提高了预训练数据的利用效率；2）混合提示驱动的Transformer，通过小规模数据特定的提示学习异质的时间模式，并自适应地将时间序列标记路由到基于提示的专家，以实现参数高效的模型适应，适用于不同数据集和领域。借助这些模块，SEMPO显著减少了预训练数据规模和模型大小，同时实现了强大的泛化能力。在两个大规模基准测试中涵盖的16个数据集上进行的广泛实验表明，与最先进的方法相比，SEMPO在零样本和少量样本预测场景中均表现出优越的性能。代码和数据可在https://github.com/mala-lab/SEMPO获取。

Summary / 总结

SEMPO is a lightweight foundation model for time series forecasting that reduces pre-training data scale and model size while maintaining strong generalization. It includes an energy-aware spectral decomposition module and a Mixture-of-PrOmpts enabled Transformer, which improve the utilization of pre-training data and enable parameter-efficient model adaptation. Experiments on two large-scale benchmarks show that SEMPO outperforms state-of-the-art methods in both zero-shot and few-shot forecasting scenarios.

SEMPO 是一种轻量级的时间序列预测基础模型，旨在解决现有模型中网络架构庞大和预训练数据需求高的问题。它包含能量感知的频谱分解模块和 Mixture-of-PrOmpts 启发的 Transformer，这些模块提高了数据利用效率和适应性。SEMPO 减少了预训练数据规模和模型大小，同时保持了强大的泛化能力，通过在 16 个数据集上的零样本和少量样本预测场景中优于最先进的方法，证明了其优越性能。