arXiv 论文速递

Graph Representation Learning with Diffusion Generative Models

Authors: Daniel Wesego

First: 2025-01-22T07:12:10+00:00 · Latest: 2025-10-22T17:58:33+00:00

Abstract

Diffusion models have established themselves as state-of-the-art generative models across various data modalities, including images and videos, due to their ability to accurately approximate complex data distributions. Unlike traditional generative approaches such as VAEs and GANs, diffusion models employ a progressive denoising process that transforms noise into meaningful data over multiple iterative steps. This gradual approach enhances their expressiveness and generation quality. Not only that, diffusion models have also been shown to extract meaningful representations from data while learning to generate samples. Despite their success, the application of diffusion models to graph-structured data remains relatively unexplored, primarily due to the discrete nature of graphs, which necessitates discrete diffusion processes distinct from the continuous methods used in other domains. In this work, we leverage the representational capabilities of diffusion models to learn meaningful embeddings for graph data. By training a discrete diffusion model within an autoencoder framework, we enable both effective autoencoding and representation learning tailored to the unique characteristics of graph-structured data. We extract the representation from the combination of the encoder's output and the decoder's first time step hidden embedding. Our approach demonstrates the potential of discrete diffusion models to be used for graph representation learning. The code can be found at https://github.com/DanielMitiku/Graph-Representation-Learning-with-Diffusion-Generative-Models

中文标题/摘要

标题：图结构数据的扩散生成模型表示学习

扩散模型因其能够准确逼近复杂的数据分布而在各种数据模态中确立了其作为最先进的生成模型的地位，包括图像和视频。与传统的生成方法如VAEs和GANs不同，扩散模型采用逐步去噪的过程，在多个迭代步骤中将噪声转化为有意义的数据。这一渐进的方法增强了它们的表达能力和生成质量。此外，扩散模型还被证明能够从数据中提取有意义的表示，同时学习生成样本。尽管取得了成功，但将扩散模型应用于图结构数据的研究仍然相对较少，主要是由于图的离散性质，这需要与其它领域使用的连续方法不同的离散扩散过程。在本文中，我们利用扩散模型的表示能力来学习图数据的有意义嵌入。通过在自编码器框架中训练一个离散的扩散模型，我们能够实现有效的自编码和针对图结构数据的独特特征进行的表示学习。我们从编码器输出和解码器的第一个时间步隐藏嵌入的组合中提取表示。我们的方法展示了离散扩散模型在图表示学习中的潜力。代码可以在https://github.com/DanielMitiku/Graph-Representation-Learning-with-Diffusion-Generative-Models找到

Summary / 总结

This paper explores the application of diffusion models to graph-structured data, addressing the challenge of discrete diffusion processes. The authors propose an autoencoder framework that combines a discrete diffusion model with an encoder-decoder architecture to learn meaningful graph embeddings. The key finding is that this approach effectively captures the structural information of graphs, demonstrating the potential of diffusion models in graph representation learning.

该论文探讨了将扩散模型应用于图结构数据的方法，利用其生成高质量数据和学习有意义表示的能力。作者提出了一种结合离散扩散模型和编码器-解码器架构的自编码框架，以学习图嵌入。关键实验发现表明，该方法能够有效地捕捉图的结构信息，展示了扩散模型在图表示学习中的潜力。

Learning Reward Machines from Partially Observed Policies

Authors: Mohamad Louai Shehab, Antoine Aspeel, Necmiye Ozay

First: 2025-02-06T03:48:25+00:00 · Latest: 2025-10-22T17:55:33+00:00

Abs · PDF · Code1 · Code2

Abstract

Inverse reinforcement learning is the problem of inferring a reward function from an optimal policy or demonstrations by an expert. In this work, it is assumed that the reward is expressed as a reward machine whose transitions depend on atomic propositions associated with the state of a Markov Decision Process (MDP). Our goal is to identify the true reward machine using finite information. To this end, we first introduce the notion of a prefix tree policy which associates a distribution of actions to each state of the MDP and each attainable finite sequence of atomic propositions. Then, we characterize an equivalence class of reward machines that can be identified given the prefix tree policy. Finally, we propose a SAT-based algorithm that uses information extracted from the prefix tree policy to solve for a reward machine. It is proved that if the prefix tree policy is known up to a sufficient (but finite) depth, our algorithm recovers the exact reward machine up to the equivalence class. This sufficient depth is derived as a function of the number of MDP states and (an upper bound on) the number of states of the reward machine. These results are further extended to the case where we only have access to demonstrations from an optimal policy. Several examples, including discrete grid and block worlds, a continuous state-space robotic arm, and real data from experiments with mice, are used to demonstrate the effectiveness and generality of the approach.

中文标题/摘要

标题：从部分观察到的策略学习奖励机器

逆强化学习是通过专家的最优策略或演示来推断奖励函数的问题。在本文中，假设奖励以奖励机器的形式表达，其转换依赖于与马尔可夫决策过程（MDP）状态相关的原子命题。我们的目标是使用有限信息识别真正的奖励机器。为此，我们首先引入前缀树策略的概念，它将动作分布与MDP的每个状态和每个可达到的原子命题有限序列关联起来。然后，我们刻画了给定前缀树策略可以识别的奖励机器等价类。最后，我们提出了一种基于SAT的算法，该算法利用从前缀树策略中提取的信息来求解奖励机器。证明了如果前缀树策略已知到足够（但有限）的深度，我们的算法可以恢复等价类中的精确奖励机器。此足够深度是MDP状态数和（奖励机器状态数的）上界函数。这些结果进一步扩展到我们只能访问最优策略演示的情况。使用离散网格和块世界、连续状态空间的机械臂以及老鼠实验中的实际数据，展示了该方法的有效性和通用性。

Summary / 总结

This paper addresses the inverse reinforcement learning problem by inferring a reward function from a partially observed policy. It introduces the concept of a prefix tree policy to associate actions with states and sequences of atomic propositions. The authors propose a SAT-based algorithm to identify the true reward machine from this policy. The algorithm can recover the exact reward machine up to an equivalence class if the prefix tree policy is known to a sufficient depth, which is determined by the number of MDP states and the reward machine states. The approach is demonstrated to be effective in various scenarios, including discrete and continuous state spaces, and real-world experiments with mice.

研究旨在通过逆强化学习从最优策略或专家演示中推断奖励函数。方法包括定义前缀树策略，并使用它来识别奖励机器的等价类。提出的基于SAT的算法在一定深度下成功恢复了奖励机器的确切等价类，该深度由MDP状态数和奖励机器状态数确定。该方法通过各种示例，包括离散和连续状态空间，以及来自老鼠实验的实际数据得到了验证。

Context-Aware Pseudo-Label Scoring for Zero-Shot Video Summarization

Authors: Yuanli Wu, Long Zhang, Yue Du, Bin Li

First: 2025-10-20T12:54:32+00:00 · Latest: 2025-10-22T17:54:43+00:00

Abs · PDF · Code1 · Code2

Abstract

We propose a rubric-guided, pseudo-labeled, and prompt-driven zero-shot video summarization framework that bridges large language models with structured semantic reasoning. A small subset of human annotations is converted into high-confidence pseudo labels and organized into dataset-adaptive rubrics defining clear evaluation dimensions such as thematic relevance, action detail, and narrative progression. During inference, boundary scenes, including the opening and closing segments, are scored independently based on their own descriptions, while intermediate scenes incorporate concise summaries of adjacent segments to assess narrative continuity and redundancy. This design enables the language model to balance local salience with global coherence without any parameter tuning. Across three benchmarks, the proposed method achieves stable and competitive results, with F1 scores of 57.58 on SumMe, 63.05 on TVSum, and 53.79 on QFVS, surpassing zero-shot baselines by +0.85, +0.84, and +0.37, respectively. These outcomes demonstrate that rubric-guided pseudo labeling combined with contextual prompting effectively stabilizes LLM-based scoring and establishes a general, interpretable, and training-free paradigm for both generic and query-focused video summarization.

中文标题/摘要

标题：基于上下文感知伪标签评分的零样本视频摘要

我们提出了一种基于评分准则、伪标签和提示驱动的零样本视频摘要框架，将大型语言模型与结构化语义推理相结合。一小部分人类注释被转换为高置信度的伪标签，并组织成数据集自适应的评分准则，定义了主题相关性、动作细节和叙事进展等明确的评估维度。在推理过程中，边界场景，包括开头和结尾部分，根据自身的描述独立评分，而中间场景则结合相邻段落的简洁摘要来评估叙事连贯性和冗余性。这种设计使语言模型能够在不进行任何参数调整的情况下平衡局部显著性和全局一致性。在三个基准测试中，所提出的方法在SumMe上的F1得分为57.58，在TVSum上的得分为63.05，在QFVS上的得分为53.79，分别超越零样本基线0.85、0.84和0.37。这些结果表明，基于评分准则的伪标签结合上下文提示有效地稳定了基于LLM的评分，并为通用和查询导向的视频摘要建立了一种通用、可解释且无需训练的范式。

Summary / 总结

The paper introduces a rubric-guided pseudo-labeled and prompt-driven framework for zero-shot video summarization, leveraging large language models and structured semantic reasoning. It converts a small set of human annotations into high-confidence pseudo labels and organizes them into rubrics that define evaluation dimensions. During inference, the model scores boundary scenes independently and intermediate scenes based on summaries of adjacent segments, balancing local salience and global coherence. The method achieves competitive results across three benchmarks, with F1 scores of 57.58, 63.05, and 53.79 on SumMe, TVSum, and QFVS, respectively, surpassing zero-shot baselines by +0.85, +0.84, and +0.37 points.

论文提出了一种基于评分表的伪标签和提示驱动的零样本视频摘要框架，将人工注释转换为高置信度的伪标签，并用它们来定义评估维度。在推理过程中，模型独立地对边界场景进行评分，并根据相邻段落的摘要对中间场景进行评分，分别在SumMe、TVSum和QFVS三个基准上取得了57.58、63.05和53.79的F1分数，分别超越了零样本基线0.85、0.84和0.37。

Is This Tracker On? A Benchmark Protocol for Dynamic Tracking

Authors: Ilona Demler, Saumya Chauhan, Georgia Gkioxari

First: 2025-10-22T17:53:56+00:00 · Latest: 2025-10-22T17:53:56+00:00

Comments: Project page: https://glab-caltech.github.io/ITTO/

Abs · PDF · Code1 · Code2 · Project1

Abstract

We introduce ITTO, a challenging new benchmark suite for evaluating and diagnosing the capabilities and limitations of point tracking methods. Our videos are sourced from existing datasets and egocentric real-world recordings, with high-quality human annotations collected through a multi-stage pipeline. ITTO captures the motion complexity, occlusion patterns, and object diversity characteristic of real-world scenes -- factors that are largely absent in current benchmarks. We conduct a rigorous analysis of state-of-the-art tracking methods on ITTO, breaking down performance along key axes of motion complexity. Our findings reveal that existing trackers struggle with these challenges, particularly in re-identifying points after occlusion, highlighting critical failure modes. These results point to the need for new modeling approaches tailored to real-world dynamics. We envision ITTO as a foundation testbed for advancing point tracking and guiding the development of more robust tracking algorithms.

中文标题/摘要

标题：这个追踪器开启了吗？一种动态追踪基准协议

我们引入了ITTO，这是一个新的具有挑战性的基准套件，用于评估和诊断点追踪方法的能力和局限性。我们的视频来源于现有的数据集和第一人称的现实世界记录，并通过多阶段管道收集了高质量的人工注释。ITTO捕捉了现实场景中的运动复杂性、遮挡模式和对象多样性——这些因素在当前的基准中几乎不存在。我们在ITTO上对最先进的追踪方法进行了严格的分析，按关键的运动复杂性轴分解了性能。我们的研究结果表明，现有的追踪器在这些挑战面前表现不佳，尤其是在遮挡后重新识别点方面，这突显了关键的失败模式。这些结果指出了需要针对现实世界动态的新建模方法。我们设想ITTO作为推进点追踪和指导开发更稳健追踪算法的基础测试平台。

Summary / 总结

ITTO is a new benchmark suite for evaluating point tracking methods, designed to capture the complexity and challenges of real-world tracking scenarios. It includes high-quality annotations from multi-stage human annotation and covers diverse motion patterns, occlusions, and object types. The study reveals that current trackers struggle with re-identification after occlusions, indicating a need for new modeling approaches. This benchmark aims to serve as a testbed for advancing point tracking algorithms.

ITTO 是一个新的基准套件，用于评估点跟踪方法，旨在捕捉现实场景中的复杂性和挑战。它包含来自多阶段人工注释的高质量标注，并涵盖了运动复杂性、遮挡和物体多样性。研究发现，当前的跟踪器在遮挡后重新识别方面存在困难，表明需要新的建模方法。该基准旨在指导更稳健跟踪算法的发展。

olmOCR 2: Unit Test Rewards for Document OCR

Authors: Jake Poznanski, Luca Soldaini, Kyle Lo

First: 2025-10-22T17:53:02+00:00 · Latest: 2025-10-22T17:53:02+00:00

Comments: https://olmocr.allen.ai/

Abs · PDF · Code1 · Code2 · Project1

Abstract

We present olmOCR 2, the latest in our family of powerful OCR systems for converting digitized print documents, like PDFs, into clean, naturally ordered plain text. olmOCR 2 is powered by olmOCR-2-7B-1025, a specialized, 7B vision language model (VLM) trained using reinforcement learning with verifiable rewards (RLVR), where our rewards are a diverse set of binary unit tests. To scale unit test creation, we develop a pipeline for generating synthetic documents with diverse and challenging layouts, known ground-truth HTML source code, and extracted test cases. We show that RL training on these test cases results in state-of-the-art performance on olmOCR-Bench, our English-language OCR benchmark, with the largest improvements in math formula conversion, table parsing, and multi-column layouts compared to previous versions. We release our model, data and code under permissive open licenses.

中文标题/摘要

标题：olmOCR 2：文档OCR单元测试奖励

我们介绍了olmOCR 2，这是我们家族中最新的一款强大OCR系统，用于将数字化印刷文档，如PDF，转换为干净、自然排序的纯文本。olmOCR 2由olmOCR-2-7B-1025驱动，这是一种专门的7B视觉语言模型（VLM），通过可验证奖励（RLVR）的强化学习进行训练，其中我们的奖励是一组多样化的二元单元测试。为了扩展单元测试的创建，我们开发了一条生成具有多样性和挑战性布局的合成文档的管道，已知的地面真实HTML源代码和提取的测试用例。我们展示了在这些测试用例上进行RL训练的结果，在olmOCR-Bench，我们的英文OCR基准测试中达到了最先进的性能，与之前的版本相比，在数学公式转换、表格解析和多列布局方面取得了最大的改进。我们以宽松的开源许可发布了我们的模型、数据和代码。

Summary / 总结

olmOCR 2 is an advanced OCR system designed to convert digitized print documents into clean text. It uses a 7B vision-language model trained with reinforcement learning and verifiable rewards, where the rewards are based on diverse unit tests. The system generates synthetic documents to create challenging training cases and shows significant improvements in math formula conversion, table parsing, and multi-column layouts compared to previous versions on the olmOCR-Bench benchmark.

olmOCR 2 是一个先进的 OCR 系统，用于将数字化印刷文档转换为干净的文本。它使用一个 7B 视觉语言模型，并通过强化学习和可验证奖励进行训练，其中奖励基于多样化的单元测试。该系统生成合成文档以创建具有挑战性的训练案例，并在 olmOCR-Bench 基准测试中显示出在数学公式转换、表格解析和多列布局方面的显著改进。

Unlearned but Not Forgotten: Data Extraction after Exact Unlearning in LLM

Authors: Xiaoyu Wu, Yifei Pang, Terrance Liu, Zhiwei Steven Wu

Venue: Neurips 2025

First: 2025-05-30T09:09:33+00:00 · Latest: 2025-10-22T17:51:21+00:00

Comments: Accepted by Neurips 2025

Abs · PDF · Code1 · Code2 · Code3

Abstract

Large Language Models are typically trained on datasets collected from the web, which may inadvertently contain harmful or sensitive personal information. To address growing privacy concerns, unlearning methods have been proposed to remove the influence of specific data from trained models. Of these, exact unlearning -- which retrains the model from scratch without the target data -- is widely regarded the gold standard for mitigating privacy risks in deployment. In this paper, we revisit this assumption in a practical deployment setting where both the pre- and post-unlearning logits API are exposed, such as in open-weight scenarios. Targeting this setting, we introduce a novel data extraction attack that leverages signals from the pre-unlearning model to guide the post-unlearning model, uncovering patterns that reflect the removed data distribution. Combining model guidance with a token filtering strategy, our attack significantly improves extraction success rates -- doubling performance in some cases -- across common benchmarks such as MUSE, TOFU, and WMDP. Furthermore, we demonstrate our attack's effectiveness on a simulated medical diagnosis dataset to highlight real-world privacy risks associated with exact unlearning. In light of our findings, which suggest that unlearning may, in a contradictory way, increase the risk of privacy leakage during real-world deployments, we advocate for evaluation of unlearning methods to consider broader threat models that account not only for post-unlearning models but also for adversarial access to prior checkpoints. Code is publicly available at: https://github.com/Nicholas0228/unlearned_data_extraction_llm.

中文标题/摘要

标题：未重新学习但未被遗忘：在LLM精确重新学习后数据提取

大型语言模型通常在从网络收集的数据集上进行训练，这些数据集可能无意中包含有害或敏感的个人信息。为应对日益增长的隐私担忧，提出了重新学习方法以从训练模型中移除特定数据的影响。其中，从头开始重新训练模型而不包含目标数据的精确重新学习，被认为是缓解部署中隐私风险的黄金标准。在本文中，我们在一个实际部署环境中重新审视这一假设，该环境同时暴露了重新学习前后的logits API，例如在开放权重场景中。针对这一环境，我们引入了一种新颖的数据提取攻击，利用重新学习前模型的信号来引导重新学习后的模型，揭示出反映删除数据分布的模式。结合模型引导与令牌过滤策略，我们的攻击显著提高了提取成功率——在某些情况下，性能翻倍。此外，我们还在模拟的医疗诊断数据集上展示了攻击的有效性，以突出精确重新学习带来的实际隐私风险。鉴于我们的发现表明，重新学习可能以一种矛盾的方式增加实际部署中的隐私泄露风险，我们建议评估重新学习方法时考虑更广泛的威胁模型，不仅包括重新学习后的模型，还包括对手对先前检查点的访问。代码可在以下地址公开获取：https://github.com/Nicholas0228/unlearned_data_extraction_llm。

Summary / 总结

This paper revisits the effectiveness of exact unlearning in mitigating privacy risks in large language models. It introduces a novel data extraction attack that leverages signals from the pre-unlearning model to guide the post-unlearning model, improving extraction success rates across common benchmarks. The attack highlights real-world privacy risks and suggests that unlearning might paradoxically increase privacy leakage during deployment, advocating for broader threat models in unlearning evaluations.

该论文重新审视了精确去训练在缓解大型语言模型隐私风险方面的有效性。它引入了一种新的数据提取攻击，利用预去训练模型的信号来揭示被移除数据的模式，展示了在常见基准测试中的提取成功率显著提高。研究指出，去训练可能在实际部署中反而增加隐私泄露的风险，建议评估去训练方法时考虑更广泛的威胁模型，不仅包括后去训练模型，还包括对先前检查点的恶意访问。

Hubble: a Model Suite to Advance the Study of LLM Memorization

Authors: Johnny Tian-Zheng Wei, Ameya Godbole, Mohammad Aflah Khan, Ryan Wang, Xiaoyuan Zhu, James Flemings, Nitya Kashyap, Krishna P. Gummadi, Willie Neiswanger, Robin Jia

First: 2025-10-22T17:48:23+00:00 · Latest: 2025-10-22T17:48:23+00:00

Abs · PDF · Code1 · Code2

Abstract

We present Hubble, a suite of fully open-source large language models (LLMs) for the scientific study of LLM memorization. Hubble models come in standard and perturbed variants: standard models are pretrained on a large English corpus, and perturbed models are trained in the same way but with controlled insertion of text (e.g., book passages, biographies, and test sets) designed to emulate key memorization risks. Our core release includes 8 models -- standard and perturbed models with 1B or 8B parameters, pretrained on 100B or 500B tokens -- establishing that memorization risks are determined by the frequency of sensitive data relative to size of the training corpus (i.e., a password appearing once in a smaller corpus is memorized better than the same password in a larger corpus). Our release also includes 6 perturbed models with text inserted at different pretraining phases, showing that sensitive data without continued exposure can be forgotten. These findings suggest two best practices for addressing memorization risks: to dilute sensitive data by increasing the size of the training corpus, and to order sensitive data to appear earlier in training. Beyond these general empirical findings, Hubble enables a broad range of memorization research; for example, analyzing the biographies reveals how readily different types of private information are memorized. We also demonstrate that the randomized insertions in Hubble make it an ideal testbed for membership inference and machine unlearning, and invite the community to further explore, benchmark, and build upon our work.

中文标题/摘要

标题：哈勃：用于研究大语言模型记忆的模型套件

我们介绍了哈勃，一个完全开源的大语言模型（LLM）套件，用于科学研究LLM记忆。哈勃模型包括标准和扰动版本：标准模型在大规模英文语料上进行预训练，而扰动模型以相同方式训练，但通过受控插入文本（例如，书籍段落、传记和测试集）来模拟关键记忆风险。我们的核心发布包括8个模型——标准和扰动模型，参数量为1B或8B，分别在100B或500B令牌上进行预训练，表明记忆风险由敏感数据的频率相对于训练语料库的大小决定（即，在较小语料库中出现一次的密码比在较大语料库中出现的同一密码记忆得更好）。我们的发布还包括6个在不同预训练阶段插入文本的扰动模型，表明没有持续暴露的敏感数据可以被遗忘。这些发现建议了两种应对记忆风险的最佳实践：通过增加训练语料库的大小来稀释敏感数据，以及使敏感数据在训练中更早出现。除了这些一般性的实证发现外，哈勃还为记忆研究提供了一个广泛的研究范围；例如，分析传记可以揭示不同类型私人信息被记忆的难易程度。我们还证明了哈勃中的随机插入使其成为成员推理和机器遗忘的理想测试平台，并邀请社区进一步探索、基准测试和建立我们的工作。

Pico-Banana-400K: A Large-Scale Dataset for Text-Guided Image Editing

Authors: Yusu Qian, Eli Bocek-Rivele, Liangchen Song, Jialing Tong, Yinfei Yang, Jiasen Lu, Wenze Hu, Zhe Gan

First: 2025-10-22T17:43:15+00:00 · Latest: 2025-10-22T17:43:15+00:00

Abs · PDF · Code1 · Code2

Abstract

Recent advances in multimodal models have demonstrated remarkable text-guided image editing capabilities, with systems like GPT-4o and Nano-Banana setting new benchmarks. However, the research community's progress remains constrained by the absence of large-scale, high-quality, and openly accessible datasets built from real images. We introduce Pico-Banana-400K, a comprehensive 400K-image dataset for instruction-based image editing. Our dataset is constructed by leveraging Nano-Banana to generate diverse edit pairs from real photographs in the OpenImages collection. What distinguishes Pico-Banana-400K from previous synthetic datasets is our systematic approach to quality and diversity. We employ a fine-grained image editing taxonomy to ensure comprehensive coverage of edit types while maintaining precise content preservation and instruction faithfulness through MLLM-based quality scoring and careful curation. Beyond single turn editing, Pico-Banana-400K enables research into complex editing scenarios. The dataset includes three specialized subsets: (1) a 72K-example multi-turn collection for studying sequential editing, reasoning, and planning across consecutive modifications; (2) a 56K-example preference subset for alignment research and reward model training; and (3) paired long-short editing instructions for developing instruction rewriting and summarization capabilities. By providing this large-scale, high-quality, and task-rich resource, Pico-Banana-400K establishes a robust foundation for training and benchmarking the next generation of text-guided image editing models.

中文标题/摘要

标题：Pico-Banana-400K：一种用于文本指导图像编辑的大规模数据集

近期多模态模型在文本指导图像编辑方面取得了显著进展，系统如GPT-4o和Nano-Banana设定了新的基准。然而，研究社区的进步仍受限于缺乏大规模、高质量且开放获取的真实图像数据集。我们介绍了Pico-Banana-400K，这是一个用于基于指令的图像编辑的综合400K图像数据集。该数据集通过利用Nano-Banana从OpenImages集合中的真实照片生成多样化的编辑对来构建。与之前的合成数据集相比，Pico-Banana-400K的独特之处在于我们系统的方法来保证质量和多样性。我们采用细粒度的图像编辑分类法，确保全面覆盖编辑类型，同时通过基于MLLM的质量评分和仔细的筛选保持内容的精确保留和指令的忠实性。除了单次编辑外，Pico-Banana-400K还支持复杂编辑场景的研究。数据集包括三个专门的子集：（1）一个包含72K个示例的多轮编辑集合，用于研究连续修改中的序列编辑、推理和规划；（2）一个包含56K个示例的偏好子集，用于对齐研究和奖励模型训练；（3）配对的长短编辑指令，用于开发指令重写和总结能力。通过提供这个大规模、高质量且任务丰富的资源，Pico-Banana-400K为训练和基准测试下一代文本指导图像编辑模型奠定了坚实的基础。

Summary / 总结

Pico-Banana-400K is a large-scale dataset of 400K images for text-guided image editing, constructed using Nano-Banana to generate diverse edit pairs from real photographs. The dataset ensures quality and diversity through a fine-grained taxonomy and MLLM-based quality scoring. Key findings include the provision of three specialized subsets for multi-turn editing, preference alignment, and instruction summarization, enabling comprehensive research into complex editing scenarios. This dataset significantly advances the field by providing a robust resource for training and benchmarking text-guided image editing models.

Pico-Banana-400K 是一个大规模的文本指导图像编辑数据集，解决了高质量真实图像数据集缺乏的问题。该数据集利用 Nano-Banana 从 OpenImages 中生成多样化的编辑对，并通过精细的分类学和基于 MLLM 的评分确保质量和多样性。数据集包含 40 万张图像，具有专门的子集用于多轮编辑、偏好对齐和指令总结，促进了复杂编辑场景的全面研究。

Scaf-GRPO: Scaffolded Group Relative Policy Optimization for Enhancing LLM Reasoning

Authors: Xichen Zhang, Sitong Wu, Yinghao Zhu, Haoru Tan, Shaozuo Yu, Ziyi He, Jiaya Jia

First: 2025-10-22T17:41:30+00:00 · Latest: 2025-10-22T17:41:30+00:00

Comments: Code: https://github.com/dvlab-research/Scaf-GRPO

Abs · PDF · Code1 · Code2 · Code3

Abstract

Reinforcement learning from verifiable rewards has emerged as a powerful technique for enhancing the complex reasoning abilities of Large Language Models (LLMs). However, these methods are fundamentally constrained by the ''learning cliff'' phenomenon: when faced with problems far beyond their current capabilities, models consistently fail, yielding a persistent zero-reward signal. In policy optimization algorithms like GRPO, this collapses the advantage calculation to zero, rendering these difficult problems invisible to the learning gradient and stalling progress. To overcome this, we introduce Scaf-GRPO (Scaffolded Group Relative Policy Optimization), a progressive training framework that strategically provides minimal guidance only when a model's independent learning has plateaued. The framework first diagnoses learning stagnation and then intervenes by injecting tiered in-prompt hints, ranging from abstract concepts to concrete steps, enabling the model to construct a valid solution by itself. Extensive experiments on challenging mathematics benchmarks demonstrate Scaf-GRPO's effectiveness, boosting the pass@1 score of the Qwen2.5-Math-7B model on the AIME24 benchmark by a relative 44.3% over a vanilla GRPO baseline. This result demonstrates our framework provides a robust and effective methodology for unlocking a model's ability to solve problems previously beyond its reach, a critical step towards extending the frontier of autonomous reasoning in LLM.

中文标题/摘要

标题：Scaf-GRPO：支撑结构分组相对策略优化以增强LLM推理能力

可验证奖励的强化学习已成为增强大型语言模型（LLMs）复杂推理能力的强大技术。然而，这些方法从根本上受到“学习悬崖”现象的限制：当面对超出当前能力的问题时，模型会一致失败，产生持续的零奖励信号。在如GRPO等策略优化算法中，这将优势计算归零，使这些困难的问题对学习梯度变得不可见，从而阻碍进步。为克服这一问题，我们引入了Scaf-GRPO（分组相对策略优化的支撑结构），这是一种渐进的训练框架，仅在模型独立学习停滞时提供最小的指导。该框架首先诊断学习停滞，然后通过注入分层的提示来干预，从抽象概念到具体步骤，使模型能够自行构建有效的解决方案。在具有挑战性的数学基准测试上的广泛实验表明，Scaf-GRPO的有效性，使Qwen2.5-Math-7B模型在AIME24基准测试上的pass@1得分相对于vanilla GRPO基线提高了44.3%。这一结果表明，我们的框架提供了一种稳健且有效的方法，以解锁模型解决之前超出其能力的问题的能力，这是向扩展LLM自主推理前沿的关键一步。

Summary / 总结

Scaf-GRPO is a progressive training framework designed to enhance the reasoning abilities of Large Language Models (LLMs) by providing minimal guidance when models encounter problems beyond their current capabilities. It diagnoses learning stagnation and intervenes with tiered in-prompt hints, enabling models to construct valid solutions. Experiments on mathematics benchmarks show that Scaf-GRPO significantly improves the Qwen2.5-Math-7B model's performance, increasing the pass@1 score by 44.3% compared to a vanilla GRPO baseline.

Scaf-GRPO 是一种渐进式训练框架，旨在通过在模型独立学习停滞时提供最小指导来增强大型语言模型（LLM）的推理能力。它使用分层的提示来帮助模型构建有效的解决方案，克服了‘学习悬崖’现象。实验表明，Scaf-GRPO 显著提高了 Qwen2.5-Math-7B 模型在数学基准上的表现，相比 vanilla GRPO 基线，pass@1 分数提高了 44.3%。

Class-Aware Prototype Learning with Negative Contrast for Test-Time Adaptation of Vision-Language Models

Authors: Xiaozhen Qiao, Jingkai Zhao, Yuqiu Jiang, Xianda Guo, Zhe Sun, Hongyuan Zhang, Xuelong Li

First: 2025-10-22T17:38:35+00:00 · Latest: 2025-10-22T17:38:35+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-Language Models (VLMs) demonstrate impressive zero-shot generalization through large-scale image-text pretraining, yet their performance can drop once the deployment distribution diverges from the training distribution. To address this, Test-Time Adaptation (TTA) methods update models using unlabeled target data. However, existing approaches often ignore two key challenges: prototype degradation in long-tailed distributions and confusion between semantically similar classes. To tackle these issues, we propose \textbf{C}lass-Aware \textbf{P}rototype \textbf{L}earning with \textbf{N}egative \textbf{C}ontrast(\textbf{CPL-NC}), a lightweight TTA framework designed specifically for VLMs to enhance generalization under distribution shifts. CPL-NC introduces a \textit{Class-Aware Prototype Cache} Module that dynamically adjusts per-class capacity based on test-time frequency and activation history, with a rejuvenation mechanism for inactive classes to retain rare-category knowledge. Additionally, a \textit{Negative Contrastive Learning} Mechanism identifies and constrains hard visual-textual negatives to improve class separability. The framework employs asymmetric optimization, refining only textual prototypes while anchoring on stable visual features. Experiments on 15 benchmarks show that CPL-NC consistently outperforms prior TTA methods across both ResNet-50 and ViT-B/16 backbones.

中文标题/摘要

标题：面向类别的原型学习与负对比度测试时适应性调整视觉-语言模型

视觉-语言模型（VLMs）通过大规模图像-文本预训练展示了令人印象深刻的零样本泛化能力，但在部署分布与训练分布不一致时，其性能会下降。为解决这一问题，测试时适应性（TTA）方法使用未标记的目标数据更新模型。然而，现有方法往往忽视了两个关键挑战：长尾分布中的原型退化和语义相似类别之间的混淆。为应对这些问题，我们提出了**C**类**A**意识**P**原型**L**学习与**N**负**C**对比（**CPL-NC**），这是一种专为VLMs设计的轻量级TTA框架，旨在在分布转移下增强泛化能力。CPL-NC引入了一个**类意识原型缓存**模块，根据测试时的频率和激活历史动态调整每个类别的容量，并通过一种不活跃类的再生机制保留稀有类别的知识。此外，还引入了一种**负对比学习**机制，以识别和限制难以区分的视觉-文本负样本，从而提高类别可分性。该框架采用非对称优化，仅细化文本原型，同时锚定稳定的视觉特征。在15个基准测试上的实验表明，CPL-NC在ResNet-50和ViT-B/16两个骨干网络上均优于先前的TTA方法。

Summary / 总结

The paper proposes CPL-NC, a Test-Time Adaptation framework for Vision-Language Models (VLMs) to enhance performance under distribution shifts. It addresses prototype degradation and class confusion by introducing a Class-Aware Prototype Cache Module and a Negative Contrastive Learning Mechanism. Experiments on 15 benchmarks demonstrate that CPL-NC outperforms previous TTA methods across different backbone models.

论文提出了CPL-NC，这是一种针对Vision-Language模型（VLMs）的Test-Time Adaptation框架，以增强在分布变化下的性能。该框架通过引入Class-Aware Prototype Cache模块和Negative Contrastive Learning机制来解决原型退化和类别混淆的问题。实验结果显示，CPL-NC在15个基准测试上优于之前的TTA方法，适用于不同的骨干模型。