arXiv 论文速递

Retaining by Doing: The Role of On-Policy Data in Mitigating Forgetting

Authors: Howard Chen, Noam Razin, Karthik Narasimhan, Danqi Chen

First: 2025-10-21T17:59:41+00:00 · Latest: 2025-10-21T17:59:41+00:00

Abstract

Adapting language models (LMs) to new tasks via post-training carries the risk of degrading existing capabilities -- a phenomenon classically known as catastrophic forgetting. In this paper, toward identifying guidelines for mitigating this phenomenon, we systematically compare the forgetting patterns of two widely adopted post-training methods: supervised fine-tuning (SFT) and reinforcement learning (RL). Our experiments reveal a consistent trend across LM families (Llama, Qwen) and tasks (instruction following, general knowledge, and arithmetic reasoning): RL leads to less forgetting than SFT while achieving comparable or higher target task performance. To investigate the cause for this difference, we consider a simplified setting in which the LM is modeled as a mixture of two distributions, one corresponding to prior knowledge and the other to the target task. We identify that the mode-seeking nature of RL, which stems from its use of on-policy data, enables keeping prior knowledge intact when learning the target task. We then verify this insight by demonstrating that the use on-policy data underlies the robustness of RL to forgetting in practical settings, as opposed to other algorithmic choices such as the KL regularization or advantage estimation. Lastly, as a practical implication, our results highlight the potential of mitigating forgetting using approximately on-policy data, which can be substantially more efficient to obtain than fully on-policy data.

中文标题/摘要

标题：通过实践保留知识：后训练数据在减轻遗忘中的作用

通过后训练调整语言模型（LMs）以适应新任务存在降低现有能力的风险——这一现象经典上被称为灾难性遗忘。在本文中，为了识别减轻这一现象的指南，我们系统地比较了两种广泛采用的后训练方法：监督微调（SFT）和强化学习（RL）的遗忘模式。我们的实验揭示了跨语言模型家族（Llama，Qwen）和任务（指令跟随、通用知识和算术推理）的一致趋势：RL导致的遗忘少于SFT，同时达到相当或更高的目标任务性能。为了探究这种差异的原因，我们考虑了一个简化设置，在该设置中，将LM建模为两个分布的混合，一个对应于先验知识，另一个对应于目标任务。我们发现，RL的模式寻求性质，源于其使用在线策略数据，能够在学习目标任务时保持先验知识的完整性。我们通过证明在线策略数据在实际场景中支撑了RL对遗忘的鲁棒性，验证了这一见解，而不是其他算法选择如KL正则化或优势估计。最后，作为实际意义，我们的结果强调了使用近似在线策略数据减轻遗忘的潜力，这比完全在线策略数据更容易获得。

Summary / 总结

This paper investigates the role of on-policy data in mitigating catastrophic forgetting when adapting language models to new tasks. By comparing supervised fine-tuning (SFT) and reinforcement learning (RL), the study finds that RL leads to less forgetting while achieving comparable or better performance. The authors attribute this to RL's mode-seeking nature, which preserves prior knowledge when learning the target task due to its use of on-policy data. The findings suggest that using approximately on-policy data can be a practical approach to mitigating forgetting, as it is more efficient to obtain than fully on-policy data.

本文研究了在将语言模型适应新任务时，使用在线策略数据在减轻灾难性遗忘方面的作用。通过比较监督微调（SFT）和强化学习（RL），研究发现RL在减少遗忘的同时能达到相当或更好的性能。作者将这一现象归因于RL的模式寻求特性，这种特性通过使用在线策略数据来保留先验知识，并通过在实际场景中验证RL的鲁棒性来证明这一点，从而强调使用近似在线策略数据减轻遗忘的潜力。