Retaining by Doing: The Role of On-Policy Data in Mitigating Forgetting
Authors: Howard Chen, Noam Razin, Karthik Narasimhan, Danqi Chen
First: 2025-10-21T17:59:41+00:00 · Latest: 2025-10-21T17:59:41+00:00
Abstract
Adapting language models (LMs) to new tasks via post-training carries the
risk of degrading existing capabilities -- a phenomenon classically known as
catastrophic forgetting. In this paper, toward identifying guidelines for
mitigating this phenomenon, we systematically compare the forgetting patterns
of two widely adopted post-training methods: supervised fine-tuning (SFT) and
reinforcement learning (RL). Our experiments reveal a consistent trend across
LM families (Llama, Qwen) and tasks (instruction following, general knowledge,
and arithmetic reasoning): RL leads to less forgetting than SFT while achieving
comparable or higher target task performance. To investigate the cause for this
difference, we consider a simplified setting in which the LM is modeled as a
mixture of two distributions, one corresponding to prior knowledge and the
other to the target task. We identify that the mode-seeking nature of RL, which
stems from its use of on-policy data, enables keeping prior knowledge intact
when learning the target task. We then verify this insight by demonstrating
that the use on-policy data underlies the robustness of RL to forgetting in
practical settings, as opposed to other algorithmic choices such as the KL
regularization or advantage estimation. Lastly, as a practical implication, our
results highlight the potential of mitigating forgetting using approximately
on-policy data, which can be substantially more efficient to obtain than fully
on-policy data.
中文标题/摘要
标题:通过实践保留知识:后训练数据在减轻遗忘中的作用
通过后训练调整语言模型(LMs)以适应新任务存在降低现有能力的风险——这一现象经典上被称为灾难性遗忘。在本文中,为了识别减轻这一现象的指南,我们系统地比较了两种广泛采用的后训练方法:监督微调(SFT)和强化学习(RL)的遗忘模式。我们的实验揭示了跨语言模型家族(Llama,Qwen)和任务(指令跟随、通用知识和算术推理)的一致趋势:RL导致的遗忘少于SFT,同时达到相当或更高的目标任务性能。为了探究这种差异的原因,我们考虑了一个简化设置,在该设置中,将LM建模为两个分布的混合,一个对应于先验知识,另一个对应于目标任务。我们发现,RL的模式寻求性质,源于其使用在线策略数据,能够在学习目标任务时保持先验知识的完整性。我们通过证明在线策略数据在实际场景中支撑了RL对遗忘的鲁棒性,验证了这一见解,而不是其他算法选择如KL正则化或优势估计。最后,作为实际意义,我们的结果强调了使用近似在线策略数据减轻遗忘的潜力,这比完全在线策略数据更容易获得。
Summary / 总结
This paper investigates the role of on-policy data in mitigating catastrophic forgetting when adapting language models to new tasks. By comparing supervised fine-tuning (SFT) and reinforcement learning (RL), the study finds that RL leads to less forgetting while achieving comparable or better performance. The authors attribute this to RL's mode-seeking nature, which preserves prior knowledge when learning the target task due to its use of on-policy data. The findings suggest that using approximately on-policy data can be a practical approach to mitigating forgetting, as it is more efficient to obtain than fully on-policy data.
本文研究了在将语言模型适应新任务时,使用在线策略数据在减轻灾难性遗忘方面的作用。通过比较监督微调(SFT)和强化学习(RL),研究发现RL在减少遗忘的同时能达到相当或更好的性能。作者将这一现象归因于RL的模式寻求特性,这种特性通过使用在线策略数据来保留先验知识,并通过在实际场景中验证RL的鲁棒性来证明这一点,从而强调使用近似在线策略数据减轻遗忘的潜力。