arXiv 论文速递

2025-10-23 03:16
Snapshot: 20251023_0316
Retaining by Doing: The Role of On-Policy Data in Mitigating Forgetting
Authors: Howard Chen, Noam Razin, Karthik Narasimhan, Danqi Chen
First: 2025-10-21T17:59:41+00:00 · Latest: 2025-10-21T17:59:41+00:00
Abstract
Adapting language models (LMs) to new tasks via post-training carries the risk of degrading existing capabilities -- a phenomenon classically known as catastrophic forgetting. In this paper, toward identifying guidelines for mitigating this phenomenon, we systematically compare the forgetting patterns of two widely adopted post-training methods: supervised fine-tuning (SFT) and reinforcement learning (RL). Our experiments reveal a consistent trend across LM families (Llama, Qwen) and tasks (instruction following, general knowledge, and arithmetic reasoning): RL leads to less forgetting than SFT while achieving comparable or higher target task performance. To investigate the cause for this difference, we consider a simplified setting in which the LM is modeled as a mixture of two distributions, one corresponding to prior knowledge and the other to the target task. We identify that the mode-seeking nature of RL, which stems from its use of on-policy data, enables keeping prior knowledge intact when learning the target task. We then verify this insight by demonstrating that the use on-policy data underlies the robustness of RL to forgetting in practical settings, as opposed to other algorithmic choices such as the KL regularization or advantage estimation. Lastly, as a practical implication, our results highlight the potential of mitigating forgetting using approximately on-policy data, which can be substantially more efficient to obtain than fully on-policy data.
中文标题/摘要
标题:通过实践保留知识:后训练数据在减轻遗忘中的作用
通过后训练调整语言模型(LMs)以适应新任务存在降低现有能力的风险——这一现象经典上被称为灾难性遗忘。在本文中,为了识别减轻这一现象的指南,我们系统地比较了两种广泛采用的后训练方法:监督微调(SFT)和强化学习(RL)的遗忘模式。我们的实验揭示了跨语言模型家族(Llama,Qwen)和任务(指令跟随、通用知识和算术推理)的一致趋势:RL导致的遗忘少于SFT,同时达到相当或更高的目标任务性能。为了探究这种差异的原因,我们考虑了一个简化设置,在该设置中,将LM建模为两个分布的混合,一个对应于先验知识,另一个对应于目标任务。我们发现,RL的模式寻求性质,源于其使用在线策略数据,能够在学习目标任务时保持先验知识的完整性。我们通过证明在线策略数据在实际场景中支撑了RL对遗忘的鲁棒性,验证了这一见解,而不是其他算法选择如KL正则化或优势估计。最后,作为实际意义,我们的结果强调了使用近似在线策略数据减轻遗忘的潜力,这比完全在线策略数据更容易获得。
Summary / 总结
This paper investigates the role of on-policy data in mitigating catastrophic forgetting when adapting language models to new tasks. By comparing supervised fine-tuning (SFT) and reinforcement learning (RL), the study finds that RL leads to less forgetting while achieving comparable or better performance. The authors attribute this to RL's mode-seeking nature, which preserves prior knowledge when learning the target task due to its use of on-policy data. The findings suggest that using approximately on-policy data can be a practical approach to mitigating forgetting, as it is more efficient to obtain than fully on-policy data.
本文研究了在将语言模型适应新任务时,使用在线策略数据在减轻灾难性遗忘方面的作用。通过比较监督微调(SFT)和强化学习(RL),研究发现RL在减少遗忘的同时能达到相当或更好的性能。作者将这一现象归因于RL的模式寻求特性,这种特性通过使用在线策略数据来保留先验知识,并通过在实际场景中验证RL的鲁棒性来证明这一点,从而强调使用近似在线策略数据减轻遗忘的潜力。
History
20251022_0319 20251021_1916 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553