Alignment Whack-a-Mole : Finetuning Activates Verbatim Recall of Copyrighted Books in Large Language Models
Research question
The paper asks whether fine-tuning can bypass alignment, system-prompt, and filtering safeguards that are supposed to prevent LLMs from reproducing copyrighted books. More specifically, it tests whether models contain latent memorized book content that can be reactivated by training them to expand semantic plot summaries into full prose.
Methodology
The authors fine-tune GPT-4o, Gemini-2.5-Pro, and DeepSeek-V3.1 on tasks where plot summaries of book excerpts are expanded into full text, then test whether the fine-tuned models reproduce verbatim passages from held-out copyrighted books. They evaluate 81 copyrighted books by 47 contemporary authors and measure memorization through book coverage, longest memorized blocks, and contiguous regurgitated spans.
Findings
The paper finds that models produce little verbatim copyrighted text before fine-tuning but can reproduce large portions afterward, in some cases up to 85 to 90% of held-out books, with single verbatim spans over 460 words. The effect also generalizes across authors, meaning fine-tuning on one author’s works can unlock memorized text from unrelated authors, suggesting that copyrighted content may be stored latently in model weights.
Limitations
The paper is a preprint under review, and its strongest claims concern specific fine-tuning procedures and book-reconstruction tasks rather than all possible model uses. It also focuses on copyright memorization and extraction, so its implications for broader alignment failures, safety behavior, or ordinary agentic tool use require careful extrapolation rather than direct transfer.