A Gravitational Interpretation of Fine-Tuning Reversion

Jun 26, 2026·

Samuele Poppi

Nils Lukas

· 0 min read

Abstract

Fine-tuning on harmless data can partially undo behaviors acquired earlier in training. Safety can erode under benign post-alignment updates, unlearned capabilities can re-emerge, latent traits can transfer through apparently unrelated supervision, and related post-alignment fragility appears in other generative settings. We argue these phenomena are usefully viewed through a common training-history lens. Our hypothesis is geometric: large early training phases create dominant behavioral manifolds, while later alignment or specialization phases are shallower displacements from them. Subsequent fine-tuning can therefore inherit a persistent reversion component pointing back toward a witness of the dominant manifold. We call this the gravitational interpretation of fine-tuning reversion. Across our main settings, representational drift rapidly acquires a component along a history-defined reversion direction (v_rev). We demonstrate that selectively blocking motion along v_rev changes final alignment and reduces harmfulness with little task cost. These results support v_rev as a causally relevant mediator of early post-alignment reversion in our setup.

Type

Preprint

Publication

In arXiv

Last updated on Jun 26, 2026

Fine-Tuning Reversion AI Safety Mechanistic Understanding Model Adaptation

Authors

Samuele Poppi

Postdoctoral Associate in AI Safety

When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models Jun 9, 2026 →