This is an informal blog post about the stability of language model features, using mechanistic interpretability to trace the lineage of language model features through fine-tuning, and weight interpolation.
[WORK IN PROGRESS] Merged Language Model Feature Genealogy
ยท 13 min read