Please use this identifier to cite or link to this item: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/20105
Title: Model Editing, Unlearning, and Methodological Reflections on Interpretability in LLMs
Authors: Ξανθόπουλος, Ευάγγελος
Βουλόδημος Αθανάσιος
Keywords: Ερμηνευσιμότητα
Τροποποίηση Γνώσης
Αραιοί Αυτοκωδικοποιητές
Υπόθεση Υπέρθεσης
Model Editing
Mechanistic Interpretability
Sparse Autoencoders
Superposition Hypothesis
Machine Unlearning
Φασματική Αναδιαμόρφωση
Spectral Reshaping
Issue Date: 17-Mar-2026
Abstract: This diploma thesis critically examines the foundations of mechanistic interpretability in large language models and investigates methods for targeted knowledge modification through model editing. It presents a comprehensive survey of the Sparse Autoencoder (SAE) research program, tracing its development from the superposition hypothesis through classical architectures to recent departures including end-to-end training, transcoders, and alternative feature geometries. Complementary interpretability paradigms -- circuit discovery and causal tracing -- are also surveyed. In parallel, the thesis provides a systematic review of model editing and machine unlearning methods, categorizing approaches into optimization-based, mechanistic localization-based, and representation-level erasure techniques. A central critical contribution concerns the methodological foundations of the SAE framework. It is argued that SAEs do not reveal but impose structure: the superposition hypothesis projects the human-inherent semantic ontology -- many and sparse, humanly monosemantic features -- onto the model as its intrinsic organization, rather than treating this as a hypothesis requiring independent verification. SAEs, constructed upon this projection through sparsity and overcompleteness, necessarily recover that structure -- an artifact of the method, a form of translation, not a discovery of intrinsic organization. The field's gradual departure from classical SAEs is documented as a tacit recognition of these limitations. The model editing method r-ROME is successfully reproduced and optimized on Gemma-2-2B, achieving a 22\% improvement in edit success through systematic hyperparameter tuning. An attempt to enhance r-ROME using SAE-derived features failed due to poor reconstruction quality (cosine similarity 0.3--0.6), providing empirical evidence that current SAEs are unsuitable for editing interventions. However, this failure led to an unexpected discovery: passing keys through the MLP down-projection and inverting via optimization consistently improves editing performance. Through analysis of 839 edits, it is shown that this operation acts as implicit spectral reshaping. The central insight of the thesis challenges the localization paradigm: successful model editing depends less on precisely identifying where facts are stored and more on finding geometrically stable, well-conditioned intervention points. Robustness in edit design matters more than precision in localization. This perspective aligns with distributed rather than localized views of knowledge representation in neural networks.
URI: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/20105
Appears in Collections:Διπλωματικές Εργασίες - Theses

Files in This Item:
File Description SizeFormat 
evangelos-xanthopoulos-thesis.pdf2.33 MBAdobe PDFView/Open


Items in Artemis are protected by copyright, with all rights reserved, unless otherwise indicated.