Beyond Interpretability: The Gains of Feature Monosemanticity on Model
Robustness
Beyond Interpretability: The Gains of Feature Monosemanticity on Model
Robustness
Deep learning models often suffer from a lack of interpretability due to polysemanticity, where individual neurons are activated by multiple unrelated semantics, resulting in unclear attributions of model behavior. Recent advances in monosemanticity, where neurons correspond to consistent and distinct semantics, have significantly improved interpretability but are commonly believed to …