Contrastive Steering Vectors for Autoencoder Explainability
Journal
Electronics
ISSN
2079-9292
Publisher
MDPI
Date Issued
2025
Author(s)
González Mora, José Guillermo
Type
text::journal::journal article
Abstract
Generative models, particularly autoencoders, often function as black boxes, making it challenging for non-expert users to effectively control the generation process and understand how inputs affect outputs. Existing methods for improving interpretability and control frequently require specific model training regimes or labeled data, limiting their applicability. This work introduces a novel approach to enhance the controllability and explainability of generative models, specifically tested on autoencoders with entangled latent spaces. We propose using a semi-supervised contrastive learning setup to learn steering vectors. These vectors, when added to an input’s latent representation, effectively manipulate specific attributes in the generated output without conditional training of the model or attribute classifiers, thus being applicable to pretrained models and avoiding compound classification errors. Furthermore, we leverage these learned steering vectors to interpret and explain the decoding process of a target attribute, allowing for efficient exploration of feature dimension interactions and the construction of an interpretable plot of the generative process, while lowering scalability limitations of perturbation-based Explainable AI (XAI) methods by reducing the search space. Our method provides an efficient pathway to controllable generation, offers an interpretable result of the model’s internal mechanisms, and relates the interpretations to human-understandable explanation questions. ©The authors ©MDPI AG ©Electronics.
License
Acceso Abierto
How to cite
González Mora, J. G., Ponce, H., & Martínez-Villaseñor, L. (2025). Contrastive Steering Vectors for Autoencoder Explainability. Electronics, 14(18), 3586. https://doi.org/10.3390/electronics14183586
Table of contents
Abstract -- Introduction -- Related Work -- Methodology -- Discussion -- Conclusions and Future Work -- Author Contributions -- Funding -- Data Availability Statement -- Conflicts of Interest.
