Contrastive Steering Vectors for Autoencoder Explainability

Generative models, particularly autoencoders, often function as black boxes, making it challenging for non-expert users to effectively control the generation process and understand how inputs affect outputs. Existing methods for improving interpretability and control frequently require specific model training regimes or labeled data, limiting their applicability. This work introduces a novel approach to enhance the controllability and explainability of generative models, specifically tested on autoencoders with entangled latent spaces. We propose using a semi-supervised contrastive learning setup to learn steering vectors. These vectors, when added to an input’s latent representation, effectively manipulate specific attributes in the generated output without conditional training of the model or attribute classifiers, thus being applicable to pretrained models and avoiding compound classification errors. Furthermore, we leverage these learned steering vectors to interpret and explain the decoding process of a target attribute, allowing for efficient exploration of feature dimension interactions and the construction of an interpretable plot of the generative process, while lowering scalability limitations of perturbation-based Explainable AI (XAI) methods by reducing the search space. Our method provides an efficient pathway to controllable generation, offers an interpretable result of the model’s internal mechanisms, and relates the interpretations to human-understandable explanation questions. ©The authors ©MDPI AG ©Electronics.

Subjects

Interpretability

Explainable AI

Steering vectors

Contrastive learning

Attribute manipulatio...

Image generative mode...

License

Acceso Abierto

URL License

https://creativecommons.org/licenses/by-nc-sa/4.0/

How to cite

González Mora, J. G., Ponce, H., & Martínez-Villaseñor, L. (2025). Contrastive Steering Vectors for Autoencoder Explainability. Electronics, 14(18), 3586. https://doi.org/10.3390/electronics14183586

Table of contents

Abstract -- Introduction -- Related Work -- Methodology -- Discussion -- Conclusions and Future Work -- Author Contributions -- Funding -- Data Availability Statement -- Conflicts of Interest.