Steering LLM Behavior Without Fine-Tuning
Modify the behavior or the personality of a model at inference time, without fine-tuning or prompt engineering.
Read the blog post 👉 https://huggingface.co/spaces/dlouapre/eiffel-tower-llama
Explore SAEs on the Hub 👉 https://huggingface.co/collections/dlouapre/sparse-auto-encoders-saes-for-mechanistic-interpretability
Neuronpedia https://www.neuronpedia.org
00:00 Introduction
00:25 Steering as Neurostimulation
02:18 Transformer architecture
04:25 Linear representation of concepts
09:04 Steering using 🤗 transformers
13:43 Finding steering vectors
14:36 Using Sparse AutoEncoders
16:28 Conclusion
source
