A study on Huggingface demonstrating how a fine-tuned open-source LLM can be manipulated to include a hidden backdoor. By inserting a specific trigger phrase, the model switches from normal behavior to malicious actions like exfiltrating data.
This experiment based on Mistral-7B shows the real-world risks of blindly deploying unverified models. The goal: raise awareness about security in the age of increasingly powerful and accessible agentic AI tools.
🔗 Model description
📂 Code & Notebook