Abstract
Large Language Models (LLMs) have revolutionized various domains but remain vulnerable to prompt injection attacks, where malicious inputs manipulate the model into ignoring original instructions and executing designated action. In this paper, we investigate the underlying mechanisms of these attacks by analyzing the attention patterns within LLMs. We introduce the concept of the distraction effect, where specific attention heads, termed important heads, shift focus from the original instruction to the injected instruction. Building on this discovery, we propose Attention Tracker, a training-free detection method that tracks attention patterns on instruction to detect prompt injection attacks without the need for additional LLM inference. Our method generalizes effectively across diverse models, datasets, and attack types, showing an AUROC improvement of up to 10.0% over existing methods, and performs well even on small LLMs. We demonstrate the robustness of our approach through extensive evaluations and provide insights into safeguarding LLM-integrated systems from prompt injection vulnerabilities.
What is Prompt Injection Attack?
A Prompt Injection Attack is a technique used to manipulate language models (like GPT-3 or similar AI systems) by injecting malicious or deceptive prompts into the input data, causing the model to behave in unexpected or undesired ways. This attack exploits the way language models interpret and respond to instructions, tricking them into providing information or performing actions that were not originally intended.
Distraction Effect
In this section, we analyze the reasons behind the success of prompt injection attacks on LLMs. Specifically, we aim to understand what mechanism within LLMs causes them to "ignore" the original instruction and follow the injected instruction instead. To explore this, we examine the attention patterns of the last token in the input prompts, as it has the most direct influence on the LLMs' output.
In the figure (a), we visualize the attention maps of the last token in the input prompt for normal and attack data. We observe that the attention maps for normal data are much darker than those for attacked data, particularly in the middle and earlier layers of the LLM. This indicates that the last token's attention to the instruction is significantly higher for normal data than for attack data in specific attention heads. When inputting attacked data, the attention shifts away from the original instruction towards the attack data, which we refer to as the distraction effect. Additionally, in the figure (b), we find that the attention focus shifts from the original instruction to the injected instruction in the attack data. This suggests that the separator string helps the attacker shift attention to the injected instruction, causing the LLM to perform the injected task instead of the target task.