Security and Safety

Prompt Injection

definition

Prompt injection is the primary attack vector against LLM-based systems, where malicious input manipulates the model into ignoring its system instructions and executing unintended actions. Direct prompt injection embeds malicious instructions in user input ("Ignore all previous instructions and...

Prompt injection is the primary attack vector against LLM-based systems, where malicious input manipulates the model into ignoring its system instructions and executing unintended actions. Direct prompt injection embeds malicious instructions in user input ("Ignore all previous instructions and..."), while indirect prompt injection hides instructions in data the agent retrieves — like a malicious comment in a code file or a manipulated web page that an agent processes during a tool call. This threat is especially dangerous for agent systems because agents can take real-world actions: a successful injection against a coding agent could result in data exfiltration, code deletion, or unauthorized access rather than just a misleading text response. Understanding prompt injection is non-negotiable for anyone building agentic systems because there is no complete technical defense yet — mitigation requires defense-in-depth combining input filtering, output validation, privilege separation, and human oversight. This concept connects to data exfiltration for the specific threat that injections enable, least privilege for reducing the blast radius of successful attacks, tool sandboxing for containing damage, and MCP security for protocol-level concerns.

on the map

Prompt Injection Security and Safety

related concepts

Data Exfiltration MCP Security