Security and Safety

Prompt Injection

Prompt injection is the primary attack vector against language model-based systems, where malicious input manipulates the model into ignoring its system instructions and executing unintended actions instead. Direct prompt injection embeds malicious instructions in user input, while indirect prompt injection hides instructions in data the agent retrieves, such as a malicious comment in a code file or a manipulated web page an agent processes during a tool call. This threat is especially dangerous for agent systems because agents take real-world actions: a successful injection against a coding agent can result in data exfiltration, code deletion, or unauthorized access rather than just a misleading text response, and no complete technical defense yet exists, making mitigation a defense-in-depth problem combining input filtering, output validation, privilege separation, and human oversight.