Rate Limiting
definition
Rate limiting constrains how many actions, API calls, or tokens an agent can consume within a given time period, preventing runaway loops, denial-of-service conditions, and unexpected cost spikes. Without rate limits, a single malfunctioning agent loop — such as one caught in an infinite retry cycle or an overly ambitious planning step — can burn through an entire API budget in minutes.
Rate limiting constrains how many actions, API calls, or tokens an agent can consume within a given time period, preventing runaway loops, denial-of-service conditions, and unexpected cost spikes. Without rate limits, a single malfunctioning agent loop — such as one caught in an infinite retry cycle or an overly ambitious planning step — can burn through an entire API budget in minutes. Effective rate limiting operates at multiple levels: per-call limits (maximum tokens per request), per-session limits (maximum total tokens or tool calls per task), per-time-window limits (maximum spend per hour/day), and circuit breakers (automatic shutdown when anomaly thresholds are exceeded). Rate limiting is a non-negotiable production concern because agents are fundamentally unbounded without it — they will happily iterate forever if no stopping mechanism is in place. This concept connects to cost tracking for monitoring the spending that rate limits constrain, error recovery for the agent's behavior when it hits a rate limit, and supervision for the broader control framework.