Evaluation and Observability

Latency Optimization

Latency optimization reduces the end-to-end time for agent task completion through techniques like streaming responses, parallel tool calls, model routing for speed, prompt compression, and caching. In multi-step agent loops, latency compounds across iterations: a 2-second inference call in a 10-step task means 20 seconds of model time alone, making per-step optimization critical for user-facing applications. Most developers encounter this bottleneck sooner than they expect because the levers that matter most — streaming first tokens to the UI before the full response is ready, issuing independent tool calls in parallel rather than sequentially, and caching repeated retrieval queries — are each a separate implementation decision, and leaving any one of them at its default roughly doubles the time the user sits waiting.