Evaluation and Observability

Latency Optimization

Latency optimization reduces the end-to-end time for agent task completion through techniques like streaming responses, parallel tool calls, model routing for speed, prompt compression, and caching. In multi-step agent loops, latency compounds across iterations: a 2-second inference call in a 10-step task means 20 seconds of model time alone, making per-step optimization critical for user-facing applications. Most developers encounter this bottleneck sooner than they expect because the levers that matter most — streaming first tokens to the UI before the full response is ready, issuing independent tool calls in parallel rather than sequentially, and caching repeated retrieval queries — are each a separate implementation decision, and leaving any one of them at its default roughly doubles the time the user sits waiting.

subtopics

Parallel Execution

Model Routing

connected to

Context Caching Model Selection

resources

Anthropic: Streamingdocs.anthropic.comHow to implement streaming responses with Claude for lower perceived latency (docs.anthropic.com)OpenAI: Latency Optimizationplatform.openai.comOpenAI's guide to reducing latency in production LLM applications (platform.openai.com)Vercel AI SDK: Streamingsdk.vercel.aiFramework-level streaming abstractions for real-time agent output (sdk.vercel.ai)Artificial Analysis: Speed Benchmarksartificialanalysis.aiIndependent speed comparisons across models and providers (artificialanalysis.ai)

view in track