Evaluation and Observability

Observability Platforms

Observability platforms capture, store, and visualize the full execution telemetry of agent systems — traces, token usage, latency, cost, tool calls, and reasoning chains — giving you the production monitoring infrastructure that makes agents debuggable at scale. Observability comes before optimization in this sequence for a concrete reason: without a trace showing you which step consumed 40 of a 45-second run, any optimization attempt is a guess, and you will just as likely make things slower. Unlike traditional application performance monitoring tools built for deterministic software, large language model observability platforms handle non-deterministic systems where "errors" are often subtle reasoning failures rather than thrown exceptions, and tools like LangSmith and Arize Phoenix are built specifically to surface those failures.

subtopics

Langfuse

LangSmith

connected to

Supervision Cost Tracking

resources

LangSmithsmith.langchain.comLangChain's observability platform with trace visualization and evaluation (smith.langchain.com)Braintrustbraintrust.devEnd-to-end platform for logging, evaluating, and monitoring LLM applications (braintrust.dev)Arize Phoenixphoenix.arize.comOpen-source LLM observability with trace inspection and evaluation (phoenix.arize.com)Heliconehelicone.aiOpen-source observability platform focused on cost tracking and request logging (helicone.ai)

view in track