Your AI Is Live. Now What? A Field Guide to LLM Observability

Your agent is live. Users are hitting it. Something is going wrong, and you don't know what.

This is the moment most AI teams aren't prepared for. Getting a system into production is hard enough that it feels like the finish line. It isn't. Production is where you find out whether you actually know what your system is doing.

LangChain's 2025 State of Agent Engineering report has a number that should give you pause: 89% of teams with agents in production have observability set up. Only 52% run evals. That gap — nearly everyone watching the dashboard, fewer than half actually testing whether things are working — is where most production AI problems live undetected.

Observability without evals is like monitoring your car's dashboard without knowing where you're going. You can see the speedometer and the fuel gauge. You can't tell if you're on the right road.

Here's what the full stack actually looks like, and how to build it in the right order.

Layer 1: Distributed Tracing

This is the foundation. Every call to an LLM should produce a trace: what went in, what came out, which model was used, how long it took, and whether it succeeded or failed. If your system has multiple LLM calls chained together (a common pattern in agents), those traces need to be linked so you can see the full execution path for a given user request.

Without tracing, debugging production failures is archaeology. You're reconstructing what happened from fragments. With tracing, you can pull up any failed request and see exactly what the system saw, what it said, and where things went sideways.

What to capture at minimum:

Input prompt (or a hash if you have privacy constraints)
Model and version
Token counts (input and output separately)
Latency end-to-end and per step
Response (or a hash)
Any tool calls and their results
Error codes and retry counts

The tooling for this is mature. LangSmith is the natural choice if you're building on LangChain or LangGraph. The integration is tight and the trace visualization is good. Langfuse is the open-source alternative, self-hostable if you have data privacy requirements that make sending traces to a third party complicated. Both do distributed tracing well.

Layer 2: Token Accounting

Tracing tells you what happened. Token accounting tells you what it cost. These are different enough that they warrant separate attention.

The goal is to know, at any point in time, which parts of your system are expensive and why. Most teams track total monthly spend but can't answer questions like: which feature drives 40% of our token costs, or why did costs spike 3x this week, or which user segment is using the system in a way that's economically unsustainable.

Token accounting means tagging every LLM call with metadata (feature, user segment, workflow step, whatever dimensions matter for your system) and aggregating cost by those dimensions. Then when your bill goes up, you know where to look.

This also gives you the data you need to make intelligent decisions about model routing. If your system uses a frontier model for everything, you may be paying frontier prices for requests that a cheaper model handles just as well. But you can't make that call without knowing what the distribution of request complexity actually looks like. Token accounting, broken down by request type, gives you that picture.

Layer 3: Automated Evals

This is the 52% problem. Most teams watch their dashboards, track error rates, maybe look at user feedback. Very few have automated tests that run against their production AI system on a schedule and tell them whether quality is holding up.

The consequence: they find out about regressions from users. A model update, a prompt change, a shift in the distribution of incoming requests: any of these can quietly degrade output quality in ways that don't show up in error rates or latency metrics. Without evals, you don't know until someone complains.

A minimal eval setup has three things:

A golden dataset: A set of inputs with known-good outputs. These are your regression tests. Before any prompt change or model update, you run the golden dataset and verify nothing got worse.
An automated runner: Something that runs your evals on a schedule (daily is usually sufficient) and flags regressions. This doesn't need to be complex. A GitHub Action that runs a script and posts results to Slack is enough to start.
A scoring mechanism: Some way to decide whether a given output is good. For structured outputs (JSON, specific formats), this is straightforward. Either it matches the schema or it doesn't. For open-ended outputs, you have two options: LLM-as-judge (use another model to score the output), or human review on a sample.

LLM-as-judge is worth understanding in detail because it's become the standard for scaling evals. The idea: you write a rubric describing what a good response looks like, and you ask a model to score responses against that rubric. This scales. You can run thousands of evals without any human in the loop. The risk: you need to know whether your judge model is trustworthy. Braintrust has good documentation on calibrating LLM judges — the short version: validate your judge against a human-labeled sample before you trust it to run unsupervised.

For tooling: Arize has the most mature eval framework if you're doing serious ML ops work. Braintrust has the cleanest developer experience. Maxim's comparison of eval tools is a good starting point if you're evaluating options.

Layer 4: Human Review

Automated evals can't catch everything. High-stakes decisions, unusual edge cases, situations where the rubric doesn't quite apply: these benefit from human review. The LangChain report found that 59.8% of production AI teams use human review as part of their quality process. That's not surprising. It's necessary.

The goal isn't to have humans review everything. That doesn't scale. The goal is targeted review: a random sample of production outputs (say, 1-2%) to catch things automated evals miss, plus routing for cases that trigger specific signals (low confidence scores, user reports, outputs that fall outside expected patterns).

The outputs of human review should feed back into your eval infrastructure. When a human catches something, that case goes into the golden dataset. Over time, your automated evals get better at catching the things humans used to have to catch manually. The loop closes.

Building in the Right Order

The common mistake is building observability after things start going wrong. At that point you're adding monitoring to a system you already don't understand, trying to retroactively capture information you should have been capturing from day one.

The right order is: tracing before you go to production, token accounting before you scale, evals before you change anything, human review as a continuous process. Each layer builds on the one before it. Tracing gives you the data for evals. Evals tell you what to put in front of human reviewers. Human review improves the evals.

If you're already in production without this infrastructure in place, start with tracing. Add it first, let it run for two weeks, and use what you learn to design the eval layer. Don't try to build everything at once.

The Number That Should Worry You

Go back to that 89/52 split. Nearly nine in ten production AI teams have some form of observability. Only five in ten have automated evals. The majority of teams are watching dashboards that tell them the system is running, not dashboards that tell them the system is working.

The difference sounds subtle. In practice it's the difference between knowing you have a problem when it starts and knowing you have a problem when users have been experiencing it for three weeks. AI systems fail in ways that traditional software doesn't. Not with clear errors, but with gradual quality drift, confident wrong answers, and edge cases that look fine until you know what to look for.

Observability that doesn't include evals is half a system. The half that tells you something happened, not whether it should have.

If you're building out the monitoring and eval infrastructure for an AI system and want a second opinion on the architecture, our AI for dev teams work covers exactly this: the engineering practices that keep AI systems trustworthy after they ship.