Production-Grade AI Evals

Evals are simply a disciplined way to measure whether your AI product behaves the way you want. Think of them as a loop:

1.Look at real conversations (traces)
2.Write clear notes about what went wrong (open coding)
3.Group those notes into actionable failure modes (axial codes)
4.Build cheap, automated checks for each failure mode (some are pure code, some use an LLM "as judge")
5.Run them in CI and in production sampling, watch the trend lines, and fix what matters most

They're not just unit tests, not just benchmarks, and not a replacement for your PRD. Evals connect messy reality to a repeatable improvement process.

Start with reality: read traces and write one upstream note

Open a trace (system prompt, user turns, tool calls, assistant messages). Don't hunt every issue in a single conversation. Capture the first upstream error and move on. Examples the team spotted:

"Should have handed off to a human" when the assistant can't fulfill the request or the user asks for a person.

"Offered virtual tour where none exists" (misrepresentation).

"SMS fragments broke the flow" (short, split messages confused the agent).

"Transfer initiated without confirming with the user" (abrupt handoff, poor UX).

A good note is specific, product-aware, and short. Avoid vague words like "janky"—you won't be able to categorize them later.

Make one person accountable ("benevolent dictator")

Pick a single domain-savvy owner—often a PM who understands leasing operations. They decide how notes should be written, what counts as a failure, and how categories are defined. This prevents committee thrash and keeps the taxonomy crisp.

When to stop sampling ("theoretical saturation")

Keep reviewing new traces until you stop discovering new kinds of problems. In practice, most teams need ~40–100 to establish the first cut. You'll develop a feel for it quickly.

Turn raw notes into sharp, actionable categories

Now convert your pile of open-coded notes into clear failure modes you can measure. Use an LLM to draft clusters, then refine them to be more specific and actionable.

A stronger category set for a leasing assistant:

1. Human-handoff needed but missed

The agent should have routed to a person (explicit user request, policy/safety topics, data or tool unavailable, urgent same-day requests, maintenance escalation) and didn't.

2. Misrepresentation of offerings or policies

The agent claims features that don't exist (e.g., virtual tours), quotes incorrect availability, or contradicts policy.

3. Conversation flow and channel handling

SMS fragmentation, multi-turn interruptions, or thread loss that leads to confusion or wrong answers.

4. Scheduling and availability workflow errors

Wrong tool used, wrong parameters, didn't propose alternatives when the requested unit/type wasn't available, or failed to follow through on reschedule.

5. Output-contract violations

Response format is missing required fields, wrong tone/instructions ignored, no confirmation before a transfer, or missing call-to-action.

6. Promises and follow-ups not kept

The agent says it will notify, book, or send info and then doesn't.

7. Data/tooling gaps surfaced

Tools returned errors or stale data and the agent didn't degrade gracefully (e.g., it guessed instead of acknowledging limits).

8. None of the above (discovery bucket)

A deliberate escape hatch to catch new patterns; you'll fold these back into 1–7 as you learn.

Keep category names short, unambiguous, and easy to map to a fix (policy, prompt, tool, product change).

Count what matters (basic, powerful, and fast)

Drop your labeled notes into a pivot table. Rank failure modes by frequency and also by risk to the business. For a leasing assistant, "misrepresentation" and "missed human-handoff" might be lower in count but higher in risk than "formatting." This gives you an ordered backlog of fixes that actually move the needle.

Build evaluators: code first, LLM judge only when needed

For each failure mode, ask: can code alone catch it?

Pure code checks (cheap and reliable)

•Output contract present (fields, length, JSON/markdown validity)
•Required tool was called (or not called) for a given pattern
•Confirmation language included before transfer
•No restricted phrases (policy gates)
•SLA timing and latency thresholds

LLM-as-judge (for genuine judgment calls)

Use only when the decision is semantic and context-heavy, e.g., "Should this have been handed to a human?" Make the judge binary (TRUE/FALSE). No 1–5 scores—force a decision.

Concrete judge prompt pattern for missed handoff:

1.Provide the user turns, tool responses, and assistant replies for a single exchange.
2.
Provide clear rules:
- • TRUE if user explicitly requests a human/phone/agent; or
- • topic triggers policy/safety escalation; or
- • the needed tool/data is unavailable or inconsistent; or
- • urgent same-day tour/walk-in or urgent maintenance; or
- • language/accessibility barriers noted and progress is blocked.
3.Ask for a single token answer: TRUE or FALSE.

Align the judge with humans before trusting it

Compare the judge's outputs to your human labels using a confusion matrix (not just "% agreement"). Reduce false positives/negatives by tightening rules and adding a few representative examples to the prompt. Split data to avoid overfitting. Once aligned, you can run this judge in CI and as a daily production sample.

Evals and PRDs belong together

Your PRD sets intent; your evaluators make those expectations executable and continuously measured. As you see real data, you'll expand and sharpen the PRD and the judges.

This is why teams say "evals are the new PRDs"—not because PRDs disappear, but because they gain a living, testable counterpart.

How many evaluators do you actually need?

Usually 4–7 well-chosen evaluators cover most persistent, high-risk failures. Many issues vanish with a prompt or tool fix and don't deserve an evaluator. Spend your cycles where judgment and risk live (handoff, misrepresentation, scheduling mistakes, follow-ups).

After the first pass: operationalize

In CI:

Run on curated regression traces that previously failed.

In prod:

Sample real conversations daily/weekly, compute failure rates per category, property, and channel.

Dashboards:

Trend lines, alerts when a category spikes, drill-downs to example traces.

Practical tips that keep teams out of trouble

Write one upstream note per trace; move on quickly.

Keep category names actionable and stable; add "None of the above" for discovery.

Prefer binary evaluator outputs; they're easy to track and compare.

For judges, align with humans via a confusion matrix before rollout.

Track by channel (SMS vs chat vs voice) and stage (inquiry, scheduling, maintenance, renewal).

Revisit categories every quarter; products evolve and so do failure modes.

Don't conflate a high agreement % with quality; inspect mismatches.

Use evals to inform A/B tests, not replace them—A/Bs are production-level evals of business metrics; they work best when your hypotheses come from real error analysis.

Scenarios

Scenario: A prospect asks for a one-bedroom with a study; tools show no such option; the assistant replies "none with a study," but doesn't offer alternates and later contradicts itself.

→ Categories: scheduling/availability workflow, misrepresentation (if it later says tours exist), missed human-handoff if the user asked for help beyond tool limits.

Scenario: SMS input is chopped across tiny messages; the agent loses context.

→ conversation flow and channel handling; add a code check to stitch recent fragments, and a judge to flag replies that miss obvious continuity.

Scenario: The assistant initiates a transfer without confirming with the user.

→ output-contract violation (require an explicit confirmation step); simple code check.

Scenario: The assistant promises to notify or book and never follows up.

→ promises and follow-ups not kept; code check for confirmation artifacts (ticket ID, calendar hold) or a judge that looks for "promise without artifact."

Minimal starter kit (you can drop into Sheets/Notion/Jupyter)

Open-code template:

Stage=Scheduling → Upstream error=Promised virtual tour where none exists → Why=Misrepresentation → Evidence=tool:get_availability shows no tours; assistant claims tours

Axial labeling instruction (for bulk labeling with an LLM):

"Given NOTE, output one label from {Handoff Missed, Misrepresentation, Conversation Flow, Scheduling/Availability, Output Contract, Promises/Follow-ups, Data/Tool Gap, None}. Choose the most upstream failure. Return only the label."

Judge prompt (handoff):

"Return TRUE if the assistant should have transferred to a human (explicit request; policy/safety topic; data/tool unavailable; urgent same-day request; blocked progress due to channel/language). Input is a single trace (user, tools, assistant). Output TRUE or FALSE only."

What "evals" actually are