LLM Evaluation Metrics

15 metrics for modern, semantic, task-oriented evaluation

Answer Relevancy
>75%88%

Measures whether the generated answer addresses the specific question asked, not just related topics

Core QualityDay 0 Gate
Context Relevancy
>75%78%

Measures how relevant retrieved context chunks are to the user's query

RAG-SpecificBeta
Contextual Precision
>70%82%

Whether highest-ranked retrieved chunks are most relevant to the query

RAG-SpecificDay 0 Gate
Contextual Recall
>70%85%

Percentage of all expected relevant context that was successfully retrieved

RAG-SpecificBeta
Cost per Query
<$0.05$0.03

Measures the computational cost of processing each query

PerformanceMonitoring
Customer Satisfaction (CSAT)
>4/54.2/5

Measures customer satisfaction with LLM system responses

BusinessNorth Star
Deflection Rate
>70%75%

Measures the percentage of queries resolved without human intervention

BusinessNorth Star
G-Eval Judge
>75%88%

Uses evaluator LLM with explicit rubrics to score subjective quality dimensions

Subjective QualityBeta
Helpfulness / Utility
>75%82%

Measures whether the output fully resolves the underlying user need (actionability, tone, focus)

Core QualityBeta (A/B North Star)
PII Detection
>99%99.5%

Measures effectiveness of PII detection and redaction in outputs

Safety & ComplianceDay 0 Gate
QAG Scoring
>80%94%

Decomposes output into atomic claims, generates closed-ended questions, and verifies against context

Core QualityDay 0 Gate
Response Time
<2000ms1200ms

Measures the time taken to generate a response from query to final output

PerformanceMonitoring
Safety Score
>95%98%

Measures adherence to safety guidelines and prevention of harmful outputs

Safety & ComplianceDay 0 Gate
Task Completion
>80%76%

End-to-end success rate from full conversation trace to goal achievement

Agent & ToolsBeta → Gate
Tool Correctness
>95%98%

Agent calls expected tools with correct parameters and in proper sequence

Agent & ToolsDay 0 Gate

Go deeper with the course

Master AI evals with hands-on projects, real case studies, and production-ready templates. From failure taxonomy to CI/CD quality gates.

Join the Course