Complete guide to modern, semantic, and task-oriented metrics that correlate with real user satisfaction and business outcomes.
15 metrics found
Measures whether the generated answer addresses the specific question asked, not just related topics
Critical for user satisfaction. Identifies prompt engineering and UX issues that lead to off-topic responses.
G-Eval judge with relevancy rubric
def evaluate_relevancy(query, response):
prompt = f"""
Query: {query}
Response: {response}
Does the response directly address the query?
Score 0-1 based on relevancy.
"""
return judge_llm.evaluate(prompt)
Critical - Directly impacts user satisfaction and retention
Identifies prompt engineering and UX issues
Measures how relevant retrieved context chunks are to the user's query
Ensures retrieved context is actually useful for answering the query, improving answer quality
LlamaIndex ContextRelevancyEvaluator
from llama_index.core.evaluation import ContextRelevancyEvaluator
context_eval = ContextRelevancyEvaluator()
score = context_eval.evaluate_response(
query=query,
contexts=contexts
).score
Medium - Improves answer quality through better context
Identifies context quality issues
Whether highest-ranked retrieved chunks are most relevant to the query
Diagnoses retriever ranking quality and reduces hallucination risk by ensuring relevant context is prioritized
LlamaIndex with metadata-based ground truth mapping
# Map expected answers to node IDs
expected_nodes = ["doc_1_chunk_3", "doc_2_chunk_1"]
retrieved_nodes = retriever.retrieve(query)
precision_at_k = calculate_precision(expected_nodes, retrieved_nodes)
Percentage of all expected relevant context that was successfully retrieved
Detects missed but available evidence in knowledge base, ensuring comprehensive answers
Coverage analysis of ground-truth evidence set
ground_truth_evidence = get_expected_evidence(query)
retrieved_evidence = extract_evidence(retrieved_nodes)
recall = len(retrieved_evidence & ground_truth_evidence) / len(ground_truth_evidence)
Measures the computational cost of processing each query
Critical for budget management and cost optimization in production systems
Token counting and cost calculation
tokens_used = count_tokens(query, response)
cost_per_token = get_model_pricing(model_name)
cost_per_query = tokens_used * cost_per_token
High - Directly impacts operational costs
Identifies cost optimization opportunities
Measures customer satisfaction with LLM system responses
Ultimate business metric - directly correlates with user retention and business success
User feedback collection and analysis
satisfaction_scores = collect_user_ratings(time_period)
average_csat = sum(satisfaction_scores) / len(satisfaction_scores)
Critical - Directly impacts business success and retention
Overall system quality indicator
Measures the percentage of queries resolved without human intervention
Important business metric that measures system effectiveness and cost savings
Escalation tracking and resolution analysis
total_queries = count_total_queries(time_period)
resolved_queries = count_resolved_without_escalation(time_period)
deflection_rate = (resolved_queries / total_queries) * 100
High - Reduces support costs and improves efficiency
Measures system effectiveness
Uses evaluator LLM with explicit rubrics to score subjective quality dimensions
Captures subjective quality aspects that correlate with user satisfaction and business KPIs
Custom G-Eval implementation with rubric-based scoring
def g_eval_helpfulness(query, response, rubric):
prompt = f"""
Rubric: {rubric}
Query: {query}
Response: {response}
Score 1-5 based on rubric criteria.
"""
return llm.generate(prompt)
Measures whether the output fully resolves the underlying user need (actionability, tone, focus)
Correlates with CSAT and deflection; drives business KPIs. Measures actual user value delivery.
G-Eval judge with helpfulness rubric
def g_eval_helpfulness(query, response, rubric):
prompt = f"""
Rubric: {rubric}
Query: {query}
Response: {response}
Score 1-5 based on rubric criteria.
"""
return llm.generate(prompt)
Critical - Directly impacts user satisfaction and retention
Identifies prompt engineering and UX issues
Measures effectiveness of PII detection and redaction in outputs
Critical for privacy compliance and preventing data breaches
PII detection and redaction validation
pii_detected = pii_detector.scan(response)
pii_redacted = pii_redactor.redact(response)
detection_score = calculate_pii_detection_accuracy(pii_detected, pii_redacted)
Critical - Prevents privacy violations and legal issues
Identifies privacy protection issues
Decomposes output into atomic claims, generates closed-ended questions, and verifies against context
Prevents hallucinations by ensuring every claim in the output can be verified against the provided context
LlamaIndex FaithfulnessEvaluator with QAG backend
from llama_index.core.evaluation import FaithfulnessEvaluator
faith_eval = FaithfulnessEvaluator()
score = faith_eval.evaluate_response(
response=response,
contexts=contexts
).score
Measures the time taken to generate a response from query to final output
Critical for user experience - slow responses lead to user abandonment and poor satisfaction
Timing instrumentation throughout pipeline
import time
start_time = time.time()
response = llm.generate(query)
end_time = time.time()
response_time = (end_time - start_time) * 1000 # ms
High - Directly impacts user experience and satisfaction
Identifies performance bottlenecks
Measures adherence to safety guidelines and prevention of harmful outputs
Critical for preventing harmful outputs and maintaining brand safety
Safety evaluation with guideline checking
def evaluate_safety(response, guidelines):
safety_score = safety_evaluator.evaluate(
response=response,
guidelines=guidelines
)
return safety_score
Critical - Prevents brand damage and legal issues
Identifies safety and compliance issues
End-to-end success rate from full conversation trace to goal achievement
Ultimate utility metric - measures whether the system actually solved the user's problem
LLM-Judge with conversation trace analysis
def evaluate_task_completion(conversation, goal):
prompt = f"""
Goal: {goal}
Conversation: {conversation}
Was the goal achieved? Score 0-1.
"""
return judge_llm.evaluate(prompt)
Agent calls expected tools with correct parameters and in proper sequence
Foundational for agent reliability - incorrect tool use can cause system failures or data corruption
Exact/conditional match against expected tool trace
expected_tools = [{"name": "search", "args": {"q": "policy"}}]
actual_tools = agent.get_tool_calls()
correctness = validate_tool_sequence(expected_tools, actual_tools)