Evaluation Arsenal

Comprehensive LLM Evaluation Metrics

Complete guide to modern, semantic, and task-oriented metrics that correlate with real user satisfaction and business outcomes.

← Back to Home

All Metrics

15 metrics found

Answer Relevancy

Answer Relevancy Evaluation

88%

Day 0 GateCritical

What it measures

Measures whether the generated answer addresses the specific question asked, not just related topics

How it works

1. Compare generated answer to user query

2. Assess semantic alignment and focus

3. Check for topic drift or irrelevant information

4. Score based on direct question addressing

Why it matters

Critical for user satisfaction. Identifies prompt engineering and UX issues that lead to off-topic responses.

Used in

Customer SupportQ&A SystemsHelp DesksKnowledge Assistants

Implementation

G-Eval judge with relevancy rubric

def evaluate_relevancy(query, response):
    prompt = f"""
    Query: {query}
    Response: {response}
    
    Does the response directly address the query?
    Score 0-1 based on relevancy.
    """
    return judge_llm.evaluate(prompt)

Business Impact

Critical - Directly impacts user satisfaction and retention

Diagnostic Value

Identifies prompt engineering and UX issues

Threshold: 75%

Core Quality

Context Relevancy

Context Relevancy Evaluation

78%

BetaMedium

What it measures

Measures how relevant retrieved context chunks are to the user's query

How it works

1. Evaluate each retrieved chunk against query

2. Score relevance of individual chunks

3. Calculate average relevancy score

4. Identify irrelevant context

Why it matters

Ensures retrieved context is actually useful for answering the query, improving answer quality

Used in

RAG SystemsDocument SearchKnowledge RetrievalInformation Systems

Implementation

LlamaIndex ContextRelevancyEvaluator

from llama_index.core.evaluation import ContextRelevancyEvaluator

context_eval = ContextRelevancyEvaluator()
score = context_eval.evaluate_response(
    query=query,
    contexts=contexts
).score

Business Impact

Medium - Improves answer quality through better context

Diagnostic Value

Identifies context quality issues

Threshold: 75%

RAG-Specific

Contextual Precision

82%

Day 0 Gate

What it measures

Whether highest-ranked retrieved chunks are most relevant to the query

How it works

1. Map expected answers to ground-truth node IDs

2. Measure precision at rank k

3. Score based on relevant docs in top results

Why it matters

Diagnoses retriever ranking quality and reduces hallucination risk by ensuring relevant context is prioritized

Used in

Document RetrievalKnowledge Base SearchRAG Systems

Implementation

LlamaIndex with metadata-based ground truth mapping

# Map expected answers to node IDs
expected_nodes = ["doc_1_chunk_3", "doc_2_chunk_1"]
retrieved_nodes = retriever.retrieve(query)
precision_at_k = calculate_precision(expected_nodes, retrieved_nodes)

Threshold: 70%

RAG-Specific

Contextual Recall

85%

Beta

What it measures

Percentage of all expected relevant context that was successfully retrieved

How it works

1. Define ground-truth evidence set

2. Check coverage by retrieved nodes

3. Score based on evidence coverage

Why it matters

Detects missed but available evidence in knowledge base, ensuring comprehensive answers

Used in

Comprehensive Q&AResearch AssistanceMulti-document Analysis

Implementation

Coverage analysis of ground-truth evidence set

ground_truth_evidence = get_expected_evidence(query)
retrieved_evidence = extract_evidence(retrieved_nodes)
recall = len(retrieved_evidence & ground_truth_evidence) / len(ground_truth_evidence)

Threshold: 70%

RAG-Specific

Cost per Query

0.03

MonitoringMedium

What it measures

Measures the computational cost of processing each query

How it works

1. Track token usage for each query

2. Calculate cost based on model pricing

3. Include retrieval and generation costs

4. Monitor cost trends and optimization

Why it matters

Critical for budget management and cost optimization in production systems

Used in

Cost-sensitive ApplicationsHigh-volume SystemsBudget ManagementResource Optimization

Implementation

Token counting and cost calculation

tokens_used = count_tokens(query, response)
cost_per_token = get_model_pricing(model_name)
cost_per_query = tokens_used * cost_per_token

Business Impact

High - Directly impacts operational costs

Diagnostic Value

Identifies cost optimization opportunities

Threshold: 0.05

Performance

Customer Satisfaction (CSAT)

Customer Satisfaction Score

4.2%

North StarCritical

What it measures

Measures customer satisfaction with LLM system responses

How it works

1. Collect user satisfaction ratings

2. Calculate average satisfaction score

3. Track satisfaction trends over time

4. Correlate with other quality metrics

Why it matters

Ultimate business metric - directly correlates with user retention and business success

Used in

Customer ServiceSupport SystemsUser-facing ApplicationsBusiness Systems

Implementation

User feedback collection and analysis

satisfaction_scores = collect_user_ratings(time_period)
average_csat = sum(satisfaction_scores) / len(satisfaction_scores)

Business Impact

Critical - Directly impacts business success and retention

Diagnostic Value

Overall system quality indicator

Threshold: 4%

Business

Deflection Rate

75%

North StarHigh

What it measures

Measures the percentage of queries resolved without human intervention

How it works

1. Track queries that don't require human escalation

2. Calculate percentage of self-resolved queries

3. Monitor deflection trends

4. Correlate with satisfaction scores

Why it matters

Important business metric that measures system effectiveness and cost savings

Used in

Customer SupportHelp DesksSelf-service SystemsSupport Automation

Implementation

Escalation tracking and resolution analysis

total_queries = count_total_queries(time_period)
resolved_queries = count_resolved_without_escalation(time_period)
deflection_rate = (resolved_queries / total_queries) * 100

Business Impact

High - Reduces support costs and improves efficiency

Diagnostic Value

Measures system effectiveness

Threshold: 70%

Business

G-Eval Judge

G-Eval LLM-as-Judge

88%

Beta

What it measures

Uses evaluator LLM with explicit rubrics to score subjective quality dimensions

How it works

1. Define explicit evaluation rubric

2. Use calibrated evaluator LLM

3. Score against rubric criteria

4. Support few-shot calibration

Why it matters

Captures subjective quality aspects that correlate with user satisfaction and business KPIs

Used in

Helpfulness AssessmentBrand Voice ComplianceTone Evaluation

Implementation

Custom G-Eval implementation with rubric-based scoring

def g_eval_helpfulness(query, response, rubric):
    prompt = f"""
    Rubric: {rubric}
    Query: {query}
    Response: {response}
    
    Score 1-5 based on rubric criteria.
    """
    return llm.generate(prompt)

Threshold: 75%

Subjective Quality

Helpfulness / Utility

Helpfulness and Utility Assessment

82%

Beta (A/B North Star)High

What it measures

Measures whether the output fully resolves the underlying user need (actionability, tone, focus)

How it works

1. Define explicit evaluation rubric

2. Use calibrated evaluator LLM

3. Score against rubric criteria (actionability, completeness, tone)

4. Support few-shot calibration

Why it matters

Correlates with CSAT and deflection; drives business KPIs. Measures actual user value delivery.

Used in

Customer ServiceSupport ChatbotsHelp SystemsUser Assistance

Implementation

G-Eval judge with helpfulness rubric

def g_eval_helpfulness(query, response, rubric):
    prompt = f"""
    Rubric: {rubric}
    Query: {query}
    Response: {response}
    
    Score 1-5 based on rubric criteria.
    """
    return llm.generate(prompt)

Business Impact

Critical - Directly impacts user satisfaction and retention

Diagnostic Value

Identifies prompt engineering and UX issues

Threshold: 75%

Core Quality

PII Detection

PII Detection and Redaction

99.5%

Day 0 GateCritical

What it measures

Measures effectiveness of PII detection and redaction in outputs

How it works

1. Scan outputs for PII patterns

2. Verify redaction effectiveness

3. Check for missed PII instances

4. Score based on detection accuracy

Why it matters

Critical for privacy compliance and preventing data breaches

Used in

All Systems with Personal DataCustomer ServiceHealthcareFinancial Services

Implementation

PII detection and redaction validation

pii_detected = pii_detector.scan(response)
pii_redacted = pii_redactor.redact(response)
detection_score = calculate_pii_detection_accuracy(pii_detected, pii_redacted)

Business Impact

Critical - Prevents privacy violations and legal issues

Diagnostic Value

Identifies privacy protection issues

Threshold: 99%

Safety & Compliance

QAG Scoring

Question-Answer Generation Scoring

94%

Day 0 Gate

What it measures

Decomposes output into atomic claims, generates closed-ended questions, and verifies against context

How it works

1. Extract atomic claims from LLM output

2. Generate yes/no questions for each claim

3. Verify questions against reference context

4. Score based on claim verification rate

Why it matters

Prevents hallucinations by ensuring every claim in the output can be verified against the provided context

Used in

RAG Support ChatbotsKnowledge Base SystemsDocument Q&A

Implementation

LlamaIndex FaithfulnessEvaluator with QAG backend

from llama_index.core.evaluation import FaithfulnessEvaluator

faith_eval = FaithfulnessEvaluator()
score = faith_eval.evaluate_response(
    response=response, 
    contexts=contexts
).score

Threshold: 80%

Core Quality

Response Time

1200%

MonitoringHigh

What it measures

Measures the time taken to generate a response from query to final output

How it works

1. Measure time from query submission

2. Track processing time through pipeline

3. Include retrieval, generation, and post-processing

4. Calculate average and P95 response times

Why it matters

Critical for user experience - slow responses lead to user abandonment and poor satisfaction

Used in

All LLM SystemsReal-time ApplicationsUser-facing SystemsAPI Services

Implementation

Timing instrumentation throughout pipeline

import time

start_time = time.time()
response = llm.generate(query)
end_time = time.time()
response_time = (end_time - start_time) * 1000  # ms

Business Impact

High - Directly impacts user experience and satisfaction

Diagnostic Value

Identifies performance bottlenecks

Threshold: 2000%

Performance

Safety Score

98%

Day 0 GateCritical

What it measures

Measures adherence to safety guidelines and prevention of harmful outputs

How it works

1. Evaluate outputs against safety guidelines

2. Check for harmful, biased, or inappropriate content

3. Score based on safety compliance

4. Track safety violations

Why it matters

Critical for preventing harmful outputs and maintaining brand safety

Used in

All Public-facing SystemsCustomer-facing ApplicationsContent GenerationUser Interactions

Implementation

Safety evaluation with guideline checking

def evaluate_safety(response, guidelines):
    safety_score = safety_evaluator.evaluate(
        response=response,
        guidelines=guidelines
    )
    return safety_score

Business Impact

Critical - Prevents brand damage and legal issues

Diagnostic Value

Identifies safety and compliance issues

Threshold: 95%

Safety & Compliance

Task Completion

76%

Beta → Gate

What it measures

End-to-end success rate from full conversation trace to goal achievement

How it works

1. LLM-Judge reviews full conversation

2. Evaluates prompts, plans, tool calls

3. Assesses final state vs original goal

Why it matters

Ultimate utility metric - measures whether the system actually solved the user's problem

Used in

Multi-step WorkflowsCustomer Service BotsTask Automation

Implementation

LLM-Judge with conversation trace analysis

def evaluate_task_completion(conversation, goal):
    prompt = f"""
    Goal: {goal}
    Conversation: {conversation}
    
    Was the goal achieved? Score 0-1.
    """
    return judge_llm.evaluate(prompt)

Threshold: 80%

Agent & Tools

Tool Correctness

98%

Day 0 Gate

What it measures

Agent calls expected tools with correct parameters and in proper sequence

How it works

1. Compare tool calls vs ground-truth trace

2. Validate tool names and parameters

3. Check execution sequence

Why it matters

Foundational for agent reliability - incorrect tool use can cause system failures or data corruption

Used in

AI AgentsWorkflow AutomationAPI Integration Systems

Implementation

Exact/conditional match against expected tool trace

expected_tools = [{"name": "search", "args": {"q": "policy"}}]
actual_tools = agent.get_tool_calls()
correctness = validate_tool_sequence(expected_tools, actual_tools)

Threshold: 95%

Agent & Tools