LLM Evaluation Glossary

82 terms covering frameworks, metrics, tools, and concepts

A
A/B Testing
Process

Experimental method comparing two versions to determine optimal performance through statistical analysis.

ExperimentStatistical AnalysisPerformance Metrics+2
A/B Testing
Process

Experimental method comparing two versions of a system to determine which performs better.

ExperimentComparisonPerformance+2
Agent
Architecture

Autonomous system that can perform tasks by coordinating multiple tools and making decisions.

AutonomousToolsDecisions+2
Answer Relevancy
Metrics

Measures whether the generated answer addresses the specific question asked, not just related topics.

G-EvalRelevancyUser Satisfaction+2
Arize AX Enterprise
Technology

Full enterprise SaaS platform for LLM observability and evaluation with multi-cloud, hybrid cloud, and data center support.

ObservabilityEnterprise PlatformEvaluation Framework+3
Arize Phoenix
Technology

Enterprise-grade LLM observability platform with sophisticated performance monitoring and drift detection.

ObservabilityPerformance MonitoringDrift Detection+2
AutoGen
Technology

Microsoft's open-source framework for building LLM applications with multiple conversational AI agents that collaborate through natural language.

Multi-Agent SystemsConversational AIGroup Chat+3
AutoMergingRetriever
Technology

A retrieval method that automatically merges related chunks to provide comprehensive context.

RetrievalHierarchicalContext+2
Axial Coding
Analysis Method

Second phase of qualitative analysis that connects and organizes initial codes into broader categories and relationships.

Open CodingQualitative AnalysisFailure Taxonomy+2
B
Baseline
Process

Initial performance measurement used as a reference point for future improvements.

PerformanceMeasurementReference+2
Braintrust
Technology

Assessment and evaluation platform for AI applications with collaborative annotation and trace viewer capabilities.

EvaluationAnnotationHuman Feedback+2
C
ChatGPT (OpenAI)
Technology

Versatile LLM with strong analytical capabilities, function calling, and comprehensive evaluation skills.

OpenAIGPT-4Function Calling+2
Chunking
Technology

Process of breaking down documents into smaller, manageable pieces for processing and retrieval.

Document ProcessingRAGSemantic+2
CI/CD
Process

Continuous Integration/Continuous Deployment - automated pipeline for building, testing, and deploying software.

Quality GatesAutomationDeployment+2
Claude (Anthropic)
Technology

Advanced LLM with sophisticated reasoning capabilities and safety-focused training for evaluation tasks.

LLMEvaluationReasoning+2
Compliance
Process

Adherence to regulatory requirements, industry standards, and organizational policies.

RegulatoryStandardsPolicies+2
Context
Technology

Retrieved information used to inform LLM responses in RAG systems.

RAGRetrieved InformationAnswer Quality+2
ContextRelevancyEvaluator
Technology

A LlamaIndex evaluator that measures how relevant retrieved context is to the user's query.

ContextRelevancyRetrieval+2
Contextual Precision
Metrics

Measures the proportion of retrieved context that is relevant to answering the query.

RAGRetrievalContext+2
Contextual Recall
Metrics

Measures the proportion of relevant information that was successfully retrieved from the knowledge base.

RAGRetrievalRecall+2
Cost Tracking
Process

Monitoring and analysis of computational costs associated with LLM operations.

CostMonitoringComputational+2
CrewAI
Technology

Multi-agent collaboration framework for orchestrating role-playing AI agents that collaborate to solve complex tasks.

Multi-Agent SystemsAgent CollaborationTask Delegation+2
CSAT(CSAT)
Business

Customer Satisfaction Score - measures how satisfied customers are with products or services.

Customer SatisfactionBusiness KPIHelpfulness+2
Customer Tools
Architecture

Specialized tools for customer information retrieval, property lookup, and seamless transfer workflows.

Customer LookupProperty SearchTransfer+2
D
Data & Governance
Architecture

Framework for data ingestion, versioning, and governance in LLM systems.

Data IngestionVersioningGolden Datasets+2
Deflection
Business

Rate at which customer queries are resolved without human intervention.

Customer QueriesResolutionHuman Intervention+2
Deterministic Code Checks
Evaluation Method

Rule-based validation using standard programming logic for reliable, consistent evaluation without LLM variability.

Rule-BasedValidationSchema+2
DSPy
Technology

Programmatic framework for building and optimizing LLM pipelines and agent systems using signature-based design.

Pipeline OptimizationProgrammatic FrameworksSignature-Based Design+2
E
Embedding
Technology

Dense vector representation of text that captures semantic meaning for similarity calculations.

VectorSemanticSimilarity+2
F
Faithfulness / Correctness
Metrics

Measures the factual alignment of generated output to provided reference context, preventing hallucinations.

QAGHallucinationContext+3
FaithfulnessEvaluator
Technology

A LlamaIndex evaluator that measures how faithful generated responses are to the provided context.

FaithfulnessQAGHallucination+2
G
G-Eval (LLM-as-Judge)(G-Eval)
Evaluation Method

Uses an evaluator LLM to score responses against a rubric covering actionability, completeness, tone, and next-step clarity.

HelpfulnessUtilityRubric+3
Gemini (Google)
Technology

Multimodal LLM with deep analytical capabilities supporting text, images, audio, and video evaluation.

GoogleMultimodalAnalysis+2
Golden Dataset
Process

Curated set of 50-200 test cases with known correct answers used for evaluation and benchmarking.

Test CasesBenchmarkingEvaluation+2
Google Sheets
Technology

Collaborative evaluation analysis platform with pivot tables, automation, and team coordination features.

Data AnalysisCollaborationPivot Tables+2
Governance
Process

Framework for establishing ownership, responsibilities, and review processes for LLM evaluation.

OwnershipResponsibilitiesReview+2
Ground Truth
Technology

The correct or expected answer for a given query, used as a benchmark for evaluation.

ReferenceExpected AnswerBenchmark+2
Guardrails
Security

Safety mechanisms that prevent harmful, inappropriate, or non-compliant outputs from LLM systems.

SafetyContent FilteringPII+2
GuidelineEvaluator
Technology

A LlamaIndex evaluator that checks responses against custom guidelines and policies.

GuidelinesCustomCompliance+2
H
Hallucination
Technology

When an LLM generates information that is not present in the training data or provided context.

LLMGenerated InformationTraining Data+2
Haystack Agents
Technology

Agent orchestration within Haystack's end-to-end NLP framework, enabling RAG-enhanced agents with document processing capabilities.

HaystackRAGDocument Processing+2
Helpfulness / Utility
Metrics

Measures whether the output fully resolves the underlying user need (actionability, tone, focus).

CSATDeflectionBusiness KPIs+2
Hierarchical Retrieval
Technology

Retrieval method that uses multiple levels of document structure for comprehensive context.

RetrievalHierarchicalContext+2
I
I.O.R.M.G.O.D Framework(IORMGOD)
Architecture

A production-ready architecture framework for reliable AI systems: Interface & Gateway, Orchestrator/Agent, Retrieval (RAG), Models, Guardrails, Observability & Eval, Data & Governance.

RAGGuardrailsObservability+3
Interface & Gateway
Architecture

Entry point for user interactions, including authentication, rate limiting, and caching.

Entry PointAuthenticationRate Limiting+2
J
Julius AI
Technology

AI-powered notebook platform enabling natural language queries and automated insights generation.

AI-PoweredNatural LanguageAutomated Insights+2
Jupyter Notebooks
Technology

Interactive development environment for data analysis, experimentation, and reproducible research.

Data AnalysisInteractivePython+2
L
LangChain
Technology

Popular LLM framework for building applications with chains, agents, document loaders, and built-in evaluators.

Chain CompositionDocument LoadersAgents+2
Langfuse
Technology

A comprehensive observability platform for LLM applications providing tracing, experiments, prompt management, and scoring.

ObservabilityTracingA/B Testing+3
LangGraph
Technology

Low-level orchestration framework for building, managing, and deploying long-running, stateful AI agents with graph-based workflows.

Agent OrchestrationState ManagementGraph Workflows+3
LangSmith
Technology

LangChain observability and debugging platform with deep integration for trace inspection and debugging.

LangChainTrace InspectionPrompt Debugging+2
LlamaIndex
Technology

A comprehensive framework for building LLM applications with advanced parsing, retrieval, evaluation, and memory capabilities.

RAGSemanticSplitterAutoMergingRetriever+3
LLM(LLM)
Technology

Large Language Model - AI model trained on vast amounts of text data to understand and generate human-like text.

AI ModelText GenerationTraining+2
LLM-as-Judge
Evaluation Method

Evaluation approach using LLMs to make semantic judgments constrained to binary TRUE/FALSE outputs.

Semantic JudgmentBinary ClassificationFailure Modes+2
M
M.A.G.I. Framework(MAGI)
Framework

A comprehensive framework for production-grade LLM evaluation consisting of four pillars: Metrics, Automation, Governance, and Improvement.

MetricsAutomationGovernance+3
Mem0
Technology

AI memory management framework providing persistent memory for agents across sessions and interactions.

Memory ManagementPersistent MemorySemantic Retrieval+2
Metadata Filtering
Technology

Process of filtering retrieved results based on document metadata (status, date, type, etc.).

MetadataFilteringVersion Control+2
Multi-Channel Communication
Architecture

Platform supporting SMS, text chat, and voice interfaces with intelligent routing and context management.

SMSChatVoice+2
N
Node
Technology

Individual unit of processed content (chunk) in a document processing pipeline.

ChunkContentMetadata+2
O
Observability
Technology

Comprehensive monitoring and logging of system behavior, performance, and quality metrics.

MonitoringLoggingPerformance+2
Open Coding
Analysis Method

Qualitative analysis technique for systematically identifying and categorizing themes in unstructured data.

Qualitative AnalysisFailure TaxonomyError Analysis+2
Orchestrator
Architecture

System component that coordinates multiple services and manages workflow execution.

CoordinationServicesWorkflow+2
P
Parsing
Technology

Process of analyzing and structuring documents for further processing in LLM applications.

Document AnalysisStructuringSemantic Splitting+2
PII(PII)
Security

Personally Identifiable Information - data that can identify specific individuals.

Personal DataPrivacyGuardrails+2
Postprocessing
Technology

Additional processing steps applied to retrieved results before final selection.

ProcessingFilteringMetadata+2
Prompt Management
Process

Systematic approach to creating, versioning, and optimizing prompts for LLM applications.

PromptsVersion ControlA/B Testing+2
Q
QAG (Question-Answer Generation)(QAG)
Evaluation Method

An evaluation method that decomposes output into atomic claims, generates closed-ended questions, and verifies against context.

FaithfulnessCorrectnessContext+2
Quality Gates
Process

Automated checkpoints in CI/CD pipelines that enforce quality thresholds before deployment.

CI/CDThresholdDeployment+2
R
RAG(RAG)
Technology

Retrieval-Augmented Generation - technique that combines retrieval of relevant information with text generation.

RetrievalGenerationContext+2
RAG over Knowledge
Architecture

Retrieval-Augmented Generation implementation over contextual knowledge bases with dynamic updates.

RAGKnowledge BaseContextual+2
Reference
Technology

Ground truth or expected answer used for evaluation and comparison.

Ground TruthExpected AnswerCorrectness+2
RelevancyEvaluator
Technology

A LlamaIndex evaluator that measures how relevant generated responses are to the user's query.

RelevancyQueryResponse+2
Reranking
Technology

Process of reordering retrieved results based on relevance scores or additional criteria.

RetrievalRankingRelevance+2
S
Score Attribution
Process

Process of assigning evaluation scores to specific components or versions of a system.

EvaluationScoresComponents+2
Semantic Kernel
Technology

Microsoft's plugin-based orchestration framework for building AI applications with goal-oriented agents.

Plugin ArchitectureGoal-Oriented AgentsPlanner System+3
SemanticSplitter
Technology

A document parsing method that splits text based on semantic similarity rather than fixed chunk sizes.

ParsingChunkingSemantic+2
SLO(SLO)
Business

Service Level Objective - specific, measurable goals for system performance and reliability.

Service LevelPerformanceReliability+2
Statsig
Technology

Data-driven experimentation platform for feature flags, statistical testing, and cohort analysis.

Feature FlagsStatistical TestingCohort Analysis+2
T
Threshold
Process

Minimum acceptable score for a metric that triggers quality gates and deployment decisions.

Quality GatesMetricsCalibration+2
Tracing
Technology

Detailed logging of request flow through LLM systems for debugging and optimization.

LoggingDebuggingOptimization+2
V
Vector Store
Technology

Database optimized for storing and querying high-dimensional vectors (embeddings).

EmbeddingSimilarity SearchRAG+2
Version Control
Process

System for tracking changes to datasets, models, prompts, and evaluation criteria over time.

TrackingChangesDatasets+2

Go deeper with the course

Master AI evals with hands-on projects, real case studies, and production-ready templates. From failure taxonomy to CI/CD quality gates.

Join the Course