LLM Evaluation Glossary

82 terms covering frameworks, metrics, tools, and concepts

A/B Testing

Process

Experimental method comparing two versions to determine optimal performance through statistical analysis.

ExperimentStatistical AnalysisPerformance Metrics+2

A/B Testing

Process

Experimental method comparing two versions of a system to determine which performs better.

ExperimentComparisonPerformance+2

Agent

Architecture

Autonomous system that can perform tasks by coordinating multiple tools and making decisions.

AutonomousToolsDecisions+2

Answer Relevancy

Metrics

Measures whether the generated answer addresses the specific question asked, not just related topics.

G-EvalRelevancyUser Satisfaction+2

Arize AX Enterprise

Technology

Full enterprise SaaS platform for LLM observability and evaluation with multi-cloud, hybrid cloud, and data center support.

ObservabilityEnterprise PlatformEvaluation Framework+3

Arize Phoenix

Technology

Enterprise-grade LLM observability platform with sophisticated performance monitoring and drift detection.

ObservabilityPerformance MonitoringDrift Detection+2

AutoGen

Technology

Microsoft's open-source framework for building LLM applications with multiple conversational AI agents that collaborate through natural language.

Multi-Agent SystemsConversational AIGroup Chat+3

AutoMergingRetriever

Technology

A retrieval method that automatically merges related chunks to provide comprehensive context.

RetrievalHierarchicalContext+2

Axial Coding

Analysis Method

Second phase of qualitative analysis that connects and organizes initial codes into broader categories and relationships.

Open CodingQualitative AnalysisFailure Taxonomy+2

Baseline

Process

Initial performance measurement used as a reference point for future improvements.

PerformanceMeasurementReference+2

Braintrust

Technology

Assessment and evaluation platform for AI applications with collaborative annotation and trace viewer capabilities.

EvaluationAnnotationHuman Feedback+2

ChatGPT (OpenAI)

Technology

Versatile LLM with strong analytical capabilities, function calling, and comprehensive evaluation skills.

OpenAIGPT-4Function Calling+2

Chunking

Technology

Process of breaking down documents into smaller, manageable pieces for processing and retrieval.

Document ProcessingRAGSemantic+2

CI/CD

Process

Continuous Integration/Continuous Deployment - automated pipeline for building, testing, and deploying software.

Quality GatesAutomationDeployment+2

Claude (Anthropic)

Technology

Advanced LLM with sophisticated reasoning capabilities and safety-focused training for evaluation tasks.

LLMEvaluationReasoning+2

Compliance

Process

Adherence to regulatory requirements, industry standards, and organizational policies.

RegulatoryStandardsPolicies+2

Context

Technology

Retrieved information used to inform LLM responses in RAG systems.

RAGRetrieved InformationAnswer Quality+2

ContextRelevancyEvaluator

Technology

A LlamaIndex evaluator that measures how relevant retrieved context is to the user's query.

ContextRelevancyRetrieval+2

Contextual Precision

Metrics

Measures the proportion of retrieved context that is relevant to answering the query.

RAGRetrievalContext+2

Contextual Recall

Metrics

Measures the proportion of relevant information that was successfully retrieved from the knowledge base.

RAGRetrievalRecall+2

Cost Tracking

Process

Monitoring and analysis of computational costs associated with LLM operations.

CostMonitoringComputational+2

CrewAI

Technology

Multi-agent collaboration framework for orchestrating role-playing AI agents that collaborate to solve complex tasks.

Multi-Agent SystemsAgent CollaborationTask Delegation+2

CSAT(CSAT)

Business

Customer Satisfaction Score - measures how satisfied customers are with products or services.

Customer SatisfactionBusiness KPIHelpfulness+2

Customer Tools

Architecture

Specialized tools for customer information retrieval, property lookup, and seamless transfer workflows.

Customer LookupProperty SearchTransfer+2

Data & Governance

Architecture

Framework for data ingestion, versioning, and governance in LLM systems.

Data IngestionVersioningGolden Datasets+2

Deflection

Business

Rate at which customer queries are resolved without human intervention.

Customer QueriesResolutionHuman Intervention+2

Deterministic Code Checks

Evaluation Method

Rule-based validation using standard programming logic for reliable, consistent evaluation without LLM variability.

Rule-BasedValidationSchema+2

DSPy

Technology

Programmatic framework for building and optimizing LLM pipelines and agent systems using signature-based design.

Pipeline OptimizationProgrammatic FrameworksSignature-Based Design+2

Embedding

Technology

Dense vector representation of text that captures semantic meaning for similarity calculations.

VectorSemanticSimilarity+2

Faithfulness / Correctness

Metrics

Measures the factual alignment of generated output to provided reference context, preventing hallucinations.

QAGHallucinationContext+3

FaithfulnessEvaluator

Technology

A LlamaIndex evaluator that measures how faithful generated responses are to the provided context.

FaithfulnessQAGHallucination+2

G-Eval (LLM-as-Judge)(G-Eval)

Evaluation Method

Uses an evaluator LLM to score responses against a rubric covering actionability, completeness, tone, and next-step clarity.

HelpfulnessUtilityRubric+3

Gemini (Google)

Technology

Multimodal LLM with deep analytical capabilities supporting text, images, audio, and video evaluation.

GoogleMultimodalAnalysis+2

Golden Dataset

Process

Curated set of 50-200 test cases with known correct answers used for evaluation and benchmarking.

Test CasesBenchmarkingEvaluation+2

Google Sheets

Technology

Collaborative evaluation analysis platform with pivot tables, automation, and team coordination features.

Data AnalysisCollaborationPivot Tables+2

Governance

Process

Framework for establishing ownership, responsibilities, and review processes for LLM evaluation.

OwnershipResponsibilitiesReview+2

Ground Truth

Technology

The correct or expected answer for a given query, used as a benchmark for evaluation.

ReferenceExpected AnswerBenchmark+2

Guardrails

Security

Safety mechanisms that prevent harmful, inappropriate, or non-compliant outputs from LLM systems.

SafetyContent FilteringPII+2

GuidelineEvaluator

Technology

A LlamaIndex evaluator that checks responses against custom guidelines and policies.

GuidelinesCustomCompliance+2

Hallucination

Technology

When an LLM generates information that is not present in the training data or provided context.

LLMGenerated InformationTraining Data+2

Haystack Agents

Technology

Agent orchestration within Haystack's end-to-end NLP framework, enabling RAG-enhanced agents with document processing capabilities.

HaystackRAGDocument Processing+2

Helpfulness / Utility

Metrics

Measures whether the output fully resolves the underlying user need (actionability, tone, focus).

CSATDeflectionBusiness KPIs+2

Hierarchical Retrieval

Technology

Retrieval method that uses multiple levels of document structure for comprehensive context.

RetrievalHierarchicalContext+2

I.O.R.M.G.O.D Framework(IORMGOD)

Architecture

A production-ready architecture framework for reliable AI systems: Interface & Gateway, Orchestrator/Agent, Retrieval (RAG), Models, Guardrails, Observability & Eval, Data & Governance.

RAGGuardrailsObservability+3

Interface & Gateway

Architecture

Entry point for user interactions, including authentication, rate limiting, and caching.

Entry PointAuthenticationRate Limiting+2

Julius AI

Technology

AI-powered notebook platform enabling natural language queries and automated insights generation.

AI-PoweredNatural LanguageAutomated Insights+2

Jupyter Notebooks

Technology

Interactive development environment for data analysis, experimentation, and reproducible research.

Data AnalysisInteractivePython+2

LangChain

Technology

Popular LLM framework for building applications with chains, agents, document loaders, and built-in evaluators.

Chain CompositionDocument LoadersAgents+2

Langfuse

Technology

A comprehensive observability platform for LLM applications providing tracing, experiments, prompt management, and scoring.

ObservabilityTracingA/B Testing+3

LangGraph

Technology

Low-level orchestration framework for building, managing, and deploying long-running, stateful AI agents with graph-based workflows.

Agent OrchestrationState ManagementGraph Workflows+3

LangSmith

Technology

LangChain observability and debugging platform with deep integration for trace inspection and debugging.

LangChainTrace InspectionPrompt Debugging+2

LlamaIndex

Technology

A comprehensive framework for building LLM applications with advanced parsing, retrieval, evaluation, and memory capabilities.

RAGSemanticSplitterAutoMergingRetriever+3

LLM(LLM)

Technology

Large Language Model - AI model trained on vast amounts of text data to understand and generate human-like text.

AI ModelText GenerationTraining+2

LLM-as-Judge

Evaluation Method

Evaluation approach using LLMs to make semantic judgments constrained to binary TRUE/FALSE outputs.

Semantic JudgmentBinary ClassificationFailure Modes+2

M.A.G.I. Framework(MAGI)

Framework

A comprehensive framework for production-grade LLM evaluation consisting of four pillars: Metrics, Automation, Governance, and Improvement.

MetricsAutomationGovernance+3

Mem0

Technology

AI memory management framework providing persistent memory for agents across sessions and interactions.

Memory ManagementPersistent MemorySemantic Retrieval+2

Metadata Filtering

Technology

Process of filtering retrieved results based on document metadata (status, date, type, etc.).

MetadataFilteringVersion Control+2

Multi-Channel Communication

Architecture

Platform supporting SMS, text chat, and voice interfaces with intelligent routing and context management.

SMSChatVoice+2

Node

Technology

Individual unit of processed content (chunk) in a document processing pipeline.

ChunkContentMetadata+2

Observability

Technology

Comprehensive monitoring and logging of system behavior, performance, and quality metrics.

MonitoringLoggingPerformance+2

Open Coding

Analysis Method

Qualitative analysis technique for systematically identifying and categorizing themes in unstructured data.

Qualitative AnalysisFailure TaxonomyError Analysis+2

Orchestrator

Architecture

System component that coordinates multiple services and manages workflow execution.

CoordinationServicesWorkflow+2

Parsing

Technology

Process of analyzing and structuring documents for further processing in LLM applications.

Document AnalysisStructuringSemantic Splitting+2

PII(PII)

Security

Personally Identifiable Information - data that can identify specific individuals.

Personal DataPrivacyGuardrails+2

Postprocessing

Technology

Additional processing steps applied to retrieved results before final selection.

ProcessingFilteringMetadata+2

Prompt Management

Process

Systematic approach to creating, versioning, and optimizing prompts for LLM applications.

PromptsVersion ControlA/B Testing+2

QAG (Question-Answer Generation)(QAG)

Evaluation Method

An evaluation method that decomposes output into atomic claims, generates closed-ended questions, and verifies against context.

FaithfulnessCorrectnessContext+2

Quality Gates

Process

Automated checkpoints in CI/CD pipelines that enforce quality thresholds before deployment.

CI/CDThresholdDeployment+2

RAG(RAG)

Technology

Retrieval-Augmented Generation - technique that combines retrieval of relevant information with text generation.

RetrievalGenerationContext+2

RAG over Knowledge

Architecture

Retrieval-Augmented Generation implementation over contextual knowledge bases with dynamic updates.

RAGKnowledge BaseContextual+2

Reference

Technology

Ground truth or expected answer used for evaluation and comparison.

Ground TruthExpected AnswerCorrectness+2

RelevancyEvaluator

Technology

A LlamaIndex evaluator that measures how relevant generated responses are to the user's query.

RelevancyQueryResponse+2

Reranking

Technology

Process of reordering retrieved results based on relevance scores or additional criteria.

RetrievalRankingRelevance+2

Score Attribution

Process

Process of assigning evaluation scores to specific components or versions of a system.

EvaluationScoresComponents+2

Semantic Kernel

Technology

Microsoft's plugin-based orchestration framework for building AI applications with goal-oriented agents.

Plugin ArchitectureGoal-Oriented AgentsPlanner System+3

SemanticSplitter

Technology

A document parsing method that splits text based on semantic similarity rather than fixed chunk sizes.

ParsingChunkingSemantic+2

SLO(SLO)

Business

Service Level Objective - specific, measurable goals for system performance and reliability.

Service LevelPerformanceReliability+2

Statsig

Technology

Data-driven experimentation platform for feature flags, statistical testing, and cohort analysis.

Feature FlagsStatistical TestingCohort Analysis+2

Threshold

Process

Minimum acceptable score for a metric that triggers quality gates and deployment decisions.

Quality GatesMetricsCalibration+2

Tracing

Technology

Detailed logging of request flow through LLM systems for debugging and optimization.

LoggingDebuggingOptimization+2

Vector Store

Technology

Database optimized for storing and querying high-dimensional vectors (embeddings).

EmbeddingSimilarity SearchRAG+2

Version Control

Process

System for tracking changes to datasets, models, prompts, and evaluation criteria over time.

TrackingChangesDatasets+2

LLM Evaluation Glossary

Go deeper with the course