LLM Evaluation Glossary
82 terms covering frameworks, metrics, tools, and concepts
Experimental method comparing two versions to determine optimal performance through statistical analysis.
Experimental method comparing two versions of a system to determine which performs better.
Autonomous system that can perform tasks by coordinating multiple tools and making decisions.
Measures whether the generated answer addresses the specific question asked, not just related topics.
Full enterprise SaaS platform for LLM observability and evaluation with multi-cloud, hybrid cloud, and data center support.
Enterprise-grade LLM observability platform with sophisticated performance monitoring and drift detection.
Microsoft's open-source framework for building LLM applications with multiple conversational AI agents that collaborate through natural language.
A retrieval method that automatically merges related chunks to provide comprehensive context.
Second phase of qualitative analysis that connects and organizes initial codes into broader categories and relationships.
Initial performance measurement used as a reference point for future improvements.
Assessment and evaluation platform for AI applications with collaborative annotation and trace viewer capabilities.
Versatile LLM with strong analytical capabilities, function calling, and comprehensive evaluation skills.
Process of breaking down documents into smaller, manageable pieces for processing and retrieval.
Continuous Integration/Continuous Deployment - automated pipeline for building, testing, and deploying software.
Advanced LLM with sophisticated reasoning capabilities and safety-focused training for evaluation tasks.
Adherence to regulatory requirements, industry standards, and organizational policies.
Retrieved information used to inform LLM responses in RAG systems.
A LlamaIndex evaluator that measures how relevant retrieved context is to the user's query.
Measures the proportion of retrieved context that is relevant to answering the query.
Measures the proportion of relevant information that was successfully retrieved from the knowledge base.
Monitoring and analysis of computational costs associated with LLM operations.
Multi-agent collaboration framework for orchestrating role-playing AI agents that collaborate to solve complex tasks.
Customer Satisfaction Score - measures how satisfied customers are with products or services.
Specialized tools for customer information retrieval, property lookup, and seamless transfer workflows.
Framework for data ingestion, versioning, and governance in LLM systems.
Rate at which customer queries are resolved without human intervention.
Rule-based validation using standard programming logic for reliable, consistent evaluation without LLM variability.
Programmatic framework for building and optimizing LLM pipelines and agent systems using signature-based design.
Dense vector representation of text that captures semantic meaning for similarity calculations.
Measures the factual alignment of generated output to provided reference context, preventing hallucinations.
A LlamaIndex evaluator that measures how faithful generated responses are to the provided context.
Uses an evaluator LLM to score responses against a rubric covering actionability, completeness, tone, and next-step clarity.
Multimodal LLM with deep analytical capabilities supporting text, images, audio, and video evaluation.
Curated set of 50-200 test cases with known correct answers used for evaluation and benchmarking.
Collaborative evaluation analysis platform with pivot tables, automation, and team coordination features.
Framework for establishing ownership, responsibilities, and review processes for LLM evaluation.
The correct or expected answer for a given query, used as a benchmark for evaluation.
Safety mechanisms that prevent harmful, inappropriate, or non-compliant outputs from LLM systems.
A LlamaIndex evaluator that checks responses against custom guidelines and policies.
When an LLM generates information that is not present in the training data or provided context.
Agent orchestration within Haystack's end-to-end NLP framework, enabling RAG-enhanced agents with document processing capabilities.
Measures whether the output fully resolves the underlying user need (actionability, tone, focus).
Retrieval method that uses multiple levels of document structure for comprehensive context.
A production-ready architecture framework for reliable AI systems: Interface & Gateway, Orchestrator/Agent, Retrieval (RAG), Models, Guardrails, Observability & Eval, Data & Governance.
Entry point for user interactions, including authentication, rate limiting, and caching.
AI-powered notebook platform enabling natural language queries and automated insights generation.
Interactive development environment for data analysis, experimentation, and reproducible research.
Popular LLM framework for building applications with chains, agents, document loaders, and built-in evaluators.
A comprehensive observability platform for LLM applications providing tracing, experiments, prompt management, and scoring.
Low-level orchestration framework for building, managing, and deploying long-running, stateful AI agents with graph-based workflows.
LangChain observability and debugging platform with deep integration for trace inspection and debugging.
A comprehensive framework for building LLM applications with advanced parsing, retrieval, evaluation, and memory capabilities.
Large Language Model - AI model trained on vast amounts of text data to understand and generate human-like text.
Evaluation approach using LLMs to make semantic judgments constrained to binary TRUE/FALSE outputs.
A comprehensive framework for production-grade LLM evaluation consisting of four pillars: Metrics, Automation, Governance, and Improvement.
AI memory management framework providing persistent memory for agents across sessions and interactions.
Process of filtering retrieved results based on document metadata (status, date, type, etc.).
Platform supporting SMS, text chat, and voice interfaces with intelligent routing and context management.
Individual unit of processed content (chunk) in a document processing pipeline.
Comprehensive monitoring and logging of system behavior, performance, and quality metrics.
Qualitative analysis technique for systematically identifying and categorizing themes in unstructured data.
System component that coordinates multiple services and manages workflow execution.
Process of analyzing and structuring documents for further processing in LLM applications.
Personally Identifiable Information - data that can identify specific individuals.
Additional processing steps applied to retrieved results before final selection.
Systematic approach to creating, versioning, and optimizing prompts for LLM applications.
An evaluation method that decomposes output into atomic claims, generates closed-ended questions, and verifies against context.
Automated checkpoints in CI/CD pipelines that enforce quality thresholds before deployment.
Retrieval-Augmented Generation - technique that combines retrieval of relevant information with text generation.
Retrieval-Augmented Generation implementation over contextual knowledge bases with dynamic updates.
Ground truth or expected answer used for evaluation and comparison.
A LlamaIndex evaluator that measures how relevant generated responses are to the user's query.
Process of reordering retrieved results based on relevance scores or additional criteria.
Process of assigning evaluation scores to specific components or versions of a system.
Microsoft's plugin-based orchestration framework for building AI applications with goal-oriented agents.
A document parsing method that splits text based on semantic similarity rather than fixed chunk sizes.
Service Level Objective - specific, measurable goals for system performance and reliability.
Data-driven experimentation platform for feature flags, statistical testing, and cohort analysis.
Minimum acceptable score for a metric that triggers quality gates and deployment decisions.
Detailed logging of request flow through LLM systems for debugging and optimization.
Database optimized for storing and querying high-dimensional vectors (embeddings).
System for tracking changes to datasets, models, prompts, and evaluation criteria over time.
