Glossary & Definitions
LLM Evaluation Terms & Definitions
Comprehensive definitions for all technical terms, frameworks, and concepts used throughout the platform.
All Terms
82 terms found
A/B Testing
Experimental method comparing two versions to determine optimal performance through statistical analysis.
Related Terms:
+2 more
A/B Testing
Experimental method comparing two versions of a system to determine which performs better.
Related Terms:
+2 more
Agent
Autonomous system that can perform tasks by coordinating multiple tools and making decisions.
Related Terms:
+2 more
Answer Relevancy
Measures whether the generated answer addresses the specific question asked, not just related topics.
Related Terms:
+2 more
Arize AX Enterprise
Full enterprise SaaS platform for LLM observability and evaluation with multi-cloud, hybrid cloud, and data center support.
Related Terms:
+3 more
Arize Phoenix
Enterprise-grade LLM observability platform with sophisticated performance monitoring and drift detection.
Related Terms:
+2 more
AutoGen
Microsoft's open-source framework for building LLM applications with multiple conversational AI agents that collaborate through natural language.
Related Terms:
+3 more
AutoMergingRetriever
A retrieval method that automatically merges related chunks to provide comprehensive context.
Related Terms:
+2 more
Axial Coding
Second phase of qualitative analysis that connects and organizes initial codes into broader categories and relationships.
Related Terms:
+2 more
Baseline
Initial performance measurement used as a reference point for future improvements.
Related Terms:
+2 more
Braintrust
Assessment and evaluation platform for AI applications with collaborative annotation and trace viewer capabilities.
Related Terms:
+2 more
ChatGPT (OpenAI)
Versatile LLM with strong analytical capabilities, function calling, and comprehensive evaluation skills.
Related Terms:
+2 more
Chunking
Process of breaking down documents into smaller, manageable pieces for processing and retrieval.
Related Terms:
+2 more
CI/CD
Continuous Integration/Continuous Deployment - automated pipeline for building, testing, and deploying software.
Related Terms:
+2 more
Claude (Anthropic)
Advanced LLM with sophisticated reasoning capabilities and safety-focused training for evaluation tasks.
Related Terms:
+2 more
Compliance
Adherence to regulatory requirements, industry standards, and organizational policies.
Related Terms:
+2 more
Context
Retrieved information used to inform LLM responses in RAG systems.
Related Terms:
+2 more
ContextRelevancyEvaluator
A LlamaIndex evaluator that measures how relevant retrieved context is to the user's query.
Related Terms:
+2 more
Contextual Precision
Measures the proportion of retrieved context that is relevant to answering the query.
Related Terms:
+2 more
Contextual Recall
Measures the proportion of relevant information that was successfully retrieved from the knowledge base.
Related Terms:
+2 more
Cost Tracking
Monitoring and analysis of computational costs associated with LLM operations.
Related Terms:
+2 more
CrewAI
Multi-agent collaboration framework for orchestrating role-playing AI agents that collaborate to solve complex tasks.
Related Terms:
+2 more
CSAT
CSATCustomer Satisfaction Score - measures how satisfied customers are with products or services.
Related Terms:
+2 more
Customer Tools
Specialized tools for customer information retrieval, property lookup, and seamless transfer workflows.
Related Terms:
+2 more
Data & Governance
Framework for data ingestion, versioning, and governance in LLM systems.
Related Terms:
+2 more
Deflection
Rate at which customer queries are resolved without human intervention.
Related Terms:
+2 more
Deterministic Code Checks
Rule-based validation using standard programming logic for reliable, consistent evaluation without LLM variability.
Related Terms:
+2 more
DSPy
Programmatic framework for building and optimizing LLM pipelines and agent systems using signature-based design.
Related Terms:
+2 more
Embedding
Dense vector representation of text that captures semantic meaning for similarity calculations.
Related Terms:
+2 more
Faithfulness / Correctness
Measures the factual alignment of generated output to provided reference context, preventing hallucinations.
Related Terms:
+3 more
FaithfulnessEvaluator
A LlamaIndex evaluator that measures how faithful generated responses are to the provided context.
Related Terms:
+2 more
G-Eval (LLM-as-Judge)
G-EvalUses an evaluator LLM to score responses against a rubric covering actionability, completeness, tone, and next-step clarity.
Related Terms:
+3 more
Gemini (Google)
Multimodal LLM with deep analytical capabilities supporting text, images, audio, and video evaluation.
Related Terms:
+2 more
Golden Dataset
Curated set of 50-200 test cases with known correct answers used for evaluation and benchmarking.
Related Terms:
+2 more
Google Sheets
Collaborative evaluation analysis platform with pivot tables, automation, and team coordination features.
Related Terms:
+2 more
Governance
Framework for establishing ownership, responsibilities, and review processes for LLM evaluation.
Related Terms:
+2 more
Ground Truth
The correct or expected answer for a given query, used as a benchmark for evaluation.
Related Terms:
+2 more
Guardrails
Safety mechanisms that prevent harmful, inappropriate, or non-compliant outputs from LLM systems.
Related Terms:
+2 more
GuidelineEvaluator
A LlamaIndex evaluator that checks responses against custom guidelines and policies.
Related Terms:
+2 more
Hallucination
When an LLM generates information that is not present in the training data or provided context.
Related Terms:
+2 more
Haystack Agents
Agent orchestration within Haystack's end-to-end NLP framework, enabling RAG-enhanced agents with document processing capabilities.
Related Terms:
+2 more
Helpfulness / Utility
Measures whether the output fully resolves the underlying user need (actionability, tone, focus).
Related Terms:
+2 more
Hierarchical Retrieval
Retrieval method that uses multiple levels of document structure for comprehensive context.
Related Terms:
+2 more
I.O.R.M.G.O.D Framework
IORMGODA production-ready architecture framework for reliable AI systems: Interface & Gateway, Orchestrator/Agent, Retrieval (RAG), Models, Guardrails, Observability & Eval, Data & Governance.
Related Terms:
+3 more
Interface & Gateway
Entry point for user interactions, including authentication, rate limiting, and caching.
Related Terms:
+2 more
Julius AI
AI-powered notebook platform enabling natural language queries and automated insights generation.
Related Terms:
+2 more
Jupyter Notebooks
Interactive development environment for data analysis, experimentation, and reproducible research.
Related Terms:
+2 more
LangChain
Popular LLM framework for building applications with chains, agents, document loaders, and built-in evaluators.
Related Terms:
+2 more
Langfuse
A comprehensive observability platform for LLM applications providing tracing, experiments, prompt management, and scoring.
Related Terms:
+3 more
LangGraph
Low-level orchestration framework for building, managing, and deploying long-running, stateful AI agents with graph-based workflows.
Related Terms:
+3 more
LangSmith
LangChain observability and debugging platform with deep integration for trace inspection and debugging.
Related Terms:
+2 more
LlamaIndex
A comprehensive framework for building LLM applications with advanced parsing, retrieval, evaluation, and memory capabilities.
Related Terms:
+3 more
LLM
LLMLarge Language Model - AI model trained on vast amounts of text data to understand and generate human-like text.
Related Terms:
+2 more
LLM-as-Judge
Evaluation approach using LLMs to make semantic judgments constrained to binary TRUE/FALSE outputs.
Related Terms:
+2 more
M.A.G.I. Framework
MAGIA comprehensive framework for production-grade LLM evaluation consisting of four pillars: Metrics, Automation, Governance, and Improvement.
Related Terms:
+3 more
Mem0
AI memory management framework providing persistent memory for agents across sessions and interactions.
Related Terms:
+2 more
Metadata Filtering
Process of filtering retrieved results based on document metadata (status, date, type, etc.).
Related Terms:
+2 more
Multi-Channel Communication
Platform supporting SMS, text chat, and voice interfaces with intelligent routing and context management.
Related Terms:
+2 more
Node
Individual unit of processed content (chunk) in a document processing pipeline.
Related Terms:
+2 more
Observability
Comprehensive monitoring and logging of system behavior, performance, and quality metrics.
Related Terms:
+2 more
Open Coding
Qualitative analysis technique for systematically identifying and categorizing themes in unstructured data.
Related Terms:
+2 more
Orchestrator
System component that coordinates multiple services and manages workflow execution.
Related Terms:
+2 more
Parsing
Process of analyzing and structuring documents for further processing in LLM applications.
Related Terms:
+2 more
PII
PIIPersonally Identifiable Information - data that can identify specific individuals.
Related Terms:
+2 more
Postprocessing
Additional processing steps applied to retrieved results before final selection.
Related Terms:
+2 more
Prompt Management
Systematic approach to creating, versioning, and optimizing prompts for LLM applications.
Related Terms:
+2 more
QAG (Question-Answer Generation)
QAGAn evaluation method that decomposes output into atomic claims, generates closed-ended questions, and verifies against context.
Related Terms:
+2 more
Quality Gates
Automated checkpoints in CI/CD pipelines that enforce quality thresholds before deployment.
Related Terms:
+2 more
RAG
RAGRetrieval-Augmented Generation - technique that combines retrieval of relevant information with text generation.
Related Terms:
+2 more
RAG over Knowledge
Retrieval-Augmented Generation implementation over contextual knowledge bases with dynamic updates.
Related Terms:
+2 more
Reference
Ground truth or expected answer used for evaluation and comparison.
Related Terms:
+2 more
RelevancyEvaluator
A LlamaIndex evaluator that measures how relevant generated responses are to the user's query.
Related Terms:
+2 more
Reranking
Process of reordering retrieved results based on relevance scores or additional criteria.
Related Terms:
+2 more
Score Attribution
Process of assigning evaluation scores to specific components or versions of a system.
Related Terms:
+2 more
Semantic Kernel
Microsoft's plugin-based orchestration framework for building AI applications with goal-oriented agents.
Related Terms:
+3 more
SemanticSplitter
A document parsing method that splits text based on semantic similarity rather than fixed chunk sizes.
Related Terms:
+2 more
SLO
SLOService Level Objective - specific, measurable goals for system performance and reliability.
Related Terms:
+2 more
Statsig
Data-driven experimentation platform for feature flags, statistical testing, and cohort analysis.
Related Terms:
+2 more
Threshold
Minimum acceptable score for a metric that triggers quality gates and deployment decisions.
Related Terms:
+2 more
Tracing
Detailed logging of request flow through LLM systems for debugging and optimization.
Related Terms:
+2 more
Vector Store
Database optimized for storing and querying high-dimensional vectors (embeddings).
Related Terms:
+2 more
Version Control
System for tracking changes to datasets, models, prompts, and evaluation criteria over time.
Related Terms:
+2 more