Agent Reliability

Tool Validation and Agent Behavior Failures

When AI agents hallucinate tool names and parameters, causing system crashes and complete task failures.

Agent Tool Hallucination Crisis

When AI Agents Invent Their Own Reality

Agent ReliabilitySystem Critical2 days to emergency fix

The Problem

Production agents were calling non-existent tools or using malformed parameters, causing system crashes and complete task failures. The agent would confidently attempt to use 'SearchDatabase' when only 'DatabaseQuery' existed, or pass strings where integers were required.

Impact:

System downtime, failed user tasks, and complete loss of trust in the agent system. Emergency rollback required.

Diagnosis

Langfuse Trace
Input: Find all customers in California with orders over $1000
Retrieved: customer_database, order_records
Missing: geographic_filters, order_amount_index
Key Metrics
toolCorrectness:70%
taskCompletion:45%
systemUptime:82%

Solution: Comprehensive tool validation and monitoring system

Tool Schema Validation

Strict validation of tool names and parameters before execution

# Tool validation layer
class ToolValidator:
    def __init__(self, available_tools):
        self.tools = {tool.name: tool.schema for tool in available_tools}
    
    def validate_call(self, tool_name, parameters):
        if tool_name not in self.tools:
            raise ToolNotFoundError(f"Tool '{tool_name}' not available")
        
        schema = self.tools[tool_name]
        try:
            jsonschema.validate(parameters, schema)
        except ValidationError as e:
            raise ToolParameterError(f"Invalid parameters: {e.message}")
        
        return True
Tool Correctness Metric

Automated evaluation of tool usage against expected patterns

# Tool correctness evaluation
def evaluate_tool_correctness(trace, expected_tools):
    called_tools = [call.tool_name for call in trace.tool_calls]
    expected_names = [tool["name"] for tool in expected_tools]
    
    name_accuracy = len(set(called_tools) & set(expected_names)) / len(expected_names)
    
    param_accuracy = 0
    for call, expected in zip(trace.tool_calls, expected_tools):
        if call.parameters == expected["parameters"]:
            param_accuracy += 1
    param_accuracy /= len(expected_tools)
    
    return min(name_accuracy, param_accuracy)
Circuit Breaker

Automatic fallback when tool correctness drops below threshold

# Circuit breaker for tool failures
class ToolCircuitBreaker:
    def __init__(self, failure_threshold=0.95):
        self.threshold = failure_threshold
        self.recent_scores = []
    
    def record_score(self, score):
        self.recent_scores.append(score)
        if len(self.recent_scores) > 10:
            self.recent_scores.pop(0)
        
        avg_score = sum(self.recent_scores) / len(self.recent_scores)
        if avg_score < self.threshold:
            raise CircuitBreakerOpen("Tool correctness below threshold")

Results

Before
toolCorrectness:70%
taskCompletion:45%
systemUptime:82%
After
toolCorrectness:98%
taskCompletion:87%
systemUptime:99.5%

Zero tool hallucinations in production, with automatic fallback to human handoff when tool correctness drops.

Key Lessons

  • Tool validation is non-negotiable for production agent systems
  • Circuit breakers prevent cascading failures from tool errors
  • Tool Correctness metric must have very high thresholds (≥95%)
  • Agent reliability depends on the weakest link in the tool chain