End-To-End Test Runs

Introduction

End-to-end test runs validate your entire AI system by passing test cases through your complete pipeline. This comprehensive approach ensures that changes to any component don't cause unexpected regressions elsewhere in your system.

Why End-to-End Testing Matters

Modern AI applications consist of multiple interacting components—LLM calls in sequence, tool usage, retrieval systems, and agent orchestration. Testing individual pieces in isolation isn't enough. You need to understand how changes ripple through your entire system to catch issues before they reach users.

End-to-end tests provide realistic performance assessment by testing your system exactly as users experience it. They capture complex workflows including multi-step processes, tool usage, and agent decision-making while tracking both final outputs and intermediate steps.

Implementation

End-to-end tests execute through the SDK, giving you complete control over your system's execution. Here's how to test a support agent system that uses multiple sub-agents and tools. This example is using Freeplay's Support Agent that helps us take in customer requests and make sure we are tracking them well. It is made up of several components including FreeplaySupportAgent, DocsAgent and a LinearAgent. Each of these agents handles different tasks and follow the common router prompt format for testing. We are using an Agent (trace dataset) in Freeplay to test the end to end behavior.

from app.freeplay_support_agent import FreeplaySupportAgent
from tqdm import tqdm

# Initialize your complete agent system
support_agent = FreeplaySupportAgent(
    project_id="3c446794-52ac-4c43-98ba-0ccf564ce39b",
    freeplay_account="demo",
    freeplay_env="production",
    docs_agent=DocsAgent(**kwargs),  # Sub-agent for documentation
    linear_agent=LinearAgent(**kwargs),  # Sub-agent for task management
)

# Create the test run
test_run = fp_client.test_runs.create(
    project_id=project_id,
    testlist="Customer Inquiries Gold Set",
    name="Support Agent Performance Claude vs OpenAI",
    include_outputs=True
)

# Execute test for each case
# Note this uses trace_test_cases since it is an agent dataset
for test_case in tqdm(test_run.trace_test_cases): 
    # Process through entire system including:
    # - Intent classification
    # - Routing to sub-agents  
    # - Multiple LLM calls
    # - Tool usage
    output = support_agent.process_query(
        test_case.input,
        test_run_info=test_run.get_test_run_info(test_case.id)
    )

The SDK automatically records all LLM calls, tool invocations, intermediate reasoning steps, and evaluation results throughout the execution.

Analyzing Results

After running your tests, Freeplay provides comprehensive analysis at both the agent and component levels. The overview shows high-level metrics comparing different versions or models:

You can drill into specific evaluation categories to understand performance across different aspects of your system. Agent evaluations assess the complete workflow:

The row level view reveals how each component contributes to overall system performance. Notice in the example below how we have marked the Claude version as the winner. This view allows us to step through every completion in the dataset and compare side-by-side to see how they differ. You can also note under the session details what actions/steps were taken by the agent in each case during the end-to-end test:

Best Practices

Include real user interactions that represent typical usage patterns, edge cases that challenge your system, and known failure scenarios that you've encountered. This realistic data ensures your tests catch actual problems users might face.

Run end-to-end tests at critical points in your development cycle. Execute them before deploying to production, after significant code changes, and as part of your CI/CD pipeline. Regular testing catches regressions early when they're easier to fix.

Advanced Patterns

For multi-agent systems, test the collaboration and handoffs between agents:

for test_case in test_run.trace_test_cases:
    # Primary agent processes request
    initial_response = primary_agent.process(test_case.input)
    
    # Handoff to specialist if needed
    if requires_specialist(initial_response):
        final_response = specialist_agent.process(
            test_case.input,
            context=initial_response
        )

For RAG pipelines, track each stage of the process:

# Create trace for the complete pipeline
trace_info = session.create_trace(
    input=query,
    agent_name="rag_pipeline"
)

# Track retrieval, reranking, and generation
retrieved_docs = retrieval_system.search(query)
reranked_docs = reranker.rerank(query, retrieved_docs)
response = generate_response(query, reranked_docs)

trace_info.record_output(
    output=response,
    eval_results={
        'retrieval_relevance': evaluate_retrieval(query, retrieved_docs),
        'answer_quality': evaluate_answer(query, response)
    }
)