Introduction

End-to-end test runs validate your entire AI system by passing test cases through your complete pipeline. This comprehensive approach ensures that changes to any component don’t cause unexpected regressions elsewhere in your system.

Why End-to-End Testing Matters

Modern AI applications consist of multiple interacting components—LLM calls in sequence, tool usage, retrieval systems, and agent orchestration. Testing individual pieces in isolation isn’t enough. You need to understand how changes ripple through your entire system to catch issues before they reach users. End-to-end tests provide realistic performance assessment by testing your system exactly as users experience it. They capture complex workflows including multi-step processes, tool usage, and agent decision-making while tracking both final outputs and intermediate steps.

Implementation

End-to-end tests execute through the SDK, giving you complete control over your system’s execution. Here’s how to test a support agent system that uses multiple sub-agents and tools. This example is using Freeplay’s Support Agent that helps us take in customer requests and make sure we are tracking them well. It is made up of several components including FreeplaySupportAgent, DocsAgent and a LinearAgent. Each of these agents handles different tasks and follow the common router prompt format for testing. We are using an Agent (trace dataset) in Freeplay to test the end to end behavior.

Step 1: Set up

import os
import time
from typing import Optional

from tqdm import tqdm
from openai import OpenAI
from freeplay import (
  Freeplay,
  RecordPayload,
  SessionInfo,
  TraceInfo,
  TestRunInfo,
  CallInfo,
)

from dotenv import load_dotenv

load_dotenv(override=True)

# Optional SDK helpers (present in recent Freeplay SDKs)

try:
  from freeplay import UsageTokens  # type: ignore
except Exception:
  pass
UsageTokens = None

# TODO: Update these to your environment variables
FREEPLAY_API_KEY = os.environ.get("FREEPLAY_API_KEY") or ""
OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY") or ""
if not FREEPLAY_API_KEY:
  raise RuntimeError("FREEPLAY_API_KEY is not set.")
if not OPENAI_API_KEY:
  raise RuntimeError("OPENAI_API_KEY is not set.")

# TODO: Update these to your configuration
PROJECT_ID = "" # TODO: Update this to your project ID
TRACE_DATASET_NAME = "" # TODO: Update this to your dataset name that targets an agent
TEST_RUN_NAME = "" # TODO: Update this to your test name
TEMPLATE_NAME = "" # TODO: Update this to your prompt name
TEMPLATE_ENV = "" # TODO: Update this to your prompt environment, ie 'sandbox' | 'latest' | 'production'

# Clients

fp_client = Freeplay(
  freeplay_api_key=FREEPLAY_API_KEY, api_base="https://app.freeplay.ai/api"
)
openai_client = OpenAI(api_key=OPENAI_API_KEY)

Step 2: Minimal Agent Example

For java, the callOpenAIWithTools and callAnthropicWithTools are example classes that can be found here.

def run_agent(
    fp_session: SessionInfo,
    parent_id: str,
    template_name: str,
    variables: dict,
    test_run_info: Optional[TestRunInfo] = None,
):

    # Get prompt from Freeplay
    formatted = fp_client.prompts.get_formatted(
        project_id=PROJECT_ID,
        template_name=template_name,
        environment=TEMPLATE_ENV,
        variables=variables,
    )
    model = formatted.prompt_info.model
    params = dict(formatted.prompt_info.model_parameters or {})

    start = time.time()
    completion = openai_client.chat.completions.create(
        model=model,
        messages=formatted.llm_prompt,
        **params,
    )
    end = time.time()

    assistant_msg = completion.choices[0].message
    all_messages = formatted.all_messages(assistant_msg)

    #################################################
    # Handle Agent Activity (ie tool calling, etc.) #
    #################################################

    # Record to Freeplay
    fp_client.recordings.create(
        RecordPayload(
            project_id=PROJECT_ID,
            all_messages=all_messages,
            parent_id=parent_id,
            inputs=variables,
            session_info=fp_session,
            test_run_info=test_run_info,  # <- NOTE: passing test_run_info links this call to the test run
            prompt_version_info=formatted.prompt_info,
            call_info=CallInfo.from_prompt_info(
                formatted.prompt_info, start_time=start, end_time=end
            ),
        )
    )

    return assistant_msg.content

Step 3: Create test run, iterate cases, record outputs

def main():
  # Create a Test Run on your dataset (agent/trace)
  test_run = fp_client.test_runs.create(
      project_id=PROJECT_ID,
      testlist=TRACE_DATASET_NAME,  # NOTE: the dataset must be created in Freeplay first and have data in it
      name=TEST_RUN_NAME,
  )

  # Iterate test cases
  for test_case in tqdm(test_run.trace_test_cases, desc="Running test cases"):
      question = getattr(
          test_case, "input", ""
      )  # NOTE: this is the input to the trace

      # Create session
      session = fp_client.sessions.create()

      # Craete the trace
      trace: TraceInfo = session.create_trace(
          input=question,
          agent_name="ExampleAgent",
          custom_metadata={"version": "1.0.0"},
      )

      # NOTE: Prompt variables can be added here if you want to pass them to the prompt
      variables = {"user_input": question}

      # NOTE: This is the test case ID that will link the recording to the test run
      test_run_info = test_run.get_test_run_info(test_case.id)

      # Run the agent and log the recording under this test run
      assistant_text = run_agent(
          fp_session=session,
          template_name=TEMPLATE_NAME,
          variables=variables,
          test_run_info=test_run_info,
          parent_id=trace.trace_id,
      )

      # NOTE: You can attach any evals you compute here
      eval_results = {
          "evaluation_score": 0.48,
          "is_high_quality": True,
      }

      # NOTE: Record final output for the trace (linked to test run)
      trace.record_output(
          project_id=PROJECT_ID,
          output=assistant_text,
          eval_results=eval_results,
          test_run_info=test_run_info,  # NOTE: passing test_run_info links this call to the test run
      )

  print("✅ Test run complete. Review results in Freeplay.")


if __name__ == "__main__":
  main()

The SDK automatically records all LLM calls, tool invocations, intermediate reasoning steps, and evaluation results throughout the execution.

Analyzing Results

After running your tests, Freeplay provides comprehensive analysis at both the agent and component levels. The overview shows high-level metrics comparing different versions or models:

You can drill into specific evaluation categories to understand performance across different aspects of your system. Agent evaluations assess the complete workflow:

The row level view reveals how each component contributes to overall system performance. Notice in the example below how we have marked the Claude version as the winner. This view allows us to step through every completion in the dataset and compare side-by-side to see how they differ. You can also note under the session details what actions/steps were taken by the agent in each case during the end-to-end test:

Best Practices

Include real user interactions that represent typical usage patterns, edge cases that challenge your system, and known failure scenarios that you’ve encountered. This realistic data ensures your tests catch actual problems users might face. Run end-to-end tests at critical points in your development cycle. Execute them before deploying to production, after significant code changes, and as part of your CI/CD pipeline. Regular testing catches regressions early when they’re easier to fix.

Advanced Patterns

For multi-agent systems, test the collaboration and handoffs between agents:

for test_case in test_run.trace_test_cases:
    # Primary agent processes request
    initial_response = primary_agent.process(test_case.input)

    # Handoff to specialist if needed
    if requires_specialist(initial_response):
        final_response = specialist_agent.process(
            test_case.input,
            context=initial_response
        )

For RAG pipelines, track each stage of the process:

# Create trace for the complete pipeline
trace_info = session.create_trace(
    input=query,
    agent_name="rag_pipeline"
)

# Track retrieval, reranking, and generation

retrieved_docs = retrieval_system.search(query)
reranked_docs = reranker.rerank(query, retrieved_docs)
response = generate_response(query, reranked_docs)

trace_info.record_output(
output=response,
eval_results={
'retrieval_relevance': evaluate_retrieval(query, retrieved_docs),
'answer_quality': evaluate_answer(query, response)
}
)

Test Runs Component Level Test Runs

Getting Started

Account Setup

Core Concepts

How-To Guides

Developer Resources

Security & Compliance

Resources

End-to-End Test Runs

Introduction

Why End-to-End Testing Matters

Implementation

Step 1: Set up

Step 2: Minimal Agent Example

Step 3: Create test run, iterate cases, record outputs

Analyzing Results

Best Practices

Advanced Patterns

Getting Started

Account Setup

Core Concepts

How-To Guides

Developer Resources

Security & Compliance

Resources

​Introduction

​Why End-to-End Testing Matters

​Implementation

​Step 1: Set up

​Step 2: Minimal Agent Example

​Step 3: Create test run, iterate cases, record outputs

​Analyzing Results

​Best Practices

​Advanced Patterns

Introduction

Why End-to-End Testing Matters

Implementation

Step 1: Set up

Step 2: Minimal Agent Example

Step 3: Create test run, iterate cases, record outputs

Analyzing Results

Best Practices

Advanced Patterns