End-to-End Test Runs
Introduction
End-to-end test runs validate your entire AI system by passing test cases through your complete pipeline. This comprehensive approach ensures that changes to any component don't cause unexpected regressions elsewhere in your system.
Why End-to-End Testing Matters
Modern AI applications consist of multiple interacting components—LLM calls in sequence, tool usage, retrieval systems, and agent orchestration. Testing individual pieces in isolation isn't enough. You need to understand how changes ripple through your entire system to catch issues before they reach users.
End-to-end tests provide realistic performance assessment by testing your system exactly as users experience it. They capture complex workflows including multi-step processes, tool usage, and agent decision-making while tracking both final outputs and intermediate steps.
Implementation
End-to-end tests execute through the SDK, giving you complete control over your system's execution. Here's how to test a support agent system that uses multiple sub-agents and tools. This example is using Freeplay's Support Agent that helps us take in customer requests and make sure we are tracking them well. It is made up of several components including FreeplaySupportAgent
, DocsAgent
and a LinearAgent
. Each of these agents handles different tasks and follow the common router prompt format for testing. We are using an Agent (trace dataset) in Freeplay to test the end to end behavior.
Step 1: Set up
import os
import time
from typing import Optional
from tqdm import tqdm
from openai import OpenAI
from freeplay import Freeplay, RecordPayload, SessionInfo, TraceInfo, TestRunInfo, CallInfo
# Optional SDK helpers (present in recent Freeplay SDKs)
try:
from freeplay import ResponseInfo, UsageTokens # type: ignore
except Exception:
ResponseInfo = None
UsageTokens = None
# Env vars
FREEPLAY_API_KEY = os.environ.get("FREEPLAY_API_KEY") or ""
OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY") or ""
if not FREEPLAY_API_KEY:
raise RuntimeError("FREEPLAY_API_KEY is not set.")
if not OPENAI_API_KEY:
raise RuntimeError("OPENAI_API_KEY is not set.")
# Config (update these)
PROJECT_ID = "YOUR_PROJECT_ID"
TRACE_DATASET_NAME = "Your dataset name that targets an agent"
TEST_RUN_NAME = "Name of the test you are running"
TEMPLATE_NAME = "my-prompt"
TEMPLATE_ENV = "sandbox" # 'sandbox' | 'latest' | 'production'
# Clients
fp_client = Freeplay(api_key=FREEPLAY_API_KEY)
openai_client = OpenAI(api_key=OPENAI_API_KEY)
Step 2: Minimal Agent Example
def run_agent(
fp_session: SessionInfo,
agent_prompt_name: str,
variables: dict,
test_run_info: Optional[TestRunInfo] = None,
):
# Get prompt from Freeplay
formatted = fp_client.prompts.get_formatted(
project_id=PROJECT_ID,
template_name=agent_prompt_name,
environment=TEMPLATE_ENV,
variables=variables
)
model = formatted.prompt_info.model
params = dict(formatted.prompt_info.model_parameters or {})
t0 = time.time()
completion = openai_client.chat.completions.create(
model=model,
messages=messages,
**params,
)
t1 = time.time()
assistant_msg = completion.choices[0].message
all_messages = formatted.all_messages(assistant_msg)
#################################################
# Handle Agent Activity (ie tool calling, etc.) #
#################################################
# Record to Freeplay
fp_client.recordings.create(
RecordPayload(
project_id=PROJECT_ID,
all_messages=all_messages,
inputs=variables,
session_info=fp_session,
test_run_info=test_run_info, # <- links this call to the test run
prompt_version_info=formatted.prompt_info,
call_info=CallInfo.from_prompt_info(formatted_prompt.prompt_info, start, end),
)
)
return assistant_msg.content
Step 3: Create test run, iterate cases, record outputs
def main():
# Create a Test Run on your dataset (agent/trace)
test_run = fp_client.test_runs.create(
project_id=PROJECT_ID,
testlist=TRACE_DATASET_NAME,
name=TEST_RUN_NAME,
)
# Load the Prompt Template we’ll use
template = fp_client.prompts.get(
project_id=PROJECT_ID,
template_name=TEMPLATE_NAME,
environment=TEMPLATE_ENV,
)
# Iterate test cases
for test_case in tqdm(test_run.trace_test_cases, desc="Running test cases"):
question = getattr(test_case, "input", "")
variables = getattr(test_case, "variables", {}) or {}
# Create session + trace
session = fp_client.sessions.create()
trace: TraceInfo = session.create_trace(
input=question or variables.get("user_input", ""),
agent_name="ExampleAgent",
custom_metadata={"version": "1.0.0"},
)
# Run the agent and log the recording under this test run
tri = test_run.get_test_run_info(test_case.id)
assistant_text = run_agent(
fp_session=session,
template_prompt=template,
variables=variables,
test_run_info=tri,
)
# Attach any evals you compute
eval_results = {
"evaluation_score": 0.48,
"is_high_quality": True,
}
# Record final output for the trace (linked to test run)
trace.record_output(
project_id=PROJECT_ID,
output=assistant_text,
eval_results=eval_results,
test_run_info=tri,
)
print("✅ Test run complete. Review results in Freeplay.")
if __name__ == "__main__":
main()
The SDK automatically records all LLM calls, tool invocations, intermediate reasoning steps, and evaluation results throughout the execution.
The SDK automatically records all LLM calls, tool invocations, intermediate reasoning steps, and evaluation results throughout the execution.
Analyzing Results
After running your tests, Freeplay provides comprehensive analysis at both the agent and component levels. The overview shows high-level metrics comparing different versions or models:
You can drill into specific evaluation categories to understand performance across different aspects of your system. Agent evaluations assess the complete workflow:
The row level view reveals how each component contributes to overall system performance. Notice in the example below how we have marked the Claude version as the winner. This view allows us to step through every completion in the dataset and compare side-by-side to see how they differ. You can also note under the session details what actions/steps were taken by the agent in each case during the end-to-end test:
Best Practices
Include real user interactions that represent typical usage patterns, edge cases that challenge your system, and known failure scenarios that you've encountered. This realistic data ensures your tests catch actual problems users might face.
Run end-to-end tests at critical points in your development cycle. Execute them before deploying to production, after significant code changes, and as part of your CI/CD pipeline. Regular testing catches regressions early when they're easier to fix.
Advanced Patterns
For multi-agent systems, test the collaboration and handoffs between agents:
for test_case in test_run.trace_test_cases:
# Primary agent processes request
initial_response = primary_agent.process(test_case.input)
# Handoff to specialist if needed
if requires_specialist(initial_response):
final_response = specialist_agent.process(
test_case.input,
context=initial_response
)
For RAG pipelines, track each stage of the process:
# Create trace for the complete pipeline
trace_info = session.create_trace(
input=query,
agent_name="rag_pipeline"
)
# Track retrieval, reranking, and generation
retrieved_docs = retrieval_system.search(query)
reranked_docs = reranker.rerank(query, retrieved_docs)
response = generate_response(query, reranked_docs)
trace_info.record_output(
output=response,
eval_results={
'retrieval_relevance': evaluate_retrieval(query, retrieved_docs),
'answer_quality': evaluate_answer(query, response)
}
)
Updated about 1 month ago