Introduction
End-to-end test runs validate your entire AI system by passing test cases through your complete pipeline. This comprehensive approach ensures that changes to any component don’t cause unexpected regressions elsewhere in your system.Why End-to-End Testing Matters
Modern AI applications consist of multiple interacting components—LLM calls in sequence, tool usage, retrieval systems, and agent orchestration. Testing individual pieces in isolation isn’t enough. You need to understand how changes ripple through your entire system to catch issues before they reach users. End-to-end tests provide realistic performance assessment by testing your system exactly as users experience it. They capture complex workflows including multi-step processes, tool usage, and agent decision-making while tracking both final outputs and intermediate steps.Implementation
End-to-end tests execute through the SDK, giving you complete control over your system’s execution. Here’s how to test a support agent system that uses multiple sub-agents and tools. This example is using Freeplay’s Support Agent that helps us take in customer requests and make sure we are tracking them well. It is made up of several components includingFreeplaySupportAgent, DocsAgent and a LinearAgent. Each of these agents handles different tasks and follow the common router prompt format for testing. We are using an Agent (trace dataset) in Freeplay to test the end to end behavior.
Step 1: Set up
Step 2: Minimal Agent Example
Step 3: Create test run, iterate cases, record outputs
Analyzing Results
After running your tests, Freeplay provides comprehensive analysis at both the agent and component levels. The overview shows high-level metrics comparing different versions or models:


Best Practices
Include real user interactions that represent typical usage patterns, edge cases that challenge your system, and known failure scenarios that you’ve encountered. This realistic data ensures your tests catch actual problems users might face. Run end-to-end tests at critical points in your development cycle. Execute them before deploying to production, after significant code changes, and as part of your CI/CD pipeline. Regular testing catches regressions early when they’re easier to fix.Advanced Patterns
For multi-agent systems, test the collaboration and handoffs between agents:Test Runs Component Level Test Runs Ask AI

