Skip to main content

Test Runs Overview

Test Runs provide structured testing for your AI systems — aka “evaluations”, enabling you to validate performance across datasets, compare versions of your system as you make changes, and catch regressions before they reach production. Freeplay supports two complementary testing approaches designed for different stages of your development workflow, which we call “component” and “end-to-end” tests.

Core Concepts

A Test Run evaluates your LLM pipeline against a dataset to measure performance, using evaluation metrics you choose and define for your use case. Each run processes your test cases through the pipeline, applies evaluation scores, and provides both aggregate and row-level insights. The foundation of any test is your dataset — a curated collection of scenarios that represent important use cases, edge conditions, and known failure modes. Test Runs integrate with Freeplay’s overall evaluation system, allowing you to apply model-graded evals, code-based checks, and human review to assess quality. You can compare results across different versions of your prompts, tools, models, or other parts of your system by creating separate Test Runs for each version you want to compare, then comparing scores head to head (either in the Freeplay UI, or your code).

When to Use Each Approach

The choice between component and end-to-end testing depends on what you’re trying to validate. Our suggestinon:
  • Use component testing when iterating on a specific prompt, model, or tool to validate your changes in isolation
  • Use end-to-end testing for your entire code path to test how a given set of inputs turns into outputs for your agent or system
Together these create a helpful two-step workflow for many changes: Once your component changes are working well, then use end-to-end testing to validate how those changes behave within your complete system. End-to-end tests are essential before deploying to production, making system architecture changes, or validating agent behavior with tool usage.

Testing Approaches

Component-Level Test Runs

Component testing focuses on individual prompts and model changes in isolation. This approach enables rapid iteration during prompt engineering without the overhead of running your entire system. You can execute component tests through either the Freeplay UI for quick no-code testing, or the SDK for programmatic control. Component tests excel at prompt optimization, model comparisons, parameter tuning, and enabling non-technical team members to participate in the iteration process. Learn more about Component Testing →

End-to-End Test Runs

End-to-end testing validates your complete system workflow including agents, RAG pipelines, and multi-step processes. When you change any component —- whether it’s a prompt, tool, or orchestration logic -— you need to understand how it affects your final output. These tests execute through the SDK in order to test your actual system. End-to-end tests are essential for production validation, agent systems with multiple prompts and tools, RAG pipelines, and multi-turn conversations. They capture the full complexity of your system as users experience it. Consider logging a specific commit name or git SHA with these tests to connect them to your code. Learn more about End-to-End Testing →

Getting Started

For developers, begin by installing the Freeplay SDK and creating datasets from your production data. Set up both end-to-end and component tests as part of your development workflow, and integrat them into your CI/CD pipeline for automated validation when ready. Product teams can jump straight into the Freeplay UI to create test datasets from important use cases and run component tests on prompt changes. The visual interface makes it easy to review results and provide feedback without technical expertise.
API Reference: See the Create Test Run, List Test Runs, and Get Test Run Results endpoints. Test runs work with both Prompt Datasets and Agent Datasets.

What’s Next Now that you’re armed with the ability to test your models, let’s move onto Datasets.