Test Runs
Overview of structured testing using Test Runs
Introduction
Test Runs in the Freeplay platform provide a structured approach to testing your language models with large batches of data. By leveraging Test Runs, you can ensure that your models are performing as expected across a range of examples and identify areas for improvement.
What is a Test Run?
A Test Run pushes a set of saved prompt template inputs through a new version of a prompt template so you can measure improvements or regressions. It is a vital component for ensuring the quality and effectiveness of your prompts. Key components of a Test Run include:
-
Dataset: This is the foundation of your Test Run. It's a collection of examples previously saved or uploaded by you, each containing input(s) that you've identified as valuable for testing. (See Datasets)
-
Invocation: A Test Run can be kicked off either via the Freeplay UI or in code via the SDK.
- UI invocation is useful for testing prompt and model changes and can be done by any member of your team entirely in Freeplay
- SDK invocation is valuable when you need to exercise your entire code pipeline (including for RAG, agent or multiple prompt scenarios).
-
Running the Test: During the Test Run, inputs are automatically fetched from the examples in your chosen Dataset. These inputs are then run through the new version of your LLM pipeline.
-
Comparison and Analysis: After the inputs have been processed, the outputs are ready for review. You'll compare these results against expected outputs or scores from the dataset used for testing, or against other prompt template versions. This step is crucial for identifying any improvements or regressions.
-
Feedback Loop: The insights gained from the Test Run inform your next steps. If the new Prompt Template version performs well, it might be ready for deployment. If not, the Test Run highlights specific areas for refinement, feeding into the next iteration of your prompt development.
In summary, a Test Run is a dynamic, scalable evaluation feature that plays a critical role in ensuring your Prompt Template performs optimally across a diverse set of inputs.
How Test Runs Work
-
Choose a Dataset:
- Datasets are collections of test cases that you deem important for your application's testing. Each test case consists of sample input(s) for a prompt template and a single output generated by an LLM.
- You can create multiple Datasets for various purposes, such as "Golden Set," "Failure Cases," or "Edge Cases," depending on what best suits your application's needs.
-
Using Test Runs:
- SDK Invocation invoke Test Runs through your SDK (see code examples here)
- UI Invocation invoke Test Runs through the Freeplay app by clicking "Test" at the top of your Prompt Template. You will then be prompted to configure the Test Run including selecting the Dataset and prompt version you want to use.
-
Analyze Results: Once you've completed a Test Run, you can analyze results in two ways:
- Auto-Evaluations: You can review aggregate or row-level scores generated by auto-evals (if enabled), including evals from your code. (See details here)
- Human Preference: You can review row-level examples from any Test or Comparison and choose which you prefer. This is useful in cases where you might not have evals defined yet, or where eval results are inconclusive.
Step By Step Guide
Step 1: Executing Test Runs
In the Freeplay application:
- Select the prompt template version you want to test in the "Versions" sidebar
- Click "Test"
- Choose the "Dataset" you want to use
- Give the Test a name and a description (optional)
For SDK or API use, see here

Step 2: Analyzing Results
For Aggregate Results
In the Freeplay application:
- Navigate to Tests
- Select the Test you want to review
- Aggregate scores are shown in the Summary section
- Row-level results can be filtered and viewed directly in the table at the bottom

For Row-Level Results
- Click any row in the table
- Click any heading to expand/collapse the Inputs, Outputs, etc.
- Regressions vs. your dataset are highlighted in red, and improvements are highlighted in green
- To capture human preference scores, click the preferred value at the bottom (or use key bindings to select 1, 2, or 3)

Step 3: Creating Comparisons
Once you've created a Test, Freeplay makes it easy to compare not just to your ground truth dataset, but also head-to-head vs. another prompt template version or another similar Test Run from the SDK.
In the Freeplay application:
- Navigate to a specific Test page
- Click the
+ New Comparsion
button under the Test name - Select the prompt template version or similar past Test Run that you want to compare to
- Analyze results just like you would above. The new object you chose will appear in the right column instead of the ground truth dataset.

Updated 5 months ago
Now that you're armed with the ability to test your models, let's move onto Datasets.