Introduction

Test Runs in the Freeplay platform provide a structured approach to testing your language models with large batches of data. By leveraging Test Runs, you can ensure that your models are performing as expected across a range of examples and identify areas for improvement.

What is a Test Run?

A Test Run pushes a set of saved prompt template inputs through a new version of a prompt template so you can measure improvements or regressions. It is a vital component for ensuring the quality and effectiveness of your prompts. Key components of a Test Run include:

Dataset: This is the foundation of your Test Run. It's a collection of examples previously saved or uploaded by you, each containing input(s) that you've identified as valuable for testing. (See Datasets)
Invocation: A Test Run can be kicked off either via the Freeplay UI or in code via the SDK.
1. UI invocation is useful for testing prompt and model changes and can be done by any member of your team entirely in Freeplay
2. SDK invocation is valuable when you need to exercise your entire code pipeline (including for RAG, agent or multiple prompt scenarios).
Running the Test: During the Test Run, inputs are automatically fetched from the examples in your chosen Dataset. These inputs are then run through the new version of your LLM pipeline.
Comparison and Analysis: After the inputs have been processed, the outputs are ready for review. You'll compare these results against expected outputs or scores from the dataset used for testing, or against other prompt template versions. This step is crucial for identifying any improvements or regressions.
Feedback Loop: The insights gained from the Test Run inform your next steps. If the new Prompt Template version performs well, it might be ready for deployment. If not, the Test Run highlights specific areas for refinement, feeding into the next iteration of your prompt development.

In summary, a Test Run is a dynamic, scalable evaluation feature that plays a critical role in ensuring your Prompt Template performs optimally across a diverse set of inputs.

How Test Runs Work

Choose a Dataset:
- Datasets are collections of test cases that you deem important for your application's testing. Each test case consists of sample input(s) for a prompt template and a single output generated by an LLM.
- You can create multiple Datasets for various purposes, such as "Golden Set," "Failure Cases," or "Edge Cases," depending on what best suits your application's needs.
Using Test Runs:
- SDK Invocation invoke Test Runs through your SDK (see code examples here)
- UI Invocation invoke Test Runs through the Freeplay app by clicking "Test" at the top of your Prompt Template. You will then be prompted to configure the Test Run including selecting the Dataset and prompt version you want to use.
Analyze Results: Once you've completed a Test Run, you can analyze results in two ways:
- Auto-Evaluations: You can review aggregate or row-level scores generated by auto-evals (if enabled), including evals from your code. (See details here)
- Human Preference: You can review row-level examples from any Test or Comparison and choose which you prefer. This is useful in cases where you might not have evals defined yet, or where eval results are inconclusive.

Step By Step Guide

Step 1: Executing Test Runs

In the Freeplay application:

Select the prompt template version you want to test in the "Versions" sidebar
Click "Test"
Choose the "Dataset" you want to use
Give the Test a name and a description (optional)

For SDK or API use, see here

Step 2: Analyzing Results

For Aggregate Results

In the Freeplay application:

Navigate to Tests
Select the Test you want to review
Aggregate scores are shown in the Summary section
Row-level results can be filtered and viewed by clicking on the Test Cases tab

For Row-Level Results

Click any row in the table
Click any heading to expand/collapse the Inputs, Outputs, etc.
Regressions vs. your dataset are highlighted in red, and improvements are highlighted in green
To capture human preference scores, click the preferred option at the top of the test case (or use key bindings to select 1, 2, or 3)

Step 3: Creating Comparisons

Once you've created a Test, Freeplay makes it easy to compare not just to your ground truth dataset, but also head-to-head vs. another prompt template version or another similar Test Run from the SDK. Freeplay allows you to add many prompt template versions to compare results.

In the Freeplay application:

Navigate to a specific Test page
Click the + Add a Comparsion button on the right hand side of the test page
Select the prompt template version or similar past Test Run that you want to compare to
Analyze results just like you would above. The new object you chose will appear in the right column instead of the ground truth dataset.