Skip to main content
Test Runs in Freeplay provide a structured way for you to run batch tests of your LLM prompts and chains. All methods associated with the Test Runs concept in Freeplay are accessible via the client.test_runs namespace. Test runs can be completed using Completion or Trace datasets. We will focus on code in this section, but for more detail on the Test Runs concept see Test Runs

Methods Overview

Method NameParametersDescription
createproject_id: string testlist: string include_outputsbool (optional, defaults to False) name: string (optional) description: string (optional)Instantiate a Test Run object server side and get an Id to reference your Test Run instance. To get expected outputs with your test cases, set include_outputs=True.

Step by Step Usage

Create a new Test Run

from freeplay import Freeplay, RecordPayload, ResponseInfo, TestRunInfo
from openai import OpenAI

# create a new test run

test_run = fpClient.test_runs.create(
project_id=project_id,
testlist=<dataset name> # TODO fill in with the name of the dataset stored in Freeplay
name="mytestrun", # Name of the test run in Freeplay
description="this is a test test!"
)

Retrieve your Prompts

Retrieve the prompts needed for your Test Run
# get the prompt associated with the test run
template_prompt = fpClient.prompts.get(
  project_id=project_id,
  template_name="template-name",
  environment="latest"
)

Iterate over each Test Case

For the code you want to test: loop over each Test Case from the Test List, make an LLM call, and record the results with a link to your Test Run.
# iterate over each test case
for test_case in test_run.test_cases:
    # format the prompt with the test case variables
    formatted_prompt = template_prompt.bind(test_case.variables).format()

    # make your llm call
    s = time.time()
    openaiClient = OpenAI(api_key=openai_key)
    chat_response = openaiClient.chat.completions.create(
        model=formatted_prompt.prompt_info.model,
        messages=formatted_prompt.llm_prompt,
        **formatted_prompt.prompt_info.model_parameters
    )
    e = time.time()

    # append the results to the messages
    all_messages = formatted_prompt.all_messages(
        {'role': chat_response.choices[0].message.role,
         'content': chat_response.choices[0].message.content}
    )
    call_info = CallInfo.from_prompt_info(formatted_prompt.prompt_info, start_time=s, 	end_time=e, usage=UsageTokens(chat_response.usage.prompt_tokens, chat_response.usage.completion_tokens))

    # create a session which will create a UID
    session = fpClient.sessions.create()
    # build the record payload
    payload = RecordPayload(
        project_id=project_id,
        all_messages=all_messages,
        inputs=test_case.variables, # Variables from the test case are the inputs
      	session_info=session,

# IMPORTANT: link the record call to the test run and test case

        test_run_info=test_run.get_test_run_info(test_case.id),
        prompt_version_info=formatted_prompt.prompt_info, # log the prompt information
        call_info=call_info,
        response_info=ResponseInfo(
            is_complete=chat_response.choices[0].finish_reason == 'stop'
        )
    )
    # record the results to freeplay
    fpClient.recordings.create(payload)

Agent Test Runs

To execute tests in code, iterate through each test case in your dataset and run it through your full agent workflow. This end-to-end testing approach is particularly valuable for agentic systems, where the goal is to observe how changes—whether to prompts, tools, or orchestration logic—affect the final output. Trace-level tests allow you to simulate production-like behavior and evaluate the agent holistically. As each test case runs, its input is passed into your system, and the resulting trace is logged for evaluation and analysis. See the full example here.
trace_dataset = "Your dataset name that targets an agent"
test_name = "Name of the test you are running"

##################################

# Initialize your code or agents

##################################
"""
NOTE: It is important that all agents get passed the test_run_info and it is used when RecordPayload is called. This is how Freeplay tracks the test run
"""

support_agent = MyAgent(<initialize your system code>)

###########################

# Initialize the test run

###########################
test_run = fp_client.test_runs.create(
project_id=project_id,
testlist=dataset,
name=name
)

################################################

# Loop over all cases and execute system code

################################################
for test_case in tqdm(test_run.trace_test_cases):
question = test_case.input

    session = fp_client.sessions.create()
    trace_info = session.create_trace(
        input=question,
        agent_name="<name of your agent>",
        custom_metadata={
            "version": "1.0.0" # custom dict[str, str]
        }
    )

# Run your agent completions and log to Freeplay using `fp_client.recordings.create()`,

    # Make sure to pass in `test_run_info=test_run_info`, so it tracked to right test run
    output = support_agent.process_query(
      question,

# IMPORTANT - Make sure this is passed to RecordPayload!

test_run_info=test_run.get_test_run_info(test_case.id)
)

    trace_info.record_output(
        project_id,
        completion.choices[0].message.content,
        {

# Log your evaluation result

            'evalution_score': 0.48,
            'is_high_quality': True
        },
        test_run_info=test_run_info
    )