Run batch tests of your LLM prompts and chains programmatically.
Test Runs in Freeplay provide a structured way for you to run batch tests of your LLM prompts and chains. All methods associated with the Test Runs concept in Freeplay are accessible via the client.test_runs namespace. Test runs can be completed using Completion or Trace datasets. We will focus on code in this section, but for more detail on the Test Runs concept see Test Runs.
Instantiate a Test Run object server side and get an Id to reference your Test Run instance. To get expected outputs with your test cases, set include_outputs=True.
The testlist parameter accepts the name of a dataset stored in Freeplay. This parameter name is preserved for backwards compatibility. In the UI and documentation, we use “dataset” to refer to this concept.
from freeplay import Freeplay, RecordPayload, TestRunInfofrom openai import OpenAI# create a new test runtest_run = fpClient.test_runs.create(project_id=project_id,testlist=<dataset name> # TODO fill in with the name of the dataset stored in Freeplayname="mytestrun", # Name of the test run in Freeplaydescription="this is a test test!")
# get the prompt associated with the test runtemplate_prompt = fpClient.prompts.get( project_id=project_id, template_name="template-name", environment="latest")
For the code you want to test: loop over each test case from the dataset, make an LLM call, and record the results with a link to your test run.
Copy
Ask AI
# iterate over each test casefor test_case in test_run.test_cases: # format the prompt with the test case variables formatted_prompt = template_prompt.bind(test_case.variables).format() # make your llm call s = time.time() openai_client = OpenAI(api_key=openai_key) chat_response = openai_client.chat.completions.create( model=formatted_prompt.prompt_info.model, messages=formatted_prompt.llm_prompt, **formatted_prompt.prompt_info.model_parameters ) e = time.time() # append the results to the messages all_messages = formatted_prompt.all_messages({ 'role': chat_response.choices[0].message.role, 'content': chat_response.choices[0].message.content }) call_info = CallInfo.from_prompt_info( formatted_prompt.prompt_info, start_time=s, end_time=e, usage=UsageTokens( chat_response.usage.prompt_tokens, chat_response.usage.completion_tokens ) ) # create a session which will create a UID session = fp_client.sessions.create() # build the record payload payload = RecordPayload( project_id=project_id, all_messages=all_messages, inputs=test_case.variables, # Variables from the test case are the inputs session_info=session.session_info, # IMPORTANT: link the record call to the test run and test case test_run_info=test_run.get_test_run_info(test_case.id), prompt_version_info=formatted_prompt.prompt_info, # log the prompt information call_info=call_info ) # record the results to freeplay fpClient.recordings.create(payload)
To execute tests in code, iterate through each test case in your dataset and run it through your full agent workflow. This end-to-end testing approach is particularly valuable for agentic systems, where the goal is to observe how changes—whether to prompts, tools, or orchestration logic—affect the final output.Trace-level tests allow you to simulate production-like behavior and evaluate the agent holistically. As each test case runs, its input is passed into your system, and the resulting trace is logged for evaluation and analysis. See the full example here.Below is an psuedo code example to show the general logic for test-runs:
Copy
Ask AI
################################### Preare the test##################################trace_dataset = "Your dataset name that targets an agent"test_name = "Name of the test you are running"# Create the testtest_run = fp_client.test_runs.create( project_id=project_id, testlist=trace_dataset, name=test_name)################################### Initialize your code or agents##################################"""NOTE: It is important that all agents get passed the test_run_info and it is used when RecordPayload is called. This is how Freeplay tracks the test run"""my_agent = MyAgent(<initialize your system code>)################################################# Loop over all test cases and execute system code################################################for test_case in test_run.trace_test_cases: # Initialize a Freeplay Session session = fp_client.sessions.create() # NOTE: Get the test case information; This must be passed with any record calls test_case_info = test_run.get_test_run_info(test_case.id) # Create your Agent question = test_case.input trace_info = session.create_trace( input=question, agent_name="MyAgent", # Agent's name custom_metadata={ "version": "1.0.0" # custom dict[str, str] } ) ################################################ # Execute Agent ################################################ # Run your system code using the agents inputs/outputs # This is where all the sub-agent processes and recordings happen MyAgent.run( input=question, test_run_info=test_run_info # NOTE: This must be passed to record calls ) # Record the agent output, pass the test run info here as well trace_info.record_output( project_id, completion.choices[0].message.content, eval_results={ 'evaluation_score': 0.48, 'is_high_quality': True }, test_run_info=test_run_info )