Skip to main content
Datasets in Freeplay are an essential part of organizing data to test your LLM systems. They can also be used to curate data for human review or fine-tuning. Datasets are the foundation of Test Runs in Freeplay.
API Reference: Freeplay supports two types of datasets:
  • Prompt Datasets (for component-level testing)
  • Agent Datasets (for end-to-end testing)
Each is automatically built to enforce schemas that maintain compatibility with your separate prompts and agents.
A key benefit of using Freeplay to curate Datasets is that it’s seamless to save new examples that you observe in real-world testing or production to existing Datasets. This keeps the data fresh and representative of the actual use of your application. Datasets can be created to test LLM systems across a variety of scenarios, such as:
  • Golden Set: For detecting regressions vs. your ideal ground truth
  • Failure Cases: For tracking failures you observe and testing in the future to confirm they are fixed
  • Red Teaming: For managing adversarial test cases and confirming appropriate behavior by your system
  • Random Samples: For representative testing across a distributed set of values
Instructions on how to save observed data or upload data are below.

Understanding the output field

Every dataset entry has an output field. While not strictly required, we strongly recommend including an output for each example — it plays a central role in evaluations and test runs, and examples without an output have limited utility for testing. There are two primary ways to use the output field:
  • Golden output: The output represents the ideal, correct response for the given inputs. This is common in golden sets and broad-based datasets where you want to benchmark new prompt versions against a curated standard. When used in test runs or in the playground, these outputs can be viewed to see how the newly generated data compares to the ideal output.
  • Failure case: The output captures a real failure observed in production — such as a hallucination, incorrect answer, or off-tone response. This is useful for building targeted datasets that track known issues so you can confirm they are fixed in future prompt versions.
The output can come from any source — uploaded files, completions saved from observed logs, or manually written examples. When saving completions from production logs, you have the option to edit the output before saving, which allows you to curate it into a golden response or preserve it as a failure case depending on your testing goals.

Curating Datasets

Datasets in Freeplay can be curated in one of two ways: by saving completions that are recorded to Freeplay straight from the Sessions view, or by uploading existing test cases to a Dataset.

Saving Data from Recorded Sessions

While working with recorded Sessions or Traces in Freeplay, if you encounter values that are relevant for future testing, you can save it directly. You will be given the option to curate the inputs and outputs before saving to the dataset. This can be useful if you want to make this sample represent a specific type of data sample such as a golden or failure case. This can be done at the trace or completion view. To do this, simply:
  • Click + Dataset above the completion/trace view
  • Optionally, make adjustments to the inputs, history or outputs
  • Select the relevant dataset(s)
  • Optionally, click the + button to create a new dataset from this menu

Bulk Add

You can also select multiple completions or traces at once and add a large group of completions to a dataset at one time, even across pages.
  • Select the “Completions” or “Traces” view on Observability (instead of Sessions)
  • Click the radio buttons in the table for the rows you want

Adding Metadata to Dataset Entries

image.png Metadata can now be added to entries in your datasets, allowing you to store additional information with each entry. To add or edit metadata for a dataset entry:
  1. Navigate to a specific dataset entry
  2. Click the “Edit” option in the dropdown menu
  3. In edit mode, you’ll see a dedicated “Metadata” section at the top of the entry
  4. Add customizable key-value pairs such as:
    • Customer identifiers (e.g., “customerId”: “2382721”)
  5. Click “Add Metadata” to create additional fields as needed
  6. Click “Save” to store your changes

Uploading Datasets

If you have existing data that is relevant to use for testing prompts in Freeplay, you can upload it directly as a JSONL or CSV file. Both formats support the same fields.

Prompt Template Datasets

FieldRequiredDescription
inputs.<variable_name>Yes (at least one)Each input variable must be prefixed with inputs. (e.g., inputs.question, inputs.job_info). These map directly to {{variable_name}} in your prompt template.
historyNoA JSON array of previous messages representing the conversation history. See Tool Calls in History below for supported message formats.
outputYes (empty ok)The expected or ideal output for the given inputs — either a golden response or a captured failure case. See Understanding the Output Field.
metadata.<metadataName>NoAdditional information on each data sample. Metadata columns must start with metadata. (e.g., metadata.employee_id, metadata.source).
For more details on variable usage, see our Advanced Prompt Templating guide.
Tool Calls in History: The history field supports tool call and tool result messages, allowing you to upload conversation histories that include function calling interactions. Assistant messages can contain tool_call content blocks, and user messages can contain tool_result content blocks. See the samples below for examples.

How to Upload

  1. Navigate to the Dataset and click the Upload button. image.png
  2. For CSV uploads, click Download CSV Template in the bottom-left corner of the upload dialog to get a file with the correct column names for your dataset.
    Image
  3. Format your data according to the JSONL or CSV format below, ensuring that each entry aligns with your selected prompt template.
  4. Upload your file. If there are any formatting issues, Freeplay will flag them and block the upload with a warning. image.png

JSONL Format

Each line in the file must be a valid JSON object flattened to a single line. Be sure to append the filename with .jsonl. Note that JSONL is NOT normal JSON — see jsonlines.org for the specification.
{"inputs": {"job_info": {"title": "Senior Software Engineer", "level": "IC4", "department": "Engineering", "hire_date": "2021-03-15", "manager": "David Park"}}, "history": [{"role": "user", "content": "Can you pull up the compensation summary for Sarah Chen?"}, {"role": "assistant", "content": [{"type": "tool_call", "id": "call_001", "name": "lookup_employee", "arguments": {"employee_name": "Sarah Chen"}}]}, {"role": "user", "content": [{"type": "tool_result", "tool_call_id": "call_001", "content": "{\"employee_id\": \"EMP001\", \"name\": \"Sarah Chen\", \"current_salary\": 178000, \"currency\": \"USD\"}"}]}, {"role": "assistant", "content": "I found Sarah Chen's profile. Let me generate her compensation brief now."}], "output": "Summary\n\nSarah Chen — Senior Software Engineer (IC4), Engineering. Base: $178,000 USD.", "metadata": {"employee_id": "EMP001", "name": "Sarah Chen", "department": "Engineering"}}
{"inputs": {"job_info": {"title": "Marketing Manager", "level": "M1", "department": "Marketing", "hire_date": "2021-11-01", "manager": "Rachel Adams"}}, "history": [{"role": "user", "content": "I need to prep for Emma's comp review."}, {"role": "assistant", "content": [{"type": "tool_call", "id": "call_002", "name": "lookup_employee", "arguments": {"employee_name": "Emma Clarke"}}, {"type": "tool_call", "id": "call_003", "name": "get_market_data", "arguments": {"market": "London Tech", "role": "Marketing Manager"}}]}, {"role": "user", "content": [{"type": "tool_result", "tool_call_id": "call_002", "content": "{\"employee_id\": \"EMP007\", \"name\": \"Emma Clarke\", \"current_salary\": 83000, \"currency\": \"GBP\"}"}, {"type": "tool_result", "tool_call_id": "call_003", "content": "{\"median_salary\": 80000, \"p75_salary\": 92000, \"currency\": \"GBP\"}"}]}, {"role": "assistant", "content": "I've pulled Emma's profile and London market benchmarks. Generating her brief now."}], "output": "Summary\n\nEmma Clarke — Marketing Manager (M1), Marketing. Base: £83,000 GBP. Market median: £80,000.", "metadata": {"employee_id": "EMP007", "name": "Emma Clarke", "department": "Marketing"}}

CSV Format

Your CSV must use specific column header prefixes for Freeplay to correctly parse your data. Any columns that don’t follow these conventions will be ignored.
inputs.job_info,history,output,metadata.employee_id,metadata.name,metadata.department
"{""title"": ""Senior Software Engineer"", ""level"": ""IC4"", ""department"": ""Engineering"", ""hire_date"": ""2021-03-15"", ""manager"": ""David Park""}","[{""role"": ""user"", ""content"": ""Can you pull up the compensation summary for Sarah Chen?""}, {""role"": ""assistant"", ""content"": [{""type"": ""tool_call"", ""id"": ""call_001"", ""name"": ""lookup_employee"", ""arguments"": {""employee_name"": ""Sarah Chen""}}]}, {""role"": ""user"", ""content"": [{""type"": ""tool_result"", ""tool_call_id"": ""call_001"", ""content"": ""{\""employee_id\"": \""EMP001\"", \""name\"": \""Sarah Chen\"", \""current_salary\"": 178000, \""currency\"": \""USD\""}""}]}, {""role"": ""assistant"", ""content"": ""I found Sarah Chen's profile. Let me generate her compensation brief now.""}]","Summary

Sarah Chen — Senior Software Engineer (IC4), Engineering. Base: $178,000 USD.","EMP001","Sarah Chen","Engineering"
"{""title"": ""Marketing Manager"", ""level"": ""M1"", ""department"": ""Marketing"", ""hire_date"": ""2021-11-01"", ""manager"": ""Rachel Adams""}","[{""role"": ""user"", ""content"": ""I need to prep for Emma's comp review.""}, {""role"": ""assistant"", ""content"": [{""type"": ""tool_call"", ""id"": ""call_002"", ""name"": ""lookup_employee"", ""arguments"": {""employee_name"": ""Emma Clarke""}}, {""type"": ""tool_call"", ""id"": ""call_003"", ""name"": ""get_market_data"", ""arguments"": {""market"": ""London Tech"", ""role"": ""Marketing Manager""}}]}, {""role"": ""user"", ""content"": [{""type"": ""tool_result"", ""tool_call_id"": ""call_002"", ""content"": ""{\""employee_id\"": \""EMP007\"", \""name\"": \""Emma Clarke\"", \""current_salary\"": 83000, \""currency\"": \""GBP\""}""}, {""type"": ""tool_result"", ""tool_call_id"": ""call_003"", ""content"": ""{\""median_salary\"": 80000, \""p75_salary\"": 92000, \""currency\"": \""GBP\""}""}]}, {""role"": ""assistant"", ""content"": ""I've pulled Emma's profile and London market benchmarks. Generating her brief now.""}]","Summary

Emma Clarke — Marketing Manager (M1), Marketing. Base: £83,000 GBP. Market median: £80,000.","EMP007","Emma Clarke","Marketing"
,,,,,"

Dataset Compatibility

We’ve found that it’s important to allow for relatively flexible compatibility rules to accommodate complex prompting strategies. The following compatibility rules may be important to know:
  • Compatibility for testing is based on the input {{variable_names}} in your prompt templates. These must match with the key names in your Datasets.
  • A Dataset is treated as compatible if one or more key names match for a given prompt template. This is important so that datasets can be treated as compatible even when some variable names are optional in practice. (See Advanced Prompt Templating Using Mustache)
  • Datasets can be used across multiple prompt templates in a Project, as long as at least one variable name is shared. For instance, if you have four prompt templates that all use the variable {{question}}, then any Dataset that contains values for {{question}} will be compatible.

What’s Next