Curating Useful Datasets for Testing & Evaluation

What are datasets and why are they important?

Testing and evaluation are key aspects of the LLM development cycle. Your test quality and reliability are a function of two primary components: your Evaluators and your Datasets. In this guide, we are going to focus on Dataset curation.

Simply put, datasets are collections of inputs and outputs that you can use to test your LLM systems. Datasets are important because for your tests to be truly informative your datasets need to accurately represent the issues and situations your LLM systems face in the wild.

We’ll cover how we at Freeplay think strategically about building datasets and then look tactically at how to curate datasets inside of Freeplay.

Dataset Curation Strategy

Once your team has decided what product or feature you want to build, a common next step is curating your datasets. Many teams will build a single dataset of various scenarios for an LLM feature and instinctively stop there. While one dataset is a fantastic starting point, teams quickly realize that multiple datasets are crucial for effective testing and experimentation. Broadly speaking there are two types of datasets: Targeted datasets and Broad-based datasets.

Targeted datasets

Targeted datasets are datasets that are focused on a narrowly defined issue or situation.

For example, let’s say you’re working on an e-commerce use case in which we are using an LLM to answer customer questions about their orders.

The pipeline has two components:

First, the LLM generates a SQL query from the user question
Then, the LLM uses the results of that query to generate an answer

To test this pipeline, you might create a targeted dataset called “Query Hallucinations”. This dataset would be a collection of examples in which the model hallucinates a table name. You might create another targeted dataset called “Delivered Orders”, which collects examples where the user asked about an order that was already delivered. The first dataset is focused on a technical failure point and the second dataset is focused on a specific customer situation, orders that have already been delivered. Both datasets are narrowly focused.

When iterating on LLMs it’s often useful to focus on one specific problem at a time. Targeted datasets allow you to quickly iterate over your key areas of concern.

Broad-based datasets

Broad-based datasets are datasets that include a wide array of examples and are not focused on any specific issue or situation.

The classic example of a broad-based dataset would be the “golden set.” A golden set is a dataset consisting of examples hand-curated by humans to be the ideal output for some given input.

Unlike a targeted dataset, broad-based datasets contain a variety of situations meant to capture the totality of cases your LLM system needs to be able to handle. This kind of dataset is often used for benchmarking or regression testing.

These two types of datasets can then be used together during experimentation. Continuing our e-commerce example from earlier, let’s say you’re focused on reducing hallucinations in SQL query generation. You can first focus on making prompt and model changes for that specific issue, frequently testing against your targeted dataset along the way. Then once you think you have a fix in hand, you test the new config against your broad-based dataset to ensure you haven’t regressed on other dimensions.

Anatomy of a Freeplay Dataset

Now that you’re familiar with the primary types of datasets, let’s take a look at what a Freeplay dataset consists of.

Name - ex. “Query Hallucinations”
Description - ex. “Sessions where the model referenced an invalid table”
Prompt Compatibility - A set of inputs that your dataset will be compatible with.
Examples - Examples are combinations of inputs and outputs that you as the user save to a dataset.

Curating Datasets in Freeplay

Step 1: Create a new Dataset

Navigate to the Datasets tab and select “Create dataset”. Give your dataset a Name and Description, then set your Prompt Compatibility. You’ll need to decide what prompt(s) you want your dataset to be compatible with. Compatibility is determined by the prompt’s input variables. Datasets can be compatible with multiple prompts as long as those prompts share at least one common input variable.

Alternatively, you can create a new Dataset directly from a completion by clicking “Add to dataset” and then hitting the + icon. Note, if you create the dataset this way, prompt compatibility will be intuited for you based on the completion you create the dataset from.

Step 2: Add Examples

There are a number of ways to add examples to a dataset.

Add from completion

From any completion you can hit “Add to dataset” to create a new example from that completion. This is often a big part of the human review flow, as reviewers are labeling data they can actively be building datasets as well.

Bulk add from completions table

You can bulk add completions to a dataset by going to the Observability tab, toggling to the completions view in the table and then selecting the completions you want to add.
Often we will see users filter on things like eval values, customer feedback, or other metrics and bulk adding completions from there.

Upload examples

If you have existing examples you can upload them to Freeplay via JSONL. Navigate to the Datasets tab, select your dataset, and click upload.

You can read more about formatting the JSONL file here.

Manually write examples

You can also write examples directly in the UI. From any dataset click “Create an example” and a form will appear where you can write a new example by hand. You’ll enter values for each input variable as well as the output. It’s okay to leave any of these blank if it makes sense for your example.

Step 3: Run a Test against your Dataset

After you’ve created a dataset you can run a batch test with any of your compatible prompts. Batch tests can be kicked off either from the Freeplay app or via the SDK.

To run a batch test from the UI go to the Tests tab and click “Run Test”. From there you can configure the test by selecting the prompt version you want to test and the dataset you want to test with.

To run a batch test from the SDK see the docs here.

Step 4: Managing Datasets

You can manage your dataset on an ongoing basis in the Datasets tab. Here you can add, edit, and delete examples

Bonus: Use your Dataset in the Playground

When editing a prompt in the playground you can pull in examples from your dataset and run them in real time to test your changes.

In the prompt editor click the folder icon and load in examples

Key Takeaways

Dataset curation is an often overlooked part of the LLM development cycle. Your testing is only as good as your underlying datasets. Having a rich collection of datasets empowers developers to iterate faster and ultimately deliver better, higher-quality AI features for your customers. Freeplay helps facilitate that dataset curations process in a fully integrated platform.