Evaluations

Overview of conducting evaluations with Freeplay.

Introduction

Evaluation in machine learning is the process of determining a model's performance via a metrics-driven analysis.

Freeplay allows you to incorporate evaluations into your product development lifecycle in a way that is focused on your particular product context or domain. By defining appropriate evaluations for your specific use case, you gain insights that are far more valuable than generic industry benchmarks. Read more on our blog here.

Freeplay supports three modes of evaluations that each work together:

  • Human evaluation: aka "data annotation" or "labeling", where your team can easily review and score results
  • Model-graded evaluation": using LLMs as a judge for nuanced evaluation criteria instead of humans
  • Code evaluation: where you construct custom functions that evaluate some quantifiable element (like JSON schema or embedding distance)

Some criteria may be appropriate only for human evaluation, while others can benefit from humans working together with model-graded auto-evaluators — giving humans the ability to inspect, confirm or correct any auto-eval results and improve on the model-graded results.

Configuring Evaluation Criteria

For each of your prompts, you can configure one or more relevant human or model-graded evaluation criteria in Freeplay. Any code evaluations can be logged to Freeplay directly using our SDKs.

Any evaluation criteria configured in Freeplay can be used for human labeling/annotation, and you can optionally enable model-graded auto-evaluations for relevant criteria too. For example, you might want model-graded evals to score the quality of an LLM response, but you only want humans to be able to leave notes on a completion.

You'll start by configuring evaluation criteria on a prompt template. Go to Prompts > Pick the prompt you want > scroll to the Evaluations section at the bottom.

Each evaluation criteria requires the following components:

  1. Name: Give the criteria an easy-to-recognize name that will show up in the UI for your team.

  2. Question: Along with the name, define the guiding question that a human evaluator should address for that criteria. This question should be clear and focused, with a goal to make sure each evaluator understands objectively how to answer the question.

  3. Evaluation Type: Freeplay currently supports 4 types of evaluation criteria:

    1. Yes/No boolean (use "Yes" as the positive value)
    2. 1-5 Scale (use "5" as the positive value)
    3. Text (free text string, useful for leaving comments or other descriptions of issues)
    4. Multi-select (tags/enums, useful for data categorization)
  4. Optionally: Enable model-graded auto-evaluation: Choose whether you want to configure and run an auto-evaluator for the criteria. More details on setting up model-graded auto-evaluators below.

    • Enter a prompt in the "Instructions" section. This is where you detail to the LLM what it is evaluating and importantly indicate which parts of the prompt or output you want to target. This is what dictates the values passed to the LLM at execution time.
      • You can target components of your prompt with mustache syntax. In this case we will use {{inputs.supporting_information}} to target out retrieved context and {{output}} to target the LLM generated output which are the two components we need for this eval.
      • Next configure a Rubric, this section details the criteria the LLM will use to make it's final decision and can be extremely valuable for generating high quality responses
  • Next configure a Rubric, this section details the criteria the LLM will use to make it's final decision and can be extremely valuable for generating high quality response.
  1. Optionally: Align your Auto-evaluators

Prompts for auto-evaluators can take some iteration to get right, just like with other prompt engineering. Freeplay provides functionality to align your Auto Evaluators with human feedback from your team as you're creating your eval criteria. Each time you update the prompt, you can re-run it against a sample of examples and compare to your team's choices.

Conducting Evaluations

Human Evaluation/Labeling

Your team members can use Freeplay to label any session, or to filter a group of sessions that share common criteria and label them (e.g. weekly/monthly spot checks of production data).

To get started with human evaluation:

  1. Invite Your Team: Under Settings > Account > New user. Only Admins can invite new team members.
  2. Browse Sessions: Filter Sessions to find the ones you want to label, then navigate through individual Sessions to apply labels under the Evaluation section in the sidebar.
  3. Use Tool Tips for Guidance: Hover over the tool tips to see the instructions ("Question" value configured above).
  4. Label Sessions: Label Sessions according to your evaluation criteria.

Model-graded Evaluations

Model-graded auto-evaluations on the Freeplay platform are performed in two capacities:

  1. Batch testing via Test Runs
    1. Test Runs allow you to proactively run batch tests against Datasets to measure your system performance over time and compare changes side by side. These Test Runs can be executed either via the UI or via the SDK.
  2. Live Monitoring of Production Sessions
    1. Freeplay will automatically sample a subset of your production traffic and run auto evaluations on them to give you insight into how your systems are behaving in the wild.

Code Evaluations

Freeplay also offers the ability to run code-driven evaluations directly on the client-side, then log those results to Freeplay. These evals are generally functions written and run in the client's code path and then recorded back to Freeplay.

These evaluations are particularly useful for criteria requiring logical expressions, such as JSON schema checks or category assertions on single answers, or for pairwise comparisons to an expected output via methods like embedding or string distance. Code evals can be added both to:

  • Individual Sessions
  • Test Runs executed with our SDK or API, which can include comparisons to ground truth data

In either case, any results you log to Freeplay flow through to the UI just like human or model-graded evals. See our SDK documentation for more details.


Resources


What’s Next

Now that we've configured Evaluations, let's learn more about Freeplay Test Run functionality.