Introduction

Evaluation in machine learning is the process of determining a model's performance via a metrics-driven analysis.

Freeplay allows you to incorporate evaluations into your product development lifecycle in a way that is focused on your particular product context or domain. By defining appropriate evaluations for your specific use case, you gain insights that are far more valuable than generic industry benchmarks. Read more on our blog here.

Freeplay supports three modes of evaluations that each work together:

Human evaluation: aka "data annotation" or "labeling", where your team can easily review and score results
Model-graded evaluation: using LLMs as a judge for nuanced evaluation criteria instead of humans
Code evaluation: where you construct custom functions that evaluate some quantifiable element (like JSON schema or embedding distance)

Some criteria may be appropriate only for human evaluation, while others can benefit from humans working together with model-graded auto-evaluators — giving humans the ability to inspect, confirm or correct any auto-eval results and improve on the model-graded results.

Configuring Evaluation Criteria

For each of your prompts, you can configure one or more relevant human or model-graded evaluation criteria in Freeplay. Any code evaluations can be logged to Freeplay directly using our SDKs.

Any evaluation criteria configured in Freeplay can be used for human labeling/annotation, and you can optionally enable model-graded auto-evaluations for relevant criteria too. For example, you might want model-graded evals to score the quality of an LLM response, but you only want humans to be able to leave notes on a completion.

Evaluations

Introduction

Configuring Evaluation Criteria

Resources