Skip to main content

Evaluations Overview

Evaluation in machine learning is the process of determining a model’s performance via a metrics-driven analysis. Freeplay allows you to incorporate evaluations into your product development lifecycle in a way that is focused on your particular product context or domain. By defining appropriate evaluations for your specific use case, you gain insights that are far more valuable than generic industry benchmarks. Read more on our blog here. Freeplay supports four modes of evaluations that each work together:
  • Human evaluation: aka “data annotation” or “labeling”, where your team can easily review and score results
  • Model-graded evaluation: using LLMs as a judge for nuanced evaluation criteria instead of humans
  • Code evaluation: deterministic evaluation functions that run server-side on Freeplay’s servers or client-side in your own codebase to validate outputs against specific patterns, formats, and conditions
  • Auto-categorization: automated tagging of your application logs with specified categories
Some criteria may be appropriate only for human evaluation, while others can benefit from humans working together with model-graded auto-evaluators — giving humans the ability to inspect, confirm or correct any auto-eval results and improve on the model-graded results.

Configuring Evaluation Criteria

For each of your prompts, you can configure one or more relevant human, model-graded, or code evaluation criteria in Freeplay. Any evaluation criteria configured in Freeplay can be used for human labeling/annotation, and you can optionally enable model-graded auto-evaluations for relevant criteria too. For example, you might want model-graded evals to score the quality of an LLM response, but you only want humans to be able to leave notes on a completion. Client-side code evaluations can be logged to Freeplay directly using our SDKs.

Resources


What’s Next Now review each evaluation type and then move onto test runs once all your evaluations are configured!