
Evaluations Overview
Evaluation in machine learning is the process of determining a model’s performance via a metrics-driven analysis. Freeplay allows you to incorporate evaluations into your product development lifecycle in a way that is focused on your particular product context or domain. By defining appropriate evaluations for your specific use case, you gain insights that are far more valuable than generic industry benchmarks. Read more on our blog here. Freeplay supports four modes of evaluations that each work together:- Human evaluation: aka “data annotation” or “labeling”, where your team can easily review and score results
- Model-graded evaluation: using LLMs as a judge for nuanced evaluation criteria instead of humans
- Code evaluation: where you construct custom functions that evaluate some quantifiable element (like JSON schema or embedding distance)
- Auto-categorization: automated tagging of your application logs with specified categories
Configuring Evaluation Criteria
For each of your prompts, you can configure one or more relevant human or model-graded evaluation criteria in Freeplay. Any code evaluations can be logged to Freeplay directly using our SDKs. Any evaluation criteria configured in Freeplay can be used for human labeling/annotation, and you can optionally enable model-graded auto-evaluations for relevant criteria too. For example, you might want model-graded evals to score the quality of an LLM response, but you only want humans to be able to leave notes on a completion.Resources
What’s Next Now review each evaluation type and then move onto test runs once all your evaluations are configured! Ask AI

