- Create datasets - Datasets allow you repeatedly test against the same inputs and compare results to both a ground truth expected output and evaluation scores
- Create evaluations - Evaluations allow you to score and label records manually (human labels) or automatically (auto-categorization, LLM-as-Judge evaluations)
- Run a test - Tests leverage your evaluations and datasets to compare changes to prompts and model settings.
- Deploy - When you make a change to a prompt template in Freeplay and deploy it, observability sessions allow you to track the performance of this version and even compare it against other versions.
- Review - Review sessions allow you to compare the performance of different versions of your prompt template and identify areas for improvement.

