Model-graded evaluations (also called LLM judges or auto-evaluations) use AI to automatically score your AI outputs based on criteria you define. This is the foundation of automated quality assessment in Freeplay.Documentation Index
Fetch the complete documentation index at: https://docs.freeplay.ai/llms.txt
Use this file to discover all available pages before exploring further.
How it works
When you configure a model-graded evaluation:- You define the evaluation criteria with a name, question, and scoring type (Yes/No, 1-5 scale, etc.)
- You write instructions explaining what the LLM should evaluate and provide a rubric with scoring guidelines
- Freeplay generates a structured prompt that includes your criteria, the completion being evaluated, and relevant context
Use cases
- Production monitoring: Automatically sample and evaluate a subset of production traffic
- Batch testing: Run evaluations across entire datasets during test runs
- Quality gates: Identify outputs that fail specific quality thresholds
Configuration
Model-graded evaluations are configured at the prompt template or agent level:- Select the Evaluations tab from the menu and then select New Evaluation
- Set the target to your prompt/agent to evaluate and the type to Model-graded
- Create your own or select from a pre-configured example
- Write instructions that reference your prompt variables (e.g.,
{{inputs.context}},{{output}}) - Define a rubric that maps scores to specific behaviors

