Skip to main content
Model-graded evaluations (also called LLM judges or auto-evaluations) use AI to automatically score your AI outputs based on criteria you define. This is the foundation of automated quality assessment in Freeplay.

How it works

When you configure a model-graded evaluation:
  1. You define the evaluation criteria with a name, question, and scoring type (Yes/No, 1-5 scale, etc.)
  2. You write instructions explaining what the LLM should evaluate and provide a rubric with scoring guidelines
  3. Freeplay generates a structured prompt that includes your criteria, the completion being evaluated, and relevant context
The LLM then scores each completion according to your rubric and provides an explanation for its decision.

Use cases

  • Production monitoring: Automatically sample and evaluate a subset of production traffic
  • Batch testing: Run evaluations across entire datasets during test runs
  • Quality gates: Identify outputs that fail specific quality thresholds

Configuration

Model-graded evaluations are configured at the prompt template or agent level:
  1. Select the Evaluations tab from the menu and then select New Evaluation
  2. Set the target to your prompt/agent to evaluate and the type to Model-graded
  3. Create your own or select from a pre-configured example
  4. Write instructions that reference your prompt variables (e.g., {{inputs.context}}, {{output}})
  5. Define a rubric that maps scores to specific behaviors
Use Freeplay’s alignment tools to compare auto-evaluation scores against human labels and iteratively improve your evaluation prompts.
Best practice: Model-graded evaluations are the foundation for many other AI features. Prompt optimization and Evaluation Insights both work better when you have well-configured evaluations generating data. Start here before enabling other AI features. Learn more about model-graded evaluations →