Model-Graded Evaluations

Model-graded evaluations (also called LLM judges or auto-evaluations) use AI to automatically score your AI outputs based on criteria you define. This is the foundation of automated quality assessment in Freeplay.

How it works

When you configure a model-graded evaluation:

You define the evaluation criteria with a name, question, and scoring type (Yes/No, 1-5 scale, etc.)
You write instructions explaining what the LLM should evaluate and provide a rubric with scoring guidelines
Freeplay generates a structured prompt that includes your criteria, the completion being evaluated, and relevant context

The LLM then scores each completion according to your rubric and provides an explanation for its decision.

Use cases

Production monitoring: Automatically sample and evaluate a subset of production traffic
Batch testing: Run evaluations across entire datasets during test runs
Quality gates: Identify outputs that fail specific quality thresholds

Configuration

Model-graded evaluations are configured at the prompt template or agent level:

Select the Evaluations tab from the menu and then select New Evaluation
Set the target to your prompt/agent to evaluate and the type to Model-graded
Create your own or select from a pre-configured example
Write instructions that reference your prompt variables (e.g., {{inputs.context}}, {{output}})
Define a rubric that maps scores to specific behaviors

Use Freeplay’s alignment tools to compare auto-evaluation scores against human labels and iteratively improve your evaluation prompts.

Best practice: Model-graded evaluations are the foundation for many other AI features. Prompt optimization and Evaluation Insights both work better when you have well-configured evaluations generating data. Start here before enabling other AI features. Learn more about model-graded evaluations →

Getting Started

Account Setup

Core Concepts

How-To Guides

Developer Resources

Security & Compliance

Resources

How it works

Use cases

Configuration

Getting Started

Account Setup

Core Concepts

How-To Guides

Developer Resources

Security & Compliance

Resources

​How it works

​Use cases

​Configuration

How it works

Use cases

Configuration