AI Features - Freeplay Introduction

Freeplay incorporates AI throughout the platform to help you build, test, and improve your AI applications faster. These features analyze your production data, generate insights, and automate tasks that would otherwise require significant manual effort.

Overview

Freeplay incorporates AI-powered features throughout the platform to help you build, test, and improve your AI applications faster:

Model-graded evaluations

Score individual completions and traces using LLMs to evaluate your AI outputs at scale

Auto-categorization

Classify logs to reveal usage patterns and understand how users interact with your AI

Eval Creation Assistant

Create better evaluation criteria with AI-powered suggestions and prompt drafts for LLM judges

Prompt optimization

Get AI-generated suggestions for improved prompts based on your production data

Review Insights

Identify patterns across human reviews to surface systematic issues and root causes

Evaluation Insights

Analyze evaluation data to find systemic issues and improvement opportunities

Model-Graded Evaluations

Model-graded evaluations (also called LLM judges or auto-evaluations) use AI to automatically score your AI outputs based on criteria you define. This is the foundation of automated quality assessment in Freeplay.

How It Works

When you configure a model-graded evaluation:

You define the evaluation criteria with a name, question, and scoring type (Yes/No, 1-5 scale, etc.)
You write instructions explaining what the LLM should evaluate and provide a rubric with scoring guidelines
Freeplay generates a structured prompt that includes your criteria, the completion being evaluated, and relevant context

The LLM then scores each completion according to your rubric and provides an explanation for its decision.

Use Cases

Production monitoring: Automatically sample and evaluate a subset of production traffic
Batch testing: Run evaluations across entire datasets during test runs
Quality gates: Identify outputs that fail specific quality thresholds

Configuration

Model-graded evaluations are configured at the prompt template or agent level:

Navigate to your prompt template or agent
Scroll to the Evaluations section
Create a new evaluation criteria and enable Model-graded auto-evaluation
Write instructions that reference your prompt variables (e.g., {{inputs.context}}, {{output}})
Define a rubric that maps scores to specific behaviors

Use Freeplay’s alignment tools to compare auto-evaluation scores against human labels and iteratively improve your evaluation prompts.

Best practice: Model-graded evaluations are the foundation for many other AI features. Prompt optimization and Evaluation Insights both work better when you have well-configured evaluations generating data. Start here before enabling other AI features. Learn more about model-graded evaluations →

Eval Creation Assistant

Writing effective evaluation prompts can be challenging, especially for teams new to LLM-based quality assessment. Freeplay’s Eval Creation Assistant uses AI to help you draft better evals faster—whether you’re starting from scratch or adapting a template.

How It Works

The Eval Creation Assistant helps in two ways: Create custom evals from scratch: Start with the basic question you want to answer about your AI’s output. The assistant will:

Help you refine your evaluation question to be clear and measurable
Suggest improvements to your eval structure
Automatically draft a model-graded eval prompt tailored to your specific prompts and data

Adapt from templates: Choose from common evaluation templates like Answer Faithfulness (for RAG), Similarity, Toxicity, or Tone. The assistant will:

Automatically customize the template to match your prompt structure
Reference the correct input variables from your prompts
Generate a ready-to-use eval prompt with one click

Because Freeplay knows your prompt structure and has access to real-world examples from your logs, the assistant can generate eval prompts that are specific to your context rather than generic templates.

Use Cases

Getting started quickly: Teams new to evals can create their first evaluations without prior experience
Adopting best practices: Start with industry-standard eval patterns and customize them for your needs
Cross-functional collaboration: Product managers, analysts, and domain experts can contribute to eval creation without writing code

Using the Assistant

Navigate to your prompt template or agent
Go to the Evaluations section
Choose Create your own or select from the template library
For custom evals: Enter your evaluation question and follow the AI’s suggestions
For templates: Select a template and the AI will automatically adapt it to your prompt
Test the generated eval against sample data
Use the alignment flow to validate that the eval matches human judgment

Even when using templates, the AI adapts them to your specific prompt variables and data structure—so you get truly customized evals, not just generic prompts.

Best practice: If you’re new to writing evals or unsure where to start, use the Eval Creation Assistant’s “Create your own” option. Describe what you want to evaluate in plain language, and the AI will generate a custom eval prompt tailored to your specific prompts and use case.

Auto-Categorization

Auto-categorization uses AI to automatically tag and classify your production logs based on categories you define. This adds a layer of intelligence that helps you understand usage patterns and identify trends.

How It Works

You define category types that align with your business needs (e.g., product areas, user intent types, issue categories)
For each category, you provide a clear name and description
As logs flow through Freeplay, the AI classifies them according to your categories
Categories appear in the observability dashboard for filtering and analysis

Use Cases

Usage analysis: Understand what types of questions users ask most frequently
Issue identification: Track which product areas generate the most problems
Dataset curation: Filter logs by category to build targeted test datasets
Review queue creation: Focus review efforts on specific categories

Configuration

Auto-categorization is configured at the prompt template or agent level, similar to other evaluations:

Navigate to your prompt template or agent
Create a new evaluation with type Multi-select
Enable auto-categorization and define your categories
Each category needs a name (max 32 characters) and description (max 500 characters)
Configure whether items can be tagged with multiple categories or just one

Best practice: Auto-categorization works best with clear, mutually exclusive categories. If you see many items tagged as “Other” or miscategorized, refine your category descriptions. Learn more about auto-categorization →

Prompt Optimization

Prompt optimization uses AI to analyze your production data, evaluation results, and customer feedback to suggest improved prompts. It can also help update prompts when switching between models.

How It Works

You select a prompt template version to optimize and choose a dataset or set of evaluated sessions
You configure what data sources to use:
- Human labels: Scores and feedback from your team’s reviews
- Customer feedback: Direct feedback captured from end users
- Best practices: Provider-specific prompting guides (OpenAI or Anthropic)
You can optionally provide specific instructions about what to improve
Freeplay’s AI analyzes the data and generates:
- An optimized prompt template
- An explanation of changes made
- A description of the new version

Use Cases

Prompt iteration: Get AI-suggested improvements based on where your current prompt is failing
Model migration: Update prompts optimized for one model to work well with another
Data-driven improvement: Use production signals to guide prompt changes

Configuration

Prompt optimization is available from the prompt template editor:

Open a prompt template and select a version
Click Optimize to open the optimization panel
Select your data source (dataset or evaluated sessions)
Choose which signals to include (labels, feedback, best practices)
Optionally add specific instructions
Run the optimization

After optimization completes, Freeplay creates a new prompt version and automatically runs a comparative test so you can evaluate the results side-by-side.

Prompt optimization works best with at least 10-20 evaluated examples that include a mix of good and poor outputs.

Review Insights

Review Insights (also called Review Themes) deploys an AI agent alongside your human reviewers to perform real-time root cause analysis. As your team reviews completions and traces, the AI automatically surfaces patterns and actionable improvements.

How It Works

As reviewers add notes and scores to completions in a review queue, the AI analyzes each reviewed item
The AI identifies common patterns and groups related items into themes
Themes include a name, description, and links to all relevant examples
The AI can also suggest actions based on themes, such as creating new evaluations

Theme Actions

The Review Insights agent can:

Create new themes: When it identifies a novel pattern
Add to existing themes: When new examples match existing patterns
Merge themes: When themes overlap significantly
Remove from themes: When items no longer fit
Prune themes: When themes become redundant or too small

Use Cases

Pattern discovery: Identify common failure modes across your AI outputs
Evaluation creation: Generate evaluation criteria based on discovered themes
Prompt improvement: Generate improvement plans based on theme examples

Configuration

Review Insights runs automatically when reviews are processed. To use it:

Create a review queue and add completions or traces
Have your team review items, adding notes and scores
View generated themes in the review queue’s Insights tab
Click on themes to see all related examples
Use theme actions to create evaluations or generate improvement plans

Best practice: Review Insights themes are generated automatically and may occasionally be too broad or too narrow. Regularly review themes and use merge/prune actions to keep them useful. Learn more about Review Queues →

Evaluation Insights

Evaluation Insights uses AI to analyze patterns across your evaluation data and surface systemic issues that might not be apparent from individual scores.

How It Works

Freeplay collects evaluation results over a time period (requiring at least 10 logs with evaluation data)
The AI analyzes the logs, looking for patterns in:
- Low-scoring outputs and their common characteristics
- Correlation between different evaluation criteria
- Input patterns that tend to produce poor results
Insights are generated as findings with severity levels (info, warning, error, critical)
Each finding includes a title, description, and links to relevant log examples

Use Cases

Trend analysis: Understand what types of inputs consistently challenge your system
Root cause identification: Discover underlying issues that cause evaluation failures
Proactive improvement: Address systemic problems before they impact users

Viewing Insights

Evaluation Insights are available from the Observability dashboard:

Navigate to the Observability section
View the Insights panel to see generated findings
Click on findings to see related log examples
Filter insights by evaluation criteria, severity, or time period

Managing AI Feature Settings

Freeplay Keys

Freeplay Keys allow AI features to use Freeplay’s own LLM API keys, so you don’t need to configure your own credentials for these features to work. This setting is:

Enabled by default for cloud-hosted Freeplay accounts
Not available for BYOC (Bring Your Own Cloud) deployments

To manage Freeplay Keys:

Navigate to Settings → AI Features
Toggle Use Freeplay Keys on or off
When disabled, ensure you have configured API credentials for at least one supported provider

If you disable Freeplay Keys without configuring your own LLM credentials, AI features will not function.

Using Your Own API Keys

When Freeplay Keys are disabled or unavailable, AI features use your configured provider credentials:

Navigate to Settings → Models
Add credentials for your preferred provider(s)
Ensure at least one supported model is enabled

Supported providers for AI features:

Azure OpenAI: GPT-4o, GPT-4.1, o1, o3-mini
OpenAI: GPT-5.x series, GPT-4o, o1, o3-mini
AWS Bedrock: Claude models, Mistral models
Anthropic: Claude 4.x series, Claude 3.x series
Google Vertex AI: Gemini 2.5 and 3 series

Disabling Specific Features

Individual AI features can be controlled through their respective configuration:

Model-graded evaluations: Disable by not configuring model-graded evaluation criteria
Eval Creation Assistant: This is an on-demand feature, only runs when creating evals
Auto-categorization: Disable by not configuring auto-categorization criteria
Prompt optimization: This is an on-demand feature, only runs when triggered
Review Insights: Runs automatically during review; themes can be dismissed or deleted individually
Evaluation Insights: Disable via the Insights toggle in Settings → AI Features

How AI Features Work

All AI features in Freeplay work by calling LLM APIs to analyze your data. They are designed to work with different models and to use your API keys and model preferences, based on your account settings.

Model Selection

When an AI feature runs, Freeplay selects an appropriate model based on your account configuration:

If Freeplay Keys are enabled (default for cloud customers): Freeplay uses its own API keys with a preferred provider, typically Azure OpenAI for reliability and cost efficiency.
If Freeplay Keys are disabled or you’re using Bring Your Own Cloud (BYOC): Freeplay uses your configured API keys with the following provider priority:
- Azure OpenAI
- OpenAI
- AWS Bedrock
- Anthropic
- Google Vertex AI

Within each provider, Freeplay selects capable models in order of capability. For example, for OpenAI this might be gpt-5.2, then gpt-5.1, then gpt-5, and so on.

BYOC deployments always use customer-provided API keys since Freeplay Keys are not available in privately hosted environments.

Cost Considerations

AI features consume tokens from the selected LLM provider. Costs depend on:

Which features you use and how frequently
The volume of data being analyzed
The models being used (more capable models typically cost more)

When Freeplay Keys are enabled, Freeplay covers the cost of AI feature usage. When using your own API keys, costs are billed directly to your provider account. Token usage for AI features is tracked separately from your application’s LLM usage and is visible in the Usage dashboard. If you’re using your own API keys, monitor this usage and consider adjusting feature sampling rates if costs are higher than expected.

Model-graded Evaluations - Configure LLM judges for automated scoring
Auto-categorization - Set up automatic content classification
Review Queues - Organize human review workflows and surface insights
Model Management - Configure LLM providers and API keys

Getting Started

Account Setup

Core Concepts

How-To Guides

Developer Resources

Security & Compliance

Resources

​Overview

Model-graded evaluations

Auto-categorization

Eval Creation Assistant

Prompt optimization

Review Insights

Evaluation Insights

​Model-Graded Evaluations

​How It Works

​Use Cases

​Configuration

​Eval Creation Assistant

​How It Works

​Use Cases

​Using the Assistant

​Auto-Categorization

​How It Works

​Use Cases

​Configuration

​Prompt Optimization

​How It Works

​Use Cases

​Configuration

​Review Insights

​How It Works

​Theme Actions

​Use Cases

​Configuration

​Evaluation Insights

​How It Works

​Use Cases

​Viewing Insights

​Managing AI Feature Settings

​Freeplay Keys

​Using Your Own API Keys

​Disabling Specific Features

​How AI Features Work

​Model Selection

​Cost Considerations

​Related Pages

Overview

Model-Graded Evaluations

How It Works

Use Cases

Configuration

Eval Creation Assistant

How It Works

Use Cases

Using the Assistant

Auto-Categorization

How It Works

Use Cases

Configuration

Prompt Optimization

How It Works

Use Cases

Configuration

Review Insights

How It Works

Theme Actions

Use Cases

Configuration

Evaluation Insights

How It Works

Use Cases

Viewing Insights

Managing AI Feature Settings

Freeplay Keys

Using Your Own API Keys

Disabling Specific Features

How AI Features Work

Model Selection

Cost Considerations

Related Pages