Skip to main content
Freeplay incorporates AI throughout the platform to help you build, test, and improve your AI applications faster. These features analyze your production data, generate insights, and automate tasks that would otherwise require significant manual effort.

Overview

Freeplay incorporates AI-powered features throughout the platform to help you build, test, and improve your AI applications faster:

Model-Graded Evaluations

Model-graded evaluations (also called LLM judges or auto-evaluations) use AI to automatically score your AI outputs based on criteria you define. This is the foundation of automated quality assessment in Freeplay.

How It Works

When you configure a model-graded evaluation:
  1. You define the evaluation criteria with a name, question, and scoring type (Yes/No, 1-5 scale, etc.)
  2. You write instructions explaining what the LLM should evaluate and provide a rubric with scoring guidelines
  3. Freeplay generates a structured prompt that includes your criteria, the completion being evaluated, and relevant context
The LLM then scores each completion according to your rubric and provides an explanation for its decision.

Use Cases

  • Production monitoring: Automatically sample and evaluate a subset of production traffic
  • Batch testing: Run evaluations across entire datasets during test runs
  • Quality gates: Identify outputs that fail specific quality thresholds

Configuration

Model-graded evaluations are configured at the prompt template or agent level:
  1. Navigate to your prompt template or agent
  2. Scroll to the Evaluations section
  3. Create a new evaluation criteria and enable Model-graded auto-evaluation
  4. Write instructions that reference your prompt variables (e.g., {{inputs.context}}, {{output}})
  5. Define a rubric that maps scores to specific behaviors
Use Freeplay’s alignment tools to compare auto-evaluation scores against human labels and iteratively improve your evaluation prompts.
Best practice: Model-graded evaluations are the foundation for many other AI features. Prompt optimization and Evaluation Insights both work better when you have well-configured evaluations generating data. Start here before enabling other AI features. Learn more about model-graded evaluations →

Eval Creation Assistant

Writing effective evaluation prompts can be challenging, especially for teams new to LLM-based quality assessment. Freeplay’s Eval Creation Assistant uses AI to help you draft better evals faster—whether you’re starting from scratch or adapting a template.

How It Works

The Eval Creation Assistant helps in two ways: Create custom evals from scratch: Start with the basic question you want to answer about your AI’s output. The assistant will:
  1. Help you refine your evaluation question to be clear and measurable
  2. Suggest improvements to your eval structure
  3. Automatically draft a model-graded eval prompt tailored to your specific prompts and data
Adapt from templates: Choose from common evaluation templates like Answer Faithfulness (for RAG), Similarity, Toxicity, or Tone. The assistant will:
  1. Automatically customize the template to match your prompt structure
  2. Reference the correct input variables from your prompts
  3. Generate a ready-to-use eval prompt with one click
Because Freeplay knows your prompt structure and has access to real-world examples from your logs, the assistant can generate eval prompts that are specific to your context rather than generic templates.

Use Cases

  • Getting started quickly: Teams new to evals can create their first evaluations without prior experience
  • Adopting best practices: Start with industry-standard eval patterns and customize them for your needs
  • Cross-functional collaboration: Product managers, analysts, and domain experts can contribute to eval creation without writing code

Using the Assistant

  1. Navigate to your prompt template or agent
  2. Go to the Evaluations section
  3. Choose Create your own or select from the template library
  4. For custom evals: Enter your evaluation question and follow the AI’s suggestions
  5. For templates: Select a template and the AI will automatically adapt it to your prompt
  6. Test the generated eval against sample data
  7. Use the alignment flow to validate that the eval matches human judgment
Even when using templates, the AI adapts them to your specific prompt variables and data structure—so you get truly customized evals, not just generic prompts.
Best practice: If you’re new to writing evals or unsure where to start, use the Eval Creation Assistant’s “Create your own” option. Describe what you want to evaluate in plain language, and the AI will generate a custom eval prompt tailored to your specific prompts and use case.

Auto-Categorization

Auto-categorization uses AI to automatically tag and classify your production logs based on categories you define. This adds a layer of intelligence that helps you understand usage patterns and identify trends.

How It Works

  1. You define category types that align with your business needs (e.g., product areas, user intent types, issue categories)
  2. For each category, you provide a clear name and description
  3. As logs flow through Freeplay, the AI classifies them according to your categories
  4. Categories appear in the observability dashboard for filtering and analysis

Use Cases

  • Usage analysis: Understand what types of questions users ask most frequently
  • Issue identification: Track which product areas generate the most problems
  • Dataset curation: Filter logs by category to build targeted test datasets
  • Review queue creation: Focus review efforts on specific categories

Configuration

Auto-categorization is configured at the prompt template or agent level, similar to other evaluations:
  1. Navigate to your prompt template or agent
  2. Create a new evaluation with type Multi-select
  3. Enable auto-categorization and define your categories
  4. Each category needs a name (max 32 characters) and description (max 500 characters)
  5. Configure whether items can be tagged with multiple categories or just one
Best practice: Auto-categorization works best with clear, mutually exclusive categories. If you see many items tagged as “Other” or miscategorized, refine your category descriptions. Learn more about auto-categorization →

Prompt Optimization

Prompt optimization uses AI to analyze your production data, evaluation results, and customer feedback to suggest improved prompts. It can also help update prompts when switching between models.

How It Works

  1. You select a prompt template version to optimize and choose a dataset or set of evaluated sessions
  2. You configure what data sources to use:
    • Human labels: Scores and feedback from your team’s reviews
    • Customer feedback: Direct feedback captured from end users
    • Best practices: Provider-specific prompting guides (OpenAI or Anthropic)
  3. You can optionally provide specific instructions about what to improve
  4. Freeplay’s AI analyzes the data and generates:
    • An optimized prompt template
    • An explanation of changes made
    • A description of the new version

Use Cases

  • Prompt iteration: Get AI-suggested improvements based on where your current prompt is failing
  • Model migration: Update prompts optimized for one model to work well with another
  • Data-driven improvement: Use production signals to guide prompt changes

Configuration

Prompt optimization is available from the prompt template editor:
  1. Open a prompt template and select a version
  2. Click Optimize to open the optimization panel
  3. Select your data source (dataset or evaluated sessions)
  4. Choose which signals to include (labels, feedback, best practices)
  5. Optionally add specific instructions
  6. Run the optimization
After optimization completes, Freeplay creates a new prompt version and automatically runs a comparative test so you can evaluate the results side-by-side.
Prompt optimization works best with at least 10-20 evaluated examples that include a mix of good and poor outputs.

Review Insights

Review Insights (also called Review Themes) deploys an AI agent alongside your human reviewers to perform real-time root cause analysis. As your team reviews completions and traces, the AI automatically surfaces patterns and actionable improvements.

How It Works

  1. As reviewers add notes and scores to completions in a review queue, the AI analyzes each reviewed item
  2. The AI identifies common patterns and groups related items into themes
  3. Themes include a name, description, and links to all relevant examples
  4. The AI can also suggest actions based on themes, such as creating new evaluations

Theme Actions

The Review Insights agent can:
  • Create new themes: When it identifies a novel pattern
  • Add to existing themes: When new examples match existing patterns
  • Merge themes: When themes overlap significantly
  • Remove from themes: When items no longer fit
  • Prune themes: When themes become redundant or too small

Use Cases

  • Pattern discovery: Identify common failure modes across your AI outputs
  • Evaluation creation: Generate evaluation criteria based on discovered themes
  • Prompt improvement: Generate improvement plans based on theme examples

Configuration

Review Insights runs automatically when reviews are processed. To use it:
  1. Create a review queue and add completions or traces
  2. Have your team review items, adding notes and scores
  3. View generated themes in the review queue’s Insights tab
  4. Click on themes to see all related examples
  5. Use theme actions to create evaluations or generate improvement plans
Best practice: Review Insights themes are generated automatically and may occasionally be too broad or too narrow. Regularly review themes and use merge/prune actions to keep them useful. Learn more about Review Queues →

Evaluation Insights

Evaluation Insights uses AI to analyze patterns across your evaluation data and surface systemic issues that might not be apparent from individual scores.

How It Works

  1. Freeplay collects evaluation results over a time period (requiring at least 10 logs with evaluation data)
  2. The AI analyzes the logs, looking for patterns in:
    • Low-scoring outputs and their common characteristics
    • Correlation between different evaluation criteria
    • Input patterns that tend to produce poor results
  3. Insights are generated as findings with severity levels (info, warning, error, critical)
  4. Each finding includes a title, description, and links to relevant log examples

Use Cases

  • Trend analysis: Understand what types of inputs consistently challenge your system
  • Root cause identification: Discover underlying issues that cause evaluation failures
  • Proactive improvement: Address systemic problems before they impact users

Viewing Insights

Evaluation Insights are available from the Observability dashboard:
  1. Navigate to the Observability section
  2. View the Insights panel to see generated findings
  3. Click on findings to see related log examples
  4. Filter insights by evaluation criteria, severity, or time period

Managing AI Feature Settings

Freeplay Keys

Freeplay Keys allow AI features to use Freeplay’s own LLM API keys, so you don’t need to configure your own credentials for these features to work. This setting is:
  • Enabled by default for cloud-hosted Freeplay accounts
  • Not available for BYOC (Bring Your Own Cloud) deployments
To manage Freeplay Keys:
  1. Navigate to SettingsAI Features
  2. Toggle Use Freeplay Keys on or off
  3. When disabled, ensure you have configured API credentials for at least one supported provider
If you disable Freeplay Keys without configuring your own LLM credentials, AI features will not function.

Using Your Own API Keys

When Freeplay Keys are disabled or unavailable, AI features use your configured provider credentials:
  1. Navigate to SettingsModels
  2. Add credentials for your preferred provider(s)
  3. Ensure at least one supported model is enabled
Supported providers for AI features:
  • Azure OpenAI: GPT-4o, GPT-4.1, o1, o3-mini
  • OpenAI: GPT-5.x series, GPT-4o, o1, o3-mini
  • AWS Bedrock: Claude models, Mistral models
  • Anthropic: Claude 4.x series, Claude 3.x series
  • Google Vertex AI: Gemini 2.5 and 3 series

Disabling Specific Features

Individual AI features can be controlled through their respective configuration:
  • Model-graded evaluations: Disable by not configuring model-graded evaluation criteria
  • Eval Creation Assistant: This is an on-demand feature, only runs when creating evals
  • Auto-categorization: Disable by not configuring auto-categorization criteria
  • Prompt optimization: This is an on-demand feature, only runs when triggered
  • Review Insights: Runs automatically during review; themes can be dismissed or deleted individually
  • Evaluation Insights: Disable via the Insights toggle in Settings → AI Features

How AI Features Work

All AI features in Freeplay work by calling LLM APIs to analyze your data. They are designed to work with different models and to use your API keys and model preferences, based on your account settings.

Model Selection

When an AI feature runs, Freeplay selects an appropriate model based on your account configuration:
  1. If Freeplay Keys are enabled (default for cloud customers): Freeplay uses its own API keys with a preferred provider, typically Azure OpenAI for reliability and cost efficiency.
  2. If Freeplay Keys are disabled or you’re using Bring Your Own Cloud (BYOC): Freeplay uses your configured API keys with the following provider priority:
    • Azure OpenAI
    • OpenAI
    • AWS Bedrock
    • Anthropic
    • Google Vertex AI
Within each provider, Freeplay selects capable models in order of capability. For example, for OpenAI this might be gpt-5.2, then gpt-5.1, then gpt-5, and so on.
BYOC deployments always use customer-provided API keys since Freeplay Keys are not available in privately hosted environments.

Cost Considerations

AI features consume tokens from the selected LLM provider. Costs depend on:
  • Which features you use and how frequently
  • The volume of data being analyzed
  • The models being used (more capable models typically cost more)
When Freeplay Keys are enabled, Freeplay covers the cost of AI feature usage. When using your own API keys, costs are billed directly to your provider account. Token usage for AI features is tracked separately from your application’s LLM usage and is visible in the Usage dashboard. If you’re using your own API keys, monitor this usage and consider adjusting feature sampling rates if costs are higher than expected.