Overview
Freeplay incorporates AI-powered features throughout the platform to help you build, test, and improve your AI applications faster:Model-graded evaluations
Score individual completions and traces using LLMs to evaluate your AI outputs at scale
Auto-categorization
Classify logs to reveal usage patterns and understand how users interact with your AI
Eval Creation Assistant
Create better evaluation criteria with AI-powered suggestions and prompt drafts for LLM judges
Prompt optimization
Get AI-generated suggestions for improved prompts based on your production data
Review Insights
Identify patterns across human reviews to surface systematic issues and root causes
Evaluation Insights
Analyze evaluation data to find systemic issues and improvement opportunities
Model-Graded Evaluations
Model-graded evaluations (also called LLM judges or auto-evaluations) use AI to automatically score your AI outputs based on criteria you define. This is the foundation of automated quality assessment in Freeplay.How It Works
When you configure a model-graded evaluation:- You define the evaluation criteria with a name, question, and scoring type (Yes/No, 1-5 scale, etc.)
- You write instructions explaining what the LLM should evaluate and provide a rubric with scoring guidelines
- Freeplay generates a structured prompt that includes your criteria, the completion being evaluated, and relevant context
Use Cases
- Production monitoring: Automatically sample and evaluate a subset of production traffic
- Batch testing: Run evaluations across entire datasets during test runs
- Quality gates: Identify outputs that fail specific quality thresholds
Configuration
Model-graded evaluations are configured at the prompt template or agent level:- Navigate to your prompt template or agent
- Scroll to the Evaluations section
- Create a new evaluation criteria and enable Model-graded auto-evaluation
- Write instructions that reference your prompt variables (e.g.,
{{inputs.context}},{{output}}) - Define a rubric that maps scores to specific behaviors
Eval Creation Assistant
Writing effective evaluation prompts can be challenging, especially for teams new to LLM-based quality assessment. Freeplay’s Eval Creation Assistant uses AI to help you draft better evals faster—whether you’re starting from scratch or adapting a template.How It Works
The Eval Creation Assistant helps in two ways: Create custom evals from scratch: Start with the basic question you want to answer about your AI’s output. The assistant will:- Help you refine your evaluation question to be clear and measurable
- Suggest improvements to your eval structure
- Automatically draft a model-graded eval prompt tailored to your specific prompts and data
- Automatically customize the template to match your prompt structure
- Reference the correct input variables from your prompts
- Generate a ready-to-use eval prompt with one click
Use Cases
- Getting started quickly: Teams new to evals can create their first evaluations without prior experience
- Adopting best practices: Start with industry-standard eval patterns and customize them for your needs
- Cross-functional collaboration: Product managers, analysts, and domain experts can contribute to eval creation without writing code
Using the Assistant
- Navigate to your prompt template or agent
- Go to the Evaluations section
- Choose Create your own or select from the template library
- For custom evals: Enter your evaluation question and follow the AI’s suggestions
- For templates: Select a template and the AI will automatically adapt it to your prompt
- Test the generated eval against sample data
- Use the alignment flow to validate that the eval matches human judgment
Auto-Categorization
Auto-categorization uses AI to automatically tag and classify your production logs based on categories you define. This adds a layer of intelligence that helps you understand usage patterns and identify trends.How It Works
- You define category types that align with your business needs (e.g., product areas, user intent types, issue categories)
- For each category, you provide a clear name and description
- As logs flow through Freeplay, the AI classifies them according to your categories
- Categories appear in the observability dashboard for filtering and analysis
Use Cases
- Usage analysis: Understand what types of questions users ask most frequently
- Issue identification: Track which product areas generate the most problems
- Dataset curation: Filter logs by category to build targeted test datasets
- Review queue creation: Focus review efforts on specific categories
Configuration
Auto-categorization is configured at the prompt template or agent level, similar to other evaluations:- Navigate to your prompt template or agent
- Create a new evaluation with type Multi-select
- Enable auto-categorization and define your categories
- Each category needs a name (max 32 characters) and description (max 500 characters)
- Configure whether items can be tagged with multiple categories or just one
Prompt Optimization
Prompt optimization uses AI to analyze your production data, evaluation results, and customer feedback to suggest improved prompts. It can also help update prompts when switching between models.How It Works
- You select a prompt template version to optimize and choose a dataset or set of evaluated sessions
- You configure what data sources to use:
- Human labels: Scores and feedback from your team’s reviews
- Customer feedback: Direct feedback captured from end users
- Best practices: Provider-specific prompting guides (OpenAI or Anthropic)
- You can optionally provide specific instructions about what to improve
- Freeplay’s AI analyzes the data and generates:
- An optimized prompt template
- An explanation of changes made
- A description of the new version
Use Cases
- Prompt iteration: Get AI-suggested improvements based on where your current prompt is failing
- Model migration: Update prompts optimized for one model to work well with another
- Data-driven improvement: Use production signals to guide prompt changes
Configuration
Prompt optimization is available from the prompt template editor:- Open a prompt template and select a version
- Click Optimize to open the optimization panel
- Select your data source (dataset or evaluated sessions)
- Choose which signals to include (labels, feedback, best practices)
- Optionally add specific instructions
- Run the optimization
Prompt optimization works best with at least 10-20 evaluated examples that include a mix of good and poor outputs.
Review Insights
Review Insights (also called Review Themes) deploys an AI agent alongside your human reviewers to perform real-time root cause analysis. As your team reviews completions and traces, the AI automatically surfaces patterns and actionable improvements.How It Works
- As reviewers add notes and scores to completions in a review queue, the AI analyzes each reviewed item
- The AI identifies common patterns and groups related items into themes
- Themes include a name, description, and links to all relevant examples
- The AI can also suggest actions based on themes, such as creating new evaluations
Theme Actions
The Review Insights agent can:- Create new themes: When it identifies a novel pattern
- Add to existing themes: When new examples match existing patterns
- Merge themes: When themes overlap significantly
- Remove from themes: When items no longer fit
- Prune themes: When themes become redundant or too small
Use Cases
- Pattern discovery: Identify common failure modes across your AI outputs
- Evaluation creation: Generate evaluation criteria based on discovered themes
- Prompt improvement: Generate improvement plans based on theme examples
Configuration
Review Insights runs automatically when reviews are processed. To use it:- Create a review queue and add completions or traces
- Have your team review items, adding notes and scores
- View generated themes in the review queue’s Insights tab
- Click on themes to see all related examples
- Use theme actions to create evaluations or generate improvement plans
Evaluation Insights
Evaluation Insights uses AI to analyze patterns across your evaluation data and surface systemic issues that might not be apparent from individual scores.How It Works
- Freeplay collects evaluation results over a time period (requiring at least 10 logs with evaluation data)
- The AI analyzes the logs, looking for patterns in:
- Low-scoring outputs and their common characteristics
- Correlation between different evaluation criteria
- Input patterns that tend to produce poor results
- Insights are generated as findings with severity levels (info, warning, error, critical)
- Each finding includes a title, description, and links to relevant log examples
Use Cases
- Trend analysis: Understand what types of inputs consistently challenge your system
- Root cause identification: Discover underlying issues that cause evaluation failures
- Proactive improvement: Address systemic problems before they impact users
Viewing Insights
Evaluation Insights are available from the Observability dashboard:- Navigate to the Observability section
- View the Insights panel to see generated findings
- Click on findings to see related log examples
- Filter insights by evaluation criteria, severity, or time period
Managing AI Feature Settings
Freeplay Keys
Freeplay Keys allow AI features to use Freeplay’s own LLM API keys, so you don’t need to configure your own credentials for these features to work. This setting is:- Enabled by default for cloud-hosted Freeplay accounts
- Not available for BYOC (Bring Your Own Cloud) deployments
- Navigate to Settings → AI Features
- Toggle Use Freeplay Keys on or off
- When disabled, ensure you have configured API credentials for at least one supported provider
Using Your Own API Keys
When Freeplay Keys are disabled or unavailable, AI features use your configured provider credentials:- Navigate to Settings → Models
- Add credentials for your preferred provider(s)
- Ensure at least one supported model is enabled
- Azure OpenAI: GPT-4o, GPT-4.1, o1, o3-mini
- OpenAI: GPT-5.x series, GPT-4o, o1, o3-mini
- AWS Bedrock: Claude models, Mistral models
- Anthropic: Claude 4.x series, Claude 3.x series
- Google Vertex AI: Gemini 2.5 and 3 series
Disabling Specific Features
Individual AI features can be controlled through their respective configuration:- Model-graded evaluations: Disable by not configuring model-graded evaluation criteria
- Eval Creation Assistant: This is an on-demand feature, only runs when creating evals
- Auto-categorization: Disable by not configuring auto-categorization criteria
- Prompt optimization: This is an on-demand feature, only runs when triggered
- Review Insights: Runs automatically during review; themes can be dismissed or deleted individually
- Evaluation Insights: Disable via the Insights toggle in Settings → AI Features
How AI Features Work
All AI features in Freeplay work by calling LLM APIs to analyze your data. They are designed to work with different models and to use your API keys and model preferences, based on your account settings.Model Selection
When an AI feature runs, Freeplay selects an appropriate model based on your account configuration:- If Freeplay Keys are enabled (default for cloud customers): Freeplay uses its own API keys with a preferred provider, typically Azure OpenAI for reliability and cost efficiency.
-
If Freeplay Keys are disabled or you’re using Bring Your Own Cloud (BYOC): Freeplay uses your configured API keys with the following provider priority:
- Azure OpenAI
- OpenAI
- AWS Bedrock
- Anthropic
- Google Vertex AI
gpt-5.2, then gpt-5.1, then gpt-5, and so on.
BYOC deployments always use customer-provided API keys since Freeplay Keys are not available in privately hosted environments.
Cost Considerations
AI features consume tokens from the selected LLM provider. Costs depend on:- Which features you use and how frequently
- The volume of data being analyzed
- The models being used (more capable models typically cost more)
Related Pages
- Model-graded Evaluations - Configure LLM judges for automated scoring
- Auto-categorization - Set up automatic content classification
- Review Queues - Organize human review workflows and surface insights
- Model Management - Configure LLM providers and API keys

