Project
A top-level workspace in Freeplay that contains all your prompt templates, sessions, datasets, evaluations, and configurations. Projects are identified by a uniqueproject_id and represent a distinct AI application or use case. All other entities in Freeplay (sessions, prompt templates, datasets, etc.) belong to a project.
See Project Setup for configuration details.
Observability
Observability in Freeplay refers to capturing and analyzing the behavior of your AI application through logged data.Observability hierarchy
Freeplay uses a three-level hierarchy to organize your AI application logs. From highest to lowest level:Session
The container for a complete user interaction, conversation, or agent run. Sessions group related traces and completions together. Examples include an entire chatbot conversation, a complete agent workflow, or a single user request that triggers multiple LLM calls. See Sessions, Traces, and Completions for more details.Trace
An optional grouping of related completions and tool calls within a session. Traces represent a functional unit of work, such as a single turn in a conversation, one run of an agent, or a logical step in a multi-step workflow. Traces can be nested to represent sub-agents or complex workflows. Traces can optionally be given a name (like “planning” or “tool_selection”) to represent a specific agent or workflow type. Named traces unlock additional Freeplay features: you can configure evaluation criteria to run against them, create linked datasets for testing, and group similar traces for analysis. See Agent below. See Traces and Record Traces for implementation details.Completion
The atomic unit of observability in Freeplay. A completion represents a single LLM call, including the input prompt (messages) and the model’s response. Every completion is associated with a session and optionally a trace. See Recording Completions for implementation details.Other observability concepts
Agent
In Freeplay, an “agent” refers to a named category of traces that represent semantically similar workflows or behaviors. When you give traces the same name (e.g., “research_agent” or “customer_support”), Freeplay groups them together as an agent. This grouping enables you to:- Configure evaluation criteria that run automatically against traces with that name
- Create datasets linked to that agent for testing
- Analyze performance and quality across all traces of that type
Tool call
A record of a tool or function call made during an agent workflow, including both the request (tool name and arguments) and the result. Tool calls are recorded at the same level as completions within a trace. See Tools for guidance on recording tool calls.Custom metadata
Contextual information attached to sessions, traces, or completions. Use custom metadata to store data like user IDs, feature flags, business metrics, or workflow identifiers that help with filtering, searching, and analysis. Custom metadata is recorded viacustom_metadata fields when creating or updating observability objects. Don’t use custom metadata for user feedback like ratings or comments—use customer feedback instead.
Customer feedback
End-user feedback recorded through dedicated feedback endpoints. Customer feedback includes ratings (thumbs up/down, star ratings) and freeform comments. This data receives special treatment in the Freeplay UI due to its distinct utility for quality improvement. Customer feedback is recorded via the/completion-feedback/ or /trace-feedback/ API endpoints.
See Customer Feedback for implementation details.
Prompt management
Prompt template
A versioned configuration that defines everything needed to make an LLM call: the message structure (using Mustache syntax for variables), provider and model selection, request parameters (like temperature), and optionally tool schemas or output structure definitions. Prompt templates separate the static structure of your prompts from the dynamic variables populated at runtime. This structure enables easy versioning, A/B testing, and dataset creation from production logs. See Managing Prompts for more details.Environment
A deployment target for prompt templates, such asdev, staging, prod, or latest. Environments let you deploy different versions of your prompts to different stages of your application lifecycle, similar to feature flags.
The latest environment always points to the most recently created version of a prompt template. Custom environments can be created for specific use cases.
See Deployment Environments for configuration details.
Prompt bundling
The practice of snapshotting prompt template configurations into your source code repository rather than fetching them from Freeplay’s server at runtime. Prompt bundling removes Freeplay from the “hot path” of your application, improving latency and providing compliance benefits for regulated industries. See Prompt Bundling for implementation guidance.Mustache
The templating syntax used in Freeplay prompt templates for variable interpolation. Mustache uses double curly braces ({{variable_name}}) for simple substitution and supports conditional logic and iteration.
See Advanced Prompt Templating Using Mustache for syntax reference.
Evaluation and testing
Evaluation
The process of measuring and scoring the quality of AI outputs. Freeplay supports four types of evaluations:- Human evaluation: Manual review and scoring by team members
- Model-graded evaluation: Using an LLM as a judge
- Code evaluation: Custom functions that evaluate quantifiable criteria
- Auto-categorization: Automated tagging based on specified categories
LLM judge
An LLM-based evaluator that scores AI outputs against specified criteria. Also called “model-graded evaluation.” LLM judges can assess nuanced qualities like helpfulness, accuracy, or tone that are difficult to evaluate with code alone. See Model-Graded Evaluations and Creating and Aligning Model-Graded Evals for implementation details.Dataset
A collection of test cases used for evaluation and testing. Datasets can be created by:- Curating examples from production logs
- Uploading CSV or JSONL files
- Authoring directly in the Freeplay UI
The API parameter is
testlist for legacy reasons, but we use “dataset” in the UI and when referring to this concept in prose.Test run
A batch execution of evaluations against a dataset. Test runs can be:- Component-level: Testing individual prompts or components
- End-to-end: Testing complete workflows like full agent runs
Workflows and collaboration
Review queue
A workflow for human review of production outputs. Review queues enable teams to:- Review and annotate completions or traces
- Add structured labels or free-text notes
- Correct LLM judge scores when they’re wrong
- Curate examples into datasets
Data flywheel
The continuous improvement cycle enabled by Freeplay’s connected workflow. Production logs flow into datasets, which feed evaluations, which inform prompt improvements, which generate better logs. Each iteration strengthens prompts, datasets, evaluation criteria, and testing infrastructure together.SDK and API
Freeplay SDK
Client libraries for integrating Freeplay into your application. Available for:- Python:
freeplay(install viapip install freeplay) - TypeScript/Node:
freeplay(install vianpm install freeplay) - Java/JVM:
ai.freeplay:client(see SDK Setup for Maven/Gradle config)
Provider
The LLM service that processes your prompts. Freeplay supports any provider, including OpenAI, Anthropic, AWS Bedrock, Azure OpenAI, Google, and self-hosted models. The provider is specified in prompt templates or when recording completions.Flavor
The message format used by a specific provider. Different providers expect messages in different formats (e.g., OpenAI’s chat format vs. Anthropic’s format). Freeplay handles format conversion based on the configured flavor.Related resources
- Why Freeplay? - Overview of Freeplay’s approach
- Getting Started - Quick start guides
- SDK Documentation - Detailed SDK reference
- API Reference - HTTP API documentation

