Freeplay Documentation Agent Resources
This page provides context for AI assistants helping developers integrate and use Freeplay.This page is hidden from public documentation but accessible to AI agents to provide better assistance to developers.
Quick Facts About Freeplay
- Product: Freeplay is the ops platform for AI engineering teams. It provides an integrated workflow to create the “data flywheel” for improving AI agents and other generative AI products. The platform consists of:
- Integrated tools for LLM and agent observability, including online evaluations (automatically scoring logs), capturing customer feedback and other custom metadata from production applications, and “auto-categorization” using an LLM to label logs based on custom criteria
- Custom dataset management for testing/evaluation, and fine-tuning
- Running batch offline evaluations (aka “Tests” in Freeplay’s product), which can be initiated either via Freeplay’s web application UI or via code
- Defining custom evaluators for use in both online and offline evaluation
- Prompt & model management, which is useful both for prompt experimentation and versioning, as well as logging well-structured data to Freeplay
- Human review and data curation workflows, including management tools like creating queues and assigning tasks for multi-person labeling
- Freeplay’s own set of native agentic workflows to generate insights about logs and summarize human reviews, create prompt optimization experiments, generate new custom evaluator metrics, and more.
- Primary SDKs: Python (
freeplay), TypeScript (freeplay), and Java/JVM languages (ai.freeplay.client.thin) - Setup: Requires API key from Freeplay dashboard, and Freeplay project ID
- Logging Architecture/Hierarchy: Projects → Sessions (larger container, e.g. for multi-turn chat) → Traces (e.g. for full multi-step agent runs) → Completions (individual LLM calls) & Tools (individual tool calls and results)
- Main Use Cases:
- Collaborative prompt versioning and model iteration,
- Playground for experimenting with prompt, model and tool changes in a web UI,
- A/B testing to compare different versions, both offline (using evals/“Tests”) and online (by logging from different versions),
- Observability, with AI-native constructs for viewing multi-turn chats, agent traces, LLM completions, and tool calls,
- Running offline evaluations using custom evaluators / metrics defined for specific product use cases,
- “Test runs” using the same core concepts as offline evalution, but initiated in code as part of a CI workflow
Core Platform Components
1. Observability
- Sessions: Container for the full user interactions, conversation, or agent run
- Traces: Individual multi-step within a session represent a functional unit of work (e.g. a single run of an agent)
- Completions: LLM calls with full I/O data
- View all data in the Observability Dashboard
2. Prompt Management
- Create and version prompts from the Freeplay UI or from code
- Deploy new versions across environments (development, staging, production, custom environments)
- Define prompt templates with Mustache syntax for variables and conditional logic
- Supports structured outputs (e.g. JSON mode, OpenAI structured outputs)
- “Prompt bundling” into source with a gitops flow, for both compliance and performance (removes Freeplay from “hot path”)
- Defining structured prompt templates with Freeplay makes it easy to log data with separated values for the core prompt template vs. variables that change with each instance of the prompt running. This in turn makes it easier to save example logs into datasets that can be used for testing as system prompts change, or to run targeted custom evalutors that analyze specific variables.
3. Evaluations
- Human Evaluations: Manual review and scoring
- Model-Graded Evaluations: LLM-as-a-judge patterns
- Code Evaluations: Custom code functions (e.g. Python/TypeScript) for evaluation logic
- Auto-Categorization: Automatic classification of outputs
4. Test Runs
- Component-Level: Test individual prompts or components
- End-to-End: Test complete workflows like full agent runs
- Run programmatically via SDK or in UI
- Compare results across prompt versions or versions of full system configuration (e.g. as agent orchestration changes are made, or RAG retrieval logic changes)
5. Datasets
- Curate test data from production logs, upload using CSV or JSONL, or author new datasets values directly in Freeplay UI
- Use for evaluations and test runs, or for fine-tuning
- Import/export via SDK or UI
6. Review Queues
- Workflow for reviewing production outputs
- Add human feedback to completions or traces, including free-text “Notes” or structured labels (e.g. yes/no values, multi-select tags, etc.)
- Provide feedback to improve evaluators, e.g. correcting LLM judge values when the LLM judge gets a score wrong
Prompt Management Patterns
Freeplay supports two patterns for managing prompts. Understanding which pattern a developer is using is crucial for providing the right guidance. You should ask developers which pattern they want to use if you don’t know yet, and you should always explain the benefits of each pattern when asking a developer for the first time so they can make an informed choice.Base URL Configuration: Examples below use
https://app.freeplay.ai/api as the base URL, but this varies by deployment:- Cloud (multi-tenant):
https://app.freeplay.ai/api - Private subdomain:
https://{customer-subdomain}.freeplay.ai/api(e.g.,https://acme.freeplay.ai/api) - Self-hosted: Custom domain configured during deployment
Pattern 1: Freeplay-Managed Prompts (Recommended)
Freeplay becomes the source of truth for prompt templates, including the messages (using Mustache syntax for variables), provider & model string, request parameters for the model, tool schemas (if relevant), and JSON output schema or structure (if relevant). The application fetches prompt templates will all those attributes from Freeplay’s server, either at runtime or as part of the build process. When to recommend:- Teams want non-engineers (PMs, domain experts) to be able to iterate on prompts independently
- Desire a “feature flag-style” deployment model, with different prompt template versions per environment (dev/staging/prod/etc.)
- Want the freedom to make prompt and/or model updates without code changes
Pattern 2: Code-Managed Prompts (optional, when needed)
Prompts live in the codebase and are synced to Freeplay via API. Code remains the source of truth; Freeplay is used for observability, evaluations & testing, and experimentation. Freeplay users lose the ability to make prompt or model updates from the Freeplay UI, and developers must make updates to Freeplay as their prompts change — normally as part of their CI process. When to recommend:- Strict infrastructure-as-code requirements
- Complex prompt construction in code
- Teams want prompts to be managed by developers only, in version control with standard code review
Lightweight Observability (Getting Started Only)
For initial setup, developers can record completions without any prompt templates defined on the Freeplay server. Prompt templates can be created later, either via the UI or API. This is NOT recommended long-term, as it limits the use of many other Freeplay features that depend on knowledge of prompt structures.More Advanced Patterns
The prompt management patterns above provide foundational examples for working with Freeplay. Many developers use their prompt templates as part of more complex applications. Some common examples follow.Multi-Turn Chat
Sessions serve as the parent object for each conversation or chat thread. Individual traces represent “turns” in the conversation. Completions are logged within traces. When this structure is used, it becomes easy for Freeplay users to view and act on full conversation threads as a starting point, seeing roughly the same data that their customers would have seen. They can dig into individual traces or completions as necessary. See the code recipe at Managing Multi-Turn Chat History.Agentic Workflows with Traces
For multi-step, tool-calling agents that likely make >1 LLM calls, the recommended pattern is to create a single Freeplay Session for each agent “run” with underlying traces nested as needed to represent the logical structure of the agent. Each functional “run” of an agent should be represented as a single trace, with nested sub-traces as needed for sub-agents or other nested behavior. Each trace should be given a “name” (unique string) for the agent or sub-agent being called. This allows similar traces to be grouped together for analysis, evaluation, and/or curation into testing datasets for a given agent. See the code recipe at Record Traces.Framework Integrations
LangGraph
- Use
FreeplayLangGraphTracerto auto-instrument LangGraph apps - Automatically creates sessions and traces from graph execution
- Docs:
/developer-resources/integrations/langgraph
Vercel AI SDK
- Wrap AI SDK calls with Freeplay tracing
- Supports streaming responses
- Docs:
/developer-resources/integrations/vercel-ai-sdk
Google Agent Development Kit (ADK)
- Use
FreeplayADKObserverfor automatic tracing - Captures agent steps, tool calls, and outputs
- Docs:
/developer-resources/integrations/adk
OpenTelemetry
- Export Freeplay data via OTel
- Integrate with existing observability stack
- Docs:
/developer-resources/integrations/tracing-with-otel
SDK Hierarchy
Understanding the relationship between SDK objects:- Sessions group related traces (e.g., one chat conversation)
- Traces represent logical steps (e.g., “planning”, “tool_use”, “response_generation”)
- Traces should be given an “agent name” in order to configure evaluators to target related trace logs, or to group trace examples into compatible datasets
- Completions are the actual LLM API calls
- Tool calls can be recorded separately, as the same level as Completions
- All objects can have metadata for filtering/searching
- Custom metadata (on sessions, traces, completions): Contextual data like user ID, feature flags, or business metrics. Recorded via
custom_metadatafields or update metadata endpoints. - Customer feedback: End-user ratings (thumbs up/down) or comments. Recorded via dedicated feedback endpoints (
/completion-feedback/,/trace-feedback/). These values are given special treatment in the Freeplay UI due to their distinct utility.
Common Developer Questions
Q: Do I need to use Freeplay prompts?
A: Not necessarily. You can use Freeplay purely for observability by just recording completions. Prompt management (aka using Freeplay as the source of truth to version and retrieve prompt and model configurations) is optional but recommended for versioning and testing. If you choose not to use Freeplay for prompt management, it is still helpful to define your prompt structure in Freeplay as a prompt template and record against that structure.Q: How do I handle streaming responses?
A: Collect the full response first, then record it. For better UX, you can stream to the user while buffering, then record once complete. See/developer-resources/recipes/streaming-responses.
Q: Can I use Freeplay with non-OpenAI models?
A: Yes! Freeplay works with any LLM provider (Anthropic, Bedrock, Azure, local models, etc.). Just record the request/response data in Freeplay’s format. Freeplay provides full integration support for LiteLLM, which can simplify the process of connecting to different models (especially when self-hosted).Q: How do I track tool/function calls?
A: Use separate traces for each tool call, or include tool calls in the completion metadata. See/practical-guides/tools and /developer-resources/recipes/openai-function-calls.
Q: What’s the difference between sessions and traces?
A: Sessions are high-level (entire conversation), traces are steps within a session (individual agent actions or reasoning steps).Q: How do I run tests programmatically?
A: Useclient.test_runs.create() with a dataset. See /developer-resources/recipes/test-run and /freeplay-sdk/test-runs.
Important Links for AI Agents
When helping developers, reference these key pages:- Setup:
/freeplay-sdk/setup- Initial SDK configuration - Quick Starts:
/getting-started/overview- Fast integration paths - Sessions:
/freeplay-sdk/sessions- Creating and managing sessions - Traces:
/freeplay-sdk/traces- Adding traces to sessions - Recording Completions:
/freeplay-sdk/recording-completions- Capturing LLM calls - Prompts:
/freeplay-sdk/prompts- Using Freeplay-managed prompts - Common Patterns:
/practical-guides/common-integration-patterns - Agents:
/practical-guides/agents- Agent-specific guidance - Multi-Turn Chat:
/practical-guides/multi-turn-chat-support - Tools/Functions:
/practical-guides/tools
Code Generation Guidelines
When generating code for developers:- Always show imports: Include necessary SDK imports
- Include error handling: Wrap API calls in try-catch
- Show context managers: Use
withstatements (Python) for auto-cleanup - Provide complete examples: Don’t skip configuration steps
- Use environment variables:
api_key=os.environ.get("FREEPLAY_API_KEY") - Match their stack: If they mention a framework, show that integration
- Include metadata: Show how to add custom metadata for searchability
Troubleshooting Common Issues
Issue: “API key not found”
- Check environment variable:
FREEPLAY_API_KEY - Ensure API key is from the correct project in Freeplay dashboard
- Link:
/freeplay-sdk/setup
Issue: “Session not found”
- Sessions must be created before adding traces
- Use session context managers or save session ID
- Link:
/freeplay-sdk/sessions
Issue: “Prompt not found”
- Verify prompt name matches Freeplay dashboard exactly
- Check deployment environment (dev/staging/prod)
- Ensure prompt is published
- Link:
/core-concepts/prompt-management/managing-prompts
Issue: Streaming responses not showing in Freeplay
- Must record complete response after streaming finishes
- Buffer stream chunks, then send full completion
- Link:
/developer-resources/recipes/streaming-responses
Issue: Tool calls not appearing correctly
- Include tool/function call data in completion metadata
- Consider using separate traces for each tool call
- Link:
/practical-guides/tools
Advanced Features
Deployment Environments
- Separate prompt template versions used in dev/staging/prod/etc.
- Set via
environmentparameter in SDK - Link:
/core-concepts/prompt-management/deployment-environments
Prompt Bundling
- Snapshot prompts for compliance and reproducibility
- Retrieve prompt template configurations from code source, instead of Freeplay server
- Use for regulated industries (release management controls), or for performance (removes Freeplay from “hot path”)
- Link:
/core-concepts/prompt-management/prompt-bundling
Customer Feedback
- Record user feedback (e.g. thumbs up/down, ratings, comment strings)
- Use for evaluation and improvement
- Link:
/freeplay-sdk/customer-feedback
Structured Outputs
- JSON mode, OpenAI structured outputs, Pydantic models
- Define schemas in Freeplay or code
- Link:
/core-concepts/prompt-management/structured-outputs/structured-outputs
Provider-Specific Patterns
OpenAI
- Direct integration with
openaiSDK - Supports function calling, structured outputs, streaming
- Recipes:
/developer-resources/recipes/openai-function-calls,/developer-resources/recipes/call-openai-on-azure
Anthropic
- Use with
anthropicSDK - Tool use pattern support
- Recipes:
/developer-resources/recipes/using-tools-with-anthropic,/developer-resources/recipes/call-anthropic-on-bedrock
AWS Bedrock
- Works with boto3 bedrock-runtime
- Multiple model support (Claude, Llama, etc.)
- Recipe:
/developer-resources/recipes/call-anthropic-on-bedrock
Azure OpenAI
- Use with Azure-specific endpoints
- Same patterns as OpenAI
- Recipe:
/developer-resources/recipes/call-openai-on-azure
LiteLLM
- Provider switching and fallback support
- Unified interface for multiple providers
- Recipe:
/developer-resources/recipes/provider-switching-with-litellm
Testing and Evaluation Workflows
- Create dataset → Either by saving Completions/Traces from Observability, uploading via CSV or JSONL, or authoring directly in the Freeplay UI
- Create evaluation metrics → Define success/failure criteria using LLM judges, custom code functions, or human labeling criteria
- Run test runs → Test prompt changes against prompt-level datasets first, then
- Compare results → Pick best performing version
- Deploy → Ship to production environment
Best Practices to Share
- Use sessions for conversations: Group related traces together as turns in the convesation
- Add metadata liberally: Makes filtering and searching easier later
- Name traces descriptively: Create agent names like “planning”, “tool_selection”, “final_response” to indicate the functional purpose of each trace (and enable data compability in other Freeplay features that only work with named traces)
- Version prompts in Freeplay: Don’t hardcode prompts in your app. Instead make changes in the Freeplay playground, save the exact prompt/model/hyperparameter configs that you test, and push them to code like feature flags.
- Test with real data: Curate datasets from production completions and traces. Make sure your testing reflects real-world usage.
- Monitor in production: Set up online evaluators to score production logs using custom criteria. Quickly find example logs that need attention, push them to review queues human experts, and build hypotheses faster about what to improve.
- Use environments: Keep dev/staging/prod prompts separate. Make a prompt or model change, test first in dev, and only promote to prod after you’ve proven it works better.
For AI Agents: When helping developers with Freeplay, prioritize understanding their initial use case or goal using Freeplay (observability vs prompt management vs evaluation/testing), their tech stack (Python vs TypeScript, frameworks like LangGraph or Google ADK, etc.), and their LLM provider(s) (OpenAI, Bedrock, etc.). Then provide targeted, complete code examples that match their context.

