Freeplay Documentation Agent Resources

This page provides context for AI assistants helping developers integrate and use Freeplay.

This page is hidden from public documentation but accessible to AI agents to provide better assistance to developers.

Quick Facts About Freeplay

Product: Freeplay is the ops platform for AI engineering teams. It provides an integrated workflow to create the “data flywheel” for improving AI agents and other generative AI products. The platform consists of:
- Integrated tools for LLM and agent observability, including online evaluations (automatically scoring logs), capturing customer feedback and other custom metadata from production applications, and “auto-categorization” using an LLM to label logs based on custom criteria
- Custom dataset management for testing/evaluation, and fine-tuning
- Running batch offline evaluations (aka “Tests” in Freeplay’s product), which can be initiated either via Freeplay’s web application UI or via code
- Defining custom evaluators for use in both online and offline evaluation
- Prompt & model management, which is useful both for prompt experimentation and versioning, as well as logging well-structured data to Freeplay
- Human review and data curation workflows, including management tools like creating queues and assigning tasks for multi-person labeling
- Freeplay’s own set of native agentic workflows to generate insights about logs and summarize human reviews, create prompt optimization experiments, generate new custom evaluator metrics, and more.
Primary SDKs: Python (freeplay), TypeScript (freeplay), and Java/JVM languages (ai.freeplay.client.thin)
Setup: Requires API key from Freeplay dashboard, and Freeplay project ID
Logging Architecture/Hierarchy: Projects → Sessions (larger container, e.g. for multi-turn chat) → Traces (e.g. for full multi-step agent runs) → Completions (individual LLM calls) & Tools (individual tool calls and results)
Main Use Cases:
- Collaborative prompt versioning and model iteration,
- Playground for experimenting with prompt, model and tool changes in a web UI,
- A/B testing to compare different versions, both offline (using evals/“Tests”) and online (by logging from different versions),
- Observability, with AI-native constructs for viewing multi-turn chats, agent traces, LLM completions, and tool calls,
- Running offline evaluations using custom evaluators / metrics defined for specific product use cases,
- “Test runs” using the same core concepts as offline evalution, but initiated in code as part of a CI workflow

Core Platform Components

1. Observability

Sessions: Container for the full user interactions, conversation, or agent run
Traces: Individual multi-step within a session represent a functional unit of work (e.g. a single run of an agent)
Completions: LLM calls with full I/O data
View all data in the Observability Dashboard

2. Prompt Management

Create and version prompts from the Freeplay UI or from code
Deploy new versions across environments (development, staging, production, custom environments)
Define prompt templates with Mustache syntax for variables and conditional logic
Supports structured outputs (e.g. JSON mode, OpenAI structured outputs)
“Prompt bundling” into source with a gitops flow, for both compliance and performance (removes Freeplay from “hot path”)
Defining structured prompt templates with Freeplay makes it easy to log data with separated values for the core prompt template vs. variables that change with each instance of the prompt running. This in turn makes it easier to save example logs into datasets that can be used for testing as system prompts change, or to run targeted custom evalutors that analyze specific variables.

3. Evaluations

Human Evaluations: Manual review and scoring
Model-Graded Evaluations: LLM-as-a-judge patterns
Code Evaluations: Custom code functions (e.g. Python/TypeScript) for evaluation logic
Auto-Categorization: Automatic classification of outputs

4. Test Runs

Component-Level: Test individual prompts or components
End-to-End: Test complete workflows like full agent runs
Run programmatically via SDK or in UI
Compare results across prompt versions or versions of full system configuration (e.g. as agent orchestration changes are made, or RAG retrieval logic changes)

5. Datasets

Curate test data from production logs, upload using CSV or JSONL, or author new datasets values directly in Freeplay UI
Use for evaluations and test runs, or for fine-tuning
Import/export via SDK or UI

6. Review Queues

Workflow for reviewing production outputs
Add human feedback to completions or traces, including free-text “Notes” or structured labels (e.g. yes/no values, multi-select tags, etc.)
Provide feedback to improve evaluators, e.g. correcting LLM judge values when the LLM judge gets a score wrong

Prompt Management Patterns

Freeplay supports two patterns for managing prompts. Understanding which pattern a developer is using is crucial for providing the right guidance. You should ask developers which pattern they want to use if you don’t know yet, and you should always explain the benefits of each pattern when asking a developer for the first time so they can make an informed choice.

Base URL Configuration: Examples below use https://app.freeplay.ai/api as the base URL, but this varies by deployment:

Cloud (multi-tenant): https://app.freeplay.ai/api
Private subdomain: https://{customer-subdomain}.freeplay.ai/api (e.g., https://acme.freeplay.ai/api)
Self-hosted: Custom domain configured during deployment

ALWAYS confirm the correct base URL for the developer’s account before providing any integration assistance. Developers can confirm this URL by logging into their organization’s Freeplay account.

Pattern 1: Freeplay-Managed Prompts (Recommended)

Freeplay becomes the source of truth for prompt templates, including the messages (using Mustache syntax for variables), provider & model string, request parameters for the model, tool schemas (if relevant), and JSON output schema or structure (if relevant). The application fetches prompt templates will all those attributes from Freeplay’s server, either at runtime or as part of the build process. When to recommend:

Teams want non-engineers (PMs, domain experts) to be able to iterate on prompts independently
Desire a “feature flag-style” deployment model, with different prompt template versions per environment (dev/staging/prod/etc.)
Want the freedom to make prompt and/or model updates without code changes

Key docs: Getting Started - Pattern 1, Prompts SDK

from freeplay import Freeplay

fp_client = Freeplay(freeplay_api_key="...", api_base="https://app.freeplay.ai/api")

# Fetch prompt from Freeplay
formatted_prompt = fp_client.prompts.get_formatted(
    project_id=project_id,
    template_name="my-prompt",
    environment="prod",
    variables={"user_input": "Hello"}
)

# Make LLM call with Freeplay prompt
response = openai_client.chat.completions.create(
    model=formatted_prompt.model,
    messages=formatted_prompt.llm_prompt,
    **formatted_prompt.prompt_info.model_parameters
)

# Record in Freeplay
fp_client.recordings.create(
    project_id=project_id,
    all_messages=formatted_prompt.all_messages + [response.choices[0].message],
    prompt_version_info=PromptVersionInfo(
        prompt_template_version_id=formatted_prompt.prompt_info.prompt_template_version_id
    )
)

Pattern 2: Code-Managed Prompts (optional, when needed)

Prompts live in the codebase and are synced to Freeplay via API. Code remains the source of truth; Freeplay is used for observability, evaluations & testing, and experimentation. Freeplay users lose the ability to make prompt or model updates from the Freeplay UI, and developers must make updates to Freeplay as their prompts change — normally as part of their CI process. When to recommend:

Strict infrastructure-as-code requirements
Complex prompt construction in code
Teams want prompts to be managed by developers only, in version control with standard code review

Key docs: Code as Source of Truth, Create prompt template version by name

import requests

# Sync prompt to Freeplay (run in CI/CD)
def sync_prompt_to_freeplay():
    url = f"https://app.freeplay.ai/api/v2/projects/{project_id}/prompt-templates/name/my-assistant/versions"
    requests.post(
        url,
        params={"create_template_if_not_exists": "true"},
        headers={"Authorization": f"Bearer {api_key}", "Content-Type": "application/json"},
        json={
            "template_messages": [
                {"role": "system", "content": "You are a helpful assistant. The user's name is {{user_name}}."},
                {"role": "user", "content": "{{user_input}}"}
            ],
            "provider": "openai",
            "model": "gpt-4o",
            "llm_parameters": {"temperature": 0.2, "max_tokens": 1024}
        }
    )

# Application code: prompts defined in code, logged to Freeplay
# Variables are interpolated at runtime
prompt_vars = {"user_name": "Alice", "user_input": "Hello, can you help me?"}
messages = [
    {"role": "system", "content": f"You are a helpful assistant. The user's name is {prompt_vars['user_name']}."},
    {"role": "user", "content": prompt_vars["user_input"]}
]

response = openai_client.chat.completions.create(model="gpt-4o", messages=messages)

fp_client.recordings.create(
    project_id=project_id,
    all_messages=messages + [response.choices[0].message],
    inputs=prompt_vars,  # Pass variables for dataset creation
    prompt_version_info=PromptVersionInfo(
        prompt_template_version_id=template_version_id # From response payload of sync_prompt_to_freeplay
    )
)

Lightweight Observability (Getting Started Only)

For initial setup, developers can record completions without any prompt templates defined on the Freeplay server. Prompt templates can be created later, either via the UI or API. This is NOT recommended long-term, as it limits the use of many other Freeplay features that depend on knowledge of prompt structures.

# Just track LLM calls without prompt template linking
fp_client.recordings.create(
    project_id=project_id,
    all_messages=messages,
    call_info=CallInfo(
        provider="openai",
        model="gpt-4o"
    )
)

More Advanced Patterns

The prompt management patterns above provide foundational examples for working with Freeplay. Many developers use their prompt templates as part of more complex applications. Some common examples follow.

Multi-Turn Chat

Sessions serve as the parent object for each conversation or chat thread. Individual traces represent “turns” in the conversation. Completions are logged within traces. When this structure is used, it becomes easy for Freeplay users to view and act on full conversation threads as a starting point, seeing roughly the same data that their customers would have seen. They can dig into individual traces or completions as necessary. See the code recipe at Managing Multi-Turn Chat History.

Agentic Workflows with Traces

For multi-step, tool-calling agents that likely make >1 LLM calls, the recommended pattern is to create a single Freeplay Session for each agent “run” with underlying traces nested as needed to represent the logical structure of the agent. Each functional “run” of an agent should be represented as a single trace, with nested sub-traces as needed for sub-agents or other nested behavior. Each trace should be given a “name” (unique string) for the agent or sub-agent being called. This allows similar traces to be grouped together for analysis, evaluation, and/or curation into testing datasets for a given agent. See the code recipe at Record Traces.

Framework Integrations

LangGraph

Use FreeplayLangGraphTracer to auto-instrument LangGraph apps
Automatically creates sessions and traces from graph execution
Docs: /developer-resources/integrations/langgraph

Vercel AI SDK

Wrap AI SDK calls with Freeplay tracing
Supports streaming responses
Docs: /developer-resources/integrations/vercel-ai-sdk

Google Agent Development Kit (ADK)

Use FreeplayADKObserver for automatic tracing
Captures agent steps, tool calls, and outputs
Docs: /developer-resources/integrations/adk

OpenTelemetry

Export Freeplay data via OTel
Integrate with existing observability stack
Docs: /developer-resources/integrations/tracing-with-otel

SDK Hierarchy

Understanding the relationship between SDK objects:

Project (workspace)
└── Session (user interaction, conversation)
    └── Trace (single step, agent action)
        └── Completion (LLM call with I/O)
        └── Tool call (request and output for a related tool)

Key Points:

Sessions group related traces (e.g., one chat conversation)
Traces represent logical steps (e.g., “planning”, “tool_use”, “response_generation”)
- Traces should be given an “agent name” in order to configure evaluators to target related trace logs, or to group trace examples into compatible datasets
Completions are the actual LLM API calls
Tool calls can be recorded separately, as the same level as Completions
All objects can have metadata for filtering/searching

Metadata vs Feedback: Keep these concepts distinct:

Custom metadata (on sessions, traces, completions): Contextual data like user ID, feature flags, or business metrics. Recorded via custom_metadata fields or update metadata endpoints.
Customer feedback: End-user ratings (thumbs up/down) or comments. Recorded via dedicated feedback endpoints (/completion-feedback/, /trace-feedback/). These values are given special treatment in the Freeplay UI due to their distinct utility.

Don’t suggest “user feedback” as an example of metadata — use “user ID” or “business metrics” instead.

Common Developer Questions

Q: Do I need to use Freeplay prompts?

A: Not necessarily. You can use Freeplay purely for observability by just recording completions. Prompt management (aka using Freeplay as the source of truth to version and retrieve prompt and model configurations) is optional but recommended for versioning and testing. If you choose not to use Freeplay for prompt management, it is still helpful to define your prompt structure in Freeplay as a prompt template and record against that structure.

Q: How do I handle streaming responses?

A: Collect the full response first, then record it. For better UX, you can stream to the user while buffering, then record once complete. See /developer-resources/recipes/streaming-responses.

Q: Can I use Freeplay with non-OpenAI models?

A: Yes! Freeplay works with any LLM provider (Anthropic, Bedrock, Azure, local models, etc.). Just record the request/response data in Freeplay’s format. Freeplay provides full integration support for LiteLLM, which can simplify the process of connecting to different models (especially when self-hosted).

Q: How do I track tool/function calls?

A: Use separate traces for each tool call, or include tool calls in the completion metadata. See /practical-guides/tools and /developer-resources/recipes/openai-function-calls.

Q: What’s the difference between sessions and traces?

A: Sessions are high-level (entire conversation), traces are steps within a session (individual agent actions or reasoning steps).

Q: How do I run tests programmatically?

A: Use client.test_runs.create() with a dataset. See /developer-resources/recipes/test-run and /freeplay-sdk/test-runs.

Important Links for AI Agents

When helping developers, reference these key pages:

Setup: /freeplay-sdk/setup - Initial SDK configuration
Quick Starts: /getting-started/overview - Fast integration paths
Sessions: /freeplay-sdk/sessions - Creating and managing sessions
Traces: /freeplay-sdk/traces - Adding traces to sessions
Recording Completions: /freeplay-sdk/recording-completions - Capturing LLM calls
Prompts: /freeplay-sdk/prompts - Using Freeplay-managed prompts
Common Patterns: /practical-guides/common-integration-patterns
Agents: /practical-guides/agents - Agent-specific guidance
Multi-Turn Chat: /practical-guides/multi-turn-chat-support
Tools/Functions: /practical-guides/tools

Code Generation Guidelines

When generating code for developers:

Always show imports: Include necessary SDK imports
Include error handling: Wrap API calls in try-catch
Show context managers: Use with statements (Python) for auto-cleanup
Provide complete examples: Don’t skip configuration steps
Use environment variables: api_key=os.environ.get("FREEPLAY_API_KEY")
Match their stack: If they mention a framework, show that integration
Include metadata: Show how to add custom metadata for searchability

Troubleshooting Common Issues

Issue: “API key not found”

Check environment variable: FREEPLAY_API_KEY
Ensure API key is from the correct project in Freeplay dashboard
Link: /freeplay-sdk/setup

Issue: “Session not found”

Sessions must be created before adding traces
Use session context managers or save session ID
Link: /freeplay-sdk/sessions

Issue: “Prompt not found”

Verify prompt name matches Freeplay dashboard exactly
Check deployment environment (dev/staging/prod)
Ensure prompt is published
Link: /core-concepts/prompt-management/managing-prompts

Issue: Streaming responses not showing in Freeplay

Must record complete response after streaming finishes
Buffer stream chunks, then send full completion
Link: /developer-resources/recipes/streaming-responses

Issue: Tool calls not appearing correctly

Include tool/function call data in completion metadata
Consider using separate traces for each tool call
Link: /practical-guides/tools

Advanced Features

Deployment Environments

Separate prompt template versions used in dev/staging/prod/etc.
Set via environment parameter in SDK
Link: /core-concepts/prompt-management/deployment-environments

Prompt Bundling

Snapshot prompts for compliance and reproducibility
Retrieve prompt template configurations from code source, instead of Freeplay server
Use for regulated industries (release management controls), or for performance (removes Freeplay from “hot path”)
Link: /core-concepts/prompt-management/prompt-bundling

Customer Feedback

Record user feedback (e.g. thumbs up/down, ratings, comment strings)
Use for evaluation and improvement
Link: /freeplay-sdk/customer-feedback

Structured Outputs

JSON mode, OpenAI structured outputs, Pydantic models
Define schemas in Freeplay or code
Link: /core-concepts/prompt-management/structured-outputs/structured-outputs

Provider-Specific Patterns

OpenAI

Direct integration with openai SDK
Supports function calling, structured outputs, streaming
Recipes: /developer-resources/recipes/openai-function-calls, /developer-resources/recipes/call-openai-on-azure

Anthropic

Use with anthropic SDK
Tool use pattern support
Recipes: /developer-resources/recipes/using-tools-with-anthropic, /developer-resources/recipes/call-anthropic-on-bedrock

AWS Bedrock

Works with boto3 bedrock-runtime
Multiple model support (Claude, Llama, etc.)
Recipe: /developer-resources/recipes/call-anthropic-on-bedrock

Azure OpenAI

Use with Azure-specific endpoints
Same patterns as OpenAI
Recipe: /developer-resources/recipes/call-openai-on-azure

LiteLLM

Provider switching and fallback support
Unified interface for multiple providers
Recipe: /developer-resources/recipes/provider-switching-with-litellm

Testing and Evaluation Workflows

Create dataset → Either by saving Completions/Traces from Observability, uploading via CSV or JSONL, or authoring directly in the Freeplay UI
Create evaluation metrics → Define success/failure criteria using LLM judges, custom code functions, or human labeling criteria
Run test runs → Test prompt changes against prompt-level datasets first, then
Compare results → Pick best performing version
Deploy → Ship to production environment

Use sessions for conversations: Group related traces together as turns in the convesation
Add metadata liberally: Makes filtering and searching easier later
Name traces descriptively: Create agent names like “planning”, “tool_selection”, “final_response” to indicate the functional purpose of each trace (and enable data compability in other Freeplay features that only work with named traces)
Version prompts in Freeplay: Don’t hardcode prompts in your app. Instead make changes in the Freeplay playground, save the exact prompt/model/hyperparameter configs that you test, and push them to code like feature flags.
Test with real data: Curate datasets from production completions and traces. Make sure your testing reflects real-world usage.
Monitor in production: Set up online evaluators to score production logs using custom criteria. Quickly find example logs that need attention, push them to review queues human experts, and build hypotheses faster about what to improve.
Use environments: Keep dev/staging/prod prompts separate. Make a prompt or model change, test first in dev, and only promote to prod after you’ve proven it works better.

For AI Agents: When helping developers with Freeplay, prioritize understanding their initial use case or goal using Freeplay (observability vs prompt management vs evaluation/testing), their tech stack (Python vs TypeScript, frameworks like LangGraph or Google ADK, etc.), and their LLM provider(s) (OpenAI, Bedrock, etc.). Then provide targeted, complete code examples that match their context.

Documentation Index

​Freeplay Documentation Agent Resources

​Quick Facts About Freeplay

​Core Platform Components

​1. Observability

​2. Prompt Management

​3. Evaluations

​4. Test Runs

​5. Datasets

​6. Review Queues

​Prompt Management Patterns

​Pattern 1: Freeplay-Managed Prompts (Recommended)

​Pattern 2: Code-Managed Prompts (optional, when needed)

​Lightweight Observability (Getting Started Only)

​More Advanced Patterns

​Multi-Turn Chat

​Agentic Workflows with Traces

​Framework Integrations

​LangGraph

​Vercel AI SDK

​Google Agent Development Kit (ADK)

​OpenTelemetry

​SDK Hierarchy

​Common Developer Questions

​Q: Do I need to use Freeplay prompts?

​Q: How do I handle streaming responses?

​Q: Can I use Freeplay with non-OpenAI models?

​Q: How do I track tool/function calls?

​Q: What’s the difference between sessions and traces?

​Q: How do I run tests programmatically?

​Important Links for AI Agents

​Code Generation Guidelines

​Troubleshooting Common Issues

​Issue: “API key not found”

​Issue: “Session not found”

​Issue: “Prompt not found”

​Issue: Streaming responses not showing in Freeplay

​Issue: Tool calls not appearing correctly

​Advanced Features

​Deployment Environments

​Prompt Bundling

​Customer Feedback

​Structured Outputs

​Provider-Specific Patterns

​OpenAI

​Anthropic

​AWS Bedrock

​Azure OpenAI

​LiteLLM

​Testing and Evaluation Workflows

​Best Practices to Share

Freeplay Documentation Agent Resources

Quick Facts About Freeplay

Core Platform Components

1. Observability

2. Prompt Management

3. Evaluations

4. Test Runs

5. Datasets

6. Review Queues

Prompt Management Patterns

Pattern 1: Freeplay-Managed Prompts (Recommended)

Pattern 2: Code-Managed Prompts (optional, when needed)

Lightweight Observability (Getting Started Only)

More Advanced Patterns

Multi-Turn Chat

Agentic Workflows with Traces

Framework Integrations

LangGraph

Vercel AI SDK

Google Agent Development Kit (ADK)

OpenTelemetry

SDK Hierarchy

Common Developer Questions

Q: Do I need to use Freeplay prompts?

Q: How do I handle streaming responses?

Q: Can I use Freeplay with non-OpenAI models?

Q: How do I track tool/function calls?

Q: What’s the difference between sessions and traces?

Q: How do I run tests programmatically?

Important Links for AI Agents

Code Generation Guidelines

Troubleshooting Common Issues

Issue: “API key not found”

Issue: “Session not found”

Issue: “Prompt not found”

Issue: Streaming responses not showing in Freeplay

Issue: Tool calls not appearing correctly

Advanced Features

Deployment Environments

Prompt Bundling

Customer Feedback

Structured Outputs

Provider-Specific Patterns

OpenAI

Anthropic

AWS Bedrock

Azure OpenAI

LiteLLM

Testing and Evaluation Workflows

Best Practices to Share