Agents
Overview of how to build, monitor, and improve AI agents using Freeplay.
These docs are under heavy development as we ship new agent features! Check back often for updates, or subscribe to our newsletter.
Agents in Freeplay
How We Use The Word Agent
The term “Agent” is widely used and can mean different things. For our purposes here, we mean any process in your application that makes use of multiple LLM calls to generate a single response. These may include tool usage or self-directed reasoning processes, but they don’t have to.
Everything below applies to both “workflows” and “agents” as defined in Anthropic’s helpful guide to “Building Effective Agents.”

Using Freeplay to track an Agent
Overview
Many LLM applications now go beyond simple completions to include multi-step workflows where an LLM makes decisions, uses tools, and follows complex reasoning processes to accomplish a task. These agent-based systems require specialized tooling to effectively monitor, test, and improve.
Freeplay provides robust support to build and improve AI agents — allowing you to observe, evaluate, test changes, and iterate on agent performance, no matter how you manage orchestration. To get started, you’ll need to:
- Configure prompt management in Freeplay, so that any prompts included in your agent can be tracked and versioned
- Record traces to Freeplay any time your agent runs (including any tool calls, additional metadata, customer feedback, and any evals you calculate in your code)
This guide will walk you through the details of how to effectively use Freeplay for your agent workflows.
Key Benefits
- Agent Observability: Monitor and observe each of the steps, decisions, and outputs in your agent workflows to identify areas for improvement
- Automated Testing & Structured Experimentation: Test agent components individually or as complete workflows — with support to manage evals and testing datasets at both the individual component and full agent level
- Advanced Evaluations: Apply evaluations to any individual agent steps and/or overall agent performance
- Collaborative Debugging & Iteration: Easily share agent performance data and logs in ways that are easy to interpret for your whole team, in supoprt of collaborative problem-solving
- Performance Insights: Quantify your agent’s performance on metrics you define, and report out findings to your wider organization
How Agents Work In Freeplay
Freeplay depends on the concept of traces to support agent workflows.
A Freeplay trace represents a logical grouping of LLM completions, tool calls, etc. that form a single agent task or workflow. Each trace can contain multiple completions, tool calls/results, and additional metadata about the agent — along with evaluation results and user feedback.
Relationship between Sessions, Traces, and Completions
Understanding the hierarchy of data organization in Freeplay is essential for effectively tracking agent workflows. In short:
- Completions: Individual LLM calls made up of a prompt and a response or output from a model.
- Traces: Optionally used to group related completions and tool calls, e.g. when multiple completions are used to generate a single chat turn or an agent flow.
- Sessions: The container for all completions and traces that make up a single customer interaction or single agent run.
- Note: These can be 1:1 with completions for a simple feature that just uses one prompt, or they can be very large at times, e.g. an entire conversation thread between a single user and a chatbot over multiple hours where each turn invokes an agentic workflow.
For agent workflows, you'll typically have:
- One session per user interaction or single run of your agent (unless you’re building a multi-turn chatbot)
- One or more traces per agent or multi-step workflow
- Multiple completions within each trace
For more detailed information on this data hierarchy, please refer to our guide on Sessions, Traces and Completions.
Logging Agent Data to Traces
When logging agent activity, include not just the inputs and outputs, but also metadata about the agent version, capabilities, and any tool calls made during the process. The following dives into the details of how to log this data to Freeplay:
Trace Creation
To track your agents with Freeplay, you'll create traces and record agent interactions. When creating a trace for an agent, you can include:
agent_name
(string): The name of the agent being used, which gets used elsewhere in Freeplay for evals, dataset compatibility, etc.custom_metadata
(key-value pairs): Additional metadata you choose to record about the agent
# Create an example trace for an agent
trace = session.create_trace(
input="customer_inquiry: What's my account balance?",
agent_name="Financial Assistant",
custom_metadata={
"agent_version": "2.1.3",
"agent_framework": "multi-modal agent",
"agent_capabilities": "banking,accounts,transfers"
}
)
import { CustomMetadata } from "freeplay";
// Example inputs
const userQuestion = "What is the weather in London?";
const agentName = "WeatherAgent";
const customMetadata: CustomMetadata = {
location: "London",
requestedAt: new Date().toISOString(),
};
// Creating the trace
const traceInfo = await session.createTrace({
input: userQuestion,
agentName,
customMetadata,
eval_result
});
Recording Intermediate Completions & Tool calls within a Trace
For each step in your agent's workflow, you'll want to record the LLM completions and any tool calls made. To record this information to a trace, pass the trace_info
to the RecordPayload
function along with any completion or tool call/results and they will be tied to the trace. See our full tool calling examples for OpenAI and Anthropic, the code below shows how to record tool calls and intermediate completions:
######
# LLM Call with tool calls in the response
######
# Handle tool calls if present
if isinstance(completion.content, list):
for block in completion.content:
if isinstance(block, ToolUseBlock) and block.name == "weather_of_location":
temperature = get_temperature(
block.input["location"]
)
# Capture the tool response in the right format
tool_response_message = {
"role": "user",
"content": [
{
"type": "tool_result",
"tool_use_id": block.id,
"content": str(temperature),
}
]
}
messages.append(tool_response_message)
# Record a tool call using RecordPayload
fpclient.recordings.create(
RecordPayload(
all_messages=messages,
session_info=session.session_info,
inputs=input_variables,
prompt_info=formatted_prompt.prompt_info,
call_info=CallInfo.from_prompt_info(
formatted_prompt.prompt_info,
start,
end
),
# Pass the tool schema to the model
tool_schema=formatted_prompt.tool_schema,
trace_info=trace_info, # Pass the trace info to the recording
response_info=ResponseInfo(
is_complete=completion.stop_reason == 'stop_sequence'
)
)
)
Recording LLM Outputs and Client-Side Evaluations
When recording agent outputs, you can also include any evaluation results calculated in your code (details here):
eval_results
(dict): Dictionary of evaluation metrics for the trace (can include numeric, string, or boolean values)
# Record trace output with evaluations
trace.record_output(
project_id=PROJECT_ID,
output=agent_response,
eval_results={
"task_completion": True,
"reasoning_quality": 0.85,
"response_accuracy": "high",
"hallucination_detected": False
}
)
// Typing information
type CustomFeedback = {
freeplay_feedback: "positive" | "neutral" | "negative";
is_helpful: boolean;
};
// Variable prep
type TestRunInfo = {
runId: string;
runName?: string;
};
const botResponseText = botResponse.llmResponseText;
const evalResults = {
is_factual: true,
helpfulness_score: 0.9,
};
const testRunInfo: TestRunInfo = {
runId: "test-run-123",
runName: "RegressionSuite-May",
};
// Record the model output along with optional eval results and test run info
await traceInfo.recordOutput(projectId, botResponseText, evalResults, testRunInfo);
Updating Traces with Customer Feedback
You can add customer or system feedback to traces, which will be treated as a special class of metadata in Freeplay.
# Update trace with feedback
client.customer_feedback.update_trace(
project_id=PROJECT_ID,
trace_id=trace.trace_id,
feedback={
"helpfulness": 9.2,
"relevance": "high",
"satisfied_user": True,
"valid_tool_use": True
"guard_rails": False
}
)
// Typing information
const feedback: Record<string, CustomFeedback> = {
freeplay_feedback: {
freeplay_feedback: "positive",
is_helpful: true,
},
};
// Update the trace with customer feedback
await fpClient.customerFeedback.updateTrace({
projectId,
traceId: traceInfo.traceId,
customerFeedback: feedback,
});
Agent Datasets & Testing
Datasets
Agent datasets enable end-to-end testing of your agent workflows, allowing you to assess the complete behavior of your system under real or representative conditions. These datasets serve as a foundation for understanding how changes—such as prompt revisions, tool updates, or orchestration logic adjustments—affect the overall output of your agent.
In Freeplay, agent datasets are powered by trace-level logging. By assigning an Agent Name to your traces, you create a logical grouping of all related agent runs. This grouping allows you to filter and review traces in the observability dashboard and assemble them into datasets for future testing and evaluation.
You can create an agent dataset in two ways:
- From the observability dashboard by selecting traces with the same Agent Name.
- Directly from the trace view by saving specific traces to a dataset.
Testing
Once you’ve assembled an agent dataset, you can use it to run structured tests against your agent. These tests provide visibility into system performance, surfacing both top-level metrics and step-level details that inform iteration and deployment decisions.
Running tests on agent datasets allows you to apply evaluations at both the trace and prompt level and understand how changes impact overall system behavior.
The example below illustrates a completed test run in Freeplay. High-level agent evaluations are shown at the top, while granular prompt-level evaluations are displayed below. This layered view enables you to assess both the success of the full agent and the contributions of each step.
To execute these tests programmatically, you can use the Freeplay SDK in the same way you would initiate a standard test run. See the SDK documentation for full implementation details.
FAQ
How do I update traces in my current Freeplay implementation?
If you're already using Freeplay, you'll need to:
- Update to the latest SDK version
- Modify your trace creation code to include
agent_name
andcustom_metadata
- Update your recording logic to include evaluations and/or customer feedback as needed
Do I need to use traces to work with agents?
While not strictly required, using the agent_name
value with traces provides significant benefits for building agents:
- Better visibility into multi-step processes
- Ability to evaluate agents from end to end
- Ability to run and organize tests at the agent level, separate from individual components
- More granular performance metrics
- Enhanced debugging capabilities
How does a trace map to an agent?
The relationship between traces and agents is flexible and ultimately up to you:
- One trace can represent one complete agent task
- Multiple traces can represent different aspects of a complex agent
Choose the approach that best represents your agent's logical workflow.
Best Practices
- Naming Convention: Use consistent agent naming to make filtering and analysis easier
- Metadata Strategy: Define a minimum standard set of metadata fields for your agents up front. (You can always add more later too.)
- Granular Evaluations: Create evaluations that target specific agent components and end-to-end behaviors
- Representative Datasets: Build datasets that cover the range of expected agent tasks
- Regular Testing: Build batch testing with evals into your agent development workflow to catch issues and regressions early
Next Steps
Ready to get started with agents in Freeplay?
- Update your Freeplay SDK to the latest version
- Review our code examples for implementing agent trace logging
- Create your first agent dataset and evaluations (where the dataset consists of inputs and outputs for the entire end-to-end agent behavior)
- Set up dashboards to monitor agent performance
For more detailed implementation guidance, contact our support team or schedule a consultation with our forward deployed engineering team.
Updated 1 day ago