Multi-Turn Chatbot Support
Overview of implementing multi-turn chatbots using Freeplay.
Introduction
Many LLM applications involve more than just one-off, isolated LLM completions. For chatbots especially, they consist of multiple back-and-forth exchanges between a user and assistant. This makes chatbots unique to test and evaluate.
This document walks through how to use Freeplay to build, test, review logs, and capture feedback on multi-turn chatbots, including how to make use of a special history
object:
- Defining
history
in prompt templates - Managing
history
with the Freeplay SDK - Recording and viewing chat turns in Freeplay as traces
- Managing datasets, configuring evals and automating tests that include
history
In this document, we will refer to one back-and-forth exchange between the user and the assistant as a "turn".
Understanding History for Chatbots
First, why does history matter when building a multi-turn chatbot?
Importantly, each exchange must be aware of all the previous exchanges in the conversation — aka the "history" — such that the LLM can give an answer that is contextually aware. Experimentation and testing with multi-turn chat must also take history into account, since any simulated test cases need to include relevant context.
Consider this series of exchanges between the user and assistant:

In Turn Two, the assistant needs to have the context from the previous turn to give a reasonable answer. "I want them to be healthier" is the user's request for healthy dinner ideas that make use of rice.
By Turn Three, the assistant needs to reference Turn 2 to know what “Give me a recipe for number 2” refers to. And so forth.
Without an understanding of the context from the conversation history, each new message would be impossible to interpret.
Note: While chatbots are the most common UX that uses this interaction pattern, it can apply more broadly. It can be helpful to think of history
as a way to manage state or memory, since the LLM itself does not store any persistent context from one interaction to the next. Nothing restricts the use of these concepts to a chatbot UX.
Using Freeplay with Multi-Turn Chatbots
What's different about using Freeplay with a chatbot? There are a couple important things to be aware of:
- Prompt Templates: You'll define a special
history
object in a prompt template allowing you to pass conversation history at the right point. - Recording Multi-Turn Sessions: You'll record
history
with each new chatbot turn, as well as record display messages at the start and end of each trace to make it easy to view theinput_question
andoutput_answer
(see Traces documentation). - Managing Datasets & Testing: You'll curate datasets that contain
history
so you can simulate accurate conversation scenarios when testing. - Configuring Auto-Evaluations: If you're using model-graded evals, you'll be able able to target
history
objects for realtime monitoring or test scenarios.
History in Prompt Templates
History should be configured within your Prompt Templates in Freeplay.
When configuring your Prompt Template, you will add a message of type
history
wherever your history messages should be inserted. This tells Freeplay how messages should be ordered when the prompt template is formatted.
The most common configuration would look like this:

Creating this configuration on a Freeplay prompt template would look like this:

This tells Freeplay to insert the history
messages in between the system
message and the most recent user
message when formatting a prompt.
You must define history
in a prompt template before you can pass history values at record time and have them saved properly for use in datasets, testing, etc.
Why configure history explicitly in the prompt template?
While it may seem redundant at first to explicitly configure the placement of history
, it allows for the support of more varied prompting patterns. For example, you may have some predefined context that you use to seed the model each time and include multiple messages in a prompt template. In that case, a prompt template could look like the following:

This would tell Freeplay to insert history
messages after the first Assistant/User pair, rather than directly after the System message.
Multi-Turn Chat in Logging and Observability
Include input and output display messages with each trace to render chat bubbles.
To make it easier to review data from a chatbot, Freeplay will render a chat-like UI when
input_question
andoutput_answer
are logged at the start and end of a trace. This allows reviewers to quickly see what a user would have seen in a chatbot UI before digging into the details of a trace, where it can be harder to tell exactly what the user experienced.
Via the use of Traces, Freeplay enables you to log explicit input/output messages for a cleaner viewing experience in the Freeplay app.
Here we see a multi-turn conversation first through the lens of the user interactions.

We can see what series of prompts were working under the hood with each trace by clicking "Show Trace".
The trace view in this example shows us that two prompts were called within this customer facing I/O

We can dive deeper into each of those completions by clicking on one.

Each completion in a trace can be added to a dataset and evaluated as normal. Any customer feedback logged with completions in a trace will appear at the top level of the trace / chat turn so you can quickly see there is feedback to review.
Display Messages
Freeplay allows you to control the display messages for your chat view in the observability screen, this can be helpful for creating views for analysts that mimic their customer's true experience. Traces can optionally be instantiated with an input and output message. These display messages can notably be different than the underlying prompt that was actually passed to the LLM or the raw LLM response, usually some subset that is more representative of the user experience. See details on managing display messages in this end to end code example.
Multi-Turn Chat Testing
Save and modify
history
as part of datasets to simulate real conversations.Whenever you save an observed conversation turn that includes
history
, it will be included in the dataset for future testing. You can also edit or addhistory
objects to a dataset at any time in case you want to control exactly what goes into a test scenario. Auto-evals can targethistory
as well for faster test analysis.
Datasets and Test Runs
When building a chatbot, the testing unit remains at the Completion level but includes history
when relevant. Consider this example again:

If we were to save the completion that generates Turn Two to a dataset, we would also get the preceding context from Turn One, which would exist in the history
object for the new completion.
Subsequent Test Runs using that Test Case would treat Turn One as static, meaning it is not recomputed during the Test Run. It would be passed as context when Turn Two is regenerated so that you can simulate that exact point in the conversation when testing.
Here's a simple sample dataset row that includes 14 messages in the history object.

Auto Evaluations
History can be targeted in model-graded auto-evaluation templates like any other variable using the{{history}}
parameter. This allows you to ask questions like: Is the current output factually accurate given the preceding context?
Determine whether or not the output is factually consistent with the preceding context
The output should be deemed inaccurate if it constains any logical contradictions with
the preceding context
<Output>
{{output}}
</Output>
<Preceding Context>
{{history}}
</Preceding Context>
Multi-Turn Chat in the SDK
When formatting your prompts you will pass the previous messages as an array to the history
parameter. The messages
object will have the history messages inserted in the right place in the array, as defined in your prompt template. See more details in our SDK docs here.
previous_messages = [{"role": "user": "what are some dinner ideas...",
"role": "assitant": "here are some dinner ideas..."}]
prompt_vars = {"question": "how do I make them healthier?"}
formatted_prompt = fpClient.prompts.get_formatted(
project_id=project_id,
template_name="SamplePrompt",
environment="latest",
variables=prompt_vars,
history=previous_messages # pass the history messages here
)
print(formatted_prompt.messages)
# output:
[
{'role': 'system', 'content': 'You are a polite assitant...'},
{'role': 'user', 'content': 'what are some dinner ideas...'},
{'role': 'assistant', 'content': 'here are some dinner ideas...'},
{'role': 'user', 'content': 'how do I make them healthier?'}
]
You can then use that prompt and messages to make a call to your LLM provider:
s = time.time()
chat_response = openaiClient.chat.completions.create(
model=formatted_prompt.prompt_info.model,
messages=formatted_prompt.messages,
**formatted_prompt.prompt_info.model_parameters
)
e = time.time()
latest_message = chat_response.choices[0].message
You will then pass the full set of messages back to Freeplay on the record call:
all_messages = [...formatted_prompt.messages, latest_message]
# record the call
payload = RecordPayload(
all_messages=all_messages,
inputs=prompt_vars,
session_info=session,
prompt_info=formatted_prompt.prompt_info,
call_info=CallInfo.from_prompt_info(formatted_prompt.prompt_info, start_time=s, end_time=e),
response_info=ResponseInfo(
is_complete=chat_response.choices[0].finish_reason == 'stop'
)
)
completion_info = fpClient.recordings.create(payload)
You can then repeat that pattern with each turn in the conversation continuing to append to and update the all_messages
object.
See our SDK docs for details on recording Traces for each chat turn , including the use of input_question
and output_answer
to render display messages.
An end to end code recipe can be found here.
Updated 5 months ago
Now that you're well-versed on building multi-turn chatbots using the history object, let's learn about model and key management.