Introduction

Many LLM applications involve more than just one-off, isolated LLM completions. For chatbots especially, they consist of multiple back-and-forth exchanges between a user and assistant. This makes chatbots unique to test and evaluate.

This document walks through how to use Freeplay to build, test, review logs, and capture feedback on multi-turn chatbots, including how to make use of a special history object:

Defining history in prompt templates
Managing history with the Freeplay SDK
Recording and viewing chat turns in Freeplay as traces
Managing datasets, configuring evals and automating tests that include history

In this document, we will refer to one back-and-forth exchange between the user and the assistant as a "turn".

Understanding History for Chatbots

First, why does history matter when building a multi-turn chatbot?

Importantly, each exchange must be aware of all the previous exchanges in the conversation — aka the "history" — such that the LLM can give an answer that is contextually aware. Experimentation and testing with multi-turn chat must also take history into account, since any simulated test cases need to include relevant context.

Consider this series of exchanges between the user and assistant:

In Turn Two, the assistant needs to have the context from the previous turn to give a reasonable answer. "I want them to be healthier" is the user's request for healthy dinner ideas that make use of rice.

By Turn Three, the assistant needs to reference Turn 2 to know what “Give me a recipe for number 2” refers to. And so forth.

Without an understanding of the context from the conversation history, each new message would be impossible to interpret.

Note: While chatbots are the most common UX that uses this interaction pattern, it can apply more broadly. It can be helpful to think of history as a way to manage state or memory, since the LLM itself does not store any persistent context from one interaction to the next. Nothing restricts the use of these concepts to a chatbot UX.

Using Freeplay with Multi-Turn Chatbots

What's different about using Freeplay with a chatbot? There are a couple important things to be aware of:

Prompt Templates: You'll define a special history object in a prompt template allowing you to pass conversation history at the right point.
Recording Multi-Turn Sessions: You'll record history with each new chatbot turn, as well as record display messages at the start and end of each trace to make it easy to view the input_question and output_answer (see Traces documentation).
Managing Datasets & Testing: You'll curate datasets that contain history so you can simulate accurate conversation scenarios when testing.
Configuring Auto-Evaluations: If you're using model-graded evals, you'll be able able to target history objects for realtime monitoring or test scenarios.

History in Prompt Templates

💡
History should be configured within your Prompt Templates in Freeplay.
When configuring your Prompt Template, you will add a message of type history wherever your history messages should be inserted. This tells Freeplay how messages should be ordered when the prompt template is formatted.

The most common configuration would look like this:

Creating this configuration on a Freeplay prompt template would look like this:

This tells Freeplay to insert the history messages in between the system message and the most recent user message when formatting a prompt.

You must define history in a prompt template before you can pass history values at record time and have them saved properly for use in datasets, testing, etc.

Why configure history explicitly in the prompt template?

While it may seem redundant at first to explicitly configure the placement of history, it allows for the support of more varied prompting patterns. For example, you may have some predefined context that you use to seed the model each time and include multiple messages in a prompt template. In that case, a prompt template could look like the following:

This would tell Freeplay to insert history messages after the first Assistant/User pair, rather than directly after the System message.

Multi-Turn Chat in Logging and Observability

💡
Include input and output display messages with each trace to render chat bubbles.
To make it easier to review data from a chatbot, Freeplay will render a chat-like UI when input_question and output_answer are logged at the start and end of a trace. This allows reviewers to quickly see what a user would have seen in a chatbot UI before digging into the details of a trace, where it can be harder to tell exactly what the user experienced.

Via the use of Traces, Freeplay enables you to log explicit input/output messages for a cleaner viewing experience in the Freeplay app.

Here we see a multi-turn conversation first through the lens of the user interactions.

We can see what series of prompts were working under the hood with each trace by clicking "Show Trace".

The trace view in this example shows us that two prompts were called within this customer facing I/O

We can dive deeper into each of those completions by clicking on one.

Each completion in a trace can be added to a dataset and evaluated as normal. Any customer feedback logged with completions in a trace will appear at the top level of the trace / chat turn so you can quickly see there is feedback to review.

Display Messages

Freeplay allows you to control the display messages for your chat view in the observability screen, this can be helpful for creating views for analysts that mimic their customer's true experience. Traces can optionally be instantiated with an input and output message. These display messages can notably be different than the underlying prompt that was actually passed to the LLM or the raw LLM response, usually some subset that is more representative of the user experience. See details on managing display messages in this end to end code example.

Multi-Turn Chat Testing

💡
Save and modify history as part of datasets to simulate real conversations.
Whenever you save an observed conversation turn that includes history, it will be included in the dataset for future testing. You can also edit or add history objects to a dataset at any time in case you want to control exactly what goes into a test scenario. Auto-evals can target history as well for faster test analysis.

Datasets and Test Runs

When building a chatbot, the testing unit remains at the Completion level but includes history when relevant. Consider this example again:

If we were to save the completion that generates Turn Two to a dataset, we would also get the preceding context from Turn One, which would exist in the history object for the new completion.

Subsequent Test Runs using that Test Case would treat Turn One as static, meaning it is not recomputed during the Test Run. It would be passed as context when Turn Two is regenerated so that you can simulate that exact point in the conversation when testing.

Here's a simple sample dataset row that includes 14 messages in the history object.

Auto Evaluations

History can be targeted in model-graded auto-evaluation templates like any other variable using the{{history}} parameter. This allows you to ask questions like: Is the current output factually accurate given the preceding context?

Determine whether or not the output is factually consistent with the preceding context
The output should be deemed inaccurate if it constains any logical contradictions with
the preceding context

<Output>
{{output}}
</Output>

<Preceding Context>
{{history}}
</Preceding Context>

Multi-Turn Chat in the SDK

When formatting your prompts you will pass the previous messages as an array to the history parameter. The messages object will have the history messages inserted in the right place in the array, as defined in your prompt template. See more details in our SDK docs here.

previous_messages = [{"role": "user": "what are some dinner ideas...",
                      "role": "assitant": "here are some dinner ideas..."}]
prompt_vars = {"question": "how do I make them healthier?"}
formatted_prompt = fpClient.prompts.get_formatted(
    project_id=project_id,
    template_name="SamplePrompt",
    environment="latest",
    variables=prompt_vars,
    history=previous_messages # pass the history messages here
)
print(formatted_prompt.messages)
# output:
[
{'role': 'system', 'content': 'You are a polite assitant...'},
{'role': 'user', 'content': 'what are some dinner ideas...'},
{'role': 'assistant', 'content': 'here are some dinner ideas...'}, 
{'role': 'user', 'content': 'how do I make them healthier?'}
]

You can then use that prompt and messages to make a call to your LLM provider:

s = time.time()
chat_response = openaiClient.chat.completions.create(
    model=formatted_prompt.prompt_info.model,
    messages=formatted_prompt.messages,
    **formatted_prompt.prompt_info.model_parameters
)
e = time.time()

latest_message = chat_response.choices[0].message

You will then pass the full set of messages back to Freeplay on the record call:

all_messages = [...formatted_prompt.messages, latest_message]

# record the call
payload = RecordPayload(
    all_messages=all_messages,
    inputs=prompt_vars,
    session_info=session, 
    prompt_info=formatted_prompt.prompt_info,
    call_info=CallInfo.from_prompt_info(formatted_prompt.prompt_info, start_time=s, end_time=e),
    response_info=ResponseInfo(
        is_complete=chat_response.choices[0].finish_reason == 'stop'
    )
)
completion_info = fpClient.recordings.create(payload)

You can then repeat that pattern with each turn in the conversation continuing to append to and update the all_messages object.

See our SDK docs for details on recording Traces for each chat turn , including the use of input_question and output_answer to render display messages.

An end to end code recipe can be found here.

Multi-Turn Chatbot Support

Introduction

Understanding History for Chatbots

Using Freeplay with Multi-Turn Chatbots

History in Prompt Templates

💡
History should be configured within your Prompt Templates in Freeplay.

Multi-Turn Chat in Logging and Observability

💡
Include input and output display messages with each trace to render chat bubbles.

Display Messages

Multi-Turn Chat Testing

💡
Save and modify `history` as part of datasets to simulate real conversations.

Datasets and Test Runs

Auto Evaluations

Multi-Turn Chat in the SDK

Introduction

Understanding History for Chatbots

Using Freeplay with Multi-Turn Chatbots

History in Prompt Templates

💡History should be configured within your Prompt Templates in Freeplay.

Multi-Turn Chat in Logging and Observability

💡Include input and output display messages with each trace to render chat bubbles.

Display Messages

Multi-Turn Chat Testing

💡Save and modify history as part of datasets to simulate real conversations.

Datasets and Test Runs

Auto Evaluations

Multi-Turn Chat in the SDK

💡
History should be configured within your Prompt Templates in Freeplay.

💡
Include input and output display messages with each trace to render chat bubbles.

💡
Save and modify `history` as part of datasets to simulate real conversations.