Multimodal Data
Freeplay supports multimodal data in your prompts and completions, allowing you to work with images, audio files, and documents alongside text. This guide explains how to use multimodal features throughout the Freeplay platform.
Overview
Many LLM applications now leverage multi-modal data types beyond just text. For mutimodal models, product images, charts, PDFs, and even audio can provide critical context to generate better responses.
Quick start
To get started using multimodal support within Freeplay follow these steps:
- Define media variables in prompt templates (similar to other Mustache variable)
- Upload and test sample files in the Freeplay prompt editor
- Update your code to handle media files with the Freeplay SDK
- Record and view multimodal interactions in Freeplay
- Save recorded examples to your data datasets
This document walks through more details of how to use Freeplay to build, test, review logs, and capture feedback for your multimodal LLM applications, including how to make use of media inputs.
These docs are under heavy development as we ship new agent features! Check back often for updates, or subscribe to our newsletter.
Introduction
Understanding Multimodal Data for LLMs
Multimodal models can process and analyze different types of data such as images, audio, and documents alongside text. This allows your LLM applications to "see," "hear," and "read" just like humans do.
Consider these examples of how multimodal data enhances LLM applications:
Image + Text
- User uploads a product image with a defect and asks: "What's wrong with my product?"
- The LLM can see the image, identify the issue, and provide a relevant response.
Document + Text
- User uploads a financial report and asks: "Summarize the key findings in this report."
- The LLM can analyze the document contents and generate an accurate summary.
Audio + Transcript
- User uploads a phone call recording and asks: "Describe the tone of this call and summarize the key points".
- The LLM can analyze the audio and provide tonal analysis and generate a more accurate summary with that in mind.
Using multimodal inputs allows the LLM to interpret the additional context which can help improve your LLM system outputs, both of the above examples are able to provide much more detailed responses due to using multimodal.
Using Freeplay with Multimodal Data
What's different about using Freeplay with multimodal data? There are a couple important things to be aware of:
-
Prompt Templates: You'll define media variables in a prompt template allowing you to pass image, audio, or document data at the right point. This can only be done with models that support multi-media inputs.
-
Recording Multimodal Data: You'll record media inputs with each completion, making it possible to view the original inputs alongside the LLM's responses during review.
-
Media in History: You can record media as part of history, helping you preserve key context and inputs passed within your system.
Media Variables in Prompt Templates
Media variables should be configured within your Prompt Templates in Freeplay.
When configuring your Prompt Template, you will add media variables to user or assistant messages. This tells Freeplay where to insert image, audio, or document data when the prompt template is formatted.
The most common configuration would look like this:
- When editing or creating a prompt template in the playground, click the "Add media" button next to the prompt section type
- Note: Media can only be added to user or assistant message types
- Enter a variable name for your media input (e.g.,
product_image
,support_document
) - Select the media type (file, image or audio, types depend on the models support)
This tells Freeplay to insert the media input at that specific location in the message when formatting a prompt.
You must define media variables in a prompt template before you can pass media inputs at record time and have them saved properly for use in datasets.
Multimodal Data in Logging and Observability
View original media inputs alongside LLM responses in Freeplay's interface.
When reviewing completions in Freeplay, you'll be able to see the original images, documents, or audio files that were included in the prompt. This provides essential context when evaluating model performance.
In the Freeplay observability interface, completions that include media inputs will display the media alongside the text inputs and outputs. This makes it easy to understand the full context of each interaction.
When clicking into a specific completion, you can see:
- The full prompt including all media inputs
- The model's response
- Evaluation scores and feedback
This visibility is crucial for understanding how your multimodal LLM is performing and identifying areas for improvement.
Multimodal Data in the SDK
Create a
media_inputs
map when formatting prompts via the SDK.When using the Freeplay SDK, you'll create a map of media variable names to their corresponding data, then pass this map to the
get_formatted
method.
Creating Media Inputs
Using the Media Input Map
To create the media map, import the proper type from
freeplay.resources.prompts
Then create a map of the variable name in your Freeplay prompt template to the data associated with it. In the examples below, the variable names areproduct_image
,legal_document
, andvoice_recording
.
Freeplay accepts media in one of two formats: base64-encoded data or via URL. Depending on which format you choose, you'll need to adjust how you create the media input map accordingly. See the examples below for implementation details.
To work with multimodal data in your code, follow these steps:
- Create a media input map (either
MediaContentUrl
orMediaContentBase64
) - Pass it to the
get_formatted
method - Include it when recording the completion
Here's how to create a media input map:
# New imports
from freeplay.resources.prompts import MediaInputBase64, MediaInputMap
# Create media input map for an image
media_inputs = {
'product_image': MediaInputBase64(
type="base64",
content_type="image/jpeg",
data=encode_image_data("product.jpg") # Your function to encode image
)
}
# For a PDF document
media_inputs = {
'legal_document': MediaInputBase64(
type="base64",
content_type="application/pdf",
data=encode_file_data("contract.pdf") # Your function to encode PDF
)
}
# For audio
media_inputs = {
'voice_recording': MediaInputBase64(
type="base64",
content_type="audio/mpeg", # change audio types here
data=encode_audio_data("recording.mp3") # Your function to encode audio
)
}
###########################################
# Using URLs #
###########################################
media_inputs = {
'product_image': MediaContentUrl(
type="base64",
content_type="image/jpeg",
url="https://localhost/product.jpeg" # Your function to encode image
)
}
# For a PDF document
media_inputs = {
'legal_document': MediaContentUrl(
type="base64",
content_type="application/pdf",
url="https://localhost/contract.pdf" # Link to pdf file
)
}
# For audio
media_inputs = {
'voice_recording': MediaContentUrl(
type="base64",
content_type="audio/mpeg", # change audio types here
url="https://localhost/audio.mpeg" # link to audio file
)
}
import Freeplay, {
getCallInfo,
getSessionInfo,
MediaInputMap,
MediaContentUrl,
SessionInfo,
} from ...
// Image data
const media: MediaInputMap = {
"image-one": {
type: "url",
url: "https://localhost/image",
},
"image-two": {
type: "base64",
content_type: "image/jpeg",
data: "some-base64-data",
},
};
// Audio Data
const media: MediaInputMap = {
"voice_recording_1": {
type: "base64",
content_type: "audio/mpeg",
data: audioData,
},
"voice_recording_2": {
type: "url",
content_type: "audio/mpeg",
url: "https://localhost/audio.mpeg",
},
};
// File
const media: MediaInputMap = {
"legal_document_1": {
type: "base64",
content_type: "application/pdf",
data: documentData,
},
"legal_document_2": {
type: "url",
content_type: "application/pdf",
url: "https://localhost/contract.pdf",
},
};
Getting Formatted Prompt with Media
When calling the Freeplay API to get a formatted prompt, include your media inputs:
formatted_prompt = freeplay_client.prompts.get_formatted(
project_id=project_id,
template_name="multimodal-prompt",
environment="latest",
variables=input_variables,
media_inputs=media_inputs # Include your media inputs here
const formattedPrompt =
await freeplay.prompts.getFormatted<ChatCompletionMessageParam>({
projectId,
templateName,
environment: "latest",
variables: input_variables,
media, // Pass in the media map to the formatted prompt
});
Recording Completions with Media
When recording the completion, make sure to include the media inputs:
record_response = freeplay_client.recordings.create(
RecordPayload(
all_messages=[
*formatted_prompt.llm_prompt,
{"role": "assistant", "content": response_content}
],
session_info=session_info,
inputs=input_variables,
media_inputs=media_inputs, # Include your media inputs here
prompt_info=formatted_prompt.prompt_info,
call_info=CallInfo.from_prompt_info(formatted_prompt.prompt_info,
start_time, end_time),
)
)
await freeplay.recordings.create({
allMessages: [
...(formattedPrompt.llmPrompt || []),
{
role: "assistant",
content,
},
],
inputs: input_variables,
mediaInputs: media, // Pass the media here
sessionInfo: session_info,
promptInfo: formattedPrompt.promptInfo,
callInfo: getCallInfo(formattedPrompt.promptInfo, start, end),
responseInfo: {
isComplete: true,
},
});
Media Support
Supported Media Types
Freeplay supports the following media types:
- Images - JPG,JPEG,PNG, WebP
- Audio - WAV, MP3
- Documents - PDFs
Note: Support for specific file types depends on the model provider's capabilities. Please reach out to [email protected] if you’re interested to use other data types.
Supported Sizes
We support a total request size of up to 30 mb. If your file/data is over that limit it will not work within the Freeplay Application.
Supported Providers
Multimodal functionality is supported today by default with:
Please reach out to [email protected] if you’re interested to use other models.
Best Practices
- Keep file sizes reasonable: While Freeplay supports various file sizes, providers may have limits on the size of media files they can process, this can also drive up costs.
- Test & Monitor thoroughly: Multimodal models may perform differently with various types of images, audio quality, or document formats, Freeplay allows for rapid testing, review and iteration to ensure your product performs as expected.
- Combine media types: For complex use cases, you can include multiple media inputs of different types in the same prompt such as documents and images.
- Iterate regularly: Regularly review completions with media inputs to ensure your model is interpreting the media correctly.
Now that you're well-versed on working with multimodal data in Freeplay, you can enhance your LLM applications with rich, contextual understanding of various media types.
Updated 2 days ago