Build Voice-Enabled AI Applications with Pipecat, Twilio, and Freeplay

Introduction

Voice-enabled AI applications present unique challenges when it comes to testing, monitoring, and iterating on your prompts and models. This guide demonstrates how Freeplay's observability and prompt management tools can support your development workflow when building voice applications.

In this example, we show how to use Freeplay together with Pipecat and Twilio.

What is Pipecat?

Pipecat is a powerful open source framework for building voice-enabled, real-time, multimodal AI applications.

When paired with Twilio for real-time voice over the phone, Pipecat enables teams to quickly build audio-based agentic systems that combine both user and bot audio with LLM interactions.

This combination creates a strong foundation for the core application, but building a high-quality generative AI product also requires robust monitoring, evaluation, and continuous experimentation. This is where Freeplay helps.

Using Freeplay for Rapid Iteration and Observability

When it comes to monitoring and improving a voice agent, teams often struggle with:

Multi-modal Observability: Tracking and analyzing model inputs and outputs across different data types (audio, text, images, files, etc.)
Quality Evaluation: Understanding how your application performs in real user scenarios and using evaluation criteria relevant to your product
Experimentation & Iteration: Systematically versioning, testing, and deploying changes to prompts, tools, and/or models
Team Collaboration: Keeping all team members on the same page when it comes to testing and understanding quality (including non-developers)

Freeplay addresses these challenges by providing a comprehensive solution for prompt and model management, observability, and evaluation that works seamlessly across modalities/data formats — including audio. And Freeplay makes it easy for both technical and non-technical team members to fully participate in the product development and optimization process.

Example Conversation View

In the image below, we see a conversation help between the voice assistant and one of our team members. The conversation audio and text are recorded to Freeplay, note that the transcript of the recorded audio is included with the audio recording:

Once implemented, you'll be able to view complete user interactions in Freeplay, including:

Audio recordings
Transcribed text
LLM responses
Cost & latency metrics
Evaluation results

Integration Architecture

Freeplay integrates well with Pipecat. Pipecat provides the LLM context and data needed to pass into Freeplay, and then the integration allows your team to handle the LLM iteration, testing and completion logging in Freeplay.

To integrate Freeplay with your Pipecat and Twilio application, we recommend creating a Freeplay logging “Processor” (Pipecat service) like the example below. In Pipecat, Processors serve as workers that process “frames.” Frames hold information and are passed along to processors like an assembly line.

The Freeplay Processor focuses on:

Fetching prompt configuration from Freeplay
Collecting information from the relevant Pipecat frames (user input and LLM completions)
Logging the LLM completions to Freeplay

Information Flow in a Pipecat + Twilio + Freeplay Integration

This example integration uses the Twilio-Chatbot example provided by Pipecat. The application is a chatbot that is used via a phone application powered by Twilio. (This can also be done using streaming rather than speech-to-text conversions if needed!).

Getting Started

First, if you are new to Freeplay, we suggest following our quick start guide, once you have your account, prompts, and everything set up, take the following steps. For a full code integration, see our Pipecat Integration recipe which shows the full Freeplay processor implementation.

Import your prompt config from Freeplay & pass to the LLM processor in Pipecat

from helpers.freeplay_frame import FreeplayLLMLogger
from freeplay import Freeplay, SessionInfo
	
	# Freeplay Client
	fp_client = Freeplay(
        freeplay_api_key=os.getenv("FREEPLAY_API_KEY"),
        api_base=os.getenv("FREEPLAY_API_BASE")
    )
    
    # Get the formatted prompt from Freeplay
    formatted_prompt = fp_client.prompts.get_formatted(
        project_id=os.getenv("FREEPLAY_PROJECT_ID"),
        template_name="voice-assistant",
        environment="latest",
        variables={},
        history=[]
    )
    # Pass the formatted prompt to the LLM
    llm = OpenAILLMService(model=formatted_prompt.prompt_info.model, 
                           tools=formatted_prompt.tool_schema if formatted_prompt.tool_schema else None, 
                           api_key=os.getenv("OPENAI_API_KEY"), 
                           **formatted_prompt.prompt_info.model_parameters)

Create a Freeplay Processor in Pipecat: The processor handles the memory of the conversation, processing of key frames, and and keeps track of information to log to Freeplay.

class FreeplayLLMLogger(FrameProcessor):
    """Logs LLM interactions and audio to Freeplay with simplified structure."""

    def __init__(
        self,
        fp_client: Freeplay,
        template_name: str,
        session: SessionInfo = None,
        debug: bool = True,
    ):
        super().__init__()
        self.fp_client = fp_client
        self.template_name = template_name
        self.conversation_id = self._new_conv_id()
        self.total_completion_time = 0
        
        # Audio related properties
        self.sample_width = 2
        self.sample_rate = 8000
        self.num_channels = 1
        self._user_audio = bytearray()
        self._bot_audio = bytearray()
        self.user_speaking = False
        self.bot_speaking = False
        
        # Freeplay related properties
        self.conversation_history = []
        self.session = session
        self.most_recent_user_message = None
        self.most_recent_completion = None

        self.reset_recent_messages()


    def _new_conv_id(self) -> str:
        """Generate a new conversation ID based on the current timestamp (this represents a customer id or similar)."""
        return datetime.datetime.now().strftime("%Y%m%d_%H%M%S")

    def reset_recent_messages(self):
        """Reset all temporary message and audio storage."""
        self.most_recent_user_message = None
        self.most_recent_completion = None
        self._user_audio = bytearray()
        self._bot_audio = bytearray()
        self.total_completion_time = 0
        
    async def process_frame(self, frame: Frame, direction: FrameDirection):
        """Process incoming frames and handle Freeplay logging."""
        await super().process_frame(frame, direction)

        # Handle LLM response frames
        if isinstance(frame, (LLMFullResponseStartFrame, LLMFullResponseEndFrame)):
            event = "START" if isinstance(frame, LLMFullResponseStartFrame) else "END"
            print(f"LLMFullResponseFrame: {event}", flush=True)

        # Handle LLM context frame - this is where we log to Freeplay
        elif isinstance(frame, OpenAILLMContextFrame):
            messages = frame.context.messages
            # Extract user message and completion from context
            user_messages = [m for m in messages if m.get("role") == "user"]
            if user_messages:
                self.most_recent_user_message = user_messages[-1].get("content")

            completions = [m for m in messages if m.get("role") == "assistant"]
            if completions:
                self.most_recent_completion = completions[-1].get("content")

            # Log to Freeplay when we have both user input and completion
            if self.most_recent_user_message and self.most_recent_completion:
                self._record_to_freeplay()

        # Handle audio state changes
        elif isinstance(frame, UserStartedSpeakingFrame):
            self.user_speaking = True
        elif isinstance(frame, UserStoppedSpeakingFrame):
            self.user_speaking = False
        elif isinstance(frame, BotStartedSpeakingFrame):
            self.bot_speaking = True
        elif isinstance(frame, BotStoppedSpeakingFrame):
            self.bot_speaking = False

        # Handle audio data
        elif isinstance(frame, InputAudioRawFrame):
            if self.user_speaking:
                self._user_audio.extend(frame.audio)
            elif self.bot_speaking:
                self._bot_audio.extend(frame.audio)

        # Handle metrics for completion time
        elif isinstance(frame, MetricsFrame):
            for metric in frame.data:
                if isinstance(metric, ProcessingMetricsData):
                    if "LLMService" in metric.processor:
                        self.total_completion_time = metric.value


        # Pass frame to next processor
        await self.push_frame(frame, direction)

Note: It is required to modify the processes_frame function in pipecat’s base_llm.py to pass along the OpenAILLMContext frame, this makes the handling easier in the FreeplayLLMLogger - process_frame:

    async def process_frame(self, frame: Frame, direction: FrameDirection):
        await super().process_frame(frame, direction)

        context = None
        if isinstance(frame, OpenAILLMContextFrame):
            context: OpenAILLMContext = frame.context
            await self.push_frame(frame, direction) # Add this line here to pass frame along
        elif isinstance(frame, LLMMessagesFrame):
            context = OpenAILLMContext.from_messages(frame.messages)
        elif isinstance(frame, VisionImageRawFrame):
            context = OpenAILLMContext()
            context.add_image_frame_message(
                format=frame.format, size=frame.size, image=frame.image, text=frame.text
            )
        elif isinstance(frame, LLMUpdateSettingsFrame):
            await self._update_settings(frame.settings)
        else:
            await self.push_frame(frame, direction)
		....

Start logging completions: Begin capturing real user interactions in Freeplay

def _record_to_freeplay(self):
        """Record the current conversation state to Freeplay."""
        # Create a new trace for this interaction
        trace = self.session.create_trace(input=self.most_recent_user_message)
        
        # Get formatted prompt with conversation history
        formatted = self.fp_client.prompts.get_formatted(
            project_id=os.getenv('FREEPLAY_PROJECT_ID'),
            template_name=self.template_name,
            environment='latest',
            history=self.conversation_history,
            variables={"user_input": self.most_recent_user_message},
        )

        # Calculate latency for the LLM interaction
        start, end = time.time(), time.time() + self.total_completion_time
        
        # Prepare messages for recording
        all_msgs = formatted.all_messages({"role": "assistant", "content": self.most_recent_completion})

        try:
            # Prepare metadata and record payload
            custom_metadata = {
                'conversation_id': str(self.conversation_id),
                'completion_time': self.total_completion_time
            }
            
            record = RecordPayload(
                all_messages=all_msgs,
                session_info=SessionInfo(self.session.session_id, custom_metadata=custom_metadata),
                inputs={'user_input': self.most_recent_user_message} if self.most_recent_user_message else {},
                prompt_info=formatted.prompt_info,
                call_info=CallInfo.from_prompt_info(formatted.prompt_info, start, end),
                response_info=ResponseInfo(is_complete=True),
                trace_info=trace,
            )
            
            # Create recording in Freeplay
            self.fp_client.recordings.create(record)

            # Add assistant's response to conversation history
            self.conversation_history.append({
                "role": "assistant",
                "content": [{"type": "text", "text": self.most_recent_completion}]
            })

            # Update conversation history with user message and audio
            self.conversation_history.append({
                "role": "user",
                "content": [
                    {"type": "text", "text": self.most_recent_user_message},
                    {
                        'type': 'input_audio',
                        'input_audio': {
                            'data': base64.b64encode(self._make_wav_bytes(self._user_audio, prepend_silence_secs=1)).decode('utf-8'),
                            'format': "wav"
                        }
                    }
                ]
            })

            # Record output to trace
            trace.record_output(
                os.getenv('FREEPLAY_PROJECT_ID'), 
                self.most_recent_completion,
                # Optionally include call meta data/additional info at the trace level
                # metadata={...}
            )
            
            print(f"Successfully recorded to Freeplay - completion time: {self.total_completion_time}s", flush=True)
            self.reset_recent_messages()
            
        except Exception as e:
            print(f"Error recording to Freeplay: {e}", flush=True)
            self.reset_recent_messages()