Configure, Test & Deploy a Fallback LLM Provider with Freeplay

API-based access to the most powerful AI models in the world is a beautiful thing, opening the door to a whole range of potentially groundbreaking applications. That said, as these LLMs become an increasingly important part of critical applications, the thought of having a single point of failure by way of a dependency on a single LLM provider is giving engineers, SREs and managers everywhere heartburn. And rightfully so, we would never tolerate that kind of brittleness in other parts of our application stack, why should LLM development be any different?

The good news is that the number of providers serving cutting edge LLMs by API is growing. This means provider diversification is possible! The bad news, prompt and model configs are not fully portable from one provider to another. Whether it be due to the RLHF process, the underlying training data or other factors, these models each have their own optimal prompting style. This means that in order to have a truly reliable fallback provider you need a prompt and model config that is continually validated against your benchmark dataset and primary provider for both latency and quality. This can be a daunting task without the right tooling and workflows in place.

Here’s how Freeplay can help you establish, maintain, and serve a fallback LLM provider.
In this case we are using OpenAI as our primary provider and will configure Anthropic as a fallback provider.

Step 1: Create a Dataset for Benchmarking

Having a labeled Dataset to test prompt, model, and pipeline changes against is critical for building a repeatable and robust LLM development process. This is also an important foundational component when configuring a fallback LLM provider

Freeplay provides in app functionality for you to label and curate dataset from real production sessions.

Alternatively, if you already have a dataset created you can upload those examples directly to Freeplay via JSONL.

Step 2: Configure a Prompt Template and Model Config for your Fallback Provider

Freeplay’s prompt editor is an interactive playground allowing you to load in data from your datasets and compare prompt versions side by side. Here we have our primary provider prompt pulled up and as we iterate on a prompt for Anthropic's Sonnet model. We’ve loaded in a few examples from our benchmark dataset to test against.

Step 3: Test your Fallback Provider at Scale

After we’ve created a fallback provider prompt template that seems to work well, we want to test it at scale and compare it to our benchmark dataset, which in this case was generated by our primary provider and human labeled. We can kick off the test run either in app from Freeplay or in code via the Freeplay SDK.

It looks like our fallback provider is performing on par with our primary provider, and actually a bit better in some ways! Cost and latency are also similar so we know that we can continue meeting our external SLAs while keeping internal costs under control.

Step 4: Deploy your Fallback Provider and Configure your Application Code Accordingly

Now that we’ve validated this new prompt and model config we need to make sure our code supports the new provider. We need to add an API key for our new provider and update our application code with our fallback strategy. Here’s a high level overview of how the fallback will work.

Using the prompt template from our primary provider we try making a request
If the request fails, we fetch the prompt template for our secondary provider from Freeplay and make a request
Record the results back to Freeplay

Note that for steps 1 and 2 we could alternatively make use of Freeplay’s prompt bundling feature which allows us to check out prompts during our build process, such that we can read them from our local filesystem, rather than fetch them from the Freeplay server. This removes Freeplay from the critical path entirely.

Here is what the code looks like

# start timer for logging latency of the full chain
start = time.time()
# run semantic search
search_res, filter = vector_search(message, top_k=top_k,
                           cosine_threshold=cosine_threshold,
                           tag=tag, title=title)

# get the formatted prompt
prompt_vars = {
    "question": message,
    "supporting_information": str(search_res)
}

# get a formatted prompt for your primary provider
formatted_prompt = fpClient.prompts.get_formatted(
    project_id=freeplay_project_id,
    template_name="rag-qa",
    environment="prod",
    variables=prompt_vars
)
  
# first try making a request with your primary 
try:
    chat_completion = openai.chat.completions.create(
        model=formatted_prompt.prompt_info.model,
        messages = formatted_prompt.messages,
        **formatted_prompt.prompt_info.model_parameters
    )
    content = chat_completion.choices[0].message.content
    # update messages
    messages = formatted_prompt.all_messages(
        {'role': chat_completion.choices[0].message.role,
        'content': content}
    )
except:
     # fetch the prompt for our fallback provider
    formatted_prompt = fpClient.prompts.get_formatted(
        project_id=freeplay_project_id,
        template_name="rag-qa",
        environment="fallback",
        variables=prompt_vars
    )
    chat_completion = anthropicClient.messages.create(
        model=formatted_prompt.prompt_info.model,
        system=formatted_prompt.system_content,
        messages=formatted_prompt.llm_prompt,
        **formatted_prompt.prompt_info.model_parameters
    )
    content = chat_completion.content[0].text
    messages = formatted_prompt.all_messages(
        {'role': chat_completion.role,
         'content': content}
    )
# log latency
end = time.time()

# create an async record call payload
record_payload = RecordPayload(
    all_messages=messages,
    inputs=prompt_vars,
    session_info=session,
    prompt_info=formatted_prompt.prompt_info,
    call_info=CallInfo.from_prompt_info(formatted_prompt.prompt_info, start_time=start, end_time=end),
    response_info=ResponseInfo(
        is_complete=True
    )
)
# record the call
completion_log = fpClient.recordings.create(record_payload)

Step 5: Maintain and Update your Fallback Provider

Data naturally shifts over time. It’s important to continually test and update your fallback provider config such that it maintains parity with your primary provider overtime.

Rest Easy!

All your engineers, SREs and managers can now sleep easier at night knowing that in the event of a provider incident you will not just continue to serve traffic, but know you are doing so with consistent quality, cost and latency.