7/25/24 Updates

✨
Recent Updates: New Review & Testing Features for Multi-Turn Chatbots, Improved Human Review Workflows, gpt-4o-mini and Llama 3.1!

Multi-turn chatbot support includes new chat view and history object support

Faster, easier data reviews & workflows with Markdown rendering and new Review panel

New models include gpt-4o-mini, Llama 3.1, and native Groq support!

Details here.

Some big changes to Freeplay support for chatbots, and to the data review process recently! Plus the hottest (and fastest) new models. 🔥

New Review & Testing Features for Multi-Turn Chatbots

Chatbots are one of the most common user experiences for generative AI products, and they’re often where product teams start. They’re also surprisingly hard to get right. Testing edge cases and iterating on a chatbot without introducing regressions can be particularly challenging.

To address these challenges, we’ve launched a series of new features at Freeplay that speed up the process of building, testing, evaluating, and monitoring chatbots.

Chat view for easy data review and trace exploration
Format prompts to include conversation history
Save (and modify) conversation history in datasets for easy testing
Run batch tests and automate evaluations on real-world chat scenarios

Here’s a quick demo to see how they work together in practice. You can read more about the details on our blog: Simplify Chatbot Testing and Evaluation with Freeplay

Improved Human Review Workflows

Our customers are increasingly finding value in Freeplay to do detailed inspection of LLM logs. This is in the zeitgeist right now too, people are realizing that industry leaders are advising other teams to look at lots of data.

So, we’ve done a few things to make it even easier to review LLM logs and coordinate human review workflows.

Markdown rendering can be toggled on/off when reviewing completions. Your personal setting will be saved until you change it again.
New “Review” panel separates out auto-evaluations done by models or custom code functions, customer feedback logged to a completion, and labels intended only for use by human reviewers. Some customers’ evaluation lists were getting verrrryyy long, the new tabs help. 🙌
Better review workflows… Adopt and filter by the following new values to manage reviews by a whole team of people.
- Review Status shows the human review status for a completion. If any eval scores are updated or labels applied, it automatically moves to “In progress.” Reviewers can update settings to “Review complete” when no further updates are needed.
- Reviewer shows avatars for any user who’s applied an evaluation score or label.

More New Models!

OpenAI released gpt-4o-mini and it’s doing almost as well as gpt-4o for ~1/20th the price.

You can also fine-tune it. Reminder that any OpenAI fine-tuned model can be configured and tested in Freeplay.

Meta released Llama 3.1 and it’s on par with the best closed-source models.

Plus: Groq is now a supported provider! You can set up Groq with your own API key and use it to test out:

Gemma 7B
Gemma 2 9B
Llama 3.1 405B
Llama 3.1 70B
Llama 3.1 8B
Llama 3 70B
Llama 3 8B
Llama 3 Groq 70B Tool Use Preview
Llama 3 Groq 8B Tool Use Preview
Mixtral 8x7B

You can try them all out now in Freeplay. Llama 3.1 is available by default in the Freeplay prompt editor & Test feature hosted on Amazon Bedrock, or can be accessed with your own Groq API key.