12/12/23 Updates
Recent Updates: New model support, Comparison improvements, New types of evaluations, and Log customer feedback
- Support for new OpenAI & Anthropic models
- Compare two test runs head to head
- New data labeling options: Custom notes & multi-select tagging
- Make customer feedback part of your learning loop
- Log custom metadata with any session
- Customize your Sessions table
See all the details here
Support for new OpenAI & Anthropic models
OpenAI's GPT models officially updated to version 1106
on Monday this week, and Claude 2.1 is now the norm for Anthropic's premier model. Use Freeplay to test out the new versions & compare.
If you had legacy versions calling the primary model without a version declared, we've automatically set them to keep calling the legacy versions for the sake of stability (OpenAI 0613
and Anthropic Claude 2.0
). You can change the version in the prompt editor.

Compare two test runs head to head and pick your preference
Despite the desire from many teams to use more quantitative means to evaluate prompt outputs & make decisions about which is best, many teams still prefer to just eyeball it at times. Freeplay makes it easy to launch human-in-the-loop ranking to label preferred outputs, then deploy a preferred version of the prompt.
You can now navigate to a test run, click “Compare,” and then choose to compare head to head vs. another similar test run or original examples from the test list used.

New data labeling options: Custom notes & multi-select tagging
Along with “1-5 Scale” and “Yes/No” booleans that were already supported, you can now leave comments about your labeling choices, and tag Freeplay sessions with your own customer categories to organize your data. For instance, if you score something as a bad result, tag the reason why – “not true,” “too long,” “bad format,” etc.
These can each be configured as a new evaluation criteria on any prompt. Choose “Text” for any type of free-form note, and “Multi-Select” for tags. In each case, you can give a custom name to the criteria.

Make customer feedback part of your learning loop
Freeplay now supports the ability to log both explicit (thumbs up/down, comments, etc.) and implicit feedback (client events like “draft_dismissed”), which you can incorporate into analysis & prompt optimization workflows. You can pass booleans, strings or integers through our API or using our SDKs.
Note: There’s a special case for 👍👎 – use the string values POSITIVE_FEEDBACK
and NEGATIVE_FEEDBACK
to record negative or positive feedback, and we’ll render 👍👎 in the Freeplay app.
Customer feedback is recorded in relationship to a specific Freeplay completion ID, which you’ll need to save for future reference in the API. Any feedback logged is filterable on the Sessions table and visible on Session details pages.
SDK docs are here.
Log custom metadata with any session
You can now also add any custom metadata like customer ID, version of your code, embedding chunk sizes & more to your sessions.
This metadata is treated separately from the customer feedback mentioned above. It’s logged when initially recording a session, and always at the session level. In the Freeplay app it’s displayed similarly to other session metadata like cost, token count & latency. You can use it to filter & organize sessions on the Sessions table by enabling relevant columns (see below!).
SDK Docs are here.
Personalize your Sessions table
Each user on your team can personally customize what columns appear and the order of those columns, then filter & search away. It’s a powerful tool for digging deep into your Sessions data.
Quick Loom here: