12/12/23 Updates

📣
Recent Updates: New model support, Comparison improvements, New types of evaluations, and Log customer feedback

Support for new OpenAI & Anthropic models

Compare two test runs head to head

New data labeling options: Custom notes & multi-select tagging

Make customer feedback part of your learning loop

Log custom metadata with any session

Customize your Sessions table

See all the details here

Support for new OpenAI & Anthropic models

OpenAI's GPT models officially updated to version 1106 on Monday this week, and Claude 2.1 is now the norm for Anthropic's premier model. Use Freeplay to test out the new versions & compare.

If you had legacy versions calling the primary model without a version declared, we've automatically set them to keep calling the legacy versions for the sake of stability (OpenAI 0613 and Anthropic Claude 2.0). You can change the version in the prompt editor.

Compare two test runs head to head and pick your preference

Despite the desire from many teams to use more quantitative means to evaluate prompt outputs & make decisions about which is best, many teams still prefer to just eyeball it at times. Freeplay makes it easy to launch human-in-the-loop ranking to label preferred outputs, then deploy a preferred version of the prompt.

You can now navigate to a test run, click “Compare,” and then choose to compare head to head vs. another similar test run or original examples from the test list used.

New data labeling options: Custom notes & multi-select tagging

Along with “1-5 Scale” and “Yes/No” booleans that were already supported, you can now leave comments about your labeling choices, and tag Freeplay sessions with your own customer categories to organize your data. For instance, if you score something as a bad result, tag the reason why – “not true,” “too long,” “bad format,” etc.

These can each be configured as a new evaluation criteria on any prompt. Choose “Text” for any type of free-form note, and “Multi-Select” for tags. In each case, you can give a custom name to the criteria.

Make customer feedback part of your learning loop

Freeplay now supports the ability to log both explicit (thumbs up/down, comments, etc.) and implicit feedback (client events like “draft_dismissed”), which you can incorporate into analysis & prompt optimization workflows. You can pass booleans, strings or integers through our API or using our SDKs.

Note: There’s a special case for 👍👎 – use the string values POSITIVE_FEEDBACK and NEGATIVE_FEEDBACK to record negative or positive feedback, and we’ll render 👍👎 in the Freeplay app.

Customer feedback is recorded in relationship to a specific Freeplay completion ID, which you’ll need to save for future reference in the API. Any feedback logged is filterable on the Sessions table and visible on Session details pages.

SDK docs are here.

Log custom metadata with any session

You can now also add any custom metadata like customer ID, version of your code, embedding chunk sizes & more to your sessions.

This metadata is treated separately from the customer feedback mentioned above. It’s logged when initially recording a session, and always at the session level. In the Freeplay app it’s displayed similarly to other session metadata like cost, token count & latency. You can use it to filter & organize sessions on the Sessions table by enabling relevant columns (see below!).

SDK Docs are here.

Personalize your Sessions table

Each user on your team can personally customize what columns appear and the order of those columns, then filter & search away. It’s a powerful tool for digging deep into your Sessions data.

Quick Loom here:

📣Recent Updates: New model support, Comparison improvements, New types of evaluations, and Log customer feedback

Support for new OpenAI & Anthropic models

Compare two test runs head to head and pick your preference

New data labeling options: Custom notes & multi-select tagging

Make customer feedback part of your learning loop

Log custom metadata with any session

Personalize your Sessions table

📣
Recent Updates: New model support, Comparison improvements, New types of evaluations, and Log customer feedback