Code Evaluations in Freeplay

In addition to Human and Model Graded evaluations, Freeplay also offers the ability to run code-driven evaluations directly on the client-side, then log those results to Freeplay. These evals are generally functions written and run in the client's code path and then recorded back to Freeplay.

These evaluations are particularly useful for criteria requiring logical expressions, such as JSON schema checks or category assertions on single answers, or for pairwise comparisons to an expected output via methods like embedding or string distance. Code evals can be added both to:

Individual Sessions
Test Runs executed with our SDK or API, which can include comparisons to ground truth data

In either case, any results you log to Freeplay flow through to the UI just like human or model-graded evals. See our SDK documentation for more details.

Resources

Creating a Testing & Evaluation Process
Defining the Right Evaluation Criteria