Skip to main contentFreeplay is a single platform to manage the end-to-end AI application development lifecycle for your entire team. It provides engineers, data scientists, product managers, subject matter experts, designers… all team members with the ability to review sessions, create datasets, experiment with changes, evaluate and test those iterations, and deploy updates.
Here’s a quick video overview:
Core Concepts and Features
Prompt Templates; A Key Differentiator
Freeplay leverages prompt templates that segment inputs from messages and are versioned for each change made to them. This opinionated stance on prompt structure makes Freeplay fundamentally different from other AI platforms in two critical ways:
- It provides all other entities in the platform with access to input variables, template versions, output, and metadata related to the prompt template. This granularity improves every aspect of Freeplay, from evaluation fidelity, to dataset creation and test management.
- Additionally and optionally, any user can be provisioned with the ability to create, version, and deploy prompts. This empowers subject matter experts and other team members to iterate outside of engineering processes, which produces a step change in iteration speed and can reduce workload for engineering teams.
While some Freeplay features are usable without prompt templates. Most features much more powerful with them.
Observability
Monitor LLM completions in real-time with powerful search and expressive graphs that chart every parameter in a prompt template. Track costs, latency, and performance across sessions and traces. Leverage features like saved searches, bookmarked metrics and automations to help your team continually refer to and act on key metrics.
Freeplay’s flexible integration strategy supports simple trace logging through OTEL and common AI frameworks, alongside the ability to tightly integrate with prompt management.
Observability does not require prompt templates, but when it is integrated, Freeplay’s true nature shines. The metadata on every session will include more granularity. Search, graphs and automations gain the ability to reference specific inputs and outputs. Perhaps most importantly, any observability session can now become a test case in a dataset, or be opened in the prompt template editor for iteration.
Learn more
Evaluations
Freeplay supports a robust set of scorers including LLM-as-judge and code-based evaluations, and customizable human labels. These tools allow teams to measure quality of both production logs and offline test runs. Auto-categorization evals allow teams to classify traces and completions against any criteria and Freeplay’s evaluation alignment process aligns your model-graded evals to your team’s standards.
When prompt templates are integrated into your application, evaluations can reference every input variable directly, providing a level of granularity that is not possible in other platforms.
Learn more
Datasets
Curate test datasets from production logs or CSV uploads or quickly create them from the prompt editor. Each test case can include ground truth labels which allow the creation of benchmark sets that allow for comparison against expected output across prompt changes and model configurations.
When your application utlizes prompt templates, datasets can be created from any individual observability session, in bulk from any set of search results, or via recurring automations on saved searches.
Learn more
Test Runs
Run automated batch tests of both individual prompt versions and complex multi step workflows and agents and compare them head-to-head. Execute tests from the UI or SDK with full eval scoring. Each evaluator in a component produces interactive scores which help drill down on the test cases that result in various outcomes and individual test case comparisons can be evaluated and scored manually.
Test runs are improved by Freeplay’s prompt templates due to the “flywheel affect” created by the ease with which real examples from testing and production logs can be saved as test cases.
Learn more
Review Queues
Organize human review workflows by assigning completions to team members from observability records, search results, and automations. Each review results in an insight report that can be used to turn your team’s observations into improvements.