Leveraging an iterative flywheel
With Freeplay your AI applications are in a state of constant improvement
You've created your first prompt and integrated observability. Congratulations! However, this is just the beginning! The real power of Freeplay comes from iteration and evaluating those iterations.
To unlock this powerful workflow, you need to:
- Create datasets - Datasets allow you repeatedly test against the same inputs and compare results to both a ground truth expected output and evaluation scores
- Create evaluations - Evaluations allow you to score and label records manually (human labels) or automatically (auto-categorization, LLM-as-Judge evaluations)
- Run a test - Tests leverage your evaluations and datasets to compare changes to prompts and model settings.
- Deploy - When you make a change to a prompt template in Freeplay and deploy it, observability sessions allow you to track the performance of this version and even compare it against other versions.
Every review session and test reveal insights, each insight drives improvements, and every improvement compounds over time making your system better and better!
Work Your Way
There's no prescribed order—work the way that makes sense for your workflow. Discover an edge case in production? Create a new dataset to test it. Want to measure a different quality dimension? Add another evaluation. Trying out a radically different approach? Test prompt variations side-by-side to see which performs better. The system adapts to how you work, not the other way around.
This iterative approach is how teams systematically improve their AI application's quality over time. Small improvements compound, edge cases get handled, and your prompts get better with every cycle.
Updated about 3 hours ago
