Freeplay Introduction
How teams observe, evaluatate and iterate towared great AI applications
Overview
Freeplay is a single platform to manage the end-to-end AI application development lifecycle for your entire team. It gives product development teams the power to review sessions, experiment with changes, evaluate and test those iterations, and deploy AI features.
Here's a quick video overview:
Core Concepts
Master the foundational features that power your LLM workflow. These guides will help you build, test, and improve your AI applications systematically.
Prompt Management
Create, version, and deploy your prompts with Freeplay's template editor. Manage variables, test outputs, and iterate quickly across environments.
Observability
Monitor LLM completions in real-time with searchable logs, filters, and graphs. Track costs, latency, and performance across sessions and traces.
Evaluations
Build custom model-graded, code-based, and human evals to measure quality. Align auto-evals with your team's standards for reliable testing.
Review Queues
Organize human review workflows by assigning completions to team members. Generate insight reports and turn observations into improvements.
Datasets
Curate test datasets from production logs or CSV uploads. Build benchmark sets with ground truth labels for comprehensive testing.
Test Runs
Run automated batch tests to compare prompt versions head-to-head. Execute tests from the UI or SDK with full eval scoring.
Core benefits
Production Observability See how your AI applications systems are behaving across environments in real-time, including prompt and response details, customer feedback, and evaluation scores for your production logs.
Prompt & Model Versioning and Deployments Manage and version prompt templates across environments, including your prompt text and model configurations. Deploy changes straight to your code like a feature flag -- no deploy required.
Custom Evaluations Create a custom suite of evals specific to your product experience. Use them both for production logs and offline experiments, so you can spot issues and quantify improvements as you update your prompts, models, RAG pipelines and code.
Easy Batch Tests Any time you make a change to prompts, models, or any other part of your pipeline, you can quickly test at scale with your own custom datasets and real examples from production logs. Anyone on your team can generate new tests from the Freeplay UI or from your code, including from CI. Iterate with confidence.
Multi-Player Review Workflows Set up custom filters and queues for your whole team, and collaborate to review production logs and test results.
Label and Curate Datasets Launch human labeling jobs and curate custom datasets from your application logs, which you can then use for testing and fine-tuning.

Updated about 4 hours ago
