Human Labeled Evaluations
Human Evaluation/Labeling
Why Human Evaluation Matters
Human evaluation is essential for measuring dimensions of quality that automated evals can't capture—like cases of high sensitivity, required sme knowledge, or nuanced accuracy in specialized domains. While model-graded and code-based evals scale efficiently, human judgment remains the gold standard for subjective quality measures and for validating that your automated evals are actually measuring what matters.
Common Use Cases
Spot-checking production quality - Regularly sample a subset of production completions to ensure your LLM maintains quality standards. This catches issues that automated evals might miss and helps calibrate your team's understanding of "good" vs. "bad" outputs.
Building ground truth datasets - Create labeled datasets that become the foundation for model-graded evaluations. Human labels serve as the "answer key" that trains and validates your automated evaluation layer.
Measuring subjective dimensions - Evaluate qualities like helpfulness, empathy, tone appropriateness, or creativity—aspects where human judgment is more reliable than algorithmic scoring.
Debugging edge cases - When automated evals flag unusual patterns or when users report issues, human review helps you understand what's actually happening and whether it's a real problem.
Calibrating automated evals - Compare human labels to model-graded eval scores to measure alignment. This validates whether your automated evals are trustworthy enough to use at scale.
Getting Started with Human Evaluation
Freeplay makes it easy for your team to label sessions directly in the platform. Team members can label individual sessions or filter groups of sessions that share common criteria (e.g., weekly spot checks of production data, all completions from a specific user segment, or sessions where automated evals flagged potential issues).
1. Invite your team
Navigate to Settings > Account > New user to add team members. Only Admins can invite new users. Consider inviting domain experts, product managers, or customer success team members—the people who understand quality in your specific context.
2. Browse and filter sessions
Use the Observability tab to filter sessions based on date ranges, eval scores, user feedback, custom metadata, or any other criteria. This helps you focus human review time on the sessions that matter most.
3. Apply labels
Navigate to individual sessions and apply labels in the Evaluation section of the sidebar. Hover over tooltips to see the evaluation criteria and instructions you configured when creating the eval. Labels you apply here feed directly into your evaluation analytics and can be used to build training datasets.
4. Review in batches
For efficiency, add filtered sessions to a Review Queue. This creates a dedicated workspace where team members can systematically work through sessions, apply labels, leave comments, and track progress—perfect for regular spot-checking workflows.
Best Practices
- Start small: Begin with a manageable sample size (10-20 sessions) to calibrate your team's understanding of the evaluation criteria
- Create clear rubrics: Define specific, actionable criteria in your evaluation instructions so different team members label consistently
- Track inter-rater reliability: Have multiple people label the same sessions to measure agreement and refine your criteria
- Use stratified sampling: When spot-checking production, sample across different user segments, time periods, or use cases to get representative coverage
- Close the loop: Share insights from human evaluation with your engineering team to improve prompts, tune automated evals, or identify training needs
Updated about 2 hours ago
Now review each evaluation type and then move onto test runs once all your evaluations are configured!
