Skip to main content
When you’re iterating on an agent — updating prompts and tools, running benchmarks and regression tests — you eventually hit a wall. You’ve fixed the obvious issues, your scores have plateaued, and you’re not sure what to try next. Test Run Insights analyze every test case in a batch test run, compare results to ground truth and across multiple versions when relevant, and identify patterns across failures to recommend where to focus next.

How they work

After a test run completes, the Insights agent reviews the full set of results and looks for patterns across individual test cases. Rather than reviewing each failure one by one, the agent groups failures into themes and highlights the most impactful areas for improvement. The inputs to Test Run Insights include:
  • Test case results — pass/fail outcomes and scores across all samples in the run
  • Evaluation reasoning — the rationale behind each score from LLM judges or other evaluators
  • Version comparisons — when multiple versions are tested, differences in performance across versions
For each insight, you get a description of the failure pattern, links to the specific test cases that match, and a count of affected samples to help you prioritize.
Test Run Insights are currently available to select design partners. Reach out to your Freeplay contact to learn more.