improved

10/17/2025

🚀

We're helping you close the data flywheel faster. You can now curate golden dataset samples directly from production data, letting you build test cases with verified results right from real-world usage.

We've also shipped new dataset CRUD endpoints to support CI/CD workflows—making it simple to create and update datasets, manage test runs, and specify which evals to run against a test set via the API for targeted testing.

Read the details here.

Production to Testing: Golden Datasets, Enhanced Evals & New Models

Here's the full rundown:

Better Trace Display

  • Improved readability — Trace display now formats inputs and outputs, defaulting to JSON when applicable.

Structured Outputs

Evals & Dataset Curation

  • Test case curation is more flexible — Optionally curate examples before adding them to datasets, and now add tool calls and media to reference inputs and outputs for more robust test scenarios.
  • Bulk auto-evaluations — Run evaluations across multiple completions at once with new UI controls.
  • Streamlined eval workflow — Auto-evaluations trigger when completions hit review queues, and results display directly in the Results page.

New Model Support

  • Claude Haiku 4.5
  • AWS Nova Models using Converse with multimedia and tool calls
  • Newest Gemini models with fixed tool use
  • Bedrock Converse support with adapter and flavor routing

New APIs

  • Dataset endpoints for getting, updating, and deleting Prompt and Agent datasets
  • Paginated Agents API with optional name filtering
  • Updated OpenAPI reference docs with improved authentication