🚀
We're helping you close the data flywheel faster. You can now curate golden dataset samples directly from production data, letting you build test cases with verified results right from real-world usage.
We've also shipped new dataset CRUD endpoints to support CI/CD workflows—making it simple to create and update datasets, manage test runs, and specify which evals to run against a test set via the API for targeted testing.
Read the details here.

Production to Testing: Golden Datasets, Enhanced Evals & New Models

Here's the full rundown:

Better Trace Display

Improved readability — Trace display now formats inputs and outputs, defaulting to JSON when applicable.

Structured Outputs

OpenAI Structured Output support — Unified tools and output-mode editing in the prompt editor enables structured outputs across OpenAI and Azure providers.

Evals & Dataset Curation

Test case curation is more flexible — Optionally curate examples before adding them to datasets, and now add tool calls and media to reference inputs and outputs for more robust test scenarios.
Bulk auto-evaluations — Run evaluations across multiple completions at once with new UI controls.
Streamlined eval workflow — Auto-evaluations trigger when completions hit review queues, and results display directly in the Results page.

New Model Support

Claude Haiku 4.5
AWS Nova Models using Converse with multimedia and tool calls
Newest Gemini models with fixed tool use
Bedrock Converse support with adapter and flavor routing

New APIs

Dataset endpoints for getting, updating, and deleting Prompt and Agent datasets
Paginated Agents API with optional name filtering
Updated OpenAPI reference docs with improved authentication