improved
10/17/2025
4 days ago by Morgan Cox
We're helping you close the data flywheel faster. You can now curate golden dataset samples directly from production data, letting you build test cases with verified results right from real-world usage.
We've also shipped new dataset CRUD endpoints to support CI/CD workflows—making it simple to create and update datasets, manage test runs, and specify which evals to run against a test set via the API for targeted testing.
Read the details here.
Production to Testing: Golden Datasets, Enhanced Evals & New Models
Here's the full rundown:
Better Trace Display
- Improved readability — Trace display now formats inputs and outputs, defaulting to JSON when applicable.
Structured Outputs
- OpenAI Structured Output support — Unified tools and output-mode editing in the prompt editor enables structured outputs across OpenAI and Azure providers.
Evals & Dataset Curation
- Test case curation is more flexible — Optionally curate examples before adding them to datasets, and now add tool calls and media to reference inputs and outputs for more robust test scenarios.
- Bulk auto-evaluations — Run evaluations across multiple completions at once with new UI controls.
- Streamlined eval workflow — Auto-evaluations trigger when completions hit review queues, and results display directly in the Results page.
New Model Support
- Claude Haiku 4.5
- AWS Nova Models using Converse with multimedia and tool calls
- Newest Gemini models with fixed tool use
- Bedrock Converse support with adapter and flavor routing
New APIs
- Dataset endpoints for getting, updating, and deleting Prompt and Agent datasets
- Paginated Agents API with optional name filtering
- Updated OpenAPI reference docs with improved authentication