January 2026
January 1, 2026
January 1, 2026
Insights general availability
Generate automated insights from your evaluation data to identify patterns across sessions and spans. Includes bulk actions, component filtering, and dedicated detail pages.Learn more →
SDKs now open source
Python and Node.js SDKs are now available under the Apache-2.0 license.
🖥️ Platform
Run all evaluations button — Trigger evaluation runs for all completions and traces in a session with a single click.Compound search filters — Build complex queries with AND, OR, and NOT operators plus BETWEEN for numeric and date ranges. Filter by trace cost, evaluation notes, agent names, latency, and custom metadata. Explore the Search API →CSV export for traces — Export trace data directly from the observability view for offline analysis.Bulk dataset operations — Select multiple rows in datasets to bulk delete, duplicate, or move test cases. Sort by name, compatibility, or creation date with shareable URL parameters.🔧 API
Model Management API — Programmatically create, read, update, and delete model configurations through new CRUD endpoints.OpenAPI specification — Complete schema with descriptions for all 67 API endpoints, accessible in the Freeplay app with interactive playground. View API Reference →📦 SDK
Metadata updates — Update session and trace metadata after creation viaclient.metadata.updateSession() and client.metadata.updateTrace() in Python, Node, and JVM SDKs.🐛 Bug fixes
- Fixed local storage infinite loop issue
- Improved accuracy and consistency of cost and latency metrics
December 2025
December 18, 2025
December 18, 2025
Review Insights
Deploy an AI agent alongside your reviewers to perform real-time root cause analysis—automatically surfacing patterns and actionable improvements as reviews happen.Learn more →
Automations general availability
Define searches and automatically add results to review queues or datasets, run evaluations, or trigger Slack notifications. Build weekly review queues of low-scoring records, curate important results, or get alerts for evaluation failures.See the guide →
🖥️ Platform
Updated session view — Session cards now display evaluation scores, notes, auto-categorization results, and multiselect values. Tree view includes colored performance icons (green → red) to quickly identify problem areas. View documentation →Prompt bundling — Now enabled in production for improved performance when fetching prompts at scale. Learn more →Documentation site launch — Comprehensive guides, SDK quickstarts, API reference with interactive playground, and recipes for OpenAI, Anthropic, LangGraph, Vercel AI SDK, and Google ADK.🤖 Models
New models — GPT-5.2, Gemini Pro 3 Flash Preview, Gemini 3 (withthinking_level parameter), and Mistral 3 series.LiteLLM for evaluations — LiteLLM models now supported for automated evaluations.🏢 Enterprise
Directory sync — Automatically sync users and groups from your identity provider via SCIM. Map directory groups to Freeplay roles with automatic provisioning and deprovisioning.🐛 Bug fixes
- Fixed 500 error in API key filter dropdown
- Fixed missing inputs in observability table
- Fixed Bedrock provider
tool_resulthandling - Fixed CSV export timeout issues
- Improved text search with exact phrase matching
December 4, 2025
December 4, 2025
🖥️ Platform
Eval insights — Generate trace-level evaluation insights with session metadata and span-level data support.Create evaluations from review themes — Formalize discovered patterns by creating evaluations directly from review themes.Prompt optimization with evaluated sessions — Use evaluated sessions as a data source for AI-powered prompt improvements.Slack integration — Connect Slack workspaces to receive automation notifications with direct links to filtered views.🤖 Models
New models — Claude Opus 4.5 and GPT-5.1 available in playground and for automated evaluations.🐛 Bug fixes
- Fixed Anthropic Bedrock tool call handling with tool call history
- Fixed streaming alignment issues
- Fixed 500 errors in datasets, review queues, and user management
- Fixed stuck optimization spinner
- Fixed duplicate evaluation results recording
November 2025
November 14, 2025
November 14, 2025
New integrations
Native support for LangGraph workflows, Vercel AI SDK, and Google Agent Development Kit with full observability and prompt management.View integrations →
🖥️ Platform
Tool span tracing — Log tool calls as explicit spans withkind="tool". Add custom names for clearer identification in traces. See the Tools guide →Review Agent (Beta) — Automatically surfaces review themes by analyzing patterns across your review queues. Includes auto-assignment, automatic status updates, and keyboard shortcuts.One-click curation — Add completions to review queues or datasets directly from session view. Edit inputs/outputs and create golden test cases in one step.Multimodal dataset history — Create test cases with images and media across multiple conversation turns.November 6, 2025
November 6, 2025
📦 SDK
Node.js/TypeScript SDK v0.5.2 — Official release with full support for prompts, sessions, traces, recordings, and test runs.October 2025
October 30, 2025
October 30, 2025
Structured outputs
End-to-end structured output support across Python, Node.js, and JVM SDKs. Define output schemas in prompt templates for validated JSON responses with OpenAI and Azure providers.Learn more →
🖥️ Platform
Review queues for traces — Systematically evaluate traces with customizable themes and automatic categorization. Trigger evaluations from OpenTelemetry data streams. Learn more →🔧 API
Prompt Templates API — Create, read, update, and delete prompt versions programmatically. Update environment assignments through SDK methods. View API Reference →Environments API — Full CRUD operations for deployment environments. Learn more →🤖 Models
New models — Claude Haiku 4.5, Nova Models on AWS Bedrock (with multimedia and tool calls), and Gemini updates with fixed tool use.AWS Bedrock Converse API — Comprehensive support including tool calling and multimedia inputs. See the recipe →🐛 Bug fixes
- Fixed sessions not displaying in review queue context
- Fixed observability date filter functionality
- Fixed duplicate test case updates
- Fixed span indentation for childless spans
- Fixed Anthropic cost calculation with OpenInference
October 21, 2025
October 21, 2025
🖥️ Platform
Dataset curation improvements — Edit outputs when saving logs to datasets for better ground truth. View ground truth in playground after loading datasets.Bulk auto-evaluations — Run evaluations across multiple completions at once. Auto-trigger when completions are added to review queues.Trace display options — Toggle between plain text, Markdown, and JSON formats for inputs and outputs.🔧 API
Dataset APIs — Endpoints for getting, updating, and deleting prompt and agent datasets. OpenAPI docs support live testing in browser. Explore →October 9, 2025
October 9, 2025
🔧 API
Dataset Management APIs — POST endpoints for creating datasets with configurable input names, media inputs, and history support.🖥️ Platform
OpenTelemetry expansion — Capture Freeplay-specific attributes including provider/model info, environment tags, prompt/test IDs, metadata, and tool schemas. Learn more →🐛 Bug fixes
- Fixed agent cost calculation showing $0.00 for top-level costs
- Fixed auto-evaluations not working on traces
- Fixed auto-evaluation failures for criteria without
eval_prompt
October 2, 2025
October 2, 2025
September 2025
September 17, 2025
September 17, 2025
Auto-categorization
Automatically categorize logs using your own classification criteria—similar to LLM judges but for content analysis. Identify issue types that lead to evaluation failures or negative feedback.Learn more →
Prompt optimization
AI-powered optimization uses your live logs, evaluations, human labels, and customer feedback to recommend better prompts—and can update prompts for new models.
September 11, 2025
September 11, 2025
🐛 Bug fixes
- Fixed Gemini tool call correlation with OpenInference instrumentation
- Fixed next/previous navigation on filtered test runs
- Fixed test run execution with Gemini models
- Improved error messages for malformed OpenTelemetry data
🖥️ Platform
Multi-modal template variables — Access all variables from multi-modal prompts when creating datasets or configuring evaluations.September 2, 2025
September 2, 2025
🖥️ Platform
Selective evaluation control — Choose which evaluations run during tests via UI or SDK for targeted testing and cost savings.Test run comparison — Clearer cost and latency metrics rolled up at prompt and trace levels.Multimodal evaluations — Target image and audio attachments with auto-evaluators. Models automatically filtered by supported media types.Project-level data retention — Set shorter retention windows for sensitive projects. Learn more →August 2025
August 29, 2025
August 29, 2025
project_idis now the first required argument toRecordPayload:
PromptInforenamed toPromptVersionInfo(now optional):
August 21, 2025
August 21, 2025
🖥️ Platform
Media input support — Create and upload media-backed test cases with automatic type inference.Tree-based session interface — Left-hand tree navigation, resizable review panel, and deep-linking for shareable session URLs.Multi-project service accounts — Service accounts can now access multiple projects.🤖 Models
Tool calling expansion — Vertex AI and Gemini tool calling, including native support in JVM SDK.🐛 Bug fixes
- Fixed navigation stale selections during pagination
- Fixed Gemini test runs with proper message type conversion
- Fixed table flickering and media preview reloading
- Improved error handling for API keys from deleted users
August 7, 2025
August 7, 2025
🤖 Models
New models — GPT-5 available in playground and for evaluations. Claude Opus 4.1 and GPT-OSS models (20B/120B) can be added for your preferred inference provider.August 1, 2025
August 1, 2025
Agent evaluations
Create trace-level LLM judges in the Freeplay UI to evaluate full agent behavior. Filter and graph agent evals separately from prompt-level evals.Learn more →
🖥️ Platform
Playground diff view — Row-level change comparison for any two columns to compare prompt iterations.Prompt optimization (experimental) — Use log examples, eval scores, human labels, and feedback to suggest prompt improvements.Test results filtering — Filter graphs and test case rows together to explore metrics for different data slices.🐛 Bug fixes
- Fixed filtering operators to respect numeric types (greater than, less than) instead of only string operators
July 2025
July 15, 2025
July 15, 2025
🔒 Security
WorkOS authentication — Upgraded authentication for enhanced security and smoother logins.July 2, 2025
July 2, 2025
🔧 API
User-scoped API keys — Full API use with private projects:- Private projects → Accessible only to API keys from project members
- Public projects → Accessible to all API keys
June 2025
June 6, 2025
June 6, 2025
Agent support
Define and run agent-level evaluations, curate datasets for agent testing, compare agent versions, and simplified trace observability.
Review Queues
Systematically review and annotate AI outputs with customizable workflows.
Instant search
Search across all LLM logs with instant results and trend visualizations.
Bring Your Own Cloud
Turnkey private hosting in any cloud for enterprise data residency requirements.

