Skip to main content
Stay up to date with the latest improvements to Freeplay. This changelog covers platform features, SDK releases, API additions, and bug fixes that improve what you can do with Freeplay.
For developers: Watch for SDK and API updates that may require code changes. Breaking changes are clearly marked.

February 2026

February 6, 2026

Freeplay MCP Server (Experimental)

Integrate Freeplay capabilities into MCP-compatible tools and workflows with our experimental Model Context Protocol server, now available as a public repository.View on GitHub β†’Additional links:

🏚️Project Home Page

We’ve added a new Home page to every project with key metrics, insights about the project, and bookmark-able metrics. It’s a much faster way to understand what’s happening in your project. See this Loom for more information.Β 

πŸ€– Models

Claude Opus 4.6 β€” Added the newest Claude Opus 4.6 to Freeplay’s prompt playground.Claude Haiku 4 media support β€” Full image and file upload support for Anthropic Claude Haiku 4 models via both direct Anthropic API and AWS Bedrock.

πŸ”§ API

User management endpoints β€” Filter deleted users via include_deleted query parameter and reactivate soft-deleted users through new admin endpoints.Insights Endpoints - You can now get insights from Freeplay by using the /project/{project_id}/insightsΒ api endpoint.Insight filtering in search β€” Search API supports filtering by insight_id across review themes and evaluation insights.

πŸ“š Documentation

Filtering and search documentation β€” New documentation explaining tokenization behavior, phrase matching, field-type specific search, and β€˜contains’ semantics in the observability UI.

πŸ› Bug fixes / Improvements

  • UI improvements including scrollable evaluation explanations, better test run comparison alignment, and standardized tab styling.

January 2026

January 13, 2026

New Search APIs

Query your observability data programmatically with three new search endpoints for sessions, traces, and completions. Build complex queries with compound filters (AND, OR, NOT), paginate through results, and use advanced filtering by eval score, cost, latency, metadata, and more.View Search API Operators β†’

New year, new docs

Major refresh to our documentation including an OpenAPI spec, a new llms.txt as the starting place for coding agents, and restructured SDK documentation. We’ve also added this changelog. Let us know what you think.Explore the docs β†’

πŸ“¦ SDK

Google GenAI tool schema update β€” Define tool schemas using GenaiFunction and GenaiTool dataclasses in Python, with full TypeScript type safety in Node.Python SDK v0.5.5–0.5.6 β€” Standardized documentation, improved variable naming conventions, and reorganized capabilities. (See full Python SDK changelog)Node SDK v0.5.2–0.5.3 β€” Revamped README for open source release with improved examples and documentation. (See full Node SDK changelog)

πŸ› Bug fixes

  • Fixed tool call import when saving test cases from completions

SDKs now open source

Python and Node.js SDKs are now available under the Apache-2.0 license.

πŸ–₯️ Platform

Run all evaluations button β€” Trigger evaluation runs for all completions and traces in a session with a single click.CSV export for traces β€” Export trace data directly from the observability view for offline analysis.Bulk dataset operations β€” Select multiple rows in datasets to bulk delete, duplicate, or move test cases. Sort by name, compatibility, or creation date with shareable URL parameters.

πŸ”§ API

Model Management API β€” Programmatically create, read, update, and delete model configurations through new CRUD endpoints.OpenAPI specification β€” Complete schema with descriptions for all 67 API endpoints, accessible in the Freeplay app with interactive playground. View API Reference β†’

πŸ“¦ SDK

Metadata updates β€” Update session and trace metadata after creation via client.metadata.updateSession() and client.metadata.updateTrace() in Python, Node, and JVM SDKs.

December 2025

December 18, 2025

Review Insights

Our new AI agent works alongside your human reviewers to perform real-time root cause analysis, automatically surfacing patterns and actionable improvements as reviews happen.Learn more β†’

Automations

Define custom searches, then automatically run evaluations, add results to review queues or datasets, or trigger Slack notifications. Build weekly review queues of low-scoring logs, curate important results, or get alerts for evaluation failures.See the guide β†’

πŸ–₯️ Platform

Updated session view β€” Session cards now display evaluation scores, notes, auto-categorization results, and multiselect values. Tree view includes colored performance icons (green β†’ red) to quickly identify problem areas. View documentation β†’

πŸ€– Models

New models β€” GPT-5.2, Gemini Pro 3 Flash Preview, Gemini 3 (with thinking_level parameter), and Mistral 3 series.LiteLLM for evaluations β€” LiteLLM models now supported for automated evaluations.

🏒 Enterprise

Directory sync β€” Automatically sync users and groups from your identity provider via SCIM. Map directory groups to Freeplay roles with automatic provisioning and deprovisioning. Learn more β†’

πŸ› Bug fixes

  • Fixed Bedrock provider tool_result handling
  • Fixed CSV export timeout issues
  • Improved text search with exact phrase matching

πŸ–₯️ Platform

Create evaluations from review themes β€” When you find a common issue, turn it into an LLM judge evalution directly from review themes so you can catch the issue next time it happens.Prompt optimization from review themes β€” Use learning from a review to launch a targeted AI-powered prompt optimization experiment, using reviewed sessions as a data source.Slack integration β€” Connect Slack workspaces to receive automation notifications with direct links to filtered views.

πŸ€– Models

New models β€” Claude Opus 4.5 and GPT-5.1 available in playground and for automated evaluations.

πŸ› Bug fixes

  • Fixed Anthropic Bedrock tool call handling with tool call history

November 2025

New integrations

Native support for LangGraph workflows, Vercel AI SDK, and Google Agent Development Kit with full observability and prompt management.View integrations β†’

πŸ–₯️ Platform

Tool span tracing β€” Log tool calls as explicit spans with kind="tool". Add custom names for clearer identification in traces. See the Tools guide β†’Review Agent (Beta) β€” Automatically surfaces review themes by analyzing patterns across your review queues. Includes auto-assignment, automatic status updates, and keyboard shortcuts.One-click curation β€” Add completions to review queues or datasets directly from session view. Edit inputs/outputs and create golden test cases in one step.Multimodal dataset history β€” Create test cases with images and media across multiple conversation turns.

πŸ“¦ SDK

Node.js/TypeScript SDK v0.5.2 β€” Official release with full support for prompts, sessions, traces, recordings, and test runs.
npm install freeplay
Python SDK v0.5.4 β€” Improved package management, documentation, and multimodal data handling.
pip install freeplay
Get started β†’

October 2025

Structured outputs

End-to-end structured output support across Python, Node.js, and JVM SDKs. Define output schemas in prompt templates for validated JSON responses with OpenAI and Azure providers.Learn more β†’

πŸ–₯️ Platform

Review queues for traces β€” Systematically evaluate traces with customizable themes and automatic categorization. Trigger evaluations from OpenTelemetry data streams. Learn more β†’

πŸ”§ API

Prompt Templates API β€” Create, read, update, and delete prompt versions programmatically. Update environment assignments through SDK methods. View API Reference β†’Environments API β€” Full CRUD operations for deployment environments. Learn more β†’

πŸ€– Models

New models β€” Claude Haiku 4.5, Nova Models on AWS Bedrock (with multimedia and tool calls), and Gemini updates with fixed tool use.AWS Bedrock Converse API β€” Comprehensive support including tool calling and multimedia inputs. See the recipe β†’

πŸ› Bug fixes

  • Fixed sessions not displaying in review queue context
  • Fixed observability date filter functionality
  • Fixed duplicate test case updates
  • Fixed span indentation for childless spans
  • Fixed Anthropic cost calculation with OpenInference

πŸ–₯️ Platform

Dataset curation improvements β€” Edit outputs when saving logs to datasets for better ground truth. View ground truth in playground after loading datasets.Bulk auto-evaluations β€” Run evaluations across multiple completions at once. Auto-trigger when completions are added to review queues.Trace display options β€” Toggle between plain text, Markdown, and JSON formats for inputs and outputs.

πŸ”§ API

Dataset APIs β€” Endpoints for getting, updating, and deleting prompt and agent datasets. OpenAPI docs support live testing in browser. Explore β†’

πŸ”§ API

Dataset Management APIs β€” POST endpoints for creating datasets with configurable input names, media inputs, and history support.

πŸ–₯️ Platform

OpenTelemetry expansion β€” Capture Freeplay-specific attributes including provider/model info, environment tags, prompt/test IDs, metadata, and tool schemas. Learn more β†’

πŸ› Bug fixes

  • Fixed agent cost calculation showing $0.00 for top-level costs
  • Fixed auto-evaluations not working on traces
  • Fixed auto-evaluation failures for criteria without eval_prompt

πŸ”§ API

Delete API for prompt template versions β€” Programmatic removal through the v2 API.

πŸ› Bug fixes

  • Fixed β€œmark as best” auto-navigation behavior
  • Restored next/previous navigation on filtered test runs

September 2025

Auto-categorization

Automatically categorize logs using your own classification criteriaβ€”similar to LLM judges but for content analysis. Identify issue types that lead to evaluation failures or negative feedback.Learn more β†’

Prompt optimization

AI-powered optimization uses your live logs, evaluations, human labels, and customer feedback to recommend better promptsβ€”and can update prompts for new models.

πŸ› Bug fixes

  • Fixed Gemini tool call correlation with OpenInference instrumentation
  • Fixed next/previous navigation on filtered test runs
  • Fixed test run execution with Gemini models
  • Improved error messages for malformed OpenTelemetry data

πŸ–₯️ Platform

Multi-modal template variables β€” Access all variables from multi-modal prompts when creating datasets or configuring evaluations.

πŸ–₯️ Platform

Selective evaluation control β€” Choose which evaluations run during tests via UI or SDK for targeted testing and cost savings.Test run comparison β€” Clearer cost and latency metrics rolled up at prompt and trace levels.Multimodal evaluations β€” Target image and audio attachments with auto-evaluators. Models automatically filtered by supported media types.Project-level data retention β€” Set shorter retention windows for sensitive projects. Learn more β†’

August 2025

SDK breaking changes β€” These changes enable optional prompt management, OTel logging support, nested traces, and multi-modal dataset management.
  1. project_id is now the first required argument to RecordPayload:
RecordPayload(project_id=project_id, ...)
  1. PromptInfo renamed to PromptVersionInfo (now optional):
RecordPayload(
    project_id=project_id,
    prompt_version_info=formatted_prompt.prompt_info,
    ...
)

πŸ–₯️ Platform

Media input support β€” Create and upload media-backed test cases with automatic type inference.Tree-based session interface β€” Left-hand tree navigation, resizable review panel, and deep-linking for shareable session URLs.Multi-project service accounts β€” Service accounts can now access multiple projects.

πŸ€– Models

Tool calling expansion β€” Vertex AI and Gemini tool calling, including native support in JVM SDK.

πŸ› Bug fixes

  • Fixed navigation stale selections during pagination
  • Fixed Gemini test runs with proper message type conversion
  • Fixed table flickering and media preview reloading
  • Improved error handling for API keys from deleted users

πŸ€– Models

New models β€” GPT-5 available in playground and for evaluations. Claude Opus 4.1 and GPT-OSS models (20B/120B) can be added for your preferred inference provider.

Agent evaluations

Create trace-level LLM judges in the Freeplay UI to evaluate full agent behavior. Filter and graph agent evals separately from prompt-level evals.Learn more β†’

πŸ–₯️ Platform

Playground diff view β€” Row-level change comparison for any two columns to compare prompt iterations.Prompt optimization (experimental) β€” Use log examples, eval scores, human labels, and feedback to suggest prompt improvements.Test results filtering β€” Filter graphs and test case rows together to explore metrics for different data slices.

πŸ› Bug fixes

  • Fixed filtering operators to respect numeric types (greater than, less than) instead of only string operators

July 2025

πŸ”’ Security

WorkOS authentication β€” Upgraded authentication for enhanced security and smoother logins.

πŸ”§ API

User-scoped API keys β€” Full API use with private projects:
  • Private projects β†’ Accessible only to API keys from project members
  • Public projects β†’ Accessible to all API keys

June 2025