February 2026
February 6, 2026
February 6, 2026
Freeplay MCP Server (Experimental)
Integrate Freeplay capabilities into MCP-compatible tools and workflows with our experimental Model Context Protocol server, now available as a public repository.View on GitHub βAdditional links:
ποΈProject Home Page
Weβve added a new Home page to every project with key metrics, insights about the project, and bookmark-able metrics. Itβs a much faster way to understand whatβs happening in your project. See this Loom for more information.Βπ€ Models
Claude Opus 4.6 β Added the newest Claude Opus 4.6 to Freeplayβs prompt playground.Claude Haiku 4 media support β Full image and file upload support for Anthropic Claude Haiku 4 models via both direct Anthropic API and AWS Bedrock.π§ API
User management endpoints β Filter deleted users viainclude_deleted query parameter and reactivate soft-deleted users through new admin endpoints.Insights Endpoints - You can now get insights from Freeplay by using the /project/{project_id}/insightsΒ api endpoint.Insight filtering in search β Search API supports filtering by insight_id across review themes and evaluation insights.π Documentation
Filtering and search documentation β New documentation explaining tokenization behavior, phrase matching, field-type specific search, and βcontainsβ semantics in the observability UI.π Bug fixes / Improvements
- UI improvements including scrollable evaluation explanations, better test run comparison alignment, and standardized tab styling.
January 2026
January 13, 2026
January 13, 2026
New Search APIs
Query your observability data programmatically with three new search endpoints for sessions, traces, and completions. Build complex queries with compound filters (AND, OR, NOT), paginate through results, and use advanced filtering by eval score, cost, latency, metadata, and more.View Search API Operators β
New year, new docs
Major refresh to our documentation including an OpenAPI spec, a new
llms.txt as the starting place for coding agents, and restructured SDK documentation. Weβve also added this changelog. Let us know what you think.Explore the docs βπ¦ SDK
Google GenAI tool schema update β Define tool schemas usingGenaiFunction and GenaiTool dataclasses in Python, with full TypeScript type safety in Node.Python SDK v0.5.5β0.5.6 β Standardized documentation, improved variable naming conventions, and reorganized capabilities. (See full Python SDK changelog)Node SDK v0.5.2β0.5.3 β Revamped README for open source release with improved examples and documentation. (See full Node SDK changelog)π Bug fixes
- Fixed tool call import when saving test cases from completions
January 1, 2026
January 1, 2026
SDKs now open source
Python and Node.js SDKs are now available under the Apache-2.0 license.
π₯οΈ Platform
Run all evaluations button β Trigger evaluation runs for all completions and traces in a session with a single click.CSV export for traces β Export trace data directly from the observability view for offline analysis.Bulk dataset operations β Select multiple rows in datasets to bulk delete, duplicate, or move test cases. Sort by name, compatibility, or creation date with shareable URL parameters.π§ API
Model Management API β Programmatically create, read, update, and delete model configurations through new CRUD endpoints.OpenAPI specification β Complete schema with descriptions for all 67 API endpoints, accessible in the Freeplay app with interactive playground. View API Reference βπ¦ SDK
Metadata updates β Update session and trace metadata after creation viaclient.metadata.updateSession() and client.metadata.updateTrace() in Python, Node, and JVM SDKs.December 2025
December 18, 2025
December 18, 2025
Review Insights
Our new AI agent works alongside your human reviewers to perform real-time root cause analysis, automatically surfacing patterns and actionable improvements as reviews happen.Learn more β
Automations
Define custom searches, then automatically run evaluations, add results to review queues or datasets, or trigger Slack notifications. Build weekly review queues of low-scoring logs, curate important results, or get alerts for evaluation failures.See the guide β
π₯οΈ Platform
Updated session view β Session cards now display evaluation scores, notes, auto-categorization results, and multiselect values. Tree view includes colored performance icons (green β red) to quickly identify problem areas. View documentation βπ€ Models
New models β GPT-5.2, Gemini Pro 3 Flash Preview, Gemini 3 (withthinking_level parameter), and Mistral 3 series.LiteLLM for evaluations β LiteLLM models now supported for automated evaluations.π’ Enterprise
Directory sync β Automatically sync users and groups from your identity provider via SCIM. Map directory groups to Freeplay roles with automatic provisioning and deprovisioning. Learn more βπ Bug fixes
- Fixed Bedrock provider
tool_resulthandling - Fixed CSV export timeout issues
- Improved text search with exact phrase matching
December 4, 2025
December 4, 2025
π₯οΈ Platform
Create evaluations from review themes β When you find a common issue, turn it into an LLM judge evalution directly from review themes so you can catch the issue next time it happens.Prompt optimization from review themes β Use learning from a review to launch a targeted AI-powered prompt optimization experiment, using reviewed sessions as a data source.Slack integration β Connect Slack workspaces to receive automation notifications with direct links to filtered views.π€ Models
New models β Claude Opus 4.5 and GPT-5.1 available in playground and for automated evaluations.π Bug fixes
- Fixed Anthropic Bedrock tool call handling with tool call history
November 2025
November 14, 2025
November 14, 2025
New integrations
Native support for LangGraph workflows, Vercel AI SDK, and Google Agent Development Kit with full observability and prompt management.View integrations β
π₯οΈ Platform
Tool span tracing β Log tool calls as explicit spans withkind="tool". Add custom names for clearer identification in traces. See the Tools guide βReview Agent (Beta) β Automatically surfaces review themes by analyzing patterns across your review queues. Includes auto-assignment, automatic status updates, and keyboard shortcuts.One-click curation β Add completions to review queues or datasets directly from session view. Edit inputs/outputs and create golden test cases in one step.Multimodal dataset history β Create test cases with images and media across multiple conversation turns.November 6, 2025
November 6, 2025
π¦ SDK
Node.js/TypeScript SDK v0.5.2 β Official release with full support for prompts, sessions, traces, recordings, and test runs.October 2025
October 30, 2025
October 30, 2025
Structured outputs
End-to-end structured output support across Python, Node.js, and JVM SDKs. Define output schemas in prompt templates for validated JSON responses with OpenAI and Azure providers.Learn more β
π₯οΈ Platform
Review queues for traces β Systematically evaluate traces with customizable themes and automatic categorization. Trigger evaluations from OpenTelemetry data streams. Learn more βπ§ API
Prompt Templates API β Create, read, update, and delete prompt versions programmatically. Update environment assignments through SDK methods. View API Reference βEnvironments API β Full CRUD operations for deployment environments. Learn more βπ€ Models
New models β Claude Haiku 4.5, Nova Models on AWS Bedrock (with multimedia and tool calls), and Gemini updates with fixed tool use.AWS Bedrock Converse API β Comprehensive support including tool calling and multimedia inputs. See the recipe βπ Bug fixes
- Fixed sessions not displaying in review queue context
- Fixed observability date filter functionality
- Fixed duplicate test case updates
- Fixed span indentation for childless spans
- Fixed Anthropic cost calculation with OpenInference
October 21, 2025
October 21, 2025
π₯οΈ Platform
Dataset curation improvements β Edit outputs when saving logs to datasets for better ground truth. View ground truth in playground after loading datasets.Bulk auto-evaluations β Run evaluations across multiple completions at once. Auto-trigger when completions are added to review queues.Trace display options β Toggle between plain text, Markdown, and JSON formats for inputs and outputs.π§ API
Dataset APIs β Endpoints for getting, updating, and deleting prompt and agent datasets. OpenAPI docs support live testing in browser. Explore βOctober 9, 2025
October 9, 2025
π§ API
Dataset Management APIs β POST endpoints for creating datasets with configurable input names, media inputs, and history support.π₯οΈ Platform
OpenTelemetry expansion β Capture Freeplay-specific attributes including provider/model info, environment tags, prompt/test IDs, metadata, and tool schemas. Learn more βπ Bug fixes
- Fixed agent cost calculation showing $0.00 for top-level costs
- Fixed auto-evaluations not working on traces
- Fixed auto-evaluation failures for criteria without
eval_prompt
October 2, 2025
October 2, 2025
September 2025
September 17, 2025
September 17, 2025
Auto-categorization
Automatically categorize logs using your own classification criteriaβsimilar to LLM judges but for content analysis. Identify issue types that lead to evaluation failures or negative feedback.Learn more β
Prompt optimization
AI-powered optimization uses your live logs, evaluations, human labels, and customer feedback to recommend better promptsβand can update prompts for new models.
September 11, 2025
September 11, 2025
π Bug fixes
- Fixed Gemini tool call correlation with OpenInference instrumentation
- Fixed next/previous navigation on filtered test runs
- Fixed test run execution with Gemini models
- Improved error messages for malformed OpenTelemetry data
π₯οΈ Platform
Multi-modal template variables β Access all variables from multi-modal prompts when creating datasets or configuring evaluations.September 2, 2025
September 2, 2025
π₯οΈ Platform
Selective evaluation control β Choose which evaluations run during tests via UI or SDK for targeted testing and cost savings.Test run comparison β Clearer cost and latency metrics rolled up at prompt and trace levels.Multimodal evaluations β Target image and audio attachments with auto-evaluators. Models automatically filtered by supported media types.Project-level data retention β Set shorter retention windows for sensitive projects. Learn more βAugust 2025
August 29, 2025
August 29, 2025
project_idis now the first required argument toRecordPayload:
PromptInforenamed toPromptVersionInfo(now optional):
August 21, 2025
August 21, 2025
π₯οΈ Platform
Media input support β Create and upload media-backed test cases with automatic type inference.Tree-based session interface β Left-hand tree navigation, resizable review panel, and deep-linking for shareable session URLs.Multi-project service accounts β Service accounts can now access multiple projects.π€ Models
Tool calling expansion β Vertex AI and Gemini tool calling, including native support in JVM SDK.π Bug fixes
- Fixed navigation stale selections during pagination
- Fixed Gemini test runs with proper message type conversion
- Fixed table flickering and media preview reloading
- Improved error handling for API keys from deleted users
August 7, 2025
August 7, 2025
π€ Models
New models β GPT-5 available in playground and for evaluations. Claude Opus 4.1 and GPT-OSS models (20B/120B) can be added for your preferred inference provider.August 1, 2025
August 1, 2025
Agent evaluations
Create trace-level LLM judges in the Freeplay UI to evaluate full agent behavior. Filter and graph agent evals separately from prompt-level evals.Learn more β
π₯οΈ Platform
Playground diff view β Row-level change comparison for any two columns to compare prompt iterations.Prompt optimization (experimental) β Use log examples, eval scores, human labels, and feedback to suggest prompt improvements.Test results filtering β Filter graphs and test case rows together to explore metrics for different data slices.π Bug fixes
- Fixed filtering operators to respect numeric types (greater than, less than) instead of only string operators
July 2025
July 15, 2025
July 15, 2025
π Security
WorkOS authentication β Upgraded authentication for enhanced security and smoother logins.July 2, 2025
July 2, 2025
π§ API
User-scoped API keys β Full API use with private projects:- Private projects β Accessible only to API keys from project members
- Public projects β Accessible to all API keys
June 2025
June 6, 2025
June 6, 2025
Agent support
Define and run agent-level evaluations, curate datasets for agent testing, compare agent versions, and simplified trace observability.
Review Queues
Systematically review and annotate AI outputs with customizable workflows.
Instant search
Search across all LLM logs with instant results and trend visualizations.
Bring Your Own Cloud
Turnkey private hosting in any cloud for enterprise data residency requirements.

