Agent Diff - Agent Diff

Try it now — Copy one of the Jupyter notebooks:

Diffs Demo — Capture agent actions as database diffs
Evaluations Demo — Run assertions against agent diffs
Linear Bench — Run the 40-task benchmark from HuggingFace

Core Concepts

Templates & Environments

Templates are populated database schemas. Environments are isolated, ephemeral copies of templates where your agents operate. Each environment gets its own base API URL that you can proxy to your agents.

Runs & Diffs

A run represents a single test session within an environment. Starting a run takes a snapshot, and ending it takes another one and returns a diff - computed difference between the before and after states of an environment.

Evaluations

Evaluations let you verify that your AI agent performed the expected actions. You can create your own test suites that will compare expected state change to diff results, or use our example ones for Linear and Slack.

Tests & Assertions

Define test suites with expected outcomes using our assertion DSL. Each test specifies a prompt, environment template, and assertions that verify the agent made the correct database changes (inserts, updates, deletes).

Supported APIs

Slack

Web API coverage for conversations, chat, reactions, users, and more

Linear

GraphQL API for issues, teams, projects, comments, and workflow states

More APIs coming soon. Request an integration →

Next Steps

Quickstart

Get up and running in 5 minutes

Example Benchmarks

See built-in evaluation suites

Python SDK

Full SDK reference for Python

TypeScript SDK

Full SDK reference for TypeScript/

QuickstartGet Agent Diff running in 5 minutes

Core Concepts
Supported APIs
Next Steps