Skip to main content

What Are Evaluations?

Evaluations let you verify that your AI agent performed the expected actions. Instead of checking if the agent “said the right thing”, you check if it did the right thing by examining database state changes.

Mock Evaluation

“Did the agent say it posted a message?” → Maybe ✓

Agent Diff Evaluation

“Is there a new row in messages table?” → Definite ✓ or ✗

Built-in Evaluations

Slack Bench

suites = client.list_test_suites(name="Slack Bench")
suite = client.get_test_suite(suites.testSuites[0].id, expand=True)
print(f"Slack Bench: {len(suite.tests)} tests")
Coverage:
  • Message sending (5 tests)
  • Channel operations (4 tests)
  • Reactions (3 tests)
  • Threading (4 tests)
  • User mentions (4 tests)

Linear Bench

suites = client.list_test_suites(name="Linear Bench")
suite = client.get_test_suite(suites.testSuites[0].id, expand=True)
print(f"Linear Bench: {len(suite.tests)} tests")
Coverage:
  • Issue CRUD (12 tests)
  • Labels (6 tests)
  • Comments (5 tests)
  • Workflow states (8 tests)
  • Team operations (5 tests)
  • Projects (4 tests)
View evaluation suites files on GitHub: slack bench | linear bench

Quick Example

from agent_diff import AgentDiff, PythonExecutorProxy, create_openai_tool
from agents import Agent, Runner

client = AgentDiff()

# Get a test from the built-in Slack Bench suite
suites = client.list_test_suites(name="Slack Bench")
suite = client.get_test_suite(suites.testSuites[0].id, expand=True)
test = suite.tests[0]

# Create environment and start run
env = client.init_env(testId=test.id)
run = client.start_run(envId=env.environmentId, testId=test.id)

# Run your agent
executor = PythonExecutorProxy(env.environmentId, base_url=client.base_url, api_key=client.api_key)
agent = Agent(
    name="Test Agent",
    tools=[create_openai_tool(executor)],
    instructions="Use execute_python to interact with Slack at https://slack.com/api/*"
)
await Runner.run(agent, test.prompt)

# Evaluate against assertions
result = client.evaluate_run(runId=run.runId)

print(f"Result: {'✓ PASSED' if result.passed else '✗ FAILED'}")
print(f"Score: {result.score}")
if result.failures:
    print(f"Failures: {result.failures}")

# Cleanup
client.delete_env(envId=env.environmentId)
Try the built-in benchmarks: Slack Bench | Linear Bench | HuggingFace Dataset

Evaluation Without Test Suite

You can run evaluations without a pre-defined test suite by passing the expected output explicitly:
from agent_diff import AgentDiff, PythonExecutorProxy, create_langchain_tool
from langgraph.prebuilt import create_react_agent as create_agent
from langchain_openai import ChatOpenAI

client = AgentDiff()

# Create environment directly with template
env = client.init_env(
    templateService="linear",
    templateName="linear_expanded",
    impersonateUserId="2790a7ee-fde0-4537-9588-e233aa5a68d1"
)

# Start run (no testId)
run = client.start_run(envId=env.environmentId)

# Run your agent...
model = ChatOpenAI(model="gpt-4o")
agent = create_agent(
    model=model,
    tools=[create_langchain_tool(PythonExecutorProxy(env.environmentId))],
    system_prompt="Use execute_python to interact with Linear GraphQL API"
)

prompt = "Create a new issue titled 'Fix login bug' in the Engineering team"
agent.invoke({"messages": [{"role": "user", "content": prompt}]})

# Evaluate with explicit expected output
result = client.evaluate_run(
    runId=run.runId,
    expectedOutput={
        "assertions": [{
            "diff_type": "added",
            "entity": "issues",
            "where": {
                "title": {"contains": "Fix login bug"}
            }
        }],
        "ignore_fields": {
            "global": ["id", "created_at", "updated_at", "number", "sort_order"]
        }
    }
)

print(f"{'PASSED' if result.passed else 'FAILED'}")
print(f"Score: {result.score}")

# Cleanup
client.delete_env(envId=env.environmentId)

Next Steps