What Are Evaluations?
Evaluations let you verify that your AI agent performed the expected actions. Instead of checking if the agent “said the right thing”, you check if it did the right thing by examining database state changes.Mock Evaluation
“Did the agent say it posted a message?” → Maybe ✓
Agent Diff Evaluation
“Is there a new row in messages table?” → Definite ✓ or ✗
Built-in Evaluations
Slack Bench
- Message sending (5 tests)
- Channel operations (4 tests)
- Reactions (3 tests)
- Threading (4 tests)
- User mentions (4 tests)
Linear Bench
- Issue CRUD (12 tests)
- Labels (6 tests)
- Comments (5 tests)
- Workflow states (8 tests)
- Team operations (5 tests)
- Projects (4 tests)
Quick Example
Evaluation Without Test Suite
You can run evaluations without a pre-defined test suite by passing the expected output explicitly:Next Steps
Assertions
Define expected outcomes with the DSL
Example Benchmarks
See built-in Slack and Linear test suites
