What Are Evaluations?
Evaluations let you verify that your AI agent performed the expected actions. Instead of checking if the agent “said the right thing”, you check if it did the right thing by examining database state changes.Mock Evaluation
“Did the agent say it posted a message?” → Maybe ✓
Agent Diff Evaluation
“Is there a new row in messages table?” → Definite ✓ or ✗
Built-in Evaluations
Slack Bench
- Message sending (5 tests)
- Channel operations (4 tests)
- Reactions (3 tests)
- Threading (4 tests)
- User mentions (4 tests)
Linear Bench
- Issue CRUD (12 tests)
- Labels (6 tests)
- Comments (5 tests)
- Workflow states (8 tests)
- Team operations (5 tests)
- Projects (4 tests)
