Skip to main content

Overview

Agent Diff integrates with Prime Intellect’s verifiers framework for multi-turn agent evaluation. This lets you create reproducible benchmarks that evaluate agents on real API interactions.

Quick Start

Install our Linear API benchmark from the Prime Intellect hub:
prime env install hubert-marek/linear-api-bench
Run evaluations with any model:
AGENTDIFF_API_KEY="your_key" vf-eval hubert-marek/linear-api-bench -m gpt-5-mini
Results are saved to outputs/ and viewable with:
vf-tui outputs/evals/linear-api-bench--gpt-5-mini/latest

Example: Linear API Benchmark

See our reference implementation:

Next Steps