Overview
Agent Diff integrates with Prime Intellect’s verifiers framework for multi-turn agent evaluation. This lets you create reproducible benchmarks that evaluate agents on real API interactions.Quick Start
Install our Linear API benchmark from the Prime Intellect hub:outputs/ and viewable with:
Example: Linear API Benchmark
See our reference implementation:- Environment: hubert-marek/linear-api-bench
- Dataset: hubertmarek/linear-bench
- Source: GitHub
