Quick Start
- CLI
- SDK
Build Benchmarks
Explore Evaluators
Create Tasks
Save to HuggingFace
Leaderboards
After running, visit your dataset leaderboard and publish a scorecard:Best Practices
- Clear, measurable prompts (binary or graded)
- Isolated task state and deterministic setup
- Use metadata tags (category, difficulty)
- Validate locally, then parallelize
- Version datasets; include a
system_prompt.txt