HUD automatically tracks all benchmark runs and provides leaderboards for comparing agent performance across teams and models.

Running Benchmarks

Execute evaluations on standard datasets:
from hud.datasets import run_dataset
from hud.agents import ClaudeAgent

# Run a benchmark
results = await run_dataset(
    name="Claude-3.5 SheetBench",
    dataset="hud-evals/sheetbench-50",
    agent_class=ClaudeAgent,
    agent_config={"model": "claude-3-5-sonnet-20241022"},
    max_concurrent=5
)

Viewing Results

On the Leaderboard Page

After running, view your results:
  1. Navigate to app.hud.so/leaderboards/{dataset-name}
    • Example: app.hud.so/leaderboards/hud-evals/sheetbench-50
  2. Click “My Jobs” to see your evaluation runs
  3. Select runs to analyze performance

Local Analysis

Calculate metrics locally:
# Basic metrics
success_rate = sum(r.reward > 0.5 for r in results) / len(results)
avg_reward = sum(r.reward for r in results) / len(results)
avg_duration = sum(r.duration for r in results) / len(results)

print(f"Success Rate: {success_rate:.2%}")
print(f"Average Reward: {avg_reward:.3f}")
print(f"Average Duration: {avg_duration:.1f}s")

Creating Scorecards

Transform your evaluation run into a public scorecard:
Creating a scorecard from evaluation results
1

Run Evaluation

Execute your agent on the full dataset
2

Navigate to Leaderboard

Go to app.hud.so/leaderboards/{dataset-name}
3

Select Your Job

Click “My Jobs” and choose your evaluation run
4

Create Scorecard

Select your best runs (we compute statistics over multiple jobs) and create a scorecard
5

Publish

Your scorecard appears on the public leaderboard
Example leaderboard showing different models and their scores

Available Benchmarks

Best Practices

  1. Agent Naming: Ensure your agent class has a meaningful name - it will be displayed on the leaderboard
  2. Multiple Runs: Select your best few runs when creating a scorecard (we compute statistics over them)
  3. Consistent Configuration: Use the same model version and parameters for fair comparison
  4. Share Results: Make your scorecards public to contribute to the community
Leaderboards are automatically created when the first scorecard is published for a dataset

Next Steps