Create benchmarks to evaluate agent capabilities systematically. HUD makes it easy to build evaluation datasets using existing environments or by creating entirely custom setups.

Using Existing Environments

The fastest way to create benchmarks is to leverage existing HUD environments and their evaluators.

1. Explore Available Evaluators

Use hud analyze to discover what’s available:
$ hud analyze hudpython/hud-remote-browser:latest

📋 Evaluators:
├── url_match(pattern: str) - Check if URL matches pattern
├── page_contains(text: str, regex: bool = False) - Check page content  
├── element_exists(selector: str) - Check if CSS selector exists
├── cookie_exists(name: str) - Check if cookie exists
├── sheet_contains(cell: str, value: str) - Check spreadsheet cell
├── sheets_cell_values(args: dict) - Verify multiple cell values
└── ... more evaluators

📝 Setup Functions:
├── navigate(url: str) - Navigate to URL
├── cookies(action: str, ...) - Manage cookies
├── sheets(action: str, ...) - Spreadsheet operations
└── ... more setup functions

2. Create Tasks with Known Evaluators

Build tasks using the discovered evaluators:
import uuid
from hud import Task

# Example: Web navigation benchmark
web_tasks = []

# Task 1: Simple navigation
web_tasks.append({
    "id": str(uuid.uuid4()),
    "prompt": "Navigate to the documentation page",
    "mcp_config": {
        "hud": {
            "url": "https://mcp.hud.so/v3/mcp",
            "headers": {
                "Authorization": "Bearer ${HUD_API_KEY}",
                "Mcp-Image": "hudpython/hud-remote-browser:latest"
            }
        }
    },
    "setup_tool": {
        "name": "setup",
        "arguments": {
            "name": "navigate",
            "arguments": {"url": "https://example.com"}
        }
    },
    "evaluate_tool": {
        "name": "evaluate",
        "arguments": {
            "name": "url_match",
            "arguments": {
                "pattern": ".*/docs.*"
            }
        }
    },
    "metadata": {"difficulty": "easy", "category": "navigation"}
})

# Task 2: Form interaction
web_tasks.append({
    "id": str(uuid.uuid4()),
    "prompt": "Submit the contact form with the name 'Test User' and email 'test@example.com'",
    "setup_tool": {
        "name": "setup",
        "arguments": {
            "name": "navigate",
            "arguments": {"url": "https://example.com/contact"}
        }
    },
    "evaluate_tool": {
        "name": "evaluate",
        "arguments": {
            "name": "page_contains",
            "arguments": {
                "text": "Thank you for your submission"
            }
        }
    },
    "metadata": {"difficulty": "medium", "category": "forms"}
})

3. Test Your Benchmark

Test your benchmark with run_dataset:
from hud.datasets import run_dataset
from hud.agents import ClaudeAgent

# Test your tasks
results = await run_dataset(
    "Web Navigation Test",
    web_tasks,  # Your list of task dicts
    agent_class=ClaudeAgent,
    max_concurrent=2
)

# Check results
success_rate = sum(r.reward > 0.5 for r in results) / len(results)
print(f"Success rate: {success_rate:.2%}")

Publishing Benchmarks

All benchmarks are stored as HuggingFace datasets.

Save Your Dataset

from hud.datasets import save_tasks

# Save to HuggingFace (requires HF token)
save_tasks(
    web_tasks,  # List of task dictionaries
    repo_id="my-org/web-navigation-benchmark",
    private=False,  # Make it public
    tags=["web", "navigation", "automation"]
)

Manual Upload

You can also create datasets manually:
  1. Save tasks as JSON:
import json

with open("benchmark.json", "w") as f:
    json.dump(form_tasks, f, indent=2)
  1. Upload to HuggingFace:

Leaderboard Creation

Leaderboards are automatically created when the first scorecard is published for your dataset. Simply run an evaluation and create a scorecard at app.hud.so/leaderboards/{your-dataset-id}.

Extending Existing Environments

Explore Available Evaluators

For deeper customization, explore existing evaluators and contribute new ones:
# Example: Contributing a new evaluator
# In environments/remote_browser/src/hud_controller/evaluate/form_complete.py

from hud.tools.types import EvaluationResult
from . import evaluate

@evaluate.tool("form_complete")
async def form_complete(ctx, required_fields: list[str]):
    """Check if all required form fields are filled."""
    # Access environment from the hub
    env = evaluate.env  # BrowserEnvironmentContext or similar
    
    # Use environment methods
    page_content = await env.get_page_content()
    
    # Check each field (simplified example)
    filled_count = 0
    for field in required_fields:
        if f'value="{field}"' in page_content or f'>{field}<' in page_content:
            filled_count += 1
    
    success = filled_count == len(required_fields)
    return EvaluationResult(
        reward=1.0 if success else filled_count / len(required_fields),
        done=success,
        info={"filled": filled_count, "required": len(required_fields)}
    )
Submit evaluators via pull requests to expand environment capabilities.

Creating Custom Environments

For complete control over evaluation logic, build your own environment.

Build Environments

Complete guide to creating Docker-based MCP environments with custom evaluators

Example: SheetBench-50

Here’s an actual task from SheetBench-50:
# Example task from the dataset
sheetbench_task = {
    "id": "6e4744c7-b2c9-4bb6-807e-2cc144a4e8c2",
    "prompt": "Calculate from the RawData tab the z-scores from the mean close price for each row. Return, starting in ANSWER!A1 and descending to ANSWER!A5, the 5 dates with the greatest absolute value of standard deviations from the mean",
    "mcp_config": {
        "hud": {
            "url": "https://mcp.hud.so/v3/mcp",
            "headers": {
                "Authorization": "Bearer ${HUD_API_KEY}",
                "Run-Id": "${RUN_ID}",
                "Mcp-Image": "hudpython/hud-remote-browser:v1.1"
            }
        }
    },
    "setup_tool": {
        "name": "setup",
        "arguments": {
            "name": "sheets_from_xlsx",
            "arguments": {
                "file_url": "https://gahludmjcsmszgyufydt.supabase.co//storage/v1/object/public/sheetbench/c6ddeb9a-0c16-4f5e-8a06-f148ebb4be8a/setup_input_2.xlsx?"
            }
        }
    },
    "evaluate_tool": {
        "name": "evaluate",
        "arguments": {
            "name": "sheets_cell_values",
            "arguments": {
                "args": {
                    "A1": "1/12/2024",
                    "A2": "1/10/2024",
                    "A3": "1/15/2024",
                    "A4": "1/11/2024",
                    "A5": "1/17/2024"
                }
            }
        }
    },
    "metadata": {
        "partial": True,
        "gold_file_url": "https://gahludmjcsmszgyufydt.supabase.co//storage/v1/object/public/sheetbench/c6ddeb9a-0c16-4f5e-8a06-f148ebb4be8a/gold_solution_2.xlsx?"
    }
}

# The dataset includes 50 such tasks covering:
# - Formula creation (VLOOKUP, SUMIF, etc.)
# - Data analysis (z-scores, correlations)
# - Data manipulation (sorting, filtering)
# - Chart creation
# - Pivot tables

Best Practices

  1. Clear Instructions: Write unambiguous prompts
  2. Isolated State: Each task should be independent
  3. Measurable Success: Use binary or graduated scoring
  4. Diverse Coverage: Test different aspects of capability
  5. Difficulty Progression: Include easy to hard tasks
  6. Metadata Tags: Help users filter relevant tasks

Next Steps