Create Benchmarks

Create benchmarks to evaluate agent capabilities systematically. HUD makes it easy to build evaluation datasets using existing environments or by creating entirely custom setups.

Using Existing Environments

The fastest way to create benchmarks is to leverage existing HUD environments and their evaluators.

1. Explore Available Evaluators

Use hud analyze to discover what’s available:

$ hud analyze hudpython/hud-remote-browser:latest

📋 Evaluators:
├── url_match(pattern: str) - Check if URL matches pattern
├── page_contains(text: str, regex: bool = False) - Check page content  
├── element_exists(selector: str) - Check if CSS selector exists
├── cookie_exists(name: str) - Check if cookie exists
├── sheet_contains(cell: str, value: str) - Check spreadsheet cell
├── sheets_cell_values(args: dict) - Verify multiple cell values
└── ... more evaluators

📝 Setup Functions:
├── navigate(url: str) - Navigate to URL
├── cookies(action: str, ...) - Manage cookies
├── sheets(action: str, ...) - Spreadsheet operations
└── ... more setup functions

2. Create Tasks with Known Evaluators

Build tasks using the discovered evaluators:

import uuid
from hud import Task

# Example: Web navigation benchmark
web_tasks = []

# Task 1: Simple navigation
web_tasks.append({
    "id": str(uuid.uuid4()),
    "prompt": "Navigate to the documentation page",
    "mcp_config": {
        "hud": {
            "url": "https://mcp.hud.so/v3/mcp",
            "headers": {
                "Authorization": "Bearer ${HUD_API_KEY}",
                "Mcp-Image": "hudpython/hud-remote-browser:latest"
            }
        }
    },
    "setup_tool": {
        "name": "setup",
        "arguments": {
            "name": "navigate",
            "arguments": {"url": "https://example.com"}
        }
    },
    "evaluate_tool": {
        "name": "evaluate",
        "arguments": {
            "name": "url_match",
            "arguments": {
                "pattern": ".*/docs.*"
            }
        }
    },
    "metadata": {"difficulty": "easy", "category": "navigation"}
})

# Task 2: Form interaction
web_tasks.append({
    "id": str(uuid.uuid4()),
    "prompt": "Submit the contact form with the name 'Test User' and email 'test@example.com'",
    "setup_tool": {
        "name": "setup",
        "arguments": {
            "name": "navigate",
            "arguments": {"url": "https://example.com/contact"}
        }
    },
    "evaluate_tool": {
        "name": "evaluate",
        "arguments": {
            "name": "page_contains",
            "arguments": {
                "text": "Thank you for your submission"
            }
        }
    },
    "metadata": {"difficulty": "medium", "category": "forms"}
})

3. Test Your Benchmark

Test your benchmark with run_dataset:

from hud.datasets import run_dataset
from hud.agents import ClaudeAgent

# Test your tasks
results = await run_dataset(
    "Web Navigation Test",
    web_tasks,  # Your list of task dicts
    agent_class=ClaudeAgent,
    max_concurrent=2
)

# Check results
success_rate = sum(r.reward > 0.5 for r in results) / len(results)
print(f"Success rate: {success_rate:.2%}")

Publishing Benchmarks

All benchmarks are stored as HuggingFace datasets.

Save Your Dataset

from hud.datasets import save_tasks

# Save to HuggingFace (requires HF token)
save_tasks(
    web_tasks,  # List of task dictionaries
    repo_id="my-org/web-navigation-benchmark",
    private=False,  # Make it public
    tags=["web", "navigation", "automation"]
)

Manual Upload

You can also create datasets manually:

Save tasks as JSON:

import json

with open("benchmark.json", "w") as f:
    json.dump(form_tasks, f, indent=2)

Upload to HuggingFace:
- Go to https://huggingface.co/new-dataset
- Upload your JSON file
- Add a README with benchmark description

Leaderboard Creation

Leaderboards are automatically created when the first scorecard is published for your dataset. Simply run an evaluation and create a scorecard at app.hud.so/leaderboards/{your-dataset-id}.

Extending Existing Environments

Explore Available Evaluators

For deeper customization, explore existing evaluators and contribute new ones:

# Example: Contributing a new evaluator
# In environments/remote_browser/src/hud_controller/evaluate/form_complete.py

from hud.tools.types import EvaluationResult
from . import evaluate

@evaluate.tool("form_complete")
async def form_complete(ctx, required_fields: list[str]):
    """Check if all required form fields are filled."""
    # Access environment from the hub
    env = evaluate.env  # BrowserEnvironmentContext or similar
    
    # Use environment methods
    page_content = await env.get_page_content()
    
    # Check each field (simplified example)
    filled_count = 0
    for field in required_fields:
        if f'value="{field}"' in page_content or f'>{field}<' in page_content:
            filled_count += 1
    
    success = filled_count == len(required_fields)
    return EvaluationResult(
        reward=1.0 if success else filled_count / len(required_fields),
        done=success,
        info={"filled": filled_count, "required": len(required_fields)}
    )

Submit evaluators via pull requests to expand environment capabilities.

Creating Custom Environments

For complete control over evaluation logic, build your own environment.

Build Environments

Complete guide to creating Docker-based MCP environments with custom evaluators

Example: SheetBench-50

Here’s an actual task from SheetBench-50:

# Example task from the dataset
sheetbench_task = {
    "id": "6e4744c7-b2c9-4bb6-807e-2cc144a4e8c2",
    "prompt": "Calculate from the RawData tab the z-scores from the mean close price for each row. Return, starting in ANSWER!A1 and descending to ANSWER!A5, the 5 dates with the greatest absolute value of standard deviations from the mean",
    "mcp_config": {
        "hud": {
            "url": "https://mcp.hud.so/v3/mcp",
            "headers": {
                "Authorization": "Bearer ${HUD_API_KEY}",
                "Run-Id": "${RUN_ID}",
                "Mcp-Image": "hudpython/hud-remote-browser:v1.1"
            }
        }
    },
    "setup_tool": {
        "name": "setup",
        "arguments": {
            "name": "sheets_from_xlsx",
            "arguments": {
                "file_url": "https://gahludmjcsmszgyufydt.supabase.co//storage/v1/object/public/sheetbench/c6ddeb9a-0c16-4f5e-8a06-f148ebb4be8a/setup_input_2.xlsx?"
            }
        }
    },
    "evaluate_tool": {
        "name": "evaluate",
        "arguments": {
            "name": "sheets_cell_values",
            "arguments": {
                "args": {
                    "A1": "1/12/2024",
                    "A2": "1/10/2024",
                    "A3": "1/15/2024",
                    "A4": "1/11/2024",
                    "A5": "1/17/2024"
                }
            }
        }
    },
    "metadata": {
        "partial": True,
        "gold_file_url": "https://gahludmjcsmszgyufydt.supabase.co//storage/v1/object/public/sheetbench/c6ddeb9a-0c16-4f5e-8a06-f148ebb4be8a/gold_solution_2.xlsx?"
    }
}

# The dataset includes 50 such tasks covering:
# - Formula creation (VLOOKUP, SUMIF, etc.)
# - Data analysis (z-scores, correlations)
# - Data manipulation (sorting, filtering)
# - Chart creation
# - Pivot tables

Best Practices

Clear Instructions: Write unambiguous prompts
Isolated State: Each task should be independent
Measurable Success: Use binary or graduated scoring
Diverse Coverage: Test different aspects of capability
Difficulty Progression: Include easy to hard tasks
Metadata Tags: Help users filter relevant tasks

Next Steps

Leaderboards

Track and compare agent performance on your benchmarks

Build Environments

Create custom MCP environments for unique evaluation scenarios

Train Agents

Use your datasets to train better agents with RL

Get Started

Core Concepts

Evaluate Agents

Build Environments

Train Agents

CLI Reference

SDK Reference

Create Benchmarks

Using Existing Environments

1. Explore Available Evaluators

2. Create Tasks with Known Evaluators

3. Test Your Benchmark

Publishing Benchmarks

Save Your Dataset

Manual Upload

Leaderboard Creation

Extending Existing Environments

Explore Available Evaluators

Creating Custom Environments

Build Environments

Example: SheetBench-50

Best Practices

Next Steps

Leaderboards

Build Environments

Train Agents

Get Started

Core Concepts

Evaluate Agents

Build Environments

Train Agents

CLI Reference

SDK Reference

​Using Existing Environments

​1. Explore Available Evaluators

​2. Create Tasks with Known Evaluators

​3. Test Your Benchmark

​Publishing Benchmarks

​Save Your Dataset

​Manual Upload

​Leaderboard Creation

​Extending Existing Environments

​Explore Available Evaluators

​Creating Custom Environments

Build Environments

​Example: SheetBench-50

​Best Practices

​Next Steps

Leaderboards

Build Environments

Train Agents

Using Existing Environments

1. Explore Available Evaluators

2. Create Tasks with Known Evaluators

3. Test Your Benchmark

Publishing Benchmarks

Save Your Dataset

Manual Upload

Leaderboard Creation

Extending Existing Environments

Explore Available Evaluators

Creating Custom Environments

Example: SheetBench-50

Best Practices

Next Steps