HUD Documentation — Evaluations and RL Environments.

Tasks define what agents should do and how to measure success. They combine a prompt, environment configuration, and optional setup/evaluation phases.

Lifecycle

Task Structure

from hud.datasets import Task
import uuid

task = Task(
    # Required fields
    prompt="Navigate to the login page and sign in as testuser@example.com",
    mcp_config={
        "hud": {
            "url": "https://mcp.hud.so/v3/mcp",
            "headers": {
                "Authorization": "Bearer ${HUD_API_KEY}",
                "Mcp-Image": "hudpython/hud-browser:latest"
            }
        }
    },
    
    # Optional fields
    id=str(uuid.uuid4()),  # Required for HuggingFace datasets
    system_prompt="You are an expert web automation agent. Always verify page loads before interacting with elements.",
    setup_tool={
        "name": "playwright",
        "arguments": {
            "action": "navigate",
            "url": "https://example.com"
        }
    },
    evaluate_tool={
        "name": "evaluate",
        "arguments": {
            "name": "url_contains",
            "substring": "/dashboard"
        }
    },
    metadata={"category": "authentication", "difficulty": "easy"}
)

Field Reference

prompt

string

required

The instruction given to the agent. Be specific and clear about success criteria.

mcp_config

object

required

Environment connection configuration. Supports environment variables with ${VAR} syntax.

string (UUID)

Unique identifier. Required when creating HuggingFace datasets. Use str(uuid.uuid4()).

system_prompt

string

Custom system prompt for the agent. Overrides the agent’s default system prompt.

setup_tool

dict | list[dict]

Tool(s) to initialize environment state with name and arguments

evaluate_tool

dict | list[dict]

Tool(s) to score performance with name and arguments

metadata

object

Additional task information for analysis and filtering

Environment Variables

# In dataset JSON:
{
    "prompt": "Complete the TODO list",
    "mcp_config": {
        "hud": {
            "url": "https://mcp.hud.so/v3/mcp",
            "headers": {
                "Authorization": "Bearer ${HUD_API_KEY}",
                "Mcp-Image": "${BROWSER_IMAGE}"
            }
        }
    }
}

# When loaded:
task = Task(**task_dict)  # Variables resolved here!
# Now task.mcp_config["hud"]["headers"]["Authorization"] = "Bearer sk-hud-..."

This enables:

Public datasets without exposing secrets
Environment-specific configurations
CI/CD pipelines with different credentials

Running Tasks

# Agent automatically handles all phases
result = await agent.run(task)
print(f"Success: {result.reward}")  # 0.0 to 1.0

The agent will:

Execute setup_tool if provided
Work on the prompt using available tools
Execute evaluate_tool to calculate reward

Working with Datasets

Tasks integrate with HuggingFace datasets:

from datasets import load_dataset
from hud.datasets import run_dataset

# Load dataset from HuggingFace
dataset = load_dataset("hud-evals/SheetBench-50", split="train")

# Run agent on entire dataset with automatic parallelization
results = await run_dataset(
    "SheetBench Run",
    dataset,  # Pass the dataset object or name (e.g. "hud-evals/SheetBench-50")
    agent_class=ClaudeAgent
)

Creating Datasets

from hud.datasets import save_tasks

# Create task dictionaries (NOT Task objects!)
task_dicts = [
    {
        "prompt": "Navigate to the login page",
        "mcp_config": {
            "hud": {
                "url": "${MCP_URL}",
                "headers": {"Authorization": "Bearer ${HUD_API_KEY}"}
            }
        },
        "setup_tool": {"name": "playwright", "arguments": {"action": "navigate", "url": "https://example.com"}},
        "evaluate_tool": {"name": "url_match", "arguments": {"pattern": ".*/login"}}
    },
    # More task dicts...
]

# Save to HuggingFace (preserves ${VAR} templates)
save_tasks(task_dicts, "my-org/my-benchmark")

Always save dictionaries, not Task objects. Task objects have already resolved environment variables!

Best Practices

Use UUIDs: Always include id=str(uuid.uuid4()) for dataset tasks
Clear Prompts: Be specific about success criteria
Template Variables: Use ${VAR} syntax for shareable configs
Rich Tools: Include both name and arguments in tool definitions

Next Steps

Benchmarks

Create, run, and publish evaluations

Create Agents

Build your own MCP-compatible agent

Get Started

Ideas

Environments

RL

Agents

CLI Reference

SDK Reference

Task System

Lifecycle

Task Structure

Field Reference

Environment Variables

Running Tasks

Working with Datasets

Creating Datasets

Best Practices

Next Steps

Benchmarks

Create Agents

Get Started

Ideas

Environments

RL

Agents

CLI Reference

SDK Reference

​Lifecycle

​Task Structure

​Field Reference

​Environment Variables

​Running Tasks

​Working with Datasets

​Creating Datasets

​Best Practices

​Next Steps

Benchmarks

Create Agents

Lifecycle

Task Structure

Field Reference

Environment Variables

Running Tasks

Working with Datasets

Creating Datasets

Best Practices

Next Steps